<%BANNER%>

A Knowledge-Based Heuristic Approach to Explore the Genetic Architecture of Complex Diseases

Permanent Link: http://ufdc.ufl.edu/UFE0044966/00001

Material Information

Title: A Knowledge-Based Heuristic Approach to Explore the Genetic Architecture of Complex Diseases
Physical Description: 1 online resource (210 p.)
Language: english
Creator: Nazarian, Alireza
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2012

Subjects

Subjects / Keywords: architecture -- complex -- diseases -- genetic -- gwas -- heuristic -- knowledge -- preexisting
Genetics and Genomics -- Dissertations, Academic -- UF
Genre: Genetics and Genomics thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract: Complex diseases constitute a class of disorders whose phenotypic variance is caused by the interplay of multiple genetic and environmental factors. Diabetes,hypertension, coronary artery disease, and Alzheimer's disease are a few examples of such diseases. Because complex disorders are of great public health importance and impose a lot of burden on patients and society, analyzing the complexity underlying their genetic architecture may help develop more efficient diagnostic tests and therapeutic protocols. In recent years, genome-wide association study (GWAS) has become the method of choice in genetic dissection of complex diseases. However,despite the continuous advances in revealing the genetic basis of many of complex diseases using GWAS, a major proportion of their genetic variance has remained unexplained,in part because GWAS is unable to reliably detect small individual risk contributions and to capture the underlying genetic heterogeneity and the interactions existing among genetic factors. In this study, I describe an innovative hypothesis-based method, called the Knowledge-based Association Studies (KBAS), to analyze the association between multiple genetic factors and a complex phenotype. Starting from sets of markers selected based on preexisting biological knowledge, the KBAS method generates multi-marker models relevant to the biological process underlying a complex trait for which genotype data is available. I also present the results obtained by testing the applicability of the KBAS method to large-scale genotyping datasets using the Wellcome Trust Case-Control Consortium (WTCCC) dataset.Analyzing a number of biological pathways, the method was able to identify several immune system related multi-SNP models significantly associated with Rheumatoid Arthritis (RA) and Crohn's disease (CD). RA-associated multi-SNP models were also replicated in an independent case-control dataset. Finally, I introduce a freely available software tool that implements the KBAS method. The method I present provides a framework for performing genome-wide association analysis in situations in which combinations of genetic factors jointly contribute to the genetic variance of a trait of interest, as is often the case in complex diseases. In contrast to hypothesis-free approaches, the results obtained by employing the KBAS method can be given a direct biological interpretation.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility: by Alireza Nazarian.
Thesis: Thesis (Ph.D.)--University of Florida, 2012.
Local: Adviser: Riva, Alberto.

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2012
System ID: UFE0044966:00001

Permanent Link: http://ufdc.ufl.edu/UFE0044966/00001

Material Information

Title: A Knowledge-Based Heuristic Approach to Explore the Genetic Architecture of Complex Diseases
Physical Description: 1 online resource (210 p.)
Language: english
Creator: Nazarian, Alireza
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2012

Subjects

Subjects / Keywords: architecture -- complex -- diseases -- genetic -- gwas -- heuristic -- knowledge -- preexisting
Genetics and Genomics -- Dissertations, Academic -- UF
Genre: Genetics and Genomics thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract: Complex diseases constitute a class of disorders whose phenotypic variance is caused by the interplay of multiple genetic and environmental factors. Diabetes,hypertension, coronary artery disease, and Alzheimer's disease are a few examples of such diseases. Because complex disorders are of great public health importance and impose a lot of burden on patients and society, analyzing the complexity underlying their genetic architecture may help develop more efficient diagnostic tests and therapeutic protocols. In recent years, genome-wide association study (GWAS) has become the method of choice in genetic dissection of complex diseases. However,despite the continuous advances in revealing the genetic basis of many of complex diseases using GWAS, a major proportion of their genetic variance has remained unexplained,in part because GWAS is unable to reliably detect small individual risk contributions and to capture the underlying genetic heterogeneity and the interactions existing among genetic factors. In this study, I describe an innovative hypothesis-based method, called the Knowledge-based Association Studies (KBAS), to analyze the association between multiple genetic factors and a complex phenotype. Starting from sets of markers selected based on preexisting biological knowledge, the KBAS method generates multi-marker models relevant to the biological process underlying a complex trait for which genotype data is available. I also present the results obtained by testing the applicability of the KBAS method to large-scale genotyping datasets using the Wellcome Trust Case-Control Consortium (WTCCC) dataset.Analyzing a number of biological pathways, the method was able to identify several immune system related multi-SNP models significantly associated with Rheumatoid Arthritis (RA) and Crohn's disease (CD). RA-associated multi-SNP models were also replicated in an independent case-control dataset. Finally, I introduce a freely available software tool that implements the KBAS method. The method I present provides a framework for performing genome-wide association analysis in situations in which combinations of genetic factors jointly contribute to the genetic variance of a trait of interest, as is often the case in complex diseases. In contrast to hypothesis-free approaches, the results obtained by employing the KBAS method can be given a direct biological interpretation.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility: by Alireza Nazarian.
Thesis: Thesis (Ph.D.)--University of Florida, 2012.
Local: Adviser: Riva, Alberto.

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2012
System ID: UFE0044966:00001


This item has the following downloads:


Full Text

PAGE 1

1 A KNOWLEDGE BASED HEURISTIC APPROACH TO EXPLORE THE GENETIC ARCHITECTURE OF COMPLEX DISEASES By ALIREZA NAZARIAN A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2012

PAGE 2

2 2012 Alireza Nazarian

PAGE 3

3 To my family for all of their love, encouragement and support

PAGE 4

4 ACKNOWLEDGMENTS First and foremost I would like to express my deepest gratitude to my family, for their love, encouragement and valuable support over these years and throughout my life Without their support, I would not have been the person I am today I would like to express my heartfel t thanks to Dr. Alberto Riva, cha ir of my supervisory committee. This dissertation would not have been accomplished without his guidance, support and dedication I would also like to thank all the professors who served in my supervisory committee ; Dr. Mar gret Wallace, Dr. Nasser Chegini, Dr. Luciano Brocchieri, and Dr. Salvador Gezan ; for their contributions comments and valuable support. I would like to acknowledge Dr. Heike Sichtig for her help in writing part of the code for Genetic Algorithm platform developed in this study as well as all my other colleagues in the laboratory: Shaojun Tang, Eric Hernandez and Brandon Walts for their support. Finally, I would also like to thank Dr. Peter Gregersen for providing me with the NARAC dataset. Also, t his study makes use of data generated by the Wellcome Trust Case Control Consortium. A full list of the investigators who contributed to the generation of the data is available from www.wtccc.org.uk

PAGE 5

5 TABLE OF CONTENTS page ACKNOWLEDGMENTS ................................ ................................ ................................ .. 4 LIST OF TABLES ................................ ................................ ................................ ............ 8 LIST OF FIGURES ................................ ................................ ................................ ........ 10 LIST OF ABBREVIATIONS ................................ ................................ ........................... 11 ABSTRACT ................................ ................................ ................................ ................... 17 CHAPTER 1 PURPOSE AND SIGNIFICANCE OF THE STUDY ................................ ................ 19 2 LITERATURE REVIEW ................................ ................................ .......................... 23 A Historical Review ................................ ................................ ................................ 26 Major Hypotheses Regarding the Genetic Basis of Complex Traits ....................... 31 1 Common Disease Common Variant (CD CV) Notion ................................ ... 31 2 Infinitesimal Notion ................................ ................................ ....................... 33 3 Complex Disease Rare Variants (CD RV) Notion ................................ ........ 38 4 The Broad sense Heritability Notion ................................ ............................. 40 Common vs. Rare Variants: Consequences ................................ ........................... 49 Methods Used for the Study of Complex Diseases ................................ ................. 51 Modeling Non linear Effects of Genetic and Environmental Factors ....................... 66 1 Traditional Statistical Methods ................................ ................................ ..... 66 1 1 Multiple linear regression and multiple logistic regression analysis .... 66 1 2 Case only analysis ................................ ................................ ............. 68 1 3 Haplotype analysis ................................ ................................ ............. 69 2 Multi factor Dimensionality Reduction (MDR) Method ................................ .. 70 3 Combinatorial Partitioning Method (CPM) ................................ .................... 72 4 Recursive Partitioning Methods ................................ ................................ .... 74 5 Pattern recognition Based Methods ................................ ............................. 77 3 KBAS METHOD DEVELOPMENT ................................ ................................ .......... 85 Method Overview ................................ ................................ ................................ .... 85 Step 1: Hypothesis Generation ................................ ................................ ............... 87 Step 2: Hypothesis Testing and Refinement by a Genetic Algorithm Engine .......... 88 1 Problem under Consideration ................................ ................................ ....... 90 2 Search Space ................................ ................................ ............................... 90 3 Population Size ................................ ................................ ............................ 91 4 Methods of Representation ................................ ................................ .......... 91 5 Methods of Change ................................ ................................ ...................... 92

PAGE 6

6 6 Fitness Function ................................ ................................ ........................... 93 6 1 F 1 fitness function ................................ ................................ ............... 94 6 2 F 2 fitness function ................................ ................................ ............... 96 6 3 F 3 fitness function ................................ ................................ ............... 96 6 4 F 4 fitness function ................................ ................................ ............... 98 6 5 F 5 Fitness Function ................................ ................................ ............. 98 6 6 F 6 Fitness Function ................................ ................................ ............. 99 6 7 F 7 Fitness Function ................................ ................................ ........... 100 6 8 F 8 Fitness Function ................................ ................................ ........... 100 6 9 F 9 fitness function ................................ ................................ ............. 101 7 Method of Selection ................................ ................................ ................... 103 Step 3: Permutation Test ................................ ................................ ...................... 104 Step 4: Association Criteria Verification ................................ ................................ 105 Step 5: Replication of the Results ................................ ................................ ......... 105 Step 6: Further Statistical Analysis ................................ ................................ ....... 105 The KBAS Software ................................ ................................ .............................. 106 4 KBAS METHOD EVALUATION AND RESULTS ................................ .................. 109 Method Evaluation ................................ ................................ ................................ 109 Discovery Dataset ................................ ................................ .......................... 109 Hypothesis Generation ................................ ................................ ................... 111 Hypotheses Testing and Refinement ................................ ............................. 112 Permutation Test ................................ ................................ ............................ 113 Comparison of RA and CD Gro ups against the Pooled Population of the Two Control Groups ................................ ................................ .................... 114 Association Criteria ................................ ................................ ........................ 114 Replication Dataset ................................ ................................ ........................ 115 Further Statistical Analysis: Logistic Regression Analysis and Disease Risk Score Class Diagram ................................ ................................ .................. 116 Results ................................ ................................ ................................ .................. 118 Analysis of the Rheumatoid Arthritis (RA) Dataset ................................ ......... 118 Test set pathways ................................ ................................ .................... 118 Negative control pa thways ................................ ................................ ....... 120 RA vs. CTR comparison ................................ ................................ .......... 121 Multiple logistic regression analysis ................................ ......................... 123 Replication of the Results ................................ ................................ ........ 124 Simple Logistic Regression Analysis and Disease Risk Score Class Diagram ................................ ................................ ................................ 125 Analy sis of the Crohn's Disease (CD) Dataset ................................ ............... 126 Test set and Negative Control Pathways ................................ ................. 127 CD vs. CTR Comparison ................................ ................................ .......... 1 28 Multiple Logistic Regression Analysis ................................ ...................... 130 Simple Logistic Regression Analysis and Disease Risk Score Class Diagram ................................ ................................ ................................ 130 Comparative Analysis of the RA and CD Results ................................ ........... 131

PAGE 7

7 5 KBAS SOFTWARE ................................ ................................ ............................... 163 Installing KBAS ................................ ................................ ................................ ..... 163 How Does KBAS Works? ................................ ................................ ...................... 164 SNP Selection Module ................................ ................................ ................... 164 Convert Modu le ................................ ................................ .............................. 165 GA engine Module ................................ ................................ ......................... 167 Help ................................ ................................ ................................ ...................... 171 6 DISCUSSION ................................ ................................ ................................ ....... 184 BIOGRAPHICAL SKETCH ................................ ................................ .......................... 210

PAGE 8

8 LIST OF TABLES Table page 4 1 The list of pathways included in this s tudy and their characteristics. ................ 132 4 2 The number of genotyped and validated SNPs in each functional role class for the test and control pathways analyzed in this study. ................................ .. 134 4 3 The p values associated with the pairwise comparisons of the RA group and the two control groups using the successful models derived from immune system related pathways. ................................ ................................ ................. 136 4 4 The p values associated with the pairwise comparisons of the RA group and the two control groups using successful models derived from negative control pathways. ................................ ................................ ................................ ......... 137 4 5 The p values associated with the comparison of the RA and CTR groups and NARAC A and NARAC C using the successful models derived from pathways under consideration. ................................ ................................ ......... 138 4 6 The li st of SNPs included in the successful models showing strong or moderate association with rheumatoid arthritis. ................................ ................ 140 4 7 Multivariate regression of disease state on the score variables derived f rom the successful models showing strong or moderate association with rheumatoid arthritis (comparing RA vs. CTR). ................................ .................. 144 4 8 The list of SNPs from NARAC study replacing the SNPs from WTCCC study for replication of the results. ................................ ................................ .............. 146 4 9 Simple regression of disease state on the overall score variable derived from the entire set of 44 SNPs present in the replicated RA associated models (comparing RA vs. CTR). ................................ ................................ ................. 147 4 10 Simple regression of disease state on the overall score variable derived from the entire set of 44 SNPs present in the replicated RA associated models (compari ng NARAC A vs NARAC C). ................................ ............................. 148 4 11 The p values associated with the pairwise comparisons of the CD group and the two control groups using the successful models derived from immune system relate d pathways. ................................ ................................ ................. 149 4 12 The p values associated with the pairwise comparisons of the CD group and the two control groups using the successful models derived from negative control pathways. ................................ ................................ .............................. 150

PAGE 9

9 4 13 The p values associated with the comparisons of the CD and CTR groups using the successful models derived from pathways under consideration. ...... 151 4 14 The list of SNPs included in the successful models showing association with Crohn's disease. ................................ ................................ ............................... 152 4 15 Multivariate regression of disease state on the score variables d erived from the successful models showing association with Crohn's disease (comparing CD vs. CTR). ................................ ................................ ................................ .... 156 4 16 Simple regression of disease state on the overall score variable derived from the en tire set of 57 SNPs present in the CD associated models (comparing CD vs. CTR). ................................ ................................ ................................ .... 158 4 17 Comparative summary of pathway associations with rheumatoid arthritis and Crohn's disease. ................................ ................................ ............................... 159 5 1 Description of the various methods can be used to generate the desired list of SNPs by the getsnps command. ................................ ................................ 173 5 2 Description of the arguments of the getsnps command. ................................ 174 5 3 Description of the arguments of the convert command. ................................ .. 175 5 4 Description of the required arguments of the run command. ........................... 177 5 5 Description of the optional arguments of the run command. ........................... 178 5 6 Description o f the various fitness functions (GA classes) can be used by the run command to evaluate the fitness of multi marker models. ........................ 181

PAGE 10

10 LIST OF FIGURES Figure page 3 1 A flow chart illustrating the steps applied by the KBAS method. ....................... 107 3 2 A flow chart illustrating the steps applied by CHC class of genetic algorithm (GA). ................................ ................................ ................................ ................. 108 4 1 Disease risk Score class diagram for RA vs. CTR and NARAC A vs. NARAC C comparisons ................................ ................................ .................. 161 4 2 Disease risk Score class diagram for CD vs. CTR comparison.. ...................... 162

PAGE 11

11 LIST OF ABBREVIATIONS Abbreviation Term 18q21 Region 21 on Long Arm of Chromosome 18 18q23 Region 23 on Long Arm of Ch romosome 18 3' UTR 3' Untranslated Region 58C 1958 Birth Cohort Controls (One o f the Two Control Groups in WTCCC Study) ABO ABO Blood Group System ACE Angiotensin I Converting Enzyme AGTR1 Angiotensin II Receptor, Type 1 AKT3 v akt Murine Thym oma Viral Oncogene Homolog 3 AMD A ge related Macular Degeneration ANK2 Ankyrin 2 AP OB Apolipoprotein B APOE Apolipoprotein E APOE All ATM Ataxia Telangiectasia Mutated BRCA1 Breast Ca ncer Type 1 Susceptibility Gene BRCA2 Breast Ca ncer Type 2 Susceptibility Gene CD Crohn's Disease CD CV Common Di sease Common Variant CD RV Common Disease Rare Variant CFB Complement Factor B CHC Cross Generational Elitist Selection, Heterogeneous Recombi nation and Cataclysmic Mutation CI Confidence Interval

PAGE 12

12 COMT Catechol O Methyl Transferase COV G,E Covarianc e of Genetic and Environmental Factors CPM Co mbinatorial Partitioning Method CTR Pooled Control Population Containing All Subjects in 58C and NBS Groups CYP1A1m1 Cytochrome P450, Famil y 1, Subfamily A, Polypeptide 1 CYP1B1 Cytochrome P450, Famil y 1, S ubfamily B, Polypeptide 1 CYP2C19 Cytochrome P450, Family 2, S ubfamily C, Po lypeptide 19 df Deg ree of Freedom D H Hamming Distance EDNRA Endothelin Receptor Type A EM Expectation Maximization ERCC2 Excision Repair Cr oss complementing Rodent Repair Defi ciency, Complementation Group 2 ERK2 Mi togen activated Protein Kinase 1 F F statistic FTO F at Mass and Obesity Associated GA Genetic Algorithm GABA Gamma AminoButyric Acid GABRA4 Gamma AminoB utyric Acid A Receptor, Alpha 4 GABRB1 Gamma AminoButyric Acid A Receptor, Beta 1 GBR2 Gamma Aminobutyric Acid B Receptor, 2 GO Gene Ontology GP Genetic Programming GPAS Genetic Programming for Association Studies GRCh37 Genome Reference Consortium Human G enome B uild 37

PAGE 13

13 GRIN1 Glutamate Receptor, Ion otropic N methyl D aspartate 1 GRIN2B Glutamate Receptor, Iono tropic, N methyl D aspartate 2B GRR Genotype Relative Risk GSTP1 Glutathione S Transferase Pi 1 GWAS Genome Wide Association Studies H 0 Null Hypothesis H 1 Alternative Hypothesis H 2 Broad Sense Heritability h 2 Narrow Sense Heritability HapMap The International Haplotype Map Project HGNC HUGO Gene Nomenclature Committee HincII HincII Restriction Enzyme HI V Human Immunodeficiency Virus HLA Human Leukocyte Antigen HLA B Huma n Leukocyte Ant igen, Class I, B HLA B27 Huma n Leukocyte Antigen, Allele B27 HLA DQA1 Human Leukocyt e Antigen, Class II, DQ Alpha 1 HLA DRB1 Human Leukocy te Antigen, Class II, DR Beta 1 HLA DRB1*14 Human Leukocyte Antigen, Class II, DR Beta 1, Allele DRB1*14 HLA DRB1 *15 Human Leukocyte Antigen, Class II, DR Beta 1, Allele DRB1*15 hMSH2 MutS Homolog 2 HNF4A Hepatocyte Nuclear Factor 4, Alpha HSD17B1 HydroxySteroid (17 beta) Dehydrogenase 1 HUX Half Uniform Crossover

PAGE 14

14 IFRD1 Interferon Re lated Developmental Regulat or 1 IL 10 Interleukin 10 IL 6 Interleukin 6 IL 8 Interleukin 8 InDel Insertion Deletion IQ Intelligence Quotient KBAS Know ledge Based Association Studies KCNJ11 Potassium Inwardly Rectifying Channel Subfamily J, Member 11 KEGG Kyoto En cyclopedia of Genes and Genomes KIR3DL1 Killer Cell Immunoglobulin Like Receptor, Three Doma ins, Long Cytoplasmic Tail, 1 LD Linkage Disequilibrium LDLR Low Density Lipoprotein Receptor LGR5 Leucine Rich Repeat Containing G Protein Coupled Receptor 5 LPL Lip oprotein Lipase MAF Minor Allele Frequency MDR Multi factor Dimensionality Reduction Met Methionine MHC M ajor Histocompatibility Complex m icroRNA Micro Ribonucleic Acid MODY Maturity Onset Di abetes of the Young m RNA Messenger Ribonucleic Acid MS Mu ltiple Sclerosis MSE Mean Squared Error

PAGE 15

15 NARAC North American Rheumatoid Arthritis Consortium NARAC A Affected Group in NARAC Dataset NARAC C Control Group in NARAC Dataset NBS UK Blood Services Controls (One of the Two Control Groups in WTCCC Study) NCBI National Center for Biotechnology Information OMIM Online Mendel ian Inheritance in Man Database PAH Phenylalanine Hydroxylase PAI 1 Plasminogen Activator Inhibitor 1 PGC Psychiatric Genome wide Association Study Consortium PON1 Paraoxonase 1 PRK CQ Protein Kinase C, Theta PTHR1 Parathyroid Hormone 1 Receptor QTL Quantitative Trait Loci RA Rheumatoid Arthritis RAF Risk Allele Frequency RF Random Forest RNA Ribonucleic Acid RPM Restricted Partitioning Method SNP Single Nucleotide Polymorphi sm ST model Select and Test Model t t statistic T1D Type 1 Diabetes T 2 Hotelling's T Square Statistic T2D Type 2 Diabetes TCF7L2 Transcription Factor 7 Like 2

PAGE 16

16 TGFb1 Tra nsforming Growth Factor, Beta 1 TSPAN8 Tet raspanin 8 VA Additive Genetic Varia nce Val Valine V D Dominance Genetic Variance VDR Vitamin D Receptor V E Environmental Variance V G Genetic Variance V G*E Genetic by Environment Variance V I Interaction Genetic Variance V P Phenotypic Variance VUS Variant of Unknown Significance Wnt Wnt Signaling Pathway WTCCC Wellcome Trust Case Control Consortium XRCC1 X ray Repair Complementing Defective Repair in Chinese Hamster Cells 1 XRCC3 X ray Repair Complementing Defective Repair in Chinese Hamster Cells 3 Trend Test's Over Dispersion Parameter

PAGE 17

17 Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy A KNOWLEDGE BASED HEURISTI C APPROACH TO EXPLORE THE GENETIC ARCHITECTURE OF COMPLEX DISEASES By Alireza Nazarian December 2012 Chair: Alberto Riva Major: Genetics and Genomics Complex diseases constitute a class of disorders whose phenotypic variance is caused by the interplay of multiple genetic and environmental factors. Diabetes, hypertension, coronary artery disease, and Alzheimer's disease are a few examples of such diseases. Because complex disorders are of great public health importance and impose a lot of burden on patie nts and society, analyzing the complexity underlying their genetic architecture may help develop more efficient diagnostic tests and therapeutic protocols. In recent years, genome wide association study (GWAS) has become the method of choice in genetic di ssection of complex diseases. However, d espite the continuous advances in revealing the genetic basis of many of complex diseases using GWAS, a major proportion of their genetic variance has remained un explained, in part because GWAS is unable to reliably detect small individual risk contributions and to capture the underlying genetic heterogeneity and the interactions existing among genetic factors.

PAGE 18

18 In this study, I describe an innovative hypothesis based method, called the Knowledge based Association Stu dies (KBAS), to analyze the association between multiple genetic factors and a complex phenotype. Starting from sets of markers selected based on preexisting biological knowledge, the KBAS method generates multi marker models relevant to the biological pro cess underlying a complex trait for which genotype data is available. I also present the result s obtained by testing th e applicability of the KBAS method to large scale genotyping datasets using the Wellcome Trust Case Control Consortium ( WTCCC ) dataset. A nalyzing a number of biological pathways, the method was able to identify several immune system related multi SNP models significantly associated with Rheumatoid Arthritis (RA) and Crohn's disease (CD). RA associated multi SNP models were also replicated i n an independent case control dataset. Finally, I introduce a freely available software tool that implements the KBAS method. The method I present provides a framework for performing genome wide association analysis in situations in which combinations of g enetic factors jointly contribute to the genetic variance of a trait of interest, as is often the case in complex diseases. In contrast to hypothesis free approaches, the results obtained by employing the KBAS method can be given a direct biological interp retation.

PAGE 19

19 CHAPTER 1 PU RPOSE AND SIGNIFICANCE OF THE STUDY The study of genotype phenotype relationship in complex disorders represents a great challenge in the field of translational genetics, due to the importance from a public health perspective and t o the difficulties involved in the analysis of their genetic architecture [1,2] In contrast to monogenic traits, the phenotypic variance of complex traits is caused by the interplay of multiple genetic and environmental factors [3 6], which limits the app licability of the traditional approaches used for mapping of Mendelian traits [1 3,6 8] Genome Wide Association Study (GWAS), a powerful method for the large scale analysis of genotype phenotype relationships is currently the method of choice for dissecti ng the genetic basis of complex diseases [7,9,10] A large number of disorders such as cardiovascular disorders, Crohn s disease, rheumatologic disorders, diabetes, bipolar disorder, schizophrenia and human malignancies, have thus far been studied with thi s approach leading to the detection of many previously undetected contributing loci (for example [1,11 14]). However, despite the very large number of markers that can be genotyped using currently available array based SNP genotyping platforms (up to one m illion SNP genotypes per run) [15] in most cases GWAS analyses have so far shown limited success in explaining a considerable proportion of genetic variance of complex diseases [7,10,16] This is in part due to the nature of complex diseases, and in part due to limitations which are inherent to the current analytical framework employed to analyze and interpret the obtained data [17 19]. To start, most of the susceptibility loci so far discovered by GWAS carry small predisposing risk [4,7] and it has been h ypothesized that the genetic variance of

PAGE 20

20 complex diseases may be largely due to the joint contribution of multiple susceptibility loci having small individual effects [16,20,21] Detecting such infinitesimal contributions is difficult, especially when the predisposing allele is rare or the sample size is not sufficiently large [2,3] Also, the very large number of markers under investigation raises the issue of multiple testing, making it harder to reliably detect a small association signal [15] This is al so true in the case of rare variants with major phenotypic effects, as put forward by the complex disease rare variants notion [16] Moreover, the genetic architecture of a complex disease may include epistatic effects among interacting loci [22,23] and e ffects related to gene environment interactions [24] Statistical methods that analyze SNPs individually are unable to address such complex effects. Although current regression based methods have enough power to analyze some low order interactions (e.g. quadratic or cubic terms), provided that the corresponding simple terms have significant main effects, they are unable to explore all potential combinations of markers because as the number of factors under analysis increases linearly, the number of their combinations grows exponentially, resulting in a computationally intractable situation [17,18] The issues of sparsity of data, overfitting and multiple testing may also impose additional hurdles in such cases [17] A further weakness of current analytical methods is that they analyze genome scale datasets without necessarily taking the existing biological knowledge about the trait of interest into account. It has been suggested that using prior knowledge can help reduce the dimensionality of large scale da tasets, and may guide the process of extracting biologically meaningful results in a more effective manner [18,25,26] Pathway based association studies represent an example of the integration of biological

PAGE 21

21 knowledge with statistical data analysis, and hav e recently drawn attention as an alternative to hypothesis free methods (for example [27 29]). However, despite the fact that the limitations of classical GWAS in the context of complex diseases are increasingly being recognized and that pathway based asso ciation studies are receiving growing attention, there is still no optimal approach able to overcome the above described challenges and provide a comprehensive interpretation of genome scale data. The work presented here focuses on developing an innovative hypothesis based analysis method that combines a non deterministic computational approach to the analysis of large scale genotype datasets with preexisting biological and biomedical knowledge in order to increase our understanding of the genetic basis of complex disorders. My specific aims are as follows: 1. Develop a new hypothesis based method for the analysis of genome wide association studies (GWAS) which has the ability to capture joint effects and interactions of genetic factors. 2. Apply the developed m ethod on independent datasets in order to demonstrate its general applicability. 3. Implement a user friendly software package which provides the research community with opportunity to use the developed method in their projects. In the following sections I provide a detailed description of this method, called KBAS (Knowledge Based Association Study), and present its application to large scale genotyping datasets for Rheumatoid Arthritis (RA) and Crohn's disease (CD). I will use these examples to show how thi s method can be used to compare alternative hypotheses about the disease of interest which in turn can lead to new insights into the genetic structure of the complex disease under consideration. I also present the results of a replication study, in which t he significant findings obtained on RA were validated

PAGE 22

22 through an independent dataset. Finally, I introduce a freely available software tool that implements the KBAS method. Since complex disorders are widespread and important from the public health perspec tive, dissecting their genetic architecture and reconstructing the integrated networks of factors underlying their pathogenesis may help develop more specific and sensitive screening and diagnostic tests and more efficient therapeutic targets [1,18]. This in turn may lead to reducing the morbidity and mortality of these diseases [1], consequently decreasing the burden imposed on patients and society. I believe that the KBAS method will represent an advance in this direction, and will help increase our under standing of the genetic basis of complex disorders.

PAGE 23

23 CHAPTER 2 LITERATURE REVIEW One hundred forty seven years after the particulate theory of genetic material was introduced by Gregor Mendel [3] and despite of the continuous advances in the theoretical concepts and experimental technology and analytical methods which have dramatically changed the paradigm of research in the field of genetics and genomics, the study of genotype phenotype relationship in complex traits still represents a great challenge du e to the difficulties involved in the analysis of their genetic architecture [2,6,17,18]. Complex diseases constitute a broad class of important human disorders [24] whose phenotypic variance (V p ) is caused by the interplay of multiple genetic and environm ental factors. This can be described by the following : (2 1 ) Where V G refers to the proportion of the phenotypic variance caused by the genetic factors, also called genetic variance, and V E refers to the proportion of the phenotypic variance caused by the environmental factors, also called environmental variance. V G E refers to the interactions between genetic and environmental factors in cases where the environment modulates the phenotypic effects of genetic factors, and COV G,E refers to the covarianc e of genetic and environmental factors when the genetic effect is confounded by environmental factors [30,31]. For example, individuals with higher Intelligence quotient (IQ) match themselves with stimulating environments which in turn

PAGE 24

24 can affect IQ [32]. Since estimating the COV G,E and V G E is difficult, a more simplified formula containing only the first two components is usually considered. However, ignoring these two terms may result in the inflation of estimates of genetic variance and environmental v ariance respectively [30,31]. The genetic variance can itself be broken down into the variances contributed by the additive genetic variance (V A ), and the non additive genetic variance including the dominance genetic variance (V D ) and the epistatic genetic variance (V I ) [3,6,30,31]. (2 2 ) The additive genetic variance (V A ) is defined as the average phenotypic effect contributed by the additive allelic effect at a locus. It measures the average phenotypic effects resulti ng from substitution of one allele by the other if a series of transitions from wild type homozygote (AA) to heterozygote (Aa) and then mutant homozygote (aa) genotypes is considered The additive effect of multiple loci is the sum of their individual addi tive effects The dominance genetic variance (V D ) arises from the interaction existing between the alleles within a locus. It measures whether the average phenotypic value of heterozygote genotype is closer to that of either homozygote genotypes [30]. The epistatic genetic variance (V I ) arises from the interactions existing among different genetic factors, either genotypic or epigenetic factors [33], which can modify the individual phenotypic effects of corresponding factors [34,35] For a particular bi all elic locus (e.g. alleles A and a with respective frequencies of p and q), the additive genetic variance ( V A ) and the dominance genetic variance ( V D )

PAGE 25

25 are a function of allele frequencies, the difference between the mean of phenotypic values of subjects wit h the two homozygote genotypes (t), and the degree of dominance effect existing between the two alleles (d), as described by the two following formulas [3]: ( 2 3) ( 2 4) Complex disea ses constitute an important and extensive class of human diseases and are major cause of morbidity and mortality in the world [24]. Complex diseases represent a stochastic mode of inheritance [4] in that the transmission of proven risk alleles from parent to offspring does not necessarily result in the disease development. For example, although more than 90% of European patients affected with ankylosing spondylitis harbor the allele HLA B27 which accounts for around 40% of the disease risk, only 5% of thos e with HLA B27 are affected by the disease [36]. Note that although the terms complex diseases and polygenic diseases are usually used in the literature to imply the same meaning, the concept of polygenic refers to the infinitesimal effects of a large num ber of variants collectively giving rise to a phenotype [4], as is put forth by the infinitesimal model as one of the potential underlying mechanisms of complex diseases [16]. Because such infinitesimal effects are not the only contributors to the genetic architecture of complex diseases, the term complex disease is in fact an umbrella term encompassing a wide spectrum of potential underlying mechanisms including, but not limited to, the polygenic mechanism [4].

PAGE 26

26 Human malignancies, type 1 and 2 diabetes, rh eumatoid arthritis, Crohn's disease, hypertension, coronary artery disease, multiple sclerosis, Alzheimer's disease, autism, bipolar disorder and schizophrenia are just a few examples of the long list of complex diseases [1,6,37], as opposed to monogenic M endelian diseases such as cystic fibrosis, thalassemia, phenylketonuria, Marfan syndrome, fragile X syndrome, and Duchenne muscular dystrophy [ 6, 37]. There are also forms of complex diseases showing near Mendelian inheritance such as breast cancer due to m utations in BRCA1 and BRCA2 genes [38 42], early onset Alzheimer's disease [43 45], and Maturity onset diabetes of the young (MODY) [46,47]. The Online Mendelian Inheritance in Man (OMIM) database [48] ( http ://www.ncbi.nlm.nih.gov/omim/ ) provides a comprehensive repository of updated information about the various aspects of phenotype genotype relationships in Mendelian disorders and complex diseases. In the following sections, different aspects of complex di seases are discussed in further details. A Historical Review The laws of inheritance of genetic material inferred by Gregor Mendel provided a theoretical framework with practical application in the genetic analysis of many traits, known as monogenic or Men delian traits, at the turn of the 20th century [3,49]. Pioneered by William Bateson, Reginald C Punnett Hugo M. de Vries, and Archibald E. Garrod, geneticists adhering to the Mendelian school of thought focused their endeavors to explore the genetic basi s of traits with discrete phenotypic classes. Garrod was the first who used Mendel's law of inheritance to investigate the genetic basis of human diseases [3 ,49 ]. Through the analysis of pedigrees of multiple families with subjects affected by alkaptonuria he postulated that disrupted metabolism of alkapton

PAGE 27

27 due to an enzyme deficiency with recessive mode of inheritance underlies the disease pathogenesis [50]. Later, he extended his postulate to other inborn errors of metabolism such as cystinuria, pentos uria, and albinism which along with the alkaptonuria are called Garrod's tetrad [51,52]. The efforts in mapping of Mendelian diseases have been very productive since then, and currently hundreds of such human diseases have been mapped by linkage analysis [ 6,49] and the corresponding genes have been cloned [6]. The genetic variance of Mendelian diseases is caused by rare mutant variants with very large phenotypic effect sizes (e.g. odds ratios greater than 100 [49]) and of high penetrance [53]. In other word s considering formula s 2 3 and 2 4, Mendelian disorders are caused by variants with v ery small allele frequencies ( q ) whose mean phenotyp ic values in homozygo us state are considerably different from values result ing from normal homozygo us state ( very large t ) [3]. It is however noteworthy that although Mendelian diseases are classified as single gene disorders, the impact of disease causing mutations are not the only determinants of the impaired phenotype [22]. Instead clinical manifestations can be modulat ed through the effects of modifier genes so that patients with mutation in the same gene may have different symptoms [35,54 59]. For example, the severity of cognitive and metabolic manifestations of phenylketonuria is not solely determined by the presence of mutations in the PAH locus, as it can be influenced by variants affecting protein degradation or protein folding pathways such as mutants affecting chaperones or by the concurrent polymorphisms within or upstream of PAH [58] As another example, cystic fibrosis patients (a Mendelian disease which is caused by mutation in CFTR gene) with concurrent loss of

PAGE 28

2 8 function mutations in the MBL gene or over expression of TGFb1 gene present with more severe respiratory symptoms. Also IFRD1 IL8 and EDNRA genes hav e been implicated in modulating respiratory symptoms of cystic fibrosis [54] At the same time that particulate theory of genetic material spurred Mendelists to identify the genetic basis of hereditary traits, another group of scientists whose views adhere d to the biometrical school of thought were not convinced by the theory [3,60]. Pioneered by Francis Galton and Karl Pearson, biometricists were interested in exploring the genetic basis of continuous traits [3,49]. Analyzing generational data obtained for a number of continuous traits (e.g. height, weight, intellectual caliber), Galton noticed that although such traits seemed to run in families, it is difficult to account for their genetic basis based on the Mendelian view of genetics [3,60]. Instead he b elieved the phenotypic variance of such traits seemed to be better explained through the blending theory of heredity [3,61], which assumes the phenotypic value of offspring as the average of those of the parents. Through conceiving and implementing statist ical concepts such as correlation of variables, regression analysis and principal component analysis, pioneering biometricists seminally contributed to building an analytical infrastructure that later was widely implemented for the study of genetic varianc e of quantitative traits [49]. The heated debate between Mendelists and biometricists was resolved in the late 1910s when Ronald A. Fisher formulated his model to explain the genetic architecture of quantitative traits [3,6,20,49,60,62]. He was interested in determining how the genetic basis of traits with a continuous phenotypic distribution can be

PAGE 29

29 explained by the independent segregation of alleles as considered by Mendelian principles of genetics [63]. Finally, he suggested that the different views of th e two schools of thought arise from the different nature of traits studied by them; in contrast to monogenic traits, the phenotypic variance of quantitative traits results from contribution of multiple genetic factors, the inheritance of each of which foll ows Mendel's laws [62]. Indeed he suggested through assuming the simultaneous contribution of multiple loci to the development of a trait, Mendel's laws of inheritance still would be valid for explaining the genetic basis of quantitative characters [20]. T hrough his works, he also noted that the phenotypic variance of a quantitative trait is not necessarily determined by additive effects of underlying genetic factors. To explain such non additive effects, he coined the term epistacy [60] in analogy with the term epistasis which had already been introduced by Bateson to point out a particular class of genetic interactions observed in double mutants [64]. Fisher also introduced the concept of variance and the analysis of variance for the first time [62,63,65] and established the groundwork for future investigations in the field of quantitative genetics [49]. Later, in the light of theoretical and experimental advances, the concept of quantitative trait loci (QTL) was introduced by John M. Thoday [66], and linka ge mapping [67 71], QTL mapping [72], association studies [73 77] and physical mapping [78] were conceived and employed for the study of genetic basis of quantitative traits [49]. The theoretical and methodological advances along with implementing the conc ept of linkage disequilibrium (LD) for spotting causal variants contributing to the disease pathogenesis promoted the field of association

PAGE 30

30 studies and paved the road towards the future development of genome wide association studies (GWAS) [49]. The polygen ic notion of quantitative traits implies that the phenotypic values of such traits, as the result of cumulative effects of multiple genetic factors, will have a Gaussian (normal) distribution in the population [6,20]. Assuming an additive model on the basi s of co dominant allelic relationships and additive inter genic effects, as the number of genes contributing to a particular trait increases, the number of genotype and phenotype classes increases as well. For example, the numbers of genotype and phenotype classes are three and three for one gene, nine and five for two genes, 27 and seven for three genes, respectively. With the increase in the number of contributing loci, and under the influence of non additive genetic effects and environmental factors effe cts, the distribution of phenotypic values will tend towards a normal distribution [20]. To accommodate t he polygenic notion of quantitative traits with complex diseases as medically important dichotomous traits (i.e. disease vs. healthy states), Douglas S Falconer later conceived the idea of an underlying liability variable, a latent variable which in fact is the main determinant of the phenotypic state [6,30]. Therefore, bility value resulting from cumulative effects of his/her genetic composition and the environmental factors exceeds the disease threshold, the corresponding pathologic phenotype will manifest [16,30]. The concept of liability has a direct implication in explaining the increased familial risk or population incidence. From a statistical point of view the higher incidence of a

PAGE 31

31 particular disease in a population (or family) can be accounted for in the following ways: First, the mean of the liability distribut ion may be higher in a population with higher incidence than that in a population with lower incidence rate. In such cases, because the liability distribution demonstrates a shift to the right, more subjects will have a liability value beyond the disease t hreshold and manifest the symptoms related to the disease of interest. Second, the high incidence population may have a liability distribution with an increased variance compared to the distribution of a low incidence population. In such cases, although th e means of the two groups are not much different, because the liability distribution in the high incidence population demonstrates a heavy tailed distribution, liability values of more subjects exceed the disease threshold [24]. Major Hypotheses Regarding the Genetic Basis of Complex Traits A number of hypotheses about the genetic mechanisms which may underlie the development of complex diseases have been put forth. There may exist arguments both in favor of and against each of these notions and as experime ntal evidence indicates, none of them has been able to provide a comprehensive view of the genetic architecture of complex traits solely. However, it is important to note that they are not mutually exclusive and it is plausible to assume that the genetic v ariance of complex diseases can be explained by a mixture of the propositions suggested by these outlooks. A detailed description of these notions is as follows : 1 Common Disease Common Variant (CD CV) Notion The ommon disease common variant hypothesis assumes that the genetic variance of a complex disease is largely caused by the aggregate contribution of a few common causal variants (~ 20 variants each with minor allele frequency ( MAF) 0.05

PAGE 32

32 ) with small [80] to moderate effect s [16] Each of these disease causing alleles is assumed to explain several percent of the disease susceptibility [16,24] The empirical evidence in favor of the CD CV hypothesis was first obtained through linkage mapping and candidate gene association stu dies [49,63,79] For instance, HLA alleles were among the first markers proven to have disease associations with risk alleles conferring more than five fold increase in the risk of disease through early candidate gene association studies on a number of dis eases such as ankylosing spondylitis, psoriasis, hemochromatosis, celiac disease, and rheumatoid arthritis. The more prominent association was the one detected between ankylosing spondylitis and HLA B27 with an odds ratio of more than 100 [63,81] As anoth er example of the early efforts to identify common variants associated with complex diseases, Corder et al. (1993), through analysis of the genotyping data from 42 families with cases of late onset Alzheimer's disease, noticed that a common allele of apoli poprotein E gene (i.e. APOE allele) with frequency of 35%, remarkably increased the risk of disease by a factor of 2.84 and 8.07 in subjects with heterozygote and homozygote genotypes, respectively [77] Another evidence in favor of the CV CD notion was postulated by QTL mapping studies in inbred populations of plants and animals. Through such studies, it was noticed that identifying QTLs with substantial effect sizes explaining a large proportion of phenotypic variance, even around 10 to 20% of variance is not a rare finding [82]. In fact, such QTL mapping results supported the exponential model for quantitative traits,

PAGE 33

33 which states that the genetic variance of such traits is due to a few loci with relatively large effect sizes and many loci with smal l effects [83] Under the postulates of such model, finding QTLs explaining around 5% or more of the genetic variance of complex diseases in human seemed plausible [82], although it is noteworthy that the QTL mapping approach does not consider any assumpti ons regarding the allele frequency and it is not obvious if QTLs with large effects on the variance of phenotype are related to rare or common alleles [16] An increasing interest in the detection of common variants contributing to common disease later dom inated the field of complex diseases during late 1990s and at the turn of the 21 st century [49] Inspired by the postulates of the CD CV hypothesis, the GWAS approach was conceived [2] and implemented to investigate the genetic basis of many complex traits [49]. Through such studies a number of common variants with moderate effects were detected for example in Crohn's disease [13] and serum lipid level [84] However, none of the GWAS conducted thus far has detected more than one risk locus with an effect s ize as large as the one observed in QTL studies [16] Instead, most of the risk variants so far detected have small effects, and a major proportion of genetic variance is still unexplained for in any of the studied diseases [7,10] All in all, today the CD CV hypothesis is no longer considered valid and common variants of moderate effects are not considered to be the major source of genetic variance of complex diseases [16]. 2 Infinitesimal Notion model regarding the adaptive evolution to the field of common human diseases

PAGE 34

34 [16,82,85]. This hypothesis considers complex diseases as polygenic traits whose genetic variance is mainly contributed by additive effects of hundreds or t housands of loci with tiny effect sizes, each of which accounting for less than one percent of the disease susceptibility. The infinitesimal hypothesis does not consider any assumption as to the frequency of causal alleles; the risk alleles can be either r are or common [16]. basis of a particular complex disease. If the individual's liability determined by the aggregate effects of many risk variants is more extreme than th e disease threshold, the pathologic condition is most likely to be manifested [16,30]. There are multiple lines of empirical evidence supporting the infinitesimal hypothesis of complex traits. First of all, most of the susceptibility loci so far identified by complex trait GWAS are common variants conferring tiny predisposing risk [4,7,16,18] as is reflected in their small genotype relative risk (GRR < 1.1 1.2 [16,18]) or odds ratio (OR mostly between 1.05 and 1.2 [4]). The same is true for QTL mapping stu dies in which most of the detected QTLs are of infinitesimal effects, each explaining a small proportion of the genetic variance of the quantitative traits under investigation [82]. Moreover, it has been demonstrated that a larger number of risk variants can be captured as the power of analysis is increased by employing very large case control datasets (for example [84,86,87]) or performing meta analysis on multiple GWAS datasets (for example [12,88]) [16]. It has been also demonstrated that by applying a less stringent significance threshold (for example [89]) or by defining a score to combine the small association signals of many variants (for example [21,89]), it is possible to

PAGE 35

35 identify more variants whose infinitesimal signals were previously been burie d in noise [4,16]. The idea of summing the effects of multiple variants across the genome has empirically been used for many years by animal breeders to estimate and improve the breeding values [16], a method called genomic selection [90]. Despite the over whelming empirical evidence to support the infinitesimal hypothesis of complex diseases, there exist a number of issues that are not clearly addressed by it. One issue is that most of common variants detected by GWAS are unlikely to have a functional role in the biological processes underlying the phenotype [16,63]. Instead it is widely accepted that the variants detected by GWAS are tag variants in linkage disequilibrium (LD) with the causal ones in their neighborhood [63]. However, there is a conflict bet ween investigators regarding the commonality or rarity of these causal variants. While some have suggested the association signals detected by tag variants arise mostly from single or multiple causal rare variants with large effects, a phenomenon also call ed synthetic association [91 93], others have shown that synthetic associations can just explain a minor fraction of disease risk. Instead these investigators believe that common causal variants with infinitesimal effects in LD with the tag variants contri bute the majority of the genetic variance [94,95]. However, note that even if such common causal variants underlie the genetic basis of complex diseases, it is too difficult to figure out which of multiple variants in LD with tag variants are really functi onally relevant to the disease under consideration because of their tiny contribution to the phenotype [63].

PAGE 36

36 The next controversial issue regarding the infinitesimal hypothesis is based on the fact that a large proportion of susceptibility alleles thus far identified are common ancestral alleles [96], which seems at first sight to be inconsistent with what is expected under the assumptions of purifying selection theory [24]. However, this inconsistency can be accounted for in a few ways. One explanation is that parts of such disease associated ancestral alleles may not influence the fitness because their unfavorable phenotypic effects appeared post reproductively. This can be true especially in case of late onset forms of complex diseases [49,96]. The next p ossible explanation is that a common risk allele may become common because of its pleiotropic effect: the disease causing allele is selectively advantageous due to its favorable impact on another fitness influencing trait [97,98]. Another possible explanat associated alleles is due to the different selective pressures in ancient and modern environments. The environmental conditions and lifestyle of anc estral human populations originally favored the phenotypic effects of the common ancestral risk allele and acted against the effects of derived (mutant) alleles. With the change in environmental conditions and lifestyle of current human populations, such s elective pressures have been reversed. Therefore, the common ancestral allele now confers susceptibility to the disease because of its selectively disadvantageous state, while derived alleles, now selectively neutral or advantageous, act as disease protect hypothesis takes advantage of ancestral susceptibility model to explain the genetic mechanism by which the susceptibility to type 2 diabetes may increase [96]. It suggests

PAGE 37

37 that the thrifty genotype that developed to ada pt ancestral populations to the food scarcity and the highly active lifestyle of human in the ancient times can be disadvantageous in modern societies, due to the food rich environment and sedentary lifestyle of human, and therefore promotes susceptibility to type 2 diabetes [99]. Also, such interactions among genetic and environmental factors in the development of a phenotype. Based on this hypothesis, common ancestral dis ease associated alleles can be those variants whose unfavorable phenotypic effects were buffered in ancestral populations by optimal genetic buffering mechanisms evolved over millions of years, but are manifested in the modern developed societies in which environmental changes may have disturbed the previously highly efficient genetic buffering system. Whatever the underlying mechanism is, the fact that some identified complex disease associated risk alleles are ancestral alleles implies that the common int uition which relates a disease to mutant variants is not always true in case of complex diseases [24]. The next weakness of the infinitesimal hypothesis is that it is inconsistent with the familial clustering of cases [16,24,53]. Under the infinitesimal hy pothesis the median of distribution of risk alleles is not expected to be much different between affected and unaffected groups. In fact if the postulates of this hypothesis are true, and under random mating, it is expected that most individuals will harbo r more or less the same number of predisposing loci, which in turn should reduce the phenotypic variance of the trait in the population. However, familial clustering in a disease is not consistent with such homogeneous phenotypic distribution [16].

PAGE 38

38 Finally the infinitesimal hypothesis is not able to explain why the incidence and prevalence of many of complex diseases have increased over the past decades or why a disease may have a different incidence rate in genetically similar populations. Such demographi c features can be attributed to the interactions among genetic and environmental factors [16]. 3 Complex Disease Rare Variants (CD RV) Notion hypothesis suggests that the genetic variance of complex diseases is largely caused by cumulative effects of multiple rare variants with moderate to major [16]) [53,63]. Such variants are in most cases sub polymorphic variants with allele frequencies between 0.001 and 0.01, although variants with allele frequencies less than 0.001 may also exist rarely [5 3]. Note that the range of allele frequency of such risk variants has been considered differently throughout the literature (for example risk allele frequency (RAF) < 0.01 [24] or RAF < 0.005 0.02 [94] or 0.001 < RAF < 0.03 [63], or RAF < 0.05 [53]). Each disease contributing locus can potentially harbor a large number of rare variants [53,79] with the most extreme situation in which each variant is only found in one individual in the population [79]. In other words, a population of patients affected by a p articular disease can be considered as a conglomeration of many subpopulations, each harboring a distinct set of mutations [16]. Although such rare variants usually influence the phenotype dominantly, for instance through haplo insufficiency or gain of fun ction effects, their penetrance might be incomplete [16,53,63] because of the interaction with other modifier genes [16]. The complex disease causing rare variants can be population specific, because of the founder effects and subsequent

PAGE 39

39 genetic drift. The y are different from mutants causing Mendelian diseases in that Mendelian mutants are high penetrance mutations with large deleterious effects on the phenotype and very low risk allele frequencies (mostly RAF < 0.001) [53]. The fact that the low frequency of such disease causing variants with slightly deleterious effects is consistent with the evolutionary perspective provides strong evidence in favor of RV CD hypothesis. Purifying selection against impaired phenotypes is in fact expected to remove extreme phenotypes, preventing disease causing variants from being common. However, such variants may not be completely eliminated from the population's gene pool because of their partial penetrance or recessive mode of action [4], or their moderate effects on fit ness [100]. There exists empirical evidence supporting the potential role of rare variants with major effects in the pathogenesis of complex diseases. For instance, multiple rare variants of unknown significance (VUSs) with odds ratios over 20 have been de tected in the BRCA1 and BRCA2 genes [101] which may remarkably influence the risk of breast cancer [63]. These variants are different from those mutations in the BRCA1 and BRCA2 genes commonly increasing the risk of familial breast cancer [38 42]. Further examples in favor of the role of rare alleles in complex diseases include a number of rare variants detected in the ATM gene collectively incrementing the risk of breast cancer by a factor of 2.37 [102], in the MSH2 gene significantly associated with m ulti ple colorectal a denomas with odds ratios over 11.7 [103], in the IPF 1 gene collectively increasing the risk of type 2 diabetes by a factor of 3 [104], and in fibulin genes significantly associated with age related macular degeneration (AMD) [105]. Also, r are

PAGE 40

40 copy number variations with high odds ratios conferring susceptibility to complex diseases such as schizophrenia and related psychoses [106,107], and autism [108] have previously been detected [16]. However, although some investigators believe syntheti c associations due to rare alleles in LD with tag variants are major determinants of the genetic variance of complex diseases [92,93,100], others have suggested that although there is no doubt about the contribution of such synthetic associations to the ge netic basis of complex diseases, they are not the major source of genetic variance [94,95,109]. For instance, through the analysis of the distribution of risk alleles associated with eight complex traits, Park et al. (2011) demonstrated that while the MAF of most of the tagging SNPs in general populations were between 0.05 and 0.2, the causal alleles so far identified for these complex traits were mostly alleles with frequencies higher than 0.2 [109]. Finally, similar to the infinitesimal hypothesis, the CD RV hypothesis is not consistent with the increase in the incidence and prevalence of many of complex diseases over a few generations [24] and is not able to account for familial clustering of the complex diseases [53,63]. The familial risk of complex dise ases is in fact much higher than the risk supposed to be conferred by a rare variant [16]. 4 The Broad sense Heritability Notion Broad sense heritability (H 2 ) is defined as the proportion of phenotypic variance determined by the genetic variance, as oppos ed to the narrow sense heritability (h 2 ) which refers to the proportion of phenotypic variance caused only by the additive genetic variance [31].

PAGE 41

41 (2 5) (2 6) The broad sense heritab hypothesis states that interactions among genetic factors, also called epistasis [110 112], and among genetic and environmental factors and the effects arising from epigenetic modifications of genome, together contribute to the phenotypic variance o f the complex traits and may explain an important proportion of missing heritability of complex traits [16] Although the importance and commonality of epistatic effects of interacting loci on the biological processes underlying complex phenotypes has long been recognized [22], it is noteworthy that the term epistasis has been used by biologists and statistical geneticists to implicate different meanings which may result in some confusion about this key genetic concept [113]. In fact the term epistasis orig inated through the study of Mendelian traits and was later used by quantitative geneticists [34]. The phenomenon of biological epistasis was first highlighted by William Bateson Reginald C. Punnett [64] and Wilhelm Weinberg [114] in the early 20 th century through experim ents whose results showed deviation from expected Mendelian segregation ratios. In their study on chicken combs, Bateson and Punnett noticed that the observed ratio of the single type of comb was lower than what was expected on the basis of segregation of a single genetic factor. Taking in to account potential explanations for these unexpected results, they suggested the comb type is determined by two independently segregating genetic factors which act in concert. They postulated the

PAGE 42

42 single t ype of comb is inherited only if the chicken has recessive homozygote genotype in both factors [60] To account for the masking or suppressing effects of alleles in one locus on the phenotypic effects of alleles in another locus, Bateson coined the term ep istasis [60,64,115]. In the meantime, Weinberg komplizierter polyhybridismus phenomenon [60]. Epistasis in such cases in fact referred to the inter genic dominant relatio nships, in analogy to the intra genic dominant effects existing between alleles in the same locus [113]. The suppressor locus is said to have epistatic effect on the suppressed locus which in turn is considered hypostatic to the suppressor locus [60]. The Bombay phenotype represents an example of biological epistasis in humans, which may be misleading regarding paternity testing. In such cases, the phenotype encoded by the ABO blood group system ( I locus ) is masked under the homozygous recessive genotype at the H locus leading to the blood group of type O in individuals with hh genotype regardless of their ABO system genotype [116,117]. On the other hand, statistical epistasis, first noted by Ronald A. Fisher [62], lends itself to the statistical concept of the departure of a multi variable model from linearity due to the nonlinear relationships existing among predictor variables [30,60,113]. Through his efforts to accommodate Mendel's segregation law with the genetic basis of continues traits, Fisher notice d it may not always be possible to predict the phenotypic value of a quantitative trait just by an additive model summing the effects of different contributing loci. He coined the term epistacy to denote such non additive effects, although the term was lat er replaced by the term epistasis, already introduced by

PAGE 43

43 Bateson [60]. In other words, statistical epistasis implies that predicting the variati on in phenotype through a multi variate model allowing for the interactions of predictor variables can improve th e goodness of fit of the model compared to a pure additive model [34]. In the simplest scenario of two interacting loci, the regression model will consist of nine terms: an overall mean, four simple terms corresponding to the additive and dominance effects of alleles in each locus, and four interacti on terms corresponding to the interactions among additive and dominance effects of alleles in the two loci [17,113]. The concept of statistical epistasis is in fact a general term encompassing a wide spectrum of various types of interactions, one type of which is the one considered by early Mendelists [60,112]. In general, epistasis refers to the situations in which multiple genetic loci act dependently and cooperatively to form a final phenotypic endpoint [22,11 0,113]. Interaction among biomolecules such as proteins, DNA sequences, and different types of RNA molecules is in fact a ubiquitous phenomenon that plays a crucial role in different biological processes like gene regulation and transcription, regulation o f translation, functional forms of protein complexes, signal transduction and biological pathways [22]. However, note that detecting epistatic effects is not always an easy task and furthermore not all statistically detected genetic interactions may easily be interpreted biologically [26]. When the combined phenotypic effect of interacting loci is greater than the amount expected from their additive effects, it is said the loci are synergistically interacting [17,118,119]. Synergistic epistasis may arise fr om functional relationships

PAGE 44

44 existing among functionally redundant genetic elements or physically interacting biomolecules, or from the co occurrence of two or more hypomorphic mutations in different steps of a particular pathway [119,120]. Antagonistic epi stasis refers to the situations in which the effects of genetic factors are in opposite directions and one ameliorates the impact of the other one. Therefore, the combined phenotypic effect of interacting loci is smaller than the amount expected based on t heir additive effects [17]. For example while the HLA DRB1*15 allele confers susceptibility to the multiple sclerosis with a relative risk of 2.919, its predisposing effect is antagonized in the presence of the HLA DRB1*14 allele, resulting in a relative r isk for the HLA DRB1*14*15 genotype of 1.06 [121]. Compensatory double mutants which counteract the individual deleterious effects of each mutant and improve the fitness of organisms, represent one of the best examples for antagonistic epistasis [17]. Gene tic interactions have indeed been suggested to play important roles in the process of evolution. The potential contribution of combination of genotypes to fitness and adaptivity was first implicated by Sewall G. Wright [111,122]. From an evolutionary persp ective, a gene by gene interaction improving the fitness by buffering the effects of deleterious mutations is called antagonistic (positive, buffering or diminishing) and the one negatively affecting the fitness is called synergistic (negative, reinforcing or aggravating) [123]. Epistatic fitness interactions in fact represent an important class of biological epistasis in which a compensatory mutation in one locus hinders the impairing phenotypic effects of another locus to reinstate the normal phenotype a nd consequently

PAGE 45

45 the fitness of the organism [124 127]. It has been noticed that such antagonistic epistasis between individually deleterious mutations is favored through the process of evolution [123,128,129]. For example a second mutation in the RNA seque nce can compensate for destabilizing effect of the first mutation on the secondary structure of RNA [127]. The interaction among compensatory mutations is one of the mechanisms that contributed to the process of canalization of physiological processes [130 ]. Epistatic selection may also represent one of the mechanisms through which linkage disequilibrium between neighboring variants is established [131 133]. It is also believed that epistasis is a common phenomenon and plays an important role in conferring disease susceptibility [17,22]. The modulating effect of modifier genes on the phenotypic effect of disease causing variants is an example demonstrating the role of epistasis in the genetic basis of complex diseases [35]. The effect of modifier genes may a lso account for incomplete penetrance of complex diseases [22,34]. Moreover, modifier genes can also influence the different aspects of clinical manifestations of Mendelian diseases such as severity of symptoms, age of onset, prognosis and survival [22,35, 134]. There exists considerable empirical evidence showing the impact of gene gene interactions in various complex diseases. Note that not all the variants detected to influence the disease's risk epistatically have significant association with the corresp onding disease individually [17]. The epistatic contribution of variants of XRCC1 and XRCC3 genes to the risk of bladder cancer [135] and colorectal cancer [136], of AKT3 and PRKCQ genes to the risk of prostate cancer [137], of ACE and AGTR1 genes

PAGE 46

46 to the risk of myocardial infarction [138], of KIR3DL1 and HLA B genes to the immune response against HIV [139], of GABRB1 and GABRA4 genes, two subunits of GABA receptor, to the risk of autism [140], of GRIN1 and GRIN2B genes to the risk of schizophrenia [141], and of HNF4A and KCNJ11 to the risk of type 2 diabetes [142,143] are just a few examples of the role of pairwise gene interactions in complex diseases. There is also evidence of higher order interaction of multiple variants with some diseases. The associat ion between breast cancer and combination of four variants in COMT CYP1A1 and CYP1B1 genes [144] is an example in this regard. Although it may be harmful due to discarding population specific associations [145,146], replication of significant findings ha s been suggested as the gold standard to validate any significant association among genetic variants and diseases [146,147]. Interaction between IL10 and IL6 genes in the context of Alzheimer's disease [148,149] represents an example of replicated epistati c effects with a potential biological relevance to the pathogenesis of the disease [18]. Associations between type 2 diabetes and the four way interaction among variants in TCF7L2 FTO ANK2 genes and the 18q21 18q23 region, or the four way interaction am ong variants in TCF7L2 FTO TSPAN8/LGR5 and CDKAL1 genes are examples of higher order replicated interactions [150]. Gene by environment interactions have also been implicated in the pathogenesis of complex diseases (for example [151 154]). For instance, Caspi et al. (2005) reported adolescent cannabis users with valine158 allele of COMT gene have increased odds of developing schizophreniform disorder later in their life compared to those with

PAGE 47

47 methionine158 allele. The odds ratio for Val/Val individuals an d Val/Met individuals were 10.9 and 2.5, respectively [151] Besides genotype by environment interactions, interactions between epigenetic modifications and environmental factors may also contribute to the pathogenesis of complex diseases [155,156]. For in stance based on a number of epigenetic related phenomena observed in multiple sclerosis (MS) such as discordance between monozygotic twins and maternal parent of origin effect at the HLA DRB1 locus; one of the major MS associated loci; and based on the fac t that epigenetic marks may not be erased and reestablished entirely through gametogenesis [157,158], Burell et al. (2011) suggested that environmental risk factors like Epstein Barr virus, vitamin D deficiency and smoking may influence the risk of MS thro ugh their influence on the epigenetic mechanisms [156] genetic and environmental factors [24] has also been implicated as a mechanism which may be involved in the development of co mplex diseases [24,159,160]. It can also account for the reason why a proportion of variants associated with complex diseases are common ancestral alleles [96] and for the increase in incidence and prevalence of complex diseases observed in modern societie s in which people are leading new lifestyles and experiencing altered environmental conditions such as dietary changes, air pollution, smoking and mo re psychological stresses [24]. Decanalization refers to the situations in which the canalization of physio logic processes is broken down and as a result the additive genetic variance related to the

PAGE 48

48 already buffered polymorphisms is represented. The canalization theory states that through the effects of genetic buffering mechanisms elaborately evolved over mill ions of years, physiological processes such as metabolic or developmental pathways reach a robust optimum state in which complex phenotypic traits display maximum stability [24] and are therefore preserved against stochastic perturbations [161]. Canalizati on, mediated by stabilizing selection, reduces variability of the physiologic processes, which in turn decreases the evolvability of the process [162] and reduces phenotypic variation by decreasing the additive genetic variance [162,163]. Decreased phenoty pic variation ensures phenotypic reproducibility [161]. Functional redundancy inter genic and intra genic buffering interactions (such as interaction among chaperones and other proteins to form functional spatial configuration of proteins [164,165] or amo ng compensatory mutations [127]) and interactions forming regulatory feed back loops (such as the interactions among microRNAs and mRNAs [161,166]) are a number of buffering mechanisms underlying the process of canalization [130]. The highly efficient buff ering mechanisms allow for accumulation of cryptic variants whose phenotypic effects are masked [24,162]. In case canalization mechanisms are broken down and are not able to preserve the stability of a particular physiologic process, for instance due to en vironmental perturbations, admixture [24] or mutations with large effects [16] such hidden common variants may display their phenotypic effects which in turn increases the additive genetic variance [24] and the evolutionary divergence rate of the process under influence [162].

PAGE 49

49 In cases in which the liability variable underlying susceptibility to a particular complex disease is influenced due to impaired buffering mechanisms, the increase in the variance of phenotype can result in the increase in diseases i ncidence and prevalence because more individuals exceed the disease threshold [24]. The decanalization notion implies that dissecting the genetic architecture of a particular complex disease through GWAS is more difficult when the endophenotypes underlying the disease development are more strongly canalized. This may explain why GWAS has shown little success in exploring the genetic basis of a number of diseases like psychological diseases and hypertension compared to its success regarding some other diseas es like Crohn's disease. Decanalization may also account for the reason why the results of GWAS are not always replicable in other populations: if disease influencing endophenotypes are under canalization, the genetic drift may result in different allele f requencies of cryptic variants in different populations [24]. Common vs. Rare Variants: Consequences The controversial debates about the commonality or rarity of variants largely contributing to the genetic variance of complex diseases have a number of pra ctical consequences in terms of research methodology and clinical applications [63,79]. If most of the unexplained genetic variance of the complex diseases is supposed to be contributed by common risk alleles with tiny effect sizes, performing GWAS on very large case control datasets will provide greater statistical power to detect undetected risk alleles [63,86,94]. On the other hand, deep sequencing of candidate genes in large datasets of clinically well diagnosed cases, and subsequent functional analyses of

PAGE 50

50 detected variants underlies finding rare risk alleles contributing to the genetic basis of complex diseases [53,63]. The next issue is that being statistically significant is not equal to being biologically relevant. Although rare variants associated w ith complex diseases are almost always causal variants, meaning that they have a functional role in the biological processes leading to the phenotype, most of the detected common variants are unlikely to be causal ones [16,53,63]. In such cases both rare a nd common variants in linkage disequilibrium with such common tagging variants can potentially be causal variants, although it is more likely that the causal variants be common [16,94]. However, when the odds ratio is small, it is difficult to figure out w hich of multiple variants in LD with a common tagging variant is really functionally relevant to the disease of interest because each variant may have a small effect size [63]. The application of preventive interventions for carriers of a particular varian t directly depends on the degree of its penetrance. Although common variants with small odds ratios may have a meaningful contribution to the population attributable risk (PAR), which measures what proportion of disease incidence is attributable to particu lar genetic factors, they are not appropriate candidates for preventive interventions because of their small effects and very small penetrance. On the other hand, preventive interventions for carriers of a rare variant with moderate to large effect can be feasible b ecause of its higher penetrance. R are variants are therefore more valuble from a personalized medicine perspective [63]

PAGE 51

51 Methods Used for the Study of Complex Diseases Linkage mapping is the method of choice for exploring the genetic basis of Men delian diseases [6]. Through analysis of the pattern of allele segregation in pedigrees of affected families, under a series of different potential recombination causing locus, a LOD (log arithm of odds) score is estimated to measure if the observed pattern of allele segregation is statistically different from the pattern expected under the assumption that the marker and the disease locus are not linked together [6,167,168]. A LOD score of above 3 is indicative of the linkage between the marker and the unknown disease loci and a LOD score below 2 demonstrates that the two loci are independent [6,168]. The candidate disease causing locus is then explored by positional cloning analyses. As a parametric method, the power of linkage mapping depends on the accuracy of the genetic model specified by the investigator. The genetic model determines assumptions about the mode of inheritance of disease, and the frequency and penetrance of the disease c ausing genotype(s) [6] Although parametric linkage analysis has been successfully employed to discover the genetic basis of a number of near Mendelian forms of complex diseases [6], it is in general underpowered to elucidate the genetic architecture of su ch diseases [2,3,6]. The complexity of genotype phenotype mapping in complex diseases arises from interactions among genetic factors and among genetic and environmental factors, locus heterogeneity and phenocopy [26].

PAGE 52

52 Because of the complexity of the genet ic basis of complex diseases, it is too difficult to specify a precise genetic model required for a successful linkage mapping. Note that although the low penetrance of complex disease influencing genotypes can be overcome by performing linkage analysis on only affected members of pedigrees, such analysis still requires a genetic model about the disease under consideration, making it underpowered in case of complex diseases. Non parametric methods have therefore been considered more powerful for analyzing t he genotype phenotype relationships of complex diseases. Such model free methods can be used in both family based and population based studies [6]. Examples of family based non parametric and model free methods include non parametric linkage analysis, whic h measures whether affected sibling pairs represent identity by descent (IBD) state in a marker locus significantly more than what is expected under a null hypothesis of assuming there is no disease locus in linkage with that marker locus [6], and the Tran smission Disequilibrium Test (TDT), which links a phenotype to a particular allele in a locus (either a causal allele or in linkage with causal allele [169]), in cases where the allele shows over transmission from parents to their affected offspring [170]. The TDT method involves performing more work than population based methods because it requires the two parents to also be genotyped in each family. This may also make the TDT method infeasible in the case of late onset complex diseases [6]. It is also not eworthy that although common alleles, individually or collectively, and multiple rare alleles, collectively, can substantially contribute to the population

PAGE 53

53 attributable risk (PAR) of a complex disease, neither common alleles nor rare alleles can lead to fa milial clustering of cases if the penetrance of an allele is below 50%. Therefore, family based studies may not considerably improve the detection of common or rare variants associated with complex diseases [53,63]. Moreover, since the power of detecting p otential genetic interactions depends on the sample size, family based studies may not be powerful enough for such analyses due to their insufficient sizes [19]. Candidate genes association studies which investigate the association between genetic markers related to a limited number of genes selected based on their known or inferred role in biological processes [171] and GWAS [15], are two population based non parametric method s which have widely been employed for the study of complex diseases [6,7,9] In their seminal paper which provided the theoretical groundwork for development of the method of genome wide association study, Risch and Merikangas (1996) compared linkage analysis with family based association analysis in terms of the sample size required to detect a linkage signal or an association signal, respectively, for variants with different risk allele frequencies (RAF) and different genotypic relative risks (GRR). They noticed that the power of both methods in detecting rare risk variants or varian ts with small effects decreases if the sample size is not large enough. However, this is more prominent for linkage based analysis in that the detection of linkage signal arising from a variant with RAF of 0.01 or with GRR of less than 2 is almost infeasib le because it requires having access to a very large sample. In such cases, association

PAGE 54

54 analysis still provides acceptable power to detect risk variants even at stringently adjusted significance levels. Ultimately, they suggested that a genome scale associ ation study can overcome the limitations of linkage based analysis methods and remarkably improve our knowledge about the genetic architecture of complex diseases [2]. Later, the completion of large scale genomic efforts like the Human Genome project [172, 173] and the HapMap project [174] coupled with the advent of high throughput technology [175] provided more theoretical and technological support for fulfilling the idea of genome wide association studies [1,176]. GWAS, a powerful method for the large scal e analysis of genotype phenotype relationships is currently the method of choice for dissecting the genetic basis of complex diseases [7,9,10]. A large number of disorders such as cardiovascular s, bipolar disorder, schizophrenia and human malignancies, have thus far been studied with this approach leading to the detection of many previously undetected contributing loci (for example [1,11 14,88,89] ). GWAS has also been used for the study of quanti tative traits such as height [177,178] body mass index [178,179] and lipid level of plasma [84] The general outline of the steps employed by a genome wide association study is as follows: upon a question of interest, the phenotypic criteria for defining different phenotypic classes are specified and subjects are recruited for the study. In the most common scenario, subjects in two contrasting groups (cases vs. controls) are genotyped for hundreds of thousands of SNPs, with the goal of identifying one or m ore SNPs whose genotypes show a significantly different distribution between cases and

PAGE 55

55 controls [7,15]. The genotyping data is then normalized and undergoes a step of quality control in which the dataset is filtered on the basis of a number of quality crit eria. Excluding SNPs with calling rate less than 97 98% in either case or control groups, or SNPs with minor allele frequencies of less than 1% to 5%, or SNPs showing Hardy Weinberg disequilibrium (HWD) in control group with p values < 10 4 or excluding i ndividuals with at least 3% missing genotypes in all the genotyped SNPs, are some of the quality control criteria usually applied in GWAS [1,15,180]. Also, the potential confounding effect of population stratification must be carefully investigated. Ignori ng population stratification will results in detecting false positive associations arising from the factors unrelated to the disease status, such as geographic or ethnic factors. Population stratification, if it exists, must be controlled either by includi ng the geographic or ethnic factors as covariates in the analysis to take into account their potential confounding effects [15] or by adjusting the inflated test statistics using methods such as genomic control [15,181]. Note that family based association studies such as TDT can also overcome the issue of population stratification [182] Once the quality control task is performed, the dataset is analyzed for any associations between SNPs and the disease under investigation. Note that because of very large number of hypotheses tested in a GWAS, it is necessary that the results be interpreted at an adjusted significance level [15] A number of different methods have so far been applied to deal with the issue of multiple testing. The adjusted significance leve

PAGE 56

56 significance level in order to keep the probability of detecting at least one false positive result, also called family wise error rate, at the desired level [183]), the false disco very rate (FDR) method (which adjusts the significance threshold so that the proportion of false positive results among all positive results is kept at a desired le vel [183] or a Bayes factor based method (which specifies a local significance level as a fu nction of the expected power of analysis and the expected posterior and prior odds in favor of detecting true association between a disease and a locus [1]) [15]. Relying on permutation testing to determine the significance level is another approach to the problem of multiple testing. However, note that there is a trade off between type I and type II errors, which means a stringent control over type I error decreases the power of analysis in detecting association signals [15]. For example, because Bonferron method does not consider the correlation among SNPs in linkage disequilibrium (LD) with each other, some believe it might be too stringent for GWAS [15]. The analytical framework currently used for the analysis of association data primarily analyzes th e association of individual SNPs with the disease of interest, for instance by test of allelic associations or Cochran Armitage trend test. Other in depth analyses such as the analysis of interactions or haplotype analysis may then be conducted secondarily based on the results obtained at the first step [15]. However, despite the very large number of markers that can be genotyped using currently available array based SNP genotyping platforms (up to one million SNP genotypes per run) [15] in most cases GWAS has so far shown limited success in explaining a considerable proportion of genetic variance of complex diseases, a

PAGE 57

57 phenomenon also referred to as missing in the literature [7,10,16] However, note that heritability is the amount of phenoty pic variance attributable to the additive genetic variance of the trait [31] This failure is in part due to the nature of complex diseases and in part due to limitations inherent to the current analytical framework employed to analyze and interpret the ob tained data [17 19]. To start, most of the susceptibility loci so far discovered by GWAS are of small predisposing risk [4,7] and it has been hypothesized that the genetic variance of complex diseases may be mainly due to the joint contribution of multiple susceptibility loci having small individual effects [16,20,21] Detecting such infinitesimal contributions is difficult, especially when the predisposing allele is rare or the sample size is not sufficiently large [2,3] and the very large number of marke rs under investigation raises the issue of multiple testing which makes it even harder to reliably detect a small association signal [15] It has been suggested that the detection of low penetrance common variants will be facilitated with the increase in t he size of sample in a GWAS. This in turn may lead to capturing a much larger proportion of the genetic variance compared to the variance thus far has been explained [16]. For instance Park et al. (2010) demonstrated that while on average a total of 27.4, 26 and 2.8 new susceptibility loci are expected to be discovered for height, Crohn's di sease and a collection of breast, prostate and colorectal cancers, respectively with a sample size of 25,000 t he expected number s will be on average 182.9, 140.1 and 40 .5, respectively if the sample size increases to

PAGE 58

58 125,000. These loci are expected to explain 15.7, 19.8 and 13.5 % of unexplained genetic variance of their corresponding diseases [86] The idea of increasing the power of GWAS through increasing the sample size has led the research community towards a second wave of GWAS in which additional risk loci are sought by performing GWAS on datasets with very large sample sizes or by performing meta analyses on multiple GWAS datasets at hand [86]. For example in a l arge GWAS on over 100,000 individuals of European ancestry to explore the loci influencing blood lipid levels, 59 novel lipid trai t associated loci with p values less than 5x10 8 were detected Among these newly detected loci, 39 were significantly associa ted with total cholesterol, 22 with l ow density lipoprotein ( LDL ) level 31 with high density lipoprotein ( HDL) level, and 16 with triglycerides (TG) level. Most of these newly detected loci were replicated in independent samples of European, Asian and Afr ican American ancestry [84]. In another GWAS on a very large cohort, Allen et al. (2010) analyzed genotyping data obtained from 183,727 individuals. They were interested in studying the genetic determinants influencing normal variation in height, a well kn own polygenic trait. Their effort resulted in revealing hundreds of SNPs in more than 180 loci, mostly novel, accounting for on average 10.5% of the phenotypic variance of height [87] Also, in a meta analysis on multiple genome wide T1D datasets, Barrett et al. (2009), through analyzing genotyping data of 7,514 cases and 9,045 controls, reported 27 new T1D associated regions containing multiple T1D associated SNPs with p values < 10 6 Of these newly detected regions, 18 were replicated ( p < 0.01 in an ind ependent replication dataset and p < 5x10 8 in a collection of primary and replication

PAGE 59

59 datasets) and four presented weak evidence of replication ( p < 0.01 in the replication dataset but p > 5x10 8 in the collective dataset) [12]. In another meta analysis c onducted by the P sychiatric GWAS Consortium (PGC) (2011), five previously undetected regions demonstrated association with schizophrenia through the analysis of a dataset containing genotyping data for 17,836 cases and 33,859 controls [184]. It has been al so suggested that by defining a score to combine the small association signals of many variants across the genome, more undetected susceptibility loci can be detected [4,16]. For example Yang et al. (2010) showed that by considering the tiny effects of tho usands of SNPs which are not individually significant when a stringent significance threshold is applied 45% of genetic vari ance of height can be explained [21] This proportion is much higher than the one captured by common significant SNPs. However, the applicability of their approach to other complex traits such as BMI schizophrenia and intelligence has not been as fruitful [16] In another study, Purcel et al. (2009) introduced a scoring approach in which an additive score was calculated by combining the odds ratio of risk alleles that were individually in association with schizophrenia at a particular significance level. They calculated the score for a set of different significance thresholds rang ing from 0.01 to 0.5 and noticed that at each signific ance level the cases in general had higher scores than the controls, leading to significant association between the score and the disease. The score disease association was more eminent as a more liberal significance threshold was applied, causing more SNP s to prove significant individually and therefore be included in the analysis. For example at a significance threshold of 0.5, a total of 37,655 SNPs

PAGE 60

60 collectively contributed to the score which in turn showed the best association with schizophrenia compare d to the scores produced at more stringent significant thresholds [89] N ote that although parts of the unexplained genetic variance of complex diseases might also be due to the contribution of undetected rare variants with major phenotypic effects, as put forward by the complex disease rare variants notion the same problems making GWAS underpowered in detecting the infinitesimal effects of common variants (explained in previous paragraphs) adversely affect the efficacy of GWAS in capturing the effects of theses rare variants [16] Moreover, the genetic architecture of a complex disease may include epistatic effects among interacting loci [17,22,23] and effects related to gene environment interactions [24] In their study, Zuk et al. (2012) suggested that one reason that can account for the missing heritability of complex diseases is that the additive genetic variance and consequently narrow sense heritability can be overestimated by ignoring the contribution of genetic interactions. They showed that 80% of currently missing heritability of Crohn's disease can be attributed to the genetic interactions. They concluded that if the role of epistasis is ignored, attempts to find more risk loci for complex diseases may not necessarily solve the problem of missing heritability [23] Statistical methods that analyze SNPs individually are unable to address such complex epistatic effects. Although current regression based methods have enough power to analyze some low order interactions (e.g. quadratic or cubic terms ), provided that the corresponding simple terms have significant main effects, they are unable to

PAGE 61

61 explore all potential combinations of markers because as the number of factors under analysis increases linearly, the number of their combinations grows expon entially, resulting in a computationally intractable situation [17,18] It is highly likely that synergistic epistasis occurs while the main effects of interacting loci are significant. Therefore, employing regression based approaches for modeling interact ions given the significance state of main effects of genetic factors may have enough power to detect synergistic interactions. However, relying on the significant individual main effects of loci to guide the process of detection of antagonistic interaction s may not be successful because such interactions may frequently occur while the main effects of genetic factors are non significant [17] Pure epistasis is a class of antagonistic epistasis in which the marginal main effects of interacting loci are small and non significant, while their interaction is statistically significant [17,185]. Disordinal epistasis represents another class of antagonistic epistasis in which epistasis is present despite zero marginal main effects of interacting loci. Although pure epistasis may be of little biological plausibility, there is evidence corroborating the biological importance of disordinal epistasis [17] Epistatic fitness interactions [124,126,127] are examples of disordinal epistasis [17]. In addition to computational burden arising from exhaustive search of epistatic effects (i.e. interaction terms in a regression analysis), t he sparsity of data and issues of overfitting and multiple testing may also impose additional hurdles in such cases [17] A further weakness of current analytical frameworks is that they analyze genome scale datasets without necessarily taking the existing expert knowledge about the trait under investigation into account [18]. It has been proposed that there can be

PAGE 62

62 obvious benefits in integrating preexisting biological and statistical knowledge into the process of genotype phenotype association analysis [25,26] One important advantage is that combining high quality prior biological knowledge such as protein protein interaction or intracellular pa thways with data analysis can provide a biologically plausible way to reduce the dimensionality of large scale datasets [26] Such knowledge based feature selection methods establish an alternative and supplementary way to the statistical (e.g. a two step analysis [17] ) and computational dataset dimensionality reduction methods (e.g. MDR [144] or CPM [186] ). Physiologic processes like metabolic or developmental pathways represent integral sets of interconnected modules of functionally or physically dependen t biomolecules evolved to a robust and optimal state to keep the integrity of cells [24] It has been previously demonstrated that different biological processes are implicated in the development of different clinical endpoints. For example, while defects in the function and regulation of the immune system have been linked to the pathogenesis of autoimmune diseases like rheumatoid arthritis [187,188] metabolic and Wnt signaling pathways have been implicated in type 2 diabetes [1] The nonlinear genotype ph enotype mapping procedures in complex diseases may therefore benefit from reducing the dimensionality of the genome scale datasets in a knowledge based manner which ensures that genetic factors inferred to be more biologically relevant to the phenotype und er investigation are prioritized and given more weight [26] As an example of using biological expert knowledge to filter out genetic factors and guide the analysis of genetic data, Pattin et al. (2010) suggested that a number of metrics,

PAGE 63

63 computed on the b asis of the probability with which the pairs of proteins may functionally interact, can be used to prioritize the list of genes and their corresponding SNPs for the analysis of epistatic interactions using a large scale dataset. By implementing their metho d on an association dataset of bladder cancer, they showed that preprocessing GWAS data on the basis of protein protein interactions helps in performing feature selection required for finding actual epistatic interactions [189] Prioritizing features based on their suggested functional role and structural properties may also potentially assist investigators with exploring higher level features related to the phenotype of interest For example the likely association between a pathway and a phenotype can be g uided by the lower level findings such as association between individual SNPs and the phenotype. In an effort to develop a knowledge based framework for translational research in general, and for GWAS in particular, Riva et al. (2010) proposed a methodolog y based on the Select and Test model (ST model) [190,191] for scientific discoveries. The ST model takes advantage of an iterative reasoning procedure that describes the process of scientific discovery as a cyclical sequence of problem definition ( abstra ction ), hypothesis generation ( abduction ), hypothesis evaluation ( deduction) and hypothesis verification ( induction ) steps. Once a hypothesis is approved, it can then be used to generate new hypotheses to connect features of higher level to the problem und er investigation. In the context of GWAS, the authors have demonstrated the application of the ST model by explaining the steps which can be employed to help an investigator move from prior knowledge regarding the association of a single SNP belonging to t he TCF7L2 gene (i.e. rs4506565 ) and

PAGE 64

64 type 2 diabetes (T2D) towards concluding that the Wnt signaling pathway to which TCF7L2 gene belongs is also associated with T2D To this end they extracted all genotyped and validated SNPs belonging to the Wnt signali ng pathway and removed rs4506565 SNP from this pool. Comparing case and control groups using an additive model, the pathway under investigation proved to be in association with T2D. The authors commented that although this finding is not novel and is jus t a reproduction of a previously revealed pathway phenotype association, it exemplifies the application of the ST model in exploring large scale datasets through a hypothesis based fashion [25] The interpretation of statistically significant results may a lso benefit from integrating preexisting biological knowledge into the data analysis process in that feature selection guided by expert knowledge will maximize the chance that significant findings include factors functionally related with the phenotype of interest. In other words, s uch significant findings are more likely to have an explicit and straightforward biological interpretation compared to the ones obtained from hypothesis free methods. For instance, Ballouz et al. (2011) demonstrated that a long l ist of genes related to many disease associated SNPs can efficiently be narrowed down to a short list of likely causal genes by the help of preexisting biological knowledge about the functional domains of proteins which are produced by previously discovere d diseases causing genes or about the pathways or protein interactions to which previously known diseases causing genes contribute [192].

PAGE 65

65 However, note that the predictions of methods taking advantage of preexisting biological discoveries might be affected if there is any lack of information or any bias in the biological databases, or if the genetic factors are not well annotated [26,192] Due to its unquestionable benefits, the paradigm of knowledge based genomic data analysis has received increasing atten tion over recent years and parts of research endeavors have been devoted to developing and applying innovative analytical frameworks in this direction [18]. Pathway based association studies represent an example of the integration of biological knowledge w ith statistical data analysis, and have recently drawn attention as an alternative to hypothesis free methods. For instance, in an attempt to extend gene set enrichment analysis to relate gene he SNP ratio test, which calculates the ratio between the number of individually significant SNPs and the number of non significant ones in sets of SNPs derived from every pathway of interest, and determines a pathway to be associated with the trait under investigation if its corresponding SNP ratio proves statistically significant [28]. As another example, Peng et al. (2010) [27] and Yu et al. (2009) [29] suggested ways of computing an overall p value for a gene by combining the p values obtained from the association tests on all the individual SNPs belonging to that gene. Gene level p values, in turn, were employed to define a pathway level p value to investigate pathway trait associations [27,29]. However, despite the fact that the limitations of classica l GWAS in the context of complex diseases are increasingly being recognized and that pathway based association studies are receiving growing attention, there is still no optimal approach

PAGE 66

66 able to overcome the above described challenges and provide a compreh ensive inter pretation of genome scale data. Modeling Non linear Effects of Genetic and Environmental Factors To address the drawbacks of traditional statistical methods in mapping complex genotype phenotype relationships, development of novel alternative c omputational approaches has received much attention by bioinformaticians and computational biologists over recent years [17,18]. In contrast to regression based approach, data mining (machine learning) methods do not fit a pre specified model on the data. Instead they search through a solution space containing a vast number of potential solutions (models) to the problem under investigation to find out the one with efficient performance in explaining the data [18,19]. In the remainder of Chapter 2 tradition al statistical methods and a number of popular data mining methods which have shown promise in modeling non linear effects of genetic and environmental factors are explained in detail. 1 Traditional Statistical Methods 1 1 Multiple linear regression an d multiple logistic regression analysis Multiple linear regression and multiple logistic regression analyses are used to investigate the relationship between a set of predictor variables (X 1 X 2 etc) and the mean of a response variable or the logarithm of odds in favor of the response variable (i.e. disease state), respectively. The r esponse variable is continuous in the case of linear regression analysis and is dichotomous in the case of logistic regression analysis [193]. The main drawback of the regress ion based methods is that because of the computational burden arising from the modeling of interactions of many variables, the

PAGE 67

67 analysis is biased towards the modeling of the lower order interactions of predictor variables whose main effects are significant This results in ignoring the interactions that may exist among variables with non significant main effects and also prevents such regression based methods from considering higher order interactions [17,18] Moreover, the sparsity of data may impose addit ional hurdles in such cases and make the parameter estimates imprecise [17] R egression based statistical methods are able to model non linear relationships of variables through two different ways [19] : Testing for interaction per se. A multivariate regre ssion model is fitted on the dataset by including the terms related to the simple effects of predictor variables along with their corresponding interactions. The interaction terms in such cases indicate the classical definition of statistical interaction a s deviation from linearity [19,193]. The alternative hypothesis ( H 1 ) tests if the variables of interest influence the disease status epistatically [194]. Supposing the simplest scenario, the fitted regression model consists of only two predictor variables and their interaction: (2 7 ) In this example, the H 0 states that the parameter corresponding to the interaction term is zero ( 3 =0) and the H 1 states 3 0 A number of model selection procedures have been developed to optimize a regression model by keeping significant terms and dropping non significant ones in cases where the simple and interaction effects of multiple variables have been included in the analysis. Note that such model selection procedures are not applica ble if the number of variables is too large [19,193].

PAGE 68

68 Testing for the joint effects of variables. The rationale for this test is that if the phenotypic effect of a genetic factor is manifested through interaction with another genetic or environmental facto r, the effect would be more powerfully detectable if a regression model containing the simple effects of both factors along with their corresponding interaction is fitted. The inverse is true if the two factors are independent [19]. In fact, the alternativ e hypothesis tests whether a particular genetic factor contributes to the phenotype of interest either individually or epistatically [194]. In other words, supposing the simplest multivariate regression model shown in formula 2 7 the H 0 states 3 = 2 =0 and the H 1 2 and 3 0 [19]. To investigate the joint effects of two variables through this approach, two regression models are fitted: the first one is a complete model containing all simple and interaction terms and the o ther one is a partial model in which all simple and interaction terms containing the variable of interest have been eliminated. The goodness of fit of the two fitted models are then compared (e.g. by comparing R square or mean squared error ( MSE)). If the complete model proves to be better fitted to the data than the partial one, the joint effects of the variable of interest with the other variables is concluded. [19,194,195] 1 2 Case only analysis Exploring the epistatic effects of genetic factors throug h analyzing only the case group is a more powerful alternative for the methods analyzing both case and control groups [19]. It can also efficiently be applied to model the interactions between genetic and environmental factors [196,197]. Case only analysis can be implemented simply by a chi square test assessing the independence of alleles (df=1) or genotypes (df=4) or by

PAGE 69

69 a logistic regression analysis [19]. However, note that the robustness of the case only approach is contingent on the independence of var iables under investigation in the general population [15,19,196]. The results of case only analysis can therefore be greatly distorted, for instance when the genetic variants are in linkage disequilibrium or epistatically contribute to the fitness of the o rganism [19], or when subtle correlations between genetic and environmental factors cause departure from the assumption of independence [15,196]. To overcome this limitation, a number of alternative methods have been suggested which take use of both case c ontrol and case only analyses estimates simultaneously [19, 197]. 1 3 Haplotype analysis The analysis of association between haplotypes and the disease of interest aims at determining whether particular haplotypes in a chromosomal region are over represen ted in cases (or controls) compared to what is expected according to the allele frequencies of respective loci. It has been suggested that this method may be more powerful in detecting variants contributing to a phenotype than testing for individual locus associations [15,198]. The rationale is that affected individuals, who have inherited a recent disease causing mutation due to a founder effect, not only share the causal variant but also share a haplotype containing the causal variant [199]. Therefore, ha plotype analysis can be used as a more efficient way to identify the functional variants contributing to the disease development [15,96]. However, note that the pattern of haplotype diversity is highly affected by the forces (such as mutation and selectio n) promoting genetic changes. For example, in contrast to the rare and recently derived causal alleles, the ancestral causal alleles can

PAGE 70

70 be related to a larger number of haplotypes, none of which is significantly over represented in the case group to cause a detectable association signal [96]. The association of haplotypes with different complex diseases has previously been investigated in a number of studies. For instance, Liu et al. (2005) discovered that two rare haplotypes related to the PTHR1 and VDR g enes were in negative and positive association with the risk of osteoporosis, respectively [200] Imputation of non genotyped SNPs by relying on the available haplotype maps and LD patterns (such as HapMap project information) is another application of hap lotype analysis in genome wide association studi es [201] (for example see [1]). One important issue regarding haplotype analysis is that, because haplotypes are not usually determined directly and instead are inferred by haplotype estimation algorithms bas ed on probabilistic models (such as EM algorithm [202]), the robustness of the haplotype analysis methods is contingent on how well the assumptions of the applied algorithms are satisfied. Also, the issues of sparsity of data and computational burden arisi ng from the large number of haplotypes, which can potentially be tested, are further limitations of haplotype analysis method [15] 2 Multi factor Dimensionality Reduction (MDR) Method The multi factor dimensionality reduction m ethod and software, introdu ced by Ritchie et al. (2001) [144] and Hahn et al. (2003) [203], is a non parametric and model free method which was originally developed to model the interaction effects of multiple factors on a dichotomous phenotype (e.g. diseased vs. healthy) through re ducing the dimensionality of data [17,18,144]. For each potential multi class combination of multiple factors, MDR reduces the multi dimensional space to a one dimensional space

PAGE 71

71 by defining a new variable with two classes: high risk vs. low risk. All combi natorial classes in which the ratio of number of cases to the number of controls exceeds a pre specified threshold collectively give rise to the high risk class and all other classes collectively constitute the low risk class. The original case and control datasets are then broken into training and test datasets. For each multi factor model, subjects in training sets are assigned to either high risk or low risk classes based on the original genotypic classes they belong to, and the prediction error rate of the newly generated variable is then calculated. This procedure is repeated for all potential combinations of variables under investigation (i.e. all potential multi factor model), and the multi factor model with lowest misclassification rate is selected. The classifying performance of the selected model is then validated in the test set. The entire aforementioned procedure is repeated in a multi fold cross validation manner and in each round of cross validation the multi factor model with the smallest misc lassification rate is selected as the best model. The average error rate of each model over multi fold cross validation along with its cross validation consistency, defined as the number of times the model is selected as the best model through the multi fo ld cross validation procedure, are used as criteria to select the final best multi factor model [17,19,144,204]. Therefore, The model with the highest average accuracy in classifying cases and controls, and the highest cross validation consistency is favor ed When there are two models, each of which has better performance with respect to one of the two aforementioned selection criteria, the model containing fewer variables is selected as the best final model [17,204]

PAGE 72

72 The MDR has been widely used for explor ing the phenomenon of epistasis in different complex diseases such as breast cancer [144], bladder cancer [135], atrial fibrillation [205], rheumatoid arthritis [206], type II diabetes [207] and autism [140]. The MDR method overcomes the issue of sparsity of data because it reduces a multi dimensional space to a one dimensional space [19]. However, one important limitation of this method is that the combinatorial search procedure employed by MDR can be computationally infeasible if the number of variants ex ceeds a few hundred or if higher order MDR software will currently support disease models with up to 15 factors at a time from a list of up to 500 total factors and a maximum sample si Defining an appropriate threshold for labeling original multi factor classes as high risk vs. low risk [208] and taking into account the confounding effects of covariates are other challenges of the MDR method [209]. Over rec ent years, a number of modified versions of MDR have been developed to improve the performance of the method [18]. For example, Lou et al. (2007) developed the generalized MDR (GMDR) method which allows adjustment for confounding covariates, and can be app lied to continuous phenotypes as well as dichotomous traits [209]. 3 Combinatorial Partitioning Method (CPM) The combinatorial partitioning method, introduced by Nelson et al. (2001) [186], is another data reduction based method which attempts to find set s of complementary genotypic partitions which are capable of explaining a large proportion of phenotypic variance of complex traits [17]. The procedure applied by the CPM method is as follows: a set of all potential combinations of genotypes at multiple lo ci, e.g. set N, is partitioned

PAGE 73

73 to complementary genotypic partitions, the union of which constitutes the set N. A set containing a group of complementary genotypic partitions is considered as a partitioning set. The partitioning procedure can be performed in many ways because the original set of n combinatorial genotypes (set N) can be broken into two to n genotypic partitions each of which may potentially be resulted from different ways of partitioning. For instance, the set N (with n combinatorial genotyp es) can be broken into two genotypic partitions so that one partition contains one combinatorial genotype and the other contains n 1 combinatorial genotypes, or one partition contains two combinatorial genotypes and the other contains n 2 combinatorial gen otypes, or one partition contains three combinatorial genotypes and the other contains n 3 combinatorial genotypes, and so on. Therefore, for example assuming the set N contains all potential combinations of genotypes at two bi allelic loci A and B, i.e. N ={AABB, AABb, AAbb, AaBB, AaBb, Aabb, aaBB, aaBb, aabb}, there would be 255 and 3,025 partitioning sets with two and three complementary genotypic partitions respectively, and a total of 21,146 partitioning sets with two to nine complementary genotypic par titions [186,210]. The ability of partitioning sets to explain the phenotypic variance of the trait under investigation is then evaluated, and if any partitioning set shows a promising performance, it is selected for further validation through a cross vali dation procedure [17]. Note that if there are L loci whose interactions are investigated, the total number of partitioning sets will be multiplied by the number of all potential combinations of L loci [186]. Therefore, exhaustive analysis of all potential partitioning sets by CPM, especially

PAGE 74

74 when the number of loci of interest goes up or when higher order interactions are sought for, can be computationally intractable [17,210]. In addition, the issue of multiple testing will arise if a very large number of competing hypotheses is tested [17]. The applicability of CPM in detecting gene gene interactions has previously been investigated in a number of studies. For instance, through the analysis of the epistatic effects of 18 variants in six genes ( APOB PON1 LPL APOA1 C3 A4 LDLR APOE ) on the plasma triglyceride level, Nelson et al. (2001) demonstrated that there were multiple sets of genotypic partitions which were able to explain significant proportions of the phenotypic variance of the trait under investi gation. The best set of genotypic partitions (i.e. {InDel ( APOA1 C3 A4 ), HincII ( LDLR )}) accounted for 9% of the phenotypic variance of the trait, while the corresponding loci explained less than 1% of phenotypic variance individually [186] In another stu dy, the epistatic effacts of the variants of ACE and PAI 1 genes on plasma plasminogen activator inhibitor 1 ( PAI 1 ) level in African Americans was reported by Moore et al. (2002) [211] To improve the performance of the CPM method and to deal with the is sue of computational complexity, Culverhouse et al. (2004) developed the restricted partitioning method (RPM) [210] which takes into account the difference between the mean phenotypic value of different combinations of genotypes to guide the process of par titioning more efficiently based on the similarity of these values. 4 Recursive Partitioning Methods D ecision trees, which investigate the epistatic effects of genetic factors on a continuous phenotypic variable (regression trees) or on a discrete phenoty pic variable

PAGE 75

75 (classification trees), are examples of methods employing a r ecursive partitioning approach to the genotype phenotype mapping [19]. T o find combinations of multiple variables, which are capable of distinguishing cases from controls accurately, the d ecision tree method uses a tree like graph which consists of multiple nodes connected to each other through branches, also called edges Each node represents a variable, for example a particular SNP, with a certain classifying performance determined on the basis of its distribution in the two contrasting groups. The first node added to the tree is called root, and each node gives rise to branches along which cases and controls with the same genotypic values are guided to the next node where the next v ariables are tested, and the one with the best classifying performance is added. If classifying performance of a branch is better than a pre determined criterion or if the number of cases and controls moving along a branch is below a pre specified threshol d, that branch, also called a leaf, is terminated and no new node is added. Otherwise t he procedure of adding new nodes will be continued [18,19]. At the end of this r ecursive partitioning procedure, s tarting from a leaf and moving up towards the root we are actually moving across specific SNP's genotypic values (e.g. SNP3=1 AND SNP2=2 AND SNP1=2 ). Each one of such combinations of genotypic values represents an epistatic model, and the combination with the best performance in distinguishing cases fro m controls is selected as the final model [19]. One important advantage of the decision tree method is that models retrieved by this method are easily interpretable [18]. Moreover, Cook et al. (2004) demonstrated that the tree based method outperforms logi stic regression analysis in revealing

PAGE 76

76 epistatic effects in cases where the main effects of genetic factors are not significant individually. Through analyzing a case control dataset of 319 cases of ischemic stroke and 2090 controls for which genotyping dat a were available on 92 variants from 56 genes involved in thrombosis, lipid metabolism and inflammation, they reported that interleukin 4 and P selectin influence the trait under consideration epistatically [212]. Random forest (RF) method developed by Le o Breiman (2001) [213], is a nother method which takes advantage of a r ecursive partitioning algorithm. It has been suggested that RF is a powerful machine learning method for modeling the epistatic effects and the genetic heterogeneity underlying the genet ic architecture of complex diseases [214]. The main difference between RF method and decision tree method is that i nstead of growing only a single decision tree, RF grows a random forest containing a large number of unpruned trees. For growing each tree, t he case and control datasets are divided into training sets and out of bag sets. Each training set is generated by a random bootstrap sampling of the original case control datasets, and the subjects excluded by the bootstrap procedure make the out of bag s ets. Once a tree is grown over the training dataset, the out of bag sets are used for estimating the error rate of retrieved models and measuring the importance of each variable through permutation testing [213,215]. Random forest method has been employed by a number of investigators to explore the genetic basis of complex phenotypes (for example [214 217]). It has been suggested that RF can model the epistatic effects more efficiently than traditional

PAGE 77

77 methods, and can be used for constructing gene networks [18]. Another important advantage of the random forest method is that it is capable of measuring the importance of variables both individually and in the context of their interactions with each other [19] Therefore, the random forest method can serve as a feature selection method for data dimensionality reduction. Such trimmed datasets can later be analyzed by other analytical methods [18,19] An important limitation of recursive partitioning methods is that the selection of the root node variable is depe ndent on the marginal main effect of variables [18,19], although this might be overcome through combining RF method with RelieF method [18]. Also, in the process of growing a tree, each node is added by testing a randomly selected subset of variables [18,1 9,213]. Therefore, although the interaction of every two variables is likely to be tested in at least one tree of the forest, this requires the number of grown trees be extremely high, which may impo se a computational burden [15]. 5 Pattern recognition Ba sed Methods Pattern recognition based methods represent a class of non parametric problem solving methods which are capable of efficiently dealing with complex problems such as epistasis. In contrast to the data dimensionality reduction methods, pattern re cognition based methods navigate the entire set of variables to find patterns in dataset capable of accounting for the problem of interest (e.g. classifying the cases and controls). Therefore, they don't suffer from the possibility of loss of information w hich may affect the data dimensionality reduction methods. However, despite their high power,

PAGE 78

78 interpreting their retrieved models may not be easy especially when they find a complex pattern. They are also more prone to overfitting, which requires their res ults be validated through validation procedures like cross validation which in turn may impose a computation burden [17]. Artificial neural networks [218 220], evolutionary computational strategies [221 224] and cellular automata [225] are examples of such methods [17]. A neural network is a biologically inspired computational model which mimics the function of neurons and neural networks in the brain in order to solve a complex problem. It is represented as multiple layers of interconnected processing node s network, and nodes that are stimulated then transmit a signal to the nodes of the next layer to which they are connected. If the sum of all the inputs entering one of t hese virtual neurons is higher than that neuron's so called activation threshold that neuron itself activates, and passes on its own signal to neurons in the next layer. The pattern of activation therefore spreads forward until it reaches the output layer and is there returned as a solution to the presented input. Just as in the nervous system of biological organisms, neural networks learn and fine tune their performance over time via repeated rounds of adjusting their thresholds until the actual output ma tches the desired in detecting the SNPs disease associations [17]. Gunther et al. (2009), through computational simulation, demonstrated that neural networks outperform log istic

PAGE 79

79 regression and MDR methods in modeling epistatic effects of genetic factors contributing to the genetic basis of complex diseases [226]. However, note that despite the high power of neural networks in solving complex problems, their performance is co ntingent upon defining a neural architecture a priori, which may not always be an easy and feasible task [17,227]. In addition, while attempting to optimize an error function, a neural network may become trapped in a local optimum and fail to identify the best absolute solution to the problem under investigation [227] Evolutionary computational strategies such as Genetic Algorithm (GA) and Genetic Programming (GP) constitute another class of pattern recognition based methods [17] which take advantage of op erators inspired by evolutionary biology (like fitness, selection, cross over, and mutation) to perform a more efficient search over an extremely large search space (solution space). These methods take use of parallel search algorithms to find an optimized solution to the problem of interest [221,222]. A Genetic Algorithm (GA) is a non deterministic search algorithm which optimizes a fitness function over a very large, multidimensional search space. It has proven itself as a highly powerful method in solvin g complex problems in many disciplines ranging from engineering to biology [222]. The outline of the procedure employed by a GA is as follows: upon defining the problem of interest, a random subset of all potential solutions ( organisms ) to the problem is g enerated. Each solution is a unique combination of variables generated according to a pre specified mathematical or logical model, and is represented as an encoded representation ( artificial chromosome )

PAGE 80

80 of the set of parameters describing it. The fitness o f solutions is evaluated according to a fitness function to determine how well they solve the problem of interest. Solutions are then optimized through a simulated process of evolution using operators borrowed from evolutionary biology such as mutation cr ossover and selection Through the process of optimization, the parent solutions in each generation undergo changes by operators like mutation (which causes point alterations in the structure of each artificial chromosome) and crossover (which allows two d ifferent artificial chromosomes to exchange parts of their structure to give rise to offspring solutions) to give birth to offspring solutions. The fitness of offspring solutions is then evaluated, and a subset of parent and offspring solutions is selected to populate the next generation on the basis of a method of selection. For instance, the method of selection may promote the selection of the parent and offspring organisms with the highest fitness. This cycle continues until a solution with a desired fit ness level is generated, or the maximum number of generations determined by the investigator is reached [218,221,222] The GA method enjoys high flexibility and efficiency in searching through an extremely large solution space which may not be exhaustively searched in a timely manner, and through a complex fitness landscape with many local optima [222]. For example in a previous application, GA had a promising performance in finding a solution for a complex engineering problem with a solution space of 10 327 potential solutions in two days using an ordinary engineering computer, a problem which could be solved by traditional methods in five years [222,228]. GAs can also successfully deal with

PAGE 81

81 problems whose solution requires optimizing many parameters simulta neously [222,229,230]. Although (as with any other method relying on heuristic search algorithms) GAs may fail to find the absolute best solution, the employment of biological operators along with performing parallel search over the solution space provides GA based methods with exceptional performance in finding the global optimum or a local optimum close to the global one. While the solution space is searched by a GA engine, artificial chromosomes have the opportunity to explore the surrounding fitness lan dscape through mutation, and to exchange the information, they have obtained, through crossover in order to give rise to new offspring artificial chromosomes which inherit the positive properties of both of their parents. The selection then ensures that mo re fit artificial chromosomes and schemata have greater chance of survival and r eproduction than less fit ones [222]. The high power of genetic algorithms in exploring complex fitness landscapes has largely resulted from exploring the search space in multi ple directions parallel to each other at the same time, through both explicitly evaluating the fitness of a large number of artificial chromosomes and implicitly evaluating the fitness of an extremely large number of schemata, as stated by schema theorem [ 222,228]. The outline of schema theorem is that while a genetic algorithm evaluates a population of artificial chromosomes, besides the evaluation of fitness of each artificial chromosome, it implicitly keeps track of the fitness of all schemata existing i n the artificial chromosomes. For example supposing an artificial chromosome is a string like

PAGE 82

82 star units in the string and its average fitness is defined as the average fitness of all artificial chromosomes containing that schema. By keeping track of the average fitness of different schemata, GAs have a tendency to increase the number of low order schemata w hose fitness is above the average fitness of the entire set of schemata [218,228]. Moreover, through adjusting the mutation and crossover rates, increasing the number of artificial chromosomes in each generation, and applying a selection method which ensur es diversity of the population of artificial chromosomes in upcoming generations, the risk of drift and being trapped in a local optimum can be decreased [218]. However note that in spite of the high power of GA based methods, there is no standard single G A framework that can be applied on var ious kinds of problems. Instead, the way through which solutions are encoded as artificial chromosomes, or the definition of fitness function are problem specific issues which may affect the power of GAs considerably i f not addressed corre ctly by the investigator [222]. Moreover, the parameters used by GAs, such as population size, mutation or crossover rates, and the selection method, need to be properly adjusted. For example using a small population size, or a selecti on method which is highly biased towards the most fit artificial chromosomes may results in an early convergence to a local optimum [222,229]. If the mutation rate is too high, the GA may fail to converge to a very good solution, except for the time elitis t selection method is applied by GA [218], because considerable changes in artificial chromosomes may disrupt the efficient schemata

PAGE 83

83 [222]. On the other hand, if the mutation rate is too small or the crossover rate is high, the risk of drift and early conv ergence to a local optimum increases [218]. Finally, the traditional methods may outperform GAs in solving problems which are analytically solvable in a timely manner [222]. Genetic programing (GP), developed by John Koza (1992) [223], is another evolution ary computational strategy which is similar to genetic algorithm in essence, with the difference that instead of optimizing a population of artificial chromosomes defined by the same mathematical or logical model, GP optimizes a population of computer prog rams (solutions) based on their performance in solving the problem of interest. Each computer program is a unique combination of variables and constants through a different mathematical or logical model. To illustrate this, each solution can be considered as a tree in which the root and internal nodes are mathematical or logical operators and the leaf nodes (terminal nodes) are variables and constants [222,223,227]. As with GAs, solutions are evaluated using a fitness function to determine their goodness of fit, and then undergo mutation and crossover (which act on the internal nodes to replace the node or the node and its corresponding branches) to give rise to offspring computer programs. The cycle of fitness evaluation and reproduction is repeated times a nd times in order to find a highly fit solution for the problem of interest [223,227]. In an example of applying genetic programming to explore higher order genetic interactions, Nunkesser et al. (2007) developed the GPAS (Genetic Programming for Associati on Studies) method. They demonstrated that the method was able to find

PAGE 84

84 higher order interactions in both small and large scale association datasets. For example GPAS came up with a genotypic combination of eight SNPs located in the ERCC2, HSD17B1, GBR2, ER K2, GSTP1, XRCC1 and MDR1 genes which was present in 81 cases of breast cancer and only 12 controls. Also, their method was successfully capable of separating individuals with Japanese and Han Chinese origin based on the higher order combination of multipl e SNPs for which genotyping data were available in the HapMap dataset [231].

PAGE 85

85 CHAPTER 3 KBAS METHOD DEVELOPMENT Method Overview I propose a hypothesis based method, called the Knowledge B ased Association Studies (KBAS), which combines a heuristic computati onal approach to the analysis of genome wide association data with pre existing biological knowledge in order to increase our understanding of the genetic architecture of complex phenotypes. Figure 3 1 illustrates the basic outline of the KBAS method. The w ork flow of KBAS begins with a hypothesis which is a statement that a particular set of genetic factors (for example a set of genes belonging to a pathway) contributes to the trait under investigation. Given a hypothesis formulated by the investigator, a se t of markers (typically SNPs) is extracted and the genotypes of subjects in case and control groups at the marker loci under consideration are converted to numerical values using an encoding scheme At the next step, the optimal combination of markers and their corresponding weights leading to a model with maximized classification ability to separate cases from controls is determined through an iterative adjustment procedure using Genetic Algorithms (GA) [221], which generates tests and refines different m odels relevant to the hypothesis under investigation. At first, a user defined number of potential solutions to the problem under consideration is generated by the GA engine of KBAS. Each solution ( organism ) is a unique combination of markers along with th eir respective weights assigned to them by the GA engine. The weight of each marker quantifies its contribution to the overall genotype phenotype association. A binary encoded representation of parameters (the

PAGE 86

86 weight of markers in this study) describing a certain solution is called an artificial chromosome Different artificial chromosomes are therefore the same in terms of the number of marker loci they contain and are different in terms of the weights of the marker loci. The combination of the set of mark ers having non zero weights in an artificial chromosome and their corresponding weights represents a model The goodness of fit of each model in solving the problem under consideration is evaluated by a fitness function through a two step procedure: at fir st a score is assigned to each subject in the case and control groups by an appropriate mathematical function on the basis of his/her genotype The distributions of scores are then compared between the accurately classify subjects as affected or not affected which in turn serves as the fitness value of that model The idea at the base of the KBAS method is that, if the set of markers included in a model is relevant to the phenotype, their joint signal, obtained by combining their individual association signals into an overall score variable, will be able to accurately classify cases and controls. Models are then optimized through a simulated process of evolution using operators like mutation, crossover a nd selection. The optimization process ensures that more fit models have more chance to emerge, survive and reproduce than less fit models Once the GA engine stops, either because it finds a model with the desired fitness level or reaches the maximum numb er of generations, the best fit model is retrieved as the successful model. If the successful model has a fitness value more extreme than a pre specified desired fitness level (adjusted for multiple testing for

PAGE 87

87 example according to Bonferroni's correction [183,232] to account for the number of artificial chromosomes of simulated generations, of investigated hypotheses and of pair wise case control comparisons performed in the analysis) and its corresponding fitness value proves significant in permutation testing, the model is considered as a trait associated model, and is then subject to replication study and further statistical analyses such as logistic regression analysis. A detailed explanation of the steps employed by KBAS is provided in the following sections Step 1: Hypothesis Generation The KBAS method is aimed at verifying or disproving a hypothesis put forth by the user. In this context, the hypothesis is the statement that a particular set of genetic factors contributes to the trait under invest igation. The set of genes to be included in a hypothesis will in general be determined by the investigator on the basis of existing biological knowledge. Biochemical and regulatory pathways, gene ontology classes, gene expression databases, protein protein interactions databases, and biomedical literature are some examples of sources of information that can be used to generate relevant hypotheses [26] Once a hypothesis is formulated, the KBAS method tests whether a sub set of genetic markers related to the specified set of genetic factors (e.g. genes) is able to precisely separate cases from controls, when their genotype values are converted into a single score variable through an appropriate mathematical formula In the following sections I will assume that the markers under consideration are SNPs, and therefore only exhibit two alleles, but the method can be applied to any kind of polymorphic marker, given a way of consistently encoding its genotype classes into numerical

PAGE 88

88 values. In a typical scenario, an i nvestigator may wish to determine which of two different sets of genes is more likely to be involved in a disease of interest. This question can be answered by creating two competing hypotheses, each producing a set of SNPs belonging to one of the two gene sets, and evaluating them on the basis of their power to discriminate cases from controls. Before proceeding to the hypothesis testing and refinement step, the marker genotypes in the contrasting groups under investigation are converted to numerical value s using an encoding sche me For instance under the assumption that the alleles at each SNP locus of interest contribute to the phenotype co dominantly and by representing the major allele (A) at each locus as 0 and the minor allele (B) as 1 t he three pos sible genotypes can therefore be encoded as follows: AA=0, AB=1, BB=2 (i.e. the value is the number of minor alleles present in the genotype). The current implementation of KBAS provides user s with the opportunity to specify the desired encoding scheme. Fo r example, a lternative encoding schemes to represent dominant [ AA vs. (AB+BB)], over dominant [ AB vs. (AA+BB)] or recessive [( AA+AB) vs. BB] intra genic allelic relationships can easily be adopted. Step 2: Hypothesis Testing and Refinement by a Genetic Alg orithm Engine KBAS uses a Genetic Algorithm (GA) which generates, tests and refines multi SNP models related to the hypothesis under investigation. The weights of SNPs are initialized at random in initial population of artificial chromosomes and are adju sted over many simulated generations, in order to give birth to a multi SNP model with maximized ability to accurately classify subjects into the case and control groups on the basis of the scores they receive. The GA used in this work is a variant of the CHC class of genetic

PAGE 89

89 algo rithms which was completely rewritten to serve the purpose of this study. The CHC Cross generational elitist selection heterogeneous recombinatio n (by incest prevention) and [234,235]. The CHC class of GAs combines a highly disruptive recombination method ( H alf Uniform C rossover ), which causes each pair of mating artificial chromosomes to exchange half of their different bits only if they are structurally distant enough ( incest prevention ), with a conservative elitist selection method, which selects the best fit parent and offspring artificial chromosomes in each generation to populate the next generation [ 233, 235]. It has been therefore suggested that CHC may outperform other GA variants [233,236]. The CHC pseudo code is as follows [218,233]: 1. A subset of N potential solutions ( organisms ) to the problem under investigation is randomly generated to populate the initial populatio n. Each solution is represented as a bit vector ( artificial chromosome ). 2. The fitness of each artificial chromosome in the initial population is determined using a fitness function. 3. Artificial chromosomes are randomly paired. 4. For each pair, if the distance of the two artificial chromosomes is above a p re specified incest threshold, they are mated T hrough the mating process they exchange half of their differing bits by crossover, and are also subject to point changes caused by mutation ( a detail ed explanatio n is provided in Chapter 3, artificial chromosomes is generated. 5. The fitness of each offspring artificial chromosome is determined. 6. Parent and offspring artificial chromosomes are pooled into a temporary population and ranked based on their fitness from highest to lowest. The N best ranking artificial chromosomes are then selected to populate the next generation.

PAGE 90

90 7. Steps 3 to 6 are repeated until a particular termination criterion (f or instance a pre specified number of generations or a pre specified fitness level) is satisfied. 8. If divergence occurs (i.e. the artificial chromosomes populating a generation are the same as those populating the previous generation), the incest threshold will be decreased step by step and the aforementioned process is repeated until the time in a generation the incest threshold becomes less than zero. In such situations, all the artificial chromosomes in that generation are killed except with the one with highest fitness. Then (N 1) new artificial chromosomes are generated based on mutating the survived artificial chromosome Steps 3 to 8 are then repeated Figure 3 2 illustrates the steps applied by CHC algorithm to search the solution space. Essential el ements related to the GA engine employed by KBAS are as follows: 1 Problem under Consideration The problem whose solution is supposed to be found by KBAS is to find a subset of genetic markers related to the hypothesis under investigation which jointly h ave statistically different contribution to the contrasting phenotype states (e.g. diseased vs. healthy phenotypes). 2 Search Space The search space consists of all potential solutions to the problem of interest. In other words, supposing the marke r set s elected by a hypothesis contains N SNPs, the search space consists of all potential single marker models as well as all potential multi SNP models consisting of pair wise, three way combinations of the selected SNPs. Once the GA engine starts running, a random subset of the entire set of potential solutions to the problem of interest is generated by GA to populate the first generation as the initial population of artificial chromosomes

PAGE 91

91 3 Population Size The choice of population size is a pro blem specific issue. However, it is obvious that with a larger population size, the GA engine may have a smaller chance of being trapped in local optima while investigating a complex fitness landscape because it can more comprehensively explore the search space [218]. In the current implementation, KBAS provides user s with the opportunity to select the popul ation size as a parameter of GA. 4 Methods of Representation T he standard practice of encoding each model parameter (in this case, each marker weight) as a binary number consisting of the appropriate number of bits is followed to represent a rtificial chromosomes Artificial chromosomes are therefore represented by bit vectors whose length is proportional to the number of SNPs in the set of marker s selec ted by the hypothesis under investigation In the current implementation, KBAS provides two alternative options to convert the bit patterns into weights: The first option is a direct binary to decimal conversion in which each SNP weight is a value greater than or equal to zero. This value ranges from zero to a maximum amount which is determined based on the number of bits used to represent weights. In the simplest case only one bit is used to represent each weight: the weight therefore indicates whether the corresponding SNP is present in the model or not. With 3 bits, the range of weights is between 0 and 7. The alternative option produces both negative and positive values. This option has been considered to cover cases in which a given SNP may be a protec tive rather

PAGE 92

92 than predisposing factor. For example under this scenario weight values would be in the range 3 to +4 if 3 bits are used to represent weights. 5 Methods of Change The GA engine employed by KBAS takes advantage of two operators borrowed from evolutionary biology i.e. cross over and mutation, to modify parent artificial chromosomes in order to generate offspring artificial chromosomes Cross over. In the step in which artificial chromosomes are randomly paired, the similarity of the two artif icial chromosomes in each pair is determines by measuring their Hamming distance (D H ). Hamming distance is the number of positions in the two artificial chromosomes which have different values. For a given pair, if the Hamming distance is less than twice a n incest threshold, which has been specified as an internal parameter of the GA, the mating of the two artificial chromosomes is prevented. The reason is that such a mating would be useless because it would give rise to offspring that are too similar to th e parents that already exist. If the two artificial chromosomes in a given pair are distant enough, Half Uniform Crossover ( HUX ) occurs by which the two parent artificial chromosomes exchange half of their differing bits randomly to give rise to offsprin g artificial chromosomes [218,233] Mutation. Furthermore, in the mating step the GA engine randomly generates mutations in the structure of artificial chromosomes which help increase diversity of the population by giving rise to offspring artificial chrom osomes which have different structure from their parents Especially, such point mutations are of high importance for soft restarts when divergence (i.e. the artificial chromosomes populating a generation are the same as

PAGE 93

93 those populating the previous gene ration) occurs. In such cases, at first the GA engine attempts to overcome the issue of divergence by reducing the incest threshold step by step. However, when the incest threshold becomes negative in a generation (i.e. no further reduction is possible), t he GA kills all the artificial chromosomes in that generation except the one with the highest fitness. (N 1) new artificial chromosomes are then generated by randomly mutating different bits of the survived artificial chromosomes [218,233] The choice of m utation rate depends on the discretion of the investigator. In the current implementation, KBAS provides user s with the opportunity to select the mutation rate as a parameter of the GA. This parameter consists of three values: overall mutation non z 0 weights to non mutation rate. To reduce the number of non zero bits in the artificial chromosomes t he mutation rate can t herefore non zero weights to 0 weights transitions. This may in turn reduces the number of genetic markers (e.g. SNPs) in the retrieved successful model. 6 Fitness Function The definition of fitness function is a critical aspect in the use of a GA. Fitness function provides a criterion for measuring the fi tness of each organism ( artificial chromosome ) based on which the organisms are ranked, selected and given more chance to reproduce. The GA engine attempts to maximize the fitness function over the search space b y favoring the survival and reproduction of solutions capable of better solving the problem. Therefore, it is expected that the average fitness of population of artificial chromosomes will increase gradually (i.e. the average fitness of population in

PAGE 94

94 every generation is expected to be higher than th at in previous generation ) [218,221,222]. Note that the exact definition of the fitness function is a problem specific issue and there is no general framework that can be applied to all kinds of problems under investigation [218] In the current implementa tion of KBAS, the evaluation of fitness of each model is performed through a two step procedure: at first a score is assigned to each subject in the contrasting groups (e.g. case vs. control groups) by combining his/her genotype values at the marker loci p resent in the model through an appropriate mathematical formula T he distributions of the generated score variable are then compared between the contrasting groups to evaluate the model s classification ability. In other words, the fitness of a model is a based on the set of scores assigned to the subjects in the contrasting gro ups by th at model. To this end, a number of different fitness functions have thus far been designed. Using several a lternative fitness functions will allow the user to test and find the best fitness function capable of identifying a combination of genetic markers jointly contributing to the phenotype of interest. The way the score variable is defined and compared betwee n case and control groups is different from one fitness function to the others. A more detailed description of the alternative fitness functions employed by KBAS will be provided in the following sections. 6 1 F 1 f itness f unction Score variable. For each subject in the case and control groups and for each model, at first the weight of each marker present in the model is multipl ied by the

PAGE 95

95 numerical value of the subject's genotype (e.g. AA=0, AB=1, BB=2) at th at marker locus. A score is then computed as the sum of these products calculated for every marker loci in the model according to the fol l owing formula: (3 1) Where S i is the score assigned to the subject i W j is the weight of marker locus j i n the model under consideration, G ij is the numerical v alue of the genotype of the subject i at marker locus j and n is the number of markers in the model under investigation. Method of fitness evaluation. After the scores were assigned to every subjects in the case and control groups by a model the overall average of scores in the pooled population of the two contrasting groups, and the average of scores in case and control groups are determined. If the average of the case group is greater than the overall average, the fitness of the model is defined as the sum of the number of cases whose scores are smaller than the overall average and the number of controls whose scores are greater than the overall average. In other words, the fitness function measures the s and the controls, under the assumption that the scores for cases are in general greater than the scores for controls. If the average of the case group is smaller than the overall average, the fitness of the model is defined as the sum of the number of ca ses whose scores are greater than the overall average and the number of controls whose scores are smaller than the overall

PAGE 96

96 the scores for the cases and the controls, und er the assumption that the scores for cases are in general smaller than the scores for controls. In the ideal case, the two sets of scores will be perfectly separated, and the fitness value will be 0. Therefore, the F 1 Fitness function favors the lower val ues of the fitness 6 2 F 2 f itness f unction In order to reduce the risk of over fitting which may happen in cases where models contain a large number of markers, The F 2 fitness function has been developed which is similar to the F 1 fitness function, excep t that to generate more parsimonious models, the number of markers included in the model ( i.e the number of SNPs with a non zero weight in an artificial chromosome ) is taken into account when calculating the fitness. Score variable. For each subject in th e case and control groups and for each model, a score is computed exactly as was explained for F 1 fitness function. Method of fitness evaluation. The fitness of a model is determined by adding the fitness value obtained by the method of fitness evaluation described for F 1 fitness function, to the product of the number of markers in the model multiplied by an appropriate factor. The choice of this factor is arbitrary and depends on the discretion of the investigators. The factor has been set to 40 by default Consequently, if two models have the same classification accuracy, the one with fewer SNPs will be favored by F 2 fitness function. 6 3 F 3 f itness f unction Score variable. For each subject in the case and control groups and for each model, a score is def ined as the logarithm of ratio of the conditional probabilities of the

PAGE 97

97 subject's genotype under the two phenotype states of interest (e.g. diseased vs. healthy), according to the following formula inspired by the definition of Bayes factor [237]: (3 2) Where S i is the score assigned to the subject i, G i is the genotype of individual i for the set of SNPs included in the model, and State 1 and State 2 are the two phenotype states being compared to each other, i.e. case vs. control Method of fitness evaluation. Once the scores are computed for all subjects in the case and control groups, their distributions are compared between the contrasting groups using a two sample t test, and t he t statistic or the p value resulting fro m the comparison of the score distributions is then used as the fitness measure for the model under investigation. The t test is a parametric statistical method which determines whether the difference between the population means for two groups (signal) is significantly different from the background variation (noise) [238] The rationale behind the use of this fitness function is that a model with high fitness consists of a set of markers whose joint contribution to the two contrasting phenotype states (e.g diseased vs. healthy phenotypes) is significantly different. Such model is expected to generate highly different score distributions in the two groups leading to a more extreme t statistic and a smaller p value compared to those generated by a model with low fitness. Therefore, models with larger t statistics and smaller fitness p values are preferred by the F 3 fitness function

PAGE 98

98 6 4 F 4 f itness f unction Score variable. For each subject in the case and control groups and for each model, a score is compute d exactly as was explained for F 3 fitness function. Method of Fitness Evaluation. Once the scores are computed for all subjects in the case and control groups, the fitness is defined as the sum of the number of cases whose scores are smaller than zero and the number of controls whose scores are greater than zero. The rationale behind this fitness function is that because for each subject the score is calculated as the ratio of his/her genotype under the assumptions of being a disease associated genotype an d a healthy phenotype associated genotype, it is expected that a model containing a combination of markers which collectively contribute to the phenotype of interest, gives rise to some combinations of genotypes found more often in the case group and some combinations of genotypes found more often in the control group. Therefore, the subjects in the case group are expected to mostly have a ratio over 1, which is converted to a positive score at logarithm scale, and the subjects in the control group are exp ected to mostly have a ratio less than 1, which is converted to a negative score at logarithm scale. In the ideal case, the two sets of scores will be perfectly separated, and the fitness value will be 0. As a result, a model with lower misclassification o f case control subjects is favored by the F 4 Fitness function. 6 5 F 5 Fitness Function Score variable. For each subject in the case and control groups and for each model, a score is computed based on a weighted additive model, exactly as was explained fo r the F 1 fitness function.

PAGE 99

99 Method of fitness evaluation. Once the scores have been computed for all the subjects, the distribution of scores is compared between case and control groups using a two sample t test [238] to determine if the between group diffe rence is statistically different from background noise. The t statistic or the p value r esulting from the comparison of the score distributions is then used as the fitness measure. The larger the t statistic is, and therefore the smaller the p value of a m odel is, the more statistically different the two groups are. Therefore, models with larger t statistics and smaller p values are favored by F 5 fitness function. 6 6 F 6 Fitness Function Score variable. For each subject in the case and control groups and f or each model, a score is computed based on a weighted additive model, exactly as was explained for the F 1 fitness function. Method of fitness evaluation. The fitness of a model is defined as the sum of its corresponding t statistics ; calculated exactly th e same as was explained for F 5 fitness function; and a factor which takes into account the number of markers in the model. This factor is calculated using the following formula: [1 r], where r is the ratio of the number of markers included in the model to the total number of markers an artificial chromosome contains (which is equal to the total number of markers present in the initial set of markers). Therefore, models with fewer markers which produce more statistically different score distributions in the two contrasting groups, are favored by F 6 fitness function.

PAGE 100

100 6 7 F 7 Fitness Function Score variable. For each subject in the case and control groups and for each model, the ratio of the conditional probabilities of the subject's genotype under the two phen otype states of interest (e.g. diseased vs. healthy) is calculated for each marker locus present in the model A score is then computed as the sum of these ratios calculated for every marker loci in the model according to the following formula : (3 3) Where S i is the score assigned to the subject i G ij is the genotype of individual i at locus j State 1 and State 2 are the two phenotype states being compared to each other, i.e. case vs. control, and n is the number of markers in the model under investig ation Method of Fitness Evaluation. Once the scores are computed for all subjects in the case and control groups, their distributions are compared using a two sample t test [238], as was explained for F 3 fitness function. Therefore, models giving rise to larger t statistics and smaller p values are favored by F 7 fitness function. 6 8 F 8 Fitness Function Score variable. For each subject in the case and control groups and for each model, the ratio of the conditional probabilities of the subject's genotype u nder the two phenotype states of interest (e.g. diseased vs. healthy) is calculated for each marker locus present in the model and is multiplied by the numeric al value of the subject's

PAGE 101

101 genotype at that locus. A score is then computed as the sum of these p roducts calculated for every marker loci in the model according to the following formula: (3 4) Where S i is the score assigned to the subject i G ij is the genotype of individual i at locus j State 1 and State 2 are the two phenotype states being comp ared to each other, i.e. case vs. control, and n is the number of markers in the model under investigation. Method of Fitness Evaluation. Once the scores are computed for all subjects in the case and control groups, their distributions are compared using a two sample t test [238], as was explained for F 3 fitness function. Therefore, models giving rise to larger t statistics and smaller p values are favored by F 8 fitness function 6 9 F 9 f itness f unction The idea at the basis of this fitness function is to extend the application of KBAS to situations in which the combinatorial effects of genetic factors with quantitative measurements (such as the data obtained through microarray or RNA Seq experiments) on a complex phenotype is sought. However, note that th is fitness function can still be used for capturing the combinatorial effects of genetic factors with discrete measurements (such as SNP genotypes obtained from a genome wide association study) on a complex phenotype. Score variable. F or each subject in th e case and control groups and for each

PAGE 102

102 corresponding to the genetic factors included in that model. The distributions of vectors in case and control groups are assumed to have a multivariate normal distribution characterized by a mean vector (X) and a matrix of variance covariance (C) in each group [239,240] Method of Fitness Evaluation. Once the scores (vectors) are obtained for all subjects in the case and control groups, the mean vectors (i.e. X case and X control ) and variance covariance matrices (i.e. C case and C control ) are computed for case and control groups. An unbiased pooled variance covariance matrix (C) and a Hotelling's T square statistic ( T 2 statistic ) are then obt ained according to the following formulas [239,240] : (3 5) (3 6) Where, n case and n control are the number of subjects in case and control groups respectively, X case and X control are the vectors of mean in case and control groups respectively, (X case X control )' is the transpose of the vector of mean differences, and C 1 is the inverse of the pooled variance covariance matrix. To test if T 2 statistic is significant, an F statistic is then calculated (3 7) Where, p is the number of genetic factors in the model under investigation.

PAGE 103

103 F statistic has an F distribution with p and (n case + n control p 1) degrees of freedom corresponding to the numerato r and denominator of the fraction, respectively. The F statistic or the p value obtained from a model will then serve as the fitness criterion. An F statistic or a p value more extreme than the significance level indicates that the distributions of scores of the two groups are significantly different. Therefore, models giving rise to larger F statistic and smaller fitness p value are favored by the F 9 Fitness Function. 7 Method of Selection Once in a generation the fitness of each parent and offspring arti ficial chromosome is determined by the fitness function, t he GA applies a method of selection to choose the highly fit artificial chromosomes which will populate the next generation. In general, there are a number of different methods of selection which ca n potentially be employed by a GA such as elitist selection, fitness proportionate selection, roulette wheel selection, tournament selection, scaling selection, rank selection, generational selection, steady state selection, hierarchical selection [222] B ecause the GA employed by KBAS method is a variant of CHC class of genetic algorithms, it applies a c ross generational elitist selection as the method of selection [218,233] Given running KBAS using a population size of N, in each generation the fitness o f parent and offspring artificial chromosomes is determined. A temporary population containing all the parent and offspring artificial chromosomes is then created in which they are ranked based on their corresponding fitness values. The N best artificial c hromosomes are then selected to populate the next generation. Therefore,

PAGE 104

104 each generation may retain highly fit artificial chromosomes from previous generation and harbor newly generated highly fit offspring artificial chromosomes as well. Step 3: Permutat ion Test The successful model retrieved by the GA engine of KBAS is then evaluated by permutation testing to validate its suggested association with the trait of interest. Permutation testing (also called randomization testing) is a non parametric statist ical test which under the null hypothesis of no between group differences, tests if an observed test statistic has been obtained by chance ( i.e. sampling error). Especially in cases where datasets show deviation from the assumption of normality, permutatio n can be used as an alternative to routine parametric tests [241] In the current implementation of KBAS, h olding the values of the generated score variable constant, the case control labels of subjects in the original dataset are permuted. The successful model is then applied to the resulting permuted dataset and its corresponding fitness value is recorded. This process is repeated a large number of times to produce an empirical distribution of the fitness values, which can then be used to assess whether t he fitness value obtained from comparing original case control datasets is significant. If there are no between group differences, test statistics resulting from different permutation rounds are expected to be more or less similar. Therefore, the original fitness value is expected not to be an extreme value. On the other hand, in case a successful model consists of a set of markers whose joint contribution to the two contrasting phenotype states (e.g. diseased vs. healthy phenotypes) is significantly

PAGE 105

105 differ ent, it is expected that its original fitness value would represent an extreme value in the resulting empirical distribution [241]. Step 4: Association Criteria Verification If a successful model derived by the GA engine of KBAS shows a fitness value more extreme than the user specified fitness threshold and the model also proves significant under permutation testing, it is considered to be associated with the trait under investigation, and its fitness level can be used as a measure of the strength of the a ssociation. We can therefore conclude that the markers (e.g. SNPs) included in the model represent causal variants or are in linkage disequilibrium (LD) with causal variants in neighboring regions [6]. Step 5: Replication of the Results Like in any associa tion study, the significant findings should be further validated in an independent case control dataset to prove their replicability [147]. In the replication step, a score variable is generated and tested in the replication dataset using the same successf ul model that was generated and tested in the discovery dataset. If a model successfully discriminates cases from controls in the replication dataset and its permutation test p value obtained by performing permutation testing over the replication dataset i s also significant, it is considered replicated. Step 6: Further Statistical Analysis Once one or more successful models retrieved by the GA engine of KBAS prove significantly associated with the complex phenotype under investigation, further statistical a nalyses can then be performed at the discretion of the investigator. For

PAGE 106

106 instance, a simple logistic regression analysis [242] on the score variable generated by each of the successful models or generated by the entire set of markers present in a number of successful models can be fitted to determine the odds ratio of the successful models under investigation. A multivariate logistic regression analysis [242] can also be performed on a set of score variables created by trait associated multiple trait associ ated models to investigate the performance of the corresponding successful mo dels in a multivariate setting. The KBAS Software The KBAS method described here is implemented in a freely available software package as a 64bit GNU/Linux command line executable The program allows users to specify the input genotype datasets (case and control groups), and the GA parameters such as number of organisms, number of generations, weights encoding scheme, etc. It also provides users with an interface to the Genephony system ( http://genome.ufl.edu/gp/ ) [243] to automatically generate sets of SNPs related to the hypotheses under investigation on the basis of preexisting biological knowledge. The output of the program is a file co ntaining the names of the markers in the successful model with their respective weights and functional roles. The program can also perform permutation testing on the successful model and write the resulting p value on the output file. All the results descr ibed in the Chapter 4 of this dissertation have been generated using the KBAS software. A more detailed description of the KBAS software will be provided in the Chapter 5.

PAGE 107

107 Figure 3 1. A flow chart illustrating the steps app lied by the KBAS method

PAGE 108

108 Figure 3 2. A flow chart illustrating the steps applied by CHC class of genetic algorithm (GA).

PAGE 109

109 CHAPTER 4 KBAS METHOD EVALUATION AND RESULTS Method Evaluation Discovery Dataset I used the KBAS method to test a number of hyp otheses that genes in a set of pathways related to the activity and regulation of the immune system play a role in the development of Rheumatoid Arthritis (RA). In addition, I tested the applicability of the KBAS method to other diseases like Crohn s disea se (CD) and type I diabetes (T1D). In particular, the results related to CD will also be described in this manuscript. The tests were performed using genotype and phenotype data provided by the Wellcome Trust Case Control Consortium (WTCCC) [1] Patients i n the RA and CD groups served as cases, and healthy individuals in the 58C and NBS groups constituted the controls. Case and control groups contain ~2000 and ~1500 individuals respectively. Study subjects are of Caucasian ancestry, and each one was genotyp ed at around 500,000 SNPs using the Affymetrix platform (Affymetrix GeneChip 500K Mapping Array). The small values of the trend test's over dispersion parameter for RA and 1.11 for CD) based on principal component analysis indicates only trivial confounding effects exists related to the population stratification [1] The SNP genotypes were converted to numerical values by representing the major allele (A) at each locus as 0 and the minor allele (B) as 1. The three possible genotypes were therefore encoded as follows: AA=0, AB=1, BB=2. Alternative encoding schemes, for example to consider dominant or recessive effects, can easily be adopted.

PAGE 110

110 Using two contro l groups provides the advantage that the successful model produced by the comparison of the case group against one of the controls (e.g. RA vs. NBS) can be tested in two more comparisons, one between the case group and the other control group (e.g RA vs. 5 8C), and the other one between the two control groups (58C vs. NBS). This increases the robustness of the inferred associations and reduces the risk of overfitting, because the successful model should be able to separate the case group from the second cont rol group, and should not be able to separate the two control groups from each other. This in turn shows that results are reproducible and provides evidence that the GA is not simply learning to classify different groups of subjects or finding models consi sting of chance aggregations of SNPs due to small differences between the groups unrelated to the disease (such as those due to geographic or ancestral factors), but produces results that are directly related to the disease of interest. Rheumatoid arthriti s is an autoimmune disorder primarily affecting joints in a monoarticular or oligoarticular pattern with subsequent progression to a polyarthritis with clinical manifestations related to inflammation of synovial membrane, and articular cartilage erosion an d destruction. The etiology of the disease is unknown, but it seems both genetic susceptibility and environmental triggers contribute to the pathogenesis of the disease [187] A number of previously reported RA predisposing loci (for example see [ 1,11,188, 244 246]) implicate the role of genetic factors in the process of disease development.

PAGE 111

111 Crohn's disease is also a chronic immune mediated disorder characterized by recurrent transmural inflammation of the gastrointestinal tract [247] It has been suggested that an inappropriate immune response to microbial flora of the intestine in genetically susceptible individuals may play an important role in the development of this disease [1,248] A previous meta analysis reported over 71 CD associated loci [88] corrob orating the role of genetic factors in the disease susceptibility. The entire set of these predisposing loci accounts for about 21% of heritability of the disease [23] Hypothesis Generation Due to the suggested autoimmune nature of the rheumatoid arthriti s and Crohn's disease, I focused on 14 pathways related to different aspects of the human immune system as test set pathways. Moreover, I selected 10 other pathways involved in processes unrelated to the immune system, and therefore are likely to be irrele vant to the pathogenesis of RA and CD, to serve as negative controls in this study Table 4 1 provides complete details on both sets of pathways. Pathway definitions were retrieved from the KEGG database (Kyoto Encyclopedia of Genes and Genomes: http://www.genome.jp/kegg/ ) [249] For each pathway, I selected all validated SNPs found within the transcripts encoded by the genes belonging to it, extending up to 3,000 bp upstream and downstream. I then filtered this se t of SNPs to retain only those that were genotyped in the WTCCC study. To further reduce the number of selected SNPs, KBAS provides user s with an option to choose a subset of representative SNPs from each transcript (e.g. only one SNP), on the basis of a prioritizing rule that preferentially selects non synonymous coding SNPs, followed by promoter SNPs, exon intron junction SNPs,

PAGE 112

112 synonymous coding SNPs, other exonic SNPs, and intronic SNPs. This option may decrease the computational burden by reducing the complexity of search space, and facilitate the search procedure performed by GA engine of KBAS especially if the entire set of SNPs generated by a hypothesis is very large and contains a few thousands of SNPs. Table 4 2 summarizes the functional roles of the genotyped and validated SNPs related to the pathways under investigation. Hypotheses Testing and Refinement For the analysis of either RA or CD datasets versus the two control groups (i.e. NBS and 58C) and for each of the 24 initial sets of SNPs derive d from the pathways listed in Table 4 1 KBAS initialized a GA using a population of 200 candidate models (represented by artificial chromosomes, using a single bit for each SNP weight). Each artificial chromosome was initialized to contain non zero weight s for a random subset of approximately 10 SNPs. The population was then evolved over 500 simulated generations, and the artificial chromosome providing the best fitness value in the final generation was used to generate the successful model. T he fitness of a model is a measure of the model s ability to discriminate cases and controls, which in turn is calculated on the basis of the set of scores assigned by the corresponding model to the subjects in the case and in the control groups. In this study, t he sco re corresponding to a particular model is defined, for each subject, as the logarithm of ratio of the conditional probabilities of the subject's genotype under the diseased vs. healthy phenotype states ( see F 3 Once t he scores are computed for all subjects in the case and control groups, their distributions are compared using a two sample t test, and the p value resulting from the

PAGE 113

113 comparison of the score distributions is used as the fitness measure. A model with higher fitness is expected to generate highly different score distributions in the two contrasting groups which in turn result in a smaller fitness p value compared to the fitness p value generated by a model with low fitness. Note that the significance level fo r interpreting the significance of fitness p values was adjusted to 6.944x10 9 according to Bonferroni's correction [ 183, 232] given the number of artificial chromosomes (n=200), simulated generations (n=500), investigated pathways (n=24) and pair wise pop ulation comparisons (n=3) (e.g RA vs. NBS RA vs. 58C and 58C vs. NBS ) A fitness p value between 10 7 and 6.944x10 9 was considered borderline Permutation Test To validate the significance of the fitness p values generated in previous step, the goodness of fit of each model was assessed using permutation testing performed over 100,000 rounds of permutation In each round of permutation, the case control labels of the subjects were randomly permuted, and the model under consideration was applied to the re sulting permuted datasets in order to produce a fitness p value. After repeating this step for 100,000 times, the significance level of the original fitness p values; resulted from comparing original case control datasets; was specified using the empirical distribution resulted from the entire set of permuted fitness p values. For the analysis of either RA or CD datasets versus the two control groups (i.e. NBS and 58C), given the total number of 48 pair wise comparisons (the number of investigated pathways = 24 and the number of pair wise population comparisons = 2), a corrected significance threshold of 0.00104 was used according to Bonferroni's

PAGE 114

114 correction [ 183, 232] to interpret the results of permutation testing. A permutation test p value betwe en 0.00104 and 0.05 was considered borderline. Comparison of RA and CD Groups against the Pooled Population of the Two Control Groups To further investigate the goodness of fit of the retrieved successful models, the case groups under consideration (i.e. RA and CD gr oups) were compared with the pooled population of the two control groups present in the WTCCC dataset, here called CTR. For each subject in case and control groups and for each successful model derived from 24 pathways under consideration, the score variab le was generated using the same procedure as in the previous case control comparisons (i.e. RA vs. NBS and RA vs. 58C) and the distribution of score s was compared between case and control groups. The statistical significance of the fitness p value result ing from each model was validated through permutation testing. The significance threshold for interpreting fitness p values is the same as the one used in previous step. Given the total number of 24 pair wise comparisons (24 investigated pathways); a corre cted significance threshold of 0.00208 was used to interpret the results of permutation test according to Bonferroni's correction [ 183, 232] A permutation test p value between 0.00208 and 0. 0 5 was considered borderline. Association Criteria For each case vs. control comparison (e.g. RA vs. 58C, or CD vs. NBS), a model was considered significant if both its fitness and permutation test p values were

PAGE 115

115 significant, and was considered non significant if resulted in a non significant fitness p value or permutat ion test p value. In all other situations, it was considered borderline. In comparing a case group against two control groups (e.g. RA vs. 58C, and RA vs. NBS), a model was considered to be in significant association with the disease under investigation i f in both case vs. control comparisons proved significant, and was considered non associated with the disease if was non significant in at least one of the two comparisons. In all other situations, the model was considered in moderate (borderline) associat ion with the disease of interest. In comparing a case group against the pooled population of the two control groups (i.e. RA vs. CTR, and CD vs. CTR), a model was considered as a disease associated model if both its fitness and permutation test p values we re significant, and was considered non associated if at least one of its fitness p value or permutation test p value was non significant In all other situations, it was considered in moderate (borderline) association with the disease of interest. Replicat ion Dataset To investigate if the association detected for RA associated pathways can be generalized to other cohorts, the significance of the corresponding successful models was tested in an independent case control dataset of white Americans (NARAC) [246 ] This dataset contains genotype data for 908 patients suffering from RA as the case group (NARAC A) and for 1,260 healthy individuals as controls (NARAC C). All subjects are of Caucasian ancestry and were genotyped at 545,080 SNPs using the Illumina Infi nium HumanHap550 array. The set of SNPs genotyped in this dataset is not the same as that in the WTCCC study, due to the different genotyping platforms used.

PAGE 116

116 Therefore rel ying on linkage disequilibrium (LD) between markers in close proximity SNPs in the s uccessful models which had not been genotyped in NARAC study were replace d with the closest SNPs for which NARAC data was available. T he scores produced by each of the successful models under investigation were calculated for the NARAC dataset exactly the same as were calculated for the WTCCC dataset, and their distributions were compared between case (NARAC A) and control (NARAC C) groups. The fitness p value of each model was further validated by performing 100,000 rounds of permutation testing, and the r esulting p value was used as the replication criteria. Given the total number of pairwise comparisons (24 investigated pathways) ; an adjusted significance level of 0.00208 was used as the significance threshold for interpreting the results of permutation t esting according to Bonferroni's correction [ 183, 232] A permutation test p value between 0.00208 and 0.05 was considered borderline. Further Statistical Analysis: Logistic Regression Analysis and Disease Risk Score Class Diagram To evaluate how the disea se associated successful models produced by KBAS affect the risk of developing the disease under consideration, a multiple logistic regression model [242] was fitted using the scores produced by these models as the predictor variables, and the disease stat e (affected vs. healthy) as the response variable. Moreover, to investigate potential interactions among the successful models under consideration, all their pairwise interaction terms were included as predictor variables as well, and a stepwise selection procedure was applied ( to enter = 0.01 and to remove = 0.05) to only retain the statistically significant variables.

PAGE 117

117 At the next step, for each subject in the case and control groups, an overall score variable was computed on the basis of the entire se t of SNPs selected by the disease associated successful models A simple logistic regression model [242] was then fitted using the overall score variable as the predictor variable and the disease state as the response variable. The purpose of this analysis was to evaluate the applicability of the entire set of SNPs present in the disease associated models to predicting the risk of the disease. To illustrate the relationship between the overall score variable and the disease risk, the overall score variable was then discretized into multiple classes according to the score values corresponding to quantile values ( 2.5, 12, 21.5, 31, 40.5, 50, 59.5, 69, 78.5, 88, 97.5 % ), and a Disease risk Score class diagram was illustrated by computing the posterior probabili ty of disease development for each class using Bayes' formula [237] : (4 1) Where sc is the score class, and P( affected ) and P( healthy ) are the prior probabilities of a subject being in the affected or healthy groups re spectively. P( sc| affected ) and P( sc| healthy ) are conditional probabilities of a subject being in a particular score class given the two phenotype states of interest, i.e. affected vs. healthy respectively. This procedure was used to generate Figures 4 1 an d 4 2 illustrating Disease risk Score class diagrams for Rheumatoid Arthritis and Crohn's disease, respectively. Logistic regression analysis was performed using SAS software (v9.2).

PAGE 118

118 Results Analysis of the Rheumatoid Arthritis (RA) Dataset For each of the 24 initial sets of SNPs derived from the pathways listed in Table 4 1, KBAS initialized a GA using parameters mentioned in Chapter 4, sub section Tables 4 3 and 4 4 summarize the p va lues associated with the fitness of each successful model on the basis of the pairwise comparisons between the case group and the two control groups (i.e. RA vs. 58C and RA vs. NBS). To validate the significance of these fitness values, the goodness of fit of each model was assessed using permutation testing performed over 100,000 rounds. The p values obtained from permutation tests are also shown in Tables 4 3 and 4 4 next to the fitness p values. A fitness p value < 6.944x10 9 and a permutation test p val ue < 0.00104 were considered significant respectively ( see Chapter 4, and sub section s The association of each model with rheumatoid arthritis was inferred on the b asis of the criteria explained in Chapter 4, Test set pathways For each of the tested pathways, with the exception of the Fc epsilon RI signaling, Fc gamma R mediated phagocytosis, regu lation of autophagy and T cell receptor signaling pathways, KBAS was able to identify a model that classifies cases and controls with a highly significant p value, ranging from 2.55x10 9 to 1.19x10 23 for the RA vs. 58C comparisons and from 1.23x10 9 to 1. 64x10 22 for the RA vs. NBS comparisons. By contrast, the successful models were unable to separate the NBS and

PAGE 119

119 58C groups, and the p values of the comparisons were always non significant (see Table 4 3 ). The fact that successful models derived from these 10 pathways were able to separate case control groups with fitness values more extreme than the pre determined significance threshold comparing RA vs. NBS and RA vs. 58C, but were not able to discriminate the two control groups from each other, indicates t hat they represent sets of SNPs which could potentially be associated with rheumatoid arthritis. In permutation tests performed by comparing RA vs. 58C and RA vs. NBS, the fitness p values related to all these 10 models except for those related to the chem okine signaling and natural killer cell mediated cytotoxicity pathways, were validated with p values less than 10 5 suggesting their remarkable performance in distinguishing the RA cohort from unaffected individuals. The eight pathways that give rise to s ignificant p values in permutation test are therefore considered to be strongly associated with rheumatoid arthritis. Despite the high fitness values of the models obtained from the c hemokine signaling and n atural killer cell mediated cytotoxicity pathways permutation testing suggested only moderate association of these models with rheumatoid arthritis. While permutation testing of the model derived from the first pathway resulted in a significant p value for RA vs. 58C comparison ( p < 7.3x10 4 ) and a bord erline p value for RA vs. NBS comparison ( p < 3.36x10 3 ), the p values resulting from permutation testing of the model from the second pathway were borderline for RA vs. 58C comparison ( p < 2.63x10 3 ) and significant for RA vs. NBS comparison ( p < 08.5x10 4 ), respectively.

PAGE 120

120 The successful model from the T cell receptor signaling pathway yielded statistically borderline fitness values when comparing RA against 58C and NBS ( p < 5.83x10 8 and p < 8.03x10 8 respectively). Once again the comparison between the t wo control groups had a non significant p value. In permutation testing, the original case control p values appeared to be of borderline statistical significance for both the RA vs. 58C comparison ( p < 1.11x10 3 ) and the RA vs. NBS comparison ( p < 1.61x10 3 ). Therefore, the model derived from this pathway may also be in moderate association with RA. Finally, the successful models derived from the Fc epsilon RI signaling, Fc gamma R mediated phagocytosis and regulation of autophagy pathways are not considere d to be in association with RA because they did not produce significant fitness p values and the p values resulting from the corresponding permutation tests were non significant as well. Negative control pathways On all pathways in the negative control set the method failed to produce a model capable of classifying cases and controls with a statistically significant fitness p value. In addition, the results of all permutation tests were non significant ( p > 0.00104) (see Table 4 4 ). In three cases, namely insulin signaling, oxidative phosphorylation and spliceosome pathways, the fitness p values related to the selected successful models were closer to the significance threshold than those of other control pathways. However, the fact that neither of their co rresponding permutation tests was statistically significant confirms the poor goodness of fit of these models in discriminating cases from controls. These results are consistent with my prior expectation that these negative control

PAGE 121

121 pathways may not be rele vant to the pathogenesis of rheumatoid arthritis, and do not therefore lead to a significant separation of the case and control groups. Overall, these results indicate that the final models generated by KBAS are not just simple classifiers capable of learn ing the differences between two arbitrary groups of individuals. Instead, their classification power is a function of the biological relevance of criterion used to select the initial sets of SNPs: SNP sets derived from pathways that are more biologically r elevant to the trait of interest will lead to more accurate and powerful models. This feature is e specially important when one aims to test and rank alternative hypotheses about the trait under consideration. RA vs. CTR comparison T he reproducibility of th e observed associations was further evaluated by comparing the RA group versus the pooled population of the two control groups present in the WTCCC dataset [1] here called CTR, using the successful models retrieved by GA engine from the 24 pathways under investigation. For each model a fitness p value was obtained exactly through the same procedure as was explained for the previous case control comparisons (i.e. RA vs. NBS and RA vs. 58C) and the statistical significance of the resulting fitness p value wa s validated through permutation testing. A fitness p value less than 6.944x10 9 and a permutation test p value less than 0.00208 were considered significant based on the Bonferroni's correction method [ 183, 232] (see Chapter 4, Comparison of RA and CD Groups against the Pooled Population of the Two Control Groups A model was considered significant if both its fitness and permutation test p values were significant.

PAGE 122

122 As shown in Table 4 5, all 11 models found t o be in strong or moderate association with RA in the previous step (RA vs. NBS and RA vs. 58C) showed statistically significant performance in separating RA from CTR, with fitness values less than 2.90x10 10 which were then validated by permutation test p values less than 2x10 5 In addition, the model from the Fc gamma R mediated phagocytosis pathway also yielded significant fitness and permutation test p values ( p < 9.09x10 10 and p < 3x10 4 respectively). Although the multi SNP model related to the oxi dative phosphorylation pathway had a significant fitness p value ( p < 1.72x10 9 ), its corresponding permutation test p value was not significant ( p = 0.0046). Table 4 6 shows the list of selected SNPs in the successful models derived from pathways displayi ng strong or moderate association with RA, the genes and chromosomes they belong to, their positions on their respective chromosome, and their functional roles. The number of SNPs in the successful models ranged from three to eight, which in all cases is l ess than 1% of the total number of SNPs in the corresponding initial SNP sets (see Table 4 1). This indicates that KBAS was able to efficiently cope with the complexity of a very large multi dimensional search space. Out of 72 total SNPs in these 12 models two of them lie within the MHC region on 6p21 (chr 6: 29,910,021 33,498,585) which is a known hot spot for several autoimmune disorders. rs9272346 is a promoter SNP related to the HLA DQA1 gene which was previously shown to be associated with juvenile chronic arthritis [250] On the other hand rs2072633 located in an intron of the CFB gene, is neither associated with RA based on previous findings nor in LD with other RA associated SNPs according to

PAGE 123

123 the HapMap LD data (International HapMap Project: http://hapmap.n cbi.nlm.nih.gov/ ). The fact that KBAS detected two GWAS implicated SNPs is reassuring, although failure to detect others is not considered a weakne ss at all, given the completely different set of hypotheses used to find significant contributors. In fact it may shed light on why some SNPs are identified in both systems. None of the other SNPs present in the successful models or the genes to which the se SNPs belong is among RA associated loci in previously published genome wide association studies. There is also no evidence of linkage disequilibrium between theses SNPs and other RA associated SNPs according to the HapMap LD data. This indicates that th e results generated by KBAS are not simply a rediscovery of strongly associated individual SNPs. Instead, the KBAS method is able to elucidate the joint contribution of previously unknown sets of SNPs to the genetic architecture of rheumatoid arthritis. As seen in Table 4 6 the final models for seven of the tested pathways contained pairs of SNPs on the same chromosome. SNPs in these pairs are not in linkage disequilibrium with each other according to HapMap LD data. This shows that there is no redundancy in the selected SNPs, and therefore all the SNPs identified by the method were retained in their respective successful models. Multiple logistic regression analysis To evaluate the impact of the 12 successful models previously concluded to be in associatio n with RA on the risk of being affected by the disease, and to investigate potential interactions among them, a multiple logistic regression model was fitted by regressing t he disease state on the score variables produced by comparing RA vs. CTR using each of these 12 models along with their pairwise interaction terms (66 terms).

PAGE 124

124 The multivariate model obtained from the stepwise selection procedure contained the score variables related to all 12 pathways except for those related to the a ntigen processing a nd presentation and p hagosome pathways. Moreover, the interaction terms of the interaction between the c ytokine cytokine receptor interaction and l eukocyte trans endothelial migration pathways and between the c ytokine cytokine receptor interaction and the T cell receptor signaling pathways were also kept in the fitted regression model. The fitted regression model had overall model p values smaller than 10 4 and the covariates of the included terms were statistically significant with p values smaller than 0 .0028. All covariates were positively correlated with the disease susceptibility with odds ratios between 1.652 and 2.859, except for the two aforementioned interaction terms which were negatively correlated with odds ratios of 0.266 and 0.182 respectively The c statistic for the model was 0.687 and the p value of Hosmer Lemeshow test for the fitted model was non significant ( p > 0.05) indicating the model's goodness of fit. Table 4 7 summarizes these results. Replication of the Results To test if the det ected significant association between 12 immune system related pathways and RA can be generalized to other cohorts, the scores related to each of these successful models were calculated in the NARAC dataset [246], and their distributions were compared betw een case (NARAC A) and control (NARAC C) groups. After performing 100,000 rounds of permutation testing over each of the models, the significance of the primary fitness p value was determined and used as the replication

PAGE 125

125 criteria. The significance threshold s for interpreting the results were the same as used for the RA vs. CTR analysis. As seen in Table 4 5, five out of the 12 immune system related pathways showing moderate or strong association with RA in WTCCC dataset were replicated with fitness p values ranging from 5.85x10 15 to 7.17x10 25 and permutation test p values less than 10 5 In addition, three models derived from chemokine signaling, leukocyte trans endothelial migration and n atural killer cell mediated cytotoxicity pathways resulted in permut ation test p values of 0.03463, 0.0031 and 0.00793 respectively and therefore can be considered replicated at the significance level of 0.05. None of the models derived from the remaining 16 pathways gave rise to a significant p value in the permutation te sts. Due to the different genotyping platforms, a number of SNPs present in the successful models were replaced by new ones to conduct the replication study, as explained in Method Evaluation section. Table 4 8 summarizes the list of the original and subst ituted SNPs present in the eight replicated models. The substituted SNPs are in the range of 39 bps to 23,342 bps away from the original ones. Simple Logistic Regression Analysis and Disease Risk Score Class Diagram The disease state was regressed on the o verall score variable computed based on the all 44 SNPs present in the eight models replicated in the NARAC dataset [246] to evaluate the applicability of this single score variable in predicting the disease risk (see Table 4 5 and Table 4 6). The fitted r egression models for RA vs. CTR and NARAC A vs. NARAC C comparisons resulted in overall model p values less than 10 4 and the covariates of the overall score variables were statistically significant ( p < 10 4 ) showing positive association with the risk of rheumatoid arthritis with odds ratios of 2.398 (95%

PAGE 126

126 CI: 2.187 to 2.629) for the first comparison and 2.760 (95% CI: 2.439 to 3.123) for the second comparison. The Hosmer Lemeshow tests corresponding to the fitted models had p values greater than 0.05, ind icating the adequacy of their respective models. The c statistic for the model related to the RA vs. CTR comparison was 0.66, and the one for the NARAC A vs. NARAC C comparison was 0.719. Tables 4 9 and 4 10 summarize the results of simple logistic regres sion analysis based on the overall score variables. To illustrate how an increase in the value of the overall score variable influences the risk of disease development, the range of values corresponding to each overall score variable was discretized into 1 2 bins, and for each bin the posterior probability of being affected was calculated. As shown in Figure 4 1 for both case control comparisons the risk of development of rheumatoid arthritis increases as the score takes larger values. Comparing RA vs. CTR, this risk ranges from around 18% when the score class is lowest to around 75% when the score class is highest. For model related to the NARAC A vs. NARAC C comparison's model, the disease risk rises from around 2% for the lowest score class to around 75% for the highest score class. This indicates that the scores obtained from the total set of these 44 SNPs can be used as a predictor to estimate the probability by which an individual may develop the disease. Analysis of the Crohn's Disease (CD) Dataset The applicability of the KBAS method to other diseases was tested in Crohn s disease (CD) using the same set of 24 pathways used for RA analysis. Subjects in CD group served as cases and those in 58C and NBS groups served as controls (from [1] ). Each of the 2 4 initial sets of SNPs extracted from the pathways under investigation was analyzed by the GA engine of KBAS using the strategy explained in Chapter 4, the

PAGE 127

127 T he successful model derived from each pathway was then validated over 100,000 rounds of permutation testing. A fitness p value less than 6.944x10 9 and a permutation test p value less than 0.00104 were considered significant ( see Chapter 4, Testing and Refine sub section s section) The association of each model with Crohn's disease was inferred on the basis of the criteria explaine d in Chapter 4, the sub section of ction Test set and Negative Control Pathways As is evident in Tables 4 11 and 4 12, the three pathways namely, B cell receptor signaling, cytokine cytokine receptor interaction and T cell receptor signaling pathways gave rise to multi SNP models with stat istically significant performance in separating the disease group from the two control groups. While they had significant fitness p values comparing CD vs. 58C and CD vs. NBS ( p < 1.07x10 10 and p < 3.22x10 10 respectively), they were not able to separate the two control groups from each other ( p > 0.0078). The models related to B cell receptor signaling and T cell receptor signaling pathways also yielded significant p values in permutation tests on both the CD vs. 58C and the CD vs. NBS comparisons, and ar e therefore considered strongly associated with Crohn's disease. As the results of permutation testing indicate, the model derived from cytokine cytokine receptor interaction pathway was of borderline statistical significance in both CD vs. 58C and CD vs. NBS comparisons ( p < 0.00121 and p < 0.00367 respectively). This may be suggestive of moderate association between this multi SNP model and Crohn's disease.

PAGE 128

128 CD vs. CTR Comparison The classifying performance of successful models was also tested by comparin g disease group against pooled population of controls (CTR) (see Table 4 13 ). Results were interpreted based on the corrected significance thresholds as in the RA vs. CTR analysis. In this case, in addition to the three aforementioned ones, six more models proved significant, i.e. those derived from the antigen processing and presentation, c omplement and coagulation cascades, intestinal immune network for IgA production, leukocyte trans endothelial migration, natural killer cell mediated cytotoxicity and ox idative phosphorylation pathways. Their fitness p values ranged from 7.40x10 9 to 1.16x10 15 and were verified by their respective permutation test p values ranging form less than 8.5x10 4 to less than 10 5 These pathways may therefore be considered as be ing associated with Crohn's disease and should be subject to further validation studies in an independent dataset. Eight out these nine pathways were also in association with RA in the WTCCC dataset, and six of them were also among the pathways replicated in the NARAC dataset (see Chapter 4, Analysis of the Results ). Table 4 14 summarizes the list of SNPs included in the successful models showing association with Crohn's disease, the genes and chromosomes they belong to, their posi tions on their respective chromosome, and their functional roles. The number of SNPs in these successful models ranged from 4 to 11, which in all cases is less than 1% of the total number of SNPs in the corresponding initial SNP sets (see Table 4 1).

PAGE 129

129 None of the remaining SNPs in these nine models or the genes to which these SNPs belong is among or in linkage disequilibrium with Crohn's disease susceptibility loci detected by g enome wide association studies. Although SNPs rs2569094 from antigen processing and presentation rs9724615 from B cell receptor signaling rs428060 from c omplement and coagulation cascades rs2192752 and rs6876446 from cytokine cytokine receptor interaction rs1327473 from natural killer cell mediated cytotoxicity and cytokine cytokine receptor interaction, rs7555443 and rs3808917 from T cell receptor signaling and rs1000984 from oxidative phosphorylation pathways are located on chromosom e regions less than 1Mbps away from regions which have shown evidence of moderate or strong association with Crohn's disease in previous studies [1,88] there is no evidence of linkage disequilibrium between these nine SNPs and nearby CD associated SNPs ac cording to HapMap LD data (International HapMap Project: http://hapmap.ncbi.nlm.nih.gov ). As seen in Table 4 14 the successful models from the c omplement and coagulation cascades, cytokine cytokine receptor interaction, intestinal immune network for IgA production, leukocyte trans endothelial migration, T cell receptor signaling, and oxidative phosphorylation pathways contained pairs of SNPs on the same chromosome. The large distance between these pairs of SN Ps strongly decreases the chance of

PAGE 130

130 linkage disequilibrium between them. This is verified by the lack of evidence of linkage disequilibrium between the aforementioned SNP pairs according to HapMap LD data. Multiple Logistic Regression Analysis Table 4 15 s ummarizes the results obtained from multiple logistic regression analysis conducted for CD vs. CTR comparison. The analysis was performed by regressing disease state on the score variables derived from nine successful models demonstrating evidence of assoc iation with Crohn's disease (see Tables 4 13 and 4 14 ). The fitted regression model after step wise model selection procedure contained the score variables from all nine pathways except for those from the intestinal immune network for IgA production and na tural killer cell mediated cytotoxicity pathways. The overall model's p values and the p values of the included covariates were smaller than 10 4 All the included score variables were positively associated with the risk of Crohn's disease and their corres ponding odds ratios were in the range of 2.320 to 2.741. The c statistic for the fitted model was 0.622 and the model's goodness of fit was verified by a non significant Hosmer Lemeshow test ( p > 0.05). Simple Logistic Regression Analysis and Disease Risk Score Class Diagram Table 4 16 summarizes the results obtained from fitting a simple logistic regression model by regressing the disease state (CD vs. CTR) on the overall score variable derived from the entire set of 57 SNPs present i n the nine CD associat ed models (see Tables 4 13 and 4 14). The p values of the fitted regression model and of the covariates of the overall score variables were statistically significant ( p < 10 4 ), and the score variable was in positive association with the risk of disease wi th odds ratio of

PAGE 131

131 2.506 (95% CI: 2.219 to 2.831). The c statistic for the fitted model was 0.623 and t he Hosmer Lemeshow test was non significant ( p > 0.05). The Disease risk Score class diagram for the CD vs. CTR comparison (Figure 4 2 ) indicated that the risk of development of Crohn's disease increases with the increase in the score class values. This risk goes up from around 14% for the lowest score class to around 78% for the highest score class Comparative Analysis of the RA and CD Results Table 4 17 p rovides a comparative summary of the observed disease pathway associations analyzing RA and CD using WTCCC and NARAC datasets. While none of the negative control pathways showed association with both rheumatoid arthritis and Crohn's disease, there were mul tiple immune system related pathways in association with both diseases under investigation. The four models derived from antigen processing and presentation c omplement and coagulation cascades, intestinal immune network for IgA production, and T cell rece ptor signaling pathways were in strong association with rheumatoid arthritis comparing RA vs. CTR and NARAC A vs. NARAC C and with Crohn's disease comparing CD vs. CTR. Although models from leukocyte trans endothelial migration and natural killer cell medi ated cytotoxicity pathways showed strong association with both rheumatoid arthritis and Crohn's disease in the WTCCC dataset, they were in moderate association with RA in the NARAC dataset. Models derived from B cell receptor signaling and cytokine cytokin e receptor interaction pathways were in strong association with both diseases only in the WTCCC dataset.

PAGE 132

132 Table 4 1. The list of pathways included in this study and their characteristics.

PAGE 133

133 Ta ble 4 1. Continued

PAGE 134

13 4 Table 4 2. The number of genotyped and validated SNPs in each functional role class for the test and control pathways analyzed in this study. Pathway 5'UTR Promoter Coding Sequence Exon/Intron Boundary 3'UTR Intron Downstream Test Pathways Antigen Processing and Presentation 1 22 6 18 10 76 15 B cell Receptor Signaling 0 43 12 27 22 913 37 Chemokine Signaling 2 121 26 53 38 2 213 123 Complement and Coagulation Cascades 0 47 27 43 22 500 34 Cytokine Cytokine Receptor Interaction 4 168 37 67 57 1 338 140 Fc Epsilon RI Signaling 1 58 18 29 19 1 083 39 Fc Gamma R mediated Phagocytosis 1 32 22 47 22 1 668 50 Immune Network for IgA Production 0 12 0 8 8 92 12 Leukocyte Trans endothelial Migration 5 74 27 33 23 1 710 58 Natural Killer Cell Mediated Cytotoxicity 2 80 22 37 28 1 038 55 Phagosome 0 54 20 62 19 822 52 Regulation of Autophagy 0 11 2 4 9 123 11 T cell Receptor Sig naling 1 74 15 34 28 1 138 44 Toll like Receptor Signaling 2 52 8 27 17 429 50 Control Pathways Cardiac Muscle Contraction 1 45 12 32 16 1 298 26 Gap Junction 0 44 17 35 17 2 095 49 Glycolysis/Gluconeogenesis 1 35 15 16 19 252 54 Insulin Signaling 0 61 26 42 30 1 081 49

PAGE 135

135 Table 4 2. Continued. Pathway 5'UTR Promoter Coding Sequence Exon/Intron Boundary 3'UTR Intron Downstream Nucleotide Excision Repair 0 22 9 35 8 226 19 Oxidative Phosphorylation 1 41 13 26 9 344 44 Purine Metabolism 2 76 22 79 33 2 567 66 Pyrimidine Metabolism 1 35 11 39 22 585 31 Renin Angiotensin System 0 12 3 7 3 91 9 Spliceosome 2 38 7 40 27 294 66

PAGE 136

136 Table 4 3. The p values associated with the pairwise comparisons of the RA group and the two control groups using the success ful models derived from immune system related pathways. Pathway 58C vs. NBS RA vs. 58C RA vs. NBS Antigen Processing and Presentation 0.00589 1.03x10 9 < 10 5 3.96x10 16 < 10 5 B cell Receptor Signaling 0.04550 4.44x10 14 < 10 5 5.79x10 13 10 5 Chemokine Signaling > 0.05 9.76x10 11 0.00073 1.23x10 9 0.00336 Complement and Coagulation Cascades > 0.05 1.19x10 23 < 10 5 2.31x10 20 < 10 5 Cytokine Cytokine Receptor Interaction 0.04981 1.09x10 20 < 10 5 1.64x10 22 < 10 5 Fc Epsilon RI Signaling 0.03246 6.97x10 5 > 0.05 2.47x10 5 > 0.05 Fc Gamma R mediated Phagocytosis 0.04356 1.91x10 6 > 0.05 1.43x10 7 0.01143 Immune Network for IgA Production 0.00408 6.65x10 11 < 10 5 4.31x10 18 < 10 5 Leukocy te Trans endothelial Migration > 0.05 1.11x10 17 < 10 5 2.71x10 18 < 10 5 Natural Killer Cell Mediated Cytotoxicity 0.00437 2.55x10 9 0.00263 7.97x10 10 0.00085 Phagosome > 0.05 2.48x10 16 < 10 5 4.34x10 17 < 10 5 Regulation of Autophagy > 0.05 0.00941 > 0.05 0.01714 > 0.05 T cell Receptor Signaling 0.04116 5.83x10 8 0.00111 8.03x10 8 0.00161 Toll like Receptor Signaling > 0.05 6.82x10 14 < 10 5 2.68x10 12 < 10 5 The fitness p values measure the fitness of each successful model retrieved by Genetic Al gorithm engine. They are calculated by comparing original case and control datasets using corresponding successful models. Permutation test p values measure the significance of fitness p values of their corresponding successful model by comparing permuted case and control datasets. According to Bonferroni's correction, a fitness p value < 6.944x10 9 and a permutation test p value < 0.00104 were considered significant.

PAGE 137

137 Table 4 4. The p values associated with the pairwise comparisons of the RA group and the two control groups using successful models derived from negative control pathways. Pathway 58C vs. NBS RA vs. 58C RA vs. NBS Cardiac Muscle Contraction > 0.05 0.01392 > 0.05 0.01158 > 0.05 Gap Junction > 0.05 0.00051 > 0.05 0.00202 > 0.05 Glycolysis/Gluconeogenesis > 0.05 0.0007 > 0.05 0.00288 > 0.05 Insulin Signaling > 0.05 4.30x10 5 > 0.05 0.000533 > 0.05 Nucleotide Excision Repair > 0.05 0.00144 > 0.05 0.00084 > 0.05 Oxidative Phosphor ylation 0.02171 9.12 x10 8 0.04476 1.26x10 6 > 0.05 Purine Metabolism > 0.05 > 0.05 > 0.05 > 0.05 > 0.05 Pyrimidine Metabolism > 0.05 0.00782 > 0.05 0.03497 > 0.05 Renin Angiotensin System > 0.05 0.00273 > 0.05 0.02638 > 0.05 Spliceosome 0.02692 4.4 7x10 6 > 0.05 0.00024 > 0.05

PAGE 138

138 Table 4 5. The p values associated with the comparison of the RA and CTR groups and NARAC A and NARAC C using the successful models derived from pathways under consideration. Pathway RA vs. CTR NARAC A vs. NARAC C Test Pathways Antigen Processing and Presentation 8.17x10 17 < 10 5 1.47x10 17 < 10 5 B cell Receptor Signaling 8.31x10 18 < 10 5 4.54x10 5 > 0.05 Chemokine Signaling 1.44x10 13 < 10 5 6.47x10 8 0.03463 Comp lement and Coagulation Cascades 3.11x10 28 < 10 5 7.17x10 25 < 10 5 Cytokine Cytokine Receptor Interaction 5.51x10 30 < 10 5 6.34x10 5 > 0.05 Fc Epsilon RI Signaling 2.73x10 6 > 0.05 7.52x10 7 > 0.05 Fc Gamma R mediated Phagocytosis 9.09x10 10 3x10 4 5.65x10 17 < 10 5 Intestinal Immune Network for IgA Production 5.99x10 19 < 10 5 1.41x10 17 < 10 5 Leukocyte Trans endothelial Migration 4.61x10 25 < 10 5 4.44x10 9 0.00310 Natural Killer Cell Mediated Cytotoxicity 7.37x10 13 <10 5 5.15x10 9 0.00793 Ph agosome 2.34x10 23 < 10 5 0.00532 > 0.05 Regulation of Autophagy 0.00172 > 0.05 9.99x10 5 > 0.05 T cell Receptor Signaling 2.90x10 10 2 x10 5 5.85x10 15 < 10 5 Toll like Receptor Signaling 2.96x10 18 < 10 5 1.96x10 4 > 0.05 Cardiac Mu scle Contraction 0.00405 > 0.05 4.44x10 5 > 0.05 Gap Junction 7.72x10 5 0.04995 0.01804 > 0.05 Glycolysis/Gluconeogenesis 0.00025 > 0.05 3.51x10 5 > 0.05

PAGE 139

139 Table 4 5. Continued. Pathway RA vs. CTR NARAC A vs. NARAC C Insulin Signaling 2.37x10 6 0.00860 0.00186 > 0.05 Nucleotide Excision Repair 7.01x10 5 > 0.05 0.01216 > 0.05 Oxidative Phosphorylation 1.72x10 9 0.00460 2.77x10 6 > 0.05 Purine Metabolism 0.04224 > 0.05 > 0.05 > 0.05 Pyrimidine M etabolism 0.00371 > 0.05 0.00034 > 0.05 Renin Angiotensin System 0.00216 > 0.05 0.00248 > 0.05 Spliceosome 1.09x10 6 0.02772 5.78x10 6 > 0.05 According to Bonferroni's correction, a fitness p value < 6.944x10 9 and a permutation test p value < 0.00208 were considered significant for RA vs. CTR comparison. The p values of the models showing significant association with rheumatoid arthritis comparing RA vs. CTR are in bold. Of these 12 pathways, five were replicated in NARAC dataset at the significance l evel of 0.00208 and three were replicated at the significance level of 0.05.

PAGE 140

140 Table 4 6. The list of SNPs included in the successful models showing strong or moderate association with rheumatoid arthritis. SNP GENE NCBI GENE ID CHROMOSOME POSITION SNP R OLE Antigen Processing and Presentation Pathway rs9272346 HLA DQA1 3117 Chr6 32,604,372 Promoter rs7760860 NFYA 4800 Chr6 41,059,873 Intron rs7869876 HSPA5 3309 Chr9 127,994,564 Downstream rs10163054 PDIA3 2923 Chr15 44,047,638 Intron B cell Receptor Signaling Pathway rs2037547 GSK3B 2932 Chr3 119,544,615 3' UTR rs171649 PIK3R1 5295 Chr5 67,569,746 Exon/Intron boundary rs3735131 CARD11 84433 Chr7 2,962,293 Coding sequence rs2272733 IKBKB 3551 Chr8 42,157,902 Intron rs4645869 FOS 2353 Chr14 75,74 9,875 Downstream rs1013316 PRKCB 5579 Chr16 24,135,112 Exon/Intron boundary rs2855259 BTK 695 ChrX 100,615,478 Exon/Intron boundary Chemokine Signaling Pathway rs17110519 GNG5 2787 Chr1 84,968,749 Intron rs2734871 CXCR4 7852 Chr2 136,869,873 Downstrea m rs5029748 IKBKB 3551 Chr8 42,140,549 Intron rs3808391 LYN 4067 Chr8 56,790,649 Promoter rs2839686 CXCL12 6387 Chr10 44,881,455 Promoter rs1950501 ADCY4 196883 Chr14 24,806,800 Promoter rs2075019 SHC2 25759 Chr19 441,195 Intron

PAGE 141

141 Table 4 6. Continued. SNP GENE NCBI GENE ID CHROMOSOME POSITION SNP ROLE rs12394147 PRKX 5613 ChrX 3,539,248 Exon/Intron boundary Complement and Coagulation Pathway rs664142 C8A 731 Chr1 57,325,855 Intron rs698094 MASP1 5648 Chr3 186,970,808 Intron rs2072633 CFB 629 Ch r6 31,919,578 Intron Cytokine Cytokine Receptor Interaction Pathway rs3748669 IL24 11009 Chr1 207,077,023 3' UTR rs11758366 TNFRSF21 27242 Chr6 47,253,631 Exon/Intron boundary rs334349 TGFBR1 7046 Chr9 101,914,387 3' UTR rs2870946 IL26 55801 Chr12 68, 596,661 Intron rs8107861 IL29 282618 Chr19 39,784,152 Promoter rs2834213 IFNGR2 3460 Chr21 34,792,910 Intron rs5980890 EDA 1896 ChrX 69,232,272 Intron Fc Gamma R mediated Phagocytosis Pathway rs4971024 PIP5K1A 8394 Chr1 151,224,320 Downstream rs1020 4214 ASAP2 8853 Chr2 9,540,849 Exon/Intron boundary rs352084 MARCKS 4082 Chr6 114,183,385 3' UTR rs7803810 SCIN 85477 Chr7 12,607,971 Promoter rs12280627 PAK1 5058 Chr11 77,047,679 Intron rs901104 GAB2 9846 Chr11 77,930,499 Exon/Intron boundary rs3079 43 PRKCG 5582 Chr19 54,404,476 Intron

PAGE 142

142 Table 4 6. Continued SNP GENE NCBI GENE ID CHROMOSOME POSITION SNP ROLE Intestinal Immune Network for IgA Production Pathway rs291092 PIGR 5284 Chr1 207,113,230 Intron rs2734871 CXCR4 7852 Chr2 136,869,873 Down stream rs2236938 CCR9 10803 Chr3 45,938,939 Intron rs9272346 HLA DQA1 3117 Chr6 32,604,372 Promoter rs1800795 IL6 3569 Chr7 22,766,645 Promoter rs11574530 ITGB7 3695 Chr12 53,599,340 Intron Leukocyte Trans endothelial Migration Pathway rs6999346 TRI M35 23087 Chr8 27,174,861 Promoter rs2839686 CXCL12 6387 Chr10 44,881,455 Promoter rs16931177 VCL 7414 Chr10 75,870,890 Intron rs11223714 JAM3 83700 Chr11 134,016,011 Exon/Intron boundary rs4802260 VASP 7408 Chr19 46,027,752 Exon/Intron boundary rs21 9761 CLDN14 23562 Chr21 37,839,410 Promoter/Intron rs5921682 NOX1 27035 ChrX 100,130,437 Promoter Natural Killer Cell Mediated Cytotoxicity Pathway rs164123 SH2D1B 117157 Chr1 162,367,654 Intron rs1327473 IFNGR1 3459 Chr6 137,541,230 Promoter rs102344 38 RAC1 5879 Chr7 6,422,800 Intron rs10111172 TNFRSF10C 8794 Chr8 22,969,091 Intron rs16927724 PRF1 5551 Chr10 72,363,568 Promoter

PAGE 143

143 Table 4 6. Continued. SNP GENE NCBI GENE ID CHROMOSOME POSITION SNP ROLE rs12597573 NFATC3 4775 Chr16 68,167,521 Intro n rs11641233 NFAT5 10725 Chr16 69,734,563 3' UTR rs307943 PRKCG 5582 Chr19 54,404,476 Intron Phagosome Pathway rs17035245 ATP6V1E2 90423 Chr2 46,739,683 Coding sequence rs7622403 ATP6V1A 523 Chr3 113,472,434 Intron rs1017813 ITGB5 3693 Chr3 124,591, 985 Intron rs1032070 ATP6V0A1 535 Chr17 40,618,251 Intron rs5921682 NOX1 27035 ChrX 100,130,437 Promoter T cell Receptor Signaling Pathway rs27341 ITK 3702 Chr5 156,679,576 Exon/Intron boundary rs3735131 CARD11 84433 Chr7 2,962,293 Coding sequence rs 7358099 CHUK 1147 Chr10 101,955,002 Intron rs152041 CHP2 63928 Chr16 23,765,355 Promoter rs3092923 CD40L 959 ChrX 135,741,185 Exon/Intron boundary Toll like Receptor Signaling Pathway rs4256246 CXCL10 3627 Chr4 76,945,522 Promoter rs5743303 TLR3 709 8 Chr4 186,988,853 Promoter rs3740720 FADD 8772 Chr11 70,053,547 Downstream rs2403102 TRAF3 7187 Chr14 103,306,215 Intron rs3027898 IRAK1 3654 ChrX 153,275,890 Downstream SNP positions refer to version GRCh37 of the human reference sequence

PAGE 144

144 Table 4 7 Multivariate regression of disease state on the score variables derived from the successful models showing strong or moderate association with rheumatoid arthritis (comparing RA vs. CTR). Test of Overall Model Test Chi square df p value Likelihood Ratio Test 529.516 12 <0.0001 Score Test 497.930 12 <0.0001 Wald Test 451.457 12 <0.0001 Test of Parameters Parameter Parameter Estimate Standard Error Wald's Chi square df p value Odds Ratio Estimates Point Estimate 95% Confidence Interval Intercept 0.3790 0.0312 147.7942 1 <0.0001 Pathway 1 0.5020 0.1304 14.8224 1 0.0001 1.652 1.279 2.133 Pathway 2 0.6809 0.1516 20.1816 1 <0.0001 1.976 1.468 2.659 Pathway 3 0.8802 0.0990 79.0971 1 <0.0001 2.411 1.986 2.928 Pathway 4 0.7601 0.09 84 59.6341 1 <0.0001 2.139 1.763 2.594 Pathway 5 1.0191 0.1973 26.6887 1 <0.0001 2.771 1.882 4.078 Pathway 6 0.7184 0.1224 34.4668 1 <0.0001 2.051 1.614 2.607 Pathway 7 0.6218 0.1108 31.4872 1 <0.0001 1.862 1.499 2.314 Pathway8 1.0506 0.1575 44.5086 1 <0.0001 2.859 2.100 3.893 Pathway 9 0.5614 0.1840 9.3069 1 0.0023 1.753 1.222 2.514 Pathway 10 0.5519 0.1274 18.7555 1 <0.0001 1.737 1.353 2.229 Pathway 4 Pathway 7 1.3250 0.3206 17.0822 1 <0.0001 0.266 0.142 0.498

PAGE 145

145 Table 4 7. Continued. Test of Par ameters Parameter Parameter Estimate Standard Error Wald's Chi square df p value Odds Ratio Estimates Point Estimate 95% Confidence Interval Pathway 4 Pathway 9 1.7018 0.5694 8.9334 1 0.0028 0.182 0.060 0.557 Goodness of fit Test Test Chi square df p value Hosmer Lemeshow Test 6.9178 8 0.5455 Pathway 1: B cell Receptor Signaling Pathway Pathway 2: Chemokine Signaling Pathway Pathway 3: Complement and Coagulation Cascades Pathway Pathway 4: Cytokine Cytokine Receptor Interaction Pat hway Pathway 5: Fc Gamma R mediated Phagocytosis Pathway Pathway 6: Intestinal Immune Network for IgA Production Pathway Pathway 7: Leukocyte Trans endothelial Migration Pathway Pathway 8: Natural Killer Cell Mediated Cytotoxicity Pathway Pathway 9: T cell Receptor Signaling Pathway Pathway 10: Toll like Receptor Signaling Pathway

PAGE 146

146 Table 4 8. The list of SNPs from NARAC study replacing the SNPs from WTCCC study for replication of the results. SNP in WTCCC Study SNP in NARAC Study Distance (bp) SNP in WTCCC Study SNP in NARAC Study Distance (bp) rs10163054 rs8040336 5,979 rs1800795 rs2069837 1,382 rs9272346 rs1063355 23,342 rs291092 rs291102 6,752 rs7760860 rs9296354 905 rs11574530 rs1554753 4,398 rs7869876 rs16927988 1,033 rs6999346 rs6986075 39 rs17110519 rs17365723 270 rs16931177 rs2279648 3,697 rs2734871 rs16832740 1,997 rs11223714 rs12803146 1,435 rs2839686 rs3780891 2,742 rs4802260 rs10995 2,402 rs1950501 rs3212247 2,995 rs307943 rs3745405 9,378 rs2075019 rs4919871 9,320 rs1327473 rs132 7474 155 rs664142 rs669444 6,753 rs10234438 rs2347338 4,522 rs7803810 rs3735222 1,708 rs11641233 rs10517 9,197 rs307943 rs3745405 9,378 rs12597573 rs2418736 12,659 rs352084 rs7747961 7,049 rs16927724 rs3758562 1,488 rs4971024 rs7532935 13,531 rs373513 1 rs1878805 879 rs10204214 rs10186062 1,437 rs7358099 rs12248110 246 rs901104 rs1318241 293 rs152041 rs194821 3,372 rs12280627 rs1377470 3,035 rs3092923 rs3092921 1,815

PAGE 147

147 Table 4 9. Simple regression of disease state on the overall score variable deriv ed from the entire set of 44 SNPs present in the replicated RA associated models (comparing RA vs CTR). Test of Overall Model Test Chi square df p value Likelihood Ratio Test 391.2276 1 <0.0001 Score Test 368.9996 1 <0.0001 Wald Test 346.1146 1 <0.0001 Test of Parameters Parameter Parameter Estimate Standard Error Wald's Chi square df p value Odds Ratio Estimates Point Estimate 95% Confidence Interval Intercept 0.4095 0.0301 185.4796 1 <0.0001 Score 0.8745 0.0470 346.1146 1 < 0.0001 2.398 2.187 2.629 Goodness of fit Test Test Chi square df p value Hosmer Lemeshow Test 9.2494 8 0.3217

PAGE 148

148 Table 4 10. Simple regression of disease state on the overall score variable derived from the entire set of 44 SNPs present in the repli cated RA associated models (comparing NARAC A vs NARAC C). Test of Overall Model Test Chi square df p value Likelihood Ratio Test 325.0076 1 < 0.0001 Score Test 295.5573 1 < 0.0001 Wald Test 259.0395 1 < 0.0001 Test of Parameters Parameter Par ameter Estimate Standard Error Wald's Chi square df p value Odds Ratio Estimates Point Estimate 95% Confidence Interval Intercept 0.2371 0.0561 17.8931 1 <0.0001 Score 1.0151 0.0631 259.0395 1 <0.0001 2.760 2.439 3.123 Goodness of fit Te st Test Chi square df p value Hosmer Lemeshow Test 9.5291 8 0.2996

PAGE 149

149 Table 4 11. The p values associated with the pairwise comparisons of the CD group and the two control groups using the successful models derived from immune system related pathways. Pathway 58C vs. NBS CD vs. 58C CD vs. NBS Antigen Processing and Presentation > 0.05 2.36x10 6 0.00605 7.85x10 6 0.01423 B cell Receptor Signaling > 0.05 1.07x10 10 0.00018 5.01x10 11 0.00002 Chemokine Signaling > 0.05 0.00015 0.04892 0.00085 > 0.05 Complement and Coagulation Cascades > 0.05 1.44x10 6 0.02870 6.81x10 6 > 0.05 Cytokine Cytokine Receptor Interaction 0.04556 7.01x10 11 0.00121 3.22x10 10 0.00367 Fc Epsilon RI Signaling > 0.05 0.00877 > 0.05 0.00529 > 0.05 Fc Gamma R mediated Phagocytosis > 0.05 > 0.05 > 0.05 > 0.05 > 0.05 Immune Network for IgA Production > 0.05 4.51x10 7 0.00304 6.03x10 6 0.02240 Leukocyte Trans endothelial Migration > 0.05 1.21x10 6 > 0.05 4.89x10 7 0.0310 3 Natural Killer Cell Mediated Cytotoxicity > 0.05 2.24x10 5 0.03038 2.58x10 6 0.00620 Phagosome > 0.05 0.00021 > 0.05 4.61x10 5 > 0.05 Regulation of Autophagy > 0.05 9.94x10 5 > 0.05 0.00031 0 > 0.05 T cell Receptor Signaling 0.00783 1.40x10 11 0.00002 8.30x10 11 0.00015 Toll like Receptor Signaling > 0.05 0.00113 0.02412 0.00511 0 > 0.05 According to Bonferroni's correction, a fitness p value < 6.944x10 9 and a permutation test p value < 0.00104 were considered significant.

PAGE 150

150 Table 4 12. The p values associated with the pairwise comparisons of the CD group and the two control groups using the successful models derived from negative control pathways. Pathway 58C vs. NBS CD vs. 58C CD vs. NBS Cardiac Muscle Contraction > 0.05 5.64x10 6 > 0.05 3.43x10 5 > 0.05 Gap Junction > 0.05 6.42x10 6 > 0.05 6.88x10 6 > 0.05 Glycolysis/Gluconeogenesis > 0.05 0.00502 0 > 0.05 0.00691 0 > 0.05 Insulin Signaling 0.02463 7.86x10 6 > 0.05 9.54x10 7 > 0.05 N ucleotide Excision Repair > 0.05 0.00738 0 > 0.05 0.00482 0 > 0.05 Oxidative Phosphorylation > 0.05 9.84x10 7 0.02836 5.25x10 7 0.01937 Purine Metabolism > 0.05 0.00017 0 > 0.05 1.40x10 6 0.03845 Pyrimidine Metabolism > 0.05 0.00069 0 > 0.05 1.41x10 5 > 0 .05 Renin Angiotensin System > 0.05 0.00028 0 > 0.05 0.00105 0 > 0.05 Spliceosome > 0.05 1.13x10 6 > 0.05 5.71x10 6 > 0.05

PAGE 151

151 Table 4 13. The p values associated with the comparisons of the CD and CTR groups using the successful models derived from pathwa ys under consideration. Pathway Fitness Permutation T est Test Pathways Antigen Processing and Presentation 4.50x10 9 0.00003 B cell Receptor Signaling 4.01x10 1 5 < 10 5 Chemokine Signaling 8.59x10 6 0.00523 Complement and Coagulation Cascades 4.32x10 9 0.00053 Cytokine Cytokine Receptor Interaction 1.16x10 1 5 < 10 5 Fc Epsilon RI Signaling 0.00204 0.03899 Fc Gamma R mediated Phagocytosis > 0.05 > 0.05 Intestinal Immune Network for IgA Production 1.76x10 9 0.00005 Leukocyte Trans endothelial Migr ation 1.49x10 9 0.00072 Natural Killer Cell Mediated Cytotoxicity 7.40x10 9 0.00005 Phagosome 2.75x10 6 > 0.05 Regulation of Autophagy 3.51x10 6 0.01377 T cell Receptor Signaling 6.51x0 1 5 < 10 5 Toll like Receptor Signaling 8.61x10 5 0.00240 Control Pathways Cardiac Muscle Contraction 2.36x10 7 0.01646 Gap Junction 1.01x10 7 0.00479 Glycolysis/ Gluconeogenesis 0.00207 > 0.05 Insulin Signaling 1.15x10 8 0.01003 Nucleotide Excision Repair 0.00109 > 0.05 Oxidative Phosphorylation 3.57x10 9 0.00 085 Purine Metabolism 8.92x10 7 0.02956 Pyrimidine Metabolism 9.19x10 6 > 0.05 Renin Angiotensin System 3.19x10 5 0.04161 Spliceosome 1.42x10 8 0.00359 According to Bonferroni's correction, a fitness p value < 6.944x10 9 and a permutation test p value < 0.00 208 were considered significant

PAGE 152

152 Table 4 14. The list of SNPs included in the successful models showing association with Crohn's disease. SNP GENE NCBI GENE ID CHROMOSOME POSITION SNP ROLE Antigen Processing and Presentation Pathway rs2569 094 CD74 972 Chr5 149,795,052 Promoter rs504697 HSP90AB1 3326 Chr6 44,220,076 Exon/Intron boundary rs7869876 HSPA5 3309 Chr9 127,994,564 Downstream rs10163054 PDIA3 2923 Chr15 44,047,638 Intron B cell Receptor Signaling Pathway rs9724615 NRAS 4893 Chr 1 115,259,550 Promoter rs2037547 GSK3B 2932 Chr3 119,544,615 3' UTR rs767652 DAPP1 27071 Chr4 100,791,805 Downstream rs2272733 IKBKB 3551 Chr8 42,157,902 Intron rs290990 SYK 6850 Chr9 93,561,695 Promoter rs6096473 NFATC2 4773 Chr20 50,181,272 Promoter rs2855259 BTK 695 ChrX 100,615,478 Exon/Intron boundary Complement and Coagulation Cascades rs2935542 C1QA 712 Chr1 22,963,471 Intron rs671137 C8A 731 Chr1 57,319,480 Promoter rs428060 CFH 3075 Chr1 196,706,200 Exon/Intron boundary rs11117913 CR2 13 80 Chr1 207,668,235 Downstream rs7096206 MBL2 4153 Chr10 54,531,685 Promoter rs3213721 VWF 7450 Chr12 6,182,753 Exon/Intron boundary rs5960 F10 2159 Chr13 113,801,737 Coding sequence

PAGE 153

153 Table 4 14. Continued. SNP GENE NCBI GENE ID CHROMOSOME POSITION S NP ROLE Cytokine Cytokine Receptor Interaction Pathway rs3748669 IL24 11009 Chr1 207,077,023 3' UTR rs2192752 IL1R1 3554 Chr2 102,769,373 Promoter rs2270418 TNFSF10 8743 Chr3 172,240,999 Exon/Intron boundary rs13143866 IL21 59067 Chr4 123,540,758 Intr on rs6876446 LIFR 3977 Chr5 38,474,149 Downstream rs1327473 IFNGR1 3459 Chr6 137,541,230 Promoter rs1800795 IL6 3569 Chr7 22,766,645 Promoter rs6557634 TNFRSF10A 8797 Chr8 23,060,256 Coding sequence rs11818239 BMPR1A 657 Chr10 88,659,788 Coding sequen ce rs784894 AMHR2 269 Chr12 53,817,956 Intron rs11104905 KITLG 4254 Chr12 88,890,581 3' UTR Intestinal Immune Network for IgA Production Pathway rs1554286 IL10 3586 Chr1 206,944,233 Exon/Intron boundary rs6767853 CD80 941 Chr3 119,252,238 Intron rs27 15260 CD86 942 Chr3 121,793,490 Intron rs6886399 CCL28 56477 Chr5 43,413,268 Promoter rs1800795 IL6 3569 Chr7 22,766,645 Promoter rs11574530 ITGB7 3695 Chr12 53,599,340 Intron Leukocyte Trans endothelial Migration Pathway rs497900 PIK3CB 5291 Chr3 138 ,433,568 Exon/Intron boundary

PAGE 154

154 Table 4 14. Continued SNP GENE NCBI GENE ID CHROMOSOME POSITION SNP ROLE rs11966646 CLDN20 49861 Chr6 155,585,154 5' UTR rs2839686 CXCL12 6387 Chr10 44,881,455 Promoter rs3802903 ESAM 90952 Chr11 124,634,946 Promoter rs568259 VAV1 7409 Chr19 6,782,574 Intron rs4802260 VASP 7408 Chr19 46,027,752 Exon/Intron boundary rs219761 CLDN14 23562 Chr21 37,839,410 Intron Natural killer Cell Mediated Cytotoxicity Pathway rs13112866 PPP3CA Intron rs132747 3 IFNGR1 rs2126053 SYK Exon/Intron boundary SHC4 3' UTR T cell Receptor Signaling Pathway rs7555443 PTPRC 5788 Chr1 198,707,217 Intron rs1554286 IL10 3586 Chr1 206,9 44,233 Exon/Intron boundary rs7582886 NCK2 8440 Chr2 106,359,698 Promoter rs6784820 RHOA 387 Chr3 49,450,864 Promoter rs3808917 CHUK 1147 Chr10 101,990,715 Promoter rs12599391 NFAT5 10725 Chr16 69,605,349 Intron rs8090560 NFATC1 4772 Chr18 77,213,155 Intron Oxidative Phosphorylation Pathway rs2726502 PPA2 27068 Chr4 106,298,896 Intron

PAGE 155

155 Table 4 14. Continued SNP GENE NCBI GENE ID CHROMOSOME POSITION SNP ROLE rs2029623 ATP6V1F 9296 Chr7 128,507,297 Downstream rs2244916 PPA1 5464 Chr10 71,969,133 Exon/Intron boundary rs1000984 COX15 1355 Chr10 101,470,085 3' UTR rs3794186 TCIRG1 10312 Chr11 67,821,036 Downstream rs9630712 COX10 1352 Chr17 13,970,617 Promoter rs1032070 ATP6V0A1 535 Chr17 40,618,251 Intron SNP positions refer to version GRCh37 o f the human reference sequence.

PAGE 156

156 Table 4 15. Multivariate regression of disease state on the score variables derived from the successful models showing association with Crohn's disease ( comparing CD vs. CTR). Test of Overall Model Test Chi square df p v alue Likelihood Ratio Test 259.5042 7 <0.0001 Score Test 194.4606 7 <0.0001 Wald Test 214.6265 7 <0.0001 Test of Parameters Parameter Parameter Estimate Standard Error Wald's Chi square df p value Odds Ratio Estimates Point Estimate 95% C onfidence Interval Intercept 0.3866 0.0296 170.0454 1 <0.0001 Pathway 1 0.9332 0.2290 16.6024 1 0.0001 2.543 1.623 3.983 Pathway 2 0.9589 0.1375 48.6488 1 <0.0001 2.609 1.993 3.416 Pathway 3 0.8805 0.1871 22.1391 1 <0.0001 2.412 1.671 3.481 Pat hway 4 0.9304 0.1454 40.9663 1 <0.0001 2.535 1.907 3.371 Pathway 5 1.0084 0.2125 22.5234 1 <0.0001 2.741 1.808 4.157 Pathway6 0.9256 0.1392 44.2113 1 <0.0001 2.523 1.921 3.315 Pathway 7 0.8416 0.1852 20.6553 1 <0.0001 2.320 1.614 3.335 Goodness of fit Test Test Chi square df p value Hosmer Lemeshow Test 7.7630 8 0.4570 Pathway 1: Antigen Processing and Presentation Pathway

PAGE 157

157 Pathway 2: B cell Receptor Signaling Pathway Pathway 3: Complement and Coagulation Cascades Pathway Pathway 4: Cytokine Cy tokine Receptor Interaction Pathway Pathway 5: Leukocyte Trans endothelial Migration Pathway Pathway 6: T cell Receptor Signaling Pathway Pathway 7: Oxidative Phosphorylation Pathway

PAGE 158

158 Table 4 16. Simple regression of disease state on the overall score v ariable derived from the entire set of 57 SNPs present in the CD associated models (comparing CD vs. CTR). Test of Overall Model Test Chi square df p value Likelihood Ratio Test 270.4568 1 <0.0001 Score Test 174.5467 1 <0.0001 Wald Test 219.3221 1 <0.0001 Test of Parameters Parameter Parameter Estimate Standard Error Wald's Chi square df p value Odds Ratio Estimates Point Estimate 95% Confidence Interval Intercept 0.3807 0.0297 164.6881 1 <0.0001 Score 0.9189 0.0620 219.3221 1 <0.0001 2.506 2.219 2.831 Goodness of fit Test Test Chi square df p value Hosmer Lemeshow Test 8.1880 8 0.4153

PAGE 159

159 Table 4 17. Comparative summary of pathway associations with rheumatoid arthritis and Crohn's disease. Pathway RA vs. CTR NARAC A v s. NARAC C CD vs. CTR Test Pathways Antigen Processing and Presentation + + + B cell Receptor Signaling + + Chemokine Signaling + +/ Complement and Coagulation Cascades + + + Cytokine Cytokine Receptor Interaction + + Fc Epsilon RI Signaling Fc Gamma R mediated Phagocytosis + + Intestinal Immune Network for IgA Production + + + Leukocyte Trans endothelial Migration + +/ + Natural Killer Cell Mediated Cytotoxicity + +/ + Phagosome + Regulation of Autophagy T cell Rec eptor Signaling + + + Toll like Receptor Signaling + Control Pathways Cardiac Muscle Contraction Gap Junction Glycolysis/Gluconeogenesis Insulin Signaling Nucleotide Excision Repair Oxidative Phosphorylation + Purine Metabolism Pyrimidine Metabolism Renin Angiotensin System Spliceosome (+) indicates that the successful model retrieved from the pathway is in strong association with the disease in the corresponding comparison, ( ) indicates that the

PAGE 160

160 successful model retrieved from the pathway is not associated with the disease in the corresponding comparison, and (+/ ) indicates that the successful model retrieved from the pathway is in borderline association with the disease in t he corresponding comparison.

PAGE 161

161 Figure 4 1. Disease risk Score class diagram for RA vs. CTR and NARAC A vs. NARAC C comparisons. For each comparison overall score variable derived from the entire set of 44 SNPs present in the eight replicated RA associat ed models was discretized into 12 bins, and for each bin the posterior probability of being affected by disease was calculated based on Bayes formula.

PAGE 162

162 Figure 4 2. Disease risk Score class diagram for CD vs. CTR comparison. The overall score variable d erived from the entire set of 57 SNPs present in the nine significant CD associated models was discretized into 12 bins, and for each bin the posterior probability of being affected by disease was calculated based on Bayes formula.

PAGE 163

163 CHAPTER 5 KBAS SOFTWARE KBAS is a software package developed to perform genome wide association studies in situations in which combinations of genetic factors jointly contribute to the genetic variance of a trait of interest, as is often the case in complex diseases. The program is highly configurable and easy to use. The KBAS software is distributed as a GNU/Linux command line 64bit executable and is designed to be included in automated pipelines for the analysis of genome wide association data. It implements a hypothesis based analytical method called knowledge based associating studies (KBAS) which integrates the biological discoveries about a phenotype into the procedure of heuristically genomic scanning to confirm or invalidate the association between a user provided set of m arkers (typically SNPs) and the trait under investigation in a case control setting. A detailed description about the KBAS method has been provided in Chapter 3. Installing KBAS KBAS is a GNU/Linux command line 64bit executable. The currently released vers ion of KBAS is version 1.0 (v.1.0). The KBAS package can be downloaded from the following URL: http://genome.ufl.edu/rivalab/kbas/ Once the package is downloaded, copied to an appropriate directory, and uncompressed ( i.e. by executing tar xvf kabs 1.0.tar.gz command in terminal ) a kbas script and the demo script.

PAGE 164

164 How Does KBAS Works? T he KBAS software consists of three structurally separate but functionally related modules: SNP selection module conver t module and GA engine module These modules function together to provide an integrated workflow to extract a set of SNPs related to a hypothesis about the trait of interest, to convert the case con trol genotype data to appropriate format usable by KBAS, and to generate test and optimize multi marker models related to the hypothesis under investigation. SNP Selection Module The SNP selection module provides an interface through which the user can ex tract a set of SNPs on the basis of a specific query. Examples of possible queries include the identifier of one or more KEGG pathways or gene ontology (GO) classes, the identifier/symbol of one or more genes, the mRNA accession number of one or more trans cripts, and one or more genomic regions. Moreover, the user can provide the SNP selection module with a list of SNPs to get a new list containing only validated SNPs. Table 5 1 provides a detailed description of the methods that can be used by the current version of KBAS to generate the set of desired SNPs. T he user has the opportunity to either type the query terms (e.g. the identifier of one or more KEGG pathways) directly in the terminal or use a file containing the query terms. The user should also spec ify the pathname of an output file to which the identifiers of the desired SNPs are written. Moreover, the user can specify a maximum number of SNPs to be returned from each transcript related to the query (e.g. only 5 SNPs from each transcript belonging t o the genes involved in a particular KEGG pathway). Note that in this case the returned

PAGE 165

165 SNPs are not selected randomly, but are selected on the basis of a prioritizing rule that preferentially selects non synonymous coding SNPs, followed by promoter SNPs, exon intron junction SNPs, synonymous coding SNPs, 3'UTR SNPs, and intronic SNPs. The SNPs selection procedure is invoked using the getsnps command followed by the appropriate command line arguments. Table 5 2 provides a detailed description of the argume nts of the getsnps command. The output file of the SNP selection module will later serve as one of the input files required by the GA engine module An e xample of r unning the SNP s election m odule of KBAS. Suppose we are interested in getting a list of at most five SNPs from each transcript belonging to the genes involved in Insulin Signaling (KEGG ID : map04910) and T cell Receptor Signaling (KEGG ID: map04660) pathways, and storing the identifier of the resulting set of SNPs in a file called snps out.txt in the current directory. This can be accomplished running the getsnps command with the following arguments : kbas getsnps snp method 1 snp query map04910,map04660 snp outfile snps out.txt snp count 5 The snp infile argument has not been specified b ecause in this example the query terms (i.e. map04910, map04660) has directly been specified in the terminal through the snp query argument. Note that either snp query argument or snp infile argument must be specified and both of them cannot be used at the same time Convert Module The convert module is responsible for converting the case control genotypes files into the binary format which is the file type required by the GA engine module of the KBAS software. The genotyping data for case and control gr oups can be either in one

PAGE 166

166 single file or in separate files. If the genotyping data of the contrasting groups are in separate files, the conversion of the data related to each group is conducted separately. The input genotypes file can be either in csv, csv 2 or vcf format, and contains the name/identifier of the subjects, the name/identifier of the genotyped SNP s, and the allelic values for each genotyped SNP Data in different fields can be separated by one of the following delimiters: tab, comma, space, co lon, and semicolon. Note that the two alleles for each genotyped SNP can be in separate columns or in one single column in which alleles can be either separated by a delimiter like / or be represented without any delimiter in between (for instance, the a lleles can be represented as T/C, or TC). Also, the name/identifier of the case and control subjects for whom genotyping has been performed can be provided in a file separate from the aforementioned genotypes files. If the genotypes of both cases and contr ols are included in a single input file, the file should contain a field specifying the subject classification labels. The convert command generates three output files: P.bin P names.txt and P snps.txt ser. The three files contain the genotypes in binary format, the name/identifier of the subjects included in the study, and the name/identifier of the SNP s which have been genotyped in the study, respectively. The conversion of provided input file into bi nary format is performed by the convert command followed by the appropriate command line arguments. Table 5 3 provides a detailed description of the arguments of the convert command. An e xample of r unning the c onvert m odule of KBAS. Suppose a genotypes f ile called case genotypes.csv is located in the ~/data/ directory. The file is in csv

PAGE 167

167 format and contains genotypes only for the case group, with the identifier of the genotyped SNPs in column 1, the name/identifier of subjects in column 2, and both allel es for each genotyped SNP in column 3. The columns are separated by tab delimiter. Running the convert command with the following arguments will convert the input file into the appropriate binary format and store the three converted files in the ~ /data/conv/ directory with the prefix name of case group The three converted output files will be: case group.bin case group names.txt and case group -snps.txt k bas convert conv in ~/data/case genotypes.csv conv dir ~/data/conv/ conv prefix case group conv format csv conv delim tab conv marker col 1 conv subj col 2 conv allele col 3 Note that four arguments related to convert command (i.e. conv class col, conv allele delim, conv names from, conv names col arguments) have not been s pecified because the input genotypes file only contains data from case group and therefore all the subjects have the same classification label the two alleles of each genotyped SNP are in the column 3 but have not been separated by a delimiter character, and finally the name/identifier of the genotyped subjects are not in a file separate from the input genotypes file GA engine Module Once the case control genotypes files converted to the appropriate format and the file containing the list of SNP identifie rs related to the hypothesis under investigation are ready, they serve as the required input files for the GA engine module, the analytical core of the KBAS software. This module generates, evaluates, and refines multi SNP models with the aim of investigat ing if there exists any combination of SNPs which is

PAGE 168

168 associated with the trait under investigation. Such association is measured as the classification performance of the corresponding multi SNP model. The GA engine module of KBAS is invoked using the run command and its relevant command line arguments. Tables 5 4 and 5 5 provide a detailed description of the arguments of this command. At first, the genotypes of subjects in the case and control groups need to be converted into numerical values according to an encoding scheme specified by the user. The current implementation of KBAS provides user s with the opportunity to test alternative encoding schemes related to various intra genic relationship between alleles (e.g. dominance, partial dominance or over dom inance relationships). For instance, the three possible genotypes at each SNP locus can be encoded as AA=0, AB=1, and BB=2 under the assumption of a co dominant additive relationship between the major (A) and minor (B) alleles at the locus. As another exam ple, over dominant allelic effects can be investigated by encoding the SNPs genotypes as AA=0, AB=1, and BB=0. By testing alternative encoding schemes, the user will be able to obtain optimized multi SNP models with respect to both the inter genic epistati c effects of SNPs and the intra genic relationship between alleles at each locus. The GA engine is then initialized by a user specified number of randomly generated solutions ( organism ) to the problem under investigation, also called population size Thes e organisms are internally represented as bit vectors ( artificial chromosomes ), whose length is proportional to the number of SNPs existing in the set of SNPs initially generated by the hypothesis of interest.

PAGE 169

169 The weights of the SNPs loci located on artif icial chromosomes are assigned and adjusted by the GA engine module of KBAS The number of bits used to assign weights to SNPs is determined by the user. For instance in the simplest case in which only one bit is used, each SNP's weight can be either 0 or 1: the weight therefore indicates whether the corresponding SNP is present in the model or not. If 3 bits are used to weight the SNPs, the weights would range from 0 to 7. A collection of SNPs with non zero weights present in an artificial chromosome alo ng with their respective weights is called a model Models are evaluated by a fitness function which measures how well they fit the available datasets, and are optimized over a number of simulated generations specified by the user The KBAS software provid es the user with an option to select one of several available fitness functions, each one providing a distinct way to convert the genotype of each subject into a score, and to compare the distributions of resulting scores between the case and control group s. Table 5 6 provides a detailed description of the fitness functions available in the current versio n of KBAS. The list of all available fitness functions can be printed on the terminal by running the kbas listga command The process of optimizing the mu lti SNP models by the GA engine takes use of o perators borrowed from evolutionary biology such as mutation and crossover In the current implementation of KBAS, the user can adjust the mutation rate using three arguments related to the run command. The fi rst one specifies the overall mutation rate at which the weight of individual SNP s may be mutated in each generation. The other ones are zero to non zero mutation rate and non zero to zero mutation rate

PAGE 170

170 specifying the probabilities by which a mutatio n can mutate a zero weight of individual SNP s to non zero values in each generation, and vice versa. If the user is interested in obtaining a more parsimonious multi SNP model, he/she can adjust these mutation rates so that the non zero to zero mutations are favored. The GA stops either when the maximum number of generations is reached, or when the fitness of the best model in a particular generation exceeds a significance threshold. These two stop criteria can be specified by the user in t he current impl ementation of KBAS C onsequently over many generations, the GA engine module refines the initial set of SNP s by removing SNP s of less contribution to the trait under investigation and produces a final model ( successful model ) that best fits the fitness fu nction. The successful model can then be subject to further validation by permutation testing over user specified rounds of permutation The user can also specify the number of cores employed by the GA engine module if KBAS is run on a multi core computer. The output of the GA engine module, displayed on the terminal, c ontains the identifiers of SNPs present in the retrieved successful model along with their respective weights, followed by the fitness of the retrieved model and the permutation test p value if permutation testing has been requested. If an output file has also been requested by the user, the output of the GA engine module is written to the user specified file in a machine readable format. An e xample of r unning the GA engine m odule of KBAS Su ppose the converted genotypes files corresponding to the case and control groups have been

PAGE 171

171 stored in the ~/data/conv/ directory (e.g. the case group.bin and ctrl group.bin files). The desired list of SNPs has also been extracted (e.g. the snps out.t xt file) and stored in the ~/data/marker/ directory. We are interested in running the GA engine module using the following parameters: 1. A numerical encoding scheme on the basis of dominant relationship between alleles at each locus (e.g. AA=0, AB=BB=2) 2. 200 artificial chromosomes ( solutions ) in each generation 3. A maximum number of 1000 generations 4. One bit to weight each SNP in artificial chromosomes 5. An approximate initial number of 10 SNPs with non zero weights in the artificial chromosomes i nitializing th e first generation 6. The GA class 1 (see T able 5 6) which defines the score as the weighted sum of the genotypes of each subject over the SNPs loci present in the model under consideration, and then uses the t statistic resulting from comparing the score dis tributions between the case and control groups as the measure of the fitness for that model 7. A t statistic of at least 6 as the fitness threshold used to label a model as significant We also wish to run the GA engine module on a multi core computer using 4 cores, to perform 10000 rounds of permutation testing over the final successful model, and to corresponding run command will require the following arguments : k bas run case ~/data/conv/case group.bin ctrl ~/data/conv/ctrl group.bin markers ~/data/marker/snps out.txt ga class 1 ga enc 022 ga pop 200 ga gen 1000 ga bits 1 ga num 10 ga thld 6 ga rnd 10000 ga cores 4 ga out ~/data/output/output.txt Help The help getsnps help convert or help run commands prints a general description about the getsnps, convert and run commands, respectively. Also, the

PAGE 172

172 help Getsnps help Convert or help Run commands can be used to get a detailed description of all the c ommand line arguments available for the getsnps convert and run commands, respectively. Moreover, the KBAS package includes an interactive demonstration which explains the basic steps in running the program. Running the demonstration before starting to use KBAS is highly recommend, since it will provide the user with an overview of the KBAS workflow and with a description of its most important command line options. To run the demonstration, start the demo script using demo command and follow the ins tructions step by step.

PAGE 173

173 Table 5 1. Description of the various methods can be used to generate the desired list of SNPs by the getsnps command Method of SNPs Selection Method ID Description SNPs From Pathways 1 Given one or more KEGG pathway identifie rs, retrieves the list of all validated SNPs in genes belonging to these pathways. Example query: map00010,map00020 SNPs From Gene Ontology Classes 2 Given one or more gene ontology class identifiers, retrieves the list of all validated SNPs in genes belo nging to these classes. Example query: GO:0002021,GO:0034384 SNPs From Genes (Using Gene IDs) 3 Given one or more Entrez gene identifiers, retrieves the list of all validated SNPs belonging to these genes. Example query: 348,1557 SNPs From Genes (Using G ene Symbols) 4 Given one or more HGNC gene symbols, retrieves the list of all validated SNPs belonging to these genes. Example query: APOE CYP2C19 SNPs From Genes (Using mRNA Accession Numbers) 5 Given one or more Gene Bank mRNA accession numbers, retriev es the list of all validated SNPs belonging to these transcripts. Example query: NM_000041,NM_000769 SNPs From Regions 6 Given one or more genomic regions retrieves the list of all validated SNPs belonging to these regions. Example query: chr19:4540903 8 45412649,chr10:96522462 96612670 Validated SNPs From a List of SNP Identifiers 7 Given a list of SNPs, retrieves all validated SNPs existing in the list. Example query: rs10009,rs10010

PAGE 174

174 Table 5 2. Description of the arguments of the getsnps command. Argument Description of the Value Assigned to the Argument snp method This argument specifies the method employed to generate the desired list of SNPs. This argument takes as value a number that identifies the desired method. Please use the getsnps com mand with no other arguments to get a list of all available methods. The number preceding each method is the value that should be used for this argument. A more detailed description of the different available methods can also be found in Table 5 1. snp q uery T he query terms which are used to generate the desired list of SNPs are specified through this argument on the command line (terminal) For example, when using method 1, the value of this argument should be one or more pathway identifiers. To specif y more than one query term, separate them with a single comma (with no spaces). All query terms should be of the same type (e.g. pathway identifiers, mRNA accession numbers), and should be appropriate for the method selected with snp method argument. To r ead query terms from a file instead of from the command line, please use the snp infile argument. Note that only one of snp query and snp infile arguments must be specified, but not both at the same time. snp infile This argument specifies the pathnam e of the file containing the query terms which are used to generate the desired list of SNPs. The file should be a plain text file, with one term per line. If the file is tab delimited with multiple columns, query terms will be read from the first column. All query terms should be of the same type (e.g. pathway identifiers, or mRNA accession numbers), and should be appropriate for the method selected with snp method argument. To specify query terms on the command line (terminal) instead of reading them fro m a file, please use the snp query argument. Note that only one of snp query and snp infile arguments must be specified, but not both at the same time. snp outfile This argument specifies the pathname of the output file that the identifiers of the des ired list of SNPs will be written to. Note that this file will be created if it does not exist, and overwritten if it already exists. snp count This argument specifies the maximum number of SNPs from each transcript that should appear in the resulting li st of SNPs (e.g. only 5 SNPs from each transcript related to the genes belonging to a particular KEGG pathway). If this argument is specified, SNPs related to each transcript are selected on the basis of a prioritizing rule that preferentially selects non synonymous coding SNPs, followed by promoter SNPs, exon intron junction SNPs, synonymous coding SNPs, 3'UTR SNPs, and intronic SNPs (i.e. SNPs are not selected randomly ).

PAGE 175

175 Table 5 3. Description of the arguments of the convert command. Argument Descripti on of the Value Assigned to the Argument conv in This argument specifies the name of the input file containing the genotypes of cases and/or controls. conv dir This argument specifies the pathname of an existing directory where the resulting converte d files will be saved. If this argument is not provided, the program will use the current directory by default to store the converted files. conv prefix The conversion process will create three output files named P.bin P names.txt and P snps.txt wher e 'P' is the prefix specified through this argument. If this argument is not specified, the program will use 'CONV DATA' by default as the prefix to name the output files. conv format This argument specifies the format of the input file containing the g enotypes. Recognized formats are 'vcf', 'csv', or 'csv2'. The 'csv' format requires the genotypes file to be a delimited file in which both alleles of the genotyped markers (e.g. SNPs) are stored in the same column (possibly separated by a character, e.g. C/T). The 'csv2' format refers to a delimited genotypes file in which the two alleles of the genotyped markers are stored in separate columns. Please use the conv allele col argument to specify the column number of the column(s) containing the alleles, an d conv allele delim argument to specify the character separating the two alleles if both are in the same column and are separated by a delimiter. conv delim When using the 'csv' or 'csv2' format, this argument is used to specify the delimiter separatin g the fields (columns) in each row. Possible values are: 'tab', 'comma', 'space', 'colon', 'semicolon'. If this argument is not specified, the program will try to automatically detect the delimiter. conv marker col This argument specifies the column num ber of the column containing the identifier of the genotyped markers in the input genotypes file. If this argument is not specified, the program will assume that marker identifiers are in the first column. conv subj col This argument specifies the colum n number of the column containing the name/identifier of the subjects in the input genotypes file. If this argument is not specified, the program will assume that subject names/identifiers are in the second column.

PAGE 176

176 Table 5 3. Continued. Argument Descrip tion of the Value Assigned to the Argument conv class col This argument is used when the input genotypes file contains subjects of different classes (e.g. 0=unaffected, 1=affected). In such cases, this argument specifies the column number of the column cont aining the classification label used to classify the subjects. If th is argument be specified, the conversion process will create multiple sets of output files, one for each class of the contrasting groups. conv allele col This argument specifies the column number of the column (s) containing the alleles of the genotyped markers in the input genotypes file. When the input file is in the 'csv' format (both alleles of each genotyped marker are contained in a single column), please specify a single number When the input file is in the 'csv2' format (the two alleles of each genotyped marker are in separate columns), please specify two column numbers separated by comma (e.g. 3,4). conv allele delim This argument specifies the character separating alleles if both alleles are in the same column in the input genotypes file, and there exist s any delimiter in between (e.g. / for alleles represented as 'A/C'). If this argument is not specified, the program assumes that both alleles are in the same column and ar e not separated by any delimiter character (e.g. 'AC'). conv names from This argument specifies the pathname of the file containing the name/identifiers of the genotyped subjects, if the input genotypes file doesn't contain this information. conv name s col This argument specifies the column number of the column containing the name/identifier of the genotyped subjects in the file specified by conv names from argument.

PAGE 177

177 Table 5 4. Description of the required arguments of the run command Argument De scription of the Value Assigned to the Argument case This argument specifies the pathname of the '.bin' file created using the convert command on the input genotypes file for the case group (e.g. ~/.../case.bin). ctrl This argument specifies th e pathname of the '.bin' file created using the convert command on the input genotypes file for the control group (e.g. ~/.../ctrl.bin). markers This argument specifies the pathname of the file containing the identifiers of markers (e.g. SNPs) use d by KBAS to generate multi marker models and to evaluate their association with the trait of interest (e.g. ~/.../SNPs out.txt ) These markers are selected by the user on the basis of preexisting biological knowledge about the trait under investigation The getsnps command can also be used to generate sets of SNPs belonging to specific pathways or gene ontology classes, or belonging to a number of genes, transcripts, or chromosomal regions.

PAGE 178

178 Table 5 5. Description of the optional arguments of the run command Argument Description of the Value Assigned to the Argument ga enc This argument allows the user to define how different genotype classes of each marker should be converted to numeric values (e.g. 0,1,2). Indicating the major allele at each loc us with A and the minor allele as B, the three possible genotypes (AA, AB, BB) can be numerically encoded under various assumptions related to the intra genic relationship between the alleles (e.g. dominance, partial dominance or over dominance relationshi ps). For instance, an encoding schema as AA =0, AB=1, and BB=2 refers to a co dominant additive relationship between the major and minor alleles at the marker locus. This argument takes as value a three digit number between 000 and 999 (e.g. 012), where the first number (here 0) specifies the numeric value of AA genotype, the second number (here 1) shows the numeric value of AB genotype, and the third one (here 2) refers to the numeric value of BB genotype. If this argument is not specified, the program will use '012' encoding by default. ga class This argument specifies the GA class used for the analysis. Each of the pre defined GA classes provides a diffe rent way to determine the score variable and a different fitness function to measure the fitness of generated models. Please use the kbas listga command to get a list of all available GA classes. The number preceding each GA class is the value that should be used for this argument. If this argument is not specified, the program will use GA class 1' by default ga gen This argument specifies the maximum number of generations the GA engine should run for if a model with the desired level of fitness is not obtained in any generation. It should be an integer number greater than zero. If this argument is not specified, the program will run for 1000 generations by default. ga pop This argument specifies the number of artificial chromosomes generated and evaluated in each generation of GA. Each artificial chromosome represents a different model being eva luated. The value of this argument should be an integer number greater than zero. If this argument is not specified, the program will use a default value of 500. ga bits This argument specifies the number of bits used for giving weight to each marker in each artificial chromosome. It should be an integer number greater than zero. For example in the simplest scenario of using one bit to represent individual marker's weights, each weight can be either 0 or 1: the weight therefore indicates whether the corr esponding marker is present in the model or not. With 3 bits, the weights of the SNPs would range from 0 to 7. If this argument is not specified, the program will use 1 bit for each marker by default.

PAGE 179

179 Table 5 5. Continued. Argument Description of the Va lue Assigned to the Argument ga num This argument specifies the approximate number of markers with non zero weights when artificial chromosomes are initialized in the first generation. This corresponds to the approximate number of markers initially incl uded in each candidate model. The number of markers can then change over the course of a run due to mutations and recombinations. The default value for this argument is 20. ga mut This argument specifies the probability that each individual bit in an ar tificial chromosome will be selected for mutation. If a bit is selected, the decision on whether to actually mutate it will depend on the value of the ga 0 to 1 and ga 1 to 0 arguments. This argument takes a real number between 0 and 1. If this argument is not specified, the program will use a n overall mutation rate of 0.05 by default. ga 0 to 1 This argument specifies the probability that a zero bit selected for mutation in an artificial chromosome will change to a 1 bit. The value of this argument sh ould be a real number between 0 and 1. If this argument is not specified, the program will use a zero to non zero mutation rate of 1 by default. ga 1 to 0 This argument specifies the probability that a 1 bit selected for mutation in an artificial chromo some will change to a zero bit. The value of this argument should be a real number between 0 and 1. If this argument is not specified, the program will use a non zero to zero mutation rate of 1 by default. ga thld This argument specifies the desired fit ness that should be reached by the GA. If during the course of a run the GA generates a model whose fitness is above this level, the run will stop and the model will be returned as the successful model. If instead the maximum number of generations is reach ed before the desired fitness level is attained, the GA will return the model with the highest fitness level in the last generation. The successful model can then undergo further validation by permutation testing. The value of this argument should be selec ted according to the GA class chosen by the user. If it is not specified, the program will use a threshold of 6 by default as the significance threshold for the t statistic obtained from each model using the default GA class (i.e. GA class '1'). ga rnd This argument specifies the number of rounds the GA engine uses to perform permutation testing (randomization testing). If it is specified the program will perform permutation testing over the retrieved successful model and print the resulting p value to the terminal and/or to the output file. This argument is zero by default meaning no permutation testing will be performed.

PAGE 180

180 Table 5 5. Continued. Argument Description of the Value Assigned to the Argument ga cores This argument specifies the number of co res employed to run the GA engine if the program is run on a multi core computer. If this argument is not specified, the program will employ 1 core by default. ga out If this argument is specified, the program will write the final successful model out t o this file, in tab delimited format. The output file will contain two columns containing the identifier of the markers present in the successful model (e.g. SNP identifiers) and their respective weights respectively. Note that this file will be created i f it does not exist, and overwritten if it already exists. If this argument is not specified, the program will not generate any output file and will only show the results on the terminal.

PAGE 181

181 Table 5 6. Description of the various fitness functions (GA classe s) can be used by the run command to evaluate the fitness of multi marker models Fitness Function Class (GA class) Description 1 For each subject in the case and control groups and for each model, the weight of each marker present in the model is multip l ied by the numerical value of the subject's genotype at the marker locus. A score is then computed as the sum of these products calculated for every marker loci in the model The distribution of scores is then compared between case and control groups usin g t test and the resulting t statistic is then used as the fitness of th e model. 2 This class is similar to the GA class '1', except that the number of the markers included in the model is taken into account when calculating the fitness, mildly favoring m odels with fewer markers. Under this scenario, t he fitness of a model is determined by adding the fitness value obtained by the method explained for the GA class '1' to (1 r), where r is the ratio of the number of markers included in the model under invest igation to the total number of markers present in the initial set of markers selected by a hypothesis. 3 For each subject in case and control groups and for each model a score is calculated the same as was explained for GA class '1'. T he overall average of the scores in the pooled population of cases and controls and the average of the scores in each of the two contrasting groups are then determined. If the average of the case group is greater than the overall average, the fitness of the model is defined as the sum of the number of cases whose scores are smaller than the overall average and the number of controls whose scores are greater than the overall average. If the average of the case group is smaller than the overall average, the fitness of the mode l is defined as the sum of the number of cases whose scores are greater than the overall average and the number of controls whose scores are smaller than the overall average.

PAGE 182

182 Table 5 6. Continued. Fitness Function Class (GA class) Description 4 This cl ass is similar to the GA class '3', except that the number of markers included in the model is taken into account when calculating th e fitness, favoring models with fewer markers. Under this scenario, the fitness of a model is determined by adding the fitn ess value obtained by the method explained for the GA class '3' to the number of markers present in the model multiplied by an appropriate factor (default=40). 5 For each subject in case and control groups and for each model a score is calculated as the logarithm of ratio of the two phenotype states of interest (e.g. diseased vs. healthy). The distribution of scores is then compared between the case and control groups using t test and the resul ting t statistic serves as the fitness of the model under consideration. 6 For each subject in case and control groups and for each model a score is determined the same as was explained for the GA class '5'. The fitness of the model is then defined as the sum of the number of case scores smaller than zero and the number of control scores greater than zero." 7 For each subject in the case and control groups and for each model, the ratio of the conditional probabilities of the subject's genotype under the t wo phenotype states of interest (e.g. diseased vs. healthy) is calculated for each marker locus present in the model A score is t hen computed as the sum of these ratios calculated for every marker loci in the model. T he fitness of the model under consider ation is then determined by the method explained for the GA class 5 8 For each subject in the case and control groups and for each model, the ratio of the conditional probabilities of the subject's genotype under the two phenotype states of interest (e .g. diseased vs. healthy) is calculated for each marker locus present in the model and is multiplied by the numeric al value of the subject's genotype at that locus. A score is then computed as the sum of these products calculated for every marker loci in the model. T he fitness of the model under consideration is then determined by the method explained for the GA class 5 '.

PAGE 183

183 Table 5 6. Continued. Fitness Function Class (GA class) Description 9 F or each subject in the case and control groups and for each mo del, a score is defined as the vector containing the included in that model. The distribution of scores is then compared between the case and control groups using Hotelling's T square test and t he resulting F statistic serves as the fitness of the model under consideration.

PAGE 184

184 CHAPTER 6 DISCUSSION The KBAS method I presented in this dissertation provides a framework for extending genome wide association studies to situations in which a phenotype is caused by a number of concurrent genetic factors. Its main purpose is to measure how well a multi SNP model is able to classify case and control subjects for which genotype data are available. Biologically, multi SNP models may implicate cases in which functional relationships exist among genetic factors of interest. These relationships might be for instance in the form of interaction among functionally redundant elements or physically interacting biomolecules, or of the concurrent occurrence of two or more hypomorphic mutations in different steps of a particular pathway [119,120] To capture these effects KBAS employs Genetic Algorithms [221,233], a powerful search method able to navigate through an extensive search space consisting of potential combin ations of markers and to refine them by removing markers that show limited contribution to the trait under consideration. Instead of examining each marker individually, KBAS allows the user to define models consisting of a set of markers, to integrate the markers' genotypes into a score variable and to evaluate the model s ability to correctly classify cases and controls on the basis of the distribution of the score variable values. Multi SNP models which define a score variable over a set of loci are able to take into account the joint effects of loci individually conferring small risks to the phenotype under investigation, and the different combinations of trait influencing loci which is of considerable importance in capturing the genetic heterogeneity und erlying a complex trait. For example, we can imagine a case in which a hypothetical phenotype is

PAGE 185

185 produced by concurrent mutations in any two out of three loci involved in a particular pathway (e.g. loci A, B and C). Therefore, subjects having mutations in loci A and B show the same phenotype as subjects having mutations in loci A and C or loci B and C. This can be formally described using the following logical statement: [A AND B] OR [A AND C] OR [B AND C]. However, s ince the main effect of each individual locus is small, subjects harboring single mutations do not develop the phenotype. Through d efining a score variable small association signals arising from individual risk loci into can be integrated into an overall association signal capturing the joint e ffects of contributing loci. Moreover, since the score variable is computed independently on each subject, different combinations of loci can contribute to the same score variable distribution in parallel to each other leading to a more accurate represent ation of the differences of case and control groups. Since KBAS employs a Genetic Algorithm based search engine, it enjoys high flexibility and efficiency in searching through an extremely large solution space. On the other hand, as with any other method relying on heuristic search algorithms, this does not ensure that the successful models are the absolute best ones. The KBAS method is instead meant to be used in a hypothesis based fashion: its purpose is not to discover the best set of markers that exp lains a phenotype; something that is computationally unfeasible since it would require testing all possible combinations of a given number of markers; but to confirm or invalidate the hypothesis that a user provided set of markers is associated with the ph enotype. The method does not require or stipulate a specific way of selecting markers: if the provided set does not contain a sufficient number of

PAGE 186

186 relevant ones, the algorithm will simply fail to find an appropriate combination of markers, and will repor t that the corresponding hypothesis is not supported by the available data. This is clearly observed using the two sets of pathways I selected for this study, the first one containing pathways known to be relevant to the immune system, and the second one c ontaining pathways unrelated to the immune system, used as negative controls. For instance, while on the first set of pathways KBAS was able to produce multiple replicated RA associated models with remarkable classifying efficacy, none of the pathways in t he second set gave rise to a successful model capable of significantly separating cases from controls ( see Tables 4 3, 4 4, and 4 5 ). In addition, this method provides researcher with a way to prioritize the successful models derived from different pathway s in order to evaluate which pathway is more likely to be related to the pathogenesis of the disease of interest. For example, comparing RA vs. CTR (see Table 4 5), the complement and coagulation cascade pathway seems to be of higher relevance to the proce ss of development of rheumatoid arthritis compared to T cell receptor signaling pathway. Also, a mong the significant models for the CD vs. CTR comparison (see Table 4 13 ), the B cell receptor signaling, cytokine cytokine receptor interaction and T cell rec eptor signaling pathways show high association with the Crohn's disease while for example the association of the antigen processing and presentation pathway seems to be of lower importance compared to the three aforementioned pathways.

PAGE 187

187 Another application of the KBAS method is to compare the contribution of specific pathways to different traits of interest. Although several pathways were labeled as associated with rheumatoid arthritis and Crohn's disease in this study each provides a different contributio n to the diseases under investigation. For example comparing the two successful models derived from the complement and coagulation cascade pathway through the analysis of RA and CD datasets, RA associated model has a smaller p value than the CD associated model, while the reverse is true for the two models derived from T cell receptor signaling pathway (see Tables 4 5 and 4 13 ). There are clear benefits to having SNPs selection be guided by preexisting biological knowledge. This choice maximizes the chance of including SNPs that are functionally related with the phenotype of interest, and makes it more likely that the results, if positive, may have an explicit biological interpretation. If the algorithm identifies a model with high case/control separation pe rformance, it can be directly used to formulate a biological explanation for the observed phenotype. For example, this could involve identifying the genes containing the SNPs within the model or in LD with them, and making them high priority candidates for further experimental analysis. The observed significant associations between immune system related multi SNP models and the diseases under investigation is consistent with previously revealed aspects of their pathogenesis as autoimmune diseases [1,187,188 ,248] The fact that a number of RA associated models were replicated in another population of the same ancestry makes them a potential predictor for estimating the risk of disease development in individuals of Caucasian ancestry. As shown in the Disease r isk Score

PAGE 188

188 class diagrams, although the risk of disease development rises parallel to the increase in the score variable value, the disease risk in the lowest and highest classes, constituting extreme portions of the score class spectrum, are not 0 or 100% respectively. This is consistent with the definition of complex diseases and indicates that there are still further genetic and non genetic risk factors contributing to the RA and CD, which remain to be discovered. Although I tested it using SNPs genotype data, KBAS can be applied to any kind of polymorphic genetic marker, provided that its alleles can be converted into numeric values in a way that consistently assigns values to different types of alleles (e.g. wild type and mutant alleles). KBAS will then use the score variable obtained by combining these values to classify study subjects as affected or unaffected. The KBAS Method presented in this dissertation has been published as an open access article [251]. Further extensions to the method, including d eveloping new fitness functions to capture the epistatic effect s of genetic factors on continues traits (such as weight and height), and a user friendly interface will be introduced in future.

PAGE 189

189 1. WTCCC (2007) Genome wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 447: 661 678. doi:10.1038/nature05911. 2. Risch N, Merikangas K (1996) The future of genetic studies of complex human diseases. Science 273: 1516 1517. 3. Risch NJ (2000) Se arching for genetic determinants in the new millennium. Nature 405: 847 856. doi:10.1038/35015718. 4. Mitchell KJ (2012) What is complex about complex disorders? Genome Biology 13: 237. doi:10.1186/gb 2012 13 1 237. 5. Wang Z, Liu T, Lin Z, Hegarty J, Ko ltun WA, et al. (2010) A general model for multilocus epistatic interactions in case control studies. PLoS ONE 5: e11384. doi:10.1371/journal.pone.0011384. 6. Strachan T, Read A (2010) Human Molecular Genetics, Fourth Edition. 4th ed. Garland Science. 807 p. 7. Manolio TA, Collins FS, Cox NJ, Goldstein DB, Hindorff LA, et al. (2009) Finding the missing heritability of complex diseases. Nature 461: 747 753. doi:10.1038/nature08494. 8. Altmller J, Palmer LJ, Fischer G, Scherb H, Wjst M (2001) Genomewide S cans of Complex Human Diseases: True Linkage Is Hard to Find. Am J Hum Genet 69: 936 950. 9. Donnelly P (2008) Progress and challenges in genome wide association studies in humans. Nature 456: 728 731. doi:10.1038/nature07631. 10. Clarke AJ, Cooper DN (2 010) GWAS: heritability missing in action? Eur J Hum Genet 18: 859 861. doi:10.1038/ejhg.2010.35. 11. Stahl EA, Raychaudhuri S, Remmers EF, Xie G, Eyre S, et al. (2010) Genome wide association study meta analysis identifies seven new rheumatoid arthritis risk loci. Nat Genet 42: 508 514. doi:10.1038/ng.582. 12. Barrett JC, Clayton DG, Concannon P, Akolkar B, Cooper JD, et al. (2009) Genome wide association study and meta analysis find that over 40 loci affect risk of type 1 diabetes. Nat Genet 41: 703 707 doi:10.1038/ng.381.

PAGE 190

190 13. Barrett JC, Hansoul S, Nicolae DL, Cho JH, Duerr RH, et al. (2008) Genome disease. Nature Genetics 40: 955 962. doi:10.1038/ng.175. 14. Easton DF, Ee les RA (2008) Genome wide association studies in cancer. Human Molecular Genetics 17: R109 R115. doi:10.1093/hmg/ddn287. 15. Ziegler A, Knig IR, Thompson JR (2008) Biostatistical aspects of genome wide association studies. Biom J 50: 8 28. doi:10.1002/bi mj.200710398. 16. Gibson G (2011) Rare and common variants: twenty arguments. Nat Rev Genet 13: 135 145. doi:10.1038/nrg3118. 17. Musani SK, Shriner D, Liu N, Feng R, Coffey CS, et al. (2007) Detection of gene x gene interactions in genome wide associati on studies of human population data. Hum Hered 63: 67 84. doi:10.1159/000099179. 18. Moore JH, Asselbergs FW, Williams SM (2010) Bioinformatics challenges for genome wide association studies. Bioinformatics 26: 445 455. doi:10.1093/bioinformatics/btp713. 19. Cordell HJ (2009) Detecting gene gene interactions that underlie human diseases. Nat Rev Genet 10: 392 404. doi:10.1038/nrg2579. 20. Plomin R, Haworth CMA, Davis OSP (2009) Common disorders are quantitative traits. Nat Rev Genet 10: 872 878. doi:10.1 038/nrg2670. 21. Yang J, Benyamin B, McEvoy BP, Gordon S, Henders AK, et al. (2010) Common SNPs explain a large proportion of the heritability for human height. Nat Genet 42: 565 569. doi:10.1038/ng.608. 22. Moore JH (2003) The ubiquitous nature of epist asis in determining susceptibility to common human diseases. Hum Hered 56: 73 82. doi:10.1159/000073735. 23. Zuk O, Hechter E, Sunyaev SR, Lander ES (2012) The mystery of missing heritability: Genetic interactions create phantom heritability. Proc Natl Ac ad Sci USA 109: 1193 1198. doi:10.1073/pnas.1119675109. 24. Gibson G (2009) Decanalization and the origin of complex disease. Nat Rev Genet 10: 134 140. doi:10.1038/nrg2502. 25. Riva A, Nuzzo A, Stefanelli M, Bellazzi R (2010) An automated reasoning fram ework for translational research. J Biomed Inform 43: 419 427. doi:10.1016/j.jbi.2009.11.005.

PAGE 191

191 26. Pattin KA, Moore JH (2008) Exploiting the Proteome to Improve the Genome Wide Genetic Analysis of Epistasis in Common Human Diseases. Hum Genet 124: 19 29. d oi:10.1007/s00439 008 0522 8. 27. Peng G, Luo L, Siu H, Zhu Y, Hu P, et al. (2010) Gene and pathway based second wave analysis of genome wide association studies. Eur J Hum Genet 18: 111 117. doi:10.1038/ejhg.2009.115. 28. A, Segurado R, Gill M, et al. (2009) The SNP ratio test: pathway analysis of genome wide association datasets. Bioinformatics 25: 2762 2763. doi:10.1093/bioinformatics/btp448. 29. Yu K, Li Q, Bergen AW, Pfeiffer RM, Rosenberg PS, et al. (2009) Pathway an alysis by adaptive combination of P values. Genet Epidemiol 33: 700 709. doi:10.1002/gepi.20422. 30. Falconer DS, Mackay TFC (1996) Introduction to Quantitative Genetics. 4th ed. Benjamin Cummings. 480 p. 31. Visscher PM, Hill WG, Wray NR (2008) Heritabi lity in the genomics era -concepts and misconceptions. Nat Rev Genet 9: 255 266. doi:10.1038/nrg2322. 32. Dickens WT, Flynn JR (2001) Heritability estimates versus large environmental effects: the IQ paradox resolved. Psychol Rev 108: 346 369. 33. Slatki n M (2009) Epigenetic Inheritance and the Missing Heritability Problem. Genetics 182: 845 850. doi:10.1534/genetics.109.102798. 34. Carlborg O, Haley CS (2004) Epistasis: too often neglected in complex trait studies? Nat Rev Genet 5: 618 625. doi:10.1038/ nrg1407. 35. Nagel RL (2005) Epistasis and the genetics of human diseases. C R Biol 328: 606 615. doi:10.1016/j.crvi.2005.05.003. 36. Reveille JD (2006) Major histocompatibility genes and ankylosing spondylitis. Best Pract Res Clin Rheumatol 20: 601 609. doi:10.1016/j.berh.2006.03.004. 37. Principles and Practice of Medical Genetics e dition: Continually Updated Online Reference, 3 Volume Set, 5e (Principles and Practice of Medical Genet ics. 5th ed. Churchill Livingstone. 3637 p. 38. Risch HA, McLaughlin JR, Cole DEC, Rosen B, Bradley L, et al. (2006) Population BRCA1 and BRCA2 Mutation Frequencies and Cancer Penetrances: A Kin Cohort Study in Ontario, Canada. JNCI J Natl Cancer Inst 98: 1694 1706. doi:10.1093/jnci/djj465.

PAGE 192

192 39. Ford D, Easton DF, Stratton M, Narod S, Goldgar D, et al. (1998) Genetic heterogeneity and penetrance analysis of the BRCA1 and BRCA2 genes in breast cancer families. The Breast Cancer Linkage Consortium. Am J Hum Genet 62: 676 689. 40. Ford D, Easton DF, Peto J (1995) Estimates of the gene frequency of BRCA1 and its contribution to breast and ovarian cancer incidence. Am J Hum Genet 57: 1457 1462. 41. Antoniou AC, Pharoah PDP, McMullan G, Day NE, Stratton MR, et al. (2002) A comprehensive model for familial breast cancer incorporating BRCA1, BRCA2 and other genes. Br J Cancer 86: 76 83. doi:10.1038/sj.bjc.6600008. 42. Struewing JP, Hartge P, Wacholder S, Baker SM, Berlin M, et al. (1997) The risk of cancer associ ated with specific mutations of BRCA1 and BRCA2 among Ashkenazi Jews. N Engl J Med 336: 1401 1408. doi:10.1056/NEJM199705153362001. 43. implications of systematic meta analyses. Nat Rev Neurosci 9: 768 778. doi:10.1038/nrn2494. 44. Williamson J, Goldman J, Marder KS (2009) Genetic aspects of Alzheimer disease. Neurologist 15: 80 86. doi:10.1097/NRL.0b013e318187e76b. 45. Ertekin entennial review. Neurol Clin 25: 611 667, v. doi:10.1016/j.ncl.2007.03.009. 46. Ellard S, Colclough K (2006) Mutations in the genes encoding the transcription factors hepatocyte nuclear factor 1 alpha (HNF1A) and 4 alpha (HNF4A) in maturity onset diabete s of the young. Hum Mutat 27: 854 869. doi:10.1002/humu.20357. 47. Owen K, Hattersley AT (2001) Maturity onset diabetes of the young: from clinical description to molecular genetic characterization. Best Pract Res Clin Endocrinol Metab 15: 309 323. doi:10 .1053/beem.2001.0148. 48. Hamosh A, Scott AF, Amberger JS, Bocchini CA, McKusick VA (2005) Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res 33: D514 D517. doi:10.1093/nar/gki033. 49. Stra nger BE, Stahl EA, Raj T (2011) Progress and promise of genome wide association studies for human complex trait genetics. Genetics 187: 367 383. doi:10.1534/genetics.110.120907.

PAGE 193

193 50. Garrod AE (1996) The incidence of alkaptonuria: a study in chemical indiv iduality. 1902. Mol Med 2: 274 282. 51. Garrod A e (2009) Inborn Errors of Metabolism. BiblioBazaar. 178 p. 52. Medicine 2: 271. 53. Bodmer W, Tomlinson I (2010) Rare genetic varian ts and the risk of cancer. Curr Opin Genet Dev 20: 262 267. doi:10.1016/j.gde.2010.04.016. 54. Cutting GR (2010) Modifier genes in Mendelian disorders: the example of cystic fibrosis. Ann N Y Acad Sci 1214: 57 69. doi:10.1111/j.1749 6632.2010.05879.x. 55. disorders to complex traits. Mol Genet Metab 71: 43 50. doi:10.1006/mgme.2000.3052. 56. Mller HH, Heinimann K, Dobbie Z (2000) Genetics of hereditary colon cancer a basis for preve ntion? European Journal of Cancer 36: 1215 1223. doi:10.1016/S0959 8049(00)00111 8. 57. Sankaran VG, Lettre G, Orkin SH, Hirschhorn JN (2010) Modifier genes in Mendelian disorders: the example of hemoglobin disorders. Ann N Y Acad Sci 1214: 47 56. doi:10. 1111/j.1749 6632.2010.05821.x. 58. Scriver CR, Waters PJ (1999) Monogenic traits are not simple: lessons from phenylketonuria. Trends Genet 15: 267 272. 59. Wu Y, Whitman I, Molmenti E, Moore K, Hippenmeyer P, et al. (1994) A lag in intracellular degrada tion of mutant alpha 1 antitrypsin correlates with the liver disease phenotype in homozygous PiZZ alpha 1 antitrypsin deficiency. Proc Natl Acad Sci U S A 91: 9014 9018. 60. Phillips PC (1998) The language of gene interaction. Genetics 149: 1167 1171. 61. Vorzimmer P (1963) Charles Darwin and Blending Inheritance. Isis 54: 371 390. 62. Fisher R (1918) The correlation between relatives on the supposition of Mendelian inheritance. Transactions of the Royal Society Edinburgh 52: 399 433. 63. Bodmer W, Boni lla C (2008) Common and rare variants in multifactorial susceptibility to common diseases. Nat Genet 40: 695 701. doi:10.1038/ng.f.136.

PAGE 194

194 64. University Press. 466 p. Available:http://arc hive.org/details/mendelsprinciple00bate. 65. Fisher R (1921) On the probable error of a coefficient of correlation deduced from a small sample. Metron 1: 3 32. 66. Thoday JM (1961) Location of Polygenes. Published online: 22 July 1961; | doi:101038/191 368a0 191: 368 370. doi:10.1038/191368a0. 67. Morgan TH (1911) THE ORIGIN OF NINE WING MUTATIONS IN DROSOPHILA. Science 33: 496 499. doi:10.1126/science.33.848.496. 68. Morgan TH (1911) Random Segregation Versus Coupling in Mendelian Inheritance. Science 34: 384 384. doi:10.1126/science.34.873.384. 69. Morgan TH (1922) The mechanism of Mendelian heredity,. Revised. Holt. 357 p. 70. Sturtevant AH (1913) The Linear Arrangement Of Six Sex Linked Factors In Drosophila, As Shown By Their Mode Of Association. Journal of Experimental Zoology 14: 43 59. 71. Sturtevant AH (1915) The behavior of the chromosomes as studied through linkage. Molecular and General Genetics MGG 13: 234 287. doi:10.1007/BF01792906. 72. Sax K (1923) The Association of Size Differences with Seed Coat Pattern and Pigmentation in PHASEOLUS VULGARIS. Genetics 8: 552 560. 73. Clarke CA, Edwards JW, Haddock DRW, Howel Evans AW, McConnell RB, et al. (1956) ABO Blood Groups and Secretor Character in Duodenal Ulcer. Br Med J 2: 725 731. 74. Ai rd I, Bentall HH, Roberts JAF (1953) A relationship between cancer of stomach and the ABO blood groups. Br Med J 1: 799 801. 75. Lilly F, Boyse EA, Old LJ (1964) Genetic Basis of Susceptibility to Viral Leukaemogenesis. Lancet 2: 1207 1209. 76. Curtis D (1997) Use of siblings as controls in case control association studies. Ann Hum Genet 61: 319 333. doi:10.1046/j.1469 1809.1998.6210089.x. 77. Corder EH, Saunders AM, Strittmatter WJ, Schmechel DE, Gaskell PC, et al. (1993) Gene dose of apolipoprotein E t disease in late onset families. Science 261: 921 923. doi:10.1126/science.8346443.

PAGE 195

195 78. Hudson TJ, Stein LD, Gerety SS, Ma J, Castle AB, et al. (1995) An STS based map of the human genome. Science 270: 1945 1954. 79 Lyengar SK, Elston RC (2007) The genetic basis of complex traits: rare variants 84. doi:10.1007/978 1 59745 389 9_6. 80. Pritchard JK, Cox NJ (2002) The allelic architecture of human disease ge nes: common disease 2423. doi:10.1093/hmg/11.20.2417. 81. Bodmer W (1981) HLA The Major Human Tissue Typing System. In: Messel H, editor. The Biological Manipulation of Life. Sydney: Pergamon. pp. 217 244. 8 2. Farrall M (2004) Quantitative genetic variation: a post modern view. Hum Mol Genet 13 Spec No 1: R1 7. doi:10.1093/hmg/ddh084. 83. Robertson A (1967) The nature of quantitative genetic variation. In: Brink RA, editor. Heritage From Mendel. Madison, WI : University of Wisconsin Press. pp. 265 280. 84. Teslovich TM, Musunuru K, Smith AV, Edmondson AC, Stylianou IM, et al. (2010) Biological, clinical and population relevance of 95 loci for blood lipids. Nature 466: 707 713. doi:10.1038/nature09270. 85. F isher RA (2000) The Genetical Theory of Natural Selection. 1st ed. Bennett JH, editor Oxford University Press, USA. 318 p. 86. Park J H, Wacholder S, Gail MH, Peters U, Jacobs KB, et al. (2010) Estimation of effect size distribution from genome wide assoc iation studies and implications for future discoveries. Nature Genetics 42: 570 575. doi:10.1038/ng.610. 87. Allen HL, Estrada K, Lettre G, Berndt SI, Weedon MN, et al. (2010) Hundreds of variants clustered in genomic loci and biological pathways affect h uman height. Nature 467: 832 838. doi:10.1038/nature09410. 88. Franke A, McGovern DPB, Barrett JC, Wang K, Radford Smith GL, et al. (2010) Genome wide meta disease susceptibility loci. Nat Genet 42: 1118 1125. doi:10.1038/ng.717. 89. Common polygenic variation contributes to risk of schizophrenia and bipolar disorder. Nature 460: 748 752. doi:10.1038/nature08185. 90. Goddard ME Hayes BJ (2007) Genomic selection. J Anim Breed Genet 124: 323 330. doi:10.1111/j.1439 0388.2007.00702.x.

PAGE 196

196 91. Goldstein DB (2009) Common genetic variation and human traits. N Engl J Med 360: 1696 1698. doi:10.1056/NEJMp0806284. 92. Dickson SP, Wang K, Krantz I, Hakonarson H, Goldstein DB (2010) Rare Variants Create Synthetic Genome Wide Associations. PLoS Biol 8: e1000294. doi:10.1371/journal.pbio.1000294. 93. Wang K, Dickson SP, Stolle CA, Krantz ID, Goldstein DB, et al. (2010) Interpretation of assoc iation signals and identification of causal variants from genome wide association studies. Am J Hum Genet 86: 730 742. doi:10.1016/j.ajhg.2010.04.003. 94. Wray NR, Purcell SM, Visscher PM (2011) Synthetic Associations Created by Rare Variants Do Not Expla in Most GWAS Results. PLoS Biol 9: e1000579. doi:10.1371/journal.pbio.1000579. 95. Waters KM, Stram DO, Hassanein MT, Le Marchand L, Wilkens LR, et al. (2010) Consistent Association of Type 2 Diabetes Risk Variants Found in Europeans in Diverse Racial and Ethnic Groups. PLoS Genet 6: e1001078. doi:10.1371/journal.pgen.1001078. 96. Rienzo AD, Hudson RR (2005) An evolutionary framework for common diseases: the ancestral susceptibility model. Trends in Genetics 21: 596 601. doi:10.1016/j.tig.2005.08.007. 97. Williams GC (1957) Pleiotropy, Natural Selection, and the Evolution of Senescence. Evolution 11: 398 411. doi:10.2307/2406060. 98. Ding K, Kullo IJ (2009) Evolutionary Genetics of Coronary Heart Disease. Circulation 119: 459 467. doi:10.1161/CIRCULATION AHA.108.809970. 99. 362. 100. Eyre Walker A (2010) Genetic architecture of a complex trait and its implications for fitness and genome wide association studies. Proc Natl Acad Sci U S A 107: 1752 1756. doi:10.1073/pnas.0906182107. 101. Easton DF, Deffenbaugh AM, Pruss D, Frye C, Wenstrup RJ, et al. (2007) A systematic genetic assessment of 1,433 sequence variants of unknown clinical significa nce in the BRCA1 and BRCA2 breast cancer predisposition genes. Am J Hum Genet 81: 873 883. doi:10.1086/521032. 102. Renwick A, Thompson D, Seal S, Kelly P, Chagtai T, et al. (2006) ATM mutations that cause ataxia telangiectasia are breast cancer susceptib ility alleles. Nature Genetics 38: 873 875. doi:10.1038/ng1837.

PAGE 197

197 103. Fearnhead NS, Wilding JL, Winney B, Tonks S, Bartlett S, et al. (2004) Multiple rare variants in different genes account for multifactorial inherited susceptibility to colorectal adenoma s. PNAS 101: 15992 15997. doi:10.1073/pnas.0407187101. 104. Macfarlane WM, Frayling TM, Ellard S, Evans JC, Allen LI, et al. (1999) Missense mutations in the insulin promoter factor 1 gene predispose to type 2 diabetes. J Clin Invest 104: R33 39. doi:10.1 172/JCI7449. 105. Stone EM, Braun TA, Russell SR, Kuehn MH, Lotery AJ, et al. (2004) Missense variations in the fibulin 5 gene and age related macular degeneration. N Engl J Med 351: 346 353. doi:10.1056/NEJMoa040833. 106. Stefansson H, Rujescu D, Cichon S, Pietilinen OPH, Ingason A, et al. (2008) Large recurrent microdeletions associated with schizophrenia. Nature 455: 232 236. doi:10.1038/nature07229. 107. Zhang F, Gu W, Hurles ME, Lupski JR (2009) Copy number variation in human health, disease, and e volution. Annu Rev Genomics Hum Genet 10: 451 481. doi:10.1146/annurev.genom.9.081307.164217. 108. Weiss LA, Shen Y, Korn JM, Arking DE, Miller DT, et al. (2008) Association between microdeletion and microduplication at 16p11.2 and autism. N Engl J Med 35 8: 667 675. doi:10.1056/NEJMoa075974. 109. Park J H, Gail MH, Weinberg CR, Carroll RJ, Chung CC, et al. (2011) Distribution of allele frequencies and effect sizes and their interrelationships for common genetic susceptibility variants. PNAS 108: 18026 180 31. doi:10.1073/pnas.1114759108. 110. 371 373. doi:10.1038/ng1296 371. 111. Cheverud JM, Routman EJ (1995) Epistasis and its contribution to genetic variance components. Genetics 139: 1455 1461. 112. Roth FP, Lipshitz HD, Andrews BJ (2009) Q&A: Epistasis. Journal of Biology 8: 35. doi:10.1186/jbiol144. 113. methods to detect it in humans. Hum Mol G enet 11: 2463 2468. doi:10.1093/hmg/11.20.2463. 114. Hill WG (1984) Quantitative Genetics: Explanation and analysis of continuous variation. Van Nostrand Reinhold. 376 p. 115. Hollander WF (1955) Epistasis and Hypostasis. J Hered 46: 222 225.

PAGE 198

198 116. Bhend e YM, DESHPANDE CK, BHATIA HM, SANGER R, RACE RR, et al. 904. 117. Dipta TF, Hossain AZ (2011) The Bombay blood group: are we out of risk? Mymensingh Med J 20: 536 540. 118. An astassiou D (2007) Computational analysis of the synergy among multiple interacting genes. Mol Syst Biol 3: 83. doi:10.1038/msb4100124. 119. Prez Prez JM, Candela H, Micol JL (2009) Understanding synergy in genetic interactions. Trends Genet 25: 368 376 doi:10.1016/j.tig.2009.06.004. 120. Martienssen R, Irish V (1999) Copying out our ABCs: the role of gene redundancy in interpreting genetic hierarchies. Trends Genet 15: 435 437. 121. Ramagopalan SV, Morris AP, Dyment DA, Herrera BM, DeLuca GC, et al. (2007) The Inheritance of Resistance Alleles in Multiple Sclerosis. PLoS Genet 3: e150. doi:10.1371/journal.pgen.0030150. 122. Wright S (1932) The roles of mutation, inbreeding, crossbreeding and selection in evolution. ProcIntCongGen 1: 356 366. 123. Ja snos L, Korona R (2007) Epistatic buffering of fitness loss in yeast double deletion strains. Nature Genetics 39: 550 554. doi:10.1038/ng1986. 124. Silva J da, Coetzer M, Nedellec R, Pastore C, Mosier DE (2010) Fitness Epistasis and Constraints on Adaptat ion in a Human Immunodeficiency Virus Type 1 Protein Region. Genetics 185: 293 303. doi:10.1534/genetics.109.112458. 125. Albarracn Orio AG, Pias GE, Cortes PR, Cian MB, Echenique J (2011) Compensatory Evolution of pbp Mutations Restores the Fitness Cos t Imposed by Lactam Resistance in Streptococcus pneumoniae. PLoS Pathog 7: e1002000. doi:10.1371/journal.ppat.1002000. 126. Kirby DA, Muse SV, Stephan W (1995) Maintenance of pre mRNA secondary structure by epistatic selection. Proc Natl Acad Sci USA 92 : 9047 9051. 127. Chen Y, Carlini DB, Baines JF, Parsch J, Braverman JM, et al. (1999) RNA secondary structure and compensatory evolution. Genes Genet Syst 74: 271 286. 128. Desai MM, Weissman D, Feldman MW (2007) Evolution Can Favor Antagonistic Epistas is. Genetics 177: 1001 1010. doi:10.1534/genetics.107.075812.

PAGE 199

199 129. Bonhoeffer S, Chappey C, Parkin NT, Whitcomb JM, Petropoulos CJ (2004) Evidence for Positive Epistasis in HIV 1. Science 306: 1547 1550. doi:10.1126/science.1101786. 130. De Visser JAGM, Hermisson J, Wagner GP, Ancel Meyers L, Bagheri Chaichian H, et al. (2003) Perspective: Evolution and detection of genetic robustness. Evolution 57: 1959 1972. 131. Miyashita NT, Aguad M, Langley CH (1993) Linkage disequilibrium in the white locus region of Drosophila melanogaster. Genet Res 62: 101 109. 132. Schaeffer SW, Miller EL (1993) Estimates of linkage disequilibrium and the recombination parameter determined from segregating nucleotide sites in the alcohol dehydrogenase region of Drosophila pseu doobscura. Genetics 135: 541 552. 133. Takahasi KR, Innan H (2008) The direction of linkage disequilibrium: a new measure based on the ancestral derived status of segregating alleles. Genetics 179: 1705 1712. doi:10.1534/genetics.107.085654. 134. Felix R Bodmer W, Fearnhead NS, Van der Merwe L, Goldberg P, et al. (2006) GSTM1 and GSTT1 polymorphisms as modifiers of age at diagnosis of hereditary nonpolyposis colorectal cancer (HNPCC) in a homogeneous cohort of individuals carrying a single predisposing m utation. Mutat Res 602: 175 181. doi:10.1016/j.mrfmmm.2006.09.004. 135. Andrew AS, Karagas MR, Nelson HH, Guarrera S, Polidoro S, et al. (2008) DNA repair polymorphisms modify bladder cancer risk: a multi factor analytic strategy. Hum Hered 65: 105 118. d oi:10.1159/000108942. 136. Krupa R, Blasiak J (2004) An association of polymorphism of DNA repair genes XRCC1 and XRCC3 with colorectal cancer. J Exp Clin Cancer Res 23: 285 294. 137. Lavender NA, Rogers EN, Yeyeodu S, Rudd J, Hu T, et al. (2012) Interac tion among apoptosis associated sequence variants and joint effects on aggressive prostate cancer. BMC Med Genomics 5: 11. doi:10.1186/1755 8794 5 11. 138. Tiret L, Bonnardeaux A, Poirier O, Ricard S, Marques Vidal P, et al. (1994) Synergistic effects of angiotensin converting enzyme and angiotensin II type 1 receptor gene polymorphisms on risk of myocardial infarction. Lancet 344: 910 913. 139. Martin MP, Qi Y, Gao X, Yamada E, Martin JN, et al. (2007) Innate partnership of HLA B and KIR3DL1 subtypes aga inst HIV 1. Nat Genet 39: 733 740. doi:10.1038/ng2035.

PAGE 200

200 140. Ma DQ, Whitehead PL, Menold MM, Martin ER, Ashley Koch AE, et al. (2005) Identification of significant association and gene gene interaction of GABA receptor subunit genes in autism. Am J Hum Gen et 77: 377 388. doi:10.1086/433195. 141. Qin S, Zhao X, Pan Y, Liu J, Feng G, et al. (2005) An association study of the N methyl D aspartate receptor NR1 subunit gene (GRIN1) and NR2B subunit gene (GRIN2B) in schizophrenia with universal DNA microarray. E ur J Hum Genet 13: 807 814. doi:10.1038/sj.ejhg.5201418. 142. Qi L, Van Dam RM, Asselbergs FW, Hu FB (2007) Gene gene interactions between HNF4A and KCNJ11 in predicting Type 2 diabetes in women. Diabet Med 24: 1187 1191. doi:10.1111/j.1464 5491.2007.0225 5.x. 143. Neuman RJ, Wasson J, Atzmon G, Wainstein J, Yerushalmi Y, et al. (2010) Gene Gene Interactions Lead to Higher Risk for Development of Type 2 Diabetes in an Ashkenazi Jewish Population. PLoS ONE 5: e9903. doi:10.1371/journal.pone.0009903. 144. R itchie MD, Hahn LW, Roodi N, Bailey LR, Dupont WD, et al. (2001) Multifactor dimensionality reduction reveals high order interactions among estrogen metabolism genes in sporadic breast cancer. Am J Hum Genet 69: 138 147. doi:10.1086/321276. 145. Greene CS Penrod NM, Williams SM, Moore JH (2009) Failure to Replicate a Genetic Association May Provide Important Clues About Genetic Architecture. PLoS ONE 4: e5639. doi:10.1371/journal.pone.0005639. 146. Shriner D, Vaughan LK, Padilla MA, Tiwari HK (2007) Prob lems with Genome Wide Association Studies. Science 316: 1840 1842. doi:10.1126/science.316.5833.1840c. 147. Chanock SJ, Manolio T, Boehnke M, Boerwinkle E, Hunter DJ, et al. (2007) Replicating genotype phenotype associations. Nature 447: 655 660. doi:10.1 038/447655a. 148. Infante J, Sanz C, Fernndez Luna JL, Llorca J, Berciano J, et al. (2004) Gene gene interaction between interleukin 6 and interleukin 10 reduces AD risk. Neurology 63: 1135 1136. 149. Combarros O, Van Duijn CM, Hammond N, Belbin O, Aria s Vsquez A, et al. (2009) Replication by the Epistasis Project of the interaction between the genes for IL 6 and IL doi:10.1186/1742 2094 6 22.

PAGE 201

201 150. Lu Q, Wei C, Ye C, Li M, Elston RC (201 2) A Likelihood Ratio Based Mann Whitney Approach Finds Novel Replicable Joint Gene Action for Type 2 Diabetes. Genetic epidemiology. Available:http://www.nc bi.nlm.nih.gov/pubmed/22760990. 151. Caspi A, Moffitt TE, Cannon M, McClay J, Murray R, et al. (20 05) Moderation of the effect of adolescent onset cannabis use on adult psychosis by a functional polymorphism in the catechol O methyltransferase gene: longitudinal evidence of a gene X environment interaction. Biol Psychiatry 57: 1117 1127. doi:10.1016/j. biopsych.2005.01.026. 152. Caspi A, Sugden K, Moffitt TE, Taylor A, Craig IW, et al. (2003) Influence of Life Stress on Depression: Moderation by a Polymorphism in the 5 HTT Gene. Science 301: 386 389. doi:10.1126/science.1083968. 153. Smith AM, Bernstei n DI, LeMasters GK, Huey NL, Ericksen M, et al. (2008) Environmental tobacco smoke and interleukin 4 polymorphism (C 589T) gene: environment interaction increases risk of wheezing in African American infants. J Pediatr 152: 709 715, 715.e1. doi:10.1016/j.j peds.2007.10.011. 154. Hamza TH, Chen H, Hill Burns EM, Rhodes SL, Montimurro J, et al. (2011) Genome Wide Gene Environment Study Identifies Glutamate Receptor Gene PLoS Genet 7: e 1002237. doi:10.1371/journal.pgen.1002237. 155. Liu L, Li Y, Tollefsbol TO (2008) Gene Environment Interactions and Epigenetic Basis of Human Diseases. Curr Issues Mol Biol 10: 25 36. 156. Burrell AM, Handel AE, Ramagopalan SV, Ebers GC, Morahan JM (2011 ) Epigenetic Mechanisms in Multiple Sclerosis and the Major Histocompatibility Complex (MHC). Discovery Medicine 11: 187 196. 157. Anway MD, Cupp AS, Uzumcu M, Skinner MK (2005) Epigenetic Transgenerational Actions of Endocrine Disruptors and Male Fertili ty. Science 308: 1466 1469. doi:10.1126/science.1108190. 158. Crews D, Gillette R, Scarpino SV, Manikkam M, Savenkova MI, et al. (2012) Epigenetic transgenerational inheritance of altered stress responses. PNAS. Available:http://www.pnas.org/conte nt/early /2012/05/15/1118514109. 159. Rutherford SL (2000) From genotype to phenotype: buffering mechanisms and the storage of genetic information. Bioessays 22: 1095 1105. doi:10.1002/1521 1878(200012)22:12<1095::AID BIES7>3.0.CO;2 A. 160. Wilkins JF (2011) Geno mic imprinting and conflict induced decanalization. Evolution 65: 537 553. doi:10.1111/j.1558 5646.2010.01147.x.

PAGE 202

202 161. Hornstein E, Shomron N (2006) Canalization of development by microRNAs. Nature Genetics 38: S20 S24. doi:10.1038/ng1803. 162. Gibson G, Wagner G (2000) Canalization in evolutionary genetics: a stabilizing theory? Bioessays 22: 372 380. doi:10.1002/(SICI)1521 1878(200004)22:4<372::AID BIES7>3.0.CO;2 J. 163. Wagner GP, Booth G, Bagheri Chaichian H (1997) A Population Genetic Theory of Canal ization. Evolution 51: 329 347. doi:10.2307/2411105. 164. Hartl FU, Bracher A, Hayer Hartl M (2011) Molecular chaperones in protein folding and proteostasis. Nature 475: 324 332. doi:10.1038/nature10317. 165. Whitesell L, Lindquist SL (2005) HSP90 and th e chaperoning of cancer. Nat Rev Cancer 5: 761 772. doi:10.1038/nrc1716. 166. Wu C I, Shen Y, Tang T (2009) Evolution under canalization and the dual roles of microRNAs A hypothesis. Genome Res 19: 734 743. doi:10.1101/gr.084640.108. 167. Morton NE (1955 ) Sequential tests for the detection of linkage. Am J Hum Genet 7: 277 318. 168. Clerget Darpoux F (2002) Genetic Linkage Analysis. Atlas Genet Cytogenet Oncol Haematol. Available:http://AtlasGeneticsOncology.org/Educ/LinkageShortID30031ES.html. 169. Boe hnke M, Langefeld CD (1998) Genetic association mapping based on discordant sib pairs: the discordant alleles test. Am J Hum Genet 62: 950 961. 170. Spielman RS, McGinnis RE, Ewens WJ (1993) Transmission test for linkage disequilibrium: the insulin gene r egion and insulin dependent diabetes mellitus (IDDM). Am J Hum Genet 52: 506 516. 171. Barroso I, Luan J, Middelberg RPS, Harding A H, Franks PW, et al. (2003) Candidate Gene Association Study in Type 2 Diabetes Indicates a Role for Ce ll Function as Well as Insulin Action. PLoS Biol 1: e20. doi:10.1371/journal.pbio.0000020. 172. Cavalli Sforza LL (2005) The Human Genome Diversity Project: past, present and future. Nature Reviews Genetics 6: 333 340. doi:10.1038/nrg1596. 173. Cann HM, De Toma C, Cazes L, Legrand M F, Morel V, et al. (2002) A human genome diversity cell line panel. Science 296: 261 262.

PAGE 203

203 174. Frazer KA, Ballinger DG, Cox DR, Hinds DA, Stuve LL, et al. (2007) A second generation human haplotype map of over 3.1 million SNP s. Nature 449: 851 861. doi:10.1038/nature06258. 175. Wang DG, Fan JB, Siao CJ, Berno A, Young P, et al. (1998) Large scale identification, mapping, and genotyping of single nucleotide polymorphisms in the human genome. Science 280: 1077 1082. 176. Krugl yak L (2008) The road to genome wide association studies. Nature Reviews Genetics 9: 314 318. doi:10.1038/nrg2316. 177. Weedon MN, Lango H, Lindgren CM, Wallace C, Evans DM, et al. (2008) Genome wide association analysis identifies 20 loci that influence adult height. Nat Genet 40: 575 583. doi:10.1038/ng.121. 178. Liu JZ, Medland SE, Wright MJ, Henders AK, Heath AC, et al. (2010) Genome wide association study of height and body mass index in Australian twin families. Twin Res Hum Genet 13: 179 193. doi:1 0.1375/twin.13.2.179. 179. Speliotes EK, Willer CJ, Berndt SI, Monda KL, Thorleifsson G, et al. (2010) Association analyses of 249,796 individuals reveal 18 new loci associated with body mass index. Nature Genetics 42: 937 948. doi:10.1038/ng.686. 180. S amani NJ, Erdmann J, Hall AS, Hengstenberg C, Mangino M, et al. (2007) Genomewide Association Analysis of Coronary Artery Disease. New England Journal of Medicine 357: 443 453. doi:10.1056/NEJMoa072366. 181. Devlin B, Roeder K (1999) Genomic control for a ssociation studies. Biometrics 55: 997 1004. 182. Tian C, Gregersen PK, Seldin MF (2008) Accounting for ancestry: population substructure and genome wide association studies. Human Molecular Genetics 17: R143 R150. doi:10.1093/hmg/ddn268. 183. Verhoeven KJF, Simonsen KL, McIntyre LM (2005) Implementing false discovery rate control: increasing your power. Oikos 108: 643 647. doi:10.1111/j.0030 1299.2005.13727.x. 184. Schizophrenia Psychiatric Genome Wide Association Study (GWAS) Consortium (2011) Genome w ide association study identifies five new schizophrenia loci. Nat Genet 43: 969 976. doi:10.1038/ng.940. 185. Culverhouse R, Suarez BK, Lin J, Reich T (2002) A Perspective on Epistasis: Limits of Models Displaying No Main Effect. Am J Hum Genet 70: 461 47 1.

PAGE 204

204 186. Nelson MR, Kardia SL, Ferrell RE, Sing CF (2001) A combinatorial partitioning method to identify multilocus genotypic partitions that predict quantitative trait variation. Genome Res 11: 458 470. doi:10.1101/gr.172901. 187. Cooles FAH, Isaacs JD (2011) Pathophysiology of rheumatoid arthritis. Curr Opin Rheumatol 23: 233 240. doi:10.1097/BOR.0b013e32834518a3. 188. De Vries R (2011) Genetics of rheumatoid arthritis: time for a change! Curr Opin Rheumatol 23: 227 232. doi:10.1097/BOR.0b013e328345752 4. 189. Pattin K, Gui J, Moore J (2010) Employing Publically Available Biological Expert Knowledge from Protein Protein Interaction Information. In: Dijkstra T, Tsivtsivadze E, Marchiori E, Heskes T, editors. Pattern Recognition in Bioinformatics. Lecture Notes in Computer Science. Springer Berlin / Heidelberg, Vol. 6282. pp. 395 406. Available:http://www.springerlink.com/content/n62788t32kq1460v/abstract/. 190. Ramoni M, Stefanelli M, Magnani L, Barosi G (1992) Epistemological framework for medical knowl edge based systems. IEEE Transactions on Systems, Man and Cybernetics 22: 1361 1375. 191. Magnani L (2000) Abduction, Reason and Science Processes of Discovery and Explanation. 1st ed. Springer. 224 p. 192. Ballouz S, Liu JY, Oti M, Gaeta B, Fatkin D, et al. (2011) Analysis of genome wide association study data using the protein knowledge base. BMC Genetics 12: 98. doi:10.1186/1471 2156 12 98. 193. Vittinghoff E, Glidden DV, Shiboski SC, McCulloch CE (2005) Regression Methods in Biostatistics: Linear, Logistic, Survival, and Repeated Measures Models. Springer. 360 p. 194. Chapman J, Clayton D (2007) Detecting association using epistatic information. Genet Epidemiol 31: 894 909. doi:10.1002/gepi.20250. 195. Kraft P, Yen Y C, Stram DO, Morrison J, Gaude rman WJ (2007) Exploiting gene environment interaction to detect genetic associations. Hum Hered 63: 111 119. doi:10.1159/000099183. 196. Albert PS, Ratnasinghe D, Tangrea J, Wacholder S (2001) Limitations of the Case only Design for Identifying Gene Envi ronment Interactions. Am J Epidemiol 154: 687 693. doi:10.1093/aje/154.8.687.

PAGE 205

205 197. Mukherjee B, Chatterjee N (2008) Exploiting gene environment independence for analysis of case control studies: an empirical Bayes type shrinkage estimator to trade off be tween bias and efficiency. Biometrics 64: 685 694. doi:10.1111/j.1541 0420.2007.00953.x. 198. Allen AS, Satten GA (2007) Statistical models for haplotype sharing in case parent trio data. Hum Hered 64: 35 44. doi:10.1159/000101421. 199. Van der Meulen MA Te Meerman GJ (1997) Haplotype sharing analysis in affected individuals from nuclear families with at least one affected offspring. Genet Epidemiol 14: 915 920. doi:10.1002/(SICI)1098 2272(1997)14:6<915::AID GEPI59>3.0.CO;2 P. 200. Liu P Y, Zhang Y Y, L u Y, Long J R, Shen H, et al. (2005) A survey of haplotype variants at several disease candidate genes: the importance of rare variants for complex diseases. J Med Genet 42: 221 227. doi:10.1136/jmg.2004.024752. 201. Marchini J, Howie B, Myers S, McVean G Donnelly P (2007) A new multipoint method for genome wide association studies by imputation of genotypes. Nature Genetics 39: 906 913. doi:10.1038/ng2088. 202. Dempster AP, Laird NM, Rubin DB (1977) Maximum Likelihood from Incomplete Data via the EM Alg orithm. Journal of the Royal Statistical Society Series B (Methodological) 39: 1 38. 203. Hahn LW, Ritchie MD, Moore JH (2003) Multifactor dimensionality reduction software for detecting gene gene and gene environment interactions. Bioinformatics 19: 376 382. doi:10.1093/bioinformatics/btf869. 204. Ritchie MD, Hahn LW, Moore JH (2003) Power of multifactor dimensionality reduction for detecting gene gene interactions in the presence of genotyping error, missing data, phenocopy, and genetic heterogeneity. G enet Epidemiol 24: 150 157. doi:10.1002/gepi.10218. 205. Tsai C T, Lai L P, Lin J L, Chiang F T, Hwang J J, et al. (2004) Renin angiotensin system gene polymorphisms and atrial fibrillation. Circulation 109: 1640 1646. doi:10.1161/01.CIR.0000124487.36586. 26. 206. Juli A, Moore J, Miquel L, Alegre C, Barcel P, et al. (2007) Identification of a two loci epistatic interaction associated with susceptibility to rheumatoid arthritis through reverse engineering and multifactor dimensionality reduction. Genomic s 90: 6 13. doi:10.1016/j.ygeno.2007.03.011. 207. Cho YM, Ritchie MD, Moore JH, Park JY, Lee K U, et al. (2004) Multifactor dimensionality reduction shows a two locus interaction associated with Type 2 diabetes mellitus. Diabetologia 47: 549 554. doi:10.1 007/s00125 003 1321 3.

PAGE 206

206 208. Hua X, Zhang H, Zhang H, Yang Y, Kuk AYC (2010) Testing multiple gene interactions by the ordered combinatorial partitioning method in case control studies. Bioinformatics 26: 1871 1878. doi:10.1093/bioinformatics/btq290. 209. Lou X Y, Chen G B, Yan L, Ma JZ, Zhu J, et al. (2007) A generalized combinatorial approach for detecting gene by gene and gene by environment interactions with application to nicotine dependence. Am J Hum Genet 80: 1125 1137. doi:10.1086/518312. 210. Cul verhouse R, Klein T, Shannon W (2004) Detecting epistatic interactions contributing to quantitative traits. Genet Epidemiol 27: 141 152. doi:10.1002/gepi.20006. 211. Moore JH, Lamb JM, Brown NJ, Vaughan DE (2002) A comparison of combinatorial partitioning and linear regression for the detection of epistatic effects of the ACE I/D and PAI 1 4G/5G polymorphisms on plasma PAI 1 levels. Clin Genet 62: 74 79. 212. Cook NR, Zee RYL, Ridker PM (2004) Tree and spline based association analysis of gene gene intera ction models for ischemic stroke. Stat Med 23: 1439 1453. doi:10.1002/sim.1749. 213. Breiman L (2001) Random Forests. Machine Learning. pp. 5 32. 214. Sun YV (2010) Multigenic Modeling of Complex Disease by Random Forests. Computational Methods for Genet ics of Complex Traits. Academic Press, Vol. Volume 72. pp. 73 99. Available:http://www.sciencedirect.com/science/article/pii/B 97801238086220000 47. 215. Meng YA, Yu Y, Cupples LA, Farrer LA, Lunetta KL (2009) Performance of random forest when SNPs are in l inkage disequilibrium. BMC Bioinformatics 10: 78. doi:10.1186/1471 2105 10 78. 216. Lunetta KL, Hayward LB, Segal J, Van Eerdewegh P (2004) Screening large scale association study data: exploiting interactions using random forests. BMC Genet 5: 32. doi:10 .1186/1471 2156 5 32. 217. Jiang R, Tang W, Wu X, Fu W (2009) A random forest approach to the detection of epistatic interactions in case control studies. BMC Bioinformatics 10 Suppl 1: S65. doi:10.1186/1471 2105 10 S1 S65. 218. Sichtig H (2009) The SGE framework| Discovering spatio temporal patterns in biological systems with spiking neural networks(S), a genetic algorithm(G) and expert knowledge(E) STATE UNIVERSITY OF NEW YORK AT BINGHAMTON. Available:http://gradw orks.umi.com/33/59/3359709.html

PAGE 207

207 219. S herriff A, Ott J (2001) Applications of neural networks for gene finding. Adv Genet 42: 287 297. 220. Motsinger AA, Lee SL, Mellick G, Ritchie MD (2006) GPNN: power studies and applications of a neural network method for detecting gene gene interactions i n studies of human disease. BMC Bioinformatics 7: 39. doi:10.1186/1471 2105 7 39. 221. Goldberg DE (1989) Genetic Algorithms in Search, Optimization, and Machine Learning. 1st ed. Addison Wesley Professional. 432 p. 222. Marczyk A (2004) Genetic Algorith ms and Evolutionary Computation. the TalkOrigins Archive. Available:http://talkorigins.org/faqs/genalg/genalg.html. 223. Koza JR (1992) Genetic Programming: On the Programming of Computers by Means of Natural Selection. 1st ed. A Bradford Book. 840 p. 224 Ritchie MD, Motsinger AA, Bush WS, Coffey CS, Moore JH (2007) Genetic Programming Neural Networks: A Powerful Bioinformatics Tool for Human Genetics. Appl Soft Comput 7: 471 479. doi:10.1016/j.asoc.2006.01.013. 225. Moore JH, Hahn LW (2002) A cellular automata approach to detecting interactions among single nucleotide polymorphisms in complex multifactorial diseases. Pacific Symposium On Biocomputing 64: 53 64. 226. Gnther F, Wawro N, Bammann K (2009) Neural networks for modeling gene gene interaction s in association studies. BMC Genetics 10: 87. doi:10.1186/1471 2156 10 87. 227. Ritchie MD, White BC, Parker JS, Hahn LW, Moore JH (2003) Optimization of neural network architecture using genetic programming improves detection and modeling of gene gene i nteractions in studies of human diseases. BMC Bioinformatics 4: 28. doi:10.1186/1471 2105 4 28. 228. Holland JH (1992) Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligenc e. A Bradford Book. 211 p. 229. Forrest S (1993) Genetic algorithms: principles of natural selection applied to computation. Science 261: 872 878. doi:10.1126/science.8346439. 230. Haupt RL, Haupt SE (2004) Practical Genetic Algorithms. 2nd ed. Wiley Int erscience. 272 p.

PAGE 208

208 231. Nunkesser R, Bernholt T, Schwender H, Ickstadt K, Wegener I (2007) Detecting high order interactions of single nucleotide polymorphisms using genetic programming. Bioinformatics 23: 3280 3288. doi:10.1093/bioinformatics/btm522. 232. Belle G van, Heagerty PJ, Fisher LD, Lumley TS (2004) Biostatistics: A Methodology For the Health Sciences. 2nd ed. Wiley Interscience. 896 p. 233. Eshelman LJ (1991) The CHC adaptive search algorithm: How to have safe search when engaging in nontradit ional genetic recombination. Foundations of Genetic Algorithms. Morgan Kaufmann Publishers, Inc. pp. 265 283. 234. Reeves CR, Rowe JE (2002) Genetic Algorithms Principles and Perspectives: A Guide to GA Theory. Springer. 348 p. 235. Olariu S, Zomaya AY (2005) Handbook of Bioinspired Algorithms and Applications. CRC Press. 712 p. 236. Sichtig H, Schaffer JD, Riva A (2010) Evolving Spiking Neural Networks for pr edicting transcription factor binding sites. The 2010 International Joint Conference on Neural Networks (IJCNN). pp. 1 8. doi:10.1109/IJCNN.2010.5596642. 237. Bolstad WM (2007) Introduction to Bayesian Statistics. 2nd ed. Wiley Interscience. 464 p. 238. Ott RL, Longnecker MT (2008) An Introduction to Statistical Methods and Data Analysis. 6th ed. Duxbury Press. 1273 p. 239. Manly BF (1986) Multivariate Statistical Methods : A Primer. 1st ed. Springer. 150 p. 240. Xiong M, Zhao J, Boerwinkle E (2002) Generalized T2 test for genome association studies. Am J Hum Genet 70: 1257 1268. doi:10.1086/340392. 241. Ernst MD (2004) Permutation Methods: A Basis for Exact Inference. S tatistical Science 19: 676 685. doi:10.2307/4144438. 242. Peng CYJ, Lee K, Ingersoll G (2002) An Introduction to Logistic Regression Analysis and Reporting. The Journal of Educational Research 96. Available:htt p://dx.doi.org/10.2307/27542407 243. Nuzzo A, Riva A (2009) Genephony: a knowledge management tool for genome wide research. BMC Bioinformatics 10: 278. doi:10.1186/1471 2105 10 278.

PAGE 209

209 244. Gregersen PK, Amos CI, Lee AT, Lu Y, Remmers EF, et al. (2009) REL, encoding a member of the NF kappaB family of transcription factors, is a newly defined risk locus for rheumatoid arthritis. Nat Genet 41: 820 823. doi:10.1038/ng.395. 245. Juli A, Ballina J, Caete JD, Balsa A, Tornero Molina J, et al. (2008) Genome wide association study of rheumatoid arthriti s in the Spanish population: KLF12 as a risk locus for rheumatoid arthritis susceptibility. Arthritis Rheum 58: 2275 2286. doi:10.1002/art.23623. 246. Plenge RM, Seielstad M, Padyukov L, Lee AT, Remmers EF, et al. (2007) TRAF1 C5 as a risk locus for rheum atoid arthritis -a genomewide study. N Engl J Med 357: 1199 1209. doi:10.1056/NEJMoa073491. 247. Friedman S, Blumberg R (2011) Chapter 295. Inflammatory Bowel Disease. In: Longo DL, Fauci AS, Kasper DL, Hauser SL, Jameson JL, Loscalzo J, eds. rinciples of Internal Medicine: Volumes 1 and 2. 18th ed. McGraw Hill Professional. 4012 p. 248. ulcerative colitis. Nat Clin Pract Gastroenterol Hepatol 3: 390 407. doi:10.1038/n cpgasthep0528. 249. Kanehisa M, Goto S, Furumichi M, Tanabe M, Hirakawa M (2010) KEGG for representation and analysis of molecular networks involving diseases and drugs. Nucleic Acids Res 38: D355 360. doi:10.1093/nar/gkp896. 250. Haas JP, Kimura A, Truc kenbrodt H, Suschke J, Sasazuki T, et al. (1995) Early onset pauciarticular juvenile chronic arthritis is associated with a mutation in the Y box of the HLA DQA1 promoter. Tissue Antigens 45: 317 321. 251. Nazarian A, Sichtig H, Riva A (2012) A Knowledge Based Method for Association Studies on Complex Diseases. PLoS ONE 7(9): e44162. doi:10.1371/journal.pone.0044162

PAGE 210

210 BIOGRAPHICAL SKETCH Alireza Nazarian was born in 1977. He enrolled in Doctor of Medicine program at T ehran University of Medical Sciences ( TUMS) in 1996, and received his MD degree in 2003. After practicing medicine as a physician for a couple of years he undertook a research career to pursue his interests in the study of genotype phenotype relationship in complex disorders. In August 2008, he was offered a graduate research assistant position in Genetics and Genomics program at University of Florida Genetics Institute where he joined the bioinformatics laboratory of Dr. Alberto Riva. He is currently a candidate for the Doctor of Philosophy (PhD) degree in the Genetic and Genomics.