A Knowledge-Based Method for Association Studies on Complex Diseases

MISSING IMAGE

Material Information

Title:
A Knowledge-Based Method for Association Studies on Complex Diseases
Physical Description:
Serial
Creator:
Nazarian, Alireza
Sichtig, Heike
Riva, Alberto
Publisher:
PLoS ONE
Publication Date:

Subjects

Genre:
serial   ( sobekcm )

Notes

Abstract:
Complex disorders are a class of diseases whose phenotypic variance is caused by the interplay of multiple genetic and environmental factors. Analyzing the complexity underlying the genetic architecture of such traits may help develop more efficient diagnostic tests and therapeutic protocols. Despite the continuous advances in revealing the genetic basis of many of complex diseases using genome-wide association studies (GWAS), a major proportion of their genetic variance has remained unexplained, in part because GWAS are unable to reliably detect small individual risk contributions and to capture the underlying genetic heterogeneity. In this paper we describe a hypothesis-based method to analyze the association between multiple genetic factors and a complex phenotype. Starting from sets of markers selected based on preexisting biomedical knowledge, our method generates multi-marker models relevant to the biological process underlying a complex trait for which genotype data is available. We tested the applicability of our method using the WTCCC case-control dataset. Analyzing a number of biological pathways, the method was able to identify several immune system related multi-SNP models significantly associated with Rheumatoid Arthritis (RA) and Crohn’s disease (CD). RA-associated multi-SNP models were also replicated in an independent case-control dataset. The method we present provides a framework for capturing joint contributions of genetic factors to complex traits. In contrast to hypothesis-free approaches, its results can be given a direct biological interpretation. The replicated multi-SNP models generated by our analysis may serve as a predictor to estimate the risk of RA development in individuals of Caucasian ancestry.
Funding:
This work was supported by National Institutes of Health (NIH) grant R01 HL87681-01, ‘‘Genome-Wide Association Studies in Sickle Cell Anemia and in Centenarians’’, and by funds from the University of Florida Genetics Institute. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Publication of this article was funded in part by the University of Florida Open-Access Publishing fund.

Record Information

Source Institution:
University of Florida Institutional Repository
Holding Location:
University of Florida
Rights Management:

This item is licensed with the Creative Commons Attribution License. This license lets others distribute, remix, tweak, and build upon this work, even commercially, as long as they credit the author for the original creation.
System ID:
AA00012790:00001


This item is only available as the following downloads:


Full Text

PAGE 1

AKnowledge-BasedMethodforAssociationStudieson ComplexDiseasesAlirezaNazarian,HeikeSichtig,AlbertoRiva *DepartmentofMolecularGeneticsandMicrobiologyandUFGeneticsInstitute,UniversityofFlorida,Gainesville,Florida,UnitedStatesofAmeric aAbstractComplexdisordersareaclassofdiseaseswhosephenotypicvarianceiscausedbytheinterplayofmultiplegeneticand environmentalfactors.Analyzingthecomplexityunderlyingthegeneticarchitectureofsuchtraitsmayhelpdevelopmore efficientdiagnostictestsandtherapeuticprotocols.Despitethecontinuousadvancesinrevealingthegeneticbasisofmany ofcomplexdiseasesusinggenome-wideassociationstudies(GWAS),amajorproportionoftheirgeneticvariancehas remainedunexplained,inpartbecauseGWASareunabletoreliablydetectsmallindividualriskcontributionsandtocapture theunderlyinggeneticheterogeneity.Inthispaperwedescribeahypothesis-basedmethodtoanalyzetheassociation betweenmultiplegeneticfactorsandacomplexphenotype.Startingfromsetsofmarkersselectedbasedonpreexisting biomedicalknowledge,ourmethodgeneratesmulti-markermodelsrelevanttothebiologicalprocessunderlyingacomplex traitforwhichgenotypedataisavailable.WetestedtheapplicabilityofourmethodusingtheWTCCCcase-controldataset. Analyzinganumberofbiologicalpathways,themethodwasabletoidentifyseveralimmunesystemrelatedmulti-SNP modelssignificantlyassociatedwithRheumatoidArthritis(RA)andCrohn’sdisease(CD).RA-associatedmulti-SNPmodels werealsoreplicatedinanindependentcase-controldataset.Themethodwepresentprovidesaframeworkforcapturing jointcontributionsofgeneticfactorstocomplextraits.Incontrasttohypothesis-freeapproaches,itsresultscanbegiven adirectbiologicalinterpretation.Thereplicatedmulti-SNPmodelsgeneratedbyouranalysismayserveasapredictorto estimatetheriskofRAdevelopmentinindividualsofCaucasianancestry.Citation: NazarianA,SichtigH,RivaA(2012)AKnowledge-BasedMethodforAssociationStudiesonComplexDiseases.PLoSONE7(9):e44162.doi:10.1371/ journal.pone.0044162 Editor: GiuseppeNovelli,TorVergataUniversityofRome,Italy Received May4,2012; Accepted July30,2012; Published September6,2012 Copyright: 2012Nazarianetal.Thisisanopen-accessarticledistributedunderthetermsoftheCreativeCommonsAttributionLicense,whichpermits unrestricteduse,distribution,andreproductioninanymedium,providedtheoriginalauthorandsourcearecredited. Funding: ThisworkwassupportedbyNationalInstitutesofHealth(NIH)grantR01HL87681-01,‘‘Genome-WideAssociationStudiesinSickleCellAnemiaandin Centenarians’’,andbyfundsfromtheUniversityofFloridaGeneticsInstitute.Thefundershadnoroleinstudydesign,datacollectionandanalysis ,decisionto publish,orpreparationofthemanuscript.PublicationofthisarticlewasfundedinpartbytheUniversityofFloridaOpen-AccessPublishingfund. CompetingInterests: Theauthorshavedeclaredthatnocompetinginterestsexist. *E-mail:ariva@ufl.eduIntroductionThestudyofgenotype-phenotyperelationshipincomplex disordersrepresentsagreatchallengeinthefieldoftranslational genetics,duetotheirimportancefromapublichealthperspective andtothedifficultiesinvolvedintheiranalysisatthegeneticlevel [1,2].Incontrasttomonogenictraits,thephenotypicvarianceof complextraitsiscausedbytheinterplayofmultiplegeneticand environmentalfactors[3–6],whichlimitstheapplicabilityofthe traditionalapproachesusedformappingofMendeliantraits[1– 3,7,8]. Genome-WideAssociationStudies(GWAS),apowerfulmethod forthelargescaleanalysisofgenotype-phenotyperelationships [7,9,10],arecurrentlythemethodofchoicefordissectingthe geneticbasisofcomplexdiseases.Alargenumberofdisorderssuch ascardiovasculardisorders,Crohn’sdisease,rheumatologic disorders,diabetes,bipolardisorder,schizophreniaandhuman malignancies,havethusfarbeenstudiedwiththisapproach leadingtothedetectionofmanypreviouslyundetectedcontributingloci(forexample[1,11–15]).However,despitetheverylarge numberofmarkersthatcanbegenotypedusingcurrently availablearray-basedSNPgenotypingplatforms(uptoone millionSNPgenotypesperrun)[16],inmostcasesGWAShave sofarshownlimitedsuccessinexplainingaconsiderableproportionofgeneticvarianceofcomplexdiseases[7,10,17].Thisis inpartduetothenatureofcomplexdiseases,andinpartto limitationswhichareinherenttothecurrentanalyticalframework employedtoanalyzeandinterprettheobtaineddata[18–20]. Tostart,mostofthesusceptibilitylocisofardiscoveredby GWASareofsmallpredisposingrisk[4,7]andithasbeen hypothesizedthatthegeneticvarianceofcomplexdiseasesmaybe largelyduetothejointcontributionofmultiplesusceptibilityloci havingsmallindividualeffects[21,22].Detectingsuchinfinitesimalcontributionsisdifficult,especiallywhenthepredisposing alleleisrareorthesamplesizeisnotsufficientlylarge[2,3],and theverylargenumberofmarkersunderinvestigationraisesthe issueofmultipletesting,makingitevenhardertoreliablydetect asmallassociationsignal[16].Thisisalsotrueinthecaseofrare variantswithmajorphenotypiceffects,asputforwardbythe complexdisease-rarevariantsnotion[17].Moreover,thegenetic architectureofacomplexdiseasemayincludeepistaticeffects amonginteractingloci[23,24],andeffectsrelatedtogeneenvironmentinteractions[6].Statisticalmethodsthatanalyze SNPsindividuallyareunabletoaddresssuchcomplexeffects. Althoughcurrentregression-basedmethodshaveenoughpowerto analyzesomelow-orderinteractions(e.g.-quadraticorcubic terms),providedthatthenon-interactivetermshavesignificant maineffects,theyareunabletoexploreallpotentialcombinations ofmarkersbecauseasthenumberoffactorsunderanalysis increaseslinearly,thenumberoftheircombinationsgrows PLOSONE|www.plosone.org1September2012|Volume7|Issue9|e44162

PAGE 2

exponentially,resultinginacomputationallyintractablesituation [18,19].Thesparsityofdataandissuesofoverfittingandmultiple testingmayalsoimposeadditionalhurdlesinsuchcases[18]. Afurtherweaknessofcurrentanalyticalmethodsisthatthey analyzegenome-scaledatasetswithoutnecessarilytakingthe existingbiologicalknowledgeaboutthetraitofinterestinto account.Ithasbeensuggestedthatusingpriorknowledgecanhelp reducethedimensionalityoflargescaledatasets,andmayguide theprocessofextractingbiologicallymeaningfulresultsinamore effectivemanner[19,25,26].Pathway-basedassociationstudies representanexampleoftheintegrationofbiologicalknowledge withstatisticaldataanalysis,andhaverecentlydrawnattentionas analternativetohypothesis-freemethods.Forinstance,inan attempttoextendgenesetenrichmentanalysisusedforgene expressiondatatoGWAS,O’Dushlaineetal.[27]introducedthe SNPratiotest,whichcalculatestheratiobetweenthenumberof individuallysignificantSNPsandthenumberofnon-significant onesinsetsofSNPsderivedfromeverypathwayofinterest,and determinesapathwaytobeassociatedwiththetraitunder investigationifitscorrespondingSNPratioprovesstatistically significant.Asanotherexample,Pengetal.[28]andYuetal.[29] suggestedwaysofcomputinganoverall pvalueforageneby combiningthe pvaluesobtainedfromtheassociationtestsonall theindividualSNPsbelongingtothatgene.Gene-level pvalues,in turn,wereemployedtodefineapathway-level pvalueto investigatepathway-traitassociations.However,despitethefact thatthelimitationsofclassicalGWASinthecontextofcomplex diseasesareincreasinglybeingrecognizedandthatpathway-based associationstudiesarereceivinggrowingattention,thereisstillno optimalapproachabletoovercometheabovedescribed challengesandprovideacomprehensiveinterpretationof genome-scaledata. Theworkpresentedherefocusesondevelopinganinnovative hypothesis-based analysismethodthatcombinesanon-deterministic computationalapproachtotheanalysisoflarge-scalegenotype datasetswithpre-existingbiologicalandbiomedicalknowledgein ordertoincreaseourunderstandingofthegeneticbasisof complexdisorders.Inthefollowingsectionsweprovideadetailed descriptionofourmethod,calledKBAS(Knowledge-Based AssociationStudy),andwepresentitsapplicationtolarge-scale genotypingdatasetsforRheumatoidArthritis(RA)andCrohn’s disease(CD).Wewillusetheseexamplestoshowhowourmethod canbeusedtocomparealternativehypothesesaboutthediseaseof interestwhichinturncanleadtonewinsightsintothegenetic structureofthecomplexdiseaseunderconsideration.Wealso presenttheresultsofareplicationstudy,inwhichwevalidatedthe resultsobtainedonRAthroughadifferentindependentdataset. Sincecomplexdisordersarewidespreadandimportantfromthe publichealthperspective,dissectingtheirgeneticarchitectureand reconstructingtheintegratednetworksoffactorsunderlyingtheir pathogenesismayhelpdevelopmorespecificandsensitive screeninganddiagnostictestsandmoreefficienttherapeutic targets[1,19].Thisinturnmayleadtoreducingthemorbidityand mortalityofthesediseases[1],consequentlydecreasingtheburden imposedonpatientsandsociety.WebelievethattheKBAS methodwillrepresentanadvanceinthisdirection,andwillhelp increaseourunderstandingofthegeneticbasisofcomplex disorders.Methods KBASMethodOverviewWeproposea hypothesis-based approachthatperformsaholistic analysisofgenome-wideassociationdata.Figure1illustratesthe basicoutlineofKBASmethod.Givenahypothesisformulatedby theinvestigator,themethodgeneratesanumberof models consistingofsetsofmarkerscorrespondingtothehypothesis underconsideration.Themodelsarethentestedandrefinedon thebasisoftheirabilitytoaccuratelyclassifyingsubjectsas affected or notaffected onthebasisoftheirgenotypes. Tostart,wedefinean encodingscheme thatconvertsthe genotypesofthemarkersunderconsiderationtonumerical values.Eachmarkeristhengivena weight thatquantifiesits contributiontotheoverallgenotype-phenotypeassociation.The combinationofasetofmarkersandtheircorrespondingweights representsamodel.Thegenotypevaluescorrespondingtoeach markerinthemodel,alongwiththeirrespectiveweights,are combinedaccordingtoanappropriatemathematicalformula, producinga scorevariable foreachsubject,andthedistributions ofscoresforthesubjectsinthecaseandcontrolgroupsare thencomparedtoevaluatethemodel’sclassificationability.The ideaatthebaseofourmethodisthat,ifthesetofmarkers includedinthemodelisrelevanttothephenotype,theirjoint signal,obtainedbycombiningtheirindividualassociationsignals intoanoverallvariable,willbeabletoaccuratelyclassifycases andcontrols.TheoptimalcombinationofSNPsandtheir correspondingweightsleadingtoamodelwithmaximized classificationabilityisdeterminedthroughaniterativeadjustmentprocedurebasedonGeneticAlgorithms(GA)which generates,testsandrefinesdifferentmodelsrelevanttothe hypothesisunderinvestigation.Adetailedexplanationofthe stepsemployedbyKBASmethodisprovidedinthefollowing sections.HypothesisGenerationTheKBASmethodweproposeisaimedatverifyingor disprovingahypothesisputforthbytheuser.Inthiscontext,the hypothesisisthestatementthataparticularsetofgenes contributestothetraitunderinvestigation.Thesetofgenesto beincludedinahypothesiswillingeneralbedeterminedbythe investigatoronthebasisofexistingbiologicalknowledge.Biochemicalandregulatorypathways,geneontologyclasses,gene expressiondatabases,protein-proteininteractionsdatabases,and biomedicalliteraturearesomeexamplesofsourcesofinformation thatcanbeusedtogeneraterelevanthypotheses[26]. Onceourhypothesisisformulated,themethodtestswhether asubsetofgeneticmarkersrelatedtothespecifiedsetofgenesis abletopreciselyseparatecasesfromcontrols,whenconvertedinto asinglevariablethroughanappropriatemathematicalcombinationoftheirgenotypevalues.Inthefollowingsectionswewill assumethatthemarkersunderconsiderationareSNPs,and thereforeonlyexhibittwoalleles,butthemethodcanbeappliedto anykindofpolymorphicmarker,givenawayofencodingits genotypesintonumericalvalues.Inatypicalscenario,an investigatormaywishtodeterminewhichofthetwodifferent setsofgenesismorelikelytobeinvolvedinadiseaseofinterest. Thisquestioncanbeansweredbycreatingtwocompeting hypotheses,eachproducingasetofSNPsbelongingtothetwo genesets,andevaluatingthemonthebasisoftheirpowerto discriminatecasesfromcontrols.GeneticAlgorithmKBASusesaGeneticAlgorithmtoadjusttheweightsofthe SNPscomposingamodel,inordertomaximizethemodel’s abilitytoaccuratelyclassifysubjectsintothecaseandcontrol groupsonthebasisofthescorestheyreceive.AGenetic Algorithmisaheuristicsearchalgorithmabletoefficiently exploreverylargeandcomplexparameterspaces,inordertoAssociationStudiesforComplexDiseases PLOSONE|www.plosone.org2September2012|Volume7|Issue9|e44162

PAGE 3

maximizea fitnessfunction overthepotentialsolutionsofthe problemunderconsideration[30,31].GeneticAlgorithms belongtotheclassofnon-deterministiccomputationalmethods, whichalsoincludesothermethodssuchasneuralnetworks [32,33],geneticprogramming[34],andcellularautomata[35]. Giventhelimitationsofclassicalanalysismethods,these approachesoffernewstrategiestoaddressthecomplexissue ofgenotype-phenotypemapping. AGAreceivesanumberofrandomlygeneratedpotential solutions( organisms )totheproblemunderconsideration;eachof whichcontainsanencodedrepresentation( artificialchromosome )of thesetofparametersdescribingthesolution.Solutionsarethen evaluatedaccordingtothespecifiedfitnessfunction,andare optimizedthroughasimulatedprocessofevolutionusing operatorsborrowedfromevolutionarybiologysuchasmutation, crossoverandselection[30,31].TheGAusedinthisworkis avariationoftheCHCalgorithm[36],implementedbythe authors.Wefollowedthestandardpracticeofencodingeach modelparameter(inthiscase,eachSNPweight)asabinary numberconsistingoftheappropriatenumberofbits.Artificial chromosomesarethereforerepresentedbybitvectorswhose lengthisproportionaltothenumberofSNPsinthemodel.Inthe simplestcaseonlyonebitisusedtorepresenteachSNP’sweight: theweightthereforeindicateswhetherthecorrespondingSNPis presentinthemodelornot. Thedefinitionofthefitnessfunctionisacriticalaspectinthe useofaGA.Inourapplicationthefitnessofamodelisameasure ofthemodel’sabilitytodiscriminatecasesandcontrols,whichin turniscalculatedonthebasisofthesetofscoresassignedbythe correspondingmodeltothesubjectsinthecaseandinthecontrol groups.Thescorecorrespondingtoaparticularmodelisdefined, foreachsubject,asthelogarithmofratiooftheconditional probabilitiesofthesubject’sgenotypeunderthetwophenotype statesofinterest(e.g.diseased vs. healthy),accordingtothe followingformulainspiredbythedefinitionofBayesfactor[37]: Si~ ln PGiD State1 PGiD State2 whereSiisthescoreforindividual i ,Giisthegenotypeof Figure1.AflowchartillustratingthestepsappliedbytheKBASmethod. doi:10.1371/journal.pone.0044162.g001 AssociationStudiesforComplexDiseases PLOSONE|www.plosone.org3September2012|Volume7|Issue9|e44162

PAGE 4

individual i forthesetofSNPsincludedinthemodel,andState1andState2arethetwophenotypestatesbeingcomparedtoeach other,i.e.case vs. control. Oncethescoresarecomputedforallsubjectsinthecaseand controlgroups,theirdistributionsarecomparedusingatwosamplet-test,andthe pvalueresultingfromthecomparisonofthe scoredistributionsisusedasthefitnessmeasure.Theassumption isthatamodelwithhighfitnessgenerateshighlydifferentscore distributionsinthetwogroupsresultinginamoreextremet-value andasmaller pvaluecomparedtothosegeneratedbyamodel withlowfitness.Therefore,modelswithsmallerfitness pvaluesare preferred. TheGAwillthenrankallsimulatedorganismsonthebasis oftheirfitness,removethelowerhalfoftherankedorganisms, andreplacethemwithanewgenerationproducedbyapplying theaforementionedgeneticoperatorstothesurvivingorganisms. AlthoughtheGAisinitializedusingauser-specifiednumberof SNPs,reflectingtheuser’spreferencefortheapproximate numberofmarkersincludedinthemodel,duringthisstepthe numberofnon-zeroweightscanincreaseordecreasedueto occurrenceofcrossoverandmutations,causingtheGAto discardsomemarkersortoincludenewonesinthemodels. Thiscyclecontinuesuntiladesiredfitnesslevelisattained,or themaximumnumberofgenerationsisreached.Notethatthe desiredfitnesslevelshouldbeadjustedaccordingtoBonferroni’s correction[38]toaccountforthenumberofartificial chromosomes,ofsimulatedgenerations,ofinvestigatedhypothesesandofpair-wisecase-controlcomparisonsperformedinthe analysis.AtthelastgenerationofGA,thesetofweights encodedbytheorganismwiththehighestfitnessisusedto generatethe successfulmodel ,thatis,theonethatdisplaysthe highestabilitytoseparatecasesfromcontrols.AssociationCriteriaThesuccessfulmodelisthenevaluatedbypermutationtesting tovalidateitssuggestedassociationwiththetraitofinterest. Holdingthevaluesofthegeneratedscorevariableconstant,the case-controllabelsofsubjectsintheoriginaldatasetarepermuted. Themodelisthenappliedtotheresultingpermuteddatasetandits correspondingfitnessvalueisrecorded.Thisprocessisrepeated alargenumberoftimestoproduceanempiricaldistributionofthe fitnessvalues,whichcanthenbeusedtoasseswhetherthefitness valuederivedfromtheoriginaldatasetissignificant. Ifthesuccessfulmodelshowsafitness pvaluelowerthanthe user-specifiedsignificancethresholdandalsoprovessignificant underpermutationtest,itisconsideredtobeassociatedwiththe traitunderinvestigationanditsfitnesslevelcanbeusedas ameasureofthestrengthoftheassociation.Wecantherefore concludethattheSNPsincludedinthemodelrepresentcausal variantsorareinlinkagedisequilibrium(LD)withcausalvariants inneighboringregions.ReplicationoftheResultsLikeinanyassociationstudy,thesignificantfindingsshouldbe furthervalidatedinanindependentcase-controldatasettoprove theirreplicability.Inthereplicationstep,ascorevariableis generatedandtestedinthereplicationdatasetusingthesame modelthatwasgeneratedandtestedinthediscoverydataset.If amodelsuccessfullydiscriminatescasesfromcontrolsinthe replicationdatasetandthe pvalueofitspermutationtestisalso significant,itisconsideredreplicated.LogisticRegressionAnalysisandDiseaseRisk-ScoreClass DiagramToevaluatehowthedisease-associatedsuccessfulmodels producedbytheGAaffecttheriskofdevelopingthedisease underconsideration,amultiplelogisticregressionmodelisfitted usingthescoresproducedbythesemodelsastheindependent variables,andthediseasestate(affected vs. healthy)astheresponse variable.Inaddition,toinvestigatepotentialinteractionsamong thesuccessfulmodelsunderconsideration,alltheirpairwise interactiontermsareincludedaspredictorvariablesaswell,and astepwiseselectionprocedurewasappliedtoonlyretainthe statisticallysignificantvariables. Asimpleregressionmodelisalsofittedbyregressingdisease stateontheoverallscorevariablecomputedfortheentiresetof SNPspresentinthesignificantdisease-associatedmodels.To illustratetherelationshipbetweentheoverallscorevariableand thediseaserisk,wethendiscretizetheoverallscorevariableinto multipleclassesandcomputetheposteriorprobabilityofbeing affectedbythediseaseforeachclassofthisnewlygenerated discretevariableusingBayes’formula[37](seeMethodsS1). LogisticregressionanalysisisperformedusingSAS(v9.2).MethodEvaluationWeusedtheKBASmethodtotestanumberofhypotheses indicatingthatgenesinasetofpathwaysrelatedtotheactivityand regulationoftheimmunesystemplayaroleinthedevelopmentof RheumatoidArthritis(RA).Inaddition,wetestedtheapplicability oftheKBASmethodtootherdiseaseslikeCrohn’sdisease(CD) andtypeIdiabetes(T1D).Inparticular,theresultsrelatedtoCD willbedescribedinthismanuscript.Thetestswereperformed usinggenotypedataprovidedbytheWellcomeTrustCaseControlConsortium(WTCCC)[1].PatientsintheRAandCD groupsservedascases,andhealthyindividualsinthe58Cand NBSgroupsconstitutedthecontrols.Caseandcontrolgroups contain 2000and 1500individualsrespectively.Studysubjects areofCaucasianancestry,andeachonewasgenotypedataround 500,000SNPsusingtheAffymetrixplatform(AffymetrixGeneChip500KMappingArray).Thesmallvaluesofthetrendtest’s over-dispersionparameter( l =1.03forRAand1.11forCD) basedonprincipalcomponentanalysisindicatesonlytrivial confoundingeffectsexistsrelatedtothepopulationstratification [1].WeconvertedSNPgenotypestonumericalvaluesby representingthemajorallele(A)ateachlocusas0andtheminor allele(B)as1.Thethreepossiblegenotypescanthereforebe encodedasfollows:AA=0,AB=1,BB=2.Alternativeencoding schemes,forexampletoconsiderdominantorrecessiveeffects, caneasilybeadopted. Usingtwocontrolgroupsprovidestheadvantagethatthe successfulmodelproducedbythecomparisonofthecasegroup againstoneofthecontrols(e.g.RA vs. NBS)canbetestedintwo morecomparisons,onebetweenthecasegroupandtheother controlgroup(e.gRA vs. 58C),andtheotheronebetweenthetwo controlgroups(58C vs. NBS).Thisincreasestherobustnessofthe inferredassociationsandreducestheriskofoverfitting,because thesuccessfulmodelshouldbeabletoseparatethecasegroup fromthesecondcontrolgroup,andshouldnotbeabletoseparate thetwocontrolgroupsfromeachother.Thisinturnshowsthat resultsarereproducibleandprovidesevidencethattheGAisnot simplylearningtoclassifydifferentgroupsofsubjectsorfinding modelsconsistingofchanceaggregationsofSNPsduetosmall differencesbetweenthegroupsunrelatedtothedisease(suchas thoseduetogeographicorancestralfactors),butproducesresults thataredirectlyrelatedtothediseaseofinterest.AssociationStudiesforComplexDiseases PLOSONE|www.plosone.org4September2012|Volume7|Issue9|e44162

PAGE 5

Rheumatoidarthritisisanautoimmunedisorderprimarily affectingjointsinamonoarticularoroligoarticularpatternwith subsequentprogressiontoapolyarthritiswithclinicalmanifestationsrelatedtoinflammationofsynovialmembrane,andarticular cartilageerosionanddestruction.Theetiologyofthediseaseis unknown,butitseemsbothgeneticsusceptibilityandenvironmentaltriggerscontributetothepathogenesisofthedisease[39]. AnumberofpreviouslyreportedRA-predisposingloci(for example[1,11,40–43])implicatetheroleofgeneticfactorsinthe processofdiseasedevelopment.Crohn’sdiseaseisalsoachronic immune-mediateddisordercharacterizedbyrecurrenttransmural inflammationofthegastrointestinaltract[44].Ithasbeen suggestedthataninappropriateimmuneresponsetomicrobial floraoftheintestineingeneticallysusceptibleindividualsmayplay animportantroleinthedevelopmentofthisdisease[1,45].A previousmeta-analysisreportedover71CD-associatedloci[15] corroboratingtheroleofgeneticfactorsinthediseasesusceptibility.Theentiresetofthesepredisposinglociaccountsforabout 21%ofheritabilityofthedisease[23]. Duetothesuggestedautoimmunenatureoftherheumatoid arthritisandCrohn’sdisease,wefocusedonfourteenpathways relatedtodifferentaspectsofthehumanimmunesystemastest-set pathways.Moreover,weselectedtenotherpathwaysinvolvedin processesunrelatedtotheimmunesystem,andthereforearelikely tobeirrelevanttothepathogenesisofRAandCD,toserveas negativecontrolsinouranalysis.Table1providescompletedetails onbothsetsofpathways.Pathwaydefinitionswereretrievedfrom theKEGGdatabase(KyotoEncyclopediaofGenesandGenomes: http://www.genome.jp/kegg/)[46]. Foreachpathway,weselectedallvalidatedSNPsfoundwithin thetranscriptsencodedbythegenesbelongingtoit,extendingup to3,000bpupstreamanddownstream.Wethenfilteredthissetof SNPstoretainonlythosethatweregenotypedintheWTCCC study.TofurtherreducethenumberofselectedSNPs,the investigatorhastheoptiontochooseasubsetof‘‘representative’’ SNPsfromeachtranscript(e.g.onlyoneSNP),onthebasisof aprioritizingrulethatpreferentiallyselectsnon-synonymous codingSNPs,followedbypromoterSNPs,exon-intronjunction SNPs,synonymouscodingSNPs,otherexonicSNPs,andintronic SNPs.TableS1summarizesthefunctionalrolesofthegenotyped andvalidatedSNPsrelatedtothepathwaysunderinvestigation. AllproceduresrelatedtoSNPsetcreation,manipulationand filteringwereperformedusingGenephony,auser-friendlywebbasedbrowserforlarge-scalegenomicdatasetsmanipulationand forknowledge-baseddiscoverytasks[47]. Finally,totestiftheassociationdetectedforRA-associated pathwayscanbegeneralizedtoothercohorts,wetestedthe significanceofthecorrespondingsuccessfulmodelsinanindependentcase-controldatasetofwhiteAmericans(NARAC) [41].Thisdatasetcontainsgenotypedatafor908patientssuffering fromRAasthecasegroup(NARAC-A)andfor1,260healthy individualsascontrols(NARAC-C).AllsubjectsareofCaucasian ancestryandweregenotypedat545,080SNPsusingtheIllumina InfiniumHumanHap550array.ThesetofSNPsgenotypedinthis datasetisnotthesameasthatoftheWTCCCstudy,duetothe differentgenotypingplatformsused.Therefore,wereliedon linkagedisequilibriumbetweenmarkersincloseproximityto replaceSNPsinthesuccessfulmodelsthatwerenotgenotypedin NARACwiththeclosestSNPsforwhichNARACdatais available.TheKBASSoftwareTheKBASmethoddescribedherewasimplementedinafreely availablesoftwarepackage,thatcanbedownloadedathttp:// genome.ufl.edu/rivalab/KBAS/asa64bitGNU/Linuxcommand-lineexecutable.Theprogramallowsuserstospecifythe inputgenotypedatasets(caseandcontrolgroups),andtheGA parameterssuchasnumberoforganisms,numberofgenerations, weightsencodingscheme,etc.Italsoprovidesuserswithan interfacetotheGenephonysystem[47]toautomaticallygenerate setsofSNPsrelatedtothehypothesesunderstudyonthebasisof biologicalknowledge.Theoutputoftheprogramisafile containingthenamesofthemarkersinthesuccessfulmodelwith theirrespectiveweightsandfunctionalroles.Theprogramcan alsoperformrandomizationtestingonthesuccessfulmodeland writetheresulting pvalueontheoutputfile.Allresultsdescribed intheremainderofthispaperweregeneratedusingtheKBAS software.Results AnalysisoftheRheumatoidArthritis(RA)DatasetForeachofthetwentyfourinitialsetsofSNPsderivedfromthe pathwayslistedinTable1,KBASinitializedaGAusing apopulationof200candidatemodels(representedbyartificial chromosomes,usingasinglebitforeachSNPweight).Each artificialchromosomewasinitializedtocontainnon-zeroweights forarandomsubsetofapproximately10SNPs.Thepopulation wasthenevolvedover500simulatedgenerations,andtheartificial chromosomeprovidingthebestfitnessvalueinthefinalgeneration wasusedtogeneratethesuccessfulmodel. Tables2and3summarizethe pvaluesassociatedwiththe fitnessofeachsuccessfulmodelonthebasisofthepairwise comparisonsbetweenthecasegroupandthetwocontrolgroups (i.e.RA vs. 58CandRA vs. NBS).Notethatthesignificancelevel wasadjustedto6.944 6 102 9accordingtoBonferroni’scorrection, giventhenumberofartificialchromosomes(n=200),simulated generations(n=500),investigatedpathways(n=24)andpair-wise populationcomparisons(n=3).Tovalidatethesignificanceof thesesfitnessvalues,weassessedthegoodness-of-fitofeachmodel usingpermutationtestingperformedover100,000rounds.Given thetotalnumberofpair-wisecomparisons(n=48),weused acorrectedsignificancethresholdof0.00104accordingto Bonferroni’scorrection.Arandomizationtest pvaluebetween 0.05and0.00104wasconsideredborderline.Amodelwas consideredsignificantifinbothRA vs. 58CandRA vs. NBS comparisonshadsignificant pvalues,andwasconsiderednonsignificantifresultedinnon-significant pvaluesinatleastoneof thetwocomparisons.Inallothersituations,itwasconsidered borderline.The pvaluesobtainedfrompermutationtestsarealso showninTables2and3nexttothefitness pvalues.Test-setPathwaysForeachofthetestedpathways,withtheexceptionofthe Fc epsilonRIsignaling,FcgammaR-mediatedphagocytosis,regulationof autophagyandT-cellreceptorsignaling pathways,KBASwasableto identifyamodelthatclassifiescasesandcontrolswithahighly significant pvalue,rangingfrom2.55 6 102 9to1.19 6 102 23forthe RA vs. 58Ccomparisonsandfrom1.23 6 102 9to1.64 6 102 22for theRA vs. NBScomparisons.Bycontrast,thesuccessfulmodels wereunabletoseparatetheNBSand58Cgroups,andthe pvalues ofthecomparisonswerealwaysnon-significant(seeTable2).The factthatsuccessfulmodelsderivedfromthesetenpathwayswere abletoseparatecase-controlgroupswithfitnessesmoreextreme thanthepre-determinedsignificancethresholdcomparingRA vs. NBSandRA vs. 58C,butwerenotabletodiscriminatethetwo controlgroupsfromeachother,indicatesthattheyrepresentsetsAssociationStudiesforComplexDiseases PLOSONE|www.plosone.org5September2012|Volume7|Issue9|e44162

PAGE 6

Table1. Thelistofpathwaysincludedinthisstudyandtheircharacteristics.PathwayKEGGIDNumberofGenes Numberof Transcripts NumberofValidated SNPs NumberofValidatedSNPsinWTCCC Dataset TestPathways AntigenProcessingandPresentation map04612691117,950148 B-cellReceptorSignaling map046627514130,7911,054 ChemokineSignaling map0406219031471,3712,576 ComplementandCoagulationCascades map046106913715,415673 Cytokine-CytokineReceptorInteraction map0406026545347,5511,811 FcEpsilonRISignaling map046647913134,9661,247 FcGammaR-mediatedPhagocytosis map046669518248,1871,842 ImmuneNetworkforIgAProduction map0467246617,538132 LeukocyteTrans-endothelialMigration map0467011621553,2961,930 NaturalKillerCellMediatedCytotoxicity map0465013223637,0951,262 Phagosome map0414515126132,4211,029 RegulationofAutophagy map0414034455,342160 T-cellReceptorSignaling map0466010820538,0351,334 Toll-likeReceptorSignaling map0462010217219,755585 ControlPathways CardiacMuscleContraction map042607314533,0431,430 GapJunction map045409014857,8542,257 Glycolysis/Gluconeogenesis map000106411610,570392 InsulinSignaling map0491013724544,6331,289 NucleotideExcisionRepair map03420446310,096319 OxidativePhosphorylation map0019011617515,100478 PurineMetabolism map0023016129680,6632,845 PyrimidineMetabolism map002409916028,338724 ReninAngiotensinSystem map0461417293,720125 Spliceosome map0304012618014,922474 Test-setcontainsimmunesystemrelatedpathwaysselectedbasedonthepreexistingknowledgeaboutthepathogenesisofdiseasesunderinvestigati onandcontrol-setcontainspathwayswhicharenotlikelytoberelevanttothe pathogenesisofthediseasesofinterestbasedonthepreexistingknowledge. doi:10.1371/journal.pone.0044162.t001 AssociationStudiesforComplexDiseases PLOSONE|www.plosone.org6September2012|Volume7|Issue9|e44162

PAGE 7

ofSNPswhichcouldpotentiallybeassociatedwithrheumatoid arthritis. InpermutationtestsperformedbycomparingRA vs. 58Cand RA vs. NBS,thefitness pvaluesrelatedtoallthesetenmodels exceptforthoserelatedtothe chemokinesignaling andthe naturalkiller cellmediatedcytotoxicity pathways,werevalidatedwith pvaluesless than102 5,suggestingtheirremarkableperformanceindistinguishingtheRAcohortfromunaffectedindividuals.Theeight pathwaysthatgiverisetosignificant pvaluesinpermutationtest Table2. The pvaluesassociatedwiththepairwisecomparisonsoftheRAgroupandthetwocontrolgroupsusingthesuccessful modelsderivedfromimmunesystemrelatedpathways.Pathway 58Cvs.NBSRAvs.58CRAvs.NBS FitnessFitnessRandomization-testFitnessRandomization-test AntigenProcessingandPresentation 0.00589 1.03 6 102 9, 102 53.96 6 102 16, 102 5B-cellReceptorSignaling 0.04550 4.44 6 102 14, 102 55.79 6 102 13102 5ChemokineSignaling 0.05 9.76 6 102 110.000731.23 6 102 90.00336 ComplementandCoagulationCascades 0.05 1.19 6 102 23, 102 52.31 6 102 20, 102 5Cytokine-CytokineReceptorInteraction 0.04981 1.09 6 102 20, 102 51.64 6 102 22, 102 5FcEpsilonRISignaling 0.032466.97 6 102 5. 0.052.47 6 102 5. 0.05 FcGammaR-mediatedPhagocytosis 0.043561.91 6 102 6. 0.051.43 6 102 70.01143 ImmuneNetworkforIgAProduction 0.00408 6.65 6 102 11, 102 54.31 6 102 18, 102 5LeukocyteTrans-endothelialMigration 0.05 1.11 6 102 17, 102 52.71 6 102 18, 102 5NaturalKillerCellMediatedCytotoxicity 0.00437 2.55 6 102 90.002637.97 6 102 100.00085 Phagosome 0.05 2.48 6 102 16, 102 54.34 6 102 17, 102 5RegulationofAutophagy 0.050.00941 0.050.01714 0.05 T-cellReceptorSignaling 0.04116 5.83 6 102 80.001118.03 6 102 80.00161 Toll-likeReceptorSignaling 0.05 6.82 6 102 14, 102 52.68 6 102 12, 102 5Thefitness pvaluesmeasurethefitnessofeachsuccessfulmodelretrievedbyGeneticAlgorithmengine.Theyarecalculatedbycomparingoriginalcaseandcontro l datasetsusingcorrespondingsuccessfulmodels.Randomization-test pvaluesmeasurethesignificanceoffitness pvaluesoftheircorrespondingsuccessfulmodelby comparingpermutedcaseandcontroldatasets.AccordingtoBonferroni’scorrection,afitness pvalue 6.944 6 102 9andarandomizationtest pvalue 0.00104were consideredsignificant.The pvaluesofthemodelsshowingstrongormoderateassociationwithrheumatoidarthritisareinbold. doi:10.1371/journal.pone.0044162.t002 Table3. The pvaluesassociatedwiththepairwisecomparisonsoftheRAgroupandthetwocontrolgroupsusingsuccessful modelsderivedfromnegativecontrolpathways.Pathway 58Cvs.NBSRAvs.58CRAvs.NBS FitnessFitnessRandomization-testFitnessRandomization-test CardiacMuscleContraction 0.050.01392 0.050.01158 0.05 GapJunction 0.050.00051 0.050.00202 0.05 Glycolysis/Gluconeogenesis 0.050.0007 0.050.00288 0.05 InsulinSignaling 0.054.30 6 102 5. 0.050.000533 0.05 NucleotideExcisionRepair 0.050.00144 0.050.00084 0.05 OxidativePhosphorylation 0.021719.12 6 102 80.044761.26 6 102 6. 0.05 PurineMetabolism 0.05 0.05 0.05 0.05 0.05 PyrimidineMetabolism 0.050.00782 0.050.03497 0.05 ReninAngiotensinSystem 0.050.00273 0.050.02638 0.05 Spliceosome 0.026924.47 6 102 6. 0.050.00024 0.05 doi:10.1371/journal.pone.0044162.t003 AssociationStudiesforComplexDiseases PLOSONE|www.plosone.org7September2012|Volume7|Issue9|e44162

PAGE 8

arethereforeconsideredtobestronglyassociatedwithrheumatoid arthritis. Despitethehighfitnessvaluesofthemodelsobtainedfromthe c hemokinesignaling andn aturalkillercellmediatedcytotoxicity pathways, permutationtestingsuggestedonlymoderateassociationofthese modelswithrheumatoidarthritis.Whilepermutationtestingofthe modelderivedfromthefirstpathwayresultedinasignificant pvalueforRA vs. 58Ccomparison( p 7.3 6 102 4)andaborderline pvalueforRA vs. NBScomparison( p 3.36 6 102 3),the pvalues resultingfrompermutationtestingofthemodelfromthesecond pathwaywereborderlineforRA vs. 58Ccomparison ( p 2.63 6 102 3)andsignificantforRA vs. NBScomparison ( p 08.5 6 102 4)respectively. Thesuccessfulmodelfromthe T-cellreceptorsignaling pathway yieldedstatisticallyborderlinefitnessvalueswhencomparingRA against58CandNBS( p 5.83 6 102 8and p 8.03 6 102 8respectively).Onceagainthecomparisonbetweenthetwocontrol groupshadanon-significant pvalue.Inpermutationtesting,the originalcase-control pvaluesappearedtobeofborderline statisticalsignificanceforboththeRA vs. 58Ccomparison ( p 1.11 6 102 3)andtheRA vs. NBScomparison ( p 1.61 6 102 3).Wesuggestthemodelderivedfromthispathway mayalsobeinmoderateassociationwithRA. Finally,thesuccessfulmodelsderivedfromthe FcepsilonRI signaling,FcgammaR-mediatedphagocytosis and regulationofautophagy pathwaysarenotconsideredtobeinassociationwithRAbecause theydidnotproducesignificantfitness pvaluesandthe pvalues resultingfromthecorrespondingpermutationtestswerenonsignificantaswell.NegativeControlPathwaysOnallpathwaysinthenegativecontrolset,themethodfailedto produceamodelcapableofclassifyingcasesandcontrolswith astatisticallysignificantfitness pvalue.Inaddition,theresultsof allpermutationtestswerenon-significant( pvalues 0.00104)(see Table3).Inthreecasesnamely insulinsignaling,oxidativephosphorylation and spliceosome pathwaysthefitness pvaluesrelatedtothe selectedsuccessfulmodelswereclosertothesignificancethreshold thanthoseofothercontrolpathways.However,thefactthat neitheroftheircorrespondingpermutationtestswasstatistically significantconfirmsthepoorgoodness-of-fitofthesemodelsin discriminatingcasesfromcontrols.Theseresultsareconsistent Table4. The pvaluesassociatedwiththecomparisonoftheRAandCTRgroupsandNARAC-AandNARAC-Cusingthesuccessful modelsderivedfrompathwaysunderconsideration.Pathway RAvs.CTRNARAC-Avs.NARAC-C FitnessRandomization-testFitnessRandomization-test TestPathways AntigenProcessingandPresentation8.17 6 102 17, 102 51.47 6 102 17, 102 5B-cellReceptorSignaling8.31 6 102 18, 102 54.54 6 102 5. 0.05 ChemokineSignaling1.44 6 102 13, 102 56.47 6 102 80.03463 ComplementandCoagulationCascades3.11 6 102 28, 102 57.17 6 102 25, 102 5Cytokine-CytokineReceptorInteraction5.51 6 102 30, 102 56.34 6 102 5. 0.05 FcEpsilonRISignaling 2.73 6 102 6. 0.057.52 6 102 7. 0.05 FcGammaR-mediatedPhagocytosis9.09 6 102 103 6 102 45.65 6 102 17, 102 5IntestinalImmuneNetworkforIgAProduction5.99 6 102 19, 102 51.41 6 102 17, 102 5LeukocyteTrans-endothelialMigration4.61 6 102 25, 102 54.44 6 102 90.00310 NaturalKillerCellMediatedCytotoxicity7.37 6 102 13, 102 55.15 6 102 90.00793 Phagosome2.34 6 102 23, 102 50.00532 0.05 RegulationofAutophagy 0.00172 0.059.99 6 102 5. 0.05 T-cellReceptorSignaling2.90 6 102 102 6 102 55.85 6 102 15, 102 5Toll-likeReceptorSignaling2.96 6 102 18, 102 51.96 6 102 4. 0.05 ControlPathways CardiacMuscleContraction 0.00405 0.054.44 6 102 5. 0.05 GapJunction 7.72 6 102 50.049950.01804 0.05 Glycolysis/Gluconeogenesis 0.00025 0.053.51 6 102 5. 0.05 InsulinSignaling 2.37 6 102 60.008600.00186 0.05 NucleotideExcisionRepair 7.01 6 102 5. 0.050.01216 0.05 OxidativePhosphorylation 1.72 6 102 90.004602.77 6 102 6. 0.05 PurineMetabolism 0.04224 0.05 0.05 0.05 PyrimidineMetabolism 0.00371 0.050.00034 0.05 ReninAngiotensinSystem 0.00216 0.050.00248 0.05 Spliceosome 1.09 6 102 60.027725.78 6 102 6. 0.05 AccordingtoBonferroni’scorrection,afitness pvalue 6.944 6 102 9andarandomizationtest pvalue 0.00208wereconsideredsignificant.The pvaluesofthemodels showingsignificantassociationwithrheumatoidarthritiscomparingRA vs. CTRareinbold.Ofthese12pathways,fivewerereplicatedinNARACdatasetatthe significancelevelof0.00208andthreewerereplicatedatthesignificancelevelof0.05.The p -valuesofthesereplicatedmodelsarealsoinbold. doi:10.1371/journal.pone.0044162.t004 AssociationStudiesforComplexDiseases PLOSONE|www.plosone.org8September2012|Volume7|Issue9|e44162

PAGE 9

withourpriorexpectationthatthesenegativecontrolpathways maynotberelevanttothepathogenesisofrheumatoidarthritis, anddonotthereforeleadtoasignificantseparationofthecase andcontrolgroups. Overall,theseresultsindicatethatthefinalmodelsgeneratedby KBASarenotjustsimpleclassifierscapableoflearningthe differencesbetweentwoarbitrarygroupsofindividuals.Instead, theirclassificationpowerisafunctionofthebiologicalrelevanceof criterionusedtoselecttheinitialsetsofSNPs:SNPsetsderived formpathwaysthataremorebiologicallyrelevanttothetraitof interestwillleadtomoreaccurateandpowerfulmodels.This featureisspeciallyimportantwhenoneaimstotestandrank alternativehypothesesaboutthetraitunderconsideration.RA vs. CTRWefurtherevaluatedthereproducibilityoftheobserved associationsbycomparingtheRAgroupversusthepooled populationofthetwocontrolgroups,herecalledCTR,usingthe successfulmodelsretrievedbyGA.Alltheelevenmodelsshowing strongormoderateassociationwithRAinthepreviousstepalong withmodelrelatedtothe FcgammaR-mediatedphagocytosis pathway werestatisticallyassociatedtorheumatoidarthritis(seeTable4)in thisanalysis.CompletedetailsregardingtheRA vs. CTRanalysis areprovidedinResultsS1. TableS2showsthelistofselectedSNPsinthesuccessfulmodels derivedfrompathwaysdisplayingstrongormoderateassociation withRA,thegenesandchromosomestheybelongto,their positionsontheirrespectivechromosome,andtheirfunctional roles.ThenumberofSNPsinthesuccessfulmodelsrangedfrom threetoeight,whichinallcasesislessthan1%ofthetotalnumber ofSNPsinthecorrespondinginitialSNP-sets(seeTable1).This indicatesthatKBASwasabletoefficientlycopewiththe complexityofaverylargemulti-dimensionalsearchspace. Outof72totalSNPsinthese12models,twoofthemliewithin theMHCregionon6p21(chr6:29,910,021–33,498,585)whichis aknownhotspotforseveralautoimmunedisorders. rs9272346 is apromoterSNPrelatedtotheHLA-DQA1genewhichwas previouslyshowntobeassociatedwithjuvenilechronicarthritis [48].Ontheotherhand rs2072633, locatedinanintronofthe CFBgene,isneitherassociatedwithRAbasedonprevious findingsnorinLDwithotherRA-associatedSNPsaccordingto theHapMapLDdata(InternationalHapMapProject:http:// hapmap.ncbi.nlm.nih.gov). NoneoftheotherSNPspresentinthesuccessfulmodelsorthe genestowhichtheseSNPsbelongisamongRA-associatedlociin previouslypublishedgenome-wideassociationstudies.Thereis alsonoevidenceoflinkage-disequilibriumbetweenthesesSNPs andotherRA-associatedSNPsaccordingtotheHapMapLD data.ThisindicatesthattheresultsgeneratedbyKBASarenot simplyarediscoveryofstronglyassociatedindividualSNPs. Instead,ourmethodisabletoelucidatethejointcontributionof previouslyunknownsetsofSNPstothegeneticarchitectureof rheumatoidarthritis. AsseeninTableS2,thefinalmodelsforsevenofthetested pathwayscontainedpairsofSNPsonthesamechromosome. SNPsinthesepairsarenotinlinkagedisequilibriumwitheach otheraccordingtoHapMapLDdata.Thisshowsthatthereisno redundancyintheselectedSNPs,andthereforealltheSNPs identifiedbythemethodwereretainedintheirrespective successfulmodels.MultipleLogisticRegressionAnalysisToevaluatetheimpactofthetwelvesuccessfulmodels previouslyconcludedtobeinassociationwithRAontheriskof beingaffectedbythedisease,andtoinvestigatepotential interactionsamongthem,amultiplelogisticregressionmodel wasfittedbyregressingthediseasestateonthescorevariables producedbycomparingRA vs. CTRusingeachofthesetwelve models,alongwiththeirpairwiseinteractionterms(66terms). Themultivariatemodelobtainedfromthestepwiseselection procedurecontainedthescorevariablesrelatedtoalltwelve pathwaysexceptforthoserelatedtothe Antigenprocessingand presentation and Phagosome pathways.Moreover,theinteraction termsoftheinteractionbetweenthe cytokine-cytokinereceptor interaction andl eukocytetrans-endothelialmigration pathwaysand betweenthe cytokine-cytokinereceptorinteraction and T-cellreceptor signaling pathwayswerealsokeptinthefittedregressionmodel. Thefittedregressionmodelhadoverallmodel pvalues smallerthan102 4,andthecovariatesoftheincludedterms werestatisticallysignificantwith pvaluessmallerthan0.0028. Allcovariateswerepositivelycorrelatedwiththedisease susceptibilitywithoddsratiosbetween1.652and2.859,except forthetwoaforementionedinteractiontermswhichwere negativelycorrelatedwithoddsratiosof0.266and0.182 respectively.Thec-statisticsforthemodelwas0.687andthe pvalueofHosmer Lemeshowtestforthefittedmodelwasnonsignificant( p 0.05),indicatingmodel’sgoodness-of-fit.Table5 summarizestheseresults.ReplicationoftheResultsTotestifthedetectedsignificantassociationbetweenthetwelve immunesystemrelatedpathwaysandRAcanbegeneralizedto othercohorts,thescoresrelatedtoeachofthesessuccessfulmodels werecalculatedintheNARACdataset,andtheirdistributions werecomparedbetweencase(NARAC-A)andcontrol(NARACC)groups.Afterperforming100,000roundsofpermutation testingovereachofthemodels,thesignificanceoftheprimary fitness pvaluewasdeterminedandusedasthereplicationcriteria. Thesignificancethresholdsforinterpretingtheresultswerethe sameasusedfortheRA vs. CTRanalysis. AsseeninTable4,fiveoutofthe12immune-systemrelated pathwaysshowingmoderateorstrongassociationwithRAin WTCCCdatasetwerereplicatedwithfitness pvaluesranging from5.85 6 102 15to7.17 6 102 25andpermutationtest pvalues lessthan102 5.Inaddition,threemodelsderivedfrom chemokine signaling,leukocytetrans-endothelialmigration and naturalkillercell mediatedcytotoxicity pathwaysresultedinpermutationtest pvalues of0.03463,0.0031and0.00793respectively,andthereforecan beconsideredreplicatedatthesignificancelevelof0.05.None ofthemodelsderivedfromtheremaining16pathwaysgaverise toasignificant pvalueinthepermutationtests.Duetothe differentgenotypingplatforms,anumberofSNPspresentinthe successfulmodelswerereplacedbynewonestoconductthe replicationstudy,asexplainedinMethodEvaluationsection. TableS3summarizesthelistoftheoriginalandsubstituted SNPspresentintheeightreplicatedmodels.Thesubstituted SNPsareintherangeof39bpsto23,342bpsawayfromthe originalones.SimpleLogisticRegressionAnalysisandDiseaseRiskScoreClassDiagramThediseasestatewasalsoregressedontheoverallscorevariable computedbasedontheall44SNPspresentinthereplicated models,toevaluatetheapplicabilityofthissinglescorevariablein predictingthediseaserisk.FurtherdetailsareprovidedinResults S1andTablesS4andS5. Toillustratehowanincreaseinthevalueoftheoverallscore variableinfluencestheriskofdiseasedevelopment,therangeofAssociationStudiesforComplexDiseases PLOSONE|www.plosone.org9September2012|Volume7|Issue9|e44162

PAGE 10

valuescorrespondingtoeachoverallscorevariablewasdiscretized into12bins,andforeachbintheposteriorprobabilityofbeing affectedwascalculated.AsshowninFigure2,forbothcasecontrolcomparisonstheriskofdevelopmentofrheumatoid arthritisincreasesasthescoretakeslargervalues.Comparing RA vs. CTR,thisriskrangesfromaround18%whenthescore classislowesttoaround75%whenthescoreclassishighest.For modelrelatedtotheNARAC-A vs. NARAC-Ccomparison’s model,thediseaseriskrisesfromaround2%forthelowestscore classtoaround75%forthehighestscoreclass.Thisindicatesthat thescoresobtainedfromthetotalsetofthese44SNPscanbeused asapredictortoestimatetheprobabilitybywhichanindividual maydevelopthedisease.AnalysisoftheCrohn’sDisease(CD)DatasetTheapplicabilityoftheKBASmethodtootherdiseaseswas testedinCrohn’sdisease(CD)usingthesamesetof24pathways usedforRAanalysis.DetailedresultsareprovidedinResultsS2. AsseeninTablesS6,S7andS8,successfulmodelsderivedfrom ninepathwaysdemonstratedevidenceofassociationwithCrohn’s disease.Eightouttheseninepathwayswerealsoinassociation withRAintheWTCCCdataset,andsixofthemwerealsoamong thepathwaysreplicatedintheNARACdataset.TwoSNPsoutof the57SNPsincludedinthesemodels(seeTableS9)areinlinkage disequilibriumwithtwopreviouslydetectedCD-associatedSNPs. NoneoftheotherSNPsareamongorinlinkagedisequilibrium with71knownlocilinkedtoCDpreviously[15].Thesuccessful multi-SNPmodelsincludedinthefittedregressionmodelhad Table5. Multivariateregressionofdisease-stateonthescorevariablesderivedfromthesuccessfulmodelsshowingstrongor moderateassociationwithrheumatoidarthritis(comparingRA vs. CTR).TestofOverallModel TestChi-squaredfP-value LikelihoodRatioTest 529.516112 0.0001 ScoreTest 497.930012 0.0001 WaldTest 451.457612 0.0001 TestofParameters ParameterParameter Estimate Standard Error Wald’sChisquare dfP-valueOddsRatioEstimates PointEstimate95%Confidence Interval Intercept -0.37900.0312147.79421 0.0001--Pathway1 0.50200.130414.822410.00011.6521.2792.133 Pathway2 0.68090.151620.18161 0.00011.9761.4682.659 Pathway3 0.88020.099079.09711 0.00012.4111.9862.928 Pathway4 0.76010.098459.63411 0.00012.1391.7632.594 Pathway5 1.01910.197326.68871 0.00012.7711.8824.078 Pathway6 0.71840.122434.46681 0.00012.0511.6142.607 Pathway7 0.62180.110831.48721 0.00011.8621.4992.314 Pathway8 1.05060.157544.50861 0.00012.8592.1003.893 Pathway9 0.56140.18409.306910.00231.7531.2222.514 Pathway10 0.55190.127418.75551 0.00011.7371.3532.229 Pathway4*Pathway7 -1.32500.320617.08221 0.00010.2660.1420.498 Pathway4*Pathway9 -1.70180.56948.933410.00280.1820.0600.557 Goodness-of-fitTest TestChi-squaredfP-value Hosmer-LemeshowTest 6.917880.5455 Pathway1:B-cellReceptorSignalingPathway. Pathway2:ChemokineSignalingPathway. Pathway3:ComplementandCoagulationCascadesPathway. Pathway4:Cytokine-CytokineReceptorInteractionPathway. Pathway5:FcGammaR-mediatedPhagocytosisPathway. Pathway6:IntestinalImmuneNetworkforIgAProductionPathway. Pathway7:LeukocyteTrans-endothelialMigrationPathway. Pathway8:NaturalKillerCellMediatedCytotoxicityPathway. Pathway9:T-cellReceptorSignalingPathway. Pathway10:Toll-likeReceptorSignalingPathway. doi:10.1371/journal.pone.0044162.t005 AssociationStudiesforComplexDiseases PLOSONE|www.plosone.org10September2012|Volume7|Issue9|e44162

PAGE 11

oddsratiosrangingfromof2.320to2.741(seeTableS10).Table S11summarizestheresultsofasimplelogisticregressionanalysis obtainedbyregressingdiseasestate(CD vs. CTR)ontheoverall scorevariablederivedfromtheentiresetof57SNPspresentinthe nineCD-associatedmodels.TableS12providesacomparative summaryoftheobserveddisease-pathwayassociationsanalyzing RAandCDusingWTCCCandNARACdatasets.Figure3 illustratesthe Diseaserisk-Scoreclass diagramf or theCDvs.CTR comparison.DiscussionThemethodwepresentedinthispaperprovidesaframework forextendinggenome-wideassociationstudiestosituationsin whichaphenotypeiscausedbyanumberofconcurrentgenetic factors.Itsmainpurposeistomeasurehowwellamulti-SNP modelisabletoclassifycaseandcontrolsubjectsforwhich genotypedataareavailable.Biologically,multi-SNPmodelsmay implicatecasesinwhichfunctionalrelationshipsexistamong geneticfactorsofinterest.Theserelationshipsmightbefor instanceintheformofinteractionamongfunctionallyredundant elementsorphysicallyinteractingbiomolecules,orofthe concurrentoccurrenceoftwoormorehypomorphicmutations indifferentstepsofaparticularpathway[49,50].Tocapturethese effectsKBASusesGeneticAlgorithms,apowerfulsearchmethod abletonavigatethroughanextensivesearchspaceconsistingof potentialcombinationsofmarkersandtorefinethembyremoving markersthatshowlimitedcontributiontothetraitunder consideration.Insteadofexaminingeachmarkerindividually, ourmethodallowstheusertodefinemodelsconsistingofasetof markers,tointegratethemarkers’genotypesintoascorevariable andtoevaluatethemodel’sabilitytocorrectlyclassifycasesand controlsonthebasisofthedistributionofthescorevariablevalues. Multi-SNPmodelsthatdefineascorevariableoverlocithey containareabletotakeintoaccountthejointeffectsofloci individuallyconferringsmallriskstothephenotypeunder investigation,andthedifferentcombinationsoftraitinfluencing lociwhichisofconsiderableimportanceincapturingthegenetic heterogeneityunderlyingacomplextrait.Forexamplewecan imagineacaseinwhichahypotheticalphenotypeisproducedby concurrentmutationsinanytwooutofthreelociinvolvedin aparticularpathway(e.g.lociA,BandC).Therefore,subjects havingmutationsinlociAandBshowthesamephenotypeas subjectshavingmutationsinlociAandCorlociBandC.This canbeformallydescribedusingthefollowinglogicalstatement:[A AND B] OR [A AND C] OR [BandC].However,sincethemain effectofeachindividuallocusissmall,subjectsharboringsingle mutationsdonotdevelopthephenotype.Definingascorevariable resultsinintegratingsmallassociationsignalsarisingfrom individualrisklociintoanoverallassociationsignalcapturing thejointeffectsofcontributingloci.Moreover,sincethescore variableiscomputedindependentlyoneachsubject,different combinationsoflocicancontributetothesamescorevariable distributioninparalleltoeachother,leadingtoamoreaccurate representationofthedifferencesofcaseandcontrolgroups. SinceKBASemploysaGeneticAlgorithm-basedsearchengine, itenjoyshighflexibilityandefficiencyinsearchingthroughan extremelylargesolutionspace.Ontheotherhand,likewithany othermethodrelyingonheuristicsearchalgorithms,thisdoesnot ensurethatthesuccessfulmodelsaretheabsolutebestones.Our methodisinsteadmeanttobeusedinahypothesis-basedfashion: Figure2.Diseaserisk-ScoreclassdiagramforRAvs.CTRandNARAC-Avs.NARAC-Ccomparisons. Foreachcomparisonoverallscore variablederivedfromtheentiresetof44SNPspresentintheeightreplicatedRA-associatedmodelswasdiscretizedinto12bins,andforeachbinthe posteriorprobabilityofbeingaffectedbydiseasewascalculatedbasedonBayesformula. doi:10.1371/journal.pone.0044162.g002 AssociationStudiesforComplexDiseases PLOSONE|www.plosone.org11September2012|Volume7|Issue9|e44162

PAGE 12

itspurposeisnottodiscoverthe‘‘best’’setofmarkersthatexplains aphenotype;somethingthatiscomputationallyunfeasiblesinceit wouldrequiretestingallpossiblecombinationsofagivennumber ofmarkers;buttoconfirmorinvalidatethehypothesisthatauserprovidedsetofmarkersisassociatedwiththephenotype.The methoddoesnotrequireorstipulateaspecificwayofselecting markers:iftheprovidedsetdoesnotcontainasufficientnumberof ‘‘relevant’’ones,thealgorithmwillsimplyfailtofindan appropriatecombinationofmarkers,andwillreportthatthe correspondinghypothesisisnotsupportedbytheavailabledata. Thisisclearlyobservedusingthetwosetsofpathwayswe selectedforthisstudy,thefirstonecontainingpathwaysknownto berelevanttotheimmunesystem,andthesecondonecontaining pathwaysunrelatedtotheimmunesystem,usedasnegative controls.Forinstance,whileonthefirstsetofpathwaysKBASwas abletoproducemultiplereplicatedRA-associatedmodelswith remarkableclassifyingefficacy,noneofthepathwaysinthesecond setgaverisetoasuccessfulmodelcapableofsignificantly separatingcasesfromcontrols(seeTables2,3and4). Inaddition,ourmethodprovidesresearcherwithawayto prioritizethesuccessfulmodelsderivedfromdifferentpathwaysin ordertoevaluatewhichpathwayismorelikelytoberelatedtothe pathogenesisofthediseaseofinterest.ForexamplecomparingRA vs. CTR(seeTable4),the complementandcoagulationcascade pathway seemstobeofhigherrelevancetotheprocessofdevelopmentof rheumatoidarthritiscomparedto T-cellreceptorsignaling pathway. Also,amongthesignificantmodelsfortheCD vs. CTR comparison(seeTableS8),the B-cellreceptorsignaling,cytokinecytokinereceptorinteraction and T-cellreceptorsignaling pathwaysshow highassociationwiththeCrohn’sdisease,whileforexamplethe associationofthe antigenprocessingandpresentation pathwayseemsto beoflowerimportancecomparedtothethreeaforementioned pathways. Anotherapplicationofourmethodistocomparethe contributionofspecificpathwaystodifferenttraitsofinterest. Althoughseveralpathwayswerelabeledasassociatedwith rheumatoidarthritisandCrohn’sdiseaseinouranalysis,each providesadifferentcontributiontothediseasesunderinvestigation.Forexamplecomparingthetwosuccessfulmodelsderived fromthe complementandcoagulationcascade pathwaythroughthe analysisofRAandCDdatasets,RA-associatedmodelhas asmaller pvaluethantheCD-associatedmodel,whilethereverse istrueforthetwomodelsderivedfrom T-cellreceptorsignaling pathway(seeTables4andS8). ThereareclearbenefitstohavingSNPsselectionbeguidedby preexistingbiologicalknowledge.Thischoicemaximizesthe chanceofincludingSNPsthatarefunctionallyrelatedwiththe phenotypeofinterest,andmakesitmorelikelythattheresults,if positive,mayhaveanexplicitbiologicalinterpretation.Ifthe algorithmidentifiesamodelwithhighcase/controlseparation performance,itcanbedirectlyusedtoformulateabiological explanationfortheobservedphenotype.Forexample,thiscould involveidentifyingthegenescontainingtheSNPswithinthe modelorinLDwiththem,andmakingthemhigh-priority candidatesforfurtherexperimentalanalysis. Theobservedsignificantassociationsbetweenimmunesystem relatedmulti-SNPmodelsandthediseasesunderinvestigationis consistentwithpreviouslyrevealedaspectsoftheirpathogenesisas autoimmunediseases[1,39,43,45].ThefactthatanumberofRAassociatedmodelswerereplicatedinanotherpopulationofthe sameancestrymakesthemapotentialpredictorwhichcanbeused toestimatetheriskofdiseasedevelopmentinindividualsof Caucasianancestry.Asshowninthe Diseaserisk-Scoreclass diagrams,althoughtheriskofdiseasedevelopmentrisesparallel totheincreaseinthescorevariablevalue,thediseaseriskinthe lowestandhighestclasses,constitutingextremeportionsofthe score-classspectrum,arenot0or100%respectively.Thisis Figure3.Diseaserisk-ScoreclassdiagramforCDvs.CTR. Theoverallscorevariablederivedfromtheentiresetof57SNPspresentinthenine significantCD-associatedmodelswasdiscretizedinto12bins,andforeachbintheposteriorprobabilityofbeingaffectedbydiseasewascalculate d basedonBayesformula. doi:10.1371/journal.pone.0044162.g003 AssociationStudiesforComplexDiseases PLOSONE|www.plosone.org12September2012|Volume7|Issue9|e44162

PAGE 13

consistentwiththedefinitionofcomplexdiseasesandindicates thattherearestillfurthergeneticandnon-geneticriskfactors contributingtotheRAandCD,whichremaintobediscovered. AlthoughwetesteditusingSNPsgenotypedata,KBAScanbe appliedtoanykindofpolymorphicgeneticmarker,providedthat itsallelescanbeconvertedintonumericvaluesinawaythat consistentlyassignsvaluestodifferenttypesofalleles(e.g.wild-type andmutantalleles).KBASwillthenusethescorevariable obtainedbycombiningthesevaluestoclassifystudysubjectsas affectedorunaffected.SupportingInformationTableS1ThenumberofgenotypedandvalidatedSNPs ineachfunctionalroleclassforthetestandcontrol pathwaysanalyzedinthisstudy. (DOC)TableS2ThelistofSNPsincludedinthesuccessful modelsshowingstrongormoderateassociationwith rheumatoidarthritis.SNPpositionsrefertoversion GRCh37ofthehumanreferencesequence. (DOC)TableS3ThelistofSNPsfromNARACstudyreplacing theSNPsfromWTCCCstudyforreplicationofthe results. (DOC)TableS4Simpleregressionofdisease-stateonthe overallscorevariablederivedfromtheentiresetof44 SNPspresentinthereplicatedRA-associatedmodels (comparingRA vs. CTR). (DOC)TableS5Simpleregressionofdisease-stateonthe overallscorevariablederivedfromtheentiresetof44 SNPspresentinthereplicatedRA-associatedmodels (comparingNARAC-A vs. NARAC-C). (DOC)TableS6The pvaluesassociatedwiththepairwise comparisonsoftheCDgroupandthetwocontrolgroups usingsuccessfulmodelsderivedfromimmunesystem relatedpathways. Thefitness pvaluesmeasurethefitnessof eachsuccessfulmodelretrievedbyGeneticAlgorithmengine. Theyarecalculatedbycomparingoriginalcaseandcontrol datasetsusingcorrespondingsuccessfulmodels.Randomizationtest pvaluesmeasurethesignificanceoffitness pvaluesoftheir correspondingsuccessfulmodelbycomparingpermutedcaseand controldatasets.AccordingtoBonferroni’scorrection,afitness pvalue 6.944 6 102 9andarandomizationtest pvalue 0.00104 wereconsideredsignificant.The pvaluesofthemodelsshowing strongormoderateassociationwithCrohn’sdiseaseareinbold. (DOC)TableS7The pvaluesassociatedwiththepairwise comparisonsoftheCDgroupandthetwocontrolgroups usingthesuccessfulmodelsderivedfromnegative controlpathways. (DOC)TableS8The pvaluesassociatedwiththecomparisons oftheCDandCTRgroupsusingsuccessfulmodels derivedfrompathwaysunderconsideration. Accordingto Bonferroni’scorrection,afitness pvalue 6.944 6 102 9and arandomizationtest pvalue 0.00208wereconsideredsignificant.The pvaluesofmodelsshowingsignificantassociationwith Crohn’sdiseaseareinbold. (DOC)TableS9ThelistofSNPsincludedinthesuccessful modelsshowingassociationwithCrohn’sdisease. SNP positionsrefertoversionGRCh37ofthehumanreference sequence. (DOC)TableS10Multivariateregressionofdisease-stateon thescorevariablesderivedfromthesuccessfulmodels showingassociationwithCrohn’sdisease(comparing CD vs. CTR). (DOC)TableS11Simpleregressionofdisease-stateonthe overallscorevariablederivedfromtheentiresetof57 SNPspresentintheCD-associatedmodels(comparing CD vs. CTR). (DOC)TableS12ComparativesummaryofpathwayassociationswithrheumatoidarthritisandCrohn’sdisease. ( + ) indicatesthatthesuccessfulmodelretrievedfromthepathwayisin strongassociationwiththediseaseinthecorrespondingcomparison,( 2 )indicatesthatthesuccessfulmodelretrievedfromthe pathwayisnotassociatedwiththediseaseinthecorresponding comparison,and( + / 2 )indicatesthatthesuccessfulmodel retrievedfromthepathwayisinborderlineassociationwiththe diseaseinthecorrespondingcomparison. (DOC)MethodsS1Descriptionofthemethodusedtoperform logisticregressionanalysisandtoillustrate Disease risk-Scoreclass diagram. (PDF)ResultsS1Descriptionofadditionalresultsregarding theanalysisofrheumatoidarthritis(RA)dataset. (PDF)ResultsS2Descriptionoftheresultsregardingthe analysisofCrohn’sdisease(CD)dataset. (PDF)AcknowledgmentsTheauthorswouldliketothankDr.PaolaSebastianiforusefuldiscussions, andDr.PeterGregersenforprovidinguswiththeNARACdataset.This studymakesuseofdatageneratedbytheWellcomeTrustCaseControl Consortium.Afulllistoftheinvestigatorswhocontributedtothe generationofthedataisavailablefromwww.wtccc.org.uk.Fundingforthe projectwasprovidedbytheWellcomeTrustunderaward076113.AuthorContributionsConceivedanddesignedtheexperiments:ARAN.Performedthe experiments:AN.Analyzedthedata:AN.Contributedreagents/ materials/analysistools:ANARHS.Wrotethepaper:ARAN. AssociationStudiesforComplexDiseases PLOSONE|www.plosone.org13September2012|Volume7|Issue9|e44162

PAGE 14

References1.WTCCC(2007)Genome-wideassociationstudyof14,000casesofseven commondiseasesand3,000sharedcontrols.Nature447:661–678.doi:10.1038/ nature05911. 2.RischN,MerikangasK(1996)Thefutureofgeneticstudiesofcomplexhuman diseases.Science273:1516–1517. 3.RischN(2000)Searchingforgeneticdeterminantsinthenewmillennium. Nature405:847–856.doi:10.1038/35015718. 4.MitchellKJ(2012)Whatiscomplexaboutcomplexdisorders?GenomeBiology 13:237.doi:10.1186/gb-2012-13-1-237. 5.WangZ,LiuT,LinZ,HegartyJ,KoltunWA,etal.(2010)Ageneralmodelfor multilocusepistaticinteractionsincase-controlstudies.PLoSONE5:e11384. doi:10.1371/journal.pone.0011384. 6.GibsonG(2009)Decanalizationandtheoriginofcomplexdisease.NatRev Genet10:134–140.doi:10.1038/nrg2502. 7.ManolioTA,CollinsFS,CoxNJ,GoldsteinDB,HindorffLA,etal.(2009) Findingthemissingheritabilityofcomplexdiseases.Nature461:747–753. doi:10.1038/nature08494. 8.Altmu ¨llerJ,PalmerLJ,FischerG,ScherbH,WjstM(2001)GenomewideScans ofComplexHumanDiseases:TrueLinkageIsHardtoFind.AmJHumGenet 69:936–950. 9.DonnellyP(2008)Progressandchallengesingenome-wideassociationstudiesin humans.Nature456:728–731.doi:10.1038/nature07631. 10.ClarkeAJ,CooperDN(2010)GWAS:heritabilitymissinginaction?EurJHum Genet18:859–861.doi:10.1038/ejhg.2010.35. 11.StahlEA,RaychaudhuriS,RemmersEF,XieG,EyreS,etal.(2010)Genomewideassociationstudymeta-analysisidentifiessevennewrheumatoidarthritis riskloci.NatGenet42:508–514.doi:10.1038/ng.582. 12.BarrettJC,ClaytonDG,ConcannonP,AkolkarB,CooperJD,etal.(2009) Genome-wideassociationstudyandmeta-analysisfindthatover40lociaffect riskoftype1diabetes.NatGenet41:703–707.doi:10.1038/ng.381. 13.PurcellSM,WrayNR,StoneJL,VisscherPM,O’DonovanMC,etal.(2009) Commonpolygenicvariationcontributestoriskofschizophreniaandbipolar disorder.Nature460:748–752.doi:10.1038/nature08185. 14.EastonDF,EelesRA(2008)Genome-wideassociationstudiesincancer.Human MolecularGenetics17:R109–R115.doi:10.1093/hmg/ddn287. 15.FrankeA,McGovernDPB,BarrettJC,WangK,Radford-SmithGL,etal. (2010)Genome-widemeta-analysisincreasesto71thenumberofconfirmed Crohn’sdiseasesusceptibilityloci.NatGenet42:1118–1125.doi:10.1038/ ng.717. 16.ZieglerA,Ko ¨nigIR,ThompsonJR(2008)Biostatisticalaspectsofgenome-wide associationstudies.BiomJ50:8–28.doi:10.1002/bimj.200710398. 17.GibsonG(2011)Rareandcommonvariants:twentyarguments.NatRevGenet 13:135–145.doi:10.1038/nrg3118. 18.MusaniSK,ShrinerD,LiuN,FengR,CoffeyCS,etal.(2007)Detectionof gene 6 geneinteractionsingenome-wideassociationstudiesofhuman populationdata.HumHered63:67–84.doi:10.1159/000099179. 19.MooreJH,AsselbergsFW,WilliamsSM(2010)Bioinformaticschallengesfor genome-wideassociationstudies.Bioinformatics26:445–455.doi:10.1093/ bioinformatics/btp713. 20.CordellHJ(2009)Detectinggene-geneinteractionsthatunderliehuman diseases.NatRevGenet10:392–404.doi:10.1038/nrg2579. 21.PlominR,HaworthCMA,DavisOSP(2009)Commondisordersare quantitativetraits.NatRevGenet10:872–878.doi:10.1038/nrg2670. 22.YangJ,BenyaminB,McEvoyBP,GordonS,HendersAK,etal.(2010) CommonSNPsexplainalargeproportionoftheheritabilityforhumanheight. NatGenet42:565–569.doi:10.1038/ng.608. 23.ZukO,HechterE,SunyaevSR,LanderES(2012)Themysteryofmissing heritability:Geneticinteractionscreatephantomheritability.ProcNatlAcadSci USA109:1193–1198.doi:10.1073/pnas.1119675109. 24.MooreJH(2003)Theubiquitousnatureofepistasisindeterminingsusceptibility tocommonhumandiseases.HumHered56:73–82.doi:10.1159/000073735. 25.RivaA,NuzzoA,StefanelliM,BellazziR(2010)Anautomatedreasoning frameworkfortranslationalresearch.JBiomedInform43:419–427. doi:10.1016/j.jbi.2009.11.005. 26.PattinKA,MooreJH(2008)ExploitingtheProteometoImprovetheGenomeWideGeneticAnalysisofEpistasisinCommonHumanDiseases.HumGenet 124:19–29.doi:10.1007/s00439-008-0522-8. 27.O’DushlaineC,KennyE,HeronEA,SeguradoR,GillM,etal.(2009)The SNPratiotest:pathwayanalysisofgenome-wideassociationdatasets. Bioinformatics25:2762–2763.doi:10.1093/bioinformatics/btp448. 28.PengG,LuoL,SiuH,ZhuY,HuP,etal.(2010)Geneandpathway-based second-waveanalysisofgenome-wideassociationstudies.EurJHumGenet18: 111–117.doi:10.1038/ejhg.2009.115. 29.YuK,LiQ,BergenAW,PfeifferRM,RosenbergPS,etal.(2009)Pathway analysisbyadaptivecombinationofP-values.GenetEpidemiol33:700–709. doi:10.1002/gepi.20422. 30.GoldbergDE(1989)GeneticAlgorithmsinSearch,Optimization,andMachine Learning.1sted.Addison-WesleyProfessional.432p. 31.WhitleyD(1994)Ageneticalgorithmtutorial.StatisticsandComputing4:65– 85. 32.SherriffA,OttJ(2001)Applicationsofneuralnetworksforgenefinding.Adv Genet42:287–297. 33.MotsingerAA,LeeSL,MellickG,RitchieMD(2006)GPNN:powerstudiesand applicationsofaneuralnetworkmethodfordetectinggene-geneinteractionsin studiesofhumandisease.BMCBioinformatics7:39.doi:10.1186/1471-2105-739. 34.RitchieMD,MotsingerAA,BushWS,CoffeyCS,MooreJH(2007)Genetic ProgrammingNeuralNetworks:APowerfulBioinformaticsToolforHuman Genetics.ApplSoftComput7:471–479.doi:10.1016/j.asoc.2006.01.013. 35.MooreJH,HahnLW(2002)Acellularautomataapproachtodetecting interactionsamongsingle-nucleotidepolymorphismsincomplexmultifactorial diseases.PacificSymposiumOnBiocomputing64:53–64. 36.EshelmanLJ(1991)TheCHCadaptivesearchalgorithm:Howtohavesafe searchwhenengaginginnontraditionalgeneticrecombination.Foundationsof GeneticAlgorithms.MorganKaufmannPublishers,Inc.265–283. 37.BolstadWM(2007)IntroductiontoBayesianStatistics.2nded.WileyInterscience.464p. 38.BelleGvan,FisherLD,HeagertyPJ,LumleyTS(2004)Biostatistics:A MethodologyFortheHealthSciences.2nded.Wiley-Interscience.896p. 39.CoolesFAH,IsaacsJD(2011)Pathophysiologyofrheumatoidarthritis.Curr OpinRheumatol23:233–240.doi:10.1097/BOR.0b013e32834518a3. 40.GregersenPK,AmosCI,LeeAT,LuY,RemmersEF,etal.(2009)REL, encodingamemberoftheNF-kappaBfamilyoftranscriptionfactors,isanewly definedrisklocusforrheumatoidarthritis.NatGenet41:820–823.doi:10.1038/ ng.395. 41.PlengeRM,SeielstadM,PadyukovL,LeeAT,RemmersEF,etal.(2007) TRAF1-C5asarisklocusforrheumatoidarthritis–agenomewidestudy. NEnglJMed357:1199–1209.doi:10.1056/NEJMoa073491. 42.Julia `A,BallinaJ,Can eteJD,BalsaA,Tornero-MolinaJ,etal.(2008)GenomewideassociationstudyofrheumatoidarthritisintheSpanishpopulation:KLF12 asarisklocusforrheumatoidarthritissusceptibility.ArthritisRheum58:2275– 2286.doi:10.1002/art.23623. 43.deVriesR(2011)Geneticsofrheumatoidarthritis:timeforachange!CurrOpin Rheumatol23:227–232.doi:10.1097/BOR.0b013e3283457524. 44.FriedmanS,BlumbergR(2011)Chapter295.InflammatoryBowelDisease.In: LongoDL,FauciAS,KasperDL,HauserSL,JamesonJL,LoscalzoJ,eds. Harrison’sPrinciplesofInternalMedicine:Volumes1and2.18thed.McGrawHillProfessional.4012p. 45.SartorRB(2006)Mechanismsofdisease:pathogenesisofCrohn’sdiseaseand ulcerativecolitis.NatClinPractGastroenterolHepatol3:390–407. doi:10.1038/ncpgasthep0528. 46.KanehisaM,GotoS,FurumichiM,TanabeM,HirakawaM(2010)KEGGfor representationandanalysisofmolecularnetworksinvolvingdiseasesanddrugs. NucleicAcidsRes38:D355–360.doi:10.1093/nar/gkp896. 47.NuzzoA,RivaA(2009)Genephony:aknowledgemanagementtoolfor genome-wideresearch.BMCBioinformatics10:278.doi:10.1186/1471-210510-278. 48.HaasJP,KimuraA,TruckenbrodtH,SuschkeJ,SasazukiT,etal.(1995)Earlyonsetpauciarticularjuvenilechronicarthritisisassociatedwithamutationinthe Y-boxoftheHLA-DQA1promoter.TissueAntigens45:317–321. 49.Pe rez-Pe rezJM,CandelaH,MicolJL(2009)Understandingsynergyingenetic interactions.TrendsGenet25:368–376.doi:10.1016/j.tig.2009.06.004. 50.MartienssenR,IrishV(1999)CopyingoutourABCs:theroleofgene redundancyininterpretinggenetichierarchies.TrendsGenet15:435–437.AssociationStudiesforComplexDiseases PLOSONE|www.plosone.org14September2012|Volume7|Issue9|e44162