Citation
Modeling Perturbations in Gene Regulatory and Signaling Networks

Material Information

Title:
Modeling Perturbations in Gene Regulatory and Signaling Networks
Creator:
Bandyopadhyay, Nirmalya
Place of Publication:
[Gainesville, Fla.]
Florida
Publisher:
University of Florida
Publication Date:
Language:
english
Physical Description:
1 online resource (136 p.)

Thesis/Dissertation Information

Degree:
Doctorate ( Ph.D.)
Degree Grantor:
University of Florida
Degree Disciplines:
Computer Engineering
Computer and Information Science and Engineering
Committee Chair:
Ranka, Sanjay
Committee Co-Chair:
Kahveci, Tamer
Committee Members:
Dobra, Alin
Banerjee, Arunava
De Crecy-Lagard, Valerie
Graduation Date:
12/17/2011

Subjects

Subjects / Keywords:
Breast cancer ( jstor )
Datasets ( jstor )
Gene expression ( jstor )
Gene interaction ( jstor )
Genes ( jstor )
Mathematical independent variables ( jstor )
Mathematical variables ( jstor )
Objective functions ( jstor )
P values ( jstor )
Regulator genes ( jstor )
Computer and Information Science and Engineering -- Dissertations, Academic -- UF
gene-regulation -- graphical-model -- machine-learning -- markov-random-field -- microarray -- perturbation -- regulatory-network -- signaling-network
Genre:
bibliography ( marcgt )
theses ( marcgt )
government publication (state, provincial, terriorial, dependent) ( marcgt )
born-digital ( sobekcm )
Electronic Thesis or Dissertation
Computer Engineering thesis, Ph.D.

Notes

Abstract:
Genes interact with each other through gene networks. A perturbation in the organism tissue can propagate to different parts of the network and affect the expression of the other genes. A perturbation can take place because of a disorder due to cancer, radiation or medication. In this work, we propose to model and analyze the impact of perturbations in gene networks. In particular, we are interested in determining the subset of genes that are directly and indirectly affected due to these perturbations, and how this impact may vary in multiple groups with different characteristics. This problem have been touched upon in the existing literature. However, most of those methods employs statistical and machine learning based techniques that consider the gene expression alone as the sole input data and do not consider the genetic interaction network to solve the problems. In this work, we resort to a hypothesis for solving the problem that we just specified. We formally state the hypothesis as follows- integrating gene networks with gene expression enables us to analyze the effect of perturbations in a more effective and comprehensive fashion. We address our problem by breaking it into four connected sub-problems. In all the four sub-problems of this work, we leverage the knowledge from gene network to build machine learning based solutions to solve the sub problems. Our thesis establishes the justification behind this hypothesis in two ways. First, our methods are able to produce more accurate methods compared to the existing ones as evident in the results of the experiments. In stead of producing binary decisions, our methods build probabilistic models that enable us to estimate the confidence in their predictions. Second, the computational biologists will be able to inspect which pathways have been affected and to which extent. This problem have been touched upon in the existing literature. However, most of those methods employs statistical and machine learning based techniques that consider the gene expression alone as the sole input data and do not consider the genetic interaction network to solve the problems. In this work, we resort to a hypothesis for solving the problem that we just specified. We formally state the hypothesis as follows - integrating gene networks with gene expression enables us to analyze the effect of perturbations in a more effective and comprehensive fashion. We address our problem by breaking it into four connected sub-problems. In all the four sub-problems of this work, we leverage the knowledge from gene network to build machine learning based solutions to solve the sub problems. Our thesis establishes the justification behind this hypothesis in two ways. First, our methods are able to produce more accurate methods compared to the existing ones as evident in the results of the experiments. In stead of producing binary decisions, our methods build probabilistic models that enable us to estimate the confidence in their predictions. Second, the computational biologists will be able to inspect which pathways have been affected and to which extent. ( en )
General Note:
In the series University of Florida Digital Collections.
General Note:
Includes vita.
Bibliography:
Includes bibliographical references.
Source of Description:
Description based on online resource; title from PDF title page.
Source of Description:
This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Thesis:
Thesis (Ph.D.)--University of Florida, 2011.
Local:
Adviser: Ranka, Sanjay.
Local:
Co-adviser: Kahveci, Tamer.
Statement of Responsibility:
by Nirmalya Bandyopadhyay.

Record Information

Source Institution:
UFRGP
Rights Management:
Copyright Bandyopadhyay, Nirmalya. Permission granted to the University of Florida to digitize, archive and distribute this item for non-profit research and educational purposes. Any reuse of this item in excess of fair use or other copyright exemptions requires permission of the copyright holder.
Resource Identifier:
814396873 ( OCLC )
Classification:
LD1780 2011 ( lcc )

Downloads

This item has the following downloads:


Full Text

PAGE 1

MODELINGPERTURBATIONSINGENEREGULATORYANDSIGNALINGNETWORKSByNIRMALYABANDYOPADHYAYADISSERTATIONPRESENTEDTOTHEGRADUATESCHOOLOFTHEUNIVERSITYOFFLORIDAINPARTIALFULFILLMENTOFTHEREQUIREMENTSFORTHEDEGREEOFDOCTOROFPHILOSOPHYUNIVERSITYOFFLORIDA2011

PAGE 2

c2011NirmalyaBandyopadhyay 2

PAGE 3

TomybelovedwifeShrutiandourbabywhomwelostduringourjourneytoPh.D. 3

PAGE 4

ACKNOWLEDGMENTS Iwouldliketothankmyadvisors,Dr.SanjayRankaandDr.TamerKahveci,foralltheirguidance,supportandthenumerousopportunitiestheyprovidedthroughoutmystudiesandresearch.Iwouldalsoliketothankmycommitteemembers,Dr.AlinDobra,Dr.ArunavaBanerjeeandDr.ValeriedeCrecy-Lagard,foralloftheirhelpandvaluablesuggestions.IwouldliketothankDr.PaulGaderforhisinspirationandexcellentteachinginhisthreecoursesofmachinelearning.IexpressmygratitudetomybelovedwifeShruti,withoutwhoserelentlesssupport,inspirationandsacricemyPh.D.couldnothavebeenpossible.Finally,myspecialthanksgoestomyfamilyandfriendsfortheircontinuedsupportandguidancetoovercometheobstaclesduringmyjourneytoPh.D. 4

PAGE 5

TABLEOFCONTENTS page ACKNOWLEDGMENTS .................................. 4 LISTOFTABLES ...................................... 7 LISTOFFIGURES ..................................... 8 ABSTRACT ......................................... 10 CHAPTER 1INTRODUCTION ................................... 12 2BACKGROUND ................................... 19 2.1FeatureSelectioninGeneExpressionData ................. 19 2.2AnalysisofPerturbationinGeneNetworks ................. 20 2.3SyntheticSicknessLethality .......................... 22 3PATHWAYBASEDFEATURESELECTIONFORCANCERMICROARRAYDATA ......................................... 24 3.1Algorithm .................................... 27 3.1.1PickingtheFirstFeature:WheretoStart? .............. 28 3.1.2SelectingtheCandidateFeatures ................... 29 3.1.3SelectingtheBestCandidateGene .................. 30 3.1.4ExplorationofTrainingSpace ..................... 32 3.2DataSetandExperiments ........................... 33 3.2.1ExperimentalSetup .......................... 34 3.2.2PathwayswithUnresolvedGenes ................... 35 3.2.3BiologicalValidationofSelectedFeatures .............. 38 3.2.4ComparisonwithVan'tVeer'sGeneSignature ........... 40 3.2.5ComparisonwithI-RELIEF ...................... 42 3.2.6CrossValidationExperiments ..................... 43 3.2.7ExperimentsonFullyConnectedPathways ............. 48 3.2.8ContributionofPathwayInformation ................. 48 3.3Discussion ................................... 49 4MODELINGPERTURBATIONSUSINGGENENETWORKS .......... 51 4.1Methods ..................................... 53 4.1.1NotationandProblemFormulation .................. 53 4.1.2OverviewofOurSolution ....................... 55 4.1.3ComputationofthePriorDensityFunction .............. 56 4.1.4ApproximationoftheObjectiveFunction ............... 60 4.1.5CalculationofLikelihoodDensityFunction .............. 61 5

PAGE 6

4.1.6ObjectiveFunctionOptimization .................... 63 4.2Experiments .................................. 64 4.2.1EvaluationofBiologicalSignicance ................. 65 4.2.2EvaluationoftheRankingsofNeighborGenes ........... 66 4.2.3ComparisontoOtherMethods .................... 68 4.2.4SensitivitytotheGapbetweenPrimaryandSecondaryEffects .. 72 4.3Discussion ................................... 73 5IDENTIFYINGDIFFERENTIALLYREGULATEDGENES ............. 74 5.1Methods ..................................... 78 5.1.1NotationandProblemFormulation .................. 78 5.1.2OverviewoftheSolution ........................ 81 5.1.3ComputationofthePriorDensityFunction .............. 82 5.1.4ApproximationoftheObjectiveFunction ............... 89 5.1.5ComputationofLikelihoodDensityFunction ............. 91 5.1.6ObjectiveFunctionOptimization .................... 94 5.2ResultsandDiscussion ............................ 95 5.2.1ComparisontoOtherMethods .................... 97 5.2.2StatisticalSignicanceExperiment .................. 100 5.3Discussion ................................... 104 6SSLPRED:PREDICTINGSYNTHETICSICKNESSLETHALITY ........ 105 6.1Methods ..................................... 109 6.1.1ProblemFormulationandNotation .................. 109 6.1.2BetweenPathwayConjectures .................... 110 6.1.3RegressionBasedSolution ...................... 112 6.2Experiments .................................. 116 6.2.1Datasets ................................. 116 6.2.2ComparisonwithHescott'sMethod .................. 117 6.3Discussion ................................... 122 7CONCLUSION .................................... 123 REFERENCES ....................................... 124 BIOGRAPHICALSKETCH ................................ 136 6

PAGE 7

LISTOFTABLES Table page 3-1ListofpublicationssupportingthersttwentyfeaturesobtainedfromBCRdatasetabouttheirresponsibilityforcancer. ................... 36 3-2AccuracyofouralgorithmobtainedfromCrossValidationexperimentsonrealpathway. ........................................ 44 4-1Thetableenumeratesthetruthvaluesforthetwobinaryfeaturefunctions. .. 58 4-2Listoftop25genesthataremostlyaffectedbyexternalperturbation. ..... 65 5-1EnumerationofthevaluesofZi,ZjandXijfordifferentvaluesofSAi,SBi,SAjandSBj. ........................................ 79 5-2EnumerationofvedifferentunaryfeaturefunctionsF1,F2,F3,F6andF7 ... 85 5-3Thetableenumeratesthetruthvaluesforthebinaryfeaturefunctionleftex-ternalequality(f4). .................................. 87 5-4Thetableenumeratesthetruthvaluesforthebinaryfeaturefunctionrightex-ternalequality(f5). .................................. 88 6-1Summarizationofthefeaturefunctionsoftheregressionmodel. ........ 115 6-2BPMswithp-Valueslessthan0.l[%] ........................ 119 7

PAGE 8

LISTOFFIGURES Figure page 1-1AportionofcolorectalcancerpathwayobtainedfromKEGGdatabase. .... 13 1-2Thisgureillustratestheimpactofahypotheticalexternalperturbationonasmallportionofthepancreaticcancerpathway. ................. 14 3-1PartofpancreaticcancerpathwayadaptedfromKEGGshowingthegene-geneinteractions. ...................................... 25 3-2WeplottherealdatapointsRP10,,RP100alongwiththeconstructedfunctionf(s)againsts,thefractionofthecurrentpathway. ................ 35 3-3Comparisonoftestaccuracyofoursignatureandvan'tVeer's70-geneprognosticsignatureonrealpathway. ............................. 39 3-4Thestandarddeviationoftheaccuracyofourmethodfordifferentnumberoffeatures.Thex-axisdenotesdifferentnumberoffeatures. ............ 40 3-5Comparisonoftestaccuracyofourmethod(BPFS)tothatofI-RELIEFonrealpathway. ..................................... 41 3-6Comparisonoftestaccuracyofourmethod(BPFS)tothatofI-RELIEFonfullyconnectedpathway. .............................. 45 3-7Comparisonoftestaccuracyofourmethod(BPFS)towhenweselectthegenesonlybasedonmarginalclassicationpower. ............... 46 4-1Illustrationoftheimpactofahypotheticalexternalperturbationonasmallportionofthepancreaticcancerpathway. .......................... 52 4-2IllustrationofasmallhypotheticalgenenetworkwithaperturbationandthegraphforMarkovrandomeldcreatedforthatgenenetwork. .......... 56 4-3Frequencyofaveragedistanceofrankingsovertrainingandtestingdata. ... 67 4-4ComparisonofourmethodtoSSEMandt-test. ................. 69 4-5ComparisonofaccuracieswithSSEMandStudent'sttestwhilevaryingtheratioofgapsofprimarilyandsecondarilyaffectedgenes. ............ 71 5-1Illustrationoftheimpactofahypotheticalexternalperturbationonasmallportionofthepancreaticcancerpathway. .......................... 75 5-2AhypotheticalgenenetworkandcorrespondingMarkovrandomgraph. .... 77 5-3IllustrationofdifferentcomponentsofCMRFandconnectivitybetweenthem. 83 8

PAGE 9

5-4ComparisonofourmethodCMRFtoSMRF,SSEMandt-test.Thenumberofprimarilydifferentiallyaffectedgenesis40. .................... 99 5-5Illustrationofthestatisticalsignicancetest. .................... 102 6-1IllustrationoftheconceptsofSyntheticSicknessandLethalityandBetweenPathwayModels. ................................... 106 6-2Thisguredepictsthelayeredneighborstructurearoundageneinteractionedge(ga,gb). ..................................... 112 6-3ComparisonofSSLPredwiththemethodfromHescottetal. .......... 118 6-4ComparisonofSSLPredwiththemethodfromHescottetal.onBradyandMadatasetsforp-Values0.1 ........................... 120 9

PAGE 10

AbstractofDissertationPresentedtotheGraduateSchooloftheUniversityofFloridainPartialFulllmentoftheRequirementsfortheDegreeofDoctorofPhilosophyMODELINGPERTURBATIONSINGENEREGULATORYANDSIGNALINGNETWORKSByNirmalyaBandyopadhyayDecember2011Chair:SanjayRankaCochair:TamerKahveciMajor:ComputerEngineeringGenesinteractwitheachotherthroughgenenetworks.Aperturbationintheorganismtissuecanpropagatetodifferentpartsofthenetworkandaffecttheexpressionoftheothergenes.Aperturbationcantakeplacebecauseofadisorderduetocancer,radiationormedication.Inthiswork,weproposetomodelandanalyzetheimpactofperturbationsingenenetworks.Inparticular,weareinterestedindeterminingthesubsetofgenesthataredirectlyandindirectlyaffectedduetotheseperturbations,andhowthisimpactmayvaryinmultiplegroupswithdifferentcharacteristics.Thisproblemhavebeentoucheduponintheexistingliterature.However,mostofthosemethodsemploysstatisticalandmachinelearningbasedtechniquesthatconsiderthegeneexpressionaloneasthesoleinputdataanddonotconsiderthegeneticinteractionnetworktosolvetheproblems.Inthiswork,weresorttoahypothesisforsolvingtheproblemthatwejustspecied.Weformallystatethehypothesisasfollowsintegratinggenenetworkswithgeneexpressionenablesustoanalyzetheeffectofperturbationsinamoreeffectiveandcomprehensivefashion.Weaddressourproblembybreakingitintofourconnectedsub-problems.Inallthefoursub-problemsofthiswork,weleveragetheknowledgefromgenenetworktobuildmachinelearningbasedsolutionstosolvethesubproblems. 10

PAGE 11

Ourthesisestablishesthejusticationbehindthishypothesisintwoways.First,ourmethodsareabletoproducemoreaccuratemethodscomparedtotheexistingonesasevidentintheresultsoftheexperiments.Insteadofproducingbinarydecisions,ourmethodsbuildprobabilisticmodelsthatenableustoestimatethecondenceintheirpredictions.Second,thecomputationalbiologistswillbeabletoinspectwhichpathwayshavebeenaffectedandtowhichextent. 11

PAGE 12

CHAPTER1INTRODUCTIONGenesinteractwitheachotherthroughgenenetworks.Inthiswork,thetermgenenetworksencompassesbothregulatoryandsignalingnetworks.Wemodelgenenetworksasgraphstructure,whereeverynodecorrespondstoagene.Onegenehasadirectededgetoanotherone,iftherstgeneregulatestheexpressionofthesecondgene.Expressionistheprocessbywhichageneusesitssequenceinformationtosynthesizeafunctionalproteinproduct.Wedenotetwogenesasneighborsiftheyshareanedge.Figure 1-1 showsanexampleofasmallgenenetwork;aportionofthecolorectalcancerpathwayadaptedfromKEGGdatabase.!standsforactivation,wheretherstgeneatthestartofthearrowincreasestheexpression(up-regulation)oftheothergene.aimpliesinhibitionwheretherstgenedecreasetheexpression(down-regulation)oftheothergene.Forexample,thegeneK-RasactivatesRaf,whilePKB/AktinhibitsCASP9.Thus,K-RasandRafareneighbors.Asgenesinteractthroughtheconnectedgenenetwork,ifthereisaperturbationintheorganismtissue,itpotentiallypropagatestodifferentpartsofthenetworkandaffectstheexpressionofthegenes.Inthisproposal,wedeneperturbation,inabroadersense,asanexternalcausethatcanaffectthegenesinthegenenetwork.Thisdenitionincludesdifferentkindsofcancers,radiation,medicationandtoxicelementsastheyallchangegeneexpression.Aperturbationinitiallyaffectssomegenesandchangestheirexpressions.Wecallthesegenesasprimarilyaffectedgenes.Theseprimarilyaffectedgenesinthecourseoftimechangetheexpressionsoftheirneighborsthroughgenenetworks.Wedenotethesesecondcategoryofgenesassecondarilyaffectedgenes.Figure 1-2 illustratestheimpactofahypotheticalexternalperturbationonasmallportionofthepancreaticcancerpathway.Inthisgure,geneK-Ras,RafandCob42Roc 12

PAGE 13

Figure1-1. AportionofcolorectalcancerpathwayobtainedfromKEGGdatabase(c1995-2011KanehisaLaboratories).Thegurerepresentsthegenesandgeneticinteractionsinthatpathway.Thegreenrectanglesrepresentthegenes.Therearetwoclassesofinteractionsindicatedbytwokindsofarrows.The)166(!arrowsstandforactivationwheretherstgeneatthestartofthearrowincreasestheexpression(up-regulation)oftheothergene.aimpliesinhibitionwhiletherstgenedecreasestheexpression(down-regulation)oftheothergene.Forexample,thegeneK-RasactivatesRaf,whilePKB/AktinhibitsCASP9. areprimarilyaffectedandMEK,RalandRalGDSaresecondarilyaffectedthroughinteractions.Thegeneexpressiondatasetwithperturbationtypicallyconsistsoftwoparts,controldataandnon-controldata.Thecontroldataisasnapshotofthetissuewithoutanyaffectofperturbation.Thenon-controldatasetisobtainedaftertheadministrationofperturbation.Sometimesthenon-controldatacanbecollectedoverseveraltimepointsasatimeseriesdataaftertheperturbationtakesplace.Theexactwaytopreparethedatadependsonthetypeoftheexperiment.Forcancer,controldatasetsarepreparedbycollectingtissuesfromnon-cancerousorganisms,ornon-canceroustissuesfrom 13

PAGE 14

Figure1-2. Thisgureillustratestheimpactofahypotheticalexternalperturbationonasmallportionofthepancreaticcancerpathway.ThepathwayistakenfromtheKEGGdatabase.Thesolidrectanglesdenotethegenesaffecteddirectlybytheperturbation,thedashedrectanglesindicategenessecondarilyaffectedthroughinteractions.Thedottedrectanglesdenotethegenesthatarenotaffectedbytheperturbation.!impliesactivationandaimpliesinhibition.Inthisgure,geneK-Ras,RafandCob42RocareprimarilyaffectedandMEK,RalandRalGDSaresecondarilyaffectedthroughinteractions. thepatients.Thenon-controldatasetiscollectedfromactualdiseasedtissue.Forthedeletionmutantexperiments,anexperimentspecicgeneisknockedoutfromthechromosomeandgeneexpressionsaretakenbeforeandafterthedeletion.Ourhighlevelproblem,inthiscontext,istoanalyzethecharacteristicofpertur-bationingenenetworks.Tosolvethisproblem,weproposetosolvethefollowingsubproblems: 1. Whichgenesareaffectedasaresultoftheperturbation?Canwerenethesetofaffectedgenestoclassifybetweenthecontrolandnon-controlgroups? 2. Canwepredictwhetherageneisaffectedasaresultoftheperturbationorduetoitsinteractionwiththeothergenesinthegenenetwork? 3. Sometimeswehavedatasetsfromtwoslightlydifferentpopulations,suchasAfricanAmericanvsCaucasianAmerican.Canweidentifythosegenesthatbehaveinadifferentwaybetweenthosetwogroups?Whatisthereasonofthisdiscrepancy? 4. Somepathways(sub-network)ingenenetworkhaveduplicatepathwaysasbackupsystems.Ifoneofthosepathwaysisdamagedbecauseoftheperturbation,the 14

PAGE 15

backupsystemactivatesandcompensatesforthedamage.Canweidentifythosebackuppathways?Someofthisproblemshavebeentoucheduponintheexistingliterature,abriefsummaryofwhichcanbefoundinChapter 2 .However,mostofthesemethodsemploystatisticalandmachinelearningbasedtechniquesthatconsidersthegeneexpressionaloneasthesoleinputdataanddonotconsiderthegeneticinteractionnetworktosolvetheproblems.Inthiswork,weresorttoahypothesisforsolvingthesetofproblemsthatwejustenlisted.Weformallystatethehypothesisasfollowsintegratinggenenetworkswithgeneexpressionenablesustoanalyzetheeffectofperturbationinamoreeffectiveandcomprehensivefashion.Inallthefoursubtasksofthiswork,weleveragetheknowledgefromgenenetworktobuildmachinelearningbasedsolutionstosolvethesesubproblems.Ourworkestablishesthejusticationbehindthishypothesisintwoways.First,ourmethodsareabletoproducemoreaccuratemethodscomparedtotheexistingmethodsasevidentintheexperimentalresults.Insteadofproducingbinarydecisions,ourmethodsbuildprobabilisticmodelsthatenableustoestimatethecondenceintheirpredictions.Second,thecomputationalbiologistswillbeabletoinspectwhichpathwayshavebeenaffectedandtowhichextent.Hereweelaborateinbrief,ourapproachtothesedescribedproblems. Pathwaybasedfeatureselectionofcancermicroarraydata.Classicationofcancerisanimportantproblem,asclassspecicmedicationimprovesthetreatmentefciencyandreducesunwantedsideeffects.Classicationalgorithmsbasedongeneexpressionsoftenhavebetterpredictionaccuracythantheonesbasedonclinicalmarkers.However,classicationalgorithmsarepronetooverttingformicroarraydataduetolargenumberoffeatures.Onewaytoavoidthisproblemistoselectasmallsetofrelevantfeaturesandignoretheremainingones. 15

PAGE 16

ThischapterdevelopsanewfeatureselectionmethodcalledBiologicalPathwaybasedFeatureSelection(BPFS)formicroarraydata.Notethathereweconsiderthecauseofcancerastheperturbationinthecontextofthemainwork.Unlikemostoftheexistingmethods,ourmethodintegratessignalingandgeneregulatorypathwayswithgeneexpressiondatasothattheselectedfeatureshavesignicantclassicationaccuracyandalsohaveleastinteractionamongeachotheronthepathway.Thus,BPFSselectsabiologicallymeaningfulfeaturesetthatisminimallyredundant.Ourexperimentsonpublishedbreastcancerdatasetsdemonstratethatallofthetop20genesfoundbyourmethodareassociatedwithcancer.Furthermore,theclassicationaccuracyofoursignatureisupto18%betterthanthatofvan'tVeer's70genesignature,anditisupto8%betteraccuracythanthebestpublishedfeatureselectionmethod,I-RELIEF.Chapter 3 discussesthiswork. Modelingperturbationsusinggenenetworks.Externalfactorssuchasradiation,drugsorchemotherapycanaltertheexpressionsofasubsetofgenes.Wecallthesegenestheprimarilyaffectedgenes.Primarilyaffectedgeneseventuallycanchangetheexpressionsofothergenesastheyactivate/suppressthemthroughinteractions.Measuringthegeneexpressionsbeforeandafterapplyinganexternalfactor(i.e.,perturbation)inmicroarrayexperimentscanrevealhowtheexpressionofeachgenechanges.Ithowevercannotidentifythecauseofthechange.Weconsidertheproblemofidentifyingprimarilyaffectedgenesgiventheexpressionmeasurementsofasetofgenesbeforeandaftertheapplicationofanexternalperturbation.Wedevelopanovelprobabilisticmethodtoquantifythecauseofdifferentialexpressionofeachgene.Ourmethodconsidersthepossiblegeneinteractionsinregulatoryandsignalingnetworksforalargenumberofperturbedgenes.ItusesaBayesianmodeltocapturethedependencybetweenthegenes.Ourexperimentsonbothrealandsyntheticdatasetsdemonstratethatourmethodcanndprimarilyaffectedgeneswithhighaccuracy.Ourexperimentsalsosuggest 16

PAGE 17

thatourmethodissignicantlymoreaccuratethanthecompetingmethodSSEMandStudent'st-test.Chapter 4 providesadetaileddescriptionofthiswork. Identifyingdifferentiallyregulatedgenes.Microarrayexperimentsoftenmeasureexpressionsofgenestakenfromsampletissuesinthepresenceofexternalperturbationssuchasmedication,radiation,ordisease.Typicallyinsuchexperiments,geneexpressionsaremeasuredbeforeandaftertheapplicationofexternalperturbation.Wefocusonanimportantclassofsuchmicroarrayexperimentsthatinherentlyhavetwogroupsoftissuesamples.Theexternalperturbationcanchangetheexpressionsofsomegenesdirectlyorindirectlythroughgeneinteractionnetwork.Whensuchdifferentgroupsexist,theexpressionsofgenesaftertheperturbationcanbedifferentbetweenthetwogroups.Itisnotonlyimportanttoidentifythegenesthatresponddifferentlyacrossthetwogroups,butalsotominethereasonbehindthisdifferentialresponse.Weaimtoidentifythecauseofthisdifferentialbehaviorofgenes,whetherbecauseoftheperturbationorduetointeractionswithothergenesintwogroupperturbationexperiments.WeproposeanewprobabilisticBayesianmethodwithMarkovRandomFieldtondsuchgenes.Ourmethodincorporatesinformationaboutrelationshipfromgenenetworksaspriorinformation.Experimentalresultsonsyntheticandrealdatasetsdemonstratethesuperiorityofourmethodcomparedtoexistingtechniques.Chapter 5 elaboratesonthiswork. Predictionofsyntheticsicknesslethalityrelationships.TwogenesinanorganismhaveaSyntheticSicknessLethality(SSL)interaction,iftheirjointdeletionleadstoalowerthanexpectedtness.SyntheticGeneArray(SGA)isatechniquethathelpsinidentifyingSSLvaluesforpairsofgenesinagivensetofgenes.SSLinteractionsareusefultodiscovertheco-expressedgenegroupsintheregulatoryandsignalingnetworks.Also,theyareusedtounravelthepairofpathways(subsetofphysicallyinteractinggenes)thatsubstitutethefunctionsofeachother.Generatingan 17

PAGE 18

SGAentryiscostlyasitrequiresproducingandmonitoringadoublemutant(aprogenywithtwomutatedgenes).GeneratingacomprehensiveSGAcanbeveryexpensiveasthenumberofgenepairsisquadraticinthenumberofgenesofthecorrespondingorganism.Inthischapter,wedevelopanewmethodSSLPredtopredicttheSSLinteractionsinanorganism.Inthepresentcontext,singleanddoubledeletionofgeneshavebeenmodeledastheexternalperturbation.OurmethodisbuiltontheconceptofBetweenPathwayModels(BPM),wheremajorityoftheSSLpairsspanacrossthetwofunctionallycomplementingpathways.WedeveloparegressionbasedapproachthatlearnsthemappingbetweenthegeneexpressionsofsingledeletionmutanttothecorrespondingSGAentries.WecompareourmethodtotheonebyHescottetal.forpredictingtheGI(GeneticInteraction)scoreofSaccharomycescerevisiae(S.cerevisiae)onfourbenchmarkdatasets.Ondifferentexperimentalsetups,onaverageSSLPredperformssignicantlybettercomparedtotheothermethod.Chapter 6 elaboratesonthisapproach. 18

PAGE 19

CHAPTER2BACKGROUNDInthischapterwebrieydescribetheexistingworkrelevanttomywork. 2.1FeatureSelectioninGeneExpressionDataFeatureselectionisanimportantareaindataminingforpreprocessingofdata.Featureselectiontechniquesselectasubsetoffeaturestoreducerelativelyredundant,noisyandirrelevantpartofthedata.Thereducedsetoffeaturesincreasesthespeedofthedataminingalgorithms,improvesaccuracyandunderstandabilityofresult.Featureselectionisoftenusedinareassuchassequence,microarray,andmass-spectraanalysis[ 1 ].Thepopularfeatureselectionmethodscanbebroadlycategorizedintothefollowing: FILTERMETHODS.[ 2 4 ]Thesearewidelystudiedmethodsthatworkindependentoftheclassier.Theyrankthefeaturesdependingontheintrinsicpropertiesofdata.Onesuchmethodistoselectsetsoffeatureswhosepairwisecorrelationsareaslowaspossible. WRAPPERMETHODS.[ 5 6 ]Thesemethodsembedthefeatureselectioncriteriaintothesearchingofsubsetoffeatures.Theyuseaclassicationalgorithmtoselectthefeaturesetandevaluateitsqualityusingtheclassier. EMBEDDEDMETHODS.[ 7 ]Theseapproachesselectfeaturesasapartoftheclassicationalgorithm.Similartothewrappermethods,theyinteractwiththeclassier,butatalowercostofcomputation.Alltheabove-mentionedtraditionalfeatureselectionmethodsignoretheinteractionsofthegenes.Consideringeachgeneasanindependententitycanleadtoredundancyandlowclassicationaccuracyasmanygenescanhavesimilarexpressionpatterns.Severalrecentworksonmicroarrayfeatureselectionhaveleveragedmetabolicandgeneinteractionpathwaysintheirmethods.Vertetal.[ 8 ]encodedthegraphandthesetofexpressionprolesintokernelfunctionsandperformedageneralizedformofcanonicalcorrelationanalysisinthecorrespondingreproducibleHilbertspaces.Rapaportetal.[ 9 ]proposedanapproachbasedonspectraldecompositionofgeneexpressionproleswithrespecttotheeigenfunctionsofthegraph.Weietal.[ 10 ] 19

PAGE 20

proposedaspatiallycorrelatedmixturemodelwheretheyextractedgenespecicpriorprobabilitiesfromgenenetwork.Weietal.[ 11 ]developedaMarkovrandomeldbasedmethodforidentifyinggenesandsubnetworksthatarerelatedtoadisease.Adrawbackofthelasttwomodelbasedapproachesisthatthenumberofparameterstobeestimatedisproportionaltothenumberofgenes.Sooptimizingtheobjectivefunctioniscostlyasthenumberofgenesinmicroarraysaremorethan20,000.Lietal.[ 12 ]introducedanetworkconstraintregularizationprocedureforlinearregressionanalysis,whichformulatestheLaplacianofthegraphextractedfromgeneticpathwayasaregularizationconstraint.Onelimitationofalltheabovementionedmethodsthatusebiologicalpathwayisthat,allofthemconsidergeneticinteractionsbetweenimmediateneighborsonthepathway.Noneofthemexplicitlyconsiderinteractionsthatarebeyondimmediateneighbors.Also,mostofthemperformedquantitativeanalysisoftheselectedfeaturesonsimulateddatasets.Soitisnotpossibletoquantifytheaccuracyofthoseselectedfeaturesonsomerealdatasetsonlyfromtheresultsinthesepapers.Additionalexperimentsonrealmicroarraydatasetsarerequiredtojustifythosemethodsandtheirsetoffeatures.Also,asthereconstructionofgeneticpathwayisyettobecompleted,wecannotalwaysmapamicroarrayentrytothebiologicalpathway.Theydonotconsidertheimplicationsofthosemissinginformation.Inthischapter,weintroduceanewmicroarrayfeatureselectionmethodthataddressestheseissues. 2.2AnalysisofPerturbationinGeneNetworksExistingmethodstoidentifytheprimarilyaffectedgenessuchasassociationanalysistechniques[ 13 14 ],haplo-insufciencyproling[ 15 17 ]andchemical-geneticinteractionmapping[ 18 ]requireadditionalinformationsuchastnessbasedassaysofdrugresponseoralibraryofgeneticmutants.Bernardoetal.suggestedaregressionbasedapproachcalledMNIbasedontheassumptionthattheinternalgeneticinteractionsareoffsetbytheexternalperturbation[ 19 ].Itestimatesgene-gene 20

PAGE 21

interactioncoefcientsfromthecontroldata.Itthenusesthosecoefcientstopredictthetargetgenesinthenon-controldata.Cosgroveetal.proposedamethodnamedSSEMthatissimilartoMNI[ 20 ].SSEMmodelstheeffectofperturbationbyanexplicitchangeofgeneexpressionfromthatoftheunperturbedstate.Vaskeetal.developedamethodtoinfertheaffectedregulatorynetworksduetoexternalperturbationsusingagraphicalmodelcalledprobabilisticfactorgraph[ 21 ].Thesemethodshaveseverallimitations. LACKOFGENEINTERACTIONDATA.Theexistingmethodsdonotemployregulatoryorsignalingnetworkstomodelgene-geneinteractions.Sincegenenetworksaremanuallycuratedusingdomainexperts,theyarereliablesourcesofgeneinteractions.Utilizingthemhasthepotentialtomoreaccuratelysolvetheproblemofidentifyingprimarilyaffectedgenes. LIMITEDPERTURBATIONS.Thesemethodsaresuitablewhenonlyaverysmallnumberofgenesareperturbed,e.g.,thegeneticperturbationexperimentsareoftendesignedforsinglegeneperturbations[ 13 ].However,externaleffectssuchasradiationcanaltertheexpressionofmanygenesdirectlymakingtheexistingmethodstobeoflimiteduse. SIMPLISTICMODELS.Mostofthesemethodsprovideonlythesetofgenesthataredirectlyaffectedbytheperturbationsanddonotspecifyanyerrorbounds.However,anon-probabilisticinferenceoversimpliestheproblemespeciallyincaseswhenasmallnumberofgeneexpressionmeasurementsareavailable.Asaresult,thesemethodscanovertthedata,makingthesolutionunreliable.SeveralrecentstudiesaimtoidentifyDEgenesinmultiplegroupsofdatapoints.maSigProisatwostageregressionbasedmethodthatidentiesgenesthatdemonstratedifferentialgeneexpressionprolesacrossmultipleexperimentalgroups[ 22 ].Hongetal.proposedafunctionalhierarchicalmodelfordetectingtemporallydifferentiallyexpressedgenesbetweentwoexperimentalconditions[ 23 ].TheymodeledgeneexpressionsbybasisfunctionexpansionandestimatetheparametersusingaMonteCarloEMalgorithm.Taietal.rankedDEgenesusingdatafromreplicatedmicroarraytimecourseexperiments,wheretherearemultiplebiologicalconditions[ 24 ].TheyderivedamultisamplemultivariateempiricalBayes 21

PAGE 22

statisticforrankinggenes.Angelinietal.proposedaBayesianmethodfordetectingtemporallyDEgenesbetweentwoexperimentalconditions[ 25 ].Deunetal.developedaBayesianmethodtondthegenesthataredifferentiallyexpressedinasingletissueorconditionovermultipletissuesorconditions[ 26 ].Allthesemethodsidentifydifferentiallyexpressedgenesinmultiplegroups.However,noneofthesemethodsanalyzedtheprimaryandsecondaryeffectsinatwogroupperturbationexperiment. 2.3SyntheticSicknessLethalityRecentstudiesonsyntheticsicknessandlethalityanalysisopenedupnewdirectionsintheareasoffunctionalgenomics.Theseworkscanbeclassiedintotwocategories,namedexperimentalandcomputational.Weprovidebriefoverviewsonboththegroups. Experimental.InaseminalworkbyTongetal.pairwisesingledeletionstrainsarecrossedtoproducearraysofdouble-mutantstrainscalledsyntheticgenearray(SGA)[ 27 ].Thenalresultisachievedbyanorderedarrayofdouble-mutanthaploidstrainswhosegrowthratesaremonitoredbyvisualinspectionorimageanalysisofcolonysize.Adifferentapproach,dSLAM(diploid-basedsyntheticlethalanalysiswithmicroarrays)identiesgeneticinteractionsusingpoolsofbarcodedyeastmutants[ 28 ].Thismethodmeasuresdifferentialenrichmentofdoublemutantsgrowingincompetitivecultureusingbarcodemicroarraysanddeterminesgeneticinteractions.TheEMAP(epistaticmini-arrayprole)strategyenablescomprehensiveandquantitativemeasurementsofgeneticinteractionsbetweenpairsofmutationswithinadenedsubsetofgenesthatarerelatedtooneormorespecicbiologicalprocesses.EMAPsareproducedbysystematicallygeneratingyeaststrainscarryingeachpairofmutationsandmeasuringtheirgrowthrates[ 29 ]. Computational.KellyandIdekerintroducedBetweenPathwayModels(BPM)bycombiningsyntheticsicknessandlethalityinformationsfromEMAPdatawithinformationonprotein-protein,protein-DNAormetabolicnetworks[ 30 ].Inaprobabilisticway,their 22

PAGE 23

modeluncoveredalmosttwothousandgeneticinteractionsspannedeitherbetweenorwithinpathways.TheirpapershowedthattwopathwaysinaBPMhavesignicantlymoreSSLedgesthanphysicaledgesbetweenthem.Thisisjusttheoppositeforwithinpathways,wherephysicaledgesoutnumberSSLedges.Thisobservationislogicallysupportedbythefunctionallycomplementingnatureoftheconsistingpathways.Asmallernumberofbetweenpathwayphysicaledgesindicatesthatthetwopathwaymodulesarefunctionallyindependent.So,destroyingoneofthemisnotlethaltotheotherpathway.AhighernumberofbetweenpathwaySSLsimpliesthatthetwopathwaysarecomplementary,anddisruptingbothofthemcanmaketheorganismnon-viable.Hescottetal.[ 31 ]proposedanewmethodtovalidateBPMsusingsinglegenedeletionmicroarraydata.TheyevaluatedthequalityoftheBPMsfromfourdifferentstudiesanddescribedhowtheirmethodsmightbeextendedtoreneBPMpathways.KelleyandKingsford[ 32 ]developedanewmethodcalledExpectedGraphCompressiontoidentifycompensatorypathways(BPMs)byclusteringgenesintomodulesandestablishingrelationshipsbetweenthosemodules.TheyusedthisframeworktoapplyagraphclusteringmethodcalledgraphsummarizationtoEMAPdataset.Thoughresearchesinthesetwoavenuesenrichedourunderstandingofgeneinteractionsandgenenetworks,wedidnotndanypredictivemodeltopredicttheEMAPscores. 23

PAGE 24

CHAPTER3PATHWAYBASEDFEATURESELECTIONFORCANCERMICROARRAYDATAAnimportantchallengeincancertreatmentistoclassifyapatienttoanappropriatecancerclass.Thisisbecauseclassspecictreatmentreducestoxicityandincreasestheefcacyofthetherapy[ 33 ].Traditionalclassicationtechniquesarebasedondifferentkindsofclinicalmarkerssuchasthemorphologicalappearanceoftumors,ageofthepatientsandthenumberoflymphnodes[ 34 ].Thesetechniqueshoweverhaveextremelylow(9%)predictionaccuracies[ 35 ].Classpredictionbasedongeneexpressionmonitoringisarelativelyrecenttechnologywithapromiseofsignicantlybetteraccuracycomparedtotheclassicalmethods[ 33 ].Thesealgorithmsoftenusemicroarraydata[ 36 ]asinput.Microarraysmeasuregeneexpressionandarewidelyusedduetotheirabilitytocapturetheexpressionofthousandsofgenesinparallel.Atypicalmicroarraydatabasecontainsgeneexpressionprolesofafewhundredpatients.Foreachpatient(alsocalledobservation),themicroarrayrecordsexpressionsofmorethan20,000genes.Wedeneanentryofamicroarrayasafeature.Classicationmethodsoftenbuildaclassicationfunctionfromatrainingdata.Theclasslabelsofallthesamplesinthetrainingdataareknowninadvance.Givennewsample,theclassicationfunctionassignsoneofthepossibleclassestothatsample.However,asthenumberoffeaturesislargeandthenumberofobservationsissmall,standardclassicationalgorithmsdonotworkwellonmicroarraydata.Onepotentialsolutiontothisproblemistoselectasmallsetofrelevantfeaturesfromallmicroarrayfeaturesanduseonlythemtoclassifythedata.Theresearchonmicroarrayfeatureselectioncanbedividedintothreemaincategories:lter,wrapperandembedded[ 1 ].Thesemethodsoftenemploystatisticalscoringtechniquestoselectasubsetoffeatures.Selectionofafeaturefromalargenumberofpotentialcandidatesishoweverdifcultasmanycandidatefeatureshave 24

PAGE 25

Figure3-1. PartofpancreaticcancerpathwayadaptedfromKEGGshowingthegene-geneinteractions.!impliesactivationandaimpliesinhibition.Therectangleswithsolidlinerepresentvalidgenesmappedtothepathway.Theyarereferredtobythenameofthegenes.+pdenotesphosphorylation.ForexamplePKB=AktactivatesIKKthroughphosphorylation.IKKinturnactivatesNFkB.Thus,PKB=AktindirectlyactivatesNFkB.TherectangleswithdottedlinesaregeneticsequencethatdonothaveEntrezGeneIDandnotmappabletopathway.Wecannotyetassociatethemtosomepathway.Wedenotethemasunresolvedgenes.TheyarereferredtobyGenBankAccessionnumbers. similarexpressions.Thispotentiallyleadstoinclusionofbiologicallyredundantfeatures.Furthermore,selectionofredundantfeaturesmaycauseexclusionofbiologicallynecessaryfeatures.Thus,theresultantsetoffeaturesmayhavepoorclassicationaccuracy.Onewaytoselectrelevantfeaturesfrommicroarraydataistoexploittheinteractionsbetweenthesefeatures,whichistheproblemconsideredinthischapter.Morespecicallyweconsiderthefollowingproblem. Problemstatement.LetDbethetrainingmicroarraydatasetwhereeachsam-plebelongstooneoftheTpossibleclasses.LetPbethegeneregulatoryandthesignalingnetwork.ChooseKfeaturesusingDandPsothatthesefeaturesmaximizetheclassicationaccuracyforanunobservedmicroarraysamplethathasthesamedistributionofvaluesasthoseinD. 25

PAGE 26

Contributions.Unlikemostofthetraditionalfeatureselectionmethods,weintegrategeneregulatoryandsignalingpathwayswithmicroarraydatatoselectbiologicallyrelevantfeatures.Onthepathway,onegenecaninteractwithanotherinvariousways,suchasbyactivatingorinhibitingit.InFigure 3-1 ,RacGEFactivatesRAC,BADinhibitsBcl-xlandPKB/AktinhibitsBADbyphosphorylation.Weusetheterminuencetoimplythisinteractionbetweentwogenes.Wequantifyinuencebyconsideringthenumberofintermediategenesbetweentwogenesonthepathwaythatconnectsthem.Theinuenceishighestwhentwogenesaredirectlyconnected.Ourhypothesisinthischapteristhatselectingtwogenesthathighlyinuenceeachotheroftenimpliesinclusionofbiologicallyredundantgenes.Therationalebehindthisisthatmanipulatingoneofthesegeneswillhavesignicantimpactontheotherone.Thus,selectingoneofthemproducescomparablepredictionaccuracy.Sowechoosethesetoffeaturessuchthateachofthemhaslowestinuenceonotherselectedfeatures.WeproposeanovelalgorithmcalledBiologicalPathwaybasedFeatureSelectionalgorithm(BPFS)basedontheabovehypothesisthathasthefollowingcharacteristics: 1. LetthecompletesetoffeaturesbeGandthesetofalreadyselectedfeaturesbeS.BPFSranksallthefeaturesinG)]TJ /F3 11.955 Tf 12.54 0 Td[(SwithanSVMbasedscoringmethodMIFS[ 37 ].Thescorequantiesthecapacityofafeaturetoimprovethealreadyattainedclassicationaccuracy.BPFSranksfeaturesindecreasingorderoftheirscores. 2. BPFSchoosesasmallsubsetCofhighlyrankedfeaturesfromG)]TJ /F3 11.955 Tf 13.23 0 Td[(SandevaluatestheinuenceofeveryfeatureinConthefeaturesinS.Finally,itselectsthefeatureinCthathasthelowestinuenceonthefeaturesinSandmovesittoSfromG)]TJ /F3 11.955 Tf 11.95 0 Td[(S.BPFSrepeatsthisstepforaxednumberofiterations.Weobservethatasignicantfractionofthegeneentriesinthemicroarraydonothaveanycorrespondinggeneinthepathway.Weusethetermunresolvedgenestorepresentthesegenes.Weproposeaprobabilisticmodeltoestimatetheinuenceofthosegenesonselectedfeatures.Wetestedtheperformanceofourmethodonvebreastcancerdatasets[ 38 42 ]topredictwhetherbreastcancerforthosepatientsrelapsedbeforeveyearsornot.Our 26

PAGE 27

AlgorithmBiologicalPathwaybasedFeatureSelectionAlgorithm(BPFS)/*GandSdenotethesetofallfeaturesandthesetofselectedfeaturesrespectively.SetG)]TJ /F3 11.955 Tf 11.96 0 Td[(Srepresentsalltheremainingfeatures.*/ 1. SelecttherstfeaturefromGthathashighestmutualinformation. 2. Repeattillthereismorefeaturestoselect. (a) CalculatemarginalclassicationpowerforallthefeaturesinG)]TJ /F3 11.955 Tf 11.96 0 Td[(S. (b) SelecttoptfeatureswithhighestmarginalclassicationpowerascandidatesetC. (c) CalculateTotalInuenceFactor(TIF)forallthefeaturesinC. (d) SelectthefeaturewithlowestTIFandincludeitintoS. experimentsshowthatourmethodachievesupto18%and8%betteraccuracythanthe70-geneprognosticsignature[ 38 ]andI-RELIEF[ 34 ]respectively.Theorganizationoftherestofthischapterisasfollows.Section 3.1 describestheproposedalgorithm.Section 3.2 presentsexperimentalresults.Section 3.3 ,briey,concludesthechapter. 3.1AlgorithmThissectiondescribesourBiologicalPathwaybasedFeatureSelectionalgorithm(BPFS)indetail.BPFStakesalabeledtwoclassmicroarraydataasinputandselectsasetoffeatures.Algorithm1portraysasynopsisofBPFS.WediscussanoverviewofBPFSnext.WedenotethesetofallfeaturesbyG.LetSbethesetoffeaturesselectedsofar.ThesetG)]TJ /F3 11.955 Tf 12.01 0 Td[(Srepresentsalltheremainingfeatures.BPFSiterativelymovesonefeatureinG)]TJ /F3 11.955 Tf 12.19 0 Td[(StoSusingthefollowingsteps,tilltherequirednumberoffeaturesareselected(alongwiththeirrank): 1. DETERMINETHEtBESTCANDIDATES.(2(a)to2(b)inAlgorithm1)ThisstepcreatesacandidatesetoffeaturesfromG)]TJ /F3 11.955 Tf 12.35 0 Td[(Sbyconsideringtheirclassicationaccuracyalone.Todothis,BPFSrstsortsalltheavailablefeaturesindecreasingorderoftheirmarginalclassicationpowerandchoosesthetopt(typicallyt=10inpractice)ofthemasthecandidatesetfornextstep.Wedenethemarginalclassicationpowerofafeatureasitsabilitytoimprovetheclassicationaccuracy 27

PAGE 28

whenweincludeitintoS.LetusdenotethesetthatcontainsthesetoptfeaturesbythevariableC. 2. PICKTHEBESTGENEUSINGPATHWAYS.(2(c)to2(d)inAlgorithm1)Inthisstep,weusesignalingandregulatorypathwaystodistinguishamongthefeaturesetCobtainedinStep1.GivenasetofalreadyselectedfeaturesS,BPFSaimstoselectthenextmostbiologicallyrelevantfeaturefromC.WedeneametrictocomparethefeaturesinCforthispurpose.Thismetricestimatesthetotalinuencebetweenacandidatefeatureandthesetofselectedfeatures.WedenotethistotalinuenceastheTotalInuenceFactor(TIF).TIFisameasureofthepotentialinteraction(activation,inhibitionetc.)betweenacandidategeneandalltheselectedgenes.AhighvalueofTIFforageneimpliesthat,thegeneishighlyinuencedbysomeorallofthealreadyselectedsetofgenes.WechoosethegeneinCthathaslowestTIF.WethenincludeitinS.WeelaboratethisstepinSection 3.1.3 .Inthefollowingsubsectionswediscusstheaboveaspectsofouralgorithminmoredetail.Section 3.1.1 deneshowweselectourrstfeature.Section 3.1.2 discussestherstroundofselectionproceduresbasedonclassicationcapability.Section 3.1.3 describestheuseofpathwaysforfeatureselection.Section 3.1.4 presentsatechniquetoutilizethetrainingspaceefcientlyinordertoimprovethequalityoffeatures. 3.1.1PickingtheFirstFeature:WheretoStart?BPFSincrementallyselectsonefeatureatatimebasedonthefeaturesthatarealreadyselected.Theobviousquestion,thenis,howdoweselecttherstfeature?Therearemanyalternativewaystodothis.Onepossibilityistogetaninitialfeatureusingdomainknowledge.This,however,isnotfeasibleifnodomainknowledgeexistsonthedataset.Weusemutualinformationtoquantifythediscriminatingpowerofafeature.LetusrepresentthekthfeatureofmicroarrayusingarandomvariableFandtheclasslabelofthedatausinganotherrandomvariableL.Assumethattherearenobservationsinthedata.FandLcanassumedifferentvaluesoverthosenobservations.LetfbeaninstanceofFandlbeaninstanceofL.ThemutualinformationofFandLisI(F,L)=Pf2F,l2L F,L(f,l)log F,L(f,l) F(f) L(l)where F,ListhejointprobabilitymassfunctionofFandL; Fand LaretherespectivemarginalprobabilitymassfunctionofFandL. 28

PAGE 29

Thus,weuseI(F,L)toquantifytherelevanceofthekthfeatureforclassication.Wechoosethefeaturewithmaximummutualinformationasthestartingfeature.Anotherwaytoselecttherstfeaturecanbetoutilizethemarginalclassicationpower.Essentially,itiswaywecanapplythesecondstepofouralgorithmwhichS=fgandselectthetopcandidateastherstcandidate.Nextwediscusshowweselecttheremainingfeatures. 3.1.2SelectingtheCandidateFeaturesInthisstep,BPFSsortsalltheavailablefeaturesinG)]TJ /F3 11.955 Tf 12.52 0 Td[(Sindecreasingorderoftheirmarginalclassicationpower.Wedenethemarginalclassicationpowerofafeaturelaterinthissection.BPFSthenchoosesthetfeatureswiththehighestmarginalclassicationpowerasthecandidatesetthatwillbeexploredmorecarefullyinthesubsequentsteps.Weelaborateonthisnext.WeuseanSVMbasedalgorithm,MIFS[ 37 ],tocalculatethemarginalclassicationpowerofallavailablefeaturesasfollows.BPFS,rst,trainsSVMusingthefeaturesinStogetthevalueoftheobjectivefunctionofSVM.WeuselinearkernelfortheSVMinourexperiments.Forveryhighdimensionaldataalinearkernelperformsbetterthanorcomparabletoanon-linearkernel[ 43 ].Alinearkernelisasimpledotproductofthetwoinputs.SotheobjectivefunctionofSVMbecomes J=nXi=1i)]TJ /F8 11.955 Tf 13.15 8.09 Td[(1 2nXi,j=1ijyiyjxixj (3) wherei,yiandxidenotetheLagrangemultiplier,theclasslabelandthevalueofselectedsetoffeaturesofithobservationrespectively.Here,xix)]TJ /F3 11.955 Tf 12.53 0 Td[(jarevectorsandxi.xjisthedotproductofthem.Thenforeachfeaturem2G)]TJ /F3 11.955 Tf 12.3 0 Td[(S,BPFScalculatestheobjectivefunctionJifmisaddedtoSasJ(S[fmg)=nXi=1i)]TJ /F8 11.955 Tf 13.15 8.08 Td[(1 2nXi=1,j=1ijyiyjxi(+m)xj(+m) 29

PAGE 30

Herexi(+m)denotesthevalueoftheselectedsetoffeaturesalongwiththeaforementionedfeaturemfortheithobservation.Usingthelasttwoequations,wecalculatethemarginalclassicationpowerofafeaturem,asthechangeintheobjectivefunctionofSVMwhenmisincludedinS.WedenotethisvaluewithvariableJ.Toparaphrase,marginalclassicationpowerofafeatureisthecapabilityofanewfeaturetoimprovetheclassicationaccuracyofasetofselectedfeatures,whenthenewfeatureisaddedtothealreadyselectedset.Formally,wecomputeJ(m)forallm2G)]TJ /F3 11.955 Tf 12.98 0 Td[(SasJ(m)=J(S[fmg))]TJ /F3 11.955 Tf 12.26 0 Td[(J(S).BPFSsortsallthefeaturesm2G)]TJ /F3 11.955 Tf 12.26 0 Td[(SindescendingorderofJ(m).Itconsidersthetopt(t=10inourexperiments)genesaspossiblecandidatesforthenextround.LetCdenotethesetofthesetgenes.Inthenextsteps,BPFSexaminesbiologicalnetworkstondoutthemostbiologicallymeaningfulfeatureinC. 3.1.3SelectingtheBestCandidateGeneAllthefeaturesinthecandidatesetCoftenhavehighmarginalclassicationpower.Inthisstep,wedistinguishthefeaturesinCbyconsideringtheirinteractionswiththefeaturesinS(thesetoffeaturesthatarealreadyselected).WehypothesizethatifafeatureinCisinuencedbythefeaturesinSgreatly,thenthatfeatureisredundantforSevenifithashighmarginalclassicationpower.Wediscusshowwemeasuretheinuenceofafeatureonanotheronenext.Considertheentirepathwayasagraph,whereallthegenesareverticesandthereisanedgebetweentwoverticesiftheyinteractwitheachother.Inthischapter,wedonotconsideranyspecicpathwaysuchasp53signalingpathway,ratheraconsolidationofalltheavailablehumansignalingandregulatorypathways.Ifwehavehadtheknowledgeaboutthepathwaysthatareaffectedbythatspecicbiologicalcondition(suchascancer),wecouldselectfeaturesonlyfromthosepathways.However,theavailableliteraturedoesnotprovidethecomprehensivelistofaffectedpathwaysmostofthetime.Thus,wecreateaconsolidationofalltheregulatoryandsignalingpathways. 30

PAGE 31

Therearedifferentkindsofinteractionssuchasactivationandinhibition.Iftwogenesdonothaveacommonedge,buttheyarestillconnectedbyapath,itmeansthattheyinteractindirectlythroughachainofgenes.Forexample,inFigure 3-1 .RacGEFactivatesRACandRACactivatesNFkB.Thus,RacGEFindirectlyactivatesNFkBthroughRAC.We,therefore,computethedistancebetweenthemastwo.Ahighernumberofedgesonthepaththatconnectstwogenesimpliesfeeblerinuence.Anabnormallyexpressedgenedoesnotnecessarilyimplythatitsneighborwillbeabnormallyexpressed[ 44 ].Thisisbecausetheinteractionbetweentwogenesisaprobabilisticevent.Forexample,inFigure 3-1 ,ifRacGEFbecomesaberrantlyexpressed,thereisaprobabilitythatRACisalsoexpressedaberrantly.Letusdenotethisprobabilitywiththevariableh.Similarly,ifRACisabnormallyexpressed,NFkBisabnormallyexpressedwithaprobabilityh.So,ifRacGEFisoverexpressed,NFkBcanbeoverexpressedwithaprobabilityofh2.Thisleadstotheconclusionthatasthenumberofhopsincreasestheinuencedecreasesexponentially.Thus,weuseanexponentialfunctiontomodelthisinuence.Toquantifyinuencewedeneametric,termedInuenceFactor(IF),betweentwogenesgiandgjasIF(gi,gj)=1 2d(gi,gj))]TJ /F7 5.978 Tf 5.76 0 Td[(1,providedi6=j.Hered(gi,gj)isthelengthoftheshortestpaththatconnectsgiandgjonthepathway.TocalculatetotalinuenceonacandidategeneassertedbyaselectedsetofgeneswecalculateIFbetweenthecandidategeneandeverygeneintheselectedsetandsumitup.Wecallitthetotalinuencefactor(TIF)ofacandidategenegwithrespecttoalreadysetofselectedgenes.Formally, TIF(g,S)=Xs2S1 2d(g,s))]TJ /F5 7.97 Tf 6.58 0 Td[(1(3)d(g,s)iszeroifthereisnopathbetweengands.Forexample,inFigure 3-1 considerPKB/AktasacandidategeneandassumethatSconsistsoftwogenesNFkBandCASP9.CASP9isonehopawayfromPKB/Akt. 31

PAGE 32

SoIF(PKB/Akt,CASP9)=1.TheshortestpathfromPKB/AkttoNFkBisoftwohopsthroughIKK.SoIF(PKB/Akt,NFkB)=0.5.Thus,TIF(PKB/Akt,fCASP9,NFkBg)=1.5.BPFScalculatesTIFforallthegenesinCandselectstheonethatgeneratesthelowestvalueofTIF.AlowTIFvalueimpliesloweraggregateinuenceonthesetofselectedfeaturesS.Toparaphrase,thegenewithlowestTIFisleastinteractingwiththegenesinS.So,weselectthegenethatisbiologicallymostindependentfromS.Thegenedatabases,likeKEGG,arestillevolving.Thus,manyofthegenescannotbemappedfrommicroarraydatatothesedatabase.InSection 3.2 wedescribeaprobabilistictechniquetohandlethisproblem. 3.1.4ExplorationofTrainingSpaceWehavedescribedthekeycomponentsofourfeatureselectionalgorithm(BPFS)intheprevioussubsections.Asadatasetconsistsofcomparativelysmallernumberofobservationsandalargefeatureset,BPFSispronetoovertting.Toavoidthisproblem,weproposeamethodthatutilizesthetrainingspaceefciently.LetDTbethetrainingdata.WecreateKdatasubsets(K=50inourexperiments)DT1,DT2,,DTKeachcontaining80%oftheDTrandomlysampledfromit.We,then,runBPFSoneachofthemandgetKsetsoffeatures.WestoretheseKfeaturesetsinaKNmatrixM,wheretheithrowcontainsrstNfeaturesobtainedfromDTi.Thus,mijisthejthfeatureobtainedfromDTi.Weusethismatrixtorankallthefeaturesinthefollowingfashion: 1. Weassignalinearlydecreasingweightacrossarowtoemphasizetheimportanceofthefeaturesthatcomerst.Morespecically,weassignaweightofN)]TJ /F3 11.955 Tf 12.27 0 Td[(ktoafeaturethatappearsinthekthcolumnofarow. 2. WesumtheweightsofthefeaturesoveralltherowstodeterminetheoverallweightofthefeaturesinM.Forexample,assumethat,afeatureappearsinthreerowsofM,at(5,3),(17,14)and(29,10)wheretherstnumberineachpairindicatestherowandsecondnumberindicatesthecolumn.Also,assumethatwewanttochooseatotalof150features.Then,thetotalweightofthisfeatureis(150-3)+(150-14)+(150-10)=423. 32

PAGE 33

WepicktheNfeatureswiththehighestweightfromourfeatureset.Weighingthefeaturesbasedontheirpositionshelpsustoprioritizethefeaturesthatoccurfrequentlyand/orappearwithhighrank.WediscusstheimpactofthevalueofNinSection 3.2 3.2DataSetandExperimentsInthissectionweevaluateBPFSexperimentally.Weusemultiplerealmicroarraydatasetsinsteadofsyntheticallygenerateddata,assyntheticdatamaynotaccuratelymodeldifferentaspectsofarealmicroarraydata[ 45 ].Weobservethatwecanmaponlyasmallportion(25%)ofthemicroarrayentriestoKEGGregulatorypathway.Someofthemdonottakepartinanysingleinteraction.Sotheonlyinformationwehaveaboutthemistheirmeasuredexpressionvalueonthemicroarraydataset.Duetothismissingdataproblemitisdifculttoquantifytheimplicationofbiologicalpathwayinouralgorithm.Tohandlethisproblemwehaveconductedourexperimentsontwodifferentkindofinformation.Inonecase,weusetheKEGGpathwaysasitisandusedarandomizedtechniquetohandletheinteractionswithunresolvedgenes.Intheothercase,wemapallthemicroarraygenestoKEGGpathwayandassumethatgeneswithinasinglepathwayarefullyconnectedandthereisnocommongenebetweentwopathways.Still,weneedtobecarefulwhileinterpretingtheresultswithfullyconnectedpathwayasitisonlyasimplisticviewoftheactualpathway.WecovertheexperimentswithrealpathwaysinthechapterfromSection 3.2.2 3.2.6 .InSection 3.2.7 wediscusstheexperimentswithfullyconnectedpathways.InSection 3.2.1 wedescribetheexperimentalsetup.InSection 3.2.2 wedescribetherandomizationtechnique.Weshowthebiologicalvalidityofourfeatureset,bytabulatingthesupportingpublicationsagainsteveryfeature(Section 3.2.3 ).Wecompareoursignatureagainstvan'tVeer's[ 38 ]onfourdatasets(Section 3.2.4 ).WecomparethetestingaccuracyofourmethodtothatofI-RELIEF,aleadingmicroarrayfeatureselectionmethod(Section 3.2.5 ).Weconductedcrossvalidationexperimentswhereweextractedfeaturesfromonedatasetandtesteditsaccuracyonanotherdatasetin 33

PAGE 34

Section 3.2.6 .Finally,weexecutedBPFSandI-RELIEFonanidealisticfullyconnectedpathwayinSection 3.2.7 3.2.1ExperimentalSetup Microarraydata.Inourexperimentsweusedvebreastcancermicroarraydatasetsfromliterature.WenamethesedatasetsasBCR[ 41 ],JNCI[ 39 ],Lancet[ 40 ],CCR[ 42 ]andNature[ 38 ]respectivelyaccordingtothenameofthejournalstheywerepublished.BCR,CCRandLancetuseAffymetrixRGeneChipHumanGenomeU133ArraySetHG-U133Aconsistingof24,481entries.Naturehasitsownmicroarrayplatformwith24,481entries.JNCIhasthesameplatformasthatofNature,butitconsistsofamuchsmallerfeaturesetof1,145entries.Weremovedtheobservationswhoseclasslabelswerenotdened.Fortherestofthedatapointswecreatedtwoclassesdependingonwhetherrelapseofthediseasehappenedinveyearsornot,countingfromthetimeoftheprimarydisease.ThedatasetsNature,JNCI,BCR,CCRandLancetcontain97,291,159,190and276observationsrespectively. Pathwaydata.WeusedthegeneregulatoryandsignalingpathwaysofHomosapienceinKEGG.Wemergedalltherelevantpathwaylestobuildaconsolidatedviewoftheentirepathway.Thenalpathwayconsists8,270genesand7,628interactions.Clearly,somegenesdonottakepartinanyinteraction. Trainingandtestingdata.Werandomlydividedamicroarraydataset(e.g.BCRdataset)in4:1ratiotocreatetrainingandtestingsubset.Wemaintainedthedistributionoftwoclassesintheundivideddatasetunchangedinthetrainingandtestingsubset.Wecollectedfeaturesfromthetrainingdatasetandtestedtheclassicationaccuracyusingthosefeaturesonthetestdataset.Nowweelaborateonhowweutilizedthetrainingspacetoselectfeatures.WecreatedanumberofsubsetsK(typically50inourexperiments)usingbootstrappingfromtrainingdataset.Eachsubsetcontains80%samplesofthetrainingdata.Weselectedfeaturesfromeachofthosesubspacesusing 34

PAGE 35

Figure3-2. WeplottherealdatapointsRP10,,RP100alongwiththeconstructedfunctionf(s)againsts,thefractionofthecurrentpathway.Weextrapolatef(s)uptos=500.f(s)convergesaround180. Section 3.1.1 toSection 3.1.3 .ThenwecombinedtheKobtainedsetoffeaturesusingthemethodofSection 3.1.4 Implementationandsystemdetails.Weimplementedourfeatureselectionalgorithm(BPFS)usingMATLABR.ForSVM,weusedMATLABRSVMToolbox,afastSVMimplementationinCbasedonsequentialminimaloptimizationalgorithm[ 46 ].ForpathwayanalysiscodeweusedJavaTM.WeranourimplementationonaclusteroftenIntelXeon2.8GHznodesonUbuntuLinux. Availabilityofcode.Theimplementationoftheproposedmethodcanbedownloadedfrom http://bioinformatics.cise.ufl.edu/microp.html 3.2.2PathwayswithUnresolvedGenesTocalculatetheinuencefactor(IF)weneedtocalculatethenumberofhopsbetweentwogenesonthepathway.Thisrequiresamappingofthosemicroarrayentriestopathwaygenes.However,assomeofthemicroarrayentriesarenotcomplete 35

PAGE 36

Table3-1. ListofpublicationssupportingthersttwentyfeaturesobtainedfromBCRdatasetabouttheirresponsibilityforcancer. GeneSupportingPublications KCNK2AcuteLymphoblasticLeukemia[ 47 ]ZNF222Breastcancer[ 48 ]P2RY2Humanlungepithelialtumor[ 49 ],Non-melanomaskincancer[ 50 ],Thyroidcancer[ 51 ]SLC2A6HumanLeukemia[ 52 ]CD163Breastcancer[ 53 ],Humancolorectalcancer[ 54 ]HOXC13Acutemyeloidleukemia[ 55 56 ]PCSK6Breastcancer[ 57 ],Ovariancancer[ 58 ]AQP9Leukemia[ 59 ]PYYPeptideYY(PYY)isanaturallyoccurringguthormonewithmostlyinhibitoryactionsonmultipletissuetargets[ 60 ]KLRC4KLRC4isamemberoftheNKG2groupthatareexpressedprimarilyinnaturalkiller(NK)cells[ 61 ]CYP2A13Lungadenocarcinoma[ 62 ]GRM2MetastaticColorectalCarcinoma[ 63 ]PHOX2BNeuroblastoma[ 64 ]ASCL1Prostatecancer[ 65 ],Lungcancer[ 66 ]PKD1Polycystin-1inducedapoptosisandcellcyclearrestinG0/G1phaseincancercells[ 67 ].PKD1inhibitscancercellsmigrationandinvasionviaWntsignalingpathwayinvitro[ 68 ].Gastriccancer[ 69 ]ANGPT4Gastrointestinalstromaltumor,leiomyomaandschwannoma[ 70 ],renalepithelialandclearcellcarcinoma[ 71 ]PSMB1Breastcancer[ 72 ]RUNX1Gastriccancer[ 73 74 ],Ovariancancer[ 75 ],Classicaltumorsuppressorgene[ 76 ]CD1CProstatecancer[ 77 ]ZNF557MyeloidLeukemia[ 78 ] genesandbiologicalpathwayconstructionisnotyetnished[ 79 ],wecannotmapallmicroarrayentries.Wedenotealltheunmappedgenesasunresolvedgenes.Forexample,AffymetrixRmicroarrayHG-U133Acontains24,481entries.Weareabletomaponly6,500entriestoKEGG(KyotoEncyclopediaofGenesandGenomes).Forexample,inFigure 3-1 wedrawfourrectangleswithdottedlinesthatcorrespondtofourmicroarrayentriesinAffymetrixRplatform.AstheydonothaveEntrezGeneidenticationnumber,wecannotassociatethemwithanypathway.Hence,theseareunresolvedgenes. 36

PAGE 37

OurpreliminaryexperimentssuggestthatunresolvedgenesrepresentalargefractionofthegenesinsetC(morethan60%onaverage).ToestimatetheTIFoftheunresolvedgenes,wedevelopaprobabilisticmodel.LetCbethesetofcandidategenesandSbethesetofselectedgenesinaniterationofBPFS.WhilecalculatingTIFforag2Cweconsidertwocases: 1. THECANDIDATEGENEISRESOLVED.Assumethatgisresolved.LetQSbethesetthatcontainsalltheunresolvedfeaturesinS.LetR=S)]TJ /F3 11.955 Tf 12.21 0 Td[(QbethesetofresolvedfeaturesinS.Letpbetheexpectedinuenceofageneq2Qongenegifallgenesweremappedandthepathwayconstructionwascomplete.Wediscusshowweestimatethevalueofplaterinthissection.ThentheexpectednumberofgenesfromQhavinginuenceongisTIF(g,Q)=p.jQj,wherejQjisthenumberofgenesinQ.SotheTotalInuenceFactorbecomes: TIF(g,S)=TIF(g,Q)+TIF(g,R)=pjQj+Xs2R21)]TJ /F4 7.97 Tf 6.59 0 Td[(d(g,s) (3) 2. THECANDIDATEGENEISUNRESOLVED.WhengisunresolvedweconsideritasaspecialcaseofCase1.AstheconnectivitybetweengandallgenesinSisunknown,weestimateTIFas: TIF(g,S)=pjSj(3)Insummary,tohandletheunresolvedgenesweaugmenttheprobabilisticmodeltobiologicalpathwaybasedselectionandreplacedequation( 3 )byequation( 3 ).AmongthegenesincandidatesetCweselectthegenewithsmallestTIFusingEquation 3 .Now,wedescribehowwederivethevalueofpinEquation 3 and 3 .Toderivethevalueofpweproposethefollowingapproach.Itisreasonabletoassumethattherearemanymissinginteractionsinthecurrentlyavailablepathwaydatabases,sincemissinginteractionsarecontinuallybeingdiscovered.LetusdenotethepresentincompletepathwaygraphbyPIandthehypotheticalcompletepathwaybyPC.AssumethatPCcontainsztimesmoreinteractionsthanthatofPI.FromPIweestimatepasafunctionofzandtheexpectednumberofthegenesinPIasfollows.We,rst,buildagraphPIasdescribedinsection3.3fromtheKEGGdatabase.We, 37

PAGE 38

then,randomlydeleteedgesofPIandcreatethesubgraphsP10,P20,,P100ofPIcorrespondingto10%,20%,,100%ofedgesofthatofPI.Foreachofthesesub-graphswecalculatetheaveragenumberofverticesreachablefromavertex.Formally,letVdenotethesetofverticesinPs,wheres2f10,20,,100g.Wedenotethenumberofreachableverticesfromg(g2V)byR(g).Then,theaveragereachabilityofPsis: RPs=Pg2VR(g) jVj(3)wherejVjisnumberofverticesinV.WecalculateRP10,,RP100,theaveragereachabilityforallthesubgraphs.We,then,constructthefunctionf(s)thatevaluatestof(s)=RPs.Toconstructf(s),weuseaconvergingpowerseries.WederivethevalueofparametersusingRP10,,RP100.TocalculatetheaveragereachabilityofthehypotheticalpathwayPCweusevalueofsgreaterthan100inf(s).InFigure 3-2 weplotRP10,,RP100alongwiththeconstructedfunctionf(s)againstthes,thefractionofthecurrentpathway.Weobservethatf(s)interpolatesthevaluesRP10,,RP100accuratelyanditconvergesatarounds=500withthevalueof180.Thus,wecanconcludethattheaveragereachabilityofPCisaround180.TheprobabilitythatanunresolvedgenehasaninteractionwitharandomlychosengeneinSinPCisgivenbyp=RPC jVjwherejVjisnumberofnodesinthepathway.Asthetotalnumberofhumangenesiscloseto20,500[ 80 ],weget0.0088asthevalueofp. 3.2.3BiologicalValidationofSelectedFeaturesWecollectedthelistofpublicationsthatsupporttherelevanceofthersttwentyfeaturesselectedfromBCRoncancer.Table 3-1 liststhepublicationsandcancertypesforeachofthegenes.Togetthefeatures,wecreatedthetrainingdatasetasdescribedinSection 3.2.1 .We,then,trainedBPFSonthetrainingdataandobtainedarankedsetoffeatures.WerepeatedthisprocesstentimesonBCR.Weselectedrsttwentyfeaturesfromeach 38

PAGE 39

ABCRDataset BJNCIDataset CCCRDataset DLancetDatasetFigure3-3. Comparisonoftestaccuracyofoursignatureandvan'tVeer's70-geneprognosticsignatureonrealpathway.Forallthedatasetsoursignatureperformssignicantlybetterthantheirsignature. rankingsandmergedthemusingthemethoddescribedinSection 3.1.4 togetthenaltwentyfeatures.Wefoundrelevantpublicationforallthetwentygenes.Weobservedthatfourofthemaredirectlyresponsibleforbreastcancer.Anothersevengenesareassociatedtobreastcancerfromthepointofhistology(twoprostrate,fourgastricandonecolorectal).Restofthemarerelatedtootherkindsofcancersuchasovariancancerandlungcancer.Insomecasesageneisinvolvedformorethanonekindofcancer.Forexample,ASCL1isassociatedwithbothprostratecancerandlungcancer.WeconcludedthatBPFSchoosesthesetofgenesthatareresponsibleforbreastcancerandotherkind 39

PAGE 40

Figure3-4. Thestandarddeviationoftheaccuracyofourmethodfordifferentnumberoffeatures.Thex-axisdenotesdifferentnumberoffeatures. ofcancers.Hence,BPFSselectsabiologicallymeaningfulfeaturesetthatreducesthenumberofredundantfeaturesandimprovesgeneralizationaccuracybyselectingmorerelevantfeatureset.Ingeneral,theaboveapproachofcombiningfeaturesobtainedfromtendifferentrunsmayleadtoselectingfeaturesfromtheentiredataset.Wediditonlyforthisexperimenttolteroutthebiologicallysignicantfeaturesfromadataset.Fortheremainingexperimentswekeptseparatetrainingandtestingdatasets. 3.2.4ComparisonwithVan'tVeer'sGeneSignatureInthissectionwecompareourgenesignaturestothebreastcancerprognosticsignaturefoundbyvan'tVeeretal.[ 38 ].van'tVeeretal.generatedthe70genesignatureusingacorrelationbasedclassieron98primarybreastcancerpatients.van'tVeer's70geneexperiment[ 38 ]demonstratedthatgeneticsignaturecanhaveamuchhigheraccuracyinpredictingdiseaserelapsefreesurvivalagainstclinicalmarkers(50%vs10%). 40

PAGE 41

ABCRDataset BJNCIDataset CNatureDataset DLancetDataset ECCRDatasetFigure3-5. Comparisonoftestaccuracyofourmethod(BPFS)tothatofI-RELIEFonrealpathway.InthreedatasetsJNCI,BCRandNatureourmethodperformsbetterthanI-RELIEF.InLancetandBCRboththemethodshavesimilaraccuracy. 41

PAGE 42

Inourexperiment,wedemonstratethatourmethodndsabettergenesignaturethanthese70-genesignatureforaparticulardataset.WecreatedtrainingandtestingdatafromthefourdatasetsJNCI,Lancet,BCRandCCRasdescribedinSection 3.2.1 .Fromthosefourtrainingdatasetwecreatedfoursetoffeaturesusingourmethod.Wecalculatedtheaccuracyofthefoursetoffeaturesusingthecorrespondingfourtestdatasets.Also,wecomputedthetestaccuracyusingthis70-genesignatureonthesamefourtestdatasetjustmentioned.FinallyweplottedthetestingaccuraciesobtainedfrombothsetoffeaturesinFigure 3-3 .Figure 3-3 illustratestheresultsforupto150genesforboththesignatureonBCR,CCRandLancet.Weobservethatforallfourdatasetsouraccuracyisbetterthanthatof70-genesignature.BPFSattains18%betteraccuracyforJNCIdataset.FromthisweconcludethatBPFSndsabettergenesignatureforallthedatasets. 3.2.5ComparisonwithI-RELIEFWecomparetheaccuracyofourmethodtothatofI-RELIEF[ 34 ].I-RELIEFisanonlinearfeatureselectionalgorithm.Itproducedsignicantaccuracyovervan'tVeer's70geneprognosticsignatureandstandardclinicalmarkers[ 81 ].We,rst,createdtrainingandtestingdatasetfromthegivendataasdiscussedinSection 3.2.1 .WeobtainedtheorderedfeaturesetbytrainingtheBPFSonthetrainingdata.WetestedthequalityofthosefeaturesusingtheSVMclassier.WeusedidenticalsetupanddatasetsforI-RELIEF.Weused2.0asthekernelwidthforI-RELIEFasrecommended[ 34 ].Werepeatedtheexperimentstentimesoneachdatasetandpresenttheaverageaccuracyfordifferentfeatures.Figure 3-4 plotsthestandarddeviationoftheaccuracyofourmethodoverthis10timesofrunning.ForallofthemexcepttheNaturedataset,thestandarddeviationislessthan.1(i.e.10%accuracy).Sowecanconcludethatourmethodisquitestablewhileweexecuteitoverseveralsubsetsofdatacreatedfromasingledataset. 42

PAGE 43

Figure 3-5 comparesouralgorithmtoI-RELIEF.Weobservethatfortwodatasets(BCRandJNCI),BPFSoutperformsI-RELIEFforallthefeaturesselected.BPFSshowshighestimprovement(8%)overI-RELIEFforJNCIdataset,ataround50features.JNCIhashigherfractionofresolvedgenes(45%vs25%).Thus,BPFShasahigherchancetoselectmoreresolvedgenes.Thisimplieslessdependabilityontheprobabilisticmodel,theselectionofgenesismoreaccurate.Fromthisobservation,weexpectthatBPFSwouldproducebetterresultwhenthemissinglinksofthepathwaywouldbediscovered.ForNaturedataset,BPFSproducesbetteraccuracythanI-RELIEFupto130features.BPFShassimilaraccuracywiththatofI-RELIEFforCCRandLancetdata.Ouralgorithmisbasedonlinearkernelwhichisingeneralmoreappropriatewhenthenumberoffeaturesismuchhigherthannumberofsamples.OntheotherhandI-RELIEFemploysanon-linearkernel[ 43 ].ItispossiblethatthedistributionofLancetdataworksbetterwiththetypeofkernelthatI-RELIEFuses.WecanpotentiallyimprovetheclassicationaccuracyofBPFSbyusinganon-linearkernel.Weobservethatforallthedatasetsourmethodreachesitshighestaccuracyataround50-70features.Weconcludethatthese50-70featuresconsistthemostimportantsetofgenesthatareassociatedwiththebreastcancer. 3.2.6CrossValidationExperimentsWeconductseveralcrossvalidationexperimentswherewegenerateasetoffeaturesononedatasetandvalidateitsqualitybytestingitonsomeotherdataset.Forthiscrossvalidation,welimitourselvestothesamemicroarrayplatformthatweusetogeneratethefeatureset.Forexample,wetestBCRdataset'sfeaturesonLancetasthemicroarrayplatformsonwhichtheyweregeneratedaresame.Themainreasonfordoingsoisthatthesetofgenesusedintwodifferentmicroarrayscanbedifferent.Evenforthesamegenetheyusedifferentpartofthegenomicsequence.Thus,inter-platformvalidationmaynotberepresentativeoftheactualgeneralization.Table 3-2 displaystheresultofourcrossvalidationexperiments. 43

PAGE 44

Table3-2. AccuracyofouralgorithmintermsofpercentageobtainedfromCrossValidationexperimentsonrealpathway.Featuresetobtainedfromonedatasetistestedagainstanotherdataset.Wealwayschosetargetdatafromthesameclassofmicroarrayinordertoavertcrossplatformproblems.ThecrossvalidationresultsimpliesthatthefeaturesetgeneratedbyBPFSprovidessatisfactoryperformanceincrossdatasetswithoutsignicantlossofaccuracy.Wealsocompareourmethodwithatrimmedversionwhereweskipstep3.3.Thecompleteversionofthealgorithm(withpathway)isindicatedby1atthesuperscriptwhilethetrimmedversion(withoutpathway)isdenotedbythesuperscript2. TestFeatureNumberofFeaturesDataData51020406080100120140 BCRBCR1656572727375747473CCR1636161636668667169CCR2646265636465666568Lancet1575958636161656567Lancet255556362586466697070-Gene-Sig1625959666768707371CCRCCR1707072737777767778BCR1606367656566686666BCR2536565666167696865Lancet1575862606669687071Lancet258586165687378757570-Gene-Sig1535251556566666664LancetLancet1585861616262626463BCR1545754565556596063BCR2555957555655545753CCR1555654555956595859CCR261626456616060605970-Gene-Sig1565553575860576161NatureNature1686565657270727066JNCI1717165615964605761JNCI2677063606674737472 44

PAGE 45

ABCRDataset BJNCIDataset CNatureDataset DLancetDataset ECCRDatasetFigure3-6. Comparisonoftestaccuracyofourmethod(BPFS)tothatofI-RELIEFonfullyconnectedpathway.InthreedatasetsJNCI,BCRandCRRourmethodperformsbetterthanI-RELIEF.InLancetandBCRI-Reliefhasabetteraccuracy. 45

PAGE 46

ABCRDataset BJNCIDataset CNatureDataset DLancetDataset ECCRDatasetFigure3-7. Comparisonoftestaccuracyofourmethod(BPFS)towhenweselectthegenesonlybasedonmarginalclassicationpower. 46

PAGE 47

Weusetwodifferentversionofouralgorithm,onewiththepathwayinformationandanotherwithoutthepathwayinformationonobservethecontributionofinclusionofthepathwayinformationintoouralgorithm.InTable 3-2 wedenotethetheversionofourmethodwithpathwaybyputting1atsuperscriptoftheresult.Similarlywedenotetheversionwithoutpathwaybyputting2atthesuperscriptofthedatasets.Also,toestablishtherelevanceofoursignatureonastandardbenchmark,wecompareoursignatureswithvan'tVeer's70-genesignatureinthecontextofcrossvalidation.Forexample,inTable 3-2 BCRdatasetiscrossvalidatedwiththreefeaturesetsobtainedfromLancet,CCRand70-genesignature.Weobservethatonanaverageoursignaturesperformbetterthanthe70-geneone.ForCCRdata,bothLancetandBCRfeaturesgeneratebetteraccuracy.ForLancetdatatheaccuraciesobtainedusingCCRandBCRfeaturesaresimilartothatof70-genesignature.ForBCRdata,upto80featuresoursignaturesoutperformthe70-genesignature.Beyondthatthe70-genesignaturehasabetteraccuracy.Tosumup,wegetbetteraccuracywhiletestingwiththefeaturesonadifferentplatformcomparedtovan'tVeer'sprognosticsignature.Regardingthecomparisonofthetwoversionsofouralgorithm(withandwithoutthepathwayinformation)it'sdifculttoreachaconclusivedecision.Forexample,whileweextractthefeaturesetfromCCRdatasetandcrossvalidateitBCRdatasetthealgorithmwithoutpathwayinformationisdoingbetterthantheotherforupto20features,butforthelargernumberoffeaturesthealgorithmwithpathwayinformationprovidesabetteraccuracy.Whenwecomparetheaccuracywithfeaturesfromadifferentdatasettothatofitsownfeatureset,weobservethatontheaverage,thedropoftheaccuraciesarenotmorethan6%.Forsomeextremecasesthedropcanbehigher.SpecicallywhenwecrossvalidatewithfeaturesextractedfromLancet,theaccuracyislower. 47

PAGE 48

3.2.7ExperimentsonFullyConnectedPathwaysInthissectionwedescribetheexperimentswithidealisticfullyconnectedpathway.Heretheapproachwetookwastoevaluateourexperimentsonanidealisticpathway,wherewemapallthegenesincludingtheunresolvedgenesintotheKEGGpathway.Theunmappedgenesbecomesingletonpathwaywithonlyasinglemembergene.ForotherpathwaysthatarealreadylistedintheKEGGdatabaseweassumethatallthegeneswithinapathwayarefullyconnectedandthereisnoconnectionbetweentwopathway.Ifweconsidereachpathwayasthesmallestindivisiblemoduleofinteractions,thenallthegeneswithinapathwaycoexpressinasimilarfashion.WecomparetheaccuracyofBPFSwiththispathwaywiththatofI-RELIEF.InFigure 3-6 weseethatforBCR,JNCIandCCRenvironment,BPFShasabetteraccuracy.ForBCRthereisanimprovementof5%aroundwhenBPFSusesaround40features.ForJNCI,theimprovementisover10%witharound70features.ForCCRthereisa3%improvementfor30features.ForNaturedataset,I-RELIEFhasslightly(around1.5%)betteraccuracyupto100features.ForLancetdatasetI-RELIEFperformsbetterthanBPFS.So,weobservethatforthreedatasetBPFShasbetteraccuracy,foronedatasettheaccuracyisalmostcomparable. 3.2.8ContributionofPathwayInformationInthissectionwedescribetheexperimentsthatweconducttounderstandthecontributionofpathwayinformationinouralgorithmintermsofclassicationaccuracy.Inoneversionweusethecompleteversionofthealgorithmasitis,intheotherversionweselectthefeaturesonlybasedonthemarginalclassicationpowerandskipthenextstep.However,weobservethatthecontributionofbiologicalnetworkisnotverydecisive.Forsomedatasetandsomesetoffeaturesthecompleteversionofthemethodgeneratesbetteraccuracy,sometimesthetrimmedversionproducesmoreaccurateresult.Forexample,forBCRdatasetthecompletemethodhasupto6%higheraccuracy, 48

PAGE 49

whereforNaturedatasetthemethodwithpathwaygeneratesbetterresultfor50-100numberoffeatures.Thereasonbehindthisuctuationofaccuracymightbethatreconstructionofgeneregulatoryandsignalingnetworkisstillinprogress.Amongalmost22,000AffymetrixRtranscripts,wecouldmaponly3,300genesthattakepartintoatleastoneKEGGpathway.Evenforthosegenes,thepathwayconstructionisnotcomplete.Wehopethatouralgorithmcangeneratebetteraccuracyinthenearfuturewhenwecanhaveamorecomprehensivepathwaystructure. 3.3DiscussionInthischapterweconsideredfeatureselectionproblemforaclassieroncancermicroarraydata.Insteadofusingtheexpressionlevelofageneasthesolefeatureselectioncriteriawealsoconsidereditsrelationwithothergenesonthebiologicalpathway.Ourobjectivesweretodevelopanalgorithmforndingasetofbiologicallyrelevantfeaturesandtoreducethenumberofredundantgenes.Thekeycontributionsofthechapterare: Weproposedanewfeatureselectionmethodthatleveragesbiologicalpathwayinformationalongwithclassicationcapabilitiestoreducetheredundancyingeneselectionbasedonbiologicalpathway. Weproposedaprobabilisticsolutiontohandletheproblemofunresolvedgenesthatarecurrentlynotmappablefrommicroarraytobiologicalpathway. Wepresentedanewframeworkofutilizingthetrainingsubspacethatimprovethequalityoffeatureset.Ouralgorithmimprovequalityoffeaturesbyabiologicalwaybyexcludingthefeaturesthathavetotalinuencefactor,andincludesgenesthatareapartinthebiologicalnetworkandstillhavehighmarginalclassicationpower.Thus,webelievethatinsteadofselectingaclosesetofgenesasfeaturesourmethodidentifybiologicallyimportantkeyfeaturesforasignicantnumberofpathways.Wedemonstratedthebiologicalsignicanceofourfeaturesetbytabulatingtherelevantpublications.Wealsoestablishedthequalityofourfeaturesetbycrossvalidatingthemonotherdatasetsand 49

PAGE 50

comparingthemagainstvan'tVeer's70-geneprognosticsignature.OurexperimentsshowedthatitisbetterthanbestpublishedavailablemethodI-RELIEF. 50

PAGE 51

CHAPTER4MODELINGPERTURBATIONSUSINGGENENETWORKSAsignicantsetofmicroarrayexperimentsmeasuregeneexpressionsinthepresenceofexternalperturbations[ 82 83 ].Intheseperturbationexperiments,radiation[ 84 ],drug[ 14 ]orotherbiologicalconditionsareadministeredontissuesandtheirresponsesaremonitoredusingmicroarrays.Theexpressionsofthegenesbeforeperturbationscorrespondtocontroldata,whiletheexpressionsofgenesafterperturbationscorrespondtonon-controldata[ 85 ].Afractionofgenesrespondtotheexternalperturbationbychangingtheirexpressionvaluessignicantlybetweencontrolandnon-controlgroups.Suchgenesarecalleddifferentiallyexpressed(DE)genes[ 86 ].Theremaininggenes,withoutnoticeablechangesintheirexpressions,arecalledequallyexpressed(EE)genes.TheDEgenesthataredirectlyaffectedbytheexternalperturbation[ 19 ]aredenotedasprimarilyaffectedgenes.RestoftheDEgeneschangetheirexpressionsduetointeractionswithprimarilyaffectedgenesandeachotherthroughsignalingandregulatorynetworks[ 20 87 89 ].Wecallthemassecondarilyaffectedgenes.Inthischapter,thetermgenenetworksisusedtorefersignalingandregulatorynetworks.Figure 4-1 showsthestateofthegenesinthepancreaticcancerpathwayafterahypotheticalexternalperturbationisapplied[ 90 91 ].Inthisgure,genesK-Ras,RafandCob42RocareprimarilyaffectedandMEK,RalandRalGDSaresecondarilyaffectedthroughinteractions.Weconsidertheproblemofidentifyingtheprimarilyaffectedgenesinaperturbationexperimentwheregeneexpressionsareprovidedbeforeandaftertheapplicationofperturbationforasetofsamples.Weassumethattheunderlyinggenenetworkcanbemodeledasadirectedgraphwhereeachnoderepresentsagene,andadirectededgefromgeneatogenebrepresentsageneticinteraction(e.gaactivatesorinhibitsb).Wedenetwogenesasneighborsofeachotheriftheyshareanedge.Forexample, 51

PAGE 52

Figure4-1. Illustrationoftheimpactofahypotheticalexternalperturbationonasmallportionofthepancreaticcancerpathway.ThepathwayistakenfromtheKEGGdatabase.Thesolidrectanglesdenotethegenesaffecteddirectlybytheperturbation,thedashedrectanglesindicategenessecondarilyaffectedthroughinteractions.Thedottedrectanglesdenotethegenesthatarenotaffectedbytheperturbation.!impliesactivationandaimpliesinhibition.Inthisgure,geneK-Ras,RafandCob42RocareprimarilyaffectedandMEK,RalandRalGDSaresecondarilyaffectedthroughinteractions. inFigure 4-1 ,genesK-RasandRafareneighborsasK-RasactivatesRaf.Aneighborcanbeclassiedasincomingoroutgoingifitisatthestartorattheendofthedirectededge,respectively.InFigure 4-1 ,RafisanincomingneighborofMEKandMEKisanoutgoingneighborofRaf. Contributions. Weproposeanewprobabilisticmethodtondtheprimarilyaffectedgenesinmicroarraydataset.OurmethodemploysgenenetworksasapriorbeliefinaBayesianframework.Whentheexpressionlevelofagenealters,itcanaffectsomeofitsoutgoingneighbors.Thus,theexpressionofagenecanchangeduetoexternalperturbationorbecauseofoneormoreoftheaffectedincomingneighbors.Webuildoursolutionbasedonthisobservation. LetG=fg1,g2,,gMgdenotethesetofallgenes.Werepresenttheexternalperturbationbyahypotheticalgene(i.e.metagene)g0inourthegenenetwork.Anedgefromthemetagenethetoalltheothergenesimpliesthattheexternalperturbationhasthepotentialtoaffectalltheothergenes.So,g0isanincomingneighbortoalltheothergenes.Wecalltheresultingnetworktheextendedgenenetwork.OurmethodestimatestheprobabilitythatagenegjisDEduetoanalterationintheactivityofgenegi(gi2G[fg0g,gj2G)ifthereisanedgefromgitogjintheextendednetwork.WeuseaBayesianmodelinoursolutionwith 52

PAGE 53

thehelpofMarkovRandomField(MRF)tocapturethedependencybetweenthegenesintheextendedgenenetwork.Weoptimizethepseudo-likelihoodofthejointposteriordistributionovertherandomvariablesintheMRFusingIterativeConditionalMode(ICM)[ 92 ].Theoptimizationprovidesusthestateofthegenesandthepairwisecausalityamongthegenesincludingthemetagene.Ourexperimentsonbothrealandsyntheticdatasetsdemonstratethatourmethodcanndprimarilyaffectedgeneswithhighaccuracy.WecomparedourmethodwithSSEMandStudent'sttestandobtainedsignicanthigheraccuracyinndingouttheprimarilydifferentiallyexpressedgenes.Therestofthechapterisorganizedasfollows.InSection 4.1 wedescribeourmethodindetail.InSection 4.2 wediscusstheexperimentsandresults.Finally,inSection 4.3 wedescribeourkeyconclusionsforthischapter. 4.1MethodsInthissectionwedescribeourmethod.Section 4.1.1 presentsthenotations.Section 4.1.2 providesanoverviewofoursolution.Section 4.1.3 discussesthemodelingoftheMRFbasedpriordistribution.Section 4.1.4 describeshowweformulateatractableapproximateversionoftheobjectivefunction.Section 4.1.5 discusseshowwecomputethelikelihoodoftheexpressionofagene.Section 4.1.6 explainshowweoptimizetheobjectivefunction. 4.1.1NotationandProblemFormulationWestartbydescribingthenotationusedintherestofthischapterandprovideaformaldenitionoftheproblem.Weusetwotypesofparameterstomodelthisproblem,namelyobservedandhidden.Thevaluesofobservedvariablesareavailableinthegivendataset.Weestimatethehiddenvariablesfromtheobserveddata. Observedvariables.Therearetwosetsofobservedvariables. MICROARRAYDATASET.WedenotethenumberofmicroarraysamplesandthenumberofgenesbyNandMrespectively.Werepresentthesetofallgenesin 53

PAGE 54

thedatasetwithG=fg1,g2,,gMg.Foreachgenegi,thedatasetcontainstheexpressionsbeforeandaftertheperturbation(i.e.controlandnon-control)respectively.Wedenotetheexpressionsofgiwithyijandy0ijincontrolandnon-controlgrouprespectively,(1iM,1jN).Letyi=fyi1,yi2,yiNgandyi0=fy0i1,y0i2,y0iNgdenotetheexpressionvaluesofgiincontrolandnon-controlgroupsrespectively.WeuseYitodenoteallthedataforgenegiincontrolandnon-controlgroups(i.e.Yi=yi[y0i).Y=SMi=1Yirepresentsthecollectionoftheentiregeneexpressiondata. NEIGHBORHOODVARIABLES.WeusethetermW=fWijgtorepresentifanytwogenesgiandgjareneighbors.Wij(1i,jM)issetto1ifgiisanincomingneighborofgj(i.e.gjhasanincomingedgefromgiintheextendedgenenetwork)and0otherwise. HiddenVariables.Therearetwosetsofhiddenvariables. STATEVARIABLES.Eachgenegicanattainoneofthetwostates(i.e.DEorEE).WeintroducethevariablesS=fSigtoindicatethestatesofthegenes.Formally,SiisDEifgiisdifferentiallyexpressedandEEifgiisequallyexpressed. INTERACTIONVARIABLES.WedenethesetofrandomvariablesX=fXijgtorepresentthejointstateofgenesgiandgj(0iM,1jM).Formally,Xij=8>><>>:1ifSi=DEandSj=DE;2ifSi=DEandSj=EE;3ifSi=EEandSj=DE;4ifSi=EEandSj=EE;ItisevidentthatthevalueofXijdependsonthevaluesoftwoindependentvariablesSiandSj.NotethatthevaluesofXijarecategoricalinnature. Problemformulation.WehavemicroarrayexpressiondataYandthegenenetworkfG,Wgasinputtotheproblem.Fromnowon,thegenenetworkfG,WgwillbereferredtobyV.Wewouldliketoestimatetheposteriordensityp(XijjX)]TJ /F3 11.955 Tf 27.7 0 Td[(Xij,Y,V,Wij=1).Specically,alowervalueofp(X0j=2jX)]TJ /F3 11.955 Tf 29.45 0 Td[(X0j,Y,V,Wij=1)indicatesahigherchancethatthegenegjisprimarilyaffected,asX0j=2indicatesthatthemetageneisDEandgenegjisEE.Basedonthisprobabilityestimation,wecreatealistofprimarilyaffectedgenes. 54

PAGE 55

4.1.2OverviewofOurSolutionAnapproachtosolveourproblemcanbetomaximizealikelihooddistributionoverthegeneexpressionYwhereXaretheparametersofthedistribution.Theobjectiveistoobtainthemaximumlikelihoodestimate(MLE)ofX.However,therearetwoproblemsinthisthisapproach.First,MLErequiresalargenumberofdatapointstoaccuratelyestimatetheparameters.Second,MLEdependsonlyontheobserveddataandcannotutilizedomainspecicknowledge;asaresultitleadstooverttingofdataandpoorgeneralization.WedevelopaBayesianframeworkforestimatingXthataddressestheabove-mentionedlimitationsoftheexistingapproaches.Bayesianapproachescangenerallyestimatetheparameterswithfewerdata-points,whichmakesourapproachmoresuitableforperturbationexperiments[ 93 ].WeestimatetheprobabilityofXijgiventheotherobservedandhiddenvariables.Inthisapproach,weaimtomaximizethejointdensityoftheXvariablesgiventhegeneexpressionsYandthegenenetworkV.Thus,theobjectivetomaximizeisgivenby, P(XjY,V,Y,X)=P(YjX,V,Y)P(XjV,X) PXP(YjX,V,Y)P(XjV,X)(4)YisthesetofparametersforthelikelihoodfunctionP(YjX,V,Y)andXisthesetofparametersforthepriordensityfunctionP(XjV,X).XandYwillbediscussedinSections 4.1.3 and 4.1.5 respectively.SinceadirectoptimizationofEquation 4 isimpracticalduetoexponentialnumberoftermsinthedenominator,wedeneamoretractableobjectivefunctionasdiscussedinSection 4.1.4 .Weuseiterativeconditionalmode(ICM)tooptimizetheobjectivefunctionandobtainanassignmentofX,XandY.Finallyweestimatetheposteriorprobabilityp(XijjX)]TJ /F3 11.955 Tf 30.33 0 Td[(Xij,Y,X,Y)foreveryXijwhenWij=1.Usingthisposteriorprobability,wequantifythechancethatonegeneisDEduetooneofitsincomingneighbors. 55

PAGE 56

APerturbationexperiment BMarkovrandomeldgraphFigure4-2. IllustrationofasmallhypotheticalgenenetworkwithaperturbationandthegraphforMarkovrandomeldcreatedforthatgenenetwork.(a)Asmallhypotheticalgenenetworkwithaperturbation.Thecircleg0representstheabstractionoftheexternalperturbationi.e.g0.Rectanglesdenotegenes.!impliesactivationandaimpliesinhibition.Thedottedarrowfromg0indicatespotentialeffectoneachgenes.ThedirectlyimpactedDEgenesg1andg3aredenotedbysolidrectangle.Dashedrectanglesg4andg5implysecondarilyimpactedDEgenes.DottedrectangleisfortheEEgeneg2.(b)ThegraphforMarkovrandomeldcreatedfromthehypotheticalgenenetworkin(a).Foreachneighborpairwecreateacircularnode.Wecreatetworectangularnodesthatdonotcorrespondtoanyneighborpair,howevertheyarepartoftheMRFgraph.Twonodesareconnectedwithanundirectededgeiftheyshareasubscriptatsamepositionandthetwogenescorrespondingtotheothersubscriptinteractinthegenenetwork.Forexample,nodeX04andX14areconnectedastheyshare4atsecondpositionandg0isanincomingneighborofg1. 4.1.3ComputationofthePriorDensityFunctionInthissection,wedescribehowwebuildthepriordensityfunctionP(XjX).Weincorporateinformationfrombiologicalnetworksaspriorbeliefinthisdensityfunction.Thefollowingtwoassumptionsencapsulateourbeliefaboutgeneinteractions. Eachgenecanaffecttheexpressionsofitsoutgoingneighbors.Iftheactivityofageneisaltered,theeffectcanpropagatetoitsoutgoingneighbors. Themetageneg0(i.e.externalperturbation)canaffecttheexpressionofeveryothergene.Thisiseasytovisualizeastheexternalperturbationsuchasradiationcanchangetheactivityofanyofthegenes. 56

PAGE 57

Clearly,whenthedatadoesnotfollowoneormoreofthehypotheses,theoptimizationfunctioncanovercomethepriorbeliefwithastrongsupportfromthedata.Inordertocomputethepriordensityfunction,wedeneaMarkovRandomField(MRF)overtheXvariables[ 94 ].MRFisaprobabilisticmodel,wherethestateofavariabledependsonlyonthestatesofitsneighbors.MRFisusefultomodelourproblemasthestatesofgenesdependontheirneighbors.Here,theMRFisanundirectedgraph=(X,E),whereX=fXijgvariablesrepresenttheverticesofthegraph(i.e.eachinteractionvariableXijcorrespondstoavertex).WedenotethesetofedgeswithE=f(Xij,Xpj)jWpi=Wij=1g[f(Xij,Xik)jWjk=Wij=1g.Thus,twovariablesinXshareanedgeiftheyshareacommonsubscriptatthesamepositionandthetwogenescorrespondingtotheothersubscriptinteractinthegenenetwork(thisisanecessarycondition).Forexample,inFigure 4-2B ,X35andX25areneighbors,astheyshare5(i.e.geneg5)asthesecondsubscriptandg2andg3interactinthegenenetworkinFigure 4-2A .Oneimportantpointtonoteisthat,thisgraphdoesnotusethestatevariablesStomodelthedependenciesbetweenthegenes.Rather,itestablishesthosedependenciesovertheXvariables.Forexample,inFigure 4-2B wedrawtheMRFgraphcorrespondingtothehypotheticalgenenetworkinFigure 4-2A .Inthegenenetwork,thereisanedgefromg2tog3.So,g2canpotentiallychangethestateofg3.WecreateanedgefromX12toX13thatcorrespondstotheedgefromg2tog3.Asg1iscommonforX12andX13,iftheyassumethesamevalue(i.e.X12=X13),itimpliesthatthegenesg2andg3areinthesamestate(i.e.S2=S3).Weformulatethesedependencyconstraintsusingasetofunaryandbinaryfunctionscalledfeaturefunctions.Wediscussthesefeaturefunctionsnext.WedenotetheneighborsofXijintheMRFgraphasXij=fXpjjWpi=1g[fXikjWjk=1g.WedeneacliqueovereachXijanditsneighborsXijbyCijprovidedWij=1.Afeaturefunctionf(Cij)isaBooleanfunctiondenedoverthecliquesCijof.This 57

PAGE 58

Table4-1. Thetableenumeratesthetruthvaluesforthetwobinaryfeaturefunctions.Onlythepermittedentriesareannotatedwith0/1.Theblankentriescorrespondstocombinationsthatarenotpossible.(a)f3(Xij,Xpj)representsthefeaturefunctionforleftequality.(b)f4(Xij,Xik)representsthefeaturefunctionforrightequality.XpjXikXij 1234Xij 1234 1 101 102 102 013 013 104 014 01(a)f3(Xij,Xpj)(b)f4(Xij,Xik) functionevaluatesto1or0,ifitissatisedornot,respectively.Wedeneapotentialfunction (Cij)correspondingtof(Cij)asanexponentialfunctiongivenbyexp(f(Cij)).Hereisacoefcientassociatedwithf(Cij)thatrepresentstherelevanceoff(Cij)intheMRF.AccordingtoHammersley-Cliffordtheorem,weexpressthejointdensityfunctionoftheMRFoverXasproductofpotentialfunctionsdenedoverthatMRFas,p(XjX)=1 QCij,Wij=1 (Cij)[ 95 ].Inthisformulation,isthenormalizationfunction=PXQCij,Wij=1 (Cij).Tolimitthecomplexityofourmodel,weconsideronlycliquesofsizeoneandtwo.WedenefourfeaturefunctionstocapturethedependenciesamongthevariablesinXaccordingtothetwohypotheses.Basedonthenumberofinputvariables,theyareclassiedasunaryandbinaryfeaturefunctions. Unaryfeaturefunctions.AprimarycomponentofthepriordensityfunctionismodelingthefrequencyofXijitself.Wecapturethisfrequencyusingtwounaryfeaturefunctionsdenedoversingletoncliques.WedeneafeaturefunctionF1(Xij)whichreturnsonewhenXij=1and0otherwise.Tocapturethecomplementedevents,wedeneanotherfeaturefunctionF2(Xij),whichreturnsto1whenXij=0andreturns0otherwise. 58

PAGE 59

Binaryfeaturefunctions.Thesefeaturefunctionsaredenedtoincorporatethetwoassumptionsstatedatthebeginningofthissection.Considerasequenceoffourgenesg1,g2,g3andg5inFigure 4-2A .X23isavariableintheMRFgraphthatdependsonthestatesofg2andg3.X13isaneighborofX23inMRFgraphasg1isanincomingneighborofg3inthegenenetwork.Similarly,X25isaneighborofX23asg5isanoutgoingneighborofg3.IfS1equalstoS2thenX23=X13.SimilarlyifS3equalstoS5thenX23=X25.WecapturetheseeventsintwofeaturefunctionsforXijbasedontheincomingneighborsofgiandtheoutgoingneighborsofgj. LEFTEQUALITY.LetusdenotetheincomingneighborsofgiwithIn(gi).Wewriteafeaturefunctionf3(Xij,Xpj),8p,gp2In(gi).f3(Xij,Xpj)=1ifSi=SpandWpi=Wij=1.Otherwise,f3(Xij,Xpj)=0.Wedenotethesummationofthisfunctionoveralltheincomingneighborsofgias,F3(Xij)=Xp,Wij=1,Wpi=1f3(Xij,Xpj). RIGHTEQUALITY.LetusdenotetheoutgoingneighborsofgjasOut(gj).Wedeneafeaturefunctionf4(Xij,Xik),8k,gk2Out(gj).f4(Xij,Xik)=1ifSk=SjandWjk=Wij=1.Otherwise,f4(Xij,Xik)=0.Wedenotethesummationofthisfunctionoveralltheoutgoingneighborsofgjas,F4(Xij)=Xk,Wij=1,Wjk=1f4(Xij,Xik).Table 4-1 enumeratesthetruthvaluesofthebinaryfeaturefunctionsfordifferentvaluesoftheirarguments.Onlythepermittedentriesareannotatedwithzeroandone.Theotherentriesrequireillegalcombinationofargumentvalues.InthebinaryfeaturefunctionsXpjorXikmaynotrepresentanyinteractionsfromtheextendedgenenetworkwhenWpj=0orWik=0,respectively.WerepresentthembyrectanglesinFigure 4-2B .Basedonthesefeaturefunctions,wedenethejointdensityfunctionofXas, 59

PAGE 60

p(XjX)=1 exp(Xi,j,Wij=1,k2f1,2,,4gkFk(Xij))(4)Intheaboveequationk,k2f1,2,4garethecoefcientsofthefourfeaturefunctionsinMRF.Inthenextsection,wediscusshowwedenetheobjectivefunctionwithrespecttotheMRF.WealsodescribehowweformulatetheposteriorprobabilitydensityfunctionforXij. 4.1.4ApproximationoftheObjectiveFunctionAdirectmaximizationoftheobjectivefunctiongivenbyEquation 4 isimpractical,asitrequiresevaluationofexponentialnumberoftermsinthedenominator.Weemploypseudo-likelihoodasanestablishedsubstitutetoEquation 4 [ 96 ].Pseudo-likelihoodisthesimpleproductoftheconditionalprobabilitydensityfunctionoftheXijvariables.Gemanetal.provedtheconsistencyofthemaximumpseudo-likelihoodestimate[ 97 ].Theapproximatedobjectivefunctioncanbewrittenas, F=argmaxX(Yi,jFij)(4)WederivetheposteriordensityfunctionFijofXijwhenWij=1as, Fij=p(XijjX)]TJ /F3 11.955 Tf 28.21 0 Td[(Xij,Y,X,Y,Wij=1)=p(Yi,YjjXij,Xij,Y,Wij=1)p(XijjX)]TJ /F3 11.955 Tf 28.21 0 Td[(Xij,Y,X,Wij=1) PXij2f1,,4gp(Yi,YjjXij,Xij,Y,Wij=1)(4)TherearetwodifferenttermsinobjectivefunctionofEquation 4 .p(XijjX)]TJ /F3 11.955 Tf 22.89 0 Td[(Xij,Y,X,Y,Wij=1)standsfortheconditionalpriordensityfunctionofXijwhichcanbederivedfrom 60

PAGE 61

Equation 4 usingBayesrule.Wediscussp(Yi,YjjXij,Xij,Y,Wij=1),thelikelihoodfunctioninthenextsection. 4.1.5CalculationofLikelihoodDensityFunctionInthissection,wedescribehowwederivethelikelihoodfunction.Weassumethatgeneexpressionsinagroupfollowanormaldistribution.Wecanredothederivationsifgeneexpressionsfollowsomeotherdistribution.ConsiderasetofmeasurementsforagenegithatfollowsasingleGaussiandistributionbyzi=fzi1,zi2,,ziNg.Wedenotethelatentmeanofzibyandthestandarddeviationby.Asdifferentgenescanhavedifferentaverageexpressions,weassumethatfollowsagenome-wisenormaldistributionwithmean0andstandarddeviation[ 98 ].Thus,forzi,thelikelihoodforthedatapointsinthatgroupisgivenby, L(zj0,2,2)=Z[nYi=1N(zij,2)]N(j0,2)d= (p 2)np n2+2exp()]TJ /F13 11.955 Tf 10.49 17.11 Td[(Piz2i 22)]TJ /F11 11.955 Tf 15.98 8.08 Td[(20 22)exp(2n2 z2 2+220 2+2n z0 2(n2+2))(4)ThereadercanndthederivationofEquation 4 inDemichelisetal.[ 99 ].IfageneisDE,itsexpressionmeasurementsincontrolandnon-controlgroupsfollowseparatedistributions.Ontheotherhand,forequallyexpressedgenes,allthemeasurementsinboththegroupssharethesamemean.ThejointdatalikelihoodforaDEgeneisgivenby, L(gi)=8><>:L(yij0,2,2)L(y0ij0,2,2),ifSi=DE.L(yi[y0ij0,2,2),ifSi=EE(4) 61

PAGE 62

NowwearereadytoderivethejointlikelihooddistributionfordifferentvaluesofXij.Letusdenotethesetofparametersf,,gbyY.Wehavefourdifferentformsforthelikelihoodof(Yi,Yj)duetofourdifferentvaluesitcanassume.However,weshallderiveonlyforXij=1,asfortheothervaluesofXijwehavesimilarderivations. p(Yi,YjjXij=1,Xij,Y,Wij=1)=Xi,j2fDE,EEgp(Yi,YjjSi=i,Sj=j,Y,Wij=1)p(Si=i,Sj=jjXij=1,Xij,Y,Wij=1)(4)FromthedenitionofXij,p(Si=i,Sj=jjXij=1,Xij,Y)equalsto1whenSi=DEandSj=DE.ItsvalueiszeroforallothervaluesofSiandSj.So,continuingfromthelaststepofEquation 4 ,p(Yi,YjjXij=1,Xij,Y,Wij=1)=p(Yi,Yj,YBjjSi=DE,Sj=DE,Y,Wij=1)=p(YijSi=DE,Sj=DE,Y)p(YjjSi=DE,Sj=DE,Y)=p(YijSi=DE,Y)p(YjjSj=DE,Y)=L(gi)L(gj)Inasimilarway,wecanderivethelikelihoodfunctionsfortheotherthreevaluesofXijvariable.Aspecialcaseariseswhengiisthemetagene,i.e.g0.Specically,L(g0)=1ifS0=DEand0otherwise,as,accordingtoourassumptionthemetageneisalwaysDE. 62

PAGE 63

4.1.6ObjectiveFunctionOptimizationSofar,wehavedescribedhowwecomputetheposteriordensityfunction.Thenalchallengeistondthevaluesofthehiddenvariablesthatmaximizetheobjectivefunction(Equation 4 ).Wedevelopaniterativealgorithmtoaddressthischallenge.Inourmodelwehavethreedifferentsetsofparameters.ThenodesoftheMRFgivenbyXconsistofoneset.OthertwosetsaretheparametersofconditionalprobabilitydensityfunctionofXijandlikelihoodfunctionofobserveddatagivenbyX=f1,4gandY=f0,,),respectively.Ineachiteration,werstestimateXandYbasedontheestimatedvalueofXinthepreviousiteration.Next,basedontheestimatedparameters,weestimateXthatmaximizetheobjectivefunctioninEquation 4 .Thelikelihoodfunctionisnon-convexintermsoftheparametersY=f0,,).Also,theconditionaldensityisnon-convexintermsofX=f1,4g.Weuseaglobaloptimizationmethodcalleddifferentialevolutiontooptimizebothofthem[ 100 ].Tooptimizetheobjectivefunctioninequation 4 ,weemploytheICMalgorithmdescribedbyBesag[ 92 ].Briey,ouriterativealgorithmworksasfollows. ObtainaninitialestimateofSvariables.Inourimplementationweusestudent'st-testassumingthedatafollowsnormaldistribution.Weuse5%condenceintervalforthispurpose. EstimateparametersYthatmaximizesthedatalikelihoodfunctiongivenby,argmaxYYXijp(Yi,YjjXij,Xij,Y,Wij=1)WeimplementthisstepusingDifferentialEvolution,whichissimilartothegeneticalgorithm. CalculateanestimateoftheparametersXthatmaximizestheconditionalpriordensityfunctionby,argmaxXYXijp(XijjX)-222(fXijg,X,Wij=1)WealsoimplementthisstepusingDifferentialEvolution. 63

PAGE 64

CarryoutasinglecycleofICMusingthecurrentestimateofS,XandY.ForallSi,maximizeQXmnp(XmnjX)]TJ /F3 11.955 Tf 28.94 0 Td[(Xmn,Y,X,Y,Wmn=1)whenXmn2fXrtjr=iort=igandWrt=1. Gotostep 4.1.6 foraxednumberofcyclesoruntilXconvergestoacertainpredenedvalue.WeoptimizetheobjectivefunctionintermsoftheSi(1iM)variablesinsteadofXijvariables.Specically,instep4,wegooveralltheSivariables,andoptimizeFijfunction(givenbyEquation 4 )foronlythoseXijvariablesthatareimpactedbythechangeofSi.Theoptimizationprocedureisguaranteedtoconvergesinceineveryiterationthevalueoftheobjectivefunctionincreases.Wecontinuetheiterativeprocess,untilthechangesinestimatesoftheparametersbetweentwoconsecutiveiterationsreachbelowacertaincutofflevel. 4.2ExperimentsInthissectionwediscusstheexperimentsweconductedtoevaluatethequalityofourmethod.WeimplementedourmethodinMATLABRandJavaTM.WeobtainedanimplementationofDifferentialEvolutionfromthe http://www.icsi.berkeley.edu/~storn/code.html .WecomparedourmethodwithSSEM[ 20 ]asSSEMisoneofthemostrecentmethodsthatcanbeusedtosolvetheproblemconsideredinthischapter.WeobtainedSSEMfrom http://gardnerlab.bu.edu/SSEMLasso .WeranourcodeonaAMDOpteron2.4Ghzworkstationwith4GBmemory. Dataset.WeusethedatasetcollectedbySmirnovetal.[ 101 ].Itwasgeneratedusing10GyionizingradiationoverimmortalizedBcellsobtainedfrom155membersof15Centred'tudeduPolymorphismeHumain(CEPH)Utahpedigrees[ 102 ].Microarraysnapshotswereobtainedat0thhour(i.e.,beforetheradiation)and2and6hoursaftertheradiation.Weadaptthetimeseriesdatatocreatethecontrolandnon-controldataforourexperiments.Weusethedatabeforeradiationascontroldata.Forthenon-controldatawecalculatetheexpectedexpressionsofageneateachpointsaftertheradiation.Weselecttheonewithhigherabsolutedifferencefromtheexpected 64

PAGE 65

Table4-2. Listoftop25genesthataremostlyaffectedbyexternalperturbationfortherealdataadaptedfromSmirnovetal. PGFIL8RBFOSL1F2RPPM1DMDM2CDKN1ATNCPLXNB2EPHA2DDB2TP53I3PLK1TNFSF9ADRB2MAP3K12JUNSORBS1LRDDSDC1MYCPRKAB1EI24DDIT4FAS expressionofcontroldataforthatgene.ThisdatasetisusedintheexperimentsdescribedinSections 4.2.1 and 4.2.2 .FortheexperimentsdescribedinSections 4.2.3 and 4.2.4 ,wederivenewdatasetsusingthisdata.Thedetailsofthisprocesscanbefoundincorrespondingsections.Wealsocollect24,663geneticinteractionsfromthe105regulatoryandsignalingpathwaysofKEGGdatabase[ 103 ].Overall2,335genesbelongtoatleastonepathwayinKEGG.Weconsideronlythegenesthattakepartinthegenenetworksinourmodel. 4.2.1EvaluationofBiologicalSignicanceInthissection,weinvestigatethesupportinexistingliteratureforsusceptibilitytoradiationbasedperturbationfortheprimarilyaffectedgenesfoundbyourmethod.Wetrainourmethodonthedatasetdescribedabove.AftertheoptimizationwerankeachgenegjindecreasingorderofL(g0)L(gj),whereL(gj)isgivenbyEquation 4 .Wetabulatethetop25genesinTable 4-2 .Nineoutofthetenhighestrankedgeneshavesignicantbiologicalevidencethattheyareimpactedbyradiation.Imaokaetal.[ 104 ]comparedthegeneexpressionbetweennormalmammaryglandstospontaneousand-radiationinducedcancerousglandsofrat.ThePGF(parentalgrowthfactor)geneshoweddifferentialexpressioninbothspontaneousandirradiatedcarcinomas.Nagtegaaletal.[ 105 ]appliedradiationtohumanrectaladenocarcinomaandcomparedthegeneresponsetothatofnormaltissues.ThecytokinesandreceptorIL8RBshoweddifferentialexpressionbetweennormalandirradiatedrectaltissues.Amundsonetal.[ 106 ]administered-radiationtop-53wildtypeML-1humanmyeloidcellline.FOSL1(knownbyFRA1thattime)showed 65

PAGE 66

differentialexpressionasthestressresponse.Linetal.[ 107 ]appliedionizingradiationonhumanlymphoblastoidcells.F2R,acoagulationfactorIIreceptor,wasupregulatedinthatexperiment.Jenetal.[ 108 ]investigatedtheeffectofionizingradiationonthetranscriptionalresponseoflymphoblastoidcellsintimeseriesmicroarrayexperiments.PPM1D,agenerelatedtoDNArepair,showedresponsetoboth3Gyand10Gyradiation.Wuetal.[ 109 ]conductedahighdoseUVradiationexperimenttoobservetherelationbetweenMDM2geneonp53gene.TheirexperimentrevealedthatinitiallybothproteinandmRNAlevelofMDM2increasesinap53independentmanner,whichclearlysubstantiatedthedirecteffectofradiationonMDM2.Jakobetal.[ 110 ]irradiatedhumanbroblastswithacceleratedleadions.Confocalmicroscopydiscoveredasingle,brightfocusofCDKN1Aproteininthenucleiofhumanbroblastwithin2minutesafterradiation.Riegeretal.[ 111 ]appliedbothultravioletandinfraredradiationonfteenhumancelllinesandobservedthatPLXNB2wasup-regulatedforbothkindofradiations.Zhangetal.[ 112 ]reportedthatEPHA2workedasanessentialmediatorofUV-radiationinducedapoptosis.Thisexperimentdemonstratesthatwendsufcientsupportinexistingliteraturethatthetoprankedgenesfoundbyourmethod(i.e.highlylikelytobeprimarilyaffected)areaffectedbyradiation. 4.2.2EvaluationoftheRankingsofNeighborGenesRecallthatourgoalistondtheprimarilyaffectedgenes.WeachievethisobjectivebycomputingtheprobabilityforeachDEgenetocontributetowardsthechangeintheexpressionofitsoutgoingneighbors.Inthisexperiment,weevaluateoursuccessintermsofhowaccuratelywerankthecontributionprobabilitiesofthegenesasdiscussedinthenextparagraph.Wedividethedatasetof155samplesintotrainingandtestingsetin2:1ratio.WecreatearankedlistforeachDEgeneasfollows.ForeachDEgene,wesortitsincomingDEneighborsindecreasingorderoftheirdatalikelihoodprobabilitywithrespectto 66

PAGE 67

Figure4-3. Frequencyofaveragedistanceofrankingsovertrainingandtestingdata.Thegureshowsthatthedifferenceisveryclosetozero.Thissuggeststhatourmethodcanranktheprobabilisticeffectoftheincomingneighborsofthegeneswithgreatprecision.Theaveragedifferencebetweentheranksobtainedinthetrainingandthetestingdataislessthanonepositionin92.7%ofthecases. theoutgoingneighbor.Forexample,letusassumeg1isDE.IthasfourincomingDEneighborsg2,g3,g4andg0whereg0isthemetagene.LetNLijdenotesthenormalizedlikelihoodfunctionp(Yi,YjjXij=1,Xij,Y) PXij2f1,2,3,4gp(Yi,YjjXij,Xij,Y)ofXij.Forinstance,IfNL01NL41NL21NL31,thenthesortedlistisfg0,g4,g2,g3g.WedenotethesortedlistasarankingoftheincomingDEneighbors.Letusdenotethepositionofagenegiintherankingofgjfortrainingdatagj(gi).Wecreateanothersetofrankingsfromthetestingdatalikelihoodprobability.Letusdenotethepositionofgiintherankingofgjfromtestingdataby0gj(gi).Foragenegjwedenetheaveragerankingdistancebetweentrainingandtestingdataas(gj)=Pgi2IN(gj)abs(gj(gi))]TJ /F15 7.97 Tf 6.58 0 Td[(0gj(gi)) jIN(gj)j,whereIN(gj)isthesetofincomingDEneighborsforgi,abs(.)denotestheabsolutevalueandjIN(gj)jstandsforthecardinalityofIN(gj).Wecalculatedtheaveragerankingdistanceforallthegenesthathaveincomingneighborsapartfromthemetagene.Thisexperimentwasrepeatedthreetimeswith 67

PAGE 68

adifferentsetoftrainingandtestingdata.WecreateahistogramfortheaveragedifferencesfromthethreeexperimentsinFigure 4-3 .Itshowsthatthedifferenceinaveragerankingdistanceisveryclosetozero.Theaveragedifferencebetweentheranksobtainedinthetrainingandthetestingdataislessthanonepositionin92.7%ofthecases.Thus,wehavedemonstratedthatwecanaccuratelyrankthecontributionprobabilitiesofincomingneighborsforDEgenesintestdatasetbasedonthemodelparameterslearnedfromthetrainingdataset. 4.2.3ComparisontoOtherMethodsInthissection,wecomparetheaccuracyofourmethodtothatofSSEMandasimplermethodStudent'sttest. Syntheticdatageneration.Wesimulatedrealperturbationeventstopreparesyntheticdatawithknownprimarilyandsecondarilyaffectedgenesinacontrolledsetting.WeusethegenenetworkderivedfromKEGGrsttoselectarandomgenefromthenetworkanddenoteitasaprimarilyaffectedDEgene.Wetraversetheancestorsinabreadthrstmanner.Foreachoftheancestor,wemadeitasecondarilyaffectedDEgenewithaprobabilityof1)]TJ /F8 11.955 Tf 12.99 0 Td[((1)]TJ /F3 11.955 Tf 12.99 0 Td[(q),whereisthenumberofincomingDEneighbors.Hereq(0.4inourexperiments)istheprobabilitythatageneisDEduetoaDEpredecessor.Werepeatthesestepstocreatethedesirednumberofprimarilyaffectedgenes.Aftertheclassicationofthegeneswecreatecontrolandnon-controldataforeachofthemforoverNpatients.WeusethecontrolpartoftherealdatasetinSmirnovetal.[ 101 ]asthecontrolpartofoursyntheticdataset.Togeneratethenon-controldataset,wetraverseeachofthegenesthatparticipateinthegenenetworks.Suppose,foragenegi,themeanandstandarddeviationofitsexpressioninthecontroldatasetaregivenbyiandirespectively.IfthegeneisEEwegenerateitsnon-controldatapointsfromtheanormaldistributiongivenbytheparameters(i,2i).IfthegeneisDE,weusethesamevarianceasthatofthecontrolgroup.However,weuseadifferent 68

PAGE 69

AGap=0.2 BGap=0.6Figure4-4. ComparisonofourmethodtoSSEMandt-test.Thenumberofprimarilyaffectedgenesis50.Thegapbetweenthemeanofprimarilyaffectedandsecondarilyaffectedgenesare0.2to0.6,whereisestimatedfromtherealdataset.TheguresindicatethatourmethodoutperformsSSEMandt-test. mean.Fortheprimarilyandsecondarilyaffectedgenesweuse0i=idpand0i=idsrespectively,wheredp>ds. Experimentalsetup.Givenaninputdataset,usingeachofthethreemethods,werankedallthegenes.Highlyrankedgeneshavehigherchanceofbeingaprimarilyaffectedgeneaccordingtoeachmethod.Weexplainhowwedotherankinginthefollowing. OURMETHOD.Wesortthegenesindecreasingorderofjointlikelihoodwiththemetagene.Ahigherjointlikelihoodimpliesahigherchanceofbeingprimarilyaffected. SSEM.WetrainSSEMonthecontroldataset,whereitlearnsthecorrelationbetweenthegenes.WetestSSEMonthenon-controldataset,whereitproducesarankforeachsingledatapoint. STUDENT'STTEST.Weusedthefunctioncalledttest2fromMATLABR.Weapplyitoneveryindividualgene,whereittakescontrolandnon-controldatasetasinputandproducesap-valueasoutput.Bydefault,nullhypothesisisthatthedifferencesoftwoinputdatasetarearandomsamplefromanormaldistributionwithmean0andunknownvariance,againstthealternativethatthemeanisnot0.Thus,thenullhypothesiscorrespondstotheassumptionthatthegeneisEE.Soa 69

PAGE 70

substantiallylowerp-valueimpliesahigherchanceofbeingprimarilyaffected.Weperformedthetestonallthegenesandrankthemaccordingtheincreasingorderofp-values.LetusassumethesetofprimarilyaffectedgenesasPGandrstkelementsoftherankingasRGk.Wedenethesensitivityoftherankingatpositionkbyk=jPG\RGkj jPGj.Thus,ahighervalueofkdenotesahighersensitivity.Weprepareasensitivityvectorf1,2,jRjg,byarrayingthesensitivityofarankingatallthepositionsoftheranks.Here,jRjdenotesthecardinalityoftheranking.ForSSEMweobtainasensitivityvectorforeverydatapointsinthenon-controldataset.Wecreateaconsolidatedsensitivityvectorbyaveragingthem. Results.Weconductedexperimentsbyfords)]TJ /F4 7.97 Tf 6.59 0 Td[(dp =f0.2,0.4,0.6,0.8,1.0,1.2,1.5,1.75g,numberofprimarilyaffectedgenes=f10,50gandnumberofdatapoints=f10,20,40,60,80,100,125,155g.Here,correspondstothestandarddeviationoftheexpressionsofgenesinthedataset.However,duetospacelimitationwediscussonlytwooftheminthischapter(seeFigure 4-4 ).Theresultswediscusscorrespondtothecasewhenwehave50primarilyaffectedgenesand155datapoints.TheresultsoftheotherexperimentsaresimilartothoseinFigure 4-4B .Figures 4-4A and 4-4B showthesensitivityofthethreemethodswhen(ds)]TJ /F3 11.955 Tf 12.23 0 Td[(dp)=0.2and0.6respectively.Theformeronecorrespondstothecomputationallyhardercaseasthedifferencebetweenthecontrolandnon-controldatasetsissmall.Asthegapbetweendsanddpincreasesidentifyingprimarilyaffectedgenesbecomeseasier.Fromthegure,weobservethatourmethodissignicantlymoresensitivethantheothertwomethodsforalldatasetsconsistently.Itreacheshighsensitivity(morethan90%)usingthetop150rankedgeneswhenthegapissmall,andusingthetop50genesasthegapincreasesto0.6.Theresultsweresimilarforlargergapvalues(resultsnotshown).Thettestreachesaround40%and50%sensitivityat200rankingposition 70

PAGE 71

Figure4-5. ComparisonofaccuracieswithSSEMandStudent'sttestwhilevaryingtheratioofgapsofprimarilyandsecondarilyaffectedgenes.Foracategoryofgene,thegapdenotestheabsolutedifferenceofaverageexpressionsincontrolandnon-controlgroups.Thex-axisrepresentstheratioofgapsofprimarilyandsecondarilyaffectedgenes.TheyaxisdenotestheaccuracyofourmethodasdescribedinnSection 4.2.4 Theguredemonstratesthatourmethodobtainsveryhighaccuracyexceptwhentheratioequalstozero,i.ethegapisequalforboththeprimarilyandsecondarilyaffectedgenes. respectively.SSEM'ssensitivityisbelow0.25forallexperimentsevenwithinthetop200positions.Webelievethattherearetwomajorfactorsforimprovedresultsusingourmethod.First,ourmethodcansuccessfullyincorporatethegeneinteractionswhileothermethodsignorethisinformation.Second,ourmethodiscapableofdealingwithabroadrangeofprimarilyaffectedgeneswhileothermethods'performancedeterioratesasthisnumbergrows.Inrealperturbationexperiments,oftenmultiplegenesareprimarilyaffected.Thus,weconcludethatourmethodismoresuitableforrealperturbationexperiments. 71

PAGE 72

4.2.4SensitivitytotheGapbetweenPrimaryandSecondaryEffectsTheexperimentsovertherealdatasetsuggestthevalidityofourmodel.Onequestionhoweverfollowsfromtheseexperiments.Howdoesourmethodcomparewhenwevarythedistinctionbetweenprimarilyandsecondarilyaffectedgenesintermsoftheirgapbetweencontrolandnon-controldata.Toanswerthisquestionweconductedexperimentsonsyntheticdatasets,wherewechangethedifferencesbetweenprimarilyandsecondarilyaffectedgenesandcompareourtheaccuracyofourmethodwithSSEMandstudent'sttest. Syntheticdatageneration.Wegeneratethedatainthepresenceofahypotheticalperturbationtosimulatetherealdataset.TheprimarilyandsecondarilyaffectedgenesareascertainedinthewaydescribedinSection 4.2.3 .Toutilizetherealdatasettomaximumpossibleextent,weemployaninnovativeapproach.Letusdenotethemeanofgenegiinthecontrolandnon-controlbyiand0i,respectively.Wesubtractthedifference(0i)]TJ /F11 11.955 Tf 12.79 0 Td[(i)fromalltheexpressionsinthenon-controlgroupofgi.Werepeatthissubtractionforallthegenes.Oncethenon-controlgroupisleveledtocontrolgroup,were-modifythenon-controlexpressionsofDEgenes.IfageneisprimarilyDEaccordingtothedecidedsetofgenes,weincreaseordecreaseitsexpressionoverthedatapointsinnon-controlgroupbydp.Similarly,wemodifytheexpressionvaluebyds,ifthegeneissecondarilyaffected.Here,dpisgreaterthands. Results.Wecreatedthreedifferentsetsofdatabyvaryingdpandds.Forallthedatasetsthenumberofprimarilyaffectedgeneswas40.Foreverydataset,weuseddifferentvaluesofdpgivenbyf0.8,1.2,1.6g,respectively.However,withinadatasetdpwasxedandds=dpratiowasvariedasf0.1,0.2,1.0g.Wediscussonlytheresultforthedatasetdp=0.8astheresultsfortheotheraresimilar.Theaccuracyofthemethodscanuctuatefordifferentsetofaffectedgenes.Hence,foraparticularvalueof 72

PAGE 73

dsanddpwerepeatedtheexperimentvetimeswithdifferentsetsofaffectedgenesandaveragedtheresult.WerunthethreemethodsonallthedatasetsandextractranksofgenesasdescribedinSection 4.2.3 .Ahigherpositionintherankindicatesahigherchanceofbeingprimarilydifferentiallyexpressed.LetthesetoftrueprimarilyaffectedgenesbePA.LetRGbethesetofrstjPAjgenesfromtherankproducedbyamethod,wherejPAjisthecardinalityofPA.WedeneaccuracyofthatmethodasjPA\RGj jRGj.Figure 4-5 depictstheresultfromthisexperiment.ItisclearthatourmethodoutperformsSSEMallthetime.TheaccuracyofourmethodissubstantiallybetterthanStudent'sttestforallthecasesexceptwhentheratiods=dpequalstoone.Fromthisexperiment,wecanconcludethatourmethodperformsverywelloverawiderangeofdifferencebetweenthenon-controlgroupsforprimarilyandsecondarilyaffectedgenes.Specically,forthecasewherethesegroupshavethesamemean,ourmethodperformalmostaswellasothermethods. 4.3DiscussionInthischapter,weconsideredtheproblemofidentifyingprimarilyaffectedgenesinthepresenceofanexternaleffectthatcanperturbtheexpressionsofgenes.Weassumedthatweweregiventheexpressionmeasurementsofasetofgenesbeforeandaftertheapplicationofanexternalperturbation.Wedevelopedanewprobabilisticmethodtoquantifythecauseofdifferentialexpressionofeachgene.Ourmethodconsidersthepossiblegeneinteractionsinregulatoryandsignalingnetworks,foralargenumberofperturbations.ItusesaBayesianmodelwiththehelpofMarkovRandomFieldstocapturethedependencybetweenthegenes.Italsoprovidestheunderlyingdistributionoftheimpactwithcondenceinterval.Ourexperimentsonbothrealandsyntheticdatasetsdemonstratedthatourmethodcouldndprimarilyaffectedgeneswithhighaccuracy.Itachievedsignicantlybetteraccuracythantwocompetingmethods,namelySSEMandthestudent'sttestmethod. 73

PAGE 74

CHAPTER5IDENTIFYINGDIFFERENTIALLYREGULATEDGENESMicroarrayexperimentsoftenmeasureexpressionsofgenestakenfromsampletissuesinthepresenceofexternalperturbationssuchasmedication,radiation,ordisease[ 82 83 ].Typicallyinsuchexperiments,geneexpressionsaremeasuredbeforeandaftertheapplicationofexternalperturbation,andarecalledcontroldataandnon-controldata,respectively.Inthischapter,wefocusonanimportantclassofsuchmicroarrayexperimentsthatinherentlyhavetwogroupsoftissuesamples.Differentgroupsinamicroarraymeasurementcanexistinmanydifferentways.Forinstance,samplescanbetakenfrommembersofmultiplecloselyrelatedspecies(e.g.ratversusmouse).Withinthesamespeciestherecanbesubgroupswithdifferentphenotypes(e.g.AfricanAmericanversusCaucasianAmerican).Anotherexampleiswhenthesampleshavealreadybeenthroughseveralalternativeexternalperturbations(e.g.fastingandnotfasting).Whensuchdifferentgroupsexist,itisnotonlyimportanttoobserveoverallchangesingeneexpression,butalsotoobservehowdifferentgroupsrespondtotheexternalperturbation.Forexample,Tayloretal.appliedmedicationson36CaucasianAmericanand33AfricanAmericanpatientsinfectedwithHepatitisC[ 47 ].Geneexpressionswerecollectedbeforeandafterthemedication.Inaperturbationexperiment,someofthegenesrespondbynoticeablychangingtheirexpressionvaluesbetweenthecontrolandnon-controldata.Genesthatchangetheirexpressionsinastatisticallysignicantwayarereferredtoasdifferentiallyex-pressed(DE),whilethosethatdonot,arereferredtoasequallyexpressed(EE)genes.Inthecontextoftwogroups,werefertoagenethathasthesamestateinboththegroups,i.e.eitherDEorEEforboththegroups,asequallyregulated(ER)gene.Onthecontrary,ifageneisDEinonegroupandEEintheother,wedenoteitasdifferentiallyregulated(DR). 74

PAGE 75

Figure5-1. Illustrationoftheimpactofahypotheticalexternalperturbationonasmallportionofthepancreaticcancerpathway.ThepathwayistakenfromtheKEGGdatabase.Thesolidrectanglesdenotethegenesaffecteddirectlybyperturbation,thedashedrectanglesindicategenessecondarilyaffectedthroughthenetworks.Thedottedrectanglesdenotethegeneswithoutanychangeinexpression.!impliesactivationandaimpliesinhibition.Inthisgure,geneK-Ras,RafandCob42RocareprimarilyaffectedandMEK,RalandRalGDSaresecondarilyaffectedthroughinteractions. Genesforanyorganismtypicallyinteractwitheachotherviaregulatoryandsignalingnetworks.Forsimplicity,wewillrefertothemasgenenetworksfortherestofthischapter.AsmallportionofanexamplegenenetworkcanbeseeninFigure 5-1 .Onceanexternalperturbationisapplied,agenecanbeDEinoneofthetwowaysasadirecteffectoftheperturbationorviainteractionwithotherDEgenesthroughgenenetworks.WedenoteageneasprimarilyaffectedDE,ifitisDEduetotheexternalperturbation.Similarly,ageneissecondarilyaffectedDE,ifitisDEduetoanothergeneinthegenenetwork.Figure 5-1 showsthestateofthegenesinthepancreaticcancerpathwayafterahypotheticalexternalperturbationisapplied.Inthisgure,genesK-Ras,RafandCob42RocareprimarilyaffectedandMEK,RalandRalGDSaresecondarilyaffectedthroughinteractions.RecallthatforagenetobeDR,ithastobeEEinonegroupandDEinanothergroup.Forsuchagene,ifithappenstobeDEinonegroupbecauseoftheexternalperturbation,wecallitasprimarilydifferentiallyregulated(PDR)gene.WhenitisDEinonegroupbecauseoftheinteractionwithotherDEgenesinthegenenetworks,wewill 75

PAGE 76

refertoitbysecondarilydifferentiallyregulated(SDR)gene.Inthischapter,weconsidertheproblemofidentifyingthePDRgenesinagivensetofcontrolandnon-controlgeneexpressionsfromtwogroupsofsamples. Ourapproach.Inthischapter,weproposeanewprobabilisticBayesianmethodCMRFtondthePDRgenesintwogroupperturbationexperimentdatasetasdenedabove.WecallthismethodCMRF(ComparativeMRF)foritappliesMRFontwogroupsofdataforcomparisonpurpose.OurBayesianmethodincorporatesinformationaboutrelationshipfromgenenetworksaspriorbeliefs.Weconsiderthegenenetworkasadirectedgraphwhereeachnoderepresentsagene,andadirectededgefromgenegitogenegjrepresentsageneticinteraction(e.ggiactivatesorinhibitsgj).Wedenetwogenesasneighborsofeachotheriftheyshareadirectededge.Forexample,inFigure 5-1 ,genesK-RasandRafareneighborsasK-RafactivatesRas.Wealsoclassifyaneighborasincomingoroutgoing,ifitisatthestartorattheendofthedirectededgerespectively.InFigure 5-1 ,RafisanincomingneighborofMEKandMEKisanoutgoingneighborofRaf.Whentheexpressionlevelofageneisaltered,itcanaffectsomeofitsoutgoingneighbors.Thus,thegeneexpressioncanchangeduetoexternalperturbationorbecauseofoneormoreoftheaffectedincomingneighbors.Werepresenttheexternalperturbationbyahypotheticalgene(i.e.metagene)g0inthegenenetwork.Weaddanedgefromthemetagenetoalltheothergenesbecausetheexternalperturbationhasthepotentialtoaffectalltheothergenes.So,g0isanincomingneighbortoalltheothergenes.Wecalltheresultingnetworktheextendedgenenetwork.CMRFestimatestheprobabilitythatagenegjisDRduetoanalterationintheactivityofgenegi(8gi2G[fg0g,gj2G)ifthereisanedgefromgitogjintheextendednetwork.WeuseaBayesianmodelinoursolutionwiththehelpofMarkovRandomField(MRF)[ 94 ]tocapturethedependencybetweenthegenesintheextendedgenenetwork.Wedenefeaturefunctionsthatencapsulatethedomainknowledgeavailable 76

PAGE 77

Figure5-2. AhypotheticalgenenetworkandcorrespondingMarkovrandomgraph.(a)AsmallhypotheticalgenenetworkwithperturbationintwodatasetsDAandDB.Thegenesinthetwodatasetsinteractthroughidenticalnetwork,althoughtheyassumedifferentstates.Thecircleg0representstheabstractionoftheexternalperturbation.Rectanglesdenotegenes.!impliesactivationandaimpliesinhibition.Thepotentialeffectofmetagenetoallothergenesisindicatedbydottedarrowsfromthemetagenetoalltheothergenes.Forexample,g1isprimarilyaffectedinDA,butnotaffectedinDB.g2isprimarilyaffectedinboththedatasets.g3issecondarilyaffectedinbothDAandDB.(b)TheMarkovRandomFiledgraphconstructedbasedonthesmallhypotheticalgenenetworkin(a).Thenumbersintheparenthesisaretheexpectedassignmentstothevariablesbasedonthestatesofthegenesin(a).Nodeswithdottedboundariesindicatethatthosenodesarerequiredforcompletenessofthemodel,howeverthecorrespondinginteractionsdonotexist. fromgenenetworksandgeneexpressiondata.CMRFoptimizesthejointposteriordistributionovertherandomvariablesintheMRFusingIteratedConditionalModes(ICM)[ 92 ].Theoptimizationprovidesthestateofthegenes,theregulationofthegenesandtheprobabilisticestimateofpairwiseinteractionsbetweenthegenesincludingthemetagene.Giventhis,wecanrankthegenesaccordingtothedatalikelihoodthatageneisDRbecauseofthemetageneg0,andobtainalistofpossiblePDRgenes.Figure 5-2 illustratesdifferentcomponentsofCMRFandtheconnectivitybetweenthem.Notethat,(C)correspondstotheBayesianpriorbasedonMRF. 77

PAGE 78

WecomparetheaccuracyofCMRFwiththatofSSEMandStudentsttestonsemi-syntheticdatasetgeneratedfrommicroarraydatainCosgroveetal.[ 20 ].WealsocompareCMRFwithouroldmethodSMRFthatwedevelopedtoidentifytheprimarilyaffectedDEgenesinasinglegroupperturbationdata[ 113 ].CMRFobtainshighaccuracyandoutperformsalltheotherthreemethods.Also,weconductastatisticalsignicancetestusingaparametricnoisebasedexperimenttoevaluatetheaccuracyofCMRF.Inthisexperimentourmodeldemonstratesreasonablecondenceregionsforvariousvaluesoftheparameters.Therestofthechapterisorganizedasfollows.Section 5.1 describesourmethodsindetail.Section 5.2 presentstheresultsofourexperiments.Section 5.3 concludesourdiscussion. 5.1MethodsInthissectionwedescribedifferentcomponentsofCMRF.Section 5.1.1 describesthenotationandformulatestheproblem.Section 5.1.2 providesahighleveloverviewofthesolution.Section 5.1.3 describesthecalculationofthepriordensityfunctionofMRF.Section 5.1.4 discussesthedenitionofatractableobjectivefunction.Section 5.1.5 discussesthecalculationofthelikelihoodfunction.Finally,Section 5.1.6 describesthealgorithmtooptimizetheobjectivefunction. 5.1.1NotationandProblemFormulationInthissection,wedescribeournotationandformallydenetheproblem.WedeneaBayesianmodelforgeneexpressioninatwo-groupperturbationexperiment.Weclassifytherandomvariablesofthemodelintotwodifferentgroups,namelyobservedvariablesandhiddenvariables.Wehavethevaluesfortheobservedvariables,whileweestimatethevaluesofthehiddenvariables. Observedvariables.Wedenetwosetsofobservedvariables,oneformicroarraygeneexpressiondataandanotherfortheneighborhoodintheextendedgenenetwork. 78

PAGE 79

Table5-1. EnumerationofthevaluesofZi,ZjandXijfordifferentvaluesofSAi,SBi,SAjandSBj.Thehiddenvariablesareorientedinahierarchicalstructure.Forinstance,thevalueofZidependsonthevaluesofSAiandSBi.Similarly,thevalueofXijdependsonthevaluesofZiandZj.Thus,thevalueofthedependentvariableXijinterndependsonthevaluesoffourindependentvariablesSAi,SBi,SAjandSBj. SAiSBiSAjSBjZiZjXij DEDEDEDE111DEDEDEEE122DEDEEEDE133DEDEEEEE144DEEEDEDE215DEEEDEEE226DEEEEEDE237DEEEEEEE248EEDEDEDE319EEDEDEEE3210EEDEEEDE3311EEDEEEEE3412EEEEDEDE4113EEEEDEEE4214EEEEEEDE4315EEEEEEEE4416 MICROARRAYDATA.WedenotethenumberofgenesbyMandthenumberofdatapointsinthetwogroupsDAandDBbyNAandNBrespectively.WerepresentthesetofgeneswithG=fg1,g2,,gMg.Foreachgeneandforeachgroupthemicroarraydatacontainsthegeneexpressionvaluesbeforeandaftertheperturbation,i.e.controlandnon-controldatarespectively.WedenotetheexpressionvalueoftheithgenefromthejthsampleinthecontroldataofgroupDAwithyAij.Werepresentthesameforthenon-controldatawithy0Aij.ThustheexpressionvaluesofthegenegiforallthesamplesinDAforcontrolandnon-controldataareyAi=fyAi1,yAi2,yAiNAgandyAi0=fyAi10,yAi20,yAiNA0grespectively.WedenotealltheexpressionvaluesingroupDAforgenegiwithYAi(i.e.YAi=yAi[y0Ai).WedenotethecollectionofthegeneexpressionsofallthegenesingroupDAbyYA=SMi=1YAi.WedeneYBsimilarlyforallthegenesinDB.WereferthecompletegeneexpressiondatausingvariableY=YASYB. NEIGHBORHOODVARIABLES.WeusethetermW=fWijgtoindicateiftwogenesgiandgjareneighborsintheextendedgenenetwork.Ifgiisanincomingneighborofgj(i.e.gjhasanincomingedgefromgi),thenwesetthevalueofWij(1i,jM)to1.Itis0otherwise. 79

PAGE 80

Hiddenvariables.Wedenethreesetsofhiddenvariables,Thesevariablesgovernthestateofgenes,regulationsofgenesandinteractionsamonggenesrespectively. STATEVARIABLES.WeuseSA=fSAigandSB=fSBig,(1iM)todenotethestatesofthegenesingroupDAandDB.SAi=1ifgiisDEinDAand0ifitisEEinDA.WedeneSBisimilarly.Weassumethatthemetageneg0isDEforbothDAandDB.Thus,SA0=SB0=1. REGULATIONVARIABLES.WedenotetheregulationconditionofgenegiwithZi.Table 5-1 enumeratesdifferentvaluesofZiforthevaluesofSAiandSBi.Inthisformulation,thecasesZi=f2,3gindicatethatgiisDR,whereasZi=f1,4gindicatethatgiisER.ThemetageneisguaranteedtobeER,sinceSA0=SB0=1. INTERACTIONVARIABLES.InordertogovernthejointregulationstatesofgenesgiandgjwedeneinteractionvariablesX=fXijg,(1i,jM).Mathematically,Xij=4(Zi)]TJ /F8 11.955 Tf 12.01 0 Td[(1)+Zj.Notethat,thisequationiscreatedtomaintainbrevityofthemappingbetweentheinteractionvariablesandtheregulationvariablesbycarefullyassigningdifferentnumericconstantsbetweenoneand16toappropriatevaluesofaninteractionvariable.Table 5-1 enumeratesdifferentvaluesofXijforvaluesofZiandZj.Specically,X0j2f2,3gandX0j2f1,4gcorrespondtothecaseswheregjisDRandERrespectivelybecauseofinteractionwiththemetageneg0.Itiseasytoseethatthehiddenvariablesfollowahierarchicalstructure.Forinstance,thevalueofZidependsonthevaluesofSAiandSBi.Similarly,thevalueofXijdependsonthevaluesofZiandZj.Thus,thevalueofthedependentvariableXijisbasedonthevaluesoffourindependentvariablesSAi,SBi,SAjandSBj.Table 5-1 enumeratesthevaluesofZi,ZjandXijfordifferentvaluesofSAi,SBi,SAjandSBj.Itisworthnotingthatthedifferentvaluesthatweassigntothehiddenvariablesarecategoricalinnature. Problemformulation.LetG=fg1,g2,,gMgdenotethesetofallgenes.UsingthedenitionoftheneighborhoodvariablesW,wedenotethecollection(G,W)byVwhichessentiallyrepresentsthegenenetworks.Wedenotethemetagenebyg0.GivenanobserveddatafV,Ygwewanttoestimatetheprobabilitiesp(Xij=xjX)]TJ /F3 11.955 Tf 28.83 0 Td[(Xij,Y,V),x2f1,2,16g. 80

PAGE 81

Ahighervalueofp(X0j=f2,3gj)indicatesahigherprobabilityofagenegjbeingPDR.Usingtheestimatedvaluesofp(X0jj),8j2f1,2,Mg,wecancreateanorderedlistofcandidatePDRgenes. 5.1.2OverviewoftheSolutionThissectiondescribesahighleveloverviewofourapproachtoestimatep(X0jj),8j2f1,2,Mg.OnesimpleapproachcanbeusingahypothesistesttondoutthePDRgenesinthegivendataset[ 24 ].However,theavailablehypothesistestsdonotconsidertheinteractionsamonggenesinthegenenetwork.Also,decidingonthesignicanceoftestcanbeacomplexstep.AnotherapproachcanbetouseSSEMtocreatearankofthepotentialprimarilyaffectedgenesineachgroupseparately[ 20 ].ThenwecanselectthetopkgenesineachgroupandperformasetdifferencetoobtainthePDRgenes.ThoughSSEMconsidersthecorrelationbetweenthegenes,itdoesnotutilizeanyknowninformationfromthegenenetworks.WebuildaBayesianprobabilisticmethodbasedonMarkovRandomFieldwhereweleveragetheinformationfromgenenetworksasthepriorbeliefofthemodel.UsingBayestheorem[ 93 ]wecanwritethejointprobabilitydensityofinteractionvariablesXas, P(XjY,V)=P(YjX,V,Y)P(XjV,X) PXP(YjX,V,Y)P(XjV,X)(5)Therstterminthenumerator,P(YjX,V,Y),isthelikelihoodoftheobservedexpressiondataYgiventheinteractionvariablesandgenenetwork.Yrepresentstheparametersforthelikelihoodfunction.AdetaileddiscussionofhowwecomputethislikelihoodcanbefoundinSection 5.1.5 .ThesecondterminthenumeratorP(XjV,X)representsthispriorbelief.Xrepresentstheparametersforthepriordensityfunction.WedeneaMarkovRandomField(MRF)overtheinteractionvariablesXandthepriorsareencodedviafeaturefunctionsintheMRF.DetailsofthepriorsandtheassociatedfeaturefunctionsareoutlinedinSection 5.1.3 .ThedenominatorofEquation 5 isthenormalizationconstant 81

PAGE 82

thatrepresentsthesumoftheproductofthelikelihoodandtheprioroverallpossibleassignmentsofinteractionvariablesX.GiventhejointprobabilitydensityfunctionoutlinedinEquation 5 ,ouroriginalproblemreducestoobtainingassignmentsfortheinteractionvariablesXandtheparametersXandYthatmaximizeit.AMaximumLikelihoodEstimation(MLE)ofEquation 5 ispracticallyinfeasibleevenforasmallnumberofgenessincethenumberoftermsinthedenominatorgrowsexponentially.Insteadweuseapseudo-likelihoodversionoftheobjectivefunctionasshowninSection 5.1.4 .WeuseIterativeConditionalModes(ICM)[ 92 ]andDifferentialEvolution[ 100 ]inanalternatingoptimizationtechniquetomaximizethepseudo-likelihoodwithrespecttoX,XandY.Aftertheoptimization,weobtainanassignmentforX,XandY.Usingtheseassignmentsandtheobserveddata,weestimatetheposteriorprobabilityofallXijvariables.Usingtheestimatedvaluesofp(X0jj),8j2f1,2,Mg,wecreateanorderedlistofcandidatePDRgenes.Weelaborateoneachofthesestepsnext.Figure 5-3 illustratesdifferentportionsofCMRFandtheconnectivitybetweenthem. 5.1.3ComputationofthePriorDensityFunctionInthissection,wedescribehowweincorporategenenetworkasthethepriorbeliefintoourBayesianmodel.Fromthestructureandpropertiesofgenenetwork,webuildthreehypothesesandembedthemintoourmodel.Wepresenttheentireconceptinthreenumberedsubsections. StatementofHypotheses.Herewestatethethreehypothesesonthebiologicalnetworksinbrief. HYPOTHESIS1.IneachgroupDT(T2fA,Bg),themetageneg0canchangethestateofalltheothergenes.Thus,allthegenescanbedirectlyaffectedbytheexternalperturbation. 82

PAGE 83

Figure5-3. IllustrationofdifferentcomponentsofCMRFandconnectivitybetweenthem.(A)obtainsaninitialestimatesofstatevariablesusingStudent'sttest.(B)estimatesparametersinawaythatmaximizesdatalikelihood.(C)estimatesparametersinordertomaximizepriordensity.Both(B)and(C)useaglobaloptimizationtechniquecalledDifferentialEvolution.(D)employsIteratedConditionalModestomaximizethepseudo-likelihood.(B),(C)and(D)consistofanalternatingoptimizationtechnique.Thesethreesteps(B),(C)and(D)arerepeatedtillthealgorithmmeetsacriteriaforcompletion.Finally,oncetheoptimizationiscomplete,theDRgenesaresortedindecreasingorderoftheirlikelihoodwithrespecttothemetageneg0.ThegenesatthetopofthelistaredeclaredPDR. HYPOTHESIS2.IneachgroupDT(T2fA,Bg),agenegicanchangethestatesofitsoutgoingneighborsgjinthesamedatagroup,i.e.agenecanbeindirectlyaffectedbytheperturbationthroughgeneticinteractions. HYPOTHESIS3.Eachgenehasahighprobabilityofbeingequallyregulated.Thisfollowsfromtheobservationthat,oftenthedifferencebetweentheexpressionsofmostofthegenesintwogroupsissmall.Weexpectthattheresponseofgenesinthesegroupsisverysimilar.Clearly,whenthedatadoesnotfollowoneormoreofthehypotheses,theoptimizationfunctioncanovercomethepriorbeliefwithastrongsupportfromthedata. 83

PAGE 84

MarkovRandomFieldconstruction.Inordertocomputethepriordensityfunction,wedeneaMarkovRandomField(MRF)overtheXvariables[ 94 ].MRFisaprobabilisticmodel,wherethestateofavariabledependsonlyonthestatesofitsneighbors.MRFisusefultomodelourproblemasthestatesofgenesdependontheirneighbors.Here,theMRFisanundirectedgraph=(X,E),whereX=fXijgvariablesrepresenttheverticesofthegraph(i.e.eachinteractionvariableXijcorrespondstoavertex).WedenotethesetofedgeswithE=f(Xij,Xpj)jWpi=Wij=1g[f(Xij,Xik)jWjk=Wij=1g.Thus,twovariablesinXshareanedgeiftheyshareacommonsubscriptatthesamepositionandthetwogenescorrespondingtotheothersubscriptinteractinthegenenetwork.Forexample,inFigure 5-2 (b),X35andX25areneighbors,astheyshare5(i.e.geneg5)asthesecondsubscriptandg2andg3interactinthegenenetworkinFigure 5-2 (a).Oneimportantpointtonoteisthat,thisgraphdoesnotusethestatevariablesSortheregulationvariablesZtomodelthedependenciesbetweenthegenes.Rather,itestablishesthosedependenciesovertheXvariables.Forexample,inFigure 5-2 (b)wedrawtheMRFgraphcorrespondingtothehypotheticalgenenetworkinFigure 5-2 (a).Inthegenenetwork,thereisanedgefromg2tog3.So,g2canpotentiallychangethestateofg3.WecreateanedgefromX12toX13thatcorrespondstotheedgefromg2tog3.Asg1iscommonforX12andX13,iftheyassumethesamevalue(i.e.X12=X13),itimpliesthatthegenesg2andg3areinsamestate(i.e.ST2=ST3,T2fA,Bg).Weformulatethesedependencyconstraintsusingasetofunaryandbinaryfunctionscalledfeaturefunctions.Wediscussthesefeaturefunctionsnext. Developmentoffeaturefunctions.WedenotetheneighborsofXijintheMRFgraphasXij=fXkjjWki=1g[fXipjWjp=1g.WedeneacliqueovereachXijanditsneighborsXijbyCijprovidedWij=1.Afeaturefunctionf(Cij)isaBooleanfunctiondenedoverthecliqueCij.Thisfunctionevaluatestooneorzero,ifitissatisedornot,respectively.Wedeneapotentialfunction (Cij)correspondingtof(Cij)asan 84

PAGE 85

exponentialfunctiongivenbyexp(f(Cij)).Hereisacoefcientassociatedwithf(Cij)thatrepresentstherelevanceoff(Cij)intheMRF.AccordingtoHammersley-Cliffordtheorem,weexpressthejointdensityfunctionoftheMRFoverXasproductofpotentialfunctionsdenedoverthatMRFas,p(XjX)=1 QCij,Wij=1 (Cij)[ 95 ].Inthisformulation,isthenormalizationfunction=PXQCij (Xij).Tolimitthecomplexityofourmodel,weconsideronlycliquesofsizeoneandtwo. Table5-2. EnumerationofvedifferentunaryfeaturefunctionsF1,F2,F3,F6andF7 XijF1F2F3F6F7 1001112100103010104001115001016001007001008001019001011000100110010012001011300111140011015001101600111 WedenesevenfeaturefunctionstocapturethedependenciesamongthevariablesinXaccordingtothethreehypotheses. UnaryfeaturefunctionsF1,F2,F3.AprimarycomponentofthepriordensityfunctionismodelingthefrequencyofXijitself.Here,wefocusontwovaluesofXijnamelyXij=f2,3g,sincetheycorrespondtotheeventsthatagenegjisDRduetothemetageneg0.WhenXij=2,gjisDEinDAandEEinDB.Tocapturethis,wedeneafeaturefunctionF1(Xij)whichreturnsonewhenXij=2.Itreturnszerootherwise.Similarly,Xij=3whengjisEEinDAandDEinDB.WedeneanotherfeaturefunctionF2(Xij),whichreturnsonewhenXij=3.WecapturealltheothervaluesofXijbya 85

PAGE 86

featurefunctioncalledF3(Xij).ItreturnszerowhenXij2f2,3gandequalstooneotherwise.Table 5-2 enumeratesthethedomainsandrangesofF1,F2andF3. BinaryfeaturefunctionsF4,F5.LetrepresentthehypothesisthatinagroupDT,T2fA,Bgagenegjincludingthemetagenecanchangethestateofoneofitsoutgoingneighborsgk.Wemakeastrongerhypothesisothat,holdssimultaneouslyinDAandDBwithhighprobability.Notethat,thisstrongerhypothesisisbasedontheassumptionthatthegenesinbothDAandDBexpressinasimilarfashion.Thisassumptionismeaningfulasinthesetwo-groupperturbationexperimentsthedifferentgroupsbelongtosimilarbiologicalconditions[ 47 ].oisencodedinXdomainasfollows.Considerfourgenesgp,gi,gjandgk,suchthatgp!gi,gi!gjandgj!gk.Here!indicatesthatthegeneontheleftactivatesorinhibitsthegeneontheright.Bydenition,(Xpj,Xij)and(Xij,Xik)areedgesintheMRF.Notethattherstedgecorrespondstoanincomingneighborgpofgi,whilethesecondedgecorrespondstoanoutgoingneighborgkofgj.WediscriminatebetweenthesetwosetsofneighborsofXij,astheyarerelatedtotheincomingneighborsofgiandoutgoingneighborsofgjrespectively.Itcanbeshownthat,fortherstsetofedges,XpjequalstoXijifandonlyif(iff)Zp=Zi,i.e.oholdstrue.Similarly,forthesecondsetofedgesXijequalstoXikiffZj=Zk,whichinternimpliesthatoissatised.Wedenetwosetsoffeaturefunctionstoformalizetheseequalitiesbasedontheincomingneighborsofgiandtheoutgoingneighborsofgj. LEFTEXTERNALEQUALITY.WedenotetheincomingneighborsofgiwithIn(gi).Wewriteafeaturefunctionf4(Xpj,Xij),8p,gp2In(gi).f4(Xpj,Xij)=1ifZi=ZpandWpi=Wij=1.Otherwise,f4(Xpj,Xij)=0.Wedenotethesummationofthisfunctionoveralltheincomingneighborsofgias,F4(Xij)=Xp,Wij=1,Wpi=1f4(Xij,Xpj). RIGHTEXTERNALEQUALITY.WedenotetheoutgoingneighborsofgjasOut(gj).Wedeneafeaturefunctionf5(Xik,Xij),8k,gk2Out(gj).f5(Xik,Xij)=1ifSk=SjandWjk=Wij=1.Otherwise,f5(Xik,Xij)=0.Wedenotethesummationofthis 86

PAGE 87

functionoveralltheoutgoingneighborsofgjas,F5(Xij)=Xk,Wij=1,Wjk=1f5(Xij,Xik). Table5-3. Thetableenumeratesthetruthvaluesforthebinaryfeaturefunctionleftexternalequality(f4).Onlythepossibleentriesareannotatedwithzeroandone.TheotherentriesrequiredifferentvaluesofZjinXijandXpj,whichisnotpossible.Note,thatthefeaturefunctioncanassumeoneonlywhenXijandXpjareequal,whichisinaccordancewiththedenitionofthatfeaturefunction.XpjXij 12345678910111213141516 11000 21000 31000 41000 50100 60100 70100 80100 90010 100010 110010 120010 130001 140001 150001 160001 Tables 5-3 and 5-4 enumeratethevaluesoff4andf5fordifferentvaluesofXij.Themissingentriesinthesetablescorrespondtothecaseswhichcannotoccurduringtheoptimization.Forinstance,inTable 5-3 ,amissingentrycorrespondstodifferentvaluesofZjinXijandXpjwhichisnotpossible.Forfeaturefunctionsf4andf5,XpjorXikmaynotrepresentanyinteractionsfromtheextendedgenenetworkwhenWpj=0orWik=0respectively.WerepresentthembydottedrectanglesinFigure 5-2 (b). 87

PAGE 88

Table5-4. Thetableenumeratesthetruthvaluesforthebinaryfeaturefunctionrightexternalequality(f5).Onlythepossibleentriesareannotatedwithzeroandone.TheotherentriesrequiredifferentvaluesofZiinXijandXik,whichisnotpossible.Note,thatthefeaturefunctioncanassumethevalueoneonlywhenXijandXikareequal,whichisinaccordancewiththedenitionofrightexternalequality.XikXij 12345678910111213141516 11000 20100 30010 40001 51000 60100 70010 80001 91000 100100 110010 120001 131000 140100 150010 160001 UnaryfeaturefunctionsF6,F7.Weintroducetwounaryfeaturefunctionstoincorporateourlasthypothesis,thatallgenesareERwithahighprobability.Weconsidertwogenesgiandgjsuchthatgi!gj.Thishypothesisholdstrue,ifgiisequallyregulatedorgjisequallyregulated. LEFTINTERNALEQUALITY.Wedenethisfeaturefunctiontocapturetheeventswhengiisequallyregulated.As,gjcanassumeanystate,thisfeaturefunctionholdstrueforeightdifferentvaluesofXij.Wedenotethefeaturefunctionbyf6(Xij,t)thatreturnsoneifitstwoargumentsareequalandzerootherwise.WedenotethesummationofthisfunctionsoveralltheseeightvaluesofXijas,F6(Xij)=Xi,j,Wij=1,t2f1,,4,13,,16gf6(Xij,t). RIGHTINTERNALEQUALITY.Wedenethisfeaturefunctiontocapturetheeventswhengjisequallyregulated.As,gicanassumeanystate,thisfeaturefunctionholdstrueforeightdifferentvaluesofXij.Wedenotethefeaturefunctionby 88

PAGE 89

f7(Xij,t)thatreturnsoneifitstwoargumentsareequalandzerootherwise.WedenotethesummationofthisfunctionsoveralltheseeightvaluesofXijas,F7(Xij)=Xi,j,Wij=1,t2f1,4,5,8,9,12,13,16gf7(Xij,t).ThelasttwocolumnsofTable 5-2 enumeratethesetwointernalequalities.Basedonthesefeaturefunctions,wedenethejointdensityfunctionofXas, p(XjX)=1 exp(Xi,j,Wij=1,k2f1,2,,7gkFk(Xij))(5)Intheaboveequationk,k2f1,2,7garethecoefcientsofthesevenfeaturefunctionsinMRF.Inthenextsection,wediscusshowweapproximatetheobjectivefunctionoftheMRFandthedata.WealsodescribehowweformulatetheposteriorprobabilitydensityfunctionforXij. 5.1.4ApproximationoftheObjectiveFunctionAdirectmaximizationoftheobjectivefunctiongivenbyEquation 5 isintractable,asitrequiresevaluationofexponentialnumberoftermsinthedenominator.Weemploypseudo-likelihoodasanestablishedsubstitutetoEquation 5 [ 96 ].Pseudo-likelihoodisthesimpleproductoftheconditionalprobabilitydensityfunctionoftheXijvariables.Gemanetal.provedtheconsistencyofthemaximumpseudo-likelihoodestimate[ 97 ].Theapproximatedobjectivefunctioncanbewrittenas, F=argmaxX(Yi,jFij)(5)TheposteriordensityfunctionFijofXijas, Fij=p(XijjX)]TJ /F3 11.955 Tf 28.21 0 Td[(Xij,Y,X,Y)=p(YAi,YBi,YAj,YBjjXij,Xij,Y)p(XijjX)]TJ /F3 11.955 Tf 28.21 0 Td[(Xij,X) PXij2f1,,16gp(YAi,YBi,YAj,YBjjXij,Xij,Y)(5) 89

PAGE 90

DerivationofFij.Fij=p(XijjX)]TJ /F3 11.955 Tf 28.21 0 Td[(Xij,Y,X,Y)=p(XijjX)]TJ /F3 11.955 Tf 28.21 0 Td[(Xij,YAi,YBi,YAj,YBj,X,Y)=p(YAi,YBi,YAj,YBj,X)]TJ /F3 11.955 Tf 24.89 0 Td[(Xij)]TJ /F3 11.955 Tf 11.96 0 Td[(Xij,Xij,Xij,X,Y) p(YAi,YBi,YAj,YBj,X)]TJ /F3 11.955 Tf 24.89 0 Td[(Xij)]TJ /F3 11.955 Tf 11.95 0 Td[(Xij,Xij,X,Y)=p(YAi,YBi,YAj,YBj,X)]TJ /F3 11.955 Tf 24.89 0 Td[(Xij)]TJ /F3 11.955 Tf 11.96 0 Td[(XijjXij,Xij,X,Y)p(Xij,Xij,X,Y) p(YAi,YBi,YAj,YBj,X)]TJ /F3 11.955 Tf 24.89 0 Td[(Xij)]TJ /F3 11.955 Tf 11.95 0 Td[(XijjXij,X,Y)p(Xij,X,Y)=p(YAi,YBi,YAj,YBjjXij,Xij,X,Y)p(X)]TJ /F3 11.955 Tf 24.89 0 Td[(Xij)]TJ /F3 11.955 Tf 11.96 0 Td[(XijjXij,Xij,X,Y)p(Xij,Xij,X,Y) p(YAi,YBi,YAj,YBjjXij,X,Y)p(X)]TJ /F3 11.955 Tf 24.89 0 Td[(Xij)]TJ /F3 11.955 Tf 11.95 0 Td[(XijjXij,X,Y)p(Xij,X,Y)=p(YAi,YBi,YAj,YBjjXij,Xij,X,Y)p(X)]TJ /F3 11.955 Tf 24.89 0 Td[(Xij)]TJ /F3 11.955 Tf 11.96 0 Td[(Xij,Xij,Xij,X,Y) p(YAi,YBi,YAj,YBjjXij,X,Y)p(X)]TJ /F3 11.955 Tf 24.89 0 Td[(Xij)]TJ /F3 11.955 Tf 11.96 0 Td[(Xij,Xij,X,Y)=p(YAi,YBi,YAj,YBjjXij,Xij,X,Y)p(X,X,Y) p(YAi,YBi,YAj,YBjjXij,X,Y)p(X)]TJ /F3 11.955 Tf 24.89 0 Td[(Xij,X,Y)=p(YAi,YBi,YAj,YBjjXij,Xij,X,Y)p(XijjX)]TJ /F3 11.955 Tf 28.21 0 Td[(Xij,X,Y)p(X)]TJ /F3 11.955 Tf 24.89 0 Td[(Xij,X,Y) p(YAi,YBi,YAj,YBjjXij,X,Y)p(X)]TJ /F3 11.955 Tf 24.89 0 Td[(Xij,X,Y)=p(YAi,YBi,YAj,YBjjXij,Xij,X,Y)p(XijjX)]TJ /F3 11.955 Tf 28.21 0 Td[(Xij,X,Y) p(YAi,YBi,YAj,YBjjXij,X,Y)=p(YAi,YBi,YAj,YBj,Xij,Xij,X,Y)p(Xij,X,Y)p(XijjX)]TJ /F3 11.955 Tf 28.21 0 Td[(Xij,X,Y) p(Xij,Xij,X,Y)p(YAi,YBi,YAj,YBj,Xij,X,Y)=p(YAi,YBi,YAj,YBj,XjXij,Xij,Y)p(Xij,Xij,Y)p(Xij,X,Y)p(XijjX)]TJ /F3 11.955 Tf 28.21 0 Td[(Xij,X,Y) p(Xij,Xij,X,Y)p(YAi,YBi,YAj,YBj,XjXij,Y)p(Xij,Y)=p(YAi,YBi,YAj,YBjjXij,Xij,Y)p(XjXij,Xij,Y)p(Xij,Xij,Y)p(Xij,X,Y)p(XijjX)]TJ /F3 11.955 Tf 28.21 0 Td[(Xij,X,Y) p(YAi,YBi,YAj,YBjjXij,Y)p(Xij,Xij,X,Y)p(XjXij,Y)p(Xij,Y)=p(YAi,YBi,YAj,YBjjXij,Xij,Y)p(Xij,Xij,X,Y)p(Xij,X,Y)p(XijjX)]TJ /F3 11.955 Tf 28.21 0 Td[(Xij,X,Y) p(YAi,YBi,YAj,YBjjXij,Y)p(Xij,Xij,X,Y)p(Xij,X,Y)=p(YAi,YBi,YAj,YBjjXij,Xij,Y)p(XijjX)]TJ /F3 11.955 Tf 28.21 0 Td[(Xij,X,Y) p(YAi,YBi,YAj,YBjjXij,Y)=p(YAi,YBi,YAj,YBjjXij,Xij,Y)p(XijjX)]TJ /F3 11.955 Tf 28.21 0 Td[(Xij,X) p(YAi,YBi,YAj,YBjjXij,Y)=p(YAi,YBi,YAj,YBjjXij,Xij,Y)p(XijjX)]TJ /F3 11.955 Tf 28.21 0 Td[(Xij,X) PXij2f1,2,3,16gp(YAi,YBi,YAj,YBjjXij,Xij,Y) 90

PAGE 91

Instep2ofthederivation,wesubstituteYbyYAi,YBi,YAjandYBjasXijisindependentofallYCksuchthatk6=fi,jgandC6=fA,Bg.Also,inthe15thstepweassumethatXijisindependentofYgivenX)]TJ /F3 11.955 Tf 24.89 0 Td[(XijandX. Derivationofp(XijjX)]TJ /F3 11.955 Tf 28.22 0 Td[(Xij,X),Wij=1.p(XijjX)]TJ /F3 11.955 Tf 28.21 0 Td[(Xij,X)=p(X,X) P(X)]TJ /F3 11.955 Tf 24.89 0 Td[(Xij,X)=p(X,X) PXij2f1,2,3,16gP(X)]TJ /F3 11.955 Tf 24.89 0 Td[(Xij,Xij,X)=A(Xij).B(ij) Pt=f1,2,3,16gA(t).B(ij)A(Xij)isexp(Pk2f1,2,,7gkFk(Xij))andB(ij)isgivenbyexp(Pm,n,ij6=mn,k2f1,2,,7gkFk(Xmn)).Here,wedenotethepriordensityparametersf1,2,7gbyX.CancelingB(ij)fromnumeratoranddenominatorthedensityfunctionsimpliesto,p(XijjX)]TJ /F3 11.955 Tf 28.21 0 Td[(Xij,X)=exp(Pk2f1,2,,7gkFk(Xij)) Pt=f1,2,3,16gexp(Pk2f1,2,,7gkFk(Xij=t)) TherearetwodifferenttermsinobjectivefunctionofEquation 5 .p(XijjX)]TJ /F3 11.955 Tf -404.07 -23.91 Td[(Xij,X)standsfortheconditionalpriordensityfunctionofXijwhichwejusthavederivedfromusingBayesrule.Inthenextsection,wediscussthelikelihoodfunctionp(YAi,YBi,YAj,YBjjXij,Xij,Y). 5.1.5ComputationofLikelihoodDensityFunctionInthissection,wedescribehowwederivethelikelihoodfunctioninthreenumberedsubsections.Here,weassumethatgeneexpressionsinagroupfollowanormal 91

PAGE 92

distribution,Wecanrewritethederivationsifgeneexpressionsfollowsomeotherdistribution. Likelihoodforasinglegene.ConsiderasetofmeasurementsforagenegithatfollowsasingleGaussiandistributionbyzi=fzi1,zi2,,ziNg.Wedenotethelatentmeanofziasandthestandarddeviationas.Asdifferentgenescanhavedifferentaverageexpressions,weassumethatfollowsagenomewisedistributionwithmean0andstandarddeviation[ 98 ].Thus,forzi,thelikelihoodforthedatapointsinthatgroupisgivenby, L(zj0,2,2)=Z[nYi=1N(zij,2)]N(j0,2)d= (p 2)np n2+2exp()]TJ /F13 11.955 Tf 10.49 17.12 Td[(Piz2i 22)]TJ /F11 11.955 Tf 15.98 8.09 Td[(20 22)exp(2n2 z2 2+220 2+2n z0 2(n2+2))(5)ThederivationofEquation 5 canbeobtainedfromDemichelisetal.[ 99 ].IfageneisDE,itsexpressionmeasurementsincontrolandnon-controlgroupsfollowdifferentdistributions[ 98 ].Ontheotherhand,forequallyexpressedgenes,allthemeasurementsinbothgroupssharethesamemean.ThelikelihoodfunctionforaDEgenegiingroupDT,T2fA,Bgisgivenby, LTDE(gi)=L(yij0,2,2)L(y0ij0,2,2)(5)Similarly,forEEgenesitisgivenby, LTEE(gi)=L(yi[y0ij0,2,2)(5)Forinstance,thelikelihoodofagenetobeDEingroupDAisgivenbyLADE(gi). 92

PAGE 93

Likelihoodforaregulationvariable.Asforagenegi,theregulationvariableZicanassumefourdifferentvaluesfrom1to4,theequationsofthelikelihoodthatageneisDRorERalsotakefourdifferentformsgivenby,LZ(gi)=8>>>>>>><>>>>>>>:LADE(gi)LBDE(gi),ifZi=1.LADE(gi)LBEE(gi),ifZi=2LAEE(gi)LBDE(gi),ifZi=3LAEE(gi)LBEE(gi),ifZi=4 Likelihoodforaninteractionvariable.Wehave16differentformsforthelikelihoodoftheXijduetoits16differentvalues.However,here,weshallderiveonlyforXij=1,asfortheothervaluesofXijwehaveasimilarderivation. p(YAi,YBi,YAj,YBjjXij=1,Xij,Y)=Xi,j2f1,,4gp(YAi,YBi,YAj,YBjjZi=i,Zj=j,Y)p(Zi=i,Zj=i,YjXij=1,Xij,Y)(5)FromthedenitionofXij,p(Zi=i,Zj=i,YjXij=1,Xij,Y)equalsto1whenZi=1andZj=1.ItsvalueiszeroforallothervaluesofZiandZj.So,continuingfromthelaststepofEquation 5 p(YAi,YBi,YAj,YBjjXij=1,Xij,Y)=p(YAi,YBi,YAj,YBjjZi=1,Zj=1,Y)=p(YAi,YBijZi=1,Zj=1,Y)p(YAj,YBjjZi=1,Zj=1,Y)=p(YAi,YBijZi=1,Y)p(YAj,YBjjZj=1,Y)=LZ(gi)LZ(gj)(5) 93

PAGE 94

Inasimilarway,wecanderivethelikelihoodfunctionsforallthe16differentvaluesofXijvariables.Aspecialcaseariseswhengiisthemetagene,i.e.g0.WeassumethatLTDE(g0)=1andLTEE(g0)=0,T2fA,Bg.Thus,thelikelihoodofthemetagenegivenZ0=1equalsto1.Itsvalueiszerootherwise. 5.1.6ObjectiveFunctionOptimizationSofar,wehavedescribedhowwecomputetheposteriordensityfunction.Thenalchallengeistondthevaluesofthehiddenvariablesthatmaximizetheobjectivefunction(Equation 5 ).Wedevelopaniterativealgorithmtoaddressthischallenge.Inourmodelwehavethreedifferentsetsofparameters.ThenodesoftheMRFgivenbyXconsistofoneset.OthertwosetsaretheparametersofconditionalprobabilitydensityfunctionofXijandlikelihoodfunctionofobserveddatagivenbyX=f1,7gandY=f0,,),respectively.Ineachiteration,werstestimateXandYbasedontheestimatedvalueofXinthepreviousiteration.Next,basedontheestimatedparameters,weestimateXthatmaximizetheobjectivefunctioninEquation 5 .Thelikelihoodfunctionisnon-convexintermsoftheparametersY=f0,,).Also,theconditionaldensityisnon-convexintermsofX=f1,7g.Weuseaglobaloptimizationmethodcalleddifferentialevolutiontooptimizebothofthem[ 100 ].TooptimizetheobjectivefunctioninEquation 5 ,weemploytheICMalgorithmdescribedbyBesag[ 92 ].Briey,ouriterativealgorithmworksasfollows. 1. ObtainaninitialestimateofSvariables.Inourimplementationweusestudent'st-testassumingthedatafollowsnormaldistribution.Weuse5%condenceintervalforthispurpose. 2. EstimateparametersYthatmaximizesthedatalikelihoodfunctiongivenby,argmaxYYXij,Wij=1p(YAi,YBi,YAj,YBjjXij,Xij,Y)WeimplementthisstepusingDifferentialEvolution,whichissimilartothegeneticalgorithm. 94

PAGE 95

3. CalculateanestimateoftheparametersXthatmaximizestheconditionalpriordensityfunctionby,argmaxXYXij,Wij=1p(XijjX)-222(fXijg,X)WealsoimplementthisstepusingDifferentialEvolution. 4. CarryoutasinglecycleofICMusingthecurrentestimateofS,XandY.ForallSi,maximizeQXmnp(XmnjX)]TJ /F3 11.955 Tf 30.54 0 Td[(Xmn,Y,X,Y)whenXmn2fXrtjr=iort=i,Wrt=1g. 5. Gotostep 2 foraxednumberofcyclesoruntilXconvergestoacertainpredenedvalue.WeoptimizetheobjectivefunctionintermsoftheSi(1iM)variablesinsteadofXijvariables.Specically,instep4,wegooveralltheSivariables,andoptimizeFijfunction(givenbyEquation 5 )foronlythoseXijvariablesthatareimpactedbythechangeofSi.Figure 5-3 illustratesdifferentcomponentsofCMRFandtheconnectivitybetweenthem.Theoptimizationprocedureisguaranteedtoconvergesinceineveryiterationthevalueoftheobjectivefunctionincreases.Wecontinuetheiterativeprocess,untilthechangesinestimatesoftheparametersbetweentwoconsecutiveiterationsreachbelowacertaincutofflevel. 5.2ResultsandDiscussionInthissectionwediscusstheexperimentsweconductedtoevaluatethequalityofCMRF.WeimplementedCMRFinMATLABRandJavaTM.WeobtainedthecodeforDifferentialEvolutionfrom http://www.icsi.berkeley.edu/~storn/code.html .WecomparedCMRFwithSSEMasSSEMisoneofthemostrecentmethodsthatconsidersidentifyingprimarilyaffectedgenesinaperturbationexperiments[ 20 ].WeobtainedSSEMfrom http://gardnerlab.bu.edu/SSEMLasso .WeexecutedourcodeonaQuad-CoreAMDOpteron2Ghzworkstationwith32GBofmemory. Dataset.Weusedfourdifferentsetsofdatatoconducttheexperimentsinthischapter. 95

PAGE 96

DATASET1.TherstdatasetwascollectedbySmirnovetal.[ 101 ].Thisdatasetwasgeneratedusing10GyionizingradiationoverimmortalizedBcellsobtainedfrom155membersof15Centred'tudeduPolymorphismeHumain(CEPH)Utahpedigrees[ 102 ].Microarraysnapshotswereobtainedbefore(atzerothhour)andafter(atsecondandsixthhours)theapplicationofradiation. DATASET2.TheseconddatasetcorrespondstoadrugresponseexperimentconductedbyTayloretal.[ 47 ].Medicationswereappliedon36CaucasianAmericanand33AfricanAmericanpatientsinfectedwithHepatitisC.Geneexpressionswerecollectedbeforethemedicationwasstartedandat1,2,7,14,28daysafterthemedicationwasadministered.Bothdataset1and2aremicroarraytimeseriesdatawithmorethantwotimepoints.Weadaptedthesetwotimeseriesdatasettwocreatecontrolandnon-controldatasuitableforourexperiments.Weusedthedatabeforeperturbationascontroldata.Forthenon-controldatawecalculatedtheexpectedexpressionofageneateachpointsaftertheperturbation.Weselectedtheonewithhighestabsolutedifferencefromtheexpectedexpressionofcontroldataforthatgene. DATASET3.Wecreateddataset3usingdataset1.Weusedthecontrolgroupofdataset1asthecontrolgroupofdataset3.Then,wechangedtheexpressionvaluesofsomeoftherandomlyselectedgenestomodeltheprimaryeffectofexternalperturbation.Fromthatperturbeddataset,wesimulatedthesecondaryeffectsusingthesigmoidmethoddescribedinGargetal.[ 114 ].Wedenotetheparameterforprimaryperturbationeffectbydeviation.Deviationistheratioofthechangeofexpressionvalue4xofagenetoitsoriginalexpressionvaluex(i.e.derivation=4x x)whichisnormalizedbetweenzeroandone.Wetunedtheotherparametersofthemethodtocreateameaningfuldatasetasfollows;alpha=1,=0.01,kac=1.0,kin=1,h=0.1. DATASET4.Wecreatethisdatasetfromdataset1intwostepsasfollows. SELECTIONOFGENES.InordertocarryoutexperimentsonlargerscaledatawithknownPDRgenes,wegenerateddatainthepresenceofahypotheticalperturbationfromtherealdatasetsasfollows.Werstselectthreesetsofgenes.Eachsetconsistsofsomeprimarilyaffectedgenesandahighernumberofsecondarilyaffectedgenes.Here,wedescribehowweconstructeachofthethreesetsofaffectedgenes.WerstselectarandomgenefromthenetworkandlabelitasaprimarilyaffectedDEgene.Wethentraverseitsoutgoingneighborsinabreadthrstsearchmanner.Aswevisitageneduringtraversal,welabelitasasecondarilyaffectedDEgenewithaprobabilityof1)]TJ /F8 11.955 Tf 12.24 0 Td[((1)]TJ /F3 11.955 Tf 12.23 0 Td[(q),whereisthenumberofincomingDEneighbors.HereqistheprobabilitythatageneisDEduetoaDEpredecessor(0.4inourexperiments).Werepeatthesestepstocreatethedesirednumberofprimarilyaffectedgenes. 96

PAGE 97

Afterweobtainthethreesetofgenes,weassignonesettobothDAandDBgroups.Weassigntheothertwosetsofgenestodifferentgroups.Thesetwosetofgenesaredifferentiallyregulatedastheyareaffectedinonlyonegroupandnotintheother.Thethreegroupscancontaindifferentnumberofprimarilyandsecondarilyaffectedgenes.Wecallthesethreesetsofgenesasprimarilydifferentiallyregulated,secondarilydifferentiallyregulatedandequallyregulatedgenes. GENERATIONOFGENEEXPRESSION.Onceweidentifythesethreesetsofgenesinthetwogroups,wecreatecontrolandnon-controldataforDAandDBoverNsamples.WeusethecontrolpartoftherealdatasetinSmirnovetal.asthecontrolpartofoursyntheticdatasetinbothDAandDB[ 101 ].Togeneratethenon-controldataset,wetraverseeachofthegenesthatparticipateinthegenenetworks.Consideragenegiwithmeanandstandarddeviationofexpressioninthecontroldatasetgivenbyiandirespectively.IfthegeneisEEwegenerateitsnon-controldatapointsfromtheanormaldistributiongivenbytheparameters(i,2i).IfthegeneisDE,weusethesamevariancebutdifferentmeanasthatofthecontrolgroup.Fortheprimarilyandsecondarilyaffectedgenesweuseidpandidsrespectively,wheredp>ds.Tosummarize,weusedthesamevarianceinthenon-controlgroupasthatinthecontrolgroup.However,foranaffectedgenewechangedthevalueofthemeaninthenon-controlgroupfromthatinthecontrolgroup.Foraprimarilyaffectedgeneweappliedahigherdeviationofmeanthanthatofthesecondarilyaffectedgenes. Regulatorynetworks.Wecollected24,663geneticinteractionsfromthe105regulatoryandsignalingpathwaysofKEGGdatabase[ 103 ].Overall2,335genesbelongtoatleastonepathwayinKEGG.Inourmodel,weconsideredonlythegenesthattakepartinthegenenetworks. 5.2.1ComparisontoOtherMethodsOurmethodprovidesusalistofdifferentiallyregulatedgenes.Wesortthelistofthosegenesasfollows.ConsideraDRgenegi,whichisDEinDAandEEinDB.WecalculatethelikelihoodofbeingEEinDAandDEinDBforthatgene.WecaninterpretthisstepastheprobabilityofbeingDR,butinareverseway.WecouldinsteadusetheprobabilitythatthegeneisDEinDAandEEinDB.However,accordingtoourobservation,theearliermetricprovidesamuchbetteraccuracy.WesortalltheDRgeneswithincreasingorderofthatlikelihood. 97

PAGE 98

Asperourknowledge,noothermethodexiststhatdifferentiatesbetweentheprimaryandsecondaryeffectsinatwo-groupperturbationexperiment.Thereexistsomestudiesinidentifyingprimarilyaffectedgenesinsinglegroupdatasets.WecomparedtheaccuracyofCMRFtothreesuchmethodsnamely,SMRF,Student'sttestandSSEM. Experimentalsetup.Givenaninputdataset,usingeachofthefourmethods,werankedallthegenes.HighlyrankedgeneshavehigherchanceofbeingaPDRaccordingtoeachmethod.However,asotherthreemethodsarenottailoredtosolvethisproblem,wecreatedseparaterankingonDAandDB.Then,outofthosetworanks,wecreatedauniedrankofdifferentiallyregulatedgenes.Weshallelaborateonthisuniedrankcreationlater.We,rst,explainhowwecreateranksonindividualgroupsDAandDBforotherthreemethods. SMRF.WeapplytheSMRFtoeachgroupseparatelyandobtainasetofdifferentiallyexpressedgenes.Wesortthegenesindecreasingorderofjointlikelihoodwiththemetagene.Ahigherjointlikelihoodimpliesahigherchanceofbeingprimarilyaffected. SSEM.WetrainSSEMonthecontroldataset,whereitlearnsthecorrelationbetweenthegenes.WetestSSEMonthenon-controldatasetofeachgroup,whereitproducesarankforeachsingledatapoint. STUDENT'STTEST.Weusethefunctioncalledttest2fromMATLABR.Weapplyitoneveryindividualgene,whereittakescontrolandnon-controldatasetasinputandproducesap-valueasoutput.WeassumethatthenullhypothesiscorrespondsthatthegeneisEE.Soasubstantiallylowerp-valueimpliesahigherchanceofbeingprimarilyaffected.Weperformthetestonallthegenesandrankthemaccordingtheincreasingorderofp-values.Nowwedescribehowwecreateanuniedrankingofdifferentiallyregulatedgenesforthesethreemethods.WedenotetheranksfromdatagroupDAandDBbyRAandRBrespectively.TheuniedrankisdenedbyRU.Wedenotethenumberofgenesineachranktobe!Aand!Brespectively.Wescanboththerankssimultaneouslyfromrstpositionto!=min(!A,!B).Whilescanningatthekthposition,wedenotetheequallyregulatedsetobtainedtillthatpositionbyk=RA(1:k)\RB(1:k).WeincludeRT(k) 98

PAGE 99

Figure5-4. ComparisonofourmethodCMRFtoSMRF,SSEMandt-test.Thenumberofprimarilydifferentiallyaffectedgenesis40.Thevaluesfordeviation(maximumperturbationtothePDRgenes)are0.6and0.8.TheguresindicatethatCMRFoutperformsSMRF,SSEMandt-test. totheuniedrankRUifRT(k)=2k,T2fA,Bg.ForSSEMweobtainaseparateRUforeachdatapoint.Weaveragetheaccuraciesoveralltheseranks. Results.Inthisexperimentweuseddataset3,thatwehavejustdescribed.ToobservetheaccuracyofCMRFatvaryingdegreeofdifculties,weconductedexperimentswithfourdifferentvaluesofdeviation,namely,f0.5,0.6,0.7,0.8g.However,wediscussonlytwooftheminthischapter(seeFigure 5-4 )sinceforothertwoparameterstheresultsaresimilar.Theresultswediscusscorrespondtothecaseswhendeviation=f0.6,0.8g.Figures2(a)and2(b)showthesensitivityofthefourmethodswiththetwodeviationsettings.Theformeronecorrespondstothecomputationallyhardercaseasthedifferencebetweenthenon-controlgroupsofprimarilyandsecondarilyaffectedgenesissmall.Asthedeviationincreasesidentifyingprimarilyaffectedgenesbecomeseasier.Formthegure,weobservethatCMRFissignicantlymoreaccuratethantheotherthreemethodsforalldatasetsconsistently.Itreachesalmost50%sensitivity(i.e.,itcanndaround15-18primarilyaffectedgenesoutof30)inthetop50rankedgenes,whenthedeviationis0.6.Ontheotherhand,itsachievesasensitivityof0.6when 99

PAGE 100

thedeviationis0.8.Weobtainedsimilarresultsforotherdeviations,whichwedonotdiscusshere.ThemethodinSMRFreachesto30%and40%accuracy,howeverataslowerpace.Thet-testreachesaround25%and30%sensitivityatrankingposition50forthesetwocasesrespectively.SSEM'ssensitivityisbelow0.1forallexperimentsevenwithinthetop50positions.Webelievethattherearethreemajorfactorsforthesuccessofourmethodovertheothercompetingmethods.First,theothermethodsdonotsimultaneouslyhandletwogroupsofdatasetsandarenotabletogenerateanuniedrankingofdifferentiallyregulatedgenes.CMRFencompassesbothgroupsinasinglemodelandprobabilisticallydeterminesthePDRgenes.Hence,itismoreshieldedagainstthefalsepositivesintroducedduringtheunicationofranking.Second,CMRFcansuccessfullyincorporatethegeneinteractionsusingMRFswhileothersignorethisinformation.Finally,inrealperturbationexperiments,multiplegenesareoftenprimarilyaffected.CMRFiscapableofdealingwithbothlargeandsmallnumberofprimarilyaffectedgenes,whileperformancesofothermethodsdeteriorateasthenumberofprimarilyaffectedgenesgrows.Thus,weconcludethatourmethodismoresuitableforrealperturbationexperiments. 5.2.2StatisticalSignicanceExperimentTheexperimentsinthelastsectionenableustocomparetheaccuracyofCMRFwiththatoftheothermethodsonsyntheticdatasets.WealsowantedtoevaluatetheaccuracyofCMRFonrealdataset.However,wedonothaveanygoldstandardavailablethatenliststruesetofPDRgenes.Hence,weconductedasetofstatisticalsignicanceexperimentstoestimatethecondenceofouraccuracy.Specically,weobtainedthecontroldatafromarealdataset,perturbeditinacontrolledwayforanumberofgenes.Wecalculatedthelikelihoodprobabilitiesofthosegenesandcreatedadistribution.Werepeatedthisprocesswithvaryingamountofperturbation.Finally,weexecutedCMRFonarealdatasetandanalyzedtheresult. 100

PAGE 101

Results.WeobtainedtherealdatasetfromdrugresponseexperimentconductedbyTayloretal.[ 47 ],whichisactuallydataset2.Apartfromthisrealdataset,wecreatedifferentversionsofdataset4byvaryingdpasf0.1,0.2,0.3,,3.0g.Ifdp>1.1,wesetdsto1,otherwiseds=0.5dp.Thus,wehave30syntheticdatasetsintotal.Ineverydataset,wexthenumberofprimarilyandsecondarilydifferentiallyregulatedgenesto50and172respectively.TodecidewhetheragenegiisDR,whengiisDEinDAandEEinDB,wedeneanull-hypothesisH0i:giisDR,butinthereverseway,i.e.giisEEinDAandDEinDB.WecalculatethelikelihoodofbeingEEinDAandDEinDBforthatgene,asdescribed.Forgenegi,wedenotetheloglikelihoodofacceptingH0ibyLLi.Ineverydataset,wecreateaboxplotofthe50LLivalues,asthenumberofDRgenesineachdatasetis50.AlowervalueLLiindicatesthatgihasahigherprobabilityofbeingdifferentiallyregulated.Figure 5-5 illustratesthestatisticalsignicanceoftheexperimentsoverthedatasetswithdp=1.2to2.0.TheboxplotdemonstratesarelationshipbetweentheP-valueanddp.AhighervalueofdpindicatesalowerP-valueandhence,ahighchanceofbeingPDR.WealsoobservethatthevarianceofP-valueincreaseswiththeincreaseofdp.WealsoexecutedCMRFontherealdatasetswithoutanymodication.Interestingly,ontherealdatasetfromTayloretal.[ 47 ](dataset2),wedidnotobtainanygenesasdifferentiallyregulated.Acarefulobservationconcludesthatwhenboththenumberofdatapointsandthegapdp(i.e.thesignaltonoiseratio)islow,thecoefcients6and7inthepriordensitybecomestrongandallgenesareidentiedasequallyregulated.However,wheneitherthenumberofdatapointsordpissignicantlyhigh,thedatacanovercometheprior.Inthecurrentrealdataset,thenumberofdatapointsisonly33andthegapsbetweenthecontrolandnon-controlgroupwerelessthan1.2.Asaresult,CMRFidentiesnodifferentiallyregulatedgenesinthedataset.Thus,wecanconcludethateitherthereisnotmuchdifferencebetweenthetwogroupsintherealdata,orthe 101

PAGE 102

Figure5-5. Illustrationofthestatisticalsignicancetest.BoxplotdemonstratestheP-valueofnullhypothesisoftheDRgenesoversyntheticdataset.Fromtheplotweclearlyconcludethatahighergapbetweenthecontrolandnon-controlgroupofaDRgeneleadstoalowerP-value.ThegeneswithalowerP-valuehaveahigherchanceofbeingprimarilydifferentiallyregulated. datadoesnotcontainenoughdatapoints,sothatourmodelcanhighlightdifferencebetweenthetwogroups.InFigure 5-5 wepresenttheresultsfordp=1.2to2.0.Note,thatfordp<1.2ourmodeldidnotidentifyanyDRgenes.Herealso,weattributeasimilarreasonfornotndinganyDRgenesasbothdpandthenumberofdatapointsaresmall.Ontheotherhand,inSection 5.2.1 ,whenweexecuteCMRFonsyntheticdatasetswith155datapointswewereabletoidentifyasubstantialnumberoftruePDRgenesevenwithdp=0.02.Tosubstantiateourconclusionthatthereexistslittledifferencebetweenthetwogroupsintherealdataset,weconductedasetofpermutationtests.Weshufedthetwooriginalgroupstocreatenewsetsofdata.Werepeatedthisprocessforanumberof 102

PAGE 103

times(40inthepresentexperiment)andexecutedCMRFoneachofthem.Foreveryderiveddataset,CMRFdidnotndanyDRgenes.Hence,thisexperimentbolsterstheclaimthattherearenoDRgenesintheoriginalrealdata.AninterestingquestioncanberaisedisthatisthereindeednoDRgenesintherealdatasetfromTayloretal.[ 47 ]?AnothersimilarquestioncanbewillourmethodbeabletodetectDRandPDRgenesfromsimilarotherrealdatasets?WebelievethatCMRFrequiresabiggerdatasetforDRandPDRgenestobediscovered.Forexample,CMRFisabletoidentifytheDRandPDRgenesfromthesyntheticdatasetthatcontainssubstantiallyhighernumberofdatapointsthanthatoftherealdataset.Sincethedifferencebetweencontrolandnon-controlgroupsofaDEgeneissmallcomparedtothevarianceofthedatapoints,itisdifculttodetectthatsubtleeffectofperturbationwithasmalldataset.Forasmalldataset,thepriorduetothirdhypothesisbecomesstrongandthetwocorrespondingparameters6and7assumesextremevalues.Thusthesupportfromdataisnotsufcienttoovercomethepriorandhence,themethodisnotabletoidentifytheDRandPDRgenes.Therearetwosolutionstoovercomethisproblem.Firstofthemistoemployabiggerdataset.Withtheadvancementofcomparativelyinexpensiveandhighthroughputtechnologiesbiggerdatasetareincreasinglycommonnowadays.Fromthatperspective,CMRFissupposedtoperformmoreaccuratelyinthenearfuture.Asecondoptiontocircumventtheproblemistorestrictthegrowthofthetwoparameters6and7.Ifwehaveknowledgeaboutthevaluesofthesetwoparameters,wecanassignthenasinputtotheprogramandrefrainfromestimatingtheirvalues.Thiswillenableustoemployacomparativelynon-informativepriorwhichwillbeeasierforthedatatoovercome.Also,wecanusespecicboundoverthosevariableswhileestimatingthemtoavoidthembecomingstronger. 103

PAGE 104

5.3DiscussionMicroarrayexperimentsoftenmeasureexpressionsofgenestakenfromsampletissuesinthepresenceofexternalperturbationssuchasmedication,radiation,ordisease.Typicallyinsuchexperiments,geneexpressionsaremeasuredbeforeandaftertheapplicationofexternalperturbation.Inthischapter,wesolvedtheproblemofndingprimarilydifferentiallyregulatedgenesinthepresenceofexternalperturbationswhenthedataissampledfromtwogroups.TheprobabilisticBayesianmethodbasedonMarkovRandomFieldincorporatesdependencystructureofthegenenetworksasthepriortothemodel.ExperimentalresultsonsyntheticandrealdatasetsdemonstratedthesuperiorityofCMRFcomparedtoothersimpletechniques. 104

PAGE 105

CHAPTER6SSLPRED:PREDICTINGSYNTHETICSICKNESSLETHALITYAnalysisofgeneessentialityisacrucialproblemtounderstandtherolesofdifferentgenesatthemolecularandgeneticlevels.Ageneisdenedessentialifitisrequiredforpropergrowthandsustenanceofthatorganism.EssentialgeneshavebeenthoroughlyinvestigatedusingtechniquessuchassinglegenedeletionscreeningforsomelowlevelorganismssuchasEscherichiacoli(E.coli)[ 115 ].Thoughidenticationofessentialgenesenlightensusaboutthefunctionsofindividualgenesinanorganism,itprovideslittleconclusiveinformationaboutthenatureoftheirgeneticrelationshipsingeneregulatoryandsignalingnetworks.Recently,studiesonSyntheticSicknessLethality(SSL)openedupnewdirectionsintheareasoffunctionalgenomics.Twonon-essentialgenesfollowanSSLinteractioniftheirjointdeletionleadstoalessthanexpectedtnessfortheorganism.Heretnessdenotesthegrowthandsustenancerateofanorganism.AnexpectedtnesscorrespondstothatofadoublemutantwhenthetwoknockedoutgenesarenotinanSSLinteraction.NotethatthetnessofanorganismduetoanSSLinteractioncanbelessthan(aggravating)ormorethan(alleviating)theexpectedtness[ 116 ].AgenomewisecatalogofSSLinteractionsenablesin-depthmolecularanalysis,bycreatingafunctionalmapofthecell,predictingfunctionsandrelationsofthegenesanddecipheringcomplexregulatoryrelationsfromtheglobalgeneticnetwork[ 117 ].TheSyntheticGeneticArray(SGA)[ 27 ]anddiploid-basedsyntheticlethalityanalysisonmicroarray(dSLAM)[ 28 ]aretwopioneeringapproachesthatenablesystematicidenticationofSSLinteractions.Bothmethodsrequiregenerationofdoublemutantstrainsandmonitoringtheirgrowth.EachentryofSGAisatriplethatconsistsoftwogenesandaGI(geneticinteraction)scoreforthosetwogenes.AscoreclosetozeroindicatesthatthereisnoSSLinteractionbetweenthetwogenes.Foragene-pair,anegativeGIwithalargemagnitudeindicatesanaggravatingSSL 105

PAGE 106

ASyntheticSicknessLethality BBetweenPathwayModelsFigure6-1. IllustrationoftheconceptsofSyntheticSicknessandLethalityandBetweenPathwayModels.Figure 6-1A illustratestheconceptsofsyntheticsicknessandlethality.AdoublemutantproducedfromthecrossoftwosinglemutantscanhaveaspecictnessinarangeofGIscoresbasedoftherelationshipbetweenthetwogenes.Thesingleanddoublecirclesrepresentsingleanddoublemutantsrespectively.Here,thesizeofacirclecorrespondstotheobservedtnessofthecorrespondingmutant.Basedonwhetherthetwogeneshaveaepistatic,neutralofSSLinteraction,theobservedtnessofthedoublemutantcanhavemorethan,equaltoorlessthantheexpectedtness.Inthosecases,theGIscorecanbeasignicantlylargepositive,closetozeroorsignicantlylargenegativenumber,respectively.Figure 6-1B depictstheconceptofBetweenPathwayModels(BPM).ThehypotheticalBPMconsistsoftwosub-networks(alsocalledpathways)GAandGBwhoarefunctionallyindependentandcomplementing.Thesolidlinesdenotephysicalinteractions,whilethedasheddirectedlinesstandfortheSSLinteractions.ItisevidentthatthenumberofSSLedgesbetweenGAandGBishighercomparedtotheoneswithinthetwogroups. interaction.Asignicantlargepositivenumberdenotesahigherchanceofbeingalleviatingrelationship[ 116 ].Figure 6-1A illustratestheconceptsofsyntheticsicknessandlethalityrelationship.TheEMAPstrategyexploitstheSGAtechniquebyenablingcolonysizestobemeasuredinanarrayformat,thusquantifyinggeneticinteractionsinahigh-throughputfashion.BothSGAanddSLAMarecostlytechniquesasforapairofgenestheyrequirecreationoftwosinglemutantstrainsandcrossingbetweenthemtoproduceadoublemutantstrain.ForanorganismwithNgenes,weneedtogenerateandmonitorthegrowthofN(N)]TJ /F5 7.97 Tf 6.58 0 Td[(1) 2differentdoublemutants.Asaresult,millionsofdoublemutantsneedto 106

PAGE 107

beproducedtotabulateallthegeneticinteractionscoresforanorganismthatconsistsofthousandsofgenes.Creatingsuchdoublemutantsinwet-labisanexpensiveandtimeconsumingprocess.Therefore,weneedanefcientmethodtopredictwhetherthereexistsasyntheticlethalityrelationbetweentwogenes.Briey,wecandescribetheproblemconsideredinthischapterasfollows:giventwogenesgAandgBwhatistheGIscorebetweenthem?InordertopredicttheGIscorebetweentwogenes,weincorporatethegeneticproleofsinglemutantstrains.Thisisapromisingstrategyforthenumberofsinglemutantscannotbemorethanthenumberofgenes.First,weelaborateonthetermgeneticprole.Considersinglegeneknockoutdataset(alsotermedassinglegenemutantdata).Here,ineachexperimentanon-essentialgeneisknockedoutfromanorganism.Foreachgene,expressionsareobtainedbeforeandaftertheknock-outandratioofthisaftertobeforeexpressioniscalculated.Finally,thelogarithmofthatratioiscomputedandtabulated.Ifthemagnitudeofalogarithmislarge,itindicatesthattheexpressionofthecorrespondinggenechangedsignicantlyaftertheknockoutofthenon-essentialgeneunderconsideration.Thegeneticproleforasinglemutantorasinglegeneknockoutexperimentconsistsofentriesforallgenescomputedinthewaydescribedabove.InthischapterourobjectiveistolearntheGIscoresofgenepairswiththehelpofgeneticproleofsinglemutants.Formallywesolvethefollowingproblem. Problem.LetV=fg1,g2,,gMgdenotethesetofgenesinanorganism.AssumethatwearegiventhegeneticprolesofKsinglemutantgenes.XisaKMmatrix,whereeachrowcorrespondstothegeneticproleofasinglemutant.LetusrepresenttheGIscoreofgenepairsgaandgbwithta,b.LetTdenotethesetofalltheavailableGIscoresforthatorganism.Foranygenepair(gi,gj)suchthatgi2V,gj2V,wewouldliketopredicttheGIscore. 107

PAGE 108

Beforediscussingourcontributioninthischapter,wesummarizetheBetweenPathwayModels(BPM),whichisabuildingblockofourmodel[ 30 ].ABPMconsistsoftwogenesubnetworks(alsocalledpathways)GAandGB,suchthattherearefewSSLinteractionswithinGAandwithinGB,butmanyofthosebetweenGAandGB.Theoppositeholdsforthephysicalinteractionedges.Thatis,manyphysicalinteractionsexistwithinGAandGB,butfewofthemexistbetweenGAandGB.Figure 6-1B depictsahypotheticalBPM.AccordingtoKelleyandIdeker,thetwopathwaysinaBPMarefunctionallycompensatingduetotheorientationofgeneticandphysicaledges[ 30 ].Nowthatwehaveintroducedalltherelevantbuildingblocks,wediscussourcontributioninthischapter. Contribution.Inthispaper,wedevelopanewmethodSSLPredtopredicttheGIscores.Toourknowledge,ourmethodistherstonetopredictGIscoresusingamathematicalmachinelearningbasedtechnique.InaccordancewiththeconceptofBPM,weproposethefollowingconjecture.IfthereisanSSLinteractionbetweentwogenesandifthesetwogenesbelongtotwopathwaysofaBPM,thenknockingoutoneofthemwillchangetheexpressionsofmostofthegenesinbothofthepathwaysinthatBPM.Thepathwaycontainingthemutatedgeneisdirectlyaffectedanddysfunctionalasmostoftheconsistinggeneshaveadirectconnectionwiththemutatedgenethroughphysicaledges.Theotherpathwaycompensatesforitsaffectedpair,andduetotheadditionalactivitiesthegenesinitchangetheirexpressionsnoticeably.InourregressionbasedmethodSSLPred,wedevelopamappingbetweenthegeneticprolesofsinglemutantsandthecorrespondingGIscore.Foreverygeneticinteractionentry(ga,gb,ta,b),suchthateitherofgaandgbhasbeenmutatedinasinglemutantgeneexperimentandta,bistheGIscoreforgaandgb,wecreateatrainingsample.Aswehavealreadyconjecturedinthepreviousparagraph,ifthisgeneticinteractionentryrepresentsanSSL,themutatedgeneaffectstheexpressionsofall 108

PAGE 109

thegenesinthecorrespondingBPM.Thus,weusethegeneexpressionchangesonlyfromthepathwaysofthatBPMtoextractthefeaturesofthetrainingpointandcorrelateitwiththecorrespondingGIscoreta,busingaregressionmodel.AfterweestimatetheparametersofSSLPred,weareabletopredicttheGIscoreforanewpairofgenes.WecompareourmethodtotheonebyHescottetal.[ 31 ]intheirabilitytoidentifyBPMsinthegenenetworksofS.cerevisiaeonfourbenchmarkdatasets.OnaverageSSLPredperformssignicantlybettercomparedtotheothermethod.Wesummarizeourcontributionasfollows: 1. Accordingtoourknowledge,SSLPredistherstpredictivemethodtopredicttheGIscoreforapairofgenes.Allotherrelevantcomputationalmethodsaredescriptive. 2. TheGIscorespredictedbySSLPredassumearealvalue.Thisismoreusefulthanabinaryprediction,sinceitenablestoconductstatisticalanalysissuchaspermutationtestsandp-ValuegenerationassociatedwiththevalidationofbenchmarkBPMs.Therestofthechapterisorganizedasfollows.Section 6.1 describesourmethodSSLPred.Section 6.2 presentstheexperimentalresults.Finally,Section 6.3 concludesthechapter. 6.1MethodsInthissection,wediscussourmethodindetail.Section 6.1.1 describesthenotationandformulatestheproblem.Section 6.1.2 explainsourconjectureswhichguideourmodelandtherationalebehindit.Section 6.1.3 discussesthefeatureextractionandregressionmodel. 6.1.1ProblemFormulationandNotationInthissection,wemathematicallyformulatetheproblem,andforthatpurpose,wedescribetherelevantnotation.Wegroupournotationinthreeclassesbasedonthreerelatedentities.Thesearegenenetwork,singlemutantdataandSGA.Here,genenetworkstandsforgeneinteractionnetwork,specicallytheunionofgeneregulatoryandsignalingnetwork. 109

PAGE 110

1. GENENETWORK.Thegenenetworkisaunionofgeneregulatoryandsignalingnetworksthatcanbemodeledasasetofgenesandthedirectededges(i.e.,interactions)connectingthesegenes.Here,anedgebetweentwogenesdenotesdifferentkindsofgeneticinteractionssuchasactivation,inhibitionandphosphorylation.LetusdenotethesetofallMgenesbyV=fg1,g2,,gMg.WedenotethesetofalledgesinthegenenetworkbyW=f(gi,gj)jgi2V,gj2Vg,where(gi,gj)impliesadirectedinteractionfromgitogj.thus,G=(V,W)denesthegenenetwork. 2. SINGLEMUTANTDATASET.Inasinglemutant,onegeneismutatedinanorganismandgeneexpressionisobtainedbeforeandafterthemutation.Singledeletionmutant(alsoknownassinglegeneknockout)isanimportantkindofgenemutant,whereonegeneisknockedoutfromanorganism.Inasinglemutantdataset,eachentrycontainsthelogarithmoftheratiooftheexpressionsofageneafterthegeneknockouttothatofthesamegenebeforethegeneknockout[ 13 ].Lete0h,jandeh,jdenotetheexpressionsofthegenegjafterandbeforethemutationofghrespectively.WedenethegeneticproleoftheorganismwhengeneghismutatedbyXh=fxh,jjxh,j=ln(e0h,j=eh,j),j2f1,2,,Mgg.LetHGbethesetofgenesthathavebeenmutatedintotal.WedenethesinglemutantgeneticproleofNgenesasX=fXhjgh2Hg. 3. SYNTHETICGENEARRAY.AnSGAisasetoftriples,T=f(gi,gj,ti,j)ji,j2f1,2,,Mg,ijg,whereti,jisarealnumberthatcorrespondstotheratiooftheobservedtnesstotheexpectedtnesswhentheorganismhasbothgenegiandgjknockedout.AvaluewithalargemagnitudeimpliesapotentialSSLedge.Apositiveandanegativevaluestandforalleviatingandaggravatingrelations,respectively. ProblemFormulation.GivenagenenetworkG,thesinglemutantdatasetXandtheSGAdatasetT,ndthemapping:X,G)166(!Twhichminimizesapredeterminedriskfunction.Riskfunctionisameasureofexpectedmisspredictionrate.Inthischapter,whileestimatingthemappingfunction,weminimizeleastsquareerrorinordertominimizeexpectedmisspredictionrate.Basedonthemappinglearned,wewouldpredicttheGIscoreti,jforanewdoublemutantwhosetwogenesgiandgjhavenbeenmutated. 6.1.2BetweenPathwayConjecturesInthissection,wedescribeourtwoconjecturesthatarecentraltoSSLPredandtherationaleforthem.ThesetwoconjecturesarebuiltontheconceptsofBPMs. 110

PAGE 111

IncorporatingthestructureandpropertiesofBPMsintoourmodeltoimproveitspredictionaccuracywasthemotivationbehindtheseconjecture. Conjecture1.LetBdenoteaBPM,consistingoftwopathwaysGAandGB.Also,consideranSSLedgeS=fga,gbgsuchthatga2GAandgb2GB.Then,mutatinggawillsignicantlyaltertheexpressionsofmanygenesinGAandGB. Since,gaisconnectedtomostothergenesinGAthroughphysicalinteractions,alteringtheexpressionlevelofgawillaffecttheexpressionofallthegenesconnectedtogaastheyregulateeachother.ThiseffectwillpropagatethroughthegenenetworkandeventuallymaychangetheexpressionsofmanygenesinGA.EventuallygawillseverelyaffectGAandprohibititfromworkingproperly.SinceGAandGBconstituteaBPM,GBwillcompensatethislossbychangingtheexpressionofthegenesinGB.Thus,mutatinggaeventuallychangestheexpressionsofthegenesinbothGAandGB.Fromthisconjecture,weconcludethatthereisamappingbetweenanSGAentry(ga,gb,ta,b)andthecorrespondingsinglemutantdatasetXh,gh2fga,gbg.Thisimpliesanon-trivialmapping,iftheSGAentrycorrespondstoanSSLorepistaticrelationshipandwehaveahigherchancetondbothgaandgbembeddedintwopathwaysofaBPM.Inthatcase,mostofthegenesinthatBPMaresupposedtohavetheirexpressionchangedinthesinglemutantdatasetandanappropriateregressionmethodcancorrelatethechangesinthesinglemutantgeneexpressionsandthecorrespondingGIscore.Beforestatingthesecondconjecture,wedenearelevantterm,neighbor.Wesaythat,genegbisanrthlayerincomingneighborofgenegainthedirectedgenenetwork,iftheshortestpathfromgbtogaconsistsofrdirectededges.Inthatcase,gaisarthlayeroutgoingneighborofgb.Figure 6-2 depictstheincomingandoutgoingneighborsforthetwogenesinageneticinteraction. Conjecture2.LetBdenoteaBPM,consistingoftwopathwaysGAandGB.ConsideranSSLedgeS=fga,gbgsuchthatga2GAandgb2GB.Iftheexpression 111

PAGE 112

Figure6-2. Thisguredepictsthelayeredneighborstructurearoundageneinteractionedge(ga,gb).NIN(gh,r)denotesthesetofincomingneighborsofgeneghatlayerr.ThesetoutgoingneighborNOUT(gh,r)isdenedsimilarly.Theexamplecontainsonlyupto2layersforeachdirectionandgene.ThedottedrectanglesdenotetheputativeBPM(GA,GB)aroundthegeneinteractionedge. ofgachangessignicantly,theninGAexpressionchangeismostprominentintherstlayerofneighborsofgaandgraduallydecreaseswithincreasinglayers.Similarly,inGBtheeffectismostprominentforgbandgraduallydecreaseswithincreasinglayersofneighbors. Inbrief,ourconjectureisthattheeffectofageneknockouteventuallywanesawaythroughthegenenetwork.TherationalebehindthisisthattheneighborsthatareclosetogaandgbhaveahigherchanceofbeingconnectedonlytothenodesofB.Thegenesinthedistantneighborhoodhaveagreaterpossibilitytotakepartinotherpathways.Hence,theclosernodesaremoresusceptibletoundergoamajoreffect,whilethedistantneighborsaresupposedtobepartiallyscreenedfromthataffectduetotheiractivityintheotherpathways.Basedonthesetwoconjectures,webuildaregressionbasedmodelthatwedescribeinthenextsection. 6.1.3RegressionBasedSolutionThissectiondescribesthecustomizedregressionbasedapproachthatwedevelopedtobuildthemapping:X,G)166(!T,whereX,GandTdenotethesinglemutantgeneexpression,genenetworkandGIscore,respectively.Basedonthe 112

PAGE 113

twoconjecturesinSection 6.1.2 ,weextractasetoffeaturesfortrainingandtestingsamples.WestartfromtheSGAandforeachentry(ga,gb,ta,b),wecreateasamplepointprovidedeithergaorgbhasbeenmutatedinthesinglemutantdatasetavailabletous,otherwisewediscardthatSGAentry.Withoutlosingthegenerality,assumethatgahasbeenmutatedinthiscase.Thus,weextractthefeaturefunctionsfromthesinglemutantdataXa=fxa,1,xa,2,,xa,Mg.Indesigningthesetoffeatures,weleveragetheinformationfromgenenetworksbyincorporatingthetwoconjecturesinoursolution.Accordingtotherstconjecture,themutatedgeneissupposetoperturbonlythegenesinthehostBPM.Thus,whileprocessingtheSGAentry(ga,gb,ta,b),weconsideronlythegenesfromGAandGBanddiscardtheonesfromG)]TJ /F8 11.955 Tf 23.69 0 Td[((GA[GB).WeusetheGIscoreta,basthelabelofthetrainingsample.Notethatwhilewecreatethefeaturesforatrainingpoint,allthedatawehaveisthesinglemutantdata,GIscoresandthegenenetworks.However,foraspecicpairofgenes(ga,gb)wedonotknowthesetofgenesthatconsistsoftheputativeBPMB=(GA,GB)aroundthegenepair.Rather,wearesupposetovalidatethatinformationusingourmodel.Infact,iftheSGAentrydoesnotcorrespondtoanSSL,theremaynotbearealBPMforthepair(ga,gb).Tocircumventthisproblem,weassumetheBPMaspartofourmodelratheraninputtothemodel.Specically,weusetheconceptofrthlayerneighbors,introducedinSection 6.1.2 .LetRrepresentthemaximumnumberoflayerstoconstructGAandGB.(Usually,Rwillbesetbytheuser.)LetusdenotetherthlayerincomingandoutgoingneighborsofgeneghbyNIN(gh,r)andNOUT(gh,r)respectively.Figure 6-2 demonstratesthelayeredstructureofincomingandoutgoingneighborsforapairofgenes.WedenetheputativeBPMpathwayforghastheunionofthesetsofincomingandoutgoingneighborsofgh 113

PAGE 114

uptothelayerRthgivenby, GH=R[r=1(NIN(gh,r)[NOUT(gh,r))(6)InacomprehensiveSGAdataeachGIscoreisarealvaluednumberthatvariesintherangeoftwosmallnumberssuchas-5to+5.However,ifthescoreta,bhasasmallmagnitude(closetozero),thegenepair(ga,gb)maynothaveanSSL/epistaticinteractionandmaynotbepartofaBPM.SincetheGIscoreta,b,whichisthelabeloftheregressionmodelisrealvalued,westillshallusethissamplepointtotrainourmodel.However,theregressionmodelisexpectednottodiscoveranyinterestingpatternofaBPMinthegeneexpression,andwilladjustitsparametersaccordingly.Toincorporatethesecondconjecture,wedesignthefeaturesoftheregressioninalayeredapproachthatdirectlydependsontheconceptoflayeredneighborsintroducedinSection 6.1.2 .WedenotethefeaturefunctionassociatedwiththeincomingneighborsoflayerrofgenegbyIN(NIN(g,r))andthecorrespondingregressionparameterbywIN(g,r).Similarly,thefeaturefunctionandparametersfortherthlayeroutgoingneighboraregivenbyOUT(NOUT(g,r))andwOUT(g,r),respectively.Thus,forJ2fIN,OUTgwestatethatthefeaturefunctionJ(NJ(gc,r))correspondstoneighborsofgenegcindirectionJatlayerr.Giventhatghhasbeenknockedoutandweareconsideringtheneighborhoodofgc,J(NJ(gc,r))canbedenedasfollows, J(NJ(gc,r))=Pgi2NJ(gc,r)jxh,ij jNJ(gc,r)j(6)Wedeneanotherfeaturefunctionforgbby(gb)andthecorrespondingparameterbyw.However,wedonotcreateanyfeaturefunctiontocapturetheexpressionofga,sincegaismutatedanditsexpressionmaynotbeavailableforinspection.Finally,wecreatethelastparameterw0thatactsasabiasconstantinthemodel.Table 6-1 summarizesthefeaturefunctionsandthecorrespondingparameters.Byaggregatingall 114

PAGE 115

Table6-1. Summarizationofthefeaturefunctionsoftheregressionmodelandthecorrespondingparameters.Featurefunctionrepresentsthesetofdifferentfeaturesfortheregression.Aparameterquantiesthestrengthofthecorrespondingfeaturefunction. FeatureFunctionParameterDescription IN(NIN(g,r))wIN(g,r)Forincomingneighborsofrthlayerforgeneg.OUT(NOUT(g,r))wOUT(g,r)Foroutgoingneighborsofrthlayerforgeneg.(gb)wForgenegbwhenconsidering(ga,gb),gaisknockedout.w0Aconstantrepresentingthebiasoftheregression. thesefeaturefunctions,wecanttheSGAentry(ga,gb,ta,b),wheregaisknockedoutinthesinglemutantdataas, ya,b=w0+w(gb)+Xr2f1,2,,Rg,J2fIN,OUTg,c2fa,bgwJ(gc,r)J(NJ(gc,r))(6) ParameterEstimation.Whentheratioofnumberofsamplestothatoftheparametersissmall(typicallylessthan20),theestimatedvalueoftheparametersexperiencehighvarianceduetooverttingofdata[ 118 ].Toalleviatethisproblem,weaugmentaregularizationtermontopoftheregressionmodel.Specically,weaimtominimizethedifferencebetweentheparametervaluesatneighborlevelsrandr+1,tosmoothenthedecayingofgeneexpressionchange.Formally,theregularizationtermcanbewrittenas, Q=Xr2f1,2,,R)]TJ /F5 7.97 Tf 6.59 0 Td[(1g,J2fIN,OUTg,c2fa,bgjwJ(gc,r+1))]TJ /F3 11.955 Tf 11.96 0 Td[(wJ(gc,r)j(6)Weaugmentthisregularizationtermwiththeobjectivefunctionwhenestimatingtheparameters.Usingleastsquareerrorapproach,weestimatetheparametersoftheregressionbyminimizingthefollowing, 115

PAGE 116

E=Xa,b2f1,2,,Mg,a
PAGE 117

byHughesetal.[ 13 ].Eachexperimentcontains6,316entries.EveryentrycontainsthethelogarithmofaftertobeforeratioofexpressionsofageneasdescribedinSection 6.1.1 3. SGADATA.Costanzoetal.generatedagenomescaleSGAproleforS.cerevisiaewithneatly5.4millionsofgeneticinteractionsoutofnearly75%genes[ 117 ].Outofthiscomprehensiveprole,weselectedGIscoresfor370,913interactionssuchthatforeveryedge,atleastoneofthetwoconsistinggeneswasknockedoutinthegeneknockoutexperiments. 4. BPMS.WeobtainedfoursetsofBPMs,allareofS.cerevisiae,itemizedinthefollowingKelley-Ideker[ 30 ],Ulitsky-Shamir[ 123 ],Bradyetal.[ 124 ],andMaetal.[ 125 ].Wedenoteadatasetusingtheauthors'namesofthecorrespondingpaper.ThenumbersofBPMsthatcontainthreeormoregenesineachpathwayinthesedatasetsare160,36,959and54respectively. 6.2.2ComparisonwithHescott'sMethodThissectiondescribesthecomparisonbetweenSSLPredandthemethodproposedbyHescottetal.[ 31 121 ].Hescottetal.employsmicroarrayexpressiondataofsinglegeneknockoutexperimentstoidentifyBPMs.ThoughtheirmethoddoesnotpredictGIscore,accordingtoourknowledge,thisistheonlypublishedmethodthatintegratestheconceptofsinglegenemutantsandbetweenpathwaymotifs.Beforecomingtothemaindiscussion,wedescribehowwecreateamatrixofpredictedGIscoresusingve-foldcrossvalidation.Wedividethe287knockoutexperimentsintonearlyequalvegroups,eachofthembeinga576316matrix.Foreachfoldofcrossvalidation,weusefouroutofvegroupstocreatesampletrainingpointsalongwiththecorrespondingGIscoresasdescribedinSection 6.1.3 .IfforagenepairthecorrespondingGIscoreisnotavailable,wediscardthethatsamplepoint.Aftertraining,wecreatetestpointsfromtheleft-outonegroupandpredictthetestscoresforthem.Repeatingthisprocessinavefoldcrossvalidationfashion,wepredictGIscoresforallpossiblepairsofgenesfromthe2876316matrix.NowthatwehavethepredictedGImatrixwhichwedenotebyTP,wediscusshowweemployitforcomparisonbetweenSSLPredandtheoneproposedbyHescottetal. 117

PAGE 118

AKelleyDataset BUlitskyDataset CBradyDataset DMaDatasetFigure6-3. ComparisonofSSLPredwiththemethodfromHescottetal.SSLPredandSSLPred(2)denotevariantsofSSLPredwithatmostoneortwolayersofneighbors,respectively.TheXaxisthatrangesbetweenzeroandonerepresentsthep-Valuesofthepermutationtests.TheYaxisrepresentsthefrequenciesoftheBPMsataparticularp-Valueofthepermutationtest.Thefoursub-guresdemonstratethatapartfromonMadataset,SSLPredoutperformsHescott'smethod,asitmaintainsahigherfrequencyatthep-Valuerangesbetweenzeroand0.1. ConsideraBPMB=(GA,GB)obtainedfromaknownsetsofBPMs.Now,consideragenegx2GA.Hescottetal.ranksallthegenesGoftheorganismwithrespecttogx.LetusdenotethatrankbyG(gx).Then,fromthatrank,itretrievesGBandcalculatethequalityofretrievedGBbyascoringmethodcalledClusterRankScore.WenowdescribeClusterRankScorewhichisadaptedfromGeneSetEnrichmentAnalysis[ 126 ]. 118

PAGE 119

Table6-2. BPMswithp-Valueslessthan0.l[%] MethodBPMsKelley-IdekerUlitsky-ShamirBradyetal.Maetal. SSLPred8.7411.719.957.02SSLPred(2)11.48.558.6411.89Hescottetal.6.655.859.7513.51 ClusterRankScoreacceptsanorderedlistofgenesLandanothersetCasinput.Then,itexplorethedistributionofCalongL.Intuitively,ifCappearsattheheadortailofL,itisenrichedwiththespecicpropertiesrepresentedbytheorderedlistL.Inthecurrentcontext,consideraBPMB=fGA,GBg.LetusknockoutagenegafromGAandmeasurethechangeofexpressionsforalltheothergenes.Hescottetal.nowarraysthegenesaccordingtoitsowncriteria.Here,thisorderedlistisLandthepathwayGBisC.Thus,acorrelationbetweentheorderedgenelistandthepathwayGBimpliesthattheBPMBisvalidatedbyHescottetal.UsingSSLPredwecreateasimilarrankasfollows.WeobtainthepredictedGIscoreofallthegenepairs(gx,gy),gy2GfromthepredictedGImatrixTP.ThenwesortGinincreasingvalueoftheretrievedGIscoresof(gx,gy).LetusdenotethesortedlistofgenesbyG.AfterthiswecalculatetheClusterRankScoreofGBbasedonG.LetusdenotetheClusterRankScoreofGBwithrespecttoHescottetal.andSSLPredbyCRS(gx,GB)andCRS(gx,GB),respectively.TocalculatethestatisticalsignicanceofthetwoClusterRankScore,wedesignseparatepermutationtestsforeachofthemandcalculatep-Valueswithrespecttothosepermutationtests.HerethenullhypothesiscanbestatedasBisnotaBPM.AdetailedaccountofClusterRankScoreandthepermutationtestcanbefoundatHescottetal.[ 31 121 ].ConsideraBPMB=(GA,GB).Forallthecombinations,(gx,GB),gx2GAand(gy,GA),gy2GBwecalculatethep-Valuesusingtheproceduredescribedabove.Foreverydataset,weplotthehistogramsofthosep-Values.SincealltheBPMshavebeenobtainedfrompublishedliterature,weassumethemtobeequivalenttoagoldstandard. 119

PAGE 120

ABradyDataset BMaDatasetFigure6-4. ComparisonofSSLPredwiththemethodfromHescottetal.onBradyandMadatasetsforp-Values0.1.SSLPredandSSLPred(2)denotevariantsofSSLPredwithatmostoneortwolayersofneighbors,respectively.TheXaxis,thatrangesbetweenzeroand0.l,representsthep-Valuesofthepermutationtests.TheYaxisrepresentsthefrequenciesofthepathwaysataparticularp-Valueofthepermutationtest.WedisplayhistogramsfortheverytwodatasetsforwhichourmethodperformssimilarorworseinFigure 6-3 .Thetwosub-guresdemonstratethatonthesetwospecicdatasets,SSLPredmaintainsahigherfrequencyatthep-Valuerangesbetweenzeroand0.01. Hence,inthehistogram,anincreasedfrequencyoftheBPMswithlowerp-ValuescorrespondstoabetterqualityoftheBPMretrievalmethod.Figure 6-3 comparesSSLPredwiththeothermethod.SSLPredandSSLPred(2)denotevariantsofSSLPredwithatmostoneortwolayersofneighbors,respectively.TheXaxis,thatrangesbetweenzeroandone,representsthep-Valuesofthepermutationtests.TheYaxisrepresentsthefrequenciesoftheBPMsataparticularp-Valueofthepermutationtest.SSLPredoutperformsHescottetal.by100%,31%,2%forUlitsky,KelleyandBradlydatasetwhenthep-Valueisequaltoorsmallerthan0.1.ForSSLPred(2)thecorrespondingnumbersare46%,71%and-12%respectively.ForDatasetMa,Hescottetal.isbetterby48%and12%thanSSLPredandSSLPred(2)respectively.Ifwerelaxthep-Valueto0.3weobservethatSSLPred(2)performsbetterthanHescottetal.by24%,12%,10%and48%forUlitsky,Brady,Maand 120

PAGE 121

Kelleyrespectively.ItcanbeconcludedthatthatapartfromonMadataset,SSLPredoutperformsHescott'smethod,sinceitmaintainsahigherfrequencyatthep-Valuerangesbetweenzeroand0.1.ForKelleyandUlitskydatasetSSLPredoutperformswithahighmarginbetween0to0.1p-Valuerange.ForMaandBradydatasets,thetwomethodsperformverycompetitivelyonanaverage.Also,itsisdifculttocomparebetweentwovariantsofSSLPred.Thoughthesetwovariantsareinclosecompetition,SSLPred(2)hasaslightlybetteradvantageoverSSLPred.Table 6-2 summarizestheresultsinFigure 6-3 bytabulatingthepercentageofBPMswithp-Valueslessthanorequalto0.1.Figure 6-4 highlightsaspecialcaseofthefrequencydistributionwhenthep-Valueisrestrictedtobelessthanorequalto0.01.WearespecicallyinterestedintheverytwodatasetsBradyandMaforwhichourmethodperformssimilarorworsewhenwerestrictthep-Valuesto0.1.Sincealowerp-valueimpliesalowerchanceoffalsepositivedetection,theseresultsareimportantindeterminingthesuperiorityofthecompetingmethods.HerealsoweobservethatSSLPred(2)hasabetteraccuracycomparedtoHescottetal.inidentifyinglargernumberofsmallp-ValueBPMs.ForBradydatasetSSLPred(2)outperformstheothermethodby177%.ForMadatasetbothofthemdetecttwoBPMs.Thisconcludesthat,ourmethoddemonstratessuperioraccuracyinvalidatingmoreBPMswithverylowp-Values(0.01).WealsoconductedathirdsetofexperimentsSSLPred(3)withhighestlayerofneighborsR=3.However,SSLPred(3)performedpoorlycomparedtoHescottetal.WebelievethatmostBPMsareofsmallsizeswithadiameteroflessthatequaltofouredgesasindicatedinKelleyetal.[ 32 ].Hence,assumingabiggersizeBPMswithR=3compromisestheaccuracyofourmethod. Code.Allthecodedevelopedinthischapterisavailablefrom http://bioinformatics.cise.ufl.edu/projects/SSLPred.html 121

PAGE 122

6.3DiscussionInthischapter,wedevelopedanewmethodSSLPredtopredictSSLinteractionsinanorganism.OurmethodisbuiltontheconceptofBetweenPathwayModels,wheremajorityoftheSSLpairsspanacrossthetwofunctionallycomplementingpathways.Wedevelopedaregressionbasedapproachthatlearnsthemappingbetweenthegeneexpressionsofsingledeletionmutanttothecorrespondingsyntheticgenearray.WecomparedourmethodtotheonebyHescottetal.forpredictingtheGIscoresofS.cerevisiaeonfourbenchmarkdatasets.Ondifferentexperimentalsetups,onaverageSSLPredperformedsignicantlybettercomparedtotheothermethod. 122

PAGE 123

CHAPTER7CONCLUSIONAnalyzingtheeffectofperturbationongenenetworksinacomplexproblemduetodifferentreasonsincludingbutnotlimitedtonon-homogeneousnetworkstructure,differentkindsofgeneticinteractions,characteristicsassociatedtoaspecicperturbationandsmallnumberofsamplesduetocostconstraint.Duetocomplexanddiversiednatureofthisproblem,weselectedfoursubproblemsthatwellrepresentdifferentaspectsoftheoriginalproblem,howeverarenotnecessarilycomprehensiveinnature.Also,whilesolvingtheproblems,wemadeseveralassumptionstosimplifytheproblems.Forexample,intherstandsecondchaptersweassumethatthereshouldbeonlytwotimepoints(controlandnoncontrol)inthegeneexpressiondata.Despitethislimitations,thereisnodenyingthefactthatthisworkhasbeenabletosubstantiatethehypothesisthatitclaimedinthebeginning.Inallfourofthechapters,ourmethodsdemonstratedthemeritofthehypothesisthattheintegrationofgenenetworksenablesustobuildsuperiordescriptiveandpredictivemodel.Inthefuture,Iwouldliketoleveragethehypothesisproposedinthisworkinrelevantproblems.Ibelievethatnotonlyinthepresentcontext,butalsoinotherrelevantareasintegrationofdomainspecicknowledgetodatainaBayesianfashionwillhelpustobuildmoreaccurateandusefulmodels. 123

PAGE 124

REFERENCES [1] SaeysY,InzaIn,LarranagaP:Areviewoffeatureselectiontechniquesinbioinformatics.Bioinformatics2007,23(19):2507. [2] DudoitS,FridlyandJ,SpeedTP:Comparisonofdiscriminationmethodsfortheclassicationoftumorsusinggeneexpressiondata.JournaloftheAmericanStatisticalAssociation2002,97(457):77. [3] JafariP,AzuajeF:Anassessmentofrecentlypublishedgeneexpressiondataanalyses:reportingexperimentaldesignandstatisticalfactors.BMCMedInformDecisMak2006,6:27. [4] BaldiP,LongA:ABayesianframeworkfortheanalysisofmicroarrayexpres-siondata:regularizedt-testandstatisticalinferencesofgenechanges.Bioinformatics2001,17(6):509. [5] InzaI,LarranagaP,BlancoR,CerrolazaA:Filterversuswrappergenese-lectionapproachesinDNAmicroarraydomains.ArtifIntellMed2004,31(2):91. [6] Jirapech-UmpaiT,AitkenS:Featureselectionandclassicationformicroar-raydataanalysis:evolutionarymethodsforidentifyingpredictivegenes.BMCBioinformatics2005,6:148. [7] Daz-UriarteR,deAndresSA:Geneselectionandclassicationofmicroarraydatausingrandomforest.BMCBioinformatics2006,7:3. [8] VertJP,KanehisaM:Graph-DrivenFeatureExtractionFromMicroarrayDataUsingDiffusionKernelsandKernelCCA.InAdvancesinNeuralInformationProcessingSystems15[NeuralInformationProcessingSystems,NIPS2002,December9-14,2002,Vancouver,BritishColumbia,Canada].EditedbyBeckerS,ThrunS,ObermayerK,MITPress2002:1425. [9] RapaportF,ZinovyevA,DutreixM,BarillotE,VertJ:Classicationofmicroar-raydatausinggenenetworks.BMCBioinformatics2007,8:35. [10] WeiP,PanW:Incorporatinggenenetworksintostatisticaltestsforge-nomicdataviaaspatiallycorrelatedmixturemodel.Bioinformatics2008,24:404. [11] WeiZ,LiH:AMarkovrandomeldmodelfornetwork-basedanalysisofgenomicdata.Bioinformatics2007,23(12):1537. [12] LiC,LiH:Network-constrainedRegularizationandVariableSelectionforAnalysisofGenomicData.Bioinformatics2008. [13] HughesTR,MartonMJ,JonesAR,RobertsCJ,StoughtonR,ArmourCD,BennettHA,CoffeyE,DaiH,HeYD,KiddMJ,KingAM,MeyerMR,SladeD,LumPY, 124

PAGE 125

StepaniantsSB,ShoemakerDD,GachotteD,ChakraburttyK,SimonJ,BardM,FriendSH:Functionaldiscoveryviaacompendiumofexpressionproles.Cell2000,102:109. [14] MartonMJ,DerisiJL,BennettHA,IyerVR,MeyerMR,RobertsCJ,StoughtonR,BurchardJ,SladeD,DaiH,BassettDE,HartwellLH,BrownPO,FriendSH:DrugtargetvalidationandidenticationofsecondarydrugtargeteffectsusingDNAmicroarrays.NatMed1998,4(11):1293. [15] GiaeverG,ShoemakerDD,JonesTW,LiangH,WinzelerEA,AstromoffA,DavisRW:Genomicprolingofdrugsensitivitiesviainducedhaploinsufciency.NatureGenetics1999,21(3):278. [16] GiaeverG,FlahertyP,KummJ,ProctorM,NislowC,JaramilloDF,ChuAM,JordanMI,ArkinAP,DavisRW:Chemogenomicproling:Identifyingthefunc-tionalinteractionsofsmallmoleculesinyeast.ProceedingsoftheNationalAcademyofSciencesoftheUnitedStatesofAmerica2004,101(3):793. [17] LumP,ArmourC,StepaniantsS,CavetG,WolfM,ButlerJ,HinshawJ,GarnierP,PrestwichG,LeonardsonA,Garrett-EngeleP,RushC,BardM,SchimmackG,PhillipsJ,RobertsC,ShoemakerD:Discoveringmodesofactionforthera-peuticcompoundsusingagenome-widescreenofyeastheterozygotes.Cell2004,116:121. [18] ParsonsAB,BrostRL,DingH,LiZ,ZhangC,SheikhB,BrownGW,KanePM,HughesTR,BooneC:Integrationofchemical-geneticandgeneticinteractiondatalinksbioactivecompoundstocellulartargetpathways.NatureBiotechnology2003,22:62. [19] diBernardoD,ThompsonMJ,GardnerTS,ChobotSE,EastwoodEL,WojtovichAP,ElliottSJ,SchausSE,CollinsJJ:Chemogenomicprolingonagenome-widescaleusingreverse-engineeredgenenetworks.NatureBiotechnology2005,23(3):377. [20] CosgroveEJ,ZhouY,GardnerTS,KolaczykED:Predictinggenetargetsofperturbationsvianetwork-basedlteringofmRNAexpressioncompendia.Bioinformatics2008,24(21):2482. [21] VaskeCJ,HouseC,LuuT,FrankB,YeangCH,LeeNH,StuartJM:AFactorGraphNestedEffectsModelToIdentifyNetworksfromGeneticPerturba-tions.PLoSComputBiol2009,5:e1000274+. [22] ConesaA,NuedaMJ,FerrerA,TalonM:maSigPro:amethodtoidentifysignicantlydifferentialexpressionprolesintime-coursemicroarrayexperiments.Bioinformatics2006,22(9):1096. [23] Hong,F,Li,H:FunctionalHierarchicalModelsforIdentifyingGeneswithDifferentTime-CourseExpressionProles.Biometrics2006,62(2):534. 125

PAGE 126

[24] ChuanTaiY,SpeedTP:OnGeneRankingUsingReplicatedMicroarrayTimeCourseData.Biometrics2009,65:40. [25] AngeliniC,DeCanditiisD,PenskyM:Bayesianmodelsfortwo-sampletime-coursemicroarrayexperiments.ComputationalStatistics&DataAnalysis2009,53(5):1547. [26] VanDeunK,HoijtinkH,ThorrezL,VanLommelL,SchuitF,VanMechelenI:Testingthehypothesisoftissueselectivity:theintersection-uniontestandaBayesianapproach.Bioinformatics(Oxford,England)2009,25(19):2588. [27] TongAH,EvangelistaM,ParsonsAB,XuH,BaderGD,PageN,RobinsonM,RaghibizadehS,HogueCWV,BusseyH,AndrewsB,TyersM,BooneC:SystematicGeneticAnalysiswithOrderedArraysofYeastDeletionMutants.Science2001,294(5550):2364. [28] PanX,YuanDS,XiangD,WangX,Sookhai-MahadeoS,BaderJS,HieterP,SpencerF,BoekeJD:Arobusttoolkitforfunctionalprolingoftheyeastgenome.Molecularcell2004,16(3):487. [29] CollinsSR,MillerKM,MaasNL,RoguevA,FillinghamJ,ChuCS,SchuldinerM,GebbiaM,RechtJ,ShalesM,DingH,XuH,HanJ,IngvarsdottirK,ChengB,AndrewsB,BooneC,BergerSL,HieterP,ZhangZ,BrownGW,InglesCJ,EmiliA,AllisCD,ToczyskiDP,WeissmanJS,GreenblattJF,KroganNJ:Functionaldissectionofproteincomplexesinvolvedinyeastchromosomebiologyusingageneticinteractionmap.Nature2007,446(7137):806. [30] KelleyR,IdekerT:Systematicinterpretationofgeneticinteractionsusingproteinnetworks.NatureBiotechnology2005,23(5):561. [31] HescottBJ,LeisersonMD,CowenLJ,SlonimDK:Evaluatingbetween-pathwaymodelswithexpressiondata.Journalofcomputationalbiology:ajournalofcomputationalmolecularcellbiology2010,17(3):477. [32] KelleyDR,KingsfordC:ExtractingBetween-PathwayModelsfromE-MAPInteractionsUsingExpectedGraphCompression.LectureNotesinComputerScience2010,6044:248. [33] GolubTR,SlonimDK,TamayoP,HuardC,GaasenbeekM,MesirovJP,CollerH,LohML,DowningJR,CaligiuriMA,BloomeldCD,LanderES:Molecularclassicationofcancer:classdiscoveryandclasspredictionbygeneexpressionmonitoring.Science(NewYork,N.Y.)1999,286(5439):531. [34] SunY,GoodisonS,LiJ,LiuL,FarmerieW:Improvedbreastcancerprognosisthroughthecombinationofclinicalandgeneticmarkers.Bioinformatics2007,23:30. 126

PAGE 127

[35] RamaswamyS,TamayoP,RifkinR,MukherjeeS,YeangCH,AngeloM,LaddC,ReichM,LatulippeE,MesirovJP,PoggioT,GeraldW,LodaM,LanderES,GolubTR:Multiclasscancerdiagnosisusingtumorgeneexpressionsignatures.ProceedingsoftheNationalAcademyofSciencesoftheUnitedStatesofAmerica2001,98(26):15149. [36] SchenaM,ShalonD,DavisRW,BrownPO:QuantitativemonitoringofgeneexpressionpatternswithacomplementaryDNAmicroarray.Science(NewYork,N.Y.)1995,270(5235):467. [37] LiuJ,RankaS,KahveciT:Classicationandfeatureselectionalgorithmsformulti-classCGHdata.Bioinformatics2008,24(13):i86. [38] van'tVeerL,DaiH,vandeVijverM,HeY,HartA,MaoM,PeterseH,vanderKooyK,MartonM,WitteveenA,SchreiberG,KerkhovenR,RobertsC,LinsleyP,BernardsR,FriendS:Geneexpressionprolingpredictsclinicaloutcomeofbreastcancer.Nature2002,415(6871):530. [39] vandeVijverM,HeY,van'tVeerL,DaiH,HartA,VoskuilD,SchreiberG,PeterseJ,RobertsC,MartonM,ParrishM,AtsmaD,WitteveenA,GlasA,DelahayeL,vanderVeldeT,BartelinkH,RodenhuisS,RutgersE,FriendS,BernardsR:Agene-expressionsignatureasapredictorofsurvivalinbreastcancer.NEnglJMed2002,347(25):1999. [40] WangY,KlijnJG,ZhangY,SieuwertsAM,LookMP,YangF,TalantovD,TimmermansM,Meijer-vanGelderME,YuJ,JatkoeT,BernsEM,AtkinsD,FoekensJA:Gene-expressionprolestopredictdistantmetastasisoflymph-node-negativeprimarybreastcancer.Lancet2005,365(9460):671. [41] PawitanY,BjohleJ,AmlerL,BorgAL,EgyhaziS,HallP,HanX,HolmbergL,HuangF,KlaarS,LiuET,MillerL,NordgrenH,PlonerA,SandelinK,ShawPM,SmedsJ,SkoogL,WedrenS,BerghJ:Geneexpressionprolingsparesearlybreastcancerpatientsfromadjuvanttherapy:derivedandvalidatedintwopopulation-basedcohorts.BreastCancerRes2005,7(6). [42] DesmedtC,PietteF,LoiS,WangY,LallemandF,Haibe-KainsB,VialeG,DelorenziM,ZhangY,d'AssigniesMSS,BerghJ,LidereauR,EllisP,HarrisAL,KlijnJG,FoekensJA,CardosoF,PiccartMJ,BuyseM,SotiriouC,TRANSBIGConsortium:Strongtimedependenceofthe76-geneprognosticsignaturefornode-negativebreastcancerpatientsintheTRANSBIGmulticenterindependentvalidationseries.Clinicalcancerresearch:anofcialjournaloftheAmericanAssociationforCancerResearch2007,13(11):3207. [43] HsuCW,ChangCC,LinCJ:APracticalGuidetoSupportVectorClassication2010.[ http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf ]. [44] LehningerA,NelsonDL,CoxMM:LehningerPrinciplesofBiochemistry.W.H.Freeman,fthedition2008. 127

PAGE 128

[45] RockeDMM,IdekerT,TroyanskayaO,QuackenbushJ,DopazoJ:Papersonnormalization,variableselection,classicationorclusteringofmicroarraydata.Bioinformatics2009,25(6):701. [46] PlattJC:Fasttrainingofsupportvectormachinesusingsequentialminimaloptimization.Cambridge,MA,USA:MITPress1999. [47] TaylorK,Pena-HernandezK,DavisJ,ArthurG,DuffD,ShiH,RahmatpanahF,SjahputeraO,CaldwellC:Large-scaleCpGmethylationanalysisiden-tiesnovelcandidategenesandrevealsmethylationhotspotsinacutelymphoblasticleukemia.CancerRes2007,67(6):2617. [48] WolfI,BoseS,DesmondJ,LinB,WilliamsonE,KarlanB,KoeferH:UnmaskingofepigeneticallysilencedgenesrevealsDNApromotermethylationandreducedexpressionofPTCHinbreastcancer.BreastCancerResTreat2007,105(2):139. [49] SchaferR,SedehizadeF,WelteT,ReiserG:ATP-andUTP-activatedP2Yreceptorsdifferentlyregulateproliferationofhumanlungepithelialtumorcells.AmJPhysiolLungCellMolPhysiol2003,285(2):L376. [50] GreigA,LingeC,HealyV,LimP,ClaytonE,RustinM,McGroutherD,BurnstockG:Expressionofpurinergicreceptorsinnon-melanomaskincancersandtheirfunctionalrolesinA431cells.JInvestDermatol2003,121(2):315. [51] PinesA,BiviN,VascottoC,RomanelloM,D'AmbrosioC,ScaloniA,DamanteG,MorisiR,FilettiS,FerrettiE,QuadrifoglioF,TellG:Nucleotidereceptorsstimu-lationbyextracellularATPcontrolsHsp90expressionthroughAPE1/Ref-1inthyroidcancercells:anoveltumorigenicpathway.JCellPhysiol2006,209:44. [52] IchikawaY,HirokawaM,AibaN,FujishimaN,KomatsudaA,SaitohH,KumeM,MiuraI,SawadaK:Monitoringtheexpressionprolesofdoxorubicin-resistantK562humanleukemiacellsbyserialanalysisofgeneexpression.IntJHematol2004,79(3):276. [53] ShaboI,StalO,OlssonH,DoreS,SvanvikJ:BreastcancerexpressionofCD163,amacrophagescavengerreceptor,isrelatedtoearlydistantrecurrenceandreducedpatientsurvival.IntJCancer2008,123(4):780. [54] NagorsenD,VoigtS,BergE,SteinH,ThielE,LoddenkemperC:Tumor-inltratingmacrophagesanddendriticcellsinhumancolorectalcancer:relationtolocalregulatoryTcells,systemicT-cellresponseagainsttumor-associatedantigensandsurvival.JTranslMed2007,5:62. [55] PanagopoulosI,IsakssonM,BillstromR,StrombeckB,MitelmanF,JohanssonB:FusionoftheNUP98geneandthehomeoboxgeneHOXC13inacute 128

PAGE 129

myeloidleukemiawitht(11;12)(p15;q13).GenesChromosomesCancer2003,36:107. [56] StarzaRL,TrubiaM,CrescenziB,MatteucciC,NegriniM,MartelliM,PelicciP,MecucciC:HumanhomeoboxgeneHOXC13isthepartnerofNUP98inadultacutemyeloidleukemiawitht(11;12)(p15;q13).GenesChromosomesCancer2003,36(4):420. [57] LapierreM,SiegfriedG,ScamuffaN,BontempsY,CalvoF,SeidahN,KhatibA:OpposingfunctionoftheproproteinconvertasesfurinandPACE4onbreastcancercells'malignantphenotypes:roleoftissueinhibitorsofmetalloproteinase-1.CancerRes2007,67(19):9030. [58] FuY,CampbellE,ShepherdT,NachtigalM:EpigeneticregulationofproproteinconvertasePACE4geneexpressioninhumanovariancancercells.MolCancerRes2003,1(8):569. [59] BhattacharjeeH,CarbreyJ,RosenB,MukhopadhyayR:DruguptakeandpharmacologicalmodulationofdrugsensitivityinleukemiabyAQP9.BiochemBiophysResCommun2004,322(3):836. [60] TsengW,LiuC:PeptideYYandcancer:currentndingsandpotentialclinicalapplications.Peptides2002,23(2):389. [61] BrostjanC,SobanovY,GlienkeJ,HayerS,LehrachH,FrancisF,HoferE:TheNKG2naturalkillercellreceptorfamily:comparativeanalysisofpromotersequences.GenesImmun2000,1(8):504. [62] WangH,TanW,HaoB,MiaoX,ZhouG,HeF,LinD:SubstantialreductioninriskoflungadenocarcinomaassociatedwithgeneticpolymorphisminCYP2A13,themostactivecytochromeP450forthemetabolicactivationoftobacco-speciccarcinogenNNK.CancerRes2003,63(22):8057. [63] FukudaS,KurokiT,KohsakiH,HayashiS,OzakiK,YamoriT,TsuruoT,NakamoriS,ImaokaS,NakamuraY:Isolationofanovelgeneshowingreducedexpres-sioninmetastaticcolorectalcarcinomacelllinesandcarcinomas.JpnJCancerRes1997,88(8):725. [64] TrochetD,BourdeautF,Janoueix-LeroseyI,DevilleA,dePontualL,SchleiermacherG,CozeC,PhilipN,FrbourgT,MunnichA,LyonnetS,DelattreO,AmielJ:Germlinemutationsofthepaired-likehomeobox2B(PHOX2B)geneinneuroblastoma.AmJHumGenet2004,74(4):761. [65] RapaI,CeppiP,BollitoE,RosasR,CappiaS,BacilloE,PorpigliaF,BerrutiA,PapottiM,VolanteM:HumanASH1expressioninprostatecancerwithneuroendocrinedifferentiation.ModPathol2008,21(6):700. 129

PAGE 130

[66] OsadaH,TomidaS,YatabeY,TatematsuY,TakeuchiT,MurakamiH,KondoY,SekidoY,TakahashiT:Rolesofachaete-scutehomologue1inDKK1andE-cadherinrepressionandneuroendocrinedifferentiationinlungcancer.CancerRes2008,68(6):1647. [67] ZhengR,ZhangZ,LvX,FanJ,ChenY,WangY,TanR,LiuY,ZhouQ:Polycystin-1inducedapoptosisandcellcyclearrestinG0/G1phaseincancercells.CellBiolInt2008,32(4):427. [68] ZhangK,YeC,ZhouQ,ZhengR,LvX,ChenY,HuZ,GuoH,ZhangZ,WangY,TanR,LiuY:PKD1inhibitscancercellsmigrationandinvasionviaWntsignalingpathwayinvitro.CellBiochemFunct2007,25(6):767. [69] KimM,JangH,KimJ,NohS,SongK,ChoJ,JeongH,NormanJ,CaswellP,KangG,KimS,YooH,KimY:EpigeneticinactivationofproteinkinaseD1ingastriccanceranditsroleingastriccancercellmigrationandinvasion.Carcinogenesis2008,29(3):629. [70] NakayamaT,InabaM,NaitoS,MiharaY,MiuraS,TabaM,YoshizakiA,WenC,SekineI:Expressionofangiopoietin-1,2and4andTie-1and2ingastroin-testinalstromaltumor,leiomyomaandschwannoma.WorldJGastroenterol2007,13(33):4473. [71] YamakawaM,LiuL,BelangerA,DateT,KuriyamaT,GoldbergM,ChengS,GregoryR,JiangC:Expressionofangiopoietinsinrenalepithelialandclearcellcarcinomacells:regulationbyhypoxiaandparticipationinangiogene-sis.AmJPhysiolRenalPhysiol2004,287(4):F649. [72] OrsettiB,NugoliM,CerveraN,LasorsaL,ChuchanaP,UrsuleL,NguyenC,RedonR,duManoirS,RodriguezC,TheilletC:Genomicandexpressionprolingofchromosome17inbreastcancerrevealscomplexpatternsofalterationsandnovelcandidategenes.CancerRes2004,64(18):6453. [73] SakakuraC,HagiwaraA,MiyagawaK,NakashimaS,YoshikawaT,KinS,NakaseY,ItoK,YamagishiH,YazumiS,ChibaT,ItoY:FrequentdownregulationoftheruntdomaintranscriptionfactorsRUNX1,RUNX3andtheircofactorCBFBingastriccancer.IntJCancer2005,113(2):221. [74] UsuiT,AoyagiK,SaekiN,NakanishiY,KanaiY,OhkiM,OgawaK,YoshidaT,SasakiH:ExpressionstatusofRUNX1/AML1innormalgastricepitheliumanditsmutationalanalysisinmicrodissectedgastriccancercells.IntJOncol2006,29(4):779. [75] NanjundanM,ZhangF,SchmandtR,Smith-McCuneK,MillsG:IdenticationofanovelsplicevariantofAML1binovariancancerpatientsconferringlossofwild-typetumorsuppressivefunctions.Oncogene2007,26(18):2574. 130

PAGE 131

[76] SilvaF,MorolliB,StorlazziC,AnelliL,WesselsH,BezrookoveV,Kluin-NelemansH,Giphart-GasslerM:IdenticationofRUNX1/AML1asaclassicaltumorsuppressorgene.Oncogene2003,22(4):538. [77] WilkinsonR,KassianosA,SwindleP,HartD,RadfordK:Numericalandfunc-tionalassessmentofblooddendriticcellsinprostatecancerpatients.Prostate2006,66(2):180. [78] PellegriniM,ChengJ,VoutilaJ,JudelsonD,TaylorJ,NelsonS,SakamotoK:ExpressionproleofCREBknockdowninmyeloidleukemiacells.BMCCancer2008,8:264. [79] OverbeekR,BegleyT,ButlerR,ChoudhuriJ,ChuangH,CohoonM,deCrecy-LagardV,DiazN,DiszT,EdwardsR,FonsteinM,FrankE,GerdesS,GlassE,GoesmannA,HansonA,Iwata-ReuylD,JensenR,JamshidiN,KrauseL,KubalM,LarsenN,LinkeB,McHardyA,MeyerF,NeuwegerH,OlsenG,OlsonR,OstermanA,PortnoyV,PuschG,RodionovD,RckertC,SteinerJ,StevensR,ThieleI,VassievaO,YeY,ZagnitkoO,VonsteinV:Thesubsystemsapproachtogenomeannotationanditsuseintheprojecttoannotate1000genomes.NucleicAcidsRes2005,33(17):5691. [80] ClampM,FryB,KamalM,XieX,CuffJ,LinMF,KellisM,Lindblad-TohK,LanderES:Distinguishingprotein-codingandnoncodinggenesinthehumangenome.ProceedingsoftheNationalAcademyofSciences2007,104(49):19428. [81] GoldhirschA,GlickJ,GelberR,CoatesA,ThurlimannB,SennH,&PanelMembers:Meetinghighlights:internationalexpertconsensusontheprimarytherapyofearlybreastcancer2005.AnnOncol2005,16(10):1569. [82] ChengR,ZhaoA,AlvordW,PowellD,BareR,MasudaA,TakahashiT,AndersonL,KasprzakK:Geneexpressiondose-responsechangesinmicroarraysafterexposureofhumanperipherallungepithelialcellstonickel(II).ToxicolApplPharmacol2003,191:22. [83] IdekerT,ThorssonV,RanishJ,ChristmasR,BuhlerJ,EngJ,BumgarnerR,GoodlettD,AebersoldR,HoodL:Integratedgenomicandproteomicanal-ysesofasystematicallyperturbedmetabolicnetwork.Science2001,292(5518):929. [84] TsaiK,ChuangE,LittleJ,YuanZ:Cellularmechanismsforlow-doseionizingradiation-inducedperturbationofthebreasttissuemicroenvironment.CancerRes2005,65(15):6734. [85] HamelinckD,ZhouH,LiL,VerweijC,DillonD,FengZ,CostaJ,HaabB:Op-timizednormalizationforantibodymicroarraysandapplicationtoserum-proteinproling.MolCellProteomics2005,4(6):773. 131

PAGE 132

[86] BaggerlyK,CoombesK,HessK,StiversD,AbruzzoL,ZhangW:IdentifyingdifferentiallyexpressedgenesincDNAmicroarrayexperiments.JComputBiol2001,8(6):639. [87] CourcelleJ,KhodurskyA,PeterB,BrownP,HanawaltP:ComparativegeneexpressionprolesfollowingUVexposureinwild-typeandSOS-decientEscherichiacoli.Genetics2001,158:41. [88] MiklosG,MaleszkaR:Microarrayrealitychecksinthecontextofacomplexdisease.NatBiotechnol2004,22(5):615. [89] ShimoniY,FriedlanderG,HetzroniG,NivG,AltuviaS,BihamO,MargalitH:Regulationofgeneexpressionbysmallnon-codingRNAs:aquantitativeview.MolSystBiol2007,3:138. [90] AyF,XuF,KahveciT:Scalablesteadystateanalysisofbooleanbiologicalregulatorynetworks.PloSone2009,4(12). [91] SongB,BuyuktahtakinIE,RankaS,KahveciT:ManipulatingtheSteadyStateofMetabolicPathways.IEEE/ACMTrans.Comput.Biol.Bioinformatics2011,8:732. [92] BesagJ:OntheStatisticalAnalysisofDirtyPictures.JournaloftheRoyalStatisticalSociety1986,48(3):259. [93] BishopCM:PatternRecognitionandMachineLearning(InformationScienceandStatistics).Secaucus,NJ,USA:Springer-VerlagNewYork,Inc.2006. [94] LiSZ:MarkovRandomFieldModelinginImageAnalysis.SpringerPublishingCompany,Incorporated,3rdedition2009. [95] HammersleyJM,CliffordP:Markoveldsonnitegraphsandlattices1968.[Unpublishedmanuscript]. [96] BesagJ:EfciencyofpseudolikelihoodestimationforsimpleGaussianelds.Biometrika1977,64(3):616. [97] GemanS,GrafgneC:Markovrandomeldimagemodelsandtheirappli-cationstocomputervision.InProceedingsoftheInternationalCongressofMathematics:Berkley1986:1496. [98] KendziorskiC,NewtonM,LanH,GouldM:OnparametricempiricalBayesmethodsforcomparingmultiplegroupsusingreplicatedgeneexpressionproles.StatMed2003,22(24):3899. [99] DemichelisF,MagniP,PiergiorgiP,RubinM,BellazziR:AhierarchicalNaiveBayesModelforhandlingsampleheterogeneityinclassicationproblems:anapplicationtotissuemicroarrays.BMCBioinformatics2006,7:514+. 132

PAGE 133

[100] StornR,PriceK:DifferentialEvolutionASimpleandEfcientHeuristicforglobalOptimizationoverContinuousSpaces.JournalofGlobalOptimization1997,11(4):341. [101] SmirnovD,MorleyM,ShinE,SpielmanR,CheungV:Geneticanalysisofradiation-inducedchangesinhumangeneexpression.Nature2009,459(7246):587. [102] DaussetJ,CannH,CohenD,LathropM,LalouelJ,WhiteR:Centred'etudedupolymorphismehumain(CEPH):collaborativegeneticmappingofthehumangenome.Genomics1990,6(3):575. [103] KanehisaM,GotoS:KEGG:kyotoencyclopediaofgenesandgenomes.NucleicAcidsRes2000,28:27. [104] ImaokaT,YamashitaS,NishimuraM,KakinumaS,UshijimaT,ShimadaY:Geneexpressionprolingdistinguishesbetweenspontaneousandradiation-inducedratmammarycarcinomas.JRadiatRes(Tokyo)2008,49(4):349. [105] NagtegaalI,GasparC,PeltenburgL,MarijnenC,KapiteijnE,vandeVeldeC,FoddeR,vanKriekenJ:Radiationinducesdifferentchangesinexpressionprolesofnormalrectaltissuecomparedwithrectalcarcinoma.VirchowsArch2005,446(2):127. [106] AmundsonS,BittnerM,ChenY,TrentJ,MeltzerP,FornaceA:FluorescentcDNAmicroarrayhybridizationrevealscomplexityandheterogeneityofcellulargenotoxicstressresponses.Oncogene1999,18(24):3666. [107] LinR,SunY,LiC,XieC,WangS:Identicationofdifferentiallyexpressedgenesinhumanlymphoblastoidcellsexposedtoirradiationandsuppres-sionofradiation-inducedapoptosiswithantisenseoligonucleotidesagainstcaspase-4.Oligonucleotides2007,17(3):314. [108] JenK,CheungV:Transcriptionalresponseoflymphoblastoidcellstoionizingradiation.GenomeRes2003,13(9):2092. [109] WuL,LevineA:Differentialregulationofthep21/WAF-1andmdm2genesaf-terhigh-doseUVirradiation:p53-dependentandp53-independentregulationofthemdm2gene.MolMed1997,3(7):441. [110] JakobB,ScholzM,Taucher-ScholzG:ImmediatelocalizedCDKN1A(p21)radiationresponseafterdamageproducedbyheavy-iontracks.RadiatRes2000,154(4):398. [111] RiegerK,ChuG:Portraitoftranscriptionalresponsestoultravioletandionizingradiationinhumancells.NucleicAcidsRes2004,32(16):4786. [112] ZhangG,NjauwC,ParkJ,NaruseC,AsanoM,TsaoH:EphA2isanessentialmediatorofUVradiation-inducedapoptosis.CancerRes2008,68(6):1691. 133

PAGE 134

[113] BandyopadhyayN,SomaiyaM,KahveciT,RankaS:ModelingPerturbationsUsingGeneNetworks.InProcLSSComputSystBioinformConf.2010:26. [114] GargA,MendozaL,XenariosI,DeMicheliG:Modelingofmultiplevaluedgeneregulatorynetworks.ConfProcIEEEEngMedBiolSoc2007,2007:1398. [115] BabaT,AraT,HasegawaM,TakaiY,OkumuraY,BabaM,DatsenkoKA,TomitaM,WannerBL,MoriH:ConstructionofEscherichiacoliK-12in-frame,single-geneknockoutmutants:theKeiocollection.MolecularSystemsBiology2006,2(April):2006.0008. [116] BeltraoP,CagneyG,KroganN:Quantitativegeneticinteractionsrevealbiologicalmodularity.Cell2010,141(5):739. [117] CostanzoM,BaryshnikovaA,BellayJ,KimY,SpearED,SevierCS,DingH,KohJL,ToughiK,MostafaviS,PrinzJ,StOngeRP,VanderSluisB,MakhnevychT,VizeacoumarFJ,AlizadehS,BahrS,BrostRL,ChenY,CokolM,DeshpandeR,LiZ,LinZYY,LiangW,MarbackM,PawJ,SanLuisBJJ,ShuteriqiE,TongAHYH,vanDykN,WallaceIM,WhitneyJA,WeirauchMT,ZhongG,ZhuH,HouryWA,BrudnoM,RagibizadehS,PappB,PalC,RothFP,GiaeverG,NislowC,TroyanskayaOG,BusseyH,BaderGD,GingrasACC,MorrisQD,KimPM,KaiserCA,MyersCL,AndrewsBJ,BooneC:Thegeneticlandscapeofacell.Science(NewYork,N.Y.)2010,327(5964):425. [118] CherkasskyVS,MulierF:LearningfromData:Concepts,Theory,andMethods.NewYork,NY,USA:JohnWiley&Sons,Inc.,1stedition1998. [119] TibshiraniR,SaundersM,RossetS,ZhuJ,KnightK:Sparsityandsmoothnessviathefusedlasso.JournaloftheRoyalStatisticalSociety:SeriesB(StatisticalMethodology)2005,67:91. [120] WrightMH:Theinterior-pointrevolutioninoptimization:history,recentdevelopments,andlastingconsequences.Bull.Amer.Math.Soc.(N.S)2005,42:39. [121] HescottBJ,LeisersonMDM,CowenL,SlonimDK:EvaluatingBetween-PathwayModelswithExpressionData.InRECOMB2009:372. [122] BreitkreutzB,StarkC,RegulyT,BoucherL,BreitkreutzA,LivstoneM,OughtredR,LacknerD,BhlerJ,WoodV,DolinskiK,TyersM:TheBioGRIDInteractionDatabase:2008update.NucleicAcidsRes2008,36(Databaseissue):D637. [123] UlitskyI,ShamirR:Identicationoffunctionalmodulesusingnetworktopologyandhigh-throughputdata.BMCSystemsBiology2007,1:8+. [124] BradyA,MaxwellK,DanielsN,CowenLJ:FaultToleranceinProteinInterac-tionNetworks:StableBipartiteSubgraphsandRedundantPathways.PLoSONE2009,4(4):e5364+. 134

PAGE 135

[125] MaX,TaroneA,LiW:Mappinggeneticallycompensatorypathwaysfromsyntheticlethalinteractionsinyeast.PLoSOne2008,3(4):e1922. [126] SubramanianA,TamayoP,MoothaV,MukherjeeS,EbertB,GilletteM,PaulovichA,PomeroyS,GolubT,LanderE,MesirovJ:Genesetenrichmentanalysis:aknowledge-basedapproachforinterpretinggenome-wideexpressionproles.ProcNatlAcadSciUSA2005,102(43):15545. 135

PAGE 136

BIOGRAPHICALSKETCH NirmalyaBandyopadhyayreceivedhisBachelorofEngineeringdegreeinComputerSciencefromBengalEngineeringandScienceUniversity,India.HespentthreeyearsintheindustryintheareaofcomputernetworksandEDAtooldevelopment.HereceivedhisPh.D.fromtheSchoolofEngineeringintheUniversityofFloridainthefallof2011.DuringhisPh.D.heworkedintheareasofcomputationalbiologyandmachinelearningunderthesupervisionofDr.SanjayRankaandDr.TamerKahveciandwasassociatedwiththebioinformaticslab.Nirmalyaauthoredtwobookchapters,fourjournalpapersandfourconferencepapers.In2010,NirmalyaworkedasadevelopmentinterninAmazon.com. 136