<%BANNER%>

Querying Large Biological Network Datasets

MISSING IMAGE

Material Information

Title:
Querying Large Biological Network Datasets
Physical Description:
1 online resource (126 p.)
Language:
english
Creator:
Gulsoy, Gunhan
Publisher:
University of Florida
Place of Publication:
Gainesville, Fla.
Publication Date:

Thesis/Dissertation Information

Degree:
Doctorate ( Ph.D.)
Degree Grantor:
University of Florida
Degree Disciplines:
Computer Engineering, Computer and Information Science and Engineering
Committee Chair:
Kahveci, Tamer
Committee Members:
Dobra, Alin Viorel
Ranka, Sanjay
Banerjee, Arunava
De Crecy, Valerie Anne

Subjects

Subjects / Keywords:
bioinformatics -- biological-network-databases -- biological-networks -- computational-biology -- database-indexing -- databases
Computer and Information Science and Engineering -- Dissertations, Academic -- UF
Genre:
Computer Engineering thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract:
New experimental methods has resulted in increasing amount of genetic interaction data to be generated every day. Biological networks are used to store genetic interaction data gathered. Increasing amount of data available requires fast large scale analysis methods. Therefore, we address the problem of querying large biological network datasets. We begin with considering querying of large biological network datasets as a collection of individual comparisons. First, we define measures of similarities between two networks. In order to do this, we develop an algorithm to align two biological networks. Then, we validate this algorithm using biological evidence. This algorithm incorporates both sequence similarity of nodes and topology of the network itself. Then, we formulate an algorithm which uncovers a hierarchical structure in transcriptional regulatory networks. This algorithm works purely on the topology on the network. Using this method, we show relations between the functions of genes and topological properties of networks. Finally, we analyze functional properties of metabolic networks. We use first calculate the elementary flux modes of the metabolic networks. Then, for each genetic function, we analyze relations between functions and metabolic flux cones. In this method, we aim to analyze networks functionally. Next, we consider the large datasets. In biological networks, pairwise operations are costly. Therefore, exhaustive comparison of all networks with a query is infeasible. In order to tackle this problem, we develop reference based indexing. In this method, we first build a small set of reference networks. Then, instead of aligning a query with all the database networks, we use references to calculate upper and lower bounds for the alignment scores between the query and all the database networks. Using these bounds, we calculate 80% of the database networks quickly. We experimentally show that, we can successfully mine statistically and biologically significant relationships in a large database of biological networks. Finally, we propose a new method, which uses a dynamic set of references in reference based indexing.
General Note:
In the series University of Florida Digital Collections.
General Note:
Includes vita.
Bibliography:
Includes bibliographical references.
Source of Description:
Description based on online resource; title from PDF title page.
Source of Description:
This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility:
by Gunhan Gulsoy.
Thesis:
Thesis (Ph.D.)--University of Florida, 2013.
Local:
Adviser: Kahveci, Tamer.

Record Information

Source Institution:
UFRGP
Rights Management:
Applicable rights reserved.
Classification:
lcc - LD1780 2013
System ID:
UFE0045644:00001

MISSING IMAGE

Material Information

Title:
Querying Large Biological Network Datasets
Physical Description:
1 online resource (126 p.)
Language:
english
Creator:
Gulsoy, Gunhan
Publisher:
University of Florida
Place of Publication:
Gainesville, Fla.
Publication Date:

Thesis/Dissertation Information

Degree:
Doctorate ( Ph.D.)
Degree Grantor:
University of Florida
Degree Disciplines:
Computer Engineering, Computer and Information Science and Engineering
Committee Chair:
Kahveci, Tamer
Committee Members:
Dobra, Alin Viorel
Ranka, Sanjay
Banerjee, Arunava
De Crecy, Valerie Anne

Subjects

Subjects / Keywords:
bioinformatics -- biological-network-databases -- biological-networks -- computational-biology -- database-indexing -- databases
Computer and Information Science and Engineering -- Dissertations, Academic -- UF
Genre:
Computer Engineering thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract:
New experimental methods has resulted in increasing amount of genetic interaction data to be generated every day. Biological networks are used to store genetic interaction data gathered. Increasing amount of data available requires fast large scale analysis methods. Therefore, we address the problem of querying large biological network datasets. We begin with considering querying of large biological network datasets as a collection of individual comparisons. First, we define measures of similarities between two networks. In order to do this, we develop an algorithm to align two biological networks. Then, we validate this algorithm using biological evidence. This algorithm incorporates both sequence similarity of nodes and topology of the network itself. Then, we formulate an algorithm which uncovers a hierarchical structure in transcriptional regulatory networks. This algorithm works purely on the topology on the network. Using this method, we show relations between the functions of genes and topological properties of networks. Finally, we analyze functional properties of metabolic networks. We use first calculate the elementary flux modes of the metabolic networks. Then, for each genetic function, we analyze relations between functions and metabolic flux cones. In this method, we aim to analyze networks functionally. Next, we consider the large datasets. In biological networks, pairwise operations are costly. Therefore, exhaustive comparison of all networks with a query is infeasible. In order to tackle this problem, we develop reference based indexing. In this method, we first build a small set of reference networks. Then, instead of aligning a query with all the database networks, we use references to calculate upper and lower bounds for the alignment scores between the query and all the database networks. Using these bounds, we calculate 80% of the database networks quickly. We experimentally show that, we can successfully mine statistically and biologically significant relationships in a large database of biological networks. Finally, we propose a new method, which uses a dynamic set of references in reference based indexing.
General Note:
In the series University of Florida Digital Collections.
General Note:
Includes vita.
Bibliography:
Includes bibliographical references.
Source of Description:
Description based on online resource; title from PDF title page.
Source of Description:
This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility:
by Gunhan Gulsoy.
Thesis:
Thesis (Ph.D.)--University of Florida, 2013.
Local:
Adviser: Kahveci, Tamer.

Record Information

Source Institution:
UFRGP
Rights Management:
Applicable rights reserved.
Classification:
lcc - LD1780 2013
System ID:
UFE0045644:00001


This item has the following downloads:


Full Text

PAGE 1

QUERYINGLARGEBIOLOGICALNETWORKDATASETSByGUNHANGULSOYADISSERTATIONPRESENTEDTOTHEGRADUATESCHOOLOFTHEUNIVERSITYOFFLORIDAINPARTIALFULFILLMENTOFTHEREQUIREMENTSFORTHEDEGREEOFDOCTOROFPHILOSOPHYUNIVERSITYOFFLORIDA2013

PAGE 2

c2013GunhanGulsoy 2

PAGE 3

ACKNOWLEDGMENTS Ithankthechairandmembersofmysupervisorycommitteefortheirmentoring.AlltheworkinthisthesiswaspartiallysupportedbyNSFundergrantsCCF-0829867andIIS-0845439. 3

PAGE 4

TABLEOFCONTENTS page ACKNOWLEDGMENTS .................................. 3 LISTOFTABLES ...................................... 6 LISTOFFIGURES ..................................... 7 ABSTRACT ......................................... 9 CHAPTER 1INTRODUCTION ................................... 11 1.1ApproximateNetworkAlignment ....................... 11 1.2TopologicalPropertiesofRegulatoryNetworks ............... 12 1.3FunctionalPropertiesofMetabolicFluxCones ............... 13 1.4SimilaritySearchesinBiologicalNetworkDatabases ............ 13 1.5Outline ...................................... 14 2TOPAC:ALIGNMENTOFGENEREGULATORYNETWORKSUSINGTOPOLOGYAWARECOLORING ................................. 15 2.1Introduction ................................... 15 2.2Background ................................... 18 2.3OurColoringMethod .............................. 22 2.4PruningNetworkNodes ............................ 28 2.5Experiments .................................. 32 2.5.1PerformanceComparisonAgainsttheStateoftheArtColoring .. 34 2.5.2EffectsofKonColoring ........................ 36 2.5.3EvaluationoftheNodePruningMethod ............... 38 2.5.4BiologicalSignicanceoftheResults ................. 40 2.6Conclusion ................................... 43 3HIDEN:HIERARCHICALDECOMPOSITIONOFREGULATORYNETWORKS 44 3.1Introduction ................................... 44 3.2Algorithm .................................... 47 3.2.1HIDEN .................................. 47 3.2.2Example ................................. 49 3.2.3DivideandConquerMethod ...................... 51 3.3ResultsandDiscussion ............................ 53 3.3.1ComparisonwithExistingHierarchicalDecompositionMethods .. 53 3.3.2BiologicalEvaluationofNetworkHierarchies ............. 56 3.3.2.1Functionsofgenes ..................... 56 3.3.2.2Geneessentiality ...................... 59 3.3.3EffectsofInputonHIDEN ....................... 61 4

PAGE 5

3.3.3.1Navigationofgenesacrosslevelsinvaryinghierarchies 62 3.3.3.2RobustnessofHIDEN. ................... 62 3.3.3.3StabilityofHIDENtonetworkmutations .......... 65 3.3.3.4Localversusglobalhierarchyofsubnetworks ....... 68 3.4Conclusion ................................... 70 4INFERRINGGENEFUNCTIONSFROMMETABOLICREACTIONS ...... 71 4.1Introduction ................................... 71 4.2Methods ..................................... 73 4.3Results ..................................... 76 4.3.1ExperimentalSetup .......................... 77 4.3.2CorrelationofImpactandFunction .................. 78 4.3.3GeneticFunctionPredictionisPossibleUsingMetabolicNetworkAnalysis ................................. 79 4.4Conclusion ................................... 81 5RINQ:REFERENCE-BASEDINDEXINGFORNETWORKQUERIES ..... 82 5.1Introduction ................................... 82 5.2RelatedWork .................................. 84 5.3Algorithm .................................... 86 5.3.1AnOverviewofOurReference-BasedIndexing ........... 88 5.3.2ReferenceSelection .......................... 89 5.3.2.1PhaseI:Creatingthecandidatereferenceset. ...... 89 5.3.2.2PhaseII:Creatingthenalreferenceset. ......... 90 5.3.3ComputingtheBounds ......................... 92 5.3.3.1Lowerboundcalculation ................... 92 5.3.3.2Upperboundcalculation .................. 94 5.3.4TimeComplexityoftheMethods ................... 96 5.3.4.1Complexityoflowerboundcalculation ........... 97 5.3.4.2Complexityofapproximateupperbound .......... 97 5.3.4.3Complexityofcandidatereferencecreation ........ 97 5.3.4.4Complexityofreferenceselection ............. 98 5.4ResultsandDiscussion ............................ 99 5.4.1EffectsofReferenceSelectionStrategy ............... 102 5.4.2EffectsofIndexandQueryParameters ................ 104 5.4.3ComparisonofRINQwithExistingMethods ............. 106 5.4.4SignicanceofResults ......................... 107 5.5Conclusion ................................... 110 6CONCLUSION .................................... 115 REFERENCES ....................................... 118 BIOGRAPHICALSKETCH ................................ 125 5

PAGE 6

LISTOFTABLES Table page 2-1Frequentlyusedsymbolsinthischapter ...................... 18 2-2Statisticsofk-hopcoloringfork=1;2and3. ................... 37 2-3Statisticalinformationregardingthefunctionalenrichmentofthequerynetworksweusedinourexperimentsfor7,8and9nodequerynetworks. ......... 40 3-1ComparisonofHIDENwithexistingmethods ................... 55 3-2StabilityofHIDENusingadjacency ......................... 66 3-3StabilityofHIDENusingreachability ........................ 67 3-4ComparisonoftheglobalhierarchyofsubnetworkstotheirlocalhierarchyintermsoftheadjacencypenaltyandtheZ-score. ................. 69 5-1Thedistributionofthenumberofdatabasenetworkstodifferentsetsforagivenquery. ......................................... 90 5-2Thedistributionofthenetworksinourdatabasetodifferentorganisms. .... 100 5-3Thedistributionofthenetworksinourdatabasetodifferentclassesofnetworks. 101 5-4QuerystatisticsforRINQ .............................. 108 6

PAGE 7

LISTOFFIGURES Figure page 2-1AnalignmentofthenetworksQandT. ...................... 17 2-2Twodifferentcoloringinstancesofasubnetworkofthreenodes. ........ 20 2-3Athreenodenetworkwithcolorsassigned. .................... 23 2-4Coloringinstancesoftwodifferentnetworksoffournodesusing2-hopcoloring. 24 2-5Fractionofcorrectresultsversusthenumberofiterationsfork-hopcoloringandrandomcoloring. ................................ 35 2-6Theperformanceimprovementofk-hopcoloringoverAlonetal.forquerysizesof4to25((Left)k=1,(Right)k=1;2;3). .................. 36 2-7Theaveragenetworksizeaftereachiterationduringalignmentswithnodepruning. ........................................ 39 2-8Theenrichmentscoresofeachgeneannotationterminqueryandtargetnetworksintopvequerieswiththelargestnumberofenrichedterms. .......... 42 3-1Hierarchicaldecompositionofasamplenetworkwithsevennodesdenotedbyn1,n2,,n7tothreelevels. ........................... 45 3-2ResultofthehierarchicaldecompositionofthenetworkinFigure 3-1 usingHIDEN. ........................................ 50 3-3TheenrichmentscoresofthewoundhealingprocessateachlevelfortheH.sapiensTRN. ..................................... 57 3-4IllustrationofthedistributionofthenumberoffunctionsthateachgeneparticipatesinfortheTRNofH.sapiens(A),S.cerevisiae(B)andE.coli(C). ......... 58 3-5TheratioofessentialgenesofS.cerevisiaeTRNateachlevelofthehierarchy. 60 3-6TheTRNofH.sapienswithasubnetworkrelatedtocancerhighlighted. .... 61 3-7IllustrationofthenavigationofgenesacrosslevelsfortheTRNofH.sapi-ens(A),S.cerevisiae(B)andE.coli(C). ....................... 63 3-8EvaluationoftherobustnessofHIDEN ....................... 65 4-1Fluxconeofasimpleexamplenetwork. ...................... 72 4-2ResultofremovinganEFMfromtheuxconeinFigure 4-1 .......... 74 4-3Illustrationofthedimensionreductionwithknockoutofreactionr3. ....... 77 4-4Cumulativehistogramofthep-valuescalculatedforeachreactionandGOterm 79 7

PAGE 8

4-5Numberoftotalpredictionsversusratioofcorrectpredictions. .......... 80 5-1Overviewofthelowerboundcalculation. ...................... 94 5-2Overviewoftheupperboundcalculationmethod. ................. 96 5-3Thedistributionsnumberofgenesandthenumberofinteractionsforeachnetworkinourdatabase. .............................. 102 5-4Thesizeofthenetworksinthedatabase. ..................... 103 5-5Runningtimeversusaccuracyofourmethodfordifferentreferenceselectionstrategies. ....................................... 104 5-6Impactofthenumberofreferences. ........................ 112 5-7RunningtimeandaccuracyofRINQfordifferentqueryselectivityvalues. ... 113 5-8RunningtimeversusaccuracyofSAGA,CTreeandRINQ.Experimentsarerepeatedforsamequerieswithdifferentselectivities. ............... 114 5-9Samplequeriesfromourexperiments. ....................... 114 8

PAGE 9

AbstractofDissertationPresentedtotheGraduateSchooloftheUniversityofFloridainPartialFulllmentoftheRequirementsfortheDegreeofDoctorofPhilosophyQUERYINGLARGEBIOLOGICALNETWORKDATASETSByGunhanGulsoyAugust2013Chair:TamerKahveciMajor:ComputerEngineeringNewexperimentalmethodshasresultedinincreasingamountofgeneticinteractiondatatobegeneratedeveryday.Biologicalnetworksareusedtostoregeneticinteractiondatagathered.Increasingamountofdataavailablerequiresfastlargescaleanalysismethods.Therefore,weaddresstheproblemofqueryinglargebiologicalnetworkdatasets.Webeginwithconsideringqueryingoflargebiologicalnetworkdatasetsasacollectionofindividualcomparisons.First,wedenemeasuresofsimilaritiesbetweentwonetworks.Inordertodothis,wedevelopanalgorithmtoaligntwobiologicalnetworks.Then,wevalidatethisalgorithmusingbiologicalevidence.Thisalgorithmincorporatesbothsequencesimilarityofnodesandtopologyofthenetworkitself.Then,weformulateanalgorithmwhichuncoversahierarchicalstructureintranscriptionalregulatorynetworks.Thisalgorithmworkspurelyonthetopologyonthenetwork.Usingthismethod,weshowrelationsbetweenthefunctionsofgenesandtopologicalpropertiesofnetworks.Finally,weanalyzefunctionalpropertiesofmetabolicnetworks.Weuserstcalculatetheelementaryuxmodesofthemetabolicnetworks.Then,foreachgeneticfunction,weanalyzerelationsbetweenfunctionsandmetabolicuxcones.Inthismethod,weaimtoanalyzenetworksfunctionally.Next,weconsiderthelargedatasets.Inbiologicalnetworks,pairwiseoperationsarecostly.Therefore,exhaustivecomparisonofallnetworkswithaqueryisinfeasible. 9

PAGE 10

Inordertotacklethisproblem,wedevelopreferencebasedindexing.Inthismethod,werstbuildasmallsetofreferencenetworks.Then,insteadofaligningaquerywithallthedatabasenetworks,weusereferencestocalculateupperandlowerboundsforthealignmentscoresbetweenthequeryandallthedatabasenetworks.Usingthesebounds,wecalculate80%ofthedatabasenetworksquickly.Weexperimentallyshowthat,wecansuccessfullyminestatisticallyandbiologicallysignicantrelationshipsinalargedatabaseofbiologicalnetworks.Finally,weproposeanewmethod,whichusesadynamicsetofreferencesinreferencebasedindexing. 10

PAGE 11

CHAPTER1INTRODUCTIONInlivingorganisms,manytasksarecarriedoutbyinteractingmolecules.Theseinteractionsarerepresentedbyusingbiologicalnetworksorpathways.Intheliterature,thetermspathwayandnetworkareusedinterchangeably.Wewillusethetermbiolog-icalnetworksthroughoutthiswork.Dependingonthetypeofinteractions,biologicalnetworksareconsideredasmetabolic( Franckeetal. 2005 ),generegulatory( LevineandDavidson 2005 )andproteininteractionnetworks( GiotandBaderetal. 2003 ).Thisworkfocusesongeneralrequirementsoncreatingbiologicalnetworkdatabases.Mininginformationfrombiologicalnetworkshasbeenanimportantgoalinbiologicalsciences.Researchersfocusonanumberofdifferentproblemsrelatedtobiologicalnetworkswhenextractingdatafromthesemodels.Ultimategoalofthesestudiesaretobuildlargescaledatabasesofbiologicalnetworks.Inthisthesis,weaimtomodelsystemsthatsupportefcientqueryingofbiologicalnetworkdatabases.Next,weelaborateonanumberofproblemsrelatedtobiologicalnetworkdatabases,andoursolutionstotheseproblems. 1.1ApproximateNetworkAlignmentAnimportanttypeofanalysisofbiologicalnetworksiscomparisonbasedanalysis.Pairwisenetworkalignmentisaninvaluabletoolforcomparativeanalysisoforganisms.Ittriestondamappingbetweenthenodesoftwonetworksinordertodeterminethelevelofsimilaritybetweentwonetworks.Ithasbeensuccessfullyusedforndingfunctionalannotations( Clementeetal. 2006 ),identifyingdrugtargets( Sridharetal. 2007 ),reconstructingmetabolicnetworksfromnewlysequencedgenome( Franckeetal. 2005 ),andbuildingphylogenetictrees( Clementeetal. 2005 ).However,pairwisealignmentoftwonetworksisacomputationallychallengingproblem.Stateoftheartmethodsreduceglobalandlocalalignmentofbiologicalnetworkproblemstographandsubgraphisomorphismproblems.TheseproblemsareGIandNP-completerespectively. 11

PAGE 12

Anumberofexistingmethodsemployapproximate( Dostetal. 2008 ; Pinteretal. 2005 ; Shlomietal. 2006 )andheuristic( AyandKahveci 2010b ; Ayetal. 2009 ; Liaoetal. 2009 )strategiestospeedupthesolutionintheexpenseofreducedaccuracy.However,theystillrequireasignicantamountoftime.InChapter 2 wedevelopapairwisenetworkalignmentalgorithmwhichtriestondtheoptimalalignmentbetweentwonetworkswithaprovablecondencebound.Ourmethodisprimarilybasedoncolorcoding( Alonetal. 1995 )anddynamicprogramming( Dostetal. 2008 ).Weintroducenewalgorithmicimprovementsonbothcolorcodinganddynamicprogramming,whichreducestherunningtimebyatleast60%.Thisworkispublishedandpresentedin2ndAnnualACMConferenceonBioinformatics,ComputationalBiologyandBiomedicine(ACM-BCB2011)conferenceinChicago,Illinois( Gulsoyetal. 2011 ).AlsothisworkhasbeenpublishedinJournalofBioinformaticsandComputationalBiology(JBCB)asaninvitedpaper( Gulsoyetal. 2012b ). 1.2TopologicalPropertiesofRegulatoryNetworksTopologiesofbiologicalnetworksstoreinvaluableinformationaboutorganizationofthenetwork.Analysisofthetopologyofbiologicalnetworksshowthatthesenetworkssharecommonlocalandglobalproperties.Oneoftheglobalpropertiesofbiologicalnetworksisthehierarchicalorganization.Hierarchicalorganizationdenesapartialorderingforsignaltransductioninbiologicalnetworks.Recentstudieshaveshownthattranscriptionalregulatorynetworkshavestronghierarchicalorganizations( Bhardwajetal. 2010a b ; Hartspergeretal. 2010 ; Jothietal. 2009 ; YuandGerstein 2006 ).InChapter 3 ,wedevelopanalgorithmtorevealtheunderlyinghierarchicalorganizationofthetranscriptionalregulatorynetworks.Weprovideareductionoftheproblemtoamixedintegerproblem.Then,weuseexistinglinearproblemsolverstondtheunderlyinghierarchy.However,mixedintegerproblemisNP-hard( GarnkelandNemhauser 1972 ).Therefore,forlargernetworksrunningtimeofouralgorithm 12

PAGE 13

becomesinfeasible.Weusedivideandconquerstrategytotacklerunningtimeissuesinlargernetworks.ThisworkispublishedinBioMedCentralBioinformaticsjournalin2012( Gulsoyetal. 2012a ). 1.3FunctionalPropertiesofMetabolicFluxConesMetabolicnetworksmodelthevitalreactionsandtherelationsbetweenthesereactions( Franckeetal. 2005 ).Analysisofmetabolicnetworksprovideinsightsonhowreactionstakeplaceincells.Anumberofdifferentapproacheshavebeenproposedtoanalyzemetabolicnetworks( BellandPalsson 2005 ; Kaletaetal. 2009 ; LarhlimiandBockmayr 2009 ; Schusteretal. 1999 ; Stellingetal. 2002 ; TerzerandStelling 2008 ).Arecentapproachistousethesteadystatesofmetabolicnetworkstoanalyzetheirproperties.Steadystateofanetworksisdenedasasnapshotofthesystem,inwhichtherateofchangesofthemetabolitesdonotchangeovertime( Voit 1992 ).Thesystemreachesthisstatebybalancingtherateofproductionandconsumptionoftheinternalmetabolites.Steadystatesofmetabolicnetworksareinvaluabletools,becausetheyidentifylongtermbehaviorsofnetworks.InChapter 4 ,wesystematicallyanalyzethemetabolicuxconesofdifferentorganisms.Werstperformindividualreactionknockoutsonmetabolicnetworks.Then,wequantifythechangeintheuxconeofthenetworksafterknockouts.Finally,wecorrelatereactionknockoutswithdifferentcellularfunctionsbasedonfunctionalannotationsofthegenes.Thisworkispresentedin2012IEEEInternationalWorkshoponGenomicSignalProcessingandStatistics(GENSIPS2012)( GulsoyandKahveci 2012 ). 1.4SimilaritySearchesinBiologicalNetworkDatabasesFormanyapplications,databasesneedtoprovideanumberofoperationsonthedata.Oneofthemostwidelyuseddatabaseoperationsissimilaritysearches.InSection 1.1 ,wedescribethecomputationalcostofpairwisenetworkalignment.Thus,inbiologicalnetworkdatabases,thelargenumberofpairwisecomparisonsrequiredby 13

PAGE 14

similarityqueriesprovetobehighlytimeconsuming.Exhaustivesearchesforsimilaritiesinbiologicalnetworkdatabasestakedaysifnotweeks.Therefore,inorderforsuchdatabasestobepractical,alotofworkneedstobedoneinthiseld.Inlargedatabases,indexesareinvaluabletoolsinspeedingupdataretrieval( Ullmanetal. 2001 ).Theyquicklyeliminatelargeportionsofthedatabase,withoutindividuallyexaminingthedata.Therehasbeenanumberofattemptstouseindexinginbiologicalnetworkandgraphdatabases( GiugnoandShasha 2002 ; HeandSingh 2006 ; Mongiovietal. 2010 ; Yanetal. 2004 ).Thesemethodsusefeature( GiugnoandShasha 2002 ; Mongiovietal. 2010 ; Yanetal. 2004 )andtree( GiugnoandShasha 2002 ; Mongiovietal. 2010 ; Yanetal. 2004 )basedindexingwithlimitedsuccess.InChapter 5 ,weexplainourworkonimplementingreferencebasedindexinginbiologicalnetworkdatabases.Tothebestofourknowledge,ourmethodistherstattemptatusingreferencebasedindexingonnetworkorgraphdatabases.Thisworkispublishedandpresentedin19thAnnualInternationalConferenceonIntelligentSystemsforMolec-ularBiologyand10thEuropeanConferenceonComputationalBiology(ISMB/ECCB2011)conferencein2011inVienna,Austria( GulsoyandKahveci 2011 ). 1.5OutlineTherestofthisworkisorganizedasfollows:InChapter 2 ,weelaborateonourworkonapproximatealignmentofbiologicalnetworks.InChapter 3 ,wediscussourworkonhierarchicalorganizationoftranscriptionalregulatorynetworks.InChapter 4 ,wediscussourworkonanalysisoffunctionalpropertiesofmetabolicnetworks.InChapter 5 ,weexplainourworkonreferencebasedindexinginbiologicalnetworkdatabases. 14

PAGE 15

CHAPTER2TOPAC:ALIGNMENTOFGENEREGULATORYNETWORKSUSINGTOPOLOGYAWARECOLORING 2.1IntroductionBiologicalprocessesareoftengovernedthroughinteractionsofmanymolecules.Dependingonthetypesofinteractions,collectionsofsuchmoleculesareexplainedthroughvarioustypesofnetworkssuchasmetabolic( Franckeetal. 2005 ),generegulatory( LevineandDavidson 2005 )orproteininteraction( GiotandBaderetal. 2003 )networks.Wewillusethetermbiologicalnetworktodescribetheminthischapter.Understandingbiologicalnetworkshasbeenoneofthemaingoalsofbiologicalsciencesastheycontainkeyinformationregardinghoworganismswork.Toachievethisgoal,biologicalnetworksareanalyzedinanumberofways.Oneofthem,comparisonbasedanalysis,identiessimilarpartsoftwobiologicalnetworksbyaligningthem.Suchanalysishasbeensuccessfullyusedinmanyapplicationssuchasndingfunctionalannotations( Clementeetal. 2006 ),identifyingdrugtargets( Sridharetal. 2007 ),reconstructingmetabolicnetworksfromnewlysequencedgenome( Franckeetal. 2005 ),andbuildingphylogenetictrees( Clementeetal. 2005 ).Existingliteratureoftenconsidersnetworkalignmentasagraphorsubgraphisomorphismproblem.Inordertodothis,theyconsidereachmolecule(i.e.,geneorprotein)asanodeandeachinteractionasanedgethatconnectsthenodescorrespondingtothatinteraction.Thesemethodsmeasurethequalityofanalignmentintermsofascoringfunction.Weexplaintheconceptofnetworkalignmentandscoringfunctionlaterinthissection.Thegraphandsubgraphisomorphismproblems,however,donothaveknownpolynomialtimesolutions( Cook 1971 ; GareyandJohnson 1990 ).Wecanclassifyexistingsolutionsundertwocategories.Therstonedevelopsheuristicsolutionswithpolynomialtimecomplexities( AyandKahveci 2010b ; Ayetal. 2009 ; Chindelevitchetal. 2010 ; Liaoetal. 2009 ).Althoughthesemethodsoftenworkinpracticaltime,itisimpossibletotellhowtheirresultscomparetothoseoftheoptimal 15

PAGE 16

one(i.e.,thealignmentwiththehighestscorebasedonagivenscoringfunction).Thesecondone(whichisthefocusofthischapter)searchesfortheoptimalalignment( Dostetal. 2008 ; Kelleyetal. 2004 ; Pinteretal. 2005 ; Shlomietal. 2006 ).Inordertosimplifythenetworkalignmentproblem,existingmethodsimposerestrictionsonnetworktopologies( Dostetal. 2008 ; Kelleyetal. 2004 ; Shlomietal. 2006 ).Forinstance,QPathcanonlyalignlinearpathwayswithproteininteractionnetworks( Shlomietal. 2006 ).Also,inordertoreducetherunningtime,theyensureoptimalitywithaprovablecondencebound,whichislessthan100%( Dostetal. 2008 ; Shlomietal. 2006 ).Forexample,someuserandomizedtechniquessuchascolorcodingthatreturnsaresultwhichisoptimalwithapredenedcondence( Dostetal. 2008 ; Shlomietal. 2006 ).Wewillelaborateonexistingmethods,particularlyoncolorcodingmethods,inSection 2.2 .Formaldenitionoftheproblemweconsiderinthischapterisasfollows:ProblemDenition:AssumethatwearegiventwobiologicalnetworksdenotedwithQ=(VQ;EQ)andT=(VT;ET),whereVQandVTarethesetofnodes(i.e.,genesorproteins)andEQ,ETarethesetofedges(interactions)respectively.AnalignmentofQandTisabijectionbetweenthenodesofQandT,whereamappingbetweentwonodesimpliesthattheedgesbetweenthem(iftheyexist)alignwitheachother.Figure 2-1 showsthealignmentbetweentwohypotheticalnetworks.Dashedlinesinthegureshowthenodemappingsbetweenthenodesofthetwonetworks.AnalignmentallowsinsertionsanddeletionsofnodesoredgesbymatchingthemtoNULL.ThescoreofanalignmentofQandTisthesumoffollowingthreeterms: 1. Pairwisesimilarityscoresofthematchingnodes 2. Weightsofthematchingedges, 3. Insertionanddeletionpenalties.Analignmentisoptimalifithasthelargestscoreamongallpossiblealignments.Wesaythatwehaveacondenceof(01)intheresultifthealignmentisoptimalwith 16

PAGE 17

Figure2-1. AnalignmentofthenetworksQandT.fu1;222;:::;u5gandfv1;v2;:::;v5garethesetofnodesinnetworksQandTrespectively.Thesolidlinesrepresenttheinteractionsbetweenpairsofmolecules.Thedashedlinesconnectthealignednodestoeachother.Notethatunalignednodes(e.g.,v5)canexistinanynetwork.Theycorrespondtoinsertionsordeletionsinthealignment. probability.OuraimistodevelopanalgorithmtondtheoptimalalignmentbetweenQandTwithagivencondenceinpracticaltime.Contributions:Inthischapter,wedevelopanovelrandomizedbiologicalnetworkalignmentmethod,TOPAC(AlignmentofgeneregulatorynetworksusingTOPologyAwareColoring),whichcanaligntwonetworksoptimallywithaprovablecondencebound.Morespecically,ourmaincontributionhereisasmartcolorcodingmethod.UnlikethestateoftheartcoloringmethodofAlonetal.( Alonetal. 1995 ),ourmethodconsidersthecolorassignmentsalreadymadeintheneighborhoodofeachnodewhileassigningacolortothatnode.Thiswayitpreemptivelyavoidsmanynonpromisingcolorassignments.Weformulatethecondencecalculationforourcoloringmethod.Wealsodevelopalteringmethodthateliminatesthenodeswhichcannotbealignedwithoutreducingthealignmentscoreaftereachcoloringinstance.Weevaluatetheimpactofourcoloringmethodwhenthetargetnetworkhasanarbitrarytopologyandthequerynetworkisatreeoraboundedtreewidthgraph.WedemonstrateboththeoreticallyandexperimentallythatourcoloringmethodoutperformsthatofAlonetal.( Alonetal. 1995 ) 17

PAGE 18

Table2-1. Frequentlyusedsymbolsinthischapter SymbolConcept Q,TQueryandthetargetnetworkstobealignedVQ,VTThesetofnodesinqueryandtargetnetworksEQ,ETThesetofinteractionsinqueryandtargetnetworksui,vjNodesofnetworksQandTm,nNumberofnodesinqueryandtargetnetworksSim(ui;vj)SimilarityscorebetweenthenodesuiandvjInsertionordeletionpenaltyCondenceintheoptimalityofthealignmentCSetofcolorstocolorthetargetnetworkc(vj)ColorofthenodevjRuiCompletesetofchildrenofnodeuiinQui(k)kthchildofuiinQN(vj)SetofneighborsofnodevjinnetworkTScore(Q;T)TheoptimalalignmentscoreofQandTLB(Q;T)LowerboundforScore(Q;T)UB(Q;T)UpperboundforScore(Q;T) whichisalsousedinQPath( Shlomietal. 2006 )andQNet( Dostetal. 2008 )byafactorofthreewithoutreducingthecondenceintheoptimalityoftheresult.Weuseanumberofsymbolstodescribeourmethod.Table 2-1 liststhemostfrequentlyusedsymbolsthroughoutthischapter.Therestofthischapterisorganizedasfollows:Section 2.2 summarizesexistingbiologicalnetworkalignmenttoolsandelaboratesoncolorcodinganddynamicprogrammingalgorithms.Section 2.3 explainsourcoloringmethod.Section 2.4 discussesourlteringmethodthatreducesthenetworksize.Section 2.5 presentsourexperimentresults.Finally,section 2.6 brieyconcludesthechapter. 2.2BackgroundExistingsolutionsforaligningbiologicalnetworkscanbegroupedundertwomainclasses;heuristicandoptimalalignmentstrategies.Wediscussselectexamplesofthesesolutionsnext.Heuristicstrategies:Mostoftheexistingliteraturefallunderthiscategory.MetaPathwayHunteralignsaquerynetworkofaspecictopologytoacollectionof 18

PAGE 19

metabolicnetworks( Pinteretal. 2005 ).Tohsatoetal.proposedanalgorithmwhichalignslinearpathswithinmetabolicnetworksbasedonthefeaturesofindividualenzymesinthenetwork( Yetal. 2000 ).MetNetAlignerallowsinsertionanddeletionofenzymes( Qetal. 2009 ).Torquealignsproteininteractionnetworksbyignoringthequerytopologyimposedbytheinteractions( Bruckneretal. 2009 ).IsoRank( Liaoetal. 2009 ; Singhetal. 2007 )andSubMAP( AyandKahveci 2010b ; Ayetal. 2009 )aretworecentalgorithmswhichcombinenodesimilarityandnetworktopologyinformationforbiologicalnetworkalignment.Thesestudiesshowthatfornetworkalignmenttakingthenetworktopologiesintoaccountincreasestheaccuracyoftheresultingalignment.However,allthesesolutionsareheuristicalgorithms.Theydonotcalculateaprovablecondencevaluefortheresultingalignmentnorprovideerrorpercentageforthealignmentscore.Thus,itisimpossibletotellhowwelltheydoascomparedtotheoptimalalignment.Optimalalignmentstrategies:Inordertondtheoptimalalignmentbetweentwobiologicalnetworks,thealgorithmsinthisclassreducethisproblemtothesubgraphisomorphismproblem.However,subgraphisomorphismproblemhasnoknownpolynomialtimesolutions.Thus,inordertondtheoptimalalignmentinpracticaltime,thesealgorithmsenforceadditionalrestrictionsonquerynetworksizeandtopology.PathBLAST,forinstancecanonlyalignlinearpathwaystolargenetworks( Kelleyetal. 2004 ).Thisalgorithmalsolimitsthenumberofnodeinsertionsanddeletionsinthealignment.QPath( Shlomietal. 2006 )alignslinearnetworkswithproteininteractionnetworksusingaprobabilisticmethodcalledcolorcoding( Alonetal. 1995 ).QNetextendsQPathtoaligntreenetworksandboundedtreewidthgraphswithproteininteractionnetworks( Dostetal. 2008 ).Huffneretal.proposeimprovementsoncolorcodingwhichimprovestheperformanceofcolorcoding( Huffneretal. 2008 ).WeelaborateonQNetlaterinthissection.Thesealgorithmsaimtondtheoptimalalignmentbetweentwonetworks.Althoughtheydonotguaranteeoptimality, 19

PAGE 20

Figure2-2. Twodifferentcoloringinstancesofasubnetworkofthreenodes.Thelettersx,yandzrepresentdifferentcolorswhichcanbeassignedtothenetworknodes.c(vi)representsthecolorofanodevi.(Left)Aninstancewhichisnotcolorful(c(v0)=c(v2)).(Right)Acolorfulassignmentofcolors. theyprovideprovablecondencevaluesfortheirresults.Next,wediscusshowthecondencevalueiscomputedusingatechniquecalledcolorcoding.Colorcodinganddynamicprogramming:Colorcodingisarandomizedmethodforndingsmallsubnetworkswithinlargenetworks( Alonetal. 1995 ).Letusdenotethenumberofnodesinthequerynetworkwithm.Colorcodingbeginsbycoloringthenodesofthetargetnetworkusingmcolorsrandomly.WedenotethesetofmcolorswithC.Wesaythatasubnetworkofthetargetnetworkiscolorfulifallofitsnodeshavedifferentcolors.Figure 2-2 demonstratestwoinstancesofcoloringofathreenodenetwork,amongwhichonlyoneiscolorful.Insteadofndinganinstanceofthequery,theproblemnowbecomesndingacolorfulinstanceofthequerynetwork.Thisreformulationoftheproblemreducesthetimeandthespacecomplexityoftheproblembyafactorof(n 2)m.Thisisasignicantimprovementasnetworksizesintheorderofhundredstothousandsarecommon.OneofthemostrecentalgorithmsthatusecolorcodingisQNet( Dostetal. 2008 ).Next,weexplainhowcoloringisusedtomeasurethecondenceinanalignmentbyfocusingonQNet. 20

PAGE 21

QNetdevelopsadynamicprogrammingalgorithmthatusesthreedifferentfunctionstohandlematch,insertionanddeletionofthenodesofthegivennetworks.LetM,IandDrespectivelydenotethesefunctions.Mhasfourargumentswhichare:querytreenodeui,targetnetworknodevj,thecolorsetusedbythepartialalignmentSwhichisasubsetofC,andsetofrstkchildrenofuidenotedbyRui(k).DandIhavealltheargumentsMhasexceptRui(k).Foreverynodepair(ui,vj)whereui2Qandvj2T,thebasecaseissetasfollows: M(ui;vj;fc(vj)g;;)=Sim(ui;vj)I(ui;vj;;)=D(ui;vj;;)=0(2)AfterinitializingM,IandD,dynamicprogrammingrecursionstakeplace.Followingdenestheserecursivefunctions:M(ui;vj;S;Rui(k))=maxvj02Nj,S0S8>>>>>><>>>>>>:M(ui;vj;S0;Rui(k)]TJ /F4 11.955 Tf 11.96 0 Td[(1))+M(ui(k);vj0;S)]TJ /F3 11.955 Tf 11.96 0 Td[(S0;Rui(k))M(ui;vj;S0;Rui(k)]TJ /F4 11.955 Tf 11.96 0 Td[(1))+I(ui(k);vj0;S)]TJ /F3 11.955 Tf 11.96 0 Td[(S0)M(ui;vj;S0;Rui(k)]TJ /F4 11.955 Tf 11.96 0 Td[(1))+D(ui(k);vj;S)]TJ /F3 11.955 Tf 11.95 0 Td[(S0)I(ui;vj;S)=maxvj02Nj8><>:M(ui;vj0;S;Rui)+I(ui;vj0;S)+D(ui;vj;S)=maxvj02Nj8><>:M(ui(1);vj0;S;R(ui(1)))+D(ui(1);vj0;S)+Afterthecompletionoftheserecursions,onecanndthescoreofthebestalignmentas:max8>>>>>><>>>>>>:maxvjM(u0;vj;C;Ru0)maxvjI(u0;vj;C)maxvjD(u0;vj;C) 21

PAGE 22

WecanndthealignmentbybacktrackingthematricescorrespondingtoM,IandD.AmoredetailedexplanationofdynamicprogrammingformulationisinDostetal.( Dostetal. 2008 ).DynamicprogrammingcandiscovertheoptimalalignmentifandonlyifthesubnetworkofinTthatcorrespondstotheoptimalalignmentiscolorful.Sinceallcoloringinstanceshavethesamelikelihood,thechanceofndingtheoptimalalignmentisequaltotheratioofthenumberofcolorfulinstancestothetotalnumberofallcoloringinstancesofthatsubnetwork.Forrandomcoloring,thetotalnumberofcoloringinstancesismmforthemnodesubnetwork.Thenumberofinstanceswhichthesubnetworkiscolorfulism!.Thus,thechanceofndingtheoptimalalignmentism! mm.Inordertoincreasethechanceofndingtheoptimalalignment,colorcodingmethodrepeatscoloringandalignmentuntilarequiredcondence(where2[0;1])ismet.Foragivencondencevalue,wecomputethenumberofiterationsrneededtoachieveatleastcondencefromthefollowingequation:=1)]TJ /F14 11.955 Tf 11.96 16.86 Td[(1)]TJ /F3 11.955 Tf 15.52 8.09 Td[(m! mmr:Thus,wehave: r=log(1)]TJ /F3 11.955 Tf 11.95 0 Td[() log(1)]TJ /F8 7.97 Tf 15.49 4.71 Td[(m! mm):(2) 2.3OurColoringMethodColorcodingfailstondtheoptimalalignmentifthetruesolutionisnotcolorful.Inordertoaddressthis,colorcodingperformsmultipleBernoullitrials.Ateachtrial,itcolorsthetargetnetworkrandomlyandperformsalignmentwiththeexpectationthatatleastoneofthosetrials(i.e.,iterations)willbesuccessful.Asthenumberofiterationsincreases,thecondenceintheoptimalityoftheresultgrows.Largenumberofiterationsishoweverundesirableastherunningtimegrowswiththenumberofiterations. 22

PAGE 23

Figure2-3. Athreenodenetworkwithcolorsassigned.xandyarethecolorsusedtocolorthisnetwork.Thisinstanceofcoloringis1-hopcolorful.However,itisnot2-hopcolorful. Thenumberofiterationsinanycoloringmethoddependsontwoparameters.Thesearethecondencevalueandtheprobabilityofcoloringthecorrectsubnetworkcolorfullyinthetargetnetworkinasingleiteration.Amongthese,therstoneissuppliedbytheuser.Asaresultitcannotbealtered.Ourcoloringmethodincreasesthevalueofthesecondparametertoreducethenumberofiterations.Beforeweexplainourcoloringmethod,wedenetheconceptofdistancebetweentwonetworknodesandtheneighborhoodofanode: Denition1. (DISTANCE):Thedistancebetweentwonodesinanetworkisthenumberofedgesontheshortestpathbetweenthem. Denition2. (k-NEIGHBORHOOD):Givenapositiveintegerk,thek-neighborhoodofanodevjinanetworkTisthesetofallnodesofTwithdistanceatmostktovj.Ourcoloringmethodarisesfromthefollowingobservation.Aqueryisaconnectedbiologicalnetwork.Analignmentmapsthenodesandtheedgesofthequerynetworktothoseofthetargetnetwork.Asaresult,mostofthenodesofthissubnetworkareconnectedtoeachotherdirectlyorthroughasequenceofedges.Ourcoloringmethodexploitsthisbyassigningdifferentcolorstothenodeswhosedistancetoeachotherissmall.Weformalizethisinthefollowingdenition. Denition3. (k-HOPCOLORING):Wecallthatanassignmentofcolorstoallthenodesofagivennetworkask-hopcolorfulifthenetworkcontainsnotwonodesofthesamecolorwithinadistanceofktoeachother.Figure 2-3 depictsanassignmentofcolorsforathreenodenetwork.1-hopcoloringallowsthisassignmentofcolorssinceeverynodehasdifferentcolorswithitsneighbors. 23

PAGE 24

However,thisspecicassignmentofcolorsisnotpossibleusingk-hopcoloring,wherek2.Thisisbecausev0andv2havethesamecolor,andthedistancebetweenthesetwonodesis2.Therstquestionweneedtoansweratthispointishowfastthek-hopcoloringreachestoagivencondence.Followingtheoremestablishesalowerboundtotheprobabilitythatak-hopcoloringinstanceissuccessful. Theorem1. Theprobabilityofsuccess(p)ofk-hopcoloringisboundedas:pm! (m)]TJ /F3 11.955 Tf 11.96 0 Td[(k)m)]TJ /F8 7.97 Tf 6.59 0 Td[(kQk)]TJ /F6 7.97 Tf 6.59 0 Td[(1i=0(m)]TJ /F3 11.955 Tf 11.95 0 Td[(i): Proof. Letusdenotethesetofnodestobecoloredwithfv0;v1;:::;vm)]TJ /F6 7.97 Tf 6.58 0 Td[(1g.Werstprovethistheoremonaspecicnetworktopology,wherethenodesareconnectedasapath.Thatisthereisanedge(vi;vi+1)foralli2f0;1;:::;m)]TJ /F4 11.955 Tf 11.97 0 Td[(1g.Figure 2-4A illustratesthisonasmallnetwork.Laterwewillgeneralizethistheoremtoarbitrarytopologies.Werstcomputethenumberofk-hopcolorings.Totalnumberoflegalk-hopcoloringsdoesnotdependontheorderatwhichwecolorthenodes.Therefore,tosimplifyourproof,wecolorthenodesinv0;v1;:::;vm)]TJ /F6 7.97 Tf 6.59 0 Td[(1order.Wecanassignmcolorstov0.However,oncev0isdeterminedtherearem)]TJ /F4 11.955 Tf 12.47 0 Td[(1choicesforv1.Thisisbecausethedistancebetweenv0andv1islessthank.Sotheyarenotallowedtohavethesamecolor.Similarlyforallik)]TJ /F4 11.955 Tf 12.31 0 Td[(1wehavem)]TJ /F3 11.955 Tf 12.31 0 Td[(ichoicesforvisinceicolorsareusedbyothernodesinitsk-neighborhood.Foreachoftheremainingm)]TJ /F3 11.955 Tf 12.57 0 Td[(knodesvi(ik)therearem)]TJ /F3 11.955 Tf 10.96 0 Td[(kpossiblecolorsastherearekuniquecolorsusedamongtheirk-distanceneighbors.Thus,thenumberofk-hopcoloringsis(m)]TJ /F3 11.955 Tf 11.96 0 Td[(k)m)]TJ /F8 7.97 Tf 6.59 0 Td[(kQk)]TJ /F6 7.97 Tf 6.59 0 Td[(1i=0(m)]TJ /F3 11.955 Tf 11.96 0 Td[(i).Now,wearereadytoextendourcomputationtoarbitrarytopologies.Eachnodeinthek-neighborhoodofanodeviassertsarestrictiononthenumberofcolorsthatvicanbecoloredwith.Asaresult,thenumberofallk-hopcoloringsreducesorremainsthesameifweincreasethesizeofthek-neighborhoodofanode.Inanyconnected 24

PAGE 25

Figure2-4. Coloringinstancesoftwodifferentnetworksoffournodesusing2-hopcoloring.Nodelabelsarewritteninsidethenodes.Thecoloringisperformedinv0,v1,v2,v3order.Thecolorsarerepresentedwithx,y,zandt.Listofpossiblecolorsforanodeiswrittenincurlybracketsneareachnode.Thechosencolorforeachnodeisunderlined.Theguresshowtheprocessof(Left)coloringasimplelinearpathand(Right)coloringabranchingnetwork.Notethatin(Right)listofpossiblecolorsofv3issmaller.Thisisbecausein(Right)alltheremainingnodesarein2-neighborhoodofv3. network,eachnodehasatleastknodesinitsk-neighborhood.Linearpathminimizesthesizeofthek-neighborhoodaseachnodeisconnectedtoatmosttwoothernodes.Therefore,thetotalnumberofk-hopcoloringsinconnectednetworkswithmnodesisatmost(m)]TJ /F3 11.955 Tf 11.95 0 Td[(k)m)]TJ /F8 7.97 Tf 6.58 0 Td[(kQk)]TJ /F6 7.97 Tf 6.58 0 Td[(1i=0(m)]TJ /F3 11.955 Tf 11.96 0 Td[(i).Thenumberofcolorfulinstancesofthesamemnodesissimplythenumberofpermutationsofthemcolorstothesenodes,i.e.,m!.Theprobabilityofsuccessistheratioofcolorfulinstancestothatofallk-hopcolorfulinstances,thatis:pm! (m)]TJ /F3 11.955 Tf 11.96 0 Td[(k)m)]TJ /F8 7.97 Tf 6.59 0 Td[(kQk)]TJ /F6 7.97 Tf 6.59 0 Td[(1i=0(m)]TJ /F3 11.955 Tf 11.95 0 Td[(i): Noticethatk-hopcoloringlimitsmanyuselesscoloringinstancesbyincludinganadditionalconstraintparameterizedusingk.Figure 2-4A presentsthisonaconcreteexamplewhenthenetworkcontainsfournodes.Fork=2,thenumberofpossible2-hopcoloringsis2243=48.OntheotherhandthetotalnumberofcoloringsusingthestandardcoloringmethodofAlonetal.( Alonetal. 1995 )ismuchlarger 25

PAGE 26

(44=256).ThenetworkinFigure 2-4B violatesthelinearityinFigure 2-4A byconnectingv3withv1insteadofv2.Thisincreasesthesizeofthe2-neighborhoodofv0andv3.Asaresult,thenumberofpossible2-hopcoloringsreducesfrom48to24.Asthevalueofkgrows,thek-hopcoloringmethodbecomesmoreandmorestringent.Forinstance,whenkisequaltom)]TJ /F4 11.955 Tf 12.78 0 Td[(1,usingthek-hopcoloringmethod,wecanensurethatanyconnectedsubnetworkofsizemiscolorful.Thisimpliesthat(m)]TJ /F4 11.955 Tf 12.95 0 Td[(1)-hopcoloringguaranteestoreturntheoptimalconnectedsubnetworkinonlyoneiteration.However,thisrestrictioncomesataprice.Notallthenetworkscanbecoloredusingk-hopcoloring.Forexample,ifthereisacliqueofsizeminthenetwork,itcannotbecoloredwith(m)]TJ /F4 11.955 Tf 12.51 0 Td[(1)-hopcoloringusingmcolors.Asthevalueofkdropsthesetofnetworksthatcanbecoloredgrows.Ourexperimentsdemonstratethatmorethan99%ofallthenetworksinKEGG( Ogataetal. 1999 )arecolorableusing1-hopcoloringmethod.WeelaborateonthisissueinSection 2.5.1 .Followingcorollarysetsanupperboundtothenumberofiterationsrequiredbyourmethod. Corollary1. Thenumberofiterationsneededbyk-hopcoloringtoachieveacondenceisatmost:log(1)]TJ /F3 11.955 Tf 11.96 0 Td[() log1)]TJ /F8 7.97 Tf 52.77 4.71 Td[(m! (m)]TJ /F8 7.97 Tf 6.59 0 Td[(k)m)]TJ /F9 5.978 Tf 5.76 0 Td[(kQk)]TJ /F16 5.978 Tf 5.76 0 Td[(1i=0(m)]TJ /F8 7.97 Tf 6.58 0 Td[(i):OnecanproveCorollary1bysubstitutingthesuccessprobabilityinEquation 2 withthatofTheorem 1 .Askgrows,theupperboundtothenumberofiterationsdropsaswellasthepercentageofcolorablenetworks.Weobservethatsmallvaluesofk(suchask=1)resultsinasignicantimprovementinthenumberofiterationswithnegligibledropincolorablenetworks(SeeSection 2.5.2 fordetails).Assigningcolorsfork-hopcoloring.Sofar,wehaveexplainedhowk-hopcoloringreducesthenumberofiterationsneededtoachieveagivencondence.We,however,havenotexplainedhowweassigncolorstothenodesofthetargetnetworktoensurethatthecoloringfollowsDenition 3 .Next,weelaborateonthis. 26

PAGE 27

Wecanperformk-hopcoloringinseveralways.Here,wedescribetwostrategies.Therstoneweproposeisapessimisticstrategy.ThisstrategyassumesthatrandomlyassigningcolorstonodesmostlikelywillviolateDenition 3 .Thus,itensuresthateachnewlycolorednoderemainstobek-hopcoloredbyassertingrestrictionsonthesetofcolorsthatcanbeassignedtothatnode.Briey,thisstrategyworksasfollows: repeat -Pickanuncolorednodefromthetargetnetwork. -FindthesetofcolorsthatcanbeassignedtothisnodewithoutviolatingDenition 3 usingthesetofalreadycolorednodes. ifthesetofavailablecolorsisemptythen Clearallcolorsfromallnodesandrestartcoloring. else Assignarandomcolorfromthesetofavailablecolorstothecurrentnode. endif untilallthenodesinthetargetnetworkarecoloredThisalgorithmrstassumesthatallcolorsareequallyprobableforeverynodeinthenetwork.It,then,pickseverynodeinthenetworkonebyone,inarandomorder.Foreverynode,itchecksthek-neighborhoodofthenode.Ifthereareanypreviouslyconsideredandcolorednodesintheneighborhood,itremovescolorsofthosenodesfromthelistofpossiblecolorsforthecurrentnode.Ifthereareanynodesleftinthepossiblecolorlist,itpicksacolorrandomlyandassignsittothecurrentnode.Incasethelistofpossiblecolorsisempty,itimpliesthatassigninganycolortothecurrentnodewillviolatethek-colorfulnessofthenetwork.Thereforeweclearalltheassignedcolorsfromthenetwork,andstartover.Previously,weexplainedthatnotallthenetworksarecolorableusingk-hopcoloring.Thisalgorithmwillrunintoaninniteloopifthenetworkisuncolorable. 27

PAGE 28

Thisisbecause,whenitfailstondak-hopcoloringatoneiterationoftherepeat-untilloop,therearetwopossibilities.Firstoneispoorcolorselectionbytherandomizationprocess.Secondoneisthetopologyofthetargetnetworkmakingitimpossibletofollowk-hopcoloringdenition(e.g.,existenceofacliqueofsizek+1).DeterminingifanetworkiscolorableusingkcolorsisanNP-completeproblem( Bjrklundetal. 2009 ; Lawler 1976 ).Therefore,therearenoknownpolynomialtimealgorithmswhichquicklylteroutuncolorablenetworks,makingitveryhardtotellthereasonofthefailure.Inordertoavoidrunningintoaninniteloop,wesetalimitonthenumberofrestarts(failures)ofthisstrategy.Oncetheabove-strategyfailssomanytimes,wedeclarethenetworkuncolorableusingk-hopcoloringandtryk)]TJ /F4 11.955 Tf 12.04 0 Td[(1-hopcoloringinstead.Inpractice,weobservethatsettingthislimittoevenassmallas10to20workswellindistinguishingk-hopcolorableonesfromthosethatarenot.Oursecondstrategyisoptimistic.ItassignscolorstoallthenodesofthetargetnetworkrandomlywiththehopethattheresultingassignmentwillfollowDenition 3 .Itthenrejectsthecoloringassoonasitviolatesthisdenition.Briey,itworksasfollows: repeat -Pickanuncolorednodefromthetargetnetworkandassignarandomcolortothatnode. ifthisnodeleadstoanillegalcoloringthen Clearallcolorsfromallnodesandrestartcoloring. endif untilallthenodesinthetargetnetworkarecoloredStartingfromanuncolorednetwork,thisstrategyassignsrandomcolorstoeachnodeinthenetworkonebyone.Aftereachcolorassignmenttoanode,itchecksthek-neighborhoodofthatnode.Ifthereisanodeinthisneighborhoodwiththesamecolorasthecurrentnode,thealgorithmclearsallcolorsandrestartsthecolorassignments. 28

PAGE 29

Algorithm1GivenQ,TandLB(Q;T),thisalgorithmltersoutunnecessarynodesfromT. 1: Forallui,calculateUB(Q)-222(fuig;T), 2: forallvj2Tdo 3: UB(Q;Tjvj)=maxifSim(ui;vj)+UB(Q)-222(fuig;T)g 4: ifUB(Q;Tjvj)
PAGE 30

UB(Q;T).Similarly,wedenotethelowerboundforthealignmentscorebetweenQandTwithLB(Q;T).Algorithm 1 describesourlteringmethod.OuralgorithmcalculatesupperandlowerboundsusingQandT,inordertodeterminethenodestolter.WedeferthediscussiononthecharacteristicsandthecomputationofthelowerandupperboundfunctionsLB(Q;T)andUB(Q;T)totheendofthissection.OuralgorithmbeginsbycalculatingUB(Q)-303(fuig;T)foreachquerynetworknodeui(Step1).Usingthesebounds,itthencalculatesaconditionalupperboundtothealignmentscorebetweenQandTforeachtargetnetworknodevjwiththeconditionthatvjisalignedtoanodeinQ(Step3).WedenotethiswithUB(Q;Tjvj).OuralgorithmthencomparesthisupperboundwiththelowerboundtothealignmentscorebetweenQandT(Step4).Foranynodevj,UB(Q;Tjvj)
PAGE 31

twocasesasfollows: Score(Q;T)=max8><>:Score(Q;Tjvj)Score(Q;T)-222(fvjg)(2)Equation 2 holdsforeverynodevjinT.However,thenodesthatholdthefollowinginequalityareofinteresttous: UB(Q;Tjvj)
PAGE 32

1. (TIGHT)Theirvaluesareclosetoactualalignmentscore. 2. (FAST)Theyhavelowcomputationalcomplexity.Usuallythetwoconditionsabovecontradictwitheachother.Thatiscomputingtighterboundsareoftenharder.Next,wedescribehowwecancomputethesefunctionswhilesatisfyingthesetwoproperties.WestartbycomputingUB(Q)-210(fuig;T)foreachui2VQ.WedothisbyconstructingabipartitegraphusingthenodesofQandT.Theweightofanedgebetweentwonodesui2VQandvj2VTisSim(ui;vj).Werstremoveuiandusethemaximumweightedbipartitematchingalgorithm( Kuhn 1955 )ontheremainingnetwork.ThisalgorithmreturnsanupperboundtothealignmentscoreofQ)-304(fuigandT(i.e.,itreturnsUB(Q)-229(fuig;T)).Thisisbecauseitignoresthetopologicaldifferencesbetweenthealignednetworksandmatchesthenodepairswiththehighestsimilarityscores.Next,wecalculateUB(Q;Tjvj)as:UB(Q;Tjvj)=maxui2VQfSim(ui;vj)+UB(Q)-221(fuig;T)g:CalculationofLB(Q;T)requiresnoadditionalcomputation.Ateachiteration,wecalculateanalignmentscoreusingdynamicprogramming.Theresultofdynamicprogrammingiseithertheoptimalalignmentscore,oritisalowerboundtothealignmentscore.Therefore,weusethebestalignmentscoreobservedtillthecurrentiterationasLB(Q;T).Calculatingupperboundsforeveryvjineachiterationwillincursomeadditionalcomputationalcost.FirststepofAlgorithm 1 requiresmaximumweightedbipartitematchingformdifferentnetworks.ThecomputationalcostofthisstepisthereforeO(m2n2).InthesecondstepofAlgorithm 1 ,wecheckeachpossible(ui;vj)mapping.ThisstephasO(mn)complexity.Thus,theoverallcomputationalcomplexityoftheupperboundcalculationisO(m2n2).However,thisadditionalcostisnegligiblesince 32

PAGE 33

thecostofdynamicprogrammingisexponentialinm.WewillelaborateonthisissueinSection 2.5.3 2.5ExperimentsThissectionevaluatestheperformanceofTOPACexperimentally.Dataset.WeusednetworksfromtheKEGGdatabase( Ogataetal. 1999 )inourexperiments.Wedownloadedallthegeneregulatorynetworksthathavemorethan15genes(i.e.,nodes).Thelargestofthesenetworkshad150genes.Intotal,thereare297suchnetworksasofSeptember2010.Inthisdatabase,wehad21differenttypesofnetworks(suchasMAPKsignalingpathway)from46differentorganisms.Wecreatedthreesetsofquerynetworksfromthisdatasetbyperformingrandomwalksonourgeneregulatorynetworks.Thesesetscontainseven,eightandninenodequeriesrespectively.Eachquerysetconsistedof50querynetworks.Allthequeryanddatabasenetworksweusedinthischaptercanbedownloadedfromourserverat http://bioinformatics.cise.ufl.edu/khop.html .Wedownloadedgeneannotationdataforthegenesinourdatasetfromthegeneontologydatabase( Ashburneretal. 2000 ).Weused-02releaseoftypeassocdbforourexperiments.Thereareanumberofnetworks,whichdonothaveanyannotationsingeneontologydatabase.Wedidnotusesuchnetworksinourexperimentsrequiringgenefunctionanalysis.ImplementationDetails.WeimplementedallthemethodsweusedinourexperimentsinC++.Weuseourownimplementationofcolorcodinganddynamicprogrammingalgorithminourexperiments.WeusedthesamedynamicprogrammingformulationusedbyDostetal.( Dostetal. 2008 ).Inordertocalculatepairwisesimilarityscoresbetweentwonodes,weuseBLAST( Altschuletal. 1990 ).WenormalizetheminuslogarithmoftheE-valuestogetthescores.Weuseinsertion/deletionpenaltiesashighas0.5timesthelargestpossiblematchscore. 33

PAGE 34

EvaluationCriteria.Wemeasurerunningtimeandcondenceasperformanceindicatorsforthecoloringmethodswecompared.Wemeasuredrunningtimeasthenumberofiterations.Thisisbecausethetimeittakestocolorthenetworkswasnegligibleascomparedtothatforaligningcolorednetworksinallexperiments.Wemeasuredcondenceafterinteractionrasthepercentageofthealignmentswhichndtheoptimalalignmentafterriterations.Inordertouseastheoptimalalignmentsforourexperiments,wealignedallthepairsofnetworksusingrandomcoloringwith99.99%accuracy.TestEnvironment.WeperformedourtestsonLinuxserversequippedwithdualAMDOpterondualcoreprocessorsrunningat2.2GHz,and3GBsofmainmemory.InSection 2.5.1 ,wecompare1-hopcoloringtotherandomcoloringmethod.InSection 2.5.2 ,weanalyzetheeffectsofkonourcoloringmethod.InSection 2.5.3 ,weevaluatehownodepruningworks.Finally,inSection 2.5.4 ,wediscussthestatisticalandbiologicalsignicanceofourresults. 2.5.1PerformanceComparisonAgainsttheStateoftheArtColoringInSection 2.3 ,weshowedtheoreticallythatourcoloringmethodoutperformsrandomcoloringofAlonetal.( Alonetal. 1995 )intermsofthespeedatwhichtheyachieveagivencondencevalue.Inordertodemonstratethedifferenceinperformanceexperimentally,wecompare1-hopcoloringwiththecoloringmethodofAlonetal.onourdataset.Inthisexperiment,werandomlypicked50networksfromourdatabase.Wethenalignedeachofthe50querynetworksineachquerysetwithallofthese50databasenetworks.So,intotalweperformed5050=2500pairwisealignmentsforeachquerynetworksize.Wecarriedouteachofthese2500alignmentsusingthecoloringmethodbyAlonetal.( Alonetal. 1995 )andusingour1-hopcoloringmethod.Foreachalignment,wexthetotalnumberofiterationsto500.Werecordtheresultingalignmentscoreaftereachiterationforeveryalignment.Wereportthefractionofthe 34

PAGE 35

Figure2-5. Fractionofcorrectresultsversusthenumberofiterationsfork-hopcoloringandrandomcoloring.Bothexperimentalresultsandtheoreticalestimatesfork-hopcoloringisincludedinthegure.Fractionofcorrectresultsfortiterationsshowsthefractionofthealignmentsamongall2500alignmentsforwhichtheoptimalalignmentisfoundwithinthersttiterations.Forinstance,fractionofcorrectresults=0:9meansthatthecorrespondingmethodoptimallyaligned90%ofallthequeryandtargetnetworkpairs.Weusedsevennodequerynetworksinthisgure. alignmentsamongallthe2500alignmentswhichreachtotheoptimaloneinrsttiterationsfor1t500foreachcoloringmethod.Figure 2-5 showsourexperimentresultswhenthequerynetworkcontainssevennodes.Ourexperimentsdemonstratethat1-hopcoloringhasasignicantperformanceimprovementoverAlonetal.( Alonetal. 1995 ).Inordertoachievesamelevelofcondenceonthealignmentresult,randomcoloringrequiresabouttwicethenumberofiterations,thustheamountoftime.Forexample,randomcoloringreachesprobabilityofsuccessof0:9aftermorethan350iterations.Ontheotherhand,1-hopcoloringneedslessthan120iterationsforsameamountofresults.Furthermore,inonly150iterations,1-hopcoloringachievesaprobabilityofsuccesswhichrandomcoloringfailedtoapproachevenafter500iterations.Resultsofthisexperimentrevealthatalignment 35

PAGE 36

Figure2-6. Theperformanceimprovementofk-hopcoloringoverAlonetal.forquerysizesof4to25((Left)k=1,(Right)k=1;2;3).ThespeedupiscalculatedbydividingthenumberofiterationsforAlonetal.bythatofk-hopcoloringuntiltheybothreachtothesamecondence.Wealsodrawapolynomialttingthecalculatedvalues. using1-hopcoloringismorethantwotimesfasterthanalignmentusingrandomcoloringforthesamesuccessprobabilities.Figure 2-5 showsthatk-hopcoloringoutperformsAlonetal.forsevennodequerynetworks.Aninterestingquestionatthispointwouldbehowthesizeofquerynetworkschangesthedifferenceinperformance.Inordertogureouttherelationbetweenthesizeofquerynetworksandtheperformancegapbetweentwomethods,wetheoreticallycalculatethespeedup1-hopcoloringcreatesfordifferentquerysizes.WedothisbydividingthenumberofiterationsrequiredbyAlonetal.toachieveaspeciccondencevaluebythatof1-hopcoloring.WecalculatethenumberofiterationsusingtheformulasweprovidedinEquation 2 andCorollary1.Noticethatintheratio,asthetermwithcondenceiscanceledout,doesnothaveaneffectontheratioofrunningtimes.Figure 2-6 showstheresultsofthisexperiment.Ourresultsshowthatthegaininperformancebyusing1-hopcoloringisgreaterthan2:5foreveryquerysize.Forquerynetworkswhichhavelessthanvenodes,theperformancedifferenceishigherthanexpected,becausethecondenceof1-hopcoloringforsuchquerynetworksisvery 36

PAGE 37

high.Consideringthepracticalquerynetworksizeofboundedtreewidthqueries,whichis6to10,ourmethodperformsbetween2:5to2:6percentfasterthanAlonetal..Forexample,inordertoalignaninenodequerynetworktoanytargetnetworkwith80%condence,Alonetal.requires1718iterations.Ontheotherhand,our1-hopcoloringmethodrequiresonly669iterationsforthesamealignmentwiththesamecondenceontheresult. 2.5.2EffectsofKonColoringInSection 2.3 ,wediscussedthatwithk-hopcoloring,highercondencecanbeachievedusinglargerk.However,asthevalueofkgrows,thenumberofnetworksthatcanbecoloredwithk-hopcoloringdrops.Inthisexperiment,weaimtoshowtheeffectsofkontheaccuracyandperformanceofourcoloringmethod.Inordertodothis,similartotheexperimentinSection 2.5.1 ,werandomlychoose50databasenetworksandaligneachofthemwithallthequerynetworks,resultingin2500pairwisecomparisons.Werepeatthisexperimentusingthek-hopcoloringmethodfork=1;2and3.Inordertoobservedifferentaspectsofthek-hopcoloringmethod,wemeasurethreedifferenttypesofstatisticsforeachvalueofk. 1. Percentageofnetworksthatcanbecoloredusingk-hopcoloring.Thelargerthisvalueisthebetter.Thisisbecauseourmethodwillndtheoptimalalignmentonlyifthetargetnetworkisk-hopcolorful. 2. Percentageofnodesthatareinthek-neighborhoodofatargetnodeontheaverage.Thesmallerthisvalueisthebetteraseachnodeinthek-neighborhoodlimitsthesetofcolorsthatcanbeassignedtothatnode. 3. Thenumberofiterationsneededtoreachtoagivencondencevaluewhenthetargetnetworkcanbecoloredusingthek-hopcoloringmethod.Wereportthisfor=0:95;0:98and0:99.Table 2-2 showsourresultsforthisexperiment.Ourexperimentresultsrevealthat1-hopcoloringcansuccessfullycoloreverynetworkinourdataset.Moreover,itcanachieve99%condenceinalignmentresultsafterlessthan300iterations.Wealsoobservedthat,askincreasesthenumberofiterationstoattainacertaincondence 37

PAGE 38

Table2-2. Statisticsofk-hopcoloringfork=1;2and3.Therstcolumnshowsthepercentageoftargetnetworkswhichcanbecoloredusingk-hopcoloring.Percentageofnodesink-neighborhoodissizeofk-neighborhooddividedbythetotalnumberofnodesinthenetwork.Thelastthreecolumnsreportthenumberofiterationsrequiredtoachieve95%,98%and99%condencerespectively. PercentagePercentageNumberofiterationsofcolorableofnodesintoachievecondencenetworksk-neighborhood0.950.980.99 1-hop100%8%1932522972-hop20%27%771001183-hop4%42%314047 valuedecreasessignicantly.Forexampleat99%condence,2-hopcoloringneedsonly40%ofthenumberofiterationsof1-hop.3-hopcoloringneedsabout15%ofthenumberofiterationsthat1-hopcoloringrequiresforthesamecondence.Forthenetworksthatcanbecolored,weobservethatincreasingkonlybyonespeedsupbymorethantwotimes.However,weseethattheimprovementintherunningtimecomesatthepriceofreducedpercentageofcolorablenetworks.Thisisbecauseaskgrows,sodoesthek-neighborhoodofeachnodeinthetargetnetwork.Therefore,basedonourexperiments,wesuggestusingthelargestk,wherethesizeofk-neighborhooddoesnotexceed10%ofallnetworknodes.Insummaryourexperimentssuggestthatforasmallpercentage(about4to20%)ofthenetworksinKEGG,2-hopor3-hopcoloringarepreferable.Fortheremainingones,1-hopcoloringisthebestchoice. 2.5.3EvaluationoftheNodePruningMethodInSection 2.4 ,weproposedamethodwhichreducesthenumberofnodesinthetargetnetwork.Inthisexperiment,weevaluatewhethernodepruningmethodhelpsreducingintherunningtimeofthealignment.Inordertodothis,werandomlypick50targetnetworksandalignthemwithallofthequerynetworksusingboth1-hopcoloringandnodepruningatthesametime.Wexthenumberofiterationsto500forallofthealignments.Foreveryalignment,werecordthesizeofthetargetnetworkaftereach 38

PAGE 39

Figure2-7. Theaveragenetworksizeaftereachiterationduringalignmentswithnodepruning.Allnetworksreferstotheaveragenetworksizeofalltargetnetworksinthealignmentaftereachiterationforallalignments.Filterednetworksreferstothatofonlythenetworksthatnodepruningtakesplaceinatleastoneiteration.Thenetworksizesarenormalizedusingtheirinitialsizes. iteration.Wenormalizethesizeofthenetworkswiththeinitialsizeofeverynetwork.Wecalculatetheaveragenetworksizeaftereachiteration,rstforeverynetwork,andforonlythenetworksthatnodepruningtakesplaceinatleastoneofthe500iterations.Wereportbothoftheseinourexperimentresults.Figure 2-7 presentsourresultsforthisexperiment.Ourresultsshowthat,ifthenodepruningispossibleinanalignment,ourmethodremovesupto19%ofthenodesinthetargetnetworkafteronlyabout100iterations.Thisreductioninthesizeoftargetnetworkresultsinatleast19%reductionindynamicprogrammingexecutiontime.Moreover,ourmethodperformsthelargestnodeeliminationduringtherst50iterations.Therefore,duringthealignmentwesuggestusingthenodepruningmethodonlyinrst 39

PAGE 40

50iterations.Theresultsofthisexperiment,combinedwiththeresultsofthepreviousexperimentsrevealthatasthecondenceonthealignmentscoreincreases,thenumberofnodeswhichareltereddecreases.Thereasonforthisfollowsfrommoreandmorealignmentsconvergingtotheiroptimalalignmentscore.Asthealignmentsreachtheiroptimalalignmentscore,theycannolongercalculateanewlowerboundforthenodepruningmethod.Asaresult,nodepruningcannotbeperformedanymoreinsuchnetworks.Ineachiteration,weneedtomakeadditionalcomputationsforthenodepruningmethod.WecalculatedthecomputationalcomplexityofthesecalculationsasO(m2n2)inSection 2.4 .Ourexperimentsshowthatthetimenodepruningrequiresisnegligibleduringalignmentoftwonetworks.Thisisbecausedynamicprogramminghasexponentialtimecomplexityinthesizeofquerynetwork. 2.5.4BiologicalSignicanceoftheResultsInthisexperiment,wedemonstratehowmuchinformationcanbeharvestedfromgeneregulatorynetworksusingTOPAC.Inordertodothis,wemeasuretheinformationcontentofournetworks.Weusethefunctionalenrichmentscoretoevaluatetheinformationcontentofanetwork.Forageneannotationtermg,wecalculateitsfunctionalenrichmentscoreinanetworkQasfollows.Letusdenotetheratioofthenodesannotatedwithgasp(gjQ).Letusdenotethemeanandstandarddeviationoftheprobabilitydistributionofaproteininthewholedatasetbeingannotatedwithgwith(g)and(g).Thefunctionalenrichmentscore(orsimplytheenrichmentscore)ofatermginQisthencalculatedas(p(gjQ))]TJ /F3 11.955 Tf 12.36 0 Td[((g))=(g).Intheliterature,thismeasureisalsocalledtheZ-score.WesayatermgisenrichedinQiftheenrichmentscoreofginQisgreaterthan2:0.WesayanetworkQisenriched,thereexistsatleastonetermthatisenrichedinQ.Weignoretermswithsmallerenrichmentscorestoaccountforfalsediscoveries. 40

PAGE 41

Table2-3. Statisticalinformationregardingthefunctionalenrichmentofthequerynetworksweusedinourexperimentsfor7,8and9nodequerynetworks.Firstthreecolumnsarethemean,standarddeviationandthemaximumoftheenrichmentscoresofallenrichedterms.Lastcolumnshowsthepercentageofquerieswhichcontainatleastoneenrichedterm. Thefunctionalenrichment%ofenrichedscoredistributionnetworksMeanStd.DeviationMaximum 7-node3.51.045.2648%8-node3.850.75.1730%9-node3.890.895.4956% Informationcontentofthequerynetworks.Usingtheenrichmentscore,werstevaluateourquerynetworks.Wemeasurehowincreasingnetworksizeaffectstheinformationcontentofanetwork.Inordertodothiswecalculatetheenrichmentscoresforeachannotationinallofourquerynetworks.Werecordthenumberofenrichedquerynetworksandtheenrichmentscoresofeachquerynetwork.Forquerynetworksofsize7,8and9,wereportthemeanandthestandarddeviationofenrichmentscoredistribution,aswellastheaverageofthemaximumenrichmentscoreobservedandthepercentageofenrichedquerynetworksinrespectivenetworksets.Table 2-3 displaysourresultsforthisexperiment.Ourexperimentalresultsdemonstratethatthereisahighratioofenrichednetworksamongourqueries.Thisresultimpliesthatourquerysetconsistsofmanynetworks,whosegeneshavelargeamountofcommonfunctions.Anotherobservationtobemadeisthat,querysizedoesnothavesignicanteffectontheenrichmentscoresofthenetworks.Forexample,inourexperimentsenrichednetworksamongthesetof7,8and9nodequerynetworkshavesimilarenrichmentscoredistributions.However,thereisalargevariationinthepercentageoftheenrichednetworks.Weattributethistotherandomwalkalgorithm,whichusesatotalrandomnetworkgenerationstrategy,withouttheknowledgeofanyfunctionaldata. 41

PAGE 42

Figure2-8. Theenrichmentscoresofeachgeneannotationterminqueryandtargetnetworksintopvequerieswiththelargestnumberofenrichedterms.Theenrichmentscoreofthetermsineachalignmentisshownusingadifferentsymbolinthegure.Thedashedlineisthesetofhypotheticalpointswheretheenrichmentofaterminthetargetnetworkisequaltotheenrichmentoftheterminthequerynetwork. Weconcludethattherecanbemanylargequerynetworkswithhighenrichmentscores.Thus,TOPACisessentialinmakingitcomputationallyfeasibletoalignthemtotargetnetworksinpracticaltime.Informationcontentofthealignments.Inthepreviousexperiment,weevaluatedtheinformationcontentofourqueriesusingtheenrichmentscore.Inthisexperiment,weusethesametooltointerpretouralignmentresults.Inordertodothis,weusethealignmentswiththehighestinformationcontent.Querynetworksandthesubnetworksthataligntotheminthesealignmentsareannotatedwithover35termseach.Foreachofthesenetworks,wecalculatetheenrichmentscoresofeachgeneannotationterminthequerynetworkandthesubnetworkofthetargetnetworkitisalignedto.Wereporttheenrichmentscoreofeveryterminourresults.Figure 2-8 showsourexperimentalresultsfortheenrichedtermsfortopvequerieswiththelargestnumberofenrichedterms.Noticethattherearealargenumberof 42

PAGE 43

termswhichhaveenrichmentscoresgreaterthan2:0inbothqueryandtargetnetworks.Anenrichmentscoreof2:0orlargermeansthattheprobabilitytoobserveaspecicannotationinournetworkismorethantwostandarddeviationslargerthanthemeanprobabilityofallofthedataset.Inotherwords,theseannotationsinqueryandtargetnetworksarestatisticallysignicant.Ourresultsdemonstratethattheenrichmentscoresofthetermsinthequerynetworkandthematchingtargetnetworkconvergealmostconvergeonthesamelineforeachquery.Thisshowsthatthereisahighcorrelationbetweentheenrichmentscoresinqueryandthetargetnetworks.HighcorrelationbetweentheenrichmentscoresofhighlyenrichedtermsshowthatTOPACissuccessfulatdiscoveringfunctionalsimilaritiesbetweennetworks.Anotherinterestingobservationthatfollowsfromthisgureisthateachtermismoreenrichedinthetargetnetworkthanthequerynetworkmostofthetime.Thisresultsuggeststhat,ourmethodcandiscoverhighlyenrichedsubnetworks. 2.6ConclusionInthischapter,weconsideredtheproblemofndingasubnetworkinagivenbiologicalnetwork(i.e.,targetnetwork)thatisthemostsimilartoagivensmallquerynetwork.Ouraimwastondtheoptimalsolution(i.e.,thesubnetworkwiththelargestalignmentscore)withaprovablecondencebound.Thereisnoknownpolynomialtimesolutiontothisproblemintheliterature.Alonetal.( Alonetal. 1995 )hasdevelopedastateoftheartcoloringmethodthatreducesthecostofthisproblemthatneedstobeexecutedformanyiterationsuntilausersuppliedcondenceisreached.Herewedevelopedanovelcoloringmethod,namedk-hopcoloring,thatachievesaprovablecondencevalueinasmallnumberofiterationswithoutsacricingtheoptimality.Ourmethodconsidersthecolorassignmentsalreadymadeintheneighborhoodofeachtargetnetworknodewhileassigningacolortoanode.Thiswayitpreemptivelyavoidsmanycolorassignmentsthatareguaranteedtofailtoproducetheoptimalalignment.Wealsodevelopalteringmethodthateliminatesthenodeswhichcannotbealigned 43

PAGE 44

withoutreducingthealignmentscoreaftereachcoloringinstance.WedemonstrateboththeoreticallyandexperimentallythatourcoloringmethodoutperformsthatofAlonetal.whichisalsousedbyanumbernetworkalignmentmethodsincludingQPathandQNetbyafactorofthreewithoutreducingthecondenceintheoptimalityoftheresult.Manyapplicationsofarhasbenetedfromtheideaofcoloringbiologicalnetworks.Aligningaquerynetworktoasubnetworkofatargetnetworkisonlyoneofthem.Webelievethatournovelcoloringmethodologywillimprovetheutilizationofcoloringandrandomizationinlargescalebiologicalnetworkanalysisinpracticalapplications,andthusitwillbeofgreatvaluetobiologists. 44

PAGE 45

CHAPTER3HIDEN:HIERARCHICALDECOMPOSITIONOFREGULATORYNETWORKS 3.1IntroductionGenesarethesmallestfunctionalunitsinanorganism.Theycarryoutvitalfunctionsincellbyinteractingwitheachotherandwithothermolecules.Biologicalnetworksareusedformodelingsuchinteractionsbetweengenes.Usingbiologicalnetworks,researchersareabletotakeaholisticapproachontheanalysisofcellularfunctions.Suchanalysishasshownthatbiologicalnetworkshaveanumberofglobalproperties.Oneofthesepropertiesishierarchicalorganization.Hierarchicalorganizationdenesahypotheticalpartialorderingoftheunderlyinggenes.Recentstudieshaveshownthatdirectedinteractionsbetweentranscriptionfactorsintranscriptionalregulatorynetworks(TRNs)imposeahierarchyonthenetwork( Bhardwajetal. 2010a b ; Hartspergeretal. 2010 ; Jothietal. 2009 ; YuandGerstein 2006 ).Transcriptionfactors(TFs)arespecialtypesofproteinsthatcontroltheexpressionofothergenesbybindingtospecicregionsoftheDNA( S.andLatchman 1997 ).Sinceeachproteiniscodedbygenes,wewillusethetermstranscriptionfactorandgenetorefertoTFsthroughoutthischapter.OnewaytomodelhierarchyinTRNsistoassignlevelstotheinteractingTFs.Figure 3-1 showsasamplenetworkwithitslevelassignments.Inthischapter,weconsidertheproblemofndingthehierarchicalorganizationofaTRN.Theformaldenitionofourprobleminthischapterisasfollows:Problemdenition.LetusdenoteaTRNwithG=(N;E).Here,NdenotesthesetofTFsandEdenotesthesetofdirectedinteractions(i.e.edges)betweentheTFsinNWewillrefertoeachTFinNasanode.WewillnametheithnodeinNasni.Werepresentanedgefromthenodenitonjwith(ni;nj).Also,wewilldenotethemaximumpossiblenumberoflevelsinGwithM.Wedenotethehierarchylevelassignedtoanodeniwithtiwheretiisanintegerinf1;2;;3;:::;Mg.Let(ni;nj)!f0;1gbeabinaryfunctionthatdescribesthekeytopologicalrelationshipbetweenniandnj.(We 45

PAGE 46

Figure3-1. Hierarchicaldecompositionofasamplenetworkwithsevennodesdenotedbyn1,n2,,n7tothreelevels.Directededgesrepresenttheinteractions.Dashedlinesplitsthenodesintodifferentlevels.Eachofthesevennodesareassignedoneofthethreeexistinglevels. elaborateonthefunctionbelow.)Wecomputeapenaltyscorepijforeachpairofnodesasfollows:pij=8>><>>:(ni;nj);iftitj0;elseOuraimistondanassignmentofhierarchiestothenodesofNwhichminimizesP(i;j)pij.Inthischapter,weusetwodifferentfunctionsdescribingtwokeytopologicalproperties. 1. Adjacency.Wedene(ni;nj)=1if(ni;nj)2Eand(ni;nj)=0otherwise. 2. Reachability.Wedene(ni;nj)=1ifthereexistsapathfromnitonjinGbytraversingtheedgesinE.Weset(ni;nj)=0otherwise.Dependingonthechoiceofthetwofunctions,wenametheresultingdistancefunctionadjacencydistanceorreachabilitydistancerespectively.Insummary,usingadjacencydistanceweaimtoassignlevelssuchthateveryTFisabovetheothersitdirectlyregulates,andbelowitseverydirectregulator.Ontheotherhand,usingreachabilitydistance,weconsideranydirectorindirectregulationrelationbetweentwoTFswhenassigninglevels. 46

PAGE 47

TherehasbeenattemptstodevisemethodstorevealtheunderlyinghierarchiesofTRNs.YuandGersteindevelopedBFS-levelmethodtocarryoutthistask( YuandGerstein 2006 ).ThismethodusesbreadthrstsearchtoassignhierarchiestoTFsinanetwork.Althoughtheirmethodworksformostnetworks,itfailstoassignaccuratelevelsfornetworksthatcontaincycles.Jothietal.developedvertexsortmethod( Jothietal. 2009 ).Thismethodincorporatestopologicalsortalgorithmforsolvingthenetworkhierarchyproblem.Vertexsortmethoddoesnothaveanyrestrictionsonnetworkmotifsorcycles.However,ratherthanacertainhierarchy,itassignsarangeofpossiblelevelsfortheTFs.Hartspergeretal.devisedanalgorithmbasedonbreadthrstsearchmethodtosolvetheproblem( Hartspergeretal. 2010 ).TheirsolutionimprovestheBFS-levelmethod,andoutputsahierarchyforeverynetworkregardlessofitstopologicalfeatures.Howeverallthesealgorithmsfailtominimizethenumberofedgesthatviolatethehierarchy.Wenamesuchedgesasconictingedges.WewillelaborateonthequalityofresultscalculatedbyexistingmethodsinSection 3.3.1 .Contributions:Inthischapter,wetakeanovelapproachtotheproblemofdiscoveringunderlyingnetworkhierarchy.Werstconsiderthetopologyofthenetworkasasetofconstraints.Wethendenetheminimizationofconictingedgesastheobjectiveoftheproblem.Usingtheaboveexplanations,wetransformthisproblemtoamixedintegerprogrammingproblem(MIPP)( GarnkelandNemhauser 1972 ).Then,wesolvetheresultingproblemusingexistingsolvers.WenameourmethodHIerarchicalDEcompositionofregulatoryNetworks.ThemainadvantageofHIDENisitprovidesasoundmathematicalformulationofthenetworkhierarchyproblem.However,forlargernetworksHIDENencountersscalabilityissuesduetothegrowingsizeoftheMIPPwithincreasingnumberofTFs.Inordertotacklethenetworkhierarchyprobleminlargernetworks,wedevelopadivideandconquerapproach. 47

PAGE 48

Therestofthisarticleisorganizedasfollows:InSection 3.2 ,wedescribethemethodswedevelopedinthischapter.InSection 3.3 ,wediscusstheresultsofHIDENindetail.Finally,inSection 3.4 ,webrieyconcludethechapter. 3.2AlgorithmInthissection,wedescribethehierarchicaldecompositionmethodwedeveloped.Section 3.2.1 describesourmethod.Section 3.2.2 demonstratesHIDENonasimpleexample.Section 3.2.3 describesdivideandconquermethodweemploytoscaleHIDENtolargernetworks. 3.2.1HIDENHIDENtransformsthehierarchicalnetworkdecompositionproblemtoaMIPP( GarnkelandNemhauser 1972 ).GivenaTRN,HIDENrstconstructsasetoflinearconstraintsandalinearoptimizationfunctionthatcollectivelydescribethepenaltyofthedecomposition.Thenitusesexistingoptimizersoftwaretosolvetheresultingproblem.Next,wewillexplainhowweformulatetheMIPP.LetusdenotethegivennetworkthatwillbedecomposedwithG.Letusdenotethenodes(i.e.,TFs)ofthisnetworkwithn1,n2,:::,nm,wheremisthetotalnumberofnodesofG.HIDEN,allowstheusertosetalimitonthemaximumnumberofallowedlevelsforhierarchicaldecomposition.LetusdenotethisnumberwithM.Also,letusrepresentthelevelassignedtonodeniwithtiforalli2f1;2;:::;mg,(i.e.8i;ti2f1;2;3;:::;Mg).WeaimtondthelevelassignmentT=ft1;t2;:::;tmgthatminimizesthetotalpenaltyresultingfromthislevelassignment.Therefore,theobjectiveofourproblemisthesumofindividualpenaltyscoresforeachpairofnodes: minimizeX1i;jmpij:(3)Next,wesetalimitonthenumberoflevelsinthehierarchy.Wedothisbylimitingthevariablestiasfollows: 0ti
PAGE 49

We,then,representeachpijasalinearconstraint.Rememberthatpijisabinaryfunctioninthefollowingform:pij=8>><>>:(ni;nj);iftitj0;elseWecanrewritethisfunctionasfollows:pij=8>><>>:1;iftitjand(ni;nj)=10;elseLetusonlyconsiderthecaseswhere(ni;nj)=1.Wecanrepresenttherestofthisfunctionusingtwolinearinequalities.Thefollowingsetofconstraintsrepresentthefunctionpij: pij2f0;1g(3) tj)]TJ /F3 11.955 Tf 11.95 0 Td[(ti)]TJ /F3 11.955 Tf 11.95 0 Td[(Mpij)]TJ /F3 11.955 Tf 21.92 0 Td[(M(3) tj)]TJ /F3 11.955 Tf 11.95 0 Td[(ti)]TJ /F3 11.955 Tf 11.95 0 Td[(Mpij)]TJ /F4 11.955 Tf 21.92 0 Td[(1(3)Inordertoprovethattheseinequalitiesmodelthefunctionpijcorrectly,weneedtoinspectallpossiblescenarios: 1. ifti>tjandpij=0,then)]TJ /F4 11.955 Tf 9.3 0 Td[(1tj)]TJ /F3 11.955 Tf 11.89 0 Td[(ti)]TJ /F4 11.955 Tf 21.92 0 Td[((M)]TJ /F4 11.955 Tf 11.88 0 Td[(1)andMpij=0.Thereforeboth( 3 )and( 3 )holds. 2. iftitjandpij=0,thentj)]TJ /F3 11.955 Tf 12.51 0 Td[(ti0andMpij=0.Therefore,( 3 )holds,however,( 3 )doesnothold. 3. ifti>tjandpij=1,then)]TJ /F4 11.955 Tf 9.3 0 Td[(1tj)]TJ /F3 11.955 Tf 11.88 0 Td[(ti)]TJ /F4 11.955 Tf 21.91 0 Td[((M)]TJ /F4 11.955 Tf 11.88 0 Td[(1)andMpij=M.Thereforetheexpressiontj)]TJ /F3 11.955 Tf 12.24 0 Td[(ti)]TJ /F3 11.955 Tf 12.24 0 Td[(Mpijissmallerthanorequalto)]TJ /F3 11.955 Tf 9.3 0 Td[(M)]TJ /F4 11.955 Tf 12.24 0 Td[(1.Thisimpliesthat( 3 )doesnotholdbut( 3 )holds. 4. iftitjandpij=1,then(M)]TJ /F4 11.955 Tf 12.2 0 Td[(1)tj)]TJ /F3 11.955 Tf 12.19 0 Td[(ti0andMpij=M,thereforeboth( 3 )and( 3 )holds. 49

PAGE 50

Therefore,enforcingtheconstraints( 3 ),( 3 )and( 3 )implies: (pij=0,ti>tj)^(pij=1,titj)(3)Thiscorrespondstothelatterdenitionofthefunctionpijexcepttheconditionof(ni;nj)=1.SincewechoosethefunctionpriortotheconstructionoftheMIPP,weknowthevalueof(ni;nj)foreverypair(ni;nj).Therefore,wecanmanuallyensurethisproperty,byonlyconsideringpijwhere(ni;nj)=1andexcludingpijcompletelyfromourcalculationswhere(ni;nj)=0. Basedontheconstraintsabove,theMIPPweconstructtosolvethenetworkhierarchyproblemisasfollows:minimizePi;js:t:(ni;nj)=1pijwhere8ni0ti><>>:1ifanedgefromnitonjexists0otherwise 50

PAGE 51

Figure3-2. ResultofthehierarchicaldecompositionofthenetworkinFigure 3-1 usingHIDEN.NotethatthedecompositiondiffersfromthedecompositioninFigure 3-1 Usingthisfunction,theobjectiveoftheMIPPistominimizethefollowingfunction:Xi;js:t:(ni;nj)=1pij=p12+p13+p15+p24+p35+p36+p67+p73Nowwegoovertotheconstraints.Firstsetofconstraintslimitti:8ni;ifrom1to70ti
PAGE 52

Then,wewritetheremainingfunctionsasfollows:8(ni;nj)2f(n1;n2);(n1;n3);(n1;n5);(n2;n4);(n3;n5);(n3;n6);(n6;n7);(n7;n3)gpij2f0;1gtj)]TJ /F3 11.955 Tf 11.95 0 Td[(ti)]TJ /F3 11.955 Tf 11.95 0 Td[(Mpij)]TJ /F3 11.955 Tf 21.92 0 Td[(Mtj)]TJ /F3 11.955 Tf 11.95 0 Td[(ti)]TJ /F3 11.955 Tf 11.95 0 Td[(Mpij)]TJ /F4 11.955 Tf 21.92 0 Td[(1Intheresultingproblem,Misleftasauserdenedparameter.WhenweruntheaboveproblemwithM=4,HIDENreturnsthefollowingresult:p12+p13+p15+p24+p35+p36+p67+p73=1(t1;t2;t3;t4;t5;t6;t7)=(4;3;3;2;2;2;1)Figure 3-2 showstheresultofHIDENonthegivennetwork.NotethatHIDENsuccessfullyperformsdespitetheexistenceofacycleinthenetwork. 3.2.3DivideandConquerMethodHIDENworkswellfornetworksthathaveupto100nodes.Forlargernetworks,however,itbecomesdifculttosolvetheresultingMIPPusingcurrenthardware.ThisismainlybecausethenumberofintegervariablesoftheMIPPthatdescribetheproblemforthegivennetworkincreases.Thisincreasesthememoryconsumptionandtherunningtimesignicantly.Inordertosolveourproblemfornetworksthathavemorethan100nodesweadoptadivideandconquerapproach.GivenalargeTRN,werandomlydividethisnetworkintoxedsizepartitions.Wedothisbyrstrandomlyselectinganodefromthegivennetwork.Thisnodeistheseedoftherstpartition,andthusitisamemberofthatpartition.Wethenchosetheremainingnodesinthatpartitionsiterativelyby 52

PAGE 53

randomlygrowingthepartitiononenodeatatime.Morespecically,ateachiteration,werandomlyselectanodethatisnotselectedsofarandthatisinteractingwithatleastonenodeamongtheonesinthepartition.WerepeattheseiterationsuntilthenumberofnodesinthepartitionreachestoapredenedthresholdorallthenodesintheTRNareassignedtoapartition.Then,weuseHIDENtodecomposethesubnetworkdenedbythenodesandtheedgesinthispartitionintohierarchicallevels.Oncewedeterminethelevelsofallthenodesinthecurrentpartition,westorethosevaluesastheywillremainunchangedintherestofoursolution.NextwerandomlypickanothernodefromthegivenTRNamongthosethathavenotbeenconsideredyetastheseedofthenextpartition.WegrowthenextpartitionsimilarlyanduseHIDENtodecomposeitintohierarchicallevels.WerepeatthesestepsuntilweexhaustallthenodesinthegivenTRN.ThismethodgreatlyreducestherunningtimeofHIDENonlargenetworks.SinceMIPPisNP-hard,dependingonthesizeandtheconnectivityofthegivenTRN,thedivideandconquerstrategycanbeordersofmagnitudefasterthantheunpartitionedHIDEN.However,duetorandomselectionofthenodes,itispossibleforustonottoachievetheoptimalresult.Thisisbecause,ateachpartition,wenalizethevaluesofasubsetofthevariablesintheMIPPformulationwithoutconsideringmanyothervariables.InordertoreducethegapbetweentheobjectiveourpartitionedandunpartitionedHIDENmethods,werepeatthedivideandconquerstrategymultipletimes,eachtimestartingfromarandomlyselectednode.Inpractice,werepeatthisprocess1000timesforrealTRNs.1000times.SincetherunningtimeofpartitionedHIDENisordersofmagnitudelessthanthatoftheunpartitionedHIDEN,usingdivideandconquermethod,1000repetitionsremainstobepractical.Ittooklessthan10minutesforthelargestdataset(S.cerevisiae).Ourexperimentsshowedthatontheaverage,theresultsofthedivideandconquermethodreachitsoptimuminlessthan500iterations. 53

PAGE 54

3.3ResultsandDiscussionThissectionevaluatesHIDENexperimentally.DATASETInourexperiments,weusedtranscriptionregulationnetworksofE.coli,H.sapiensandS.cerevisiae.Weusedtheexistingexperimentaldatatoconstructthesethreenetworks( Balajietal. 2006a b ; Gama-Castroetal. 2011 ; Harbisonetal. 2004 ; Horaketal. 2002 ; Jiangetal. 2007 ; Leeetal. 2002 ; Luscombeetal. 2004 ; SvetlovandCooper 1995 ; TeichmannandBabu 2004 ).Fortheexperimentsthatrequiregenefunctioninformation,weusedtheinformationincludedintheGeneOntologyDatabase( Ashburneretal. 2000 ).WedownloadedthelistofessentialgenesforS.cerevisiaefromthedatabaseofessentialgenes( ZhangandLin 2009 ).Intherestofthissection,werstcompareHIDENwithotherexistinghierarchicaldecompositionmethodsinSection 3.3.1 .InSection 3.3.2 weevaluatetheresultsourmethodusinganumberofbiologicalpropertiesofTFs.FinallyinSection 3.3.3 ,weanalyzethebehaviorofouralgorithmwithrespecttodifferentquantitativepropertiesofthedata. 3.3.1ComparisonwithExistingHierarchicalDecompositionMethodsTheobjectiveofhierarchicaldecompositionistoarrangetheTFsofagivennetworktolevelssothatthegenethataltertheactivityoftheotherappearsatahigherlevelthantheotherthroughoutthenetworkasfrequentlyaspossible.Thetwofunctionsdescribedatthebeginningofthischaptermodelthisrelationshipintermsoftheadjacencyandthereachabilityofthenodesinthegivennetwork.Inthisexperimentweevaluatehowwellourmethod,HIDEN,comparesagainstthreestateoftheartmethods,namelyvertexsort( Jothietal. 2009 ),HiNO( Hartspergeretal. 2010 )andBFS-level( YuandGerstein 2006 ),inachievingthisobjective.Toperformthiscomparison,wecomputethepenaltyvaluesobtainedbyHIDENwhenitisappliedonS.cerevisiae,E.coliandH.sapiensnetworks.Wecomputethesamepenaltyvaluesfor 54

PAGE 55

thevertexsort,HiNOandBFS-levelmethodsonthesamethreedatasetsforwhichtheirhierarchicaldecompositionsareavailable.Thepenaltyisaquantitativevaluethatcanbeusedtocomparedifferentmethodsonthesamedataset.However,sincethesize(numberofgenesandinteractions)andthetopologyofthesenetworksdeviatesignicantly,theresultingpenaltieswilldiffersignicantlyacrossdatasets.Inordertoreportastatisticallysoundvaluethatdescribesthesuccessofamethodindependentofthenetworksizeandtopology,wealsocomputetheZ-scoresoftheresultingpenaltyvalues.LetusdenotethelevelassignmentobtainedbyaspecicmethodforanmnodenetworkwithT=ft1;t2;;tmg.LetdenotethepenaltyofTaccordingtoaspecicfunction.InordertocomputetheZ-scoreforT,werandomlyproducemanylevelassignmentsusingthesameleveldistributionasthatofT.Foreachsuchassignment,wecomputetheresultingpenaltyvalueusingthesamefunction.Letanddenotethemeanandstandarddeviationoftheresultingpenaltyvaluesofalltheserandomlevelassignmentsrespectively.WecalculatetheZ-scoreasfollows, z=)]TJ /F3 11.955 Tf 11.95 0 Td[( (3)AhigherZ-scoreimpliesabetterlevelassignment.Table 3-1 summarizesthepenaltiesandthecorrespondingZ-scorevalues.ForHIDEN,wereportedtheresultsforsixvaluesofmaximumnumberoflevels(M=f3;4;;8g).Forothermethodsthenumberoflevelsisnotacongurableparameter.Hence,wereportedthelevelthatweobtainedafterexecutionofthatmethod.Wediscusstheresultsforeachorganismnext.S.cerevisiae.WecomparedHIDENwithallthethreecompetingmethodsforthisdataset.OurmethodoutperformedallthethreemethodsintermsofbothadjacencyandreachabilitypenaltyvaluesaswellastheZ-scoresregardlessofthenumberoflevels.Asthenumberoflevelsallowedincreases,thepenaltyincurredbyHIDENmonotonicallydecreases.This,however,isnottruefortheZ-scoreasitdependsonthedistributionof 55

PAGE 56

Table3-1. ComparisonofHIDENwiththreeothermethodsnamely,vertexsort,BFS-levelandHiNOonthreeorganisms,namelyS.cerevisiae,E.coliandH.sapiens.ForHIDENwevarythemaximumallowablelevelfromthreetoeight.WereporttheadjacencyandreachabilitypenaltiesaswellastheZ-scoresforthesepenaltiesforeachexperiment.Num.Leveldenotesthemaximumnumberofallowedlevels. OrganismMethodNum.AdjacencyReachabilityLevelPenaltyZ-scorePenaltyZ-score YeastHIDEN314010.8360013.9410310.8302714.65889.5277414.169110.2257313.47799.8246912.887911.6236513.7vertexsort91797.3392010.2BFS-level42456.057349.9HiNO32796.8620510.5E.coliHIDEN3156.3197.1486.5106.8556.476.7656.266.6756.256.6856.256.6vertexsort6105.7116.5BFS-level4443.7655.1HiNO4414.2595.3HumanHIDEN31017.419509.44847.4160810.65757.9143510.86667.9134710.27727.512879.78727.312489.9vertexsort52071.221625.8BFS-level32100.7221636.0 nodestolevels.ForinstanceHIDENattainsthehighestZ-scoreforadjacencypenaltyatleveleightwhereasitattainsthatusingonlysixlevelsforthereachabilitypenalty.Thebiggestdropofpenaltytakesplacewhenthenumberofallowedlevelsincreasesfromthreetofour.Weobservefurther,yet,smallerimprovementinthepenaltyasthenumberofallowedlevelsincreasesbeyondfour. 56

PAGE 57

Amongthecompetingmethods,thevertexsortmethodofJothietal.incursthelowestpenalty.It,however,usessignicantlymorelevelsthantheHiNOandBFS-levelmethods.Furthermore,althoughitusesmorelevelsthanHIDENaswell,itperformsworsethanHIDENintermsofbothpenaltyandZ-scoremeasures.Amongtheremainingtwomethods,HiNOandBFS-level,thereisnoclearwinner.BFS-levelleadstoslightlylesspenaltyattheexpenseofanadditionallevel.Asaresult,HiNOproducesslightlybetterZ-scores.E.coli.Forthisdataset,wecomparedHIDENwithallthreeexistingmethods.ThepenaltyvaluesofallthemethodsforE.coliaresmallercomparedtothoseofS.cerevisiae.ThisismainlybecauseE.colinetworkismuchsmaller.HIDENperformsthebestamongallmethodsforfourormorelevelsaccordingtobothpenaltyandZ-scorevalues.WedidnotobserveanyimprovementforHIDENbeyondsevenlevels.VertexsortattainsstatisticallybetterresultsthanHiNOandBFS-levelmethods.H.sapiens.WecomparedHIDENwithvertexsortandBFS-levelmethodsforthisdataset.WeomittedHiNOinthisexperimentbecausewecouldnotrunitonthisdataset.Theresultsfollowasimilarpatternasthoseofthetwootherdatasets.HIDENoutperformedvertexsortandBFS-levelevenwhenitusedfewerlevels.ThegapbetweentheZ-scoresofHIDENandtheothermethodswasevenmoresignicantthanthepreviousdatasets.HIDENledtothehighestdropofpenaltyoffromthreetofourlevelsandcontinuedtoimprovewithincreasednumberoflevels.Weconcludethat,HIDENperformssignicantlybetterthanthecompetingmethodsforallthethreemajordataset. 3.3.2BiologicalEvaluationofNetworkHierarchiesInthissection,weanalyzeHIDENusingbiologicalevidence.First,wecheckfunctionalpropertiesofgenesacrossdifferentlayers.Then,weevaluatethelocationsofessentialgenesinthehierarchy. 57

PAGE 58

Figure3-3. TheenrichmentscoresofthewoundhealingprocessateachlevelfortheH.sapiensTRN.ThenetworkisdividedintosixlevelsusingHIDENwithreachabilityaspenaltyscheme. 3.3.2.1FunctionsofgenesTRNsregulatetheexpressionofgenesthattakepartinmanyprocessesinanorganism( Jiangetal. 2007 ).Earlierworkssuggestthattheconcentrationofgenesparticipatingincertainfunctionsarecloselyrelatedtothelevelinthehierarchy( YuandGerstein 2006 ).However,majorityofcellularfunctionsinthecelltakeplacethroughmultiplereactionshappeninginsuccession.Therefore,weexpectauniformdistributionoffunctionsamongdifferentlevels.Inordertoconrmthistheory,wecalculatedthefunctionalenrichmentscoreofeverysinglelevelinthehierarchieswediscovered.WerstdecomposedeachoftheH.sapiens,E.coliandS.cerevisiaeTRNstoeachofthethreetoeightlevels.Then,fortheresulting18combinations(i.e.,3organismsand6levels),wecalculatedap-valueforeachgeneontologytermandlevelpair.WeobtainthegeneontologytermsfromtheGeneOntologydatabase( Ashburneretal. 58

PAGE 59

A B CFigure3-4. IllustrationofthedistributionofthenumberoffunctionsthateachgeneparticipatesinfortheTRNofH.sapiens(A),S.cerevisiae(B)andE.coli(C).EachcirclerepresentsaTF.Thenetworkisdividedintosixlevelsusingthereachabilityasthepenaltyfunctionandplacedinrelevantlevels.ThehorizontallinesseparatetheTFstodifferentlevels.ThegenesarecoloredaccordingtothenumberofGeneontologytermstheyareannotatedwithingrayscale.Theleastnumberoffunctionsisassignedthecolorblack,wherethelargestnumberoffunctionsisassignedthecolorwhite. 2000 ).Wecalculatethesep-valuesassumingahypergeometricdistributionofthegeneontologytermoveralltheTFsinthenetwork.Weobservedthatthep-valuesweresimilaratalllevelsofthehierarchy(seeFigure 3-3 ).ThissupportsourinitialtheorythatmajorityofthefunctionstheTFsinournetworkparticipatearenotenrichedatanylevel.Oneexampleamongmanyisthewoundhealingfunctioninhumannetwork( Galko 59

PAGE 60

andKrasnow 2004 ).However,inrareinstances,weobservedsomefunctionsbeingmoderatelyenrichedinsomelevels.Forexample,thirdoftheeighthlevel(thethirdlevelwhenwedecomposethenetworkintoeightlevels)ofhumanTRNisenrichedwiththesignaltransductionfunction.However,wecouldnotdetectanyotherlevelsinanyothernetworkenrichedwiththisfunction.Eachgeneinanorganismtakespartinatleastonemetabolicfunction.Ageneparticipatinginalargenumberofreactionsisacommonphenomenainmanyorganisms.Inthisexperiment,wecomparethelevelofeachgenewiththenumberoffunctionstheyparticipatein.Bydoingso,weaimtodiscoveranyexistingrelationbetweenthetwo.Inordertodothis,weusethegeneontologydatabase( Ashburneretal. 2000 ).Theparticipationofageneinareactionisrepresentedusinggeneontologyannotationsintheliterature.Foreachgeneinournetworks,werstcountthenumberofgeneontologytermsitisannotatedwith.WealsodecomposedeachnetworkintosixlayersusingHIDEN.Then,wevisualizedthenetworksusinghierarchyinformationaslocationandfunctionalinformationascolorofeachnode.Figure 3-4 showsourresults.Ourresultssuggestthatthereisnostrongcorrelationbetweenthenumberoffunctionsofeachgeneandthelevelofthegeneinthehierarchy.However,inallthreeorganisms,weobservedthatthegeneswiththehighestnumberofannotationstendtolieatthemiddlelevels(i.e.2,3or4).ThisresultindicatesthatregulatoryhubsintheTRNsarenotatthetoplevels.Theyareratheratthemiddlelevelsofthehierarchy. 3.3.2.2GeneessentialityThegeneswhichanorganismcannotsurvivewithoutarecalledessentialgenes( ZhangandLin 2009 ).Suchgenestakepartinvitalfunctionsinthecell.Earlierworksproposedthatthenumberofessentialgenesisstronglycorrelatedtoitslocationinthehierarchy( Jothietal. 2009 ).Inthisexperiment,weaimtondoutifthereexistsanysuchrelation.Inordertodothis,wecountthenumberofessentialgenesateachlevelofhierarchyinaveleveldecompositionofS.cerevisiaeTRN.Wethenreportthe 60

PAGE 61

Figure3-5. TheratioofessentialgenesofS.cerevisiaeTRNateachlevelofthehierarchy.Thenetworkisdividedintovedifferentlevelsusingthereachabilitypenalty. ratioofnumberofessentialgenestototalnumberofgenesinalevelinthehierarchy.Figure 3-5 showsourresultsforthisexperiment.Weobservethatthereisahigherdensityofessentialgenesatthemiddlelevelsofthehierarchy.Thisresult,combinedwiththeresultsofthepreviousexperimentsupportourtheorythatregulatoryhubsofanorganismareoftenatthemiddlelevelsofthehierarchy,ratherthanthetoplevel.Indeedstrongcorrelationbetweenhubsandessentialityhasbeenobservedintheliteraturethatsupportsourresults( Jeongetal. 2001 ).Figure 3-6 showsasubnetworkofthehumanTRN.Thehighlightedtranscriptionfactorsareshowntohaveabnormalitiesinmanytypesofcancers.c-Mycisatranscriptionfactorwhichhasakeyroleincellproliferation( Dangetal. 1999 ).Overexpressionofc-Mycmayresultindevelopmentofdifferenttypesofcancers.TP53isanessential 61

PAGE 62

Figure3-6. TheTRNofH.sapienswithasubnetworkrelatedtocancerhighlighted.Inthissubnetwork,externalsignals(i.e.Growthfactors,otherproteinsandmolecules)regulateoraffecttheproteinsc-Myc,FLI1andERG.Manyotherregulatoryconnectionsandtranscriptionfactorsareomittedforsimplicity. genewhichisregulatedbyc-Myc.TheexpressionofthisgenepreventsformationoftumorsbyactivatingDNArepair,inhibitingcellgrowthandnallyinducingapoptosis( HandVousden 2000 ).TP53executesapoptosisbyactivatingcaspases(i.e.CASP8,CASP3)( SchulerandGreen 2001 ).FLI1isanotherproteinregulatedindirectlybyc-Myc.ThefusionofproteinsEWSR1/FLI1andEWSR1/ERGduetoamutationcreatesamasterregulatorforthedevelopmentinEwing'sSarcoma( Kaueretal. 2009 ; Owenetal. 2008 ).EWSR1/FLI1causestumorformationbyupregulatinggenesthatareinvolvedincellproliferation(i.e.IGF1)anddownregulatinggenesthatcontrolapoptosisandgrowthinhibition(i.e.IGFBP3,TGFBR2)( Riggi.andStamenkovic 2007 ).These 62

PAGE 63

smallscaleobservationssupportourpreviousjusticationsthatregulatoryhubsandessentialgenesaremorelikelytobesituatedinthemiddlelayersoftheTRNs. 3.3.3EffectsofInputonHIDENInthissection,weanalyzeHIDENbychangingtheinputofthealgorithm.Inordertodothis,werstchangethenumberoflayerswedecomposethenetworkinto.Then,weassumeerrorsanduncertaintiesininputnetworks.Usingourresults,weexplainhowreliableourmethodisunderdifferentconditions.Finally,wediscussthequalityofourresultsfordifferentsubnetworks. 3.3.3.1NavigationofgenesacrosslevelsinvaryinghierarchiesThelocationofageneinthehierarchydependshighlyonthetotalnumberoflevels.Thisleadstothefollowingimportantquestion:Howmuchcanwerelyontherelativelevelsofgenes?Onekeyfeatureofourmethodisthatitallowstheusertospecifythenumberoflevelsinthehierarchicaldecompositionofthegivennetwork.Byexploitingthisfeature,next,weanswerthisquestion.Particularly,weshowhowthechangethenumberoflevelsaffectthelocationsofthenodesinthehierarchy.Inordertodothis,werstcalculatethelevelsofeverynodeforS.cerevisiae,E.coliandH.sapiensnetworksinasixlevelhierarchy.Weusethesecalculationstoplaceeverynodeintheirrespectivepositions.Wethendecomposethesamenetworkstovelevels.Weusetheresultoftheseconddecompositiontoassigncolorstoeachnodeinthenetwork.Figure 3-7 showstheresultsofthisexperiment.Ourresultsdemonstratethatforthemajorityofthegenes,therelativepositionbetweendifferentgenesispreserved.Indifferentdecompositions,discoveringgenesinthesamerelativepositionswithrespecttoothergenessuggesttheaccuracyofourmethodfortherelevantgenes.However,thereexistssomegenesthatviolatethisrelationship.Forexample,inFigure 3-7 ,nodesYGL013C,YMR280CandYKL109Wareatleasttwolevelsawayfromwheretheywereearlier.Therefore,weconcludethatthepredictedlevelsofthesegenesnotasreliableastheothers.Thisapproachcanbeusedforevaluatingthereliabilityofourresults. 63

PAGE 64

A B CFigure3-7. IllustrationofthenavigationofgenesacrosslevelsfortheTRNofH.sapiens(A),S.cerevisiae(B)andE.coli(C).Eachcirclerepresentsagene.Thelocationsrepresentthelevelsofthegenesina6-leveldecomposition,whereascolorsofthegenesrepresenttheirlocationsina5-leveldecomposition.Thecolorredrepresentsthebottomlevelinthehierarchy,greenrepresentsthetopmostlevelandthegradientofcolorsinbetweenisusedtocolorthenodesinbetween. 3.3.3.2RobustnessofHIDEN.Oneweaknessofallhierarchicaldecompositionmethodsarisesfromthenatureofthebiologicalnetworkdatasetsthattheyareincompleteandimprecise.Asaresult,theactualnetworktopologyobservedcanbeslightlydifferentthanwhatisgiveninexistingnetworkdatabases( Sunetal. 2006 ).Furthermore,studiesdemonstratethatthenetworktopologiescanvaryduetomanyotherfactorssuchas 64

PAGE 65

externalperturbations( Bandyopadhyayetal. 2010 )andvaryinggeneticprolesanddisorders( Bandyopadhyayetal. 2011 )evenwithinthesamespecies.Thisraisesthequestionthathowmuchcanwerelyonahierarchicaldecompositionifthetopologyofthegivennetworkisnotaccurate?ThissectionevaluatestherobustnessofHIDENwithrespecttoinaccuraciesinthegivennetwork.Inordertodothis,wecarryoutthefollowingsteps.Givenanetwork,werstnditshierarchicaldecomposition,denotedbyT.Wethencreatemanymutantnetworksfromthisnetworkforagivenmutationpercentageusingthedegreepreservingedgeshufingmodel( Mcsweeneyetal. 2008 ).Thisisthestateoftheartnetworkalterationmethodthatpreservesthedegreedistributionofthenetwork.Weelaborateonthismethodlaterinthissection.Thus,eachmutantnetworkdenotesapotentialprecisenetworkthatisdifferentthantheoriginalnetwork.Usingthetopologyofeachmutatednetwork,wecomputethepenaltyofthehierarchicaldecompositionTwefoundattherststep.Thus,thispenaltyshowshowbadourresultsareifournetworkisinaccurate.Werepeatthisformanymutantnetworksandreporttheaverageoftheirpenalties.Briey,wemutateagivennetworkGasfollows.Let(u,v)and(s,t)denotetworandomlyselectededgesfromthisnetworksuchthat(i)uandvaredifferent;sandtaredifferent,and(ii)theedges(u,t)and(s,v)donotexistinG.Weremoveedges(u,v)and(s,t)andaddedges(u,t)and(s,v).Letdenotethenumberoftimeswerepeatthisedgeshufingprocess.Thenthemutationpercentageoftheoriginalnetworkis jEj100%roundedtothenearestinteger.WeconductedtheexperimentsonS.cerevisiae,E.coliandH.sapiensandonbothpenaltymetricsadjacencyandreachabilityfordifferentnumberoflevelsofhierarchy.Figure 3-8 summarizestheresultsusingtheadjacencyandreachabilitypenaltieswhenthree,sixoreightlevelsareallowed.ThemostimportantobservationthatfollowsfromourresultsisthattheZ-scoreremainshighevenafterwemutatethenetworkby20%.Weobserveaslightdropasthe 65

PAGE 66

Figure3-8. (EvaluationoftherobustnessofHIDEN)TheZ-scoreofHIDEN'shierarchicaldecompositionoftheS.cerevisiae,E.coliandH.sapiensnetworkusingadjacencyandreachabilitypenalties.Levelassignmentisdoneontheoriginalnetwork.TheZ-scoreiscomputedonthemutantnetworkwherethenetworkismutatedatincreasingmutationpercentages.Theresultsarereportedforthreedifferenthighestallowedlevels,namelythree,sixandeight. mutationrateincreases,yettheresultsremainstatisticallysignicant.Thisobservationholdsforsmall(3),medium(6)andlarge(8)numberofallowedhierarchicallevels.Thisresulthastwomajorimplications.First,HIDENisextremelyrobustwithrespecttonetworkmutations.Itwasabletoidentifyhierarchicalstructureusingthecluesthatremaininthetopologyofthegivennetworkafterallmutationstakeplace.Thus,eveniftheoriginalnetworkmaybeimprecise,thedecompositionfoundbyHIDENwillbeatruedecompositionwithahighprobability.Second,thedegreepreservingedgeshufing 66

PAGE 67

doesnotaffectthedecomposabilityofthenetwork.ThefactthateventheoriginallevelassignmentTyieldedstatisticallysignicantpenaltiesonthemutantnetworkprovesthatitispossibletondanotherdecompositionT0ofthemutantnetworkthatisstatisticallyatleastassignicantas(possiblymoresignicantthan)T. 3.3.3.3StabilityofHIDENtonetworkmutationsSofar,wehaveobservedthatHIDENwasabletodecomposethenetworksofthegiventhreeorganismssuccessfully.Thisobservationalongwithourlastconclusionfromtheprevioussectionbegsthefollowingquestion:CanHIDENdecomposethemutantnetworksorwasthereabiasintopologyofthesethreenetworksinfavorofHIDEN?Inotherwords,howstableisHIDENwithrespecttoalterationsinthenetworktopology?InordertoevaluatethestabilityofHIDENwithrespecttonetworkalterations,wecarryoutthefollowingsteps.GivenanetworkG,wecreatemanymutantnetworksG0fromGforagivenmutationpercentageusingthedegreepreservingedgeshufing.WethenuseHIDENoneachsuchG0tondanewhierarchicallevelassignmentT0specicallyforthattopology.Thus,thispenaltyshowshowmuchtheperformanceofHIDENisaffectedfromnetworkalterations.Werepeatthisformanymutantnetworksandreporttheaverageoftheirresults.Tables 3-2 and 3-3 summarizestheadjacencyandreachabilitypenaltiesandthecorrespondingZ-scoresforvaryingmutationpercentagesaswellasvaryingmaximumnumberofallowedhierarchicallevels.Forallthethreeorganisms,weobservesimilarpatternsinourexperiments.ThemostimportantobservationisthatHIDENachievesveryhighZ-scoresatallmutationrates.Furthermore,theseZ-scoresarecomparabletothoseoftheoriginalnetwork(i.e.,mutationpercentage=0%).Theadjacencypenaltyvaluesarealsocomparabletothosefortheoriginalnetwork.ThiscoincideswiththeobservationwemadeintherobustnesstestinSection 3.3.3.2 thatthedegreepreservingedgeshufingdoesnotalterthedecomposabilityofthegivennetwork.Asthemutationpercentageincreases,Z-scoreandtheadjacencypenaltiesdonotshow 67

PAGE 68

Table3-2. Stabilityexperimentforincreasingmutationpercentages:Thenumbersinparenthesisistheaverageadjacencypenalty.ThenumbersabovethemisthecorrespondingZ-score.Levelindicatesthemaximumnumberofallowedlevels. OrganismLevelMutation[%]05102040 Yeast39.1011.6310.1712.1210.10(118)(137)(127)(127)(130)49.3111.2011.3011.3710.91(99)(117)(103)(114)(104)59.2610.9610.9711.6210.67(84)(108)(92)(103)(88)E.coli35.764.985.264.985.34(17)(16)(22)(16)(15)45.434.775.674.775.57(11)(15)(14)(15)(10)55.464.725.524.725.34(9)(15)(12)(15)(8)Human37.449.248.667.799.14(101)(105)(95)(106)(107)47.379.098.148.228.98(84)(90)(83)(92)(92)57.908.837.708.539.33(75)(93)(81)(73)(86) acleartrendtoincreaseordecrease.We,thus,reachtotheconclusionthatHIDENisstablewithrespecttonetworkalterations. 3.3.3.4LocalversusglobalhierarchyofsubnetworksTheentirebiologicalnetworkofanorganismcanbeconsideredasa(possiblyoverlapping)collectionofsmallersubnetworkswhereeachsubnetworkcorrespondstoacoherentfunctionalgroup.Forinstance,cellcyclenetworkdescribestheinteractionsthattakeplaceduringthedivisionandreplicationofacelltoproducenewcells.Similarly,meiosisnetworkdescribesaspecialtypeofcelldivisiononlyobservedinreproductivecells.Thesesmallersubnetworksmayfollowahierarchicalstructureaswellwithintheirlocaltopologies.Clearly,wecanuseHIDENoneachofthese 68

PAGE 69

Table3-3. StabilityExperimentforincreasingmutationpercentages:Thenumbersinparenthesisistheaveragereachabilitypenalty.ThenumbersabovethemisthecorrespondingZ-score.Levelindicatesthemaximumnumberofallowedlevels. OrganismLevelMutation[%]05102040 Yeast312.3515.2114.6215.2015.31(3674)(3600)(3483)(3599)3598412.3314.7414.4714.7314.66(3027)(3026)(2923)(3025)(3024)512.2714.1814.2914.1814.18(2754)(2773)(2644)(2772)(2771)E.coli37.736.586.906.586.37(21)(26)(21)(26)(27)47.296.166.216.165.93(15)(20)(15)(20)(26)56.956.006.226.005.81(14)(20)(11)(20)(26)Human38.2511.1711.178.6611.17(1950)(1951)(1951)(1944)(1951)48.8812.0212.0210.4512.02(1628)(1613)(1613)(1608)(16.13)512.4011.7111.7110.4511.71(1431)(1441)(1441)(1431)(1441) subnetworkstondtheirhierarchicalstructurebyisolatingthemfromtherestofthenetworkonebyone.Wecallsuchhierarchicaldecompositionaslocalhierarchysinceitonlydependsonthetopologyofthesubnetwork.Wecallthehierarchicaldecompositionweobtainforasubnetworkfromtheentirenetwork'stopologyasitsglobalhierarchy.Inthisexperiment,weevaluatehowwelltheglobalhierarchyofasubnetworktstothatofitslocalhierarchy.LetusdenotetheentirenetworkwithGandasubnetworkofGwithG0.LetusdenotethelevelassignmentsforthenetworksGandG0byHIDENwithTandT0respectively.Let^TTbetheglobalhierarchyofG0inducedfromT.WecalculatetheadjacencypenaltyandZ-scoreof^TandT0usingthetopologyofG0.Table 3-4 69

PAGE 70

Table3-4. ComparisonoftheglobalhierarchyofsubnetworkstotheirlocalhierarchyintermsoftheadjacencypenaltyandtheZ-score.TheexperimentisconductedonthetwosubnetworksofS.cerevisiae,namelycellcycleandmeiosis.Num.Leveldenotesthemaximumnumberofallowedlevels. SubnetworkNum.GlobalLocalLevelPenaltyZ-scorePenaltyZ-score 343.234.2423.314.0CellCycle523.203.7623.103.7723.003.7822.903.7380.723.5461.213.8Meiosis561.213.8651.613.8751.513.8851.513.8 summarizestheresultsforS.cerevisiaefortwomajorsubnetworks,namelycellcycleandmeiosistakenfromtheKEGGdatabase( Ogataetal. 1999 )withdifferentvaluesofmaximumnumberofallowedlevels.Theresultsdemonstratethatthelocalhierarchyisbetterthantheglobalone.Thisisnotsurprisingastheglobalhierarchyisdeterminedbasedontheentirenetwork.Thus,thelevelsaredeterminedwiththegoalofoptimizingalltheinteractionsinthenetwork.Ontheotherhand,localhierarchyisdeterminedonlybasedontherestrictionsassertedbythecorrespondingsubnetwork.Weobservethatthegapbetweenthelocalandtheglobalhierarchyissmallforthecellcyclenetwork.Itis,however,signicantforthemeiosisnetwork.Inordertounderstandthefactorsthatcontributetothisgap,weperformedadetailedanalysisofthetopologyoftheentireS.cerevisiaenetworkaswellasthesetwosubnetworks.Cellcyclecontained54genesand108interactions.Meiosiswassmaller,containing44genesand62interactions.Wedeneaninteractionfromagenethatisnotinthesubnetworktoagenethatisinthesubnetworkasanincomingedge.Iftheinteractionpointstheoppositedirection,wedeneitasan 70

PAGE 71

outgoingedge.Wecomputedthenumberofincomingandoutgoingedgestoeachsubnetwork.Thenumberofincomingedgespernodewas1.9and3.6forcellcycleandmeiosisrespectively.Thatfortheoutgoingedgeswas20.6and18.8respectively.Thissuggeststhatasthenumberofincomingedgesincrease,thechancethattheglobalhierarchydeviatesfromthelocaloneincreases.Thisisindeedintuitiveastheincomingedgesinuencethehierarchymuchmorethantheoutgoingedges.Forthelocalhierarchy,weobservethatasmallnumberoflevelsissufcienttogetagoodhierarchicaldecomposition.Forinstance,HIDENresolvedallconictsforcellcycleinonlyvelevels.Itresolvedallbutoneconictformeiosisinfourlevels.Theseresultsdemonstratethatthelocalandglobalhierarchiescandeviatesignicantlydependingonthetopologicalrelationshipbetweenthesubnetworkandtherestofthenetwork.Thus,detailedanalysisofbothdecompositionscanyieldusefulinformationregardinghowthefunctionsofagivensubnetworkisdependsontheothergenes.HIDENiscapableofrevealingsuchinformation. 3.4ConclusionInthischapter,wetookanovelapproachtotheproblemofdiscoveringunderlyingnetworkhierarchy.WersttransformedourproblemtoaMIPP.Then,wesolvedthisproblemusingexistingoptimizers.WenamedthismethodHIerarchicalDEcompositionofgeneregulatoryNetworks.However,duetothegrowingsizeoftheMIPPwithincreasingnumberofgenes,weencounteredscalabilityissues.Weproposedadivideandconquerapproachtotacklesuchproblems.Later,weexperimentallyshowedthatouralgorithmoutperformedexistingsolutionsintermsofminimizingconictingedgesinhierarchy.Wealsoevaluatedourmethodusingbiologicalandstatisticaltools.Then,wediscussedtherelationbetweenthehierarchyofageneinaTRNanditslocationincell,essentialityandfunction,basedonourexperimentalresultsandbiologicalevidence. 71

PAGE 72

CHAPTER4INFERRINGGENEFUNCTIONSFROMMETABOLICREACTIONS 4.1IntroductionAdvancesinbiologicalsciencesandimprovementsinexperimentaltechniquesresultedinanincreaseintheamountofbiologicalinteractiondatabeinggenerated.Dependingontypesofinteractions,suchdataisclassiedasgeneregulatory( LevineandDavidson 2005 ),proteininteraction( GiotandBaderetal. 2003 )andmetabolicnetworks( Franckeetal. 2005 ).Amongthese,metabolicnetworksmodelthetransformationofcompoundstoeachotherduringmetabolicprocesses.Analysisofmetabolicnetworkshasbeenofutmostimportanceinunderstandinghowreactionstakeplaceincells.Anumberofapproachesexistforstudyingmetabolicnetworks( BellandPalsson 2005 ; Kaletaetal. 2009 ; LarhlimiandBockmayr 2009 ; Schusteretal. 1999 ; Stellingetal. 2002 ; TerzerandStelling 2008 ).Oneoftheseistousetheirsteadystatestoanalyzepropertiesofmetabolicnetworks.Ametabolicnetworkissaidtobeinsteadystateiftherateofchangesofinternalmetabolites(i.e.themetaboliteswhicharenotprovidedtotheorganismbytheenvironment)donotchangeovertime( Voit 1992 ).Thesystemachievesthisbybalancingtherateofproductionandconsumptionoftheinternalmetabolites.Steadystatesofthenetworksareimportantbecausetheysummarizepossiblelongtermoutcomesofdifferentnetworks.Recently,anumberofdifferentmodelsareintroducedformetabolicnetworkanalysisusingitssteadystates.Thesemodelsareelementaryuxmodes(EFM)( Schusteretal. 1999 ),extremepathways(EP)( BellandPalsson 2005 ),minimalmetabolicbehaviors(MMB)( LarhlimiandBockmayr 2009 )anduxbalanceanalysis(FBA)( Stellingetal. 2002 ).Recently,twonewmodelsforanalyzinglargernetworks,elementaryuxpatterns( Kaletaetal. 2009 )andprojectedconeelementarymodes( Marashietal. 2012 )areintroduced.Thesemodelsstudythesteadystatesofthemetabolicnetworks 72

PAGE 73

Figure4-1. Fluxconeofasimpleexamplenetworkwiththreereactions(r1,r2andr3)andveelementaryuxmodes(e1,e2,e3,e4ande5).Theshadedareaisthemetabolicuxconeandthedashedlinesareelementaryuxmodes. bycreatingamapofallthesteadystatesofanetworkinahighdimensionalspace.Theuxofeachreactioninthemetabolicnetworkcharacterizethespaceinwhichwerepresentallthesteadystates.Figure 4-1 showstheuxconeandEFMsofahypotheticalnetworkwiththreereactions(networkisnotshown).InFigure 4-1 ricorrespondtotheuxoftheithreactionofthenetwork,andthespacedenedbytheseaxesarereferredtoastheuxspaceofthemetabolicnetwork.Eachpointintheuxspacerepresentsastateofthenetwork,wherethevalueofapointineachdimensionistherateatwhichthereactioncorrespondingtothatdimensiontakesplace.EachejisanEFMofthenetwork,andtheshadedregiondenedbythesetofpossiblelinearcombinationsoftheveEFMsisthesetofallpossiblesteadystatesofthenetwork. 73

PAGE 74

Eachreactioninametabolicnetworkcontributestothesetofpossiblesteadystatesthatcanbereachedbythatnetwork.Earlierstudiesusedifferentapproachestomeasuretheimpactofareactioninmetabolicpathways.Thesestudiesanalyzecentrality( MaandZeng 2003 ),metabolitestheyproduceandconsume( Samaletal. 2006 )andfrequenciesinEPs( Conradietal. 2007 )ofreactionsinametabolicnetwork.ArecentstudybyAyetal.showedthatfunctionallysimilarreactionscanbepredictedbystudyingthesteadystatesofmetabolicnetworks( AyandKahveci 2010a ).Theaccuracyoftheirmethodhoweverdropsasthenumberofdimensionsoftheuxspace(i.e.,numberofreactions)grow.Inthischapter,wedevelopanewmethodthatpredictsthesetofpossiblefunctionsofagivengene(orasetofgenes)fromthecontributionofthatgene(genes)tothesteadystatesofthemetabolicnetworkthatcontainsit.Particularly,wefocusonthefunctionsdenedintheGeneOntologydatabase( Ashburneretal. 2000 ).Weshowthatdifferentsteadystatescanindicatedifferentsetsoffunctions.Wedothisbyrstcharacterizingtheimpactofeachreactiononthenetworkintermsofthesteadystatesofthenetwork.Thisisanontrivialproblemastheuxspacegrowsexponentiallywithincreasingnumberofreactions.WeuseMonteCarlosamplingtouniformlyselectmanypointsintheuxcone.Knockingoutageneshrinkstheuxcone.Asaresult,thesampledpointslayatvariousdistancesfromthesmallercone.Wecharacterizetheimpactofknockingoutageneintermsofthesedistances.Wethenclusterallthereactionsusingtheirimpacts.WecheckeachclusterintermsoftheGOannotationsforeachreaction.WeobservethatreactionswithsameGOtermsaremorelikelytoexistinsameclusters.Wealsosetupamodelwhichcanbeusedtopredictmissinggeneticfunctioninformationfornewlydiscoveredreactions.Weshowthatforasubsetofreactions,ourmethodcanbeusedforfunctionalannotationprediction.Therestofthischapterisorganizedasfollows:Section 4.2 explainsourmethodologytoestablishtheconnectionbetweenthesteadystatesandthefunctionsofreactionsin 74

PAGE 75

metabolicnetworks.Section 4.3 presentsexperimentalresults.Section 4.4 brieyconcludesthischapter. 4.2MethodsInthissection,wedescribethedataanalysismethodwedevelopedinthischapter.Overall,ouralgorithmworksasfollows:First,wecalculatetheEFMsofthewholenetwork.Then,wecharacterizetheimpactofdeletionofeachreactiononebyone.Wethenclusterthereactionsintermsoftheirimpactsintheuxspace.Finally,weanalyzeeachclusterusinggeneontology(GO)termsofeachencodinggeneofeachenzymecatalyzingthereactionsinsideacluster.Below,weexplainouralgorithmindetail.WedenoteametabolicnetworkusingN.WestartouranalysisbyrstcalculatingtheEFMsofN.Weneedthisinordertocomparehowthereactiondeletionsimpactthesteadystates.WewilluseE(N)todenotethesetofEFMsofthenetwork.WerepresentasingleEFMofNwithej2E(N).WedenotethemetabolicuxconeofthenetworkNasC(N)=fPei2E(N)xieijxi0g.Tomodeltheimpactofareactionriinthenetworkmathematically,werstknockoutrifromN.Inbiologicalcontext,weachievethisbyremovingoneormoregenesthatencodetheenzymescatalyzingri.RemovalofareactionfromNmaychangethesetofpossiblesteadystateswhichcanbeachievedbyN.Morespecically,thesteadystatesinwhichrihasnetuxthatisgreaterthanzerowillnotbeachievableanymore.Figure 4-2 showstheresultofareactionknockout.Noticethatafterremovalofareaction,togetherwithsomesteadystates,oneormoreEFMsmayalsobecomeinfeasible.Afterremovingareaction,thenewuxconebecomesthezoneboundedbyalltheremainingEFMsasshowninFigure 4-1 .WedenotetheuxconeresultingfromdeletionoftheithreactionriwithC(N)]TJ /F3 11.955 Tf 11.96 0 Td[(ri).Inordertocorrectlycharacterizethechangeintheuxcone,weneedtocomparetwohighdimensionalshapes.However,thisisadifcultproblemparticularlywhenthenumberofdimensions(i.e.,uxes)islarge.Earlierstudiescalculateminimum 75

PAGE 76

Figure4-2. ResultofremovinganEFMfromtheuxconeinFigure 4-1 .Inthisgure,e4isremoved,resultinginashrinkageintheuxcone.NoticethatuxconeisstillthelinearcombinationofalltheremainingEFMs. enclosingballsofuxconestocomparethedifferentvolumes( AyandKahveci 2010a ).However,asthenumberofdimensions(i.e.,uxes)increases,suchapproximationsquicklydivergefromtheactualvolumesoftheuxcones( Bellman 2003 ).Thisproblemisreferredtoasdimensionalitycurseintheliterature.Here,wedevelopamethodthatavoidsthecurseofdimensionality.OurmethodcharacterizesthechangeinuxconeusingMonteCarlosampling.WedothisbyrstcreatingalargenumberofsamplesteadystatesofN.NoticethateachsteadystateisapointinC(N).WedothisbyrandomlygeneratingpointsinthehighdimensionalspacethatcontainsC(N).WethenrejectthepointsthatarenotinC(N)fromthesampleset.Inallofourexperimentsinthischapter,wecreate1000suchsamples.Wedenotetheresultingsetofsampleswith 76

PAGE 77

S(N).Next,wedescribehowweusethesamplesinS(N)tocharacterizetheimpactofeachreaction.Whenweknockoutareactionri,theuxofreactionrihastobexedtozero.Thisimplieseliminationofthedimensioncorrespondingtoriinuxspace.Therefore,aftereachknockout,werstprojecteachsampletotheresultingreduceduxspace.WethencalculatethedistanceofeachsampletotheconeC(N)-283(frig).Wewilldenotethedistanceofeachsamplesj2S(N)toC(N)]TJ /F3 11.955 Tf 12.78 0 Td[(ri)withd(i;j).Figure 4-3 explainsthisprocessonasimpleexample.IntheexampleinFigure 4-3 ,weremover3.Afterremovingr3,onlye1ande4remainfeasibleamongfouroriginalEFMsofN.Ofthetwooriginalsamplessxandsy,projectionofsx(denotedwiths0x)liesinC(N)-274(fr3g).Therefore,d(3;x)=0,andd(3;y)>0.Letusdenotethenumberofsamplepointswithm.WecharacterizetheimpactofreactionririwithvectorDiofsizem.WesetthejthentryofDi(1jm)tod(i;j).Noticethatafterthisstep,impactofeachreactionisstillrepresentedwithm=jS(N)jdimensionalvectors.Inordertofurtherreducethenumberofdimensionsandspeeduptheremainingstepsofourmethod,wecalculatetherstfewprincipalcomponentsofallDiforallreactionsri.Inpractice,weobservedthattherstthreeprinciplecomponentscarriedmostofthevarianceandthustheyweresufcienttorepresenttheentirevectors.WesummarizetheimpactofeachreactionastherstthreeprinciplecomponentsofthecorrespondingDi.Aftercalculatingtheimpactofeachreaction,wegroupreactionsbasedontheirimpacts.Wedothisbyclusteringtheimpactvectorsofeachreaction.Weusehierarchicalclusteringtogroupthereactionsusingtheirimpacts( Sibson 1973 ).Weassumethatreactionsthatfallintothesameclusterhavesimilardynamicproperties.Aftercreatingclustersofreactions,wereplaceeachreactionwiththegenesthattranscribeenzymescatalyzingthesereactions.Wethencalculatefunctionalenrichmentscores( Shlomietal. 2006 )foreachGOtermineachcluster.Wethenusethese 77

PAGE 78

Figure4-3. Illustrationofthedimensionreductionwithknockoutofreactionr3.e02ande03areprojectionsoforiginalEFMs,whicharenolongerfeasiblewiththedeletionofr3.s0xands0yareprojectionsoftwoofthesamplesinS(N)tothenewuxspace.Inthisexample,d(3;x)=0andd(3;y)>0,asshowningure. enrichmentscoresasindicatorsofimportanceofeachfunctionineachgroup.Next,weexplainourexperimentalsetupanddiscussourresultsinSection 4.3 4.3ResultsInthissection,weevaluateourmethodexperimentally.First,weexplainourexperimentalsetupinSection 4.3.1 .Section 4.3.2 exploresthecorrelationbetweenourenrichmentscoresandtheGOannotations.Section 4.3.3 showssimpleexperimentsdemonstratingtheperformanceofourmethodingenefunctionpredictions. 4.3.1ExperimentalSetupWeranourexperimentsonmetabolicnetworksofE.coli,H.pylori,M.tuberculo-sis,S.aureusandS.cerevisiae.Forourexperiments,wedownloadedthemetabolicnetworksoftheabovelistedorganismsfromtheBiGGdatabase( Schellenberger 78

PAGE 79

etal. 2010 ).InordertocalculatetheEFMsofalltheoriginalandperturbednetworks,weusedEFMtool( TerzerandStelling 2008 ).Inordertodividethenetworksintosmallersubnetworks,weusedthesubsysteminformationincludedinthenetworkswedownloadedfromBiGG.Weconsideredeachsubsystemasaseparatesubnetwork.Weusedgeneontology(GO)databasetodownloadtheGOtermannotationsofallthegenesinvolvedinourmetabolicnetworks( Ashburneretal. 2000 ).ThefunctionalannotationtermsintheGOdatabasehaveahierarchicalorganization.Inthishierarchy,thetermsclosetothetopofthehierarchyhavebroaderfunctionaldenitions(i.e.cellularphysiologicalprocess,signaltransduction,etc.),whilethetermsclosertothebottomhavemorespecicfunctions(i.e.pyrimidinemetabolicprocess,alpha-glucosidetransport,etc.).Inordertoremoveanybiasintroducesbyverybroadorspecicfunctions,weonlyuseGOterms,thatareatthefthlayerfromthetopofthehierarchy.Calculationofmetabolicuxconesisacomputationallychallengingtask.Fororganismswithlargermetabolicnetworks,thusmoreEFMs(i.e.morethantwomillionEFMs),itbecomesimpossibletocalculatealltheEFMsofthenetwork( TerzerandStelling 2008 ).Wesplitsuchnetworksintosmallersubnetworks.WedenethesubnetworksusingthesubsysteminformationforeachreactioninBiGGdatabase( Schellenbergeretal. 2010 ).Inourexperiments,weimplementedthecodewhichreadsanddividesmetabolicnetworksintosubnetworksandthecodethatdownloadsandltersGOtermsusingJava.WeimplementedtheremainingstepsinMATLAB. 4.3.2CorrelationofImpactandFunctionInthisexperiment,explorewhethertheenrichmentscoreswecalculatecorrelatewithactualGOannotations.Inordertodothis,westartbycalculatingtheenrichmentsofallGOtermsineachgroup.Then,foreachreactionandeachGOterm,wecheckifanyoftheencodinggenesoftheenzymesofthisreactionisannotatedwiththisGO 79

PAGE 80

Figure4-4. Cumulativehistogramofthep-valuescalculatedforeachreactionandGOterm,forpairswherethereactionisactuallyannotatedwiththeGOterm,ornot.Figureisdrawninlog-logscale. term.WegroupallreactionandGOtermpairstotwogroups,namelythereactionsthatareannotatedwiththegivenGOterm,ornot.Wewillrefertothesesetsaspositiveandnegativesetsrespectively.Afterthat,weassignp-valuestoeachpairusingthep-valueswecalculatedforthegroupofeachreaction.Finally,weplotthecumulativehistogramsforthep-valuesineachgroup.Figure 4-4 showstheresultsofthisexperiment.Ourresultssuggestthataround3%ofthereaction-GOtermpairsfromthepositivesethavep-valueslessthan10)]TJ /F6 7.97 Tf 6.59 0 Td[(3.WhenwefocusontheseGOterms,weobservethatamongthemembersofthenegativeset,lessthan0.1%ofthepairshaveaslowp-values.Ontheotherhand,around60%ofthepairsinnegativesethavep-valuesverycloseto1.Statistically,thisshowsaneventwithverylittlestatisticalsignicance.Theseresultssuggestthatourhypothesisthatenrichmentscorescorrelatewithactual 80

PAGE 81

annotationsistrue.Therefore,weconcludethatreactionsrelatedtosameGOtermstendtoclustertogether. 4.3.3GeneticFunctionPredictionisPossibleUsingMetabolicNetworkAnalysisThissectionevaluateswhetherourmethodcanpredicttheGOtermscorrectly.Inordertodothiswithoutcreatinganyunfairadvantagetoourmethod,weneedtwosets,therstoneistobuildourmodel,andthesecondoneistotesttheresultingmodel.Toensurethatthenumberofsamplesthatcontributetoourmodelisaslargeaspossible,wefollowleave-one-outstrategyasfollows.WepickonereactionandremovealltheGOterminformationwetransferredusingallthegenescatalyzingenzymesforthisreaction.Aftertheremoval,werecalculatealltheenrichmentscoreswecalculatedinthelaststepofouralgorithm.Finally,usingthenewp-values,wetrytorecovertheinformationwejustdeletedbypredictingtheGOtermswiththesmallestp-values.Werepeatthesestepsforallthereactionsinournetworkonebyone.Figure 4-5 showtheresultsofthisexperiment.OurexperimentsshowthatforasubsetoftheGOterms,ourmethodcanbeusedtopredictgeneticfunctions.However,beforenumberoffalsepositivesgetveryhigh,only20%oftheGOtermsareretrievedusingourmethod.Ontheotherhand,becauseofthedifcultyofcalculationofEFMsofthecompletenetworkanddimensionalitycurse,wehadtomakenumerousassumptions.Thereforeweconcludethatgeneticfunctionpredictionispossibleusingmetabolicnetworkanalysis.However,furtherevaluationisrequiredtojudgehowgoodpredictionscanbemadeusingmetabolicnetworkanalysis. 4.4ConclusionInthischapter,weexploredsteadystatesofmetabolicnetworksandtheirfunctionalproperties.Wedidthisbyrstcharacterizingtheimpactofeachreactioninametabolicnetworkintermifthesteadystatesofthenetwork.Then,usingtheirimpacts,wegroupedeveryreactioninthenetwork.Wecheckedeachgenethatencodesanenzymecatalyzingeachreactioninthenetworktotransferthegeneontology(GO)annotations 81

PAGE 82

Figure4-5. Numberoftotalpredictionsversusratioofcorrectpredictions.Predictionsconsistofareaction-GOtermpair,whichimpliesthereactionispredictedtobeannotatedwiththegivenGOterm.Only20%ofsuchpairscanberetrievedusingourmethod. tothereactions.Foreachgroupweformed,wecalculatedenrichmentofeachGOterminthenetworkforeachgroupinthenetwork.Weshowedthatforeachreactioninagroup,enrichmentvaluescorrelatehighlywithGOtermsofeachreaction.WealsoshowedthatourmethodcanbeusedinGOtermpredictionforlessknownreactions. 82

PAGE 83

CHAPTER5RINQ:REFERENCE-BASEDINDEXINGFORNETWORKQUERIES 5.1IntroductionWiththeadvancesinbiotechnologyandhighthroughputcomputing,theamountofdatadescribingtheinteractionsamongmoleculeshasincreasedgreatlyinrecentyears.Dependingonthetypesofinteractions,thisdataisexpressedthroughvariousmodelssuchasmetabolic( Franckeetal. 2005 ),generegulatory( LevineandDavidson 2005 )orproteininteraction( GiotandBaderetal. 2003 )networks.Thetermpathwayisalsousedintheliteraturetodescribeapartofthem.Inthischapter,wewillusethetermbiologicalnetworktodescribeacollectionoftheseinteractions.Understandingbiologicalnetworkshasbeenoneofthemaingoalsofbiologicalsciencesastheycontainkeyinformationregardinghoworganismswork.Toachievethisgoal,biologicalnetworksareanalyzedinanumberofways.Oneofthem,comparisonbasedanalysis,identiessimilarpartsoftwobiologicalnetworksbyaligningthem.Suchanalysishasbeensuccessfullyusedforndingfunctionalannotations( Clementeetal. 2006 ),identifyingdrugtargets( Sridharetal. 2007 ),reconstructingmetabolicnetworksfromnewlysequencedgenome( Franckeetal. 2005 ),andbuildingphylogenetictrees( Clementeetal. 2005 ).Aligningtwobiologicalnetworksisacomputationallychallengingproblem.Existingmethodsoftenmapthisproblemtosubgraphisomorphismproblem.Forthisproblem,therearenoknownpolynomialtimealgorithmswhichguaranteetoproducetheoptimalalignment.Anumberofmethodsaimtondapproximate( Dostetal. 2008 ; Pinteretal. 2005 ; Shlomietal. 2006 )orheuristic( AyandKahveci 2010b ; Ayetal. 2009 ; Liaoetal. 2009 )resultsinpracticaltime.However,theystillrequireasignicantamountoftime.Computationaldifcultyofbiologicalnetworkalignmentbecomesapparentwhenweneedtoalignaquerynetworkwithadatabaseofbiologicalnetworks.Totestthis, 83

PAGE 84

wedownloaded297networksfromKEGG( Ogataetal. 1999 ).Thesenetworkshad16to120genes(i.e.,nodes)and17to165edges.Wealignedthenetworksinthedatasetwithasetof50querynetworksusingtheQNetalgorithm( Dostetal. 2008 ).Eachofthesequerynetworkshadsevennodes.Averageprocessingtimeforasinglequerynetworkwasovertwodaysonasingleprocessor.Thesameexperimentforeightnodequeriestookmorethanaweekforasinglequerynetwork.Inthischapter,weaimtoalignaquerynetworkwithadatabaseofnetworksefciently.Formaldenitionofourproblemisasfollows:Problemdenition:AssumethatwehaveadatabaseofnbiologicalnetworksdenotedbyD=fd1;d2;:::;dng,wherediistheithnetworkinourdatabase.Also,wearegivenaquerynetworkdenotedbyq.Letusdenotethealignmentscorebetweenqanddi(di2D)withsim(q;di).Givenasimilaritycutoff,similaritysearchreturnsallthedatabasenetworksdithatsatisfysim(q;di).Weaimtoreducetheprocessingtimeofsimilaritysearchesinbiologicalnetworkdatabasestoapracticallevel. Anumberofvariantsoftheproblemofaligningtwonetworkshavebeenconsideredintheliterature( AyandKahveci 2010b ; Dostetal. 2008 ; Kelleyetal. 2004 ; Liaoetal. 2009 ; Shlomietal. 2006 ).However,therehasbeenalimitednumberofstudiesonsimilaritysearchesinnetworkdatabases( GiugnoandShasha 2002 ; HeandSingh 2006 ; Mongiovietal. 2010 ; Yanetal. 2004 ).Althoughthesemethodsworkwellforspecictypesofnetworkdatabases,theyarenotwellsuitedforbiologicalnetworkdatabaseswhichistheproblemconsideredinthischapter.Thisismainlybecausetheyoftenuseanoverlysimpliedsimilaritymeasurefornetworkalignment.Whenusedwithcostlybiologicalnetworkalignmenttools,thetimecomplexityoftheseindexingmethodsbecomeimpractical.WeelaborateonpreviousworksinSection 5.2 .Contributions:Inthischapter,wedevelopareference-basedindexingmethodtoanswersimilarityqueriesinbiologicalnetworkdatabasesefciently.WecallthismethodReference-basedIndexingforBiologicalNetworkQueries(RINQ).Themainadvantage 84

PAGE 85

ofthismethodisitsindependenceofthepairwisenetworkalignmentalgorithmthedatabaseemploys.Briey,RINQworksasfollows.First,wechooseasmallsetofnetworkstouseasreferencenetworks.Wealignallofthereferencenetworkswiththenetworksinourdatabase,andstoreallthealignmentmappingsandscores.Thesestepsareperformedofine.Afterthesetwosteps,ourdatabaseisreadytoacceptqueries.Givenaqueryqandasimilaritycutoff,insteadofaligningqwithallthedatabaseentries,wealignitwiththereferencenetworks.Usingthesealignmentsandtheprecomputedalignmentsbetweenreferenceanddatabasenetworks,wecomputealowerboundandanapproximateupperboundforsim(q;di)foreachdatabasenetworkdi.WewillrefertotheseupperandlowerboundsasUB(q;di)andLB(q;di)respectively.Computingupperandlowerboundstakemuchlesstimethanaligningthequerynetworkwiththedatabasenetworks.Dependingonthevaluesoftheupperandlowerbounds,wemayencounteroneofthefollowingthreecases. Case1:(FilterSet)UB(q;di)< Case2:(ResultSet)LB(q;di) Case3:(TwilightZone)LB(q;di),whichiswhatwearelookingfor.Wecandirectlyaddsuchditoourresultset.Forthedatabasenetworksdithatfallintothethirdcase,thevalueofsim(q;di)isunclear.Wealignqwithsuchditondtheiractualsimilarityvalue.Inotherwords,weperformthecostlynetworkalignmentalgorithmonlyforthenetworksthatfallintothethirdcase.Ourexperimentssuggestthat 1. RINQltersupto80percentofthedatabaseentriesdependingonthegivensimilaritycutoffandthequery. 2. RINQissignicantlyfasterandmoreaccuratethantheexistingmethodsincludingSAGA( Tianetal. 2007 )andCTree( HeandSingh 2006 ). 85

PAGE 86

3. ThealignmentsreturnedbyRINQarestatisticallysignicantandbiologicallyrelevant.Restofthischapterisorganizedasfollows.Section 5.2 discussesthepreviousworksrelatedtoourproblem.Section 5.3 elaboratesonRINQ.Section 5.4 presentsourexperimentalresults.Section 5.5 brieyconcludesthechapter. 5.2RelatedWorkMostoftheexistingworkoncomparingbiologicalnetworksfocusonalignmentofapairofnetworks.Thisproblemisthefundamentalbuildingblockofthesimilaritysearchproblemconsideredinthischapter.However,aligningevenasinglepairofnetworksisacomputationallydifculttask.Toalleviatethisdifcultyanumberofmethodsintheliteraturerestrictedalignmentstospecictopologiesornetworksizes.Forexample,PathBLAST( Kelleyetal. 2004 )andQPath( Shlomietal. 2006 )alignlinearpathwayswithproteininteractionnetworks.Pinteretal:devisedanalgorithmwhichcanalignqueriesoftreenetworkswithamultisourcetreenetwork( Pinteretal. 2005 ).QNetusesacolor-codingalgorithmtoaligntreesorboundedtreewidthnetworkswithanynetworkwithprovablecondencevalues( Dostetal. 2008 ).ToevaluatetheperformanceofQNet,wealignedasetofsevennodequerynetworkswithadatabaseofgeneregulatorynetworksusingourimplementationoftheQNetalgorithm.WedownloadedallthegeneregulatorynetworksfromKEGG( Ogataetal. 1999 )thathadmorethan15nodesinourlocaldatabase.Therewere297suchnetworks(SeeTables 5-2 5-3 ).Whenwesetthecondencevalueto99%,theaveragerunningtimetoalignasinglequerynetworkwithalldatabasenetworksexhaustivelytookovertwodays.Evenafterwereducedthecondenceto90%,therunningtimeforexhaustivesearchwasstilloneday.Anumberofotherapproachesexisttospeedupforpairwisenetworkalignmentproblem.NetMatch( Ferroetal. 2007 )andNetGrep( Banksetal. 2008 )allowusertodenequerieswithspecicnodefeatures.Thentheyusefastapproximatealgorithmstomatchtheuserdenedqueries.Heuristicmethodsaimtosolvethe 86

PAGE 87

alignmentproblemfasterwithoutprovidingacondencevalue( AyandKahveci 2010b ; Bruckneretal. 2009 ; Liaoetal. 2009 ).However,despitealltheseefforts,therunningtimestillremainstobeabottleneck.Severalindexingmethodsforsimilaritysearchesingraphdatabasesexist.Majorityofthesemethodscanbeclassiedasfeaturebasedindexingmethods.Thesemethodsstartbypickingspecicfeaturesofthenetworksforlteringpurposes.GraphGrepchoosespathsasindexfeature( GiugnoandShasha 2002 ).gIndexusesfrequentsubgraphsforthesamepurpose( Yanetal. 2004 ).SAGAusesfragmentsofdatabasenetworksandtriestocombinethemtogethertondlargermatches( Tianetal. 2007 ).Thesemethodsfocusonndingexactmatchesofagivenqueryinthedatabase.SIGMAisanotherfeaturebasedindexingmethodwhichconcentratesontheproblemofinexactmatchingingraphdatabases( Mongiovietal. 2010 ).AnothermethodforgraphdatabaseindexingisClosure-Tree(CTree)( HeandSingh 2006 ).Closure-Treeorganizesthenetworksinthedatabaseusingabinarytreestructure.Itplaceseachdatabasenetworkatadifferentleafnodeinthistree.Eachinternalnodeinthistreeisahypotheticalnetworkthatisobtainedbyaligningthetwonetworkscorrespondingtoitschildrennodes.AninterestingpropertyoftheCTreeisthefollowing.Thescoreofthealignmentofanyquerynetworkwithaninternalnodeisatleastasmuchasthatwithaleafnoderootedatthatinternalnode.Followingfromthis,givenaquerynetwork,closuretreealgorithmstartsaligningquerytotherootnode.Itthenproceedstothechildrennodes.Itprunesanentiresubtreerootedataninternalnode,ifthealignmenttothatinternalnodehasascorelessthanthegivencutoff.Existinggraphdatabaseindexingmethodsworkwellwheneachdatabasenetworkissmallortheunderlyingsimilaritymeasureisnotexpensive.However,thesetwopropertiesdonotholdforbiologicalnetworkdatabasessuchasmetabolicandgeneregulatorynetworks.Asaresult,whentheyareappliedtothesenetworks,eithertheindexconstructiontimebecomesimpracticalortheamountofperformancegained 87

PAGE 88

becomestoosmall.Forinstance,whenacostlydynamicprogrammingmethod,suchasQNet( Dostetal. 2008 ),isusedtosummarizesetsofnetworksaccuratelyintheCTreemethod,computingthesummaryforevenasingleinternalnodecantakemorethanonemonthfornetworkswithmorethan12nodes.Ontheotherhand,themethodwedevelopinthischapterallowsqueryingdatabaseswithnetworksofarbitrarilylargenumberofnodes.WeprovideanexperimentalcomparisonofourmethodwithSAGAandCTreeinSection 5.4.3 5.3AlgorithmInthissection,weexplaintheindexingmethodwedevelopedforqueryingbiologicalnetworkdatabases.Werstdescribethenotationweuseintherestofthischapter.WedenotethedatabaseofnbiologicalnetworkswithD=fd1;d2;:::;dng.Here,direpresentstheithnetworkinthedatabase.Aswewillexplainlater,ouralgorithmusesasetofreferencenetworks.WedenotethesetofmcandidatereferenceswithC=fc1;c2;:::;cmg.Inthisrepresentation,eachcicorrespondstoacandidatereferencenetwork.WedenotetheactualsetofreferenceswithR=fr1;r2;:::;rkg,whereRC.Tosimplifyournotation,wewilldropthesubscriptandused,rorctorepresentdi,riorcirespectivelywheneverpossible.Wewillrepresentaquerynetworkwithq.Weuseq[j],c[j],r[j]andd[j]torefertothejthnodeinrespectivenetworks.Analignmentisamappingbetweenthenodesoftwonetworks.Letusrepresentthemappingbetweenqanddwith.Then,(q[i])=d[j]meansthatq[i]isalignedtod[j].Notethatallthesemappingsareonetooneandsymmetric.Inotherwords,if(q[i])=d[j],then(d[j])=q[i].Whenanodeq[i]isnotalignedtoanynodeind(i.e.,insertionsordeletions),wewillusethenotation(q[i])=;.Thesimilarityscorebetweentwonodesq[i]andd[j]isrepresentedassim(q[i];d[j]).LetEdandEqbethesetofedges(i.e.,interactions)ofdandqrespectively.LetusdenotethefunctionthatreturnstheweightofanedgeinEdwithEWd.Givenamappingbetweenqandd,the 88

PAGE 89

individualnodesimilaritiesbetweenqandd,wecalculatethesimilarityscoresim(q;d)as:sim(q;d)=X(q[i])6=;sim(q[i];(q[i]))+X(q[i])=;InDelPenalty+X((q[i]);(q[j]))2EdAND(q[i];q[j])2EqEWd((q[i]);(q[j])):Theorganizationoftherestofthissectionisasfollows.InSection 5.3.1 ,weprovideanoverviewofreference-basedindexing.WeelaborateonourreferenceselectionmethodinSection 5.3.2 .Finally,inSection 5.3.3 ,weexplainthecalculationofthelowerandupperboundsweuseforRINQ. 5.3.1AnOverviewofOurReference-BasedIndexingRINQhastwomajorsteps,namelyindexcreationandqueryprocessing.Wecreateindexonceasapreprocessingstepintwophases.Intherstphase,wecreatealargesetofcandidatereferencesfromthedatabasenetworks.Inthesecondphase,wepickasubsetofthesecandidatesastheactualreferencesetbyconsideringtheirperformanceoverasetoftrainingqueries.Inourindex,westorethealignmentsbetweenallthereferenceanddatabasenetworkpairs.Oncetheindexiscreated,wearereadytoanswersimilarityqueriesonourdatabase.Givenaquerynetworkq,wealignqwiththereferencenetworksinRratherthanthedatabasenetworks.Usually,thecostofthesealignmentsisnegligiblesinceRcontainsmuchfewerandsmallernetworksthanD.Foreachdj2D,wecomputealowerboundandanapproximateupperboundtothesimilarityscorebetweenqanddj.Wedothisbyusingthealignmentsofthereferencestoqanddj.WedenotetheselowerandupperboundsbyLB(q;dj)andUB(q;dj)respectively.Usingthelowerandupperboundvalues,weclassifyeachdjinoneofthefollowingthreesets:FilterSet,ResultSet,orTwilightZone.IfUB(q;dj)<,thealignmentscorebetweenqanddjcannotbelargerthan.WeputsuchdjinFilterSet.Weeliminate 89

PAGE 90

thenetworksintheltersetwithoutfurthercalculation.IfLB(q;dj),itmeansdjhasanalignmentscorewhichisdenitelylargerthan.Sodjshouldbeinthesetwereturnastheresultofourquery.WeplacesuchdjintheResultSet.Finally,ifLB(q;dj)
PAGE 91

Therstpropertyaboveensuresthatthequeryalignswitheachreferencequickly.Thesecondpropertyensuresthatitispossibletondtightlowerandupperboundsforanyquerythatalignswellwithareferencenetwork.Iftworeferencesarehighlysimilar,thenthelowerandupperboundscomputedthroughthemwillbesimilarregardlessofthequerynetwork.Thusoneofthemissufcienttocomputethebounds.Thethirdpropertyavoidssuchredundancy.Inordertosatisfytherstrulewesetthesize(i.e,thenumberofnodes)ofeachreferencenetworktothatofthelargestqueryallowed.LetusdenotethisnumberwithT.WeuseT=6,7or8inourexperiments.UsingsmallTensuresProperty1describedabove.Followingfromtherulesabove,wecreateapotentiallylargesetofcandidatereferencesfromthedatabasenetworksiteratively.Eachiterationconsistsoftwosteps.Step1.Wepickadatabasenetworkrandomly.Wechoosearandomnodefromthatnetworkastheseed.Wethengrowthatseedbyincludingoneoftheneighboringnodestotheseedrandomly,untiltheseedcontainsTnodes.Step2.Wealignthecurrentseedwithrestofthecandidatereferencenetworkswecreatedsofar.Ifanyofthealignmentscoresisgreaterthanapredenedcutoff,itmeansthatthecurrentseedisredundant.Therefore,wediscardthecurrentseed.Otherwise,weincludethecurrentseedinthecandidatereferenceset.Steps1and2eliminatesredundancyfromthecandidatereferencenetworkset,thereforeenforcingProperty3.Theiterationscontinueuntilweareunsuccessfulatcreatinganewcandidatereferencenetwork,hencefulllingProperty2. 5.3.2.2PhaseII:Creatingthenalreferenceset.RecallfromSection 5.3.1 thatwealignagivenquerynetworkwithallthereferencenetworksinRrst.Inordertoperformthiscomparisonquickly,itisdesirabletohaveasmallnumberofreferencesinR.TheamountoftimewewouldliketospendforlteringthedatabasedeterminesthesizeofR.Wewilluseatmost100referencesinourexperiments.Wechooseouractualreferencesetasasubsetofthecandidate 91

PAGE 92

Table5-1. Thedistributionofthenumberofdatabasenetworkstodifferentsetsforagivenquery.ThenetworksinthepartitionTruearetheonesthatneedtobereturnedforthequery.ThenetworksinthepartitionFalsearetheonesthatarenotsimilartothequery.Thenetworksarealsodividedintothreesets,ResultSet,TwilightZoneandFilterSet.AlltheNiinsidethepartitionsrepresentthenumberofnetworkinthecorrespondingpartition. ResultTwilightFilterSetZoneSet TrueN1N2N3False0N4N5 references.Anaivesolutionwouldbetopickthereferencenetworksrandomlyfromthecandidatereferenceset.Thissimplestrategymayproduceagoodreferencesetsincethecandidatesetalreadysatisesthethreepropertieslistedatthebeginningofthissection.Here,wedevelopanintelligentstrategythatoptimizesthereferenceselectionfurther.Thisstrategylearnstheaccuracyofthereferencesbytrainingoverasetofqueries.Atthispoint,wetakesmalldetourtoexplainhowwecomputetheaccuracyofourreferencesforaquery.AsweexplainedinSection 5.3.1 ,ourreferencesplaceeachdatabasenetworkintooneofthethreesets,namelyResultSet,FilterSetandTwilightZone.Amongthese,thelasttwomaycontaintrueorfalseresults(i.e.,adatabasenetworkisatrueresultifitsalignmenttothequeryhasascoregreaterthanthegivencutoff).Table 5-1 showsthedecompositionandthenumberofnetworksineachset.Resultsetcontainsonlytrueresultsasourlowerboundisguaranteedtobeatmosttheactualscore.Theltersetontheotherhandmaycontaintrueresults,aswellasfalseresults.Ourmethodfailstondthetrueresultsinthissetifthereareany.Wecomputetheaccuracyofourmethodastheratiooftrueresultsthatourmethodcanreturn.FormallythisnumberisN1+N2 N1+N2+N3.Thismetricisalsoknownasrecallintheliterature.Algorithm 2 iterativelytriestondthereferencesetwhichyieldsthehighestaccuracy.Wedothiswiththehelpofasetoftrainingquerynetworks.Weavoidapurelycombinatorialapproachbecausethenumberofpossiblereferencesetsisvery 92

PAGE 93

Algorithm2CreatereferencesetR 1: AlignallthenetworksinQandCwiththenetworksinD 2: RemoveknetworksfromCrandomlyandaddthemtoR 3: repeat 4: SetCurrentAccuracytotheaccuracyofR 5: RemoverifromRwhereaccuracyofR)-222(frigismaximum 6: forallcjinCdo 7: CalculateaccuracyusingR[fcjg 8: endfor 9: ifTheresultingaccuracyisbetterthanthecurrentonethen 10: AddcjtoRforwhichtheaccuracyofR[fcjgislargest 11: RemovecjfromC 12: else 13: InsertribackinRandBreaktheloop. 14: endif 15: untilAccuracycannotbeimprovedanymoreorCisempty high.Instead,wedevelopahillclimbingapproach.Algorithm1describesthismethodindetail.Westartbyrandomlychoosingareferenceset.Then,ateachiteration,wereplaceonereferencenetworkinRwithanotherinC)]TJ /F3 11.955 Tf 12.53 0 Td[(Rtoimprovetheaccuracyofourindexforthetrainingqueries.Todothis,wersteliminatethereferencefromthecurrentreferencesetwhoseremovaldropstheaccuracytheleast.Next,amongalltheremainingcandidatereferences,wendtheonewhoseinclusiontothecurrentreferencesetincreasestheaccuracythemost.Ifthesetwochangesimprovetheoverallaccuracyweacceptthem.Werepeattheseiterationsuntilthereisnocandidatereferencewhichcanimprovetheoverallaccuracy. 5.3.3ComputingtheBoundsForaquerynetworkq,wecalculateUB(q;d)andLB(q;d)foreachd2Dwiththehelpofreferences.WhenwecreatethereferencesetasdescribedinSection 5.3.2 ,wealsorecordthealignmentofeach(ri,d)pair(ri2R).Foreach(q,d)pair,wecomputeUB(q;d)andLB(q;d)usingtheprecomputedalignmentsofeachriwithd,andthealignmentsofeachriwithq.InSection 5.3.3.1 ,wepresentanexactlowerbound 93

PAGE 94

calculationtechnique.TheninSection 5.3.3.2 ,wediscussanapproximateupperboundcalculationmethod.Inordertosimplifytheexplanationofboundcalculations,wedenoteanalignmentofqandriwitharelation i.Wewillusetherelationitorepresentthealignmentbetweenriandd. 5.3.3.1LowerboundcalculationInordertoprovideatightlowerboundforagivenqueryanddatabasenetworkpair,weuseeachreferenceindependently.Usingeachreferencenetworkri,wecalculateadifferentlowerboundvalueLBi(q;d).Aftercalculatingkdifferentlowerbounds,wechoosethelargestone(i.e.,LB(q;d)=maxifLBi(q;d)g)asthelowerbound.Givenq,riandd,wecomputeLBi(q;d)intwophases.Intherstphase,wemapasubsetofthenodesofqtothoseofdfromtheirexistingalignmentswithri.Inthesecondphase,wemaptheremainingnodesifpossibleandcalculatenalalignmentscore.PhaseI.Usingtherepresentationsforalignmentsbetween(q;ri)and(ri;d)pairs,ourindirectalignmentibetweenqanddthroughriissimplythecompositionofrelationsiand i.Inotherwords,itistherelationi(q[j])=i( i(q[j])).Forexample,inFigure 5-1A ,q[1]andri[1]arealigned(i.e., i(q[1])=ri[1]).Alsor[1]andd[1]arealigned(i.e.,i(ri[1])=d[1]).Fromthesetwo,inFigure 5-1B wealignq[1]withd[1](i.e.,i(q[1])=d[1]).Noticethatimapseachnodeofq(ord)toatmostonenodeind(orq).Thisisbecausebothiand imaponenodeofqanddtoatmostonenodeinri.Alsonoticethattheremaybenodesinqordthatarenotalignedtoanynodebyi.PhaseII.IfallthenodesofqarealignedattheendofPhaseI,wesimplyusethisalignmenttocomputeLBi(q;d).Otherwise,therewillbeunalignednodesfromatleastonenetwork.Onewaytodealwiththeunalignednodesistoconsiderthemasinsertionsordeletions(indel)inthealignmentofqandd.This,however,isnotdesirableaseachindelwilllowerthevalueofLBi(q;d).Weobservethatitispossibletoalign 94

PAGE 95

AMappings and BIndirectalignmentFigure5-1. Overviewofthelowerboundcalculation.Figure 5-1A showsthealignmentsof(q,r)and(r,d)pairs.Figure 5-1B showstheresultingindirectalignment.Solidlinesrepresenttheedgeswithinanetwork.Dashedlinesrepresentthemappingbetweenthenodesoftwonetworks. suchnodesifthereisamatchingnodeindthatisnotalignedinPhaseI.Forexample,inFigure 5-1A ,bothq[2]andd[2]areunalignedwithareferencenode.Thus,theybothareunaligned.InPhaseII,wealignsuchnodeswitheachother.Wedothisbyperformingabreadthrsttraversalonqanddsimultaneously.Westartthetraversalfromtherootnodeofqandthenodealignedtoitind.Then,wevisiteverychildofthisnodeinq,alsovisitingthealignednodeind.Whenvisitinganode,wechecktheshortestpathbetweenthisnodeanditsparent.Ifbothinqanddthereareunalignednodes,wealignsuchnodestoeachotherintheordertheyaretraversed.Wemarkthenodeswecannotalignasinsertionanddeletions.WestartthetraversalfromapairofnodesthatarealignedtoeachotherinPhaseIandthatarebothneighborstounalignednodes.Figure 5-1B showsthealignmentfoundafterPhaseIIusingthereferencealignmentinFigure 5-1A .Finally,wecomputeLBi(q;d)asthesimilarityscoresim(q;d)resultingfromthealignmenti. 5.3.3.2UpperboundcalculationAswediscussedinSection 5.3.3.1 ,eachreferenceprovidesanindirectalignmentbetweenqandd.Considertheextremescenariowhenoneofthereferencesisidenticaltoq.Inthatcase,theindirectalignmentofqanddthroughthatreferencewillbethe 95

PAGE 96

optimalalignmentsincedisoptimallyalignedwitheachreferenceinourindex.Asthesimilaritybetweenqandareferenceincreases,weexpectthattheindirectalignmentobtainedfromthatreferencewillapproachtotheoptimalone.Ourupperboundcomputationstrategyfollowsfromthisobservation.Evenwhennoneofthereferencesisidenticaltoq,eachonemaybesimilartoasubnetworkofq.Thus,byconsideringallthereferencesatonce,wemaybeabletoreconstructasubnetworkthatisverysimilartoq.Notethatthismethodisnotguaranteedtoproduceanupperbound.Itssuccessdependsonhowwellthereferencesetrepresentsq.AswedescribedinSection 5.3.2 ,Algorithm1choosesthereferencesetthatmaximizestheaccuracyofourupperboundfunction.WediscusstheobservedaccuracyofourmethodlaterinSection 5.4 .WecalculateUB(q;d)intwophases.Intherstphase,wendthesetofnodepairsfromqanddthatcanbealignedusingallthereferences.Inthesecondphase,wendthehighestscoringalignmentunderthislimitationbyrelaxingthetopologiesofqandd.PhaseI.Foreachreferenceri2R,wendanindirectmappingibetweenqanddasweexplainedinPhaseIofSection 5.3.3.1 .Wethencreateajointmappingfromallindirectmappingsas(d[j])=[1ikfi(d[j])g.Thatismapseachd[j]tothesetofallnodesofqthatd[j]isindirectlyalignedbyatleastonereference.Noticethatitispossibletohavesomenodesindorqthatarenotmappedtoanynodeby.Thishappenswhennoreferenceinourreferencesetalignswiththem.Thishoweverdoesnotmeanthatthosenodesofqanddaredissimilar.Inordertodealwithsuchnodes,weexpandbymappingsuchsuchd[i](q[j])toallthenodesinq(d).Figure 5-2 depictsthisonanexamplethathastworeferences.Inthiscase,wehavetworeferencenetworks.Figures 5-2A and 5-2B showthealignmentsofthesereferencestobothqandd.Figure 5-2C showstherelationbasedonthesetworeferences.NoticethatinFigure 5-2C ,d[4]ismappedtotwonodesfromq,q[4]andq[2].Thisisbecaused[4]isindirectlymappedtoq[4]andq[2]throughreference1and 96

PAGE 97

A B CFigure5-2. Overviewoftheupperboundcalculationmethod.Figures 5-2A and 5-2B showthealignmentsofqanddwithreferences1and2respectively.InFigures 5-2A and 5-2B ,solidlinesrepresenttheedgesbetweenthenodesofasinglenetwork,anddashedlinesrepresentthemappingsoftwonodesintwodifferentnetworks.Figure 5-2C showsthebipartitegraphresultingfromtheindirectalignmentsusingtworeferences.InFigure 5-2C ,allthelines,solidanddashed,representanelementoftherelation .Solidlinesoriginatingfromd[6]showthattheseelementsareaddedto ,sincenoinformationaboutthenoded[6]wasgatheredthroughreferences.Alltheedgeweightsareomittedinallgures. reference2respectively.Also,d[6]ismappedtoallthenodesinqasitisnotalignedtoanynodeinanyofthereferences.PhaseII.InthisphasewealignqanddusingthemappingobtainedinPhaseI.Sinceiscomputedfrommanyreferencesjointly,itcanmapsomeofthenodesinqtotopologicallydistantnodesofdandviceversa.Weignorethetopologiesofqandd 97

PAGE 98

atthisphase.Instead,weconsiderthemappingasaweightedbipartitegraph.Thenodesofqanddarethenodesofthisgraph.Themappingsinaretheedges.Thesimilaritybetweenthenodesthatmaparetheweightsoftheedges.WecomputetheupperboundUB(q;d)asthetotalweightofthemaximumweightedbipartitematchingbetweenqandd. 5.3.4TimeComplexityoftheMethodsInthissection,wepresentthetimecomplexityofthemethodswedevelopedthroughoutSection 5.3 .Wewillassumethatallthenecessarynetworkalignmentsareprecomputedandstoredsuchthatwecanretrievetheminlineartime.Forsimplicity,werepresentthenumberofnodesnetworksqanddwithjqjandjdjrespectively. 5.3.4.1ComplexityoflowerboundcalculationWecalculatethecomplexityoflowerboundcalculationforeachphaseseparately.InPhaseI,wegooverallthenodesinq,tondthecompositerelation.So,thetimecomplexityofPhaseIisO(jqj).InPhaseII,wetraverseallnodesinqtocompletethealignment.ThecomplexityofthisphaseisO(jqj).Finally,wecalculatethescoresbyaddingalltheindividualscoresaccumulatedbyeachnodeinq,whichisO(jqj).So,theoveralltimecomplexityoflowerboundcalculationisO(jqj). 5.3.4.2ComplexityofapproximateupperboundComplexityofupperboundcomputationcanbecalculatedintwophases.Intherstphase,thealgorithmndstheindirectmappingsthrougheachofthekreferences.ThecomplexityofthisphaseisO(kjqj).Inthesecondphase,weperformMWBMbetweenthenodesofqandd.UsingHungarianalgorithm,wecanexecutethismatchinginO(jdj2jqj)time.Intotal,thetimecomplexityofcalculatingUB(q;d)isO(kjqj+jdj2jqj). 5.3.4.3ComplexityofcandidatereferencecreationTocreatethecandidatereferenceset,weiteratethefollowingtwosteps: StepICreateaseedusingrandomwalk StepIIAligncurrentseedwiththerestofthecandidatereferencenetworks 98

PAGE 99

Weanalyzeeachofthetwostepsnext. StepIWeperformrandomwalksimplybylistingourpossiblechoicesofnodesandthenchoosinganewnoderandomlyfromthislist.WechooseTnodesinordertocreateourseed.UpdatingthesetofpossiblechoicesaftereachnodeselectiontakesO(jdj)time.Thus,thetotaltimetogenerateaseedwithTnodesisO(Tjdj). StepIIAftercreatingaseed,wealignthisseedwithalltheexistingmcandidatereferencesinC.RecallthatthenumberofnodesineachnetworkinCisT.Inotherwords,thenumberofnodesintheseedandthecandidatenetworksarethesame.Thispropertyofthetwonetworksenableustoalignthemoptimallyusingdynamicprogramminginasingleiteration.Thus,thecomplexityofasingleseedtocandidatereferencealignmentisO(22TT).ThereareatmostmnetworksinC.Hence,malignmentstakeO(m22TT).Fromthecomplexitiesderivedabove,weconcludethatasingleiterationtakesO(m22TT+Tjdj)time.Werepeatthisuntiltheiterationsfailapredenednumberoftimesconsecutively.Let'sdenotethisnumberwithf.Intheworstcase,wehavetotryf)]TJ /F4 11.955 Tf 12.04 0 Td[(1timestocreatethenextreference.Tocreatemreferencenetworks,weneedtotry1+(m)]TJ /F4 11.955 Tf 12.4 0 Td[(2)(f)]TJ /F4 11.955 Tf 12.4 0 Td[(1)+ftimes,whichisO(mf).So,theoverallcomplexityofcandidatereferencecreationcanbeexpressedasO(mf(m22TT+Tjdj)).Inourexperiments,wewereabletocreateCinapproximately10minutes,wherem=100,f=100,andT=7. 5.3.4.4ComplexityofreferenceselectionWeconsiderthecomplexityofAlgorithm1intwoseparateparts. Part1(PREPROCESSING)First,wealignall(q;d),(q;c)and(c;d)pairs.Lettdenotethenumberoftrainingqueries.Thecomplexityofthealignmentmethodweusedenesthetimecomplexityofthesetn+tm+mnalignments.WeareusingQNetalgorithmtoalignnetworks.so,thecomplexityofthepreprocessingphaseisO(tn(2jqjjdjjdj)1+tm(2jqjjcjjcj)2+mn(2jcjjdjjdj)3)whereiisthenumberofiterationsperalignment.Numberofiterationsdependsonthenumberofnodesinthesmallernetworkandthepredenedcondencevaluefortheresult. Part2(POSTPROCESSING)Oncewecomputeallthepairwisealignmentsdescribedabove,wearereadytoselectreferences.Ouralgorithmiteratesuntilitndsthereferencesetwithlargestaccuracyvalue.Duringtheseiterations,rstwecalculatetheaccuracyoftheindexforusingeachR)-341(frigasthereferenceset.Todothis,wecomputetheupperboundforeach(q;d)pair.Intotal,therearetnksuchupperboundcalculations,resultinginO(tnk(kjqj+jdj2jqj)) 99

PAGE 100

time.WealsocomputetheaccuracyusingeachR[fcig.ThissteptakesO(tmk(kjqj+jdj2jqj))time.Intheworstscenario,thisiterationrepeatsuntilallthenetworksinCareused.So,theoverallcomplexityofthepostprocessingisO(((tnk(kjqj+jdj2jqj))+(tmk(kjqj+jdj2jqj)))m).Inourexperiments,weusedQNetalgorithmwith99%condencetoaligntwobiologicalnetworks.Usingthisalignmentmethod,wealigned50trainingqueries,100candidatereferencesand297databasenetworkswitheachother.Thesealignmentstookaround25daysforsevennodequeryandreferencenetworksonasinglecpu.However,theiterationsofthealgorithmcompletedwithinafewhoursaftercompletingthealignments. 5.4ResultsandDiscussionThissectionevaluatestheperformanceofRINQexperimentally.Database.WedownloadedoursampledatabasefromKEGG( Ogataetal. 1999 ).Wedownloadedallthegeneregulatorynetworksthatarelargerthan15nodes.Intotal,ourdatabaseconsistsof297generegulatorynetworks.Inthisdatabase,wehad21differenttypeofnetworksfrom46differentorganisms.Table 5-2 showsthedifferentorganismsandthenumberofnetworkswhichbelongtothatorganism.Homosapiens,MusmusculusandRattusnorvegicushavethelargestnumberofnetworksamongalltheorganismsinourdatabase.Table 5-3 displaysthedifferenttypesofnetworksinourdatabase.ThemostfrequentpathwayisthePhosphatidylinositolsignalingsystemwhichappearsin45differentorganisms.Notethatthesamenetworkofdifferentorganismscandifferintermsofthegenesaswellastheinteractionsthatgovernthem.Figure 5-3 presentsthedistributionofthenumberofgenesandthenumberofinteractionsofeachnetworkinourdatabase.Halfofthenetworksinourdatabasehas30ormoregenes.Also,halfofthenetworksinourdatabasehasmorethan40interactions.About6%ofthesehavemorethan100nodesandinteractions. 100

PAGE 101

Table5-2. Thedistributionofthenetworksinourdatabasetodifferentorganisms.Therearetotally46organismsinourdatabase.Theorganismsarelistedinalphabeticalorder. OrganismNumofOrganismNumofnetworksnetworks Aedesaegypti(yellowfevermosquito)2Drosophilasimulans1Anophelesgambiae(mosquito)2Drosophilayakuba2Apismellifera(honeybee)2Equuscaballus(horse)19Acyrthosiphonpisum(peaaphid)2Gallusgallus(chicken)13Arabidopsisthaliana(thalecress)1Hydramagnipapillata2Branchiostomaoridae(Floridalancelet)1Homosapiens(human)20Brucellamelitensisbv.116M4Monosigabrevicollis1Bostaurus(cow)19Macacamulatta(rhesusmonkey)17Caenorhabditisbriggsae2Monodelphisdomestica(opossum)17Caenorhabditiselegans(nematode)3Musmusculus(mouse)20Canisfamiliaris(dog)18Nematostellavectensis(seaanemone)2Cionaintestinalis(seasquirt)2Nasoniavitripennis(jewelwasp)2Culexquinquefasciatus(southernhousemosquito)2Ornithorhynchusanatinus(platypus)13Drosophilaananassae2Pantroglodytes(chimpanzee)16Dictyosteliumdiscoideum(cellularslimemold)1Rattusnorvegicus(rat)20Drosophilaerecta2Saccharomycescerevisiae(buddingyeast)1Drosophilagrimshawi2Strongylocentrotuspurpuratus(purpleseaurchin)2Drosophilamelanogaster(fruity)3Susscrofa(pig)8Drosophilamojavensis2Trichoplaxadhaerens1Drosophilapersimilis2Triboliumcastaneum(redourbeetle)2Drosophilapseudoobscurapseudoobscura2Taeniopygiaguttata(zebranch)13Daniorerio(zebrash)13Xenopuslaevis(Africanclawedfrog)8Drosophilasechellia2Xenopustropicalis(westernclawedfrog)6 Figure 5-4 plotsthedensityofthenetworksinthedatabaseintermsofthenumberofinteractionsascomparedtothenumberofgenes.Thenumberofinteractionsareusuallyslightlymorethanthenumberofgenes,indicatingthatthenetworksareoftensparse.Queries.Wecreatedthreesetsofquerynetworksfromthisdatasetbyperformingrandomwalksonthedatabasenetworks.Thesesetscontainsix,sevenandeightnodequeriesrespectively.Eachquerysetconsistedof100querynetworks.Wethenrandomlyspliteachquerysetintotwodisjointsets,eachcontaining50networks.Weusedoneofthesesetsfortrainingourindex,andtheotheronefortesting.Allthequerynetworkscanbedownloadedfromourserverat http://bioinformatics.cise.ufl.edu/palRINQ.html .ImplementationDetails.WeimplementedRINQinC++.WealigntwonetworksusingourownimplementationoftheQNetmethod( Dostetal. 2008 ).Inordertocalculatenodesimilarityscores,weuseBLAST( Altschuletal. 1990 ).Weuse 101

PAGE 102

Table5-3. Thedistributionofthenetworksinourdatabasetodifferentclassesofnetworks.Eachnetworkclasscorrespondstoadifferentbiologicalfunction.Therearetotally21classesofnetworksinourdatabase.Thenetworksarelistedinalphabeticalorder. NetworknameNumofnetworks Adipocytokinesignalingpathway16Bcellreceptorsignalingpathway9Calciumsignalingpathway11Chemokinesignalingpathway11ErbBsignalingpathway18FcepsilonRIsignalingpathway10FcgammaR-mediatedphagocytosis9GnRHsignalingpathway3Insulinsignalingpathway14Jak-STATsignalingpathway15MAPKsignalingpathway18MAPKsignalingpathway-yeast1mTORsignalingpathway10NOD-likereceptorsignalingpathway12Phosphatidylinositolsignalingsystem45PPARsignalingpathway16RIG-I-likereceptorsignalingpathway5Tcellreceptorsignalingpathway10Toll-likereceptorsignalingpathway14VEGFsignalingpathway14Wntsignalingpathway36 theminuslogarithmoftheE-valuesasthenodesimilarityscores.Wecarriedoutexperimentswithbothhighandlowindelpenalties.Forhighandlowindelpenalties,weused1.5and0.5timesthelargestpossiblenodesimilarityscorerespectively.MethodsComparedAgainst.WecomparedRINQagainsttwooftheexistinglteringmethods,SAGA( Tianetal. 2007 )andCTree( HeandSingh 2006 ).Thesemethodsdidnothavetheirsourcecodespubliclyavailable.Thus,forexperimentswithSAGA,weusedtheonlineSAGAserver.TheCTreemethoddescribestwovariationsforlteringtheinternalnodesofthetree,namelymaximumweightedbipartitematchingandneighborbiasedmatching.Weimplementedandtestedbothofthem.Wereport 102

PAGE 103

Figure5-3. Thedistributionsnumberofgenesandthenumberofinteractionsforeachnetworkinourdatabase.Thegraphsdoesnotshowarelationbetweenthenumberofgenesandnumberofinteractions. theresultsforCTreeforonlytheneighborbiasedmatchingasitperformsbetterthantheotheralternative.EvaluationCriteria.Wemeasuredtherunningtimeandaccuracyastheperformancecriteria.Wemeasuredtheperformanceofourmethodbychangingthevaluesoftwovariables.Thesevariablesarethenumberofreferencesandqueryselectivity.Inourexperiments,thenumberofreferencesvariesfrom4to100.Queryselectivityindicatesthepercentageoftrueresultsforeachquery.Forinstance,4%selectivitymeansthat4%ofthenetworksinthedatabasearesimilartothequeryaccordingtothegivensimilaritycutoff.Weperformedexperimentswithupto10%selectivity.Foreachparametercombination,weperformedexperimentsusing50testingqueries.Wepresenttheaverageof50queriesforeachexperiment.Weusedallthreesetsofqueriesinourtests.However,duetospacerestrictions,weonlyincludethetestsforsevennodequerieshere.Ourotherexperimentsontheotherqueryandreferencesetsyieldsimilarresults. 103

PAGE 104

Figure5-4. Thenumberofgenesandnumberofinteractionsforeachnetworkinourdatabase. TestEnvironment.WeperformedourtestsonLinuxserversequippedwithdualAMDOpterondualcoreprocessorsrunningat2.2GHz,and3GB'sofmainmemory.InSection 5.4.1 weevaluatethecontributionofourreferenceselectionstrategytotheperformanceofRINQ.InSection 5.4.2 ,wediscusshowourindexstructureperformsforvaryingindexandqueryparameters.InSection 5.4.3 ,wecomparetheperformanceofRINQtoSAGAandCTree.Finally,inSection 5.4.4 wepresentthestatisticalandbiologicalsignicanceofourresults. 5.4.1EffectsofReferenceSelectionStrategyRecallthatweproposedtocarefullyselectareferencesetfromthecandidatesetwegenerated(Section 5.3.2.1 ).Therstquestionweneedtoanswerishoweachofthesestepsaffecttheperformanceofreferencebasedindexing.Inordertodothis,wecomparedourmethodtotwoalternativesvariants.Therstonepicksreferencesrandomlyfromthedatabaseusingrandomwalk.Thesecondonechoosesreferencesrandomlyfromthecandidatereferenceset.Weconductedtwosetsofexperiments 104

PAGE 105

Figure5-5. Runningtimeversusaccuracyofourmethodfordifferentreferenceselectionstrategies.Experimentsarerepeatedfordifferentnumberofreferences.Averagequeryprocessingtimeispresentedasrunningtime.Accuracyiscalculatedoveralltestqueries.Lowerrunningtimeandhigheraccuracyindicatesbetterperformance.Therunningtimeofexhaustivesearchisovertwodays(notshowningure). using4%and8%queryselectivity.Ineachexperiment,wevariedthenumberofreferencesfromfourto100andmeasuredtheaverageaccuracyandtherunningtime.Figure 5-5 showstheresultsforthreereferenceselectionstrategies.Wealsomeasuredtherunningtimeofexhaustivelysearchingthedatabasewithoutusingourindex.Therunningtimeofexhaustivesearchwasovertwodays.Theexperimentsdemonstratethatreferencebasedindexingissignicantlyfasterthanexhaustivesearch.Itimprovestherunningtimebyafactorofveoverexhaustivesearch.Creatingcandidatereferencesbyfollowingthethreerules(seeSection 5.3.2.1 )aloneimprovestherunningtimebyupto5%.Finally,carefullyselectingreferencesusingAlgorithm1resultsinupto10%additionalimprovement.Insummary,theresultssuggestthatthe 105

PAGE 106

referenceselectionstrategyofRINQindeedhelpsinimprovingtheperformanceofourmethod.ItalsodemonstratesthatRINQmakessimilaritysearchesinbiologicalnetworkdatabasespractical. 5.4.2EffectsofIndexandQueryParametersEarlierinSection 5.4 ,weexplainedthattwoparameterscanaffecttheperformanceofRINQ,namelythenumberofreferencesandqueryselectivity.Inournextsetofexperiments,weevaluatetheeffectsoftheseparametersontheperformanceofRINQ.Inordertodothis,wedesignedtwosetsofexperiments.Inourrstsetofexperiments,wexedqueryselectivityto4%and8%,andperformeddatabasequerieswithvaryingnumberofreferencesforeachselectivityvalue.WecarriedoutthesameexperimentwithRINQusingalignmentswithbothhighandlowindelpenalties.Forthesecondsetofexperiments,wexedthenumberofreferencesandwechangedthequeryselectivityfrom3%to10%.Wepresenttherunningtimeandaccuracywithrespecttovaryingparameters.InadditiontoRINQ,wealsoreporttherunningtimeofahypotheticalmethodnamedOracle.WeassumethatOraclealreadyknowsthenetworksintheresultsetforanyqueryandalignsquerywithonlythosenetworks.Thus,therunningtimeofOracleisalowerboundtoanydatabasesearchmethod.ThepurposewehadforincludingsuchahypotheticalmethodistoobservehowmuchroomforimprovementRINQleavesinpractice.Effectsofthenumberofreferences:Figure 5-6 showstheresultsforvaryingnumberofreferences.TheresultssuggestthatRINQachievesupto80%improvementinrunningtimeoverexhaustivesearch.Withhigherindelpenalties,theimportanceofmatchingtopologiesincrease.OurexperimentswithRINQshowthatasthealignmentfavorstopology,theaccuracyofourmethodincreases.Thissuggeststhatthereferencescanpreservethealternativetopologiesinthedatabaseaccurately.Anotherobservationfromthisexperimentis,asthenumberofreferencesincrease,therunning 106

PAGE 107

timeofourmethodincreasesslowlywhileitsaccuracyincreasesrapidlytoalmost100%.ThisisadesirablepropertyasitsuggeststhatRINQneedstomaintainonlyasmallnumberofreferences.Thus,itusesasmallamountofstorageandperformsasmallnumberofquery-referencenetworkalignments.Fromourresults,wesuggestusingaround30referencesforourdatabase.Effectsofqueryselectivity:Followingfromourpreviousexperiments,wexthenumberofreferencesto32andvarythequeryselectivity.Figure 5-7 plotstheresults.Astheselectivityincreases,thesizeoftheresultset,andthusthenumberofnetworksweneedtoalignwiththequeryincreases.Asaresult,therunningtimeofRINQgrows.Weobservethatrunningtimeofourmethoddoesnotincreaselinearly.Thisisexpectedasthesizeoftheactualresultsetgrows,itgetshardertoplacedatabasenetworksintothelterset.ThesmallgapbetweenOracleandRINQindicatesthatourmethodisverysuccessfulinlteringunpromisingdatabasenetworks.Italsoshowsthatourmethodleaveslittleroomforimprovementparticularlyforsmallselectivityvalues.Forinstance,whentheselectivityis3%,ourmethod'srunningtimeisalmostthatofthesmallestnumberofalignmentsneededforthatselectivity.Foralltheselectivityvalues,RINQhasveryhighaccuracy.Weobservethatasweincreasethequeryselectivity,RINQ'saccuracyincreasesaswell.Thisismainlybecauseforsmallselectivityvalues,evenafewfalsedismissalscanreducetheaccuracy.Theresultsdemonstratethatourmethodreachestoveryhighaccuracyvaluesquickly. 5.4.3ComparisonofRINQwithExistingMethodsAnobviousquestionthatneedstobeansweredishowwellRINQperformsascomparedtoexistingstateoftheartmethods.Inordertoevaluatethis,wecomparedRINQwithtwoexistingmethods,namelySAGA( Tianetal. 2007 )andCTree( HeandSingh 2006 ).Inordertohaveafaircomparison,similartoRINQ,weusedSAGAand 107

PAGE 108

CTreetocreateacandidatesetofresults.WeusedourQNetimplementationtondtheactualresultsfromthecandidatesetforeachmethod.SAGArequirestheusertoprovidethelistofgenesthatcanbealignedwitheachgeneinthequerynetwork.Theusercanalsospecifytheminimumpercentageofquerynodesthatneedstobealigned.WevariedthesetwoparameterstogetthesmallestcandidatesetforSAGAfordifferentaccuracies.CTreeusesaneighborbiasedmatchingalgorithmtolterdatabasenetworks.WevariedtheconstantthatdenestheamountofneighborbiastogetthebestrunningtimeofCTreefordifferentaccuracyvalues.WeranRINQ,SAGAandCTreeusingthesamequeryanddatabasenetworks.Wequeriedeachmethodwiththesamesetofqueryselectivityvalues.Werecordedrunningtimeandaccuracyofeachmethod.Forallthethreemethods,theamountoftimeittakesforlteringthedatabasewasmuchsmallerthanthepost-processingtime,wherewecomparequerynetworkwiththecandidatenetworks.Inourexperiments,RINQrequiresonlyeightsecondsforlteringphaseusing32references,whereas,theothermethodsrequirefromafewsecondsuptominutes.Weignoretheamountoftimetheindexstructuretakestolterdatabasenetworksforthisisanegligiblefraction(0.03%to0.1%)ofthetimeittakestoprocessthecandidateset.WereporttheaveragerunningtimeandaccuracyvaluesforSAGA,CTreeandRINQoverallqueries.Figure 5-8 plotsourresults.InourexperimentsRINQperformedmuchbetterintermsofbothaccuracyandspeedthanbothCTreeandSAGA.RINQisupto3timesfasterthanCTreeforthesameaccuracyvalues.WecouldnotachievecomparableaccuracywithSAGA.WebelievethatthisisprobablybecauseSAGAiswellsuitedforexactmatchesanditsaccuracydropsquicklyasthedifferencebetweenthegenesinthequeryandthedatabasenetworkgrows.InordertoimprovetheaccuracyofSAGAfurther,wegrewthesetofgenesthateachgenecanalign(i.e.,thiscorrespondstoreducingthesimilaritycutoffforgenepairsintheSAGAqueries).However,SAGAfailstoreturnanyresultsinthatcase.Giventhat(1)itsrunningtime(whenitsaccuracyis 108

PAGE 109

75%)isalreadymuchlargerthanbothRINQandCTree(whentheiraccuraciesare85and89%respectively),and(2)itsrunningtimewillonlyincreasewithincreasedaccuracy,weconcludedthatSAGAcouldnotcompetewiththetwoforhighaccuracyvalues.Inconclusion,weobservedthatRINQisasignicantimprovementoverexistingmethods. 5.4.4SignicanceofResultsInordertovalidateourmethod,wediscussthesignicanceofourresultsinthissection.Werstevaluatethestatisticalsignicanceofourresults.Wethendiscusssomeofthebiologicallyinterestingobservationsfromthestatisticallysignicantresults.Westartbydiscussingthestatisticalsignicanceofourresults.Wecomputethestatisticalsignicanceofeachqueryasitsp-value.Wedothisasfollowsforagivenquerynetworkandasimilaritycutoff.WerstrunRINQtondthesetofdatabasenetworksthathaveanalignmentwithscoregreaterthanthecutoffwiththequery.Wethencomputetheprobabilitythatagivenrandomquerynetworkhasatleastasmanymatchingdatabasenetworksinthesamedatasetwithalignmentscoreatleastthesamesimilaritycutoff.Wereportthisprobabilityasthep-value.Table 5-4 presentsthep-valuesandtheresultsetsizesofthetopvequerieswhensimilaritycutoffissetto90,80and70%ofthemaximumpossiblesimilarityscore.Themaximumscoreisthescorethatisachievedwhenthequerynetworkisalignedwithitself.Weobservethatforallthethreesimilaritycutoffvaluesthetopqueriesarestatisticallysignicant.Thisimpliesthatthedatabasecontainsmotifsthathaveatleastasmanygenesasthequerynetworks.Thetopqueriesareapartofthesemotifs.TheresultsarealsoencouragingastheydemonstratethatRINQiscapableofndingsuchfrequentpatternsquickly.Whenthesimilaritycutoffdecreases,thenumberofmatchingdatabasenetworksgrow.Thisprovesthatmotifsarenotexactduplicatesofthesamenetworks,buttheyareapproximatealignments.Thisisanimportantresultasitshowsthenecessityof 109

PAGE 110

Table5-4. Thep-valuesandtheresultsetsizesoftopvequerieswhensimilaritycutoffissetto90,80and70%ofthemaximumsimilarityscorethatcanbeachievedwithanexactmatch.Thevaluesintheparenthesesaretheresultsetsize(i.e.,thenumberofmatches)reportedbyRINQforthatspecicquery. Similarity[%]Rank908070 11.4E-6(21)1.0E-13(34)<1E-15(48)25.7E-4(17)1.3E-7(27)<1E-15(44)33.4E-2(13)5.4E-4(21)1.9E-7(32)43.4E-2(13)4.4E-2(16)3.0E-6(30)57.3E-2(12)4.4E-2(16)3.2E-4(26) approximatealignmentalgorithms.Decreasingp-valuesalsoindicatethatourresultsaremoresignicant.ThissupportsourresultsinFigure 5-7B thatforlowersimilaritycutoffs(i.e.,higherqueryselectivity),RINQhashigheraccuracy.Higheraccuracyandlowerp-valuesindicatethatRINQperformsbetterwithlowersimilaritycutoffs.OnecanarguethatthelargenumberofmatchesfoundbyRINQcanbeattributedtothefactthattherearemultiplepathwaysthatbelongtothesamepathwayclassfromdifferentorganisms(SeeTable 5-3 ).Nextwefocusedontheindividualqueriesthathavethebestp-valueandexploredthecharacteristicsofthematchingdatabasesubnetworksaswellasthebiologicalsignicanceofthesematches.Figure 5-9 depictstwoofourtopqueries.ThequeryinFigure 5-9A isasubnetworkoftheErbBsignalingnetworkofRattusnorvegicus.Thisqueryisrankedastherstfor90and80%similarityandthesecondforthe70%similarity.ErbBsignalingnetworkiscloselyrelatedtocancer( HynesandLane 2005 ).MutationsanddysregulationsofErbBnetworkproteinsarefrequentlyobservedincancercells.ErbBsignalingnetworkconsistsofErbBfamilyofreceptorproteinsandanumberofintracellularsignalingpathways( YardenandSliwkowski 2001 ).ErbBfamilyofreceptorsarecellmembraneproteins.Extracellulargrowthfactorsbindtotheseproteins.Inresponsetothisbinding,ErbBproteinscantriggermanydifferentcellsignalingnetworks.OneofthemostimportantintracellulartargetsofErbBreceptorsisMAPKsignalingnetwork.Ourrst 110

PAGE 111

queryisapartofErbBsignalingpathwaythatisresponsiblefortriggeringtheMAPKsignalingpathway.RINQsuccessfullyidentiestherelationshipbetweenourqueryandthesimilarpatternsintheErbBandMAPKsignalingnetworksofotherorganisms.Whenwequeryourdatabasefor90%similarity,RINQsuccessfullyreturnsallthe21trueresults,withonlythreefalsepositives.Amongthese21results,ninearefromtheErbBsignalingnetworkofvariousorganismandtherestfromtheMAPKsignalingnetwork.Forquerieswith80%and70%similarity,RINQalsoidentiesthedatabasenetworkswith100%accuracy.For80%similarity,RINQndsmatchingpatternsintheInsulinsignalingnetworksofnineorganisms.Indeed,theinsulinsignalingnetworkcontainstheRas-Raf-MEK-ERK-Elk1pathwaythatisresponsiblefromgrowthanddifferentiationandtheRas-Raf-MEK-ERK-MNKpathwaythatcontributestoproteinsynthesis.Aswereducethesimilarityto70%,RINQreturnsadditionalalignmentsfromthechemokinesignalingpathway.Therearetotallyveorganisms,namelyMusmusculus,Homosapiens,Canisfamiliaris,BostaurusandPantroglodytes,whosechemokinesignalingpathwaysalignwiththisquery.Theseareallphylogeneticallyclosetothequeryorganism.FurtherexplorationofthesepathwaysshowthattheyindeedcontaintheSos-Ras-Raf-MEK1-ERK1/2pathwaythatregulatesasetofcriticalfunctionssuchascytokineproduction,cellulargrowthanddifferentiation,cellsurvival,apoptosisandmigration.Someorganisms,suchasOrnithorhynchusanatinus,Susscrofa,GallusgallusandXenopuslaevishaveoneormoreproteindeletionsintheirErbBsignalingnetworks.Amongthesethersttwobelongtothesameclassasthequeryorganism(i.e.,mammalian)whilethelasttwobelongtodifferentclasses.Nevertheless,RINQsuccessfullyidentiessuchnetworksandprovidestightestimatesfortheiractualnetworksimilarity.ThequeryinFigure 5-9B istakenfromtheVEGFsignalingpathway.AlthoughthereareVEGFsignalingnetworksofonly14organismsinourdatabase,thisquery 111

PAGE 112

returnstotally48alignmentsat70%similarity.Thesealignmentsarefoundineighttypesofnetworksof14differentorganisms.ThisdemonstratesthattheVEGFsignalingnetworkinteractswithmanyothernetworks.Italsosuggeststhatthisqueryisapartofanimportantmotifasitappearsinnumerousorganisms.RINQcouldidentifythismotifaswellamongothers. 5.5ConclusionInthischapter,weconsideredsimilarityqueriesinbiologicalnetworkdatabases.Sincepairwisecomparisonofbiologicalnetworksisatimeconsumingoperation,exhaustivedatabasesearchesareimpracticalforsuchdatabases.Weintroducedreferencebasedindexingtospeedupsuchqueries.Wedevelopedmethodsforcalculatingupperandlowerboundstothealignmentscorequickly.Usingthesebounds,weclassieddatabasenetworksassimilarornotwithoutaligningthemwiththequerynetwork.Wealsodevelopedanintelligentmethodtochoosereferenceswhichmaximizesthechanceofreferencestoprunedatabasenetworks.Ourexperimentsshowedthat(1)ourmethodreducedtherunningtimeofasinglequeryonadatabaseofaround300networksfromovertwodaystoonlyeighthours;(2)ourmethodoutperformedthestateoftheartmethodClosureTreeandSAGAbyafactorofthreeormore;(3)ourmethodsuccessfullyidentiedstatisticallyandbiologicallysignicantrelationshipsacrossnetworksandorganisms.Tothebestofourknowledgethisistherstattemptatdevelopingreferencebasedindexingofbiologicalnetworks.Therefore,applicationofthismethodtoothertypesofbiologicalnetworksisanimportantfutureresearchdirectionforus.Thisapproachenablesindexinglargebiologicalnetworks.Therefore,webelievethatitisanimportantsteptowardslargescaleanalysisofbiologicalnetworkdatabasesandhasgreatpotentialtoassistbiologicalscienceswhichneedsuchlargescalecomparison. 112

PAGE 113

Figure5-6. Impactofthenumberofreferences.Top:TheaveragerunningtimeRINQandOracle.Bottom:TheaccuracyofRINQ.Thequeryselectivityis8%.ThegraphsincludestwodifferenttestswithRINQ.RINQ:HreferstoRINQexperimentswhichusealignmentswithhighindelpenalties.RINQ:Lreferstoexperimentsusingalignmentswithlowindelpenalties.Therunningtimeofexhaustivesearchisovertwodays(notshowningure). 113

PAGE 114

Figure5-7. RunningtimeandaccuracyofRINQfordifferentqueryselectivityvalues.Theseresultsarecalculatedusing32references. 114

PAGE 115

Figure5-8. RunningtimeversusaccuracyofSAGA,CTreeandRINQ.Experimentsarerepeatedforsamequerieswithdifferentselectivities.Averagequeryprocessingtimeispresentedasrunningtime.Accuracyiscalculatedoveralltestqueries.Lowerrunningtimeandhigheraccuracyindicatesbetterperformance. A BFigure5-9. Samplequeriesfromourexperiments.(A)AsubnetworkofErbBsignalingpathwayofRattusNorvegicus(rat).(B)AsubnetworkofVEGFsignalingpathwayofMusMusculus(mouse) 115

PAGE 116

CHAPTER6CONCLUSIONInthiswork,werstconsideredtheproblemofndingasubnetworkinagivenbiologicalnetwork(i.e.,targetnetwork)thatisthemostsimilartoagivensmallquerynetworkinChapter 2 .Ouraimwastondtheoptimalsolution(i.e.,thesubnetworkwiththelargestalignmentscore)withaprovablecondencebound.Thereisnoknownpolynomialtimesolutiontothisproblemintheliterature.Alonetal.( Alonetal. 1995 )hasdevelopedastateoftheartcoloringmethodthatreducesthecostofthisproblemthatneedstobeexecutedformanyiterationsuntilausersuppliedcondenceisreached.Herewedevelopedanovelcoloringmethod,namedk-hopcoloring,thatachievesaprovablecondencevalueinasmallnumberofiterationswithoutsacricingtheoptimality.Ourmethodconsidersthecolorassignmentsalreadymadeintheneighborhoodofeachtargetnetworknodewhileassigningacolortoanode.Thiswayitpreemptivelyavoidsmanycolorassignmentsthatareguaranteedtofailtoproducetheoptimalalignment.Wealsodevelopalteringmethodthateliminatesthenodeswhichcannotbealignedwithoutreducingthealignmentscoreaftereachcoloringinstance.WedemonstrateboththeoreticallyandexperimentallythatourcoloringmethodoutperformsthatofAlonetal.whichisalsousedbyanumbernetworkalignmentmethodsincludingQPathandQNetbyafactorofthreewithoutreducingthecondenceintheoptimalityoftheresult.Manyapplicationsofarhasbenetedfromtheideaofcoloringbiologicalnetworks.Aligningaquerynetworktoasubnetworkofatargetnetworkisonlyoneofthem.Webelievethatournovelcoloringmethodologywillimprovetheutilizationofcoloringandrandomizationinlargescalebiologicalnetworkanalysisinpracticalapplications,andthusitwillbeofgreatvaluetobiologists.InChapter 3 ,wetookanovelapproachtotheproblemofdiscoveringunderlyingnetworkhierarchy.Wersttransformedourproblemtoamixedintegerprogrammingproblem.Then,wesolvedthisproblemusingexistingoptimizers.Wenamedthis 116

PAGE 117

methodHIerarchicalDEcompositionofgeneregulatoryNetworks.However,duetothegrowingsizeofthemixedintegerprogrammingproblemwithincreasingnumberofgenes,weencounteredscalabilityissues.Weproposedadivideandconquerapproachtotacklesuchproblems.Later,weexperimentallyshowedthatouralgorithmoutperformedexistingsolutionsintermsofminimizingconictingedgesinhierarchy.Wealsoevaluatedourmethodusingbiologicalandstatisticaltools.Then,wediscussedtherelationbetweenthehierarchyofageneinaTRNanditslocationincell,essentialityandfunction,basedonourexperimentalresultsandbiologicalevidence.InChapter 4 ,weexploredsteadystatesofmetabolicnetworksandtheirfunctionalproperties.Wedidthisbyrstcharacterizingtheimpactofeachreactioninametabolicnetworkintermifthesteadystatesofthenetwork.Then,usingtheirimpacts,wegroupedeveryreactioninthenetwork.Wecheckedeachgenethatencodesanenzymecatalyzingeachreactioninthenetworktotransferthegeneontology(GO)annotationstothereactions.Foreachgroupweformed,wecalculatedenrichmentofeachGOterminthenetworkforeachgroupinthenetwork.Weshowedthatforeachreactioninagroup,enrichmentvaluescorrelatehighlywithGOtermsofeachreaction.WealsoshowedthatourmethodcanbeusedinGOtermpredictionforlessknownreactions.InChapter 5 ,weconsideredsimilarityqueriesinbiologicalnetworkdatabases.Sincepairwisecomparisonofbiologicalnetworksisatimeconsumingoperation,exhaustivedatabasesearchesareimpracticalforsuchdatabases.Weintroducedreferencebasedindexingtospeedupsuchqueries.Wedevelopedmethodsforcalculatingupperandlowerboundstothealignmentscorequickly.Usingthesebounds,weclassieddatabasenetworksassimilarornotwithoutaligningthemwiththequerynetwork.Wealsodevelopedanintelligentmethodtochoosereferenceswhichmaximizesthechanceofreferencestoprunedatabasenetworks.Ourexperimentsshowedthat(1)ourmethodreducedtherunningtimeofasinglequeryonadatabaseofaround300networksfromovertwodaystoonlyeighthours;(2)ourmethod 117

PAGE 118

outperformedthestateoftheartmethodClosureTreeandSAGAbyafactorofthreeormore;(3)ourmethodsuccessfullyidentiedstatisticallyandbiologicallysignicantrelationshipsacrossnetworksandorganisms.Tothebestofourknowledgethisistherstattemptatdevelopingreferencebasedindexingofbiologicalnetworks.Therefore,applicationofthismethodtoothertypesofbiologicalnetworksisanimportantfutureresearchdirectionforus.Thisapproachenablesindexinglargebiologicalnetworks.Therefore,webelievethatitisanimportantsteptowardslargescaleanalysisofbiologicalnetworkdatabasesandhasgreatpotentialtoassistbiologicalscienceswhichneedsuchlargescalecomparison. 118

PAGE 119

REFERENCES Alon,Noga,Yuster,Raphael,andZwick,Uri.Color-coding.J.ACM42(1995):844. Altschul,StephenF.,Gish,Warren,Miller,Webb,Myers,EugeneW.,andLipman,DavidJ.Basiclocalalignmentsearchtool.JournalofMolecularBiology215(1990).3:403410. Ashburner,Michael,Ball,CatherineA.,Blake,JudithA.,Botstein,David,Butler,Heather,Cherry,J.Michael,Davis,AllanP.,Dolinski,Kara,Dwight,SelinaS.,Eppig,JananT.,Harris,MidoriA.,Hill,DavidP.,Issel-Tarver,Laurie,Kasarskis,Andrew,Lewis,Suzanna,Matese,JohnC.,Richardson,JoelE.,Ringwald,Martin,andRubin,GeraldM.GeneOntology:toolfortheunicationofbiology.NatureGenetics25(2000).1:25. Ay,FerhatandKahveci,Tamer.Functionalsimilaritiesofreactionsetsinmetabolicpathways.ProceedingsoftheFirstACMInternationalConferenceonBioinformaticsandComputationalBiology.BCB'10.NewYork,NY,USA:ACM,2010a,102. .SubMAP:AligningMetabolicPathwayswithSubnetworkMappings.Proceed-ingsofAnnualInternationalConferenceonResearchinComputationalMolecularBiology(2010b):15. Ay,Ferhat,Kahveci,Tamer,anddeCrecy-Lagard,Valerie.AFastandAccurateAlgorithmforComparativeAnalysisofmetabolicPathways.JournalBioinformaticsandComputationalBiology7(2009).3:389. Balaji,S.,Babu,MMadan,Iyer,LakshminarayanM.,Luscombe,NicholasM.,andAravind,L.Comprehensiveanalysisofcombinatorialregulationusingthetranscriptionalregulatorynetworkofyeast.JMolBiol360(2006a).1:213. Balaji,S.,Iyer,LakshminarayanM.,Aravind,L.,andBabu,MMadan.Uncoveringahiddendistributedarchitecturebehindscale-freetranscriptionalregulatorynetworks.JMolBiol360(2006b).1:204. Bandyopadhyay,N.,Somaiya,M.,andKahveci,S.,T.andRanka.ModelingPerturbationsUsingGeneNetworks.ProceedingsofLifeSciencesSocietyCom-putationalSystemsBioinformatics(CSB).vol.9.2010,26. Bandyopadhyay,N.,Somaiya,M.,Ranka,S.,andKahveci,T.Identifyingdifferentiallyregulatedgenes.ComputationalAdvancesinBioandMedicalSciences(ICCABS),2011IEEE1stInternationalConferenceon.2011,19. Banks,Eric,Nabieva,Elena,Peterson,Ryan,andSingh,Mona.NetGrep:fastnetworkschemasearchesininteractomes.GenomeBiology9(2008).9:R138. Bell,StevenL.andPalsson,Bernhard.Expa:aprogramforcalculatingextremepathwaysinbiochemicalreactionnetworks.Bioinformatics21(2005).8:1739. 119

PAGE 120

Bellman,RichardErnest.DynamicProgramming.DoverPublications,Incorporated,2003. Bhardwaj,Nitin,Kim,PhilipM.,andGerstein,MarkB.Rewiringoftranscriptionalregulatorynetworks:hierarchy,ratherthanconnectivity,betterreectstheimportanceofregulators.SciSignal3(2010a).146:ra79. Bhardwaj,Nitin,Yan,Koon-Kiu,andGerstein,MarkB.Analysisofdiverseregulatorynetworksinahierarchicalcontextshowsconsistenttendenciesforcollaborationinthemiddlelevels.ProcNatlAcadSciUSA107(2010b).15:6841. Bjrklund,Andreas,Husfeldt,Thore,andKoivisto,Mikko.SetPartitioningviaInclusion-Exclusion.SIAMJournalonComputing39(2009).2:546. Bruckner,Sharon,Huffner,Falk,Karp,RichardM.,Shamir,Ron,andSharan,Roded.Torque:topology-freequeryingofproteininteractionnetworks.NucleicAcidsResearch37(2009).suppl2:W106W108. Chindelevitch,Leonid,Liao,Chung-Shou,andBerger,Bonnie.Localoptimizationforglobalalignmentofproteininteractionnetworks.PacSympBiocomput(2010):123. Clemente,JoseCarlos,Satou,Kenji,andValiente,Gabriel.Reconstructionofphylogeneticrelationshipsfrommetabolicpathwaysbasedontheenzymehierarchyandthegeneontology.GenomeInformatics16(2005).2:45. .Findingconservedandnon-conservedreactionsusingametabolicpathwayalignmentalgorithm.GenomeInformatics17(2006).2:46. Conradi,Carsten,Flockerzi,Dietrich,Raisch,Jorg,andStelling,Jorg.Subnetworkanalysisrevealsdynamicfeaturesofcomplex(bio)chemicalnetworks.ProcNatlAcadSciUSA104(2007).49:19175. Cook,StephenA.Thecomplexityoftheorem-provingprocedures.ProceedingsofthethirdannualACMsymposiumonTheoryofcomputing.STOC'71.1971,151. Dang,ChiV.,Resar,LindaM.S.,Emison,Eileen,Kim,Sunkyu,Li,Qing,Prescott,JuliaE.,Wonsey,Diane,andZeller,Karen.Functionofthec-MycOncogenicTranscriptionFactor.ExperimentalCellResearch253(1999).1:6377.URL http://www.sciencedirect.com/science/article/pii/S0014482799946864 Dost,Banu,Shlomi,Tomer,Gupta,Nitin,Ruppin,Eytan,Bafna,Vineet,andSharan,Roded.QNet:atoolforqueryingproteininteractionnetworks.JournalofComputa-tionalBiology15(2008).7:913. Ferro,Alfredo,Giugno,Rosalba,Pigola,Giuseppe,Pulvirenti,Alfredo,Skripin,Dmitry,Bader,GaryD.,andShasha,Dennis.NetMatch:aCytoscapepluginforsearchingbiologicalnetworks.Bioinformatics23(2007).7:910. 120

PAGE 121

Francke,Christof,Siezen,RolandJ,andTeusink,Bas.Reconstructingthemetabolicnetworkofabacteriumfromitsgenome.TrendsinMicrobiology13(2005).11:550. Galko,MichaelJ.andKrasnow,MarkA.CellularandgeneticanalysisofwoundhealinginDrosophilalarvae.PLoSBiol2(2004).8:E239. Gama-Castro,Socorro,Salgado,Heladia,andetal.RegulonDBversion7.0:transcriptionalregulationofEscherichiacoliK-12integratedwithingeneticsensoryresponseunits(GensorUnits).NucleicAcidsRes39(2011).Databaseissue:D98. Garey,MichaelR.andJohnson,DavidS.ComputersandIntractability;AGuidetotheTheoryofNP-Completeness.NewYork,NY,USA:W.H.Freeman&Co.,1990. Garnkel,RobertSandNemhauser,GeorgeL.Integerprogramming.NewYork:JohnWiley&Sons,1972. Giot,LoicandBaderetal.,JoelS.AProteinInteractionMapofDrosophilamelanogaster.Science302(2003).5651:1727. Giugno,RosalbaandShasha,Dennis.GraphGrep:Afastanduniversalmethodforqueryinggraphs.Proceedingsof16thInternationalPatternRecognitionConference.vol.2.2002,112. Gulsoy,G.andKahveci,T.Inferringgenefunctionsfrommetabolicreactions.GenomicSignalProcessingandStatistics(GENSIPS),2012IEEEInternationalWorkshopon.2012. Gulsoy,Gunhan,Bandhyopadhyay,Nirmalya,andKahveci,Tamer.HIDEN:Hierarchicaldecompositionofregulatorynetworks.BMCbioinformatics13(2012a).1:250. Gulsoy,Gunhan,Gandhi,Bhavik,andKahveci,Tamer.Topologyawarecoloringofgeneregulatorynetworks.Proceedingsofthe2ndACMConferenceonBioinformatics,ComputationalBiologyandBiomedicine.BCB'11.2011,435. Gulsoy,Gunhan,Gandhi,Bhavik,andKahveci,Tamer.Topac:AlignmentofgeneRegulatoryNetworksUsingTopology-AwareColoring.J.BioinformaticsandCompu-tationalBiology10(2012b).1. Gulsoy,GunhanandKahveci,Tamer.RINQ:Reference-basedIndexingforNetworkQueries.Bioinformatics27(2011).13:i149i158. H,KarenandVousden.p53:DeathStar.Cell103(2000).5:691694. Harbison,ChristopherT.,Gordon,DBenjamin,andetal.Transcriptionalregulatorycodeofaeukaryoticgenome.Nature431(2004).7004:99. 121

PAGE 122

Hartsperger,MaraL.,Strache,Robert,andStumpen,Volker.HiNO:anapproachforinferringhierarchicalorganizationfromregulatorynetworks.PLoSOne5(2010).11:e13698. He,HuahaiandSingh,AmbujK.Closure-Tree:Anindexstructureforgraphqueries.InternationalConferenceonDataEngineering0(2006):38. Horak,ChristineE.,Luscombe,NicholasM.,Qian,Jiang,Bertone,Paul,Piccirrillo,Stacy,Gerstein,Mark,andSnyder,Michael.ComplextranscriptionalcircuitryattheG1/StransitioninSaccharomycescerevisiae.GenesDev16(2002).23:3017. Huffner,Falk,Wernicke,Sebastian,andZichner,Thomas.AlgorithmEngineeringforColor-CodingwithApplicationstoSignalingPathwayDetection.Algorithmica52(2008).2:114. Hynes,NancyEandLane,HeidiA.ERBBreceptorsandcancer:thecomplexityoftargetedinhibitors.NatRevCancer5(2005).5:341. Jeong,H.,Mason,S.P.,Barabasi,A.L.,andOltvai,Z.N.Lethalityandcentralityinproteinnetworks.Nature411(2001).6833:41. Jiang,C.,Xuan,Z.,Zhao,F.,andZhang,M.Q.TRED:atranscriptionalregulatoryelementdatabase,newentriesandotherdevelopment.NucleicAcidsRes35(2007).Databaseissue:D137D140. Jothi,Raja,Balaji,S.,Wuster,Arthur,Grochow,JoshuaA.,Gsponer,Jorg,Przytycka,TeresaM.,Aravind,L.,andBabu,MMadan.Genomicanalysisrevealsatightlinkbetweentranscriptionfactordynamicsandregulatorynetworkarchitecture.MolSystBiol5(2009):294. Kaleta,Christoph,deFigueiredo,LusFilipe,andSchuster,Stefan.Canthewholebelessthanthesumofitsparts?Pathwayanalysisingenome-scalemetabolicnetworksusingelementaryuxpatterns.GenomeRes19(2009).10:1872. Kauer,M.,Ban,J.,Koer,R.,Walker,B.,Davis,S.,Meltzer,P.,andKovar,H.AMolecularFunctionMapofEwing'sSarcoma.PlosOne4(2009).4. Kelley,BrianP,Yuan,Bingbing,Lewitter,Fran,Sharan,Roded,Stockwell,BrentR,andIdeker,Trey.PathBLAST:atoolforalignmentofproteininteractionnetworks.NucleicAcidsResearch32(2004).WebServerissue:W83W88. Kuhn,H.W.TheHungarianmethodfortheassignmentproblem.NavalResearchLogisticsQuarterly2(1955).1-2:83. Larhlimi,AbdelhalimandBockmayr,Alexander.Anewconstraint-baseddescriptionofthesteady-stateuxconeofmetabolicnetworks.DiscreteAppliedMathematics157(2009).10:2257. 122

PAGE 123

Lawler,EugeneL.Anoteonthecomplexityofthechromaticnumberproblem.InformationProcessingLetters5(1976).3:6667. Lee,TongIhn,Rinaldi,NicolaJ.,andetal.TranscriptionalregulatorynetworksinSaccharomycescerevisiae.Science298(2002).5594:799. Levine,MichaelandDavidson,EricH.Generegulatorynetworksfordevelopment.ProceedingsoftheNationalAcademyofSciencesoftheUnitedStatesofAmerica102(2005).14:4936. Liao,C.S.,Lu,K.,Baym,M.,Singh,R.,andBerger,B.IsoRankN:spectralmethodsforglobalalignmentofmultipleproteinnetworks.Bioinformatics25(2009).12:253. Luscombe,NicholasM.,Babu,MMadan,Yu,Haiyuan,Snyder,Michael,Teichmann,SarahA.,andGerstein,Mark.Genomicanalysisofregulatorynetworkdynamicsrevealslargetopologicalchanges.Nature431(2004).7006:308. Ma,Hong-WuandZeng,An-Ping.Theconnectivitystructure,giantstrongcomponentandcentralityofmetabolicnetworks.Bioinformatics19(2003).11:1423. Marashi,Sayed-Amir,David,Laszlo,andBockmayr,Alexander.AnalysisofMetabolicSubnetworksbyFluxConeProjection.AlgorithmsMolBiol7(2012).1:17. Mcsweeney,PatrickJ.,Ashkenazi,Maital,andStates,David.RandomNetworkPlugin.2008.URL https://sites.google.com/site/randomnetworkplugin/Home Mongiovi,Misael,Natale,RaffaeleDi,Giugno,Rosalba,Pulvirenti,Alfredo,Ferro,Alfredo,andSharan,Roded.SIGMA:aset-cover-basedinexactgraphmatchingalgorithm.JournalofBioinformaticsandComputationalBiology8(2010).2:199. Ogata,H.,Goto,S.,Sato,K.,Fujibuchi,W.,Bono,H.,andKanehisa,M.KEGG:KyotoEncyclopediaofGenesandGenomes.NucleicAcidsRes27(1999).1:29. Owen,LeahA.,Kowalewski,AshleyA.,andLessnick,StephenL.EWS/FLIMediatesTranscriptionalRepressionviaNKX2.2duringOncogenicTransformationinEwingbold'/boldsSarcoma.PLoSONE3(2008).4:e1965. Pinter,RonY,Rokhlenko,Oleg,Yeger-Lotem,Esti,andZiv-Ukelson,Michal.Alignmentofmetabolicpathways.Bioinformatics21(2005).16:3401. Q,Cheng,R,Harrison,andA.,Zelikovsky.MetNetAligner:awebservicetoolformetabolicnetworkalignments.Bioinformatics25(2009).15:1989. Riggi.,NicoloandStamenkovic,Ivan.TheBiologyofEwingsarcoma.CancerLetters254(2007).1:110. S.,DavidandLatchman.Transcriptionfactors:Anoverview.TheInternationalJournalofBiochemistryandCellBiology29(1997).12:13051312. 123

PAGE 124

Samal,Areejit,Singh,Shalini,Giri,Varun,Krishna,Sandeep,Raghuram,Nandula,andJain,Sanjay.Lowdegreemetabolitesexplainessentialreactionsandenhancemodularityinbiologicalnetworks.BMCBioinformatics7(2006):118. Schellenberger,Jan,Park,JunyoungO.,Conrad,TomM.,andPalsson,Bernhard.BiGG:aBiochemicalGeneticandGenomicknowledgebaseoflargescalemetabolicreconstructions.BMCBioinformatics11(2010):213. Schuler,M.andGreen,D.R.Mechanismsofp53-dependentapoptosis.BiochemSocTrans29(2001).Pt6:684. Schuster,S.,Dandekar,T.,andFell,D.A.Detectionofelementaryuxmodesinbiochemicalnetworks:apromisingtoolforpathwayanalysisandmetabolicengineering.TrendsBiotechnol17(1999).2:53. Shlomi,Tomer,Segal,Daniel,Ruppin,Eytan,andSharan,Roded.QPath:amethodforqueryingpathwaysinaprotein-proteininteractionnetwork.BMCBioinformatics7(2006):199. Sibson,R.SLINK:Anoptimallyefcientalgorithmforthesingle-linkclustermethod.16(1973).1:30. Singh,Rohit,Xu,Jinbo,andBerger,Bonnie.PairwiseGlobalAlignmentofProteinInteractionNetworksbyMatchingNeighborhoodTopology.RECOMB.2007,16. Sridhar,Padmavati,Kahveci,Tamer,andRanka,Sanjay.Aniterativealgorithmformetabolicnetwork-baseddrugtargetidentication.PacicSymposiumonBiocomputing(2007):88. Stelling,Jorg,Klamt,Steffen,Bettenbrock,Katja,Schuster,Stefan,andGilles,ErnstDieter.Metabolicnetworkstructuredetermineskeyaspectsoffunctionalityandregulation.Nature420(2002).6912:190. Sun,Ning,Carroll,RaymondJ.,andZhao,Hongyu.Bayesianerroranalysismodelforreconstructingtranscriptionalregulatorynetworks.ProcNatlAcadSciUSA103(2006).21:7988. Svetlov,V.V.andCooper,T.G.Review:compilationandcharacteristicsofdedicatedtranscriptionfactorsinSaccharomycescerevisiae.Yeast11(1995).15:1439. Teichmann,SarahA.andBabu,MMadan.Generegulatorynetworkgrowthbyduplication.NatGenet36(2004).5:492. Terzer,MarcoandStelling,Jorg.Large-scalecomputationofelementaryuxmodeswithbitpatterntrees.Bioinformatics24(2008).19:2229. Tian,Yuanyuan,McEachin,RichardC.,Santos,Carlos,States,DavidJ.,andPatel,JigneshM.SAGA:asubgraphmatchingtoolforbiologicalgraphs.Bioinformatics23(2007).2:232. 124

PAGE 125

Ullman,JeffreyD.,Garcia-Molina,Hector,andWidom,Jennifer.DatabaseSystems:TheCompleteBook.2001,1sted. Voit,EberhardO.Optimizationinintegratedbiochemicalsystems.BiotechnologyandBioengineering40(1992).5:572. Y,Tohsato,H,Matsuda,andA.,Hashimoto.AMultipleAlignmentAlgorithmforMetabolicPathwayAnalysisUsingEnzymeHierarchy.ISMB.2000,376. Yan,Xifeng,Yu,PhilipS.,andHan,Jiawei.Graphindexing:afrequentstructure-basedapproach.SIGMOD.2004,335. Yarden,YosefandSliwkowski,MarkX.UntanglingtheErbBsignallingnetwork.NatRevMolCellBiol2(2001).2:127. Yu,HaiyuanandGerstein,Mark.Genomicanalysisofthehierarchicalstructureofregulatorynetworks.ProcNatlAcadSciUSA103(2006).40:14724. Zhang,RenandLin,Yan.DEG5.0,adatabaseofessentialgenesinbothprokaryotesandeukaryotes.NucleicAcidsRes37(2009).Databaseissue:D455D458. 125

PAGE 126

BIOGRAPHICALSKETCH GunhanGulsoyreceivedhisBachelorofScienceinComputerEngineeringdegreefromMiddleEastTechnicalUniversityinAnkara,Turkeyin2010.Duringhisundergraduatestudies,healsoattendedUniversityofCopenhagenin2008asanexchangestudent.HereceivedhisPhDinComputerEngineeringdepartmentfromtheUniversityofFloridain2013.DuringhisPhD,heworkedwithDr.TamerKahveciinhisBioinformaticsLabasagraduateresearchassistant.Hisresearchinterestsarebioinformatics,systemsbiologu,databases,dataminingandmachinelearning. 126