<%BANNER%>

Introduction to Phylogenetics

Permanent Link: http://ufdc.ufl.edu/UFE0041558/00001

Material Information

Title: Introduction to Phylogenetics a Study of Maximum Parsimony and Maximum Likelihood Methods
Physical Description: 1 online resource (65 p.)
Language: english
Creator: Iuhasz, Naomi
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2010

Subjects

Subjects / Keywords: likelihood, parsimony, phylogenetics
Mathematics -- Dissertations, Academic -- UF
Genre: Mathematics thesis, M.S.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract: In this thesis we investigate the conceptual framework of phylogenetics together with two of the most popular methods of inferring phylogenetic relationships. Some notation and background concepts of graph theory are introduced. The general class of X-trees is presented together with its properties, shapes of trees, and X-splits. The functions of characters applied to trees make the connection between mathematics and biology. The notions of character convexity and compatibility are formalized. We also introduce and analyze the maximum parsimony and maximum likelihood methods of creating phylogenies, with comparison and contrast between the methods.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility: by Naomi Iuhasz.
Thesis: Thesis (M.S.)--University of Florida, 2010.
Local: Adviser: Pilyugin, Sergei.
Electronic Access: RESTRICTED TO UF STUDENTS, STAFF, FACULTY, AND ON-CAMPUS USE UNTIL 2011-04-30

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2010
System ID: UFE0041558:00001

Permanent Link: http://ufdc.ufl.edu/UFE0041558/00001

Material Information

Title: Introduction to Phylogenetics a Study of Maximum Parsimony and Maximum Likelihood Methods
Physical Description: 1 online resource (65 p.)
Language: english
Creator: Iuhasz, Naomi
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2010

Subjects

Subjects / Keywords: likelihood, parsimony, phylogenetics
Mathematics -- Dissertations, Academic -- UF
Genre: Mathematics thesis, M.S.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract: In this thesis we investigate the conceptual framework of phylogenetics together with two of the most popular methods of inferring phylogenetic relationships. Some notation and background concepts of graph theory are introduced. The general class of X-trees is presented together with its properties, shapes of trees, and X-splits. The functions of characters applied to trees make the connection between mathematics and biology. The notions of character convexity and compatibility are formalized. We also introduce and analyze the maximum parsimony and maximum likelihood methods of creating phylogenies, with comparison and contrast between the methods.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility: by Naomi Iuhasz.
Thesis: Thesis (M.S.)--University of Florida, 2010.
Local: Adviser: Pilyugin, Sergei.
Electronic Access: RESTRICTED TO UF STUDENTS, STAFF, FACULTY, AND ON-CAMPUS USE UNTIL 2011-04-30

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2010
System ID: UFE0041558:00001


This item has the following downloads:


Full Text

PAGE 1

INTRODUCTIONTOPHYLOGENETICS: ASTUDYOFMAXIMUMPARSIMONYANDMAXIMUMLIKELIHOODMETHODS By NAOMIR.IUHASZ ATHESISPRESENTEDTOTHEGRADUATESCHOOL OFTHEUNIVERSITYOFFLORIDAINPARTIALFULFILLMENT OFTHEREQUIREMENTSFORTHEDEGREEOF MASTEROFSCIENCE UNIVERSITYOFFLORIDA 2010

PAGE 2

c 2010NaomiR.Iuhasz 2

PAGE 3

Tomysister,AndreeaIuhasz 3

PAGE 4

ACKNOWLEDGMENTS Itakethisopportunitytothankmysupervisor,Dr.SergeiPilyugin,forhisexpertise, kindness,andmostofall,forhispatience.Ithankhimalsoformanyinsightful conversationsduringthedevelopmentoftheideasinthisthesis,forthequestions thatchallengedmyunderstandingofthematerial,andforhelpfulcommentsonthe text.Iwouldliketoexpresssinceregratitudetomythesisco-advisorDr.EdwardBraun, fortakingmeunderhiswinglongbeforeIknewanythingaboutbio-mathematicsand forhisconstantguidancethroughoutthejourneythatleadtothisthesis.Ialsothank Dr.RebeccaKimballandtheentireBraun-Kimballlabforallowinganoutsidertoenter theirnicheandtolearnfromtheirexpertise.IthankDr.MuraliRaoforhisvaluable suggestions.IamgreatlyindebtedtoProf.BillyGunnellsforchallengingmetopursue myinterestsinbiology.MythanksandgratitudegoestoProf.ClaireKurtgis-Hunterfor recognizingmypotentialandforbeingthersttoencouragemetopursueagraduate degree. Aboveall,Ithankmyfamilywhostoodbesidemeandencouragedmeconstantly. Mythankstomygrandparentsfortheirloveandsoundadvice,tomyauntfortakingon theresponsibilitytoseemethroughcollege,andespeciallytomyfatherwhoismyrole model.SpecialthanksandappreciationtoCamiloforhisconstantsupportandunending encouragement. 4

PAGE 5

TABLEOFCONTENTS page ACKNOWLEDGMENTS..................................4 LISTOFTABLES......................................6 LISTOFFIGURES.....................................7 ABSTRACT.........................................8 CHAPTER 1INTRODUCTION...................................9 2PRELIMINARIES...................................12 2.1DenitionofTerms...............................12 2.2X-trees.....................................13 2.3TreeShapes..................................15 2.4X-splits.....................................18 2.5CharactersandConvexity...........................23 2.6CharacterCompatibility............................27 3MAXIMUMPARSIMONY..............................34 3.1ClassicalParsimony..............................34 3.2OptimizationonaFixedTree.........................41 3.3TreeRearrangementOperations.......................46 3.4Relevance....................................49 4MAXIMUMLIKELIHOOD..............................51 4.1BasicPrinciples................................52 4.2ModelsofSequenceEvolution........................54 4.3CalculatingChangeProbabilities.......................58 4.4DifferencesinPerspectivebetweenParsimonyandLikelihood.......60 REFERENCES.......................................62 BIOGRAPHICALSKETCH................................65 5

PAGE 6

LISTOFTABLES Table page 2-1Characters 1 2 and 3 ...............................32 6

PAGE 7

LISTOFFIGURES Figure page 2-1aAnX-tree.bAbinaryphylogeneticX-tree...................13 2-2ThetwotreeshapesforB.............................16 2-3Arootedtreeshape..................................17 2-4Edges e 1 and e 2 induce f 1,2,3,4 gjf 5,6,7,8,9 g and f 1,2,3,4,5,6,7 gjf 8,9 g X-splitsrespectively..................................19 2-5Oneiterationoftreepopping.............................23 2-6AnX-treeandamapping t .............................24 2-7Vertexdisjointsubtrees................................26 2-8Compatibilityofcharacters.aArestrictedchordalcompletionof int C .b AmaximalcliquetreerepresentationofG.cAnX-treeonwhicheachcharacter in C isconvex.....................................33 3-1aTheX-tree.bAminimumextensionofacharacter.cEdgesforminga cut-set.........................................37 3-2Amaximum-sizedsetofedge-disjointproperpaths................39 3-3AnErd os-Sz ekelypathsystem...........................40 3-4Decienciesofforwardpassreconstruction.....................43 3-5aTheX-tree.bForwardpass.cBackwardpass...............45 3-6AschematicrepresentationonthegenericTBRoperation............46 3-7AschematicrepresentationonthegenericSPRoperation............47 3-8AschematicrepresentationonthegenericNNIoperation.............47 4-1Overviewofthecalculationofthelikelihoodofatree.aHypotheticalsequence alignment.bAnunrootedtreeforthefourtaxaina.cTreeafterrooting atarbitraryinteriorvertex,inthiscase.dLikelihoodofcharacter .....54 7

PAGE 8

AbstractofThesisPresentedtotheGraduateSchool oftheUniversityofFloridainPartialFulllmentofthe RequirementsfortheDegreeofMasterofScience INTRODUCTIONTOPHYLOGENETICS: ASTUDYOFMAXIMUMPARSIMONYANDMAXIMUMLIKELIHOODMETHODS By NaomiR.Iuhasz May2010 Chair:SergeiS.Pilyugin Major:Mathematics Inthisthesisweinvestigatetheconceptualframeworkofphylogeneticstogether withtwoofthemostpopularmethodsofinferringphylogeneticrelationships.Some notationandbackgroundconceptsofgraphtheoryareintroduced.Thegeneralclass ofX-treesispresentedtogetherwithitsproperties,shapesoftrees,andX-splits.The functionsofcharactersappliedtotreesmaketheconnectionbetweenmathematicsand biology.Thenotionsofcharacterconvexityandcompatibilityareformalized.Wealso introduceandanalyzethemaximumparsimonyandmaximumlikelihoodmethodsof creatingphylogenies,withcomparisonandcontrastbetweenthemethods. 8

PAGE 9

CHAPTER1 INTRODUCTION Thesubjectofthisthesiswasmotivatedbytheundergraduateresearchconducted bythisauthorintheeldofcomputationalbiology.Thepremiseoftheundergraduate researchwastore-evaluatethegeneralassumptionthattherateofmolecularevolution andtherateofmorphologicalevolutionareineffectdissociatedfromeachother.Ifthere isacorrelationbetweenthetworates,therewouldbeauniversalrateofevolutionfor organismsandtheexpectednumberofchangeseithermorphologicalormolecular duringacertainperiodofevolutionaryhistorywouldbeproportionaltothatuniversal rate.Evenwithauniversalrate,somevariationinthenumberofchangesisexpected. ThisvariancewouldreectaPoissonprocessintheidealcase.Theresearchtested whetherthenegativebinomialmethodisanimprovedalternativeoverthePoisson methodtoaccommodatethevariationinthenumberofchanges.Also,ittestedwhether thereisaneedtoassumemorevariancetotmorphologicalchangetotheuniversal ratethantotmolecularchange.BothnegativebinomialandPoissonmethodswere appliedtophylogenetictreesthatwerecomputedusingoneofthewidelyknown softwarepackagesforinferringphylogenetictrees,PAUP*.Thisthesisisacontinuation oftheundergraduateresearchinthatitinvestigatesthemathematicalconceptsonwhich phylogenetictreesarebasedonandhowtheyareinferredfromtheavailabledata. Phylogeneticsisthebiologicaldisciplinewhichstudiestheevolutionaryrelatedness amongdifferentorganismsbasedonmolecularandmorphologicalinformation.This relativelynewbranchofbiologyhasitsrootsinCharlesDarwin'stheoryofevolution. Thequesttousepresent-daycharacteristicsofagroupofspeciestoinferthehistorical relationshipsbetweenthemandtheirevolutionfromacommonancestorhasbeenthe subjectofnumerousstudies.Theserelationshipsareconsistentlyrepresentedbyan evolutionaryphylogenetictree,structurerstproposedbyDarwinhimself.Initially,the relationshipsweredrawnafterstudyingthemorphologicalcharacteristicsofthespecies. 9

PAGE 10

However,suchcriterionofcomparisonhasitslimitationsduetosimplisticassumptions aboutevolutionaryprocessesanddifcultyofcomparisonbetweenverydistantlyrelated speciesormorphologicallyidenticalyetdifferentspecies.Theeldofphylogenetics ourishedwiththediscoveryandstudyofmoleculardatawhichbeganinthelate1960s. Proteinandgeneticsequencesprovideanimmenselyricherpoolofinformationwhich canbeexploredthroughasteadilygrowingnumberofmethodsandtechniques.[32] Theeldofphylogeneticsdevelopedwithastronginterdisciplinaryfoundationas itincorporateselementsofmathematics,statistics,andcomputersciencewithbiology. Herewewilllookatthemathematicalaspectoftheeld.Thereconstructionand analysisofphylogenetictreesinvolvesalmostexclusivelydiscretemathematics,mainly graphtheoryandprobabilitytheory.[32] Inferringaphylogenetictreeisanestimationproceduresincethetruetreeis essentiallyunknowable.Theestimationmodelsemployedcalculateabestestimate ofanevolutionaryhistorybasedontheincompleteinformationcontainedinthedata. Phylogeneticinferencemethodsseektoaccomplishthisgoalinoneoftwoways:by deningaspecicalgorithmthatleadstothedeterminationofatreeorbydeninga criterionforcomparingalternativephylogeniestooneanotheranddecidingwhichis better.Whileinapurelyalgorithmicmethodthealgorithmdenesthetreeselection criterion,inthecriterion-basedmethodsthealgorithmsaremerelytoolsusedto evaluateandcomparetrees.Purelyalgorithmicmethodstendtobecomputationally fastbecausetheyproceeddirectlytowardsthenalsolutionwithoutevaluatinglarge numbersofcompetingtrees.Thesemethodsincludeallformsofpair-groupcluster analysisandsomeotherdistancemethodssuchasneighborjoining,notdiscussed inthisthesis.Thesecondclassofmethodsrstdenesanoptimalitycriterionfor evaluatingagiventreeandthenusesspecicalgorithmsforcomputingthevalueof theobjectivefunctionandforndingthetreesthathavethebestvalueaccordingto thecriterion.Thepriceofthislogicalclarityisthatthecriterion-basedmethodstend 10

PAGE 11

tobemuchslowerthanthosefromtherstclass.However,sincecriterionmethods canassignascoretoeverytreeexamined,phylogeniescanberankedinorderof preference.Thetwomaincriterion-basedmodels,themaximumparsimonyandthe maximumlikelihoodmodelswillbeexaminedinthisthesis.[34] 11

PAGE 12

CHAPTER2 PRELIMINARIES 2.1DenitionofTerms Variousdiagramsusedtoillustrateevolutionaryrelationshipsamongorganisms resemblethestructureofatree;therefore,graphsareanelegantwaytoportrayand studytheserelationships.Agraph G isanorderedpair V E consistingofanon-empty set V ofverticesandamultiset E ofedgeseachofwhichisanelementof ff x y g : x y 2 V g .Agraph H isasubgraphofagraph G if V H and E H aresubsetsof V G and E G respectively.If V 0 isanon-emptysubsetof V G ,thenthesubgraphof G that hasvertexset V 0 andtheedgesetconsistingofthoseedgesof G thathavebothendsin V 0 isthesubgraphof G inducedby V 0 ,andisdenotedby G [ V 0 ] Sincegraphsandtreesinparticularareusedextensivelyinseveraldisciplines, suchasmathematics,biology,andcomputerscience,thereareseveralnamesattributed toeachcomponent.Forthisreason,wewillpresentthemostcommondenitions butuseonlyonenameforeachcomponentconsistentlythroughoutthisthesis.Most graphsdiscussedinthisthesiscorrespondtounrootedtrees,alsoknownasunrooted phylogeny.Inthesestructures,thelocationofthecommonancestorisnotidentied. Verticesarealsocallednodesorpoints.Theterminalnodes,alsocalledleavesor externalnodes,correspondtothecontemporarytaxaorspeciesbeingassessed. Thebranchpointswithintheinteriorofthetree,correspondingtopast,intermediary speciesarecalledinternalnodesorinteriorvertices.Hereafter,wewillemploytheterms vertices,leaves,andinteriorverticesrespectively.Thebranchesconnectingpairsof nodesarealsocallededges,links,orsegments.Branchesincidenttoaleafarecalled exteriororperipheralbranchesandthoseconnectingtwointeriorverticescareinterior edgesorinteriorbranches.Inthisthesiswewilluseedges,interiororexterior.The degreeofavertex v ,denotedby d v ,isthenumberofedgesthatareincidentwith v Traditionally,atreeisdenedasaconnectedgraphwithnocycles.Thephylogenetic 12

PAGE 13

researchworksextensivelywithbinarytrees,thatarethetreesinwhicheveryinterior vertexhasdegreethree.Unlessmentionedotherwise,allgraphsareconnectedandall treesarebinary. 2.2X-trees Fromabiologicalstandpoint,a'phylogenetictree'representsthestandardgraphical depictionofevolutionaryrelationships.However,weneedtodeneamoregeneral classofobjectstothoroughlyinvestigatethemathematicalprocessesinvolved.Forthis purpose,wewillintroducetheconceptofan'X-tree'asdescribedin[32]. Denition2.2.1. AnX-tree T isanorderedpair T ; ,where T isatreewithvertexset V and : X V isamapwiththepropertythat,foreach v 2 V ofdegreeatmosttwo, v 2 X .AnX-treeisalsocalledasemi-labeledtreeon X Denition2.2.2. AphylogeneticX-tree T isanX-tree T ; withthepropertythat isabijectionfrom X intothesetofleavesof T .Ifinaddition,everyinteriorvertexof T hasdegreethree, T isabinaryphylogeneticX-tree. Figure2-1.aAnX-tree.bAbinaryphylogeneticX-tree. Figure2-1aisanexampleofanX-tree,andFigure2-1bisitsbinaryphylogenetic X-treeequivalent.ForaphylogeneticX-tree T = T ; X canbeviewedasthesetof leavesofthetree T .Next,wepresenttwoimportantpropertiesofbinaryphylogenetic trees. 13

PAGE 14

Proposition2.2.3. Let T beabinaryphylogeneticX-treeandlet n = j X j .Then,forall n 2, T has 2 n )]TJ/F22 11.9552 Tf 11.565 0 Td [(3 edgesand n )]TJ/F22 11.9552 Tf 11.566 0 Td [(3 interioredges,and 2 n )]TJ/F22 11.9552 Tf 11.565 0 Td [(2 verticesand n )]TJ/F22 11.9552 Tf 11.566 0 Td [(2 interior vertices. Proof. Useinductiononn.For n =2 ,thereare2vertices,nointeriorvertices,1edge, andnointerioredges.Resultholds. Supposeresultholdsforsome n .Thenwehave 2 n )]TJ/F22 11.9552 Tf 12.109 0 Td [(3 edges, n )]TJ/F22 11.9552 Tf 12.109 0 Td [(3 interioredges, 2 n )]TJ/F22 11.9552 Tf 12.826 0 Td [(2 vertices,and n )]TJ/F22 11.9552 Tf 12.826 0 Td [(2 interiorvertices.Nowcheckfor n +1: Weaddaleafto thetree.Thismeansweneedtoaddaninteriorvertexalso.Sothenewtreehas 2 n verticesand n )]TJ/F22 11.9552 Tf 12.157 0 Td [(1 interiorvertices.Oneedgeisdestroyedandthreeedgesarecreated tolinktheleafandvertextothetree.If n isodd,theedgedestroyedisnotinterior,but itisreplacedbyaninterioredge.If n iseven,theedgedestroyedisinterioranditis replacedbytwointerioredges.Hence,wehave 2 n )]TJ/F22 11.9552 Tf 11.8 0 Td [(3 )]TJ/F22 11.9552 Tf 11.799 0 Td [(1+3=2 n +1 )]TJ/F22 11.9552 Tf 11.799 0 Td [(3 edgesand n )]TJ/F22 11.9552 Tf 11.955 0 Td [(3+1= n +1 )]TJ/F22 11.9552 Tf 11.955 0 Td [(3 interioredges.Hence,theclaimholdsfor n +1 Proposition2.2.4. Let B n denotethecollectionofallbinaryphylogenetictreeswith labelset f 1,2,..., n g andlet b n = j B n j .If n 2f 1,2 g ,then b n =1 .Forall n 3 b n =1 3 5 ... n )]TJ/F22 11.9552 Tf 11.955 0 Td [(5= n )]TJ/F22 11.9552 Tf 11.955 0 Td [(4! n )]TJ/F22 11.9552 Tf 11.955 0 Td [(2!2 n )]TJ/F23 7.9701 Tf 6.587 0 Td [(2 Proof. Useinductiononn.For n =3 ,weget b =1 ,sotheresultholds.Let a n = n )]TJ/F22 11.9552 Tf 11.955 0 Td [(4! n )]TJ/F22 11.9552 Tf 11.955 0 Td [(2!2 n )]TJ/F23 7.9701 Tf 6.587 0 Td [(2 a = 2! 1! 2 =1= b Supposetheresultholdsfor n )]TJ/F22 11.9552 Tf 12.757 0 Td [(1, n 4 .Let : B n B n )]TJ/F22 11.9552 Tf 12.757 0 Td [(1 suchthat T isthebinaryphylogenetictreein B n )]TJ/F22 11.9552 Tf 12.86 0 Td [(1 thatisobtainedfrom T2 B n by deletingtheleaflabeled n anditsincidentedge,andthensuppressingtheresulting degree-twovertex.Byconstruction, isonto.Weobtainabinaryphylogenetictreein B n byconnectinganewleaftoanedgeofatreein B n )]TJ/F22 11.9552 Tf 12.162 0 Td [(1 .FromProposition2.2.3, abinaryphylogenetictreein B n )]TJ/F22 11.9552 Tf 12.301 0 Td [(1 has 2 n )]TJ/F22 11.9552 Tf 12.302 0 Td [(5 edges,sotheleafcanbeaddedin 2 n )]TJ/F22 11.9552 Tf 12.082 0 Td [(5 ways.Hence,eachbinaryphylogenetictreein B n )]TJ/F22 11.9552 Tf 12.082 0 Td [(1 intherangeof isthe 14

PAGE 15

imageof 2 n )]TJ/F22 11.9552 Tf 12.429 0 Td [(5 treesin B n .Weknow b n )]TJ/F22 11.9552 Tf 12.429 0 Td [(1=1 3 5 ... n )]TJ/F22 11.9552 Tf 12.429 0 Td [(7 .Therefore, b n = b n )]TJ/F22 11.9552 Tf 11.956 0 Td [(1 n )]TJ/F22 11.9552 Tf 11.955 0 Td [(5=1 3 5 ... n )]TJ/F22 11.9552 Tf 11.955 0 Td [(7 n )]TJ/F22 11.9552 Tf 11.955 0 Td [(5 .Also, a n = a n )]TJ/F22 11.9552 Tf 11.955 0 Td [(1 n )]TJ/F22 11.9552 Tf 11.955 0 Td [(5 n )]TJ/F22 11.9552 Tf 11.955 0 Td [(4 n )]TJ/F22 11.9552 Tf 11.955 0 Td [(2 = a n )]TJ/F22 11.9552 Tf 11.955 0 Td [(1 n )]TJ/F22 11.9552 Tf 11.955 0 Td [(5= b n )]TJ/F22 11.9552 Tf 11.955 0 Td [(1 n )]TJ/F22 11.9552 Tf 11.955 0 Td [(5= b n Thenumberofallpossiblephylogenetictreeswithagivenlabelsetisimportant sinceweoftenneedtosearchfortheoptimaltree.However,anexhaustivesearchis practicallyimpossiblewhen n islarge.Forthisreason,newmethodsmustbeemployed tondthebesttreewithoutsearchingthroughallpossibilities.Thesemethodsutilize variousoptimalitycriteriatocompareandratealternativetrees.Wewilldiscussthemin detailinthenextchapters. 2.3TreeShapes Atreeshapeisaphylogenetictreeinwhichweignorethelabelsontheleaves.It isusefultobeabletodeterminethenumberofpossiblephylogenetictreesgivenby n-leavedtreesthatsharethesameshape. Denition2.3.1. Twophylogenetictrees T 1 and T 2 withequallabelsetsareshape equivalentif T T 1 isisomorphicto T T 2 .[32] Unrootedbinaryphylogenetictreeswith n 2f 2,3,4,5 g haveexactlyoneshape. Startingwith n =6 ,however,weencountermultipletreeshapes.Forexample,Figure 2-2showsthetwoshapesofthecollectionofbinaryphylogenetictreeswith6leaves B .Determiningthenumberoftreeshapesforphylogenetictreeswith n leavesis nottrivialandwasnotexploredsinceitisnotrelevanttothetopicofthisthesis.For details,see[7]forunrootedbinaryphylogenetictreesand[16]forrootedbinarytrees. Concerningtreeshapes,wehaveexaminedhowtocountthenumberofphylogenetic treesonagivenlabelsetwithaspecictreeshape .Forthispurpose,rstweneedto introducesomeelementsofgroupactionsonsets. 15

PAGE 16

Figure2-2.ThetwotreeshapesforB. Denition2.3.2. Anactionofagroup G onaset M isamap M G! M suchthat,for all m 2 M andforall g 1 g 2 2G a m ,1 G = m and b m g 1 g 2 = m g 1 g 2 Wedenetherelation m 1 m 2 ifthereisandelement g 2G suchthat m 1 g = m 2 Itiseasilyseenthat isanequivalencerelationon M .Wedenotetheequivalence classof m underthisrelationby m Lemma2.3.3 Burnside'sLemma Let G beanitegroupactingonaniteset M and let m 2 M .Then, j m j = jGj jG m j where G m = f g 2G : m g = m g isasubgroupof G .[32] m isalsoknownastheorbitof m and G m asthecentralizerof m .Inourcontext, M isthecollectionofallphylogenetictreeswiththelabelset f 1,2,..., n g and G isthe symmetricgroup S n ofall n permutationsof f 1,2,..., n g .Let g 2G and T2 M .The actionof g on T maps T tothephylogenetictreeobtainedfrom T bypermutingthelabel setaccordingto g .So,if T hastreeshape ,thenthenumberofphylogenetictrees havingtreeshape is j Tj = n jG T j 16

PAGE 17

where G T isthecollectionofpermutationsof f 1,2,..., n g thatleaves T unchanged. [19]providesthefollowingformulaefordetermining G T Proposition2.3.4. Let T bearootedphylogenetictree.Foreachinteriorvertex v of T ,let D v denotethecollectionofmaximalrootedphylogeneticsubtreesthatlie below v .Now,letusimposetherootedshapeequivalencerelationon D v andlet n 1 v n 2 v ,... denotethesizesoftheresultingequivalenceclasses.Then, jG T j = Y v = V T Y n i v !, where V T isthesetofinteriorverticesof T Figure2-3.Arootedtreeshape. Toillustratethisproposition,considerarootedphylogenetictree T havingtheshape showninFigure2-3.ApplyingProposition2.3.4,weget jG T j =! 2 !=48. Theinteriorverticeshavebeennumberedtoshowtheorderinwhichtheywere considered.Theformulafor jG T j issimplerforarootedbinaryphylogenetictree.Let s denotethenumberofinteriorvertices v ofarootedbinarytreewithshape ,ifthe twomaximalrootedsubtreesthatliebelow v havethesameshape. 17

PAGE 18

Corollary2.3.5. Forarootedbinaryphylogenetictree T ofshape jG T j =2 s Thus,thenumberofrootedbinaryphylogenetictreeshavingshape is n !2 )]TJ/F40 7.9701 Tf 6.587 0 Td [(s Thecorrespondingformulafor jG T j isslightlymorecomplicatedbecausean unrootedtreecanhaveanadditionalsymmetrywhentwoadjacentverticesare interchanged.Anunrootedphylogenetictreecanhaveanedgeforwhichthetwo rootedsubtreesobtainedbydeletingthisedgehavethesameshape.Thisedgeis calledacentraledge,andaphylogenetictreecanhaveatmostonesuchedge.Let c T =1 if T hasacentraledgeand0ifitdoesnot.Then,foranunrootedphylogenetic tree T jG T j =2 c T Y v = V T Y n i v !. 2.4X-splits Theconceptsinthissectionhaveplayedanimportantroleintthemathematical developmentofphylogenetics.[32] Denition2.4.1. AnX-splitisapartitionof X intotwonon-emptysets.Wedenotethe X-splitwhoseblocksareAandBby A j B SincewelabelthetwocomponentsAandBarbitrarily,theX-split B j A isequivalent to A j B .Next,wedenetheentirecollectionofX-splitsassociatedwitheveryX-tree.Let T = T ; beandX-treeandlet e beandedgeof T .Then T n e ,thetreeobtainedfrom T bydeleting e ,iscomposedoftwocomponentsthatwewillname V 1 and V 2 .Hence, )]TJ/F23 7.9701 Tf 6.587 0 Td [(1 V 1 j )]TJ/F23 7.9701 Tf 6.587 0 Td [(1 V 2 istheX-splitcorrespondingtoein T .ThisX-splitisuniquetoedge e Wedenoteby T thecollectionofX-splitsthatcorrespondtotheedgesof T ,andwe refertoitastheX-splitof T orinducedby T asin[32].Tbetterexplainthisconcept,we employthefollowingexample: 18

PAGE 19

ConsidertheX-treeshowninFigure2-4,where X = f 1,2,...,9 g .TheX-splits correspondingtoedges e 1 and e 2 are f 1,2,3,4 gjf 5,6,7,8,9 g and f 1,2,3,4,5,6,7 gjf 8,9 g respectively. Denition2.4.2. ApairofX-splits A 1 j B 1 and A 2 j B 2 arecompatibleifatleastoneofthe sets A 1 A 2 A 1 B 2 B 1 A 2 ,and B 1 B 2 istheemptyset. Thisdenition,givenby[32],isjustiedbytheSplits-EquivalenceTheorem,rst presentedby[3].ItisthemostimportantconceptregardingX-splits.It'sproof,alsofrom [32],makesuseofthefollowingthreeLemas. Lemma2.4.3. Let T = T ; beanX-tree,andlet 1 and 2 bedistinctelements of T .Then X canbepartitionedintothreesets X 1 X 2 and X 3 suchthat 1 = X 1 j X 2 X 3 and 2 = X 1 X 2 j X 3 .Furthermore,theintersectionofthevertexsetsof theminimalsubtreesof T inducedby T X 1 and T X 2 isempty. Proof. Let e 1 = f u 1 v 1 g and e 2 = f u 2 v 2 g betheuniqueedgescorrespondingto 1 and 2 respectively.Obviously,thereisapath P in T suchthat e 1 and e 2 aretherstand lastedges,respectively,thataretraversedby P .Withoutlossofgenerality,assume u 1 and u 2 areinitialandterminalverticesofP,respectively.Observethat u 1 6 = u 2 ,but v 1 and v 2 maynotbedistinct.Let V 1 V 2 ,and V 3 denotethevertexsetofthecomponents of T nf e 1 e 2 g containing u 1 v 1 ,and u 2 respectively.Choose X i = )]TJ/F23 7.9701 Tf 6.586 0 Td [(1 V i foreach i = f 1,2,3 g .Then 1 and 2 aredistinct. Figure2-4.Edges e 1 and e 2 induce f 1,2,3,4 gjf 5,6,7,8,9 g and f 1,2,3,4,5,6,7 gjf 8,9 g X-splitsrespectively. 19

PAGE 20

ToillustrateLemma2.4.3,considertheX-treeshowninFigure2-4.Let 1 and 2 betheX-splitscorrespondingtotheedges e 1 and e 2 ,respectivelyobviouslydistinct. Choosing X 1 = f 1,2,3,4 g X 2 = f 5,6,7 g ,and X 3 = f 8,9 g providesapartitionof X into threesets.Moreover, 1 = X 1 j X 2 X 3 and 2 = X 1 X 2 j X 3 ThenextLemmaisageneralpropertyoftrees.Let T beatreeandlet f bea functionfromaniteset Y intothevertexset V of T .Colortheelementsof Y either redorgreen.Next,assignacoloringtotheelementsof V in f Y inthefollowingway. Let v beanelementof f Y .Ifallelementsof f )]TJ/F23 7.9701 Tf 6.586 0 Td [(1 v areofthesamecolor,assignthat colorto v itself;otherwise,assignbothredandgreento v .[32]referstothiscoloringas thecoloringof V inducedby f .Asubgraphof T ismonochromaticifallofitscolored verticesareofoneparticularcolor. Lemma2.4.4. Let T = V E beatree,andlet f beamappingfromaniteset Y into V .Considerthecoloringof V inducedby f .Supposethat,foreachedge e 2 E ,exactly oneofthecomponentsof T n e ismonochromatic.Then,thereexistsauniquevertex v 2 V forwhicheachcomponentof T n v ismonochromatic. Proof. First,showthereexistsatleastonesuchvertex.Let e 2 E .Then,one componentismonochromatic.Assignanorientationfromtheendof e thatisincident withthemonochromaticcomponentof T n e totheotherendof e .Then,thereexists v 2 V without-degreezero;otherwise,wewouldhaveadirectedpathofinnite length.Deleting v producesmonochromaticcomponents.Now,showtherecanbeat mostonesuchvertex v .Supposeforthesakeofacontradictionthatthereisanother vertex v 0 2 V withtheclaimedproperty.Selectanedge e inthepathconnecting v and v 0 .Thenexactlyoneofthetwocomponentsof T n e isnotmonochromatic.Without lossofgenerality,thiscomponentcontains v .Butthiscontradictstheassumptionthat eachcomponentof T n v 0 ismonochromaticasthecomponentcontaining v isnot monochromatic. 20

PAGE 21

Lemma2.4.5. Let A j B beanX-split.Supposethat T = T ; isanX-treesuchthat A j B isnotasplitof T ,but A j B iscompatiblewitheachX-splitof T .Then,thereexistsa uniquevertex v of T suchthatforeachcomponentof V 0 E 0 of T n v either )]TJ/F23 7.9701 Tf 6.587 0 Td [(1 V 0 A or )]TJ/F23 7.9701 Tf 6.586 0 Td [(1 V 0 B Proof. Colortheelementsof A redandtheelementsof B green,andconsiderthe correspondingcoloringoftheverticesof T inducedby .Then,foreachedge e of T exactlyoneofthecomponentsof T n e ismonochromaticunderthecoloringofthe verticesof T by .ApplyingLemma2.4.4with f = and Y = X ,thereexistsaunique vertex v of T forwhicheachcomponentof T n v ismonochromatic.Therefore, A j B satisestheconditiondescribedintheLemma. Theorem2.4.6 Splits-EquivalenceTheorem Let beacollectionofX-splits.Then, thereisanX-tree T suchthat = T ifandonlyifthesplitsin arepairwise compatible.Moreover,ifsuchatreeexists,then T isuniqueuptoisomorphism. Proof. First,suppose = T .Let 1 and 2 bedistinctelementsof .ByLemma 2.4.3,thereisapartitionof X intothreesets X 1 X 2 and X 3 suchthat 1 = X 1 j X 2 [ X 3 and 2 = X 1 [ X 2 j X 3 Since X 1 X 2 = ; ,thex-splits 1 and 2 arecompatible;therefore, theX-splitsof arepairwisecompatible. Conversely,supposethat isapairwisecompatiblecollectionofX-splits.Weuse inductiononthecardinalityof toprovethat = T forsomeX-tree T andthatthe choiceof T isuniqueuptoisomorphism.If j j =0 ,thenthetree T withasinglevertex labeled X istheuniquetreeforwhich = T Nowsupposethat j j = k +1 ,where k 0 ,andthattheexistenceanduniqueness propertiesholdfor j j = k .Let A j B 2 .Since )-273(f A j B g ispairwisecompatible,it followsbyourinductionassumptionthatthereis,uptoisomorphism,auniqueX-tree T 0 = T 0 0 with )-232(f A j B g = T .ByLemma2.4.5,thereisauniquevertex v 0 of T 0 suchthat,foreachcomponent V 0 E 0 of T 0 n v 0 ,either )]TJ/F23 7.9701 Tf 6.587 0 Td [(1 V 0 A or )]TJ/F23 7.9701 Tf 6.587 0 Td [(1 V 0 B 21

PAGE 22

Let T bethetreeobtainedfrom T 0 byreplacing v 0 withtwonewadjacentvertices v A and v B ,andattachingthesubtreesthatwereincidentwith v 0 tothenewverticesinsufch awaythatthesubtreeconsistingofverticesin 0 A and 0 B areattachedto v A and v B respectively.Let : X V T bethemapdenedasfollows: x = 8 > > > > < > > > > : 0 x if 0 x 6 = v 0 v A if 0 x = v 0 and x 2 A v B if 0 x = v 0 and x 2 B Itiseasilycheckedthat T ; isanX-tree,andthatifwedenote T = T ; ,thenwe have = T .Moreover,since T 0 istheuniqueX-treeforwhich )-242(f A j B g = T 0 itiseasilyseenthat T istheonlysuchX-treesatisfying = T uptoisomorphism. ThiscompletestheproofoftheSplits-EquivalenceTheorem. OneapplicationofLemma2.4.5,rstdescribedby[27],istheabilitytoreconstruct anX-tree T from T calledtreepopping.Weordertheelements 1 2 ,..., k arbitrarily,where k = j T j ,andweconstructasequence T 0 T 1 ,..., T k ofX-trees suchthat,forall i 2f 1,2,..., k g T i = f 1 2 ,..., i g .Thus, T k = T .Inthis construction, T 0 istheX-treeconsistingofonlyonevertexlabeled X ,and,forall i T i istheX-treeobtainedfrom T i )]TJ/F23 7.9701 Tf 6.586 0 Td [(1 byintroducinganedgecorrespondingtotheX-split i ThisintroductionisdescribedintheinductionstepoftheSplits-EquivalenceTheorem. Toillustrateonesuchiterationinthetreepoppingmethod,let X = f 1,2,...,7 g andlet 1 = f 7 gj X )-317(f 7 g 2 = f 1,2 gj X )-317(f 1,2 g 3 = f 4 gj X )-317(f 4 g ,and 4 = f 6,7 gj X )-287(f 6,7 g .Applyingthetreepoppingmethodinthechosenorderwe gettheX-tree T showninFigure2-5aafterthreeiterations.Nowconsider 4 andcolortheelements f 6,7 g redandtheelementsof X )-282(f 6,7 g green.Sincethe vertexlabeled 3,5,6 isnotmonochromaticanymore,weseparatethatvertexintotwo monochromaticverticesasinFigure2-5b. 22

PAGE 23

Figure2-5.Oneiterationoftreepopping. 2.5CharactersandConvexity Theconceptof'characters'isessentialtoanyworkinthedomainofPhylogenetics. Inbiology,itreferstotheattributesofthespeciesbeingconsideredandarethedata typicallyusedtoreconstructphylogenetictrees.However,mathematically,characters arefunctions.Inthissection,wewillformalizethenotionandexaminethemathematical propertiesofcharactersneededtoconstructphylogenetictrees.Thefollowingsection wasrstpresentedin[32]. Denition2.5.1. Acharacteron X isafunction fromanon-emptysubset X 0 of X intoaset C ofcharacterstates.Cisreferredtoasthestatesetof .Thecharacter issaidtobetrivialifthereisatmostoneelement 2 C forwhich j )]TJ/F23 7.9701 Tf 6.587 0 Td [(1 j 2 ; otherwise, isnon-trivial.If X 0 = X ,wesay isafullcharacter.If j X 0 j = r wesay isanr-characterstate.Acharacter on X isabinarycharacterif isatwo-statefull character. Thebiologicalinterpretationofcharactersmayvary.Theycanbemorphological e.g.furversusfeathers,behavioral,physiological,biochemical,embryological,or molecular.Theundergraduateresearchthatledtothisthesisdealtwithmolecularand morphologicalcharacters.Thenextdenitionintroducestheconceptofconvexity,which hasafundamentalbiologicalinterpretationthatwewilldiscusslaterinthissection. Denition2.5.2. Let beacharacteron X from X 0 intoaset C ofcharacterstates.We saythat isconvexonanX-tree T ; with T = V E ifthereisafunction : V C satisfyingthefollowingproperties: 23

PAGE 24

C1 j X 0 = and C2foreach 2 C ,thesubgraphof T inducedby f v 2 V : v = g isconnected. Itfollowsimmediatelythatabinarycharacter on X isconvexonanX-tree T preciselyifthebipartitionof X inducedby isanX-splitof T Thefollowingisanexampleforcompatibility. Figure2-6.AnX-treeandamapping t Let X betheset f 1,2,...,7 g .Let C = f , g bethesetofcharacterstates. Let : X C bethefullcharacteron X denedby = = = = = and = = .So, isafour-statecharacter.Now,considerthe X-tree T = T ; with T = V E showninFigure2-6a.Foreach t 2f g ,let t bethemapfrom V into C speciedinFigure2-6b.Clearly t satisesbothC1 andC2forall t = t = or t = .Hence, isconvexon T foranycharacterstate t 2f g Thenextpropositionprovidestwoalternativedescriptionsofconvexity. Proposition2.5.3. Let T = T ; beanX-treewith T = V E andlet : X 0 C bea characteron X .Thefollowingstatementsareequivalent: i isconvexon T ; iithemembersof f T : 2 C g arepairwisevertexdisjoint;and iiiforalldistinct 2 X 0 ,thereexistsanX-split A j B of T suchthat )]TJ/F23 7.9701 Tf 6.586 0 Td [(1 A and )]TJ/F23 7.9701 Tf 6.586 0 Td [(1 B 24

PAGE 25

T denotestheminimalsubtreeof T containing )]TJ/F23 7.9701 Tf 6.587 0 Td [(1 -thelabelsin X thatwere assignedcharacterstate Proof. Itissufcienttoprove i ii iii i i ii :Suppose isconvexon T .Thenthereexists : V T C satisfying propertiesC1andC2.Let 1 2 2 C .Then,byC2,forany i 2f 1,2 g T i isa subtreeofthesubgraphof T inducedby f v 2 V T : = i g .So, T 1 and T 2 are vertexdisjoint. ii iii :Supposetheelementsof f T : 2 C g arepairwisedisjoint.Let and betwodistinctelementsof X 0 C .Then,frompropertyC1, T 6 = T Therefore,thereexistsapathfrom T to T suchthat v 2 T and v 2 T arethebeginningandtheendingvertexofthepath,andforanyedge e ofthepath, e = 2 E T and e = 2 E T .Takeanedge e inthispathandconsideritscorresponding X-split.Let A bethecomponentof T suchthat v 2 A andBbetheothercomponent, v 2 B .Since T T areconnectedbasedonpropertyC2,wehavethat T A and T B .So, )]TJ/F23 7.9701 Tf 6.586 0 Td [(1 2 A and )]TJ/F23 7.9701 Tf 6.586 0 Td [(1 2 B iii i :SupposethereisanX-split A j B of T suchthat )]TJ/F23 7.9701 Tf 6.586 0 Td [(1 2 A and )]TJ/F23 7.9701 Tf 6.586 0 Td [(1 2 B foranydistinct and in X 0 .TheC1propertyisclear;otherwise,an X-splitwouldn'tbepossible.Now,supposeforthesakeofacontradictionthatthere existsan 2 X 0 suchthatthesubgraphof T inducedby f v 2 V : v = g isnot connected.Withoutlossofgenerality,supposethat T hastwocomponents,say M and N .Then,thepathbetween M and N containsavertex v suchthat v 6 = .Let = v .Let e betheedgebetween M and v .TaketheX-splitcorrespondingto e Then, )]TJ/F23 7.9701 Tf 6.586 0 Td [(1 M A ,but )]TJ/F23 7.9701 Tf 6.587 0 Td [(1 N B ,whichisacontradictiontotheinitialassumption. Thiscompletestheproof. WecanillustrateProposition2.5.3usingtheX-treeinFigure2-6.Thevertexdisjoint subtreesareshowninFigure2-7. 25

PAGE 26

Figure2-7.Vertexdisjointsubtrees. Convexityisafundamentalconcepttophylogeneticsbecauseofitsbiological meaning.LetarootedphylogeneticX-tree T = T ; bedescribingtheevolutionofthe set X ofextantspeciesfromandancestralspecieswhichwewillintroduceastheroot of T .Now,supposethateachspeciesorvertex v hasanassociatedcharacterstatein C .Wecanregardthecharacterstateas'evolving'from towardstheelementsof X on T .Forall v 2 V T ,let c v denotethecharacterstateassignedto v .Wecandene theassumptionthateachtimeaspecieschangesitscharacterstate,thenewstateit acquiresappearsforthersttimeinthetreeinthefollowingway: Denition2.5.4. Wesaythatacharacter c ishomoplasy-freeifneitherofthefollowing occur: iSuppose v 1 v 2 ,..., v k isapathin T directedawayfromtheroot c issaidto exhibitreversetransitionifforsome i 2f 2,3,..., k )]TJ/F22 11.9552 Tf 11.973 0 Td [(1 g c v 1 = c v k 6 = c v i .This correspondstoanewcharacterstatearisingbutthenrevertingbacktoanearlier state. iiSupposethat v 1 v 2 ,..., v k and w 1 w 2 ,..., w l arepathsin T directedawayfrom theroot andthat v 1 = w 1 c issaidtoexhibitconvergenttransitionif c v k = c w l 6 = c v 1 .Thiscorrespondstothesamestatearisingindifferentpartsofthe treeindependentofeachother. Reverseandconvergenttransitionsareknowntooccurinbiologyinmanytypes ofcharacters.Wewillnowexplainthetheconnectionbetweentheseconceptsand 26

PAGE 27

convexity.Let T = T ; bearootedphylogeneticX-treewith T = V E androot .Supposeeachvertex v of T hasacharacterstatein C .Considertheassociated phylogeneticX-tree T )]TJ/F26 7.9701 Tf 6.587 0 Td [( .Ifwelookonlyatthevaluesof c attheleavesof T ,we obtainedaninducedfullcharacter on X bysetting x = c x forall x 2 X Thischaracterisdescribingthecharacterstatesofthepresent-dayspecies.If c is homoplasy-free,then isconvexon T )]TJ/F26 7.9701 Tf 6.587 0 Td [( since : V C denedas u = c u for all u 2 V ,satisesconditionsC1andC2.Conversly,if isconvexonaphylogenetic X-tree T 1 with T 1 = V 1 E 1 andacorrespondingfunction 1 : V 1 C thatsatises conditionsC1andC2,thenforallchoicesofaroot ,wecanextend 1 toamap from V 1 [f g to C thatishomoplasy-free.Itisimportanttonote,however,thatevenif c isnothomoplasy-freeonarootedphylogenetictree T itisentirelypossiblethatthe associatedcharacter maybeconvexon T )]TJ/F26 7.9701 Tf 6.587 0 Td [( .Theconceptofhomoplasyisquantied inthenextchapter. 2.6CharacterCompatibility Denition2.6.1. Acollectionofcharacterson X issaidtobecompatibleifthereexists anX-treeonwhichallthecharactersinthecollectionareconvex. Inotherwords,acollectionofcharactersiscompatibleiftheycouldallhave evolvedonsometreewithoutanyreverseorconvergenttransitions.Theconditionof compatibilityisthesameforphylogeneticX-treesandbinaryphylogeneticX-trees. Determiningwhetheracollectionofcharactersiscompatibleand,ifso,constructingthe treeonwhichtheyareallcomplexisknownasthecharactercompatibilityproblemor, morerecentlyincomputersciencecircles,astheperfectphylogenyproblem. Inthecaseofbinarycharacters,theSplits-EquivalenceTheorem2.4.6statesthat acollectionofbinarycharactersiscompatibleifandonlyifthecharactersarepairwise compatible.Hence,thereisauniqueminimaltreeonwhichbinarycharactersare convex.Fornon-binarycharacters,however,thisobservationdoesnotapply.Semple andSteelofferaframeworkin[32]toascertaincompatibilityusingchordalgraphs,which 27

PAGE 28

wewillnowdescribe.Werstneedtointroduceaseriesofdenitionsfortheterms used. Denition2.6.2. Let S = f S 1 S 2 ,..., S k g beafamilyofsets.Theintersectiongraph of S ,denoted int S ,isthegraphthathasvertexset S andanedgebetween S i and S j preciselyif S i S j 6 = ; ,for i j 2f 1,2,..., k g ,anddistinct. Denition2.6.3. Agraph G ischordal,alsocalledtriangulated,ifeveryinduced subgraphof G thatisacyclehasatmostthreeedges.Equivalently,agraphischordal ifeverycyclewithatleastfourverticeshasanedgecalledachordconnectingtwo non-consecutiveverticesinthecycle. Denition2.6.4. Achordalizationalsocalledtriangulationofagraph G = V E isa graph G 0 = V E 0 withthepropertiesthat G 0 ischordaland E E 0 Denition2.6.5. Foracharacter : X 0 C on X ,let x denotethepartitionof X 0 correspondingto f )]TJ/F23 7.9701 Tf 6.586 0 Td [(1 : 2 C g .Let C beacollectionofcharacterson X andlet T = T ; beanX-tree.Next,denetwographs,eachofwhichhasvertexset [ 2C f A : A 2 g iThepartitionintersectiongraphof C isthegraphthathasthevertexsetmentioned aboveandanedgejoiningtwoverticespreciselyiftheintersectionofthesecond coordinatesisnon-empty.Wedenotethisgraphby int C iiThesubtreeintersectiongraphof T inducedby C isthegraphthathasthevertex setmentionedaboveandanedge f A 0 B g iftheintersectionofthevertex setsof T A and T B isnon-empty.Thisgraphisdenotedby int C T Denition2.6.6. Avertexofgraph G issimplicialifitsneighborstogetherwithitself induceacliqueagraphinwhicheachpairofdistinctverticesisjoinedbyoneedge, alsoknownasacompletegraph. Denition2.6.7. Wesaythat G hasaperfecteliminationorderingiftheverticesof G canbeorderedas v 1 v 2 ,..., v k sothatforeach i 2f 1,2,..., k g v i isasimplicialvertexof thesubgraphof G inducedby f v i ,..., v k g 28

PAGE 29

Denition2.6.8. Agraph G isarestrictedchordalcompletionof int C if G isachordalizationof int C and,foralledges f A 0 B g of G 6 = 0 Thefollowingtheoremwasstatedby[32]withvariouspartsoftheequivalencesdue to[4],[13],[14],[30],and[36]. Theorem2.6.9. Let G beagraph.Thenthefollowingstatementsareequivalent: i G ischordal; ii G isasubtreeintersectiongraph; iii G hasaperfecteliminationordering; ivthereexistsatree T whosevertexset K isthesetomaximalcliquesof G and,for eachvertex v in G ,thesubgraphof T inducedbytheelementsof K containing v isasubtreeof T Thetreedescribedin iv ofTheorem2.6.9isreferredtoasamaximalcliquetree representationof G Theorem2.6.10,indicatedby[4]and[27],andformallyprovedby[33],isthemain resultofthissection. Theorem2.6.10. Let C beacollectionofcharacterson X .Then, C iscompatibleifand onlyifthereexistsachordalcompletionof int C Proof. Suppose C iscompatible.ThenthereexistsanX-tree T onwhicheverycharacter in C isconvex.ByTheorem2.6.9, i ii int C T ischordal.Theedgesetof int C isasubsetoftheedgesetof int C T andeverycharacterin C isconvexon T Therefore, int C T isarestrictedchordalcompletionof int C Toprovetheconverse,supposethat G isarestrictedchordalcompletionof int C FromTheorem2.6.9, i iv ,thereexistsatree T 0 whosevertexset K isthesetof themaximalcliquesof G ,andforeachvertex A thesubgraphof T 0 inducedbythe elementsof K containing A isasubtreeof T 0 .Tocompletetheproof,weconstruct anX-treevia T 0 onwhicheverycharacterin C isconvex.Dene : X !K suchthat, forany x 2 X x containstheverticesofthemaximum-sizedcliquein G inwhich x is 29

PAGE 30

anelementofthesecondcoordinateofeveryvertex.Observethat int C isasubgraph of G ,soavertexof G isinthiscliquepreciselyifthisvertexcontains x .Notethatsuch amapmaynotbeunique.Dene T tobethetreeobtainedfrom T 0 bysuppressingall verticesofdegreetwothatarenotidentiedbyanelementof X .Itiseasilycheckedthat alldegree-oneverticesof T 0 areidentiedbyanelementof X ,andso T = T ; isan X-tree. Now,showthateverycharacterin C isconvexon T .Let A 1 A 2 bemembersof forsome 2C .Then,thesubtrees T 0 1 and T 0 2 of T 0 inducedbytheelements of K containing A 1 and A 2 respectively,donotintersect.Sincetheelements of A i canonlybeidentiedwithverticesin T 0 i ,foreach i 2f 1,2 g ,itfollowsthatthe intersectionofthevertexsetsof T A 1 and T A 2 isempty.Thus,everyelementof C is convexon T ,andtherefore, C iscompatiblebydenition. Corollary2.6.11. Twocharacters and 0 on X arecompatibleifandonlyif int f 0 g isacyclic. Proof. Suppose int f 0 g isacyclic.Then int f 0 g ischordal.Hence,byTheorem 2.6.10 and 0 arecompatible. Conversly,suppose int f 0 g containsacycle.Let G beachordalizationof int f 0 g .Then, G mustcontainathree-cycle )]TJ/F25 11.9552 Tf 12.077 0 Td [( 0 )]TJ/F25 11.9552 Tf 12.077 0 Td [( or )]TJ/F25 11.9552 Tf 12.076 0 Td [( 0 )]TJ/F25 11.9552 Tf 12.076 0 Td [( 0 .Thisimplies G isnotarestrictedchordalcompletionof int f 0 g .Therefore,byTheorem2.6.10, and 0 arenotcompatible. Corollary2.6.12. Let C beacollectionofbinarycharactersonX.Then, C iscompatible ifandonlyif int C ischordal. Proof. If int C ischordal,then int C isarestrictedchordalcompletionofitself. Therefore,byTheorem2.6.10, C iscompatible. Now,suppose C iscompatible.Then,bytheSplits-EquivalenceTheorem2.4.6, thereisauniqueX-tree T suchthat T isequaltothesetofX-splitsinducedbythe 30

PAGE 31

elementsin C .Wenextshow int C = int C T byverifyingtheclaimthattheedgesets of int C and int C T areequal. Let A and 0 B bedistinctverticesof int C .If = 0 ,then A j B isanX-split inducedby T andtheclaimclearlyholds.Next,assumethat 6 = 0 .If A B 6 = ; then theclaimtriviallyholds.Assume A B isempty.Then, A j X )]TJ/F39 11.9552 Tf 12.429 0 Td [(A and B j X )]TJ/F39 11.9552 Tf 12.429 0 Td [(B are distinctX-splitsinducedby T .Therefore,byLemma2.4.3thereisapartitioningof X into thethreesets X 1 X 2 X 3 sothat X 1 2f A X )]TJ/F39 11.9552 Tf 11.701 0 Td [(A g X 3 2f B X )]TJ/F39 11.9552 Tf 11.701 0 Td [(B g andtheintersectionof thevertexsetsof T X 1 and T X 3 isempty.Theonlypossiblechoicesfor X 1 and X 3 are A and B ,respectively.Theclaimnowreadilyfollowsfromthiscase. Thiscorollary,however,doesnotextendtocollectionsoftwo-statecharactersonX. Wewillnowprovidetheframeworkofhowtoconstructamaximalcliquetree representationofachordalgraphfollowing[15].Suppose G = V E isachordal graph.Let v 1 v 2 ,..., v k beaperfecteliminationorderingoftheverticesof G ,where k = j V j .Sinceeverychordalgraphhasatleastonesimplicialvertex[6]andevery vertex-inducedsubgraphofachordalgraphischordal,obtainingsuchanorderingis elementary.Let i 2f 1,2,..., k )]TJ/F22 11.9552 Tf 11.018 0 Td [(1 g andlet K i denotethevertexsetofthemaximalclique of G [ f v i v i +1 ,... v k g ] thatcntains v i .Dene T k asthetreeconsistingofthesinglevertex v k .So, T k isamaximalcliquerepresentationof G [ f v k g ] .Ingeneral,forall i ,dene T i to bethetreeobtainedfrom T i +1 asfollows: iif K i )]TJ/F39 11.9552 Tf 11.956 0 Td [(v i isavertexof T i +1 ,thenreplace K i )]TJ/F39 11.9552 Tf 11.955 0 Td [(v i with K i toget T i ; iiotherwise,joinanewvertex K i toavertexof T i +1 containing K i )]TJ/F39 11.9552 Tf 11.955 0 Td [(v i toget T i Iteasilycheckedthat T i isamaximalcliquetreerepresentationof G [ f v i v i +1 ,..., v k g ] for any i .Thus, T i isamaximalnotnecessarilyuniquecliquetreerepresentationof G Wenowillustratetheseconceptswithanexample.Supposethat X = f 1,2,3,4,5,6 g andlet C = f 1 2 3 g beasetofcharacterson X with 1 2 ,and 3 asdenedin Table2-1.Then 1 = ff 2 g f 1,4 g f 3,5,6 gg 2 = ff 1,2 g f 3,5 g f 6 gg ,and 3 = ff 2,3 g f 4,6 gg .Let G denotethegraphshowninFigure2-8a.Sincetherst 31

PAGE 32

coordinatesoftheendverticesofeachofthedashedlinesaredistinct, G isarestricted chordalcompletionof int C ,with int C beingthegraphinducedbythesolidlinesof thisgraph.Hence,byTheorem2.6.10, C iscompatible. Table2-1.Characters 1 2 and 3 x 1 x 2 x 3 x 1 2 00 3 0 0 4 0 5 0 0 6 0 00 0 WenextconstructanX-treeonwhichallofthecharactersin C areconvex.Letthe followingsequencebetheperfecteliminationorderingof G thatweuse: 2 f 6 g 1 f 2 g 1 f 1,4 g 3 f 4,6 g 2 f 1,2 g 1 f 3,5,6 g 3 f 2,3 g 2 f 3,5 g Followingtheprocessdescribedimmediatelypriortothisexample,wecanconstruct themaximalcliquetreerepresentation T = K E of G showninFigure2-8b,where K isthecollectionofmaximalcliquesof G and,foreachvertex v in G ,thesubgraph inducedbytheelementsof K containing v isasubtreeof T .Inthiscase,thisistheonly maximalcliquerepresentationof G .Lastly,toobtainthedesiredX-tree,wedeneamap : X !K sothat,foreachelement x in X x containstheverticesofthemaximum sizedcliquesin G inwhich x isanelementofthesecondcoordinateofeveryvertex. TheX-treeinFigure2-8cisthetreeonwhichallcharactersof C areconvexonit. Theproblemofcompatibilityisrelativelyeasilydeterminedforbinarycharacters, butitbecomesamuchmoredifcultproblemtosolveinothercases.[32]statesthat determiningifanarbitrarycollection C ofcharactersiscompatibleisNP-completeeven ifallcharactersin C aretwo-state.Moreover,pairwisecompatibilityisnotsufcientfor compatibilityoftheentirecollectionofcharacters,alsoshownby[32].Itis,nonetheless, ausefulmethodforaboundednumberofdistinctstatesorfullcharacterson X 32

PAGE 33

Figure2-8.Compatibilityofcharacters.aArestrictedchordalcompletionof int C .b AmaximalcliquetreerepresentationofG.cAnX-treeonwhicheach characterin C isconvex. [1]discoveredanumberofpolynomial-timealgorithms,andsodid[26]. 33

PAGE 34

CHAPTER3 MAXIMUMPARSIMONY Themaximumparsimonymethodisoneofthemorepopulartechniquesusedfor reconstructingphylogenetictreesfromcharacters.Theconceptbehindthismethod istotcharacterdatatoasemi-labeledtreeinawaythatminimizesconvergentand reversetransitions.Thiswayofthinkingisbasedon'Ockham'sRazor'principlewhich saysthatasimpleexplanationismorelikelyandshouldbechosenoveramorecomplex one.Here,thecomplexityismeasuredbythenumberofreverseandconvergent transitionswithahomoplasy-freetreebeingtheidealcase.Moreover,suchtransitions aregenerallyconsideredtoberelativelyrareandthereforeacasewithfewertransitions maybeamoreprobablesituation.Inthischapter,wewillexplorethefundamental conceptsofparsimony,lookingmostlyatclassicalparsimony.Thismethodhasdirect connectionstographtheory.Wewilldealmostlywithfullcharactersandphylogenetic X-treesinthischapter.Also,wewillmostlydealwithsequencesofcharactersinstead ofsetsofcharacterstoallowforsomedatatohaveacharacterappearmorethanonce. [32] 3.1ClassicalParsimony Denition3.1.1. Foragraph G = V E andafunction f on V ,thechangingsetof f is thesubset Ch f = ff u v g2 E : f u 6 = f v g ofedgesof G .Thechangingnumberof f ,denoted ch f ,isthecardinalityof Ch f Denition3.1.2. Let : X 0 C beacharacteron X andlet T = T ; beanX-tree. Anextensionof to T isafunction : V T C forwhich j X 0 = .The parsimonyscoreof on T istheminimumvalueof ch overallextensions of to T .Wedenotethisscoreby l T .Furthermore,if isanextensionof to T and ch = l T ,then iscalledaminimumextensionof to T Denition3.1.3. Let C = f 1 2 ,..., k g beasequenceofcharacterson X .The parsimonyscoreof C onanX-tree T ,denotedby l C T ,isthesumoftheindividual 34

PAGE 35

parsimonyscoresofthecharactersof T ;thus, l C T = k X i =1 l i T AnX-tree T 0 thatminimizes l C T iscalledthemaximumparsimonytreefor C andthe correspondingvalueof l C T is l C Proposition3.1.4. Let beanr-statecharacteron X andlet T beanX-tree.Then l T r )]TJ/F22 11.9552 Tf 11.955 0 Td [(1 .Moreover, l T = r )]TJ/F22 11.9552 Tf 11.955 0 Td [(1 ifandonlyif isconvexon T Proof. Let T betheunderlyingtreeof T andlet beaminimumextensionof to T Let T denotethetreeobtainedfrom T bycontractingeveryedgein E T )]TJ/F39 11.9552 Tf 12.027 0 Td [(Ch ,and considerthemappingon V T inducedby .Sincethecardinalityoftheimageofthis mappingis r ,wehave j V T j r .Therefore,as j E T j = ch and T isatree,we have l T = ch r )]TJ/F22 11.9552 Tf 12.317 0 Td [(1 .Furthermore,is isconvexon T ,then j V l T j = r Hence,equalityholdsifandonlyif isconvexon T InChapter2wediscussedreverseandconvergenttransitionsandhomoplasy-free characters.Wecannowcountthenumberofsuchtransitions,denedas h T ,using thefollowingformulasgivenby[32]. Let : X 0 C beacharacteron X andlet T beanX-tree.Let h T = l T )]TJ/F39 11.9552 Tf 11.955 0 Td [(r +1. h T issometimesreferredtoasthehomoplasyof on T .FromProposition3.1.4we knowthat h T isnon-negative,andisequaltozeropreciselywhen isconvexon T ishomoplasy-free.Nowlet C beasequence 1 2 ,..., k ofcharacterson X and let T beamaximumparsimonytreefor C .Thequantity h C = k X i =1 h i T 35

PAGE 36

isthecalledthetotalhomoplasyof C andmeasuresthenumberofreverseand convergenttransitionsthatneedtobepostulatedifallthecharactersin C evolvedon acommonX-tree.ThefollowingcorollaryisanimediateconsequenceofProposition 3.1.4. Corollary3.1.5. Supposethat C = 1 2 ,..., k isasequenceofcharacterson X Then, h C 0 withequalitypreciselyif C iscompatible. Theparsimonyscoreofacharacteronasemi-labeledtreecanbeviewedinterms ofsetsofedgesseparatingverticesassigneddifferentcharacterstatesorintermsof amaximalsystemofpathsundercertainrestrictions.Wewillconsiderbothwaysof viewingparsimony. Denition3.1.6. Let : X 0 C beacharacteron X andlet T = T ; beanX-tree with T = V E .Asubset E 1 of E isacut-setfor on T if,foreachpair x y 2 X 0 with x 6 = y ,thevertices x and y lieindifferentcomponentsofthedisjointunion oftrees T n E 1 Aminimumcut-setfor on T isacut-setfor ofminimumsizeandthesizeof suchacut-setisdenotedby cut T .Thesetofallminimumcut-setsfor on T is Cut T Thefollowingexampleillustratestheseconcepts.Let T betheX-treeshownin Figure3-1awith X = f 1,2,3,4,5,6,7 g .Let : X !f g bethecharacterdened by = = = = = = ,and = .Onecaneasily checkthat l T =3 .Figure3-1bindicatesaminimumextension of to T as wellasthethreecorrespondingedgesin Ch .Notethatthisminimumextensionisnot uniqueforthistreeandcharacterset .InFigure3-1c,theset f e 1 e 2 e 3 e 4 g ofedges of T isacut-setfor ,butitisnotthechangingsetofanyextensionof .Thisshows thatanarbitrarycut-set E 1 for doesnotnecessarilycorrespondto Ch forsome extension of .However,Lemma3.1.7showsthisisnotthecaseif E 1 isaminimum cut-setfor 36

PAGE 37

Figure3-1.aTheX-tree.bAminimumextensionofacharacter.cEdgesforminga cut-set. Lemma3.1.7. Let beacharacteron X ,let T = T ; beanX-treewith T = V E andlet E 1 beasubsetof E .If E 1 isaminimumcut-setfor ,then E 1 = ch fora uniqueextension of Proof. Suppose E 1 isaminimumcut-setfor .If V 0 isthesetofverticesofacomponent ofthedisjointunionoftrees T n E 1 ,thenas E 1 isacut-setfor wemusthave j )]TJ/F23 7.9701 Tf 6.587 0 Td [(1 V 0 j 1 .Furthermore, j )]TJ/F23 7.9701 Tf 6.587 0 Td [(1 V 0 j6 =0 ;otherwise,byselectinganedge e in E 1 thatisincidentwithavertexin V 0 ,theset E 1 )-255(f e g isacut-setfor ,whichcontradicts theminimalityof E 1 .Hence, j )]TJ/F23 7.9701 Tf 6.587 0 Td [(1 V 0 j =1 .Wenextdeneanextension 1 of usingthefollowingcriteria.Eachvertex v of V liesinexactlyoneofthecomponentsof T n E 1 .Ifthevertexsetofthatcomponentis V 0 ,set 1 equaltotheuniqueelementin )]TJ/F23 7.9701 Tf 6.587 0 Td [(1 V 0 .Now,eachedge f u v g in E 1 satises 1 u 6 = 1 v otherwise E 1 wouldnot beminimal,andsowehave Ch 1 = E 1 .Furthermore,itiseasilycheckedthatevery 37

PAGE 38

extension of thatsatises Ch = E 1 mustagreewith 1 atallverticesin V .So, 1 isunique. Thefollowingpropositionshowsthattheparsimonyscoreofacharacter on X onanX-tree T isequaltothesizeofaminimumcut-setfor on T andthatanytwo minimumextensionsthatinducethesamechangingsetareequal. Lemma3.1.8. Let beacharacteron X andlet T beanX-tree.Then, cut T = l T .Furthermore,themap fromthesetofminimumextensionsof on T into Cut T denedby = Ch ,forallsuchextensions,isabijection. Proof. Let beaminimumextensionof on T .Then, ch = l T ,andsince Ch isacut-setfor on T cut T l T .Now,supposethat E 1 2 Cut T .Then,by Lemma3.1.7, cut T = j E 1 j min f ch 1 : 1 isanextensionof g = l T establishing cut T l T ,andtherebytherstpartoftheproposition. Wenowprovethesecondpart.Since Ch isacut-setfor on T andsince ch T = l T ,itfollowsfromtherstpartofthepropositionthat Ch 2 Cut T Moreover,byLemma3.1.7,themap isabijection. Havingdescribedtheparsimonyscoreintermsofthesetsofedgesthatseparate verticeswithdifferentcharacterstates,wenowexaminetheparsimonyscoreintermsof amaximalsystemofpaths.Werstdothisfortwo-statecharactersusingthefollowing classicalgraphtheoryresultprovenby[28]. Lemma3.1.9 Menger'sLemma Let G = V E beagraph,andlet V 1 and V 2 be disjointsubsetsof V .Then,themaximumnumberofedge-disjointpathsin G withthe propertythateachpathhasoneendpointin V 1 andtheotherendpointin V 2 isequal totheminimumnumberofedgeswhoseremovalform G leavestheverticesin V 1 in differentcomponentsfromtheverticesin V 2 38

PAGE 39

Menger'sLemma3.1.9canbeappliedtobinarycharacters.First,however,weneed tointroducethefollowingconcept.ForanX-tree T = T ; ,apath P in T isaproper pathrelativetoacharacter on X if,forsome x y 2 X P connects x and y ,and x 6 = y .CombiningLemma3.1.9withProposition3.1.8weobtainthefollowing corollary. Corollary3.1.10. Let beatwo-statecharacteron X andlet T = T ; beanXtree.Then,themaximumparsimonyscore l T isequaltothemaximumnumberof edge-disjointproperpathsof T relativeto ToillustrateCorollary3.1.10,considerthephylogeneticX-tree T showninFigure 3-2aandthebinarycharacter : X !f g ,where = = = = and = = = .EachdashedpathinFigure3-2bisproperrelativeto .Furthermore,asthesepathsareedgedisjointandas j )]TJ/F23 7.9701 Tf 6.587 0 Td [(1 j =3 ,thesetofthese pathsisamaximum-sizedsetofedge-disjointproperpathsof T relativeto .Hence,by Corollary3.1.10,themaximumparsimonyscoreis l T =3 Figure3-2.Amaximum-sizedsetofedge-disjointproperpaths. Erd osandSz ekely[8]haveextendedtheconceptinCorollary3.1.10toarbitrary charactersbypermittingpathstointersect,providedsomeconditionsaremet. SupposethereisaphylogeneticX-tree T = T ; andacharacter on X .A collection D ofdirectedpathsin T isanErd os-Sz ekelypathsystemfor on T ifit satisesthefollowingtwoconditions: 39

PAGE 40

iIf P 2D ,then P connectstwoleaves x and y of T forwhich x 6 = y iiLet P and P 0 bepathsin D thatsharesomeedge.Then, P and P 0 traversethis edgeinthesamedirectionand,if x and y denotetheterminalverticesof P and P 0 x 6 = y ConsiderthephylogeneticX-tree T showninFigure3-3aandthefullcharacter : X !f g ,where = = = = ,and = = AnErd os-Sz ekelypathsystemfor on T isshowninFigure3-3b.Notethatthepath systemisnotunique. Figure3-3.AnErd os-Sz ekelypathsystem. Theorem3.1.11. Let beacharacteron X andlet T beaphylogeneticX-tree.Then theparsimonyscore l T isequaltothemaximumsizeofanErd os-Sz ekelypath systemfor on T Theorem3.1.11isduetoErd osandSz ekely[8],whoalsoprovidedapolynomial-time algorithmtoconstructanexplicitErd os-Sz ekelypathsystemforagivencharacteron X andphylogeneticX-tree.TheresultofTheorem3.1.11canbeextendedtoX-treesas shownby[32].Let T = T ; beanX-treeandsupposethat,forsome x 2 X x isaninternalvertexon T .Let V 1 V 2 ,..., V k denotethevertexsetsofthecomponents of T n x and,foreach i 2f 1,2,..., k g ,let T i denotethesubgraphof T inducedby V i [f x g .Let X i = )]TJ/F23 7.9701 Tf 6.587 0 Td [(1 V T i ,let i = j X i ,andlet T i = T i ; i forall i .Itisnow 40

PAGE 41

easilyseenthat l T = k X i =1 l i T i where i = i j X i forall i .Thisprocess,repeatedforall x forwhich x isaninterior vertex,providesacollectionofphylogenetictreeswhoseindividualparsimoniesare summedtogettheparsimonyscoreof on T 3.2OptimizationonaFixedTree Thenextstepinunderstandingthemaximumparsimonymethod,aspresentedin [32],issometimescalledthe'smallparsimonyproblem'.Givenaxedparsimonytree, wewishtocomputetheparsimonyscoreofasequenceofcharacters. Supposewehaveacharacter : X C andaphylogeneticX-tree T with v internalvertices.Thereareseveralwaysofcomputing l T .Therstoneisthe brute-forceapproachofsearchingall j C j v extensionsof tondtheminimumextension of on T .However,thismethodisimpracticalformorethan20internalvertices sincethenumberofsuchextensionsgrowsexponentially.Asecond,moreaccessible approachiscomputingtheparsimonyscoreusingageneraldynamicprogramming approachwhichrequires O j X jj C j 2 time.Thismethoddoesnotuseanyxed underlyingmathematicalconceptsandisnotbeexaminedinthisthesis.Thethird methodisaclassicintheeld,beingdescribedinitsentiretyin[11].TheFitch-Hartigan algorithmrequires O j X jj C j time.Thealgorithm'sframeworkisasfollows. If T isunrooted,weintroducearootarbitrarily,sincethenaloutputofthemethod isindependentoftherootchosenandthelocationwhereitisinserted.Themethod hastwomainpassesofthetree.Theinitialpass,describedpreviously[11],iscalled aforwardpassandallowsustocomputetheparsimonyscore.Weassignnon-empty subsetsof C andcorrespondingintegerstotheverticesof T recursivelyfromleaves totheroot.Thesubsequentpass,thebackwardpassfromtherootbacktotheleaves, constructsaminimumextensionof to T 41

PAGE 42

Algorithm3.2.1. Let : X C beacharacterandlet T bearootedphylogenetic X-treewithroot .Denetwomaps : V T C )-237(f;g and l : V T !f 0,1,2,... g recursivelyasfollows.Let v 2 V T .If v isaleafof T labelledby x 2 X ,set v = f x g and l v =0 .Forthersthalfofthealgorithm,weanalyzeallpathsfromthe leavestotheroot ,consideringprecedingverticestobedescendantsofthefollowing verticesineachpath.Foreach i 2f 1,2,...,2 n )]TJ/F22 11.9552 Tf 13.017 0 Td [(1 g ,let v i 1 v i 2 ,..., v ik denotethe immediatedescendantsof v i n beingthenumberofverticesof T .Foreveryvertex v i of T ,let f v i =max 2 C jf j : 2 v ij gj Now,foreach i 2f 1,2,...,2 n )]TJ/F22 11.9552 Tf 12.293 0 Td [(1 g ,set v i tobethesetofcharacterstatesof C that appearin f v i ofthesets v i 1 v i 2 ,..., v ik .Inotherwords, v i = 8 > > > > < > > > > : k j =1 v ij if k j =1 v ij 6 = ; k [ j =1 v ij otherwise. Also,denethequantity l v i = k X j =1 l v ij + k )]TJ/F39 11.9552 Tf 11.955 0 Td [(f v i Next,associatewitheachvertex v of V T theorderedpair v l v Thisassignmentiscalledtheforwardpassandtheparsimonyscoreisgivenby Theorem3.2.2.[17] Theorem3.2.2. Let : X C beacharacterandlet T bearootedphylogeneticX-tree withroot .Supposethatwehavecompletedtheforwardpassontheverticesof T Then i l = l T ,and ii = f 2 C : thereisaminimumextension of to T with = g 42

PAGE 43

Thecharacter-statesets obtainedforeachvertex v of T throughforwardpass cannotbeusedfortheminimalextensionbecauseofthedecienciesofthepreliminary phasedescribedin[11].ThesedecienciesareshowninFigure3-4. Figure3-4.Decienciesofforwardpassreconstruction. Figure3-4ashowsapreliminaryphasereconstructionofapositionfromthree leaves.Theset f g atthelowerancestralvertexrepresentstheimpossibilityto decidewhethertheancestralcharacterstatewas or .Theonlycertaintyisthata replacementisrequired.Thethirdvertexrequirestheultimateancestortobean Therefore,bytheassumptionsofparsimony,thelowerancestralvertexhastobean asinFigure3-4b.Theeliminationof fromtherstancestorisdeterminedbywhat Fitchcallstheruleofdiminishedambiguity.[11]Thepreciseformulationofthisruleis encompassedinstepsIandIIofthebackwardpassAlgorithm3.2.3,tobepresented furtheron. Figure3-4chasthreepossiblecharacter-statesfortheultimateancestorandtwo statesforthelowerancestralvertex.Whilethechoicecanbemadeinseveralways tomaximizeparsimony,Figure3-4dpresentsanadequatesolutionwhichwasnot comprisedinthepossiblealternativesinavailableinc.Thiscaseisencompassedby 43

PAGE 44

theruleofexpandedambiguity,whichisdescribedinstepsIIIandIVoftheAlgorithm 3.2.3. InFigure3-4eisshownathirdreconstructionthathasfourleavesandtwo ancestorsthatneedreplacement.Figure3-4frepresentsavalidsolution.Inthistype ofcase,twovertices,separatedbyasinglevertex,bothcontainingacharacterstatenot presentinthesetoftheintermediaryvertexcanassignthatcharacterstatetoit.Hence, thisiscalledtheruleofencompassingambiguityandiscomprisedinstepVofAlgorithm 3.2.3. Algorithm3.2.3. Thepreliminarycharacter-statesetfortheroot ismadethenal setforthatvertex.Then,gotooneofitsdescendentverticesandproceedaccordingto thefollowingsixsteps: I.Ifthepreliminaryset v containsallcharacter-statespresentinthenalsetofits immediateancestor,gotostepII,otherwise,gotoIII. II.Eliminateallcharacterstatesfromthepreliminaryset v thatarenotpresentin thenalsetofitsimmediateancestorandgotoVI. III.Ifthepreliminarycharacter-state v wasformedbyaunionofitsdescendent sets,gotoIV,otherwisegotoV. IV.Addtothepreliminaryset v anycharacterstatesinthenalsetofitsimmediate ancestorthatarenotpresentin v andgotoVI. V.Addtothepreliminary v anycharacterstatesnotalreadypresentprovidedthat theyarepresentinboththenalsetoftheimmediateancestorandinatleastone ofthetwoimmediatelydescendentpreliminarysetsandgotoVI. VI.Thepreliminaryset v isnownal.DescendonevertexaslongasanypreliminaryvertexsetsremainandreturntoIabove. Thisalgorithmappliestoallinteriorvertices.Whentherstruledoesnotapply,then andonlythendoesthesecondruleapply.Thethirdrulecanapplyifandonlyiftherst twodonot.Theymutuallyexcludeoneanother. Figure3-5illustratesanexampleofapplyingtheforwardandbackwardpasstothe phylogeneticX-tree T presentedinFigure3-5a.Italsoshowsthelocationwherewe 44

PAGE 45

introducetheroot .Figure3-5bshowstherootedX-tree T 0 afterhavingperformed theforwardpass.Eachinteriorvertexwasassignedapreliminarycharacter-stateset andtheparsimonyscoreofthatvertex.BasedonTheorem3.2.2,theparsimonyscore oftheX-treeis l T 0 = l =4 .Figure3-5cpresents T 0 afterthebackwardpass.The characterstatesinparenthesesaretheonesoverlookedbyforwardpassbecauseofits deciencies.Thecharacter-statesetsofeachvertexarethepossiblecharacterstates thatthevertexcanbeassignedinordertogetamaximumextensionoftheoriginal X-tree T Figure3-5.aTheX-tree.bForwardpass.cBackwardpass. Thexed-treeproblemforparsimonycanbegeneralizedinanumberofdirections. Incertainsituations,thecharacterstateassignedbysomeparticularcharacters andspeciesmaybeunspeciedorambiguous.Theproblemcanbealsoextendedto graphs.Generalizedparsimonyassignsdifferentnon-negativeweightstothetransitions 45

PAGE 46

amongcharacterstates.Itspremiseistopenalizerareorunfrequenttransitions.These directionsarediscussedin[32]butwerenotpursuedinthisthesis. 3.3TreeRearrangementOperations Whencalculatingmaximumparsimony,oneneedstoconsidermultipletreessince theancestralspeciesrepresentedbytheinteriorverticesandtherelationshipsamong themareunknown.However,eveninthesettingofclassicalparsimony,theproblemof ndingamaximumparsimonytreeforasequence C ofbinarycharactersisNP-hard. [12]Forthesetasks,branch-and-boundalgorithmshavebeeneffectivewhendealing withmodestnumberofspecies.Forlargernumbers,researchersuseheuristicmethods basedontreerearrangementoperations.[32] Thepremisefortherearrangementoperationsisthat T = T ; isabinary phylogeneticX-treeand e = f u v g isanedgeof T .Weintroducethreetypesoftree rearrangementoperations,beginningwiththeleastrestrictiveoperations.Notthatall operationsarereversible. TreeBisectionandReconnectionTBR:Let T 0 bethebinarytreeobtained from T bydeleting e ,addinganedgebetweenavertexthatsubdividesanedgeofone componentof T n e andavertexthatsubdividesandedgeoftheothercomponentof T n e ,andthensuppressinganyresultingdegree-twovertices.Ifacomponentof T n e consistsofasinglevertex,thentheaddededgeisattachedtothisvertex.Thebinary phylogeneticX-tree T 0 ; 0 ,where 0 x = x forall x 2 X ,isobtainedfrom T bya singletreebisectionandreconnectionoperation.Figure3-6illustratesthegenericform ofthisoperation,where e isthedeletededgeand f istheaddededge. Figure3-6.AschematicrepresentationonthegenericTBRoperation. 46

PAGE 47

SubtreePruneandRegraftSPR:Let T 0 bethebinarytreeobtainedfrom T bydeleting e pruningasubtree,addinganedgebetweenanendvertex u of e and avertexthatsubdividesanedgeinthecomponentof T n e thatdoesnotcontain u regraftingthesubtreeandthensuppressinganyresultingdegree-twovertices.The binaryphylogeneticX-tree T 0 ; 0 ,where 0 x = x forall x 2 X ,issaidtobe obtainedfrom T byasinglesubtreepruneandregraftingoperation.Thegenericformof theSPRoperationisillustratedinFigure3-7,where e isthedeletededgeand f isthe addededge. Figure3-7.AschematicrepresentationonthegenericSPRoperation. NearestNeighborInterchangeNNI:Supposethat e 0 = f v v 0 g isaninterioredge of T adjacentto e .Let T 0 bethebinarytreeobtainedfrom T bydeleting e ,addingan edgebetween u andavertexthatsubdividesanedgewiththeendvertex v 0 in T n e interchangingtwosubtreesaccros e 0 ,andthensuppressinganyresultingdegree-two vertices.ThebinaryphylogeneticX-tree T 0 ; 0 ,where 0 x = x forall x 2 X ,issaid tobeobtainedfrom T byasinglenearestneighborinterchangeoperation.[32]Figure 3-8illustratesthegenericformoftheNNIoperation,where e isthedeletededgeand f istheaddededge. Figure3-8.AschematicrepresentationonthegenericNNIoperation. 47

PAGE 48

Clearly,anyNNIoperationisarestrictiononanSPRoperationandanySPR operationisaparticulartypeofTBRoperation.Also,everyTBRoperationiseithera singleSPRoperationorthecompositionoftwoSPRoperations. Proposition3.3.1. Let T and T 0 beelementsof B n .Then, T 0 canbeobtainedfrom T byasequenceofNNIoperationsand,therefore,byasequenceofSPRorTBR operations.[29] Theorem3.3.2. Let T beanelementof B n .Then,thenumberofelementsof B n apartfrom T ,thatcanbeobtainedby iasingleNNIoperationon T is 2 n )]TJ/F22 11.9552 Tf 11.955 0 Td [(3 and iiasingleSPRoperationon T is 2 n )]TJ/F22 11.9552 Tf 11.955 0 Td [(3 n )]TJ/F22 11.9552 Tf 11.955 0 Td [(7 .[2] Proof. Toprove i ,let e beaninterioredgeof T .Then,thenumberofdistinctelements of B n thatcanbeobtainedbyasingleNNIoperationacross e istwo.Itiseasilyseen thatneitheroftheseelementscanbeobtainedbyasingleNNIoperationacrossany otherinterioredgeof T .Since T has n )]TJ/F22 11.9552 Tf 12.492 0 Td [(3 interioredges,thenumberofelementsof B n thatcanbeobtainedbyasingleNNIon T is 2 n )]TJ/F22 11.9552 Tf 11.955 0 Td [(3 Fortheproofof ii ,wenotethatanySPRoperationispreciselyoneofthreetypes: atheprunededgeisadjacenttotheedgeonwhichthesubtreeisregrafted; btheprunededgeisseparatedbyexactlyoneedgefromtheedgeonwhichthe subtreeisregrafted;and ctheprunededgeisseparatedbyatleasttwoedgesfromtheedgeonwhichthe subtreeisregrafted. Foratypeaoperation,theresultingelementis T .Fortypeb,thesingleSPR operationcorrespondstoasingleNNIoperationon T .Thus,bypartiofthistheorem, thereare 2 n )]TJ/F22 11.9552 Tf 11.955 0 Td [(3 elementsof B n resultingfromasingletypebSPRoperationon T Nowconsidertypec.EverySPRoperationon T correspondstoanorderedpair ofdistinctedgesof T ,wheretherstcomponentistheprunededgeandthesecond componentistheedgeonwhichthesubtreeisregrafted.Onecaneasilycheckthattwo 48

PAGE 49

distincttypecSPRoperationson T resultintwodistinctelementsof B n .Therefore, wehave n )]TJ/F22 11.9552 Tf 12.271 0 Td [(3 n )]TJ/F22 11.9552 Tf 12.27 0 Td [(4 suchorderedpairs.Weneedtosubtractthenumberofsuch orderedpairsfromfromthesumoftheorderedpairsfromtypeaandtypebSPR operationson T ThenumberoforderedpairscorrespondingtoatypeaofSPRoperationon T is 6 n )]TJ/F22 11.9552 Tf 11.992 0 Td [(2 ,sincetherearesixsuchpairsforeachofthe n )]TJ/F22 11.9552 Tf 11.992 0 Td [(2 interiorvertices.Nowatype bSPRoperationcorrespondstoanNNIoperationacrossaninterioredge.Therefore, thenumberoforderedpairscorrespondingtoatypebSPRoperationon T is 8 n )]TJ/F22 11.9552 Tf 12.008 0 Td [(3 sincethereareeightsuchpairsforeachofthe n )]TJ/F22 11.9552 Tf 12.87 0 Td [(3 interioredges.Therefore,the numberofdistinctelementson B n resultingfromatypecSPRoperationon T is n )]TJ/F22 11.9552 Tf 11.955 0 Td [(3 n )]TJ/F22 11.9552 Tf 11.955 0 Td [(4 )]TJ/F22 11.9552 Tf 11.955 0 Td [([6 n )]TJ/F22 11.9552 Tf 11.955 0 Td [(2+8 n )]TJ/F22 11.9552 Tf 11.956 0 Td [(3]=4 n )]TJ/F22 11.9552 Tf 11.955 0 Td [(3 n )]TJ/F22 11.9552 Tf 11.955 0 Td [(4. Combiningtheresultsofac,wededucethatthetotalnumberofdistinctelementsof B n ,apartfrom T ,resultingfromasingleSPRoperationon T is 0+2 n )]TJ/F22 11.9552 Tf 11.955 0 Td [(3+ n )]TJ/F22 11.9552 Tf 11.955 0 Td [(3 n )]TJ/F22 11.9552 Tf 11.955 0 Td [(4=2 n )]TJ/F22 11.9552 Tf 11.955 0 Td [(3 n )]TJ/F22 11.9552 Tf 11.955 0 Td [(7. ObservefromthestatementofTheorem3.3.2thatthenumberofelementsof B n thatcanbeobtainedbyasingleNNIorSPRoperationon T isindependentof theshapeof T .However,inthecaseofasingleTBRoperationon T thenumberof suchelementsisdependentontheshapeof T .Nevertheless,thisnumberis O n 3 regardlessoftheshapeof T .[32] 3.4Relevance Eventhoughparsimonymethodsarebasedonaspecicoptimalitycriteria,theydo notrequireexplicitmodelsofevolutionarychange.Thereisconsiderabledisagreement overwhetherbeing'modelfree'isanadvantageordisadvantage.Eitherway,nothaving anexplicitmodeldoesnotmakeparsimonymorereliable.Thismethoddoesmake 49

PAGE 50

assumptions,buttheyareimpliedanditisdifculttotryingtodenethem.Oneexample isthattheacceptanceofanoptimaltreeunderparsimonyrequirestheassumption thatitisunlikelyforparsimonymethodstoestimateanincorrecttree.Theabilityofan estimationmethodtoconvergetoatruevalue,inthiscasethecorrecttree,isknownas consistency.Ithasbeenshownthat,insomecases,parsimonycanfavorincorrecttrees moreifthenumbercharactersincreases.ThissituationisknownastheFelsenstein zoneofinconsistency.[34]Nonetheless,maximumparsimonypresentsitselfto biologistsasareasonableestimatorforthenumberofchangesandthestructureofthe tree,evenifitisimperfectundersomeconditions. Maximumparsimonymethodwasusedintheundergraduateresearchtodetermine thebranchlengthsofthebest-tmodelbecauseparsimonyyieldsintegervalues.The besttmodelforthemoleculardatawasestablishedbytheAkaikeInformationCriterion onapreliminarytreeaccordingtoexpertevaluation[22].Eventhoughthetreewas calculatedwithmaximumlikelihood,maximumparsimonybranchlengthswereusedso molecularchangewouldberepresentedbynumbersofeventsnumbersofmutations thatwouldbecomparabletothemorphologicalchange.[20] 50

PAGE 51

CHAPTER4 MAXIMUMLIKELIHOOD Maximumlikelihoodisanotherpopularmethodforinferringphylogenythathasbeen gainingrecognitionespeciallyinrecentyears.Thismethodisaspecicimplementation ofthepopularstatisticalmethodusedforttingastatisticalmodeltodata,andproviding estimatesforthemodel'sparameterscalledmaximumlikelihoodestimation.Foraxed setofdataandanunderlyingprobabilitymodel,maximumlikelihoodpicksthevalues ofthemodelparametersthatmakethedatamorelikelythananyothervaluesofthe parameterswouldmakethem.Incontrasttoparsimonymethods,whicharebased mainlyongraphtheoryandcombinatorics,maximumlikelihoodmethodsofphylogenetic inferenceevaluateahypothesisaboutevolutionaryhistoryintermsofprobability.The assumptionbehinditisthatahistorywithahigherprobabilityofreachingthecurrent setofspeciesbeingobservedispreferabletootherhypotheseswithlowerprobabilities ofgivingrisetotheobservedstate.Maximumlikelihoodmethodsattempttoaccount forunobservedaswellasobservedsubstitutions.Theychoosethehypothesisthat maximizestheprobabilityofobservingthecurrentdata.[34] Maximumlikelihoodfrequentlyyieldsestimateswithlowervariancethanother methods,whichmeansitisoftenthemethodleastaffectedbysamplingerror.Italso takesintoaccountthefactthatsubstitutionprocessestakingplaceatdifferentsites havemuchincommonandthatthemajorcomponentsdeterminingtheevolutionof sequencescanbedescribedbyjustafewparameters,afactoverlookedmymost methods.Forthisreason,likelihoodtendstooutperformotherestimationmethods evenwithveryshortsequencesofcharacters.However,theperceivedandactual complexitiesofobtainingasolutiontoproblemswithnumerousalternativehypotheses havehinderedthewidespreaduseofthismethod.[34]Asarule,theexamplesprovided inthischapterwillusemolecularcharacterstatessincemostofthetimemaximum likelihoodisappliedtomolecularsequencesofdataandsincetheparametersofthe 51

PAGE 52

modelspresentedarebasedongeneticdata.Also,unlessotherwisestated,thesetof characterstateswillbe C = f A C T G g 4.1BasicPrinciples Aconcreteevolutionarymodelisneededtoperformmaximumlikelihood.This modelmaybefullydenedormaycontainparameterstobeestimatedfromthedata. Amaximumlikelihoodapproachevaluatestheprobabilitythatthechosenmodel,a phylogeneticX-tree,willhavegeneratedtheobserveddata.Phylogenetictreesare theninferredbyndingthetreesthatyieldthehighestlikelihoods.Theactualprocess iscomplexbecausedifferenttreetopologiesrequiredifferentmathematicaltreatments. [31] Let T = T ; betheinputX-treewith T = V E onwhichwewillperformthe maximumlikelihoodapproach.Let : X C bethefullcharacteron X .Let C bea collectionofcharacters on X .Wewanttocalculatetheprobabilitythat T couldhave generatedthecharacterset C underthechosenmodel.Mostmodelsaretime-reversible, whichmeansthattheprobabilityofcharacter-state changingintocharacter-state isthesameastheprobabilityof changinginto .Ifanunrootedtreeisevaluated usinglikelihood,itisconvenienttorootthetreearbitrarilyataninteriorvertexsince, ifthemodelistime-reversible,thelikelihoodofatreeisgenerallyindependentofthe locationoftheroot.Wedene L asthelikelihoodvalueofcharacter and L C as thelikelihoodofthesetofcharacters C .Undertheassumptionthatcharacterstates evolveindependently,wecalculatethelikelihoodforeachcharacter 2C separately andcombinethelikelihoodsinto L C .Sincethelikelihoodisaprobabilityvalue,we have L C = L 1 L 2 ... L N = N Y j =1 L j 52

PAGE 53

where N = jCj .Becausetheprobabilityofanysingleobservationisextremelysmall,we almostalwaysevaluatethelogofthelikelihoodinstead. ln L C =ln L 1 +ln L 2 +...+ln L N = N X j =1 ln L j .[ 34 ] Tocalculatethelikelihoodforcharacter 2C ,wemustconsiderallpossible scenariosbywhichwecouldgetthesetofleavesof T .Someofthesescenarios aremoreplausiblethanothers,buteverycasehasatleastsomeprobabilityof generatinganypatternofobservedleaves.Hence,thereare j C j V T possibilitiesto consider,where V T isthesetofinteriorverticesof T .Deneascenarioastheset v 1 v 2 ,..., v k where k = j V T j and : V T C Sinceanyofthese scenarioscouldhaveledtothecongurationattheleavesofthetree,wemustcalculate theprobabilityofeachandsumthemtoobtaintheprobability L foreachcharacter 2C .[34] Toillustratetheprocess,considerthealignedsetofnucleotidesequences introducedinFigure4-1a.Supposewewanttoevaluatethelikelihoodoftheunrooted treeshowninFigure4-1b,whichwerootasshowninFigure4-1c.Eachinterior vertexofthetreemightpossessanyofthecharacterstatesA,C,T,orG.Sincethe treehastwointeriorvertices,namelyvertexandvertex,thereare 4 4=16 possibilitiestoconsider.Thecalculationofthelikelihoodforcharacter isillustrated schematicallyinFigure4-1d. Maximumlikelihoodcalculatestheprobabilitiesbasedonthebranchlengthofthe modeltree.Inlikelihoodmethods,branchlengthsrepresenttheexpectednumberof character-statechangesalongabranch,oranedgeofthetree.Ifabranchisshort, thereisarelativelylowprobabilityofasinglechangeoccurringalongthatparticular branch,andanalmostnegligibleprobabilityofmorethanonechange.Weassume thatchangesalongdifferentbranchesareindependent.Thus,theprobabilityofany singlescenario v 1 v 2 ,..., v k isequaltotheproductoftheprobabilitiesofthe 53

PAGE 54

Figure4-1.Overviewofthecalculationofthelikelihoodofatree.aHypothetical sequencealignment.bAnunrootedtreeforthefourtaxaina.cTree afterrootingatarbitraryinteriorvertex,inthiscase.dLikelihoodof character changesrequiredbythatscenario.[34]Forexample,theprobabilityofthescenario representedbythersttermofFigure4-1disequaltotheprobabilitythatthe characterstateatnodeisanAtypically 1 = 4 ortheaveragefrequencyofAinthe originalsequence,dependingonthetypeofmodeltimestheprobabilityofretainingan Aalongtheedge f g ,timestheprobabilityofanA Cchangealongtheexterior edgeleadingtoleaf,andsoon. 4.2ModelsofSequenceEvolution Wenowexaminehowtheprobabilitiesofthevariouschangesarecalculated. Theseprobabilitiesdependonseveralassumptionsabouttheprocessofnucleotide substitution,whichdeneasubstitutionmodel.Themodelsexploredinthisthesisare restrictedtoMarkovmodels,inwhichtheprobabilityofchangefromstate i tostate 54

PAGE 55

j atagivensitedoesnotdependonthehistoryofthesitepriortoithavingstate i Wealsoassumethatthesubstitutionprobabilitiesdonotchangeindifferentpartsof thetree,inotherwords,thattheevolutionarymechanismsconstituteahomogeneous Markovprocess.Theseassumptionsarenotnecessarilybiologicallyplausible;theyare consequencesofmodelingsubstitutionsasstochasticMarkovianprocesses.[34] Themathematicalexpressionofamodelisatableofratesofsubstitutionspersite perunitofevolutionarydistance.ForDNAsequences,theseratesareexpressedas a 4 4 instantaneousratematrix Q .Eachelement Q ij representstherateofchange fromcharacterstate i tostate j duringsometimeperiod dt .Therowsandcolumnsof Q correspondtothebasesA,C,G,andTinthisorder.Themostgeneralformofthis matrixis Q = 0 B B B B B B B @ a C b G c T g A d G e T h A j C f T i A k C l G 1 C C C C C C C A wherethediagonalelementsaresettothenegativeofthesumoftheoff-diagonal elementsinthecorrespondingrow.Thefactor representsthemeaninstantaneous substitutionrate,anditismodiedbytherelativerateparameters a b c ,..., l ,which correspondtoeachpossibletransformationfromonebasetoanother.Theproduct ofthemeaninstantaneoussubstitutionrateandarelativerateparameterconstitutes arateparameter.Theremainingparameters A c G and T arecalledfrequency parametersthatcorrespondtothefrequenciesofthebasesA,C,G,andTintheknown setofleaves.[37]Weassumethesefrequenciesremainconstantovertimeandthatthe rateofchangetoeachbaseisproportionaltotheequilibriumfrequencybutindependent oftheidentityofthestartingbase.Thediagonalelementsof Q arealwayschosenso thattheelementsinthecorrespondingrowsumtozero.AlmostallDNAsubstitution 55

PAGE 56

modelsproposedarespecialcasesofthismatrix.[34]Analogousmatricescanbe denedforproteinsequencedata,excepttheywouldhave20statesinsteadof4. Time-reversiblemodelshavethefollowingrateparameterrestrictions: g = a h = b i = c j = d k = e ,and l = f .Hencethematrix Q willbesymmetricforthisrestriction. Themostgeneraltime-reversiblemodel,GTR,isthenrepresentedby Q = 0 B B B B B B B @ a C b G c T a A d G e T b A d C f T c A e C f G 1 C C C C C C C A withdiagonalelementssettothenegativeofthesumoftheoff-diagonalelements inthecorrespondingrow.[25]Mostoftheremainingmodelsusedformaximumlikelihood treeinferencecanbeobtainedbyfurtherrestrictingtheparametersofmatrix Q of GTR.Itisoftendesirabletoreducethenumberoffreeparameters,especiallywhen theyareunknownandneedtobeestimatedfromthedata.Thiscanbeachieved byintroducingconstraintsbasedonsomeappropriatesymmetries.Forexample, nucleotidesubstitutionsfallintotwomajorgroups.Substitutionsinwhichapurineis exchangedforapyrimidineorviceversaarecalledtransversions.Pyrimidinesare thesingle-ringednucleobasesCcytosineandTthymine,whereaspurinesarethe double-ringednucleobasesAadenineandGguanine.Thepossibletransversionsare A $ C A $ T C $ G and G $ T .Allothersubstitutionsfallundertransitions.These canalsobeseparatedintosubstitutionsbetweenpurines,calledpurinetransitions A $ G ,andsubstitutionsbetweenpyrimidines,calledpyrimidinetransitions C $ T Forinstance,themodelofTrN[35]separatessubstitutionsintotransversions,purine transitions,andpyrimidinetransitionsbyrequiringthat a = c = d = f .Similarly,we canobtainKimura'sthree-substitution-typeK3ST[24]modelbyrequiringthatallbases occurinequalfrequency A = A = G = T =1 = 4 anddividingthesubstitutiontypes 56

PAGE 57

intotransitions b = e A $ T or C $ G transversions c = d ,and A $ C or G $ T transversions a = f .Zharkikh[38]describedamodelSYMalmostidenticaltoGTR, exceptitassumesequalbasefrequencies. Furtherrestrictionsontheparametersleadtoadditionalpopularmodels.The simplestmodelistheoneproposedbyJukesandCantorJC[21]inwhichallbase frequenciesareequal A = A = G = T =1 = 4 andallsubstitutionsoccuratthe samerate a = b = c = d = e = f =1 Q = 0 B B B B B B B @ )]TJ/F23 7.9701 Tf 10.494 4.708 Td [(3 4 1 4 1 4 1 4 1 4 )]TJ/F23 7.9701 Tf 10.494 4.707 Td [(3 4 1 4 1 4 1 4 1 4 )]TJ/F23 7.9701 Tf 10.494 4.707 Td [(3 4 1 4 1 4 1 4 1 4 )]TJ/F23 7.9701 Tf 10.494 4.707 Td [(3 4 1 C C C C C C C A Thebasefrequencyandsubstitutionratearetypicallycombinedintoasingle parameter = = 4 forsimplicitytoform Q = 0 B B B B B B B @ )]TJ/F22 11.9552 Tf 9.298 0 Td [(3 )]TJ/F22 11.9552 Tf 9.298 0 Td [(3 )]TJ/F22 11.9552 Tf 9.299 0 Td [(3 )]TJ/F22 11.9552 Tf 9.299 0 Td [(3 1 C C C C C C C A Kimura'stwo-parametermodelK2P[23]iscalledthiswaybecauseitassigns differentratestotransitionsandtransversionsbutkeepsbasefrequenciesequal.Thus, weset a = c = d = f =1 and b = e = k .Lettingthetransitionratebe = k = 4 andthe transversionratebe = = 4 ,thematrix Q becomes Q = 0 B B B B B B B @ )]TJ/F25 11.9552 Tf 9.299 0 Td [( )]TJ/F22 11.9552 Tf 11.955 0 Td [(2 )]TJ/F25 11.9552 Tf 9.298 0 Td [( )]TJ/F22 11.9552 Tf 11.955 0 Td [(2 )]TJ/F25 11.9552 Tf 9.299 0 Td [( )]TJ/F22 11.9552 Tf 11.955 0 Td [(2 )]TJ/F25 11.9552 Tf 9.299 0 Td [( )]TJ/F22 11.9552 Tf 11.956 0 Td [(2 1 C C C C C C C A 57

PAGE 58

Notethat k representsthetransitionbias,orthetransitiontotransversionratio. If k =1 ,thereisnopreferencebetweenthetworatesandthemodelbecomes theJCmodel.Sincetherearetwiceasmanykindsoftransversionsastransitions, theexpectedratiois k =1 = 2 .Similarly,if k =4 onewouldexpecttwiceasmany transitionsastransversions.Othermodels,HKY85[18]andFelsenstein'sF81[9], aregeneralizationsofK2PandJCmodelsrespectivelybyallowingforunequalbase frequencies.FelsensteinusedadifferentmethodF84[10]toaccommodateunequal basefrequenciesinatwo-parametermodel.Ithasageneralsubstitutionrateforall typesofsubstitutionsandawithin-groupsubstitutionrateonlyfortransitions.Thiscan beachievedbysetting a = c = d = f b =+ K = R and e =+ K = Y ,whereKis theparameterdeterminingthetransitiontotransversionratio, R = a + G Y = C + T andthediagonalelementsaresettothenegativeofthesumoftheoff-diagonal elementsinthecorrespondingrow.Thismodelhastwocomponentsforeachtransition, becausetransitionscanoccurduetoeitherthegeneralsubstitutionrateortothe within-grouprate.[34]Thematrixisoftheform Q = 0 B B B B B B B @ C G + K R T A G T + K Y A + K R C T A C + K Y G 1 C C C C C C C A 4.3CalculatingChangeProbabilities Theinstantaneousratematrix Q speciestheratesofchangebetweenpairsof nucleotidesperinstantoftime dt .Once Q iscalculatedbasedonthechosenmodel,we needtheprobabilitiesofchangefromanystatetoanyotheralongabranchoflength t .Thesubstitutionprobabilitymatrixiscalculatedby P t = e Qt 58

PAGE 59

[5][37] Itscomponentsare P ij t satisfythefollowingconditions: n X j =1 P ij t =1 and P ij t > 0 for t > 0. Moreover,italsofulllstherequirementthat P t + s = P t P s knownastheChapman-Kolmogorovequation,andtheinitialcondition P ij = 8 > < > : 1, for i = j 0, for i 6 = j FromtheChapman-KolmogorovEquation4,theforwardandbackwarddifferential equationsareobtained: d dt P t = P t Q = QP t Theexponentialcanbeevaluatedwiththeuseofmatrixalgebrabydecomposing Q into itseigenvaluesandeigenvectors.[31] Severalmethodsallowforsimpleexpressionsfortheeigenvalues,providingan analiticcalculationofthesubstitutionprobabilitymatrix P t .Forexample, JC: P ij t = 8 > < > : 1 4 + 3 4 e )]TJ/F26 7.9701 Tf 6.587 0 Td [( t i = j 1 4 )]TJ/F23 7.9701 Tf 13.151 4.708 Td [(1 4 e )]TJ/F26 7.9701 Tf 6.587 0 Td [( t i 6 = j K2P: P ij t = 8 > > > > < > > > > : 1 4 + 1 4 e )]TJ/F26 7.9701 Tf 6.587 0 Td [( t + 1 2 e )]TJ/F26 7.9701 Tf 6.586 0 Td [( t k +1 2 i = j 1 4 + 1 4 e )]TJ/F26 7.9701 Tf 6.587 0 Td [( t )]TJ/F23 7.9701 Tf 13.15 4.708 Td [(1 2 e )]TJ/F26 7.9701 Tf 6.586 0 Td [( t k +1 2 i 6 = j transition 1 4 )]TJ/F23 7.9701 Tf 13.151 4.707 Td [(1 4 e )]TJ/F26 7.9701 Tf 6.587 0 Td [( t i 6 = j transversion [34] The P ii t entryofthesubstitutionprobabilitymatrixistheprobabilityofno substitutionoveranedgeoflength t ,whereas P ij t istheprobabilityofsubstitution 59

PAGE 60

fromcharacterstate i to j alonganedgeoflength t .Thisisthemissingelementto calculatethelikelihoodofatreeaspresentedinsection4.1. 4.4DifferencesinPerspectivebetweenParsimonyandLikelihood Theparsimonyandlikelihoodapproachestoinferringaphylogenyhavesome commoncharacteristics.Theyarebothcriterion-basedmethodsandareabletorank andevaluatetheobservedtrees.Also,thecostofagivenchangeunderparsimonyis analogoustotothelikelihoodofthegivenchangefromthesubstitutionmatrix P t Inparsimony,thecostofplacingagivenstateataninternalnodeisthesumofthe costsofderivingbothofthedaughtertreesfromthatstate,whereasthelikelihoodofan ancestralstateistheproductofthelikelihoodsofthestategivingrisetothedaughter trees.Inparsimony,thetotalcostofthetreeisthesumofthecostsateachposition. Similarly,thenetlog-likelihoodofatreeisthesumofthelog-likelihoodsoftheevolution ateachsequenceposition.[34] However,therearesomeessentialdifferencesbetweenthetwomethods.Thecost ofachangeinparsimonyisnotafunctionofbranchlength,unlikemaximumlikelihood. Forthisreason,eventhoughseveralminimumextensionsofacharactermaybeequally rankedbasedontheirparsimonyscore,somemayhavescoredhigherthanothers underlikelihoodbecauseofthebranchlengths.Maximumparsimonylooksonlyat thesingle,lowestcostsolution,whereasmaximumlikelihoodlooksatthecombined likelihoodforallsolutionsancestralstatesconsistentwiththetreeandthebranch lengths.Anotherobviousdifferencemaximumparsimonydoesnotdependonaprecise modelasmaximumlikelihooddoes.Theinitialtreemodelneededforlikelihoodisbased onasetofstatedassumptions,whilemethodslikemaximumparsimonynonotrequire one,whichmakestheirassumptionsimplicit.Sometimes,model-freeassumptionsmay bemorelikelytoviolatetheirimplicitassumptionssincetheyarenotasobviousasthe methodswithmoreexplicitassumptions.Havingsomeideaofthephylogenyisrelevant 60

PAGE 61

tothedevelopmentofgoodmodels,butever-improvingmodelscanalsoleadtobetter phylogeneticinferences.Thus,bothtypesofmethodsareusefulandimportant.[34] EventhoughtheconceptofmaximumparsimonyisbasedonOckham'sRazor,it isquestionablewhetherthemethod,asimplementedinphylogenetics,actuallyfullls Ockham'srazor.Sincetherearesimplestochasticmodelsthatassumeallcharacters havesimilarprobabilitiesofchange,arguablyamoreparsimonioussituation,where maximumparsimonyisinconsistent,itcouldbearguedthatmaximumlikelihoodis actuallymoreparsimonious. Intheundergraduateresearch,maximumlikelihoodmethodswereusedto determinethebest-tmodelofevolutionaswellastodescribethephylogenetictrees estimatedforallloci.Next,acomputerprogramwasusedtoestimatetheexpected branchlengthsforthemorphologicalandmoleculardatausingtheestimatedbranch lengthsforalllocifromPAUP*.Thelikelihoodoftheobserveddatawascalculatedusing thePoissonmodelinwhichcasethereisnovarianceinationparameterandthe negativebinomialmodelwhereavarianceinationparameteristakenintoaccount. Theexpectationgivenauniversalrateofevolutionforbothmoleculesandmorphology wouldbethatthemaximumlikelihoodestimatesofthevarianceinationparameterfor themorphologicaldatawouldfallwithinthediversityofvarianceinationparameter estimatesforthemoleculardata.Theresultsoftheundergraduateresearchsuggested that,althoughtheestimateforvarianceinationparameterformorphologyishigh,it doesfallwithintherangeofestimatesofthevarianceinationparameterformolecular data.Inconclusion,itdoesappearthattheratesofmolecularandmorphologicalrates ofevolutionarecorrelated,atleastwithinthegaliforms.[20] 61

PAGE 62

REFERENCES [1]Agarwala,R.andFern andez-Baca,D.Apolynomial-timealgorithmforthe phylogenyproblemwhenthenumberofcharacterstatesisxed. SIAMJournalon Computing 23:1216. [2]Allen,B.L.andSteel,M.Subtreetransferoperationsandtheirinducedmetricson evolutionarytrees. AnnalsofCombinatorics 5:1. [3]Buneman,P.Therecoveryoftreesfrommeasuresofdissimilarity. Mathematics intheArchaelogicalandHistoricalSciences .eds.F.RHodson,D.G.Kendall,and P.Tautu.Edinburgh:EdinburghUniversityPress,1971.387. [4].Acharacterizationofrigidcircuitgraphs. DiscreteMathematics 9: 205. [5]Cox,D.R.andMiller,H.D. TheTheoryofStochasticProcesses .London:Chapman andHall,1977. [6]Dirac,G.A.Onrigidcircuitgraphs. AbhandlungenausdemMathematischen SeminarderUniversit atHamburg 25:71. [7]Dobson,A.J.Unrootedtreesfornumericaltaxonomy. JournalofApplied Probability 11:32. [8]Erd os,P.L.andSz ekely,L.A.Evolutionarytrees:anintegermulticommodity max-ow-min-cuttheorem. AdvancesinAppliedMathematics 13:488. [9]Felsenstein,J.EvolutionarytreesfromDNAsequences:Amaximumlikelihood approach. JournalofMolecularEvolution 17:368. [10].Distancemethodsforinferringphylogenies:Ajustication. Evolution 38 :16. [11]Fitch,W.M.Towarddeningthecourseofevolution:minimumchangeforaspecic treetopology. SystematicZoology 20:406. [12]Foulds,L.R.andGraham,R.L.TheSteinerprobleminphylogenyisNP-complete. AdvancesinAppliedMathematics 3:43. [13]Fulkerson,D.R.andGross,O.A.Incidencematricesandintervalgraphs. Pacic JournalofMathematics 15:835. [14]Gavril,F.Theintersectiongraphsofsubtreesintreesareexactlythechordal graphs. JournalofCombinatorialTheory 16:47. [15]Golumbic,M.C. AlgorithmicGraphTheoryandPerfectGraphs .NewYork: AcademicPress,1980. 62

PAGE 63

[16]Harding,E.F.Theprobabilitiesofrootedtreeshapesgeneratedbyrandom bifurcation. AdvancesinAppliedProbability 3:44. [17]Hartigan,J.A.Minimummutationtstoagiventree. Biometrics 29:53. [18]Hasegawa,M.,Kishino,H.,andYano,T.Datingofthehuman-apesplittingbya molecularclockofmitochondrialDNA. JournalofMolecularEvolution 21: 160. [19]Hendy,M.D.,Little,C.H.C.,andPenny,D.Comparingtreeswithpendantvertices labeled. SIAMJournalofAppliedMathematics 44:1054. [20]Iuhasz,N.R.andBraun,E.TheUseoftheNegativeBinomialMethodinFinding aCorrelationBetweentheRatesofMorphologicalandMolecularEvolution.,2008. Tobesubmitted. [21]Jukes,T.H.andCantor,C.R.Evolutionofproteinmolecules. MammalianProtein Metabolism .ed.H.N.Munro.NewYork:AcademicPress,1969.21. [22]Kimball,R.andBraun,E.Intronsoutperformexonsinanalysesofbasalavian phylogenyusingclathrinheavychaingenes. JournalofAvianBiology 410: 89. [23]Kimura,M.Asimplemethodforestimatingevolutionaryrateofbasesubstitutions throughcomparativestudiesofnucleotidesequences. JournalofMolecular Evolution 16:111. [24].Estimationofevolutionarydistancesbetweenhomologousnucleotide sequences. ProceedingsoftheNationalAcademyofScience 78:454. [25]Lanave,C.,Preparata,G.,Saccone,C.,andSerio,G.Anewmethodfor calculatingevolutionarysubstitutionrates. JournalofMolecularEvolution 20 :86. [26]McMorris,F.R.,Warnow,T.J.,andWimer,T.Triangulatingvertex-coloredgraphs. SIAMJournalonDiscreteMathematics 7:296. [27]Meacham,C.A.Theoreticalandcomputationalconsiderationsofthecompatibility ofqualitativetaxonomiccharacters. NumericalTaxonomy .ed.J.Felsenstein, vol.G1of NATOASI .Berlin:Springer-Verlag,1983.304. [28]Menger,K.ZurallgemeinKurventheorie. FundamentaMathematica 10: 96. [29]Robinson,D.F.Comparisonoflabeledtreeswithvalencythree. Journalof CombinatorialTheory 11:105. [30]Rose,D.J.Triangulatedgraphsandtheeliminationprocess. JournalofMathematicalAnalysisandApplications 32:597. 63

PAGE 64

[31]Salemi,M.andVandamme,A.-M. ThePhylogeneticHandbook:APractical ApproachtoDNAandProteinPhylogeny .Cambridge:CambridgeUniversityPress, 2003. [32]Semple,C.andSteel,M. Phylogenetics .Oxford:OxfordUniversityPress,2003. [33]Steel,M.Thecomplexityofreconstructingtreesfromqualitativecharactersand subtrees. JournalofClassication 9:91. [34]Swofford,D.L.,Olsen,G.J.,Waddell,P.J.,andHillis,D.M.Phylogenetic inference. MolecularSystematics .ed.D.M.Hillis.Sunderland,MA:Sinauer AssociatesInc.,1996,seconded.407. [35]Tamura,K.andNei,M.Estimationofthenumberofnucleotidesubstitutionsin thecontrolregionofmitochondrialDNAinhumansandchimpanzees. Molecular BiologyandEvolution 10:512. [36]Walter,J.R.Representationsofchordalgraphsassubtreesofatree. Journalof GraphTheory 2:265. [37]Yang,Z.Estimatingthepatternofnucleotidesubstitution. JournalofMolecular Evolution 39:105. [38]Zharkikh,A.Estimationofevolutionarydistancesbetweennucleotidesequences. JournalofMolecularEvolution 39:315. 64

PAGE 65

BIOGRAPHICALSKETCH NaomiIuhaszwasbornandgrewupinResita,Romania.Shegraduatedfrom TraianLalescuTheoreticalHighSchoolin2002withamajorinComputerScience. ShethenmovedtoFloridawheresheearnedherBachelorofArtsdegreeinmathematics fromtheUniversityofFloridaUFin2008.UpongraduatinginAugust2008,she continuedhergraduatestudiesintheDepartmentofMathematicsoftheUniversityof FloridaearningaMasterofSciencedegreeinappliedmathematicsinMay2010. 65