<%BANNER%>

Semi-Analytical Method for Analyzing Models and Model Selection Measures

Permanent Link: http://ufdc.ufl.edu/UFE0024733/00001

Material Information

Title: Semi-Analytical Method for Analyzing Models and Model Selection Measures
Physical Description: 1 online resource (174 p.)
Language: english
Creator: Dhurandhar, Amit
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2009

Subjects

Subjects / Keywords: Computer and Information Science and Engineering -- Dissertations, Academic -- UF
Genre: Computer Engineering thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract: Semi-Analytical Method for Analyzing Models and Model Selection Measures Considering the large amounts of data that is collected everyday in various domains such as health care, financial services, astrophysics and many others, there is a pressing need to convert this information into knowledge. Machine learning and data mining are both concerned with achieving this goal in a scalable fashion. The main theme of my work has been to analyze and better understand prevalent classification techniques and paradigms which are an integral part of machine learning and data mining research, with an aim to reduce the hiatus between theory and practice. Machine learning and data mining researchers have developed a plethora of classification algorithms to tackle classification problems. Unfortunately, no one algorithm is superior to the others in all scenarios and neither is it totally clear as to which algorithm should be preferred over others under specific circumstances. Hence, an important question now is, what is the best choice of a classification algorithm for a particular application? This problem is termed as classification model selection and is a very important problem in machine learning and data mining. The primary focus of my research has been to propose a novel methodology to study these classification algorithms accurately and efficiently in the non-asymptotic regime. In particular, we propose a moment based method where by focusing on the probabilistic space of classifiers induced by the classification algorithm and datasets of size $N$ drawn independently and identically from a joint distribution (i.i.d.), we obtain efficient characterizations for computing the moments of the generalization error. Moreover, we can also study model selection techniques such as cross-validation, leave-one-out and hold out set in our proposed framework. This is possible since we have also established general relationships between the moments of the generalization error and moments of the hold-out-set error, cross-validation error and leave-one-out error. Deploying the methodology we were able to provide interesting explanations for the behavior of cross-validation. The methodology aims at covering the gap between results predicted by theory and the behavior observed in practice.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility: by Amit Dhurandhar.
Thesis: Thesis (Ph.D.)--University of Florida, 2009.
Local: Adviser: Dobra, Alin.

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2009
System ID: UFE0024733:00001

Permanent Link: http://ufdc.ufl.edu/UFE0024733/00001

Material Information

Title: Semi-Analytical Method for Analyzing Models and Model Selection Measures
Physical Description: 1 online resource (174 p.)
Language: english
Creator: Dhurandhar, Amit
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2009

Subjects

Subjects / Keywords: Computer and Information Science and Engineering -- Dissertations, Academic -- UF
Genre: Computer Engineering thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract: Semi-Analytical Method for Analyzing Models and Model Selection Measures Considering the large amounts of data that is collected everyday in various domains such as health care, financial services, astrophysics and many others, there is a pressing need to convert this information into knowledge. Machine learning and data mining are both concerned with achieving this goal in a scalable fashion. The main theme of my work has been to analyze and better understand prevalent classification techniques and paradigms which are an integral part of machine learning and data mining research, with an aim to reduce the hiatus between theory and practice. Machine learning and data mining researchers have developed a plethora of classification algorithms to tackle classification problems. Unfortunately, no one algorithm is superior to the others in all scenarios and neither is it totally clear as to which algorithm should be preferred over others under specific circumstances. Hence, an important question now is, what is the best choice of a classification algorithm for a particular application? This problem is termed as classification model selection and is a very important problem in machine learning and data mining. The primary focus of my research has been to propose a novel methodology to study these classification algorithms accurately and efficiently in the non-asymptotic regime. In particular, we propose a moment based method where by focusing on the probabilistic space of classifiers induced by the classification algorithm and datasets of size $N$ drawn independently and identically from a joint distribution (i.i.d.), we obtain efficient characterizations for computing the moments of the generalization error. Moreover, we can also study model selection techniques such as cross-validation, leave-one-out and hold out set in our proposed framework. This is possible since we have also established general relationships between the moments of the generalization error and moments of the hold-out-set error, cross-validation error and leave-one-out error. Deploying the methodology we were able to provide interesting explanations for the behavior of cross-validation. The methodology aims at covering the gap between results predicted by theory and the behavior observed in practice.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility: by Amit Dhurandhar.
Thesis: Thesis (Ph.D.)--University of Florida, 2009.
Local: Adviser: Dobra, Alin.

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2009
System ID: UFE0024733:00001


This item has the following downloads:


Full Text

PAGE 1

1

PAGE 2

2

PAGE 3

3

PAGE 4

Firstandforemost,IwouldliketothankthealmightyforgivingmethestrengthtoovercomebothacademicandemotionalchallengesthatIhavefacedinmypursuitofearningadoctoratedegree.WithouthisstrengthIwouldnothavebeeninthispositiontoday.Second,Iwouldliketothankmyfamilyfortheircontinuedsupportandforthefunwehavewhenweallgettogether.Averyspecialthankstomyadvisor,Dr.AlinDobra,fornotonlyhisguidancebutalsoforthegreatcommoradorythatweshare.Iamgreatfulforhavingmetsuchanintelligent,creative,full-of-lifeyetpatientandhelpfulindividual.Ihavethoroughlyenjoyedtheintensediscussions(whichothersmistookforghtsandactuallybetonwhowillwin)wehavehadinthistime.IwouldliketothankDr.PaulGaderandDr.ArunavaBanerjeefortheirinsightfulsuggestionsandencouragementduringdiculttimes.IwouldalsoliketothankmyothercommitteemembersDr.SanjayRankaandDr.RavindraAhujafortheirinvaluableinputs.IfeelfortunatetohavetakencourseswithDr.MeeraSitharamandDr.AnandRangarajanwhoaregreatteachersandtaughtmewhatitmeanstounderstandsomething.Lastbutdenitelynottheleast,Iwouldliketothankmyfriendsandroomatesforwithoutthemlifewouldhavebeendry.AspecialthankstoHale,Kartik(orKartiksshouldIsay),Bhuppi,Ajit,Gnana,Somnathandmanyothersfortheirsupportandencouragement.Thanksalotguys!Thiswouldnothavebeenpossiblewithoutyouall. 4

PAGE 5

page ACKNOWLEDGMENTS ................................. 4 LISTOFTABLES ..................................... 8 LISTOFFIGURES .................................... 9 ABSTRACT ........................................ 12 CHAPTER 1INTRODUCTION .................................. 14 1.1PracticalImpact ................................ 15 1.2RelatedWork .................................. 16 1.3Methodology .................................. 18 1.3.1WhatistheMethodology? ....................... 18 1.3.2WhyhavesuchaMethodology? .................... 18 1.3.3HowdoIImplementtheMethodology? ................ 19 1.4ApplyingtheMethodology ........................... 20 1.4.1AlgorithmicPerspective ......................... 20 1.4.2DatasetPerspective ........................... 21 1.5ResearchGoals ................................. 22 2GENERALFRAMEWORK ............................. 25 2.1GeneralizationError(GE) ........................... 26 2.2AlternativeMethodsforComputingtheMomentsofGE .......... 29 3ANALYSISOFMODELSELECTIONMEASURES ............... 32 3.1Hold-outSetError ............................... 32 3.2MultifoldCrossValidationError ........................ 34 4NAIVEBAYESCLASSIFIER,SCALABILITYandEXTENSIONS ....... 38 4.1Example:NaiveBayesClassier ........................ 38 4.1.1NaiveBayesClassierModel(NBC) .................. 38 4.1.2ComputationoftheMomentsofGE .................. 39 4.2Full-FledgedNBC ................................ 42 4.2.1CalculationofBasicProbabilities ................... 42 4.2.2DirectCalculation ............................ 43 4.2.3ApproximationTechniques ....................... 44 4.2.3.1Seriesapproximations(SA) ................. 46 4.2.3.2Optimization ......................... 47 4.2.3.3Randomsamplingusingformulations(RS) ......... 55 5

PAGE 6

................................. 56 4.3MonteCarlo(MC)vsRandomSamplingUsingFormulations ........ 57 4.4CalculationofCumulativeJointprobabilities ................. 59 4.5MomentComparisonofTestMetrics ..................... 62 4.5.1Hold-outSet ............................... 62 4.5.2CrossValidation ............................. 63 4.5.3ComparisonofGE,HE,andCE .................... 64 4.6Extension .................................... 65 5ANALYZINGDECISIONTREES ......................... 82 5.1ComputingMoments .............................. 82 5.1.1TechnicalFramework .......................... 83 5.1.2AllAttributeDecisionTrees(ATT) .................. 84 5.1.3DecisionTreeswithNon-trivialStoppingCriteria .......... 85 5.1.4CharacterizingpathexistsforThreeStoppingCriteria ........ 87 5.1.5SplitAttributeSelection ........................ 88 5.1.6RandomDecisionTrees ......................... 90 5.1.7Puttingthingstogether ......................... 92 5.1.7.1FixedHeight ......................... 92 5.1.7.2PurityandScarcity ...................... 94 5.2Experiments ................................... 96 5.3Discussion .................................... 99 5.3.1Extension ................................ 99 5.3.2Scalability ................................ 101 5.4Take-aways ................................... 101 6K-NEARESTNEIGHBORCLASSIFIER ...................... 108 6.1SpecicContributions ............................. 108 6.2TechnicalFramework .............................. 108 6.3K-NearestNeighborAlgorithm ......................... 109 6.4ComputationofMoments ........................... 110 6.4.1GeneralCharacterization ........................ 111 6.4.2EcientCharacterizationforSampleIndependentDistanceMetrics 113 6.5ScalabilityIssues ................................ 118 6.6Experiments ................................... 119 6.6.1GeneralSetup .............................. 120 6.6.2Study1:PerformanceoftheKNNAlgorithmforDierentValuesofk. ................................... 121 6.6.3Study2:ConvergenceoftheKNNAlgorithmwithIncreasingSampleSize. ................................... 122 6.6.4Study3:RelativePerformanceof10-foldCrossValidationonSyntheticData. ................................... 123 6

PAGE 7

................................. 124 6.7Discussion .................................... 125 6.8PossibleExtensions ............................... 127 6.9Take-aways ................................... 127 7INSIGHTSINTOCROSS-VALIDATION ...................... 132 7.1Preliminaries .................................. 134 7.2OverviewoftheCustomizedExpressions ................... 140 7.3RelatedWork .................................. 142 7.4Experiments ................................... 143 7.4.1Variance ................................. 145 7.4.2Expectedvalue ............................. 148 7.4.3Expectedvaluesquare+variance ................... 148 7.5Take-aways ................................... 148 8CONCLUSION .................................... 163 APPENDIX:PROOFS ................................... 165 REFERENCES ....................................... 170 BIOGRAPHICALSKETCH ................................ 174 7

PAGE 8

Table page 2-1Notationusedthroughoutthethesis. ........................ 31 4-1ContingencytableofinputX 67 4-2NaiveBayesNotation. ................................ 68 4-3EmpiricalComparisonofthecdfcomputingmethodsintermsofexecutiontime.RSndenotestheRandomSamplingprocedureusingnsamplestoestimatetheprobabilities. ..................................... 68 4-495%condenceboundsforRandomSampling. ................... 68 4-5Comparisonofmethodsforcomputingthecdf. ................... 68 6-1Contingencytablewithvclasses,MinputvectorsandtotalsamplesizeN=PM;vi=1;j=1Nij. ..................................... 128 8

PAGE 9

Figure page 4-1Ihavetwoattributeseachhavingtwovalueswith2classlables. ......... 69 4-2Thecurrentiterateykjustsatisestheconstraintclandeasilysatisestheotherconstraints. ...................................... 69 4-3EstimatesofexpectedvalueofGEbyMCandRSwithincreasingtrainingsetsizeN. ......................................... 70 4-4EstimatesofexpectedvalueofGEbyMCandRSwithincreasingtrainingsetsizeN. ......................................... 70 4-5EstimatesofexpectedvalueofGEbyMCandRSwithincreasingtrainingsetsizeN. ......................................... 71 4-6EstimatesofexpectedvalueofGEbyMCandRSwithincreasingtrainingsetsizeN. ......................................... 71 4-7EstimatesofexpectedvalueofGEbyMCandRSwithincreasingtrainingsetsizeN. ......................................... 72 4-8Theplotisofthepolynomial(x+10)4x2y+(y+10)4y2xz=0. ........ 72 4-9HEexpectationinsingledimension. ......................... 73 4-10HEvarianceinsingledimension. .......................... 73 4-11HEE[]+Std()insingledimension. ......................... 74 4-12HEexpectationinmultipledimensions. ....................... 74 4-13HEvarianceinmultipledimensions. ......................... 75 4-14HEE[]+Std()inmultipledimensions. ....................... 75 4-15ExpectationofCE. .................................. 76 4-16IndividualrunvarianceofCE. ............................ 76 4-17PairwisecovariancesofCV. ............................. 77 4-18Totalvarianceofcrossvalidation. .......................... 77 4-19E[]+p ................................ 78 4-20Convergencebehavior. ................................ 78 4-21CEexpectation. .................................... 79 4-22IndividualrunvarianceofCE. ............................ 79 9

PAGE 10

............................. 80 4-24Totalvarianceofcrossvalidation. .......................... 80 4-25E[]+p ................................ 81 4-26Convergencebehavior. ................................ 81 5-1Theallattributetreewith3attributesA1;A2;A3,eachhaving2values. .... 103 5-2Given3attributesA1;A2;A3,thepathm11m21m31isformedirrespectiveoftheorderingoftheattributes. ............................ 103 5-3FixedHeighttreeswithd=5,h=3andattributeswithbinarysplits. ..... 104 5-4FixedHeighttreeswithd=5,h=3andattributeswithternarysplits. ..... 104 5-5FixedHeighttreeswithd=8,h=3andattributeswithbinarysplits. ..... 104 5-6Puritybasedtreeswithd=5andattributeswithbinarysplits. ......... 105 5-7Puritybasedtreeswithd=5andattributeswithternarysplits. ......... 105 5-8Puritybasedtreeswithd=8andattributeswithbinarysplits. ......... 105 5-9Scarcitybasedtreeswithd=5,pb=N .... 106 5-10Scarcitybasedtreeswithd=5,pb=N ... 106 5-11Scarcitybasedtreeswithd=8,pb=N .... 106 5-12ComparisonbetweenAFandMConthreeUCIdatasetsfortreesprunnedbasedonxedheight(h=3),purityandscarcity(pb=N ............... 107 6-1b,canddarethe3nearestneighboursofa. .................... 128 6-2TheFigureshowstheextenttowhichapointxiisneartox1. .......... 129 6-3BehavioroftheGEfordierentvaluesofk. .................... 129 6-4ConvergenceoftheGEfordierentvaluesofk. .................. 130 6-5ComparisonbetweentheGEand10foldCrossvalidationerror(CE)estimatefordierentvaluesofkwhenthesamplesize(N)is1000. ............. 130 6-6ComparisonbetweentheGEand10foldCrossvalidationerror(CE)estimatefordierentvaluesofkwhenthesamplesize(N)is10000. ............ 131 6-7Comparisonbetweentrueerror(TE)andCEon2UCIdatasets. ......... 131 7-1Var(HE)forsmallsamplesizeandlowcorrelation. ................ 149 7-2Var(HE)forsmallsamplesizeandmediumcorrelation. .............. 149 10

PAGE 11

................ 150 7-4Var(HE)forlargersamplesizeandlowcorrelation. ................ 150 7-5Var(HE)forlargersamplesizeandmediumcorrelation. .............. 151 7-6Var(HE)forlargersamplesizeandhighcorrelation. ................ 151 7-7Cov(HEi;HEj)forsmallsamplesizeandlowcorrelation. ............ 152 7-8Cov(HEi;HEj)forsmallsamplesizeandmediumcorrelation. .......... 152 7-9Cov(HEi;HEj)forsmallsamplesizeandhighcorrelation. ............ 153 7-10Cov(HEi;HEj)forlargersamplesizeandlowcorrelation. ............ 153 7-11Cov(HEi;HEj)forlargersamplesizeandmediumcorrelation. .......... 154 7-12Cov(HEi;HEj)forlargersamplesizeandhighcorrelation. ............ 154 7-13Var(CE)forsmallsamplesizeandlowcorrelation. ................ 155 7-14Var(CE)forsmallsamplesizeandmediumcorrelation. .............. 155 7-15Var(CE)forsmallsamplesizeandhighcorrelation. ................ 156 7-16Var(CE)forlargersamplesizeandlowcorrelation. ................ 156 7-17Var(CE)forlargersamplesizeandmediumcorrelation. .............. 157 7-18Var(CE)forlargersamplesizeandhighcorrelation. ................ 157 7-19E[CE]forsmallsamplesizeandlowcorrelation. .................. 158 7-20E[CE]forlargersamplesizeandlowcorrelation. .................. 158 7-21E[CE]forsmallsamplesizeatmediumandhighcorrelation. ........... 159 7-22E2[CE]+Var(CE)forsmallsamplesizeandlowcorrelation. .......... 159 7-23E2[CE]+Var(CE)forsmallsamplesizeandmediumcorrelation. ........ 160 7-24E2[CE]+Var(CE)forsmallsamplesizeandhighcorrelation. .......... 160 7-25E2[CE]+Var(CE)forlargersamplesizeandlowcorrelation. .......... 161 7-26E2[CE]+Var(CE)forlargersamplesizeandmediumcorrelation. ....... 161 7-27E2[CE]+Var(CE)forlargersamplesizeandhighcorrelation. ......... 162 A-1Instancesofpossiblearrangements. ......................... 169 11

PAGE 12

12

PAGE 13

13

PAGE 14

Vapnik [ 1998 ].Shouldformulaebecomelargeandtedioustomanipulate,thetheoreticalresultsarehardtoobtainanduse/interpret.Theempiricalmethodiswellsuitedforvalidatingintuitionsbutissignicantlylessusefulforndingnovel,interestingthingssincelargenumberofexperimentshavetobeconductedinordertoreducetheerrortoareasonablelevel.Thisisparticularlydicultwhensmallprobabilitiesareinvolved,makingtheempiricalevaluationimpracticalinsuchacase.Anidealscenario,fromthepointofviewofproducinginterestingresults,wouldbetousetheorytomakeasmuchprogressaspossiblebutpotentiallyobtaininguninterpretableformulae,followedbyvisualizationtounderstandandndconsequencesofsuchformulae.Thiswouldavoidthelimitationoftheorytouseonlyniceformulaeandthelimitationofempiricalstudiestoperformlargeexperiments.Theroleofthetheorycouldbetosignicantlyreducetheamountofcomputationrequiredandtherole 14

PAGE 15

15

PAGE 16

Vapnik [ 1998 ].SLTcategorizesclassicationalgorithms(actuallythemoregenerallearningalgorithms)intodierentclassescalledConceptClasses.TheconceptclassofaclassicationalgorithmisdeterminedbyitsVapnik-Chervonenkis(VC)dimensionwhichisrelatedtotheshatteringcapabilityofthealgorithm.Givena2classproblem,theshatteringcapabilityofafunctionreferstothemaximumnumberofpointsthatthefunctioncanclassifywithoutmakinganyerrors,forallpossibleassignmentsoftheclasslabelstothepointsinsomechosenconguration.Theshatteringcapabilityofanalgorithmisthesupremumoftheshatteringcapabilitiesofallthefunctionsitcanrepresent.Distributionfreeboundsonthegeneralizationerror{expectederrorovertheentireinput,ofaclassierbuiltusingaparticularclassicationalgorithmbelongingtoaconceptclassarederivedinSLT.TheboundsarefunctionsoftheVCdimensionandthesamplesize.ThestrengthofthistechniqueisthatbyndingtheVCdimensionofanalgorithmIcanderiveerrorboundsfortheclassiersbuiltusingthisalgorithmwithouteverreferringtotheunderlyingdistribution.Afalloutofthisverygeneralcharacterizationisthattheboundsareusually 16

PAGE 17

Boucheronetal. [ 2005 ], Williamson [ 2001 ]whichinturnresultinmakingstatementsaboutanyparticularclassierweak.Thereisalargebodyofbothexperimentalandtheoreticalworkthataddressestheproblemofunderstandingvariousmodelselectionmeasures.Themodelselectionmeasuresthatrelevanttoourdiscussion,areHold-out-setvalidation,Cross-validation. Shao [ 1993 ]showedthatasymptoticallyLeave-one-out(LOO)choosesthebestbutnotthesimplestmodel. Devroyeetal. [ 1996 ]deriveddistributionfreeboundsforcrossvalidation.Theboundstheyfoundwereforthenearestneighbourmodel. Breiman [ 1996 ]showedthatcrossvalidationgivesanunbiasedestimateoftherstmomentoftheGeneralizationerror.Thoughcrossvalidationhasdesiredcharacteristicswithestimatingtherstmoment,Breimanstatedthatitsvariancecanbesignicant.TheoriticalboundsonLOOerrorundercertainalgorithmicstabilityassumptionsweregivenby KearnsandRon [ 1997 ].TheyshowedthattheworstcaseerroroftheLOOestimateisnotmuchworsethanthetrainingerrorestimate. ElisseeandPontil [ 2003 ]introducedthenotionoftrainingstability.Theyshowedthatevenwiththisweakernotionofstabilitygoodboundscouldbeobtainedonthegeneralizationerror. Blumetal. [ 1999 ]showedthatv-foldcrossvalidationisatleastasgoodasN vholdoutsetestimationonexpectation. Kohavi [ 1995 ]conductedexperimentsonNaiveBayesandC4.5usingcross-validation.Throughhisexperimentsheconcludedthat10foldstratiedcrossvalidationshouldbeusedformodelselection. MooreandLee [ 1994 ]proposedheuristicstospeedupcross-validation. Plutowski [ 1996 ]surveyincludedproposalswiththeoriticalresults,heuristicsandexperimentsoncross-validation.Hissurveywasespeciallygearedtowardsthebehaviorofcross-validationonneuralnetworks.Heinferredfromthepreviouslypublishedresultsthatcross-validationisrobust.Morerecently, BengioandGrandvalet [ 2003 ]provedthatthereisnouniversallyunbiasedestimatorofthevarianceofcross-validation. ZhuandRohwer [ 1996 ]proposedasimplesettinginwhichcross-validationperformspoorly. Goutte [ 1997 ]refutedthis 17

PAGE 18

Langford [ 2005 ].Preliminaryworkofthisnaturewasdonein Braga-NetoandDougherty [ 2005 ]wheretheauthorscharacterizedthediscretehistogramrule.However,theiranalysisdoesnotprovideanyindicationofhowothermorepopularalgorithmscanbecharacterizedinsimilarfashionkeepinginmindscalabilityandaccuracy.SpecicclassicationschemessuchastheW-statistic Anderson [ 2003 ]havebeencharacterizedinthepast,butsuchanalysisisverymuchlimitedtothatandothersimilarstatistics.ThemethodologyIpresentheremaypotentiallybeapplicabletolargevarietyoflearningalgorithms. 1.3.1WhatistheMethodology?ThemethodologyforstudyingclassicationmodelsconsistsinstudyingthebehaviorofthersttwocentralmomentsoftheGEoftheclassicationalgorithmstudied.Themomentsaretakenoverthespaceofallpossibleclassiersproducedbytheclassicationalgorithm,bytrainingitoverallpossibledatasetssampledindependentlyandidentically(i.i.d.)fromsomedistribution.Thersttwomomentsgiveenoughinformationaboutthestatisticalbehavioroftheclassicationalgorithmtoallowinterestingobservationsaboutitsbehavior/trends.Highermomentsmaybecomputedusingthesamestrategysuggestedbutmightprovetobeinecienttocompute. 18

PAGE 19

2. 3. 19

PAGE 20

20

PAGE 21

21

PAGE 22

Vapnik [ 1998 ].Whiletheresultsthusobtainedareverygeneral,noparticularityoftheclassicationalgorithmisexploited.Theclassofclassiersconsideredinthisthesisistheclassiersobtainedbyapplyingtheclassicationalgorithmtoadatasetofgivensizesampledi.i.d.fromtheunderlyingdistribution.Thisleadstoadierentwayofcharacterizingclassiersbasedonmomentanalysis.Idevelopaframeworktoanalyzeclassicationalgorithms. 22

PAGE 23

1. NaiveBayesianClassier(NBC)model:NBCisamodelwhichisextensivelyusedinindustry,duetoitsrobustnessoutperformingitsmoresophisticatedcounterpartsinmanyrealworldapplications(eg.spamlteringinMozillaThunderbirdandMicrosoftOutlook,bio-informaticsetc.).TherehasbeenworkontherobustnessofNBC DomingosandPazzani [ 1997 ], Rish [ 2001 ],buttheproposedframeworkandtheinter-relationshipsbetweenthemomentsofthevariouserrorshelpsustoextensivelystudynotjustthemodelbutalsothebehaviorofthevalidationmethodsinconjunctionwithit. 2. DecisionTrees(DT)model:Decisiontreesarealsoextensivelyusedindataminingandmachinelearningapplications.Besidesperformance,theyaresometimespreferredoverothermodels(eg.SupportVectorMachines,neuralnets)becausetheprocessbywhichtheeventualclassierisbuiltfromthesampleistransparent.Theprobabilisticformulationswillincorporatevariouspruningconditionssuchaspurity,scarcityandxedheight.Theformulationswillhelpbetterunderstandthebehaviorofthesetreesforclassication. 3. K-Nearest-Neighbor(KNN)Classiermodel:Thismodelisoneofthemoresimplermodelsbutyetitishighlyeective.Theoreticalresultsexist Stone [ 1977 ],regardingconvergenceoftheGeneralizationError(GE)ofthisalgorithmtoBayeserror(bestpossibleperformance).However,thisresultisasymptoticandfornitesamplesizesinrealscenariosndingtheoptimalvalueofKismoreofanartthanscience.ThemethodologyproposedbyuscanusedtostudythealgorithmfordierentvaluesofKandfordierentdistancemetricsaccuratelyincontrolledsettings. 23

PAGE 24

24

PAGE 25

2-1 unlessstatedotherwise. 25

PAGE 26

26

PAGE 27

2{1 andEquation 2{2 Ihavethefollowingtheorem. 27

PAGE 29

InbothseriesofequationsImadethetransitionfromasummationovertheclassofclassierstoasummationoverthepossibleoutputssincethefocuschangedfromtheclassiertothepredictionoftheclassierforaspecicinput(xisxedinsidetherstsummation).Whatthiseectivelydoesisitallowsthecomputationofmomentsusingonlylocalinformation(behavioronparticularinputs)notglobalinformation(behavioronallinputs).Thisresultsinspeedingtheprocessofcomputingthemoments. 29

PAGE 30

30

PAGE 31

Notationusedthroughoutthethesis. SymbolMeaning

PAGE 32

32

PAGE 33

2-1 andrealizingthatallthedatapointsarei.i.d.Iderivetheaboveresult.EDt(Nt)Ds(Ns)[HE]=EDt(Nt)"EDs(Ns)"P(x;y)2Ds([Dt](x);y) IobservefromtheaboveresultthattheexpectedvalueofHEisdependentonlyonthesizeofthetrainingsetDt.ThisresultisintuitivesinceonlyNtdata-pointsareusedforbuildingtheclassier. Proof. 3{1 Ihave:EDt(Nt)Ds(Ns)[HE2]=1 33

PAGE 34

Proof. Unliketherstmomentthevariancedependsonthesizesofboth,thetrainingsetaswellasthetestset. 34

PAGE 35

3{1 intotheaboveequation,adirectdenitionforCVisobtained,ifdesired.InthiscaseIhaveaclassierforeachchunknotasingleclassierfortheentiredata.ImodeltheselectionofNi.i.d.samplesthatconstitutethedatasetDandthepartitioningintovchunks.WiththisIhave: v)[CE]=EDt(v1 v)[CE]=1 v)[HEi]=1 Thisresultsfollowstheintuitionsinceisstatesthattheexpectederroristhegeneralizationerrorofaclassiertrainedonv1 35

PAGE 36

4.1 v.WiththisIhavethefollowinglemma, v)Djt(v1 v)[HEiHEj]=EDit(v1 Proof. v)Djt(v1 v)[HEiHEj]=EDit(v1 v)"HEiEDj(N v)"P(xj;yj)2Dj([Djt](xj);yj) v)[HEi]i=EDit(v1 36

PAGE 37

v)Djt(v1 v)CE=1 vNEDt(v1 3{2 IderivethevarianceofCE. Itisworthmentioningthatleave-one-out(LOO)isjustaspecialcaseofv-foldcrossvalidation(v=Nforleave-one-out).TheformulaeaboveapplytoLOOaswellthusnoseparateanalysisisnecessary.WiththisIhaverelatedthersttwomomentsofHEandCEtothatofGE.Hence,ifIcancomputethemomentsofGEIcanalsocomputethemomentsofHEandCE,allowingustostudythemodelaswellastheselectionmeasures.InthenextcoupleofchaptersIthusfocusourattentiononcomputingthemomentsofGEecientlyforthefollowingclassicationmodels{NBC,DTandKNN. 37

PAGE 38

38

PAGE 39

4-1 .UsingthefactthatP[Y=yk]P[X=xjY=yk]=P[X=x^Y=yk]andthefactthatP[X=xi^Y=yk]isNik 4-1 cantakeO(N)values.Theformulationoftherstmomentwouldbeasfollows. 4-1 allIcareaboutintheclassicationprocessistherelativecountsineachoftherows.Thus,ifIhadtoclassifyadatapointwithattributevaluexiIwouldclassifyitintoclassy1ifNi1>Ni2andvice-versa.WhatthismeansisthatirrespectiveoftheactualcountsofNi1andNi2aslongNi1>Ni2theclassicationalgorithmwouldmakethesamepredictioni.e.Iwouldhavethesameclassier.Icanhenceswitchfromgoingoverthespaceofallpossibledatasetstogoing 39

PAGE 40

40

PAGE 41

41

PAGE 42

4.4 .LetusnowbrieypreviewthekindofprobabilitiesIneedtodecipher. 42

PAGE 43

4-1 ,consideringthecellx1y1withoutlossofgenerality(w.l.o.g.)andbytheNaiveBayesclassierindependenceassumption,Ineedtondtheprobabilityofthefollowingconditionbeingtrueforthe2-dimensionalcase,pc1px11 4-1 Ihave,P[N2Nx11Ny11>N1Nx12Ny12]=PN111PN121PN211PN112PN122PN212PN222P[N111;N121;N211;N112;N122;N212;N221;N222]I[N2Nx11Ny11>N1Nx12Ny12]whereN2=N112+N122+N212+N222,Nx11=N111+N121,Ny11=N111+N211,N1=NN2,Nx12=N112+N122,andI[condition]=1ifconditionistrueelseI[condition]=0.EachofthesummationstakesO(N)valuesandsotheworstcasetimecomplexityisO(N7).Ithusobservethatforthesimplescenariodepicted,thetimetocomputetheprobabilitiesisunreasonableevenforsmallsizedatasets(N=100say).Thenumberofsummationsincreaseslinearlywiththedimensionalityofthespace.Hence,thetimecomplexityisexponentialinthedimensionality.Ithusneedtoresorttoapproximationstospeeduptheprocess. 43

PAGE 44

44

PAGE 45

Wolfram-Research .ButhowdoesallofwhatIhavejustdiscussedrelatetoourproblem?Considerthe2dimensionalcasegiveninFigure 4-1 .IneedtondtheprobabilityP[Z>0]whereZ=N2Nx11Ny11N1Nx12Ny12.Theindividualtermsintheproductcanbeexpressedasasumofcertainrandomvariablesinthemultinomial.ThusZcanbewrittenasthesumoftheproductofsomeofthemultinomialrandomvariables.ConsidertherstterminZ,N2Nx11Ny11=(N112+N122+N212+N222)(N111+N121)(N111+N211)=N112N2111+:::+N222N121N211thesecondtermalsocanbeexpressedinthisform.ThusZcanbewrittenasthesumoftheproductsofthemultinomialrandomvariables.E[Z]=E[N2Nx11Ny11N1Nx12Ny12]=E[N2Nx11Ny11]E[N1Nx12Ny12]=E[N112N2111+:::+N222N121N211]E[N111N2112+:::+N221N122N212]=E[N112N2111]+:::+E[N222N121N211]E[N111N2112]:::E[N221N122N212]InthegeneralcaseZ=Nd12Nx111:::1:::Nxd11:::1Nd11Nx111:::12:::Nxd11:::12wherethesubscriptofNwithdotshasd+1numbers.TheexpectedvalueofZisthengivenby,E[Z]=E[N11:::12Nd11:::1]+:::+E[Nm1m2:::md2N11:::md1N11:::md111:::Nm11:::1]E[N11:::11Nd11:::2]:::E[Nm1m2:::md1N11:::md2N11:::md112:::Nm11:::12]wheremidenotesthenumberofattributevaluesofxi.Theseexpectationscanbecomputedusingthetechniqueinthediscussionbefore.Highermomentscanalsobefound 45

PAGE 46

Hall [ 1992 ]areusedtoapproximatedistributionsofrandomvariableswhosemomentsormorespecicallycumulantsareknown.Theseexpansionsconsistinwritingthecharacteristicfunctionoftheunknowndistributionwhoseprobabilitydensityistobeapproximatedintermsofthecharacteristicfunctionofanotherknowndistribution(usuallynormal).ThedensitytobefoundisthenrecoveredbytakingtheinverseFouriertransform.Letpuc(t),pud(x)andibethecharacteristicfunction,probabilitydensityfunctionandtheithcumulantoftheunknowndistributionrespectively.Andletpkc(t),pkd(x)andibethecharacteristicfunction,probabilitydensityfunctionandtheithcumulantoftheknowndistributionrespectively.Hence,puc(t)=e[P1a=1(aa)(it)a r!]pkc(t)pud(x)=e[P1a=1(aa)(D)a r!]pkd(x)whereDisthedierentialoperator.Ifpkd(x)isanormaldensitythenIarriveatthefollowingexpansion, 22][1+3 22H3(x1 Levin [ 1981 ]and ButlerandSutton [ 1998 ].Themajorchallengethough,liesinchoosingadistributionthatwillapproximatetheunknowndistribution"well",astheaccuracyofthecdfestimatedependsonthis.Theperformanceofthemethodmayvarysignicantlyonthechoiceofthisdistribution,sincechoosingthe 46

PAGE 47

Isii [ 1960 1963 ], KarlinandShapely [ 1953 ].Infactupto3momentsknown,thereareclosedformsolutionsforthebounds Prekopa [ 1989 ].Inthematerialthatfollows,Ipresenttheoptimizationprobleminitsprimalanddualform.Ithenexplorestrategiesforsolvingit,giventhefactthatthemostobviousonescanprovetobecomputationallyexpensive.AssumethatIknowmmomentsofthediscreterandomvariableXdenotedby1;:::;mwherejisthejthmoment.ThedomainofXisgivenbyU=x0;x1;:::;xn.P[X=xr]=prwherer20;1;:::;nandPrpr=1.Ionlydiscussthemaximizationversionoftheproblem(i.e.ndingtheupperbound)sincetheminimizationversion(i.e.ndingthelowerbound)hasananalogousdescription.Thus,intheprimalspaceIhavethefollowingformulation,maxP[X<=xr]=Pri=0pi,rnsubjectto:Pni=0pi=1Pni=0xipi=1Pni=0xmipi=mpi0;8in

PAGE 49

49

PAGE 50

1. IftheydonotthenIchecktoseeifthevalueofpolynomialatanypointwithinthisrangesatisestheinequalities.OntheinequalitiesbeingsatisedIjumptothenextyusinggradientdescentstoringthecurrentvalueofyinplaceofthepreviouslystored(ifitexists)one.Iftheinequalitiesaredis-satisedIrejectthevalueofyandperformabinarysearchbetweenthisvalueandthepreviouslegalvalueofyalongthegradientuntilIreachthevaluethatminimizestheobjectivesatisfyingtheconstraints. 2. IfIdo,thenIcheckthevalueoftheconstraintsatthetwoextremeties.IfsatisedandifthereexistsonlyonerootintherangeIstorethisvalueofyandgoontothenext.IftherearemultiplerootsthenIchecktoseeifconsecutiverootshaveanyintegralvaluesbetweenthem.IfnotIagainstorethisvalueofyandmovetothenext.ElseIverifyforanypointbetweentherootsthattheconstraintsaresatisedbasedonwhichIeitherstoreorrejectthevalue.OnrejectingIperformthesamebinarysearchtypeprocedurementionedabove.CheckingifconsecutiverootsofthepolynomialhavevaluesinthedomainofX,iswheretheextensionindomaintoincludeallintegersbetweentheextremetieshelpsinenhancingperfomance.IntheabsenceofthisextensionIwouldneedtondifaparticularsetofintegerslieinthedomainofX.Thisoperationisexpensiveforlargedomains.Butwiththeextensionalltheaboveoperationscanbeperformedeciently.Findingrootsofpolynomialscanbedoneextremelyecientlyevenforhighdegreepolynomialsbyvariousmethodssuchascomputingeigenvaluesofthecompanionmatrix EdelmanandMurakami [ 1995 ]asisimplementedinMatlab.Sincethenumberofrootsisjustthedegreeofthe 50

PAGE 51

[ 1989 ]gaveanalgorithmforthediscretemomentproblem.InhisalgorithmImaintainanm+1m+1matrixcalledthebasismatrixBwhichneedstohaveaparticularstructuretobedualfeasible.Iiterativelyupdatethecolumnsofthismatrixuntilitbecomesprimalfeasible,resultingintheoptimalsolutiontotheoptimizationproblem1.Theissuewiththisalgorithmisthatthereisnoguaranteew.r.t.thetimerequiredforthealgorithmtondthisprimalfeasiblebasisstructure.IntheremainingapproachesIfurtherextendthedomainoftherandomvariableXtobecontinuouswithinthegivenrange.Againforthesamereasondescribedbeforetheboundisunintruisive.Itisalsoworthwhilenotingthatthefeasibilityregionoftheoptimizationproblemisconvexsincetheobjectiveandtheconstraintsareconvex(actuallyane).Standardconvexoptimizationstrategiescannotbeusedsincetheequationoftheboundaryisunknownandthelengthofthedescriptionoftheproblemislarge. Prekopa [ 1989 ] 51

PAGE 52

2dTkr2L(yk;k)dksubjectto:ceqi(yk)+rceqi(yk)Tdk=0;i2Ecieqi(yk)+rcieqi(yk)Tdk0;i2IwhereOk(dk)isthequadraticapproximationoftheobjectivefunctionaroundyk.Thetermf(yk)isgenerallydroppedfromtheaboveobjectivesinceitisaconstantatanyparticulariterationandhasnobearingonthesolution.r2L()istheHessianoftheLagrangianw.r.t.y,EandIarethesetofindicesfortheequalityandinequalityconstraintsrespectivelyanddkisthedirectionvectorwhichisthesolutionoftheaboveoptimizationproblem.Thenextiterateyk+1isgivenbyyk+1=yk+kdkwherekisthesteplength.Forourspecicproblemtheobjectivefunctionisane,thusaquadraticapproximationofityieldstheoriginalobjectivefunction.Ihavenoequalityconstraints.Forthe 52

PAGE 53

4-2 .Theconstraintcl=Pmj=0yjxjiisjustsatised.WiththisinviewIarriveatthefollowingformulationofouroptimizationproblematthekthiteration,minTdksubjectto:Pmj=0y(k)jxji+Pmj=0xjidk0;yk=[y(k)0;:::;y(k)m]2Thistechniquegivesasenseofthenon-linearboundarytracedoutbytheconstraints.Theabovementionedvaluescanbededucedbyndingrootsofthederivativeofthe2polynomialsw.r.t.xandthenndingtheminimumofthesevaluesevaluatedattherealrootsofitsderivative.Thenumberofrootsisboundedbythenumberofmoments,infactitisequaltom1.Sincethisapproachdoesnotrequiretheenumerationofeachofthelinearconstraintsandoperationsdescribedarefastwithresultsbeingaccurate,thisturnsouttobeagoodoptionforsolvingthisoptimizationproblem.IcarriedouttheoptimizationusingtheMatlabfunctionfminconandtheprocedurejustillustrated.Semi-deniteProgramming(SDP):Asemi-deniteprogrammingproblemhasalinearobjective,linearequalityconstraintsandlinearmatrixinequality(LMI)constraints.Hereisanexampleformulation,mincTqsubjectto:q1F1+:::+qnFn+H0Aq=b 53

PAGE 54

BertsimasandPopescu [ 1998 ].Iderivetheequivalentsemideniteformulationforthesecondconstraintc2(x)=Pmi=0yixi1tobegreaterthanorequaltozero.Toaccomplishthis,Ireplacey0byy01intheabovesetofequalitiessincec2(x)=c1(x)1.Thus8x2[a;b]Ihavethefollowingsemideniteformulationforthesecondconstraint,Pi+j=2l1S(i;j)=0;l=1;:::;mPlk=1Pk+mlr=kyrrCk(mr)Clkarkbk+Pmlr=1yr(mr)Clar+y01=Pi+j=2lS(i;j);l=1;:::;mPmr=1yrar+y01=S(0;0)S0Combiningtheabove2resultsIhavethefollowingsemideniteprogramwithO(m2)constraints,minPmk=0ykksubjectto:Pi+j=2l1G(i;j)=0;l=1;:::;mPlk=1Pk+mlr=kyrrCk(mr)Clkarkbk+Pmlr=1yr(mr)Clar+y01=Pi+j=2lG(i;j);l=1;:::;mPmr=1yrar+y01=G(0;0)Pi+j=2l1Z(i;j)=0;l=1;:::;mPlk=0Pk+mlr=kyrrCk(mr)Clkbrkck=Pi+j=2lZ(i;j);l=0;:::;m

PAGE 55

WuandBoyd [ 1996 ]tosolvetheabovesemideniteprogram.ThroughempiricalstudiesthatfollowIfoundthisapproachtobethebestinsolvingtheoptimizationproblemintermsofabalancebetweenspeed,reliabilityandaccuracy. Hall [ 1992 ], Bartlettetal. [ 2001 ], ChambersandSkinner [ 1977 ]havebeenconductedtoanalyzedierentkindsofsamplingprocedures.ThesamplingprocedurethatisrelevanttoourproblemisRandomSamplingandhenceIrestrictourdiscussiononlytoit.RandomsamplingisasamplingtechniqueinwhichIselectasamplefromalargerpopulationwhereineachindividualischosenentirelybychanceandeachmemberofthepopulationhaspossiblyanunequalchanceofbeingincludedinthesample.Randomsamplingreducesthelikelihoodofbias.Itisknownthatasymptoticallytheestimatesfoundusingrandomsamplingconvergetotheirtruevalues.Forourproblemthecdfcanbecomputedusingthissamplingprocedure.Isampledatafromthemultinomialdistribution(ourdatagenerativemodel)andaddthenumberoftimestheconditionwhosecdfistobecomputedistrue.Thisnumberwhendividedbythetotalnumberofsamplesgivesanestimateofthecdf.ByndingthemeanandstandarddeviationoftheseestimatesIcanderivecondenceboundsonthecdfusingChebyshevinequality.Thewidthofthesecondenceboundsdependsonthestandarddeviationofthe 55

PAGE 56

4-1 .Iinstantiatedallthecellprobabilitiestobeequal.IfoundtheprobabilityP[N2Nx11Ny11>N1Nx12Ny12]bythemethodssuggested,varyingthedatasetsizefrom10to1000inmultiplesof10andhavingknowledgeoftherstsixmomentsoftherandomvariableX=N2Nx11Ny11N1Nx12Ny12.Theactualprobabilityinallthethreecasesisaround0.5(actuallyjustlesserthan0.5).TheexecutionspeedsforthevariousmethodsaregiveninTable 4-3 56

PAGE 57

4-3 andTable 4-4 thatthemethoddoesnotscalemuchintimewiththesizeofthedatasetbutproducesextremelygoodcondenceboundsasthenumberofsamplesincreases.With1000samplesIalreadyhaveprettytightboundswithtimerequiredbeingjustoverhalfasecond.Alsoaspreviouslystatedthecdfcanbecalculatedtogetherratherthanindependently.Recommendation:TheSDPmethodisthebestbutRScanprovetobemorethanacceptable. 57

PAGE 58

4.6 4-3 4.6 4-5 and 4.6 depicttheestimatesofMCandRSfordierentamountsofcorrelation(measuredusingChi-Square Connor-Linton [ 2003 ])betweentheattributesandtheclasslabels,withincreasingtrainingsetsize.Observations:FromtheFigure 4.6 Iobservethatwhentheattributesandclasslabelsareuncorrelated,withincreasingtrainingsetsizetheestimatesofbothMCandRSareaccurate.SimilarqualitativeresultsareseeninFigure 4.6 whentheattributesandclasslabelsaretotallycorrelated.Hence,forextremelylowandhighcorrelationsbothmethodsproduceequallygoodestimates.TheproblemarisesfortheMCmethodwhenImoveawayfromtheseextremecorrelations.ThisisseeninFigures 4-3 4.6 and 4-5 .BoththeMCandRSmethodsperformwellinitially,butathighertrainingsetsizes(around10000andgreater)theestimatesoftheMCmethodbecomegrosslyincorrect,whiletheRSmethodstillperformsexceptionallywell.Infact,theestimatesofRSbecomeincreasinglyaccuratewithincreasingtrainingsetsize.ReasonsandImplications:Anexplanationoftheabovephenomenaisasfollows:ThetermED(N)[GE()]denotestheexpectedGEofallclassiersthatareinducedbyallpossibletrainingsetsdrawnfromsomedistribution.InthecontinuouscasethenumberofpossibletrainingsetsofsizeNisinnite,whileinthediscretecaseitisO(Nm1),wheremisthetotalnumberofcellsinthecontigencytable.AsNincreasesthenumber 58

PAGE 59

59

PAGE 60

2-1 ).Sincetheprobabilityisofaneventovertwodistinctrandomvariablesthepreviousmethodofcomputingmomentscannotbedirectlyapplied.AnimportantquestioniscanIsomehowthroughcertaintransformationsreusethepreviousmethod?Fortunatelytheanswerisarmative.TheintuitionbehindthetechniqueIproposeisasfollows.IndanotherrandomvariableZ=f(X;Y)(polynomialinXandY)suchthatZ>0iX>0andY>0.Sincethetwoeventsareequivalenttheirprobabilitiesarealsoequal.BytakingderivativesoftheMGFofthemultinomialIgetexpressionsforthemomentsofpolynomialsofthemultinomialrandomvariables.Thus,f(X;Y)isrequiredtobeapolynomialinXandY.Inowdiscussthechallengesinndingsuchafunctionandeventuallysuggestasolution.Geometrically,IcanconsidertherandomvariablesX;YandZtodenotethethreeco-ordinateaxes.Thenthefunctionf(X;Y)shouldhaveapositivevalueintherstquadrantandnegativeintheremainingthree.IfthedomainsofXandYwereinniteandcontinuousthenthisproblemispotentiallyintractablesincethepolynomialneedstohaveadiscretejumpalongtheXandYaxis.Suchbehaviorcanbeemulatedatbestapproximatelybypolynomials.Inourcasethough,thedomainsoftherandomvariablesarenite,discreteandsymmetricabouttheorigin.Therefore,whatIcareaboutisthatthefunctionbehavesasdesiredonlyatthesenitenumberofdiscretepoints.Onesimplesolutionistohaveacirclecoveringtherelevantpointsintherstquadrantandwithappropriatesignthefunctionwouldbepositiveforallthepointsencompassedbyit.ThisworksforsmalldomainsofXandY.Asthedomainsizeincreasesthecircleintrudesintotheotherquadrantsandnolongersatisfyingtheconditions.OthersimplefunctionssuchasXYorX+Yoraproductofthetwoalsodonotwork.Inowgiveafunctionthatdoes 60

PAGE 61

4-7 depictsthepolynomialfora=10wherer=4.Thepolynomialresemblesabirdwithitsneckintherstquadrant,wingsinthe2ndand4thquadrantsanditsposteriorinthethird.Thegeneralshaperemainssameforhighervaluesofa.Therstrequirementforthepolynomialwasthatitmustbesymmetric.Secondly,IwantedtopenalisenegativetermsandsoIhaveX+a(andY+a)raisedtosomepowerwhichwillalwaysbepositivebutwillhavelowervaluesforlesserX(andY).TheX2Y(andY2X)makestherst(second)termzeroifanyofXandYarezero.Moreover,itimpartssigntothecorrespondingterm.Ifabsolutevaluefunction(jj)couldbeusedIwouldreplacetheX2(Y2)byjXj(jYj)andsetr=1.ButsinceIcannot;intheresultantfunctionrisareciprocalofalogarithmicfunctionofa.Foraxedrwithincreasingvalueofathepolynomialstartsviolatingthebiconditionalbybecomingpositiveinthe2ndand4thquadrants(i.e.thewingsrise).Thepolynomialisalwaysvalidinthe1stand3rdquadrants.Withincreaseindegree(r)ofthepolynomialitswingsbeginatteningout,thussatisfyingthebiconditionalforacertaina.Byrecursively,applyingtheaboveformulaIcanapproximatecdfofprobabilitieswithmultipleconditions. 61

PAGE 62

4.6 4-11 ,thevarianceinFigures 4-9 4.6 andthesumoftheexpectationandstandarddeviationinFigures 4.6 4-13 forsingleandmultipledimensionsrespectively.Asexpected,theexpectationofHEgrowsasthesize 62

PAGE 63

4.6 4.6 )sincethesizeoftrainingdataincreases,(b)thevarianceoftheclassierforeachofthefoldsincreases(Figures 4-15 4-21 )sincethesizeofthetestdatadecreases,(c)thecovariancebetweentheestimatesofdierentfoldsdecreasesrstthenincreasesagain(Figures 4.6 4.6 ){IexplainthisbehaviorbelowandthesametrendisobservedforthetotalvarianceofCE(Figures 4-17 4-23 )andthesumoftheexpectationandthestandarddeviationofCE(Figures 4.6 4.6 ).Observethattheminimumofthesumoftheexpectationandthestandarddeviation(whichindicatesthepessimistic 63

PAGE 64

4-19 4-25 IplottedthemomentsofGE,HEandCE,thesizeofthehold-out-setforHEwassetto40%and20foldsforCE.Asitcanbeobservedfromthegure,theerrorofhold-out-setissignicantlylargerforsmalldatasets.Theerrorofcrossvalidationisalmostonparwiththethegeneralizationerror.Thispropertyofcrossvalidationtoreliablyestimatethegeneralizationerroris 64

PAGE 65

65

PAGE 66

66

PAGE 67

Table4-1. ContingencytableofinputX Xy1y2

PAGE 68

NaiveBayesNotation. SymbolSymantics Table4-3. EmpiricalComparisonofthecdfcomputingmethodsintermsofexecutiontime.RSndenotestheRandomSamplingprocedureusingnsamplestoestimatetheprobabilities. MethodDatasetSize10DatasetSize100DatasetSize1000 Direct25hrsarnd200centuriesarnd200billionyrsSAInstantaneousInstantaneousInstantaneousLParnd3.5secarnd2minarnd2:30hrsGDarnd0.13sarnd0.13secarnd0.13secPAarnd1sarnd25secarnd5minGDTSarnd3.5secarnd3.5secarnd3.5secSQParnd3.5secarnd3.5secarnd3.5secSDParnd0.1secarnd0.1secarnd0.1secRS100arnd0.08secarnd0.08secarnd0.1secRS1000arnd0.65secarnd0.66secarnd0.98secRS10000arnd6.3secarnd6.5secarnd9.6sec Table4-4. 95%condenceboundsforRandomSampling. SamplesDatasetSize10DatasetSize100DatasetSize1000 1000.7-0.230.72-0.260.69-0.3110000.54-0.40.56-0.420.57-0.42100000.5-0.440.51-0.470.52-0.48 Table4-5. Comparisonofmethodsforcomputingthecdf. MethodAccuracySpeed DirectExactsolutionLowSeriesApproximationVariableHighStandardLPsolversHighLowGradientdescentLowHighPrekopaAlgorithmHighModerateGradientdescent(topologysearch)ModerateModerateSequentialQuadraticProgrammingHighModerateSemi-deniteProgrammingHighHighRandomSamplingHighModerate 68

PAGE 69

Ihavetwoattributeseachhavingtwovalueswith2classlables. Thecurrentiterateykjustsatisestheconstraintclandeasilysatisestheotherconstraints.Suppose,clisPmj=0yjxjiwherexiisavalueofX,theninthediagramontheleftIobservethatforthekthiterationy=ykthepolynomialPmj=0yjxj=0hasaminimumatX=xiwiththevalueofthepolynomialbeinga.Thisisalsothevalueofclevaluatedaty=yk. 69

PAGE 70

EstimatesofED(N)[GE()]byMCandRSwithincreasingtrainingsetsizeN.Theattributesareuncorrelatedwiththeclasslabels.ED(N)[GE()]is0.5. Figure4-4. EstimatesofED(N)[GE()]byMCandRSwithincreasingtrainingsetsizeN.Thecorrelationbetweentheattributesandtheclasslabelsis0.25.ED(N)[GE()]is0.24. 70

PAGE 71

EstimatesofED(N)[GE()]byMCandRSwithincreasingtrainingsetsizeN.Thecorrelationbetweentheattributesandtheclasslabelsis0.5.ED(N)[GE()]is0.14. Figure4-6. EstimatesofED(N)[GE()]byMCandRSwithincreasingtrainingsetsizeN.Thecorrelationbetweentheattributesandtheclasslabelsis0.75.ED(N)[GE()]is0.068. 71

PAGE 72

EstimatesofED(N)[GE()]byMCandRSwithincreasingtrainingsetsizeN.Theattributesaretotallycorrelatedtotheclasslabels.ED(N)[GE()]is0. Figure4-8. Theplotisofthepolynomial(x+10)4x2y+(y+10)4y2xz=0.Iseethatitispositiveintherstquadrantandnon-positiveintheremainingthree. 72

PAGE 73

HEexpectationinsingledimension. Figure4-10. HEvarianceinsingledimension. 73

PAGE 74

HEE[]+Std()insingledimension. Figure4-12. HEexpectationinmultipledimensions. 74

PAGE 75

HEvarianceinmultipledimensions. Figure4-14. HEE[]+Std()inmultipledimensions. 75

PAGE 76

ExpectationofCE. Figure4-16. IndividualrunvarianceofCE. 76

PAGE 77

PairwisecovariancesofCV. Figure4-18. Totalvarianceofcrossvalidation. 77

PAGE 78

Figure4-20. Convergencebehavior. 78

PAGE 79

CEexpectation. Figure4-22. IndividualrunvarianceofCE. 79

PAGE 80

PairwisecovariancesofCV. Figure4-24. Totalvarianceofcrossvalidation. 80

PAGE 81

Figure4-26. Convergencebehavior. 81

PAGE 82

DhurandharandDobra [ 2009 ]torandomizedclassicationalgorithms.AnextensiveempiricalcomparisonbetweentheproposedmethodandMonteCarlo,depictstheadvantagesofthemethodintermsofrunningtimeandaccuracy.Italsoshowcasestheuseofthemethodasanexploratorytooltostudylearningalgorithms. 82

PAGE 83

DhurandharandDobra [ 2009 ],istodeneaclassofclassiersinducedbyaclassicationalgorithmandani.i.d.sampleofaparticularsizefromanunderlyingdistribution.EachclassierinthisclassanditsGEactasrandomvariables,sincetheprocessofobtainingthesampleisrandomized.SinceGE()isarandomvariable,ithasadistribution.Quiteoftenthough,characterizinganitesubsetofmomentsturnsouttobeamoreviableoptionthancharacterizingtheentiredistribution.Basedonthesefacts,IrevisittheexpressionsforthersttwomomentsaroundzerooftheGEofaclassier, 83

PAGE 84

5-1 .Itcanbeseenthatirrespectiveofthesplitattributeselectionmethod(e.g.informationgain,ginigain,randomisedselection,etc.)theabovestoppingcriteriayieldstreeswiththesameleafnodes.Thusalthoughaparticularpathinonetreehasanorderingofattributesthatmightbedierentfromacorrespondingpathinothertrees,theleafnodeswillrepresentthesameregioninspaceorthesamesetofdatapoints.ThisisseeninFigure 5-2 .Moreover,sincepredictionsaremadeusingdataintheleafnodes,anydeterministicwayofpredictionwouldleadtothesetreesresultinginthesameclassierforagivensampleandthushavingthesameGE.Usually,predictionintheleavesisperformedbychoosingthemostnumerousclassastheclasslabelforthecorrespondingdatapoint.WiththisIarriveattheexpressionsforcomputingtheaforementioned 84

PAGE 85

4-1 DhurandharandDobra [ 2009 ]. 85

PAGE 86

86

PAGE 87

1. 2. 87

PAGE 88

3. HallandHolmes [ 2003 ].Someofthemostpopularonesaimtoincreasethepurityofasetofdatapointsthatlieintheregionformedbythatsplit.Thepurerthe 88

PAGE 89

Quinlan [ 1986 ],ii)GiniGain(GG) Breimanetal. [ 1984 ],iii)GainRatio(GR) Quinlan [ 1986 ],iv)Chi-squaretest(CS) Shao [ 2003 ]etc.aimatrealisingthisintuition.OthermeasuresusingPrincipalComponentAnalysis Smith [ 2002 ],Correlation-basedmeasures Hall [ 1998 ]havealsobeendeveloped.Anotherinterestingyetnon-intuitivemeasureintermsofitsutilityistheRandomattributeselectionmeasure.AccordingtothismeasureIrandomlychoosethesplitattributefromavailableset.ThedecisiontreethatthisalgorithmproducesiscalledaRandomdecisiontree(RDT).Surprisinglyenough,acollectionofRDTsquiteoftenoutperformtheirseeminglymorepowerfulcounterparts Liuetal. [ 2005 ].InthisthesisIstudythisinterestingvariant.Idothisbyrstpresentingaprobabilisticcharacterizationinselectingaparticularattribute/setofattributes,followedbysimulationstudies.Characterizationsfortheothermeasurescanbedevelopedinsimilarveinbyfocusingontheworkingofeachmeasure.Asanexample,forthedeterministicpuritybasedmeasuresmentionedabovethesplitattributeselectionisjustafunctionofthesampleandthusbyappropriatelyconditioningonthesampleIcanndtherelevantprobabilitiesandhencethemoments.Beforepresentingtheexpressionfortheprobabilityofselectingasplitattribute/attributesinconstructingaRDTIextendtheresultsin DhurandharandDobra [ 2009 ]whererelationshipsweredrawnbetweenthemomentsofHE,CE,LE(justaspecialcaseofcross-validation)andGE,tobeapplicabletorandomizedclassicationalgorithms.Therandomprocessisassumedtobeindependentofthesamplingprocess.Thisresultisrequiredsincetheresultsin DhurandharandDobra [ 2009 ]areapplicabletodeterministicclassicationalgorithmsandIwouldbeanalyzingRDT.WiththisIhavethefollowinglemma.

PAGE 90

TheresultisvalidevenwhenDandTarecontinous,butconsideringthescopeofthisthesisIaremainlyinterestedinthediscretecase.Thisresultimpliesthatalltherelationshipsandexpressionsin DhurandharandDobra [ 2009 ]holdwithanextraexpectationoverthet,forrandomizedclassicationalgorithmswheretherandomprocessisindependentofthesamplingprocess.Inequations 6{9 and 6{2 theexpectationsw.r.t.Z(N)becomeexpectationsw.r.t.Z(N;t). 90

PAGE 91

91

PAGE 92

(vi)!denotespermutationandprobi=1 5.1.4 .Hencetheprobabilityusedinndingtherstmomentisgivenby, 92

PAGE 93

93

PAGE 94

5.1.4 .Theprobabilityusedinndingtherstmomentisgivenby,PZ(N)[(x)=Ci]=XpPZ(N)[ct(pathpCi)>ct(pathpCj);pathpexists;8j6=i;i;j2[1;:::;k]]=XpPZ(N)[ct(pathpCi)>ct(pathpCj);s:c:c:i;s:c:c:s:;8j6=i;i;j2[1;:::;k]]=XpPZ(N)[ct(pathpCi)>ct(pathpCj);s:c:c:s:;8j6=i;i;j2[1;:::;k]]PZ(N)[s:c:c:i:] 94

PAGE 95

(5{8) wherehpisthelengthofthepathindexedbyp.Thejointprobabilityofcomparingcountsands:c:c:s:canbecomputedfromtheunderlyingjointdistribution.Theprobabilityforthesecondmomentwhenthetreesaredierentisgivenby,

PAGE 96

(5{10) whereristhenumberofattributesthatarecommoninthe2pathssparingtheattributeschosenasleaves,bisthenumberofattributesthathavethesamevalue,hpandhqarethelengthsofthe2pathsandwithoutlossofgeneralityassuminghphqprobt=1 96

PAGE 97

Connor-Linton [ 2003 ].Moreprecisely,Isumoverallithesquaresofthedierenceofeachpiwiththeproductofitscorrespondingmarginals,witheachsquareddierencebeingdividedbythisproduct.i.e.correlation=Pi(pipim)2 97

PAGE 98

5-3 5-4 and 5-5 depicttheerrorofxedheighttreeswiththenumberofattributesbeing5forthersttwoguresand8forthethirdgure.Thenumberofattributevaluesincreasesfrom2to3ingures 5-3 and 5-4 respectively.IobserveintheseguresthatAFissignicantlymoreaccuratethanbothMC-1andMC-10.Infacttheperformanceofthe3estimatorsnamely,AF,MC-1andMC-10remainsmoreorlessunalteredevenwithchangesinthenumberofattributesandinthenumberofsplitsperattribute.Asimilartrendisseenforbothpuritybasedtreesi.e.gures 5-6 5-7 and 5-8 aswellasscarcitybasedtrees 5-9 5-10 and 5-11 .ThoughinthecaseofpuritybasedtreestheperformanceofbothMC-1andMC-10ismuchsuperiorascomparedwiththeirperformanceontheothertwokindsoftrees,especiallyatlowcorrelations.Thereasonforthisbeingthat,atlowcorrelationstheprobabilityineachcellofthemultinomialisnon-negligibleandwithN=10000theeventthateverycellcontainsatleastasingledatapointishighlylikely.Hence,thetreesIobtainwithhighprobabilityusingthepuritybasedstoppingcriteriaareallATT.SinceinanATTalltheleavesareidenticalirrespectiveoftheorderingoftheattributesinanypath,therandomnessintheclassiersproduced,isonlyduetotherandomnessinthedatagenerationprocessandnotbecauseoftherandomattributeselectionmethod.Thus,thespaceofclassiersoverwhichtheerroriscomputedreducesandMCperformswellevenforarelativelyfewernumberofiterations.AthighercorrelationsandfortheothertwokindsoftreestheprobabilityofsmallertreesisreasonableandhenceMChastoaccountforalargerspaceofclassiersinducedbynotonlytherandomnessinthedatabutalsobytherandomnessintheattributeselectionmethod.Incaseofrealdatatoogure 5-12 ,theperformanceoftheexpressionsissignicantlysuperiorascomparedwithMC-1andMC-10.TheperformanceofMC-1andMC-10for 98

PAGE 100

5{3 and 5{4 theconditionsforpathexistsfortheseattributeselectionmethodsdependtotallyonthesample.ThisisunlikewhatIobservedfortherandomizedattributeselectioncriterionwheretheconditionsforpathexistsdependingonthisrandomizedcriterion,weresampleindependentwhiletheotherconditionsinpurityandscarcityweresampledependent.CharacterizingtheseprobabilitiesenablesusincomputingthemomentsofGEfortheseotherattributeselectionmethods.IntheanalysisthatIpresented,Iassumedthatthesplitpointsforcontinuousattributesweredeterminedaprioritotreeconstruction.Ifthesplitpointselectionalgorithmisdynamici.e.thesplitpointsareselectedwhilebuildingthetree,theninthepathexistsconditionsofthe3stoppingcriteriaIwouldhavetoappendanextraconditionnamely,thesplitoccursat"this"particularattributevalue.Inreality,thevalueof"this"isdeterminedbythevaluesthatthesamplesattainforthespecicattributeintheparticulardataset,whichisnite.1Hence,whileanalyzingIcanchooseasetofallowed 100

PAGE 101

DhurandharandDobra [ 2009 ]tobeapplicabletorandomizedclassicationalgorithms;thisisnecessaryifthetheoryistobeappliedtorandomdecisionstreesasIdidinthisthesis.The 101

PAGE 102

5.2 hadtwopurposes:(a)portraythemannerinwhichtheexpressionscanbeutilizedasanexploratorytooltogainabetterunderstandingofdecisiontreeclassiers,and(b)showconclusivelythatthemethodologyin DhurandharandDobra [ 2009 ]togetherwiththedevelopmentsinthisthesisprovideasuperioranalysistoolwhencomparedwithsimpleMonteCarlo. 102

PAGE 103

Theallattributetreewith3attributesA1;A2;A3,eachhaving2values. Given3attributesA1;A2;A3,thepathm11m21m31isformedirrespectiveoftheorderingoftheattributes.Threesuchpermutationsareshownintheabovegure. 103

PAGE 104

FixedHeighttreeswithd=5,h=3andattributeswithbinarysplits. Figure5-4. FixedHeighttreeswithd=5,h=3andattributeswithternarysplits. Figure5-5. FixedHeighttreeswithd=8,h=3andattributeswithbinarysplits. 104

PAGE 105

Puritybasedtreeswithd=5andattributeswithbinarysplits. Figure5-7. Puritybasedtreeswithd=5andattributeswithternarysplits. Figure5-8. Puritybasedtreeswithd=8andattributeswithbinarysplits. 105

PAGE 106

Scarcitybasedtreeswithd=5,pb=N Figure5-10. Scarcitybasedtreeswithd=5,pb=N Figure5-11. Scarcitybasedtreeswithd=8,pb=N 106

PAGE 107

ComparisonbetweenAFandMConthreeUCIdatasetsfortreesprunnedbasedonxedheight(h=3),purityandscarcity(pb=N 107

PAGE 108

Stone [ 1977 ],i.e.itasymptoticallyachievesBayeserrorwithinaconstantfactor.Noneoftheevenmoresophisticatedclassicationalgorithmseg.SVM,NeuralNetworksetc.areknowntooutperformitconsistently StanllandWaltz [ 1986 ].However,thealgorithmissusceptibletonoiseandchoosinganappropriatevalueofkismoreofanartthanscience. DhurandharandDobra [ 2009 ].ThemomentsoftheGEofaclassierbuiltoveranindependentandidenticallydistributed(i.i.d.)randomsampledrawnfromajointdistribution,aretakenoverthespaceofallpossibleclassiersthatcanbebuilt,giventheclassicationalgorithmandthejointdistribution.Thoughtheclassicationalgorithmmaybedeterministic,theclassiersactasrandomvariablessincethesamplethattheyarebuiltonisrandom.TheGEofaclassier,beingafunctionoftheclassier,alsoactsasarandomvariable.Duetothisfact,GEofclassierdenotedbyGE()hasadistribution 108

PAGE 109

6{1 istheexpressionfortherstmomentoftheGE().NoticethatinsidetherstsumPx2Xtheinputxisxedandinsidethesecondsumtheoutputyisxed,thusthePZ(N)[(x)=y]istheprobabilityofallpossiblewaysinwhichaninputxisclassiedintoclassy.Thisprobabilitydependsonthejointdistributionandtheclassicationalgorithm.Theothertwoprobabilitiesaredirectlyderivedfromthedistribution.Thus,customizingtheexpressionforEZ(N)[GE()],eectivelymeansdecipheringawayofcomputingPZ(N)[(x)=y].Similarly,customizingtheexpressionforEZ(N)Z(N)[GE()GE(0)]meansndingawayofcomputingPZ(N)Z(N)[(x)=y^0(x0)=y0]givenanyjointdistribution.InSection 6.4 Ideriveexpressionsforthesetwoprobabilities,whichdependonlyontheunderlyingjointprobabilitydistribution,thusprovidingawayofcomputingthemanalytically. 109

PAGE 110

6.9 showspointsinR2space.Thepointsb,canddarethe3-nearestneighbors(k=3)ofthepointa.WhentheattributesarecategoricalthemostpopularmetricusedistheHammingdistance LiuandWhite [ 1997 ].TheHammingdistancebetweentwopoints/inputsisthenumberofattributesthathavedistinctvaluesforthetwoinputs.Thismetricissampleindependenti.e.theHammingdistancebetweentwoinputsremainsunchanged,irrespectiveofthesamplecountsproducedinthecorrespondingcontingencytable.Forexample,Table 6-1 representsacontingencytable.TheHammingdistancebetweenx1andx2isthesameirrespectiveofthevaluesofNijwherei2f1;2;:::;Mgandj2f1;2;:::;vg.OthermetricssuchasValueDierenceMetric(VDM) StanllandWaltz [ 1986 ],Chi-square Connor-Linton [ 2003 ]etc.exist,thatdependonthesample.Inowprovideaglobalcharacterizationforcalculatingtheaforementionedprobabilitiesforbothkindsofmetrics.Thisisfollowedbyanecientcharacterizationforthesampleindependentmetrics,whichincludesthetraditionallyusedandmostpopularHammingdistancemetric. 110

PAGE 111

6-1 ,ifx1andx2arethekNNofsomeinput,thenq=f1;2gandc(q;b)=N1b+N2b.Noticethat,sincex1andx2arethekNNofsomeinput,P2;vi=1;j=1Nijk.Moreover,ifthekNNcompriseoftheentireinputsample,thentheresultantclassicationisequivalenttoclassicationperformedusingclasspriorsdeterminedbythesample.ThePZ(N)Z(N)[(x)=y^0(x0)=y0]usedinthecomputationofthesecondmomentiscalculatedbygoingoverkNNoftwoinputsratherthanone.Theexpressionforthisprobabilityisgivenby, 111

PAGE 112

6-1 isgreaterthanorequaltok.Ifthisinequalityistrue,thendenitelytheclasslabelofinputxiisdeterminedbythecopiesofx1orx2orbothx1,x2.Noinputbesidesthesetwoisinvolvedintheclassicationofxi.Thesecondandthirdinequalitystatethatthenumberofcopiesofx1andx2islessthankrespectively.Thisforcesbothx1andx2tobeusedintheclassicationofxi.Iftherstinequalitywasuntrue,thenfartherawayinputswillalsoplayapartintheclassicationofxi.ThusthekNNofaninputdependonthesampleirrespectiveofthedistancemetricused.Theaboveexamplealsoillustratesthemannerinwhichthesetq(orr)canbecharacterizedasafunctionofthesample,enablingustocomputethetwoprobabilitiesrequiredforthecomputationofthemomentsfromanygivenjointdistributionoverthedata,forsampleindependentmetrics.Withoutlossofgenerality(w.l.o.g.)assume 112

PAGE 113

113

PAGE 114

6{3 and 6{4 turnsouttobeexponentialintheinputsizeM.Consideringtheselimitations,Iprovidealternativeexpressionsforcomputingtheseprobabilitiesecientlyforsampleindependentdistancemetrics,viz.Manhattandistance Krause [ 1987 ],Chebyshevdistance Abelloetal. [ 2002 ],Hammingdistance.ThenumberoftermsinthenewcharacterizationIpropose,islinearinMforPZ(N)[(x)=y]andquadraticinMforPZ(N)Z(N)[(x)=y^0(x0)=y0].ThecharacterizationIjustpresented,computestheprobabilityofclassifyinganinputintoaparticularclassforeachpossiblesetofkNNseparately.WhatifIinsomemanner,combinedisjointsetsoftheseprobabilitiesintogroups,andcomputeasingleprobabilityforeachgroup?Thiswouldreducethenumberoftermstobecomputed,thusspeedinguptheprocessofcomputationofthemoments.Toaccomplishthis,Iusethefactthatthedistancebetweeninputsisindependentofthesample.Aconsequenceofthisindependenceisthatallpairwisedistancesbetweentheinputsareknownpriortothecomputationoftheprobabilities.Thisassistsinobtainingasortedorderingofinputsfromtheclosesttothefarthestforanygiveninput.Forexample,ifIhaveinputsa1b1,a1b2,a2b1anda2b2,thengiveninputa1b1,Iknowthata2b2isthefarthestfromthegiveninput,followedbya1b2,a2b1whichareequidistantanda1b1istheclosestintermsofHammingdistance.Beforepresentingafull-edgedcharacterizationforcomputingthetwoprobabilities,IexplainthebasicgroupingschemethatIemploywiththehelpofanexample. 6.9 .Inthiscase,thenumberoftermsIneed,tocomputePZ(N)[(x1)=C1]isM.Thersttermcalculatestheprobabilityofclassifyingx1intoC1whenthekNNaremultipleinstancesofx1(i.e.Pvj=1N1jk).Thus,therstgroupcontainsonlythesetfx1g.Thesecondtermcalculatestheprobabilityofclassifyingx1intoC1whenthekNNaremultipleinstancesofx2orx1;x2.Thesecondgroupthuscon-tainsthesetsfx2gandfx1;x2gasthepossiblekNNtox1.IfIproceedinthismanner,

PAGE 115

6-1 is,

PAGE 116

6.9 maycontainmorethanoneinput.Toaccommodatethiscase,Iextendthegroupingschemepreviouslyoutlined.Previously,thegroupgrcontainedallpossiblesetsformedbyther1distinctclosestinputstoagiveninput,withtherthclosestinputbeingpresentineveryset.RealizethattherthclosestinputdoesnotnecessarilymeanitistherthNN,sincetheremaybemultiplecopiesofanyofther1closestinputs.Inourmodieddenition,thegroupgrcontainsallpossiblesetsformedbyther1closestinputs,withatleastoneoftherthclosestinputsbeingpresentineveryset.Iillustratethiswithanexample.Say,Ihaveinputsa1b1,a1b2,a2b1anda2b2,thengiveninputa1b1,Iknowthata2b2isthefarthestfromthegiveninput,followedbya1b2,a2b1whichareequidistantanda1b1istheclosestintermsofHammingdistance.Thegroupg1containsonlya1b1asbefore.Thegroupg2inthiscasecontainsthesetsfa1b2g,fa2b1g,fa1b2;a2b1g,fa1b2;a1b1gandfa2b1;a1b1g.Observethateachsethasatleastoneofthe2inputsa1b2,a2b1.Inowcharacterizetheprobabilitiesinequations 6{5 and 6{6 forthisgeneralcase.Letqrdenotethesetcontaininginputsfromtheclosestto 116

PAGE 117

6{7 thePZ(N)[(xi)=Cj]isgivenby, 6{8 thePZ(N)Z(N)[(xi)=Cj;0(xp)=Cw]isgivenby, 117

PAGE 118

1. thenumberofterms(orsmallerprobabilities)thatsumuptotheaboveprobabilities, 2. thetimecomplexityofeachterm.Reductioninnumberofterms:IntheprevioussectionIreducedthenumberoftermstoasmallpolynomialinMforaclassofdistancemetrics.ThecurrentenhancementIpropose,furtherreducesthenumberoftermsandworksevenforthegeneralcaseattheexpenseofaccuracy,whichIcancontrol.Therthterminthecharacterizationshastheconditionthatthenumberoftheclosestr1distinctinputsislessthank.Theprobabilityofthisconditionbeingtruemonotonicallyreduceswithincreasingr.Afterapoint,thisprobabilitymaybecome"smallenough",sothatthetotalcontributionoftheremainingtermsinthesumisnotworthwhilending,giventhe 118

PAGE 119

DhurandharandDobra [ 2009 ].Usingtechniquessuchasoptimization,Icanndtightlowerandupperboundsforthetermsinessentiallyconstanttime.Parallelcomputation:Notethateachofthetermsisselfcontainedandnotdependentontheothers.Thisfactcanbeusedtocomputethesetermsinparallel,eventuallymergingthemtoproducetheresult.Thiswillfurtherreducethetimeofcomputation.WiththisIhavenotonlyproposedanalyticalexpressionsforthemomentsofGEforthekNNclassicationmodelappliedtocategoricalattributes,buthavealsosuggestedecientmethodsofcomputingthem. DhurandharandDobra [ 2009 ].Iusetheexpressionsprovidedinthisthesisandtheserelationshipstoconducttheexperimentsdescribedbelow.ThemainobjectiveoftheexperimentsIreport,istoprovideaavoroftheutilityoftheexpressionsasatooltostudythislearningmethod. 119

PAGE 120

Connor-Linton [ 2003 ])betweentheattributesandtheclasslabelstoseetheeectithasontheperformanceofthealgorithm.Inourfourthstudy,Ichoose2UCIdatasetsandobservetheestimatesofcross-validationwiththetrueerrorestimates.Ialsoexplainhowamultinomialdistributioncanbebuiltoverthesedatasets.Thesameideacanbeusedtobuildamultinomialoveranydiscretedatasettorepresentitprecisely.Setupforstudies1-3:Isetthedimensionalityofthespacetobe8.Thenumberofclassesisxedtotwo,witheachattributetakingtwovalues.Thisgivesrisetoamultnomialwith29=512cells.IfIxtheprobabilityofobservingadatapointincellitobepisuchthatP512i=1pi=1andthesamplesizetoN,IthenhaveacompletelyspeciedmultinomialdistributionwithparametersNandthesetofcellprobabilitiesfp1;p2;:::;p512g.ThedistancemetricIuseisHammingdistanceandtheclassprioris0.5.Setupforstudy4:IncaseofrealdataIchoose2UCIdatasetswhoseattributesarenotlimitedtohavingbinarysplits.Thedatasetscanberepresentedintheformofacontingencytablewhereeachcellinthetablecontainsthecountofthenumberofcopiesofthecorrespondinginputbelongingtoaparticularclass.Thesecountsintheindividualcellsdividedbythedatasetsizeprovideuswithempiricalestimatesfortheindividualcellprobabilities(pi).Thus,withtheknowledgeofN(datasetsize)andtheindividualpiIhaveamultinomialdistributionwhoserepresentativesampleistheparticulardataset.UsingthisdistributionIobservetheestimatesofthetrueerror(i.e.momentsofGE)and 120

PAGE 121

6{9 6{10 areusedtoproducetheplots. 6-3 atheattributesandtheclasslabelsaretotallycorrelated(i.e.correlation=1).Iobservethatforalargerangeofvaluesofk(fromsmalltolarge)theerroriszero.Thisisexpectedsinceanyinputliesonlyinasingleclasswiththeprobabilityoflyingintheotherclassbeingzero.InFigure 6-3 bIreducethecorrelationbetweentheattributesandclasslabelsfrombeingtotallycorrelatedtoacorrelationof0.5.Iobservethatforlowvaluesofktheerrorishigh,itthenplummetstoabout0.14andincreasesagainforlargevaluesofk.ThehigherrorforlowvaluesofkisbecausethevarianceofGEislargefortheselowvalues.Thereasonforthevariancebeinglargeisthatthenumberofpointsusedtoclassifyagiveninputisrelativelysmall.Asthevalueofkincreasesthiseectreducesuptoastageandthenremainsconstant.ThisproducesthemiddleportionofthegraphwheretheGEisthesmallest.Intherightportionofthegraphi.e.atveryhighvaluesofk,almosttheentiresampleisusedtoclassifyanygiveninput.Thisprocedureiseectivelyequivalenttoclassifyinginputsbasedonclasspriors.InthegeneralsetupImentionedthatIsetthepriorsto0.5,whichresultsinthehigherrors.InFigure 6-3 cIreducethecorrelationstillfurtherdownto0i.e.theattributesandtheclasslabelsareuncorrelated.InhereIobservethattheerrorisinitiallyhigh,thenreducesandremainsunchanged.Asbeforetheinitialupsurgeisduetothefactthatthevarianceforlowvaluesofkishigh,whichlatersettlesdown. 121

PAGE 122

6-3 a,Figure 6-3 bandFigure 6-3 cIobserveagradualincreaseinGEasthecorrelationreduces.Thevaluesofkthatgivelowerrorforthethreevaluescorrelationandasamplesizeof1000canbedecipheredfromthecorrespondinggures.InFigure 6-3 a,Inoticethatsmall,mid-rangeandlargevaluesofkareallacceptable.InFigure 6-3 bIndthatmid-rangevalues(200to500)ofkaredesirable.Inthethirdgure,i.e.Figure 6-3 cIdiscoverthatmid-rangeandlargevaluesofkproducelowerror. 6-4 atheattributesandclasslabelsarecompletelycorrelated.Theerrorremainszeroforsmall,mediumandlargevaluesofkirrespectiveofthesamplesize.Inthiscaseanyvalueofkissuitable.InFigure 6-4 bthecorrelationbetweentheattributesandtheclasslabelsis0.5.Forsmallsamplesizes(lessandcloseto1000),largeandsmallvaluesofkresultinhigherrorwhilemoderatevaluesofkhavelowerrorthroughout.Theinitialhigherrorforlowvaluesofkisbecausethevarianceoftheestimatesishigh.Thereasonforhigherroratlargevaluesofkisbecauseitisequivalenttoclassifyinginputsbasedonpriorsandtheprioris0.5.Atmoderatevaluesofkboththeseeectsarediminishedandhencetheerrorproducedbythemislow.FromthegureIseethatafteraround1500theerrorsofthelowandhighkconvergetotheerrorofmoderatek.Thushereakwithintherange200to0.5Nwouldbeappropriate.InFigure 6-4 ctheattributesandtheclasslabelsareuncorrelated.Theinitialhigherrorforlowkisagainbecauseofthehighvariance.Sincetheattributesandclasslabelsareuncorrelatedwithagivenprior,theerroris0.5formoderateaswellashighvalues 122

PAGE 123

6-4 a,Figure 6-4 bandFigure 6-4 cIobserveagradualincreaseinGEasthecorrelationreduces.Atsamplesizesofgreaterthanabout1500,large,mediumandsmallvaluesofkallperformequallywell. DhurandharandDobra [ 2009 ].InFigure 6-5 athecorrelationis1andthesamplesizeis1000.CrossvalidationexactlyestimatestheGEwhichiszeroirrespectiveofthevalueofk.WhenIincreasethesamplesizeto10000,asshowninFigure 6-6 aCrossvalidationstilldoesaprettygoodjobinestimatingtheactualerror(i.e.GE)ofkNN.InFigure 6-5 bthecorrelationissetto0.5andthesamplesizeis1000.Iobservethatcrossvalidationinitially,i.e.forlowvaluesofkunderestimatestheactualerror,performswellformoderatevaluesofkandgrosslyoverestimatestheactualerrorforlargevaluesofk.Atlowvaluesofktheactualerrorishighbecauseofthehighvariance,whichIhavepreviouslydiscussed.Hence,even-thoughtheexpectedvaluesofGEandCEareclose-by,thevariancesarefarapart,sincethevarianceofCEislow.Thisleadstotheoptimisticestimatemadebycrossvalidation.AtmoderatevaluesofkthevarianceofGEisreducedandhencecrossvalidationproducesanaccurateestimate.Whenktakeslargevaluesmostofthesampleisusedtoclassifyaninput,whichisequivalenttoclassicationbasedon 123

PAGE 124

10thN)isusedforclassicationofaninputforaxedk,thanitiswhencomputingGE.Duetothis,CErisesmoresteeplythanGE.WhenIincreasethesamplesizeto10000,asisdepictedinFigure 6-6 b,thepoorestimateatlowvaluesofkthatIsawforasmallersamplesizeof1000vanishes.ThereasonforthisisthatthevarianceofGEreduceswiththeincreaseinsamplesize.Evenformoderatevaluesofktheperformanceofcrossvalidationimprovesthoughthedierenceinaccuracyofestimationisnotasvividasinthepreviouscase.Forlargevaluesofkthoughtheerrorinestimationissomewhatreduceditisstillnoticeable.ItisadvisablethatinthescenariopresentedIshouldusemoderatevaluesofkrangingfromabout200to0.5Ntoachievereasonableamountofaccuracyinthepredictionmadebycross-validation.InFigure 6-5 ctheattributesareuncorrelatedtotheclasslabelsandthesamplesizeis1000.ForlowvaluesofkthevarianceofGEishighwhilethevarianceofCEislowandhence,theestimateofcrossvalidationiso.Formediumandlargevaluesofk,crossvalidationestimatestheGEaccurately,whichhasthesamereasonmentionedabove.Onincreasingthesamplesizeto10000,showninFigure 6-6 cthevarianceofGEforlowvaluesofkreducesandcrossvalidationestimatestheGEwithhighprecision.Ingeneral,theGEforanyvalueofkwillbeestimatedaccuratelybycrossvalidationinthiscase,butforlowersamplesizes(belowandaround1000)theestimatesareaccurateformoderateandlargevaluesofk. 6-7 Iobservethatcross-validationestimatesthetrueerroraccuratelyforakvalueof2.Increasingthekto5thecross-validationestimatebecomespessimistic.ThisisbecauseoftheincreaseinvarianceofCE.Ialsoobservethatthetrueerroris 124

PAGE 125

6-7 ,cross-validationdoesagoodjobforboth,thesmallvalueofkandthelargervalueofk.Thetrueerrorinthiscaseislowerforthehigherksincetheexpectationsforboththekisroughlythesamebutthevarianceforthesmallerkislarger.Thisismainlyduetothehighcovariancebetweenthesuccessiverunsofcross-validation. 125

PAGE 126

126

PAGE 127

DhurandharandDobra [ 2009 ]anddevelopmentssuchastheonesintroducedinthisthesisopennewavenuesinstudyinglearningmethods,allowingthemtobeassessedfortheirrobustness,appropriatenessforaspecictask,withlucidelucidationsbeinggivenfortheirbehavior.Thesestudiesdonotreplacebutcomplementpurelytheoreticalandempiricalstudiesusuallycarriedoutwhenevaluatinglearningmethods. vc+1pointsthatarekNNlieinclassCi).ArigorousanalysisusingideasfromthisthesiswouldhavetobeperformedandthecomplexitydiscussedforthecontinuouskNN.Iplantoaddresstheseissuesinthefuture. 127

PAGE 128

Table6-1. Contingencytablewithvclasses,MinputvectorsandtotalsamplesizeN=PM;vi=1;j=1Nij. XC1C2...Cv b,canddarethe3nearestneighboursofa. 128

PAGE 129

TheFigureshowstheextenttowhichapointxiisneartox1.Theradiusofthesmallestencompassingcircleforapointxiisproportionaltoitsdistancefromx1.x1istheclosestpointandxMisthefarthest. (a) (b) (c)Figure6-3. BehavioroftheGEfordierentvaluesofkwithsamplesizeN=1000andthecorrelationbetweentheattributesandclasslabelsbeing1in(a),0.5in(b)and0in(c).Std()denotesstandarddeviation. 129

PAGE 130

(b) (c)Figure6-4. ConvergenceoftheGEfordierentvaluesofkwhenthesamplesize(N)increasesfrom1000to100000andthecorrelationbetweentheattributesandclasslabelsis1in(a),0.5in(b)and0in(c).Std()denotesstandarddeviation.In(b)and(c),afteraboutN=1500large,mid-rangeandsmallvaluesofkgivethesameerrordepictedbythedashedline. (a) (b) (c)Figure6-5. ComparisonbetweentheGEand10foldCrossvalidationerror(CE)estimatefordierentvaluesofkwhenthesamplesize(N)is1000andthecorrelationbetweentheattributesandclasslabelsis1in(a),0.5in(b)and0in(c).Std()denotesstandarddeviation. 130

PAGE 131

(b) (c)Figure6-6. ComparisonbetweentheGEand10foldCrossvalidationerror(CE)estimatefordierentvaluesofkwhenthesamplesize(N)is10000andthecorrelationbetweentheattributesandclasslabelsis1in(a),0.5in(b)and0in(c).Std()denotesstandarddeviation. Comparisonbetweentrueerror(TE)andCEon2UCIdatasets. 131

PAGE 132

Kohavi [ 1995 ], Plutowski [ 1996 ]forabout10-20foldsandishencecommonlyusedforsmallsamplesizes.Mostoftheexperimentalworkoncross-validationfocussesonreportingobservations Kohavi [ 1995 ], Plutowski [ 1996 ]andnotonunderstandingthereasonsfortheobservedbehavior.Moreover,modelingthecovariancebetweenindividualrunsofcross-validationisnotastraightforwardtaskandishencenotadequatelystudied,thoughitisconsideredtohaveanon-trivialimpactonthebehaviorofcross-validation.Theworkpresentedin BengioandGrandvalet [ 2003 ], Markatouetal. [ 2005 ]addressissuesrelatedtocovariance,butitisfocussedonbuildingandstudyingthebehaviorofestimatorsfortheoverallvarianceofcross-validation.In Markatouetal. [ 2005 ]theestimatorsofthemomentsofcross-validationerror(CE)areprimarilystudiedfortheestimationofmeanproblemandintheregressionsetting.Thegoalofthischapterisquitedierent.IdonotwishtobuildestimatorsforthemomentsofCEratherIwanttoexperimentallyobservethebehaviorofthemomentsofcross-validationandprovideexplanationsfortheobserved 132

PAGE 133

DhurandharandDobra [ 2009 2008 2007 ].Theadvantageofusingtheseclosedformexpressionsisthattheyareexactformulas(notapproximations)forthemomentsofCEandhencethesemomentscanbestudiedaccuratelywithrespecttoanychosendistribution.InfactasitturnsoutapproximatingcertainprobabilitiesintheseexpressionsalsoleadstosignicantlyhigheraccuracyincomputingthemomentswhencomparedwithdirectlyusingMonteCarlo.ThereasonforthisisthattheparameterspaceoftheindividualprobabilitiesthatneedtobecomputedintheseexpressionsismuchsmallerthanthespaceoverwhichthemomentshavetobecomputedatlargeandhencedirectlyusingMonteCarlotoestimatethemomentscanprovetobehighlyinaccurateinmanycases DhurandharandDobra [ 2009 2008 ].AnotheradvantageofusingtheclosedformexpressionsisthattheygiveusmorecontroloverthesettingsIwishtostudy.Insummary,thegoalinthischapteristoempiricallystudythebehaviorofthemomentsofCE(plottedusingtheexpressionsintheAppendix)andtoprovideinterestingexplanationsfortheobservedbehavior.AsIwillsee,whenstudyingthevarianceofCE,thecovariancebetweentheindividualrunsplaysadecisiveroleandhenceunderstandingitsbehavioriscriticalinunderstandingthebehaviorofthetotalvarianceandconsequentlythebehaviorofCE.Iprovideinsightsintothebehaviorofthecovarianceaproposincreasingsamplesize,increasingcorrelationbetweenthedataandtheclasslabelsandincreasingnumberoffolds.InthenextsectionIreviewsomebasicdenitionsandpreviousresultsthatarerelevanttothecomputationofthemomentsofCE.InSection 7.2 Iprovideanoverview 133

PAGE 134

7.3 Iconductabriefliteraturesurvey.InSection 7.4 {theexperimentalsection,Iprovidesomekeeninsightsintothebehaviorofcross-validation,whichisourprimarygoal.IdiscusstheimplicationsofthestudyconductedandsummarizethemajordevelopmentsinthechapterinSection 7.5 DhurandharandDobra [ 2009 ].InthissectionIreviewtherelevantresultswhichareusedinthepresentstudyofCE.ConsiderthatNpointsaredrawnindependentlyandidentically(i.i.d.)fromagivendistributionandaclassicationalgorithmistrainedoverthesepointstoproduceaclassier.IfmultiplesuchsetsofNi.i.d.pointsaresampledandaclassicationalgorithmistrainedoneachofthemIwouldobtainmultipleclassiers.EachoftheseclassierswouldhaveitsownGE,hencetheGEisarandomvariabledenedoverthespaceofclassierswhichareinducedbytrainingaclassicationalgorithmoneachofthedatasetsthataredrawnfromthegivendistribution.ThemomentsofGEcomputedoverthisspaceofallpossiblesuchdatasetsofsizeN,dependonthreethings:1)thenumberofsamplesN,2)theparticularclassicationalgorithmand3)thegivenunderlyingdistribution.IdenotebyD(N)thespaceofdatasetsofsizeNdrawnfromagivendistribution.Themomentstakenoverthisnewdistribution{thedistributionoverthespaceofdatasetsofa 134

PAGE 135

DhurandharandDobra [ 2009 ]whichIwillshortlyreview.Thecharacterizationreducesthenumberoftermsinthemomentsfromanexponentialintheinput-outputspacetolinearforthecomputationoftherstmomentandquadraticforthecomputationofthesecondmoment. 135

PAGE 136

136

PAGE 137

137

PAGE 138

6{2 isaclassierlike(maybesameordierent)inducedbytheclassicationalgorithmtrainedonasamplefromtheunderlyingdistribution.PD(N)[(x)=y]P[Y(x)6=y]representstheprobabilityoferror.TherstprobabilityintheproductPD(N)[(x)=y],dependsontheclassicationalgorithmandthedatadistributionthatdeterminesthetrainingdataset.ThesecondprobabilityP[Y(x)6=y],dependsonlyontheunderlyingdistribution.AlsonotethatboththeseprobabilitiesareactuallyconditionedonxbutIomitwritingtheprobabilitiesasconditionalsexplicitlysinceitisanobviousfactanditmakestheformulasmorereadable.ED(N)[:]denotestheexpectationtakenoverallpossibledatasetsofsizeNdrawnfromthedatadistribution.Thetermsinequation 6{2 alsohavesimilarsemanticsbutareapplicabletopairsofinputsandoutputs.Thus,bybeingabletocomputeeachoftheseprobabilitiesIcancomputethemomentsofGE.MomentsofCE:Theprocessofsamplingadataset(i.i.d.)ofsizeNfromaprobabilitydistributionandthenpartitioningitrandomlyintotwodisjointpartsofsizeNtandNs,isstatisticallyequivalenttosamplingtwodierentdatasetsofsizeNtandNsi.i.d.fromthesameprobabilitydistribution.TherstmomentofCEisjusttheexpectederroroftheindividualrunsofcross-validation.Intheindividualrunsthedatasetispartitionedintodisjointtrainingandtestsets.Dt(Nt)andDs(Ns)denotethespaceoftrainingsetsofsizeNtandtestsetsofsizeNsrespectively.Hence,therstmomentofCEistakenw.r.t.theDt(Nt)Ds(Ns)spacewhichisequivalenttothespaceobtainedbysamplingdatasetsofsizeN=Nt+Nsfollowedbyrandomlysplittingthemintotrainingandtestsets.In 138

PAGE 139

7{3 ).InthecovarianceIhavetocomputethefollowingcrossmoment,EDijt(v2 v)Djs(N v)[HEiHEj]whereDijt(k)isthespaceofoverlappedtrainingsetsofsizekintheithandjthrunofcross-validation(i;jvandi6=j),Dfs(k)isthespaceofalltestsetsofsizekdrawnfromthedatadistributioninthefthrunofcross-validation(fv),Dft(k)isthespaceofalltrainingsetsofsizekdrawnfromthedatadistributioninthefthrunofcross-validation(fv)andHEfisthehold-outerroroftheclassierinthefthrunofcross-validation.Sincethecrossmomentconsidersinteractionbetweentworunsofcross-validationitistakenoveraspaceconsistingoftrainingandtestsetsinvolvingboththerunsratherthanjustone.Hence,thesubscriptinthecrossmomentsisacrossproductbetween3spaces(overlappedtrainingsetsbetweentworunsandthecorrespondingtestsets).TheothermomentsinthevarianceofCEaretakenoverthesamespaceastheexpectedvalue.ThevarianceofCEisgivenby,Var(CE)=1 v)[HE2i]E2Dit(v1 v)[HEi])+vXi;j;i6=j(EDijt(v2 v)Djs(N v)[HEiHEj]EDit(v1 v)[HEi]EDjt(v1 v)[HEj]]ThereasonIintroducedmomentsofGEpreviouslyisthat,in DhurandharandDobra [ 2009 ]relationshipsweredrawnbetweenthesemomentsandthemomentsofCE.Thus,usingtheexpressionsforthemomentsofGEandtherelationshipswhichIwillstateshortly,IhaveexpressionsforthemomentsofCE.TherelationshipbetweentheexpectedvaluesofCEandGEisgivenby,EDt(v1 v)[CE]=EDt(v1 139

PAGE 140

DhurandharandDobra [ 2009 ]itwasshownthatEDit(v1 v)[HEi]=EDt(v1 v)[CE]8i2f1;2;:::;vgandhencetheexpectationofHEicanbecomputedusingtheaboverelationshipbetweentheexpectedCEandexpectedGE.Noticethatthespaceoftrainingandtestdatasetsoverwhichthemomentsarecomputedisthesameforeachfold(sincethespacedependsonlyonthesizeandallthefoldsareofthesamesize)andhencethecorrespondingmomentsarealsothesame.TocomputetheremainingtermsinthevarianceIusethefollowingrelationships.TherelationshipbetweenthesecondmomentofHEi8i2f1;2;:::;vgandthemomentsofGEisgivenby,EDit(v1 v)[HE2i]=v NEDt(v1 NEDt(v1 v)Djs(N v)[HEiHEj]=EDijt(v2 7.1 IprovidedthegeneralizedexpressionsforcomputingthemomentsofGEandconsequentlymomentsofCE.InparticularthemomentsIcomputeare:E[CE]andVar(CE).TheformulaforthevarianceofCEcanberewrittenasaconvexcombinationofthevarianceoftheindividualrunsandthecovariancebetweenanytworuns.Formally, 140

PAGE 141

1. 2. 3. 141

PAGE 142

Efron [ 1986 ]crossvalidationisstudiedinthelinearregressionsettingwithsquaredlossandisshowntobebiasedupwardsinestimatingthemeanofthetrueerror.Morerecentlythesameauthorin Efron [ 2004 ],comparedparametricmodelselectiontechniquesnamely,covariancepenaltieswiththenon-parametriccrossvalidationandshowedthatunderappropriatemodelingassumptionstheformerismoreecientthancrossvalidation. Breiman [ 1996 ]showedthatcrossvalidationgivesanunbiasedestimateoftherstmomentofGE.Thoughcrossvalidationhasdesiredcharacteristicswithestimatingtherstmoment,Breimanstatedthatitsvariancecanbesignicant.In MooreandLee [ 1994 ]heuristicsareproposedtospeedupcross-validationwhichcanbeanexpensiveprocedurewithincreasingnumberoffolds.In ZhuandRohwer [ 1996 ]asimplesettingwasconstructedinwhichcross-validationperformedpoorly. Goutte [ 1997 ]refutedthisproposedsettingandclaimedthatarealisticscenarioinwhichcross-validationfailsisstillanopenquestion.Themajortheoreticalworkoncross-validationisaimedatndingbounds.Thecurrentdistributionfreebounds Devroyeetal. [ 1996 ], KearnsandRon [ 1997 ], Blumetal. [ 1999 ], ElisseeandPontil [ 2003 ], Vapnik [ 1998 ]forcross-validationareloosewithsomeofthembeingapplicableonlyinrestrictedsettingssuchasboundsthatrelyonalgorithmicstabilityassumptions.Thus,ndingtightPAC(ProbablyApproximatelyCorrect)style 142

PAGE 143

Guyon [ 2002 ].Thoughboundsareusefulintheirownrighttheydonotaidinstudyingtrendsoftherandomvariableinquestion,3inthiscaseCE.Asymptoticanalysiscanassistinstudyingtrends Stone [ 1977 ], Shao [ 1993 ]withincreasingsamplesize,butitisnotclearwhentheasymptoticscomeintoplay.Thisiswhereempiricalstudiesareuseful.Mostempiricalstudiesoncross-validationindicatethattheperformance(bias+variance)isthebestaround10-20folds Kohavi [ 1995 ], Breiman [ 1996 ]whilesomeothers Schneider [ 1997 ]indicatethattheperformanceimproveswithincreasingnumberoffolds.IntheexperimentalstudythatIconductusingclosedformexpressions,IobservebothofthesetrendsbutinadditionIprovidelucidelucidationsforobservingsuchbehavior. 143

PAGE 144

Connor-Linton [ 2003 ].Moreprecisely,Isumoverallithesquaresofthedierenceofeachpiwiththeproductofitscorrespondingmarginals,witheachsquareddierencebeingdividedbythisproduct,thatis,correlation=Pi(pipim)2 144

PAGE 145

7.5 to 7-5 areplotsofthevariancesoftheindividualrunsofcross-validation.Figures 7.5 to 7-11 depictthebehaviorofthecovariancebetweenanytworunsofcross-validation.Figures 7.5 to 7-17 showcasethebehaviorofthetotalvarianceofcross-validation,whichasIhaveseenisinfactaconvexcombinationoftheindividualvarianceandthepair-wisecovariance.LinearbehaviorofVar(HE):InFigures 7.5 to 7-5 Iseethattheindividualvariancespracticallyincreaselinearlywiththenumberoffolds.Thislinearincreaseoccurssince,thesizeofthetestsetdecreaseslinearly5withthenumberoffolds;andIknowthatCEistheaverageerroroverthevrunswheretheerrorofeachrunisthesumofthezero-onelossfunctionevaluatedateachtestpointnormalizedbythesizeofthetestset.Sincethetestpointsarei.i.d.(independentandidenticallydistributed),soarethecorrespondingzero-onelossfunctionsandfromtheoryIhavethatthevarianceofarandomvariablewhichisthesumofTi.i.d.randomvariableshavingvariance2<1,isgivenby2 7.5 Iobservethatthecovariancerstdecreasesasthenumberoffoldsincreasesfrom2up-until10-20andthenincreasesup-untilv=N(calledLeave-one-out(LOO)validation).Thisstrangebehaviorhasthefollowingexplanation.Atlowfoldsforexampleatv=2thetestsetforonerunisthe 145

PAGE 146

7.5 isduetothecasewhereLOOfails.IfIhaveamajorityclassierandthedatasetcontainsanequalnumberofdatapointsfromeachclass,thenLOOwouldestimate100%error.Since,eachrunwouldproducethiserrortheerrorsofanytworunsarehighlycorrelated.Thiseectreducesasthenumberoffoldsreduces.TheclassicationalgorithmsIhavechosen,classifybasedonmajorityintheirnalinferencestepi.e.locallytheyclassifydatapointsbasedonmajority.AtlowdatacorrelationasinFigure 7.5 theprobabilityofhavingequalnumberdatapointsfromeachclassforeachinputishighandhencethecovariancebetweentheerrorsoftwodierentrunsishigh.Thus,athighfoldsthiseectispredominantanditincreasesthecovariance.Consequently,Ihavethe 146

PAGE 147

7.5 whichareacombinationofthersteectandthissecondeect.L-shapedbehaviorofCov(HEi;HEj):AsIincreasethecorrelationbetweentheinputattributesandtheclasslabelsseeninFigures 7-7 7.5 theinitialeectwhichraisesthecovarianceisstilldominant,butthelattereect(equalnumberofdatapointsfromeachclassforeachinput)hasextremelylowprobabilityandisnotsignicantenoughtoincreasethecovarianceathighfolds.Asaresult,thecovariancedropswithincreasingv.Onincreasingthedatasetsizethecovariancedoesnotincreaseasmuch(infactreducesinsomecases)inFigures 7-9 7.5 and 7-11 athighfolds.InFigure 7-9 thoughthecorrelationbetweentheinputattributesandclasslabelsislow,theprobabilityofhavingequalnumberdatapointsfromeachclassislowsincethedatasetsizehasincreased.Foragivensetofparameterstheprobabilityofaparticulareventoccurringisessentiallyreduced(neverincreased)asthenumberofeventsisincreased(i.e.Nincreases),sincetheoriginalprobabilitymasshasnowtobedistributedoveralargerset.Hence,thecovarianceinFigure 7-9 dropsasthenumberoffoldsincreases.ThebehaviorobservedinFigures 7.5 and 7-11 hasthesameexplanationasthatforFigures 7-7 and 7.5 describedbefore.Finally,thecovariancehasaV-shapeforlowdatacorrelationsandlowdatasetsizeswheretheclassicationalgorithmsclassifybasedonmajorityatleastatsomelocallevel.Intheothercasesthecovarianceishighinitiallyandthenreduceswithincreasingnumberoffolds.BehaviorofVar(CE)similartocovariance:Figures 7.5 to 7-17 representthebehaviorofthetotalvarianceofCE.Iknowthattotalvarianceisgivenby,Var(CE)=1 7.5 to 7-5 thattheVar(HE)varyalmostlinearlywithrespecttov.Inotherwords,1 147

PAGE 148

7.5 .Inothercases,thetotalvariancereduceswithincreasingnumberoffolds. 7.5 to 7.5 depictthebehavioroftheexpectedvalueofCEfordierentamountsofcorrelationbetweentheinputattributesandtheclasslabelsandfortwodierentsamplesizes.ThebehavioroftheexpectedvalueatmediumandhighcorrelationsforsmallandlargesamplesizesisthesameandhenceIplotthesescenariosonlyforsmallsamplesizesasshowninFigure 7.5 .FromtheguresIobservethatasthecorrelationincreasestheexpectedCEreduces.Thisoccurssincetheinputattributesbecomeincreasinglyindicativeofaparticularclass.Asthenumberoffoldsincreasetheexpectedvaluereducessincethetrainingsetsizesincreaseonexpectation,enhancingclassierperformance. 7-21 to 7.5 depictthebehaviorofCE.InFigure 7-21 Iobservethatthebestperformanceofcross-validationisaround10-20folds.Intheothercases,thebehaviorimprovesasthenumberoffoldsincreases.InFigure 7-21 thevarianceathighfoldsislargeandhencetheabovesumislargeforhighfolds.AsaresultIhaveaV-shapedcurve.IntheotherFiguresthevarianceislowathighfoldsandsoistheexpectedvalueandhencetheperformanceimprovesasthenumberoffoldsincreases. 148

PAGE 149

Var(HE)forsmallsamplesizeandlowcorrelation. Figure7-2. Var(HE)forsmallsamplesizeandmediumcorrelation. globallevel(e.g.majorityclassiers)oratleastatsomelocallevel(e.g.DTclassicationattheleaves).Theotherinterestingfactwasthatalltheexperimentsandtheinsightswereaconsequenceofthetheoreticalformulasthatwerederivedpreviously.Ihopethatnon-asymptoticstudiesliketheonepresentedwillassistinbetterunderstandingpopularprevalenttechniques,inthiscasecross-validation. 149

PAGE 150

Var(HE)forsmallsamplesizeandhighcorrelation. Figure7-4. Var(HE)forlargersamplesizeandlowcorrelation. 150

PAGE 151

Var(HE)forlargersamplesizeandmediumcorrelation. Figure7-6. Var(HE)forlargersamplesizeandhighcorrelation. 151

PAGE 152

Figure7-8. 152

PAGE 153

Figure7-10. 153

PAGE 154

Figure7-12. 154

PAGE 155

Var(CE)forsmallsamplesizeandlowcorrelation. Figure7-14. Var(CE)forsmallsamplesizeandmediumcorrelation. 155

PAGE 156

Var(CE)forsmallsamplesizeandhighcorrelation. Figure7-16. Var(CE)forlargersamplesizeandlowcorrelation. 156

PAGE 157

Var(CE)forlargersamplesizeandmediumcorrelation. Figure7-18. Var(CE)forlargersamplesizeandhighcorrelation. 157

PAGE 158

E[CE]forsmallsamplesizeandlowcorrelation. Figure7-20. E[CE]forlargersamplesizeandlowcorrelation. 158

PAGE 159

E[CE]forsmallsamplesizeatmediumandhighcorrelation. Figure7-22. 159

PAGE 160

Figure7-24. 160

PAGE 161

Figure7-26. 161

PAGE 162

162

PAGE 163

McAllester [ 2003 ]incertainsettings.Roughlyspeaking,PACBayesbounds,boundthedierencebetweentheexpectedGEandexpected 163

PAGE 164

164

PAGE 165

Proof. x] y+a]Onxingthevalueofythevalueofxatwhichtherighthandsideoftheaboveachievesmaximumis1(sincelowerthevalueofxhighertherighthandsidebutxispositivebyourassumptionandx2[a;:::;a]).Thuswehavetheaboveinequalitytrueonlyif, 165

PAGE 166

Theprobabilitythattwopathsoflengthsl1andl2(l2l1)co-existinatreebasedontherandomizedattributeselectionmethodisgivenby,P[l1andl2lengthpathscoexist]=vXi=0vPri(l1i1)!(l2i1)!(rv)probiwhereristhenumberofattributescommoninthetwopaths,visthenumberattributeswiththesamevaluesinthetwopaths,vPri=v! (vi)!denotespermutationandprobi=1 166

PAGE 167

A-1 adepictsthiscase.Inthesuccessivetreestructures,thatis,Figure A-1 b,Figure A-1 cthecommonattributewithdistinctattributevalues(A4)riseshigherupinthetree(tolowerdepths)untilinFigure A-1 ditbecomestheroot.Tondtheprobabilitythatthetwopathsco-existwesumuptheprobabilitiesofsucharrangements/treestructures.TheprobabilityofthesubtreeshowninFigure A-1 ais1 (d4)2,sincerepetitionsareallowedintwoseparatepaths.Finally,therstpathendsatdepth6andonlyoneattributehastobechosenatdepth7forthesecondpathwhichischosenwithaprobabilityof1 A-1 bwherethecommonattributewithdierentvaluesis 167

PAGE 169

Instancesofpossiblearrangements. 169

PAGE 170

J.Abello,P.M.Pardalos,andM.G.C.Resende,editors.Handbookofmassivedatasets.KluwerAcademicPublishers,Norwell,MA,USA,2002.ISBN1-4020-0489-3. T.Anderson.AnIntroductiontoMultivariateStatisticalAnalysis.Wiley,2003. J.Bartlett,J.Kotrlik,andC.Higgins.Organizationalresearch:Determiningappropriatesamplesizeforsurveyresearch.InformationTechnology,Learning,andPerformanceJournal,19(1):43{50,2001. Y.BengioandY.Grandvalet.Nounbiasedestimatorofthevarianceofk-foldcrossvalidation.JournalofMachineLearningResearch,2003. D.BertsimasandI.Popescu.Optimalinequalitiesinprobabilitytheory:aconvexoptimizationapproach.Technicalreport,Dept.Math.O.R.,Cambridge,Mass02139,1998.URL A.Blum,A.Kalai,andJ.Langford.Beatingthehold-out:Boundsfork-foldandprogressivecross-validation.InComputationalLearingTheory,pages203{208,1999. S.Boucheron,O.Bousquet,andG.Lugosi.Introductiontostatisticallearningtheory.Dateaccessed1/2007,http://www.kyb.mpg.de/publications/pdfs/pdf2819.pdf,2005. U.Braga-NetoandE.Dougherty.Exactperformanceoferrorestimatorsfordiscreteclassiers.PatternRecognition,38(11):1799{1814,2005. L.Breiman.Heuristicsofinstabilityandstabilizationinmodelselection.TheAnnalsofStatistics,1996. L.Breiman,J.Friedman,R.Olshen,andC.Stone.ClassicationandRegressionTrees.WadsworthandBrooks,1984. R.ButlerandR.Sutton.Saddlepointapproximationformultivariatecumulativedistributionfunctionsandprobabilitycomputationsinsamplingtheoryandoutliertesting.JournaloftheAmericanStatisticalAssociation,93(442):596{604,1998. R.ChambersandC.Skinner.AnalysisofSurveyData.Wiley,1977. J.Connor-Linton.Chisquaretutorial.Dateaccessed8/2006,http://www.georgetown.edu/faculty/ballc/webtools/web chi tut.html,2003. L.Devroye,L.Gyor,andG.Lugosi.AProbabilisticTheoryofPatternRecognition.Springer-Verlag,1996. A.DhurandharandA.Dobra.Semi-analyticalmethodforanalyzingmodelsandmodelselectionmeasuresbasedonmomentanalysis.ACMTransactionsonKnowledgeDiscoveryandDataMining,3,2009. 170

PAGE 171

A.DhurandharandA.Dobra.Probabilisticcharacterizationofnearestneighborclassier.TechnicalReport,2007. P.DomingosandM.J.Pazzani.Ontheoptimalityofthesimplebayesianclassierunderzero-oneloss.MachineLearning,29(2-3):103{130,1997. A.EdelmanandH.Murakami.Polynomialrootsfromcompanionmatrixeigenvalues.MathematicsofComputation,64(210):763{776,1995. B.Efron.Howbiasedistheapparenterrorrateofapredictionrule?JournaloftheAmericanStatisticalAssociation,81:461{470,1986. B.Efron.Theestimationofpredictionerror:Covariancepenaltiesandcross-validation.JournaloftheAmericanStatisticalAssociation,99:619{642,2004. A.ElisseeandM.Pontil.LearningTheoryandPractice,chapterLeave-one-outerrorandstabilityoflearningalgorithmswithapplications.IOSPress,2003. C.Goutte.Noteonfreelunchesandcross-validation.NeuralComputation,9(6):1245{1249,1997. I.Guyon.Nips.Discussion:OpenProblems,2002. M.Hall.Correlation-basedfeatureselectionformachinelearning,1998. M.A.HallandG.Holmes.Benchmarkingattributeselectiontechniquesfordiscreteclassdatamining.IEEETRANSACTIONSONKDE,2003. P.Hall.TheBootstrapandEdgeworthExpansion.Springer-Verlag,1992. K.Isii.Theextremaofprobabilitydeterminedbygeneralizedmoments(i)boundedrandomvariables.Ann.Inst.Stat.Math,12:119{133,1960. K.Isii.Onthesharpenessofchebyshev-typeinequalities.Ann.Inst.Stat.Math,14:185{197,1963. S.KarlinandL.Shapely.Geometryofmomentspaces.MemoirsAmer.Math.Soc.,12,1953. M.KearnsandD.Ron.Algorithmicstabilityandsanity-checkboundsforleave-one-outcross-validation.InComputationalLearingTheory,pages152{162,1997. R.Kohavi.Astudyofcross-validationandbootstrapforaccuracyestimationandmodelselection.InInProceedingsoftheFourteenthInternationalJointConferenceonArticialIntelligence,pages1137{1143.SanMateo,CA:MorganKaufmann,1995.,1995. 171

PAGE 172

J.Langford.Filedunder:Predictiontheory,problems.Dateaccessed6/2006,http://hunch.net/index.php?p=29,2005. B.Levin.Arepresentationformultinomialcumulativedistributionfunctions.TheAnnalsofStatistics,9(5):1123{1126,1981. F.Liu,K.Ting,andW.Fan.Maximizingtreediversitybybuildingcomplete-randomdecisiontrees.InPAKDD,pages605{610,2005. W.LiuandA.White.Metricsfornearestneighbourdiscriminationwithcategoricalattributes.InResearchandDevelopmentinExpertSystemsXIV:Proceedingsofthe17thAnnualTechnicialConferenceoftheBCESSpecialistGroup,pages51{59,1997. M.Markatou,H.Tian,S.Biswas,andG.Hripcsak.Analysisofvarianceofcross-validationestimatorsofthegeneralizationerror.J.Mach.Learn.Res.,6:1127{1168,2005.ISSN1533-7928. D.McAllester.Pac-bayesianstochasticmodelselection.Mach.Learn.,51,2003. A.MooreandM.Lee.Ecientalgorithmsforminimizingcrossvalidationerror.InInternationalConferenceonMachineLearning,pages190{198,1994. M.Plutowski.Survey:Cross-validationintheoryandinpractice.Dateaccessed10/2006,www.emotivate.com/CvSurvey.doc,1996. A.Prekopa.Thediscretemomentproblemandlinearprogramming.RUTCORResearchReport,1989. J.Quinlan.Inductionofdecisiontrees.MachineLearning,1(1):81{106,1986. I.Rish.Anempiricalstudyofthenaivebayesclassier.InIJCAI-01workshopon"EmpiricalMethodsinAI",2001. J.Schneider.Crossvalidation.Dateaccessed5/2008,http://www.cs.cmu.edu/schneide/tut5/node42.html,1997. J.Shao.Linearmodelselectionbycrossvalidation.JournaloftheAmericanStatisticalAssociation,88,1993. J.Shao.Mathematicalstatistics.Springer-Verlag,2003. L.Smith.Atutorialonprincipalcomponentsanalysis.2002. C.StanllandD.Waltz.Towardmemory-basedreasoning.Commun.ACM,29(12):1213{1228,1986.ISSN0001-0782.doi:http://doi.acm.org/10.1145/7902.7906. C.Stone.Consistentnonparametricregression.TheAnnalsofStatistics,5(4):595{645,1977. 172

PAGE 173

R.Williamson.Srmandvctheory(statisticallearningtheory).http://axiom.anu.edu.au/williams/papers/P151.pdf,2001. Wolfram-Research.Mathematica.http://www.wolfram.com/. S.-P.WuandS.Boyd.Sdpsol:aparser/solverforsdpandmaxdetproblemswithmatrixstructure.Dateaccessed7/2006,http://www.stanford.edu/boyd/SDPSOL.html,1996. H.ZhuandR.Rohwer.Nofreelunchforcrossvalidation.NeuralComputation,8(7):1421{1426,1996. 173

PAGE 174

AmitDhurandharisoriginallyfromPune,India.HereceivedhisB.E.degreeincomputersciencefromtheUniversityofPunein2004.Hethenreceivedhismaster'sdegreein2005(December)andhisP.h.D.insummer2009fromtheUniversityofFlorida.Hisprimaryresearchisfocusedonbuildingtheoryandscalableframeworksforstudyingclassicationalgorithmsandrelatedtechniques. 174