<%BANNER%>

Novel Change Detection Techniques in Multidimensional Data Mining

Permanent Link: http://ufdc.ufl.edu/UFE0021943/00001

Material Information

Title: Novel Change Detection Techniques in Multidimensional Data Mining
Physical Description: 1 online resource (130 p.)
Language: english
Creator: Song, Xiuyao
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2008

Subjects

Subjects / Keywords: Computer and Information Science and Engineering -- Dissertations, Academic -- UF
Genre: Computer Engineering thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract: Data mining is the process of automatically discovering useful information in large data repositories. The development of data mining is motivated by the challenges posed by modern data sets, such as large size, high dimensionality and heterogeneity. This thesis proposes several novel data mining methods to discover change detection. The first problem considered is detecting anomalies in a given data set. Anomalies are those data points that are different from the remaining of the data set. In the thesis, a method is proposed to make use of domain knowledge provided by the user. Often, the data include a set of environmental attributes whose values a user would never consider to be directly indicative of an anomaly. However, such attributes cannot be ignored because they have a direct effect on the expected distribution of the result attributes whose values can indicate an anomalous observation. The method proposed in this thesis takes such differences among attributes into account. The second problem considered is detecting change of distribution in multi-dimensional data sets. For a given baseline data set and a set of newly observed data points, the proposed method defines a statistical test for deciding if the observed data points are sampled from the underlying distribution that produced the baseline data set. The method defines a test statistic that is strictly distribution-free under the null hypothesis. The experimental results show that the proposed test has substantially more power than existing methods for multi-dimensional change detection. The third problem considered is modeling the temporal change in prominence of data clusters. Existing work is based on developing a mixture model that treats the time information as one of the random variables, which causes the model to be sensitive to the distribution of time. The proposed method defines a Bayesian mixture model with a set of linear regression mixing proportions that are conditioned on the time. A Gibbs Sampler is used to derive the distributions of the random variables in the model.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility: by Xiuyao Song.
Thesis: Thesis (Ph.D.)--University of Florida, 2008.
Local: Adviser: Ranka, Sanjay.
Local: Co-adviser: Jermaine, Christophe.
Electronic Access: RESTRICTED TO UF STUDENTS, STAFF, FACULTY, AND ON-CAMPUS USE UNTIL 2009-02-28

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2008
System ID: UFE0021943:00001

Permanent Link: http://ufdc.ufl.edu/UFE0021943/00001

Material Information

Title: Novel Change Detection Techniques in Multidimensional Data Mining
Physical Description: 1 online resource (130 p.)
Language: english
Creator: Song, Xiuyao
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2008

Subjects

Subjects / Keywords: Computer and Information Science and Engineering -- Dissertations, Academic -- UF
Genre: Computer Engineering thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract: Data mining is the process of automatically discovering useful information in large data repositories. The development of data mining is motivated by the challenges posed by modern data sets, such as large size, high dimensionality and heterogeneity. This thesis proposes several novel data mining methods to discover change detection. The first problem considered is detecting anomalies in a given data set. Anomalies are those data points that are different from the remaining of the data set. In the thesis, a method is proposed to make use of domain knowledge provided by the user. Often, the data include a set of environmental attributes whose values a user would never consider to be directly indicative of an anomaly. However, such attributes cannot be ignored because they have a direct effect on the expected distribution of the result attributes whose values can indicate an anomalous observation. The method proposed in this thesis takes such differences among attributes into account. The second problem considered is detecting change of distribution in multi-dimensional data sets. For a given baseline data set and a set of newly observed data points, the proposed method defines a statistical test for deciding if the observed data points are sampled from the underlying distribution that produced the baseline data set. The method defines a test statistic that is strictly distribution-free under the null hypothesis. The experimental results show that the proposed test has substantially more power than existing methods for multi-dimensional change detection. The third problem considered is modeling the temporal change in prominence of data clusters. Existing work is based on developing a mixture model that treats the time information as one of the random variables, which causes the model to be sensitive to the distribution of time. The proposed method defines a Bayesian mixture model with a set of linear regression mixing proportions that are conditioned on the time. A Gibbs Sampler is used to derive the distributions of the random variables in the model.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility: by Xiuyao Song.
Thesis: Thesis (Ph.D.)--University of Florida, 2008.
Local: Adviser: Ranka, Sanjay.
Local: Co-adviser: Jermaine, Christophe.
Electronic Access: RESTRICTED TO UF STUDENTS, STAFF, FACULTY, AND ON-CAMPUS USE UNTIL 2009-02-28

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2008
System ID: UFE0021943:00001


This item has the following downloads:


Full Text

PAGE 1

1

PAGE 2

2

PAGE 3

3

PAGE 4

Iexpressmysinceregratitudeandappreciationtomyadvisors(Dr.SanjayRankaandDr.ChrisJermaine)fortheirexpertguidanceandmentorship,andfortheirencouragementandsupportatalllevels.IwouldalsoliketothankMingxiWuforhisgeneroushelpintheexperimentalphaseinmyresearch.IthankDr.JohnGumsfromCHFMandRichAndrewsfromSmartWebDesigncompanyforprovidingtheantibioticresistancedatasetsandvaluablediscussionsonanalyzingthesedata.Ithankmycommitteemembers(ProfessorsSartajSahni,ShigangChen,andTanWong)fortheirguidanceandsupport. 4

PAGE 5

page ACKNOWLEDGMENTS .................................... 4 LISTOFTABLES ....................................... 8 LISTOFFIGURES ....................................... 9 ABSTRACT ........................................... 10 CHAPTER 1INTRODUCTION .................................... 12 1.1DetectingAnomaliesinaGivenDataSet ..................... 13 1.2DetectingChangeofDistributionBetweenTwoDataSets ............. 14 1.3DetectingTemporalChangesinData ........................ 15 1.4OurContributions .................................. 16 1.4.1DetectingAnomaliesinaGivenDataSet .................. 16 1.4.2DetectingChangeofDistributionBetweenTwoDataSets ......... 17 1.4.3DetectingTemporalChangesinData .................... 18 1.5StudyOverview ................................... 19 2RELATEDWORK .................................... 20 2.1Type1:Supervisedlearning ............................ 20 2.2Type2:Unsupervisedlearning ........................... 21 2.2.1DetectingAnomaliesinaGivenDataSet .................. 22 2.2.1.1Distributionttingapproach ................... 22 2.2.1.2Depth-basedapproach ...................... 24 2.2.1.3Clustering-basedapproach .................... 24 2.2.1.4Distance-basedapproach .................... 25 2.2.1.5Density-basedapproach ..................... 27 2.2.1.6Neuralnetwork-basedapproach ................. 28 2.2.2DetectingChangeofDistributionbetweenTwoDataSets ......... 28 2.2.2.1Description-basedapproach ................... 29 2.2.2.2Model-basedapproach ...................... 29 2.2.2.3Distance-basedapproach .................... 31 2.2.3DetectingTemporalChangesinData .................... 31 2.2.3.1Incrementalmodeling ...................... 32 2.2.3.2Change-pointdetection ..................... 33 2.2.3.3Description-basedchangedetection ............... 34 2.2.3.4Surprisingpatternsdetection .................. 35 2.3Type3:Semi-supervisedlearning ......................... 36 5

PAGE 6

....................... 38 3.1Introduction ..................................... 38 3.1.1ConditionalAnomalyDetection ...................... 39 3.1.2OurContributions .............................. 41 3.1.3StudyOverview ............................... 41 3.2StatisticalModel .................................. 42 3.2.1TheGaussianMixtureModel ........................ 43 3.2.2ConditionalAnomalyDetection ...................... 43 3.2.3DetectingConditionalAnomalies ...................... 46 3.3LearningtheModel ................................. 46 3.3.1TheExpectationMaximizationMethodology ............... 47 3.3.2TheDirect-CADAlgorithm ......................... 48 3.3.3TheGMM-CAD-FullAlgorithm ...................... 51 3.3.4TheGMM-CAD-SplitAlgorithm ...................... 52 3.3.5ChoosingtheComplexityoftheModel ................... 53 3.4Benchmarking .................................... 53 3.4.1ExperimentalSetup ............................. 54 3.4.2CreatingtheSetPerturbed 55 3.4.3DataSetsTested ............................... 57 3.4.4ExperimentalResults ............................ 58 3.4.5Discussion .................................. 61 3.5RelatedWork .................................... 63 3.6ConclusionsandFutureWork ............................ 65 4STATISTICALCHANGEDETECTIONFORMULTI-DIMENSIONALDATA .... 66 4.1Introduction ..................................... 66 4.1.1ProblemDenitionandPriorWork ..................... 66 4.1.2OurContributions .............................. 67 4.1.3StudyOverview ............................... 68 4.2High-LevelViewofDensityTest .......................... 68 4.3DensityviaGaussianKernels ............................ 70 4.4TheTestStatistic .................................. 74 4.4.1Normalityof 75 4.4.1.1ApplyingCentralLimitTheorem ................ 75 4.4.1.2Meanof 77 4.4.1.3Varianceof 78 4.4.2Estimating2 78 4.4.3ApplyingtheFullTestProcedure ...................... 79 4.5RunninginTwoDirections ............................. 80 4.6Experiments ..................................... 81 4.6.1ExperimentOne:FalsePositiveRate .................... 82 4.6.2ExperimentTwo:Power ........................... 83 4.6.3ExperimentThree:ScalabilityAnalysis .................. 84 4.7RelatedWork .................................... 85 6

PAGE 7

..................................... 87 5ABAYESIANMIXTUREMODELWITHLINEARREGRESSIONMIXINGPRO-PORTIONS ........................................ 91 5.1Introduction ..................................... 91 5.1.1Motivation .................................. 91 5.1.2LinearRegressionMixtureModel ..................... 92 5.1.3OurContributions .............................. 93 5.1.4StudyOverview ............................... 94 5.2StatisticalGenerativeModel ............................ 94 5.2.1GenerativeModel .............................. 94 5.2.2ComparisonwithDistributionoverTime .................. 96 5.3TheMCMCMethods ................................ 99 5.4TwoInstantiations .................................. 101 5.4.1MultinomialModel ............................. 102 5.4.1.1Centroidparameters ....................... 102 5.4.1.2Mixingproportionsofmultinomialcomponents ........ 103 5.4.2MixtureModelwithGaussianComponents ................ 105 5.5Experiments ..................................... 106 5.5.1Experiment1:E.Coli 106 5.5.2Experiment2:S.Aureus 110 5.6RelatedWork .................................... 112 5.7Conclusion ..................................... 113 APPENDIX AMORECADMODELEXPERIMENTANDEMDERIVATIONSOFCADMODEL 115 A.1ExperimentonKDDCup1999Dataset ...................... 115 A.2EMDerivations ................................... 115 A.2.1MLEGoal. 115 A.2.2ExpectationStep. 116 A.2.3MaximizationStep. 118 REFERENCES ......................................... 123 BIOGRAPHICALSKETCH .................................. 130 7

PAGE 8

Table page 3-1Head-to-headwinnerwhencomparingrecall/precisionforidentifyingperturbeddataasanomalous(eachcellhaswinner,winsoutof13datasets,bootstrappedp-value) .. 61 3-2Head-to-headwinnerwhencomparingrecall/precisionforidentifyingnon-perturbedoutlierdataasnon-anomalous(eachcellhaswinner,winsoutof13datasets,boot-strappedp-value) ..................................... 61 4-1The13experimentdatasources. ............................. 88 4-2Falsepositive%onlow-Dgroupdatasets. ........................ 89 4-3Falsepositive%onhigh-Dgroupdatasets. ....................... 89 4-4Falsenegative(%)oflow-Dgroup. ............................ 89 4-5Falsenegative(%)ofhigh-Dgroup. ........................... 89 4-6ParameterLofeverychangeonallthe13datasources. ................. 90 4-7RunningtimeonStreamowdataset.Timeunitissecondsbydefault.Otherwise,mstandsforminutes,andhstandsforhours. ........................ 90 4-8Fractionofthetime(in%)reportedintable 4-7 thatisdevotedtoone-timecomputa-tionthatcouldbere-usedformultipletests. ....................... 90 5-1AntibioticresistancedataofE.colionmultipledrugs(R:Resistant;S:Susceptible;U:Undetermined) ..................................... 91 5-2Modelparameters 95 8

PAGE 9

Figure page 1-1Scatterplotforidentifyinganomalies ........................... 13 1-2Boxplotforcomparingtwodatasets ........................... 15 1-3Temporalchangeindata ................................. 16 2-1Localdensityproblemindistance-basedanomalydetection ............... 27 3-1Syndromicsurveillanceapplication ............................ 40 3-2Generativemodelusedinconditionalanomalydetection ................ 44 4-1Samplestoverifybandwidthaccuracy .......................... 74 4-2Runningtestintwodirections .............................. 80 5-1GenerativeprocessforBayesianmixturemodel ..................... 94 5-2Thef(y^t)modeltrainedovertwodatasets ...................... 97 5-3Thef(yjt)modeltrainedoverdatasetwithuniformtimedistribution .......... 98 5-4Thef(yjt)modeltrainedoverdatasetwheretimehasabetadistribution ........ 99 5-5ApplyingBayes'ruletoremovethedependenceofjointmodelontime ......... 100 5-6SusceptibleprobabilityofE.colitomultipleantibiotics ................. 108 5-7LineartrendofthemixingproportionsoftheveresistancepatternsofE.coli ..... 108 5-8SusceptiblerateofthethreeresistancepatternsofS.aureus ............... 111 5-9LineartrendofthemixingproportionsofthethreeresistancepatternsofS.aureus ... 112 A-1PrecisionrateonKDDCup1999datasetby10-foldcross-validation .......... 115 9

PAGE 10

10

PAGE 11

11

PAGE 12

12

PAGE 13

Anoutlierisanobservationthatdeviatessomuchfromotherobservationsastoarousesuspicionsthatitwasgeneratedbyadifferentmechanism[ 1 ].Inasense,thisdenitionleavesituptotheanalysttodecidewhatwillbeconsidereddeviatessomuch.Forexample,onegraphicaltechniqueforidentifyinganomaliesisthescatterplot.InFigure 1-1 ,mostofthedataappeartocomefromalinearmodelwithagivenslopeandvariation,exceptforthesingleanomalywhichappearstohavebeengeneratedfromsomeothermodel. Figure1-1. Scatterplotforidentifyinganomalies.Formostofthedata,theplotrevealsabasiclinearrelationshipbetweenXandY.Thereisasingleanomalyatcoordinateposition(100;8). Anomalydetectionisimportantinmanydomains.Forexample,itisoftencrucialforeffectivestatisticalmodeling.Anomaliesshouldgenerallybeexcludedfromdatabeforemodeltting.Iftheanomalousdataareincludedinamodel,thenthettedmodelcanbepoor.Iftheanomaliesareomittedfromthettingprocess,thentheresultingtwillbeexcellentalmosteverywhere,exceptattheoutlyingpoints. 13

PAGE 14

1.1 discussedtheproblemofndinganomalieswhichseemtobegeneratedbyadifferentmechanismthantheremainderofthedataset.Thesecondproblemcategoryconsideredinthisthesisextendstheconceptofanomaliestoanentireanomalousdataset.Specically,weconsidertheproblemofcomparingtwodatasetstoseeifthetwosetsaregeneratedbythesamemechanism(ordistribution).Assumetherearetwosetsofindependentandidentically-distributed(ori.i.d.forshort)samplestakenfromtwounknown,multi-dimensional,non-parametricdistributions,respectively.Inpractice,therstdatasetisabaselinedataset.Theseconddatasetcontainsthemostrecentlyobserveddatathattheuserswishtotestforchange.Thegoalistodecidewhetherthetwodatasetscomefromthesameunderlyingdistribution.Forexample,oneverysimplegraphicaltechniqueforcheckinglocationandvariationchangesbetweendifferentgroupsofdataistheboxplot.AnexampleboxplotisgiveninFigure 1-2 ,whichidentiesthemiddle50%ofthedata,themedian,andtheextremepoints.Theboxplotcanprovidetheanswerstothequestionssuchas:Doesthelocationdifferbetweenthetwodatasets?Doesthelevelofvariationdifferbetweenthetwodatasets?Detectingchangeofdistributionbetweentwodatasetshasnumerousapplicationsinavarietyofdisciplines.Forexample,onemaymonitortheinformationaboutthesetofsalesobservedinthemostrecentday(inacommercialenterprise)ortheprescriptionswrittenrecently(inahospital)orthephonecallsoremailsrecentlyobserved(inthesecuritydomain)andask:isthedistributionofrecentlyobserveddatadifferentfromwhathasbeenobservedbefore? 14

PAGE 15

Boxplotforcomparingtwodatasets.Theboxisdrawnbetweenthelowerquartile(25thpercentile)andtheupperquartile(75thpercentile).Thestarsigninsidetheboxisthemedianofthedataset.Theothertwostarsignsaretheminimumpointandmaximumpointrespectively. 1-3 .Fromtherstsnapshottothelastsnapshot,boththecenterandtheshapeofthedatasetchangesovertime.Thegoalofthisproblemcategoryistoidentifythetemporalvariationsinthedataand/ordescribethevariationquantitativelyorqualitatively.Detectingtemporalchangeshasmanyapplicationsinreallife.Forexample,changesintherainfallfrequencydatacollectedoveryearsofaspecicregionmayindicateclimatechange(suchasglobalwarming).Inanepidemiologicalstudy,temporalchangesintheantibiotic 15

PAGE 16

Figure1-3. Temporalchangeindata.Therearefoursnapshotsofthetime-evolvingdata,givinginanorderofoccurrencefromlefttoright.Thecenterofthedataisat(1;1)intherstsnapshot.Itmovesdiagonallyuprightwardandendsat(4;4)inthelastsnapshot.Theshapeofthedatasetalsochangesovertime,withvarianceincreasingalongX-axisanddecreasingalongY-axis. 2 4 ],buttherearefewattemptstomaintainhighcoveragewhilekeepingthenumberoffalsepositivestoaminimum.Ratherthantryingtondnewandintriguingclassesofanomalies,itisperhapsmoreimportanttoensurethatthosedatapointsthatamethoddoesndtobeanoma-lousareinfactsurprising.Itisalsousefultodetermineasubsetofdataattributesthatarenotindicativeofanomalousbehavior.Let'stakeanexampleofasyndromicsurveillancesystem,wherethedailyvisitsofanemergencyroominahospitalisundersurveillance.Thegoalofthesystemistodetectadiseaseoutbreakattheearliestpossibleinstant.Assumethevariablesthataremonitoredincludethedailytemperatureandthenumberofpatientswithfeversymptoms.Thedailytemperatureshould 16

PAGE 17

5 ]learningalgorithmsaredevelopedfortheparametricCADmodel.TheCADmethodologyandthethreelearningalgorithmsaretestedagainstseveralconventionalanomalydetectionmethodsonalargenumberofdatasets.TheresultsshowthattheCADmethodologyissignicantlybetteratrecognizingdatapointswherethereisamismatchamongenvironmentalandindicatorattributes,andatignoringunusualindicatorattributevaluesthatareduetounobservedvaluesofenvironmentalattributes. 3 4 ]areusefulfordetectingifasmallsubsetofpointsaredifferentfromthebaselinedistributionandcannotbedirectlyusedfordetectingchangesindistribution.Dataminingtechniquesfordescribingchangesinmultidimensionaldata[ 6 7 ]donotaddressthestatisticalsignicanceofthosechanges.StatisticaltestsforsignicantspatialirregularitiessuchasKulldorff'sspatialscanstatistic[ 8 9 ]andthosefordetectingdiseaseoutbreaks[ 10 ]lookforhotspotsorhighdensityregions.Ratherthandetectingaspecictypeofchangeindistribution,thisthesisaimstodetectagenericdistributionalchange.Theparticularproblemconsideredisasfollows.Assumetwosetsofi.i.d.samplesSandS0,aretakenfromtwounknown,multi-dimensional,non-parametricdistributionsFSandFS0,respectively.Inpractice,Sisabaselinedataset,andS0containsthemostrecentlyobserveddatathatistobetestedforchange.ThenullhypothesisH0isdened, 17

PAGE 18

5 ]algorithmisusedinconjunctionwithakerneldensityestimatortoinferthebaselinedistribution.Thedensitytestisexperimentallycomparedwithtwoalternativetestsformorethan60differentchangedetectiontasksover13realdatasets.Thedatasetsrangefrom3to26dimensions.Thedensitytestisfoundtohavesubstantiallymorepowerascomparedtothetwoexistingmethods,ismorescalablethanoneoftheexistingmethods,andhasacomputationalcostthatiscompetitivewithhierarchicalspace-partitioningmethods. 18

PAGE 19

11 ]isusedtolearnthejointdistributionoftherandomvariablesinthemixturemodel.Theconditionalposteriordistributionsofthevariablesarederivedassumingthatthemixingproportionsareconstrainedbyalinearregressionmodel.ExperimentsaredesignedtoevaluatetheperformanceoftheBayesianmixturemodeloverrealdatasets. 2 givesanoverviewoftheexistingworkonthegeneralsubjectsofanomalydetection,distri-butionalchangedetectionandtemporalchangedetection.Chapter 3 proposestheConditionalAnomalyDetectionmethodforanomalydetection.Chapter 4 discussesthetechniquesofstatisti-caldistributionalchangedetection.Chapter 5 describestheBayesianmixturemodelwithlinearregressionmixingproportionsfortemporalchangedetection. 19

PAGE 20

12 15 ].However,thesupervisedlearninghastwodrawbackswhichmakeitunsuitabletobeappliedinchangedetectionproblems.First,thesupervisedapproachrequirestheinputdatatobeprelabeledandthelabelsshouldincludebothnormalandabnormalforthepurposeoftraining.Whiletheabnormaldataishardtoobtainsinceidentifyingtheanomaliesisexactlythetaskthattheapproachisexpectedtonish.Sousingsupervisedapproachforchangedetectionsuffersfromwhatiscommonlyknownasthechickenandeggproblem. 20

PAGE 21

16 18 ].Inthesemethods,rstthedataaregroupedintoseveralclusters,suchthatthedataineachclustersharesomecommoncharacteristics.Thenthosedatapointswhicharefarawayfromalltheclustersareidentiedandaggedaspotentialanomalies.Anotherexampleofunsupervisedlearningisanomalydetectionbasedonarticialneuralnetwork,suchas[ 19 ].Givenadatasetandadenitionofthecostfunction,theneuralnetworkistrainedinsuchawaythatthecostfunctionassociatedwithallthedataowingthroughthenetworkisminimized.Afterthenetworkisxed,thedatapointsthathavelargecostfunctionareaggedasanomalies.Unlikesupervisedlearning,theunsupervisedapproachonlylearnsthenormalbehaviorofthedataset.Thosedatapointsthatthenormalitymodeldoesnottwellaredetectedasanomalies.Similartothesupervisedlearning,itisimportanttoensuretheinputdatacoverallpossiblenormalbehaviorsothatthemodelcanbetrainedtoadaptivetothenormality.Otherwise,someunseennormaldatapointswillbemissedoutinthetrainingprocessandthuswillbetreatedasanomalies.However,asmentionedbefore,sincethenormalbehaviorisusually 21

PAGE 22

1 20 ].Thesestatisticalmethodstastatisticaldistributiontotheobservedtrainingdataset.Thedatapointsinthetestdatasetareaggedasanomaliesornotdependingupontherelationshipofthedatapointwiththelearneddistribution.BarnettandLewis[ 20 ]provideagoodsurveyofstatisticaltechniquesforanomalydetection.Theylistover100statisticaltests.Thechoiceofanappropriatetestdependsontheparticularapplicationcircumstance,whichisspeciedbythefollowingfourfactors: 21 ]Althoughthestatistictestsarewelldenedandeasytouse,thereareseveralshortcomingsofthistechnique.First,allofthetestsassumetheobserveddatapointsfollowastandardparametricdistribution.Butinmostdataminingapplications,theunderlyingdistributionsoftheobserveddataareunknown.Itisnotclearwhichparametricdistributionshouldbeusedtottheobserveddata.Anunsuitabledistributionwilltriggerbigttingerrorandtheaccuracyofthecorrespondingstatistictestwillsuffer. 22

PAGE 23

22 ],whereanengineisdesignedtodetectanomaliesinanonlineprocess.Ratherthanassumingtheobserveddatapointsaregeneratedbyastandarddistribution,theapproachtsthedatatoageneralprobabilisticnitemixturemodel.Amixturemodelisamixtureofseveralindependentdistributions.Itispowerfulinttinganyobserveddatathatcannotbettedbyastandarddistribution.Whentheonlinedatacomein,theanomalydetectionengineupdatesthemixturemodelinsuchawaythattheeffectofthepastdatapointsisdiscounted.Eacharrivingdatapointisgivenascore,measuringhowlargethemixturemodelchangedafterincorporatingthenewdatapoint.Alargescoremeansthatthedistributionundergoessignicantchangeafterseeingthearrivingdata,whichindicatingthatthenewdatapointisananomaly.Asforthesecondlimitationincurredbythecurseofdimensionality,themoststraightfor-wardwaytosolvethisistoutilizeapreprocessingalgorithmtoreducethedimensionalityofthedataspace.Therearetwotechniquesfordimensionreduction,featureselectionandfeatureextraction.Featureselectiontriestondasubsetoftheoriginaldimensions.Featureextractionappliesamappingalgorithmtotransformthefulldimensionalspacetoaspacewithfewerdimen-sions.Ithasbeenshownin[ 23 26 ]thattheanomalydetectionalgorithmsbecomemoreefcientbyemployingadimensionreductiontechnique. 23

PAGE 24

27 28 ],eachcorrespondstoadifferentnotionofdepth.Comparedtothedistributionttingapproaches,thedepth-basedapproachesavoidtheproblemofttingthedatapointstoastandarddistribution.Andconceptually,depth-basedapproachesareabletohandledatainmultidimensionalspace.However,inpractice,thecompu-tationofk-dimensionallayersrequiresthecomputationofk-dimensionalconvexhulls,whichhasanexponentialcomplexitywithalowerboundtobe(Ndk 28 ]thattheexistingdepth-basedmethodsonlygiveacceptableperformancein2-dimensionalspace. 16 ],DBSCAN[ 17 ],BIRCH[ 18 ],CURE[ 29 ]andotherclusteringtechniques[ 23 25 ]detectanomaliesasby-productsofclusteringalgorithms.Thesemethodsrstgroupobserveddatapointsintoseveralclusters,andtheanomaliesaredenedtobethedatapointswhichdonotlieinanyclusters.Theseclustering-basedmethodsassumethatthereisonlyasmallfractionofanomaliesintheobserveddataset,andthenormaldatapointssharesomecommoncharacteristicssothattheycanbedepictedbyoneormoreclusters.Therearetwodrawbacksoftheclustering-basedapproach.First,thedenitionofanomaliesisnotobjective.Rather,itissubjectiveandrelatedtothedenitionofclustersgiveninthealgorithm.Differentdenitionofclusterswillresultindifferentclusteringresult,thusthe 24

PAGE 25

21 ].Theydenedtheanomaliesinasimpleandintuitivewayasfollows: Apointpinadatasetisanoutlierwithrespecttotheparameterskandd,ifnomorethankpointsinthedatasetareatadistancedorlessfromp.Theauthorsproposetwoefcientalgorithmstodetecttheanomalies.Onealgorithmisastraightforwardnested-loopalgorithm,withtheouterloopiterateseachpointinthedatasetandtheinnerloopcomputesandcountsthenumberofdatapointsthathavedistancelessthanorequaltodtothecurrentdatapointintheouterloop.Thisstraightforwardalgorithmhasquadraticcomplexity.Theotheralgorithmproposedisbasedonpartitioningscheme.A-dimensionalspaceispartitionedintocellswithsidesoflengthd 25

PAGE 26

30 ].Theirmethoddoesnotrequiretheusertoprovidethedistancethresholdd.Instead,thethresholdisdenedbythedistanceofthekthnearestneighbortoaspecicpoint.Inthisway,thedistancethresholdisnotthesametoallthedatapoints.Everydatapointhasitsownassociateddistancethreshold.Ascomparedtothedenitionofanomalygivenin[ 21 ],kthnearestneighbormethoddoesnotcountthenumberofthedatapointswithinathreshold,becausethisnumberisalwaysequaltokbythedenitionofthethreshold.Rather,themethoddecideifadatapointisananomalyornotbytherankofitsdistancethresholdamongtheothers.Specically,foradatapointp,letDk(p)denotethedistanceofthekthnearestneighborofp.ThetopnanomaliesaredenedasthetopnpointswiththemaximumDkvalues.Thedenitiongivenin[ 30 ]hassomeadvantagesoverthatof[ 21 ].Itrankseachpointbyitsdistancetoitskthnearestneighbor.Thenewdenitionjustrequirestheusertospecifythenumberofanomaliesn.Usually,nisaverysmallnumberanditisindependentoftheunderlyingdatadistribution.Itismucheasiertospecifyncomparedtod.Actually,thenewdenitioncanbesimplyextendedsothatthemethodcanrankallthedatapointsintheorderoftheiroutlierness.ThelargerthevalueofDkofadatapoint,themoretheoutliernessofthatdatapoint. 26

PAGE 27

2-1 )thatadistance-basedapproachmightsufferfrom. Figure2-1. Localdensityproblemindistance-basedanomalydetection.Ifaglobalthresholdondistanceisusedtojudgetheanomalies,thenp1,p2andallthepointsinclusterBwillbeaggedasanomalies.Butinrealapplications,theusermayonlywanttoagp1andp2asanomaliesbecausetheydeviatesignicantlyfromtheirlocalstructure. Oneofdensity-basedapproachesisproposedbyBreunigetal.in[ 3 ].Themethodcandetectanomalieswherethedensityisnothomogeneousinthewholedataspace.Thekeyideaofthismethodisthatthedenitionoftheanomalyshouldbelocalinthesensethattheoutliernessofadatapointshouldbedeterminedbythestructureofaboundedneighborhoodofthatdatapoint.Theboundedneighborhoodofadatapointisdenedbythepoint'sknearestneighbors.Inordertodecidetheanomalies,eachdatapointpisassociatedwithalocaloutlierfactor(LOF),whichistheaverageoftheratioofthelocalreachabilitydensityofpandthoseofitsknearestneighbors.Intypicaluse,datapointswithahighLOFareaggedasanomalies. 27

PAGE 28

31 ]byPapadimitriouandKiragawa.Theycalculateamulti-granularitydeviationfactor(MDEF)foreachdatapoint.ApointisaggedasananomalyifitsMDEFvaluedeviatesmorethanthreestandarddeviationsfromthelocalaverageMDEFvaluesinitsneighborhood.However,thisextendedLOFmethodhashighcomputationcosttocomputethestandarddeviation. 19 ].TheRNNisavariationofneuralnetwork.Indicatedbythereplicatorinitsname,inRNN,theinputdataarealsousedastheoutputresults.TheRNNactsasafunctiontoreproducetheinputdataattheoutputlayer.Foranomalydetectionapplications,thedataarefedintotheRNNfortraining.Accordingly,thevariablesoftheRNNareadjustedtominimizethemeansquareerrorforalltheinputdata.Asaresult,normaldatapointsaremorelikelytobereproducedbythethewell-trainedRNN.Thosedatapointswithalargereconstructionerrorareaggedasanomalies.Also,thereconstructionerrorcanbeusedasmeasurementofoutliernessofadatapoint. 28

PAGE 29

32 ],ChawatheandGarcia-Molinaproposedanalgorithm,calledMH-DIFF,fordetectingchangesbetweentwohierarchicallystructureddatasnapshots,ortrees.Theydescribethechangesasaneditscriptthatgivesasequenceofoperationsneededtotransformonetreetotheothertree.SimilarworkhasbeendonebyCobenaetal.todetectchangesbetweentheoldversionandthenewversionoftree-structuredXMLdocumentsin[ 33 ].AnothermethodproposedbyAggarwalin[ 7 ]isdesignedformonitoringthedatastreamanddeterminingthetrendsindataevolution.Theauthorproposedaconceptcalledvelocitydensityestimation,whichisusedtocreatebothtemporalvelocityprolesandspatialvelocityprolesofthedatastreamatxedinstancesintime.Theseprolesarethenusedtopredictthetypesofdatachange,suchasdissolution,coagulationandshift. 34 ],suchasthecommonlyusedchi-squaretest.Amethodrstlearnstheunderlyingdatadistributionfromonedataset,thenitusesagoodness-of-ttesttomeasurehowwelltheotherdatasettsthislearneddistribution.Ifthedistributiontstheseconddatasetwell,thatmeanstheseconddatasetisgeneratedbythesamemechanismthatisusedtogeneratetherstdataset. 29

PAGE 30

35 36 ].Themethodisdesignedtoworkfordatasetsunderblockevolution.Byblockevolution,adatasetisupdatedperiodicallybyinsertingordeletingblocksofdatapointsatatime.TheydescribeagenericframeworknamedFOCUS,whereadataminingmodel,suchasdecisiontreemodelorfrequentitemsetsmodel,isconstructedforeachofthetwodatasetstocapturethecharacteristicsinthedata.ThenFOCUSusesthedifferencebetweenthetwodataminingmodelsasthemeasurementofthechanges/deviationsbetweenthetwounderlyingdatasets.TheFOCUSframeworkalsoaddressedthestatisticalsignicanceofthedeviationsbetweentwodatasets.Toseewhyitiscriticaltoaddressthestatisticalsignicance,let'stakeasimpleexampleasfollows.AssumetherearethreedataminingmodelsconstructedoutofthreedatasetsD1,D2andD3respectively.ThedeviationbetweenD1andD2is0.01,andthatbetweenD1andD3is0.1.ThedeviationvaluesjustindicatesthatdatacharacteristicsofD1andD2aremoresimilarthanthoseofD1andD3.But,itisstillunclearwhetherD1andD2havedifferentdistributions.FOCUSframeworkusesbootstrappingtechniquestocomputethedistributionofthedeviationvaluesassumingthattwodataminingmodelsarethesame.Thenthisdistributioncanbeusedtoanswerwhethertheobserveddeviationvalueindicatesasignicantchange.FromabovediscussionaboutFOCUSframework,itcanbeseenthatitgeneralizesthegoodness-of-tstatisticaltestsintwoways.First,FOCUSusespopulardataminingmodelstodepicttheobserveddataset;Whilegoodness-of-tteststrytotthedatabytraditionalstatisticaldistributions.Second,ifthetwodatasetshavethesamedistribution,FOCUSusesthebootstrappingtechniquetoobtainthedistributionofthechange/deviationvalues;Whilegoodness-of-ttestsusuallyassumethechangemeasurementfollowssomeknowndistribution. 30

PAGE 31

37 ].Themethodconstructsanoptimalnon-bipartitematchingoveralldatapointsfromthetwodatasets.Anoptimalmatchingisamatchingoftheobserveddatapointsintopairssuchthatthetotalinner-pairdistanceisminimized.Thedistancebetweenthetwodatasetsisinverselydependentonthenumberofcross-matchpairs,inwhichonedatapointisfromtherstdatasetandtheotherisfromtheseconddataset.Iftwounderlyingdistributionsaredifferent,thenumberofcross-matchpairswillbeveryfew.Themethodalsoderivesthenulldistributionofthedistance.Thisdistributionisusedtojudgewhethertheobserveddistanceindicateasignicantchangeornot.Anotherdistance-basedmethodwasrecentlyproposedbyDasuetal.in[ 38 ].Themethodrstusessomepartitioningscheme,suchasquadtreeork-dtreetomapthetwodatasetstotwodistributionsrespectively.Thenitusesrelativeentropy,whichisalsocalledtheKullback-Leiblerdistance,tomeasurethedifferencebetweentwogivendistributions.Finally,theyuseastatisticalinferenceprocedure,whichisbasedonthetheoryofbootstrapping,todeterminethesignicanceoftheobserveddistance. 31

PAGE 32

39 ].Thiskindofincrementallearningisalsocalledbatchlearningsinceeachwindowcorrespondstoabatchofdatapoints.Inbatchlearning,themodelsareindependentlylearnedovereachbatch.Therearetwodrawbacksofbatchlearning.First,thecomputationcostofrepeatedlylearningthemodelfromeachwindowisveryhigh.Second,theeffectivenessofthisapproachisdependentonthesizeoftheslidingwindow,Butthewindowsizeishardtodecidebecausethechangepointofthegeneratingmechanismofthedatastreamisunpredictableandvariant.Ontheonehand,ifthewindowistoolargecomparedtotherateofthechanges,themodelwouldincludetoomuchhistoricalinformationandthecurrentgeneratingmechanismwillbemaskedandcannotbereectedintheadaptivemodel.Ontheotherhand,ifthewindowistoosmall,theapproachwillhaveinsufcientdatasamplestolearnthecomplexedmodel.Sothismethodispronetooverttingproblem.Toaddresstheproblemcausedbybatchlearning,DomingosandHulten[ 40 ]proposeanincrementallearningsystem,namedVFDT,toconstructanincrementalclassier.Theclassierisincrementalratherthan`batchinthesensethatthenewclassierisconstructedbasedontheoldone.Thetimeandspacecomplexityoflearningtheincrementalclassierarebothconstantaseachdatapointcomesin.Thismethodassumesthattheunderlyingdistributionofthedatastreamisstationary.Thisassumptionisoftenviolatedbecauseinmanycases,theunderlyinggeneratingmechanismsofdatastreamschangeovertime.Theincrementalclassiersworkingfortime-varyingdatastreamsareproposedbyin[ 41 ]and[ 42 ].TheBOATsystemproposedin[ 41 ]guaranteesthattheincrementallytrainedmodel 32

PAGE 33

42 ]isanextensionworkbasedonVFDTbyremovingtheassumptionofstationarydistributionofdatastreams.Inordertoimprovetheefciency,CVFDTalsoremovestherestrictionofBOATthattheincrementalmodelisexactlythesameasitwouldbetrainedinthebatchlearningmode.Thesingleclassierisgeneralizedtoweightedensembleclassiersin[ 43 ].Thesetofclassiersareweightedbasedontheirexpectedclassicationaccuracyonthecurrentdatapoints.Comparedtothesingleclassierapproaches,theensembleapproachavoidsthehardproblemofdecidingwhichpartofthedatastreamisout-of-datebecauseitdoesnothavetothrowawayanyolddata.Theclassierwhichcanpredictthecurrentdatapointsmoreaccuratelyaregivenmoreweight.Thatmeans,thedenitionofout-of-dateisnotbasedonthearrivingtimeanymore.Rather,itisbasedonthesimilarnessofthedistributiontothecurrentdata,whichismoredata-centricclosetothereallifedenition.Besidesclassiers,thereareotherincrementaldataminingmodels.Suchasassociationrules[ 44 45 ],andfrequentitemsets[ 46 ].Theworkin[ 35 ]extendstheincrementalmodelingmethodstoanydataminingmodelsandgeneralizesthedatastreamtoblockevolution.Althoughalltheabovetechniquescanbeappliedintemporalchangedetectionbymeasuringthedifferencebetweendataminingmodelslearnedindifferenttimesegments[ 36 ],themaingoalofincremen-talmodelingistomaintainanup-to-datemodelastherecenttimewindowmovesforward.Theydonotexplicitlydetectthechangeordescribethechangesubstantially. 47 48 ].Inthesestatisticalmethods,userneedstospecifythenumberofchangepointsinthedatastream.Andthemethodsassumeaknown,standardunderlyingdistributionforeachintervalbetweentwo 33

PAGE 34

49 ].Theoretically,themethodcantarbitrarymodelsoverthetimeseriesdata.Itusessomemodelselectiontechniquestochooseanappropriatemodelforeachtimeinterval.Also,themethoddoesnotrequireusertoinputthenumberofchangepoints.Ittsamodeltoatimeintervalandusesalikelihoodcriteriontodecideifthistimeintervalincludesachangepointinthemiddle.Midwaychangepointwillcausethetimeintervaltobefurtherpartitioned.Inthisway,thesetofthechangepointsaredeterminedbyamaximumlikelihoodmethod.Kiferetal.useatwo-windowparadigmtodecidethechangepointsindatastreams[ 50 ].Thecurrentwindowmovesforwardwheneveranewdatapointcomesin.Thereferencewindowisupdatedwheneverachangepointisdetectedtoreectthelatestbehavior.Themethoddenesadistancefunctionbetweenthetwowindows.Astatisticallylargedistanceindicatesachangepointatthefrontendofthecurrentwindow.Besidesdetectingthetimepointofchange,themethodalsodetectsthecombinationofthedimensionswherethechangeismostlytremendous.Thisisveryusefultounderstandandinvestigatethechangeswhenallthedimensionsdonotexhibitsimilaramountofchanges,orwhenthechangesonlyoccuronaparticularsubsetofthedimensions. 7 ].Asisalreadydiscussedinthedistributionalchangedetection,thismethodcapturesthe 34

PAGE 35

51 ],Aggarwaletal.proposedaframeworkforclusteringevolvingdatastreams.Theyusedatwo-phaseclusteringframework:anonlinemicro-clusteringkeepsstoringtheapproximatesummarystatisticsforthefastincomingdatastreams;Whenauserrequestaclusteringresult,anofinemacro-clusteringprovidestheresultbasedonthesummarystatisticsofdatafallingintothequerytimerange.Bycomparingtheclusteringresultsoftwotimesegments,Theevolutioncanbedescribedas:Therearenewclustersintimesegmenttwowhichwerenotpresentinsegmentone;Someclustersarelostfromsegmentonetosegmenttwo;Orsomeclustersshiftinpositionfromsegmentonetosegmenttwo.Thereareothertechniquesfordetectinganddescribingchangesforparticulardatatypes.Themethodsproposedin[ 32 33 ]workforhierarchicallystructureddata.Thechangeisdescribedasthesequenceofoperationstotransferonetimesegmenttotheother.Thereareaparticularfamilyoftemporalchangedetectionmethodsworkforminingdocumentfeatures[ 52 54 ].Themethodsusuallyuseahierarchicalmixturemodelformodelingthewordsinthedocuments,whichisgenerallycalledtopicmodel.Theresultingmodelcandescribethetopicevolutionasthetopicshift,orasthechangeoftheprevalencesofsomespecictopics. 55 ].Theydeclaredanitemsetasinterestingnotbecauseitssupportexceedssomethreshold,butbecause 35

PAGE 36

56 ]toidentifyinterestingbehaviorswhichmayindicateafraudulentactivity.In[ 57 ],DongandLiproposedamethodfordetectingemergingitemsetwhosesupportincreasessignicantlyovertime. 58 60 ]andsemi-supervisedclustering[ 61 62 ].Similartounsupervisedlearning,asemi-supervisedapproachalsoaimstolearnaboundaryofnormalbehavior.Insemi-supervisedclassication,theboundaryisrenedbyapplyingtheclassiertoprobabilisticallylabeltheunlabeledinputdata.Thentheapproachusessomeoptimizationtechniquetoadjusttheclassiersuchthataspeciederrorfunctionisminimizedoveralltheinputdata.Insemi-supervisedclustering,thenormalityboundaryisrenedbyincorporatingthelabeleddataintotheobjectfunctionandthenadjustingtheclusteringresultbyminimizingtheobjectfunction.Liketheprevioustwotypes,semi-supervisedlearningrequiresthetrainingdatahaveacompletecoverageofallnormaldata.Thisrequirementcanguaranteethenormalbehaviorisfullycaptured.Inthesituationwherealargeamountofdataareunlabeledandonlyasmallfractionareprelabeled,thesemi-supervisedlearningwilloutperformtheunsupervisedlearningbymakingfulluseoftheknowledgeintheprelabeleddata. 36

PAGE 37

37

PAGE 38

63 ]).Applicationsincludemedicalinformatics[ 64 ],computervision[ 65 ][ 66 ],computersecurity[ 67 ],sensornetworks[ 68 ],general-purposedataanalysisandmining[ 4 ][ 3 ][ 30 ],andmanyotherareas.However,incontrasttoproblemsinsupervisedlearningwherestudiesofclassicationaccuracyarethenorm,littleresearchhassystematicallyaddressedtheissueofaccuracyingeneral-purposeunsupervisedanomalydetectionmethods.Papershavesuggestedmanyalternateproblemdenitionsthataredesignedtoboostthechancesofndinganomalies(again,seeMarkouandSingh'ssurvey[ 63 ]),buttherebeenfewsystematicattemptstomaintainhighcoverageatthesametimethatfalsepositivesarekepttoaminimum.Accuracyinunsupervisedanomalydetectionisimportantbecauseifusedasadataminingordataanalysistool,anunsupervisedanomalydetectionmethodologywillbegivenabudgetofacertainnumberofdatapointsthatitmaycallanomalies.Inmostrealisticscenarios,ahumanbeingmustinvestigatecandidateanomaliesreportedbyanautomaticsystem,andusuallyhasalimitedcapacitytodoso.Thisnaturallylimitsthenumberofcandidateanomaliesthatadetectionmethodologymayusefullyproduce.Giventhatthisnumberislikelytobesmall,itisimportantthatmostofthosecandidateanomaliesareinterestingtotheenduser.Thisisespeciallytrueinapplicationswhereanomalydetectionsoftwareisusedtomonitorincomingdatainordertoreportanomaliesinrealtime.Whensucheventsaredetected,analarmissoundedthatrequiresimmediatehumaninvestigationandresponse.Unliketheofinecasewherethecostandfrustrationassociatedwithhumaninvolvementcanusuallybeamortizedovermultiplealarmsineachbatchofdata,eachfalsealarmintheonlinecasewilllikelyresultinanadditionalnoticationofahumanexpert,andthecostcannotbeamortized. 38

PAGE 39

69 ][ 22 ][ 70 ])orimplicitly(asinvariousdistance-basedmethods[ 30 ][ 4 ][ 3 ]).However,thequestionableassumptionmadebyvirtuallyallexistingmethodsisthatthereisnoaprioriknowledgeindicatingthatalldataattributesshouldnotbetreatedinthesamefashion.Thisisanassumptionthatislikelytocauseproblemswithfalsepositivesinmanyproblemdomains.Inalmosteveryapplicationofanomalydetection,therearelikelytobeseveraldataattributesthatahumanbeingwouldneverconsidertobedirectlyindicativeofananomaly.Byallowingananomalydetectionmethodologytoconsidersuchattributesequally,accuracymaysuffer.Forexample,considertheapplicationofonlineanomalydetectiontosyndromicsurveil-lance,wherethegoalistodetectadiseaseoutbreakattheearliestpossibleinstant.Imaginethatwemonitortwovariables:max daily tempandnum fever.max daily temptellsusthemaximumoutsidetemperatureonagivenday,andnum fevertellsushowmanypeoplewereadmittedtoahospitalemergencyroomcomplainingofahighfever.Clearly,max daily tempshouldneverbetakenasdirectevidenceofananomaly.Whetheritwashotorcoldonagivendayshouldneverdirectlyindicatewhetherornotwethinkwehaveseenthestartofanepidemic.Forexample,ifthehighinGainesville,FloridaonagivenJunedaywasonly70degreesFahrenheit(whentheaveragehightemperatureiscloserto90degrees),wesimplyhavetoacceptthatitwasanabnormallycoolday,butthisdoesnotindicateinanywaythatanoutbreakhasoccurred. 39

PAGE 40

Syndromicsurveillanceapplication. Whilethetemperaturemaynotdirectlyindicateananomaly,itisnotacceptabletosimplyignoremax daily temp,becausenum fever(whichclearlyisofinterestindetectinganoutbreak)maybedirectlyaffectedbymax daily temp,orbyahiddenvariablewhosevaluecaneasilybededucedbythevalueofthemax daily tempattribute.Inthisexample,weknowthatpeoplearegenerallymoresusceptibletoillnessinthewinter,whentheweatheriscooler.Wecallattributessuchasmax daily tempenvironmentalattributes.Theremainderofthedateattributes(whichtheuserwouldconsidertobedirectlyindicativeofanomalousdata)arecalledindicatorattributes.Theanomalydetectionmethodologyconsideredinthischapter,calledconditionalanomalydetection,orCAD,takesintoaccountthedifferencebetweentheuser-speciedenvironmentalandindicatorattributesduringtheanomalydetectionprocess,andhowthisaffectstheideaofananomaly.Forexample,considerFigure 3-1 .InthisFigure,PointAandPointBarebothanomaliesoroutliersbasedonmostconventionaldenitions.However,ifwemakeuseoftheadditionalinformationthatmax daily tempisnotdirectlyindicativeofananomaly,thenitislikelysafeforananomalydetectionsystemtoignorePointA.Why?Ifweacceptthatitisacoldday,thenencounteringalargenumberoffevercasesmakessense,reducingtheinterestofthisobservation.Forthisreason,theCADmethodologywillonlylabelPointBananomaly.Themethodologyweproposeworksasfollows.Weassumethatwearegivenabaselinedatasetandauser-denedpartitioningofthedataattributesintoenvironmentalattributesandindicatorattributes,baseduponwhichattributestheuserdecidesshouldbedirectlyindicativeofananomaly.Ourmethodthenanalyzesthebaselinedataandlearnswhichvaluesareusualortypicalfortheindicatorattributes.Whenasubsequentdatapointisobserved,itislabeled 40

PAGE 41

5 ]learningalgorithmsfortheparametricCADmodel.Theworkdescribesarigoroustestingprotocolandshowsthatifthetruedenitionofananomalyisadatapointwhoseindi-catorattributesarenotin-keepingwithitsenvironmentalattributes,thentheCADmethodologydoesindeedincreaseaccuracy.BycomparingtheCADmethodologyanditsthreelearningalgorithmsagainstseveralconventionalanomalydetectionmethodsonthirteendifferentdatasets,theworkshowsthattheCADmethodologyissignicantlybetteratrecognizingdatapointswherethereisamismatchamongenvironmentalandindicatorattributes,andatignoringunusualindicatorattributevaluesthatarestillin-keepingwiththeobservedenvironmentalattributes.Furthermore,thiseffectisshowntobestatisticallysignicant,andnotjustanartifactofthenumberofdatasetschosenfortesting. 3.2 ,wedescribethesta-tisticalmodelthatweusetomodelenvironmentalandindicatorattributes,andtherelationshipsbetweenthem.Section 3.2 alsodetailshowthismodelcanbeusedtodetectanomalies.Section 3.3 describeshowtolearnthismodelfromanexistingdataset.Experimentalresultsaregiven 41

PAGE 42

3.4 ,relatedworkinSection 3.5 ,andthechapterisconcludedinSection 3.6 .TheAppendix A.2 givesamathematicalderivationofourlearningalgorithms. 63 ]),thedetectionalgorithmsdescribedinthisworkmakeuseofatwo-stepprocess: 1. First,existingdataareusedtobuildastatisticalmodelthatcapturesthetrendsandrelationshipspresentinthedata. 2. Then,duringtheonlineanomalydetectionprocess,futuredatarecordsarecheckedagainstthemodeltoseeiftheymatchwithwhatisknownabouthowvariousdataattributesbehaveindividually,andhowtheyinteractwithoneanother.Likemanyotherstatisticalmethods,ouralgorithmsrelyonthebasicprincipleofmaximumlikelihoodestimation(MLE)[ 71 ].InMLE,a(possiblycomplex)parametricdistributionfisrstchosentorepresentthedata.fisthentreatedasagenerativemodelforthedatasetinquestion.Thatis,wemaketheassumptionthatourdatasetofsizenwasinfactgeneratedbydrawingnindependentsamplesfromf.Giventhisassumption,weadjusttheparametersgoverningthebehavioroffsoastomaximizetheprobabilitythatfwouldhaveproducedourdataset.Formally,tomakeuseofMLEwebeginbyspecifyingadistributionf(xj1;2;:::;k);wheretheprobabilitydensityfunctionfgivestheprobabilitythatasingleexperimentorsamplefromthedistributionwouldproducetheoutcomex.Theparameter=h1;2;:::;kigovernsthespeciccharacteristicsofthedistribution(suchasthemeanandvariance).Sincewearetryingtomodelanentiredataset,wedonothaveonlyasingleexperimentx;rather,weviewourdatasetX=hx1;x2;:::;xniastheresultofperformingnexperimentsoverf.Assumingthattheseexperimentsareindependent,thelikelihoodthatwewouldhaveobservedX=hx1;x2;:::;xniisgivenby: 42

PAGE 43

63 ]).TheoverallprocessbeginsbyrstperformingtheMLErequiredtottheGMMtotheexistingdata,usingoneofseveralapplicableoptimizationtechniques.Ofcourse,theproblemdescribedintheintroductionoftheworkisthatifweallowallattributestobetreatedinthesamefashion(asallexistingparametricmethodsseemtodo),thenfalsepositivesmaybecomeaseriousproblem.OursolutiontothisproblemistodeneagenerativestatisticalmodelencapsulatedbyaPDFfCADthatdoesnottreatallattributesidentically.Thismodelisdescribedindetailinthenextsubsection. 43

PAGE 44

Generativemodelusedinconditionalanomalydetection.Givenx,avectorofenvironmentalattributevalues,werstdeterminewhichGaussianfromUgeneratedx(step1).Next,weperformarandomtrialusingthemappingfunctiontoseewhichGaussianfromVwemapto(step2).Inourexample,wehappentochooseV3,whichhasaprobability0.6ofbeingselectedgiventhatxisfromU2.Finally,weperformarandomtrialovertheselectedGaussiantogeneratey(step3). attributesy.ThenetresultisthatwhenweperformaMLEtotourmodeltothedata,wewilllearnhowtheenvironmentalattributesmaptotheindicatorattributes.Formally,ourmodelusesthefollowingthreesetsofparameters,whichtogethermakeuptheparameterset: 44

PAGE 45

Theprocessbeginswiththeassumptionthatthevaluesforadatapoint'senvironmentalattributeswereproducedbyasamplefromU.LetUidenotetheGaussianinUwhichwassampled.Notethatthegenerativeprocessdoesnotactuallyproducex;itisassumedthatxwasproducedpreviouslyandtheGaussianwhichproducedxisgivenasinput. 2. Next,weuseUitotossasetofdicetodeterminewhichGaussianfromVwillbeusedtoproducey.TheprobabilityofchoosingVjisgivendirectlybythemappingfunction,andisp(VjjUi). 3. Finally,yisproducedbytakingasinglesamplefromtheGaussianthatwasrandomlyselectedinStep(2).TheprocessofgeneratingavalueforygivenavalueforxisdepictedaboveinFigure 3-2 .ItisimportanttonotethatwhileourgenerativemodelassumesthatweknowwhichGaussianwasusedtogeneratex,inpracticethisinformationisnotincludedinthedataset.WecanonlyinferthisbycomputingaprobabilitythateachGaussianinUwastheonethatproducedx.Asaresult,fCAD(yj;x)isdenedasfollows:fCAD(yj;x)=nUXi=1p(x2Ui)nVXj=1fG(yjVj)p(VjjUi)Where: 45

PAGE 46

46

PAGE 47

5 ]frameworkthatservesasabasisforallthreeofthealgorithms. Whilethemodelcontinuestoimprove: (a) Letbethecurrentbestguessastotheoptimalcongurationofthemodel. (b) Letbethenextbestguessastotheoptimalcongurationofthemodel. (c) (d) 47

PAGE 48

48

PAGE 50

ChooseaninitialsetofvaluesforhVj,Vji,hUi,Uii,p(VjjUi),p(Ui),foralli,j. 2. Whilethemodelcontinuestoimprove(measuredbytheimprovementof): (a) Computebkijforallk,i,andjasdescribedunderE-Stepabove; (b) ComputehVj;Vji,hUi;Uii, (c) SethVj;Vji:=hVj;Vji,hUi;Uii:=hUi;Uii,p(VjjUi):= 50

PAGE 51

72 ],theproblemremainsthattheCADproblemiscertainlycomplexasMLEproblemsgo,sincethetaskistosimultaneouslylearntwosetsofGaussiansaswellasamappingfunction,andtheproblemiscomplicatedevenmorebythefactthatthefunctionfCADisonlyconditionallydependentuponthesetofGaussianscorrespondingtothedistributionoftheenvironmentalattributes.Asaresult,thesecondlearningalgorithmthatweconsiderinthisworkmakesuseofatwo-stepprocess.WebeginbyassumingthatthenumberofGaussiansinUandVisidentical;thatis,nU=nV.Giventhis,asetofGaussiansislearnedinthecombinedenvironmentalandindicatorattributespace.TheseGaussiansarethenprojectedontotheenvironmentalattributesandindicatorattributestoobtainUandV.Then,onlyafterUandVhavebeenxed,asimpliedversionoftheDirect-CADalgorithmisusedtocomputeonlythevaluesp(VijUj)thatmakeupthemappingfunction.Thebenetofthisapproachisthatbybreakingtheproblemintotwo,mucheasieroptimizationtasks,eachsolvedbymuchsimplerEMalgorithms,wemaybelessvulnerabletotheproblemoflocaloptima.ThedrawbackofthisapproachisthatwearenolongertryingtomaximizefCADdirectly;wehavemadethesimplifyingassumptionthatitsufcestorstlearntheGaussians,andthenlearnthemappingfunction.TheGMM-CAD-FullAlgorithm LearnasetofnU(dU+dV)-dimensionalGaussiansoverthedatasetf(x1;y1);(x2;y2);:::;(xn;yn)g.CallthissetZ. 2. LetZirefertothecentroidoftheithGaussianinZ,andletZirefertothecovariancematrixoftheithGaussianinZ.WethendetermineUasfollows:Fori=1tonUdo: 51

PAGE 52

Next,determineVasfollows:Fori=1tonVdo:Forj=1todVdo:Vi[j]:=Zi[j+dU]Fork=1todVdo:Vi[j][k]:=Zi[j+dU][k+dU] RunasecondEMalgorithmtolearnthemappingfunction.Whilethemodelcontinuestoimprove(measuredbytheimprovementof): (a) Computebkijforallk,i,andjasinthepreviousSubsection. (b) Compute (c) Setp(VjjUi):= 52

PAGE 53

LearnUandVbyperformingtwoseparateEMoptimizations. 2. RunStep(4)fromtheGMM-CAD-FullAlgorithmtolearnthemappingfunction.InGMM-CAD-Splitalgorithm,learningGaussiansinUandVneedsO(nnU(d2U+d2V))time.LearningthemappingfunctionneedsO(nn3U)time.ThememoryrequirementforthealgorithmisO(nnU+nU(d2U+d2V)). 1. WhichofthethreelearningalgorithmsshouldbeusedtottheCADmodeltoadataset? 53

PAGE 54

CanuseoftheCADmodelreducetheincidenceoffalsepositivesduetounusualvaluesforenvironmentalattributescomparedtoobviousalternativesforuseinananomalydetectionsystem? 3. Ifso,doesthereducedincidenceoffalsepositivescomeattheexpenseofalowerdetectionlevelforactualanomalies? 1. Foragivendataset,webeginbyrandomlydesignating80%ofthedatapointsastrainingdata,and20%ofthedatapointsastestdata(thislattersetwillbereferredtoastestData). 2. Next,20%anomaliesoroutliersinthesettestDataareidentiedusingastandard,parametrictestbasedonGaussianmixturemodelingthedataaremodeledusingaGMMandnewdatapointsarerankedaspotentialoutliersbasedontheprobabilitydensityatthepointinquestion.ThisbasicmethodisembodiedinmorethanadozenpaperscitedinMarkouandSingh'ssurvey[ 63 ].However,akeyaspectofthisprotocolisthatoutliersarechosenbasedonlyonthevaluesoftheirenvironmentalattributes(indicatorattributesareignored).Wecallthissetoutliers. 54

PAGE 55

ThesettestDataisthenitselfrandomlypartitionedintotwoidentically-sizedsubsets:perturbedandnonPerturbed,subjecttotheconstraintthatoutliersnonPerturbed. 4. Next,themembersofthesetperturbedareperturbedbyswappingindicatorattributevaluesamongthem. 5. Finally,weusetheanomalydetectionsoftwaretocheckforanomaliesamongtestData=perturbed[nonPerturbed.Notethatafterthisprotocolhasbeenexecuted,membersofthesetperturbedarenolongersamplesfromtheoriginaldatadistribution,andmembersofthesetnonPerturbedarestillsamplesfromtheoriginaldatadistribution.Thus,itshouldberelativelystraightforwardtousetheresultingdatasetstodifferentiatebetweenarobustanomalydetectionframework,andonethatissusceptibletohighratesoffalsepositives.Specically,giventhisexperimentalsetup,weexpectthatausefulanomalydetectionmechanismwould: 55

PAGE 56

3.2.2 thatxisavectorofenvironmentalattributevaluesandyisavectorofindicatorattributevalues.Toperturbz,werandomlychoosekdatapointsfromD;inourtests,weusek=min(50;jDj=4).Letz0=(x0;y0)bethesampledrecordsuchthattheEuclideandistancebetweenyandy0ismaximizedoverallkofoursampleddatapoints.Toperturbz,wesimplycreateanewdatapoint(x;y0),andaddthispointtoperturbed.ThisprocessisiterativelyappliedtoallpointsinthesetDtoperturbed.NotethateverysetofenvironmentalattributevaluespresentinDisalsopresentinperturbedafterperturbation.Furthermore,everysetofindicatorattributespresentinperturbedafterperturbationisalsopresentinD.Whathasbeenchangedisonlytherelationshipbetweentheenvironmentalandindicatorattributesintherecordsofperturbed.Wenotethatswappingtheindicatorvaluesmaynotalwaysproducethedesiredanomalies(itispossiblethatsomeoftheswapsmayresultinperfectlytypicaldata).However,swappingthevaluesshouldresultinanewdatasetthathasalargerfractionofanomaliesascomparedtosamplingthedatasetfromitsoriginaldistribution,simplybecausethenewdatahasnotbeensampledfromtheoriginaldistribution.Sincesomeoftheperturbedpointsmaynotbeabnormal,evenaperfectmethodshoulddeemasmallfractionoftheperturbedsettobenormal.Effectively,thiswillcreateanon-zerofalsenegativerate,evenforaperfectclassier.Despitethis,theexperimentisstillavalidonebecauseeachofthemethodsthatwetestmustfacethissamehandicap. 56

PAGE 57

2. 3. 4. 5. 57

PAGE 58

7. 8. 9. 10. 11. 12. 13. 1. SimpleGaussianmixturemodeling(asdescribedatthebeginningofthisSection).Inthismethod,foreachdatasetaGMMmodelislearneddirectlyoverthetrainingdata.Next,thesettestDataisprocessed,andanomaliesaredeterminedbasedonthevalueofthefunctionfGMMforeachrecordintestData;thosewiththesmallestfGMMvaluesareconsideredanomalous. 2. 30 ](withk=5).kth-NNoutlierdetectionisoneofthemostwell-knownmethodsforoutlierdetection.Inordertomakeitapplicabletothetraining/testingframeworkweconsider,themethodmustbemodiedslightly,since 58

PAGE 59

3. Localoutlierfactor(LOF)anomalydetection[ 3 ].LOFmustbemodiedinamattersimilartokth-NNoutlierdetection:theLOFofeachtestpointiscomputedwithrespecttothetrainingdatasetinordertoscorethepoint. 4. Conditionalanomalydetection,withthemodelconstructedusingtheDirect-CADAlgorithm. 5. Conditionalanomalydetection,withthemodelconstructedusingtheGMM-CAD-FullAlgorithm. 6. Conditionalanomalydetection,withthemodelconstructedusingtheGMM-CAD-SplitAlgorithm.ForeachexperimentovereachdatasetwiththeGMM/CADmethods,adatarecordwasconsideredananomalyiftheprobabilitydensityatthedatapointwaslessthanthemedianprobabilitydensityatallofthetestdatapoints.For5th-NN/LOF,therecordwasconsideredananomalyifitsdistancewasgreaterthanthemediandistanceoveralltestdatapoints.Themedianwasusedbecauseexactlyone-halfofthepointsinthesettestDatawereperturbed.Thus,ifthemethodwasabletoorderthedatapointssothatallanomalieswerebeforeallnon-anomalies,choosingthemediandensityshouldresultinarecall 59

PAGE 60

73 ]. 60

PAGE 61

Head-to-headwinnerwhencomparingrecall/precisionforidentifyingperturbeddataasanomalous(eachcellhaswinner,winsoutof13datasets,bootstrappedp-value) CAD-Full CAD-Split DirectCAD LOF GMM .720 Full,11,0:01 .793 Full,10,0:04 .737 Split,7,0:47 .730 CAD,8,0:28 5th-NN .721 .721 Head-to-headwinnerwhencomparingrecall/precisionforidentifyingnon-perturbedoutlierdataasnon-anomalous(eachcellhaswinner,winsoutof13datasets,bootstrappedp-value) CAD-Full CAD-Split DirectCAD LOF GMM .500 Full,12,0:00 .749 Full,11,0:01 .687 CAD,7,0:49 .681 CAD,11,0:00 5th-NN .500 LOF,7,0:37 .549 61

PAGE 62

62

PAGE 63

63 ].Inthiscontext,statisticalreferstomethodswhichtrytoidentifytheunderlyinggenerativedistributionforthedata,andthenidentifypointsthatdonotseemtobelongtothatdistribution.MarkouandSinghgivereferencestoaroundadozenpapers,that,liketheCADmethodology,relyinsomecapacityonGaussianMixtureModelingtorepresenttheunderlyingdatadistribution.However,unliketheCADmethodology,nearlyallofthosepapersimplicitlyassumethatallattributesshouldbetreatedequallyduringthedetectionprocess.Thespecicapplicationareaofintrusiondetectionisthetopicofliterallyhundredsofpapersthatarerelatedtoanomalydetection.Unfortunately,spaceprecludesadetailedsurveyofmethodsforintrusiondetectionhere.Onekeydifferencebetweenmuchofthisworkandourownproposalisthespecicity:mostmethodsforintrusiondetectionarespecicallytailoredtothatproblem,andarenottargetedtowardsthegenericproblemofanomalydetection,astheCADmethodologyis.Aninteresting2003studybyLazarevicetal.[ 74 ]considerstheapplicationofgenericanomalydetectionmethodstothespecicproblemofintrusiondetection.TheauthorstestseveralmethodsandconcludethatLOF[ 3 ](testedinthiswork)doesaparticularlygoodjobinthatdomain.Whilemostgeneral-purposemethodsassumethatallattributesareequallyimportant,therearesomeexceptionstothis.AggarwalandYu[ 75 ]describeamethodforidentifyingoutliersinsubspacesoftheentiredataspace.Theirmethodidentiescombinationsofattributeswhereoutliersareparticularlyobvious,theideabeingthatalargenumberofdimensionscanobscureoutliersbyintroducingnoise(theso-calledcurseofdimensionality[ 76 ]).Byconsideringsubspaces,thiseffectcanbemitigated.However,AggarwalandYumakenoaprioridistinctionsamongthetypesofattributevaluesusingdomainknowledge,asisdoneintheCADmethodology. 63

PAGE 64

77 ][ 78 ][ 79 ].Inthiswork,dataareaggregatedbasedupontheirspatiallocation,andthenthespatialdistributionisscannedforsparseordenseareas.SpatialattributesaretreatedinafashionanalogoustoenvironmentalattributesintheCADmethodology,inthesensethattheyarenottakenasdirectevidenceofananomaly;rather,theexpectedsparsityordensityisconditionedonthespatialattributes.However,suchstatisticsarelimitedinthattheycannoteasilybeextendedtomultipleindicatorattributes(theindicatorattributeisassumedtobeasinglecount),noraretheymeanttohandlethesparsityandcomputationalcomplexitythataccompanylargenumbersofenvironmentalattributes.TheexistingworkthatisprobablyclosestinspirittoourownislikelythepaperonanomalydetectionforndingdiseaseoutbreaksbyWong,Moore,Cooper,andWagner[ 80 ].TheirworkmakesuseofaBayesNetasabaselinetodescribepossibleenvironmentalstructuresforalargenumberofinputvariables.However,becausetheirmethodreliesonaBayesNet,therearecertainlimitationsinherenttothisapproach.ThemostsignicantlimitationofrelyingonaBayesNetisthatcontinuousphenomenaarehardtodescribeinthisframework.Amongotherissues,thismakesitverydifculttodescribethetemporalpatternssuchastimelagsandperiodicity.Furthermore,aBayesNetimplicitlyassumesthatevencomplexphenomenacanbedescribedintermsofinteractionbetweenafewenvironmentalattributes,andthattherearenohiddenvariablesunseeninthedata,whichisnottrueinrealsituationswheremanyhiddenattributesandmissingvaluesarecommonplace.ABayesNetcannotlearntwodifferentexplanationsforobservedphenomenaandlearntorecognizeboth,foritcannotrecognizeunobservedvariables.SuchhiddenvariablesarepreciselythetypeofinformationcapturedbytheuseofmultipleGaussianclustersinourmodel;eachGaussiancorrespondstoadifferenthiddenstate.Finally,wementionthattraditionalpredictionmodels,suchaslinearornonlinearmodelsofregression[ 81 ][ 82 ]canalsobepotentiallyusedforderivingthecause-effectinterrelationshipsthattheCADmethoduses.Forexample,foragivensetofenvironmentalattributevalues,a 64

PAGE 65

83 ]forscalingstandard,GaussianEMtoclusteringofverylargedatabases. 65

PAGE 66

84 ].Thislowerstheacceptablefalsepositiverateforeachtest.Unfortunately,loweringthefalsepositiveratealwayslowersthepower.Thus,atestshouldhavenaturallyhighpowertobeusefulinthisscenario.

PAGE 67

85 6 7 ]butitdoesnotaddressthestatisticalsignicanceofthosechanges.StatisticaltestsforsignicantspatialirregularitiessuchasKulldorff'sspatialscanstatistic[ 8 9 ]andthosefordetectingdiseaseoutbreaks[ 10 ]arecloselyrelatedbutaremorespecicinthattheydonotcheckforagenericchangeofdistributiontheylookforhotspotsinthedatathatmaysignalsomesortofdiseaseoutbreak.Infact,asidefromsolutionsthatrelyonmappingtwoorthree-dimensionaldatatoasingledimensionandapplyingoneoftheuni-dimensionaltests[ 86 ],oneoftheonlytestsproposedbythestatisticscommunityforthisproblemwaspublishedrecently,calledthecross-matchtest[ 37 ].However,aswedemonstrateexperimentally,thecross-matchtestmaybeofquestionablepower,anditiscomputationallyexpensive,requiringamaximalmatchingcomputationoveragraphhavingO((jSj+jS0j)2)edges.Tofunction,alloftheseedgesmustbestoredinmainmemory;eventhen,therunningtimeiscubicinthesizeofthedataset.OneadditionalmultidimensionalchangedetectiontestrelatedtoKulldorff'stestwasproposedbyDasuetal.[ 38 ],butthisisatestthatreliesonadiscretizationofthedataspace.Spacepartitioningschemestendtosufferfromthecurseofdimensionality.Assuch,onemightexpectthatthepowerofthetestwillsufferinmorethanafewdimensionsanissuethatweconsiderexperimentallylaterinthiswork. 67

PAGE 68

5 ]algorithminconjunctionwithakerneldensityestimatortoinferthebaselinedistribution.Inadditiontothiskeytechnicalcontribution,thedensitytestisexperimentallycomparedwithtwoalternativetestsviaatotalofmorethan60differentchangedetectiontasksover13realdatasets,havingfrom3to26dimensionseach.Thestrengthsandweaknessesofthevarioustestsarecarefullyconsidered.Thedensitytestisfoundtohavesubstantiallymorepower(especiallyfordetectingchangesalongmultipledimensionsatonce)comparedtothetwoexistingmethods,andhasacomputationalcostthatiscompetitive. 4.2 ,wegivethehigh-leveloverviewofthedensitytest.WediscusskerneldensityestimationinSection 4.3 .InSection 4.4 ,wedenetheteststatisticanddiscussitsnulldistribution.InSection 4.5 weexplainwhyweneedtorunthetestontwodifferentdirections.Section 5.5 experimentallyteststhepowerandaccuracyofthedensitytestcomparedwiththetwoexistingalternatives.RelatedworkiscoveredinSection 4.7 ,andSection 4.8 concludesthechapter. 8 ]whichidentiesabnormallydenseregionsinaspatialdatasetthatmaysignaldistributionalchangeinspatialdata.First,thedataspaceisdiscretizedintoa 68

PAGE 69

1. First,thebaselineSisrandomlypartitionedintotwohalvesS=S1[S2.S1willbeusedformodelingandS2fortesting. 2. Next,tocomputethestatistic,werstcomputeakerneldensityestimatorusingS1;theresultingdensityfunctionisdenotedbyKS1.isdenedbasedonthedifferenceoflogprobabilitydensityofKS1atS0andatS2(hencethenamedensitytest).Intuitively,ifS0isverydifferentfromS,weexpectthatwilltakeonasmallnegativevaluesinceKS1hashigherlogprobabilitydensityatS2thanatS0.Ontheotherhand,ifS0doesnotdiffermuchfromS,weexpectthatthevalueofwillbereasonablylarge(ifS0iscomparabletoS2orS0isevenclosertoS1thanS2). 3. DuetotheCentralLimitTheorem,weknowthathasanormallimitingdistribution.Afterderivingthemeanandvarianceof,weperformastandardnullhypothesistest.Dependingonwhethertheobservedissignicantlyfaroutinthetailof,weeitherrejectthenullhypothesisanddeclarechange,orwedeclarenoChangesincewedonothavestrongenoughevidencetorejectthenullhypothesis. 4.4 ).Currentlyweusetwoequal-sizedhalves.Determininganoptimalpartitioningisanavenueforfutureresearch. 69

PAGE 70

70

PAGE 71

71

PAGE 72

87 ]isoneofthebestknownandmostcomplete).AssumeS1hasdimensionalityk.Let!1;:::!jS1jbetheGaussiankernelscenteredatx1;:::xjS1jrespectively.Theupdateruleforthebandwidthofthekernelassociatedwithdatapointiis: 4 ),P(!ijxj)istheprobabilitythatxjisgeneratedbykernelassociatedwithxi.ThisprobabilityisoftenreferredtoasthesoftmembershipofpointxjinGaussiancomponent!i.ThesoftmembershipiscomputedwiththeformulagiveninEquation( 4 ).Thesecondlineoftheequationiscustomizedtoourparticularsituation,wherewetakeintoaccounttheconstraintthatthejthdatapointcannotbegeneratedbythekernelcenteredatthatdatapoint.NotethatjS1ji=1P(!ijxj)=1stillholdsforallj's,j=1;2;:::jS1j: jS1jt=1p(xjj!t)i;j=1;2;:::jS1j;i6=jP(!ijxi)=0i=1;2;:::jS1j(4) 72

PAGE 73

4 ).Again,thesecondlineintheequationisgivenbecauseofthespecialpseudolog-likelihoodobjectivefunction: (2)k=2jxij1=2exp(1 2(xjxi)T1xi(xjxi))i;j=1;2;:::jS1j;i6=jp(xij!i)=0;i=1;2;:::jS1j(4)NowwecansummarizehowweoptimizethebandwidthoftheGaussiankernelsinAlgorithm 1 .Accordingtothisalgorithm,westopthecomputationwhenevertheiterationnumberexceedsthemaximumiterationnumberallowed,orwhentheobjectivefunctionLconvergestowarditsoptimalvalue.WedecidethattheobjectivefunctionhasconvergedifthefractionaldifferenceoftwoconsecutiveLvaluesislessthan.Inourimplementation,=0:01andMaxIteration=100. 4 ) 4 ) 4 ) 4 ) IsThisMethodEffective? 5.5 foradescriptionofthisdataset).Werstsample1000pointsfromthe 73

PAGE 74

88 ],whichisastandard,uniformbandwidthselectionmethod.ThesecondislearnedusingtheEMalgorithm. Figure4-1. Samplesfromtheoriginaldistribution(top),aKDEusingScott'sbandwidth(center),andaKDElearnedusingourEMmethod(bottom). Wethensamplethreeadditional1000-pointsamples.Onefromtheoriginaldistribution,onefromastandardbandwidthestimator,andonefromourEMestimator.Projectionsofthesethreesamplesontotwodimensions(the23rdand24thdimensionsoftheStreamowdataset)areshowninFigure 4-1 .ItisveryclearfromthethreeplotsthatagreatdealofinformationregardingtheoriginaldistributionhasbeenlostthroughtheapplicationofScott'sestimator.SamplingfromtheresultingKDEseemstohavecreatedalargeandshapelessblobofpoints.Ontheotherhand,theEMestimatorseemstodoanexcellentjobofpreservingthecharacteristicsoftheoriginaldistribution.TheexperimentsinSection 5.5 willshowhowsensitivetheresultingKDEcanbewhenitisusedtodetectchangesinhigh-dimensionaldata. 74

PAGE 75

jS2jLLH(KS1;S2)(4)SincebothS2andS1arefromFS,ifS0isquitedifferentfromS2,KS1willbemorelikelytogenerateS2thantogenerateS0,whichwillresultinanextremesmallnegativevaluefor.Thus,ahypothesistestusingtheaboveteststatisticisalowerone-sidedtest.Thisparticularteststatisticischosenforthreereasons: 1. 4.4.1.1 ). 2. Theexpectedvalueofmeanofisalwayszero,whichmeansthatisalwayscenteredattheorigin(Section 4.4.1.2 ). 3. Thevarianceofcanbeestimatedinarobustfashion(Section 4.4.1.3 ).Forthesereasons,thedensitytestiseasilyapplicablesincethenulldistributionhasaknownparametricdistributionandonlyoneparametermustbeestimated. jS2jLLH(KS1;S2)=log(Yy2S0KS1(y))jS0j jS2jlog(Yy2S2KS1(y))=Xy2S0logXx2S11 jS2jlog(Xx2S11 75

PAGE 76

4 ),therandomvariableusedtoproducecanthenbedenotedas: jS2jlog(Xx2S11 where, jS2jf(Ti)Itiswellknownthatifweapplyameasurablefunctiontoarandomvariable,theresultwillstillbearandomvariable.Sincebothfandgaremeasurablefunctions,bothf(Ti)andg(Ti)arerandomvariables.Let: Giventhis,wecanmakethefollowingthreeobservationswithrespectto1: 1. SinceallT0is;i=1;2;:::followthesamedistributionFS,itfollowsthatallf(Ti)0shaveanidenticaldistributionf(FS). 76

PAGE 77

SinceallT0is;i=1;2;:::areindependentrandomvariables,itfollowsthatallf(Ti)0sareindependent. 3. 4 )andEquation( 4 ),allT0is;i=1;2;:::;jS0j+jS2jareindependent,so1and2areindependentandfollowsanormallimitingdistributionaswell. jS2jf(Ti)g=jS0jjS2jjS0j jS2j=0

PAGE 78

jS2jf(Ti)=jS0j2+jS2jjS0j jS2j22=(jS0j+jS0j2 jS2j1Var[f(S2)].Unfortunately,thisisabadidea.Ifthisestimatorislessthanthetrue2,theestimateddistribu-tionofwillimproperlyshrinktowardsthemean.ThiswillintroduceanadditionaltypeI(falsepositive)error,thatmustbequantied.InordertodeneanestimatorVsuchthatPr[2>V]=,weusethebootstrappercentilemethod[?].Ratherthanconstructingatwo-tailedcondenceinterval,weneedonlycomputetheone-sided(1)uppercondencelimiton2andthenusethisupperlimitastheestimatorof2.Algorithm 2 givestheprocess,whichintroducesanadditionaltypeIerrorof.TherearenowtwosourcesoftypeIerror.First,couldbefaroutinthetailofnulldistri-bution,evenifitisasamplefrom.Assumethiserroris.Second,wecouldunderestimate 78

PAGE 79

jS2j1Var[f(R)] 2 .AboundonthetypeIerrorisgivenbyp=+.Forauser-suppliedp,thereisatradeoffbetweenand.Fortunately,itiseasilypossibletochooseandsuchthat+=pandthepoweroftheresultingtestismaximized.Todothis,weconsidereachpossiblevalue(eachofwhichcorrespondstoadifferentbootstrappedvariance),andcomputetheresultingcutoffvaluefordeclaringachangeresult.Overallpossiblevalues,wechoosetheresultingcutoffthatismostaggressive,andusethatduringouractualtest.Algorithm 3 givestheprocesstondsuchacriticalvalue. stepSize1,Misthenumberof(;)pairs 4 ).Second,weobtainthecriticalvalueCmaxthroughAlgorithm 3 .Third,wecomparewiththiscriticalvalue.If
PAGE 80

Twodistributionsthatmustbecheckedfordeviationinbothdirections,becauseoneiscoveredbytheother. 4.4.3 iscomplete,andcanbeusedexactlyasdescribedtocheckforachangeofdistribution.However,thepowerofthetestmaybesubstantiallyincreasedifitisruntwice,onceineachdirection.Thisisbecausethetestisnotsymmetric,andchangefromS0toSmaybemoreobviousthanchangefromStoS0.Forexample,imaginethatFSisauni-dimensionalGaussian,whileFS0isauni-dimensionalmixtureoflow-varianceGaussians,whereVar(FS)=Var(FS0)andE(FS)=E(FS0).ThisisillustratedinFigure 4-2 ,whereFS0isamixtureofthreeGaussians.AtypicalsamplefromFS0willhavedatapointslocatedonlyinrelativelyhigh-densityregionsofFS,whichmeansthatthestatisticmaybeuselessforcomparingFSandFS0.However,alargeenoughsamplefromFSwilllikelyhavedatapointslocatedinthelow-densitytroughsbetweenthepeaksinFS0.ThisrendersatestfordistributionalchangefromS0toSfarmoresensitivethanatestofchangefromStoS0.Inordertoremovethedirectionality,werunthetesttwice.First,thetestisrunexactlyasdescribedinSection 4.4.3 ,exceptthatthefalsepositiverateisheldatp=2.Ifnochangeisobserved,thetestisrunasecondtimeinexactlythesameway,exceptthattherolesofSandS0arereversed.Runningthetesttwicewithafalsepositiverateofp=2foreachrunensuresthattheoverallfalsepositiverateislessthanp. 80

PAGE 81

1. Isthefalsepositiverateofourtestcorrectlygovernedbytheinputparameterp? 2. Howdoesthepower(falsenegativerate)ofourapproachcomparewiththatoftheothertwoalternativemethods,thecross-matchtestandkdq-treetest? 3. Howscalableisourapproachcomparedwiththecross-matchtestandkdq-treetest?Datasets.Ourtestsmakeuseof13experimentaldatasets,whicharegeneratedfromthe13realdatasourcesdescribedinTable 4-1 .Thedimensionalityisindicatedaftereachdatasource.Werequirethatthesizeofeachdatasourcebereasonablylargesothatwecansamplewithreplacementinordertoobtainatruemultivariateprobabilitydistributionwithouttoomanyduplicatevalues.Amongthe13datasources,vehaverelativelysmallsize:CAPeak,Bodyfat,Boston,FCATreadandFCATmath.Forthesesmall-sizeddatasources,webumpupthedatasetsizeto20,000pointswhilemaintainingthedistributionalcharacteristicsofthedataasfollows.WerstrandomlydrawasamplepointAfromthesmall-sizeddatasource.Wethensamplevepoints,A1toA5(withreplacement)fromA'svenearestneighbors.AandA1toA5areaveragedtoproduceanewdatapoint.Werepeattheprocess20,000times.Forourexperiments,wedividethedatasourcesinto2groups,thelow-Dgroupandthehigh-Dgroup.Thelow-Dgroupconsistsofdatasourcesfrom3to10dimensions.Thehigh-Dgroupconsistsofdatasourcesfrom15dimensionsto26dimensions.AllthreechangedetectionmethodsaretestedusingjSj=jS0j=850forthelow-Dgroup.850ischosenbecausethisisthelargestsizethatiscomputationallyfeasibleusingthecross-matchtest.However,forthehigh-Dgroup,moresamplesarerequiredtoaccuratelydetectdistributionalchangeaconsequenceofthewell-knowncurseofdimensionality.Forthisgroup,jSj=jS0j=3500isused,andthecross-matchtestisnotevaluated. 81

PAGE 82

2 .InAlgorithm 3 ,p=0:04(duetothetwo-directionaltest)andstepSize=0:002.Inthekdq-treetest,weusethesameparametersettingsasintheoriginalpaper[ 38 ].AllexperimentswererunonanIntelXeonmachinewithdualCPUof2.8GHz,and5GBofRAM.Wekeepthissetupforalltheexperimentsthroughoutthissection.ThefalsepositiveresultsarereportedinTable 4-2 forlow-DgroupandinTable 4-3 forhigh-Dgroup.Inthosetables,DirepresentstheexperimentdatasetcreatedoutoftheithdatasourcegiveninTable 4-1 .Theaveragefalsepositiverateover13experimentdatasetsisgivenintheaveragecolumnofthetables.Discussion.Theseresultsshowthatallthethreechangedetectionmethodshaveanaveragefalsepositiverateatorlessthanthe8%allowedbythesuppliedpvalue,andhenceapositiveresultfromanyofthetestsseemstobeasafeindicatorofdistributionalchange.Boththedensitytestandthecross-matchtesthaveafalsepositiverateveryclosetothedesiredpvalue,andthekdq-treetesttendstoberatherconservative. 82

PAGE 83

83

PAGE 84

4-4 .Thefalsenegativesofhigh-DgrouparereportedinTable 4-5 .Inordertomakeourexperimentsreproducible,wegivethevaluesofparameterLforeachtypeofchangeoverallthe13datasourcesinTable 4-6 4-4 andTable 4-5 showthatthedensitytestisthemostpowerfuloutofthethreemethodscompared.Thoughitistruethattherewassomevariationfromdatasettodataset,itseemsclearthatthedensitytestisthemostobvious,general-purposechoice.Foreachofthevetypesofchanges,thedensitytesthadthelowestaveragefalsenegativerate.Dependinguponthetypeofchange,thekdq-treetestandthecross-matchtestweregenerallycomparableonthelow-Dgroup,withtheformerdoingbetteronfull-dimensionalchanges,andthelatterdoingbetteronsingle-dimensionalchanges. 84

PAGE 85

4-7 .Discussion.Thekdq-treehassignicantperformancebenetsoverboththedensitytestandthecross-matchtest.Giventhatthekdq-treereliesmostlyonacomputationallyefcientrecursivepartitioningofthedataspace,thisisnottoosurprising.However,thedensityteststillscalestoreasonabledatasetsizes(seethediscussionintheConclusionSectionofthechapter).Thecross-matchtestisgenerallyunusablepastafewthousanddatapoints.Wealsopointoutthatthedensitytesthasaveryhigh,one-time,xedcostfortheconstruc-tionoftheKDE.Afterthiscosthasbeenpaid,themodelcanbesavedandthetestcanberunrepeatedlyinaverysmallfractionofthetimereportedinTable 4-7 .Toillustratethefractionofthetotaltimeassociatedwiththedensitytestthatisaone-timecost,considerTable 4-8 .ThisTableshowsthefractionoftherunningtimeforthedensitytestthatisdevotedtoone-timecomputationforeachofthedatasetsizes.Theaveragefractionisaround84%.Theseresultsshowthatifthesamealgorithmhastoberepeatedmultipletimes,thecomputationalcostofthedensitytestisonlyabout4timesmorethanthethetree-basedmethod.Foradatasetwith7000points,theoverallrequirementsareontheorderofaminute.Thismakesitpracticalforanumberofapplications. 8 ],canbeusedtondsignicantspatialclustersindata;theparticularapplicationthatitwasdevelopedforisdetectingdisease 85

PAGE 86

supp=qL(z;p;q),whereL(z;p;q)denotesthelikelihoodofobservingthecasesinsideandoutsidezonezsimultaneously.ThenullhypothesisfortheresultingtestisH0:p=q,andthealternativehypothesisisH1:p>q.ThenulldistributionofthelikelihoodratioteststatisticisobtainedbytheMonte-Carlomethodbasedontheassumptionthatthepointsaregeneratedbyeitherannon-homogeneousPoissonprocessoraBernoulliprocess.Severalresearchgroupshavelookedatspeedingthistest[ 9 89 90 ].ThoughKulldorff'stestdoescheckforachangeofdistribution,itlooksforaveryspecicdistributionalchange.Assuch,Kulldorff'stestisnotreallysuitableforgeneral-purposechangedetection.Dasuetal.[ 38 ]proposedaninformation-theoreticapproachtodetectingchangesinmulti-dimensionaldatastreamsthatgeneralizesKulldorff'stest.Theirapproachdependsonaspatialpartitioningscheme(calledakdq-tree)topartitionthedataspaceintosmallcells.Basedonthedatacountineachcell,twoempiricaldistributionsarebuilt,representingthereferencewindowandthetestingwindowrespectively.TheKullback-Leibler(KL)distanceisusedtomeasurethedistancebetweenthetwoempiricaldistributions.InordertomeasurethesignicanceoftheobtainedKL-distance,theyuseanon-parametricbootstrapmethod.Athirdtestformultivariatedistributionalchangefromthestatisticsliteratureisthecross-matchtest[ 37 ].Inthecross-matchtest,everyobservationinS[S0isrankedalongeachdimension.Therankofanobservationxi,denotedbyri,isavectorwithkelements.Thentheinter-pointdistanceijbetweentwoobservationsxiandxjisdenedtobetheMahalanobisdistancebetweentheirranks.Formally,thedistanceisgivenbyij=(rirj)TM1(rirj),whereMisthesamplevariance-covariancematrixoftheranksri.Afterthat,allthe 86

PAGE 87

91 ]seemstobethemostwidelyused.Thismethodassumesxedbandwidthforallthekernelsandisnotsuitablewhendataexhibitslocalscalevariations.TheoptimalbandwidthselectionmethodsproposedbySilverman[ 92 ]andWandandJones[ 93 ]areoflimitedpracticaluseformultidimensionaldataasthederivationofasymptoticsinvolvesmultivariatederivativesandhigherorderTaylorexpansions.Wedidnotndanymethodinthetheliteraturethatassociatesdifferentbandwidthforeachkernelformorethanthreedimensionaldatasets.ThemethodproposedbyComaniciu[ 94 ]allowsforvariablebandwidthbutassumesthattherangeofdatascalesisknownandchoosesthebandwidththatisthemoststableacrossscalesforeachdatapoint. 87

PAGE 88

The13experimentdatasources. Datasource Description

PAGE 89

Falsepositive%onlow-Dgroupdatasets. average D1 4 3 8 3 8 6 7 11 7 7 8 10 5 6 3 7 1 8 Falsepositive%onhigh-Dgroupdatasets. average D7 4 7 3 8 10 2 8 1 0 0 0 0 4 0 Falsenegative(%)oflow-Dgroup. test avg D1 0 0 3 0 0 0 47 47 45 66 29 44 0 23 64 87 77 79 0 0 0 0 1 1 24 46 34 56 53 44 0 2 0 62 51 21 1 0 0 0 0 1 49 36 67 40 51 41 0 0 78 0 0 0 26 33 36 27 27 29 77 56 62 60 52 51 76 20 82 85 96 87 2 2 17 23 20 20 49 44 49 68 55 48 7 82 76 68 66 90 Falsenegative(%)ofhigh-Dgroup. test avg D7 0 0 0 1 39 0 0 44 32 36 61 60 79 70 0 0 1 0 0 0 0 69 52 69 67 70 56 55 14 1 1 3 18 6 7 86 57 72 79 60 75 51 0 0 70 82 42 1 0 87 79 75 80 60 71 70 0 2 38 58 52 5 1 62 72 68 54 96 74 65

PAGE 90

ParameterLofeverychangeonallthe13datasources. S1 :17 :2 :12 :12 :18 RunningtimeonStreamowdataset.Timeunitissecondsbydefault.Otherwise,mstandsforminutes,andhstandsforhours. 200 300 400 500 600 700 800 900 1k 0:4 0:9 1 2 2 5 5 7 6 22 52 91 2m 4:6 10 19 31 46 65 88 114 146 15m 0:02 0:03 0:06 0:08 0:1 0:2 0:2 0:3 0:3 0:8 2 3 7 10 13 Fractionofthetime(in%)reportedintable 4-7 thatisdevotedtoone-timecomputationthatcouldbere-usedformultipletests. 200 300 400 500 600 700 800 900 1k 73 71 78 78 83 81 90 89 91 87 89 87 88 88 87 88

PAGE 91

Table5-1. AntibioticresistancedataofE.colionmultipledrugs(R:Resistant;S:Susceptible;U:Undetermined) S S S S S 2 R U S S S 3 R R U S S

PAGE 92

2. 92

PAGE 93

5 )willguaranteethatP~(t)=1atanytimet.Withoutlinearregression,wehavetoperformnormalizationtomakesurethemixingproportionssumto1.Inthiscase,evenifeachofthenumeratorisasmoothandintuitivecurve,themixingproportionafternormalizationcouldbecomeerraticandhardtointerpret. 93

PAGE 94

5.2 ,wedescribethegenerativestatisticmodel.InSection 5.3 ,theMarkovchainMonteCarlo(MCMC)methodisusedtoapproximatetheparametersintheBayesianmixturemodel.InSection 5.4 ,wegivetwoinstancesoftheproposedBayesianmixturemodel,onewithmultinomialcomponentsandtheotherwithGaussiancomponents.InSection 5.5 ,weexperimentallytestthemixturemodel.Section 5.6 discussestherelatedworkandSection 5.7 concludesthechapter. 5.2.1GenerativeModelThissectiondescribesthestatisticalmodelweuseformodelingthelinearregressiontrendofthemixingcomponents.Ourdiscussionassumesthatsomeparametricdistributionfzhasbeenchosenasthegenerativemodelfortheclusterindexedbytheintegerz.Asdescribedintheintroduction,themixingproportionisalinearfunctionofthetimet,sotheprobabilityorlikelihoodthatweobserveadatapointyattimetisgivenbyf(yjt)=Xz~z(t)fz(y):Sincef(yjt)isconditionedont,ourgenerativemodelassumesthattogenerateadatapoint,atimestamptistakenasinput,andusedalongwiththemodelparametersettogeneratethedatapoint.TheelementsofaresummarizedinTable 5-2 .Giventhelinearregression,~band~ecanbeusedtodeterminethemixingproportionatanygiventimetbysimpleinterpolation. Figure5-1. Generativemodel.Circledenotesrandomvariableofthemodel.Rectangledenotesinputstothemodel.Arrowdenotesthesamplingprocessfromtheassociateddistribution.Dottedarrowdenotescalculationoperation. 94

PAGE 95

Modelparameters 5-1 .Thecorrespondinggeneratingprocessofadatapointisasfollows: 1. Obtainhyperparametersofthemixingproportions: (a) Drawbfrominverse-Gammawithparameter (b) Drawefrominverse-Gammawithparameter; 2. Obtainthemixingproportionsatthebeginningandtheendoftime: (a) Drawamultinomial~bfromasymmetricDirichletwithparameterb; (b) Drawamultinomial~efromasymmetricDirichletwithparametere; 3. Drawacomponentzfrommultinomial~(t),where~(t)=~b+ttb Drawadatapointyfromfz.Ingeneral,fzcanbeanyparametricdistribution.Inthiswork,weconsidertwoinstantiationsofthisdistribution:theGaussiandistributionandthemultinomialdistribution. 95 ].Forsimplicity,weuseaconstantvector=<1;1>inexperiments. 95

PAGE 96

52 ],wherethetimeistreatedasarandomvariable.Thatis,thegenerativeprocessisresponsibleforgeneratingboththedatapointandtheassociatedtime.ThePDFofsuchanalternativemodeltakestheformoff(y^t).Whydowepreferf(yjt)overf(y^t)whenassessingtheprevalencetrendofmixingcomponentsovertime?Thereasonisthatthemodelf(y^t)mustbeusedtolearnthedistributionofthedataaswellasthetimestampdistribution,soitisverysensitivetothedistributionoftime.Ifthereisdifferentamountofdatafordifferenttimeperiods(generallythereislessdataforearliertimeperiods.),averydifferentmodelwillbelearnedthaniftheamountofdataisconstantovertime.Sinceinourapplication,anexpertisinterestedintherelationshipbetweentimeandtheprevalenceofeachcluster,itisunacceptabletohavethelearnedmodeldependfundamentallyonhowthedataaredistributedovertime.APDFoftheformf(yjt)removesthedependenceonthetimedistributionbecauseitisconditionedontheinputtime.Thedifferencesbetweenthesealternativemodelscanbeclearlydemonstratedbythefollowingsimpleexperiment.Assumearegressionmixturemodelwiththreeuni-dimensionalGaussianclusters,denotedbyC1=N(1;0:2),C2=N(0;0:2)andC3=N(1;0:2).The 96

PAGE 97

5 ]tolearnaGaussianmixturemodelforf(y^t)onthistwodatasets.Figure 5-2 givestheplotsofthetrainedmodels. (a)(b) Figure5-2. Thef(y^t)modeltrainedovertwodatasets.(a)Timehasauniformdistribution.(b)Timehasabetadistribution.Bothresultsare3-componentGMMs.Thestardenotesthecentroid.Theellipseillustratesthecovariancematrix.Thefractionalnumberaboveeachclusterdenotesitsmixingproportion. TherstmodelpresentedinFigure 5-2 (a)doesareasonablypoorjobofcapturingthethreeGaussianclusters.Ifthelearnedcentroidsareprojectedondataydimension,weobtain0:6,0:3and0:3,correspondingtotherealcentroidsof1,0and1respectively,sothereisquitea 97

PAGE 98

5 ),therealcluster1startswithweight0atthebeginningoftimeandshiftsto0:8attheendofthetime,thus,itshouldbecenteredlateronthetimeline.Ontheotherhand,therealcluster3startsatweight0:8andgoesto0,soitshouldbecenteredearlierinthetimeline.InFigure 5-2 (a),thelearnedclustersarepresentedintheexpectedpartialorderalongthetimeline.AftershiftingthetimedistributiontowardsthehighsideofthetimelineusingaBetadistribution,thelearnedmodelpresentedinFigure 5-2 (b)getsevenworse.Allthreeclustercentroidsshifttowardthehighsideofthetimeline.Themodelignorestheearlystagedatabecausetherearefewerdatapointsintheearlytime.Furthermore,thelearnedcentroidsalongthedatadimensionshiftto0:8,0:6and0:3respectively.Allofthemaredraggeddownwardduetothefactthatmoresamplesaretakenattheendofthetimeline;therealcluster1,whichhasthelowestvalue,isthemostprevalentclusterattheendofthetimeline.Therefore,comparedtothedatasetinFigure 5-2 (a),therearemoredatapointsgeneratedbytherealcluster1. (a)(b) Figure5-3. Thef(yjt)modeltrainedoverdatasetwithuniformtimedistribution.(a)denotesthePDFcurvesofthelearnedGaussianclusters.(b)denotesthelearnedweighttrendovertime. Theaboveexperimentillustratesclearlyhowsensitiveamodelforf(y^t)canbetothedistributionofthedata'stimestamps.Forthepurposeofcomparison,wealsotrainamodelbaseduponf(yjt)overthetwodatasets.Themodelwaslearnedusingthemethodsthatwillbedescribedindetaillaterinthiswork.Figure 5-3 givestheresultswhentimeisuniformlydistributed.Figure 5-4 givestheresultwhentimehasaBetadistribution.Comparingthese 98

PAGE 99

(a)(b) Figure5-4. Thef(yjt)modeltrainedoverdatasetwheretimehasabetadistribution.(a)denotesthePDFcurvesofthelearnedGaussianclusters.(b)denotesthelearnedweighttrendovertime. Finally,wementionthattheoretically,wecantransformf(y^t)modeltoourmodelbyapplyingBayes'rule:f(yjt)=f(y^t) 5-5 ,itisobviousthatthetransformedconditionalmodeldoesapoorjobtodepictthePDFoftherealdataset.DuetothesetwodrawbacksofapplyingBayes'ruletoremovethetimedependence,ourconditionalmodelisabetterchoice. 5.2.1 ,apopularestimationtechniqueusedisMaximumLikelihoodEstimation(MLE).MLEprovidesthepointestimatefortheparametersthatmaximizestheprobabilityofthegenerateddata.Fromastatisticalpointofview,MLEmethodisconsideredtoberobust(withsomeexceptions)andyieldsestimatorswithgoodstatisticalproperties.Unfortunately,theMLEcorrespondingtoourmodeliscomputationally 99

PAGE 100

Figure5-5. ApplyingBayes'ruletoremovethedependenceofjointmodelontime.(a)isthePDFoftransformedconditionalmodelatt=50.TimehasaBetadistributioninthecorrespondingjointmodel.(b)isthePDFofrealGMMmodelatt=50. intractableusingclassicalmethodssuchasEM.Thisiscausedbythefactthatthemixingproportionisnotasinglerandomvariable.Rather,itisanon-linearfunctionoftwoindependentrandomvariables(~band~e).whichmakesdeviationofsuitableEMalgorithmalmostimpossiblebecausetheM-steprequiresadifcult,non-linearoptimization.Asanalternative,wecanuseBayesianmethodstocomputetheposteriordistributionsoftherandomvariables.First,wespecifythepriordistributionforeachparameterinthemixturemodel.WethenapplyBayes'ruletoconverttheexpressionoflikelihoodoftheparametersintotheposteriorprobabilitydistributionoftheparameters.TheconversionisgiveninEqn.( 5 ): 5 )intotheposteriordistributionintheformofaprobabilitydensityfunctionisoftencomputationallydifcultduetothemulti-dimensionalintegrals.Toworkaroundthisdifculty,MarkovChainMonteCarlo(MCMC)methodsarewidelyused.MCMCmethodsarethetechniquesforsamplingfromtheprobabilitydistributionbyconstructingaMarkovchainwhosestationarydistributionisthedistributionofinterest.Byrepeatedlysimulatingthestateofthechain,themethodsimulatessamplesdrawnfromthedistributionofinterest.Specicallyinourproblem,eachstepofthe 100

PAGE 101

11 ]isperhapsthemostpopularMCMCmethod.ThekeyideabehindtheGibbssampleristhatateachstepoftheMarkovchain,weonlyneedtoconsideraunivariateconditionaldistributioninsteadofajointmultivariatedistribution.Thatmeans,ateachstep,wesimulatetheconditionaldistributionofoneparameterassumingthatalltheotherparametersareassignedxedvalues.Suchconditionaldistributionsarefareasiertosimulatethancomplexjointdistributions,andusuallyhavesimpleforms.Assumetherearepparameters.Thesepparametersaresimulatedfromthepunivariateconditionaldistributionssequentiallyratherthanfromasinglep-variatejointdistributioninasinglepass.Supposetheparametersetcanbewrittenas=(1;:::;p),whereeachicouldbeeitherunidimensionalormultidimensional.Assumetheunivariateconditionaldensityforiisfi(ij1;:::;i1;i+1;:::;p);i=1;2;:::;p.IntheGibbssampler,givenalltheparametersatstept,(t)=((t)1;:::;(t)p),theiparametersatstept+1aregeneratedsequentiallybyisteps:(1)(t+1)1f1(1j(t)2;:::;(t)p),(2)(t+1)2f2(2j(t+1)1;(t)3;:::;(t)p),:::(i)(t+1)ifp(ij(t+1)1;:::;(t+1)i1).Eachparameterisupdatedinturnfromitsposteriorconditionaldistributiongivenallotherparameters.TheentiredistributionforalltheparametersisexploredasthenumberofstepsofGibbssamplinggrowslarge. 101

PAGE 102

5.1 ,sometimesitisreasonabletoassumethatthedatapointisgeneratedfromamixturemodelwithmultinomialcomponents.Assumethedimensionsareindependentfromeachother.LetDbethenumberofdimensionsandMbethenumberofcategoryvaluesoneachdimension.Thus,adatapointyiisaD-aryvector,andeachelementtakesthevaluefrom1toM.ThemultinomialmixturemodelwithKcomponentsonnobservationsy=fy1;:::;yngcanbewrittenas: 96 ],thoughforbrevity,wedonotconsidertheissue. 102

PAGE 103

97 ],wegiveDirichletpriorstotheKcomponentweightsatthebeginningtimeslice,andattheendingtimeslice.However,sinceitissimplertoworkwithindependentrandomvariables,wecanmakeuseofareparametrizationwheretheinterdependentrandomvariablesfromaK-variateDirichletdistributionareviewedasbeinggeneratedfromasetofKindependentGammarandomvariables.Formally,ifYi;i=f1;:::;KgareindependentrandomvariableswithYiG(shape=i;scale=1);then(X1;:::;XK)=(Y1=V;:::;YK=V)Dir(1;:::;K);whereV=PKi=1Yi.Themixingproportionj;j=1;:::;Katthetimeslicetcanthenbeexpressedas: 103

PAGE 104

5 )bythelikelihoodfromEq.( 5 )conditionedontheindicator:p(ci=kj:)="bk 5 )playstheroleofthelikelihoodwhichtogetherwiththepriorfromEq.( 5 )giveconditionalposteriorof~b:p(bjj:)=bb1jebjteYt=tbKYk=1"bk 5 ),weassumetheinverseGammaprior.Forsimplicity,weusexedconstantvalueastheparameteroftheprior. 2bexp(1 2b)(5) 104

PAGE 105

5 )playstheroleofthelikelihoodwhichtogetherwiththepriorfromEq.( 5 )giveconditionalposterior: 2bexp(1 2b)1 K(b)KYj=1bb1j(5) 5 ),exceptthatthedensityofadatapointistransformedfrommultinomialtoGaussian: 95 ]andusethepriordistributionsinthatworkforcentroids,precisionsandtheirassociatedhyperparameters.Giventhepriors,wederivetheposteriordistributionoftheparametersbyBayesianruleinEq.( 5 ).ThederivationprocessisomittedfromthischaptersinceitissimilartothatofSection 5.4.1.2 .AsforthemixingproportionsofGaussiancomponents,sincethemodelingoftheweightsisindependentofthetypeofthemixturecomponent,theconditionaldistributionofthemixingproportion-relatedparametersandhyperparametersarethesameasinthemultinomialmixturemodel.TheonlydifferenceintheGaussianmixturemodelistheposteriordistributionoftheindicatorvariable.Theposteriordistributionofcbecomes:p(ci=kj:) 2kexp[1 2sk(yik)2]:

PAGE 106

95 ].Thecentroidsjandhyperparameterbecomevectors,andtheirpriorsbecomemultivariateGaussianaccordingly. 98 ]showthattheresistancepatternofE.coliwithVTEC-AVFissimilartotheresistancepatternofenterohemorrhagicE.coli(EHEC)andVerocytotoxin-producingE.coli(VTEC).Itisalsoreasonabletoexpectthattheprevalenceofstrainschangesovertime,whichisagainwhatourmodelisdesignedtodetect.Forexample,therstcaseofESBLE.coli(anE.colistrainthatproducesExtended-SpectrumBetaLactamase,anenzymewhichmakesthebacteriaresistanttoseveralantibioticdrugs)appearedaboutfouryearsagoandseemedtoinfectonlyelderlywomen.Asthisstrainhasspreadovertime,theageandtypeofpatientwhogetsinfectedarealsobroadened,andhencethepopulationofthisstrainexpectedlyincreased. 106

PAGE 107

5.1 .Thedateoftheisolatesrangesfromyear2004toyear2007.WesetthenumberofmixingcomponentsK=5andrun2000loopsofourGibbssamplerintheexperiment.Toavoidtheoscillationsoftheinitialphase,weusethemeanvalueofthelast1500samplesastheapproximationofeachparameter.Experimentalresults.Giventhesesettings,ouralgorithmscomputeveresistancepatterns(orclustercentroids)forthevariousgroupsofE.coli.Foraparticularpatternonaparticularantibiotic,thecentroidisathreeby27matrix,whereeachcolumninthematrixrepresentstheprobabilitythatanisolatefromthispatternfallsintothecategoriesofR,SandU,respectively.Tovisualizeourresultsgraphically,inFigure 5-6 weplottheprobabilityofSforeachofthevepatterns.Thesumoftheprobabilitiesofresistanceandundeterminedcanbecalculatedby(1probabilityofS).Also,thelineartrendsofthemixingproportionslearnedbyourmodelfortheveresistancepatternsareplottedinFigure 5-7 .Forexample,themixingproportionofpattern1isaround0:1inyear2004,andincreasestoaround0:2inyear2007.Discussion.Theresultsweobservearequiteinformative,andalsoin-keepingwithwhatwemightexpecttoobserveinthisapplicationdomain.Forexample,considerpatterntwo.Thispatterncorrespondstothoseisolatesthatarehighlysusceptibletoalmostalloftherelevant 107

PAGE 108

SusceptibleprobabilityofE.colitomultipleantibiotics.X-axisrepresentsantibioticnames.Y-axisrepresentsthesusceptibleprobability.Eachcurverepresentsaresistancepattern.Forexample,inpattern5,anE.coliisolateissusceptibletoImipenemwithaprobabilityof1. Figure5-7. LineartrendofthemixingproportionsoftheveresistancepatternsofE.coli.Eachlinecorrespondstoaresistancepattern.Thestartpointofalineisthemixingproportionofthatpatternatthebeginningofyear2004.Theendpointisthemixingproportionofthatpatternattheendofyear2007. antimicrobials.ItturnsoutthatthisisalsothemostprevalentclassofE.coli,whichisverygoodnews.In2004,morethan55%oftheisolatesbelongedtothisclass.Unfortunately,presumablyduetoselectivepressures,theprevalenceofthisclassdecreasesovertime.Thelearnedmodelshowsthatby2007,theprevalenceoftheclasshaddecreasedtoaround45%.Thissortoftrendadecreaseinprominenceofaspecicpatternisexactlywhatourmodelisdesignedtodetect. 108

PAGE 109

109

PAGE 110

99 ],whichisanongoingprojectdesignedtodocumenttrendsinantimicrobialsusceptibilitypatternsinin-patientandoutpatientisolates,andtoidentifyrelationshipsbetweenantibioticuseandresistancerates.Thedatagivesthesusceptibilityof19bacteriato51antibi-oticscollectedfrom355participatedUShospitals.Eachhospitalprovidesaminimumofthreeyearsofsusceptibilitydata.Intheexperiment,weinvestigatetheresistancepatternsofS.aureusintheARMdata.S.aureusisknowntohavevarioussubclassesorstrains.MosthospitalstrainsofS.aureusareusuallyresistanttoavarietyofdifferentantibiotics,andafewstrainsareresistanttoallclinicallyusefulantibioticsexceptVancomycin,andVancomycin-resistantstrainsareincreasinglyreported.Theprevalenceofanotherstrain,namedMethicillinResistantStaphylococcusaureus(MRSA),iswidespreadtoo.Experimentsetup.UnlikeinExperimentOne,whereeachdatapointrepresentsthesusceptibilityofasingleisolate,inExperimentTwo,thesusceptibilitydataisaggregatedbasedonallthebacteriaisolatescollectedfromagivenhospitalinaparticularyear.Eachdatapointisthenasusceptibilityratevector,associatedwithaspecichospitalandaspecicyear.ThesusceptibilityratevectorconsistsofMrealnumbers,whereeachnumberrepresentsthethefractionofbacteriaisolatescollectedinagivenyearthataresusceptibletotheMantibiotics.Inordertomaketheexperimentsimpleandinformative,wemodelresistancetoasetof17,clinicallyrelevantantibiotics.Thetimespanwetestedisfromyear1992toyear2006andthedatasetsizeis1323.Sincethesusceptibilityratesarevectorsofrealnumbers,wemodelthem 110

PAGE 111

5-8 givesthreedifferentclustersforS.aureus,eachofwhichisassociatedwithaspecicresistancepatterns.ItcanbeseenthatallthepatternsofS.aureushavesusceptibilityrateslargerthan0:5.Pattern1hasthehighestsusceptibilityratesoveralargeportionofantibiotics.ThesusceptibilityratesoftheothertwopatternsaregenerallylowerexceptforCefazolin,Linezolid,Meropenum,andQuinupristin/Dalfopristin. Figure5-8. SusceptiblerateofthethreeresistancepatternsofS.aureus.Eachcurverepresentsaresistancepattern.Forexample,inpatternthree,67%ofisolatesaresusceptibletoClindamycin. ThelineartrendofthemixingproportionsofthethreeresistancepatternsexhibitedbyS.aureusareplottedinFigure 5-9 .Itshowsthatpatternoneoccupiesadominantportionofnearly100%atthebeginningoftime.Attheendofthetime,althoughpatternoneisstillthemostprevalentpattern,itsdominanceisdiminishedandpatternstwoandthreegainmoreprevalenceovertime.Discussion.Unlikeintherstexperiment,inthisexperimentthelearnedpatternsdonotcorrespondtostrainsofmicrobes,theycorrespondtoresistanceprolesforhospitals.Theresultsshowthatthebestprole(wherethehighestsusceptibilityratesarefound)decreasessignicantlyovertime;thisisexpected,andnotgoodnews.Ofepidemiologicalinterestistheincreaseinprevalenceofpatternthreeovertime.Thispatternshowsahighrateofisolatesthat 111

PAGE 112

LineartrendofthemixingproportionsofthethreeresistancepatternsofS.aureus.Thetwoendsofalinerepresentsthemixingproportionofthepatterninyear1992andyear2006respectively. arenotsusceptibletoClindamycin,whichisindicativeofahighrateofMRSA,aworrisomestrainofS.aureus.OneinterestingndingisthatbothpatternstwoandthreearemarkedbyaveryhighsusceptibilitytotheQuinupristin/Dalfopristincombination.Thisisquiteinteresting,becauseitmeansthatQuinupristin/DalfopristinsusceptibilityisassociatedwithhospitalsshowingS.aureuswithalackofsusceptibilitytootherdrugs,andthatsuchhospitalsarebecomingmorecommonovertime.Weweresostartledbythisobservationthatwetriedre-runningthelearningalgorithmwithagreatervalueforK,sincewethoughtthatoneexplanationcouldbethatseveralhospitalproleswerebeingmixedtogetherwithineachpattern.However,increasingKhadlittleeffect:allitdidwastosplituppatternstwoandthreemanyways,andeachpatternotherthanonestillhadaveryhighrateofsusceptibilitytoQuinupristin/Dalfopristin! 100 102 ],eachdatapointtypicallybelongstooneofklabeledtimeseriesthatarearchivedinthedatabaseforexample,adatabasemaycontainECGdataforkdifferentpeople.Alltimeseriesaregenerallyindependent,butwithineachtimeseriesthereisanimplicit(orexplicit;see[ 100 ])assumptionthatthetimeseriesisgeneratedfromadatageneratorwithinternalstate.Thisisfundamentallydifferentfromoursetupwherethereisa 112

PAGE 113

52 54 ].Thedocumentclassesthatareminedcanbeseenasequivalenttoourclustercentroidsweretheyinstantiatedusingabagofwordsmodel(see,forexample,Blei,Ng,andJordan[ 103 ]).Thekeyfeaturethatdifferentiatesourworkisthatourclassesarexed;whatchangesovertimeisonlytheprevalenceofeachclass,andthischangeismodeledasasimple,linearprogression.Thisresultsinasimple,easy-to-understandmodel,where(withenoughdata)thelearnedmodelisinvarianttothedistributionofthetimeattributeitselfanissuethatwediscussedindetailinSection2.Muchworkhasbeendoneinclusteringtemporaldata.Someclusteringmethodsdiscretizethedatabasedupontime[ 42 51 ].Inthesemethods,onemodelisbuiltpertimeperiod,orthemodelisupdatedtoincorporatethenewdataastherecenttimewindowmovesforward.Someothertime-basedclusteringmethodsareconcernedwithmaintainingtemporalsmoothness[ 104 105 ].Thesemethodsattempttotthecurrentdatawell,whileatthesametime,avoidingtoomuchdeviationfromthehistoricalclustering.Finally,temporalanomalydetectionisalsosomewhatrelatedtoourwork[ 106 107 ].Intemporalanomalydetection,thegoalistondanomalous,emergentpatterns.Thiscouldbeseenasdiscoveringclassesthatemergesuddenly,andgrowfromaprevalenceofzeroinashorttime. 5.5.2 .Inthiscase,itisactuallyknownbeforeouralgorithmsareappliedwhichsusceptibilityvectorscorrespondtowhichhospitals, 113

PAGE 114

114

PAGE 115

A-1 .Theresultsshowthatthisparticulartaskisrathereasy,andsoitisdifculttodrawanyadditionalconclusionsfromtheresults.AllofthemethodsexceptforLOFperformverywell,withalmostperfectoverlapamongtheerrorbars. A.2.1MLEGoal.Thegoalistomaximizetheobjectivefunction:=lognYk=1fCAD(ykjxk)=nXk=1log(nUXi=1p(xk2Ui)nVXj=1fG(ykjVj)p(VjjUi)) FigureA-1. PrecisionrateonKDDCup1999datasetby10-foldcross-validation.Theminimumandmaximumprecisionratesover10runsaredepictedascirclesandtheoverallprecisionisdepictedasastarsign 115

PAGE 116

116

PAGE 117

117

PAGE 118

118

PAGE 119

@ p(Ui)=nXk=1nVXj=1bkij p(Ui)=0 p(Uh)#=0=nXk=1nUXh=1nVXj=1bkhj nUXh=1 nXk=1nUXh=1nVXj=1bkhji2f1;2;:::;nUg

PAGE 120

2(ykVj)T1Vj(ykVj)] (2)d 2 221Vj(ykVj)=0 120

PAGE 121

@ p(VjjUi)=nXk=1bkij 121

PAGE 122

nVXh=1 122

PAGE 123

[1] D.Hawkins,Identicationofoutliers.ChapmanandHall,London,1980. [2] C.C.AggarwalandP.S.Yu,Outlierdetectionforhighdimensionaldata,inSIGMODConference,2001. [3] M.M.Breunig,H.-P.Kriegel,R.T.Ng,andJ.Sander,LOF:identifyingdensity-basedlocaloutliers,ACMSIGMOD,pp.93,May2000. [4] E.M.Knorr,R.T.Ng,andV.Tucakov,Distance-BasedOutliers:AlgorithmsandApplications,VLDBJournal,vol.8,no.3-4,pp.237,Feburary2000. [5] A.Dempster,N.Laird,andD.Rubin,MaximumlikelihoodfromincompletedataviatheEMalgorithm,JournaloftheRoyalStatisticalSociety,SeriesB,vol.39,no.1,pp.1,1977. [6] S.BayandM.Pazzani,Detectinggroupdifferences:Miningcontrastsets.DataMiningandKnowledgeDiscovery,vol.5,no.3,pp.213,2001. [7] C.Aggarwal,Aframeworkfordiagnosingchangesinevolvingdatastreams,inSIG-MOD,2003,pp.575. [8] M.Kulldorff,Aspatialscanstatistic,CommunicationsinStatistics:TheoryandMethods,vol.26,no.6,pp.1481,1997. [9] D.NeillandA.Moore,Rapiddetectionofsignicantspatialclusters,inSIGKDD,2004. [10] W.-K.Wong,A.Moore,G.Cooper,andM.Wagner,Bayesiannetworkanomalypatterndetectionfordiseaseoutbreaks,inICML,2003,pp.808. [11] S.GemanandD.Geman,Stochasticrelaxation,gibbsdistributionandbayesianrestora-tionofimages,IEEEPAMI,vol.vol.6,pp.721,1984. [12] C.M.Bishop,NeuralNetworksforPatternRecognition.OxfordUniversityPress,Inc.,1995. [13] A.SperdutiandA.Starita,Supervisedneuralnetworksfortheclassicationofstruc-tures,IEEETransactionsonNeuralNetworks,vol.8,no.3,pp.714,May1997. [14] J.R.Quinlan,Simplifyingdecisiontrees,InternationalJournalofMan-MachineStudies,vol.27,no.3,pp.221,1987. [15] Y.YuanandM.J.Shaw,Inductionoffuzzydecisiontrees,FuzzySetsSyst.,vol.69,no.2,pp.125,1995. [16] R.T.NgandJ.Han,Efcientandeffectiveclusteringmethodsforspatialdatamining,in20thInternationalConferenceonVeryLargeDataBases,1994,pp.144. 123

PAGE 124

[17] M.Ester,H.-P.Kriegel,J.Sander,andX.Xu,Adensity-basedalgorithmfordiscoveringclustersinlargespatialdatabaseswithnoise,inKDD,1996,pp.226. [18] T.Zhang,R.Ramakrishnan,andM.Livny,BIRCH:anefcientdataclusteringmethodforverylargedatabases,inACMSIGMOD,1996,pp.103. [19] S.Hawkins,H.He,G.Williams,andR.Baxter,Outlierdetectionusingreplicatorneuralnetworks,pp.170,2002. [20] V.BarnettandT.Lewis,Outliersinstatisticaldata,3rdedition,1994. [21] E.M.KnorrandR.T.Ng,Algorithmsforminingdistance-basedoutliersinlargedatasets,inVLDB,1998,pp.392. [22] K.Yamanishi,J.ichiTakeuchi,G.J.Williams,andP.Milne,On-LineUnsupervisedOutlierDetectionUsingFiniteMixtureswithDiscountingLearningAlgorithms,DataMiningandKnowledgeDiscovery,vol.8,no.3,pp.275300,May2004. [23] C.C.Aggarwal,J.L.Wolf,P.S.Yu,C.Procopiuc,andJ.S.Park,Fastalgorithmsforprojectedclustering,inSIGMOD,1999,pp.61. [24] C.C.AggarwalandP.S.Yu,Findinggeneralizedprojectedclustersinhighdimensionalspaces,2000,pp.70. [25] R.Agrawal,J.Gehrke,D.Gunopulos,andP.Raghavan,Automaticsubspaceclusteringofhighdimensionaldatafordataminingapplications,1998,pp.94. [26] K.ChakrabartiandS.Mehrotra,Localdimensionalityreduction:Anewapproachtoindexinghighdimensionalspaces,inTheVLDBJournal,2000,pp.89. [27] F.P.PreparataandM.I.Shamos,ComputationalGeometry:AnIntroduction(MonographsinComputerScience).Springer,August1985. [28] I.RutsandP.J.Rousseeuw,Computingdepthcontoursofbivariatepointclouds,ComputationalStatisticsandDataAnalysis,vol.23,no.1,pp.153,1996. [29] S.Guha,R.Rastogi,andK.Shim,CURE:anefcientclusteringalgorithmforlargedatabases,1998,pp.73. [30] S.Ramaswamy,R.Rastogi,andK.Shim,Efcientalgorithmsforminingoutliersfromlargedatasets,ACMSIGMOD,pp.427,May2000. [31] S.Papadimitriou,H.Kitagawa,P.Gibbons,andC.Faloutsos,Loci:Fastoutlierdetectionusingthelocalcorrelationintegral,pp.315,2003. [32] S.S.ChawatheandH.Garcia-Molina,Meaningfulchangedetectioninstructureddata,1997,pp.26. [33] G.Cobena,S.Abiteboul,andA.Marian,DetectingchangesinXMLdocuments,inICDE,2002.

PAGE 125

[34] R.B.D'AgostinoandM.A.Stephens,Eds.,Goodness-of-ttechniques.NewYork,NY,USA:MarcelDekker,Inc.,1986. [35] V.Ganti,J.Gehrke,andR.Ramakrishnan,Miningdatastreamsunderblockevolution,SIGKDDExplor.Newsl.,vol.3,no.2,pp.1,2002. [36] ,Aframeworkformeasuringchangesindatacharacteristics,inPODS,1999,pp.126. [37] P.R.Rosenbaum,Anexactdistribution-freetestcomparingtwomultivariatedistributionsbasedonadjacency,JRSSSeriesB),vol.67,no.4,pp.515,2005. [38] T.Dasu,S.Krishnan,S.Venkatasubramanian,andK.Yi,Aninformation-theoreticapproachtodetectingchangesinmulti-dimensionaldatastreams,inInterface,2006. [39] G.WidmerandM.Kubat,Learninginthepresenceofconceptdriftandhiddencontexts,MachineLearning,vol.23,pp.69,1996. [40] P.DomingosandG.Hulten,Mininghigh-speeddatastreams,inKDD,2000,pp.71. [41] J.Gehrke,V.Ganti,R.Ramakrishnan,andW.-Y.Loh,Boat:optimisticdecisiontreeconstruction,inACMSIGMOD,1999,pp.169. [42] G.Hulten,L.Spencer,andP.Domingos,Miningtime-changingdatastreams,inKDD,2001,pp.97. [43] H.Wang,W.Fan,P.S.Yu,andJ.Han,Miningconcept-driftingdatastreamsusingensembleclassiers,inKDD,2003,pp.226. [44] D.W.-L.Cheung,J.Han,V.Ng,andC.Y.Wong,Maintenanceofdiscoveredassociationrulesinlargedatabases:Anincrementalupdatingtechnique,inICDE,1996,pp.106. [45] V.Ganti,J.Gehrke,andR.Ramakrishnan,DEMON:Miningandmonitoringevolvingdata,KnowledgeandDataEngineering,vol.13,no.1,pp.50,2001. [46] N.F.Ayan,A.U.Tansel,andM.E.Arkun,Anefcientalgorithmtoupdatelargeitemsetswithearlypruning,inKDD,1999,pp.287. [47] N.SugiuraandR.T.Ogden,Testingchange-pointswithlineartrend,1994. [48] A.N.Pettitt,Anon-parametricapproachtothechange-pointproblem,AppliedStatistics,vol.28,no.2,pp.126,1979. [49] V.GuralnikandJ.Srivastava,Eventdetectionfromtimeseriesdata,inKDD,1999,pp.33. [50] D.Kifer,S.Ben-David,andJ.Gehrke,Detectingchangeindatastreams.inVLDB,2004,pp.180.

PAGE 126

[51] C.C.Aggarwal,J.Han,J.Wang,andP.S.Yu,Aframeworkforclusteringevolvingdatastreams,inVLDB,2003,pp.81. [52] X.WangandA.McCallum,Topicsovertime:anon-markovcontinuous-timemodeloftopicaltrends,inKDD,2006,pp.424. [53] D.M.BleiandJ.D.Lafferty,Dynamictopicmodels,inICML,2006,pp.113. [54] J.M.Kleinberg,Burstyandhierarchicalstructureinstreams,DataMiningandKnowl-edgeDiscovery,vol.7,no.4,pp.373,2003. [55] S.Chakrabarti,S.Sarawagi,andB.Dom,Miningsurprisingpatternsusingtemporaldescriptionlength,inVLDB,1998,pp.606. [56] T.FawcettandF.Provost,Activitymonitoring:Noticinginterestingchangesinbehavior,inKDD,1999,pp.53. [57] G.DongandJ.Li,Efcientminingofemergingpatterns:discoveringtrendsanddifferences,inKDD,1999,pp.43. [58] K.Nigam,A.K.Mccallum,S.Thrun,andT.Mitchell,Textclassicationfromlabeledandunlabeleddocumentsusingem,MachineLearning,vol.V39,no.2,pp.103,May2000. [59] J.Weston,C.S.Leslie,E.Ie,D.Zhou,A.Elisseeff,andW.S.Noble,Semi-supervisedproteinclassicationusingclusterkernels,Bioinformatics,vol.21,no.15,pp.32413247,2005. [60] G.Druck,C.Pal,A.McCallum,andX.Zhu,Semi-supervisedclassicationwithhybridgenerative/discriminativemethods,inKDD,2007,pp.280. [61] S.Basu,M.Bilenko,andR.J.Mooney,Aprobabilisticframeworkforsemi-supervisedclustering,inKDD,2004,pp.59. [62] W.Tang,H.Xiong,S.Zhong,andJ.Wu,Enhancingsemi-supervisedclustering:afeatureprojectionperspective,inKDD,2007,pp.707. [63] M.MarkouandS.Singh,Noveltydetection:areviewpart1:statisticalapproaches,SignalProcessing,vol.83,pp.24812497,Decemeber2003. [64] W.-K.Wong,A.Moore,G.Cooper,andM.Wagner,Rule-basedanomalypatterndetectionfordetectingdiseaseoutbreaks,AAAIConferenceProceedings,pp.217,August2002. [65] A.Adam,E.Rivlin,andI.Shimshoni,Ror:Rejectionofoutliersbyrotationsinstereomatching,ConferenceonComputerVisionandPatternRecognition(CVPR-00),pp.1002,June2000.

PAGE 127

[66] F.delaTorreandM.J.Black,Robustprincipalcomponentanalysisforcomputervision.inProceedingsoftheEighthInternationalConferenceOnComputerVision(ICCV-01),2001,pp.362. [67] A.Lazarevic,L.Ertoz,V.Kumar,A.Ozgur,andJ.Srivastava,Acomparativestudyofanomalydetectionschemesinnetworkintrusiondetection,inProceedingsoftheThirdSIAMInternationalConferenceonDataMining,2003. [68] Y.ZhangandW.Lee,Intrusiondetectioninwirelessad-hocnetworks.inMOBICOM,2000,pp.275. [69] E.Eskin,AnomalyDetectionoverNoisyDatausingLearnedProbabilityDistributions,ICMLConferenceProceedings,pp.255,Jun2000. [70] S.RobertsandL.Tarassenko,AProbabilisticResourceAllocatingNetworkforNoveltyDetection,NEURALCOMPUTATION,vol.6,pp.270284,March1994. [71] J.Bilmes,Agentletutorialontheemalgorithmanditsapplicationtoparameterestima-tionforgaussianmixtureandhiddenmarkovmodels,TechnicalReport,UniversityofBerkeley,ICSI-TR-97-021,1997. [72] D.M.ChickeringandD.Heckerman,Efcientapproximationsforthemarginallikeli-hoodofbayesiannetworkswithhiddenvariables,MachineLearning,vol.29,no.2-3,pp.181,1997. [73] B.EfronandR.J.Tibshirani,AnintroductiontotheBootstrap.NewYork:ChapmanandHall,1993. [74] A.Lazarevic,L.Ertoz,V.Kumar,A.Ozgur,andJ.Srivastava,Acomparativestudyofanomalydetectionschemesinnetworkintrusiondetection.inSDM,2003. [75] C.C.AggarwalandP.S.Yu,Outlierdetectionforhighdimensionaldata,ACMSIG-MOD,pp.37,May2001. [76] K.S.Beyer,J.Goldstein,R.Ramakrishnan,andU.Shaft,WhenIsNearestNeighborMeaningful?InternationalConferenceonDatabaseTheoryProceedings,pp.217,January1999. [77] M.Kulldorff,Aspatialscanstatistic,CommunicationsinStatistics:TheoryandMethods,vol.26,no.6,pp.1481,1997. [78] D.B.NeillandA.W.Moore,Rapiddetectionofsignicantspatialclusters,inKDD'04:Proceedingsofthe2004ACMSIGKDDinternationalconferenceonKnowledgediscoveryanddatamining,2004,pp.256. [79] M.Kulldorff,Spatialscanstatistics:models,calculations,andapplications,InJ.GlazandN.Balakrishnan,editors,ScanStatisticsandApplications,pp.303,1999.

PAGE 128

[80] W.-K.Wong,A.Moore,G.Cooper,andM.Wagner,BayesianNetworkAnomalyPatternDetectionforDiseaseOutbreaks,ICMLConferenceProceedings,pp.808,August2003. [81] L.Breiman,R.A.Olshen,J.H.Friedman,andC.J.Stone,Classicationandregressiontrees,1984. [82] J.R.Quinlan,LearningwithContinuousClasses,the5thAustralianJointConferenceonArticialIntelligenceProceedings,pp.343,1992. [83] P.S.Bradley,U.M.Fayyad,andC.Reina,Scalingclusteringalgorithmstolargedatabases,inKnowledgeDiscoveryandDataMining,1998,pp.9. [84] R.Miller,SimultaneousStatisticalInference.NewYork:McGraw-Hill,1966. [85] M.Breunig,H.-P.Kriegel,R.Ng,andJ.Sander,Lof:Identifyingdensity-basedlocaloutliers.inSIGMOD,2000,pp.93. [86] J.-F.Maa,D.Pearl,andR.Bartoszynski,Reducingmultidimensionaltwo-sampledatatoone-dimensionalinterpointcomparisons,TheAnnalsofStatistics,vol.24,no.3,pp.1069,1996. [87] J.Bilmes,Agentletutorialontheemalgorithmanditsapplicationtoparameterestima-tionforgaussianmixtureandhiddenmarkovmodels,1997. [88] D.Scott,MultivariateDensityEstimation:Theory,PracticeandVisualization.NewYork:Wiley-Interscience,1992. [89] D.Agarwal,A.McGregor,J.Phillips,S.Venkatasubramanian,andZ.Zhu,Spatialscanstatistics:Approximationsandperformancestudy,inSIGKDD,2006. [90] D.Agarwal,J.M.Phillips,andS.Venkatasubramanian,Thehuntingofthebump:Onmaximizingstatisticaldiscrepancy,inSODA,2006. [91] S.SheatherandM.Jones,Areliabledatabasedbandwidthselectionmethodforkerneldensityestimation,JRSSSeriesB,no.53,pp.683,1991. [92] B.W.Silverman,DensityEstimationforStatisticsandDataAnalysis.London:ChapmanandHall,1986. [93] M.WandandM.Jones,KernelSmoothing.ChapmanandHall,1995. [94] D.Comaniciu,Analgorithmfordata-drivenbandwidthselection,IEEETransactionsonPAMI,vol.25,no.2,pp.281,2003. [95] C.E.Rasmussen,Theinnitegaussianmixturemodel,inNIPS,1999,pp.554. [96] P.Schlattmann,Estimatingthenumberofcomponentsinanitemixturemodel:thespecialcaseofhomogeneity,ComputationalStatisticsandDataAnalysis,vol.41,no.3-4,pp.441,2003.

PAGE 129

[97] T.S.Ferguson,Abayesiananalysisofsomenonparametricproblems,AnnalsofStatistics,vol.vol.1,no.2,pp.209,1973. [98] K.G.andB.M.,Antibioticsusceptibilitypatternofescherichiacolistrainswithve-rocytotoxice.coli-associatedvirulencefactorsfromfoodandanimalfaeces,FoodMicrobiology,vol.Volume20,Number1,pp.27(7),February2003. [99] [100] A.J.BagnallandG.J.Janacek,Clusteringtimeseriesfromarmamodelswithclippeddata,inKDD,2004,pp.49. [101] B.Chiu,E.Keogh,andS.Lonardi,Probabilisticdiscoveryoftimeseriesmotifs,inKDD,2003,pp.493. [102] E.J.Keogh,S.Lonardi,andB.Y.chiChiu,Findingsurprisingpatternsinatimeseriesdatabaseinlineartimeandspace,inKDD,2002,pp.550. [103] D.M.Blei,A.Y.Ng,andM.I.Jordan,Latentdirichletallocation,JournalofMachineLearningResearch,vol.3,pp.993,2003. [104] D.Chakrabarti,R.Kumar,andA.Tomkins,Evolutionaryclustering,inKDD,2006,pp.554. [105] Y.Chi,X.Song,D.Zhou,K.Hino,andB.L.Tseng,Evolutionaryspectralclusteringbyincorporatingtemporalsmoothness,inKDD,2007,pp.153. [106] D.B.Neill,A.W.Moore,M.Sabhnani,andK.Daniel,Detectionofemergingspace-timeclusters,inKDD,2005,pp.218. [107] A.T.Ihler,J.Hutchins,andP.Smyth,Adaptiveeventdetectionwithtime-varyingpoissonprocesses,inKDD,2006,pp.207.

PAGE 130

XiuyaoSongwasbornin1975inZigui,China.SheearnedherB.S.andM.E.incomputersciencefromHuazhongUniversityofScienceandTechnologyin1998and2001,respectively.SheearnedherPh.D.incomputerengineeringfromtheUniversityofFlorida(UF),USAin2008. 130