Citation
Statistical Methods for Fast Anomaly Detection

Material Information

Title:
Statistical Methods for Fast Anomaly Detection
Creator:
Wu, Mingxi
Place of Publication:
[Gainesville, Fla.]
Publisher:
University of Florida
Publication Date:
Language:
english
Physical Description:
1 online resource (136 p.)

Thesis/Dissertation Information

Degree:
Doctorate ( Ph.D.)
Degree Grantor:
University of Florida
Degree Disciplines:
Computer Engineering
Computer and Information Science and Engineering
Committee Chair:
Jermaine, Christophe
Committee Members:
Kahveci, Tamer
Dobra, Alin
Ranka, Sanjay
Pardalos, Panagote M.
Graduation Date:
8/9/2008

Subjects

Subjects / Keywords:
Databases ( jstor )
Datasets ( jstor )
Detection ( jstor )
Hospitals ( jstor )
Outliers ( jstor )
Pruning ( jstor )
Rectangles ( jstor )
Statistical models ( jstor )
Statistics ( jstor )
Tessellations ( jstor )
Computer and Information Science and Engineering -- Dissertations, Academic -- UF
anomaly, extreme, outlier, randomized, sampling, spatial
Genre:
Electronic Thesis or Dissertation
bibliography ( marcgt )
theses ( marcgt )
Computer Engineering thesis, Ph.D.

Notes

Abstract:
In general, the task of detecting anomalies is to find the most anomalous points or subset of points from a given data set according to a user-defined score function. In order to give users the freedom to try different score functions during data exploration, the generic way to find anomalies would be that, evaluating the user-defined score function on each candidate data point or subset of points, sorting them according to their scores, and returning the top few candidates. Though simple, this approach is \emph{slow} when the data set is huge and/or the score function demands expensive computation. This thesis presents different statistical methods to speed three specific anomaly detection tasks. First, a simple sampling algorithm is proposed to efficiently detect distance-based outliers in domains where each and every distance computation is expensive. Unlike any existing algorithms, the sampling algorithm allows the user to control the number of distance computations and can return good results with accuracy guarantees. The most computationally expensive aspect of estimating the accuracy of the result is sorting all of the distances computed by the sampling algorithm. Second, a Bayesian framework is devised to predict the $k^{th}$ largest (or smallest) value in a real set. The Bayesian framework can combine both the domain knowledge derived from the available database workload and the sampled information obtained at query time, and return confidence bounds to the user. Third, a likelihood ratio test (LRT) framework is designed to efficiently detect spatial anomalies according to a user-supplied likelihood function. Given a spatial data set placed on an $n\times n$ grid, the goal is to find the rectangular regions within which subsets of the data set exhibit anomalous behavior. With any user-supplied arbitrary likelihood function, the naive algorithm would conduct a LRT over each rectangular region in the grid, rank all of the rectangles based on the computed LRT statistics, and return the top few most interesting rectangles. To speed this process, the LRT framework uses novel and effective pruning methods to prune a large fraction of the rectangles without computing their associated LRT statistics. For all of the three research issues, extensive experiments show significant speedups comparing to the alternative algorithms on real problem over real data. ( en )
General Note:
In the series University of Florida Digital Collections.
General Note:
Includes vita.
Bibliography:
Includes bibliographical references.
General Note:
Description based on online resource; title from PDF title page.
General Note:
This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Thesis:
Thesis (Ph.D.)--University of Florida, 2008.
General Note:
Adviser: Jermaine, Christophe.
General Note:
RESTRICTED TO UF STUDENTS, STAFF, FACULTY, AND ON-CAMPUS USE UNTIL 2015-05-31
Statement of Responsibility:
by Mingxi Wu.

Record Information

Source Institution:
University of Florida
Holding Location:
University of Florida
Rights Management:
Applicable rights reserved.
Embargo Date:
5/31/2015
Classification:
LD1780 2008 ( lcc )

Downloads

This item has the following downloads:


Full Text

PAGE 1

STATISTICALMETHODSFORFASTANOMALYDETECTION By MINGXIWU ADISSERTATIONPRESENTEDTOTHEGRADUATESCHOOL OFTHEUNIVERSITYOFFLORIDAINPARTIALFULFILLMENT OFTHEREQUIREMENTSFORTHEDEGREEOF DOCTOROFPHILOSOPHY UNIVERSITYOFFLORIDA 2008 1

PAGE 2

c r 2008MingxiWu 2

PAGE 3

Tomyparents,JinshengWuandXiazhenJin Tomytwinbrother,MingqiWu Tomywife,WenmeiYu 3

PAGE 4

ACKNOWLEDGMENTS Mygratituderstgoestomysupervisingcommitteechair,Pr ofessorChrisJermaine forhiswonderfulguidanceandconsistentsupport.WhenIst artedresearch,Iwasa \greenhand"withoutanyideawhatisgoodresearchworkandh owtoconductgood research.Fortunately,Chrishasbeenverypatientinguidi ngmethroughallthestages fromreadingrelevantliteraturetoformulatingamanageab leresearchproblem,from derivingtheoreticalresultstoactuallyimplementingmyi deas,andfromreviewingpeers' worktomyself'stechnicalwriting.Inretrospect,Icancle arlyseethatIhavegone throughallofthekeytrainingstagesthatagraduatestuden twillbenetfrominhis/her futurecareerasacomputerscienceresearcher.Moreimport antly,Chrishastakenendless eortstoensurethatIreallygraduatefromeachofthekeyst ages.Withouthishaving raisingthebarformyresearch,Iwouldnothavebecomeascon denceasIamnow.I wouldalwaysremainindebtedtohimforthefruitfulgraduat etraining,enjoyableresearch collaboration,andthemostimportant,theencouragemento fmythinking. IamalsothankfultoProfessorSanjayRankaforhisinsightf uldiscussionandpatient guidanceinourcollaborationondataminingresearchproje cts.Especially,duringmy workingonthelasttopicofmythesis,whenIfeltfrustrated orwhenIwenttothewrong direction,hewasconstantlyencouragingmeandcorrecting metotherightdirectionto proceed. ThanksarealsoduetoProfessorAlinDobraandTamerKehveci .Havingbothof themservingonmyPh.D.committeeandactuallyhavingcondu ctedresearchwiththem isanenjoyableandbenecialexperienceformycareer.Mean time,IappreciateProfessor PanosPardalostakinghistimetoserveonmycommitteeandhi shelpfulsuggestionsfor mywork. Also,Ifeelobligedtothankthegreatfellowlabmatesthath aveaccompaniedmy graduatestudy.ThankShantanuJoshi,SubiArumugam,Lauki kChitnis,andManas Somaiyaforusefulresearchdiscussion.ThankXiuyaoSong, RaviJampani,FeiXu,Luis 4

PAGE 5

Perez,andFlorinRusuforcollaboratingwithmeonchalleng ingdatamininganddatabase systemresearchprojects. Finally,thisworkwouldnothavebeenaccomplishedwithout myfamilysupport. Istillrememberthatduringtherstsemesterofmygraduate study,duetofrustration ofexams,uneasinesswithnewenvironment,Ialmostquitthe program.However,my parentswereconstantlyencouragingmenottogiveupandtea chingmehowtodealwith failures.Withouttheirselressloveandsupport,Icannoti maginemyachievementtoday. Isincerelydedicatemyaccomplishmenttothem.Meanwhile, Ifeelluckytohaveatwin brother,whohasprovidednumeroussupportivephonecallsa ndemails,makingsurethat Iwasmovingforwardinmydegreepursuingpath.Lastly,Iwou ldthankmywifeWenmei Yuforhertrustinmyabilityandpatienceinwaitingformygr aduation. 5

PAGE 6

TABLEOFCONTENTS page ACKNOWLEDGMENTS ................................. 4 LISTOFTABLES ..................................... 9 LISTOFFIGURES .................................... 11 ABSTRACT ........................................ 13 CHAPTER 1INTRODUCTION .................................. 15 1.1DetectingDistance-basedOutliers ....................... 17 1.2DetectingExtremeValuesinaDataSet ................... 18 1.3DetectingAnomalousLocalRegionsinSpatialData ............ 20 2RELATEDWORK .................................. 22 2.1RelatedWorkforOutlierDetection ...................... 22 2.2RelatedWorkforExtremeValuesDetection ................. 25 2.3SpatialAnomalyDetectionRelatedWork ................... 28 3OUTLIERDETECTIONBYSAMPLINGWITHACCURACYGUARANTEES 31 3.1Introduction ................................... 31 3.2ProvidingQualityGuarantees ......................... 34 3.3StatisticalAnalysis ............................... 36 3.3.1Qualitymeasure N ........................... 36 3.3.2AverageNumberofCorrectOutliers .................. 36 3.3.2.1Howoftenispoint i declaredanoutlier? .......... 37 3.3.2.2ProbabilityformulasinLemma1 .............. 39 3.3.3VarianceInCorrectOutliers ...................... 39 3.3.3.1Howoftenis M i M j one? ................... 40 3.3.3.2ProbabilityformulasinLemma2 .............. 41 3.4SpeedingUptheComputation ......................... 43 3.4.1ConstructingtheDistanceDatabase D 0 ................ 43 3.4.2Computing E [ N ]and Var ( N )for D 0 ................. 44 3.5EmpiricalEvaluation .............................. 46 3.5.1TestingExpensiveDistanceFunctions ................. 46 3.5.2ReliabilityoftheEstimation ...................... 49 3.6Conclusions ................................... 51 4GUESSINGTHEEXTREMEVALUESINADATASET:ABAYESIANMETHOD 52 4.1Introduction ................................... 52 4.2ProblemDenition ............................... 55 6

PAGE 7

4.2.1EstimatingtheExtremeValue ..................... 55 4.2.2ANaturalEstimator .......................... 56 4.3OverviewofMyApproach ........................... 58 4.3.1ImportanceoftheQueryShape .................... 58 4.3.2BasicsoftheBayesianApproach .................... 59 4.3.3ProposedBayesianInferenceFramework ............... 61 4.4TheLearningPhase .............................. 62 4.4.1ChoosinganAppropriateParametricModel ............. 63 4.4.2DerivingthePDF ............................ 64 4.4.3LearningtheParameters ........................ 65 4.5TheCharacterizationPhase .......................... 68 4.5.1NaiveMonteCarloSampling ...................... 69 4.5.2PracticalMonteCarloSampling .................... 69 4.5.3ExampleUseoftheTKDMethod ................... 71 4.5.4AdditionalTechnicalDetails ...................... 73 4.5.4.1Sampling f ( k ) ......................... 73 4.5.4.2SamplingfromatruncatedGamma ............. 74 4.5.4.3Anoteregardingthescaleparameter ............ 74 4.6TheInferencePhase .............................. 75 4.7Experiments ................................... 77 4.7.1ApplicabilityoftheBayesianApproach ................ 77 4.7.2Approximate MAX (orTop-k)Aggregates ............... 80 4.7.3Distance-BasedOutlierDetection ................... 84 4.7.4SpatialAnomalyDetection ....................... 88 4.7.4.1LRTBasics .......................... 89 4.7.4.2ApplyingtheBayesianframework .............. 90 4.7.4.3Experimentalevaluation:syntheticdata .......... 92 4.7.4.4Experimentalevaluation:realData ............. 95 4.7.5MultimediaDistanceJoin ........................ 98 4.7.6AFewFinalComments ......................... 100 4.8Conclusion .................................... 102 5ALRTFRAMEWORKFORFASTSPATIALANOMALYDETECTION ... 103 5.1Introduction ................................... 103 5.2LRTandProblemDenition .......................... 106 5.2.1TheLRTStatistic ............................ 106 5.2.2ProblemDenition ........................... 107 5.2.3ExampleApplication .......................... 109 5.2.4So,What'stheDiculty? ....................... 110 5.3BasicApproach ................................. 112 5.4TightBoundCriteria .............................. 115 5.5PrecomputationandBounding ......................... 117 5.5.1PrecomputationandBoundingfor A ................. 117 5.5.2PrecomputationandBoundingfor A ................. 118 7

PAGE 8

5.6FinalSearchAlgorithm ............................. 120 5.7Experiments ................................... 122 5.7.1ExperimentOne:SyntheticData ................... 122 5.7.2ExperimentTwo:TheARMDatabase ................ 124 5.7.3ExperimentThree:CensusData .................... 128 5.8Conclusions ................................... 131 REFERENCES ....................................... 132 BIOGRAPHICALSKETCH ................................ 136 8

PAGE 9

LISTOFTABLES Table page 3-1EciencycomparisonsbetweenBay'salgorithmandthesa mplealgorithmon theChr18andtheUCIDdatasets.Thetimereportedforthesam plingalgorithm includestherunningtimeofboththesamplingalgorithmand theENandVarN algorithms. ...................................... 46 3-2Averageestimationresultsandtheempiricalresultsof N for10trialsonthe10 realdatasets.Thesamplesizes issetto10,60respectively. .......... 46 3-3Descriptionofthe10realdatasetsandtheaveragenumbe rofdistancecomputations perdatapoint(denotedbyBA)byBay'salgorithm.Cont./Fea turemeansthe numberofcontinuousfeaturesoverthenumberoftotalfeatu res. ......... 49 4-1Coverageratesfor95%condenceboundsusingsamplesiz e50,150,respectively. 78 4-2Coverageratesfor95%condenceboundsusingsamplesiz e300. ......... 78 4-3Datadescription.Thesedatasetsconsistofbothcatego ricalandcontinuous features.Continuous/Featuredenotesthenumberofcontin uousfeaturesover thenumberoftotalfeatures. ............................ 82 4-4Coverageratesfor95%condenceboundswithvarious k valuesusinga10% sample. ........................................ 84 4-5ResultofmakinguseoftheBayesianframeworkwithinBay andSchwabacher's algorithm.Foreachdataset,thetableshowsthespeedupres ultingfromthe applicationoftheframeworktothealgorithm(Speedup),th esizeoftheresult setoverlapbetweenthe\exact"versionofthealgorithmand theapproximate one(Overlap),andtheaveragerelativeerroroftheapproxi mateversion(Error). 86 4-6Nullcase:totalfalsealarmsandaveragespeedupover50 randomtrials.The speedupiscomputedbydividingthetotalnumberofrectangl esbytheactually examinedrectanglesinboththetrainingandthepruningsta ge. ......... 94 4-7Alternativecase:averagehitrate,containmentrate,r elativeerror,andcomputing rateforbinomialdata. ................................ 95 4-8Resultsforthemultimediadistancejoinexperiment. ............... 100 5-1Averagepruningratesandaccuracyfor50randomtrialso nagrid128 128. Thehotspotsizeis4 3. .............................. 123 5-2Averagepruningratesandaccuracyfor7randomtrialson agrid256 256. Thehotspotsizeis7 9. .............................. 124 5-3LRTframeworktime(indays)vs.naivealgorithmfor\tre nd"model. ...... 127 9

PAGE 10

5-4LRTframeworktime(indays)vs.naivealgorithmfor\wag e"model. ...... 127 10

PAGE 11

LISTOFFIGURES Figure page 1-1Nested-LoopAlgorithmfor k th -NNOutlierDetection ............... 16 1-2SamplingAlgorithmfor k th -NNOutlierDetection ................. 16 1-3Outliersina2-DimensionalSpace .......................... 16 3-1Replicatingsampledistancestoconstruct D 0 ................... 38 4-1Thehistogramsoffourdierentqueryresultsetswithdi erentdomains(scales). (a)and(c)havelongtailstotheright.(b)and(d)havenotai lstotheright. .. 57 4-2ThePDFsoffourGammadistributionswithincreasingske wtotherightfrom (a)to(d). ....................................... 62 4-3TKDsampling. .................................... 70 4-4Thehistogramsforthesixsyntheticquerydistribution sconsideredintherst setofexperiments;theyareorderedfromtheeasiestcaseto thehardestcase. 77 4-5Medianrelativecondenceboundwidthof500testquerie sasafunctionofsample size. .......................................... 83 4-6Comparisonof5 th -NNoutlierdistancesforBayandSchwabacher'soriginalne sted loopsalgorithm,andthemodiedversionofBayandSchwabac her'salgorithm. 87 4-7Classifyingtherectanglesaccordingtotheirlower-le ftcorner.Rectangles(5,10), (5,11),(5,14)and(5,15)belongtothesameclassindexedby thelower-leftcorner 5. ........................................... 91 4-8Comparisonofthe30smallestreturnedpairdistancesfo rthebasicnestedloop joinandthemodiedversionthatmakesuseoftheBayesianfr amework. .... 101 5-1Naivetopk LRTsearch(Algorithm1) ....................... 109 5-2Upperboundingthemaximumlikelihoodofregion A bythesummationofthe maximumlikelihoodsofsubregion A i s.(a)Thecurrenttestingregion A .(b) Thetestingregion A titledbyfoursubregions. ................... 112 5-3Secondtightboundcriterion.(a)Threesubregionseven lycoverthetargetregion A .(b)Threesubregionsunevenlycoverthetargetregion A ,withonebigsubregion dominatingtheothertwo. .............................. 115 5-4Tiling A .(a)Rectangle( m;n;i;j )isrecursivelysplit.(b)Thesplittingposition ineachlevel.(c) A istiledusingthreeprecomputedrectangles.(d)Theprecom puted rectanglesthatareusedtotile A .......................... 117 5-5Algorithmfortiling A ................................ 119 11

PAGE 12

5-6Tilingmethodsfor A .(a)and(b)depicttheRadialmethods.(c)and(d)depict theSandwichmethods. ................................ 120 5-7Topk LRTstatisticsearchwithpruning ...................... 120 12

PAGE 13

AbstractofDissertationPresentedtotheGraduateSchool oftheUniversityofFloridainPartialFulllmentofthe RequirementsfortheDegreeofDoctorofPhilosophy STATISTICALMETHODSFORFASTANOMALYDETECTION By MingxiWu August2008 Chair:ChrisJermaineMajor:ComputerEngineering Ingeneral,thetaskofdetectinganomaliesistondthemost anomalouspointsor subsetofpointsfromagivendatasetaccordingtoauser-de nedscorefunction.Inorder togiveusersthefreedomtotrydierentscorefunctionsdur ingdataexploration,the genericwaytondanomalieswouldbethat,evaluatingtheus er-denedscorefunctionon eachcandidatedatapointorsubsetofpoints,sortingthema ccordingtotheirscores, andreturningthetopfewcandidates.Thoughsimple,thisap proachis slow when thedatasetishugeand/orthescorefunctiondemandsexpens ivecomputation.This thesispresentsdierentstatisticalmethodstospeedthre especicanomalydetection tasks.First,asimplesamplingalgorithmisproposedtoec ientlydetectdistance-based outliersindomainswhereeachandeverydistancecomputati onisexpensive.Unlikeany existingalgorithms,thesamplingalgorithmallowstheuse rtocontrolthenumberof distancecomputationsandcanreturngoodresultswithaccu racyguarantees.Themost computationallyexpensiveaspectofestimatingtheaccura cyoftheresultissortingall ofthedistancescomputedbythesamplingalgorithm.Second ,aBayesianframeworkis devisedtopredictthe k th largest(orsmallest)valueinarealset.TheBayesianframe work cancombineboththedomainknowledgederivedfromtheavail abledatabaseworkload andthesampledinformationobtainedatquerytime,andretu rncondenceboundsto theuser.Third,alikelihoodratiotest(LRT)frameworkisd esignedtoecientlydetect spatialanomaliesaccordingtoauser-suppliedlikelihood function.Givenaspatialdataset 13

PAGE 14

placedonan n n grid,thegoalistondtherectangularregionswithinwhich subsets ofthedatasetexhibitanomalousbehavior.Withanyuser-su ppliedarbitrarylikelihood function,thenaivealgorithmwouldconductaLRTovereachr ectangularregioninthe grid,rankalloftherectanglesbasedonthecomputedLRTsta tistics,andreturnthetop fewmostinterestingrectangles.Tospeedthisprocess,the LRTframeworkusesnoveland eectivepruningmethodstoprunealargefractionoftherec tangleswithoutcomputing theirassociatedLRTstatistics.Forallofthethreeresear chissues,extensiveexperiments showsignicantspeedupscomparingtothealternativealgo rithmsonrealproblemover realdata. 14

PAGE 15

CHAPTER1 INTRODUCTION Anomalydetectioninadatasethasbeenanactiveresearchar eaincomputerscience foraverylongtime(seetherecentsurveybyHodgeandAustin [ 24 ]).Thetaskistond thetopfewpointsorsubsetsinadatasetaccordingtoauserdenedscorefunction.A straightforwardalgorithmexists:evaluatethescorefunc tionateachcandidatedatapoint orsubsetofthedataset,andthensortthecandidatesaccord ingtotheresultingscores. Thetopfewcandidatesarenallyreturnedforuserinvestig ation.Thoughsimple,the naivealgorithmeasilyrendersunacceptableperformancew henthedatasetishugeand/or thescorefunctionrequiresexpensivecomputations. Theneedforimprovingtheperformanceofthenaivealgorith misanimportantissue indataexploration,sincetheusersusuallyneedtotrymult iplescorefunctionsbeforeany meaningfulanomaliesdiscoveredfromthedataset.Thisthe sisaddressestheperformance issuesfacedbymostanomalydetectiontasksfromastatisti calpointofview.Specically, thethesisaddressesthequestion: Canstatisticaltechniquesbeappliedtospeedtheanomalyd etectiontask? Toanswertheabovequestion,Ihavedesignedstatisticalme thodstosolvethe followingthreerelatedbutdierentanomalydetectiontas ks: 1. DetectingDistance-basedOutliers 2. DetectingExtremeValuesinaDataSet 3. DetectingAnomalousLocalRegionsinaSpatialDataSet Eachresearchproblemisdescribedatahighlevelinthefoll owingthreesubsections. 15

PAGE 16

Input :AdatasetD; k ,speciyingwhichNNdistance; r; thenumberofoutliersto return;adistancefunctionforanytwopointsOutput: r outlierpoints 1. For eachpoint i 2 D Do 2. For eachpoint j 2 D Do 3.Computethe distance ( i;j ) 4.Returnthetop r pointswhose k th -NNdistanceisthegreatest Figure1-1.Nested-LoopAlgorithmfor k th -NNOutlierDetection Input :thesameinputoftheNLalgorithm; ,thesamplesize Output: r outlierpoints 1. For eachpoint i 2 D Do 2.Drawarandomsample S ofsize from D 3. For eachpoint j 2 S Do 4.Computethe distance ( i;j ) 5.Returnthetop r pointswhose k th -NNdistanceinitssampleisthegreatest Figure1-2.SamplingAlgorithmfor k th -NNOutlierDetection Figure1-3.Outliersina2-DimensionalSpace 16

PAGE 17

1.1DetectingDistance-basedOutliers Oneeectiveapproachfordetectinganomalousdatapointsi nasetis distance-based outlier (DB-outlier)detection[ 27 5 8 ].InDB-outlierdetection,eachdataobjectis representedasapointinamulti-dimensionalfeaturespace .Adistancefunctionischosen basedondomain-specicrequirements.Datapointsthatare signicantlyfarawayfromall oftheothersareraggedasoutliers. Theproblem .Iaminterestedin k th nearestneighbor( k th -NN)DB-outlierdetection. Givenintegerparameters k and r ,the k th -NNoutliersarethetop r pointswhosedistance toits k th -NNisgreatest.Figure3illustratestheoutlierconceptin atwodimensional space. Anaivesolution .Thenaivewaytondthetop rk th -NNoutliersisthenested-loop algorithmlistedinFigure1.Thoughsimple,thenested-loo palgorithmrequiresquadratic timewithrespecttothedatasetsize,whichisimpracticali nmanyreal-lifedata explorationtasks.Forexample,ifthedistancefunctionus edinthealgorithmrequires onesecondtocompute,adatasetofonethousandpointswillr equiremorethan11days tocompute! Proposedstatisticalmethod .Iproposeasimplemodicationofthenaive nested-loopalgorithmtosolvethisperformanceissue.The modicationisbasedon sampling.Thealgorithm(Figure2)worksasfollows.Foreac hdatapoint i points arerandomlysampledfromthedataset.Usingtheuser-speci eddistancefunction,the k th -NNdistanceofpoint i inits samplesiscomputed.Afterrepeatingthisprocessfor eachpoint,thealgorithmreturnsthetop r pointswhosesampled k th -NNdistanceisthe greatest. Asidefromthealgorithm'sobvioussimplicity,thebiggest benetisthatthetotal numberofdistancecomputationsis( n ),where n isthedatasetsize.Thisboundsthe timerequiredtorunthealgorithmtocompletion. 17

PAGE 18

Technicalcontribution. Thesamplingalgorithmistrivialtoimplement.However, sampling-basedalgorithmisoflittleuseifitisnotpossib letoprovidetheuserwith anideaoftheresultquality.Theseleadstomytechnicalcon tributions:(1)Iformally analyzethestatisticalpropertiesoftheproposedalgorit hm.(2)Also,Ideviseaneective techniquethatgivestheuserastatisticalindicationofth ealgorithm'sresultquality. Specically,thenumberofthetruetoprk th -NNoutliersinthesamplingalgorithm's returnsetistreatedasarandomvariable.Iderivetheformu lasforthemeanandvariance ofthisrandomvariable.Then,Idesigntwoecientestimati onalgorithmstoestimatethe meanandvarianceoftherandomvariable.Theestimatesarer eturnedtotheuserasa qualitymeasurementoftheanswerset. 1.2DetectingExtremeValuesinaDataSet Thesecondproblemdealswithaubiquitousproblemindatama nagement:guessing themaximum/minimumvalue(orsomeotherextremeorderstat istics)overadataset. Statedsimply,thegoalistoguessthe k th largestvalueoverarealdatasetbytaking asamplefromthedataset.Thisparticularproblemarisesin manyapplications,for example: AspointedoutbyDonjerkovicandRamakrishnan[ 17 ],intop-kqueryprocessing knowingthecutovaluebeforehandallowsthe\top-k"porti onofthequerytobe transformedintoarelationalselection.Theresultingque rycanthenbeprocessed withoutmodicationtothedatabaseengine. Inoutlierdetectionfordatamining,thestate-of-the-art algorithm[ 8 ]prunespoints fromtheoutliercandidatesetwhenithasbeendeterminedth attherearetoo manypointsthatareclosebythecandidate.Ifitwerepossib letoaccuratelyguess thedistancetoapoint's k th nearest-neighbor,thispruningcouldbedonewithout actuallyndingthoseclose-bypoints. Indistancejoinprocessing[ 23 49 ],thegoalistondthe k closestpairsovertwo dierentdatasets.Thefactthat k issuppliedtothejoin(asopposedtoacuto distance)makesthequerymoreuseful,butitgreatlycompli catesthecomputation. Ifthecutodistancewereknownbeforehand,thentheproble mcanbesolvedusing anyoneofalargenumberofecientalgorithms[ 12 6 35 ]. 18

PAGE 19

Theproblem .Thegoalistondthe k th largest/smallestextremevalueinadata set.Givenarealdataset D ,anintegerparameter k ,the k th largest/smallestextreme valueisthevaluerankingatthe k th positionoftheentiredataset. Anaivesolution .Thenaivewaytondthe k th extremevalueinarealdatasetis toscanthedatasetonce,andndthe k th largest/smallestvalue.Thetimecomplexityis linearwithrespecttothedatasetsize. Thissolutionwillbetooslowifthedatasetistoolargeandd isk-resident,orifwe needtorepeatedlyapplydierentfunctionstothedataseta ndndthe k th extremevalue foreachfunction(asintheapplicationtooutlierdetectio nthatIwillconsider[ 8 ]). Proposedstatisticalmethod .Iproposeasub-linear-speedalgorithmtoguess the k th extremevalueinadataset.Thealgorithmmakesuseofasimpl erandomsample (withoutreplacement)fromthedatasettoguesstheanswer. Denotethesamplesizeby n andthedatasetsizeby N .Theestimatorstudiedisthe( k 0 ) th largestvalueinthesample, where k 0 = d n N k e .Forexample,supposewewishtoguessthelargestvalueinad ata setofsize15.Thedatasetcontains f 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 g .Thelargestvalue is15.Now,wetakeave-itemsamplefromthisset,whichhapp enstobe: f 11,5,2,1,9 g Thenthevalueoftheestimatoris11,where k 0 = d 5 15 1 e Technicalcontribution. Theaboveestimationisuselessifonecannotcharacterize therelationshipbetweentheestimatorandthetrue k th extremevalue.Theseleadstomy maintechnicalcontributions:(1)Ideviseaclose-formsha pemodeltorepresentmy prior knowledgeaboutthedatadomain.(2)IemployeetheEMlearni ngalgorithm[ 16 ]tolearn theshapemodelparameterfromthehistoricaldata.(3)Byap plyingBayesianstatistical theory,thepriormodelisupdatedwiththesampletakenfrom thecurrentdataset.The resultingmodelcontainsthe posterior knowledgeaboutthecurrentdataset.(4)Idesign anovelandecientMonte-Carloalgorithmtosampleadistri bution{usefulforproviding condenceboundsfortheestimator{fromtheposteriormode l. 19

PAGE 20

1.3DetectingAnomalousLocalRegionsinSpatialData Thethirdpieceofmyworkinvolvesidentifyinglocalspatia lareashavingstatistical propertiesthataresignicantlydierentfromtherestoft hedata.Themotivating applicationisthefollowing.Wehavealargenumberofhospi tals,andwanttondaset ofadjacenthospitalswherethetrendofsomeyearsinantimi crobialresistancerateis signicantlydierentfromthetrendintherestofthehospi tals',sincethissetofhospitals maybeacandidateforfurtherexamination.\Signicantlyd ierent"ismeasuredby conductingalikelihoodratiohypothesistestthatcanhand learbitraryhypothetical models. Theproblem. Thegoalistondthe most anomalousrectangleregioninspatial datasetplacedona n n grid.Givenarectangleregiononagrid,tomeasurethe degreeofdierencebetweenthelocaldatainsidetherectan gleandthedataoutsidethe rectangle,thelikelihood-ratiotest(LRT)[ 53 ]isemployed.ALRTisastatisticaltestin whicharatioiscomputedcomparingthemaximumprobabiliti esofobservingaresult undertwodierenthypotheses.Adecisionbetweenthetwohy pothesescanbemade basedonthevalueofthisratio.Inthisproblem,foragivenl ikelihoodmodelandaset ofparametersfromthemodelparameterspace,twohypothes esarerelevant:thenull hypothesisassertingthatthedatainsidetherectangleis not dierentfromtherestof thedataintermsoftheparameterset,andthealternativeh ypothesisassertingthat thedatainsidetherectangleregionisdierentfromtheres tofthedataintermsofthe parameterset. Anaivesolution. Givenaspatialdatasetdistributedona n n grid,the straightforwardsolutionwouldbeenumeratingallofthepo ssiblerectanglesinthis grid.Foreachenumeratedrectangle,theLRTteststatistic iscomputedandthecurrent bestLRTteststatisticisupdatedwhenabetteroneisdiscov ered. Denotetheupperboundforthetimeneededtoobtainonetests tatisticby c .The aboveexhaustiveenumerationhastimecomplexityof O ( cn 4 ),sincethereare O ( n 4 ) 20

PAGE 21

rectanglesina n n grid.However,theproblemisnotnecessarythenumberofrec tangles. Forexample,enumerating O (100 4 )rectanglesona100 100gridtakesasecondona moderncommodityPC.Theproblemisthatobtainingonelikel ihoodratiostatisticmay beexpensive.Inotherwords, c canbeverylarge,sincetheratiostatisticrequirestwo optimizationtasksunderthealternativehypothesis. Proposedstatisticalmethod .Iproposeaframeworkthatenumeratesallpossible rectanglesona n n grid.However,Iavoidcomputingtheexpensiveteststatist icfor everytest,thuseectivelyreducingtheconstant c .ThestrategythatIwilluseisanovel pruningtechnique.DenotetheLRTstatisticas = L H 0 L H a ,where L H 0 isthelikelihoodof thedatasetunderthenullhypothesisand L H a isthelikelihoodofthedatasetunderthe alternativehypothesis.Inlogspace, 2log =2log L H a 2log L H 0 .Thebiggerthevalue of 2log ,themoreevidencewehavethatthealternativehypothesisi strue.Idevisea methodthatcanupperbound 2log ,andusetheupperboundtoprunethoseteststhat areimpossibletoexceedthecurrentbestvalue. Technicalcontribution. Pruningalargefraction(above99.999%)oftestsis challenging.Thisleadstomyprimarytechnicalcontributi ons.(1)Ideviseanappropriate likelihoodmodeltominethesignicantlocalregioninhosp italdata.(2)Indaminimum setofrectanglesandpre-computetheirlog-likelihoodval ues.Theprecomputedsethas asizeof O ( n 3 ).ForeachLRTtest,Icanmakeuseoftheseprecomputationst oupper boundtheteststatistic,thusachievingpruning.(3)Thefr ameworkisgeneric.Ishow experimentallythatbyusingmyframework,Icanreducether unningtimeofaspatial anomalydetectiontaskonantibioticresistancedatafrom1 .5yearto11days. 21

PAGE 22

CHAPTER2 RELATEDWORK 2.1RelatedWorkforOutlierDetection Outlierdetectionisalong-standingtopicinthestatistic sarea[ 7 21 ],where statisticaltestshavebeenusedfordetectinganomalouspo ints.Therestrictionon statisticalmethodsisthattheygenerallyconsideruni-di mensionaldataandarenot distribution-free.Theserestrictionsseverelylimitthe scopeoftheapplicablereal-life problems,wheremultidimensionaldataandnon-standarddi stributionarecommonplace. Todetectoutliersforreal-lifedata,manycomputerscienc eresearchershaverecently investigateddistance-basedmethods.Sincetheseminalwo rkondistance-basedoutlier detectionbyKnorrandNg[ 27 ],numerouspapershavebeenpublishedonhowto ecientlyminedistance-basedoutliers[ 27 3 28 8 43 ].Distance-basedoutlierdetection useacandidatepoint'saverage k nearestneighbordistance(or,alternativelythe k th nearestneighbordistance)astheanomalousscorefunction andreturnthetopfewpoints inadatasetwhosescoreisthehighest. IntheseminalworkofKnorrandNg[ 27 ],ahash-basedalgorithmcalled\cell-based algorithm"wasproposed.Thebasicideaisthatgivenathres holddistance ,amulti-dimensional gridwhosecelldiagonalisoflength 2 isrstconstructed.Then,allofthedatapointsare hashedtotheircorrespondingsquarecells.Foreachcell,i tsimmediateneighboringcells arenamedLayer1neighbors,anditsnexttwoimmediateneigh boringcellsarenamed Layer2neighbors.Bycountingthenumberofpointsresiding inthecurrentcellandits Layer1andLayer2neighbors,onecaninferwhetherthepoint sfallingintothecurrent cellarealloutliers,ornoneofthemare.Forthecasewheret hecurrentcellisraggedas \non-outliercell",itsLayer1neighborsmustcontainnoou tliers.Forthecasewhereone cannotdeterminetheragshipofthecurrentcell,atmostone singlepassofthedataset isrequiredtodetecttheexistenceofoutliersinthatcell. Thishash-basedalgorithmhas 22

PAGE 23

lineartimecomplexitywithrespecttothesizeofthedatase t,butexihbitingexponential performancewithrespecttothenumberofdimensionsofadat apoint. Thedrawbackoftheaboveworkisthatitrequiresuserstosup plyamagic asa parameterfortheoutlierdenition,whichmaybeimpractic alinreallife.Also,itisnot scalablewiththedatadimensions.Inordertoremedythesho rtcomingsofKnorrand Ng'swork,Ramaswamyetal.[ 43 ]proposeanewdenitionofdistance-basedoutlier, wheretheyuseapoint's k th nearestneighbordistanceasascorefunctiontodenean outlier.Withthenewdenition,apartitionbasedalgorith misdeveloped,whichhas beenexperimentallyshowntoscalewellwithrespecttothed atasetsizeandthedata pointdimensions.Thealgorithmrstclustersdataintodi erentpartitions;datapoints belongtothesameclusterareclosetoeachother.Then,fore achpartition,thelower andupperboundsforthe k th nearestneighbordistancesoftheentirepartition'spoint s areobtained.Usingthesebounds,togetherwiththecardina lityofeachpartition,one caneasilyprunepartitionsthatcannotcontainoutliers.F orthosepartitionsthatare promising,adetectionstageisinvokedtondoutliers. Thestate-of-the-artalgorithmtominedistance-basedout liersisduetoBayet.al [ 8 ].Theauthorproposesasimplemodicationofanested-loop algorithm.Thealgorithm worksasfollows.Foreachpointinthedataset,ascanofthed atasetisstarted.During thescan,withmoreneighbordistanceiscomputed,the k th nearestneighbordistance forthecurrentpointisconstantlyupdated,andoncethecur rent k th nearestneighbor distanceislessthanthebestcut-omaintainedfromtheout liercandidateset,thecurrent scanofthedatasetisstopped.Thealgorithmcontinuesthes ameprocesswiththenext point.Inordertohaveearlystopformostofthelinearscans ofthedataset,thekeystep isthattheouterloopprocessesthedatainarandomizedfash ion.Theoreticalanalysis showsthatthismodiednested-loopalgorithmingeneralha slinearperformance,butin worstcaseitstillhasquadraticcomplexitywithrespectto thedatasetsize. 23

PAGE 24

ThereareotherfollowupworktofurtherimproveBay'salgor ithm.Forexample, withthegoaltooptimizediskI/O,Taoetal.[ 51 ]devisedaSNIFtechniquetomine distance-basedoutlierinmetricspace. However,noneoftheexistingworkispracticalinthesituat ionwhereeachandevery distancecomputationisveryexpensive,suchasonesecondp erdistancecomputation.My workinthisthesisspecicallytargetsthissituationwith asimplesamplingalgorithmto returnanapproximateanswerset.Signicantly,Iaddresst hetechnicalchallengesrelated tohowtoprovidingaccuracyguaranteesfortheapproximate danswerset. Theworkmostrelatedtomyownisthesamplingapproachprese ntedin[ 28 ].In thatpieceofwork,theauthorsalsoadoptsamplingasaneci entmethodtodooutlier detection.However,theirworkdoesnotconsiderarigorous measurementoftheanswer set,whichcannotprovidethecertaintythatmostpractitio nersrequire. Orthogonaltodistance-distancebasedoutlierdetection, thereisalsoadensity basedoutlierdetectionapproachasproposedin[ 11 ],whereanoutlierisdenedbased onthedensityindicator{theinverseoftheaverageofthe k nearestneighbordistances. Nested-loopalgorithmseemstobetheonlysolutiontothisp roblem,asonehasto computealltheknearestneighbordistancesforeachpointi nagivendataset. 24

PAGE 25

2.2RelatedWorkforExtremeValuesDetection Ialsoconsidertheproblemofobtainingasamplewithoutrep lacementfromanite populationandestimatingthe k th largestvalueinthenitepopulationviathesample. Solvingthisproblemhaspotentialforgreatimpactontheda tabasemanagementanddata miningresearchareas,becausetherearealargenumberofre searchproblemsthatcan directlybenetfromtheabilitytoestimatetheextremeval ues. Samplinghaslongbeenapopularmethodindatabasesanddata management [ 41 25 22 ].However,todateusingsamplingtoestimatetheextremeva luesfromanite populationhasmostlybeenignoredasadatamanagementrese archtopic.Thereason seemstobethedicultyoftheproblem.Unlikeotheraggrega tefunctionssuchas COUNT AVERAGE MEDIAN ,andsoon,nouniversallimitingtheoremssuchasthecentra llimit theoremapplytoextremevalues.Forsimpleaggregatefunct ionssuchas AVERAGE ,the accuracyofanestimatedependsprimarilyonthersttwomom entsoftheunderlying datadistribution(themeanandvariance),andthesearepro pertiesthatcaneasilybe estimated.Formaximumsandminimums,thisisnotthecase. Notsurprisingly,statisticianshavestudiedrelatedprob lemsindetail,andmany resultsarewidelyknown.Therearetheoremsthatallowmath ematicalderivationof theExtremeValueDistribution(EVD)givenaknownparametr icdistributionforthe originaldataandthesamplesize.Heretheextremevalueref erstothemaximaorminima foragivensample.However,duetothedicultyofderivatio n,appropriateasymptotic distributionsaregenerallyused.Thatis,inmostEVDtheor y,userscollecttheextreme valuesamplesassucientstatisticsforinference,theyne verusethedatafromwhichthe extremevaluesareobtainedtoperforminference.Myownwor klooksattheproblem usingasamplefromadatabasetoproducecondenceboundsfo rtheextremevalueofthe database. Inordertounderstandthecontributionofmywork,itisusef ultoreviewtheclassical EVDtheory.Atahighlevel,allcontinuousdistributionsca nbeclassiedasoneofthe 25

PAGE 26

followingthreegroups:theExponentialfamily,theCauchy family,andtheWeilbull family.Foreachdistributionfamily,thereexistsanassoc iatedasymptoticEVD.Knowing theunderlyingparametricdistribution,onecanderivethe asymptoticEVDsforthe k th maximaandminimainasimilarmanner(seeSection7.4in[ 26 ]).Insomereallife applications,theabovethreeEVDsarenotapplicable.Mari tzandMunro[ 36 ]presented aGeneralizedEVDthatcanbeappliedtosmallsampleproblem saswellaslargesample problems. TomakeuseofEVDtheoryinpractice,oneneedstocollectsam plesofextreme values,andusethecollectedsampletoestimatetheparamet ersofanasymptoticEVD. Forexample,imaginethatonewantstopredictthemaximumwi ndspeedofthe25years' spansince1997.He/shehascollectedthemaximumwindspeed foryearsfrom1997to 2007(10years).Usingthe10extremevaluesamples,thepara metersofanEVDarerst estimated,andthentherstorderstatisticof25samplesis obtainedfromtheresulting EVDasthenalestimate.Obviously,therequirementofgett ingextremevaluesamplesis necessaryinthisexampletomakeuseofEVDtheory. Incontrast,myworkconsidersadierentscenariowhereIas k\withasubsampleof anitepopulation,canIboundtheextremevalueinthis particular nitepopulation?" NotethathereIamnotallowedtoobtaini.i.d.extremevalue samples.Asaresult,the traditionalEVDtheorycannotbeapplieddirectly.Also,si nceIaminterestedinarbitrary, real-lifedatasetsthatingeneraldonotfollowanyknownpa rametricdistribution,itis doubtfulthatclassicEVDtheory(whichreliesonwell-know nparametricdistribution) willhelp.Finally,sincemyapproachmodelsthedistributi onofaratioestimate,whichis scale-agnostic,itcanmakeuseofhistoricalqueriestopre dictfuturequerieseventhough theyhavequitedierentscales. Themostrelevantworktothistopicistheclassicsurveysam plingtechnique[ 46 ], whereusersdesignsamplingplantotakesamplesfromanite populationandusethe obtainedsampletoinferthepropertiesofthenitepopulat ion.Myworkhasbeen 26

PAGE 27

inspiredbythesuperpopulationtechniqueinsurveysampli ngdesign,wherethenite populationistreatedasarandomsamplefromaninnitesupe rpoplation,andthe observedsampleisdeemedasasubsampleofthesampletakenf romthesuperpopulation. Inordertoperformaninference,thesubsampleisusedtoest imatethesuperpopulation rst.Onceasuperpopulationisobtained,itcanbeusedtoop timizeanestimatorthat performswellonaverageoveranynitepopulationgenerate dfromthesuperpopulation. 27

PAGE 28

2.3SpatialAnomalyDetectionRelatedWork ThethirdresesearchtopicthatIinvestigateisspatialano malydetection.Detecting anomalousspatialregionshavebeenapopularresearchtopi c.Mostoftheexistingwork aimsatdetectingspatialclusters[ 52 4 14 ]and\hotspots"[ 30 39 1 14 ].Thelatter setofpapersaremostcloselyrelatedtomywork.Thesepaper sarebasedonKulldor's spatialscanstatistic(SSS)[ 30 { 32 ]. SincemyworksolvesproblemsthatSSScannotmodel,itisuse fultoreviewSSSrst. SSSisusedtodetectnon-randomspatialclustersofevent(o rpoint)occurrence.Inreal applications,aneventcanbeadiseaseincident,atreeofas pecialspecies,aparticular kindofstaretc.Inordertohandlethepeculiarsofspatiald ata,SSSusesinhomogeneous Poissonprocesstomodelthespatialdierencesinintensit y(oreventoccurrencerate), thuscapturingthepointpattern'sspatialdependenceinso -calledrstordereects.SSS buildstwomodels,wherethenullmodelassertsthatthereis noclusterofeventsandthe alternativemodelassertsthecontrary.Inbothcases,thes patialvariationsintheexpected numberofobservedeventsarererectedbyageographicalmea sure ( A ),where A isa spatialarea.Forexample,underthenullmodel,ifthemeasu reisuniformona1-Dspace, aclassicone-dimensionPoissondistributionisobtained. WiththeLebesguemeasureona 2-Dplane,ahomogeneousspatialPoissonprocessisusedast henullmodel.Inthedisease outbreakdetectionapplication,themeasure ( A )isaninteger,correspondingtothesize ofthepopulationinarea A whoareatriskofadisease.Morespecically,SSScomputes alikelihoodratiostatisticusingtwolikelihoodsforthes patialdataunderanullandan alternativemodelrespectively.Thealternativemodelsay sthatthereisaparticularzone Z withinwhichthePoissonintensityforanarbitraryareais p ,whileoutsidethiszone Z thePoissonintensityforanarbitraryareais q .Thetwointensitiessatisfytheinequality p>q .Incontrast,thenullmodelassumesaconstantintensity p fortheentirespatial area.Conditionedontheobservedeventcounts,andassumin geventindependence,the SSSisexpressedinthefollowingform: 28

PAGE 29

C Z log C Z P Z +( C G C Z )log C G C Z P G P Z C G log C G P G Intheaboveexpression, C Z istheaggregateeventcountforzone Z C G isthe aggregateeventcountfortheentirespatialregion. P Z and P G arethepopulationsizesfor zone Z andtheentirespatialregion,respectively.MonteCarlome thodsisemployedin ordertogettheempiricalnulldistributionoftheteststat istic.Thatis,manyreplicasof thespatialdataisgeneratedunderthenullmodel.Foreacho fthegenerateddataset,a SSSiscomputed.Finally,allofthecomputedSSStogetherwi ththeSSSfortherealdata setaresorted.IftheSSSfortherealdatasetisbeyondthe95 percentileinthesortedlist, thecorrespondingzone Z isreportedasanon-randomspatialclusterata5%signican ce level. Incontrast,theLRTstatisticconsideredinmythesishasak nownasymptotically nulldistribution,whichavoidsreplicatingthegridalarg enumberoftimesinorderto obtainanempiricalnulldistribution.Notethat,theMonte -carloapproachtoobtainthe nulldistributionisalsoapplicabletotheLRTstatisticco nsideredinthisthesis.Mykey contributionisthatgivena N N grid,howtoecientlyobtainthemaximumLRT statisticforthisgrid. SinceobtainingnulldistributionisexpensiveforKulldor 'sspatialscanstatistic, severalmethodshavebeenproposedforaddressingtheperfo rmanceissue[ 39 38 1 2 ]. In[ 39 38 ],an overlap-kdtree datastructureisdesignedtopartitionthespatialarea intooverlappingregions.Theupperboundsofthescoresofs ubregionscontainedineach overlappingregioniscomputed.Unpromisingregionsarepr unedbasedonthesebounds. Themethodhas O (( N log N ) 2 )performanceforsucientdenseclusters.Agarwaletal. [ 1 2 ]proposedanapproximatealgorithmthatrequires O ( N 3 log N ).Theideaistouse asetoflinearfunctionstoapproximatetheteststatisticf unction,andforeachlinear function,theoptimalpointthatmaximizesthelinearfunct ionisfound.Bypluggingthe 29

PAGE 30

setofoptimalpointsintotheSSSstatisticfunction,theal gorithmreturnstheonethat yieldsthelargestvalueastheapproximateanswer. Thesemethodsdierfrommyowninthattheyaimtoactuallyav oidconsidering all O ( n 4 )rectangularareas,whereasmygoalistoavoidexpensiveLR Tstatistic computations.Theexistingmethodsintheliteratureonlya pplytothoserelatively simpledensitymeasuresthatareconvexormonotonicwithre specttotheratioofzone populationoverentirepopulationandtheratioofzone'sev entcountovertheentireevent count.Forsuchmeasures,thescan-statistic-basedapproa chwouldbepreferredtomy own.Unfortunately,existingmethodsdonotapplytothelik elihoodfunctionsusedinthe trendorwagemodelconsideredinmythesis;inthesecases,m yalgorithmsaretheonly option. Itisworthmentioningthatinordertoavoidexpensivenulld istributioncomputations inSSSbasedapproach,Neiletal.[ 14 ]proposedaBayesianscanstatistic,wherethey stilluseinhomogeneousPoissonprocesstomodeltheeventc ounts.However,aBayesian hierarchyisbuilttomodelthedata.Thatis,theintensityw ithinatestingzoneis regardedasasamplefromaGammapriordistribution.Simila rly,theintensitywithout thetestingzoneisdeemedasasamplefromanotherGammaprio rdistribution.By computingandsortingtheposteriorprobabilitiesforever ytestingregion,theanomalous regionscanbedetected. 30

PAGE 31

CHAPTER3 OUTLIERDETECTIONBYSAMPLINGWITHACCURACYGUARANTEES 3.1Introduction Oneeectiveapproachfordetectinganomalousdatapointsi nasetis distance-based (DB)outlierdetection[ 27 43 5 ].InDB-outlierdetection,eachdatapointisrepresented asapointinamulti-dimensionalfeaturespaceandadistanc efunctionischosenbasedon domain-specicrequirements.Datapointsthataresignic antlyfarawayfromallofthe othersareraggedasoutliers. ThoughitmayseemthatDB-outlierdetectionisasolvedprob lem,itturnsout thatevenusingadvancedalgorithms,themethodiscomputat ionallyprohibitiveeven forsmalldatasets.Thisisparticularlytrueindomainstha trequirecomputationally expensivedistancefunctions.Forexample,theeditdistan cefunctionforstrings[ 44 ],the ERPdistancefunctionfortimeseries[ 13 ],thequadratic-formdistancefunctionforcolor histograms[ 19 ]andvariousscoringmatricesforaligningbioinformatics sequences[ 15 ] allarecomputationallyexpensive.Giventwoinputpointst hatarebothofdimension n ,theaforementioneddistancefunctionsrequire( cn 2 )time,wheretheconstant c is usuallyabovethree.Inthesedomains,state-of-the-artal gorithmsmaytakedaystodetect DB-outliers,evenindatasetsofsmallsizes. Thegoalofthischapteristodeneanalgorithmthatcanprov ideuserswith interactive-speedperformanceoverthemostexpensivedis tancecomputations,giving theusertheabilityto\tryout"variousdistancefunctions andqueriesduringexploratory mining.ThequestionIconsideris:HowcanIreducetherequi rednumberofdistance computationsinDB-outliermining?Forcertaindistanceme asuresanddatasets,indexing andpruningtechniquescanbeusedtoreducethenumberofdis tancecomputations. Unfortunately,indexing[ 11 ]isnotusefulinthedomainsmentionedabovedueto highdatadimensionality,andpruning[ 8 ]tendsonlytoreducetherequirednumberof distancecomputationstoafewhundredorafewthousandperd atapoint;asIwillshow 31

PAGE 32

experimentally,thisisstilltoocostlyifeverydistancec omputationrequiressecondsof CPUtime. Inthischapter,Iconsiderasimplesamplingalgorithmform iningthe kth nearest neighbor( kth -NN)DB-outliers[ 43 ].Givenintegerparameters k and r kth -NNoutliers arethetop r pointswhosedistancetoits kth -NNisgreatest.Samplinghasbeen consideredbeforeasanoptionfordetectingoutliers[ 29 ],butneveralongwitha rigoroustreatmentoftheeectonresultquality.Iftheuse risabletotoleratesome rigorously-measuredinaccuracy,myalgorithmcangivearb itrarilyfastresponsetimes. Myalgorithmworksasfollows.Foreachdatapoint i ,Irandomlysample points fromthedataset.Usingtheuser-specieddistancefunctio n,Indthe kth -NNdistanceof point i inthose samples.Afterrepeatingtheaboveprocessforeachpoint,t hesampling algorithmreturnsthe r pointswhosesampled kth -NNdistanceisgreatest.Algorithm1 presentsthecorrespondingpseudo-code: Algorithm1 SamplingAlgorithm Input :Adataset G ofsizen; k ,specifyingwhichNNdistanceasthecriterion; ( >k ), thesamplesize; r ,thenumberofoutlierstoreturn. Output : r pointsasthe kth -NNoutliers 1: for eachpoint i 2 G do 2: Drawarandomsampleofsize from G (notincludingpoint i ) 3: Calculatepoint i 's kth -NNdistanceinitssample 4: endfor 5: Returnthetop r pointswhose kth -NNdistanceinitssampleisthegreatest Asidefromthealgorithm'sobvioussimplicity,itsbiggest benetisthatitallows ausertocontrolthetotalnumberofdistancecomputations, thusboundingthetime requiredtorunthealgorithmtocompletion.Forexample,wi thonemillionpointsand tensamplesforeachpoint,thealgorithmwillrequiretenmi lliondistancecomputations. Duringexploratorymining,suchanalgorithmmaysavesigni cantusertimebygivingfast evidencethatthedatasetinquestionlacksanymeaningfulo utliers.Or,suchfastresults maydetectaninappropriatedistancefunctionchoiceatane arlystage.Iftheresults 32

PAGE 33

areaccurateorinterestingenough,thealgorithmmayobvia tetheneedtorunanexact algorithmaltogether. Ofcourse,sampling-basedalgorithmsareoflittleuseifit isnotpossibletoprovide theuserwithanideaoftheresultquality.Thisleadstothis chapter'sprimarytechnical contribution.Iformallyanalyzethestatisticalproperti esofthealgorithm,anddescribe aneectivetechniquethatgivestheuserastatisticalindi cationofthealgorithm'sresult quality.Specically,Itreatthenumberofthetruetop rkth -NNoutliersinAlgorithm1's returnsetasarandomvariablewhosecharacteristicscanbe usedtomeasurethequality oftheoutliersreturnedbythealgorithm.Thus,ifItellthe userthattheexpectednumber oftrueoutliersreturnedis20foragivendataset,thenthis impliesthatifthesampling algorithmwererunmanytimesonthatdataset,onaverage20o utofthe30returned pointswillbeamongthetruetop30 kth -NNoutliers.Inadditiontotheformalanalysis, Idemonstratetheeectivenessandeciencyofmyproposeds amplingalgorithmand statisticalestimationalgorithmsbygivingextensiveexp erimentalresults. Theremainderofthechapterisorganizedasfollows.InSect ion2,Idescribethe overallprocessofprovidingqualityguaranteesforthesam plingalgorithm.InSection 3,astatisticalanalysisofthesamplingalgorithmisperfo rmed.Section4describeshow theanalysiscanbeappliedtoaconstructeddistancedataba seinahighlyscalableand ecientfashion.Section5detailstheexperimentalstudyo ntwoexpensivedomainsas wellastenrealdatasets.ThechapterisconcludedinSectio n6. 33

PAGE 34

3.2ProvidingQualityGuarantees Algorithm1(hereafterreferredtoasthe\samplingalgorit hm")issosimplethat thereislittlebenetindiscussingitingreaterdetail.Ho wever,theissuethatclearly does deservemoreattentionisthequestionofhowtogivetheuser anunderstandingofexactly howaccuratethisalgorithmislikelytobeinpractice;itis thisquestionthatIconsiderin detailintheremainderofthepaper. Thechapterdescribesatwo-stepprocessforanalyzingtheq ualityofthesampling algorithmforagivendataset,sothattheestimatedresultq ualitycanbereturnedtothe useralongwiththediscoveredoutliers.Intherststep,th edistancesbetweeneachpoint andtheirsamplesareusedto(logically)constructadistan cedatabase D 0 ,wheretheset ofpairwisedistancesin D 0 isareasonableapproximationoftheactualdistancedataba se D .By distancedatabase ,Irefertoamatrixthatstoresthepairwisedistancefrompo int i topoint j forevery i and j .Duringtheconstructionof D 0 ,Iwishtoavoidadditional distancecomputationsbeyondthoserequiredbythesamplin galgorithm.Thus,fora datapoint i ,thesampleddistances f d 01 ;d 02 ;:::;d 0 g between i andtheotherpointsinits samplearereplicatedasneededtoserveasasurrogateforth eactualneighbordistances f d 1 ;d 2 ;:::;d n 1 g of i .Thisreconstructionisidenticaltoatypeofreconstructi onadvocated whenmakinguseofthewithout-replacementbootstrapfroms tatistics[ 42 ]. Inthesecondstep,anexact,statisticalanalysisofthequa lityofthesampling algorithmforthedatabase D 0 isperformed,andtheresultisreturnedtotheuserasan indicationoftheaccuracyofthealgorithmwhenitisapplie dto D .Whileitistruethat boundsthatarereturnedrelyonthesuppositionthat D 0 isinfactareasonablesurrogate for D ,suchassumptionsaregenerallyunavoidableinstatistica lanalysis.Ananalogous assumptioninclassicalsamplingtheoryisthatthesamplev ariancecanbeusedasa reasonablesurrogateforthepopulationvariance[ 46 ].Inordertoaddressthisconcern, Section5ofthechaptershowsthatovertwelvereal-lifedat asets,forvarioussamplesizes, 34

PAGE 35

theboundsderivedbymymethoddoinfactreliablypredictth eaccuracyofthesampling algorithm. Theprocessofderivingaccuracyguaranteesforthesamplin galgorithmisdescribed inthenexttwoSections.Section3providesrigorouslystat isticalanalysisofthesampling algorithm.Section4describeshowtheanalysiscanbeappli edto D 0 inahighlyscalable andecientfashion. 35

PAGE 36

3.3StatisticalAnalysis ThisSectiondescribesaformal,probabilisticanalysisof thequalityofthesampling algorithm'sreturnset.3.3.1Qualitymeasure N Theanalysisneedstobeperformedwithrespecttosomemeasu reofthealgorithm's quality.Let A bethesetofthetruetop rkth -NNoutliers,andlet A 0 bethereturn setofthesamplingalgorithm.Thenthesizeoftheintersect ion A \ A 0 isareasonable measureofthesamplingalgorithm'squality.Let N bearandomvariabledenotingthe sizeofthisintersectionsetforasinglerunofthesampling algorithmoveragivendata set.Inordertopreciselycharacterizethesamplingalgori thm'squality,Icharacterizethe expectationandvarianceof N .Thesestatisticsdescribetheaveragenumberofcorrect outliersreturnedbythealgorithm,aswellashowmuchthenu mberofcorrectoutliersis expectedtovarydependingonhowlucky(orunlucky)Ihappen tobe. Intherestofthechapter,Iwilluse E [ X ]and Var ( X )todenotetheexpectationand varianceofarandomvariable X respectively.Iwillalsousethenotation Pr [ B ]todenote theprobabilitythatevent B willoccur. 3.3.2AverageNumberofCorrectOutliers Tocompute E [ N ]foragivendataset,Ibeginwithamathematicalexpression for N Let y i evaluatetooneifthe ith pointinthedataset G isatrue kth -NNoutlier,andzero otherwise.IdeneaseriesofBernoulli(zero/one)randomv ariablesoftheform M i ,where M i isoneifthe ith pointisdeclaredasanoutlierbythesamplingalgorithm,an dzero otherwise.Then: N = n X i =1 y i M i (3{1) Takingtheexpectationofbothsides,Ihave: E [ N ]= n X i =1 y i E [ M i ] (3{2) 36

PAGE 37

Notethatforagiveninputofthesamplingalgorithm, y i isaconstant.Thus,derivingan expressionfor E [ N ]reducestotheproblemofderivinganexpressionfor E [ M i ].Since M i isaBernoullirandomvariable,thisisequivalenttocomput ingtheprobabilitythat M i evaluatestoone.3.3.2.1Howoftenispoint i declaredanoutlier? M i evaluatestooneifpoint i isreportedasanoutlierbythesamplingalgorithm. Obviously,point i israggedasanoutlieronlyifthereareatmost r 1pointsin G whose kth -NNdistanceinitssampleislargerthanpoint i 's.Let T i bearandomvariable denotingthetotalnumberofsuchpoints.Then M i isoneifandonlyif T i r 1.Noting that T i isasymptoticallynormallydistributed 1 ,Ihave: E [ M i ]= Pr [ M i =1] = Pr [ T i r 1] = Pr [ T i E [ T i ] p Var ( T i ) r 1 E [ T i ] p Var ( T i ) ] =( r 1 E [ T i ] p Var ( T i ) ) (3{3) InEquation 3{3 ,( x )isthestandardnormalcumulativedistributionfunction. In ordertomakeuseof( x ),Ineedtocalculatethemeanandvarianceof T i .Beforedoing this,foreachpoint i 2 G ,Iintroducearandomvariable N i ,whichdenotesthesampled kth -NNdistanceofpoint i .Inaddition,Ideneafunction one ( expr )thatevaluatesto oneifandonlyiftheboolean-typeargument expr istrue,andzerootherwise.Giventhese notations, T i = P j;j 6 = i one ( N j >N i ).If D i = h d 1 ;d 2 ;:::;d n 1 i referstothevectorofpoint i 'sneighbordistancesorderedfromthesmallesttothelarge st,thenthefollowinglemma 1 Ingeneral, T i isnormallydistributedduetotheLiapounovCentralLimitT heorem. Since T i isthesumof n 1independentBernoullirandomvariables,thisTheoremapp lies; seeLehmann[ 34 ],Corollary2.7.1. 37

PAGE 38

P 1 67 7 77 88 9 10 10 1010 11 11 11 11 33 44 12 12 12 12 225 5 14 14 6 66 9 19 19 231323 13 Original Distance Matrix D P 2 P 3 P 4 P 5 P 6 P 7 55 Sample P 1 P 2 P 3 P 4 P 5 P 6 P 7 Approximate Distance Matrix D' 7 7 88 99 10 10 10 1011 11 1212 1212 1212 141414 14 191923 11 11 1111 44 2 2 66 6 6 23 55 5 5 Reconstruct P 1 P 2 P 3 P 4 P 5 P 6 P 7 7 8 9 10 10 11 11 11 4 12 12 12 2 14 14 66 1923 Sampled Distance Matrix 5 5 Figure3-1.Replicatingsampledistancestoconstruct D 0 givestheformulasforcomputingthemeanandvarianceof T i .Thislemmaprovidesthe baseformulasthatIwillusetocompute E [ N ]inSection4. Lemma1. Themeanandvarianceof T i aregivenby: E [ T i ]= X d 2 D i Pr [ N i = d ] X j;j 6 = i Pr [ N j >N i j N i = d ] Var ( T i )= E [ T i ] E 2 [ T i ]+ sum 1[ T i ] sum 2[ T i ] InLemma1, sum 1[ T i ]and sum 2[ T i ]aregivenby: sum 1[ T i ]= X d 2 D i Pr [ N i = d ] X j;j 6 = i Pr [ N j >N i j N i = d ] # 2 sum 2[ T i ]= X d 2 D i Pr [ N i = d ] X j;j 6 = i Pr 2 [ N j >N i j N i = d ] Proof. Takingtheexpectationofbothsidesof T i 'sexpression: E [ T i ]= E [ X j;j 6 = i one ( N j >N i )] = X j;j 6 = i Pr [ N j >N i ] = X j;j 6 = i X d 2 D i Pr [ N i = d ] Pr [ N j >N i j N i = d ] = X d 2 D i Pr [ N i = d ] X j;j 6 = i Pr [ N j >N i j N i = d ] Var ( T i )canbeprovedsimilarly. 38

PAGE 39

3.3.2.2ProbabilityformulasinLemma1 InLemma1,Iexpressedthemeanandvarianceof T i intermsoftheprobabilitythat N i = d andtheconditionalprobabilitythat N j >N i given N i = d .Inowgivetheexact formulastocomputethesetwoprobabilities. Computing Pr [ N i = d ].Recallthatsteps(2)-(3)ofthesamplingalgorithmdrawa randomsample(withoutreplacement)ofsize frompoint i 's n 1neighborsand compute i 's kth -NNdistanceinthissample.Thisprocesscanbemodeledbyth e hypergeometricdistribution.Inthehypergeometricdistr ibution,ajarhas N balls, M red, N M green.Ifapersonweretoreachin,blindfolded,andselect K balls atrandom withoutreplacement ,theprobabilitythatexactly x ballsofthe K balls retrievedwereredisgivenby H ( X = x j M;N;K ).Analogously,supposethatpoint i 's qth -NNdistance(denotedby d q )ischosentobe N i .ThenIcanregardthe q 1 NNsofpoint i asredballs,andthoseNNswhoarefurtherawaythan i 's qth -NNas greenballs.Giventhis: Pr [ N i = d q ]= n 1 H ( X = k 1 j q 1 ;n 2 ; 1)(3{4) InEquation 3{4 n 1 istheprobabilitythatthe qth -NNofpoint i isincludedinthe sampleduringarunofthesamplingalgorithm; k and aretheparametersofthe samplingalgorithm; n isthesizeoftheinputdataset. Computing Pr [ N j >N i j N i = d ].Computingthisconditionalprobabilityisnothing butndingoutallthedistancesin N j 'sdomainthataregreaterthan d ,andthen summingtheprobabilitiesofobservingthosedistances.Gi venpoint j 'ssortedvector ofneighbordistances D j = h d 1 ;d 2 ;:::;d n 1 i ,supposethat d min 1 isthesmallest distanceinthisarraythatislargerthan d .ThenbymakinguseofEquation 3{4 : Pr [ N j >N i j N i = d ]= n 1 X q = min 1 Pr [ N j = d q ](3{5) Notethatitistheindex q thatsolelydeterminestheprobabilityof N j = d q ,given thattheotherparametersofthesamplingalgorithmarexed 3.3.3VarianceInCorrectOutliers Knowingonexpectationhowmanycorrectoutliersintheretu rnsetofthesampling algorithmisnotenough.Itiscrucialtohelptheusertounde rstandhowthenumberof correctoutliersisgoingtovaryarounditsmean.Thevarian ceof N givessuchameasure. Thisquantityis Var ( N )= E [ N 2 ] E 2 [ N ].SinceIhavealreadydiscussedhowtocompute 39

PAGE 40

E [ N ]inSection3.2,theremainingtaskistocalculatethesecon dmomentof N : E [ N 2 ]= E [( X i y i M i ) 2 ] = E [ N ]+ X i X j;j 6 = i y i y j E [ M i M j ] (3{6) NotethatinEquation 3{6 y i and y j areconstantsforagiveninputofthesampling algorithm.Therefore,derivinganexpressionfor Var ( N )reducestoderivinganexpression for E [ M i M j ].Since M i M j isaBernoullirandomvariable,thisisequivalenttocomput ing theprobabilitythat M i M j evaluatestoone. 3.3.3.1Howoftenis M i M j one? M i M j evaluatestooneifandonlyifbothpoint i andpoint j arereportedasoutliers bythesamplingalgorithm.Point i andpoint j arebothraggedasoutliersonlyifthere areatmost r 2pointsin G whosesampled kth -NNdistanceislargerthanthesmaller of i 'sand j 's.Otherwise,eitherpoint i or j (orboth)willnotberaggedasoutliers.In amanneranalogoustothewaythatIuse T i tohelpcalculate E [ M i ],Iwilluse U ij to helpcalculate E [ M i M j ].Let U ij bearandomvariabledenotingthetotalnumberofthe pointswhosesampled kth -NNdistanceislargerthanthesmallerof i 'sand j 'ssampled kth -NNdistance.Giventhis, M i M j isoneifandonlyif U ij r 2.Notingthat U ij is asymptoticallynormallydistributed(forreasonsidentic al T i 'snormality),Ihave: E [ M i M j ]= Pr [ M i M j =1] = Pr [ U ij r 2] = Pr [ U ij E [ U ij ] p Var ( U ij ) r 2 E [ U ij ] p Var ( U ij ) ] =( r 2 E [ U ij ] p Var ( U ij ) ) (3{7) Inordertomakeuseof( x ),Ineedtocalculatethemeanandvarianceof U ij .To express U ij inamathematicalform,Iagainintroduceaseriesofrandomv ariables, where S ij denotesthesmallerofpoint i 'sandpoint j 's kth -NNdistancesintheir 40

PAGE 41

samples.Obviously,thedomainof S ij is D i [ D j .Giventhesenotations,Ihave U ij = P l;l 6 = i;j one ( N l >S ij ).Thefollowinglemmagivestheformulasforcomputing themeanandvarianceof U ij .ThislemmaprovidesthebaseformulasthatIwilluseto compute Var ( N )inSection4. Lemma2. Themeanandvarianceof U ij aregivenby: E [ U ij ]= X d 2 D i [ D j Pr [ S ij = d ] X l;l 6 = i;j Pr [ N l >S ij j S ij = d ] Var ( U ij )= E [ U ij ] E 2 [ U ij ]+ sum 3[ U ij ] sum 4[ U ij ] InLemma2, sum 3[ U ij ]and sum 4[ U ij ]aregivenby: sum 3[ U ij ]= X d 2 D i [ D j Pr [ S ij = d ] X l;l 6 = i;j Pr [ N l >S ij j S ij = d ] # 2 sum 4[ U ij ]= X d 2 D i [ D j Pr [ S ij = d ] X l;l 6 = i;j Pr 2 [ N l >S ij j S ij = d ] TheproofissimilartoLemma1'sproof.3.3.3.2ProbabilityformulasinLemma2 InLemma2,Iexpressthemeanandvarianceof U ij intheformoftheprobabilityof S ij = d andtheconditionalprobabilityof N l >S ij given S ij = d .Inowgivetheexact formulastocomputethesetwoprobabilities: Computing Pr [ S ij = d ].Since S ij 'sdomainis D i [ D j ,Imayusetheconditional probabilitydenitiontocomputetheprobabilityof S ij = d When d 2 D i : Pr [ S ij = d ]= Pr [ N i = d ] Pr [ N j >N i j N i = d ](3{8) When d 2 D j : Pr [ S ij = d ]= Pr [ N j = d ] Pr [ N i >N j j N j = d ](3{9) Foreach d 2 D i [ D j ,IcanuseEquation 3{4 and 3{5 andEquation 3{8 and 3{9 to compute Pr [ S ij = d ]. Computing Pr [ N l >S ij j S ij = d ].Thisconditionalprobabilitycanbecomputed asIdidinEquation 3{5 .Givenpoint l 'ssortedvectorofneighbordistances D l = 41

PAGE 42

h d 1 ;d 2 ;:::;d n 1 i ,supposethat d min 2 isthesmallestdistanceinthisarraythatis largerthan d ,thenbymakinguseofEquation 3{4 : Pr [ N l >S ij j S ij = d ]= n 1 X q = min 2 Pr [ N l = d q ](3{10) 42

PAGE 43

3.4SpeedingUptheComputation EquippedwithSection3'stheoreticalfoundation,Inowfol lowthetwostepsdiscussed inSection2toperformtheestimationonaconstructeddista ncedatabase. 3.4.1ConstructingtheDistanceDatabase D 0 SinceIwishtoavoidnewdistancecomputationsintheconstr uctionof D 0 ,one strategyistouniformlyreplicatethedistancescomputedb ythesamplingalgorithmto createafull-sizedistancedatabasethatcansubsequently beanalyzed.Idescribesucha constructionwithanexample,illustratedinFigure1.Onth eleft-handsideofFigure1,a two-dimensionalmatrixisusedtorepresentthedistanceda tabase,whereeachcolumnof thematrixcontainsthesorteddistancesfromagivenpointt oalloftheotherdatapoints. Giventhisdata,mysamplingalgorithmrandomlyselectstwo neighbordistancesforeach pointoftheinputdataset.Asaresult,Iobtainthesampledd istancematrixshowninthe middleofFigure1.Inordertobuildacompletedistancematr ix D 0 bymakinguseofthe sampleddistancematrix,Ireplicateeachrowofthesampled distancematrixtwice,which givesmetheapproximatedistancematrix D 0 shownontheright-handsideofFigure1. Notethatnoadditionaldistancecomputationsareperforme d. Ingeneral,theprocessofconstructing D 0 uses n 1 replicationsofeachdistance computedbytheoriginalsamplingalgorithm.Asaresult,Ir egardeachcolumn D 0 i in D 0 asanapproximationofitscorrespondingcolumn D i in D Algorithm2 EN Algorithm Input : thesampledistancearray SD ;Algorithm1'sparameters Output : the E [ N ]for D 0 1: Sort SD ascendingly 2: for j = n to1/*backwardscan SD */ do 3: if SD [ j ]isasampledistancefromanoutliercolumn D 0 i in D 0 then 4: FollowLemma1toupdate E [ T i ]andthe sum 1and sum 2of Var ( T i )bymakinguseof thesucientstatisticsmaitainedduringthebackwardscan ofSD 5: endif 6: Updatethesucientstatistics 7: endfor 8: ApplyLemma1andEquation3tocalculate E [ M i ]foreachoutlier i ,thensumthemand returnthesumas E [ N ]for D 0 43

PAGE 44

3.4.2Computing E [ N ] and Var ( N ) for D 0 Inthissection,Iwillshowecientalgorithmsthatcanobta in E [ N ]and Var ( N )for D 0 in( n log( n ))time. Tocompute E [ N ],themaintaskistocompute E [ T i ]and Var ( T i )foreachoutlier column i in D 0 (thosewiththe r largest D 0 i [ k ]valuesin D 0 ). Tocalculate E [ T i ],thefollowingtwoobservationsofEquation4andthe E [ T i ]formula inLemma1arecritical: 1. Eachdistance d 2 D 0 isassociatedwithaxedprobability Pr [ N i = d ]whichis determinedby d 'srowpositionin D 0 2. Themaintaskofcomputing P j;j 6 = i Pr [ N j >N i j N i = d ]isnothingbutndingoutall ofthosedistancesin D 0 thatarelargerthan d andfromacolumnotherthancolumn i ,thensummingtheprobabilitiesassociatedwiththosedist ances. Furthermore,noticingthatforagivencolumn D 0 i of D 0 ,therowsreplicatingfrom thesamesampledistance d canberepresentedbyonerow.Theonlythingneedtobe changedisthattheprobabilityassociatedwiththereprese ntativerowisthesumofthe probabilitesassociatedwiththe n 1 replicationsof d .Withthiscompression,thestepleft tocompute E [ T i ]isthatforeach d incolumn i ,ndingallofthosesampledistancesthat arenotfromcolumn i andgreaterthan d .Astraightforwardwaytodothisistosortall ofthesampledistancesandscanbackwardofthesorteddista ncearrayuntil d .During thescan,ImaintainsomesucentstatisticssothatwhenIen counterasampledistance d 2 D 0 i Pr [ N i = d ] P j;j 6 = i Pr [ N j >N i j N i = d ]canbecomputed.Indthatthefollowing statisticsaresucientforthispurpose:(1)thesumofthep robabilitiesassociatedwith eachsampledistancethathasbeenchecked;(2)howmanysamp ledistancesfromcolumn i havebeenpassed.ThenIcanupdate E [ T i ]byfollowingLemma1duringthebackward scan.Thisrequires O ( n )time.Thesimilarideaappliestothecalculationof sum 1and sum 2inLemma1.Therefore, Var ( T i )requires O ( n )timetoo,providedthesample distancearrayissorted(thissetuprequires( n log( n ))time). 44

PAGE 45

Algorithm2(hereafterreferredtoastheENalgorithm)pres entsthepseudocode forcomputing E [ N ]on D 0 .SinceLemma1andLemma2havesimilarstructures,Ican calculate E [ U ij ]and Var ( U ij )accordingtoLemma2similarlybymaintainingsome sucientstatisticsduringthebackwardscanofthesorteds ampledistancearray.After obtaining E [ U ij ]and Var ( U ij ),computing Var ( N )istrivial.Icallthealgorithmto compute Var ( N )for D 0 asthe VarN algorithm. 45

PAGE 46

Table3-1.EciencycomparisonsbetweenBay'salgorithman dthesamplealgorithmon theChr18andtheUCIDdatasets.Thetimereportedforthesam pling algorithmincludestherunningtimeofboththesamplingalg orithmandthe ENandVarNalgorithms. Algorithm Chr18 UCID DC/point Time DC/point Time Bay'salgorithm 331 24h:9m:31s 180 3h:48m:26s Samplingalgorithm 10 47m:24s 10 14m:7s Ratios 33.1 30.6 18 16.2 Table3-2.Averageestimationresultsandtheempiricalres ultsof N for10trialsonthe10 realdatasets.Thesamplesizes issetto10,60respectively. Dataset = 10 = 60 observed N E [ N ] observedN E [ N ] Wisconsincancer 16.60 13.29 3.55 24.60 22.20 0.00 Balancescale 8.60 5.17 5.29 8.60 6.46 5.75 Floridafarm 27.10 27.09 0.27 28.20 26.90 0.86 Californiafarm 22.70 24.59 0.73 24.40 25.15 0.00 FCATRead 15.90 13.86 3.75 19.90 18.48 1.43 FCATMath 12.90 12.69 4.40 18.70 17.14 2.11 Californiahouse 10.40 14.30 5.09 11.30 17.94 3.00 Baseballpitching 14.70 17.04 2.32 14.70 19.01 1.32 Baseballbatting 6.10 4.19 5.57 8.50 7.30 5.99 CoverType 0.10 3.24 6.34 0.00 2.98 5.17 3.5EmpiricalEvaluation ThisSectiondetailstheresultsoftwosetsofexperimentsd esignedtotestmy methods: 1. First,Itesttheperformanceofmyalgorithmsintwodomains thatrequireexpensive distancecomputationsbyrunningfully-functionalimplem entationsofthechapter's algorithmsoverabioinformaticssequencedatasetandanim agehistogramdataset. 2. Second,Itestthereliabilityofmyestimationsfortheretu rnsetqualitybyrepeating thealgorithmstentimesforvarioussamplesizesovertenre aldatasets. 3.5.1TestingExpensiveDistanceFunctions Therstsetofexperimentshastwogoals.First,Iwishtotes twhethermyalgorithms areabletoreturnahigh-qualityanswersetinanacceptably shorttimeinadomain requiringanexpensivedistancefunction.Second,Iwishto checkwhetherarepresentative, 46

PAGE 47

state-of-the-artalgorithmforthispurpose(Bay'snested loopalgorithm[ 8 ])canalso provideacceptableperformance. Toaccomplishthisgoal,Iconsideredtwoproblemssharingt hecharacteristicsthat: 1. DetectingDB-outliersfromthedatasetsisameaningfultas kintheselecteddomain. 2. Thedomainsrequireexpensivedistancefunctions. ThersttaskIconsiderisdetectingoutliersequencesinhu manchromosome18. Eachdatapointinthisdatasetisanucleotidesequence.Det ectingDB-outliersinsucha datasetisaninterestingtask,becauseitiswell-establis hedthatcomparisonsofrelated proteinandnucleotidesequenceshavefacilitatedmanyrec entadvancesinunderstanding theinformationcontentandfunctionofgeneticsequences. TheseconddatasetIconsider isabenchmarkdatasetforcontentbasedimageretrieval.Ea chdatapointinthisdataset isacolorhistogramderivedfromanindividualimage.Inthi scase,thetaskistodiscover theimagesthatareleastliketheotherimagesinthedataset Inowdescribethecharacteristicsofbothdatasets. Humanchromosome18(Chr18).Thisdatasetiscreatedbyrand omlyselecting 4000non-overlappingsubsequencesoflength2000fromhuma nchromosome18 downloadedfromNCBIftpsite.Editdistanceisusedtomeasu rethedistance betweentwosubsequences.TheEditdistancebetweentwosub sequencesiscomputed bytheNeedleman-Wunschalgorithm[ 37 ]whichrequires(3 2000 2 )numberof arithmeticcomputationsperdistancecalculationinthisc ase. UCID.UCIDversion2[ 47 ]contains1338uncompressedcolorimages.Itransform eachimagetoa576dimensionalcolorhistogramandthennorm alizeeachdimension totherange[0,1]. HSV colorspaceisusedbecauseitprovidesabreakdownofcolor intoitsmostnaturalcomponents:Hue,SaturationandValue (intensity).Iuniformly dividehuesinto36buckets,saturationinto4buckets,andv alueinto4buckets. Asaresult,576colorbucketsareusedtorepresentanimage. Givenhistogram h 1 and h 2 ,thedistancebetweenthemiscomputedbythequadraticdist ancefunction ( h 1 h 2 ) T A ( h 1 h 2 ).Theentry a ij inthecolorsimilaritymatrix A isdeterminedby Equation3giveninSection3.1oftheVisualSeekpaper[ 50 ]. TheexperimentswereperformedonaLinuxworkstationhavin ganIntelXeon2.4GHz Processorwith1GBofRAM.Allthealgorithmswereimplement edinC++andcomplied withgccversion4.02.SinceIwasinterestedintestingthes cenariosthatthedistance 47

PAGE 48

computationsdominatealltheothercosts,Iloadedtheenti redatasetintomemoryforall thealgorithmsthatItested.Thewallclocktimewasreporte dhere.Allexperimentswere runtoreturnthetop305th-NNoutliers. IbegantheexperimentsbyrunningBay'snestedloopalgorit hmoverbothdatasets. Irecordedthetimerequired,aswellastheaveragedistance computationsperdatapoint. TheresultsarereportedinTable1.Next,Iranmyimplementa tionsofthesampling algorithmandthe EN and VarN algorithmsoverthesamedata,andthenintersected theresultsetofmysamplingalgorithmwiththeresultsetof Bay'salgorithmtoget theobserved N .Foreachdataset,Iran10trialswiththesamplesize setto10.The performanceresultsarealsoshowninTable1. Inaddition,Iplottheboundcalculatedby E [ N ] ,where E [ N ]istheestimate oftheexpectedvalueof N ,and (thestandarddeviationof N )isthesquarerootof theestimateofthevarianceof N .Usingexpectationandstandarddeviationtopredict the\typical"observationisstandardpracticeinstatisti cs.Theanalytically-estimated standarderrorandtheobserved N forbothoftheserunsareplottedinFigure3. Discussion .Eventhoughthetwodatasetshavejustafewthousandpoints ,Bay's algorithmhadrelativelypoorperformanceonboth,takingn earlyonedayfortherstone. Notethattheseresultsareactuallyfairlyoptimisticduet oexperimentaltimeconstraints. Bothdatasetstestedwerefairlysmall.Supposingthatthes ubsequencesintheChr18 datasetwereoflength8000insteadof2000,thecostoftheEd itdistancefunctionwould beincreasedby16times.AssumingthatBay'salgorithmhadt hesamepruningpowerfor thelongersequences,Iwouldexpectmorethantwoweeks'run ningtimetocompletethe computationovertheChr18datasetusingthefulldimension ality. Incontrast,myalgorithmsdidanexcellentjobinthesetwod omains.Myalgorithms providedbetweenoneandtwoordersofmagnitudespeedup,an dstillmaintained relativelyhighresultquality.Thesamplingalgorithmret urnedmorethanhalfofthe totaltrue5th-NNoutliersinbothcases(andnearlyallofth emintheUCIDcase)using 48

PAGE 49

Table3-3.Descriptionofthe10realdatasetsandtheaverag enumberofdistance computationsperdatapoint(denotedbyBA)byBay'salgorit hm. Cont./Featuremeansthenumberofcontinuousfeaturesover thenumberof totalfeatures. DataSetCont./FeatureSizeBA Wisconsincancer30/31569165Balancescale4/4625625Floridafarm188/188811140Californiafarm188/1881,373175FCATRead27/271,404275FCATMath28/281,405274Californiahouse9/920,640654Baseballpitching22/2236,245454Baseballbatting17/1785,979745CoverType10/55581,0124527 onlytensamplesperdatasetpoint.Furthermore,insevenof thetentrialsovertheChr18 datasetandintenofthetentrialsovertheUCIDdataset,the actualnumberofoutliers returnedeitherexceededorwasalmostequalto E [ N ] .Thisindicatesthat(atleastfor thesetwodatasets),thereported E [ N ] constitutesasafelowerboundonqualitative performance.3.5.2ReliabilityoftheEstimation Thissubsectiondescribesanadditionalexperimentaleval uationofmyalgorithms, aimedatevaluatingthereliabilityofmyestimationalgori thms.Iwishtoobtainsome experimentalevidencethattheresultsgivenbymyestimati onalgorithmsarereliablewith varioussamplesizesanddierentsizesofdatasets. Thetestwasdesignedasfollows.Iselected10realdatasets summarizedinTable 3.Theyspanarangeofproblemsandhaveverydierentcharac teristics.Duetothe spacelimit,Idonotdescribeeachindetail.Theexperiment alsetupwasidentical towhatIdescribedintheprevioussubsection.Iprocessedt hedatabynormalizing allcontinuousvariablestotherange[0,1]andconvertinga llcategoricalvariablesto anintegerrepresentation.Hammingdistancewasusedforca tegoricalfeaturesand 49

PAGE 50

Euclideandistancewasusedforcontinuousfeatures.Forea chdataset,beforerunning myalgorithmsIexperimentedBay'salgorithmonthedataset .Theaveragenumberof distancecomputationsperdatapointrequiredbyBay'salgo rithmarereportedinthe lastcolumnofTable3.NotethatIhappenedtoencounterthew orstcasescenario( O ( n 2 ) timecomplexity)usingBay'salgorithmontheBalancescale datasetfromUCIMachine LearningRepository.Thatis,theaveragenumberofdistanc ecomputationsperdata pointinthisdatasetwasthesizeofthedataset.Foreachoft he10realdatasets,I systematicallytestedthischapter'salgorithmswiththes amplesize setto10,60and 110respectively.Foreachexperiment,Iperformed10trial s.Theaveragestatisticsofthe 10trialsarereportedinTable2.Discussion .Iobservedhighreliabilityformyestimationalgorithms, indicatingthat theconstructeddistancedatabase D 0 isareasonablesurrogatefortheoriginaldistance database D .Specically,forthe300totalsrunsthatwereperformedin ordertoconstruct Table2,theactualnumberoftrue5 th -NNoutliersreturnedwaslargerthan E [ N ]199/300 times,largerthan( E [ N ] )236/300times,andlargerthan( E [ N ] 2 )253/300times. Thetabledepictsindetailtheclosecorrelationbetweenth epredicted E [ N ]andthe observedaverage N .Again,thisshowsthegeneralutilityofmyanalysisforpre dictingthe accuracyofthealgorithm. 50

PAGE 51

3.6Conclusions Inthischapter,Ihaveconsideredtheproblemofhowtoecie ntlydetectDB-outliers whenthedistancefunctionisexpensive.Asimplesamplinga lgorithmisproposedand thestatisticalcharacteristicsofthissamplingalgorith misformallyanalyzed.Based ontheanalysis,twoestimationalgorithmsareproposedreq uiringonlysortingtime ofthesampleddistances.Asaresult,thischapterprovides apracticaltooltoexplore DB-outliersinexpensivedomains. 51

PAGE 52

CHAPTER4 GUESSINGTHEEXTREMEVALUESINADATASET:ABAYESIANMETHOD 4.1Introduction Thischapterdealswithaubiquitousproblemindatamanagem ent:guessingthe maximum/minimumvalue(orsomeotherextremestatistics)o veradataset.Stated simply,Iwishtoguessthe k th largest f ( d )valueoverall d 2 D foradataset D .More formally,givenanarbitraryfunction f (),mygoalistoaccuratelyguessthe k th largest f () value f ( k ) forall d 2 D ,suchthat jf f ( d ): f ( d ) >f ( k ) ^ d 2 D gj = k 1. Thisparticularproblemarisesinmanyapplications,forex ample: AspointedoutbyDonjerkovicandRamakrishnan[ 17 ],intop-kqueryprocessing knowingthecutovaluebeforehandallowsthe\top-k"porti onofthequerytobe transformedintoarelationalselection.Theresultingque rycanthenbeprocessed withoutmodicationtothedatabaseengine. Inoutlierdetectionfordatamining,thestate-of-the-art algorithm[ 8 ]prunespoints fromtheoutliercandidatesetwhenithasbeendeterminedth attherearetoo manypointsthatareclosebythecandidate.Ifitwerepossib letoaccuratelyguess thedistancetoapoint's k th nearest-neighbor,thispruningcouldbedonewithout actuallyndingthoseclose-bypoints. Indistancejoinprocessing[ 23 49 ],thegoalistondthe k closestpairsovertwo dierentdatasets.Thefactthat k issuppliedtothejoin(asopposedtoacuto distance)makesthequerymoreuseful,butitgreatlycompli catesthecomputation. Ifthecutodistancewereknownbeforehand,thentheproble mcanbesolvedusing anyoneofalargenumberofecientalgorithms[ 6 35 12 ]. Inspatialanomalydetection,givenaspatialdatasetplace donan n n grid,the goalistondtherectangularregionswithinwhichsubsetso fthedatasetexhibit anomalousbehavior.Thenaivealgorithmwillenumerateall ofthe O ( n 4 )rectangles andcomputeananomalousscoreforeachretangle.IfIcancla ssifytherectangles intogroups,andestimatethe k th largestscoreforeachgroup,Icanpotentially prunetheentiregroupwithoutfurthercomputingthescores associatedwitheach groupmember. Manyotherapplicationsexist,thoughspaceprecludeslist ingthemallhere. Undercertaincircumstances,guessing f ( k ) istrivial.If f ( d )simplyreturnsan attributeof d ,thenthesolutioncanbeaseasyaspre-computingandstorin gthelargest k attributevaluesfromthedataset.However,theproblemcan becomearbitrarilydicult 52

PAGE 53

dependingonthenatureof f ().Inthegeneralcase, f ()mayencodeanunanticipated, arbitrarymulti-attributerelationalselectionpredicat e{thatis, f ( d )returns 1 ifsome selectionpredicatedoesnotevaluateto true .Inothercases, f ()mayperformanarbitrary, non-linearnumericalcomputationovertheattributesof d thatisimpossibletoanticipate. ABayesianApproach Inthischapter,Iproposeanovelapproachtosolvingthispr oblem.SinceIamtrying toguessextremevaluesforanyarbitraryandunanticipated function f (),Iarguethatitis impossibletosolvetheproblembyusingastatisticalsynop sistomodelthedata.Models forthedataareoftenusefulfordescribingwhatis\typical ",buttheextremevaluequeries Iaminterestedinspecicallyrefertotheoutliersintheda ta.Thus,standardapproaches areoflimitedutility.Forexample,considera1%sampleofa databasehaving100million recordswhereIaminterestedin f (10) ,butIhavenopriorknowledgeof f ()andsoitisnot possibletobiasthesampletolarger f ()values.SinceIhavelessthana10%chanceof samplinganyoftherecordsresultinginoneofthe10largest f ()values,howcanIguess f (10) withhighaccuracy? Thus,ratherthantryingtoguess f ( k ) bymodelingthedata,Iinsteadguess f ( k ) by watchingandmodelingthebehaviorofthe queries Ihaveseen. 1 Somequeryaggregatefunctionsmayresultin f ()valuesthataretypicallyverysmall, withafewtremendousoutliersthatgreatlyboostthevalueo f f ( k ) .Somequeryaggregate functionsmayresultin f ()valuesthataretightly,normallydistributedaroundthe mean f ()value,meaningthat f ( k ) isnottoodierentfromthetypical f ()value.Asqueries areasked,mymethodwatchesandlearnswhat\typical"queri estendtolooklike.Then, whenanewqueryisasked,Ilookattherstfew f ()valuesobtainedanddecide(ina statisticallymeaningfulway)whichtypeofthe\typical"q ueriesIamexperiencing. 1 Iwillusetheterm query verylooselyinthechapter,anditsexactmeaningwill dependupontheapplication. 53

PAGE 54

Ofcourse,ImaybewrongwhenIguesswhattypeofqueryhasbee nasked,andso thereisadegreeofuncertaintywithmybelief.Thisuncerta intyisincorporatedintothe probabilisticguaranteeontheaccuracyofmyguessed f ( k ) .Inthisway,mymethodisan exampleofaso-called Bayesian statisticaltechnique,inthatImakeuseofa\prior"or guessedquerydistributioninordertoassociateabeliefwi ththetypeofquerythathas beenissued.MyContributions Thischapterhasthefollowingtechnicalcontributions: Iproposeanewmethodforguessingtheanswertoanextremeva luequeryoveran arbitraryfunction.Themethodisstatisticallyrigorousa ndmakesuseofunique Bayesianstatisticaltechniques.Suchtechniqueshavebee nmostlyignoredinthe datamanagementliteraturetodate. Alongwiththeapproximateanswer,mymethodreturnsthedis tributionassociated withtheguess'error,andsoitcanbeusedtoassociatecond enceboundsonthe guess'accuracy. Ideviseamethodtolearnthepriorquerydistributionbywat chingqueryresultsas theyareproduced.Thelearningalgorithmrequiresthatfor eachqueryresult, Icomputeonlythreeaggregatevalues.Inthisway,thelearn ingmethodis inexpensive,andcaneasilybeincorporatedintoaDBMSoras pecicapplication withlittleornocost. Iarguefortheutilityofmyapproachbydetailingfoursepar ateapplicationsofmy methodtospecicproblemsthatoccurwhendealingwithlarg edatabases. 54

PAGE 55

4.2ProblemDenition Inthissection,Iformalizetheproblemanddescribethesol utionthatIwillstudyin thischapter.4.2.1EstimatingtheExtremeValue Mygoalistoprovideestimationalgorithmsthatwillsuppor tprocessingforextreme valuequeriesoftheform: SELECT g 1 ( d 1 ) FROMDAS d 1 WHERE g 2 ( d 1 ) AND k 1= ( SELECTCOUNT(*)FROMDAS d 2 WHERE g 2 ( d 2 ) AND g 1 ( d 1 ) g 1 ( d 2 )",thenthequeryiseasilymodiedtoaskforthe k th smallest g 1 ( d )valuewhentheselectionpredicatedenedby g 2 ()issatised.Foreaseof exposition,intheremainderofthechapterIassumethatthe searchisforalargevalue. ThefactthatIhavetwoseparatefunctions g 1 ()and g 2 ()complicatesthingsabit,soI encodebothindividualfunctionswithinasinglefunction f (): f ( d )= 8><>: g 1 ( d )if g 2 ( d )istrue 1 otherwise Iusethenotation f ( k ) todenotetheanswertothequery. 2 Foreaseofexposition,Iassumethateachtuplemapstoadist inctrealvalue. 55

PAGE 56

4.2.2ANaturalEstimator Assumingthattherearenoindexstructurestohelpuslocate tupleswithspecic f ()values,anobvioussolutiontotheproblemistosequentia llyscanthedatabaseonce andevaluate f ()overeachtuple.ThenIcanreturnthe k th largest f ()valueencountered. However,thisalgorithmmaybetooslowifthedatabaseislar ge,orif f ( k ) mustbe evaluatedrepeatedlyfordierentfunctions(asintheappl icationtooutlierdetectionthat Iwillconsider). Thefundamentalideainthischapteristhat,inordertoobta inasub-linear-speed algorithmtocompute f ( k ) ,onecanuseasimplerandomsample(withoutreplacement) fromthedatabase.Byexaminingthe f ()valuesinthissample,itmaybepossibleto estimatethe k th largest/smallestvalueinthequeryresultset. 3 TheestimatorthatI studyworksasfollows.SupposethatIhaveaccessto n samplesfromadatabaseofsize N .Thenanaturalestimatorforthe k th largestvalueinthequeryisthe( k 0 ) th largest valueinthesample,where k 0 n = k N .Sincetomakeuseofthisestimator, k 0 mustbean integer,Iuse k 0 = n N k .Intheremainderofthischapter,Iuse d f ( k ) todenotethe( k 0 ) th largestvalueinthesample. Forexample,supposethatIwishtoguessthelargestvaluein adatabaseofsize15. Applying f ()toeachtuple,Iobtaintheset: f 1 ; 2 ; 1 ; 4 ; 5 ; 1 ; 7 ; 8 ; 9 ; 1 ; 11 ; 1 ; 1 ; 14 ; 15 g Thus, f (1) =15.Now,Itakeave-itemsamplefromthisset,whichhappen stobe: f 11 ; 1 ; 4 ; 1 ; 1g 3 AlthoughIconsiderthecasewhereasinglesampleisusedfor asingleestimate,the techniquedevelopedinthischaptercaneasilybeextendedt odealwiththeonlinecase, wheretheentirerandomizeddatabasewillbescanned.Ateac hinstantduringthescan, thesetoftuplesretrievedthusfarisasampleofthedatabas e. 56

PAGE 57

0 50 100 150 200 0 200 400 600 800 1000 Frequency(a) Domain 0 0.2 0.4 0.6 0.8 1 0 100 200 300 400 500 600 DomainFrequency(b) 0 0.2 0.4 0.6 0.8 1 1.2 1.4 0 500 1000 1500 2000 2500 DomainFrequency(c) 300 400 500 600 700 800 900 1000 0 500 1000 1500 DomainFrequency(d) Figure4-1.Thehistogramsoffourdierentqueryresultset swithdierent domains(scales).(a)and(c)havelongtailstotheright.(b )and(d)haveno tailstotheright. Thenmyestimator c f (1) is11,where k 0 = 5 15 1 : Characterizingtheaccuracyofthisestimatorisfarfromtr ivial.Thefundamental questionIaskinthischapteris: Howdoestheestimator d f ( k ) relatetothetrue f ( k ) ? Givenarigorouscharacterizationofthisrelationship,it ispossibletobothcorrect d f ( k ) toobtainanevenbetterestimate,andtocharacterizetheac curacyoftheestimator. Sincethedierencebetween df ( k ) and f ( k ) isaectedbythescaleofthevaluesunder consideration,thischapterstudiesthebehavioroftherat io f ( k ) d f ( k ) asawaytocharacterize theaccuracyoftheestimator.Inparticular,Iamintereste dincharacterizingthe distribution ofthisratio,becausethisdistributionfacilitatestheus eof d f ( k ) toproduce condenceboundson f ( k ) .Forexample,ifIknowthatthereisa90%chancethatthe ratioisbetween l and h ,thenthereisa90%chancethat f ( k ) isbetween l d f ( k ) and h d f ( k ) 57

PAGE 58

4.3OverviewofMyApproach ThissectiongivesanoverviewoftheapproachthatIusetoch aracterizethe ratio f ( k ) d f ( k ) .First,Idiscussthestatisticalpropertyofthedatabasem ostrelevanttothe characterization:theshapeoftherighttailofthedistrib utionof f ()values.ThenIgive anoverviewofauniqueBayesianapproachtodealingwiththe importanceofthetail's shape.4.3.1ImportanceoftheQueryShape Unlikemanyotherestimationproblems,characterizingthe estimator d f ( k ) isextremely challenging,becauseunlikeclassicalestimationproblem swheresimplestatistical propertiessuchasthedistribution'svarianceareimporta nt,theactualshapeofthe queryresultset'sdistributionismostcloselyrelatedtot heaccuracyoftheestimator.This isbestillustratedbyanexample. Figure1depictsasetofhistogramsshowingthedistributio nsoffoursyntheticquery resultsets.Eachqueryresultsetcontains10,000values.N ow,imaginethatIwishtouse asampleofsize100tocompute c f (1) ,andIask:Howaccurateis c f (1) asanestimatefor f (1) ?Toanswerthisquestion,Ire-sample(withoutreplacement )100timestoproduce 100dierent c f (1) values,andcomputethemedianfor f (1) d f (1) overeachdataset.Themedian valuesrecordedare3.07,1.06,4.45and1.01for(a),(b),(c ),and(d),respectively. Therelativemagnitudesofthesefourvaluescanbeexplaine dbyexaminingFigure 1.Thedistributionscorrespondingtoqueries(a)and(c)ha velongtailstotheright, soonehasonlyasmallchanceofsamplinganyvaluesclosetot hetipoftherighttail where f (1) islocated.Therefore,Iobservedalargeratiobetweenthet rueanswerandthe estimator.Incontrast,theshapesofquery(b)and(d)haven otailtotheright,soone wouldexpectthatasamplehasanexcellentchanceofincludi ngvaluescloseto f (1) .Asa result,Iobservedaratioclosetooneforthesetwoqueries. 58

PAGE 59

4.3.2BasicsoftheBayesianApproach Sincethequeryshapeissoimportantwhenevaluatingtheacc uracyof d f ( k ) ,itmust beincorporatedintotheprocessofcharacterizingtherati o f ( k ) d f ( k ) .Mybasicapproachisto assumethatthereexistsalargenumberofpossiblequerysha pes:from\easy"shapes withnoskewtovery\hard"shapeswithaheavyrightskew.Eac hshapehasaweightor probabilityassociatedwithitthatspeciestheextenttow hichIthinkthisisthecurrent shapeIamexperiencing{representingsuchabeliefwithapr obabilityisthehallmarkof theso-called\Bayesian"statisticalapproach[ 33 ]. TheinitialsetofweightsthatIstartoutwithbeforeanydat ahavebeenencountered areknownusingBayesianterminologyasthe priordistribution .Whiletherearemany waystodevelopareasonablepriordistribution,Ichooseto learnthepriorfromthe historicalqueryworkload.Then,asdatafromaparticularq ueryareencountered,the weightsareupdatedinastatisticallyrigorousfashiontot akeintoaccountthenewdata. InBayesianterminology,theupdatedweightsrepresentthe posteriordistribution .These updatedweightsarethenusedtoproducecondencebounds.F orexample,ifadatabase sampleofareasonablesizeisobtainedthatisconsistentwi thaqueryhavingaheavy rightwardskew,theupdatedweightswilltendtofavorquery shapeswithcorresponding rightwardskew,andthecondenceboundsfortheaccuracyof d f ( k ) thatIreportwillbe suitablywide.The\Dangers"oftheBayesianApproach Atrstglance,assumingtheexistenceofapriordistributi onmayseemdangerous. SinceIwilllearnmypriorfromthepreviouslyobservedquer ies,Iamassumingthatanew querywillneverbetotallydierentfromallofthequeriesi nthetrainingworkload.Inthe casewhereIseea\new"query,theshapecorrespondingtothe newquerywillnecessarily haveazeropriorweight,sincethequerywastotallyunantic ipated.Ifthenewqueryhas atailthatisfarnastierthananythingelseIhaveeverseen, thenImaybetooaggressive withmycondencebounds{thisisthedangerinherentintheB ayesianapproach. 59

PAGE 60

Infact,relateddangersareinherentin all estimationtechniquesthatdonothave accesstoallofthedata,includingclassicmethodsthatare widelyusedinthedata managementliterature.Forexample,considertheclassica l,sampling-basedestimatorfor a SUM SQLquery[ 20 22 ]:rsta1 = sampleofthedatabaseistaken,thenthequeryis appliedtothesampleandtheresultisscaledupbyafactorof .Itisanoften-ignored factthatinordertoboundtheaccuracyofsuchanestimatein theclassicalfashion, thevarianceoftheestimatorisalsoestimatedfromthesame sample .Ifthevariance estimateistoolow(whichmaybethecaseifthereisoneparti cularlyhigh-valuerecord thatdidnotappearinthesample)thenanyresultingconden ceboundsareworthless. Theimplicitassumptionunderlyingtheclassicmethodisth atthedatabasecharacteristic inquestion{thevariance{canbeestimatedaccuratelyfrom thesample.Incomparison, theBayesianapproachmakesanexplicitassumptionregardi ngtheavailabilityofaprior distribution.Ineithercase,thepossibilityofanerrorex ists.Infact,astatisticianfrom theso-called\Bayesian"schoolwouldarguethatitisbette rtomakesuchassumptions explicitusingapriorthanto\hidebehind"argumentssucha stheunbiasednessofa varianceestimate.AsIwillshowexperimentally,myapplic ationoftheBayesianapproach isveryrobusttoerrorsintheprior,andturnsouttobequite successfulinpractice. 60

PAGE 61

4.3.3ProposedBayesianInferenceFramework Giventhisbackground,InowdescribethethreestepsofmyBa yesianinference framework: 1. The learningphase usesstatisticalmethodstobuildapriorshapemodel composedofanumberofcandidateshapepatterns.Eachshape patternrepresents aclassofqueries.Aweightisassignedtoeachshapepattern ,indicatinghowlikely afuturequery'sshapematchesthatshapepattern.Boththew eightsandthe shapepatternsarelearnedoine,fromthehistoricalquery workloadusinganEM algorithm[ 10 ]. 2. The characterizationphase derivesanerrordistributionforeachlearnedshape pattern.BeforeIcanusethelearnedpriorshapemodeltopre dictthebehavior of d f ( k ) onareal-lifequery,asapreparationforthenextphaseInee dtoderive thedistributionof f ( k ) d f ( k ) foreachlearnedshapepattern.ThisisdoneusingMonte Carlomethods[ 45 ],aftertheshapemodelhasbeenlearnedbutbeforeitistime to actuallyansweranextreme-valuequery. 3. The inferencephase usestheresultsofthecharacterizationphaseandasample fromthenewquery'sresultsettoproducetheerrordistribu tionfortheactual queryunderconsideration.Thecharacterizationphaseapp liesonlytothelearned shapepatterns,andnottoanyreal-lifequery.Whenitistim etoactuallyanswer aqueryonline,thepriorweightsoftheshapemodelareupdat edbaseduponthe observedsamplestoproducetheposteriorweights.Usingth eposteriorweightsand theerrordistributionforeachshapepattern,anerrordist ributionfortheratio f ( k ) d f ( k ) is obtained.Since df ( k ) canbecomputedfromthesample,condenceboundson f ( k ) can easilybederivedfromtheresultingdistribution. Thenextthreesectionsofthischapterdescribeeachofthet hreephasesinmore detail.Inthesesections,Isimplifytheexpositionbyassu ming f ()neverreturns 1 ;that is,thesizeofthequeryresultsetisthedatabasesize.Inth eExperimentsSectionwhere theframeworkisactuallyappliedtotwospecicproblems,I willdiscusshowtoremove thisassumptionwhennecessary. 61

PAGE 62

5 10 15 0 0.1 0.2 0.3 0.4 XPDF VALUE(a) 0 100 200 300 400 500 0 2 4 6 8 x 10 -3 XPDF VALUE(b) 0 0.5 1 1.5 0 1 2 3 4 5 XPDF VALUE(c) 0 5000 10000 0 0.5 1 x 10 -3 XPDF VALUE(d) Figure4-2.ThePDFsoffourGammadistributionswithincrea singskewtotherightfrom (a)to(d). 4.4TheLearningPhase AnyBayesianmethodrequiresagenerative,probabilisticm odelforthedata.The processshouldbebothgeneral(inthesensethatitallowsth eproductionofanydata setthatmightbeobserved)andspecic(inthesensethatitp roducesallimportant propertiesoftheunderlyingdataandistailoredforthespe cicproblemathand). Inmycase,eachindividual\datapoint"thatisproducedbyt hegenerativeprocessis asinglequeryresultset.Since(asdescribedinSection3.1 )theshapeofthequeryresult setissoimportantindeterminingthequalityoftheestimat or df ( k ) ,thegenerativemodel willpayspecialattentionastohowtheshapeishandled. Informally,Iassumethefollowinggenerativeprocessforp roducingeachqueryresult set: First,abiaseddieisrolledtodeterminewhichshapepatter nthequeryresultset willbegeneratedby.Iassumetheexistenceofsomesetof c dierentshapepatterns, andthedierollselectsoneofthem. Next,anarbitraryscaleforthequeryisrandomlyselected. Thisscaledenesthe magnitudeoftheitemsinthequeryresultset.InFigure1,th escaledetermineshow largethelabelsareonthe X -axis. 62

PAGE 63

Finally,theshapeandscaleareusedasinputstoinstantiat eaparametricmodel forthedata.Forreasonsdescribedsubsequently,Iwillmak euseoftheGamma distributionfromstatisticsasmyparametricmodel.Thisd istributionisrepeatedly sampledfromtoproducethequeryresultset. Giventheintuitiveprocessdescribedabove,thenextstepi stoformalizeit. Mathematically,thisisdonebydeninga probabilitydensityfunction (PDF)forthe process.Thisisafunctionthattakesasaninputaqueryresu ltset,andreturnshow probableitisthatthisqueryresultsetwouldbeproducedby theprocess.Afterdening thePDF,Iwillthenconsiderhowto\learn"themodel;thatis ,Iconsiderhowtotailor themodeltoaspecicqueryworkload.4.4.1ChoosinganAppropriateParametricModel Itisrstnecessarytochoosesomeparametricdistribution tomodelthequeryshape. GiventhediscussionofSection3.1,thereisoneoverarchin gconcern:mydistributionmust beabletomodelarbitrarilylongtailstotheright.Modelin gallofthedipsandbumpsof areal-lifedistributionisnotnecessary,becauseIamconc ernedonlywiththerelationship betweenthedistribution'stailandtherestofitsmass. Giventhisconsideration,theGammadistributionfamilybe comesanaturalchoice, sinceitcanproduceashapewitharbitraryright-leaningsk ew.Figure2showsthePDFs offourinstancesoftheGammafamily,eachwithincreasings kewandlongertailstothe right.Figure2(a)depictsabell(normal)shape,whichdoes nothaveanyskewtothe rightandonlyaveryshorttailrelativetothedistribution 'svariance,whereasFigure2(d) ishighlyskewedtotherightwithanexceedinglylongtail. IstressthatthoughtheGammadistributionunderliesmymod el,Ido not assume thatthe f ()valuesinthedatabaselookanythinglikeaGammadistribu tion.TheGamma distributionisusedonlytomodeltherelationshipbetween thevaluesinthefarrighttail ofthedatadistributionandthevaluesthataremorelikelyt obesampled{thosethat areclosertothemainbodyofthedistribution.TheGammadoe sthiswellbecauseof itsabilitytotakeonshapeshavingarbitraryskew.AsIwill showexperimentally,using 63

PAGE 64

theGammadistribution,mymethodcanhandledatasetsthatc ouldnotpossiblyhave beensampledfromaGammadistribution,includingthosewit haleftskewandthosewith multiplemodes,includingverysmallmodesor\bumps"farou tintherighttail. 4.4.2DerivingthePDF InowturnmyattentiontoderivingthePDFassociatedwithth eresulting, three-step,generativeprocess.Formally,thePDFoftheGa mmadistributioncanbe expressedintermsoftheGammafunction 4 : p ( x j ; )= ( ) x 1 e x ;x> 0 (4{1) InEquation 4{1 ,theparameter > 0isknownasthe shapeparameter ,sinceit inruencestheshapeorskewofthedistribution,whilethepa rameter > 0iscalledthe inversescaleparameter ,since 1 inruencesthedomain(scale)ofthedistribution. Let y = h d 1 ;:::;d N i denoteaqueryresultsetwhichhas N matchingtuples. AssumingthatthetuplesareindependentlydrawnfromaGamm adistribution,using Equation1thelikelihoodofobservingaquery y given and is: p ( y j ; )= N Y i =1 f ( ) d 1 i e d i g = N N ( ) M 1 e S (4{2) InEquation 4{2 M = Q Ni =1 d i and S = P Ni =1 d i SinceIaminterestedincharacterizingaratio,Iamuninter estedinthescale parameteranddonotwanttobiasmymodeltoanyparticularsc ale.Thus,Itreat theinversescaleparameter asarandomvariablethatisuniformlychosenfrom(0 ;max ), where max isahugenumberchosentobelargeenoughthatitpermitsasca lethatis arbitrarilysmall.Then,thelikelihoodofobservingaquer y y giventheshape and 4 TheGammafunctionisdenedas( )= R 1 0 t 1 e t dt ,where > 0. 64

PAGE 65

unknown isgivenby: p ( y j )= Z max 0 p ( ) N N ( ) M 1 e S d 1 max M 1 N ( ) S N 1 ( N +1) (4{3) Equation 4{3 istheresultoftakingexpectationofEquation 4{2 withrespectto Itshowsthatevaluatingthelikelihoodofagivenqueryresu ltsetrequiresexactlythree aggregatevalues: M S and N ,denotingtheproductandthesumofallthevaluesin thequery'sresultsetandthesizeofthequery,respectivel y.Consequently,thesethree numbersareallthatIneedtocollectwithrespecttoeachque ry.Subsequently, y will refertothistriplet h M;S;N i NotethatEquation3isvalidforagivenshapeparameter.Sin cemymodelassumes thatashapeparameterischosenatrandomfromaweightedset wheretheprobabilityof choosingshape j is w j ,thelikelihoodofobservinganentirequeryset y is: f ( y j )= c X j =1 w j p ( y j j ) (4{4) InEquation4,the w j sareeachnon-negativeweightssatisfyingtheconstraint that P cj =1 w j =1.Thecompletesetofmodelparametersis= f 1 ;:::; c g ,where j = f w j ; j g 4.4.3LearningtheParameters isunknownandmustbelearnedfromthehistoricalworkload .Tolearn,Ifollow thebasicprincipleof MaximumLikelihoodEstimation (MLE),whosegoalistondthe parametersetmostlikelytohaveproducedtheobserveddata Givenasetofindependent,historicalqueries Y = f y 1 ;::: y r g ; applyingEquation 4{4 thelikelihoodofobserving Y is: L ( j Y )= r Y i =1 f ( y i j ) 65

PAGE 66

Often,itispreferabletoworkwithlog( L ( j Y ))becausetheproductgivenabove becomesasummation.Thatis,Iwishtond sothatitmaximizes: = maxlog( L ( j Y )) Inordertooptimizetheobjectivefunction,Iemploythe Expectation-Maximization (EM) framework[ 10 ]fromstatisticsandmachinelearningtoiterativelymaxim izethe log-likelihood.TheEMAlgorithm EMisusedtosolveMLEproblemsmadedicultbythefactthatt hereareoneor more\hidden"variablesthatcannotbeobservedinthedata. EMisaniterativemethod, whosebasicoutlineisdescribedinAlgorithm 3 Algorithm3 BasicEMalgorithm 1: while Themodelcontinuetoimprove do 2: Letbethecurrent\bestguess"astotheoptimalcongurati onofthemodel 3: Let bethenext\bestguess"astotheoptimalcongurationofth emodel 4: E-Step: Compute Q ,theexpectedvalueofwithrespecttoallpossiblevalueso f thehiddenvariables.Theprobabilityofobservingeachpos siblesetofhiddenvalues iscomputedusing : 5: M-Step: Choose soastomaximizethevalueforQ. thenbecomesthenew \bestguess". 6: endwhile Inmyproblem,thehiddenvariablesaretheidentitiesofeac hparticularshapethat wasusedtoproduceeachtrainingquery.Sincethedetailsof somederivationsbeloware similartotheexamplein[ 10 ],mostoftheresemblingderivationsareomitted.Asaresul t ofthe E-Step, Ihave: Q ( ; )= c X j =1 r X i =1 log( w j ) p ( j j y i ; )+ c X j =1 r X i =1 log( p ( y i j j )) p ( j j y i ; ) (4{5) 66

PAGE 67

InEquation 4{5 p ( j j y i ; )istheposteriorprobabilityofquery y i comingfromthe j th shapepattern'sdistribution,andisgivenby: p ( j j y i ; )= w j p ( y i j j ) P cl =1 w l p ( y i j l ) Inthe M-Step ,Ineedtoobtaintheupdateequationsfortheweights w j andthe shapes j .Toupdatetheweights,IuseaLagrangemultipliertomaximi ze Q withrespect to w j .Thisgivesusthefollowingupdateequationfor w j : w j = 1 r r X i =1 p ( j j y i ; ) NextImaximize Q withrespecttoeach j bytakingderivativeof Q andsettingthe resulttozero.Thepartof Q relevantto j is: Q 2 = c X j =1 r X i =1 log( p ( y i j j )) p ( j j y i ; ) Unfoldingthelogoperationandtakingthederivativewithr espectto j ; Ihave: @Q 2 @ j = r X i =1 f log M i N i ( j ) N i log S i + N i ( N i j +1) g p ( j j y i ; ) (4{6) InEquation 4{6 ()istheDigammafunction 5 .Toupdate j IsetEquation 4{6 to zeroandsolveitbythebisectionmethod[ 9 ]. TheEMalgorithmrepeatedlyappliestheseupdateequations inaniterativefashion untiltheparametersbegintostabilize(thisistypicallym easuredbyaniteration-to-iteration fractionalchangeinthatislessthan1%). 5 TheDigammafunctionisdenedas ( z )= 0 ( z ) ( z ) 67

PAGE 68

4.5TheCharacterizationPhase Thelearningphaseprovidesmewithasetofweightedshapepa tternsthatdescribe thehistoricalworkload.Asapreparationfortheinference phasetomakeuseofthese shapes,Ineedtoderivetheerrordistributionassociatedw itheachshape.Inthissection,I showhowtodeterminethedistributionoftheratio f ( k ) d f ( k ) foraqueryresultsetthatIknow hasbeengeneratedbyashapeparameter andascaleparameter 1 .Theextensionto unknownparametersisconsideredinthenextsection. Sinceaqueryistreatedasasamplefromaparametricdistrib ution,myestimator d f ( k ) andthenalanswer f ( k ) canbeviewedasaresultofthefollowingtwo-stagesampling process: StageOne.Aqueryisproducedbydrawingasampleofsize N fromthedistribution Gamma( x j ; ).The k th largestvalueinthissampleis f ( k ) StageTwo.Inordertoestimate f ( k ) ,asubsampleofsize n isdrawnwithout replacementfromthesampleobtainedinstageone.The( k 0 ) th largestvalueinthis subsampleismyestimator d f ( k ) Giventhisprocess,itisclearthat f ( k ) and df ( k ) arecorrelated.Asaresult,itisvery hardtoanalyticallyobtainthedistributionoftheratiobe tweenthem.Therefore,Iresort toMonteCarlomethodstoobtaintheirratio'sapproximated istribution,expressedinthe formofacumulativedistributionfunction(CDF). Algorithm4 NaiveMonteCarloSampling Input: N k n num Output: theMonteCarlosamplearray MC 1: Let MC = ; 2: for i =1To num do 3: Stage1:drawani.i.d.sampleofsize N fromthe Gamma( x j ; )distribution,thennd f ( k ) fromit 4: Stage2:drawasubsampleofsize n fromthe sampleobtainedinstage1,thennd d f ( k ) fromit 5: MC = MC [f f ( k ) d f ( k ) g 6: endfor 7: Return MC 68

PAGE 69

4.5.1NaiveMonteCarloSampling Inordertoobtainthedistributionofastatistic,theMonte Carloapproachobtains alargenumberofindependentsamplesofthestatisticdirec tlyfromtheunderlying generativemodel.Thesamplescanthenbeorganizedintoaso rtedlistsothatthe approximateCDFforthetargetdistributioncanbecalculat edbysimplycountingthe fractionofsampleslessthantheCDF'sinputvariable. ItisnothardtoimaginehowanaiveMonteCarloalgorithmfor obtainingmy particularerrordistributionwouldwork.Thealgorithmwo uldsimplyrepeattheabove two-stageprocess num times(where num isthenumberofMonteCarlosamplesto produce),andreturnthearrayof num ratiosproduced.ThebasicMonteCarloapproach isgivenasAlgorithm 4 .Theinputparametersarethedatabasesize N ,therankofthe itemtobeestimated k ,thesamplesize n ,thetwoGammadistribution'sparameters and ,andthenumberofMonteCarlosamplestoobtain num ,respectively. Thoughsimple,thenaivealgorithmisslow.Thecostis O ( N )perMonteCarlo iteration,where N isthedatabasesize.SinceIamsamplingforanextremevalue inorder toavoidscanningtheentiredataset(whichisitselfan O ( N )operation),makinguseofan O ( num N )MonteCarloalgorithmisunacceptable. 4.5.2PracticalMonteCarloSampling Fortunately,Icandomuchbetterbysimulatingthetwostage softhesampling processtoproducearesultthat,statisticallyspeaking,i sindistinguishablefromanactual executionofthenaivemethod. ToreducethecostofobtainingoneMonteCarlosample,Ineed toecientlysample both f ( k ) and d f ( k ) .Itturnsouttobeeasilypossibletosampletheorderstatis tic f ( k ) directlyfromitsCDF(thedetailsofhowtodothisaredeferr edtoSection5.4).Thus,the problemofsamplingfortheratio f ( k ) d f ( k ) reducestotheproblemofsamplingfor d f ( k ) givena valueof f ( k ) 69

PAGE 70

7 7 7 7 7 7 7 7 (b) m+1 k' b=1 (1) (2) b=0 k' (c) m
PAGE 71

Algorithm5 TKDSampling Input: N k n ,asinglesamplefrom f ( k ) Output: asampleof d f ( k ) ,correspondingtotheinput f ( k ) 1: k 0 = n N k 2: Let b Bernoulli( x j n N )/* b istheresultofaBernoullitrial*/ 3: if b ==1/* f ( k ) isincludedinthesubsample*/ then 4: Let m Hypergeometric( x j N 1 ;k 1 ;n 1) 5: if m +1
PAGE 72

samples.Theblackballsarethetop-( k 1)largestvaluesinthequeryresultset,andthe shadedballwithavaluelabelrepresentsthe k th largestvalueoverall. TheTKDmethodbeginsbyrstdeterminingwhetherornotthe k th largestvalue appearsinthesubsample;thiscreatescases(1)and(2)inFi gure3,andcorrespondsto line(2)ofAlgorithm 5 .Ineithercase,itbecomesnecessarytodeterminethevalue ofthe ( k 0 ) th largestvalueinthesubsample.Inmyexample,inthecasetha tthe7appearsinthe subsample,Ineedtodeterminehowmanyoftheblack(large-v alued)ballsalsoappearin thesubsample.ThisisdoneviaacalltoHypergeometric( x j 9 ; 3 ; 3)online(4)ofAlgorithm 5 ,whichgeneratesaHypergeometricrandomvariablewiththe specieddistribution. InFigure3(a),thisrandomtrialdeterminesthatnoneofthe blackballsareretrieved bythesubsample.Thus,theTKDmethodconcludesthatonlyon eofthetopk largest valuesareincludedinthesubsample,whichislessthan k 0 =2.SinceIknowthatall theotheritems(thethreewhiteballs)onthissidemustbesm allerthan f ( k ) =7,in ordertosamplethesecondlargestvaluefromthesubsample, theTKDmethodreturns thelargestofthreesamplesfromthetruncatedGamma( x j ; )distribution,where x 2 (0 ;f ( k ) =7).A\truncated"distributionissimplyaprobabilitydis tributionwithone ofitstailschoppedo. However,thisrandomtrialcouldhavedeterminedthatenoug hblackballswere containedinthesubsamplethatthe( k 0 ) th largestvalueinthesubsampleisfromthe top k overall(line(8)ofAlgorithm 5 ).InFigure3(b)theHypergeometric( x j 9 ; 3 ; 3)trial determinesthatoneoftheblackballsisretrievedbythesub sample.Thus,thesubsample containstwoofthetopk values.TheTKDmethoddeterminestheblackball'svalueby drawingoneadditionalsamplefromthetruncatedGamma( x j ; )distribution,where x 2 ( f ( k ) =7 ; 1 ).Fromthissetoftwotop-kitems,thesecondlargestvaluei sreturnedto thecaller. TwoanalogoussituationsoccuriftheBernoullitrialhasde terminedthatthe k th largestoveralldoes not appearinthesubsample(lines(12)-(21)ofAlgorithm 5 ).In 72

PAGE 73

Figure3(c),fewenoughblackballsareincludedinthesubsa mplethatthe( k 0 ) th largest willcomefromtheGammadistributiontruncated above at f ( k ) =7;inFigure3(d), enoughblackballsareincludedinthesubsamplethatthe( k 0 ) th largestwillcomefrom oneoftheblackballs,whicharesampledfromaGammadistrib utiontruncated below at f ( k ) =7. 4.5.4AdditionalTechnicalDetails Inthissection,Iwilladdressafewremainingtechnicaliss uesregardingtheMonte Carlosamplingprocess.4.5.4.1Sampling f ( k ) TheprocessdescribedaboveassumesthatIhaveanecientme thodtosamplea valuefor f ( k ) withouthavingtogeneratetheentiredataset.Tosample f ( k ) eciently,I canrstobtainitsCDF,whichisdenedbythefollowinglemm a: Lemma1. Givenaquerysize N ,arank k ,ashapeparameter ,andascaleparameter 1 ,theCDF F f ( k ) for f ( k ) is: F f ( k ) ( x )= k 1 X i =0 0B@ N i 1CA [1 F Gamma ( x )] i [ F Gamma ( x )] N i (4{7) where F Gamma ( x ) denotestheCDFfortheGamma ( x j ; ) distribution. Proof: Let Y bearandomvariablecountingthenumberofvaluesinthequer yresultset thataregreaterthanorequalto x .Thus, Y countsthesizeoftheset f f ( d i ) x g for i N .Sincewhetherornoteach f ( d i ) x isanindependentBernoullitrial,Iseethat Y followstheBinomial( N; 1 F Gamma ( x ))distribution.Then: F f ( k ) ( x )= Pr [ Y
PAGE 74

GiventheCDFof f ( k ) ,itiseasytosample f ( k ) usingthefollowingtwo-stepprocess: 1. Sample u fromtheUniform(0 ; 1)distribution 2. f ( k ) = F 1 f ( k ) ( u ) Steptwocanbeimplementedbysolvingfor x intheequation u = F f ( k ) ( x ).Sincea CDFmustbemonotonicallyincreasing,Icanusebinarysearc htoobtainthesolution. Notethatthecomputationof F f ( k ) ( x )requires O ( k )time,whichisfastsince k<f ( k ) (4{9) 4.5.4.3Anoteregardingthescaleparameter ThereadermaynoticethatIhaveomittedanymentionastohow theinversescale parameter isobtainedordealtwithbytheMonteCarloprocess.Thismay seem likeasignicantoversight,since isnotsupplied.However,the\scale"issimplya multiplicativefactor.Thus,sinceIaminterestedinthe ratio oftwovaluessampledfrom thesameGammadistribution,thescaleisirrelevant.Asare sult,anyoftheMonteCarlo methodsfromthissectioncanbeimplementedbysimplychoos inganarbitraryscale parameterlargerthanzeroandusingitconsistently. 74

PAGE 75

Algorithm6 ApproximatingtheErrorDistribution Input: p j for j 2f 1 ;:::;c g MC j for j 2f 1 ;:::;c g num x Output: F ratio ( x ) 1: Sort MC = S cj =1 MC j inascendingorder. 2: Foreachentryin MC ,if MC [ i ]camefromshapepattern j ,set MC [ i ] :prob = p j =num 3: i =0; tot =0; 4: while (thecurrentsample MC [ i ]islessthan x ) do 5: tot += MC [ i ] :prob 6: i ++ 7: endwhile 8: Returntot 4.6TheInferencePhase Atthispoint,Ihavemostofthetoolsnecessarytocompletet heframework.Assume thatIhavecompletedthelearningandcharacterizationpha sesandhaveusedasetof samplesfromadatabasetocalculate d f ( k ) .Iwishtocharacterizethedistributionof f ( k ) d f ( k ) Recallthatthepriorshapemodelconsistsof c weightedshapepatterns.Giventhe samesamplesusedtocompute d f ( k ) ,Ican\update"thoseweightsofshapepatternsto incorporatetheobserveddatausingBayes'rule[ 33 ].Thisisdonebycomputingthe posteriorprobabilitythatIamsamplingfromthe j th shapepattern.InclassicBayesian fashion,theupdatedweight p j is: p j = w j p ( q j j ) P cl =1 w l p ( q j l ) (4{10) Recallthat q denotestheaggregatetriplet h M;S;n i correspondingtomydatabase sample, w j isthepriorweightforshapepattern j ,and p ( q j j )isthePDFofshape pattern j Oncetheposteriorweightsareknown,thenalstepincomput ingthedistributionof theratio f ( k ) d f ( k ) istocombinetheposteriorweightswiththeMonteCarloerro rdistribution foreachindividualshapeinordertocomputeanal,posteri ordistributionfortheratio. ThisisformalizedinAlgorithm 6 ,whichdetailshowtousethisinformationtocompute thetotalprobabilitythat f ( k ) d f ( k )
PAGE 76

ofposteriorweights p 1 ;p 2 ;:::;p c ,alloftheMonteCarlosamples MC 1 ;MC 2 ;:::;MC c (one setofsamplesforeachshapepattern),thenumberofMonteCa rlosamplesforeachshape pattern num ,andnallytheCDFinputvalue x Inthisalgorithm,alloftheMonte-Carlosamplesarerstar rangedinasortedorder fromsmallestratiovaluetolargest.Thesamplesarethenwe ightedaccordingtothe posteriorweights.Tocalculatetheprobabilitythatthera tio f ( k ) d f ( k ) islessthananinput x Iscanfromthelowendtohighendofthearray MC andstoponceIndthecurrent MonteCarlosampleisgreaterthantheinput x .WhenIcompletethescan,thesumof theprobabilitiesprocessedwillcloselyapproximatethep robabilitythat f ( k ) d f ( k )
PAGE 77

0 100 200 300 400 500 600 700 0 1000 2000 3000 4000 Domain (a) Frequency 0 20 40 60 80 100 120 0 1000 2000 3000 4000 Domain (b) Frequency -100 0 100 200 300 400 500 0 1000 2000 3000 Domain (c) Frequency -200 0 200 400 600 800 0 1000 2000 3000 Domain (d) Frequency 0 20 40 60 80 100 120 0 2000 4000 6000 Domain (e) Frequency 0 100 200 300 400 500 600 700 0 1000 2000 3000 4000 Domain (f) Frequency tiny density tiny density tiny density tiny density Figure4-4.Thehistogramsforthesixsyntheticquerydistr ibutionsconsideredintherst setofexperiments;theyareorderedfromtheeasiestcaseto thehardestcase. 4.7Experiments Thissectiondetailsvesetsofexperiments.Irstperform asetofexperiments designedtotesttheaccuracyandapplicabilityoftheBayes ianapproachtoextremevalue estimation.ThenItestapplicationoftheproposedBayesia nframeworktofourdierent, specicproblemsinthedatamanagementdomain:approximat e MAX (ortop-k)aggregates, distance-basedoutlierdetection,spatialanomalydetect ion,andmultimediadistancejoin. 4.7.1ApplicabilityoftheBayesianApproachGoals: Asdiscussedpreviouslyinthechapter(particularlyinSec tions3.2and4.1), therearesomenaturalconcernsregardingtheapplicationo ftheBayesianapproachtothe problemofextremevalueestimation.Theexperimentsinthi ssubsectionaredesignedto directlytestwhetherornottheseconcernsarelegitimate. Specically,Iwishtoanswer twoquestions: 77

PAGE 78

Table4-1.Coverageratesfor95%condenceboundsusingsam plesize50,150, respectively. Samplesize = 50 Samplesize = 150 PriorWeights k =1 k =5 k =10 k =20 k =1 k =5 k =10 k =20 Uniform 0.84 0.86 0.81 0.85 0.91 0.92 0.91 0.93 Geometric(0.1) 0.84 0.86 0.82 0.85 0.90 0.92 0.91 0.92 Geometric(0.2) 0.78 0.80 0.77 0.80 0.87 0.88 0.88 0.90 Geometric(0.4) 0.86 0.89 0.85 0.88 0.92 0.93 0.93 0.94 Geometric(0.8) 0.88 0.89 0.86 0.89 0.93 0.94 0.94 0.95 Table4-2.Coverageratesfor95%condenceboundsusingsam plesize300. Samplesize = 300 PriorWeights k =1 k =5 k =10 k =20 Uniform 0.95 0.95 0.95 0.96 Geometric(0.1) 0.95 0.95 0.94 0.96 Geometric(0.2) 0.93 0.93 0.93 0.95 Geometric(0.4) 0.95 0.96 0.96 0.97 Geometric(0.8) 0.96 0.96 0.96 0.97 1. SincetheshapemodelisderivedfromtheGammadistribution family,doesmy methodactuallyworkforqueryshapesthatlooknothinglike aGammadistribution? 2. Second,sincethepriorshapemodelislearnedfromthehisto ricalqueryworkload, howwilltheBayesianframeworkfunctionwhenthefutureque rydistributionhas changedfromthehistoricalquerydistribution? ExperimentalSetup: Whenevaluatingthesequestions,therelevantmetriciscon dence boundcoverageaccuracy.Thatis,Iwishtobeabletoensuret hatnomatterwhatthe datalooklikeandwhattheprioris,iftheuserspecies p %condencebounds,thatthe boundsIreturnareinfact p %condencebounds.Totestthemethod'srobustness,Iuse sixstrangely-shaped,non-Gammasyntheticdistributions togeneratevariousqueryresult sets.ThegenerativedistributionsareillustratedinFigu re 4-4 .Theyareconstructedto bemultimodal,exhibitleftandrightskewwithdierentdeg rees,andarediscontinuous. Clearly,theybearlittleresemblancetotheGammadistribu tion.Usingthediscussion inSection3.1,Iordertheshapepatternsusingtheexpected valueof f ( k ) d f ( k ) .Theshapein Figure 4-4 (a)hasthesmallestexpectedvalueforthisratio,whilethe oneinFigure 4-4 (f) hasthelargest. 78

PAGE 79

Giventheseorderedshapes,Irunaseriesofvetests.Forea chtest,Ibeginby trainingmymodelusing500randomly-generatedquerieseac hhaving1000tuplessampled fromoneofthedistributionsillustratedinFigure 4-4 .Inthersttest,thetraining queriesaresampleduniformly.Inthesecond,theyaresampl edaccordingtoaGeometric distributionwithparameter0.1,sothattherstdistribut ionintheorderedshapesetis mostlikelytobesampledandthelastoneisleastlikely.Int hethird,fourth,andfth, theGeometricparametersare0.2,0.4,and0.8,respectivel y. Foreachlearnedmodel,Ithenrun500testqueries,eachofsi ze50000.Thetest queriesarealwayssampled uniformly fromthegenerativemodel.Inthisway,Itest thecasewherethequerygenerationisverydierentfromthe trainingdistribution.For example,withaGeometricparameterof0.8,Iamveryunlikel ytoseemorethanafew trainingqueriesfromthelastfewdistributionsinFigure 4-4 ,thoughIwill test quiteoften usingthosedistributions(duetotheuniformtestgenerati on).Foreachofthe500test queries,Iobtain95%condenceboundsfortheactualquerya nswerusingsamplesizesof 50,150,and300,andalsousing4dierent k values.Foreachsamplesize,Icomputethe fractionofcondenceintervals(coveragerate)thatdid,i nfact,containtheactualquery answer.TheseaccuraciesaregiveninTables1.Discussion: Ingeneral,theresultsshowthehighreliabilityoftheBaye sianframework andclearlyillustratetherobustnessoftheshapemodel.Us ing300samples,Table1shows almostperfect(95%)coverageineverycase.Thereseemstob enodependenceonthe Gammadistribution(sincethetestdatawereclearlynotGam ma-distributed)andvery little(ifany)dependenceontheaccuracyofthepriorweigh ts,sincewithsamplesize300 Table1showsnearlyperfectcoveragenomatterwhatthetrai ningdistributionis(recall thatthetestdistributionisalwaysuniform). Oneotherinterestingndingisthatthere is adependenceonsamplesize,since Tables1showscoverageaccuracythatissomewhatlessthant heexpected95%for extremelysmallsamples.Thisisanalogoustoproblemsthat occurwhenIhavetoofew 79

PAGE 80

samplestoobtainagoodvarianceestimateusingaclassical estimationregime(seeSection 3.2).However,notethatonceasampleofsize300hasbeenobt ainedthecoverageis nearlyperfect.Istressthat300isarelativelytinysample fromareal-lifedatabasethat mayhavebillionsofdatapoints. Thesendingsarenotsurprising.Robustnesstoerrorsinth epriorwithanadequate samplesizeisawidelyrecognizedmeritoftheBayesianappr oach.Asmoreandmore samplesaretaken,theposteriordistributionthatIusetog eneratetheboundsbecomes lessdependentonthepriordistribution.Itisgenerallyac knowledgedthatinmost circumstances,afterafewhundredsamplesthepriorcarrie slittle(ifany)weight,and thesampleisusedalmostexclusivelytocomputetheresult. Theresultsverifythis supposition.4.7.2Approximate MAX (orTop-k)Aggregates ThemoststraightforwardapplicationofmyBayesianframew orkisusingittoguess thelargestvalueinaset.Forexample,onecouldeasilyusem yBayesianinference frameworktofacilitateanonlineanswertoatopk selectionquerywithanarbitrary selectionpredicateandaggregatefunction,alongwithacc uracyguarantees.Therecords inthedatasetwouldbescannedinarandomizedfashion,anda talltimes,thetop k sampledrecordswouldbepresentedtotheuser.Inordertogi vetheusersomeideaofthe qualityoftheanswerset,thesamplescouldbeusedtoobtain condenceboundson f ( k ) Bycomparingthe k th bestrecordreturnedtotheuserwiththeboundsfor f ( k ) ,theuser maybegivenanideaofthequalityoftheanswersetthathasbe enobtainedthusfar. ApplicationDetails TheBayesianinferenceframeworkdevelopedinthischapter couldeasilybeused toprovidetheboundson f ( k ) .First,thepreviouslyobservedqueryresultsets,each representedbyanaggregatetriplet,wouldbeusedtotraint hepriorshapemodel.During thetraining,alldatapointsnotacceptedbyagiventrainin gqueryareignoredwhen 80

PAGE 81

computingthequery'saggregatetriplet.(thatis,Iignore all f ()= 1 values).Also,all valuesnotequalto 1 areshiftedsothattheirminimumvalue(theorigin)iszero. Whenitistimetoevaluateaquery,thesamplesfromthenewqu eryresultsetare used.Again,Iignoreall 1 values.Givenactualsampleanddatabasesizes n 0 and N 0 ,forthepurposeofmyBayesianframeworkIuse n = jf f ( d ): f ( d ) 6 = 1^ d 2 DBsample gj and N = N 0 n 0 n .Thatis,notonlydoIignore 1 values,butI\scale down"thesizeofthedatabasetoaccountfortheexpectednum berof 1 valuesthat cannotcounttowards f ( k ) .Tobeconsistentwiththetraining,Ialsoshiftallofthe sampledvaluessothatthesmallestvalueisalsoattheorigi n. Next, df ( k ) iscomputedfromthesample.Asdescribedinthechapter,the samples arealsousedtocomputetheposteriorshapemodel,whichint urnisusedtocompute theCDF F ratio fortheratio f ( k ) d f ( k ) .Givenacondencelevel p ,bounds low and high are thenchosensoastominimize high low subjecttotheconstraintthat F ratio ( high ) F ratio ( low )= p .Finally, low d f ( k ) and high d f ( k ) aregivenaslevelp condencebounds fortheanswer.ExperimentalEvaluationGoals: WhenevaluatingtheutilityofapplyingtheBayesianframew orktothisproblem, therearetwoimportantquestionstoanswer: 1. First,arethecondenceboundsproducedreliableforreallifedatadistributions, arbitraryselectionpredicatesandaggregatefunctions,a ndvariousvaluesof k ? 2. Second,hownarrowaretheboundswhenareasonablylargesam plehasbeen obtained?Thatis,aretheynarrowenoughtoactuallybeusef ul? ExperimentalSetup: Totestthesequestions,Iselectedsevenrealdatasets,sum marized inTable 4-3 6 .Theyarepubliclyavailableandspanarangeofproblemdoma inswith 6 TherstvedatasetsarefromtheUCIMachineLearningRepos itory.Thelasttwo datasetsarefromhttp://usa.ipums.org. 81

PAGE 82

Table4-3.Datadescription.Thesedatasetsconsistofboth categoricalandcontinuous features.Continuous/Featuredenotesthenumberofcontin uousfeaturesover thenumberoftotalfeatures. DatasetContinuous/FeatureSize Letter16/1720,000CAHouse7/920,640ElNino7/793,935CoverType10/55581,012KDDCup9934/414,898,430Person9012/135,000,000Household907/115,523,522 dierentcharacteristics.Foreachdataset,a\query"isge neratedasfollows.First,atuple t israndomlyselected.Thenaselectivity s israndomlychosenfromtherange5%to20% andthe s ( datasetsize )nearest(Euclidean)neighborsof t arechosenastheactual queryresultset.Thequery'saggregatefunctionisdeneda stheweightedsumofthree arbitrarily-chosencontinuousattributesperquery,wher etheweightsareuniformlychosen fromzerotoone.Thequeryanswerisdenedtobethe k th largestvalueoftheaggregate function,asappliedtotuplesinthequeryresultset. Foreachdataset,aftertrainingon500randomlyselectedqu eriesusing10shapes, 500testqueries,aregenerated,anda10%sampleofthedatas etisusedtoprovide95% condenceboundsforthenalanswertoeachquery(Ialsoexp erimentedwithusingonly 50trainingqueries;theresultswerenearlyidenticalands oareomittedforbrevity).Table 4-4 reportstheobservedcoverageratesofthereportedconden ceboundsforvarious valuesof k Inthesecondsetoftests,Iuse k valuesof1and10,respectively,andvarythesample sizefrom5%to20%ofeachdataset.Ithencomputethemedianr elativecondence boundwidthat95%(therelativecondenceboundwidthishal fthewidthofthebounds dividedbythequeryanswer).Figure 4-5 givestheresults. Discussion: Ingeneral,thecondenceboundsprovidedshowhighreliabi lity,which wouldseemtoconrmthecorrectnessofmyframeworkandthea ppropriatenessofa 82

PAGE 83

5% 10% 15% 20% 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 Sample size (percentage of DB size) Median of relative errors(a) k=1 5% 10% 15% 20% 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 Sample size (percentage of DB size)Median of relative errors(b) k=10 Cover Type Letter El Nino CA House Household90 Kddcup99 Person90 Cover Type Letter El Nino CA House Household90 Kddcup99 Person90 Figure4-5.Medianrelativecondenceboundwidthof500tes tqueriesasafunctionof samplesize. Gammapriordistributionforthisproblem,evenforarbitra ry,real-lifedatasets.There is some variationincoverageaccuracy;forthreeofthesevendatas ets,theboundswere tooconservative(showingcoveragethatwassignicantlyh igherthan95%),andforthe KDDCup99andPerson90datasets,thecoverageaccuracywasa bitlowerthan95%. However,thecondenceboundsoverallwereremarkablyaccu rategiventhedicultyof theproblem.Ifeelthatthisisverystrongevidencethatthe boundsgeneratedwillbesafe andaccurategivenanarbitrary,real-lifedatasetandquer ydistribution. TheresultsshowninFigure 4-5 alsodemonstratethatdependingonthedatasetin question,theboundwidthat95%accuracycanbequitenarrow ,evenfora10%sampling fraction.Forveofthesevendatasets,a10%sampleprovide sfor95%boundsonthe maximumvalueinthedatasetwhoserangeis 15%oftheactualqueryanswer.Not surprisingly,thisrangegenerallyshrinksas k grows,sincealargervalueof k meansthatI amtryingtoguessvaluesthatliefurtherfromtheextremeri ghttailofthedistribution. 83

PAGE 84

Table4-4.Coverageratesfor95%condenceboundswithvari ous k valuesusinga10% sample. Dataset k =1 k =5 k =10 k =20 Letter 1.00 0.98 0.97 0.94 CAHouse 0.97 0.99 0.97 0.96 ElNino 1.00 1.00 0.99 0.99 CoverType 1.00 1.00 0.99 0.99 KDDCup99 0.92 0.91 0.92 0.93 Person90 0.92 0.94 0.90 0.91 Household90 0.98 0.97 0.97 0.97 4.7.3Distance-BasedOutlierDetection Moregenerally,myframeworkisapplicabletoanyproblemwh erethegoalistond afewrecordsinasetthatare\closeto"or\farawayfrom"all oftheotherrecordsin theset.Inparticular,thisappliestodistance-basedoutl ierdetection[ 27 43 8 ].Givenan arbitrarydistancefunction dist (whichmayormaynotbeametricdistance),thegoal istopickthe t ( t<
PAGE 85

Algorithm7 BayandSchwabacher'sNestedLoopAlgorithm 1: Initialize cutoff 2: Let Outliers = fg 3: for eachpoint p 2 D do 4: Let countClose =0 5: for eachpoint q 2 D ,inarandomorder do 6: if dist ( p;q ) t then 15: Removethepointfrom Outliers havingthesmallest k th -NNdistance 16: Set cutoff tobethesmallest k th -NNdistanceforanypointin Outliers 17: endif 18: endfor neighbors.Thesequeriesareusedtotrainashapemodel.Thi smodelcanthenbeusedto speedBay'salgorithmasfollows: 1. First,Icancarefullychoosetheorderinwhichline(3)ofAl gorithm 7 considersthe databasepoints.IfIcanguesswhichpointsareoutliersand considerthemrst,I willbesurethatthecutovaluewillbeverylargeearlyon.T hiswillincreasethe eectivenessofthealgorithm'spruning.Tochoosesuchanadvantageousordering,Iobservethatthes etofdistances computedfortrainingfromeachdatabasepoint p toeachofthetrainingpoints isactuallyasampleofallof p 'sneighbordistances.Thissamplesetcanthenbe reusedtocompute d f ( k ) for p ,aswellasthedistributionoftheratio f ( k ) d f ( k ) for p 7 By computingthemedianvalue x for f ( k ) d f ( k ) (thatis,thepoint x where F ratio ( x )= : 5),I canuse x df ( k ) asaguessfor p 's k th -NNdistance,andorderthedatasetaccordingly. 2. Second,Ispeedthealgorithmbyalteringtheloopoflines(5 )to(12)byadding asecond,probabilisticpruningconditionafterline(11). Thisadditionalpruning stepisperformedperiodically(forexample,afteriterati ons10,20,40,80,160,and 7 Notethatinthisapplication,becauseIamlookingforneare stneighbors, f ( k ) isused todenotethe k th smallestvalueintheset. 85

PAGE 86

Table4-5.ResultofmakinguseoftheBayesianframeworkwit hinBayandSchwabacher's algorithm.Foreachdataset,thetableshowsthespeedupres ultingfromthe applicationoftheframeworktothealgorithm(Speedup),th esizeoftheresult setoverlapbetweenthe\exact"versionofthealgorithmand theapproximate one(Overlap),andtheaveragerelativeerroroftheapproxi mateversion (Error). Dataset Speedup Overlap Error Letter 2.61 25 0.02 CAHouse 3.22 28 0.00 ElNino 5.33 27 0.02 CoverType 4.29 21 0.14 KDDCup99 3.92 24 0.02 Person90 5.28 24 0.07 Household90 4.29 28 0.00 soon).Givenallofthesampledneighbordistancesthathave beencomputedin line(6)fortheparticularpoint p ,IusetheBayesianframeworkandthetrained modelto guess thedistancefromthepointtoits k th -NN.Ifthisguessislessthan cutoff ,thenIcanprunethepointandstillbereasonablysurethatI havenot prunedanoutlier.Inordertobe\reasonablysure",Ishould chooseanupperbound onthe k th -NNdistancethatholdswithhighprobability.Inmyimpleme ntation,I choose x sothat f ( k ) d f ( k ) islessthan x with99%probability(thatis,Ichoose x sothat F ratio ( x )=0 : 99),andmyguessastotheupperboundof k th -NNdistanceis x d f ( k ) ExperimentalEvaluationGoals: IwishtoexperimentallytestwhethermyBayesianframework canbeusedto eectivelyspeedBay'salgorithm.Therearetwoprimaryque stionsthatIwishtoanswer: 1. First,whatsortofspeedupcomparedtoBay'soriginalalgor ithmcanmyframework helptoprovide? 2. Second,whatexactlyisaccuracylostduetotheprobabilist icpruningthatmy modiedalgorithmprovides? ExperimentalSetup: IbeganmyexperimentsbyrunningBay'snestedloopalgorith m overthesevendatasetsfromtheprevioussubsectiontoobta inthetop305 th -NN outliers.Irecordthetotalnumberofdistancecomputation srequiredaswellasthe actualanswersetreturnedbyBay'salgorithm.Next,IrunBa y'salgorithmaugmented withmyBayesianframeworkasdescribedabove(makinguseof 80trainingpointsand10 86

PAGE 87

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 0 0.5 1 1.5 2 2.5 Outlier Ranks5th -NN distanceCover Type data set NL Modified NL Figure4-6.Comparisonof5 th -NNoutlierdistancesforBayandSchwabacher'soriginal nestedloopsalgorithm,andthemodiedversionofBayandSc hwabacher's algorithm. shapes)andcomputethespeedupoftheaugmentedalgorithmw ithrespecttothenumber ofdistancecomputationsrequired.Ialsocomputetheoverl apoftheresultsetwithBay's resultset,aswellastheaveragerelative\error"oftheaug mentedalgorithm.Forexample, considerFigure 4-6 ,whichshowsthe5 th -NNdistancesforthetop30outliersdiscovered bybothversionsofBay'salgorithmfortheCoverTypedatase t(thisisthedataset wheretheaugmentedversionofthealgorithmreturnedthefe west\true"outliers).Let b i bethe5 th -NNdistanceforthe i th outlierreturnedbyBay'salgorithm,andlet ab i be thecorrespondingvaluefortheaugmentedversionofBay'sa lgorithm.Thentheaverage relativeerroriscomputedas P 30i =1 j ab i b i j = P 30i =1 b i .Foreachdataset,thespeedup,result setoverlapsize,andtheaveragerelativeerroraregivenin Table 4-5 Discussion: TheseexperimentalresultsshowthatbysimplypluggingmyB ayesian frameworkintoBay'soutlierdetectionalgorithm,onecang enerallyobtainafactoroffour improvementinrunningtime.Furthermore,thisimprovemen tisobtainedvirtuallyfor \free"andwithalmostnolossinresultquality.Ineveryexp eriment,themodiedversion 87

PAGE 88

ofBay'salgorithmreturnedmorethan20outofthe30outlier sthatwerereturnedby Bay'soriginalalgorithm,buteventhatstatistictendstou nder-statethequalityofthe result.Theactualdierenceinqualitybetweenthetworesu ltsetsisalwaysquitesmall; inveofthesevencasestheaveragerelativeerrorislessth an2%.Intheworstcase(the CoverTypedataset),theerroris14%,butcloseexamination ofFigure 4-6 showsthat nearlyallofthiserrorisduetothelossofoutliers10throu gh15,whenitisunclearhow muchofaproblemthelossofthosefewoutliersmightactuall ybe.Thisisduetothe factthatoutliersone,twoandthreeareclearlydominant{t heyaremuchfurtheraway fromtheotherdatabasepointsthananyotherreportedpoint s,andonemightreasonably questionwhetheranyotherreportedpointsareactuallyout liers. 4.7.4SpatialAnomalyDetection Spatialanomalydetectionhasbeenstudiedwithmuchintens ityrecentlyinstatistics [ 30 31 ],machinelearning[ 38 ],anddatamining[ 39 40 ].Mostoftherecentworkon spatialanomalydetectionfollowsasimpleparadigm.Dataa rearrangedintoarectangular spatialgrid,andthegoalistondtheparticularsub-recta ngle R inthegrid(where R is characterizedbyalowerleftandanupperrightcoordinates )overwhichsomestatistic ismaximized.measurestheextenttowhichthedatain R diersfromtherestofthe datainthedatabase.Ingeneral,isatleastlooselybasedu ponsomesortoflikelihood ratiotest(LRT)[ 53 ],whichisastatisticalhypothesistestthatiscommonlyus edto comparedierencesamongdatadistributions.Forexample, aspatialdatabasemay containinformationaboutpeoplelivinginacertainarea,a longwiththeexactlocation whereeachpersonlives.Then,mightcomparethenumberofp eoplewithin R where somebooleanpredicateevaluatesto true tothenumberofpeopleoutsideof R where thesamebooleanpredicateevaluatesto true .Ifthesefractionsdiersignicantly,then willbelargeand R isdeterminedtobeanomalous.Byndingtherectangle(orto p fewrectangles)thatmaximizethevalueofcomparedtoallo therrectangles,themost anomalousregionisdiscovered. 88

PAGE 89

Thedicultyofsolvingthisproblemexactlyisthatforan n n grid,thereare O ( n 4 ) dierentrectanglestoconsiderinordertodiscoverthetop k mostanomalousones|there are O ( n 2 )possiblelowerleftpointsinthegrid,and O ( n 2 )possibleupperrightpointsin thegrid,leadingto O ( n 4 )possiblerectanglesinall.If n isverylarge,orifcomputing isveryexpensive,thenitcantakealongtimetosearchthrou ghallofthosecandidatesto ndthetop k 4.7.4.1LRTBasics SincespatialanomalydetectionisfrequentlybaseduponaL RTstatistic|indeed, bothoftheexperimentsIdevisetotesttheapplicationofmy Bayesianframeworkto spatialanomalydetectionareconcernedwithLRTs{itisisi nformativetobeginthis particularapplicationwithanoverviewoftheLRTanditsas sociatedstatistic. Themostcommonstatisticaltestforanomalousdataistheso called likelihoodratio test .TheLRTisahypothesistestthatfacilitatesthecompariso noftwomodels:one parametricstatisticalmodelassociatedwiththehypothes isthatthereisananomaly, andanotherparametricstatisticalmodelassociatedwitht hehypothesisthatthereis noanomaly|theso-called\nullmodel".FortheLRTtobevali d,bothparametric modelsshouldtaketheformofidenticallikelihoodfunctio ns L ( j X ),where X isthe databasedataand isasetofparameterscomingfromtheparameterspace.The (restricted)parameterspaceallowedunderthenullmodel( thissetisdenotedas 0 ) isthecomplementoftheparameterspaceallowedinthecaseo fanomalousdata(the \anomalous"subspaceisdenotedas 0 ).TocheckforananomalyusingtheLRT, twohypotheses H 0 : 2 0 and H a : 2 0 arecomparedagainsteachotherby computingthestatistic: ( X )= sup 0 L ( j X ) sup L ( j X ) (4{11) Thisstatisticiscomputedbyrstcomputingamaximumlikel ihoodestimate(MLE) underbothparameterspaces 0 and,andthencomputingtheratioofthelikelihoods 89

PAGE 90

obtainedviathetwoMLEs.In1938,Wilks[ 53 ]showedthattheasymptoticdistribution of= 2log (whichIsubsequentlyrefertoasthe LRTstatistic )ischi-squaredwith ( p q )degreesoffreedomunderthenullhypothesisthat 2 0 p isthenumberof dimensions(orfreeparameters)in,and q isthenumberofdimensionsof 0 .Thus, tocheckforananomalyatcondencelevel ,onecheckswhether( X ) c ,where c isanon-negativenumbercomputedbyndinghowfaroutinthe tailofachi-square distributiononehastogotond(1 )%ofthemass. ForaverysimpleexampleofthesortofcasewheretheLRTisap plicable,imagine thatIwishtotestwhetherthediseaseratewithinaspatiala reaisdierentthanthe diseaserateoutsideofthearea.wouldcontaintwoprobabi litiesorratesofinfection: p in and p out .Ontheotherhand, 0 wouldcontainonlyasinglerate p ,where p = p in = p out Thedatadescribingaspatialregionwouldcontainfournumb ers:(1) n in |thenumberof peopleinsidethearea;(2) n out |thenumberofpeopleoutsideofthearea;(3) k in |the numberofdiseasedpeopleinsidethearea;and(4) k out |thenumberofdiseasedpeople outsideofthearea.Thelikelihoodfunction L ()wouldthenbeabinomialfunction: L ( j X ) / p k in in (1 p in ) n in k in p k out out (1 p out ) n out k out Thedegreesoffreedomofthenulldistributionisoneinthis example,sincethereis onemoreparameterinthanin 0 4.7.4.2ApplyingtheBayesianframework GivensomeLRTstatistic,thenaivemethodtoansweratop k queryoveraspatial datasetplacedonagridwouldbetoapplytoeachofthe O ( n 4 )rectanglesinthe gridandthenreturnthesethavingthe k largestvalues.Luckily,itturnsoutthatthe Bayesianframeworkdescribedinthischaptercanbeusedtog reatlyspeedthissearch,asI willdiscussinthissubsection. ThebasicmethodologyforapplyingmyBayesianframeworkis quitesimple.Given an n n spatialgrid,Iclassifythe O ( n 4 )rectanglesinto O ( n 2 )classesaccordingtotheir 90

PAGE 91

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 Figure4-7.Classifyingtherectanglesaccordingtotheirl ower-leftcorner.Rectangles (5,10),(5,11),(5,14)and(5,15)belongtothesameclassin dexedbythe lower-leftcorner5. lowerleftcorners.Forexample,considerFigure 4-7 ,whichillustratesa3 3grid.Then therectangles(5,10),(5,11),(5,14)and(5,15)belongtot hesameclassindexedbythe lower-leftcorner5.IfItreateachclassasa\query",where thequeryresultsetistheLRT statisticscorrespondingtoalltherectanglesintheclass indexedbythelower-leftcorner, thenIcanapplymyBayesianframeworktondtheanomalyasfo llows: 1. First,Irandomlypickasetoflower-leftpointsfromthegri dasmysetoftraining queries.Foreachofthetrainingqueries,Iobtaintheentir esetofLRTstatistics,by computingovereachofthe O ( n 2 )possibleupper-rightpointsassociatedwitheach lower-leftpointinthetrainingset. 2. Second,Ilearnashapemodelfromthequeriesobtainedinste p1. 3. Third,foreverypossiblelower-leftpoint,Irandomlysamp leasmallnumberof upper-rightpoints.isthencomputedforeachofthesample drectanglesassociated witheachlower-leftpoint.Thesesamplesarethenusedalon gwithmyBayesian frameworktoobtainaposteriordistributiononthebestva lueassociatedwith eachlower-leftpoint,one-at-a-time.Thenthevariouslow er-leftpointsaresorted baseduponthemediansoftheirposteriordistributions,fr omlargesttosmallest. 4. Finally,thelower-leftpointsareconsideredinorderbyan ested-loopsalgorithm. Foreachlower-left,Iscanallofthevalidupper-rightsina randomorder.Foreach (lower-left,upper-right)combination,avalueiscalcul ated.Thesampledvalues 91

PAGE 92

associatedwithagivenlower-leftareusedbymyBayesianfr ameworktoguess whetherornotthecurrentlower-leftcaneverproduceatha twillexceedthe currentbest.Iftheprobabilityofthishappeningisveryl ow,thenImoveonto thenextlower-leftpointimmediately,withoutconsiderin gtheremainingassociated upper-rights.Thus,inthebestcase(thatis,whentheBayes ianframeworkallowsfor perfectpruning),itwouldbepossibletoreducetheoverall complexityfrom O ( n 4 )to O ( n 2 ). Pseudo-codefortheactualnested-loopspruningalgorithm seekingtop-1rectangleis givenasAlgorithm6. Algorithm8 NestedLoopsSpatialAnomalyDetection 1: Let cutoff = 1 2: Let bestSoFar =( 1 ; 1) 3: for eachlower-leftpoint ll ,inadescendingordersortedontheiresitmatedmedians do 4: Let counter =0 5: Let allLambdas = fg 6: for eachpotentialupper-rightpoint ur (inrandomorder) do 7: Compute cur =( ll;ur ) 8: counter ++ 9: allLambdas = allLambdas [f cur g 10: if cur>cutoff then 11: Let cutoff = cur 12: Let bestSoFar =( ll;ur ) 13: endif 14: if counter isapowerof2 then 15: Use allLambdas toestimateanupperbound(UB)forthecurrent ll 's max suchthat Pr [ max
PAGE 93

2. Second,whatistheaccuracylostduetotheprobabilityprun ingofthecandidate rectangles? ExperimentalSetup. Inthiscase,encodesasimplebinomiallikelihoodmodel.T his modelischosenbecausecomputingtheLRTstatisticisfast, soIcanrepeatedlyrunand re-runtestsoverreasonablylargegrids. Mysetupisasfollows.Foreachcell c inthegrid,Irandomlygenerateapopulation size n c .Then,thenumberof\successes" k c inthecellisgeneratedbysamplingfroma Bin( n c p )randomvariableforsuccessrate p: Givenatestrectangle R onan n n grid,denotethe\success"ratewithin R by p andwithin R by q .ThenIsetupaLRTthatmakesadecisiononthefollowingtwo competinghypothesis: H 0 : p = q H a : p 6 = q Fromthepriormaterialinthissubsection,recallthatinge neralforaLRT,= 2log = 2log MLE 0 +2log MLE c ,where MLE 0 istheresultinglikelihoodof runninganMLEovertheparameterspaceassociatedwith H 0 ,and MLE c istheresulting likelihoodofrunninganMLEovertheparameterspaceassoci atedwiththefullparameter space.Inthisparticularcase, MLE 0 requiresasingledivisiontoestimate p : p = X c 2 R [ R k c = X c 2 R [ R n c Then, MLE 0 itselfiscomputedbypluggingtheresulting p backintotheoriginalbinomial likelihoodmodel.The MLE c routineisjusttwoinvocationsof MLE 0 ; oneoverthedata withinthetestregion R ,theotheroverthedataoutsideof R ;thetwo p 'sareplugged intotwobinomialmodels,andtheresultinglikelihoodsare multiplied.Sincethecomplete parameterspaceonlyhasonemoredimensionthanthenullpar ameterspace,thenull distributionischi-squaredwithonedegreeoffreedom. 93

PAGE 94

Table4-6.Nullcase:totalfalsealarmsandaveragespeedup over50randomtrials.The speedupiscomputedbydividingthetotalnumberofrectangl esbytheactually examinedrectanglesinboththetrainingandthepruningsta ge. n n 16 16 32 32 64 64 128 128 Falsealarm# 0 0 0 0 Speedup 2.8 9.5 25.3 92.1 Formytests,Iran50trialsonagridofsize n n where n =16,32,64,128.My initialcutovalueforpruningwaschosentocorrespondtoa noverallfalsepositiverateof 5%(computedusingthechi-squareddistributionalongwith aBonferonicorrection[ 18 ]). Irantwosetsofexperiments.Therstisdesignedtotestwhe nthenullhypothesisis true,whatisthefalsealarmrateandspeedup.Foreachcell, thepopulation n c isalways sampledfromaNormal( =1 10 4 ; =1 10 3 )distribution.The\success"rateforeach cellisalwayssetto0.001. Ialsosimulatedthecasewherethealternativehypothesish olds.Thesetupwasthe sameasthestandardnullcase,exceptinthealternativecas e,Igeneratedarandom\hot spot"ofsize3 4,withinwhichthe\success"rateis0.003.Iconsiderthehi trate,the containmentrateandtheaveragerelativeerrorasmyaccura cymeasures.Atrialproduces ahitifthediscoveredrectangleisthehotspot.Ifatrialn allyproducesaresultregion thatcontainsthehotspot,Iincrementthecounterofcontai nmentcases.Inacontainment case,therelativeerroriscomputedbydividingtheareasdi erencebetweenthereturned rectangleandthehotspotbythehotspotarea.Resultsaregi veninTables 4-6 and 4-7 Inthesetwotables,the\hitrate"describesthefractionof experimentsforwhichexactly thetruehotspotwasreturnedbythealgorithm.The\contain mentrate"isthefractionof thetimethatthehotspotwasfullycontainedintherectangl ereturnedbythealgorithm. Therelativeerroriscomputedbydividingthedierenceoft heareasbetweenthereported rectangleandthegeneratedhotspotbythehotspot'sarea.S peedupiscomputedby dividingthetotalnumberofrectanglesbytheactuallyexam inedrectanglesinboththe trainingandthepruningstage. 94

PAGE 95

Table4-7.Alternativecase:averagehitrate,containment rate,relativeerror,and computingrateforbinomialdata. n n 16 16 32 32 64 64 128 128 Hitrate 1 1 0.88 0.91 Containmentrate 1 1 0.98 1 Relativeerror 0 0 0.026 0.03 Speedup 1.99 6.09 19.38 50 Discussion. Ingeneral,theresultsshowexcellentscalabilitycompare dtothenaive nestedloopsalgorithm.Inboththenullcaseandthealterna tivecase,thespeedup increasesfasterthanlinearlywithrespecttothegridsize n Thealgorithmclearlyhadtheabilitytodetecthotspotsint hedata.Ineverycase wheretherewasahotspot,thealgorithmwasabletoatleastp artiallydetectit.Evenin fewcaseswherethereportedregionisnotexactlythehotspo t,therelativeerrorissmall consideringthehot-spotisonlysize3 4. 4.7.4.4Experimentalevaluation:realDataGoals. Thegoaloftheexperimentsdetailedinthissubsectionisto seeifthepromising experimentalresultsfromthepriorsubsectioncarryovert oareal-lifedatabaseand real-lifeapplicationofspatialanomalydetection.ExperimentalSetup. TheproblemIconsiderisspatialanomalydetectiononareal datasetextractedfromtheARMdatabase(http://www.armpr ogram.com/).Thedataset andstatisticalmodelforanomalydetectionthatIuseareon lybrierydescribedhere;for moredetail,Ireferthereadertomypaperwhichisspecical lyconcernedwithanomaly detectionovertheARMdatabase. TheARMdatabasedescribesantibioticresistancepatterns inhospitalsasafunction oftime.Itconsistsofantimicrobialresistancedataforne arly400hospitalscollected overa15-yearperiod.Overthose15years,thetrendinresis tanceratesisgenerally upward,duetoselectivepressuresonthebacteriawhichcau sesthemtoevolveresistance tocommonantimicrobials.However,thekeyquestionthatIw ouldliketoansweris:Is thetrend uniformly upwardovertime,oraretheresignicantspatialirregular itiesin 95

PAGE 96

thetrend?Ifresistancetrendsshownogeographicanity,t henitmightindicatethat resistancepatternsunfoldinisolation,andsolocalprogr amsatindividualhospitalsaimed atcarefulantimicrobialstewardshipwillbeuseful.Ifres istancetrendshavestrongspatial correlations,thenitmightindicatethatsuccessfulantim icrobialstewardshipmustbea widereort,sincewide-areatrendsexhibitingstronginru ence. TheARMdataarecollectedasfollows.Foreachyear,foreach (bacteria/antimicrobial drug)combination,eachparticipatinghospitalsubmitsth enumberofisolates(bacteria instances)tested,aswellasthenumberoftimesthatthebac teriawasfoundtobe susceptibletothegivendrug.InordertoapplyLRTtomyARMm iningproblem,Iplace theARMdataonan n n grid.Mygoalistondtherectangularregionswithinwhich subsetsofthedatasetexhibitanomalousantimicrobialres istanttrendsovertime.Todo this,IdeneaLRTstatisticasfollows: 1. Foreachcellonthegrid,theremaybeoneormorehospitals.L et x denotethe numberofsusceptiblecasesand n denotethenumberofisolates.Foraxedve yearspan,let i indexthehospitalsand j indextheyears,thenthedataisreported as( x ij ;n ij )forevery i and j 2. Foreachhospital,Itreateachyear'sdataasasinglebinomi altrial.IfIuse p ij to denotethesusceptibilityrateofhospital i 0 sdatainyear j ,thenIhavethelikelihood function L : L ( j X )= Y i Y j n ij x ij p x ij ij (1 p ij ) n ij x ij Iassumethereisalinearrelationshipforeachhospital'ss usceptibilityrates.Thatis, p ij = kj + c i ,where c i isaninterceptparametertailoredforeachhospital i k isthe slopeofthislinearmodel,and j istheyear. Iaminterestedintestingifanysubsetofhospitalsexhibit sadierenttrendthan therestofthehospitals.Therefore,Iassumeallthehospit alswithinatestregion R havethesametrend k R ,andallthehospitalwithoutregion R havethesametrend k R .Then,Ihavethefollowingcompetinghypotheses: H 0 : k R = k R H a : k R 6 = k R IfIreplace p ij with kj + c i andcomputethelikelihoodinlogspace,Ihavethenull log-likelihood: 96

PAGE 97

log L ( 0 j X )= X i X j [log n ij x ij + x ij log( kj + c i ) +( n ij x ij )log(1 kj c i )] (4{12) InEquation 5{2 k = k R = k R 0 = f k;c 0i s g .Similarly,undertheunrestricted parameterspaceIallowdierent k R and k R ThenIhave: log L ( j X )= X i 2 R X j [log n ij x ij + x ij log( k R j + c i ) +( n ij x ij )log(1 k R j c i )]+ X i 2 R X j [log n ij x ij + x ij log( k R j + c i ) +( n ij x ij )log(1 k R j c i )] (4{13) InEquation 5{3 k R and k R donotconstraineachother.= f k R ;k R ;c 0i s g Justasinthesyntheticcase,Icomputeas= 2log = 2log MLE 0 + 2log MLE c .Inthisparticularcase: 3. MLE 0 implementsanumericaloptimizationroutineformaximizin gthevalueof L inEquation 5{2 withrespecttothenullparameterspace.Numericalmethods are requiredduetothelackofananalyticsolution. 4. MLE c maximizesthevalueof L inEquation 5{3 withrespecttothecomplete parameterspacebysimplyinvoking MLE 0 ondatawithin R anddatawithin R independently. ThisisaparticularlyinterestingapplicationoftheBayes ianframework,because MLE 0 requiresontheorderofasecondtocompute.Thismakesthena ive,brute-force searchonevenasmallgridprohibitivelyexpensive. Toactuallyruntheexperiments,allofthehospitalswereor ganizedintoan16 16 spatialgrid.Irstsortthehospitalsusingthelongitudeo ftheirphysicallocations.They arethengroupedinto n equi-depthbuckets.Theplacementofhospitalsintothe n buckets determineswhichcolumnofthegrideachhospitalbelongsto .Similarly,Isortallhospitals onlatitude,partitionthem n ways,andusethepartitioningtodeterminetherows.Iran 97

PAGE 98

boththenaive,nestedloopsalgorithmandanalgorithmthat makesuseofmyBayesian approach.ResultsandDiscussion. Thenaivemethodtook42.69hourstocompleteonthis particulardataset,andtheBayesianframeworkwasabletor educethistimeto10.83 hours,foraspeedupof4.0.Thisisquiteremarkablegiventh everysmallsizeofthegrid thatItested.Furthermore,therewasnolossofaccuracyint hisparticularcase:the Bayesianframeworkdiscoveredexactlythemostanomalousr egioninthedatabase. 4.7.5MultimediaDistanceJoin ThenalapplicationthatIconsiderisamultimediadistanc ejoin.Givenasetof \example"multimediaobjects P andaseparatedatabase Q ,thegoalistondthe t closest( p;q )pairs( p 2 P and q 2 Q ).Inareal-lifeapplication,theexamplesetmaybe anumberofphotographsofacrimesuspect,andthedatabasem aycontainaverylarge numberofphotographsextractedfromasetofvideorecordin gs.Byperformingthejoin,I canobtainthe t mostlikelysightingsofthesuspectforsubsequentexperte xamination. ApplicationDetails Ifthefunction dist issimplytheEuclideandistancebetweentwofeaturevector s,then manyalgorithmsareapplicableforthisproblem[ 49 23 ].However,formorecomplicated distancefunctions(especiallynon-metricfunctions),th enested-loopsalgorithmgivenas Algorithm 9 isreallytheonlyalternative. Inotethatevenif dist isapropermetricdistance(sothatvariouspruningmethods suchasareference-basedindexcouldbeusedtospeedtheinn erloop),intheenda signicantfractionofthedatabaserecordsmuststillbech eckedforeachexamplepoint. Thisisbecauseevenstate-of-the-artmethodsarerestrict edintheirpruningpower.Thus, itisnaturaltoask:canImakeuseofmyBayesianframeworkto determinewhenan examplepoint p 2 P isunlikelytoappearintheresultset? Icanaccomplishthisbyselectingasmallrandomsubsetofth eexamplepoints.For thesepoints,Iscanthedatabase Q initsentirety,recordingtheobserved dist valuesfrom 98

PAGE 99

Algorithm9 NestedLoopsDistanceJoinAlgorithm 1: Let cutoff = 1 2: Let Matches = fg 3: for eachpoint p 2 P do 4: for eachpoint q 2 Q do 5: if dist ( p;q ) t then 8: Removethepairfrom Matches havingthelargest dist ( p;q )value 9: Let cutoff = max f dist ( p;q ) j ( p;q ) 2 Matches g 10: endif 11: endif 12: endfor 13: endfor thedatabasepointstomysampledexamplepoints.Thesetofd istancesassociatedwith eachsampledexamplepointistreatedasaseparatetraining querywithinmyBayesian framework,andashapemodelislearned.Ithenmakesuretosc anthedatabase Q inarandomorder,andaddasecondpruningtestafterline(11 )ofAlgorithm 9 that attemptstoprunetheexampleobject p inaprobabilisticfashion(similartothepruning usedinSection7.2).Giventhesubsetof Q seensofar,IusemyBayesianframework tomakeasuitablyoptimisticguessastothedistancefrom p toitsclosestmatchin Q Toensurethattheguessis\suitablyoptimistic",Iuseallo ftheobserveddistancesto computetheCDF F ratio asdescribedinSection6,andchoose x sothat F ratio ( x )=0 : 01.If x d f ( k ) >cutoff ,thenthereisonlyaverysmallchancethat p hasamatchin Q ,andI canpruneit.ExperimentalEvaluationExperimentalSetup: Thegoalinevaluatingtheresultingalgorithmissimilarto thegoaloftheprevioussubsection.Iwishtoknow:howmucho faspeedupdoesthis conservativepruningprovidefor,andwhatistheeectonre sultaccuracy? Toanswerthesequestions,Iusetworealimagedatasets.One imagedatasetconsists of15117colorimagesselectedrandomlyfromtheInternet(I callthistheMixdataset). ThesecondimagedatasetistheUCIDbenchmarkcolorimageda taset[ 47 ],which 99

PAGE 100

Table4-8.Resultsforthemultimediadistancejoinexperim ent. Dataset Speedup Overlap Error Mix 3.24 19 0.28 UCID 3.52 30 0.00 has1338colorimages.Itransformeachimageintoa128dimen sionalcolorhistogram. Inordertocreatetwomeaningfuldistancejoinqueries,Iru nthek-meansclustering algorithmoneachdatasettosegmenteachdatasetintotwocl asses.Foreachdataset, myexperimentinvolvescomputingatop-30distancejoinacr ossthetwoclusters.The distancemetricemployedisthequadraticdistance[ 48 ]betweenthetwocolorhistograms. Justasinthelastsubsection,Icomputethespeedup,theove rlapinresultsets,andthe relativeerror,whicharereportedinTable 4-8 Discussion: Ingeneral,Iobservedafactorof3to4improvementonbothda tasetsin termsoftherequiredrunningtime.FortheUCIDdataset,the resultqualitywasno dierentthanforthesimplenestedloopsalgorithm.Forthe Mixdataset,19outofthe 30\true"matchesarereturned,witharelativeerrorof28%. AsFigure 4-8 shows(this FigureisanalogoustoFigure 4-6 ),thereissomeseparationbetweenthetworesultcurves, duemainlytothefactthatthemodiedalgorithmmissedthet opthreematches.This seemstobeduemainlytotheextremeheterogeneityoftheima gesfromtheMixdatabase, giventhattheyweresampledfromtheInternet.Inotherword s,thisisacasewherethe Bayesianassumption{thatIcanbuildareasonablepriordis tributionbylookingatafew samples{maybesomewhatproblematic.Sincetheimagesarea rbitrary,thenextimagein P haslittleresemblancetoanyotherimagein P .Evenso,mymethodstillreturns19of the30trueanswers(outofmorethan17millioncandidatepai rs). 4.7.6AFewFinalComments Iclosebystressingthatthesection'sgoalwasnottopropos ethebestpossibleoutlier detectionalgorithm,orthefastestspatialanomalydetect ionalgorithm,orthefastest possiblejoinalgorithm,butwasinsteadtoshowthetypeofa pplicationforwhichmy 100

PAGE 101

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 0.5 1 1.5 2 2.5 3 3.5 4 x 10 -3 RanksPair DistanceMix data set NL Modified NL Figure4-8.Comparisonofthe30smallestreturnedpairdist ancesforthebasicnestedloop joinandthemodiedversionthatmakesuseoftheBayesianfr amework. Bayesianframeworkmaybeuseful.Infourdierentcases,it wasalmosttrivialtoview theBayesianframeworkasa\blackbox"thatcanguessthelar gestorsmallestvaluein aset,andtousethistospeedastate-of-the-artalgorithmb yafactoroffourorvewith littlehitinaccuracy.Clearly,additionalworkormorecom plexapplicationofmymethod couldresultinapplicationsthatperformevenbetter,butt hisisaproblemforfuture work.Mygoalwastoshowthefeasibilityandutilityofthiss ortofapplication,andthe resultsofthissectionseemtobeconclusive. 101

PAGE 102

4.8Conclusion Ihaveconsideredtheproblemofestimatingtheextremevalu esinadatasetby lookingatasmallnumberofsamplesfromit.Becausetherela tionshipbetweenthe samplesandthemaximum(orminimum)valueinadatasetissod ependentupon thedistributionalpropertiesofthedatasetinquestion,I havedevisedaunique, Bayesianframeworkforthisproblemthatusespreviously-o bservedqueriestomakea statisticallyrigorousguessastothetypeofquerythatisc urrentlyunderconsideration. Signicantly,Ihavegiventwoexamplesofhowthisframewor kcanbeappliedtovarious datamanagementproblems.Theapplicationsdetailedareby nomeansexhaustiveand itisfarfromclearthatIhaveappliedmyframeworkinthebes twaypossibleforeach problem.However,Ibelievemyresultsconclusivelyshowth atthistechniquecanbe appliedsuccessfullytomanyotherproblemdomains. 102

PAGE 103

CHAPTER5 ALRTFRAMEWORKFORFASTSPATIALANOMALYDETECTION 5.1Introduction Discoveringsubsetsofdatabasedatathatarespatiallyclo setooneanotherand exhibitanomalousbehaviorisofkeyimportanceinmanyappl icationareas.Forexample, considermymotivatingapplicationofminingantimicrobia l(antibiotic)resistance patterns.Antimicrobialresistanceinnosocomial(hospit alacquired)bacterialinfections isakeypublichealthproblem.Antimicrobialdrugsarethe rstandsometimesonly meansofattackingbacterialinfection,butduetouseandmi suseovertime,antimicrobials becomelessusefulasbugsbecomeresistantduetoselective pressures.Theresultis thatcommon,oftenmild,nosocomialinfectionssuchasstap hcanbecomedeadlywith noeectivetreatment.TheAntimicrobialResistanceManag ementdatabase(orARM database,seehttp://www.armprogram.com/)consistsofan timicrobialresistancedata fornearly400hospitalsovera15-yearperiod,andpresents anopportunitytostudythe epidemiologyofantimicrobialresistance.Overthose15ye ars,thetrendinresistance ratesisgenerallyupward.However,akeyquestionthatIwou ldliketoansweris:Isthe trend uniformly upwardovertime,oraretheresignicantspatialirregular itiesinthe trend?Inotherwords,isitpossibletodetectspatialregio nswhereasetofhospitals havesignicantlydierenttrendsfromtherestofthehospi tals?Mightsomeregions ofthecountryactuallyseea decrease inresistanceratesovertime?Answeringthese questionswouldprovidekeyinsightsintothelarge-scalee pidemiologyofantimicrobial resistanceindicating,forexample,how\mobile"thebugsa re,orhowtheevolutionof bugsinahospital'sgeneralregionaectslocalresistance rates.Ifresistancetrendsshow nogeographicanity,thenitmightindicatethatresistanc epatternsunfoldinisolation, andsolocalprogramsatindividualhospitalsaimedatcaref ulantimicrobialstewardship willbeuseful.Ifresistancetrendshavestrongspatialcor relations,thenitmightindicate 103

PAGE 104

thatsuccessfulantimicrobialstewardshipmustbeawidere ort,sincewide-areatrends exhibitingstronginruence.ApplyingaSpatialLikelihoodRatioTest. Usingclassicalstatistics,itispossible toimagineawaytominetheARMdatabaseforregionsofthecou ntrywithanomalous oruniqueresistancetrends.First,Iwoulddenesomestati sticalmodelforresistance trends.Forexample,givenaspatialregion,ayear y ,anda(bug,drug)combination,I mighttreatthenumberofresistantcasesinyear y astheresultofasinglebinomialtrial, withanunknownprobabilityofresistance p y andthenumberofexperimentedisolations ofthebugasaknown,xedinput n .SinceIaminterestedintrends,Icouldlinkallof the p y valuesovertimeforagivenregionusingalinearmodelovert heyears,making p y a linearfunctionoftheyear y .Then,usingalikelihoodratiotest[ 53 ],Icouldcheckwhether thetrendforthelocalspatialregionissignicantlydier entfromthetrendthatwould beusedfortheentirecountry.Bybreakingthecountryintoa gridandcheckingeach contiguousregioninthegridforadierenceinthelocaltre nd,Iwilllocateanylocally anomalousresistancetrends.So,what'stheproblem? Aslongascheckingwhethereachcontiguousregionis anomalousiscomputationallyinexpensive,thenabrute-fo rcesearchisfeasible.There are O ( n 4 )rectangularregionsinan n n grid.Forexample,if n =32,thenthereare 278,784regions.Theoverallcosttosearchthegridusingab rute-forcemethodis O ( cn 4 ), where c isthecosttocheckagivenareabycomputingasinglelikelih oodratiotest.Since amodernCPUcanperformseveral billion instructionspersecond,thenaslongas c translatestoafewthousandCPUcycles,agridofreasonable sizecanbesearchedin minutesorhours. Theissueisthatformorecomplicatedanomalydetectionpro blems,running eachtestcanbe extremely costly|requiringthesolutiontoadicultoptimization problem|and c cantranslatetobillionsofCPUcycles.Thelikelihoodrati otestresulting fromtrend-basedsearchdescribedabovemayrequire onesecond torunforeachregion 104

PAGE 105

thatissearched.Thisresultsinasearchthattakesweeksto runona32by32grid.The problemisnotenumeratingallofthelocalregionstocheck; theproblemisactuallyhaving torunthelikelihoodratiotestonallofthem MyContributions. Inthischapter,Iconsidertheproblemofrunningasearchfo r anomalouslocalregionsinaspatialgrid.Iproposeaprunin gstrategythatcanbe usedtoradicallycutdownonthenumberoflikelihoodratiot eststhatmustberun whensearchingforanomalousspatialregions.Thishasthee ectofreducingthecost c associatedwiththeoverall O ( cn 4 )computation.Experimentsonrealdatainreala applicationdomainshowsthatthiscanresultinrunningtim esthatarereducedfrom oneyeartoonlyafewdays.Mypruningstrategyisimplemente dwithinaverygeneric frameworkthatcanbeusedinconjunctionwithvirtuallyany likelihoodratiotestand anomalydetectionproblem.Auserofmyframeworkneedonlys upplyimplementationsof afewspecicfunctionstoinstantiatetheframework.ChapterOrganization. Section2ofthechapterdiscussestheclassicallikelihood ratio test,andintroducesmynewframeworkforspatialanomalyde tection.Section3describes howIcanreducethecostofndinganomaliesinmyframeworkv iapruningofindividual likelihoodratiotests.Sections4,5,and6detailhowthepr uningisimplemented. ExperimentsarecoveredinSection7,Section8describesre latedwork,andSection9 concludesthechapter. 105

PAGE 106

5.2LRTandProblemDenition 5.2.1TheLRTStatistic Themostcommonstatisticaltestforanomalousdataistheso -called likelihoodratio test (LRT),whichformsthebasisforthischapter.TheLRTisahyp othesistestthat facilitatesthecomparisonoftwomodels:oneparametricst atisticalmodelassociated withthehypothesisthatthereisananomaly,andanotherpar ametricstatisticalmodel associatedwiththehypothesisthatthereisnoanomaly|the so-called\nullmodel".For theLRTtobevalid,bothparametricmodelsshouldtakethefo rmofidenticallikelihood functions L ( j X ),where X isthedatabasedataand isasetofparameterscomingfrom theparameterspace.The(restricted)parameterspaceall owedunderthenullmodel (thissetisdenotedas 0 )isthecomplementoftheparameterspaceallowedinthecase ofanomalousdata(the\anomalous"subspaceisdenotedas 0 ).Tocheckforan anomalyusingtheLRT,twohypotheses H 0 : 2 0 and H a : 2 0 arecompared againsteachotherbycomputingthestatistic: ( X )= sup 0 L ( j X ) sup L ( j X ) (5{1) Thisstatisticiscomputedbyrstcomputingamaximumlikel ihoodestimate(MLE) underbothparameterspaces 0 and,andthencomputingtheratioofthelikelihoods obtainedviathetwoMLEs.In1938,Wilks[ 53 ]showedthattheasymptoticdistribution of= 2log (whichIsubsequentlyrefertoasthe LRTstatistic )ischi-squaredwith ( p q )degreesoffreedomunderthenullhypothesisthat 2 0 p isthenumberof dimensions(orfreeparameters)in,and q isthenumberofdimensionsof 0 .Thus, tocheckforananomalyatcondencelevel ,onecheckswhether( X ) c ,where c isanon-negativenumbercomputedbyndinghowfaroutinthe tailofachi-square distributiononehastogotond(1 )%ofthemass. ForaverysimpleexampleofthesortofcasewheretheLRTisap plicable,imagine thatIwishtotestwhetherthediseaseratewithinaspatiala reaisdierentthanthe 106

PAGE 107

diseaserateoutsideofthearea.wouldcontaintwoprobabi litiesorratesofinfection: p in and p out .Ontheotherhand, 0 wouldcontainonlyasinglerate p ,where p = p in = p out Thedata X wouldcontainfournumbers:(1) n in |thenumberofpeopleinsidethearea; (2) n out |thenumberofpeopleoutsideofthearea;(3) k in |thenumberofdiseasedpeople insidethearea;and(4) k out |thenumberofdiseasedpeopleoutsideofthearea.The likelihoodfunction L ()wouldthenbeabinomialfunction: L ( j X ) / p k in in (1 p in ) n in k in p k out out (1 p out ) n out k out Thedegreesoffreedomofthenulldistributionisone,since thereisonemore parameterinthanin 0 5.2.2ProblemDenition IstudytheuseofabroadclassofLRTstodetectspatialanoma lies.Iassumethat thespatialareaoverwhichIsearchforanomalieshasbeenpr e-partitionedintoan n n spatialgrid.Foreachspatialarea A inthegrid,Iwishtoanswerthequestion,\Does A diersignicantlyfromtheremainderoftheareainthegrid ?"Thisquestionisanswered bycomputingaLRTstatisticthatcompares A to A .IfthevalueoftheLRTstatisticis large,then A isreturnedtotheuserasananomalousregion. ThealgorithmsthatIdevelopcanhandleaverybroadclassof user-supplied hypothesistestsandassociatedlikelihoodfunctions.Mya lgorithmscanbethoughtof asagenerictemplate,whichrequirethatauseronlyneedsto supplyafewfunctions whichinstantiateaparticularanomalydetectionproblem. Atahighlevel,tousemy frameworkauserrstdenesamodelcharacterizedbyalikel ihoodfunction L (). L acceptsaparameterset c forsomecell c inthespatialgrid,aswellasthesummary statistics X c forthecell,andthenreturnsthelikelihoodoftheparamete rsgiventhedata. Ingeneral, c containssome\sharedparameters"andsome\freeparameter s."\Shared parameters"areparametersthat,underthemodeltobeteste d,cannotbedierentamong cells.Forexample,theslopeinthetrendofantimicrobialr esistancerateinmymotivating 107

PAGE 108

exampleisasharedparameter,becauseIaminterestedintes tingwhetherthechangein resistanceratediersacrossregions;ifthetrenddoesnot dier,thenitisshared.\Free parameters"areparametersthatarechosenonacell-by-cel lbasisandareassumedto havenolinksacrosscells.Forexample,thebaselineorstar tingrateofantimicrobial resistanceinahospitalmaybeafreeparameter.Everyparam eteriseithersharedorfree. Foreachcontiguousspatialregion A inthegrid,mymodelthenconsidersthefollowing, competinghypotheses: H 0 :\Eachsharedparameterhasoneandonlyonevaluebothwithi n A and A ,and itisthesameacross A and A ." H a :\Eachsharedparameterhasoneandonlyonevaluewithin A .Eachshared parameteralsohasoneandonlyonevaluewithin A .However,oneparticularsubset ofthesharedparameters(calledthe testset )hasdierentvalueswithin A thanit doeswithin A ." The testset isthesetofsharedparametersthatIamactuallyinterested intesting. Notethatingeneral,thereisnoreasonthateachandeverysh aredparametermustappear inthetestset.Ifcertainsharedparametersarenotinthete stset,thenunderboth H 0 and H a ,thosesharedparametersmustalwaystakethesamevaluefor eachandeverycell inthegrid. Giventhissetup,tousemyalgorithms,ausermustsupplyexa ctlythefollowing specicfunctions: 1. Afunction f thatcancomputeasetofsummarystatistics X = f X c 1 ;X c 2 ;::: g fora givenareainthegrid. c 1 ;c 2 ;::: arethegridcellsinthearea,and X c 1 ;X c 2 ;::: arethe summarystatisticsdescribingthesegridcells.Forexampl e,inasimplecasewhereI amtryingtondregionswithahighdiseaserate, f mightcomputethenumberof diseasecasesineachcell,aswellasthepopulationofeachc ell. 2. Alikelihoodfunction L ( j X )thatacceptsasetofsummarystatistics X = f X c 1 ;X c 2 ::: g aswellasasetofparametervalues = f c 1 ; c 2 ::: g forasetofgrid cells f c 1 ;c 2 ;::: g andreturnsthelikelihoodof given X 3. AnMLEprocedure MLE c ( X A ;X A )correspondingtotheunrestricted(complete) parameterspace.ThisMLEprocedureacceptstwosetsofsumm arystatistics X A and X A computedby f overtworanges A and A ,andcomputesthesetof parameters = A [ A thatmaximizesthevalueof L ( j X A ;X A ).Thismaximization 108

PAGE 109

1. For eachrectangle A inthegrid 2. Let X A = f ( A ) 3. Let X A = f ( A ) 4. Let X = f ( A [ A ) 5. Let c = MLE c ( X A ;X A ) 6. Let 0 = MLE 0 ( X ) 7. Let = 2log L ( 0 j X )+2log L ( c j X A )+ 2log L ( c j X A ) 8. If isinthetop k foundsofar,thenremember A Figure5-1.Naivetopk LRTsearch(Algorithm1) iscomputedsubjecttotheconstraintthatallsharedparame terswithinthetestset musttakethesamevalueswithin A andalsowithin A ,butcandieracross A and A 4. AnMLEprocedure MLE 0 ( X )correspondingtotherestricted(null)parameter space.ThisMLEprocedureacceptsasetofsummarystatistic s X oversomespatial region,andcomputesthe whichmaximizesthevalueof L ( j X ).Thiscomputation isperformedsubjecttotheconstraintthat all sharedparametersmustbeuniform withintheregioncorrespondingto X Notethatifthetestsetistheentiresharedparameterset,t hen MLE c issimplytwo invocationsof MLE 0 ,oneover X A ,theotherover X A Logically,myalgorithmsthenperformthecomputationgive ninAlgorithm1and returnthetop k spatialregionsdiscovered. 5.2.3ExampleApplication Foranexampleapplicationofmyframework,considerasimpl evariationonmy motivatingapplication,whereIhavetwoyears'worthofant imicrobialresistancedatafor eachspatialareainagrid,andIwishtocheckwhethertherea reparticularlocalareas wherethechangeinantimicrobialresistanceratesacrosst hetwoyearsisdierentfrom thechangeinresistanceratesinthedatabaseasawhole.Tha tis: H 0 :\Thetrendinresistanceratechangeisthesameforeachand everycellinthe grid." H a :\Thereexistsanarea A wherethetrendinresistanceratechangeisdierent thanitisfortheremainderofthecellsinthegrid." 109

PAGE 110

Toapplymyframeworktothisproblem,Iwouldinstantiate f L MLE c ,and MLE 0 asfollows: 1. f ( A )wouldcomputethenumberofmicrobialinfectionsineachce llin A foreachof thetwoyears(foraparticularcellin A ,thesevaluesaredenotedas n 1 and n 2 ),as wellasthenumberofresistantmicrobialinfectionsforeac hcellintheareaforeach ofthetwoyears(denotedas k 1 and k 2 ). 2. Foracell c inaspatialarea A L accepts n 1 ;n 2 ;k 1 and k 2 ,aswellastherateof infectionforthetwoyears p 1 and p 2 .Itthencomputesthebinomiallikelihoodofthe ratesgiventhedata: L ( c j X c ) / p k 1 1 (1 p 1 ) ( n 1 k 1 ) p k 2 2 (1 p 2 ) ( n 2 k 2 ) Oncethevalueof L hasbeencomputedforeachcell, L returnstheproductofthe likelihoodsovereachcellinthearea: L ( j X ) / Q c 2 A L ( c j X c ). 3. MLE 0 performsabinomialMLEfortheentirespatialarea.However ,thisisnot aclassicbinomialMLEwheretherateorprobabilityparamet erisestimatedby dividing k by n .When MLE 0 performsMLE,itforcesthetrend(ordierenceof p 1 and p 2 )foreachcelltobexed.Thatis,thebinomialMLEisconduct edunderthe constraintthat p 2 = p 1 +forasinglexedvalueofacrossallcells;thisis anexampleofa\sharedparameter."Thiscorrespondstothen ullhypothesisthatI assumedwhenIstartedout,whichsaysthatthetrendforeach cellinthegridisthe same. 4. Finally, MLE c performstwosimilarbinomialMLEs,exceptthattherearetw o trends A and A allowedfortheareaswithinandwithout A ,respectively. 5.2.4So,What'stheDiculty? RunningthenaivealgorithminFigure1canbequiteexpensiv e,butthecomputational dicultyis not necessarilyassociatedwithenumeratingallofthelocalsp atialregions. Whilethenumberofrectangularregionsis O ( n 4 )foran n n grid,thefactisthateven foralargegridofsize100 100,itisnotveryexpensivetoenumerateeveryrectangular regionusingmodernhardware;thiswouldtakeonlyafewseco ndsatmost.Rather,Ind thatforsearchproblemswherethenaivealgorithmisnotfas tenough,thecomputational dicultyisassociatedwith actuallycomputing MLE c and MLE 0 foreverycellinthe grid .Thus,Iwillconsiderpruningstrategiesthatstillenumer ateallofthe O ( n 4 )local 110

PAGE 111

spatialareasinthegrid,but canofteninexpensivelytellmebeforeanyMLEsareever computedthatitisimpossiblefortheLRTstatistictohavea largeorinterestingvalue 111

PAGE 112

A2 A3 A4 A1 A A A (a) (b) Figure5-2.Upperboundingthemaximumlikelihoodofregion A bythesummationofthe maximumlikelihoodsofsubregion A i s.(a)Thecurrenttestingregion A .(b) Thetestingregion A titledbyfoursubregions. 5.3BasicApproach ThebasicideathatIwillpursueisdevisingsomewayofimmed iatelyknowing|without performinganyMLE|whethertheLRTstatisticcanpossiblye xceedagivencutovalue foranarea A .So,howdoIdothis?AteachiterationofAlgorithm1,Iwillc onsidera region A ,andwishtoknowwhetherthevalueofassociatedwith A exceedsthecuto. Normally,tocompute,Iwillinvokeboth MLE c and MLE 0 .Thatis,givenaspatial region A ,Iwouldcompute c = MLE c ( X A ;X A )and 0 = MLE 0 ( X A [ X A ),andthen computethequantity:= 2log L ( 0 j X A [ X A )+2log L ( c j X A )+2log L ( c j X A ) MygoalistosomehowavoidrunningtheexpensiveMLEsandsti llobtainsomeidea aboutwhetherornotexceedsthecuto.Handling 0 iseasy.Sincethevalueof MLE 0 isthesamenomatterwhatthevalueof A ,Icancompute MLE 0 overtheentiredataset once,andthensimplyre-usethisvalueforeveryiterationo ftheloopthatenumeratesall possiblerectangles. However,itismorediculttoavoidrunning MLE c overthecurrenttestingregion A .Fortunately,Icanupperboundthevaluesofboth L ( c j X A )and L ( c j X A )byusingthe likelihoodsassociatedwiththeMLEsthatIhavecomputedpr eviously,andstoredforlater re-use. Inmydiscussion,Iwillconsiderupperbounding L ( c j X A )rst,becauseallof thesameargumentsholdforboth X A and X A .AsillustratedinFigure2,saythatI donotknowtheMLEresultover A inFigure2(a)thatwouldbecomputedby c = 112

PAGE 113

MLE c ( X A ;X A ).HowcanIboundthevalueof L ( c j X A )withoutactuallycomputing MLE c ? First,Iknowthat MLE 0 woulddoa\betterjob"on X A than MLE c will,since MLE c isforcedtochooseuniformvaluesacross A and A forallsharedparametersthat arenotinthetestset,while MLE 0 doesnothavethisconstrain.Thatis,if X A = MLE 0 ( X A ),then L ( c j X A ) L ( X A j X A ). Giventhis,assumethatIknowtheresultof A i = MLE 0 ( X A i )foreach i in1 ::: 4in Figure2(b).Then: Let 1 bethesetofallpossibleparametervaluesthatcouldbeassi gnedto X A by MLE 0 ( X A ). Let 2 bethesetofallpossiblevaluesthatcouldbechosenfor A 1 [ A 2 [ A 3 [ A 4 Iknowthatinthiscase 1 2 .Why?In 1 ,allsharedparametersmusthavethe samevalueforallcells|thisisprescribedbythedenition of MLE 0 .However,in 2 sharedparameterscandieracrosseach A i ,sinceeachindividual MLE 0 isperformedin isolation.Since 1 2 ,itmustbethecasethat L ( X A j X A ) Q L ( A i j X A i ):both X A and A 1 [ A 2 [ A 3 [ A 4 describeexactlythesamedata,butthelaterparametersare chosenfromalargerspaceofpossibleparametersets.Putti ngthistogetherwiththefact that L ( c j X A ) L ( X A j X A ),Icanboundthevalueof L ( c j X A )for c = MLE c ( X A ;X A ) by L ( c j X A ) Q L ( A i j X A i ). Thus,mybasicstrategywillbetoprecomputearelativelysm allsetof MLE 0 results oversmallregionsofthespatialareathattogethercanbeus edtotileanysubregionof thespatialarea.UsingtheseprecomputedMLEs,foranytest ingregion A ,Icanusethe aboveobservationtoupperboundtheLRTstatistic.Ifthis upperboundistight,itwill oftenbepossibletoprune A andthusavoidcomputingavastmajorityofMLEsincurred bytheLRTstatistics. Notethatexactlythesamestrategycanbeusedtoboundthequ alityoftheresultof L ( c j X A ).Theonlydierenceisthatratherthantiling A ,Ineedasetof MLE 0 results 113

PAGE 114

correspondingtoregionsthattotallytile A .Then,givenaboundforboth L ( c j X A )and L ( c j X A ),aboundforfollowsimmediately. 114

PAGE 115

A A (a) (b) Figure5-3.Secondtightboundcriterion.(a)Threesubregi onsevenlycoverthetarget region A .(b)Threesubregionsunevenlycoverthetargetregion A ,withone bigsubregiondominatingtheothertwo. 5.4TightBoundCriteria Tobound L ( c j X A )and L ( c j X A )Imustbeabletotileboth A and A withasetof regionsforwhichan MLE 0 hasalreadybeenpre-computed.Therearemanypossible precomputationsandtilings.Itturnsoutthatthemannerin which A (and A )istiledcan haveasignicanteectuponthequalityoftheboundthatisa chieved.Iconsiderthis issueinthissection. Inpractice,therearetwoover-ridingconsiderationswhen tilingaregioninorderto boundthevalueof L ( c j X A ): 1. Theregionshouldbetiledwithasfewtilesaspossible. 2. Foraxednumberoftiles,itisgenerallybettertohaveahig hvarianceintilesizes thanitistohavealowvarianceintilesizes|thatis,onebig tileand n 1small tilesarebetterthan n medium-sizedtiles. Itiseasytoarguewhyaddingmoretilesthanthatarestrictl yneededisgenerally abadidea.Under MLE 0 ,onlyonevalueforeveryparameterinthetestsetisallowed However,if A iscoveredwith n tiles,then n dierentvaluesforeachparameterinthetest setareallowed,withoneparametervaluepertile.Thus,add ingmoreandmoretileshas theeectofaddingmoreandmoreparameters,or,equivalent ly,changingtheunderlying MLEproblemsoitislessandlessconstrained.Thistendstor esultinalooserandlooser upperboundon L ( c j X A ). 115

PAGE 116

Thereasonthatlargetilesarepreferred|evenifitresults inoneortwolargeregions andafewverysmallregions|isillustratedinFigure3.Thed atapointsinthisgureare representedbyX'sandO's;thedatapointsrepresentedbyX' saregeneratedbyadierent processthantheO'sandsotheyrequiredierentsharedpara metervaluestomodelthem thantheO'sdo.Ifanumberofsmallerboxesareused(asinthe leftofFigure3),then itismuchmorelikelythatthedatainsideofthoseboxesareh omogeneousandcontain mostlyX'sormostlyO's.Thus,the MLE 0 procedurecandoaverygoodjobofchoosing aparametersetthatmatchesthedataclosely,whichwillpro duceamuchlooserbound thanthecaseintherightofFigure3. 116

PAGE 117

1 234 5 678 1 2 3 4 1 2 1 L1 L0 L3 L2 (b) 1 234 5 678 1 2 3 4 1 2 1 L3 L2L1 (d) L0 j i mn (c) (a) j i mn A Figure5-4.Tiling A .(a)Rectangle( m;n;i;j )isrecursivelysplit.(b)Thesplitting positionineachlevel.(c) A istiledusingthreeprecomputedrectangles.(d) Theprecomputedrectanglesthatareusedtotile A 5.5PrecomputationandBounding WereItocompute MLE c ( A; A )forevery A inthegrid,therewouldbe O ( n 4 )MLE computations.Mygoalistoreducethisbyatleastafactorof n byperformingonly O ( n 3 ) MLE 0 pre-computationsandahandfulofactual MLE c ( A; A )computationsforthose A A pairsthatarenotpruned.Also,mygoalistolimitthoseprec omputationtosmaller spatialregions,sotheytendtobelessexpensive.Sincemyr educedproblemistobound both L ( c j X A )and L ( c j X A ),Iconsiderthesetwosubproblemsseparatelyinthefollow ing twosubsections.5.5.1PrecomputationandBoundingfor A Ideviseamethodthatusesatmost2log n 2 precomputedrectanglestotileanygiven A .Theprecomputedrectanglesareobtainedasfollows.Consi derthebiggestrectangle thatisenclosedbyapairoftheverticalgridlines.Figure4 (a)illustratesonesuch rectangle( m;n;j;i ).Thehorizontalgridlinewhichcrossesthecenterpointof thegrid isusedtosplitthisrectangleintotwoidenticalsubrectan gles.IfIrecursivelysplitthe resultingsubrectanglesbyhorizontalgridlinesfromthei rmidpointsuntilIreachthe 117

PAGE 118

lowestresolutionofthegrid,Iwillhaveconsideredonerec tangleatthehighestlevel(level 0),twoidenticalrectanglesatthesecondlevel(level1),a ndfouridenticalrectanglesat thethirdlevel(level2),andsoon.Atthelastlevel,Iwillh ave n identicalrectangles thattogethertiletheoriginalrectanglethatIstartedout with.Thesplitpointsateach levelisillustratedinFigure4(b).Thus,foreachpairofve rticalgridlines,Iconsider 1+2+4+ ::: + n rectangles,andprecompute MLE 0 foreach.Thereare 0B@ n +1 2 1CA numberofverticallinepairs.Therefore,myprecomputatio nsetcontains (2 n 1)( n +1) n 2 rectangles,whichisintheorderof O ( n 3 ). Tousetheprecomputedrectanglestotileanygiven A ,Iuseadivideandconquer method.Figure5givesthedetailedalgorithm.Therecursiv efunction Tile A accepts thebottom-left( x 1 ;y 1 )andtop-right( x 2 ;y 2 )coordinatesofarectangle A ,andapartition rangelowandhigh(initially,lowissettozeroandhighisse tto n ).Itsplits A ina top-downfashionfollowingthesplitpointsusedduringpre computation.Thismethodwill alwaystile A withthebiggestpossiblerectangleinmyprecomputedsetan dguarantees Iusethesmallestnumberoftiles.Figure4(c)and(d)illust rateanexamplewhereIuse twolevel2rectangles(numbered2and3)andonelevelthreer ectangle(numbered7)to tilethegiven A 5.5.2PrecomputationandBoundingfor A NowIturnmyattentiontothetilingmethodsfor A .Therearetwodierenttiling strategiesthatIemploy:Theradialmethod. AsillustratedinFigure6(a),inaclock-wiseorder,Ielong ate theedgesofarectangle A untiltheedgeshitthegridborders. A isthendividedinto fourrectanglesdenoted A 1 to A 4 ,whichareusedtotile A .Icandothesamethingin acounter-clockwiseorder,whichisdepictedinFigure6(b) .Inordertotileanygiven A usingtheradialmethod,theprecomputedsetshouldcontain alltherectanglesthat shareatleastonecornerwiththegrid.Thissetcanbeobtain edbyconsideringallofthe 118

PAGE 119

ProcedureTile A( x 1 ;y 1 ;x 2 ;y 2 ; low,high ) 1. Let mid=(low+high)/2 //basecase 2. If (( y 1 ==low&& y 2 ==high) || ( y 1 ==low&& y 2 ==mid) || ( y 1 ==mid&& y 2 ==high)) 3. Return rect( x 1 ;y 1 ;x 2 ;y 2 )/*rect()returnsthestoredrectanglelog-likelihood*/ //recursivecase 4. If ( y 2 mid) 5. ReturnTile A( x 1 ;y 1 ;x 2 ;y 2 ; low,mid ) 6. ElseIf ( y 1 < mid&& y 2 > mid) 7. ReturnTile A ( x 1 ;y 1 ;x 2 ; mid,low,mid)+ Tile A ( x 1 ; mid, x 2 ;y 2 ,mid,high) 8. ElseIf ( y 1 mid) 9. ReturnTile A( x 1 ;y 1 ;x 2 ;y 2 mid,high) Figure5-5.Algorithmfortiling A intersectionpointsonthegrid.Iconnecteachintersectio npointonthegridwiththefour cornersofthegrid.Thisproducesfourdiagonals,eachofwh ichcreatesonerectanglein myprecomputedset.Sincethereare O ( n 2 )intersectionpoints,thereare O (4 n 2 )rectangles inmyprecomputedset.Thesandwichmethod. AsillustratedinFigure6(c),ifIelongatethetwovertical edgesof A inbothdirectionsuntiltheyreachthebordersofthegrid, A isdividedinto fourrectangles,denoted A 1 to A 4 .Iusethesefourresultingrectanglestotile A .Inthe samefashion,IcandothishorizontallyasillustratedinFi gure6(d).Inordertotile any A usingthismethod,Ineedtoprecomputealltherectanglesth atsharetwocorners withthegrid.InFigure6(c),theserectanglesare A 2 and A 4 .Noticethatthesetwo rectanglesareinthetheprecomputedsetfortheradialmeth od,soIcanreusethem. Also,sinceIwanttoavoidanyadditionalprecomputationst obound A 1 ,Icallthe Tile A procedurefromtheprevioussubsectiontoupperbound A 1 .Icanupperbound A 3 in similarfashion.Asaresult,withnoadditionalprecomputa tion,Icanobtainthebounds usingthesandwichmethod. Ihavediscussedtwostrategiesandfourmethodstobound A .Inpractice,Icompute thefourboundsandchoosethetightestonetoupperbound A 119

PAGE 120

A4 A1 A2 A3 A2 A1 A4 A3 A2 A4 A1 A3 A1 A3 A2 A4 A A A A (c) (a) (b) (d) Figure5-6.Tilingmethodsfor A .(a)and(b)depicttheRadialmethods.(c)and(d) depicttheSandwichmethods. Input: aspatialdataset D f;MLE 0 ;MLE c L and k Output: topk anomalousregions. 1.Precomputethe O ( n 3 )rectanglesdescribedinSection5.1 2.Precomputethe O (4 n 2 )rectanglesdescribedinSection5.2 3. Let 0 = MLE 0 ( X ),where X = f ( D ). 4. For eachrectangle A onthegrid: 5. Let X A = f ( A ), X A = f ( A ). 6.FollowSection5.1togettheupperboundfor log L ( c j X A ) 7.FollowSection5.2togettheupperboundfor log L ( c j X A ) 8.Combinetheresultsofstep3,6and7toanupperboundfor A 9. If theupperboundislessthanthe k th bestfoundsofar, thenprune A 10. Else computetheexact A ;if A isinthetop k foundso far,thenremember A Figure5-7.Topk LRTstatisticsearchwithpruning 5.6FinalSearchAlgorithm ThenalsearchalgorithminFigure7modiesthenaivealgor ithm.Ateachiteration, thenewalgorithmobtainsanupperboundforthecurrentLRTs tatistic,andcompares theupperboundwiththecurrentcutorelatedtothe k th largestdiscoveredsofar.If theupperboundislessthanthecuto,thecurrentregionisp runed.Otherwise,theexact valuefortheLRTstatisticiscomputed. 120

PAGE 121

Onetechnicaldetailregardingtheimplementationishowth eprecomputated likelihoodsareindexed.Iusethebottom-leftandtop-righ tcoordinatesofeachrectangle astheindexingkey.Iusetwo-dimensionalbinarytreestoin dexthetwocoordinates foreachrectangle.Arstbinarytreeacceptsarectangle's bottom-leftcoordinate,and returnstherootnodeofanotherbinarytree,whichindexest hetop-rightcoordinatesofall therectanglessharingthesamebottom-leftcoordinate. 121

PAGE 122

5.7Experiments Irstperformasetofexperimentsonsyntheticdatadesigne dtotesttheLRT framework'spruningpower,eectivenessatdiscoveringan omalies,andcorrectness.Then Iinstantiatetheframeworktosolvetwodierent,real-wor ld,spatialanomalydetection problemswithexpensiveMLEs.Forboth,Istudythepruningp owerandspeed-upin wall-clockrunningtime.5.7.1ExperimentOne:SyntheticDataExperimentalGoal. Mygoalistoanswertwoquestions: 1. DoesmyprecomputationactuallycutdownonthenumberofMLE sthatneedtobe runinarealisticsetting? 2. Second,theLRTframeworkhasaknownasymptoticnulldistri bution.IfIusethis factalongwithastandardBonferonicorrectiontotakeinto accountthemultiple hypothesistest(MHT)probleminducedbyrunningaseparate hypothesistestfor eachareainan n n grid,willI(a)stillbeabletodetectanysubtleanomalies, and(b)beabletocorrectlyrecognizethosecaseswherether eisnoanomalyinthe underlyingdata? ExperimentalSetup. Formyexperimentsoversyntheticdata,Iusedasimplebinom ial likelihoodmodel.Thismodelischosenbecauseperformingt herequiredMLEisfast,soI canrepeatedlyrunandre-runtestsoverreasonablylargegr ids. Mysetupisasfollows.Foreachcell c inthegrid,Irandomlygenerateapopulation size n c .Thenthenumberof\successes" k c inthecellisgeneratedbysamplingfroma Bin( n c ;p )randomvariableforsuccessrate p Givenatestingrectangle A ,denotethe\success"ratewithin A by p andwithin A by q .Thenullhypothesisassertsthat p = q ,andthealternativeassertsthat p 6 = q .The MLE 0 routinerequiresasingledivisiontoestimate p : P c 2 A [ A k c = P c 2 A [ A n c .The MLE c routineisjusttwoinvocationsof MLE 0 ,oneoverthedatawithinthetestingregion,the otheroverthedatawithoutthetestingregion.Sincethecom pleteparameterspaceonly hasonemoredimensionthanthenullparameterspace,thenul ldistributionischi-squared withonedegreeoffreedom.Formytests,Irun50trialsonagr idsizeof128 128.To 122

PAGE 123

Table5-1.Averagepruningratesandaccuracyfor50randomt rialsonagrid128 128. Thehotspotsizeis4 3. Test AvgPruningRate Accuracy Null(standard) 99.9994% nofalsealarm Null(citypopulation) 99.9996% nofalsealarm Hotspot p =0 : 003 99.9712% 100% Hotspot p =0 : 01 99.9722% 100% testthescalability,Ialsoconsideragridsizeof256 256.Myinitialcutovaluefor pruningwaschosentocorrespondtoafalsepositiverateof5 %.Inalltests,Iseekthetop subregion. Irantwodierentsetsofexperimentswherethenullhypothe siswasinfacttrue.In therst,theunderlying n c valuesarerelativelyuniform,where n c wasalwayssampled fromaNormal( =1 10 4 ; =1 10 3 )distribution.Thissimulatesthecasewhere Iamsearchingthrougharelativelyuniformspatialpopulat ion,withnodenseregions. The\success"rateforeachcellwassetto0.001.Inthesecon dcase,therewasasingle densely-populatedregion,orsimulated\city".Irandomly selectedanareaofsize12 12, andwithinthisarea, n c wasalwayssampledfromaNormal( =1 10 5 =5 10 3 ) distribution.Usingthesecellpopulations, k c wasthengeneratedforeachcellasdescribed above.Iranmyframeworkovereachgenerateddatabaseinsta nce,recordedthepruning rate(thenumberofrectanglespruned/thetotalrectangles ),andcheckediftherewerefalse alarms. Ialsosimulatedacasewherethealternativehypothesishol ds.Thesetupwasthe sameasthestandardnullcase,exceptthatIrandomlypicka4 3regionasthe\hot spot".Inonesetoftests(the\subtleanomalycase"),thesu ccessrateforgeneratingeach k c wassetto0.003.Intheother(the\extremeanomalycase"),t hesuccessratewassetto 0.01.ResultsaregiveninTables1and2.Discussion. Ingeneral,theresultsshowsveryhighpruningpower.Takin gintoaccount theprecomputatedMLEs,thepruningratesrealizedonthe12 8 128gridwouldon averagereducethenumberoftestsrequiredbyafactorof31. 1to31.4comparedtothe 123

PAGE 124

Table5-2.Averagepruningratesandaccuracyfor7randomtr ialsonagrid256 256. Thehotspotsizeis7 9. Test AvgPruningRate Accuracy Null(standard) 99.9999% nofalsealarm Hotspot p =0 : 003 99.9996% 100% naivealgorithm.Onthe256 256grid,myframeworkwouldreducethenumberoftests requiredbyafactorof63.4. Also,withanoverall0.05leveltest,myframeworkdidnotn danyfalsealarmsin testswherethenullhypothesisheld.Thetotallackoffalse alarms(evenatasignicance levelof0.05)mayresultfrommyapplicationofaconservati veBonferronicorrection. Ontheotherhand,inboththe\subtleanomaly"and\extremea nomaly"cases,the frameworkhad100%detectionaccuracy.Overall,theseresu ltsshowthattheframework seemstobebothsafeandeective.5.7.2ExperimentTwo:TheARMDatabaseApplicationDescription. 357U.S.hospitalsparticipateintheARMprogram,where eachyear,foreach(bacteria/antimicrobialdrug)combina tion,eachhospitalsubmitsthe numberofisolates(bacteriainstances)tested,aswellast henumberoftimesthatthe bacteriawasfoundtobesusceptibletothegivendrug.Asdes cribedintheintroduction, mygoalistondlocalspatialregionswherethe temporaltrend ofantimicrobialresistance changewithintheregionissignicantlydierentthanthet rendoutsideoftheregion. ExperimentalGoal. IwishtoexperimentallytestwhethermyLRTframeworkcanbe usedtoeectivelyspeedthenaivealgorithm.Therearetwop rimaryquestionsIwishto answer: 1. First,whatisthepruningrateofmyLRTframeworkoverthere aldataset,witha realisticlikelihoodmodel? 2. Second,whatisthespeedupinwall-clockrunningtimecompa redtothenaive algorithm? 124

PAGE 125

InstantiatingtheLRTFramework. TomakeuseofLRTframework,Iprovide instancesof f L MLE c and MLE 0 asfollows: 1. Foreachcellofthegrid,theremaybeoneormorehospitals.F unction f reports thenumberofsusceptiblecases(denotedby x )andnumberofisolates(denotedby n )foreachhospitalsinthecelloveraxedveyearsspan.Tha tis,if i indexes thehospitalsinacell,and j indexestheyears,then f reportsforthecurrentcell ( x ij ;n ij )forevery i and j 2. Foreachhospital,Itreateachyear'sdataasasinglebinomi altrial.IfIuse p ij to denotethesusceptibilityrateofhospital i 'sdatainyear j ,thenIhavethelikelihood function L : L ( j X )= Y i Y j n ij x ij p x ij ij (1 p ij ) n ij x ij Iassumethereisalinearrelationshipforeachhospital'ss usceptibilityrates.Thatis, p ij = kj + c i ,where c i isaninterceptparametertailoredforeachhospital i k isthe slopeofthislinearmodel,and j istheyear. Iaminterestedintestingifanysubsetofhospitalsexhibit sadierenttrendthan therestofthehospitals.Therefore,Iassumeallthehospit alswithinatestregion A havethesametrend k A ,andallthehospitalwithoutregion A havethesametrend k A .Then,Ihavethefollowingcompetinghypotheses: H 0 : k A = k A H a : k A 6 = k A IfIreplace p ij with kj + c i andcomputethelikelihoodinlogspace,Ihavethenull log-likelihood: log L ( 0 j X )= X i X j [log n ij x ij + x ij log( kj + c i ) +( n ij x ij )log(1 kj c i )] (5{2) InEquation 5{2 k = k A = k A 0 = f k;c 0i s g .Similarly,undertheunrestricted parameterspaceIallowdierent k A and k A ThenIhave: log L ( c j X )= X i 2 A X j [log n ij x ij + x ij log( k A j + c i ) +( n ij x ij )log(1 k A j c i )]+ X i 2 A X j [log n ij x ij + x ij log( k A j + c i ) +( n ij x ij )log(1 k A j c i )] (5{3) 125

PAGE 126

InEquation 5{3 k A and k A donotconstraineachother. c = f k A ;k A ;c 0i s g 3. MLE 0 implementsanumericaloptimizationroutineformaximizin gthevalueof L inEquation 5{2 .Numericalmethodsarerequiredduetothelackofananalyti c solution. 4. MLE c simplyinvokes MLE 0 on X A and X A independently,sinceeachshared parameterisalsointhetestset. ExperimentSetup. Forthebacteria-antimicrobialcombinationofS.aureusan d Nafcillin,IextracteddatafromtheARMdatabasefor203hos pitalsthatprovideddatain thetimeperiodfrom2000to2004,inclusive.Allofthehospi talswereorganizedintoan n n spatialgrid.Irstsortthehospitalsusingthelongitudeo ftheirphysicallocations. Theyarethengroupedinto n equi-depthbuckets.Theplacementofhospitalsintothe n bucketsdetermineswhichcolumnofthegrideachhospitalbe longsto.Similarly,Isortall hospitalsonlatitude,partitionthem n ways,andusethepartitioningtodeterminethe rows. Itestedthreedierentgridsizes: n =16 ; 32 ; and64,andusemyframeworktond thespatialregionthatmoststronglyrejectsthenullhypot hesis.Foreachgridsize,I recordedboththepruningrateandthewall-clockrunningti me.Since MLE 0 requires onesecondovertheentiredataset,thereisastrongrelatio nshipbetweenthepruning rateandrunningtime;enumeratingallofthecellsinthegri disinexpensivecompared torunningoneormoreMLEsforeachareainthegrid.Also,due tothehighcostof runningtheMLE,inordertocomparemymethodswiththenaive methodofsimply runninganMLEovereacharea,Ihadtoresorttosampling.For eachiterationofthe naivemethodofAlgorithm1,beforeIinvoke MLE 0 over A and A ,Iripacoinhaving a1%chanceofobtaininga\head"result.Iftheresultisa\he ad",IcomputetheLRT statisticbyrunningtheMLE.Otherwise,Iskiptherectangl e.Whiletheresultofrunning thissampling-basedalgorithmwillbeuseless,therunning timeisexpectedly 1 100 ofthe timerequiredtorunthenaivealgorithmtocompletion.Allr esultsarepresentedinTable 3. 126

PAGE 127

Table5-3.LRTframeworktime(indays)vs.naivealgorithmf or\trend"model. n n LRTtime LRTpruning Naivetime Speedup 16 16 0.15 96.5286% 2.6 17.3 32 32 1.13 97.6303% 35.9 31.8 64 64 11.9 98.0431% 544 45.7 Table5-4.LRTframeworktime(indays)vs.naivealgorithmf or\wage"model. n n LRTtime LRTpruning Naivetime Speedup 16 16 0.29 78.8396% 1.28 4.41 32 32 2.01 86.1219% 14.0 6.97 Discussion. Iobservedaspeedupofbetweenoneandtwoordersofmagnitud ewith respecttothenaivealgorithm,withthespeedupincreasing withincreasinggridsize.The resultsarequitestriking.Forexample,inthe64 64case,myframeworktookabout 12daystonishrunning.Ontheotherhand,thenaivemethodi sestimatedtotake544 days,oraboutoneandahalfyears!Thus,inthisrealapplica tion,thepruningstrategies thattheproposedframeworkutilizescanturnacomputation thatistotallyinfeasibleto onethatmaystillbeexpensive|butispossibleinatimefram ethatislikelyacceptablein mostepidemiologicalsettings. Anadditional,interestingobservationisthatthespeedup isactuallymoresignicant thanwhattheobservedpruningrateswouldpredict.Thereas onisthataprecomputated MLE 0 isgenerallylessexpensivethanan MLE c thatisrunforanarbitraryarea A inthe grid.EachprecomputedMLEisobtainedbyrunning MLE 0 overthedatacontainedin asubregionofthegrid,andthesubregionsconsideredinthe precomputationtendtobe relativelysmall.Since MLE 0 requirestimethatislinearwithrespecttothenumberof hospitalsintheinputset,theaveragecostislow.However, thenaivemethodmustinvoke MLE 0 twiceforeach A :onceover X A ,andtheothertimeover X A .Since X A [ X A covers theentiredataset,thetwo MLE 0 invocationsrequireaboutthesametimeas MLE 0 whenitisrunoverthe entire dataset.Asaconsequence,eachtestrequiredusingthe naivemethodisrelativelyexpensive,andthenaivemethods uers. 127

PAGE 128

5.7.3ExperimentThree:CensusDataApplication. Mydatacomesfroma5%sampleofthe2000U.S.censusdata. 1 Each databaserecordcontainsdatadescribingasingleperson,w hereIknow(a)theperson's annualwage,(b)theperson'shighesteducationalattainme ntlevel,and(c)theplace wherethepersonworks.Annualwageisanintegralnumber.Th ereareeighteenclasses ofeducationalattainment,suchasBachelor'sdegree,Mast er'sdegree,etc.Forprivacy reasons,theplaceofworkattributeinthisdataisdescribe dintermsofthePublicUse MicrodataArea(PUMA),aCensusBureau-denedareaof100,0 00+residents;Iusethe centerofeachPUMAtoplaceeachpersonintoaspatialgridpl acedovertheU.S.A.Given thisdata,Iwishtoaskthefollowingquestion: Areructuationsinthelevelofincomeacrossthecountrytot allyexplainedby dierencesineducationlevelsacrossthecountry,orareth ererealdierencesinincome levelaccordingtothespatialregionwherepeoplework? Thisquestioncanbere-castinaslightlydierentwaythati squiteamenableto analysisusingmyproposedframework: IfIconditioneachperson'sincomeleveluponthelevelofed ucationthatthey obtained,aretherespatialsubregionsofthecountrywhere theincomelevelissignicantly dierentfromtherest?ExperimentalGoal. IwishtoexperimentallytestwhethermyLRTframeworkcan beusedtoeectivelyspeedarealistic,large-scaledataan alysisproblem|thereare1.5 millionrecordsinthesetIconsider.Furthermore,theprob lemismadeevenmorerealistic bythefactthatthedataarenotclean|someofthedatahaveam issinglabelforthe levelofeducationalattainment,whichisanissuethatmust bedealtwithinastatistically rigorousfashion;IuseanEMalgorithmtodealwiththemissi ngdata. 1 http://usa.ipums.org 128

PAGE 129

InstantiatingtheLRTFramework. TousetheLRTframework,againIneedto providethefollowingcomponents: 1. Foreachcellofthegrid,theremaybeco-locatedmanyPUMAs. Function f reports eachrespondent'sannualwage,togetherwiththeperson'se ducationlevel,for eachPUMAinthecell.Notethatsomerespondentsmaychoosen otdisclosetheir educationlevel.Then f reportsthewagewithamissingvaluerag.ForPUMA i in area A ,denotethesetoflabeleddatareturnedby f by X L i ,anddenotethesetof unlabeleddatausing X U i .Eachelementof X L i [ X U i takestheform( x;y )where x isthereportedincomeand y isalabelindicatingeducationalattainment,with y 2f 0 ; 1 ;::: 18 g .Azeroindicatestheperson'seducationismissing. 2. ThelikelihoodfunctionIusetomodelthewagedataisbasedo naGammamixture model;theGammadistributionischosentomodelincomebeca useitcanmodel arbitrarylevelsofskew,anditproducesvaluesthatarestr ictlypositive.Idescribe thewagesdistributionwithineachPUMAusingamixturewher ethereare18 componentdistributions,eachofwhichrepresentsthethew agedistributionofan educationalattainmentclass.Iuse w ij todenotetheweightofthe j th component forPUMA i ;thisistheprobabilitythatanarbitrarypersoninPUMA i achievesan educationlevelof j .( j ; j )denotestheGammadistribution'sparametersforthe j th component.Then, L is: L ( j X )= Y i [ Y ( x;y ) 2 X Li w iy p ( x j y ; y ) Y x 2 X U i 18 X j =1 w ij p ( x j j ; j )] Iaminterestedintestingifthecomponentdistributionrep resentingeacheducation classisthesameacrossthecountry.Therefore,Iassumeall thePUMAswithina testingregion A havethesameeducationclassdistributionGamma( x j A j ; A j ),and allthePUMAwithout A havethesameclassdistributionGamma( x j A j ; A j )foreach j 2f 1 ;:::; 18 g .Then,Ihavethefollowingcompetinghypothesis: H 0 : A j = A j and A j = A j forall j 2f 1 ;:::; 18 g H a : 9 A j 6 = A j or A j 6 = A j forsome j 2f 1 ;:::; 18 g Underthenullhypothesis, j = A j = A j j = A j = A j ,andsothenull log-likelihoodis: 129

PAGE 130

log L ( 0 j X )= X i X ( x;y ) 2 X Li log( w iy )+ X i X ( x;y ) 2 X Li log p ( x j y ; y )+ X i X x 2 X Ui log 18 X j =1 w ij p ( x u j j ; j ) (5{4) InEquation 5{4 0 = f j ; j ;w ij g forevery i and j .Similarly,underthe unrestrictedparameterspacewhereIallowdierent( A j ; A j )and( A j ; A j )values,I have: log L ( c j X )= X i 2 A X ( x;y ) 2 X L i log( w iy )+ X i 2 A X ( x;y ) 2 X L i log p ( x l j A y ; A y )+ X i 2 A X x 2 X U i log 18 X j =1 w ij p ( x j A j ; A j )+ X i 2 A X ( x;y ) 2 X L i log( w iy )+ X i 2 A X ( x;y ) 2 X L i log p ( x j A y ; A y )+ X i 2 A X x 2 X U i log 18 X j =1 w ij p ( x j A j ; A j ) (5{5) InEquation 5{5 c = f A j ; A j ; A j ; A j ;w ij g forevery i;j 3. MLE 0 inthisproblemisanimplementationoftheExpectation-Max imization(EM) algorithm[ 16 ],whichischosensoIcanhandletheMLEovertheunlabeledda tain astatisticallymeaningfulfashion,withoutsimplythrowi ngthemaway.SinceEM onlyconvergestoalocal(andnotnecessarilyaglobal)maxi ma,myimplementation performsseveralrandomrestarts,thoughinpracticeIndt hatnomatterwhatthe startingpoint,theendqualityof MLE 0 doesnotchangesignicantlyenoughto aecttheLRTstatistic.Forbrevity,Iomitthedetailsofth ederivedEM. 4. MLE c istwoinvocationsof MLE 0 with X A and X A 130

PAGE 131

ExperimentalSetup. Iextracteddataforfourteen,contiguousU.S.states,form inga rectanglefromArizonatoIndiana,covering330PUMAs.The3 30PUMAswereorganized intoan n n gridusingamethodidenticaltotheoneusedfortheARMdata. Iused n =16and32.Iusedmyframeworktolookforthetopsubregion|t hatis,theonewith thelargestLRTstatisticvalue.Since MLE 0 reliesonaniterativeEMimplementation,it isexpensivetorun.Iusedsamplingtoobtaintheestimatedw all-clocktimeforthenaive method.TheresultsarepresentedinTable4.Discussion. IndthatIstillachievegoodperformancecomparedtothena ivemethod, eveninasmallgridsize.However,thepruningrateandspeed up,whilesignicant,are notasdramaticaswiththeothertwodatasets.Itappearstha tthereissomeparticular aspectofthisapplicationwhichresultsintheboundcomput edbytheframeworknot beingquiteastightasintheothertwocases.Still,intheca seofthe32 32grid,the frameworkisabletoreducetherequiredtimefromtwoweeksd owntotwodays. 5.8Conclusions Inthischapter,Ihaveconsideredtheproblemofdetectings patialanomaliesinan n n gridusingaLRTwithachi-squaredasymptoticnulldistribu tion.Myfocusis onusingpruningtoreducethemagnitudeoftheconstant c intheunderlying O ( cn 4 ) computationinthecasewhere c islargeduetoanexpensiveLRTstatistic.Experiments inreal-lifeapplicationsshowthatmymethodsareeective atreducing c byalmosttwo ordersofmagnitude. 131

PAGE 132

REFERENCES [1] D.Agarwal,A.McGregor,J.M.Phillips,S.Venkatasubraman ian,andZ.Zhu. Spatialscanstatistics:approximationsandperformances tudy.In SIGKDDInternationalConferenceProceedings ,pages24{33,2006. [2] D.Agarwal,J.M.Phillips,andS.Venkatasubramanian.Theh untingofthebump: Onmaximizingstatisticaldiscrepancy.In SODAInternationalConferenceProceedings ,pages1137{,2006. [3] C.C.AggarwalandP.S.Yu.Outlierdetectionforhighdimens ionaldata.In ProceedingsoftheACMSIGMODInternationalConferenceonM anagementofData pages37{46,May2001. [4] R.Agrawal,J.Gehrke,D.Gunopulos,andP.Raghavan.Automa ticsubspace clusteringofhighdimensionaldatafordataminingapplica tions.In SIGMOD InternationalConferenceProceedings ,pages94{105,1998. [5] F.AngiulliandC.Pizzuti.Outlierdetectionforhighdimen sionaldata.In Proceedingsofthe6thEuropeanConferenceonPrinciplesandPr acticeofKnowledge DiscoveryinDatabases ,August2002. [6] L.Arge,O.Procopiuc,S.Ramaswamy,T.Suel,andJ.S.Vitter .Scalable sweeping-basedspatialjoin.In VLDBInternationalConferenceProceedings ,pages 570{581,1998. [7] V.Barnett. OutliersinStatisticalData .JohnWileyandSons,1994. [8] S.D.BayandM.Schwabacher.Miningdistance-basedoutlier sinnearlineartime withrandomizationandasimplepruningrule.In KDDInternationalConference Proceedings ,pages29{38,2003. [9] M.S.Bazaraa,H.D.Sherali,andC.M.Shetty. NonlinearProgramming:Theoryand Algorithms .Wiley,1993. [10] J.Bilmes.Agentletutorialontheemalgorithmanditsappli cationtoparameter estimationforgaussianmixtureandhiddenmarkovmodels.T echnicalReport UniversityofBerkeley,ICSI-TR-97-021,1997. [11] M.M.Breunig,H.-P.Kriegel,R.T.Ng,andJ.Sander.LOF:ide ntifying density-basedlocaloutliers.In ProceedingsoftheACMSIGMODInternational ConferenceonManagementofData ,pages93{104,May2000. [12] T.Brinkho,H.-P.Kriegel,andB.Seeger.Ecientprocessi ngofspatialjoinsusing r-trees.In SIGMODInternationalConferenceProceedings ,pages237{246,1993. [13] L.ChenandR.T.Ng.Onthemarriageoflp-normsandeditdista nce.In VLDB InternationalConferenceProceedings ,pages792{803,August2004. 132

PAGE 133

[14] G.F.C.DanielB.Neill,AndrewW.Moore.Abayesianspatials canstatistic.In NIPSInternationalConferenceProceedings ,2005. [15] M.Dayho,R.M.Schwartz,andB.C.Orcutt.Amodelofevoluti onarychangein proteins. AtlasofProteinSequenceandStructure ,5:345{352,1978. [16] A.Dempster,N.Laird,andD.Rubin.Maximumlikelihoodfrom incompletedatavia theEMalgorithm. JournalRoyalStat.Soc.,SeriesB ,39(1):1{38,1977. [17] D.DonjerkovicandR.Ramakrishnan.Probabilisticoptimiz ationoftopnqueries.In VLDBInternationalConferenceProceedings ,pages411{422,1999. [18] S.Dudoit,J.P.Shaer,andJ.C.Boldrick.Multiplehypothe sistestinginmicroarray experiments. StatisticalScience ,18:71{103,2003. [19] C.Faloutsos,R.Barber,M.Flickner,J.Hafner,W.Niblack, D.Petkovic,and W.Equitz.Ecientandeectivequeryingbyimagecontent. JournalofIntelligent InformationSystems ,3(3/4):231{262,1994. [20] P.J.HaasandJ.M.Hellerstein.Ripplejoinsforonlineaggr egation.In SIGMOD InternationalConferenceProceedings ,pages287{298,1999. [21] D.Hawkins. Indenticationofoutliers .ChapmanandHall,1980. [22] J.M.Hellerstein,P.J.Haas,andH.J.Wang.Onlineaggregat ion.In SIGMOD InternationalConferenceProceedings ,pages171{182,1997. [23] G.R.HjaltasonandH.Samet.Incrementaldistancejoinalgo rithmsforspatial databases.In SIGMODInternationalConferenceProceedings ,pages237{248,1998. [24] V.HodgeandJ.Austin.Asurveyofoutlierdetectionmethodo logies. Artif.Intell. Rev. ,22(2):85{126,2004. [25] W.-C.HouandG. Ozsoyoglu.Statisticalestimatorsforaggregaterelation alalgebra queries. ACMTrans.DatabaseSyst. ,16(4):600{654,1991. [26] R.R.Kinnison. AppliedExtremeValueStatistics .MacmillanPublishingCompany, 1985. [27] E.M.Knorr,R.T.Ng,andV.Tucakov.Distance-basedoutlier s:Algorithmsand applications. VLDBJournal ,8(3-4):237{253,Feburary2000. [28] G.Kollios,D.Gunopulos,N.Koudas,andS.Berchtold.Ecie ntbiasedsamplingfor approximateclusteringandoutlierdetectioninlargedata sets. IEEETransactionon KnowledgeandDataEngineering ,15(5),2003. [29] G.Kollios,D.Gunopulos,N.Koudas,andS.Berchtold.Ecie ntbiasedsamplingfor approximateclusteringandoutlierdetectioninlargedata sets. IEEETransactionson KnowledgeandDataEngineering ,15(5),2003. 133

PAGE 134

[30] M.Kulldor.Aspatialscanstatistic. Comm.instat.:TheoryandMethods 26(6):1481{1496,1997. [31] M.Kulldor.Spatialscanstatistics:model,calculations ,andapplications.In Scan StatisticsandApplications ,pages303{322,1999. [32] M.KulldorandN.Nagarwalla.Spatialdiseaseclusters:de tectionandinference. Stat.inMedicine ,14:799{810,1995. [33] P.M.Lee. BayesianStatistics:AnIntroduction .AHodderArnoldPublication,1997. [34] E.Lehmann. ElementsofLarge-SampleTheory .Springer,1998. [35] M.-L.LoandC.V.Ravishankar.Spatialhash-joins.In SIGMODInternational ConferenceProceedings ,pages247{258,1996. [36] J.MaritzandA.Munro.Ontheuseofthegeneralizedextremev aluedistributionin estimatingextremepercentiles. Biometrics ,23:79{103,1976. [37] S.NeedlemanandC.D.Wunsch.Ageneralmethodapplicableto thesearchfor similaritiesintheaminoacidsequenceoftwoproteins. JournalofMolecularBiology 48:443{453,1970. [38] D.B.NeillandA.W.Moore.Afastmulti-resolutionmethodfo rdetectionof signicantspatialdiseaseclusters.In NIPSInternationalConferenceProceedings pages256{265,2003. [39] D.B.NeillandA.W.Moore.Rapiddetectionofsignicantspa tialclusters.In SIGKDDInternationalConferenceProceedings ,pages256{265,2004. [40] D.B.Neill,A.W.Moore,M.Sabhnani,andK.Daniel.Detectio nofemerging space-timeclusters.In SIGKDDInternationalConferenceProceedings ,pages218{227, 2005. [41] F.Olken. RandomSamplingfromDatabases .PhDthesis,1993. [42] B.PresnellandJ.Booth.Resamplingmethodsforsamplesurv eys. Presnell,B.and Booth,J.G.(1994)Resamplingmethodsforsamplesurveys.T echnicalReport470, DepartmentofStatistics,UniversityofFlorida,Gainesvi lle,FL ,1994. [43] S.Ramaswamy,R.Rastogi,andK.Shim.Ecientalgorithmsfo rminingoutliers fromlargedatasets.In SIGMODInternationalConferenceProceedings ,pages 427{438,May2000. [44] E.S.RistadandP.N.Yianilos.Learningstring-editdistan ce. IEEETrans.on PatternAnalysisandMachineIntelligence ,20(5),1998. [45] C.P.RobertandG.Casella. MonteCarloStatisticMethods .Springer-Verlag,2004. 134

PAGE 135

[46] C.-E.Sarndal,B.Swensson,andJ.Wretman. ModelAssistedSurveySampling Springer-Verlag,1992. [47] G.SchaeferandM.Stich.Ucid-anuncompressedcolourimage database.In SPIE, StorageandRetrievalMethodsandApplicationsforMultime dia ,pages472{480,2004. [48] T.SeidlandH.-P.Kriegel.Ecientuser-adaptablesimilar itysearchinlarge multimediadatabases.In VLDBInternationalConferenceProceedings ,pages 506{515,1997. [49] H.Shin,B.Moon,andS.Lee.Adaptivemulti-stagedistancej oinprocessing.In SIGMODInternationalConferenceProceedings ,pages343{354,2000. [50] J.R.SmithandS.-F.Chang.Visualseek:Afullyautomatedco ntent-basedimage querysystem.In ProceedingsoftheForthACMInternationalConferenceon Multimedia'96 ,pages87{98,November1996. [51] Y.Tao,X.Xiao,andS.Zhou.Miningdistance-basedoutliers fromlargedatabasesin anymetricspace.In KDDInternationalConferenceProceedings ,pages394{403,2006. [52] W.Wang,J.Yang,andR.R.Muntz.Sting:Astatisticalinform ationgridapproach tospatialdatamining.In VLDBInternationalConferenceProceedings ,pages 186{195,1997. [53] S.S.Wilks.Thelargesampledistributionofthelikelihood ratiofortestingcomposite hypotheses. AnnalsofMathematicalStatistics ,9:60{62,1938. 135

PAGE 136

BIOGRAPHICALSKETCH MingxiWureceivedhisB.S.inComputerSciencefromFudanUn iversityin2000, hisM.S.andPh.D.degreeinComputerSciencefromUniversit yofFlorida(UF)in 2008.InhisgraduatestudyatUF,Mingxiconductedresearch indatamininganddata managementelds.Mostofhisresearchsolvesperformancei ssueusingstatisticstheory andmachinelearningtechnique.Inthesummer2007,hewasas oftwaredevelopment interninMicrosoftSQLServerproductteam,wherehebuiltf aulttolerantdistributed systemforthemanageabilityteam.Upongraduation,hewill jointheQueryOptimizer groupasaTechnicalStaMemberatOracleInc. 136