Citation
Efficient Algorithms for Spatiotemporal Data Management

Material Information

Title:
Efficient Algorithms for Spatiotemporal Data Management
Creator:
Arumugam, Subramanian
Place of Publication:
[Gainesville, Fla.]
Florida
Publisher:
University of Florida
Publication Date:
Language:
english
Physical Description:
1 online resource (123 p.)

Thesis/Dissertation Information

Degree:
Doctorate ( Ph.D.)
Degree Grantor:
University of Florida
Degree Disciplines:
Computer Engineering
Computer and Information Science and Engineering
Committee Chair:
Jermaine, Christophe
Committee Members:
Yavuz, Tuba
Kahveci, Tamer
Dobra, Alin
Koehler, Gary J.
Graduation Date:
8/9/2008

Subjects

Subjects / Keywords:
Buffer storage ( jstor )
Cost estimates ( jstor )
Databases ( jstor )
Datasets ( jstor )
Granularity ( jstor )
Indexing ( jstor )
Parametric models ( jstor )
Sensors ( jstor )
Ticks ( jstor )
Trajectories ( jstor )
Computer and Information Science and Engineering -- Dissertations, Academic -- UF
algorithms, database, entity, join, probabilistic, spatiotemporal, sprt
Genre:
bibliography ( marcgt )
theses ( marcgt )
government publication (state, provincial, terriorial, dependent) ( marcgt )
born-digital ( sobekcm )
Electronic Thesis or Dissertation
Computer Engineering thesis, Ph.D.

Notes

Abstract:
This study focuses on interesting data management problems that arise in the analysis, modeling and querying of spatiotemporal data. Such data naturally arise in the context of many scientific and engineering applications that deal with physical processes that evolve over time. We first focus on the issue of scalable query processing, where the goal is to answer questions over massive spatiotemporal databases efficiently. We propose a novel adaptive algorithm based on the plane-sweep that dynamically adjusts to suit to the characteristics of the underlying data. Next, we discuss a novel version of the entity resolution problem that appears in spatiotemporal sensor databases. We consider approaches to modeling spatiotemporal data to aid analysis in the presence of missing or incomplete information and propose a statistical learning-based approach to solving the problem. Finally, we consider statistical issues that arise when a user is allowed to specify location uncertainty of a moving object via arbitrary, pseudo-random functions that generate the data which populate various 'possible worlds.' We develop very general algorithms that can be used to estimate the probability that a relational predicate evaluates to true over a very large database of moving objects. ( en )
General Note:
In the series University of Florida Digital Collections.
General Note:
Includes vita.
Bibliography:
Includes bibliographical references.
Source of Description:
Description based on online resource; title from PDF title page.
Source of Description:
This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Thesis:
Thesis (Ph.D.)--University of Florida, 2008.
Local:
Adviser: Jermaine, Christophe.
Statement of Responsibility:
by Subramanian Arumugam.

Record Information

Source Institution:
UFRGP
Rights Management:
Copyright Arumugam, Subramanian. Permission granted to the University of Florida to digitize, archive and distribute this item for non-profit research and educational purposes. Any reuse of this item in excess of fair use or other copyright exemptions requires permission of the copyright holder.
Classification:
LD1780 2008 ( lcc )

Downloads

This item has the following downloads:


Full Text

PAGE 1

1

PAGE 2

2

PAGE 3

3

PAGE 4

Firstofall,IwouldliketothankmyadvisorChrisJermaine.Thisdissertationwouldnothavebeenmadepossiblehaditnotbeenforhisexcellentmentoringandguidancethroughtheyears.Chrisisaterricteacher,acriticalthinkerandapassionateresearcher.Hehasservedasagreatrolemodelandhashelpedmematureasaresearcher.Icannotthankhimmoreforthat.MythanksalsogoestoProf.AlinDobra.Throughtheyears,Alinhasbeenapatientlistenerandhashelpedmestructureandrenemyideascountlesstimes.Hisexcitementforresearchiscontagious!Iwouldliketotakethisopportunitytomentionmycolleaguesatthedatabasecenter:Amit,Florin,Fei,Luis,MingxiandRavi.Ihavehadmanyhoursoffundiscussinginterestingproblemswiththem.SpecialthanksgoestomyfriendsManas,Srijit,Arun,Shantanu,andSeema,formakingmystayinGainesvilleallthemoreenjoyable.Finally,Iwouldlikethankmyparentsforbeingasourceofconstantsupportandencouragmentthroughoutmystudies. 4

PAGE 5

page ACKNOWLEDGMENTS ................................. 4 LISTOFTABLES ..................................... 8 LISTOFFIGURES .................................... 9 ABSTRACT ........................................ 11 CHAPTER 1INTRODUCTION .................................. 13 1.1Motivation .................................... 13 1.2ResearchLandscape .............................. 14 1.2.1DataModelingandDatabaseDesign ................. 15 1.2.2AccessMethods ............................. 16 1.2.3QueryProcessing ............................ 16 1.2.4DataAnalysis .............................. 17 1.2.5DataWarehousing ............................ 17 1.3MainContributions ............................... 18 1.3.1ScalableJoinProcessingoverMassiveSpatiotemporalHistories ... 18 1.3.2EntityResolutioninSpatiotemporalDatabases ............ 19 1.3.3SelectionQueriesoverProbabilisticSpatiotemporalDatabases ... 19 2BACKGROUND ................................... 21 2.1SpatiotemporalJoin .............................. 21 2.2EntityResolution ................................ 22 2.3ProbabilisticDatabases ............................. 23 3SCALABLEJOINPROCESSINGOVERSPATIOTEMPORALHISTORIES 25 3.1Motivation .................................... 25 3.2Background ................................... 27 3.2.1MovingObjectTrajectories ....................... 27 3.2.2ClosestPointofApproach(CPA)Problem .............. 28 3.3JoinUsingIndexingStructures ........................ 30 3.3.1TrajectoryIndexStructures ...................... 31 3.3.2R-treeBasedCPAJoin ......................... 32 3.4JoinUsingPlane-Sweeping ........................... 36 3.4.1BasicCPAJoinusingPlane-Sweeping ................. 36 3.4.2ProblemWithTheBasicApproach .................. 37 3.4.3LayeredPlane-Sweep .......................... 38 3.5AdaptivePlane-Sweeping ............................ 40 3.5.1Motivation ................................ 40 5

PAGE 6

.............. 41 3.5.3TheBasicAdaptivePlane-Sweep ................... 41 3.5.4EstimatingCost 43 3.5.5DeterminingTheBestCost ....................... 44 3.5.6SpeedingUptheEstimation ...................... 46 3.5.7PuttingItAllTogether ......................... 48 3.6Benchmarking .................................. 48 3.6.1TestDataSets .............................. 48 3.6.2MethodologyandResults ........................ 50 3.6.3Discussion ................................ 52 3.7RelatedWork .................................. 55 3.8Summary .................................... 57 4ENTITYRESOLUTIONINSPATIOTEMPORALDATABASES ........ 58 4.1ProblemDenition ............................... 60 4.2OutlineofOurApproach ............................ 61 4.3GenerativeModel ................................ 63 4.3.1PDFforRestrictedMotion ....................... 64 4.3.2PDFforUnrestrictedMotion ...................... 65 4.4LearningtheRestrictedModel ......................... 66 4.4.1ExpectationMaximization ....................... 67 4.4.2LearningK 69 4.5LearningUnrestrictedMotion ......................... 71 4.5.1ApplyingaParticleFilter ........................ 72 4.5.2HandlingMultipleObjects ....................... 73 4.5.3UpdateStrategyforaSamplegivenMultipleObjects ........ 75 4.5.4SpeedingThingsUp ........................... 76 4.6Benchmarking .................................. 77 4.7RelatedWork .................................. 82 4.8Summary .................................... 83 5SELECTIONOVERPROBABILISTICSPATIOTEMPORALRELATIONS .. 84 5.1ProblemandBackground ............................ 86 5.1.1ProblemDenition ........................... 87 5.1.2TheFalsePositiveProblem ....................... 87 5.1.3TheFalseNegativeProblem ...................... 90 5.2TheSequentialProbabilityRatioTest(SPRT) ................ 91 5.3TheEnd-BiasedTest .............................. 95 5.3.1What'sWrongWiththeSPRT? .................... 95 5.3.2RemovingtheMagicEpsilon ...................... 96 5.3.3TheEnd-BiasedAlgorithm ....................... 97 5.4IndexingtheEnd-BiasedTest ......................... 103 5.4.1Overview ................................. 103 5.4.2BuildingtheIndex ........................... 104 6

PAGE 7

........................... 106 5.5Experiments ................................... 107 5.6RelatedWork .................................. 111 5.7Summary .................................... 113 6CONCLUDINGREMARKS ............................. 114 REFERENCES ....................................... 115 BIOGRAPHICALSKETCH ................................ 123 7

PAGE 8

Table page 4-1Varyingthenumberofobjectsanditseectonrecall,precisionandruntime. .. 80 4-2Varyingthenumberoftimeticks. .......................... 80 4-3Varyingthenumberofsensorsred. ........................ 80 4-4VaryingthestandarddeviationoftheGaussiancloud. .............. 80 4-5VaryingthenumberoftimetickswhereEMisapplied. .............. 81 5-1Runningtimesovervaryingdatabasesizes. ..................... 109 5-2Runningtimesovervaryingquerysizes. ...................... 109 5-3Runningtimesovervaryingobjectstandarddeviations. .............. 109 5-4Runningtimesovervaryingcondencelevels. ................... 109 8

PAGE 9

Figure page 3-1Trajectoryofanobject(a)anditspolylineapproximation(b) .......... 28 3-2ClosestPointofApproachIllustration ....................... 29 3-3CPAIllustrationwithtrajectories .......................... 29 3-4ExampleofanR-tree ................................. 33 3-5Heuristictospeedupdistancecomputation .................... 34 3-6IssueswithR-trees-Fastmovingobjectpjoinswitheveryone .......... 35 3-7Progressionofplane-sweep .............................. 36 3-8LayeredPlane-Sweep ................................. 38 3-9Problemwithusinglargegranularitiesforboundingboxapproximation ..... 40 3-10Adaptivelyvaryingthegranularity ......................... 42 3-11Convexityofcostfunctionillustration. ....................... 45 3-12Iterativelyevaluatingkcutpoints .......................... 46 3-13SpeedinguptheOptimizer .............................. 48 3-14Injectiondatasetattimetick2,650 ......................... 49 3-15Collisiondatasetattimetick1,500 ......................... 50 3-16Injectiondatasetexperimentalresults .................... 51 3-17Collisiondatasetexperimentalresults ....................... 52 3-18BuersizechoicesforInjectiondataset ...................... 53 3-19BuersizechoicesforCollisiondataset ...................... 53 3-20Syntheticdatasetexperimentalresults ....................... 54 3-21BuersizechoicesforSyntheticdataset ...................... 56 4-1Mappingofasetofobservationsforlinearmotion ................. 60 4-2Objectpath(a)andquadratictforvaryingtimeticks(b-d) ........... 62 4-3Objectpathinasensoreld(a)andsensorringstriggeredbyobjectmotion(b) 64 4-4Thebaselineinputset(10,000observations) .................... 79 9

PAGE 10

................. 79 5-1TheSPRTinaction.ThemiddlelineistheLRTstatistic ............. 92 5-2Twospatialqueriesoveradatabaseofobjectswithgaussianuncertainty .... 97 5-3ThesequenceofSPRTsrunbytheend-biasedtest ................ 98 5-4BuildingtheMBRsusedtoindexthesamplesfromtheend-biasedtest. ..... 104 5-5Usingtheindextospeedtheend-biasedtest .................... 106 10

PAGE 11

11

PAGE 12

12

PAGE 13

1 { 4 ].Theadventofcomputationalscienceandtheincreasinguseofwirelesstechnology,sensors,anddevicessuchasGPShasresultedinnumerouspotentialsourcesofspatio-temporaldata.Largevolumesofspatiotemporaldataareproducedbymanyscientic,engineeringandbusinessapplicationsthattrackandmonitormovingobjects.``Movingobjects"maybepeople,vehicles,wildlife,productsintransit,weathersystems.Suchapplicationsoftenariseinthecontextoftracsurveillanceandmonitoring,landusemanagementinGIS,simulationinastrophysics,climatemonitoringinearthsciences,eetmanagement,mulitmediaanimation,etc.Theincreasingimportanceofspatiotemporaldatacanbeattributedtotheimprovedreliabilityoftrackingdevicesandtheirlowcost,whichhasreducedtheacquisitionbarrierforsuchdata.Trackingdeviceshavebeenadoptedinvaryingdegreesinanumberofscienticandenterpriseapplicationdomains.Forinstance,vehiclesincreasinglycomeequippedwithGPSdeviceswhichenablelocation-basedservices[ 3 ].Sensorsplayanincreasinglyimportantroleinsurveillanceandmonitoringofphysicalspaces[ 5 ].EnterprisessuchasWalmart,TargetandorganizationsliketheDepartmentofDefense(DoD)plantotrackproductsintheirsupplychainthroughuseofsmartRadioFrequencyIdentication(RFID)labels[ 6 ]. 13

PAGE 14

7 ].Developingscalablealgorithmstosupportqueryprocessingovertera-andpeta-byte-sizedspatiotemporaldatasetsisasignicantchallenge. 14

PAGE 15

1 3 ]. 8 ].Conventionaldatatypesemployedinexistingdatabasesareoftennotsuitabletorepresentspatiotemporaldatawhichdescribecontinuoustime-varyingspatialgeometries.Thus,thereisaneedforaspatiotemporaltypesystemthatcanmodelcontinuouslymovingdata.Dependingonwhethertheunderlyingspatialobjecthasanextentornot,abstractionshavebeendevelopedtomodelamovingpoint,line,andregionintwo-andthree-dimensionalspacewithtimeconsideredastheadditionaldimension[ 8 { 11 ].Similarly,earlyworkhasalsofocusedonreningexistingCASEtoolstoaidinthedesignofspatiotemporaldatabases.ExistingconceptualtoolssuchasERdiagramsandUMLpresentanon-temporalviewoftheworldandextensionstoincorporatetemporalandspatialawarenesshasbeeninvestigated[ 12 13 ].Recentlytherehasbeeninterestindesigningexibletypesystemsthatcanmodelaspectsofuncertaintyassociatedwithanobject'sspatiallocation[ 14 ].TherehasalsobeenactiveeorttowardsdesigningSQLlanguageextensionsforspatiotemporaldatatypesandoperations[ 15 ]. 15

PAGE 16

16 ]toincorporatethetimedimension.Indexingstructuresdesignedtosupportpredictivequeriestypicallymanageobjectmovementwithinasmalltimewindowandneedtohandlefrequentupdatestoobjectlocations.ApopularchoiceforsuchapplicationsistheTPR-tree[ 17 ]anditsmanyvariants.Ontheotherhand,indexstructuresdesignedtosupporthistoricalqueriesneedtomanageanobject'sentirepastmovementtrajectory(forthisreasontheycanbeviewedastrajectoryindexes).Dependingonthetimeintervalindexed,thesheervolumeofdatathatneedstobemanagedpresentsignicanttechnicalchallengesforoverlap-allowingindexingschemessuchasR-trees[ 16 ].Thus,therehasbeeninterestinrevisitinggrid/cell-basedsolutionsthatdonotallowoverlap,suchasSETI[ 18 ].Severaltree-basedindexingstructureshavebeendevelopedsuchasSTRIPES[ 19 ],3DR-trees[ 20 ],TBtrees[ 21 ]andlinearquadtrees[ 22 ].Further,spatiotemporalextensionsofseveralpopularqueriessuchasnearest-neighbor[ 23 ],top-k[ 24 ],andskyline[ 25 ]havebeendeveloped. 26 { 28 ],continuousqueries,joins[ 29 30 ],andtheirecientevaluation[ 31 32 ].Inthesamevein,therehasalsobeenseempreliminaryworkonoptimizingspatio-temporalselectionqueries[ 33 34 ]. 16

PAGE 17

35 ],dataclassicationandgeneralization[ 36 ],trajectoryclusteringandrulemining[ 37 { 39 ],andsupportinginteractivevisualizationforbrowsinglargespatiotemporalcollections[ 40 ]. 17

PAGE 18

41 42 ]isrelativelynewandisfocusedonreningexistingmultidimensionalmodelstosupportcontinuousdataanddeningsemanticsforspatiotemporalaggregation[ 43 44 ]. 1 ].Thethreespecicproblemsconsideredaredescribedbrieyinthefollowingsubsections. 18

PAGE 19

19

PAGE 20

20

PAGE 21

45 ].Theirapproachassumestheexistenceofahierarchicalspatialindex,suchasanR-tree[ 16 ],ontheunderlyingrelations.ThejoinBrinkhoproposesmakesuseofacarefullysynchronizeddepth-rsttraversaloftheunderlyingindicestonarrowdownthecandidatepairs.Abreadth-rststrategywithseveraladditionaloptimizationsisconsideredbyHuangetal.[ 46 ].LoandRavishankar[ 47 ]exploreanon-indexbasedapproachtoprocessingaspatialjoin.Theyconsiderhowtoextendthetraditionalhashjoinalgorithmtothespatialjoinproblemandproposeastrategybasedonapartitioningofthedatabaseobjectsintoextentmappinghashbuckets.Asimilaridea,referredtoasthepartition-basedspatialmerge(PBSM),isconsideredbyPateletal.[ 48 ].Insteadofpartitioningtheinputdataobjects,theyconsideragridpartitioningofthedataspaceontowhichobjectsaremapped.ThisideaisfurtherextendedbyLarsetal.[ 49 ],wheretheyproposeadynamicpartitioningoftheinputspaceintoverticalstrips.Theirstrategyavoidsthedataspillproblemencounteredbypreviousapproachessincethestripscanbeconstructedsuchthattheytwithintheavailablemainmemory.Acommonthemeamongexistingapproachesistheiruseoftheplane-sweep[ 50 ]asafastpruningtechnique.Inthecaseofindex-basedalgorithms,plane-sweepisusedtoltercandidatenodepairsenumeratedfromthetraversal.Non-indexedbasedalgorithmsmakeuseoftheplane-sweeptoconstructcandidatesetsoverpartitions. 21

PAGE 22

51 ].However,theyonlyconsiderspatiotemporaljointechniquesthatarestraightforwardextensionstotraditionalspatialjoinalgorithms.Further,theylimittheirscopetoindex-basedalgorithmsforobjectsoverlimitedtimewindows. 52 { 55 ]andhasfocusedmainlyonintegratingnon-geometricstringbaseddatafromnoisyexternalsources.Closelyrelatedtotheworkinthisthesisisthelargebodyofworkontargettrackingthatexistsineldsasdiverseassignalprocessing,robotics,andcomputervision.Thegoalintargettracking[ 56 57 ]istosupportthereal-timemonitoringandtrackingofasetofmovingobjectsfromnoisyobservations.Variousalgorithmstoclassifyobservationsamongobjectscanbefoundinthetargettrackingliterature.Theycharacterizetheproblemasoneofdataassociation(i.e.associatingobservationswithcorrespondingtargets).Abriefsummaryofthemainideasisgivenbelow.TheseminalworkisduetoReid[ 58 ]whoproposeamultiplehypothesistechnique(MHT)tosolvethetrackingproblem.IntheMHTapproach,asetofhypothesesismaintainedwitheachhypothesisreectingthebeliefonthelocationofanindividualtarget.Whenanewsetofobservationsarrive,thehypothesesareupdated.Hypotheseswithminimalsupportaredeletedandadditionalhypothesesarecreatedtoreectnewevidence.Themaindrawbackoftheapproachisthatthenumberofhypothesescangrowexponentiallyovertime.Thoughheuristiclters[ 59 { 61 ]canbeusedtoboundthesearchspace,itlimitsthescalabilityofthealgorithm.TargettrackingalsohasbeenstudiedusingBayesianapproaches[ 62 ].TheBayesianapproachviewstrackingasastateestimationproblem.Givensomeinitialstateandasetofobservations,thegoalistopredicttheobject'snextstate.AnoptimalsolutiontotheproblemisgivenbyBayesFilter[ 63 64 ].Bayesltersproducesoptimalestimatesby 22

PAGE 23

57 ]andsequentialMonteCarlotechniques[ 63 ]areoftenusedinpractice.Recently,MarkovChainMonteCarlo(MCMC)[ 65 66 ]techniqueshavebeenproposed.MCMCtechniquesattempttoapproximatetheoptimalBayeslterformultipletargettracking.MCMCbasedmethodsemploysequentialMCsamplingandareshowntoperformbetterthanexistingsub-optimalapproachessuchasMHTfortrackingobjectsinhighlyclutteredenvironments.Acommonthemeamongmostoftheresearchintargettrackingisitsfocusonaccuratetrackinganddetectionofobjectsinrealtimeinhighlyclutteredenvironmentsoverrelativelyshorttimeperiods.Inadatawarehousecontext,theabilityoftechniquessuchasMCMCtomakene-graineddistinctionsmakethemidealcandidateswhenperformingoperationssuchasdrilldownthatinvolveanalyticsoversmalltimewindows.Theirapplicabilityislimited,however,toentityresolutioninadatawarehouse.Insuchacontext,summarizationandvisualizationofhistoricaltrajectoriessmoothedoverlongtimeintervalsisoftenmoreuseful.Themodel-basedapproachconsideredinthisworkseemsamoresuitablecandidateforsuchtasks. 9 67 ].Inthecontextofqueryprocessing,oneoftheearliestpapersinthisareaisthepaperbyPfoseretal.[ 68 ]wheredierentsourcesofuncertaintyarecharacterizedandaprobabilitydensityfunctionisusedtomodelerrors.Hosbondetal.[ 69 ]extendedthisworkbyemployingahypersquareuncertaintyregion,whichexpandsovertimetoanswerqueriesusingaTPR-tree. 23

PAGE 24

70 ]studytheproblemfromamodelingperspective.Theymodeltrajectoriesbyacylindricalvolumein3Dandoutlinesemanticsoffuzzyselectionqueriesovertrajectoriesinbothspaceandtime.However,theapproachdoesnotspecifyhowtochoosethedimensionsofthecylindricalregionwhichmayhavetochangeovertimetoaccountforshrinkingorexpandingoftheunderlyinguncertaintyregion.Chengetal.[ 71 ]describealgorithmsfortimeinstantqueries(probabilisticrangeandnearestneighbor)usinganuncertaintymodelwhereaprobabiltydensityfunction(PDF)andanuncertainregionisassociatedwitheachpointobject.Givenalocationintheuncertainregion,thePDFreturnstheprobablityofndingtheobjectatthatlocation.AsimilarideaisusedbyTaoetal.[ 72 ]toanswerqueriesinspatialdatabases.Tohandletimeintervalqueries,Mokhtaretal.[ 73 ]representuncertaintrajectoriesasastochasticprocesswithatime-parametricuniformdistribution. 24

PAGE 25

25

PAGE 26

7 ]andthereferencescontainedtherein).Inthischapter,thespatial-temporaljoinproblemformovingobjecthistoriesinthree-dimensionalspace,withtimeconsideredasthefourthdimensionisinvestigated.ThespatiotemporaljoinoperationconsideredistheCPAJoin(Closest-Point-Of-ApproachJoin).ByClosestPointofApproach,werefertoapositionatwhichtwomovingobjectsattaintheirclosestpossibledistance[ 74 ].Formally,inaCPAJoin,weanswerqueriesofthefollowingtype:\Findallobjectpairs(p2P;q2Q)fromrelationsPandQsuchthatCPA-distance(p;q)d".Thegoalistoretrieveallobjectpairsthatarewithinadistancedattheirclosest-point-of-approach.Surprisingly,thisproblemhasnotbeenstudiedpreviously.Thespatialjoinproblemhasbeenwell-studiedforstationaryobjectsintwo-andthree-dimensionsalspace[ 45 47 { 49 ],howeververylittleworkrelatedtospatiotemporaljoinscanbefoundinliterature.Therehasbeensomeworkrelatedtojoinsinvolvingmovingobjects[ 75 76 ]buttheworkhasbeenrestrictedtoobjectsinalimitedtimewindowanddoesnotconsidertheproblemofjoiningobjecthistoriesthatmaybegigabytesorterabytesinsize. 26

PAGE 27

27

PAGE 28

(A) x t0 Trajectoryofanobject(a)anditspolylineapproximation(b) timeinstanceti.Thearityofthevectordescribesthedimensionsofthespace.Forightsimulationdata,thearitywouldbe3,whereasforamovingcar,thearitywouldbe2.Thepositionofthemovingobjectsisnormallyobtainedinoneofseveralways:bysamplingorpollingtheobjectatdiscretetimeinstances,throughuseofdeviceslikeGPS,etc. 28

PAGE 29

ClosestPointofApproachIllustration p x t CPAIllustrationwithtrajectories thesetwoobjectsarep(t)=p0+tu;q(t)=q0+tv.Atanytimeinstancet,thedistancebetweenthetwoobjectsisgivenbyd(t)=jp(t)q(t)j.Usingbasiccalculus,onecanndthetimeinstanceatwhichthedistanced(t)isminimum(whenD(t)=d(t)2isaminimum).Solvingforthistimeweobtain:tcpa=(poqo):(uv) 29

PAGE 30

Distance(p[i];q[j]) InthenexttwoSections,weconsidertwoobviousalternativesforcomputingtheCPAJoin,wherewewishtodiscoverallpairsofobjects(p;q)fromtworelationsPandQ,whereCPA(p;q,d)evaluatestotrue.ThersttechniquewedescribemakesuseofanunderlyingR-treeindexstructuretospeedupjoinprocessing.Thesecondmethodologyisbasedonasimpleplane-sweep. 17 ],REXPtree[ 77 ],TPR*-tree[ 78 ]havebeendevelopedtosupportpredictivequeries,where 30

PAGE 31

26 ],MV3R-tree[ 27 ],HR-tree[ 28 ],HR+-tree[ 27 ]aremorerelevantsincetheyaregearedtowardsansweringtimeinstancequeries(incaseofMV3R-treealsoshorttime-intervalqueries),whereallobjectsaliveatacertaintimeinstanceareretrieved.Thegeneralideabehindtheseindexstructuresistomaintainaseparatespatialindexforeachtimeinstance.However,suchindicesaremeanttostorediscretesnapshotsofaevolvingspatialdatabase,andarenotidealforusewithCPAJoinovercontinuoustrajectories. 21 ]andSETI[ 18 ].TB-treesemphasizetrajectorypreservationsincetheyareprimarilydesignedtohandletopologicalquerieswhereaccesstoentiretrajectoryisdesired(segmentsbelongingtothesametrajectoryarestoredtogether).TheproblemwithTB-treesinthecontextoftheCPAJoinisthatsegmentsfromdierenttrajectoriesthatarecloseinspaceortimewillbescatteredacrossnodes.Thus,retrievingsegmentsinagiventimewindowwillrequireseveralrandomI/Os.Inthesamepaper[ 21 ],aSTRtreeisintroducedthatattemptstosomewhatbalancespatiallocalitywithtrajectorypreservation.However,astheauthorspointoutSTR-treesturnouttobeaweakcompromisethatdonotperformbetterthantraditional3DR-trees[ 20 ]orTB-trees.MoreappropriatetotheCPAJoinisSETI[ 18 ].SETIpartitionstwo-dimensionalspacestaticallyintonon-overlappingcellsandusesaseparatespatialindexforeachcell.SETImightbeagoodcandidateforCPAJoinsinceitpreservesspatialandtemporallocality.However,thereareseveralreasonswhySETIisnotthemostnaturalchoiceforaCPAJoin: 31

PAGE 32

16 ].TheR-tree[ 16 ]isahierarchical,multi-dimensionalindexstructurethatiscommonlyusedtoindexspatialobjects.ThejoinproblemhasbeenstudiedextensivelyforR-treesandseveralspatialjointechniquesexist[ 45 46 79 ]thatleverageunderlyingR-treeindexstructurestospeed-upjoinprocessing.Hence,ourrstinclinationistoconsideraspatiotemporaljoinstrategythatisbasedonR-trees.ThebasicideaistoindexobjecthistoriesusingR-treesandthenperformajoinovertheseindices.TheR-TreeIndex 32

PAGE 33

y x ExampleofanR-tree 46 ]andworkswellinpractice. 33

PAGE 34

y z Heuristictospeedupdistancecomputation Thedistanceroutineisusedinevaluatingthejoinpredicatetodeterminethedistancebetweentwoboundingrectanglesassociatedwithapairofnodes.Anode-pairqualiesforfurtherexpansionifthedistancebetweenthepairislessthanthelimitingdistancedsuppliedbythequery.HeuristicstoImprovetheBasicAlgorithm 45 ]tospeeduptheall-pairsdistancecomputationwhenpairsofnodesareexpandedandtheirchildrenarecheckedforpossiblematches. 46 ]. 34

PAGE 35

t y x IssueswithR-trees-Fastmovingobjectpjoinswitheveryone Inaddition,therearesomeobviousimprovementstothealgorithmthatcanbemadewhicharespecictothe4-dimensionalCPAJoin: 80 ]tobuildthetrees.Becausethepotentialpruningpowerofthetimedimensionisgreatest,weensurethatthetreesarewell-organizedwithrespecttotimebychoosingtimeastherstpackingdimension.ProblemWithR-treeCPAJoin 35

PAGE 36

time tend Progressionofplane-sweep OnesuchscenarioisdepictedinFigure3-6,whichshowsthepathsofasetofobjectsona2-Dplaneforagiventimeperiod.AfastmovingobjectsuchaspwillbecontainedinaverylargeMBR,whileslowerobjectssuchasqwillbecontainedinmuchsmallerMBRs.WhenaspatialjoiniscomputedoverR-treesstoringtheseMBRs,theMBRassociatedwithpcanoverlapmanysmallerMBRs,andeachoverlapwillresultinanexpensivedistancecomputation(eveniftheobjectsdonottravelclosetooneanother).Thus,anysortofvarianceinobjectvelocitiescanadverselyaecttheperformanceofthejoin. 49 ]asawaytoecientlycomputethespatialjoinoperation. 36

PAGE 37

81 ].Themainrequirementisthatthedatastructureselectedshouldeasilybepossibletocheckproximityofobjectsinspace. 37

PAGE 38

time tend LayeredPlane-Sweep thatareencounteredatthesamplepointareaddedintothedatastructureandsegmentsinDthatarenolongeractivearedeletedfromit.Consequently,thesweeplinepausesmoreoftenwhenobjectswithhighsamplingratesarepresent,andtheprogressofthesweeplineisheavilyinuencedbythesamplingratesoftheunderlyingobjects.Forexample,considerFigure3-7whichshowsthetrajectoryoffourobjectsinagiventimeperiod.Inthecaseillustrated,objectp2controlstheprogressionofthesweepline.Observethatinthetime-interval[tstart;tend],onlynewsegmentsfromobjectp2getaddedtoDbutexpensivejoincomputationsareperformedeachtimewithsamesetoflinesegments.Thenetresultisthatifthesamplingrateofadatasetisveryhighrelativetotheamountofobjectmovementinthedataset,thenprocessingamulti-gigabyteobjecthistoryusingasimpleplane-sweepingalgorithmmaytakeaprohibitivelylongtime. 38

PAGE 39

39

PAGE 40

Problemwithusinglargegranularitiesforboundingboxapproximation thereisanopportunitytoprocesstheentiredatasetthroughjustthreecomparisonsattheMBRlevel. 40

PAGE 41

41

PAGE 42

y ti Adaptivelyvaryingthegranularity time-varying)characteristicsofthedata.Ateveryiteration,thealgorithmsimplychoosestoprocessthefractionofthebuerthatappearstominimizetheoverallcostoftheplane-sweepintermsoftheexpectednumberofdistancecomputations.Thealgorithmisgivenbelow: 42

PAGE 43

82 ].Atahighlevel,theideaisasfollows.Toestimatecost,webeginbyconstructingboundingrectanglesforalloftheobjectsinPconsideringtheirtrajectoriesfromtimetstartto(tendtstart).Theserectanglesaretheninsertedintoanin-memoryindex,justasifweweregoingtoperformalayeredplane-sweep.Next,werandomlychooseanobjectq1fromQ,andconstructaboundingboxforitstrajectoryaswell.ThisobjectisjoinedwithalloftheobjectsinPbyusingthein-memoryindextondallboundingboxeswithindistancedofq1.Then: 43

PAGE 44

44

PAGE 45

Convexityofcostfunctionillustration. fact,weidentifythefeasibleregionbyevaluatingcostiforasmallnumber,k,ofivalues.Givenkthenumberofallowedcutpoints,thefraction1canbedeterminedasfollows:1=r(1 45

PAGE 46

mincost mincost Iterativelyevaluatingkcutpoints Forinstance,assumewechoseiafterevaluationofkcutpointsinthetimeranger.Tofurthertunethisi,weconsiderthetimerangedenedbetweentheadjacentcutpointsi1andi+1andrecursivelyapplycostestimationinthisinterval.(i.e.,evaluatekpointsinthetimerange(tstart+i1r;tstart+i+1r)).Figure3-12illustratestheidea.Thisapproachissimpleandveryeectiveinconsideringalargenumberofchoicesof. 46

PAGE 47

47

PAGE 48

y pn (4) (3) (2) (1) MBR MBR MBR MBR SpeedinguptheOptimizer 48

PAGE 49

Injectiondatasetattimetick2,650 positionandmotionoftheobject).Thesizeofeachdatasetisaround50gigabyteseach.Thetwodatasetsareasfollows: 1. TheInjectiondataset.Thisdatasetistheresultofasimulationoftheinjectionoftwogassesintoachamberthroughtwonozzlesontheoppositesidesofthechamberviathedepressionofpistonsbehindeachofthenozzles.EachgascloudistreatedasoneoftheinputrelationstotheCPA-join.Inadditiontoheatenergytransmittedtothegasparticlesviathedepressionofthepistons,thegasparticlesalsohaveanattractivecharge.Thepurposeofthejoinistodeterminethespeedofthereactionresultingfromtheinjectionofthegassesintothechamber,bydeterminingthenumberof(near)collisionsofthegasparticlesmovingthroughthechamber.Bothdatasetsconsistof100,000particles,andthepositionsoftheparticlesaresampledat3,500individualtimeticks,resultingintworelationsthatarearound28gigabytesinsizeeach.Duringtherst2,500timeticks,forthemostpartbothgassesaresimplycompressedintheirrespectivecylinders.Aftertick2,500,themajorityoftheparticlesbegintobeejectedfromthetwonozzles.AsmallsampleoftheparticlesinthedatasetisdepictedaboveinFigure3-13,attimetick2,650. 2. TheCollisiondataset.ThisdatasetistheresultofanN-bodygravitationalsimulationofthecollisionoftwosmallgalaxies.Again,bothgalaxiescontainaround100,000starsystems,andthepositionsofthesystemsineachgalaxyarepolledat3,000dierenttimeticks.Thesizeoftherelationstrackingeachgalaxyisaround24gigabyteseach.Fortherst1,500orsotimeticks,thetwogalaxiesmerelyapproachoneanother.Forthenextthousandtimeticks,thereisanintenseinteractionastheypassthroughoneanother.Duringthelastfewhundredtimeticks,thereislessinteractionasthetwogalaxieshavelarglelygonethroughoneanother.ThepurposeoftheCPAJoinistondparisofgalaxiesthatapprachedcloselyenoughtohavea 49

PAGE 50

Collisiondatasetattimetick1,500 stronggraviationalinteraction.AsmallsampleofthegalaxiesinthesimulationisdepictedaboveinFigure3-14,attimetick1,500.Inaddition,wetestathirddatasetcreatedusingasimple,3-dimensionalrandomwalk.WecallthistheSyntheticdataset(thisdatasetwasagainabout50GBinsize).Thespeedofthevariousobjectsvariesconsiderablyduringthewalk.Thepurposeofincludingthisdataistorigorouslytesttheadpatabilityoftheadaptiveplane-sweep,bycreatingasyntheticdatasetwheretherearesignicantuctuationsintheamountofinteractionamongobjectsasafunctionoftime. 80 ]toconstructanR-treeforeachinputrelation),asimpleplane-sweep(implementedasdescribedinSection3.4),alayeredplane-sweep(implementedasdescribedinSection3.5).Wealsotestedtheadaptiveplane-sweepalgorithm,implementedasdescribedinSection6.Fortheadaptiveplane-sweep,wealsowantedtotesttheeectofthe 50

PAGE 51

Injectiondatasetexperimentalresults tworelevantparametersettingsontheeciencyofthealgorithm.Thesesettingsarethenumberofcut-pointskconsideredateachleveloftheoptimizationperformedbythealgorithm,aswellasthenumberofrecursivecallsmadetotheoptimizer.Inourexperiments,weusedkvaluesof5,10,and20,andwetestedusingeitherasingleornorecursivecallstotheoptimizer.TheresultsofourexperimentsareplottedaboveinFigures3-15through3-20.Figures3-15,3-16,and3-19showtheprogressofthevariousalgorithmsasafunctionoftime,foreachofthethreedatasets(onlyFigure3-15depictstherunningtimeoftheadaptiveplane-sweepmakinguseofarecursivecalltotheoptimizer).Forthevariousplane-sweep-basedjoins,thex-axisofthetwoplotsshowsthepercentageofthejointhathasbeencompleted,whilethey-axisshowsthewall-clocktimerequiredtoreachthatpointinthecompletionofthejoin.FortheR-tree-basedjoin(whichdoesnotprogressthroughvirtualtimeinalinearfashion)thex-axisshowsthefractionoftheMBR-MBRpairsthathavebeenevaluatedateachparticularwall-clocktimeinstant.Thesevaluesarenormalizedsothattheyarecomparablewiththeprogressoftheplane-sweep-basedjoins. 51

PAGE 52

Collisiondatasetexperimentalresults Figures3-17,3-18and3-20showthebuer-sizechoicesmadesbytheadaptiveplane-sweepingalgorithmusingk=20andnorecursivecallstotheoptimizer,asafunctionoftimeforallthethreetestdatasets. 52

PAGE 53

BuersizechoicesforInjectiondataset BuersizechoicesforCollisiondataset theinputrelations(whenthegassesareexpelledfromthenozzlesintheInjectiondatasetandwhenthetwogalaxiesoverlapintheCollisiondataset).Duringsuchperiodsitmakessensetoconsideronlyverysmalltimeperiodsinordertoreducethenumberofcomparisons,leadingtogoodeciencyforthestandardplane-sweep.Ontheotherhand,duringtimeperiodswhentherewasrelativelylittleinteractionbetweentheinputrelations,thelayeredplane-sweepperformedfarbetterbecauseitwasabletoprocesslargetime-periodsatonce.Evenwhentheobjectsintheinputrelationshaveverylongpathsduringsuchperiods,theinputdatawereisolatedenoughthattheretendstobelittlecostassociatedwithcheckingthesepathsforproximityduringtherstlevelofthelayeredplane-sweep.Theadaptiveplane-sweepwasthebestoptionbyfarforallthethreedata 53

PAGE 54

Syntheticdatasetexperimentalresults sets,andwasabletosmoothlytransitionfromperiodsoflowtohighactivityinthedataandbackagain,eectivelypickingthebestgranularityatwhichtocomparethepathsoftheobjectsinthetwoinputrelations.Fromthegraphs,wecanseethatthecostofperformingtheoptimizationcausestheadaptiveapproachtobeslightlyslowerthanthenon-adaptiveapproachwhenoptimizationisineective.Inboththedatasets,thishappensinthebeginningwhentheobjectsaremovingtowardseachotherbutstillfarenoughthatnointeractiontakesplace.Asexpected,inboththeexperiments,adaptivitybeginstotakeeectwhentheobjectsintheunderlyingdatasetstartinteracting.FromFigures3-17,3-18,and3-20itcanbeseenthatthebuersizechoicesmadebytheadaptiveplane-sweepisverynelytunedtotheunderlyingobjectinteractiondynamics(decreasingwithincreasinginteractionandviceversa).InboththeInjectionandCollisiondatasets,thesizeofthebuerfallsdramaticallyjustastheamountofinteractionbetweentheinputrelationsincreases.IntheSyntheticdataset,theoscillationsinbuerusagedepictedinFigure20mimicalmostexactlytheenergyofthedataastheyperformtheirrandomwalk.Thegraphsalsoshowtheimpactofvaryingtheparameterstotheadaptiveplane-sweeproutine,namely,thenumberofcutpointsk,consideredateachleveloftheoptimization,andwhetherornotachosengranularityisrenedthroughrecursive 54

PAGE 55

51 ].However,theirpaperconsidersthebasicproblematahighlevel.Thealgorithmicandimplementationissuesaddressedbyourownworkwerenotconsidered. 55

PAGE 56

BuersizechoicesforSyntheticdataset Thoughlittleworkhasbeenreportedonspatiotemporaljoins,therehasbeenawealthofresearchintheareaofspatialjoins.Theclassicalpaperinspatial-joinsisduetoBrinkho,KreigelandSeeger[ 45 ]andisbasedontheR-treeindexstructure.AnimprovementofthisworkwasgivenbyHuangetal.[ 46 ].Hash-basedspatialjoinstrategieshavebeensuggestedbyLoandRavishankar[ 47 ],andPatelandDewitt[ 48 ].Larsetal.[ 49 ]proposedaplane-sweepapproachtoaddressthespatial-joinproblemintheabsenceofunderlyingindexes.Withinthecontextofmovingobjects,researchhasbeenfocusedontwomainareas:predictivequeries,andhistoricalqueries.Withinthistaxonomy,ourworkfallsinthelattercategory.Inpredictivequeries,thefocusisonthefuturepositionoftheobjectsandonlyalimitedtimewindowoftheobjectpositionsneedtobemaintained.Ontheotherhand,forhistoricalqueries,theinterestisonecientretrievalofpasthistoryandusuallytheindexstructuremaintainstheentiretimelineofanobject'shistory.Duetothesedivergentrequirements,indexstructuresdesignedforpredictivequeriesareusuallynotsuitableforhistoricalqueries.Anumberofindexstructureshavebeenproposedtosupportpredictiveandhistoricalquerieseciently.Thesestructuresaregenerallygearedtowardsecientlyanswering 56

PAGE 57

20 ],spatiotemporalR-TreesandTB(TrajectoryBounding)-trees[ 21 ],andlinearquad-trees[ 22 ].Atechniquebasedonspacepartitioningisreportedin[ 18 ].Forpredictivequeries,Saltenisetal.[ 17 ]proposedtheTPR-tree(time-parametrizedR-tree)whichindexesthecurrentandpredictedfuturepositionsofmovingpointobjects.TheymentionthesensitivityoftheboundingboxesintheR-treetoobjectvelocities.AnimprovementoftheTPR-treecanbefoundin[ 78 ].In[ 76 ],aframeworktocovertime-parametrizedversionsofspatialqueriesbyreducingthemtonearest-neighborsearchproblemhasbeensuggested.In[ 23 ],anindexingtechniqueisproposedwheretrajectoriesinad-dimensionalspaceismappedtopointsinhigher-dimensionalspaceandthenindexed.In[ 75 ],theauthorsproposeaframeworkcalledSINAinwhichcontinuousspatiotemporalqueriesareabstractedasaspatialjoininvolvingmovingobjectsandmovingqueries.Anoverviewofdierentaccessstructurescanbefoundin[ 83 ]. 57

PAGE 58

58

PAGE 59

1. Auniqueexpectation-maximization(EM)algorithmthatissuitableforlearningassociationsofspatiotemporal,movingobjectdataisdescribed.Thisalgorithmallowsustorecognizequadratic(xedacceleration)motioninalargesetofsensorobservations. 2. WeapplyandextendthemethodofBayesianltersforrecognizingunrestrictedmotiontothecasewhenalargenumberofinteractingobjectsproducedata,anditisnotclearwhichobservationcorrespondstowhichobject. 3. Experimentalresultsshowthattheproposedmethodcanaccuratelyperformresolutionovermorethanonehundredsimultaneuouslymovingobjects,evenwhenthenumberofmovingobjectsisnotknownbeforehand.Theremainderofthischapterisorganizedasfollows:Inthenextsection,westatetheproblemformallyandgiveanoverviewofourapproach.WethendescribethegenerativemodelanddenethePDFsfortherestrictedandunrestrictedmotion.Thisisfollowedbyadetaileddescriptionofthelearningalgorithmsinsection4.4.Anexperimentalevaluationofthealgorithmsisgiveninsection4.5followedbytheconclusion. 59

PAGE 60

Figure4-1. Mappingofasetofobservationsforlinearmotion Asanexample,considerthesetofobservations:f(2,8,0)(9,9,0)(4,11,1)(11,7,1)(6,14,2)(13,5,2)(8,17,3)(15,3,3)gasshowninFigure4-1(a).GiventheunderlyingmotionislinearandK=2,Figure4-1(b)showsamappingoftheobservationswithobjects.Observationsf(2,8,0)(4,11,1)(6,14,2)(8,17,3)gareassociatedwithobject1,andobservationsf(9,9,0)(11,7,1)(13,5,2)(15,3,3)gareassociatedwithobject2.Thoughinthiscasetheclassicationwaseasy,theproblemingeneralishardduetoanumberoffactors,including: 60

PAGE 61

61

PAGE 62

Objectpath(a)andquadratictforvaryingtimeticks(b-d) Oneofthekeyaspectsofourapproachisthattomakethelearningprocessfeasible,werelyontwoseparatemotionmodels:arestrictedmotionmodelthatisusedforonlytherstfewtimeticksinordertorecognizethenumberandinitialmotionofthevariousobjects,andanunrestrictedmotionmodelthattakesthisinitialmotionasinputandallowsforarbitraryobjectmotion.Giventhis,thefollowingdescribesouroverallprocessforgroupingsensorobservationsintoobjects: 1. First,wedetermineKandlearnthesetofmodelparametersgoverningobjectmotionundertherestrictedmodelfortherstfewtimeticks. 2. Next,weuseKaswellastheobjectpositionsattheendoftherstfewtimeticksasinputintoalearningalgorithmfortheremainderofthetimeline.Thegoalhereistolearnhowobjectsmoveunderthemoregeneral,unrestrictedmotionmodel. 3. Finally,onceallobjecttrajectorieshavebeenlearnedforthecompletetimeline,eachsensorobservationisassignedtotheobjectthatwasmostlikelytohaveproducedit.Ofcourse,thisbasicapproachrequiresthatweaddresstwoveryimportanttechnicalquestions: 1. Whatexactlyisourmodelforhowobjectsmovethroughthedataspace,andhowdoobjectsprobabilisticallygenerateobservationsaroundthem? 2. Howdowe"learn"theparametersassociatedwiththemodel,inordertotailorthegeneralmotionmodeltothespecicobjectsthatwearetryingtomodel? 62

PAGE 63

63

PAGE 64

Objectpathinasensoreld(a)andsensorringstriggeredbyobjectmotion(b) paramtersarelearned,wecanmakeuseoftheunrestrictedmodelfortheremainderofthetimelinesincetherewillbefewerunknownsandthecomputationalcomplexityisgreatlyreduced. 2jj1=2e1 2(xp)T1(xp)isaGaussianPDFthatmodelsthecloudofsensorstriggeredbytheobjectattimet.Figure4-3showsatypicalscenarioofhowobservationsaregenerated.Theparametersetcontains: 64

PAGE 65

65

PAGE 66

66

PAGE 67

67

PAGE 68

84 ].EMisaniterativealgorithmthatworksbyrepeatingthe\E-Step"andthe\M-Step".Atalltimes,EMmaintainsacurrentguessastotheparameterset.IntheE-Step,wecomputetheso-called\Q-function",whichisnothingmorethantheexpectedvalueofthelog-likelihood,takenwithrespecttoallpossiblevaluesofY.TheprobabilityofgeneratinganygivenYiscomputedusingthecurrentguessfor.ThisremovesthedependencyonY.TheM-StepthenupdatessoastomaximizethevalueoftheresultingQfunction.Theprocessisrepeateduntilthereislittlestep-to-stepchangein.InordertoderiveanEMalgorithmforlearningtherestrictedmotionmodel,wemustrstderivetheQfunction.Ingeneral,theQfunctiontakestheform:Q(;i)=E[logL(X;Yj)jX;i]Inourparticularcase,thiscanbeexpandedto:Q(;g)=NXi=1KXj=1log(jp(xi;tijgj))Pj;iwhereg=fgj;gjj(1jK)grepresentsourguessforthevariousparametersoftheKobjectsandPj;iistheposteriorprobabiltythattheithobservationcamefromthejthobjectgivenbytheformula:Pj;i=P(jjxi;ti)=gjp(xi;tijgj)

PAGE 69

85 ].Doingsoresultsinthefollowingupdaterulesfortheparametersetjforthejthobject:0BBBBBBBBB@NXi=1Pj;iNXi=1tiPj;iNXi=1t2iPj;iNXi=1tiPj;iNXi=1t2iPj;iNXi=1t3iPj;iNXi=1t2iPj;iNXi=1t3iPj;iNXi=1t4iPj;i1CCCCCCCCCAjvjaj!=0BBBBBBBBB@NXi=1xiPj;iNXi=1xitiPj;iNXi=1xit2iPj;i1CCCCCCCCCA;j=NXi=1(xij)(xij)Tpj;i NXi=1pj;iandj=1 4.4.2LearningKSofarwehaveassumedthatthenumberofobjectsKisknown.However,inpractice,weoftenhaveverylittleknowledgeaboutK,thusrequiringustoestimateitfromtheobserveddata.TheproblemofchoosingKcanbeviewedastheproblemofselectingthenumberofcomponentsofamixturemodelthatdescribessomeobserveddata.The 69

PAGE 70

86 ][ 87 ][ 88 ][ 89 ].Thebasicideabehindthevarioustechniquesisasfollows:AssumewehaveamodelforsomeobserveddataintheformofaparametersetK=f1;:::;Kg.Further,assumewehaveacostfunctionC(k)toevaluatethecostofthemodel.Inordertoselectthemodelwiththeoptimalnumberofcomponents,wesimplycomputeforarangeofKvaluesandchoosetheonewiththeminimumcost:K=argminKfC(K)jKlowKKhighgThevarioustechniquesproposedintheliteraturecanbedistinguishedbythecostcriteriontheyusetoevaluateamodel:AIC(Akaike'sInformationCriterion),MDL(MinimumDescriptionLength)[ 88 ],MML(MinimumMessageLength)[ 90 ],etc.Forthecostfunction,wemakeuseoftheMinimumMessageLength(MML)criterionasithasbeenshowntobecompetitiveandevensuperiortoothertechniques[ 89 ].MMLisaninformationtheoreticcriterionwheredierentmodelsarecomparedonthebasisofhowwelltheycanencodetheobserveddata.TheMMLcriterionnicelycapturesthetradeobetweenthenumberofcomponentsandmodelsimplicity.Thegeneralformula[ 89 ]fortheMMLcriterionisgivenby:C(k)=logh(k)logL(Xjk)+1 2logjI(k)j+c 21+1 12whereh()describesthepriorprobabilitiesofthevariousparameters,L()thelikelihoodofobservingthedata,jIjisthedeterminantofthesherinformationmatrixoftheobserveddata.Forourspeciccase,weneedaformulationthatisapplicabletoGaussiandistributions[ 87 ]: 2logL(Yjk) 70

PAGE 71

91 ].InthisvariationofEM,amodelisrstlearnedwithaverylargeKvalue.Then,inaniterativefashion,poorcomponentsareprunedoandthemodelisre-adjustedtoincorporateanydatathatisnolongerwell-t.ForeachresultingvalueofK,theMMLcriteriaischeckedandthebestmodelischosen. 62 ]toupdatefmottotakeintoaccountthevarioussensorobservations.fmotdenesadistributionoverptforeverytime-tickt,whichcanbeviewedasdescribingabeliefintheobject'spositionattime-tickt.InaBayesianfashion,thisbelief(e.g.,distribution)canbeupdatedandmademoreaccuratebymakinguseof 71

PAGE 72

64 ]. 57 ].Aparticleltersimpliestheproblembyrepresentingftmotbyasetofdiscrete\particles",whereeachparticleisapossiblecurrentpositionfortheobject.Wedenotethesetofparticlesassociatedwithtime-ticktasSt,andtheithparticleinthissetisSt[i].Theithparticlehasanon-negativeweightwiattachedtoitwiththeconstraintPiwi=1.Highly-weightedparticlesindicatethattheobjectismorelikelytobelocatedatorclosetotheparticle'sposition.GivenSt,ftmot(pt)simplyreturnswiifpt=St[i],and0otherwise.Thebasicapplicationofaparticleltertoourproblemisquitesimple(thoughthereisamajorcomplicationthatwewillconsiderinthenextsubsection).Tocomputeftmotforanytimetickt,weusearecursivealgorithm.Forthebasecaset=0,wehaveasingleparticlelocatedatp0,havingweightone.Then,givenasetofparticlesSt1fortime-tickt1,thesetStfortimeticktiscomputedasgiveninAlgorithm2. 72

PAGE 73

73

PAGE 74

1;jThen,wecanapplyBayesruletocomputewi:wi=Qj=1(1;j)ft0obs(xjj:)+;jfN(xjjobs;St[i]) 74

PAGE 75

2jobsj1=2e1 2(xipt)T1obs(xipt)

PAGE 76

76

PAGE 77

77

PAGE 78

1. 2. 3. 4. 5. 1. Inthersttest,numTicksisxedat50,stdDevisxedat2%ofthewidthoftheeld,andnumObsissetat5.emTimeisxedat5,numObjisvariedfrom10to110objectsinincrementsof30. 2. Inthesecondtest,numObjisxedat40,stdDevisxedat2%ofthewidthoftheeld,emTimeisxedat5,andnumObsissetat5.ThetimeintervaloverwhichobservationswererecordednumTicksisvariedinincrementsof25upto100timeticks. 3. Inthethirdtest,numObjisxedat40,numTicksisxedat50,stdDevisxedat2%ofthewidthoftheeld,emTimeisxedat5,andtheaveragenumberofsensorringsgeneratedperobjectateachtimeticknumObsisvariedfrom5to25inincrementsof5. 78

PAGE 79

Thebaselineinputset(10,000observations) ThelearnedtrajectoriesforthedataofFigure4-4 4. Inthefourthtest,numObjisxedat40,numTicksisxedat50,numObsissetat5.WethenvarythespreadoftheGaussiancloudstdDevfrom2%to10%ofthewidthoftheeld. 79

PAGE 80

Recall 1.00.910.760.69 Precision 1.00.920.920.93 Runtime 9sec38sec131sec378sec Table4-1. Varyingthenumberofobjectsanditseectonrecall,precisionandruntime. Recall 0.930.910.750.64 Precision 0.960.930.920.92 Runtime 21sec38sec59sec72sec Table4-2. Varyingthenumberoftimeticks. Recall 0.910.910.910.92 Precision 0.930.920.920.91 Runtime 38sec71sec102sec134sec Table4-3. Varyingthenumberofsensorsred. Recall 0.910.900.880.80 Precision 0.930.940.910.83 Runtime 38sec37sec37sec38sec Table4-4. VaryingthestandarddeviationoftheGaussiancloud. 5. Inthenaltest,numObjisxedat40,numTicksisxedat50,emTimeisxedat5,stdDevisxedat2%ofthewidthoftheeld,andnumObsissetat5.emTimeisvariedfrom5to20timeticksinincrementsof5.Alltestswerecarriedoutinadual-corePentiumPCwith2GBRAM.Thetestswererunintwostages.First,theEMalgorithmisruntogetaninitialestimateofthenumberofobjectsandtheirstartinglocation.ThenumberoftimeticksoverwhichEMisappliediscontrolledbytheemTimeparameter.Next,theestimatesproducedbyEMareusedtobootstraptheparticlelterphaseofthealgorithm,whichtrackstheindividualobjectsfortherestofthetimeline.Inapost-processingstep,therecallandprecisionvaluesarecomputed.Eachtestwasrepeatedvetimesandtheresultswereaveragedacrosstheruns. 80

PAGE 81

Recall 0.910.880.870.83 Precision 0.920.950.940.96 Runtime 38sec38sec37sec37sec Table4-5. VaryingthenumberoftimetickswhereEMisapplied. 81

PAGE 82

18 21 92 { 94 ],queriesovertracks[ 75 76 83 ]andclusteringpaths[ 35 37 38 95 96 ].However,littleworkexistsindatabasesthatworryabouthowtoactuallyobtaintheobjectpath.TheonlypriorworkindatabaseliteraturecloselyrelatedtotheproblemweaddressistheworkofKubietal[ 36 ].Givenasetofasteroidobservations,theyconsidertheproblemoflinkingobservationsthatcorrespondtothesameunderlyingasteroid.Theirapproachconsistsofbuildingaforestofk-dtrees[ 97 ],oneforeachtimetick,andperformingasynchronizedsearchofallthetreeswithexchangeofinformationamongtreenodestoguidethesearchtowardsfeasibleassociations.Theyassumethateachasteroidhasatmostoneobservationateverytimetickandconsideronlysimplelinearorquadraticmotionmodels.Modelingbasedapproaches[ 85 98 99 ]havebeenpreviouslyemployedintargettrackingtomapobservationsintotargets.Thefocusisprimarilyonsupportingreal-timetrackingusingsimplemotionmodels.Incontrasttoexistingresearch,wefocusonaiding 82

PAGE 83

83

PAGE 84

100 { 106 ].Inthesemodels,therelationalmodelisextendedsothatasinglestoreddatabaseactuallyspeciesadistributionofpossibledatabasestates,wherethesepossiblestatesarealsocalledpossibleworlds.Inthissortofmodel,answeringaqueryiscloselyrelatedtotheproblemofstatisticalinference.Givenaqueryoverthedatabase,thetaskistoinfersomecharacteristicoftheunderlyingdistributionofpossibleworlds.Forexample,thegoalmaybetoinfertheprobabilitythataspecictupleappearsintheanswersetofaqueryexceedssomeuser-speciedp.Alongtheselines,mostoftheexistingworkonprobabilisticdatabaseshasfocusedonprovidingexactsolutionstovariousinferenceproblems.Forexample,imaginethatonerelationR1hasanattributelname,whereexactlyonetupletfromR1hasthevaluet.lname=`Smith'.Theprobabilityoftappearinginagivenworldis0.2.talsohasanotherattributet.SSN=123456789,whichisaforeignkeyintoaseconddatabasetableR2.Theprobabilityof123456789appearinginR2is0.6.Then(assumingthattherearenoother'Smith'sinthedatabase)theprobabilitythat'Smith'willappearintheoutputofR1R2canbecomputedexactlyas0.20.6=0.12.Unfortunately,probabilisticdatamodelswheretuplesorattributevaluescanbedescribedusingsimple,discreteprobabilitydistributionsmaybeofonlylimitedutilityintherealworld.Ifthegoalistobuilddatabasesthatcanrepresentthesortofuncertaintypresentinmoderndatamanagementapplications,itisveryusefultohandlecomplex,continuous,multi-attributedistributions.Forexample,consideranapplicationwheremovingobjectsareautomaticallytracked|perhapsbyvideo,magnetic,orseismicsensors|andtheobservedtracksarestoredinadatabase.Thestandard,modernmethodforautomatictrackingviaelectronicsensoryinputistheso-called\particlelter"[ 63 ],whichgeneratesacomplex,time-parameterizedprobabilisticmixturemodelforeach 84

PAGE 85

62 ]isapopularmethodthatiscommonlyproposedasawaytoinferunknownoruncertaincharacteristicsofdata|onestandardapplicationofBayesianinferenceisautomaticallyguessingthetopicofadocumentsuchasanemail.Theso-called\posterior"distributionresultingfromBayesianinferenceoftenhasnoclosedform,cannotbeintegrated,andcanonlybesampledfrom,usingtoolssuchasaMarkovChainMonteCarlo(MCMC)methods[ 107 ].Thus,inthemostgeneralcase,anintegratablePDFisunavailable,andtheusercanonlyprovideanimplementationofapseudo-randomvariablethatcanbeusedtoprovidesamplesfromtheprobabilitydistributionthatheorshewishestoattachtoanattributeorsetofcorrelatedattributes.Byaskingonlyforapseudo-randomgenerator,wecanhandlebothdicultcases(suchastheBayesiancase)andsimplercaseswheretheunderlyingdistributioniswell-knownandwidelyused(suchasGaussian,Poisson,Gamma,Dirichlet,etc.)inauniedfashion.MyriadalgorithmsexistforgeneratingMonteCarlosamplesinacomputationallyecientmanner[ 108 ].Formoredetailsonhowadatabasesystemmightsupportuser-denedfunctionsforgeneratingtherequiredpseudo-randomsamples,wepointtoourearlierpaperonthesubject[ 109 ].OurContributions.Iftheuserisaskedonlytosupplypseudo-randomattributevaluegenerators,itbecomesnecessarytodevelopnewtechnologiesthatallowthedatabasesystemtointegratetheunknowndensityfunctionunderlyingapseudo-randomgeneratoroverthespaceofdatabasetuplesacceptedbyauser-suppliedquerypredicate.Inthis 85

PAGE 86

86

PAGE 90

90

PAGE 91

110 ],orNeymantestforshort.Foragivendatabaseobjectobj,theNeymantestchoosesbetweenH0orH1byanalyzingaxedsampleofsizendrawnusingGetInstance().Thetestreliesonalikelihoodratiotest(LRT)thatcomparestheprobabilitiesofobservingthesamplesequenceunderH0andH1.Itisnamedafteratheoreticalresult(theNeyman-Pearsonlemma)thatstatesthatatestbasedonLRTisthemostpowerfultestofallpossibletestsforaxedsamplesizencomparingthetwosimplehypotheses(i.e.itisauniformly-most-powerfultest).SincetheNeymantestfortheBernoulli(yes/no)probabilitycaseisgiveninmanytextbooksonhypothesistesting,weomititsexactdenitionhere.GivenanimplementationofaNeymantestthatreturnsACCEPTifH1isselected,itispossibletoreplacelines(9)to(11)ofAlgorithm5-1with: 91

PAGE 92

TheSPRTinaction.ThemiddlelineistheLRTstatistic hypothesisrelatetospecic,innitely-preciseprobabilityvaluesp+andp,wheninrealitythetrueprobabilityislikelytobeeithergreaterthanp+orlessthanp,butnotexactlyequaltoeitherofthem.Inthiscase,theNeymantestwillstillbecorrectinthesensethatwhilestillrespectingand,itwillchooseH0ifp.However,thetestissomewhatsillyinthiscase,becauseitstillrequiresjustasmanysamplesasitwouldinthehardcasewhereispreciselyequaltooneofthesevalues.Tomakethisconcrete,imaginethatp=:95,andafter100sampleshavebeentakenfromGetInstance(),absolutelynoneofthemhavebeenacceptedbypred(),buttheNeymanalgorithmhasdeterminedthatintheworstcase,weneed105tochoosebetweenH0andH1.Eventhoughthereisaprobabilityofatmost(1:95)100<10130ofobserving100consecutivefalsevaluesifwasatleast0.95,thetestcannotterminate|meaningthatwemuststilltake99,900moresamples.InthisextremecasewewouldliketobeabletorealizethatthereisnochancethatwewillacceptH1andterminateearlywitharesultofH0.Infact,thisextremecasemaybequitecommoninaprobabilisticdatabasewherepwilloftenbequitelargeandpred()highlyselective.Notsurprisingly,thisissuehasbeenconsideredindetailbythestatisticscommunity,andthereisanentiresubeldofworkdevotedtoso-called\sequential"tests.Thebasis 92

PAGE 93

111 ],orSPRTforshort.TheSPRTcanbeseenasasequentialversionoftheNeymantest.Ateachiteration,theSPRTdrawsanothersamplefromtheunderlyingdatadistribution,andusesittoupdatethevalueofalikelihoodratiostatistic.Ifthestatisticexceedsacertainupperthreshold,thenH1isaccepted.Ifiteverfailstoexceedacertainlowerthreshold,thenH0isaccepted.Ifneitherofthesethingshappen,thenatleastonemoreiterationisrequired;however,theSPRTisguaranteedtoend(eventually).Thus,overtime,thelikelihoodratiostatisticcanbeseenasperformingarandomwalkbetweentwomoving\goalposts".Assoonasthevalueofthestatisticfallsoutsideofthegoalposts,adecisionisreachedandthetestisended.TheprocessisillustratedinFigure5-1.ThisplotshowstheSPRTforaspeciccasewhere=:5,=:05,p=:3,and==0:05.Thex-axisofthisplotshowsthenumberofsamplesthathavebeentaken,whilethewavylineinthemiddleisthecurrentvalueoftheLRTstatistic.Assoonasthestatisticexitseitherboundary,thetestisended.Thekeybenetofthisapproachisthatforverylowvaluesofthatareveryfarfromp,H0isacceptedquickly(H1isacceptedwithasimilarspeedwhengreatlyexceedsp).Allofthisisdonewhilefullycontrollingforthemultiple-hypothesis-testing(MHT)problem:whentheteststatisticischeckedrepeatedly,thenextremecaremustbetakenwithrespecttoandbecausetherearemanychancestoerroneouslyacceptH0(orH1),andsotheeectiveorreal(or)canbemuchhigherthanwhatwouldnaivelybeexpected.Furthermore,liketheNeymantest,theSPRTisalso\optimal"inthesensethatonexpectation,itrequiresnomoresamplesthananyothersequentialtesttochoosebetweenH0andH1,assumingthatoneofthetwohypothesesaretrue.Justliketheneymantest,theSPRTmakesuseofalikelihoodratiostatistic.inthebernoullicasewestudyhere,afternumaccsamplesthatareacceptedbypred()outofnum

PAGE 94

b p+(numnumacc)log1p forsimplicity,thiscanbere-workedabit.let:a=log1p+ plog1p band:numacclog b

PAGE 95

95

PAGE 96

96

PAGE 97

Twospatialqueriesoveradatabaseofobjectswithgaussianuncertainty TheuniqueprobleminthedatabasecontextisthatwhileH0andH1areveryclosetooneanother(duetoatiny),inreality,istypicallyveryfarfrombothpandp+;usually,itwillbeclosetozeroorone.Forexample,considerFigure5-2,whichshowsasimplespatialqueryoveradatabaseofobjectswhosepositionsarerepresentedastwo-dimensionalGaussiandensityfunctions(depictedasovalsinthegure).Forboththemoreselectivequeryattheleftandtheandlessselectivequeryattheright,onlythefewobjectsfallingontheboundaryofthequeryregionwouldhavepforanyuser-speciedp6=0;1.ThiscreatesauniquesetupthatisquitedierentfromclassicapplicationsoftheSPRTanditsvariants.Infact,theSPRTitselfisprovablyoptimalforonlyvalueslyingatpandp+;butforthosefarfromtheseboundaries(suchasatzeroandone),itmaydoquitepoorly.Manyother\optimal"testshavebeenproposedinthestatisticalliterature,butfewseemtobeapplicabletothisratheruniqueapplicationdomain|seetheexperimentalsectionaswellastherelatedworksectionformoredetails. 97

PAGE 98

ThesequenceofSPRTsrunbytheend-biasedtest Toperformtheend-biasedtest,werunaseriesofpairsofindividualhypothesistests.Intherstpairoftests,oneSPRTisrunrightafteranother: 98

PAGE 99

2forboththerejectiontestandtheacceptancetest.Thismeansthatmoresampleswillprobablyberequiredtoarriveataresultineithertest|duetothefactthatH0andH1willbeclosertooneanother|butitalsomeansthatfewerobjectswillhavevaluesthatfallineithertest'sregionofindierence.Specically,thethirdSPRTthatisrunisusedtodeterminepossiblerejectionusingH0:=3p=4versusH1:=p+.IftheSPRTacceptsH0,thenobjisimmediatelyrejected.However,ifthethirdSPRTaccepts 99

PAGE 100

1. Theprocessterminateswitheitheranacceptorarejectinsometest,or; 2. Thespaceofpossiblevaluesforwhichtheprocesswouldnothaveterminatedfallsstrictlyintherangefromptop+.Inthiscase,anarbitraryresultcanbechosen.ThesequenceofSPRTteststhatarerunisillustratedaboveinFigure5-3.Ateachiteration,theregionofindierenceshrinks,untilitbecomesvanishinglysmallandthetestterminates.Sincealargeinitialregionofindierencemeansthattherstfewteststerminatequickly(butwillonlyacceptorrejectlargeorsmallvaluesof),thetestis\end-biased";thatis,itisbiasedtowardsterminatingearlyinthosecaseswhereiseithersmallorlarge.Forthosevaluesthatareclosertop,moresub-testsandmoresampleswillberequired|whichisverydierentfromclassicaltestssuchastheSPRTorNeymantest,whichtrytooptimizeforthecasewhenisclosetop.,,andtheMHTproblem.Onethingthatwehaveignoredthusfarishowtochoose0and0(thatis,thefalsenegativeandfalsepositiverateofeachindividualSPRTsubtest)sothattheoverall,user-suppliedandvaluesarerespectedbytheend-biasedtest.Thisisabitmoredicultthanitmayseemtobeatrstglance:onesignicantresultofrunningaseriesofSPRTsisthatitbecomesimperativethatwebeverycarefulnottoaccidentallyacceptorrejectanobjectduetothefactthatwearerunningmultiplehypothesistests.Webeginourdiscussionbyassumingthatthelimitonthenumberofpairsoftestsrunisn;thatis,therearenteststhatcanacceptobj,andtherearenteststhatcanrejectobj.Wealsonotethatinpractice,the2ntestsarenotruninsequence,buttheyarerun 100

PAGE 101

112 ],wehave:Pr[n_i=1rejectintesti]nXi=1Pr[rejectintesti]Asaresult,ifweruneachindividualrejectiontestusingafalserejectrateof0,weknowthat:Pr[n_i=1rejectintesti]n0Thus,bychoosing0==n,wecorrectlyboundthefalsenegativerateoftheoverallend-biasedtest.Asimilarargumentholdsforthefalsepositiverate:bychoosingarateof 101

PAGE 102

2i;p+;=numTests;=numTests) 2i;=numTests;=numTests) 102

PAGE 103

1. First,duringano-linepre-computationphase,weobtain,fromeachdatabaseobject,asequenceofsamples.Thosesamples(oratleastasummaryofthesamples)arestoredwithininanindextofacilitatefastevaluationofqueriesatalatertime. 2. Then,whenauserasksaquerywithaspecic,,p,andarangepredicatepred(),therststepistodeterminehowmanysampleswouldneedtobetakeninordertorejectanygivenobjectbytherstrejectionSPRTintheend-biasedtest,ifpred()evaluatedtofalseforeachandeveryoneofthosesamples.Thisquantitycanbecomputedas:minSam=blog(0 log(1p

PAGE 104

BuildingtheMBRsusedtoindexthesamplesfromtheend-biasedtest. 3. OncenumSamisobtained,theindexisusedtoanswerthequestion:\WhichdatabaseobjectscouldpossiblyhaveoneoftherstnumSamsamplesinitspre-computedsequenceacceptedbypred()?"AllsuchdatabaseobjectsareplacedwithinacandidatesetC.AllthoseobjectnotinCareimplicitlyrejected. 4. Finally,foreachobjectwithinC,anend-biasedtestisruntoseewhethertheobjectisactuallyacceptedbythequery.Inthefollowingfewsubsections,aswediscusssomeofthedetailsassociatedwitheachofthesesteps. 104

PAGE 105

1. First,weobtainasmanysamplesasareneededuntilanewsampleisobtainedthatcannottintoR. 2. Letbbethecurrentnumberofsamplesthathavebeenobtained.Createa(d+1)-dimensionalMBRusingRalongwiththesequencenumberpair(b0;b1),andinsertthisMBRalongwiththecurrentSandtheobjectidentierintothespatialindex. 3. Next,updateRbyexpandingitsothatitcontainsthenewsample.UpdateStobethecurrentrandomnumberseed,andsetb0=b. 4. Repeatfromstep(1).ThisprocessisillustratedpictoriallyaboveinFigure5-4,foraseriesofone-dimensionalrandomvalues,uptob=16.Inthisexample,webeginbytakingtwosamplesduringinitialization.Wethenkeepsamplinguntilthefthsample,whichistherstone 108 ].Togenerateapseudo-randomnumber,astringofbits(calledaseed)isrstsentasanargumenttoafunctionthatusesthebitstoproducethenextrandomvalue.Asaside-eectofproducingthenewrandomvalue,theseeditselfisupdated.Thisupdatedseedvalueisthensavedandusedtoproducethenextrandomvalueatalatertime.4 105

PAGE 106

Usingtheindextospeedtheend-biasedtest thatdoesnottintotheinitialMBR.Thiscompletestostep(1)above.Then,atwo-dimensionalMBRiscreatedtoboundthesamplesequencerangefrom1to4,aswellasthesetofpseudo-randomvaluesthathavebeenobserved.ThisMBRisinserted(alongwithS)intothespatialindexasMBR1(step(2)).Next,thefthsampleisusedtoenlargetheMBR(step(3))MoresamplesaretakenuntilitisfoundthattheeighthsampledoesnottintothenewMBR(backtostep(1)).Then,MBR2iscreatedtocovertherstsevensamplesaswellasthesequencerangefrom5to7,andinsertedintothespatialindex.Theprocessisrepeateduntilallmsampleshavebeenobtained.Theprocesscanbesummedupasfollows:everytimethatanewsampleforcesthecurrentMBRtogrow,acopyoftheMBRisrstinsertedintotheindex,andthentheMBRisexpandedtoaccommodatethesample. 106

PAGE 107

1. Inarealisticenvironmentwhereaselectionpredicatemustberunovermillionsofdatabaseobjects,howwelldostandardmethodsfromstatisticsperform? 2. Canourend-biasedtestimproveuponmethodsfromthestatisticalliterature? 3. Doestheproposedindexingframeworkeectivelyspeedapplicationoftheend-biasedtest?ExperimentalSetup.Ineachofourexperiments,weconsiderasimple,syntheticdatabase,whichallowsustoeasilytestthesensitivityofthevariousalgorithmstodierentdataandquerycharacteristics.Thisdatabaseconsistsoftwo-dimensionalGaussiansspreadrandomlythroughoutaeld.Foranumberofdierent(query,database) 107

PAGE 108

1. 2. 3. 4. 1. Inthersttest,stdDevisxedat10%ofthewidthoftheeld,qSizealongeachaxisisxedat3%ofthewidthoftheeld,andpissetat0.8.Thus,manydatabaseobjectsintersecteachquery,butlikelynoneareaccepted.dbSizeisvariedfrom106to3106to5106to107. 2. Inthesecondtest,dbSizeisxedat107.stdDevisagainxedat1%,andpis0.95.qSizeisvariedfrom0.3%to1%to3%to10%alongeachaxis.Intherstcase,mostdatabaseobjectintersectingthequeryregionareaccepted;inthelatter,nonearesincetheobject'sspreadismuchgreaterthanthequeryregion. 3. Inthethirdtest,dbSizeisxedat107,qSizeis3%,p=0:8,andstdDevisvariedfrom1%to3%to10%. 4. Inthenaltest,dbSizeis107,qSizeis3%,stdDevis10%,andpisvariedfrom0.8to0.9to0.95.Therstcaseisparticularlydicultbecausewhileveryfewobjects 108

PAGE 109

10631065106107 568sec1700sec2824sec5653sec Opt 2656sec8517sec14091sec26544sec End-biased 9sec24sec38sec76sec Indexed 1sec3sec7sec15sec Table5-1. Runningtimesovervaryingdatabasesizes. Method 0.3%1%3%10% SPRT 1423sec1420sec1427sec3265sec End-biased 76sec75sec75sec430sec Indexed 11sec4sec4sec962sec Table5-2. Runningtimesovervaryingquerysizes. Method 1%3%10% SPRT 5734sec5608sec5690sec End-biased 116sec75sec75sec Indexed 107sec12sec15sec Table5-3. Runningtimesovervaryingobjectstandarddeviations. Method 0.80.90.95 SPRT 5672sec2869sec1436sec End-biased 75sec75sec75sec Indexed 14sec12sec13sec Table5-4. Runningtimesovervaryingcondencelevels. areaccepted,thespreadofeachobjectissogreatthatmostarecandidatesforacceptance.Eachtestisrunseveraltimes,andresultsareaveragedacrossallruns.MethodsTested.Foreachoftheabovetests,wetestfourmethods:theSPRT,analternativesequentialtestthatisapproximately,asymptoticallyoptimal[ 113 ],theend-biasedtestviasequentialscan,andtheend-biasedtestviaindexing.Inpractice,wefoundtheoptimaltesttobesoslowthatitwasonlyusedfortherstsetoftests.Results.TheresultsaregiveninTables5-1through5-4.Alltimesarewall-clockrunningtimesinseconds.Therawdatalesforadatabaseofsize107requiredabout500MBofstorage.Theindexed,pre-sampledversionofthisdatalerequiresaround7GBtostoreinitsentiretyif500samplesareused.Discussion.Thereareseveralinterestingndings.Firstandforemostistheterriblerelativeperformanceofthe\optimal"sequentialtest,whichwasgenerallyaboutvetimesslower 109

PAGE 110

110

PAGE 111

111 ],sequentialstatisticshavebeenwidelystudied.Wald'soriginalSPRTisproventobeoptimalforvalueslyingexactlyatH0orH1;inothercases,itmayperformpoorly.KieerandWeissrstraisedthequestionofdevelopingteststhatareprovablyoptimalatpointsotherthanthosecoveredbyH0andH1[ 114 ].However,inthegeneralcase,thisproblemhasnotbeensolved,thoughtherehasbeensomesuccessinsolvingtheasymptoticcasewhere(=)!0.SuchasolutionwasrstproposedbyLordenin1976[ 115 ]where 111

PAGE 112

116 ],Human[ 113 ],andPavlov[ 117 ].Workinthisareacontinuestothisday.However,areasonablecriticismofmuchofthisworkisitsfocusonasymptoticoptimality|particularlyitsfocusonapplicationshavingvanishinglysmall(andequal)valuesofand.Itisunclearhowwellsuchasymptoticallyoptimaltestsperforminpractice,andthestatisticalresearchliteratureprovidessurprisinglylittleinthewayofguidance,whichwasourmotivationforundertakingthislineofwork.Inourparticularapplication,andarenotequal(infact,theywillmostoftendierbyordersofmagnitude),anditisuncleartouswhetherpracticalandclearlysuperioralternativestoWald'soriginalproposalexistinsuchacase.Incontrasttorelatedworkpureandappliedstatistics,weseeknonotionofoptimality;ourgoalistodesignatestthatis(a)correct,and(b)experimentallyproventoworkwellinthecasewhereisvanishinglysmall,andyetmoreoftenthannot,the\true"probabilityofobjectinclusioniseitherzeroorone.Inthedatabaseliterature,thepaperthatisclosesttoourindexingproposalisduetoTaoetal.[ 118 ].TheyconsidertheproblemofindexingspatialdatawherethepositionisdenedbyaPDF.However,theyassumethatthePDFisnon-innite,andintegratable.Theassumptionofnitenessmaybereasonableformanyapplications(sincemanydistributions,suchastheGaussian,falloexponentiallyindensityasdistancefromthemeanincreases).However,integratabilityisastrongassumption,precluding,forexample,manydistributionsresultingfromBayesianinference[ 62 ]thatcanonlybesampledfromusingMCMCmethods[ 107 ].Mostoftheworkinprobabilisticdatabasesisatleasttangentiallyrelatedtoourown,inthatourgoalistorepresentuncertainty.WepointthereadertoDalviandSuciuforageneraltreatmentofthetopic[ 119 ].ThepapermostcloselyrelatedtothisworkisduetoJampanietal.[ 109 ]whoproposeadatamodelforuncertaindatathatisfundamentallybaseduponMonteCarlo. 112

PAGE 113

113

PAGE 114

114

PAGE 115

[1] R.GutingandM.Schneider,MovingObjectDatabases,MorganKaufmann,2005. [2] D.Papadias,D.Zhang,andG.Kollios,AdvancesinSpatialandSpatioTemporalDataManagement,Springer-Verlag,2007. [3] J.SchillierandA.Voisard,Location-BasedServices,MorganKaufmann,2004. [4] Y.ZhangandO.Wolfson,Satellite-basedinformationservices,KluwerAcademicPublishers,2002. [5] W.I.Grosky,A.Kansal,S.Nath,J.Liu,andF.Zhao,\Senseweb:Aninfrastructureforsharedsensing,"inIEEEMultimedia,2007. [6] Cover,\Mandateforchange,"RFIDJournal,2004. [7] G.Abdulla,T.Critchlow,andW.Arrighi,\Simulationdataasdatastreams,"SIGMODRecord33(1):89-94,2004. [8] N.Pelekis,B.Theodoulidis,I.Kopanakis,andY.Theodoridis,\Literaturereviewofspatio-temporaldatabasemodels,"inTheKnowledgeEngineeringReview,2004. [9] A.P.Sistla,O.Wolfson,S.Chamberlain,andS.Dao,\Modelingandqueryingmovingobjects,"inICDE,1997. [10] M.Erwig,R.Guting,M.Schneider,andM.Vazirgianni,\Afoundationforrepresentingandqueryingmovingobjects,"inTODS,2000. [11] L.Forlizzi,R.H.Guting,E.Nardelli,andM.Schneider,\Adatamodelanddatastructuresformovingobjectsdatabases,"inSIGMOD,2000. [12] C.Parent,S.Spaccapietra,andE.Zimanyl,\Spatiotemporalconceptualmodels:Datastructures+space+time,"inGIS,1999. [13] N.Tryfona,R.Price,andC.S.Jensen,\Conceptualmodelsforspatiotemporalapplications,"inTheCHOROCHRONOSApproach,2002. [14] E.Tossebro,\Representinguncertaintyinspatialandspatiotemporaldatabases,"inPhdThesis,2002. [15] M.ErwigandS.Schneider,\Stql:Aspatiotemporalquerylanguage,"inMiningspatio-temporalinformationsystems,2002. [16] R.Guttman,\R-trees:adynamicindexstructureforspatialsearching,"inSIGMOD,1984. [17] S.Saltenis,C.Jensen,S.Leutengger,andM.Lopez,\Indexingthepositionsofcontinuouslymovingobjects,"inSIGMOD,2000. 115

PAGE 116

P.Chakka,A.Everspaugh,andJ.Patel,\IndexinglargetrajectorydatasetswithSETI,"inCIDR,2003. [19] J.Patel,Y.Chen,andP.Chakka,\Stripes:Anecientindexforpredictedtrajectories,"inSIGMOD,2004. [20] S.Theodoridis,\Spatio-temporalIndexingforLargeMultimediaApplications,"inIEEEInt'lConferenceonMultimediaComputingandSystems,1996. [21] D.Pfoser,C.S.Jensen,andY.Theodoridis,\Novelapproachestotheindexingofmovingobjecttrajectories,"inVLDB,2000. [22] T.Tzouramanis,M.Vassilakopoulos,andY.Manolopoulos,\Overlappinglinearquadtrees:Aspatio-temporalaccessmethod,"inAdvancesinGIS,1998. [23] G.Kollios,D.Gunopulos,andV.J.Tsotras,\Nearestneighborqueriesinamobileenvironment,"inSpatiotemporaldatabasemanagement,1999. [24] Z.SongandN.Roussopoulos,\K-nearestneighborsearchformovingquerypoint,"inSymp.onSpatialandTemporalDatabases,2001. [25] Z.Huang,H.Lu,B.Ooi,andA.Tung,\Continuousskylinequeriesformovingobjects,"inTKDE,2006. [26] G.Kollios,D.Gunopulos,andV.J.Tsotras,\AnimprovedR-treeindexingfortemporalspatialdatabases,"inSDH,1990. [27] Y.TaoandD.Papadias,\Mv3r-tree:Aspatiotemporalaccessmethodfortimestampandintervalqueries,"inVLDB,2001. [28] M.A.NascimentoandJ.R.O.Silva,\TowardshistoricalR-trees,"inACMSAC,1998. [29] G.Iwerks,H.Samet,andK.P.Smith,\Maintenanceofspatialsemijoinqueriesonmovingpoints,"inVLDB,2004. [30] S.ArumugamandC.Jermaine,\Closest-point-of-approachjoinovermovingobjecthistories,"inICDE,2006. [31] Y.ChoiandC.Chung,\Selectivityestimationforspatio-temporalqueriestomovingobjects,"inSIGMOD,2002. [32] M.Schneider,\Evaluationofspatio-temporalpredicatesonmovingobjects,"inICDE,2005. [33] Y.Tao,J.Sun,D.Papadias,andG.Kollios,\Analysisofpredictivespatio-temporalqueries,"inTODS,2003. [34] J.Sun,Y.Tao,D.Papadias,andG.Kollios,\Spatiotemporaljoinselectivity,"inInformationSystems,2006. 116

PAGE 117

M.Vlachos,G.Kollios,andD.Gunopulos,\Discoveringsimilarmultidimensionaltrajectories,"inICDE,2002. [36] J.Kubica,A.Moore,A.Connolly,andR.Jedicke,\Amultipletreealgorithmfortheecientassociationofasteroidobservations,"inKDD,2005. [37] S.GaneyandP.Smyth,\TrajectoryClusteringwithMixturesofRegressionModels,"inKDD,1999. [38] Y.Li,J.Han,andJ.Yang,\ClusteringMovingObjects,"inKDD,2004. [39] J.Lee,J.Han,andK.Whang,\Trajectoryclustering:Apartition-and-groupframework,"inSIGMOD,2007. [40] D.Guo,J.Chen,A.MacEachren,andK.Liao,\Avisualizationsystemforspace-timeandmultivariatepatterns,"inIEEETransactionsonVisualizationandComputerGraphcis,2006. [41] D.Papadias,Y.Tao,P.Kalnis,andJ.Zhang,\Indexingspatio-temporaldatawarehouses,"inICDE,2002. [42] N.Mamoulis,H.Cao,G.Kollios,M.Hadjieleftheirou,Y.Tao,andD.Cheung,\Mining,indexing,andqueryinghistoricalspatiotemporaldata,"inKDD,2004. [43] Y.Tao,G.Kollios,J.Considine,F.Li,andD.Papadias,\Spatio-temporalaggregationusingsketches,"inICDE,2004. [44] D.Papadias,Y.Tao,P.Kalnis,andJ.Zhang,\Historicalspatio-temporalaggregation,"inTrans.ofInformationSystems,2005. [45] T.Brinkho,H.P.Kriegel,andB.Seeger,\Ecientprocessingofspatial-joinsusingR-trees,"inSIGMOD,1993. [46] Y.W.Huang,N.Jing,andE.A.Rundensteiner,\SpatialjoinsusingR-trees:Breadth-rsttraversalwithglobaloptimizations,"inVLDB,1997. [47] M.LoandC.V.Ravishankar,\Spatialhashjoins,"inSIGMOD,1996. [48] J.PatelandD.DeWitt,\Partitionbasedspatial-mergejoin,"inSIGMOD,1996. [49] L.Arge,O.Procopiu,andS.T.J.S.Vitter,\Scalablesweeping-basedspatialjoin,"inVLDB,1998. [50] M.Berg,M.Kreveld,M.Overmars,andO.Schwarzkopf,ComputationalGeomtery:AlgorithmsandApplictions,Springer-Verlag,2000. [51] S.H.Jeong,N.W.Paton,A.Fernandes,andT.Griths,\Anexperimentalperformanceevaluationofspatio-temporaljoinstrategies,"inTransactionsinGIS,2004. 117

PAGE 118

W.Winkler,\Matchingandrecordlinkage,"inBusinessSurveyMethods,1995. [53] M.HernandezandS.Stolfo,\Themerge/purgeproblemforlargedatabases,"inSIGMOD,1995. [54] C.E.A.Monge,\Theeldmatchingproblem:Algorithmsandapplications,"inKDD,1996. [55] W.CohenandJ.Richman,\Learningtomatchandclusterlargehigh-dimensionaldatasetsfordataintegration,"inKDD,2002. [56] Y.Bar-ShalomandT.Fortmann,\Trackinganddataassociation,"inAcademicPress,1988. [57] B.Ristic,S.Arulampalam,andN.Gordon,\Beyondthekalmanlter:Particleltersfortrackingapplications,"inArtechHousePublishers,2004. [58] D.B.Reid,\Analgorithmfortrackingmultipletargets,"inIEEETrans.Automat.Control,1979. [59] X.Li,\Thepdfofnearestneighbormeasurementandaprobabilisticnearestneighborlterfortrackinginclutter,"inIEEEControlandDecisionConference,1993. [60] I.CoxandS.L.Hingorani,\Anecientimplentationofreid'smultiplehypothesistrackingalogrithmanditsevaluationforthepurposeofVisualTracking,"inIntl.Conf.onPatternRecognition,1994. [61] T.Song,D.Lee,andJ.Ryu,\Aprobabilisticnearestneighborlteralgorithmfortrackinginaclutterenvironment,"inSignalProcessing,ElsevierScience,2005. [62] A.O'HaganandJ.J.Forster,BayesianInference,Volume2BofKendall'sAdvancedTheoryofStatistics.Arnold,secondedition,2004. [63] A.Doucet,C.Andrieu,andS.Godsill,\Onsequentialmontecarlosamplingmethodsforbayesianltering,"StatisticsandComputing,vol.10,pp.197{208,2000. [64] D.Fox,J.Hightower,L.Liao,D.Schulz,andG.Borriello,\Bayesianlteringforlocationestimation,"inIEEEPervasiveComputing,2003. [65] Z.Khan,T.Balch,andF.Dellaert,\Anmcmc-basedparticlelterformulitipleinteractingtargets,"inECCV,2004. [66] S.Oh,S.Russell,andS.Sastry,\MarkovChainMonteCarlodataassociationforgeneralmultiple-targettrackingproblems,"inIEEEConf.onDecisionandControl,2004. [67] O.Wolfson,S.Chamberlain,S.Dao,L.Jiang,andG.Mendez,\Costandimprecisioninmodelingtheprecisionofmovingobjects,"inICDE,1998. 118

PAGE 119

D.Pfoser,\Capturingtheuncertaintyofmovingobjects,"inLNCS,1999. [69] J.H.Hosbond,S.Saltenis,andR.Ortfort,\Indexinguncertaintyofcontinuouslymovingobjects,"inIDEAS,2003. [70] C.Trajcevski,O.Wolfson,K.Hinrichs,andS.Chamberlain,\Managinguncertaintyinmovingobjectdatabases,"inTODS,2004. [71] R.Cheng,D.Kalashikov,andS.Prabhakar,\Queryingimprecisedatainmovingobjectenvironments,"inTKDE,2004. [72] Y.Tao,R.Cheng,andX.Xiao,\Indexingmultidimensionaluncertaindatawitharbitraryprobabilitydensityfunctions,"inVLDB,2005. [73] H.MokhtarandJ.Su,\Universaltrajectoryqueriesonmovingobjectdatabases,"inMobileDataManagement,2004. [74] D.Eberly,3DGameEngineDesign:APracticalApproachtoReal-timeComputerGraphics,MorganKaufmann,2001. [75] M.Mokbel,X.Xiong,andW.Aref,\SINA:Scalableincrementalprocessingofcontinuousqueriesinspatio-temporaldatabases,"inSIGMOD,2004. [76] Y.Tao,\Time-parametrizedqueriesinspatio-temporaldatabases,"inSIGMOD,2004. [77] S.SaltenisandC.Jensen,\Indexingofmovingobjectsforlocation-basedservices,"inICDE,2002. [78] Y.Tao,D.Papadias,andJ.Sun,\TheTPR*-tree:Anoptimizedspatio-temporalaccessmethodforpredictivequeries,"inVLDB,2003. [79] O.Gunther,\Ecientcomputationofspatialjoins,"inICDE,1993. [80] S.LeuteneggerandJ.Edgington,\STR:AsimpleandecientalgorithmforR-treepacking,"in13thIntl.Conf.onDataEngineering(ICDE),1997. [81] D.MehtaandS.Sahni,HandbookofDataStruturesandItsApplications,ChapmanandHall,2004. [82] P.J.HaasandJ.M.Hellerstein,\Ripplejoinsforonlineaggregation,"inSIGMOD,1999. [83] M.NascimentoandJ.Silva,\Evaluationofaccessstructuresfordiscretelymovingpoints,"inInt'lWorkshoponSpatio-TemporalDatabaseManagement,1999. [84] A.Dempster,N.Laird,andD.Rubin,\Maximumlikelihoodestimationfromincompletedataviatheem,"inJourn.RoyalStatisticalSociety,1977. 119

PAGE 120

J.Bilmes,\Agentletutorialoftheemalgorithmanditsapplicationtoparameterestimationforgaussianmixtureandhiddenmarkovmodels,"inTechnicalReport,Univ.ofBerkeley,1997. [86] J.BaneldandA.Raftery,\Model-basedgaussianandnon-gaussianclustering,"inBiometrics,1993. [87] J.Oliver,R.Baxter,andC.Wallace,\Unsupervisedlearningusingmml,"inICML,1996. [88] M.HansenandB.Yu,\Modelselectionandtheprincipleofminimumdescriptionlength,"inJournaloftheAmericanStatisticalAssociation,1998. [89] M.FigueiredoandA.Jain,\Unsupervisedlearningofnitemixturemodels,"inIEEETrans.onPatternAnalysisandMachineIntelligence,2002. [90] R.Baxter,\Minimummessagelengthinference:Theoryandapplications,"inPhDThesis,1996. [91] G.Celeux,S.Chretien,F.Forbes,andA.Mikhadri,\Acomponent-wiseemalgorithmformixtures,"inJourn.ofComputationalandGraphicalStatistics,1999. [92] D.PfoserandC.Jensen,\Trajectoryindexingusingmovementconstraints,"inGeoInformatica,2005. [93] Y.CaiandR.Ng,\Indexingspatio-temporaltrajectorieswithchebyshevpolynomials,"inSIGMOD,2004. [94] S.RaseticandJ.Sander,\Atrajectorysplittingmodelforecientspatio-temporalindexing,"inVLDB,2005. [95] D.Chudova,S.Ganey,E.Mjolsness,andP.Smyth,\Translation-invariantMixtureModelsforCurveClustering,"inKDD,2003. [96] H.KriegelandM.Pfeie,\Density-basedClusteringofUncertainData,"inKDD,2005. [97] J.L.Bentley,\K-dtreesforsemidynamicpointsets,"inAnnualSymposiumonComputationalGeometry,1990. [98] L.FrenkelandM.Feder,\RecursiveExpectationMaximizationalgorithmsfortime-varyingparameterswithapplicationstomultipletargettracking,"inIEEETrans.SignalProcessing,1999. [99] P.Chung,J.Bohme,andA.Hero,\Trackingofmultiplemovingsourcesusingrecursiveemalgorithm,"inEURASIPJournalonAppliedSignalProcessing,2005. [100] P.Agrawal,O.Benjelloun,A.D.Sarma,C.Hayworth,S.U.Nabar,T.Sugihara,andJ.Widom,\Trio:Asystemfordata,uncertainty,andlineage,"inVLDB,2006. 120

PAGE 121

P.Andritsos,A.Fuxman,andR.J.Miller,\Cleananswersoverdirtydatabases:Aprobabilisticapproach,"inICDE,2006,p.30. [102] L.Antova,C.Koch,andD.Olteanu,\MayBMS:Managingincompleteinformationwithprobabilisticworld-setdecompositions,"inICDE,2007,pp.1479{1480. [103] R.Cheng,S.Singh,andS.Prabhakar,\U-DBMS:Adatabasesystemformanagingconstantly-evolvingdata,"inVLDB,2005,pp.1271{1274. [104] N.N.DalviandD.Suciu,\Ecientqueryevaluationonprobabilisticdatabases,"VLDBJ.,vol.16,no.4,pp.523{544,2007. [105] N.FuhrandT.Rolleke,\Aprobabilisticrelationalalgebrafortheintegrationofinformationretrievalanddatabasesystems,"ACMTrans.Inf.Syst.,vol.15,no.1,pp.32{66,1997. [106] R.GuptaandS.Sarawagi,\Creatingprobabilisticdatabasesfrominformationextractionmodels,"inVLDB,2006,pp.965{976. [107] C.RobertandG.Casella,MonteCarlosStatisticalMethods,Springer,secondedition,2004. [108] J.E.Gentle,RandomNumberGenerationandMonteCarloMethods,Springer,secondedition,2003. [109] R.Jampani,F.Xu,M.Wu,L.P.Ngai,C.Jermaine,andP.Hass,\Mcdb:Amontecarloapproachtohandlinguncertianty,"inSIGMOD,2008. [110] J.NeymanandE.Pearson,\Ontheproblemofthemostecienttestsofstatisticalhypotheses,"Phil.Tran.oftheRoyalSoc.ofLondon,SeriesA,vol.231,pp.289{337,1933. [111] A.Wald,SequentialAnalysis,Wiley,1947. [112] J.GalambosandI.Simonelli,Bonferroni-TypeInequalitieswithApplications,Springer-Verlag,1996. [113] M.Human,\Anecientapproximatesolutiontothekiefer-weissproblem,"inTheAnnalsofStatistics,1983,vol.11,pp.306{316. [114] J.KieferandL.Weiss,\Somepropertiesofgeneralizedsequentialprobabilityratiotests,"inTheAnnalsofMathematicalStatistics,1957,vol.28,pp.57{74. [115] G.Lorden,\2-sprtsandthemodiedkeifer-weissproblemofminimizinganexpectedsamplesize,"inTheAnnalsofStatistics,1976,vol.4,pp.281{291. [116] B.Eisenberg,\Theasymptoticsolutiontothekeifer-weissproblem,"inComm.StatisticsC-SequentialAnalysis,1982,vol.1,pp.81{88. 121

PAGE 122

I.Pavlov,\Sequentialprocedureoftestingcompositiehypotheseswithapplicationtothekeifer-weissproblem,"inTheoryofProbabilityandItsApplications,1991,vol.35,pp.280{292. [118] Y.Tao,R.Cheng,X.Xiao,W.K.Ngai,B.Kao,andS.Prabhakar,\Indexingmulti-dimensionaluncertaindatawitharbitraryprobabilitydensityfunctions,"inVLDB,2005,pp.922{933. [119] N.DalviandD.Suciu,\Managementofprobabilisticdata:foundationsandchallenges,"inPODS,2007,pp.1{12. 122

PAGE 123

SubramanianArumugamisamemberofthequeryprocessingteamatthedatabasestartup,Greenplum.Heisarecipientofthe2007ACMSIGMODBestPaperAward.Hereceivedhisbachelor'sdegreefromtheUniversityofMadrasin2000.Heobtainedhismaster'sincomputerengineeringin2003,andhisPhDincomputerengineeringin2008bothfromtheUniversityofFlorida. 123