<%BANNER%>

Correlation-Aware Statistical Methods for Sampling-Based Group By Estimates

Permanent Link: http://ufdc.ufl.edu/UFE0024750/00001

Material Information

Title: Correlation-Aware Statistical Methods for Sampling-Based Group By Estimates
Physical Description: 1 online resource (139 p.)
Language: english
Creator: Xu, Fei
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2009

Subjects

Subjects / Keywords: correlation, estimation, groupby, sampling, topk
Computer and Information Science and Engineering -- Dissertations, Academic -- UF
Genre: Computer Engineering thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract: Over the last decade, Data Warehousing and Online Analytical Processing (OLAP) have gained much interest from industry because of the need for processing analytical queries for business intelligence and decision support. A typical analytical query may require long evaluation time because analytical queries are complicated, and because the datasets used to evaluate analytical queries are large. One key problem arising from long evaluation time is that no feedback is given until the query is fully evaluated. This is problematic for several reasons. First, this makes query debugging very di?cult. Second, the long running time also discourages users to explore the data interactively. One way to speed up the evaluation time is to use approximate query processing techniques, such as sampling. Researchers have developed scalable approximate query processing techniques for SELECT-PROJECT-JOIN-AGGREGATE queries. However, most work has ignored GROUP BY queries. This is a signi?cant hole in the state-of-the-art, since the GROUP BY query is an important type of OLAP query. For example, more than two thirds of the public TPC-H benchmark queries are GROUP BY queries. Running a GROUP BY query in an approximate query processing system requires the same sample to be used to estimate the result of each group, which induces correlations among the estimates. Thus from a statistical point of view, providing estimation information for a GROUP BY query without considering the correlations is problematic and probably misleading. In this thesis, I formally address this problem and provide correlation-aware statistical methods to answer sampling-based GROUP BY queries. I make three speci?c contributions to the state-of-the-art in this area. First, I formally characterize the correlations among the groupwis estimates. Second, I develop methods to provide correlation-aware simultaneous con?dence bounds for GROUP BY queries. Finally I develop correlation-aware statistical methods to return all top-k groups with high probability when only database samples are available.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility: by Fei Xu.
Thesis: Thesis (Ph.D.)--University of Florida, 2009.
Local: Adviser: Jermaine, Christophe.

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2009
System ID: UFE0024750:00001

Permanent Link: http://ufdc.ufl.edu/UFE0024750/00001

Material Information

Title: Correlation-Aware Statistical Methods for Sampling-Based Group By Estimates
Physical Description: 1 online resource (139 p.)
Language: english
Creator: Xu, Fei
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2009

Subjects

Subjects / Keywords: correlation, estimation, groupby, sampling, topk
Computer and Information Science and Engineering -- Dissertations, Academic -- UF
Genre: Computer Engineering thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract: Over the last decade, Data Warehousing and Online Analytical Processing (OLAP) have gained much interest from industry because of the need for processing analytical queries for business intelligence and decision support. A typical analytical query may require long evaluation time because analytical queries are complicated, and because the datasets used to evaluate analytical queries are large. One key problem arising from long evaluation time is that no feedback is given until the query is fully evaluated. This is problematic for several reasons. First, this makes query debugging very di?cult. Second, the long running time also discourages users to explore the data interactively. One way to speed up the evaluation time is to use approximate query processing techniques, such as sampling. Researchers have developed scalable approximate query processing techniques for SELECT-PROJECT-JOIN-AGGREGATE queries. However, most work has ignored GROUP BY queries. This is a signi?cant hole in the state-of-the-art, since the GROUP BY query is an important type of OLAP query. For example, more than two thirds of the public TPC-H benchmark queries are GROUP BY queries. Running a GROUP BY query in an approximate query processing system requires the same sample to be used to estimate the result of each group, which induces correlations among the estimates. Thus from a statistical point of view, providing estimation information for a GROUP BY query without considering the correlations is problematic and probably misleading. In this thesis, I formally address this problem and provide correlation-aware statistical methods to answer sampling-based GROUP BY queries. I make three speci?c contributions to the state-of-the-art in this area. First, I formally characterize the correlations among the groupwis estimates. Second, I develop methods to provide correlation-aware simultaneous con?dence bounds for GROUP BY queries. Finally I develop correlation-aware statistical methods to return all top-k groups with high probability when only database samples are available.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility: by Fei Xu.
Thesis: Thesis (Ph.D.)--University of Florida, 2009.
Local: Adviser: Jermaine, Christophe.

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2009
System ID: UFE0024750:00001


This item has the following downloads:


Full Text

PAGE 1

CORRELATION-AWARESTATISTICALMETHODSFORSAMPLING-BASEDGROUPBYESTIMATESByFEIXUADISSERTATIONPRESENTEDTOTHEGRADUATESCHOOLOFTHEUNIVERSITYOFFLORIDAINPARTIALFULFILLMENTOFTHEREQUIREMENTSFORTHEDEGREEOFDOCTOROFPHILOSOPHYUNIVERSITYOFFLORIDA2009 1

PAGE 2

c2009FeiXu 2

PAGE 3

TABLEOFCONTENTS page LISTOFTABLES ..................................... 6 LISTOFFIGURES .................................... 7 CHAPTER ABSTRACT ........................................ 9 1INTRODUCTION .................................. 11 1.1ProblemDescription .............................. 11 1.1.1HowCorrelationsAecttheGROUPBYEstimates ......... 12 1.1.2ASimplebutImpracticalSolution ................... 14 1.2MajorContributionsofThisThesis ...................... 15 1.3ComputingtheCovariancebetweenaPairofGroup-wiseEstimates .... 16 1.4ProvidingCorrelation-AwareSimultaneousCondenceBounds ...... 18 1.5ReturningTop-kGroupswithHighProbability ............... 19 1.6Organization .................................. 21 2RELATEDWORK .................................. 22 2.1DatabaseSampling ............................... 22 2.1.1ObtainingandMaintainingRandomDatabaseSamples ....... 22 2.1.2Sampling-BasedEstimation ...................... 25 2.1.2.1Selectivityandsizeofjoinestimation ............ 25 2.1.2.2Approximatequeryprocessing ................ 29 2.1.3ComparisontoSurveySampling .................... 34 2.2Top-kQueryProcessinginRelationalDatabases ............... 35 2.2.1Top-kQueryProcessingforSELECT-PROJECT-JOINQueries .. 36 2.2.2Top-kQueryProcessingforGROUPBYQueries ........... 40 2.3RelatedWorkinStatisticsLiterature ..................... 41 2.3.1MultipleInference ............................ 41 2.3.2ErrorGuaranteesforMonte-CarloMethods .............. 46 3PRELIMINARIES .................................. 49 3.1EstimatorsandTheirProperties ........................ 49 3.2Correlations ................................... 51 3.3CondenceIntervalsandCondenceRegions ................. 52 4COVARIANCECOMPUTATION .......................... 54 4.1CovarianceAnalysis ............................... 54 4.1.1AnalysisforSUMoverTwoDatabaseTables ............. 54 4.1.2AnalysisforSUMoverMultipleDatabaseTables ........... 58 3

PAGE 4

4.1.3AnUnbiasedCovarianceEstimator .................. 59 4.2ComputationalConsiderationsforCovariances ................ 60 4.3ExperimentalEvaluation ............................ 65 4.3.1Experimentalsetup ........................... 65 4.3.2Results .................................. 69 4.3.3Discussion ................................ 69 4.4Conclusion .................................... 70 5PROVIDINGCORRELATION-AWARESIMULTANEOUSCONFIDENCEBOUNDS ....................................... 71 5.1TheSIProblemandGROUPBYQueries .................. 71 5.1.1GROUPBYandMultipleInference .................. 71 5.1.2TheSIProblem ............................. 73 5.2ApplyingtheSumInferenceProblemtoGROUPBYQueries ......... 76 5.3SolvingtheSIProblem ............................. 77 5.3.1ASolutionUsingMonteCarloRe-sampling .............. 77 5.3.2SolvingtheSIProblemUsingMomentAnalysis ........... 80 5.3.3SolutionUsingApproximateMomentAnalysis ............ 85 5.4ExperimentalEvaluation ............................ 88 5.4.1RunningTimeExperiments ...................... 89 5.4.1.1Experimentalsetup ...................... 89 5.4.1.2Results ............................ 91 5.4.1.3Discussion ........................... 91 5.4.2CorrectnessExperiments ........................ 94 5.4.2.1ExperimentalSetup ..................... 94 5.4.2.2Results ............................ 95 5.4.2.3Discussion ........................... 96 5.5Conclusion .................................... 101 6RETURNINGTOP-kGROUPSWITHHIGHPROBABILITY ......... 102 6.1Introduction ................................... 102 6.2CondenceRegionandtheRegionTop-kProblem .............. 105 6.2.1HowCondenceRegionsmayHelpReturnallTop-kGroups .... 106 6.2.2TheRegionTop-kProblem ....................... 107 6.2.3ApplyingtheRegionTop-kProblemtotheGROUPBYTop-kQuery 108 6.3SolutionsforReturningAllTop-kGroupswithHighProbability ...... 109 6.3.1ASolutionUsingIntervalSweeping .................. 109 6.3.2APruning-BasedSolution ....................... 113 6.3.3ATwo-StageAlgorithm ......................... 115 6.4VariousIssuesforBuildingCondenceRegions ................ 116 6.5ExperimentalEvaluation ............................ 119 6.5.1RunningTimeExperiments ...................... 119 6.5.1.1Experimentalsetup ...................... 119 6.5.1.2Results ............................ 122 4

PAGE 5

6.5.1.3Discussion ........................... 122 6.5.2EciencyExperiments ......................... 123 6.5.2.1Results ............................ 124 6.5.2.2Discussion ........................... 125 6.5.3MonteCarloExperiments ....................... 126 6.5.3.1Discussion ........................... 126 6.6WhichAlgorithmshouldbeImplementedforaRealWorldSystem? .... 128 6.7Conclusion .................................... 128 REFERENCES ....................................... 130 BIOGRAPHICALSKETCH ................................ 139 5

PAGE 6

LISTOFTABLES Table page 1-1TwodatabasetablesfortheexampleSQLquery:PRODUCTandSALES. ...... 14 4-1Exampleresultsofquery1 .............................. 62 4-2Exampleresultsofquery2 .............................. 62 4-3RunningtimeforCovarianceComputationunderdierentparametervalues .. 70 5-1Runningtimeunderdierentparametervalues ................... 92 6-1Runningtimefortwoextremecasesforthreealgorithms.Therstcaseisthattheassociatedgvaluesarealmostidentical,andthesecondcaseisthattheassociatedgvaluesarehighlyseparable.Bothlowcorrelationandhighcorrelationcombinationsfor100groupsareprovided,andonlyshowtherstcasewithlowcorrelation,thesecondcasewithhighcorrelationcombinationsfor1000and10000groupsarealsoprovided. ........................... 123 6-2Numberofreturnedgroupsfortwoextremecasesforthreealgorithms.Therstcaseisthattheassociatedgvaluesarealmostidentical,andthesecondcaseisthattheassociatedscoresarehighlyseparable. ................. 124 6-3Numberofreturnedgroupsforthemiddlecaseforthreealgorithms.Thetotalnumberofgroupsis100.Theparametersarechosensothatthenumberofgroupsreturnedbytheintervalsweepingalgorithmisapproximately50. ......... 124 6-4Numberoftimesthatcorrectlyreturnsalltop-5groupsunder100independenttries. .......................................... 126 6

PAGE 7

LISTOFFIGURES Figure page 1-1Threepossiblerelationshipsbetweentheregionandthelinex0=x1:aTheregionisabovetheline.bTheregiontouchestheline.cTheregionisbelowtheline.Ifthetruerankingscorevectorisintheregion,incasea,thetop-1groupisthesecondgroupbecauseforanypointsintheregionx1>x0.Similar,incasec,thetop-1groupistherstgroup.Incasebwecannotsayanything. 20 3-1StandardBivariateNormalcoef=0.5leftand-0.5right. ............ 52 4-1TheAlgorithmforcomputingVFF. ......................... 63 5-1AGUIforGROUPBYquery. ............................. 73 5-2ShapesofBetadistributionleft,theempiricalestimateofpdfofF[k]for100groupsbasedon10000MonteCarlosamplestopright,andtheapproximationofthepdfofF[k]usingBeta31.123,66.6907bottomright. ........... 82 5-3Algorithmtocalculatetheestimatedvariance ................... 88 5-4Correctnessofthreemethodsoverdatasetone:Allparametersusetheirdefaultvalues.Theerrorbarisa95%two-sidedcondenceintervalfromthecorrespondingbinomialdistributionBinomial00;p,whereeachpvalueisshowninthexaxis. 96 5-5Correctnessofthreemethodsoverdatasettwo:Exceptforthefactthatthesamplingratioissetto0.5%,threeotherparametersusetheirdefaultvalues. ....... 97 5-6Correctnessofthreemethodsoverdatasetthree:Exceptforthefactthattheskewissetto0.3,threeotherparametersusetheirdefaultvalues. ........ 98 5-7Correctnessofthethreemethodsoverdatasetfour:20groupsandaskewof0.3areused. ...................................... 99 5-8Correctnessobtainedassumingindependenceondatasetfour.TherightplotshowsthedierencebetweentheSIsolutionobtainedusingindependencethesolidlineandtheBetaapproximationthedashline. .............. 99 6-1Threepossiblecasesforthecondenceregion,classiedbyusingtherelationshipbetweentheregionandthelinerst-group=second-group:aTheregionisabovetheline.bTheregiontouchestheline.cTheregionisbelowtheline. ..... 106 6-2AframeworkforsolvingtheGROUPBYtop-kusingasolutiontotheregiontop-kproblem. .................................... 109 6-3Anexampleoftheintervalsweepingalgorithm:aThestartingpoint.bTheendingpoint.Thereturnsetisf1,2,4g. ....................... 111 7

PAGE 8

6-4Threepossiblerelationshipsbetweentheregionandthelinerst-group=second-group.Theregionisaregionsuchthateachofitsedgesisparalleltooneoftheeigenvectors.aTheregionisabovetheline.bTheregiontouchestheline.cTheregionisbelowtheline.Thedashrectangleistheboundingbox. .... 112 6-5ThePruning-BasedAlgorithm ............................ 114 6-6TheTwo-StageAlgorithm. .............................. 115 8

PAGE 9

AbstractofDissertationPresentedtotheGraduateSchooloftheUniversityofFloridainPartialFulllmentoftheRequirementsfortheDegreeofDoctorofPhilosophyCORRELATION-AWARESTATISTICALMETHODSFORSAMPLING-BASEDGROUPBYESTIMATESByFeiXuAugust2009Chair:ChristopherJermaineMajor:ComputerEngineeringOverthelastdecade,DataWarehousingandOnlineAnalyticalProcessingOLAPhavegainedmuchinterestfromindustrybecauseoftheneedforprocessinganalyticalqueriesforbusinessintelligenceanddecisionsupport.Atypicalanalyticalquerymayrequirelongevaluationtimebecauseanalyticalqueriesarecomplicated,andbecausethedatasetsusedtoevaluateanalyticalqueriesarelarge.Onekeyproblemarisingfromlongevaluationtimeisthatnofeedbackisgivenuntilthequeryisfullyevaluated.Thisisproblematicforseveralreasons.First,thismakesquerydebuggingverydicult.Second,thelongrunningtimealsodiscouragesuserstoexplorethedatainteractively.Onewaytospeeduptheevaluationtimeistouseapproximatequeryprocessingtechniques,suchassampling.ResearchershavedevelopedscalableapproximatequeryprocessingtechniquesforSELECT-PROJECT-JOIN-AGGREGATEqueries.However,mostworkhasignoredGROUPBYqueries.Thisisasignicantholeinthestate-of-the-art,sincetheGROUPBYqueryisanimportanttypeofOLAPquery.Forexample,morethantwothirdsofthepublicTPC-HbenchmarkqueriesareGROUPBYqueries.RunningaGROUPBYqueryinanapproximatequeryprocessingsystemrequiresthesamesampletobeusedtoestimatetheresultofeachgroup,whichinducescorrelationsamongtheestimates.Thusfromastatisticalpointofview,providingestimationinformationforaGROUPBYquerywithoutconsideringthecorrelationsisproblematicandprobablymisleading.Inthisthesis,Iformallyaddressthisproblemandprovidecorrelation-awarestatisticalmethods 9

PAGE 10

toanswersampling-basedGROUPBYqueries.Imakethreespeciccontributionstothestate-of-the-artinthisarea.First,Iformallycharacterizethecorrelationsamongthegroupwiseestimates.Second,Idevelopmethodstoprovidecorrelation-awaresimultaneouscondenceboundsforGROUPBYqueries.FinallyIdevelopcorrelation-awarestatisticalmethodstoreturnalltop-k"groupswithhighprobabilitywhenonlydatabasesamplesareavailable. 10

PAGE 11

CHAPTER1INTRODUCTIONOverthelastdecade,datawarehousingandonlineanalyticalprocessingOLAP[ 46 ]havegainedmuchinterestfromindustry,sincethereisaneedforprocessinganalyticalqueriesforbusinessintelligenceanddecisionsupport.Unfortunately,evaluatinganalyticalqueriesisverytime-consuming.BasedonthemostrecentpublicTPC-H[ 100 ]benchmarkresults,itisclearthatevaluatingananalyticalquerycantakehoursorevendays.Thislongprocessingtimeisbecauseoftworeasons.First,analyticalqueriesaremorecomplicatedthannormaldatabasequeries.Second,thedatasetsthatareusedforevaluatinganalyticalqueriesarehuge.Thevolumeofdataforatypicaldatawarehouseisatleastseveralterabytes,andthesequeriesgenerallyrequireaccessingamajorityofthedataset.Onekeyproblemarisingfromsuchlongevaluationtimeisthatnofeedbackisgivenuntilthequeryisfullyevaluated.Thisisproblematicforseveralreasons.First,thismakesquerydebuggingverydicult.Onlywhenaqueryisfullyevaluateddoesaquerydesignerseetheresult.Ifthequeryisincorrectlyformulated,thequerydesignermayneedtowaitforhoursordaystoseetheresultanddiscovertheerror.Thelongrunningtimealsodiscouragesusersfromexploringthedatainteractively.Usersmaywanttoexplorethedatabyiterativelymodifyingtheselectionconditionsforaquery,andwatchinghowthequeryresultsvary.Thishelpsinndinginterestingaspectsofthedata.However,thelongrunningtimemakesinteractivedataexplorationimpossible.1.1ProblemDescriptionOnewaytospeeduptheevaluationtimeistouseapproximatequeryprocessingtechniques[ 57 73 ].Unliketraditionalqueryprocessingtechniques[ 41 ],mostapproximatequeryprocessingtechniquesusesomestatistics,suchasasample[ 51 53 57 62 63 79 ],toansweraquery,andonlyrequireasmallamountoftimetoprovideanapproximateresultwithstatisticalguarantee.Theapproximateresultiscalledanestimate.Notethat 11

PAGE 12

ifthequeryisapproximatelyevaluatedmanytimes,manydierentestimatesmaybeobtained.Theseestimatesaretypicallycharacterizedusingarandomvariable[ 18 ],whichiscalledanestimator[ 50 51 53 62 63 ].TheestimatorsandtheirpropertiesarefurtherdiscussedinSection3.1.Mostofthestatisticalguaranteesareprovidedthroughcondenceintervals[ 18 ].Atypicalstatisticalguaranteelookslikewith95%condence,theresultisin[2:33105;2:42105]".CondenceintervalsarefurtherdiscussedinSection3.3.Forthepurposeofquerydebugginganddataexploration,anapproximateresultmaybesucientbecauseanapproximateresultisenoughforanusertodeterminethecorrectnessoftheinputquery.Furthermore,duringdataexploration,anapproximateresultmayhelptoidentifyinterestingselectionconditions.Oncetheuserisreasonablysurethequeryiscorrectoraninterestingselectionconditionisfound,heorshecanthenaskthesystemtofullyevaluatethequery.Thisthesisisfocusedonapproximatequeryprocessingforoneparticularlyimportanttypeofquery:theGROUPBYquery[ 87 ].AGROUPBYquerymaybeviewedasalargesetofsubqueriesaskedsimultaneously.Thesesubqueriespartitionthedataintogroupsbasedononeormoreuserspeciedattributessuchthateachgroupcontainsalltherecordswhosegroupingattributeshavethesamevalues.GROUPBYqueriesareubiquitous.MorethantwothirdsoftheTPC-HbenchmarkqueriesareGROUPBYqueries.ThefundamentalproblemaddressedinthisthesisisthatexistingonlineapproximatequeryprocessingsystemsarenotdesignedforGROUPBYqueriesinthesensethatthesesystemsprovideestimatesforeachgroupwithoutconsideringthecorrelations[ 18 ]amongthegroupswhileusingthesamesamplestoestimateeverygroup[ 57 89 ].MoreinformationaboutcorrelationsisgiveninSection3.2.1.1.1HowCorrelationsAecttheGROUPBYEstimatesAllexistingonlineapproximatequeryprocessingsystemsusesampling[ 91 ].Anapproximatequeryprocessingsystemisanonlinesystemifthesystemconstantlyestimatesthequeryanswerandupdatestheestimatesandassociatedcondenceintervals, 12

PAGE 13

alongwiththeactualqueryprocessing[ 57 73 ].Generallytheestimatesconvergetothetruequeryanswerandtheassociatedboundsbecome0whentheactualqueryprocessingterminates[ 73 ].Thekeyproblemthatexistingapproachesfailtoaddressisthatthesamesamplesaresharedtoestimatethequeryanswerforeachandeverygroup.Thisisproblematicbecausesharingsamplesintroducescorrelationsamongthegroup-wiseestimates.However,noonlineapproximatequeryprocessingsystemhasconsideredthecorrelationsinducedbythesamplesharingwhileprovidingthegroup-wiseestimates.Toillustratewhycorrelationmustbetakenintoaccount,werstconsiderthequery:Whatwerethetotalsalesperregionin2004?"ThisquerymaybewritteninSQLas:SELECTSUMp.COST,s.REGIONFROMPRODUCTp,SALESsWHEREp.PROD=s.PRODANDs.YR=2004GROUPBYs.REGIONImaginethattheanswertothisqueryisguessedbyusingrandomsamplesfromthetwodatabasetablesshowninTable 1-1 .Therandomsamplesareusedasinputintoasampling-basedestimatorfortheanswertothequery[ 49 ].Inordertoapplythesimplestestimator,thesamplefromeachofthetwotableswouldbejoined,andtheresultscaledupbytheinverseofeachsamplingfraction.TheproblemwefaceisthatthesamesampleofrelationPRODUCTiseectivelyusedtosimultaneouslyanswerthreequeries,oneforeachgroup{inthiscase,thereisonegroupassociatedwiththeregionEurope,onewiththeregionUSA,andonewiththeregionAsia.Iftherstestimatewasinaccurate,thenthesecondandthirdestimatesarelikelytobeinaccurateaswell.Notethateachofthethreeregionsdescribedinthedatabasehasexperiencedasaleoftheproductthingyin2004,andthingyisthemostexpensiveproductavailable.Asaresult,ifthingyhasnotbeensampledfromthePRODUCTtable,itislikelythattheestimatesforEurope,Asia,andUSAwillsimultaneouslybetoolow.Suchcorrelationmaybeofgreatconcernifanend-userwillmakedecisionsbasedupontheresultsofseveralcorrelatedapproximations. 13

PAGE 14

PRODUCT SALES PROD COST PROD YR Region thingy $10 thingy 2004 Asia gadget $2 widget 2003 Asia widget $3 thingy 2004 USA dohicky $4 gadget 2004 USA widget 2003 USA thingy 2004 Europe dohicky 2004 Europe Table1-1.TwodatabasetablesfortheexampleSQLquery:PRODUCTandSALES. Ausermayaskaquerythatreturnsanswersfor10dierentgroups,andreceive10answerswith90%condenceboundsforeach.Theusermaythenassumethatsinceeachestimateiscorrectwith90%probability,thatmostoftheapproximationswillbecorrect.However,ifthevariousestimatesarestronglycorrelated,thestatisticalrealitymaybethatthereisa10%chancethateachandeveryoneofthe10approximationsisincorrect.Thisexampleclearlyshowsthatweneedtotakethecorrelationsintoaccountwhenprovidinggroup-wiseestimatestousers.Specically,ifthereisa10%chancethateachandeveryoneofthe10approximationsisincorrect,wemustexplicitlytellanenduserthisinformationwhenprovidinggroup-wiseestimates.1.1.2ASimplebutImpracticalSolutionOnewaytoavoidthisproblemistouseindependentsamplesforeachgroup.Butitiseasytoarguethatusingindependentsamplesforeachgroupisnotapracticaloptionforrealqueries.Letusconsidera3-table-join-basedGROUPBYquerywith1000groups.Inordertoprovideindependentsamplestoeachgroup,assumingthesamplesizeisxed,weneedtoprepareandevaluate1000samples,insteadofonesampleusedinthesharingsamplesolution.Theoveralltimespenttoevaluatethesesamplesisapproximately1000timesaslargeastherunningtimeforthesharingsamplesolution.Thepurposeofsamplingistosavethequeryevaluationtime.However,takingindependentsamplesincreasestheoverallrunningtimedramatically,whichsignicantlydegradesthebenetofsampling.Arealquerymayreturnthousandsorevenmillionsofgroups,leadingto 14

PAGE 15

evenlongerqueryevaluationtime.Thus,providingindependentsamplestoeachgroupisnotapracticalsolution.Sharingthesamplesamonggroupsistheonlypracticalchoice,requiringthatwetakethecorrelationsintoaccountwhenprovidingestimates.1.2MajorContributionsofThisThesisThemaincontributionofthisthesisisprovidingcorrelation-awaremethodsforansweringsampling-basedGROUPBYqueriesinanonlineapproximatequeryprocessingsystem.Imakethreespeciccontributionstothestate-of-the-artinthisarea.First,inordertoprovidecorrelation-awaremethodsforansweringsampling-basedGROUPBYqueriesandtosolvetheproblemsinthecurrentonlineapproximatequeryprocessingsystem,weneedtocharacterizethecorrelationsamonggroup-wiseestimates.Thenaturalwayistocomputethecovariances[ 18 ]amongthegroup-wiseestimators.Covarianceisusedtoquantifyhowapairofgroup-wiseestimatorsarecorrelated.Inthisthesis,Iderivethecovarianceformulaforapairofgroup-wiseestimators.Unfortunatelyastraightforwardcomputationofthecovarianceistootime-consuming.Thus,Idevelopanecientalgorithmforcomputingthecovariance,andexperimentallyevaluatetheeciencyofthealgorithm.Second,oncethecovariancecomputationiscomplete,thecovariancemustbeusedtoprovidesimultaneouscondenceintervals[ 82 ]thattakescorrelationsamonggroup-wiseestimatesintoaccount.Forthepurposeofcomputingsuchsimultaneouscondenceintervalsinaprincipledfashion,IintroducetheSum-InferenceSIproblem,whichisusedtoinferthedistributionthattheintervalsaresimultaneouslywrong.TheSIproblemisformallydenedinSection5.1.IalsodevelopanalgorithmthatusestheSIproblemtoprovidesimultaneouscondenceintervalsforGROUPBYqueries,andproposeseveralecientapproximatemethodstosolvetheSIproblem.Finally,Iexperimentallyevaluatetheeciencyandcorrectnessoftheproposedsolutions.Finally,Idevelopalgorithmsthatareguaranteedtoreturnalltop-kgroupswithhighprobability,wheretop-k"referstothebestormostinterestinggroupsaccordingtosome 15

PAGE 16

userdenedcriteria.Correctlyreturningalltop-kgroupsrequirestakingintoaccountcorrelations.Iproposevariousalgorithmsfordoingthis.Alloftheproposedalgorithmsrequirebuildingacondenceregion[ 33 ]forthegroup-wiseestimates.Acondenceregion"withcondencepisarandomregionsuchthatwithprobabilityp,thequeryanswerlieswithinit.Idevelopalgorithmsthatuseseveraltypesofcondenceregions.Thesealgorithmsarecarefullycomparedviaanexperimentalevaluation.Inthenextseveralsections,eachoftheproblemsaredescribedinmoredetail.1.3ComputingtheCovariancebetweenaPairofGroup-wiseEstimatesTheveryrstproblemiscomputingthecovariancebetweenapairofgroup-wiseestimators.Instatistics,thecovariancebetweenapairofgroup-wiseestimatorsdescribesthecorrelationbetweenthesetwoestimators.Apositivecovariancemeansthereisapositivecorrelationbetweenthem.Thatis,whentheestimateofonegroupishigh,theestimateoftheothergroupisalsohigh,andviceversa.Anegativecovariancemeansthereisanegativecorrelationsbetweenthetwogroup-wiseestimators,sothatwhentheestimateofonegroupishigh,theestimateoftheothergroupislow,andviceversa.Thederivationofthecovarianceformulaisnontrivialbutnottoodicult.ThefullderivationwillbepresentedinSection4.2.Therealdicultyiscomputingthecovarianceeciently.Toseewhythestraightforwardwayisnotecient,Ioutlinesomeofthemostimportantpropertiesofthecovarianceformulawithoutdeliveringthedetail.AssumewehaveaGROUPBYquerythatinvolveskbaserelations:R1;R2;:::;Rk.ThetotalnumberofrecordsinrelationRiisni.Thecovarianceformulaoftwoestimatorsforgroupaandbhasthefollowingtwoimportantproperties: 1. Theformulaisasummationof2kcomponents. 2. Eachcomponenttakestheformcn1Xi1=1n1Xi01=1:::nkXik=1nkXi0k=1Ii1;i01;:::;ik;i0kfai1;:::;ikfbi01;:::;i0k; 16

PAGE 17

wherecisaconstant,Ii1;i01;:::;ik;i0kisanindicatorfunctionforsomebooleanexpression,andfaandfbaretwofunctionsthatcomputenumericalvaluesamongthejoinedresultsthatareuseforaggregation.Ingeneral,thenumberofrelationsinvolvedinaGROUPBYqueryislessthan10.Thus,thetotalnumberofcomponentsintheoverallformulaisgenerallynotmorethan1000.Ontheotherhand,thecomputationassociatedwitheachcomponentismoredicult.Themoststraightforwardwaytocomputeeachofthecomponentistousenestedloopstoimplementthenestedsummations.Thisapproachissimplebutverytime-consuming.Forexample,assumethereare5relations,eachhaving1000tuples.Ifweassumethattocomputethebodyofthelooprequires10CPUcycles,weneed1031cyclestocomputeacovariance.Sincenomoderncomputercanprocessmorethan1020cycleseachsecond1,thetotalamountoftimetocomputeacovariancerequiresmorethan1011seconds.Thus,developinganecientalgorithmtocomputethecovarianceisvital.Todothis,akeyobservationisthatformostofthecombinationsofi1;i2;:::;ik,faevaluatesto0,becauseonlyasmallnumberofthesecombinationssatisfythejoincondition,andthusproducearesulttupleforaggregation.Thesameistrueforfb.Asimplewaytoretrievetheseoutputtuplesistojointhesamplesfromeachbaserelationandgrouptheoutputtuplesbasedonthegroupingattributes.Ahash-basedimplementationmaythenbeusedtocheckwhetherthebooleanexpressionassociatedwiththeindicatorfunctionIi1;i01;:::;ik;i0kissatised.Thus,thecovariancemaybeecientlycomputed.ThedetailisgiveninSection4.3.Inordertoevaluatehowecienttheproposedalgorithmsare,Iexperimentallyevaluatethealgorithmsondierentdatasets,whichisshowninSection4.4.Theeectofskewness,samplingratio,numberofgroupsareallevaluated. 1AtypicalPCCPUfromintelorAMDisabout2GHZ.Thusinonesecond,itcanruntwobillionCPUcycles.Evenconsideringthepipelinetechnique,andabigclusterwith10000PCs,thereisstillnowayforthesystemtoprocess1020cyclesinasecond. 17

PAGE 18

1.4ProvidingCorrelation-AwareSimultaneousCondenceBoundsThesecondproblemIconsiderishowtoprovidesimultaneouscondenceboundstoapproximatelyanswerGROUPBYqueries.Currently,onlineapproximatequeryprocessingsystemsprovideonecondenceintervalforeachgroupseparately[ 57 89 ].Thisisproblematic.Ausermayaskaquerythatreturnsanswersfor10dierentgroups,andreceive10answerswith90%condenceboundsforeach.Ifthevariousestimatesareindependent,theprobabilityofseeingall10groupswrongis0:1100.However,ifthevariousestimatesarestronglycorrelated,itispossiblethatthereisa10%chancethateachandeveryoneofthe10approximationsisincorrect.Regardlessofwhichofthesetwoextremecasesisthetruecase,thecondenceintervalsprovidedbyexistingonlineapproximatequeryprocessingsystemsarethesame.Thisshowsthatimportantinformationhasbeenignored.Iwouldliketoextendthetraditionalguaranteeandpresentaguaranteetakingthefollowingform:Withprobabilityp,atleastkofthengroupsarewithintherangesl1toh1,l2toh2,:::,andlntohn,respectively."Theextendedguaranteetakesthecorrelationsamongthegroup-wiseestimatesintoaccount.Thus,itprovidesmoreinformation.Giventhissortoftheextendedguarantee,thekeyproblemishowtocomputthem.Forthepurposeofcomputingsimultaneouscondenceboundsinaprincipledfashion,IintroducetheSum-InferenceSIproblem.Insteadofcomputingthesimultaneouscondenceboundsthatsatisfytheuser'srequirement,theSIproblemmodelshowtheseintervalsaresimultaneouslyincorrect.IntheSIproblem,wearegivenknormallydistributedrandomvariablesN1;N2;:::;Nk,whereeachrandomvariablemodelsanestimatorassociatedwithaparticulargroup.Alongwiththeseestimators,asetofkintervalsaregiven,oneforeachgroup.Theithregionisdenedtobefromlitohi.Wealsoattachkpenaltiess1;s2;:::;sk,oneforeachgroup.Iftheithrangedoesnotcontainthequeryanswer,thenapenaltyofscoresiisaddedtothe 18

PAGE 19

totalpenalty.Wethendenethefollowingrandomvariable:=kXi=1INi62[li;hi]si:Intheaboveexpression,Iisanindicatorfunctionthatreturns1ifthebooleanconditionistrueand0otherwise.IfwedeneadistributionfunctionFsuchthatF[k]givestheprobabilitythatevaluatestok,thentheSIproblemistheproblemofinferringF.TheSIproblemisthekeysteptoprovidingsimultaneouscondenceintervalsforGROUPBYqueries.BecausesolvingtheSIproblemexactlyisNP-hard,Idevelopthreeapproximatealgorithms: 1. Monte-CarloResampling.Inthismethod,giventhecovariancematrix,inordertoapproximateF,werepeatedlysamplefromthejointdistributionofthegroupwiseestimators. 2. MomentAnalysis.Thismethodcomputesthemean,variance,thelowerbound,andtheupperboundofthedomainofF.Then,anappropriateparametricdistributionischosentotF. 3. ApproximateMomentAnalysis.Whenthecovariancematrixisverylarge,computingtheentirecovariancematrixmayrequiretoomuchtime.ThusasamplingbasedalgorithmisdevelopedtoapproximatethemomentsofFbyonlycomputingthevariancesandsamplingentriesfromthecovariancematrix.InordertotesttheutilityofeachoftheapproximatesolutionstotheSIproblem,Idesignexperimentstoevaluatethesealgorithms.Foreciency,therunningtimeofeachofthethreemethodsiscomputedandcomparedtothetimeofevaluatingthequeryovertheentiredatabase.Monte-Carloexperimentsareusedtotestwhetherthethreealgorithmsworkcorrectlyinpractice.1.5ReturningTop-kGroupswithHighProbabilityThenalproblemIconsiderishowtoreturntop-kgroupswithhighprobability.Currentapproximatequeryprocessingsystemsmonitorallgroupssimultaneously[ 57 89 ].Thisisproblematicfortworeasons.First,updatingthousandsofgroupsforarealworldquerysimultaneouslyistime-consuming.Second,returningjustafewgroups 19

PAGE 20

Figure1-1.Threepossiblerelationshipsbetweentheregionandthelinex0=x1:aTheregionisabovetheline.bTheregiontouchestheline.cTheregionisbelowtheline.Ifthetruerankingscorevectorisintheregion,incasea,thetop-1groupisthesecondgroupbecauseforanypointsintheregionx1>x0.Similar,incasec,thetop-1groupistherstgroup.Incasebwecannotsayanything. withrespecttoauser-denedrankingfunctionisoftenmorethansucient[ 76 ].Inthisthesis,Idevelopalgorithmsthatuseadatabasesampletoreturntop-kgroupswithhighprobability.Onewaytodeterminethetop-kgroupswithrespecttoauser-denedstatisticistousecondenceregions"[ 33 ].Acondenceregionwithcondencepisarandomregioninthen-dimensionalspacesuchthat,withprobabilityp,thequeryanswerlieswithinit.Toseehowacondenceregionmayhelpusndtop-kgroups,letusconsiderasimplecase:aquerywithtwogroups.Weareinterestedinthetop-1group.Therearethreecases,showninFigure 1-1 ,wherex0andx1aredimensionsfortherstandsecondgroup,respectively.Ifthequeryanswerisintheregion,itisobviousthatincasea,thetop-1groupisthesecondgroupbecauseforanypointsintheregionx1>x0;thustherankingscoreassociatedwiththesecondgroupislargerthantherankingscoreassociatedwiththerstone.Similarly,incasec,thetop-1groupisthebettergroup.Onlyincasebisitimpossibletodeterminewhichgroupispreferredwithhighprobability. 20

PAGE 21

ThecondenceregioninFigure 1-1 isacubesuchthateachedgeisparalleltooneoftheaxes.Abenetofthissortofregionisthattobuildit,onlyvariancesareneeded.IdevelopanIntervalSweepingalgorithmthatworkswiththistypeofcondenceregion.Whenallthecovariancesareavailable,itispossibletobuildtightercondenceregions.Theintervalsweepingalgorithmreturnstoomanygroupsfortheseregions.Toovercomethisproblem,Idevelopapruning-basedalgorithmthattakesintoaccountcorrelations.Whenthecondenceregionistight,thisalgorithmreturnsfewergroupsthantheintervalsweepingone.However,buildingsuchacondenceregionrequiresknowingallthecovariancesinadvance,whichisknowntobeexpensive[ 105 ].ThenalalgorithmIdevelopcombinesthebenetsofthesetwoalgorithms.Itrunsintwostages.Intherststage,onlythevariancesareneededtobuildacondenceregion.Theintervalsweepingalgorithmisusedtoreturnafewcandidategroups.Covariancesamongthecandidatesarecomputedtobuildatightercondenceregion,andthepruning-basedalgorithmisusedtoreturnthenalresults.Idesignexperimentstoevaluatealgorithmsaccuracyintermsofthenumberofgroupsreturnedandeciencyintermsoftherunningtime,andprepareseveraldierentdatasets.InparticularItesttheeectofstrengthofthecorrelationsamonggroupwiseestimators.1.6OrganizationTherestofthisstudyisorganizedasfollows.Chapter2providesasurveyofworkrelatedtotheproblemsaddressedinthisthesis.Chapter3reviewsafewstatisticalpreliminariesthatareusedinthisthesis.Chapter4describesthederivationofthecovarianceformulabetweenapairofgroupwiseestimators,andanecientalgorithmtocomputethecovariances.Chapter5discussestheproblemofprovidingsimultaneouscondenceintervalsforGROUPBYqueries.Chapter6describesvarioussolutionsforreturningalltop-kgroupswithhighprobability. 21

PAGE 22

CHAPTER2RELATEDWORKThischapterpresentsasurveyofpreviousworkintheliteraturerelatedtotheproblemsinthisthesis.Thesurveybeginswithpreviousworkindatabasesampling[ 3 48 49 51 { 54 56 57 62 63 69 { 73 79 85 86 89 ].Indatabasesampling,Idiscussbothhowtoobtaindatabasesamples[ 72 85 86 ]andhowtousesamplestoestimatevariousdatabasestatistics[ 3 48 49 51 { 54 56 57 62 63 69 { 71 73 79 89 ].Abriefcomparisonbetweensurveysampling[ 91 ]anddatabasesamplingendsthereviewofdatabasesampling.Thischapterthendiscussesrelatedworkintop-kqueries[ 36 { 38 67 76 84 95 ].Bothndingtop-ktuplesforSELECT-PROJECT-JOINqueries[ 36 { 38 67 84 ]andndingtop-kgroupsforGROUPBYqueries[ 76 95 ]arediscussed.Finallythischapterreviewsrelatedworkinstatisticsliterature.Thereviewbeginswithasurveyinmultipleinference[ 13 14 26 32 58 { 60 64 82 90 96 98 99 103 ].AndthenitbrieysurveystheliteraturerelatedtoerrorguaranteesforMonteCarlomethods[ 12 28 81 ].2.1DatabaseSamplingThestudyofsamplinghasaverylonghistoryindatabases.Researchersmainlyfocusontwodierentandrelatedproblems: 1. Howtoobtainandmaintainrandomsamplesfromdatabasetables[ 72 85 86 ]? 2. Howtouserandomsamplestoestimatevariousdatabasestatisticssuchasselectivity,sizeofjoin,andanswerstodatabasequeries[ 3 48 49 51 { 54 56 57 62 63 69 { 71 73 79 89 ]?Theworkinmythesisfallsintothesecondcategory.Inthissection,Irstgiveabriefreviewoftherstcategory,thenmovetothesecondcategory.2.1.1ObtainingandMaintainingRandomDatabaseSamplesThemostnotableearlyworktoobtainrandomsamplesfromdatabaselesisduetoOlkenetal.[ 85 86 ].In[ 85 ],Olkenetal.studytheproblemofobtainingsimplerandomsamplesfromB+trees[ 41 ].Olkenetal.proposeseveralacceptance/rejection-based[ 91 ]samplingmethodstoobtainrandomsamplesfromB+treeswithoutrequiringany 22

PAGE 23

additionaldatastructuresorindices.In[ 86 ],Olkenetal.studytheproblemofsamplingfromhashles[ 41 ].Olkenetal.proposesamplingmethodsforbothstaticanddynamichashles[ 41 ].Forstatichashles,theyconsideropenaddressinghashlesandhashleswithaseparateoverowchain.Fordynamichashles,theyconsiderbothlinearhashles[ 41 ]andextendiblehashles[ 41 ].BesidestheresearchworkdonebyOlkenetal.,AntoshenkovGennady[ 7 ]proposeasamplingmethodbasedonpseudo-rankedB+trees.TheB+treesareaugmentedsothattheycontaininformationthatallowscalculationofupperandlowerrankssatisfyingthenestedboundconditionsforallparent-childpairs|thatis,theboundofanychildisnestedinsideoftheboundofanyparentintheB++tree.Anacceptance/rejection-basedsamplingmethodusesthisinformationtospeedupthesamplingandreducetherejectionrate.DeWittetal.[ 29 ]proposetheextentmapsamplingforhandlingskewnessinparalleljoins[ 94 ].Thedatabasetableisstoredbyoneormorepagescalledextents.Recordsarestoreineachextent.Anin-memorydatastructuremaintainsinformationonhowtheseextentsaredistributed.Theaddressofapagewithinanextentcanbefoundbyaddinganosettotheaddressoftherstpageofthisextent.Inordertoobtainarandompage,anumberrbetween1andthetotalnumberofpagesisrandomlyselected,andthecorrespondingpageisreturned.Toobtainarandomrecord,onceapageisselected,arandomrecordisselectedfromthepage.Fromstatisticalpointofview,onlywhentherecordsinthetablearepre-randomizedistheobtainedsamplearandomsampleofthedatabasetable.However,thisisnotdiscussedindepthinthepaper.Thesepapersallconsiderhowtoobtainasmallrandomsamplefromadatabasetable.Thesampleitselfissmallandisnotstoredandmanagedondisks.Jermaineetal.[ 72 ]proposethegeometricletomaintainlargeon-diskrandomsamples.Thealgorithmusesthereservoirsampling[ 102 ].Itcollectssamplesinanin-memorybuer.Theitemsinthebuerarerandomlypermutedwithandthendividedintosegmentsof 23

PAGE 24

geometricallydecreasingsize.Whenabuerushisrequired,eachsegmentinthebueroverwritesanon-disksegmentofthesamesize.Segmentssmallerthanacertainsizearestoredandupdatedwithinmainmemory.Theorganizationofthegeometricleavoidsmanyrandomaccessestothediskbecauseonlyonerandomaccesspersegmentisrequiredtondthebeginningplaceofthein-placeupdate.Recordswithinasegmentarewrittensequentially.Besidesthispioneeringwork,GemullaandLehner[ 42 ]proposeseveraldeferredmaintenancestrategiesfordisk-basedrandomsampleswithaboundedsize.Theapproachisbasedonreservoirsampling.TheauthorsassumethatauniformsampleofsizeMhasbeencomputedpreviously.Theincrementalstrategiesarecomprisedoftwophases:alogphasethatcapturesalltheinsertiontothedatabasetableandarefreshphasethatusesthelogle[ 41 ]toupdatethesample.Theauthorsconsiderbothfullloggingandcandidateloggingthatonlykeepsasample"oftheoperationsonthedataset.Theyfurtherdevelopanalgorithmfordeferredrefresh,whichperformsonlyfastsequentialI/Ooperationsanddoesnotrequireanymainmemory.Theyexperimentallycomparetheimplementationtothegeometricleandconcludethatwhengeometricleoutperformstheirmethodonlywhenthebuersizeissucientlylarge.Morerecently,NathandGibbons[ 83 ]proposeamethodformaintainingalargerandomsampleonaashdisk[ 16 ].TheyintroduceB-FileBucketFiletostoresamples.AB-Filecontainsmanybuckets.Andasamplemayberandomlystoredinoneofthebucketsdeterminedbytheiralgorithm.Theyalsointroducetheconceptofsemi-randomwrite,suchthattheblockmaybeselectedrandomly,butpageswithinablockmustbewrittensequentiallyfromthebeginningtotheend.Thealgorithmassignseachiteminthedatabaseadeletionlevel".Iftheitem'sdeletionlevelishigherthantheminimumsampledeletionlevel,itisselectedasasample.Ifthesizeofthesampleexceedsthelimit,abucketwithminimumdeletionlevelisdeleted.Thelevelsareassignedsothattheitems 24

PAGE 25

keptisauniformsampleofthedatabase.Theauthorsclaimthattheirimplementationisash-diskfriendly,andisthreeordersofmagnitudefasterthanexistingtechniques.2.1.2Sampling-BasedEstimationThereisalonghistoryofresearchthatconsidershowtoutilizesamplesinthedatabaseliterature[ 3 48 49 51 { 54 56 57 62 63 69 { 71 73 79 89 ].Manyofthesepapersarecloselyrelatedtomyresearch.Therstworkinthisareaconsidershowtoestimatetheselectivityofaselectqueryandestimatethesizeofajoin[ 24 39 51 53 { 55 62 63 79 ].Lateron,researchersconsiderestimatingtheanswerofadatabasequery[ 3 48 49 52 56 57 69 { 71 73 89 ].Thissectiongivesathoroughliteraturesurveyonthesetopics.2.1.2.1SelectivityandsizeofjoinestimationWorkonhowtoutilizesamplesstartedbyestimatingtheselectivityofaselectionqueryandthesizeofjoin.Initially,inordertousesamplestoestimatethesizeofjoin,researchersrecommendeithersampling[ 62 63 ]fromtheproductspaceofthetablesinvolvedinajoin,orusingsomeindexstructuretoaidinsamplingfromjoins[ 78 79 ].Lateron,researchersdevelopmethodstorstsamplefromeachbasetable.Thesamplesfromthebasetablesarethenjoinedtoestimatethesizeofjoin[ 52 ].Thelaterapproachnallyleadstotheapproximatequeryprocessing[ 48 49 57 70 73 ].PioneeringworkonusingsamplestoproduceestimatesforselectivityandsizeofjoinaredonebyHouelat.[ 62 63 ].TheyproposemethodstoestimateCountEusingasimplerandomsample,whereEisanarbitraryrelationalalgebraexpression[ 41 ].Thisincludesbothselectivityandsizeofjoinestimation.Selectionishandledtriviallybyscalingupthesimplecountwithrespecttotheinverseofthesamplingratio.Inordertoestimatethejoinsize,theyproposeamethodthatsamplesfromtheproductspaceofthejointables.Intersectionistreatedasaspecialcaseofjoin.Theyalsoproposeanalgorithmforclustersampling.Theirestimatorsaresomewhatlimitedbecausesampling 25

PAGE 26

fromtheproductspaceofbasetablesisnotecient.Furthermore,thevariance[ 18 ]oftheproposedestimatorislarge.LiptonandNaughton[ 78 79 ]introduceadaptivesamplingtoestimatetheselectivityofaselectionconditionandthesizeofajoinquery.Theadaptivesamplingmethodassumesthatanyqueryresultcanbetreatedastheunionofresultsfromasetofdisjointsubqueries.Inordertoestimatetheselectivityofaselectionconditionorthesizeofjoin,thesesubqueriesaresampledrandomly.Thesizeofthesampledsubqueriesarecomputedandusedtoestimatetheselectivityofaselectionconditionorthesizeofjoin.Adaptivesamplingassumesthatthemaximumsizeofthesubqueriesisknowninadvance.Thisisgenerallyimpossible.Thus,theauthorsuseanupperboundofthemaximumsizeofthesubqueries.Houelat.[ 61 ]proposedoublesamplingtoestimateacountquery[ 41 ]witherrorguarantees.Theirmethodcontainstwosteps.First,asmallpilotsampleistakentodeterminetheactualnumberofsamplesneededfortheinputerrorbound.Then,theactualsamplesaretakensoastoguaranteeadherencetotheerrorbound.Theycomparetheirmethodtoadaptivesamplinganddoublesampling.Basedontheirresults,doublesamplingrequireslesssamplestomaintainthesameaccuracy,comparedtoadaptivesampling.In[ 54 ],HaasandSwamipointoutthatthereisnotheoreticalguideforchoosingthepilotsamplesizeindoublesampling.HaasandSwamialsondthatiftheupperboundusedinadaptivesamplingisloose,adaptivesamplingmaytaketoomanysamples.TheyproposetwosequentialsamplingalgorithmscalledS2andS3.S2decomposesthequeryresultintopartitions.Itrandomlyselectspartitionsandobservesthesizesofthepartitionsoneatatime.Thealgorithmterminatesaccordingtoastoppingrulethatdependsontherandomobservationsobtainedsofar.Authorsndthatifthepartitionsaredividedintostrataofequalsize,andifeachtimewhensampling,apartitionfromeachstratumisselected,theperformanceofthesamplingincreases.Theycallthisnew 26

PAGE 27

algorithmS3.ExperimentalresultsshowthatS3isthebestmethodforestimatingthejoinsize.BothS2anddoublesamplingperformedwell,whileadaptivesamplingisabitunstable.Theyalsondthattheirestimatordoesnotperformwellwhenonerelationisskewedandtheotheroneisnot.Haaselat.[ 51 ]comparevariousestimatorsintheliteratureforestimatingthejoinsize.Theyconcludethatforaxednumberofsamplingsteps,estimatorsbasedonindex-levelsamplingarebetterthanestimatorsbasedonpage-levelsampling,andestimatorsbasedonpage-levelsamplingarebetterthanestimatorsbasedontuple-levelsampling.Furthermore,estimatorsbasedoncross-productsamplingarebetterthanestimatorsbasedonindependentsampling.Theysuggesttousecross-productindex-levelsamplingwhenanindexisavailableandcross-productpage-levelsamplingwhenanindexisnotavailable.Haaselat.[ 53 ]alsostudytheproblemofwhenitisappropriatetoestimatethejoinsizeinsteadofcomputingthejoinresultdirectly.Theyndthatifanabsolutecriterionisusedtomeasuretheprecisionoftheestimates,therelativecostofsamplingdecreaseswheneitherthesizeoftherelationsincreasesorthenumberofrelationsinthejoinincreases.Ifarelativecriterionisusedinstead,therelativecostofsamplingdecreaseswhenthesizeoftherelationsincrease,andincreaseswhenthenumberofrelationsincreases.Theyalsondthatthesamplingismoreexpensivewhenmoretuplesarenotinvolvedinthejoin.Finallytheyndthatwhenthejoinkeyisuniformlydistributedinalljoinrelations,thesamplingcostislow.Whentheskewinthedistributionofthejoinattributevaluesishigh,thesamplingcostislowifthejoinvaluesindierentrelationsarehomogeneouslydistributedi.e.highfrequentvaluesarealwayshighineachrelation,lowfrequentvaluesarealwayslowineachrelation.Ifthejoinattributevaluesindierentrelationsareheterogeneouslydistributed,thesamplingcostishigh.Inordertosolvetheproblemthatthesamplingcostisrelativelyhighwhenthejoinvaluesareskewedandheterogeneouslydistributed,HaasandSwami[ 55 ]propose 27

PAGE 28

theTCMalgorithm.Thealgorithmusesaugmentedfrequentedvaluestatisticstodecreasethecostofsampling.TheyalsoprovideasinglepassalgorithmtoestimatetheaugmentedfrequentedvaluestatisticsthatareneededbytheTCMalgorithm.TheyshowexperimentallyonrealworlddatathatthecostofsamplingusingTCMalgorithmisrelativelycheapevenwhenthejoinvaluesareskewedandheterogeneouslydistributed.Gangulyetal.[ 39 ]introduceBifocalSamplingtoovercometheproblemsinthepreviousproposedestimators.Basedonhowmanynumberoftupleshavethesamejoinvalue,thetuplesineachrelationaregroupedintotwodierentgroups:spareanddense.Theauthorsthendevelopdierentestimatorsfordense-densepairs,andsparse-anypairs.Thedense-denseestimatoronlytakessamplesfromthdensetuplesofbothrelation.Therearetwospares-anyestimators,eachofthemtakessamplesfromsparsetuplesinonerelationandsamplesfromthewholerelationoftheother.Byseparatingoutthedense-densepart,BifocalSamplingachievesgoodaccuracy.ExperimentalresultsshowthatBifocalSamplingoutperformspreviousestimators.Chaudhurietal.[ 24 ]studytheproblemofsamplingfromtheoutputofthejoin.Theirprimarypurposeismakingthesamplinganoperatorwithinthedatabase.Theyfocusonhowtocommutethesamplingoperatorwithasinglejoinoperator[ 41 ].Theyestablishtheconclusionthatbysimplyjoiningrandomsamplesfromtwobaserelations,itisnotpossibletoobtainasamplefortheoutputofthejoin.Thustheyfocusontheproblemofobtainingasamplefortheoutputofthejoinwithoutcompletelyevaluatingthejoin.Theyobservethatbecauseoftheskewinthedata,itisbettertosampleatupleinonerelationdependingonhowmanytupleswiththesamejoinvalueintheotherrelation.Theythenclassifytheproblembasedonhowmanyrelationshaveindicesorstatisticsforeachjoinvalue,andproposedierentsamplingalgorithmstosamplefromthejoinoutputwithoutfullyevaluatingthejoin. 28

PAGE 29

2.1.2.2ApproximatequeryprocessingInlate1990's,researchonsampling-basedestimationslowlymovedfromestimatingvariousstatisticstoapproximatelyevaluatingdatabasequeries[ 2 { 4 44 49 56 57 69 { 71 73 89 ].ThetworepresentativeprojectsaretheAQUAproject[ 3 ]fromBellLabs,andtheCONTROLproject[ 56 ]fromUniversityofCalifornia,Berkeley.TheseprojectsespeciallytheCONTROLprojectarecloselyrelatedtomyresearch.AQUAisanapproximatequeryansweringsystem.ThemaintechnicalcontributionsofAQUAincludecongressionalsamplesforGROUPBYqueries[ 2 ],joinsynopses[ 4 ],andconciseandcountingsamples[ 44 ].CongressionalsamplesareproposedtoapproximatelyanswerGROUPBYquerieswithhighaccuracy.Acharyaetal.[ 2 ]ndthattheproblemoftheuniformsamplingisthatsomesmallgroupshavefewrecords,leadingtomuchpooreraccuracy.Acongressionalsamplesisaprecomputed,biasedsamplesuchthateachgrouphasasucientsetofrepresentativerecordsinthesample.Givenxedamountofspaceandasetofdatabasetablecolumns,congressionalsamplestrytomaximizeestimationaccuracyforallpossibleGROUPBYqueries.Theauthorsalsoprovideone-passalgorithmsforconstructingcongressionalsamples,andecientmethodstoevaluateGROUPBYqueriesoncongressionalsamples.Ofcourse,thedatabasetablecolumnsthatareusedforgroupingmustbeknownbeforehand,whichsignicantlylimitstheusageofcongressionalsamples.Joinsynopses[ 4 ]areprecomputedsamplesfromasmallsetofdistinguishedjoinsusedforforeign-key[ 41 ]joinsindatawarehousing.Theauthorsshowthattheycanobtainanestimateforanypossibleforeign-keyjoinbyusingtheprecomputedsamples.Theyalsoprovideoptimalstrategiestoallocatethesedistinguishedjoins,assumingtheworkloadinformationisavailable.Theyshowexperimentallythatestimatesbasedonjoinsynopsesarebetterthanestimatesbasedonrandomsamples. 29

PAGE 30

Theconciseandcountingsamples[ 44 ]aretwosampling-basedstatisticalsummaries.Becausethesummariesrequiresmuchlessstoragethantherandomsamples,theauthorsclaimthattheaccuracyofestimatesfordatabasequeriesfromthesummariesisbetterthantheaccuracyofestimatesfromrandomsampleswiththesameamountofstorage.Theyalsoprovidefastincrementalmaintainersalgorithmsforthetwosampling-basedstatisticalsummaries.UnliketheAQUAproject,theUCBerkeleyCONTROLprojectprovidestechniquesthatworkeddirectlyonrandomsamplestoanswerapproximatequeries,suchasonlineaggregation[ 57 ]andripplejoin[ 49 ].ThesetechniquesaremorecloselyrelatedtotheproblemsItackledinthisthesis.Hellersteinelat.[ 57 ]rstproposeonlineaggregation.Onlineaggregationisasystemthatexecutesanalyticalqueriesinteractively.Atanypointofthequeryevaluation,usersareshownthecurrentestimateofthequerywithcondencebounds[ 18 ]foraninputcondence.Theestimatesareconstantlyupdatedastimeprogresses.Theusermayterminatethequeryevaluationatanytimewhenhesatisestheestimateofthequery.Inthesystem,randomsamplesfrominputrelationsareusedtoprovideestimates.Thepaperalsodiscussesvarioustechniquesthatarerequiredforonlineaggregation.Inthecontextofonlineaggregation,Hellersteinetal.[ 57 ]considerthesingle-tableGROUPBYqueries.Oneproblemwithsampling-basedGROUPBYqueriesisthatthesmallgroupsmayhavefeworevennorecordsinthesample.Thus,theaccuraciesforsmallergroupsislow,comparedwiththeaccuracyforlargergroups.Theindexstridingtechniqueisproposedtoremedythis.Theyassumethattherecordsisindexedbythegroupingattributes.Thentheindexcanbeexploitedtosampleuniformlyinaround-robinfashione.g.atuplefromtherstgroup,atuplefromthesecondgroup,atuplefromthethirdgroup,...,andsoon,butdelivertherecordswithineachgrouprandomly.Sometimes,inordertoincreasetheaccuraciesofthesmallgroups,itmay 30

PAGE 31

beimportanttosamplesmallgroupsmoreheavilythanbigones.Therefore,aweightedround-robinfetchingschemamaybemoredesirable.Haaselat.[ 52 ]andHaas[ 48 ]furtherdiscusshowtoprovidecondenceboundsforanalyticalqueriesinonlineaggregation.ThemajordierencebetweenHaas'sworkandpreviousresearchisthatunlikeothermethodsthatsamplesfromtheoutputofthejoinspace,themethodHaasproposedonlyneedstosamplefromeachbaserelation.Thesamplesarejoinedestimatethequeryresult.Basedonthiswork,Haaselat.[ 49 ]proposeafamilyofjoinalgorithmscalledtheripplejoin.Theripplejointakessomesamplesfromonerelation,andthentakessomesamplesfromtheotherrelation,andthentakessomesamplesfromtherstrelation,andsoon.Thenewsamplesarethenjoinedwiththesamplesfromtheotherrelation.Theestimateofthequeryisupdatedwitheachnewsampleset.Thepaperalsodiscusseshowtoprovidecondenceintervals[ 18 ]fortheestimate.Theripplejoinalgorithmconvergesslowlywhentheselectivityofthequeryislow.Inordertospeeduptheconvergency,Luoelat.[ 80 ]proposeadistributedhashripplejoinalgorithm.Thealgorithmassumesthatthetwojoinrelationsarestoredinadistributedenvironmentwhereeachnodehassometuples.Asplitvectorthatmapsthejoinvaluestothenodesisusedtoredistributebothrelations.Thealgorithmaugmentsatraditionalparallelhybridhashjoinalgorithm[ 94 ].Itcontainstwophases.Intherstphase,thetuplesfrombothrelationsareretrievedinrandomorderandredistributedtoeachnodesothateachnoderoughlyreceivesthesamenumberoftuplesfrombothrelations.Wheneachnodeisredistributingitsstoredtuples,italsojoinstheincomingtuplessimultaneously.Thesecondphasestartswhenthetuplesarefullyredistributed.Thetuplesinthebucketsofthetworelationsthatarestoredondiskarere-readintomemoryalternatively.Theyarethenjoinedwithexistingsamplestoprovideestimates.Theauthorsalsodiscussedhowtoextendthepreviousresultstoprovidecondence 31

PAGE 32

boundsfortheirnewdistributedhashripplejoin.Experimentalresultsshowedthatthealgorithmobtainedgoodspeedup.Jermaineelat.[ 70 71 ]pointoutthattheripplejoinsdonotworkcorrectlywhenthesamplesizeexceedsthetotalamountofmemory.TheyproposetheSort-Merge-ShrinkSMSjointoovercometheproblem.Essentiallythisisacombinationoftheripplejoinandthesort-mergejoin[ 41 ].Therearethreephasesinthealgorithm.Thesortphaseisanaugmentedsortphaseofthesort-mergejoin.Thetwojoinrelationsarereadandsortedintorunsinparallel.Furthermore,ahashripplejoinisevaluatedoneachpairofruns.Theestimatesprovidedbyeachhashripplejoinareusedtoprovideanestimateforthequeryanswer.Themergephaseandshrinkphaserunsimultaneously.Inthemergephase,runsaremergedandjoinedtogether,whichmeansthesortedrunsarelosingtuples.Thisrequiresashrinkphasetoupdatetheestimatorinordertocontinuouslyprovideanestimateforthequeryanswer.Jermaineelat.[ 69 73 89 ]furtherextendtheSMSjoinandproposetheDBOsystem.TheDBOsystemisascalableonlineaggregationsystemthatutilizesrandomizedalgorithmsestimatetheanswertoamulti-tabledisk-basedquery.Likeonlineaggregation,DBOconstantlyupdatesestimatesfromthebeginningtotheendofthequeryevaluation.Unlikeonlineaggregation,DBOcanhandlenotonlymemory-basequeriesbutalsodisk-basedqueries.ThusDBOisscalable.InDBO,thedatabasetablesareprerandomizedandstoredondisk.Inordertoconstantlyprovideandupdatetheestimateoftheanswertothequery,DBOfundamentallyredesignedsomepartofthequeryprocessingengine.Queryplansareprocessedbyrunningalltheoperationsatthesamelevelinthequerytreeatthesametime.Thisiscalledalevelwisestep.Theoperationsinthesamelevelwisestepcommunicatewitheachothertondluckytuples"byjoiningthepartialresultsfromeachoftheoperationtogether.Theseluckytuplesarethenusedtoestimatethequeryanswer.TheauthorsprovidedetaileddiscussionsoftheDBO-specicoperations,andalso 32

PAGE 33

gaveathoroughanalysisforthestatisticalissuesinvolvedinDBOsystemtoestimatetheanswertoarunningquery.Therearesomeotherresearchesinapproximatelyansweringadatabasequery,thoughthesearelessrelatedtomyresearches.Gantietal.[ 40 ]proposeIcicles,aclassofself-tuningsamplingtechniquesthattunethemselvestoadynamicworkload.ThebasicideaisthattheprobabilityofatuplebeingselectedasasampleinIciclesisproportionaltoitsimportancetoansweringqueriesintheworkload.Ifatupleisusedtoanswermanyqueriesintheworkload,theprobabilitythatitbecomesasampleismuchhigher.Theauthorsadaptthetraditionalsampling-basedestimatorstothisbiasedsamplingenvironmentandshowthattheaccuracyofapproximateanswersobtainedbyusingIciclesisbetterthanastaticuniformrandomsample.Similarly,Chaudhurietal.[ 20 ]noticethatusingauniformsampletoestimateanaggregateisnotecientiftheselectivityofthequeryisloworthedistributionoftheaggregatedattributeisskewed.Theyproposetwomethodstosolvethisproblem.Oneoftheirmethodsistouseweightedsamplingbasedontheworkloadinformation.Arecordthatisevaluatedbymanyqueriesissampledwithhigherprobabilitythanarecordthatisrarelyevaluatedbyqueriesintheworkload.TheyfurtherintroduceatechniquecalledOutlierIndexingtondtupleswithoutliervaluesandsavethemintoaseparatetables.TheydemonstratethatthecombinationofweightedsamplingandOutlierIndexinggivesbetterestimatesthaneitheruniformsamplingorweightedsamplingonly.Babcocketal.[ 11 ]proposethesmall-groupingsamplingforGROUPBYqueries.Thebasicideaofsmall-groupingsamplingistopartitiongroupsintosmallgroupsandlargegroups.Forlargegroups,thesystemusesauniformsampletoanswerthequery.Forsmallgroups,thesystemusestheoriginaldatadirectlytoanswerthequery.Unlikethecongressionalsamples,thesmall-groupingsamplingmaintainonlyauniformsample.Furthermore,thesmall-groupsamplingalsomaintainaso-calledsmallgrouptables"foreachattributetostorerowsthatcontainrarevaluesforthatparticularattribute.The 33

PAGE 34

incomingqueryisrewrittensothatthebiggroupsareansweredthroughtheuniformsampleandthesmallgroupsareanswereddirectlybythesesmallgrouptables".Chaudhurietal.[ 21 22 ]furtherinvestigatetheproblemofhowtoutilizetheworkloadinformationtoprecomputeastratiedsample[ 91 ]inordertominimizetheerrorinansweringtheworkloadqueries.Theydemonstratethatwhentheworkloadissimilarbutnotidentical,thestratiedsampletakenbytheirmethodproducesgoodestimatesforworkloadqueries.2.1.3ComparisontoSurveySamplingSamplinghasbeenstudiedforaverylongtimeoutsideofthedatabaseliterature.Givingathoroughreviewofthisliteratureisbeyondthescopeofthethisthesis.However,Igiveabriefcomparisonbetweenthedatabasesamplingandthesurveysampling,andpointedoutreadersto[ 91 ]formoreinformationaboutsurveysampling.Althoughdatabasesamplingtechniquesarebuiltontopofsurveysamplingtechniques,thepurposeofsurveysamplingisquitedierent.Thepurposeofsurveysamplingistoobtainsomespecicinformationoftheunderliningdistribution[ 18 ]thatisdiculttoobtaindirectlybecauseoflargepopulation,privacyissues,legalissues,orotherreasons.Thetargetofthesurveysamplingisknownbeforehand.Thesamplescollectedisonlyforthatparticulartarget.Andthesurveysamplingisspeciallydesignedbystatisticiansanddomainexpertssothatonlyaverylimitednumberofsamplesmaybeneededtoestimatethetargetaccurately.Thisisbecausegenerallythecostofcollectingsamplesisveryexpensive.Forexample,astudysupportedbyUniversityofFloridawouldliketostudyhowmanypeoplethinkthattheyhavebeensexuallyharassedoncampus.Formanyreasons,itisinfeasibletoaskeachoneoncampustoanswerthisquestion.Thus,aspecialsurveysamplingmaybedesigned.Onlyaverysmallsampletenstohundredsmaybecollectedtoestimatetheactualnumber.Databasesamplingisquitedierent.Themaingoalindatabasesamplingistoquicklyestimatesomestatisticanswerforagivenquery.Samplesmaybecollected 34

PAGE 35

inadvance,storedinthememoryorharddisk,andpossiblysharedbymanyqueries.Furthermore,usingaspecialsampledesignforaparticularquerythatmayincreasetheestimationaccuracyisofteninfeasible.Becausethesamplesmaybesharedbymanyqueries,ingeneral,simplesampling,suchasuniformlysamplingorBernoullisampling,ispreferred.Furthermore,indatabasesampling,wegenerallyneedtojoinsamplesfrombasetablesandestimatequeryresultsorstatisticsfromthejoinresultofthesamples.Forexample,inordertoestimatethetotalnumberofbluezebras,indatabasesampling,werstestimatethetotalnumberofblueanimalsbytakingasamplefromatablethatcontainsallanimalswiththeircolors.Thenweestimatethetotalnumberofzebrasbytakingasamplingfromanothersamplefromanothertablethatcontainsallanimalswiththeirtypes.Wethenjointhesetwosamplestogetherontheuniquekeyofeachanimaltoestimatethetotalnumberofbluezebras.Thiskindofestimationisneverrequiredinsurveysampling,butisverycommonindatabasesampling.2.2Top-kQueryProcessinginRelationalDatabasesTheproblemofndingtop-kgroupswithrespecttoauser-speciedrankingfunctionusingsamplesisrelatedtoworkintop-kqueryprocessing.Top-kqueryprocessinghasbeenstudiedinthedatabasesandwebareas,informationretrieval,andmanyotherareas.Indatabaseliterature,therearemainlytwodierenttypeoftop-kproblems: 1. Findthetop-ktuplesorobjectsfromtheresultsofaSELECT-PROJECT-JOINquery[ 41 ],whereeachresulttupleinthequeryisrankedbyarankingfunction[ 36 { 38 67 84 ]. 2. Findthetop-kgroupsforanaggregateGROUPBYquery[ 41 ]suchthatthegroupsarerankedbyanaggregaterankingfunction[ 76 95 ].Thersttypeoftop-kqueryhasalongerresearchhistory,andisnotveryclosetotheproblemsItackleinthethesis.Thesecondtypeoftop-kqueryisessentiallythesameproblemthatIamworkingon;theonlydierencebeingthatotherresearchersaredevelopingtechniquestoexactlyreturningthetop-kgroupsecientlywhenallthedata 35

PAGE 36

isavailableforprocessing.Researchonthistypeoftop-kqueriesisrare,andthetotalnumberofpapers[ 76 95 ]publishedisverysmall.Inthefollowing,Iwillrstbrieyreviewresearchesrelatedtothersttypeoftop-kqueryandthengiveathoroughsurveyonresearchesrelatedtothesecondtypeoftop-kquerydeveloped.2.2.1Top-kQueryProcessingforSELECT-PROJECT-JOINQueriesResearchonprocessingtop-kSELECT-PROJECT-JOINquerieshasalonghistory.Faginelat.[ 36 { 38 ]pioneertheresearchofprocessingtop-kqueries.Intheirmethods,databaseobjectsaretreatedasnlists,oneforeachattribute.Eachattributeisrankedseparatelywithascore,thenalscoreofadatabaseobjectisafunctionthatcombinesthesescores.Therankingfunctionisassumedtobemonotonicallyincreasing.Thatis,ifwexallthescoresofadatabaseobjectfromn)]TJ/F15 11.955 Tf 12.642 0 Td[(1list,andonlyincreasethescorefromonelist,theoverallrankalsoincreases.Thisassumptionisusedinmosttop-kqueryprocessingtechniques,andmanyfunctionssuchassumhavethisproperty.Thelistsmaybesortedbasedontheirscoresorrandomaccessed.TheThresholdAlgorithmTAisanalgorithmthatrequiresbothsortedaccessandrandomaccess.IntheTAalgorithm,sortedaccesstoeachofnsortedlistarevisitedinparallel.Whenanewobjectisseeninsomelist,arandomaccesstoeachoftheotherlistsisperformedtocomputetheactualscoreofthisnewobject.Anupperboundfortheoverallscoreofunseenobjectsismaintainedbyapplyingthescorefunctiontothelastseenscoresinthelists.Assoonasatleastkseenobjectshavescoresnolessthantheupperbound,thealgorithmterminatesandthesekobjectsarereturned.TheauthorsalsoproposetheCombinedAlgorithmCAandtheNoRandomAccessAlgorithmNRA.CAisanalgorithmthatissimilartoTAbutoptimizedtoreducethenumberofrandomaccess.NRAisusedwhenonlysortedaccessesareavailable.UnlikeTAorCA,NRAisnotguaranteedtoreturnexactanswer.Itreliesonthelowerandupperboundsforpartiallyseenobjects.Theboundsforpartiallyseenobjectsarecomputedbyapplyingpartiallyseenscoresandthelowest/highestpossible 36

PAGE 37

valuesforunseenattributes.Ifapartiallyseenobject'sscorelowerboundislargerthanthescoreupperboundsofallotherprojects,thisobjectcanbereportedasatop-kobject.Someextensionstotheabovealgorithms[ 47 66 ]areproposedbyotherresearchers.ChaudhuriandGravano[ 23 ]studytheproblemofmappingatop-kquerytoarangequery[ 41 ]inarelationaldatabasesystem.Theirworksupportsthreespecicscorefunctions:Min,Euclidean,andSum.Allthreescorefunctionssatisfythemonotonicityproperty.Thebasicideaoftheirmappingalgorithmistochooseascoreqbaseonhistogramandotherstatisticsinthedatabase,andselectalltupleswhosescoresarelargerthanq.Ifmorethanktuplesareselected,thektupleswithhighestscoresarereturned.Otherwise,ascorelowerthanqisselectedforanotherroundofqueryprocessing.Theprocessisrepeateduntilkrecordsarereturned.Inthejournalversionofthepaper,theideaisfurtherextendedbyBrunoelat.[ 17 ]tousemulti-dimensionalhistogram[ 30 ]tosupportthemapping.Lateron,ChangandHwang[ 19 ]studytheproblemofansweringatop-kquerywhensomeofthepredictsofthequeryareexpensive.Apredictisabooleanexpressionthatcontainsselectionandjoinconditionsthatincludeuser-denedfunction.Theexpensivepredictsdenedinthepapercontainuser-denedfunctions,whicharead-hoc,andcannotbenetfromsortingandindexing[ 41 ].Acheappredictisanpredictthatdoesnotcontainanyuser-denedfunction.Theyassumethatthereisatleastonecheappredictandthesortedaccessisavailableforthatpredict.TheyproposetheMinimalProbingalgorithmMPro.Themainideaisthatevaluatingsomeofthepredictsespeciallyexpensiveonesarenotnecessarytodeterminewhetheranobjectisatop-koneornot.Sinceevaluatingsomepredictsareexpensive,thealgorithmseeksforminimizingthepredictsevaluationcost.Thealgorithmrunsintwophases.Intherstphase,apriorityqueueisinitializedbasedonthescoreupperboundsoftheobjects,whicharecomputedbyaggregatingthescoresofthecheappredicatesandthemaximumscoresoftheexpensivepredicates.Thesecondphase,theobjectatthetopofthepriorityqueueisselectedandoneofthe 37

PAGE 38

expensivepredicatesofthatprojectisprobed.Thescoreupperboundisthenupdated.Theobjectisreinsertedintothepriorityqueue.Ifthereisnomorepredictsforthetopobjectedtobeevaluated,itisselectedastheoutput.TheproblemofdeterminingwhichexpensivepredicateisselectedasthenextprobingpredicateisanNP-hardproblemandMProonlyusesheuristicstodeterminetheorder.ThisideaisfurtherextendedbyHwangandChang[ 65 ]bytakingthecostofsortedaccessintoaccount.Thersttop-kqueryprocessingalgorithmthatsupportsjoiningmultiplerelationsproposedinthedatabaseliteratureisJ*byNatsevelat.[ 84 ].J*isbasedontheA*searchalgorithm.ItmapstheproblemtoasearchproblemintheCartesianspaceoftherankedinputsandusesguidedsearchtondtheresults.Apriorityqueue[ 35 ]isusedtomaintainallpartiallyandfullycompletedresultsbasedonthescoreupperboundsofthenalresults.Ateachtime,thetopelementofthepriorityqueueisprocessedbysearchingandselectingthenextstreamtojoinwiththepartialresult.Thepartialresultisreinsertedintothepriorityqueueifitisnotcompleted.Otherwise,thisisanoutputcandidate.Theauthorsalsoproposedseveralothersimilaralgorithmsinthepaper.TheRank-Joinalgorithm[ 67 ]isanothertop-kqueryprocessingalgorithmthatsupportsthejoinoperation.Itscanstheinputrelationsinsortedorderbasedonthescores.Whenanewrecordisretrieved,itisimmediatelyjoinedwithexistingresultsofotherrelations.Foreachjoinresult,thealgorithmcomputesthescoreofthejoinresult.Thealgorithmalsomaintainsathresholdforthejoinresultsthatarenotseenyet.Whentherearekjoinresultswhoseactualscoresareatleastaslargeasthethreshold,thealgorithmstops.Thesekresultsarereturned.Theauthorsdeveloptwohash-basedimplementations,HRJNandHJRN*.SchnaitterandNeoklis[ 92 ]furtherextendtheRank-Joinproblem,andstudythecostofprocessingatop-kjoinquery.Theydevelopageneralmodelsuchthatthepreviousalgorithmsarespecialcasesoftheirgeneralmodel.TheythenstudytheperformanceofHRJNandpointoutthatHRJN'sperformancecanbearbitrarybad.TheyintroduceanewalgorithmthatworksbetterthanHRJN.They 38

PAGE 39

alsostudythecostofinferringatightboundofunseenobjectsandndingeneralitisNP-hard.Mosttop-kqueryprocessingtechniquesassumetherankingfunctionadherestothemonotonicityproperty.Afewrecentpublicationshavestudiedtop-kqueryprocessingusingmoregeneralrankingfunctions.Zhangetal.[ 106 ]proposetheOPT*algorithmfortheproblemofansweringatop-kquerywitharankingfunctionthatdeterminesthescoreofadatabaseobjectandanarbitrarybooleanexpressionthatdetermineswhetheradatabaseobjectisqualied.Thesolutiontotheproblemistoutilizeindicesandsearchthespaceforindicesorscoredattributesalongwithfunctionoptimizationforcontinuousrankingfunctions.TheauthorsproposeanA*searchalgorithmtoachievethecompletenessandoptimality.Xinelat.[ 104 ]studytheproblemofansweringatop-kquerywithanad-hocrankingfunctionthatcanbelowerbounded.Theauthorsproposeanindex-mergeframeworkthatsearchesoverthespacecomposedbyjoiningindexnodes.TheseindexnodesareexistingB+Tree[ 41 ]orR-Tree[ 41 ]indices.Adouble-heap[ 35 ]algorithmthatsupportsbothprogressivestatesearchandprogressivestategenerationisdeveloped.Researchershavealsostudiedhowtointegratetop-kqueryprocessingtechniquesintotherelationalDBMStoprovideranking-awarequeryplan.Ilyaselat.[ 68 ]studytheproblemofintegratingtheRank-Joinalgorithm[ 67 ]intothequeryoptimizer[ 41 ].Theymodifytheenumerationphaseofthequeryoptimizer[ 41 ]totaketheRank-Joinoperatoraspartoftheenumerationspace.Theyalsoprovideacostestimation[ 41 ]methodfortheRank-Joinalgorithm.Lielat.[ 77 ]extendtheworktoproposetheRANKSQLsystemthatsupportsrankingasarst-classoperationinDBMS.Theauthorsextendtherelationalalgebratoallowrankingoperations.Theyfurtherdevelopqueryrewritingrules[ 45 ]thatcanbeusedforqueryoptimization.Furthermore,theauthorsintroduceanoptimizationframeworkandexecutionmodelfortop-kqueryprocessingasarst-classoperationinDBMS.Morerecently,Schnaitterelat.[ 93 ]studytheproblemofhowmanyrecordsare 39

PAGE 40

requiredtoberetrievedfromeachsortedbaserelationinordertoproducetop-kresultsanddevelopnovelestimationmethodsforthisproblem.2.2.2Top-kQueryProcessingforGROUPBYQueriesFindingthetop-kgroupswithaggregaterankingfunctionsisverycloselyrelatedtotheproblemofreturningtop-kgroupswithhighprobabilityusingsamples.However,thetotalnumberofpaperspublishedisratherlimited.Lielat.[ 76 ]rststudythisproblem.Inthesystemtheypropose,theyassumethatanupperboundoftherankingscoreofagroupcanbeobtainedbyreplacingthevaluesofeachtupleinthegroupwiththeirmaximumvalues.Manyaggregationfunctionssuchassumandaveragesatisesthisproperty.AnaiveimplementationtoanswersuchaquerymayexecutetheGROUPBYqueryrst,sortthegroupsontherankingscoreandonlyreturnstop-kgroups.Theauthorspointoutthatsinceusersareonlyinterestedinthetop-kgroups,suchnaiveimplementationisinecientandunnecessary.Abetterideaistodeterminethetruetop-kgroupswhileprocessingthetuples.Unqualiedgroupsareprunedrightawayoncetheyarefound.Theauthorsproposeametrictoevaluatehowgoodanalgorithmbycountingthetotalnumberoftuplesprocessedtodeterminetop-kgroups.Andtheyndthattheoptimalsolutionreliesontwofacts:theorderofprocessingthegroupsandtheorderofprocessingtupleswithineachgroup.Baseontheirobservations,theyproposethreeprinciplestoprocesstuples:Upper-Bound,Group-RankingandTuple-Ranking.TheUpper-Boundprincipledetermineswhenagroupmaybepruned.Intuitivelyiftheminimumscoreofalltop-kgroupsiss,whentheupper-boundofagroup'sscoreislessthans,wearesurethatthisgroupisnotpossibletobeatop-kgroup.Thus,alooseupperboundfunctionmaynothelppruneanygroupatall.Furthermoretheupperboundfunctionmayneedtobeupdatedwhenmoretuplesareprocessedinordertoprovidetighterupperbound.Intheimplementation,inordertoobtainatightupperbound,theauthorsassumethatthetotalnumberofrecordsfromeachgroupareknowninadvance.TheGroup-Rankingprincipledetermineshowtochoose 40

PAGE 41

agrouptoprocess.Theauthorspointoutthatatuplefromthecurrenthighestupperboundscoreshouldbeselectedforfurtherprocessing.Ifkgroupsnishesprocessing,thesearethetop-kgroups.Theauthorsfurtherproposethattheprocessingorderoftupleswithinagrouponlyreliesontheinformationwithinthegroup.Theysuggesttoprocesstupleswithagroupindescendingordersortedbythescoresofthetuplesinthegroup.Finally,theauthorsdiscussvarioustechniquesrequiredtointegratetheseprinciplestoarelationdatabaseengine.Solimanetal.[ 95 ]furtherextendtheproblemtotheprobabilisticdatabase[ 5 6 8 { 10 25 27 ],andproposealgorithmstondtop-kgroupsrankedbyagroup-wiseaggregatefunctionwhentheunderliningtuplesarenotdeterministic.Theyalsoassumethattherankingfunctionmaybeboundedandthusismonotonic.Theirapproachismainlybasedonsearchingoverthequeryspace.Thesearchhastwolevels:intragroupsearchandintergroupsearch.Intheintragroupsearchstage,orderednonoverlappingaggregaterangesarecomputedtomaterializethenecessarypartsintheaggregatedistributionofeachgroup.Thesearenecessaryinformationtoincrementallyconstructthesearchspace.Intheintergroupstage,probability-guidedsearchisperformedtonavigatethesearchspaceandndthequeryanswer.2.3RelatedWorkinStatisticsLiteratureInthissection,Ibrieyreviewsomeimportanttopicsinthestatisticalliteraturethatarerelatedtotheworkinthisthesis.Instatisticsliterature,twoareasareespeciallyrelatedtomyresearch:multipleinference[ 13 14 26 32 58 { 60 64 82 90 96 98 99 103 ],andthestudyoferrorguaranteesforMonte-Carlomethods[ 12 28 81 ].Thefollowingsubsectionsdiscusseachofthemindetail.2.3.1MultipleInferenceAlthoughindatabaseliterature,researchershavenotconsideredprovidingsimultaneouscondenceboundsfordatabasequeries,thestudyofmultipleinferencehasalonghistoryinstatistics.TheclassicalreferencematerialisMiller'sbook[ 82 ].Somenewersummaries 41

PAGE 42

oftheeldarethebooksofHochbergandTamhane[ 59 ],WestfallandYoung[ 103 ],andHsu[ 64 ].Mostresultsregardingmultipleinferencearerelatedtohypothesistesting[ 18 ],whichistheprocessofusingobserveddatatomakedecisionsaboutunobservedparametersofinterest.Inahypothesistest,thetesterdenesanullhypothesis,ateststatistic,andacuttingvalue.Thenullhypothesisisanassumptionthataneectorresultwemaybeinterestedinconrmingisnottrueforexample,wemaybeinterestedindeterminingwhetheranHIVdrugiseective;thenaturalnullhypothesisassumesthatthedrugisnotinfacteective.Oncethenullhypothesisisformulated,ateststatisticiscalculatedusingtheobserveddata.Theteststatisticisusedtocomputethep-value,whichistheprobabilitythatateststatisticwouldbeobtainedassumingthenullhypothesisistrue.Ifthep-valueislessthanthecuttingvaluep,thenullhypothesisisrejected.ThisresultsinaType-Iorfalsepositiverateofp.Theso-calledType-IIerroristheprobabilityofnotrejectinganullhypothesisthatisinfactfalse,anddescribesthepowerofthetest.Furtherinformationabouthypothesistestingcouldbefoundinmanystatisticalbooks,suchas[ 75 ].Inmultiplehypothesistesting,ratherthandealingwithasinglenullhypothesis,oneinsteadhasalargenumberofhypothesesthataretobeevaluatedusingthesamedata.Theprobleminthiscaseisthatsincethesamedataisusedforeachtest,anerrorononetestmayincreasethechanceofanerroronanother.Theeasiestwaytohandlesuchpotentialcorrelationsistoadjustthep-valueassociatedwitheachtestsothattheoverallchanceoferroneouslyrejectingahypothesisiskeptmanageable.Thereareanumbermethodsfordoingthis.Thesimplestandleasteectivewaytoadjustthep-valueistheBonferronicorrection,orsocalledclassicalBonferroniprocedure[ 82 ],whichsimplymultiplieseachp-valuebythenumberoftestedhypotheses.Incasetheresultingp-valueislargerthan1,itistreatedas1.ThereisalsotheDunn-Sidak'sprocedure[ 32 ].Intuitively,ifwewouldliketomaintainanoverallerrorrateofrejectingatleastonetest 42

PAGE 43

outofntests,Dunn-Sidak'sprocedureusestheerrorrate0=1)]TJ/F15 11.955 Tf 12.012 0 Td[()]TJ/F21 11.955 Tf 12.012 0 Td[(alpha1=ntorejecteachindividualtest.Theseproceduresarecalledsinglestep"procedures.Ingeneral,iftherearemanyhypothesestobetested,theseproceduresarenotverypowerful.Therearealsostep-down"proceduressuchasHolm'sprocedure[ 60 ].Assumewewanttotestnhypothesessimultaneously,andtheoveralltype-Ierrorofmakingoneormorefalsediscoveriesis.InHolm'sprocedure,thehypothesesareorderedbytheirp-values.Thesmallestp-valueiscomparedto=k.Ifitislessthan=k,thehypothesisisrejected.Afteritisrejected,thesmallestp-valueintheremaininghypothesesiscomparedto=k)]TJ/F15 11.955 Tf 12.779 0 Td[(1.Itisrejectedifthep-valueissmallerthan=k)]TJ/F15 11.955 Tf 12.779 0 Td[(1.Theprocedureiscontinueduntilthehypothesiswiththesmallestp-valuecannotberejected.Atthispoint,allremaininghypothesesareaccepted.Thisiscalledastep-down"procedurebecauseitstartscomparingtothesmallestp-value,andgraduallyincreasesthep-valuetobecompared.Therearealsostep-up"procedures,suchasHochberg'sprocedure[ 58 ].Hochberg'sprocedureessentiallyusesthesamecriteriatorejecthypotheses,buttheprocedureisbackwards.Itstartswiththehypothesiswhichhasthelargestp-valueandcomparedthisp-valueto.Itrejectsallthehypothesesifthep-valueassociatedwithitissmallerthan.Itthengoestothehypothesisassociatedwiththelargestp-valueinalltheremaininghypotheses,andcomparethep-valueto=2.Ifthep-valueislessthan=2,allthehypotheseswiththisp-valueorsmallerp-valuesarerejected.Thisprocedurecontinuesuntilnofurthercomparisoncanbeapplied.Hochberg'sprocedureisastep-up"procedurebecauseitstartswiththebiggestp-valueinallhypotheses.Hochberg'sprocedureisatleastaspowerfulasHolm'sprocedure.Onecommoncharacteristicofalloftheclassicstatisticalmethodsforsimultaneousinferenceaswellasallofthisdiscussedthusfaristhatthegoalistocontroltheso-calledfamily-wiseerrorrate"FWER;thatis,theyseektocontroltheprobabilityoffalselyrejectinganyofthehypotheses.TheyvaryinpowerType-IIerrorandhowthey 43

PAGE 44

decidetoassociateacuttingvaluewitheachp-valueassociatedwitheachhypothesisinordertomaintaintheFWER.Inpractice,allofthesemethodstendtobeconservative.ThereareseveralproblemswithcontrollingFWER.Forexample,controllingFWERisonlymeaningfulwhentheoverallconclusionfromvariousindividualinferencesislikelytobeerroneousevenifonlyoneoftheseindividualinferencesiswrong.Inmanycases,thisisnottrue.Forexample,ifwewouldliketoknowwhetheranewtreatmentisbetterthananoldone.Atreatmentgroupandacontrolgrouparegenerallycomparedbytestingvariousaspects.Theconclusionthatthenewtreatmentisbetterisnotnecessarilywrongevenifsomeofthenullhypothesesarefalselyrejected.Inrecentyears,statisticianshavestudiedthelimitationofcontrollingFWERandproposedothercriteria.TherstpapertoaddressthisproblemispublishedbyBenjaminiandHochberg[ 13 ].InsteadofcontrollingFWER,theyproposedtheFalseDiscoveryRateFDR,whichistheexpectedfractionofrejectedhypothesesthatareactuallyvalid.BenjaminiandHochbergshowthatifallnullhypothesesaretrue,controllingFDRisthesameascontrollingFWER.ButingeneralFDRisnotaslargeasFWER.ThuscontrollingFDRincreasesthepowerofthetest.Astep-down"procedureisproposedbyBenjaminiandHochberg.Theproceduresortsallthehypothesesbasedontheirassociatedp-values.AssumingtheFDRtobecontrolledisq,andtherearenhypotheses,thebiggestp-valueiscomparedtoq,ifitissmallerthanq,allthehypotheseswhoseassociatedp-valuearesmallerthanorequaltothisp-valuearerejected.Ifthehypothesisisaccepted,thenextsmallestp-valueiscomparedton)]TJ/F15 11.955 Tf 12.155 0 Td[(1q=n.Thisprocedurecontinuesuntilsomehypothesesarerejectedoreveryhypothesishasbeencompared.Ingeneral,theithlargestp-valueiscomparedton)]TJ/F21 11.955 Tf 12.458 0 Td[(i+1q=n.Thisprocedureisonlytrueforindependentteststatistics.Inreality,thislimitstheusageoftheprocedure.ThelimitationofindependentteststatisticsisaddressedbyBenjerminiandYekutieli[ 14 ].Theyprovethatthesameproceduremaybeusedwhentheteststatisticshavepositiveregressiondependencyon 44

PAGE 45

eachoftheteststatisticscorrespondingtothetruenullhypotheses.Thedependencyissueisfurtherstudiedbyothers.[ 90 96 ].Story[ 98 ]pointoutseverallimitationsofBenjamini-Hochberg'sprocedureandproposedadierentapproachtocontroltheFDR.Storypointedoutthatonexpectation,Benjamini-Hochberg'sprocedurecontrolstheFDRat.However,becausetheprocedureinvolvesestimation,thereliabilityforaspeciccaseisnotguaranteed,whichisnotgoodforusers.InsteadofxingtheFDRtobecontrolled,Storystudiestheproblemhowtheexpectederrorratemaybecomputedwhenalltheassociatedp-valuesofthetestedhypothesesaregiven.Unliketraditionalmultiplehypothesistesting,wherewextheerrorratetocontrolanddeterminetherejectionregion,thisapproachxestherejectionregionandthenestimatesitscorrespondingFDR.Storyshowsexperimentallythatthisapproachachievesovereighttimesinpower,whencomparingtotheBenjamini-Hochberg'sprocedure.Thetwoapproachesseemverydierent.Benjamini-Hochberg'sapproachxestheFDRtobecontrolledrstanddeterminestherejectionregion,whileStory'sapproachxestherejectionregionrstandestimatetheFDR.Storyandelat.[ 99 ]approvethatthetwoapproachesareessentiallyequivalent.AndtheestimationoftheFDRcouldbeusedtodeneanBenjamini-Hochbergstyleprocedure.Therearemanyfollow-upresearchesforFDR.GenoveseandWasserman[ 43 ]developaframeworkthattreatsFDRcontrolasastochasticprocessandproposeproceduresthatcontrolsthetailprobabilitiesofFDR,assumingtheteststatisticsareindependent.VanderLaanelat.[ 101 ]proposemethodstocontrolgeneralizedfamily-wiseerrorrateGFWERinsteadofcontrollingFDR.GFWERkistheprobabilityoffalselyrejectingatmostkhypotheses.Theauthorsproposesimplep-valueaugmentedbasedprocedurestocontrolGFWERandclaimtheproceduresaremoreusefulbecausemostproceduresofcontrollingFDRrequiresindependentteststatisticsoronlylimitedcorrelatedteststatistics.Manydomainrelatedmethodsarealsoproposed,especiallyintheareasof 45

PAGE 46

GenomicsandBiology.ThenewbookbyCrowderandVanderLaan[ 26 ]givesaverythroughoverviewonrecentresearchesinmultipleinferenceingeneral,andspeciallyforapplicationsinGenomics.2.3.2ErrorGuaranteesforMonte-CarloMethodsThissectiondiscussesanotherrelatedtopicinstatistics,thatisthestudyoferrorguaranteesforMonte-Carlomethods.Inintroduction,IbrieydiscussedhowIsolvetheSIproblem.OneofthethreemethodsisaMonte-Carloresamplingmethod.IalsomentionedthatIhaveprovedanerrorguaranteebetweentheempiricaldistributionconstructedbysamplesandthetruedistribution.Morespecically,foranydistribution,inordertoguaranteethattheEuclidiandistancebetweenthistwodistributionsis,thenumberofsamplesneededisatleast1=2.Similarstudieshavebeenproposedinthestatisticsliterature.Mostoftheresearchesareabouttheerrorguaranteeswhenestimatingthecumulativedistributionfunctioncdf[ 18 ]usingindependentandidentically-distributedi.i.d.samples[ 18 ].ThemostfundamentalresultistheGlivenko-Cantelli'stheorem[ 12 ].Webrieydescribeithere.AssumethatX1;X2;X3;:::arei.i.d.randomvariablesinRwithcommoncdfFx.TheempiricalcdfforX1;X2;X3;:::;XnisdenedbyFnx=1 nnXi=1I;x]Xi;whereIisaindicatorfunction.Foraxedx,FnxisarandomvariablethatconvergestoFxwithprobabilityone.Thisisduetothestronglawoflargenumbers[ 12 ].GlivenkoandCantellifurtherprovedthatFnconvergestoFuniformly,asshownbelow:jFn)]TJ/F21 11.955 Tf 11.955 0 Td[(Fj1=supx2RjFnx)]TJ/F21 11.955 Tf 11.955 0 Td[(Fxj!0,withprobability1Glivenko-Cantelli'stheoremensuresthattheempiricalcdfconvergestotheactualcdfuniformly.Thus,whentherearesucientnumberofi.i.d.samples,thedistancebetween 46

PAGE 47

theempiricalcdfandthetruecdfcanbearbitrarilylow.However,Glivenko-Cantelli'stheoremdosenotprovideanyboundthatrelatesthetotalnumberofsamplesandthedistancebetweentheempiricalcdfandtheactualcdf.Severalotherresultshavebeenproposedtogiveanerrorboundthatisrelatedtothetotalnumberofsamples.OneofthemostwidelyuseddistributionindependentboundistheDvoretzky-Kiefer-Wolfowitzinequality[ 81 ],whichwasrstproposedbyDvoretzky,Kiefer,andWolfowitz,andfurtherrenedbyMassart,asshownbelow:Prsupx2RjFnx)]TJ/F21 11.955 Tf 11.955 0 Td[(Fxj>2e)]TJ/F20 7.97 Tf 6.587 0 Td[(2n2:Theinequalityshowsthattheprobabilityofseeingthedistancebetweentheempiricalcdfandtheactualcdfismorethandecreasesexponentiallyasthesamplesizeincreases.Thisinequalityisgenerallygoodenoughformanydistributions,especiallydiscretedistributions[ 18 ].Forexample,ifwedonotknowanythingaboutFandwishtohaveatleast90%condencesuchthatforanypointintherealline,thedistancebetweentheempiricalcdfandtheactualcdfiswithin0.01,thetotalnumberofrequiredsamplescanbecomputedfromthefollowingequation:2e)]TJ/F20 7.97 Tf 6.586 0 Td[(0:0002n<0:1.Bysolvingtheequation,wehaven14978.Kolmogorov'stheorem[ 12 ]isanotherimportantconclusioninthiseldforcontinuousdistributions[ 18 ].WedeneDn=supx2RjFnx)]TJ/F21 11.955 Tf 11.955 0 Td[(Fxj:Iftheunderliningdistributionisacontinuousdistribution,thenp nDnconvergestotheKolmogorovdistributionregardlesswhateverdistributiontheunderliningdistributionis.ThecdfofKolmogorovdistributionisshownbelow:PrK
PAGE 48

Thiscanbeusedtodeterminehowmanysamplesweneedforagivendistanceandagivencondencep.Forexample,ifwewouldliketohaveatleast95%condence,suchthatforanypointintherealline,thedistancebetweentheempiricalcdfandtheactualcdfisatmost0.01,thetotalnumberofsamplesrequiredisaround18496.Whenthedistributionfamilyisknowninadvance,variousdensityestimationtechniques[ 28 ]maybeused.Anddistributiondependenttighterboundsareavailable.Itisbeyondthescopeofthissectiontogiveathoroughsurveyforthesebounds.WereferreaderswhoareinterestedinthisareatoDevroyeandLugosi'sbook[ 28 ],whichdiscussesthesetopicsindetail. 48

PAGE 49

CHAPTER3PRELIMINARIESInthischapterdiscussesafewstatisticalpreliminaries.Theremainderofthechapterisorganizedasfollows.Section 3.1 discussesestimatorsandtheirproperties.Section 3.2 discussescorrelations.Section 3.3 describescondenceintervalsandcondenceregions.3.1EstimatorsandTheirPropertiesLetMbeaparameterofinterestthatwearetryingtoestimate.Inthisthesis,theparameterofinterest"istypicallythenalanswertoaquery.Anestimate MofMisasinglenumbercomputedfromasampleorotherstatisticalsummary,whichservesasaguessofthevalueofM.Notethatiftheestimationprocessisrepeatedmanytimes,manydierentestimateswillbeobserved.Theseestimatesaretypicallycharacterizedusingarandomvariable.TherandomvariablewhoseobservedvalueisusedtoestimateMiscalledanestimatoranddenotedfM.1Forexample,consideraSUMqueryoverasingledatabasetable.Specically:SELECTSUMfRFROMRThefunctionfcanencodeanymathematicalfunctionovertuplesfromR,andcanencodeanarbitraryselectionpredicate.Also,aqueryofthisformcanbeextendedtohandleaclauseoftheformGROUPBYatt1,att2,:::byrstidentifyingallofthengroupsinducedbytheGROUPBYclause,andthenre-runningthequeryntimes.Duringtheithrun,fRismodiedtoacceptonlytuplesfromtheithgroupfevaluatestozeroforanytuplethatisnotfromthatgroup.Forexample,imaginethatweaddaclauseGROUPBYgendertotheabovequery,andgenderhasvaluesmaleandfemale.WecouldsimplydeneafunctionfmalewherefmaleR=fRifR.genderismaleandzerootherwise,as 1WhenMisaparametertobeestimated,wewillusefMtodenotetheassociatedestimator;wewillalsomakeuseofthestandardconventionofusingacapitallettersuchasNtodenotearandomvariable. 49

PAGE 50

wellasafunctionffemalewhereffemaleR=fRifR.genderisfemale.Then,runningthequerytwiceoncewitheachfunctiongivesananswertotheGROUPBYquery.Givensuchaquery,thenalanswercanbecomputedas:M=Xr2Rfr{1ThesimplestestimatorforMmayberandomsamplingwithoutreplacement.ToestimateMusingthismethod,denotebyR0thesetofsamplesfromRandletnRdenotethesample'ssize.IfNRisthenumberoftuplesinrelationR,wecanthenexpressfMbyintroducingasetofBernoullizero/onerandomvariablesthatindicatewhethertuplesfromRarenot/areinR0.LetXkbethevariablethatindicateswhetherthekthtuplefromRisinR0.Withthis,anaturalestimatorfortheaggregatequerywouldbe:fM=NR nRNRXk=1Xkfk{2Thisestimatorsimplysumstheelementsinthesamplesetandscalestheresultaccordingtotheinverseofthesamplingfraction.TheestimatorfMiscalledanunbiasedestimatorbecauseE[fM]=M,whereE[fM]istheexpectedvalueoftheestimator.Thatis,iftheestimatorwereusedmanytimesinsuccession,theaveragewouldbeexactlyequaltothetrueanswer.Foranunbiasedestimator,thevarianceisusuallyusedasthecriteriatoindicatehowgoodtheestimatorfMis,where2fM=E[fM2])]TJ/F21 11.955 Tf 12.166 0 Td[(E2[fM].Thevarianceofthisestimatoris:2fM=NR nRnR)]TJ/F15 11.955 Tf 11.955 0 Td[(1 NR)]TJ/F15 11.955 Tf 11.955 0 Td[(1NRXk=1NRXl=1;l6=kfkfl)]TJ/F21 11.955 Tf 13.151 8.088 Td[(NR nRNRXk=1f2k{3Ifanunbiasedestimatorhassmallvariance,thenthereisonlyasmallchancethatitsvalueisfarfromthetruevaluefortheparameterM.Thisverysimpleestimatorcanbeextendedtoajoinoveranarbitrarilylargenumberoftablesinastraightforwardway.WerefertotheresultingestimatorastheHaas-Hellersteinestimator[ 49 ].ThisgeneralizationistheestimatorusedbytheUCBerkeley 50

PAGE 51

Controlproject[ 56 ],andifonesamplesdirectlyfromapre-computedjoin,thesimpler,one-tableversionoftheHaas-HellersteinestimatoristheestimatorusedbytheAQUAproject[ 3 ]bytheirjoinsynopsismethod[ 1 ].Toextendtheestimator,werstassumethatT1;T2;:::;Tkarekdatabasetables,andthatwewishtoguessananswertoaqueryoftheform:SELECTSUMfT1;T2;:::;TkFROMT1;T2;:::;TkWHEREpredT1;T2;:::;TkWeletgT1;:::;TkbeafunctionthatreturnsfT1;:::;Tkifpredistrue,0otherwise.WeassumethenumbersoftuplesinthesetablesareN1;N2;:::;Nk,andthesamplesizesaren1;n2;:::;nk,respectively.WedeneBernoullirandomvariablesXw1;:::XwkthatgovernwhetherornotthewthtuplefromT1;:::;Tkaresampled,respectively.ThefollowingthenservesastheHaas-Hellersteinestimatorforthesumoff:fM=N1:::Nk n1:::nkXw1;:::;wkXw1:::Xwkgw1;:::;wk{43.2CorrelationsThisdissertationconsidershowcorrelation-awaremethodsmaybedevelopedtotakethecorrelationsamonggroupwiseestimatesintoaccountwhenprocessingsampling-basedGROUPBYqueries.Ingeneral,thecovariancebetweentwovariablesprovidesameasureofthecorrelationbetweenthem.ThecovariancefortworandomvariablesXandYisdenedas:CovX;Y=E[XY])]TJ/F21 11.955 Tf 11.955 0 Td[(E[X]E[Y]3{5Foruncorrelatedvariables,thecovariancebetweenthemis0.IfCovX;Y>0,thenYtendstoincreaseasXincreases,andifCovX;Y<0,thenYtendstodecreaseasXincreases.Figure 3-1 givenbelowshowthecontoursofstandardbivariatenormaldistributionwithcovariance0.5and-0.5,respectively.Fromthegure,weseefora 51

PAGE 52

negativecovariancethedistribution'scontoursareasetofellipses,themajoraxesofwhichfallintothesecondandfourthquadrants,respectively.ThisindicatesthatYtendstodecreaseasXincreases.Forthepositivecovariancethecontoursareasetofellipses,themajoraxesofwhichfallintotherstandthirdquadrants.ThisindicatesthatYtendstoincreaseasXincreases. Figure3-1.StandardBivariateNormalcoef=0.5leftand-0.5right. Ingeneral,suchmultivariatedistributionsarespeciedusingacovariancematrixandameanvector.ForthejointdistributionoftheerrorofasetofestimatorsfMi,i=0andi;j=CovfMi;fMj.3.3CondenceIntervalsandCondenceRegionsAcondenceintervalCIforMisanrandominterval[l;h]withauser-denedprobabilitypassociatedwithit.Thisistheprobabilitythattheintervalthatischosenactuallycontainstheparameterofinterest.Generally,piscalledthecondenceand=1)]TJ/F21 11.955 Tf 11.955 0 Td[(piscalledtheType-IerroroftheestimatorfM.Propertiessuchasanestimator'svarianceandunbiasednesscanbeusedtocomputeaCI.Dene:err=fM)]TJ/F21 11.955 Tf 11.955 0 Td[(M{6 52

PAGE 53

tobetheerrorofsomeestimatorfM.Sincetheestimatorisunbiased,themeanoferris0andthevarianceisequalto2fM.AccordingtotheCentralLimitTheorem,thedistributionoferrisasymptoticallynormalfortheHaas-Hellersteinestimatorconsideredinthisthesis,aslongasalarge-enoughdatabasesampleisusedseeHaasandHellerstein,Sections5.2.1and6formoredetails[ 49 ].SinceasdiscussedintheprevioussubsectionaGROUPBYquerycanbeencodedasanumberofHaas-Hellersteinestimators,theestimateforeachgroupinaSUM-basedGROUPBYqueryisalsonormallydistributed.Iftheerrorfollowsanormaldistribution,aCIwithcondencepis[)]TJ/F21 11.955 Tf 9.299 0 Td[(zp 2fM;zp 2fM]wherezp 2isacoecientthatisdeterminedbythecumulativedensityfunctionforthenormaldistributionandfMisthestandarddeviationoffM,whichisthesquarerootofthevariance2fM.Thus,theprobabilitythat M)]TJ/F21 11.955 Tf 12.755 0 Td[(Misinthisintervalisp.Bymanipulatingthisexpression,weobtain:Pr[M2[ M)]TJ/F21 11.955 Tf 11.955 0 Td[(zp 2fM; M+zp 2fM]]=1)]TJ/F21 11.955 Tf 11.956 0 Td[(p{7Therefore:[ M)]TJ/F21 11.955 Tf 11.955 0 Td[(zp 2fM; M+zp 2fM]{8isaCIforMwithcondencelevelp.TheconceptofaCIinonedimensionalspacescanbeextendedtomultidimensionalspaces.Sincetheareaisnolongerainterval,itiscalledaCondenceRegion.Acondenceregionwithuserdenedcondencepisarandomregioninamulti-dimensionalspace,suchthattheprobabilityofseeingtheparameterofinterestintheregionisp.TheparameterofinterestisinthisthesisisthequeryanswerforaGROUPBYquery.Inthisthesis,IammainlyinterestedinprovidingcondenceregionsforthejoindistributionfM=ofthegroupwiseestimators,speciedusingacovariancematrixandameanvector,suchthati=Miandi;j=CovfMi;fMj.Chapter6willdiscusshowtobuildvarioustypesofcondenceregionsinmoredetail. 53

PAGE 54

CHAPTER4COVARIANCECOMPUTATIONTheveryrststepforprovidingcorrelation-awarestatisticalmethodsforsampling-basedGROUPBYqueriesistocomputethecovariancesamongthegroupwiseestimators.Thisrequiresbothmathematicallyderivingtheformulaofthecovariancebetweenapairofgroupwiseestimators,anddevelopinganecientalgorithmtocomputeit.Thischapterdescribesvariousaspectsofthecovariancecomputation.Theremainderofthischapterisorganizedasfollows.Section 4.1 formallyanalyzesthecovariancebetweenapairofgroupwiseestimatorsforaGROUPBYqueryoveroneormoredatabasetables.Section 4.2 showshowtodevelopahash-basedalgorithmtocomputethecovarianceeciently.TheeciencyoftheproposedalgorithmisexperimentallyevaluatedinSection 4.3 .Section 4.4 concludesthischapter.4.1CovarianceAnalysisThissectionconsiderstheproblemofhowtoformallyanalyzethecovariancefortwogroupwiseestimatorsusingtheHaas-Hellersteinestimator"[ 49 ]forSUMqueriesoveroneormoredatabasetables.Notethatthisanalysisisbasedonnitepopulationsampling.Thisisdierentfromtheanalysisin[ 48 52 ],whichisbasedonlargesample,innitepopulation.4.1.1AnalysisforSUMoverTwoDatabaseTablesItisnaturaltobeginbyconsideringaSUMqueryovertwodatabasetables.Specically:SELECTSUMfR;SFROMR;SGROUPBYgroupR;SNotethatthefunctionfcanencodeanymathematicalfunctionoverpairsfromRandSincludingaCOUNTquery.Theabovequerymaybeusedtoperformthe 54

PAGE 55

computationrequiredforasinglegroupinaGROUPBYqueryifflimitsthesumtoonlythosetuplesbelongingtothegroupinquestion.LetfiR;SdenotethefunctionforestimatorfMithatreturnsthevalueoffR;SifthetupleR;Sbelongstotheithgroup,andreturnszerootherwise.ThenumberoftuplesofrelationsRandSaredenotedbyNRandNS,respectively.Furthermore,thesamplesfromRandSaredenotedbyR0andS0,respectively,andnRandnSareusedfortheirrespectivesizes.Inordertoperformtheanalysis,itishelpfultointroduceBernoullizero/onerandomvariablesthatindicatewhethertuplesfromrelationsarenot/areinthesample.LetXkandYlbevariablesthatindicatewhetherthekthandlthtuplesfromRandSareinR0andS0,respectively.Withthis,theithnaturalHaas-Hellersteinestimatoroftheaggregateis:fMi=NR nRNS nSNRXk=1NSXl=1XkYlfik;l{1wherethedoublesumisreallyasumoverthetuplesofR0andS0expressedusingtheBernoullivariables.BeforetheestimatefMicanbecharacterized,someusefulpropertiesoftherandomvariablesXkandYlneedtobederived.ThenecessaryresultsandproofsforXkonlyarepresented;theresultsforYlareobtainedsimplybysubstitutingYlforXkandSforRinallequations.First,somenotation:R=NR nR;R=NR)]TJ/F15 11.955 Tf 11.955 0 Td[(1 nR)]TJ/F15 11.955 Tf 11.955 0 Td[(1{2Forthepurposeofderivingintuitiveformulas,thefollowingfunctionisintroduced:k;k0t=8>>>>>><>>>>>>:1k=k0^t=01k6=k0^t=10otherwise{3 55

PAGE 56

Intuitivelythisfunctionallowsuserstoturnon/otermsthathaveequal/notequalvaluesfortheindicesk;k0basedonvalueofthevariablet.Thefollowingpropertiesofk;k0functionwillbeusedextensively: Lemma1. ForanyfunctionFk;k0,Thefollowingholds:8k;k0;1Xt=0k;k0t=1{4XkXk0k;k0Fk;k0=8>><>>:PkFk;kt=0PkPk06=kFk;k0t=1{51Xt=0XkXk0k;k0Fk;k0=XkXk0Fk;k0{6 Proof. Toprovetherstidentity,itisobservedthatirrespectiveofwhetherk=k0,k;k0ttakesvalue1foronet2f0;1gandvalue0fortheother,thusthesumisalways1.Thesecondidentityfollowsdirectlyfromthedenitionofbyobservingthatsometermsvanishsincetheyhave0coecientdueto.Thethirdidentityfollowsdirectlyfromthesecond,sincethetwocasesinEquation 4{5 aresummedupandtheycompletethedoublesum. Withthese,thefollowingresultthatcharacterizestheBernoullirandomvariablesareavailable: Lemma2. IfR0isasamplewithoutreplacementofR,thenforthekthtupleofRthefollowingholds:E[Xk]=1 R{7k;k0tE[XkXk0]=k;k0t1 RtR{8 Proof. Theexpectedvalueofazero/onerandomvariableistheprobabilityofthevariabletakingthevalue1;inthiscase1 RistheprobabilitythatthekthtupleofRisinR0.Whenk=k0,sinceX2k=Xk,E[XkXk0]is1 R.Whenk6=k0,E[XkXk0]=P[Xk= 56

PAGE 57

1^Xk0=1]=P[k2R0]P[k02R0jk2R0].Sincetheconditionalprobabilityis1 R,thenE[XkXk0]=1 RR.Usingthedenitionofk;k0t,thetwocasescanbeeasilyencodedasinthestatementofthelemma.Notethathastoappearontherightsideaswellasontheleftsidetoensuretherightsideis0whenevertheleftsideis. Intheformulasderivedsubsequently,aparticularsetoftermsappearthatdeservespecialattention.Theyarerstintroduceandcharacterize.FortR;tS2f0;1gandtheith,jthestimators,thefollowingisdened:PtR;tS=NRXk=1NSXl=1NRXk0=1NSXl0=1k;k0tRl;l0tSfik;lfjk0;l0{9DependingonvaluesoftRandtS,thesetermscanberewrittenusingthesecondidentityinLemma 1 assumsoftheformfifj.Forexample,P0;1=NRXk=1NSXl=1NSXl0=1;l06=lfik;lfjk;l04{10Anextrapropertyofthesetermsisgivenby: Lemma3. 1XtR=01XtS=0PtR;tS=NRXk=1NSXl=1NRXk0=1NSXl0=1fik;lfjk0;l0{11 Proof. TheprooffollowsdirectlyfromthesecondidentityinLemma 1 andsimplereorganizationofthesums. Theorem1. ForanyarbitraryestimatorsfMiandfMjgivenbyEquation 4{1 ,thefollowingholds:EhfMii=NRXk=1NSXl=1fik;l{12CovfMi;fMj=1XtR=01XtS=0R tRRS tSS)]TJ/F15 11.955 Tf 11.955 0 Td[(1PtR;tS{13 57

PAGE 58

Proof. UsinglinearityofexpectationandtherstidentityinLemma 2 ,theunbiasednessoftheexpectationoffMifollowsimmediately.Tocomputethecovariance,therststepistoestimateEhfMifMji.EhfMifMji=2R2SNRXk=1NRXk0=1NSXl=1NSXl0=1E[XkXk0]E[YlYl0]fik;lfjk0;l0{14Torewritethisterm,observethatbymultiplyingeachtermwithinthesummationbyP1tR=0k;k0tRandP1tS=0l;l0tSboththesemultipliersare1bytherstidentityinLemma 1 thustheydonotchangetheidentityandregroupingthesums,thefollowingisobtained:EhfMifMji=2R2S1XtR=01XtS=0NRXk=1NRXk0=1NSXl=1NSXl0=1k;k0tRE[XkXk0]l;l0tSE[YlYl0]fik;lfjk0;l0{15Now,usingthesecondidentityinLemma 2 andthedenitionofPtR;tSthefollowingholds:EhfMifMji=1XtR=01XtS=0R tRRS tSSPtR;tS{16UsingnowthefactthatCovfMi;fMj=EhfMifMji)]TJ/F21 11.955 Tf 12.422 0 Td[(EhfMiiEhfMjiandtheresultinLemma 3 therequiredresultisobtained. Notethattheaboveresultdoesnotrequirethatiandjbedierent;thus,theresultappliestothecasewheni=j,whichgivesthevarianceoffMi.4.1.2AnalysisforSUMoverMultipleDatabaseTablesThepreviousresultisnowextendedtoanarbitrarynumberofdatabasetables.AssumethatT1;T2;:::;Tkarekdatabasetables,andconsiderthequeriesoftheform:SELECTSUMfT1;T2;:::;TkFROMT1;T2;:::;TkGROUPBYgroupT1;T2;:::;Tk 58

PAGE 59

Justasbefore,letfiT1;:::TkbeafunctionthatreturnsfT1;:::TkifthetupleT1;:::;Tkbelongstogroupi,and0otherwise.ThenumbersoftuplesinthesetablesareN1;N2;:::;Nk,respectively.Inaddition,itisalsoassumedthesamplesizesaren1;n2;:::;nk,respectively.Furthermore,BernoullirandomvariablesXw1;:::XwkaredenedtogovernwhetherornotthewthtuplefromT1;:::;Tkaresampled,respectively.Thefollowingthenserveastheithestimatorforthesumsoff:fMi=N1:::Nk n1:::nkXw1;:::;wkXw1:::Xwkfiw1;:::;wk{17AsinthepreviousSubsection,inordertoderivethecovarianceoffMiandfMj,foruin1tok,thefollowingaredened:u=Nu nu;u=Nu)]TJ/F15 11.955 Tf 11.955 0 Td[(1 nu)]TJ/F15 11.955 Tf 11.955 0 Td[(1{18Pt1;:::;tk=N1Xw1=1N1Xw01=1:::NkXwk=1NkXw0k=1kYu=1wu;wu0tufiw1;:::;wkfjw01;:::;w0k{19 Theorem2. Thecovarianceformultipledatabasetables.ThecovarianceoffMiandfMjis:CovfMi;fMj=1Xt1=0:::1Xtk=0kYu=1u tuu)]TJ/F15 11.955 Tf 11.956 0 Td[(1Pt1;:::;tk{20 Proof. TheproofmirrorstheproofofTheorem 1 .ktermsoftheformP1tu=0wu;wu0tuaremultipliedwithtermsinthemultiplesummationoftheexpansionofEhfMifMji,thentheresultsinLemmas 2 and 1 ,thedenitionofPt1;:::;tk,thegeneralizationofLemma 3 andsummationregroupinggivetherequiredresult. 4.1.3AnUnbiasedCovarianceEstimatorInordertocomputethecovariancesusingtheformulasdescribedabove,itisnecessarytohaveaccesstotheentiredatabasetablewhichisobviouslynotfeasibleifsamplingisperformed.Asaresult,itisusefultoobtainanunbiasedestimatorforthe 59

PAGE 60

covarianceoftwoestimators,givenonlythesamplethatisusedtoestimatetheanswertothequery.Considerthefollowingestimator:gCovfMi;fMj=1Xt1=0:::1Xtk=0kYu=1u tuu)]TJ/F15 11.955 Tf 11.955 0 Td[(1kYv=1vtvvN1Xw1=1N1Xw01=1:::NkXwk=1NkXw0k=1w1;w01t1:::wk;w0ktkXw1Xw01:::XwkXw0kfiw1;:::;wkfjw01;:::;w0k{21Itissimpletoprovethatthisestimatorisunbiasedforthecovarianceoftwoestimators.Toshowthis,itisnoticedthattheexpectedvalueof:N1Xw1=1N1Xw01=1:::NkXwk=1NkXw0k=1w1;w01t1:::wk;w0ktkXw1Xw01:::XwkXw0kfiw1;:::;wkfjw01;:::;w0k{22is:1 Qkv=1vtvvPt1;:::;tk{23Therefore,theexpectedvalueofgCovfMi;fMjisthecovarianceoffMiandfMj.4.2ComputationalConsiderationsforCovariancesTheprevioussectiongivesadetailedderivationofthecovarianceformulasandassociatedunbiasedestimatorsforsamplingmulti-tablejoinsoverGROUPBYqueries.Unfortunately,ifoneweretosimplyimplementthemdirectlyusingtheobviousnestedloopscomputations,itwouldbeprohibitivelyexpensivetocomputeevenasingleentryinthecovariancematrix.Thissectionconsidershowtheseformulascanbeimplementedeciently.Foreaseofexposition,thissectionwillconsiderhowtocomputethecovarianceofthefollowingtwoqueries,whicharefortwogroups:SELECTsumf1p1;l1 60

PAGE 61

FROMPARTASp1,LINEITEMl1SELECTsumf2p2;l2FROMPARTASp2,LINEITEMl2Inthesequeries,f1computesthetotalaggregatevalueforonegroup,whilef2computesthetotalaggregatevalueforasecondgroup.f1andf2alsocomputeanyselectionand/orjoinconditions.Subsequently,Iassumethatbothfunctionscorrespondtoqueriesthatcanbeevaluatedusingahash-joinorasort-merge-join.Uponexaminingtheformulasgivenabove,thekeystatisticsneededtoestimatethecovariancebetweenthetwoestimatorsgivenaboveare:VTT=Xp12PXl12Lf1p1;l1f2p1;l1VTF=Xp12PXl12LXl22L;l26=l1f1p1;l1f2p1;l2VFT=Xp12PXp22P;p26=p1Xl12Lf1p1;l1f2p2;l1VFF=Xp12PXp22P;p26=p1Xl12LXl22L;l26=l1f1p1;l1f2p2;l2Intheabovesummations,PandLrefertothesamplesofPARTandLINEITEM,respectively.EquivalentformulasforthesestatisticscanbeobtainedbymakinguseoftheidentityfunctionI:VTT=Xp12PXp22PXl12LXl22LIp1=p2Il1=l2f1p1;l1f2p2;l2VTF=Xp12PXp22PXl12LXl22LIp1=p2Il16=l2f1p1;l1f2p2;l2VFT=Xp12PXp22PXl12LXl22LIp16=p2Il1=l2f1p1;l1f2p2;l2VFF=Xp12PXp22PXl12LXl22LIp16=p2Il16=l2f1p1;l1f2p2;l2 61

PAGE 62

ridp1 ridl1 f1p1;l1 1 1 2 1 3 5 3 2 4 2 1 3 Table4-1.Exampleresultsofquery1 ridp2 ridl2 f2p2;l2 3 1 3 1 2 2 2 3 3 3 2 1 Table4-2.Exampleresultsofquery2 Notethatforthesetworelationshavefourstatistics,wherethesubscriptTTforexamplerequiresadoublesummationovertheresultsetsforf1andf2,consideringidenticalsourcerecordpairs.ThesubscriptTFrequiresasummationoverthetworesultsetsfornon-zerof1andf2valueswherethesourcerecordsfromParethesame,butthosefromLdier.FTissimilar.ThesubscriptFFrequiresasummationoverthetworesultsetswherethesourcerecordsfrombothPandLdier.Inordertocalculateeachofthesestatistics,areasonablemethodwouldbetomakeuseofadoublesummationovertheresultsetfrombothofthequeries.First,ahashjoinwouldbeusedtocomputethesetofallnon-zerof1values,sincethosearetheonlyonesthatmaycontributetotheactualvalueforthestatistic.Theninasimilarfashion,ahashjoinwouldbeusedtocomputethesetofallnon-zerof2values.Inordertodeterminewhethereachnon-zerovaluecancontributetooneofthestatistics,foreachresulttupletherecordIDsRIDsofthetwotuplesthatwereusedtoproducethevalueneedtobestored.Forexample,forVFF,onlythosepairsofrecordpairswhereIp16=p2andIl16=l2bothevaluateto1cancontributetothetotal;thiscanbedeterminedifthenecessaryRIDsareavailable.Oncethisdatahasbeencomputedandstoredinanarray,thenanestedsummationcanbeusedtocomputeeachstatistic.Forexample,considerTables 4-1 and 4-2 ,whichgiveexamplenon-zerof1andf2valuesalongwiththesource 62

PAGE 63

FunctionCalculateVFFS1,S2 1.VFF=0 2.ForeveryrecordinS1 3.ForeveryrecordinS2 4.VFF=VFF+Ip16=p2Il16=l2f1p1;l1f2p2;l2 5.ReturnVFF Figure4-1.TheAlgorithmforcomputingVFF. recordIDsforthetwoqueries.Givensuchadatastructure,thealgorithmshowninFigure 4-1 thencomputesVFF.IfS1isthesetofnon-zerof1valuesandS2istheanalogoussetforf2,thisalgorithmwouldrequireOnjS1jjS2jtime.Foralargernumberofrelations,thecomplexityincreasesquickly.Ingeneral,overnrelations,thetotalnumberofstatisticsrequiredis2n,andthetimecomplexityforrelationsS1toSnisOnQjSij.Bycarefulimplementation,thetimecomplexityinthetwo-relationcasecanbereducedtoOjS1j+jS2jand,byextension,thecomplexityovernrelationscanbereducetoOnPjSij.Continuingwiththesameexample,thebasicideaisasfollows: 1. First,itiseasytocalculateVTT.AhashjoinissimplyperformedonboththeRID'sofPARTandLINEITEM. 2. Second,onceVTThasbeencalculated,itiseasytocalculateVTF.LetVT=Xp12PXp22PXl12LXl22LIp1=p2f1p1;l1f2p2;l2{24InsteadofcheckingtheequalityofboththepairsofRIDsfrombothrelations,VTonlycheckstheequalityofthetwoRIDsfromPART.ThiscanbecomputedecientlyusingahashjoinontheRIDfromPART.Then,itisnoticedthat:VT=Xp12PXp22PXl12LXl22LIp1=p2f1p1;l1f2p2;l2=Xp12PXp22PXl12LXl22LIp1=p2Il1=l2+Il16=l2f1p1;l1f2p2;l2=VTF+VTT{25Asaresult,VTFisVTF=VT)]TJ/F21 11.955 Tf 11.955 0 Td[(VTT 63

PAGE 64

3. Third,VFTcanbecomputedinasimilarway.FirstVTiscalculatedusingahashjoinontheRIDfromLINEITEM;then,VFT=VT)]TJ/F21 11.955 Tf 11.955 0 Td[(VTT. 4. Finally,itiseasytocalculateV=VTT+VTF+VFT+VFF.Thisisequivalenttocalculating:Xp12PXl12Lf1p1;l1Xp22PXl22Lf2p2;l24{26ThiscalculationonlyrequiresasinglescanofbothS1andS2.Then,VFF=V)]TJ/F21 11.955 Tf 11.955 0 Td[(VTT)]TJ/F21 11.955 Tf 11.955 0 Td[(VTF)]TJ/F21 11.955 Tf 11.955 0 Td[(VFT{27ThetimecomplexityisOjS1j+jS1jforeachstep.Therefore,theoveralltimecomplexityisOjS1j+jS1j.Theprocessisillustratedwithanexample.S1andS2areshowninFigures 4-1 and 4-2 respectively.Byexecutingeachstep,thefollowingareobtained:VTT=4VT=39VT=42V=126Therefore:VTF=39)]TJ/F15 11.955 Tf 11.955 0 Td[(4=35VFT=42)]TJ/F15 11.955 Tf 11.955 0 Td[(4=38VFF=126)]TJ/F15 11.955 Tf 11.956 0 Td[(4)]TJ/F15 11.955 Tf 11.955 0 Td[(35)]TJ/F15 11.955 Tf 11.955 0 Td[(38=49Itisnothardtoextendthisideatoanynumberofrelations.Forexample,ifthetwoqueriesbothhavethreerelations,therststepistocalculateVTTTusingahashjoinontheRIDsfromallthreerelations.ThenextstepistocalculateVTT;VTTandVTT.VTTF;VTFTandVFTTcanthenbeobtained.Similarly,VT;VTandVTmaybecalculatedtoobtainVTFF;VFFTandVFTF.Finally,VmaybecalculatedandusedtocomputeVFFF.Theonlydicultyisthattheinclusion/exclusionruleneedstobeapplied 64

PAGE 65

carefullytoensurethatnovaluesaredouble-counted.Forexample,ifonewouldliketocalculateVTFF,heorsheshouldnoticethatVT=VTTT+VTTF+VTFT+VTFF.InordertocomputeVTFF,heorsheneedstosubtracttheotherthreetotalsfromVT.4.3ExperimentalEvaluationInthissection,Iexperimentallyevaluatetheeciencyofthecovariancecomputationalgorithm.Theeciencyisevaluatedintermsofrunningtime.Itisimportanttoknowhowhighisthetimerequiredtocomputethewholecovariancematrix,comparedtothetimerequiredtoexecutethequeryovertheentiredatabase.Forthispurpose,Iaminterestedinthefollowingthreetimes: 1. Thetimetocompletethequeryovertheentiredatabase. 2. Thetimetocompletethequeryoverasamplesetinordertoproduceanapproximateanswer. 3. Thetimetocomputethecovariancematrix.Ifthealgorithmisecient,Iwouldexpectthatthesummationofthesecondandthirdtimestobemuchsmallerthanthersttime.Ideally,Iwouldalsoexpectthatthethirdtimeisnotmuchlargerthanthesecondone.Thisindicatesthatthecostofcomputingthecovariancematrixisasfastasestimatingthequeryanswerusingsamples.However,evenifthethirdtimeislargerthanthesecondone,aslongasthesummationofthelasttwotimesarenotaslargeasthersttime,usingsamplingisstillbenecial.4.3.1Experimentalsetupa.QueryandSchemaTested.Intheproposedexperiments,IusethefollowingSQLstatement:SELECTSUMo.ProfitFROMOrdero,Employeee,BranchbWHEREo.EmpID=e.EmpIDANDe.OfficeID=b.OfficeIDGROUPBYb.BranchIDThethreerelationsusedinthequeryare: 65

PAGE 66

OrderEmpID,Profit,OtherEmployeeEmpID,OfficeID,OtherBranchBranchID,OfficeID,OtherTheOthereldineachtablereferstootherpossibleattributesthatcouldappear.Intheproposedexperiments,thiseldisoccupiedbyarandomstringsothatthetotalsizeofeachrecordis100bytes.IsetupthequerysothatthejoinbetweenOrderandEmployeeisaprimary-keytoforeign-keyjoin,andthejoinbetweenEmployeeandBranchisamany-to-manyjoin.b.ParametersConsidered.Idesignthissetofexperimentstotesthowdierentparametersmayaectthethreetimesdescribedabove.TheparametersIconsideredarethedatabasesize,theskewnessoftheattributesinthejoinoperation,thenumberofgroupsinthequery,andthesamplingratio.Thedatabasesizedeterminesthesizeofeachrelation.Forsimplicity,ItreatthesizeofthelargestrelationOrdersasbeingapproximatelythesameasthedatabasesize.ThesizesofrelationsEmployeeandBranchesare0.1and0.01timesaslargeasthesizeofrelationOrders.10GB,1GBand0.1GBdatabasesaretested.Thedefaultdatabasesizeis10GB.Giventhefactthatthesizeofeachrecordis100bytes,thedefaultnumberofrecordsinOrders,Employee,andBranchesina10GBdatabaseis100million,10millionand1million,respectively.Theskewnessofthejoinattributesaectsthejoinintwoways:rst,ajoinoperationtakesmoretimeifthejoinattributesareskewed,andsecond,thecorrelationsbetweendierentgroupsincreaseastheskewnessincreases.IuseazipfcoecienttocontroltheskewnessofthethreeattributesEmployee.OfficeID,Branch.OfficeIDandOrder.EmpID.Thedefaultzipfcoecientis0.ThenumberofgroupsdenesthenumberofdistinctvaluesfortheattributeBranch.BranchID,whichappearsintheGROUPBYclauseofthequery.Thisparameterdoesnotaectthersttwotimesmuchbecausethegroupingoperationisneartheend 66

PAGE 67

ofqueryexecution.However,increasingthenumberofgroupsincreasesthenumberofconcurrentestimators,andthereforeincreasesthesizeofthecovariancematrix.Thus,thetimetocomputethecovariancematrixisaected.Thedefaultnumberofgroupsis100.Thesampleratiodeterminesthenumberofsamplesobtainedfromthedatabase.Giventhefactthatallotherparametersarexed,increasingthesampleratioincreasesthesamplesize.Thus,boththesecondandthirdtimesdescribedaboveshouldbeaected.Thesamplesaretakensothatifap%samplefromOrdersisobtained,a5p%samplefromEmployee,anda20p%samplefromBrancheswillbeobtained,respectively.Bydefault,a1%samplefromOrdersisobtained.Thesamplingwithoutreplacementpolicyisusedineachexperiment.c.DataGeneration.Givenasetofparametervalues,Igeneratethedatasetrelationbyrelation.First,recordsinOrderaregeneratedasfollows: 1. AProfitisrandomlygenerateduniformlybetween1and100. 2. AnEmpIDisproducedbyazipfdistributionwithadomainsize0.1timesaslargeasthenumberofrecordsinOrder,andazipfcoecientspeciedbytheskewnessofthejoinattributes.BecausethetotalnumberofrecordsinEmployeeis0.1timesaslargeasthetotalnumberofrecordsinOrder,EmpIDisgeneratedfromthisdomaintoguaranteetheprimarykeytoforeignkeyjoinbetweenOrderandEmployeeoverEmpIDispreserved. 3. Arandomstringisgeneratedtomakethesizeoftherecord100bytes.Second,recordsofrelationEmployeearegenerated.Foreachrecord,thefollowingstepsareperformed: 1. AuniqueEmpIDisassignedfrom1to0.1timesaslargeasthenumberofrecordsinOrder.EmpIDisthentheprimarykeyofEmployee,andguaranteesthedomainofEmpIDinEmployeeandOrderarethesametopreservethecorrectnessoftheprimary-keytoforeign-keyjoin. 2. AnOfficeIDisproducedbyazipfdistributionwithadomainsize0.01timesaslargeasthenumberofrecordsinOrder,andazipfcoecientspeciedbytheskewnessofthejoinattributes.OfficeIDisgeneratedfromthisdomainforbothEmployeeandBranchtopreservethecorrectnessofthemany-to-manyjoinbetweenthesetworelationsoverOfficeID. 67

PAGE 68

3. Arandomstringisgeneratedtomakethesizeoftherecord100bytes.Third,recordsofrelationBrancharegenerated.Foreachrecord,thefollowingstepsareperformed: 1. AnOfficeIDisproducedbyazipfdistributionwithadomainsize0.01timesaslargeasthenumberofrecordsinOrder,andazipfcoecientspeciedbytheskewnessofthejoinattributes.Thispreservesamany-to-manyjoinbetweenEmployeeandBranchoverOfficeID.Furthermore,thisdomainisaslargeasthetotalnumberofrecordsinOrder. 2. ABranchIDisrandomlyselecteduniformlydistributedfromvalue1tothetotalnumberofgroups. 3. Arandomstringisgeneratedtomakethesizeoftherecord100bytes.d.HardwareandSoftwareused.AlltheexperimentsarerunoveraDELLdesktopwithanIntelcore-dual1.8GCPUand2GBmemoryusingOpenSUSELinux10.1.1Postgres8.12isusedasthebackendDBMStostoreallthedataandperformqueriesbothqueriesoverthewholedatabaseandqueriesoverthesamples.ThecodesarewritteninC++andcompiledusinggcc4.1.0.Libpqxx2.6.73isusedtoconnectPostgreSQLinC++.GSL1.94isusedforscienticcomputationandrandomnumbergeneration.e.ExperimentalProcedure.Ipreparefourdierentsetsofdatabasesandsamplesfortheexperiments.Eachtime,Ixthreeparametersusingthedefaultvaluesandvarythefourth.Intherstsetofexperiments,Ivarydatabasesizetogenerate100MB,1GB,and10GBdatabases.Inthesecondsetofexperiments,Ivarydierentdatasetsusingzipfcoecients0,0.3,and0.6.Inthethirdsetofexperiments,Ivarydatasetswith100, 1http://www.opensuse.org2http://www.postgresql.org3http://pqxx.org4http://www.gnu.org/software/gsl 68

PAGE 69

1000and10000groups.Inthelastsetofexperiments,Igenerateadatabaseandobtain3dierentsamplesusingsampleratios0:5%,1%,2%forOrder.Thetwootherrelationsfollowtheruledescribedabove.Foreachdatabaseandsamplegenerated,Ithenruntheexperimentsvetimestocomputethecovariancematrix,andreporttheaveragetimeoftheveruns.Ineachexperiment,thetimerequiredtocomputethecovariancematrixistheexecutiontimeforthehash-basedalgorithmafterthejoinoverthesamplehasbeencomputedandissittinginmainmemory.Thesamplesusedineachexperimentwerepre-computedandstoredwithinthedatabase.Sincethesampleswererelativelysmallwithrespecttotheavailablemainmemory,itisreasonabletoassumethattheywerebueredentirelywithinmainmemorybythedatabasesystemanddidnotneedtobereadfromdisk.4.3.2ResultsTheresultsareshowninTable 4-3 .Runningtimesaregiveninseconds.Numbersareroundedtothenearestsecond.Whenthenumberofgroupsis10000,thecomputationdoesnotnishinareasonabletime,resultinginanN/Ainthegure.Thisisbecauseoperatinga10000by10000covariancematrixisnotpossibleusingthegivenhardware.4.3.3DiscussionIobservethataslongasthenumberofgroupsis100,therunningtimeofcomputingthecovariancematrixisveryfast,comparingtotherunningtimeofevaluatingthequeryovertheentiredatabase.Whenthenumberofgroupsis1000,thecovariancecomputationtimeisabout18%aslongastherunningtimeofexecutingthequeryovertheentiredatabase,whichisstillfeasiblebutnotasattractiveasthe100groupcase.Whenthenumberofgroupsis10000,computingtheentirecovariancematrixisnolongerfeasible.Overall,Inoticethatinmanycases,computingtheentirecovariancematrixusingtheproposedhash-basedalgorithmiseective. 69

PAGE 70

Runningtimeunderdatabasesize 100MDB 1GDB 10GDB Querytime 8 85 1053 Estimatetime 1 2 12 Covariancetime 7 8 18 Runningtimeunderdierentskewness skew0 skew0.3 skew0.6 Querytime 1053 1604 29164 Estimatetime 12 16 27 Covariancetime 18 22 371 Runningtimeunderdierentnumberofgroups 100groups 1Kgroups 10Kgroups Querytime 1053 1057 1086 Estimatetime 12 15 16 Covariancetime 18 172 N/A Runningtimeunderdierentsampleratio 0.5% 1% 2% Querytime 1053 1053 1053 Estimatetime 8 12 32 Covariancetime 13 18 22 Table4-3.RunningtimeforCovarianceComputationunderdierentparametervalues 4.4ConclusionThischapterhasconsideredthecovariancecomputationindepth.Ihaveformallyderivedthecovarianceformulaforapairofgroupwiseestimators.Ihavealsoproposedanecientalgorithm.Finally,Ihaveexperimentallyevaluatedtheeciencyofthealgorithm.Theworkinthischapterservesasthebaseofprovidingsimultaneouscondenceboundsandreturningalltop-kgroupswithhighprobabilityforGROUPBYquerieswhenonlydatabasesamplesareavailable. 70

PAGE 71

CHAPTER5PROVIDINGCORRELATION-AWARESIMULTANEOUSCONFIDENCEBOUNDSThischapterdiscusseshowtoprovidesimultaneouscondenceintervalstoGROUPBYqueries.Akeyissueofthischapteristhathowthesesimultaneouscondenceboundscanbeprovidedinaprincipledfashioneciently.Inordertodothis,thischapterdenesanabstractproblemcalledSum-InferenceSIproblem,whichmaybeusedtoprovidesimultaneouscondenceboundsinaprincipledfashion.ItturnedoutthatsolvingtheSIproblemexactlyisNP-hard.Thus,themajordicultyinthischapteristosolvetheSIproblemeciently.IproposethreedierentapproximatemethodstosolvetheSIproblem.Theremainderofthischapterisorganizedasfollows.Section 5.1 formallydenestheSIproblemandshowshowtheSIproblemmaybeusedtoprovidesimultaneouscondencebounds.Section 5.3 proposesthreeapproximatealgorithmsforsolvingtheSIproblem.TheeciencyandcorrectnessoftheproposedalgorithmsareexperimentallyevaluatedinSection 5.4 .Section 5.5 concludesthischapter.5.1TheSIProblemandGROUPBYQueriesInordertoprovidesimultaneouscondencebounds,therststepistodeterminewhatinformationorstatisticsshouldbecommunicatedtoauserinthecaseofanapproximateGROUPBYanswersothatheorsheisfullyawareoftheaccuracyofanapproximation.Thusthisinformationisrstdiscussedinthissection.Thentheabstractproblemthatmustbesolvedinordertocomputethisinformation,whichiscalltheSIProblem,isformallydened.ThesectionconcludesbydescribinghowasolutiontotheSIproblemcanbeusedtocomputesafe"GROUPBYbounds.5.1.1GROUPBYandMultipleInferenceAGROUPBYquerymaybeviewedasalargesetofindividualqueriesaskedsimultaneously,oneforeachgroup.Unfortunately,atraditionalcondenceboundsuchas:Withprobabilityp,theanswertogroupiiswithintherangelitohi." 71

PAGE 72

hasonlyaverynarrowcorrectinterpretationinthissituation.Assoonasauserlooksattheresultsforasecondgroupj,statisticallyspeakings/heneedstoforget"thats/heeversawtheboundsforgroupiandconsidergroupjincompleteisolation!Whentwoormoregroupsareconsideredtogetherorcomparedwithoneanother,theaccuracyparameterpismeaningless,andhencedangerous.Ratherthanviewingcondenceboundsindividually,Iproposetoextendthetraditionalguaranteeandinsteadpresentaguaranteetakingtheform:Withprobabilityp,atleastkofngroupsarewithintherangesl1toh1,l2toh2,:::,andlntohn,respectively."Inthecasewheren=k,thesenewboundsareexactlyequivalenttotheclassicalmethodfromstatisticsfordealingwithsimultaneousestimates:theboundsarealteredsothatwithhighprobability,eachoneoftheestimatesarewithinthespeciedranges[ 82 ].However,statisticianshavebeguntoacknowledgethatthisisoverlyrestrictive[ 13 ].Forexample,theclassicmethodforcontrollingsimultaneouserroristoassumethattheerrorisadditivethisisknownistheBonferronicorrectionortheunionbound.Inordertoprovide100boundsthatareallcorrectwithprobability0.99,eachonewouldbecorrectwithprobability0.9999.Thiscanresultinneedlesslywidecondencebounds.Theproposedgeneralizationoftheboundtoincludetheparameterkallowstheusertocontroltheextenttowhichgroupwisecorrelationscanaectthenumberofincorrectestimates.Nolongerisaboundderivedbycomputingtheprobabilitythatitisincorrect;rather,itisderivedbycomputingtheprobabilitythatasetofboundsaresimultaneouslyincorrect.Inordertomakethisabitclearer,Figure 5-1 givesaprimitiveuserinterfacethatcouldbeusedtopresentsuchboundstotheuserforaGROUPBYquery.Thescrollboxattherightoftheguregivesacondenceboundforeachofthe100groupsthatarefoundinthequeryresult.Usingtheinterface,theuserchoosespandk,andtheboundsarecomputedaccordingly.Inaddition,theuserissuppliedwithaplotthatshowshowlikelyitisthatn)]TJ/F21 11.955 Tf 11.956 0 Td[(kormoreoftheboundscomputedarewrong.Forexample,Figure 5-1 shows 72

PAGE 73

Figure5-1.AGUIforGROUPBYquery. thatthereisaverysmallbutnon-zerochancethatatleast30ofthe100boundsgivenareincorrect,andthereisonlyabouta0.003chancethatatleast20ofthe100boundsareincorrect.Sincethesearerelativelysmallvalues,theusercanbesurethatinthiscasethechanceofacatastrophicerrorinthequeryresultissmall.5.1.2TheSIProblemTheissueattheheartofthisthesisishowsuchsimultaneousboundscanbecomputedinanecient,principledfashion.ThissubsectiondescribestheSum-InferenceProblem,orSIproblemforshort,whichcanbeusedtoquantifytheextenttowhichcorrelationamongstatisticalestimatorsfordatabasequeriescancausemanyestimatesderivedfromthesamesynopsistobeincorrect.ThisproblemwillformthebasisofthecomputationofsafeGROUPBYbounds.ThemainreasonforformulatingtheabstractSIprobleminsteadofdirectlyintroducingandanalyzingestimatesforGROUPBYqueries 73

PAGE 74

isthefactthatthereisthepotentialthattheSIproblemcanbeappliedtoothertypesofqueries.AsidebenetisthattheanalysisGROUPBYqueriescanbedissociatedfromtheparticularmethodusedtoestimatethetotalforeachindividualgroup.TheSIProblemmodelsthesituationwheretherearenrandomvariables.TherandomvariableNicorrespondstoapossibleanswertotheithqueryorgroup.ItisassumethatNiisnormallydistributedwhichistrueasymptoticallyforasample-basedSUMGROUPBYqueriesduetotheCentralLimitTheorem.AlsogivenasinputintotheSIproblemareasetofnvalidranges,withonerangeforeachrandomvariable;eachrangecorrespondstoacondenceboundontheansweraeachquery.Theithrangeisdenedbytheboundsliandhi.Iftheithrangeorboundiswrong{thatis,ifNihappenstofalloutsideoftherangefromlitohi{thenapenaltyorscoresiisincurred.Ifthepurposeistosimplycountthenumberofincorrectranges,thensi=1;ingeneral,simaytakeadierentvalue.Thefollowingrandomvariableisthendened:=100Xi=1INi62[li;hi]si{1Inthisexpression,Iisafunctionthatreturns1ifitsbooleanargumentevaluatestotrueand0ifitisfalse.IfadistributionfunctionFisdenedsuchthatF[k]givestheprobabilitythattheaboveexpressionevaluatestok,thengivenFonecanchecktheprobabilitythatthetotalpenaltyortotalnumberofincorrectcondenceboundsmeetsorexceedsk,whichcanbeusedasasolidindicatorastotheaccuracyofthegivenbounds.Formally,theSIproblemisdenedasfollows:TheSum-InferenceSIProblem.Given: amulti-dimensionalnormalrandomvariableN=hN1;N2;:::;Nniwithmeanvectorandcovariancematrix, avectorsofnscores, avectorlofnlowerbounds,and 74

PAGE 75

avectorhofnupperboundstheSIproblemistheproblemofinferringthedistributionfunctionF,whereF[k]=PPiINi62[li;hi]si=k,i.e.F[k]istheprobabilityofobservingatotalpenaltyofkduetoincorrectintervals. Unfortunately,itiseasytoshowthatitislikelyimpossibletoinfercharacteristicsofthedistributionofFdirectly: Lemma4. HardnessoftheSIproblem.IfF[k]canalwaysbecomputedinpolyno-mialtime,thenP=NP.Proof.GivenasetofnintegersS=fe1;e2;:::;engandanumberk,thesubset-sumproblemaskswhetherthereisasubsetofSsuchthatthesummationofthissubsetisk.ThisisknowntobeNP-complete.ThisreduceisreducedtotheSIproblem.Tosolveaninstanceofthesubset-sumproblemusingtheSIproblem,arandomvariableN=hN1;N2;:::;Nniisconstructed,whereN1;N2;:::;Nnareindependentnormallydistributedrandomvariableswithmean0andvariance1.Avectorofscoress=[e1;e2;:::;en]TsuchthateachelementinSisinthisscorevectoronceandonlyonceisalsoconstructed.Twovectorsoflowerboundsandupperboundslandharealsoconstructed,wheretheelementsaretheextentof95%,two-sidedcondenceintervals.TheresultingSIproblemisthensolved.Sinceeachcondenceintervalcorrectwithprobabilityis95%,thereisa5%chancethateachrandomvariableisoutsidetheinterval,thusthereisanonzeroprobabilitythatanysubsetofincorrectpredictionsisobtained.Thismeansthatallsubsetsarepossiblewithgreaterthanzeroprobability,andsodeterminingwhetherF[k]6=0isactuallyaquestionofwhetherasubsetwithsumkexists.Thus,thesubsetsumproblemhasasolutionifandonlyifF[k]6=0fortheresultingSIproblem. SubsequentsectionsofthischapterwillconsiderappropriatemethodsforcomputingapproximatesolutionstotheSIproblem. 75

PAGE 76

5.2ApplyingtheSumInferenceProblemtoGROUPBYQueriesInthissubsectionasimplealgorithmthatmakesuseoftheSIproblemtocomputetheinformationnecessarytodisplaytheuserinterfacedescribedpreviouslyisproposed.Asdescribed,theusersuppliesparameterspandksuchthatwithprobabilityp,atleastkofngroupsarewithintheintervalsorboundsthatarecomputed;intervalsarethenchosenautomaticallytoensurethedesiredaccuracy.Thealgorithmmakestheassumptionthattheselectedintervalsforeverygroupshouldhavethesameaccuracy.Thatis,itisassumedthattheratioofthewidthoftheithintervaltothevarianceoftheithinterval'sstandarddeviationisalways.Thisisreasonablesincealwaysusingthesameratiogivesallgroupsthesamepriority;withoutthisrestrictionthenumberofacceptablecondenceintervalsisinnite.Abinarysearchisthenperformedonthepossiblevalues.Ateachiteration,anSIproblemisconstructedthatcanbeusedtocheckifagivenproducesanactualpthatistoohighortoolowfortheuser'schosenkvalue.Thealgorithmisgivenbelow. FunctionGetIntvlsk,p,n,,:[l,h] 1.Buildascorevectorswheresi=1foralli 2.Chooseaninitialupperboundufor 3.Setl=0 4.Set=u+l 2 5.Setli=i)]TJ/F21 11.955 Tf 11.955 0 Td[(1=2i;i 6.Sethi=i+1=2i;i 7.Give,,s,l,hasinputtoanSIproblem. 8.Ifjp)]TJ/F26 11.955 Tf 11.955 8.967 Td[(Pn)]TJ/F22 7.97 Tf 6.587 0 Td[(kj=0F[j]jgoto11 9.Ifp)]TJ/F26 11.955 Tf 11.955 8.967 Td[(Pn)]TJ/F22 7.97 Tf 6.587 0 Td[(kj=0F[j]>thenu=;goto4 10.IfPn)]TJ/F22 7.97 Tf 6.586 0 Td[(kj=0F[j])]TJ/F21 11.955 Tf 11.955 0 Td[(p>thenl=;goto4 11.Returnlandhasthenalcondenceintervals. Thereareafewconsiderationsregardingthealgorithmthatwarrantadditionaldiscussion.First,theproceduredescribedaboverequiresthatadierentinstanceoftheSIproblembesolvedateachiterationofthealgorithm.Withabinarysearchalgorithmofthistype,30orsoiterationsofthealgorithmmayberequired.Thismaybeaconcern,sincetheSIproblemcanbeexpensivetosolve.Fortunately,itturnsoutthatinpractice, 76

PAGE 77

almostallofthecomputationsrequiredtosolvetheSIproblemcanbere-usedacrossiterations,renderingthecostofsubsequentSIsolutionsnegligiblecomparedtothecostoftherstone.ThisisduetothefactthattheunderlyingrandomvariablesN1;N2;::;Nndonotchangeacrossiterations;onlytheboundsliandhichange.Second,thealgorithmaspresentedleavesopenthequestionofhowtochooseaninitialupperboundu.Intheactualimplementationu=0:1ischosenanduntilthegotoinline9isused,everytimethatthegotoinline10isuseduisdoubledratherthansettingl=.u=0:1ischosenbecauseitisclosetotheerrorthatausermightndacceptableinthenalanswertothequery;thistendstoreducethenumberofiterationsofthebinarysearchrequired.However,theinitialchoiceofuisnotcritical,duetotheexponentially-fastconvergenceofthebinarysearchandthefactthatadditionalSIsolutionspasttherstonetendtohavelittlecost.5.3SolvingtheSIProblemGiventheintractabilityoftheSIproblem,developingcomputationallyfeasiblemethodsforsolvingitorapproximatelysolvingitinadatabaseenvironmentismandatoryifitistobeapracticalabstraction.Inthissection,threedierentmethodsforsolvingtheSIproblemareproposed: 1. AMonteCarlosolutionthatsamplesdirectlyfromFinordertolearnthedistribution; 2. Asecondmethodthatmakesuseofmomentanalysisandanappropriate,parametricmodelforFinordertoapproximateit;and, 3. ArelatedmethodthatinsteadperformsapproximatemomentanalysisandismorecomputationallyecientforverylargeinstancesoftheSIproblem.EachofthesolutionsassumesthatthemeanandcovariancematrixareprovidedasinputintotheSIproblemandcanbeaccessed.5.3.1ASolutionUsingMonteCarloRe-samplingTherstsolutionapproximatesFviaMonteCarlosampling.ThealgorithmrepeatedlyanddirectlysamplefromthemultivariatenormalorGaussiandistribution 77

PAGE 78

denedbyand.Foreachsamplefromthemultivariatenormal,thetotalpenaltyobtainedduetoincorrectintervalsiscomputed.F[k]isthenestimatedbycomputingthefractionofthetimethattheobservedpenaltyisexactlyk.Forexample,if1000samplesaretakenandapenaltyof13wasincurred55times,eF[13]=0:055,whereeFistheestimateforF.TheobservedpenaltyforeachtrialorsampleiscomputedbytotalingthepenaltyovereachofthenvariablesN1;:::;Nn.Foragiventrial,ifNihappenstobeoutsideoftherangedenedbyliandhi,thenapenaltyofsiisincludedinthetotalforthattrial.Whilethismethodisfairlysimple,thereareafewtechnicalquestionsthatneedtobeconsidered.First,itisrequiredtobeabletosamplefromtheGaussiandistributiondenedbyand.Itturnsoutthattherearewell-knownmethodsforthis,thatinvolverstsamplinganumberofstandardnormalrandomvariables,usingthosetoformarandomvector,andtherotatingandtranslatingthevectorusingandandstandardmethodsfromlinearalgebra[ 31 ],Section4.Asecondkeyquestionthatmustbeansweredis:HowmanyMonteCarlosamplesarerequiredinordertoobtainasucientlyaccurateapproximationforF?Inordertoobtainaguidelineforthisnumber,themultivariateGaussiandistributionissampledNtimes1,whereNischosensothattheexpectedEuclideandistancefromeFtoFislessthanforsomeuser-supplied.Inotherwords,enoughsamplesaretakensothat:E[XveF[v])]TJ/F21 11.955 Tf 11.956 0 Td[(F[v]2]<2{2Computingthenumberofsamplesrequiredsothatthisinequalityholdsrequiresabitofmathematics.ItisassumedthatPkskismaxthatis,thetotalpossiblepenalty 1IntheremainderofthissectionNreferstothenumberofMonteCarlotrials.IchoosenottosubscriptNinordertoclarifythisusingNMC,forexampleinordertosimplifythepresentationoftheformulas.Asintheprevioussection,NistillreferstotheithcomponentofthemultivariatenormalthatmakesuptheSIproblem. 78

PAGE 79

foranygiventrialismaxandwithoutlosinggeneralityitisassumedthattherearenonegativepenaltiesthissimpliestheexpositionbyallowingsummationsoverallpossiblepenaltyvaluetostartat0.LetF[v]evaluatedatv=0;1;:::;maxbep0;p1;:::;pmax.Thatis,thesetofvaluesp0;p1;:::;pmaxanswerstheSIproblemcorrectly.Then,randomvariablesX1;0;:::;XN;maxaredened,whereXu;vindicateswhetherornotthetotalpenaltyincurredintheuthtrialwasexactlyv.Thatis,Xu;visoneifv=PksiINi62[li;hi]fortheuthofNmultivariateGaussiansamples,andXu;viszerootherwise.NotethatE[Xu;v]=pvforu=1;:::;N.Thatis,onexpectation,Xuvisexactlythedesiredvaluepv.Thus,thefollowingholds:E"maxXv=0eF[v])]TJ/F21 11.955 Tf 11.955 0 Td[(F[v]2#=maxXv=0E"PNu=1Xu;v N)]TJ/F21 11.955 Tf 11.955 0 Td[(pv2#=maxXv=0E"p2v)]TJ/F15 11.955 Tf 11.955 0 Td[(2pvPNu=1Xu;v N+PNu=1Xu;v N2#=maxXv=0p2v)]TJ/F15 11.955 Tf 11.955 0 Td[(2p2v+1 N2E[NXu=1X2u;v+NXu=1NXw=1w6=uXu;vXw;v]=maxXv=0)]TJ/F21 11.955 Tf 9.298 0 Td[(p2v+1 N2Npv+NN)]TJ/F15 11.955 Tf 11.955 0 Td[(1p2v=maxXv=0pv)]TJ/F21 11.955 Tf 11.955 0 Td[(p2v N=1 N1)]TJ/F22 7.97 Tf 12.212 14.944 Td[(maxXv=0p2v!2{3Thereforeif:N1)]TJ/F26 11.955 Tf 11.955 8.967 Td[(Pmaxv=0p2v 2{4thentheexpectedEuclideandistancebetweenFandeFislessthan.NotethatthisisguaranteedtobethecaseifN1 2,because1)]TJ/F26 11.955 Tf 12.401 8.966 Td[(Pmaxv=0p2visboundedby1.Asaresult,10000samplesareenoughtoguaranteeanexpectedEuclideandistanceofnogreaterthan0:01.Thesemethodswillbeconsideredexperimentallylateroninthethesis. 79

PAGE 80

ReusingMonteCarloSIComputationsforGROUPBYqueriesWhentheSIproblemisappliedtocomputingboundsforaGROUPBYquery,thealgorithmtosolvetheSIproblemwillbeinvokedrepeatedlytoperformabinarysearch.Fortunately,itispossibletore-usealmostallofthecomputationsfromtherstSIsolutionwhencomputingsubsequentsolutions,becauseonlytheboundsliandhichangeacrossiterations,andallboundwidthsarecontrolledbytheparameterseeSection3.3fordetails.ThismeansthatsubsequentSIsolutionsarealmostforfree",sothattheactualmultivariatesamplingonlyneedstobeperformedtheveryrsttimethattheSIproblemissolved.Toreusecomputations,alloftheNMonteCarlotrialstobuildalistof)]TJ/F21 11.955 Tf 7.085 -4.338 Td[(;trialno;itriples.Eachtriplemeansthatforthespeciedtrial,anvaluegreaterthan)]TJ/F15 11.955 Tf 10.986 -4.338 Td[(willcauseNitofalloutsideoftheresultingrangelitohi.Thesetriplesareconstructedforeveryi,andthenallofthemaresortedbaseduponthe)]TJ/F15 11.955 Tf 10.987 -4.339 Td[(valueduringtherstSIsolution.Subsequently,foranygiven,itbecomesveryecienttocheckthecorrespondingeF[k].Itjustrequirestoscanthissortedarrayfromfronttoback,andkeepgoingaslongas)]TJ/F15 11.955 Tf 10.987 -4.338 Td[(doesnotexceed.Foreveryvalueoftrialno,thenumberoftriplesfoundarecounted.Letcntbethenumberoftrialsvalueswherethenumberoftriplesfoundisk.Thenreturningcnt=NgivesuseF[k]forthegivenvalueof.Thisprocesscanbemadeevenfasterbypre-computingsummarystatisticsandstoringtheminthearray,sothattheentirelowerendofthearrayneednotbere-scannedforeachandkpairthatisqueried.5.3.2SolvingtheSIProblemUsingMomentAnalysisBecauseitisdistribution-free,MonteCarlomaybepreferred,especiallyifthecomputationalresourcesrequiredareminimal.However,foranSIproblemwithaverylargeproblemsize{thatis,withaverylargenumberofunderlyinggroupsorvariables{theMonteCarlomethodmaynotbepracticalbecauseitrequiresalargenumberofsamplesfromtheunderlyingGaussian.Sincethegenerativemodelmayhavearbitrary 80

PAGE 81

pairwisecorrelations,eachGaussiansamplerequiresmultiplyingalengthnvectorbyannbynmatrix,whichwillbeexpensiveifnisverylarge.Thus,thesecondmethodforsolvingtheSIproblemisquitedierent.ItreliesoncalculatingexactlytherstandsecondcentralmomentsofthedistributionfunctionF,andthenchoosinganappropriateparametricdistributiontoapproximateF.LetbearandomvariablewhosedistributionispreciselyF.ThentherstandsecondcentralmomentsofFarethemeanE[]andvarianceE2[])]TJ/F21 11.955 Tf 11.955 0 Td[(E[2]of.Thejusticationforresortingtosuchaparametricmethodisstraightforward.NotethatthevaliddomainofForrangeofisknownandeasilycomputed.Thatis,inlineartimewecaneasilyupper-boundandlower-boundthevaluesofiforwhichF[i]canpossiblyhavenon-zeroprobability,bycomputingthesmallestandlargestsumsthatoversubsetsofthesivaluesinputintotheSIproblem{thesmallestpossiblesumisminf0;Psi<0sig,andthelargestpossiblesumismaxf0;Psi>0sig,whereeachsiisthepenaltyvalueassociatedwiththeithgroup.Givenanupperbound,alowerbound,andameanandvariance,thepossibleshapesthatareasonablywell-behaveddistributioncantakeareseverelyconstrained,andsochoosingaparametricdistributionthatacceptsthesefourparametersasasurrogateforFshouldnotincurmuchinaccuracy.2Asshownexperimentally,theaccuracyofthismethodisquiteremarkablewhenitisusedinconjunctionwiththeBetadistribution[ 18 ],whichacceptsexactlythissetoffourparameters.TheBetadistributioncantakeaverywidevarietyofshapesseeFigure 5-2 atleft.TogivethereaderanideaofhowgoodtheBetadistributionapproximationisfortheapplication,Figure 5-2 rightshowstheempiricaldistributionofF[k]inatypicalscenarioobtainedusing10000MonteCarlosamplestogetherwiththeapproximationusingbetadistributiontheproposedmethodproduces.Thetwodistributionsarevirtually 2Asevidenceofthis,thewell-knownandwidely-usedChevyshev'sinequalitystatesexactlyhowmuchjusttwoparameters{meanandvariance{constrainadistribution. 81

PAGE 82

Figure5-2.ShapesofBetadistributionleft,theempiricalestimateofpdfofF[k]for100groupsbasedon10000MonteCarlosamplestopright,andtheapproximationofthepdfofF[k]usingBeta.123,66.6907bottomright. indistinguishable.ThequalityoftheBetadistributionforuseasasolutiontotheSIproblemwillbeconsideredexperimentallyindepthinSection 5.4 .Whilesuchparametricmethodsmaybequitecommoninstatisticsandstatisticalmachinelearning,theremaybeskepticismtowardssuchmethodsinthepartofdatamanagementpractitioners.Guidelinesonwheretousesuchparametricdistributionscanbefoundinthestatisticsliterature[ 74 ].Ifthereisaconcernregardingtheaccuracy,aprincipledapproachtocheckingtheaccuracyofsuchanapproximationwouldbetocomputeorestimatethethirdmomentandperhapsthefourthmomentusingtechniquessimilartothosethatwillbepresentedshortly.Thesemomentsaretheso-calledskewandkurtosisofF,respectively.Ifboththeskewandthekurtosisalsomatchtheparametricmodelchosen,thenwithsixmatchingparametersthepossibledistributionsaresoconstrainedthatnoreasonablestatisticianwouldeverquestiontheapplicabilityofthe 82

PAGE 83

model!However,asshownexperimentallyinthesubsequentsection,usingthefourparametermethod,itispossibletoobtainexcellentresults.Ofcourse,allofthisrequiresthatitispossibletoperformanappropriateanalysisonthevariable.ComputingE[]giventheinputtotheSIproblemisarelativelyeasymatter:E[]=XiZ<)]TJ/F20 7.97 Tf 12.703 0 Td[([li;hi]pixsidx{5wherepixistheprobabilitydensityfunctionoftheestimateassociatedwiththeithestimator,whichisnormalwithmean[i]andvariance[i;i].ThisintegralforeachisimplycomputestheprobabilitythatatrialoverNiresultsinavaluethatincursapenalty,andmultipliesthatvaluebytheincurredpenalty,inordertoarriveattheexpectedpenaltyfortheithestimate.ThenextstepistocalculateE[2].Thisistheexpectedsquaredpenaltyoveralli:E[2]=XiINi62[li;hi]si!2=XiINi62[li;hi]si!XjINj62[lj;hj]sj!=XiINi62[li;hi]si2+XiXji6=jINi62[li;hi]siINj62[lj;hj]sj=XiZ<)]TJ/F20 7.97 Tf 12.703 0 Td[([li;hi]pixsidx2+XiXji6=jZ<)]TJ/F20 7.97 Tf 12.702 0 Td[([li;hi]Z<)]TJ/F20 7.97 Tf 12.702 0 Td[([lj;hj]pi;jx;ysisjdxdy{6Thisequationhassimplybeenbrokentheexpectedsquaredpenaltyintotwosums.TherstsumconsidersthecasewherethepenaltyforasingleNiissquared,andthesecondsumconsidersthecasewherethepenaltyfortwodierentNivariablesismultiplied.Intheequation,pixisagaintheprobabilitydensityfunctionofNiandpi;jx;yistheprobabilitydensityfunctionofthebivariatenormalwith:0=[[i];[j]]T;0=264[i;i][i;j][i;j][j;j]375{7 83

PAGE 84

Inotherwords,pi;jx;yisthejointdensityfunctionforNiandNj.Evaluatingthisequationrequiresthatitispossibletocomputetherequiredintegraloverpi;jx;y.Fortunatelythereisagooddealofexistingworkthatconsidershowtocomputeanintegraloverabivariatenormal.IfMonteCarlointegrationisused[ 88 ],thenitiseasilypossibletoreusetheSIsolutionfortherstiterationoftheGROUPBYcomputationofSection3.3forsubsequentiterationsseebelow.ApplyingtheBetaDistributionAftercalculatingthersttwomoments,anappropriateparametricdistributioncanthenbeusedtoapproximatelycalculateanyF[k].Asmentionedpreviously,intheactualimplementationthefour-parameterBetadistributionisusedtoserveasasurrogateforF.ThefourparametersfortheBetaarethetwoshapeparametersand,aswellasthelowerandupperboundminandmaxfortherangeoftheunderlyingrandomvariable.Ifmin=0andmax=1,straightforwardcalculationsshowthatgivenand2,andshouldbechosensothat:=2)]TJ/F21 11.955 Tf 11.955 0 Td[(3)]TJ/F21 11.955 Tf 11.955 0 Td[(2 2=)]TJ/F15 11.955 Tf 11.955 0 Td[(22+3)]TJ/F21 11.955 Tf 11.955 0 Td[(2+2 2Generallymin=0andmax=1arenotsatised.Howeverasimplelineartransformationcanmapthistoaspacesuchthatmin=0andmax=1.Themeanandvariancearetransferredaccordingly.Assumingthemeanandvarianceintheoriginalspaceare0and02,respectively,thenthefollowingequationshold:=0)]TJ/F21 11.955 Tf 11.955 0 Td[(min max)]TJ/F21 11.955 Tf 11.956 0 Td[(min2=02 max)]TJ/F21 11.955 Tf 11.955 0 Td[(min2 84

PAGE 85

ThisensuresthatthersttwomomentsoftheresultingdistributioneFbothmatchand2,respectively.Giventhist,toprovidethesolutiontotheSIproblemeF[k]=Rk+0:5k)]TJ/F20 7.97 Tf 6.586 0 Td[(0:5Betaxdxisused.ReusingMoment-BasedSIComputationsforGROUPBYqueriesInthesamewaythattheMonteCarlosolutionfortherstiterationoftheGROUPBYalgorithmcanbereused,themoment-basedsolutioncanbere-usedtosolvemultipleSIproblemswiththesameunderlyingvariables.AssumingthatMonteCarlointegrationisusedforeachintegraloverpi;jx;yandpix,anumberofsamplesaretakenfromeachpi;jx;yandpix,andthefractionofsampleswithineachcorrespondingboundiscomputedtoapproximatetheintegral.UsingmethodsverysimilartothoseintheprevioussectionforreusingthepureMonteCarlosolution,thesesamplescanbeusedtocomputeanecientmappingfromeverypossibletoagivenvarianceandmeanfor.Forexample,considerthemeanof.Tohandlethis,anumberofrecordsoftheform)]TJ/F21 11.955 Tf 7.085 -4.339 Td[(;i;arecomputedforeachi,onecorrespondingtoeachsamplefrompi.ThisrecordindicatesthatNi'scontributiontothemeanoffor>)]TJ/F15 11.955 Tf 10.987 -4.339 Td[(isatleast.Therecordsarethensortedonthe)]TJ/F15 11.955 Tf 10.987 -4.338 Td[(values.Tocomputethemeanofforagiven,therecordsarescannedfromfronttoback,foreach)]TJ/F15 11.955 Tf 10.986 -4.338 Td[(thatdoesnotexceed.Foreachi,thelastobservedisthenusedtocompute'smean.5.3.3SolutionUsingApproximateMomentAnalysisWhiletheprevioussolutionmaybefasterthantheMonteCarlosolution,itisstilllinearinthesizeofthecovariancematrix.InthecaseofaGROUPBYquery,thenumberofentriesinisquadraticinthenumberofgroups.Sincethenumberofgroupsmaybemanythousandsinarealisticscenario,asub-linearalgorithminthesizeofishighlydesirable.ForaverylargeSIproblem,itispossibletoincreasethespeedofthemomentanalysismethodsubstantiallybysamplingfromthecovariancematrix.RecallthatthersttwomomentsofthedistributionresultingfromtheSIproblemaregivenasequations 85

PAGE 86

3and4.Inordertomakeuseofthemethodofmoments,theseconduncorrectedmomentisgenerallynotasimportantasthevariance,whichisrelatedtothesecondmomentviatherelationship2=E[2])]TJ/F21 11.955 Tf 11.976 0 Td[(E2[].Towriteanexpressionfor2,letusdene:C=PiR<)]TJ/F20 7.97 Tf 12.703 0 Td[([li;hi]pixsidx2)]TJ/F26 11.955 Tf 11.955 13.271 Td[(PiR<)]TJ/F20 7.97 Tf 12.703 0 Td[([li;hi]pixsidx2 PiPji6=j1{8Then,thefollowingequationisobtained:2=E[2])]TJ/F21 11.955 Tf 11.955 0 Td[(E2[]=XiXji6=jZ<)]TJ/F20 7.97 Tf 12.702 0 Td[([li;hi]Z<)]TJ/F20 7.97 Tf 12.703 0 Td[([lj;hj]pi;jx;ysisjdxdy+C!{9ThemainthatreasontodeneCisthatinordertocalculateC,onlythediagonalentriesofthecovariancematrixareneededtobecalculated,acostthatscaleslinearlywithrespecttothenumberofestimators.ThuscomputingCexactlyisfeasible,andtheproblemofcalculatingthevariancereducestotheproblemofcomputingthevalueofthedoublesummationgivenabove,whichcanbeestimatedeectivelyusingsampling.Assumingthetotalnumberofestimatorsisn,thetotalnumberofnon-diagonalentriesinthedoublesummationism=nn)]TJ/F15 11.955 Tf 12.294 0 Td[(1.LetSbearandomsamplewithoutreplacementfrom:T=fi;jj1im;1jm;i6=jg{10wherejSjisthesamplesizeandjTj=m.Thefollowingformulathengivesanunbiasedestimatorofthevarianceof:e2=m jSjXi;j2SZ<)]TJ/F20 7.97 Tf 12.702 0 Td[([li;hi]Z<)]TJ/F20 7.97 Tf 12.702 0 Td[([lj;hj]pi;jx;ysisjdxdy+C!{11Inordertocomputehowmanysamplesfromthecovariancematrixareneededinordertoaccurateestimate2,theformulaforthevarianceofe2canalsobederived.Dene: 86

PAGE 87

a=m jSjb=m)]TJ/F15 11.955 Tf 11.955 0 Td[(1 jSj)]TJ/F15 11.955 Tf 17.933 0 Td[(1hi;j=Z<)]TJ/F20 7.97 Tf 12.702 0 Td[([li;hi]Z<)]TJ/F20 7.97 Tf 12.702 0 Td[([lj;hj]pi;jx;ysisjdxdy)]TJ/F21 11.955 Tf 11.955 0 Td[(C!{12Ifsamplingwithoutreplacementisused,thevarianceofe2is:2e2=a b)]TJ/F15 11.955 Tf 11.955 0 Td[(1Xi;j2TXk;l2TIi;j6=k;lhi;jhk;l+a)]TJ/F15 11.955 Tf 11.955 0 Td[(1Xi;j2Thi;j2{13Anunbiasedestimatorofthevariancethatonlybasedonthesamplesis:e2e2=aba b)]TJ/F15 11.955 Tf 11.955 0 Td[(1Xi;j2SXk;l2SIi;j6=k;lhi;jhk;l+ab)]TJ/F15 11.955 Tf 11.955 0 Td[(1Xi;j2Shi;j2{14Giventheseformulas,itiseasytodesignanalgorithmthatrepeatedlysamplesfromthenon-diagonalentriesinthecovariancematrix,andcalculatestheestimatedvalueofe2basedonthecurrentsamplesetS.Thealgorithmthenstopssamplingwhenthestandarddeviationofthisestimatorislessthensomepercentageoftheestimatedvalue,computedusinganappropriateboundsuchastheCentralLimitTheoremorChebyshev'sinequality.TheresultingalgorithmisshowninFigure 5-3 .FunctionGetEstimatedVarianceimplementsthealgorithm.Thersttwolinesinitializesomevariables.Lines3to8calculatetheconstantCgiveninEquation 5{8 .TheDo:::Whileloopinlines9to18continuessamplingandupdatingtheestimatedvarianceusingEquation 5{11 untilthe 87

PAGE 88

stoppingconditiondescribedaboveissatised.ThefunctionCalculateVariancecalculatestheestimatedvarianceusingEquation 5{14 FunctionGetEstimatedVarianceT;n 1.Exp=0;Vpart=0;C=0;m=nn)]TJ/F15 11.955 Tf 11.955 0 Td[(1 2.SampleSize=0;Estimate=0;Total=0 3.Fori=1toM: 4.A=R<)]TJ/F20 7.97 Tf 12.702 0 Td[([li;hi]pixsidx 5.Exp=Exp+A 6.Vpart=Vpart+A2 7.C=Vpart)]TJ/F22 7.97 Tf 6.586 0 Td[(Exp2 m 8.Do: 9.Sampleentryi;jfromT 10.hi;j=R<)]TJ/F20 7.97 Tf 12.703 0 Td[([li;hi]R<)]TJ/F20 7.97 Tf 12.703 0 Td[([lj;hj]pi;jx;ysisjdxdy+C 11.SampleSet:addhi;j 12.Total=Total+hi;j 13.SampleSize++ 14.Estimate=m SampleSizeTotal 15.2=CalculateVarianceSampleSet;N 16.=sqrt2 17.WhileStd10>Estimate 18.ReturnEstimate FunctionCalculateVarianceS;m 1.NonEqPart=0;EqPart=0 2.a=m jSj;b=m)]TJ/F20 7.97 Tf 6.586 0 Td[(1 S)]TJ/F20 7.97 Tf 6.586 0 Td[(1 3.Foranytwodierentelementss;t2S 4.NonEqPart=NonEqPart+st 5.Foranyelements2S 6.EqPart=EqPart+s2 7.Returnaba b)]TJ/F15 11.955 Tf 11.955 0 Td[(1NonEqPart+ab)]TJ/F15 11.955 Tf 11.955 0 Td[(1EqPart Figure5-3.Algorithmtocalculatetheestimatedvariance 5.4ExperimentalEvaluationInthissection,theutilityoftheSI-basedsimultaneousGROUPBYboundsisexperimentallyevaluated.Therearetwoquestionstobeconsidered.First,thepurposeofsamplingfromadatabaseistoobtainanapproximateanswerquickly.However,theadditionalcomputationrequiredtoproducethesimultaneousGROUPBYboundswill 88

PAGE 89

requireextratime,whichmayperhapsmitigatethebenetofsampling.Thus,therstquestiontoaddressis:Foreachofthethreemethodsthathavebeenproposed,howhighisthetimerequiredtoobtaintheSI-basedsimultaneousGROUPBYbounds,comparedtothetimerequiredtoexecutethequeryovertheentiredatabase?Second,allthreemethodsthathavebeenproposedareapproximatemethods,anddonotsolvetheSIproblemexactly.Therefore,thesecondquestiontoaddressis:HowwelldothesethreemethodsboundthecorrectnessofanapproximateGROUPBYanswerinreality?5.4.1RunningTimeExperimentsToevaluatetheutilityoftheSI-basedsimultaneousGROUPBYbounds,threedierenttimesarecomputed: 1. Thetimetocompletethequeryovertheentiredatabase. 2. Thetimetocompletethequeryoverasamplesetinordertoproduceanapproximateanswer. 3. ThetimetocomputesimultaneouscondenceboundsusingtheSIproblem.Ifsamplingisusefulforreducingtherunningtimeofaquery,itisexpectedthatthesummationofthelasttwotimestobemuchsmallerthanthersttime.Otherwise,thequeryshouldsimplybeansweredexactly.Intheidealcase,itisalsoexpectedthethirdtimetobemuchsmallerthanthesecondone,sothatthesimultaneouscondenceboundsareproducedatnoextracostcomparedtothetimerequiredtoestimatetheanswer.However,evenifthethirdtimeislargerthanthesecondone,thisonlyindicatesthatthereisroomforimprovementanddoesnotprecludetheuseoftheproposedmethods,becausethetotalsamplingtimemaystillbelessthanevaluatingthequeryexactly.5.4.1.1ExperimentalsetupIntheexperiments,thesameSQLstatementasdescribedinSection 4.3 isused,asshownbelow:SELECTSUMo.Profit 89

PAGE 90

FROMOrdero,Employeee,BranchbWHEREo.EmpID=e.EmpIDANDe.OfficeID=b.OfficeIDGROUPBYb.BranchIDTheschemasofthethreerelationsusedinthequeryare:OrderEmpID,Profit,OtherEmployeeEmpID,OfficeID,OtherBranchBranchID,OfficeID,OtherTheserelationsandthesamplesusedforestimationaregeneratedthroughagenerationprocedurecontrolledbyfourparameters:databasesize,skewness,numberofgroups,andsampleratio.Thegenerationprocedureandtheexperimentalplatformbothsoftwareandhardwarearediscussedin 4.3 .ThesamedatasetsasdescribedinSection 4.3 arepreparefortheexperiments.Thereareoverall12dierentdatasets.Eachtime,threeparametersarexedusingthedefaultvaluesandthefourthisvaried.Thedefaultvaluefordatabasesize,skewness,numberofgroupsandsampleratioare10GB,0,100,and0:5%,respectively.Intherstsetofexperiments,databasesizeisvariedtogenerate100MB,1GB,and10GBdatabases.Inthesecondsetofexperiments,dierentdatasetsaregeneratedusingzipfcoecients0,0.3,and0.6.Inthethirdsetofexperiments,datasetsaregeneratedwith100,1000and10000groups.Inthelastsetofexperiments,adatabaseisgeneratedandobtain3dierentsamplesusingsampleratios0:5%,1%,2%forOrder.Thetwootherrelationsfollowtheruledescribedabove.Foreachdatabaseandsamplegenerated,theexperimentsarerunvetimesusingeachofthethreeproposedsolutionstotheSIproblem,andtheaveragetimesoftheverunsarereported.Ineachexperiment,thetimerequiredtosolvetheSIproblemincludesexactlythetimerequiredtocomputeFoncethejoinoverthesamplehasbeencomputedandissittinginmainmemory. 90

PAGE 91

Thesamplesusedineachexperimentwerepre-computedandstoredwithinthedatabase.Sincethesampleswererelativelysmallwithrespecttotheavailablemainmemory,itisreasonabletoassumethattheywerebueredentirelywithinmainmemorybythedatabasesystemanddidnotneedtobereadfromdisk.5.4.1.2ResultsTheresultsareshowninTable 5-1 .Runningtimesaregiveninseconds.Numbersareroundedtothenearestsecond.Whenthenumberofgroupsis10000,boththemomentanalysismethodandtheMonteCarloresamplingmethoddonotnishinareasonabletime,resultinginanN/Ainthegure.Thisisbecausethesetwomethodsrequireoperationsovera10000by10000covariancematrix,whichisnotpossibleusingthegivenhardware.Inordertoputthesamplesizesthathasbeentestedintosomesortofcontext,the95%condenceintervalwidthforsampleratio0:5%,1%,and2%averaged43%,20%,and8%oftheestimatedanswer,respectively,ifallotherparameterstaketheirdefaultvalue.Thus,a2%sampleproduceserrorsthatarelessthan10%whichshouldbequitereasonableformanydataexplorationapplicationsandtheerrorboundsseemtoshrinkby1/2ormorewithadoublingofthesamplesize.5.4.1.3DiscussionSeveralinterestingresultscanbeobserved,afewsignicantonesarepointedhere.First,itisnoticethatthetimetoactuallycomputethesample-basedestimateincreasesapproximatelylinearlyasthesampleratioincreaseswhichisexpected,whilethetimerequiredtosolvetheSIproblemisgenerallyindependentofthesamplesize.Second,therunningtimeforPostgrestocompletethequeryovertheentiredatabaseisfromaround16minutestomorethan8hours.Ontheotherhand,Postgresneverneedlongerthanaminutetocomputeanapproximatequeryansweroverthesamplesintheworstcase.Thus,samplingitselfseemstobeavaluablemethodintermsofreducingcomputationtime,evenovermulti-tablejoins. 91

PAGE 92

Runningtimeunderdatabasesize 100MDB 1GDB 10GDB Querytime 8 85 1053 Estimatetime 1 2 12 Momentanalysis 11 12 23 Approximatemomentanalysis 1 1 1 MonteCarloresampling 66 111 221 Runningtimeunderdierentskewness skew0 skew0.3 skew0.6 Querytime 1053 1604 29164 Estimatetime 12 16 27 Momentanalysis 23 26 406 Approximatemomentanalysis 1 1 5 MonteCarloresampling 221 245 435 Runningtimeunderdierentnumberofgroups 100groups 1Kgroups 10Kgroups Querytime 1053 1057 1086 Estimatetime 12 15 16 Momentanalysis 23 215 N/A Approximatemomentanalysis 1 1 57 MonteCarloresampling 221 18055 N/A Runningtimeunderdierentsampleratio 0.5% 1% 2% Querytime 1053 1053 1053 Estimatetime 8 12 32 Momentanalysis 15 23 28 Approximatemomentanalysis 1 1 3 MonteCarloresampling 219 221 230 Table5-1.Runningtimeunderdierentparametervalues Third,aslongasthenumberofgroupsis100,evenfortheslowestmethodMonteCarloresampling,therunningtimetoproducesimultaneousGROUPBYcondenceintervalsintheworstcaseisonly20%aslongastherunningtimetoexecutethequeryovertheentiredatabase,whenthedatahasnoskew.Ifthedataisveryskewed,thisextratimeisonly2%aslongasthequerytime.Thisindicatesthatallthreemethodsarevaluablefor100groupsorless.However,whenthenumberofgroupsis1000,MonteCarloresamplingtakesaround20timesaslongastherunningtimetoexecutethequeryovertheentiredatabase,andthusisnotpracticaltouse.Bothmomentanalysisand 92

PAGE 93

approximatemomentanalysisstillrequirelesstimethanthetimetoexecutethequeryovertheentiredatabase.Whenthenumberofgroupsis10000,theonlymethodpracticalisapproximatemomentanalysis.ItisnoticedthatthethirdtimeforboththemomentanalysisandtheMonteCarloresamplingmethodisgenerallylargerthanthetimetocompletethequeryoverthesamples.Thisindicatesthatthesetwomethodsarenotideal,thoughtheyarestilluseful.However,theapproximatemomentanalysismethodwasnearlyidealintermsofitscomputationaltime.Mostofthetime,thetimetoobtainsimultaneousGROUPBYboundswasmuchsmallerthanthetimetocomputethequeryresultoverasample.Theonlytimewherethiswasnotthecasewaswhentherewere10000groupsthecomputationrequired57seconds.However,eveninthiscase,themethodtakesjust4timesaslongasitdoestocomputethequeryresultoverthesample,andittakesjust1/20aslongtoobtainthesimultaneousboundsasitdoestocomputetheexactresult.Thereforeapproximatemomentanalysiscanbeusedvirtuallyforfreeinmostcasesintermsoftheextracomputationthatisrequired.Finally,Iacknowledgethattheexperimentsconsiderwhatisreallythebestpossiblescenariofortheapplicationofsampling:anarbitrary,adhocqueryisissuedforwhichnopre-computedcomputationisavailable.Thisfavorablescenarioisthemainreasonfortheorders-of-magnitudespeedupassociatedwithsampling.Inotherscenarios,suchasanincrementalscenariowhereauserdrillsdownintoanalready-computedanswerset,samplingmaylookmuchlessattractive.Still,theresultsdoconvincinglyshowthatwhensamplingisapplicable,computingSI-basedsimultaneousboundsshouldnotbemuchmoreexpensivethansimplyusingtheunderlyingsampletocomputeanestimate. 93

PAGE 94

5.4.2CorrectnessExperiments5.4.2.1ExperimentalSetupInthissection,thecorrectnessofthethreeproposedmethodsforsolvingtheSIproblemareexperimentallyevaluated,includingtheapplicabilityoftheparametricsolutions.Testingthecorrectnessofanycondenceboundisnon-trivial,andisgenerallydoneusingMonteCarlomethods.Forexample,atraditionalp%condenceboundtellsusthataparameterofinterestiswithinaspeciedrangep%ofthetime.AMonteCarloexperimenttocheckthecorrectnessofsuchacondenceboundwouldre-runthecomputationthatproducedtheboundNindependenttimes,andseeifapproximatelyp 100NoutoftheNtimestheboundscontainthetrueanswer.Underthisregime,eachrepetitioncanbetreatedasacoinipwhereaheadsisobservedwhentheboundcontainsthetrueanswer,andatailsotherwise.Ifthecondenceboundcomputationiscorrect,thetotalnumberofobservedheadswouldfollowabinomialdistribution,givenbyBinomialN;p 100.Aslongastheactualnumberofobservedheadsiswithinatwo-sidedbinomialcondenceintervalcomputedusingBinomialN;p 100,itiscondenttoconcludethatthiscondenceboundiscorrect.ItismorechallengingtodesignatestforthethreesolutionstotheSIproblem.InthecaseofaGROUPBYquery,foragivensetofcondencebounds,thesolutiontotheSIproblemisafunctionFsuchthatforanygivenk,F[k]istheprobabilityofobservingkincorrectintervals.Forexample,ifk=12andF[k]=0:25,itisexpected25%ofthetimetoobserve12incorrectbounds.Thus,ifpPki=0F[i],thentheSIsolutionassertsthattheprobabilityofobservingkorlessincorrectboundsisp.Foragivenp,aMonteCarloexperimentcanbedesignedtocheckthecorrectnessofanSIsolutionbyrepeatingthefollowingprocedureNtimesindependently.First,asampleisobtainedfromthedatabasetablesandthesampleisusedtoestimatetheresulttoaGROUPBYquery.Foreachgroup,a95%isconstructedcondenceintervalusingthe 94

PAGE 95

sample,andtheresultingSIproblemissolved.UsingF,thennextstepistondksuchthattheprobabilityofobservingkorlessincorrectboundsisp.Thentheactualnumberofincorrectintervalsiscalculatedbycomputingtheactualtheanswerofthequeryovertheentiredatabase,andthencountingthetotalnumberofintervalsthatdonotcontaintheanswersfortheassociatedgroup.Thecoinip"associatedwiththisSIsolutionisaheadsiftheactualnumberofintervalsislessthanorequaltok,andatailsotherwise.Therefore,ifthemethodtosolvetheSIproblemiscorrect,thetotalnumberofheadsinNindependentrepetitionsfollowsabinomialdistribution,givenbyBinomialN;p.Aslongastheactualnumberofobservedheadsiswithinatwo-sidedbinomialcondenceinterval,itiscondenttoconcludethatthemethodforsolvingtheSIproblemiscorrect.Toactuallyimplementsuchanexperiment,thesamequerydescribedintheprevioussectionisused:SELECTSUMo.ProfitFROMOrdero,Employeee,BranchbWHEREo.EmpID=e.EmpIDANDe.OfficeID=b.OfficeIDGROUPBYb.BranchIDFourdierentdatasetsarebuiltforthisexperiment.Intherstdataset,allparameterstaketheirdefaultvalues.Intheseconddataset,thesampleratioischangedto0:5%.Inthethirddataset,theskewnessischangedto0:3.Inthefourthdataset,theskewnessischangedto0:3,andthenumberofgroupsischangedto20.ForeachofthethreemethodsMonteCarloresampling,momentanalysis,andapproximatemomentanalysis,andforeachoffourdatasets,theproceduredescribedaboveisperformedforpinf0:05;0:10;:::;0:90;0:95g,andN=100.5.4.2.2ResultsTheobservedresultsaregiveninFigures 5-4 5-5 5-6 5-7 .TheFiguresshowthe12resultsobservedusingthethreedierentSIsolutionsandfourdierentdatasets.Thexaxisineachgureistheptested.Theyaxisshownasabaristhetotal 95

PAGE 96

Figure5-4.Correctnessofthreemethodsoverdatasetone:Allparametersusetheirdefaultvalues.Theerrorbarisa95%two-sidedcondenceintervalfromthecorrespondingbinomialdistributionBinomial00;p,whereeachpvalueisshowninthexaxis. numberofobservedheadsoutof100trials.Theerrorbarovereachbaristhetwo-sided95%condenceintervalofBinomial00;p.Inordertocomputetheinterval,theinversecumulativedistributionfunctionvaluesofprobabilities0.025and0.975fromBinomial00;parecomputedandtreatedaslowerboundandupperbound,respectively.5.4.2.3DiscussionIftheSIsolutionsareunreliableinpractice,itisexpectedtoobserveanumberofheadsisoutsideofthetwo-sided95%condenceintervalofBinomial00;pmorethan5%ofthetime.Inactuality,theobservedresultshowsthatonly7outof240resultsareoutsideofthetwo-sided95%condenceintervalofBinomial00;p.Theseresults 96

PAGE 97

Figure5-5.Correctnessofthreemethodsoverdatasettwo:Exceptforthefactthatthesamplingratioissetto0.5%,threeotherparametersusetheirdefaultvalues. stronglysupportthatthethreemethodsforsolvingtheSIproblemdoinfactproducecorrectandreliableresults.Furthermore,thislowrateoferrorwasobservedforallthreeofthemethods,includingthetwoparametricsolutionsthatrelyontheBetadistribution{infact,therateoferrorsfortheparametric,Beta-basedsolutionwasactuallylowerthanforthenon-parametricMonteCarlomethod.Thisseemstobestrongevidencefortheassertionthatusingthefourparametersusedintheparametricsolutionlowerbound,upperbound,mean,andvariancedoinfactconstrainthespaceofpossibledistributionssomuchthataparametricassumptionisquitereasonable,andseemstodiscounttheneedforaparametricsolutionthattakesintoaccountevenhighermomentssuchastheskewandkurtosis. 97

PAGE 98

Figure5-6.Correctnessofthreemethodsoverdatasetthree:Exceptforthefactthattheskewissetto0.3,threeotherparametersusetheirdefaultvalues. Finally,thereisstilltheconcernthatwhiletheresultsobtainedinthissetofexperimentsdoseemtoargueconvincinglythattheSI-basedsolutiontotheGROUPBYproblemisaccurateinpractice,theydonotnecessarilyarguethattheSI-basedsolutionisactuallynecessaryinpractice.Itisstillreasonabletoask,Howdangerouswoulditbetosimplyignorethecorrelationsamonggroups,andassumethatallgroupsareindependent?"Therststeptowardansweringthisquestionistorepeattheproceduredescribedabove,whileignoringthecorrelationsamongthegroups.RecallthatintheMonteCarloexperimentdesign,rstasampleisobtainedfromthedatabase.Thenforeachgroup,a95%condenceisintervalconstructedusingthesample.Ifitisassumedthattheintervalsareindependent,thentheprobabilityofseeingexactlykboundsareincorrectfollowsa 98

PAGE 99

Figure5-7.Correctnessofthethreemethodsoverdatasetfour:20groupsandaskewof0.3areused. Figure5-8.Correctnessobtainedassumingindependenceondatasetfour.TherightplotshowsthedierencebetweentheSIsolutionobtainedusingindependencethesolidlineandtheBetaapproximationthedashline. 99

PAGE 100

BinomialdistributionBinomialm;0:05,wheremisthetotalnumberofgroups,and0.05istheprobabilityofseeingasingleintervalwrong.Thus,ifassumingindependence,thebinomialdistributioncanbeusedtoprovideaverysimplesolution"totheSIproblembyusingF[k]=Pki=0ProbBinomialm;0:05=i.Figure 5-8 showstheaccuracyobtainedusingthisbinomialsolution"totheSIproblem,usingtheexactlythesameexperimentalsetupfortheotherSIproblemsolutionswithpinf0:05;0:10;:::;0:90;0:95g,andN=100overthefourthdatasettheskewnessis0:3,thenumberofgroupsis20.Theresultsshowedthatforlargepvalues,theobservedvaluetendstobebelowthe95%errorbar.Forexample,forp=0:6,the95%errorbarcoverstherangefrom50to70,buttheobservedvalueisonlyaround45.Unfortunately,theseresultsdonotclearlydemonstratetheextenttowhichausercanbemisleadwhentheindependenceassumptionisusedtocommunicatetheaccuracyofaGROUPBYquerytotheuser.InFigure 5-8 ,iftheindependenceassumptionwerevalid,thetotalnumberoftimesthatisobservedthattheactualcountofincorrectintervalsissmallerthanorequaltoagivenkwhereF[k]=pshouldbeapproximatelyp100.Thisgureshowsthatforlargerpvalues,thenumberoftimesthatthecountofincorrectintervalsdoesnotexceedktendstobesmallerthanp100.Thus,foraxedk,whenusingtheindependenceassumption,anestimatedprobabilityofseeingkormoreincorrectintervalsisgenerallytoosmall.TheproblemisthatFigure 5-8 doesnotshowexactlytheextenttowhichthisprobabilitycanbeunderestimated.Inordertoinvestigateexactlyhowthisprobabilityisunderestimated,oneofthe100MonteCarlotrialsisarbitrarilychosenfromFigure 5-8 ,andtheprobabilityofseeingkormoreincorrectgroupsobtainedusingtheBetaapproximation,aswellasthesameprobabilityobtainedusingtheindependenceassumption,isplotted.TheresultsarereportedintherightpartofFigure 5-8 .Onlythetailportionofcumulativedensityfunctionisplottedbecauseobservingalargenumberofincorrectgroupsisthemostworrisomeoutcomefortheuserofasampling-basedestimate.Thegureshowsthatthe 100

PAGE 101

probabilityofseeing14ormoreincorrectgroupsoutof20isaround0.002usingtheBetaapproximation,whichmeansthatthiscanactuallyhappenaroundonceper500queryexecutions.Thisisnotinsignicant,andnotcommunicatingthismaybedangerous.However,whenusingtheindependenceassumption,thecomputedprobabilityisverycloseto0.Ifausertakesthisresultatfacevalue,heorshewillassumethatthereisnochanceofobservingthattheobservedqueryresultislargelyuseless.Analissuetopointoutisthatthedataisgeneratedusingmildzipfskewinordertoinducesomecorrelationsamongthegroups.Inreality,correlationsamonggroupscanbemuchmoresignicantthanthoseinducedbythezipfdistribution.Thusthedierencebetweenthetwoprobabilitiescanbeevenmoresignicantinpracticethanwhatwasshownhere,makingtheassumptionofindependenceevenmoreproblematicinpractice.5.5ConclusionThischapterhasconsideredindepththestatisticalissuesthatmustbeaddressedifasinglesampleisusedtoansweraGROUPBYquery.Theproblemisthatsincethesamesampleisusedforeachgroup,itisunacceptabletouseclassic,univariatemethodstoquantifyerrorsincethegroupwiseestimatesarenotindependent{ifoneboundiswrong,thenitmaybelikelythatalloftheboundsarewrong.Statisticallyspeaking,univariateboundsareonlyvalidinisolation,andonceauserhasseenoneofthemheorsheneedstoforgetaboutthatboundbeforelookingatthenext.Sincethisisunreasonableinpractice,thischapterhasconsideredwhatinformationshouldbegiventotheusertosafelyquantifytheaccuracyofaGROUPBYqueryresult,andalsodiscussedindepthhowtocomputesuchinformationeciently. 101

PAGE 102

CHAPTER6RETURNINGTOP-KGROUPSWITHHIGHPROBABILITY6.1IntroductionThischapterdiscussesvariousalgorithmsthatuseadatabasesampletoreturnthetop-kgroupsfromaGROUPBYquerywithhighprobability,wheretop-k"referstothebestormostinterestinggroupsaccordingtosomeuserdenedrankingfunction.AGROUPBYtop-k"queryovernrelationsR1;R2;:::;RncanbewritteninSQLasfollows:SELECTSUMfR1;R2;:::;RnFROMR1;R2;:::;RnGROUPBYgroupR1;R2;:::;RnHAVINGSUMgR1;R2;:::;RnINTOPkThefunctionsfandgmayencodeanycomputation,includinganyjoinandselectionconditions,overeachpossiblecombinationoftuplesfromR1;R2;:::;Rn.groupR1;R2;:::;RnspeciesagroupingoperationonanysubsetofattributesinR1;R2;:::;Rn.SUMgR1;R2;:::;Rnisauser-speciedrankingfunctionthatcanbeusedtodeterminetop-kgroups.Wesayagroupisatop-kgroupifitsassociatedgvalueisoneoftheklargestgvalues.ThegoaloftheGROUPBYtop-kqueryevaluationistoreturnalltop-kgroups.Whenallofthedatabaserecordscanbeaccessed,theeasiestwaytoanswerthisqueryistorstevaluatetheunderlyingGROUPBYquery.Theresultsarethensortedbasedontheirassociatedgvalues.Onlythetop-kgroupsareretainedaftersorting.However,thisprocesscanbeoverlytime-consumingbecauseitdevotescomputationtogroupsnotinthetop-k.Lielat.[ 76 ]studiedthisproblemandproposedaframeworkthatcanreturntop-kgroupsasearlyaspossibletoavoidprocessingalldatabasetuples.Unfortunately,inanapproximatequeryprocessingsystem,itisnotpossibletoaccesstheentiredatabase.Onlydatabasesamplesareavailable.Wemaysimplyevaluatethe 102

PAGE 103

queryoversamplesandreturnkgroupswhoseestimatedgvaluesareinthetop-k.Butthissimplemethoddoesnotcorrectlyreturnallofthetruetop-kgroups.Itisbesttoillustratewhythissimpleideadoesnotcorrectlyreturnallthetruetop-kgroupsthroughanexample.Letusconsiderthequery:"Whatwerethetotalsalesforthetop2regions?"Ifthequeryisfortworelations:PRODUCTandSALES.ThisquerymaybewritteninSQLas:SELECTSUMp.COST,s.REGIONFROMPRODUCTp,SALESsWHEREp.PROD=s.PRODGROUPBYs.REGIONHAVINGSUM{p.COST}INTOP2Imaginethattheanswertothequeryisguessedatbyusingrandomsamplesfromthefollowingtwodatabasetables: PRODUCT SALES PROD COST PROD Region thingy $10 thingy Asia gadget $6 widget Asia widget $2 thingy USA dohicky $3 widget USA gadget Europe dohicky Europe gadget Africa dohicky Africa Therandomsamplesareusedasinputintoasampling-basedestimatorfortheanswertothequery[ 49 ],whichhasbeendiscussedinSection3.1.Inordertoapplythesimplestestimator,thesamplesfromeachofthetwotableswouldbejoined,andtheresultscaled 103

PAGE 104

upbytheinverseofeachsamplingfraction.Assumingnowwetake50%ofsamplesfromPRODUCTand100%ofsamplesfromSALES,ifonlygadgetandwidgetwereinthesampleoftablePRODUCT,thetworegionswhichhavethetwohighestgvalueestimatesareEuropeandAfrica.However,thetruetop2regionsbasedonthegvaluesareAsiaandUSA.Asaresult,ifwesimplyreturnkgroupswhichhavekhighestgvalueestimates,wecannotguaranteetheyarethetruetop-kgroupswhichhaveklargesttruegvalues.Sincetheactualgvalueofeachgroupisunknown,inordertoreturnallofthetop-kgroupswithhighprobability,thesizeofthereturnsetmayhavetobelongerthank.Inthischapter,Idevelopsampling-basedmethodsthatreturnallofthetop-kgroupswithhighprobability.Akeyobservationfordevelopingsampling-basedmethodstoreturnalltop-kgroupswithhighprobabilityisthatitisveryusefultousecondenceregions".AsdescribedinSection 3.3 ,acondenceregionwithuserdenedcondencepisarandomregioninamulti-dimensionalspace,suchthattheprobabilitythattheparameterofinterestliesintheregionisp.Inthischapter,theparameterofinterestisthegvalueofeachgroup.Idevelopthreedierentmethodsforreturningalltop-kgroupswithhighprobability,wheretop-kgroupsarethegroupswhichhaveklargestgvalues.Eachofthethreemethodsusesaspecictypeofcondenceregion.Eachofthethreemethodsalsosolvesanabstractproblemcalledtheregiontop-k"problem,whichwillbediscussedinthischaptersubsequently.Theregiontop-kproblemprovidesaframeworksothatdierentmethodsforndingalltop-kgroupscanbeeasilypluggedin.Thecombinationofaninputcondenceregionandasolutionofregiontop-kproblemgivesaresultthatcontainsthetop-kgroupswithhighprobability.Theremainderofthischapterisorganizedasfollows.Section 6.2 discussesthecondenceregionandtheregiontop-kproblem.Italsoshowshowthecondenceregionandtheregiontop-kproblemmaybeusedforreturningalltop-kgroupswithhighprobability.Section 6.3 proposesthreedierentalgorithmsforreturningalltop-kgroupswithhighprobability.Eachofthealgorithmsrequiresadierenttypeofcondenceregion, 104

PAGE 105

andadierentsolutionforsolvingtheregiontop-kproblem.Section 6.4 discussesvariousissuesforbuildingcondenceregions.Section 6.5 experimentallyevaluatestheutilityofthethreeproposedalgorithms.Section 6.6 discusseswhichofthethreealgorithmsshouldbeimplementedforarealworldsystem.Section 6.7 concludesthischapter.6.2CondenceRegionandtheRegionTop-kProblemInordertodevelopsampling-basedmethodsforreturningalltop-kgroupswithhighprobability,itisusefultobuildacondenceregionthatcontainsthetruegvaluesforallgroupswithhighprobability.Beforewediscusseshowacondenceregionmaybeusedtohelpusreturnalltop-kgroupswithhighprobability,afewissuesneedtobeclearlyexplainedinadvance.Givenadatabasesample,estimatingthegvaluefortheithgroupcanbedonebyrstevaluatingthequeryoversamples1andthenscalinguptheresultwithrespecttotheinverseofthesamplingratioofeachtable.Ifweobtainmanydatabasesamplesandrepeatthisproceduremanytimes,thedistributionoftheestimatescanbemodeledbyanormallydistributedrandomvariable,denotedbyMi.Dene)778(!g=hg1;g2;:::;gni,wheregiisthegvalueoftheithgroup.Thenthejointdistributionoftheserandomvariables,denotedbyM=hM1;M2;:::;Mni,canbeusedtoconstructacondenceregionfor)778(!g.WhenIassertthatacondenceregioncontains)778(!gwithprobabilityp,Imeanthatthereisan-dimensionalrandomregionsuchthatwithprobabilityp,itcontains)778(!g.VariousmethodsforconstructingdierenttypesofcondenceregionswillbediscussedinSection 6.3 .SincetheestimatorMifortheithgroupistheithdimensionofM,throughoutofthischapter,Iwilluseiastheidentierforboththedimensionandthegroup.Thetermithdimension"andithgroup"areusedinterchangeably. 1Theaggregationfunctionisaugmentedsothatifaresultingtupleisnotintheithgroup,theaggregationfunctionreturns0. 105

PAGE 106

Figure6-1.Threepossiblecasesforthecondenceregion,classiedbyusingtherelationshipbetweentheregionandthelinerst-group=second-group:aTheregionisabovetheline.bTheregiontouchestheline.cTheregionisbelowtheline. 6.2.1HowCondenceRegionsmayHelpReturnallTop-kGroupsCondenceregionsmaybeusedforhelpingndalltop-kgroups.Toillustratehowacondenceregionmayhelpusndtop-kgroups,letusconsiderasimplecase.Imaginethatwehaveaquerywithtwogroups,andweareinterestedinthetop-1group.Regardlesshowrandomlythepossibleinstancesofthecondenceareplaced,itispossibletoclassifythesepossibleinstancesintothreedierentcases,determinedbywhethertheregionistotallyabove,istotallybelow,ortouchesthelinerst-group=second-group,ThethreecasesareshowninFigure 6-1 .ItisimportanttounderstandwhyIclassifyallpossibleinstancesofthecondenceregionintothreecases.Withoutlosinggenerality,letusrstconsiderthecaseainFigure 6-1 .Sincetheentirecondenceregionisabovethelinerst-group=second-group,ifwepickanypointintheregion,thevaluethatcorrespondstotherstgroupislargerthanthevaluethatcorrespondstothesecondgroup.If)778(!gisinthecondenceregion,thentherstgroupmustbethetop-1group,sincewealreadyknowthegvalueoftherstgroupislargerthanthesecondforanypointintheregion.Becauseweknowthat)778(!gisinthecondenceregionwithhighprobability,ifweseethecondenceregionisabovetheline 106

PAGE 107

rst-group=second-group,returningtherstgroupasthetop-1groupguaranteesthatwereturnthetruetop-1groupwithhighprobability.Casecisverysimilartocasea,butthistimeweshouldreturnthesecondgroupinstead.Themoredicultcaseiscaseb,becauseevenifweknowthat)778(!gisinthecondenceregion,wedonotknowwhichpointinsideofthecondenceregioncorrespondsto)778(!g.Somepointsintheregionmayhavealargervaluefortherstgroup,whileothersmayhavealargervalueforthesecondgroup.Thus,wehavetoreturnbothgroupsinordertoguaranteethatwecancorrectlyreturntop-1groupwithhighprobability.6.2.2TheRegionTop-kProblemOnewaytoviewtheclassicationfromtheprevioussubsectionisthatIamtryingtondthesetofallpossibletop-kgroupsforeachpointintheregion.Incasea,sincetheentireregionisabovethelinerst-group=second-group,thesetoftop-1groupsofallthepointsintheregioncontainsnothingbuttherstgroup.Similarly,incasec,thesetofthetop-1groupsofallthepointsintheregioncontainsthesecondgroup.Incaseb,wenoticethatforsomepoints,thetop-1groupistherstgroup,whileforothers,thetop-1groupisthesecond.Thus,thesetofalltop-1groupscontainsbothgroup.Ifwecancorrectlyreturnthesetofthetop-kgroupsforallthepointsintheregion,wecanreturnthetruetop-kgroupswithhighprobability.Thereasonisthataslongas)778(!gisinthecondenceregion,itstop-kgroupisintheset.Thus,thetop-kgroupsarecorrectlyreturned.Iabstracttheproblemthatreturnstheunionofthetop-kgroupsofallthepointsintheregion,andcallittheregiontop-k"problem.Inordertoformallydenethisproblem,itisnecessarytorstdenesomenotation.Apointx=[x1;x2;:::;xn]Tinann-dimensionalspaceisavectorofnvalues,wherexiisthevalueoftheithdimensionandiisthedimensionidentier.Forexample,t=[t1=1:3;t2=2:1;t3=2:4;t4=1:9]Tisapointina4-dimensionalspace.Dimensioniisatop-kdimensionifxiisoneoftheklargestvaluesinfx1;x2;:::;xng.Intheaboveexample,dimension3isatop-2dimensionbecause2:4isoneofthetop-2largestvaluesin 107

PAGE 108

thevector.Ialsodenetop-kxtobethesetoftop-kdimensionalidentiersofx.Intheaboveexample,top-2t=f2;3g.AregionT=fxgisasetofpointsinann-dimensionalspace.Deneftobeafunctionthattakesaregionasinputandreturnsasetofdimensionidentifers.Forexample,consideringFigure 6-1 ,fmayreturnf1gincasea,f0gincasec,andf0;1gincaseb.Alternatively,fcanbetreatedasanalgorithmthatoperatesonaninputregionandreturnasetofdimensionidentifers.IftheinputregionisdenotedbyT,wecandenotethereturnsetoffbyfT.Wesaythealgorithm,encapsulatedbyf,isasolutiontotheregiontop-kproblemif8x2T,top-kxfT.Forexample,theffunctiondescribedinthissectionisasolutiontotheregiontop-1problem.Inordertoseewhythisisthecase,weneedtoprovethatforanypointinanyinputregionT,itstop-1dimensionalidentierisinfT.Sincetheseinputregionscanbeclassiedintothreedierentcases,letusconsidereachofthethreecases,asshowninFigure 6-1 : 1. Incasea,foranyT,fT=f1g.ForanypointxinT,top-1x=f1gfT. 2. Incaseb,foranyT,fT=f0;1g.ForanypointxinT,top-1xiseitherf0gorf1g.EachofthemisasubsetoffT. 3. Incasec,foranyT,fT=f0g.ForanypointxinT,top-1x=f0gfT.Thesolutiontotheregiontop-kproblemisnotunique.Forexample,itisalsopossibletodesignanalgorithmthatisalwaysreturnalldimensionidentiers,thatis,fT=f0;1gforanyinputregion,andthisisalsoasolutiontotheregiontop-1problembecauseforanypointintheregion,itstop-1dimensionisforsureasubsetoff0;1g.Butthissolutionisnotastightasthepreviousone.Incaseaandc,itreturnsmoregroupsthanneeded.6.2.3ApplyingtheRegionTop-kProblemtotheGROUPBYTop-kQueryAsolutiontotheregiontop-kproblemmaybeusedtoanswertheGROUPBYtop-kproblem,asdescribedinFigure 6-2 .Thisalgorithmtakesacondenceregionastheinput 108

PAGE 109

FunctionGetTopKGroupsp,S,Q 1.BuildacondenceregionCwithcondencepusingsampleSforqueryQ; 2.Applythesolutiontotheregiontop-kproblemftoC; 3.ReturnfC. Figure6-2.AframeworkforsolvingtheGROUPBYtop-kusingasolutiontotheregiontop-kproblem. regionoftheregiontop-kproblem,andreturnsthesetofgroupidentiersbysolvingtheregiontop-kproblem.Thecorrectnessofthealgorithmisobvious.IfisinC,top-kfC.Sincethecondenceregionhascondencep,withprobabilityp,isinC.Thus,withprobabilityp,fCcontainstop-kgroupsoftheGROUPBYqueryQ. 6.3SolutionsforReturningAllTop-kGroupswithHighProbabilityGiventheframeworkdescribedintheprevioussection,thekeytobuildingasolutionforreturningalltop-kgroupswithhighprobabilityistodevelopasolutionfortheregiontop-kproblem,andamethodforconstructingaspecictypeofcondenceregion.Thissectiondiscussesthreedierentmethodsforreturningalltop-kgroupswithhighprobability: 1. Anintervalsweepingalgorithm. 2. Apruning-basedalgorithm. 3. Atwo-stagealgorithm.Eachofthethreeproposedmethodsrequiresoneormorecondenceregionsasinput.Eachofthethreeproposedmethodsalsorequiresanalgorithmforsolvingthecorrespondingregiontop-kproblem.Idescribeeachoftheminthefollowingsubsections.6.3.1ASolutionUsingIntervalSweepingTherstsolutionisaintervalsweepingalgorithm.ThissolutionworksforthetypeofaxisparallelcondenceregionshowninFigure 6-1 .Abenetofusingthissortofregionisthatonlyvariances,asopposedtoallcovariances,areneededtobuildit.Thus,itiscomputationallyinexpensive. 109

PAGE 110

Toillustratethealgorithm,Irstdenesomenotation.Denote:li=minxi;hi=maxxi;8x2N:Intuitively,[li;hi]istherangeofpossiblevaluesforxi,8x2T.Thealgorithmalignsallintervals[l1;h1];[l2;h2];:::;[ln;hn]onasinglerealline.Thealgorithmsweepsfromtherightmostpointmaxhitotheleftmostpointminli.Thesweepingprocedurestopsimmediatelyaftertherearekintervalswhoselowerboundsarelargerorequaltothecurrentsweepingpoint,andreturnsthesekdimensions,aswellasanydimensionwhosecorrespondingintervalcontainsthestoppingpoint.Figure 6-3 showsanexampleofthesweepingprocedureina4-dimensionalspaceforndingtop-2dimensions.Figure 6-3 aisshowsthestartingpointofthesweepingprocedure.Figure 6-3 bshowsthestoppingpointofthesweepingprocedure.Thereturnsetinthisexampleisf1,2,4g.Thecorrectnessofthealgorithm.Byconstruction,thestoppingpointisthelowestpossibletop-kvalueforanypointintheregion.Theupperboundhiforanydimensionithatisnotreturnedbythealgorithmissmallerthanthelowestpossibletop-kvalue,andthuscanbesafelyremovedfromthesolution.Thereforethesetreturnedcontainstop-kdimensionsofanypointintheinputcondenceregion.Sinceisinthecondenceregionwithprobabilityp,theintervalsweepingalgorithmreturnsalltop-kgroupswithprobabilityp. Howtoconstructthecondenceregion.LetusdenotethecondenceregionbyC.Cisann-dimensionalrandomcube,suchthateachoftheedgesisparalleltooneoftheaxes.TherangeCioftheithgroupisestizii,whereesti,iaretheobservedestimate,andstandarddeviationoftheithgroup,respectively.Ifwedenotetheaccuracyoftheithgroupbypi,ziisthecorrespondingcoecientusedforaccuracypi,whichcanbecomputebyusingtheinversecdfofthestandardnormaldistribution.Ifurtherenforceeverygrouptohavethesamecondencep0,i.e.8ipi=p0.ByusingtheBonferroni 110

PAGE 111

Figure6-3.Anexampleoftheintervalsweepingalgorithm:aThestartingpoint.bTheendingpoint.Thereturnsetisf1,2,4g. Inequality[ 82 ],thefollowingholds:Pr8ii2Ci=1)]TJ/F21 11.955 Tf 11.955 0 Td[(Pr9ii62Ci1)]TJ/F22 7.97 Tf 18.02 14.944 Td[(nXi=1Pri62Ci=1)]TJ/F21 11.955 Tf 11.956 0 Td[(n)]TJ/F21 11.955 Tf 11.955 0 Td[(p0{1Thus,inordertoguaranteethatPr8ii2Ciisatleastp,onewayistomakesurethat1)]TJ/F21 11.955 Tf 11.964 0 Td[(n)]TJ/F21 11.955 Tf 11.964 0 Td[(p0p.Thisrequiresp01)]TJ/F15 11.955 Tf 11.964 0 Td[()]TJ/F21 11.955 Tf 11.964 0 Td[(p=n.Thus,choosingp0=1)]TJ/F15 11.955 Tf 11.965 0 Td[()]TJ/F21 11.955 Tf 11.965 0 Td[(p=ncompletesthecondenceregionconstruction.Thelimitationoftheintervalsweepingalgorithm.Theintervalsweepingalgorithmisecientforthetypeofaxis-parallelcondenceregiondiscussedabove.However,thisisnotnecessarilytheonlytypeofcondenceregion.Whenthecovariance 111

PAGE 112

Figure6-4.Threepossiblerelationshipsbetweentheregionandthelinerst-group=second-group.Theregionisaregionsuchthateachofitsedgesisparalleltooneoftheeigenvectors.aTheregionisabovetheline.bTheregiontouchestheline.cTheregionisbelowtheline.Thedashrectangleistheboundingbox. matrixisavailable,standardeigenvector/eigenvaluedecompositionallowsustobuildacondenceregionsuchthateachoftheregion'sedgesisparalleltooneoftheeigenvectors.Thistypeofcondenceregiongenerallyhassmallervolume.Whenusingtheintervalsweepingalgorithmforthistypeofcondenceregion,thealgorithmisperformedontheboundingboxoftheregion.Thismayresultinreturningunnecessarygroups.Letusconsiderthefollowingcaseagain:aregionin2-dimensionalspace,andkis1.Thistime,thecondenceregionisarectanglein2-dimensionalspace,suchthateachoftheedgesisparalleltooneoftheeigenvectors.ThreecasesareshowninFigure 6-4 .Inthisexample,theboundingboxfortheregionthedashedboxinFigure 6-4 touchesthelinerst-group=second-groupinallthreecases.Thustheintervalsweepingalgorithmreturnsbothgroupsinanycase.However,itisclearthatonlyincasebneedthealgorithmactuallyreturnbothgroups. 112

PAGE 113

6.3.2APruning-BasedSolutionInordertodealwiththislimitationoftheintervalsweepingalgorithm,thissubsectiondiscussesapruning-basedalgorithm.Thepruning-basedalgorithmworksbestforthetypeofcondenceregionshowninFigure 6-4 .AkeyobservationfromFigure 6-4 isthatitispossibletorankanygivenpairofdimensions/groups.Forexample,incaseainFigure 6-4 ,sinceforanypointxintheregion,thevalueoftherstgroupisalwayslargerthanthevalueofthesecondgroup,therstgrouphasahigherrank.Thatis,ifthesecondgroupisatop-kdimension,therstgroupmustbeatop-kdimensiontoo.Forapairofdimensionsiandj,ihasahigherrankthanjifandonlyif8x2T,xi>xj.Noticethatwemaynotbeabletodeterminetherankbetweeneverypairofdimensions.Forexample,incasebofFigure 6-4 ,wearenotsurewhichdimensionhasahigherrank.Thepruning-basedalgorithmreliesontherelativeranksamongthedimensionstoremoveanydimensionthatweknowcannotbeatop-kdimension.Wenoticethatifagroupisatop-kdimension,anydimensionwhosepairwiserankishigherthanthisgroupmustalsobeatop-kdimension.However,thereisatmostkslotsforthetruetop-kgroups.Ifforaparticulargroup,wendkormorehigherrankgroupsforit,thisgroupcannotbeatop-kgroupsincethesekhigherrankgroupsoccupyallkslotsalready.Thefollowingtheoremformalizesthisobservation:Observation.Givendimensioni,ifthereexistdimensionsj1;j2;:::;jk,suchthat8x2T;xi
PAGE 114

FunctionPruningTopKT,n 1.ReturnSet=fg; 2.fori=1ton 3.Count=0; 4.forj=1ton,j6=i 5.ifIsSmalleri;j;T 6.count++; 6.ifcount
PAGE 115

FunctionTwoStageGroupByTopkpS,Q,n 1.ComputevariancesusingsampleSforqueryQ; 2.BuildacondenceregionCwithcondencep; 3.Candidate=IntervalSweepingC;n; 4.ComputethecovariancesamongthegroupsinCandidate; 5.BuildacondenceregionDwithcondencep; 6.ReturnSet=PruningTopKD;n; 7.returnReturnSet; Figure6-6.TheTwo-StageAlgorithm. 6.3.3ATwo-StageAlgorithmAlthoughthepruning-basedalgorithmmayoutperformtheintervalsweepingalgorithm,computingtheentirecovariancematrixisexpensive.Areal-worldquerymaycontainthousandsorevenmillionsofgroups,whichmakesitimpossibletocomputetheentirecovariancematrix.ThusIproposeatwo-stagealgorithmthatcombinesthebenetsofthetwoalreadydescribedalgorithms.Intherststage,thealgorithmonlyrequiresvariancesandbuildsacondenceregionusingthevariancesonly.Theintervalsweepingalgorithmisusedtoevaluatethiscondenceregion,andtheresultsetistreatedasasetofcandidategroups.Inthesecondstage,thecovariancesamongthesegroupsarecomputed,andanothercondenceregionisbuiltusingthesecovariances.Thenthepruning-basedalgorithmisrunusingthenewcondenceregionasinput.Thereturnedsetistreatedasthenalanswer.Thepseudo-codeofthetwo-stagealgorithmisshowninFigure 6-6 .Noticethattherststagesuggestsasetofcandidategroups.Thus,thecorrectnessofthealgorithmonlyreliesonline5,6,and7,andthealgorithmiscorrectbecausethepruning-basedalgorithmiscorrect. Howtobuildthecondenceregions.Onequestionsstillneedstobeansweredinorderforthisalgorithmtobepractical:HowtobuildcondenceregionCandD?BuildingcondenceregionChasbeendealtwithusingtheintervalsweepingalgorithm.BuildingcondenceregionDismorechallenging.Forsimplicity,Iassumethatforany 115

PAGE 116

dimensionthatisnotreturnedbytheintervalsweepingalgorithmintherststage,theaccuracyp0remainsthesamewhenbuildingD.Giventhisassumption,ifthecandidatesetreturnedintherststageisdenotedbyCand,wehave:Pr8ii2Di=1)]TJ/F21 11.955 Tf 11.955 0 Td[(Pr9ii62Di=1)]TJ/F21 11.955 Tf 11.955 0 Td[(Pr9i2Cand;i62Di[9j62Cand;j62Dj1)]TJ/F15 11.955 Tf 11.955 0 Td[(Pr9i2Cand;i62Di+Pr9j62Cand;j62Dj1)]TJ/F21 11.955 Tf 11.955 0 Td[(Pr9i2Cand;i62Di)]TJ/F26 11.955 Tf 18.382 11.357 Td[(Xj62CandPrj62Dj=1)]TJ/F21 11.955 Tf 11.955 0 Td[(Pr9i2Cand;i62Di)]TJ/F15 11.955 Tf 11.955 0 Td[()]TJ/F21 11.955 Tf 11.955 0 Td[(p0n)-222(jCandj=Pr8i2Cand;i2Di)]TJ/F15 11.955 Tf 11.955 0 Td[()]TJ/F21 11.955 Tf 11.955 0 Td[(p0n)-222(jCandj{2InordertoguaranteePr8ii2Diisatleastp,onewayistomakesurethatPr8i2Cand;i2Di)]TJ/F15 11.955 Tf 11.955 0 Td[()]TJ/F21 11.955 Tf 11.955 0 Td[(p0n)-222(jCandjp:ThusPr8i2Cand;i2Di)]TJ/F21 11.955 Tf 11.955 0 Td[(p0n)-222(jCandj+p:BecausethecovariancesofgroupsinCandhavebeencomputed,wemaycomputetheeigenvectorsofthesub-covariancematrixformedbydimensionsinCand,andthenbuildthesubpartofcondenceregionssothatanyoftheseedgesisparalleltooneoftheeigenvectors.Sincethesedimensionsindependent,ifweassumethatthesameaccuracyisusedforeachofthesedimensions,thenthecondenceofeachofthesedimensionsisjCandjp )]TJ/F21 11.955 Tf 11.955 0 Td[(p0n)-222(jCandj+p:6.4VariousIssuesforBuildingCondenceRegionsInbuildingvarioustypesofcondenceregions,Iassignequalaccuracytoeachgroup.Oneobviousquestioniswhetherwecanbesmarter",andmakeabetterassignmentsothatthetotalnumberofgroupsreturnedissmaller. 116

PAGE 117

Letusconsideraquerythatasksforthetop-1groupamongtwogroups.Currently,weassigneachgroupwithequalaccuracyp0beforeweseethesampleandconstructthecondenceregion.Ifwewouldliketochangetheaccuracybeforeweseetheinputsampleandconstructthecondenceregion,thereisnoproblem.Forexample,wecanalwaysassignaccuraciessuchthatp1=2p2.Howeverthisassignmentisnotalwaysbetterthantheonethatequalaccuracyisassignedtoeachgroup.Thesecondassignmentmayincreasethesizeofthereturnsetofcertainsamplesbutdecreasethesizeofthereturnsetofothersamples.Alternatively,usersmaythinkabetterwaytoassignaccuraciestodierentgroupsistodothisaccordingtothesamples.Letuscontinuetheaboveexample.Assumethatthevarianceofeverygroupis1.Accordingtothesample,theestimatesforthetwogroupsare2and3,respectively.Givingthefactthat2issmallerthan3,therstgroupcannolongerbeatop-1group.Thus,wemightletp2=0,andassignp1accordingly.Thismethoddoesgiveusthepossibilityofreturninglessgroups.However,thismethodisproblematic.Thecoecientusedfordeterminingthewidthreliesontheaccuracyassignedtothisregion.Thismethodhasthreesteps.First,lookattheestimates.Second,determinewhichgroupscanbepotentiallydropped.Finally,assignaccuraciestodierentgroupssothatwecandropasmanygroupsaspossible.Theproblemisthatnow,theaccuracyassignedtoeachgroupisnolongeraconstant,butafunctionthatreliesontheestimatesobtainedfromaspecicsample.Unfortunately,thisaddsanewsourceofrandomnesstothecondenceregionthatmaybeimpossibletoquantify.Thus,theaccuracyassignmentmustbeindependentfromaparticularsample.Aswehavediscussedabove,iftheaccuracyassignmentisdeterminedbeforeseeinganyinputsamples,asafeassignmentistogiveeachdimensionthesameaccuracy.Anotherquestioniswhetherwecanusealternativeshapesforcondenceregions.Forexample,researchersinstatisticshavestudiedhowtobuildellipsoidcondenceregionsforsometime.Stein[ 97 ]studiedthisproblemandpointedoutthattheellipsoidtype 117

PAGE 118

condenceregionisrecommendedforonlyonedimensionortwodimensioncases,andthatellipsoidcondenceregionsconstructedusingsimplesamplesinhigh-dimensionalspacesareproblematic.SteinsuggestedusingaBayesianestimatorinordertoobtainareasonableellipsoid.Thesamesuggestionhasalsobeenproposedbymanyotherresearchers[ 15 34 ].Duetothecomplexityofsuchanestimate,Idonotconsiderconstructingaellipsoidtypecondenceregionforhighdimensionalspaces.Onemethodworthconsideringisconstructingaseriesoftwodimensionalellipsoids,thencombiningthemtogethertobuildacondenceregion.Forexample,foran100-groupquery,wecanconstructanellipsoidforthersttwogroups,anotherellipsoidforthethirdandfourthgroups,soon.Afterdoingthat,wecantreatthese50-pairsas50combined"dimensions,anduseBonferroni'sinequalitytoobtainasingleregion.Unfortunately,itdoesnotappearthatsuchacondenceregionwouldbeverybenecial.ThecombinedellipsoidcondenceregionisconstructedusingtheBonferroni'sinequality.Itdoesnotfullyutilizeallthecorrelationsamongthegroups.Therefore,itisexpectedthatthiscondenceregionisnotgoodforthepruning-basedalgorithm.Thus,thefollowingdiscussionmainlycomparesthecombinedellipsoidcondenceregionandtheaxisparallelcondenceregionusingtheintervalsweepingalgorithm.Forthecomparison,weconstructa95%axisparallelcondenceregion,anda95%combinedellipsoidcondenceregionusingthemethodsdiscussedbefore.Thenwecomputethecoecientassociatedtoeachdimension.Sincethecoecientdeterminesthewidthofeachinterval,smallercoecientindicatesbetterresults.Whenusingtheaxisparallelcondenceregion,thecoecientassociatedtoeachgroupis3.4.Whenusingthecombinedellipsoidcondenceregion,ifallthegroupsareindependent,thecoecientis3.7.Ifallthegroupsarefullypositivelycorrelated,thecoecient3.3.Thisindicatesthatthecombinedellipsoidcondenceregionisslightlyworsethantheaxisparallelcondenceregionifthegroupsareindependent,andslightlybetterifallthegroupsare 118

PAGE 119

fullypositivelycorrelated.Overall,wedonotobservesignicantbenetsforusingthecombinedellipsoidcondenceregion.6.5ExperimentalEvaluationInthissection,IexperimentallyevaluatetheutilityofthethreemethodsIhaveproposedtoreturntop-kgroupswithhighprobability.Iaminterestedinthreequestions: 1. Foreachofthethreemethods,howmuchtimeisrequiredtoobtaintheresult? 2. Foreachmethod,howecientisthereturnset,intermsofminimizingthenumberofgroupsreturned? 3. Areeachoftheproposedalgorithmscorrect?Theanswerstothersttwoquestionsmaybedierentfordierentdatasets.Ingeneral,theseanswersaredeterminedbythedistributionoftherankingscoresofthegroups,thecorrelationsamongthegroups,aswellasthetotalnumberofgroups.Therefore,inordertoanswerthersttwoquestions,Iruneachofthethreemethodsunderdierentdatasets.Toanswerthelastquestion,theeasiestwayistouseMonteCarlomethods.6.5.1RunningTimeExperiments6.5.1.1Experimentalsetupa.QueryandSchemaTested.Intheexperiments,thefollowingTPC-H[ 100 ]styleSQLstatementisused:SELECTc_name,sumlextprice*1-ldiscountFROMsuppliers,orderso,lineitemlWHEREo_orderkey=l_orderkeyANDs_suppkey=l_suppkeyGROUPBYc_custkey,c_nameHavingsumlextprice*-ldiscountinTOP5TheSQLqueryselectstop5suppliersbyrevenues. 119

PAGE 120

b.ParametersConsidered.Therearethreekeyparametersthataectthebehaviorsofvariousalgorithms:thenumberofgroups,thedistributionoftheassociatedgvalues,thecorrelationsamonggroup-wiseestimates.Idiscusseachofthembelow.Thenumberofgroupsinthequeryisanimportantparameterintheexperiment.Itdeterminesthesizeofthecovariancematrix,whichdominatestherunningtimeforthenon-sweepingalgorithms.Inordertocontrolthenumberofgroups,IcontrolthetotalnumberofdistinctsuppliersNsappearinginthedataset.Thedistributionofthegvaluesdeterminesthedicultyoftheproblemofaccuratelyreturningkgroups.Whenthelargergvaluesaremuchlargerthanthesmallerones,theproblemiseasy.Thisisbecauseweexpectthatthesamplereectsthedistribution.Whenthelargergvaluesaremuchlargerthanthesmallerones,theestimatesofthelargergvaluesarealsoexpectedtobemuchlargerthanthesmallerones.Therefore,itisveryeasytondthosegroupswhoseassociatedgvaluesaremuchlargerthanothers.Ontheotherhand,whenalltheassociatedgvaluesareveryclose,itisratherdiculttogureoutwhichgroupisatop-kgroup,becausetheestimatedgvaluesarecloseanditisimpossibletotellwhichassociatedgvalueistrulylargerthanothers.Onewaytocontrolthedistributionoftheassociatedgvaluesistocontrolthenumberofrecordsforeachgroupinthelineitemrelation.Inthisexperimentalsetup,Iuseazipfdistributiontodeterminehowmanyrecordsforeachuniquesupplierinthelineitemrelation.ThezipfdistributionisdeterminedbythenumberofsuppliersNsandthecoecientzs.Thecorrelationsamonggroup-wiseestimatesaectstheaccuracyofvariousalgorithms.Intuitively,whenthecorrelationsareavailable,acondenceregionwithlessvolumemaybeconstructed.Thus,thecorrelation-awarepruning-basedalgorithmmayhaveabetterchancetoprunemoregroups.Forexample,ifeverygroupisperfectly,positivelycorrelatedwitheachother,andthevariancesofthegroupsarethesame,the 120

PAGE 121

condenceregionbuiltusingeigenvalueeigenvectordecompositionisjustalinesegmentinann-dimensionalspace.Intheexperimentalsetup,Icontrolthecorrelationsamonggroup-wiseestimatesbycontrollinghowimportanttheorderkeysareinthelineitemrelation.Intuitively,ifaparticularorderhasasignicantnumberofrecordsinthelineitemrelation,andtheserecordsareequallyassignedtodierentsuppliers,therearestrongcorrelationsamonggroup-wiseestimates.Thisisbecauseifthisimportantorderisnotinthesample,eachgroup-wiseestimateisunderestimated.Otherwise,eachgroup-wiseestimateisoverestimated.Thus,itispossibletocontrolthecorrelationsamongthegroup-wiseestimatesbycontrollingthefrequenciesoftheordersinthelineitemrelation.Inmyexperimentalsetup,thefrequenciesoftheordersaredeterminedbyazipfdistributionwiththedomainsizetothetotalnumberoforders1500000,andthecoecientzo.c.DataGeneration.GiventhetotalnumberofsuppliersNsandthetwozipfcoecientszs,andzo,Igeneratetheexperimentinthefollowingsteps: 1. GenerateaTPC-Hscale1dataset. 2. RandomlyselectNsrecordsfromthesupplierrelationanddropallotherrecordsinthesupplierrelation. 3. Foreachrecordinlineitem,replacetheoriginalorderkeybytakingasamplefromazipfdistributionwithdomainsize1500000andcoecientzo,replacetheoriginalsuppkeybytakingasamplefromazipfdistributionwithdomainsizeNsandcoecientzs.Ithenusetheentiresupplierrelation,alongwitha30%samplesfromtheordersrelation,anda10%samplefromthelineitemrelation.d.HardwareandSoftwareused.AlltheexperimentsarerunoveraDELLdesktopwithanInteldualcore1.8GCPUand2GBmemoryusingOpenSUSELinux11.ThecodesarewritteninC++andcompiledusinggcc4.3.2.e.ExperimentalProcedure.Iprepareseveraldatabasesforthisexperiment.ForNs=100,Iprepare4dierentdatabases:zs=0;zo=0,zs=0;zo=1,zs=0:6;zo=0, 121

PAGE 122

zs=0:6;zo=1.ForbothNs=1000andNs=10000,Ipreparetwodatabases:zs=0;zo=0andzs=1;zo=1,respectively.Foreachdatabase,Ithenrunthethreealgorithms5timesandaveragetherunningtime.Itisusefultobrieysummarizewhatthesedatabaseslooklike.Whenzsis0,thegvaluesofthegroupsarealmostidentical.Whenzsis0.6,thelargestgvalueisaround14timesaslargeasthesmallestone.Thedierencebetweenthetwolargestgvaluesisaround40%.Whenzois0,thecorrelationcoecientsamonggroupwiseestimatesarearound0.01,wherecorrelationcoecientisdenedascoefu;v=covu;v=p varuvarv.Whenzois1,thecorrelationcoecientsamonggroupwiseestimatesexceedthan0.97.Thestandarddeviationthesquarerootofthestandarddeviationofeachofthegroupwiseestimatesisaround10%ofthecorrespondingestimate.26.5.1.2ResultsTheresultsareshowninTable 6-1 .Runningtimesaregiveninseconds.Numbersareroundedtothenearestsecond.Whenthenumberofgroupsis10000,somesetupcannotnishinareasonableamountoftime,resultinginanN/Ainthegure.Thisisbecausethesetwoalgorithmmayrequireoperationsovera10000by10000covariancematrix,whichisnotpossibleusingtheexperimentalhardware.6.5.1.3DiscussionTherunningtimeoftheintervalsweepingalgorithmisalwaysverysmall,andtherunningtimeforotheralgorithmsareshortwhenthereare100groups.Butbothofthemincreasewhenthenumberofgroupsincreases.Theoverallrunningtimeismainlydeterminedbythenumberofcovarianceentriesthatareneededtobecomputed.Processingthematrixinordertodoeigenvalueeigenvectordecompositionalsorequires 2Thevarianceincreaseswhenthecorrelationsamongthegroupincreases,whenthecorrelationisverystrong,thestandarddeviationislargerthan10%oftheestimate. 122

PAGE 123

Runningtimefordatasetsthathave100groups zo zs IntervalSweeping Pruning-Based Two-Stage 0 0 4 21 23 1 0 4 18 20 0 0.6 4 22 5 1 0.6 4 21 8 Runningtimefordatasetsthathave1000groups zo zs IntervalSweeping Pruning-Based Two-Stage 0 0 5 221 226 1 0.6 4 198 29 Runningtimefordatasetsthathave10000groups zo zs IntervalSweeping Pruning-Based Two-Stage 0 0 5 N/A N/A 1 0.6 4 N/A 209 Table6-1.Runningtimefortwoextremecasesforthreealgorithms.Therstcaseisthattheassociatedgvaluesarealmostidentical,andthesecondcaseisthattheassociatedgvaluesarehighlyseparable.Bothlowcorrelationandhighcorrelationcombinationsfor100groupsareprovided,andonlyshowtherstcasewithlowcorrelation,thesecondcasewithhighcorrelationcombinationsfor1000and10000groupsarealsoprovided. sometime,approximately1minuteforan1000by1000matrix.Theintervalsweepingalgorithmisthemostecientalgorithmaccordingtoitsrunningtime.6.5.2EciencyExperimentsIrstconsiderthetwocases:whentheassociatedgvaluesarehighlyseparable,andwhentheassociatedgvaluesareveryclose.Forthispurpose,Irstusethe8databasesfromtheprevioussetofexperiments.ForNs=100,Iprepare4dierentdatabases:zs=0;zo=0,zs=0;zo=1,zs=0:6;zo=0,zs=0:6;zo=1.ForbothNs=1000andNs=10000,Ipreparetwodatabases:zs=0;zo=0andzs=1;zo=1,respectively.Inextconsiderthecasebetweenthetwoextremecases.Inordertoinvestigatethemiddlecasemorecarefully,IchooseNs=100andprepareasetofzoandzsvaluessothatateachtimetheintervalsweepingalgorithmreturnsapproximately50groups.ButIvarytheaveragecorrelationcoecientfrom0.1tomorethan0.95.Noticethatwhenincreasingzotoincreasethecorrelationsamongthegroupwiseestimates,thevariancesofthegroupwise 123

PAGE 124

Numberofreturnedgroupsfordatasetsthathave100groups zo zs IntervalSweeping Pruning-Based Two-Stage 0 0 100 100 100 1 0 100 100 100 0 0.6 5 5 5 1 0.6 10 6 6 Numberofreturnedgroupsfordatasetsthathave1000groups zo zs IntervalSweeping Pruning-Based Two-Stage 0 0 1000 1000 1000 1 0.6 12 7 7 Numberofreturnedgroupsfordatasetsthathave10000groups zo zs IntervalSweeping Pruning-Based Two-Stage 0 0 10000 N/A N/A 0.7 0.3 208 N/A 189 1 0.6 22 N/A 13 Table6-2.Numberofreturnedgroupsfortwoextremecasesforthreealgorithms.Therstcaseisthattheassociatedgvaluesarealmostidentical,andthesecondcaseisthattheassociatedscoresarehighlyseparable. zo zs AvgCoef IntervalSweeping Pruning-Based Two-Stage 0.7 0.11 0.11 49 100 61 0.8 0.13 0.35 48 100 55 0.9 0.15 0.65 51 97 56 0.92 0.19 0.75 47 90 50 0.94 0.21 0.83 49 68 49 0.96 0.23 0.88 50 44 44 0.97 0.25 0.93 48 38 38 0.98 0.27 0.97 49 25 25 Table6-3.Numberofreturnedgroupsforthemiddlecaseforthreealgorithms.Thetotalnumberofgroupsis100.Theparametersarechosensothatthenumberofgroupsreturnedbytheintervalsweepingalgorithmisapproximately50. estimatorsalsoincrease.Therefore,inordertoguaranteethattheintervalsweepingalgorithmalwaysreturnsapproximately50groups,Ialsoneedtoincreasezsaccordingly.6.5.2.1ResultsTheresultsareshowninTables 6-2 ,and 6-3 .Whenthenumberofgroupsis10000,thepruning-basedalgorithmandsomeindividualcasesofthetwo-stagealgorithmcannotnishinareasonabletime,resultinginanN/Ainthegure. 124

PAGE 125

6.5.2.2DiscussionSeveralinterestingresultscanbeobserved,wepointoutafewsignicantoneshere.First,inthetwoextremecases,allalgorithmseitherreturnallthegroups,orreturnveryfewgroups,regardlesshowthesegroupsarecorrelated.Thereasonisthatwhentheassociatedgvaluesarehighlyseparable,theestimatesobtainedfromthesamplesarealsohighlyseparable.Thehighseparationmakesitveryeasytondthecorrecttop-kgroups.Ontheotherhand,ifallthescoresaresoclose,ingeneral,noneofthealgorithmscangureoutwhichgrouptrulyhasahigherassociatedgvalues.Itispossibleintheextremecase,thecorrelationmayhelp.Recalltheexamplethatallthegroupsarefullypositivelycorrelated,andthevariancesofthesegroupsareallidentical.Inthiscase,weonlyneedtoreturnthekgroupswhoseassociatedgvalueestimatesaretheklargestones.However,inreality,eachassociatedgvalueestimatesmayhavedierentvariances,andtheyarenotfullypositivelycorrelated.Thus,itisverydicultforanyofthethreealgorithmtoworkwellifallthescoresareveryclose.Second,inthemiddlecase,asshowninTable 6-3 ,thepruning-basedalgorithmoutperformstheintervalsweepingalgorithmwhenthecorrelationsamonggroupwiseestimatorsarestrong.Otherwise,thepruning-basedalgorithmdoesnothelpmuch.Thesameobservationholdsforthetwo-stagealgorithm.Thereasonisthatwearerequiredtousethetwo-dimensionalprojectedshapetorankapairofgroups.Whenwebuildaneigenvectorparallelcondenceregion,wedonotknowhowtheseeigenvectorsexpandinthen-dimensionalspace.Inreality,thetwo-dimensionalprojectedshapecanbearbitrary,especiallywhenthecorrelationsarenotverystrong.Whenthecorrelationsareverystrong,fewimportanteigenvectordominatestheoverallspace.Theseeigenvectorsareassociatedwithmuchlargervariances.Thetwo-dimensionalprojectedshapeforanypairofgroupscanbemoreregular.Thus,strongcorrelationshelpreturnlessgroups.Inreality,seeingsuchstrongcorrelationsmaynotbeveryfrequent.Therefore,theintervalsweepingalgorithmispreferredunlessstrongcorrelationsareobserved. 125

PAGE 126

zo zs IntervalSweeping Pruning-Based Two-Stage 0 0 100 100 100 1 0.6 100 100 100 0.7 0.11 98 100 98 0.94 0.21 100 100 100 0.98 0.27 100 100 100 Table6-4.Numberoftimesthatcorrectlyreturnsalltop-5groupsunder100independenttries. 6.5.3MonteCarloExperimentsToverifythecorrectnessofeachoftheproposedalgorithms,IdesignaMonteCarloexperiment.IsetNs=10andchoose5dierentvaluesforzoandzs.Thesevaluesarezo=0;zs=0,zo=1;zs=0:6,zo=0:7;zs=0:11,zo=0:94;zs=0:21,andzo=0:98;zs=0:27.Thersttwodatasetscoverthetwoextremecases.Andthelastthreedatasetscoverthemiddlecasewithlow,median,andhighcorrelations.Foreachdataset,Itake100independentsampleandruneachofthethreeproposedalgorithmsoneachofthe100samples.Ithencomputethetruetop-5groupsandcountforeachofthethreemethodshowmanytimesthereturnsetcorrectlycontainalltop-5groups.TheresultsareshowninTable 6-4 .6.5.3.1DiscussionTheresultsshowthatmostofthetime,theproposedalgorithmscorrectlyreturnalltop-5groups.Infactthenumberoftimesthatcorrectlyreturnalltop-kgroupsisactuallylargerthan95.Thisindicatesthatthecondenceoftheproposedmethodsisconservativefornormaldatabasequeries.Theexperimentalresultsareexpectedbecausewhenbuildingcondenceregions,Irequirethateveryscoreassociatedwitheachgrouptobeinthecondenceregionwithhighprobability.Thisrequirementisveryrestrictiveformostnormaldatabasequeries.Tounderstandwhythisistrue,consideracondenceregionforthepoint)778(!g=hg1;g2;:::;gni,wheregiisnotinthecondenceregion.Howeversomepointsnear)778(!gareinthecondenceregion.Thepointsnear)778(!ghavesimilarpropertiesto)778(!g.Forexample,thetop-kgroupsofsuchagroupmaybeidenticalorhighlysimilarto 126

PAGE 127

thetruetop-kgroups.Ifwestillrunanyofthesealgorithms,thesetreturnedcontainsthetop-kgroupsofeachpointintheinputregion.Sincethepointsnear)778(!ghavetop-kgroupsthataresimilartothetruetop-kgroup,mostlikely,itislikelythatthealgorithmstillcorrectlyreturnsalltop-kgroups,evenwhen)778(!gisnotinthecondenceregion.Buildingalessrestrictivecondenceregionispossible,butalessrestrictivecondenceregionmaycauseproblemsforunusualcases.Consideranexamplethatwewouldliketoreturnthetop-5groupsfora100-groupquery.Insteadofbuildingacondenceregionthatcontains)778(!gwithhighprobability,webuildacondenceregionthatcontainsthetop-5associatedgvalueswithhighprobability.Thisisalessrestrictivecondenceregionbecauseseeing)778(!ginthecondenceregionwithhighprobabilityisnotrequired.Anypointwhichhassamedimensionalvaluesforthetop-5dimensionsin)778(!gcorrectlycontaintop-5associatedgvalues.Ifwerunanyofthethreealgorithms.Mostofthetime,itislikelythatthealgorithmsreturnthetop-5groupscorrectly,becausewithhighprobability,thepointsthatcorrectlycontainthetruetop-5associatedgvaluesinthecondenceregionarepointsnear)778(!g.However,inrarecases,pointsthatcorrectlycontainthetruetop-5associatedgvaluescanbeveryfarawayfrom)778(!g.Ingeneral,thedimensionalvaluesofrest95dimensionsofthesepointsthatcorrectlycontainthetruetop-5associatedgvaluescanbearbitrarilylarge.Forexample,thesedimensionalvaluescanbelargerthanthetop-1trueassociatedgvalues.Iftheonlypointsinacondenceregionthatcorrectlycontainalltop-5associatedgvaluesarethesepoints,wehavetoreturnall100groupsbecausethetop-1trueassociatedgvaluecanappearinanyplacebetweenthelargestandthe96thlargestdimensionalvaluesinthesepoints.Inreality,weneverknowwhichinputcondenceregionfallsintothisrarecase.Thus,forsafety,wemustalwaysreturn100groups.Thismakesanyalgorithmimpracticalforthistypeofcondenceregion.Enforcingalltheassociatedgvaluesinthecondenceregioneliminatesthisproblem,becauseonly)778(!gcontainsallassociatedgvalues,anditstop-5dimensional 127

PAGE 128

valuesarethetruetop-5associatedgvalues.Thisallowsthealgorithmtoreturnonlytop-kgroupsforeachpointinthecondenceregion.Asideeectisthatthecondenceofthereturnedresultsmaybeconservativefornormaldatabasequeries.Butthereisnoeasywaytogetaroundit.6.6WhichAlgorithmshouldbeImplementedforaRealWorldSystem?Giventheseresults,Isuggestutilizingtheintervalsweepingalgorithmforarealworldsystem,duetothreeprimaryreasons.Firstofall,theintervalsweepingalgorithmisveryeasytoimplementcomparedtothetwootheralgorithms,bothofwhichrequireustocomputetheeigenvectorsofamatrix.Theoreticallyspeaking,themathforcomputingtheeigenvectorsofaninputmatrixisnotdicult.Butinpractice,toimplementarobustalgorithmthatcanhandleanyspecialinputcaseisdicult.WhenItriedtoimplementthisfunctionalityusingsomeopen-sourcelibraries,severalofthemdidnotworkforcertaininputmatrices.Areal-worldsystemshouldbeveryrobust.Implementinganalgorithmforcomputingtheeigenvectorscanbedicult.Second,theintervalsweepingrequiresonlyafewsecondsevenwhenthenumberofgroupsislarge.Theothertwoalgorithmsrequiressignicantlymoretimewhenthenumberofgroupsislarge.Theextratimeismainlybecauseofthecovariancecomputation.Basedonmyexperience,whenthematrixismorethan1000by1000,thecomputationbecomesinecientandtime-consuming.Finally,inmanyrealcases,theintervalsweepingalgorithmgenerallyperformswell.Evenwhentheotheralgorithmsarebetterthantheintervalsweepingalgorithm,theresultsobtainedbytheintervalsweepingalgorithmremainacceptable.Furthermore,manyreal-worldqueriesmaynothaveverystrongcorrelations.Therefore,forapracticalimplementation,theintervalsweepingalgorithmmaybepreferred.6.7ConclusionThischapterhasconsideredtheproblemofreturningalltop-kgroupswithhighprobabilitywhenonlydatabasesamplesareavailable.Idevelopedthreenovelmethodsto 128

PAGE 129

solvethisproblem.Eachmethodrequiresconstructingacondenceregion,andreturningtheunionoftopdimensionsforanypointintheregion.Theeciency,accuracyandcorrectnessofthesemethodsareexperimentallyevaluated. 129

PAGE 130

REFERENCES [1] S.Acharya,P.B.Gibbons,V.Poosala,andS.Ramaswamy.Joinsynopsesforapproximatequeryanswering.InSIGMOD'99:Proceedingsofthe1999ACMSIGMODinternationalconferenceonManagementofdata,pages275{286,NewYork,NY,USA,1999.ACMPress. [2] SwarupAcharya,PhillipB.Gibbons,andViswanathPoosala.Congressionalsamplesforapproximateansweringofgroup-byqueries.InSIGMODConference,pages487{498,2000. [3] SwarupAcharya,PhillipB.Gibbons,ViswanathPoosala,andSridharRamaswamy.Theaquaapproximatequeryansweringsystem.InSIGMOD'99:Proceedingsofthe1999ACMSIGMODinternationalconferenceonManagementofdata,pages574{576,NewYork,NY,USA,1999.ACMPress. [4] SwarupAcharya,PhillipB.Gibbons,ViswanathPoosala,andSridharRamaswamy.Joinsynopsesforapproximatequeryanswering.SIGMODRec.,282:275{286,1999. [5] ParagAgrawal,OmarBenjelloun,AnishDasSarma,ChrisHayworth,ShubhaU.Nabar,TomoeSugihara,andJenniferWidom.Trio:Asystemfordata,uncertainty,andlineage.InVLDB,2006. [6] PeriklisAndritsos,ArielFuxman,andReneeJ.Miller.Cleananswersoverdirtydatabases:Aprobabilisticapproach.InICDE,page30,2006. [7] GennadyAntoshenkov.Randomsamplingfrompseudo-rankedb+trees.InLi-YanYuan,editor,18thInternationalConferenceonVeryLargeDataBases,August23-27,1992,Vancouver,Canada,Proceedings,pages375{382.MorganKaufmann,1992. [8] LyublenaAntova,ChristophKoch,andDanOlteanu.10106worldsandbeyond:Ecientrepresentationandprocessingofincompleteinformation.InICDE,pages606{615,2007. [9] LyublenaAntova,ChristophKoch,andDanOlteanu.MayBMS:Managingincompleteinformationwithprobabilisticworld-setdecompositions.InICDE,pages1479{1480,2007. [10] LyublenaAntova,ChristophKoch,andDanOlteanu.World-setdecompositions:Expressivenessandecientalgorithms.InICDT,pages194{208,2007. [11] BrianBabcock,SurajitChaudhuri,andGautamDas.Dynamicsampleselectionforapproximatequeryprocessing.InSIGMOD'03:Proceedingsofthe2003ACMSIGMODinternationalconferenceonManagementofdata,pages539{550,NewYork,NY,USA,2003.ACM. 130

PAGE 131

[12] RobertBartoszynskiandMagdalenaNiewiadomska-Bugaj.ProbabilityandSta-tisticalInferenceWileySeriesinProbabilityandStatisticsSecondEdition.Wiley-Interscience,January2008. [13] Y.BenjaminiandY.Hochberg.Controllingthefalsediscoveryrate:apracticalandpowerfulapproachtomultipletesting.JournaloftheRoyalStatisticalSociety,57:289{300,1995. [14] Y.BenjaminiandD.Yekutieli.Thecontrolofthefalsediscoveryrateinmultipletestingunderdependency.AnnalsofStatistics,2001. [15] JamesO.Berger.Arobustgeneralizedbayesestimatorandcondenceregionforamultivariatenormalmean.TheAnnalsofStatistics,8:716{761,1980. [16] AndrewBirrell,MichaelIsard,ChuckThacker,andTedWobber.Adesignforhigh-performanceashdisks.SIGOPSOper.Syst.Rev.,412:88{93,2007. [17] NicolasBruno,SurajitChaudhuri,andLuisGravano.Top-kselectionqueriesoverrelationaldatabases:Mappingstrategiesandperformanceevaluation.ACMTrans.DatabaseSyst.,272:153{187,2002. [18] GeorgeCasellaandRogerL.Berger.StatisticalInference.SecondEdition.Duxbury,2002.CASg202:11.Ex. [19] KevinChen-ChuanChangandSeungwonHwang.Minimalprobing:supportingexpensivepredicatesfortop-kqueries.InSIGMODConference,pages346{357,2002. [20] SurajitChaudhuri,GautamDas,MayurDatar,RajeevMotwani,andVivekR.Narasayya.Overcominglimitationsofsamplingforaggregationqueries.InProceedingsofthe17thInternationalConferenceonDataEngineering,pages534{542,Washington,DC,USA,2001.IEEEComputerSociety. [21] SurajitChaudhuri,GautamDas,andVivekNarasayya.Arobust,optimization-basedapproachforapproximateansweringofaggregatequeries.InSIGMOD'01:Proceedingsofthe2001ACMSIGMODinternationalconferenceonManagementofdata,pages295{306,NewYork,NY,USA,2001.ACM. [22] SurajitChaudhuri,GautamDas,andVivekNarasayya.Optimizedstratiedsamplingforapproximatequeryprocessing.ACMTrans.DatabaseSyst.,322:9,2007. [23] SurajitChaudhuriandLuisGravano.Evaluatingtop-kselectionqueries.InVLDB'99:Proceedingsofthe25thInternationalConferenceonVeryLargeDataBases,pages397{410,SanFrancisco,CA,USA,1999.MorganKaufmannPublishersInc. [24] SurajitChaudhuri,RajeevMotwani,andVivekNarasayya.Onrandomsamplingoverjoins.SIGMODRec.,282:263{274,1999. 131

PAGE 132

[25] ReynoldCheng,DmitriV.Kalashnikov,andSunilPrabhakar.Evaluationofprobabilisticqueriesoverimprecisedatainconstantly-evolvingenvironments.Inf.Syst.,321:104{130,2007. [26] MartinCrowderandMarkJ.vanderLaan.MultipleTestingProcedureswithApplicationstoGenomics.Springer,December2007. [27] NileshN.DalviandDanSuciu.Ecientqueryevaluationonprobabilisticdatabases.InVLDB,pages864{875,2004. [28] L.DevroyeandG.Lugosi.CombinatorialMethodsinDensityEstimation.Springer,January2001. [29] DavidJ.DeWitt,JereyF.Naughton,DonovanA.Schneider,andS.Seshadri.Practicalskewhandlinginparalleljoins.InVLDB'92:Proceedingsofthe18thInternationalConferenceonVeryLargeDataBases,pages27{40,SanFrancisco,CA,USA,1992.MorganKaufmannPublishersInc. [30] AlinDobra.Histogramsrevisited:whenarehistogramsthebestapproximationmethodforaggregatesoverjoins?InPODS,pages228{237,2005. [31] AlinDobra,MinosGarofalakis,JohannesGehrke,andRajeevRastogi.Processingcomplexaggregatequeriesoverdatastreams.InSIGMOD'02:Proceedingsofthe2002ACMSIGMODinternationalconferenceonManagementofdata,pages61{72,NewYork,NY,USA,2002.ACMPress. [32] S.Dragici.DataAnalysisToolsforDNAMicroarrays.ChapmanandHall,CRCPress,2003. [33] NormanR.DraperandHarrySmith.AppliedRegressionAnalysisWileySeriesinProbabilityandStatisticsThirdEdition.Wiley-Interscience,April1998. [34] BradleyEfron.Minimumvolumecondenceregionsforamultivariatenormalmeanvector.JournaloftheRoyalStatisticalSociety:SeriesBStatisticalMethodology,68:655{670,September2006. [35] SartajSahniEllisHorowitzandDineshMehta.FundamentalsofDataStructuresinC++.SiliconPress,secondeditionedition,August2007. [36] RonaldFagin.Combiningfuzzyinformationfrommultiplesystems.InPODS,pages216{226,1996. [37] RonaldFagin.Combiningfuzzyinformationfrommultiplesystems.J.Comput.Syst.Sci.,581:83{99,1999. [38] RonaldFagin,AmnonLotem,andMoniNaor.Optimalaggregationalgorithmsformiddleware.InPODS,2001. 132

PAGE 133

[39] SumitGanguly,PhillipB.Gibbons,YossiMatias,andAviSilberschatz.Bifocalsamplingforskew-resistantjoinsizeestimation.SIGMODRec.,252:271{281,1996. [40] VenkateshGanti,Mong-LiLee,andRaghuRamakrishnan.Icicles:Self-tuningsamplesforapproximatequeryanswering.InAmrElAbbadi,MichaelL.Brodie,SharmaChakravarthy,UmeshwarDayal,NabilKamel,GunterSchlageter,andKyu-YoungWhang,editors,VLDB2000,Proceedingsof26thInternationalCon-ferenceonVeryLargeDataBases,September10-14,2000,Cairo,Egypt,pages176{187.MorganKaufmann,2000. [41] HectorGarcia-Molina,JereyD.Ullman,andJenniferWidom.DatabaseSystems:TheCompleteBook.PearsonEducationUS,secondeditionedition,August2008. [42] RainerGemullaandWolfgangLehner.Deferredmaintenanceofdisk-basedrandomsamples.InEDBT,pages423{441,2006. [43] ChristopherGenoveseandLarryWasserman.Astochasticprocessapproachtofalsediscoverycontrol.ANNALSOFSTATISTICS,321,2004. [44] PhillipB.GibbonsandYossiMatias.Newsampling-basedsummarystatisticsforimprovingapproximatequeryanswers.InSIGMOD'98:Proceedingsofthe1998ACMSIGMODinternationalconferenceonManagementofdata,pages331{342,NewYork,NY,USA,1998.ACM. [45] GoetzGraefeandWilliamJ.McKenna.Thevolcanooptimizergenerator:Extensibilityandecientsearch.InICDE,pages209{218,1993. [46] JimGray,AdamBosworth,AndrewLayman,andHamidPirahesh.Datacube:Arelationalaggregationoperatorgeneralizinggroup-by,cross-tab,andsub-total.InICDE,pages152{159,1996. [47] UlrichGuntzer,Wolf-TiloBalke,andWernerKieling.Optimizingmulti-featurequeriesforimagedatabases.InAmrElAbbadi,MichaelL.Brodie,SharmaChakravarthy,UmeshwarDayal,NabilKamel,GunterSchlageter,andKyu-YoungWhang,editors,VLDB2000,Proceedingsof26thInternationalConferenceonVeryLargeDataBases,September10-14,2000,Cairo,Egypt,pages419{428.MorganKaufmann,2000. [48] PeterJ.Haas.Large-sampleanddeterministiccondenceintervalsforonlineaggregation.InSSDBM'97:ProceedingsoftheNinthInternationalConferenceonScienticandStatisticalDatabaseManagement,pages51{63,Washington,DC,USA,1997.IEEEComputerSociety. [49] PeterJ.HaasandJosephM.Hellerstein.Ripplejoinsforonlineaggregation.InSIGMOD'99:Proceedingsofthe1999ACMSIGMODinternationalconferenceonManagementofdata,pages287{298,NewYork,NY,USA,1999.ACMPress. 133

PAGE 134

[50] PeterJ.Haas,JereyF.Naughton,S.Seshadri,andLynneStokes.Sampling-basedestimationofthenumberofdistinctvaluesofanattribute.InVLDB,pages311{322,1995. [51] PeterJ.Haas,JereyF.Naughton,S.Seshadri,andArunN.Swami.Fixed-precisionestimationofjoinselectivity.InPODS'93:ProceedingsofthetwelfthACMSIGACT-SIGMOD-SIGARTsymposiumonPrinciplesofdatabasesystems,pages190{201,NewYork,NY,USA,1993.ACM. [52] PeterJ.Haas,JereyF.Naughton,S.Seshadri,andArunN.Swami.Selectivityandcostestimationforjoinsbasedonrandomsampling.J.Comput.Syst.Sci.,52:550{569,1996. [53] PeterJ.Haas,JereyF.Naughton,andArunN.Swami.Ontherelativecostofsamplingforjoinselectivityestimation.InPODS'94:ProceedingsofthethirteenthACMSIGACT-SIGMOD-SIGARTsymposiumonPrinciplesofdatabasesystems,pages14{24,NewYork,NY,USA,1994.ACM. [54] PeterJ.HaasandArunN.Swami.Sequentialsamplingproceduresforquerysizeestimation.InSIGMOD'92:Proceedingsofthe1992ACMSIGMODinternationalconferenceonManagementofdata,pages341{350,NewYork,NY,USA,1992.ACM. [55] PeterJ.HaasandArunN.Swami.Sampling-basedselectivityestimationforjoinsusingaugmentedfrequentvaluestatistics.InICDE'95:ProceedingsoftheEleventhInternationalConferenceonDataEngineering,pages522{531,Washington,DC,USA,1995.IEEEComputerSociety. [56] JosephM.Hellerstein,RonAvnur,AndyChou,ChristianHidber,ChrisOlston,VijayshankarRaman,TaliRoth,andPeterJ.Haas.Interactivedataanalysis:Thecontrolproject.Computer,328:51{59,1999. [57] JosephM.Hellerstein,PeterJ.Haas,andHelenJ.Wang.Onlineaggregation.InSIGMODConference,pages171{182,1997. [58] Y.Hochberg.Asharperbonferroniprocedureformultipletestsofsignicance.Biometrika,75:800{802,1988. [59] YosefHochbergandAjitC.Tamhane.MultipleComparisonProcedures.WileySeriesinProbabilityandStatistics.JohnWileyandSons,1987. [60] S.Holm.Asimplesequentiallyrejectivemultipletestprocedure.Scand.J.Stat,6:65{70,1979. [61] Wen-ChiHou,GultekinOzsoyoglu,andErdoganDogdu.Error-constrainedcountqueryevaluationinrelationaldatabases.InSIGMOD'91:Proceedingsofthe1991ACMSIGMODinternationalconferenceonManagementofdata,pages278{287,NewYork,NY,USA,1991.ACM. 134

PAGE 135

[62] Wen-ChiHou,GultekinOzsoyoglu,andBaldeoK.Taneja.Statisticalestimatorsforrelationalalgebraexpressions.InPODS,pages276{287,1988. [63] Wen-ChiHou,GultekinOzsoyoglu,andBaldeoK.Taneja.Processingaggregaterelationalquerieswithhardtimeconstraints.InSIGMODConference,pages68{77,1989. [64] JasonHsu.MultipleComparisons:TheoryandMethods.Chapman&Hall/CRC,1996. [65] Seung-wonHwangandKevinChen-ChuanChang.Probeminimizationbyscheduleoptimization:Supportingtop-kquerieswithexpensivepredicates.IEEETrans.onKnowl.andDataEng.,195:646{662,2007. [66] IhabF.Ilyas,WalidG.Aref,andAhmedK.Elmagarmid.Joiningrankedinputsinpractice.InVLDB'02:Proceedingsofthe28thinternationalconferenceonVeryLargeDataBases,pages950{961.VLDBEndowment,2002. [67] IhabF.Ilyas,WalidG.Aref,andAhmedK.Elmagarmid.Supportingtop-kjoinqueriesinrelationaldatabases.InVLDB,pages754{765,2003. [68] IhabF.Ilyas,RahulShah,WalidG.Aref,JereyScottVitter,andAhmedK.Elmagarmid.Rank-awarequeryoptimization.InSIGMOD'04:Proceedingsofthe2004ACMSIGMODinternationalconferenceonManagementofdata,pages203{214,NewYork,NY,USA,2004.ACM. [69] ChrisJermaine,SubramanianArumugam,AbhijitPol,andAlinDobra.Scalableapproximatequeryprocessingwiththedboengine.ACMTrans.DatabaseSyst.,33:1{54,2008. [70] ChristopherJermaine,AlinDobra,SubramanianArumugam,ShantanuJoshi,andAbhijitPol.Adisk-basedjoinwithprobabilisticguarantees.InSIGMOD'05:Proceedingsofthe2005ACMSIGMODinternationalconferenceonManagementofdata,pages563{574,NewYork,NY,USA,2005.ACM. [71] ChristopherJermaine,AlinDobra,SubramanianArumugam,ShantanuJoshi,andAbhijitPol.Thesort-merge-shrinkjoin.ACMTrans.DatabaseSyst.,31:1382{1416,2006. [72] ChristopherJermaine,AbhijitPol,andSubramanianArumugam.Onlinemaintenanceofverylargerandomsamples.InSIGMOD'04:Proceedingsofthe2004ACMSIGMODinternationalconferenceonManagementofdata,pages299{310,NewYork,NY,USA,2004.ACM. [73] ChristopherM.Jermaine,SubramanianArumugam,AbhijitPol,andAlinDobra.Scalableapproximatequeryprocessingwiththedboengine.InSIGMODConfer-ence,pages725{736,2007. 135

PAGE 136

[74] NormanL.Johnson,SamuelKotz,andN.Balakrishnan.ContinuousUnivariateDistributions,Vol.2.WileySeriesinProbabilityandStatistics.JohnWileyandSons,1995. [75] E.L.LehmannandSpringer.TestingStatisticalHypothesesSpringerTextsinStatistics.Springer,January1997. [76] ChengkaiLi,KevinChen-ChuanChang,andIhabF.Ilyas.Supportingad-hocrankingaggregates.InProceedingsoftheACMSIGMODInternationalConferenceonManagementofData,Chicago,Illinois,USA,June27-29,2006,pages61{72.ACM,2006. [77] ChengkaiLi,KevinChen-ChuanChang,IhabF.Ilyas,andSuminSong.Ranksql:queryalgebraandoptimizationforrelationaltop-kqueries.InSIGMOD'05:Proceedingsofthe2005ACMSIGMODinternationalconferenceonManagementofdata,pages131{142,NewYork,NY,USA,2005.ACM. [78] RichardJ.LiptonandJereyF.Naughton.Querysizeestimationbyadaptivesamplingextendedabstract.InPODS'90:ProceedingsoftheninthACMSIGACT-SIGMOD-SIGARTsymposiumonPrinciplesofdatabasesystems,pages40{46,NewYork,NY,USA,1990.ACM. [79] RichardJ.Lipton,JereyF.Naughton,andDonovanA.Schneider.Practicalselectivityestimationthroughadaptivesampling.InSIGMODConference,pages1{11,1990. [80] GangLuo,CurtJ.Ellmann,PeterJ.Haas,andJereyF.Naughton.Ascalablehashripplejoinalgorithm.InSIGMOD'02:Proceedingsofthe2002ACMSIGMODinternationalconferenceonManagementofdata,pages252{262,NewYork,NY,USA,2002.ACM. [81] PascalMassart.Thetightconstantinthedvoretzkyckiefercwolfowitzinequality.TheAnnalsofProbability,183:1269C1283,1990. [82] RupertG.Miller.SimultaneousStatisticalInference.SpringerSeriesinStatistics,2ndeditionedition,1981. [83] SumanNathandPhillipB.Gibbons.Onlinemaintenanceofverylargerandomsamplesonashstorage.PVLDB,1:970{983,2008. [84] ApostolNatsev,Yuan-ChiChang,JohnR.Smith,Chung-ShengLi,andJereyScottVitter.Supportingincrementaljoinqueriesonrankedinputs.InVLDB'01:Proceedingsofthe27thInternationalConferenceonVeryLargeDataBases,pages281{290,SanFrancisco,CA,USA,2001.MorganKaufmannPublishersInc. [85] FrankOlkenandDoronRotem.Randomsamplingfromb+trees.InVLDB,pages269{277,1989. 136

PAGE 137

[86] FrankOlken,DoronRotem,andPingXu.Randomsamplingfromhashles.InSIGMODConference,pages375{386,1990. [87] RaghuRamakrishnanandJohannesGehrke.DatabaseManagementSystems.McGraw-Hill,NewYork,NY,USA,2003. [88] ChristianP.RobertandGeorgeCasella.MonteCarloStatisticalMethodsSpringerTextsinStatistics.Springer-VerlagNewYork,Inc.,Secaucus,NJ,USA,2005. [89] FlorinRusu,FeiXu,LuisLeopoldoPerez,MingxiWu,RaviJampani,ChrisJermaine,andAlinDobra.Thedbodatabasesystem.InSIGMOD'08:Pro-ceedingsofthe2008ACMSIGMODinternationalconferenceonManagementofdata,pages1223{1226,NewYork,NY,USA,2008.ACM. [90] SanatK.Sarkar.Someresultsonfalsediscoveryrateinstepwisemultipletestingprocedures.TheAnnalsofStatistics,301:239{257,2002. [91] C.E.Sarndal,B.Swensson,andJ.Wretman.ModelAssistedSurveySampling.Springer,NewYork,1992. [92] KarlSchnaitterandNeoklisPolyzotis.Evaluatingrankjoinswithoptimalcost.InPODS'08:Proceedingsofthetwenty-seventhACMSIGMOD-SIGACT-SIGARTsymposiumonPrinciplesofdatabasesystems,pages43{52,NewYork,NY,USA,2008.ACM. [93] KarlSchnaitter,JoshuaSpiegel,andNeoklisPolyzotis.Depthestimationforrankingqueryoptimization.InVLDB'07:Proceedingsofthe33rdinternationalconferenceonVerylargedatabases,pages902{913.VLDBEndowment,2007. [94] DonovanA.SchneiderandDavidJ.DeWitt.Aperformanceevaluationoffourparalleljoinalgorithmsinashared-nothingmultiprocessorenvironment.InSIG-MODConference,pages110{121,1989. [95] MohamedA.Soliman,IhabF.Ilyas,andKevinChen-ChuanChang.Probabilistictop-kandranking-aggregatequeries.ACMTrans.DatabaseSyst.,333:1{54,2008. [96] PaulN.Somerville.Fdrstep-downandstep-upproceduresforthecorrelatedcase.IMSLectureNotesMonographSeries,471:100{118,2004. [97] CharlesStein.Inadmissibilityoftheusualestimatorforthemeanofamultivariatenormaldistribution.InThirdBerkeleySymposiumonMathematicalStatisticsandProbability,pages197{206,1956. [98] J.D.Storey.Adirectapproachtofalsediscoveryrates.JournaloftheRoyalStatisticalSociety,SeriesB,64:479{498,2002. [99] JohnD.Storey,JonathanE.Taylor,andDavidSiegmund.Strongcontrol,conservativepointestimationandsimultaneousconservativeconsistencyoffalse 137

PAGE 138

discoveryrates:auniedapproach.JournalOfTheRoyalStatisticalSocietySeriesB,661:187{205,2004. [100] TPC.TPCBenchmarkH.Technicalreport,TransactionProcessingPerformanceCouncil,2008. [101] MarkJ.vanderLaan,SandrineDudoit,andKatherineS.nPollard.Augmentationproceduresforcontrolofthegeneralizedfamily-wiseerrorrateandtailprobabilitiesfortheproportionoffalsepositives.StatisticalApplicationsinGeneticsandMolecularBiology,3:1035,2004. [102] JereyS.Vitter.Randomsamplingwithareservoir.ACMTrans.Math.Softw.,11:37{57,1985. [103] P.H.WestfallandS.SYoung.Resampling-BasedMultipleTesting.JohnWiley,1993. [104] DongXin,JiaweiHan,andKevinC.Chang.Progressiveandselectivemerge:computingtop-kwithad-hocrankingfunctions.InSIGMOD'07:Proceedingsofthe2007ACMSIGMODinternationalconferenceonManagementofdata,pages103{114,NewYork,NY,USA,2007.ACM. [105] FeiXu,ChristopherM.Jermaine,andAlinDobra.Condenceboundsforsampling-basedgroupbyestimates.ACMTrans.DatabaseSyst.,333,2008. [106] ZhenZhang,Seung-wonHwang,KevinChen-ChuanChang,MinWang,ChristianA.Lang,andYuan-chiChang.Boolean+ranking:queryingadatabasebyk-constrainedoptimization.InSIGMOD'06:Proceedingsofthe2006ACMSIGMODinternationalconferenceonManagementofdata,pages359{370,NewYork,NY,USA,2006.ACM. 138

PAGE 139

BIOGRAPHICALSKETCHFeiXuwasaPh.D.studentinComputerandInformationScienceandEngineeringdepartmentattheUniversityofFlorida.FeiisamemberoftheDatabaseCenter.FeiisunderthesupervisionofChristopherJermaine.Fei'sresearchinterestismainlyindatamanagement.DuringhisPh.D.study,Feihasworkedonapproximatequeryprocessinganduncertaindatamanagement.FeiXuwasborninLanxi,China.Feireceivedhisbachelor'sdegreein2000andmaster'sdegreein2003fromZhejiangUniversity.FeijoinedtheUniversityofFloridainAugust,2004. 139