<%BANNER%>

Statistical Approximations of Database Queries with Confidence Intervals

Permanent Link: http://ufdc.ufl.edu/UFE0043509/00001

Material Information

Title: Statistical Approximations of Database Queries with Confidence Intervals
Physical Description: 1 online resource (136 p.)
Language: english
Creator: Chen, Lixia
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2011

Subjects

Subjects / Keywords: aggregation -- approximation -- databases -- prababilistic -- statistical
Computer and Information Science and Engineering -- Dissertations, Academic -- UF
Genre: Computer Engineering thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract: Explosive data growth in databases presents challenges of fast query processing. Computation of exact values of complex queries over large databases can take a long time due to the large volume of data they need to access. Approximate query processing provides an essential solution to this problem and versatile approximation methods are used in databases nowadays. In order to evaluate the effectiveness of these approximations, confidence intervals of estimators are desired to provide error guarantees of estimated values. The focus of this dissertation is approximating database queries and providing corresponding confidence intervals. In this dissertation, we revisit the approximation method histograms in traditional databases and formulate a statistical assumption. Then we estimate aggregations over probabilistic databases and provide efficient algorithms to compute them. The traditional assumption for interpreting histograms and justifying approximate query processing methods based on them is that all elements in a bucket have the same frequency -- this is called uniform distribution assumption. We show a significantly less restrictive statistical assumption -- the elements within a bucket are randomly arranged even though they might have different frequencies -- leads to identical formulas for approximating aggregate queries using histograms. Under this assumption, the behaviors of the both unidimensional and multidimensional histograms are analyzed and tight error guarantees for the quality of approximations are provided. As an example of how the statistical theory of histograms can be extended, we show how XSketches -- an approximation technique for XML queries that uses histograms as building blocks -- can be statistically analyzed. The combination of the random shuffling assumption and the other statistical assumptions associated with XSketch estimators ensures a complete statistical model and error analysis for XSketches. Characterizing SUM-like aggregates over probabilistic databases is considered a hard problem because the size of the probability space is exponential in the database size. In this dissertation, we aim to compute aggregates in probabilistic databases with confidence intervals. Both methods -- distribution dependent and independent -- for computing confidence intervals require the expected value and variance of probabilistic aggregates. We show a general framework to compute moments of aggregations in probabilistic databases. Based on this framework, we derive mathematical formulas of the first two moments of aggregates for graphical models and tuple-independent models. We then present InDB and Middleware algorithms to evaluate the aggregates efficiently. Our prototype implementation with Postgres suggests that our characterization of aggregates incurs little overhead for the tuple-independent model and manageable overhead for the graphical model. We also extend computation of the moments to AVERAGE-like aggregates and GROUP-BYs to show the usefulness of the proposed methods. To extend the above analysis of aggregates over probabilistic databases, we study the aggregates with duplicate removal. We analyze the properties of PTIME queries and conclude the correlations introduced by projections are hierarchical and can be factorized recursively. We derive the first two moments of estimators of probabilistic aggregates with duplicate removal and design a hash-based algorithm to implement them. Our comprehensive experiments show that this algorithm has a small overhead when the correlation rate is small and a reasonable overhead when the correlation rate is large.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility: by Lixia Chen.
Thesis: Thesis (Ph.D.)--University of Florida, 2011.
Local: Adviser: Dobra, Alin.

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2011
System ID: UFE0043509:00001

Permanent Link: http://ufdc.ufl.edu/UFE0043509/00001

Material Information

Title: Statistical Approximations of Database Queries with Confidence Intervals
Physical Description: 1 online resource (136 p.)
Language: english
Creator: Chen, Lixia
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2011

Subjects

Subjects / Keywords: aggregation -- approximation -- databases -- prababilistic -- statistical
Computer and Information Science and Engineering -- Dissertations, Academic -- UF
Genre: Computer Engineering thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract: Explosive data growth in databases presents challenges of fast query processing. Computation of exact values of complex queries over large databases can take a long time due to the large volume of data they need to access. Approximate query processing provides an essential solution to this problem and versatile approximation methods are used in databases nowadays. In order to evaluate the effectiveness of these approximations, confidence intervals of estimators are desired to provide error guarantees of estimated values. The focus of this dissertation is approximating database queries and providing corresponding confidence intervals. In this dissertation, we revisit the approximation method histograms in traditional databases and formulate a statistical assumption. Then we estimate aggregations over probabilistic databases and provide efficient algorithms to compute them. The traditional assumption for interpreting histograms and justifying approximate query processing methods based on them is that all elements in a bucket have the same frequency -- this is called uniform distribution assumption. We show a significantly less restrictive statistical assumption -- the elements within a bucket are randomly arranged even though they might have different frequencies -- leads to identical formulas for approximating aggregate queries using histograms. Under this assumption, the behaviors of the both unidimensional and multidimensional histograms are analyzed and tight error guarantees for the quality of approximations are provided. As an example of how the statistical theory of histograms can be extended, we show how XSketches -- an approximation technique for XML queries that uses histograms as building blocks -- can be statistically analyzed. The combination of the random shuffling assumption and the other statistical assumptions associated with XSketch estimators ensures a complete statistical model and error analysis for XSketches. Characterizing SUM-like aggregates over probabilistic databases is considered a hard problem because the size of the probability space is exponential in the database size. In this dissertation, we aim to compute aggregates in probabilistic databases with confidence intervals. Both methods -- distribution dependent and independent -- for computing confidence intervals require the expected value and variance of probabilistic aggregates. We show a general framework to compute moments of aggregations in probabilistic databases. Based on this framework, we derive mathematical formulas of the first two moments of aggregates for graphical models and tuple-independent models. We then present InDB and Middleware algorithms to evaluate the aggregates efficiently. Our prototype implementation with Postgres suggests that our characterization of aggregates incurs little overhead for the tuple-independent model and manageable overhead for the graphical model. We also extend computation of the moments to AVERAGE-like aggregates and GROUP-BYs to show the usefulness of the proposed methods. To extend the above analysis of aggregates over probabilistic databases, we study the aggregates with duplicate removal. We analyze the properties of PTIME queries and conclude the correlations introduced by projections are hierarchical and can be factorized recursively. We derive the first two moments of estimators of probabilistic aggregates with duplicate removal and design a hash-based algorithm to implement them. Our comprehensive experiments show that this algorithm has a small overhead when the correlation rate is small and a reasonable overhead when the correlation rate is large.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility: by Lixia Chen.
Thesis: Thesis (Ph.D.)--University of Florida, 2011.
Local: Adviser: Dobra, Alin.

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2011
System ID: UFE0043509:00001


This item has the following downloads:


Full Text

PAGE 1

STATISTICALAPPROXIMATIONSOFDATABASEQUERIESWITHCONFIDENCEINTERVALSByLIXIACHENADISSERTATIONPRESENTEDTOTHEGRADUATESCHOOLOFTHEUNIVERSITYOFFLORIDAINPARTIALFULFILLMENTOFTHEREQUIREMENTSFORTHEDEGREEOFDOCTOROFPHILOSOPHYUNIVERSITYOFFLORIDA2011

PAGE 2

c2011LixiaChen 2

PAGE 3

Tomyfamily 3

PAGE 4

ACKNOWLEDGMENTS Firstandforemost,IwouldliketoexpressmydeepandsinceregratitudetomyPhDadviser,Dr.AlinDobra.Ithankhimforhisinvaluablesupport,guidanceandinspirationinmyresearch.Ihavelearnedalotfromhiminallaspectsofmyresearch:computertechniques,formulatingproblemsandpresentingtheresearchresults.Thisthesiswouldhavenotbeenpossiblewithouthishelp.IthankDr.ArunavaBanerjee,Dr.SanjayRanka,Dr.TamerKahveciandDr.RonaldRandlesforservingonmysupervisorycommittee.Dr.Banerjee'ssolidtheorybackgroundhasenlightenedmeinexploringnewpossibilities.Dr.Randlesprovidedhelpfulsuggestionsinmywork.Dr.KahveciandDr.Rankaencouragedmeinhardtimes.Iwouldliketoconveymyutmostthankstomyfamily,whomeaneverythingtome.Myhusband,YuchuTonghasbeenwithmethroughallthehappyandhardtimes.Healwaysbringsfuntomylifeandencouragesmetofacechallengesinmystudy.Ienjoyeverymomentwithhim.Iamverygratefultomyparentsfortheirloveandencouragement.Theyarealwaysbeenthereandreadytoprovidesupportstome.Theyarenotonlygoodparentsbutalsoexcellentteacherstome.Theyhaveinspiredmyinterestsinscienceandopenedmymindtoit.Manythanksalsogoestomysistersandmybrother.Everymomentwiththemiswonderful.Ithankmynewborndaughterformakingmemorehappythanever.Ialsowanttothankallmyfriendsfortheirsupportinmystudyandlife.AnincompletelistincludesXuelianXiao,WeiPeng,YixiOuyang,JiangyanXuandWangyuanZhang.Wehadamemorabletime,whichmademygraduationstudyenrichedandenjoyable.Finally,IthankNationalScienceFoundationgrantsNSF-CAREER-IIS-0448264forthenancialsupportinthiswork. 4

PAGE 5

TABLEOFCONTENTS page ACKNOWLEDGMENTS .................................. 4 LISTOFTABLES ...................................... 8 LISTOFFIGURES ..................................... 9 ABSTRACT ......................................... 10 CHAPTER 1INTRODUCTION ................................... 13 1.1RelatedWork .................................. 14 1.2CondenceIntervalsfromMoments ..................... 18 1.3Contributions .................................. 19 2HISTOGRAMSASSTATISTICALESTIMATORSFORAGGREGATEQUERIES 22 2.1Background ................................... 22 2.2ProblemFormulation .............................. 24 2.2.1SizeofJoinProblem .......................... 24 2.2.1.1UnidimensionalSizeofJoinEstimationProblem ..... 24 2.2.1.2MultidimensionalSizeofJoinEstimationProblem .... 24 2.2.2SelectivityEstimationProblem .................... 25 2.2.3AggregatesoverJoinsProblem .................... 27 2.2.4CommentsonObtainingErrorGuaranteesfromExpectedValueandVarianceEstimates ........................ 28 2.3HistogramsasFunctionApproximatorsandStatisticalNonparametricModels ..................................... 29 2.4RandomShufingAssumption ........................ 30 2.4.1DenitionofUniformRandomShufingAssumption ........ 31 2.4.2MomentsunderRandomShufingAssumption ........... 32 2.5HistogramsunderRandomShufingAssumption .............. 34 2.5.1One-bucketHistograms ........................ 34 2.5.2HistogramswithAlignedBuckets ................... 36 2.6ComparisonwithSamplingandSketches .................. 37 2.6.1Sampling ................................ 37 2.6.2Sketches ................................ 38 2.6.3CommentsonComparison ...................... 40 2.7HistogramswhenRandomShufingAssumptionDoesNotHold ..... 41 2.7.1RandomHistogramsonArbitraryProblems ............. 41 2.7.2RandomHistogramsforSelf-joinSizeComputation ......... 42 2.7.3CommentsonEnd-biasedhistograms ................ 42 2.8GeneralizationtoMultidimensionalHistograms ............... 43 5

PAGE 6

2.8.1GeneralRandomShufingProperties ................ 43 2.8.2MultidimensionalRandomShufingDistribution ........... 44 2.8.3One-bucketMultidimensionalHistogram ............... 45 2.8.4LowDimensionalHistogram ...................... 46 2.8.5EstimatingsizeofStar-JoinsusingOne-bucketHistograms .... 48 2.9XSketchEstimatorunderRandomShufingAssumption .......... 50 2.9.1Introduction ............................... 50 2.9.2XSketches ................................ 50 2.9.3ShufedXSketches ........................... 52 2.10CommentsonRandomShufingAssumption ................ 54 2.11Summary .................................... 56 3AGGREGATIONOVERPROBABILISTICDATABASESWITHCONFIDENCEINTERVAL ...................................... 57 3.1Background ................................... 57 3.2Preliminaries .................................. 57 3.2.1QueriesandEquivalentAlgebraicExpressions ........... 58 3.2.2ProbabilisticDatabaseasaDescriptionofaProbabilitySpace .. 58 3.2.3UseofKroneckerSymbolij ...................... 60 3.3AnalysisforMulti-relationSUMAggregatesforTuple-independentModel 60 3.3.1DependenceinTuplesafterAggregations .............. 60 3.3.2M2ComputationofMoments ..................... 61 3.3.3Reducingthetimecomplexity ..................... 65 3.4GeneralFrameworkofComputingMoments ................. 68 3.4.1Notation ................................. 68 3.4.2GeneralFrameworkforMlogMComputation ............ 68 3.5ComputationofMoments ........................... 71 3.5.1InDBTechniques ............................ 71 3.5.1.1InDBM2Technique ..................... 71 3.5.1.2InDBMlogMTechnique .................. 73 3.5.2MiddlewareMlogMTechnique .................... 73 3.5.3Algorithms ................................ 75 3.6Extensions ................................... 77 3.6.1Non-linearAggregates ......................... 77 3.6.2GROUPBYQueries ............................ 79 3.7Experiments .................................. 79 3.7.1ExperimentsonTuple-independentModel .............. 80 3.7.2ExperimentsonGraphicalModel ................... 82 3.7.3TheCentralLimitTheoremandProbabilisticAggregates ..... 84 3.8Summary .................................... 86 4PROBABILISTICAGGREGATIONWITHDUPLICATEELIMINATION ...... 88 4.1Backgroud ................................... 88 4.2Preliminaries .................................. 90 6

PAGE 7

4.2.1QueriesandEquivalentAlgebraicExpressions ........... 91 4.2.2QueryEvaluationonProbabilisticDatabases ............ 92 4.3AnalysisofMulti-relationSumAggregationwithDuplicateRemoval .... 93 4.3.1ProbabilityofCorrelatedGroups ................... 93 4.3.2CorrelationsinPTIMEQueries .................... 97 4.3.3Momentsofaggregateswithduplicateelimination .......... 99 4.4Hash-basedAlgorithm ............................. 102 4.5Experiments .................................. 106 4.5.1Implementation ............................. 106 4.5.2ResultsofExperiments ........................ 108 4.6Summary .................................... 111 5CONCLUSIONSANDFUTUREWORK ...................... 112 5.1DissertationSummary ............................. 112 5.2FutureDirections ................................ 113 APPENDIX AGENERALUNIDIMENSIONALHISTOGRAMS .................. 115 BPROOFOFPROPOSITION 3 ............................ 118 CPROOFOFPROPOSITION 4 ............................ 121 DPROOFOFTHEOREM 3 .............................. 122 EPROOFOFTHEOREM 5 .............................. 123 FPROOFOFLEMMA 2 ............................... 125 GPROOFOFTHEOREM 3.3 ............................ 128 REFERENCES ....................................... 129 BIOGRAPHICALSKETCH ................................ 136 7

PAGE 8

LISTOFTABLES Table page 2-1Notationsusedinthechapter ............................ 25 4-1RelationR1 ...................................... 88 4-2RelationR2 ...................................... 88 4-3TheresultofQcustomer,probabilityR1./R2 ..................... 89 4-4TheresultofQcustomer,product,probabilityR1./R2 ................. 90 4-5Groupu ........................................ 93 4-6Groupv ........................................ 93 8

PAGE 9

LISTOFFIGURES Figure page 2-1FrequencygraphofrelationF. ........................... 23 2-2FrequencygraphofrelationG. ........................... 23 2-3Histogramofthesizeofthejoinresult ....................... 23 3-1TheLayersofSequences .............................. 74 3-2TotalTimeOnTPC-HQueries,1GBdataset .................... 83 3-3AggregateTimeOnTPC-HQueries,1GBdataset ................ 83 3-4TotalTimeOnTPC-HQueries,10GBdata ..................... 83 3-5AggregateTimeOnTPC-HQueries,10GBdata ................. 83 3-6Thegraphicmodelusedintheexperiments .................... 84 3-7Theexperimentsresultsingraphicalmodel .................... 84 3-8PDFofsumofindependentdiscreterandomvariables .............. 85 3-9CDFofsumofindependentdiscreterandomvariables .............. 85 3-10PDFofsumofpartlyindependentdiscreterandomvariables .......... 85 3-11CDFofsumofpartlyindependentdiscreterandomvariables .......... 85 4-1Nestedcorrelationsinu^v ............................. 94 4-2u^v .......................................... 95 4-3Hierarchicalstructureofcorrelations ........................ 98 4-4=1RunningtimeonQ1 .............................. 109 4-5=1RunningtimeonQ2 .............................. 109 4-6=2RunningtimeonQ1 .............................. 109 4-7=2RunningtimeonQ2 .............................. 109 4-8=4RunningtimeonQ1 .............................. 110 4-9=4RunningtimeonQ2 .............................. 110 9

PAGE 10

AbstractofDissertationPresentedtotheGraduateSchooloftheUniversityofFloridainPartialFulllmentoftheRequirementsfortheDegreeofDoctorofPhilosophySTATISTICALAPPROXIMATIONSOFDATABASEQUERIESWITHCONFIDENCEINTERVALSByLixiaChenDecember2011Chair:AlinDobraMajor:ComputerEngineeringExplosivedatagrowthindatabasespresentssignicantchallengesforfastqueryprocessing.Computationofexactvaluesofcomplexqueriesoverlargedatabasescantakealongtimeduetothelargevolumeofdatatheyneedtoaccess.Approximatequeryprocessingprovidesanessentialsolutiontothisproblem.Versatileapproximationmethodsareavailabletobeusedindatabasesnowadays.Inordertoevaluatetheeffectivenessoftheseapproximations,itisdesirabletoprovidecondenceintervalsofestimators;thecondenceintervalseffectivellyprovideerrorguaranteesofestimatedvalues.Thefocusofthisdissertationisapproximatingdatabasequeriesandprovidingcorrespondingcondenceintervals.Werstrevisittheapproximationmethodhistogramsasusedintraditionaldatabasesandinterpretasastatisticalmethod.Thenweestimateaggregatesoverprobabilisticdatabasesandprovideefcientalgorithmstocomputethem.Thetraditionalassumptionforinterpretinghistogramsandjustifyingapproximatequeryprocessingmethodsbasedonthemisthatallelementsinabuckethavethesamefrequencythisiscalleduniformdistributionassumption.Weshowthatasignicantlylessrestrictivestatisticalassumptiontheelementswithinabucketarerandomlyarrangedeventhoughtheymighthavedifferentfrequenciesleadstoidenticalformulasforapproximatingaggregatequeriesusinghistograms.Weanalyze,underthisassumption,thebehaviorsofthebothunidimensionalandmultidimensional 10

PAGE 11

histogramsandweprovidetighterrorguaranteesforthequalityofapproximations.Asanexampleofhowthestatisticaltheoryofhistogramscanbeextended,weshowhowXSketchesanapproximationtechniqueforXMLqueriesthatuseshistogramsasbuildingblockscanbestatisticallyanalyzed.ThecombinationoftherandomshufingassumptionandtheotherstatisticalassumptionsassociatedwithXSketchestimatorsensuresacompletestatisticalmodelanderroranalysisforXSketches.CharacterizingSUM-likeaggregatesoverprobabilisticdatabasesisconsideredahardproblembecausethesizeoftheprobabilityspaceisexponentialinthedatabasesize.Inthisdissertation,weaimtocomputeaggregatesinprobabilisticdatabaseswithcondenceintervals.Bothmethodsdistributiondependentandindependentforcomputingcondenceintervalsrequiretheexpectedvalueandvarianceofprobabilisticaggregates.Wedevelopageneralframeworktocomputemomentsofaggregatesinprobabilisticdatabases.Basedonthisframework,wederivemathematicalformulasofthersttwomomentsofaggregatesforgraphicalmodelsandtuple-independentmodels.WethenpresentInDBandMiddlewarealgorithmstoevaluatetheaggregatesefciently.OurprototypeimplementationusingPostgressuggeststhatourcharacterizationofaggregatesincurslittleoverheadforthetuple-independentmodelandmanageableoverheadforthegraphicalmodel.WealsoextendcomputationofthemomentstoAVERAGE-likeaggregatesandGROUP-BYstoshowtheusefulnessoftheproposedmethods.Toextendtheaboveanalysisofaggregatesoverprobabilisticdatabases,westudytheaggregateswithduplicateremoval.WeanalyzethepropertiesofPTIMEqueriesandconcludethecorrelationsintroducedbyprojectionsarehierarchicalandcanbefactorizedrecursively.Wederivethersttwomomentsofestimatorsofprobabilisticaggregateswithduplicateremovalanddesignahash-basedalgorithmtoimplementthem.Ourcomprehensiveexperimentsshowthatthisalgorithmhasasmalloverhead 11

PAGE 12

whenthecorrelationrateissmallandareasonableoverheadwhenthecorrelationrateislarge. 12

PAGE 13

CHAPTER1INTRODUCTIONWiththeadvancementsinscienceandtechnology,moreandmoreinformationneedstobestoredindatabases.AccordingtoJASONreport[ 50 ],thedatavolumeisprojectedtobeinthehundredsofpetabytesby2015.WhenthedatagrowthrateislargerthanhardwaregrowthratebyMoore'slaw,processinglargedatamaytakeevenlongerthantoday.Approximatequeryprocessingprovidesanessentialmethodforfastqueryresponsetimeoverlargedataset.Inaddition,exactanswersarenotnecessaryinsomecasesandapproximatequeryprocessingiswellsuited.Manyapplications,suchasdatamining,onlineanalyticalprocessinganddecisionsupportsystems,belongtothiscategoryduetotheirexploratorynatureandassociatedimprecisioninthequeryortheuseofthequeryresult.Probabilisticdatabaseshaveattractedmuchattentioninrecentyearsbecauseoftheincreasingdemandofprocessinguncertaindata.Sincedatainprobabilisticdatabaseareuncertainbynature,thesizeoftheprobabilityspaceisexponentialinthesizeofdatabases.Computingtheexactanswersofcomplexqueriesoversuchdatabaseisinefcientandchallenging.Queryapproximationbecomesanintuitivealternative.Onenaturalquestiontoaboutapproximationmethodsishoweffectivetheestimatedvaluesareandhowmuchtheyvary.Forexample,itisveryhardforapersontodecidewhethertobuyonestockgivenaforecastingprice10,000,becausea95%condenceintervalof$10,00010anda95%condenceintervalof$10,00010,000makealotofdifference.Inbothcases,theestimatedvaluesaresamebutthelatterhasthepotentialtobeveryriskywhereastherstissafe.Inthisdissertation,weexplorethisquestion:howeffectivearetheapproximatemethods.Theprerequisiteofthisquestionisthestatisticalcharacterizationofapproximationmethods.Ourresearchrevolvesaroundthesetwoquestionsandstudiesapproximatequeryprocessingbothfortraditionaldatabasesandprobabilisticdatabasesfromstatisticalpointofview. 13

PAGE 14

1.1RelatedWorkVersatileapproximationmethodshavebeenproposedintheliterature.Samplingisoneofthemandiswidelystudiedandusedindatabases.Samplingmethodsuseasmallpartofdatatoestimatethedistributionoflargedataset.Thereisalargebodyofresearchonsampling.Webrieyintroducejustsomeofthem.Samplesynopsescanbecomputedonline.Haas[ 36 38 ]computedsamplesynopsesofaggregationduringtheruntimeandgraduallyrenedthemduringprocessing.LaterDBO[ 54 ]systemextendedthismethodandremovedthelimitationofapproximationinmainmemory.Samplesynopsescanbeprecomputed.Acharya[ 5 ]precomputedsynopsesforforeignkeyjointocompensateforlosinguniformity.AQUAproject[ 4 5 27 29 ]ofBelllabusedsamplestoprecomputesynopsestoapproximatequeryanswering.Ofalltheapproximationmethods,samplingtechniquesareexible,moststudiedandhaveshownsuccessinthecontextofdatabases.However,theuniformsamplingrequirementlimitsitsapplications.Inaddition,ifsamplingmethodsoperateonquerieswithlowselectivityorhighlyskeweddata,largeerrorswillbeproduced.AlthoughGanti[ 26 ]andChaudhuri[ 13 ]aimedtotackletheseproblems,bothsolutionsareheuristics.Histogramshavebetterperformancethensamplingmethodsinquerieswithlowselectivity.Theyaresimpleandeasytoconstructandoperat.Becauseoftheseadvantages,histogramshaveproventobesuccessfulapproximationmethodsincommercialdatabasesandextensiveresearchhasbeendoneonthem.AnicesurveyofthehistogramworkcanbefoundinIoannidis'spaper[ 45 ].Typesofhistogramsproposedintheliteratureinclude:Equi-widthhistograms[ 82 ],Equi-depth[ 70 ]andseveralothertypes.Tocapturealltheclassesofhistograms,Poosala[ 81 ]providedataxonomywhichcanrepresentdifferenttypesofhistogramssuchasEqui-sumhistograms[ 70 82 ],V-optimalhistograms[ 81 ]andsplinehistograms[ 64 ].Sinceapproximationerrorsareinevitableforhistograms,manytechniqueshavebeenproposedtominimizedifferenterrormeasures.Somearefocusedonconstructing 14

PAGE 15

bucketsofine.Poosalaandhiscollaborators[ 48 81 ]proposedV-optimalhistogramstominimizethesumofsquarederrors.Guha[ 33 34 ]constructedoptimalhistogramsforrangequeriesandminimizedrelativeerrors.Subsequentwork[ 32 ]providedlineartimeapproximationalgorithmstoconstructhistograms.RecentlyCormode[ 15 17 ]extendedhistogramresearchtoprobabilisticdatabasesandproposedageneralframeworkfordifferenterrormetrics.Othertechniquestakeadvantageoffeedbackfromexecutiontooptimizehistograms.Aboulnaga[ 3 ]rstproposedSelf-tuninghistograms.Later,Lim[ 69 ]usedTwo-phrasemethodstoconstructSelf-tuninghistograms.Bruno[ 11 ]andSrivastava[ 92 ]builtmultidimensionalhistogramsbyexploitingworkloadinformationandentropymaximizationprinciple.Kaushik[ 59 ]addresseddistinctvalueproblemswhenconstructinghistogramsfromqueryfeedback.Althoughthereisabundantliteratureonhistograms,theamountoftheoreticalworkoncharacterizinghistogramsasapproximationmethodsfordatabasequeriesissurprisinglysmall.Piatetsky-ShapiroandConnell[ 76 ]providedthersttheoreticalcharacterizationofhistograms,byderivingworstcaseandaveragecaseerrorguaranteesforEqui-widthandEqui-depthhistogramsusedforselectivityestimation.TheothertheoreticalcharacterizationsofhistogramscanbefoundintheworkofIoannidisandhiscollaborators[ 41 46 79 ].Theirworkistheonlyoneapplicabletoestimationofaggregatesoverjoins,Mostofthisworkisconcernedwithoptimalityofhistograms[ 41 43 44 ],forwhich,interestinglyenough,theissueofcomputingtheerrorofhistogramscanbecleverlyavoidedthetechnicalmeanstodothisistorelyonmajorizationtheoryinsteadofadirectoptimization.Mostofthiswork[ 42 ]andsmallpartsoftheotherpaperswementionedareconcernedwithactualcharacterizingtheerrorofhistogramsbutmostoftheresultsapplyonlytoOne-buckethistogramstheonlyexceptionisworstcaseerrorestimationwhichresultsinunrealisticallylargebounds.Theuniformfrequencyassumptionwasneverformalizedwhenitwasusedtoderiveaverageerrorboundsforhistograms[ 76 ].Onlyimplicitexplanationaboutwhat 15

PAGE 16

theassumptionsaysweremadethroughouthistogramsliterature.Eveniftheuniformfrequencyassumptionwouldbeformalizedasacompletelydecorrelatedplacementoftuplesinabucket,itwouldignoretheskewwithinthebucket,thus,providelooserboundsonbehavior.Recently,uncertaindatahaveemergedinmanyareas,suchasinformationextraction[ 25 65 ],sensornetwork[ 84 87 ].Researchworkhascombinedhistogramsandwaveletswithprobabilisticdatabases.Cormode[ 17 ]studiedtheoptimalhistogrambucketboundaryandwaveletcoefcientswithingivenerrormetricinprobabilisticdatabases.Cormode[ 15 ]presentedadynamicprogrammingframetondtheoptimalprobabilistichistogramsfordifferenterrormetric.Besideshistogrammethod,otherapproximatingworkhasbeendoneonprobabilisticdatabases.Inordertoallowuncertaindataindatabases,previouswork[ 10 12 22 25 31 66 77 ]modeledtheuncertaindataandextendedthestandardrelationalalgebratotheprobabilisticalgebra.Suciu[ 18 19 ]furtherstudiedthecomplexityofevaluationofqueriesoverprobabilisticdatabasesandprovedthatcomputingtheprobabilityofaBooleanqueryonadisjointindependentdatabasehas#Pcomplexity.Aggregatesoverprobabilisticdatabasesperceivedasamuchharderproblembythecommunityhasattractedsomeattentioninrecentyears.Ross[ 86 ]studiedtheaggregationoverprobabilisticdatabases.Thefocusisonprobabilisticdatabaseswithattributeuncertaintyandtheprobabilityofeachattributeisinaboundedinterval.[ 6 95 ]developedtheTRIOsystemsformanaginguncertaintyandlineageofdata.AggregationoverTRIOsystemsisbasedonthepossibleworldsmodelandthereforeoperationsaresimpletoimplementbutintractableformostsituations.[ 51 52 ]and[ 16 ]studiedaggregatesoverprobabilisticdatastreams.Theproblemin[ 51 52 ]istoestimatetheexpectedvalueofvariousaggregatesoverasingleprobabilisticdatastream(orprobabilisticrelation).TheyderivedanefcientmethodtoestimateAverageforOne-relationcase.[ 16 ]studiedthesameproblemtogetherwiththeestimationofthe 16

PAGE 17

sizeofthejoinoftworelations.TheanalysisprovidedinthesepapersisrestrictedtoexpectationandvarianceforOne-relationcaseandexpectationforTwo-relationcase.Furthermore,theaggregateisrestrictedtoCOUNT(theworkisonlyconcernedwithfrequencymoments).Itisimportanttonotethattheproblemsolvedinalltheseworkishardersincetheestimationhastobeperformedwithsmallspace(datastreamingproblem).Itwouldbeinterestingtoinvestigatehowtheformulaswederivecouldbeapproximatedusingsmallspace,aswell.[ 55 57 88 90 ]usedgraphicmodeltorepresentthecorrelatedtuples,butlittleworkhasbeendoneonaggregates.[ 88 ]onlypresentedadistributionofaveragequeryon500tuples.MayBMS[ 9 30 40 61 63 ],MCDB[ 37 49 75 ]andPIP[ 60 ]areprobabilisticDBMSthathaveimplementedexpectedvaluesofaggregates.Aboveresearchworkfocusedontheexpectedvalueofaggregates.Inspiredbythesameobservationthattheexpectedvalueofaggregatescannotcapturethedistributionclearly,[ 85 ]studiedtheproblemofdealingwithHAVINGpredicatesthatnecessarilyuseaggregates.Thebasicproblemtheyconsideris:computetheprobabilitythat,foragivengroup,theaggregateisinrelationshipwiththeconstantk,i.e.k.ThetypesofaggregatesconsideredareMIN,MAX,COUNT,SUMandthecomparisonoperatorisacomparisonoperatorlike>.OnlyintegerconstantskaresupportedsincetheoperationsareperformedonthesemiringSk+1.Theprobabilitiesofevents
PAGE 18

involved.ThisisespeciallytroublesomeforSUMaggregatessincekcanbeaslargeastheproductofthesizeofthedomainoftheaggregateandthesizeofthegroup. 1.2CondenceIntervalsfromMomentsCondenceintervalsforestimatorsaddresstheeffectivenessofapproximatedvalues.Bygeneratingalowerandupperlimit,condenceintervalprovidesanerrorboundoftheestimatorwhichisessentialforusers.Forexample,itisveryhardforapersontodecidewhethertobuyonestockgivenaforecastingprice10,000,becausea95%condenceintervalwith$10,00010anda95%condenceintervalwith$10,00010,000makealotofdifferences.Inbothcasestheestimateisthesamebutthelatterhasthepotentialtobeveryriskywhereastherstissafe.ThestandardwaytoobtaincondenceintervalsforrandomvariablesXistocomputethersttwocentralmomentsE[X]andVar[X],andthentouseeitheradistributiondependentordistributionindependentbound.ThedistributiondependentboundsassumethetypeofdistributionisknownandisoneoftheTwo-parameterdistributions.ThemostcommonsituationistheapplicationoftheCentralLimitTheorem,whichstatesthatthedistributionofsumsofindependentrandomvariablesisasymptoticallynormalorasimilarresult.Irrespectiveofhowthenormalityofthedistributionisjustied,thecondenceintervalwithcondence1)]TJ /F3 11.955 Tf 12.11 0 Td[(basedonmomentsandnormalityis:hE[X])]TJ /F5 11.955 Tf 11.96 0 Td[(z 2p Var[X],E[X]+z 2p Var[X]iwithz 2the 2quantileofN(0,1)distribution.AnalternativeistousethedistributionindependentboundsbasedontheChebychevinequalitytoprovideconservativebounds(boundsarecorrectirrespectiveofthedistributionbutmightbeunnecessarilylarge).ThisboundrequiresE[X]andVar[X]aswell.Letbeanyrealnumber,E[x]=andVar[X]=2,theboundisprovidedbyP(jX)]TJ /F3 11.955 Tf 11.95 0 Td[(j)2 2 18

PAGE 19

ThetwotypesofboundswediscussedaboverequirethecomputationofE[X]andVar[X].Usually,E[X]iseasytocomputebutVar[X]posessignicantproblems.Unfortunately,itisnotpossibletoavoidthecomputationofVar[X]andstillobtainreasonablecondenceintervals.IfonlyE[X]isknown,onlyMarkov'sinequalityorHoeffdingboundscanbeproduced.Bothcanbereasonablyefcientifmultiplecopiesoftherandomvariableareavailableandaveraged,butbothareinefcientifthisisnotthecase.Aswewillseeinthethesis,wehaveonlyonecopyoftherandomvariablethatcharacterizestheaggregates,thusVar[X]isstrictlyrequiredifreasonablecondenceboundsaretobeproduced.Foralltheestimatesinthisdissertation,eitherdistributionindependentboundscouldbeusedtoobtainstrictcharacterizationoftheresultsorthenormaldistributionbasedboundssinceallestimatescanbeexpressedasweightedsumsofindependentidenticallydistributed(iid)randomvariablessotheCentralLimitTheoremapplies.Inviewoftheabovediscussion,inordertosimplifytheexpositionandthecomparison,throughoutthedissertation,wewilljustprovideresultsintheformofexpectedvaluesandvariancesorsquarederrorsthevarianceisequaltothesquarederroriftherandomvariableisunbiased.Actualerrorguaranteescanbeobtainedstraightforwardlyusingtheabovementionedtechniques. 1.3ContributionsThecommonassumptioninhistogramsistheuniformassumptionwhichassumesthatthehistogramsperformwellonlywhenthefrequencyvaluesinonebucketareuniformlydistributed.Thisuniformassumptionturnsoutnottobestrictlynecessaryinourwork;histogramsmightworkwellevenwhentheaveragefrequencyinabucketisaveryroughapproximationoftheactualfrequency.Therstproblemweaddressedisageneralstatisticalassumptionforhistogramsandthebehaviorofhistogramsunderthisassumption.ThemomentsofUni-dimensionalandMulti-dimensionalhistogramsareformulatedunderthisnewassumption. 19

PAGE 20

Althoughextensiveworkhasbeendoneonprobabilisticdatabases,mostofthemonlyprovidetheexpectedvalueofqueries,whichisnotenoughforuserstomakedecisions.[ 73 ]approximatedcondencecomputationinprobabilisticdatabases,butitonlyestimatesprobabilitiesofDNFs.MCDBistheonlysystemcapableofcomputationoftightcondenceintervalsbut,inthecaseorrareevents,itrequiredprohibitivelyexpensiveevaluationsinceitisbasedonsampling.Thesecondproblemandthethirdproblemwetackledareestimatingaggregatesoverprobabilisticdatabases,especiallyforaggregatesovermultiplerelations.Weprovidecondenceintervalsforourestimatorsandefcientalgorithmstoevaluatethem.Moreprecisely,wemadethefollowingcontributions: Weformulateanewstatisticalassumption,randomshufingoffrequencieswithinabucket,thatismoregeneral,thusmorelikelytohold,thantheuniformfrequencyassumption.Aswewillshow,thisnewassumptiondoesnotchangethewaythehistogramsareusedforapproximatingresultsofqueriesthusitisconsistentwithallthepreviousworkonhistogramsbut,importantfromapractitioner'spointofview,explainswhyandwhenhistogramsbehavewellasapproximators.Statistically,randomshufingassumptionholdswhenthereisnocorrelationbetweenthefrequenciesinthetworelationsbeingjoined,soitislikelytoholdinpracticequiteoften. Weprovidetightminimumerrorguaranteesforbothunidimensionalandmultidimensionalhistogramswhentherandomshufingassumptionholds.[ 42 ]istheonlyotherworkthatprovidestighterrorguaranteesforestimationusinghistograms,inthiscaseworstcaseguarantees.Theerrorswederiveallowustoprovidetheoreticalproofthat,whentherandomshufingassumptionholds,histogramsarewellsuitedtoestimateaggregationqueriesandstrictlysuperiortosamplingandsketching.Atthesametime,weprovidecompellingtheoreticalevidencethat,whentherandomshufingassumptiondoesnothold,histogramsare,onaverage,poorapproximatorswhencomparedtosamplingandsketching. WeapplytherandomshufingassumptiontoXSketch[ 78 ].Asitisthecaseforallusesofhistogramsintheliterature,XSketchesmaketheuniformfrequencyassumptionforthehistogramtheyareusingasaningredient.ThispreventsafullstatisticalmodelforXSketchestobedevelopedwiththeresultthattheerrorcannotbeanalyzed.Bycombiningtherandomshufingassumptionwiththeotherstatisticalassumptionsmadein[ 78 ],wecompletethestatisticalmodelandshowthatXSketchesareunbiasedestimatorsiftherandomshufingassumptionholdsandwecomputetheerrorunderthesameassumption.Thisisanexampleofhow 20

PAGE 21

thetheorywedevelopedinthisdissertationcanbeextendedtomethodsthatusehistogramsasabuildingblock. WederiveageneralframeworkforcomputingthecondenceboundsofSUMandAVEREAGE-likeaggregatesoverjoinsofmultiplerelationsforprobabilisticdatabases.Theframeworkonlyneedsalooseassumptionontheprobabilisticmodelused.Weapplytheframeworktomultiplemodels:Tuple-independentmodelandgraphicalmodel.Theapplicationsarestraightforwardatestamentofthepowerofthegeneralframework.Applyingthegeneralframeworktovariationsofabovemodelsandthemajorityofothermodelsintheliteratureisaseasy. Basedonourgeneralframeworkforaggregatesoverprobabilisticdatabases,wepresentalgorithmsthatremovetheneedtoperformcomputationovertheCross-productsoftheMmatchingtuples.ThemainalgorithmhastimecomplexityO(Mlog(M))andisapplicabletoalargeclassofprobabilisticmodels. WeimplementthetheoreticalresultsbothusingqueryrewritinginpureSQLandusingC++andSQLcombinationforaggregatesoverprobabilisticdatabases.WeevaluatetheperformanceofouralgorithmsonTPC-HdatasetandshowthattheyarecompetitivewiththecomputationofaggregatesinNon-probabilisticdatabasesfortheTuple-independentmodelandreasonableforthegraphicalmodel. WestudyprobabilisticaggregateswithduplicateremovalandanalyzethepropertiesofPTIMEprobabilisticqueries.Weconcludetheconveyedcorrelationsbyprojectionarehierarchicalandcanbefactorizedrecursively. Wederivethersttwomomentsofestimatorsofprobabilisticaggregateswithduplicateremoval. Wedesignanefcienthashbasedalgorithmtoimplementestimatorsofprobabilisticaggregateswithduplicateremoval.Therestofthisdissertationisorganizedasfollows.InChapter2,wepresentarandomshufingassumptionforhistogramsandanalyzethebehaviorsofhistogramsunderthisassumption.Chapter3formulatestheaggregatesoverprobabilisticdatabasesandprovidesefcientalgorithmstocomputetheestimatesandcorrespondingcondenceintervals.Chapter4discussesaggregateswithduplicateremoval,derivesformulasofmomentsandthenimplementsefcientalgorithmstoevaluatethem.Inthenalchapter,conclusionisdrawnandfutureworkispresented. 21

PAGE 22

CHAPTER2HISTOGRAMSASSTATISTICALESTIMATORSFORAGGREGATEQUERIES1 2.1BackgroundHistogramsareamongthemostwidelyusedandextensivelystudiedapproximationtechniquesforaggregatequeries[ 41 42 44 ].Thetraditionalinterpretationofhistogramsirrespectiveofthetypeisthatthefrequenciesofitemsinabucketareapproximatedbytheaveragefrequencyofthebucket,andthisaveragewillbeusedinsteadoftheoriginalfrequenciesinanycomputation.Forexample,histogramscanbeusedtoestimatethesizeofthejoinoftworelationsFandG.Whilethisinterpretationisintuitiveandprovidessimplerecipesforperformingoperationswithhistograms,itsuggeststhatthehistogramapproximationofthefrequencydistributionwillworkwellintheapproximationprocessonlyifthefrequencydistributionissmoothandcanbelocallyapproximatedusingtheuniformdistributionassumptionofhistograms.This,assuggestedbythefollowingexampleandapparentfromthehistogramliterature,turnsoutnottobestrictlynecessary;histogramsmightworkwellevenwhentheaveragefrequencyinabucketisaveryroughapproximationoftheactualfrequency. Example1. LetFandGbetworelations,eachwithasingleattributeAwithdomain1...100.WegeneratebothFandGtohaveZipfdistributionswithZipfcoefcient0.5andwithaveragefrequency99.54(ascloseto100butallowingallfrequenciestobeintegers).InrelationFthefrequencyisdecreasing(seeFigure 2-1 );inrelationGthefrequenciesarerandomlyshufed(seeFigure 2-2 ).ObservefromtheseguresthattheOne-buckethistogramapproximationofthefrequency,thelineat99.54,isaverypoorestimateofthefrequencies,thuswewouldexpectpoorperformancewhenweuse 1ThischapterissubmittedtoInformationSystemsDatabases:TheirCreation,ManagementandUtilizationforpublicationandreprintedwithpermissionfromInformationSystemsDatabases:TheirCreation,ManagementandUtilization. 22

PAGE 23

Figure2-1. FrequencygraphofrelationF. Figure2-2. FrequencygraphofrelationG. Figure2-3. Histogramofthesizeofthejoinresult One-buckethistogramstoestimatethesizeoftheEqui-joinofFandG.Inthescenariodescribedabove,theOne-buckethistogrampredictionisalways990821irrespectiveoftheparticularshufingofthedomainofG.Intheparticularcasedepictedinthegure,thetruesizeofthejoinis981565,amere1%smallerthantheprediction;tomakesurethisisnothappeningbychance,wepicked1000randomshufingsofGandplottedthedistributionofjF1AGjinFigure 2-3 .Noticethatthesizesofthejoinsarecompactlydistributed(withina10%relativeerror)aroundthepredictionusingtheOne-buckethistograms.Asthepreviousexamplesuggests,eventhoughwithinabuckettheuniformfrequencyapproximationisrathercrude,theresultofapproximatingthesizeofthejoinissurprisinglygoodandstatisticallystable.Thispromptsthequestion:Whyisthishappeninginspiteoftheuniformapproximationnotholding?Aswewillshowinthischapter,theansweristhatamoregeneralstatisticalhypothesisholds,namelythattheplacementofthefrequenciesinthetworelationsisuncorrelated.Thisobservationisthestartingpointforthecurrentwork.Intherestofthechapter,werstformalizetheproblemwearesolvinginSection 2.2 ,followedbytheexplanationofhistogramsasfunctionapproximatorsandstatisticalnonparametricmodelsinSection 2.3 .TheformalizationoftherandomshufingassumptionismadeinSection 2.4 .WeanalyzethebehaviorofunidimensionalhistogramsundertherandomshufingassumptioninSection 2.5 andcomparewiththe 23

PAGE 24

behaviorofsamplingandsketchinginSection 2.6 .Section 2.7 analyzesthebehaviorofhistogramswhentherandomshufingassumptiondoesnothold.Section 2.8 generalizestherandomshufingassumptiontomultidimensionalhistograms.TherandomshufingassumptionisextendedtoXSketchesinSection 2.9 .Section 2.10 commentsontherandomshufingassumption.DiscussionismadeinSection 2.11 2.2ProblemFormulationThegeneralproblemwearetryingtosolveisapproximatingaggregatesoverjoins.Aswewillshowinthissection,boththeselectivityestimationandthegeneralSUM-likeaggregatesoverjoinproblemscanberephrasedassizeofjoinestimationproblems.Thusalltheresultsdevelopedforthelattercanbeextendedstraightforwardlytotheothertwoproblems. 2.2.1SizeofJoinProblem 2.2.1.1UnidimensionalSizeofJoinEstimationProblemLetFandGbetworelations,eachwithasingleattributeAwithdomainI.Furthermore,letfiandgibethefrequencyofthevalueiinFandG,respectively.Withthis,thesizeofjoinproblemistoestimatethequantity: jF1AGj=Xi2Ifigi(2)givensynopsesofrelationsFandG(iffullinformationisavailable,wecansimplycomputethesumtogettheexactanswer). 2.2.1.2MultidimensionalSizeofJoinEstimationProblemLetF1(A1),,Fm(Am)andG(A1,,Am)bem+1relations.LetFiandGhaveacommonattributeAiwithdomainIi.Letfik,gi1,,ikbethefrequenciesofvaluesik,(i1,,im)inFkandG,respectively.Thenthesizeofthejoinis: jF11A1G1A21AmFmj=Xi12I1Xim2Imfi1g(i1,,im)fim(2) 24

PAGE 25

Table2-1. Notationsusedinthechapter Symbol(s)Meaning F,GRelationsAJoinAttributeAkJoinAttributeIDomainofjoinattributeAIkDomainofjoinattributeAkNSizeofdomainIi,j,i0,j0IndicesgoingoverdomainIfi,giFrequenciesofvalueiinrelationsF,G f, gAveragefrequenciesinrelationsF,GSJ(F)Pi2If2i,theselfjoinsizeofFSqErr(F)Pi2I(fi)]TJ ET q .478 w 125.72 -187.15 m 132.19 -187.15 l S Q BT /F5 11.955 Tf 125.72 -197.13 Td[(f)2,thesquarederrorofFIlThelthbucketofdomainII(C)Identityfunction:1whenCtrue,0otherwiseUniformrandompermutationsxRandomvariablemodelingthefrequencynNumberofbucketsN0SizeofthesamplemNumberofattributesP[p]ProbabilitythatpredicatepholdsE[X]ExpectedvalueofrandomvariableXVar[X]VarianceofrandomvariableX,Var[X]=EX2)]TJ /F5 11.955 Tf 11.95 0 Td[(E[X]2Cov(X,Y)CovarianceofrandomvariablesXandY,Cov(X,Y)=E[XY])]TJ /F5 11.955 Tf 11.95 0 Td[(E[X]E[Y](i1,,im)InducesgoingoverdomainIwithmdimensionsTXMLdocumentnodes TheproblemofmultidimensionalsizeofthejoinistoapproximateEquation 2 givensynopsesofrelationsFisandG. 2.2.2SelectivityEstimationProblemForselectivityestimationproblems,weshowhowtheycanbereducedtosizeofjoinestimationproblemsinthecaseofBi-dimensionalselectivity,whichcanbeeasilygeneralizedtomultidimensionalselectivity.Thismeansthattheresultswearedevelopingforsizeofjoinestimationreadilyapplyforthisproblemaswell.Wealsoshowanalternativereductionforarbitraryselectivitypredicatesthatisnotasefcientbutworksinallscenarios.LetGbearelationwithtwoattributesA,BwithdomainsI,J, 25

PAGE 26

respectively.GivenI0IandJ0J,estimatethequantity:jI0J0(G)j=Xi2I0Xj2J0gijWithI(C)theidentityfunction,thattakesvalue1ifconditionCistrueandvalue0otherwise,bysimplysettingtherelationFandHsothatfi=I(i2I0)andhj=I(j2J0),wehave:jF1AG1BHj=Xi2IXj2Ifigijhj=Xi2IXj2II(i2I0)gijI(j2J0)=Xi2I0Xj2J0gij=jI0J0(G)jObservethatthejoinsarewithunidimensionalrelations;thisisimportantsince,ingeneral,thesmallerthedimensionalitythemoreefcienttheestimation.Essentiallythesametechniquecanbeappliedtothemultidimensionalcasewhentheselectionpredicatecanberewrittenasaconjunctionofpredicates,oneforeachattribute.InthatcasetheselectivityestimationproblemisreducedtotheproblemofestimatingthesizeofaStar-joininvolvingaunidimensionalvirtualrelationforeachattributeandtheoriginalrelation.Thiscanbegeneralizedfurtherbyconsideringdisjunctionsofconjunctionsofpredicatesinvolvingindividualpredicates(orexpressionsthatcanberewritteninthisway).Inthiscase,theexpressioncanberewrittenasadisjunctionofconjunctionswiththeextrapropertythattheconjunctionsdonotcontaincommoncases.Thisextrapropertyensuresthattheestimateoftheselectivityoftheinitialpredicateissimplythesumoftheestimatesforindividualconjunctions.Notethattheselectivityofthepredicateisthusthesizeofthejoinofvirtualrelationswiththeoriginalrelation.Thisinvolvesjoinsonmultipleattributesbetweentworelationsthataremorecomplicatedthanthejoinsonasingleattribute.Moreover,constructingsynopsesofsuchvirtualrelationscanbechallenging(thetuplessatisfyingtheselectionpredicatemightactuallyhavetobeenumeratedonebyone). 26

PAGE 27

2.2.3AggregatesoverJoinsProblemInordertogaininsightandtoeasetheunderstanding,weprimarilyfocusontheproblemofcomputingaggregatesoverthejoinoftworelations.Wemakesomecommentstowardstheendofthesectiononhowtheseideascanbeextendedtolargerjoins.LetFandGbetworelationsthatcontainajoinattributeAandpossiblyotherattributes.LetusrstlookataggregatesoftheformSUMFFFG(F1AG)=Xt2F1AGFF(t?F)FG(t?G)witht?FthepartofthetupleinthejointhatcomesfromrelationF(similarlyforG)andFFandFGarbitraryfunctions.Theonlyrequirementforthistoworkistobeabletorewritetheexpressionsummedupoverthejoinastheproductofexpressionsdependingonattributesofthetworelations.Ifsucharewritingispossible,wesaythattheaggregateisrelationfactorizable.ToevaluatethesumofrelationfactorizableaggregatesoverthejoinofFandG,weobservethat: SUMFFFG(F1AG)=Xt2F1AGFF(t?F)FG(t?G)=Xi2IXt2F1AG,t.A=iFF(t?F)FG(t?G)=Xi2I Xt2F,t.A=iFF(t)! Xt2G,t.A=iFG(t)!=Xi2I~fi~gi(2)where~fiand~giarejustcompactnotationsforexpressionsPt2F,t.A=iFF(t)andPt2G,t.A=iFG(t),respectively.Theimportantobservationisthatwecanuseanymethoddesignedforsizeofjoinestimationtoestimatethisaggregateaswellbysimplyreplacingfiby~fiandgiby~gisincethentheexpressioninEquation 2 isidentical 27

PAGE 28

tothelastexpressioninEquation 2 .Thus,computingsuchaggregatesisaseasyascomputingsizesofjoins;thecomplexityisinthejoin,notintheexpressionbeingsummedup.WiththeabilitytocomputeestimatesofaggregatesoftheformSUMFFFG(F1AG),wecanimmediatelycomputeaggregatesoftheformAVGandSTDaswell.Forexample,toestimateAVGB(F(A,B)1AG(A))wecanestimateSUMB(F(A,B)1AGA)andjF1AGjandsimplytaketheirratio.TheideasusedtoreduceaggregateestimationproblemsoftheformSUMFFFG(F1AG)tosizeofjoinestimationproblemscanbereadilygeneralizedtostarjoinsinvolvingmultiplerelations.Thesimilarrewritingcanbemade.SinceboththeselectivityestimationandCOUNT,SUM,AVGandSTDaggregateestimationproblemscanbereducedtosizeofjoinproblems,fortherestofthechapterwewillfocusonlyonthesizeofjoinproblem.TheproblemofestimatingMINandMAXaggregatescannotbereducedtosizeofjoinproblemsbutnoapproximatemethodsforestimatingsuchaggregatesexisteither.Themainproblemwithsuchdevelopmentsisthefactthatthereisnowaytopredictextremevaluesusingstatisticalmethodsunlessverystrongstatisticalassumptionsaremade(particulardistributionsofthedatahavetobeassumed). 2.2.4CommentsonObtainingErrorGuaranteesfromExpectedValueandVarianceEstimatesThestandardtechnique[ 8 91 ]toobtainerrorguarantees,i.e.condenceintervals,foranestimateistocomputetheexpectedvalueandvarianceandthentouseeitherdistributionindependentboundsgivenbyChernoff'sandChebyshev'sinequalities,ortousedistributiondependentbounds.Inthelattercase,usuallytheCentralLimitTheoremoroneofitsgeneralizationsisusedtoarguethatthedistributionoftheestimateisclosetonormalandthenerrorboundsbasedonnormaldistributionswiththesameexpectedvalueandvarianceareproduced.Foralltheestimatesinthischapter,eitherdistribution 28

PAGE 29

independentboundscouldbeusedtoobtainstrictcharacterizationoftheresultsorthenormaldistributionbasedboundssinceallestimatescanbeexpressedasweightedsumsofindependentidenticallydistributed(iid)randomvariablessotheCentralLimitTheoremapplies.Inviewoftheabovediscussion,inordertosimplifytheexpositionandthecomparison,throughoutthechapterwewilljustprovideresultsintheformofexpectedvaluesandvariancesorsquarederrorsthevarianceisequaltothesquarederroriftherandomvariableisunbiased.Actualerrorguaranteescanbeobtainedstraightforwardlyusingtheabovementionedtechniques. 2.3HistogramsasFunctionApproximatorsandStatisticalNonparametricModelsHistogramswererstintroducedinStatisticsasanalternativetoparametricmodels2.Themainideaistoapproximate,inafunctionapproximationsense,theprobabilitydistributionfunctionofanunknowndistribution.Thehistogramcanbeusedinsteadoftheunknownprobabilitydensityfunctiontocharacterizethedistribution.WhileresultsfromStatisticsclearlypointoutthatnoguaranteewithrespecttothegoodnessoftheapproximationofthep.d.f.canbegiven,goodguaranteescanbeprovidedforthecomputationofthec.d.f.Thisisparticularlyusefulwhenthecumulativedistributionfunctionhastobedeterminedatvariouspoints(forexampletoproducecondenceintervals).Itiseasiertoexplainwhythisisthecasebytranslatingthisproblemintoadatabaseproblem:selectivityestimation.Withourproblemdenition,ifwehaveaselectionpredicateoftheformG.A<=10andwewanttoestimateitsselectivity,byconstructing(orusingapreviouslyconstructed)ahistogramonthisattribute,theestimateissimplythesumofthemass(i.e.totalnumberoftuples)ofthe 2Parametricmodelsaremodelsthatdependonasmallxedsetofparameters.Forexamplenormaldistributionsareparametricmodelssincetheydependonlytwoparameters:meanandvariance. 29

PAGE 30

bucketscompletelyincludedintherange(,10]plustheproportionalpartofthebucketthatoverlapsthepoint10.Fromthepointofviewofselectivityestimationbytheabovediscussiontheerrorcomesonlyfromthebucketthatpartiallyoverlapstheinterval.Forthebucketsthatarefullyincludedintheintervalthereisnosourceoferror.Thisobservationsuggeststhattheerroriskeptundercontrolsince,intheabsolutevalue,itislessthanthemassoftheoverlappingbucket.Inthecontextofstatistics,thereisanextracomplicationcomingfromthefactthatthedataprovidedformasample,inwhichcasenaturalstatisticaluctuationswouldresultsinerrorsforthecountsineachbucket.Thisisnotaproblemindatabasessincethehistogramisconstructedovertheentiredatasetnotjustasample. 2.4RandomShufingAssumptionUnderthetraditionaluniformassumption,histogramsshouldperformwellonlywhentheaveragefrequencyapproximateswellthefrequenciesinabucket.Aswehaveseenintheintroduction(Example 1 ),theOne-buckethistogrambehavedexceptionallywelleventhoughtheaveragefrequencywasapoorapproximatorofthefrequencies.Theuniformfrequencyassumptiondoesnotexplainthisgoodbehavior.Insteadofproposinganewtypeofhistograms,thegoalinthissectionistondabetterexplanationandtoexploreitwithstatisticalanalysis.Wewanttoformalizeastatisticalmodelthatcanbeusedtocharacterizethehistograms.ThestartingpointforourinvestigationistherandomrearrangementoftheelementsofthedomainforrelationGinExample 1 .Therearrangementdidnotchangetheskewofthedistribution,itjustdecorrelatedthematchingofthefrequenciesinFandG.Weformalizethisrandomrearrangementinthissectionasthestatisticallywelldenednotionofrandomshufingassumption.Aswewillsee,thisformalizationleadstoformulasfortheerrorofthehistogramsandallowsustocomparethemwithotherapproximationmethodslikesketchingandsampling.Laterinthechapterweaskthecounterpartquestion:whatisthehistogramerrorbehavioriftheassumptionisfalse.By 30

PAGE 31

analyzingthehistogramsbothwhentherandomshufingassumptionholdsandwhenitdoesnot,wewillgetacompletecharacterizationofthehistogramsasastatisticalapproximationmethod.Asanystatisticalmodel,itmightormightnotbetrueforanygivendataset.Ourinvestigationrevolvesaroundtheleastconstrainingassumptionratherthanargumentsw.r.t.thevalidityoftheassumptioninpractice.Withastatisticalinterpretation,theuniformdistributionassumptionisastrongerstatisticalassumptionthanrandomshufingassumption.Inparticular,theuniformfrequencyassumptionrequiresthecompleterearrangementoftheentiremassofthebucket.Inthisway,anyinformationabouttheskewinsidethebucketislost.Bycontrast,therandomshufingassumptiononlyneedstherearrangementsofthefrequencies.Informationaboutthefrequencycompositionoftherelationsispreserved;onlythepreciseplacementofthevariousfrequenciesislost.Inthissense,therandomshufingassumptionismorelikelytoholdinpracticesinceitislessrestrictive.Thatdoesnotmeanthatitalwaysholdsforrealdatasets.InthecaseofSelf-joinstheassumptionprovablyneverholds;itwillbeinterestingtoseehowhistogramsbehaveonthisproblemlaterinthechapter. 2.4.1DenitionofUniformRandomShufingAssumptionAsopposedtosamples[ 71 ]andsketches[ 8 ],histogramsarenotrandomizedmethods.Theyhaveagoodperformancewhenthevalueofthefrequencyoftheitemisindependentofthepositionoftheiteminthebucket,whichisshowninExample 1 .Astatisticalmodelcanbeusedtocharacterizethisbehavior.Nowweintroduceaprobabilisticspacewhosesamplingspaceiscomposedofthefrequenciesinthebucket.Let=f1,,ngdenotetheslotswithinadomain(orabucket).Let=ff1,,fngrepresentthefrequenciesinthisdomain.LetSdenotethepermutationsetof.Letx(i)bearandomvariable,whichdependsontherandompermutation,representingthefrequencyofsloti.Thenthedistributionoffrequenciesofalltheslotswithinthebucketcanberepresentedbythejointfunctionofx(1),,x(n). 31

PAGE 32

Denition1. Iftheprobabilityfunctionoftherandomvariablesx(1),,x(n)hasthefollowingproperties:8i,j2,i6=j,fi2,Px(i)=fi^x(j)=fi=08s2S,P(x(1),,x(n))=s=1 n!thenwesaythatthefrequenciesarerandomlyshufed.Essentially,theabovedenitionallowsustoselectanyfrequencieswewantbutforcestheplacementofthesefrequenciestoberandomwithinthedomain.Thispreservestheskewofthedistributionbutdecorrelatestheplacement.Fromtheabovetechnicaldenitionoftherandomshufingassumption,wecanderivethefollowingresults: Proposition1. Therandomvariablex(i)hasthefollowingproperties:8i,i02,fi02,Px(i)=fi0=1 n8i,j,i0,j02,i6=j,i06=j0,fi0,fj02,Px(i)=fi0^x(j)=fj0=1 n(n)]TJ /F7 11.955 Tf 11.96 0 Td[(1)Thesetwopropertiesareaboutthemarginalprobabilitiesoftherandomvariables.Thereare(n)]TJ /F7 11.955 Tf 13.15 0 Td[(1)!and(n)]TJ /F7 11.955 Tf 13.15 0 Td[(2)!instancesforx(i)=fi0andx(i)=fi0^x(j)=fj0,respectivelyintheprobabilityspace.Themarginalprobabilitiesareobtainedbyaccumulatingtheprobabilitiesoftheseinstances. 2.4.2MomentsunderRandomShufingAssumptionMomentsareimportantpropertiesofstatisticaldistributions.Inthissection,weanalyzethersttwomomentsoftherandomshufingdistribution. 32

PAGE 33

Lemma1. Supposex(i)isarandomvariablewithadistributiondenedinDenition 1 .Themomentsofx(i)areEx(i)= fVar(x(i))=Pni0=1f2i0 n)]TJ ET q .478 w 293.13 -90.63 m 299.6 -90.63 l S Q BT /F5 11.955 Tf 293.13 -100.61 Td[(f2i6=i0,Cov(x(i),x(i0))=n f2)]TJ /F9 11.955 Tf 11.95 8.96 Td[(Pni=1f2i n(n)]TJ /F7 11.955 Tf 11.96 0 Td[(1)where fistheaverageoffrequenciesinthebucket. Proof. Theexpectedvaluecanbederivedeasily:Ex(i)=nXi0=1fi0Px(i)=fi0=1 nnXi0=1fi0= fVar(x(i))is:Var(x(i))=Ex2(i))]TJ /F5 11.955 Tf 11.96 0 Td[(Ex(i)2=nXi0=1f2i0Px(i)=fi0)]TJ ET q .478 w 261.66 -327 m 268.13 -327 l S Q BT /F5 11.955 Tf 261.66 -336.97 Td[(f2=Pni0=1f2i0 n)]TJ ET q .478 w 348.9 -327 m 355.37 -327 l S Q BT /F5 11.955 Tf 348.9 -336.97 Td[(f2ThenwecomputeCov(x(i),x(i0))wheni6=i0.Sincex(i)andx(i0)aredependent,Ex(i)x(i0)6=Ex(i)Ex(i0).WecomputeEx(i)x(i0)byitsdenition:Ex(i)x(i0)=nXj=1nXk=1fjfkPx(i)=fj^x(i0)=fk=1 n(n)]TJ /F7 11.955 Tf 11.96 0 Td[(1)nXj=1nXk=1,k6=jfjfk=1 n(n)]TJ /F7 11.955 Tf 11.96 0 Td[(1)"nXj=1nXk=1fjfk)]TJ /F4 7.97 Tf 18.3 14.94 Td[(nXj=1f2j#Withthefactthatn f=Pi2Ifi,wehave:Cov(x(i),x(i0))=Ex(i)x(i0))]TJ /F5 11.955 Tf 11.95 0 Td[(Ex(i)Ex(i0)= f2n n)]TJ /F7 11.955 Tf 11.95 0 Td[(1)]TJ /F9 11.955 Tf 14.85 17.11 Td[(Pni=1f2i n(n)]TJ /F7 11.955 Tf 11.96 0 Td[(1))]TJ ET q .478 w 289.12 -613.51 m 295.59 -613.51 l S Q BT /F5 11.955 Tf 289.12 -623.48 Td[(f2=n f2)]TJ /F9 11.955 Tf 11.96 8.97 Td[(Pni=1f2i n(n)]TJ /F7 11.955 Tf 11.95 0 Td[(1) 33

PAGE 34

Theaboveequationsareessentialfortherestofthechapter.Inthefollowingsections,weassumethefrequencieswithinthebucketarerandomlyshufed.Thenwecanusetheaboveequationsdirectlytoanalyzethebehaviorofhistograms. 2.5HistogramsunderRandomShufingAssumptionInthissection,westarttheanalysisofhistogramsbylookingattheirperformancewhentherandomshufingassumptionholds.Westartwiththesimplestcase,histogramswithasinglebucket,andworkourwayuptothefullygeneralcase.ThroughoutthissectionweusethenotationsinTable 2-1 2.5.1One-bucketHistogramsInthissection,wewillprovidetheoreticalanalysisoftheunidimensionalsizeofjoinproblem,asstatedinSection 2.2.1.1 ,undertherandomshufingassumption.AspointedoutinSection 2.2 ,theseresultsreadilyextendtoselectivityestimationandotheraggregatesoverjoins. Proposition2. LetbeauniformlyrandompermutationofdomainIofsizeN=jIj.Then,foranyi2IandanyfrequenciesofitemsinrelationG,fgiji2Ig,wehave:Eg(i)= gVar(g(i))=Cov(g(i),g(i))=SqErr(G) Ni6=i0,Cov(g(i),g(i0))=)]TJ /F12 11.955 Tf 10.49 8.09 Td[(SqErr(G) N(N)]TJ /F7 11.955 Tf 11.96 0 Td[(1)where gistheaverageoffrequenciesinrelationGandSqErr(G)=SJ(G))]TJ /F5 11.955 Tf 11.96 0 Td[(N g2. Proof. TheprooffollowsdirectlyfromLemma 1 Thispropositionshowsthattheexpectedvaluesofalltherandomvariablesareequalto g.Thusallofthemareunbiasedestimatorsof g,whichisexactlythebehaviorofhistograms.Inaddition,theserandomvariablesarenegativelycorrelated.Withthisresultweobtainthefollowingtheorem: 34

PAGE 35

Theorem1. LetFandGbetworelations.TheestimateN f gofjF1AGjusingunidimensionalhistogramsforbothFandGundertherandomshufingassumptionistheunbiasedestimatorandhasthesquarederror1 N)]TJ /F7 11.955 Tf 11.96 0 Td[(1SqErr(F)SqErr(G) Proof. LettherandomvariableX=Pi2Ifig(i)denotethesizeofjoinofFand(G)overthespaceofpermutationsundertherandomshufingassumption.E[X]=Xi2IfiEg(i)= gXi2Ifi=N f gwhereweusedthelinearityofexpectationandtherstresultinProposition 2 .Var(X)=Var(Xi2Ifig(i))=Xi2IXi02Ififi0Cov(g(i),g(i0))=Xi2If2iSqErr(G) N)]TJ /F9 11.955 Tf 11.96 11.36 Td[(Xi2IXi02I,i06=ififi0SqErr(G) N(N)]TJ /F7 11.955 Tf 11.96 0 Td[(1)=SqErr(G) NSJ(F))]TJ /F7 11.955 Tf 11.96 0 Td[((N2 f2)]TJ /F1 11.955 Tf 11.96 0 Td[(SJ(F))SqErr(G) N(N)]TJ /F7 11.955 Tf 11.96 0 Td[(1)=SqErr(G) N)]TJ /F7 11.955 Tf 11.95 0 Td[(1N)]TJ /F7 11.955 Tf 11.95 0 Td[(1 NSJ(F))]TJ /F5 11.955 Tf 11.95 0 Td[(N f2+1 NSJ(F)=SqErr(F)SqErr(G) N)]TJ /F7 11.955 Tf 11.95 0 Td[(1 WhatthistheoremtellsusisthatthewayweuseOne-buckethistogramsisconsistentwiththerandomshufinghypothesisandthattheestimateistheunbiasedestimatorifthehypothesisholds.ForthescenarioinExample 1 ,thevarianceofthepredictionusingtheformulainTheorem 1 is2.53109andthestandarddeviationis50309rememberthatthetrueresultis985392.Giventhefactthatmostofthemassofanormaldistributioniswithintwostandarddeviations,usingthetheorywejustdevelopedwewouldexpecttheerror, 35

PAGE 36

inmostofthecases,tobeatmost250308=98539210%whichcoincideswiththeempiricalobservationsbasedonFigure 2-3 wemadeinSection 2.1 2.5.2HistogramswithAlignedBucketsUsually,wecanaffordtobuildhistogramswithhundredsofbuckets,notonlyonebucket.ThenaturalquestiontoaskishowcanweextendtheanalysisintheprevioussectiontoMulti-buckethistograms. Theorem2. LetFandGbetworelations.ThehistogramestimatePnl=1jIlj fl glofjF1AGjusinghistogramswiththesamebucketizationI1,,InforbothFandGundertherandomshufingassumptionwithineachbucketisanunbiasedestimatorandhasthesquarederrornXl=11 jIlj)]TJ /F7 11.955 Tf 17.93 0 Td[(1SqErr(Fl)SqErr(Gl) Proof. DenearandomvariableX=Pi2Ifig(i).WeobservethatXcanberewrittenasX=Pnl=1XlwhereXl=Pi2Ilfig(i).Sincetherandomshufingassumptionholdsindependentlyineachbucket,randomvariablesXlareindependentandtheprobabilityspaceoverissimplytheproductofprobabilityspaces,oneforeachbucket.SinceeachrandomvariableXlbehavesnowexactlylikethesetupinTheorem 1 (i.e.asintheOne-buckethistogramcase),wehave:E[X]=nXl=1E[Xl]=nXl=1jIlj fl glandVar(X)=Var(nXl=1Xl)=nXl=1Var(Xl)=nXl=11 jIlj)]TJ /F7 11.955 Tf 17.93 0 Td[(1SqErr(Fl)SqErr(Gl) ThesameobservationsthatwemadeforOne-buckethistogramsholdhereaswell:(a)thewaythehistogramisusedisconsistentwiththerandomshufingassumption,and(b)histogramsarelikelytohaveasmallerrorifthehypothesisholds.Notethatwe 36

PAGE 37

donothavetomakeanyassumptionaboutthedistributioninsideabucket,justthefactthatthearrangementforoneoftherelationsisrandom.ItispossibletogeneralizeTheorem 2 forthecasewhenthebucketsarenotaligned(seeAppendix A )iftherandomshufingassumptionholdsforbothrelations.Thisresultisunlikelytobeusefulinpracticesince,asnotedbyIoannidisandChristodulakis[ 41 ],formaximalerrorreductionthesamehistogramhastobeusedforbothrelations. 2.6ComparisonwithSamplingandSketchesInthissection,wecomparehistogramswithtwootherapproximatequeryprocessingtechniques:samplingandsketches.Weprovidehereonlythecomparisonforthecasewhentherandomshufingassumptionholds;thecomparisonforthecasewhentherandomshufingassumptiondoesnotholdisdeferredtoSection 2.7 2.6.1SamplingMultiplevariationsofthesamplingmethodhavebeenproposedintheliterature.Themostimportantones,exempliedontheproblemofcomputingjF1AGj,are: Samplingfromthebaserelation[ 71 ]:Inthistypeofsampling,randomsubsetsofrelationsFandGareselected,thequeryisevaluatedonthesamplesandscaleduptoaccountforthedifferenceinsize. SampleCounts:SelectasubsampleI0ofthedomainIofthejoinattributeAandmaintainexactcountsfiandgifori2I0.Ifwedonottakesamplesatexactlythesamepointsinthedomainforthetworelations,wearewastingsamplessincewecanuseonlythesamplesintheintersection.ThissamplingtechniqueisasmallmodicationofthesamplingschemeinAlon'swork[ 8 ].Theonlydifferenceisthatitstartsthecountingofitemifromthebeginningnotfromarandompointsothecountsareaccurate. Samplingfromthejoin.Byproducingiidsamplesfromthejoin,wecansimplyscaleupthenumberofsuchsamplestogetanestimateforthesizeofthejoin.Theproblemisthatitishardtoproduceiidsamplesoutofjoinresults[ 14 ].Inthetheoreticaldevelopmentsinthissectionweusesamplecountssincetheyareeasytoproduceandarebetterapproximatorsthansamplesfromthebaserelation.WeassumethatthechoiceofthesubsetI0isuniformlyrandom;wedonotattempttopickI0sothattheperformanceisimproved.IfsamplecountsaremaintainedforFandG, 37

PAGE 38

weknowtheprecisevaluesoffiandgifori2I0andwedonotknowanythingabouttheotherfrequencies.Withtheseobservations,theestimatebasedonthesampleistherandomvariableX=N N0Pi2I0figi.Thisrandomvariable,thustheestimationusingsamplecounts,ischaracterizedbythefollowingtworesults: Proposition3. IfrandomshufingassumptionholdsforrelationG,thesamplecountestimatorforjF1AGjthatmaintainsN0sampleshasaveragesquarederrorboundedby:E[Err(X)](N)]TJ /F5 11.955 Tf 11.96 0 Td[(N0)(N)]TJ /F7 11.955 Tf 11.95 0 Td[(2) N0(N)]TJ /F7 11.955 Tf 11.96 0 Td[(1)2SqErr(F)SqErr(G) Proof. SeeAppendix B NowwepresenttheestimationerroroftheSelf-joinsizeofFbysampling.3 Proposition4. ThesamplecountestimatorforjF1AFj,theSelf-joinsizeofF,is:Err(X)=N)]TJ /F5 11.955 Tf 11.96 0 Td[(N0 N0(N)]TJ /F7 11.955 Tf 11.96 0 Td[(1)24NXi2If4i)]TJ /F9 11.955 Tf 11.96 20.45 Td[( Xi2If2i!235 Proof. SeeAppendix C Proposition 3 showsthatthelowerboundsonthesquarederrorforsamplesisproportionalwiththeproductofthesquarederrorsofthetworelations.Butitisroughlyproportionalwiththeinverseofthesizeofthesynopsis,N0forthesamples.Wewillcomparesampling,sketchesandhistogramsatendofthissection. 2.6.2SketchesAfullintroductiontoAMSsketches[ 7 8 ]isbeyondthescopeofthechapter.Weprovidehereonlyashortintroductionandthemainpropertyofsketchestimators.Foreachitemi2I,weintroducea1randomvariablei.Withthis,weintroducethefollowingrandomprojectionsforrelationsFandG:XF=Pi2IifiandXG= 3SincetherandomshufingassumptionneverholdsinthecaseofSelf-joins,theestimationerrorofSelf-joinsizeusinghistogramsispostponedtoSection 2.7.2 38

PAGE 39

Pi2Iigi.Now,thefollowingrandomvariableisanestimatorofthesizesofjoinofFandG:X=XFXG.Giventherandomvariablesi,therandomvariablesXFandXGcanbecomputedinsmallspacebyobservingthatfactorslikeifi=i++iwiththeirepeatedfitimes.WhenanitemahastobeincorporatedintheincrementallymaintainedXF,thevalueofrandomvariableaisaddedtoXF.AsimilarprocessallowsthemaintenanceofXG.Theaboveschemereliesontheavailabilityofrandomvariablesi.Thekeytoavoidingstoringthem(whichwouldnegatethesmallspacecharacteristicofthesketches)istogeneratethemwhenneededfromseedsusingspecialrandomnumbergenerators.Inparticulari=h(i,S)withSarandomseedandh()agenerationfunction.Pseudo-randomgenerationofiispossiblebecausetheserandomvariablesdonotneedtobefullyindependent4-wiseindependencesufces.ThevarianceoftheestimatorXislargewithrespecttothesizeoftheselfjoinsizesoftherelationsinvolved.ThestandardmethodtoboosttheaccuracyofXistoproducemultipleindependentcopiesandtoaveragethem.Aslongastherandomseedsusedtogeneratetherequiredifamiliesareindependent,theestimatorsareindependentandthevarianceisreducedby1 nwherenisthenumberofcopiesaveraged.Formally,wehavethefollowingresult: Lemma2. LetXjbeasketchestimateasdescribedaboveobtainedfromseedSjanda4-wiserandomnumbergenerator,forj=1:n.ThentheaverageestimatorX=Pnj=1Xj nisunbiasedandhasthevariance:Var[X]=1 n24SJ(F)SJ(G)+ Xi2Ifigi!2)]TJ /F7 11.955 Tf 11.96 0 Td[(2Xi2If2ig2i35Inparticular,whenF=G,thesquarederrorisupper-boundedby1 nSJ(F)2 Proof. Foraproofsee[ 7 ]. 39

PAGE 40

Noticethat,foranunbiasedestimatorlikeXabove,thesquarederroristhesameasthevariance. Proposition5. Ifrandomshufingassumptionholds,thentheerroraveragingnelementarysketchesisLower-boundedby1 nN)]TJ /F7 11.955 Tf 11.95 0 Td[(2 NSqErr(F)SqErr(G) Proof. TheresultfollowsdirectlyfromLemma 2 ,thefactthatSqErr(F)SJ(F)andtheobservationthatE"Xi2If2ig2(i)#=Xi2If2i1 NXj2Ig2j=1 NSJ(F)SJ(G) Sameassamplingandhistograms,thesquarederrorinProposition 5 isalsoproportionalwiththeproductofthesquarederrorsofthetworelations.Butthesquarederrorisroughlyproportionalwiththeinverseofthesizeofn. 2.6.3CommentsonComparisonWepostponethediscussiononhowhistogramscomparewithsamplingandsketchingfortheselfjoinsizeproblemforSection 2.7.2 .Welookhereonlyatthecomparisoninthecasewhentherandomshufingassumptionholds.FromtheresultinTheorem 1 ,wenoticethatthesquarederrorofOne-buckethistogramsisproportionalwiththeproductofthesquarederrorsofrelationsFandGbutitisinverselyproportionalwiththesizeofthedomainofthejoinattributeminusone.Thelowerboundsonthesquarederrorforsamples,Proposition 3 ,andsketches,Proposition 5 ,isalsoproportionalwiththeproductofthesquarederrorsofthetworelationsbutitisroughlyproportionalwiththeinverseofthesizeofthesynopsis,N0forthesamplesandnforthesketches.ThismeansthatbothsamplesandsketchesneedspacecomparabletothesizeofthedomainofthejoinattributeinordertocompetewithOne-buckethistograms.Giventhis,wecanconcludethat,whentherandomshufing 40

PAGE 41

assumptionholds,histogramsarethebestapproximationtechniquewhichisconsistentwithourtheorems. 2.7HistogramswhenRandomShufingAssumptionDoesNotHoldInordertoobtainacharacterizationofhowhistogramsperformongeneralproblems,notonlyinthecasewhenthefrequenciesinthetworelationsarenotcorrelated,welookattheaveragebehavioroftwoclassesofhistograms:histogramswithbucketsofxedsizeandgeneralhistograms.Inintroduction,wesurveyotherworkthatprovidedtheoreticalcharacterizationsofhistograms. 2.7.1RandomHistogramsonArbitraryProblems Theorem3. LetFandGbetworelations.LetI1,,InbearandompartitioningofI,thedomainofjoinattributeA,withthepropertythatjIij=Ni,axedandgivennumber.Then,X=Pni=1NiPj2Iifj NiPj2Iigj NiisthehistogramestimateforjF1GjwherethebucketsaregivenbyI1,,In.Furthermore,E[X]=N)]TJ /F5 11.955 Tf 11.95 0 Td[(n N(N)]TJ /F7 11.955 Tf 11.96 0 Td[(1)Xj2IfjXj2Igj+n)]TJ /F7 11.955 Tf 11.95 0 Td[(1 N)]TJ /F7 11.955 Tf 11.96 0 Td[(1Xj2Ifjgjwhereexpectationistakenovertheprobabilityspaceinwhichallpartitioningsareequiprobable. Proof. SeeAppendix D Animmediateconsequenceoftheaboveresultis: Corollary1. UnderthesameconditionsasinTheorem 3 butremovingthexedsizeforeachpartinthepartition,E[X]remainsthesame. Proof. SincetheexpectedvalueofXdoesnotdependontheparticularvaluesofNi,i.e.theaveragebehavioristhesameirrespectiveofthebucketsizes,theexpectationremainsthesameirrespectiveoftheparticularprobabilityspaceoverthevaluesNi,thusitisthesamewhenwechoosetheprobabilityspacesuchthattheprobabilityofanyofthepartitioningsisthesame. 41

PAGE 42

Interestingly,onaverage,theresultestimationusinghistogramsisalinearcombinationbetweentheestimatoroftheOne-buckethistogramandthetruesizeofthejoin.Unfortunately,theweightofthecorrecttermisn)]TJ /F6 7.97 Tf 6.59 0 Td[(1 N)]TJ /F6 7.97 Tf 6.59 0 Td[(1,thusingeneral,histogramswillprovideaccurateestimatesonlywhenn!NunlesstheOne-buckethistogramisaccurate.Thisonlyhappenswhenrandomshufingassumptionholds. 2.7.2RandomHistogramsforSelf-joinSizeComputationUsingtheresultinTheorem 3 anditsCorollary 1 ,whentherelationFisjoinedwithitself,onaverage,theerrorduetobiasofhistogramsis:bias2(X)= Xi2Ifigi)]TJ /F5 11.955 Tf 11.96 0 Td[(E[X]!2=N)]TJ /F5 11.955 Tf 11.95 0 Td[(n N)]TJ /F7 11.955 Tf 11.96 0 Td[(12(SJ(F))]TJ /F5 11.955 Tf 11.96 0 Td[(N f2)2SqErr(F)2Sincestandarderroristhesumofthebiasandthevariance,whencomparingtheperformanceofhistogramswithsampling(Proposition 4 )andsketches(Lemma 2 ),wenoticethatitisnotinverselyproportionalwiththesizeofthesynopsisithasaslowlineardecrease.Thismeansthat,onaverage,fortheselfjoinsizeproblemhistogramsbehavefundamentallyworsethansamplingandsketching. 2.7.3CommentsonEnd-biasedhistogramsAswehaveseenintheprevioussections,histogramshavesmallerrorwhenrandomshufingassumptionholdsandlargeerror,onaverage,otherwiseforexamplewhenSelf-joinsizesarecomputed.Interestingly,End-biasedhistograms[ 44 ]mostlyavoidthepoorperformancewhencorrelationsarepresent.Toseewhy,weobservethat,whentherandomshufinghypothesisholds,End-biasedhistogramsbehavewelllikeallotherhistograms.Ontheotherhand,whenfrequenciesarecorrelated,theresultisgoingtobedominatedbythehighfrequenciessotheestimationisagainprecisesincethesefrequenciesarecapturedaccurately.Toreinforcetheabovediscussion,letusconsidertheSelf-joinsizeestimationusingEnd-biasedhistograms.Assume,withoutlossofgenerality,thattheEnd-biased 42

PAGE 43

histogramkeepsthefrequenciesoftheitems1:k)]TJ /F7 11.955 Tf 12.12 0 Td[(1preciselyandplacesitemsk:jIjintoasinglebucket,k.Withthis,theestimateoftheSelf-joinsizePi2jIjf2iis:k)]TJ /F6 7.97 Tf 6.59 0 Td[(1Xi=1f2i+(jIj)]TJ /F5 11.955 Tf 17.93 0 Td[(K+1) PjIji=kfi jIj)]TJ /F5 11.955 Tf 17.93 0 Td[(K+1!2sothesquarederroris:0@jIjXi=kf2i)]TJ /F7 11.955 Tf 11.95 0 Td[((jIj)]TJ /F5 11.955 Tf 17.93 0 Td[(K+1) PjIji=kfi jIj)]TJ /F5 11.955 Tf 17.94 0 Td[(K+1!21A2whichistheerrorofthelargebucket.SincetheEnd-biasedhistogramsinglesouttheHigh-frequencyitems,theerrormadebythelargebucketislimited;Thecontributionofthelargefrequencyitemsdominatestheerroroftheunbiasedestimation.Theonlysituationwhenthisisnothappeningiswhenthefrequenciesareroughlythesame,inthatcasethelargebucketpredictionisaccurate. 2.8GeneralizationtoMultidimensionalHistogramsWhenaqueryinvolvesmultipleattributes,histogramsononeattributearenotenoughforapproximation.Inthiscase,histogramscanbeconstructedintwoways:multipleunidimensionalhistogramswithindependenceamongtheattributesormultidimensionalhistograms.Therstattributeindependenceassumptionputsstrongrestrictionstoattributes[ 80 ],whichisunlikelytoholdinreality.Sowewillusemultidimensionalhistograms[ 35 47 67 93 ]forapproximationandgeneralizetherandomshufingassumptiontomultidimensionalhistogramsinthissection. 2.8.1GeneralRandomShufingPropertiesAswehaveseeninSection 2.5 ,therandomshufingassumptionisthekeytostatisticalcharacterization.Inthissection,ourgoalistoextendthestatisticalcharacterizationtomultidimensionalhistogramsundertherandomshufingassumption.Letadomain(orabucket)havemdimensionsandthesizeni(1im)foreachdimension.ThusthesizeoftheoveralldomainNisn1nm.Anaturalgeneralization 43

PAGE 44

oftheunidimensionalrandomshufingassumptionistoassumethateachoftheNvaluesinthemultidimensionalspaceisrandomlyshufed.Inthisway,thepropertiesintheunidimensionalcasegeneralizestraightforwardlytothemultidimensionalcase.Itisimportanttopointoutthatthisrandomshufingassumptioncannotbeobtainedbycombiningmindependentunidimensionalrandomshufingassumptions,oneforeachdimension.Intherstcase,thenumberofpossibleshufingsis(n1nm)!whilethenumberundertheindependenceassumptionisn1!nm!,asignicantlysmallernumber.Fromatheoreticalpointofview,itisveryimportanttogeneralizetherandomshufingdistributioninthemannerdescribedaboveinsteadofcombiningindependentunidimensionalrandomdistributions.Inthismanner,noconstraintsareposedonthebucketsofthemultidimensionalhistogram.Inparticular,anybucketizationisallowed,notonlyGrid-likebucketizationsthatareobtainedbycombiningindependentunidimensionalbuckets. 2.8.2MultidimensionalRandomShufingDistributionDenethesizeofthem-dimensiondomainasN=n1nm.Dene=ff(1,,1),,f(n1,,nm)g,thefrequencieswithinthedomain.LetSdenotethepermutationsetof.Let=f(1,,1),,(n1,,nm)grepresenttheslotsinthedomain.Letarandomvariablex(i1,,im)representthefrequencyoftheslot(i1,,im).Thenthedistributionoffrequenciesisthejointdistributionofx(1,,1),,x(n1,,nm). Proposition6. Iff(i1,,im)arerandomlyshufedinthedomain,thenx(i1,,im)hasthefollowingstatisticalproperties: P1:8(i1,,im),(j1,,jm)2,(i1,,im)6=(j1,,jm),f(i1,,im)2,Px(i1,,im)=f(i1,,im)^x(j1,,jm)=f(i1,,im)=0 44

PAGE 45

P2:8s2S,P(x(1),,x(n))=s=1 N!=1 (n1nm)! P3:8(i1,im)2,f(i1,,im)2,Px(i1,,im)=f(i1,,im)=1 n1nm P4:8(i1,,im),(j1,,jm)2,(i1,,im)6=(j1,,jm),8f(i1,,im),f(j1,,jm)2,f(i1,,im)6=f(j1,,jm),Px(i1,,im)=f(i1,,im)^x(j1,,jm)=f(j1,,jm)=1 (n1nm)(n1nm)]TJ /F7 11.955 Tf 11.96 0 Td[(1)P3andP4inProposition 6 arederivedfromP1andP2.ThemeaningoftheseequationsisSelf-evidentandthesameasthatofunidimensionalrandomshufingdistribution.Nowwecanderivethemomentsofthemultidimensionalrandomshufingdistribution. Lemma3. ThemomentsofthemultidimensionalrandomshufingdistributionareEx(i1,,im)= fVar(x(i1,,im))=1 n1nmXf(i1,,im)2f2(i1,,im))]TJ ET q .478 w 348.46 -386.29 m 354.94 -386.29 l S Q BT /F5 11.955 Tf 348.46 -396.27 Td[(f2Cov(x(i1,,im),x(j1,,jm))= f2 n1nm)]TJ /F7 11.955 Tf 11.96 0 Td[(1)]TJ /F9 11.955 Tf 27.22 18.95 Td[(P(i1,,im)2f2(i1,,im) (n1nm)(n1nm)]TJ /F7 11.955 Tf 11.96 0 Td[(1))Lemma 3 isthedirectextensionofLemma 1 withN=n1nm. 2.8.3One-bucketMultidimensionalHistogramInthissection,weanalyzeOne-bucketmultidimensionalhistogramundertherandomshufingassumption.Proposition 7 isthedirectapplicationofLemma 3 Proposition7. LetbeauniformrandompermutationoverthedomainI:N1Nm.Thenforany(i1,,im)2IandanyfrequenciesofattributesinrelationG, 45

PAGE 46

fg(i1,,im)j(i1,,im)2Ig,wehave:Emg(i1,,im)= gVarmg(i1,,im)=SqErr(G) N1Nm8(i1,,im)6=(j1,,jm),Covmg(i1,,im),g(j1,,jm)=)]TJ /F12 11.955 Tf 52.98 8.09 Td[(SqErr(G) (N1Nm)(N1Nm)]TJ /F7 11.955 Tf 11.96 0 Td[(1)InProposition 7 ,theapproximatedvaluesareequaltotheaveragefrequency,whichdemonstratesthatthemultidimensionalrandomshufingassumptiondoesnotchangethebehaviorofhistograms.ThesquarederrorSqErr(G)withinthebucketisaveragedamongalltheslotswithinthebucket.Whenm=1,Proposition 7 isexactlyProposition 2 .TheequationsinProposition 7 canbeprecomputedandserveasbasisforthefollowingdiscussions. 2.8.4LowDimensionalHistogramThemultidimensionalhistogramsareconstructedovermattributes.Thereexistqueriesinvolvingonlym)]TJ /F5 11.955 Tf 12.67 0 Td[(k(k>0)attributes.Inthiscase,anewmultidimensionalhistogramwithm)]TJ /F5 11.955 Tf 12.88 0 Td[(kattributescanbeconstructedforthequeries.Thishistogramrequiresextratimeandspace.Instead,this(m)]TJ /F5 11.955 Tf 13.07 0 Td[(k)-dimensionhistogramcanbeconstructedbyprojectingkattributesfromtheoriginalm-dimensionhistogram.Beforewecarryoutfurtheranalysis,weassumetherstkattributesfrommattributesarenotinvolvedinthequeries.ThereareNk+1NmslotswithinthebucketwitheachslothavingthefrequencyPN1i1=1PNkik=1g(i1,,im).Tosimplifytheexpressions,weintroducethefollowingnotations. 46

PAGE 47

Denition2. Theexpectedvalue,varianceofPN1i1=1PNkik=1g(i1,,im)andthecovari-anceofPN1i1=1PNkik=1g(i1,,im)andPN1j1=1PNkjk=1g(j1,,jm)aredenotedbyEXPm)]TJ /F4 7.97 Tf 6.58 0 Td[(k=Em)]TJ /F19 5.978 Tf 5.76 0 Td[(k"N1Xi1=1NkXik=1g(i1,,im)#VARm)]TJ /F4 7.97 Tf 6.59 0 Td[(k=Varm)]TJ /F19 5.978 Tf 5.76 0 Td[(k"N1Xi1=1NkXik=1g(i1,,im)#COVm)]TJ /F4 7.97 Tf 6.58 0 Td[(k=Covm)]TJ /F19 5.978 Tf 5.75 0 Td[(k"N1Xi1=1NkXik=1g(i1,,im),N1Xj1=1NkXjk=1g(j1,,jm)#NotethatEXPm,VARmandCOVmarealreadyknownandpre-computedfromProposition 7 .Nowwederivetheformulasofthemoments. Lemma4. LetbeauniformrandompermutationoverdomainI.Let(,ik+1,,im)2Idenotetheslotwithm)]TJ /F5 11.955 Tf 12.03 0 Td[(kdimensions.Deneg(,ik+1,,im)=PN1i1=1PNkik=1g(i1,,im),thefrequencyoftheslot(,ik+1,,im).Then:EXPm)]TJ /F4 7.97 Tf 6.59 0 Td[(k=N1NkEXPmVARm)]TJ /F4 7.97 Tf 6.59 0 Td[(k=N1NkVARm+(N1Nk)(N1Nk)]TJ /F7 11.955 Tf 11.95 0 Td[(1)COVmCOVm)]TJ /F4 7.97 Tf 6.59 0 Td[(k=N21N2kCOVmwhereEXPm,VARmandCOVmarecomputedbyProposition 7 Proof. Theexpectedvalueofg(,ik+1,,im)isEXPm)]TJ /F4 7.97 Tf 6.59 0 Td[(k=N1Xi1=1NkXik=1Eg(i1,,im)=N1NkEXPmThevarianceofg(ik+1,,im)isVARm)]TJ /F4 7.97 Tf 6.59 0 Td[(k=Var(N1Xi1=1NkXik=1g(i1,,im))=N1NkVar(g(i1,,im))+2Xi1
PAGE 48

When(ik+1,,im)6=(jk+1,,jm),thecovarianceofg(ik+1,,im)andg(jk+1,,jm)isCOVm)]TJ /F4 7.97 Tf 6.58 0 Td[(k=Cov(N1Xi1=1NkXik=1g(i1,,im),N1Xj1=1NkXjk=1g(j1,,jm))=N1Xi1=1NkXik=1N1Xj1=1NkXjk=1Cov(g(i1,,im),g(j1,,jm))=(N1Nk)2COVm GivenEXPm,VARmandCOVm,computationofLemma 4 aresurprisinglysimple.AlltheequationsinLemma 4 arebasedontheresultsoftheoriginalm-dimensionhistograms.Thislemmacanthenbeextendedtocomputetheprojectionsofanycombinationofkattributesfrommattributesbecausesymmetryispreservedbyprojections.Iftheprojectedattributesarenottherstkattributes,themomentsarederivedbyjustreplacingthesizesofdimensionsintheformulas. 2.8.5EstimatingsizeofStar-JoinsusingOne-bucketHistogramsOncewehavemultidimensionalhistograms,wecanapproximatethesizeofmultiplerelationjoins.SupposeaStar-joininvolveskrelationsFi(0
PAGE 49

Proof. E"N1Xi1=1NkXik=1f1i1fkikg(i1,,ik,,im)#=N1Xi1=1NkXik=1E(f1i1fkikg(i1,,im))=EXPm)]TJ /F4 7.97 Tf 6.58 0 Td[(kf1fkThevarianceis:Var(N1Xi1=1NkXik=1f1i1fkikg(i1,,ik,,im))=Var((ik+1,,im)=(i0k+1,,i0m))"N1Xi1=1NkXik=1f1i1fkikg(i1,,im)#+Cov((ik+1,,im)6=(i0k+1,,i0m))"N1Xi1=1NkXik=1f1i1fkikg(i1,,im),N1Xi01=1NkXi0k=1f1i01fki0kg(i01,,i0m)35=VARm)]TJ /F4 7.97 Tf 6.59 0 Td[(kSJ(F1)SJ(Fm)+COVm)]TJ /F4 7.97 Tf 6.59 0 Td[(k(N21f12)]TJ /F5 11.955 Tf 11.96 0 Td[(SJ(F1))(N2mfm2)]TJ /F5 11.955 Tf 11.95 0 Td[(SJ(Fm)) Theorem 4 providestheapproximationsofthesizesofStar-joinsusinghistogramsandthecorrespondingerrorsoftheapproximations.TheequationsinTheorem 4 aregeneralandsimpletocompute.ThevaluesofSJ(Fi)andficanbecomputedfromtheirhistograms.ThemomentsoflowdimensionalhistogramscanbederivedfromLemma 4 .AspecialcaseofTheorem 4 isk=1,thevariancecanbewrittenas(N1Nm)]TJ /F6 7.97 Tf 6.58 0 Td[(1)SqErr(F)SqErr(G) (N1Nm)]TJ /F7 11.955 Tf 11.95 0 Td[(1)TheanalysisofmultidimensionalhistogramswithalignedbucketsfollowsexactlytheanalysisinSection 2.5.2 .Weskipthediscussionhere. 49

PAGE 50

NoteongeneralizationtoarbitraryjoinsItispossibletogeneralizetheseideastoarbitraryjoins.Wedonotdosoheresincethecurrentnotationisinsufcientand,inthecaseofgeneraljoins,leadstoextremelycomplicatedformulas. 2.9XSketchEstimatorunderRandomShufingAssumption 2.9.1IntroductionXMLdocumentsaresimple,Self-descriptiveandubiquitousasadataexchangeandstorageformatontheweb.XMLdocumentsareconstructedaselementsubelementstructures.AnXMLqueryinvolvesvaluepredictsandstructurepredicts.Sincetraditionalhistogramscanonlyapproximatethevalueselectivity,thereisresearchinterestincombininghistogramswithXMLhierarchicalstructures.Wu[ 96 97 ]proposesapositionhistogram,whichencodesXMLnodesona2DspacebypreorderingtraversalofXMLdocumenttreesandthuskeepsthehierarchicalinformation.XMLPathLearner[ 68 ]usesTwo-dimensionalhistogramstopreservestructureinformationofpathswithlengthoftwo.Thesemethodsneedlargestoragespacetoensuregoodestimation.Incontrast,XSketches[ 23 78 ]provideanefcientframeworktoestimateXMLqueriesinlimitedstoragespace.WewillexplainXSketchesinSection 2.9.2 indetail.However,allthesemethodsassumethatfrequenciesareuniformlydistributedinbuckets.Asdiscussedinprevioussections,therandomshufingassumptionismorelikelytoholdinpractice.Inthissection,wecombinetherandomshufingassumptionwithXSketchestoestimatetheselectivityofXMLqueries.WealsoprovideatighterrorguaranteefortheXSketchestimatorunderthisassumption. 2.9.2XSketchesXSketches[ 78 ]provideanefcientframeworktoestimatetheselectivityofanXMLquery.XSketchesexploitbothvalueandstructureinformationofXMLdocuments.Theypreservestructureinformationbygroupingthecorrelatedneighbornodesandmaintainvalueinformationbyusinghistogramsofnodes.XSketchesestimatetheselectivityinthefollowingway:rsttheypreservevalueandstructureinformationofsimplepaths; 50

PAGE 51

thentheyapplydifferentindependenceassumptionstoXMLqueriesanddecomposesthemintosimplepaths;nallyXSketchesusepreservedvalueandstructureinformationtoestimateandrenetheselectivity.Inparticular,foranarbitraryXMLquery:Tn[ n]==T1[ 1],whereTidenotesanodeand[ i]denotesabranchpredictoravaluepredict,usingBayesiannetworksandchainrules,XSketchesformulatetheselectivityas sel=count(T1)p(T1[ 1])p(T2[ 2]=T1[ 1]jT1[ 1])p(Tn[ n]=Tn)]TJ /F6 7.97 Tf 6.59 0 Td[(1[ n)]TJ /F6 7.97 Tf 6.58 0 Td[(1]==T1[ 1]jTn)]TJ /F6 7.97 Tf 6.59 0 Td[(1[ n)]TJ /F6 7.97 Tf 6.59 0 Td[(1]==T1[ 1])(2)Tosimplifytheabovecomplexequationandcompensateforthelackofthedetailedsynopsis,XSketchesmakethefollowingassumptions: Assumption1(A1TwigIndependence):Theparenttwigsofanodeareindependentofitschildtwigs,whichmeans:p(T3[ ]=T2jT2=T1[ 0])p(T3[ ]=T2)andp(T1[ ]jT2[ 0]=T1)p(T1[ ]). Assumption2(A2Edge-valueIndependenceAcrossNon-StableEdges):thenonstableparent/childedgesofanodeareindependentofthevaluesofthisnode,whichmeansp(T2=T1jT1[ ])p(T2=T1)andp(T2[ ]jT2[ 0])p(T2[ ]). Assumption3(A3ValueIndependenceOutsideCorrelationScope):IfnodesT1andT2donothavedirectcorrelations,thenT1andT2areindependent.A1,A2andA3areindependenceassumptionsandeliminatetheconditionalprobabilityexpressionsinEquation 2 .Asaresult,Equation 2 canbesimpliedasthefollowing: count(T1)qYi=1f( ijTi0)qYi=1f(Tki0=Tki+10)lYi=1f( q+ijTq+i0)lYi=1f(Tmi0[ mi+10])(2)Detailedinformationaboutthisequationispresentedin[ 78 ].Toestimatethevalueoff(Tki0=Tki+10)and ,XSketchesusehistogramsandmakethefollowingassumptions: Assumption4(A4FrequencyUniformity):XSketchesassumefrequenciesareuniformlydistributedwithinabucket. 51

PAGE 52

Assumption5(A5Forward-EdgeUniformity):IfanodeT2doesnothaveastaticcorrelationwithitschildnodeT1,sel(T2=T1)=count(T1)=max(Pw2parent(T1)count(w),count(T2)). Assumption6(A6Backward-EdgeUniformity):IfanodeT1doesnothaveastaticcorrelationwithitsparentnodeT2,sel(T2=T1)=count(T2)=Pw2parent(T1)count(w).XSketchesusethetraditionalhistogramassumption(A4)forvaluepredicts.A5andA6estimatethedistributionofnonstableparentnodesandchildnodes.A5computesthebranchpredictofanodewhichhasmultipleparentsinatopdownmode.A6computesthebranchpredictofanodewhichhasmultiplechildreninabottomupmode.Combiningthevalueandstructuralsynopseswiththeseassumptions,XSketchesestimatetheselectivitywithlimitedstorage.AssumptionsA1,A2andA3assumeindependenceamongnodeswhichdonothavestableconnectionsbetweenthem.Theycanbeusedinthestatisticalmodeltoanalyzeestimations.Aswehavestated,theuniformassumption(A4)isrestrictiveandpreventsthefurtherstatisticalanalysisofapproximations.IfA4isreplacedbytherandomshufingassumption,thestatisticalmodeliscompletethusstatisticalanalysiscanbeperformed.A5andA6arestructureuniformityassumptionsforbranchselectivity,whichdependonthehistogramsofthenodes.Undertherandomshufingassumption,althoughthefrequenciesofnodevaluesarerandomlyshufed,thetotalnumberoffrequenciesremainsconstant.TheestimationsinA5andA6onlyinvolvethetotalcountoffrequenciesoftwigs,thustheydonotintroduceextraerrorstothenalestimation. 2.9.3ShufedXSketchesInthissection,wecombinetherandomshufingassumptionandtheframeworkofXSketchestoestimatetheselectivityofanXMLquery.WeadoptindependenceassumptionsofXSketchestospecifycorrelationsamongnodes,A5andA6ofXSketchestospecifythebranchpredictsandtherandomshufingassumptiontopredictvalues.Asdiscussed,A5andA6donotintroduceerrorstothenalestimation.Theestimationerrorsareintroducedbythevaluepredicts. 52

PAGE 53

Now,weestimatetheselectivityofanarbitraryXMLquerywithequalvaluepredictsandanalyzeitsestimationerror.Werststudysimplequeriesoflengthtwo.Thenweextendstatisticalanalysistothegeneralcase.Tosimplifytheanalysis,weassumethequeryonlycontainsequalityvaluepredictsinfollowingdiscussion.LetusstartwithaqueryoflengthtwosuchasT2[ 2]=T1[ 1].Wehavethefollowingtheorem: Theorem5. LetT1andT2betwoXMLnodeswithdomainI1andI2.LetthesizeofI1andT2beNandM,respectively.LetT2[ 2]=T1[ 1]beanXMLquerywithequalvaluepredicts.ThenavigationratiorisequaltoP(T2=T1).LetfidenotethefrequenciesofvaluesofnodeT1.ThenrPi2I1fi MNisanunbiasedestimatorofT2[ 2]=T1[ 1]andthesquarederrorisr2Pi2I1f2i NM)]TJ /F5 11.955 Tf 13.15 8.09 Td[(r2 f2 M2 Proof. SeetheproofinAppendix E ThisestimatedvalueundertherandomshufingassumptionisexactlytheestimatedvaluebyXSketches,whererisP(T2=T1),Pi2I1fiisCOUNT(T2)and1 MNistheselectivityofvaluepredictsf(T2[ 2])f(T1[ 1]).InTheorem 5 ,onlythehistogramofT1isgivenandthehistogramofT2isunknown.Edge-valueindependenceassumptionsareusedintheestimation.Thenavigationratiorencodesthestructureinformationoftheestimation.SinceXSketchesusegraphstomodelXMLdocuments,theremaybemultipleparentnodesforoneXMLnode.A5orA6isusedtocomputer.Nowweextendthistheoremtomoregeneralcases:theXMLquerieswitharbitrarysize. Theorem6. LetT1,,TmbemXMLnodeswithdomainI1,,Im,respectively.LetthedomainsizeofIibeNi.LetTm[ m]==T1[ 1]beanXMLquerywithequalvalue 53

PAGE 54

predicts.LetfidenotethefrequenciesofvaluesofnodeT1.LetribethenavigationratiowhichisequaltoP(Ti+1=TijTi).Thenrmr1Pi2Ifi NmN1istheunbiasedestimatorofTm[ m]==T1[ 1].Thesquarederroris (rm)]TJ /F6 7.97 Tf 6.58 0 Td[(1r1)2Pi2I1f2i NmN1)]TJ /F7 11.955 Tf 13.15 8.08 Td[((rm)]TJ /F6 7.97 Tf 6.58 0 Td[(1r1)2 f2 (NmN2)2(2)TheproofofTheorem 6 followsTheorem 5 andweskipithere.TheestimationofTheorem 6 isbasedonthehistogramofT1andXSketchframework.TheestimatedvalueisunbiasedandisconsistentwithEquation 2 .Asdiscussed,thestructurepredictsdonotintroduceerrorstotheestimation.Theyaffecttheestimationresultascoefcientsintheexpression.Theestimationerrorsareintroducedbyvaluepredicts.DuetothelackofthesynopsesofparentnodesofT1,independenceassumptionisusedtoestimatethevaluepredictsofparentnodes.TheerrorsareintroducedmainlybythevaluepredictofT1.ThevaluepredictsoftheparentnodesofT1donotintroduceextraerrorstothenalestimation.However,theestimationerrorsbecomerelativelylargewhenmorevaluepredictsoftheparentnodesareinvolvedintheXMLquerybecausePi2I1f2iisdecreasedintheratioof1 Niwhile f2isdecreasedintheratioof1 N2iinExpression 2 2.10CommentsonRandomShufingAssumptionAswehaveshowninthischapter,therandomshufingassumptionisanalternativetothetraditionaluniformfrequencyassumption;usingtheaveragefrequencyinabucketinsteadofthetruefrequenciesproducesestimatorsforboth.Thefundamentaldifferenceisthattherandomshufingassumptionpreservesmoreinformationabouttheoriginaldistributionsuchasskewandcanprovablyproduceestimatorsthatrivalsketchesandsampling.Shouldtherandomshufingassumptionhold,histogramsarevirtuallyunbeatableasanapproximationmethod.Interestingly,atthesametime,therandomshufingassumptionisaweakerassumptionthanassuminganyparticulardistributionwithinthebuckettheuniformdistributionisonlyoneofthepossiblesuch 54

PAGE 55

distributionsthusmorelikelytoholdinpractice.Theonlyinformationlostwhenassumingtherandomshufingassumptionisthearrangementofthefrequencies,notthemagnitudeofthefrequenciesinabucket.Surprisingly,theaveragefrequencystillgivesthebestestimator.Itisworthpointingoutthatthischapterisaboutformalizingandderivingconsequencesoftherandomshufingassumption,notaboutproving/arguingthattheassumptionholds.Thisassumption,inouropinion,isamuchstrongerexplanationofthegoodbehaviorhistogramstendtohaveinpractice.Itprovidesamuchmoreaccurateexplanationofthebehaviorthantheratherlooseuniformdistributionassumption.Atthesametime,theassumptionallowsthecomputationoftheerrorofthehistogramsshouldtheassumptionhold,thusprovidingaquantitativewaytocomparehistogramstootherapproximationsandopensupthewayforprovidingtheoreticalguaranteesforhistograms.Ontheotherhand,aswehaveseen,iftherandomshufingassumptiondoesnotholdintheextreme,histogramscanbeverypoor.Thisstronglysuggeststhatthebehaviorofhistogramsistiedwiththevalidityoftheassumption.Aninterestinglineoffutureworkishowtocheckthattheassumptionholdsinpracticalsituationsinordertodetermineifhistogramsareanappropriateapproximationmethod.Onethingthischapterdidnotconsiderishistogramconstruction;weonlyinvestigatedexplanationsforbehavior.Therandomshufingassumptiondoesnotchangeanyoftheconstructionmethodsanddoesnotintroducenewtypesofhistograms.Shouldwewanttoprovidestatisticalcondenceintervalsforthehistogramapproximationundertherandomshufingassumption,weonlyneedtoestimatethesquarederrorswithineachbucket.Thiscanbeaccomplishedtriviallyifthefrequenciesareavailableinasinglepass.Ifastreamofactualitemsisprovided,thesquarederrorcanbeapproximatedusingsketches.Figuringoutwheretoplacethehistogrambucketsisamuchharderproblemthancomputingthesquarederrorwithinabucket. 55

PAGE 56

2.11SummaryInthischapterwerevisitedthehistogramsfromstatisticalpointofview.Theworkisbasedonassumingthatthefrequenciesareuniformlyshufedwithinabucketwhichisasignicantlylessrestrictivethantheuniformassumption.Theestimateresultundertherandomshufingassumptionisconsistentwiththewayhistogramsareusedforapproximatequeryprocessing.Usingthisobservationwerstanalyzedthebehaviorofunidimensionalhistogramsandthenextendedtheanalysistomultidimensionalhistograms.Weprovidedtighterrorwhentheassumptionholdsandshowedthat,insuchcases,histogramsaresignicantlybetterapproximatorsthansamplingandsketches.Ontheipside,weshowedthat,whentheassumptiondoesnothold,onaverage,histogramsareinferiortosamplingandsketches.Finally,weappliedtherandomshufingassumptiontoXSketchestimatorswhichusehistogramasabuildingblockandderivedthetighterrorguaranteeforthem.Thisisanexampleofhowthetheorywedevelopedforhistogramsasstatisticalestimatorscanbeextendedformethodsthatusehistogramsasingredients. 56

PAGE 57

CHAPTER3AGGREGATIONOVERPROBABILISTICDATABASESWITHCONFIDENCEINTERVAL 3.1BackgroundProbabilisticdatabaseshaveattractedmuchattentioninrecentyearsbecauseoftheincreasingdemandforprocessinguncertaindatainpractice.Computingaggregatesoverprobabilisticdataisusefulinsituationswhereanalyticalprocessingisrequiredoveruncertaindata.Sinceerrorsareinevitablyintroducedinthedata,orthenatureofcertainkindsofdataisuncertain,theneedtoprocessaggregatesoverprobabilisticdatabasesarises.Whiletheneedisthere,thespeedofprocessingaggregatesoverprobabilisticdatahastobecomparablewiththespeedcurrentsystemachieveswhencomputingaggregatesovernon-probabilisticdata.Whenitcomestoaggregatesinprobabilisticdatabases,asdiscussedinintroduction,providingjusttheexpectedvalueisnotsufcient.Providingcondenceintervalsforrealquantitiesisthestandardpractice;suchcondenceintervalscanbeobtainedifboththeexpectationandthevarianceoftheaggregatesiscomputed.Mostofthedifcultyindealingwithaggregatesinprobabilisticdatabasesisrelatedtothecomputationofthevariance;theexpectedvaluecomputationismucheasierandoftenstraightforward.Inthischapterweshowhowthevariancethuscondenceintervalscanbecomputedalmostasefcientlyasexpectationforalargeclassofqueriesandprobabilisticmodels.Inthischapter,wewillpresentmethodstocomputecondenceintervalsforprobabilisticaggregatesbasedonefcientcomputationofmoments.Webelievethattheideasandmethodswedescribeinthischaptercanbeusedtoincorporateprobabilisticaggregatecomputationcapabilitiesinexistingdatabasesystemswithouttheneedformajorredesign. 3.2PreliminariesInthissection,wereviewmethodsofcomputingcondenceintervals,describethequerieswesupportinthisworkandtheformaldenitionofprobabilisticdatabases. 57

PAGE 58

3.2.1QueriesandEquivalentAlgebraicExpressionsThebasictypeofqueryweconsiderinthischapteris:SELECTSUM(F(t1tn))FROMR1,,RnWHERESIMPLE CONDITION;whereR1,,Rnarenrelations,tiisatupleinRi,F()isanaggregatefunctionsuchasSUMthatdependsontuplesfromeachrelationandSIMPLE CONDITIONisaconditioninvolvingonlytuplesfromrelationsR1,,Rn.ArbitraryselectionsandjoinconditionsareallowedbutnosubqueriesofanykindorDISTINCTareallowed(noquerieswithsetsemantics).InordertotranslateaboveSQLqueryintoalgebraicformulasforstatisticalanalysis,weunifytheWHEREconditionandtheaggregatefunctionF()intoasinglefunctionfftiginthefollowingway:(a)weintroducealteringfunctionIC(t)thattakesvalue1ifthetupletsatisestheconditionCand0otherwise,and(b)wedenef(t)=F(t)IC(t).Withthis,thevalueoftheaggregateissimply: A=Xt12R1Xtn2RnF(t1,,tn)I(t1,,tn)=Xfti2Riji2f1:nggf(ftig)(3)thatis,thesumofthefunctionf()overtheCross-productoftherelationsinvolved. 3.2.2ProbabilisticDatabaseasaDescriptionofaProbabilitySpaceInthischapterweconsiderprobabilisticdatabaseswiththefollowingproperty:givenasetofbaserelationsR1,,Rn,theexistence(notcontent)ofthetupletiinrelationRiisprobabilistic.ThismodeliscalledTuple-uncertainprobabilisticmodelasopposedtoAttribute-uncertainmodelsthatallowuncertaintyinthevaluesoftuplesaswell.TheinstancesR01,,R0nofrelationsR1,,Rnformapossibleworldworadatabaseinstance.Theprobabilityofthispossibleworld[ 2 18 ]iscomputablegiventhespecicsofthedatabaseprobabilisticmodel. 58

PAGE 59

Sinceapossibleworldisadatabaseinstance,aggregatesofthetypedescribedaboveoverthedatabasearewelldened.IfwedenotebyAsuchaggregates,thenA=Xt12R01Xtn2R0nf(t1,,tn)Thek-thmomentofA,bydenition,is: EAk=Xw2WA(w)kp(w)(3)wherep(w)istheprobabilityofthepossibleworldwandWisthesetofallpossibleworlds.ThisimmediatelygivesformulasforE[A]andVar[A]=EA2)]TJ /F5 11.955 Tf 12.05 0 Td[(E[A]2,thetwomomentsthatareneededtocomputecondenceintervals.Unfortunately,computingthemomentsofAusingthisformulaisimpracticalsincethenumberofpossibleworldsisexponentialinthesizeofrelations.Newformulasareneededtoefciencycomputethemoments.InordertobeabletocomputeE[A]andVar[A],foreachtuplet2R,weintroducethe0,1randomvariableXtthatindicateswhethertisselectedinR0ornot.Withthis,A=Xt2RXtf(t)WritingAinthismanneriskeyforanalysissincethedependenceonarandomrangeisreplacedbyadependenceonrandomvariablessolinearityofexpectationcommuteswiththesum.Thistechniqueisusefulingeneral.Forexample,fortheOne-relationcase,usingthelinearityofexpectation,wehave:E[A]=Xt2RE[Xt]f(t)=Xt2Rptf(t)whereweusedthefactthatE[Xt]=P[Xt=1]=pt. 59

PAGE 60

3.2.3UseofKroneckerSymbolijCasesappearnaturallyintheanalysisofcomplexrandomvariablessincethesimplerandomvariablesthattheyareconstructedfrominteractdifferentlywiththemselvesandwithotherrandomvariables(forexamplewhenthevarianceofthecomplexrandomvariableisperformed).Anelegantmethodusedin[ 53 54 ]istomakeuseoftheKroneckersymboltoencodecases.TheKroneckersymbolijtakesvalue1ifi=jandvalue0ifi6=j.AssumethatwearegivenaquantityQijthattakesvalueaifi=jandbifi6=j.WecanexpressQijintermsofijbyobservingthatwecanmultiplyabyij,bby(1)]TJ /F3 11.955 Tf 10.88 0 Td[(ij)andobtain:Qij=8>><>>:ai=j(ij)bi6=j((1)]TJ /F3 11.955 Tf 11.95 0 Td[(ij))=b+(a)]TJ /F5 11.955 Tf 11.95 0 Td[(b)ijThisiscorrectsince,ifi=j,thenij=1.ThusQ=b+(a)]TJ /F5 11.955 Tf 10.79 0 Td[(b)1=a(andthesymmetricargumentfori6=j).Thefollowingsimplicationruleisusefulforremovingijfromexpressions:XiXjijF(i,j)=XiF(i,i)i.e.thedoublesumcollapsesintoasinglesumbecauseofij.Evenifijisnotremoved,itiseasytoevaluate. 3.3AnalysisforMulti-relationSUMAggregatesforTuple-independentModel 3.3.1DependenceinTuplesafterAggregationsInthecaseofasinglerelation,ifwelettbeatupleofRandtherandomvariableXtindicatewhetherthetupleis(value1)orisnot(value0)intheprobabilisticmodel,thentherandomvariableA=Pt2Rt.xXtmodelstheaggregateSUM(x)overthisrelation.LinearityofexpectationimmediatelygivesE[A]=Pt2Rt.xE[Xt]=Pt2Rt.xP[t] 60

PAGE 61

whichmeansthatanyprobabilisticdatabasesystemcapableofcomputationofthetupleinclusionprobabilitiescanbeextendedstraightforwardlytocomputationofexpectationsofSUM-likeaggregates.ShouldRbetheresultofcomplexdatabaseoperations,theformulastillappliesbutthecomputationoftheP[t]isinvolved.Asweargued,computationofthevarianceisabareminimumifreasonableinformationaboutthedistributionistobeinferred.TheabovecomputationcanbeextendedtothecomputationofthevarianceusingthefactthatVar[A]=E(Pt2Rt.xXt)2)]TJ /F5 11.955 Tf 12.16 0 Td[(EPt2Rt.xXt2.Thersttermofthisequation,afterusinglinearityofexpectation,canbewrittenas:E"(Xt2Rt.xXt)2#=Xt2RXt02Rt.xt0.xE[XtXt0]wherethetermE[XtXt0]istheprobabilityofthetuplestandt0appearingsimultaneouslyintheprobabilisticdatabase.TheformulasareindeedgeneralandgiveastraightforwardmethodtocomputethevarianceofaSUM-likeaggregateforanyprobabilisticmodel.UnfortunatelyE[XtXt0]6=E[Xt]E[Xt0].Toseethis,lett1beatupleinR1andt2andt02betuplesinR2.Tuples(t1,t2)and(t1,t02)canbothbepartoftheaggregatecomputation,buttheirexistenceinR01R02isnotindependentsincethet1partiscommon. 3.3.2M2ComputationofMomentsAswehaveseeninSection 3.2 ,thecomputationofthemomentsofanaggregateusingthepossibleworldsiscomputationallyinfeasible.Somemoreprogresscanbemadeif0,1randomvariablesareintroducedtoindicatewhetheratupleisincludedinarelationinstanceornot.Wereachedanimpassewiththismethodsincewedependedonindependencetodothederivation,independenceislostthemomenttworelationsareinvolvedintheaggregate.Theapproachinthissectiontoovercomethisproblemistonotmakeexplicituseofindependencebetweentuplesofthesamerelationsby 61

PAGE 62

introducingaproofthatcarriesoutthefullcomputationanddoesnotusetheindepen-denceshortcut.Itmightseemthat,whenthistechniqueisused,signicantlylongerproofscouldresult.Itturnsoutthatthisisnotthecaseand,whenpairingupthisstyleofproofswithextensiveuseoflinearityofexpectation,compactandgeneralderivationsresult.Theprooftechniqueweuseissimilarto[ 53 54 ]andmakesuseoftheKroneckerdeltasymbol.Therstkeyingredientfortheanalysisistousethe0,1indicatorvariablesXtforeachtupleineachoftherelations.1EachXthasaBernoullidistributionparameterizedbyprobabilitypt.ThesecondkeyingredientistoexpresstheinteractionbetweentworandomvariablesXtandXt0intermsofKroneckerdelta.Thisallowsapurealgebraicmanipulationwithouttheneedtodealwithcasesthatsignicantlycomplicatetheformulas.Whenthesetwoingredientsareusedtogether,thederivationsbecomestraightforward.Withtheabovecommentsinmind,theanalysisoftheaggregateAwillconsistin:(a)expressAintermsofthedataandr.v.Xt,(b)uselinearityofexpectationtocomputeE[A]andVar[A]intermsofXt,and(c)interprettheresultingformulasfromadatabasepointofviewandexpresstheminSQL.WiththenotationinSection 3.2.2 andusingthesameidea,theaggregateAcanbeexpressedintermsofXtas: A=Xt12R01Xtn2R0nf(t1tn)=Xt12R1Xtn2Rn nYi=1Xti!f(t1tn)(3) 1Thereisnoneedtoindextherandomvariablesbytherelationaswell,asisdonein[ 54 ],sinceitwillbeclearfromthecontextwhatrelationwearereferringtoandallrandomvariablesinteractthesameway. 62

PAGE 63

ThefollowingpropertiesofXt,thatfollowdirectlyfromthefactthatithasaBernoullidistribution,areneededintherestofthepaper: E[Xt]=P[t2R]=pt(3) E[XtXt0]=8>><>>:P[t2R]t=t0P[t2R^t02R]t6=t0=8>><>>:ptt=t0(tt0)ptpt0t6=t0((1)]TJ /F3 11.955 Tf 11.95 0 Td[(tt0))=ptpt0+pt(1)]TJ /F5 11.955 Tf 11.96 0 Td[(pt)tt0(3)whereweusedthefactthatXtandXt0areindependent(ift6=t0)andthetechniqueexplainedinSection 3.2.3 toexpresscasesusingtheKroneckerdelta.Sincetheformulaswillbetoolargewiththecurrentnotation(toomanysummationsigns),asin[ 54 ],weintroducemorecompactnotationforsummations:Xt12R1Xtn2Rnf(t1tn)=Xfti2Riji2f1:nggf(ftiji2f1:ngg)Sumsthataresubscriptedbythesetfti2Riji2Sgareequivalenttosumsovereachoftheindexesintheset.Weusethesamecompactnotationfortheargumentsoff().Withthis,wehavethefollowingresult: Theorem3.1. ThemomentsofA,theaggregatedenedbyEquation 3 are:E[A]=Xfti2Riji2f1:ngg nYi=1pti!f(ftiji2f1:ngg)EA2=Xfti2Riji2f1:nggXfti02Ri0ji02f1:ngg nYi=1(ptipti0+pti(1)]TJ /F5 11.955 Tf 11.95 0 Td[(pti)titi0)!f(ftiji2f1:ngg)f(fti0ji02f1:ngg)Var[A]=EA2)]TJ /F5 11.955 Tf 11.95 0 Td[(E[A]2 63

PAGE 64

Proof. UsingEquation 3 ,linearityofexpectationandthefactthatXtr.v.areindependentfordifferentrelations,thustheexpectationofproductsisproductofexpectations,wehave:E[A]=Xfti2Riji2f1:ngg nYi=1E[Xti]!f(ftiji2f1:ngg)=Xfti2Riji2f1:ngg nYi=1pti!f(ftiji2f1:ngg)CreatingtwoinstancesforAwiththeindexesprimedforthesecondone,usingEquation 3 andlinearityofexpectation,wehave:EA2=E"Xfti2Riji2f1:nggXfti02Ri0ji02f1:ngg nYi=1XtiXti0!f(ftiji2f1:ngg)f(fti0ji02f1:ngg#=Xfti2Riji2f1:nggXfti02Ri0ji02f1:ngg" nYi=1EXtiXti0!f(ftiji2f1:ngg)f(fti0ji02f1:ngg)#=Xfti2Riji2f1:nggXfti02Ri0ji02f1:ngg" nYi=1(ptipti0+pti(1)]TJ /F5 11.955 Tf 11.95 0 Td[(pti)titi0)!f(ftiji2f1:ngg)f(fti0ji02f1:ngg)# TheformulasinTheorem 3.1 canbeefcientlyevaluatedusingaDBMS.Rememberthatf(t)=F(t)IC(t)whereF(t)istheaggregatefunctionandIC(t)istheindicatorfunctionoftheconditionC.TocomputeE[A]usingSQLweobservethatthequeryrequirestheaggregatefunctionismultipliedbytheproductofprobabilities.InordertocomputeEA2,thusVar[A],wehavetocomputetheaggregateoverthecrossproductoftheresulttuples.TotranslatethisefcientlyintoSQL,wecanputtheunaggregatevalueandtheprobabilitiesintoatemporarytableandthencomputetheaggregateovertheCross-productofthistablewithitself. 64

PAGE 65

3.3.3ReducingthetimecomplexityAfundamentalquestionthatcanbeaskedabouttheformulaforEA2inTheorem 3.1 iswhetherwecanremovetheKroneckerdeltas,forexamplebyusingthesimplicationruleinSection 3.2.3 .Whileitisnotclearwhythismightreducethecomplexity,itwouldbeanecessarystepinthatdirectionsinceotherwisethereisnowaytoproceed.Toseewheretheopportunitylies,letusconsiderthetworelationcase.Inthiscase,theformulaforEA2canbewrittenas:EA2=Xt2R1Xt02R1Xv2R2Xv02R2ptpt0pvpv0f(t,v)f(t0,v0)+Xt2R1Xt02R1Xv2R2Xv02R2pt(1)]TJ /F5 11.955 Tf 11.95 0 Td[(pt)tt0pvpv0f(t,v)f(t0,v0)+Xt2R1Xt02R1Xv2R2Xv02R2ptpt0pv(1)]TJ /F5 11.955 Tf 11.96 0 Td[(pv)vv0f(t,v)f(t0,v0)+Xt2R1Xt02R1Xv2R2Xv02R2(pt(1)]TJ /F5 11.955 Tf 11.96 0 Td[(pt)tt0pv(1)]TJ /F5 11.955 Tf 11.96 0 Td[(pv)vv0f(t,v)f(t0,v0))whichaftersimplicationgives:EA2= Xt2R1Xv2R2ptpvf(t,v)!2+Xt2R1pt(1)]TJ /F5 11.955 Tf 11.95 0 Td[(pt) Xv2R2pvf(t,v)!2+Xv2R2pv(1)]TJ /F5 11.955 Tf 11.95 0 Td[(pv) Xt2R1ptf(t,v)!2+Xt2R1Xv2R2pt(1)]TJ /F5 11.955 Tf 11.95 0 Td[(pt)pv(1)]TJ /F5 11.955 Tf 11.95 0 Td[(pv)f(t,v)2NowwecanseewhysuchaderivationleadstobetterevaluationalgorithmsforEA2.Therstandthelasttermsstraightforwardlyrequirejustaggregatesoverthematchingtuples.Toseehowtoefcientlyevaluatethesecondterm(thethirdtermissymmetric),weobservethattheinnersquareneedstobecomputedforeachtuplet2R1.Thus,wecancomputethissquaresusingaGROUPBYwiththecorrectaggregates.Once 65

PAGE 66

thesquaresarecomputed,oneforeacht2R1,wesimplycomputetherestoftheaggregate.TheabovederivationsuggeststhatwemightbeabletocomputeefcientlyVar[A]inthegeneralcase.Indeedthisispossibleasweshowintherestofthissection.Inordertoprovidetheresult,weneedthefollowingtechnicallemma.TheresultwillbeusedinSection 3.6 aswell: Lemma1. Letati,bti,i2f1:ng,ti2Ribearbitraryvalues.Furthermore,letf(ftig)andg(ftig)bearbitraryfunctionsoftuplesftig.ThenXfti2Riji2f1:nggXfti02Ri0ji02f1:ngg0@Yi,i02f1:ng)]TJ /F5 11.955 Tf 5.48 -9.69 Td[(atiati0+btiti,ti01Af(ftig)g(fti0g)=XS2P(n)Xfti2Riji2Sg24 Yk2Sbtk!0@Xftj2Rjjj2SCg Yl2SCatl!f(fti,tjg)1A0@Xftj02Rj0jj02SCg Yl02SCatl0!g(fti,tj0g)1A35withP(n)thepowersetoff1:ng.SisasubsetofP(n)andSCisthecomplementofSw.r.t.P(n).ThelemmaaboveprovidesarecipetoremoveKroneckerdeltasandfactorizetheresult.Usingthislemma,wecanprovethefollowingresult: Theorem3.2. LetAbetheestimatedenedbyEquation 3 .Then,EA2=XS2P(n)Xfti2Riji2Sg" Yk2Sptk(1)]TJ /F5 11.955 Tf 11.95 0 Td[(ptk)!0@Xftj2Rjjj2SCg Yl2SCptl!f(fti,tjg)1A235 Proof. TheresultfollowsdirectlyfromtheexpressionofE[A]inTheorem 3.1 andtheLemma 1 withati=pti,bti=pti(1)]TJ /F5 11.955 Tf 11.95 0 Td[(pti)andg()=f(). 66

PAGE 67

Theaboveresultindicatesthat,foreachsetS2P(n),2ninall,wehavetoperformacomputationthatgeneralizesthe2-relationcasewesawbefore.Usingthesamereasoning,theSQLquerythatcomputesthetermcorrespondingtosetSis: SELECTSUM(SQYi2SPi(1)]TJ /F5 11.955 Tf 11.96 0 Td[(Pi))FROM(SELECTSUM(FYj2SCPj)^2ASSQ,fPi,i2SgFROMunaggreggatedGROUPBYfIDi,Pi,i2Sg);(3)TheanalysisofthetimecomplexitycanbeperformedeitherontheformulasinTheorem 3.2 thatignoresthespecialstructureoff()orontheSQLqueryinSQLcode( 3 )thatcomputestheinnerpartoftheformulainTheorem 3.2 .FortheformulainTheorem 3.2 ,ifwedenotebyjRijthesizeofrelationRi,thecomplexityis:O(2nQni=1jRij)sinceeachtermunderPS2P(n)justneedsatraversalovertheQni=1jRijtuplesinaparticularorder.Thetraversaldoesnotaddanyextracomplexitysincetherepresentationoff()doesnotexploitthestructure.Withthesameconsiderations,thecomplexityofthevariancecomputationusingformulasinTheorem 3.1 isO(Qni=1jRij2).Thespecialstructureoff()canbeexploitedifadatabasesystemisusedtoevaluatetheformulausingtheSQLcode( 3 ).Forbothunoptimizedandoptimizedcomputationofthevariance,sametemporaryunaggregatedtableisgeneratedbySQLtokeeptheaggregateitems.Essentiallytheunaggregatedtablecontainsinformationonlyabouttuplesforwhichf()isnonzeroandthecomplexityoftheoperationdependsonthequeryandthedatabaseengine.Thedifferencebetweenthetwoversionsishowthetableunaggregatedisusedtocompletethecomputation.IfwedenotebyNUthesizeoftheunaggregatedtable,theunoptimizedalgorithmneedstoformthecrossproductthusthecomplexityisN2U.Theoptimizedalgorithmneedstocompute2nterms,foreachofthemitneedstoevaluatetheSQLcode( 3 )thusitrequires 67

PAGE 68

agroupingandatraversal.Intheworsecase,executingSQLcode( 3 )requiresasortandalinearscanofrelationunaggregatedthushasworstcasecomplexityO(NUlogNU)butpossiblyO(NU)complexityifenoughmemoryisavailable.Thustheoverallcomplexityoftheoptimizedcomputationis,intheworstcase,O(2nNUlogNU)butcanbeaslittleasO(2nNU). 3.4GeneralFrameworkofComputingMomentsTheprevioussectionanalyzedmomentcomputationbasedonTuple-independentmodel.Wewillgeneralizethederivedformulasandprovideageneralframeworkformomentcomputationoverprobabilisticdatabases. 3.4.1NotationThroughoutthissection,weusethefollowingextranotations.SymbMeaning Mnumberofmatchingtuples,i.e.f(t1,..,tn)6=0p(ftig)probabilityp(t1,..,tn)oftuple(t1,..,tn)P(n)powersetofset(1:n)S,SCelementofP(n)anditscomplementP(n))]TJ /F5 11.955 Tf 11.96 0 Td[(Sp(ft0ijtig)conditionalprob.p(ft0i2Rigjfti2Rig)Ctheselectioncondition 3.4.2GeneralFrameworkforMlogMComputationInordertoallowmoreefcientcomputationofEA2,werequiretheprobabilisticmodeltosatisfy: Denition1. Aprobabilisticmodeliscalledconditionallyindependentif8SP(n),t1,t012R1,..,tn,t0n2Rn,(8k2S,tk=t0k)^(8j2SC,tj6=t0j))p(ft0i2Rigjfti2Rig)=p(ft0jgj2SCjftkgk2S) 68

PAGE 69

wherep(ft0jgjftkg)=p(ft0i2Rig^fti2Rigjftk=t0kgk2S,ftj6=t0jgj2SC) p(fti2Rig)istheconditionalprobabilityfortuplesft0i2Rigtoappeargiventhattuplesfti2Rigappearedandprimedtuplesareconstrainedtobeequal/non-equaltonon-primedtuplesaccordingtosetS/SC.Essentially,thistechnicaldenitionrequirestheprobabilisticmodeltobandependenciesbetweenP(ft0i2Rigjfti2Rig)andtjforwhichj2Sc.Thiseffectivelymeansthattupleswithinthesamerelationcannotinteractdirectly;theyareallowedtointeractindirectlythroughcommoninuenceontuplesinotherrelations.ThisisageneralizedformofTuple-independentmodelandislessrestrictive.Withthisdenition,wehavethefollowingtechnicalresult: Lemma2. Ifaprobabilitymodelisconditionallyindependent,then:E[A2]=XS2P(n)Xfti2Riji2Sg240@Xft0j2Rjjj2SCgcSf(fti,t0jg)1A0@Xtj2Rjjj2SCp(fti,tjg)f(fti,tjg)1A35wherecS=Xu2P(S),t0u2RuXt0k2Rkjk2SC[UC()]TJ /F7 11.955 Tf 9.3 0 Td[(1)jUj+jSjp(ft0kjt0ug)Theproofofthelemmaisageneralizationoftheproofofthesimplerresultin[ 54 ].DetailsareprovidedinAppendix F .NotethattheaboveformulaiscomposedofP(n)summationsandeachofthemisintheformofP(BPC)whichhasexactformatofTheorem 3.2 andthiscanbeimplementedbyGROUP-BYinSQL.ThetimecomplexityisatmostMlogMforallconditionallyindependentdistributions.Lemma 2 isgeneral;Theorem 3.2 isanapplicationinTuple-independentmodel.Nextwewillapplythislemmatoderivethemomentsofaggregatesovergraphicalmodel. 69

PAGE 70

Forgraphicalmodel[ 88 94 ],bydenition,anodeisconditionallyindependentofitsnon-descendantsnodegivenitsparentnode.Inaddition,theinteractionsingraphicalmodelonlyexistindifferentnodesandnotinthesamenode.Thisimpliesthatconditionalindependenceholdsingraphicalmodels.ThusLemma 2 canbeappliedtographicalmodelandthefollowingresultisobtained: Theorem3.3. Inagraphicalmodelwithnnodes:R1,,Rnandjointprobabilityp(t1,..,tn),wehave:E(A2)=XS2P(n)Xfti2Riji2Sg24 Xft0j2Rjjj2SCg0@Xu2f1:ng,t0u2RuXt0k2Rkjk2SC[UC()]TJ /F7 11.955 Tf 9.3 0 Td[(1)jUj+jSjp(ft0k,t0ug) Pt00u2Ru,t00u=t0up(ft0k,t00ug)1Af(fti,t0jg)!Xtj2Rjjj2SCp(fti,tjg)f(fti,tjg)35ThedetailedproofisinAppendix G .NotethegraphicalmodelinTheorem 3.3 isgeneral.Thevariableeliminationmethodisusedtoinfertheprobabilityofunobservedvariables[ 21 ]andthusproviderecipesforcomputationoftermsp(ft0k,t0ug).ItisinterestingtoseethatthereisnointeractionbetweenthestructureofthegraphicalmodelandtheformulaforthesecondmomentofA.Effectively,thegraphicalmodelisusedasablackbox.Theorem 3.2 andTheorem 3.3 arestraightforwardapplicationsofLemma 2 ondifferentprobabilisticmodels,whichisatestamentofthepowerofthelemma.Thesametypeofresultscanbeobtainedeffortlesslyforanyconditionallyindependentdistribution(Denition 1 ).ItisimportanttonotethattheX-tuplemodelusedin[ 95 ]isnotconditionallyindependentthusasimilarresultcannotbeobtainedusingLemma 2 .WebelievethatthelemmacanbefurthergeneralizedtocoverfordistributionsliketheX-tuplemodelthatcreatedependenciesbetweentuplesinthesamerelation. 70

PAGE 71

3.5ComputationofMomentsInsection 3.4 ,wederivedageneralframeworkforaggregatesoverprobabilisticdatabases.Inthissectionwepresenttwodifferenttechniquestoevaluatethemomentsofaggregates.Thersttechnique,calledInDB,usesexclusivelyqueryrewritingandperformsalloperationsinthedatabaseengine.Thesecondtechnique,calledMiddleware,usesthedatabasetoobtainthetableunaggregatedandthenperformsalltheoperationsinacustomC++implementation. 3.5.1InDBTechniques 3.5.1.1InDBM2TechniqueInthissection,weuseEquation 3.1 andshowhowtherequiredcomputationcanbedoneinpureSQL.Firstobservethatf(t)=F(t)IC(t),thusf(t)2=F(t)2IC(t).InordertocomputetheformulasforE[A]andVar[A]usingSQL,wewilladdanattributePtorelationsRthatspeciestheprobabilityofthetupleforTuple-independentmodel,andajointprobabilityPingraphicalmodel.WeomitSQLforgraphicalmodelduetolackofspace.TheformulasinEquation 3.1 canbeefcientlyevaluatedusingaDBMS.E[A]canbeevaluatedusingSQLas:SELECTSUM(F(t1tn)P(ft1,,tng))FROMR1ASt1,,RnAStnWHERESIMPLE CONDITION;InordertocomputeEA2,thusVar[A],wehavetocomputetheaggregateovertheCross-productoftheresulttuples.TotranslatethisefcientlyintoSQL,weputtheunaggregatedvalueandtheprobabilitiesintoatemporarytableandthencomputetheaggregateovertheCross-productofthistablewithitself.Inordertobeabletoapplythefunction,weneedtheidsofthetuples(primarykeysareidealids;wewillassume 71

PAGE 72

thattheattributeIDexists).Itispossibletowritethisasasinglequerybuthere,foreaseofexposition,weprefertwoqueries.WewillassumethatafunctionDelta(i,j)thatimplementstheKroneckerdeltaisavailable(functionscanbeaddedtoallmajordatabasesystems).Inordertobeabletoapplythefunction,weneedtheidsofthetuples(primarykeysareidealids;wewillassumethattheattributeIDexists).Withthis,theSQLcodetocomputeE[A]is: SELECTF(t1tn)ASF,t1.PASP1,...,tn.PASPn,t1.IDASID1,...,tn.IDASIDnINTOunaggregatedFROMR1ASt1,...,RnAStnWHERESIMPLE CONDITION;(3) SELECTSUM(U.FV.F(U.P1V.P1+U.P1(1)]TJ /F5 11.955 Tf 11.96 0 Td[(U.P1)Delta(U.ID1,V.ID1))...)FROMunaggreggatedASU,unaggreggatedASV;(3)Clearly,exceptforaslightlymorecomplicatedaggregatefunction,theefforttocomputeE[A]isthesameastheefforttocomputetheoriginalaggregateA.Furthermore,thequeryplancanremainthesameandstillbeefcient.ThecomputationofEA2involvestheCross-productofallresulttuples,withnopossibilityoffurtheroptimizationatthedatabaselevel.Therearenoconditionstotakeadvantageoftoreplacethecrossproductbyajoin.ThefactthatSQLcanbeuseddirectlyinsteadofredesigningthedatabasehelpsbutthissolutionisnotfundamentallybetter.Aswewillseeinnextsection,furtheroptimizationscanbeappliedtothissolutioninordertosignicantlyreducethetimecomplexity. 72

PAGE 73

3.5.1.2InDBMlogMTechniqueInthissectionweusetheresultofLemma 2 toremovetheneedtocomputetheCross-productoftheunaggregatedtable.Weassumetheunaggregatedrelationiscomputedasexplainedintheprevioussection.TheresultinLemma 2 indicatesthat,foreachsetS2P(n),2ninall,wehavetocomputetheDouble-summationterm.TheSQLquerythatcomputesthetermcorrespondingtosetSis:SELECTSUM(SQCj2SCftjg)FROM(SELECTSUM(FPj2SCftjg)ASSQ,fPi,i2SgFROMunaggreggatedGROUPBYfIDi,Pi,i2Sg);ThustoevaluateLemma 2 ,2nsimilarSQLstatementsaregenerated.Thenalresultisthesumofthese2nvalues.Cj2SChasdifferentvaluesforTuple-independentmodelandgraphicalmodel.ItcanbecomputedusingthedatainunaggregatedtableinTuple-independentmodel.Especiallyforthegraphicalmodel,itispreferabletocomputetheseprobabilitiesbeforegeneratingtheunaggregatedtable.Thevariableeliminationalgorithm[ 21 ]canbeusedtoperformthesecomputations. 3.5.2MiddlewareMlogMTechniqueAsshowninSection 3.5.1.2 ,theInDBMlogMtechniquehasdecreasedtimecomplexityfromM2toMlogM.However,itrequiresgenerating2nGROUP-BYqueries.Inordertofurtherimproveperformance,werstobservethatonesortingordercanbeusedtocomputemultipletermsefciently.Forexample,ifthedataissortedonattributesa,bandc,thengroup-bysonfag,fa,bgandfa,b,cgcanbecomputedinparallelusingthissortedorder,whichsignicantlyreducesthenumberofGROUP-BYs.Motivatedbyaboveobservation,wehopetondaparallelalgorithmwhichminimizesthesortingordersandcomputestermssimultaneously.Forexample,8combinationsofa,b,ccanbecomputedusing3sortingsequencesffa,b,cg,fb,c,ag,fc,a,bgg. 73

PAGE 74

Figure3-1. TheLayersofSequences Figure 3-1 represents,inlatticeform,the2ntermscorrespondingtoeachelementofP(n)thatneedtobecomputed.EverybitrepresentsarelationandthevalueofthebitrepresentswhetherthisrelationisinSorinSC.Tomovebetweenlayers,exactlya0bitisturnedintoan1bit.Asortingordercancoveralltermscorrespondingtotoptobottompathsthatmovebetweensuccessivelayers.Theorderofelementsinthesortingordercorrespondstotheorderinwhichthe0sareturnedinto1salongthepath.Forexample,ifwerstturnonthebitcorrespondingtoelementb,thenbistherstelementinthesortingorder.Sincethewidthofthelatticeis)]TJ /F4 7.97 Tf 9.82 -4.38 Td[(nn=2,theminimumnumberofsortedorderstocoverallthetermsis)]TJ /F4 7.97 Tf 9.81 -4.38 Td[(nn=2aswell.Noticethat,ifatermiscoveredbymultiplesortorders,itisenoughtocomputeitbasedononlyoneofthem.Moreover,alltermscoveredbythesamesortordercanbecomputedinparallel.Wearguedabovethat)]TJ /F4 7.97 Tf 9.82 -4.37 Td[(nn=2sortorderscancoveralltheterms.Wenowsketchanalgorithmthatproducesthepathscorrespondingtotheoptimalnumberofsortorders.Themainideaistointerpretthebinaryrepresentationasanumber;thisgivesanaturalorderingoftermsthatcanbeusedforsystematicenumeration.Tondthenextpath,westartwiththelargest(innumericalsense)uncoveredterminthetop-mostlevelandcontinuealongthelatticestructurebyalwaysselectingthehighestterm(ifthereisachoice).Thisprocesscanproducepartialpaths;toproduceafullpath,wecanpickanysub-pathbetweenthetopandthestartpointandsub-pathbetweentheendpointandthebottom.Thispathselectionprocesscontinuesuntilalltermsarecovered.Innext 74

PAGE 75

section,wegivefulldetailsonthisalgorithmandweprovethatitproducesexactly)]TJ /F4 7.97 Tf 9.82 -4.38 Td[(nn=2pathsthatcorrespondto)]TJ /F4 7.97 Tf 9.81 -4.38 Td[(nn=2sortordersthatcoveralltheterms. 3.5.3AlgorithmsInthissection,wefocusonhowtoproduceoptimalsortingsequencesthroughgetAllParrelItems,getParItemNextLayerandgetSortedSequence.MethodgetAllParrelItemstraversesallthesequencesfromthebottomlayer.Itvisitsthesequencesofonelayerinadecreasingorder.ThevisitedsequencerecursivelylinksonesequenceinthenextlayerbymethodgetParItemNextLayer,thelinkingsequenceisthelargestunlinkedoneofallthesimilarsequencesinthenextlayer.Forexample,thesimilarsequencesof11111include11110,11101,11011,10111and01111.11111willlink11110whichisthelargestunlinkedsequence.Finally,methodgetSortedSequenceuseslinkliststoobtainthesortedsequencefromtoplayer. Algorithm1getAllParrelItems() 1: InitializealltheelementsinarraylinkListas0; 2: InitializealltheelementsinarrayisComputedasfalse; 3: forifrom1tonumofReldo 4: EnumerateallthebinarysequencesoflayeriindecreasingordersandputthemintocombinationList; 5: foroneCombValueincombinationListdo 6: getParItemNextLayer(oneCombValue,linkList[curNode],isComputed); 7: endfor 8: endfor 9: returnlinkList; Algorithm2getParItemNextLayer(oneCombValue,nextParrelItem,isComputed) 1: forifromnumofRels-1to0do 2: temp=1<
PAGE 76

Algorithm3getSortedSequence(linkList)) 1: getAllParrelItems() 2: InitializealltheelementsinarrayisSortedasfalse; 3: InitializealltheelementsinarraylinkPathas0; 4: MapthevalueoflinkListtolinkPath[numPath][numLayer]andgetnumPath 5: forifrom0tonumPathdo 6: initiateonePath=null 7: GettheheadofithpathfromtoplayerandsettocurHead 8: forjfromcurHead+1tonumLayerdo 9: iflinkPath[i][j]!=0then 10: Settheindexof1bitinlinkPath[i][j]^linkPath[i][j-1]totheendofonePath 11: endif 12: endfor 13: ifthesizeofcurPathislessthannumofRelthen 14: iflinkPath[i][k]=0,putkattheendoftheonePath 15: endif 16: PutonePathtopathList 17: endfor 18: returnpathList. Nowweshowthatthisalgorithmproduceminimumsortingsequences.TheabovealgorithmcanvisitallthesequencesofeachlayersinceallofthemareenumeratedbygetAllParrelItems.Inordertoshowthisalgorithmhastheminimumnumberofpaths,westudytheruleingetParItemNextLayertolinkonesequenceBtoanothersequenceC.ThelinkingrulerequiresthatCisobtainedbyreplacingonebit0tobit1inB.SinceCisthelargestunlinkedsequenceofB'ssimilarsequences,theleft0sofBshouldbereplacedby1rst.Let'sexaminethelinkingruleofthirdlayerof7layers.1110000rstlinksto1111000.Then1101000hastolink1101100insteadof1111000.Similarly,1100100linksto1110100.Weobservethatthebit0canbereplacedby1ifitistherstleftbit0betweenwhichandthemostrightbit1therearemorenumber0sthan1s.Forexample,thereisno0bitbetweenaandbin110a1b0c00.Therefore,cisreplacedby0.Fromlayer1tolayerdn 2e)]TJ /F7 11.955 Tf 20.68 0 Td[(1,therearemore0sthan1sandaboveconditionisalwayssatised.Fromlayerdn 2etolayern,therearemore1sthan0s.Aboveconditioncannot 76

PAGE 77

bealwayssatised.Ifasequencecannotndalinkedsequence,thenitisendofthepath.Abovemappingshowsthatthenumberofpathsinalgorithmincreasesfromlayer1tolayerdn 2eandlayerialwayshasonemappingtolayeri+1.Nonewpathsarecreatedfromlayerdn 2etolayernsincetheyarealllinkedbyupperlayer.Thusthenumberofpathsisequaltothenumberofitemsinlayerlatwidth.AsanalyzedinSection 3.5.2 ,theminimumnumberofpathsislatwidth.Therefore,ouralgorithmisanoptimalalgorithm. 3.6ExtensionsTheanalysisandcomputationforSUMaggregatescanserveasthebasisforcomputingmorecomplicatedaggregates.Thetwoextensionsweconsiderare:non-linearaggregatesandGROUPBY. 3.6.1Non-linearAggregatesThemethoddescribedintherestofthechaptercanprovidecondenceintervalsonlyforSUMaggregates.Inpractice,otheraggregatessuchasAVERAGEandVARIANCE,thataresimilartoSUM,areuseful.SuchaggregatescanbecomputedfrommultipleSUMaggregatesbymakinganon-linearcombinationofthem.2Forexample,AVERAGEistheratioofSUMandCOUNT3.SinceouranalysisextensivelyusedthelinearityoftheSUMaggregate,itcannotbeappliedtonon-linearaggregateslikeAVERAGEdirectly.AlthoughourmethodofapproximatingAVERAGEisnotasefcientas[ 52 ],weprovideamethodtoestimatetheaverageovermultiplerelationswithcondenceintervals.Toprovideageneraltreatment,letA1,,AkbeSUMaggregates(withdifferentaggregationfunctions).Then,ageneralnon-linearaggregateneedstocomputeA=F(A1,,Ak).ForAVERAGEaggregate,A1isSUM(F),A2isSUM(1)andF(A1,A2)=A1 A2. 2Ifthecombinationislinear,thentheF()functionscanbecombinedintoasinglesuchfunctionthatisusedforcomputation.3COUNTisaSUMwiththeaggregatefunctionF()=1. 77

PAGE 78

Now,letA1,,Akbetherandomvariablesthatgivethevalueoftheaggregatesfortheprobabilisticdatabaseinstance.ThevalueoftheaggregateAontheinstancewouldbeA=F(A1,,Ak).ThestandardmethodinStatisticstoanalyzenon-linearcombinationsofrandomvariablesisthedeltamethod[ 91 ].ThedeltamethodconsistsinexpressingthemomentsofAintermsofthemomentsofA1,,Ak.Inparticular:E[A]F(E[A1],,E[Ak])Var[A]rF(E[A1],,E[Ak])>Var(A1,,Ak)rF(E[A1],,E[Ak])whererF()isthegradientofthefunctionF(i.e.thevectorconsistingofthepartialderivativesw.r.t.eachcomponent)evaluatedatE[A1],,E[Ak].Var(A1,,Ak)isthevariancematrix,thathasVar[Au]onthediagonalandCov(Au,Av)offthediagonal.WecanusethetechniquedevelopedinthischaptertocomputeE[Au]andVar[Au]efcientlyforeachAu.TheremainingtaskistocomputeCov(Au,Av)=E[AuAv])]TJ /F5 11.955 Tf 11.95 0 Td[(E[Au]E[Av],wheretheonlychallengeiscomputingtheE[AuAv]component.Theformulaisprovidedbythefollowingresult: Theorem3.4. LetAu,Avbetheestimatefortwodifferentsumaggregatesgivenbyaggregatefunctionsfu()andfv().Then,E[AuAv]=XS2P(n)Xfti2Riji2Sg240@Xft0j2Rjjj2SCgcSfv(fti,t0jg)1A0@Xtj2Rjjj2SCp(fti,tjg)fu(fti,tjg)1A35wherecS=Xu2P(S),t0u2RuXt0k2Rkjk2SC[UC()]TJ /F7 11.955 Tf 9.3 0 Td[(1)jUj+jSjp(ft0kjt0ug) Proof. TheproofofTheorem 3.4 followstheproofofLemma 2 78

PAGE 79

ComputationoftermsinTheorem 3.4 correspondingtosetSforallEA2uandE[AuAv]termsisidenticalexcepttheactualaggregatescomputed.SincetheGROUPBYisthesame,theycanallbecombinedintoasinglequerythatcomputesallrequiredaggregatessimultaneously.ThismeansthatthenumberofoverallGROUP-BYsis2n,irrespectiveofthenumberofsumaggregatescombinedbythenon-linearaggregate.Thenumberofaggregatescomputedbyeachsuchquerywillbek(k+1)=2,wherekisthenumberofoperations.Theorem 3.4 canbeevaluatedbysametechniquesdiscussedinSection 3.5 3.6.2GROUPBYQueriesIfthequerycontainsaGROUPBYclause,foreachofthegroups,condenceintervalsforthevalueofaggregateshavetobeprovided.Thecomputationofthemomentsfortheaggregatesforeachofthesegroupswouldbenodifferentfromthecomputationoftheaggregatesfortheentirerelations.Withthisinmind,onemethodtoobtainthedesiredcondenceintervals/groupwouldbetogeneratethetuplesineachgroupandthenapplythetechniquesdescribedinthischapterforeachsuchgroup.Fortunately,thiscanbeaccomplishedusingtechniquesinSection 3.5 withouttheneedtochangethedatabaseengineorexternalcomputation.Essentially,foreachaggregatethatisgeneratedwehavetoaddaGROUPBYclauseinSQLandasortingitemwhengeneratingoptimalsortingorderstoaccountfortheextragrouping. 3.7ExperimentsAsshowninChapter 1 ,condenceintervalsofAcanbeeasilycomputedeitherbydistributiondependentorindependentmethodsgivenE[A]andVar[A].Thusinthissectionwefocusoncomputationofmoments.Themostimportantquestionintheseexperimentsiswhatistheoverheadofcomputingthemomentsoftheprobabilisticaggregateswhencomparedtotheexecutionofthenon-probabilisticquery.Aswewillseeinthissection,theoverheadisusuallysmallforTuple-independentmodelandreasonableforgraphicalmodelthisisespeciallytruefortheoptimizedversion. 79

PAGE 80

Asecondquestiontoaskiswhatistherelativeperformanceofthevariousmethodformomentcomputation.Aswewillsee,thecomputationusingtheMiddlewaremethodissignicantlymoreefcientthanInDBmethod,whichsuggeststhatperformingsuchcomplexcomputationsinthedatabaseenginemightnotbedesirablewithcurrentsystems. Methodology.Alltheexperimentswerecarriedoutona4core,2.4GHzCPU,and8GBRAMmachinerunningLinux.WeusedPostgresasaDBMSwithoutanymodication.TheInDBalgorithmsworkasaclientprogramthatgeneratestheSQLqueriestoimplementtheprobabilisticaggregates.TheMiddlewarealgorithmusesPostgrestogeneratetheunaggregatedtableandimplementsmultiplethreadalgorithmsinC++.BoththedatabaseandtheC++programwereexecutedonthesamemachine.Notethatthequeryexecutiontimedoesnotdependonthevaluesoftheprobabilitiessincethesamecomputationsareperformedwithdifferentnumbers.Forthisreasonwedonotinvestigatethedependencyonvariouschoicesfortheprobabilitiesinvolvedintheprobabilisticmodels.TheInDBtechniquesareimplementedusingqueryrewriting.Aggregationoveronerelationisspecialcased.BothalgorithmsrstgeneratetheunaggregatedtableandthenexecuterewrittenSQLoverit.ThemaindifferenceofInDB-M2andInDB-MlogMtechniqueisthefactthattheInDB-M2methodneedstoevaluateasinglequeryovertheCross-productunaggregatedunaggregatedandtheMlogMmethodneedstoevaluate2nGROUPBYqueriesoverunaggregated.TheMiddlewaretechniqueisimplementedinC++usingmultiplethreading.AthreadisallocatedtoeachtermthatcanbecomputedinparallelforeachsortedorderprescribedbythemethodinSection 3.5.2 3.7.1ExperimentsonTuple-independentModelForTuple-independentmodel,weusedTPC-Hbenchmarktogeneratetwodatasetswithsizeof1GBand10GB.Wethensimulateprobabilisticdatabasesbyaddingaprobabilityattributetoeachtuple,whosevaluewasuniformlydistributedin[0,1]interval. 80

PAGE 81

ThealgorithmsareevaluatedonqueriesQ1,Q3,Q5andQ6(queriesinTPC-Hwithoutsubqueries).Resultsofexperimentsofcomputingmomentsover1GBdatasetareshowninFigure 3-3 andresultsover10GBdataareshowedinFigure 3-5 .Togaininsightsintohowthetimeisspent,wetimethefullcomputationtimeinFigure 3-2 andFigure 3-4 .Thefullcomputationtimeofmultiplerelationsincludesthetimeofgeneratingunaggregatedtableandthetimeofevaluatingmoments. Aggregationoveronerelation.QueriesQ1andQ6involveasinglerelationonwhichonlytheInDBalgorithmaretested.ForQ6,thetimetoperformtheprobabilisticaggregateisvirtuallythesameasthetimeforthenon-probabilisticaggregateforbothdatabasesizes.Thetimetondmatchingtuplesdwarfstheaggregationtime.ForQ1,mostofthetimeisspentoncomputingaggregatesnotonndingmatchingtuples.Sincetheprobabilisticaggregateneeds25aggregationsversusonly8aggregationsforthenon-probabilisticaggregate,therunningtimeisapproximately25=8=3.125timeshigher.ThetimespentonQ1seemstobeCPUnotI/Obound. Aggregationovermultiplerelations.Computationoftheseaggregatesincludestwosteps:generatinganunaggregatedtableandevaluatingmomentsoverit.Comparingthetimeoffullcomputationandtimeofaggregate,thetimeofgeneratingtheunaggregatedisalmostthesameastheexecutionoftheoriginalnon-probabilisticquery.Thisisexpectedsincetheaggregationisusuallythelastoperationperformedthustheunaggregatedtableisgeneratedasanintermediateresult.Whenevaluatingmomentsovertheunaggregatedtable,weremovedtheGROUPBYclauseinQ3toseewhathappenswhenthesizeofgroupsgrows.Forthe1GBdataset,30569tuplesarenowpartofasinglegroupforQ3.For10GBdataset,301618tuplesformthesinglegroup.WhenwetriedtorunthesequeriesinPostgresusingInDB-M2algorithm,thesystemgaveupwithamessagethatthemaximumnumberoftuplesallowedwasexceeded.Weestimatedthetimesbasedontherateofprocessingtuplesforthe0.1GBdatasetsizeas62500tuples/second.Thisresultsinanestimateof4hoursfor1GBdatasetand 81

PAGE 82

15daysforthe10GBdataset.Inbothsituations,allthetuplesofrelationunaggregatedcanbestoredinmemory,thusthelargerunningtimeisdueexclusivelytocomputationinefciency.Incomparison,theMlogMtechniquecancomputethemomentsinonly10secondsforthe10GBdataset.Fortrulylargedatasets,itiscrucialtousetheMlogMmethodtoensurethemomentscanbecomputedinreasonabletime. Totaloverhead.Whilethealgorithmsweproposed,especiallytheoptimizedalgorithm,mightseemtorequirealotofeffort,theexperimentalresulttablespaintsadifferentpicture.ExceptforQ1thatisCPUbound,foralltheotherqueriesatleastoneoftheversionsofthealgorithmrequirelessthan10%overhead.Theoverheadseemstoreduceforlargerdatabases,whichmeansthatforTPC-Htypeofqueriesthealgorithmsareexpectedtoscaletoverylargedatabases.Findingmatchingtuplesseemstobemuchmoretimeconsumingthancomputingmomentsfromtheunaggregatedtable. Comparisonoftechniques.TheadvantageofthetwoMlogMalgorithmsisthattheydonotrequiretheformationoftheCross-productwithineachgroupoftuplesinunaggregatedtable.Thus,theperformanceofMlogMusuallyisbetterthanInDBM2algorithmjustastheresultsofexperimentsonQ3demonstrates.InDB-M2performsbetterthanInDB-MlogMinQ5,becausethereareonly23matchingtuplesfor1GBdatasetand195matchingtuplesfor10GBdatasetforthisquerybut6relationsareinvolved.Alargenumberofqueriesareposedtothedatabaseengine(26=64queries)inordertoruntheMlogMalgorithm.Thetimeisdominatedbytheparsingandqueryplangenerationnotbytheexecutionitself.Middleware-MlogMstillperformsbestoverallforthisqueryandshowitsadvantagesinalltheexperiments.Themorerelationsareinvolved,thebetteritperformsbecausemoretermscanbeparallelizedandtheC++computationismoreefcientthantheIn-databasecomputation. 3.7.2ExperimentsonGraphicalModelForgraphicalmodel,experimentswerecarriedoutonPart-supplierrelationshowninFigure 3-6 .Nodepartsuppkeepsthecorrelationsbetweenpartsandsuppliers.The 82

PAGE 83

Figure3-2. TotalTimeOnTPC-HQueries,1GBdataset Figure3-3. AggregateTimeOnTPC-HQueries,1GBdataset Figure3-4. TotalTimeOnTPC-HQueries,10GBdata Figure3-5. AggregateTimeOnTPC-HQueries,10GBdata jointprobabilityinpartsuppsatesthattheprobabilityofonesuppliersupplycertainparts.Ourexperimentconsidersfollowingtwocorrelationsbetweenpartsandsuppliers:partialdependencewheresuppliersonesupplieronlysupplyasmallpartofparts;fulldependencewhereonesuppliersupplyallthepartsandonepartissuppliedbyallthesuppliers.30000nodesaregeneratedinpartsupptableforpartialindependencewithonesuppliersupplying3parts.1000x1000nodesaregeneratedforfulldependence.SincetheMlogMalgorithmsoutperformedM2inthetupleindependenceexperiments,weonlyperformedtheseexperimentsusingMlogMalgorithms.WecomparetheperformanceofInDB-MlogM,Middleware-MlogMandnonprobabilisticqueryinFigure 3-7 83

PAGE 84

Figure3-6. Thegraphicmodelusedintheexperiments Figure3-7. Theexperimentsresultsingraphicalmodel AsshowninTheorem 3.3 ,computationofmomentsofgraphicalmodelrequiresprobabilityinferencewhichincursadditionalcost.InourPart-suppliermodel,onlythejointprobabilityaregiveninpartsuppnode.Theprobabilitiesofpartandsupplierarerequiredforcomputationofmomentsandtheyhavetobeinferredbyvariableeliminationmethod,whichrequirelargecomputation.Thus,unlikeTuple-independentmodelwhoseoverheadofcomputingmomentsisquitesmall,theoverheadofcomputingmomentsingraphicalmodelislargebutreasonable.ButmethodMiddleware-MlogMstilloutperformsInDB-MlogMinthiscase. 3.7.3TheCentralLimitTheoremandProbabilisticAggregatesGivenalargenumberofi.i.dsXiwithE(Xi)=andVar(Xi)=2<1,thecentrallimittheorem(CLT)showsp n( Xi)]TJ /F3 11.955 Tf 12.51 0 Td[()=!N(0,1).SupposetheprobabilisticdatabaseiscomposedofnsuchrandomvariablesandletA=PiXi,theequationcanbetransformedas(n Xi)]TJ /F5 11.955 Tf 12.56 0 Td[(n)=p n2=A)]TJ ET q .478 w 254.92 -515.28 m 264.46 -515.28 l S Q BT /F2 11.955 Tf 254.92 -525.12 Td[(A=p VarA!N(0,1).ThismeansAtendstobenormallydistributed.Theexpectedvalueandvarianceofnormaldistributioncouldcharacterizebehaviorsofaggregationofprobabilisticdatabases.However,suchassumptionofi.i.dsinprobabilisticdatabasesisrigidandimpractical,thustraditionalCLTcannotbeapplieddirectlytoprobabilisticdatabases. 84

PAGE 85

Inordertoexaminethedistributionofprobabilisticaggregates,wecarriedouttwoexperimentsovera100MBdataset:aggregationoverbothindependentanddependentrandomvariables,eachwithnon-identicaldistributions.Wegeneratedmorethan1000instancesoftheprobabilisticrelationsandexecutedSUMqueryontheseinstances.Thevaluesoftheaggregatesobtainedthesearei.i.d.samplesfromthedistributionoftheprobabilisticaggregatewereusedtoestimatethePDFandCDFofthedistribution. Figure3-8. PDFofsumofindependentdiscreterandomvariables Figure3-9. CDFofsumofindependentdiscreterandomvariables Figure3-10. PDFofsumofpartlyindependentdiscreterandomvariables Figure3-11. CDFofsumofpartlyindependentdiscreterandomvariables Figures 3-8 3-11 depicttheexperimentalresultstogetherwiththeapproximationofthesequantitiesbasedonapproximatingthedistributionoftheprobabilisticaggregateswithanormaldistributionwithmeanandvariancecomputedusingourmethod.Whatisimmediatelyapparentfromtheseexperimentsisthatthenormalapproximationissurprisinglygood,especiallywhenitcomestoapproximationoftheCDF. 85

PAGE 86

Theseresultsencourageustofurtherstudytherelationshipbetweenprobabilisticaggregationsandthenormaldistribution.Westudythefollowingtwotypicalcasesinprobabilisticdatabases: Case1.Randomvariablesareindependent,non-identicallydistributed.ExtendingExample2.7.2of[ 24 ]whichusedLiapounov'scentrallimittheorem,wehavethefollowingtheorem: Theorem3.5. LetX1,,XnbeindependentvariableswithE(Xi)=pi(0pi1)andVar(Xi)=pi(1)]TJ /F5 11.955 Tf 12 0 Td[(pi).LetA=PaiXi,whereaiisaconstant.ThenPni=1A)]TJ /F4 7.97 Tf 13.35 0 Td[(E(A) p Var[A]!N(0,1)provided(Ppi(1)]TJ /F5 11.955 Tf 11.96 0 Td[(pi)(p2i+(1)]TJ /F5 11.955 Tf 11.95 0 Td[(pi)2)a3i)2=o((Ppi(1)]TJ /F5 11.955 Tf 11.96 0 Td[(pi)a2i)3).Theorem 3.5 givestheconditionwhencentrallimittheoremholds. Case2.Randomvariablesaredependent,non-identicallydistributed.Inthiscase,analysisbecomesdifcultsincenouniedframeworkisavailablefordependentrandomvariables.Previousresearchmainlyfocusedonspecicdependencemodels.HoeffdingandRobbins[ 39 ]showedthatm-dependencesequenceshavecentrallimitproperty.PeligradandUtev[ 74 ]derivedacentrallimittheoremforatriangulararrayundermixingconditions.Theseresultssuggestthatcentrallimittheoremcontinuestoholdforsomeweaklydependentrandomvariables. 3.8SummaryInthischapter,werstdescribedaframeworktoefcientlycomputecondenceintervalsfordistinctfree,SUM-likeaggregatesoverprobabilisticdatabases.ThenweappliedthisgeneraltheorytoTuple-independentandgraphicalmodel.Thecoreofthemethodisstatisticalanalysisoftheexpectationandvarianceoftheaggregatesasrandomvariables.Wepresentedbothunoptimized,butmoregeneral,andoptimizedformulasforvarianceoftheaggregatesandshowedhowtheseformulascanbeevaluatedusingInDBandMiddlewaretechniques.OurexperimentalresultsindicatethatcomputingmomentsofprobabilisticaggregatesusingthemethodwedescribedandexistingdatabasetechnologyhasasmalloverheadforTuple-independentmodeland 86

PAGE 87

reasonableoverheadforgraphicalmodel,evenwithoutanychanges/optimizationstothedatabasesystem. 87

PAGE 88

CHAPTER4PROBABILISTICAGGREGATIONWITHDUPLICATEELIMINATION 4.1BackgroudProbabilisticaggregationhasbeenprovedtobeahardproblem[ 18 ],inparticularquestionsoftheformP[X>x]forXaSUM-likeaggregateare#P-complete.InChapter 3 ,wepresentedaframeworktoestimateprobabilisticaggregatesbasedonexactcomputationsofthemomentsandtheuseofdistributiondependent/independentboundsbasedonthem.Thiseffectivelyavoidsthehardquestionsandresultedinefcientalgorithms.ThemethodisgeneralandcanbeimplementedusingexistingDBMSwithoutmajorredesign.Tosimplifythediscussion,Chapter 3 doesnotcoveraggregationwithduplicateelimination.Whileaggregationwithduplicateappearsinmanyqueries,theneedtoprocesssuchprobabilisticaggregatesarises.AnaturalapproachtosolvethisproblemistoapplythesolutioninChapter 3 tothiscase.Asitturnsout,thisdoesnotworkinthissituation.Toseewhy,letuslookatanexample: Example2. SupposetworelationsR1andR2inaprobabilisticdatabase.R1representsarelationshipbetweenproductsandcustomers.R2representsarelationshipofman-ufacturesandproducts.Tuplesinthesetworelationsareindependent.Supposeone Table4-1. RelationR1 customer product probability c1ap11c1bp12c2ap13c2dp14 Table4-2. RelationR2 manufacture productpriceprobability m1a40p21m2b30p22m3a45p23m4c35p24 88

PAGE 89

Table4-3. TheresultofQcustomer,probabilityR1./R2 customerprobability c11)]TJ /F7 11.955 Tf 11.95 0 Td[((1)]TJ /F7 11.955 Tf 11.96 0 Td[((p11(1)]TJ /F7 11.955 Tf 11.95 0 Td[((1)]TJ /F5 11.955 Tf 11.95 0 Td[(p21)(1)]TJ /F5 11.955 Tf 11.95 0 Td[(p23))))(1)]TJ /F5 11.955 Tf 11.96 0 Td[(p12p22)c2p13(1)]TJ /F7 11.955 Tf 11.96 0 Td[((1)]TJ /F5 11.955 Tf 11.96 0 Td[(p21)(1)]TJ /F5 11.955 Tf 11.96 0 Td[(p23)) queryreturnsthenumberofcustomerswhobuyproductsfrommanufactures:SELECTCOUNT(DISTINCT(customer))FROMR1,R2WHERER1.product=R2.productInordertocomputethisaggregate,weneedtoprojectcustomersafterthejoinandcomputetheprobabilities.Theprobabilityofprojectedtuplescanbeobtainedusingthemethodprovidedby[ 18 ].Table 4-3 showstheprojectresult.Onceweobtainprobabilitiesofprojectedtuples,wecancomputetheaggregate.WeusethesamenotationasinChapter 3 anddenotebyAarandomvariabletopresentthevalueoftheaggregate.WetakeadvantageoftechniquesinChapter 3 toworkonthisproblem:SUM-likeaggregationoverprobabilisticdatabasesisexpressedalgebraically.Eachprojectedtupleisexpressedbya0,1randomvariablexandE(x)=p(x).Thenbythelinearityofexpectation,theexpectedvalueoftheaggregateisE(A)=1)]TJ /F7 11.955 Tf 10.98 0 Td[((1)]TJ /F7 11.955 Tf 10.97 0 Td[((p11(1)]TJ /F7 11.955 Tf 10.98 0 Td[((1)]TJ /F5 11.955 Tf 10.98 0 Td[(p21)(1)]TJ /F5 11.955 Tf 10.97 0 Td[(p23))))(1)]TJ /F5 11.955 Tf 10.97 0 Td[(p12p22)+p13(1)]TJ /F7 11.955 Tf 10.98 0 Td[((1)]TJ /F5 11.955 Tf 10.97 0 Td[(p21)(1)]TJ /F5 11.955 Tf 10.98 0 Td[(p23))Nowwewanttoprovideacondenceintervalofthisestimatedvalue.InChapter 3 ,theKroneckerdeltasymbolisusedtoexpresscorrelatedinformationbyencodingwhetheranytwotuplesfromonerelationaresame.ThejointprobabilityoftwotuplesfromonerelationcanbeexpressedbyalinearfunctionoftheKroneckerdeltasymbol.Inaddition,theKroneckerdeltasymbolcollapsesdoublesumsintoonesum,whichavoidscross-productcomputationofmatchingtuples.IftheKroneckerdeltasymbolcanbeusedinthisexample,thenwecanfollowthesameanalysisinChapter 3 toderivetheformulasofcondenceinterval.Unfortunately,theKroneckerdeltasymbol 89

PAGE 90

doesnotworkinthisexampletocapturethecorrelationoftuples.WeneedtolookatintermediateresultsofthisqueryinTable 4-4 toexplainit. Table4-4. TheresultofQcustomer,product,probabilityR1./R2 customerproductprobability c1ap11(1)]TJ /F7 11.955 Tf 11.96 0 Td[((1)]TJ /F5 11.955 Tf 11.96 0 Td[(p21)(1)]TJ /F5 11.955 Tf 11.96 0 Td[(p23))c1bp12p22c2ap13(1)]TJ /F7 11.955 Tf 11.96 0 Td[((1)]TJ /F5 11.955 Tf 11.96 0 Td[(p21)(1)]TJ /F5 11.955 Tf 11.96 0 Td[(p23)) Inthisintermediaterelation,thersttupleandthethirdtuplebecomecorrelatedsincetheycontainthesametuplefromR2.Thesecondtupleandthethirdtupleareindependent.Afterprojection,thersttupleandthesecondtupleareprojectedtothersttupleinTable 4-3 .Thethirdtupleintable 4-4 isprojectedtothesecondtupleTable 4-3 .Eachprojectedtuplecontainsjoininformationofonegroup.TherelationshipoftwotuplesinTable 4-3 arecomplicated.TheyarenotindependentordependentasinChapter 3 .Nosharedpartsofcorrelatedgroupscanbefactorized,whichmeansitcannotbedividedintointersectionofindependentgroups.ThusKroneckerdeltasymbolcannotbeusedtoencodetheprojectedtuples.Anothermethodshouldbeusedtoderivetheprobabilityofpairwisetuplesandobtainthecondenceintervals.Inthefollowingsections,werstintroducebasictechniquesusedinthischapter.Thenwefocusonderivingtheformulaoftheprobabilityofpairwiseprojectedtuples.Basedonthederivedformula,wepresenttheformulaofthesecondmomentofestimatingprobabilisticaggregationwithduplicateremoval.Wevalidateandbenchmarkthetechniqueinthelatterpartofthischapter. 4.2PreliminariesProbabilisticdatabasesdescribeaprobabilityspaceinwhichtuplesarepossibleinstances.Aggregationoverprobabilisticdatabasesaggregatesallthesepossibleinstances.Inthischapter,weevaluateaggregationwithDISTINCToperationandprovideitscondenceinterval,whichisanextendeddiscussionofChapter 3 andrequirescomputationofmomentsoftheestimate.Beforewederiveformulasof 90

PAGE 91

moments,werstformallypresentqueriesdiscussedinthischapter,thenweintroducebackgroundofqueryevaluation.Attheendofthissection,wediscussthecorrelationsoversafeplan. 4.2.1QueriesandEquivalentAlgebraicExpressionsThebasictypeofqueryweconsiderinthischapteris:SELECTSUM(DISTINCTF(t1tn))FROMR1,,RnWHERESIMPLE CONDITION;JustasdescribedinChapter 3 ,R1,,Rnarenrelations,tiisatupleinRi,F()isanaggregatefunctionsuchasSUMthatdependsontuplesfromeachrelationandSIMPLE CONDITIONisaconditioninvolvingonlytuplesfromrelationsR1,,Rn.Arbitraryselectionsandjoinconditionsareallowedbutnosubqueriesofanykind.Thealgebraicformulastoexpressthevalueoftheaggregateis: A=Xj2f1mg,t12R1Xtn2RnF(uj(t1,,tn))I(t1,,tn)=Xj2f1mg,fti2Riji2f1:nggf(uj(ftig))(4)Here,F(t1,,tn)expressesanaggregatefunctionandI(t1,,tn)isthelteringfunction.Thenewsymboluj(t1,,tn)representsthejthprojectedtuple.Usingthesametrickasthepreviouschapter,weintroducearandomvariabletoindicatewhethertiispresentintheinstances.Thusweobtainthersttwomomentsoftheaggregates E(A)=Xj2f1mg,fti2Riji2f1:nggf(uj(ftig))p(uj(ftig))E(A2)=Xj,j02f1mg,fti2Riji2f1:nggf(uj(ftig))f(uj0(ftig))p(uj(ftig),uj0(ftig))(4)FromEquation 4 ,thekeytocomputingthemomentsistoobtaintheprobabilitiesofprojectedtuplesandthejointprobabilityofpairwiseprojectedgroups.Evaluationoftheprobabilityofprojectedtuplesisawidelystudiedproblem.WewilldiscussthisprobleminSection 4.2.2 .Nopreviousworkhasbeendoneonthejointprobabilityofthepairwise 91

PAGE 92

projectedgroups.Thisisourfocusandwewillprovideamethodtocomputeitinthischapter. 4.2.2QueryEvaluationonProbabilisticDatabasesAlotofresearchworkhasbeendoneonqueryevaluationonprobabilisticdatabases.Conjunctivequeriesaredividedintohierarchialqueries[ 20 ]andnon-hierarchicalqueries.AconjunctivequerywithoutSelf-joinsisconsideredasahierarchicalquerywhenrelationalpredicatesetsofanytwonon-headvariablesareeitherinclusiveordisjoint.Mostofresearchmodelsprobabilisticdatabasesastupleindependentorgraphical.Fortupleindependentprobabilisticdatabases,Suciu[ 20 ]provedhierarchicalqueriescanbeevaluatedinPTIMEwhileevaluationofotherquerieshas#P-harddatacomplexity.EvaluationofPTIMEqueriesshouldbeguidedsinceindependenttuplesbecomecorrelatedafterjoiningthesametuplefromotherrelations.Suciu[ 18 ]presentedanefcientalgorithmtoselectanoptimizationplansafeplanwhichavoidsdependenceineachgroupbyprojectingtuplesbeforejoiningtuples.Olteanu[ 72 ]tookadvantageofstructuralandfunctionaldependencyofqueriestoinferexactprobabilitieswithoutmodifyingexistingoptimizationplans,whichresultedinabetterperformance.Sen[ 88 ]studiedqueryevaluationusinggraphicalmodels.Theyusedfactordecompositiontoextractsharedtuplestoinferprobabilities.Fornon-PTIMEqueries,Suciu[ 18 83 ]usedLubyandKarp'sMonteCarloapproximationalgorithm[ 58 ]forgeneralandTop-kqueryevaluations.[ 49 ]usedMonteCarloframeworktomanageuncertaindata.Inthischapter,weusetupleindependentmodeltorepresentprobabilisticdatabases.Toprovidetheexactexpectedvalueoftheaggregation,werestrictourdiscussiontohierarchicalquerieswhichcanbeevaluatedinpolynomialtime.Inthischapter,wefocusonPTIMEqueries,theprobabilityoftheprojectedtupleinEquation 4 canbecomputedfrom[ 18 ].Supposetherearemjindependenttuplesin 92

PAGE 93

theprojectedgroupgj p(uj(ftig))=1)]TJ /F4 7.97 Tf 14.84 16.28 Td[(mjYj=1(1)]TJ /F4 7.97 Tf 17.31 14.94 Td[(nYi=1p(tij))(4)ToprovidetheerrorguaranteeoftheexpectedvalueinEquation 4 ,thereisstillalongwaytogosincethejointprobabilityofpairwisegroupsremainsunknown.Westudythisprobleminfollowingsection. 4.3AnalysisofMulti-relationSumAggregationwithDuplicateRemovalWefacethechallengeofcomputingtheprobabilityofcorrelatedgroups.Inthissection,westartwithasimplecasetoanalyzeourproblem.Thenwetakeadvantageofstructuralinformationofsafequeriestocomputethejointprobabilityofpairwisegroups. 4.3.1ProbabilityofCorrelatedGroupsFigure 4-1 representscomputationofjoinedprobabilityoftwogroupsu,v:theleftgroupuhasmtupleswhicharejoinedfromnrelationsandtherightgroupvhasm0joinedtuples.Thetopthreelayersrepresentoperationsandthebottomlayerrepresentstuples.Thetoplayershowstheintersectionrelationshipoftwogroupswhichareprojectedfromthejoinedtuplesasshownin2ndlayerand3rdlayer.Sincesametuplescanbeinvolvedindifferentgroupsandonegroupcanhavemanysuchtuples,nosharedfactorscanbeextractedwithinonegroup.Itseemsthatwereachanimpassetocomputethejointprobabilityofpairwisegroups.Inordertotacklethisproblem,westartwithasimplecase.Considertwogroupsu,vwithtwojointtuplesfromtworelationsineachgroup.Supposeonlyonejoinedtuplee^fiscorrelatedwithonetuplee^ginothergroupbysharinge.Therestjoinedtuplesbanddareindependent. Table4-5. Groupu e^f b Table4-6. Groupv e^g d 93

PAGE 94

Figure4-1. Nestedcorrelationsinu^v Thepairwiseprobabilityp(u,v)canbeexpressedbytheprobabilityof((e^f)_b)^((e^g)_d).Supposetheprobabilityofeachtupleisknown.Onestraightforwardwaytocomputethejointprobabilityisusingthedistributivityofconjunctionandthecomplementslawtotransformtheexpression.Aftertransformation,u^v=e^(f_b)^(g_d)_e^(b^d)Therightpartoftheequationconsistsoftwodisjointevents,thusp(u^v)=p(e^(f^b)_(g^d))+p(e^(b^d))Thismethoddividesthecomplicatedeventintotwodisjointeventsandeachofthemiscomprisedofindependentevents,thustheprobabilityofthecomplexeventcanbederived.Thistransformationseparatesonesharedeventfromtwoconjunctedevents.Notethatbothexpressions(f^b)_(g^d)and(b^d)havethesameformatastheoriginalexpression.Ifbanddhavecorrelateditems,thismethodcanbeappliedtothesetwoexpressionsrecursively.However,bothsubexpressionsincludebanddwhichrequirecomputingtheconjunctiveprobabilityofbanddtwice.Asaresult,the 94

PAGE 95

computationwillbecomeexpensiveandexponentialinthesizeofsharedevents.Inordertoavoidthisexpensivecomputation,weturntotheMorgan'sLawtosolvethisproblem.Wersttransformp(u^v),p(u^v)=1)]TJ /F5 11.955 Tf 11.96 0 Td[(p(u_v)=1)]TJ /F5 11.955 Tf 11.95 0 Td[(p(u))]TJ /F5 11.955 Tf 11.96 0 Td[(p(v)+p(u^v)Sincep(u)andp(v)areknown,wefocusoncomputingp(u^v).u^v= (e^f)_b^ (e^g)_d= (e^f)_(e^g)^b^dAftertransformation,thedependenttermsandindependenttermsareseparated.Theindependenttermsareeasytocompute.Weonlyneedtocomputethedependentpartwhichisdiscussedlater.Nowweextendthisobservationtogeneralcases.Figure 4-2 representstheoperationafterthistransformation.Theleftpartrepresentsdependenttuplesandtherightpartrepresentsindependenttuples. Figure4-2. u^v Weobtaintheprobabilityofthepairwisegroupsasfollowing: Lemma3. Considertwogroupsuandv.Letuandvhavenuindependentandnvindependenttuplesrespectively.Therstktuplesofuandvaredependentandtherest 95

PAGE 96

oftuplesareindependent.Thenp(u^v)=1)]TJ /F4 7.97 Tf 15.37 14.94 Td[(nuYi=1)]TJ /F7 11.955 Tf 5.48 -9.68 Td[(1)]TJ /F5 11.955 Tf 11.96 0 Td[(p(uti))]TJ /F4 7.97 Tf 15.35 14.94 Td[(nvYi=1)]TJ /F7 11.955 Tf 5.48 -9.68 Td[(1)]TJ /F5 11.955 Tf 11.96 0 Td[(p(vti)+p( ut1^^ utk)nuYi=k+1)]TJ /F7 11.955 Tf 5.48 -9.68 Td[(1)]TJ /F5 11.955 Tf 11.95 0 Td[(p(uti)nvYi=k+1)]TJ /F7 11.955 Tf 5.48 -9.68 Td[(1)]TJ /F5 11.955 Tf 11.96 0 Td[(p(vti) Proof. Sincetuplesineachgroupsareindependent,thenp(u)=p( ut1__utnu)=p( ut1^^ utnu)=nuYi=1)]TJ /F7 11.955 Tf 5.48 -9.68 Td[(1)]TJ /F5 11.955 Tf 11.95 0 Td[(p(uti)Similarly,p(v)=nvYi=1)]TJ /F7 11.955 Tf 5.48 -9.68 Td[(1)]TJ /F5 11.955 Tf 11.95 0 Td[(p(vti)Thenthesametechniqueisusedasthesimplecasetoderivep(u^v)p(u^v)=p( ut1__utnu^ vt1__vtnv)=p( ut1^^ utnu^ vt1^^ vtnv)=p( ut1^^ utk)nuYi=k+1p( uti)nvYi=k+1p( vti)=p( ut1^^ utk)nuYi=k+1)]TJ /F7 11.955 Tf 5.48 -9.69 Td[(1)]TJ /F5 11.955 Tf 11.95 0 Td[(p(uti)nvYi=k+1)]TJ /F7 11.955 Tf 5.48 -9.69 Td[(1)]TJ /F5 11.955 Tf 11.96 0 Td[(p(vti)Thus,p(u^v)=1)]TJ /F4 7.97 Tf 15.37 14.94 Td[(nuYi=1)]TJ /F7 11.955 Tf 5.48 -9.68 Td[(1)]TJ /F5 11.955 Tf 11.96 0 Td[(p(uti))]TJ /F4 7.97 Tf 15.35 14.94 Td[(nvYi=1)]TJ /F7 11.955 Tf 5.48 -9.68 Td[(1)]TJ /F5 11.955 Tf 11.96 0 Td[(p(vti)+p( ut1^^ vtk)nuYi=k+1)]TJ /F7 11.955 Tf 5.48 -9.68 Td[(1)]TJ /F5 11.955 Tf 11.96 0 Td[(p(uti)nvYi=k+1)]TJ /F7 11.955 Tf 5.48 -9.68 Td[(1)]TJ /F5 11.955 Tf 11.95 0 Td[(p(vti) Intheabovelemma,theformulaisanexpressionofindependenttermsandcorrelatedterms.Correlatedtermsarenotnecessarilydependentoneachother.Onetuplemayonlybedependentoni(0
PAGE 97

restk)]TJ /F5 11.955 Tf 12.51 0 Td[(ituples.Theexpressionp( ut1^^ vtk)containsthesekcorrelatedtuples,computationiscomplicated. 4.3.2CorrelationsinPTIMEQueriesThejoinedtuplesfromdifferentgroupsbecomecorrelatedwhentheyjoinsametuples.Evenforaggregationovertworelations,queryevaluationiscomplex.WefocusonPTIMEqueriesinthischapter.Nowwestudyhowcorrelationsoccurinsafequeriesandusesubgoals[ 20 ]toanalyzethisproblem.Givenaconjunctivequeryq=g1,,gn,giiscalledasubgoalSg(q)ofqandsg(x)isasubgoalwhosekeyisx.Westudysafequeriesindifferentstatisticalmodels.GivenaqueryinDisjoint-independentmodel,ifSg(q)=sg(x),thenthisqueryissafetobeprojectedonxandtuplesintheprojectedgroupareindependent.Thismeansonlycommonkeyattributesofrelationsaresafetobeprojected.Thuskeyattributesofjoinedrelationsarepreservedandjoinedtuplesaredifferentfromeachother.Eachjoinedtupleformsoneprojectionandnoduplicatetuplesexistinthisprojectiongroup.Inthiscase,computationofthepairwiseprobabilityofgroupsisreducedtotheproblemsolvedinChapter 3 andwecandirectlyusepreviousresults.DuplicateeliminationofaggregationinTuple-independentmodelisadifferentstory.TheprerequisiteofsafequeriesinTuple-independentmodelbecomeslooser:ifsubgoalsofanynon-headvariablesarehierarchical,thenthisconjunctivequeryissafe[ 20 ].Forexample,givenaconjunctivequeryandtwokeyattributesx,y.Letsg(x)=fR1g,sg(y)=fR1,R2gandtheprojectedattributeisy.Ifthisqueryissafe,thensg(x)sg(y)[ 20 ].SincethekeyattributesofR1includebothxandy,onevalueofxorycanexistmultipletimes,whichresultsinonetupleofR2joinsmultipletuplesofR1.Afterjoin,tupleswithsamevaluesofxbecomecorrelated.Whiletheprojectedattributeisy,thecorrelatedtuplesaredistributedindifferentprojectedgroupsandcorrelationsarenestednow.Itseemswereachanimpasse.However,observethehierarchical 97

PAGE 98

subgoals,wendthatthecorrelationscanbeorganizedinatreestructureinsteadofameshstructure.AlthoughonetupleofR1canbescatteredintodifferentgroupsafterjoin,themappingofjoinbetweenR1andR2isonetomany.Itmeansthecorrelationiscausedbytherelationswithfewerkeys.Weorganizerelationsintheorderofthenumberofkeysandtherelationswiththefewestkeysareputinthetopposition.Ifwetrackthetuplesintheorderoftheorganizedrelations,wecanndallthecorrelations.Figure 4-3 justshowsthissituation.ThevalueintoplayerrepresentstuplefromA,nodesin2ndlayershowstuplesfromBandnodesinthe3rdlayershowstuplesfromrelationC.Thelastlayershowsthejointresults.Hierarchicalstructureofsubgoalsinsafequeriesresultsindatacorrelatedhierarchically.Ifcorrelationexistsinthechildnodes,thenitmustalsoexistintheparentnodes. Figure4-3. Hierarchicalstructureofcorrelations Onceweknowtheruleforcorrelations,wecancomputetheprobabilityofcorrelatedgroups.Althoughthereisonesharedvaluebetweentwogroups,wecandividegroupsintosmallsubgroupswhichsharethesameitems.Sincethecorrelationishierarchal,thesubgroupsaredividedrecursivelyfromtoptobottomofthetree.Computationisrecursiveandweonlycomputetherstlevelcorrelation. 98

PAGE 99

GobacktoLemma 3 ,wecancomputep( ut1^^ vtk)now p( ut1^^ vtk)=p)]TJ ET q .478 w 101.44 -62.18 m 139.31 -62.18 l S Q BT /F2 11.955 Tf 101.44 -71.74 Td[(^ni=1ut1i^^ ^ni=1vtoip)]TJ ET q .478 w 247.15 -62.18 m 284.24 -62.18 l S Q BT /F2 11.955 Tf 247.15 -71.74 Td[(^ni=1utsi^^ ^ni=1vtki=p((_ni=1 ut1i)^^(_ni=1 vtoi))p((_ni=1 utsi)^^(_ni=1 vtki))=p(_lj=1 ut1j)_(_ni=l+1 ut1i)^^(_ni=l+1 vtoi)p(_qj=1 ut1j)_(_ni=q+1 utsi)^^(_ni=q+1 vtki)= 1)]TJ /F4 7.97 Tf 18.26 14.94 Td[(lYj=1p(ut1j)(1)]TJ /F5 11.955 Tf 11.96 0 Td[(p( _ni=l+1ut1i^^ ^ni=l+1utoi))! 1)]TJ /F4 7.97 Tf 17.21 15.21 Td[(qYj=1p(ut1j)(1)]TJ /F5 11.955 Tf 11.96 0 Td[(p( _ni=q+1utsi^^ ^ni=q+1utki))!(4)Inaboveequation, ut1^^ vtkisrstdividedintoindependentsubparts.Accordingtoouranalysis,somesharedtermscanbeextractedfromeachsubpart.Thenalresultisarecursionexpression.p( _ni=l+1ut1i^^ ^ni=l+1utoi)andp( _ni=m+1utsi^^ ^ni=m+1utki)havethesameformatas ut1^^ vtk,whichcanbefurthercomputedusedthesameprocess.Combiningpreviousresults,weobtainp(u^v): p(u^v)=1)]TJ /F4 7.97 Tf 14.18 17.97 Td[(nujYi=1)]TJ /F7 11.955 Tf 5.48 -9.68 Td[(1)]TJ /F5 11.955 Tf 11.95 0 Td[(p(uti))]TJ /F4 7.97 Tf 15.36 14.94 Td[(nvYi=1)]TJ /F7 11.955 Tf 5.48 -9.68 Td[(1)]TJ /F5 11.955 Tf 11.95 0 Td[(p(vti)+nujYi=k+1)]TJ /F7 11.955 Tf 5.48 -9.69 Td[(1)]TJ /F5 11.955 Tf 11.95 0 Td[(p(uti)nvYi=k+1)]TJ /F7 11.955 Tf 5.48 -9.69 Td[(1)]TJ /F5 11.955 Tf 11.96 0 Td[(p(vti) 1)]TJ /F4 7.97 Tf 18.26 14.95 Td[(lYj=1p(ut1j)(1)]TJ /F5 11.955 Tf 11.96 0 Td[(p( _ni=l+1ut1i^^ ^ni=l+1utoi))! 1)]TJ /F4 7.97 Tf 17.21 15.21 Td[(qYj=1p(ut1j)(1)]TJ /F5 11.955 Tf 11.95 0 Td[(p( _ni=q+1utsi^^ ^ni=q+1utki))!(4) 4.3.3MomentsofaggregateswithduplicateeliminationOncetheprobabilityofpairwisegroupsisknown,wecancomputemomentsofaggregationoverdistinctgroups. 99

PAGE 100

Theorem4.1. LetAbedenedinEquation 4 ,thenE(A)=Xj2f1mg,fti2Riji2f1:nggf(uj(ftig))(1)]TJ /F5 11.955 Tf 11.95 0 Td[(p( uj(ftig1)^^ uj(ftignuj)))Var(A)=Xj2f1mg,fti2Riji2f1:nggXk2f1mghp( uj(ftig1)^^ uk(ftigg(uj,uk))))]TJ /F5 11.955 Tf 9.3 0 Td[(p( uj(ftig1)^^ uj(ftigg(uj,uk)))p( uk(ftig1)^^ uk(ftigg(uj,uk)))p( uj(ftigg(uj,uk)+1)^^ uj(ftignuj))p( uk(ftigg(uj,uk)+1)^^ uk(ftignuk))f(uj(ftig))f(uk(ftig))]whereg(uj,uk)denesthecorrelatedtuplesofgroupuj(ftig)anduk(ftig).Thejointprobabilityp( uj(ftig1)^^ uk(ftigg(uj,uk)))haveasimilarformatasEquation 4 Proof. ByLemma 3 ,E(A2)=0@Xj2f1mg,fti2Riji2f1:nggf(uj(ftig))1A2)]TJ /F7 11.955 Tf 11.96 0 Td[(2Xj2f1mg,fti2Riji2f1:nggf(uj(ftig))Xk2f1mgp( uk(ftig1)^^ uj(ftignuk))f(uk(ftig))+Xj2f1mg,fti2Riji2f1:nggXv2GjG2Rhp( uj(ftig1)^^ uk(ftig)g(uj,uk))p( uj(ftigg(uj,uk)+1)^^ uj(ftignuj))p( uk(ftigg(uj,uk)+1)^^ uk(ftignuk))f(uj(ftig))f(uk(ftig))iSinceE(A)=Xj2f1mg,fti2Riji2f1:nggf(uj(ftig))p(uj(ftig))=Xj2f1mg,fti2Riji2f1:ngg1)]TJ /F5 11.955 Tf 11.96 0 Td[(p( uj(ftig1)^^ uj(ftignuj))f(uj(ftig)) 100

PAGE 101

ThenE2(A)=24Xj2f1mg,fti2Riji2f1:nggf(uj(ftig)))]TJ /F9 11.955 Tf 49.57 11.36 Td[(Xj2f1mg,fti2Riji2f1:nggp( uj(ftig1)^^ uj(ftignuj))f(uj(ftig))352=0@Xj2f1mg,fti2Riji2f1:nggf(uj(ftig))1A2)]TJ /F7 11.955 Tf 11.96 0 Td[(2Xj2f1mg,fti2Riji2f1:nggf(uj(ftig))Xk2f1mgp( uk(ftig1)^^ uk(ftignuk))f(uk(ftig))+0@Xj2f1mg,fti2Riji2f1:nggp( uj(ftig1)^^ uj(ftignuj))f(uj(ftig))1A2Thus,thevarianceisVar(X)=E(X2))]TJ /F5 11.955 Tf 11.95 0 Td[(E2(X)=Xj2f1mg,fti2Riji2f1:nggXk2f1mghp( uj(ftig1)^^ uj(ftigg(uj,uk)))p( uj(ftigg(uj,uk)+1)^^ uj(ftignuj))p( uk(ftigg(uj,uk)+1)^^ uk(ftignuk))f(uj(ftig))f(uk(ftig))i)]TJ /F9 11.955 Tf 11.95 24.03 Td[(24Xj2f1mg,fti2Riji2f1:nggp( uj(ftig1)^^ uj(ftignuj))f(uj(ftig))352=Xj2f1mg,fti2Riji2f1:nggXk2f1mghp( uj(ftig1)^^ uj(ftigg(uj,uk))))]TJ /F5 11.955 Tf 11.96 0 Td[(p( uj(ftig1)^^ uj(ftigg(uj,uk)))p( uk(ftig1)^^ uk(ftigg(uj,uk)))p( uj(ftigg(uj,uk)+1)^^ uj(ftignuj))p( uk(ftigg(uj,uk)+1)^^ uk(ftignuk))f(uj(ftig))f(uk(ftig))] 101

PAGE 102

Theorem 4.1 presentsaformulatocomputethevaluesofmomentsofprobabilisticaggregateswithduplicateremovalgiveng(uj,uk)isknown.Itisageneraltheoremwithoutspecifyingcorrelations.Theexpectedvalueisrelatedtogroupprobabilityandcanbecomputedbyprobabilityevaluationofsafequeries[ 18 ].Thevariancerequiresthepairwisegrouprelationshipsandthecorrelatedrelationshipsmaybevariedamongdifferentgroups.Therelationshipsmaybehighlycorrelatedorindependent,whichdependsonthevalueofg(uj,uk).Ifthevalueofg(uj,uk)islarge,groupsarehighlycorrelated.Ifg(uj,uk)=0,groupsareindependent.Wehavetheformulaofvariancesnow.Ournexttaskistoimplementthisformula,whichrequiresrelationshipsofpairwisegroupsandvaluesofg(uj,uk). 4.4Hash-basedAlgorithmIntheprevioussection,theformulaofvariancesofaggregationoverdistinctgroupswaspresented.Beforeweimplementthisformula,weanalyzetheformulasrst.Theformulaofvariancesrequiresexactcorrelatedinformation:onetupleiscorrelatedtowhichtuplesinwhichgroups.Onlyafterobtainingallthecorrelatedtuplesbetweengroups,theprobabilityofpairwisegroupscanbecomputed.Morespecically,tuplesinTheorem 4.1 aredividedintotwoparts:groupindependenttuplesandgroupcorrelatedtuples.Groupindependenttuplesareindependenttoallthetuplesinthegroup.Groupcorrelatedtupleshavecorrelationswithsometuplesinthegroups.Ouralgorithmshouldprovidethisfunctiontodistinguishthesetwoparts.Thuscrossproductcomparisonoftupleswithinpairwisegroupsisrequired.Thenalprobabilityofgroupsistheproductionofprobabilityofindependentpartsandcorrelatedparts.Probabilityevaluationofindependenttuplesisstraightforwardbymultiplyingcomplementofindependenttuples.ProbabilityofcorrelatedtuplesiscomputedbasedonEquation 4 whichisarecursiveproduction.TheouterpartoftherecursioncomesfromtuplesintherstorderrelationwhichareorganizedinordersforsafequeriesinSection 4.3.2 .Ifthisrecursiveformulaisexpressedbyatree,thenthe 102

PAGE 103

topnodeofthetreerepresentstuplesfromtherstorderrelationandtheleafnodesrepresenttuplesofthelastorderrelations.Middlenodesrepresentnodesfromtherelationsinthemiddle.Thenalprobabilityisevaluatedfrombottomtotopofthetree.Onlyafterdataatthelowerlevelsarecomputed,thedataatthehigherlevelscanbecomputed.Ahieraticalstructureisneededtoexpresstherecursiveformula.OnestraightforwardwaytocomputevariancesoverprobabilisticdatabasesistotakeadvantageofexistingDBMS.TheformulasarerewrittenasaSQLexpression.ThecomparisonofdatacanbeobtainedusingGROUP-BYexpressionsandtheprobabilityevaluationofeachsubgroupcanbeimplementedusinggroupaggregates.However,representationofhieraticalstructureishardinSQL.Inordertosimplifytheprocess,anewmethodisusedinourimplementation.Beforeweintroducethismethod,weneedtotransformVar(A)rstVar(A)=Xj2f1mg,fti2Riji2f1:nggXk2f1mghp( uj(ftig1)^^ uj(ftigg(uj,uk))))]TJ /F5 11.955 Tf 11.95 0 Td[(p( uj(ftig1)^^ uj(ftigg(uj,uk)))p( uk(ftig1)^^ uk(ftigg(uj,uk)))p( uj(ftigg(uj,uk)+1)^^ uj(ftignuj))p( uk(ftigg(uj,uk)+1)^^ uk(ftignuk))f(uj(ftig))f(uk(ftig))]=Xj2f1mg,fti2Riji2f1:nggp( uj(ftig1)^^ uj(ftignuj))Xk2f1mgp( uk(ftig1)^^ uk(ftignuk))p( uj(ftig1)^^ uj(ftigg(uj,uk)))f(uj(ftig))f(uk(ftig)) p( uj(ftig1)^^ uj(ftigng(uj,uk)))p( uk(ftig1)^^ uk(ftigng(uj,uk))))]TJ /F9 11.955 Tf 11.96 24.03 Td[(24Xj2f1mg,fti2Riji2f1:nggp( uj(ftig1)^^ uj(ftignuj))f(uj(ftig))352 103

PAGE 104

=Xj2f1mg,fti2Riji2f1:nggp( uj(ftig1)^^ uj(ftignuj))f(uj(ftig))24Xuk*2correlated(uj)p( uk(ftig1)^^ uk(ftignuk))f(uk(ftig))+Xuk2correlated(uj)p( uj(ftig1)^^ uj(ftigg(uj,uk)))Qnuki=11)]TJ /F5 11.955 Tf 11.95 0 Td[(p(uk(ftig))f(uk(ftig)) p( uj(ftig1)^^ uj(ftigng(uj,uk)))p( uk(ftig1)^^ uk(ftigng(uj,uk)))3775)]TJ /F9 11.955 Tf 11.95 24.03 Td[(24Xj2f1mg,fti2Riji2f1:nggp( uj(ftig1)^^ uj(ftignuj))f(uj(ftig))352=24Xj2f1mg,fti2Riji2f1:nggp( uj(ftig1)^^ uj(ftignuj))f(uj(ftig))352+Xj2f1mg,fti2Riji2f1:ngghp( uj(ftig1)^^ uj(ftignuj))f(uj(ftig))Xuk2correlated(uj)p( uj(ftig1)^^ uj(ftigg(uj,uk))))]TJ /F9 11.955 Tf 11.96 8.96 Td[(Qg(uj,uk)i=1(1)]TJ /F5 11.955 Tf 11.96 0 Td[(p(uj(ftig)))(1)]TJ /F5 11.955 Tf 11.96 0 Td[(p(uk(ftig))) p( uj(ftig1)^^ uj(ftigng(uj,uk)))p( uk(ftig1)^^ uk(ftigng(uj,uk)))p( uk(ftig1)^^ uk(ftignuk))f(uk(ftig))i)]TJ /F9 11.955 Tf 11.95 24.03 Td[(24Xj2f1mg,fti2Riji2f1:nggp( uj(ftig1)^^ uj(ftignuj))f(uj(ftig))352=Xj2f1mg,fti2Riji2f1:ngghp( uj(ftig1)^^ uj(ftignuj))f(uj(ftig))Xuk2correlated(uj)p( uj(ftig1)^^ uj(ftigg(uj,uk))))]TJ /F9 11.955 Tf 11.96 8.97 Td[(Qg(uj,uk)i=1(1)]TJ /F5 11.955 Tf 11.96 0 Td[(p(uj(ftig)))(1)]TJ /F5 11.955 Tf 11.96 0 Td[(p(uk(ftig))) p( uj(ftig1)^^ uj(ftigng(uj,uk)))p( uk(ftig1)^^ uk(ftigng(uj,uk)))p( uj(ftig1)^^ uj(ftignuj)i(4)Noticethattheformulaofvariancesisaproductoftwoparts.Theouterpartisrelatedtodistinctgroups.Theinnerpartisrelatedtothecorrelatedgroupsofoutergroups.Thusthisformulaisrestrictedtogroupsandtheircorrelatedgroupsinsteadofcross-productgroups,whichdecreasesthecomputationespeciallywhenthenumberofcorrelatedgroupsissmall.Weneedtoseparatedataintogroupsthengureout 104

PAGE 105

correlatedgroupsofthecorrespondinggroups.Asanalyzed,correlationisintroducedbyonesametuplejoiningmultipletuplesscatteredindifferentgroups.Correlationshaveahieraticalstructureinsafequeries.Iftwojoinedtuplesshareinformationfrommiddleorderedrelations,thentheymustshareinformationfromtoprelations.Ifwegrouptuplesthentrackthemwhentheyjointuples,wecanndthecorrelatedtuples.Thekeyofcheckingwhethertwojoinedtuplesarecorrelatedistocheckwhethertheyshareinformationfromtoprelation.Whenweputthisintoimplementation,hashtablesprovideanaturalwaytogroupdatawithsamevalues.Aftermappingdataintohashtables,tuplesinthesameblocksarecorrelated.Algorithm 4 presentsasketchofouralgorithmusinghashtables. Algorithm4ComputeDistVar(aggData) 1: fortupletinRdo 2: HashtintohashtablegroupTablebygroupID 3: Computetheprobabilityofthisgroup 4: endfor 5: fortupletinRdo 6: HashtintoblockbofhashtablegroupTablebyrootkeyID 7: fortuplet0inblockbdo 8: ifgroupID(t)
PAGE 106

attributesashashkeysandeachgroupisthenassignedagroupID.Rememberthatthekeyofthisalgorithmisguringoutwhethertwogroupsarecorrelated.Althoughtwocorrelatedrelationsmayshareinformationfrommanyrelations,itisenoughtocheckwhetherinformationfromtoprelationissharedforcorrelationinformation.Thusahashtablewithtophierarchicalkeysisgenerated.Datainthesamehashblocksareputintothecorrelatedlistofcorrespondinggroups.Eachgroupmaintainsacorrelatedlistwhichconsistsofblocks.Blocksinthecorrelatedlistcontaintwodataarrays:datafromthisgroupanddatafromthesamecorrelatedgroup.ComputationofprobabilityisbasedonEquation 4 andimplementedbycomputeCorProb(corrList[gi,gj]).Foreachgroup,computeCorProb(corrList[gi,gj])sortsthegroupbytheattributekeysinthehierarchialorderthentransversesdata.Bycomparingonetupleandthenexttuplewithinonegroup,thesharedinformationcanbeobtained.Ifshared,thesharedinformationisextractedwhencomputingtheprobability.Iftheprobabilityofeachgroupisunknownthenitiscomputedbyonepassofthesortedgroupandstoredinmemory.Comparisonoftuplesindifferentgroupsisalsorequiredwhenitcomestocomputingtheprobabilityofpairwisegroups. 4.5ExperimentsInthissection,weperformextensiveexperimentsonthehashedbasedalgorithm.Themaingoalofourexperimentsisexaminingperformanceofthealgorithmandtheoverheadofcomputingthevariance.Theexperimentalresultsshowthatouralgorithmperformswellandhasasmalloverhead. 4.5.1ImplementationThecongurationofexperimentalenvironmentisexactlysameasexperimentsinChapter3.Alltheexperimentswerecarriedoutona4core,2.4GHzCPU,and8GBRAMmachinerunningLinux.Inordertoobtaintheoverheadofcomputingvariancesofprobabilisticaggregationwithduplicateremoval,werstexecutedqueriesinthetraditionalDBMSsuchasPostgresandrecordedrunningtime.Inthe 106

PAGE 107

sameenvironment,Algorithm 4 isthenexecutedtocomputetheexpectedvaluesandvariancesoftheaggregatequeries.Thedifferenceofrunningtimeisthustheoverheadofcomputingvariances.Now,wediscussourexperimentsindetail.TPCHDatasetandQuery:Theorem 4 showsthatperformanceofouralgorithmnotonlydependsonthesizeofdatasets,butalsothenumberofcorrelatedtuples.Experimentsonfewcorrelatedrelationsareexpectedtohaveabetterperformancethanonfullycorrelatedrelations.Toexamineouralgorithmthoroughly,ourexperimentsarecarriedoutondatawithvariedsizesandvariedcorrelationrates:(theaveragenumberthatonetupleiscorrelated).InoriginalTPCHdata,correlationratesarexedanddataarecertain.OurexperimentaldataisthusregeneratedbymodifyingTPCHcode.Werstchangethecorrelaterateandmakeitrangefrom1to4.ThenwegeneratearandomuniformlydistributionforTPCHdataandassigneachtupleaprobability.Thesizeofdatainourexperimentsisrangingfrom0.1Gto0.8GWeruntwoqueriesQ1andQ2inourexperimentsandbothofthemarebasedonTPCHQ16.TherstoneQ1returnsthecountofdistinctpartssoldbysupplierswhosatisfyparticularrequirements:SELECTp suppkey,COUNT(DISTINCT(p partkey)),confFROMpartsupp,partWHEREp partkey=ps partkeyANDp brand<>'Brand#45'ANDp typeNOTLIKE'MEDIUMPOLISHED'ANDp sizein(1,5,34,23,9,43,25,12)ANDps suppkeyNOTIN(358,2820,3804,9504)GROUPBYps suppkeyOurpreviousdiscussionshowsthatcorrelationisintroducedbyonetuplejoiningmultipletuplesatthesametime.Morerelationsareinvolvedinjoins,morecomplicatedcorrelatedinformationis.Toexaminethissituation,aggregationoverthreerelations(Q2) 107

PAGE 108

isperformedinourexperiments.SELECTp suppkey,COUNT(DISTINCT(p partkey)),confFROMpartsupp,part,supplierWHEREp partkey=ps partkeyANDp brand<>'Brand#45'ANDp typeNOTLIKE'MEDIUMPOLISHED'ANDp sizein(1,5,34,23,9,43,25,12)ANDps suppkeyNOTIN(358,2820,3804,9504)ANDps suppkey=s suppkeyGROUPBYps suppkeyNotethattocomparetheeffectsofnumbersofrelations,aggregatingdataofQ1andQ2shouldremainsame. 4.5.2ResultsofExperimentsInourexperiments,werunallthequerieswithcomputingvariancesandwithoutcomputingvariances.QuerieswithvariancesareimplementedbyoursystemwhilequerieswithoutvariancesareexecutedbyPostgres.Runningtimeonqueriesaremeasured.Inparticular,wemeasureaggregatingtimeinadditiontothewholerunningtimewhichisusedtogettheoverheadofvariance.Theexperimentsareperformedondatawithdifferentsizesanddifferentcorrelatedrates.Thesizeofdatavariesfrom0.1Gto0.8Gandthecorrelatedratevariesfrom1to4.Q1andQ2areperformedonthesedatarespectively.TheexperimentalresultsareplottedinFigures 4-4 to 4-9 .Figure 4-4 showstheresultwhen=1andthesizeofdatavariesfrom0.1Gto0.8G.Figure 4-5 showsexperimentalresultsofwithvalueof2whileFigure 4-5 representstheexperimentalresultsofwithvalueof4.Tofurtherevaluateperformanceofouralgorithm,experimentsonQ2areperformed.SameexperimentalstepsarefollowedasforQ1.Figure 4-7 toFigure 4-9 showthecorrespondingresultsofQ2. 108

PAGE 109

Figure4-4. =1RunningtimeonQ1 Figure4-5. =1RunningtimeonQ2 Figure4-6. =2RunningtimeonQ1 Figure4-7. =2RunningtimeonQ2 Inabovegures,bluelinesshowrunningtimeofqueriesintraditionaldatabases.Purplelinespresentrunningtimeofprobabilisticquerieswithvariances.Thegreenlinesshowrunningtimeofoverheadofcomputingvariances.Asshowningures,ouralgorithmhasasmalloverheadtocomputevariancesespeciallywhenissmall.Performanceofouralgorithmvariesoverqueriesanddata.ThealgorithmperformsbestwhendatahaveasmallcorrelationrateindatashownasFigure 4-4 whiletheworstperformanceispresentedinFigure 4-9 whencorrelationrateislargeandmorerelationsareinvolvedinoperations.Nowwestudytheperformanceindifferentsituations: SizeofdataAsshowninFigures 4-4 ,Figures 4-5 ,Figures 4-6 ,Figures 4-7 ,Figures 4-8 andFigures 4-9 ,therunningtimeofcomputationofvarianceslinearlyincreasingwiththesizeofdatagivenastaticcorrelatedrate.Thetotalrunning 109

PAGE 110

Figure4-8. =4RunningtimeonQ1 Figure4-9. =4RunningtimeonQ2 timefollowsthesametrend,whichresultsintheoverheadofouralgorithmlinearlyincreasingwhendataislinearlyincreasing.Thisbehaviorisexpectedsincethealgorithmdependsonthesizeofhashtablewhichislinearlyincreasingwithdata. Numberofrelationsplaysapartintheperformanceofthealgorithm.CompareFigures 4-4 andFigures 4-5 ,Figures 4-6 andFigures 4-7 ,Figures 4-8 andFigures 4-9 ,weobservethatperformanceisbetteronQ1thanonQ2evenbothofthemhavethesamesizeofaggregatingtuples.Eachaggregatingtupleiscomposedbyattributesfrommultiplerelationssinceprobabilityattributeandkeyattributefromeachrelationarepreservedintheaggregatingtuples.Thepreservedattributesarecomparedonebyonetocheckwhetheroneaggregatingtupleiscorrelatedwithanotheraggregatingtuple.Thusmorerelationsmeansmorenumberofcomparisonsandmoretime.Thisexplainsthebehavioroftheexperiments. effectsperformanceofthealgorithm.WecanobservethevariedperformancebycomparingFigure 4-4 ,Figure 4-6 andFigure 4-8 onQ1,Figure 4-5 ,Figure 4-7 andFigure 4-9 onQ2.Thenumberofcorrelatedtuplesisincreasingwiththevalueof,whichmeansmoredataareinvolvedinaggregating.Thustherunningtimeincreaseswhenislinearlyincreasingwiththesizeofhashtable.SametrendisobservedinexperimentsonQ2. overheadTheoverheadofcomputingvarianceislessthan20percentagewhenhasasmallvaluesuchas1,2.Theoverheadislargestwhenis4.Thisisexpectedsincemorecomputationisrequiredinthiscase.Overall,theperformanceofthealgorithmdependsonsizeofdata,thenumberofinvolvedrelationsandthecorrelatedrate.Inparticular,theperformanceislineartothesizeofaggregatingdata.Theoveralloverheadisreasonableespeciallywhen 110

PAGE 111

hasasmallvalue.Wecanexpectthatthisalgorithmworkswellinpracticesincefullycorrelatedrelationsarerareindatabases. 4.6SummaryInthischapter,weestimatedaggregatesoverprobabilisticdatabasewithduplicateremoval.Theduplicateremovalintroducescorrelationstotuplesandmakesanalysisdifcult.AfteranalyzingthepropertiesofPTIMEqueriesoverprobabilisticdatabases,weconcludedthecorrelationsintroducedbyprojectionarehierarchicalandcanbefactorizedrecursively.ThenwederivedthemomentsofprobabilisticaggregateswithduplicateremovalanddesignedaHash-basedalgorithmtoimplementthem.Ourcomprehensiveexperimentsshowthatthealgorithmhasasmalloverheadwhenthecorrelatedrateissmallandareasonableoverheadwhenthecorrelatedrateislarge. 111

PAGE 112

CHAPTER5CONCLUSIONSANDFUTUREWORKInthisdissertation,westudiedstatisticalqueryapproximationwithcondenceintervals:histogramsinconventionaldatabasesandSUM-likeaggregatesinprobabilisticdatabases.Ourworkformulatedthesequeryapproximationproblemsinstatisticalalgebraandconductedstatisticalanalysisondatabasequeryapproximationproblems.Weinferredqueryresultsbywelldenedestimatorsingivenstatisticalmodelsandderivedthemomentsoftheestimators,whichareusedtoprovideerrorguaranteeofapproximatedvaluesbymomentbasedmethods.Threedistinctcontributionswemadeinthisdissertationare: 5.1DissertationSummary Histogramasastatisticalestimatorsfordatabasequeries.Westatisticallyanalyzedhistogramsandproposedastatisticalassumption:theelementswithinabucketarerandomlyarranged.Underthisassumption,theuseofhistogramsisnotchanged;identicalformulascanbederivedforqueryapproximation.Thisassumptionislessrestrictivethanwide-accepteduniformfrequencyassumptionandinterpretswhyhistogramsstillhaveagoodperformancewhenthefrequenciesofdataaredifferentinsideblocks.Thenerrorguaranteeofunidimensionalandmultidimensionalhistogramsisprovided,whichdemonstrateshistogramshaveabetterperformancethanX-sketchandsamplingmethodswhentherandomshufingassumptionishold,otherwisehistogramshaveaworseperformance.ThenweappliedourrandomshufingassumptiontoXSketches,whichapproximatedXMLqueriesusinghistograms,andprovidederroranalysistoXSketches. Aggregationoverprobabilisticdatabase.OursecondcontributioninthisdissertationischaracterizingSUM-likeaggregatesoverprobabilisticdatabases.Weproposedageneralstatisticalframeworkforprobabilisticdatabases.Underthisframework,weestimatedthevalueofprobabilisticaggregatesandprovidedthe 112

PAGE 113

condenceintervalsfortheestimatorbyderivingitsrsttwomoments.Basedonthegeneralformulas,wederivedmathematicalformulasofmomentsofestimatorsofaggregatesintuple-independentmodelandgraphicalmodel.WethenpresentedInDBandMiddlewarealgorithmstoevaluatetheaggregatesefciently.OurprototypeimplementationwithPostgressuggeststhatourcharacterizationofaggregatesincurslittleoverheadforthetuple-independentmodelandmanageableoverheadforthegraphicalmodel.WealsoextendedcomputationofthemomentstoAVERAGE-likeaggregatesandGROUP-BYstoshowtheusefulnessoftheproposedmethods. Probabilisticaggregationwithduplicateremoval.Ourlastcontributioninthisdissertationisestimatingaggregatesoverprobabilisticdatabasewithduplicateremoval.WeanalyzedthepropertiesofPTIMEqueriesandconcludedthecorrelationsintroducedbyprojectionsandjoinsarehierarchicalandcanbefactorizedrecursively.Wederivedthestatisticalmomentsofestimatorsanddesignedahash-basedalgorithmtoimplementthem.Ourcomprehensiveexperimentsshowthatthealgorithmhasasmalloverheadwhenthecorrelatedrateissmallandareasonableoverheadwhenthecorrelatedrateislarge. 5.2FutureDirectionsThisdissertationestimatesdatabasequeriesandprovidescondenceintervalsbyderivingthersttwomomentsofestimators.Toextendtheworkpresentedinthisdissertation,herearesomegeneraldirections: HigherMoments.Extendingourmethodofcomputingthersttwomomentstohighermoments:therstfourmomentsinsteadoftwo,asignicantlybetterapproximationofthecondenceintervalscanbeobtained.Webelievethetheorywedevelopedcanbeextendedtoefcientlycomputethethirdandfourthmomentsofaggregates. Combiningsamplingmethod.Inthisdissertation,thesizeofdatasetis10GB.Whenthesizeofdataisincreasedto1Terabytes,thecomputingtimeofaggregationis 113

PAGE 114

expectedtobeover16mins.Samplingmethodshavebeenprovedtobeefcienttodealwithvolumedata.Combiningourmethodofcomputingmomentswithsamplingmayprovideestimatorswitherrorguaranteesoverprobabilisticdatabasesveryquickly. 114

PAGE 115

APPENDIXAGENERALUNIDIMENSIONALHISTOGRAMSInthissectionwegeneralizetheresultinTheorem 2 tothegeneralcaseinwhichthebucketsmightnotbealigned.AsweshowedinSection 2.5.2 ,ifbucketsarealigned,therandomshufingassumptionhastoholdonlyforoneoftherelations.Ifbucketsarenotaligned,itisnecessaryfortheassumptiontoholdforbothrelations.Toobtainthedesiredresultwerstprove: Lemma5. LetI1,,InbeanarbitrarybucketizationofdomainI.LetJ1,,Jmbeanothersuchbucketization.LetandbeindependentrandompermutationsofdomainIanddeneX=Pi2If(i)g(i).Then,therandomvariableXhastwoproperties:E,[X]=nXl=1mXk=1 fl gkjIl\JkjVar,(X)=nXl=1mXk=1jIl\Jkj2"SqErr(Fl)SqErr(Gk) jIljjJkj(jIlj)]TJ /F7 11.955 Tf 17.93 0 Td[(1)(jJkj)]TJ /F7 11.955 Tf 17.94 0 Td[(1))]TJ ET q .478 w 278.99 -326.95 m 285.46 -326.95 l S Q BT /F5 11.955 Tf 278.99 -336.93 Td[(f2lSqErr(Gk) jJkj(jJkj)]TJ /F7 11.955 Tf 17.93 0 Td[(1))]TJ ET q .478 w 361.77 -329.61 m 368.54 -329.61 l S Q BT /F5 11.955 Tf 361.77 -336.93 Td[(g2kSqErr(Fl) jIlj(jIlj)]TJ /F7 11.955 Tf 17.93 0 Td[(1)#+nXl=1mXk=1jIl\JkjSqErr(Fl)SqErr(Gk) (jIlj)]TJ /F7 11.955 Tf 17.93 0 Td[(1)(jJkj)]TJ /F7 11.955 Tf 17.93 0 Td[(1)+SqErr(Fl) (jIlj)]TJ /F7 11.955 Tf 17.93 0 Td[(1)+SqErr(Gk) (jJkj)]TJ /F7 11.955 Tf 17.93 0 Td[(1) Proof. Wersthave:E,[X]=Xi2IE,f(i)g(i)=Xi2IEf(i)Eg(i)whereweusedthelinearityofexpectationandindependenceofand.Ifi2Ilforsomel,thenEf(i)= fl.ToincorporatethisifconditionintheequationsweusetheidentityfunctiontogetEf(i)=Pnl=1I(i2Il) fl.Indeed,onlyforlsuchthati2Ilthecontributiontothesumisnonzero.Similarly,wehaveEg(i)=Pmk=1I(i2Jk) gk.With 115

PAGE 116

this:E,[X]=Xi2InXl=1I(i2Il) flmXk=1I(i2Jk) gk=nXl=1mXk=1 fl gkXi2II(i2Il^i2Jk)=nXl=1mXk=1 fl gkjIl\JkjwhereweusedthefactthatPi2II(C)isthenumberofelementsofIsatisfyingthecondition,thenumberofelementsbothinIlandJkhere.E,X2=E,"Xi2I(f(i)g(i))2#=E"Xi2IXi02If(i)f(i0)#E"Xi2IXi02Ig(i)g(i0)#Usingthefactthat,foriandi0withinthebucketIl,wecanwriteE"Xi2IXi02If(i)f(i0)#=( f2ljIlj jIlj)]TJ /F7 11.955 Tf 17.94 0 Td[(1)]TJ /F1 11.955 Tf 24.47 8.09 Td[(SJ(Fl) jIlj(jIlj)]TJ /F7 11.955 Tf 17.93 0 Td[(1))(1)]TJ /F3 11.955 Tf 11.96 0 Td[(ii0)+SJ(Fl) jIljii0withii0theKroneckersymbolthattakesvalue1ifi=i0and0otherwise.Now,usingthesametechniqueasbeforetoavoidguessingthebucketinwhichiandi0havetobe,wecanwriteE"Xi2IXi02If(i)f(i0)#=nXl=1I(i2Il)I(i02Il)"( f2ljIlj jIlj)]TJ /F7 11.955 Tf 17.93 0 Td[(1)]TJ /F1 11.955 Tf 24.47 8.09 Td[(SJ(Fl) jIlj(jIlj)]TJ /F7 11.955 Tf 17.94 0 Td[(1))(1)]TJ /F3 11.955 Tf 11.95 0 Td[(ii0)+SJ(Fl) jIljii0#SimilarlywehaveE"Xi2IXi02Ig(i)g(i0)#=mXk=1I(i2Jk)I(i02Jk)( g2kjJkj jJkj)]TJ /F7 11.955 Tf 17.94 0 Td[(1)]TJ /F1 11.955 Tf 27.52 8.09 Td[(SJ(Gk) jJkj(jJkj)]TJ /F7 11.955 Tf 17.93 0 Td[(1))(1)]TJ /F3 11.955 Tf 11.95 0 Td[(ii0)+SJ(Gk) jJkjii0UsingthefactthatVar,(X)=E,X2)]TJ /F7 11.955 Tf 11.96 0 Td[((E,[X])2 116

PAGE 117

andsubstitutingbackintheexpressionforVar,(X),wegetVar,(X)=nXl=1mXk=1Xi2IXi02II(i2Il)I(i02Il)I(i2Jk)I(i02Jk)"( f2ljIlj jIlj)]TJ /F7 11.955 Tf 17.94 0 Td[(1)]TJ /F1 11.955 Tf 24.47 8.08 Td[(SJ(Fl) jIlj(jIlj)]TJ /F7 11.955 Tf 17.93 0 Td[(1))( g2kjJkj jJkj)]TJ /F7 11.955 Tf 17.93 0 Td[(1)]TJ /F1 11.955 Tf 27.52 8.09 Td[(SJ(Gk) jJkj(jJkj)]TJ /F7 11.955 Tf 17.94 0 Td[(1))+nXl=1mXk=1Xi2IXi02I[I(i2Il)I(i02Il)I(i2Jk)I(i02Jk) SJ(Fl) jIlj)]TJ ET q .478 w 94.47 -139.51 m 100.94 -139.51 l S Q BT /F5 11.955 Tf 94.47 -149.48 Td[(f2ljIlj jIlj)]TJ /F7 11.955 Tf 17.93 0 Td[(1+SJ(Fl) jIlj(jIlj)]TJ /F7 11.955 Tf 17.93 0 Td[(1)!SJ(Gk) jJkj)]TJ ET q .478 w 278.72 -142.17 m 285.48 -142.17 l S Q BT /F5 11.955 Tf 278.72 -149.48 Td[(g2kjJkj jJkj)]TJ /F7 11.955 Tf 17.94 0 Td[(1+SJ(Gk) jJkj(jJkj)]TJ /F7 11.955 Tf 17.93 0 Td[(1)+ SJ(Fl) jIlj)]TJ ET q .478 w 106.96 -179.36 m 113.43 -179.36 l S Q BT /F5 11.955 Tf 106.96 -189.34 Td[(f2ljIlj jIlj)]TJ /F7 11.955 Tf 17.93 0 Td[(1+SJ(Fl) jIlj(jIlj)]TJ /F7 11.955 Tf 17.94 0 Td[(1)!+SJ(Gk) jJkj)]TJ ET q .478 w 303.82 -182.02 m 310.59 -182.02 l S Q BT /F5 11.955 Tf 303.82 -189.34 Td[(g2kjJkj jJkj)]TJ /F7 11.955 Tf 17.93 0 Td[(1+SJ(Gk) jJkj(jJkj)]TJ /F7 11.955 Tf 17.94 0 Td[(1)#ii0)]TJ /F7 11.955 Tf 11.95 0 Td[((nXl=1mXk=1 fl gkjIl\Jkj)2=nXl=1mXk=1jIl\Jkj2"SqErr(Fl)SqErr(Gk) jIljjJkj(jIlj)]TJ /F7 11.955 Tf 17.93 0 Td[(1)(jJkj)]TJ /F7 11.955 Tf 17.93 0 Td[(1))]TJ ET q .478 w 261.2 -258.4 m 267.68 -258.4 l S Q BT /F5 11.955 Tf 261.2 -268.37 Td[(f2lSqErr(Gk) jJkj(jJkj)]TJ /F7 11.955 Tf 17.94 0 Td[(1))]TJ ET q .478 w 343.99 -261.05 m 350.75 -261.05 l S Q BT /F5 11.955 Tf 343.99 -268.37 Td[(g2kSqErr(Fl) jIlj(jIlj)]TJ /F7 11.955 Tf 17.93 0 Td[(1)#+nXl=1mXk=1jIl\JkjSqErr(Fl)SqErr(Gk) (jIlj)]TJ /F7 11.955 Tf 17.93 0 Td[(1)(jJkj)]TJ /F7 11.955 Tf 17.94 0 Td[(1)+SqErr(Fl) (jIlj)]TJ /F7 11.955 Tf 17.93 0 Td[(1)+SqErr(Gk) (jJkj)]TJ /F7 11.955 Tf 17.93 0 Td[(1)whereweusedthefactthatI(C)2=I(C)irrespectiveoftheconditionCandthatPi2II(i2Il)I(i2Jk)=jIl\Jkj. Wenowhavethemainresult: Theorem7. LetFandGbetworelations.ThehistogramestimatePnl=1Pmk=1 fl gkjIl\JkjofjF1AGjusinghistogramswithbucketsI1,,InforFandJ1,,JmforGwhentherandomshufingassumptionholdsforbothrelationswithineachbucketisanunbiasedestimatorandhassquarederror:nXl=1mXk=1jIl\Jkj2"SqErr(Fl)SqErr(Gk) jIljjJkj(jIlj)]TJ /F7 11.955 Tf 17.93 0 Td[(1)(jJkj)]TJ /F7 11.955 Tf 17.93 0 Td[(1))]TJ ET q .478 w 273.01 -521.19 m 279.48 -521.19 l S Q BT /F5 11.955 Tf 273.01 -531.17 Td[(f2lSqErr(Gk) jJkj(jJkj)]TJ /F7 11.955 Tf 17.93 0 Td[(1))]TJ ET q .478 w 355.8 -523.85 m 362.56 -523.85 l S Q BT /F5 11.955 Tf 355.8 -531.17 Td[(g2kSqErr(Fl) jIlj(jIlj)]TJ /F7 11.955 Tf 17.93 0 Td[(1)#+nXl=1mXk=1jIl\JkjSqErr(Fl)SqErr(Gk) (jIlj)]TJ /F7 11.955 Tf 17.94 0 Td[(1)(jJkj)]TJ /F7 11.955 Tf 17.93 0 Td[(1)+SqErr(Fl) (jIlj)]TJ /F7 11.955 Tf 17.93 0 Td[(1)+SqErr(Gk) (jJkj)]TJ /F7 11.955 Tf 17.93 0 Td[(1) Proof. FollowsdirectlyfromproofofLemma 5 117

PAGE 118

APPENDIXBPROOFOFPROPOSITION 3 Lemma4. LetFandGbearbitraryrelationswithfrequenciesfiandgifori2I,whereIisthedomainofthejoinattributeA.BytakingN=jIjandN0=jI0jandsettingrandomvariableXtobeX=N N0Pi2I0figi,thesampleestimateofthesizeofthejoin,wehave:E[X]=Xi2IfigiVar[X]=N)]TJ /F5 11.955 Tf 11.96 0 Td[(N0 N0(N)]TJ /F7 11.955 Tf 11.95 0 Td[(1)24NXi2If2ig2i)]TJ /F9 11.955 Tf 11.95 20.44 Td[( Xi2Ifigi!235 Proof. Inordertosimplifytheanalysis,weintroducetherandomvariablesYithattakevalue1ifi2I0and0otherwise.IfI0isarandomsubsetofIandi,i0arearbitraryelementsofIthen, E[Yi]=P[i2I0]=N0 N(B) EY2i=P[i2I0]=N0 N(B) Var[Yi]=EY2i)]TJ /F5 11.955 Tf 11.95 0 Td[(E[Yi]2=N0 NN)]TJ /F5 11.955 Tf 11.95 0 Td[(N0 N(B) i6=i0,E[YiYi0]=P[i2I0^i02I0]=P[i2I0]P[i02I0ji2I0]=N0 NN0)]TJ /F7 11.955 Tf 17.9 0 Td[(1 N)]TJ /F7 11.955 Tf 11.96 0 Td[(1(B) i6=i0,Cov(Yi,Yi0)=E[YiYi0])]TJ /F5 11.955 Tf 11.95 0 Td[(E[Yi]E[Yi0]=)]TJ /F5 11.955 Tf 10.49 8.09 Td[(N0(N)]TJ /F5 11.955 Tf 11.96 0 Td[(N0) N2(N)]TJ /F7 11.955 Tf 11.96 0 Td[(1)(B) Cov(Yi,Yi0)=N0(N)]TJ /F5 11.955 Tf 11.95 0 Td[(N0) N(N)]TJ /F7 11.955 Tf 11.96 0 Td[(1)ii0)]TJ /F5 11.955 Tf 13.15 8.09 Td[(N0(N)]TJ /F5 11.955 Tf 11.96 0 Td[(N0) N2(N)]TJ /F7 11.955 Tf 11.96 0 Td[(1)(B) 118

PAGE 119

UsingrandomvariablesYiwecanrewriteXas:X=N N0Xi2IYifigiWiththis,wehaveE[X]=N N0Xi2IE[Yi]figi=Xi2IfigiVar[X]=N2 N02Xi2IXi02Ifigifi0gi0Cov(Yi,Yi0)=N2 N02"N0(N)]TJ /F5 11.955 Tf 11.96 0 Td[(N0) N(N)]TJ /F7 11.955 Tf 11.95 0 Td[(1)Xi2If2ig2i)]TJ /F5 11.955 Tf 13.15 8.09 Td[(N0(N)]TJ /F5 11.955 Tf 11.96 0 Td[(N0) N2(N)]TJ /F7 11.955 Tf 11.96 0 Td[(1)Xi2IXi02Ifigifi0gi0#=N)]TJ /F5 11.955 Tf 11.96 0 Td[(N0 N0(N)]TJ /F7 11.955 Tf 11.96 0 Td[(1)24NXi2If2ig2i)]TJ /F9 11.955 Tf 11.96 20.44 Td[( Xi2Ifigi!235 Withthisresultwecannowestimatetheerrorsamplecountingmakeswhentherandomshufingassumptionholds.Since,accordingtoLemma 4 ,samplecountingmethodisunbiasedirrespectiveoftheproblem,wehaveErr(X)=Var[X].E[Err(X)]=N)]TJ /F5 11.955 Tf 11.95 0 Td[(N0 N0(N)]TJ /F7 11.955 Tf 11.95 0 Td[(1)24NXi2If2iEg2(i))]TJ /F5 11.955 Tf 11.96 0 Td[(E24 Xi2Ifig(i)!23535=N)]TJ /F5 11.955 Tf 11.95 0 Td[(N0 N0(N)]TJ /F7 11.955 Tf 11.95 0 Td[(1)24SJ(F)SJ(G))]TJ /F1 11.955 Tf 11.96 0 Td[(Var(Xi2Ifig(i)))]TJ /F5 11.955 Tf 11.96 0 Td[(E"Xi2Ifig(i)#235=N)]TJ /F5 11.955 Tf 11.95 0 Td[(N0 N0(N)]TJ /F7 11.955 Tf 11.95 0 Td[(1)SJ(F)SJ(G))]TJ /F7 11.955 Tf 25.16 8.09 Td[(1 N)]TJ /F7 11.955 Tf 11.96 0 Td[(1SqErr(F)SqErr(G))]TJ /F5 11.955 Tf 11.95 0 Td[(N2 f2 g2 119

PAGE 120

whereweusedthefactthatVar[X]=EX2)]TJ /F5 11.955 Tf 12.06 0 Td[(E[X]2.Togettherequiredinequalityweobservethat:SJ(F)SJ(G))]TJ /F5 11.955 Tf 11.96 0 Td[(N2 f2 g2=(SJ(F))]TJ /F5 11.955 Tf 11.96 0 Td[(N f2)(SJ(G))]TJ /F5 11.955 Tf 11.96 0 Td[(N g2)+N f2(SJ(G))]TJ /F5 11.955 Tf 11.96 0 Td[(N g2)+N g2SJ(F))]TJ /F5 11.955 Tf 11.95 0 Td[(N f2)=SqErr(F)SqErr(G)+N f2SqErr(G)+N g2SqErr(F)SqErr(F)SqErr(G) 120

PAGE 121

APPENDIXCPROOFOFPROPOSITION 4 TheprooffollowsdirectlyfromtheunbiasednessoftheestimatorandbysubstitutinggibyfiinLemma 4 121

PAGE 122

APPENDIXDPROOFOFTHEOREM 3 WithinbucketIi,theaveragefrequenciesofrelationsFandGarePj2Iifj NiandPj2Iigj Ni,respectively.Sincethehistogramestimateissumoverallbucketsovertheproductofthesizeofthebucketandtheaveragefrequencies,indeedXisthehistogramestimate.InordertoanalyzeX,weintroducerandomvariablesYijthattakevalue1ifj2Iiand0otherwise.Yijhasthefollowingtwoproperties:E[Yij]=EY2ij=P[j2Ii]=Ni N (D)j6=j0,E[YijYij0]=P[j2Ii^j02Ii]=P[j2Ii]P[j02Iijj2Ii]=Ni NNi)]TJ /F7 11.955 Tf 11.95 0 Td[(1 N)]TJ /F7 11.955 Tf 11.95 0 Td[(1 (D)UsingrandomvariablesYijwecanrewriteXas:X=nXi=1NiPj2IYijfi NiPj2IYijgi NiWiththis,wehave: E[X]=nXi=11 NiXj2IXj02Ifjgj0E[YijYij0]=nXi=11 Ni"Xj2IXj02I,j06=jfjgj0Ni NNi)]TJ /F7 11.955 Tf 11.96 0 Td[(1 N)]TJ /F7 11.955 Tf 11.96 0 Td[(1+Xj2IfjgjNi N#=nXi=11 N"Xj2IXj02Ifjgj0Ni)]TJ /F7 11.955 Tf 11.95 0 Td[(1 N)]TJ /F7 11.955 Tf 11.95 0 Td[(1+Xj2Ifjgj1)]TJ /F5 11.955 Tf 13.15 8.09 Td[(Ni)]TJ /F7 11.955 Tf 11.96 0 Td[(1 N)]TJ /F7 11.955 Tf 11.96 0 Td[(1#=N)]TJ /F5 11.955 Tf 11.95 0 Td[(n N(N)]TJ /F7 11.955 Tf 11.96 0 Td[(1)Xj2IfjXj2Igj+n)]TJ /F7 11.955 Tf 11.95 0 Td[(1 N)]TJ /F7 11.955 Tf 11.95 0 Td[(1Xj2Ifjgj(D) 122

PAGE 123

APPENDIXEPROOFOFTHEOREM 5 WedeneanotherrandomvariableoverthespaceofpermutationasY=Pi2I1T1[ j].BasedonProposition 2 ,forasingleXMLnodeY,wehaveE[Y]= tVar(Y)=SqErr(t) NTherefore,E[X]=E[E[XjY]]=E[E[[ 2]=T1[ 1]]jT2=T1[ 1])]=rXj2I2I(j) t1 M=r t1 M=r t MWhereE(T2[ 2]=T1[ 1]jT2=T1[ 1])=rE([ 2]=T1[ 1]jT1[ 1])isused.I(j)isanidentityfunction,thattakes1ifvaluepredictionistrueand0otherwise.Similarly,Var(X)=E[Var(XjY)]+Var(E[XjY])=E[Var(T1[ 1]=T2[ 2]jT1[ 1])]+Var(E[T1[ 1]=T2[ 2]jT1[ 1]])SinceVar(E(T2[ 2]=T1[ 1]jT1[ 1]))=r2Xj2I2([ j]=T1[ 1]jT1[ 1])]TJ /F5 11.955 Tf 11.96 0 Td[(E[[ 2]=T1[ 1]jT1[ 1]])2=r2(1 M(M)]TJ /F7 11.955 Tf 11.96 0 Td[(1)( t M)2+1 M((M)]TJ /F7 11.955 Tf 11.96 0 Td[(1) t M)2)=r2(M)]TJ /F7 11.955 Tf 11.95 0 Td[(1)( t)2 M2 123

PAGE 124

And(E[Var(T2[ 2]=T1[ 1]jT1[ 1])]=r2Xj2I2(SqErr(t) N1 MI(j))=r2SqErr(t) NMTherefore,Var(T2[ 2]=T1[ 1])=r2(M)]TJ /F7 11.955 Tf 11.95 0 Td[(1)( t)2 M2+r2SqErr(t) M 124

PAGE 125

APPENDIXFPROOFOFLEMMA 2 Iftheprobabilitymodelisconditionallyindependentthen: E[A2]=XS20(n)Xfti2Riji2Sg240@Xft0j2Rjjj2SCgcSf(fti,t0jg)1A0@Xtj2Rjjj2SCp(fti,tjg)f(fti,tjg)1A35(F)wherecS=Xu20(S),t0u2RuXt0k2Rkjk2SC[UC()]TJ /F7 11.955 Tf 9.29 0 Td[(1)jUj+jSjp(ft0kjt0ug)Toproveabovelemma,werstintroducetheKroneckersymbolusedintheanalysisofcomplexrandomvariables,forexamplein[ 53 54 ].TheKroneckersymbolijtakesvalue1ifi=jandvalue0ifi6=j.Thatis:ij=8>><>>:1i=j0i6=jAssumethatwearegivenaquantityQijthattakesvalueaifi=jandbifi6=j.WecanexpressQijintermsofijbyobservingthatwecanmultiplyabyij,bby(1)]TJ /F3 11.955 Tf 12.5 0 Td[(ij)andobtain:Qij=8>><>>:ai=j(ij)bi6=j((1)]TJ /F3 11.955 Tf 11.95 0 Td[(ij))=b+(a)]TJ /F5 11.955 Tf 11.95 0 Td[(b)ijThisisbecauseifi=j,thenij=1,thusQ=b+(a)]TJ /F5 11.955 Tf 12.34 0 Td[(b)1=a(andthesymmetricargumentfori6=j).Thefollowingsimplicationruleisusefulforremovingijfromexpressions:XiXjijF(i,j)=XiF(i,i) 125

PAGE 126

i.e.thedoublesumcollapsesintoasinglesumbecauseofij.Evenifijisnotremoved,itiseasytoevaluate.WethenuseconditionalindependenceandKroneckertoproveLemma 2 : Proof. E[A2]=Xfti2Ri,tj2Rjj(i,j2(1,,n))gXft0i2Ri,t0j2Rjj(i,j2(1,,n))ggp(fti,tjg)p(t0i,t0jjti,tj)f(fti,tjg)f(ft0i,t0jg))=XS20(n)Xfti2Ri,tj2Rjji2S,j2SCgXft0i2Ri,t0j2Rjg(Yi2Sti,ti0(XU20(SC)()]TJ /F7 11.955 Tf 9.3 0 Td[(1)UYk2U,tk,t0k2Rktk,t0k)p(fti,tjg)p(ft0jjtig)f(fti,tjg)f(ft0i,t0jg))=XS20(n)Xfti2Ri,tj2Rjji2S,j2SCgXft0i2Ri,t0j2Rjg(XU20(SC)()]TJ /F7 11.955 Tf 9.3 0 Td[(1)UYk2S[U,tk,t0k2Rktk,t0k)p(fti,tjg)p(ft0jjtig)f(fti,tjg)f(ft0i,t0jg))=XS20(n)Xfti2Riji2Sg240@Xft0j2Rjjj2SCg0@Xu20(S),t0u2RuXt0k2Rkjk2SC[UC()]TJ /F7 11.955 Tf 9.3 0 Td[(1)jUj+jSjp(ft0kjt0ug)1Af(fti,t0jg)Xtj2Rjjj2SCp(fti,tjg)f(fti,tjg)35=XS20(n)Xfti2%Riji2Sg240@Xft0j2Rjjj2SCgcSf(fti,t0jg)1A0@Xtj2Rjjj2SCp(fti,tjg)f(fti,tjg)1A35 Inthesecondequation,weusedthepropertyofconditionalindependence:p(fti,tjgjfti,t0jg)=p(ftjgjftig),wherei2S,j2SC.Thisequationisobtainedbyfollowing:p(fti,tjgjfti,t0jg)=p(fti,tjgjftig)=p(ftjgjftig),wheretherstpartholdsbecausetheconditionalprobabilityonlydependsontheequaltuplesinourprobabilisticmodel. 126

PAGE 127

Intheforthequation,weremovetk,t0kusingfollowingpropertyofKroneckersymbol:XiXjijF(i,j)=XiF(i,i)LastthingtocommentisthatSinthethirdandforthequationaredifferent,SintheforthequationisequaltotheunionofSandUofthethirdequation.Thussubscriptsarechanged.Thesubscriptsofconditionalprobabilityp(ft0jjtig)ischangedsinceitisdenedonsetS,whilei,jremainthesamesincetheyaredenedonthewholetuples. 127

PAGE 128

APPENDIXGPROOFOFTHEOREM 3.3 Inagraphicalmodelwithnnodes:R1,,Rnandjointprobabilityp(R1,,Rn),wehave: E(A2)=XS20(n)Xfti2Riji2Sg240@Xft0j2Rjjj2SCg0@Xu20(S),t0u2RuXt0k2Rkjk2SC[UC()]TJ /F7 11.955 Tf 9.29 0 Td[(1)jUj+jSjp(ft0k,t0ug) Pt00u2Ru,t00u=t0up(ft0k,t00ug)!f(fti,t0jg)!Xtj2Rjjj2SCp(fti,tjg)f(fti,tjg)35(G) Proof. UsingLemma 2 ,wehaveE(A2)=XS20(n)Xfti2Riji2Sg240@Xft0j2Rjjj2SCg0@Xu20(S),t0u2RuXtk2Rkjk2SC[UC()]TJ /F7 11.955 Tf 9.3 0 Td[(1)jUj+jSjp(ft0kjt0ug)1Af(fti,t0jg)1AXtj2Rjjj2SCp(fti,tjg)f(fti,tjg)35=XS20(n)Xfti2Riji2Sg240@Xft0j2Rjjj2SCg0@Xu20(S),t0u2RuXtk2Rkjk2SC[UC()]TJ /F7 11.955 Tf 9.3 0 Td[(1)jUj+jSjp(ft0k,t0ug) p(ft0ug)1Af(fti,t0jg)1AXtj2Rjjj2SCp(fti,tjg)f(fti,tjg)35=XS20(n)Xfti2Riji2Sg240@Xft0j2Rjjj2SCg0@Xu20(S),t0u2RuXtk2Rkjk2SC[UC()]TJ /F7 11.955 Tf 9.3 0 Td[(1)jUj+jSjp(ft0k,t0ug) Pt00u2Ru,t00u=t0up(ft0k,t0vg)1A1Af(fti,t0jg)Xtj2Rjjj2SCp(fti,tjg)f(fti,tjg)35 128

PAGE 129

REFERENCES [1] Proceedingsofthe23rdInternationalConferenceonDataEngineering,ICDE2007,April15-20,2007,TheMarmaraHotel,Istanbul,Turkey.IEEE,2007. [2] Abiteboul,Serge,Kanellakis,ParisC.,andGrahne,Gosta.OntheRepresentationandQueryingofSetsofPossibleWorlds.SIGMOD.1987. [3] Aboulnaga,AshrafandChaudhuri,Surajit.Self-tuningHistograms:BuildingHistogramsWithoutLookingatData.SIGMODConference.1999,181. [4] Acharya,Swarup,Gibbons,PhillipB.,Poosala,Viswanath,andRamaswamy,Sridhar.TheAquaApproximateQueryAnsweringSystem.SIGMODConference.1999,574. [5] .JoinSynopsesforApproximateQueryAnswering.SIGMODConference.1999,275. [6] Agrawal,Parag,Benjelloun,Omar,Sarma,AnishDas,Hayworth,Chris,Nabar,ShubhaU.,Sugihara,Tomoe,andWidom,Jennifer.Trio:ASystemforData,Uncertainty,andLineage.VLDB.2006. [7] Alon,Noga,Gibbons,PhillipB.,Matias,Yossi,andSzegedy,Mario.TrackingJoinandSelf-JoinSizesinLimitedStorage.ProceedingsoftheEighteenthACMSIGACT-SIGMOD-SIGARTSymposiumonPrinciplesofDatabaseSystems.Philadeplphia,Pennsylvania,1999. [8] Alon,Noga,Matias,Yossi,andSzegedy,Mario.TheSpaceComplexityofApproximatingtheFrequencyMoments.Proceedingsofthe28thAnnualACMSymposiumontheTheoryofComputing.Philadelphia,Pennsylvania,1996,20. [9] Antova,Lyublena,Koch,Christoph,andOlteanu,Dan.Fromcompletetoincompleteinformationandback.SIGMODConference.2007,713. [10] Barbara,Daniel,Garcia-Molina,Hector,andPorter,Daryl.TheManagementofProbabilisticData.IEEETrans.Knowl.DataEng.4(1992).5:487. [11] Bruno,Nicolas,Chaudhuri,Surajit,andGravano,Luis.STHoles:AMultidimensionalWorkload-AwareHistogram.SIGMODConference.2001,211. [12] Cavallo,RogerandPittarelli,Michael.TheTheoryofProbabilisticDatabases.VLDB'87:Proceedingsofthe13thInternationalConferenceonVeryLargeDataBases.SanFrancisco,CA,USA,1987,71. [13] Chaudhuri,Surajit,Das,Gautam,Datar,Mayur,Motwani,Rajeev,andNarasayya,VivekR.OvercomingLimitationsofSamplingforAggregationQueries.ICDE.IEEEComputerSociety,2001,534. 129

PAGE 130

[14] Chaudhuri,Surajit,Motwani,Rajeev,andNarasayya,Vivek.Onrandomsamplingoverjoins.Proceedingsofthe1999ACMSIGMODInternationalConferenceonManagementofData.Philadelphia,Pennsylvania,1999. [15] Cormode,Graham,Deligiannakis,Antonios,Garofalakis,MinosN.,andMcGregor,Andrew.ProbabilisticHistogramsforProbabilisticData.PVLDB2(2009).1:526. [16] Cormode,GrahamandGarofalakis,MinosN.Sketchingprobabilisticdatastreams.SIGMOD.2007. [17] .HistogramsandWaveletsonProbabilisticData.ICDE.2009,293. [18] Dalvi,NileshandSuciu,Dan.EfcientQueryEvaluationonProbabilisticDatabases.VLDB.2004. [19] Dalvi,NileshN.andSuciu,Dan.Thedichotomyofconjunctivequeriesonprobabilisticstructures.PODS.2007,293. [20] .Managementofprobabilisticdata:foundationsandchallenges.PODS.2007. [21] Dechter,Rina.Bucketelimination:Aunifyingframeworkforreasoning.ArticialIntelligence113(1999):41. [22] Dey,DebabrataandSarkar,Sumit.AProbabilisticRelationalModelandAlgebra.ACMTrans.DatabaseSyst.21(1996).3:339. [23] Drukh,Natasha,Polyzotis,Neoklis,Garofalakis,MinosN.,andMatias,Yossi.FractionalXSketchSynopsesforXMLDatabases.XSym.2004,189. [24] E.L.Lehmann.ElementsofLarge-SampleTheory.Springer,1998. [25] Fuhr,NorbertandRolleke,Thomas.AProbabilisticRelationalAlgebrafortheIntegrationofInformationRetrievalandDatabaseSystems.ACMTrans.Inf.Syst.15(1997).1:32. [26] Ganti,Venkatesh,Lee,Mong-Li,andRamakrishnan,Raghu.ICICLES:Self-TuningSamplesforApproximateQueryAnswering.VLDB.eds.AmrElAbbadi,MichaelL.Brodie,SharmaChakravarthy,UmeshwarDayal,NabilKamel,GunterSchlageter,andKyu-YoungWhang.MorganKaufmann,2000,176. [27] Garofalakis,MinosN.andGibbons,PhillipB.ApproximateQueryProcessing:TamingtheTeraBytes.VLDB.2001. [28] Gibbons,PhillipB.DistinctSamplingforHighly-AccurateAnswerstoDistinctValuesQueriesandEventReports.VLDB.eds.PeterM.G.Apers,PaoloAtzeni,StefanoCeri,StefanoParaboschi,KotagiriRamamohanarao,andRichardT.Snodgrass.MorganKaufmann,2001,541. 130

PAGE 131

[29] Gibbons,PhillipB.andMatias,Yossi.NewSampling-BasedSummaryStatisticsforImprovingApproximateQueryAnswers.SIGMODConference.1998,331. [30] Gotz,MichaelaandKoch,Christoph.Acompositionalframeworkforcomplexqueriesoveruncertaindata.ICDT.2009,149. [31] Green,ToddJ.,Karvounarakis,Gregory,andTannen,Val.Provenancesemirings.PODS.2007. [32] Guha,Sudipto,Koudas,Nick,andShim,Kyuseok.Approximationandstreamingalgorithmsforhistogramconstructionproblems.ACMTrans.DatabaseSyst.31(2006).1:396. [33] Guha,Sudipto,Koudas,Nick,andSrivastava,Divesh.FastAlgorithmsForHierarchicalRangeHistogramConstruction.PODS.2002,180. [34] Guha,Sudipto,Shim,Kyuseok,andWoo,Jungchul.REHIST:RelativeErrorHistogramConstructionAlgorithms.VLDB.2004,300. [35] Gunopulos,Dimitrios,Kollios,George,Tsotras,VassilisJ.,andDomeniconi,Carlotta.ApproximatingMulti-DimensionalAggregateRangeQueriesoverRealAttributes.SIGMODConference.2000,463. [36] Haas,PeterJ.andHellerstein,JosephM.RippleJoinsforOnlineAggregation.SIGMODConference.1999,287. [37] Haas,PeterJ.,Jermaine,ChristopherM.,Arumugam,Subi,Xu,Fei,Perez,LuisLeopoldo,andJampani,Ravi.MCDB-R:RiskAnalysisintheDatabase.PVLDB3(2010).1:782. [38] Hellerstein,JosephM.,Haas,PeterJ.,andWang,HelenJ.OnlineAggregation.SIGMODConference.1997,171. [39] Hoeffding,WassilyandRobbins,Herbert.Thecentrallimittheoremfordependentrandomvariables.DukeMath.J.15(1948):773. [40] Huang,Jiewen,Antova,Lyublena,Koch,Christoph,andOlteanu,Dan.MayBMS:aprobabilisticdatabasemanagementsystem.SIGMOD.2009. [41] Ioannidis,Yannis.UniversalityofSerialHistograms.Proceedingsofthe19thInternationalConferenceonVeryLargeDataBases.Dublin,Ireland:IEEE,1993. [42] Ioannidis,YannisandChristodulaki,Stavros.OnthePropagationofErrorsintheSizeofJoinResult.Proceedingsofthe1991ACMSIGMODInternationalConferenceonManagementofData.Denver,Colorado,1991. [43] Ioannidis,YannisandChristodulakis,Stavros.OptimalHistogramsforLimitingWorst-CaseErrorPropagationintheSizeofJoinResults.ACMTransactionsonDatabaseSystems18(1993).4:709. 131

PAGE 132

[44] Ioannidis,YannisandPoosala,Viswanath.BalancingHistogramOptimalityandPracticalityforQueryResultSizeEstimation.Proceedingsofthe1995ACMSIGMODInternationalConferenceonManagementofData.SanJose,CA,1995. [45] Ioannidis,YannisE.TheHistoryofHistograms(abridged).VLDB.2003,19. [46] Ioannidis,YannisE.andPoosala,Viswanath.Histogram-BasedApproximationofSet-ValuedQuery-Answers.VLDB.1999,174. [47] Jagadish,H.V.,Kapitskaia,Olga,Ng,RaymondT.,andSrivastava,Divesh.Multi-DimensionalSubstringSelectivityEstimation.InProceedingsoftheInternationalConferenceonVeryLargeDatabases.1999,387. [48] Jagadish,H.V.,Koudas,Nick,Muthukrishnan,S.,Poosala,Viswanath,Sevcik,KennethC.,andSuel,Torsten.OptimalHistogramswithQualityGuarantees.VLDB.1998,275. [49] Jampani,Ravi,Xu,Fei,Wu,Mingxi,Perez,LuisLeopoldo,Jermaine,ChristopherM.,andHaas,PeterJ.MCDB:amontecarloapproachtomanaginguncertaindata.SIGMOD.2008. [50] JASON.DataAnalysisChallenges.Tech.rep.,TheMITRECorporation,2008. [51] Jayram,T.S.,Kale,Satyen,andVee,Erik.Efcientaggregationalgorithmsforprobabilisticdata.SODA.2007. [52] Jayram,T.S.,McGregor,Andrew,Muthukrishnan,S.,andVee,Erik.Estimatingstatisticalaggregatesonprobabilisticdatastreams.PODS.2007. [53] Jermaine,Chris,Arumugam,Subramanian,Pol,Abhijit,andDobra,Alin.ScalableapproximatequeryprocessingwiththeDBOengine.ACMTrans.DatabaseSyst.33(2008).4. [54] Jermaine,ChristopherM.,Arumugam,Subramanian,Pol,Abhijit,andDobra,Alin.ScalableapproximatequeryprocessingwiththeDBOengine.SIGMOD.2007. [55] Kanagal,BhargavandDeshpande,Amol.EfcientQueryEvaluationoverTemporallyCorrelatedProbabilisticStreams.ICDE.2009,1315. [56] .Indexingcorrelatedprobabilisticdatabases.SIGMODConference.eds.UgurCetintemel,StanleyB.Zdonik,DonaldKossmann,andNesimeTatbul.ACM,2009,455. [57] .Lineageprocessingovercorrelatedprobabilisticdatabases.SIGMODConference.2010,675. [58] Karp,RichardM.,Luby,Michael,andMadras,Neal.Monte-Carloapproximationalgorithmsforenumerationproblems.JournalofAlgorithms10(1989).3:429448. 132

PAGE 133

[59] Kaushik,RaghavandSuciu,Dan.Consistenthistogramsinthepresenceofdistinctvaluecounts.Proc.VLDBEndow.2(2009).1:850. [60] Kennedy,OliverandKoch,Christoph.PIP:ADatabaseSystemforGreatandSmallExpectations.ICDE.2010. [61] Koch,Christoph.Approximatingpredicatesandexpressivequeriesonprobabilisticdatabases.PODS.2008,99. [62] .OnQueryAlgebrasforProbabilisticDatabases.SIGMODRecord37(2008).4:78. [63] Koch,ChristophandOlteanu,Dan.ConditioningProbabilisticDatabases.CoRRabs/0803.2212(2008). [64] Krishnan,P.,Vitter,JeffreyScott,andIyer,BalakrishnaR.EstimatingAlphanumericSelectivityinthePresenceofWildcards.SIGMODConference.1996,282. [65] Lafferty,JohnD.,McCallum,Andrew,andPereira,FernandoC.N.ConditionalRandomFields:ProbabilisticModelsforSegmentingandLabelingSequenceData.ICML.2001,282. [66] Lakshmanan,LaksV.S.,Leone,Nicola,Ross,Robert,andSubrahmanian,V.S.ProbView:AFlexibleProbabilisticDatabaseSystem.ACMTransactionsonDatabaseSystems22(1997):419. [67] Lee,Ju-Hong,Kim,Deok-Hwan,andChung,Chin-Wan.Multi-dimensionalSelectivityEstimationUsingCompressedHistogramInformation.INSIGMOD.1999,205. [68] Lim,Lipyeow,Wang,Min,Padmanabhan,Sriram,Vitter,JeffreyScott,andParr,Ronald.XPathLearner:AnOn-lineSelf-TuningMarkovHistogramforXMLPathSelectivityEstimation.VLDB.2002,442. [69] Lim,Lipyeow,Wang,Min,andVitter,JeffreyScott.SASH:ASelf-AdaptiveHistogramSetforDynamicallyChangingWorkloads.VLDB.2003,369. [70] Muralikrishna,M.andDeWitt,DavidJ.Equi-DepthHistogramsForEstimatingSelectivityFactorsForMulti-DimensionalQueries.SIGMODConference.1988,28. [71] Olken,F.andRotem,D.RandomSamplingfromDatabases-ASurvey.1995. [72] Olteanu,Dan,Huang,Jiewen,andKoch,Christoph.SPROUT:Lazyvs.EagerQueryPlansforTuple-IndependentProbabilisticDatabases.ICDE.2009,640. [73] .Approximatecondencecomputationinprobabilisticdatabases.ICDE.2010,145. 133

PAGE 134

[74] Peligrad,MagdaandUtev,Sergey.Centrallimittheoremforlinearprocesses.Ann.Probab.25(1997):443. [75] Perez,LuisLeopoldo,Arumugam,Subi,andJermaine,ChristopherM.EvaluationofprobabilisticthresholdqueriesinMCDB.SIGMODConference.2010,687. [76] Piatetsky-Shapiro,G.andConnell,C.Accurateestimationofthenumberoftuplessatisfyingacondition.Proceedingsofthe1984ACMSIGMODInternationalConferenceonManagementofData.Boston,MA,1984. [77] Pittarelli,Michael.AnAlgebraforProbabilisticDatabases.IEEETrans.Knowl.DataEng.6(1994).2:293. [78] Polyzotis,NeoklisandGarofalakis,Minos.XSketchSynopsesforXMLDataGraphs.ACMTransactionsonDatabaseSystems31(2006).3:1014. [79] Poosala,ViswanathandIoannidis,YannisE.EstimationofQuery-ResultDistributionanditsApplicationinParallel-JoinLoadBalancing.VLDB.1996,448. [80] .SelectivityEstimationWithouttheAttributeValueIndependenceAssumption.Proceedingsofthe23thInternationalConferenceonVeryLargeDataBases.Athens,Greek:IEEE,1997. [81] Poosala,Viswanath,Ioannidis,YannisE.,Haas,PeterJ.,andShekita,EugeneJ.ImprovedHistogramsforSelectivityEstimationofRangePredicates.SIGMODConference.1996,294. [82] R.,Kooi.TheOptimizationofQueriesinRelationalDatabases.Ph.D.thesis,CaseWesternReserveUniversity,1980. [83] Re,Christopher,Dalvi,NileshN.,andSuciu,Dan.EfcientTop-kQueryEvaluationonProbabilisticData.[ 1 ],886. [84] Re,Christopher,Letchner,Julie,Balazinska,Magdalena,andSuciu,Dan.Eventqueriesoncorrelatedprobabilisticstreams.SIGMODConference.2008,715. [85] Re,ChristopherandSuciu,Dan.EfcientEvaluationofHAVINGQueriesonaProbabilisticDatabase.DBPL.2007. [86] Ross,RobertB.,Subrahmanian,V.S.,andGrant,John.Aggregateoperatorsinprobabilisticdatabases.J.ACM52(2005).1:54. [87] Sarma,AnishDas,Benjelloun,Omar,Halevy,AlonY.,andWidom,Jennifer.WorkingModelsforUncertainData.ICDE.2006,7. [88] Sen,PrithvirajandDeshpande,Amol.RepresentingandQueryingCorrelatedTuplesinProbabilisticDatabases.[ 1 ],596. 134

PAGE 135

[89] Sen,Prithviraj,Deshpande,Amol,andGetoor,Lise.PrDB:managingandexploitingrichcorrelationsinprobabilisticdatabases.VLDBJ.18(2009).5:1065. [90] .Read-OnceFunctionsandQueryEvaluationinProbabilisticDatabases.PVLDB3(2010).1:1068. [91] Shao,Jun.MathematicalStatistics.Springer-Verlag,1999. [92] Srivastava,Utkarsh,Haas,PeterJ.,Markl,Volker,Kutsch,Marcel,andTran,TamMinh.ISOMER:ConsistentHistogramConstructionUsingQueryFeedback.ICDE.2006,39. [93] Thaper,Nitin,Guha,Sudipto,Indyk,Piotr,andKoudas,Nick.Dynamicmultidimensionalhistograms.SIGMODConference.2002,428. [94] Wang,DaisyZhe,Michelakis,Eirinaios,Garofalakis,MinosN.,andHellerstein,JosephM.BayesStore:managinglarge,uncertaindatarepositorieswithprobabilisticgraphicalmodels.PVLDB1(2008).1:340. [95] Widom,Jennifer.Trio:ASystemforIntegratedManagementofData,Accuracy,andLineage.CIDR.2005. [96] Wu,Yuqing,Patel,JigneshM.,andJagadish,H.V.EstimatingAnswerSizesforXMLQueries.EDBT.2002,590. [97] Y.,Wu,J.,Patel,andH.V.,Jagadish.UsingHistogramstoEsitmateAnswersizesforXMLQueries.InformationSystems28(2003).1-2:33. 135

PAGE 136

BIOGRAPHICALSKETCH LixiaChenisoriginallyfromChina.Shereceivedherbachelor'sdegreefromtheDepartmentofElectricalEngineeringatZhejiangUniversityin2000andhermaster'sdegreefromtheDepartmentofComputerScienceatZhejiangUniversityin2003.ShethenreceivedherPhDfromUniversityofFloridainthefallof2011.Herprimaryresearchisfocusedonqueryapproximationandaggregationprocessinginprobabilisticdatabases. 136