<%BANNER%>

Sketches for Aggregate Estimations over Data Streams

Permanent Link: http://ufdc.ufl.edu/UFE0024259/00001

Material Information

Title: Sketches for Aggregate Estimations over Data Streams
Physical Description: 1 online resource (173 p.)
Language: english
Creator: Rusu, Florin
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2009

Subjects

Subjects / Keywords: Computer and Information Science and Engineering -- Dissertations, Academic -- UF
Genre: Computer Engineering thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract: In this work, we present methods to speed-up the sketch computation. Sketches are randomized algorithms that use small amount of memory and that can be computed in one pass over the data. Frequency moments represent important distributional characteristics of the data that are required in any statistical modeling method. This work focuses on AGMS-sketches used for the estimation of the second frequency moment. Fast-AGMS sketches use hash functions to speed-up the computation by reducing the number of basic estimators that need to be updated. We show that hashing also changes the distribution of the estimator, thus improving the accuracy by orders of magnitude. The second method to speed-up the sketch computation is related to the degree of randomization used to build the estimator. We show that by using 3-wise independent random variables instead of the proposed 4-wise, significant improvements are obtained both in computation time and memory usage while the accuracy of the estimator stays the same. The last speed-up method we discuss is combining sketches and sampling. Instead of sketching the entire data, the sketch is built only over a sample of the data. We show that the accuracy of the estimator is not drastically affected even when the sample contains a small amount of the original data. When the three speed-up methods are put together, it is possible to sketch streams having input rates of millions of tuples per second in small memory while providing similar accuracy as the original AGMS sketches.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility: by Florin Rusu.
Thesis: Thesis (Ph.D.)--University of Florida, 2009.
Local: Adviser: Dobra, Alin.

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2009
System ID: UFE0024259:00001

Permanent Link: http://ufdc.ufl.edu/UFE0024259/00001

Material Information

Title: Sketches for Aggregate Estimations over Data Streams
Physical Description: 1 online resource (173 p.)
Language: english
Creator: Rusu, Florin
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2009

Subjects

Subjects / Keywords: Computer and Information Science and Engineering -- Dissertations, Academic -- UF
Genre: Computer Engineering thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract: In this work, we present methods to speed-up the sketch computation. Sketches are randomized algorithms that use small amount of memory and that can be computed in one pass over the data. Frequency moments represent important distributional characteristics of the data that are required in any statistical modeling method. This work focuses on AGMS-sketches used for the estimation of the second frequency moment. Fast-AGMS sketches use hash functions to speed-up the computation by reducing the number of basic estimators that need to be updated. We show that hashing also changes the distribution of the estimator, thus improving the accuracy by orders of magnitude. The second method to speed-up the sketch computation is related to the degree of randomization used to build the estimator. We show that by using 3-wise independent random variables instead of the proposed 4-wise, significant improvements are obtained both in computation time and memory usage while the accuracy of the estimator stays the same. The last speed-up method we discuss is combining sketches and sampling. Instead of sketching the entire data, the sketch is built only over a sample of the data. We show that the accuracy of the estimator is not drastically affected even when the sample contains a small amount of the original data. When the three speed-up methods are put together, it is possible to sketch streams having input rates of millions of tuples per second in small memory while providing similar accuracy as the original AGMS sketches.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility: by Florin Rusu.
Thesis: Thesis (Ph.D.)--University of Florida, 2009.
Local: Adviser: Dobra, Alin.

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2009
System ID: UFE0024259:00001


This item has the following downloads:


Full Text

PAGE 1

SKETCHESFORAGGREGATEESTIMATIONSOVERDATASTREAMSByFLORINI.RUSUADISSERTATIONPRESENTEDTOTHEGRADUATESCHOOLOFTHEUNIVERSITYOFFLORIDAINPARTIALFULFILLMENTOFTHEREQUIREMENTSFORTHEDEGREEOFDOCTOROFPHILOSOPHYUNIVERSITYOFFLORIDA2009 1

PAGE 2

c2009FlorinI.Rusu 2

PAGE 3

TABLEOFCONTENTS page LISTOFTABLES ..................................... 6 LISTOFFIGURES .................................... 7 ABSTRACT ........................................ 9 CHAPTER 1INTRODUCTION .................................. 10 1.1Contributions .................................. 13 1.2Outline ...................................... 15 2PRELIMINARIES .................................. 16 2.1ProblemFormulation .............................. 16 2.2Sketches ..................................... 17 2.3CondenceBounds ............................... 18 2.3.1Distribution-IndependentCondenceBounds ............. 19 2.3.2Distribution-DependentCondenceBounds .............. 20 2.3.3MeanEstimator ............................. 21 2.3.4MedianEstimator ............................ 22 2.3.5MeanvsMedian ............................. 24 2.3.6MedianofMeansEstimator ...................... 25 2.3.7MinimumEstimator ........................... 27 3PSEUDO-RANDOMSCHEMESFORSKETCHES ................ 29 3.1GeneratingSchemes .............................. 30 3.1.1ProblemDenition ........................... 30 3.1.2OrthogonalArrays ........................... 31 3.1.3AbstractAlgebra ............................ 33 3.1.4Bose-Chaudhuri-HocquenghemSchemeBCH ............ 34 3.1.5ExtendedHamming3-WiseSchemeEH3 .............. 36 3.1.6Reed-MullerScheme .......................... 36 3.1.7PolynomialsoverPrimesScheme .................... 37 3.1.8ToeplitzMatricesScheme ........................ 38 3.1.9TabulationBasedSchemes ....................... 38 3.1.10PerformanceEvaluation ......................... 39 3.2SizeofJoinusingAGMSSketches ....................... 41 3.2.1VarianceforBCH5 ........................... 43 3.2.2VarianceforBCH3 ........................... 43 3.2.3VarianceforEH3 ............................ 46 3.2.4EmpiricalEvaluation .......................... 50 3.3Conclusions ................................... 51 3

PAGE 4

4SKETCHINGSAMPLEDDATASTREAMS ................... 55 4.1Sampling ..................................... 58 4.1.1GenericSampling ............................ 58 4.1.2BernoulliSampling ........................... 59 4.1.3SamplingwithReplacement ...................... 61 4.2Sketches ..................................... 63 4.3SketchesoverSamples ............................. 64 4.3.1GenericSampling ............................ 65 4.3.2BernoulliSampling ........................... 71 4.3.3SamplingwithReplacement ...................... 73 4.3.4Discussion ................................ 75 4.4ExperimentalEvaluation ............................ 76 4.5Conclusions ................................... 78 5STATISTICALANALYSISOFSKETCHES .................... 82 5.1Sketches ..................................... 85 5.1.1BasicAGMSSketches .......................... 86 5.1.2Fast-AGMSSketches .......................... 88 5.1.3Fast-CountSketches ........................... 89 5.1.4Count-MinSketches ........................... 90 5.1.5Comparison ............................... 92 5.2StatisticalAnalysisofSketchEstimators ................... 92 5.2.1BasicAGMSSketches .......................... 93 5.2.2Fast-AGMSSketches .......................... 94 5.2.3Count-MinSketches ........................... 99 5.2.4Fast-CountSketches ........................... 102 5.3EmpiricalEvaluation .............................. 103 5.3.1TestbedandMethodology ....................... 104 5.3.2Results .................................. 105 5.3.3Discussion ................................ 108 5.4Conclusions ................................... 109 6SKETCHESFORINTERVALDATA ........................ 116 6.1SketchApplications ............................... 118 6.1.1SizeofSpatialJoins ........................... 119 6.1.2SelectivityEstimationforBuildingDynamicHistograms ....... 119 6.2ProblemFormulation .............................. 120 6.3DyadicMappingDMAP ........................... 121 6.3.1DyadicIntervals ............................. 121 6.3.2DyadicMappingMethod ........................ 127 6.3.3AlgorithmDMAPCOUNTS ...................... 130 6.4FastRange-SummableGeneratingSchemes .................. 131 6.4.1SchemeBCH3 .............................. 132 4

PAGE 5

6.4.2SchemeEH3 ............................... 140 6.4.3Four-WiseIndependentSchemes .................... 143 6.4.4SchemeRM7 ............................... 144 6.4.5ApproximateFour-WiseIndependentSchemes ............ 145 6.4.6EmpiricalEvaluation .......................... 147 6.5FastRange-SummationMethod ........................ 148 6.6ExperimentalResults .............................. 153 6.7Discussion .................................... 157 6.8Conclusions ................................... 158 7CONCLUSIONS ................................... 167 REFERENCES ....................................... 169 BIOGRAPHICALSKETCH ................................ 173 5

PAGE 6

LISTOFTABLES Table page 3-1OrthogonalarrayOA;4;2;3. ........................... 52 3-2Truthtableforthefunctionx1x2. ........................ 52 3-3Generationtimeandseedsize. ............................ 52 5-1Familiesof1randomvariables. .......................... 110 5-2Familiesofhashfunctions. .............................. 110 5-3Expectedtheoreticalperformance. .......................... 110 5-4Expectedstatistical/empiricalperformance. .................... 110 6-1Sketchingtimeperinterval. ............................. 162 6-2Sketchingtimeperintervalns. ........................... 162 6

PAGE 7

LISTOFFIGURES Figure page 2-1AGMSSketches. ................................... 28 3-1EH3error. ....................................... 53 3-2BCH5error. ...................................... 53 3-3Schemecomparisonfull. .............................. 54 3-4Schemecomparisondetail. ............................. 54 4-1Sizeofjoinvariance. ................................. 79 4-2Self-joinsizevariance. ................................ 79 4-3Sizeofjoinerror. ................................... 80 4-4Self-joinsizeerror. .................................. 80 4-5Sizeofjoinsamplesize. ............................... 81 4-6Self-joinsizesamplesize. ............................... 81 5-1ThedistributionofAGMSsketchesforself-joinsize. ................ 111 5-2ThedistributionofF-AGMSsketchesforself-joinsize. .............. 111 5-3ThedistributionofCMsketchesforself-joinsize. ................. 111 5-4ThedistributionofFCsketchesforself-joinsize. .................. 111 5-5F-AGMSkurtosis. .................................. 112 5-6F-AGMSeciency. .................................. 112 5-7CondenceboundsforF-AGMSsketchesasafunctionoftheskewnessofthedata. .......................................... 113 5-8AccuracyasafunctionoftheZipfcoecientforself-joinsizeestimation. .... 113 5-9Accuracyasafunctionofthecorrelationcoecientforsizeofjoinestimation. 113 5-10Relativeperformanceforsizeofjoinestimation. .................. 114 5-11Accuracyasafunctionoftheskewnessofthedataforsizeofjoinestimation. .. 114 5-12Accuracyasafunctionoftheavailablespacebudget. ............... 114 5-13CondenceboundsforCMsketchesasafunctionoftheskewnessofthedata. 115 7

PAGE 8

5-14Updatetimeasafunctionofthenumberofcountersinasketchthathasonlyonerow. ........................................ 115 6-1Thenumberofpolynomialevaluations. ....................... 162 6-2ThesetofdyadicintervalsoverthedomainI=f0;1;:::;15g. .......... 163 6-3Dyadicmappings. ................................... 163 6-4Fastrange-summationwithdomainpartitioning. ................. 163 6-5LANDO1LANDC. ................................. 164 6-6LANDO1SOIL. ................................... 164 6-7LANDC1SOIL. ................................... 165 6-8Selectivityestimation. ................................ 165 6-9Accuracyofsketchesforintervaldata. ....................... 166 6-10UpdatetimeperintervalasafunctionofthenumberofpartitionsfortheSOILdataset. ........................................ 166 8

PAGE 9

AbstractofDissertationPresentedtotheGraduateSchooloftheUniversityofFloridainPartialFulllmentoftheRequirementsfortheDegreeofDoctorofPhilosophySKETCHESFORAGGREGATEESTIMATIONSOVERDATASTREAMSByFlorinI.RusuMay2009Chair:AlinDobraMajor:ComputerEngineeringInthiswork,wepresentmethodstospeed-upthesketchcomputation.Sketchesarerandomizedalgorithmsthatusesmallamountofmemoryandthatcanbecomputedinonepassoverthedata.Frequencymomentsrepresentimportantdistributionalcharacteristicsofthedatathatarerequiredinanystatisticalmodelingmethod.ThisworkfocusesonAGMS-sketchesusedfortheestimationofthesecondfrequencymoment.Fast-AGMSsketchesusehashfunctionstospeed-upthecomputationbyreducingthenumberofbasicestimatorsthatneedtobeupdated.Weshowthathashingalsochangesthedistributionoftheestimator,thusimprovingtheaccuracybyordersofmagnitude.Thesecondmethodtospeed-upthesketchcomputationisrelatedtothedegreeofrandomizationusedtobuildtheestimator.Weshowthatbyusing3-wiseindependentrandomvariablesinsteadoftheproposed4-wise,signicantimprovementsareobtainedbothincomputationtimeandmemoryusagewhiletheaccuracyoftheestimatorstaysthesame.Thelastspeed-upmethodwediscussiscombiningsketchesandsampling.Insteadofsketchingtheentiredata,thesketchisbuiltonlyoverasampleofthedata.Weshowthattheaccuracyoftheestimatorisnotdrasticallyaectedevenwhenthesamplecontainsasmallamountoftheoriginaldata.Whenthethreespeed-upmethodsareputtogether,itispossibletosketchstreamshavinginputratesofmillionsoftuplespersecondinsmallmemorywhileprovidingsimilaraccuracyastheoriginalAGMSsketches. 9

PAGE 10

CHAPTER1INTRODUCTIONOverthelastdecade,thedevelopmentofcomputersandinformationtechnologyhasbeentremendous.Theclocksofprocessorsreachedun-imaginablespeedrates,thecapacityofthestoragedeviceshasincreasedexponentially,andtheamountofdatatransferredoverthecommunicationnetworkshasscaledtoun-thinkablesizes.Althoughweareabletogenerate,store,andtransportriversofdata,wedonothavethecomputationpowertoecientlyprocessthem.Weareoodedwithdatawecannotprocessandanalyzeexactly.Thus,newmodelsofcomputationhavebeenproposed.Thedatastreammodel[ 6 47 ]changestheperspectivewelookatthedataandcomputation.Inthedatastreammodel[ 6 47 ],theinputdataarenotavailableforrandomaccessfromdiskormemory,butratherarriveasoneormorecontinuousdatastreams.Datastreamsdierfromtheconventionalrelationalmodelinseveralways.First,thedataelementsinthestreamarriveonlineinacompletelyarbitraryorder.Second,datastreamsarepotentiallyunboundedinsize.Andthird,onceanelementfromadatastreamhasbeenprocessed,itisdiscardedorarchived,makingitdiculttobesubsequentlyreferencedunlessitisexplicitlystoredinmemory,whichistypicallysmallrelativetothesizeofthedatastream.Inotherwords,datastreamprocessingimpliesonlyonepassoverthedataandusingsmallspace.Giventhesestringentrequirements,approximationandrandomizationarekeyingredientsinprocessingrapiddatastreams,contrarytotraditionalapplicationswhichfocuslargelyonprovidingpreciseanswers.Giventhepropertiesofthedatastreamcomputationalmodel,therearemultiplefactorsthatneedtobeconsideredwhendesigningalgorithms.Sincethesizeofthestreamispotentiallyunbounded,methodstostoreorsummarizethedataarerequired.Itispossibletostoretheentiredatainadistributedfashionortosummarizeonlythemostimportantcharacteristicsofthestream.Iftheentirestreamisstored,nodataislostandanysubsequentprocessingisnothingelsethananotherstreamingalgorithm 10

PAGE 11

withpotentiallydierentsources.Asolutiontotheoriginalproblemisonlydelayed.Ifasummaryofthedataisbuilt,itshouldcapturethemostimportantfeaturesofthestreamrelevantforthedesiredcomputation.Therearemultiplealternativesthatcanbeconsidereddependingontheactualproblem.Inthemostgeneralcase,auniformsamplewithaxedsizecanbeextractedfromtheunboundedstream.Thisallowsthecomputationofalargenumberoffunctionsofthedatastreamattheexpenseofacostlyoverhead.Forspecicpropertiesthataredenedbefore-hand,bettersynopsisstructures[ 31 ]canbeused.Inthecasewhenthedataisdistributedovermultiplesources|eitheroriginallyorbecausetheywereentirelystored|itisinfeasibletore-assemblethestreaminasinglelocationforfurtherprocessingbecauseofthecommunicationoverhead.Amoreecientsolutionistocreateasynopsisateachsiteandthentransferonlythesynopsis.Ofcourse,thenecessaryconditionforthistoworkisthatthesinglesynopsiscreatedbycomposingthelocalsynopsesisatruesummaryoftheentiredata.Theupdaterateofthestreamisanimportantfactorthatalsoneedstobeconsideredinthiscomputationalmodel.Inorderforanyofthesynopsistobeapplicable,itshouldbeabletokeep-upwiththerateofthestream.Unfortunately,thisisnotalwaysthecaseinpracticewhereextremelyfaststreamsareavailable|considerforexamplethestreamofdatareadfromthehard-drivesinadatawarehouseenvironment.Insuchscenarios,eventhefastestsynopsesarenotalwaysgoodenoughandnewsolutionsmightberequired.Sketchesarerandomizedsynopsesthatsummarizelargeamountsofdatainamuchsmallerspace.Whilesamplingmaintainstheindividualidentityoftheelementscontainedinthesamples,theidentityofanelementisentirelylostoncesketched.Duetothissummarization,specicsketchesneedtobebuiltfordierentproblems,whichisnotalwaysthecaseforsamples.Atthesametime,themaintenanceoverheadisusuallymuchsmallerforsketches,makingthemmoreamenableforthedatastreamcomputationalmodel.Thereexistmultiplecategoriesofsketchesproposedintheliterature,eachofthem 11

PAGE 12

solvingadierentclassofproblems.Flajolet-MartinFMsketches[ 26 ]areusedfortheestimationofdierentsetcardinalitiessuchasthenumberofdistinctelements,thesizeoftheunion,intersection,anddierenceoftwodatasets.Alon-Gibbons-Matias-SzegedyAGMSsketches[ 3 ]wereinitiallyproposedforthecomputationofthesecondfrequencymomentofadatasetandthenextendedtotheestimationofthedot-productoftwovectors.[ 20 ]furtherextendedAGMS-sketchestotheprocessingofcomplexaggregatefunctionsoverjoinsofmultiplerelations.Sketchesbasedonstabledistributions[ 37 ]wereproposedforthecomputationofdierentnormsofadatastream.Alltypesofsketchessharethesameideaofcombiningrandomnesswiththedatainordertoreducethememoryspace.Theydierinthetypeofrandomnesstheyuse,theupdateprocedure,andthewaytheydenetheestimator.Aggregateestimationrepresentsalong-standingtopicinthedatabaseliterature.Theproposedestimationtechniquesevolvedastheapplicationscenarioschanged.Queryoptimizationwasthestartingpointforstatisticsestimation.Inordertoselecttheoptimalexecutionplanforaquery,theoptimizerutilizespre-computedstatistics.Thesewereinitiallystoredinsidethedatabaseforeachtableundertheformofnumberofdistinctelementsanddistributionalhistograms.Basedonthesestatistics,acostwascomputedforplanscontaininganypossiblecombinationofrelationaloperators.Thereexistsalargebodyofworkonhistogramcomputationandmaintenanceandontheeectivenessofhistogramsforcardinalityestimationsee[ 22 31 ]foracompletelist.Sincethetheoreticalcharacterizationoftheestimationsusinghistogramswashardtoachieve,samplingmethodswerealsoinvestigatedinthiscontext[ 22 31 ].Themainadvantagesketcheshaveovertheotherestimationtechniquesisthattheycanbemaintainedup-to-datewhilethedatabaseisupdatedwithoutsignicantoverhead.Atthesametime,sketcheswerecharacterizedtheoreticallyfromthebeginning.ApproximatequeryansweringrepresentsthenextlevelinthetreatmentofaggregateestimationbythedatabasecommunityseetheAQUAprojectwhitepaperfromBellLabs.Inadatawarehouseenvironment 12

PAGE 13

containingterabytesofdata,querieshaveun-acceptablerunningtimes.Inordertodecreasetheresponsetime,itisacceptabletoprovideapproximateanswerswithclearerrorguaranteesforspecicclassesofapplications{analyticalandexploratory.Thus,randomsummariesofthedatawarehousearecomputedandthequeriesareevaluatedonthemmuchfaster.Ifthesesummariesarepre-computedandstoredinsidethedatabase,thereisnosignicantdierencebetweenqueryoptimizationandapproximatequeryprocessing.Themaindierenceisthevisibility:whilethequeryoptimizerisnotexplicitlyvisibletotheuser,theapproximateresultsprovidedbyanapproximatequeryprocessingsystemimmediatelyaectauser.Anotheralternativetoapproximatequeryansweringisonlineaggregation[ 35 ].Inthiscasetheestimatesarecomputedatrun-timefromtheactualdata.Asmoredataareseen,theestimatesgetbetterandbetter,tothepointwheretheyarethetrueanswertothequerywhentheentiredataareprocessed.Thefocusisdierentinsuchasystem.Insteadofprovidingasingleestimatebasedonpre-computedstatistics,acontinuousestimateisupdatedthroughouttheentireexecutionofaquery.Themainrequirementforsuchasystemtoworkisthatthedataareprocessedinarandomfashion.Thisisnecessarybecausetheestimatorsarecomputedfromsmallenoughsamplesthatneedtotintomemoryateachparticularinstanceoftime.ThefocusofthisworkisAGMS-sketchesforaggregateestimations.Morespecic,estimatingthedot-productoftwostreams,andthesecondfrequencymomentofastreamasaparticularcase,isthecentralproblemofthisstudy.ThegoalistocompletelyunderstandthestrengthsandweaknessesofAGMS-sketchesbothfromatheoreticalandpracticalperspective,thusputtingthemonthemapofthecurrentapproximationtechniques.TheultimateaimistomakeAGMS-sketchesusableinpractice.1.1ContributionsThefocusofthisresearchworkisonsketchsynopsisforaggregatequeriesoverdatastreams.Weaimatdeeplyunderstandingthetheoreticalfoundationsthatlieatthebasisofthesketchingmethodsandfurtherimprovetheexistingtechniquesbothfroma 13

PAGE 14

theoreticalandpracticalperspective.Ournalgoalistomakesketchsynopsispracticalfortheapplicationinadatastreamenvironment.Tothisend,adetailedtheoreticalcharacterizationrepresentsanecessaryprerequisite.Ourcontributionsarefundamentalforsketchsynopsis,targetingthebasicfoundationsofthemethod.Thedirectionsofourresearchandthecorrespondingtheoreticalandpracticalndingsofourworkcanbesummarizedasfollows: Pseudo-randomnumberswithspecicpropertieslieatthegroundofmanyrandomizationalgorithms,includingsketchingtechniques.Westudytheexistingrandomnumbergenerationmethodsandidentifythepracticalalgorithmsthatapplytosketches.Ourmainndingisthatthebestaccuracyperformanceisobtainedfora3-wiseindependentgeneratingscheme,i.e.,EH3,whichiscompletelysurprisingbecausethegeneralbeliefwasthat4-wiseindependentrandomvariablesarerequiredforsketchsynopsis.Weprovidethetheoreticalresultsthatsupportourndingsandshowthatthe4-wiseindependencerequirementisnecessaryonlytosimplifythetheoreticalanalysis.Fromapracticalpointofview,3-wiseindependentrandomvariablesrequirelessmemoryandarefastertogenerate|theyprovideafactorof2improvementbothinspaceandtimecomparedto4-wiseindependentrandomvariables. Althoughusing3-wiseindependentrandomnumbersrepresentsaconsiderableimprovementinthetimeperformanceofthesketchsynopsis,thiscouldnotbesucientforthehighratedatastreamsthathavetobeprocessedinthecurrentnetworkingequipment.Toalleviatethisproblem,weproposethecombinationofsamplingandsketchsynopsistofurtherincreasetheprocessingrate.Weintroducetherstsketchoversamplesestimatorintheliteratureandprovidethecompletetheoreticalcharacterization.Ourmainndingwithrespecttothenewestimatoristhatalmostthesameaccuracyresultscanbeobtainedevenwhenthesketchiscomputedovera1%sampleofthedataratherthantheentiredataset.Thisimpliesa100factorimprovementinspeedorprocessingtime. Multiplesketchsynopsisusinghashinghavebeenproposedinordertofurtherimprovetheprocessingtimewhilemaintainingtheoriginalaccuracyrequirements.Althoughequivalentfromatheoreticalanalysisbasedonfrequencymoments,thesemethodshadsurprisinglydierentresultsinpractice.Thismotivatedustorealizeamoredetailedstatisticalanalysisofthesketchestimatorsbasedontheprobabilitydistribution.OursurprisingndingisthatthehybridsketchFast-AGMSsynopsisoutperformsalmostalwaystheothersketchesalthoughtheoriginaltheoreticalresultsshowedthattheywereequivalent.Weprovideboththestatisticalsupportandsignicantexperimentalevidencefortheseresults. 14

PAGE 15

Sketcheswereprovedtobeaviablesolutionforhandlingtheintervaldatastreamsthatappearinspatialapplications.DyadicmappingDMAP,thestate-of-the-artsolutionforsketchingintervals,usesasetoftransformationstoavoidthesketchingofeachpointintheinputintervalatthecostofasignicantdegradationinaccuracy.Thenaturalsolutionforsketchingintervalsistouseecientalgorithmsforsumming-upthevaluesintheinterval,solutioncalledfastrange-summation.Werealizeadetailedstudyoftherandomnumbergeneratingschemesinordertoidentifythefastrange-summableschemes.Ourmainndingisthatwhilenoneofthe4-wiseindependentschemesisfastrange-summable,twoofthe3-wiseindependentschemes,i.e.,BCH3andEH3,haveecientfastrange-summablealgorithms.ThegaininaccuracyoverDMAPwhenusingtheEH3fastrange-summableschemeissignicant,asmuchasafactorof8.Weshowthatfastrange-summationcannotbeappliedinconjunctionwithhash-basedsketches.ToalleviatethisandtheaccuracyperformanceofDMAP,wedevelophybridalgorithmswithsignicantlyimprovedresultsforbothsolutions.Thecontributionsandndingsarefurtherdetailedinthechapterscorrespondingtoeachofthegeneralproblemsintroducedinthissection.Inordertoeasethereading,theexperimentalresultsarepresentedseparatelyforeachproblem.1.2OutlineAnintroductiontosketchesandmethodstoanalyzerandomizationalgorithmsisgiveninPreliminariesChapter 2 .Pseudo-randomnumbergeneratingschemesandtheanalysisoftheoriginalAGMSsketchingsynopsisarepresentedinChapter 3 .Chapter 4 containsthedetailsofthecombinedsketchoversamplesestimator.ThestatisticalanalysisofthesketchsynopsisandestimatorsisprovidedinChapter 5 .TheextensiontosketchesforintervaldatastreamsisdetailedinChapter 6 .Chapter 7 containstheconclusions. 15

PAGE 16

CHAPTER2PRELIMINARIESThematerialinthissectionservesasabasisfortherestofthework.Thematerialisfurtherdetailedaccordinglytothedemandsofeachsubsequentsection.Theformaldenitionoftheproblemwedealwithinthiswork|estimatingthesizeofjoinoverdatastreamsusingsketches|isprovidedinSection 2.1 .Thenwetakeacloselookattheexistingmethodstoderivecondenceboundsforrandomizedalgorithms|sketchesareultimatelyarandomizedalgorithm|inSection 2.3 .ThebasicAGMSsketchingmethodweconsiderablyreneandimprovethroughouttheworkisintroducedinSection 2.2 .2.1ProblemFormulationLetS=e1;w1;e2;w2;:::;es;wsbeadatastream,wherethekeyseiaremembersofthesetI=f0;1;:::;N)]TJ/F15 11.9552 Tf 12.6656 0 Td[(1gandwirepresentfrequencies.Thefrequencyvectorf=[f0;f1;:::;fN)]TJ/F20 7.9701 Tf 6.5865 0 Td[(1]overthestreamSconsistsoftheelementsfidenedasfi=Pj:ej=iwj.Thekeyideabehindtheexistingsketchingtechniquesistorepresentthedomain-sizefrequencyvectorasamuchsmallersketchvectorxf[ 14 ]thatcanbeeasilymaintainedastheupdatesarestreamingbyandthatcanprovidegoodapproximationsforawidespectrumofqueries.Ourfocusisonsketchingtechniquesthatapproximatethesizeofjoinoftwodatastreams.Thesizeofjoinisdenedastheinner-productofthefrequencyvectorsfandg,fg=PN)]TJ/F20 7.9701 Tf 6.5865 0 Td[(1i=0figi.Asshownin[ 51 ],thisoperatorisgenericsinceotherclassesofqueriescanbereducedtothesizeofjoincomputation.Forexample,arangequeryovertheinterval[;],i.e.,Pi=fi,canbeexpressedasthesizeofjoinbetweenthedatastreamSandavirtualstreamconsistingofatuplei;1foreachi.Noticethatpointqueriesarerangequeriesoversizezerointervals,i.e.,=.Also,thesecondfrequencymomentortheself-joinsizeofSisnothingelsethantheinner-productff. 16

PAGE 17

2.2SketchesAGMSsketchesarerandomizedschemesthatwereinitiallyintroducedforapproximatingthesecondfrequencymomentofarelationinsmallspace[ 3 ].Afterwards,theywereappliedtothegeneralsizeofjoinproblem[ 4 ].ThesizeofjoinoftworelationsFandG,eachwithasingleattributeA,consistsincomputingthequantityjF1AGj=Pi2Ifigi,whereIisthedomainofAandfiandgi,respectively,arethefrequenciesofthetuplesinthetworelations.Theexactsolutiontothisproblemistomaintainthefrequencyvectorsfandgandthentocomputethesizeofjoin.SuchasolutionwouldnotworkiftheamountofmemoryorthecommunicationbandwidthissmallerthanMinjIj;cardinalityofrelation.Theapproximatesolutionbasedonsketchesisdenedasfollows[ 4 ]: 1. Startwithafamilyof4-wiseindependent1randomvariablescorrespondingtotheelementsinthedomainI. 2. DenethesketchesXF=Pi2Ifii=Pt2Rt:AandXG=Pi2Igii=Pt2St:A.Thesketchsummarizesthefrequencyvectoroftherelationasasinglevaluebyrandomlyprojectingeachtupleoverthevalues+1and)]TJ/F15 11.9552 Tf 9.2985 0 Td[(1.Noticethatthetupleswiththesamevaluearealwaysprojectedeitheron+1or)]TJ/F15 11.9552 Tf 9.2985 0 Td[(1. 3. LettherandomvariableXbeX=XFXG.XhasthepropertiesthatitisanunbiasedestimatorforthesizeofjoinjF1AGjandthatithassmallvariance.Anestimatorwithrelativeerroratmostwithprobabilityatleast1)]TJ/F21 11.9552 Tf 11.1208 0 Td[(canbeobtainedbytakingthemedianoveraveragesofmultipleindependentcopiesoftherandomvariableX:themedianiscomputedovers2=2log1 suchaverages,eachaveragecontainings1=8 2Var[X] E[X]2instancesofXFigure 2-1 .Sketchesareperfectlysuitedforbothdata-streaminganddistributedcomputationsincetheycanbeupdatedonpieces.Forexample,ifthetuplesinonerelationarestreamedonebyone,theatomicsketchcanbecomputedbysimplyaddingthevalueicorrespondingtothevalueiofthecurrentitem.Fordistributedcomputation,eachpartycancomputethesketchofthedataitowns.Then,byonlyexchangingthesketchwiththeotherpartiesandaddingthemup,thesketchoftheentiredatasetcanbecomputed. 17

PAGE 18

TheAGMSsketchesdescribedabovehaveattheircore1randomvariables.Thevalues+1and)]TJ/F15 11.9552 Tf 9.2985 0 Td[(1canbeobtainedbysimplygeneratingrandombitsandtheninterpretingthemas)]TJ/F15 11.9552 Tf 9.2985 0 Td[(1and+1values.ThenecessaryandsucientrequirementforXtobeanunbiasedestimatorforthesizeofjoinofFandGisthatthefamilyofrandomvariablestobe2-wiseindependent.Thestronger4-wiseindependencepropertyisrequiredinordertomakethevarianceassmallaspossible,thusreducingthenumberofcopiesofXthatneedtobeaveragedinordertoachievethedesiredaccuracy.But,asweshowinSection 3.1 ,generating4-wiseindependentrandomvariablesismoredemandingbothintheamountofrequiredmemory,aswellasintimeeciency.2.3CondenceBoundsTheabstractproblemwetacklethroughoutthisworkisthefollowing.GivenX1;:::;XnindependentinstancesofagenericrandomvariableX,deneanestimatorfortheexpectedvalueE[X]andprovidecondenceboundsfortheestimate.WhileE[X]istheconvergencevalueoftheestimatorhopefullythetruevalueoftheestimatedquantity,condenceboundsprovideinformationabouttheintervalwheretheexpectedvaluelieswithhighprobabilityor,equivalently,theprobabilitythataparticularinstanceofXdeviatesbyagivenamountfromtheexpectationE[X].Inthissectionweprovideanoverviewofthemethodstoderivecondenceboundsforgenericrandomvariablesingeneral,andsketches,inparticular.Thereexisttwotypesofcondencebounds:distribution-independentanddistribution-dependent.Distribution-independentcondenceboundsarederivedfromgeneralnotionsinmeasuretheoryandaremainlyusedintheoreticalcomputerscience.Distribution-dependentcondenceboundsassumecertaindistributionsfortheestimatoroftheexpectationE[X]andarelargelyusedinstatistics.Whiledistribution-independentboundsarebasedongeneralinequalities,adetailedproblem-specicanalysisisrequiredfordistribution-dependentbounds.Inthecontextofsketchestimatorsweshowthat 18

PAGE 19

distribution-independentbounds,althougheasiertoobtain,areunacceptablylooseinsomesituations,thusmakingitnecessarytoderivetighterdistribution-dependentbounds.2.3.1Distribution-IndependentCondenceBoundsAsalreadyspecied,distribution-independentcondenceboundsarederivedfromgeneralinequalitiesontailprobabilitiesinmeasuretheory.Noassumptionontheprobabilitydistributionoftheestimatorismade.Weintroducetheinequalitiesusedforcharacterizingsketchingtechniquesinwhatfollows. Theorem1MarkovInequality[ 46 ]. LetXbeanyrandomvariablewithexpectedvalueE[X]andfanypositiverealfunction.Thenforallt2R+:P[fXt]E[fX] t{1Or,equivalently,P[fXtE[fX]]1 t{2Markovinequalitystatesthattheprobabilitythatarandomvariabledeviatesbyafactorlargerthantfromitsexpectedvalueissmallerthan1 t.ThisisthetightestboundthatcanbeobtainedwhenonlytheexpectationE[X]isknown.TightercondenceboundscanbederivedusingChebyshevinequalityanditsextensiontohighermoments.TheseboundsrequiremoreinformationonthedistributionoftheestimatorforE[X]whichisnotalwaysavailable. Theorem2ChebyshevInequality[ 46 ]. LetXbearandomvariablewithexpectationE[X]andvarianceVar[X].Then,foranyrealfunctionfandanyt2R+:PhjfX)]TJ/F21 11.9552 Tf 11.9551 0 Td[(E[fX]jtp Var[fX]i1 t2{3ChernoboundsareexponentialtailboundsapplicabletosumsofindependentPoissontrials.Theyarelargelyusedfortheanalysisanddesignofrandomizedalgorithmsincludingsketchesbecauseofthetightboundslogarithmictheyprovide. 19

PAGE 20

Theorem3ChernoBound[ 46 ]. LetX1;:::;XnbeindependentPoissontrialsrandomvariablestakingonlythevalues0or1suchthat,for1in,P[Xi=1]=pi,where00,P[X>+]+]
PAGE 21

NoticethatalthoughbothtypesofcondenceboundsrequirethecomputationofthefrequencymomentsofX,theactualboundsareextractedindierentways.Thequestionthatimmediatelyarisesiswhichcondenceboundsshouldbeused.Typically,distribution-dependentboundsaretighter,butthereexistassumptionsthatneedtobesatisedinorderforthemtohold.Specically,thedistributionoftheestimatorforE[X]hastobesimilarinshapewiththeassumedparametricdistribution.Inthefollowingweprovideashortoverviewoftheresultsfromstatisticsondistribution-dependentcondenceboundsthatareusedthroughoutthiswork.2.3.3MeanEstimatorUsuallythemeanXofX1;:::;XnisconsideredastheproperestimatorforE[X].Itisknownfromstatistics[ 54 ]thatwhenthedistributionofXisnormal,themeanXistheuniformlyminimumvarianceunbiasedestimatorUMVUE,theminimumriskinvariantestimatorMRIE,andthemaximumlikelihoodestimatorMLEforE[X].ThisisstrongevidencethatXshouldbeusedastheestimatorofE[X]whenthedistributionofXisnormaloralmostnormal.CentralLimitTheoremCLTextendsthecharacterizationofthemeanestimatortoarbitrarydistributionsofX. Theorem4MeanCLT[ 54 ]. LetX1;:::;XnbeindependentsamplesdrawnfromthedistributionoftherandomvariableXandXbetheaverageofthensamples.Then,aslongasVar[X]<1:X!dNE[X];Var[X] n{7Essentially,CLTstatesthatthedistributionofthemeanisasymptoticallyanormaldistributioncenteredontheexpectedvalueandhavingvarianceVar[X] nirrespectiveofthedistributionofX.CondenceboundsforXcanbeimmediatelyderivedfromthecdfofthenormaldistribution: 21

PAGE 22

Theorem5MeanBounds[ 53 ]. ForthesamesetupasinTheorem 4 ,theasymptoticcondenceboundsforXare:P"jX)]TJ/F21 11.9552 Tf 11.9551 0 Td[(E[X]j>z=2r Var[X] n#<{8wherezisthequantile2[0;1]ofthenormalN;1distribution,i.e.,thepointforwhichtheprobabilityofN;1tobesmallerthanthepointis.Sincefastseriesalgorithmsforthecomputationofzarewidelyavailable1,thecomputationofcondenceboundsforXisstraightforward.Usually,theCLTapproximationofthedistributionofthemeanandthecondenceboundsproducedwithitarecorrectstartingwithhundredsofsamplesbeingaveraged.Ifthenumberofsamplesissmaller,condenceboundscanbedeterminedbasedontheStudentt-distribution[ 53 ].Theonlydierenceisthatthequantiletn)]TJ/F20 7.9701 Tf 6.5865 0 Td[(1;oftheStudentt-distributionwithn)]TJ/F15 11.9552 Tf 12.4035 0 Td[(1degreesoffreedomhastobeusedinsteadofthequantilezofthenormaldistributioninTheorem 5 .Noticethatinordertocharacterizethemeanestimator,onlythevarianceofXhastobedetermined.WhenVar[X]isnotknown{thisisthecaseforsketchessinceestimatingthevarianceisatleastashardasestimatingtheexpectedvalue{thevariancecanbeestimatedfromthesamplesintheformofsamplevariance.Thisisthecommonpracticeinstatisticsandalsoindatabaseliteratureapproximatequeryprocessingwithsampling.2.3.4MedianEstimatorAlthoughthemeanisthepreferableestimatorinmostcircumstances,thereexistdistributionsforwhichthemeancannotbeusedasanestimatorofE[X].ForCauchydistributionswhichhaveinnitevariancethemeancanbeshowntohavethesamedistributionasasinglerandomsample.Insuchcasesthemedian~Xofthesamplesis 1TheGNUScienticLibraryGSLimplementspdf,cdf,inversecdf,andotherfunctionsforthemostpopulardistributions,includingNormalandStudentt. 22

PAGE 23

theonlyviableestimatoroftheexpectedvalue.Thenecessaryconditionforthemediantobeanestimatoroftheexpectedvalueisthatthedistributionoftheestimatortobesymmetric,caseinwhichthemeanandthemediancoincide.Westarttheinvestigationofthemedianestimatorbyintroducingitscorrespondingcentrallimittheoremandthenshowhowtoderivecondencebounds. Theorem6MedianCLT[ 54 ]. LetX1;:::;XnbeindependentsamplesdrawnfromthedistributionoftherandomvariableXand~Xbethemedianofthensamples.Then,aslongastheprobabilitydensityfunctionfofthedistributionofXhasthepropertyf>0:~X!dN;1 4nf2{9whereisthetruemedianofthedistribution.MedianCLTstatesthatthedistributionofthemedianisasymptoticallyanormaldistributioncenteredonthetruemedianandhavingthevarianceequalto1 4nf2.Inordertocomputethevarianceofthisnormaldistributionandderivecondenceboundsfromit,theprobabilitydensityfunctionpdfofXhastobedeterminedoratleastestimatedatthetruemedian.Sincefcannotbecomputedexactlyingeneral,multipleestimatorsforthevarianceareproposedinthestatisticsliterature[ 50 ].Weusethevarianceestimatorproposedin[ 48 ]forderivingcondencebounds: Theorem7MedianBounds[ 48 ]. ForthesamesetupasinTheorem 6 ,thecondenceboundsfor~Xaregivenby:Phj~X)]TJ/F21 11.9552 Tf 11.9551 0 Td[(jtn)]TJ/F20 7.9701 Tf 6.5865 0 Td[(1;1)]TJ/F22 7.9701 Tf 6.5865 0 Td[(=2SE~Xi1)]TJ/F21 11.9552 Tf 11.9551 0 Td[({10wheretn)]TJ/F20 7.9701 Tf 6.5865 0 Td[(1;isthequantileoftheStudentt-distributionwithn)]TJ/F15 11.9552 Tf 12.4631 0 Td[(1degreesoffreedomandSE~Xistheestimateforthestandarddeviationof~Xgivenby:SE~X=XUn)]TJ/F21 11.9552 Tf 11.9552 0 Td[(XLn+1 2;whereLn=hn 2i)]TJ/F26 11.9552 Tf 11.9552 16.857 Td[(r n 4;Un=n)]TJ/F21 11.9552 Tf 11.9552 0 Td[(Ln{11 23

PAGE 24

NoticethatwhilethedistributioncorrespondingtothemeanestimatoriscenteredontheexpectedvalueE[X],thedistributionofthemedianiscenteredonthetruemedian,thustherequirementonthesymmetryofthedistributionforthemediantobeanestimatorofE[X].2.3.5MeanvsMedianForthecaseswhenthedistributionissymmetric,thustheexpectedvalueandthemediancoincide,orwhenthedierencebetweenthemedianandtheexpectedvalueisinsignicant,thedecisionwithrespecttowhichofthemeanorthemediantobeusedasanestimatefortheexpectedvalueisreducedtoestablishingwhichofthetwohassmallervariance.Sinceforbothestimatorsthevariancedecreasesbyafactorofn,thequestionisfurtherreducedtocomparingthevarianceVar[X]andthequantity1 4f2.Therelationbetweenthesetwoquantitiesisestablishedinstatisticsthroughthenotionofasymptoticrelativeeciency: Denition1[ 54 ]. Therelativeeciencyofthemedianestimator~XwithrespecttothemeanestimatorX,shortlytheeciencyofthedistributionofXwiththeprobabilitydensityfunctionf,isdenedas:ef=4f2Var[X]{12TheeciencyofadistributionforwhichE[X]=indicateswhichofthemeanorthemedianisabetterestimatorforE[X].Moreprecisely,efindicatesthereductioninmeansquarederrorifthemedianisusedinsteadofthemean.Whenef>1,medianisabetterestimator,whileforef<1themeanprovidesbetterestimates.AnimportantcasetoconsideriswhenXhasnormaldistribution.Inthissituationtheeciencyisindependentoftheparametersofthedistributionanditisequalto2 0:64derivedfromtheabovedenitionandthepdfofthenormaldistribution.Thisimmediatelyimpliesthatwhentheestimatorisdenedastheaverageofthesamples,i.e.,byMeanCLTthedistributionoftheestimatorisasymptoticallynormal,themean 24

PAGE 25

estimatorismoreecientthanthemedianestimator.WeexploitthisresultforanalyzingsketchesinSection 5.2 .Intermsofmeansquarederror,themeanestimatorhaserror0:64timessmaller,whileintermsofrootmeansquarederrororrelativeerror,themeanestimatorhaserror0:8timessmaller.Asalreadypointedout,whentheeciencyissupra-unitary,i.e.,ef>1,mediansshouldbepreferredtomeansforestimatingtheexpectedvalue,ifthedistributionissymmetricoralmostsymmetric.Aninterestingquestioniswhatpropertyofthedistribution{hopefullyinvolvingonlymomentssincetheyareeasiertocomputefordiscretedistributions{indicatessupra-unitaryeciency.Accordingtothestatisticsliterature[ 7 ],kurtosisistheperfectindicatorofsupra-unitaryeciency. Denition2[ 7 ]. ThekurtosiskofthedistributionoftherandomvariableXisdenedas:k=E[X)]TJ/F21 11.9552 Tf 11.9551 0 Td[(E[X]4] Var[X]2{13Fornormaldistributions,thekurtosisisequalto3irrespectiveoftheparameters.Eventhoughtheredoesnotexistadistribution-independentrelationshipbetweenthekurtosisandtheeciency,empiricalstudies[ 49 ]showthatwheneverk6themeanisabetterestimatorofE[X],whilefork>6themedianisthebetterestimator.2.3.6MedianofMeansEstimatorInsteadofusingonlythemeanorthemedianasanestimatorfortheexpectedvalue,wecanalsoconsidercombinedestimators.OnepossiblecombinationthatisusedinconjunctionwithsketchingtechniquesseeSection 4.2 istogroupthesamplesintogroupsofequalsize,computethemeanofeachgroup,andthenthemedianofthemeans,thusobtainingtheoverallestimatorfortheexpectedvalue.Tocharacterizethisestimatorusingdistribution-independentbounds,acombinationoftheChebyshevandChernoboundsTheorem 2 andTheorem 3 canbeused: 25

PAGE 26

Theorem8[ 3 ]. ThemedianYof2ln1 means,eachaveraging16 2independentsamplesoftherandomvariableX,hastheproperty:PhjY)]TJ/F21 11.9552 Tf 11.9552 0 Td[(E[X]jp Var[X]i1)]TJ/F21 11.9552 Tf 11.9551 0 Td[({14Weprovideanexamplethatcomparestheboundsforthemedianofmeansestimator.Whiledistribution-independentboundsarecomputedusingtheresultsinTheorem 8 ,distribution-dependentboundsarecomputedthroughacombinationofMeanCLT,MedianCLT,andeciency. Example1. Supposethatwewanttocompute95%condenceboundsforthemedianofmeansestimator.Thenthenumberofmeansforwhichwecomputethemedianshouldbe2ln1 0:05=2ln209accordingtoTheorem 8 .Ifthenumberofsamplesisn,theneachmeanistheaverageofn 9samples,thus=q 144 n=12q 1 n.Thewidthofthecondenceboundintermsofq Var[X] nisthus12.ByapplyingMeanCLT,thedistributionofeachmeanisasymptoticallynormalwithvarianceVar[X] n=9.Inpractice,condenceboundscanbeeasilyderivedbyapplyingtheresultsinTheorem 7 .Wecannotdothatinthisexamplebecausethevaluesofthe9meansareunknown.Insteadweassumethatthedistributionofthemeansisasymptoticallynormaland,byMedianCLTandthedenitionofeciency,themedianofthe9meanshasvariance1 9eNVar[X] n=9,witheN=2 theeciencyofthenormaldistribution.ThevarianceofYisthus 2Var[X] n1:57Var[X] n.Withthis,thewidthoftheCLT-basedcondenceboundforYwithrespecttoq Var[X] nisp 1:571:96=2:451:96isthe95%quantile,whichis12 2:454:89timessmallerthanthedistribution-independentcondencebound.Thisresultconrmsthatdistribution-dependentcondenceboundsaretighterandisconsistentwithotherresultsthatcomparethetwotypesofbounds[ 53 ].Condenceboundsofdierentwidthscanbecomputedinasimilarmanner,theonlydierencebeingthe 26

PAGE 27

valuesthatarepluggedintotheformulas.Forexample,theratioforthe99%condenceboundis4:64and4:34forthe99:9%condenceinterval.AnimportantpointintheabovederivationoftheCLTcondenceboundsforYisthefactthatthecondenceintervaliswiderbyp 21:25ifmediansareused,comparedtothesituationwhentheestimatorisonlythemeanwithnomedians.ThisimpliesthatthemedianofmeansestimatorisalwaysinferiortothemeanestimatorirrespectiveofthedistributionofX.AsimpleexplanationforthisisthattheasymptoticregimeofMeanCLTstartstotakeeectthedistributionbecomesnormalsincemeansarecomputedrstandthemeanestimatorismoreecientthanthemedianestimator.Thus,fromapracticalandstatisticalpointofviewbasedontheeciencyofthedistribution,theestimatorshouldbeeitherthemeane<1,orthemediane>1,butneverthemedianofmeans.2.3.7MinimumEstimatorAlthoughtheminimumofthesamplesX1;:::;XnisnotanestimatorfortheexpectationE[X],adiscussiononthebehavioroftheminimumestimatorisincludedbecauseofitsrelationtoCount-Minsketches.Itisknownfromstatistics[ 12 ]thattheminimumofasetofsampleshasanasymptoticdistributioncalledthegeneralizedextremevaluedistributionGEVindependentofthedistributionofX.TheparametersoftheGEVdistributioncanbecomputedfromthefrequencymomentsofX,thuscondenceboundsfortheminimumestimatorcanbederivedfromthecdfofGEV.Althoughthisisastraightforwardmethodtocharacterizethebehavioroftheminimumestimator,itisnotapplicabletoCount-MinsketchesSection 5.2 27

PAGE 28

Sketch Average Average Average Median X1;1 X2;1 Xs2;1 X2;2 Xs2;2 X1;s1 X2;s1 Xs2;s1 X1 X2 Xs2 X X1;2 Atomicsketch Atomicsketch Atomicsketch Atomicsketch Atomicsketch Atomicsketch Atomicsketch Atomicsketch Atomicsketch Figure2-1.AGMSSketches. 28

PAGE 29

CHAPTER3PSEUDO-RANDOMSCHEMESFORSKETCHESTheexactcomputationofaggregatequeries,likethesizeofjoinoftworelations,usuallyrequireslargeamountsofmemory,constrainedindata-streaming,orcommunication,constrainedindistributedcomputation,andlargeprocessingtimes.Inthissituation,approximationtechniqueswithprovableguarantees,likesketches,areonepossiblesolution.Duetotheirlinearity,AGMSsketches[ 3 ]haveallthesepropertiesandtheyhavebeensuccessfullyusedfortheestimationofaggregates,likethesizeofjoin,overdata-streams[ 4 20 ]andindistributedenvironments,likesensornetworks[ 41 ].Theperformanceofsketchesdependscruciallyontheabilitytogenerateparticularpseudo-randomnumbers.Inthischapterweinvestigateboththeoreticallyandempiricallytheproblemofgeneratingk-wiseindependentpseudo-randomnumbersand,inparticular,thatofgenerating3and4-wiseindependentpseudo-randomnumbers.Ourspeciccontributionsare: Weprovideathoroughcomparisonofthevariousgeneratingschemeswiththegoalofidentifyingtheecientones.Tothisend,weexplainhowtheschemescanbeimplementedonmodernprocessorsandweusesuchimplementationstoempiricallyevaluatethem. WeprovidethoroughtheoreticalanalysisforthebehavioroftheBCH3andEH3schemesasreplacementsforthe4-wiseindependentschemesinsizeofjoinestimationsusingAGMSsketches.WeshowthatwhileBCH3isapoorreplacementfornon-skeweddistributions,EH3isingeneralasgoodasorsignicantlybetterthanany4-wiseindependentscheme.ThefactthatEH3getswithinaconstantfactoroftheerrorofthe4-wiseindependentschemesfortheproblemofcomputingtheL1-dierenceoftwostreamingvectorswastheoreticallyprovedin[ 25 ].HerewegeneralizethisresulttosizeofjoinestimationsusingAGMSsketchesandweshowthatEH3canalwaysreplacethe4-wiseindependentschemeswithoutsacricingaccuracy.Ourempiricalevaluationconrmsthesetheoreticalpredictions.Intherestofthechapter,wediscusstheknowngeneratingschemesforrandomvariableswithlimitedindependenceinSection 3.1 .InSection 3.2 weprovidetheoreticalproofthattheExtendedHammingEH3generatingschemeworksaswellasthe4-wise 29

PAGE 30

independentschemes.WeprovideempiricalevidenceforthisfactandathoroughcomparisonbetweenEH3andthe4-wiseschemesinSection 3.2.4 .3.1GeneratingSchemesBasedonthepublishedliterature[ 3 ],theAGMSsketchesdescribedinSection 2.2 need14-wiseindependentrandomvariablesthatcanbegeneratedinsmallspaceinordertoproducereliableestimations.Weaddressthequestionofwhetherthe4-wiseindependenceisactuallyrequiredlaterinthechapter,butthe2-wiseindependenceisaminimumrequirement,sinceotherwisetheestimatorisnotevenunbiased.Inthissectionwereviewanumberofgeneratingschemesforrandomvariableswithlimiteddegreeofindependenceandwediscusshowtheycanbeimplementedonmodernprocessors.Inthesubsequentsectionswereferbacktothesegeneratingschemes.3.1.1ProblemDenitionLetIbeadomain|withoutlossofgeneralityweassumethatI=f0;1;:::;N)]TJ/F15 11.9552 Tf -434.1308 -23.9078 Td[(1g|andbeafamilyoftwo-valuedrandomvariableswithamemberforeachi2I,denotedbyi.Weassumethattheserandomvariablestakethevalues1withthesameprobability.Fordierentvalues,amappinghastobedened.Usually,thedomainIislarge,forexampleN=232,soalargeamountofspaceisrequiredtostoreevenoneinstanceofthefamily.Inordertobespaceecient,eachiisgeneratedondemandaccordingtoageneratingfunction:iS=)]TJ/F15 11.9552 Tf 9.2985 0 Td[(1fS;i;i2I{1SisarandomseedthatcanbestoredinsmallspaceandthefunctionfcanbeecientlycomputedfromtheindexiandS.ThespaceusedbySandthefunctionfdependuponthedesireddegreeofindependencek.Toallowsimplegeneration,theseedSisuniformlyandrandomlychosenfromitsdomain.WithSgeneratedthisway,thek-wiseindependencepropertyrequiredfortranslatesintorequiringthatfunctionf,foreachkdierentindicesi1,i2,:::,ik,generatesanyofthe2kpossibleoutcomesthesamenumber 30

PAGE 31

oftimes,asScoversitsdomain.[ 9 ]giveaformaldenitionofuniformk-wiseindependentrandomvariables,alsoknownask-universalhashingfunctions: Denition3. Afamilyof1randomvariablesdenedoverthesamplespaceIisk-wiseuniformindependentifforanykdierentinstancesofthefamilyi1;i2;:::;ik,andanyk1valuesv1;v2;:::;vk,itholds:Pr[i1=v1^i2=v2^^ik=vk]=1 2k{23.1.2OrthogonalArraysTheproblemofgeneratingtwo-valuedk-wiseindependentrandomvariablesisequivalenttotheconstructionoforthogonalarrays.Precisely,1k-wiseindependentrandomvariablesoverthedomainI=f0;1;:::;N)]TJ/F15 11.9552 Tf 12.04 0 Td[(1gformanorthogonalarraywithjSjruns,Nvariables,twolevelsandstrengthk,denotedOAjSj;N;2;k,wherejSjisthesizeoftheseedspace.Sincethereexistclearrelationsdeningthesizeofanorthogonalarray,rstweoverviewtheirmainproperties.Then,weextendtheserelationsforcharacterizingthesizeoftheseedspace.Ourpresentationfollowstheapproachin[ 36 ]. Denition4. AnrcmatrixAwithentriesfromasetMiscalledanorthogonalarrayOAr;c;m;twithmlevels,strengthtandindex{forsometintherange0tc{ifeveryrtsub-matrixofAcontainseacht-tuplebasedonMexactlytimesasarow.Misthesetcontainingtheelementsf0;1;:::;m)]TJ/F15 11.9552 Tf 12.7089 0 Td[(1g,wheremisthenumberofsymbolsorthenumberoflevels.Thenumberofrowsrisknownasthesizeofthearrayorthenumberofruns.c,thenumberofcolumns,representsthenumberofconstraints,factors,orvariables.Alessformalwayofdeninganorthogonalarrayistosaythatitisanarraywiththepropertythatinanytcolumnsyouseeequallyoftenallpossiblet-tuples.Wheneachtupleappearsexactlyonce,i.e.,=1,wesaythatthearrayhasindexunity. Example2. Table 3-1 representsanorthogonalarrayOA;4;2;3thathasindexunity. 31

PAGE 32

Giventhevaluesoftheparameters,thesizeofanorthogonalarrayisexpressedinRao'sInequalities.Aparticularformoftheseinequalities,form=2,wasintroducedinthecontextofk-wiseindependentrandomvariablesby[ 2 ].Wepresenttheinequalitieshereandlaterweshowhowtheyareusedforcharacterizingdierentconstructionschemes. Theorem9[ 36 ]. TheparametersofanorthogonalarrayOAr;c;s;tsatisfythefollowinginequalities:ruXi=0cim)]TJ/F15 11.9552 Tf 11.9552 0 Td[(1i,ift=2u,ruXi=0cim)]TJ/F15 11.9552 Tf 11.9552 0 Td[(1i+c)]TJ/F15 11.9552 Tf 11.9552 0 Td[(1um)]TJ/F15 11.9552 Tf 11.9552 0 Td[(1u+1,ift=2u+1,{3foru0,r0modst.Sincetheproblemofgeneratingtwo-valuedrandomvariablesthatarek-wiseindependentisidenticaltoconstructinganorthogonalarrayOAjSj;N;2;k.Inourparticularcase,weareinterestedinndingoptimalconstructions,thatis,constructionsthatminimizethenumberofrunsgivenxedvaluesforthenumberofvariables,i.e.,N=2n,andthenumberoflevels,sincesuchconstructionsreducethesizeoftheseed,thusthespacetogenerateandstoreit.Also,weneedgeneralconstructionsthatexistforanyvalueofthestrength,i.e.,thedegreeofindependence.TheconstructionsbasedonBCHandReed-Mullererror-correctingcodesarewidelyused.Asstatedin[ 36 ],theBCHarraysarethedensestknownforN=2nvariables,whennisodd.Whenniseven,thereexistarrayswithfewerrunsbasedonKerdockcodesand,respectively,Delsarte-Goethalscodes.However,theseconstructionsexistonlyforevenvaluesofnandtheiradvantageoverthecorrespondingBCHarraysisminimal|oneortwobitslessforrepresentingthenumberofruns|whileoverheadisincurredforcomputationsinZ4,theringofintegersmodulo4.BothBCHand 32

PAGE 33

Reed-MullerconstructionstouchtheboundsgivenbyRao'sInequalitiesonlyforsmallvaluesofk,e.g.,2,3.3.1.3AbstractAlgebraAsmentionedpreviously,1randomvariableswithlimitedindependencecanbeobtainedbygeneratingf0;1grandomvariablesandmappingthemto1.Since0and1aretheonlyelementsoftheGaloisFieldwithorder2,denotedbyGF,abstractalgebra[ 19 ]istheidealframeworkinwhichtotalkaboutgeneratingfamiliesofrandomvariableswithlimitedindependence.TheeldGFhastwooperations:additionbooleanXORandmultiplicationbooleanAND.AbstractalgebraprovidestwowaystoextendthebaseeldGF|vectorspacesandextensionelds|bothofthemusefulforgeneratinglimitedindependencerandomvariables.GFkVectorSpacesarespacesobtainedbybundlingtogetherkdimensions,eachwithaGFdomain.Theonlyoperationweareinterestedinhereisthedot-productbetweentwovectorsvandu,denedas:vu=k)]TJ/F20 7.9701 Tf 6.5865 0 Td[(1j=0vjuj.ForGFkvectorspacesthiscorrespondstoAND-ingtheargumentsandthenXOR-ingallthebitsintheresult.GFpPrimeFieldsareeldsoverthedomainf0;1;:::;p)]TJ/F15 11.9552 Tf 13.1345 0 Td[(1gwithboththemultiplicationandtheadditiondenedasthearithmeticmultiplicationandadditionmodulotheprimep.GFkExtensionFieldsareeldsdenedoverthedomainf0;1;:::;2k)]TJ/F15 11.9552 Tf 12.595 0 Td[(1gthathavetwooperations:addition,+,withzeroelement0,andmultiplication,,withunityelement1.Bothadditionandmultiplicationhavetobeassociativeandcommutative.Also,multiplicationisdistributiveoveraddition.Alltheelements,except0,musthaveaninversewithrespecttothemultiplicationoperation.TheusualrepresentationoftheextensioneldsGFkisaspolynomialsofdegreek)]TJ/F15 11.9552 Tf 12.281 0 Td[(1withthemostsignicantbitasthecoecientforXk)]TJ/F20 7.9701 Tf 6.5866 0 Td[(1,andtheleastsignicantastheconstantterm.Theadditionoftwoelementsissimplytheaddition,termbyterm,ofthecorrespondingpolynomials.Themultiplicationisthepolynomialmultiplicationmoduloanirreduciblepolynomialofdegree 33

PAGE 34

kthatdenestheextensioneld.Withthisrepresentation,theadditionissimplejustXORthebitrepresentations,butthemultiplicationismoreintricatesinceitrequiresbothpolynomialmultiplicationanddivision.Aswewillsee,alargenumberofschemesusedot-productsinvectorspaces.Dot-productscanbeimplementedbysimplyAND-ingthetwovectorsandXOR-ingtheresultingbits.WhileAND-ingentirewordsintegersonmodernarchitecturesisextremelyfast,XOR-ingthebitsofawordwhichhastobeeventuallyperformedisproblematicsincenohigh-levelprogramminglanguagesupportssuchanoperationthisoperationisactuallytheparity-bitcomputation.Tospeed-upthisoperation,whichiscritical,weimplementeditinAssemblyforPentiumprocessorstotakeadvantageofthesupported8-bitparitycomputation.3.1.4Bose-Chaudhuri-HocquenghemSchemeBCHForalltheschemes,weassumethedomaintobeI=f0;:::;2n)]TJ/F15 11.9552 Tf 12.1984 0 Td[(1g,foragenericnthedescriptionoftheschemesdependsonn.Wealsomaketheconventionthat[a;b]isequivalentwiththevectorobtainedbyconcatenatingthevectorsaandb.Thesizeofa,b,and[a;b]isclearfromthecontext.TheBCHschemewasrstintroducedin[ 2 ]anditisbasedonBCHerror-correctingcodes.Thisschemecangeneratek+1-wiseindependentrandomvariablesbyusinguniformlyrandomseedsSthatarekn+1bitsinsize.ScanberepresentedasavectorS=[s0;S0;:::;Sk)]TJ/F20 7.9701 Tf 6.5866 0 Td[(1],wheres0isonebitandeachSi,0i
PAGE 35

densestseedrequirementamongstalltheknownschemes.Theproofthatthisschemeproducesk+1-wiseindependentfamiliescanbefoundin[ 2 ].Implementingi2k)]TJ/F20 7.9701 Tf 6.5865 0 Td[(1overniteeldsisproblematiconmodernprocessorsifspeedisparamount,butinthespecialcasewhenonly3-wiseindependenceisrequiredthisproblemisavoidedseethespeedcomparisonattheendofthesection.The3-wiseindependentversionofthisschemeis:fS;i=S[1;i]{5 Example3. WegiveanexamplethatshowshowtheBCHschemeswork.Considerthatn=16andtheseedshavethevalues:s0=1,S0=7;469=011101001011012,andS1=346=0000001010110102.Wegeneratethe3and5-wiserandomvariablescorrespondingtoi=2;500=0001001110001002.Then,i3=15;625;000;000=0010100010000002,wheretheexponentiationiscomputedarithmetically,ratherthanovertheextensioneldGF16,andonlytheleastsignicant16bitsarekept.S0i=15Mj=00B@000111010010110100001001110001001CA=15Mj=00000100100000100=1{6S1i3=15Mj=00B@000000010101101010010100010000001CA=15Mj=00000000001000000=1{7ComputingS0i,rsttheANDofthetwobinaryvaluesiscalculated.Second,thesequentialXORoftheresultingbitsgivesthenalresult.ThesameisequivalentforS1i3.Whenimplementingtheseschemes,thesequentialXORcanbecomputedonlyattheend,afteralltheintermediateresultsareXOR-ed.Theresultforthe3-wisecaseis:s0S0i=11=03{8whiletheresultforthe5-wisecaseis:s0S0iS1i3=111=13{9 35

PAGE 36

3.1.5ExtendedHamming3-WiseSchemeEH3TheExtendedHamming3-wiseschemeisamodicationofBCH3anditwasintroducedin[ 25 ].ItrequiresseedsSofsizen+1anditsgeneratingfunctionisdenedas:fS;i=S[1;i]hi{10wherehiisanonlinearfunctionofthebitsofi.Apossibleformforhis:hi=i0_i1in)]TJ/F20 7.9701 Tf 6.5865 0 Td[(2_in)]TJ/F20 7.9701 Tf 6.5865 0 Td[(1{11Functionhdoesnotchangetheamountofindependence,thus,fromthetraditionalAGMSsketchestheory,itisnotasgoodasa4-wiseindependentscheme.AswewillshowinSection 3.2 ,theuseofEH3actuallyresultsinsizeofjoinAGMSsketchesestimationsaspreciseasa4-wiseindependentschemegives1.Fromthepointofviewofafastimplementation,onlyasmallmodicationhastobeaddedtotheimplementationofBCH3{thecomputationoffunctionhi.Astheexperimentalresultsshow,thereisvirtuallynorunningtimedierencebetweentheseschemesifacarefulimplementationisdeployedonmodernprocessors.3.1.6Reed-MullerSchemeTheBCHschemesrequirecomputationsoverextensioneldsfordegreesofindependencegreaterthan3.SincetheAGMSsketchesneed4-wiseindependentrandomvariables,alternativeschemesthatrequireonlysimplecomputationsmightbedesirable.TheReed-Mullerscheme[ 36 ]generalizestheBCHcodesinadierentwayinordertoobtainhigherdegreesofindependence.Seedsofsize1+)]TJ/F22 7.9701 Tf 5.4795 -4.3788 Td[(n1++)]TJ/F22 7.9701 Tf 5.4795 -4.3789 Td[(ntarerequiredtoobtainadegreeofindependenceofk=2t+1)]TJ/F15 11.9552 Tf 12.1074 0 Td[(1,t>0.Weintroducehereonlythe7-wise 1ThersttheoreticalresultonthebenetsofEH3isprovidedin[ 25 ].Ourresultsaresignicantlymoregeneral. 36

PAGE 37

independentversionoftheschemethatrequires1+n+nn)]TJ/F20 7.9701 Tf 6.5865 0 Td[(1 2seedbits:fS;i=S[1;i;i]3{12wherei=[i0i1;i0i2;:::;in)]TJ/F20 7.9701 Tf 6.5865 0 Td[(2in)]TJ/F20 7.9701 Tf 6.5865 0 Td[(1]3{13 Example4. WeexemplifytheReed-MullerschemeRM7,i.e.,t=2,forn=4.s0=0,S0=11=112,andS1=45=11012.S1contains)]TJ/F20 7.9701 Tf 5.4795 -4.3789 Td[(42=6bits.Theindexvariableisi=6=1102.i=[01;01;00;11;10;10]=0001002.S0i=3Mj=00B@101101101CA=3Mj=00010=1{14S1i=5Mj=00B@1011010001001CA=5Mj=0000100=1{15Thenalresultis:s0S0iS1i=011=03{16AnimportantfactabouttheReed-Mullerschemeisthattheseedsrequiredaresignicantlylarge.Forexample,foradomainofsize232,RM7needsseedsof1+32+3231 2=529bits.Forcomparison,theBCH5seedsareonly1+232=65bitsinsize.3.1.7PolynomialsoverPrimesSchemeThegeneratingschemesintheprevioussectionsarederivedfromerror-correctingcodes.Adierentmethodtogeneratek-wiseindependentrandomvariablesusespolynomialsoveraprimenumbereld.Thegeneratingfunctionisdenedasin[ 59 ]:fS;i=k)]TJ/F20 7.9701 Tf 6.5865 0 Td[(1Xj=0ajijmodp{17 37

PAGE 38

forsomeprimep>iwitheachajpickedrandomlyfromZp.TheseedSconsistsofthekcoecientsaj,0j
PAGE 39

timeconsuming.Inthiscase,itcouldbefastertopre-computeandstoretabulatetherandomvariablesinmemory,insteadofgeneratingthemonthey.Whentherandomvariablecorrespondingtoanindexiisneeded,itissimplyextractedfromalook-uptable.Forlargedomainsmultipletablesaremaintained,eachcorrespondingtoadierentsub-domain.Therandomvariableisobtainedasasimplefunctionofthetabulatedvalues.[ 10 ]showthatthefunctionfmappingi0i1:::in)]TJ/F20 7.9701 Tf 6.5865 0 Td[(1tof0[i0]f1[i1]fn)]TJ/F20 7.9701 Tf 6.5865 0 Td[(1[in)]TJ/F20 7.9701 Tf 6.5865 0 Td[(1]is3-universalifeachoftheindependenttabulatedfunctionsfjis3-universal.Thisimpliesthatinordertoobtaina3-universalhashfunctionforalargesequence,itisenoughtodividethesequenceintosub-sequencesandtotabulate3-universalfunctionsforeachsuchsub-sequence.ThevalueofthefunctioniscomputedbyXOR-ingthetabulatedvaluescorrespondingtoeachsub-sequence.Thisresultwassubsequentlyextendedto4-universalhashingby[ 58 ].Theyshowthatthefunctionf[ab]=f0[a]f1[b]f2[a+b]is4-universaliff0,f1,andf2areindependent4-universalhashfunctions.Thetabulationmethodreducestheprocessingtime,butincreasestherequiredmemory.Aslongasthethepre-computedtablestinthefastmemory,weexpectanimprovementinprocessingtime.SinceAGMSsketchesrequiremultipleinstancesoftheestimatorinordertoprovidethedesirederrorguarantees,thespacerequirementcouldbecomeaproblem.3.1.10PerformanceEvaluationWeimplementedBCH3,BCH5,EH3,andRM7,andweusedtheimplementationin[ 58 ]forthepolynomialsoverprimesandthetabulationschemes.BCH5wasimplementedbyperformingthei3operationarithmeticallynotinanextensioneld.ThisdoesnotchangethedegreeofindependenceofBCH5fordomainsnottoolarge,butitsignicantlyimprovestheperformance.ThepolynomialsschemeusesMersenneprimesinordertospeed-uptheimplementation.Forthetabulationscheme,both8and16bitssub-sequencesweretested,butonlythebestresultisreported. 39

PAGE 40

Weranourexperimentsonatwo-processorXeon2.80GHzmachine3,eachwitha512KBcache.Thesystemhas1GBavailablememoryandusesaFedoraCore3operatingsystem.Asacomparisonforourresults,thetimetoreadawordfromamemorylocationthatisnotcachedtakesabout250nsonthismachine.Weusedthespecialassemblyimplementationofthedot-productforallthemethods.Experimentsconsistedingenerating10;000variablesiand10;000seedsandthencomputingallpossiblecombinationsofrandomvariables,i.e.,100;000;000.Eachexperimentwasrun100timesandtheaverageoftheresultsisreported.Therelativeerrororerrorrelativeerror=jtruevalue)]TJ/F39 7.9701 Tf 6.5865 0 Td[(experimentalvaluej truevalueofarunwasingeneralunder1%,withamaximumof1:2%.Sincetheresultsaresostable,wedonotreporttheactualerrorforindividualexperiments.Thegenerationtime,innanosecondsperrandomvariable,isreportedinTable 3-3 .Whencomparedtothememoryrandomaccesstime,alltheschemes,exceptRM7,aremuchfaster.Indeed,EH3isatleastasfastasBCH3webelieveitisactuallyfastersincetheextraoperationsmaintainabetterowthroughtheprocessorpipelinesandbotharesignicantlyfasterthanthepolynomialsscheme.Thetabulationschemeisthefastestbothfor2-wiseand4-wiseindependentrandomvariablessincethetablestinthecache.NoticethatEH3andBCH5givecloseresultstothetabulationschemealthoughtheygeneratethevalueoftherandomvariablesfromscratch.Table 3-3 alsocontainsthesizeoftheseedforthegivenschemes.nrepresentsthenumberofbitsthedomainIcanberepresentedon.Forthepolynomialsoverprimesscheme,nisthesmallestpowerof2forwhich2np.Asnoted,theBCHschemeshavethedensestseeds,whiletheReed-Mullerschemeneedsthelargestseed.Forthesamedegreeofindependence,thepolynomialsoverprimesschemerequiresaseeddoubleinsizecomparedtoBCH.Thetabulationschemedoesnotstoreseeds.Insteaditstoresthe 3Whilethemachinehastwoprocessors,onlyoneofthemwasusedinourexperiments. 40

PAGE 41

valuesoftherandomvariablesfornon-overlappingsub-domains.Thus,ithasthemoststringentmemoryrequirementofalltheschemes.Whendecidingwhatschemetouse,thetradeosinvolvedshouldbeanalyzed.Whilethetabulationschemeisthefastest,italsorequiresthelargestamountofmemory.ForAGMSsketchesestimationsthiscouldbecomeadrawback.TheBCHschemeshavetheleastrestrictivememoryrequirementwhilebeingquitefast.Wewillseethatforfastrange-summationecientsketchingofintervalsarealgeneratingschemegivesthebestresults.Withatabulationschemeallthepointsinsideanintervalhavetobeindependentlysketched,asolutionthatwetrytoimprove.3.2SizeofJoinusingAGMSSketchesUsingthecurrentunderstandingoftheAGMSsketchesappliedtothesizeofjoinproblem,the4-wiseindependenceoftherandomvariablesisrequiredinordertokeepthevariancesmallweshowwhylatterinthesection.WhatweshowinthissectionisthattheExtendedHammingEH3scheme[ 25 ]notonlycanbeusedasareasonablereplacementofthe4-wiseindependentschemesthisisthetheoreticalresultprovedin[ 25 ]fortheL1-dierenceoftwostreamingfunctions,butitisusuallyasgood,inabsolutebig-Onotationterms,andincertainsituationssignicantlysurpassesthe4-wiseindependentschemesforAGMSsketchessolutionstothesizeofjoinproblem.Weprovidehereboththetheoreticalsupportforthisstatementandtheempiricalevidence.WeproceedbytakingacloselookattheestimationofthesizeofjoinoftworelationsusingAGMSsketches.AsalreadypresentedinSection 2.2 ,thesizeofjoinjF1AGjoftworelationsFandGwithacommonattributeAisdenedasjF1AGj=Pi2Ifigi.Toestimatethisquantity,weusea1familyofrandomvariablesfiji2Igthatareatleast2-wiseindependentbutnotnecessarilymore.Wedenethesketches,oneforeachrelation,asXF=Pi2IfiiandXG=Pi2Igii.TherandomvariableX=XFXGestimates,onexpectation,thesizeofjoinjF1AGj.Itiseasytoshowthat,duetothe2-wiseindependence,E[X]=Pi2Ifigi,whichisexactlythesizeofjoin.Sinceallthe 41

PAGE 42

generatingschemespublishedintheliteratureseeSection 3.1 forareviewofthemostimportantonesare2-wiseindependent,theyallproduceestimatesthatarecorrectonaverage.ThemaindierencebetweentheschemesisthevarianceofX,Var[X].Thesmallerthevariance,thebettertheestimation;inparticular,theerroroftheestimateisproportionalwithp Var[X] E[X].WeanalyzenowthevarianceofX.Followingthetechniquein[ 3 ],werstcomputeE[X2]sinceVar[X]=E[X2])]TJ/F21 11.9552 Tf 11.9552 0 Td[(E[X]2.Wehave:EX2=E24Xi2IfiiXi2Igii!235=Xi2IXj2IXk2IXl2IfifjgkglE[ijkl]{18TheexpressionE[ijkl]isequalto1ifgroupsoftwovariablesarethesamei.e.,i=j^k=lori=k^j=lori=l^j=ksince2i=1irrespectiveoftheactualvalueofi2=1and)]TJ/F15 11.9552 Tf 9.2984 0 Td[(12=1.E[ijkl]isequalto0iftwovariablesareidentical,sayi=j,andtwoaredierent,sincethenE[ijkl]=E[1kl]=0,usingthe2-wiseindependenceproperty.Thesameistruewhenthreeofthevariablesareequal,sayi=j=k,butthefourthoneisnot,sinceE[ijkl]=E[3il]=0weusethefactthat3i=i.Theaboveobservationsarenotdependentonwhatgeneratingschemeisused,aslongasitisatleast2-wiseindependent.Thecontributionofthe1valuestoE[X2]addsupto:Xi2IXj2If2ig2j+2Xi2Ifigi!2)]TJ/F15 11.9552 Tf 11.9551 0 Td[(2Xi2If2ig2i{19IntheexpressionofthevarianceVar[X],themiddletermhasthecoecient1,insteadof2,sinceE[X]2issubtracted.Thedierencebetweenthevariousgeneratingschemesconsistsinwhatothertermshavetobeaddedtotheaboveformula.Theadditionaltermscorrespondonlytothevaluesofi;j;k;lthatarealldierentotherwisethecontributionis1or0irrespectivetothescheme,asexplainedbefore.Weexplorewhataretheadditionaltermsforthreeoftheschemes,namelyBCH5,BCH3,andEH3. 42

PAGE 43

3.2.1VarianceforBCH5TheBCH5schemeis4-wiseindependent,whichmeansthatfori6=j6=k6=l,wehave:E[ijkl]=E[i]E[j]E[k]E[l]=0{20sinceallthesareindependentandE[i]=0forallgenerators.Thismeansthatnoothertermsareaddedtotheabovevariance.Thefollowingformularesults:Var[X]BCH5=Xi2IXj2If2ig2j+Xi2Ifigi!2)]TJ/F15 11.9552 Tf 11.9552 0 Td[(2Xi2If2ig2i{213.2.2VarianceforBCH3BCH3isonly3-wiseindependent,thusclearlyitisnotthecasethat,foralli6=j6=k6=l,E[ijkl]=0.Tocharacterizethevalueoftheexpectationfordierentvariables,thefollowingresultsareextensivelyapplied. Proposition1. ConsiderthefunctionXORx1;x2;:::;xn=x1x2xndenedovernbinaryvariablesx1;x2;:::;xn.Then,thefunctionXORtakesthevalues0and1equallyoften,eachvalue2n)]TJ/F20 7.9701 Tf 6.5865 0 Td[(1times. Proof. Weprovethispropositionbyinductiononn.Basecase:n=2.ThetruthtablefortheXORfunctionwithtwovariablesisgiveninTable 3-2 .FunctionXORtakesthevalues0and1thesamenumberoftimes,22)]TJ/F20 7.9701 Tf 6.5865 0 Td[(1=2.Inductivecase:Wesupposethatthepropositionistruefornandwehavetoshowthatitisalsotrueforn+1,thatis,functionx1x2xnxn+1takesbothvalues0and12ntimeseach.Foreverycombinationoftherstnvariables,xn+1takesonetimethevalue0andonetimethevalue1.Whenxn+1=0,functionXORofn+1variablesisidenticalwithfunctionXORofnvariables.Thismeansthatittakesthevalues0and12n)]TJ/F20 7.9701 Tf 6.5865 0 Td[(1timeseach.Theeectofxn+1beingequalwith1istoinverttheresultofthefunctionXORwithnvariables,thatis,thecombinationsthatgaveresult0,give1now,andthecombinationsthatgave1fornvariables,give0forn+1variables.Aswealready 43

PAGE 44

knowthatthenumberofcombinationsforbothcasesisequalto2n)]TJ/F20 7.9701 Tf 6.5865 0 Td[(1timeseach,bysummingtheresultsforxn+1=0andxn+1=1,weobtainthatfunctionXORwithn+1variablestakesthevalues0and12ntimeseach. Proposition2. LetthefunctionFonnbinaryvariablesx1;x2;:::;xnbedenedas:Fx1;x2;:::;xn=CS1x1S2x2Snxn{22whereC2f0;1gisaconstantandSk2f0;1g,for1kn,areparameters.IfthereexistsatleastoneSkthatisequalto1,functionFtakesthevalues0and1equallyoften,eachvalue2n)]TJ/F20 7.9701 Tf 6.5865 0 Td[(1times.Otherwise,functionFevaluatestotheconstantCforallthecombinationsofx1;x2;:::;xn. Proof. LetSk1;Sk2;:::;Skr,0r
PAGE 45

Withthis,weobtain:E[ijkl]=E)]TJ/F15 11.9552 Tf 9.2985 0 Td[(1s0S1i)]TJ/F15 11.9552 Tf 9.2985 0 Td[(1s0S1j)]TJ/F15 11.9552 Tf 9.2985 0 Td[(1s0S1k)]TJ/F15 11.9552 Tf 9.2985 0 Td[(1s0S1l=E)]TJ/F15 11.9552 Tf 9.2985 0 Td[(1S1iS1jS1kS1l=E)]TJ/F15 11.9552 Tf 9.2985 0 Td[(1S1ijkl{25UsingtheresultinProposition 2 withC=0andS1;:::;SnsetasthelastnbitsoftheseedS,weknowthattheexpressionS1ijkltakesthevalues0and1equallyoftenforrandomseedsS1whenijkl6=0{thisimmediatelyimpliesthatE[ijkl]=0.Whenijkl=0,S1ijkl=0irrespectiveofS1,givingE[ijkl]=1. ThisimpliesthatthevarianceforBCH3containsanadditionaltermbesidestheonesthatappearinthevarianceformulaforBCH5.Theextratermhasthefollowingform:Var[BCH3]=Xi2IXj2I;j6=iXk2I;k6=i;jfifjgkgijk{26sinceijkl=0impliesl=ijk.TheadditionaltermintheBCH3variancecanbesignicantlylarge,implyingabigincreaseoverthevarianceofBCH5.However,aswewillseeintheexperimentalsection,theinuenceoftheextra-termalmostvanishesforhigh-skewdataandthevarianceofBCH3becomescomparablewiththevarianceofBCH5. Example5. InordertogiveaclearimageonthevaluesVar[BCH3]cantake,weprovidetwoextremeexamples.First,considerthatbothrelationsFandGhaveuniformdistributionswiththevaluef,respectivelyg.Thisgives:Var[BCH3]uniform=f2g2Xi2IXj2I;j6=iXk2I;k6=i;j1/f2g2jIj3{27valuethatisanorderofjIjgreaterthantheBCH5variance,thusdominatingitandincreasingtheBCH3variancetopossiblyextremehighvalues.Second,considerthatbothFandGareveryskewed.Actually,thereexistsonlyonepositivefrequencyinFf, 45

PAGE 46

respectivelyGg.Theextra-variancebecomes:Var[BCH3]skewed=fgf2g2{28whichissmallerthanthecorrespondingBCH5variance,thusitcanbeignored.3.2.3VarianceforEH3AsBCH3,theExtendedHammingschemeEH3isalso3-wiseindependent,whichmightsuggestthatthevarianceofEH3issimilartothevarianceofBCH3extratermsnotinthevarianceofBCH5havetoappear,otherwisetheschemewouldbe4-wiseindependent.Asweshownext,eventhoughonlypositivetermsappearedinthevarianceforBCH3,intheEH3variancenegativetermsappearaswell.Thesenegativeterms,incertaincircumstances,cancompensatecompletelyforthepositivetermsandgiveavariancethatbecomeszero. Lemma2. AssumethesaregeneratedusingtheEH3scheme.Then,fori6=j6=k6=l,E[ijkl]=8>>>>>>>>>>>>>><>>>>>>>>>>>>>>:0,ifijkl6=01,ifijkl=0^hihjhkhl=0)]TJ/F15 11.9552 Tf 9.2985 0 Td[(1,ifijkl=0^hihjhkhl=1{29whereisthebitwiseXOR. Proof. Weknowthati=)]TJ/F15 11.9552 Tf 9.2985 0 Td[(1s0S1ihi{30forrandomvariablesgeneratedusingtheEH3scheme.ReplacingthisformintotheexpressionforE[ijkl]andapplyingthesamemanipulationsasintheproofof 46

PAGE 47

Proposition 1 ,weget:E[ijkl]=E)]TJ/F15 11.9552 Tf 9.2985 0 Td[(1S1ijklhihjhkhl{31IfweuseagainProposition 2 withC=hihjhkhlandS1;:::;SnsetasthelastnbitsoftheseedS,werstobservethattheexpectationis0whenijkl6=0.Whenijkl=0,theexpectedvalueisalways)]TJ/F15 11.9552 Tf 9.2985 0 Td[(1C.ForBCH3theconstantCalwaystookthevalue0,thustheexpectationinthatcasewasalways1.ForEH3,thevalueofCdependsonthefunctionh{itis1whenhihjhkhl=1.Thisimpliesthevalue)]TJ/F15 11.9552 Tf 9.2985 0 Td[(1forE[ijkl]. Usingtheaboveresult,weobservethatE[ijkl]=)]TJ/F15 11.9552 Tf 9.2985 0 Td[(1whenijkl=0andhihjhkhl=1.Thismeansthat,eventhoughEH3canhaveallthe1termsBCH3has,itcanalsohave)]TJ/F15 11.9552 Tf 9.2985 0 Td[(1terms,thus,itcanpotentiallycompensateforthe1terms.Indeed,thefollowingresultsshowthatthisisexactlywhathappensundercertainindependenceassumptions.InordertopredicttheperformanceofEH3,weperformanaverage-caseanalysisofthescheme.SinceAGMSsketchesarerandomizedschemes,theaverage-caseanalysisisamoreappropriatecharacterizationthanaworst-caseanalysis.[ 25 ]provideaworst-caseanalysisofEH3fortheparticularcaseofcomputingtheL1-dierenceoftwofunctions.Forthesizeofjoinproblem,thesituationismorecomplicatedbecausethefrequencieshavetobeconsideredtooandtheparameterweareinterestedinistheself-joinsizeofeachrelation.Toobtainanaverageanalysis,considerrstthetheorem: Theorem10. Fori;j;ktakingallthepossiblevaluesoverthedomainI=f0;:::;4n)]TJ/F15 11.9552 Tf 11.9077 0 Td[(1g,n>0,thefunctiongi;j;k=hihjhkhijktakesthevalue0zntimes 47

PAGE 48

andthevalue1yntimes,whereznandynaregivenbythefollowingrecursiveequations:z1=40;y1=24zn=40zn)]TJ/F20 7.9701 Tf 6.5865 0 Td[(1+24yn)]TJ/F20 7.9701 Tf 6.5865 0 Td[(1yn=24zn)]TJ/F20 7.9701 Tf 6.5865 0 Td[(1+40yn)]TJ/F20 7.9701 Tf 6.5865 0 Td[(1 Proof. Thebasecase,n=1,canbeeasilyveriedbyhand.Therecursionisbasedontheobservationthatthegroupsoftwobitsfromdierenthfunctionsinteractindependentlyandgive40zerovaluesand24onevalues.WhentheresultsoftwogroupsareXOR-ed,azeroisobtainedifbothgroupsarezeroorbothareone;aoneisobtainedifagroupequalszeroandtheothergroupequalsone. Wehavetocharacterizethebehavioroffunctiongi;j;kfori6=j6=kwhenatleasttwoofthesevariablesareequal,weobtainthespecialcaseofthevarianceforBCH5thatwehavealreadyconsidered.Thenumberoftimesatleasttwooutofthethreevariablesareequaliseqn=3n2)]TJ/F15 11.9552 Tf 11.7164 0 Td[(24n,whichallowsustodeterminethedesiredquantities,i.e.,thenumberofzerosiszn)]TJ/F21 11.9552 Tf 12.1525 0 Td[(eqn,whilethenumberofonesisyn.TodeterminetheaveragebehaviorofEH3,weassumethatneitherthefrequenciesinFarecorrelatedwiththefrequenciesinGnorthefrequenciesinGarecorrelatedamongstthemselves.Thisallowsustomodelthequantityl=ijkasauniformlydistributedrandomvariableLoverthedomainI.Moreover,duetothesameindependenceassumptions,functiongi;j;kcanbemodeledasarandomvariableG,independentofL,thathasthesamemacrobehaviorasgi;j;k,i.e.,ittakesthevalues0and1thesamenumberoftimes.Withtheserandomvariables,weget:E[gL]=1 jIjXl2Igl{32andE)]TJ/F15 11.9552 Tf 9.2985 0 Td[(1G=1P[G=0]+)]TJ/F15 11.9552 Tf 9.2985 0 Td[(1P[G=1]=zn)]TJ/F21 11.9552 Tf 11.9552 0 Td[(eqn)]TJ/F21 11.9552 Tf 11.9552 0 Td[(yn zn)]TJ/F21 11.9552 Tf 11.9552 0 Td[(eqn+yn{33 48

PAGE 49

TheexpectedvalueoftheadditionaltermsthatappearinthevarianceofEH3isgivenby:E[Var[EH3]]=E"Xi2IXj2IXk2Ififjgk)]TJ/F15 11.9552 Tf 9.2985 0 Td[(1GgL#=Xi2IXj2IXk2IfifjgkE)]TJ/F15 11.9552 Tf 9.2985 0 Td[(1GE[gL]=1 jIjXi2Ifi!2Xi2Igi!2zn)]TJ/F21 11.9552 Tf 11.9552 0 Td[(eqn)]TJ/F21 11.9552 Tf 11.9552 0 Td[(yn zn)]TJ/F21 11.9552 Tf 11.9552 0 Td[(eqn+yn{34Overall,thevarianceoftheEH3generatingschemeis:Var[X]EH3=1 jIjXi2Ifi!2Xi2Igi!2zn)]TJ/F21 11.9552 Tf 11.9551 0 Td[(eqn)]TJ/F21 11.9552 Tf 11.9551 0 Td[(yn zn)]TJ/F21 11.9552 Tf 11.9552 0 Td[(eqn+yn+Var[X]BCH5{35NoticethattheadditionaltermoverthevarianceforBCH5isinverselyproportionalwiththesizeofthedomainI.Also,thelastfactorintheadditionaltermtakessmallsub-unitaryvalues.Thecombinedeectofthesetwoistodrasticallydecreasetheinuenceoftheextra-termontheEH3variance,makingitclosetotheBCH5variance.Actually,thereexistsituationsforwhichtheEH3varianceissignicantlysmallerthantheBCH5variance.Itcanevenbecomeequaltozero.Thefollowingcorollarystatesthissurprisingresult: Corollary3. Iffi=fandgi=g,i2I,withfandgconstants,andjIj=4n,then:Var[X]EH3=0{36ThereasonthevarianceofEH3iszerowhenthedistributionofbothFandGisuniformisthefactthatthe)]TJ/F15 11.9552 Tf 9.2985 0 Td[(1termscanceloutentirelythe1terms.Forlessextremecases,whenthedistributionofthetworelationsisclosetoauniformdistribution,EH3signicantlyoutperformsBCH5.Thisintuitionisconrmedbythethefollowingexperimentalresults. 49

PAGE 50

3.2.4EmpiricalEvaluationThepurposeoftheempiricalevaluationistwofold.First,wewanttovalidatethetheoreticalmodelsforthevarianceofdierentgeneratingschemes,especiallytheaveragebehaviorofEH3.Then,wewanttocomparetheBCH3,EH3,andBCH5schemesforestimationsusingAGMSsketches.Themainndingsofourexperimentalstudycanbesummarizedasfollows: ValidationofthetheoreticalmodelfortheEH3generatingscheme.OurstudyshowsthatthebehavioroftheEH3generatingschemeiswellpredictedbythetheoreticalmodelinSection 3.2 BCH3vsEH3vsBCH5.EH3hasapproximatelythesameerror,or,inthecaseoflow-skewdistributions,asignicantlybettererrorthantheBCH5scheme.ThisjustiesourrecommendationtouseEH3insteadofBCH5.Forhigh-skewdata,BCH3hasapproximatelythesameerrorasBCH5andEH3,makingittheperfectsolutionforsketchingwhenthedataisskewed.WeperformedtheexperimentsusingthesetupinSection 3.1 .Wegivedetaileddescriptionsofthedatasetsandthecomparisonmethodologyforeachgroupofexperiments.ValidationoftheEH3Model:TovalidatetheaverageerrorformulainEquation 3{35 ,wegeneratedZipfdistributeddatawiththeZipfcoecientrangingfrom0to5overdomainsofvarioussizesinordertoestimatetheself-joinsize.ThepredictionisperformedusingAGMSsketcheswithonlyonemedian,i.e.,onlyaveragingisusedtodecreasetheerroroftheestimate.InFigure 3-1 wedepictthecomparisonbetweentheaverageerroroftheEH3schemeandthetheoreticalpredictiongivenbyEquation 3{35 foradomainwiththesize16;384andarelationcontaining100;000tuples.Noticethat,whenthevalueoftheZipfcoecientislargerthan1,thepredictionisaccurate.WhentheZipfcoecientisbetween0and1,theerrorofEH3ismuchsmalleritiszeroforauniformdistribution.ThisisexplainedinCorollary 3 ,whichstatesthatthevarianceofEH3iszerowhenthedistributionofthedataisuniformandthesizeofthedomainisapowerof4.Forcompleteness,wedepictthecomparisonbetweenthetheoreticalmodelandtheexperimentalresultsforBCH5schemeinFigure 3-2 50

PAGE 51

BCH3vsEH3vsBCH5:WeperformedthesameexperimentsasintheprevioussectionforBCH3,EH3,andBCH5,exceptthatthenumberofmedianswasxedto10.TheresultsoftheexperimentsaredepictedinFigure 3-3 full-sizepictureandFigure 3-4 detailedpicturefocusingonEH3andBCH5.NoticethattheerrorsofBCH3,EH3,andBCH5arevirtuallythesameforZipfcoecientsgreaterthan1.WhiletheEH3errorissignicantlysmallerforZipfcoecientslowerthan1,theBCH3errorcangetextremelyhighvalues100%relativeerror.TheseresultsconrmourtheoreticalcharacterizationsbothforBCH3andEH3schemes.WhencomparedtotheresultsinFigure 3-1 ,theerrorsaresmallerbyafactorof3.Thisisduetothefactthat10medianswereusedinsteadofonly1andthemedianshavealmostthesameeectinreducingtheerrorastheaverages{thesameobservationwasmadein[ 18 ].Giventhetheoreticalandtheexperimentalresultsinthissection,andthefactthatEH3canbeimplementedmoreecientlythanBCH5,werecommendtheexclusiveuseoftheEH3randomvariablesforsizeofjoinestimationsusingAGMSsketches.3.3ConclusionsInthischapterweconductedbothatheoreticalaswellasanempiricalstudyofthevariousschemesusedforthegenerationoftherandomvariablesthatappearinAGMSsketchesestimations.WeprovidetheoreticalandempiricalevidencethatEH3canreplacethe4-wiseindependentschemesfortheestimationofthesizeofjoinusingAGMSsketches.ThemainrecommendationofthischapteristousetheEH3randomvariablesforAGMSsketchesestimationsofthesizeofjoinsincetheycanbegeneratedmoreecientlyandusesmallerseedsthananyofthe4-wiseindependentschemes.Atthesametime,theerroroftheestimateisasgoodasor,inthecasewhenthedistributionhaslowskew,betterthantheerrorprovidedbya4-wiseindependentscheme. 51

PAGE 52

Table3-1.OrthogonalarrayOA;4;2;3. 00000011010101101001101011001111 Table3-2.Truthtableforthefunctionx1x2. x1x2 x1x2 00 001 110 111 0 Table3-3.Generationtimeandseedsize. Scheme Timens Seedsize BCH3 10:8 n+1EH3 7:3 n+1Polynomials2 31:4 2nTabulation2 5:1 2bn b BCH5 12:7 2n+1Polynomials4 106:4 4nRM7 3;301 1+n+nn)]TJ/F20 7.9701 Tf 6.5865 0 Td[(1 2Tabulation4 10:3 2b)]TJ/F22 7.9701 Tf 6.675 -4.9766 Td[(n blog23 52

PAGE 53

Figure3-1.EH3error. Figure3-2.BCH5error. 53

PAGE 54

Figure3-3.Schemecomparisonfull. Figure3-4.Schemecomparisondetail. 54

PAGE 55

CHAPTER4SKETCHINGSAMPLEDDATASTREAMSSamplingisusedasauniversalmethodtoreducetherunningtimeofcomputations{thecomputationisperformedonamuchsmallersampleandthentheresultisonlyadjusted.Sketchesareapopularapproximationmethodfordatastreamsandtheyprovedtobeusefulforestimatingfrequencymomentsandaggregatesoverjoins.Apossibilitytofurtherimprovethetimeperformanceofsketchesistocomputethesketchoverasampleofthestreamratherthantheentiredatastream.Inthischapterweanalyzethebehaviorofthesketchestimatorwhencomputedoverasampleofthestream,nottheentiredatastream,forthesizeofjoinandtheself-joinsizeproblems.Ouranalysisisdevelopedforagenericsamplingprocess.Weinstantiatetheresultsoftheanalysisfortwoparticulartypesofsampling|Bernoullisamplingwhichisusedforloadsheddingandsamplingwithreplacementwhichisusedtogeneratei.i.d.samplesfromadistribution|andcomparetheseparticularresultswiththeresultsofthebasicsketchestimator.Ourexperimentalresultsshowthattheaccuracyofthesketchcomputedoverasmallsampleofthedataisingeneralclosetotheaccuracyofthesketchestimatorcomputedovertheentiredataevenwhenthesamplesizeisonly1%ofthedatasetsize.Thisisequivalenttoaspeed-upfactorof100whenupdatingthesketch.Datastreamingreceivedalotofattentionfromtheresearchcommunityinthelastdecade.Therequirementtoprocessfastdatastreamsmotivatestheneedforapproximationmethodsthatmakeuseofbothsmallspaceandsmalltime.AGMSsketches[ 3 4 ]andtheirimprovedvariantF-AGMSsketches[ 14 52 ]provedtobeaviablesolutionforestimatingaggregatesoverjoins.Themainstrengthsofthesketchingtechniquesarethesimpleandfastupdateprocedure,thesmallmemoryrequirement,andprovableerrorguarantees.Whenthedatastreamsthatneedtobeprocessedareextremelyfast,forexampleinthecaseofnetworkingdataorlargedatasetsstreamedovertheInternet,itisdesirabletofurtherreducetheupdatetimeofsketchesinorderto 55

PAGE 56

achievetherequiredprocessingrates.Samplingisauniversalmethodfordatareductionand,inprinciple,itcanbeusedtoreducetheamountofdatathatneedstobesketched.Ifsamplesaresketchedinsteadoftheoriginaldata,animmediateupdatetimereductionresults.Thisissimilartotheexistingloadsheddingtechniquesemployedindatastreamprocessingengines[ 56 ].Themainconcernwhensamplesratherthantheoriginaldataaresketchedishowtoextendtheerrorguaranteessketchesprovidetothisnewsituation.Theformulasresultingfromsuchananalysiscouldbeusedtodeterminehowaggressivetheloadsheddingcanbewithoutasignicantlossintheaccuracyofthesketchoversamplesestimator.Inotherwords,howmuchcantheupdatetimebeimprovedwithoutasignicantaccuracydegradation.Aseeminglyunrelated,but,asshowninthechapter,technicallyrelated,problemisanalyzingstreamsofsamplesfromunknowndistributions.Samplesfromunknowndistributions|thesocalledi.i.d.samples|aretheinputtomostoftheonlinedata-miningalgorithms[ 21 ].Inthiscasethesamplesarenotusedasadatareductiontechnique,butrathertheyaretheonlyinformationavailableabouttheunknowndistribution.Afundamentalprobleminthiscontextishowtocharacterizetheunknowndistributionusingonlythesamples.Thisisoneofthefundamentalproblemsinstatistics[ 54 ].Ifthesamplesarestreamed,asisthecaseinonlinedata-mining,theaimistocharacterizetheunknowndistributionbyusingsmallspaceonly,thusmakingsketchesanaturalcandidateforcomputationsthatinvolveaggregates.Itisasimplemattertousesketchesinordertoestimateaggregatesofthesamples.Ifpredictionsabouttheunknowndistributionneedtobemade,theproblemissignicantlymoredicult.Interestingly,thisproblemismathematicallysimilartotheloadsheddingprobleminwhichsamplingisusedtoreducetheupdatetimeofsketching.Bothproblemsrequiretheanalysisofsketchingsamples:Bernoullisamplinginthecaseofloadsheddingandsamplingwithreplacementinthecaseofcharacterizingunknowndistributionsfromsamples. 56

PAGE 57

Inthischapterweanalyzethesketchoversamplesestimatorforagenericsamplingprocess.ThenweinstantiatetheresultsforBernoullisamplingandsamplingwithreplacement.Ourtechnicalcontributionsare: Weanalyzethesamplingmethodsinthefrequencydomain.Thisisabuildingblocknecessarytocombinetheanalysisofsketchesandsampling. Weprovideagenericanalysisofthesketchoversamplesestimator.Theanalysisconsistsinexpressingthersttwofrequencymomentsoftheestimatorintermsofthemomentsofthesamplingfrequencyrandomvariables. WeinstantiatethegenericanalysistoderiveformulasforsketchingBernoullisamples.Thisimmediatelyindicateshowrandomloadsheddingforsketchingdatastreamsbehaves. Weinstantiatethegenericanalysistoderiveformulasforsketchingsampleswithreplacementfromalargepopulation.Theanalysisreadilygeneralizestosketchingi.i.d.samplesfromanunknowndistribution.Theabilitytosketchi.i.d.dataisimportantifsketchesaretobeusedfordata-miningapplications. Wepresentempiricalevidencethattheanalysisisnecessarysincetheerrorofthesketchoversamplesestimatorisnotsimplythesumoftheerrorsofthetwoindividualestimators.Theinteraction,whichispredictedbytheanalysis,playsamajorrole.Theexperimentsalsopointoutthatinthemajorityofthecasesa1%sampleresultsinminimalerrordegradation{thesketchingofstreamscanthusbespeed-upbyafactorof100.Thereexistsalargebodyofworkonapproximatequeryprocessingmethods.Theideaofcombiningtwoestimatorstocapitalizeonthestrengthsofbothisnotnew.F-AGMSsketches[ 14 ]areessentiallyacombinationofrandomhistogramsandAGMSsketches.[ 30 ]presentsamethodtobuildincrementalhistogramsfromsamples.Tothebestofourknowledge,sketchingandsamplinghavenotbeencombinedinaprincipledfashionbefore.Themaindicultyincharacterizingsketchesoversamplesisthefactthatthesamplinganalysis[ 35 39 ]isperformedinthetupledomainwhilethesketchanalysis[ 3 ]isperformedinthefrequencydomain.Thisistherstobstacleweovercomeinthischapter.Theworkonsketchingprobabilisticdatastreams[ 13 38 ]issomehowsimilartoourwork.Theimportantdierenceisthefactthatsamplingispartoftheestimateinourworkwhileitrepresentsonlyawaytointerprettheprobabilisticdatain 57

PAGE 58

therelatedwork.Theresultsin[ 13 ]donotcharacterizethesketchoversampleestimatorbutapproximatetheprobabilisticaggregatesusingsketches.Theonlyoverlapintermsofanalysisseemstobethecomputationoftheexpectedvalueofsketchoversamplesforthesecondfrequencymomentcomputationin[ 38 ].Intherestofthechapter,Section 4.1 givesanoverviewonsamplingwhileSection 4.2 introducessketches.TheformalanalysisofthecombinedsketchoversamplesestimatorisdetailedinSection 4.3 .TheempiricalevaluationofthecombinedestimatorispresentedinSection 4.4 .4.1SamplingSamplingasanapproximationtechniqueconsistsinobtainingsamplesF0andG0fromrelationsFandG,respectively,computingthesizeofjoinaggregateoverthesamples,andapplyingacorrectiontoensurethatthesamplingestimatorisunbiased.Thismethodisgenericandappliestoalltypesofsampling.Inordertosimplifythetheoreticaldevelopments,wekeepthetreatmentofsamplingasgenericaspossible.InSection 2.1 weexpressedthesizeofjoinaggregateasafunctionoffiandgi,thefrequenciesofvalueiofthejoinattributeinrelationsFandG,respectively.Ifwedenef0iandg0itobethefrequenciesofiinF0andG0,respectively,thesizeofjoinofthesamplerelationsis:jF01AG0j=Xi2If0ig0i{1f0iandg0iarerandomvariablesthatdependonthetypeofsamplingandtheparametersofthesamplingprocess.Interestingly,alargepartofthecharacterizationofsamplingcanbecarriedoutwithoutspecifyingthetypeofsampling.ThisisalsotrueforsketchesoversamplesinSection 4.3 .4.1.1GenericSamplingIngeneral,jF01AG0jisnotanunbiasedestimatorforthesizeofjoinjF1AGj.Fortunately,inthemajorityofthecasesaconstantcorrectionthatscalesforthedierenceinsizebetweenthesamplesandtheoriginalrelationscanbemadetoobtainanunbiased 58

PAGE 59

estimator.IfwedenetheestimatorforthesizeofjoinasX=CPi2If0ig0i,whereCisthescalingfactor,wecandeterminethevalueofCsuchthatXisunbiased.InordertoderiveerrorboundsfortheestimatorX,themomentsE[X]andVar[X]havetobecomputed.ItturnsoutthatexpressionsforE[X]andVar[X]canbewrittenforgenericsampling,asweshowbelow.Therearetwodistinctcasesthatneedseparatetreatment.TherstcaseiswhenrelationsFandGaredierentandthesamplesareobtainedindependentlyfromthetworelations.ThesecondcaseiswhenFandGareidentical,thusonlyonesampleisavailable.Thissituationarisesinthecaseofself-joinsize.SizeofJoin:WhenF0andG0areobtainedindependently,therandomvariablesf0iandg0iareindependent.Weimmediatelyhave:E[X]=CXi2IE[f0i]E[g0i]Var[X]=C224Xi2IXj2IEf0if0jEg0ig0j)]TJ/F26 11.9552 Tf 11.9552 20.4435 Td[(Xi2IE[f0i]E[g0i]!235{2Self-JoinSize:WhenFandGareidenticalandonlythesampleF0isavailable,therandomvariablesf0iandg0iarealsoidentical.Then,weobtain:E[X]=CXi2IEf02iVar[X]=C224Xi2IXj2IEf02if02j)]TJ/F26 11.9552 Tf 11.9552 20.4435 Td[(Xi2IEf02i!235{3AspecictypeofsamplingwoulddeterminethevalueoftheexpectationsE[f0i],E[g0i],Ef0if0j,Eg0ig0j,E[f02i]andEf02if02j.DerivingnalformulasforE[X]andVar[X]anddeterminingtheconstantCbecomesjustamatterofplugginginthesequantitiesforaspecictypeofsampling.4.1.2BernoulliSamplingConsiderthatthesamplingprocessisBernoullisampling.EachtupleinFandGisselectedindependentlyinthesampleF0andG0withprobabilityporq,0p1, 59

PAGE 60

0q1,respectively.Then,f0iandg0iareindependentbinomialrandomvariables,f0i=Binomialfi;pandg0i=Binomialgi;q,respectively,withexpectedvalues:E[f0i]=pfiE[g0i]=qgi{4SizeofJoin:TheexpectationofthesizeofjoinestimatorE[X]isthen:E[X]=pqXi2Ifigi{5whichisdierentfromthesizeofjoinjF1AGj.Thus,inorderforXtobeanunbiasedestimator,ithastobescaledwiththeproductoftheinversesofthetwoprobabilitiespandq,respectively,i.e.,C=1 pq.ForderivingthevarianceVar[X]=E[X2])]TJ/F21 11.9552 Tf 12.6003 0 Td[(E2[X],expectationsoftheformEf0if0jhavetobecomputed.f0iandf0jareindependentrandomvariableswheneveri6=jbecausethedecisiontoincludeatupleinthesampleisindependentforeachtuple,thusEf0if0j=E[f0i]Ef0j.Wheni=j,E[f02i]isthesecondfrequencymomentofabinomialrandomvariable.ThevarianceVar[X]isthen:Var[X]=1 p2q2"Xi2IEf02iEg02i)]TJ/F26 11.9552 Tf 11.9552 11.3575 Td[(Xi2IE2[f0i]E2[g0i]#{6wherePi2IPj2IEf0if0jEg0ig0j=Pi2IE[f02i]E[g02i]+)]TJ 5.4795 -0.7173 Td[(Pi2IE[f0i]E[g0i]2)]TJ/F26 11.9552 Tf -410.9095 -14.9413 Td[(Pi2IE2[f0i]E2[g0i]isusedforsimplifyingtheformula.Thefrequencymomentsofstandardrandomvariables,asthebinomialis,canbefoundinanystatisticalresource,forexample[ 54 ].Afterwepluginthesecondfrequencymomentofthebinomialrandomvariables,i.e.,E[f02i]=pfi)]TJ/F21 11.9552 Tf 11.9552 0 Td[(p+pfiandE[g02i]=qgi)]TJ/F21 11.9552 Tf 11.9552 0 Td[(q+qgi,inEquation 4{6 ,weobtainthenalformulaforthevariance:Var[X]=1)]TJ/F21 11.9552 Tf 11.9551 0 Td[(p pXi2Ifig2i+1)]TJ/F21 11.9552 Tf 11.9551 0 Td[(q qXi2If2igi+)]TJ/F21 11.9552 Tf 11.9551 0 Td[(p)]TJ/F21 11.9552 Tf 11.9552 0 Td[(q pqXi2Ifigi{7 60

PAGE 61

Self-JoinSize:Theexpectationoftheself-joinsizeestimatorisgivenby:E[X]=p2Xi2If2i+p)]TJ/F21 11.9552 Tf 11.9552 0 Td[(pXi2Ifi{8Then,theself-joinsizeunbiasedestimatorXhastobedenedasX=1 p2Pi2If02i)]TJ/F20 7.9701 Tf -428.388 -18.652 Td[(1)]TJ/F22 7.9701 Tf 6.5865 0 Td[(p pPi2Ifi,withthescalingconstantC=1 p2.Theeectoftheextratermisminimal,theonlymodicationbeingthatabiascorrectionisneededwhentheestimateisactuallycomputed.ThevarianceisnotaectedsinceVar[aX+b]=a2Var[X],fora,bconstants.TheexpectationsEf02if02jfollowthesamerulesasEf0if0jinthesizeofjoinvariance.Theyareindependentfori6=j,whilefori=jtheygeneratethefourthfrequencymoment.Aftersimplications,weobtainthefollowingformulaforthevariance:Var[X]=1 p4"Xi2IEf04i)]TJ/F26 11.9552 Tf 11.9552 11.3575 Td[(Xi2IE2f02i#{94.1.3SamplingwithReplacementAsampleofxedsizecanbegeneratedbyrepeatedlychoosingarandomtuplefromthebaserelationforthespeciednumberoftimes.Ifthesametuplecanappearinthesamplemultipletimes,theprocessissamplingwithreplacement.Inthiscasetherandomvariablescorrespondingtothefrequenciesinthesample,f0iandg0i,respectively,arethecomponentsofamultinomialrandomvariablewithparametersthesizeofthesampleandtheprobabilityfi jFjandgi jGj,respectively,wherejFjandjGjarethesizeofFandG.Sinceeachcomponentofamultinomialrandomvariableisabinomialrandomvariable,theexpectationsinEquation 4{4 stillholdbutwithdierentprobabilities:E[f0i]=jF0j jFjfiE[g0i]=jG0j jGjgi{10SizeofJoin:ThesizeofjoinestimatorXisdenedasfollows:X=jFj jF0jjGj jG0jXi2If0ig0i{11 61

PAGE 62

wherejFj jF0jandjGj jG0jarescalingfactorsthatmaketheestimatorunbiased,i.e.,C=jFj jF0jjGj jG0j.TheexpectationsoftheformEf0if0jneededforthederivationofthevariancequantifytheinteractionbetweenthebinomialcomponentsofthemultinomialrandomvariablecharacterizingthesamplingfrequencies.Theycanbederivedfromthemomentgeneratingfunctioncorrespondingtoamultinomialdistribution.WithEf0if0j=jF0jjF0j)]TJ/F20 7.9701 Tf 8.9388 0 Td[(1 jFj2fifjandE[f02i]=jF0jjF0j)]TJ/F20 7.9701 Tf 8.9388 0 Td[(1 jFj2f2i+jF0j jFjfi,andEg0ig0j=jG0jjG0j)]TJ/F20 7.9701 Tf 8.9389 0 Td[(1 jGj2gigjandE[g02i]=jG0jjG0j)]TJ/F20 7.9701 Tf 8.9389 0 Td[(1 jGj2g2i+jG0j jGjgi,respectively,thevarianceis:Var[X]=jFj jF0jjGj jG0jXi2Ifigi+jFjjG0j)]TJ/F15 11.9552 Tf 17.9327 0 Td[(1 jG0jXi2Ifig2i+jF0j)]TJ/F15 11.9552 Tf 17.9327 0 Td[(1 jF0jjGjXi2If2igi+jF0j)]TJ/F15 11.9552 Tf 17.9327 0 Td[(1jG0j)]TJ/F15 11.9552 Tf 17.9327 0 Td[(1)-222(jF0jjG0j jF0jjG0jXi2Ifigi!2{12Themorecomplicatedformulaofthevarianceinthecaseofsamplingwithreplacementisduetotheconstraintimposedbythexedsizeofthesample,thusthemorecomplicatedinteractionbetweentherandomvariablescorrespondingtothefrequenciesf0iandg0i.Self-JoinSize:FollowingasimilartreatmentasforBernoullisampling,theself-joinsizeunbiasedestimatorXisdenedasfollows:X=jFj2 jF0jjF0j)]TJ/F15 11.9552 Tf 17.9327 0 Td[(1Xi2If02i)]TJ 22.4144 8.0878 Td[(jFj2 jF0j)]TJ/F15 11.9552 Tf 17.9327 0 Td[(1{13withthescalingconstantC=jFj2 jF0jjF0j)]TJ/F20 7.9701 Tf 8.9388 0 Td[(1.Noticethattheestimatorisdenedforsamplesizeslargerthan1,butinordertocomputethevariancethesizeofthesamplehastobelargerthan3.Thebiascorrectionrequiresknowledgeonlyaboutthesizeofthebaserelationandthesizeofthesample.ThevariancecanbecomputedinasimilarwayasforBernoullisamplingsincethecomponentsofamultinomialrandomvariablearebinomialrandomvariables,thustheyhavethesamefrequencymoments.ExpectationsoftheformEf02if02jappearinginthevariancecanbederivedfromthemomentgeneratingfunction 62

PAGE 63

ofthemultinomialrandomvariable,i.e.:Ef02if02j=jF0jjF0j)]TJ/F15 11.9552 Tf 17.9327 0 Td[(1 jFj2fifj1+jF0j)]TJ/F15 11.9552 Tf 17.9327 0 Td[(2 jFjfi+fj+jF0j)]TJ/F15 11.9552 Tf 17.9327 0 Td[(2jF0j)]TJ/F15 11.9552 Tf 17.9327 0 Td[(3 jFj2fifj{144.2SketchesWhilesamplingtechniquesselectarandomsubsetoftuplesfromtheinputrelation,sketchingtechniquessummarizeallthetuplesasasmallnumberofrandomvariables.Thisisaccomplishedbyprojectingthedomainoftheinputrelationonasignicantlysmallerdomainusingrandomfunctions.Multiplesketchingtechniquesareproposedintheliteratureforestimatingthesizeofjoinandthesecondfrequencymomentsee[ 52 ]fordetails.Althoughusingdierentrandomfunctions,i.e.,f+1;)]TJ/F15 11.9552 Tf 9.2985 0 Td[(1gorhashing,theexistingsketchingtechniqueshavesimilaranalyticalproperties,i.e.,thesketchestimatorshavethesamevariance.ForthisreasonwefocusonthebasicAGMSsketches[ 3 4 ]throughoutthechapter.ThebasicAGMSsketchofrelationFconsistsofasinglerandomvariableSthatsummarizesallthetuplestfromF.Sisdenedas:S=Xt2Ft:A=Xi2Ifii{15whereisafamilyoff+1;)]TJ/F15 11.9552 Tf 9.2985 0 Td[(1grandomvariablesthatare4{wiseindependent.Essentially,arandomvalueofeither+1or)]TJ/F15 11.9552 Tf 9.2985 0 Td[(1isassociatedtoeachpointinthedomainofattributeA.Then,thecorrespondingrandomvalueisaddedtothesketchSforeachtupletintherelation.WecandeneasketchTforrelationGinasimilarwayandusingthesamefamily.SizeofJoin:Thesketch-basedestimatorXdenedas:X=ST=Xi2IfiiXj2Igjj{16 63

PAGE 64

isanunbiasedestimatorforthesizeofjoinjF1AGj.Thevarianceofthesketchestimatorisgivenby:Var[X]=Xi2If2iXj2Ig2j+Xi2Ifigi!2)]TJ/F15 11.9552 Tf 11.9552 0 Td[(2Xi2If2ig2i{17Self-JoinSize:Theunbiasedestimatorfortheself-joinsizeisdenedas:X=S2=Xi2IXj2Ififjij{18Thevarianceofthesketchestimatorisgivenby:Var[X]=224Xi2If2i!2)]TJ/F26 11.9552 Tf 11.9552 11.3575 Td[(Xi2If4i35{19Acommontechniquetoreducethevarianceofanestimatoristogeneratemultipleindependentinstancesofthebasicestimatorandthentobuildamorecomplexestimatorastheaverageofthebasicestimators.Whiletheexpectedvalueofthecomplexestimatorisequalwiththeexpectationofonebasicestimator,thevarianceisreducedbyafactorofnsinceVar1 nPnk=1Xk=1 n2Pnk=1Var[Xk]=Var[Xk] n,wherenisthenumberofbasicestimatorsbeingaveraged.Thistechniquecanbeappliedtoreducethevarianceofthesketchestimatorifdierentfamiliesareusedforthebasicestimatorssee[ 3 4 ]fordetails.4.3SketchesoverSamplesGiventheabilityofsamplingtomakepredictionsaboutanentiredatasetfromarandomlyselectedsubsetandthatsketchesrequirethesummarizingoftheentiredatasetinordertodetermineanyofitsproperties,aninterestingquestionthatimmediatelyarisesishowtocombinethesetworandomizedtechniques.Althoughtheintuitiveanswertothisquestionseemstobesimple{thesketchiscomputedoverasampleofthedatainsteadoftheentiredataset{asweshowinSection 4.4 ,thebehaviorofthecombinedestimatorisnotthesimplecompositionoftheindividualbehaviorsoftheingredients.A 64

PAGE 65

carefulanalyticalcharacterizationoftheestimatorneedstobecarriedout.Furthermore,thesamplingprocesscanbeeitherexplicitandexecutedasanindividualstepbeforesketchingisdoneorimplicit,situationinwhichtheinputdatasetisassumedtobeasamplefromalargepopulation.Intherstcase,asignicantspeed-upinupdatingthesketchstructurecanbeobtainedsinceonlyarandomsubsetofthedataisactuallysketched.Thisprocessisessentiallyaloadsheddingtechniqueforsketchingextremelyfastdatastreamsthatcannotbeotherwisesketched.ItcanbeimplementedasanexplicitBernoullisamplingthatrandomlyltersthetuplesthatupdatethesketchstructure.Inthesecondcase,thedataisassumedtobeasamplefromalargepopulationandthegoalistodeterminepropertiesofthepopulationbasedonthesample.Thesampleitselfisassumedtobelargeenoughsoitcannotbestoredexplicitly,thussketchingisrequired.Ifthepopulationisinnite,theentireprocesscanbeseenassketchingi.i.d.samplesfromanunknowndistribution.Forpracticalreasonsweviewthedataasasamplewithreplacementfromanitepopulationandwecarryouttheanalysisinthiscontext.Theanalysisstraightforwardlyextendstoi.i.d.samplesifallestimatorsarenormalizedbythesizeofthepopulationandthelimit,whenthepopulationsizegoestoinnity,istaken.Insuchacircumstance,thefrequenciesintheoriginalunknownpopulationbecomedensitiesoftheunknownpopulation,buteverythingelseremainsthesame.Inthissection,weprovideagenericframeworkforsketchingsampleddatastreamsinordertoestimatethesizeofjoinandtheself-joinsize.Thenwecomputethersttwofrequencymomentsofthecombinedestimatorfortwoparticulartypesofsampling{Bernoullisamplingandsamplingwithreplacement.Thisprovidessucientinformationtoallowthederivationofcondencebounds.4.3.1GenericSamplingConsiderF0tobeagenericsampleobtainedfromrelationF.SketchingthesampleF0issimilartosketchingtheentirerelationFandconsistsinsummarizingthesampled 65

PAGE 66

tuplest0asfollows:S=Xt02F0t0:A=Xi2If0ii{20whereisafamilyoff+1;)]TJ/F15 11.9552 Tf 9.2985 0 Td[(1grandomvariablesthatare4{wiseindependent.AsampleG0fromrelationGcanbesketchedinasimilarwayusingthesamefamily:T=Xt02G0t0:A=Xi2Ig0ii{21SizeofJoin:WedenetheestimatorXforthesizeofjoinjF1AGjbasedonthesketchescomputedoverthesamplesasfollows:X=CST=CXi2If0iiXj2Ig0jj{22NoticethattheestimatorissimilartothesketchestimatorcomputedovertheentiredatasetinEquation 4{16 multipliedwithaconstantscalingfactorthatcompensatesforthedierenceinsize.TheconstantCisthepartthatcomesfromsampling.Self-JoinSize:Theself-joinsizeorsecondfrequencymomentofarelationistheparticularcaseofsizeofjoinbetweentwoinstancesofthesamerelation.Onewayofanalyzingthesketchesoversamplesestimatorfortheself-joinsizeproblemistobuildtwoindependentsamplesandtwoindependentsketchesfromthesamebaserelationandthentoapplytheresultscorrespondingtosizeofjoin.Althoughsoundfromananalyticalpointofview,thissolutionisinecientinpractice.Inthefollowingweconsiderapracticalsolutionthatrequirestheconstructionofonlyonesampleandonesketchfromthebaserelation.Anewestimatorfortheself-joinsizehastobedenedinstead,buttheanalysisiscloselyrelatedtotheanalysisofthesizeofjoinestimator.WithSdenedinEquation 4{20 ,wedenetheself-joinsizeestimatorXasfollows:X=S2=CXi2If0ii!2=CXi2If0iiXj2If0jj{23 66

PAGE 67

whereCisthesamescalingfactorcompensatingforthedierenceinsize.Noticethatthedierencebetweenthesizeofjoinestimatorandtheself-joinsizeestimatorisonlyatthesamplinglevelsincethesamefamilyofrandomvariablesisusedforsketchinginbothcases.Forthisreasonwecarryouttheanalysisforthetwoestimatorsinparallelandmakethedistinctiononlywhennecessary.InordertoderivecondenceboundsfortheestimatorX,thersttwomoments,expectedvalueandvariance,havetobecomputed.Intuitively,thescalingfactorCshouldcompensateforthedierenceinsizeandmaketheestimatorunbiased.Sincethetwoprocesses,samplingandsketching,areindependentandsequential,theinteractionbetweenthemisminimalandthesumofthetwovariancesshouldbeagoodestimatorforthevarianceofthecombinedestimator.Inthefollowing,wederivetheexactformulasfortheexpectationandthevarianceinthegenericcase.Theindependenceofthefamiliesofrandomvariablescorrespondingtosamplingandsketching,f0i;g0iand,respectively,playsanimportantroleinsimplifyingthecomputation.Thisindependenceisduetotheindependenceofthetworandomprocesses.ThenweconsiderBernoullisamplingandsamplingwithreplacementandshowthattheintuitioniscorrectinthecaseofexpectation.Forthevariance,theintuitionprovestobewrongsincetheinteractionbetweensketchingandsamplingismorecomplexanditcanbecharacterizedonlythroughadetailedanalysis.TheempiricalresultsinSection 4.4 conrmourndings.TheexpectationE[X]canbederivedasfollows:E[X]=CE"Xi2IXj2If0iig0jj#=CXi2IXj2IEf0ig0jE[ij]=CXi2IE[f0ig0i]{24sinceE[ij]=E[i]E[j]=0wheneveri6=jduetothe4{wiseindependenceofthefamily,andE[2i]=1. 67

PAGE 68

SizeofJoin:Forsizeofjoin,onemoresimplicationstepispossibleduetotheindependenceofthesamplingprocessescorrespondingtothetworelations:E[X]=CXi2IE[f0i]E[g0i]{25Self-JoinSize:Forself-joinsize,thesamplesareidentical,thus:E[X]=CXi2IEf02i{26Inessence,theexpectationE[X]iscompletelydeterminedbythepropertiesofthesamplingprocess,i.e.,thefrequencymomentsoftherandomvariablescorrespondingtothesamplingfrequencies.Theinteractionbetweensketchingandsamplingisminimal.InordertocomputethevarianceVar[X],theexpectationE[X2]hastobeevaluatedrstsinceVar[X]=E[X2])]TJ/F21 11.9552 Tf 12.359 0 Td[(E2[X].AlthoughE[X2]contains8randomvariables,thecomputationisagainsignicantlysimpliedbytheindependenceofthesamplingandsketchingprocesses.E[X2]canbederivedasfollows:EX2=C2E"Xi2IXj2IXi02IXj02If0iig0jjf0i0i0g0j0j0#=C2Xi2IXj2IXi02IXj02IEf0ig0jf0i0g0j0E[iji0j0]{27SizeofJoin:Forsizeofjoin,moresimplicationsarepossibleduetotheindependenceofthesamplingprocessescorrespondingtothetworelationsandthe4{wiseindependenceofthefamilysee[ 51 ]fordetails.WithE[X2]givenby:EX2=C2"Xi2IXj2IEf02iEg02j+2Xi2IXj2IEf0if0jEg0ig0j)]TJ/F15 11.9552 Tf 11.9551 0 Td[(2Xi2IEf02iEg02i#{28 68

PAGE 69

weobtaintheformulaforthevariance:Var[X]=C2"Xi2IEf02iXj2IEg02j+2Xi2IXj2IEf0if0jEg0ig0j)]TJ/F15 11.9552 Tf 9.2985 0 Td[(2Xi2IEf02iEg02i)]TJ/F26 11.9552 Tf 11.9552 20.4435 Td[(Xi2IE[f0i]E[g0i]!235{29Noticethatthevarianceofsketchingovergenericsamplingisanexpressiondependingonlyonthepropertiesofthesamplingprocess.Moreprecisely,inordertoevaluatethevariance,onlyexpectationsoftheformE[f0i]andEf0if0jhavetobecomputed,wheref0iandf0jarerandomvariablescorrespondingtothefrequenciesinthesample.Self-JoinSize:Forself-joinsize,thesamplesareidentical,thusE[X2]isgivenby:EX2=C2"3Xi2IXj2IEf02if02j)]TJ/F15 11.9552 Tf 11.9552 0 Td[(2Xi2IEf04i#{30Thevarianceoftheself-joinsizeestimatoristhen:Var[X]=C2243Xi2IXj2IEf02if02j)]TJ/F15 11.9552 Tf 11.9552 0 Td[(2Xi2IEf04i)]TJ/F26 11.9552 Tf 11.9552 20.4435 Td[(Xi2IEf02i!235{31TheaveragingtechniqueappliedtoreducethevarianceofbasicsketchesinSection 4.2 cannotbeusedstraightforwardlyinthecaseofsketchescomputedoversamples.Thisisthecasesince,althoughthebasicsketchestimatorsarebuiltindependentlyusingdierentfamiliesofrandomvariables,theyarecomputedoverthesamesampleandthisintroducescorrelationsbetweenanytwoestimators.Thevarianceoftheaverageestimatorisinthiscase:Var"1 nnXk=1Xk#=1 n[Var[Xk]+n)]TJ/F15 11.9552 Tf 11.9552 0 Td[(1Covk6=l[Xk;Xl]]{32wherenisthenumberofbasicestimatorsbeingaveragedandCov[Xk;Xl]=E[XkXl])]TJ/F21 11.9552 Tf -448.5861 -23.9078 Td[(E[Xk]E[Xl]isthecovariancebetweenanytwobasicestimators.Thus,inordertoderivethevarianceoftheaverageestimator,theexpectationE[XkXl]hastobeevaluatedrst.ThisderivationissimilartothederivationofE[X2]withthedierencethatthe 69

PAGE 70

estimatorsXkandXlarebuiltoverthesamesampleusingdierentsketchfamiliesofrandomvariables:E[XkXl]=C2E"Xi2IXj2IXi02IXj02If0ikig0jkjf0i0li0g0j0lj0#=C2Xi2IXj2IXi02IXj02IEf0ig0jf0i0g0j0EhkikjiEhli0lj0i{33SizeofJoin:WithE[XkXl]=C2Pi2IPj2IEf0if0jEg0ig0j,wecanwriteaformulaforthevarianceoftheaverageestimatorasfollows:Var"1 nnXk=1Xk#=C224Xi2IXj2IEf0if0jEg0ig0j)]TJ/F26 11.9552 Tf 11.9552 20.4435 Td[(Xi2IE[f0i]E[g0i]!2+1 nXi2IEf02iXj2IEg02j+Xi2IXj2IEf0if0jEg0ig0j)]TJ/F15 11.9552 Tf 9.2985 0 Td[(2Xi2IEf02iEg02i!#{34whichisthesumofthevarianceofthegenericsamplingestimatorinEquation 4{2 andatermthatseemstobesimilartothevarianceofthesketchestimatorinEquation 4{17 .NoticethatthecomplexityoftheformulahasnotincreasedwhencomparedtothevarianceofthebasicestimatorEquation 4{29 ,onlyexpectationsoftheformE[f0i]andEf0if0jhavingtobecomputed.Atthesametime,theimprovementislesssignicantthanafactorofnobtainedinthecaseofindependentestimators.Self-JoinSize:Inthecaseoftheself-joinsizeestimator,E[XkXl]=C2Pi2IPj2IEf02if02j,andthen,thevarianceoftheaverageestimatorisgivenby:Var"1 nnXk=1Xk#=C224Xi2IXj2IEf02if02j)]TJ/F26 11.9552 Tf 11.9552 20.4435 Td[(Xi2IEf02i!2+2 nXi2IXj2IEf02if02j)]TJ/F26 11.9552 Tf 11.9552 11.3575 Td[(Xi2IEf04i!#{35Again,thevarianceofthesketchingoversamplesestimatoristhesumofthegenericsamplingestimatorinEquation 4{3 andatermthatisrelatedtothevarianceofthe 70

PAGE 71

sketchestimator.Noticethattheformulasfortheself-joinsizeestimatordependonlyonthesecondorthefourthfrequencymomentoftherandomvariablesassociatedwiththesamplingfrequencies.4.3.2BernoulliSamplingWeinstantiatetheformulasderivedforgenericsamplinginSection 4.3.1 withthemomentsofthebinomialrandomvariablescorrespondingtothesamplingfrequencies.Thefactthattwobinomialrandomvariablescorrespondingtodierentelementsofthedomainareindependentfurthersimplifytheformulas.SizeofJoin:FollowingtheanalysisforBernoullisamplinginSection 4.1.2 ,weidentifytheunbiasedestimatorforthesizeofjointobe:X=1 pqXi2If0iiXj2Ig0jj{36withtheconstantC=1 pq.ThegenericvarianceinEquation 4{29 takesthefollowingforminthecaseofBernoullisampling:Var[X]=1 p2q224Xi2IEf02iXj2IEg02j+Xi2IE[f0i]E[g0i]!2)]TJ/F15 11.9552 Tf 11.9552 0 Td[(2Xi2IE2[f0i]E2[g0i]35{37sincetherandomvariablesf0iandg0iareindependent,andmore,thepairsf0i,f0jareindependentfori6=j,thustheequalityPi2IPj2IEf0if0jEg0ig0j=Pi2IE[f02i]E[g02i]+)]TJ 5.4795 -0.7173 Td[(Pi2IE[f0i]E[g0i]2)]TJ/F26 11.9552 Tf 13.0041 8.9665 Td[(Pi2IE2[f0i]E2[g0i]holds.ThegenericvarianceoftheaverageestimatorinEquation 4{34 canbeinstantiatedinasimilarway:Var"1 nnXk=1Xk#=1 p2q2"Xi2IEf02iEg02i)]TJ/F26 11.9552 Tf 11.9552 11.3575 Td[(Xi2IE2[f0i]E2[g0i]+1 n0@Xi2IEf02iXj2IEg02j+Xi2IE[f0i]E[g0i]!2)]TJ/F26 11.9552 Tf 11.9552 11.3575 Td[(Xi2IEf02iEg02i)]TJ/F26 11.9552 Tf 11.9552 11.3575 Td[(Xi2IE2[f0i]E2[g0i]1A35{38 71

PAGE 72

Afterwepluginthesecondfrequencymomentofthebinomialrandomvariables,weobtain:Var"1 nnXk=1Xk#=1 n24Xi2If2iXj2Ig2j+Xi2Ifigi!2)]TJ/F15 11.9552 Tf 11.9552 0 Td[(2Xi2If2ig2i35+n)]TJ/F15 11.9552 Tf 11.9552 0 Td[(1 n"1)]TJ/F21 11.9552 Tf 11.9552 0 Td[(p pXi2Ifig2i+1)]TJ/F21 11.9552 Tf 11.9551 0 Td[(q qXi2If2igi+)]TJ/F21 11.9552 Tf 11.9551 0 Td[(p)]TJ/F21 11.9552 Tf 11.9551 0 Td[(q pqXi2Ifigi#+1 n"1)]TJ/F21 11.9552 Tf 11.9552 0 Td[(p pXi2IfiXj2Ig2j+1)]TJ/F21 11.9552 Tf 11.9551 0 Td[(q qXi2If2iXj2Igj+)]TJ/F21 11.9552 Tf 11.9552 0 Td[(p)]TJ/F21 11.9552 Tf 11.9552 0 Td[(q pqXi2IfiXj2Igj#{39whichisexactlythesumoftheaveragesketchestimatorindividualvarianceEquation 4{17 ,theBernoullisamplingestimatorindividualvarianceEquation 4{7 ,andaninteractionterm.Thedominanttermseemstobethevarianceofthesketchestimatorduetotheproductofthesumsofthesquaresofthefrequencies.Theinteractiontermalsocontainsproductsofsumswhichcanbecomequitelargeandcomparablewiththesketchvarianceinsomeparticularscenarios.IntheempiricalevaluationsectionSection 4.4 weprovideastudythatshowswhatistherelativesignicanceofeachofthethreetermsasafunctionofthedataparameters.Self-JoinSize:TheunbiasedestimatorforBernoullisamplingisdenedsimilarlytoEquation 4{8 :X=1 p2Xi2If0ii!2)]TJ/F15 11.9552 Tf 13.1507 8.0878 Td[(1)]TJ/F21 11.9552 Tf 11.9552 0 Td[(p pXi2Ifi{40withthescalingfactorC=1 p2andanextratermforbiascorrection.Theextratermdoesnotaectthesketchingoversamplesprocess,theonlymodicationbeingthatabiascorrectionisneededwhentheestimateisactuallycomputed.Precisely,thebiascorrectiononlyrequiresknowledgeaboutthesizeofthebaserelation.ThegenericvarianceinEquation 4{31 isnotsignicantlyaectedbythemodicationofthebasicestimatorsinceVar[aX+b]=a2Var[X]holdsforconstantsaandb.Thus,thevarianceVar[X]is:Var[X]=1 p424Xi2IEf04i+2Xi2IEf02i!2)]TJ/F15 11.9552 Tf 11.9552 0 Td[(3Xi2IE2f02i35{41 72

PAGE 73

WeusedtheequalityPi2IPj2IEf02if02j=Pi2IE[f04i]+)]TJ 5.4795 -0.7173 Td[(Pi2IE[f02i]2)]TJ/F26 11.9552 Tf 12.074 8.9665 Td[(Pi2IE2[f02i]forsimplications.Thevarianceoftheaverageestimatorcanbederivedinasimilarmanner:Var"1 nnXk=1Xk#=1 p424Xi2IEf04i)]TJ/F26 11.9552 Tf 11.9552 11.3574 Td[(Xi2IE2f02i+2 n0@Xi2IEf02i!2)]TJ/F26 11.9552 Tf 11.9552 11.3575 Td[(Xi2IE2f02i1A35{42Inordertoobtaintheexactformula,thesecondandthefourthfrequencymomentsofthebinomialrandomvariablef0ihavetobepluggedin.Aftersimplication,weobtainthenalformulafortheaverageestimatorvariance:Var"1 nnXk=1Xk#=2 n24Xi2If2i!2)]TJ/F26 11.9552 Tf 11.9552 11.3575 Td[(Xi2If4i35+41)]TJ/F21 11.9552 Tf 11.9551 0 Td[(p pXi2If3i+21)]TJ/F21 11.9552 Tf 11.9551 0 Td[(p)]TJ/F15 11.9552 Tf 11.9552 0 Td[(5p p2Xi2If2i+)]TJ/F21 11.9552 Tf 11.9552 0 Td[(pp2)]TJ/F15 11.9552 Tf 11.9551 0 Td[(6p+1 p3Xi2Ifi+2 n")]TJ/F21 11.9552 Tf 11.9552 0 Td[(p2 p2Xi2IXj2I;j6=ififj+21)]TJ/F21 11.9552 Tf 11.9552 0 Td[(p pXi2IXj2I;j6=if2ifj#{43whichis,asexpected,thesumofthesketchaverageestimatorvariance,thevarianceofthesamplingestimator,andaninteractionterm.4.3.3SamplingwithReplacementInasimilarwaytoBernoullisampling,weinstantiatetheformulasderivedforgenericsamplinginSection 4.3.1 withthemomentsofthemultinomialrandomvariablescorrespondingtothesamplingfrequencies.Althoughthecomponentsofthemultinomialrandomvariablearebinomialrandomvariables,theformulasforsamplingwithreplacementaremorecomplicatedduetotheinteractionbetweentwodierentbinomialrandomvariables. 73

PAGE 74

SizeofJoin:FollowingtheanalysisforsamplingwithreplacementinSection 4.1.3 ,weidentifytheunbiasedestimatorforthesizeofjointobe:X=jFj jF0jjGj jG0jXi2If0iiXj2Ig0jj{44withC=jFj jF0jjGj jG0j.ThegenericvarianceinEquation 4{29 takesthefollowingforminthecaseofsamplingwithreplacement:Var[X]=jFj2 jF0j2jGj2 jG0j224Xi2IEf02iXj2IEg02j)]TJ/F26 11.9552 Tf 11.9552 20.4435 Td[(Xi2IE[f0i]E[g0i]!2+2Xi2IXj2I;j6=iEf0if0jEg0ig0j35{45sincetherandomvariablesf0iandg0iareindependentandPi2IPj2IEf0if0jEg0ig0j=Pi2IE[f02i]E[g02i]+Pi2IPj2I;j6=iEf0if0jEg0ig0jholds.ThegenericvarianceoftheaverageestimatorinEquation 4{34 canbederivedinasimilarway.Aftersimplication,weobtain:Var"1 nnXk=1Xk#=jF0j)]TJ/F15 11.9552 Tf 17.9327 0 Td[(1jG0j)]TJ/F15 11.9552 Tf 17.9327 0 Td[(1)-222(jF0jjG0j jF0jjG0jXi2Ifigi!2+jFj jF0jjGj jG0jXi2Ifigi+jFjjG0j)]TJ/F15 11.9552 Tf 17.9327 0 Td[(1 jG0jXi2Ifig2i+jF0j)]TJ/F15 11.9552 Tf 17.9327 0 Td[(1 jF0jjGjXi2If2igi+1 njF0j)]TJ/F15 11.9552 Tf 17.9327 0 Td[(1jG0j)]TJ/F15 11.9552 Tf 17.9327 0 Td[(1 jF0jjG0j24Xi2If2iXj2Ig2j+Xi2Ifigi!2)]TJ/F15 11.9552 Tf 11.9552 0 Td[(2Xi2If2ig2i35+1 n"jFj jF0jjGj jG0jXi2IXj2I;j6=ifigj+jFjjG0j)]TJ/F15 11.9552 Tf 17.9327 0 Td[(1 jG0jXi2IXj2I;j6=ifig2j+jF0j)]TJ/F15 11.9552 Tf 17.9327 0 Td[(1 jF0jjGjXi2IXj2I;j6=if2igj#{46whichisexactlythesumofthesamplingwithreplacementestimatorindividualvarianceEquation 4{12 ,theaveragesketchestimatorindividualvarianceEquation 4{17 ,andaninteractionterm. 74

PAGE 75

Self-JoinSize:Forsamplingwithreplacement,theunbiasedestimatorfortheself-joinsizeisdenedasfollows:X=jFj2 jF0jjF0j)]TJ/F15 11.9552 Tf 17.9327 0 Td[(1Xi2If0ii!2)]TJ 22.4145 8.0878 Td[(jFj2 jF0j)]TJ/F15 11.9552 Tf 17.9327 0 Td[(1{47withC=jFj2 jF0jjF0j)]TJ/F20 7.9701 Tf 8.9388 0 Td[(1.ThevariancecanbecomputedinasimilarwayasforBernoullisamplingsincethecomponentsofamultinomialrandomvariablearebinomialrandomvariables,thustheyhavethesamefrequencymoments.ExpectationsoftheformEf02if02jhavethesameformasinEquation 4{14 .Wedonotprovidetheformulasforthevariance,butessentiallythevarianceoftheaverageestimatorcanstillbewrittenasthesumofthesketchaverageestimatorvariance,thevarianceofthesamplingestimator,andaninteractionterm.4.3.4DiscussionTheestimatorsforsketchesoversamplesaredenedalmostasthebasicsketchingestimators.Theonlydierenceisaconstantscalingfactorthatneedstobeaddedtocompensateforthedierenceinsize.Inthecaseofself-joinsize,anextraconstanttermisrequiredtocompensateforthebias.Thevarianceofthecombinedestimatorcanbewrittenasthesumofthesketchestimator,thesamplingestimator,andaninteractionterm.Thisimmediatelygivesanintuitiveinterpretationtotheformulaswederived.Althoughthesketchvarianceseemstobethedominantterm,theexactsignicanceofeachofthetermsisdependentontheactualdistributionofthedataseetheexperimentalresultsinSection 4.4 .Whenmultiplesketchestimatorsareaveragedinordertodecreasethevariance,thecovariancealsohastobeconsideredsincethesketchestimatorsarecomputedoverthesamesample,thustheyarenotcompletelyindependent.Ourresultsshowthatthevarianceofthecombinedestimatordoesnotdecreasebyafactorequaltothenumberofaveragesanymore,justthesketchtermdoes. 75

PAGE 76

4.4ExperimentalEvaluationWepursuethreemaingoalsintheexperimentalevaluationofthesketchingoversamplesestimators.First,wewanttodeterminewhatistherelativecontributionofeachofthethreetermsthatappearinthevarianceoftheaverageestimator.Moreprecisely,wewanttodetermineiftheinteractiontermrepresentsasignicantamountfromthevariance.Second,wewanttodeterminethebehavioroftheerrorofthesketchoversamplesestimatorwhencomparedwiththeerrorofthesketchestimator.Andthird,wewanttoidentifywhatisthebehavioroftheestimationerrorasafunctionofthesamplesize.Inordertoaccomplishthesegoals,wedesignedaseriesofexperimentsoversyntheticdatasets.Thisallowsabettercontroloftheimportantparametersthataecttheresults.Thedatasetsusedinourexperimentscontaineither10or100milliontuplesgeneratedfromaZipandistributionwiththecoecientrangingbetween0uniformand5skewed.Thedomainofthepossiblevaluesis1million.Inthecaseofsizeofjoin,thetuplesinthetworelationsaregeneratedcompletelyindependent.WeusedF-AGMSsketches[ 14 ]inalloftheexperimentsduetotheirsuperiorperformancebothinaccuracyandupdatetimesee[ 52 ]fordetailsonsketchingtechniques.Thenumberofbucketsiseither5;000or10;000.Thisisequivalenttoaveraging5;000or10;000basicestimators.Inordertobestatisticallysignicant,alltheresultspresentedinthissectionaretheaverageofatleast100independentexperiments.Figure 4-1 and 4-2 depicttherelativecontributionofeachofthethreetermsappearinginthevarianceoftheaverageestimatoroverBernoullisamplesEquation 4{43 and 4{39 .Therelativecontributionisrepresentedasafunctionofthedataskewfordierentsamplingprobabilities.Acommontrendbothforsizeofjoinandself-joinsizeisthattheinteractiontermishighlysignicantforlowskewdata.Thiscompletelyjustiestheanalysiswedevelopthroughoutthechaptersinceananalysisassumingthatthevarianceofthecomposedestimatoristhesumofthevarianceofthebasicestimators 76

PAGE 77

wouldbeincorrect.Asexpected,theimpactofthevarianceofthesamplingestimatorismoresignicantasthesizeofthesampleissmaller.Forself-joinsizeFigure 4-2 ,thevarianceisdominatedbythetermcorrespondingtothesamplingestimator,whileforsizeofjoinFigure 4-1 thevarianceofthesketchestimatorquantiesforalmosttheentirevarianceirrespectiveofthesamplingprobability.Thisisentirelysupportedbytheexistingtheoreticalresultswhichshowthatsketchesareoptimalforestimatingthesecondfrequencymomentwhilesamplingisoptimalfortheestimationofsizeofjoin[ 4 ].Theexperimentalrelativeerror,i.e.,jestimation)]TJ/F20 7.9701 Tf 6.5865 0 Td[(trueresultj trueresult,ofthesketchoverBernoullisamplesestimatorisdepictedinFigure 4-3 and 4-4 asafunctionofthedataskewfordierentsamplingprobabilities.Probabilityp=1:0correspondstosketchingtheentiredataset.Theseexperimentalresultsshowthat,withsomeexceptions,thesamplingratedoesnotsignicantlyaecttheaccuracyofthesketchestimator.ForZipfcoecientslargerthan1,inthecaseofself-joinsize,andsmallerthan3,inthecaseofsizeofjoin,theerrorofthesketchestimatorisalmostthesamebothwhentheentiredatasetissketchedorwhenonlyonetupleoutofathousandissketched.Theimpactofthesamplingrateissignicantonlyforlowskewdatainthecaseofself-joinsize.Thisistobeexpectedfromthetheoreticalanalysissincetheinteractiontermdominatesthevariance.Whatcannotbeexplainedfromthetheoreticalanalysisistheeectofthesamplingrateforskeweddatainthecaseofsizeofjoin.Asshownin[ 52 ],theexperimentalbehaviorofF-AGMSsketchesisinsomecasesordersofmagnitudebetterthanthetheoreticalpredictions,thusalthoughthetheoreticalvarianceisdominatedbythevarianceofthesketchestimator,theempiricalabsolutevalueissmallwhencomparedtothevarianceofthesamplingestimator.Inthelightof[ 52 ],theempiricalresultsforhighsamplingratesaremuchbetterthanthetheoreticalpredictions,increasingthesignicanceofthesamplingrateforskeweddata.InFigure 4-5 and 4-6 wedepicttheexperimentalrelativeerrorasafunctionofthesamplesizeforsamplingwithreplacement.SincetheactualsizeofthesampleisdierentfordierentZipfcoecients,werepresentonthexaxisthesizeofthesampleasafraction 77

PAGE 78

fromthepopulationsize,with1correspondingtoasamplewithreplacementofsizeequaltothepopulationsize.Asexpected,theerrorisdecreasingasthesamplesizebecomeslarger,butitstabilizesafteracertainsamplesizea0:1fractionofthepopulationsizefortheincludedgures.Thisisdueentirelytotheerrorofthesketchestimatorwhichexistsevenwhentheentirepopulationisavailable.Inconclusion,theexperimentalevaluationsectionsupportstheneedforthedetailedanalysisofthesketchoversamplesestimatorsinSection 4.3 .Ourexperimentsshowthatasignicantspeed-upafactorof10ingeneralandafactorofupto1000insomecasescanbeobtainedbysketchingonlyasmallsampleofthedatainsteadoftheentiredata.Atthesametime,whenthedataisasamplewithreplacementfromalargepopulation,propertiesoftheentirepopulationcanbeaccuratelyinferredbysketchingonlyasmallsampleafractionof0:1orlessfromthepopulationsize.4.5ConclusionsInthischapterweprovidethemomentanalysisofthesketchoversamplesestimatorsfortwotypesofsampling:Bernoulliandsamplingwithreplacement.SketchingBernoullisamplesisessentiallyaloadsheddingtechniqueforsketchingdatastreamswhichresults,asourtheoryandexperimentssuggest,insignicantupdatetimereduction{byasmuchasafactorof100{withminimalaccuracydegradation.Sketchingsampleswithreplacementfromanunknowndistributionallowsecientcharacterizationoftheunknowndistributionwhichhasmanyapplicationstoonlinedata-mining. 78

PAGE 79

Figure4-1.Sizeofjoinvariance. Figure4-2.Self-joinsizevariance. 79

PAGE 80

Figure4-3.Sizeofjoinerror. Figure4-4.Self-joinsizeerror. 80

PAGE 81

Figure4-5.Sizeofjoinsamplesize. Figure4-6.Self-joinsizesamplesize. 81

PAGE 82

CHAPTER5STATISTICALANALYSISOFSKETCHESSketchingtechniquesprovideapproximateanswerstoaggregatequeriesbothfordata-streaminganddistributedcomputation.Smallspacesummariesthathavelinearitypropertiesarerequiredforbothtypesofapplications.Theprevalentmethodforanalyzingsketchesusesmomentanalysisanddistributionindependentboundsbasedonmoments.Thismethodproducesclean,easytointerpret,theoreticalboundsthatareespeciallyusefulforderivingasymptoticresults.However,thetheoreticalboundsobscurenedetailsofthebehaviorofvarioussketchesandtheyaremostlynotindicativeofwhichtypeofsketchesshouldbeusedinpractice.Moreover,nosignicantempiricalcomparisonbetweenvarioussketchingtechniqueshasbeenpublished,whichmakesthechoiceevenharder.Inthischapter,wetakeacloselookatthesketchingtechniquesproposedintheliteraturefromastatisticalpointofviewwiththegoalofdeterminingpropertiesthatindicatetheactualbehaviorandproducingtightercondencebounds.Interestingly,thestatisticalanalysisrevealsthattwoofthetechniques,Fast-AGMSandCount-Min,provideresultsthatareinsomecasesordersofmagnitudebetterthanthecorrespondingtheoreticalpredictions.Weconductanextensiveempiricalstudythatcomparesthedierentsketchingtechniquesinordertocorroboratethestatisticalanalysiswiththeconclusionswedrawfromit.Thestudyindicatestheexpectedperformanceofvarioussketches,whichiscrucialifthetechniquesaretobeusedbypractitioners.TheoverallconclusionofthestudyisthatFast-AGMSsketchesare,forthefullspectrumofproblems,eitherthebest,orclosetothebest,sketchingtechnique.Throughresearchinthelastdecade,sketchingtechniquesevolvedasthepremierapproximationtechniqueforaggregatequeriesoverdatastreams.Allsketchingtechniquesshareonecommonfeature:theyarebasedonrandomizedalgorithmsthatcombinerandomseedswithdatatoproducerandomvariablesthathavedistributionsconnectedtothetruevalueoftheaggregatebeingestimated.Bymeasuringcertaincharacteristics 82

PAGE 83

ofthedistribution,correctestimatesoftheaggregateareobtained.Theinterestingthingaboutallsketchingtechniquesthathavebeenproposedisthatthecombinationofrandomizationanddataisalinearoperationwiththeresultthat,asobservedin[ 14 41 ],sketchingtechniquescanbeusedtoperformdistributedcomputationofaggregateswithouttheneedtosendtheactualdatavalues.Thetightconnectionwithbothdata-streaminganddistributedcomputationmakessketchingtechniquesimportantfromboththetheoreticalandpracticalpointofview.Sketchescanbeusedeitherastheactualapproximationtechnique,inwhichcasetheyrequireasinglepassoverthedataor,inordertoimprovetheperformance,asthebasictechniqueinmulti-passtechniquessuchasskimmedsketches[ 28 ]andred-sketches[ 29 ].Foreitherapplication,itisimportanttounderstandaswellaspossibleitsapproximationbehaviordependingonthecharacteristicsoftheproblemandtobeabletopredictasaccuratelyaspossibletheestimationerror.Asopposedtomostapproximationtechniques|oneofthefewexceptionsaresamplingtechniques[ 35 ]|theoreticalapproximationguaranteesintheformofcondenceboundswereprovidedforalltypesofsketchesfromthebeginning[ 3 ].Allthetheoreticalguaranteesthatweknowofareexpressedasmemoryandupdatetimerequirementsintermsofbig-Onotation,andareparameterizedby,thetargetrelativeerror,,thetargetcondencetherelativeerrorisatmostwithprobabilityatleast1)]TJ/F21 11.9552 Tf 12.71 0 Td[(,andthecharacteristicsofthedata{usuallytherstandthesecondfrequencymoments.Whilethesetypesoftheoreticalresultsareusefulintheoreticalcomputerscience,thefearisthattheymighthidedetailsthatarerelevantinpractice.Inparticular,itmightbehardtocomparemethods,orsomemethodscanlookequallygoodaccordingtothetheoreticalcharacterization,butdiersubstantiallyinpractice.Anevenmoresignicantconcern,whichweshowtobeperfectlyjustied,isthatsomeofthetheoreticalboundsaretooconservative. 83

PAGE 84

Inthischapter,wesetouttoperformadetailedstudyofthestatisticalandempiricalbehaviorofthefourbasicsketchingtechniquesthathavebeenproposedintheresearchliteratureforcomputingsizeofjoinandrelatedproblems:AGMS[ 3 4 ],Fast-AGMS[ 14 ],Count-Min[ 15 ],andFast-Count[ 58 ]sketches.Theinitialgoalofthestudywastocomplementthetheoreticalresultsandtomakesketchingtechniquesaccessibleandusefulforthepractitioners.Whileaccomplishingthesetasks,thestudyalsoshowsthat,ingeneral,thetheoreticalboundsareconservativebyatleastaconstantfactorof3.ForFast-AGMSandCount-Minsketches,thestudyshowsthatthetheoreticalpredictionisobymanyordersofmagnitudeifthedataisskewed.AspartofourstudyweprovidepracticalcondenceintervalsforallsketchesexceptCount-Min.Weusestatisticaltechniquestoprovidecondenceboundsatthesametimetheestimateisproducedwithoutanypriorknowledgeaboutthedistribution1.Noticethatpriorknowledgeisrequiredinordertousethetheoreticalcondenceboundsprovidedintheliteratureandmightnotactuallybeavailableinpractice.Asfarasweknow,theredoesnotexistanydetailedstatisticalstudyofsketchingtechniquesandonlylimitedempiricalstudiestoassestheiraccuracy.Theinsightwegetfromthestatisticalanalysisandtheextensiveempiricalstudyweperformallowsustoclearlyshowthat,fromapracticalpointofview,Fast-AGMSsketchesarethebestbasicsketchingtechnique.Thebehaviorofthesesketchesistrulyexceptionalandmuchbetterthanpreviouslybelieved{theexceptionalbehaviorismaskedbytheresultin[ 14 ],butrevealedbyourdetailedstatisticalanalysis.Thetimingresultsforthethreehash-basedsketchingtechniquesFast-AGMS,Fast-Count,andCount-Minrevealthatsketchesarepractical,easilyabletokeepupwithstreamsofmilliontuples/second.Therestofthechapterisorganizedasfollows.InSection 5.1 wegiveanoverviewofthefourbasicsketchingtechniquesproposedintheliterature.Section 5.2 containsour 1Thisisthecommonpracticeforsamplingestimators[ 35 ]. 84

PAGE 85

statisticalanalysisofthefoursketchingtechniqueswithinsightsontheirbehavior.Section 5.3 containsthedetailsandresultsofourextensiveempiricalstudythatcorroboratesthestatisticalanalysis.5.1SketchesSketchesaresmall-spacesummariesofdatasuitedformassive,rapid-ratedatastreamsprocessedeitherinacentralizedordistributedenvironment.Queriesarenotansweredpreciselyanymore,butratherapproximately,byconsideringonlythesynopsissketchofthedata.Allsketchingtechniquesgeneratemultiplerandominstancesofanelementarysketchestimator,instancesthatareeectivelyrandomsamplesoftheelementaryestimator.Whiletherandominstancesoftheelementarysketchestimatoraresamplesoftheestimator,thesesamplesshouldnotbeconfusedwiththesamplesfromtheunderlyingdatausedbythesamplingtechniques[ 35 ].Thesamplesoftheelementarysketchestimatoraregroupedandcombinedindierentwaysinordertoobtaintheoverallestimate.Typically,inordertoproduceanelementarysketchestimator|theprocessisduplicatedforeachsampleoftheelementarysketchestimator|multiplecounterscorrespondingtorandomvariableswithrequiredpropertiesaremaintained.Thecollectionofcountersforallthesamplesoftheelementaryestimatoriscalledthesketch.Theexistingsketchingtechniquesdierinhowtherandomvariablesareorganized,thustheupdateprocedure,thewaytheelementarysketchestimatoriscomputed,andhowtheanswertoagivenqueryiscomputedbycombiningtheelementarysketchestimators.InthissectionweprovideanoverviewoftheexistingsketchingtechniquesusedforapproximatingthesizeofjoinoftwodatastreamsseeSection 2.1 .Foreachtechniquewespecifytheelementarysketchestimator,denotedbyXpossiblywithasubscriptindicatingthetypeofsketch,andthewaytheelementarysketchesarecombinedtoobtainthenalestimateZ. 85

PAGE 86

5.1.1BasicAGMSSketchesTheithentryofthesizenAGMSor,tug-of-war[ 3 4 ]sketchvectorisdenedastherandomvariablexf[i]=PN)]TJ/F20 7.9701 Tf 6.5865 0 Td[(1j=0fjij,wherefij:j2Igisafamilyofuniformlydistributed14-wiseindependentrandomvariables,withdierentfamiliesbeingindependent.Theadvantageofusing1randomvariablescomesfromthefactthattheycanbeecientlygeneratedinsmallspace[ 51 ].Whenanewdatastreamiteme;warrives,allthecountersinthesketchvectorareupdatedasxf[i]=xf[i]+wie,1in.Thetimetoprocessanupdateisthusproportionalwiththesizeofthesketchvector.ItcanbeshownthatX[i]=xf[i]xg[i]isanunbiasedestimatoroftheinner-productofthefrequencyvectorsfandg,i.e.,E[X[i]]=fg.Thevarianceoftheestimatoris:Var[X[i]]=Xj2If2j!Xk2Ig2k!+Xj2Ifjgj!2)]TJ/F15 11.9552 Tf 11.9551 0 Td[(2Xj2If2jg2j{1Byaveragingnindependentestimators,Y=1 nPni=1X[i],thevariancecanbereducedbyafactorofn,i.e.,Var[Y]=Var[X[i]] n,thusimprovingtheestimationerror.Inordertomaketheestimationmorestable,theoriginalsolution[ 3 ]returnedastheresultthemedianofmYestimators,i.e.,Z=Median1kmY[k].WeprovideanexampletoillustratehowAGMSsketcheswork. Example6. ConsidertwodatastreamsFandGgivenaspairskey;frequency:F=f;5;;)]TJ/F15 11.9552 Tf 9.2985 0 Td[(2;;2;;3;;1;;)]TJ/F15 11.9552 Tf 9.2985 0 Td[(3;;2;;2;;3gG=f;1;;3;;2;;3;;)]TJ/F15 11.9552 Tf 9.2985 0 Td[(2;;2;;)]TJ/F15 11.9552 Tf 9.2985 0 Td[(1;;2;;)]TJ/F15 11.9552 Tf 9.2985 0 Td[(1g{2WewanttoestimatethesizeofjoinjF1Gjofthetwostreamsusingsketchesconsistingof3counters.Afamilyof4-wiseindependent1randomvariablescorrespondstoeachcounter.Letthemappingsfromthekeydomainf1;2;3;4;5ginthiscaseto1tobegivenasinTable 5-1 86

PAGE 87

Asthedataisstreamingby,allthecountersinthecorrespondingsketchvectorareupdated.Forexample,thepair;5inFupdatesthecountersinxfasfollows:xf[1]=5,xf[2]=5,andxf[3]=)]TJ/F15 11.9552 Tf 9.2985 0 Td[(5thecountersareinitializedto0,whileafterthepair;)]TJ/F15 11.9552 Tf 9.2985 0 Td[(2isprocessedthecountershavethefollowingvalues:xf[1]=7,xf[2]=3,andxf[3]=)]TJ/F15 11.9552 Tf 9.2985 0 Td[(7.Afteralltheelementsinthetwostreamspassedby,thetwosketchvectorsare:xf=[7;1;)]TJ/F15 11.9552 Tf 9.2985 0 Td[(5]andxg=[7;7;)]TJ/F15 11.9552 Tf 9.2985 0 Td[(3].TheestimatorXforthesizeofjoinjF1GjconsistsofthevaluesX=[49;7;15]havingthemeanY=23:66.Thecorrectresultis31.MultipleinstancesofYcanbeobtainedifothergroupsof3countersandtheirassociatedfamiliesof1randomvariablesareaddedtothesketch.InthiscasethemedianoftheinstancesofYisreturnedasthenalresult.NoticethetradeosinvolvedbytheAGMSsketchstructure.Inordertodecreasetheerroroftheestimatorproportionalwiththevariance,thesizenofthesketchvectorhastobeincreased.Sincethespaceandtheupdate-timearelinearfunctionsofn,anincreaseofthesketchsizeimpliesacorrespondingincreaseofthesetwoquantities.Thefollowingtheoremrelatestheaccuracyoftheestimatorwiththesizeofthesketch,i.e.,n=O1 2andm=Olog1 Theorem11[ 4 ]. LetxfandxgdenotetwoparallelsketchescomprisingO)]TJ/F20 7.9701 Tf 8.3451 -4.9766 Td[(1 2log1 counterseach,whereand1)]TJ/F21 11.9552 Tf 12.4864 0 Td[(representthedesiredboundsonerrorandprobabilisticcondence,respectively.Then,withprobabilityatleast1)]TJ/F21 11.9552 Tf 12.3804 0 Td[(,Z2fgjjfjj2jjgjj2.TheprocessingtimerequiredtomaintaineachsketchisO)]TJ/F20 7.9701 Tf 8.3451 -4.9766 Td[(1 2log1 perupdate.jjfjj2=p ff=p Pi2If2iistheL2normoffandjjgjj2=p gg=p Pi2Ig2iistheL2normofg,respectively.FromtheperspectiveoftheabstractprobleminSection 2.3 ,X[i]representtheprimitiveinstancesofthegenericrandomvariableX.MedianofmeansZistheestimatorfortheexpectedvalueE[X].Thedistribution-independentcondenceboundsinTheorem 11 arederivedfromTheorem 8 87

PAGE 88

5.1.2Fast-AGMSSketchesAswehavealreadymentioned,themaindrawbackofAGMSsketchesisthatanyupdateonthestreamaectsalltheentriesinthesketchvector.Fast-AGMSsketches[ 14 ],asarenementofCountsketchesproposedin[ 11 ]fordetectingthemostfrequentitemsinadatastream,combinethepowerof1randomvariablesandhashingtocreateaschemewithasignicantlyreducedupdatetimewhilepreservingtheerrorboundsofAGMSsketches.Thesketchvectorxfconsistsofncounters,xf[i].Twoindependentrandomprocessesareassociatedwiththesketchvector:afamilyof14-wiseindependentrandomvariablesanda2-universalhashfunctionh:I!f1;:::;ng.Theroleofthehashfunctionistoscatterthekeysinthedatastreamtodierentcountersinthesketchvector,thusreducingtheinteractionbetweenthekeys.Meanwhile,theuniquefamilypreservesthedependenciesacrossthecounters.Whenanewdatastreamiteme;warrives,onlythecounterxf[he]isupdatedwiththevalueofthefunctioncorrespondingtothekeye,i.e.,xf[he]=xf[he]+we.Giventwoparallelsketchvectorsxfandxgusingthesamehashfunctionhandfamily,theinner-productfgisestimatedbyY=Pni=1xf[i]xg[i].ThenalestimatorZiscomputedasthemedianofmindependentbasicestimatorsY,i.e.,Z=Median1kmY[k].InthelightofSection 2.3 ,Ycorrespondstothebasicinstanceswhilethemedianistheestimatorfortheexpectedvalue.WeprovideasimpleexampletoillustratetheFast-AGMSsketchdatastructure. Example7. ConsiderthesamedatastreamsfromExample 6 .Thesketchvectorconsistsof9countersgroupedinto3rowsof3counterseach.Thesamefamiliesof1randomvariablesExample 6 areused,butafamilycorrespondstoarowofcountersinsteadofonlyonecounter.Anadditionalfamilyof2-universalhashfunctionscorrespondingtotherowsofthesketchmapstheelementsinthekeydomaintoonlyonecounterineachrow.ThehashfunctionsarespeciedinTable 5-2 88

PAGE 89

Foreachstreamelementonlyonecounterfromeachrowisupdated.Forexample,afterthepair;3inFisprocessed,thesketchvectorxflookslike:xf[1]=[10;2;0],xf[2]=[5;0;)]TJ/F15 11.9552 Tf 9.2985 0 Td[(3],andxf[3]=[0;3;)]TJ/F15 11.9552 Tf 9.2985 0 Td[(9]thecountersareinitializedto0.Af-terprocessingalltheelementsinthetwostreams,thetwosketchvectorsare:xf=[[7;)]TJ/F15 11.9552 Tf 9.2985 0 Td[(1;1];[5;)]TJ/F15 11.9552 Tf 9.2985 0 Td[(1;)]TJ/F15 11.9552 Tf 9.2985 0 Td[(3];[)]TJ/F15 11.9552 Tf 9.2985 0 Td[(2;3;)]TJ/F15 11.9552 Tf 9.2985 0 Td[(6]]andxg=[[8;)]TJ/F15 11.9552 Tf 9.2985 0 Td[(2;1];[9;)]TJ/F15 11.9552 Tf 9.2985 0 Td[(1;)]TJ/F15 11.9552 Tf 9.2985 0 Td[(1];[1;1;)]TJ/F15 11.9552 Tf 9.2985 0 Td[(5]].TheestimatorforthesizeofjoinjF1GjconsistsofthemedianofY=[59;49;31]whichis49,whilethecorrectresultisstill31.Thefollowingtheoremrelatesthenumberofsketchvectorsmandtheirsizenwiththeerrorboundandtheprobabilisticcondence,respectively. Theorem12[ 14 ]. Letnbedenedasn=O1 2andmasm=Olog1 .Then,withprobabilityatleast1)]TJ/F21 11.9552 Tf 12.7358 0 Td[(,Z2fgjjfjj2jjgjj2.SketchupdatesareperformedinOlog1 time.TheabovetheoremstatesthatFast-AGMSsketchesprovidethesameguaranteesasbasicAGMSsketches,whilerequiringonlyOlog1 timetoprocesstheupdatesandusingonlyonefamilypersketchvectorandoneadditionalhashfunctionh.Moreover,noticethatonlythesketchvectorsizeisdependentontheerror.5.1.3Fast-CountSketchesFast-Countsketches,introducedin[ 58 ],providetheerrorguaranteesandtheupdatetimeofFast-AGMSsketches,whilerequiringonlyoneunderlyingrandomprocess{hashing.Thetradeosinvolvedarethesizeofthesketchvectoror,equivalently,theerrorandthedegreeofindependenceofthehashfunction.ThesketchvectorconsistsofthesamencountersasforAGMSsketches.Thedierenceisthatthereexistsonlya4-universalhashfunctionassociatedwiththesketchvector.Whenanewdataiteme;warrives,wisdirectlyaddedtoasinglecounter,i.e.,xf[he]=xf[he]+w,whereh:I!f1;:::;ngisthe4-universalhashfunction. 89

PAGE 90

Thesizeofjoinestimatorisdenedasthisisageneralizationofthesecondfrequencymomentestimatorin[ 58 ]:Y=1 n)]TJ/F15 11.9552 Tf 11.9552 0 Td[(1"nnXi=1xf[i]xg[i])]TJ/F26 11.9552 Tf 11.9552 20.4435 Td[(nXi=1xf[i]!nXi=1xg[i]!#{3ThecomplicatedformofYisduetothebiasofthenaturalestimatorY0=Pni=1xf[i]xg[i].YisobtainedbyasimplecorrectionofthebiasofY0.ItcanbeprovedthatYisanunbiasedestimatoroftheinner-productfg.ItsvarianceisalmostidenticaltothevarianceoftheYestimatorforAGMSFast-AGMSsketchesin 5{1 .Theonlydierenceisthemultiplicativefactor,1 n)]TJ/F20 7.9701 Tf 6.5865 0 Td[(1forFast-Countsketches,comparedto1 nforAGMSsketches.Hence,givendesirableerrorguarantees,Fast-Countsketchesrequireoneadditionalentryinthesketchvector.Forlargevaluesofn,e.g.,n>100,thedierenceinvariancebetweenAGMSFast-AGMSandFast-CountsketchescanbeignoredandtheresultsinTheorem 12 apply.NoticethatinpracticemultipleinstancesofYarecomputedandthenalestimatorfortheexpectedvalueofthesizeofjoinisthemeanaverageoftheseinstances.WeprovideanexamplethatshowshowFast-Countsketcheswork. Example8. ConsiderthesamedatastreamsfromExample 6 .Thesketchvectorconsistsof9countersgroupedinto3rowsof3counterseach.ThesamehashfunctionsasinExample 7 areusedsupposethattheyare4-universal.Foreachstreamelementonlyonecounterfromeachrowisupdated.Forexample,afterthepair;3inFisprocessed,thesketchvectorxflookslike:xf[1]=[10;)]TJ/F15 11.9552 Tf 9.2985 0 Td[(2;0],xf[2]=[5;0;3],andxf[3]=[0;3;5]thecountersareinitializedto0.Afterprocessingalltheelementsinthetwostreams,thetwosketchvectorsare:xf=[[7;1;5];[5;5;3];[2;3;8]]andxg=[[8;2;)]TJ/F15 11.9552 Tf 9.2985 0 Td[(1];[9;)]TJ/F15 11.9552 Tf 9.2985 0 Td[(1;1];[)]TJ/F15 11.9552 Tf 9.2985 0 Td[(1;1;9]].TheestimatorforthesizeofjoinjF1GjconsistsofthevectorY=[21;6;51].Theaverage26oftheelementsinYisreturnedasthenalestimate.5.1.4Count-MinSketchesCount-Minsketches[ 15 ]havealmostthesamestructureasFast-Countsketches.Theonlydierenceisthatthehashfunctionisdrawnrandomlyfromafamilyof2-universal 90

PAGE 91

hashfunctionsinsteadof4-universal.TheupdateprocedureisidenticaltoFast-Countsketches,onlythecounterxf[he]beingupdatedasxf[he]=xf[he]+wwhentheiteme;warrives.ThesizeofjoinestimatorisdenedinanaturalwayasY=Pni=1xf[i]xg[i]noticethatYisactuallyequivalentwiththeaboveY0estimator.ItcanbeshownthatYisanoverestimateoftheinner-productfg.Inordertominimizetheover-estimatedquantity,theminimumovermindependentYestimatorsiscomputed,i.e.,Z=Min1kmY[k].NoticethedierentmethodsappliedtocorrectthebiasofthesizeofjoinestimatorY0.WhileFast-CountsketchesdeneanunbiasedestimatorYbasedonY0,Count-Minsketchesselecttheminimumovermultiplesuchoverestimates.ThefollowingexampleillustratesthebehaviorofCount-Minsketches. Example9. ForthesamesetupasinExample 8 ,exactlythesamesketchvectorsareobtainedafterupdatingthetwostreams.Onlythenalestimatorisdierent.ItistheminimumofY=[53;43;73],thatis43.TherelationshipbetweenthesizeofthesketchandtheaccuracyoftheestimatorZisexpressedbythefollowingtheorem: Theorem13[ 15 ]. Zfg+jjfjj1jjgjj1withprobability1)]TJ/F21 11.9552 Tf 12.217 0 Td[(,wherethesizeofthesketchvectorisn=O1 andtheminimumistakenoverm=Olog1 sketchvectors.UpdatesareperformedintimeOlog1 .jjfjj1=Pi2Ifiandjjgjj1=Pi2IgirepresenttheL1normsofthevectorsfandg,respectively.NoticethedependenceontheL1norm,comparedtothedependenceontheL2normforAGMSsketches.TheL2normisalwayssmallerthantheL1norm.Intheextremecaseofuniformfrequencydistributions,L2isquadraticallysmallerthanL1.ThisimpliesincreasederrorsforCount-MinsketchesascomparedtoAGMSsketches,or,equivalently,morespaceinordertoguaranteethesameerrorboundseventhoughthesketchvectorsizeisonlyO1 91

PAGE 92

5.1.5ComparisonGiventheabovesketchingtechniques,wequalitativelycomparetheirexpectedperformancebasedontheexistingtheoreticalresults.ThetechniquesarecomparedrelativelytotheresultobtainedbytheuseofAGMSsketchesfortheself-joinsizeproblem,knowntobeasymptoticallyoptimal[ 3 ].ThesizeofjoinresultsareconsideredrelativelytotheproductoftheL2L1forCount-Minnormsofthedatastreams.Noticethatlargeresultscorrespondtotheparticularself-joinsizeproblem.LowskewcorrespondstofrequencyvectorsforwhichtheratioL1 L2isclosetop Nuniformdistribution,whileforhighskewtheratioL1 L2iscloseto1.Table 5-3 summarizesthedistribution-independentcondenceboundspredictedbythetheory.SincetheboundsforAGMS,Fast-AGMS,andFast-Countsketchesareidentical,theyhavethesamebehaviorfromatheoreticalperspective.Forsmallsizeofjoinresults,theperformanceofthesethreemethodsworsens.Count-MinsketcheshaveadistinctbehaviorduetotheirdependencyontheL1norm.Theirperformanceishighlyinuencednotonlybythesizeoftheresult,butalsobytheskewnessofthedata.Forlowskewdata,theperformanceissignicantlyworsethantheperformanceofAGMSsketches.SinceL1L2,thetheoreticalperformanceforCount-MinsketchesisalwaysworsethantheperformanceofAGMSFast-AGMS,Fast-Countsketches.5.2StatisticalAnalysisofSketchEstimatorsThegoalspursuedinreningthesketchingtechniquesweretoleveragetherandomnessandtodecreasetheupdatetimewhilemaintainingthesameerrorguaranteesasfortheoriginalAGMSsketches.Aswehavepreviouslyseen,allkindsoftradeosareinvolved.Themaindrawbackoftheexistingtheoreticalresultsisthattheycharacterizeonlytheasymptoticbehavior,butdonotprovideenoughdetailsaboutthebehaviorofthesketchingtechniquesinpracticetheyignoreimportantdetailsabouttheestimatorbecausetheyarederivedfromdistribution-independentcondencebounds.Fromapurelypracticalpointofview,weareinterestedinsketchingtechniquesthatarereasonablyeasy 92

PAGE 93

toimplement,arefasti.e.,smallupdatetimeforthesynopsisdata-structure,havegoodaccuracyandcanestimateaspreciselyaspossibletheirerrorthroughcondenceintervals.Althoughthesamegoalsarepursuedfromthetheoreticalpointofview,intheoryweinsistonderivingsimpleformulasfortheerrorexpressedintermsofasymptoticbig-Onotation.Thisisperfectlyreectedbythetheoreticalresultswepresentedintheprevioussection.Theproblemwiththeoreticalresultsisthefactthat,sincewealwaysinsistonexpressibleformulas,wemightignoredetailsthatmatteratleastinsomecases{thetheoreticalresultsarealwaysconservative,buttheymightbetooconservativesometimes.Inthissection,weexplorethesketchingtechniquesfromastatisticalperspectivebyaskingthefollowingquestionsthatreectthedierencebetweenthepragmaticandthetheoreticalpointsofview: AllsketchingtechniquescombinemultipleindependentinstancesofelementarysketchesusingtheestimatorsfromSection 2.3 Mean,Median,Minimuminordertodeneamoreaccurateestimatorfortheexpectedvalue.Weaskthequestionwhichoftheestimatorsismoreaccurateforeachofthefoursketchingtechniques? Howtightarethetheoreticaldistribution-independentcondencebounds?Andisitpossibletoderivetighterdistribution-dependentcondenceboundsthatworkinpracticebasedontheestimatorchoseninthepreviousquestion?Wearenotinterestedintightboundsonlyforsomesituations,butincondenceboundsthatarerealisticforallsituations.Thegoldenstandardweareaimingforiscondenceboundssimilartotheonesforsamplingtechniques[ 35 ].Weusealarge-scalestatisticalanalysisbasedonexperimentsinordertoanswertheabovequestions.TheplotsinthissectionhavestatisticalsignicanceandarenothighlysensitiveattheexperimentalsetupSection 4.4 .5.2.1BasicAGMSSketchesWeexplorewhichestimator|mean,median,orminimum|touseforAGMSsketchesinsteadofthemedianofmeansestimatorproposedintheoriginalpaper[ 3 ]andifthatwouldbeadvisable.Inordertoaccomplishthistask,weplottedthedistributionofthebasicsketchforalargespectrumofproblems.Figure 5-1 isagenericexamplefortheformofthedistribution.Itisclearfromthisgurethatboththeminimumandthemedian 93

PAGE 94

arepoorchoices.ThemedianisapoorchoicebecausethedistributionoftheelementaryAGMSsketchisnotsymmetricandthereexistsavariablegapbetweenthemeanofthedistributionandthemedian,gapthatisnoteasilytocomputeand,thus,tocompensatefor.Inordertoverifythatthemeanistheoptimalestimatorasthetheorypredicts,weplotitsdistributionforthesameinputdataFigure 5-1 .Asexpected,thedistributionisnormalanditsexpectedvalueisthetrueresult.AsexplainedinSection 2.3.6 ,themeanisalwayspreferabletothemedianofmeansasanestimatorfortheexpectedvalueofarandomvariablegivenassamples.ThisisthecasebecauseonceaveragingoverthesamplespacethedistributionoftheestimatorstartstobecomenormalMeanCLTanditisknownthatthemeanismoreecientthanthemedianfornormaldistributionsSection 2.3.5 .Althoughthemedianofmeansestimatorhasnostatisticalsignicance,itallowsthederivationofexponentiallydecreasingdistribution-independentcondenceintervalsbasedonChernoboundTheorem 3 .Toderivetighterdistribution-dependentcondenceboundsbasedonlyonthemeanestimator,wecanuseTheorem 5 .Thevalueofthevarianceiseithertheexactoneifitcanbedeterminedor,morerealistically,anestimatecomputedfromthesamples.Thedistribution-independentcondenceboundsinTheorem 8 arewiderbyafactorofapproximately4thantheCLTbounds,asderivedinExample 1 .Thisdiscrepancybetweenthedistribution-independentboundsandtheeectiveerrorwasobservedexperimentallyin[ 18 51 ],butitwasnotexplained.Inconclusion,themeanestimatorseemstherightchoicefromapracticalperspectiveconsideringitsadvantagesoverthemedianofmeansestimatorofwhichitisanywayapart.5.2.2Fast-AGMSSketchesComparingTheorem 11 and 12 thatcharacterizeAGMSandFast-AGMSF-AGMSsketches,respectively,weobservethatthepredictedaccuracyisidentical,butFast-AGMShavesignicantlylowerupdatetime.ThisimmediatelyindicatesthatF-AGMSshould 94

PAGE 95

bepreferredtoAGMS.Intheprevioussection,wesawadiscrepancyofafactorofapproximately4betweenthedistribution-independentboundsandtheCLT-basedboundsforAGMSsketchesandthepossibilityofasignicantimprovementifthemedianofmeansestimatorisreplacedbymeansonly.Inthissection,weinvestigatethestatisticalpropertiesofF-AGMSsketchesinordertoidentifythemostadequateestimatorandtopossiblyderivetighterdistribution-dependentcondencebounds.WestarttheinvestigationonthestatisticalpropertiesofFast-AGMSsketcheswiththefollowingresultonthersttwofrequencymomentsexpectedvalueandvarianceofthebasicestimator: Proposition3[ 14 ]. LetXbetheFast-AGMSestimatorobtainedwithafamilyof4-universalhashfunctionsh:I!Banda4-wiseindependentfamilyof1randomvariables.Then,Eh;[X]=E[XAGMS]Eh[Var[X]]=1 BVar[XAGMS]ThersttwomomentsoftheelementaryFast-AGMSsketchcoincidewiththersttwomomentsoftheaverageofBelementaryAGMSsketchesinordertohavethesamespaceusage.ThisisasomewhatunexpectedresultsinceitsuggeststhathashingplaysthesameroleasaveragingwhenitcomestoreducingthevarianceandthatthetransformationonthedistributionofelementaryF-AGMSsketchesisthesame,i.e.,thedistributionbecomesnormalandthevarianceisreducedbyafactorequaltothenumberofbuckets.ThefollowingresultonthefourthfrequencymomentofF-AGMSrepresentstherstdiscrepancybetweenthedistributionsofFast-AGMSandAGMSsketches: 95

PAGE 96

Proposition4. WiththesamesetupasinProposition 3 ,wehave:Varh[Var[X]]=B)]TJ/F15 11.9552 Tf 11.9552 0 Td[(1 B2243Xif2ig2i!2+4Xif3igiXjfjg3j++Xif4iXjg4j)]TJ/F15 11.9552 Tf 11.9552 0 Td[(8Xif4ig4i#{4Varh[Var[X]]isalowerboundonthefourthmomentoftheestimatorforwhichwecannotderiveasimpleclosed-formformulabecausethefamilyisonly4-wiseindependentand8-wiseindependenceisrequiredtoremovethedependencyoftheformulaontheactualgeneratingscheme.Also,ifthehashfunctionhisonly2-universal,insteadof4-universal,moretermsareaddedtotheexpression,thusmakingthefourthmomentevenlarger.Weusekurtosistheratiobetweenthefourthfrequencymomentandthesquareofthevariance,seeSection 2.3.5 tocharacterizethedistributionoftheFast-AGMSbasicestimator.FromFigure 5-5 thatdepictstheexperimentalkurtosisanditslowerboundinProposition 4 ,weobservethatwhentheZipfcoecientislargerthan1thekurtosisgrowssignicantly,tothepointthatitgetsaround1000foraZipfcoecientequalto5.Giventhesevaluesofthekurtosis,weexpectthatthedistributionoftheF-AGMSestimatortobeclosetonormalforZipfcoecientssmallerthan1kurtosisisequalto3fornormaldistributions,seeSection 2.3.5 andthentosueradrasticchangeastheZipfcoecientincreases.Largekurtosisisanindicatorofdistributionsthataremoreconcentratedthanthenormaldistribution,butalsothathaveheaviertails[ 7 ].Indeed,Figure 5-2 conrmsexperimentallytheseobservationsforZipfcoecientsequalto0:2and1:5,respectively.TheinteractionbetweenhashingandthefrequentitemsisanintuitiveexplanationforthetransformationsueredbytheF-AGMSdistributionasafunctionoftheZipfcoecient.Forlowskewdatauniformdistributiontheredoesnotexistasignicantdierencebetweenthewaythefrequenciesarespreadintothebucketsbythehashfunction.Althoughthereexistssomevariationduetotherandomnessofthehashfunction,thedistributionoftheestimatorisnormallycenteredonthetruevalue.The 96

PAGE 97

situationiscompletelydierentforskeweddatawhichconsistsofsomeextremelyhighfrequenciesandsomeothersmallfrequencies.Theimpactofthehashfunctionisdominantinthiscase.Wheneverthehighfrequenciesaredistributedindierentbucketsthishappenswithhighprobabilitytheestimationisextremelyaccurate.Whenatleasttwohighfrequenciesarehashedintothesamebucketwithsmallprobabilitytheestimatorisordersofmagnitudeawayfromthetrueresult.Thisbehaviorexplainsperfectlytheshapeofthedistributionforskeweddata:themajorityofthemassofthedistributionisconcentratedonthetrueresultwhilesomesmallmassissituatedfarawayintheheavytails.Noticethatalthoughlargevaluesofkurtosiscapturethisbehavior,anextremelylargenumberofexperimentsisrequiredtoobservethebehaviorinpractice.Forexample,inFigure 5-5 theexperimentalkurtosisliesunderthelowerboundinsomecasesbecausethecollidingeventsdidnotappearevenafter10millionexperiments.Giventhedierentshapesofthedistribution,nosingleestimatormeanormedian,sinceminimumisclearlynotavalidestimatorfortheexpectedvalueisalwaysoptimalforFast-AGMSsketches.WhilemeanisoptimalforlowskewdatasincethedistributionisnormalseeSection 2.3.5 ,medianisclearlypreferableforskeweddatabecauseofthelargevaluesofkurtosis.Thesymmetryconditionrequiredforthemediantobeanestimatorfortheexpectedvalueissatisedbecauseofthesymmetric1randomvariables.Inthefollowing,weconsidermedianastheestimatorforFast-AGMSsketcheseventhoughitserrorislargerbyafactorof1:25forlowskewdatacomparedtotheerrorofthemean.Inordertoquantifyexactlywhatisthegainofthemedianoverthemean,weusetheconceptofeciencyseeSection 2.3.5 .Unfortunately,wecannotderiveananalyticalformulaforeciencybecauseitdependsonthevalueoftheprobabilitydensityfunctionatthetruemedian,whichweactuallytrytodetermine.Thealternativeistoestimateempiricallytheeciencyasafunctionofkurtosiswhich,aswehavealreadyseen,isagoodindicatorforthedistributionoftheF-AGMSestimatorandcanbecomputedanalytically.Figure 5-6 depictstheexperimentaleciencyasafunctionofkurtosisfor 97

PAGE 98

sketcheswithvariousnumberofbuckets.Asexpected,eciencyincreasesasthekurtosisincreases,i.e.,asthedatabecomesmoreskewed,andgetstosomeextremevaluesintheorderof1010.Whileeciencyisindependentofthenumberofbucketsinthesketch,thevalueofkurtosisislimited,withlargervaluescorrespondingtoasketchwithmorebuckets.Thisimpliesthateciencyisnotasimplefunctionofthekurtosisandotherparametersofthesketchandthedatahavealsotobeconsidered.Consequently,althoughwecouldnotquantifyexactlywhatisthegainofusingthemedianinsteadofthemean,theextremelylargevaluesofeciencyclearlyindicatethatmedianistherightestimatorforF-AGMSsketches.Thedistribution-independentcondenceboundsgivenbyTheorem 12 arelikelytobefartooconservativebecausetheyarederivedfromthersttwofrequencymomentsusingChebyshevandChernoinequalities.TheseboundsareidenticaltotheboundsforAGMSsketchessincethetwohavethesameexpectedvalueandvariance.ThesignicantdiscrepancyinthefourthmomentandtheshapeofthedistributionFigure 5-1 and 5-2 depictthedistributionsforthesamedatabetweenF-AGMSandAGMSisnotreectedbythedistribution-independentcondencebounds.Figure 5-7 aconrmsthehugegapasmuchas10ordersofmagnitudethatexistsbetweenthedistribution-independentboundsandtheexperimentalerror.Practicaldistribution-dependentcondenceboundscanbederivedfromthemedianboundsinTheorem 7 .Acomparisonbetweendistribution-independentcondencebounds,distribution-dependentcondenceboundsandtheexperimentalerror%isdepictedinFigure 5-7 b.Twoimportantfactscanbedrawnfromtheseresults:rst,thedistribution-independentboundsaretoolargeforlargeZipfcoecientsand,second,themedianboundsarealwaysaccurate.Figure 5-7 aalsorevealsthattheratiobetweentheactualerrorandthepredictionisnotstronglydependentonthecorrelationbetweenthedataforthesameZipfcoecient.ThisimpliesthatinordertocharacterizethebehaviorofF-AGMSsketchesforthesizeofjoinproblemonlytheZipfcoecientofthedistributionofthetwostreamshastobeconsidered. 98

PAGE 99

5.2.3Count-MinSketchesBasedonTheorem 13 ,weexpectCount-MinCMsketchestoover-estimatethetruevaluebyafactorproportionalwiththeproductofthesizesofthetwostreamsandinverselyproportionalwiththenumberofbucketsofthesketchstructure.Thisistheonlysketchthathaserrordependenciesontherstfrequencymoment,notthesecondfrequencymoment,andtheamountofmemorynumberofhashingbuckets,notthesquaredrootoftheamountofmemory.Whilethedependencyontherstfrequencymomentisworsethanthedependencyonthesquaredrootofthesecondfrequencymomentsincetherstisalwayslargerorequalthanthesecond,thedependencyontheamountofmemoryisfavorabletoCount-Minsketches.Accordingtothetheoreticdistribution-independentcondencebounds,weexpectCount-Minsketchestohaveweakperformanceforrelationswithlowskew,butcomparableperformancetoAGMSsketchesnotmuchbetterthoughforskewedrelations.Inthissection,wetakeacloserlookatthedistributionofthebasicCMestimatoranddiscussthemethodstoderivecondenceboundsforCount-Minsketches.WestartthestudyofthedistributionoftheelementaryCMestimatorwiththefollowingresultthatcharacterizesthefrequencymomentsoftheestimator: Proposition5. IfXCMistheelementaryCount-Minestimatorthen:E[XCM]=Xi2Ifigi+1 BXi2IfiXj2Igj)]TJ/F26 11.9552 Tf 11.9552 11.3574 Td[(Xi2Ifigi!{5Var[XCM]1 BVar[XAGMS]{6Equation 5{5 isprovedin[ 15 ].TheinequalityinEquation 5{6 becomesequalityifthehashfunctionsusedare4-universal.For2-universalhashes,thevarianceincreasesdependingontheparticulargeneratingschemeandnosimpleformulacanbederived.MostoftheproofofEquation 5{6 for4-universalhashesisembeddedinthecomputationofthevarianceforFast-CountsketchesseeSection 5.2.4 ,buttheexactformuladoes 99

PAGE 100

notappearinpreviouswork2.TheexpectedvalueofXCMisalwaysanover-estimateforthetrueresult{thisisthereasonwhytheminimumestimatorischosen.Interestingly,thevarianceoftheestimatorcoincideswiththevarianceofaveragesofBAGMSsketchesandthevarianceofFast-AGMSsketches.InordertocharacterizethedistributionofCMsketchesweconductedanextensivestatisticalstudybasedonexperiments.AsforFast-AGMSsketches,thedistributionoftheXCMestimatorishighlydependentontheskewnessofthedataandtherandomnessofhashing.Thefundamentaldierenceisthatthedistributionisnotsymmetricanymorebecause1randomvariablesarenotused.Thegenericshapeofthedistributionhasthemajorityofthemassconcentratedtotheleftextremitywhiletherighttailisextremelylong.Theintuitionbehindthisshapeliesinthewayhashingspreadsthedataintobuckets:withhighprobabilitythedataisevenlydistributedintothebucketsthissituationcorrespondstotheleftpeakwhilewithsomeextremelylowprobabilityalargenumberofitemscollideintothesamebucketthissituationcorrespondstotherighttail.Althoughtheshapeisgeneric,thepositionoftheleftpeaktheminimumofthedistributiondependsheavilyontheactualdata.Forlowskewdatathepeakisfarawayfromthetruevalue.Asthedatabecomesmoreskewedthepeakstartstotranslatetotheleft,tothepointitgetstothetruevalue.ThemovementtowardsthetruevaluewhileincreasingtheZipfparameterisduetotheimportancehighfrequenciesstarttogain.Forlowskewdatauniformdistributionthepositionofthepeakisgivenbytheaveragenumberoffrequenciesthatarehashedintothesamebucket.Forskeweddatadominatedbysomehighfrequenciesthepeakissituatedatthepointcorrespondingtothehighfrequenciesbeinghashedintodierentbuckets.Sincehighfrequenciesdominatetheresult,theestimateisinthiscaseclosertothetrue 2Sinceprovingtheformulaisasimplematterofrewritingtheequationsin[ 58 ],wedonotprovidetheproofhere. 100

PAGE 101

value.Figure 5-3 depictsthedistributionofXCMforZipfcoecientsequalto1:0and2:0,respectively.Thedistribution-independentcondenceboundsforCMsketchesin[ 15 ]arederivedfromtheMarkovinequality.Essentially,theerrorboundsareexpressedintermsoftheexpectedvalueoftheover-estimatedquantity1 BPi2IfiPj2Igj)]TJ/F26 11.9552 Tf 11.9552 8.9665 Td[(Pi2IfigiinE[XCM].Neitherthevariancenorthebiasareconsideredinderivingthesebounds.Toverifytheaccuracyofthecondencebounds,weplotinFigure 5-13 theratiobetweentheexperimentalerrorobtainedfordatasetswithdierentZipfandcorrelationcoecientsseeSection 4.4 andthecorrespondingpredictederror.ThemainobservationfromtheseresultsisthattheratiobetweentheactualerrorandthepredictiondecreasesastheZipfcoecientincreases,tothepointwherethegapismanyordersofmagnitude.Inwhatfollowsweprovideanintuitiveexplanationforthisbehavior.Forlowskewdatatheerrorisalmostentirelyduetothebias,correctlyestimatedbytheexpectedvalue,thustheperfectcorrespondencebetweentheactualerrorandtheprediction.ThisobservationisinferredfromFigure 5-3 awhichplotsthedistributionoftheelementarysketchestimatorforZipfcoecientequalwith1:0.Inthissituation,thestandarddeviationoftheelementaryestimatorismuchsmallerthanthebias.Ifmultipleinstancesoftheelementarysketchareobtained,theywillallberelativelyclosetotheexpectedvaluenomorethananumberofstandarddeviationstotheleft,thustheirminimumwillbeclosetotheexpectedvalue.ThefactthatthestandarddeviationissmallwhencomparedtothebiasforlowskewdatacanbepredictedusingProposition 5 basedonthefactthatL2normismuchsmallerthanL1normforlowskewdata.Forhighskew,thestandarddeviationbecomessignicantlylargerthanthebiasasitcanbeseeninFigure 5-3 b.Inthissituation,eventhoughthebiasisstillsignicant,withhighprobabilitysomeofthesamplesoftheelementarysketchwillbeclosetothetruevalue,thustheminimumofmultipleelementarysketcheswillhavesignicantlysmallererror.NoticehowtheshapeofthedistributionchangeswhentheZipfcoecient 101

PAGE 102

increases:itisnormal-likeforlowskew,butithasnolefttailforhighskew.ThedistributionisforcedtotakethisshapewhenthestandarddeviationislargerthanthebiassinceCMsketchestimatorscannottakevaluessmallerthanthetruevalue.ReferringbacktothemomentsoftheCMelementaryestimatorinProposition 5 ,forlargeskewthestandarddeviationiscomparabletotheexpectedvalue,butthebiasismuchsmallersincemostoftheresultisgivenbythelargefrequencieswhosecontributionisaccuratelycapturedbytheestimator.Whiletheabovediscussiongivesagoodintuitionwhythetheorygivesreasonableerrorpredictionsforlowskewdataandmakeslargeerrorsforhighskewdata,unfortunatelyitdoesnotleadtobetterboundsforskeweddata.Inordertoprovidetightcondencebounds,thedistributionoftheminimumofmultipleelementarysketchestimatorshastobecharacterized.WhileCLT-basedresultsexistfortheminimumestimatorseeSection 2.3.7 ,theyprovidemeanstocharacterizethevariance,butnotthebiasoftheminimumestimator.Determiningthebiasoftheminimumiscrucialforcorrectpredictionsoftheerrorforlargeskew,butitseemsadiculttasksinceitdependsontheprecisedistributionofthedatanotonlysomecharacteristicsliketherstfewmoments.ItisworthmentioningthattighterboundsforCMsketchescanbeobtainediftheZipfcoecientofthedataisdeterminedbyothermeans[ 16 ].NoticetheparticularproblemsofderivingcondenceboundsforCMsketches:higherrorsarecorrectlypredictedwhilesmallerrorsareincorrectlyover-estimated.Consequently,Count-Minsketchesarediculttouseinpracticebecausetheirbehaviorcannotbepredictedaccurately.5.2.4Fast-CountSketchesFast-CountFCelementaryestimatorisessentiallythebias-correctedversionoftheCount-Minelementaryestimator.ThebiascorrectionisatranslationbybiasandascalingbythefactorB B)]TJ/F20 7.9701 Tf 6.5865 0 Td[(1.ThiscanbeobservedinFigure 5-4 thatdepictsthedistributionofFast-Countsketches.EverythingstatedforCMsketchdistributionstillholdsforthedistributionofFCsketches,withthemajordierencethatFast-Countsketchesare 102

PAGE 103

unbiased,whileCount-Minsketchesarebiased.Giventheunbiasedestimatorandtheasymmetricshapeofthedistribution,meanistheonlyviableestimatorfortheexpectedvalue,whichisalsothetruevalueinthiscase.Thedistribution-independentcondenceboundsforFCsketches,derivedinasimilarmannerusingChebyshevandChernobounds,areidenticaltothoseforAGMSandFast-AGMSsketchesbecausethersttwomomentsofthedistributionsareequal.Tighterdistribution-dependentcondenceboundsarederivedusingMeanCLTforAGMSandMedianCLTforFast-AGMSsketches,respectively.AlthoughthemeanestimatorisalsousedforFCsketches,theasymptoticregimeofMeanCLTdoesnotapplyinthiscasebecausethenumberofsamplesaveragedisonlyintheorderoftens.ThealternativeistousetheStudentt-distributionformodelingthebehaviorofthemeanseeSection 2.3.3 ,buttheimprovementoverthedistribution-independentboundsisnotsoremarkable.Inconclusion,bothdistribution-independentanddistribution-dependentboundscanbeusedforFCsketcheswithoutasignicantadvantageforanyofthem.5.3EmpiricalEvaluationThemainpurposeoftheexperimentalevaluationistovalidateandcomplementthestatisticalresultsweobtainedinSection 5.2 forthefoursketchingtechniques.Thespecicgoalsare:establishtherelativeaccuracyperformanceofthefoursketchingtechniquesforvariousproblems,anddeterminetheactualupdateperformance.Ourmaintoolinestablishingtheaccuracyofsketchesistomeasuretheirerroronsyntheticdatasetsforwhichwecontrolboththeskew,viatheZipfcoecient,andthecorrelation.Thisallowsustoecientlycoveralargespectrumofproblemsandtodrawinsightfulobservationsabouttheperformanceofsketches.Wethenvalidatethendingsonreal-lifedatasetsandothersyntheticdatagenerators.Themainndingsofthestudyare: AGMSandFast-CountFCsketcheshavevirtuallyidenticalaccuracythroughoutthespectrumofproblemsifonlyaveragesareusedforAGMS.FCsketchesarepreferablesincetheyhavesignicantlysmallerupdatetime. 103

PAGE 104

TheperformanceofCount-Minsketchesisstronglydependentontheskewofthedata.Forsmallskew,theerrorisordersofmagnitudelargerthantheerroroftheothertypesofsketches.Forlargeskew,CMsketcheshavethebestperformance{muchbetterthanAGMSandFC. Fast-AGMSF-AGMSsketcheshaveerroratmost25%largerthanAGMSsketchesforsmallskew,buttheerrorisordersofmagnitudeasmuchas6ordersofmagnitudeforlargeskewsmallerformoderateandlargeskew.TheirerrorforlargeskewisslightlylargerthantheerrorofCMsketches. Allsketches,exceptCMforsmallskew,arepracticalinevaluatingself-joinsizequeries.ThisistobeexpectedsinceAGMSsketchesareasymptoticallyoptimal[ 3 ]forthisproblem.Forsizeofjoinproblems,F-AGMSsketchesremainpracticalwellbeyondAGMSandFCsketches.CMsketcheshavegoodaccuracyaslongasthedataisskewed. F-AGMS,FC,andCMsketchesallofthemarebasedonrandomhashinghavefastandcomparableupdateperformancethatrangesbetween50)]TJ/F15 11.9552 Tf 12.1902 0 Td[(400nsdependingonthesizeofthesketch.5.3.1TestbedandMethodologySketchImplementation:Weimplementedagenericframeworkthatincorporatesthesketchingtechniquesmentionedthroughoutthechapter.Algorithmsforgeneratingrandomvariableswithlimiteddegreeofindependence[ 45 51 ]areatthecoreoftheframework.Sincethesketchingtechniqueshaveasimilarstructure,theyaredesignedasahierarchyparameterizedonthetypeofrandomvariablestheyemploy.Applicationshaveonlytoinstantiatethesketchingstructureswiththecorrespondingsizeandrandomvariables,andtocalltheupdateandtheestimationprocedures.DataSets:Weusedtwosyntheticdatageneratorsandonereal-lifedatasetinourexperiments.Thedatasetscoveranextensiverangeofpossibleinputs,thusallowingustoinfergeneralresultsonthebehaviorofthecomparedsketchingtechniques.Censusdataset[ 17 ]:Thisreal-lifedatasetwasextractedfromtheCurrentPopulationSurveyCPSdatarepository,whichisamonthlysurveyofabout50;000households.Eachmonth'sdatacontainsaround135;000tupleswith361attributes.WeranexperimentsforestimatingthesizeofjoinontheweeklywagePTERNWA 104

PAGE 105

numericalattributewithdomainsize288;416forthesurveyscorrespondingtothemonthsofSeptember2002;563recordsandSeptember20064;931records3.Estan'setal.[ 24 ]syntheticdatagenerator:Twotableswithapproximately1milliontupleseachwithaZipfdistributionforthefrequenciesofthevaluesarerandomlygenerated.Thevaluesarefromadomainwith5millionvalues,andforeachofthevaluesitscorrespondingfrequencyischosenindependentlyatrandomfromthedistributionofthefrequencies.Weusedinourexperimentsthememory-peakedZipf=0:8andthememory-unpeakedZipf=0:35datasets.Syntheticdatagenerator:Weimplementedoursyntheticdatageneratorforfrequencyvectors.Ittakesintoaccountparameterssuchasthedomainsize,thenumberoftuples,thefrequencydistribution,andthecorrelationdecor=1)]TJ/F15 11.9552 Tf 12.036 0 Td[(correlationcoecient.Outofthelargevarietyofdatasetsthatweconductedexperimentson,wefocusinthisexperimentalevaluationonfrequencyvectorsovera214=16;384sizedomainthatcontain1milliontuplesandhavingZipfdistributionstheZipfcoecientrangesbetween0and5.Thedegreeofcorrelationbetweentwofrequencyvectorsvariesfromfullcorrelationtocompleteindependence.Answer-QualityMetrics:Eachexperimentisperformed100timesandtheaveragerelativeerror,i.e.,jactual)]TJ/F20 7.9701 Tf 6.5865 0 Td[(estimatej actual,overthenumberofexperimentsisreported.Inthecaseofdirectcomparisonbetweentwomethods,theratiobetweentheiraveragerelativeerrorsisreported.Althoughweperformedtheexperimentsfordierentsketchsizes,theresultsarereportedonlyforasketchstructureconsistingof21vectorswith1024counterseachn=1024,m=21,sincethesametrendwasobservedfortheothersketchsizes.5.3.2ResultsSelf-JoinSizeEstimation:Thebehaviorofthesketchingtechniquesforestimatingtheself-joinsizeasafunctionoftheZipfcoecientofthefrequencydistributionis 3Aftereliminatingtherecordswithmissingvalues. 105

PAGE 106

depictedinFigure 5-8 bothonanormalaaswellaslogarithmicbscale.Asexpected,theerrorsofAGMSandFCsketchesaresimilarthedierenceforclosetouniformdistributionsisduetotheEH3[ 51 ]randomnumbergenerator.WhileF-AGMShasalmostthesamebehaviorasFCAGMSforsmallZipfcoecients,theF-AGMSerrorisdrasticallydecreasingforZipfcoecientslargerthan0:8.Theseareduetotheeectthemedianestimatorhasonthedistributionofthepredictedresults:forsmallZipfcoecientsthedistributionisnormal,thustheperformanceofthemedianestimatorisapproximately25%worse,whileforlargeZipfcoecientsthedistributionisfocusedaroundthetrueresultSection 5.2 .CMsketcheshaveextremelypoorperformancefordistributionsclosetouniform.ThiscanbeexplainedtheoreticallybythedependencyontheL1norm,muchlargerthantheL2norminthisregime.Intuitively,uniformdistributionshavemultiplenon-zerofrequenciesthatarehashedintothesamebucket,thushighlyover-estimatingthepredictedresult.Thesituationchangesdramaticallyathighskewwhenitishighlyprobablethateachnon-zerofrequencyishashedtoadierentbucket,makingtheestimationalmostperfect.Basedontheseresults,wecanconcludethatF-AGMSisthebestorclosetothebestsketchestimatorforcomputingthesecondfrequencymoment,irrespectiveoftheskew.JoinSizeEstimation:Inordertodeterminetheperformanceofthesketchingtechniquesforestimatingthesizeofjoin,weconductedexperimentsbasedontheZipfcoecientandthecorrelationbetweenthetwofrequencyvectors.Acorrelationcoecientof0correspondstotwoidenticalfrequencyvectorsself-joinsize.Foracorrelationcoecientof1,thefrequenciesinthetwovectorsarecompletelyshued.TheresultsfordierentZipfcoecientsaredepictedinFigure 5-9 asafunctionofthecorrelation.Itcanbeclearlyseenhowtherelationbetweenthesketchestimatorsischangingasafunctionoftheskewbehavioridenticaltotheself-joinsize.Moreover,itseemsthatthedegreeofcorrelationisaectingsimilarlyalltheestimatorstheerrorincreasesasthedegreeofcorrelationisincreasing,butitdoesnotaecttherelativeordergivenbythe 106

PAGE 107

Zipfcoecient.ThesamendingsarereinforcedinFigure 5-10 whichdepictstherelativeperformance,i.e.,theratiooftheaveragerelativeerrors,betweenpairsofestimatorsforcomputingthesizeofjoin.Figure 5-11 plotstheaccuracyforestimatingthesizeofjoinoftwostreamswithdierentskewcoecients.WhiletheerrorofF-AGMSandFCincreaseswiththeskewnessofthefreestream,theerrorofCMstaysalmostconstant,havingaminimumwherethetwostreamshaveequalZipfcoecients.Atthesametime,itseemsthatthevalueoftheerrorisdeterminedbythesmallestskewparameter.Consequently,weconcludethat,asinthecaseofself-joinsize,theZipfcoecientistheonlyparameterthatinuencestherelativebehaviorofthesketchingtechniquesforestimatingthesizeofjoinoftwofrequencyvectors.MemoryBudget:TheaccuracyofthesketchingmethodsAGMSisexcludedsinceitsbehaviorisidenticaltoFC,butitsupdatetimeismuchlargerasafunctionofthespaceavailableisrepresentedinFigure 5-12 foroneofEstan'ssyntheticdatasetsaandforthecensusreal-lifedatasetb.TheerrorofCMsketchesisordersofmagnitudeworsethantheerroroftheothertwomethodsfortheentirerangeofavailablememoryduetothelowskew.TheaccuracyofF-AGMSiscomparablewiththatofFCforlowskewdata,whileforskeweddataF-AGMSisclearlysuperior.Noticethattherelativeperformanceofthetechniquesisnotdependentonthememorybudget.UpdateTime:Thegoalofthetimingexperimentistoclarifyifthereexistsignicantdierencesinupdatetimebetweenthehashsketchessincetherandomvariablestheyusearedierent.AsshowninFigure 5-14 ,alltheschemeshavecomparableupdatetimeperformance,CMsketchesbeingthefastest,whileFCsketchesaretheslowest.Noticethattherelativegapbetweentheschemesshrinkswhenthenumberofcountersisincreasingsincemorereferencesaremadetothemainmemory.Aslongasthesketchvectortsintothecache,theupdaterateisextremelyhigharound10millionupdates 107

PAGE 108

canbeexecutedpersecondonthetestmachine4,makinghashsketchesaviablesolutionforhigh-speeddatastreamprocessing.5.3.3DiscussionAswehaveseen,thestatisticalandempiricalstudyinthischapterpaintsadierentpicturethansuggestedbythetheoryseeTable 5-3 .Table 5-4 summarizestheseresultsqualitativelyandindicatesthatonskeweddata,F-AGMSandCMsketcheshavemuchbetteraccuracythanexpected.ThestatisticalanalysisinSection 5.2 revealedthatthetheoreticalresultsforFast-AGMSF-AGMSandCount-MinCMsketchesdonotcapturethesignicantlybetteraccuracywithrespecttoAGMSandFast-CountFCsketchesforskeweddata.Thereasonthereexistssuchalargegapbetweenthetheoryandtheactualbehavioristhefactthatthemedian,forF-AGMS,andtheminimum,forCM,haveafundamentallydierentbehaviorthanthemeanonskeweddata.Thisbehaviordeesstatisticalintuitionsincemostdistributionsthatareencounteredinpracticehaverelativelysmallkurtosis,usuallybelow20.Thedistributionsofapproximationtechniquesthatusehashingonskeweddatacanhavekurtosisinthe1000range,aswehaveseenforF-AGMSsketches.Forthesedistributions,themedian,asanestimatorfortheexpectedvalue,canhaveerror106smallerthanthemean.Aninterestingpropertyofallsketchingtechniquesisthattherelationshipbetweentheiraccuracydoesnotchangesignicantlywhenthedegreeofcorrelationchanges,asindicatedbyFigure 5-10 .Therelationshipisstronglyinuencedbytheskewthough,whichsuggeststhatthenatureoftheindividualrelations,butnottheinteractionbetweenthem,dictateshowwellsketchingtechniquesbehave. 4TheresultsinFigure 5-14 areforaXeon2:8GHzprocessorwith512KBofcache.Themainmemoryis4GB. 108

PAGE 109

TherelationshipbetweensketchesinFigure 5-10 alsoindicatesthatF-AGMSsketchesessentiallyworkaswellasAGMSandFCforsmallskewandjustslightlyworsethanCMforlargeskew.ItseemsthatF-AGMSsketchescombineinanidealwaythebenetsofAGMSsketchesandhashesandgivegoodperformancethroughoutthespectrumofproblemswithouttheneedtodeterminetheskewofthedata.WhileCMsketcheshavebetterperformanceforlargeskew,theiruseseemsriskiersincetheirperformanceoutsidethisregimeispoorandtheiraccuracycannotbepredictedpreciselyforlargeskew.Itseemsthat,unlessextremelypreciseinformationaboutthedataisavailable,F-AGMSsketchesarethesafechoice.5.4ConclusionsInthischapterwestudiedthefourbasicsketchingtechniquesproposedintheliterature,AGMS,Fast-AGMS,Fast-Count,andCount-Min,frombothastatisticalandempiricalpointofview.Ourstudycomplementsandrenesthetheoreticalresultsknownaboutthesesketches.TheanalysisrevealsthatFast-AGMSandCount-Minsketcheshavemuchbetterperformancethanthetheoreticalpredictionforskeweddata,byafactorasmuchas106to108forlargeskew.Overall,theanalysisindicatesstronglythatFast-AGMSsketchesshouldbethepreferredsketchingtechniquesinceithasconsistentlygoodperformancethroughoutthespectrumofproblems.Thesuccessofthestatisticalanalysisweperformedindicatesthat,especiallyforestimatorsthatuseminimumormedian,suchanalysisgivesinsightsthatareeasilymissedbyclassicaltheoreticalanalysis.Giventhegoodperformance,thesmallupdatetime,andthefactthattheyhavetighterrorguarantees,Fast-AGMSsketchesareappealingasapracticalbasicapproximationtechniquethatiswellsuitedfordatastreamprocessing. 109

PAGE 110

Table5-1.Familiesof1randomvariables. Counter Keydomain 12345 1 +1+1+1)]TJ/F15 11.9552 Tf 9.2985 0 Td[(1)]TJ/F15 11.9552 Tf 9.2985 0 Td[(12 +1)]TJ/F15 11.9552 Tf 9.2985 0 Td[(1)]TJ/F15 11.9552 Tf 9.2985 0 Td[(1+1+13 )]TJ/F15 11.9552 Tf 9.2985 0 Td[(1+1)]TJ/F15 11.9552 Tf 9.2985 0 Td[(1+1)]TJ/F15 11.9552 Tf 9.2985 0 Td[(1 Table5-2.Familiesofhashfunctions. Row Keydomain 12345 1 113232 132123 32331 Table5-3.Expectedtheoreticalperformance. Sketch SizeofJoin LargeSmall LowSkewHighSkew AGMS 00 )]TJ/F15 11.9552 Tf -211.487 -14.4458 Td[(Fast-AGMS 00 )]TJ/F15 11.9552 Tf -211.487 -14.4458 Td[(Fast-Count 00 )]TJ/F15 11.9552 Tf -211.487 -14.4458 Td[(Count-Min )]TJ/F15 11.9552 Tf 66.433 0 Td[(0 )]TJET1 0 0 1 184.8006 365.7287 cmq[]0 d0 J0.3985 w0 0.1992 m242.3987 0.1992 lSQ1 0 0 1 -112.8006 0 cm0 g 0 G1 0 0 1 0 -27.8954 cm0 g 0 G1 0 0 1 -72 -337.8333 cmBT/F15 11.9552 Tf 72 327.7213 Td[(Table5-4.Expectedstatistical/empiricalperformance. Sketch SizeofJoin LargeSmall LowSkewHighSkew AGMS 00 )]TJ/F15 11.9552 Tf -211.487 -14.4458 Td[(Fast-AGMS 0+ +Fast-Count 00 )]TJ/F15 11.9552 Tf -211.487 -14.4458 Td[(Count-Min )]TJ/F15 11.9552 Tf 64.8072 0 Td[(+ + 110

PAGE 111

aBasicestimator bAverageestimatorFigure5-1.ThedistributionofAGMSsketchesforself-joinsize. aZipf=0:2 bZipf=1:5Figure5-2.ThedistributionofF-AGMSsketchesforself-joinsize. aZipf=1:0 bZipf=2:0Figure5-3.ThedistributionofCMsketchesforself-joinsize. aBasicestimator bAverageestimatorFigure5-4.ThedistributionofFCsketchesforself-joinsize. 111

PAGE 112

Figure5-5.F-AGMSkurtosis. Figure5-6.F-AGMSeciency. 112

PAGE 113

aCorrelation bCondenceboundsFigure5-7.CondenceboundsforF-AGMSsketchesasafunctionoftheskewnessofthedata. aNormalscale bLogscaleFigure5-8.AccuracyasafunctionoftheZipfcoecientforself-joinsizeestimation. aZipf=0:8 bZipf=3:0Figure5-9.Accuracyasafunctionofthecorrelationcoecientforsizeofjoinestimation. 113

PAGE 114

aF-AGMSvsAGMSFC bF-AGMSvsCMFigure5-10.Relativeperformanceforsizeofjoinestimation. aZipf=0:5 bZipf=1:0Figure5-11.Accuracyasafunctionoftheskewnessofthedataforsizeofjoinestimation. aMemorypeaked bCensusFigure5-12.Accuracyasafunctionoftheavailablespacebudget. 114

PAGE 115

Figure5-13.CondenceboundsforCMsketchesasafunctionoftheskewnessofthedata. Figure5-14.Updatetimeasafunctionofthenumberofcountersinasketchthathasonlyonerow. 115

PAGE 116

CHAPTER6SKETCHESFORINTERVALDATATheproblemwetreatinthissectionistoestimatethesizeofjoinbetweenadatastreamgivenaspointsandadatastreamgivenasintervalsusingsketches.Noticethatthisremainsasizeofjoinproblemdenedoverthefrequenciesofindividualpointsbut,sinceoneofthestreamsisspeciedbyintervalsratherthanindividualpoints,ifbasicsketchesareusedtheupdatetimeisproportionaltothesizeoftheinterval,whichisundesirable.Inordertoapplysketchingtechniquesforthederivedproblem,thereexisttwoalternatives:eitheruserandomvariablesthatarefastrange-summable,orapplyadomaintransformationforreducingthesizeoftheintervals.Thetwosolutions,DMAPandfastrange-summation,haveupdatetimesub-linearintheintervalsize.DMAP[ 18 ]consistsinmappingboththeintervalsandthepointsintothespaceofdyadicintervalsinordertoreducethesizeoftheintervalrepresentation.Sincebothintervalsandpointsmaptoalogarithmicnumberofdyadicintervalsinthedyadicspace,theupdatetimebecomespoly-logarithmicwithrespecttotheinput.Fastrange-summation[ 51 ]usespropertiesofthepseudo-randomvariablesinordertosketchintervalsinsub-lineartime,whilepointsaresketchedasbefore.Whiletheupdatetimeofthesemethodsispoly-logarithmicwithrespecttothesizeoftheinterval,sincetheyarebasedonAGMSsketches,theupdatetimeisalsoproportionalwiththesizeofthesketch.Inthelightofthestatisticalandempiricalevaluationofhash-basedsketches,theupdatetimeduetosketchingcouldbesignicantlyimprovedwithoutlossinaccuracy.Thequestionweaskandthoroughlyexploreinthissectioniswhetherhash-basedsketchescanbecombinedsuccessfullywiththetwomethodstosketchintervaldata.Theinsightsgainedfromthestatisticalanalysisandtheempiricalevaluationofthehash-basedsketchingtechniquesareappliedinthissectiontoprovidevariantsofthetwomethodsforsketchingintervaldatathathavesignicantlysmallerupdatetimeandcomparableaccuracy. 116

PAGE 117

InthischapterweinvestigateboththeoreticallyandempiricallytheknownmethodsforgeneratingtherandomvariablesusedbyAGMSsketcheswiththegoalofidentifyingthegeneratingschemesthatarepractical,bothforthetraditionalapplicationofAGMSsketches,i.e.,aggregatecomputationoverstreamingdata,andforapplicationsthatinvolveintervalinput.Morespecically,ourcontributionsare: Weprovideadetailedstudyofthepracticalityoffastrange-summationfortheknowngeneratingschemes.Weshowthatno4-wiseindependentgeneratingschemeispracticallyfastrange-summable,eventhoughtheschemebasedonReed-Mullercodescanbetheoreticallyrange-summedinsub-lineartime[ 9 ]. Weprovideaformaltreatmentofdyadicintervalsthatallowsustodesignecientfastrange-summablealgorithmsfortwoschemes:3-wiseindependentBCHschemeBCH3[ 2 ]andExtendedHammingschemeEH3[ 25 ]. Weexplainhowtwoproblems,sizeofspatialjoinsandselectivityestimation,canbereducedtothesizeofjoinproblem,reductionthatallowsustoprovideestimatesoftheirresultsbasedonAGMSsketches.Theresultingestimatorssignicantlyoutperformthestate-of-the-artsolutionbasedondyadicmappings[ 18 ],sometimesbyasmuchasafactorof8inrelativeerror.Weapplytheresultsobtainedfromthestatisticalstudytodesigneectivealgorithmsforsketchingintervaldata.Sketchesoverintervaldatacanbeusefulbythemselves,buttheyarealsoabuildingblockinsolutionstomorecomplexproblemslikethesizeofspatialjoins[ 5 18 55 ].DMAPandfastrange-summation[ 51 ]aretheexistingsolutionsforsketchingintervaldata.Theyarebothinecientwhencomparedtohash-basedsketchesbecauseoftheuseofAGMSsketcheswhichhavehigherupdatetime.InthischapterwestudyhowDMAPandfastrange-summationcanusethemoreecienthash-basedsketches.Inparticular,weshowthatonlyDMAPcanbeextendedtoothertypesofsketchesand,thus,asignicantimprovementinupdatetimecanbegainedbyasimplereplacementoftheunderlyingsketchingtechnique.ToimprovetheaccuracyofDMAP,signicantlyinferiortothatofthefastrange-summationmethod,weproposeasimplemodicationthatkeepsexactcountsforsomeofthefrequencies.WecallthismodicationDMAPCOUNTS.Wealsointroduceamethodtoimprovetheupdate 117

PAGE 118

performanceoffastrange-summationAGMSsketchesbasedonasimpleequi-widthpartitioningofthedomain.Theexperimentalresultsshowthatthesederivedmethodskeeptheadvantageoftheirbasemethods,whilesignicantlyimprovingtheirdrawbacks,tothepointwheretheyareecientbothinaccuracyandupdatetime.Thegaininupdatetimecanbeaslargeastwoordersofmagnitude,thusmakingtheimprovedmethodspractical.TheempiricalstudysuggeststhatDMAPispreferablewhentheupdatetimeisprimordialandfastrange-summationisdesirableforbetteraccuracy.Intherestofthechapter,werstidentifyapplicationsthatcanbereducedtosketchingintervaldatainSection 6.1 .Section 6.3 isadetailedstudyofthedyadicmappingmethod,withaparticularemphasisondyadicintervals.Section 6.4 treatsfastrange-summablegeneratingschemes,whileSection 6.5 dealswithfastrange-summationforsketchingintervaldata.Section 6.6 containsanexperimentalstudyofthesketchingmethodsforintervaldata.6.1SketchApplicationsAGMSsketcheshavebeenalsousedasbuildingblocksinmanyapplications.Forexample,[ 27 ]computeaggregatesoverexpressionsinvolvingsetoperators,while[ 25 ]approximatetheL1-dierenceoftwodata-streams.AsolutionbasedonAGMSsketchesforthewaveletdecompositionofadata-streamisintroducedin[ 32 ],whiledynamicquantilesareapproximatedusingrandomsubsetsumsin[ 33 ].Innetworking,sketchescanbeappliedforchangedetectionasin[ 42 ].Asalreadymentioned,sketchingintervaldataisafundamentalproblemthatisusedasabuildingblockinmorecomplexproblemssuchasestimatingthesizeofspatialjoinsandbuildingdynamichistograms.Forexample,forthesizeofspatialjoinsprobleminwhichtwodatastreamsofintervalsaregiven,twosketchesarebuiltforeachstream,onefortheentireintervalandonefortheend-points.Thesizeofthespatialjoinbetweenthetwointervaldatastreamsissubsequentlyestimatedastheaverageoftheproductofthe 118

PAGE 119

intervalsketchfromonestreamandthesketchfortheend-pointsfromtheotherstreamsee[ 18 ]forcompletedetails.Althoughtheywereintroducedforestimatingthesizeofjoinoftwopointrelations,AGMSsketchesareaversatileapproximationmethodthatcanaccommodatemultipletypesofinput.Inthischapterwefocusonavariationoftheoriginalsizeofjoinprobleminwhichoneoftherelationsisspeciedasintervalsthisisequivalenttospecifyingeverypointinsidetheinterval.Tomotivatetheimportanceofthederivedproblem,weintroducetwoapplicationsthatcanbeexpressedasthesizeofjoinoftworelations,oneconsistingofpoints,andonecontainingintervals.Wealsoidentifyaclassofapplicationsthatcanbereducedtosizeofjoinproblemsinvolvinganintervalrelationandapointrelation.6.1.1SizeofSpatialJoinsGiventworelationsoflinesegmentsintheunidimensionalspace,thesizeofspatialjoinproblemistocomputethenumberofsegmentsfromthetworelationsthatoverlap.Theapproachin[ 18 ],eventhoughnotexplicitlystated,reducesthesizeofspatialjoinproblemtotwosizeofjoinproblems,eachinvolvingapointrelationandanintervalrelation:thesizeofjoinofthelinesegmentsfromtherstrelationandthesegmentend-pointsfromthesecondrelationand,symmetrically,thesizeofjoinofthesegmentend-pointsfromtherstrelationandthesegmentsfromthesecondrelation.InordertoimplementasolutionbasedonAGMSsketchesforthepoint-intervalsizeofjoinproblem,wehavetodealrstwithecientlysketchingintervals.Thesolutionproposedin[ 18 ]realizesanumberoftransformationsforreducingthesizeofeachinterval,thusimprovingthesketchingtime.WepresentadetailedanalysisofthisgenericmethodforsketchingintervalsinSection 6.3 .6.1.2SelectivityEstimationforBuildingDynamicHistogramsAnyhistogramconstructionalgorithmhastoevaluatetheaveragefrequencyoftheelementsinabucket,alsoknownastheselectivityofthebucket.Byintroducingavirtual 119

PAGE 120

relationhavingvalue1onlyfortheelementsinthebucket,andvalue0otherwise,theaveragefrequencycanbeexpressedasthesizeofjoinbetweentheoriginalrelationandthevirtualrelationcorrespondingtothebucket,dividedbythesizeofthebucket.Usuallyabucketisanintervalintheunidimensionalspace,thustheproblemofsketchingintervalshastobesolvedinordertoimplementanAGMSsketchessolutionfortheselectivityestimationproblem.[ 57 ]introduceasolutionbasedonsketchesforbuildingdynamichistograms.Thestreamingrelationissummarizedasasketchandthenthehistogramisextractedfromthesketch.Theauthorsfocusedmoreonhowtodeterminethebucketssuchthattheresultinghistogramtobeoptimal.Theydidnotconsiderindetailtheproblemofsketchingabucketandtookthisstepasablack-boxintheiralgorithms.6.2ProblemFormulationThecommonpointoftheaboveapplicationsisthattheycanbereducedtosizeofjoinproblemsbetweenapointrelationandanintervalrelation.Inordertoimplementeectivesketch-basedsolutionsforthisderivedproblem,thesketchofanintervalhastobecomputedmoreecientlythansketchingeachpointintheinterval.TheproblemconsideredinthissectionisaderivationofthesizeofjoinproblemdenedinSection 2.1 inwhichoneofthetwodatastreamsisgivenbyinterval,frequencypairsratherthankey,frequencypairs.Thefrequencyisattachedtoeachelementintheinterval,notonlytoasinglekey.Formally,letS=e1;w1;:::;es;wsandT=[l1;r1];v1;:::;[lt;rt];vtbetwodatastreams,wherethekeyseiandtheintervals[li;ri],withliri,aremembersofthesetI=f0;1;:::;N)]TJ/F15 11.9552 Tf 12.5509 0 Td[(1g,andwiandvi,respectively,representfrequencies.Thecomputationusingsketchesofthesizeofjoinofthetwodatastreamsdenedastheinner-productoftheirfrequencyvectorsremainsourfocus.Noticethatitisstraightforwardtoreducethisproblemtothebasicsizeofjoinproblembyobservingthatapair[li;ri];viinTcanberepresentedasanequivalentsetofpairsej;viinS,withejtakingallthevaluesbetweenliandri,liejri.The 120

PAGE 121

drawbackofthissolutionisthetimetoprocessanintervalwhichislinearinthesizeoftheinterval.Summarizing,theproblemwetreatinthischapteristoestimatethesizeofjoinbetweenapointrelationandanintervalrelationusingAGMSsketches.Westartbyrevisitingthestate-of-the-artsolutionbasedondyadicmappingsDMAP.Thenweintroducesolutionsthatusefastrange-summablerandomvariablesthatarebasedontheknownschemesforgenerating1randomvariablesSection 3.1 .6.3DyadicMappingDMAPThedyadicmappingmethod[ 18 ],whichwecallhereDMAP,usesdyadicintervalsinordertoreducethesizeofaninterval,thusmakingpossibletheecientsketchingofintervals.Sincedyadicintervalsareatthefoundationofthemethod,weintroducethemrst.Wewillseethatdyadicintervalsarealsoimportantforfastrange-summationSection 6.4 .6.3.1DyadicIntervals[ 33 ]introducedyadicrangesforsketchingintervalsinthecontextofquantilecomputation.Althoughtheyareemployedindierentwork[ 18 25 32 ],aformaltreatmentofdyadicintervalswasnotpreviouslyprovided.Suchaformaltreatmentisnecessarytodesignecientalgorithmsformethodsthatneeddyadicintervaldecomposition.Inthissectionweprovidesuchaformaltreatment.Fortherestofthissection,considerthatthedomainIhassizejIj=N=2n.Ifthisisnotthecase,thedomaincanbeextendedtotherstpowerof2thatisgreaterthanN.ThedyadicintervalsofthedomainIareintervalsthatcanbeorganizedinahierarchicalstructure{thedyadicintervalsoverthedomainf0;:::;15garedepictedinFigure 6-2 {thathasn+1layers.Thereexistsonlyoneintervalatlevel0{theentiredomain.Onthesubsequentlevels,eachintervalissplitintwoequal-sizeintervalstoproducesmallerdyadicintervals.Onthelastlevel,alldyadicintervalshavesizeoneandtheyconsistofalltheindividualpointsinthedomain.Intuitively,suchaconstructionproducesintervals 121

PAGE 122

thathavesizeapowerof2morepreciselyalltheintervalsatlevelkhavesize2n)]TJ/F22 7.9701 Tf 6.5865 0 Td[(kandthathaveboundariesalignedatmultiplesofpowersof2.Thisdescriptionisgoodenoughtoarguepropertiesofdyadicintervals,butitdoesnotsucetoformallyprovethecorrectnessofalgorithmsthatusedyadicintervals.Insightsfromthestricttheoreticaltreatmentofdyadicintervalsactuallyleadtoveryecientimplementations.Westartourpresentationwithaformaldenitionofdyadicintervalsandsomebasicproperties. Denition5. AdyadicintervaloverthedomainI=f0;1;:::;N)]TJ/F15 11.9552 Tf 12.1512 0 Td[(1g,jIj=N=2n,isanintervaloftheform[q2j;q+1j,where0jnand0q2n)]TJ/F22 7.9701 Tf 6.5865 0 Td[(j)]TJ/F15 11.9552 Tf 11.9552 0 Td[(1. Proposition6. Thereareexactly2jdyadicintervalsatlevelj,eachcontaining2n)]TJ/F22 7.9701 Tf 6.5865 0 Td[(jpointsfromI. Proof. Followsdirectlyfromdenition. Proposition7. Thedyadicintervalsatlevelj,0jn,formapartitionofthedomainI.Thatis,theyaredisjointandtheirunionisequalwiththeentiredomain. Proof. Ifjisxed,usingtheabovedenition,thedyadicintervalsatleveljare:[0;2j;[2j;22j;:::;[n)]TJ/F22 7.9701 Tf 6.5865 0 Td[(j)]TJ/F15 11.9552 Tf 11.9552 0 Td[(12j;2n)]TJ/F22 7.9701 Tf 6.5865 0 Td[(j2j.Clearly,theyformapartitionof[0;2n. Proposition8. Let1and2betwoarbitrarydistinctdyadicintervals.If126=;,theneither12or21. Proof. Let1=[q2j;q+1jand2=[r2k;r+1k.Wehavetwodistinctcases,j=k,andj6=k.Wediscusseachinturn.Supposej=k.Theneitherq=r,whichisnotpossiblesince16=2,orq6=r.Inthelattercase,thetwodyadicintervals,accordingtoProposition 7 ,aredierentelementsofapartitionofthedomain,sotheydonotintersect,whichisacontradiction.Suppose,withoutlossofgenerality,thatj>k.Wehavetoshowthat21.Thisistheonlypossibilitysince1hasmoreelementsthan2.Wehavetwodistinctcases:r2k21andr2k
PAGE 123

Ifr2k21theneitherr2kq+1j)]TJ/F15 11.9552 Tf 9.593 0 Td[(2k,inwhichcase21sincewecantall2kelementsof2fromr2ktotheendof1,orr2k2q+1j)]TJ/F15 11.9552 Tf 10.0477 0 Td[(2k;q+1j.Inthislastcase,wehaver2k>q+12j)]TJ/F15 11.9552 Tf 12.0168 0 Td[(2kandr2kr>q+1j)]TJ/F22 7.9701 Tf 6.5865 0 Td[(k)]TJ/F15 11.9552 Tf 12.1517 0 Td[(1.Sincej>k,2j)]TJ/F22 7.9701 Tf 6.5865 0 Td[(kisanintegerquantity.Butbothrandqareintegersandthisinequalityisacontradictionsincewecannotinsertanintegerbetweentwoconsecutiveintegers. Ifr2kq2j.Fromthesetwoinequalities,bydividingby2k,wehave:r
PAGE 124

Proof. Letq2jandq02j,q
PAGE 125

anydyadicintervalinanydyadiccoverof[;],soithastobethecasethatthestatementiscorrect. 3. Wegiveaconstructiveproofofthestatement.Theclaimisthatthefollowingalgorithmndstheminimaldyadiccoverof[q2j;].Startingfromthepointq2j,ndthelargestdyadicintervalthattsinto[q2j;],andtheniteratefromthepointimmediatelyfollowingthisdyadicintervaluntilallof[q2j;]iscovered.Since[q2j;]isniteandeachdyadicintervalcontainsatleastonepoint,thealgorithmterminatesinatmost)]TJ/F21 11.9552 Tf 11.9552 0 Td[(q2j+1steps.Letusshowthatthebuiltcoverisminimalandhassizeatmostj.Weconsideranarbitraryinterval[q02j0;]withq02j0beingitsdyadiccut-point.Letq002j00bethedyadiccut-pointoftheintervalq02j0;],i.e.,thenextdyadiccut-pointsincethecurrentdyadiccut-pointisexcludedfromtheinterval.Weclaimthattheinterval[q02j0;q002j00isthelargestdyadicintervalthatispartoftheminimaldyadiccoverof[q02j0;].First,j00
PAGE 126

goingtotheright,andstartingfromandgoingtotheleft.Moreover,itisenoughtoinspectthebinaryrepresentationofandtodeterminethesedyadicintervals,asitisshowninAlgorithmMinimalDyadicCover 6.8 .Forexample,todeterminethedyadicintervalscovering[;d[;],weinspectthebitsofstartingfromtheleastsignicantandwheneverwedetecta1bit,sayinpositionj,wehavetoincludethedyadicintervalofsize2jthatcoversalltheintegersthathavethesamebinaryrepresentationasonthemostsignicantn)]TJ/F21 11.9552 Tf 11.4058 0 Td[(jbits.Forthesamepatternhastobefollowed,butthe0bitssignaltheinclusionofadyadicinterval.Ateverystep,anewbitinthedescriptionofandisanalyzedandthealgorithmstopswhen=+1=d[;].Example 10 illustrateshowthealgorithmruns. Example10. Considertheinterval[100;200],with=100=11001002and=200=110010002.Thealgorithminspectsthebitsinandstartingfromtheleastsignicantone.Ifa1bitisseenin,usefulprocessinghastobedone.Thesameisequivalentfor0bitsin.Bit0inis0,implyingtheadditiontotheminimaldyadiccoveroftheinterval[200;201andthenewbecomes=199=10001112.Therstbitinthatisequalwith1isthethirdone.Itforcestheadditionoftheinterval[100;104totheminimaldyadiccoverand=104=11010002.Thenextstepsofthealgorithmaddtheintervals[104;112;[192;200;[112;128and[128;192totheminimaldyadiccover,D[100;200]=f[100;104;[104;112;[112;128;[128;192;[192;200;[200;201g{1Atthispoint,=128=0000002and=127=1111112,andthealgorithmreturns.Noticethatequalsthedyadiccut-pointoftheinterval,d[100;200]=27=128.AlgorithmMinimalDyadicCover 6.8 hasrunningtimeOlog)]TJ/F21 11.9552 Tf 12.0148 0 Td[(iftheelementaryoperationsonintegerstakeOtimeasitisthecaseforimplementationsonrealcomputers.Ifthenotionofdyadiccut-pointisnotdevelopedandLemma 3 isnotproved, 126

PAGE 127

theintuitivealgorithmtondtheminimaldyadiccoverofanintervalis:ndthelargestdyadicintervalcontainedin[;];addthisdyadicintervaltotheminimaldyadiccoverandthenrecurseonthestartandendleft-overpartsof[;].ThelargestdyadicintervalcontainedinagivenintervalcanbefoundinOlogjIjstepsandthishastobedoneforeachofOlog)]TJ/F21 11.9552 Tf 12.8857 0 Td[(dyadicintervalsinthedecompositionof[;],resultinginanOlogjIjlog)]TJ/F21 11.9552 Tf 12.5104 0 Td[(algorithm.NoticethatthereisanOlogjIjgapbetweenthetwoalgorithms.Thisisduetothefactthatthealgorithmweprovidecleverlyavoidsthesearchforthemaximaldyadicintervalanditisabletodeterminewhatdyadicintervalsareintheminimalcoverbysimplyinspectingthebitrepresentationoftheend-pointsand.Themainconclusionoftheabovemathematicalformalizationisthatanyintervalcanbedecomposedecientlyinatmostalogarithmicnumberofdyadicintervals.Thisimpliesthatinsteadofidentifyinganintervalbyallthepointsitcontains,wecanrepresenttheintervalbyitsminimaldyadiccover.Thus,areductionfromalinear-sizerepresentationtoalogarithmic-sizerepresentationisobtained.Aswewillsee,thisisasignicantimprovementinthesketchingcontext.Moreover,determiningtheminimaldyadiccoverofanintervalcanbeimplementedveryecientlybyconsideringonlythebinaryrepresentationoftheintervalend-points.6.3.2DyadicMappingMethodDMAP[ 18 ]employsdyadicintervalsforecientlysketchingintervalsinthecontextofthesizeofjoinbetweenapointrelationandanintervalrelation.Asalreadymentioned,thedyadicrepresentationofanintervalismorecompactthanthepointrepresentation.DMAPmethodusesdyadicintervalsinordertoreducetherepresentationofaninterval,thusmakingpossibletheecientsketchingofintervalssee[ 18 33 51 ]fordetails.DMAPisbasedonasetofthreetransformationsFigure 6-3 towhichthesizeofjoinoperationisinvariant.Theoriginaldomainismappedintothedomainofallpossibledyadicintervalsthatcanbedenedoverit.Anintervalintheoriginaldomainismapped 127

PAGE 128

intoitsminimaldyadiccoverinthenewdomain.Bydoingthis,therepresentationoftheintervalreducestoatmostalogarithmicnumberofpointsinthenewdomain,i.e.,thenumberofsketchupdatesreducesfromlineartologarithmicinthesizeoftheinterval.Atthesametime,apointintheoriginaldomainmapstothesetofalldyadicintervalsthatcontainthepointinthenewdomain,thusincreasingthenumberofsketchupdatesfromonetologarithmicinthesizeoftheoriginaldomain.DMAPallowsthecorrectapproximationofthesizeofjoininthemappeddomainwiththeaddedbenetthatthesketchofeachrelationcanbecomputedecientlysincebothforaninterval,aswellasapoint,atmostlogjIj=ndyadicintervalshavetobesketched.[ 18 ]provethatthesizeofjoinbetweenapointrelationandanintervalrelationisinvarianttothesetransformations,thatis,thesizeofjoinovertheoriginaldomainisequalwiththesizeofjoinoverthemappeddomain.Theintuitionbehindtheproofliesonthefactthatforeachpointinanintervalthereexistsexactlyonedyadicintervalintheminimaldyadiccoverthatcontainsthatpoint.Theapplicationofanyofthesketchingmethodsinthedyadicdomainisstraightforward.Forapoint,thesketchdatastructureisupdatedwithallthedyadicintervalsthatcontainthepointexactlylogjIj=n.Foraninterval,thesketchisupdatedwiththedyadicintervalscontainedintheminimaldyadiccoveratmost2n)]TJ/F15 11.9552 Tf 12.4286 0 Td[(2,butstilllogarithmicinthesizeoftheinterval.Specically,inthecaseofAGMSsketchesallthecountersareupdatedforeachdyadicinterval,whileinthecaseofhash-basedsketchesFast-AGMS,Fast-Count,Count-Minonlyonecounterineachrowisupdatedforeachdyadicinterval.Noticethattheupdateprocedureisidenticaltotheprocedureforpointdatastreamssinceadyadicintervalisrepresentedasapointinthedyadicdomain.Oncethesketchesforthetwodatastreamsareupdated,theestimationprocedurecorrespondingtoeachtypeofsketchdescribedinSection 4.2 isimmediatelyapplicable.Theexperimentalresultsin[ 51 ]showedthatDMAPhassignicantlyworseperformance,asmuchasafactorof8,thanfastrange-summablemethodsforAGMS 128

PAGE 129

sketches.WeprovideanexplanationforthisbehaviorbasedonthestatisticalanalysisinSection 5.2 andtheempiricalresultsinSection 4.4 .Atthesametime,weprovideevidencethattheperformanceofDMAPforhash-basedsketchesFast-AGMSinparticularcannotbesignicantlybetter.TocharacterizestatisticallytheperformanceofDMAP,werstlookatthedistributionofthetwodatastreamsinthedyadicdomain.Thedistributionofthepointdatastreamhasapeakcorrespondingtothedomainthelargestdyadicintervalduetothefactthatthisdyadicintervalcontainsallthepoints,soitsassociatedcountersgetupdatedforeachstreamingpoint.Thedyadicintervalsatthesecondlevel,ofsizehalfofthedomainsize,alsohavehighfrequenciesduetothesamereason.Asthesizeofdyadicintervalsdecreases,theirfrequencydecreasestoo,tothepointitisexactlythetruefrequencyforpointdyadicintervals.Unfortunately,thehighfrequenciesinthedyadicdomainareoutliersbecausetheirimpactonthesizeofjoinresultisminimalforexample,thedomaindyadicintervalappearsinthesizeofjoinonlyiftheintervaldatastreamcontainstheentiredomainasaninterval.Practically,DMAPtransformsthedistributionofthepointdatastreamintoaskeweddistributiondominatedbyoutlierscorrespondingtolargedyadicintervals.TheeectofDMAPoverthedistributionoftheintervaldatastreamisfarlessdramatic,butmorediculttoquantify.Thisisduetothefactthatboththesizeoftheintervalandthepositionareimportantparameters.Forexample,twointervalsofthesamesize,onewhichhappenstobedyadicandonetranslatedbyonlyaposition,cangenerateextremelydierentminimaldyadiccoversand,thus,distributionsinthedyadicdomain.Evenwithoutanyfurtherassumptionsonthedistributionoftheintervaldatastream,weexpecttheskeweddistributionofthepointstreamtoaectnegativelytheestimate,duetotheoutlierscorrespondingtolargedyadicintervals.Atthesametime,wewouldexpectnottohaveasignicantdierencebetweenAGMSFast-CountandFast-AGMSsketchesunlessthedistributionoftheintervalstreamisalsoskewedtowardslargedyadicintervals.Thereasonforthisliesinthefactthatsincethepointstreamoverthedyadicdomainisskewedandthebehavior 129

PAGE 130

ofthesketchestimatorsforthesizeofjoinoftwostreamswithdierentZipfcoecientsisgovernedbythesmallestskewfactor,theoverallbehaviorisdeterminedbytheskewoftheintervalstream.Figure 5-11 showsasignicantdierencebetweenFCandF-AGMSonlywhenbothstreamsareskewed.TheexperimentalresultsinSection 6.6 verifythesehypotheses.AnevidentdrawbackofDMAPisthatitcannotbeextendedeasilytothecasewhenbothinputdatastreamsaregivenasintervals.Ifthesketchesaresimplyupdatedwiththedyadicintervalsintheminimaldyadiccover,thesizeofjoinofthepointsinthedyadicdomainiscomputedwhichisdierentfromthesizeofjoinintheoriginaldomainbecauseapointinthedyadicdomaincorrespondstoarangeofpointsintheoriginaldomain.Updatingoneofthesketcheswiththeproductofthesizeofthedyadicintervalandthefrequencyinsteadofonlythefrequencyseemstobeaneasyxthatwouldcompensateforthereductionintherepresentation.Thisisnotthecasebecauseapointcanbepartofdierentdyadicintervalswithdierentsizes,situationthatisnotcaughtbymovinginthedyadicdomain.6.3.3AlgorithmDMAPCOUNTSApossibleimprovementtothebasicDMAPmethodistokeepexactcountsforlargedyadicintervalsinbothstreamsandtocomputesketchesonlyfortherestofthedata.Bydoingthis,thedistributionofthepointstreaminthedyadicdomainbecomesclosertotheoriginaldistributionsincetheeectoftheoutliersisneutralized.Thecontributionofthelargedyadicintervalstothesizeofjoiniscomputedexactlythroughthecounts,whilethecontributionoftherestofdyadicintervalsisbetterapproximatedthroughthesketches.Althoughtheevidentresemblancebetweenthistechniqueandothertypesofcomplexsketches,e.g.,countsketches[ 11 ],skimmedsketches[ 28 ],andredsketches[ 29 ],thereisasubtledierence.Whileforalltheothertechniquesthehighfrequencieshavetobedeterminedandrepresentanimportantfractionoftheresult,inthiscasetheyareknownbeforeandrepresentanoutlierwhoseeecthastobeminimized.Inordertoquantify 130

PAGE 131

theerrorofthismethodandtodeterminetheoptimalnumberofexactcounts,similarsolutionsto[ 11 29 ]canbeappliedwiththeaddedcomplexityofdealingwithintervaldistributionsoveradyadicdomain.ThedeeperinsightssuchananalysiscouldrevealarehardtodeterminesinceeventheexactbehaviorofDMAPisonlylooselyquantiedin[ 18 51 ].TheempiricalresultsweprovideinSection 6.6 showthattheimprovementiseective.6.4FastRange-SummableGeneratingSchemesInSection 6 wehaveseenthatDMAPisthestate-of-the-artsolutionforsketchingintervals.DMAPusesdyadicmappingsinordertoreducethesizeofaninterval,thusdecreasingthenumberofrandomvariablesthathavetobegenerated.Thefastrange-summationpropertyistheabilitytocomputethesumofrandomvariablesinanintervalintimesub-linearinthesizeoftheinterval{thealternativeistogenerateandsum-upthevaluesiforeachiintheinterval.Thispropertyisacharacteristicofthegeneratingschemeanditisformallydenedin[ 9 ]: Denition8. Ageneratingschemefortwo-valuedk-wiseindependentrandomvariablesiscalledfastrange-summableifthereexistsapolynomial-timefunctiongsuchthatg[;];S=XiiS=Xi)]TJ/F15 11.9552 Tf 9.2985 0 Td[(1fS;i{2where,,andiarevaluesinthedomainI,with.Computingthefunctiongovergeneral[;]intervalsisusuallynotstraightforward.ThetaskissimplerfordyadicintervalsseeSection 6.3.1 duetotheirregularity.Fortunately,anyschemethatisfastrange-summablefordyadicintervalscanbeextendedtogeneralintervals[;]bysimplydeterminingtheminimaldyadiccover,computingthefunctiongovereachdyadicintervalinthecover,andthensumming-uptheseresults.Sincethedecompositionofany[;]intervalcontainsatmostalogarithmicnumberofdyadicintervals,fastrange-summablealgorithmsfordyadicintervalsremainfastrange-summableforgeneralintervals.Noticehowdyadicintervalsareusedindierent 131

PAGE 132

waysforfastrange-summationandinDMAP.WhileDMAPrepresentsanintervalbyitsminimaldyadiccoverandrandomvariablesaregeneratedforeachdyadicintervalinthecover,dyadicintervalsspeed-upthesummingoftherandomvariablesinfastrange-summationmethods.Inthissection,westudythefastrange-summationpropertyofthegeneratingschemespresentedinSection 3.1 .For2-wiseindependentrandomvariables,boththeBCH3schemeanditsEH3variantarefastrange-summable.[ 8 ]showthattheToeplitzfamilyofhashfunctionsisfastrange-summable.Sincefortwo-valuedrandomvariablestheToeplitzschemeisequivalentwithBCH3,thealgorithmweintroduceforBCH3alsoappliesfortheToeplitzscheme.Forthe4-wisecase,theReed-Mullergeneratingschemeistheonlyschemeknowntobefastrange-summable[ 9 32 ].Assuggestedin[ 43 ],reducingtherange-summingproblemtodeterminingthenumberofbooleanvariablesassignmentsthatsatisfyanXOR-ANDlogicalexpression,andusingtheresultsin[ 23 ],isaformalmethodtodetermineifageneratingschemeisfastrange-summable.WeapplythismethodtoshowthatneitherBCH5northepolynomialschemesarefastrange-summable.6.4.1SchemeBCH3WeprovethatBCH3isfastrange-summableandweprovideanaveragecaseconstanttimeOalgorithmforcomputingthesumofthe1randomvariablesinany[;]interval.TheBCH3generatingfunctionfhasthefollowingform:fS;i=[s0;S0][1;i]=s0n)]TJ/F20 7.9701 Tf 6.5865 0 Td[(1Mk=0S0;kik{3wheredenotesthebitwiseXOR,anddenotesthebitwiseAND.BothS0andiarevectorsoverthespaceGFn.GiventwovectorsinGFn,and,with,wheninterpretedasbinarynumbers,theproblemoffastrange-summingfunctionfoverthe 132

PAGE 133

interval[;]istocomputethefunctiongdenedbelow:g[;];S=Xi)]TJ/F15 11.9552 Tf 9.2985 0 Td[(1fS;i=Xi)]TJ/F15 11.9552 Tf 9.2985 0 Td[(1[s0;S0][1;i]=Xi)]TJ/F15 11.9552 Tf 9.2985 0 Td[(1s0Ln)]TJ/F18 5.9776 Tf 5.7561 0 Td[(1k=0S0;kik{4NoticethattheoperationsthatinvolvetheseedandtheindexvariableareoverGF,butthesummationovertheinterval[;]isinZ.Initially,weconsiderthattheinterval[;]isdyadic.Thismeansthatithastheform[q2j;q+1j,with0jnand0q2n)]TJ/F22 7.9701 Tf 6.5865 0 Td[(j)]TJ/F15 11.9552 Tf 12.3497 0 Td[(1.Lemma 4 showshowtoevaluatethefunctiongfortheBCH3schemeoveradyadicinterval.TheresultsinProposition 1 and 2 areusedforthesubsequentproofs. Lemma4. Let[q2j;q+1jbeadyadicintervalwith1jnand0q2n)]TJ/F22 7.9701 Tf 6.5865 0 Td[(j)]TJ/F15 11.9552 Tf 12.1274 0 Td[(1,andletfunctiongbedenedas:g[q2j;q+1j;S=q+1j)]TJ/F20 7.9701 Tf 6.5865 0 Td[(1Xi=q2j)]TJ/F15 11.9552 Tf 9.2985 0 Td[(1fS;i{5whereS=[S0;s0],withS0avectorinGFnands02f0;1g,andfunctionfisdenedasf[S0;s0];i=s0Ln)]TJ/F20 7.9701 Tf 6.5865 0 Td[(1k=0S0;kik.Then,functiongcantakethefollowingvalues:g[q2j;q+1j;[S0;s0]=8><>:0,if2jdoesnotdivideS02j)]TJ/F15 11.9552 Tf 9.2985 0 Td[(1f[S0;s0];q2j,if2jdividesS0{6 Proof. Functionfcanberewrittenasfollowsfortheinterval[q2j;q+1j:f[S0;s0];i=s0n)]TJ/F20 7.9701 Tf 6.5865 0 Td[(1Mk=jS0;kikj)]TJ/F20 7.9701 Tf 6.5865 0 Td[(1Mk=0S0;kik=Cj)]TJ/F20 7.9701 Tf 6.5865 0 Td[(1Mk=0S0;kik{7whereC2f0;1gisconstantforagivenS0.Thiswaspossiblebecausethemostsignicantn)]TJ/F21 11.9552 Tf 12.0172 0 Td[(jbitsofa[q2j;q+12jdyadicintervalareidenticalforallthevaluesintheinterval. 133

PAGE 134

Functiongbecomes:g[q2j;q+1j;S=q+1j)]TJ/F20 7.9701 Tf 6.5865 0 Td[(1Xi=q2j)]TJ/F15 11.9552 Tf 9.2985 0 Td[(1CLj)]TJ/F18 5.9776 Tf 5.7562 0 Td[(1k=0S0;kik{8ApplyingProposition 2 fortheexpressionCLj)]TJ/F20 7.9701 Tf 6.5865 0 Td[(1k=0S0;kikprovidedthatS0;k6=0,0k
PAGE 135

valuesoftheleastsignicantjbitsinS0and,sometimes,togenerateoneofthevariablesintheinterval.Boththesetaskscanbecarriedouteciently.Thegeneralfastrange-summingproblemistocomputethefunctiongdenedin 6{4 foranyinterval[;],.AsweknowfromSection 6.3.1 ,anyintervalcanberepresentedasaunionofdyadicintervals.Bydecomposingtheinterval[;]intoitsminimaldyadiccoverD[;]=f1;2;:::;mg,functiongcanberewrittenas:g[;];S=Xi)]TJ/F15 11.9552 Tf 9.2985 0 Td[(1fS;i=Xi2[mk=1k)]TJ/F15 11.9552 Tf 9.2985 0 Td[(1fS;i=Xi21)]TJ/F15 11.9552 Tf 9.2985 0 Td[(1fS;i+Xi22)]TJ/F15 11.9552 Tf 9.2985 0 Td[(1fS;i++Xi2m)]TJ/F15 11.9552 Tf 9.2985 0 Td[(1fS;i{10wherek,1km,aredyadicintervals.AlgorithmMinimalDyadicCover 6.8 computestheminimaldyadiccoverofaninterval[;],whileLemma 4 showshowtoecientlycomputeeachofthesumsin 6{10 .Thestraightforwardimplementationofthesetwostepsresultsintoafastrange-summablealgorithmfortheBCH3generatingschemesincethenumberofintervalsintheminimaldyadiccoverislogarithmicinthesizeoftheintervalandthecomputationofthesumovereachoftheseintervalsneedsconstanttime.Wegoonestepfurtherandintroduceanalgorithmthatoverlapsthesetwosteps,decompositionandsummation.Besidesoverlappingthesetwostages,thealgorithmweproposehasstrongearlyterminationconditionsthatmakeitaconstanttimealgorithmfortheaveragecaseovertheseedspace,improvingconsiderablyovertheexistinglogarithmicfastrange-summablealgorithms.Insteadofcomputingfunctiongdirectlyovertheinterval[;],wewriteitasadierenceovertheintervals[0;+1and[0;,g[;];S=g[0;+1;S)]TJ/F21 11.9552 Tf 11.7929 0 Td[(g[0;;S.Theadvantageofthisformcomesfromthefactthattheminimaldyadiccoverofa[0;interval,2GFn,canbedirectlydeterminedfromthebinaryrepresentationof.If=n)]TJ/F20 7.9701 Tf 6.5865 0 Td[(1n)]TJ/F20 7.9701 Tf 6.5865 0 Td[(2:::0,k2f0;1gfor0k
PAGE 136

Lines1)]TJ/F15 11.9552 Tf 12.4669 0 Td[(2initializethevariablessumand0.sumstorestheresultoffunctiong.0isjustasubstituteofthatismodiedinsidethealgorithm.Whenisodd,thatis,D[0;containsapointinterval,functionfiscomputedseparatelyforitLines3)]TJ/F15 11.9552 Tf 12.3539 0 Td[(5.Thefor-loopintheLines6)]TJ/F15 11.9552 Tf 11.4751 0 Td[(15isthecoreofthealgorithm.ItdetectsthedyadicintervalsinD[0;andcomputesthesumoftherandomvariablesinsidethemaccordingtoLemma 4 .Thereexisttwoexitpointsfromthealgorithminsidethefor-loop.TherstoneLines8)]TJ/F15 11.9552 Tf 12.3884 0 Td[(9isstraightforward:thealgorithmreturnswhentherange-sumiscomputedovertheentireinterval.Inthiscase0equals0sinceitisdecrementedafterthedetectionofeachintervalintheminimaldyadiccover.Thesecondexitpointinthefor-loopLines10)]TJ/F15 11.9552 Tf 12.4182 0 Td[(11ismoreinteresting.WeknowfromLemma 4 thattherange-sumisdeterminedwithoutanycomputationfordyadicintervalsthathavenon-zerocorrespondingbitsintheseedS0.WeiteratethroughtheintervalsinthedyadiccoverinascendingorderoftheirsizeandtheexistenceofasingleonebitinS0automaticallydeterminestherange-sumforallsubsequentintervalsinthecover.Thisenablesustocomputethevalueoffunctiongwithoutgeneratingotherrandomvariablesweknowthatthesumis0inthiscase.Lines12)]TJ/F15 11.9552 Tf 12.6185 0 Td[(15correspondtothecasewhentheleastsignicantkbitsinS0areallequalto0.AsshowninLemma 4 ,arandomvariableinsidethedyadicintervalhastobegeneratedinthiscase.ThesubtractioninLine14imposesthattherandomvariablecorrespondingtotherstpointinthedyadicintervalisgenerated.Line16containsthelastexitpoint.ItcanbeattainedonlywhentheseedS0isequalto0andn)]TJ/F20 7.9701 Tf 6.5865 0 Td[(1equalsto1.Noticethattheoperationsthatappearinthealgorithmimplycomputationswith2anditspowersandthattheycanbeecientlyimplementedusingbitoperationsAND,XOR,SHIFT,etc..AlgorithmBCH3 6.8 showshowtocorrectlyinvokeBCH3Interval 6.8 for[;]intervals.NoticethattherstcalltoBCH3Interval 6.8 iswith+1.Thisisnecessaryforthecorrectcomputationoffunctiong. 136

PAGE 137

Example11. WeshowhowAlgorithmBCH3 6.8 worksfortheinterval[100;202]andtheseedss0=0,S0=184=01110002.sum1iscomputedovertheinterval[0;203,whilesum2iscalculatedovertheinterval[0;100.Theirdierenceisreturned.For203=10010112,functionfisinvokedrstoutsidethefor-loop,with0=202=0010102,sum1takingthevalue)]TJ/F15 11.9552 Tf 9.2985 0 Td[(10=1.fistheninvokedfor0=200=0010002,givingtheresult2)]TJ/F15 11.9552 Tf 9.2985 0 Td[(10=2.sum1isupdatedto3.Thelasttimefisinvokedfortheinterval[0;203,0=192=10000002andthereturnedresultis23)]TJ/F15 11.9552 Tf 9.2985 0 Td[(11=)]TJ/F15 11.9552 Tf 9.2985 0 Td[(8,thevalueofsum1untilthispointbeing)]TJ/F15 11.9552 Tf 9.2985 0 Td[(5.Atthenextiterationk=4andS0;3=1andtheroutineBCH3Intervalreturnsthevalue)]TJ/F15 11.9552 Tf 9.2985 0 Td[(5forsum1.For100=011001002functionfiscalledonlyonce,with0=96=1000002,sum2takingthevalue22)]TJ/F15 11.9552 Tf 9.2985 0 Td[(11=)]TJ/F15 11.9552 Tf 9.2985 0 Td[(4.Atthefourthiterationofthefor-loop,)]TJ/F15 11.9552 Tf 9.2985 0 Td[(4isreturnedforsum2.Finally,therange-sumoftheBCH3functionfortheinterval[100;202]is)]TJ/F15 11.9552 Tf 9.2985 0 Td[(5)]TJ/F15 11.9552 Tf 11.9552 0 Td[()]TJ/F15 11.9552 Tf 9.2985 0 Td[(4=)]TJ/F15 11.9552 Tf 9.2985 0 Td[(1. Theorem14. Let[;],,beaninterval,where;2GFn,andseedS=[S0;s0]beavectoroverGFn+1,i.e.,S02GFnands02f0;1g.Then,AlgorithmBCH3 6.8 computesthesumoftheBCH3randomvariablesovertheinterval[;]. Proof. Insteadofrange-summingovertheinterval[;],weapplyAlgorithmBCH3Interval 6.8 overtheintervals[0;+1and[0;andreturnthedierenceoftheseresults.AlgorithmBCH3Interval 6.8 hastwoterminationpoints:exhaustingtheentireinterval[0;anddetectingtherstleastsignicantonebitinS0.ForeveryintervalthatispartoftheminimaldyadiccoverD[0;,isdecreasedaccordingly.Whentheentireminimaldyadiccoverisdetermined,equals0andthealgorithmreturns.Thisistheordinarytermination,forwhichthealgorithmexecutesthefor-loopalogarithmicnumberoftimesinthesizeofthe[0;interval.Areductiontoaconstantnumberofexecutionsofthefor-loopisprovidedbythesecondreturnpoint,thedetectionoftherstonebitinS0.WeknowfromLemma 4 thattherange-sumoverdyadicintervalsthatdependonnon-zeroseedsS0canbecomputedwithoutgeneratinganyrandomvariable 137

PAGE 138

intheinterval.Applyingthisresulttothecurrent[0;intervalthatisaunionofdyadicintervalsthatdependonanon-zeroseedS0,therange-sumcanbeimmediatelyreturned.AlgorithmBCH3Interval 6.8 returnstherange-sumover[0;intervals.TheminimaldyadiccoverD[0;isencodedinthebinaryrepresentationofandconsistsoftheintervalsthathavedecreasingsizesaswegofrom0toward.Thatis,forD[0;=f1;2;:::;mg,j1j>j2j>>jmjholds.Therange-sumovereachkdyadicinterval,1km,iscomputedaccordingtoLemma 4 .Sincetheintervalsaredetectedintheincreasingorderoftheirsizes,itispossibletodeterminethevalueoftherange-sumwithoutanyothercomputation,whenthefastterminationconditionismet:thepartialsumis0,asstatedinLemma 4 Theorem15. AlgorithmBCH3Interval 6.8 performsonexpectation2computationsofthefunctionf,oneoutsidetheloopandoneinsidethefor-loop,giventhattheseedsS=[S0;s0]are2-wiseindependentanduniformlydistributedoverGFn+1.For100seedsS,theaveragenumberoffunctionfcomputationsisbetween1:723and2:277,whilefor1;000seedsitisinsidetheinterval[1:912;2:088].For10;000seedsitliesbetween1:972and2:028.Alltheseresultshavea0:95condenceprobability. Proof. Weconsiderthathastheworstvalueforouralgorithm,e.g.,=nz }| {1:::11.Thisway,italwaysgeneratesacomputationofthefunctionf.Weprovethatthefastterminationconditionmakesthealgorithmtoexecutethefor-loop1time,onexpectation,givingatotalof2functionfcomputations.Forthis,wehavetodeterminetheaveragenumberof0consecutivebitsthatappearintheleastsignicantpositionsofS0,knowingthatS0isuniformlydistributedoverthespaceGFn.SincethebitsinS0areindependent,theprobabilityofhavingexactlykconsecutivebitswithvalue0,1kn,precededbya1,is1 2k+1.WedenetherandomvariableXasthenumberofleastsignicantconsecutive0bitsinS0overtheuniformprobabilityspaceGFn.TheexpectedvalueofX,E[X],givesusexactlywhatweare 138

PAGE 139

lookingfor,thatis,theaveragenumberofleastsignicantconsecutive0bitsinS0.E[X]=11 22+21 23++n1 2n+1=1 2nXk=1k1 2k/1{11E[X]isaderivedpowerserieswithratio1 2thatconvergesto1.ForacompletecharacterizationoftherandomvariableX,wealsocomputeitsvariance,Var[X],whichtellsuswhatistherangearoundtheexpectedvaluevariableXtakesvaluesin.Var[X]=E[X2])]TJ/F21 11.9552 Tf 11.9552 0 Td[(E2[X]{12E[X2]=121 22+221 23++n21 2n+1=1 2nXk=1k21 2k/3{13Var[X]=E[X2])]TJ/F21 11.9552 Tf 11.9551 0 Td[(E2[X]=3)]TJ/F15 11.9552 Tf 11.9552 0 Td[(12/2{14BoththeexpectationandthevarianceofXhavenitevalues,e.g.,1and,respectively,2.ForalargeenoughnumberofseedsS,e.g.,greaterthan100,thecentrallimittheorem[ 54 ]canbeapplied.LettherandomvariableYbedenedas:Y=PNk=1Xk N{15whereNisthenumberofseedsS,N100.Usingthecentrallimittheorem,weknowthatrandomvariableYcanbeapproximatedbyanormaldistributionwithparametersE[Y]=E[X]andVar[Y]=Var[X] N.WedeterminetherangeofvariableYwithina0:95condenceintervalusingthecumulativedistributionfunctioncdfofthenormaldistributionthatapproximatesY.Yl=1+2 p NErf)]TJ/F20 7.9701 Tf 6.5865 0 Td[(1)]TJ/F15 11.9552 Tf 9.2985 0 Td[(0:956{16 139

PAGE 140

Yr=1+2 p NErf)]TJ/F20 7.9701 Tf 6.5865 0 Td[(1:956{17YlandYraretheleftandrightintervalend-points,whileErf)]TJ/F20 7.9701 Tf 6.5865 0 Td[(1istheinverseoftheerrorfunction[ 54 ].ReplacingNwith100,1;000and10;000intheaboveformulae,andadding1fortheoutofloopfunctionfcomputation,weobtaintheresultsstatedinthetheorem. InTheorem 15 weconsideredthattakestheworstvalueforouralgorithm,e.g.,=nz }| {1:::11.Inpractice,thisdoesnothappentoooftenandthenumberoffunctionfcomputationscouldbesmaller.Ifweassumethatthebitsintakethevalues0and1withthesameprobability,thenumberofcomputationsoffreducestohalf,i.e.,1.Also,thereexistcaseswhennocomputationoffisdone,e.g.,0=0andS0;0=1,i.e.,theleastsignicantbitsinandS0are0and,respectively,1. Corollary5. Functionfiscomputedonexpectation4timeswhenBCH3randomvariablesarerange-summedover[;]intervals. Proof. AlgorithmBCH3Interval 6.8 iscalledtwotimesbyAlgorithmBCH 6.8 andeachinvocationcomputesfunctionf2timesonexpectation. TheresultinCorollary 5 isforintervalend-pointsandthatareworstforouralgorithm,e.g.,bothand+1endin11.Intheaveragecase,withthebitstakingthevalues0and1uniformlyprobable,thenumberoffunctionfcomputationsreducesto2.Thisresultisthebestwecouldhopefor,thatis,computingtherange-sumoveraninterval[;]bytakingintoconsiderationonlyitsend-points,and.6.4.2SchemeEH3Although[ 25 ]showthattherandomvariablesgeneratedusingtheExtendedHammingschemeEH3arefastrange-summable,thealgorithmcontainedintheproofisabstractandnotappropriateforimplementationpurposes.Weproposeapracticalalgorithmforthefastrange-summationoftheEH3randomvariables.Itisanextensionofourconstant-timealgorithmforfastrange-summingBCH3randomvariables. 140

PAGE 141

Thefollowingtheoremprovidesananalyticalformulaforcomputingtherange-sumfunctiong.Noticethatonlyonecomputationofthegeneratingfunctionfisrequiredinordertodeterminethevalueofgoveranydyadicinterval. Theorem16. Let[q4j;q+1jbeadyadicinterval1withsizeatleast4,j1.Therange-sumfunctiong[q4j;q+1j;S=Pq+1ji=q4j)]TJ/F15 11.9552 Tf 9.2985 0 Td[(1fS;idenedforEH3schemeisequalto:g[q4j;q+1j;S=)]TJ/F15 11.9552 Tf 9.2985 0 Td[(1#ZERO2j)]TJ/F15 11.9552 Tf 9.2985 0 Td[(1fS;q4j{18where#ZEROrepresentsthenumberoftwoadjacentpairbitsthatORto0i.e.,thenumberofgroupsof00bits. Proof. Thegeneratingfunctionfcanbewrittenfori2[q4j;q+1jasfollows:fS;i=fS;q4jS0;2j)]TJ/F20 7.9701 Tf 6.5865 0 Td[(1i2j)]TJ/F20 7.9701 Tf 6.5865 0 Td[(1S0;2j)]TJ/F20 7.9701 Tf 6.5865 0 Td[(2i2j)]TJ/F20 7.9701 Tf 6.5865 0 Td[(2i2j)]TJ/F20 7.9701 Tf 6.5865 0 Td[(1i2j)]TJ/F20 7.9701 Tf 6.5865 0 Td[(2S0;1i1S0;0i0i1i0{19wheretherstpartisxed,whilethelast2jbitsofitakeallthepossiblevalues.Whenthevalueofthechangingexpressionis0,fS;i=fS;q4j,whilewhenitsvalueis1,fS;iisthenegationoffS;q4j.WeknowthatanyexpressionoftheformS0;jijS0;j)]TJ/F20 7.9701 Tf 6.5865 0 Td[(1ij)]TJ/F20 7.9701 Tf 6.5865 0 Td[(1ijij)]TJ/F20 7.9701 Tf 6.5865 0 Td[(1hasa3:1distributionofthevalues,dependingontheseedbits.Thus,thesumPS0;jijS0;j)]TJ/F20 7.9701 Tf 6.5866 0 Td[(1ij)]TJ/F20 7.9701 Tf 6.5865 0 Td[(1ijij)]TJ/F20 7.9701 Tf 6.5865 0 Td[(1takeseitherthevalue2or)]TJ/F15 11.9552 Tf 9.2985 0 Td[(2whenijandij)]TJ/F20 7.9701 Tf 6.5865 0 Td[(1takeallthepossiblevalues.Whenjblocksofthisformarecombinedtogether,theresultingsumwillbe2jor)]TJ/F15 11.9552 Tf 9.2985 0 Td[(2j.Foroneblock,thesumis2onlywhenS0;j_S0;j)]TJ/F20 7.9701 Tf 6.5866 0 Td[(1=0,i.e.,S0;j=S0;j)]TJ/F20 7.9701 Tf 6.5865 0 Td[(1=0.FormultipleblocksthatareXOR-ed,theresultis1onlywhenanoddnumberofthemtakesthevalue1.Thisimpliesthatinordertoobtaintheresult2j,S0;j=S0;j)]TJ/F20 7.9701 Tf 6.5865 0 Td[(1=0hastobevalidforanoddnumberofblocks.Ifwedenoteby#ZEROthe 1Althoughwecallthemdyadicintervals,theseintervalsaredenedoverpowersof4. 141

PAGE 142

numberofblocksforwhichtherelationS0;j=S0;j)]TJ/F20 7.9701 Tf 6.5865 0 Td[(1=0holds,therange-sumfunctiongcanbewrittenas:g[q4j;q+1j;S=)]TJ/F15 11.9552 Tf 9.2985 0 Td[(1#ZERO2j)]TJ/F15 11.9552 Tf 9.2985 0 Td[(1fS;q4j{20 BasedontheresultsinTheorem 16 ,AlgorithmEH3Interval 6.8 computesfunctiong[;];S=Pi)]TJ/F15 11.9552 Tf 9.2985 0 Td[(1fS;iforanyinterval[;].First,theminimaldyadiccoverof[;]isdetermined,thenthesumovereachdyadicintervaliscomputedusing 6{18 .Noticethatthesetwostepscanbecombined,thecomputationofgbeingperformedwhiledeterminingtheminimaldyadiccoverof[;].Theminimaldyadiccovercanbeecientlydeterminedfromthebinaryrepresentationofand.Sinceanyintervalcanbedecomposedintoalogarithmicnumberofdyadicintervals,algorithmEH3Interval 6.8 computesfunctionginOlog)]TJ/F21 11.9552 Tf 11.9552 0 Td[(steps. Example12. WeshowhowAlgorithmEH3Interval 6.8 worksfortheinterval[124;197]andtheseedS=[s0;S0]=[0;184=1110002].Theminimaldyadiccoverof[124;197]isD[124;197]=f[124;128;[128;192;[192;196;[196;197;[197;198g{21#ZEROisequalwith1forthegivenS0,theonlypairOR-ingto0beingthepairattheend.Itaectsthedyadicintervalswithpowersgreaterthan0.g[124;197];S=g[124;128;S+g[128;192;S+g[192;196;S+g[196;197;S+g[197;198;S=)]TJ/F15 11.9552 Tf 9.2985 0 Td[(21)]TJ/F15 11.9552 Tf 9.2985 0 Td[(1fS;124)]TJ/F15 11.9552 Tf 11.9552 0 Td[(23)]TJ/F15 11.9552 Tf 9.2985 0 Td[(1fS;128)]TJ/F15 11.9552 Tf 11.9552 0 Td[(21)]TJ/F15 11.9552 Tf 9.2985 0 Td[(1fS;192+20)]TJ/F15 11.9552 Tf 9.2985 0 Td[(1fS;196+20)]TJ/F15 11.9552 Tf 9.2985 0 Td[(1fS;197=2+8+2+1)]TJ/F15 11.9552 Tf 11.9552 0 Td[(1=12{22Whilefastrange-summingBCH3randomvariablesisconstant-timeonaverage,theEH3algorithmislogarithmicinthesizeoftheinterval.Althoughbothalgorithms 142

PAGE 143

computepartialsumsoverthedyadicintervalsintheminimaldyadiccover,theBCH3algorithmcanreturnthenalresultafteronlyasmallnumberoffunctionfinvocations.ThisisnotthecaseforEH3,whichrequiresoneinvocationforeachdyadicintervalintheminimaldyadiccover.6.4.3Four-WiseIndependentSchemesInthissectionweinvestigatethefastrange-summationpropertyofthe4-wiseindependentgeneratingschemespresentedthroughoutthiswork,namelyBCHandpolynomialsoverprimes.ThediscussionregardingtheReed-Mullerschemeisdeferredtothenextsection.Themainideainshowingthatsomeoftheschemesarenotfastrange-summableistousetheresultin[ 23 ]ontheproblemofcountingthenumberoftimesapolynomialoverGF,writtenasXORofANDssumsofproductswithoperationsinGF,takeseachofthetwovaluesinGF,assuggestedby[ 43 ].Theresultstatesthattheproblemis#P-completeifanyofthetermsofthepolynomialwrittenasanXORofANDscontainsatleastthreevariables.Inourparticularcase,toshowthataschemeisnotfastrange-summable,itisenoughtoprovethatforsomeseedSthegeneratingfunctionfS;i,writtenasanXORofANDspolynomialinthebitsofi,containsatleastonetermthatinvolvesthreeormorevariables.ThefollowingresultsusethisfacttoshowthattheBCH5andthepolynomialsoverprimesschemesarenotfastrange-summable. Theorem17. BCHschemesarenotfastrange-summablefork5andn4. Proof. Weshowthatfastrange-summingBCH5randomvariablesisequivalenttodeterminingthenumberofassignmentsthatsatisfya3XOR-ANDbooleanformula,problemthatisknowntobe#P-complete.Sincefork>5anyBCHschemecanbereducedtoBCH5forsomeparticularvaluesoftheseed,i.e.,S2=S3==Sk=2=0,ifBCH5isnotfastrange-summableimpliesthatBCHkisnotfastrange-summable,fork>5. 143

PAGE 144

BCH5necessitatesthecomputationofi3overtheextensioneldGFn.Ifweconsiderthebitrepresentationofi,itcanbeshownthati3isa3XOR-ANDbooleanformulaforn>3theideaistousethebitequationsofamultiplier.ThisreductionimpliesthatBCH5isnotfastrange-summable. Theorem18. Letn=dlogpebethenumberofbitstheprimep>7canberepresentedonand[q2l;q+1lbeadyadicintervalwithl3.Then,thefunctionfS;i=[a0+a1imodp]mod2isnotfastrange-summableovertheinterval[q2l;q+1l. Proof. Withoutlossofgenerality,weconsiderthata0=0,a1=p)]TJ/F15 11.9552 Tf 12.6159 0 Td[(1,andshowthatfunctiongisa3XOR-ANDformula,implyingthatitisnotfastrange-summable.Theexpressionp)]TJ/F15 11.9552 Tf 12.5278 0 Td[(1imodptakesthevalues0;p)]TJ/F15 11.9552 Tf 12.5277 0 Td[(1;p)]TJ/F15 11.9552 Tf 12.5277 0 Td[(2;:::;p)]TJ/F15 11.9552 Tf 12.5277 0 Td[(7fori=0;1;:::;7.Outofthesevalues,onlyp)]TJ/F15 11.9552 Tf 12.1132 0 Td[(2;p)]TJ/F15 11.9552 Tf 12.1132 0 Td[(4;andp)]TJ/F15 11.9552 Tf 12.1132 0 Td[(6areodd,functiongtakingthevalue3.IfweexpresstheresultoffunctiongasanXOR-ANDformulaintermsofi0;i1;andi2,weobtain:gi2i1i0;S=i1i2i1i0i2i0i2i1i2i1i0{23whichisa3XOR-ANDformula.Weknowthatthenumberofassignmentstoi2;i1;i0thatsatisfythisformulacannotbedeterminedinpolynomialtime.Thisimpliesthatfunctionfisnotfastrange-summableoverdyadicintervalsofsizeatleast8. Theorem 18 showsthatforthepolynomialsoverprimesschemewithk=2thereexistvaluesforthecoecientsa0anda1thatmaketheschemenotfastrange-summablefordyadicintervalswithsizegreaterorequalthan23=8.Sincetheschemesfork>2canbereducedto[a0+a1imodp]mod2bymakinga2==ak)]TJ/F20 7.9701 Tf 6.5866 0 Td[(1=0,itresultsthatthepolynomialsoverprimesschemeisnotfastrange-summableforanyk2.6.4.4SchemeRM7TogetherwiththenegativeresultonthehardnessofcountingthenumberoftimesanXORofANDspolynomialwithtermscontainingmorethatthreevariablesAND-ed,[ 23 ]providedanalgorithmforsuchcountingforformulaethatcontainonlyatmosttwo 144

PAGE 145

variablesAND-edineachterm.Thisalgorithm,thatwerefertoas2XOR-AND,canbereadilyusedtoproduceafastrange-summablealgorithmforthe7-wiseindependentReed-MullerRM7scheme.Analgorithmforthisschemebasedonthesameideaswasproposedin[ 9 ].Wefocusourdiscussiononthe2XOR-ANDalgorithm,butthesameconclusionsareapplicabletothealgorithmin[ 9 ].Theobservationatthecoreofthe2XOR-ANDalgorithmisthefactthatpolynomialswithaspecialshapearefastrange-summable.ThesearepolynomialswithatmosttwovariablesAND-edinanytermandwitheachvariableparticipatinginatmostonesuchterm.Theothercasescanbereducedtothiscasebyintroducingnewvariablesthatarelinearcombinationsoftheexistingones.Todeterminetheselinearcombinationsinthegeneralcase,systemsoflinearequationshavetobeconstructedandsolved,oneforeachvariable.TheoverallalgorithmisOn3,withnthenumberofvariables,ifthesummationisperformedoveradyadicinterval.The2XOR-ANDalgorithmcanbeusedtofastrange-sumrandomvariablesproducedbytheRM7schemesinceintheXORofANDsrepresentationofthisschemeasapolynomialofthebitsofiwhichistherepresentationusedinSection 3.1 onlytermswithANDsofatmosttwovariablesappear.Usingthe2XOR-ANDalgorithmforeachdyadicintervalintheminimaldyadiccoverofagiveninterval,theoverallrunningtimecanbeshowntobeOn4wherethesizeofI,thedomain,is2n.Whilethisalgorithmisclearlyfastrange-summableusingthedenition,inpracticeitmightstillbetooslowtobeuseful.Indeedthisisthecase,asitisshownintheempiricalevaluationsectionwhereweprovidearunningtimecomparisonofthefastrange-summablealgorithms.6.4.5ApproximateFour-WiseIndependentSchemesSincetheRM7fastrange-summablealgorithmisnotpracticalandfastrange-summablealgorithmsforBCH5andpolynomialsoverprimesschemesdonotexist,itisworthinvestigatingapproximationalgorithmsforthe4-wisecase.Whilesuchapproximationsare 145

PAGE 146

possible[ 40 44 ],weshowthattheyarenotmorepracticalthantheexactalgorithmforRM7.Letxafbeamultivariatepolynomialinthevariablesx1;x2;:::;xnoverGFandhavingtheform:xafx1;x2;:::;xn=t1x1;x2;:::;xntmx1;x2;:::;xn{24whereforeachj=1;2;:::;m,thetermtjx1;x2;:::;xnistheproductofasubsetofthevariablesx1;x2;:::;xn.TheapproximateXOR-ANDcountingalgorithmintroducedin[ 40 ]needs4m2ln2 1 2randomtrials{particularvalueassignmentstoallofthevariables{toprovidearesultwithrelativeerroratmostwithprobabilityatleast1)]TJ/F21 11.9552 Tf 12.0702 0 Td[(.Eachofthesetrialsnecessitatestheevaluationofthepolynomialxafattheparticularassignedvalue.Inordertohaveanecientapproximatealgorithm,thenumberoftrialsitevaluatesshouldbesmall.Weshowthatforobtainingagoodapproximationoftherange-sumproblem,thenumberoftrialsiscomparablewiththesizeoftheintervalevenfortheRM7scheme,thusmakingthealgorithmimpractical.Figure 6-1 representsthenumberofevaluationsofthepolynomialxaffortheRM7scheme.Theratiobetweenthenumberoflinearpoint-by-pointevaluationsandthenumberofevaluationsinvokedbytheapproximatealgorithmin[ 40 ]isplottedforanumberofvariablesthatrangesbetween1and32.ThenumberoftermsmintheRM7polynomialisnn+1 2.Intheaveragecase,halfofthesetermsaresimpliedbycorresponding0seeds,givingatotalofnn+1 4terms.InordertoobtainasatisfactoryapproximationoftheXOR-ANDcountingresult,wesetthevalueofthefactorln2 1 2to103.Allthesegivethevalue103n2n+12 4forthenumberoftrialsemployedbytheapproximateXOR-ANDcountingalgorithm.Asthenumberofpolynomialevaluationsforthelinearcaseis2n,thefunctionplottedinFigure 6-1 is2n 103n2n+12 4,wherentakesvaluesintherange[1:::32]. 146

PAGE 147

TheresultsinFigure 6-1 arenotveryencouraging.Onlywhenthenumberofvariablesisgreaterthan25,theapproximatealgorithmneedslesspolynomialevaluationsthanthedirectexhaustivemethod.Butthisimpliesintervalsofextremelylargesizethatareunlikelytoappearinpractice.Ifwealsotakeintoaccountthefactthattheprovidedresultisnotexactanditserrorcouldpropagateexponentially,wethinkthattheapproximateXOR-ANDcountingalgorithmisnotapplicabletotherange-summingproblem.6.4.6EmpiricalEvaluationWeimplementedthefastrange-summablealgorithmsfortheBCH3,EH3,andRM7schemesandweempiricallyevaluatedthemwiththesameexperimentalsetupasinSection 3.1.10 .Theevaluationprocedureisbasedon100experimentsthatuseanumberofrandomlygeneratedintervalsandanequalnumberofsketcheschosenforeachmethodsuchthattheoverallrunningtimeisintheorderofminutesinordertoobtainstableestimatesoftherunningtimepersketch.Theresults,depictedinTable 6-1 ,aretheaverageofthe100runsandhaveerrorsofatmost5%.NoticethattheexecutiontimeofBCH3forrangesismerely7timeslargerthantheexecutiontimeforasinglesketchrefertoTable 3-3 fortherunningtimesofindividualsketches.Thishappensbecause,aswementionedearlier,ouralgorithmforBCH3isessentiallyO.TheExtendedHammingschemeEH3hasanencouragingrunningtimeofapproximately1:8s,thusabout550;000suchcomputationscanbeperformedpersecondonamodernprocessor.TheRM7fastrange-summablealgorithmiscompletelyimpracticalsinceonlyabout40computationscanbeperformedpersecond.Thisisduetothefactthatthealgorithmisquiteinvolvedasignicantnumberofsystemsoflinearequationshavetobeformedandsolved.Evenifspecialtechniquesareusedtoreducetherunningtime,atmosta32factorreductionispossibleandtheschemewouldbestillimpractical. 147

PAGE 148

Theneteectoftheseexperimentalresultsandofthetheoreticaldiscussionsinthissectionisthatthereisnopracticalfastrange-summablealgorithmforanyoftheknown4-wisegeneratingschemes,whilethealgorithmforBCH3isextremelyecient.6.5FastRange-SummationMethodWhileDMAPusesmappingsinthedyadicdomaininordertosketchintervalseciently,fastrange-summationmethodsarebasedonpropertiesoftherandomvariablesthatallowthesketchingofanintervalinanumberofstepssub-linearinthesizeoftheinterval.Specically,thesumofrandomvariablesoverdyadicintervalsiscomputedinaconstantnumberofstepsand,sincethereexistsalogarithmicnumberofdyadicintervalsintheminimaldyadiccoverofanyinterval,thenumberofstepstosketchtheentireintervalislogarithmicinitssize.[ 51 ]showthatfastrange-summationisapropertyofthegenerationschemeoftherandomvariablesandthatthereexistonlytwopracticalschemesapplicabletoAGMSsketches,EH3andBCH3,respectively.Moreover,theperformanceofBCH3ishighlysensitivetotheinputdata,soweconsideronlyEH3inthischapter.[ 51 ]usefastrange-summationonlyinthecontextofAGMSsketcheswheretheupdateofeachkeyintervalaectseachcounterinthesketchstructure.Moreexactly,foralltheelementsinaninterval,thesamecounterhastobeupdatedandallthecountersoverall.Thisisnotthecaseanymoreforhash-basedsketcheswheredierentcountersareupdatedfordierentkeysunlesstheyarehashedintothesamebucket.Inthissectionweshowthatfastrange-summationandrandomhashingareconictingoperationsand,consequently,fastrange-summationisnotapplicabletohash-basedsketchesFast-AGMS,Fast-Count,Count-Min.Fortunately,weshowthatfastrange-summationforAGMSsketchescanbeappliedinconjunctionwithdeterministicpartitionsofthedomainwithoutlossinerror,butwithasignicantimprovementintheupdatetime.AsmentionedinSection 4.2 ,themaindrawbackofAGMSsketchesistheupdatetime.Foreachstreamelement,eachcounterinthesketchstructurehastobeupdated.Essentially,eachcounterisarandomizedsynopsisoftheentiredata.Fastrange-summation 148

PAGE 149

exploitsexactlythisadditivepropertytosketchintervalsecientlysincetheupdatecorrespondingtoeachelementintheintervalhastobeaddedtothesamecounter.Hash-basedsketchespartitionrandomlythedomainIofthekeyattributeandassociateasinglecounterinthesketchstructurewitheachofthesepartitions.Foreachstreamelement,onlythecountercorrespondingtoitsrandompartitionisupdated,thustheconsiderablegaininupdatetime.Inordertodetermineiffastrange-summationcanbeextendedtohash-basedsketches,wefocusontheinteractionbetweenrandomhashing,whosegoalistopartitionevenlythekeysintobuckets,andtheecientsketchingofcontinuousintervalsforwhichthemaximumbenetisobtainedwhenalltheelementsintheintervalareplacedintothesamebucket.Thefollowingpropositionrelatesthenumberofcountersthathavetobeupdatedinahash-basedsketchtothesizeoftheinputinterval: Proposition11. Givenahashfunctionh:I!Bandaninterval[;]ofsizel,thenumberofbucketstouchedbythefunctionhwhenappliedtotheelementsin[;]isonexpectationBh1)]TJ/F26 11.9552 Tf 11.9552 9.6838 Td[()]TJ/F15 11.9552 Tf 5.4795 -9.6838 Td[(1)]TJ/F20 7.9701 Tf 14.4237 4.7072 Td[(1 Bli. Proof. LetXibea0=1randomvariablecorrespondingtoeachoftheBbucketsofthehashfunctionh,0i<>:1,if9j2[;]withhj=i0,otherwise{25TheexpectedvalueE[Xi]canbecomputedas:E[Xi]=P[Xi=1]=1)]TJ/F21 11.9552 Tf 11.9551 0 Td[(P[Xi=0]=1)]TJ/F26 11.9552 Tf 11.9552 16.8569 Td[(1)]TJ/F15 11.9552 Tf 14.9723 8.0877 Td[(1 Bl{26 149

PAGE 150

sincetheprobabilityofanelementtobehashedintheithbucketis1 B.Theexpectedvalueofthenumberofbucketstouchedbyhover[;]isthen:E[X]=E"B)]TJ/F20 7.9701 Tf 6.5865 0 Td[(1Xi=0Xi#=B)]TJ/F20 7.9701 Tf 6.5865 0 Td[(1Xi=0E[Xi]=B"1)]TJ/F26 11.9552 Tf 11.9551 16.8569 Td[(1)]TJ/F15 11.9552 Tf 14.9723 8.0877 Td[(1 Bl#{27whereXisdenedasX=PB)]TJ/F20 7.9701 Tf 6.5865 0 Td[(1i=0Xi. Inordertogivesomepracticalinterpretationtotheaboveproposition,weconsiderthesizeltobeproportionalwiththenumberofbucketsB,i.e.,l=kB,fork>0.ThisallowsustorewriteEquation 6{27 as:B"1)]TJ/F26 11.9552 Tf 11.9551 16.857 Td[(1)]TJ/F15 11.9552 Tf 14.9723 8.0878 Td[(1 Bl#=B"1)]TJ/F26 11.9552 Tf 11.9551 16.8569 Td[(1)]TJ/F15 11.9552 Tf 14.9723 8.0877 Td[(1 BkB#B1)]TJ/F15 11.9552 Tf 15.4968 8.0877 Td[(1 ek{28whereweusedtheapproximation1 e=limB!1)]TJ/F15 11.9552 Tf 5.4795 -9.6838 Td[(1)]TJ/F20 7.9701 Tf 14.4237 4.7072 Td[(1 BB. Corollary6. ForahashfunctionwithBbuckets,63:21%ofthebucketsaretouchedonexpectationwhenhisappliedtoanintervalofsizeB.Thenumberofbucketsincreasesto86:46%whenthesizeoftheintervalistwicethenumberofbucketsB,andto98:16%fork=4.Theabovecorollarystatesthatforintervalsofsizeatleastfourtimesthesizeofthehashalmostallthebucketsaretouchedonexpectation.ThiseliminatescompletelytheeectofhashingforsketchingintervalssinceAGMSsketchesalreadyrequiretheupdateofeachcounterinthesketch.ThedierenceisthatforAGMSsketcheseachcounterisupdatedwiththeentireinterval,whileforhash-basedsketchesacounterisupdatedonlywiththeelementsintheintervalassignedtotherandompartitioncorrespondingtothatcounter.Althoughthenumberofupdatespercounterissmallerforhash-basedsketches,determininghowmanyandwhichelementsinthegivenintervalupdatethesamecounterwithoutlookingattheentireintervalisadicultproblem.Theonlysolutionweareawareofisfor2-universalhashfunctions,soapplicableonlytoFast-AGMSandCount-Minsketches.Itconsistsinapplyingthesub-linearalgorithmproposedin[ 1 ]for 150

PAGE 151

countinghowmanyelementsintheintervalarehashedintoarangeofbucketseitherforeachbucketorforrangesofincreasingsize.NoticethatthisactuallyisnotevenenoughforFast-AGMSsketchesforwhichtheinteractionbetweenhashingandEH3orBCH3[ 51 ]hastobequantied.Whilefastrange-summationtakesadvantageofpropertiesofthegeneratingschemefortheparticularformofdyadicintervals,determiningthecontributionoftheelementsinthesamerandompartitionwithoutconsideringeachelementseparatelyhastobemoredicultduetothelackofstructure.Consequently,fastrange-summationisdirectlyapplicableonlytoCount-Minsketchesthroughoutthehash-basedsketchingtechniques,withtherequirementthateachcounterisupdatedwhensketchinganinterval.FastRange-SummationwithDomainPartitioning:TheintermediatesolutionbetweenAGMSsketches,whichupdateallthecounters,andhash-basedsketches,whichupdateonlyonecounterforagivenkey,issketchpartitioning[ 20 ].ThedomainIispartitionedincontinuousblocksratherthanrandomblocks.Anumberofcountersfromthesketchstructureproportionaltothesizeoftheblockisassignedtoeachblock.Whentheupdateofakeyhastobeprocessed,onlythecountersintheblockcorrespondingtothatkeyareupdated.Thismethodcanbeeasilyextendedtofastrange-summingintervalswithouttheneedtoupdateallthecountersunlessthesizeoftheintervalisclosetothesizeofthedomain.Theintersectionbetweenthegivenintervalandeachpartitionisrstdeterminedand,foreachnon-emptyintersection,thefastrange-summationalgorithmisappliedonlytothesetofassociatedcounters.Thus,anumberofcountersproportionalwiththenumberofnon-emptyintersectionsandindirectlyproportionaltothesizeoftheintervalhastobeupdated.Inwhatfollowsweprovideanexampletoillustratehowfastrange-summationwithsketchpartitioningworks. Example13. ConsiderthedomainI=f0;:::;15gtobesplitinto4equi-widthpartitionsasdepictedinFigure 6-4 .Forsimplicity,assumethattheavailableAGMSsketchconsistsof8counterswhichareevenlydistributedbetweenthedomainpartitions,2foreach 151

PAGE 152

partition.Insteadofhavingasingleestimatorfortheentiredomain,asketchestimatorcombiningthecountersinthepartitionisbuiltforeachpartition.Thenalestimatoristhesumoftheseindividualestimatorscorrespondingtoeachpartition.Figure 6-4 depictstheupdateprocedurefortheinterval[2;7].Thenon-emptyin-tersections[2;3]and[4;7]correspondtopartition0and1,respectively.Thefastrange-summationalgorithmisappliedtoeachoftheseintervalsonlyforthecountersassociatedwiththecorrespondingdomainpartition,notforallthecountersinthesketch.Inourexample,fastrange-summationisappliedtointerval[2;3]andthetwocountersassociatedtopartition0,andtointerval[4;7]andthetwocountersassociatedtopartition1,respec-tively.Overall,onlyfourcountersareupdated,insteadofeight,forsketchingtheinterval[2;7].Theadvantageofdomainpartitioningisthefactthattheupdatetimeissmallerwhencomparedtothebasicfastrange-summationmethod.Thisisthecasebecauseonlyapartofthesketchhastobeupdatediftheintervalisnottoolargewithrespecttothesizeofapartition.Inparticular,onlythesketchescorrespondingtothepartitionsthatintersecttheintervalneedtobeupdatedwhichmeansthatthespeedupisproportionaltotheratiobetweenthenumberofpartitionsandtheaveragenumberofpartitionsanintervalintersects.Whenpointsaresketched,onlythecounterscorrespondingtothepartitionthepointbelongstoneedtobeupdatedinsteadofallthecountersinthesketch.Intheaboveexampleonlytwocountershavetobeupdatedforeachpoint,insteadofeight.Noticethat,asshownin[ 20 ],anypartitioningofthedomaincanbeusedandthenumberofcountersassociatedtoeachpartitioncanalsobedierentfrompartitiontopartition.Moreprecisely,anypartitioningandanyallocationschemeforthecountersresultsinanunbiasedestimatorforthesizeofjoin.Animportantquestionthoughiswhatisthevarianceoftheestimator,whichisanindicatorfortheaccuracy.In[ 20 ]asophisticatedmethodtopartitionandallocatethecountersperpartitionwasproposedinordertominimizethevarianceoftheestimator.Forgainstobeobtained,regionsofthe 152

PAGE 153

domainwherehighfrequenciesinonestreammatchsmallfrequenciesintheotherhavetobeidentied.Sinceinthisparticularsituationwedonotexpectlargefrequenciesfortheintervalstream,asexplainedinSection 6.3 ,wedonotexpectthesketchpartitioningtechniquein[ 20 ]tobeabletoreducethevariancesignicantly.Moreover,usingthefactthatthevarianceoftheestimatorremainsthesameifthepartitioningisrandomseeProposition 3 ,aslongastheredoesnotexistsignicantcorrelationbetweenthepartitioningschemeandtheinputfrequencies,weexpectthevarianceoftheestimatortoremainthesame.Theexpecteddistributionoftheintervalfrequenciesalsosuggeststhatasimpleequi-widthpartitioningshouldbehavereasonablywell.Indeed,theexperimentalresultsinSection 6.6 showthatthispartitioningiseectiveinreducingtheupdatetimewhiletheerroroftheestimateremainsroughlythesame.6.6ExperimentalResultsInthissectionwepresenttheresultsoftheempiricalstudydesignedtoevaluatetheperformanceofveofthealgorithmsforecientlysketchingintervalsintroducedpreviously.Thevemethodstestedare:AGMSDMAP,F-AGMSDMAPF-AGMS,F-AGMSDMAPwithexactcountsF-AGMSCOUNTS,fastrange-summationAGMSAGMS,andfastrange-summationAGMSwithsketchpartitioningAGMSP.MethodsbasedonCount-Minsketchesareexcludedduetotheirhighsensitivitytotheinputdata,whileforFast-CountsketchesthesamebehaviorasforAGMSisexpectedseeSection 4.4 ,whereapplicable.Theaccuracyandtheupdatetimeperintervalarethetwoquantitiesmeasuredinourstudyforthesizeofspatialjoinprobleminvolvingintervalssee[ 18 ].Butrst,wewanttocompareEH3withDMAP[ 18 ]seeSection 6.3 onthetwoapplicationsintroducedinSection 6.1 ,namely,sizeofspatialjoinsandselectivityestimationforhistogramsconstruction.ThecomparisonwithBCH3isomittedsinceitserrorissignicantlyhigherwhencomparedtoEH3orDMAP.Wedonotperformcomparisonswiththefastrange-summableversionoftheReed-MullerschemeRM7since 153

PAGE 154

itsthroughputisnothigherthan40sketchcomputationspersecondEH3iscapableofperformingmorethan550;000sketchcomputationspersecondasshowninSection 6.4 .Followingtheexperimentalsetupin[ 18 51 ],threerealdatasetsareusedinourexperiments:LANDO,describinglandcoverownershipforthestateofWyomingandcontaining33;860objects;LANDC,describinglandcoverinformationsuchasvegetationtypesforthestateofWyomingandcontaining14;731objects;andSOIL,representingtheWyomingstatesoilsata1:105scaleandcontaining29;662objects.Theuseofsyntheticgeneratorsforintervaldataisquestionablebecauseitisnotclearwhatareacceptabledistributionsforthesizeoftheintervals,aswellasthepositionoftheintervalend-points.InasimilarmannertoSection 4.4 ,eachexperimentisperformed100timesandeithertheaveragerelativeerror,i.e.,jactual)]TJ/F20 7.9701 Tf 6.5865 0 Td[(estimatej actual,orthemedianupdatetimeoverthenumberofexperimentsisreported.EH3vsDMAPforSpatialJoins:Weusedthesameexperimentalsetupasin[ 18 ]tocompareEH3andDMAPforapproximatingthesizeofspatialjoins.TheaverageerrorforestimatingthesizeofspatialjoinsforbothEH3andDMAPisdepictedinFigure 6-5 6-6 ,and 6-7 .Thesketchsizevariesbetween4and40Kwordsofmemory.NoticethatinalltheexperimentsEH3signicantlyoutperformsDMAPbyasmuchasafactorof8.ThismeansthatDMAPneedsasmuchas64timesmorememoryinordertoachievethesameerrorguarantees.Table 6-2 containsthetimingresultsforsketchingintervalsusingthetwopracticalfastrange-summableschemesBCH3andEH3andtheDMAPmethod.Wepresenttheaveragetimeforsketchinganintervalfromeachoftherealdatasets.Asexpected,BCH3isthefastestschemebothforsketchingonlytheinterval,butalsoforsketchingtheintervalaswellasitsend-points.ThetimetosketchanintervalusingDMAPisabouthalfthetimeusedbytheEH3fastrange-summablealgorithm.Whensketchingboththeintervalanditsend-points,asisthecaseforthesizeofjoinbetweenanintervalrelationandapointrelation,theratioreverses{DMAPusestwiceasmuchtimeasEH3.This 154

PAGE 155

happensbecauseDMAPusesalogarithmicnumberofpointsinthesizeofthedomaintorepresenteachintervalend-point,whileEH3hastogenerateonlytworandomvariables,oneforeachend-point.ThedierencebetweentheresultsreportedinTable 6-2 andthepreviousresultsTable 3-3 and 6-1 isduetothetimingprocedure.Whileforthepreviousresultswemeasuredonlythegenerating/fastrange-summingtime,theresultsinTable 6-2 alsoincludetheoverheadtimeroutineinvocation,etc..EH3vsDMAPforSelectivityEstimation:TocompareEH3andDMAPonthetaskofselectivityestimation,weusedthesyntheticdatageneratorfrom[ 20 ].Itgeneratesmulti-dimensionaldatadistributionsconsistinginregions,randomlyplacedinthetwo-dimensionalspace,withthenumberofpointsineachregionZipfdistributedandthedistributionwithineachregionZipfdistributedaswell.Fortheexperimentswereporthere,wegeneratedtwo-dimensionaldatasetswiththedomainforeachdimensionhavingthesize1024.Adatasetconsistsof10regionsofpoints.ThedistributionofthefrequencieswithineachregionhasavariableZipfcoecient,asshowninFigure 6-8 .NoticethatforsmallZipfcoecientsEH3outperformsDMAPbyafactorof14.WhentheZipfcoecientbecomeslarger,thegapbetweenDMAPandEH3shrinksconsiderably,butEH3stilloutperformsDMAPbyalargemargin.Accuracy:Wepursuetwogoalsinouraccuracyexperiments.First,wedeterminethedependenceoftheaveragerelativeerroronthememorybudget,i.e.,thenumberofcountersinthesketchstructure.Forthis,werunexperimentswithdierentsketchcongurationshavingeither4or8rowsinthestructureandvaryingthenumberofcountersinarowbetween64and1024.Second,wewanttoestablishtherelationbetweentheaccuracyandthenumberofpartitionsforAGMSP.Forthis,wedistributethecountersinthesketchinto4to64groupscorrespondingtoanequalnumberofpartitionsofthedomain.Giventhepreviousresultsin[ 51 ]forAGMSandAGMSDMAP,weexpecttheresultsforF-AGMStobeclosetoAGMSDMAP,withsomeimprovementfor 155

PAGE 156

F-AGMSCOUNTSwhicheliminatestheeectofoutlierstosomeextent.Atthesametime,weexpectthatpartitioningdoesnotsignicantlydeterioratetheperformanceofAGMSPunlessitisappliedtotheextremewhereonlyonecountercorrespondstoeachpartition.Figure 6-9 depictstheaccuracyresultsforaspecicparameterconguration.Thesametrendwasobservedfortheothersettings,withthenormalbehaviorfortheconnedactionofeachparameter.Asexpected,theerrordecreasesasthememorybudgetincreasesforallthemethodsleftplot.ThebehaviorofDMAPmethodsismoresensitivetotheavailablememory,withoutasignicantdierencebetweenAGMSandF-AGMSsketches,butstillslightlyfavorabletoF-AGMS.WhatissignicantistheeectofmaintainingexactcountsforF-AGMSDMAP.Theerrorreducesdrastically,tothepointitisalmostidenticalwiththeerroroffastrange-summationAGMS,themostaccurateofthestudiedmethods.Thisisduetolimitingtheeectofoutliersthatwouldotherwisesignicantlydeterioratetheaccuracyofthesketch.Noticethereducedlevelsoftheerrorforfastrange-summationmethodsevenwhenonlylowmemoryisavailable.Oursecondgoalwastodetecttheeectpartitioninghasontheaccuracyoffastrange-summationAGMS.FromtherightplotinFigure 6-9 ,weobservethatpartitioninghasalmostnoinuenceontheaccuracyofAGMS,theerrorsofthetwomethodsbeingalmostidenticalevenwhenasignicantnumberofpartitionsisused.Clearly,weexpecttheaccuracytodropafteracertainlevelofpartitioning,whenthenumberofcounterscorrespondingtoapartitionissmall.Theerroroftheothermethodsisplottedinbonlyforcompleteness.TheuctuationsaredueonlytotherandomnesspresentinthemethodssincetheexperimentswererepeatedwiththesameparametersforeachdierentcongurationofAGMSP.UpdateTime:Ourobjectiveistomeasurethetimetoupdatethesketchstructureforthepresentedsketchingmethods.Forasketchconsistingofonlyonerowofcounters,weknowthatthetimeislinearforAGMSsketchessinceallthecountershavetobeupdated.ThisisreectedinFigure 6-10 thatdepictstheupdatetimeperintervalfor 156

PAGE 157

twosketchstructures,onewith512countersleft,andonewith4096countersright2.NoticethatFigure 6-10 actuallyplotstheupdatetimeperintervalasafunctionofthenumberofpartitions,thustheconstantcurvesforallthemethodsexceptAGMSP.Asexpected,theupdatetimeforAGMSPdecreasesasthenumberofpartitionsincreasessincethenumberofcountersinapartitiondecreases.Thedecreaseissubstantialforasketchwith512counters,tothepointwheretheupdatetimeisalmostidenticalwiththetimeforDMAPoverF-AGMSsketches,thefastestmethod.Noticethesignicantgapof2to4ordersofmagnitudebetweenthemethodsbasedonAGMSandthosebasedonF-AGMSsketches,withtheupdatetimeforF-AGMSbeingintheorderofafewmicro-seconds,whiletheupdatetimeforAGMSisintheorderofmilli-seconds.6.7DiscussionAsitwasalreadyknown[ 51 ],DMAPisinferiorbothinaccuracyandupdatetimetofastrange-summationforAGMSsketches,factsre-provedbyourexperimentalresults.WhileDMAPcanbeusedinconjunctionwithanytypeofsketchingtechnique,fastrange-summationisimmediatelyapplicableonlytoAGMSsketches.InordertoimprovetheupdatetimeofAGMS,weproposeAGMSP,amethodthatreducesthenumberofcountersthatneedtobeupdatedbypartitioningthedomainofthekeyanddistributingthecountersoverthepartitions.Evenwithasimplepartitioningthatsplitsthedomainintobucketswiththesamesize,asisdoneforequi-widthhistograms,theimprovementweobtainisremarkable,theupdatetimebecomingcomparablewiththatforF-AGMSsketches,whiletheerrorremainsasgoodastheerroroffastrange-summationAGMS.TheonlyimprovementgainedbyusingDMAPoverF-AGMSsketchesisinupdatetime.WithasimpleimplementationmodicationthatkeepsexactcountsforlargedyadicintervalsF-AGMSCOUNTS,theerrordropssignicantlyandbecomescomparablewith 2WeusedthesamemachineasinSection 4.4 157

PAGE 158

theerroroffastrange-summationAGMS.TherootsofthismodiedmethodlieinthestatisticalanalysispresentedinSection 5.2 .Overall,toobtainmethodsforsketchingintervalsthathavebothsmallerrorandecientupdatetime,thebasictechniquesDMAPandfastrange-summationhavetobemodied.F-AGMSCOUNTSisamodicationofDMAPoverF-AGMSsketcheswithextremelyecientupdatetimeandwitherrorapproachingthestandardgivenbyAGMSforlargeenoughmemory.AGMSPisamodicationoffastrange-summationAGMSthathasexcellenterrorandwithupdatetimeclosetothatofF-AGMSsketcheswhenthenumberofpartitionsislargeenough.Inconclusion,werecommendtheuseofF-AGMSCOUNTSwhentheupdatetimeisthebottleneckandAGMSPwhentheavailablespaceisaproblem.6.8ConclusionsOurprimaryfocusinthischapterwastheidenticationofthefastrange-summableschemesthatcansketchintervalsinsub-lineartime.Weexplainhowthefastrange-summableversionsoftwoofthe3-wiseindependentschemes,BCH3andEH3,canbeimplementedecientlyandweprovideanempiricalcomparisonwiththeonlyknown4-wiseindependentfastrange-summableschemeRM7thatrevealsthatonlyBCH3andEH3arepractical.TheEH3-basedsolutionssignicantlyoutperformthestate-of-the-artDMAPalgorithmsforapplicationssuchasthesizeofspatialjoinsandthedynamicconstructionofhistograms.BCH3istheperfectsolutionforsketchinghighly-skewedintervaldatasinceweprovideaconstant-timealgorithmforrange-summingBCH3randomvariables.Fastrange-summationremainsthemostaccuratemethodtosketchintervaldata.Unfortunately,itisapplicableonlytoAGMSsketchesand,thus,itisnotpracticalduetothehighupdatetime.Thesolutionweproposeinthischapterisbasedonthepartitioningofthedomainandofthecountersinthesketchstructureinordertoreducethenumberofcountersthatneedtobeupdated.Theimprovementinupdatetimeissubstantial,gettingclosetoDMAPoverFast-AGMSsketches,thefastestmethodstudied. 158

PAGE 159

Moreover,byapplyingasimplemodicationinspiredfromthestatisticalanalysisandtheempiricalstudyofthesketchingtechniques,theaccuracyofDMAPoverF-AGMScanbereducedsignicantly,tothepointwhereitisalmostequalwiththeaccuracyoffastrange-summationoverAGMSforlargeenoughspace.Consideringtheoverallresultsforsketchingintervaldata,werecommendtheuseofthefastrange-summationmethodwithdomainpartitioningwhenevertheaccuracyiscriticalandtheuseofDMAPCOUNTSmethodoverF-AGMSsketchesinsituationswherethetimetomaintainthesketchiscritical. 159

PAGE 160

MinimalDyadicCover[;] 1j02D[;];3while4doifj=15then6D[;]D[;][[;+2j7+2j8ifj=09then10D[;]D[;][[)]TJ/F15 11.9552 Tf 9.2985 0 Td[(2j+1;+111)]TJ/F15 11.9552 Tf 9.2985 0 Td[(2j12jj+113returnD[;] 160

PAGE 161

BCH3Interval[0;];S=[S0;s0] 102sum03if00=14then00)]TJ/F15 11.9552 Tf 14.5787 0 Td[(15sum)]TJ/F15 11.9552 Tf 9.2985 0 Td[(1fS;06fork1ton)]TJ/F15 11.9552 Tf 9.2985 0 Td[(17do8if0=09thenreturnsum10ifS0;k-1=111thenreturnsum12elseS0;k-1=013if0k=114then00)]TJ/F15 11.9552 Tf 14.5787 0 Td[(2k15sumsum+2k)]TJ/F15 11.9552 Tf 9.2985 0 Td[(1fS;016returnsum BCH3[;];S=[S0;s0] 1sum1BCH3Interval[0;+1;S2sum2BCH3Interval[0;;S3returnsum1)]TJ/F37 11.9552 Tf 11.291 0 Td[(sum2 EH3Interval[;];S=[S0;s0] 1LetD=f1;:::;mgbetheminimaldyadiccoverof[;]2sum03for2D4dosumsum+)]TJ/F15 11.9552 Tf 9.2985 0 Td[(1#ZERO2j)]TJ/F15 11.9552 Tf 9.2985 0 Td[(1fS;q4j5returnsum 161

PAGE 162

Table6-1.Sketchingtimeperinterval. Scheme Timens BCH3 68:9EH3 1;798RM7 26:4106 Table6-2.Sketchingtimeperintervalns. Scheme IntervalInterval+End-Points LANDCLANDOSOIL LANDCLANDOSOIL BCH3 507579 106126115EH3 637651604 681701651DMAP 409298309 150014011391 024681012141651015202530 linear approximate Numberofvariablesn Figure6-1.Thenumberofpolynomialevaluations. 162

PAGE 163

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 level0 level1 level2 level3 level4 Figure6-2.ThesetofdyadicintervalsoverthedomainI=f0;1;:::;15g. 1 2 3 5 6 7 8 9 10 11 12 13 14 15 0 4 2 12 aIntervalmapping 1 2 3 5 6 7 8 9 10 11 12 13 14 15 0 4 8 bPointmappingFigure6-3.Dyadicmappings. 2 7 1 2 3 5 6 7 8 9 10 11 12 13 14 15 0 4 2 3 7 4 Figure6-4.Fastrange-summationwithdomainpartitioning. 163

PAGE 164

Figure6-5.LANDO1LANDC. Figure6-6.LANDO1SOIL. 164

PAGE 165

Figure6-7.LANDC1SOIL. Figure6-8.Selectivityestimation. 165

PAGE 166

aLANDO1LANDC bAGMSPFigure6-9.Accuracyofsketchesforintervaldata. aSOILwith512counters bSOILwith4096countersFigure6-10.UpdatetimeperintervalasafunctionofthenumberofpartitionsfortheSOILdataset. 166

PAGE 167

CHAPTER7CONCLUSIONSThefocusofthisresearchworkwasonsketchsynopsisforaggregatequeriesoverdatastreams.Webelievethatourndingsareessentialforunderstandingthetheoreticalfoundationsthatlieatthebasisofthesketchingmethods.Atthesametime,webelievethattheexperimentalresultspresentedthroughoutthisworksupportournalgoaltomakesketchsynopsispracticalfortheapplicationinadatastreamenvironment.Thetheoreticalandpracticalndingsofourworkcanbesummarizedasfollows: WeconductedbothatheoreticalaswellasanempiricalstudyofthevariousschemesusedforthegenerationoftherandomvariablesthatappearinAGMSsketchesestimations.WeprovidedtheoreticalandempiricalevidencethatEH3canreplacethe4-wiseindependentschemesfortheestimationofthesizeofjoinusingAGMSsketches.OurmainrecommendationistousetheEH3randomvariablesforAGMSsketchesestimationsofthesizeofjoinsincetheycanbegeneratedmoreecientlyandusesmallerseedsthananyofthe4-wiseindependentschemes.Atthesametime,theerroroftheestimateisasgoodasor,inthecasewhenthedistributionhaslowskew,betterthantheerrorprovidedbya4-wiseindependentscheme. Weprovidedthemomentanalysisofthesketchoversamplesestimatorsfortwotypesofsampling:Bernoulliandsamplingwithreplacement.SketchingBernoullisamplesisessentiallyaloadsheddingtechniqueforsketchingdatastreamswhichresults,asourtheoryandexperimentssuggest,insignicantupdatetimereduction{byasmuchasafactorof100{withminimalaccuracydegradation.Sketchingsampleswithreplacementfromanunknowndistributionallowsecientcharacterizationoftheunknowndistributionwhichhasmanyapplicationstoonlinedata-mining. Westudiedthefourbasicsketchingtechniquesproposedintheliterature,AGMS,Fast-AGMS,Fast-Count,andCount-Min,frombothastatisticalandempiricalpointofview.Ourstudycomplementsandrenesthetheoreticalresultsknownaboutthesesketches.TheanalysisrevealsthatFast-AGMSandCount-Minsketcheshavemuchbetterperformancethanthetheoreticalpredictionforskeweddata,byafactorasmuchas106to108forlargeskew.Overall,theanalysisindicatesstronglythatFast-AGMSsketchesshouldbethepreferredsketchingtechniquesinceithasconsistentlygoodperformancethroughoutthespectrumofproblems.Thesuccessofthestatisticalanalysisweperformedindicatesthat,especiallyforestimatorsthatuseminimumormedian,suchanalysisgivesinsightsthatareeasilymissedbyclassicaltheoreticalanalysis.Giventhegoodperformance,thesmallupdatetime,andthefactthattheyhavetighterrorguarantees,Fast-AGMSsketchesare 167

PAGE 168

appealingasapracticalbasicapproximationtechniquethatiswellsuitedfordatastreamprocessing. Weidentiedthefastrange-summableschemesthatcansketchintervalsinsub-lineartime.Weexplainedhowthefastrange-summableversionsoftwoofthe3-wiseindependentschemes,BCH3andEH3,canbeimplementedecientlyandweprovidedanempiricalcomparisonwiththeonlyknown4-wiseindependentfastrange-summableschemeRM7thatrevealsthatonlyBCH3andEH3arepractical.TheEH3-basedsolutionssignicantlyoutperformthestate-of-the-artDMAPalgorithmsforapplicationssuchasthesizeofspatialjoinsandthedynamicconstructionofhistograms.BCH3istheperfectsolutionforsketchinghighly-skewedintervaldatasinceweprovidedaconstant-timealgorithmforrange-summingBCH3randomvariables.Fastrange-summationremainsthemostaccuratemethodtosketchintervaldata.Unfortunately,itisapplicableonlytoAGMSsketchesand,thus,itisnotpracticalduetothehighupdatetime.Thesolutionweproposedisbasedonthepartitioningofthedomainandofthecountersinthesketchstructureinordertoreducethenumberofcountersthatneedtobeupdated.Theimprovementinupdatetimeissubstantial,gettingclosetoDMAPoverFast-AGMSsketches,thefastestmethodstudied.Moreover,byapplyingasimplemodicationinspiredfromthestatisticalanalysisandtheempiricalstudyofthesketchingtechniques,theaccuracyofDMAPoverF-AGMScanbereducedsignicantly,tothepointwhereitisalmostequalwiththeaccuracyoffastrange-summationoverAGMSforlargeenoughspace.Consideringtheoverallresultsforsketchingintervaldata,werecommendtheuseofthefastrange-summationmethodwithdomainpartitioningwhenevertheaccuracyiscriticalandtheuseofDMAPCOUNTSmethodoverF-AGMSsketchesinsituationswherethetimetomaintainthesketchiscritical. 168

PAGE 169

REFERENCES [1] P.AduriandS.Tirthapura,RangeecientcomputationofF0overmassivedatastreams,"Proc.IEEEICDE,2005. [2] N.Alon,L.Babai,andA.Itai,Afastandsimplerandomizedparallelalgorithmforthemaximalindependentsetproblem,"JournalofAlgorithms,vol.7,no.4,pp.567{583,1986. [3] N.Alon,Y.Matias,andM.Szegedy,Thespacecomplexityofapproximatingthefrequencymoments,"Proc.ACMSTOC,1996. [4] N.Alon,P.B.Gibbons,Y.Matias,andM.Szegedy,Trackingjoinandself-joinsizesinlimitedstorage,"JournalofComputerandSystemSciences,vol.64,no.3,pp.719{747,2002. [5] N.An,Z.Y.Yang,andA.Sivasubramaniam,Selectivityestimationforspatialjoins,"Proc.IEEEICDE,2001. [6] B.Babcock,S.Babu,M.Datar,R.Motwani,andJ.Widom,Modelsandissuesindatastreamsystems,"Proc.ACMSIGMOD,2002. [7] K.P.BalandaandH.L.MacGillivray,Kurtosis:Acriticalreview,"J.AmericanStatistician,vol.42,no.2,pp.111{119,1988. [8] Z.Bar-Yosse,R.Kumar,andD.Sivakumar,Reductionsinstreamingalgorithms,withanapplicationtocountingtrianglesingraphs,"Proc.ACMSODA,2002. [9] A.R.Calderbank,A.Gilbert,K.Levchenko,S.Muthukrishnan,andM.Strauss,Improvedrange-summablerandomvariableconstructionalgorithms,"Proc.ACMSODA,2005. [10] L.CarterandM.N.Wegman,Universalclassesofhashfunctions,"JournalofComputerandSystemSciences,vol.18,no.2,pp.143{154,1979. [11] M.Charikar,K.Chen,andM.Farach-Colton,Findingfrequentitemsindatastreams,"Proc.Int'lConf.ICALP,2002. [12] S.Coles,AnIntroductiontoStatisticalModelingofExtremeValues,"Springer-Verlag,2001. [13] G.CormodeandM.Garofalakis,Sketchingprobabilisticdatastreams,"Proc.ACMSIGMOD,2007. [14] G.CormodeandM.Garofalakis,Sketchingstreamsthroughthenet:distributedapproximatequerytracking,"Proc.31stInt'lConf.VeryLargeDataBasesVLDB,2005. 169

PAGE 170

[15] G.CormodeandS.Muthukrishnan,Animproveddatastreamsummary:thecount-minsketchanditsapplications,"JournalofAlgorithms,vol.55,no.1,pp.58{75,2005. [16] G.CormodeandS.Muthukrishnan,Summarizingandminingskeweddatastreams,"Proc.SIAMDataMining,2005. [17] CPS, http://www.census.gov/cps ,accessed,Nov.2006. [18] A.Das,J.Gehrke,andM.Riedewald,Approximationtechniquesforspatialdata,"Proc.ACMSIGMOD,2004. [19] W.E.Deskins,AbstractAlgebra,"DoverPublications,1996. [20] A.Dobra,M.Garofalakis,J.Gehrke,andR.Rastogi,Processingcomplexaggregatequeriesoverdatastreams,"Proc.ACMSIGMOD,2002. [21] P.DomingosandG.Hulten,Mininghigh-speeddatastreams,"Proc.ACMSIGKDD,2000. [22] W.Dumouchel,C.Faloutsos,P.J.Haas,J.M.Hellerstein,Y.Ioannidis,H.V.Jagadish,T.Johnson,R.Ng,V.Poosala,K.A.Ross,andK.C.Sevcik,TheNewJerseydatareductionreport,"IEEEDataEng.Bulletin,vol.20,no.1,pp.3{45,1997. [23] A.EhrenfeuchtandM.Karpinski,Thecomputationalcomplexityofxor-andcountingproblems,"ICSITechnicalReportTR-90-033,1990. [24] C.EstanandJ.F.Naughton,End-biasedsamplesforjoincardinalityestimation,"Proc.IEEEICDE,2006. [25] J.Feigenbaum,S.Kannan,M.Strauss,andM.Viswanathan,AnapproximateL1-dierencealgorithmformassivedatastreams,"SIAMJournalonComputing,vol.32,no.1,pp.131{151,2002. [26] P.FlajoletandG.N.Martin,Probabilisticcountingalgorithmsfordatabaseapplications,"J.Comp.Syst.Sci.,vol.31,no.1,pp.182{209,1985. [27] S.Ganguly,M.Garofalakis,andR.Rastogi,Processingsetexpressionsovercontinuousupdatestreams,"Proc.ACMSIGMOD,2003. [28] S.Ganguly,M.Garofalakis,andR.Rastogi,Processingdata-streamjoinaggregatesusingskimmedsketches,"Proc.Int'lConf.EDBT,2004. [29] S.Ganguly,D.Kesh,andC.Saha,Practicalalgorithmsfortrackingdatabasejoinsizes,"Proc.Int'lConf.FSTTCS,2005. [30] P.Gibbons,Y.Matias,andV.Poosala,Fastincrementalmaintenanceofapproximatehistograms,"Proc.23rdInt'lConf.VeryLargeDataBasesVLDB,2001. 170

PAGE 171

[31] P.B.GibbonsandY.Matias,Synopsisdatastructuresformassivedatasets,"DIMACS,1999. [32] A.C.Gilbert,Y.Kotidis,S.Muthukrishnan,andM.J.Strauss,One-passwaveletdecompositionsofdatastreams,"IEEETKDE,vol.15,no.3,pp.541{554,2003. [33] A.C.Gilbert,Y.Kotidis,S.Muthukrishnan,andM.J.Strauss,Domain-drivendatasynopsesfordynamicquantiles,"IEEETKDE,vol.17,no.7,pp.927{938,2005. [34] O.Goldreich,Asampleofsamplers-acomputationalperspectiveonsampling,"ElectronicColloquiumonComputationalComplexity,vol.4,no.20,1997. [35] P.J.HaasandJ.M.Hellerstein,Ripplejoinsforonlineaggregation,"Proc.ACMSIGMOD,1999. [36] A.S.Hedayat,N.J.A.Sloane,andJ.Stufken,OrthogonalArrays:TheoryandApplications,"Springer-Verlag,1999. [37] P.Indyk,Stabledistributions,pseudorandomgenerators,embeddings,anddatastreamcomputation,"J.ACM,vol.53,no.3,pp.307{323,2006. [38] T.S.Jayram,A.McGregor,S.Muthukrishnan,andE.Vee,Estimatingstatisticalaggregatesonprobabilisticdatastreams,"Proc.ACMSIGMOD,2007. [39] C.Jermaine,A.Dobra,S.Arumugam,S.Joshi,andA.Pol,Thesort-merge-shrinkjoin,"ACMTransactionsonDatabaseSystems,vol.31,no.4,pp.1382{1416,2006. [40] M.KarpinskiandM.Luby,ApproximatingthenumberofzeroesofaGF[2]polynomial,"JournalofAlgorithms,vol.14,no.2,pp.280{287,1993. [41] D.Kempe,A.Dobra,andJ.Gehrke,Gossip-basedcomputationofaggregateinformation,"Proc.IEEEFOCS,2003. [42] B.Krishnamurthy,S.Sen,Y.Zhang,andY.Chen,Sketch-basedchangedetection:methods,evaluation,andapplications,"Proc.ACMSIGCOMM,2003. [43] K.Levchenko, http://www.cse.ucsd.edu/klevchen/II-2005.pdf ,accessed,Aug.2005. [44] M.LubyandA.Wigderson,Pairwiseindependenceandderandomization,"EECSUCBerkeleyTechnicalReportUCB/CSD-95-880,1995. [45] MassDAL. http://www.cs.rutgers.edu/muthu/massdal.html ,accessed,Jul.2006. [46] R.MotwaniandP.Raghavan,RandomizedAlgorithms,"CambridgeUniversityPress,1995. [47] S.Muthukrishnan,Datastreams:algorithmsandapplications,"Found.TrendsTheor.Comput.Sci.,vol.1,no.2,pp.117{136,2005. 171

PAGE 172

[48] D.J.Olive,Asimplecondenceintervalforthemedian,"Manuscript, http://www.math.siu.edu/olive/ppmedci.pdf ,accessed,Nov.2006. [49] F.PennecchiandL.Callegaro,Betweenthemeanandthemedian:theLpestimator,"Metrologia,vol.43,no.3,pp.213{219,2006. [50] R.M.PriceandD.G.Bonett,Estimatingthevarianceofthesamplemedian,"J.StatisticalComputationandSimulation,vol.68,no.3,pp.295{305,2001. [51] F.RusuandA.Dobra,Pseudo-randomnumbergenerationforsketch-basedestimations,"ACMTransactionsonDatabaseSystems,vol.32,no.2,2007. [52] F.RusuandA.Dobra,Statisticalanalysisofsketchestimators,"Proc.ACMSIGMOD,2007. [53] L.Sachs,AppliedStatistics{AHandbookofTechniques,"Springer-Verlag,1984. [54] J.Shao,MathematicalStatistics,"Springer-Verlag,1999. [55] C.Sun,D.Agrawal,andA.ElAbbadi,Selectivityestimationforspatialjoinswithgeometricselections,"Proc.Int'lConf.EDBT,2002. [56] N.Tatbul,U.Cetintemel,S.Zdonik,M.Cherniack,andM.Stonebraker,Loadsheddinginadatastreammanager,"Proc.29thInt'lConf.VeryLargeDataBasesVLDB,2003. [57] N.Thaper,S.Guha,P.Indyk,andN.Koudas,Dynamicmultidimensionalhistograms,"Proc.ACMSIGMOD,2002. [58] M.ThorupandY.Zhang,Tabulationbased4-universalhashingwithapplicationstosecondmomentestimation,"Proc.ACMSODA,2004. [59] M.WegmanandJ.Carter,Newhashfunctionsandtheiruseinauthenticationandsetequality,"JournalofComputerandSystemSciences,vol.3,no.22,pp.265{279,1981. 172

PAGE 173

BIOGRAPHICALSKETCHFlorinRusureceivedhisPh.D.inComputerSciencefromtheUniversityofFlorida.HeworkedunderthesupervisionofDr.AlinDobraasamemberoftheDatabaseCenter.FlorinRusuisoriginallyfromthecityofCluj-Napoca,inthehistoricalregionofTransylvania,Romania.FlorinreceivedhisBachelorofEngineeringdegreein2004fromtheTechnicalUniversityofCluj-Napoca,FacultyofAutomationandComputerScience,underthesupervisionofProf.SergiuNedevschi.Florinhasresearchinterestsintheareaofapproximatequeryprocessingfordatabasesanddatastreaming,datawarehouses,anddatabasesystemdesignandimplementation. 173