<%BANNER%>

Monte Carlo Processing of Satisfiability Queries in Uncertain Databases

Permanent Link: http://ufdc.ufl.edu/UFE0025141/00001

Material Information

Title: Monte Carlo Processing of Satisfiability Queries in Uncertain Databases
Physical Description: 1 online resource (64 p.)
Language: english
Creator: Perez, Luis
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2009

Subjects

Subjects / Keywords: database, information, probabilistic, systems, uncertainty
Computer and Information Science and Engineering -- Dissertations, Academic -- UF
Genre: Computer Engineering thesis, M.S.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract: The management of uncertain information using relational databases has been an active area of research during the last decade. Several data and query processing models have been introduced, including the 'possible worlds', which is used as the basis for most probabilistic data management systems. An example of such system is the Monte Carlo Database (MCDB), an uncertain data management framework developed at the University of Florida that relies on Monte Carlo sampling to perform scalable query processing. While MCDB provides an efficient query processing model, it does not allow for error control over the random samples it generates. We introduce satisfiability queries, a special type of database queries associated with confidence intervals and false positive/false negative error control information. Furthermore, we propose a set of algorithms and query processing techniques that allow for these types of queries to be efficiently evaluated in the context of the MCDB framework.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility: by Luis Perez.
Thesis: Thesis (M.S.)--University of Florida, 2009.
Local: Adviser: Jermaine, Christopher.
Electronic Access: RESTRICTED TO UF STUDENTS, STAFF, FACULTY, AND ON-CAMPUS USE UNTIL 2010-06-30

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2009
System ID: UFE0025141:00001

Permanent Link: http://ufdc.ufl.edu/UFE0025141/00001

Material Information

Title: Monte Carlo Processing of Satisfiability Queries in Uncertain Databases
Physical Description: 1 online resource (64 p.)
Language: english
Creator: Perez, Luis
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2009

Subjects

Subjects / Keywords: database, information, probabilistic, systems, uncertainty
Computer and Information Science and Engineering -- Dissertations, Academic -- UF
Genre: Computer Engineering thesis, M.S.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract: The management of uncertain information using relational databases has been an active area of research during the last decade. Several data and query processing models have been introduced, including the 'possible worlds', which is used as the basis for most probabilistic data management systems. An example of such system is the Monte Carlo Database (MCDB), an uncertain data management framework developed at the University of Florida that relies on Monte Carlo sampling to perform scalable query processing. While MCDB provides an efficient query processing model, it does not allow for error control over the random samples it generates. We introduce satisfiability queries, a special type of database queries associated with confidence intervals and false positive/false negative error control information. Furthermore, we propose a set of algorithms and query processing techniques that allow for these types of queries to be efficiently evaluated in the context of the MCDB framework.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility: by Luis Perez.
Thesis: Thesis (M.S.)--University of Florida, 2009.
Local: Adviser: Jermaine, Christopher.
Electronic Access: RESTRICTED TO UF STUDENTS, STAFF, FACULTY, AND ON-CAMPUS USE UNTIL 2010-06-30

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2009
System ID: UFE0025141:00001


This item has the following downloads:


Full Text

PAGE 1

MONTECARLOPROCESSINGOFSATISFIABILITYQUERIESINUNCERTAIN DATABASES By LUISL.PEREZ ATHESISPRESENTEDTOTHEGRADUATESCHOOL OFTHEUNIVERSITYOFFLORIDAINPARTIALFULFILLMENT OFTHEREQUIREMENTSFORTHEDEGREEOF MASTEROFSCIENCE UNIVERSITYOFFLORIDA 2009 1

PAGE 2

c 2009LuisL.Perez 2

PAGE 3

ACKNOWLEDGMENTS Iwouldliketothankmyadvisor,Dr.ChrisJermaine,forhisconstantsupport, guidanceandpatience.MygratitudealsogoestoDr.AlinDobra,Dr.TamerKahveciand Dr.ManuelBermudez. Myappreciationgoestomyfellowstudentsatthedatabaselab:NiketanPansare, FerhatAy,RaviJampani,MingxiWu,FlorinRusu,FeiXu,LixiaChen,BinSongand SubiArumugam.IamindebtedtoNuriYeralanforhishelpinthewritingofthisThesis. 3

PAGE 4

TABLEOFCONTENTS Page ACKNOWLEDGMENTS.................................3 LISTOFTABLES.....................................6 LISTOFFIGURES....................................7 ABSTRACT........................................8 CHAPTER 1INTRODUCTION..................................9 1.1OurContributions...............................13 1.2ThesisOutline..................................13 2PRELIMINARIES..................................15 2.1RelatedWork..................................15 2.2MonteCarloStatisticalMethods.......................17 2.3StatisticalHypothesisTesting.........................19 2.4TheMonteCarloDatabaseSystemMCDB.................22 2.4.1TupleBundles..............................23 2.4.2VariableGeneratorVGFunctionsandthe Instantiate Operator24 2.4.3The Split Operator..........................26 2.4.4RelationalOperators..........................27 3SATISFIABILITYQUERIES............................28 3.1Denition{SyntaxandSemantics......................28 3.2StatisticalFoundations.............................29 3.3OverviewofQueryProcessing.........................31 4IMPLEMENTATIONDETAILS...........................36 4.1TupleRe-Scanning...............................36 4.2Explicitvs.InferredLineage..........................38 4.3StochasticConsiderations...........................38 5EXPERIMENTALRESULTS............................41 5.1ExperimentalSetupandMethodology.....................41 5.1.1QueryQ1................................41 5.1.2QueryQ2................................42 5.1.3QueryQ3................................43 5.1.4QueryQ4................................44 5.1.5QueryQ5................................45 4

PAGE 5

5.2ResultsandDiscussion.............................46 5.2.1QueryQ1................................47 5.2.2QueryQ2................................47 5.2.3QueryQ3................................48 5.2.4QueryQ4................................48 5.2.5QueryQ5................................50 6CONCLUSIONS...................................53 APPENDIX:SQLCODEFOREXPERIMENTALQUERYPLANS..........55 A.1QueryQ1....................................55 A.2QueryQ2....................................56 A.3QueryQ3....................................57 A.4QueryQ4....................................58 A.5QueryQ5....................................59 REFERENCES.......................................61 BIOGRAPHICALSKETCH................................64 5

PAGE 6

LISTOFTABLES Table Page 5-1SummaryofresultsforqueryQ1..........................51 5-2SummaryofresultsforqueryQ2..........................51 5-3SummaryofresultsforqueryQ3..........................51 5-4SummaryofresultsforqueryQ4..........................52 5-5SummaryofresultsforqueryQ5..........................52 6

PAGE 7

LISTOFFIGURES Figure Page 1-1SampleoutputofanMCDBquery.........................10 1-2Sampleoutputofasatisabilityquery.......................12 2-1EstimatingtheareaofagurewithaMonteCarlomethod...........18 2-2Asequentialhypothesistestinaction.......................21 2-3QueryprocessingintheMCDBsystem.......................23 2-4AtuplebundleinMCDB..............................24 2-5The Instantiate operation.............................25 2-6The Split operationonatuplebundle......................26 3-1SatisabilityqueryinthetraditionalMCDBa,andthersttestblockused fordirectprocessingofasatisabilityqueryb..................31 3-2Runningasatisabilityqueryusingaseriesoftestblocks.............34 4-1Databasetableswithlineageinformationattached................36 5-1PhysicalqueryplanforQueryQ1.........................42 5-2PhysicalqueryplanforQueryQ2.........................43 5-3PhysicalqueryplanforQueryQ3.........................44 5-4PhysicalqueryplanforQueryQ4.........................45 5-5PhysicalqueryplanforQueryQ5.........................46 5-6Wall-clockrunningtimesaandnumberofkeep-goingpartitionsbforquery Q1...........................................47 5-7Wall-clockrunningtimesaandnumberofkeep-goingpartitionsbforquery Q2...........................................48 5-8Wall-clockrunningtimesaandnumberofkeep-goingpartitionsbforquery Q3...........................................49 5-9Wall-clockrunningtimesaandnumberofkeep-goingpartitionsbforquery Q4...........................................49 5-10Wall-clockrunningtimesaandnumberofkeep-goingpartitionsbforquery Q5...........................................50 7

PAGE 8

AbstractofThesisPresentedtotheGraduateSchool oftheUniversityofFloridainPartialFulllmentofthe RequirementsfortheDegreeofMasterofScience MONTECARLOPROCESSINGOFSATISFIABILITYQUERIESINUNCERTAIN DATABASES By LuisL.Perez December2009 Chair:ChristopherJermaine Major:ComputerEngineering Themanagementofuncertaininformationusingrelationaldatabaseshasbeenan activeareaofresearchduringthelastdecade.Severaldataandqueryprocessingmodels havebeenintroduced,includingthe"possibleworlds",whichisusedasthebasisformost probabilisticdatamanagementsystems.AnexampleofsuchsystemistheMonteCarlo DatabaseMCDB,anuncertaindatamanagementframeworkdevelopedattheUniversity ofFloridathatreliesonMonteCarlosamplingtoperformscalablequeryprocessing. WhileMCDBprovidesanecientqueryprocessingmodel,itdoesnotallowforerror controlovertherandomsamplesitgenerates.Weintroducesatisabilityqueries,aspecial typeofdatabasequeriesassociatedwithcondenceintervalsandfalsepositive/false negativeerrorcontrolinformation.Furthermore,weproposeasetofalgorithmsandquery processingtechniquesthatallowforthesetypesofqueriestobeecientlyevaluatedinthe contextoftheMCDBframework. 8

PAGE 9

CHAPTER1 INTRODUCTION Inthelastdecade,therehasbeenaconcreteeortfromthedatabasecommunityto enabledatabasemanagementsystemstodealwithdatauncertainty.Thisdevelopment isdrivenbymodernenterprisesthathavebusinesscasesinvolvinguncertaintyusually expressedwithsomesortofstochasticmodelinordertoassessrisks.However,mostof theirdataisstoredbyrelationaldatabases. Severaldataandqueryprocessingmodelshavebeenproposed,includingthe extendedrelationalmodel",orERM.Inthismodel,tuplesorattributesareaugmented withprobabilityvalues.Ithasbeenusedasabasisforsomepopularsystemprototypes [1,3,11]. AlthoughtheERMhasbeenimplementedsuccessfullyandisbroadlyacceptedasa frameworkforprobabilisticdatabases,itspurposeisnottobecomeageneralstandard foruncertaintymanagement,butrathertobringfuzzyqueries"fromthedomainof theInformationRetrievalcommunityintorelationaldatabasesystems[6,18].While someuncertaintyapplicationscanbeaccommodatedbythisframeworkcorrectly, moreadvanceduncertaintymodels,likeVARTAprocessesandcopulas[10,33],can becomeprohibitivelydiculttoimplement.ThisisduetothefactthattheERMisa representationofaspecictypeofuncertainty. Forexample,consideracompany'srelationaldatabasethatcontainscustomer information.Themarketingdivisionneedstodeterminewhichcustomerswillbethe mostprotableinthenextquarterbasedonaggregateinformationfromprevious customersthatresideinthesamegeographicarea.Supposethatcustomersarestored inatable CUSTOMERCustomerID,ZipCode andsalesinformationisstoredinatable SALES DATAZipCode,IncomeMean,IncomeSD ,where IncomeMean and IncomeSD representthemeanandstandarddeviationoftheincomethecompanyreceivedfrom customerslocatedinagivenzipcode,respectively. 9

PAGE 10

Inordertoestimatethepotentialprotsforeachindividual,astochasticmodelis designed.Assumethatthepotentialprotsarenormallydistributedandindependent foreachindividual.Abasicmodelcantakesamplesfromanormaldistributionwith parameters = IncomeMean and = IncomeSD ,whichwetakefromthe SALES DATA table foreachcustomerwiththesamezipcode.Currently,usersmightimplementthemodel usingahigh-levellanguagesuchasRorMATLABandrunitondatathatwaspreviously extractedfromtherelationaldatabase.Suchanextractionprocesscanbedicult, becausecertaindatatransformationsmustbeperformedtothedatainordertotinto themodelapplication.Essentially,theuserhastowriteanapplicationthatretrieves queryresultsfromarelationaldatabaseandmakethemavailableforthestochasticmodel. Oneofthegoalsbehinduncertaindatamanagementistobringthestochasticmodels closertothedatabasesystemsothatuserscanruntheminplace"toavoidrelyingona extractionprocess,andtotakeadvantageofthecapabilitiesoftherelationaldatabase.In ordertoachieveahigherlevelofgenerality,weproposedanovelframeworkforuncertain datamanagement,calledtheMonteCarloDatabase[24].Oneofthekeydesignprinciples ofMCDBisthelackofhard-wiring"ofagivendatamodel,thedetailsofwhichare suppliedbytheuserthroughuser-denedfunctionsknownasVariableGenerator"VG functionsthatproduceMonteCarlosamplesforsaidmodel.Everysampleisseenasan independentinstanceofthedatabase,apossibleworld",inwhichthequeryisexecuted withthenalresultbeingadistribution. CustomerID ZipCode Income 101 32607 3,120.4 3,283.2 3,098.9 3,167.4 3,049.3 102 32607 4,284.3 4,225.5 4,302.1 4,244.7 2,277.8 103 32611 1,392.2 1,128.7 1,484.3 1,244.4 1,099.9 Figure1-1.SampleoutputofanMCDBquery Moreover,theisolationbetweenthedataandthemodelallowsforthedatabaseto bestoredinitstraditionalrelationalformwithouttheneedforanextractionprocess. ThequerydescribedabovecanbeimplementedeasilywithMCDB.Alltheuserneedsto 10

PAGE 11

providetothesystemisaVGfunctionthatproducesindividualsamplesfromthemodel shedevised. Inthescenariodescribedabove,a Normal VGfunctioncanbecreatedsothatittakes the IncomeMean and IncomeSD attributevaluesasparametersandproducesalltheMonte Carlosamples.MCDBusesittocreatesanewrelation CUSTOMER DATACustomerID, Income where Income isastochasticattribute"containingallthesamplesobtainedfor thatspeciccustomer.UsingMCDB'sownSQLsyntax,thisquerycanbeexpressedas follows: CREATETABLECUSTOMER DATA FOREACHcINCUSTOMER WITHIncomeDistributionASNormal SELECTd.IncomeMean,d.IncomeSD FROMSALES DATAd WHEREd.ZipCode=c.ZipCode SELECTc.CustomerID,c.ZipCode,id.VALUEASIncome FROMIncomeDistributionid Figure1-1showsanexampleoftheoutputofthisstatement,assumingthat,foreach tuplein CUSTOMER ,theVGfunction Normal producesvesamples. EventhoughMCDBprovidesaexibleframeworkforuncertaintyrepresentation,it doesnotperformanyanalysisontheresultingdistribution.Oneofthemostcommonly usedmethodsisstatisticalhypothesistesting{theacceptanceorrejectionofagiven hypothesisbasedonasetofobservedresults.RecallthatMCDBreliesonMonteCarlo samplingtoproduce approximate answerstoqueries.Manyuserscantolerateapproximate answers,butitpromptsthequestion:howmanysamplesarerequiredinordertoaccept orrejectthehypothesis?" Determininganacceptablesamplesizeisextremelyhard,asitnormallydependson thenatureoftheuncertaintymodel,thedata,andthequery.Additionally,theremay 11

PAGE 12

bedierentlevelsinwhichthehypothesisistested.Inourexample,ahypothesiscanbe appliedtoeachindividualcustomer,tothecustomersofacertaingeographiclocation,or eventheentiredatabase.Thesecomplexitiesmakeitchallengingfortheusertocomplete hypothesistestingecientlyonthedata. CityID HypothesisResult 1001 ACCEPT 1002 REJECT Figure1-2.Sampleoutputofasatisabilityquery Inthisworkweaddresstheaboveproblembypushingthehypothesistestanalysis intothedatabasesystembydeninganewtypeofquerywheretheuserprovides additionalinformationonthestatisticalrequirementsfortheresults,calledsatisability queries".Simplyput,asatisabilityqueryisarelationalMCDBqueryincluding informationonthehypothesistestthatistobeapplied,namely 1.Abooleanpredicatethatdetermineswhetherornotaninstanceofthedatabase passesthetest 2.Thelevelofstatisticalsignicance 3.Thefalsepositiveandfalsenegativerates,portype-Iandtype-IIerrors 4.Asetofattributesthatdenesthegroupinglevelinwhichthehypothesisistested, associatedwithaquantieruniversalorexistentialthatdeneshowagroupof tuplespassesthetest. Ourexamplequerycanbeextendedtoasatisabilityquery.Supposethereexists anothertable CITY CODESCityID,ZipCode thatcontainsallthezipcodesforagiven city.Atypicalsatisabilityquerycanbeprovidedtodeterminethecitieswhereallthe citizenswillprovideareturnhigherthan$2,000,with95%condenceandfalsepositive andnegativeratesof1%and0.1%,respectively.ThiscanbeexpressedusingMCDB's ownSQLsyntaxasfollows: USINGSELECTcc.CityID FROMCUSTOMER DATAcd,CITY CODEScc 12

PAGE 13

WHEREcd.CustomerID=c.CustomerIDAND cd.ZipCode=cc.ZipCode TESTWHETHERALLcd.Income>2000 WITHPROBABILITY>0.95 PARTITIONBYcc.CityID Figure1-2providesanexampleresultoftheabovequeryassumingthatcustomers101 and102liveincity1001andcustomer103livesincity1002. ThisnaturalextensionoftheMCDBframeworkisthecentralthemeofthis work.Althoughitisasimpleidea,thereareseveralconsiderationsthatcomplicateits formulationandimplementation.Asmentionedabove,oneofthemostimportantissues isrelatedtothenecessarysamplesizeforatupleorgrouptopassorfailthehypothesis test.Anotherissueisrelatedtothebooleanpredicateusedbythehypothesistestand howitaectstheprocessingofthequeryplan.Thesetwoaspectsareconsideredindetail throughoutthisthesis. 1.1OurContributions Ourresultsfocuseonthetheoreticalandpracticalaspectsofsatisabilityqueriesand canbesummarizedby: Aformaldenitionofasatisabilityqueryisshownintermsofitsstatistical guarantees. Stochasticconsiderationssuchasrelationaloperatorsonrandomattributesare addressed,andeectivesolutionsareproposed. AnecientMCDBqueryprocessingapproachforthistypeofqueryisgiven. Experimentalresultsshowingthattheapproachproposedispractical,ecient,and scalable. 1.2ThesisOutline Chapter2presentsareviewoftherelatedworkonuncertainandprobabilisticdata management,anintroductiontoMonteCarlomethods,statisticalhypothesistestinganda descriptionoftheMCDBframework.Chapter3isdescribessatisabilityqueriesformally 13

PAGE 14

andintroduceshowsuchqueriesareprocessedinthecontextoftheMCDBframework. Chapter4describeshowthequeryprocessingengineisimplemented.Anexperimental evaluationofthisworkispresentedinChapter5.TheconclusionsaredetailedinChapter 6.SQLcodeforthequeryplanspresentedinChapter5aregiveninAppendixA. 14

PAGE 15

CHAPTER2 PRELIMINARIES Thematerialcontainedinthischapterservesasthebackgroundfortherestofthe thesis.Abriefreviewoftherelatedworkisgivenwithanemphasisonuncertaindata management,probabilisticdatabases,dataprovenance,andlineage.Adiscussiononsome ofthemostwidespreaduncertaindatabasemanagementsystemimplementationsisalso presented. TherestofthechaptergivesanoverviewofthefundamentalsofMonteCarlo methodsandstatisticalhypothesistesting.Additionally,theMCDBdatabasesystemand itsdetailsaredescribed. 2.1RelatedWork Uncertaintyinrelationaldatabaseshasbeenanactivesubjectofstudysinceatleast twodecades.Severalprobabilisticframeworkswereinitiallyproposed,butmanyofthem wereverylimitedinapplication[7,14].Possibleworldssemantics,initiallyintendedasa frameworkinarticialintelligence[5]andintegratedintothedatabaseliteraturebyFuhr andRolleke[20],hasbecomeadominantparadigminuncertaindatamanagement.The intuitionbehindthepossibleworldsmodelistoconsideraseriesofindividualindependent instancesofadatabaseaspartofawholeprobabilitydistributionsuchthateachinstance hasaprobabilityassociatedwithitandtheallindividualprobabilitiesadduptoone, satisfyingtheKolmogorovaxioms. DalviandSuciuintroducedalgorithmsforecientqueryevaluationinaprobabilistic frameworkbasedonanextendedrelationalmodelERMwithaprobabilityvalue associatedwitheachtuple,followingthepossibleworldsmodel[18].Theirapproachto queryevaluation,usuallyreferredtoasextensionalsemantics",isdierentfromthat proposedbyFuhrandRolleke,calledintentionalsemantics".Whileintensionalsemantics consistofcarryingprobabilityeventinformationwitheachtuplefromwhichprobability valuesarecomputed,extensionalsemanticsrelysolelyonprobabilityvalues. 15

PAGE 16

ThealgorithmsintroducedbyDalviandSuciucomputetuple-levelprobabilities inpolynomialtimebutonlyforacertainfamilyofqueryplans,whichtheyrefertoas safeplans".However,unsafeplans"cannotbetreatedinpolynomialtimeandmust usegeneralalgorithms[19].Sincetheproblemofcomputingsaidprobabilitiesisproven tobein# P ,theKarp-LubyMonteCarloalgorithmisused[25].Thisalgorithmprovides approximationsofarbitraryprecisionforenumerationandreliabilityproblemse.g., estimatingthecardinalityofaunionofsetsandhasbeenappliedinavarietyofelds.In databases,ithasbeenusedtoestimatequeryreliability[21],lineageapproximation[37], andprobabilisticdatabaseconditioning[28]. WhiletheKarp-Lubyalgorithmcanbeusedtoobtainanapproximationwitha condenceintervalcontrolledbyamechanismsimilartoahypothesistest,itisnota suitablealternativeinthecontextofthiswork.ThisisduetothefactthattheMCDB frameworktreatsthestochasticmodelasablackbox"separatedfromthedatamodel andtheengine.TheKarp-Lubyalgorithmwouldneedaccesstocertaindetailsofthe modelinordertooperatecorrectly,violatingtheabstractionbetweenmodelanddatathat existsinMCDB.Ontheotherhand,theonlyinformationahypothesistestneedsisthe numberofsamplesthatpassthehypothesispredicate.Nevertheless,intentionalsemantics andERM-basedqueryevaluationhaveservedasthefoundationsforthedevelopmentof mostmodernprobabilisticdatabasesystems[1{3,15,19,22]. Additionalresearchhasbeendonetoapplytraditionaltechniquesfromrelational databasesystemsonprobabilisticdatasuchasdataindexing[40],queryingand maintenanceofmaterializedviews[36],aggregation[32],evaluationoftopk queries [38],andXMLdatamanagement[26]. Thisworkisalsorelatedtotheareaofdataprovenance[12],particularlylineage trackingandhowitisexpressedandmanaged.Thelineageofagivenobjectdatabase tuple,webpageisessentiallythesetofallobjectsthatcontributetotheexistenceofsaid object.Explicitlineageannotation,whichconsistsofappendingidenticationinformation 16

PAGE 17

toeveryobject,hasbeenimplementedonrelationaldatabases[9,23]andsemi-structured data[29].Anotherapproach,knownasqueryinversion",originallyproposedbyCuiand Widom[16],iswidelyusedinviewmaintenance[17]anduncertaindatamanagement [8].Inqueryinversion,lineageisnotexplicitlyannotatedbutisinferredfromthequery denitionandtheresultingattributevalues. 2.2MonteCarloStatisticalMethods Ingeneral,aMonteCarlomethodisanalgorithmthatreliesonastreamofrandom numberstoobtainanumericalresult.MonteCarlomethodsarewidelyusedinphysics andnancetosolvenumericalproblemswhosesolutionscannotbeobtainedanalytically [31]. InorderforaMonteCarlomethodtoproduceacorrectresult,auniformly distributedstreamofrandomnumbersisnecessary.However,producinggenuinelyrandom numberswithacomputerisnotpossiblesincetheresultofadeterministicalgorithm cannot,bydenition,berandom[42].Instead,computersrelyon pseudorandomnumber generators which,generally,suceforMonteCarlomethods. A uniform pseudorandomnumbergenerator,orPRNG,isanalgorithmthatstarts fromaninitialvalue u 0 referredtoasthe seed producesasequence u i = D i u 0 of valuesin[0 ; 1].Forall n ,thevalues u 1 ;:::;u n reproducethebehaviorofanindependent andidenticallydistributedi.i.d.sample V 1 ;:::;V n ofuniformvariableswhencompared byasetoftests[39].Somealgorithms,suchasthelinear-congruentialgenerator[27],the Tausworthelinear-recurrencemodulotwogenerator[41],andtheMersennetwister[30]are amongthemostpopularforMonteCarlomethods. WeillustratehowMonteCarlosimulationworkswithasimpleexample.Consider theguredepictedontheleftsideofFigure2-3.Supposethegureintheshadedarea isinscribedinasquareofsideone.Lackingaclosed-formexpressionthatdenessaid gure,wewillestimateitsareabygeneratingtwosequences u 1 ;:::;u n and v 1 ;:::;v n of 17

PAGE 18

Figure2-1.EstimatingtheareaofagurewithaMonteCarlomethod uniformlydistributedrandomnumbersandapplyingthefollowingindicatorfunctionto each u i ;v i pair: I x;y = 8 > < > : 1if x;y isinsidetheshadedarea 0otherwise Then,wecomputetheestimatedareaas ^ a = 1 n n X i =1 I u i ;v i Inourexample,theestimatedareaofthegureis^ a = 1 20 11=0 : 55.Intuitively,itcan beassertedthatthegreaterthevalueof n isthecloser^ a istotheactualarea. MonteCarlomethodsplayanimportantroleintheMCDBdatabasesystemsince VariableGeneratorVGfunctionsareresponsibleforproducingindividualsamplesfor user-denedprobabilitydistributionsthattaketheirparametersfromattributevalues foundindatabaserelations.Eachsampleisseenasapossibleworldinthenalquery result. SomeVGfunctionsimplementmodelsthatdenedistributionswhoseprobability densityfunctionsarenotinclosedformandthereforedensitiescannotbecomputed analytically.MonteCarlomethodshelptosolvethatproblem,buttheyalsoprovidethe generalitynecessarytosamplefromanyarbitrarydistributionforwhichaninversion procedure"canbeobtained. 18

PAGE 19

MCDBreliesheavilyontheGNUScienticLibraryGSL 1 toobtainsamplesfrom populardistributionssuchasGaussian,Poisson,andGamma.Inourimplementation, weplugin"animplementationoftheWellEquidistributedLong-periodLinearWELL generator[35]foruniformsampling,replacingthedefaultMersennetwistergeneratorthat comeswithGSL. 2.3StatisticalHypothesisTesting Ahypothesistestisastatisticalprocedurethatallowsforthevericationof theoryagainstobservation.Ahypothesisisregardedasastatementaboutoneor morepopulationparameters[43].Forexample,considerthehypothesisthathalfof thecustomersincertainareawilldeliveranaverageprotofatleast$1,000.Ifwedonot believethatclaim,analternativehypothesis"isformulatedbasedontheconverse:that theaverageprotislessthan$1,000.Supportforthealternativehypothesisisobtainedby showinglackofsupportfortheinitialhypothesis,whichwecallthenullhypothesis". Supposethatwerandomlychoose n =100customersandrecord X asthenumberof customersthatdeliveredanaverageprotofatleast$1,000.Weestablishthatinorder torejectthenullhypothesis,lessthan45customersmustnotdeliverthedesiredaverage prot. Astatisticalhypothesistesthasthefollowingelements: 1.Anullhypothesis, H 0 ,whichweseektodisprove.Inourexample, H 0 : p =0 : 5 2.Analternativehypothesis, H 1 ,whichweseektoprove.Inourexample, H 1 : p< 0 : 5 3.Ateststatistic.Inourexample,itis X ,thenumberofcustomersthatdeliveredat least$1,000inaverage. 4.Therejectionregion.Inourexample,itcanbedenotedbytheset RR = f x : x< 45 g 1 Seehttp://www.gnu.org/software/gsl/ 19

PAGE 20

Twotypesoferrorscanbemadewhentestingahypothesis.Thehypothesis H 1 can beacceptedwhen H 0 istrue.Thisiscalledafalsepositive"ortype-Ierror.Thesecond type,whichoccurswhen H 0 isacceptedwhen H 1 istrueiscalledafalsenegative"or type-IIerror.Theprobabilitiesofthesetypeoferrorshappeningaredenotedas and respectively. Inourexample,wecancompute and bytreatingthestatistic X asabinomial randomvariablewith n =15and p =0 : 5,having = P Y 45when p =0 : 5and = 45 X y =0 100 y : 5 y : 5 100 )]TJ/F22 7.9701 Tf 6.587 0 Td [(y =0 : 18 Conversely, = P Y> 45when p =0 : 5=1 )]TJ/F21 11.9552 Tf 11.955 0 Td [( =0 : 82. Therelativelyhighvaluesof and areduetotherejectionregionwedenedin ourexample.Thispromptsthequestionofwhatisthebestchoiceforthisparameter. Animportantcriterionforevaluatingtheperformanceofatestiscalledthepower". Essentially,thepoweristheprobabilitythatatestleadstoarejectionofthenull hypothesis.Weseekthemostpowerful"test[13]. Supposewewishtotestthehypothesis H 0 : = 0 withalternativehypothesis H 1 : = 1 basedonarandomsample X 1 ;X 2 ;:::;X n .Let L x 1 ;x 2 ;:::;x n j denotethe likelihoodofthesamplewhenthevalueoftheparameteris .Thetestwiththefollowing rejectionregion L x 1 ;x 2 ;:::;x n j 0 L x 1 ;x 2 ;:::;x n j 1
PAGE 21

Figure2-2.Asequentialhypothesistestinaction TheNeymantestisoptimalinthesensethatforagiven and ,noothertestcan choosebetween H 0 and H 1 withfewersampleswhereeithercanbetrue.Thisscenariois uncommoninpracticaldatabasesettings.InaprobabilisticsettingwhereMonteCarlo samplesareused,aNeymantestcanrequireahighnumberofsamples,leadingtoa potentialperformancechallenge. AnotherissueregardingtheNeymantestinadatabaseisthatthenumberofsamples requiredisxedinadvanceregardlessofthepropertiesoftheobjectsthatarebeing tested.Thus,ifweareapplyingahypothesistestoneachtupleinthedatabase,the numberofsamplesrequiredbytheNeymantestwillsucetoacceptorrejectthe hypothesisoneachandeverytuple.However,theNeymantestcouldfailtotake advantageoffewsamplesacceptingorrejectingthehypothesisforalargenumberof tuples. Afamilyoflikelihood-ratiotests,knownassequentialtests,aimtoalleviatethis issue[44].Sequentialtestsdonotrequirethenumberofsamplestobexedinadvance. Instead,everynewsampleupdatesthelikelihood-ratiostatisticsanddeterminesifthe hypothesisisacceptedorrejected.Ifadecisioncannotbemade,moresamplesaretaken, sincesequentialtestsareguaranteedtoconvergetoaresult. 21

PAGE 22

Figure2-2showshowasequentialhypothesistestbehaves.Thehorizontalaxis representsthenumberofsamplestaken,andtheverticalaxisrepresentsthevalueofthe likelihood-ratiostatisticfor = = =0 : 05.Thetwosolidlinesrepresenttheupperand lowerboundsrequiredtoacceptorrejectthehypothesis,respectively.Thespacebetween themisreferredtoastheregionofindierence".Thedashedline,whichshowshowthe valueofthelikelihood-ratiostatisticchangeswhenmoresamplesaretaken,bounceswithin theregionofindierenceuntilitcrossestheupperboundandthehypothesisisaccepted. The sequentialprobabilityratiotest ,orSPRT,isthemostrepresentativetestamong thisfamily.TheSPRTisasequentialversionoftheNeymantest.Inthiswork,weuse aspecialtypeofSPRTtestknownastheEnd-Biasedtest[4].Thistestisspecially craftedtowardsaprobabilisticdatabasesettingwherethevalueof isxed,the and parametersaresuppliedbytheuser,andthenumberofsamplesiscontrolledbythetest. 2.4TheMonteCarloDatabaseSystemMCDB ThissectiongivesabriefintroductiontotheMCDBsystem.Forthesakeofbrevity, somedetailsonthedesignandimplementationhavebeenomittedandcanbefoundin [24]. MCDBisanuncertaindatamanagementsystemthat,likemanyotherprobabilistic databaseproposals,makesuseofpossibleworldssemantics.However,MCDBdoesnot imposeanychangestotherelationalmodelanditinsteadseekstomaintainaclear separationbetweentheuncertaintymodelandthedata.Thisisdonebyprovidinga generalframeworkallowingtheusertodenethestochasticmodelexternallyandto performtraditionalqueryprocessingusingsaidmodel. TheoutputofanMCDBqueryisafrequencydistributionwhereeachindividual valuerepresentstheoutputofthequeryinagiveninstanceofthedatabase.Theinitial distributionisproducedbyapplyingthemodeltotheunderlyingrelationaldata,then MCDBexecutesrelationaloperatorsfromthequeryoneachdatabaseinstance. 22

PAGE 23

Figure2-3.QueryprocessingintheMCDBsystem. Thestochasticmodelsthatcreatethedistributionsinaqueryarepluggedin"by theuserintheformofuser-denedfunctionsknownasVariableGeneratorfunctions,or VGfunctionsforshort.Thesefunctionstakethemodelparametersfromthebaserelations andproduceindividualrandomsamplesforthedistributionsthataredenedbythe model.TheresultofapplyingtheVGfunctionisarelationwithatleastonestochastic attribute,knownasarandomrelation." AttributesinMCDBcanbeeither deterministic or stochastic .Anattributeis deterministicifitsattributevalueisthesameforallthepossibleinstancesofthe database.Otherwise,itisstochasticifatleastonepossibleinstancehasadierent value.Bothkindsofattributescoexistinwhatareknownastuplebundles." VGfunctionsareencapsulatedinaspecialrelationaloperatorcalled Instantiate whichisresponsibleforinitializingtherandomnumbergeneratorusedtoproducethe samples.Theyprovidethemodelparameterstothefunctionsandcollecttheindividual resultsintheformoftuplebundles.Traditionalrelationaloperatorsaremodiedinorder toprocesstuplebundles. Figure2-3showstheusualowofdataintheMCDBsystem,asrandomrelationsare theproductofVGfunctions. Therestofthissectiondescribeshowtuplebundlesarerepresented,howVG functionswork,andhowaretraditionalrelationaloperatorsareimplementedinMCDB. 2.4.1TupleBundles Ingeneral,atuplebundle t isasequenceofattributes t: att 1 ;t: att 2 ;:::;t: att k ,each ofthembeinganarrayof N values,suchthat t: myAtt [ j ]representstheattributevalue 23

PAGE 24

of t: myAtt inthe j th possibleworld.Asmentionedbefore,adeterministicattributeis anattributeinwhich t: myAtt [ i ]= t: myAtt [ j ]forall i 6 = j .Forsimplicity,deterministic attributesarerepresentedasasinglevaluewhilerandomattributescontainthearrayof N values. CustomerID ZipCode Income 101 32607 3,048.5 2944.9 3114.5 Figure2-4.AtuplebundleinMCDB Forexample,considerourquerywheretherandomrelation CUSTOMER DATA was createdusingaVGfunctionthatimplementsaNormaldistributionparametrizedwith attributevaluesvaluesfromthe CUSTOMER table.Figure2-4showsanexamplebundle where N =3and Income isarandomattribute. Animportantpartoftuplebundlesthatisnotshownintheprevioustableisthe isPresent attribute.Foreachrandomattribute t: myAtt ,thereisan isPresent arrayof N true/false valueswhere t:isPresent [ i ]determinesifthevaluethatappearsin t: myAtt [ i ] actuallyexistsinthe i th possibleworld.Forexample,supposethatweapplytheselection predicate Income > 3000 tothetuplebundleshownbefore.Theresulting isPresent attribute containsthetruthvalues T,F,T sincethevalueof Income inthesecondpossibleworld doesnotpassthepredicate. 2.4.2VariableGeneratorVGFunctionsandthe Instantiate Operator VGFunctionsareresponsibleforproducingMonteCarlosamplesfromthemodel speciedbytheuser.Inorderfortheusertobeabletosuccessfullyplugin"her ownmodel,apredenedinterfacemustbefollowed.AVGfunctionisaClibrarythat implementsthefollowingoperations: Initialize seed .Uses seed toinitializetherandomnumbergeneratorthat producestheindividualsamples.Also,allocatesmemoryforalltheinternaldata structuresusedbythefunction. 24

PAGE 25

TakeParams params .Takestheargumentsneededbythemodeltoproduce samplesforagivengroupe.g.themeanandstandarddeviationoftheincomeof thecustomerinacityintheformofthe params array. OutputVals .ReturnsanindividualMonteCarlosamplebasedontheparameters suppliedthrough TakeParams .Iftherearenomoresamplestoproduceforthe currentgroup,returns NULL PrepareNextTrial .PreparestheVGfunctionforthenextgroupsothenextcall to TakeParams doesnotreturn NULL .Inthecaseofsingle-tuplegroups,asinglecall to TakeParams isalwaysfollowedbyanothercallto PrepareNextTrial Finalize .Deallocatesthememoryobtainedfortheinternaldatastructures during Initialize The Instantiate operatorisresponsibleforattachingRNGseedstotheinputtuples, providingparameters,collectingtheMonteCarlosamples,andproducingoutputrandom relationscomposedoftuplebundles. Figure2-5.The Instantiate operation Instantiate requiresthefollowinginputseeFigure2-5foranexampleofthedata ow: 25

PAGE 26

Anouterinputpipethatcontainsthetuplesthatdeneeachgroupthatwillbe producedbytheVGfunction. Acollectionof k innerinputpipesthatcontainparametersforthegroupsofthe outerinputpipesthattheVGfunctionwilltake. Acollectionofbooleanpredicates B 1 ;:::;B k thatwillbeusedtojoineachinnerpipe withitscorrespondingforkedversionoftheouterpipe. Acollectionofattributes InAtts 1 ;:::;InAtts k fromeachinnerpipeusedto parametrizetheVGfunction. Acollectionofattributes OutAtts fromtheouterpipeusedtoparametrizetheVG function. AcollectionofoutputattributesfromtheVGfunctionthatwillbeprojectedinthe naloutputtuplebundles. 2.4.3The Split Operator AnotherimportantoperatorinMCDBis Split .Givenatuplebundle t withrandom attribute att Split willcreate,foreachoftheatmost N distinctvaluesof att ,anew tuplebundle t 0 inwhich att becomesadeterministicattributeand isPresent issetto true forallthoseiterationsinwhich att hadtheestablishedvalue. Figure2-6.The Split operationonatuplebundle Figure2-6showsanexampleofapplying Split onatuplebundlewhere N =4. Theresultingsetcontainsthreetuplebundlesasoneofthevaluesisduplicatedandthe valuesof isPresent aresetaccordingly. 26

PAGE 27

2.4.4RelationalOperators TraditionalrelationaloperatorsaremodiedinMCDBsotheycanworkontuple bundleswithrandomattributes.Intuitively,theresultingtuplebundlesareequivalentto applyingtheoperatorsonseparate,independentinstancesofthedatabase.Wedescribe themostbasicoperations:selection,aggregation,andrelationaljoin. Selection. Givenatuplebundle t ,abooleanpredicate B ,andarandomattribute att ,theSelectionoperatorsets t:isPresent [ i ]= t:isPresent [ i ] ^ B t: att [ i ].If t:isPresent isnotmaterializedin t ,thenitisassumedtobe TRUE foralltheiterations. Aggregation. Consideranindicatorfunction I x thatreturns1if x is true and0 otherwise.Assumingwewanttocomputea SUM aggregateoverattribute att ,then,given inputtuplebundles t 1 ;:::;t k ,theAggregateoperatorcreatesasingletuplebundle t 0 where t: sum [ i ]= P k j =1 I t j :isPresent [ i ] t j :att [ i ]foreach1 i N CartesianProductandJoin. Giventwomatchingtuplebundles r and s inthe cartesianproduct,theresultingtuplebundle t containsalltheattributesfromboth relations.However, t:isPresent issetto r:isPresent ^ s:isPresent .Arelationaljoin withpredicate B isjusttheresultofapplyingacartesianproductfollowedbyarelational selectionusingjoinpredicate B 27

PAGE 28

CHAPTER3 SATISFIABILITYQUERIES Thischapterintroducessatisabilityqueries.Asthenamesuggests,satisability queriessearchfordatabasetuplesorgroupsthatsatisfyauser-suppliedboolean predicatewithacertainprobability.Wedenethem,explaintheirsyntaxandsemantics, givebriefdescriptionsoftheirstatisticalfoundations,andhowtheyareprocessedinthe contextoftheMCDBframework. 3.1Denition{SyntaxandSemantics Wedescribethesyntaxandsemanticsofsatisabilityqueriesthroughaseriesof examples.Considertherelationaldatabaseschemaweproposedintheintroduction. Imaginethatwewanttodeterminewhichcustomerswillprovideareturnhigherthan $500withaprobabilityof90%.Thesyntaxforsuchqueryis: USINGSELECTIncome FROMCUSTOMER DATA TESTWHETHERANYIncome>500 WITHPROBABILITY>0.90 PARTITIONBYCustomerID MCDBrstevaluatestheSQLqueryinthe USING clause,generatingallthedierent instanceswith Income asarandomattribute.ThenMCDBcreatesapartitionforeach uniquevalueof CustomerID andtestsifthereisahighprobability > 0 : 9that ANY ofthe tuplesineachoftheresultingpartitionswillhaveanincomehigherthan$500.Observe that,inthisquery,thequantierofchoice{ ANY or ALL {isirrelevantsince CustomerID is akeyattributeinourdatabaseschema.Thequantierclausecanbenegated,having NOT ANY asthelogicalequivalentof ALL withthebooleanpredicatenegated,and NOTALL as thelogicalequivalentof ANY withthebooleanpredicatenegated.Aftertesting,theresult 28

PAGE 29

ofthetestforeachcustomerwhether TRUE or FALSE isgiventotheuser.Ingeneral,the booleanpredicatecanbearbitrary. The PARTITIONBY clauseisaveryimportantpartofthequerybecauseitdenes thelevelofgranularityforthehypothesistest.ThisisnecessaryinMCDBsincethere isnonaturalnotionofeachobject"thatisconsistentamongMonteCarloiterations. Moreover,thedeningcharacteristicsofthatobjectaresuppliedbytheuser.Inour examplequery,the PARTITIONBY clausecanbechangedtouse ZipCode insteadof CustomerID ,therebychangingthesemanticsofthequery{wenowwanttodetermine whichzipcodeareashaveatleastonecustomerthatprovidesareturnhigherthan$500 withaprobabilityof90%.Ifa PARTITIONBY clauseisnotsupplied,thequerywillreturn asingletuplewitha TRUE valueifatleastonecustomerintheentiredatabaseprovided therequiredincome. 3.2StatisticalFoundations Wesolvetheproblemofactuallydeterminingwhetheranyofthepartitions inasatisabilityqueryshouldbeacceptedwithastatisticalhypothesistest.Ina satisabilityquery,eachpartitionthatisconsideredsuchasthesetof ZipCode shas anunknownprobability ofsatisfyingthe TESTWHETHER predicate.Giventheclause WITHPROBABILITY> p ,thegoalistoexamineasetof n possibleworldsinorderto accuratelychoosebetweenthetwohypothesesforeachpartition: H 0 : = p )]TJ/F21 11.9552 Tf 11.955 0 Td [( H 1 : = p + If H 0 ischosen,thenthepartitionisrejectedandlabeledwitha FALSE result.Inthis case,thepartitionhasbeenfoundtosatisfythepredicatewithaprobabilitythatistoo smalltomeetthecuto p .If H 1 ischosen,thenthepartitionisacceptedandassociated witha TRUE result. 29

PAGE 30

Theinclusionofamagic inthehypothesisstatementisanecessaryevil. mustbe non-zero,and,as approacheszero,thedierencebetween H 0 and H 1 vanishesandit becomesmoreandmoreexpensivetodierentiatebetweenthetwo. Sincethegenerativestochasticprocessforthedataisembodiedbyasetofblack-box VGfunctionsthatcannotbeexaminedanalytically,then,foragivenpartitionallthatis availableisan IsPresent attribute,representedasabitstringthatdeterminesifagiven MonteCarloiteration,correspondstoaninstanceofthedatabasewhichpassedthequery predicate.Thehypothesistestisperformedbyexaminingthebitstringandmakingan educatedguessasto H 0 or H 1 .Sincethehypothesistestcanonlyexaminethebitstring andnottheVGfunctions,itisimpossibletochoosebetween H 0 or H 1 withcomplete accuracy.Therefore,falsepositivesandfalsenegativesbecomepossible. Giventheparameters ,and n where and arethefalsepositiveandfalse negatives,respectively,and n representsthenumberofMonteCarlosamples,itispossible tocontroltwooftheseparametersinahypothesistest.However,thethirdparameteris outofdirectcontrolandmustbedeterminedintermsoftheothertwowithrespecttothe hypothesistestandstochasticprocess.InMCDB,itispossiblefortheusertocontrol and withthefollowingcommand-lineentries: SETFALSEPOSITIVE=0.001; SETFALSENEGATIVE=0.005; Thenumberofiterations n iscontrolledautomaticallybythesystemsothat and arerespected.Sincethereisadirectrelationshipbetween n andtherunningtimeofthe query,thefocusofourqueryprocessingschemeisonchoosingbetween H 0 and H 1 while usingasfewiterationsaspossible. 30

PAGE 31

3.3OverviewofQueryProcessing Asausefulpointofreference,webeginbydiscussinghowthetraditionalMCDB engine{withouttheabilitytohandlesatisabilityqueriesdirectly{couldbeusedalong withanexternalhypothesistesttoanswerasatisabilityquery. Figure3-1.SatisabilityqueryinthetraditionalMCDBa,andthersttestblockused fordirectprocessingofasatisabilityqueryb Considerthequerywedescribedintheintroduction,whichusesthestochastictable CUSTOMER DATA todeterminethosecitieswhereallthecitizensprovidedareturnhigher than$2,000.InthetraditionalMCDBframework,theuserwouldrstguessatan n value thatislargeenoughtodeterminewhethereachcityappearswithhighenoughprobability, andthenissuethefollowingquery: SELECTDISTINCTcc.CityID FROMCUSTOMER DATAcd,CITY CODEScc WHEREcd.Income>2000 ANDcd.ZipCode=cc.ZipCode 31

PAGE 32

ThisqueryisprocessedasdepictedinFigure3-1a.The CUSTOMER DATA table iscreatedbyapplyingthe Instantiate operatoronthe CUSTOMER and SALES DATA tablesusingthe Normal VGfunction.Theresultingsetoftuplebundlesencodes Income informationinall n MonteCarloiterationsforeachcustomer.Then,itisjoinedwith CITY CODES andthebooleanpredicateisapplied,followedbyaduplicateremoval operation.Theresultingtuplesarepipedintothe Inference operation,whichreturns thefractionofMonteCarloiterationswherethebooleanpredicatereturns TRUE Todetermineifeachincomeispresentwithhighenoughprobability,theuserapplies anappropriatestatisticalhypothesistesttotheresultofthequery,whichwouldeither acceptorrejecteverycity{providedthat n islargeenoughtoguaranteetheuser-specied falsepositiveandfalsenegativerates. Whilethisapproachcouldwork,thereareseveralnoteworthyproblems.First,the usermustchooseandcorrectlyapplyahypothesistest,whichimpliesthatcodehasto bewrittentointerprettheoutputofthequery.Second,theusermustguessavalueof n thatwillrespect and .Third,eachandeverypartitionwillusethesame n ,althougha sizeableamountmightgetacceptedorrejectedwithasmallernumberofsamples. Considerthesamequerywrittenwiththesatisabilityquerysyntax: USINGSELECTcc.CityID,cd.Income FROMCITY CODEScc,CUSTOMER DATAcd WHEREcc.ZipCode=cd.ZipCode TESTWHETHERANYIncome>2000 WITHPROBABILITY>0.95 PARTITIONBYcc.CityID 32

PAGE 33

Inthiscase,MCDBdoesnotuseapredened n .Insteadthequeryisprocessedas aseriesofso-calledtestblocks".Ineachtestblock,asmallsystem-determinednumber ofMonteCarlosamplesistaken.Thersttestblockofourexamplequerycanbeseen inFigure3-1b.AsinthetraditionalMCDB,the Instantiate operatorisappliedon CITY CODES and CUSTOMER DATA ,usingthe Normal VGfunction.However,therestofthe queryisdierentinthatitusestwodierentoperations: PT-All and HypTest ThePartition-and-TestPToperation .The PT operationisresponsiblefor partitioningtheinputtuplesbasedonthesetofattributesgiveninthe PARTITIONBY clauseandapplyingthebooleanpredicatetothosetuples.The ALL and ANY modiers deneifthepredicateisappliedtothetuplesinagroupexistentiallywhere atleast one tupleinthegroupmustsatisfythepredicateoruniversallywhere all thetuplesinthe groupmustsatisfyit. Foreachresultingpartition,the PT operationcreatesasingletuplewiththeunique attributevaluesfromthe PARTITIONBY clausefollowedbyan IsPresent arraythat indicatesifthepredicateissatisedinthe i th MonteCarloiteration,for1 i n TheHypTestoperation .Theresultof PT issuppliedtothe HypTest operation, whichisparametrizedwiththeuser-suppliedtestprobabilitythreshold p HypTest uses the IsPresent arrayfromeachinputbundlealongwithanappropriatestatisticaltest todetermineoneofthreepossibleresultsforthatbundle: accepted rejected or keepgoing .Manypositivevaluesin IsPresent usuallyindicatewithhighprobabilitythatthe testpredicateissatised,andmanynegativevaluesindicateasmallprobability.Ifthe hypothesistestresultiseither accept or reject ,thenthepartitionisappropriatelylabeled andsenttotheuser.A keep-going resultindicatesthatthehypothesistestdoesnot haveenoughMonteCarlosamplestodetermineifthepartitionshouldbeacceptedor rejected.Iftheresultofthetestis keep-going ,thenthe HypTest operationwillsavestate informationforthatgivenpartition,namelytheattributevaluesthatdenethatpartition andthenumberofpositiveandnegativesfromthe IsPresent array. 33

PAGE 34

Itisimportanttonotethatoneormorepartitionsfallintothe keep-going category, andmoreMonteCarloiterationsarerequiredinordertoacceptorrejectthepartition. Whenthishappens,anothertextblockisneeded.Theentireprocessofinvoking Instantiate toproduceadditionalMonteCarlosamples,pushingtheresultinto PT and HypTest ,isperformedagaininthenewtestblock. Figure3-2.Runningasatisabilityqueryusingaseriesoftestblocks. Itispossibletonaivelyre-runthewholequeryplanfromstarttonishinthenew testblocksoifroughlythesameamountofMonteCarloiterationsarerequiredoneach testblock,theruntimeforeachtestblockwouldbeequivalent.Aproblemwiththisis thatthenumberofpartitionsinthe keep-going categoryisalwayssupposedtodecrease whenanewtestblockisrun,so,producingadditionalMonteCarlosamplesforevery possiblepartitionrepresentsawasteofsystemresources.Thisisespeciallytruegiventhat onemightexpectthenumberofactivepartitionstodecreaseexponentiallywithrespect 34

PAGE 35

tothenumberoftestblocksthathavebeenrun,assumingthatweareabletoobtainan accept or reject resultforarelativelyconstantfractionoftheactivepartitionsineachtest block. MCDBusesamoretargetedstrategyasdepictedinFigure3-2.Foreach keep-going partitionthatisproducedbythe HypTest operation,identiersforallthetuplesthat couldhavecontributedtothatpartitionareattached.Forexample,ifthe HypTest operatordoesnothaveenoughdatatoobtainan accept or reject resultforzipcode 32607 inourrunningexample,itwillsendonthe keep-going outputstreamthetupleidentiers foralltuplesin CUSTOMER DATA and CITY CODES thatcouldinuencethesatisability resultforacustomerintheareadenedbyzipcode 32607 .AspecialMCDBoperator called Route residesattheverybottomofthequeryplanforthenextblockandis responsibleforsplittingthestreamofidentiersaccordingtothebaserelationthateach tupleidentierisassociatedwith.The Route operationthenroutestheidentierstoajoin withtheappropriatebaserelation. Thesejoinslterthebaserelationssothatonlythosetuplesfromabaserelation thatcontributetosome keep-going partitionwillenterthequeryplan.Thismeansthat astheamountofpartitionsthatarelabeledas keep-going decreases,theamountof workassociatedwithcomputinganentiretestblockdecreases.Onemightexpectthat formostpartitions,an accept or reject resultisobviousandafewdozenMonteCarlo sampleswillsuce.Thesepartitionsareremovedfromconsiderationafterthersttest block.However,theremaybeafewpartitionswhosetrue valueisverycloseto p .For suchpartitions,thousandsofsamplesmayberequired.Usingthestrategyofremoving computationassociatedwithcompletedpartitionsfromsubsequenttestblocks,weavoid computingmillionsofMonteCarlosamplesforthemajorityofeasy"partitions. 35

PAGE 36

CHAPTER4 IMPLEMENTATIONDETAILS Inthischapter,detailsabouttheprocessingofsatisabilityquerieswithintheMCDB frameworkareprovided.WedescribehowMCDBattemptstokeeptuplesthatdonot contributetopartitionsinthe keep-going categoryfrombeingprocessedintestblockspast therstoneinordertomaximizequeryprocessingspeed. 4.1TupleRe-Scanning EverytupleortuplebundlethatisprocessedbyMCDBduringevaluationofa satisabilityqueryincludesanexhaustivelistofalloftherelationsandtuplesthat contributedtoitsformation.Thepurposeofthislististoguaranteethatifapartition islabeledas keep-going ,allofthetuplesthatcouldhavecontributedtothepartition enterthequeryplanandareprocessedinsubsequenttestblocks.Falsepositivesare acceptable:processingextratuplesthatdonotcontributetoa keep-going groupwillnot causeproblemssincethe HypTest operationatthetopofthenexttestblockwillignore anypartitionthathasnotbeenlabeledas keep-going .However,falsenegativescanbe problematicbecausefailingtoprocessatuplethatcontributesinsomewaytoa keep-going partitionresultsinadierentconclusionthanwouldotherwisehavebeenobserved. Figure4-1.Databasetableswithlineageinformationattached 36

PAGE 37

WhenprocessingsatisabilityqueriesinMCDB,informationaboutwhichbase-relation tuplecontributedtoatupleortuplebundlethatismovingthroughthequeryplanis representedasasetof relID ; recID pairsthatisappendedasanadditionalattribute toeachtuplethatisreadfromdisk.Inthedatabaseliterature,suchalistisusually referredtoasatuple's lineage [16].Whenatupleisrstscannedfromdisk,theonly lineageinformationitcontainsisasinglepairrepresentingitsoriginalsourcerelation andauniquerecordidentier,asshowninFiguress4-1aand4-1b.Asthequery progresses,additionallineageinformationisaccumulateddependingonitsdenining relationaloperators.Forexample,arelationaljoin T U./V whichproducesthetuple t u v willincludein t allofthe relID ; tupID pairsthatarepresentineither u: or v: ignoringduplicates,ascanbeseeninFigure4-1c.An Instantiate operation, whichproducestuplebundle t ,appendsto t allofthepairspresentinanytuplethat contributedto t ,whichcanbeanytuplethattookpartinparametrizingtheVGfunction thatproduced t oranytuplethatcamefromthe Instantiate 'souterinputstream.A grouping/aggregateoperationwhichproducesthetupleortuplebundle t appendsto t all ofthepairspresentinanytupleorbundlethatfellinto t 'sgroup. Forexample,considerthatabillionrecordtableisaggregatedviaa SUM aggregate functiondowntoasinglenumber.Thiscanmakethelineageattribute growarbitrarily large.MCDBimposesamaximumnumberofpairsfromasinglebaserelationthatmay bepresentinanytuple'slineage.Ifthenumberofpairsexceedsthisthreshold,allofthe pairsarereduceddowntoasinglepairwithaspecial 8 valuefor recID .Useof 8 may allowsuperuousrecordstoenterthequeryprocessingsystemduringthenexttestblock. However,recallthatfalsepositivesareonlyproblematicbecausetheyincreasethenumber oftuplesthatmustbeprocessedbysubsequenttestblocks. Asatestblockcompletes,the HypTest operationpipesallofthelineageinformation fromallofthetuplesthatcontributedtoany keep-going partitionintothenexttest block's Route operation.The Route operationseparatesthelineagepairsbasedoneach 37

PAGE 38

pair's relID value,andwritestodiskaseriesoflineageles",oneforeachbaserelation. Anequi-joinbetweeneachlineageleanditscorrespondingbaserelationprevents thetuplesthatdonotcontributetoa keep-going partitionfrombeingre-processedin subsequenttestblocks. 4.2Explicitvs.InferredLineage Inthedatabaseliterature,therearetwotraditionalmethodsthathavebeensuggested formanaginglineage:explicitlyasisdonewhenprocessingsatisabilityqueriesin MCDBorbyusingalgorithmsthatinferthelineageofatuple,tracingitbackward throughaqueryplanusingaqueryinversion"algorithmthat,giventheoriginalquery Q withresultset D ,producesanewquery D 0 thatreturnsthelineageofthetuplesin D asitsresultset.Thesemethodsaremorepervasiveintheliterature,buttherearetwo reasonswhyexplicitannotationispreferred. First,asdiscussedbyCuiandWidom[17],explicitlineageisbepreferredwhen lineagetracinghappensallofthetimeasinMCDBsatisabilityqueries,ratherthan periodically.Second,the Instantiate operationisubiquitiousinMCDB,whichcan beseenasablackbox"operationsinceitisimpossibletotracethepreciselineage ofanytuple t itproduces.ThishappensbecausethelinkbetweenaVGfunction's parametrizationanditsoutputisarbitraryanduser-dened.Thus, Instantiate forces theuseofexplicitlineageannotation. 4.3StochasticConsiderations Aswementionedinourdiscussionontuplere-scanning,whenitcomestoselecting thetuplesfrombaserelationsthatwillbeusedbysubsequenttestblocks,falsepositives areacceptablebutfalsenegativesarenot.Unfortunately,therearesomecaseswhena givenqueryplancandiscardatuplethatmightbenecessaryinsubsequenttestblocks, leadingtoincorrectresults. ConsiderthesettingfromFigure4-1butwithaslightmodication:theattribute CUSTOMER.ZipCode isarandomattributeproducedbyan Instantiate operation.For 38

PAGE 39

simplicity,assumethereisonlyoneMonteCarlosampleforthisattribute.Supposethat inthersttestblockthefollowingtableisproduced: CUSTOMER CityID ZipCode 101 32607 102 32208 103 77251 Aftercomputing CUSTOMER ./ CITY CODES ,theresultingrelationcontainscustomers 101 and 103 ,sincethe ZipCode forcustomer 102 doesnotmatchanycity.Suppose the HypTest operationgroupsbasedon CityID anddecidestoclassifybothofthemas keep-going .Theninthenexttestblockweobtainthefollowingtable: CUSTOMER CityID ZipCode 101 32607 102 32611 103 77251 Logically,customer 102 shouldbeconsideredforthehypothesistestsinceitwould matchcity 1001 whencomputing CUSTOMER ./ CITY CODES .However,thattuplewas discardedduringtheprevioustestblocksinceitneverreached HypTest Thecoreproblemhereisthatwehaveanoperation{arelationaljoin{thatdecided whetheratupleparticipatesintheoutputsetbylookingatthevalueofthestochastic attribute.InaparticularMonteCarloiteration,thetupleandallofitslineagecanbe lteredoutbythejoin.However,inthenextiteration,thetuplemaybeacceptedif thevalueofthestochasticattributechanges.Infact,anyoperationthatlooksatthe valueofastochasticattributeanddecides,basedonthevalueofthatattribute,whether 39

PAGE 40

thetuplebundlethattheattributeisattachedtoparticipatesintheresultset,can exhibitthisproblem.Thisincludesrelationalselection,join,semi-joinandantijoin.It alsoincludesgroupingaggregatesbecausea GROUPBY querythatreturns m groupsis essentiallyrunning m dierentqueries,eachwithitsownrelationalselectionpredicate thatacceptsoneparticularvalueforthegroupingattribute.Similarly,theMCDB-specic PT operationscanhavethisproblemifthesetofpartitioningattributesincludesa stochasticattribute. Thesolutionthatweproposetothisproblemisquitesimple.Whenexecutinga satisabilityquery,beforerunninganyoperationthatperformsanylteringbasedupon thevalueofastochasticattribute,MCDBrunsaspecial ExtractLin operation.This operationsimplyremovesallthelineagefromthetuplesortuplebundlesthatareinput intotheproblematicoperation,andcopiesthelineageintoadisk-baseddatastructure calledtheFreezer".TheFreezerisso-calledbecauseitfreezesthelineageinformationfor anytuplethatisinputintothe ExtractLin operation.Then,inthenexttextblock,both thecontentsofthefreezerandtheoutputofthe keep-going pipefromtheprevioustest blockarepipedintothe Route operation.Thisensuresthatwedonotincorrectlydrop customer 102 from CUSTOMER sinceMCDBaddsthistuple'slineageintotheFreezerbefore thejoinrunsforthersttime. 40

PAGE 41

CHAPTER5 EXPERIMENTALRESULTS Sofarwehavedescribedhowsatisabilityqueriesaredenedandhowtheyare processedbyMCDB.Thissectionisdevotedtoempiricalresults.First,wedescribeour methodology{thesystemssetup,thequerieswerun,andthedatawecollectfromthem. Thenwediscusstheresultsforeachquery. 5.1ExperimentalSetupandMethodology Oursingle-sure,proof-of-conceptimplementationofMCDBiswritteninC++ andconsistsofapproximately40,000linesofC++sourcecodewithabout3,500lines correspondingtoprocessingsatisabilityqueries.Theimplementationismulti-threaded andsupportspipelinedparallelism.Itdoesnotsupportwithin-operatorparallelism. MCDBcurrentlyimplementsalloftheideasdescribedinthisthesis,exceptthe compilationandoptimizationofSQLqueries{currently,thephysicalplanfora querymustbeprogrammedusingaspecialMCDB-specicprogramminglanguage wheretheuserspeciesthephysicaloperationsusedtoimplementthequeryandthe interconnectionsbetweenthem. Weranallvequeriesonascale10versionoftheTPC-Hbenchmarkonaserver machinewith8CPUs.Weset to10 )]TJ/F20 7.9701 Tf 6.587 0 Td [(6 to0.01,and toonedividedbythenumber ofpartitions.Duetomemoryconstraints,thenumberofMonteCarloiterationspertest blockislimitedto60,000foragiventuplebundle.Since,insomecases,thelastremaining partitionsfromaquerymightrequirebillionsofMonteCarloiterationstobeclassied asacceptedorrejected,amaximumof200,000iterationsisenforcedandanarbitrary decisionistakenifthatnumberisexceededbyapartition. 5.1.1QueryQ1 Inthisquery,weuseastochasticmodelthatencodesaGammadistributionto estimateshipmentdurations.Themodelisparameterizedwithaggregateinformation frompreviousorders.Theresultingrandomrelation SHIP DURATIONSsd custkey, 41

PAGE 42

Figure5-1.PhysicalqueryplanforQueryQ1 sd when,sd til ship,sd til arr contains,foreachcustomer,theinitiationdateand theestimateddaysuntilshipmentandarrival.Giventhisrelation,weaskthefollowing questionthroughasatisabilityquery:Whichcustomerswhoorderedtodayhaveatleast a20%chanceofreceivingpartoftheirordermorethan20weeksfromnow?" Figure5-1showshow SHIP DURATIONS iscreatedfromjoining ORDERS and LINEITEM andusingthemasbothinnerandouterinputs.However,theinnerinputcontainsonly aggregateinformationi.e.meanandvariancefortheshippingandarrivaltimes. 5.1.2QueryQ2 ThisqueryusesastochasticmodelthatencodesarandomwalkbasedonaNormal distributionandaimstoaddressabasicproblemwiththeTPC-Hdata:nopricehistory forthevarioussuppliersisrecorded,soitisimpossibletoknowwhatthesuppliercostof anitemwasinthepast.Therandomwalkisperformedbackwardtoguess"thesupplier priceateachmonth. 42

PAGE 43

Figure5-2.PhysicalqueryplanforQueryQ2 Theresultingrandomrelation PRICE HISTph suppkey,ph partkey,ph month, ph year,ph price willcontain,foreachsupplierandpart,atotalof12tuples,each encodingtheestimatedpriceforagivenmonthandyear.Giventhisrelation,weasked thefollowingquestionthroughasatisabilityquery:Forwhich1995ordersistherea 90%chancethatwecouldhavesavedatleast20%ifwehadusedtheleastexpensive supplier?". Figure5-2showshow PRICE HIST iscreatedbyestimatingthelowestpricefromthe PARTSUPP relation.Thisresultisusedtoobtaintheinformationforeachorder. 5.1.3QueryQ3 Inthisquery,weaskthefollowingsatisabilitystatement:Withaprobability of20%,whichsuppliersmayrunoutofsomeproductwithinthenext30days?".The stochasticmodelusedhereencodesaPoissondistributiontoapproximateabootstrapping schemethatestimatesthenumberofunitsforeachpartthatasuppliermayhave. 43

PAGE 44

Figure5-3.PhysicalqueryplanforQueryQ3 Therandomrelation N LINEITEMl suppkey,l partkey,new quant containsthe estimatednumberofavailableunitsforeachpartandsupplier.Figure5-3showshow N LINEITEM iscreatedbyrunningusinginstantiateonboth LINEITEM and PARTSUPP to estimatethenewquantity. 5.1.4QueryQ4 Thepurposeofthisqueryistotesttheperformanceofthefreezerbyhavinga relationaloperatorthatusesarandomattributeinthequeryplan.Thequestionwe askisthefollowing:WhichordersfromJapanesecustomers,groupedbyordermonth, providedacombinedtotalpriceofatleast$1,000,000,000withaprobabilityof70%?". Althoughtheorderpriceinformationcanbefoundinthedeterministicdatawithout theneedofusingaVGfunctiontoestimateit,weassumethatthe c custkey attribute valueshaveachanceofbeingwrongandcouldprovideerroneouscustomerinformation. Anewrelation ERROR CUSTKEYold custkey,custkey,prob iscreatedexternally suchthat,foreachcustomerkeyfromtheoriginal customer relation,wehaveapotential customerkeyandtheprobabilityofthatkeybeingthetrue"one.The ERROR CUSTKEY 44

PAGE 45

Figure5-4.PhysicalqueryplanforQueryQ4 tableiscreatedexternallybytakingeachcustomerkeyandsettingthefollowing n keys inthetableaspotentialswithprobabilityproportionaltothedistancefromtheoriginal key.Forexample,if n =3andcustkey=1 ; 2 ; 3 ; 4 ; 5,thenthenewtablewillcontain ; 1 ;: 5 ; ; 2 ;: 25 ; ; 3 ;: 125 ; ; 2 ; 0 : 5 ;::: .AsseeninFigure5-4,theVGfunctionuses discretechoicetoselectanewsetofcustomers,whicharethenusedtocalculatetheorder prices. 5.1.5QueryQ5 Inthisquery,weassumethereexistsastrongcorrelationbetweenthenumberof daysittakestoshipanorderandwhetheritisreceivedorreturnedbythecustomer. Usingthatinformation,alogisticregressionmodelthatestimatestheprobability thatanorderwillbeacceptedbasedontheshippingtimeistrainedo-line.The 45

PAGE 46

Figure5-5.PhysicalqueryplanforQueryQ5 parametersofthismodelareencodedinsideaVGfunctionandusedtoestimatethe possibleoutcomeofallthoseordersthatareyettobereceivedbytheirrespective customers,asshowninFigure5-5.Weaskthefollowingsatisabilitystatement:With aprobabilityof30%,whichcustomerswillprovideanincreaseof10%inreturnsifthe shipmenttimesarecutbyhalf?".Inordertodothis,theVGfunctiontakeshalfof theshippingtimeandevaluatesthelogisticfunctionobtainingtheprobabilitythatthe customerwillreceiveit,thensamplesfromaBernoullidistributiontoproducearelation LOGISTIC ORDERSis received where is received 2f 0 ; 1 g .Thisattributeismultiplied bytheorder'spriceinordertoobtaintherevenue. 5.2ResultsandDiscussion WetestedvequeriesintheTPC-Hbenchmarkdatabase.Someofthemaremodied versionsofqueriesinitiallydescribedintheoriginalMCDBpaper.SQLcodeforthese queryplanscanbefoundinAppendixA.Eachoftheseplansinvolvetheinstantiation ofatleastonestochasticattributeandahypothesistest.Foreachquery,wecollectthe followingstatistics: 1.Thenumberoftestblocksrequiredtorunthequery. 46

PAGE 47

2.ThenumberofMonteCarloiterationspertestblock. 3.Thenumberoftuplesfromeachbaserelationthatareactuallyprocessedineachtest block. 4.Thenumberofpartitionsputintothe accept reject ,and keepgoing categoriesby eachtestblock. 5.Therunningtimeofeachtestblock. 5.2.1QueryQ1 Figure5-6.Wall-clockrunningtimesaandnumberofkeep-goingpartitionsbforquery Q1 Inthisquery,therunningtimeforeachtestblockisdecreasedbyalargefactor aftertheinitialrun,asillustratedinFigure5-6a.Thenumberof keep-going partitions betweeneachtestblockisusuallyreducedbyafactoroftwo,asexpectedfromthe hypothesistestanddepictedinFigure5-6b.Thenumberoftuplesscannedforeach relationshowsasignicantdecrease,asshowninTable5-1. 5.2.2QueryQ2 TheVGfunctionusedbythisqueryisrelativelymorecomplexthantheoneused inotherqueries,generating12tupleswith N samplespereach PARTSUPP tuplethatis 47

PAGE 48

Figure5-7.Wall-clockrunningtimesaandnumberofkeep-goingpartitionsbforquery Q2 suppliedtoit.ThemaximumnumberofMonteCarloiterationspertestblockisrestricted to5,000tuplesduetomemorylimitations,asshowninthelatesttestblocksinTable5-2. Inthiscase,althoughthenumberofkeep-goingpartitionsdecreasessubstantiallyin aratesimilartootherqueries,therunningtimedoesnot,asillustratedinFigures5-7a and5-7b.Asmentionedabove,thisisduetothecomplexityofourVGfunction. 5.2.3QueryQ3 InascenariosimilartoqueryQ1,weseeasignicantreductioninthenumberof keep-going partitionsbetweeneachtestblock,asillustratedinFigure5-8b.Therunning timeforeachtestblockaftertheinitialrundecreasesbyalargefactor,withtheexception ofthesixthtestblock,duetoasuddenincreaseinthenumberofMonteCarlosamples requiredbythehypothesistest,asshowninTable5-3. 5.2.4QueryQ4 Althoughthisquerycompletesafterashortnumberoftestblocksandhasonlyve dierentpartitions,asseeninTable5-4andFigure5-9b.However,itaccuratelydepicts theperformanceissuesassociatedwiththeuseofthefreezerandthelackoftupleltering 48

PAGE 49

Figure5-8.Wall-clockrunningtimesaandnumberofkeep-goingpartitionsbforquery Q3 Figure5-9.Wall-clockrunningtimesaandnumberofkeep-goingpartitionsbforquery Q4 49

PAGE 50

betweentestblocks,showninTable5-4,asthenumberoftuplesscannedforeachtable remainsconstant.Thisresultsinascenarioinwhichtherunningtimepertestblockstays thesame,asdepictedinFigure5-9a. 5.2.5QueryQ5 Figure5-10.Wall-clockrunningtimesaandnumberofkeep-goingpartitionsbfor queryQ5 TheresultsforthisqueryaresimilartothoseforqueriesQ1andQ3,inthesense thatboth keep-going partitionsandtherunningtimedecreasesubstantially,asdepicted inFigures5-10aand5-10b.AsinQ3,thereisasuddenjump"intherunning timeduringthesixthtestblockduetothehypothesistestrequiringalargeramountof samples,asshowninTable5-5. 50

PAGE 51

Table5-1.SummaryofresultsforqueryQ1 BlockMC j ORDERS jj LINEITEM j acceptrejectkeep-going Times 16415,000,00059,986,05230476,2146,028 2496729,726766,6141,6481,7522,81472 31858347,199360,0038835631,36834 47630160,817165,70442829464619 53096274,26076,30718814531314 66000035,78736,722704419911 76000022,65923,23120171627 86000018,58219,04813121376 96000015,65716,6071291166 106000013,28713,216511104 Table5-2.SummaryofresultsforqueryQ2 BlockMC j PARTSUPP jj ORDERS jj LINEITEM j acceptrejectkeep-going Times 138,000,00015,000,00059,986,0520404,699492,55038,876 2274492,5501,315,1713,797,52454,680297,369140,5011,464 3178218,808140,501823,76956,9276,60676,9681,538 4667132,30376,968509,38039,7138,07629,179952 5153852,31728,307189,55612,0245,81611,3391,132 6390930,14314,51398,5963,8791,4386,022908 7500022,64310,49774,5389786874,357879 8500017,2388,86259,2678195422,996791 9500011,9435,97537,6198012431,952726 1050006,6373,24521,7576431921,117688 1150002,7631,6179,756394237486637 1250002,2191,4268,79448600526 Table5-3.SummaryofresultsforqueryQ3 BlockMC j LINEITEM jj PARTSUPP j acceptrejectkeep-going Times 16459,986,0528,000,000087,19912,8012,270 2731,170,853701,79666,3466,449139 3161589,495353,575144,1232,31267 41078212,332126,599721,40883242 5136876,50945,489642640018 6758036,62721,7831882119133 7340917,45010,3521131778 8380668654133450323 9202128641720170151 1011934134480715002 51

PAGE 52

Table5-4.SummaryofresultsforqueryQ4 BlockMC j CUSTOMER jj ERR CUSTKEY jj ORDERS j acceptrejectkeep-going Times 191,500,0007,499,98015,000,000014184 2151,500,0007,499,98015,000,000031177 3221,500,0007,499,98015,000,000001173 41081,500,0007,499,98015,000,000101180 5231,500,0007,499,98015,000,000100192 Table5-5.SummaryofresultsforqueryQ5 BlockMC j LINEITEM jj ORDERS j acceptrejectkeep-going Times 13559,986,05215,000,000421,11956,8061,785 2467210,93458,85530,3011,75824,74743 367052,98424,9768,8175,05610,87413 455719,20110,9312,5963,8084,4975 523398,5484,5225501,8232,1245 6259204,9331,15097831,14320 7194332,58128286402798 814291620792010782 9715518429500281 10853169820081 11178312317011 1247060311001 52

PAGE 53

CHAPTER6 CONCLUSIONS Weintroducedsatisabilityqueriesanddiscussedhowtheyareevaluatedand processedusingtheMCDBframework.Satisabilityqueriesprovideanexpressiveand powerfulvehicleforerrorcontrolinuncertaindatamanagementbypushinghypothesis testingintothedatabasesystem.Thelineagebased,multipleroundorientedquery processingschemeweproposedallowsforecientevaluationofsatisabilityqueriesby managingthesystemresourcesinawaythattheseriesoftestblocksthatcomposea queryfocussolelyonspecicsubsetsofthedatabase. Ourexperimentsshowthatourapproachisecientandmakesgooduseofsystem resources.Thereductioninprocessingtimeandnumberoftuplesbetweeneachtestblock allowsthesystemtodevoteallitsresourcestospecictuplesthatrequireadditional MonteCarlosamplesavoidingthepotentialoverheadthatwouldappearwhenevaluating theentirequeryagain.Theuseofthefreezer,althoughaconservativesolutionwith marginalperformanceresults,provedtobeafeasiblesolutionforthosespecialcaseswhere certainrelationaloperatorsthatcomposethequeryplanoperateonstochasticattributes. Someissuesweintendtoexploreinourfutureworkinclude: Partialmaterializationofqueryresults .Sometimes,completelydeterministicparts ofaphysicalqueryplanlikeajoinbetweentwolargetablesthatarecomputed beforethe Instantiate operationintroducesrandomattributes,couldbewritten backtodiskandtreatedasatablescansothatsubsequenttestblocksdonothave toevaluatetherelationaljoin{oranyotherexpensiveoperation{butjustscanand applythe Route operation.Weplantofurtherexploretheimpactofthistypeof strategiesinrunningtimes. Re-thinkingtuplefreezingthroughfunctionaldependencies .Havingtofreezetuples whenanyrelationaloperatorthatimpliesaselectiononrandomattributesisan impediment,sinceitrulesoutthepossibilityofdrasticallyreducingthesizeofthe inputforthenexttestblock.Webelievethisruleis,insomecases,unnecessarily stringent,asthelineagecanstillbeusedonportionsofthequeryplanthat havecertainproperties,suchasforeign-keyjoins.Weplantoexplorethisissue furtheranddeneanewsetofrulesfortuplefreezing,intermsofthefunctional dependenciesbetweenthebaserelationsusedinthequeryplan. 53

PAGE 54

Querycompilationandoptimization .Currently,MCDBdoesnothaveaproperSQL compilerandoptimizerandqueriesarewrittenwithaprogramminglanguageused todescribephysicalqueryplans.Weplantodevelopacompilerthatallowstheuser toinputsatisabilityqueriesusingtheSQLsyntaxwehavedescribedthroughout thisthesis. 54

PAGE 55

APPENDIX:SQLCODEFOREXPERIMENTALQUERYPLANS A.1QueryQ1 CREATEVIEWorders todayAS SELECT* FROMorders,lineitem WHEREo orderdate=todayANDo orderkey=l orderkey CREATEVIEWparamsAS SELECT AVGl shipdate-o orderdateASship mu, AVGl receiptdate-l shipdateASarrv mu, STD DEVl shipdate-o orderdateASship sigma, STD DEVl receiptdate-l shipdateASarrv sigma, l partkeyASp partkey FROMorders,lineitem WHEREo orderkey=l orderkey GROUPBYl partkey CREATETABLEship durationsAS FOREACHoinorders today WITHgamma shipASDiscGamma SELECTship mu,ship sigma FROMparams WHEREp partkey=l partkey WITHgamma arrvASDiscGamma SELECTarrv mu,arrv sigma FROMparams WHEREp partkey=l partkey 55

PAGE 56

SELECTgs.valueASship,ga.valueASarrv FROMgamma shipgs,gamma arrvga USINGSELECTsd custkey,sd when+sd til arrASlen FROMship durations WHEREsd when=today TESTWHETHERANYlen>140 WITHPROBABILITY0.2 PARTITIONBYsd custkey A.2QueryQ2 CREATETABLEprc histph month,ph year,ph prc, ph partkeyAS FOREACHpsinpartsupp WITHtime seriesASRandomWalk VALUESps supplycost,12,"Dec",1995, -0.02,0.04 SELECTmonth,year,value,ps partkey FROMtime seriests CREATEVIEWbestAS SELECTph suppkey,ph partkey,ph month,ph year, minph priceASbest prc FROMprice hist GROUPBYph suppkey,ph partkey,ph month,ph year USINGSELECTSUMbest prc-ph prc/SUMactualASdis 56

PAGE 57

FROMbest,price hist,orders,lineitem WHEREyearo orderdate=1995ANDo orderkey= l orderkeyANDmontho orderdate= monthbest.ph orderdateANDmontho orderdate =monthprice hist.ph orderdateAND yearbest.ph orderdate=1995AND yearprice hist.ph orderdate=1995AND best.ph partkey=l partkeyANDbest.ph partkey =price hist.ph partkeyANDl suppkey= price hist.l suppkey GROUPBYo orderkey TESTWHETHERdiscount>0.20 WITHPROBABILITY0.9 PARTITIONBYo orderkey A.3QueryQ3 CREATETABLEn lineitem FOREACHlin SELECT* FROMlineitem WHEREyearl shipdate=1995 WITHmultASPoisson/365.0 SELECTl quantity*mult.valueASnew quant, l partkey,l suppkey FROMmult CREATEVIEWsellsAS SELECTSUMnew quantAStotal,l partkey, 57

PAGE 58

l suppkey FROMnew lineitem GROUPBYl suppkey,l partkey USINGSELECTavailquant-totalASdiff, ps partkey,ps suppkey FROMsells,partsupp WHEREl partkey=ps partkey ANDl suppkey=ps partkey TESTWHETHERANYdiff<0 WITHPROBABILITY0.2 PARTITIONBYps partkey,ps suppkey A.4QueryQ4 CREATEVIEWfrom japanAS FROMcustomer,nation WHEREn nationkey=c nationkeyAND n name='JAPAN'; CREATETABLElikely customersAS FOREACHcINfrom japan WITHnewcustomersASDiscreteChoice SELECTcustkey,probability FROMerror custkey WHEREold custkey=c.c custkey SELECTnew custkey 58

PAGE 59

FROMnewcustomers; USINGSELECTSUMo totalpriceASSumPrice,o ordermonth FROMorders,likely customers WHEREo custkey=new custkey GROUPBYo ordermonth TESTWHETHERANYSumPrice>1000000000 WITHPROBABILITY0.7 PARTITIONBYo ordermonth A.5QueryQ5 CREATEVIEWorder timesAS SELECTl receiptdate-o orderdateASdays, l receiptdate-o orderdate*0.75ASless days, FROMlineitem,orders WHEREl orderkey=o orderkey ANDl returnflag='N'; CREATETABLElogistic ordersAS FOREACHotINorder times WITHpdaysASLogisticBernoulliVALUESdays WITHplessASLogisticBernoulliVALUESless days SELECTpd.valueASprob days,pl.valueASprob less FROMpdayspd,plesspl; USINGSELECTSUMl extendedprice*l discount*prob daysASrevenue, SUMl extendedprice*l discount*prob lessASrevenue half, 59

PAGE 60

o custkey FROMlogistic orders GROUPBYo custkey TESTWHETHERANYrevenue half-revenue/revenue>0.1 WITHPROBABILITY>0.3 PARTITIONBYo custkey 60

PAGE 61

REFERENCES [1]P.Agrawal,O.Benjelloun,A.D.Sarma,S.N.C.Hayworth,T.Sugihara,and J.Widom.Trio:Asystemfordata,uncertainty,andlineage.In Intl.Conferencein VeryLargeDataBases ,pages1151{1154,2006. [2]P.Andritsos,A.Fuxman,andR.J.Miller.Cleananswersoverdirtydatabases:a probabilisticapproach.In IEEEIntl.ConferenceinDataEngineering ,page30,2006. [3]L.Antova,C.Koch,andD.Olteanu.MayBMS:Managingincompleteinformation withprobabilisticworld-setdecompositions.In IEEEIntl.ConferenceinData Engineering ,pages1479{1480,2007. [4]S.Arumugam. Ecientalgorithmsforspatiotemporaldatamanagement .PhDthesis, UniversityofFlorida,2008. [5]F.Bacchus,A.J.Grove,J.Y.Halpern,andD.Koller.Fromstatisticalknowledge basestodegreesofbelief. ArticialIntelligence ,87{2:75{143,1996. [6]R.Baeza-YatesandB.Ribeiro-Neto. ModernInformationRetrieval .Addison-Wesley, 1999. [7]D.Barbara,H.Garcia-Molina,andD.Porter.Themanagementofprobabilisticdata. IEEETrans.KnowledgeandDataEng. ,4:487{502,1992. [8]O.Benjelloun,A.D.Sarma,A.Halevy,andJ.Widom.Uldbs:Databaseswith uncertaintyandlineage.In Intl.Conf.inVeryLargeDataBases ,pages953{964, 2006. [9]D.Bhagwat,L.Chiticariu,W.C.Tan,andG.Vijayvargiya.Anannotation managementsystemforrelationaldatabases.In Intl.ConferenceinVeryLarge DataBases ,pages900{911,2004. [10]B.BillerandB.L.Nelson.Modelingandgeneratingmultivariatetime-seriesinput processesusingavectorautoregressivetechnique. ACMTrans.ModelingComput. Simulation ,13:211{237,2003. [11]J.Boulos,N.Dalvi,B.Mandhani,S.Mathur,C.Re,andD.Suciu.Mystiq:Asystem forndingmoreanswersbyusingprobabilities.In ACMSIGMODConference ,pages 891{893,2005. [12]P.Buneman,S.Khanna,andW.C.Tan.Dataprovenance:Somebasicissues. LectureNotesinComputerScience ,1974:87{94,2000. [13]G.CasellaandR.L.Berger. StatisticalInference .Duxbury,2002. [14]R.CavalloandM.Pittarelli.Thetheoryofprobabilisticdatabases.In Intl.ConferenceinVeryLargeDataBases ,pages71{81,1987. 61

PAGE 62

[15]R.Cheng,S.Singh,andS.Prabhakar.U-dbms:Adatabasesystemformanaging constantly-evolvingdata.In Intl.ConferenceinVeryLargeDataBases ,pages 1271{1274,2005. [16]Y.CuiandJ.Widom.Lineagetracingforgeneraldatawarehousetransformations.In Intl.ConferenceinVeryLargeDataBases ,pages471{480,2001. [17]Y.Cui,J.Widom,andJ.L.Wiener.Tracingthelineageofviewdataina warehousingenvironment. ACMTransactionsonDatabaseSystems ,25:179{227. [18]N.DalviandD.Suciu.Ecientqueryevaluationonprobabilisticdatabases.In Intl. ConferenceinVeryLargeDataBases ,pages864{875,2004. [19]N.DalviandD.Suciu.Managementofprobabilisticdata:Foundationsand challenges.In ACMPrinciplesofDatabaseSystems ,pages1{12,2007. [20]N.FuhrandT.Rolleke.Aprobabilisticrelationalalgebrafortheintegrationof informationretrievalanddatabasesystems. ACMTrans.Inf.Syst. ,15:32{66,1997. [21]E.GradelandY.G.andC.Hirsch.Thecomplexityofqueryreliability.In ACM PrinciplesofDatabaseSystems ,pages227{234,1998. [22]R.GuptaandS.Sarawagi.Creatingprobabilisticdatabasesfrominformation extractionmodels.In Intl.ConferenceinVeryLargeDataBases ,pages965{976, 2006. [23]N.I.Hachem,K.Qiu,M.Gennert,andM.Ward.Managingderiveddatainthegaea scienticdbms.In Intl.ConferenceinVeryLargeDataBases ,pages1{12,1993. [24]R.Jampani,F.Xu,M.Wu,L.L.Perez,C.Jermaine,andP.J.Hass.Mcdb:A monte-carloapproachtomanaginguncertaindata.In ACMSIGMODConference pages687{700,2008. [25]R.KarpandM.Luby.Monte-carloalgorithmsforenumerationandreliability problems.pages56{64,1983. [26]B.KimelfeldandY.Sagiv.Matchingtwigsinprobabilisticxml.In Intl.Conference inVeryLargeDataBases ,pages27{38,2007. [27]D.Knuth. TheArtofComputerProgramming,Volume2:SeminumericalAlgorithms Addison-Wesley,1998. [28]C.KochandD.Olteanu.Conditioningprobabilisticdatabases.In Proceedingsofthe VLDBEndowment ,pages313{325,2008. [29]T.Lee,S.Bressan,andS.E.Madnick.Managingderiveddatainthegaeascientic dbms.In WorkshoponWebInformationandDataManagement ,pages33{39,1998. [30]M.MatsumotoandT.Nishimura.Mersennetwister:a623-dimensionally equidistributeduniformpseudo-randomnumbergenerator.8:3{30,1998. 62

PAGE 63

[31]N.MetropolisandS.Ulam.Themontecarlomethod. JournaloftheAmerican StatisticalAssociation ,44:335{341,1949. [32]R.MurthyandJ.Widom.Makingaggregationworkinuncertainandprobabilistic databases.In Proc.1stInt.VLDBWork.Mgmt.UncertainDataMUD ,pages 76{90,2007. [33]R.B.Nelsen. AnIntroductiontoCopulas .Springer,2006. [34]J.NeymanandE.S.Pearson.Ontheproblemofthemostecienttestsofstatistical hypotheses. Phil.Trans.oftheRoyalSoc.ofLondon.SeriesA ,231:289{337. [35]F.Panneton,P.L'Ecuyer,andM.Matsumoto.Improvedlong-periodgeneratorsbased onlinearrecurrencesmodulo2. ACMTrans.Math.Softw. ,32:1{16,2006. [36]C.ReandD.Suciu.Materializedviewsinprobabilisticdatabasesforinformation exchangeandqueryoptimization.In Intl.ConferenceinVeryLargeDataBases pages51{62,2007. [37]C.ReandD.Suciu.Approximatelineageforprobabilisticdatabases.In Intl. ConferenceinVeryLargeDataBases ,pages797{808,2008. [38]C.Re,N.Dalvi,andD.Suciu.Ecienttop-kqueryevaluationonprobabilisticdata. In IEEEIntl.ConferenceinDataEngineering ,pages886{895,2007. [39]C.P.RobertandG.Casella. MonteCarloStatisticalMethods .Springer,2004. [40]S.Singh,C.Mayeld,S.Prabhakar,R.Shah,andS.Hambrusch.Indexinguncertain categoricaldata.In IEEEIntl.ConferenceinDataEngineering ,pages616{625,2007. [41]R.C.Tausworthe.Randomnumbersgeneratedbylinearrecurrencemodulotwo. MathematicsofComputation ,19:201{209. [42]J.vonNeumann.Varioustechniquesusedinconnectionwithrandomdigits. National BureauofStandards,AppliedMathSeries ,11:36{38,1951. [43]D.D.Wackerly,W.Mendenhall,andR.L.Scheaer. MathematicalStatisticswith Applications .Duxbury,2002. [44]A.Wald.Sequentialtestsofstatisticalhypotheses. TheAnnalsofMathematical Statistics ,16:117{186,1945. 63

PAGE 64

BIOGRAPHICALSKETCH LuisL.PerezgraduatedwithaMasterofSciencedegreeinComputerEngineering fromtheUniversityofFloridainthefallof2009.DuringhisstudiesatFlorida,heworked underthesupervisionofChristopherJermaine,doingresearchinprobabilisticdatabases andqueryprocessing.HewasamemberoftheDatabaseCenterattheComputerand InformationScienceCISEDepartment. AftergraduatingfromUF,LuisjoinedtheRiceUniversityDepartmentofComputer ScienceasaPh.D.student.There,hecontinueshisresearchworkindatamanagement, pursuingnewtopicssuchasqueryoptimization,distributeddatabasesandcloud computing. 64