Citation
Efficient Query Processing in MCDB-R

Material Information

Title:
Efficient Query Processing in MCDB-R
Creator:
Jampani, Ravindranath Chowdary
Place of Publication:
[Gainesville, Fla.]
Florida
Publisher:
University of Florida
Publication Date:
Language:
english
Physical Description:
1 online resource (116 p.)

Thesis/Dissertation Information

Degree:
Doctorate ( Ph.D.)
Degree Grantor:
University of Florida
Degree Disciplines:
Computer Engineering
Computer and Information Science and Engineering
Committee Chair:
Dobra, Alin
Committee Members:
Ungor, Alper
Ranka, Sanjay
Kahveci, Tamer
Koehler, Gary J
Graduation Date:
8/11/2012

Subjects

Subjects / Keywords:
Customers ( jstor )
Databases ( jstor )
Datasets ( jstor )
Identifiers ( jstor )
Information attributes ( jstor )
Mathematical independent variables ( jstor )
Monte Carlo methods ( jstor )
Probability distributions ( jstor )
Risk analysis ( jstor )
Simulations ( jstor )
Computer and Information Science and Engineering -- Dissertations, Academic -- UF
databases -- probabilistic -- processing -- query -- statistical
Genre:
bibliography ( marcgt )
theses ( marcgt )
government publication (state, provincial, terriorial, dependent) ( marcgt )
born-digital ( sobekcm )
Electronic Thesis or Dissertation
Computer Engineering thesis, Ph.D.

Notes

Abstract:
Quantitative risk analysis has become an integral part of large financial institutions. For banking and insurance enterprises, calculating the risk of future insolvency is mandatory per the regulatory directives such as Solvency and Basel. Performing risk analysis in these enterprises is difficult because the systems are highly complex and have a large number of uncertain variables. One form of risk analysis is in understanding the impact of extreme events that have a low chance of occurrence. The common methods used to estimate this extreme risk come from statistics and rare event simulation fields, and are based on Monte Carlo sampling. However, these sampling-based methods produce huge amounts of data. Therefore, the current implementations of risk analysis methods have scalability problems, particularly as they relate to complex systems with a very large number of variables. Monte Carlo DataBase with Risk analysis (MCDB-R) is a database system designed to facilitate scalable risk analysis on large datasets. MCDB-R combines sampling-based statistical methods and relational database technology for scalability on large uncertain data sets. Uncertainties in data are modeled as probabilistic distributions. A query on this uncertain data with probability distributions results in another probabilistic distribution called query-result distribution. For a given scenario defined by a query, analyzing its extreme events requires examining the upper or lower areas of the query-result distribution, where the events occur with very low probability. The specific methods used in MCDB-R for this analysis are: Gibbs sampling from statistics and cloning from rare-event simulation. These methods are used to efficiently sample from the low probability areas of the query-result distribution for the analysis. This dissertation discusses three improvements to the query processor in MCDB-R. The first technique provides a mechanism to move the sample generation (during the query execution) closer to the location where those samples are actually processed. Huge number of samples are generated during the query execution in MCDB-R, and this new mechanism reduces the sample movement in the memory hierarchy. Response time of the queries is improved significantly; sometimes by an order of magnitude, as shown in the experiments. MCDB-R employs a rejection algorithm at the core of its sampling-based risk analysis. For each uncertain variable, the rejection algorithm looks at a series of samples and discards them unless the sample fits a given constraint. Sometimes the constraints are so stringent that the rejection algorithm needs to process millions of samples before finding a good fit. In MCDB-R, an instance like this will require the same number of samples to be produced for all variables. In the second technique, this problematic instance is isolated from the rest of the query execution and then run separately to find an acceptable fit. The normal execution restarts after finding the required sample. Finally, adding an anti-join operator to the MCDB-R execution engine is explained. This new operator enables the system to execute subset-based queries with not-in and not-exists clauses. Performing an anti-join in this system is not trivial because of the stochastic nature of the data. The system does not know which samples are actually used until the end of the query execution. Two methods to implement the anti-join operator are discussed and then compared through experiments. ( en )
General Note:
In the series University of Florida Digital Collections.
General Note:
Includes vita.
Bibliography:
Includes bibliographical references.
Source of Description:
Description based on online resource; title from PDF title page.
Source of Description:
This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Thesis:
Thesis (Ph.D.)--University of Florida, 2012.
Local:
Adviser: Dobra, Alin.
Statement of Responsibility:
by Ravindranath Chowdary Jampani.

Record Information

Source Institution:
UFRGP
Rights Management:
Copyright Jampani, Ravindranath Chowdary. Permission granted to the University of Florida to digitize, archive and distribute this item for non-profit research and educational purposes. Any reuse of this item in excess of fair use or other copyright exemptions requires permission of the copyright holder.
Resource Identifier:
857819498 ( OCLC )
Classification:
LD1780 2012 ( lcc )

Downloads

This item has the following downloads:


Full Text

PAGE 1

EFFICIENTQUERYPROCESSINGINMCDB-RByRAVINDRANATHCHOWDARYJAMPANIADISSERTATIONPRESENTEDTOTHEGRADUATESCHOOLOFTHEUNIVERSITYOFFLORIDAINPARTIALFULFILLMENTOFTHEREQUIREMENTSFORTHEDEGREEOFDOCTOROFPHILOSOPHYUNIVERSITYOFFLORIDA2012

PAGE 2

c2012RavindranathChowdaryJampani 2

PAGE 3

Dedicatedtomyfamily,friendsandteachers 3

PAGE 4

ACKNOWLEDGMENTS Iwouldliketothankmyadvisors,ChrisandAlin.Ihavebeenfortunatetohavethemasmyadvisors,andthisworkwouldnotbepossiblewithouttheirguidanceandsupport.Theirvastknowledge,quickthinking,greatinsightintotheproblemsandtheirenthusiasmforresearchandteachinghavebeeninspiring.Ithasbeengreatworkingwiththem.IamthankfultoChrisforlettingmeworkonmythesisproject.IamgratefultotheDepartmentofComputer&InformationSciences&Engineering(CISE),UniversityofFloridaandmyadvisorsforsupportingmethroughoutmystudies.IamthankfultoDr.Ungorforlettingmeworkonthemeshingproject.Thisprojecthadbeenagreatlearningexperience.IamthankfultomycolleaguesandfriendsatCISEMingxi,Subi,Nirmalya,Manu,Niketan,Lixia,Manas,Luis,Fei,Ferhat,Florin,Manna,Vishak,andAmit.IamgratefultoSubiArumugamandRisiThonangifortheirconversationsrelatedtoresearchandimplementationofdatabasesystems.SpecialthankstothestudentswhoattendedmyMicrosoftOfcecourseintheFallof2010.Theywereagreatmotivationformetoputinthenecessaryefforttoprepareforandteachthetopics.Theirfeedbackduringthecoursewasalsoveryuseful.Itremainsauniqueteachingexperienceforme.IwouldalsotoliketomentiontheInternationalInstituteofInformationTechnology(IIIT),Hyderabad,India,forprovidingmewithagreatlearningenvironmentduringmyundergraduatestudies.ThankstoDr.GovindarajuluRegeti,andDr.RajeevSangalforinspiringmetopursueadoctorateandtoDr.KamalKarlapalem,Dr.SushmaBendreandDr.ProsenjitGuptaforhelpingmeinmyapplicationprocess.Iamgratefultomyfriendsfrommyundergraduatedays:Sashi,Srikanth,Rishi,Santhosh,Jhansi,andTeja.Theyhavebeenasourceofconstantsupport.IamfortunatetohaveformedfriendshipswithTapasvi,Manu,Niketan,Subhajit,Arun,Ganesh,Mriganka,Siva,Aswin,JayendraandMuthuduringmygraduatestudies. 4

PAGE 5

Gainesvillewasafunplacetostaybecauseofthem.Finally,Iwouldliketothankmyfamily,whosecontinuedencouragementhasmadethiseffortpossible. 5

PAGE 6

TABLEOFCONTENTS page ACKNOWLEDGMENTS .................................. 4 LISTOFTABLES ...................................... 8 LISTOFFIGURES ..................................... 9 ABSTRACT ......................................... 11 CHAPTER 1INTRODUCTION ................................... 13 1.1MainContributions ............................... 17 1.1.1Problem1:SerializingVariableGeneratingFunctions ........ 17 1.1.2Problem2:FindinganExtremeSample ............... 18 1.1.3Problem3:IncorporatingAnti-Join .................. 19 1.2Organization .................................. 20 2MCDB-ROVERVIEW ................................ 21 2.1Background ................................... 27 2.1.1RareEventSimulation ......................... 27 2.1.2GibbsSampling ............................. 28 2.2SchemaSpecication ............................. 29 2.3TheVariableGeneratingFunction ...................... 32 2.4TheSeedOperator ............................... 34 2.5TheInstantiateoperator ............................ 35 2.6GibbsTupleBundles .............................. 37 2.6.1Tail-Sampling-SeedandSeedQueue ................. 39 2.6.2SplitOperator .............................. 40 2.7GibbsLooperandTailSampler ........................ 42 2.7.1GibbsQueue .............................. 42 2.7.2RerunningtheQueryPlan ....................... 44 2.7.3MaterializingtheOperators ...................... 45 3RELATEDWORK .................................. 48 3.1StatisticalOperationsinRelationalDatabases ................ 48 3.2ProbabilisticDatabases ............................ 48 3.3Top-kRankinginProbabilisticDatabases .................. 51 3.4MonteCarloMethodsinDatabases,DataMining,andMachineLearning 53 3.5RiskAnalysisSoftware ............................. 56 6

PAGE 7

4SERIALIZINGSAMPLEGENERATIONPROCESS ................ 57 4.1Background ................................... 57 4.2Motivation .................................... 58 4.3SolutionOutline ................................. 61 4.4Serialization ................................... 61 4.4.1IncorporatingSerializationinaQueryPlan .............. 64 4.4.2StochasticJoinPredicates ....................... 64 4.5EfcientDe-serialization ............................ 66 4.6RelatedWork .................................. 67 4.7Experiments .................................. 68 5REPLACINGANEXTREMESAMPLE ....................... 74 5.1Motivation .................................... 74 5.2SolutionOutline ................................. 76 5.3Filtering ..................................... 77 5.4FragmentingaLargeTupleBundle ...................... 79 5.5ImplementationDetails ............................. 80 5.5.1TSSeedandSampleGeneration ................... 82 5.5.2HowtoIdentifyanExtremeSample? ................. 83 5.6Experiments .................................. 85 6ANTIJOIN ...................................... 92 6.1Motivation .................................... 92 6.2CompositeAttributeFormats ......................... 95 6.2.1LongFormat .............................. 95 6.2.2ShortFormat .............................. 98 6.3DiscussiononExtremeSamples ....................... 100 6.4ForQuerieswithMultipleAnti-Joins ..................... 101 6.5Experiments .................................. 101 7CONCLUSION .................................... 106 APPENDIX:QUERIES ................................... 107 REFERENCES ....................................... 111 BIOGRAPHICALSKETCH ................................ 116 7

PAGE 8

LISTOFTABLES Table page 1-1Example:Flawofaverages ............................. 14 4-1Wallclockexecutiontimes .............................. 71 4-2DiskI/Oandsamplestatistics ............................ 71 4-3Scalingwithdatabasesize ............................. 71 4-4ScalingwithDBinstances(timeinseconds) .................... 72 4-5Parallelsamplegeneration ............................. 72 4-6PerformancefordifferentVGfunctions(timeinminutes) ............. 72 5-1Poissonrandomrelation ............................... 75 5-2Afteraphaseofrejectionsampler ......................... 75 5-3Wallclockrunningtimes(inseconds) ....................... 89 5-4Numberofextremesamplesidentied ....................... 89 5-5Performancewithincreasinginputsize(inminutes) ................ 89 5-6Samplesgenerated(inBillions) ........................... 89 5-7Queryrerunsandextremesamples ........................ 89 5-8Performancewithchangeinaggregateattribute .................. 89 5-9Performancewithchangingmax.samples(inminutes) .............. 89 5-10Executionparameterswithchangingmax.samples ................ 89 5-11NumberofextremesampleswithincreasingGibbsiterations .......... 90 6-1Performancewithvaryingseedspertuple(inminutes) .............. 103 6-2Performanceonlargedatasets(inminutes) .................... 104 8

PAGE 9

LISTOFFIGURES Figure page 2-1CustOrdertable:(a)Deterministic,and(b)Uncertain ............... 22 2-2StochasticCustOrdertable ............................. 22 2-3MultipleinstancesofCustOrderafterMonteCarlosampling ........... 24 2-4Query-resulthistogram ............................... 24 2-5Gibbscloning:Generatingsamples ........................ 25 2-6Gibbscloning:Deletinglower50% ......................... 25 2-7Gibbscloning:Clonetheelitesamples ....................... 26 2-8Gibbscloning:AfterperturbationwithGibbssampler ............... 27 2-9AnexampleSeedattribute ............................. 34 2-10InstantiateOperator ................................. 36 2-11Tailsampleroperator ................................. 38 2-12CustOrderwithGibbstuples ............................ 39 2-13Example:TSSeedandSeed ............................ 40 2-14Stochastictableinajoin ............................... 40 2-15StochastictableafteraSplit ............................. 41 2-16Joinresulttuples ................................... 41 2-17Gibbsqueue ..................................... 43 2-18Example:Endofrun1 ................................ 44 2-19Example:Beginningofrun2 ............................ 45 2-20Queryplanbeforematerialization .......................... 46 2-21Queryplanaftermaterialization ........................... 46 2-22Queryplanwithmultiplereruns ........................... 47 4-1AnMCDB-Rqueryplan ............................... 57 4-2Modiedqueryplan ................................. 60 4-3AVGfunctionprocess ................................ 62 9

PAGE 10

4-4SerializationofVGfunction ............................. 62 4-5BayesianVG:Inputserialization .......................... 63 4-6BayesianVG:Innerinput .............................. 63 4-7BayesianVG:Outerinput .............................. 63 4-8SerializationofBayesianVGinputprocess .................... 64 4-9Joinonstochasticattributeononeside ...................... 65 4-10Queryplanafterserialization ............................ 65 4-11Joinonstochasticattributesonbothsides ..................... 66 4-12Queryplanafterserialization ............................ 67 5-1Queryplan ...................................... 78 5-2Modiedplanwithltering .............................. 78 5-3Splittingalargetuplebundle ............................ 79 5-4QueryplanwithSamplerouter:normalmode ................... 80 5-5QueryplanwithSamplerouter:activemode ................... 81 5-6ATSSeedandassignedsamplevalues ...................... 82 5-7TSSeedandassignedsamplesafteraGibbssampleiteration .......... 82 6-1Example:Anti-Joininarelationaldatabase .................... 92 6-2Example:Stochastictable .............................. 94 6-3Example:AfterSplit ................................. 95 6-4CompositeAttribute:Longformat .......................... 96 6-5LongFormat:TuplesafterAnti-join ......................... 97 6-6CompositeAttribute:Shortformat ......................... 98 6-7ShortFormat:TuplesafterAnti-join ......................... 99 10

PAGE 11

AbstractofdissertationPresentedtotheGraduateSchooloftheUniversityofFloridainPartialFulllmentoftheRequirementsfortheDegreeofDoctorofPhilosophyEFFICIENTQUERYPROCESSINGINMCDB-RByRavindranathChowdaryJampaniAugust2012Chair:AlinV.DobraMajor:ComputerEngineeringQuantitativeriskanalysishasbecomeanintegralpartoflargenancialinstitutions.Forbankingandinsuranceenterprises,calculatingtheriskoffutureinsolvencyismandatorypertheregulatorydirectivessuchasSolvencyandBasel.Performingriskanalysisintheseenterprisesisdifcultbecausethesystemsarehighlycomplexandhavealargenumberofuncertainvariables.Oneformofriskanalysisisinunderstandingtheimpactofextremeeventsthathavealowchanceofoccurrence.Thecommonmethodsusedtoestimatethisextremeriskcomefromstatisticsandrareeventsimulationelds,andarebasedonMonteCarlosampling.However,thesesampling-basedmethodsproducehugeamountsofdata.Therefore,thecurrentimplementationsofriskanalysismethodshavescalabilityproblems,particularlyastheyrelatetocomplexsystemswithaverylargenumberofvariables.MonteCarloDataBasewithRiskanalysis(MCDB-R)isadatabasesystemdesignedtofacilitatescalableriskanalysisonlargedatasets.MCDB-Rcombinessampling-basedstatisticalmethodsandrelationaldatabasetechnologyforscalabilityonlargeuncertaindatasets.Uncertaintiesindataaremodeledasprobabilisticdistributions.Aqueryonthisuncertaindatawithprobabilitydistributionsresultsinanotherprobabilisticdistributioncalledquery-resultdistribution.Foragivenscenariodenedbyaquery,analyzingitsextremeeventsrequiresexaminingtheupperorlowerareasofthequery-resultdistribution,wheretheeventsoccurwithverylowprobability. 11

PAGE 12

ThespecicmethodsusedinMCDB-Rforthisanalysisare:Gibbssamplingfromstatisticsandcloningfromrare-eventsimulation.Thesemethodsareusedtoefcientlysamplefromthelowprobabilityareasofthequery-resultdistributionfortheanalysis.ThisdissertationdiscussesthreeimprovementstothequeryprocessorinMCDB-R.Thersttechniqueprovidesamechanismtomovethesamplegeneration(duringthequeryexecution)closertothelocationwherethosesamplesareactuallyprocessed.HugenumberofsamplesaregeneratedduringthequeryexecutioninMCDB-R,andthisnewmechanismreducesthesamplemovementinthememoryhierarchy.Responsetimeofthequeriesisimprovedsignicantly;sometimesbyanorderofmagnitude,asshownintheexperiments.MCDB-Remploysarejectionalgorithmatthecoreofitssampling-basedriskanalysis.Foreachuncertainvariable,therejectionalgorithmlooksataseriesofsamplesanddiscardsthemunlessthesampletsagivenconstraint.Sometimestheconstraintsaresostringentthattherejectionalgorithmneedstoprocessmillionsofsamplesbeforendingagoodt.InMCDB-R,aninstancelikethiswillrequirethesamenumberofsamplestobeproducedforallvariables.Inthesecondtechnique,thisproblematicinstanceisisolatedfromtherestofthequeryexecutionandthenrunseparatelytondanacceptablet.Thenormalexecutionrestartsafterndingtherequiredsample.Finally,addingananti-joinoperatortotheMCDB-Rexecutionengineisexplained.Thisnewoperatorenablesthesystemtoexecutesubset-basedquerieswithnot-inandnot-existsclauses.Performingananti-joininthissystemisnottrivialbecauseofthestochasticnatureofthedata.Thesystemdoesnotknowwhichsamplesareactuallyuseduntiltheendofthequeryexecution.Twomethodstoimplementtheanti-joinoperatorarediscussedandthencomparedthroughexperiments. 12

PAGE 13

CHAPTER1INTRODUCTIONThe2008nancialcrisisisconsideredtobetheworstsuchcrisissincetheGreatDepression[ 70 ].Itresultedinthecollapseofbanksandinsurancecorporations,whichnecessitatedgovernmentbailoutsandresultedinplungingstockmarkets.Thetriggerforthisnancialapocalypsewasthesubprimemortgagecrisis,whichcaughtnancialinstitutionsoffguarduntilitwastoolate.Institutionslearnedthatunderestimatingtheriskofextremeeventsandtheliquidityrequiredtocounterthoseeventslefttheminaveryprecarioussituation.Thecrisisdemonstratedthatlowprobabilityandhigh-impacteventsdohappen.Thecurrentriskmanagementmodelsdonotplacesufcientemphasisonextremeevents,whichisacriticalissue[ 35 ].Theimpactofextremeriskeventscanbelowerediftheyaredetectedearlyandmanagedproperly.However,duetothelackofpreparation,theimpactcanbegreaterthannecessary.Anotherissuewithcurrentriskmanagementmodelsistheirexcessiveapproximation.Currentnancialsystemishighlydynamicandcomplexduetoglobalization.Partsofthiscomplexsystemareeitherapproximatedor,aretotallyignoredinordertoreducethenumberofvariablesinthemodel.Theapproximationsareobtainedbyusingaveragesovermultiplevariables.Oneofthemainreasonsforreducingthenumberofvariablesisduetothelimitationsofthesoftwarethatiscurrentlyused,whichisbasedonspreadsheets,andisnotscalableformorene-grainedmodels.Theproblemthatarisesduetoimprudentapproximationwithaveragescanbeexplainedbyusingthesimpleexamplefrom[ 55 ],Chapter19.Inthisexample,abankisestimatingtheaverageprotitmightderivefromitsloans.Assumethatthebankpays4%interestperyearonthemoneyituses.Itlendstothecustomerswithgoodcreditata6%rate,therefore,theprotmarginis2%.Forcustomerswhoarelesscredit-worthy,thebankloansmoneyat14%,yieldinga10%protmargin.Thecostoverheadtomanagetheloanis$25peryear.Thus,anaccountof$600,withamarginof6%will 13

PAGE 14

Table1-1. Example:Flawofaverages BalanceMarginNetRevenueProtafterOverhead AccountwithGoodCredit10002%20-5AccountwithBadCredit20010%20-5CorrectAverage6006%20-5FlawedAverage6006%3611 yieldaprotof$11((6%of$600)-$25=$11).RefertothedatagiveninTable 1-1 .Ifweconsiderbothgroupsofcustomersseparately,wewillobtainanaccurateestimate.Agoodcustomerwitha$1000balanceona2%marginwillgenerate$20revenueanda-$5intotalprot.Theothercustomerwithabalanceof$200,andamarginof10%willagaingenerate$20inrevenueanda-$5intotalprot.ThecorrectestimatefortheaverageprotisprovidedinTable 1-1 ,row3.However,ifwetrytoestimatetheaverageprotbasedontheaveragebalanceof$600(($1000+$200)/2)andanaveragemarginof6%((2%+10%)/2),theaveragenetrevenueis$36andtheaverageprotis$11,asshowninTable 1-1 ,row4.Simulationbasedonsamplingisanexcellentwaytoavoidtheproblemcreatedbyusingaverages.However,samplingisverydataintensiveandscalabilitybecomesanissueasaresultoftheincreaseinvariables.TheMCDB-RsystemisbuilttoprovidescalableMonteCarlosimulationsforextremeriskanalysis.Scalabilityisachievedthroughtheuseofexistingdatabasetechnology,andincorporatingtheMonteCarlosimulationsinternallyinthequeryprocessing.TheadvantagesofusingMCDB-Roverexistingspreadsheetbasedmodelingincludes: 14

PAGE 15

ScalablesimulationsThescalabilityofMCDB-RfacilitatesMonteCarlosimulationsoveranerandrichermodel,avoidingtheissueswithapproximationusingaverages.Ane-grainedmodelisimportanttoobtainmoreaccurateresultsandavoidingtheawsintroducedbyaverages. EaseofuseUsersofspreadsheetapplicationssuchasMicrosoftExcelmustrstextractthedatafromadatabasethroughaggregatequeries(suchasAVG,COUNT,andsoforth)andthenloadthedataintotheapplication.Thesetwostepsrequirethattheuserbuyandlearnseveralsoftwarepackages,involvingbothconsiderablemoneyandtime.MCDB-RavoidstheseissuesbyprovidingMonteCarlosimulationswithinthedatabase.Next,wegiveanexample'what-if'query,andhowMCDB-Ranalyzesextremeriskonthatquery.Example:Letusexaminethefollowingwhat-ifqueryfrom[ 37 ]:Whatwouldtheprotsbeifweraisedallpricesby5%thisyear?.Forthisquery,relevantinformationlikeuctuationindemandwithapriceincreaseisunknown.Therefore,[ 37 ]usesaBayesianmodeltocalculatetheuctuationindemandasaparametricprobabilitydensityfunction.Theattributeforthenewdemandiscalledastochasticattributeasitsvalueisdenedbyaprobabilitydistribution.Thewholedatabaseitselfisnowstochastic,andischaracterizedbyaveryhighdimensionaljointdistribution.Aqueryonthisstochasticdatabasewillresultinanotherdistributioninsteadofasinglevalueasinadeterministicdatabase.Thisresultdistributioniscalledaquery-resultdistribution.Obtainingthisqueryresultdistributionisnotsimplebecausethedataishigh-dimensional,makingadirectanalyticalsolutionverydifcult.Forthisreason,theansweriscalculatedthroughMonteCarlo-basedsampling[ 51 ]inthefollowingway:wegeneratemultiplenewdemandvaluesfromtheprobability-densityfunctions.Eachsetofnewdemandvalueswillcreateaninstanceofthedatabase.Thequeryisexecutedoneachinstanceofthedatabase.Theoutputofthequeryisasetofprotvaluesfromthequery-resultdistribution.ThisisthemethodusedintheMonteCarloDatabaseSystem(MCDB)[ 37 ].Forefcientqueryprocessing,MCDBbundlesallvaluesofa 15

PAGE 16

singlestochastictuple(fromdifferentdatabaseinstances)togetherintoasingletuplebundle.Letusanalyzetheextremeriskforthe5%priceincrease'what-if'querytounderstandtheworstpossiblescenariosforsuchadecision.Forexample,examiningthelowestpossibletotalprotvaluethatcanoccurwouldbeofinterest.IntheaboveBayesianmodeltheworstcasecanbedenedasthevalueoftotalprotinthe0.001quantileofthequery-resultdistribution(thislowprobabilityregionisgenerallycalledthetailofadistributionfunction).Wecanbetterunderstandtheriskofthedecisionwiththeknowledgeofsuchworst-casescenarios.Alargenumberofsamplesinthe0.001quantilespacewillbeusefulinestimatingtheprotvalueaccuratelyandalsoindeterminingwhatleadtosuchpoorprots.Adecision-makercouldthendecideonthestrategiestotaketocountersucheffects.OneapproachforobtainingsamplesinthetailistogeneratelargenumbersofsamplesontheentireeventspaceusingMCDBandthenselectonlythosethatfallinthetailregion.Ifthesamplesarenotrequestedfromtheextremelylowprobabilityregions,thenMCDBworkswell.Generating100samplesfromthe0.01quantileisnotverycostlybecausegenerating10,000samplesfromthewholedistributionisnotexpensive.However,thisapproachisquiteinefcientifwearelookingmoredeeplyinthetail,suchasthe0.000001quantile.Inmanyapplications[ 53 ],aneventwithaprobabilityof10)]TJ /F8 7.97 Tf 6.59 0 Td[(6orlessiscalledrare.Therefore,getting100samplesofinterestinsuchapplicationsrequires108samplesinthefull-eventspace,whichisnotpracticalinMCDB.MCDB-Risdesignedtoperformthis'tail-sampling'taskefciently.MCDB-R,inadditiontoutilizingtuplebundleandvariable-generating(VG)functionframeworkfromMCDB,usesGibbscloningideasfromrare-eventsimulationliterature[ 10 ].TheGibbscloningmethodisusedtodirectlysamplefromthetailofthedistributionand,therefore,decreasestheoverallcostofthequery.Gibbscloningisaniterativeprocesstomovefurtherintothetailofthedistributionfunction.Ateachstepwedeletetheuninteresting 16

PAGE 17

samplesandreplacethemwithexactcopiesofinterestingsamples(acloningstep).Thenewsetofsampleswillbehighlycorrelatedduetothecloning.Hence,thesesamplesarerandomlyperturbedsothatthenewsamplesaftertheperturbationarereasonablyuncorrelated.ThisperturbationisperformedbytheGibbssamplerbecauseofthehighdimensionalityofthedata. 1.1MainContributionsOurgroupbuiltaprototypeoftheMCDB-Rsystem,andwetesteditsperformanceonvariousqueries(includingtheexamplewhat-ifquerydescribedabove).Wefoundthatthesystemwasnotasefcientasweexpected.Afterprolingthesystem,wediscoveredtwoissues: Movinglargesamplearraysinandoutofmemoryistakingmostoftheexecutiontime. Thesystemisgeneratingalargenumberofsamplesthatarenotevenused.Aspartofthisdissertation,Iamprovidingsolutionstomitigatetheseissues.Additionally,Iamtacklingathirdproblem:toimplementanAnti-joinoperatorinMCDB-R.TheAnti-joinoperatorisimportantbecauseitenablesthesystemtorunsub-querieswithNOTINandNOTEXISTSclauses.MydissertationalsodiscussestheevaluationandimplementationofthesolutionsthatIproposedforallthreeproblems,whicharebrieydescribedinthefollowingsubsections. 1.1.1Problem1:SerializingVariableGeneratingFunctionsThesamplesizesinMCDB-RareverylargebecauseoftheselectivenatureoftherejectionalgorithmduringtheGibbssamplingprocess(explainedinChapter 2 ).Usingtuplebundlestogroupthesamplestogetherandstoreasanarrayresultsinfasterprocessing.However,tuplebundlesareextremelylarge,andmovingtheminthememoryhierarchyisverytimeconsuming.Ipresentatechniquethatenablesthesystemtoserializethesamplegenerationprocessandtodelaytheactualsamplegenerationaslongaspossible.Thistechniqueavoidsthemovementofalargenumbers 17

PAGE 18

ofsamples(ortuplebundles)throughsomeexpensiveoperationssuchassortsandjoins.Forexample,inthecurrentsystem,weexecutedoneofthebenchmarkqueries(Q2fromSection 4.7 )ona200megabytes(MB)datasetwith50,000tupleswithuncertain(orstochastic)dataandgeneratedresultsamplesinthe0.999quantile.Thebenchmarkquerywritesandreadsmorethanaterabyte(TB)toandfromthediskduringitsexecution,andgeneratesmorethan3billionsamplesonaveragebeforeitcompletes.Thelargesamplesizesareduetosomestochastictuplesthatrequireafewsampleswithprobabilityintheorderof10)]TJ /F8 7.97 Tf 6.59 0 Td[(3orless.Forexample,itispossibleforthequerytolookforasampleintheorderof10)]TJ /F8 7.97 Tf 6.59 0 Td[(5probability.Intheexistingsystem,thistaskrequiresgenerating105samplesonaverageforeachrequiredtuple,andpassingthesamplesthroughthequeryplan.Insimplequerieswithnocomplexjoinsonstochasticattributes(likethewhat-ifquerymentionedabove),thesamplegenerationcanbedelayeduntilthetop-mostoperatorinthequeryplan.Inthecurrentsystem,whenaqueryisoutofsamples,thequeryisrerunwithalargersetofsamples.Forsomequeries,themethodthatIpropose(describedinSection 4.4 )alsoeliminatestherequirementofrerunningthequeryplanwhenmoresamplesareneeded.Eveninquerieswithjoinsonstochasticattributes,mymethodcaneliminatetheexecutionofsomeoftheexpensiveoperatorsbeforethatjoin,whichspeedsupthequeryexecutionfromthesecondrerunonwards. 1.1.2Problem2:FindinganExtremeSampleGibbssamplerprocessesanenormousnumberofsamplesduringasinglequeryexecutioninstance.Therefore,thechancesoftheoccurrenceofanextremesample(onethatoccurswithverylowprobability)inthesamplestreamisgood.Suchasampleiseasilyacceptedifitisatthecorrectsideofthetail.However,replacingthatsampleduringthenextGibbssamplerphasemayrequireanothersampleofnearlyequalvalue.Replacingonesuchextremesamplecouldresultinnumerousqueryrerunsand 18

PAGE 19

increasethesystemresponsetime.InChapter 5 ,Ilookataneffectivesolutionforthisproblem.Mysolutionhastwoparts.First,thequeryplanisrerunwithonlyrelevanttuplesandthenitltersouteverythingelse.Thismethodreducesthedataowinthesystem.Second,forthatspecictuple,alargenumberofsamplesaregenerated.Thelargernumberofsampleswillincreasethechancesofndingareplacementsample,and,hopefully,eliminateadditionalrerunsforthistuple.Onespecicextremesamplemightneedtensofmillionsofsamplestondareplacement.Sinceallthesamplescannottonasinglepage,Idevelopedasuccessfulworkaround.Thelargesamplearrayinatupleisbrokenintonumeroussmallerarraysastheyarebeinggenerated.Thesesmallertuples(eachstoringonesmallsamplearray)arepassedthroughthequeryplanuntiltheyreachtheGibbssamplerwheretheycanberecombinedtogether.Thesequenceidentiersarestoredtoremembertheorderinwhichthesmallertuplesneedtobegroupedtogetherwhenthetuplesarebeingused. 1.1.3Problem3:IncorporatingAnti-JoinAnti-joinofrelationsRandS(R.S)returnstuplesinRthatdonotmatchtuplesinSonthecommonattributes.Inarelationaldatabase,thisoperationisstraightforwardasitsexecutionissimilartoanequality-join.However,inMCDB-Ritisdifcultduetothedelayoftheactualexecutionofthepredicateuntiltheendofthequeryplan.AstochasticattributeparticipatingintheAnti-joinfromtherightrelationSintroducescomplications:advancingthesampleinScouldresultinthetupleinRbeingdropped.Consequently,theGibbssampling(refertoChapter 2 )musthaveallofthetuplesfromSthatmatchthetupleinR.AsimplebuteffectivesolutionistostoreallofthematchingtuplesfromSinthecorrespondingtupleinR.DuringtheAnti-joinoperationalltheinformationnecessaryfrommatchingtuplesinSisaddedtothetupleinR.Ialsodescribeanothersolutionthatdoesnotrequirestoringalloftheinformationfrom 19

PAGE 20

matchingtuples.Thismethodeliminatesthedrawbackoftherstsolution:theexistenceofhugetuplesizesasaresultofstoringattributesfromallmatchingtuples. 1.2OrganizationThisdissertationisorganizedasfollows:Chapter 2 providesanoverviewofthesystem,themathematicalbackgroundrequiredforMCDB-R,theGibbsSampler,andcloning.Chapter 3 describespriorworkrelatedtoMCDB-R.Chapter 4 ,explainstheserializationandde-serializationofthesample-generationprocess.Chapter 5 discussestheextremesampleproblem.Chapter 6 explainsthetwomethodsbywhichtheAnti-joincanbeaddedtoMCDB-R. 20

PAGE 21

CHAPTER2MCDB-ROVERVIEWMCDB-Risasystemthatfacilitatesriskanalysisonhigh-dimensionaluncertaindata[ 6 ].Performingriskanalysisrequiresansweringqueriesaboutriskyrareeventsi.e.thosethatoccurwithverylowprobabilitybutcouldhavehugeimpactiftheydooccur.DatauncertaintyinMCDB-Rismodeledasprobabilitydistributions.Thebehaviorofuncertainvariablesiscapturedbyprobabilitydensityfunctions.Riskanalysisonsuchdatacanbedonebyobservingthelowprobabilityregionstheupperandlowertailsofaqueryresultdistributionfunction.Foragivenquery,analyticalformsofeithertheresultdistributionoranyestimatesontheresultdistributionisveryunreliableanddifculttosolveonhigh-dimensionaldistributions.Therefore,MCDB-RusesMonteCarlosamplingtoapproximatelycalculatestatisticsontheresultdistribution.WhileusingtheMonteCarlomethods,theuncertaindatabaseinMCDB-Risrepresentedbymultiplepossibledatabase(DB)instances(ordatabasesamples).ThesepossibleDBinstancesaregenerallycalledpossibleworldsinprobabilisticdatabasesliterature[ 18 ].Eachinstanceiscreatedbyreplacingeachuncertainvariablebyasamplefromitsdensityfunction.SamplesofanuncertainvariableindifferentDBinstancesareindependentofeachother,eventhoughtheyareidenticallydistributed.Thisismadepossiblebyusingpseudo-randomnumbergeneratorsforsampling.Now,runninganaggregatequeryovertheuncertaindatabasereturnsmultipleoutputvalues,oneforeachDBinstance.ThenumberofDBinstancesusedisgivenasaparameterwhilespecifyingthequery.MCDB-Risdevelopedforgeneratingsamplesfromthetailofthequery-resultdistribution.ItadaptsatechniquecalledGibbscloningfromrareeventsimulationtoperformthistaskefciently.Gibbscloningmethodisusedtosamplefromthetailofthedistribution,andhencedecreasestheoverallcostofthequery.Itisaniterativeprocessthatmovesfurtherintothetailofthedistributionfunctionaftereachiteration.Eachiterationhasacloningstepandaperturbationstep.Inacloningstepwedelete 21

PAGE 22

thesamplesthatareawayfromthetailandreplacethemwithexactcopiesofsamplesthatarefurtherintothetail.Thenewsetofsampleswillbehighlycorrelatedduetothecloning.Hence,thesesamplesarerandomlyperturbedsuchthatthenewsamplesafterperturbationarereasonablyuncorrelated.Duetothehigh-dimensionalityofthedata,Gibbssamplerisusedforperturbation.ThischaptergivesanoverviewofqueryprocessinginMCDB-R.First,weprovideahighlevelunderstandingofriskanalysisinMCDB-Rusinganexample.WethenproviderelevantMonteCarloandrareeventsimulationbackgroundrequiredtounderstandthesystem.Next,wedescribethequerylanguageinthissystem.Finally,differentoperatorsinMCDB-Rareexplained.Weonlyexplainoperatorsthatarefairlydifferentfromthoseinastandardrelationaldatabasesystem. Figure2-1. CustOrdertable:(a)Deterministic,and(b)Uncertain Figure2-2. StochasticCustOrdertable 22

PAGE 23

Example:LetusrevisittheexamplequeryfromChapter 1 :whatwouldtheprotsbeifweraisedallthepricesby5%thisyear?.Assumefollowingtablesinthedatabase:CustOrder(CUSTID,ITEM,QTY),Price(ITEM,PRICE).InthetableCustOrder,CUSTIDisthekey,andQTYisthenumberoftextttITEMsboughtbythespeciccustomer.Thenewdemandfromeachcustomerforeachitem(callitNEWQTY)isnotknownafterthepricechange.ThedeterministictableCustOrderandthenewuncertainversionofitwiththeattributeNEWQTYareshowninFigure 2-1 .SincethevaluesforNEWQTYareunknown,theyaredenotedby??.ABayesianmodelisusedtoestimatethenewdemand.TheattributeNEWQTYiscalledstochasticattributesinceitdoesnothaveaxedvalue,andisdenedbyaprobabilitydensityfunction.Inthisspecicsetting,weusetwoGammafunctionsastherepresentativemodelfortheattributeNEWQTY.SincetheNEWQTYattributetakesintegervalues,theoatingpointsamplevaluecomingoutoftheGammafunctionswillberoundedtothenearestinteger.ThenewstochasticrelationCustOrderisshowninFigure 2-2 .NoticethattheparametersfortheDblGammafunctioncanbedifferentforeachtuple.Theresultofthequeryexecutedonastochasticdatabasewillalsobeaprobabilitydistribution.Analyticallysolvingfortheresultdistributionisdifcultduetohigh-dimensionalityofthedatabase.SoMCDB[ 37 ]usesMonteCarlosamplingtoanswerthequery.Multipleinstancesofthestochasticrelationaregeneratedusingsampling.Sampleinstancesareindependentofeachother.ThisstepofsamplingisdepictedinFigure 2-3 .InthisgurethenumberofinstancescreatedisN.ThequerywhenexecutedontheseDBinstancesgivesoutoneresultforeachDBinstance.Theseobtainedresultscanbeconsideredasaindependentandidenticallydistributedsamplesfromthequery-resultdistribution.ThenalresultconsistsofNvalues,andcanberepresentedasahistogramasshowninFigure 2-4 .IfthequeryisrunseparatelyoneachDBinstanceseparately,thenthesystemwillbeveryslow.ThisperformancehitismanagedinMCDBbythe 23

PAGE 24

Figure2-3. MultipleinstancesofCustOrderafterMonteCarlosampling groupingofallinstancesofatupletogether,calledatuplebundle,andrunningthequeryonlyonceforalltheDBinstances. Figure2-4. Query-resulthistogram 24

PAGE 25

Figure2-5. Gibbscloning:Generatingsamples Figure2-6. Gibbscloning:Deletinglower50% Now,letsconsiderperformingextremeriskanalysisonthisdecisionquery(increasingpricesby5%).Forthat,wewanttoknowtheworstpossibleconsequencesforsuchapriceincrease.Forexample,thelowestorhighestpossibletotalprotvaluethatcanoccurwithaprobabilityof0.001isofinterest.Weneedadecentnumberofsamples(100ormore)inthe0.001quantilespacetoestimatetheprot/lossvalueaccuratelyandalsoinndingoutwhatleadtosuchbadprotsbyanalyzingthesamples.Oneapproachforobtainingthesamplesistogeneratelargenumberof 25

PAGE 26

DBinstancesusingMCDBandthenselectonlythosewhichfallintheregionofinterestofthedistribution.Thismethodisnotefcientiftheprobabilitywearelookingatistoosmall.MCDB-RsolvesthisefciencyproblembyadaptingGibbscloningapproachtotheMCDBsystem.Gibbscloningisaniterativeprocessthatmovessamplesclosertothetailofthedistributionfunction.Eachstepdeletestheuninterestingsamplesandreplacethemwithclonesofinterestingsamples.Thiscreateslargernumberofsamples,allinthenewregioncloserthethetargettail.Thenewsetofsampleswillbehighlycorrelatedduetothecloning.Thesesamplesarerandomlyperturbedsuchthatthenewsamplesaftertheperturbationarereasonablyuncorrelated.Duringperturbationweneedtomakesurethatthenewsamplesareinthenewregionweobtainedafterdeletinguninterestingsamples.ThisperturbationisdonebyGibbssamplerduetoitsabilitytoperformmultidimensionalsampleswhilesatisfyingtheconstraintsofthenewregion. Figure2-7. Gibbscloning:Clonetheelitesamples ThisGibbscloningprocessisexplainedthroughgures.Figure 2-5 showsthethequery-resultdistributionwithfourinitialsamples,denotedbythesmallyellowcircles.Assumethatourtargetistogathersamplesfromtherightsideofthedistributionfunction.Nextstepistodeletehalfthesampleswhicharefarthestfromtherighttail,asshowninFigure 2-6 .Inthisgure,theverticallineshowstheboundaryofnewregion 26

PAGE 27

Figure2-8. Gibbscloning:AfterperturbationwithGibbssampler wecreated.Thislineactsasaboundaryforanyperturbationweperformlater.Inthenextstepweclonetheremainingtwo'elite'samples,tocreatefoursamplesintotal.WecanseetheclonedsamplesasthegreencirclesinFigure 2-7 .WeuseGibbssamplertoperformtheperturbationwhilesatisfyingtheconstraint.ThenewfoursamplesaftertheperturbationareshowninFigure 2-8 .Notethatallthenewsamplesareaftertheverticalline,whichisourconstraint.Thisprocesswillberepeatedasmanytimesasrequiredtomovetothetailasrequiredbythequery.TheprocessofGibbscloning,alongwiththeGibbssamplerareexplainedinnextsection. 2.1Background 2.1.1RareEventSimulationRareeventsareeventsthatoccurwithverylowprobabilitybuttheycouldhaveextremeimpactwhentheydooccur.Generatingsamplesintherareeventspaceforriskanalysisisusefulinmanyapplications.Someapplicationswhichbenetfromthisareinnancialdomain,andnetworkreliability[ 10 53 ].UsingbasicMonteCarloforrareeventsimulationisnotefcient.Whentheprobabilityofrareeventspaceisp,toobtainasinglesampleinthatspacerequires1=pbasicMonteCarlosimulations.Ifpis10)]TJ /F8 7.97 Tf 6.58 0 Td[(6,whichispossibleinmanyapplications[ 53 ],getting100samplesofinterestrequires108samplesinthefulleventspace,whichisclearlyinefcient.Twopopulartechniques 27

PAGE 28

usedforthissimulationareImportancesamplingandcloning[ 10 11 54 ].Importancesamplingisnotgoodforhigh-dimensionaldatabecauseoftheproblemofnumericalinstability[ 10 ].Alsointhisapplicationthedimensionalityistypicallyequaltothenumberoftuplesintherelation,whichrendersimportancesamplinguseless.Cloning:WeviewarandomdatabaseDasavectorofrandomvariablesoveraveryhigh-dimensionaljointdistribution.DconsistsofmultipleDBinstancesD1,D2,..eachrepresentingasamplefromthejoindistribution.ConsideranaggregatequeryoverD.Cloningalgorithmrstgeneratesn1DBinstances.Atstepiitkeepsonlythetop100pi%(decidedbasedontheaggregatevalueofthequeryoverthatinstance)oftheinstancesandclonesthesetopinstancestoobtainni+1newinstances.Alltheseareintop100pi%butnotindependent.ItrandomlyperturbstheseDBinstanceswithinthenewtailregion(asexplainedinthenextsection)tomakethemindependentofeachother.TherandomperturbationprocessisexplainedinSection 2.1.2 below.Asinglestepiscalledacloningiteration,andaftereachsuchiterationthealgorithmmovesfurtherintothetail.ThenumberofDBinstances,thenumberofcloningiterations,andthevaluesofni,piaredependentonthequantilespecicationofthequery. 2.1.2GibbsSamplingThetechniqueaboveexplainshowtowalktowardsthetailaftercloningandrandomlyperturbingDBinstancesateachstep.Inthissection,welookatGibbssampling,whichisusedforrandomlyperturbingagivenDBinstance.GibbssamplingisaMarkovChainMonteCarloalgorithmforobtainingasequenceofrandomsamplesfromjointprobabilitydistribution.Gibbssamplerrequiressamplesfromconditionaldistributionsofeachdimensiontoperformtheperturbationofthemulti-dimensionalsample.LetX=(X1,X2,..,Xr)beanr-dimensionalrandomvector.Xfollowsadistributionhandtakesvaluesfromthediscretesetr.Hereh(x1,..,xr)=P(X1=x1,...,Xr=xr),forx=(x1,...,xr)2r.AsmentionedearlierGibbssamplerneedstobeabletogenerate 28

PAGE 29

samplesfromconditionaldistributions.hi(ujv)=P(Xi=ujXj=viforj6=i).hiistheconditionaldistributionforithdimension,where1ir.GivenaninitialvectorX(0)=(X(0)1,....,X(0)r)theGibbssamplergeneratestherandomvectorsX(0),...,X(k).VectorX(j)isobtainedbyreplacingeachofthedimensionvaluesinX(j)iinX(j)]TJ /F8 7.97 Tf 6.59 0 Td[(1)byusinghi,where1jk.ConsidertheperturbationofaDBinstanceD1fromprevioussection.TreatD1assimilartoX.TheperturbationprocessshouldkeepD1inthecurrenttailrangeseenasacutoffvalueontheaggregate,andreplacesamplesineachofitsdimensions.Inoursettingthedimensionalityristypicallythenumberoftuplesinarelation,anddimensionsareindependent.Ifweassumesuchindependencethenh(x)=ri=1hi(xi),wherehi(u)=P(Xi=u).Qbearealvaluedfunctiononr,mimickinganaggregateandcisthecutoffvaluefortailandisconditionedoverdistributionh.Nowforperturbingithdimensionweneedasamplefromhi(xjc)=P(X=xjQ(X)c).WecandothisbyaRejectionalgorithm:keepsamplingfromhi(x)untilQ(X)cissatised.Formoredetailssee[ 6 ]. 2.2SchemaSpecicationTherststepinspecifyingaqueryinMCDB-Rsystemisdeningarandom(orstochastic)relation.Arelationisrandomifithasoneormorestochasticattributes.Suchattributesarepopulatedthroughsamplesfromparametricprobabilitydistributionfunctions.TheseparametricdistributionfunctionsarespeciedthroughtheVariableGenerating(VG)functionswhichareexplainedinthenextsection.Duringthequeryexecution,eachstochasticattributeinatupleisreplacedbysamplesgeneratedfromthecorrespondingVGfunction.Oncetherandomrelationsarespecied,thenextstepiswritingstandardStructured-Query-Language(SQL)fortheaggregation.Aggregateoperatorisspeciedlater,aspartoftheMCDB-RspecicRESULTDISTRIBUTIONblock.ThequerycompilerthenconvertstheseSQLstatementsintoaqueryplan,whichincludesbothstandardrelationoperatorsaswellassomenewoperatorstogenerate 29

PAGE 30

andprocessstochasticattributes.Finallytheparametersrequiredforthetailsamplingarespecied.Theseparametersdeterminehowfaralongthetailthenalsamplesaregenerated.Anexampleisgivenbelowforabetterunderstandingoftheprocess.Example:Theexamplequeryweuseheredescribesapredictivemodelofcustomerdemand.Thisquerysimulatestheeffectofa5%increaseinpriceonthenalprots.Forcalculatingthenewprotweneedthenewcustomerdemandduetoincreasedprice.ThedemandisestimatedusingBayesianmodelontheexistingdata.Acommonpriordistributionfordemandisusedforallcustomers.Thetheposteriordistributioniscalculatedbasedontheparametersspecictocustomer.TheVGfunctionoutputsthepossiblecustomerdemandvaluesduetochangesinprice.Thisqueryrequiresthespecicationoftherandomrelationdemandsasgivenbelow. 1.CREATETABLEdemands(new_dmnd,old_dmnd,2.old_prc,new_prc,nd_partkey,nd_suppkey)AS3.FOREACHlIN(SELECT*FROMlineitem,orders4.WHEREl_orderkey=o_orderkeyAND5.yr(o_orderdate)=1995)6.WITHnew_dmndASBayesian(7.(SELECTp0shape,p0scale,d0shape,d0scale8.FROMparams9.WHEREl_partkey=p_partkey)10.(VALUES(l_quantity,l_extendedprice*(1.0-11.l_discount))/l_quantity,l_extendedprice*12.1.05*(1.0-l_discount)/l_quantity))13.SELECTnd.value,l_quantity,l_extendedprice*14.(1.0-l_discount))/l_quantity,1.05*15.l_extendedprice*(1.0-l_discount)/l_quantity,16.l_partkey,l_suppkey 30

PAGE 31

17.FROMnew_dmndndArandomrelationisgeneratedonthey.Therefore,theabovestatementneedstospecifyhoweachoftheattributeisderived,eitherfrombasetablesincaseofdeterministicattributesorformtheparametricVGfunctionincaseofstochasticattributes.Lines1and2givethetablenamealongwiththelistofattributes.Thesub-queryfromlines3to5createsthebasetable,andeachtupleinitwillcontributetoasetoftuplesintherandomrelationdemands.Thestochasticattributenew dmndisgeneratedfromthedistributionfunctionspeciedfromlines6to12.ThenameoftheVGfunctionusedforsamplegenerationisspecied(asBayesian)online6.ThetableparamshastheparametersrequiredfortheBayesianVGfunction.Theparamstablecanhaveoneormoretuplesforeachmatchedtupleinthebasetable.Thejoinconditionbetweenthesetwotablesisgivenatline9.Inthepredicate,l partkeyisfromthebasetableandp partkeyispartoftheparamstable.SomeparameterstotheVGfunctionarebasedonthebaserelation.ThesearespeciedafterthekeywordVALUESatthebeginningofline10.Therelationnew dmnd(value)iscreatedfromthisentireVGfunctionblock(lines6to12).TheattributesintherelationdemandsarepopulatedaccordingtotheSQLstatementinlines13to17.Inthisexample,therstattributenew dmnd.valueisstochastic,andrestaredeterministic.Allthedeterministicattributesarecalculatedasfunctionsofattributesinthebasetable. 18.SELECTSUM(new_prf-old_prf)AStotalProfit19.FROM(20.SELECT21.new_dmnd*(new_prc-ps_supplycost)ASnew_prf22.old_dmnd*(old_prc-ps_supplycost)ASold_prf23.FROMpartsupp,demands24.WHEREps_partkey=nd_partkeyAND25.ps_suppkey=nd_suppkey) 31

PAGE 32

26.WITHRESULTDISTRIBUTIONMONTECARLO(100)27.DOMAINtotalProfit>=QUANTILE(0.001)28.FREQUENCYTABLEtotalProfitWritingtheaggregatepartofthequeryisquitestraightforwardandisgivenfromline18to25.Notethatthestochasticattributenew dmndispartoftheaggregate.Informationtothetailsamplingoperatorisgivenfromline26to28.ThevaluetotalProfitistheaggregatecalculatedinthisquery.Thequantileofthenalsamplesinthequery-resultdistributionisspeciedwithQUANTILEkeyword.Forthisquerythesamplesareinthe10)]TJ /F8 7.97 Tf 6.59 0 Td[(3thquantile.ThenumberofsamplesgeneratedwiththatquantileisgivenwithMONTECARLOkeywordintheline26,whichis100inthisquery.Thedirectionofthetailisgivenby<=.Inthisquery,thesamplesaretobegeneratedinthelowertailofthequery-resultdistribution.TheFREQUENCYTABLEkeywordspecieswherethesamplesgeneratedbythequeryarestored.AtableFTABLE(totalProfit,FRAC)iscreated,whereineachtupletotalProfitisadistinctsamplevalueandthecorrespondingFRACisfractionofthe100samplesinwhichthatvaluewasobserved. 2.3TheVariableGeneratingFunctionTheVGfunctionisoneofthevitalcomponentsofthissystem.Ittakesinparametersandpseudo-randomnumbergenerator(PRNG)seedsandreturnsthesamples,whicharethenstoredinthestochasticattributes.TheframeworksurroundingtheVGfunctionsallowstheuserstoprovidetheirownfunctionsalongwiththeprocessinwhichthatfunctionreceivestheparametersandreturnsthesamples.AstandardsetofVGfunctionslikeNormal()orPoisson()arealreadyprovidedwiththesystem.TheframeworkisveryexibleandallowsspecicationofavarietyofVGfunctions[ 38 ]includingmultidimensionaldistributionfunctionswithcorrelations.Framework:Asmentionedabove,theVGfunctionframeworkisdesignedtoallowtheusertocreatehisownfunction.TheBayesianVGfunctionmentionedintheprevioussectionisfairlycomplicatedandisnotpartofthestandardlibraryfunctions. 32

PAGE 33

AnycustomVGfunctionlikethatcanbeaddedtothelibrary.AnyVGfunctionshouldimplementthefollowingvemethodsInitialize(),TakeParams(),OutputVals(),PrepareNextTrial()andFinalize(). Initialize()takesinanarrayofPRNGseeds.ThesearetheseedsthattheVGfunctionusesforpseudo-randomnumbergenerator.Thesizeofthislistshouldbesameasthenumberofsamplestobegenerated.Thisisbecause,inMCDB-R,eachsamplehasitsownseed.Soeachoftheindependentandidenticallydistributedsamplesgeneratedhasitsownseriesofpseudo-randomnumbers,andcanbegeneratedindependentofitsprevioussamples.ThisisvitalbecauseMCDB-Rcanrequiresamplestobegeneratedoutofsequence.RefertoSection 2.7 formoreonwhyweneedsamplesoutoftheirsequence.Thismethodalsoinitializesthepseudo-randomnumbergeneratorusingtherstoftheseedsandanydatastructuresandvariablesrequiredbytheVGfunction. TakeParams()isusedtosendanyparametersrequiredbytheprobabilitydistributionfunctionusedintheVGfunction.Theparametervaluesgenerallyvarypertuple,andthisinformationisspeciedbytheFOREACHclauseintheCREATETABLEstatement.EachtuplemayneedmultiplecallstoTakeParams()dependingontheVGfunction. OutputVals()returnsthesamplesfromtheVGfunction.Thesesamplescanbemulti-dimensional,thereforeforeachsampleasequenceofcallsaretobemadetotheOutputVals().ThemethodreturnsaNULLafteritreturnsthevalueforthenaldimensionofthesample.EachnonNULLreturnvalueresultsinaseparatetupleforthesample.Soasingleinputtuplecanresultinmultipleoutputtuples. PrepareNextTrial()iscalledattheendofeachsamplei.e.aftereachNULLreturnvaluefromanOutputVals()call.Thismethodwillinitializethepseudo-randomnumbergeneratortothenextPRNGseed. Finalize()iscalledattheend,afterallthesamplesaregenerated.Thismethodwillde-allocateanydatastructurescreatedbytheVGfunctionandtheseedarray.AfteracalltoFinalize(),theVGfunctionobjectisfullyresetandisreadyforanInitialize()callfromthenexttuple.ConsidertheBayesianVGfunctioninthequeryfromprevioussection.TheparameterslistsenttoTakeParams()hasthevalues(p0shape,p0scale,d0shape,d0scale).AsinglecalltoTakeParams()issufcientbecausethequeryhasasingleparametertableandonetupleeachfortheparameters.IftheVGfunctionwereamultivariateprobabilitydistribution,sendingtheco-variancematrixwouldhaverequiredmultiple 33

PAGE 34

callstoTakeParams().AcalltoOutputVals()willagainreturnonlyasinglevalue,thenew dmnd.value.AsecondcalltoOutputVals()willreturnaNULL,andthesubsequentcalltoPrepareNextTrial()willadvancethePRNGseed.NotethatthisframeworkonlyspecieshowtheparametersandPRNGseedsaresent.TheprocessofsendingthecorrectparametersandcollectingthereturnedsamplestoformthenaltupleisdoneoutsidetheVGfunction.ThisprocessworksasawrapperaroundtheVGframeworkandisdescribedintheSection 2.5 aspartoftheInstantiateoperator. 2.4TheSeedOperatorThisoperatoraddsSeedattributestorandomrelations.ForeachVGfunctionintheCREATETABLEstatementofarandomrelation,oneSeedattributeisadded.TheSeedattributeconsistsofaseedidentierandthecurrentsetofactivesamples.Theseedidentiermustbeuniqueoveralltherandomrelationsinaquery.ThisidentierisusedaspartofallthePRNGseedscreatedforthistuple.SomultipletupleswithsameSeedidentierwillcreateunwantedcorrelationsinthenalresult. Figure2-9. AnexampleSeedattribute AlongwiththeseedidentiertheattributealsostoresthenumberofDBinstancesforthisquery,andalistofthecurrentiterationsassigned(calledactiveiterations)foreachoftheseDBinstances.ThenumberofDBinstancesissameasthenumberofsamplesaskedinthequerywithMONTECARLOkeyword.TheneedforthelistoftheseiterationsisexplainedlaterinSection 2.6.1 .Figure 2-9 showsaSeedattribute.Theattributehas1asseedidentier,andthenumberofDBinstancesare4.Thethird 34

PAGE 35

valuepointstothelistofactiveiterations.Thelisthasonevalueforeachofthe4DBinstances.TheiterationnumberassignedforDBinstance1(DB1)isstoredinDB1iter.cell. 2.5TheInstantiateoperatorInstantiateisafundamentaloperatorinMCDB-R.ThisoperationsactsasawrapperaroundtheVGfunction.Ittakestherandomrelation,andaseriesofparameterrelationsasinput.Theparametersfrominnerrelationsarematchedwiththecorrespondingtupleintheouterrelationthroughajoinpredicate.Foreachtuplefromtheouterrelation,aseriesofcallstotheVGfunctionaremade.ThesecallssendtheinformationrequiredbytheVGfunctionincludingthePRNGseed,parametersandanidentier.OncetheVGfunctionreturnsthesamples,theyaremergedwiththecorrespondingtupleintherandomrelation,toformthenaltuple.Example:Letorders rnd(orderkey,year,totalASNormal(,))bearandomrelation(seequeryQ2inAppendix).Thisrelationisgeneratedbypassingorder(orderkey,year,seed)asthemaininputtable,andparams(orderkey,,)astheparametertable.Instantiatematchestheparameterswithcorrespondingtupleintheorderbasedonthejoinonorderkey.Theattributeorder.seedcontainsthebaseseedusedtocalculatethePRNGseedarray.TheInstantiateoperatorgeneratesthesamplesfromaparametricVGfunction.ForeachrandomtupleInstantiatereceivesnameoftheVGfunctiontobeused,theparameterstoitandthelocationstostorethegeneratedsamples.Parameterscanbepasseddirectlythroughthemainouterinput,orthroughmultipleparameterinnerinputs.See[ 37 ]formoredetails.InnerworkingofthisoperatorisshowninFigure 2-10 foroneinputparametertable.Ifweconsidertheexamplefromlastparagraph,theinnerinputwillbethetableparams.Theouterinputistheorderrelation.OnceInstantiategetsonetuplefromtheouterrelation,itmakesmultiplecopiesofthetuple,onepereachinnerinputpipes. 35

PAGE 36

Figure2-10. InstantiateOperator Thesecopiesareusedtojoinwiththeparametertablesbasedonthekeyattribute.Throughthisjoin,alltheparametersrelatedtothekeycanbebroughttogetherbythesortoperationjustafterthejoins.Rememberfromprevioussectionthataseedidentierisuniquetoeachkeyintheouterinputrelation.Therefore,sortingonseedidentierwillbringtogetheralltheparametertuplesandtheouterinputtuple.SortingisnotdoneonthekeybecauseitisnotnecessaryanymoreintheInstantiateandcanbedroppedandseedwillbethenewidentierforthetuples.Aftersortingontheseedidentiersisnished,theparametersrequiredfortheVGfunctionarebroughttogether.TheparameterscannowberearrangedasrequiredandsenttotheVGfunction.Intheaboveexample,the(,,seed)aresenttothe 36

PAGE 37

VGfunction.OncetheVGfunctiongeneratesthesamplearrayitisreturnedtotheInstantiateanditisthenmergedwithacopyofouterinputtuple.Theouterinputpipeisalsosortedonseedidentier,thereforethemergingprocessisstraightforward.Themergefunctionwillpushtheresultingtuplesintotheoutputpipegoingtothenextoperator.NotethatmanyVGfunctionswouldneedmorethanoneparametertable,andmorethanonetuplefromaparametertablecanjoinwithasinglekeyfromouterinputrelation.Oneextremeexampleforthisisamultivariatedistributionfunction.ThisfunctionrequiresthecovariancematrixtobepassedtotheVGfunction.Onewaytopassthismatrixisthroughthesparsematrixrepresentation.TheInstantiateoperatorandVGfunctionframeworkareveryexibleanddesignedtoallowallkindsofcomplexinputmethods.Instantiateisoneoftheexpensiveoperatorsinthissystem.Onereasonforthisisduetothejoinsandsortsoninnerandoutertables.SincethisoperatoralsoencapsulatestheVGfunction,ifthesamplegenerationisexpensivethenthatwillagainslowdowntheoperator.Theothermajorcontributortotheexpensehereisthememoryallocationforthesamplearray.Thesamplearraysaregenerallyverylarge,henceeventhedatamovementbetweenVGfunctionandthearrayisconsiderable. 2.6GibbsTupleBundlesForgeneratingnumdb(numberofDBinstances)samplesinthetail,weneedtoperformcloningandperturbationoninitialDBinstances.Tokeeptrackofthecurrentsamplesused,wemaintaintheDBinstanceinformationinthetuple,storedaspartoftheSeedattribute.Astochasticattributeisastreamofrandomsamples,andonlynumdbofthemareinuseatanypointintime(assignedtonumdbDBinstances).Thesizeofthesamplestreamisdenotedbynums.AttheperturbationstepwedonotknowwhichofthesampleswillbeusedinthenewDBinstances.Sowecarrythewholesamplestreamwithoutanyltering(duetoselectionpredicates)onstochasticattributes.ThelteringisperformedoncethesampleisassignedtoaDBinstance.Thestochastictable 37

PAGE 38

TailSampler():1.Parameters:2.n:numberofcloningiterations3.numdb:numberofDBinstances4.numcloned:numberofDBinstancesclonedattheendofeachiteration5.nums:sizeofthesamplestream6.Variables:7.c:cutoffvaluefortheaggregate8.PQ:priorityqueueonseedidentiers9.R:inputrelationwithGibbstuples10.r:currentGibbstupleswithsameseedidentier11.InitializeDB(R,numdb)12.PQ(R)13.foreachiterationifrom1ton14.Clone(PQ,numdb,numcloned,c)15.PQ.reset()16.while(r=PQ.nextSeed())17.foreachDBinstancejfrom1tonumdb18.while(r[j].Advance())19.ifAgg(r[j])c20.found=true21.endif22.endwhile23.ifnotfound24.nums=nums*2;25.QueryRerun()26.endif27.endfor28.endwhile29.endfor Figure2-11. Tailsampleroperator CustOrderfromFigure 2-3 withtuplebundlesandseedisshowninFigure 2-12 .InthisrelationNEWQTYisthestochasticattribute.ThemultipleDBinstancesontheleftsideinFigure 2-12 isthelogicalrepresentation,whereastherightsideSTOC.CustOrdersisthestoragerepresentationinMCDB-R.Werunthequeryplanmultipletimestoreplenishthesamplestreamwheneverwerunoutofthesamples.Duetooperationslikestochasticjoinsomenewtuplescan 38

PAGE 39

Figure2-12. CustOrderwithGibbstuples beaddedandoldtupleseliminatedfromthenewrun.SothereisaneedtomaintainpersistentDBinstanceinformationoutsideaqueryplanexecution.SoforeachuniqueSeedintheGibbstupleswehaveaTail-Sampling-seed(TSSeed)objectondisk.SeedanditscorrespondingTSSeedaresynchedwheneveramodicationisperformedtoSeed.ATSSeedobjectstorestheseedidentier,PRNGseedandtheDBinstanceinformation.Italsostoresthecurrentin-userangeofthesamplestreamandstartingoftheunusedsamples. 2.6.1Tail-Sampling-SeedandSeedQueueFigure 2-13 showsanexampleTSSeedobjectandcorrespondingSeedobjectinthetuple.EachTSSeedobjectstorestheseedidentierforthetuple.ThisidentierisusedtomatchtheTSSeedobjectsandtherelatedtuplesinthequerypipeline.ATSSeedstoresthecurrentrangeofgeneratedsamples,andthecurrentassignedrange.ItalsomaintainsthecurrentassignediterationnumberforeachoftheDBinstances.TheTSSeedclasshasthepublicfunctionGetNumSeeds().GetNumSeeds() 39

PAGE 40

Figure2-13. Example:TSSeedandSeed returnsalistofPRNGseedsusedlaterinVGfunctionforseedingthepseudo-randomnumbergenerator.EachofthePRNGseedsreturnedisunique,andiscalculatedthroughahashfunctionontheseedidentierandtheiterationnumber.TSSeedobjectprovidesthemethodAdvance()tomovethecurrentiterationnumbertonextpreviouslyunusediterationnumber.ThismethodisusedintheGibbssamplertoreplacethecurrentsamplewithanewone.Advance()returnsanerrorvalueifthesamplearrayisexhausted.AllthechangestoassignediterationnumbersaredonethroughtheTSSeed,andthechangesarereectedontotheSeedthroughaSynch()function.TheSeedobjectalsostoresthecurrentassignediterationnumbers.Buttheseiterationsnumbersarerelativetothepositionsinthesamplearray(thestochasticattributes).TheSynch()transformstheabsoluteiterationsnumbersinaTSSeedobjectintotherelativeiterationsnumbersintheSeedobject. 2.6.2SplitOperator Figure2-14. Stochastictableinajoin 40

PAGE 41

Figure2-15. StochastictableafteraSplit Astochasticarrayattributestoresmultipleinstancesoftheattribute,henceaxedordercannotbedenedonthisattribute.Sooperatorslikethejoin,duplicateelimination,multi-setunion,intersection,anddifferencecannotbeappliedonstochasticarrayattributesdirectly.ASplitoperatorisusedinternallyineachoftheseoperatorstobreakupastochasticattributewithanarrayofsamplesintoasetofsmallerattributeseachwithoneuniquevalueinthesamplearray.Abitstringisusedtostorethelocationsofthatvalueinthesamplearray.Thenewattributewithasinglesampleiscalledastochasticsingleattribute.Abitstringisalwaysstoredalongwiththestochasticsingleattribute. Figure2-16. Joinresulttuples ConsiderthetwotablesinFigure 2-14 .Therighttablehasastochasticattribute.Tosimplifytheexample,theSeedattributeisnotshownintherighttable.ThejoinpredicateisonSTOC.ARRAYandtheJOINATTRattributes.Figure 2-15 showsthelefttableafterthesplit.Nowitisstraightforwardtomatchthetuplesonthejoinpredicate.TheresulttuplesafterthejoinpredicateareshowninFigure 2-16 .TheattributeSTOC.ARRYisbrokenupandisrenamedasSTOC.SINGLE.ThenewattributeBITSTRINGkeepstrackoftheoccurrenceofthestochasticsinglevalueinthestochasticarray.Notethattheappearanceofthesetuplesinthejoinresultdependsonweathertheiterations 41

PAGE 42

numbersoftheactivebitsinBITSTRINGarecurrentlyassignedtoaDBinstance.Ifonlytheiterationnumber0isactive,thenthersttupleinFigure 2-16 withstochasticsinglevalue1willnotappearintheresult,whereasthesecondtuplewithvalue2willappear.IfanAdvance()callismade,thentheactiveiterationnumberis2.Inthiscase,tuple2willnotappearandtuple1willappearintheresult. 2.7GibbsLooperandTailSamplerGibbsloopertakesasinputthenumberofDBinstancestobegeneratednumdb,andthequantiletobeestimated.ItthencalculatesthenumberofcloningiterationsandthenumberofDBinstancestobeclonedaftereachiteration.Figure 2-11 givestheTailsamplermethod.ThismethodperformsthenecessarycloningiterationsonthegiveninputGibbstuples.WeneedtooperateonallGibbstupleswithsameseedidentiertogether,soapriorityqueuePQisusedforthatpurpose.Atthebeginningofeachiterationweclone(line14)thetopDBinstances.Thewhileloopatline16goesthroughtheGibbstuplesandperformstheperturboperation.AllGibbstupleswithsameseedidentierarereturnedbyPQ.nextSeed()call.NowthesamplesineachDBinstanceofthissetoftuplesrarereplaced(lines17-27).Thewhileloopfromlines18-22performstherejectionalgorithm.AfteranAdvance()thenewsampleisacceptedonlyifthenewaggregatevalueisgreaterthan(orsmallerthan)thecutoff. 2.7.1GibbsQueueGibbsqueueisusedbytheGibbsloopertoretrievetherecordsintheseedorder.Gibbsqueueisadiskbasedpriorityqueuewhichallowsonlineinsertionswhilekeepingthesortorder.Inotherwordsitallowsremovalofrecordsandalsoatthesametimeinsertionofnewrecords.Gibbsqueueisdesignedsuchthatinterleavedremove()andinsert()operationsareperformedwithgoodefciency.Figure 2-17 showsthemechanismoftheGibbsqueue.Therstphaseissimilartoadiskbasedtwophasemergesort[ 26 ].Aseriesofsortedrunsarecreatedonthedisk.Aprimarypriorityqueueissetuptomergerecordsfromtheruns.Fordiskefciencyweemploypage 42

PAGE 43

Figure2-17. Gibbsqueue buffering.Gibbsqueuealsohasanotherinmemory(secondary)priorityqueue,whichisusedtoholdthereinsertedtuples.Thetoptuplecannowcanbeineitherprimaryorthesecondaryqueue.ThismechanismkeepstheGibbsqueueorderedevenafterreinserts.Thesizeofthesecondaryqueueisasbigasthediskbasedrun.Wheneverthisqueueisfull,anewruniscreatedonthedisk.Thenewrunisalsoattachedtotheprimaryqueue.Thesecondaryqueuewillnowbeemptyandisreadytoreceivenewtuplesthatarereinserted.ThisfunctionalityofreinsertingatupleisveryimportantfortheGibbslooper,sincethereisapossibilitythataGibbstuplecanhavemorethanoneseed.InsuchacasethetupleshouldbeprocessedatGibbslooperoncepereachseed.Howtoprocesssuchtuplesisexplainedinmoredetailbelow.Preparingtuplebundleswithmultipleseedidentiers:Gibbsqueueholdsthetuplebundlesinsortedorderaccordingtotheseedvalues.Butitispossiblethateachtuplebundlecanhavemorethanoneseed.Thiscanhappenduetotworeasons:(1)the 43

PAGE 44

tuplecanhavemorethanonerandomattributes,generatedbydifferentVGfunctions,andhencehasoneseedforeachVGfunction.(2)duetomergingoftuplesfromtworelations,eachwithrandomattributes,insideaJoinoperator.ThetuplebundleswithmorethanoneseedneedstobeprocessedbyGibbslooperonceperseed.Forthis,theGibbsloopershouldbeinformedthatthetuplehasmoreseeds,andneedstobereinsertedintothequeue.AfunctionAdvance()isprovidedbythetupleforthis.Thetuplebundlealsoprovidestwootherfunctionstoallowthisiterationoverseeds:BuildSeeds(),andResetSeeds().BuildSeeds()istheinitializationfunction,whichcollectsallseedsinthetuple,sortsthemandstorestheminanarrayinascendingorder.Italsoinitializesanindexonthearraypointingtotheleastvaluedseed.AnAdvance()movestheindextonextseed.Advance()returnsfalseifthearrayisexhausted.TheResetSeeds()movestheindexbacktotherstseed.TheGibbslooperafterprocessingatuple,willAdvance()theseed.Ifthetuplehasmoreseeds,thenitwillreinsertsthetupleintotheGibbsqueue. Figure2-18. Example:Endofrun1 2.7.2RerunningtheQueryPlanThequeryplanisrerunifweareoutofsamplesforagivenseedidentierbeforendingasuitablereplacementforalltheDBinstances.Foreachsuccessiverunthenumberofsamplesgeneratedforeachseedidentierisdoubled,providedtherecordtsinsidethepage.Oncethetailsamplerdecidedtorerunthequery,itwilldeletealltheexistingtuples,andushallTSSeedsandresettheseedqueue.Ifwithinpagesize 44

PAGE 45

Figure2-19. Example:Beginningofrun2 limits,thesizeofsamplesisdoubled.Oncethisisdone,tailsamplerexitstogivethecontroltothequeryplanner.OncetheplannerseesthattherequirednumberofGibbsloopsarenotnished,itwillrestartthequeryplan.AtInstantiatethenewsamplestreamisstartedonlyfromthelastusedsamplefromthepreviousrun.Figures 2-18 and 2-19 showtheTSSeedandSeedattheendofrstrunandatthebeginningofsecondrunrespectively.InFigure 2-18 therangeofsampleiterationsisfromlowiter.num.0tohighiter.num.5.Itcanbeseenthatthesamplesareexhaustedinthisarraysincethemax.usediter.is5,sameashighiter.num.ThecurrentassignediterationsaresameforbothTSSeedandSeedobjectssincethestartingiterationnumberinTSSeedis0.Forthisexampleconsiderthatthesizeofsamplearrayforthesecondrunremainas6.InFigure 2-19 theloweriter.num.insecondrunwillstartat2,sincethatisthelargestusediterationnumberinthepreviousrun(forDBinstance3).Themax.usediter.willnotchangeandthehighiter.num.willbe7(lowiter.num.+6)]TJ /F6 11.955 Tf 12.53 0 Td[(1).Afterthehighiter.num.isupdated,theTSSeedobjectissynchedwithSeedobject.ThecurrentassignediterationsnumbersintheSeedwillchangebecausethearerelativetothelowiter.num.intheTSSeed. 2.7.3MaterializingtheOperatorsIntherstexecutionofthequeryplantheresultsofanysubtreethatistotallydeterministicaresavedtodisk.Theoperatorattherootofthatsubtreeismarkedasmaterialized,andinthenextqueryruntheresultsaredirectlyreadfromthedisk. 45

PAGE 46

Figure2-20. Queryplanbeforematerialization Figure2-21. Queryplanaftermaterialization Figures 2-20 and 2-21 showtheplanforqueryQ1(seeAppendix)beforeandaftermaterializationprocess.InthequeryplaninFigure 2-20 ,thesubtreesatSeedandGroup-Byoperatorsaretotallydeterministic.Boththesubtreescanbematerializedatthoseoperators.ThenextoperatorinthequeryplanisInstantiate,whichisdenitelynotdeterministic.SothematerializationcannotgetpastInstantiate.Figure 2-21 showsthe 46

PAGE 47

queryplanaftermaterialization.Figure 2-22 showtherstthreerunsoftheexecution.Therstrunisthewholequeryplan,andsecondrunonwardsthemodiedplanwithmaterializationisused. Figure2-22. Queryplanwithmultiplereruns 47

PAGE 48

CHAPTER3RELATEDWORKTheworkpresentedinthisdissertationisveryspecictoMCDB-R,andthereforeisdifculttoputinthecontextofpreviousresearch(therelatedworkfortheproblemdiscussedinChapter 4 isgiveninSection 4.6 ).InthischapterwecovertheresearchlooselyrelatedtoMCDB-R.OtherworksaboutintegratingstatisticaloperationswitharelationaldatabaseareexplainedinSection 3.1 .MonteCarlomethodsareusedquiteheavilyineldssuchasprobabilisticdatabases(seeSection 3.2 ),anddataminingandmachinelearning(seeSection 3.4 ).Indatabaseresearch,theareaclosesttoMCDB-Rfunctionalityistop-krankinginprobabilisticdatabases.Probabilistictop-kisdescribedinSection 3.3 .StateoftheartriskanalysissoftwaresareexplainedinSection 3.5 3.1StatisticalOperationsinRelationalDatabasesMCDB-RiscloselyrelatedtotheideaofintegratingstatisticaloperationswithrelationaldatabasessuchasMauveDB[ 21 ],FunctionDB[ 63 ],andSciDB[ 59 ].MauveDBintegratesstatisticalmodelsintoadatabase,andprovidestheuserwithmodel-basedviews.Inthatsystemausercanspecify,createandquerytheseviews.FunctionDBprovidessupportforcontinuousfunctionsinsidethedatabase,andSciDBaimstoprovideadvancedanalyticalcapabilitiesinsidethedatabases.ThissystemisalsorelatedtoconditioninginProbabilisticDatabaseSystems[ 41 ].Conditioningisaimedatanalyticalcalculationofthecondencevaluesoftuplesinthequeryresult.Ittransformsagivenprobabilisticdatabaseofpriorstoaprobabilisticdatabaseofposteriors.Nowthecondencevaluesofthetuplesinaqueryresultcanbecalculatedfortheposteriordatabase.MCDB-RisdifferentasitisbasedonMonteCarlobasedapproximations,insteadofcalculatingexactcondencevalues. 3.2ProbabilisticDatabasesProbabilisticdatabasesareusedtostoredatauncertaintyinrelationalmodel.Theyhavebeenstudiedforalongtime.Interestinthemisduetotheapplicationslike 48

PAGE 49

sensors,dataintegrationandcleaning.Methodslikeinformationextractiongenerateuncertaindata[ 31 ],thatcanbemanagedbyaprobabilisticdatabase.Probabilisticdatabasescanalsobeusedforwhat-ifanalysis[ 37 ]wherethefuturedataismodeledasprobabilitydistributions.Processingprobabilisticdataisstudiedextensivelyinrecenttimes;Trio[ 69 ],MayBMS[ 5 ],MystiQ[ 50 ],Orion[ 13 ],MCDB[ 37 ],PrDB[ 56 ],andBayesStore[ 67 ]aresomeoftherecentlydevelopedprobabilisticdatabases.Forarecentsurveyofthiseldsee[ 61 ].Modelinguncertaintyinthesedatabasesisdoneprimarilyintwoways:tuplelevel,andattributeleveluncertainty.Intupleleveluncertainty,atuplehasaprobabilityofoccurrenceintherelation.Itcaneitheroccurornot.Inattributeleveluncertainty,theattributevalueisarandomvariable.Someoftheearlyworkonprobabilisticdatabasesisdonein[ 34 ].Theydenev-tables,whichhasvariablesthatrepresentincompleteinformation.Inv-tables,manyoftherelationaloperatorscanberunonthemsimilarlytothatonarelationaltable.Thev-tablescannotsupportprojectionandarbitraryselections.Theyalsodeneaconditionaltable(orc-table),avariationofv-tablewithaconditionforeachtuple.Thisframeworkdoesnothandledependenciesbetweenattributes.[ 7 ]usesanattributeleveluncertaintymodel.Theirrelationhasadeterministickeyandbothprobabilisticanddeterministicattributegroups.Theirprobabilisticattributesrepresentadiscreteprobabilitydistribution.Theyshowthatprojectionandselectionarelossless.Somenewoperatorsspecictotheprobabilisticmodelarealsogiven.Forexample,x-selectoperatorworksbymatchingprobabilitydistributionsthatareclose.[ 24 ]presentsprobabilisticrelationalalgebra,andthetuplesareassignedprobabilityvalues.Thesevaluesgivetheprobabilitythatthetuplebelongtotherelation.Theyuseintentionalsemanticstoevaluatethequery.Inintentionalsemantics,eachtuplehasanassociatedevent.Afterthequeryevaluationeachoutputtuplewillhaveanexpressionovertheeventsfromwhichthattupleisderivedfrom(lineageoftheoutputtuple).This 49

PAGE 50

expressioncanbeverycomplex.Thentheyneedtosolvethis(sometimescomplex)expressiontondtheprobabilityofthetupleintheoutputrelation.[ 42 ]storesprobabilityintervalsalongwithattributesintheirprobabilisticdatabasesystem,Probview.Theydenemultiplerelationaloperatorslikeprojection,selection,andcartesianproductontheirdatarepresentation,alongwithanewoperatorcalledcompaction.Compactionworkslikeaduplicateremovalbycombiningtheprobabilityintervalsoftupleswithsamevalues.Theyalsogivesanincrementalalgorithmtomaintainviewsontheirdatabaseforinsertionanddeletionoperations.Incontrasttointensionalqueryevaluationin[ 24 ],[ 18 50 ]usesextensionalqueryevaluation.Inextensionalevaluation,theoperatorscalculatetheactualprobabilityvalueevenattheintermediatestagesofthequeryplan.Thisavoidssolvingthecomplexexpressionthatwesawinintensionalsemanticsattheendoftheprocessing.[ 18 ]showsthatsomequeryplansgiveincorrectresultswithextensionalevaluation.Sotheygiveamethodtondasafeextensionalqueryplan(byqueryrewriting)whenitispossibletodoso.Forunsafeplans,theydescribeaMonteCarlobasedsimulationalgorithm,alongwithanapproximationalgorithm.Thecomplexityofexecutingqueriesinprobabilisticdatabasesisexploredin[ 19 ].TheyshowthatthecomplexityofevaluatingaconjunctivequeryoveraprobabilisticdatabasewithtupleindependenceassumptioniseitherPTIME(PolynomialTIME)or#P-hard.Theyalsogiveanalgorithmthatrecognizeswhetheraqueryissafeornot.Trio[ 2 8 69 ]keepslineageinformationalongwiththeuncertaindata.Intheirmodel,probabilityofoccurrenceisstoredinthetuples,andiscalledthecondencevalue.Somerelationaloperatorslikeprojectionsarenotclosedonthisdata.Butlineageisusefulincalculatingthenewcondencevaluesforatupleaftertheoperation.Thequerymodelin[ 8 ]allowstheusertospecifytheminimumcondencevaluerequiredforthetuplestobeintheoutput(Top-kqueries).Theyuselineagetocalculatecondencevaluesoverthequeryresultsonuncertaindata.Thismethodisrelatedtotheintensionalsemantics,andthelineageexpression 50

PAGE 51

sometimescanbecomeverycomplex.InsuchacaseaMonteCarloapproximationmethod[ 40 ]isusedtosolvetheexpression.[ 4 5 33 ]introducesU-relationstocaptureuncertaintyinthedata.AU-relationhasrandomvariableswhoseprobabilityvaluesaredenedinanothertable.Theyachieveattributeleveluncertaintythroughthisrepresentation.TheytoouseintentionalevaluationmodelontheirU-relations.ThedifferencewithpreviousapproacheslikeTrioisthattheytakelessspacetostoretheuncertaindata.Tocalculatethecondencevaluesofoutputtuples(iftheexpressionistoocomplex)theyuseMonteCarloapproximations[ 40 ].MCDB[ 37 ]usesMonteCarlosamplingtoestimate(userdened)statisticalmodelsonuncertainattributesinarelationaldatabase.MCDBtreatsthedatabasewithuncertainattributesasajointprobabilitydistribution.Sincethedimensionalityofthatdistributionisveryhigh,MonteCarlosamplingisusedforestimatingqueryresultsoverthatjointdistribution.Multipledatabaseinstancesarecreatedthroughsampling.Eachuncertainattributeisdenedbyaprobabilitydistributionandtheattributevalueisrepresentedbyasamplefromthatdistribution.Eachofthedatabaseinstancesaregeneratedindependentlythroughapseudo-randomgenerator.ForefciencyMCDBbundlesallthedatabaseinstancesofatupletogether,andexecutesthequeryoverthattuplebundleonlyonceforallinstances,insteadofonceperinstance.MCDBsframeworkofuserdenedstatisticalmodelsallowsexibilityinthetypesofmodelsandthetypesofqueries.PrDB[ 20 56 ]usesprobabilisticgraphicalmodelsasbasemodels.ThisgivesPrDBtheexibilityofrepresentingtupleleveluncertainly,attributeleveluncertainty,aswellascorrelationswithinattributes,andwithintuples. 3.3Top-kRankinginProbabilisticDatabasesTop-krankinginprobabilisticdatabases,althoughrecent,iswellstudied[ 9 15 28 32 44 49 58 65 72 ].TherehavebeenmultipledenitionsofTop-ktuplesinaprobabilisticdatabaseduetotheexistenceofcondence(probabilityofoccurrence)alongwiththeactualvalue(orscore).MCDB-Rdiffersfromthisworkasitgenerates 51

PAGE 52

valuesafteragivenquantile.Thoughthetop-krankingcanbeusedtogeneratevalueswithlowprobabilitytheynotsuitableforrestrictingthevalueswithinagivenquantile.[ 58 ]rststudiedthisproblem,bygivingtwodenitionsoftop-ktuples:UncertainTop-kandUncertainRank-k.AnUncertainTop-kqueryreturnsalistofktuples,whichappearsasthetop-kanswerinmostpossibleworlds.ForUncertainRank-k,eachtuplereturnedforapositionshouldbeoccurringatthepositioninmostpossibleworlds.Thatis,forithrank,wereturnatuplethatoccursatithrankinmostpossibleworlds.Theproblemisformulatedasastatespacesearch.Thealgorithmsscanstheprobabletuplesintop-kandmaintainsalargenumberofstates.Eachstateisasetofpossibleworldsthathavecommontoptuples.[ 72 ]givesmoreefcientalgorithmsforUncertainTop-kandUncertainRank-k.Theyusearepresentationcalledx-tuplesintheiralgorithms.Eachx-tuplehasmultipleoptions,eachoptionhasanassociatedprobability.Theseprobabilitiesdeneadiscretedistributionoveralltheoptionsforthattuple.Thealgorithmusesahashmaptostorethex-tuples,suchthatifanidofatupleisgiventhenitscondencevaluecanberetrievedfast.[ 32 ]givesaprobabilisticthresholdtop-kquery.Atuplewillbeintopkifithasatleastpprobabilitytobeinthetop-klistsinallpossibleworlds.[ 32 ]givesanexactalgorithmtocomputethis,alongwithsomepruningtechniquesusingtheprobabilitythresholdtoimprovethetheefciency.Asamplingbasedapproximatemethodisalsogiven.[ 15 ]denesexpectedranksinbothattributelevelandtupleleveluncertainlymodels.Therankofatupletinapossibleworldisthenumberoftupleswhosescore/valueishigherthant.Theygiveexactalgorithmsandsomepruningmethodstocalculateexpectedrankbasedtoptuples,bothinattributelevelandtupleleveluncertaintymodels.Findingnearestneighborprobabilistictop-kobjectsisstudiedin[ 9 ].Theyproposedapproachesbasedoninput/output(I/O)andCentralProcessingUnit(CPU)cost.Theirqueryprocessingalgorithmsaredesignedtominimizethetotalcost.[ 28 ] 52

PAGE 53

givesamethodgivessomeemphasistoscoresinthetop-kselection.Theydenethec-typicaltop-k,whichreturnsctypicaltop-kvectors,accordingtothescoredistribution(Atypicalsetininformationtheoryisasetofsequenceswhoseprobabilityiscloseto2)]TJ /F11 7.97 Tf 6.58 0 Td[(e,whereeistheentropyoftheirsourcedistribution).Thereisagoodchancethattheactualtop-kscoreisclosetooneofthesecvectors.In[ 49 ]top-kisdenedonlyonprobabilityvalues.Ktupleswithhighestprobabilityaregivenintheoutput.Forefciency,theyrunseveralMonteCarlosimulationstogetheroneforeachcandidateintheresultset,andcalculatetheprobabilityvaluestominimumaccuracyrequired.Arecentwork[ 43 44 ]providesauniedframeworkforlearningmultiplerankingfunctionsoveraprobabilisticdatabase.Asetofkeyfeaturesareidentied,andparameterizedrankingfunctionsaredenedoverthesefeatures.Theparametersarechosenbasedontheuserpreferences.Theseparameterizedfunctionscanencompassmanyofthetop-kmethodsdenedabove.[ 44 ]presentsseveralefcientalgorithmstocalculatetheserankingfunctions. 3.4MonteCarloMethodsinDatabases,DataMining,andMachineLearningThespecicMarkovChainMonteCarlo(MCMC)basedGibbscloning(orsplitting)idea[ 10 ]adaptedinMCDB-Rismainlydevelopedforrareeventsimulation.[ 14 ]presentsnovelalgorithmforsensordeploymentbasedonsplittingandGibbssamplerapproachfrom[ 10 ].Theirfocusistondthebestspatiotemporalplacementofthesensorstoimprovethedetectionofatargetthatisintelligent,reactiveandmovingwithastrategytogoundetected.Thisisanoptimizationproblem,andgeneticalgorithmsareusedpreviouslytosolveit.OneofMCMC'spopularusesisinmachinelearning,aspartoftheBayesianinferenceframework.AgoodintroductiontousingMCMCforBayesianinferenceisgivenin[ 25 ].MCMCisusedaspartofinferenceframeworkformodelingdynamicuserinterestsin[ 3 ].Forscalabilityformillionsofusers,theirinferenceprocedurerunsincrementallyafastbatchGibbssampleronthedataatagiventimeinstance,give 53

PAGE 54

thestateofsampleratprevioustimeinstances.GibbssamplingandlatentDirichletAllocation(LDA)havebeenpopularinlearningTopicmodelsfromtextcorpora.sinceitis,intheory,moreaccuratethanthevariationalmethods.Topicmodelingontextdocumentsisapopularareasinceithasapplicationsamongothersininformationretrievalanddocumentclustering.ThedrawbackofGibbssamplingforthisapplicationisitsefciency,andhencescalabilityforlargedatasets.AnovelandfastermethodofcollapsedGibbssamplingforLDAisproposedin[ 47 ].CollapsedGibbssamplerintegratesoutsomevariablesandsamplingisdoneoverthesimplieddistributionwithonlythespecicvariablesofinterest.[ 64 ]usedcollapsedGibbssamplerfordocumentclustering.Themodelusedisamixtureofmultinomials,withDirichletpriors.Throughexperiments,theyshowthatusingMCMCisbetterthanusingExpectation-Maximizationalgorithmfortheirmodel.Variousmethodsfortopicmodelingarecomparedin[ 71 ].SomeofthosemethodswerebasedonGibbssampling,andsomearevariationalinferencemethodslikemaximumentropy.TheyshowthattheirnewsparseLDAisfasterthanpreviousLDAbasedmethods.Theirperformanceimprovementisduetoanewdatastructurethatresultsinfastsamplingevenwithalowmemoryusage.ScalingofMCMCmethodsforlargerdatasetsisstudiedbothinmachinelearning[ 12 23 57 ]anddatabases[ 66 73 ].BasicMonteCarlomethodalsoispreviouslyusedforqueryprocessinginprobabilisticdatabases[ 27 37 39 68 ].ProcessingqueriesonarraydatabasesrequireMonteCarlosimulations[ 27 ].Thisrequirementisbecauseofthecomplexityofarrayandvectoroperationsrequiredbytheapplicationsinscienceandengineering.Usingexactmethodisdenitelyinfeasibleduetothedimensionality,andmethodslikeGaussianmixturemodelscannotproduceclosedformresultsotherthan+and-operators.Asecondreasonisthatthemodelsusedtorepresentcorrelateduncertaindatainarraydata,likegraphicalmodels,requireMonteCarloapproximatecalculations.In[ 27 ],authorsuseMarkovRandomFieldsasmodelandgivesolutions 54

PAGE 55

tocomplexoperatorslikearrayjoin.MCDB[ 37 ]usesMonteCarlosamplingtoestimate(userdened)statisticalmodelsonuncertainattributesinarelationaldatabase.Itprovidessamplesfromthequery-resultdistribution,insteadofasinglevalueasoutput.ThispaperisdiscussedinmoredetailinSection 3.2 .MonteCarlomethodsareusedin[ 39 ]forevaluatinglineageprocessingoncorrelatedprobabilisticdata.Specically,Gibbssamplerisusedinthissettingforapproximateevaluationofconjunctivequerieswithlargecomplexity.Whileexactevaluationoflineageprocessingontupleindependentmodelispolynomialtime,complexityforforthatincorrelatedtuplesis#Pcomplete[ 39 ].In[ 68 ]factorgraphs(atypeofprobabilisticgraphicalmodel)areusedtorepresentuncertaintyoverrelationaldata.Toavoidthe#Pcomplexityofqueryevaluation,[ 68 ]usesMCMCbasedinference.ThespecicMCMCalgorithmtheyusedisMetropolis-Hastings.TheyproposeanewalgorithmcombiningMetropolis-Hastingsandmaterializedviewmaintenance.Viewmaintenancetechniqueisusedforperformanceimprovement.In[ 22 ],MCMCisusedfordatacleaning.Theirdatasetconsistsoffactscollectedthroughcrowdsourcing,andtheirmaingoalsaretoidentifywhichofthefactsarereliable,howtoanswerquestionsbasedonthecollecteddata,andwhatquestionstopresentnexttotheuserstoimprovethedataquality.Thedatasetisthentreatedasaprobabilisticdatabaseandsomeprobabilisticandrecursiverulesaredenedforinterpretingthedata.Aswehaveseenbefore,evaluatingqueriesexactlyonprobabilisticdatabasesisinfeasible.MCMCbasedalgorithmforapproximatecalculationofthequeryresultcanbeused.ButMCMCisguaranteedtoproducegoodapproximationsonlyifthedatacleaningrulesetisstronglyconnected.Adisconnectedrulesetisnotergodic,arequiredpropertyforgoodMCMCapproximations.[ 22 ]usesMCMCseparatelyoneachconnectedcomponentoftherulesetandcombinestheresults.MonteCarlomethods,apartfrombeingusedtoevaluatequeriesoverprobabilisticdatabases,arealsousedtocreatetheprobabilisticdatabasesfromuncertainormissing 55

PAGE 56

data.AninferencealgorithmbasedonGibbssamplingisproposedin[ 60 ]forpredictingtheprobabilitydistributionovermultiplemissingvalues.Theyusethisalgorithmaspartoftheirlearningframeworktocreateprobabilisticdatabases.Ajointdistributionoverthedataisapproximatedwithasetofconditionalprobabilitydistributions,thattogetherarecalledadependencynetwork.Eachoftheconditionalprobabilitydistributionsislearnedindependently,andhencecannotguaranteeaconsistentjointdistribution.Gibbssamplinginferencetechniquesareusedtorecoveranapproximationofthejointprobabilitydistributioneventhoughtheindividualconditionaldistributionsareinconsistent. 3.5RiskAnalysisSoftwareMonteCarlosimulationsareimplementedinmanyriskanalysissoftware[ 16 45 62 ].MCDB[ 37 ]tooallowsscalableriskanalysiswithitsexibleVGfunctionframework.OracleCrystalBall[ 45 ]supportspredictivemodeling,forecasting,MonteCarlosimulationandoptimization.ItisbasedonMicrosoftExcel,andsupportswhat-ifriskanalysis.@RISK[ 16 ]alongwithanimplementationofMonteCarlosimulations,providesRISKOptimizertospeedupthesimulations.ThesoftwarebasedonspreadsheetslikeCrystalBalland@RISKwillnotbeasscalableasMCDBbecausetheylackdiskbasedalgorithms,andifanapplicationexceedsrandomaccessmemory(RAM)size,thenmosttimeisspendinswappingdatatoandfromdisk.Theotherstand-aloneriskanalysissoftwareslikeAnalytica[ 62 ]areproprietaryanditisdifculttotelliftheimplementationsarediskoptimized.Analyticaclaimstobetwentytimesfasterthanrisksoftwareimplementedonstandardspreadsheetssotheymightbeusingsomediskbasedalgorithms.Notethatnoneoftheseriskanalysissoftwares,toourknowledge,supporttailsampling. 56

PAGE 57

CHAPTER4SERIALIZINGSAMPLEGENERATIONPROCESS Figure4-1. AnMCDB-Rqueryplan 4.1BackgroundMCDB-RsystemusessamplingbasedMonteCarlosimulationstoperformriskanalysisonuncertaindata.InMCDB-R,thedatauncertaintyismodeledthroughparametricprobabilitydistributionfunctions.Thewholedatabaseisseenasahighdimensionaljointdistribution.Extremeriskanalysisonthisdatacanbedonebyobservingthelowprobabilityregionstheupperandlowertailsofaqueryresultdistribution.Foragivenquery,toanalyticallysolveeithertheresultdistributionoranyestimateonthetailoftheresultdistributionisverydifcultduetothehigh 57

PAGE 58

dimensionality.Therefore,MCDB-RusesMonteCarlosamplingbasedGibbscloningtoefcientlygeneratesamplesinthetail.Byusingthesesampleswecanapproximatelycalculatetheestimates.Theaccuracyoftheseestimatesdependsonthenumberofsamples.GibbscloningmethodisusedinMCDB-Rbecauseitismoreefcientingeneratinglargenumberofsamplesfromthetail.Gibbscloningisaniterativeprocessthatmovesfurtherintothetailaftereveryiteration.Eachiterationhasacloningstepandaperturbationstep.Inthecloningstepwedeletethesamplesthatareawayfromthetailandreplacethemwithcopiesofsamplesthatarefurtherintothetail.Thenewsetofsampleswillbehighlycorrelatedduetocloning,buttheywillbefurtherintothetail.Thesecorrelatedsamplesarerandomlyperturbedsuchthatthenewsamplesafterperturbationareuncorrelatedbutstillremainclosetothetail.Thedistancetothetailisenforcedthroughaconstraintontheaggregatevalue.Perturbationisdonebyreplacingtheexistingsamplewithanotherrandomsamplethatstillsatisestheconstraint.Findingagoodreplacementformultipledimensionsofthedatatogetherisextremelydifcult.Hence,thereplacementonhighdimensionaldataisdonebyaGibbssampler.Gibbssamplerperturbsthesamplesfromonedimensionatatime.AtagivendimensionthereplacementofanexistingsampleisfoundthroughtheRejectionalgorithm(Section 2.1.2 ).TheRejectionalgorithmreplacesthecurrentsamplewithanewrandomsampleandcheckstheconstraintontheaggregatetoseeifthenewsampleisstillfarenoughintothetail.Itrejectsnewsamplesuntilitndsasamplethatsatisestheconstraint. 4.2MotivationInMCDB-R,eachdatabasesampleiscalledaDBinstance.MCDB-RborrowstuplebundleideafromMCDBtoperformefcientqueryprocessingontheselargenumberofDBinstances.Atuplebundlestoresallthesamples(DBinstances)foragivendimensionofthedata.Eachtuplebundleisuniquelyidentiedbyitsseedidentier(Section 2.4 ).Becauseoftuplebundles,thequeryisrunonlyonce,insteadofoncefor 58

PAGE 59

eachDBinstance.Thus,bundlingimprovestheperformancesignicantlycomparedtoindependentexecutionofthequeryoneachoftheDBinstances.ThesamplesizesusedinMCDB-RarelargerthanthoseinMCDBbecausetheRejectionalgorithmconsumessignicantlymoresamples.ThesamplesinatuplebundlearegenerallyexhaustedwhentheRejectionalgorithmisreplacingasamplewithlowprobability.Itcanhappenthattheaggregatevalueonthedataisveryclosetotheconstraint.ThentheRejectionalgorithmshouldndanothersampleveryclosetothesamplevalueitistryingtoreplace.Otherwisetheconstraintwillnotbesatised.Sincetheoriginalsampleisoflowprobability,theRejectionalgorithmmightnotndareplacementuntilitrejectsalargenumberofsamples.Wheneveratuplebundleisoutofsamplesthequeryisreruntogeneratemoresamples.Ateachrerunthenumberofsamplesgenerated,nums,isdoubled(unlessasingletuplebundlesizeexceedsthepagesize,inwhichcasenumsremainssame).Sincenumsissameforallthetuplebundlesinaqueryrun,eachtuplebundlenowstoresdoublethesamplesandisapproximatelydoubleinsize.Butmostoftheothertuplebundlesmaynotrequirethenewsamplesthataregenerated.Example:WhenexecutingqueryQ2(seeSection 4.7 ),insecondreruntheaveragenumberofsamplesusedforeachtuplebundle(orseedidentier)atRejectionalgorithmis150,whereasthenumsis4000.Mostoftheseedidentiersonlyneed150samples,butduetotheoutliersthatneed4000samples,allseedidentiershavetostorethe4000samples.TheframeworkofMCDB-Rdoesnotallowdifferentsamplesizesfordifferenttuples.Eveniftheframeworkallowssuchoption,itisnoteasytopredictbeforehandwhichoftheseedidentiersrequiremoresamplesthanothers.Thiswastageofsamplesonlyincreaseswiththenumberofreruns.Atfourthrerunthenumberofusedsamplesonaverageis210outof16,000samplesgenerated.ThoughNormalVGfunctionisnotveryexpensive,generatingsomanyunusedsampleshadsignicantimpactontheoverallqueryexecutiontime. 59

PAGE 60

Figure4-2. Modiedqueryplan Evenifthesamplegenerationprocessisinexpensive,theincreaseindatasizewillhaveanadverseeffectontheresponsetime.Figure 4-1 showsthesimpliedphysicalplanforqueryQ2.ThesamplesaregeneratedattheInstantiateoperator(Section 2.5 ),andarepassedthroughtheplantoTailsampler(Section 2.7 )operator.TheTailsamplergoesthroughtheseedidentiersonebyoneinsortedorder.TheorderismaintainedbytheGibbsqueue(Section 2.7.1 ).InmostcasestotaltuplesinGibbsqueuewouldnottinmainmemory,thereforeitwritesthemtodisk.ThesolidarrowfromTailsamplertoGibbsqueuesigniesanewperturbationstep(orGibbsiteration)andresetoftheGibbsqueue.Thedottedarrowshowsaqueryrerun,andistriggeredwhenoneoftheseedidentiersrunsoutofsamples.Thequery,whenrunona200MBdatasetwith185,000tuplebundlesandwith0.999quantileastargettailarea,writesapproximately25gigabytes(GB)ofdatatodiskinfourthrerun,whennumsis16,000.ForveGibbs 60

PAGE 61

iterations,thequerywritesandreadsmorethan1TBtodisk.SeeSection5formorestatisticsontheexecutionofthisquery. 4.3SolutionOutlineThetechniquewepresentinthischapterallowsthesystemtoserializethesamplegenerationprocessanddelaytheactualsamplegenerationaslongaspossible.Thenewqueryexecutionavoidsthemovementofthelargetuplebundlesthroughsomeexpensiveoperationslikesortsandjoins.ThequeryplaninFigure 4-1 cannowbemodiedtothatinFigure 4-2 .InthenewplanInstantiate(S)denotesInstantiate-serialize.Inthismodiedoperator,insteadofgeneratingthesamplesasbeforeatregularInstantiate,theVGfunction(Section 2.3 )parametersareserializedintoastring,calledSVGFforSerializedVGFunction,andkeptalongwiththeSeedattribute(Section 2.4 ).Instantiate(G)de-serializestheSVGFstringandgeneratestherequestednumberofsamples.ItcanbeseeninFigure 4-2 thatthesamplegenerationisdoneaftertheGibbsqueue.Thus,insteadoflargenumberofsamplesnowonlythestringSVGFiswrittentothedisk.SVGFstringistypicallylessthan100bytesinsize.Forqueriesthatdoesnothaveajoinonstochasticattributes,andhencearesimilarinstructuretoqueryQ2,thisserializationeliminatestherequirementofqueryreruns.Nowwecangeneratevariablenumberofsamplesforeachseedidentierseparately,whicheliminatesthegenerationoflargenumberofunusedsamplesformanyseedidentierswhenonlyveryfewseedidentiersrequirethem. 4.4SerializationInthissection,wediscusstheprocessofserializingtheVGfunctionandtheparametersforagivenseedidentier.WerefertothisrepresentationasSVGFasnotedpreviously.ForserializingtheVGfunctionwerststorethenameofthefunctionandthenalltheparametersthataresuppliedtothefunction.Theschemaoftheparametersneedstobestoredaswell.AVGfunctionframeworkinMCDB-Risverygeneric.AVGfunctioncantakemultipletuplesfromparameter(inner)tablesandonetuplefromouter 61

PAGE 62

Figure4-3. AVGfunctionprocess Figure4-4. SerializationofVGfunction tableforgeneratingasinglesample.Instantiatepickstherequiredattributesfromthesetuplesandcreatesaparameterarray,thatisthenpassedtotheVGfunction.Figure 4-3 showstheVGfunctionprocesswithintheInstantiate.Recreatingthissamplegenerationprocessrequiresexactsimulationofpassingtheparameters,asdoneintheInstantiate.SerializationformatisshowninFigure 4-4 .AfterstoringtheVGfunctionnamewestoretheschemaoftheparameterarray.Thenwestorethenumberoftimestheparameterarrayispassed,andthentheparameterarraysthemselves.TheSVGFiscreatedduringtherstexecutionoftheInstantiate.Duringthisrstexecution,theVGfunctioniscalledasbeforeexceptthatonlyonesample(whichisdiscardedlater)isgeneratedandSVGF 62

PAGE 63

isformed.OnceSVGFiscreated,itisstoredasastringinthetupleaspartoftheSeedattributerelatedtotheVGfunction.AsingleVGfunctioncallcanproduceasequenceofvaluesaspartofthesample,resultinginmultiplestochasticattributes.Thereforetherelationshipbetweenastochasticattributeanditslocationinthesequenceofvaluesisstoredinthestochasticattribute.SinceasingletuplecanhavemultipleSeedattributes,eachstochasticattributealsostorestheparticularSeedfromwhichitisgenerated. Figure4-5. BayesianVG:Inputserialization Figure4-6. BayesianVG:Innerinput Figure4-7. BayesianVG:Outerinput Figures 4-5 through 4-8 showtheserializationprocessforaBayesianVGfunction(seequeryQ1inSection 4.7 ).ThisVGfunctiontakessevenparameters,andofthemfourarepassedthroughtheinnerinputpipeandotherthreearepassedfromouterinput 63

PAGE 64

Figure4-8. SerializationofBayesianVGinputprocess pipe.Figure 4-5 showstheschemaoftheparameterarray.Firstattributeistheseedidentier,nextthreearetheparametersfromouterinput.ThelastfourparametersforthetwoGammafunctionsintheBayesianmodel.TheparametersfrominnerandouterinputsaresenttoVGfunctionseparatelythroughdifferentarrays,anddifferentcallstotheTakeParams().Instantiatemapstheinnerandouterinputtuplesintotheparameterarray.Figures 4-6 and 4-7 showthetransformationofaninnerandanoutertupleintothearrayformat.Intheserializationsteptheparametercountistwo,sincewehaveoneinnerandoneoutertuple.Figure 4-8 showstheSVGFcreatedfortheBayesianVGfunction.FirstwestorethenameofVGfunction(Bayesian),thentheschemaoftheparameterarrayandthentheparametercount.Intheparameterarray,theunusedeldsaresettoNULL.Beforeeachvalue,aagisstoredtoindicateweatherthatvalueexitsorisitNULLinthearray. 4.4.1IncorporatingSerializationinaQueryPlanPredicatesonstochasticattributesaregenerallypusheddownthequeryplan,andareevaluatedjustaftertheInstantiate.Ifthesamplegenerationisserialized,thenthesamplesarenotgeneratedatInstantiate.Thereforeanypredicatesonthestochasticattributesshouldbemovedtotheoperatoratwhichthesamplesaregenerated.Allthenon-joinpredicatesonstochasticattributesarepassedtoTailsampleroperator.ThesepredicatesareappliedoncethesamplesaregeneratedinsideTailsampler. 4.4.2StochasticJoinPredicatesMovingupthejoinpredicateintoTailsamplerinnotrecommended.Withoutajoinpredicate,acrossproductisstillperformed.Thiscoulddegradetheperformancebecauseajoinpredicatereducesthenumberofoutputtuplesfromacrossproduct.ThejoinpredicatecanhaveastochasticattributeinbothsidesofthejoinasinFigure 4-9 64

PAGE 65

Figure4-9. Joinonstochasticattributeononeside Figure4-10. Queryplanafterserialization orjustonesideofthejoinasinFigure 4-11 .Inbothqueryplans,thesampleshavetobegeneratedbeforetheSplitoperator(aMCDBspecicoperationthatsplitsastochastictupleintomultipledeterministictuples,seeSection 2.6.2 ),thereforewecanonlymaterializealltheoperationsjustbeforeit.IftherearemultipleSeedattributesinthetuplebundlebeforetheSplitoperator,onlythesamplesthatareprocessedinSplit 65

PAGE 66

needstobegenerated.AmaterializationjustbeforeSpiltoperatorisstillusefulasitavoidsrunningtheexpensiveInstantiateoperationfromthesecondrunonwards.AsintheleftsubtreeinFigure 4-9 ifastochasticattributeisnotpartofthejoinpredicate,thenthereisnoneedtode-serializethatattributeuntiltheTailsampler.Givenaqueryplan,weusethefollowingrulestoidentifywhereinthequeryplanitissafetomaterializetheoperators.Rule1:AnyancestoroperatortoaSplitcannotbematerialized.Rule2:AfteranInstantiateoperatorcheckifthereareanyjoins.Foranysuchjoins,iftheothersideofthejoinpredicatehasastochasticattribute,thenmaterializetheoperationjustbeforetheSplitassociatedwiththejoin.ThiscasecanbeidentiedbytraversingtheothersideofanyjoinsinthepathleadingfromInstantiatetoTailsamplerinthequeryplan.ApplyingtheserulesresultinmodiedqueryplansasshowninFigures 4-10 and 4-12 .TheyarethemodicationsofqueryplansinFigures 4-9 and 4-11 respectively. Figure4-11. Joinonstochasticattributesonbothsides 4.5EfcientDe-serializationDe-serializingtheSVGFstringisstraightforward.FirstreadtheVGfunctionnameandcreatethefunctionobject.Thenreadtheschemaoftheparameterarray.After 66

PAGE 67

Figure4-12. Queryplanafterserialization thatreadtheparameterarraysonebyoneandpassthemtotheVGfunctionobject.Tomakeuseofmultiplecoresinthemachine,wecanmultithreadthesamplegenerationprocess.Thisiseasysinceeachsamplehasitsownpseudo-randomseed.Thispseudo-randomseediscomputedthroughahashfunctiononthelocationofthesampleinthesamplestreamandthePRNGseedfromtheSeedattribute.EachthreadisgivenarangeinthesamplestreamandthePRNGseedcangeneratethesamplesindependentofotherthreads. 4.6RelatedWorkThisideaofserializationiscloselyrelatedtocompressionindatabases[ 1 29 30 36 52 74 ]whichfocusesonreducingtheexecutiontimesbysavingdiskI/O.Sincesamplesareofsametypeandarestoredtogetherinanarrayoursettingissimilartostoringcolumnsinacolumnstore.Forintegerorstringdatathebit-vectorcompressionscheme[ 1 ]canbeused.Sincethesamplearrayisnotsortedthisseemtobetheonlymethodfeasible.TheserializationofVGfunctionstilloffersmoredisksavingscomparedtobit-vectorcompressionduetothelargebit-vectorsneededinMCDB-R.ForexamplethesizeofSVGFforaNormalVGfunctionislessthan100Bytes,butabit-vectorfor16,000valuesneeds16,000bitsisnearly2KBinsize,afactorof20.Compressing 67

PAGE 68

oatingpointdataisnotveryusefulinMCDB-Rsinceourdataconsistsofrandomsamplesandmaximumcompressionforrandomoatingpointdatagenerallydoesnotexceedafactorof4[ 48 ]inthebestcase. 4.7ExperimentsInthissectionweevaluatetheserializationofVGfunctionsthroughexperiments.WeconductvariousexperimentstotesttheefciencyandscalabilityofthenewMCDB-Rwithserialization.OuraimistoshowthatusingserializationinMCDB-Rsystemallowsittoexecutequeriesmuchfaster.Thespecicquestionswewanttoanswerhereare: 1. Doesserializationimprovetheefciencyofthequeries? 2. Howarethediskusageandnumberofsamplesgeneratedeffected? 3. Isthesystemmorescalablenow? 4. Doesincreasingthenumberofthreadsimprovethesamplegenerationafterserialization? 5. HowdoestheperformanceofthenewsystemvarywithdifferentVGfunctiontypes(shorttailedvslongtailed)?ExperimentalSetup:WeusetheTransactionProcessingPerformanceCouncilBenchmarkH(TPC-H)datagenerator[ 17 ].Theworkstationusedfortheexperimentshas32GBofRAM,4720GBDellharddisksand8processingcoresof3.2GHzeachdistributedover2chips.Thesystemrunsa64-bitUbuntuoperatingsystem,version11.04.WeaddedtheserializationfeaturetotheMCDB-Rprototypeimplementation.ItiswritteninC++andconsistsof30,000linesofcode.Weusethreequeriesforbenchmarking.QueryQ1isbasedonqueryQ4from[ 37 ],andqueryQ2isfrom[ 6 ].SeeAppendix 7 forSQLstatementsforbothQ1andQ2.QueryBasichasthebarebonesstructuretouseaVGfunction.QueryQ1.Thisquerysimulatestheeffectsofa5%increaseinpriceontheprots.ThechangeincustomerdemandduetonewpricesisestimatedbyusingaBayesian 68

PAGE 69

VGfunction.Asamepriordistributionfordemandisusedforallcustomers.Bayesianmethodsareusedtondtheposteriordistributionbasedontheparametersspecictothecustomer.TheVGfunctionoutputsthepossiblecustomerdemandfornewprice.WechosethisqueryforbenchmarkingbecauseithasanexpensiveVGfunction.QueryQ2.Thisismainlyusedfortestingcorrectness.Theordersrelationismadestochasticbyaddingnormallydistributedstochasticattribute.Intheparametertableforthenormaldistribution,allmeanandvariancevaluesaresetto1.Thequeryissuchthattheresultdistributionisalsonormallydistributed,andhasaclosedformsolution.QueryBasic:ThisqueryhasasingleInstantiatebeforeTailsampler.TheaggregateisonthesamplevaluefromtheVGfunction.TheInstantiatetakesoneoutertable,andoneparametertable.Forthisquery,parametersaresameforeachofthetuplesintheoutertable.ThisqueryisusedtocomparedifferentVGfunctions.Theprobabilitydensityfunctionsusedhererangefromthosewithnotail(Uniform),tolongtaileddistributionslikeGammaand2.Giventhissetup,therearesixdifferentvariablesweneedtoknowforeachtest,inordertounderstandtheresults: 1. query:Thequeryusedinthetest. 2. DBsize:TheondisksizeoftheTPC-Hdataset,inbinaryrepresentation. 3. numSeedIds:Thenumberofseedidentiersprocessedduringthequeryexecution. 4. numdb:NumberofDBinstancesforthequery. 5. gibbsIters:NumberofiterationsatGibbslooper. 6. eliteDBPercent:NumberofDBinstancesretainedaftertheendofeachGibbsiteration.ThedeletedDBinstancesareclonedfromtheretainedeliteinstances.Werunvedifferenttests.FirsttestisexecutedonMCDB-R,withandwithoutserialization.RestofthetestsareonMCDB-Rwithserialization.Secondandthirdtestsareusefulinunderstandingthescalabilityofthenewsystem.Testfourhelpsusunderstandifthetimetakenforsamplegenerationafterserializationcanbeimprovedby 69

PAGE 70

usemorethreads.Fifthtestisusefulinunderstandingtheperformanceofserializationonlongtaileddistributionswhencomparedtoshorttailedones.Hereisaconcisedescriptionoftheparametervaluesusedforeachtest: 1. Performancewithandwithoutserialization:WeusequeriesQ1andQ2.DBsizeis200MB.numSeedIdsforQ1is93,000,andforQ2is46,000.numdbis100.gibbsItersandeliteDBPercentare5and50%respectively. 2. ScalabilitywithincreasingDBsize:WeusetwodatasetswithDBsize200MBand20GB.ThevalueofnumSeedIdsforQ1is93,000,andforQ2itis46,000on200MBdataset.Thevalueis9,100,000forQ1,andis4,600,000forQ2on20GBdataset.RestoftheparametersaresameasinTest1. 3. Scalabilitywithincreasingnumdb:ThevariablegibbsItersisvariedfrom10to400.RestoftheparametersaresameasinTest1. 4. Parallelsamplegenerationafterserialization:AlltheparametersaresameasinTest1.Weincludeamulti-threadedimplementationofsamplegenerationatInstantiate(G).Thenumberofthreadsisvariedfrom1to4. 5. PerformancefordifferentVGfunctions:ThequeriesusedherearequeryBasicwithUniform,Normal,Gamma,and2VGfunctions.ThevalueofDBsizeis20MB,numSeedIdsis10,000,andnumdbis100.ValuesofgibbsItersandeliteDBPercentare10and50%respectively.Results:TheresultsforthesevetestsareshownfromTables 4-1 to 4-6 .Weraneachtestmultipletimesandlistedtheaveragevalues.Tables 4-1 and 4-2 showthedataforTest1.Theperformanceimprovement,whencomparingMCDB-RwithandwithoutserializationisgiveninTable 4-1 ,undercolumnImprovementFactor.Diskusage,andthenumberofsamplesgenerated/usedforeachexecutionaregiveninTable 4-2 .Thetotaldiskusagevaluesshownherearecomputedbykeepingtrackoftotalpagereadsandwritesduringtheexecutionofthequery.PeakdiskusageiscalculatedfromthedatainputintotheGibbsqueue.Wallclocktimes(inminutes)forTest2aregiveninTable 4-3 .InTable 4-4 ,wehavethewallclocktimesforTest3.Thetimesgivenareinseconds.ResultsforTest4aregiveninTable 4-5 .Again,thetimesaregiveninseconds.Finally,Table 4-6 givestheexecutiontimes(inminutes)forTest5. 70

PAGE 71

Table4-1. Wallclockexecutiontimes WithWithoutImprovementSerializationSerializationFactor Q14.2mins24mins5.7 Q22mins32mins16 Table4-2. DiskI/Oandsamplestatistics TotalDiskPeakDiskTotalSamplesAvg.Gen.Avg.UsedI/O(inGB)Usage(inGB)(inBillions)PerSeedIdPerSeedId Q1withSerialization910.121,300500Q1withoutSerialization7440.657,000500Q2withSerialization1010.0611,326510Q2withoutSerialization1374553.474,000510 TheparametervaluesusedfordifferentVGfunctionsaregiveninrstrow.Forexample,theNormalVGfunctionhasmeansetto1,andvariance2setto1.Discussion:Test1:Table 4-1 showsasignicantimprovementwhenserializationisadded.ForQ2,theperformanceimprovementismorepronouncedbecausethenormalVGfunctiongeneratedraresamplesmorefrequentlythantheBayesianVGfunction.InTable 4-2 wecanseetheevidenceforthis.Whenusingserialization,queryQ1generatesonefthofsamplescomparedtoqueryQ2withoutusingserialization.ForqueryQ2theratioofsamplesgeneratedis50.ExecutionofqueryQ2withoutserializationhadtoreruntheplanthreetimesforgeneratingextrasamples(notgiveninthetable).Thisissignicantbecauseeverynewrunresultsintwoextrapassesover Table4-3. Scalingwithdatabasesize Size200MB20GB Q15.5mins428mins Q22mins241mins 71

PAGE 72

Table4-4. ScalingwithDBinstances(timeinseconds) numdb10100200400 Q147252476958 Q240120206376 Table4-5. Parallelsamplegeneration Threads124 Q1252332373 Q2120150148 thesamples.Morererunsalsomeansmoreunprocessedsamples.Evenwhenonlyoneseedidentierrequiresextrasamplestheyaregeneratedforallotherseedidentiersaswell.Tests2and3:Table 4-3 showsthatthesystemisscalableandcanberunonlargedatabaseswithmillionsoftuples.Forbothqueriesafactorof100indatasetsizeseemtoresultinapproximatelyafactorof100intheexecutiontime.Theexecutiontimeseemstoscalelinearly.InTable 4-4 wecanseethatforbothqueriesQ1andQ2,theresponsetimeincreaseslinearlyaswevarythenumberofDBinstances.Test4:Theperformanceisworseforthemulti-threadedimplementation.AmajorreasonforthiscouldbeusingWELL512(WellEquidistributedLong-periodLinear,512bytes)[ 46 ]randomgenerator.WELL512isinitializedonceforeachthread,andthisinitializationisexpensiveasitbuffers512randombits.Thequeryisrunfor100DBinstances,soinmostcaseswegenerateonly200samplesatatime.Ifweuse4threads,eachthreadnowproduces50samplesandsomaynotuseasignicantportionofthose512bits.Usingadifferentrandomgeneratorcouldavoidthisproblem.Test5:TheexecutiontimeinTable 4-6 showsthat,asexpected,thequerywithUniformVGfunctionnishesfasterthanotherswithlongertails.ThequerywithGammaVGfunction,withlowshape Table4-6. PerformancefordifferentVGfunctions(timeinminutes) VGFunctionUniformNormalGamma2(a=0,b=0)(=1,2=1)(k=1,=6)(k=6) Time34404641 72

PAGE 73

parameterandalongtail,takesthemosttime.Fromthistest,wecanseethatMCDB-Rwithserializationworkswellforheavytaileddistributionsalso.Conclusion:Fouroftheveexperimentswerangaveexpectedresults.WecanconcludefromTests1,2,3,and5,thatserializingVGfunctioninMCDB-Risextremelyusefulinimprovingthequeryperformanceandscalabilityofthesystem.Moreworkneedstobedoneonmakingthesamplegeneration(afterde-serialization)parallelandfaster.Test4showedthatourcurrentmulti-threadapproachforthisoperationisnotefcient. 73

PAGE 74

CHAPTER5REPLACINGANEXTREMESAMPLE 5.1MotivationMCDB-RsystemusessamplingbasedMonteCarlosimulationstoperformriskanalysisonuncertaindata.Inthissystemthedatauncertaintyismodeledthroughparametricprobabilitydistributionfunctions.Thewholedatabaseisseenasahighdimensionaljointdistribution.Riskanalysisonthisdatacanbedonethroughobservingthelowprobabilityregionstheupperandlowertailsofthedistributionfunction.Gibbscloningmethodisusedtosamplefromthetailofthishighdimensionaldistribution.Gibbscloninghastwostepscloningandperturbation,asexplainedinSection 2.1 .PerturbationstepisachievedthroughGibbssampler,aMarkovChainMonteCarlomethod.Gibbssamplerprocesseshugenumberofsamplesduringasinglequeryexecution.Evenforamoderatesizeddatasetthenumberofsamplesgeneratedcanbeinbillions,asshowninsomeexperimentsfrompreviouschapter.Sinceweprocesssuchhugenumberofsamples,thechancesoftheoccurrenceofanextremesample(onewhichoccurswithverylowprobability)inthesamplestreamishigh.SuchasampleiseasilyacceptedattheGibbssamplerasagoodreplacementifitisinthecorrectsideofthetail.Beingonthesamesideofthetargettailthesamplewillpasstheconstrainteasily.Butreplacingthatsamplelater,duringthenextGibbssampleriteration,mightrequireanothersampleofnearlyequalvalue(andhenceverylowprobabilityofoccurrence).Forexample,iftheprobabilityofoccurrenceofasamplewearereplacingis10)]TJ /F8 7.97 Tf 6.58 0 Td[(6,thentherejectionalgorithm(Section 2.1.2 )willprobablyrejectamillionsamplesbeforendingafeasiblereplacementforit.Replacingonesuchextremesamplecouldexhaustthesamplearrayfromthecurrentqueryrun.Wemayevenrequirenumerousqueryreruns.Rememberthatduringthesererunssamplegenerationisdoneforrestofthetuplesaswell,notjustthetuplethatranoutofsamples.Therefore,replacementprocesswillmakethequeryresponsetimeextremelyhigh.Inthischapterwewilllook 74

PAGE 75

Table5-1. Poissonrandomrelation IdValueSamplestream a75,7,8,...b812,7,8,...c54,3,4,...d98,9,7,...e119,10,8,... Table5-2. Afteraphaseofrejectionsampler IdValueSamplestream a78,5,6,...b127,8,6,...c43,4,5,...d89,7,7,...e910,8,9,... ataneffectivesolutionforthisextremevaluereplacementprobleminordertolowerthatresponsetime.Example:ConsidertherandomrelationPoissoninTable 5-1 .IthasthestochasticattributeValueandaPoissonVGfunction(Section 2.3 )associatedwithit.Onlyonedatabaseinstanceislistedforthestochasticattributetosimplifytheexample.Letssaywewanttoperturbthisdatabaseinstanceusingthesamplestreamavailable.WeneedtoreplacetheValueattributeineachtuple.Theaggregatehereisasummationandlettheconstraintvaluebe23.Thatis,sum(Value)mustbegreaterthanorequalto23.TheinstanceoftherelationafterperturbationisshowninTable 5-2 .Thesum(Value)isstill23,andthetuplewithid'b'hasthesamplevalue12.Theacceptedsamplevaluesfortuplesfollowingtuple2arelowerthantheirexistingvaluesbecausethelowervaluesstillsatisfytheconstraintontheaggregate.Fortuple'c'thevalue5isreplacedby4sincethenewsum(Value)25isstilllargerthan23,theconstraint.Similarlyfortuple'd',9isreplacedby8,andfortuple'e'11isreplacedby9.Nowinthenextperturbationphase,therejectionalgorithmwillrequireavalueof12ormorefortuple'b'asanythinglowerwillmaketheaggregatesum(Value)golowerthan23.Findingsuchasampleisdifcultbecauseofitslowprobability.Ifweassumethatweneed30,000samplesto 75

PAGE 76

ndanothervaluegreaterthanorequalto12,theninthecurrentMCDB-Rweneedtogeneratesamenumberofsamplesforothertuplesaswell.Theperformancecanbeimproveddrasticallyifwecaneliminateproducingsuchunnecessarysamples.Fortunately,inquerieswithoutajoinonstochasticattributes,theserializedVGfunctionisaccessibleattherejectionalgorithm.Hence,processingverylargenumberofsamplesattherejectionalgorithmdoesnotrequireaqueryrerun,asitcangeneratesamplesontheywhenevernecessary.Butthisreplacementprocesscanbequiteexpensiveinotherqueriesasitmightrequirelargenumberofqueryreruns.Notethatpagesizeofthesystemwilllimitthemaximumnumberofsamplesinasingletuplebundleandinturninasinglequeryrun.Apagesizeof1MBcanatmosthold130,000samples.Tolookatamillionsamplesweneed8runsevenwithapagesizeof1MB.Itispossiblethatwemightencountermultiplesuchextremesamplesduringtheexecutionofasinglequery,eachrequiringtensofqueryreruns.Ifthenumberoftuplesinthestochasticrelationislargethenthereisagoodchanceofencounteringmultipleextremevaluesituationsinasingleperturbationstep.Withoutapropermethodtospeedupthisprocessthequerieswillbeveryexpensive. 5.2SolutionOutlineInthissectionwegivetheoutlineofanapproachtosolvethisextremevalueproblemefciently.Therststepistoeliminatethesamplegenerationforunusedtuplesineachqueryrerun.Sinceweneedextrasamplesonlyforasingleseedidentier(Section 2.4 ),unlessthereisaself-join,wecaneliminatetupleswithrestoftheseedidentiersfromtheInstantiate(Section 2.5 ).Forotherseedidentierswhicharenoteliminated,weneedtogenerateonlythosesamplesthatarecurrentlyinusebyaDBinstance.Forthegivenseedidentierwemightneedtoproducetensofmillionsofsamples.Sinceallthesamplescannottinasinglepageweneedtodeviseagetaround.OnepossibilityistoperformaSplitoperationassoonasthesamplesaregenerated.Thoughthiseliminatestheproblemofoverow,westillneedtocarrya 76

PAGE 77

verylongbit-string.Abit-stringfor10millionsamplesisjustabove1MB,butisstillmanageable.Eventhissolutionbecomesunmanageableifweneedlargernumberofsamples.AmorecleansolutionistobreakthelargetuplebundleintonumeroussmalleronesjustaftertheInstantiate.Oncethesesmallertuplesarepassedthroughthequeryplan,theycanbestitchedbackinsidetheTailsampler(Section 2.7 ).Wewillkeepsequenceidentierstoremembertheorderinwhichthesmallertuplesneedtobegroupedtogether.Thisstrategywillnotaffectthecorrectnessoftheresultexceptwhentheoperatorisanaggregateoraself-join.WecansafelyassumethattherearenoaggregatesbeforeTailsamplerasMCDB-Rdoesnotacceptsuchqueries.Incaseofaself-join,twotuplesfromthesameseedidentiershouldbejoinedonlyiftheirsequentialidentiersmatch.Thisconstraintensuresthatthejoindoesnotproducespurioustuples. 5.3FilteringFilteringoutthetuplesofunnecessaryseedidentiersisveryeffectiveinreducingtheresourcesrequiredforndinganextremesample.FilteringwillreducetheCPUtimesinceweavoidproducingsamplesforthelteredouttuples.Themovementoflargenumberoftuplesinthequeryplanwillalsoreducethememoryanddiskusage.ThisprocessisillustratedforasimplequeryplaninFigures 5-1 and 5-2 .Figure 5-1 showsthequeryplanforthenormalexecution.ThisqueryhasasingleVGfunctionserializedattherightleafnode,theMaterializedInstantiate.AfterInstantiate,thetuplesareSplitbeforebeingsenttothejoin.Figure 5-2 providesthenewqueryplanwithaselectionabovetheInstantiate.Thisselectionwillallowonlytuplesthathavetherequiredseedidentiertogotothenextoperator.ThenecessaryseedidentierisidentiedattheTailsamplerandthenissenttotheInstantiatefortheselectionprocess.NotethatthelteringcanonlybeappliedattheparticularInstantiatethatgeneratedtheseedidentiercorrespondingtotheextremesample.Ifmultiplestochasticattributes 77

PAGE 78

Figure5-1. Queryplan Figure5-2. Modiedplanwithltering existinthequeryplanwehavemultipleInstantiates,oneforeachstochasticattribute.AllotherInstantiatesexcepttheonethatproducedtheextremesamplecannotlteroutanyoftheirtuples.Fortunately,foranysuchstochasticattributesweonlyneedtheactivesamples;onlynumdbsamplesforeachtuple.MoreoveronlytuplesthatjoinwiththeseedidentieroftheextremesampleremainbythetimetuplesreachtheTailsampler.Thisisbecauseiftheotherstochasticattributesareinthesamerelationastheextremesample,thenthelteringontheseedidentieroftheextremesamplewillalso 78

PAGE 79

eliminatemanyotherseedidentiersfromtheotherstochasticattribute.Thesecondpossibilityisthattheotherstochasticattributeexistsinadifferentrelationthantheonewithextremesample.Insuchcasestherelationwithextremesample,andtheotherstochasticrelationarejoined.Sincetherelationwithextremesamplewillhaveveryfewtuplesduetoltering,thejoinresultitselfwillbesmalleraccordingly.Hence,existenceofotherstochasticattributeswillnotdegradetheperformancesignicantly.Self-joinsneedtobehandledseparately,asthelteringcannothappenuntilafterthejoininthequeryplan.Ifaqueryplanhasaself-joinonthestochasticattributethatcontainstheextremesample,wecannotlteroutthetuplesuntilaftertheself-join.Itcanalsohappenthatthetupleswhichhavetheextremesamplecanhavemultipleseedidentiersfromthesamerelationevenwithoutaself-join.Thiscanhappenduetoaself-joinonattributesotherthanthestochasticattribute.Iftheseseedidentiersbelongtothesamerelationastheonethatgeneratedtheextremesample,thenweneedtofollowthesameprocesslteringisdelayeduntilaftertheself-join. 5.4FragmentingaLargeTupleBundle Figure5-3. Splittingalargetuplebundle SinceaVGFunctioncangeneratemultipletuplesaspartofthesamesamplewemaintainaVGidentiervg-idtokeeptrackofthatorder.WeneedathirdidentiertodifferentiatebetweenmultipletupleswiththesameseediftheyoccurbeforeInstantiate. 79

PAGE 80

ThisscenariocanhappenwhenthereisajoinbeforetheInstantiate.Wecallthisidentieratupleidentiert-id.Figure 5-3 showsanexampleforfragmentingaverylargetuplebundleintosmallerpieces.Inthisexamplewekeepthenumberofsamplesinthesmallertuplebundlesat10,000.Insteadofasingletuplewith10millionsamples,wewillhave100,000tupleswith10,000sampleseach.Thesequenceidentierseq-idisfrom1to100,000.Thet-idissameforallthesmallertuplesasweonlyhaveoneoriginaltuplebeforeInstantiate.Thevg-idissameheresinceasingleinputtupletoInstantiateproducesasingleoutputtuple.Evenamillionsamplesmightnotbeenoughforsomeextremevalues.Therefore,werepeattheprocesswithtentimesmoresamplesthanthepreviousrun.Ifasatisfactorysampleisstillnotfound,wewillincreasethesamplearraysizeagainbytentimes.Werepeatthisuntilwendthatreplacement. 5.5ImplementationDetails Figure5-4. QueryplanwithSamplerouter:normalmode Incorporatingfragmentationintheexistingqueryplanisnontrivial.Afteranextremesampleisfound,anychangesdonetothequeryplanshouldbereset.Filteringisfairlyeasytoaddtotheexistingqueryplansinceitonlyrequiresaselectoperatorontheseedidentiers.TheselectionpredicateissentbytheTailsamplertotheInstantiate. 80

PAGE 81

Figure5-5. QueryplanwithSamplerouter:activemode Whentheextremesampleisfoundafterexecutingthenewqueryplan,theInstantiateisagaininformedofthendingandtheselectionpredicateisdropped.Unlikeltering,fragmentationrequiressomemajorchangesinthequeryplan.Inourimplementation,wecreateanewowforthetuplesintheTailsampler.InsteadofsendingthetuplestoGibbsqueue,wewillpassthemtoanotherdatastructure,calledaSamplerouter.Thisdatastructurewillsorttheincomingtuplesaccordingtotheseq-id'sandpassestheGibbssampleronlythetupleswithsameseq-id.Ifnecessary,Gibbssamplerwillrequestthenextsetoftupleswithnextseq-id.InnormalexecutionmodetheSamplerouterjustactsasarelayanddoesnothing.SeeFigure 5-4 foranexamplequeryplan,withaSamplerouter.Thecompiler,whenitrstseesastochasticjoininthequeryplan,willmodifytheplantoincludetheSamplerouter.Thisadditionisdoneevenbeforeanyexecutionisstarted.Ifanextremesampledoesnotoccur,thentheSamplerouterwillalwaysbeinthenormalexecutionmode.TheTailsamplerwillnotifytheSamplerouterwhenevertheexecutionmodeneedsachange.TheexamplequeryplaninactivemodeisshowninFigure 5-5 81

PAGE 82

Figure5-6. ATSSeedandassignedsamplevalues Figure5-7. TSSeedandassignedsamplesafteraGibbssampleiteration 5.5.1TSSeedandSampleGenerationAftertheextremesampleisfound,thequeryisrestartedinthenormalmode.AlliterationnumbersforseedidentiersarestoredinTSSeedattheendofeachrun.Duringthenewqueryrun,nummcsamplesaregeneratedatInstantiate.ThestartingiterationnumberforthissamplegenerationisthehighestiterationnumberassignedtoaDBinstance.Thatstartingiterationnumberwillbethelowiter.numberinnextqueryrun.Anyunassignediterationnumbersbetweenthelowiter.numberandmax.usediter.arenotused.Thisgenerallydoesnotcreateaproblembecausetheirnumberisusuallysmallcomparedtonummc.Unfortunately,justafteranextremesampleisfound,thisassumptiondoesnotholdtrue.Moreoftenthannot,nummcismuchsmallerthanthedifferencebetweenlowiter.numberandmax.usediter.. 82

PAGE 83

Figures 5-6 and 5-7 showtheTSSeed(Section 2.6.1 )objectofaseedidentierwithanextremesample.ThevalueatDBinstance2inFigure 5-6 istheextremesample,anditsreplacementisfoundonlyafterprocessingmorethan23,000samples.Inthisexample,nummcfortherunafterndtheextremesamplewillbe64.lowiter.numberis18,max.usediter.is23755,andthedifferenceis23737.Onesolutiontothisproblemistoassignnummcavalueatleastdoubleofthedifference.Thissolutionwillonlyworkuntilthedifferencebetweenlowiter.numberandmax.usediter.islargerthanwhatasinglepagecanhold.Abettersolutionistostoretheassignedsamplesonapersistentdatastructure,andgenerateonlysamplesaftermax.usediter..Wecanrestoretheassignedsamplesforthatseedidentierfromthedatastructure.ThisapproachmightgetexpensiveiftherearelargenumberofextremesamplesinasingleGibbssamplerloop.Then,thedatastructureneedstostorethesamplesofallseedidentierswithaextremesample.AsimplerapproachistochangehowthesamplesaregeneratedatInstantiateusingTSSeed.TheTSSeedcreatesthePRNGseedarrayusedinInstantiatefromiterationnumbersandSeedId.Therangeofiterationnumbersusedisfromlowiter.numbertonummc+lowiter.number.ThisPRNGseedarraydetermineswhatsamplevaluescomeoutoftheVGfunction.Wechangetheiterationnumbersused.Firstnumdbiterationnumbersaretakenasisfromtheassignediterationnumbers.Restofthenummc)]TJ /F10 11.955 Tf 12.87 0 Td[(numdbiterationnumbersstartfrommax.usediter..ImplementationofthismethodonlyrequireschangestoTSSeedobject,GetSeedsfunction.Nootherchangesarerequiredinthecode. 5.5.2HowtoIdentifyanExtremeSample?Duringexecution,weneedtoidentifyasampleasextremeinordertoinvokeourspecicmethod.Onewaytoidentifyanextremesampleistodosomestatisticalanalysis,andndouttheprobabilityofoccurrenceofthereplacementsample.Forthatwersthavetondtheminimum(ormaximum)samplevalueweneedforareplacement.Oncewehavethat,usingtheparametersoftheVGfunctionwecan,in 83

PAGE 84

theory,calculatetheprobabilityofndingasamplegreater(orsmaller)thatthanvalue.Foragenericaggregatefunction,ndingthatminimumrequiredvalueitselfisnontrivial.Instead,wecansaythatweneedasamplewithsamevalueastheonewearereplacing.Thismethodofidenticationmightresultinmanyfalsepositivessincetheactualvaluerequiredmightbeeasiertond.Thespecialexecutionweusetondanextremesampleisveryefcientcomparedtorerunningthewholequery.But,toomanyfalsepositivesinthismethodwilleliminatethatefciencyadvantage.Instead,wecanuseamethodbasedonruntimestatisticstoidentifyanextremesample.Asamplecanbedesignatedasextremeifwecannotndareplacementbytheendofthesamplearray.ButthisagainwillcreatelargenumberoffalsepositivessinceitispossibletoreachtheendofsamplearrayevenfornormalsamplesaftermultipleGibbsiterations.Toavoidthatdrawbackwecansayasampleisextremewhen,evenafterasecondrerunwedonotndthereplacement.Wespentoneentirererunjusttondthatreplacement.Theothermethodistousethenumberofsamplesconsumedtondthereplacementwhenwereachtheendofthesamplearray.Ifatleast10%ofthesamplesinthearrayareconsumedwithoutndingareplacementforasample,thenwecandesignateitasanextremesample.Ingeneral,adecentlylargenumberlike10,000mightalsosufce.Butthiscancreateproblemsduringtherstfewrerunswhenthesamplearraysizeisstilllessthan10,000.Afewrerunswillbewasteduntilthenumberofsamplesseentondthereplacementexceeds10,000.Wethinkusingafactorofsamplearraysizelike10%isbetterthanaxednumberlike10,000,sincethexedvaluemightneedtoberecalculatedforadifferentpagesize.Weexperimentallyevaluatetwodifferentwaystoidentifyanextremesample.(1)Asampleiscalledextremeifareplacementisnotfoundattheendoftwoqueryruns.and(2)ifasampleconsumedmoresamplesthan10%ofthecurrentsamplearray. 84

PAGE 85

5.6ExperimentsInthissectionweevaluateoursolution.OuraimistocomparetheperformanceofoursolutionagainstthecurrentapproachasimplementedintheMCDB-Rprototype.Wealsowanttoshowthatoursolutionworkswellonlargedatasizes,andfastgrowingaggregatefunctions.Differentaggregatefunctionswillresultindifferentnumberofextremesamples,thereforeitisinterestingtoobservetheperformancebyvaryingaggregatefunction.Findinganoptimalpagesizeherewilldirectlyaffectthequeryresponsetime:Averysmallpagesizemightresultintoomanyrerunsduetosmallsamplearraysize,whereasaverybigpagesizemightresultinthrowingawaytoomanysamplesfromthearray(wheneveranextremesampleisencounteredandthequeryisrerun).Thenumberofextremesamplesareexpectedtoincreaseaswemovefurtherintothetail.Thiswillshowthatoursolutionisextremelyimportant.Thespecicquestionswewanttoanswerhereare: 1. Howmuchistheperformanceimprovementduetooursolution? 2. Howscalableisthenewsystem? 3. Howdoesthenewsystemperformondifferentaggregatefunctions(linearvsquadraticvscubic)? 4. Doesvaryingthesystempagesizeaffecttheperformance? 5. Whatistherateofincreaseinthenumberofextremesamplesaswemovefurtherintothetail?ExperimentalSetup:WeusetheTPC-Hbenchmarkdatagenerator[ 17 ].Theworkstationusedfortheexperimentshas32GBofRAM,4720GBDellharddisksand8processingcoresof3.2GHzeachdistributedover2chips.Thesystemrunsa64-bitUbuntuoperatingsystem,version11.04.WeimplementedoursolutionintheprototypeMCDB-Rsystem.Weusetwoqueriesforbenchmarking.ThequeryQ3belowisbasedonqueryQ10from[ 38 ].SeeAppendixforcompleteSQLstatementsforboththequeriesused.AllqueriesarebasedonmodiedTPC-Hschema. 85

PAGE 86

QueryQ3.Inthisquery,totalrevenuegeneratedbyallcustomersinJapaniscalculated.Theschemaismodiedsuchthatthecustomerinformationintheordersrelationisnotfullycorrectduetoerrorsinthedataintegrationprocess.Anewrelationiscreatedtoincorporatethiserrorinthedatabase.Thisrelationhastheactualcustomerkeyandalistofpossiblecustomerkeysasastochasticattribute.Eachofthepossiblecustomerkeyshasaprobabilityvalue.Thenumberofpossiblecustomerkeysislimitedto10,oneofthembeingthecorrectkey.Theprobabilityvaluesarecalculatedaccordingtothefollowingformula1 2i,whereivariesfrom1to10.Theresultingdistributionherehasnitedomain.AdiscretechoiceVGfunctionisusedtogeneratesamplesfromthepossiblekeyanditsprobabilityfromthisnewparameterrelation.In[ 38 ]thehighestprobabilityof0.5isalwaysgiventothecorrectkey.Whenrunwiththesevalueswewillneverencounterasituationwhereasamplereplacementrequireshugenumberofsamples.Herewemodiedthetablesuchthatin50%ofthetuplesthecorrectkeyisgiventhelowestprobabilityof1 210.Thismodicationisdoneinordertocreateasituationwheretwosampleswiththecorrectcustomerkeyvaluecanoccurveryfarawayfromeachotherinthesamplestream.Suchasituationcanresultinasamplewhichisdifculttoreplaceinotherwords,anextremesample.QueryQ4.Herewecalculatethepenaltyduetoshippingdelaystothecustomers.Wehaveashiptimerelationwhichhasforeachcustomerthenumberofdayswithinwhichtheitemshouldbedelivered.Therelationavg shiptimehastheaverageshippingtimeforagivenitem,andthisistheparameterforthePoisson()VGfunction.Therelationpenaltyhasthepenaltyaddedaccordingtothenumberofdaysofdelayinshipping.Fortheseexperiments,thepenaltyrateweuseforthedelayincreasesasthenumberofdaysofdelay:penalty(i)=ijwhereiisthenumberofdaysofdelay,andjisthepenaltyrateperdaywithj=bi=5c.Inthisfunctiontherateofpenaltyincreasesat5dayintervals. 86

PAGE 87

Giventhissetup,therearesixdifferentvariablesweneedtoknowforeachtesttounderstandtheresults: 1. query:Thequeryusedinthetest. 2. numSeedIds:Thenumberofseedidentiersprocessedduringthequeryexecution. 3. PageSize:Sizeofthesystempage.Limitsthesizeofthesamplearray.Thisvalueissetto128KBforourexperiments,unlessspeciedotherwise. 4. numdb:NumberofDBinstancesforthequery.Thisissetto100foralltestsinthissection. 5. gibbsIters:NumberofiterationsatGibbslooper.Thisvalueis5,unlessspeciedotherwise. 6. eliteDBPercent:NumberofDBinstancesretainedattheendofeachGibbsiteration.ThedeletedDBinstancesareclonedfromtheretainedeliteinstances.Wehavethisvalueat50%forallthetests.Werunthefollowingvetests.InrsttestwecomparetheoriginalMCDB-R(withserialization)withMCDB-Rwithoursolution.Wetestthenewsystemona20GBdatasetinTest2,anddifferentaggregatefunctionsinTest3.InTest4,wevarythesystempagesize.WecalculatethenumberofextremesampleswithchangingGibbsiterationsinTest5.Welisthowthevariablesaresetforeachofthetests: 1. Performanceimprovementduetooursolution:BothqueriesQ3andQ4areusedhere.ThevaluesofnumSeedIdsis1000,PageSizeis128KB,numdbis100,gibbsItersis10,andeliteDBPercentis50%. 2. ScalabilitywithincreasingnumSeedIds:WeusedqueryQ4.ThevalueofnumSeedIdsisvariedfrom8000to40,000,andgibbsItersissetto5.RestofthevariablesaresameasinTest1. 3. Performancewithdifferentaggregatefunctions:WeusedqueryQ4.ThevalueofnumSeedIdsis20,000,andgibbsItersissetto5.RestofthevariablesaresameasinTest1. 4. EffectofPageSizeonperformance:WeusedqueryQ4.ThevalueofnumSeedIdsis10,000,andgibbsItersissetto5.ThevalueofPageSizeisvariedfrom32KBto512KB.RestofthevariablesaresameasinTest1. 87

PAGE 88

5. NumberofextremesampleswithincreasinggibbsIters:WeusedqueryQ4.ThevalueofnumSeedIdsis8000,andgibbsItersisvariedfrom2to10.RestofthevariablesaresameasinTest1.Results:TheresultsforthesevetestsareshowninTables 5-3 to 5-11 .Weraneachtestmultipletimesandlistedtheaveragevalues.ResultsofTest1aregiveninTables 5-3 ,and 5-4 .Table 5-3 showstheperformanceoforiginalMCDB-Randoursolutionwithtwodenitions(Section 5.5.2 )ofextremesamples.Type-1iswhenareplacementisnotfoundattheendoftwoqueryruns.Type-2iswhenasampleconsumedmoresamplesthan10%ofthecurrentsamplearray.Forthesametest,thenumberofsamplesidentiedasextremeareshownintable 5-4 .Test2resultsaregiveninTables 5-5 5-6 and 5-7 .Table 5-5 comparestheperformanceoforiginalMCDB-RwithType-2variantofoursolution.WedonotuseType-1inthisandlatertestssincewefoundthatType-2hasabetterperformance.ThenumberoftotalsamplesgeneratedinthistestaregiveninTable 5-6 .Table 5-7 givesthenumberofqueryrerunsrequiredforeachofthequeryexecutionslistedinTable 5-5 .Thevaluesgiveboththenumberoffullqueryreruns,andthenumberofrerunswithltering(i.e.thenumberofextremesamples).Ineachcell,thenumberofthererunsusedforreplacinganextremesampleisgiveninbrackets().Forexampleinputsizewith8000randomtupleswehave12Type-2extremesamples,and10queryreruns.Table 5-8 showsresultsforTest3.Therunningtimesforthequeryexecutedwithdifferentaggregatefunctions,andthenumberofextremesamplesidentiedforeachexecutionarelisted.ResultsofTest4areshowinTables 5-9 and 5-10 .Table 5-9 showsthequeryresponsetimewithchangingPageSize(andinturnmaximumsamplearraysize)inthesystem.Table 5-10 liststhenumberoftotalrunsrequiredtonishthequeryandthenumberofextremesamplesineachrun.Table 5-11 showstheresultsfromTest5.Discussion:Test1:Asexplainedabove,thequeryQ3hasadistributionwithnitedomainonitsstochasticattribute.Thislimitsthechancesofoccurrenceofan 88

PAGE 89

Table5-3. Wallclockrunningtimes(inseconds) QueryOriginalType-1Type-2 Q3434341Q421913575 Table5-4. Numberofextremesamplesidentied QueryType-1Type-2 Q301Q4713 Table5-5. Performancewithincreasinginputsize(inminutes) Size800020,00040,000 Original15102246Type-27.328110 Table5-6. Samplesgenerated(inBillions) Size800020,00040,000 Original1.514.921.7Type-20.633.210.7 Table5-7. Queryrerunsandextremesamples Size800020,00040,000 Original25(0)94(0)69(0)Type-210(12)21(24)34(47) Table5-8. Performancewithchangeinaggregateattribute Aggregateattributepenaltypenalty*5penalty2penalty3 Time(minutes)20.326.132.889.1 Num.ExtremeSamples15202464 Table5-9. Performancewithchangingmax.samples(inminutes) MaxSamples20004000800016,00032,000 MinPageSize32KB64KB128KB256KB512KB Time32.226.323.131.427.6 Table5-10. Executionparameterswithchangingmax.samples MaxSamples20004000800016,00032,000 Num.Runs3025313232 ExtremeSamples2420162220 89

PAGE 90

Table5-11. NumberofextremesampleswithincreasingGibbsiterations Iterations246810 Type-2314244572 extremevalue.Table 5-4 showsthatforqueryQ3,weonlyhave1extremesampleinType-2,andnoneinType-1.Wecanclearlyseethatthereisalmostnoimprovementonitsexecutiontimeinnewsystemovertheoldsystem.UnlikequeryQ3,queryQ4processessamplesfromaPoissondistribution,whichhasasignicanttail.ThenumberofextremesamplesforQ4are7and13forType-1andType-2,respectively.Wecanseetheexecutiontimeimprovementsforthisqueryonbothtypes.Type-2identicationclearlyoutperformsType-1sincewedonotwasteanextraqueryruntoidentifyanextremesample.InType-2,wemayidentifysomesamplesasextremeeveniftheyarenotthatextreme.Thisdoesnotaffectperformancemuchsinceourspecialexecutionmoduletoreplaceextremesamplesisquitefast.ForqueryQ4theexecutiontimedifferencebetweenourspecialmoduleandafullqueryrunisapproximatelytwoordersofmagnitude(1:100ratio).WecanseethattheexecutiontimeimprovementsfromTable 5-3 areproportionaltothenumberofextremesamplesidentied,asgiveninTable 5-4 .Therefore,wecaninferthatType-2performsbetterthanType-1becauseoffasteridenticationofextremesamples,andavoidingtheextraqueryrunittakesType-1toidentifyasampleasextreme.Fromnowon,alltheexperimentsuseonlyType-2identicationofextremesamples.Test2:TheexecutiontimeinTable 5-5 showsthattheperformanceofthenewsystemisbetterthantheoldoneconsistentlyovervaryinginputsizes.Asexpected,thenumbersshown(ofsamplesgenerated)inTable 5-5 areproportionaltothecorrespondingexecutiontimesshowninTable 5-6 .AstrongercorrelationcanbeseenbetweenthenumberofqueryrunsinTable 5-7 andtheexecutiontimeinTable 5-5 .Thisisnotsurprisingsinceourmoduletoprocessextremesamplestakesverylesstimewhencomparedtoatotalqueryrun. 90

PAGE 91

Test3:FromTable 5-8 weseethatasexpectedthenumberofextremesamplesishighforfastgrowingfunctions.Theperformanceofthesystemisdecentevenforcubicfunction.Test4:Theexecutiontimesshowthattheperformanceisbestwiththemaximumsamplearraysizeat8000.Test5:ForrsttwoGibbsiterations,thenumberofextremesamplesis3,andforrstfouriterationsitis14.Thisvalue14isinclusiveofrsttwoiterationsaswell,i.e.thenumberofextremesamplesinthirdandfourthiterationsis14)]TJ /F6 11.955 Tf 12.11 0 Td[(3=11.Asexpected,thenumberofextremesamplesincreasesaswemovetowardsthetail.Conclusion:WeconcludethatType-2variantofoursolutionisthebest.TheperformanceimprovementissignicantinquerieswithlongtailedVGfunctions,andthesolutionworkswellonlargedatasets,andfastgrowingaggregatefunctions.AswecanseefromTest5,thenumberofextremesamplesincreasesaswemovefurthertowardsthetail.Therefore,addingourspecialmoduletoexistingsystemisveryessential. 91

PAGE 92

CHAPTER6ANTIJOIN 6.1MotivationAnti-JoinoperatorispartofthestandardSQL.ItisrequiredtoexecuteSQLclausesNOTINandNOTEXISTSinsub-queries.Ananti-joinofaleftrelationLandrightrelationR(L.R)returnstuplesinLwhichdonotmatchanytuplesinRonthepredicate.ConsiderthetablesOrders(ORDID,CUSTID,AMOUNT)andCustomers(CUSTID,CUSTNAME)inFigure 6-1 .Supposeweexecutetheanti-joinquery,Orders.Customers,withthepredicateonOrders.CUSTIDandCustomers.CUSTID.ThenaloutputoftheOrders.CustomersisgiveninthetableOutput.ThersttupleinOrderswithCUSTIDvalue21appearsinOutputsincethereisnotupleinCustomerswithCUSTIDas21.SecondtupleinOrderswithCUSTIDvalue23doesnotappearinOutputsinceitmatcheswiththesecondtupleinCustomers.ThisprocessofndingamatchisrepeatedforeachtupleinOrders. Figure6-1. Example:Anti-Joininarelationaldatabase Inarelationaldatabaseananti-joinoperationisstraightforwardtoimplement.Anti-joinL.Rcanbeexecutedsimilartoasort-mergejoinwithsomechangesinthemergestep.Sortbothrelationsontheattributesinthepredicate.Duringthemerge,ifthereisnomatchforatupleinLwithanytupleinR,thenoutputthetuplefromL.InMCDB-R,executingananti-joinonstochasticattributesisnotassimple.Anexecutionsimilartoanormaljoin,asexplainedinSection 2.6.2 ,isnotenough.Inthejoin,aftera 92

PAGE 93

Splitoperationonthestochasticattributesandasortoperation,amergeisperformedonthesortedleftandrightrelations.ThejoinpredicateisevaluatedduringtheGibbssamplingatTailsampler(Section 2.7 )operator.EvaluatingajoinpredicateatTailsamplerrequiresonlyonepairofmatchingtuples,oneeachfromrightandleftrelations.Ananti-joinpredicateontheotherhandrequiresatuplefromleftrelationandallthematchesfromrightrelation.Astochasticattributeinleftrelationofanti-joindoesnotcauseanydifcultiesbecauseevenwhenthesampleischanged,theinformationinthetuplewillbeenoughtoexecutetheanti-joinpredicate.Thisissoonlyiftherightrelationdoesnothaveastochasticattributeappearinginthepredicate.Ifastochasticattributeappearsintherightrelationtheexecutionofanti-joinisnotassimple.Whenthecurrentsamplevisreplacedinarighttupletheanti-joinpredicateneedstoknowifthenewsampleistherstofitskindorarethereanyexistingtupleswiththesamevalue.Ifthisisthersttimethatvalueoccurred,thenthetuplesfromleftrelationwiththatvaluewillberemovedfromtheoutput.Sincetheoldvaluenolongerexists,ifnoothertupleswiththatvaluealsoexistsintherightrelationthenthetuplesfromtheleftrelationcontainingthatvalueneedtoappearintheoutput(because,nowtherearenomatchingtuplesintheright).Boththeabovestepsrequireinformationnotwithinthecurrenttuple,andobtainingsuchinformationisnotefcientinthecurrentframework.Bringingtogetherallmatchingtuplesfromrightrelationforasinglelefttuplewillrequireasortorahash.Bothareextremelyexpensive,andperformingthemonceperlefttupleisimpossible.RememberthatallthepredicatesarecalculatedaftertheGibbssampler,andtheGibbsqueue(Section 2.7.1 ),thatprovidestuplestoGibbssampler,doessoonlyononeseedidentier(Section 2.4 )atatime.Therefore,weneedachangeinthecurrentframeworkfortheexecutionofananti-joinpredicate.SortingalltuplesinGibbsqueueisanoption.Withasortonthevaluewecangrouptherequiredtuples(withdifferentseedidentiers)together.Thoughsortingsolvestheproblem,itisnotefcientsinceit 93

PAGE 94

needstobedoneonceforeachtuple.Asinglesortitselfisveryexpensive,thereforeperformingasortforeachtuplewillbeextremelytimeconsuming. Figure6-2. Example:Stochastictable ConsiderthetableCustomersinFigure 6-1 .AssumethattheCUSTIDattributeisstochastic.ThenewtablewiththestochasticarrayattributeisshowninFigure 6-2 .Figure 6-3 showsthestochastictableaftertheSplitoperation,alongwiththelefttableintheanti-join.ThetuplewithCUSTID21inOrdersmatcheswithalltwotuplesinStoc.Customers,tuples1and4withseedsS1andS2respectively.Theactualanti-joinpredicate(likeallotherpredicates)iscomputedaftertheGibbssamplerstep.SoinordertoknowifthetuplewithCUSTID21inOrdersappearsintheoutput,weneedboththetuplesfromStoc.CustomersandwhichofthemareactiveforagivenDBinstance.ButtheGibbsqueueonlyprovidesgroupsoftupleswithsameseedtogether.Sointhecurrentsystemitisimpossibletoexecuteananti-joinpredicate.Weneedsomenontrivialchangesinthesystemforthistowork.Inthenextsectionwediscusstwonewattributeformatsandhowtheywillfacilitatetheanti-joininMCDB-R.Afterthatweexplainthemodicationsrequiredinthesystemsothattheanti-joinwillworkincaseofextremesampleproblem.InSection 6.4 wediscussbrieyhowtoextendbothformatsifmorethanoneanti-joinispresentinaquery.FinallyinSection 6.5 weevaluatebothformatsdescribedbelowforananti-joinexecutioninvarioustypesofdatasets. 94

PAGE 95

Figure6-3. Example:AfterSplit 6.2CompositeAttributeFormatsAsdiscussedpreviously,toexecuteananti-joinpredicateweneedallmatchingtuplesfromtherightrelation.InMCDB-RweapplythepredicateonlyaftertheGibbssamplingstep,henceallthematchingtuplesfromrightshouldbecollectedtoruntheanti-join.ThisiscomplicatedinthecurrentframeworkbecauseattheGibbssamplerweonlyhavetuplescontainingaparticularseedidentier.ThecurrentinterfaceofGibbsqueuedoesnotallowretrievingtuplesnotrelatedthroughaseedidentier.AddingamethodtoperformsuchoperationinGibbsqueuewillbeveryexpensive.Suchanoperationwillrequireasortorhasheverytimewewanttoapplyananti-joinpredicate.Inthissectionwediscusstwonewattributeformatsthatcangetaroundthisproblem.Wealsoexplainhowthesenewattributetypescanfacilitatetheanti-joinoperation. 6.2.1LongFormatInthisformatwestoreallthenecessaryinformationfromtherightrelationineachtuplefromtheleftrelation.DuringtheAnti-joinoperationweaddtheseedandbitstring 95

PAGE 96

attributesfromallmatchingtuplesinRtothetupleinL.TheseedisusedbyGibbsqueuetopullitupwhenthecorrespondingseedidentierisbeingprocessed.Figure 6-4 showsthestructureofthiscompositeattribute.ThenewattributetypeiscalledAJOBJECT(AJforAnti-Join).Itrstkeepsthenumberofseedandbitstringattributepairsitcontains(sameasthenumberofmatchesinR),andthenthelistofseedsfollowedbythelistofbitstrings. Figure6-4. CompositeAttribute:Longformat TheBuildSeeds()method,whichconstructsthesortedarrayofseedidentiersforeachtupleneedstobemodiedtoincorporatethenewattribute.ThissortedarrayislaterusedbyGibbsqueuetoretrievethetuplesintheorderoftheirseedidentiers.Figure 6-5 showsthenewGibbstuplesfortherelationsfromFigure 6-2 aftertheanti-joinoperation.Thenewcompositeattributeislocatedattheendoftheoutputtuples.ThetuplewithCUSTID21fromOrdersmatcheswithtwotuples(rstandfourth)inthetableStoc.Customers.Therefore,theAJOBJECTinthersttupleofoutputtableOrders AJinFigure 6-5 ,storestwoseedattributesandtwobitstrings.Similarly,fthtupleinOrdersdoesnotmatchwithanytupleinStoc.Customers,thereforeitsAJOBJECTdoesnothaveanyseedorbitstringattributes.ProcessingofthesetuplesattheTailsamplerisexplainedbelow.ForsimplicityweareassumingthattherearenooperatorsinbetweenAnti-joinandTailsampler.AlsoassumethatwehaveonlyoneDBinstance,andinitiallyalltheDBinstancesarepointingtotherstsampleinthestream.Atrst,theGibbsqueuereturnsthefthtupleandsinceitdoesnothaveanyseedattributesitissenttotheaggregatewithoutanyextraprocessing.Inthenextstep,tupleswithseedS1arereturned,i.e.rsttwotuples.TailsamplerwilltrytoreplacethecurrentsampleforDBinstance1(DB1).LetssaythattheAdvance()willmovetheDB1samplefromiterationnumber1toiterationnumber2. 96

PAGE 97

Figure6-5. LongFormat:TuplesafterAnti-join Thebitchangedfrom1to0,thereforeDB1ofS1nowhasavalue23insteadofa21.Nowthevalidityoftuple1ischeckedthroughtheanti-joinpredicate.Sincethereisstillavalid21fromthecompositeattributeduetoDB1ofS2,thestatusofanti-joinpredicateonthistupledoesnotchange.Sorsttupledoesnotcontributetotheaggregate.Forsecondtuplebothbitstringshavea0atrstiteration.Theanti-joinpredicatewillreturnatruesincetherearenomatches.Hence,initiallythesecondtuplewillappearintheoutputandcontributetotheaggregate.AftertheAdvance(),theDB1ofS1isassigned23.ThebitstringofS1is1atseconditeration.Therefore,theanti-joinpredicatewillreturnafalse,duetotheexistenceofamatchinrightrelation.So,afteranAdvance()onS1,theaggregatevaluechanges.Moresamplesareprocessedifthenewaggregatedoesnotsatisfytheconstraintonitsvalue,andtherejectionalgorithmproceedsfurther.Possibilitiesofoverow:Asseenintheaboveexample,usingtheformatisfairlysample.WeneedtocreateanewdatatypeandmakeminormodicationstoBuildSeeds()methodandTailsampleroperator.Thoughthismethodworkswellintheaboveexample,itmakesaninherentassumptionthatthenewtuple,storingnumerousseedandbitstringattributes,willtinasinglepage.Theassumptionholds,if,foralefttuplethematchingtuplesfromrightrelationareless.Thetuplecaneasilyexceedthepagesizeiftherearetoomanymatchingtuplesinrightrelation.Boththeseedand 97

PAGE 98

bitstringattributesarebig.Bitstringsgrowarbitrarilylargebasedonthesizeofsamplearray.Asamplearrayofsize10,000willresultinabitstringofsize1250bytes.Withapagesizeof1MB,wecanonlyholdcloseto800ofsuchbitstrings.ThesizeofseedattributedependsonthenumberofDBinstances,numdb.Ifwehave100DBinstances,theseedisapproximately800bytes.Fora1MBpagesize,wecanholdcloseto500oftheseseedandbitstringattributepairs.Theattributesonwhichanti-joinsareperformedarecategorical.Ingeneralthedomainsizeofcategoricalattributesissmall.ExamplesofsuchattributesareLocation(Country,Zipcodeetc),Department,Componentetc,.Smalldomainsizesandlargerelationsizesgenerallyresultsinlargenumberofmatches.Sothelongformatisnotalwaysappropriate. 6.2.2ShortFormat Figure6-6. CompositeAttribute:Shortformat Asexplainedabove,creatinghugetupleswhicharebiggerthanthesystempagesizeshouldbeavoidedwheneverpossible.HerewediscusstheshortformatforAJOBJECTwhichreducestheattributesize.Asexplainedbefore,thecontributingfactorsforlargetuplesizesarebothseedandbitstringattributes.Sothenewformataimstoavoidstoringmultipleseedandbitstringattributesinasingletuple.Instead,itstoresonlytheseedidentiersofthematchingtuplesfromrightrelation.WealsostorethepartsofthebitstringswhicharecurrentlyassignedtotheDBinstances.ThesizeofthesebitstringpartsisthenumberofDBinstancesmultipliedbythenumberofseedidentiersstoresintheAJOBJECT,andisverysmall.TherighttuplesthemselvesareputintheGibbsqueue,sothattheirseedandbitstringattributesareaccessedwhenevernecessary.ThemaximumnumberoftuplesinsertedintheGibbsqueuewiththissolutionwillbethesumofthesizesofboththerelationsintheanti-join. 98

PAGE 99

Figure6-7. ShortFormat:TuplesafterAnti-join TheresultingnewAJOBJECTcanbeseeninFigure 6-6 .Thesizeofthisattributeis33bytes.IncomparisonthesizeoflongformatshowninFigure 6-5 forthesampleattributeiscloseto6,000bytes(assumingthenumberofDBinstancesnumdbis100andsizeofsamplearraynummcis1000).Forthesmallformatinthisexample,wealsoinsertthreeadditionaltuplesintheGibbsqueue.Thesethreetupleshavetheseedsandbitstrings.InFigure 6-7 ,therstvetuplesarefromtheleftrelation.ThesevetupleshavetheirAJOBJECTintheshortformat.Intuple1,theAJOBJECThas2seedidentiers1and2.Next8tuplesarefromrightrelation,andtheyareinsertedintoGibbsqueuedirectly(moreonthislater).ThedifcultlywithshortformatwhencomparedtothelongformatisthatitresultsinmorechangesintheMCDB-Rsystem.AsexplainedinChapter 2 ,theexecutionowassumesalltupleshavethesameschema.Sosendingthetuplesfromrightrelation 99

PAGE 100

additionallyintotheoutputwillresultintupleswithdifferentschemassenttonextoperator.Togetaroundthisissuewecouldcreateasuperrelationwithattributesfromboth.ThissolutionstillrequireseachoperatorinMCDB-Rtodifferentiatebetweenthesetwotuples.Eachoperatorneedsnecessarymodications,andsothissolutionisnotdesirable.AcleanersolutionistosendthetuplesfromrightrelationRtotheGibbsqueuedirectly.Thismethodonlyrequireschangestotheanti-joinoperator.Thoughthismethodisfairlysimpleforasingleanti-join,itrequiresmorecomplicatedschemeswhentherearecascadinganti-joinsinthequeryplan.Agenericmethodthatwillworkforquerieswithcascadinganti-joinsisdiscussedlaterinSection 6.4 .SincewehavetuplesfrombothleftandrightrelationsintheGibbsqueue,thenumberoftuplesithasishigh.Abigdrawbackofusingsmallformatistherequirementofveryhighnumberofremove/reinsertsintheGibbsqueue.Eachreinsertcouldpotentiallyinvolvehundredsoftuplesandcanslowdowntheexecution. 6.3DiscussiononExtremeSamplesAsampleiscalledextremeiftheprobabilityofitsoccurrenceisverylow.Replacinganextremesampleduringtherejectionalgorithmcanbedifcult.AsolutiontothisproblemisdiscussedinChapter 5 .However,thatsolutiononlyworksforquerieswithoutananti-joinoperation.InthissectionwebrieydiscusshowtoextendthesolutioninChapter 5 toworkevenforquerieswithananti-join.Thisproblemofreplacinganextremesampleissolvedbyrunningthequeryplanseparatelyforjustforndingthatsample.Afterthesampleisfound,onlythentheregularexecutionrestarts.Themodiedqueryrunhastwosteps.Therststepislteringoutunnecessarytuplesfromtheexecutioninordertoimprovetheefciencyoftherun.Filteringprocesswillbedifferentwhenthereisastochasticattributeontherightsideoftheanti-join.Therearemultipleseedidentiersfromthesamerelationthatproducedtheextremesample,butnotduetoaself-join.Fromthesemultipleseedidentiers,wecalltheonethatgeneratedtheextremesampleasprimaryseedidentier. 100

PAGE 101

Torecreatethecompositeattributeweneedallseedidentierspresentinthetuple.Thenewsamplesgeneratedfortheprimaryseedidentiercanhaveanewvaluenotseenbefore.Thesenewvaluescanmatchwithothernewtuples.Wecannotlteroutanyseedidentiers,butforallidentiersexcepttheprimaryseedidentierweneedtogenerateonlytheactivesamples.TheunnecessarytuplescanbelteredoutaftertheAnti-joinoperator.TheAnti-joinoperatormustbemodiedtohandlethefragmentedtuplesfortheprimaryseedidentier.Thisagainissimilartohowaself-joinismodied(Section 5.3 ). 6.4ForQuerieswithMultipleAnti-JoinsSomequeriescanhavecascadinganti-joins.Twoanti-joinsarecalledcascadingiftheoutputofoneanti-joinisgivenasarightinputtoanotheranti-join.Thecurrentcompositeattributeformatsworkonlyforquerieswithnoncascadinganti-joins.Thecreationofthecompositeattributeshouldberecursiveforthesequerieswithcascadinganti-joins.Acascadinganti-joinatsecondlevel,onewhoserightinputisfromanotheranti-join,storestheAJOBJECTintherighttuplesinsteadofjustseedandbitstringattributes.Thefunction(inTailsampler)toretrievethecurrentactiveDBinstancefromthecompositeattributewouldalsoneedchanges.Thisfunctionalsoneedstoworkrecursivelyonthecompositeattribute.Theseideasneedmoreworkandarenotimplementedinthesystem. 6.5ExperimentsInthissectionwebenchmarkbothofourmethodsdiscussedaboveLongformatandShortformat.Ourgoalistocomparetheirefciencyandscalabilityfordifferentsettings.Thefollowingarethequestionsweliketoanswer: 1. HowdoesbothformatsworkwhenlargenumberofSeedattributesneedtobestored? 2. Whichofthemisbetteronlargedatasets? 101

PAGE 102

ExperimentalSetup:WeusetheTPC-Hbenchmarkdatagenerator[ 17 ].Theworkstationusedfortheexperimentshas32GBofRAM,4720GBDellharddisksand8processingcoresof3.2GHzeachdistributedover2chips.Thesystemrunsa64-bitUbuntuoperatingsystem,version11.04.WeimplementedoursolutionintheprototypeMCDB-Rsystem.WeusethequeryQ5explainedbelow.SeeAppendixforcompleteSQLstatementsforQ5.QueryQ5isamodiedversionofqueryQ3discussedbeforeinSection 5.6 .QueryQ5(ModiedQ3):Herewecalculatetheerrorincalculatingtotalrevenueduetoerrorsinthedataintegrationprocess.Theschemaismodiedforasimulatingdatacleaningprocess.Thecustomerinformationintheordersrelationisassumedtohaveerrors.Anewstochasticrelationiscreatedtoincorporatetheseerrors.Thisrelationhastheactualcustomerkeyandalistofpossiblecustomerkeys,andaprobabilityassociatedwitheachpossiblecustomerkey.ThequeryQ4joinscustomertablewithorders,andanordercontributestothetotalrevenueonlyifitndsavalidkeytomatchincustomers.Inthecurrentquery,wecalculatetheorderswhichfailedtocontributetotherevenuebecauseofthenonexistenceofthecustomerinformationinthecustomertable.Tocontrolthenumberoftuplesfromcustomerthatjoinwitheachordertuple,weregulatehowthepossiblecustomerkeysaregeneratedinthenewrelation.Wedividecustomerkeyvaluesintononoverlappinggroups.Foreachkeyinagivengroupthelistofpossiblecustomerkeysarefromthatsamegroup.Theprobabilityofeachpossiblekeyissame,andis1 m,wheremisthesizeofthegroupinwhichtheactualcustomerkeybelongsto.Thesizeofallgroupsissame,andthissizemisthenumberoftupleswewantfromcustomerrelationtojoinwithasingletupleinorderstable.So,ifwewantthecompositeattributetohave300seeds,thenweneed300tuplesfromthecustomerrelationtojoinwiththattuple.Thiscanonlyhappenifthesamevalueisgeneratedin300tuplesincustomersandhencethegroupsizemshouldbe300.The 102

PAGE 103

Table6-1. Performancewithvaryingseedspertuple(inminutes) Num.ofSeeds102050100200300perTuple Longformat2.733.87.319.835 Shortformat2.32.53.671319.3 equalprobabilityforeachpossiblevaluewillensurethatthereisagoodchancethateachvaluewillbegeneratedinthesamplearray.Giventhissetup,therearevedifferentvariablesweneedtoknowtounderstandtheresults: 1. numSeedIds:Thenumberofseedidentiersprocessedduringthequeryexecution. 2. numdb:NumberofDBinstancesforthequery.Thisissetto100foralltestsinthissection. 3. SeedsPerTuple:TheaveragenumberofseedattributesstoredineachAJOBJECT. 4. gibbsIters:NumberofiterationsforGibbsLooper.Thisvalueis5,unlessspeciedotherwise. 5. eliteDBPercent:NumberofDBinstancesretainedattheendofeachGibbsiteration.ThedeletedDBinstancesareclonedfromtheretainedeliteinstances.Wehavethisvalueat50%forallthetests.Weusetwotestsforevaluation.Test1isforcomparingtheperformanceofbothformatswithincreasingAJOBJECTsize.InTest2,wecomparetheperformanceofthetwoformatsonvaryingsizesofthecompositeattributeformed,andalsobychangingthenumberofinputtuples(numSeedIds).BothtestsusequeryQ5.Thevariablesforeachofthetestsareasfollows: 1. VaryingSeedsPerTuple:Forthistest,numSeedIdsissetto1000.ValueofSeedsPerTupleisvariedfrom10to300.numdbissetto100,gibbsItersto5,andeliteDBPercentto50%. 2. Performanceonlargedatasets:Forthistest,numSeedIdsissetto100,000and15,000fordataset1anddataset2respectively.ValueofSeedsPerTupleis10and300forthedatasets1and2respectively.numdbissetto100,gibbsItersto5,andeliteDBPercentto50%. 103

PAGE 104

Table6-2. Performanceonlargedatasets(inminutes) Dataset1Dataset2 Num.ofTuples100,00015,000 SeedsperTuple10300 Longformat30.8552 Shortformat54.6258.3 Results:ResultsforTest1aregiveninTable 6-1 .Thetableshowstheperformanceofbothformatswithvaryingnumberofseedattributesstoredinthecompositeattributes,theAJOBJECT.Thenumberofseedattributespertupleisthegroupsizemasdescribedinthequerydescription.Allthevalueslistedarethenumberofminutestakenforthequeryexecution.Table 6-2 showstheperformanceresultsforTest2.Thequeryisrunontwodatasets.TheexecutiontimesareshownunderLongformat,andShortformatcolumns.Discussion:Test1:AscanbeseenintheTable 6-1 ,forsmallernumberofseedattributespertuple(till100)bothsolutionsperformalmostsimilarly,withShortformattakingaslightedge.Fortheothertwocases(200and300seedattributespertuple)thereisacleardifferencebetweentheperformance.Heretheexperimentalresultsareasexpected.TheLongformat,withlargertuplesizesreads/writesmoredatafrom/todisk.Theexecutiontimeincreaseswithincreasingnumberofseedattributes.Thereasonsherearetwofold.ForLongformat,thesizeoftuplesbecomeslargerandforShortformat,thenumberoftuplesintheGibbsqueuegrows.Whenm,thenumberofseedattributespertupleis300,thesizeoftupleintheGibbsqueueforLongformatisapproximately260KB.ForShortformat,thoughthetuplesizeisveryless,thenumberoftuplesinGibbsqueueisaround270,900.ThesecondreasonisthenumberoftuplesprocessedinasingleGibbsiteration.Sinceeachtuplehasmoreseedattributesnow,thenumberoftuplespulledoutforeachseedattributearemoreandsoarethereinsertsintotheGibbsqueue.Thiswillincreasetheexecutiontimeconsiderably. 104

PAGE 105

Test2:TheresultsareshowninTable 6-2 .Firstdatasethas100,000tuplesand10customersineachgroup.Therefore,thenumberofseedattributesinthecompositeattributewillbe10.SurprisinglyLongformatperformsbetterondataset1.ThiscanbeattributedtoShortformatbeingsloweddownbyalltheextratuplesintheGibbsqueue.Forthisdataset,forShortformattheGibbsqueuecontains1,100,000tuples,whereasLongformatcontainsonly100,000tuples.Theseconddatasethas15,000tuplesand300customersineachgroup.Thisresultsin4,515,000tuplesintheGibbsqueueforShortformat.Evenwithsomanytuples,itperformsbetterthanLongformat,becauseofitslowerdiskusage.Forthisquery,eachtupleintheGibbsqueueisremovedandreinserteduntilallseedattributesinthattupleareexhausted.Sincethetuplesaremassive(approximately260KB)forLongformat,multipleremoveandreinsertsaremoreexpensive.Conclusion:Afterlookingattheresultsfrombothtests,theShortformatseemsmoreuseful.Thoughwecannotsayitisalwaysbetter,inmostresultsitoutperformstheLongformat.ThemostimportantfactorinrecommendingShortformatistheinabilityofLongformattostorealargenumberofseedattributes. 105

PAGE 106

CHAPTER7CONCLUSIONWepresentedtwomethodstoimprovetheperformanceofMCDB-Rsystem.Therstmethodistoserializeandde-serializethesamplegeneratingprocess.Weimplementedandtestedtheideaintheprototypesystem.AsshownintheexperimentsinSection 4.7 ,formanytypesofVGfunctionstheperformancebenetsaregreat.Next,wepresentedanefcienttechniquetoreplaceextremesamples.Weprovidedatwostepprocess;lteringoutunnecessarytuplesandfragmentingtheverylargetuplebundle.Theimplementationofthisprocessresultedinsignicantperformancebenets,animprovementfactorof2orbetterinmanycases,asshowninSection 5.6 .Finally,welookedathowtoincorporateanti-joinoperatorinMCDB-R.Wegavetwosolutionswhichworkforquerieswithasingleanti-joinoperation.Bothsolutionsareorthogonalinhowtheygeneratetuplebundlesattheanti-joinoperator.Wewilllookatextendingtheworktofacilitatemultipleanti-joinoperatorsinasinglequeryplan. 106

PAGE 107

APPENDIX:QUERIES QueryQ1 CREATEVIEWparamsASSELECT2.0ASp0shape,1.333*AVG(l_extendedprice*(1.0-l_discount))ASp0scale,2.0ASd0shape,4.0*AVG(l_quantity)ASd0scale,l_partkeyASp_partkeyFROMlineitemlGROUPBYl_partkey CREATETABLEdemands(new_dmnd,old_dmnd,old_prc,new_prc,nd_partkey,nd_suppkey)ASFOREACHlIN(SELECT*FROMlineitem,ordersWHEREl_orderkey=o_orderkeyANDyr(o_orderdate)=1995)WITHnew_dmndASBayesian((SELECTp0shape,p0scale,d0shape,d0scaleFROMparamsWHEREl_partkey=p_partkey)(VALUES(l_quantity,l_extendedprice*(1.0-l_discount))/l_quantity,l_extendedprice*1.05*(1.0-l_discount)/l_quantity))SELECTnd.value,l_quantity,l_extendedprice*(1.0-l_discount))/l_quantity,1.05*l_extendedprice*(1.0-l_discount)/l_quantity,l_partkey,l_suppkeyFROMnew_dmndnd SELECTSUM(new_prf-old_prf)AStotalProfitFROM( 107

PAGE 108

SELECTnew_dmnd*(newprc-pssupplycost)ASnew_prfold_dmnd*(oldprc-pssupplycost)ASold_prfFROMpartsupp,demandsWHEREps_partkey=nd_partkeyANDps_suppkey=nd_suppkey) WITHRESULTDISTRIBUTIONMONTECARLO(100)DOMAINtotalProfit<=QUANTILE(0.001)FREQUENCYTABLEtotalProfitQueryQ2 CREATETABLErandom_ord(o_orderkey,o_yr,o_tot)ASFOREACHoIN(SELECT*FROMorders)WITHVALASNormal(VALUES(o_mean,o_var))SELECTo.o_orderkey,year(o.o_orderdate),v.valueFROMVALv SELECTSUM(val)astotalLossFROMrandom_ord,lineitemWHEREo_orderkey=l_orderkeyAND(o_yr=1994ORo_yr=1995) SELECTSUM(grpsize*o_mean)ASmean,SUM(grpsize*grpsize*ovar)ASvarFROM(SELECTo_mean,o_var,COUNT(*)ASgrpsizeFROMorders,lineitemWHEREyear(o_orderdate)in(1994,1995)ANDo_orderkey=l_orderkeyGROUPBYo_orderkey,o_mean,o_var) WITHRESULTDISTRIBUTIONMONTECARLO(100) 108

PAGE 109

DOMAINtotalLoss>=QUANTILE(0.999)QueryQ3 CREATEVIEWfrom_japanASSELECTc_custkeyFROMcustomer,nationWHEREn_nationkey=c_nationkeyANDn_name='JAPAN' CREATETABLEfixed_custASFOREACHoINordersWITHnewcustkeyASDiscreteChoice(SELECTcustkey,probabilityFROMerror_custkeyWHEREold_custkey=o_custkey)SELECTvalueASo_newcustkey,o_orderkeyFROMnewcustkey SELECTSUM(l_extendedprice*(1.0-l_discount))asrevJapanFROMlineitem,fixed_cust,from_japanWHEREl_orderkey=o_orderkeyANDo_newcustkey=c_custkey WITHRESULTDISTRIBUTIONMONTECARLO(100)DOMAINrevJapan>=QUANTILE(0.999)QueryQ4 CREATETABLEactual_shipASFOREACHoINorder_shipWITHnewshipdaysASPoisson(SELECTa.AvgShipDaysFROMavg_shipaWHEREa.ShipDays=o.ShipDays)SELECTShipDays,AvgShipDaysASExpShipDays 109

PAGE 110

FROMnewshipdays SELECTSUM(s.Penalty)astotalPenaltyFROMactual_shipa,ship_penaltysWHEREa.ShipDays-a.ExpShipDays=s.DaysDelayed WITHRESULTDISTRIBUTIONMONTECARLO(100)DOMAINMaxPenalty>=QUANTILE(0.99)QueryQ5 CREATEVIEWfrom_japanASSELECTc_custkeyFROMcustomer,nationWHEREn_nationkey=c_nationkeyANDn_name='JAPAN' CREATETABLEfixed_custASFOREACHoINordersWITHnewcustkeyASDiscreteChoice(SELECTcustkey,probabilityFROMerror_custkeyWHEREold_custkey=o_custkey)SELECTvalueASo_newcustkey,o_orderkeyFROMnewcustkey SELECTSUM(o_totalprice)astotalLostFROMordersWHEREo_newcustkeyNOTIN(SELECTo_custkeyFROMfixed_cust) WITHRESULTDISTRIBUTIONMONTECARLO(100)DOMAINtotalLost>=QUANTILE(0.999) 110

PAGE 111

REFERENCES [1] Abadi,D.J.,Madden,S.,&Ferreira,M.(2006).Integratingcompressionandexecutionincolumn-orienteddatabasesystems.InSIGMODConference,(pp.671). [2] Agrawal,P.,Benjelloun,O.,Sarma,A.D.,Hayworth,C.,Nabar,S.U.,Sugihara,T.,&Widom,J.(2006).Trio:Asystemfordata,uncertainty,andlineage.InVLDB,(pp.1151). [3] Ahmed,A.,Low,Y.,Aly,M.,Josifovski,V.,&Smola,A.J.(2011).Scalabledistributedinferenceofdynamicuserinterestsforbehavioraltargeting.InKDD,(pp.114). [4] Antova,L.,Jansen,T.,Koch,C.,&Olteanu,D.(2008).Fastandsimplerelationalprocessingofuncertaindata.InICDE,(pp.983). [5] Antova,L.,Koch,C.,&Olteanu,D.(2007).Maybms:Managingincompleteinformationwithprobabilisticworld-setdecompositions.InICDE,(pp.1479). [6] Arumugam,S.,Jampani,R.,Xu,F.,Perez,L.L.,Jermaine,C.M.,&Haas,P.J.(2010).Mcdb-r:Riskanalysisinthedatabase.PVLDB,3(1),782. [7] Barbara,D.,Garcia-Molina,H.,&Porter,D.(1992).Themanagementofprobabilisticdata.vol.4,(pp.487). [8] Benjelloun,O.,Sarma,A.D.,Halevy,A.Y.,Theobald,M.,&Widom,J.(2008).Databaseswithuncertaintyandlineage.VLDBJ.,17(2),243. [9] Beskales,G.,Soliman,M.A.,&Ilyas,I.F.(2008).Efcientsearchforthetop-kprobablenearestneighborsinuncertaindatabases.vol.1,(pp.326). [10] Botev,Z.I.,&Kroese,D.P.(2008).Anefcientalgorithmforrare-eventprobabilityestimation,combinatorialoptimization,andcounting.InMethodologyandComput-inginAppliedProbability,vol.10,(pp.471). [11] Cerou,F.,Moral,P.D.,Furon,T.,&Guyader,A.(2012).Sequentialmontecarloforrareeventestimation.vol.22,(pp.795). [12] Chen,W.,Zhang,D.,&Chang,E.Y.(2008).Combinationalcollaborativelteringforpersonalizedcommunityrecommendation.InKDD,(pp.115). [13] Cheng,R.,Singh,S.,&Prabhakar,S.(2005).U-DBMS:Adatabasesystemformanagingconstantly-evolvingdata.InVLDB,(pp.1271). [14] Chouchane,M.,Paris,S.,Gland,F.L.,&Ouladsine,M.(2012).Splittingmethodforspatio-temporalsensorsdeploymentinunderwatersystems.InEvoCOP,(pp.243). 111

PAGE 112

[15] Cormode,G.,Li,F.,&Yi,K.(2009).Semanticsofrankingqueriesforprobabilisticdataandexpectedranks.InICDE,(pp.305). [16] Corp.,O.(2012).Oraclecrystalball.Oracle(http://www.oracle.com/us/products/applications/crystalball/index.html). [17] Council,T.P.P.(2012).Tpcbenchmarkh.(http://www.tpc.org/tpch/default.asp). [18] Dalvi,N.N.,&Suciu,D.(2004).Efcientqueryevaluationonprobabilisticdatabases.InVLDB,(pp.864). [19] Dalvi,N.N.,&Suciu,D.(2007).Managementofprobabilisticdata:foundationsandchallenges.InPODS,(pp.1). [20] Deshpande,A.(2009).Prdb:Managinglarge-scalecorrelatedprobabilisticdatabases(abstract).InSUM,(p.1). [21] Deshpande,A.,&Madden,S.(2006).MauveDB:supportingmodel-baseduserviewsindatabasesystems.InSIGMOD,(pp.73). [22] Deutch,D.,Greenshpan,O.,Kostenko,B.,&Milo,T.(2011).Usingmarkovchainmontecarlotoplaytrivia.InICDE,(pp.1308). [23] Doshi-Velez,F.,Knowles,D.,Mohamed,S.,&Ghahramani,Z.(2009).Largescalenonparametricbayesianinference:Dataparallelisationintheindianbuffetprocess.InNIPS,(pp.1294). [24] Fuhr,N.,&Rolleke,T.(1997).Aprobabilisticrelationalalgebrafortheintegrationofinformationretrievalanddatabasesystems.vol.15,(pp.32). [25] Gamerman,D.,&Lopes,H.F.(2006).MarkovChainMonteCarlo:StochasticSimulationforBayesianInference.Chapman&Hall/CRCTextsinStatisticalScience,seconded. [26] Garcia-Molina,H.,Ullman,J.D.,&Widom,J.(1999).DatabaseSystemImplemen-tation.PrenticeHall. [27] Ge,T.,Grabiner,D.,&Zdonik,S.B.(2011).Montecarloqueryprocessingofuncertainmultidimensionalarraydata.InICDE,(pp.936). [28] Ge,T.,Zdonik,S.B.,&Madden,S.(2009).Top-kqueriesonuncertaindata:onscoredistributionandtypicalanswers.InSIGMODConference,(pp.375). [29] Goldstein,J.,Ramakrishnan,R.,&Shaft,U.(1998).Compressingrelationsandindexes.InIEEEInternationalConferenceonDataEngineering,(pp.370). [30] Graefe,G.,&Shapiro,L.D.(1991).Datacompressionanddatabaseperformance.InACM/IEEE-CSSymp.OnAppliedComputing,(pp.22). 112

PAGE 113

[31] Gupta,R.,&Sarawagi,S.(2006).Creatingprobabilisticdatabasesfrominformationextractionmodels.InVLDB,(pp.965). [32] Hua,M.,Pei,J.,Zhang,W.,&Lin,X.(2008).Rankingqueriesonuncertaindata:aprobabilisticthresholdapproach.InSIGMODConference,(pp.673). [33] Huang,J.,Antova,L.,Koch,C.,&Olteanu,D.(2009).Maybms:aprobabilisticdatabasemanagementsystem.InSIGMODConference,(pp.1071). [34] Imielinski,T.,&Jr.,W.L.(1984).Incompleteinformationinrelationaldatabases.vol.31,(pp.761). [35] Insight,Z.(2008).Dealingwiththeunexpected:Lessonsforriskmanagersfromthecreditcrisis.(http://www.zurich.com/corporatebusiness/home/home.htm). [36] Iyer,B.R.,&Wilhite,D.(1994).Datacompressionsupportindatabases.InVLDB,(pp.695). [37] Jampani,R.,Xu,F.,Wu,M.,Perez,L.L.,Jermaine,C.M.,&Haas,P.J.(2008).Mcdb:amontecarloapproachtomanaginguncertaindata.InSIGMODConfer-ence,(pp.687). [38] Jampani,R.,Xu,F.,Wu,M.,Perez,L.L.,Jermaine,C.M.,&Haas,P.J.(2011).Themontecarlodatabasesystem:Stochasticanalysisclosetothedata.InACMTrans.DatabaseSyst.,vol.36,(p.18). [39] Kanagal,B.,&Deshpande,A.(2010).Lineageprocessingovercorrelatedprobabilisticdatabases.InSIGMODConference,(pp.675). [40] Karp,R.M.,&Luby,M.(1983).Monte-carloalgorithmsforenumerationandreliabilityproblems.InFOCS,(pp.56). [41] Koch,C.,&Olteanu,D.(2008).Conditioningprobabilisticdatabases.InVLDB. [42] Lakshmanan,L.V.S.,Leone,N.,Ross,R.B.,&Subrahmanian,V.S.(1997).Probview:Aexibleprobabilisticdatabasesystem.vol.22,(pp.419). [43] Li,J.,Saha,B.,&Deshpande,A.(2009).Auniedapproachtorankinginprobabilisticdatabases.PVLDB,2(1),502. [44] Li,J.,Saha,B.,&Deshpande,A.(2011).Auniedapproachtorankinginprobabilisticdatabases.VLDBJ.,20(2),249. [45] Palisade(2012).@risk:anewstandardinriskanalysis.Palisade(http://www.palisade.com/risk/). [46] Panneton,F.,L'Ecuyer,P.,&Matsumoto,M.(2006).Improvedlong-periodgeneratorsbasedonlinearrecurrencesmodulo2.InACMTrans.Math.Soft-ware32,1. 113

PAGE 114

[47] Porteous,I.,Newman,D.,Ihler,A.T.,Asuncion,A.U.,Smyth,P.,&Welling,M.(2008).Fastcollapsedgibbssamplingforlatentdirichletallocation.InKDD,(pp.569). [48] Ratanaworabhan,P.,Ke,J.,&Burtscher,M.(2006).Fastlosslesscompressionofscienticoating-pointdata.InDCC,(pp.133). [49] Re,C.,Dalvi,N.N.,&Suciu,D.(2007).Efcienttop-kqueryevaluationonprobabilisticdata.InICDE,(pp.886). [50] Re,C.,&Suciu,D.(2008).ManagingprobabilisticdatawithMystiQ:Thecan-do,thecould-do,andthecan't-do.InSUM,(pp.5). [51] Robert,C.,&Casella,G.(2004).MonteCarloStatisticalMethods.Springer,seconded. [52] Roth,M.A.,&Horn,S.J.V.(1993).Databasecompression.SIGMODRecord,(pp.31). [53] Rubino,G.,&Tufn,B.(1980).RareEventSimulationUsingMonteCarlo.Wiley. [54] Rubinstein,R.(2009).Thegibbsclonerforcombinatorialoptimization,countingandsampling.MethodologyandComputinginAppliedProbability,11,491. [55] Savage,S.L.(2009).TheFlawofAverages:Whyweunderestimateriskinthefaceofuncertainty.Wiley. [56] Sen,P.,Deshpande,A.,&Getoor,L.(2009).Prdb:managingandexploitingrichcorrelationsinprobabilisticdatabases.vol.18,(pp.1065). [57] Singh,S.,&McCallum,A.(2011).Towardsasynchronousdistributedmcmcinferenceforlargegraphicalmodels.InNeuralInformationProcessingSystems(NIPS),BigLearningWorkshoponAlgorithms,Systems,andToolsforLearningatScale. [58] Soliman,M.A.,Ilyas,I.F.,&Chang,K.C.-C.(2007).Top-kqueryprocessinginuncertaindatabases.InICDE,(pp.896). [59] Stonebraker,M.,Becla,J.,DeWitt,D.J.,Lim,K.-T.,Maier,D.,Ratzesberger,O.,&Zdonik,S.B.(2009).Requirementsforsciencedatabasesandscidb.InProc.CIDR,(p.26). [60] Stoyanovich,J.,Davidson,S.B.,Milo,T.,&Tannen,V.(2011).Derivingprobabilisticdatabaseswithinferenceensembles.InICDE,(pp.303). [61] Suciu,D.,Olteanu,D.,Re,C.,&Koch,C.(2011).ProbabilisticDatabases.SynthesisLecturesonDataManagement.Morgan&ClaypoolPublishers. [62] Systems,L.D.(2012).Riskanduncertaintyanalysis.LuminaDecisionSystems(http://www.lumina.com/case-studies/risk-analysis/). 114

PAGE 115

[63] Thiagarajan,A.,&Madden,S.(2008).Queryingcontinuousfunctionsinadatabasesystem.InProc.SIGMOD,(pp.791). [64] Walker,D.D.,&Ringger,E.K.(2008).Model-baseddocumentclusteringwithacollapsedgibbssampler.InKDD,(pp.704). [65] Wang,C.,Yuan,L.-Y.,You,J.-H.,Zaane,O.R.,&Pei,J.(2011).Onpruningfortop-krankinginuncertaindatabases.vol.4,(pp.598). [66] Wang,D.Z.,Franklin,M.J.,Garofalakis,M.N.,Hellerstein,J.M.,&Wick,M.L.(2011).Hybridin-databaseinferencefordeclarativeinformationextraction.InSIGMODConference,(pp.517). [67] Wang,D.Z.,Michelakis,E.,Garofalakis,M.,&Hellerstein,J.(2008).BayesStore:Managinglarge,uncertaindatarepositorieswithprobabilisticgraphicalmodels.InVLDB. [68] Wick,M.L.,McCallum,A.,&Miklau,G.(2010).Scalableprobabilisticdatabaseswithfactorgraphsandmcmc.vol.3,(pp.794). [69] Widom,J.(2005).Trio:Asystemforintegratedmanagementofdata,accuracy,andlineage.InCIDR,(pp.262). [70] Wikipedia(2012).Late2000snanacialcrisis.(http://en.wikipedia.org/wiki/Late-2000s nancial crisis). [71] Yao,L.,Mimno,D.M.,&McCallum,A.(2009).Efcientmethodsfortopicmodelinferenceonstreamingdocumentcollections.InKDD,(pp.937). [72] Yi,K.,Li,F.,Kollios,G.,&Srivastava,D.(2008).Efcientprocessingoftop-kqueriesinuncertaindatabases.InICDE,(pp.1406). [73] Zhao,B.,Rubinstein,B.I.P.,Gemmell,J.,&Han,J.(2012).Abayesianapproachtodiscoveringtruthfromconictingsourcesfordataintegration.vol.5,(pp.550). [74] Zukowski,M.,Heman,S.,Nes,N.,&Boncz,P.A.(2006).Super-scalarram-cpucachecompression.InICDE,(p.59). 115

PAGE 116

BIOGRAPHICALSKETCH RaviJampaniisaPhDstudentattheUniversityofFlorida,Gainesville.HeisadvisedbyDr.ChrisJermaineandDr.AlinDobrainthebroadresearchtopicofdatabasemanagementsystems.Indatabases,hisspecicinterestsareinqueryprocessingandindexing.Apartfromdatabases,heisalsointerestedintheeldsofnaturallanguageprocessing,multiagentsystems,datastructuresandalgorithms.PriortojoiningUniversityofFlorida,RavireceivedBachelorofTechnology(2003)andMasterofScience(2005)degreesfromtheInternationalInstituteofInformationTechnology(IIIT),Hyderabad,India. 116