<%BANNER%>

Efficient Query Processing in MCDB-R

Permanent Link: http://ufdc.ufl.edu/UFE0044090/00001

Material Information

Title: Efficient Query Processing in MCDB-R
Physical Description: 1 online resource (116 p.)
Language: english
Creator: Jampani, Ravindranath Chowdary
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2012

Subjects

Subjects / Keywords: databases -- probabilistic -- processing -- query -- statistical
Computer and Information Science and Engineering -- Dissertations, Academic -- UF
Genre: Computer Engineering thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract: Quantitative risk analysis has become an integral part of large financial institutions. For banking and insurance enterprises, calculating the risk of future insolvency is mandatory per the regulatory directives such as Solvency and Basel. Performing risk analysis in these enterprises is difficult because the systems are highly complex and have a large number of uncertain variables. One form of risk analysis is in understanding the impact of extreme events that have a low chance of occurrence. The common methods used to estimate this extreme risk come from statistics and rare event simulation fields, and are based on Monte Carlo sampling. However, these sampling-based methods produce huge amounts of data. Therefore, the current implementations of risk analysis methods have scalability problems, particularly as they relate to complex systems with a very large number of variables. Monte Carlo DataBase with Risk analysis (MCDB-R) is a database system designed to facilitate scalable risk analysis on large datasets. MCDB-R combines sampling-based statistical methods and relational database technology for scalability on large uncertain data sets. Uncertainties in data are modeled as probabilistic distributions. A query on this uncertain data with probability distributions results in another probabilistic distribution called query-result distribution. For a given scenario defined by a query, analyzing its extreme events requires examining the upper or lower areas of the query-result distribution, where the events occur with very low probability. The specific methods used in MCDB-R for this analysis are: Gibbs sampling from statistics and cloning from rare-event simulation. These methods are used to efficiently sample from the low probability areas of the query-result distribution for the analysis. This dissertation discusses three improvements to the query processor in MCDB-R. The first technique provides a mechanism to move the sample generation (during the query execution) closer to the location where those samples are actually processed. Huge number of samples are generated during the query execution in MCDB-R, and this new mechanism reduces the sample movement in the memory hierarchy. Response time of the queries is improved significantly; sometimes by an order of magnitude, as shown in the experiments. MCDB-R employs a rejection algorithm at the core of its sampling-based risk analysis. For each uncertain variable, the rejection algorithm looks at a series of samples and discards them unless the sample fits a given constraint. Sometimes the constraints are so stringent that the rejection algorithm needs to process millions of samples before finding a good fit. In MCDB-R, an instance like this will require the same number of samples to be produced for all variables. In the second technique, this problematic instance is isolated from the rest of the query execution and then run separately to find an acceptable fit. The normal execution restarts after finding the required sample. Finally, adding an anti-join operator to the MCDB-R execution engine is explained. This new operator enables the system to execute subset-based queries with not-in and not-exists clauses. Performing an anti-join in this system is not trivial because of the stochastic nature of the data. The system does not know which samples are actually used until the end of the query execution. Two methods to implement the anti-join operator are discussed and then compared through experiments.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility: by Ravindranath Chowdary Jampani.
Thesis: Thesis (Ph.D.)--University of Florida, 2012.
Local: Adviser: Dobra, Alin.

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2012
System ID: UFE0044090:00001

Permanent Link: http://ufdc.ufl.edu/UFE0044090/00001

Material Information

Title: Efficient Query Processing in MCDB-R
Physical Description: 1 online resource (116 p.)
Language: english
Creator: Jampani, Ravindranath Chowdary
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2012

Subjects

Subjects / Keywords: databases -- probabilistic -- processing -- query -- statistical
Computer and Information Science and Engineering -- Dissertations, Academic -- UF
Genre: Computer Engineering thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract: Quantitative risk analysis has become an integral part of large financial institutions. For banking and insurance enterprises, calculating the risk of future insolvency is mandatory per the regulatory directives such as Solvency and Basel. Performing risk analysis in these enterprises is difficult because the systems are highly complex and have a large number of uncertain variables. One form of risk analysis is in understanding the impact of extreme events that have a low chance of occurrence. The common methods used to estimate this extreme risk come from statistics and rare event simulation fields, and are based on Monte Carlo sampling. However, these sampling-based methods produce huge amounts of data. Therefore, the current implementations of risk analysis methods have scalability problems, particularly as they relate to complex systems with a very large number of variables. Monte Carlo DataBase with Risk analysis (MCDB-R) is a database system designed to facilitate scalable risk analysis on large datasets. MCDB-R combines sampling-based statistical methods and relational database technology for scalability on large uncertain data sets. Uncertainties in data are modeled as probabilistic distributions. A query on this uncertain data with probability distributions results in another probabilistic distribution called query-result distribution. For a given scenario defined by a query, analyzing its extreme events requires examining the upper or lower areas of the query-result distribution, where the events occur with very low probability. The specific methods used in MCDB-R for this analysis are: Gibbs sampling from statistics and cloning from rare-event simulation. These methods are used to efficiently sample from the low probability areas of the query-result distribution for the analysis. This dissertation discusses three improvements to the query processor in MCDB-R. The first technique provides a mechanism to move the sample generation (during the query execution) closer to the location where those samples are actually processed. Huge number of samples are generated during the query execution in MCDB-R, and this new mechanism reduces the sample movement in the memory hierarchy. Response time of the queries is improved significantly; sometimes by an order of magnitude, as shown in the experiments. MCDB-R employs a rejection algorithm at the core of its sampling-based risk analysis. For each uncertain variable, the rejection algorithm looks at a series of samples and discards them unless the sample fits a given constraint. Sometimes the constraints are so stringent that the rejection algorithm needs to process millions of samples before finding a good fit. In MCDB-R, an instance like this will require the same number of samples to be produced for all variables. In the second technique, this problematic instance is isolated from the rest of the query execution and then run separately to find an acceptable fit. The normal execution restarts after finding the required sample. Finally, adding an anti-join operator to the MCDB-R execution engine is explained. This new operator enables the system to execute subset-based queries with not-in and not-exists clauses. Performing an anti-join in this system is not trivial because of the stochastic nature of the data. The system does not know which samples are actually used until the end of the query execution. Two methods to implement the anti-join operator are discussed and then compared through experiments.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility: by Ravindranath Chowdary Jampani.
Thesis: Thesis (Ph.D.)--University of Florida, 2012.
Local: Adviser: Dobra, Alin.

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2012
System ID: UFE0044090:00001


This item has the following downloads:


Full Text

PAGE 1

EFFICIENTQUERYPROCESSINGINMCDB-RByRAVINDRANATHCHOWDARYJAMPANIADISSERTATIONPRESENTEDTOTHEGRADUATESCHOOLOFTHEUNIVERSITYOFFLORIDAINPARTIALFULFILLMENTOFTHEREQUIREMENTSFORTHEDEGREEOFDOCTOROFPHILOSOPHYUNIVERSITYOFFLORIDA2012

PAGE 2

c2012RavindranathChowdaryJampani 2

PAGE 3

Dedicatedtomyfamily,friendsandteachers 3

PAGE 4

ACKNOWLEDGMENTS Iwouldliketothankmyadvisors,ChrisandAlin.Ihavebeenfortunatetohavethemasmyadvisors,andthisworkwouldnotbepossiblewithouttheirguidanceandsupport.Theirvastknowledge,quickthinking,greatinsightintotheproblemsandtheirenthusiasmforresearchandteachinghavebeeninspiring.Ithasbeengreatworkingwiththem.IamthankfultoChrisforlettingmeworkonmythesisproject.IamgratefultotheDepartmentofComputer&InformationSciences&Engineering(CISE),UniversityofFloridaandmyadvisorsforsupportingmethroughoutmystudies.IamthankfultoDr.Ungorforlettingmeworkonthemeshingproject.Thisprojecthadbeenagreatlearningexperience.IamthankfultomycolleaguesandfriendsatCISEMingxi,Subi,Nirmalya,Manu,Niketan,Lixia,Manas,Luis,Fei,Ferhat,Florin,Manna,Vishak,andAmit.IamgratefultoSubiArumugamandRisiThonangifortheirconversationsrelatedtoresearchandimplementationofdatabasesystems.SpecialthankstothestudentswhoattendedmyMicrosoftOfcecourseintheFallof2010.Theywereagreatmotivationformetoputinthenecessaryefforttoprepareforandteachthetopics.Theirfeedbackduringthecoursewasalsoveryuseful.Itremainsauniqueteachingexperienceforme.IwouldalsotoliketomentiontheInternationalInstituteofInformationTechnology(IIIT),Hyderabad,India,forprovidingmewithagreatlearningenvironmentduringmyundergraduatestudies.ThankstoDr.GovindarajuluRegeti,andDr.RajeevSangalforinspiringmetopursueadoctorateandtoDr.KamalKarlapalem,Dr.SushmaBendreandDr.ProsenjitGuptaforhelpingmeinmyapplicationprocess.Iamgratefultomyfriendsfrommyundergraduatedays:Sashi,Srikanth,Rishi,Santhosh,Jhansi,andTeja.Theyhavebeenasourceofconstantsupport.IamfortunatetohaveformedfriendshipswithTapasvi,Manu,Niketan,Subhajit,Arun,Ganesh,Mriganka,Siva,Aswin,JayendraandMuthuduringmygraduatestudies. 4

PAGE 5

Gainesvillewasafunplacetostaybecauseofthem.Finally,Iwouldliketothankmyfamily,whosecontinuedencouragementhasmadethiseffortpossible. 5

PAGE 6

TABLEOFCONTENTS page ACKNOWLEDGMENTS .................................. 4 LISTOFTABLES ...................................... 8 LISTOFFIGURES ..................................... 9 ABSTRACT ......................................... 11 CHAPTER 1INTRODUCTION ................................... 13 1.1MainContributions ............................... 17 1.1.1Problem1:SerializingVariableGeneratingFunctions ........ 17 1.1.2Problem2:FindinganExtremeSample ............... 18 1.1.3Problem3:IncorporatingAnti-Join .................. 19 1.2Organization .................................. 20 2MCDB-ROVERVIEW ................................ 21 2.1Background ................................... 27 2.1.1RareEventSimulation ......................... 27 2.1.2GibbsSampling ............................. 28 2.2SchemaSpecication ............................. 29 2.3TheVariableGeneratingFunction ...................... 32 2.4TheSeedOperator ............................... 34 2.5TheInstantiateoperator ............................ 35 2.6GibbsTupleBundles .............................. 37 2.6.1Tail-Sampling-SeedandSeedQueue ................. 39 2.6.2SplitOperator .............................. 40 2.7GibbsLooperandTailSampler ........................ 42 2.7.1GibbsQueue .............................. 42 2.7.2RerunningtheQueryPlan ....................... 44 2.7.3MaterializingtheOperators ...................... 45 3RELATEDWORK .................................. 48 3.1StatisticalOperationsinRelationalDatabases ................ 48 3.2ProbabilisticDatabases ............................ 48 3.3Top-kRankinginProbabilisticDatabases .................. 51 3.4MonteCarloMethodsinDatabases,DataMining,andMachineLearning 53 3.5RiskAnalysisSoftware ............................. 56 6

PAGE 7

4SERIALIZINGSAMPLEGENERATIONPROCESS ................ 57 4.1Background ................................... 57 4.2Motivation .................................... 58 4.3SolutionOutline ................................. 61 4.4Serialization ................................... 61 4.4.1IncorporatingSerializationinaQueryPlan .............. 64 4.4.2StochasticJoinPredicates ....................... 64 4.5EfcientDe-serialization ............................ 66 4.6RelatedWork .................................. 67 4.7Experiments .................................. 68 5REPLACINGANEXTREMESAMPLE ....................... 74 5.1Motivation .................................... 74 5.2SolutionOutline ................................. 76 5.3Filtering ..................................... 77 5.4FragmentingaLargeTupleBundle ...................... 79 5.5ImplementationDetails ............................. 80 5.5.1TSSeedandSampleGeneration ................... 82 5.5.2HowtoIdentifyanExtremeSample? ................. 83 5.6Experiments .................................. 85 6ANTIJOIN ...................................... 92 6.1Motivation .................................... 92 6.2CompositeAttributeFormats ......................... 95 6.2.1LongFormat .............................. 95 6.2.2ShortFormat .............................. 98 6.3DiscussiononExtremeSamples ....................... 100 6.4ForQuerieswithMultipleAnti-Joins ..................... 101 6.5Experiments .................................. 101 7CONCLUSION .................................... 106 APPENDIX:QUERIES ................................... 107 REFERENCES ....................................... 111 BIOGRAPHICALSKETCH ................................ 116 7

PAGE 8

LISTOFTABLES Table page 1-1Example:Flawofaverages ............................. 14 4-1Wallclockexecutiontimes .............................. 71 4-2DiskI/Oandsamplestatistics ............................ 71 4-3Scalingwithdatabasesize ............................. 71 4-4ScalingwithDBinstances(timeinseconds) .................... 72 4-5Parallelsamplegeneration ............................. 72 4-6PerformancefordifferentVGfunctions(timeinminutes) ............. 72 5-1Poissonrandomrelation ............................... 75 5-2Afteraphaseofrejectionsampler ......................... 75 5-3Wallclockrunningtimes(inseconds) ....................... 89 5-4Numberofextremesamplesidentied ....................... 89 5-5Performancewithincreasinginputsize(inminutes) ................ 89 5-6Samplesgenerated(inBillions) ........................... 89 5-7Queryrerunsandextremesamples ........................ 89 5-8Performancewithchangeinaggregateattribute .................. 89 5-9Performancewithchangingmax.samples(inminutes) .............. 89 5-10Executionparameterswithchangingmax.samples ................ 89 5-11NumberofextremesampleswithincreasingGibbsiterations .......... 90 6-1Performancewithvaryingseedspertuple(inminutes) .............. 103 6-2Performanceonlargedatasets(inminutes) .................... 104 8

PAGE 9

LISTOFFIGURES Figure page 2-1CustOrdertable:(a)Deterministic,and(b)Uncertain ............... 22 2-2StochasticCustOrdertable ............................. 22 2-3MultipleinstancesofCustOrderafterMonteCarlosampling ........... 24 2-4Query-resulthistogram ............................... 24 2-5Gibbscloning:Generatingsamples ........................ 25 2-6Gibbscloning:Deletinglower50% ......................... 25 2-7Gibbscloning:Clonetheelitesamples ....................... 26 2-8Gibbscloning:AfterperturbationwithGibbssampler ............... 27 2-9AnexampleSeedattribute ............................. 34 2-10InstantiateOperator ................................. 36 2-11Tailsampleroperator ................................. 38 2-12CustOrderwithGibbstuples ............................ 39 2-13Example:TSSeedandSeed ............................ 40 2-14Stochastictableinajoin ............................... 40 2-15StochastictableafteraSplit ............................. 41 2-16Joinresulttuples ................................... 41 2-17Gibbsqueue ..................................... 43 2-18Example:Endofrun1 ................................ 44 2-19Example:Beginningofrun2 ............................ 45 2-20Queryplanbeforematerialization .......................... 46 2-21Queryplanaftermaterialization ........................... 46 2-22Queryplanwithmultiplereruns ........................... 47 4-1AnMCDB-Rqueryplan ............................... 57 4-2Modiedqueryplan ................................. 60 4-3AVGfunctionprocess ................................ 62 9

PAGE 10

4-4SerializationofVGfunction ............................. 62 4-5BayesianVG:Inputserialization .......................... 63 4-6BayesianVG:Innerinput .............................. 63 4-7BayesianVG:Outerinput .............................. 63 4-8SerializationofBayesianVGinputprocess .................... 64 4-9Joinonstochasticattributeononeside ...................... 65 4-10Queryplanafterserialization ............................ 65 4-11Joinonstochasticattributesonbothsides ..................... 66 4-12Queryplanafterserialization ............................ 67 5-1Queryplan ...................................... 78 5-2Modiedplanwithltering .............................. 78 5-3Splittingalargetuplebundle ............................ 79 5-4QueryplanwithSamplerouter:normalmode ................... 80 5-5QueryplanwithSamplerouter:activemode ................... 81 5-6ATSSeedandassignedsamplevalues ...................... 82 5-7TSSeedandassignedsamplesafteraGibbssampleiteration .......... 82 6-1Example:Anti-Joininarelationaldatabase .................... 92 6-2Example:Stochastictable .............................. 94 6-3Example:AfterSplit ................................. 95 6-4CompositeAttribute:Longformat .......................... 96 6-5LongFormat:TuplesafterAnti-join ......................... 97 6-6CompositeAttribute:Shortformat ......................... 98 6-7ShortFormat:TuplesafterAnti-join ......................... 99 10

PAGE 11

AbstractofdissertationPresentedtotheGraduateSchooloftheUniversityofFloridainPartialFulllmentoftheRequirementsfortheDegreeofDoctorofPhilosophyEFFICIENTQUERYPROCESSINGINMCDB-RByRavindranathChowdaryJampaniAugust2012Chair:AlinV.DobraMajor:ComputerEngineeringQuantitativeriskanalysishasbecomeanintegralpartoflargenancialinstitutions.Forbankingandinsuranceenterprises,calculatingtheriskoffutureinsolvencyismandatorypertheregulatorydirectivessuchasSolvencyandBasel.Performingriskanalysisintheseenterprisesisdifcultbecausethesystemsarehighlycomplexandhavealargenumberofuncertainvariables.Oneformofriskanalysisisinunderstandingtheimpactofextremeeventsthathavealowchanceofoccurrence.Thecommonmethodsusedtoestimatethisextremeriskcomefromstatisticsandrareeventsimulationelds,andarebasedonMonteCarlosampling.However,thesesampling-basedmethodsproducehugeamountsofdata.Therefore,thecurrentimplementationsofriskanalysismethodshavescalabilityproblems,particularlyastheyrelatetocomplexsystemswithaverylargenumberofvariables.MonteCarloDataBasewithRiskanalysis(MCDB-R)isadatabasesystemdesignedtofacilitatescalableriskanalysisonlargedatasets.MCDB-Rcombinessampling-basedstatisticalmethodsandrelationaldatabasetechnologyforscalabilityonlargeuncertaindatasets.Uncertaintiesindataaremodeledasprobabilisticdistributions.Aqueryonthisuncertaindatawithprobabilitydistributionsresultsinanotherprobabilisticdistributioncalledquery-resultdistribution.Foragivenscenariodenedbyaquery,analyzingitsextremeeventsrequiresexaminingtheupperorlowerareasofthequery-resultdistribution,wheretheeventsoccurwithverylowprobability. 11

PAGE 12

ThespecicmethodsusedinMCDB-Rforthisanalysisare:Gibbssamplingfromstatisticsandcloningfromrare-eventsimulation.Thesemethodsareusedtoefcientlysamplefromthelowprobabilityareasofthequery-resultdistributionfortheanalysis.ThisdissertationdiscussesthreeimprovementstothequeryprocessorinMCDB-R.Thersttechniqueprovidesamechanismtomovethesamplegeneration(duringthequeryexecution)closertothelocationwherethosesamplesareactuallyprocessed.HugenumberofsamplesaregeneratedduringthequeryexecutioninMCDB-R,andthisnewmechanismreducesthesamplemovementinthememoryhierarchy.Responsetimeofthequeriesisimprovedsignicantly;sometimesbyanorderofmagnitude,asshownintheexperiments.MCDB-Remploysarejectionalgorithmatthecoreofitssampling-basedriskanalysis.Foreachuncertainvariable,therejectionalgorithmlooksataseriesofsamplesanddiscardsthemunlessthesampletsagivenconstraint.Sometimestheconstraintsaresostringentthattherejectionalgorithmneedstoprocessmillionsofsamplesbeforendingagoodt.InMCDB-R,aninstancelikethiswillrequirethesamenumberofsamplestobeproducedforallvariables.Inthesecondtechnique,thisproblematicinstanceisisolatedfromtherestofthequeryexecutionandthenrunseparatelytondanacceptablet.Thenormalexecutionrestartsafterndingtherequiredsample.Finally,addingananti-joinoperatortotheMCDB-Rexecutionengineisexplained.Thisnewoperatorenablesthesystemtoexecutesubset-basedquerieswithnot-inandnot-existsclauses.Performingananti-joininthissystemisnottrivialbecauseofthestochasticnatureofthedata.Thesystemdoesnotknowwhichsamplesareactuallyuseduntiltheendofthequeryexecution.Twomethodstoimplementtheanti-joinoperatorarediscussedandthencomparedthroughexperiments. 12

PAGE 13

CHAPTER1INTRODUCTIONThe2008nancialcrisisisconsideredtobetheworstsuchcrisissincetheGreatDepression[ 70 ].Itresultedinthecollapseofbanksandinsurancecorporations,whichnecessitatedgovernmentbailoutsandresultedinplungingstockmarkets.Thetriggerforthisnancialapocalypsewasthesubprimemortgagecrisis,whichcaughtnancialinstitutionsoffguarduntilitwastoolate.Institutionslearnedthatunderestimatingtheriskofextremeeventsandtheliquidityrequiredtocounterthoseeventslefttheminaveryprecarioussituation.Thecrisisdemonstratedthatlowprobabilityandhigh-impacteventsdohappen.Thecurrentriskmanagementmodelsdonotplacesufcientemphasisonextremeevents,whichisacriticalissue[ 35 ].Theimpactofextremeriskeventscanbelowerediftheyaredetectedearlyandmanagedproperly.However,duetothelackofpreparation,theimpactcanbegreaterthannecessary.Anotherissuewithcurrentriskmanagementmodelsistheirexcessiveapproximation.Currentnancialsystemishighlydynamicandcomplexduetoglobalization.Partsofthiscomplexsystemareeitherapproximatedor,aretotallyignoredinordertoreducethenumberofvariablesinthemodel.Theapproximationsareobtainedbyusingaveragesovermultiplevariables.Oneofthemainreasonsforreducingthenumberofvariablesisduetothelimitationsofthesoftwarethatiscurrentlyused,whichisbasedonspreadsheets,andisnotscalableformorene-grainedmodels.Theproblemthatarisesduetoimprudentapproximationwithaveragescanbeexplainedbyusingthesimpleexamplefrom[ 55 ],Chapter19.Inthisexample,abankisestimatingtheaverageprotitmightderivefromitsloans.Assumethatthebankpays4%interestperyearonthemoneyituses.Itlendstothecustomerswithgoodcreditata6%rate,therefore,theprotmarginis2%.Forcustomerswhoarelesscredit-worthy,thebankloansmoneyat14%,yieldinga10%protmargin.Thecostoverheadtomanagetheloanis$25peryear.Thus,anaccountof$600,withamarginof6%will 13

PAGE 14

Table1-1. Example:Flawofaverages BalanceMarginNetRevenueProtafterOverhead AccountwithGoodCredit10002%20-5AccountwithBadCredit20010%20-5CorrectAverage6006%20-5FlawedAverage6006%3611 yieldaprotof$11((6%of$600)-$25=$11).RefertothedatagiveninTable 1-1 .Ifweconsiderbothgroupsofcustomersseparately,wewillobtainanaccurateestimate.Agoodcustomerwitha$1000balanceona2%marginwillgenerate$20revenueanda-$5intotalprot.Theothercustomerwithabalanceof$200,andamarginof10%willagaingenerate$20inrevenueanda-$5intotalprot.ThecorrectestimatefortheaverageprotisprovidedinTable 1-1 ,row3.However,ifwetrytoestimatetheaverageprotbasedontheaveragebalanceof$600(($1000+$200)/2)andanaveragemarginof6%((2%+10%)/2),theaveragenetrevenueis$36andtheaverageprotis$11,asshowninTable 1-1 ,row4.Simulationbasedonsamplingisanexcellentwaytoavoidtheproblemcreatedbyusingaverages.However,samplingisverydataintensiveandscalabilitybecomesanissueasaresultoftheincreaseinvariables.TheMCDB-RsystemisbuilttoprovidescalableMonteCarlosimulationsforextremeriskanalysis.Scalabilityisachievedthroughtheuseofexistingdatabasetechnology,andincorporatingtheMonteCarlosimulationsinternallyinthequeryprocessing.TheadvantagesofusingMCDB-Roverexistingspreadsheetbasedmodelingincludes: 14

PAGE 15

ScalablesimulationsThescalabilityofMCDB-RfacilitatesMonteCarlosimulationsoveranerandrichermodel,avoidingtheissueswithapproximationusingaverages.Ane-grainedmodelisimportanttoobtainmoreaccurateresultsandavoidingtheawsintroducedbyaverages. EaseofuseUsersofspreadsheetapplicationssuchasMicrosoftExcelmustrstextractthedatafromadatabasethroughaggregatequeries(suchasAVG,COUNT,andsoforth)andthenloadthedataintotheapplication.Thesetwostepsrequirethattheuserbuyandlearnseveralsoftwarepackages,involvingbothconsiderablemoneyandtime.MCDB-RavoidstheseissuesbyprovidingMonteCarlosimulationswithinthedatabase.Next,wegiveanexample'what-if'query,andhowMCDB-Ranalyzesextremeriskonthatquery.Example:Letusexaminethefollowingwhat-ifqueryfrom[ 37 ]:Whatwouldtheprotsbeifweraisedallpricesby5%thisyear?.Forthisquery,relevantinformationlikeuctuationindemandwithapriceincreaseisunknown.Therefore,[ 37 ]usesaBayesianmodeltocalculatetheuctuationindemandasaparametricprobabilitydensityfunction.Theattributeforthenewdemandiscalledastochasticattributeasitsvalueisdenedbyaprobabilitydistribution.Thewholedatabaseitselfisnowstochastic,andischaracterizedbyaveryhighdimensionaljointdistribution.Aqueryonthisstochasticdatabasewillresultinanotherdistributioninsteadofasinglevalueasinadeterministicdatabase.Thisresultdistributioniscalledaquery-resultdistribution.Obtainingthisqueryresultdistributionisnotsimplebecausethedataishigh-dimensional,makingadirectanalyticalsolutionverydifcult.Forthisreason,theansweriscalculatedthroughMonteCarlo-basedsampling[ 51 ]inthefollowingway:wegeneratemultiplenewdemandvaluesfromtheprobability-densityfunctions.Eachsetofnewdemandvalueswillcreateaninstanceofthedatabase.Thequeryisexecutedoneachinstanceofthedatabase.Theoutputofthequeryisasetofprotvaluesfromthequery-resultdistribution.ThisisthemethodusedintheMonteCarloDatabaseSystem(MCDB)[ 37 ].Forefcientqueryprocessing,MCDBbundlesallvaluesofa 15

PAGE 16

singlestochastictuple(fromdifferentdatabaseinstances)togetherintoasingletuplebundle.Letusanalyzetheextremeriskforthe5%priceincrease'what-if'querytounderstandtheworstpossiblescenariosforsuchadecision.Forexample,examiningthelowestpossibletotalprotvaluethatcanoccurwouldbeofinterest.IntheaboveBayesianmodeltheworstcasecanbedenedasthevalueoftotalprotinthe0.001quantileofthequery-resultdistribution(thislowprobabilityregionisgenerallycalledthetailofadistributionfunction).Wecanbetterunderstandtheriskofthedecisionwiththeknowledgeofsuchworst-casescenarios.Alargenumberofsamplesinthe0.001quantilespacewillbeusefulinestimatingtheprotvalueaccuratelyandalsoindeterminingwhatleadtosuchpoorprots.Adecision-makercouldthendecideonthestrategiestotaketocountersucheffects.OneapproachforobtainingsamplesinthetailistogeneratelargenumbersofsamplesontheentireeventspaceusingMCDBandthenselectonlythosethatfallinthetailregion.Ifthesamplesarenotrequestedfromtheextremelylowprobabilityregions,thenMCDBworkswell.Generating100samplesfromthe0.01quantileisnotverycostlybecausegenerating10,000samplesfromthewholedistributionisnotexpensive.However,thisapproachisquiteinefcientifwearelookingmoredeeplyinthetail,suchasthe0.000001quantile.Inmanyapplications[ 53 ],aneventwithaprobabilityof10)]TJ /F8 7.97 Tf 6.59 0 Td[(6orlessiscalledrare.Therefore,getting100samplesofinterestinsuchapplicationsrequires108samplesinthefull-eventspace,whichisnotpracticalinMCDB.MCDB-Risdesignedtoperformthis'tail-sampling'taskefciently.MCDB-R,inadditiontoutilizingtuplebundleandvariable-generating(VG)functionframeworkfromMCDB,usesGibbscloningideasfromrare-eventsimulationliterature[ 10 ].TheGibbscloningmethodisusedtodirectlysamplefromthetailofthedistributionand,therefore,decreasestheoverallcostofthequery.Gibbscloningisaniterativeprocesstomovefurtherintothetailofthedistributionfunction.Ateachstepwedeletetheuninteresting 16

PAGE 17

samplesandreplacethemwithexactcopiesofinterestingsamples(acloningstep).Thenewsetofsampleswillbehighlycorrelatedduetothecloning.Hence,thesesamplesarerandomlyperturbedsothatthenewsamplesaftertheperturbationarereasonablyuncorrelated.ThisperturbationisperformedbytheGibbssamplerbecauseofthehighdimensionalityofthedata. 1.1MainContributionsOurgroupbuiltaprototypeoftheMCDB-Rsystem,andwetesteditsperformanceonvariousqueries(includingtheexamplewhat-ifquerydescribedabove).Wefoundthatthesystemwasnotasefcientasweexpected.Afterprolingthesystem,wediscoveredtwoissues: Movinglargesamplearraysinandoutofmemoryistakingmostoftheexecutiontime. Thesystemisgeneratingalargenumberofsamplesthatarenotevenused.Aspartofthisdissertation,Iamprovidingsolutionstomitigatetheseissues.Additionally,Iamtacklingathirdproblem:toimplementanAnti-joinoperatorinMCDB-R.TheAnti-joinoperatorisimportantbecauseitenablesthesystemtorunsub-querieswithNOTINandNOTEXISTSclauses.MydissertationalsodiscussestheevaluationandimplementationofthesolutionsthatIproposedforallthreeproblems,whicharebrieydescribedinthefollowingsubsections. 1.1.1Problem1:SerializingVariableGeneratingFunctionsThesamplesizesinMCDB-RareverylargebecauseoftheselectivenatureoftherejectionalgorithmduringtheGibbssamplingprocess(explainedinChapter 2 ).Usingtuplebundlestogroupthesamplestogetherandstoreasanarrayresultsinfasterprocessing.However,tuplebundlesareextremelylarge,andmovingtheminthememoryhierarchyisverytimeconsuming.Ipresentatechniquethatenablesthesystemtoserializethesamplegenerationprocessandtodelaytheactualsamplegenerationaslongaspossible.Thistechniqueavoidsthemovementofalargenumbers 17

PAGE 18

ofsamples(ortuplebundles)throughsomeexpensiveoperationssuchassortsandjoins.Forexample,inthecurrentsystem,weexecutedoneofthebenchmarkqueries(Q2fromSection 4.7 )ona200megabytes(MB)datasetwith50,000tupleswithuncertain(orstochastic)dataandgeneratedresultsamplesinthe0.999quantile.Thebenchmarkquerywritesandreadsmorethanaterabyte(TB)toandfromthediskduringitsexecution,andgeneratesmorethan3billionsamplesonaveragebeforeitcompletes.Thelargesamplesizesareduetosomestochastictuplesthatrequireafewsampleswithprobabilityintheorderof10)]TJ /F8 7.97 Tf 6.59 0 Td[(3orless.Forexample,itispossibleforthequerytolookforasampleintheorderof10)]TJ /F8 7.97 Tf 6.59 0 Td[(5probability.Intheexistingsystem,thistaskrequiresgenerating105samplesonaverageforeachrequiredtuple,andpassingthesamplesthroughthequeryplan.Insimplequerieswithnocomplexjoinsonstochasticattributes(likethewhat-ifquerymentionedabove),thesamplegenerationcanbedelayeduntilthetop-mostoperatorinthequeryplan.Inthecurrentsystem,whenaqueryisoutofsamples,thequeryisrerunwithalargersetofsamples.Forsomequeries,themethodthatIpropose(describedinSection 4.4 )alsoeliminatestherequirementofrerunningthequeryplanwhenmoresamplesareneeded.Eveninquerieswithjoinsonstochasticattributes,mymethodcaneliminatetheexecutionofsomeoftheexpensiveoperatorsbeforethatjoin,whichspeedsupthequeryexecutionfromthesecondrerunonwards. 1.1.2Problem2:FindinganExtremeSampleGibbssamplerprocessesanenormousnumberofsamplesduringasinglequeryexecutioninstance.Therefore,thechancesoftheoccurrenceofanextremesample(onethatoccurswithverylowprobability)inthesamplestreamisgood.Suchasampleiseasilyacceptedifitisatthecorrectsideofthetail.However,replacingthatsampleduringthenextGibbssamplerphasemayrequireanothersampleofnearlyequalvalue.Replacingonesuchextremesamplecouldresultinnumerousqueryrerunsand 18

PAGE 19

increasethesystemresponsetime.InChapter 5 ,Ilookataneffectivesolutionforthisproblem.Mysolutionhastwoparts.First,thequeryplanisrerunwithonlyrelevanttuplesandthenitltersouteverythingelse.Thismethodreducesthedataowinthesystem.Second,forthatspecictuple,alargenumberofsamplesaregenerated.Thelargernumberofsampleswillincreasethechancesofndingareplacementsample,and,hopefully,eliminateadditionalrerunsforthistuple.Onespecicextremesamplemightneedtensofmillionsofsamplestondareplacement.Sinceallthesamplescannottonasinglepage,Idevelopedasuccessfulworkaround.Thelargesamplearrayinatupleisbrokenintonumeroussmallerarraysastheyarebeinggenerated.Thesesmallertuples(eachstoringonesmallsamplearray)arepassedthroughthequeryplanuntiltheyreachtheGibbssamplerwheretheycanberecombinedtogether.Thesequenceidentiersarestoredtoremembertheorderinwhichthesmallertuplesneedtobegroupedtogetherwhenthetuplesarebeingused. 1.1.3Problem3:IncorporatingAnti-JoinAnti-joinofrelationsRandS(R.S)returnstuplesinRthatdonotmatchtuplesinSonthecommonattributes.Inarelationaldatabase,thisoperationisstraightforwardasitsexecutionissimilartoanequality-join.However,inMCDB-Ritisdifcultduetothedelayoftheactualexecutionofthepredicateuntiltheendofthequeryplan.AstochasticattributeparticipatingintheAnti-joinfromtherightrelationSintroducescomplications:advancingthesampleinScouldresultinthetupleinRbeingdropped.Consequently,theGibbssampling(refertoChapter 2 )musthaveallofthetuplesfromSthatmatchthetupleinR.AsimplebuteffectivesolutionistostoreallofthematchingtuplesfromSinthecorrespondingtupleinR.DuringtheAnti-joinoperationalltheinformationnecessaryfrommatchingtuplesinSisaddedtothetupleinR.Ialsodescribeanothersolutionthatdoesnotrequirestoringalloftheinformationfrom 19

PAGE 20

matchingtuples.Thismethodeliminatesthedrawbackoftherstsolution:theexistenceofhugetuplesizesasaresultofstoringattributesfromallmatchingtuples. 1.2OrganizationThisdissertationisorganizedasfollows:Chapter 2 providesanoverviewofthesystem,themathematicalbackgroundrequiredforMCDB-R,theGibbsSampler,andcloning.Chapter 3 describespriorworkrelatedtoMCDB-R.Chapter 4 ,explainstheserializationandde-serializationofthesample-generationprocess.Chapter 5 discussestheextremesampleproblem.Chapter 6 explainsthetwomethodsbywhichtheAnti-joincanbeaddedtoMCDB-R. 20

PAGE 21

CHAPTER2MCDB-ROVERVIEWMCDB-Risasystemthatfacilitatesriskanalysisonhigh-dimensionaluncertaindata[ 6 ].Performingriskanalysisrequiresansweringqueriesaboutriskyrareeventsi.e.thosethatoccurwithverylowprobabilitybutcouldhavehugeimpactiftheydooccur.DatauncertaintyinMCDB-Rismodeledasprobabilitydistributions.Thebehaviorofuncertainvariablesiscapturedbyprobabilitydensityfunctions.Riskanalysisonsuchdatacanbedonebyobservingthelowprobabilityregionstheupperandlowertailsofaqueryresultdistributionfunction.Foragivenquery,analyticalformsofeithertheresultdistributionoranyestimatesontheresultdistributionisveryunreliableanddifculttosolveonhigh-dimensionaldistributions.Therefore,MCDB-RusesMonteCarlosamplingtoapproximatelycalculatestatisticsontheresultdistribution.WhileusingtheMonteCarlomethods,theuncertaindatabaseinMCDB-Risrepresentedbymultiplepossibledatabase(DB)instances(ordatabasesamples).ThesepossibleDBinstancesaregenerallycalledpossibleworldsinprobabilisticdatabasesliterature[ 18 ].Eachinstanceiscreatedbyreplacingeachuncertainvariablebyasamplefromitsdensityfunction.SamplesofanuncertainvariableindifferentDBinstancesareindependentofeachother,eventhoughtheyareidenticallydistributed.Thisismadepossiblebyusingpseudo-randomnumbergeneratorsforsampling.Now,runninganaggregatequeryovertheuncertaindatabasereturnsmultipleoutputvalues,oneforeachDBinstance.ThenumberofDBinstancesusedisgivenasaparameterwhilespecifyingthequery.MCDB-Risdevelopedforgeneratingsamplesfromthetailofthequery-resultdistribution.ItadaptsatechniquecalledGibbscloningfromrareeventsimulationtoperformthistaskefciently.Gibbscloningmethodisusedtosamplefromthetailofthedistribution,andhencedecreasestheoverallcostofthequery.Itisaniterativeprocessthatmovesfurtherintothetailofthedistributionfunctionaftereachiteration.Eachiterationhasacloningstepandaperturbationstep.Inacloningstepwedelete 21

PAGE 22

thesamplesthatareawayfromthetailandreplacethemwithexactcopiesofsamplesthatarefurtherintothetail.Thenewsetofsampleswillbehighlycorrelatedduetothecloning.Hence,thesesamplesarerandomlyperturbedsuchthatthenewsamplesafterperturbationarereasonablyuncorrelated.Duetothehigh-dimensionalityofthedata,Gibbssamplerisusedforperturbation.ThischaptergivesanoverviewofqueryprocessinginMCDB-R.First,weprovideahighlevelunderstandingofriskanalysisinMCDB-Rusinganexample.WethenproviderelevantMonteCarloandrareeventsimulationbackgroundrequiredtounderstandthesystem.Next,wedescribethequerylanguageinthissystem.Finally,differentoperatorsinMCDB-Rareexplained.Weonlyexplainoperatorsthatarefairlydifferentfromthoseinastandardrelationaldatabasesystem. Figure2-1. CustOrdertable:(a)Deterministic,and(b)Uncertain Figure2-2. StochasticCustOrdertable 22

PAGE 23

Example:LetusrevisittheexamplequeryfromChapter 1 :whatwouldtheprotsbeifweraisedallthepricesby5%thisyear?.Assumefollowingtablesinthedatabase:CustOrder(CUSTID,ITEM,QTY),Price(ITEM,PRICE).InthetableCustOrder,CUSTIDisthekey,andQTYisthenumberoftextttITEMsboughtbythespeciccustomer.Thenewdemandfromeachcustomerforeachitem(callitNEWQTY)isnotknownafterthepricechange.ThedeterministictableCustOrderandthenewuncertainversionofitwiththeattributeNEWQTYareshowninFigure 2-1 .SincethevaluesforNEWQTYareunknown,theyaredenotedby??.ABayesianmodelisusedtoestimatethenewdemand.TheattributeNEWQTYiscalledstochasticattributesinceitdoesnothaveaxedvalue,andisdenedbyaprobabilitydensityfunction.Inthisspecicsetting,weusetwoGammafunctionsastherepresentativemodelfortheattributeNEWQTY.SincetheNEWQTYattributetakesintegervalues,theoatingpointsamplevaluecomingoutoftheGammafunctionswillberoundedtothenearestinteger.ThenewstochasticrelationCustOrderisshowninFigure 2-2 .NoticethattheparametersfortheDblGammafunctioncanbedifferentforeachtuple.Theresultofthequeryexecutedonastochasticdatabasewillalsobeaprobabilitydistribution.Analyticallysolvingfortheresultdistributionisdifcultduetohigh-dimensionalityofthedatabase.SoMCDB[ 37 ]usesMonteCarlosamplingtoanswerthequery.Multipleinstancesofthestochasticrelationaregeneratedusingsampling.Sampleinstancesareindependentofeachother.ThisstepofsamplingisdepictedinFigure 2-3 .InthisgurethenumberofinstancescreatedisN.ThequerywhenexecutedontheseDBinstancesgivesoutoneresultforeachDBinstance.Theseobtainedresultscanbeconsideredasaindependentandidenticallydistributedsamplesfromthequery-resultdistribution.ThenalresultconsistsofNvalues,andcanberepresentedasahistogramasshowninFigure 2-4 .IfthequeryisrunseparatelyoneachDBinstanceseparately,thenthesystemwillbeveryslow.ThisperformancehitismanagedinMCDBbythe 23

PAGE 24

Figure2-3. MultipleinstancesofCustOrderafterMonteCarlosampling groupingofallinstancesofatupletogether,calledatuplebundle,andrunningthequeryonlyonceforalltheDBinstances. Figure2-4. Query-resulthistogram 24

PAGE 25

Figure2-5. Gibbscloning:Generatingsamples Figure2-6. Gibbscloning:Deletinglower50% Now,letsconsiderperformingextremeriskanalysisonthisdecisionquery(increasingpricesby5%).Forthat,wewanttoknowtheworstpossibleconsequencesforsuchapriceincrease.Forexample,thelowestorhighestpossibletotalprotvaluethatcanoccurwithaprobabilityof0.001isofinterest.Weneedadecentnumberofsamples(100ormore)inthe0.001quantilespacetoestimatetheprot/lossvalueaccuratelyandalsoinndingoutwhatleadtosuchbadprotsbyanalyzingthesamples.Oneapproachforobtainingthesamplesistogeneratelargenumberof 25

PAGE 26

DBinstancesusingMCDBandthenselectonlythosewhichfallintheregionofinterestofthedistribution.Thismethodisnotefcientiftheprobabilitywearelookingatistoosmall.MCDB-RsolvesthisefciencyproblembyadaptingGibbscloningapproachtotheMCDBsystem.Gibbscloningisaniterativeprocessthatmovessamplesclosertothetailofthedistributionfunction.Eachstepdeletestheuninterestingsamplesandreplacethemwithclonesofinterestingsamples.Thiscreateslargernumberofsamples,allinthenewregioncloserthethetargettail.Thenewsetofsampleswillbehighlycorrelatedduetothecloning.Thesesamplesarerandomlyperturbedsuchthatthenewsamplesaftertheperturbationarereasonablyuncorrelated.Duringperturbationweneedtomakesurethatthenewsamplesareinthenewregionweobtainedafterdeletinguninterestingsamples.ThisperturbationisdonebyGibbssamplerduetoitsabilitytoperformmultidimensionalsampleswhilesatisfyingtheconstraintsofthenewregion. Figure2-7. Gibbscloning:Clonetheelitesamples ThisGibbscloningprocessisexplainedthroughgures.Figure 2-5 showsthethequery-resultdistributionwithfourinitialsamples,denotedbythesmallyellowcircles.Assumethatourtargetistogathersamplesfromtherightsideofthedistributionfunction.Nextstepistodeletehalfthesampleswhicharefarthestfromtherighttail,asshowninFigure 2-6 .Inthisgure,theverticallineshowstheboundaryofnewregion 26

PAGE 27

Figure2-8. Gibbscloning:AfterperturbationwithGibbssampler wecreated.Thislineactsasaboundaryforanyperturbationweperformlater.Inthenextstepweclonetheremainingtwo'elite'samples,tocreatefoursamplesintotal.WecanseetheclonedsamplesasthegreencirclesinFigure 2-7 .WeuseGibbssamplertoperformtheperturbationwhilesatisfyingtheconstraint.ThenewfoursamplesaftertheperturbationareshowninFigure 2-8 .Notethatallthenewsamplesareaftertheverticalline,whichisourconstraint.Thisprocesswillberepeatedasmanytimesasrequiredtomovetothetailasrequiredbythequery.TheprocessofGibbscloning,alongwiththeGibbssamplerareexplainedinnextsection. 2.1Background 2.1.1RareEventSimulationRareeventsareeventsthatoccurwithverylowprobabilitybuttheycouldhaveextremeimpactwhentheydooccur.Generatingsamplesintherareeventspaceforriskanalysisisusefulinmanyapplications.Someapplicationswhichbenetfromthisareinnancialdomain,andnetworkreliability[ 10 53 ].UsingbasicMonteCarloforrareeventsimulationisnotefcient.Whentheprobabilityofrareeventspaceisp,toobtainasinglesampleinthatspacerequires1=pbasicMonteCarlosimulations.Ifpis10)]TJ /F8 7.97 Tf 6.58 0 Td[(6,whichispossibleinmanyapplications[ 53 ],getting100samplesofinterestrequires108samplesinthefulleventspace,whichisclearlyinefcient.Twopopulartechniques 27

PAGE 28

usedforthissimulationareImportancesamplingandcloning[ 10 11 54 ].Importancesamplingisnotgoodforhigh-dimensionaldatabecauseoftheproblemofnumericalinstability[ 10 ].Alsointhisapplicationthedimensionalityistypicallyequaltothenumberoftuplesintherelation,whichrendersimportancesamplinguseless.Cloning:WeviewarandomdatabaseDasavectorofrandomvariablesoveraveryhigh-dimensionaljointdistribution.DconsistsofmultipleDBinstancesD1,D2,..eachrepresentingasamplefromthejoindistribution.ConsideranaggregatequeryoverD.Cloningalgorithmrstgeneratesn1DBinstances.Atstepiitkeepsonlythetop100pi%(decidedbasedontheaggregatevalueofthequeryoverthatinstance)oftheinstancesandclonesthesetopinstancestoobtainni+1newinstances.Alltheseareintop100pi%butnotindependent.ItrandomlyperturbstheseDBinstanceswithinthenewtailregion(asexplainedinthenextsection)tomakethemindependentofeachother.TherandomperturbationprocessisexplainedinSection 2.1.2 below.Asinglestepiscalledacloningiteration,andaftereachsuchiterationthealgorithmmovesfurtherintothetail.ThenumberofDBinstances,thenumberofcloningiterations,andthevaluesofni,piaredependentonthequantilespecicationofthequery. 2.1.2GibbsSamplingThetechniqueaboveexplainshowtowalktowardsthetailaftercloningandrandomlyperturbingDBinstancesateachstep.Inthissection,welookatGibbssampling,whichisusedforrandomlyperturbingagivenDBinstance.GibbssamplingisaMarkovChainMonteCarloalgorithmforobtainingasequenceofrandomsamplesfromjointprobabilitydistribution.Gibbssamplerrequiressamplesfromconditionaldistributionsofeachdimensiontoperformtheperturbationofthemulti-dimensionalsample.LetX=(X1,X2,..,Xr)beanr-dimensionalrandomvector.Xfollowsadistributionhandtakesvaluesfromthediscretesetr.Hereh(x1,..,xr)=P(X1=x1,...,Xr=xr),forx=(x1,...,xr)2r.AsmentionedearlierGibbssamplerneedstobeabletogenerate 28

PAGE 29

samplesfromconditionaldistributions.hi(ujv)=P(Xi=ujXj=viforj6=i).hiistheconditionaldistributionforithdimension,where1ir.GivenaninitialvectorX(0)=(X(0)1,....,X(0)r)theGibbssamplergeneratestherandomvectorsX(0),...,X(k).VectorX(j)isobtainedbyreplacingeachofthedimensionvaluesinX(j)iinX(j)]TJ /F8 7.97 Tf 6.59 0 Td[(1)byusinghi,where1jk.ConsidertheperturbationofaDBinstanceD1fromprevioussection.TreatD1assimilartoX.TheperturbationprocessshouldkeepD1inthecurrenttailrangeseenasacutoffvalueontheaggregate,andreplacesamplesineachofitsdimensions.Inoursettingthedimensionalityristypicallythenumberoftuplesinarelation,anddimensionsareindependent.Ifweassumesuchindependencethenh(x)=ri=1hi(xi),wherehi(u)=P(Xi=u).Qbearealvaluedfunctiononr,mimickinganaggregateandcisthecutoffvaluefortailandisconditionedoverdistributionh.Nowforperturbingithdimensionweneedasamplefromhi(xjc)=P(X=xjQ(X)c).WecandothisbyaRejectionalgorithm:keepsamplingfromhi(x)untilQ(X)cissatised.Formoredetailssee[ 6 ]. 2.2SchemaSpecicationTherststepinspecifyingaqueryinMCDB-Rsystemisdeningarandom(orstochastic)relation.Arelationisrandomifithasoneormorestochasticattributes.Suchattributesarepopulatedthroughsamplesfromparametricprobabilitydistributionfunctions.TheseparametricdistributionfunctionsarespeciedthroughtheVariableGenerating(VG)functionswhichareexplainedinthenextsection.Duringthequeryexecution,eachstochasticattributeinatupleisreplacedbysamplesgeneratedfromthecorrespondingVGfunction.Oncetherandomrelationsarespecied,thenextstepiswritingstandardStructured-Query-Language(SQL)fortheaggregation.Aggregateoperatorisspeciedlater,aspartoftheMCDB-RspecicRESULTDISTRIBUTIONblock.ThequerycompilerthenconvertstheseSQLstatementsintoaqueryplan,whichincludesbothstandardrelationoperatorsaswellassomenewoperatorstogenerate 29

PAGE 30

andprocessstochasticattributes.Finallytheparametersrequiredforthetailsamplingarespecied.Theseparametersdeterminehowfaralongthetailthenalsamplesaregenerated.Anexampleisgivenbelowforabetterunderstandingoftheprocess.Example:Theexamplequeryweuseheredescribesapredictivemodelofcustomerdemand.Thisquerysimulatestheeffectofa5%increaseinpriceonthenalprots.Forcalculatingthenewprotweneedthenewcustomerdemandduetoincreasedprice.ThedemandisestimatedusingBayesianmodelontheexistingdata.Acommonpriordistributionfordemandisusedforallcustomers.Thetheposteriordistributioniscalculatedbasedontheparametersspecictocustomer.TheVGfunctionoutputsthepossiblecustomerdemandvaluesduetochangesinprice.Thisqueryrequiresthespecicationoftherandomrelationdemandsasgivenbelow. 1.CREATETABLEdemands(new_dmnd,old_dmnd,2.old_prc,new_prc,nd_partkey,nd_suppkey)AS3.FOREACHlIN(SELECT*FROMlineitem,orders4.WHEREl_orderkey=o_orderkeyAND5.yr(o_orderdate)=1995)6.WITHnew_dmndASBayesian(7.(SELECTp0shape,p0scale,d0shape,d0scale8.FROMparams9.WHEREl_partkey=p_partkey)10.(VALUES(l_quantity,l_extendedprice*(1.0-11.l_discount))/l_quantity,l_extendedprice*12.1.05*(1.0-l_discount)/l_quantity))13.SELECTnd.value,l_quantity,l_extendedprice*14.(1.0-l_discount))/l_quantity,1.05*15.l_extendedprice*(1.0-l_discount)/l_quantity,16.l_partkey,l_suppkey 30

PAGE 31

17.FROMnew_dmndndArandomrelationisgeneratedonthey.Therefore,theabovestatementneedstospecifyhoweachoftheattributeisderived,eitherfrombasetablesincaseofdeterministicattributesorformtheparametricVGfunctionincaseofstochasticattributes.Lines1and2givethetablenamealongwiththelistofattributes.Thesub-queryfromlines3to5createsthebasetable,andeachtupleinitwillcontributetoasetoftuplesintherandomrelationdemands.Thestochasticattributenew dmndisgeneratedfromthedistributionfunctionspeciedfromlines6to12.ThenameoftheVGfunctionusedforsamplegenerationisspecied(asBayesian)online6.ThetableparamshastheparametersrequiredfortheBayesianVGfunction.Theparamstablecanhaveoneormoretuplesforeachmatchedtupleinthebasetable.Thejoinconditionbetweenthesetwotablesisgivenatline9.Inthepredicate,l partkeyisfromthebasetableandp partkeyispartoftheparamstable.SomeparameterstotheVGfunctionarebasedonthebaserelation.ThesearespeciedafterthekeywordVALUESatthebeginningofline10.Therelationnew dmnd(value)iscreatedfromthisentireVGfunctionblock(lines6to12).TheattributesintherelationdemandsarepopulatedaccordingtotheSQLstatementinlines13to17.Inthisexample,therstattributenew dmnd.valueisstochastic,andrestaredeterministic.Allthedeterministicattributesarecalculatedasfunctionsofattributesinthebasetable. 18.SELECTSUM(new_prf-old_prf)AStotalProfit19.FROM(20.SELECT21.new_dmnd*(new_prc-ps_supplycost)ASnew_prf22.old_dmnd*(old_prc-ps_supplycost)ASold_prf23.FROMpartsupp,demands24.WHEREps_partkey=nd_partkeyAND25.ps_suppkey=nd_suppkey) 31

PAGE 32

26.WITHRESULTDISTRIBUTIONMONTECARLO(100)27.DOMAINtotalProfit>=QUANTILE(0.001)28.FREQUENCYTABLEtotalProfitWritingtheaggregatepartofthequeryisquitestraightforwardandisgivenfromline18to25.Notethatthestochasticattributenew dmndispartoftheaggregate.Informationtothetailsamplingoperatorisgivenfromline26to28.ThevaluetotalProfitistheaggregatecalculatedinthisquery.Thequantileofthenalsamplesinthequery-resultdistributionisspeciedwithQUANTILEkeyword.Forthisquerythesamplesareinthe10)]TJ /F8 7.97 Tf 6.59 0 Td[(3thquantile.ThenumberofsamplesgeneratedwiththatquantileisgivenwithMONTECARLOkeywordintheline26,whichis100inthisquery.Thedirectionofthetailisgivenby<=.Inthisquery,thesamplesaretobegeneratedinthelowertailofthequery-resultdistribution.TheFREQUENCYTABLEkeywordspecieswherethesamplesgeneratedbythequeryarestored.AtableFTABLE(totalProfit,FRAC)iscreated,whereineachtupletotalProfitisadistinctsamplevalueandthecorrespondingFRACisfractionofthe100samplesinwhichthatvaluewasobserved. 2.3TheVariableGeneratingFunctionTheVGfunctionisoneofthevitalcomponentsofthissystem.Ittakesinparametersandpseudo-randomnumbergenerator(PRNG)seedsandreturnsthesamples,whicharethenstoredinthestochasticattributes.TheframeworksurroundingtheVGfunctionsallowstheuserstoprovidetheirownfunctionsalongwiththeprocessinwhichthatfunctionreceivestheparametersandreturnsthesamples.AstandardsetofVGfunctionslikeNormal()orPoisson()arealreadyprovidedwiththesystem.TheframeworkisveryexibleandallowsspecicationofavarietyofVGfunctions[ 38 ]includingmultidimensionaldistributionfunctionswithcorrelations.Framework:Asmentionedabove,theVGfunctionframeworkisdesignedtoallowtheusertocreatehisownfunction.TheBayesianVGfunctionmentionedintheprevioussectionisfairlycomplicatedandisnotpartofthestandardlibraryfunctions. 32

PAGE 33

AnycustomVGfunctionlikethatcanbeaddedtothelibrary.AnyVGfunctionshouldimplementthefollowingvemethodsInitialize(),TakeParams(),OutputVals(),PrepareNextTrial()andFinalize(). Initialize()takesinanarrayofPRNGseeds.ThesearetheseedsthattheVGfunctionusesforpseudo-randomnumbergenerator.Thesizeofthislistshouldbesameasthenumberofsamplestobegenerated.Thisisbecause,inMCDB-R,eachsamplehasitsownseed.Soeachoftheindependentandidenticallydistributedsamplesgeneratedhasitsownseriesofpseudo-randomnumbers,andcanbegeneratedindependentofitsprevioussamples.ThisisvitalbecauseMCDB-Rcanrequiresamplestobegeneratedoutofsequence.RefertoSection 2.7 formoreonwhyweneedsamplesoutoftheirsequence.Thismethodalsoinitializesthepseudo-randomnumbergeneratorusingtherstoftheseedsandanydatastructuresandvariablesrequiredbytheVGfunction. TakeParams()isusedtosendanyparametersrequiredbytheprobabilitydistributionfunctionusedintheVGfunction.Theparametervaluesgenerallyvarypertuple,andthisinformationisspeciedbytheFOREACHclauseintheCREATETABLEstatement.EachtuplemayneedmultiplecallstoTakeParams()dependingontheVGfunction. OutputVals()returnsthesamplesfromtheVGfunction.Thesesamplescanbemulti-dimensional,thereforeforeachsampleasequenceofcallsaretobemadetotheOutputVals().ThemethodreturnsaNULLafteritreturnsthevalueforthenaldimensionofthesample.EachnonNULLreturnvalueresultsinaseparatetupleforthesample.Soasingleinputtuplecanresultinmultipleoutputtuples. PrepareNextTrial()iscalledattheendofeachsamplei.e.aftereachNULLreturnvaluefromanOutputVals()call.Thismethodwillinitializethepseudo-randomnumbergeneratortothenextPRNGseed. Finalize()iscalledattheend,afterallthesamplesaregenerated.Thismethodwillde-allocateanydatastructurescreatedbytheVGfunctionandtheseedarray.AfteracalltoFinalize(),theVGfunctionobjectisfullyresetandisreadyforanInitialize()callfromthenexttuple.ConsidertheBayesianVGfunctioninthequeryfromprevioussection.TheparameterslistsenttoTakeParams()hasthevalues(p0shape,p0scale,d0shape,d0scale).AsinglecalltoTakeParams()issufcientbecausethequeryhasasingleparametertableandonetupleeachfortheparameters.IftheVGfunctionwereamultivariateprobabilitydistribution,sendingtheco-variancematrixwouldhaverequiredmultiple 33

PAGE 34

callstoTakeParams().AcalltoOutputVals()willagainreturnonlyasinglevalue,thenew dmnd.value.AsecondcalltoOutputVals()willreturnaNULL,andthesubsequentcalltoPrepareNextTrial()willadvancethePRNGseed.NotethatthisframeworkonlyspecieshowtheparametersandPRNGseedsaresent.TheprocessofsendingthecorrectparametersandcollectingthereturnedsamplestoformthenaltupleisdoneoutsidetheVGfunction.ThisprocessworksasawrapperaroundtheVGframeworkandisdescribedintheSection 2.5 aspartoftheInstantiateoperator. 2.4TheSeedOperatorThisoperatoraddsSeedattributestorandomrelations.ForeachVGfunctionintheCREATETABLEstatementofarandomrelation,oneSeedattributeisadded.TheSeedattributeconsistsofaseedidentierandthecurrentsetofactivesamples.Theseedidentiermustbeuniqueoveralltherandomrelationsinaquery.ThisidentierisusedaspartofallthePRNGseedscreatedforthistuple.SomultipletupleswithsameSeedidentierwillcreateunwantedcorrelationsinthenalresult. Figure2-9. AnexampleSeedattribute AlongwiththeseedidentiertheattributealsostoresthenumberofDBinstancesforthisquery,andalistofthecurrentiterationsassigned(calledactiveiterations)foreachoftheseDBinstances.ThenumberofDBinstancesissameasthenumberofsamplesaskedinthequerywithMONTECARLOkeyword.TheneedforthelistoftheseiterationsisexplainedlaterinSection 2.6.1 .Figure 2-9 showsaSeedattribute.Theattributehas1asseedidentier,andthenumberofDBinstancesare4.Thethird 34

PAGE 35

valuepointstothelistofactiveiterations.Thelisthasonevalueforeachofthe4DBinstances.TheiterationnumberassignedforDBinstance1(DB1)isstoredinDB1iter.cell. 2.5TheInstantiateoperatorInstantiateisafundamentaloperatorinMCDB-R.ThisoperationsactsasawrapperaroundtheVGfunction.Ittakestherandomrelation,andaseriesofparameterrelationsasinput.Theparametersfrominnerrelationsarematchedwiththecorrespondingtupleintheouterrelationthroughajoinpredicate.Foreachtuplefromtheouterrelation,aseriesofcallstotheVGfunctionaremade.ThesecallssendtheinformationrequiredbytheVGfunctionincludingthePRNGseed,parametersandanidentier.OncetheVGfunctionreturnsthesamples,theyaremergedwiththecorrespondingtupleintherandomrelation,toformthenaltuple.Example:Letorders rnd(orderkey,year,totalASNormal(,))bearandomrelation(seequeryQ2inAppendix).Thisrelationisgeneratedbypassingorder(orderkey,year,seed)asthemaininputtable,andparams(orderkey,,)astheparametertable.Instantiatematchestheparameterswithcorrespondingtupleintheorderbasedonthejoinonorderkey.Theattributeorder.seedcontainsthebaseseedusedtocalculatethePRNGseedarray.TheInstantiateoperatorgeneratesthesamplesfromaparametricVGfunction.ForeachrandomtupleInstantiatereceivesnameoftheVGfunctiontobeused,theparameterstoitandthelocationstostorethegeneratedsamples.Parameterscanbepasseddirectlythroughthemainouterinput,orthroughmultipleparameterinnerinputs.See[ 37 ]formoredetails.InnerworkingofthisoperatorisshowninFigure 2-10 foroneinputparametertable.Ifweconsidertheexamplefromlastparagraph,theinnerinputwillbethetableparams.Theouterinputistheorderrelation.OnceInstantiategetsonetuplefromtheouterrelation,itmakesmultiplecopiesofthetuple,onepereachinnerinputpipes. 35

PAGE 36

Figure2-10. InstantiateOperator Thesecopiesareusedtojoinwiththeparametertablesbasedonthekeyattribute.Throughthisjoin,alltheparametersrelatedtothekeycanbebroughttogetherbythesortoperationjustafterthejoins.Rememberfromprevioussectionthataseedidentierisuniquetoeachkeyintheouterinputrelation.Therefore,sortingonseedidentierwillbringtogetheralltheparametertuplesandtheouterinputtuple.SortingisnotdoneonthekeybecauseitisnotnecessaryanymoreintheInstantiateandcanbedroppedandseedwillbethenewidentierforthetuples.Aftersortingontheseedidentiersisnished,theparametersrequiredfortheVGfunctionarebroughttogether.TheparameterscannowberearrangedasrequiredandsenttotheVGfunction.Intheaboveexample,the(,,seed)aresenttothe 36

PAGE 37

VGfunction.OncetheVGfunctiongeneratesthesamplearrayitisreturnedtotheInstantiateanditisthenmergedwithacopyofouterinputtuple.Theouterinputpipeisalsosortedonseedidentier,thereforethemergingprocessisstraightforward.Themergefunctionwillpushtheresultingtuplesintotheoutputpipegoingtothenextoperator.NotethatmanyVGfunctionswouldneedmorethanoneparametertable,andmorethanonetuplefromaparametertablecanjoinwithasinglekeyfromouterinputrelation.Oneextremeexampleforthisisamultivariatedistributionfunction.ThisfunctionrequiresthecovariancematrixtobepassedtotheVGfunction.Onewaytopassthismatrixisthroughthesparsematrixrepresentation.TheInstantiateoperatorandVGfunctionframeworkareveryexibleanddesignedtoallowallkindsofcomplexinputmethods.Instantiateisoneoftheexpensiveoperatorsinthissystem.Onereasonforthisisduetothejoinsandsortsoninnerandoutertables.SincethisoperatoralsoencapsulatestheVGfunction,ifthesamplegenerationisexpensivethenthatwillagainslowdowntheoperator.Theothermajorcontributortotheexpensehereisthememoryallocationforthesamplearray.Thesamplearraysaregenerallyverylarge,henceeventhedatamovementbetweenVGfunctionandthearrayisconsiderable. 2.6GibbsTupleBundlesForgeneratingnumdb(numberofDBinstances)samplesinthetail,weneedtoperformcloningandperturbationoninitialDBinstances.Tokeeptrackofthecurrentsamplesused,wemaintaintheDBinstanceinformationinthetuple,storedaspartoftheSeedattribute.Astochasticattributeisastreamofrandomsamples,andonlynumdbofthemareinuseatanypointintime(assignedtonumdbDBinstances).Thesizeofthesamplestreamisdenotedbynums.AttheperturbationstepwedonotknowwhichofthesampleswillbeusedinthenewDBinstances.Sowecarrythewholesamplestreamwithoutanyltering(duetoselectionpredicates)onstochasticattributes.ThelteringisperformedoncethesampleisassignedtoaDBinstance.Thestochastictable 37

PAGE 38

TailSampler():1.Parameters:2.n:numberofcloningiterations3.numdb:numberofDBinstances4.numcloned:numberofDBinstancesclonedattheendofeachiteration5.nums:sizeofthesamplestream6.Variables:7.c:cutoffvaluefortheaggregate8.PQ:priorityqueueonseedidentiers9.R:inputrelationwithGibbstuples10.r:currentGibbstupleswithsameseedidentier11.InitializeDB(R,numdb)12.PQ(R)13.foreachiterationifrom1ton14.Clone(PQ,numdb,numcloned,c)15.PQ.reset()16.while(r=PQ.nextSeed())17.foreachDBinstancejfrom1tonumdb18.while(r[j].Advance())19.ifAgg(r[j])c20.found=true21.endif22.endwhile23.ifnotfound24.nums=nums*2;25.QueryRerun()26.endif27.endfor28.endwhile29.endfor Figure2-11. Tailsampleroperator CustOrderfromFigure 2-3 withtuplebundlesandseedisshowninFigure 2-12 .InthisrelationNEWQTYisthestochasticattribute.ThemultipleDBinstancesontheleftsideinFigure 2-12 isthelogicalrepresentation,whereastherightsideSTOC.CustOrdersisthestoragerepresentationinMCDB-R.Werunthequeryplanmultipletimestoreplenishthesamplestreamwheneverwerunoutofthesamples.Duetooperationslikestochasticjoinsomenewtuplescan 38

PAGE 39

Figure2-12. CustOrderwithGibbstuples beaddedandoldtupleseliminatedfromthenewrun.SothereisaneedtomaintainpersistentDBinstanceinformationoutsideaqueryplanexecution.SoforeachuniqueSeedintheGibbstupleswehaveaTail-Sampling-seed(TSSeed)objectondisk.SeedanditscorrespondingTSSeedaresynchedwheneveramodicationisperformedtoSeed.ATSSeedobjectstorestheseedidentier,PRNGseedandtheDBinstanceinformation.Italsostoresthecurrentin-userangeofthesamplestreamandstartingoftheunusedsamples. 2.6.1Tail-Sampling-SeedandSeedQueueFigure 2-13 showsanexampleTSSeedobjectandcorrespondingSeedobjectinthetuple.EachTSSeedobjectstorestheseedidentierforthetuple.ThisidentierisusedtomatchtheTSSeedobjectsandtherelatedtuplesinthequerypipeline.ATSSeedstoresthecurrentrangeofgeneratedsamples,andthecurrentassignedrange.ItalsomaintainsthecurrentassignediterationnumberforeachoftheDBinstances.TheTSSeedclasshasthepublicfunctionGetNumSeeds().GetNumSeeds() 39

PAGE 40

Figure2-13. Example:TSSeedandSeed returnsalistofPRNGseedsusedlaterinVGfunctionforseedingthepseudo-randomnumbergenerator.EachofthePRNGseedsreturnedisunique,andiscalculatedthroughahashfunctionontheseedidentierandtheiterationnumber.TSSeedobjectprovidesthemethodAdvance()tomovethecurrentiterationnumbertonextpreviouslyunusediterationnumber.ThismethodisusedintheGibbssamplertoreplacethecurrentsamplewithanewone.Advance()returnsanerrorvalueifthesamplearrayisexhausted.AllthechangestoassignediterationnumbersaredonethroughtheTSSeed,andthechangesarereectedontotheSeedthroughaSynch()function.TheSeedobjectalsostoresthecurrentassignediterationnumbers.Buttheseiterationsnumbersarerelativetothepositionsinthesamplearray(thestochasticattributes).TheSynch()transformstheabsoluteiterationsnumbersinaTSSeedobjectintotherelativeiterationsnumbersintheSeedobject. 2.6.2SplitOperator Figure2-14. Stochastictableinajoin 40

PAGE 41

Figure2-15. StochastictableafteraSplit Astochasticarrayattributestoresmultipleinstancesoftheattribute,henceaxedordercannotbedenedonthisattribute.Sooperatorslikethejoin,duplicateelimination,multi-setunion,intersection,anddifferencecannotbeappliedonstochasticarrayattributesdirectly.ASplitoperatorisusedinternallyineachoftheseoperatorstobreakupastochasticattributewithanarrayofsamplesintoasetofsmallerattributeseachwithoneuniquevalueinthesamplearray.Abitstringisusedtostorethelocationsofthatvalueinthesamplearray.Thenewattributewithasinglesampleiscalledastochasticsingleattribute.Abitstringisalwaysstoredalongwiththestochasticsingleattribute. Figure2-16. Joinresulttuples ConsiderthetwotablesinFigure 2-14 .Therighttablehasastochasticattribute.Tosimplifytheexample,theSeedattributeisnotshownintherighttable.ThejoinpredicateisonSTOC.ARRAYandtheJOINATTRattributes.Figure 2-15 showsthelefttableafterthesplit.Nowitisstraightforwardtomatchthetuplesonthejoinpredicate.TheresulttuplesafterthejoinpredicateareshowninFigure 2-16 .TheattributeSTOC.ARRYisbrokenupandisrenamedasSTOC.SINGLE.ThenewattributeBITSTRINGkeepstrackoftheoccurrenceofthestochasticsinglevalueinthestochasticarray.Notethattheappearanceofthesetuplesinthejoinresultdependsonweathertheiterations 41

PAGE 42

numbersoftheactivebitsinBITSTRINGarecurrentlyassignedtoaDBinstance.Ifonlytheiterationnumber0isactive,thenthersttupleinFigure 2-16 withstochasticsinglevalue1willnotappearintheresult,whereasthesecondtuplewithvalue2willappear.IfanAdvance()callismade,thentheactiveiterationnumberis2.Inthiscase,tuple2willnotappearandtuple1willappearintheresult. 2.7GibbsLooperandTailSamplerGibbsloopertakesasinputthenumberofDBinstancestobegeneratednumdb,andthequantiletobeestimated.ItthencalculatesthenumberofcloningiterationsandthenumberofDBinstancestobeclonedaftereachiteration.Figure 2-11 givestheTailsamplermethod.ThismethodperformsthenecessarycloningiterationsonthegiveninputGibbstuples.WeneedtooperateonallGibbstupleswithsameseedidentiertogether,soapriorityqueuePQisusedforthatpurpose.Atthebeginningofeachiterationweclone(line14)thetopDBinstances.Thewhileloopatline16goesthroughtheGibbstuplesandperformstheperturboperation.AllGibbstupleswithsameseedidentierarereturnedbyPQ.nextSeed()call.NowthesamplesineachDBinstanceofthissetoftuplesrarereplaced(lines17-27).Thewhileloopfromlines18-22performstherejectionalgorithm.AfteranAdvance()thenewsampleisacceptedonlyifthenewaggregatevalueisgreaterthan(orsmallerthan)thecutoff. 2.7.1GibbsQueueGibbsqueueisusedbytheGibbsloopertoretrievetherecordsintheseedorder.Gibbsqueueisadiskbasedpriorityqueuewhichallowsonlineinsertionswhilekeepingthesortorder.Inotherwordsitallowsremovalofrecordsandalsoatthesametimeinsertionofnewrecords.Gibbsqueueisdesignedsuchthatinterleavedremove()andinsert()operationsareperformedwithgoodefciency.Figure 2-17 showsthemechanismoftheGibbsqueue.Therstphaseissimilartoadiskbasedtwophasemergesort[ 26 ].Aseriesofsortedrunsarecreatedonthedisk.Aprimarypriorityqueueissetuptomergerecordsfromtheruns.Fordiskefciencyweemploypage 42

PAGE 43

Figure2-17. Gibbsqueue buffering.Gibbsqueuealsohasanotherinmemory(secondary)priorityqueue,whichisusedtoholdthereinsertedtuples.Thetoptuplecannowcanbeineitherprimaryorthesecondaryqueue.ThismechanismkeepstheGibbsqueueorderedevenafterreinserts.Thesizeofthesecondaryqueueisasbigasthediskbasedrun.Wheneverthisqueueisfull,anewruniscreatedonthedisk.Thenewrunisalsoattachedtotheprimaryqueue.Thesecondaryqueuewillnowbeemptyandisreadytoreceivenewtuplesthatarereinserted.ThisfunctionalityofreinsertingatupleisveryimportantfortheGibbslooper,sincethereisapossibilitythataGibbstuplecanhavemorethanoneseed.InsuchacasethetupleshouldbeprocessedatGibbslooperoncepereachseed.Howtoprocesssuchtuplesisexplainedinmoredetailbelow.Preparingtuplebundleswithmultipleseedidentiers:Gibbsqueueholdsthetuplebundlesinsortedorderaccordingtotheseedvalues.Butitispossiblethateachtuplebundlecanhavemorethanoneseed.Thiscanhappenduetotworeasons:(1)the 43

PAGE 44

tuplecanhavemorethanonerandomattributes,generatedbydifferentVGfunctions,andhencehasoneseedforeachVGfunction.(2)duetomergingoftuplesfromtworelations,eachwithrandomattributes,insideaJoinoperator.ThetuplebundleswithmorethanoneseedneedstobeprocessedbyGibbslooperonceperseed.Forthis,theGibbsloopershouldbeinformedthatthetuplehasmoreseeds,andneedstobereinsertedintothequeue.AfunctionAdvance()isprovidedbythetupleforthis.Thetuplebundlealsoprovidestwootherfunctionstoallowthisiterationoverseeds:BuildSeeds(),andResetSeeds().BuildSeeds()istheinitializationfunction,whichcollectsallseedsinthetuple,sortsthemandstorestheminanarrayinascendingorder.Italsoinitializesanindexonthearraypointingtotheleastvaluedseed.AnAdvance()movestheindextonextseed.Advance()returnsfalseifthearrayisexhausted.TheResetSeeds()movestheindexbacktotherstseed.TheGibbslooperafterprocessingatuple,willAdvance()theseed.Ifthetuplehasmoreseeds,thenitwillreinsertsthetupleintotheGibbsqueue. Figure2-18. Example:Endofrun1 2.7.2RerunningtheQueryPlanThequeryplanisrerunifweareoutofsamplesforagivenseedidentierbeforendingasuitablereplacementforalltheDBinstances.Foreachsuccessiverunthenumberofsamplesgeneratedforeachseedidentierisdoubled,providedtherecordtsinsidethepage.Oncethetailsamplerdecidedtorerunthequery,itwilldeletealltheexistingtuples,andushallTSSeedsandresettheseedqueue.Ifwithinpagesize 44

PAGE 45

Figure2-19. Example:Beginningofrun2 limits,thesizeofsamplesisdoubled.Oncethisisdone,tailsamplerexitstogivethecontroltothequeryplanner.OncetheplannerseesthattherequirednumberofGibbsloopsarenotnished,itwillrestartthequeryplan.AtInstantiatethenewsamplestreamisstartedonlyfromthelastusedsamplefromthepreviousrun.Figures 2-18 and 2-19 showtheTSSeedandSeedattheendofrstrunandatthebeginningofsecondrunrespectively.InFigure 2-18 therangeofsampleiterationsisfromlowiter.num.0tohighiter.num.5.Itcanbeseenthatthesamplesareexhaustedinthisarraysincethemax.usediter.is5,sameashighiter.num.ThecurrentassignediterationsaresameforbothTSSeedandSeedobjectssincethestartingiterationnumberinTSSeedis0.Forthisexampleconsiderthatthesizeofsamplearrayforthesecondrunremainas6.InFigure 2-19 theloweriter.num.insecondrunwillstartat2,sincethatisthelargestusediterationnumberinthepreviousrun(forDBinstance3).Themax.usediter.willnotchangeandthehighiter.num.willbe7(lowiter.num.+6)]TJ /F6 11.955 Tf 12.53 0 Td[(1).Afterthehighiter.num.isupdated,theTSSeedobjectissynchedwithSeedobject.ThecurrentassignediterationsnumbersintheSeedwillchangebecausethearerelativetothelowiter.num.intheTSSeed. 2.7.3MaterializingtheOperatorsIntherstexecutionofthequeryplantheresultsofanysubtreethatistotallydeterministicaresavedtodisk.Theoperatorattherootofthatsubtreeismarkedasmaterialized,andinthenextqueryruntheresultsaredirectlyreadfromthedisk. 45

PAGE 46

Figure2-20. Queryplanbeforematerialization Figure2-21. Queryplanaftermaterialization Figures 2-20 and 2-21 showtheplanforqueryQ1(seeAppendix)beforeandaftermaterializationprocess.InthequeryplaninFigure 2-20 ,thesubtreesatSeedandGroup-Byoperatorsaretotallydeterministic.Boththesubtreescanbematerializedatthoseoperators.ThenextoperatorinthequeryplanisInstantiate,whichisdenitelynotdeterministic.SothematerializationcannotgetpastInstantiate.Figure 2-21 showsthe 46

PAGE 47

queryplanaftermaterialization.Figure 2-22 showtherstthreerunsoftheexecution.Therstrunisthewholequeryplan,andsecondrunonwardsthemodiedplanwithmaterializationisused. Figure2-22. Queryplanwithmultiplereruns 47

PAGE 48

CHAPTER3RELATEDWORKTheworkpresentedinthisdissertationisveryspecictoMCDB-R,andthereforeisdifculttoputinthecontextofpreviousresearch(therelatedworkfortheproblemdiscussedinChapter 4 isgiveninSection 4.6 ).InthischapterwecovertheresearchlooselyrelatedtoMCDB-R.OtherworksaboutintegratingstatisticaloperationswitharelationaldatabaseareexplainedinSection 3.1 .MonteCarlomethodsareusedquiteheavilyineldssuchasprobabilisticdatabases(seeSection 3.2 ),anddataminingandmachinelearning(seeSection 3.4 ).Indatabaseresearch,theareaclosesttoMCDB-Rfunctionalityistop-krankinginprobabilisticdatabases.Probabilistictop-kisdescribedinSection 3.3 .StateoftheartriskanalysissoftwaresareexplainedinSection 3.5 3.1StatisticalOperationsinRelationalDatabasesMCDB-RiscloselyrelatedtotheideaofintegratingstatisticaloperationswithrelationaldatabasessuchasMauveDB[ 21 ],FunctionDB[ 63 ],andSciDB[ 59 ].MauveDBintegratesstatisticalmodelsintoadatabase,andprovidestheuserwithmodel-basedviews.Inthatsystemausercanspecify,createandquerytheseviews.FunctionDBprovidessupportforcontinuousfunctionsinsidethedatabase,andSciDBaimstoprovideadvancedanalyticalcapabilitiesinsidethedatabases.ThissystemisalsorelatedtoconditioninginProbabilisticDatabaseSystems[ 41 ].Conditioningisaimedatanalyticalcalculationofthecondencevaluesoftuplesinthequeryresult.Ittransformsagivenprobabilisticdatabaseofpriorstoaprobabilisticdatabaseofposteriors.Nowthecondencevaluesofthetuplesinaqueryresultcanbecalculatedfortheposteriordatabase.MCDB-RisdifferentasitisbasedonMonteCarlobasedapproximations,insteadofcalculatingexactcondencevalues. 3.2ProbabilisticDatabasesProbabilisticdatabasesareusedtostoredatauncertaintyinrelationalmodel.Theyhavebeenstudiedforalongtime.Interestinthemisduetotheapplicationslike 48

PAGE 49

sensors,dataintegrationandcleaning.Methodslikeinformationextractiongenerateuncertaindata[ 31 ],thatcanbemanagedbyaprobabilisticdatabase.Probabilisticdatabasescanalsobeusedforwhat-ifanalysis[ 37 ]wherethefuturedataismodeledasprobabilitydistributions.Processingprobabilisticdataisstudiedextensivelyinrecenttimes;Trio[ 69 ],MayBMS[ 5 ],MystiQ[ 50 ],Orion[ 13 ],MCDB[ 37 ],PrDB[ 56 ],andBayesStore[ 67 ]aresomeoftherecentlydevelopedprobabilisticdatabases.Forarecentsurveyofthiseldsee[ 61 ].Modelinguncertaintyinthesedatabasesisdoneprimarilyintwoways:tuplelevel,andattributeleveluncertainty.Intupleleveluncertainty,atuplehasaprobabilityofoccurrenceintherelation.Itcaneitheroccurornot.Inattributeleveluncertainty,theattributevalueisarandomvariable.Someoftheearlyworkonprobabilisticdatabasesisdonein[ 34 ].Theydenev-tables,whichhasvariablesthatrepresentincompleteinformation.Inv-tables,manyoftherelationaloperatorscanberunonthemsimilarlytothatonarelationaltable.Thev-tablescannotsupportprojectionandarbitraryselections.Theyalsodeneaconditionaltable(orc-table),avariationofv-tablewithaconditionforeachtuple.Thisframeworkdoesnothandledependenciesbetweenattributes.[ 7 ]usesanattributeleveluncertaintymodel.Theirrelationhasadeterministickeyandbothprobabilisticanddeterministicattributegroups.Theirprobabilisticattributesrepresentadiscreteprobabilitydistribution.Theyshowthatprojectionandselectionarelossless.Somenewoperatorsspecictotheprobabilisticmodelarealsogiven.Forexample,x-selectoperatorworksbymatchingprobabilitydistributionsthatareclose.[ 24 ]presentsprobabilisticrelationalalgebra,andthetuplesareassignedprobabilityvalues.Thesevaluesgivetheprobabilitythatthetuplebelongtotherelation.Theyuseintentionalsemanticstoevaluatethequery.Inintentionalsemantics,eachtuplehasanassociatedevent.Afterthequeryevaluationeachoutputtuplewillhaveanexpressionovertheeventsfromwhichthattupleisderivedfrom(lineageoftheoutputtuple).This 49

PAGE 50

expressioncanbeverycomplex.Thentheyneedtosolvethis(sometimescomplex)expressiontondtheprobabilityofthetupleintheoutputrelation.[ 42 ]storesprobabilityintervalsalongwithattributesintheirprobabilisticdatabasesystem,Probview.Theydenemultiplerelationaloperatorslikeprojection,selection,andcartesianproductontheirdatarepresentation,alongwithanewoperatorcalledcompaction.Compactionworkslikeaduplicateremovalbycombiningtheprobabilityintervalsoftupleswithsamevalues.Theyalsogivesanincrementalalgorithmtomaintainviewsontheirdatabaseforinsertionanddeletionoperations.Incontrasttointensionalqueryevaluationin[ 24 ],[ 18 50 ]usesextensionalqueryevaluation.Inextensionalevaluation,theoperatorscalculatetheactualprobabilityvalueevenattheintermediatestagesofthequeryplan.Thisavoidssolvingthecomplexexpressionthatwesawinintensionalsemanticsattheendoftheprocessing.[ 18 ]showsthatsomequeryplansgiveincorrectresultswithextensionalevaluation.Sotheygiveamethodtondasafeextensionalqueryplan(byqueryrewriting)whenitispossibletodoso.Forunsafeplans,theydescribeaMonteCarlobasedsimulationalgorithm,alongwithanapproximationalgorithm.Thecomplexityofexecutingqueriesinprobabilisticdatabasesisexploredin[ 19 ].TheyshowthatthecomplexityofevaluatingaconjunctivequeryoveraprobabilisticdatabasewithtupleindependenceassumptioniseitherPTIME(PolynomialTIME)or#P-hard.Theyalsogiveanalgorithmthatrecognizeswhetheraqueryissafeornot.Trio[ 2 8 69 ]keepslineageinformationalongwiththeuncertaindata.Intheirmodel,probabilityofoccurrenceisstoredinthetuples,andiscalledthecondencevalue.Somerelationaloperatorslikeprojectionsarenotclosedonthisdata.Butlineageisusefulincalculatingthenewcondencevaluesforatupleaftertheoperation.Thequerymodelin[ 8 ]allowstheusertospecifytheminimumcondencevaluerequiredforthetuplestobeintheoutput(Top-kqueries).Theyuselineagetocalculatecondencevaluesoverthequeryresultsonuncertaindata.Thismethodisrelatedtotheintensionalsemantics,andthelineageexpression 50

PAGE 51

sometimescanbecomeverycomplex.InsuchacaseaMonteCarloapproximationmethod[ 40 ]isusedtosolvetheexpression.[ 4 5 33 ]introducesU-relationstocaptureuncertaintyinthedata.AU-relationhasrandomvariableswhoseprobabilityvaluesaredenedinanothertable.Theyachieveattributeleveluncertaintythroughthisrepresentation.TheytoouseintentionalevaluationmodelontheirU-relations.ThedifferencewithpreviousapproacheslikeTrioisthattheytakelessspacetostoretheuncertaindata.Tocalculatethecondencevaluesofoutputtuples(iftheexpressionistoocomplex)theyuseMonteCarloapproximations[ 40 ].MCDB[ 37 ]usesMonteCarlosamplingtoestimate(userdened)statisticalmodelsonuncertainattributesinarelationaldatabase.MCDBtreatsthedatabasewithuncertainattributesasajointprobabilitydistribution.Sincethedimensionalityofthatdistributionisveryhigh,MonteCarlosamplingisusedforestimatingqueryresultsoverthatjointdistribution.Multipledatabaseinstancesarecreatedthroughsampling.Eachuncertainattributeisdenedbyaprobabilitydistributionandtheattributevalueisrepresentedbyasamplefromthatdistribution.Eachofthedatabaseinstancesaregeneratedindependentlythroughapseudo-randomgenerator.ForefciencyMCDBbundlesallthedatabaseinstancesofatupletogether,andexecutesthequeryoverthattuplebundleonlyonceforallinstances,insteadofonceperinstance.MCDBsframeworkofuserdenedstatisticalmodelsallowsexibilityinthetypesofmodelsandthetypesofqueries.PrDB[ 20 56 ]usesprobabilisticgraphicalmodelsasbasemodels.ThisgivesPrDBtheexibilityofrepresentingtupleleveluncertainly,attributeleveluncertainty,aswellascorrelationswithinattributes,andwithintuples. 3.3Top-kRankinginProbabilisticDatabasesTop-krankinginprobabilisticdatabases,althoughrecent,iswellstudied[ 9 15 28 32 44 49 58 65 72 ].TherehavebeenmultipledenitionsofTop-ktuplesinaprobabilisticdatabaseduetotheexistenceofcondence(probabilityofoccurrence)alongwiththeactualvalue(orscore).MCDB-Rdiffersfromthisworkasitgenerates 51

PAGE 52

valuesafteragivenquantile.Thoughthetop-krankingcanbeusedtogeneratevalueswithlowprobabilitytheynotsuitableforrestrictingthevalueswithinagivenquantile.[ 58 ]rststudiedthisproblem,bygivingtwodenitionsoftop-ktuples:UncertainTop-kandUncertainRank-k.AnUncertainTop-kqueryreturnsalistofktuples,whichappearsasthetop-kanswerinmostpossibleworlds.ForUncertainRank-k,eachtuplereturnedforapositionshouldbeoccurringatthepositioninmostpossibleworlds.Thatis,forithrank,wereturnatuplethatoccursatithrankinmostpossibleworlds.Theproblemisformulatedasastatespacesearch.Thealgorithmsscanstheprobabletuplesintop-kandmaintainsalargenumberofstates.Eachstateisasetofpossibleworldsthathavecommontoptuples.[ 72 ]givesmoreefcientalgorithmsforUncertainTop-kandUncertainRank-k.Theyusearepresentationcalledx-tuplesintheiralgorithms.Eachx-tuplehasmultipleoptions,eachoptionhasanassociatedprobability.Theseprobabilitiesdeneadiscretedistributionoveralltheoptionsforthattuple.Thealgorithmusesahashmaptostorethex-tuples,suchthatifanidofatupleisgiventhenitscondencevaluecanberetrievedfast.[ 32 ]givesaprobabilisticthresholdtop-kquery.Atuplewillbeintopkifithasatleastpprobabilitytobeinthetop-klistsinallpossibleworlds.[ 32 ]givesanexactalgorithmtocomputethis,alongwithsomepruningtechniquesusingtheprobabilitythresholdtoimprovethetheefciency.Asamplingbasedapproximatemethodisalsogiven.[ 15 ]denesexpectedranksinbothattributelevelandtupleleveluncertainlymodels.Therankofatupletinapossibleworldisthenumberoftupleswhosescore/valueishigherthant.Theygiveexactalgorithmsandsomepruningmethodstocalculateexpectedrankbasedtoptuples,bothinattributelevelandtupleleveluncertaintymodels.Findingnearestneighborprobabilistictop-kobjectsisstudiedin[ 9 ].Theyproposedapproachesbasedoninput/output(I/O)andCentralProcessingUnit(CPU)cost.Theirqueryprocessingalgorithmsaredesignedtominimizethetotalcost.[ 28 ] 52

PAGE 53

givesamethodgivessomeemphasistoscoresinthetop-kselection.Theydenethec-typicaltop-k,whichreturnsctypicaltop-kvectors,accordingtothescoredistribution(Atypicalsetininformationtheoryisasetofsequenceswhoseprobabilityiscloseto2)]TJ /F11 7.97 Tf 6.58 0 Td[(e,whereeistheentropyoftheirsourcedistribution).Thereisagoodchancethattheactualtop-kscoreisclosetooneofthesecvectors.In[ 49 ]top-kisdenedonlyonprobabilityvalues.Ktupleswithhighestprobabilityaregivenintheoutput.Forefciency,theyrunseveralMonteCarlosimulationstogetheroneforeachcandidateintheresultset,andcalculatetheprobabilityvaluestominimumaccuracyrequired.Arecentwork[ 43 44 ]providesauniedframeworkforlearningmultiplerankingfunctionsoveraprobabilisticdatabase.Asetofkeyfeaturesareidentied,andparameterizedrankingfunctionsaredenedoverthesefeatures.Theparametersarechosenbasedontheuserpreferences.Theseparameterizedfunctionscanencompassmanyofthetop-kmethodsdenedabove.[ 44 ]presentsseveralefcientalgorithmstocalculatetheserankingfunctions. 3.4MonteCarloMethodsinDatabases,DataMining,andMachineLearningThespecicMarkovChainMonteCarlo(MCMC)basedGibbscloning(orsplitting)idea[ 10 ]adaptedinMCDB-Rismainlydevelopedforrareeventsimulation.[ 14 ]presentsnovelalgorithmforsensordeploymentbasedonsplittingandGibbssamplerapproachfrom[ 10 ].Theirfocusistondthebestspatiotemporalplacementofthesensorstoimprovethedetectionofatargetthatisintelligent,reactiveandmovingwithastrategytogoundetected.Thisisanoptimizationproblem,andgeneticalgorithmsareusedpreviouslytosolveit.OneofMCMC'spopularusesisinmachinelearning,aspartoftheBayesianinferenceframework.AgoodintroductiontousingMCMCforBayesianinferenceisgivenin[ 25 ].MCMCisusedaspartofinferenceframeworkformodelingdynamicuserinterestsin[ 3 ].Forscalabilityformillionsofusers,theirinferenceprocedurerunsincrementallyafastbatchGibbssampleronthedataatagiventimeinstance,give 53

PAGE 54

thestateofsampleratprevioustimeinstances.GibbssamplingandlatentDirichletAllocation(LDA)havebeenpopularinlearningTopicmodelsfromtextcorpora.sinceitis,intheory,moreaccuratethanthevariationalmethods.Topicmodelingontextdocumentsisapopularareasinceithasapplicationsamongothersininformationretrievalanddocumentclustering.ThedrawbackofGibbssamplingforthisapplicationisitsefciency,andhencescalabilityforlargedatasets.AnovelandfastermethodofcollapsedGibbssamplingforLDAisproposedin[ 47 ].CollapsedGibbssamplerintegratesoutsomevariablesandsamplingisdoneoverthesimplieddistributionwithonlythespecicvariablesofinterest.[ 64 ]usedcollapsedGibbssamplerfordocumentclustering.Themodelusedisamixtureofmultinomials,withDirichletpriors.Throughexperiments,theyshowthatusingMCMCisbetterthanusingExpectation-Maximizationalgorithmfortheirmodel.Variousmethodsfortopicmodelingarecomparedin[ 71 ].SomeofthosemethodswerebasedonGibbssampling,andsomearevariationalinferencemethodslikemaximumentropy.TheyshowthattheirnewsparseLDAisfasterthanpreviousLDAbasedmethods.Theirperformanceimprovementisduetoanewdatastructurethatresultsinfastsamplingevenwithalowmemoryusage.ScalingofMCMCmethodsforlargerdatasetsisstudiedbothinmachinelearning[ 12 23 57 ]anddatabases[ 66 73 ].BasicMonteCarlomethodalsoispreviouslyusedforqueryprocessinginprobabilisticdatabases[ 27 37 39 68 ].ProcessingqueriesonarraydatabasesrequireMonteCarlosimulations[ 27 ].Thisrequirementisbecauseofthecomplexityofarrayandvectoroperationsrequiredbytheapplicationsinscienceandengineering.Usingexactmethodisdenitelyinfeasibleduetothedimensionality,andmethodslikeGaussianmixturemodelscannotproduceclosedformresultsotherthan+and-operators.Asecondreasonisthatthemodelsusedtorepresentcorrelateduncertaindatainarraydata,likegraphicalmodels,requireMonteCarloapproximatecalculations.In[ 27 ],authorsuseMarkovRandomFieldsasmodelandgivesolutions 54

PAGE 55

tocomplexoperatorslikearrayjoin.MCDB[ 37 ]usesMonteCarlosamplingtoestimate(userdened)statisticalmodelsonuncertainattributesinarelationaldatabase.Itprovidessamplesfromthequery-resultdistribution,insteadofasinglevalueasoutput.ThispaperisdiscussedinmoredetailinSection 3.2 .MonteCarlomethodsareusedin[ 39 ]forevaluatinglineageprocessingoncorrelatedprobabilisticdata.Specically,Gibbssamplerisusedinthissettingforapproximateevaluationofconjunctivequerieswithlargecomplexity.Whileexactevaluationoflineageprocessingontupleindependentmodelispolynomialtime,complexityforforthatincorrelatedtuplesis#Pcomplete[ 39 ].In[ 68 ]factorgraphs(atypeofprobabilisticgraphicalmodel)areusedtorepresentuncertaintyoverrelationaldata.Toavoidthe#Pcomplexityofqueryevaluation,[ 68 ]usesMCMCbasedinference.ThespecicMCMCalgorithmtheyusedisMetropolis-Hastings.TheyproposeanewalgorithmcombiningMetropolis-Hastingsandmaterializedviewmaintenance.Viewmaintenancetechniqueisusedforperformanceimprovement.In[ 22 ],MCMCisusedfordatacleaning.Theirdatasetconsistsoffactscollectedthroughcrowdsourcing,andtheirmaingoalsaretoidentifywhichofthefactsarereliable,howtoanswerquestionsbasedonthecollecteddata,andwhatquestionstopresentnexttotheuserstoimprovethedataquality.Thedatasetisthentreatedasaprobabilisticdatabaseandsomeprobabilisticandrecursiverulesaredenedforinterpretingthedata.Aswehaveseenbefore,evaluatingqueriesexactlyonprobabilisticdatabasesisinfeasible.MCMCbasedalgorithmforapproximatecalculationofthequeryresultcanbeused.ButMCMCisguaranteedtoproducegoodapproximationsonlyifthedatacleaningrulesetisstronglyconnected.Adisconnectedrulesetisnotergodic,arequiredpropertyforgoodMCMCapproximations.[ 22 ]usesMCMCseparatelyoneachconnectedcomponentoftherulesetandcombinestheresults.MonteCarlomethods,apartfrombeingusedtoevaluatequeriesoverprobabilisticdatabases,arealsousedtocreatetheprobabilisticdatabasesfromuncertainormissing 55

PAGE 56

data.AninferencealgorithmbasedonGibbssamplingisproposedin[ 60 ]forpredictingtheprobabilitydistributionovermultiplemissingvalues.Theyusethisalgorithmaspartoftheirlearningframeworktocreateprobabilisticdatabases.Ajointdistributionoverthedataisapproximatedwithasetofconditionalprobabilitydistributions,thattogetherarecalledadependencynetwork.Eachoftheconditionalprobabilitydistributionsislearnedindependently,andhencecannotguaranteeaconsistentjointdistribution.Gibbssamplinginferencetechniquesareusedtorecoveranapproximationofthejointprobabilitydistributioneventhoughtheindividualconditionaldistributionsareinconsistent. 3.5RiskAnalysisSoftwareMonteCarlosimulationsareimplementedinmanyriskanalysissoftware[ 16 45 62 ].MCDB[ 37 ]tooallowsscalableriskanalysiswithitsexibleVGfunctionframework.OracleCrystalBall[ 45 ]supportspredictivemodeling,forecasting,MonteCarlosimulationandoptimization.ItisbasedonMicrosoftExcel,andsupportswhat-ifriskanalysis.@RISK[ 16 ]alongwithanimplementationofMonteCarlosimulations,providesRISKOptimizertospeedupthesimulations.ThesoftwarebasedonspreadsheetslikeCrystalBalland@RISKwillnotbeasscalableasMCDBbecausetheylackdiskbasedalgorithms,andifanapplicationexceedsrandomaccessmemory(RAM)size,thenmosttimeisspendinswappingdatatoandfromdisk.Theotherstand-aloneriskanalysissoftwareslikeAnalytica[ 62 ]areproprietaryanditisdifculttotelliftheimplementationsarediskoptimized.Analyticaclaimstobetwentytimesfasterthanrisksoftwareimplementedonstandardspreadsheetssotheymightbeusingsomediskbasedalgorithms.Notethatnoneoftheseriskanalysissoftwares,toourknowledge,supporttailsampling. 56

PAGE 57

CHAPTER4SERIALIZINGSAMPLEGENERATIONPROCESS Figure4-1. AnMCDB-Rqueryplan 4.1BackgroundMCDB-RsystemusessamplingbasedMonteCarlosimulationstoperformriskanalysisonuncertaindata.InMCDB-R,thedatauncertaintyismodeledthroughparametricprobabilitydistributionfunctions.Thewholedatabaseisseenasahighdimensionaljointdistribution.Extremeriskanalysisonthisdatacanbedonebyobservingthelowprobabilityregionstheupperandlowertailsofaqueryresultdistribution.Foragivenquery,toanalyticallysolveeithertheresultdistributionoranyestimateonthetailoftheresultdistributionisverydifcultduetothehigh 57

PAGE 58

dimensionality.Therefore,MCDB-RusesMonteCarlosamplingbasedGibbscloningtoefcientlygeneratesamplesinthetail.Byusingthesesampleswecanapproximatelycalculatetheestimates.Theaccuracyoftheseestimatesdependsonthenumberofsamples.GibbscloningmethodisusedinMCDB-Rbecauseitismoreefcientingeneratinglargenumberofsamplesfromthetail.Gibbscloningisaniterativeprocessthatmovesfurtherintothetailaftereveryiteration.Eachiterationhasacloningstepandaperturbationstep.Inthecloningstepwedeletethesamplesthatareawayfromthetailandreplacethemwithcopiesofsamplesthatarefurtherintothetail.Thenewsetofsampleswillbehighlycorrelatedduetocloning,buttheywillbefurtherintothetail.Thesecorrelatedsamplesarerandomlyperturbedsuchthatthenewsamplesafterperturbationareuncorrelatedbutstillremainclosetothetail.Thedistancetothetailisenforcedthroughaconstraintontheaggregatevalue.Perturbationisdonebyreplacingtheexistingsamplewithanotherrandomsamplethatstillsatisestheconstraint.Findingagoodreplacementformultipledimensionsofthedatatogetherisextremelydifcult.Hence,thereplacementonhighdimensionaldataisdonebyaGibbssampler.Gibbssamplerperturbsthesamplesfromonedimensionatatime.AtagivendimensionthereplacementofanexistingsampleisfoundthroughtheRejectionalgorithm(Section 2.1.2 ).TheRejectionalgorithmreplacesthecurrentsamplewithanewrandomsampleandcheckstheconstraintontheaggregatetoseeifthenewsampleisstillfarenoughintothetail.Itrejectsnewsamplesuntilitndsasamplethatsatisestheconstraint. 4.2MotivationInMCDB-R,eachdatabasesampleiscalledaDBinstance.MCDB-RborrowstuplebundleideafromMCDBtoperformefcientqueryprocessingontheselargenumberofDBinstances.Atuplebundlestoresallthesamples(DBinstances)foragivendimensionofthedata.Eachtuplebundleisuniquelyidentiedbyitsseedidentier(Section 2.4 ).Becauseoftuplebundles,thequeryisrunonlyonce,insteadofoncefor 58

PAGE 59

eachDBinstance.Thus,bundlingimprovestheperformancesignicantlycomparedtoindependentexecutionofthequeryoneachoftheDBinstances.ThesamplesizesusedinMCDB-RarelargerthanthoseinMCDBbecausetheRejectionalgorithmconsumessignicantlymoresamples.ThesamplesinatuplebundlearegenerallyexhaustedwhentheRejectionalgorithmisreplacingasamplewithlowprobability.Itcanhappenthattheaggregatevalueonthedataisveryclosetotheconstraint.ThentheRejectionalgorithmshouldndanothersampleveryclosetothesamplevalueitistryingtoreplace.Otherwisetheconstraintwillnotbesatised.Sincetheoriginalsampleisoflowprobability,theRejectionalgorithmmightnotndareplacementuntilitrejectsalargenumberofsamples.Wheneveratuplebundleisoutofsamplesthequeryisreruntogeneratemoresamples.Ateachrerunthenumberofsamplesgenerated,nums,isdoubled(unlessasingletuplebundlesizeexceedsthepagesize,inwhichcasenumsremainssame).Sincenumsissameforallthetuplebundlesinaqueryrun,eachtuplebundlenowstoresdoublethesamplesandisapproximatelydoubleinsize.Butmostoftheothertuplebundlesmaynotrequirethenewsamplesthataregenerated.Example:WhenexecutingqueryQ2(seeSection 4.7 ),insecondreruntheaveragenumberofsamplesusedforeachtuplebundle(orseedidentier)atRejectionalgorithmis150,whereasthenumsis4000.Mostoftheseedidentiersonlyneed150samples,butduetotheoutliersthatneed4000samples,allseedidentiershavetostorethe4000samples.TheframeworkofMCDB-Rdoesnotallowdifferentsamplesizesfordifferenttuples.Eveniftheframeworkallowssuchoption,itisnoteasytopredictbeforehandwhichoftheseedidentiersrequiremoresamplesthanothers.Thiswastageofsamplesonlyincreaseswiththenumberofreruns.Atfourthrerunthenumberofusedsamplesonaverageis210outof16,000samplesgenerated.ThoughNormalVGfunctionisnotveryexpensive,generatingsomanyunusedsampleshadsignicantimpactontheoverallqueryexecutiontime. 59

PAGE 60

Figure4-2. Modiedqueryplan Evenifthesamplegenerationprocessisinexpensive,theincreaseindatasizewillhaveanadverseeffectontheresponsetime.Figure 4-1 showsthesimpliedphysicalplanforqueryQ2.ThesamplesaregeneratedattheInstantiateoperator(Section 2.5 ),andarepassedthroughtheplantoTailsampler(Section 2.7 )operator.TheTailsamplergoesthroughtheseedidentiersonebyoneinsortedorder.TheorderismaintainedbytheGibbsqueue(Section 2.7.1 ).InmostcasestotaltuplesinGibbsqueuewouldnottinmainmemory,thereforeitwritesthemtodisk.ThesolidarrowfromTailsamplertoGibbsqueuesigniesanewperturbationstep(orGibbsiteration)andresetoftheGibbsqueue.Thedottedarrowshowsaqueryrerun,andistriggeredwhenoneoftheseedidentiersrunsoutofsamples.Thequery,whenrunona200MBdatasetwith185,000tuplebundlesandwith0.999quantileastargettailarea,writesapproximately25gigabytes(GB)ofdatatodiskinfourthrerun,whennumsis16,000.ForveGibbs 60

PAGE 61

iterations,thequerywritesandreadsmorethan1TBtodisk.SeeSection5formorestatisticsontheexecutionofthisquery. 4.3SolutionOutlineThetechniquewepresentinthischapterallowsthesystemtoserializethesamplegenerationprocessanddelaytheactualsamplegenerationaslongaspossible.Thenewqueryexecutionavoidsthemovementofthelargetuplebundlesthroughsomeexpensiveoperationslikesortsandjoins.ThequeryplaninFigure 4-1 cannowbemodiedtothatinFigure 4-2 .InthenewplanInstantiate(S)denotesInstantiate-serialize.Inthismodiedoperator,insteadofgeneratingthesamplesasbeforeatregularInstantiate,theVGfunction(Section 2.3 )parametersareserializedintoastring,calledSVGFforSerializedVGFunction,andkeptalongwiththeSeedattribute(Section 2.4 ).Instantiate(G)de-serializestheSVGFstringandgeneratestherequestednumberofsamples.ItcanbeseeninFigure 4-2 thatthesamplegenerationisdoneaftertheGibbsqueue.Thus,insteadoflargenumberofsamplesnowonlythestringSVGFiswrittentothedisk.SVGFstringistypicallylessthan100bytesinsize.Forqueriesthatdoesnothaveajoinonstochasticattributes,andhencearesimilarinstructuretoqueryQ2,thisserializationeliminatestherequirementofqueryreruns.Nowwecangeneratevariablenumberofsamplesforeachseedidentierseparately,whicheliminatesthegenerationoflargenumberofunusedsamplesformanyseedidentierswhenonlyveryfewseedidentiersrequirethem. 4.4SerializationInthissection,wediscusstheprocessofserializingtheVGfunctionandtheparametersforagivenseedidentier.WerefertothisrepresentationasSVGFasnotedpreviously.ForserializingtheVGfunctionwerststorethenameofthefunctionandthenalltheparametersthataresuppliedtothefunction.Theschemaoftheparametersneedstobestoredaswell.AVGfunctionframeworkinMCDB-Risverygeneric.AVGfunctioncantakemultipletuplesfromparameter(inner)tablesandonetuplefromouter 61

PAGE 62

Figure4-3. AVGfunctionprocess Figure4-4. SerializationofVGfunction tableforgeneratingasinglesample.Instantiatepickstherequiredattributesfromthesetuplesandcreatesaparameterarray,thatisthenpassedtotheVGfunction.Figure 4-3 showstheVGfunctionprocesswithintheInstantiate.Recreatingthissamplegenerationprocessrequiresexactsimulationofpassingtheparameters,asdoneintheInstantiate.SerializationformatisshowninFigure 4-4 .AfterstoringtheVGfunctionnamewestoretheschemaoftheparameterarray.Thenwestorethenumberoftimestheparameterarrayispassed,andthentheparameterarraysthemselves.TheSVGFiscreatedduringtherstexecutionoftheInstantiate.Duringthisrstexecution,theVGfunctioniscalledasbeforeexceptthatonlyonesample(whichisdiscardedlater)isgeneratedandSVGF 62

PAGE 63

isformed.OnceSVGFiscreated,itisstoredasastringinthetupleaspartoftheSeedattributerelatedtotheVGfunction.AsingleVGfunctioncallcanproduceasequenceofvaluesaspartofthesample,resultinginmultiplestochasticattributes.Thereforetherelationshipbetweenastochasticattributeanditslocationinthesequenceofvaluesisstoredinthestochasticattribute.SinceasingletuplecanhavemultipleSeedattributes,eachstochasticattributealsostorestheparticularSeedfromwhichitisgenerated. Figure4-5. BayesianVG:Inputserialization Figure4-6. BayesianVG:Innerinput Figure4-7. BayesianVG:Outerinput Figures 4-5 through 4-8 showtheserializationprocessforaBayesianVGfunction(seequeryQ1inSection 4.7 ).ThisVGfunctiontakessevenparameters,andofthemfourarepassedthroughtheinnerinputpipeandotherthreearepassedfromouterinput 63

PAGE 64

Figure4-8. SerializationofBayesianVGinputprocess pipe.Figure 4-5 showstheschemaoftheparameterarray.Firstattributeistheseedidentier,nextthreearetheparametersfromouterinput.ThelastfourparametersforthetwoGammafunctionsintheBayesianmodel.TheparametersfrominnerandouterinputsaresenttoVGfunctionseparatelythroughdifferentarrays,anddifferentcallstotheTakeParams().Instantiatemapstheinnerandouterinputtuplesintotheparameterarray.Figures 4-6 and 4-7 showthetransformationofaninnerandanoutertupleintothearrayformat.Intheserializationsteptheparametercountistwo,sincewehaveoneinnerandoneoutertuple.Figure 4-8 showstheSVGFcreatedfortheBayesianVGfunction.FirstwestorethenameofVGfunction(Bayesian),thentheschemaoftheparameterarrayandthentheparametercount.Intheparameterarray,theunusedeldsaresettoNULL.Beforeeachvalue,aagisstoredtoindicateweatherthatvalueexitsorisitNULLinthearray. 4.4.1IncorporatingSerializationinaQueryPlanPredicatesonstochasticattributesaregenerallypusheddownthequeryplan,andareevaluatedjustaftertheInstantiate.Ifthesamplegenerationisserialized,thenthesamplesarenotgeneratedatInstantiate.Thereforeanypredicatesonthestochasticattributesshouldbemovedtotheoperatoratwhichthesamplesaregenerated.Allthenon-joinpredicatesonstochasticattributesarepassedtoTailsampleroperator.ThesepredicatesareappliedoncethesamplesaregeneratedinsideTailsampler. 4.4.2StochasticJoinPredicatesMovingupthejoinpredicateintoTailsamplerinnotrecommended.Withoutajoinpredicate,acrossproductisstillperformed.Thiscoulddegradetheperformancebecauseajoinpredicatereducesthenumberofoutputtuplesfromacrossproduct.ThejoinpredicatecanhaveastochasticattributeinbothsidesofthejoinasinFigure 4-9 64

PAGE 65

Figure4-9. Joinonstochasticattributeononeside Figure4-10. Queryplanafterserialization orjustonesideofthejoinasinFigure 4-11 .Inbothqueryplans,thesampleshavetobegeneratedbeforetheSplitoperator(aMCDBspecicoperationthatsplitsastochastictupleintomultipledeterministictuples,seeSection 2.6.2 ),thereforewecanonlymaterializealltheoperationsjustbeforeit.IftherearemultipleSeedattributesinthetuplebundlebeforetheSplitoperator,onlythesamplesthatareprocessedinSplit 65

PAGE 66

needstobegenerated.AmaterializationjustbeforeSpiltoperatorisstillusefulasitavoidsrunningtheexpensiveInstantiateoperationfromthesecondrunonwards.AsintheleftsubtreeinFigure 4-9 ifastochasticattributeisnotpartofthejoinpredicate,thenthereisnoneedtode-serializethatattributeuntiltheTailsampler.Givenaqueryplan,weusethefollowingrulestoidentifywhereinthequeryplanitissafetomaterializetheoperators.Rule1:AnyancestoroperatortoaSplitcannotbematerialized.Rule2:AfteranInstantiateoperatorcheckifthereareanyjoins.Foranysuchjoins,iftheothersideofthejoinpredicatehasastochasticattribute,thenmaterializetheoperationjustbeforetheSplitassociatedwiththejoin.ThiscasecanbeidentiedbytraversingtheothersideofanyjoinsinthepathleadingfromInstantiatetoTailsamplerinthequeryplan.ApplyingtheserulesresultinmodiedqueryplansasshowninFigures 4-10 and 4-12 .TheyarethemodicationsofqueryplansinFigures 4-9 and 4-11 respectively. Figure4-11. Joinonstochasticattributesonbothsides 4.5EfcientDe-serializationDe-serializingtheSVGFstringisstraightforward.FirstreadtheVGfunctionnameandcreatethefunctionobject.Thenreadtheschemaoftheparameterarray.After 66

PAGE 67

Figure4-12. Queryplanafterserialization thatreadtheparameterarraysonebyoneandpassthemtotheVGfunctionobject.Tomakeuseofmultiplecoresinthemachine,wecanmultithreadthesamplegenerationprocess.Thisiseasysinceeachsamplehasitsownpseudo-randomseed.Thispseudo-randomseediscomputedthroughahashfunctiononthelocationofthesampleinthesamplestreamandthePRNGseedfromtheSeedattribute.EachthreadisgivenarangeinthesamplestreamandthePRNGseedcangeneratethesamplesindependentofotherthreads. 4.6RelatedWorkThisideaofserializationiscloselyrelatedtocompressionindatabases[ 1 29 30 36 52 74 ]whichfocusesonreducingtheexecutiontimesbysavingdiskI/O.Sincesamplesareofsametypeandarestoredtogetherinanarrayoursettingissimilartostoringcolumnsinacolumnstore.Forintegerorstringdatathebit-vectorcompressionscheme[ 1 ]canbeused.Sincethesamplearrayisnotsortedthisseemtobetheonlymethodfeasible.TheserializationofVGfunctionstilloffersmoredisksavingscomparedtobit-vectorcompressionduetothelargebit-vectorsneededinMCDB-R.ForexamplethesizeofSVGFforaNormalVGfunctionislessthan100Bytes,butabit-vectorfor16,000valuesneeds16,000bitsisnearly2KBinsize,afactorof20.Compressing 67

PAGE 68

oatingpointdataisnotveryusefulinMCDB-Rsinceourdataconsistsofrandomsamplesandmaximumcompressionforrandomoatingpointdatagenerallydoesnotexceedafactorof4[ 48 ]inthebestcase. 4.7ExperimentsInthissectionweevaluatetheserializationofVGfunctionsthroughexperiments.WeconductvariousexperimentstotesttheefciencyandscalabilityofthenewMCDB-Rwithserialization.OuraimistoshowthatusingserializationinMCDB-Rsystemallowsittoexecutequeriesmuchfaster.Thespecicquestionswewanttoanswerhereare: 1. Doesserializationimprovetheefciencyofthequeries? 2. Howarethediskusageandnumberofsamplesgeneratedeffected? 3. Isthesystemmorescalablenow? 4. Doesincreasingthenumberofthreadsimprovethesamplegenerationafterserialization? 5. HowdoestheperformanceofthenewsystemvarywithdifferentVGfunctiontypes(shorttailedvslongtailed)?ExperimentalSetup:WeusetheTransactionProcessingPerformanceCouncilBenchmarkH(TPC-H)datagenerator[ 17 ].Theworkstationusedfortheexperimentshas32GBofRAM,4720GBDellharddisksand8processingcoresof3.2GHzeachdistributedover2chips.Thesystemrunsa64-bitUbuntuoperatingsystem,version11.04.WeaddedtheserializationfeaturetotheMCDB-Rprototypeimplementation.ItiswritteninC++andconsistsof30,000linesofcode.Weusethreequeriesforbenchmarking.QueryQ1isbasedonqueryQ4from[ 37 ],andqueryQ2isfrom[ 6 ].SeeAppendix 7 forSQLstatementsforbothQ1andQ2.QueryBasichasthebarebonesstructuretouseaVGfunction.QueryQ1.Thisquerysimulatestheeffectsofa5%increaseinpriceontheprots.ThechangeincustomerdemandduetonewpricesisestimatedbyusingaBayesian 68

PAGE 69

VGfunction.Asamepriordistributionfordemandisusedforallcustomers.Bayesianmethodsareusedtondtheposteriordistributionbasedontheparametersspecictothecustomer.TheVGfunctionoutputsthepossiblecustomerdemandfornewprice.WechosethisqueryforbenchmarkingbecauseithasanexpensiveVGfunction.QueryQ2.Thisismainlyusedfortestingcorrectness.Theordersrelationismadestochasticbyaddingnormallydistributedstochasticattribute.Intheparametertableforthenormaldistribution,allmeanandvariancevaluesaresetto1.Thequeryissuchthattheresultdistributionisalsonormallydistributed,andhasaclosedformsolution.QueryBasic:ThisqueryhasasingleInstantiatebeforeTailsampler.TheaggregateisonthesamplevaluefromtheVGfunction.TheInstantiatetakesoneoutertable,andoneparametertable.Forthisquery,parametersaresameforeachofthetuplesintheoutertable.ThisqueryisusedtocomparedifferentVGfunctions.Theprobabilitydensityfunctionsusedhererangefromthosewithnotail(Uniform),tolongtaileddistributionslikeGammaand2.Giventhissetup,therearesixdifferentvariablesweneedtoknowforeachtest,inordertounderstandtheresults: 1. query:Thequeryusedinthetest. 2. DBsize:TheondisksizeoftheTPC-Hdataset,inbinaryrepresentation. 3. numSeedIds:Thenumberofseedidentiersprocessedduringthequeryexecution. 4. numdb:NumberofDBinstancesforthequery. 5. gibbsIters:NumberofiterationsatGibbslooper. 6. eliteDBPercent:NumberofDBinstancesretainedaftertheendofeachGibbsiteration.ThedeletedDBinstancesareclonedfromtheretainedeliteinstances.Werunvedifferenttests.FirsttestisexecutedonMCDB-R,withandwithoutserialization.RestofthetestsareonMCDB-Rwithserialization.Secondandthirdtestsareusefulinunderstandingthescalabilityofthenewsystem.Testfourhelpsusunderstandifthetimetakenforsamplegenerationafterserializationcanbeimprovedby 69

PAGE 70

usemorethreads.Fifthtestisusefulinunderstandingtheperformanceofserializationonlongtaileddistributionswhencomparedtoshorttailedones.Hereisaconcisedescriptionoftheparametervaluesusedforeachtest: 1. Performancewithandwithoutserialization:WeusequeriesQ1andQ2.DBsizeis200MB.numSeedIdsforQ1is93,000,andforQ2is46,000.numdbis100.gibbsItersandeliteDBPercentare5and50%respectively. 2. ScalabilitywithincreasingDBsize:WeusetwodatasetswithDBsize200MBand20GB.ThevalueofnumSeedIdsforQ1is93,000,andforQ2itis46,000on200MBdataset.Thevalueis9,100,000forQ1,andis4,600,000forQ2on20GBdataset.RestoftheparametersaresameasinTest1. 3. Scalabilitywithincreasingnumdb:ThevariablegibbsItersisvariedfrom10to400.RestoftheparametersaresameasinTest1. 4. Parallelsamplegenerationafterserialization:AlltheparametersaresameasinTest1.Weincludeamulti-threadedimplementationofsamplegenerationatInstantiate(G).Thenumberofthreadsisvariedfrom1to4. 5. PerformancefordifferentVGfunctions:ThequeriesusedherearequeryBasicwithUniform,Normal,Gamma,and2VGfunctions.ThevalueofDBsizeis20MB,numSeedIdsis10,000,andnumdbis100.ValuesofgibbsItersandeliteDBPercentare10and50%respectively.Results:TheresultsforthesevetestsareshownfromTables 4-1 to 4-6 .Weraneachtestmultipletimesandlistedtheaveragevalues.Tables 4-1 and 4-2 showthedataforTest1.Theperformanceimprovement,whencomparingMCDB-RwithandwithoutserializationisgiveninTable 4-1 ,undercolumnImprovementFactor.Diskusage,andthenumberofsamplesgenerated/usedforeachexecutionaregiveninTable 4-2 .Thetotaldiskusagevaluesshownherearecomputedbykeepingtrackoftotalpagereadsandwritesduringtheexecutionofthequery.PeakdiskusageiscalculatedfromthedatainputintotheGibbsqueue.Wallclocktimes(inminutes)forTest2aregiveninTable 4-3 .InTable 4-4 ,wehavethewallclocktimesforTest3.Thetimesgivenareinseconds.ResultsforTest4aregiveninTable 4-5 .Again,thetimesaregiveninseconds.Finally,Table 4-6 givestheexecutiontimes(inminutes)forTest5. 70

PAGE 71

Table4-1. Wallclockexecutiontimes WithWithoutImprovementSerializationSerializationFactor Q14.2mins24mins5.7 Q22mins32mins16 Table4-2. DiskI/Oandsamplestatistics TotalDiskPeakDiskTotalSamplesAvg.Gen.Avg.UsedI/O(inGB)Usage(inGB)(inBillions)PerSeedIdPerSeedId Q1withSerialization910.121,300500Q1withoutSerialization7440.657,000500Q2withSerialization1010.0611,326510Q2withoutSerialization1374553.474,000510 TheparametervaluesusedfordifferentVGfunctionsaregiveninrstrow.Forexample,theNormalVGfunctionhasmeansetto1,andvariance2setto1.Discussion:Test1:Table 4-1 showsasignicantimprovementwhenserializationisadded.ForQ2,theperformanceimprovementismorepronouncedbecausethenormalVGfunctiongeneratedraresamplesmorefrequentlythantheBayesianVGfunction.InTable 4-2 wecanseetheevidenceforthis.Whenusingserialization,queryQ1generatesonefthofsamplescomparedtoqueryQ2withoutusingserialization.ForqueryQ2theratioofsamplesgeneratedis50.ExecutionofqueryQ2withoutserializationhadtoreruntheplanthreetimesforgeneratingextrasamples(notgiveninthetable).Thisissignicantbecauseeverynewrunresultsintwoextrapassesover Table4-3. Scalingwithdatabasesize Size200MB20GB Q15.5mins428mins Q22mins241mins 71

PAGE 72

Table4-4. ScalingwithDBinstances(timeinseconds) numdb10100200400 Q147252476958 Q240120206376 Table4-5. Parallelsamplegeneration Threads124 Q1252332373 Q2120150148 thesamples.Morererunsalsomeansmoreunprocessedsamples.Evenwhenonlyoneseedidentierrequiresextrasamplestheyaregeneratedforallotherseedidentiersaswell.Tests2and3:Table 4-3 showsthatthesystemisscalableandcanberunonlargedatabaseswithmillionsoftuples.Forbothqueriesafactorof100indatasetsizeseemtoresultinapproximatelyafactorof100intheexecutiontime.Theexecutiontimeseemstoscalelinearly.InTable 4-4 wecanseethatforbothqueriesQ1andQ2,theresponsetimeincreaseslinearlyaswevarythenumberofDBinstances.Test4:Theperformanceisworseforthemulti-threadedimplementation.AmajorreasonforthiscouldbeusingWELL512(WellEquidistributedLong-periodLinear,512bytes)[ 46 ]randomgenerator.WELL512isinitializedonceforeachthread,andthisinitializationisexpensiveasitbuffers512randombits.Thequeryisrunfor100DBinstances,soinmostcaseswegenerateonly200samplesatatime.Ifweuse4threads,eachthreadnowproduces50samplesandsomaynotuseasignicantportionofthose512bits.Usingadifferentrandomgeneratorcouldavoidthisproblem.Test5:TheexecutiontimeinTable 4-6 showsthat,asexpected,thequerywithUniformVGfunctionnishesfasterthanotherswithlongertails.ThequerywithGammaVGfunction,withlowshape Table4-6. PerformancefordifferentVGfunctions(timeinminutes) VGFunctionUniformNormalGamma2(a=0,b=0)(=1,2=1)(k=1,=6)(k=6) Time34404641 72

PAGE 73

parameterandalongtail,takesthemosttime.Fromthistest,wecanseethatMCDB-Rwithserializationworkswellforheavytaileddistributionsalso.Conclusion:Fouroftheveexperimentswerangaveexpectedresults.WecanconcludefromTests1,2,3,and5,thatserializingVGfunctioninMCDB-Risextremelyusefulinimprovingthequeryperformanceandscalabilityofthesystem.Moreworkneedstobedoneonmakingthesamplegeneration(afterde-serialization)parallelandfaster.Test4showedthatourcurrentmulti-threadapproachforthisoperationisnotefcient. 73

PAGE 74

CHAPTER5REPLACINGANEXTREMESAMPLE 5.1MotivationMCDB-RsystemusessamplingbasedMonteCarlosimulationstoperformriskanalysisonuncertaindata.Inthissystemthedatauncertaintyismodeledthroughparametricprobabilitydistributionfunctions.Thewholedatabaseisseenasahighdimensionaljointdistribution.Riskanalysisonthisdatacanbedonethroughobservingthelowprobabilityregionstheupperandlowertailsofthedistributionfunction.Gibbscloningmethodisusedtosamplefromthetailofthishighdimensionaldistribution.Gibbscloninghastwostepscloningandperturbation,asexplainedinSection 2.1 .PerturbationstepisachievedthroughGibbssampler,aMarkovChainMonteCarlomethod.Gibbssamplerprocesseshugenumberofsamplesduringasinglequeryexecution.Evenforamoderatesizeddatasetthenumberofsamplesgeneratedcanbeinbillions,asshowninsomeexperimentsfrompreviouschapter.Sinceweprocesssuchhugenumberofsamples,thechancesoftheoccurrenceofanextremesample(onewhichoccurswithverylowprobability)inthesamplestreamishigh.SuchasampleiseasilyacceptedattheGibbssamplerasagoodreplacementifitisinthecorrectsideofthetail.Beingonthesamesideofthetargettailthesamplewillpasstheconstrainteasily.Butreplacingthatsamplelater,duringthenextGibbssampleriteration,mightrequireanothersampleofnearlyequalvalue(andhenceverylowprobabilityofoccurrence).Forexample,iftheprobabilityofoccurrenceofasamplewearereplacingis10)]TJ /F8 7.97 Tf 6.58 0 Td[(6,thentherejectionalgorithm(Section 2.1.2 )willprobablyrejectamillionsamplesbeforendingafeasiblereplacementforit.Replacingonesuchextremesamplecouldexhaustthesamplearrayfromthecurrentqueryrun.Wemayevenrequirenumerousqueryreruns.Rememberthatduringthesererunssamplegenerationisdoneforrestofthetuplesaswell,notjustthetuplethatranoutofsamples.Therefore,replacementprocesswillmakethequeryresponsetimeextremelyhigh.Inthischapterwewilllook 74

PAGE 75

Table5-1. Poissonrandomrelation IdValueSamplestream a75,7,8,...b812,7,8,...c54,3,4,...d98,9,7,...e119,10,8,... Table5-2. Afteraphaseofrejectionsampler IdValueSamplestream a78,5,6,...b127,8,6,...c43,4,5,...d89,7,7,...e910,8,9,... ataneffectivesolutionforthisextremevaluereplacementprobleminordertolowerthatresponsetime.Example:ConsidertherandomrelationPoissoninTable 5-1 .IthasthestochasticattributeValueandaPoissonVGfunction(Section 2.3 )associatedwithit.Onlyonedatabaseinstanceislistedforthestochasticattributetosimplifytheexample.Letssaywewanttoperturbthisdatabaseinstanceusingthesamplestreamavailable.WeneedtoreplacetheValueattributeineachtuple.Theaggregatehereisasummationandlettheconstraintvaluebe23.Thatis,sum(Value)mustbegreaterthanorequalto23.TheinstanceoftherelationafterperturbationisshowninTable 5-2 .Thesum(Value)isstill23,andthetuplewithid'b'hasthesamplevalue12.Theacceptedsamplevaluesfortuplesfollowingtuple2arelowerthantheirexistingvaluesbecausethelowervaluesstillsatisfytheconstraintontheaggregate.Fortuple'c'thevalue5isreplacedby4sincethenewsum(Value)25isstilllargerthan23,theconstraint.Similarlyfortuple'd',9isreplacedby8,andfortuple'e'11isreplacedby9.Nowinthenextperturbationphase,therejectionalgorithmwillrequireavalueof12ormorefortuple'b'asanythinglowerwillmaketheaggregatesum(Value)golowerthan23.Findingsuchasampleisdifcultbecauseofitslowprobability.Ifweassumethatweneed30,000samplesto 75

PAGE 76

ndanothervaluegreaterthanorequalto12,theninthecurrentMCDB-Rweneedtogeneratesamenumberofsamplesforothertuplesaswell.Theperformancecanbeimproveddrasticallyifwecaneliminateproducingsuchunnecessarysamples.Fortunately,inquerieswithoutajoinonstochasticattributes,theserializedVGfunctionisaccessibleattherejectionalgorithm.Hence,processingverylargenumberofsamplesattherejectionalgorithmdoesnotrequireaqueryrerun,asitcangeneratesamplesontheywhenevernecessary.Butthisreplacementprocesscanbequiteexpensiveinotherqueriesasitmightrequirelargenumberofqueryreruns.Notethatpagesizeofthesystemwilllimitthemaximumnumberofsamplesinasingletuplebundleandinturninasinglequeryrun.Apagesizeof1MBcanatmosthold130,000samples.Tolookatamillionsamplesweneed8runsevenwithapagesizeof1MB.Itispossiblethatwemightencountermultiplesuchextremesamplesduringtheexecutionofasinglequery,eachrequiringtensofqueryreruns.Ifthenumberoftuplesinthestochasticrelationislargethenthereisagoodchanceofencounteringmultipleextremevaluesituationsinasingleperturbationstep.Withoutapropermethodtospeedupthisprocessthequerieswillbeveryexpensive. 5.2SolutionOutlineInthissectionwegivetheoutlineofanapproachtosolvethisextremevalueproblemefciently.Therststepistoeliminatethesamplegenerationforunusedtuplesineachqueryrerun.Sinceweneedextrasamplesonlyforasingleseedidentier(Section 2.4 ),unlessthereisaself-join,wecaneliminatetupleswithrestoftheseedidentiersfromtheInstantiate(Section 2.5 ).Forotherseedidentierswhicharenoteliminated,weneedtogenerateonlythosesamplesthatarecurrentlyinusebyaDBinstance.Forthegivenseedidentierwemightneedtoproducetensofmillionsofsamples.Sinceallthesamplescannottinasinglepageweneedtodeviseagetaround.OnepossibilityistoperformaSplitoperationassoonasthesamplesaregenerated.Thoughthiseliminatestheproblemofoverow,westillneedtocarrya 76

PAGE 77

verylongbit-string.Abit-stringfor10millionsamplesisjustabove1MB,butisstillmanageable.Eventhissolutionbecomesunmanageableifweneedlargernumberofsamples.AmorecleansolutionistobreakthelargetuplebundleintonumeroussmalleronesjustaftertheInstantiate.Oncethesesmallertuplesarepassedthroughthequeryplan,theycanbestitchedbackinsidetheTailsampler(Section 2.7 ).Wewillkeepsequenceidentierstoremembertheorderinwhichthesmallertuplesneedtobegroupedtogether.Thisstrategywillnotaffectthecorrectnessoftheresultexceptwhentheoperatorisanaggregateoraself-join.WecansafelyassumethattherearenoaggregatesbeforeTailsamplerasMCDB-Rdoesnotacceptsuchqueries.Incaseofaself-join,twotuplesfromthesameseedidentiershouldbejoinedonlyiftheirsequentialidentiersmatch.Thisconstraintensuresthatthejoindoesnotproducespurioustuples. 5.3FilteringFilteringoutthetuplesofunnecessaryseedidentiersisveryeffectiveinreducingtheresourcesrequiredforndinganextremesample.FilteringwillreducetheCPUtimesinceweavoidproducingsamplesforthelteredouttuples.Themovementoflargenumberoftuplesinthequeryplanwillalsoreducethememoryanddiskusage.ThisprocessisillustratedforasimplequeryplaninFigures 5-1 and 5-2 .Figure 5-1 showsthequeryplanforthenormalexecution.ThisqueryhasasingleVGfunctionserializedattherightleafnode,theMaterializedInstantiate.AfterInstantiate,thetuplesareSplitbeforebeingsenttothejoin.Figure 5-2 providesthenewqueryplanwithaselectionabovetheInstantiate.Thisselectionwillallowonlytuplesthathavetherequiredseedidentiertogotothenextoperator.ThenecessaryseedidentierisidentiedattheTailsamplerandthenissenttotheInstantiatefortheselectionprocess.NotethatthelteringcanonlybeappliedattheparticularInstantiatethatgeneratedtheseedidentiercorrespondingtotheextremesample.Ifmultiplestochasticattributes 77

PAGE 78

Figure5-1. Queryplan Figure5-2. Modiedplanwithltering existinthequeryplanwehavemultipleInstantiates,oneforeachstochasticattribute.AllotherInstantiatesexcepttheonethatproducedtheextremesamplecannotlteroutanyoftheirtuples.Fortunately,foranysuchstochasticattributesweonlyneedtheactivesamples;onlynumdbsamplesforeachtuple.MoreoveronlytuplesthatjoinwiththeseedidentieroftheextremesampleremainbythetimetuplesreachtheTailsampler.Thisisbecauseiftheotherstochasticattributesareinthesamerelationastheextremesample,thenthelteringontheseedidentieroftheextremesamplewillalso 78

PAGE 79

eliminatemanyotherseedidentiersfromtheotherstochasticattribute.Thesecondpossibilityisthattheotherstochasticattributeexistsinadifferentrelationthantheonewithextremesample.Insuchcasestherelationwithextremesample,andtheotherstochasticrelationarejoined.Sincetherelationwithextremesamplewillhaveveryfewtuplesduetoltering,thejoinresultitselfwillbesmalleraccordingly.Hence,existenceofotherstochasticattributeswillnotdegradetheperformancesignicantly.Self-joinsneedtobehandledseparately,asthelteringcannothappenuntilafterthejoininthequeryplan.Ifaqueryplanhasaself-joinonthestochasticattributethatcontainstheextremesample,wecannotlteroutthetuplesuntilaftertheself-join.Itcanalsohappenthatthetupleswhichhavetheextremesamplecanhavemultipleseedidentiersfromthesamerelationevenwithoutaself-join.Thiscanhappenduetoaself-joinonattributesotherthanthestochasticattribute.Iftheseseedidentiersbelongtothesamerelationastheonethatgeneratedtheextremesample,thenweneedtofollowthesameprocesslteringisdelayeduntilaftertheself-join. 5.4FragmentingaLargeTupleBundle Figure5-3. Splittingalargetuplebundle SinceaVGFunctioncangeneratemultipletuplesaspartofthesamesamplewemaintainaVGidentiervg-idtokeeptrackofthatorder.WeneedathirdidentiertodifferentiatebetweenmultipletupleswiththesameseediftheyoccurbeforeInstantiate. 79

PAGE 80

ThisscenariocanhappenwhenthereisajoinbeforetheInstantiate.Wecallthisidentieratupleidentiert-id.Figure 5-3 showsanexampleforfragmentingaverylargetuplebundleintosmallerpieces.Inthisexamplewekeepthenumberofsamplesinthesmallertuplebundlesat10,000.Insteadofasingletuplewith10millionsamples,wewillhave100,000tupleswith10,000sampleseach.Thesequenceidentierseq-idisfrom1to100,000.Thet-idissameforallthesmallertuplesasweonlyhaveoneoriginaltuplebeforeInstantiate.Thevg-idissameheresinceasingleinputtupletoInstantiateproducesasingleoutputtuple.Evenamillionsamplesmightnotbeenoughforsomeextremevalues.Therefore,werepeattheprocesswithtentimesmoresamplesthanthepreviousrun.Ifasatisfactorysampleisstillnotfound,wewillincreasethesamplearraysizeagainbytentimes.Werepeatthisuntilwendthatreplacement. 5.5ImplementationDetails Figure5-4. QueryplanwithSamplerouter:normalmode Incorporatingfragmentationintheexistingqueryplanisnontrivial.Afteranextremesampleisfound,anychangesdonetothequeryplanshouldbereset.Filteringisfairlyeasytoaddtotheexistingqueryplansinceitonlyrequiresaselectoperatorontheseedidentiers.TheselectionpredicateissentbytheTailsamplertotheInstantiate. 80

PAGE 81

Figure5-5. QueryplanwithSamplerouter:activemode Whentheextremesampleisfoundafterexecutingthenewqueryplan,theInstantiateisagaininformedofthendingandtheselectionpredicateisdropped.Unlikeltering,fragmentationrequiressomemajorchangesinthequeryplan.Inourimplementation,wecreateanewowforthetuplesintheTailsampler.InsteadofsendingthetuplestoGibbsqueue,wewillpassthemtoanotherdatastructure,calledaSamplerouter.Thisdatastructurewillsorttheincomingtuplesaccordingtotheseq-id'sandpassestheGibbssampleronlythetupleswithsameseq-id.Ifnecessary,Gibbssamplerwillrequestthenextsetoftupleswithnextseq-id.InnormalexecutionmodetheSamplerouterjustactsasarelayanddoesnothing.SeeFigure 5-4 foranexamplequeryplan,withaSamplerouter.Thecompiler,whenitrstseesastochasticjoininthequeryplan,willmodifytheplantoincludetheSamplerouter.Thisadditionisdoneevenbeforeanyexecutionisstarted.Ifanextremesampledoesnotoccur,thentheSamplerouterwillalwaysbeinthenormalexecutionmode.TheTailsamplerwillnotifytheSamplerouterwhenevertheexecutionmodeneedsachange.TheexamplequeryplaninactivemodeisshowninFigure 5-5 81

PAGE 82

Figure5-6. ATSSeedandassignedsamplevalues Figure5-7. TSSeedandassignedsamplesafteraGibbssampleiteration 5.5.1TSSeedandSampleGenerationAftertheextremesampleisfound,thequeryisrestartedinthenormalmode.AlliterationnumbersforseedidentiersarestoredinTSSeedattheendofeachrun.Duringthenewqueryrun,nummcsamplesaregeneratedatInstantiate.ThestartingiterationnumberforthissamplegenerationisthehighestiterationnumberassignedtoaDBinstance.Thatstartingiterationnumberwillbethelowiter.numberinnextqueryrun.Anyunassignediterationnumbersbetweenthelowiter.numberandmax.usediter.arenotused.Thisgenerallydoesnotcreateaproblembecausetheirnumberisusuallysmallcomparedtonummc.Unfortunately,justafteranextremesampleisfound,thisassumptiondoesnotholdtrue.Moreoftenthannot,nummcismuchsmallerthanthedifferencebetweenlowiter.numberandmax.usediter.. 82

PAGE 83

Figures 5-6 and 5-7 showtheTSSeed(Section 2.6.1 )objectofaseedidentierwithanextremesample.ThevalueatDBinstance2inFigure 5-6 istheextremesample,anditsreplacementisfoundonlyafterprocessingmorethan23,000samples.Inthisexample,nummcfortherunafterndtheextremesamplewillbe64.lowiter.numberis18,max.usediter.is23755,andthedifferenceis23737.Onesolutiontothisproblemistoassignnummcavalueatleastdoubleofthedifference.Thissolutionwillonlyworkuntilthedifferencebetweenlowiter.numberandmax.usediter.islargerthanwhatasinglepagecanhold.Abettersolutionistostoretheassignedsamplesonapersistentdatastructure,andgenerateonlysamplesaftermax.usediter..Wecanrestoretheassignedsamplesforthatseedidentierfromthedatastructure.ThisapproachmightgetexpensiveiftherearelargenumberofextremesamplesinasingleGibbssamplerloop.Then,thedatastructureneedstostorethesamplesofallseedidentierswithaextremesample.AsimplerapproachistochangehowthesamplesaregeneratedatInstantiateusingTSSeed.TheTSSeedcreatesthePRNGseedarrayusedinInstantiatefromiterationnumbersandSeedId.Therangeofiterationnumbersusedisfromlowiter.numbertonummc+lowiter.number.ThisPRNGseedarraydetermineswhatsamplevaluescomeoutoftheVGfunction.Wechangetheiterationnumbersused.Firstnumdbiterationnumbersaretakenasisfromtheassignediterationnumbers.Restofthenummc)]TJ /F10 11.955 Tf 12.87 0 Td[(numdbiterationnumbersstartfrommax.usediter..ImplementationofthismethodonlyrequireschangestoTSSeedobject,GetSeedsfunction.Nootherchangesarerequiredinthecode. 5.5.2HowtoIdentifyanExtremeSample?Duringexecution,weneedtoidentifyasampleasextremeinordertoinvokeourspecicmethod.Onewaytoidentifyanextremesampleistodosomestatisticalanalysis,andndouttheprobabilityofoccurrenceofthereplacementsample.Forthatwersthavetondtheminimum(ormaximum)samplevalueweneedforareplacement.Oncewehavethat,usingtheparametersoftheVGfunctionwecan,in 83

PAGE 84

theory,calculatetheprobabilityofndingasamplegreater(orsmaller)thatthanvalue.Foragenericaggregatefunction,ndingthatminimumrequiredvalueitselfisnontrivial.Instead,wecansaythatweneedasamplewithsamevalueastheonewearereplacing.Thismethodofidenticationmightresultinmanyfalsepositivessincetheactualvaluerequiredmightbeeasiertond.Thespecialexecutionweusetondanextremesampleisveryefcientcomparedtorerunningthewholequery.But,toomanyfalsepositivesinthismethodwilleliminatethatefciencyadvantage.Instead,wecanuseamethodbasedonruntimestatisticstoidentifyanextremesample.Asamplecanbedesignatedasextremeifwecannotndareplacementbytheendofthesamplearray.ButthisagainwillcreatelargenumberoffalsepositivessinceitispossibletoreachtheendofsamplearrayevenfornormalsamplesaftermultipleGibbsiterations.Toavoidthatdrawbackwecansayasampleisextremewhen,evenafterasecondrerunwedonotndthereplacement.Wespentoneentirererunjusttondthatreplacement.Theothermethodistousethenumberofsamplesconsumedtondthereplacementwhenwereachtheendofthesamplearray.Ifatleast10%ofthesamplesinthearrayareconsumedwithoutndingareplacementforasample,thenwecandesignateitasanextremesample.Ingeneral,adecentlylargenumberlike10,000mightalsosufce.Butthiscancreateproblemsduringtherstfewrerunswhenthesamplearraysizeisstilllessthan10,000.Afewrerunswillbewasteduntilthenumberofsamplesseentondthereplacementexceeds10,000.Wethinkusingafactorofsamplearraysizelike10%isbetterthanaxednumberlike10,000,sincethexedvaluemightneedtoberecalculatedforadifferentpagesize.Weexperimentallyevaluatetwodifferentwaystoidentifyanextremesample.(1)Asampleiscalledextremeifareplacementisnotfoundattheendoftwoqueryruns.and(2)ifasampleconsumedmoresamplesthan10%ofthecurrentsamplearray. 84

PAGE 85

5.6ExperimentsInthissectionweevaluateoursolution.OuraimistocomparetheperformanceofoursolutionagainstthecurrentapproachasimplementedintheMCDB-Rprototype.Wealsowanttoshowthatoursolutionworkswellonlargedatasizes,andfastgrowingaggregatefunctions.Differentaggregatefunctionswillresultindifferentnumberofextremesamples,thereforeitisinterestingtoobservetheperformancebyvaryingaggregatefunction.Findinganoptimalpagesizeherewilldirectlyaffectthequeryresponsetime:Averysmallpagesizemightresultintoomanyrerunsduetosmallsamplearraysize,whereasaverybigpagesizemightresultinthrowingawaytoomanysamplesfromthearray(wheneveranextremesampleisencounteredandthequeryisrerun).Thenumberofextremesamplesareexpectedtoincreaseaswemovefurtherintothetail.Thiswillshowthatoursolutionisextremelyimportant.Thespecicquestionswewanttoanswerhereare: 1. Howmuchistheperformanceimprovementduetooursolution? 2. Howscalableisthenewsystem? 3. Howdoesthenewsystemperformondifferentaggregatefunctions(linearvsquadraticvscubic)? 4. Doesvaryingthesystempagesizeaffecttheperformance? 5. Whatistherateofincreaseinthenumberofextremesamplesaswemovefurtherintothetail?ExperimentalSetup:WeusetheTPC-Hbenchmarkdatagenerator[ 17 ].Theworkstationusedfortheexperimentshas32GBofRAM,4720GBDellharddisksand8processingcoresof3.2GHzeachdistributedover2chips.Thesystemrunsa64-bitUbuntuoperatingsystem,version11.04.WeimplementedoursolutionintheprototypeMCDB-Rsystem.Weusetwoqueriesforbenchmarking.ThequeryQ3belowisbasedonqueryQ10from[ 38 ].SeeAppendixforcompleteSQLstatementsforboththequeriesused.AllqueriesarebasedonmodiedTPC-Hschema. 85

PAGE 86

QueryQ3.Inthisquery,totalrevenuegeneratedbyallcustomersinJapaniscalculated.Theschemaismodiedsuchthatthecustomerinformationintheordersrelationisnotfullycorrectduetoerrorsinthedataintegrationprocess.Anewrelationiscreatedtoincorporatethiserrorinthedatabase.Thisrelationhastheactualcustomerkeyandalistofpossiblecustomerkeysasastochasticattribute.Eachofthepossiblecustomerkeyshasaprobabilityvalue.Thenumberofpossiblecustomerkeysislimitedto10,oneofthembeingthecorrectkey.Theprobabilityvaluesarecalculatedaccordingtothefollowingformula1 2i,whereivariesfrom1to10.Theresultingdistributionherehasnitedomain.AdiscretechoiceVGfunctionisusedtogeneratesamplesfromthepossiblekeyanditsprobabilityfromthisnewparameterrelation.In[ 38 ]thehighestprobabilityof0.5isalwaysgiventothecorrectkey.Whenrunwiththesevalueswewillneverencounterasituationwhereasamplereplacementrequireshugenumberofsamples.Herewemodiedthetablesuchthatin50%ofthetuplesthecorrectkeyisgiventhelowestprobabilityof1 210.Thismodicationisdoneinordertocreateasituationwheretwosampleswiththecorrectcustomerkeyvaluecanoccurveryfarawayfromeachotherinthesamplestream.Suchasituationcanresultinasamplewhichisdifculttoreplaceinotherwords,anextremesample.QueryQ4.Herewecalculatethepenaltyduetoshippingdelaystothecustomers.Wehaveashiptimerelationwhichhasforeachcustomerthenumberofdayswithinwhichtheitemshouldbedelivered.Therelationavg shiptimehastheaverageshippingtimeforagivenitem,andthisistheparameterforthePoisson()VGfunction.Therelationpenaltyhasthepenaltyaddedaccordingtothenumberofdaysofdelayinshipping.Fortheseexperiments,thepenaltyrateweuseforthedelayincreasesasthenumberofdaysofdelay:penalty(i)=ijwhereiisthenumberofdaysofdelay,andjisthepenaltyrateperdaywithj=bi=5c.Inthisfunctiontherateofpenaltyincreasesat5dayintervals. 86

PAGE 87

Giventhissetup,therearesixdifferentvariablesweneedtoknowforeachtesttounderstandtheresults: 1. query:Thequeryusedinthetest. 2. numSeedIds:Thenumberofseedidentiersprocessedduringthequeryexecution. 3. PageSize:Sizeofthesystempage.Limitsthesizeofthesamplearray.Thisvalueissetto128KBforourexperiments,unlessspeciedotherwise. 4. numdb:NumberofDBinstancesforthequery.Thisissetto100foralltestsinthissection. 5. gibbsIters:NumberofiterationsatGibbslooper.Thisvalueis5,unlessspeciedotherwise. 6. eliteDBPercent:NumberofDBinstancesretainedattheendofeachGibbsiteration.ThedeletedDBinstancesareclonedfromtheretainedeliteinstances.Wehavethisvalueat50%forallthetests.Werunthefollowingvetests.InrsttestwecomparetheoriginalMCDB-R(withserialization)withMCDB-Rwithoursolution.Wetestthenewsystemona20GBdatasetinTest2,anddifferentaggregatefunctionsinTest3.InTest4,wevarythesystempagesize.WecalculatethenumberofextremesampleswithchangingGibbsiterationsinTest5.Welisthowthevariablesaresetforeachofthetests: 1. Performanceimprovementduetooursolution:BothqueriesQ3andQ4areusedhere.ThevaluesofnumSeedIdsis1000,PageSizeis128KB,numdbis100,gibbsItersis10,andeliteDBPercentis50%. 2. ScalabilitywithincreasingnumSeedIds:WeusedqueryQ4.ThevalueofnumSeedIdsisvariedfrom8000to40,000,andgibbsItersissetto5.RestofthevariablesaresameasinTest1. 3. Performancewithdifferentaggregatefunctions:WeusedqueryQ4.ThevalueofnumSeedIdsis20,000,andgibbsItersissetto5.RestofthevariablesaresameasinTest1. 4. EffectofPageSizeonperformance:WeusedqueryQ4.ThevalueofnumSeedIdsis10,000,andgibbsItersissetto5.ThevalueofPageSizeisvariedfrom32KBto512KB.RestofthevariablesaresameasinTest1. 87

PAGE 88

5. NumberofextremesampleswithincreasinggibbsIters:WeusedqueryQ4.ThevalueofnumSeedIdsis8000,andgibbsItersisvariedfrom2to10.RestofthevariablesaresameasinTest1.Results:TheresultsforthesevetestsareshowninTables 5-3 to 5-11 .Weraneachtestmultipletimesandlistedtheaveragevalues.ResultsofTest1aregiveninTables 5-3 ,and 5-4 .Table 5-3 showstheperformanceoforiginalMCDB-Randoursolutionwithtwodenitions(Section 5.5.2 )ofextremesamples.Type-1iswhenareplacementisnotfoundattheendoftwoqueryruns.Type-2iswhenasampleconsumedmoresamplesthan10%ofthecurrentsamplearray.Forthesametest,thenumberofsamplesidentiedasextremeareshownintable 5-4 .Test2resultsaregiveninTables 5-5 5-6 and 5-7 .Table 5-5 comparestheperformanceoforiginalMCDB-RwithType-2variantofoursolution.WedonotuseType-1inthisandlatertestssincewefoundthatType-2hasabetterperformance.ThenumberoftotalsamplesgeneratedinthistestaregiveninTable 5-6 .Table 5-7 givesthenumberofqueryrerunsrequiredforeachofthequeryexecutionslistedinTable 5-5 .Thevaluesgiveboththenumberoffullqueryreruns,andthenumberofrerunswithltering(i.e.thenumberofextremesamples).Ineachcell,thenumberofthererunsusedforreplacinganextremesampleisgiveninbrackets().Forexampleinputsizewith8000randomtupleswehave12Type-2extremesamples,and10queryreruns.Table 5-8 showsresultsforTest3.Therunningtimesforthequeryexecutedwithdifferentaggregatefunctions,andthenumberofextremesamplesidentiedforeachexecutionarelisted.ResultsofTest4areshowinTables 5-9 and 5-10 .Table 5-9 showsthequeryresponsetimewithchangingPageSize(andinturnmaximumsamplearraysize)inthesystem.Table 5-10 liststhenumberoftotalrunsrequiredtonishthequeryandthenumberofextremesamplesineachrun.Table 5-11 showstheresultsfromTest5.Discussion:Test1:Asexplainedabove,thequeryQ3hasadistributionwithnitedomainonitsstochasticattribute.Thislimitsthechancesofoccurrenceofan 88

PAGE 89

Table5-3. Wallclockrunningtimes(inseconds) QueryOriginalType-1Type-2 Q3434341Q421913575 Table5-4. Numberofextremesamplesidentied QueryType-1Type-2 Q301Q4713 Table5-5. Performancewithincreasinginputsize(inminutes) Size800020,00040,000 Original15102246Type-27.328110 Table5-6. Samplesgenerated(inBillions) Size800020,00040,000 Original1.514.921.7Type-20.633.210.7 Table5-7. Queryrerunsandextremesamples Size800020,00040,000 Original25(0)94(0)69(0)Type-210(12)21(24)34(47) Table5-8. Performancewithchangeinaggregateattribute Aggregateattributepenaltypenalty*5penalty2penalty3 Time(minutes)20.326.132.889.1 Num.ExtremeSamples15202464 Table5-9. Performancewithchangingmax.samples(inminutes) MaxSamples20004000800016,00032,000 MinPageSize32KB64KB128KB256KB512KB Time32.226.323.131.427.6 Table5-10. Executionparameterswithchangingmax.samples MaxSamples20004000800016,00032,000 Num.Runs3025313232 ExtremeSamples2420162220 89

PAGE 90

Table5-11. NumberofextremesampleswithincreasingGibbsiterations Iterations246810 Type-2314244572 extremevalue.Table 5-4 showsthatforqueryQ3,weonlyhave1extremesampleinType-2,andnoneinType-1.Wecanclearlyseethatthereisalmostnoimprovementonitsexecutiontimeinnewsystemovertheoldsystem.UnlikequeryQ3,queryQ4processessamplesfromaPoissondistribution,whichhasasignicanttail.ThenumberofextremesamplesforQ4are7and13forType-1andType-2,respectively.Wecanseetheexecutiontimeimprovementsforthisqueryonbothtypes.Type-2identicationclearlyoutperformsType-1sincewedonotwasteanextraqueryruntoidentifyanextremesample.InType-2,wemayidentifysomesamplesasextremeeveniftheyarenotthatextreme.Thisdoesnotaffectperformancemuchsinceourspecialexecutionmoduletoreplaceextremesamplesisquitefast.ForqueryQ4theexecutiontimedifferencebetweenourspecialmoduleandafullqueryrunisapproximatelytwoordersofmagnitude(1:100ratio).WecanseethattheexecutiontimeimprovementsfromTable 5-3 areproportionaltothenumberofextremesamplesidentied,asgiveninTable 5-4 .Therefore,wecaninferthatType-2performsbetterthanType-1becauseoffasteridenticationofextremesamples,andavoidingtheextraqueryrunittakesType-1toidentifyasampleasextreme.Fromnowon,alltheexperimentsuseonlyType-2identicationofextremesamples.Test2:TheexecutiontimeinTable 5-5 showsthattheperformanceofthenewsystemisbetterthantheoldoneconsistentlyovervaryinginputsizes.Asexpected,thenumbersshown(ofsamplesgenerated)inTable 5-5 areproportionaltothecorrespondingexecutiontimesshowninTable 5-6 .AstrongercorrelationcanbeseenbetweenthenumberofqueryrunsinTable 5-7 andtheexecutiontimeinTable 5-5 .Thisisnotsurprisingsinceourmoduletoprocessextremesamplestakesverylesstimewhencomparedtoatotalqueryrun. 90

PAGE 91

Test3:FromTable 5-8 weseethatasexpectedthenumberofextremesamplesishighforfastgrowingfunctions.Theperformanceofthesystemisdecentevenforcubicfunction.Test4:Theexecutiontimesshowthattheperformanceisbestwiththemaximumsamplearraysizeat8000.Test5:ForrsttwoGibbsiterations,thenumberofextremesamplesis3,andforrstfouriterationsitis14.Thisvalue14isinclusiveofrsttwoiterationsaswell,i.e.thenumberofextremesamplesinthirdandfourthiterationsis14)]TJ /F6 11.955 Tf 12.11 0 Td[(3=11.Asexpected,thenumberofextremesamplesincreasesaswemovetowardsthetail.Conclusion:WeconcludethatType-2variantofoursolutionisthebest.TheperformanceimprovementissignicantinquerieswithlongtailedVGfunctions,andthesolutionworkswellonlargedatasets,andfastgrowingaggregatefunctions.AswecanseefromTest5,thenumberofextremesamplesincreasesaswemovefurthertowardsthetail.Therefore,addingourspecialmoduletoexistingsystemisveryessential. 91

PAGE 92

CHAPTER6ANTIJOIN 6.1MotivationAnti-JoinoperatorispartofthestandardSQL.ItisrequiredtoexecuteSQLclausesNOTINandNOTEXISTSinsub-queries.Ananti-joinofaleftrelationLandrightrelationR(L.R)returnstuplesinLwhichdonotmatchanytuplesinRonthepredicate.ConsiderthetablesOrders(ORDID,CUSTID,AMOUNT)andCustomers(CUSTID,CUSTNAME)inFigure 6-1 .Supposeweexecutetheanti-joinquery,Orders.Customers,withthepredicateonOrders.CUSTIDandCustomers.CUSTID.ThenaloutputoftheOrders.CustomersisgiveninthetableOutput.ThersttupleinOrderswithCUSTIDvalue21appearsinOutputsincethereisnotupleinCustomerswithCUSTIDas21.SecondtupleinOrderswithCUSTIDvalue23doesnotappearinOutputsinceitmatcheswiththesecondtupleinCustomers.ThisprocessofndingamatchisrepeatedforeachtupleinOrders. Figure6-1. Example:Anti-Joininarelationaldatabase Inarelationaldatabaseananti-joinoperationisstraightforwardtoimplement.Anti-joinL.Rcanbeexecutedsimilartoasort-mergejoinwithsomechangesinthemergestep.Sortbothrelationsontheattributesinthepredicate.Duringthemerge,ifthereisnomatchforatupleinLwithanytupleinR,thenoutputthetuplefromL.InMCDB-R,executingananti-joinonstochasticattributesisnotassimple.Anexecutionsimilartoanormaljoin,asexplainedinSection 2.6.2 ,isnotenough.Inthejoin,aftera 92

PAGE 93

Splitoperationonthestochasticattributesandasortoperation,amergeisperformedonthesortedleftandrightrelations.ThejoinpredicateisevaluatedduringtheGibbssamplingatTailsampler(Section 2.7 )operator.EvaluatingajoinpredicateatTailsamplerrequiresonlyonepairofmatchingtuples,oneeachfromrightandleftrelations.Ananti-joinpredicateontheotherhandrequiresatuplefromleftrelationandallthematchesfromrightrelation.Astochasticattributeinleftrelationofanti-joindoesnotcauseanydifcultiesbecauseevenwhenthesampleischanged,theinformationinthetuplewillbeenoughtoexecutetheanti-joinpredicate.Thisissoonlyiftherightrelationdoesnothaveastochasticattributeappearinginthepredicate.Ifastochasticattributeappearsintherightrelationtheexecutionofanti-joinisnotassimple.Whenthecurrentsamplevisreplacedinarighttupletheanti-joinpredicateneedstoknowifthenewsampleistherstofitskindorarethereanyexistingtupleswiththesamevalue.Ifthisisthersttimethatvalueoccurred,thenthetuplesfromleftrelationwiththatvaluewillberemovedfromtheoutput.Sincetheoldvaluenolongerexists,ifnoothertupleswiththatvaluealsoexistsintherightrelationthenthetuplesfromtheleftrelationcontainingthatvalueneedtoappearintheoutput(because,nowtherearenomatchingtuplesintheright).Boththeabovestepsrequireinformationnotwithinthecurrenttuple,andobtainingsuchinformationisnotefcientinthecurrentframework.Bringingtogetherallmatchingtuplesfromrightrelationforasinglelefttuplewillrequireasortorahash.Bothareextremelyexpensive,andperformingthemonceperlefttupleisimpossible.RememberthatallthepredicatesarecalculatedaftertheGibbssampler,andtheGibbsqueue(Section 2.7.1 ),thatprovidestuplestoGibbssampler,doessoonlyononeseedidentier(Section 2.4 )atatime.Therefore,weneedachangeinthecurrentframeworkfortheexecutionofananti-joinpredicate.SortingalltuplesinGibbsqueueisanoption.Withasortonthevaluewecangrouptherequiredtuples(withdifferentseedidentiers)together.Thoughsortingsolvestheproblem,itisnotefcientsinceit 93

PAGE 94

needstobedoneonceforeachtuple.Asinglesortitselfisveryexpensive,thereforeperformingasortforeachtuplewillbeextremelytimeconsuming. Figure6-2. Example:Stochastictable ConsiderthetableCustomersinFigure 6-1 .AssumethattheCUSTIDattributeisstochastic.ThenewtablewiththestochasticarrayattributeisshowninFigure 6-2 .Figure 6-3 showsthestochastictableaftertheSplitoperation,alongwiththelefttableintheanti-join.ThetuplewithCUSTID21inOrdersmatcheswithalltwotuplesinStoc.Customers,tuples1and4withseedsS1andS2respectively.Theactualanti-joinpredicate(likeallotherpredicates)iscomputedaftertheGibbssamplerstep.SoinordertoknowifthetuplewithCUSTID21inOrdersappearsintheoutput,weneedboththetuplesfromStoc.CustomersandwhichofthemareactiveforagivenDBinstance.ButtheGibbsqueueonlyprovidesgroupsoftupleswithsameseedtogether.Sointhecurrentsystemitisimpossibletoexecuteananti-joinpredicate.Weneedsomenontrivialchangesinthesystemforthistowork.Inthenextsectionwediscusstwonewattributeformatsandhowtheywillfacilitatetheanti-joininMCDB-R.Afterthatweexplainthemodicationsrequiredinthesystemsothattheanti-joinwillworkincaseofextremesampleproblem.InSection 6.4 wediscussbrieyhowtoextendbothformatsifmorethanoneanti-joinispresentinaquery.FinallyinSection 6.5 weevaluatebothformatsdescribedbelowforananti-joinexecutioninvarioustypesofdatasets. 94

PAGE 95

Figure6-3. Example:AfterSplit 6.2CompositeAttributeFormatsAsdiscussedpreviously,toexecuteananti-joinpredicateweneedallmatchingtuplesfromtherightrelation.InMCDB-RweapplythepredicateonlyaftertheGibbssamplingstep,henceallthematchingtuplesfromrightshouldbecollectedtoruntheanti-join.ThisiscomplicatedinthecurrentframeworkbecauseattheGibbssamplerweonlyhavetuplescontainingaparticularseedidentier.ThecurrentinterfaceofGibbsqueuedoesnotallowretrievingtuplesnotrelatedthroughaseedidentier.AddingamethodtoperformsuchoperationinGibbsqueuewillbeveryexpensive.Suchanoperationwillrequireasortorhasheverytimewewanttoapplyananti-joinpredicate.Inthissectionwediscusstwonewattributeformatsthatcangetaroundthisproblem.Wealsoexplainhowthesenewattributetypescanfacilitatetheanti-joinoperation. 6.2.1LongFormatInthisformatwestoreallthenecessaryinformationfromtherightrelationineachtuplefromtheleftrelation.DuringtheAnti-joinoperationweaddtheseedandbitstring 95

PAGE 96

attributesfromallmatchingtuplesinRtothetupleinL.TheseedisusedbyGibbsqueuetopullitupwhenthecorrespondingseedidentierisbeingprocessed.Figure 6-4 showsthestructureofthiscompositeattribute.ThenewattributetypeiscalledAJOBJECT(AJforAnti-Join).Itrstkeepsthenumberofseedandbitstringattributepairsitcontains(sameasthenumberofmatchesinR),andthenthelistofseedsfollowedbythelistofbitstrings. Figure6-4. CompositeAttribute:Longformat TheBuildSeeds()method,whichconstructsthesortedarrayofseedidentiersforeachtupleneedstobemodiedtoincorporatethenewattribute.ThissortedarrayislaterusedbyGibbsqueuetoretrievethetuplesintheorderoftheirseedidentiers.Figure 6-5 showsthenewGibbstuplesfortherelationsfromFigure 6-2 aftertheanti-joinoperation.Thenewcompositeattributeislocatedattheendoftheoutputtuples.ThetuplewithCUSTID21fromOrdersmatcheswithtwotuples(rstandfourth)inthetableStoc.Customers.Therefore,theAJOBJECTinthersttupleofoutputtableOrders AJinFigure 6-5 ,storestwoseedattributesandtwobitstrings.Similarly,fthtupleinOrdersdoesnotmatchwithanytupleinStoc.Customers,thereforeitsAJOBJECTdoesnothaveanyseedorbitstringattributes.ProcessingofthesetuplesattheTailsamplerisexplainedbelow.ForsimplicityweareassumingthattherearenooperatorsinbetweenAnti-joinandTailsampler.AlsoassumethatwehaveonlyoneDBinstance,andinitiallyalltheDBinstancesarepointingtotherstsampleinthestream.Atrst,theGibbsqueuereturnsthefthtupleandsinceitdoesnothaveanyseedattributesitissenttotheaggregatewithoutanyextraprocessing.Inthenextstep,tupleswithseedS1arereturned,i.e.rsttwotuples.TailsamplerwilltrytoreplacethecurrentsampleforDBinstance1(DB1).LetssaythattheAdvance()willmovetheDB1samplefromiterationnumber1toiterationnumber2. 96

PAGE 97

Figure6-5. LongFormat:TuplesafterAnti-join Thebitchangedfrom1to0,thereforeDB1ofS1nowhasavalue23insteadofa21.Nowthevalidityoftuple1ischeckedthroughtheanti-joinpredicate.Sincethereisstillavalid21fromthecompositeattributeduetoDB1ofS2,thestatusofanti-joinpredicateonthistupledoesnotchange.Sorsttupledoesnotcontributetotheaggregate.Forsecondtuplebothbitstringshavea0atrstiteration.Theanti-joinpredicatewillreturnatruesincetherearenomatches.Hence,initiallythesecondtuplewillappearintheoutputandcontributetotheaggregate.AftertheAdvance(),theDB1ofS1isassigned23.ThebitstringofS1is1atseconditeration.Therefore,theanti-joinpredicatewillreturnafalse,duetotheexistenceofamatchinrightrelation.So,afteranAdvance()onS1,theaggregatevaluechanges.Moresamplesareprocessedifthenewaggregatedoesnotsatisfytheconstraintonitsvalue,andtherejectionalgorithmproceedsfurther.Possibilitiesofoverow:Asseenintheaboveexample,usingtheformatisfairlysample.WeneedtocreateanewdatatypeandmakeminormodicationstoBuildSeeds()methodandTailsampleroperator.Thoughthismethodworkswellintheaboveexample,itmakesaninherentassumptionthatthenewtuple,storingnumerousseedandbitstringattributes,willtinasinglepage.Theassumptionholds,if,foralefttuplethematchingtuplesfromrightrelationareless.Thetuplecaneasilyexceedthepagesizeiftherearetoomanymatchingtuplesinrightrelation.Boththeseedand 97

PAGE 98

bitstringattributesarebig.Bitstringsgrowarbitrarilylargebasedonthesizeofsamplearray.Asamplearrayofsize10,000willresultinabitstringofsize1250bytes.Withapagesizeof1MB,wecanonlyholdcloseto800ofsuchbitstrings.ThesizeofseedattributedependsonthenumberofDBinstances,numdb.Ifwehave100DBinstances,theseedisapproximately800bytes.Fora1MBpagesize,wecanholdcloseto500oftheseseedandbitstringattributepairs.Theattributesonwhichanti-joinsareperformedarecategorical.Ingeneralthedomainsizeofcategoricalattributesissmall.ExamplesofsuchattributesareLocation(Country,Zipcodeetc),Department,Componentetc,.Smalldomainsizesandlargerelationsizesgenerallyresultsinlargenumberofmatches.Sothelongformatisnotalwaysappropriate. 6.2.2ShortFormat Figure6-6. CompositeAttribute:Shortformat Asexplainedabove,creatinghugetupleswhicharebiggerthanthesystempagesizeshouldbeavoidedwheneverpossible.HerewediscusstheshortformatforAJOBJECTwhichreducestheattributesize.Asexplainedbefore,thecontributingfactorsforlargetuplesizesarebothseedandbitstringattributes.Sothenewformataimstoavoidstoringmultipleseedandbitstringattributesinasingletuple.Instead,itstoresonlytheseedidentiersofthematchingtuplesfromrightrelation.WealsostorethepartsofthebitstringswhicharecurrentlyassignedtotheDBinstances.ThesizeofthesebitstringpartsisthenumberofDBinstancesmultipliedbythenumberofseedidentiersstoresintheAJOBJECT,andisverysmall.TherighttuplesthemselvesareputintheGibbsqueue,sothattheirseedandbitstringattributesareaccessedwhenevernecessary.ThemaximumnumberoftuplesinsertedintheGibbsqueuewiththissolutionwillbethesumofthesizesofboththerelationsintheanti-join. 98

PAGE 99

Figure6-7. ShortFormat:TuplesafterAnti-join TheresultingnewAJOBJECTcanbeseeninFigure 6-6 .Thesizeofthisattributeis33bytes.IncomparisonthesizeoflongformatshowninFigure 6-5 forthesampleattributeiscloseto6,000bytes(assumingthenumberofDBinstancesnumdbis100andsizeofsamplearraynummcis1000).Forthesmallformatinthisexample,wealsoinsertthreeadditionaltuplesintheGibbsqueue.Thesethreetupleshavetheseedsandbitstrings.InFigure 6-7 ,therstvetuplesarefromtheleftrelation.ThesevetupleshavetheirAJOBJECTintheshortformat.Intuple1,theAJOBJECThas2seedidentiers1and2.Next8tuplesarefromrightrelation,andtheyareinsertedintoGibbsqueuedirectly(moreonthislater).ThedifcultlywithshortformatwhencomparedtothelongformatisthatitresultsinmorechangesintheMCDB-Rsystem.AsexplainedinChapter 2 ,theexecutionowassumesalltupleshavethesameschema.Sosendingthetuplesfromrightrelation 99

PAGE 100

additionallyintotheoutputwillresultintupleswithdifferentschemassenttonextoperator.Togetaroundthisissuewecouldcreateasuperrelationwithattributesfromboth.ThissolutionstillrequireseachoperatorinMCDB-Rtodifferentiatebetweenthesetwotuples.Eachoperatorneedsnecessarymodications,andsothissolutionisnotdesirable.AcleanersolutionistosendthetuplesfromrightrelationRtotheGibbsqueuedirectly.Thismethodonlyrequireschangestotheanti-joinoperator.Thoughthismethodisfairlysimpleforasingleanti-join,itrequiresmorecomplicatedschemeswhentherearecascadinganti-joinsinthequeryplan.Agenericmethodthatwillworkforquerieswithcascadinganti-joinsisdiscussedlaterinSection 6.4 .SincewehavetuplesfrombothleftandrightrelationsintheGibbsqueue,thenumberoftuplesithasishigh.Abigdrawbackofusingsmallformatistherequirementofveryhighnumberofremove/reinsertsintheGibbsqueue.Eachreinsertcouldpotentiallyinvolvehundredsoftuplesandcanslowdowntheexecution. 6.3DiscussiononExtremeSamplesAsampleiscalledextremeiftheprobabilityofitsoccurrenceisverylow.Replacinganextremesampleduringtherejectionalgorithmcanbedifcult.AsolutiontothisproblemisdiscussedinChapter 5 .However,thatsolutiononlyworksforquerieswithoutananti-joinoperation.InthissectionwebrieydiscusshowtoextendthesolutioninChapter 5 toworkevenforquerieswithananti-join.Thisproblemofreplacinganextremesampleissolvedbyrunningthequeryplanseparatelyforjustforndingthatsample.Afterthesampleisfound,onlythentheregularexecutionrestarts.Themodiedqueryrunhastwosteps.Therststepislteringoutunnecessarytuplesfromtheexecutioninordertoimprovetheefciencyoftherun.Filteringprocesswillbedifferentwhenthereisastochasticattributeontherightsideoftheanti-join.Therearemultipleseedidentiersfromthesamerelationthatproducedtheextremesample,butnotduetoaself-join.Fromthesemultipleseedidentiers,wecalltheonethatgeneratedtheextremesampleasprimaryseedidentier. 100

PAGE 101

Torecreatethecompositeattributeweneedallseedidentierspresentinthetuple.Thenewsamplesgeneratedfortheprimaryseedidentiercanhaveanewvaluenotseenbefore.Thesenewvaluescanmatchwithothernewtuples.Wecannotlteroutanyseedidentiers,butforallidentiersexcepttheprimaryseedidentierweneedtogenerateonlytheactivesamples.TheunnecessarytuplescanbelteredoutaftertheAnti-joinoperator.TheAnti-joinoperatormustbemodiedtohandlethefragmentedtuplesfortheprimaryseedidentier.Thisagainissimilartohowaself-joinismodied(Section 5.3 ). 6.4ForQuerieswithMultipleAnti-JoinsSomequeriescanhavecascadinganti-joins.Twoanti-joinsarecalledcascadingiftheoutputofoneanti-joinisgivenasarightinputtoanotheranti-join.Thecurrentcompositeattributeformatsworkonlyforquerieswithnoncascadinganti-joins.Thecreationofthecompositeattributeshouldberecursiveforthesequerieswithcascadinganti-joins.Acascadinganti-joinatsecondlevel,onewhoserightinputisfromanotheranti-join,storestheAJOBJECTintherighttuplesinsteadofjustseedandbitstringattributes.Thefunction(inTailsampler)toretrievethecurrentactiveDBinstancefromthecompositeattributewouldalsoneedchanges.Thisfunctionalsoneedstoworkrecursivelyonthecompositeattribute.Theseideasneedmoreworkandarenotimplementedinthesystem. 6.5ExperimentsInthissectionwebenchmarkbothofourmethodsdiscussedaboveLongformatandShortformat.Ourgoalistocomparetheirefciencyandscalabilityfordifferentsettings.Thefollowingarethequestionsweliketoanswer: 1. HowdoesbothformatsworkwhenlargenumberofSeedattributesneedtobestored? 2. Whichofthemisbetteronlargedatasets? 101

PAGE 102

ExperimentalSetup:WeusetheTPC-Hbenchmarkdatagenerator[ 17 ].Theworkstationusedfortheexperimentshas32GBofRAM,4720GBDellharddisksand8processingcoresof3.2GHzeachdistributedover2chips.Thesystemrunsa64-bitUbuntuoperatingsystem,version11.04.WeimplementedoursolutionintheprototypeMCDB-Rsystem.WeusethequeryQ5explainedbelow.SeeAppendixforcompleteSQLstatementsforQ5.QueryQ5isamodiedversionofqueryQ3discussedbeforeinSection 5.6 .QueryQ5(ModiedQ3):Herewecalculatetheerrorincalculatingtotalrevenueduetoerrorsinthedataintegrationprocess.Theschemaismodiedforasimulatingdatacleaningprocess.Thecustomerinformationintheordersrelationisassumedtohaveerrors.Anewstochasticrelationiscreatedtoincorporatetheseerrors.Thisrelationhastheactualcustomerkeyandalistofpossiblecustomerkeys,andaprobabilityassociatedwitheachpossiblecustomerkey.ThequeryQ4joinscustomertablewithorders,andanordercontributestothetotalrevenueonlyifitndsavalidkeytomatchincustomers.Inthecurrentquery,wecalculatetheorderswhichfailedtocontributetotherevenuebecauseofthenonexistenceofthecustomerinformationinthecustomertable.Tocontrolthenumberoftuplesfromcustomerthatjoinwitheachordertuple,weregulatehowthepossiblecustomerkeysaregeneratedinthenewrelation.Wedividecustomerkeyvaluesintononoverlappinggroups.Foreachkeyinagivengroupthelistofpossiblecustomerkeysarefromthatsamegroup.Theprobabilityofeachpossiblekeyissame,andis1 m,wheremisthesizeofthegroupinwhichtheactualcustomerkeybelongsto.Thesizeofallgroupsissame,andthissizemisthenumberoftupleswewantfromcustomerrelationtojoinwithasingletupleinorderstable.So,ifwewantthecompositeattributetohave300seeds,thenweneed300tuplesfromthecustomerrelationtojoinwiththattuple.Thiscanonlyhappenifthesamevalueisgeneratedin300tuplesincustomersandhencethegroupsizemshouldbe300.The 102

PAGE 103

Table6-1. Performancewithvaryingseedspertuple(inminutes) Num.ofSeeds102050100200300perTuple Longformat2.733.87.319.835 Shortformat2.32.53.671319.3 equalprobabilityforeachpossiblevaluewillensurethatthereisagoodchancethateachvaluewillbegeneratedinthesamplearray.Giventhissetup,therearevedifferentvariablesweneedtoknowtounderstandtheresults: 1. numSeedIds:Thenumberofseedidentiersprocessedduringthequeryexecution. 2. numdb:NumberofDBinstancesforthequery.Thisissetto100foralltestsinthissection. 3. SeedsPerTuple:TheaveragenumberofseedattributesstoredineachAJOBJECT. 4. gibbsIters:NumberofiterationsforGibbsLooper.Thisvalueis5,unlessspeciedotherwise. 5. eliteDBPercent:NumberofDBinstancesretainedattheendofeachGibbsiteration.ThedeletedDBinstancesareclonedfromtheretainedeliteinstances.Wehavethisvalueat50%forallthetests.Weusetwotestsforevaluation.Test1isforcomparingtheperformanceofbothformatswithincreasingAJOBJECTsize.InTest2,wecomparetheperformanceofthetwoformatsonvaryingsizesofthecompositeattributeformed,andalsobychangingthenumberofinputtuples(numSeedIds).BothtestsusequeryQ5.Thevariablesforeachofthetestsareasfollows: 1. VaryingSeedsPerTuple:Forthistest,numSeedIdsissetto1000.ValueofSeedsPerTupleisvariedfrom10to300.numdbissetto100,gibbsItersto5,andeliteDBPercentto50%. 2. Performanceonlargedatasets:Forthistest,numSeedIdsissetto100,000and15,000fordataset1anddataset2respectively.ValueofSeedsPerTupleis10and300forthedatasets1and2respectively.numdbissetto100,gibbsItersto5,andeliteDBPercentto50%. 103

PAGE 104

Table6-2. Performanceonlargedatasets(inminutes) Dataset1Dataset2 Num.ofTuples100,00015,000 SeedsperTuple10300 Longformat30.8552 Shortformat54.6258.3 Results:ResultsforTest1aregiveninTable 6-1 .Thetableshowstheperformanceofbothformatswithvaryingnumberofseedattributesstoredinthecompositeattributes,theAJOBJECT.Thenumberofseedattributespertupleisthegroupsizemasdescribedinthequerydescription.Allthevalueslistedarethenumberofminutestakenforthequeryexecution.Table 6-2 showstheperformanceresultsforTest2.Thequeryisrunontwodatasets.TheexecutiontimesareshownunderLongformat,andShortformatcolumns.Discussion:Test1:AscanbeseenintheTable 6-1 ,forsmallernumberofseedattributespertuple(till100)bothsolutionsperformalmostsimilarly,withShortformattakingaslightedge.Fortheothertwocases(200and300seedattributespertuple)thereisacleardifferencebetweentheperformance.Heretheexperimentalresultsareasexpected.TheLongformat,withlargertuplesizesreads/writesmoredatafrom/todisk.Theexecutiontimeincreaseswithincreasingnumberofseedattributes.Thereasonsherearetwofold.ForLongformat,thesizeoftuplesbecomeslargerandforShortformat,thenumberoftuplesintheGibbsqueuegrows.Whenm,thenumberofseedattributespertupleis300,thesizeoftupleintheGibbsqueueforLongformatisapproximately260KB.ForShortformat,thoughthetuplesizeisveryless,thenumberoftuplesinGibbsqueueisaround270,900.ThesecondreasonisthenumberoftuplesprocessedinasingleGibbsiteration.Sinceeachtuplehasmoreseedattributesnow,thenumberoftuplespulledoutforeachseedattributearemoreandsoarethereinsertsintotheGibbsqueue.Thiswillincreasetheexecutiontimeconsiderably. 104

PAGE 105

Test2:TheresultsareshowninTable 6-2 .Firstdatasethas100,000tuplesand10customersineachgroup.Therefore,thenumberofseedattributesinthecompositeattributewillbe10.SurprisinglyLongformatperformsbetterondataset1.ThiscanbeattributedtoShortformatbeingsloweddownbyalltheextratuplesintheGibbsqueue.Forthisdataset,forShortformattheGibbsqueuecontains1,100,000tuples,whereasLongformatcontainsonly100,000tuples.Theseconddatasethas15,000tuplesand300customersineachgroup.Thisresultsin4,515,000tuplesintheGibbsqueueforShortformat.Evenwithsomanytuples,itperformsbetterthanLongformat,becauseofitslowerdiskusage.Forthisquery,eachtupleintheGibbsqueueisremovedandreinserteduntilallseedattributesinthattupleareexhausted.Sincethetuplesaremassive(approximately260KB)forLongformat,multipleremoveandreinsertsaremoreexpensive.Conclusion:Afterlookingattheresultsfrombothtests,theShortformatseemsmoreuseful.Thoughwecannotsayitisalwaysbetter,inmostresultsitoutperformstheLongformat.ThemostimportantfactorinrecommendingShortformatistheinabilityofLongformattostorealargenumberofseedattributes. 105

PAGE 106

CHAPTER7CONCLUSIONWepresentedtwomethodstoimprovetheperformanceofMCDB-Rsystem.Therstmethodistoserializeandde-serializethesamplegeneratingprocess.Weimplementedandtestedtheideaintheprototypesystem.AsshownintheexperimentsinSection 4.7 ,formanytypesofVGfunctionstheperformancebenetsaregreat.Next,wepresentedanefcienttechniquetoreplaceextremesamples.Weprovidedatwostepprocess;lteringoutunnecessarytuplesandfragmentingtheverylargetuplebundle.Theimplementationofthisprocessresultedinsignicantperformancebenets,animprovementfactorof2orbetterinmanycases,asshowninSection 5.6 .Finally,welookedathowtoincorporateanti-joinoperatorinMCDB-R.Wegavetwosolutionswhichworkforquerieswithasingleanti-joinoperation.Bothsolutionsareorthogonalinhowtheygeneratetuplebundlesattheanti-joinoperator.Wewilllookatextendingtheworktofacilitatemultipleanti-joinoperatorsinasinglequeryplan. 106

PAGE 107

APPENDIX:QUERIES QueryQ1 CREATEVIEWparamsASSELECT2.0ASp0shape,1.333*AVG(l_extendedprice*(1.0-l_discount))ASp0scale,2.0ASd0shape,4.0*AVG(l_quantity)ASd0scale,l_partkeyASp_partkeyFROMlineitemlGROUPBYl_partkey CREATETABLEdemands(new_dmnd,old_dmnd,old_prc,new_prc,nd_partkey,nd_suppkey)ASFOREACHlIN(SELECT*FROMlineitem,ordersWHEREl_orderkey=o_orderkeyANDyr(o_orderdate)=1995)WITHnew_dmndASBayesian((SELECTp0shape,p0scale,d0shape,d0scaleFROMparamsWHEREl_partkey=p_partkey)(VALUES(l_quantity,l_extendedprice*(1.0-l_discount))/l_quantity,l_extendedprice*1.05*(1.0-l_discount)/l_quantity))SELECTnd.value,l_quantity,l_extendedprice*(1.0-l_discount))/l_quantity,1.05*l_extendedprice*(1.0-l_discount)/l_quantity,l_partkey,l_suppkeyFROMnew_dmndnd SELECTSUM(new_prf-old_prf)AStotalProfitFROM( 107

PAGE 108

SELECTnew_dmnd*(newprc-pssupplycost)ASnew_prfold_dmnd*(oldprc-pssupplycost)ASold_prfFROMpartsupp,demandsWHEREps_partkey=nd_partkeyANDps_suppkey=nd_suppkey) WITHRESULTDISTRIBUTIONMONTECARLO(100)DOMAINtotalProfit<=QUANTILE(0.001)FREQUENCYTABLEtotalProfitQueryQ2 CREATETABLErandom_ord(o_orderkey,o_yr,o_tot)ASFOREACHoIN(SELECT*FROMorders)WITHVALASNormal(VALUES(o_mean,o_var))SELECTo.o_orderkey,year(o.o_orderdate),v.valueFROMVALv SELECTSUM(val)astotalLossFROMrandom_ord,lineitemWHEREo_orderkey=l_orderkeyAND(o_yr=1994ORo_yr=1995) SELECTSUM(grpsize*o_mean)ASmean,SUM(grpsize*grpsize*ovar)ASvarFROM(SELECTo_mean,o_var,COUNT(*)ASgrpsizeFROMorders,lineitemWHEREyear(o_orderdate)in(1994,1995)ANDo_orderkey=l_orderkeyGROUPBYo_orderkey,o_mean,o_var) WITHRESULTDISTRIBUTIONMONTECARLO(100) 108

PAGE 109

DOMAINtotalLoss>=QUANTILE(0.999)QueryQ3 CREATEVIEWfrom_japanASSELECTc_custkeyFROMcustomer,nationWHEREn_nationkey=c_nationkeyANDn_name='JAPAN' CREATETABLEfixed_custASFOREACHoINordersWITHnewcustkeyASDiscreteChoice(SELECTcustkey,probabilityFROMerror_custkeyWHEREold_custkey=o_custkey)SELECTvalueASo_newcustkey,o_orderkeyFROMnewcustkey SELECTSUM(l_extendedprice*(1.0-l_discount))asrevJapanFROMlineitem,fixed_cust,from_japanWHEREl_orderkey=o_orderkeyANDo_newcustkey=c_custkey WITHRESULTDISTRIBUTIONMONTECARLO(100)DOMAINrevJapan>=QUANTILE(0.999)QueryQ4 CREATETABLEactual_shipASFOREACHoINorder_shipWITHnewshipdaysASPoisson(SELECTa.AvgShipDaysFROMavg_shipaWHEREa.ShipDays=o.ShipDays)SELECTShipDays,AvgShipDaysASExpShipDays 109

PAGE 110

FROMnewshipdays SELECTSUM(s.Penalty)astotalPenaltyFROMactual_shipa,ship_penaltysWHEREa.ShipDays-a.ExpShipDays=s.DaysDelayed WITHRESULTDISTRIBUTIONMONTECARLO(100)DOMAINMaxPenalty>=QUANTILE(0.99)QueryQ5 CREATEVIEWfrom_japanASSELECTc_custkeyFROMcustomer,nationWHEREn_nationkey=c_nationkeyANDn_name='JAPAN' CREATETABLEfixed_custASFOREACHoINordersWITHnewcustkeyASDiscreteChoice(SELECTcustkey,probabilityFROMerror_custkeyWHEREold_custkey=o_custkey)SELECTvalueASo_newcustkey,o_orderkeyFROMnewcustkey SELECTSUM(o_totalprice)astotalLostFROMordersWHEREo_newcustkeyNOTIN(SELECTo_custkeyFROMfixed_cust) WITHRESULTDISTRIBUTIONMONTECARLO(100)DOMAINtotalLost>=QUANTILE(0.999) 110

PAGE 111

REFERENCES [1] Abadi,D.J.,Madden,S.,&Ferreira,M.(2006).Integratingcompressionandexecutionincolumn-orienteddatabasesystems.InSIGMODConference,(pp.671). [2] Agrawal,P.,Benjelloun,O.,Sarma,A.D.,Hayworth,C.,Nabar,S.U.,Sugihara,T.,&Widom,J.(2006).Trio:Asystemfordata,uncertainty,andlineage.InVLDB,(pp.1151). [3] Ahmed,A.,Low,Y.,Aly,M.,Josifovski,V.,&Smola,A.J.(2011).Scalabledistributedinferenceofdynamicuserinterestsforbehavioraltargeting.InKDD,(pp.114). [4] Antova,L.,Jansen,T.,Koch,C.,&Olteanu,D.(2008).Fastandsimplerelationalprocessingofuncertaindata.InICDE,(pp.983). [5] Antova,L.,Koch,C.,&Olteanu,D.(2007).Maybms:Managingincompleteinformationwithprobabilisticworld-setdecompositions.InICDE,(pp.1479). [6] Arumugam,S.,Jampani,R.,Xu,F.,Perez,L.L.,Jermaine,C.M.,&Haas,P.J.(2010).Mcdb-r:Riskanalysisinthedatabase.PVLDB,3(1),782. [7] Barbara,D.,Garcia-Molina,H.,&Porter,D.(1992).Themanagementofprobabilisticdata.vol.4,(pp.487). [8] Benjelloun,O.,Sarma,A.D.,Halevy,A.Y.,Theobald,M.,&Widom,J.(2008).Databaseswithuncertaintyandlineage.VLDBJ.,17(2),243. [9] Beskales,G.,Soliman,M.A.,&Ilyas,I.F.(2008).Efcientsearchforthetop-kprobablenearestneighborsinuncertaindatabases.vol.1,(pp.326). [10] Botev,Z.I.,&Kroese,D.P.(2008).Anefcientalgorithmforrare-eventprobabilityestimation,combinatorialoptimization,andcounting.InMethodologyandComput-inginAppliedProbability,vol.10,(pp.471). [11] Cerou,F.,Moral,P.D.,Furon,T.,&Guyader,A.(2012).Sequentialmontecarloforrareeventestimation.vol.22,(pp.795). [12] Chen,W.,Zhang,D.,&Chang,E.Y.(2008).Combinationalcollaborativelteringforpersonalizedcommunityrecommendation.InKDD,(pp.115). [13] Cheng,R.,Singh,S.,&Prabhakar,S.(2005).U-DBMS:Adatabasesystemformanagingconstantly-evolvingdata.InVLDB,(pp.1271). [14] Chouchane,M.,Paris,S.,Gland,F.L.,&Ouladsine,M.(2012).Splittingmethodforspatio-temporalsensorsdeploymentinunderwatersystems.InEvoCOP,(pp.243). 111

PAGE 112

[15] Cormode,G.,Li,F.,&Yi,K.(2009).Semanticsofrankingqueriesforprobabilisticdataandexpectedranks.InICDE,(pp.305). [16] Corp.,O.(2012).Oraclecrystalball.Oracle(http://www.oracle.com/us/products/applications/crystalball/index.html). [17] Council,T.P.P.(2012).Tpcbenchmarkh.(http://www.tpc.org/tpch/default.asp). [18] Dalvi,N.N.,&Suciu,D.(2004).Efcientqueryevaluationonprobabilisticdatabases.InVLDB,(pp.864). [19] Dalvi,N.N.,&Suciu,D.(2007).Managementofprobabilisticdata:foundationsandchallenges.InPODS,(pp.1). [20] Deshpande,A.(2009).Prdb:Managinglarge-scalecorrelatedprobabilisticdatabases(abstract).InSUM,(p.1). [21] Deshpande,A.,&Madden,S.(2006).MauveDB:supportingmodel-baseduserviewsindatabasesystems.InSIGMOD,(pp.73). [22] Deutch,D.,Greenshpan,O.,Kostenko,B.,&Milo,T.(2011).Usingmarkovchainmontecarlotoplaytrivia.InICDE,(pp.1308). [23] Doshi-Velez,F.,Knowles,D.,Mohamed,S.,&Ghahramani,Z.(2009).Largescalenonparametricbayesianinference:Dataparallelisationintheindianbuffetprocess.InNIPS,(pp.1294). [24] Fuhr,N.,&Rolleke,T.(1997).Aprobabilisticrelationalalgebrafortheintegrationofinformationretrievalanddatabasesystems.vol.15,(pp.32). [25] Gamerman,D.,&Lopes,H.F.(2006).MarkovChainMonteCarlo:StochasticSimulationforBayesianInference.Chapman&Hall/CRCTextsinStatisticalScience,seconded. [26] Garcia-Molina,H.,Ullman,J.D.,&Widom,J.(1999).DatabaseSystemImplemen-tation.PrenticeHall. [27] Ge,T.,Grabiner,D.,&Zdonik,S.B.(2011).Montecarloqueryprocessingofuncertainmultidimensionalarraydata.InICDE,(pp.936). [28] Ge,T.,Zdonik,S.B.,&Madden,S.(2009).Top-kqueriesonuncertaindata:onscoredistributionandtypicalanswers.InSIGMODConference,(pp.375). [29] Goldstein,J.,Ramakrishnan,R.,&Shaft,U.(1998).Compressingrelationsandindexes.InIEEEInternationalConferenceonDataEngineering,(pp.370). [30] Graefe,G.,&Shapiro,L.D.(1991).Datacompressionanddatabaseperformance.InACM/IEEE-CSSymp.OnAppliedComputing,(pp.22). 112

PAGE 113

[31] Gupta,R.,&Sarawagi,S.(2006).Creatingprobabilisticdatabasesfrominformationextractionmodels.InVLDB,(pp.965). [32] Hua,M.,Pei,J.,Zhang,W.,&Lin,X.(2008).Rankingqueriesonuncertaindata:aprobabilisticthresholdapproach.InSIGMODConference,(pp.673). [33] Huang,J.,Antova,L.,Koch,C.,&Olteanu,D.(2009).Maybms:aprobabilisticdatabasemanagementsystem.InSIGMODConference,(pp.1071). [34] Imielinski,T.,&Jr.,W.L.(1984).Incompleteinformationinrelationaldatabases.vol.31,(pp.761). [35] Insight,Z.(2008).Dealingwiththeunexpected:Lessonsforriskmanagersfromthecreditcrisis.(http://www.zurich.com/corporatebusiness/home/home.htm). [36] Iyer,B.R.,&Wilhite,D.(1994).Datacompressionsupportindatabases.InVLDB,(pp.695). [37] Jampani,R.,Xu,F.,Wu,M.,Perez,L.L.,Jermaine,C.M.,&Haas,P.J.(2008).Mcdb:amontecarloapproachtomanaginguncertaindata.InSIGMODConfer-ence,(pp.687). [38] Jampani,R.,Xu,F.,Wu,M.,Perez,L.L.,Jermaine,C.M.,&Haas,P.J.(2011).Themontecarlodatabasesystem:Stochasticanalysisclosetothedata.InACMTrans.DatabaseSyst.,vol.36,(p.18). [39] Kanagal,B.,&Deshpande,A.(2010).Lineageprocessingovercorrelatedprobabilisticdatabases.InSIGMODConference,(pp.675). [40] Karp,R.M.,&Luby,M.(1983).Monte-carloalgorithmsforenumerationandreliabilityproblems.InFOCS,(pp.56). [41] Koch,C.,&Olteanu,D.(2008).Conditioningprobabilisticdatabases.InVLDB. [42] Lakshmanan,L.V.S.,Leone,N.,Ross,R.B.,&Subrahmanian,V.S.(1997).Probview:Aexibleprobabilisticdatabasesystem.vol.22,(pp.419). [43] Li,J.,Saha,B.,&Deshpande,A.(2009).Auniedapproachtorankinginprobabilisticdatabases.PVLDB,2(1),502. [44] Li,J.,Saha,B.,&Deshpande,A.(2011).Auniedapproachtorankinginprobabilisticdatabases.VLDBJ.,20(2),249. [45] Palisade(2012).@risk:anewstandardinriskanalysis.Palisade(http://www.palisade.com/risk/). [46] Panneton,F.,L'Ecuyer,P.,&Matsumoto,M.(2006).Improvedlong-periodgeneratorsbasedonlinearrecurrencesmodulo2.InACMTrans.Math.Soft-ware32,1. 113

PAGE 114

[47] Porteous,I.,Newman,D.,Ihler,A.T.,Asuncion,A.U.,Smyth,P.,&Welling,M.(2008).Fastcollapsedgibbssamplingforlatentdirichletallocation.InKDD,(pp.569). [48] Ratanaworabhan,P.,Ke,J.,&Burtscher,M.(2006).Fastlosslesscompressionofscienticoating-pointdata.InDCC,(pp.133). [49] Re,C.,Dalvi,N.N.,&Suciu,D.(2007).Efcienttop-kqueryevaluationonprobabilisticdata.InICDE,(pp.886). [50] Re,C.,&Suciu,D.(2008).ManagingprobabilisticdatawithMystiQ:Thecan-do,thecould-do,andthecan't-do.InSUM,(pp.5). [51] Robert,C.,&Casella,G.(2004).MonteCarloStatisticalMethods.Springer,seconded. [52] Roth,M.A.,&Horn,S.J.V.(1993).Databasecompression.SIGMODRecord,(pp.31). [53] Rubino,G.,&Tufn,B.(1980).RareEventSimulationUsingMonteCarlo.Wiley. [54] Rubinstein,R.(2009).Thegibbsclonerforcombinatorialoptimization,countingandsampling.MethodologyandComputinginAppliedProbability,11,491. [55] Savage,S.L.(2009).TheFlawofAverages:Whyweunderestimateriskinthefaceofuncertainty.Wiley. [56] Sen,P.,Deshpande,A.,&Getoor,L.(2009).Prdb:managingandexploitingrichcorrelationsinprobabilisticdatabases.vol.18,(pp.1065). [57] Singh,S.,&McCallum,A.(2011).Towardsasynchronousdistributedmcmcinferenceforlargegraphicalmodels.InNeuralInformationProcessingSystems(NIPS),BigLearningWorkshoponAlgorithms,Systems,andToolsforLearningatScale. [58] Soliman,M.A.,Ilyas,I.F.,&Chang,K.C.-C.(2007).Top-kqueryprocessinginuncertaindatabases.InICDE,(pp.896). [59] Stonebraker,M.,Becla,J.,DeWitt,D.J.,Lim,K.-T.,Maier,D.,Ratzesberger,O.,&Zdonik,S.B.(2009).Requirementsforsciencedatabasesandscidb.InProc.CIDR,(p.26). [60] Stoyanovich,J.,Davidson,S.B.,Milo,T.,&Tannen,V.(2011).Derivingprobabilisticdatabaseswithinferenceensembles.InICDE,(pp.303). [61] Suciu,D.,Olteanu,D.,Re,C.,&Koch,C.(2011).ProbabilisticDatabases.SynthesisLecturesonDataManagement.Morgan&ClaypoolPublishers. [62] Systems,L.D.(2012).Riskanduncertaintyanalysis.LuminaDecisionSystems(http://www.lumina.com/case-studies/risk-analysis/). 114

PAGE 115

[63] Thiagarajan,A.,&Madden,S.(2008).Queryingcontinuousfunctionsinadatabasesystem.InProc.SIGMOD,(pp.791). [64] Walker,D.D.,&Ringger,E.K.(2008).Model-baseddocumentclusteringwithacollapsedgibbssampler.InKDD,(pp.704). [65] Wang,C.,Yuan,L.-Y.,You,J.-H.,Zaane,O.R.,&Pei,J.(2011).Onpruningfortop-krankinginuncertaindatabases.vol.4,(pp.598). [66] Wang,D.Z.,Franklin,M.J.,Garofalakis,M.N.,Hellerstein,J.M.,&Wick,M.L.(2011).Hybridin-databaseinferencefordeclarativeinformationextraction.InSIGMODConference,(pp.517). [67] Wang,D.Z.,Michelakis,E.,Garofalakis,M.,&Hellerstein,J.(2008).BayesStore:Managinglarge,uncertaindatarepositorieswithprobabilisticgraphicalmodels.InVLDB. [68] Wick,M.L.,McCallum,A.,&Miklau,G.(2010).Scalableprobabilisticdatabaseswithfactorgraphsandmcmc.vol.3,(pp.794). [69] Widom,J.(2005).Trio:Asystemforintegratedmanagementofdata,accuracy,andlineage.InCIDR,(pp.262). [70] Wikipedia(2012).Late2000snanacialcrisis.(http://en.wikipedia.org/wiki/Late-2000s nancial crisis). [71] Yao,L.,Mimno,D.M.,&McCallum,A.(2009).Efcientmethodsfortopicmodelinferenceonstreamingdocumentcollections.InKDD,(pp.937). [72] Yi,K.,Li,F.,Kollios,G.,&Srivastava,D.(2008).Efcientprocessingoftop-kqueriesinuncertaindatabases.InICDE,(pp.1406). [73] Zhao,B.,Rubinstein,B.I.P.,Gemmell,J.,&Han,J.(2012).Abayesianapproachtodiscoveringtruthfromconictingsourcesfordataintegration.vol.5,(pp.550). [74] Zukowski,M.,Heman,S.,Nes,N.,&Boncz,P.A.(2006).Super-scalarram-cpucachecompression.InICDE,(p.59). 115

PAGE 116

BIOGRAPHICALSKETCH RaviJampaniisaPhDstudentattheUniversityofFlorida,Gainesville.HeisadvisedbyDr.ChrisJermaineandDr.AlinDobrainthebroadresearchtopicofdatabasemanagementsystems.Indatabases,hisspecicinterestsareinqueryprocessingandindexing.Apartfromdatabases,heisalsointerestedintheeldsofnaturallanguageprocessing,multiagentsystems,datastructuresandalgorithms.PriortojoiningUniversityofFlorida,RavireceivedBachelorofTechnology(2003)andMasterofScience(2005)degreesfromtheInternationalInstituteofInformationTechnology(IIIT),Hyderabad,India. 116