PAGE 1
QUERY OPTIMIZATION USING FREQUENT ITEMSET MINING By BOYUN EOM A THESIS PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLOR IDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE UNIVERSITY OF FLORIDA 2005
PAGE 2
Copyright 2005 by BoYun Eom
PAGE 3
I devote this work to my parents
PAGE 4
iv ACKNOWLEDGMENTS Lots of thanks should go to many people for this thesis. First of all, I would like to express my sincer e gratitude to my adviser, Dr. Chris Jermaine, for giving me the chance to work with him with th e interesting topic for this research and for his enthusiastic guidance. I was so lucky to have such a wonderful adviser. He has inspired me constantly through this work, and he will be a source of inspiration with whatever I would do in my career. Owing to his invaluable suggestions a nd patience, I have enjoyed this thesis and I have been able to complete it. I would also like to thank Dr. Alin Dobra and Dr.Sanjay Ranka for their willingness to serve on my supervisory committee. Next, I want to show my thanks to my colleagues in our lab, Abhi, Shantanu, and Subi, and the colleagues in DB center, Am it, Gilliean, Jungmin and Seema. I really appreciate for their friendly help and keeping me company during the day and late night. I would like to thank my Soulmates and my friends in Gaineville. I also would like to give my special thanks to all of my friends in Kor ea for their continual love and heartful cheers. Now, I wish to express my deep appreciation and admiration to the Caviedes for being my family here in a strange land. Owing to their wa rm love and thoughtful c oncerns, I have felt more comfortable and have enjoyed my life in Gain esville. I would not forget the amazing friendship and the memory we have shared together. My thanks next go to all of my family member s: My only sister, JuYun, and my brother-inlaw, DongWoo, older brother, SungHyuk, and sist er-in-law, JungA, younger brother, KyungSik,
PAGE 5
v and his new bride. Also, I would like to give my thanks to my beloved nephews, HeeWon and HeeWong. Their great encouragement has been very helpful for my work. I greatly appreciate my grandma, KyungIee Kim, who raised me. Especially, I would like to thank for all of her efforts for me and fo r her being there and patiently waiting for me. I cannot help mentioning my love and thanks toward my grandma, YongCheol Jin, who passed away right after my departure to the USA. I have felt sorry for the fact that I could not keep the last appointment with her and have been missing and will miss her, forever. I also feel that I have to show my thanks to my uncle and aunt who have encouraged me to do what I wished. Most of all, I am extremely thankful to my parents for their acceptance my decision and their prayers for my will. Since I know that it was not easy for my parents to let me go to untraditional route, I really thank them for their trust and support. Last, I appreciate God who has allowed me all of these wonderful people and experiences in my life. Even though it was really exciting an d happy journey for me to spend time for this work, the most valuable thing I have earned through this experience is that I realized His endless love and that I am nothing without Him. Words cannot express my gratitude to all those who have supported me.
PAGE 6
vi TABLE OF CONTENTS page ACKNOWLEDGMENTS.................................................................................................iv LIST OF TABLES...........................................................................................................viii LIST OF FIGURES...........................................................................................................ix ABSTRACT....................................................................................................................... ..x CHAPTER 1 INTRODUCTION........................................................................................................1 2 RELATED WORK.......................................................................................................5 2.1 Histograms..............................................................................................................6 2.1.1 Equi-Width Histograms................................................................................6 2.1.2 Equi-Depth Histograms................................................................................7 2.1.3 End-Biased Histograms................................................................................7 2.2 Selectivity Estimation Using One-Dimensional Histograms.................................9 2.3 Multi-Dimensional Histograms............................................................................14 2.4 BHUNT.................................................................................................................14 3 CARDINALITY ESTIMATION USING FREQUENT ITEMSET MINING...........16 3.1 Background in Data Mining and Frequent Itemsets.............................................16 3.2 Overview of Our Approach..................................................................................19 3.3 Data Structure of the FI Tree................................................................................21 3.4 Selectivity Estimation for Equality Selection.......................................................24 3.4.1 Step One: Get the Cardinality of Resulting Relation Using the FI Tree....24 3.4.2 Step Two: Get the Cardinality of Result Relation Using the Uniform Distribution Assumption..................................................................................25 3.4.3 Step Three: Create a New FI Tr ee for the Result after Selection...............29 3.4.4 Step Four: Update the Table for Attribute Counts for the Result after Selection...........................................................................................................32 3.5 Selectivity Estimation for Join..............................................................................33 3.5.1 Notations Used...........................................................................................35 3.5.2 Step One: Get the Cardinality for the Result after Join..............................36 3.5.3 Step Two: Get the FI tree for the Result af ter the Join...............................39
PAGE 7
vii 3.5.4 Step Three: Update the Table for Attribute Counts for the Result after Join...................................................................................................................44 4 EXPERIMENTS AND RESULTS.............................................................................46 4.1 Goal of Experiments.............................................................................................46 4.2 Methodology.........................................................................................................47 4.3 Results...................................................................................................................5 0 4.3.1 Dataset 1: Slightly Correlated Data............................................................51 4.3.2 Dataset 2: More Correlated Data................................................................51 4.4 Discussion.............................................................................................................52 5 CONCLUSION AND FUTURE WORKS.................................................................55 APPENDIX EXPERIMENTAL RESULTS...........................................................................................56 LIST OF REFERENCES...................................................................................................85 BIOGRAPHICAL SKETCH.............................................................................................88
PAGE 8
viii LIST OF TABLES Table page 1: Distribution of attribute values on Quantity ..............................................................8 2: Transactions in R1 ........................................................................................................17 3: Frequent itemsets from R1 and their supports..............................................................18 4: The updated table for numbers of attribute after selection...........................................33 5: Transactions in R2 ........................................................................................................37 6: Number of attribute values for result relation after join R1 and R2 .............................45
PAGE 9
ix LIST OF FIGURES Figure page 1: Overall flow for query processing in DBMS...................................................................1 2: Histograms (A)The Equi-width, (B)T he Equi-depth and (C)The End-biased histograms..................................................................................................................8 3: A query plan................................................................................................................ ...10 4: More possible query plans.............................................................................................11 5: Query plan1 after performing S1 in Figure 3 ................................................................13 6: A query plan tree for the example of the selection........................................................20 7: An example of an FI tree...............................................................................................22 8: The FI tree for R1.......................................................................................................... 23 9: The numbers of attribute values over R1.......................................................................26 10: Overall process for getting the FI tree.........................................................................30 11: Join modeling.............................................................................................................. .34 12: Profiles for R2 a) the FI tree for R2 b) table for attribute value count for R2 .......37 13: After rebuilding and deleting uncommon items..........................................................40 14: Changing frequencies for nodes..................................................................................41 15: The FI tree for result relation.......................................................................................43 16: Partial schema of TPC-R database...............................................................................49
PAGE 10
x Abstract of Thesis Presen ted to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Master of Science QUERY OPTIMIZATION USING FREQUENT ITEMSET MINING By BoYun Eom August 2005 Chair: Christopher M. Jermaine Major Department: Computer and In formation Science and Engineering Depending on the query plan chosen by a database query optimizer, the overall query execution time can vary dramaticall y. Consequently, the query optimizer in a database management system is critica lly important component for reducing the execution time in query processing. To r ecognize the best plan among many possible query plans using cost-based optimization, th e query optimi zer estimates the cost for every plan based on the underlying data dist ribution. The query optimizer makes use of synopses of the database rela tions to perform this task. The most common synopses in commercial databases have been one-dimensional histograms. However, histograms can provide poor estimation quality, especially when there is correlation among data. Motivated by this, we propose a new met hod to perform more accurate selectivity estimation, even for correlated data. To m odel the correlation that may exist among data, we borrow one of the well-known techniques in data mining a nd extract attribute values
PAGE 11
xi that occur together frequently using fre quent itemset mining. In our method, frequent itemsets are used as synopses of the data base relations. Through experimentation, we found that our approach is effective in mode ling correlations and we confirm that this method approximates result rela tions more accurately than one-dimensional histograms do. Our approach gives precise estimates, partic ularly for the correlat ed data and/or the skewed data.
PAGE 12
1 CHAPTER 1 INTRODUCTION A DataBase Management System (DBMS) has several components used in query execution: the query parser the query optimizer the code generator or the interpreter and the query processor When a user gives a query to a DB MS, the DBMS evaluates the query as follows. First, the query parser checks syntax an d translates the given que ry into an extended form of relational algebra. This is th en passed to the query optimizer, and the plan generator in the query optimizer produces many different query plans. Though all query plans produce the same result, their individual cost of execution vari es widely. So, the cost estimator in the query Figure 1: Overall flow for query processing in DBMS Query Query Parser Query Optimizer Code Generator Query Processor Answer Plan Generator Cost Estimator
PAGE 13
2 optimizer is assigned to choose only one among all plans that the plan generator created. Once the query optimizer has determined the execution plan as a next step, the code generator or the interpreter transforms the plan into the act ual access routines. Finally, the query processor performs the query and the DBMS returns the result [1]. Figure 1 shows overall query processing flow. The performance and efficiency of a DBMS can be significantly affect ed by the quality of the plan chosen by the query optimizer. Hence th e query optimizer is arguably the most important component of a DBMS for query execution time. No t surprisingly, much research has focused on query optimization. Our research is also focused on this particular area. The job of a query optimizer is to seek the b est plan that has the cheapest execution cost, and in particular to avoid picking up a poor plan. In cost-based optimization the main idea is to compare the cost of a plan in terms of the number of disk accesses that would be performed as a query plan is executed. Depending on the or der of relational algebra operators such as select join or project and the size of intermediate relations, the number of disk I/Os required to implement the plan can be different. A query optimizer appr oximately calculates total I/Os for each plan and finally chooses the least costly plan for the give n query. The plan chosen should have the least number of disk accesses among all plans. To estimate the approximate cost of a query plan, a query optimizer needs information on data distribution, and the DBMS typically makes u se of a statistical synopsis for this purpose. As a consequence of the limited time and space, most of current commercial DBMSs use onedimensional histograms as synopses. The Attribute Value Independence (AVI) assumption is needed when one-dimensional hist ograms are used over multiple attributes in a query. Under the AVI assumption, attributes in a relation are assu med to have independent data distribution from one other and the joint data distribution can be obtained from these independent individual distributions [2].
PAGE 14
3 Often, some of data have strong dependenc ies. For example, taller people tend to be heavier than shorter people. Most people who are from Asia have black hair color. People with higher education have a tendency to earn more money. One may imagine many other cases in which data are correlated with each other. The correlation between attributes affects the distribution of data. If we ignore these characteristics of data and simply use the AVI assumption for estimating cost, a query optimizer can make very poor decisions. Motivated by this problem, this thesis considers a method for computing high-quality selectivity estimates, even over correlated data. In our method, instead of using histograms with the AVI assumption, we adopt one of the well-known techniques from data mining to build a model for the data set. We use Frequent Itemsets [3, 4] to capture the correlation among data for estimating the size of the result of selection and join operations. Our method is described at a high level as follows. Using the Apriori algorithm [3], we determine if there exists an attribute value set that satisfies a predetermined frequency threshold. If there is, we regard this set as a frequent itemset and keep the values in this set with the freque ncy. For instance, suppose we treat the set of attributes occurring together in more than 50% of a relationÂ’s records as a frequent itemset. Among total tuples in a relation, if half of tuples have value Â‘ a1 Â’ on attribute A and value Â‘ b2 Â’ on attribute B then these two values, Â‘ a1 Â’ and Â‘ b2 Â’, are frequent items and we store this itemset, { a1, b2 } and its frequency in our profile. Base d on these frequent values, we build an FI (Frequent Itemset) tree In response to a selectivity estimate request by the query optimizer, we traverse the FI trees corresponding to the queried relations and look for frequent itemsets containing values which are on the attributes in the plan. If there are values that we are looking for, then we use their frequency to estimate the cardinality for the operation in question. For the rest of attribute values that do not appear in the FI tree, we use heuristic methods under the uniform distribution assumption to calculate the ca rdinality and the total numbers of attribute
PAGE 15
4 values are used together. Even though we do not keep the information of non-frequent attribute values, by definition they cannot occur frequently, and so it would not have serious effect on final cost because only attribute values with low frequency are discarded. The benefit of our approach is as follows: (1) Our method outperforms one-dimensional hi stograms, especially when data are not independent of each other, becau se it is available to capture co rrelation in the data. We keep data distribution information like the histogram method, but unlike histograms, our FI trees capture correlations between those attribute va lues. This gives us more accurate estimation for non-independent data. (2) Our approach is also good for skewed data. Unlike histograms which assume data is uniformly distributed within a bucket, we keep the real frequency of frequent items, which form the majority of attribute values in th e relation. This means that we can estimate selectivity more precisely for the skewed data. Through extensive experimentation using TPC -R benchmark, we confirm these useful characteristics of FI-based approach.
PAGE 16
5 CHAPTER 2 RELATED WORK Typically, there are two forms for query optimization: rule-based optimization and costbased optimization In rule-based optimization the optimizer ranks available access paths by heuristic rules and chooses an execution plan for the query. So, this optimization is sometimes called heuristic optimization On the other hand, cost-based optimization estimates the costs of executing possible plans and as a result, the overall cost of executing the query is systematically reduced. With this reason, it is also called systematic optimization Many techniques have been proposed to sugg est more accurate way to estimate in costbased optimization. These can be classified as sampling techniques parametric techniques and non-parametric techniques [5 7]. As name implies, sampling techniques collect samples from the database and use those to calculate intermedia te size. Despite of its potentially high accuracy, the drawback of sampling is that it is expensive and questionable in query optimization since it is performed mostly at run time and query optimiz ation should estimate the cost quickly [5]. Parametric techniques on the other hand, require little overhead but they can give bad estimates [5]. In such techniques, a mathematical function is used and by fitting parameters of this function, the data distribution is approximated [5, 6]. The last class of techniques in query optimization is non-parametric techniques. These techniques are sometimes so called histogram-based techniques [5, 6]. Even though histograms also have their drawbacks, histogram-based techniques are simple and inexpensive [5] and are the most widely used method in commercial systems. In the remainder of this chapter, we discuss pr ior research related to our approach in more detail. We discuss histograms in general, sel ectivity estimation using histograms, and multidimensional histograms.
PAGE 17
6 2.1 Histograms In a cost-based optimization, th e query optimizer uses statistics in the data dictionary to estimate the cost of plans [8]. To support such statistic, a database administrator periodically instructs the DBMS to construct and maintain data profiles. A profile in a database is a statistical summary of a relation such as the number of t uples, the number of attribute values and the distribution of values. Since a query optimizer tota lly relies on these profiles to calculate the costs rather than using the real relations, profiles pl ay significant roles in query optimization. The oldest and the most common form of profile is a histogram [9, 10]. To build a histogram in DBMS, the domain of attribute values for a single attribute is partitioned into buckets after being sorted, a nd the histogram keeps the minimum and maximum attribute values of in each bucket. Every bucket h as the sum of frequencies of attribute values in it, as well. Since a histogram does not store all frequencies for values in a bucket, the frequency of every value in a bucket is assumed to be equa l to the average of frequencies of all values in that bucket. This is known as the uniform distribution assumption [9, 11]. The advantages of using histograms for query optimization are that there is little run-time over head, that they are inexpensive to store, maintain and compute and that they may give low-error estimates [2, 12]. Various classes of histograms have been proposed but it is known that only some of fundamental approaches are used in practice: equi-width histograms, equi-depth histograms and end-biased [6]. 2.1.1 Equi-Width Histograms In the equi-width histogram the width of each bucket is equal but the frequencies are different for every bucket. In other words, the number of consecutive attribute values for all buckets is same but the sums of frequencies for these values in every bucket are not. The equiwidth histogram is one of the oldest histograms and it is relatively cheap to build and very easy to apply in selectivity estimation. However, it is not good for handling skewed data and has a much higher worst-case and average error for selection que ries than equi-depth histogram [9, 12, 13].
PAGE 18
7 2.1.2 Equi-Depth Histograms As an alternative to equi-width histograms, equi-depth histograms have been suggested. All buckets have almost the same height in an equi-depth histogram, i.e., when partitioned, the frequencies for each attribute values are counted an d the range of each bucket is decided by these frequencies. Most of current commercial DBMSs use this histogram [2, 9, 13]. 2.1.3 End-Biased Histograms This distinctive different histogram puts th e most frequent attribute values with its frequency in individual buckets. The rest of th e attribute values and the average of these values are stored in a bucket together. Together equi-depth histograms, end-biased histograms are more accurate for approximating distribution [6]. Below is an example for these three types of histograms. Example 1 : On the relation lineitem in the TPC-R benchmark that we used for the experiments described in chapter 4, there is an attribute named Quantity Assume that Table1 gives the distribution of attribute values on that attribute, and we construct an equi-width histogram and an equi-depth histogram with th is distribution using 4 buckets. Figure 2 is the result histograms for Table 1.
PAGE 19
8 Table 1: Distribution of attribute values on Quantity Figure 2: Histograms (A)The Equi-width, (B)The Equi-depth and (C)The End-biased histograms Quantity count Cumulative 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 2 3 5 8 2 0 0 0 30 2 8 5 5 0 10 14 2 1 1 1 1 2 5 10 18 20 20 20 20 50 52 60 65 70 70 80 94 96 97 98 99 100 40 30 20 10 20-24 25-29 30-34 35-40 20 32 28 20 Frequency Quantity 40 30 20 10 20-28 28 29-34 34-40 25 25 25 25 Quantity Frequency (A) an equi-width histogram (B) an equi-depth histogram 40 30 20 10 28 34 35 46 30 10 Quantity Frequency (C) an end-biased histogram 14 20-27
PAGE 20
9 Note that in the equi-width histogram (A) in Figure 2, the frequencies for each buckets are different but the ranges of attribute values for ea ch buckets are almost the same. On the other hand, in the equi-depth histogram (B), all frequencies for every bucket are (almost) same. When we judge a histogram, we need to consid er 1) accuracy, 2) efficiency in maintaining and 3) extensibility to multi-dime nsional data [10]. So, considering these criteria, the major setback for the partition-based histograms like eq ui-width and equi-depth histograms, is the difficulty in extension to comp lex multi-dimensional data [10]. Besides these traditional partition-based histograms, one interesting classes of histograms is wavelet-based histograms [10, 14, 15, 16]. In this approach, a mathematical function is used for decomposition process. Through this decomposition process, original data are transformed and compressed into a set of numbers, wavelet coefficients Decomposition processes are repeated until there is the single coefficient and we call this as wavelet decomposition [10, 14, 15, 16]. The wavelet-based histogram shows more accuracy than traditional histograms and most of all this can be naturally extended to the multi-dimens ional distribution in the course of wavelet decomposition and reconstruction. Yet, the mainte nance for any change of data distribution is much more difficult than partition-based histograms. 2.2 Selectivity Estimation Usin g One-Dimensional Histograms The terminology of selectivity means a ratio of the number of tuples that satisfy the predicates, conditions, in a query to the total numb er of tuples of the relation [13]. As a plan is being evaluated, an optimizer calculates th is selectivity for operations in a tree. We now consider an example of how histogram s can be used to evaluate the quality of a query plan. Example 2: Below is one of queries we used in our experiments: SELECT l_orderkey, o_orderdate, o_shippriority FROM customer, orders, lineitem WHERE c_mktsegment = Â‘AUTOMOBILEÂ’
PAGE 21
10 AND c_custkey = o_custkey AND l_orderkey = o_orderkey; For convenience, we will refer to c_mktsegment = Â‘AUTOMOBILEÂ’ as P1, c_custkey = o_custkey as P2 and l_orderkey = o_orderkey as P3 Then the predicate set P for this query can be expressed such as, P = { P1, P2, P3 }. Again, since we have dealt with selection and join operations in this thesis, we ignore the project operation in our query plan trees. For this given query, many quer y plans would be correct in the sense that they give the correct result, though each will have a different cost. Figure 3 describes one of those possible plans. Figure 3: A query plan A leaf node on the query plan tree denotes a base relation and the profile like a histogram for the base relation is called as a base profile Internal nodes are operations. Performing operations on the base profiles results in intermed iate relations and profiles. Unlike base profiles, intermediate profiles are estimated. Therefore, as an intermediate table is farther from base tables in a tree, the accuracy of the corresponding profile decreases. In Figure 3, there are three base tables on our tree and three base profiles are used for the statistic of these relations. Query plan1 first performs selection operation with predicate P1 over customer relation and then it joins the result with orders relation on custkey Finally, it joins intermediate relation after join J1 customer.mksegment=Â’AUTOMOBILE Â’ < customer > < orders > < lineitem > custkey orderkey S1 J1 J2 Query Plan1
PAGE 22
11 with lineitem relation on orderkey We can see some other possible query plans in Figure 4. As this figure shows, depending on the order of operations or depending on the combination of relations for the operation, we can have different plan trees. Note that all three plans have same relational algebra operations. Figure 4: More possible query plans How then are histograms used when the optimizer estimates the cost for a plan tree? At the time that the optimizer calculates the cost for all these possible plan trees, it starts from the base tables. First of all, it checks if there is a histogram corresponding attribute in the selection predicate over the base table. If there is, the optim izer searches for the bucket(s) that contains the attribute values named in the selection predicate. With that statistic, the optimizer can estimate the selectivity. In the case that the operati on involves multiple attributes, under the AVI assumption in that attributes are independent, we estimates each selectivity and multiply them together [17 22]. For more detailed description of the process, we go back to our first example, query plan1 Assume that all attributes that are used in this query have histograms. Normally, histograms are built on the integer attributes. If an attribute is non-numeric, we may need to take transformation for attribute values on that attribute. We assume th at we have already done this process and have Query Plan2 customer.mktsegment=Â‘AUTOMOBILEÂ’ < customer > < orders > < lineitem > custkey orderkey < customer > < orders > < lineitem > custkey orderkey customer.mktsegment=Â‘AUTOMOBILEÂ’ Query Plan3
PAGE 23
12 a histogram on attribute mksegmenet in the relation customer We now consider how the optimizer makes a use of histograms to estimate the cost for query plan1 In query plan1 there are one selection, S1 and two join operations, J1 and J2 The total cost for this plan tree is the summation of the costs for these operations We will discuss the selectivity for S1 and J1 in query plan1 Over the relations in this tree, we first define some notation: Nc Cardinality of customer relation No Cardinality of orders relation Nl Cardinality of lineitem relation Bk the kth bucket of the histogram on mktsegment which contains the value corresponding to AUTOMOBILE Fk the frequency for kth bucket fi the frequency for ith data after sorting Count(Bk) number of attribute values in Bk Cost(s) cost for the selectivity s in terms of number of tuples For the selectivity for S1 in query plan1 a query optimizer tries to find out the bucket(s), Bk which contain(s) the predicate value, Â‘AUTOMOBILEÂ’, and get(s) the frequency of the bucket(s), Fk. By adapting the uniform di stribution assumption, the frequency of each value in the bucket(s), f would be: ) Count(B F fk k With this frequency, we can get th e cost for this equality selection S1 : Cost(S1) = f Nc If we expand this equality selection to a range query over customer relation, then the cost for range selection, r would be Cost(r) = i i c) f ( N, where i is all values within the range in the query
PAGE 24
13 The selectivity for join operation is more complicated. After performing selection operation S1 our query plan1 would have a form like Figure 5: Figure 5: Query plan1 after performing S1 in Figure 3 To get the selectivity for J1 operation in query plan1, an optimizer compares all buckets on custkey in the intermediate relation R1, which is the result for S1 over the in orders relation. Note that since we use the AVI rule that assumes that all attribute values are independent, we still use the histogram on attribute custkey of relation customer for relation R1 Just like the way to get the cost for equality selection, the number of tuples for every attribute values would be calculated if those are common in both relations. Let I be the set of all common values on joined key in both relations. Then for all common attribute i, number of tuples after join operation would be: Ti = fiin relation R1 f C i R 1 in relation Orders I i C O Finally, the cost for join, J1 is the summation of all these tuples: Cost(J1) = i i T After all, the total cost for the query plan1 is the summation of all these costs after each operation in the tree, in terms of the number of tuples. < intermediate R1 > < orders > < lineitem > custkey orderkey J1 J2
PAGE 25
14 2.3 Multi-Dimensional Histograms In reality, most of queries contain more than two attributes and the results of those queries depend on the joint data distribution [2]. Thus one-dimensional histograms use the AVI assumption for multi-dimensional attributes and will produce good estimates only if the attributes are totally independent. Rather than relying on th e unrealistic AVI assumption, what if we capture data distribution over multiple attributes and use this statistic for estimating selectivity? Multidimensional histograms can be used for this purpo se. Consider the joint frequency distribution of the two attributes in a multi-dimensional space. E ach axis represents one of the attributes and a set of points has the frequency for the joint data corresponding attribute value pairs. We need to sort and partition the tuples and put these partiti ons into buckets [2, 23]. Depending on the way to partition the domain of attribute values, there are several multi-dimensional histograms that have been proposed: phased, mhist, genhist, stholes, vi and hmf [2, 24, 25]. Conceptually, the advantage of multi-dimensional histograms is that we can expand the dimensions as needed and we do not need to depend on the problematic AVI assumption any more. However, the joint frequency distribution is often very complex and e xpensive to construct. In addition to this expensive cost, it is known that in practice multi-dimensional histograms actually do not work well when the dimension is higher than three [13]. Thus especially for the join operation that increases the data dimens ionality, multi-dimensional histograms may be poor in terms of accuracy as well as cost. As a result, multi-dimensional histograms are typically not used in production systems. 2.4 BHUNT We conclude this chapter by introducing one more recent rela ted work that in many ways is close to the method suggested in this thesi s. Brown and Haas [26] suggest a Â“data-driven techniqueÂ” where functional dependencies between attributes are automatically discovered to make an improvement in query optimization. A fuzzy algebraic constraint and other useful
PAGE 26
15 relationships are found by BHUNT methodology if there is a correlation between data. First, BHUNT generates candidates that satisfy an algebr aic constraint and for each candidate, then it constructs algebraic constraints. S tatistical histogramming, segmentation, or clustering techniques are used for this construction. During query processing, the optimizer uses constraints to discover more efficient access paths [26]. In the sense that they try to use a data mining technique to get the good query evaluation, their approach may be closely related to ours.
PAGE 27
16 CHAPTER 3 CARDINALITY ESTIMATION USING FREQUENT ITEMSET MINING Our new method for estimating the size of r esulting relations makes use of an idea from data mining: the frequent itemset We begin the description of our method by describing data mining and the idea of frequent itemset, before moving on to our estimation method. 3.1 Background in Data Mining and Frequent Itemsets For those who might not be familiar with the area of Â“ data mining Â”, we start from a brief discussion. As mine workers dig gold from heap of earth, the purpose of data mining is to find some precious information that is hidden under the stack of data and that we might ignore most of times. Among the data that we store in our reposito ry, some interesting and valuable information may be buried. One of the often-men tioned examples for data mining is market basket data [3, 4]. For example, it may be the case that when men go to the market to buy diaper, they likely purchase beer, as well. By finding this correlati on between the item diapers and beer, we can generate a association rule such that if there is a diaper purchased, then beer is also sold as well. This rule is denoted as Â“diaper beerÂ” [4, 27]. Among the total transactions, if some items are found together frequently, we put these items as a set and mark this as a frequent itemset (FI) If k is the total number of items in a frequent itemset, it is called as k-size itemset that is, the number of items in an FI set is the size of the set. To mine frequent itemsets is one of the interesting challenges in data mining area and many approaches are suggested. For the purpose of selectivity estimation using frequent itemsets, we used the famous apriori [3] algorithm. To judge if an item is eligible as a frequent item or not, we need a pre-specified measure. Support and confidence are those. Suppose we consider th e correlation between some items and
PAGE 28
17 put the items that are in the ant ecedent and in the consequence as all items While support is a fraction of the transactions that contains all items to the total transaction, confidence is the ratio of the number of transactions that have all items to the number of transactions that include items in the antecedent. For example, if a basket data base has 100 transactions, out of which 80 include item diaper and 40 of these contain item beer, the support for these items is 40% (40 out of 100) and the confidence is 50% (40 out of 80). One important property of frequent itemsets is the upward closure property which means if an itemset is a frequent itemset with support with higher than or e qual to the threshold, all of its subsets are also frequent itemsets. For example, if {Â‘ a1 Â’, Â‘ b2 Â’} is a frequent itemset, all of its subsets, {Â‘ a1 Â’} and {Â‘ b2 Â’} in this case, are also frequent itemstes. Since besides the itemset itself, {Â‘ a1 Â’, Â‘ b2 Â’}, the sizes of all other subsets of this itemset, {Â‘ a1 Â’} and {Â‘ b2 Â’}, are smaller than the itemset. Since the itemset {Â‘ a1 Â’} may occur with another attribute values rather than Â‘ b2 Â’, the frequency of {Â‘ a1 Â’} should be equal to or larger than that of {Â‘ a1 Â’, Â‘ b2 Â’}. To make discussion about FI minining concrete, here we provide an example. Example 3 : In the relation R1, there are 4 attributes, A, B, C and D. Table 2 shows the values on those attributes over all tuples in R1. Let support = 30% be the threshold. Table 2: Transactions in R1 By the definition, we can get the frequent itemsets as below: Transaction ID A B C D 1 2 3 4 5 6 7 8 9 10 a1 b2 c3 d4 a1 b3 c4 d5 a2 b3 c5 d5 a1 b2 c3 d4 a2 b5 c3 d4 a2 b3 c5 d4 a1 b2 c3 d4 a1 b2 c5 a3 b2 c3 a2 b5 d4
PAGE 29
18 Table 3: Frequent itemsets from R1 and their supports Frequent itemset Number of tuples Support {a1} {a2} {b2} {b3} {c3} {c5} {d4} {a1, b2} {a1, c3} {a1, d4} {a2, d4} {b2, c3} {b2, d4} {c3, d4} {a1, b2, c3} {a1, b2, d4} {a1, c3, d4} {b2, c3, d4} {a1, b2, c3, d4} 5 4 5 3 5 3 6 4 3 3 3 4 3 4 3 3 3 3 3 5/10 100 = 50% 4/10 100 = 40% 5/10 100 = 50% 3/10 100 = 30% 5/10 100 = 50% 3/10 100 = 30% 6/10 100 = 60% 4/10 100 = 40% 3/10 100 = 30% 3/10 100 = 30% 3/10 100 = 30% 4/10 100 = 40% 3/10 100 = 30% 4/10 100 = 40% 3/10 100 = 30% 3/10 100 = 30% 3/10 100 = 30% 3/10 100 = 30% 3/10 100 = 30% Among 10 transactions, 3 transactions contain the items, Â‘ a1 Â’, Â‘ b2 Â’, Â‘ c3 Â’ and Â‘ d4 Â’ and the support for these transactions is 3 divided by 10 = 30%. Excluding the null set, there are 15 (15 = 24 Â–1) subsets of the frequent itemset {Â‘ a1 Â’ Â‘ b2 Â’, Â‘ c3 Â’, Â‘ d4 Â’} and all those subsets are also frequent itemsets. As we see, for every frequent itemsets in Table 3, the upward closure property holds. Before we finish this section, we clarify a couple of terms we use in this thesis. In data mining area, especially for the transaction data like market basket data where all transactions consist of items of goods, people use the word, item s, to refer to the values in transactions. Although attribute value is the more general-purpose terminology in a database, it will sometimes be more suitable for us to refer to an attribute value as an item in this thesis. The reason is that once we adopt frequent itemset mini ng to the task of selectivity estimation, it is more natural to use original terminology used in data mining. So, we will use the terms item and
PAGE 30
19 attribute value interchangeably and also we will use the word an itemset to refer to an attribute value set We also note that while a cardinality is the number of cardinal members in a set the cardinality in database is the total number of tuples in a relation. 3.2 Overview of Our Approach In this thesis, we describe a new summar ization method based on frequent itemsets and consider two fundamental selectivity operations, equality selection and join. Selectivity for range queries can be obtained by aggregating the selectiv ity of equality selection within the range. We begin with a high-level description of how our a pproach actually works for estimating cardinality. Before the optimizer is invoked, the first thi ng we need to do is to construct profiles of every relation in the database. In our approach, we use a frequent itemset file to model a relation. Using the Apriori algorithm, we begin by mining frequent itemsets among all attribute values. We find out all frequent itemsets over a relation, and build an FI tree with these itemsets. Every time we need the data distribution of a relation to estimate the cardinality of the result of a relational algebra operation over the relation, we use the FI tr ee, in addition to a few other simple statistics on the relation. To estimate the size of the relation from a sel ection operation using our profile, we first consider all predicates present in the selection operation. As long as all of the values which satisfy the predicates are frequent itemsets, we can estimate the exactly correct cardinality for result relation since the FI tree keeps the actual fre quencies of frequent items. For example, if the selection predicates are R.A = Â‘ a1 Â’ and R.D = Â‘ d4 Â’ in the query and we have {Â‘ a1 Â’, Â‘ d4 Â’ } as a frequent itemset whose frequenc y is 60%, we can then get the exact cardinality for the result directly by acquiring the frequency from the FI tree. In addition to the FI tree that contains the cardinality of the relation, we also store the number of attribute values for every attribute over the relation. If we cannot find some of the values that satisfy the predicates in the FI tree, we use those additional counts to estimate the frequenc ies of tuples containing these non-frequent items
PAGE 31
20 using standard heuristic methods. Taking the pr oduct of the frequencies from frequent itemsets and non-frequent itemsets, finally we approxima te the frequency for the predicates in the selection query. As an example, consider these steps for the equality selection operation over the following query plan in Figure 6. Figure 6: A query plan tree for the example of the selection For the selection operation S1 the predicate set P is P = { a1 b1 d4 }. Assume that the cardinality of R is 10 and V ( R,B ) = 5 In the FI tree, {Â‘ b1 Â’} is not shown, but {Â‘ a1 Â’, Â‘ d4 Â’} is an FI with 60% frequency. We can see three attribute values on B and the summation of the frequencies for those frequent items is 80%. With these facts, we know that for the non-FIs, there are 2 attribute values (2 = 5 Â– 3) on B and each frequency is assu med as 10%, (100% 80% = 20% frequency divided by 2 attribute values). Therefore, the final frequency for S1 is 10% 60% and the estimated cardinality for the result is 0.6 10 0.06 tuples. After computing the cardinality of the result rela tion, we then build a new FI tree for the resulting relation. The query optimizer will again us e this new FI tree as a profile of the result relation when it processes the operation that the re sult relation is involved with. In other words, this new FI tree and the estimated cardinality af ter each operation are propagated for the further operations in the query plan. Joins are processed in a similar fashion but for the join operation, we first need to rebuild the FI trees of the two input relations so that the attribute values belonging to the joined attribute are found at level 1 in the tree. Fo r example, if an FI set is {Â‘b2Â’, Â‘d1Â’, Â‘c3Â’, Â‘a1Â’} and we join . R.A=Â’a1Â’ and R.B = Â‘b1Â’ and R.D=Â‘d4Â’ < R > < S > S1
PAGE 32
21 on A then we put the set {Â‘b2Â’,Â‘d1Â’,Â‘c3Â’} under {Â‘a1Â’} at level 1 in the tree. After rebuilding the two trees that will be used to estimate the result of the join, we compare all items at level 1 and get the common join attribute values across both FI trees. If there is an attribute value from the joined attribute that is present in one relationÂ’s FI tree but not the other, we drop that node and its all branches from the trees. At this point, we have only common items for joining attributes on both trees and can easily calculate the frequenc ies for the relation resulting from joining these values together. Of course it is possible that there are common values in the non-frequent items and the values dropped from the trees. For this case, we again use standard heuristic methods [28] under the uniform distribution assumption for each pair of attribute values. Just like for the selection operation, the product of two frequencies from the FI trees and from other attribute values that are not in the FI trees is th e final frequency for the resulting relation. 3.3 Data Structure of the FI Tree During our process, the frequent itemsets th at are extracted from the relation form a tree (FI tree). Unlike one-dimensional histograms that are constructed along single attribute, an FI tree exists over a relation. So, there is only one FI tree for a relation while a histogram file exists for an attribute. The root of an FI tree has the cardinality of the relation. Except the root node, each node keeps item information, such as value, attribute name, etc., a support as a frequency of the itemset and the pointers for its children. The level of this tree implies the size of an itemset, i.e., all nodes at level 1 are the 1-frequient itemsets and their ch ildren stand for the itemsets that contain this item. One thing to be aware with this tree is that the supports that are on the nodes deeper than level 1 do not mean the support of the item itself. Rather, those are the supports for itemsets that contain all items on their ancestor nodes as well as that node. For instance, we use one of the frequent itemsets of Table 3 to show this. The itemset {Â‘a1Â’, Â‘c3Â’} has support 30%. The
PAGE 33
22 subsets of this itemset are {Â‘a1Â’} and {Â‘c3Â’} and the supports of those 1-itemsets are both 50%. For these items, Figure 7 show the part of the FI tree from Table 3. Figure 7: An example of an FI tree In the FI tree shown above, there are two nodes that contain the item Â‘c3Â’. The node that contains value Â‘c3Â’ at level 1, denotes the 1-frequent itemset {Â‘c3Â’}, and we refer this node as n1. The other node that contains value Â‘c3Â’ at level 2 represents the 2-frequent itemset, which means the size of this frequent itemset is 2, {Â‘a1Â’, Â‘c3Â’}, and let this node be n2. The support of n2 is not for an item but for an itemset. Since {Â‘a1Â’} has a child for the itemset {Â‘a1Â’, Â‘c3Â’}, to avoid duplicity, n1 does not need to have a child node for Â‘a1Â’ for that itemset. Now, we provide the full FI tree of the relation R1 in Table 3. Figure 8 depicts the profiles such as an FI tree and the table fo r attribute value counts of relation R1 and we will use these often for further examples. n2 n1 10 {a1} 50% {c3} 30% {c3} 50%
PAGE 34
23 Figure 8: The FI tree for R1 After spending short description about this FI tree, we move to the next section for more comprehensive explanation. From the root node of the FI tree in Figure 8, we get the information that the cardinality of this relation is 10. If we traverse the left most branch,i.e., depth first pattern, we can get some actual data distribution: the frequency of tuples which contain Â‘a1Â’ is 50%, the frequency of tuples which contain Â‘a1Â’ and Â‘b2Â’ at the same time is 40%, the frequency of tuples which contain Â‘a1Â’,Â’b2Â’ and Â‘c3Â’ at the same time is 30%, and the frequency of tuples which contain Â‘a1Â’,Â’b2Â’,Â’c3Â’ and Â‘d4Â’ at the same time is 30%. At the other hand, if we traverse the nodes acco rding to level 1, i.e., using a breadth first search pattern, we can get some information about the attributes. For example, at level 1 of this tree, there are two 1-items whose attribute are Â‘AÂ’ and the summation of their supports is 90%. If we use smaller support threshold, the size of an FI tree increases and we can get more accuracy owing to increased information on more frequent itemsets. A:Â‘a2Â’ 40% B:Â‘b2Â’ 50% B:Â‘b3Â’ 40% C:Â‘c3Â’ 50% C:Â‘c5Â’ 30% D:Â‘d4Â’ 50% C:Â‘c3Â’ 40% D:Â‘d4Â’ 30% D:Â‘d4Â’ 30% D:Â‘d4Â’ 30% A:Â‘a1Â’ 50% C:Â‘c3Â’ 30% D:Â‘d4Â’ 30% D:Â‘d4Â’ 30% D:Â‘d4Â’ 30% { R1 : 10 } B:Â‘b2Â’ 40% C:Â‘c3Â’ 30% D:Â‘d4Â’ 30% D:Â‘d4Â’ 30%
PAGE 35
24 Note that though an FI tree keeps informati on only on frequent itemsets, we can also draw some part of information on non-freque nt itemsets from FI tree, as well. 3.4 Selectivity Estimation for Equality Selection This Section describes our method for equality selection predicates in detail, which takes advantage of the FI tree and standard heuristic s under the uniform distribution assumption for the selection operation. The entire process is outlined below: 1) First, we discuss the way to estimate the cardinality for resulting relation after the selection operation. 2) Next, we show how the profiles ar e propagated as in a query plan. 3) Last, we explain how we recompute the statis tics for attribute values after the selection operation. This is needed for further operations over the resulting relation. 3.4.1 Step One: Get the Cardinality of Resulting Relation Using the FI Tree. When using the FI tree to estimate the selec tivity of a relational selection query, there are two cases. In the first case, all of the predicates present in the query can be handled directly using the FI tree. For example, consider the following query in 4.1. Example 4: Suppose this relational selection que ry is a part of a query plan. SELECT FROM R1 WHERE A = a1 AND C = c3; We refer to the set of predicate in the query as the predicate set, denoted by P. All predicate values on the relation in a query belong to this predicate set. For example, if we have predicate values, p1, p2, Â…, pn, then the predicate set, P for the selection is P = { p1, p2, Â… ,pn }. In this example, P = {Â‘a1Â’, Â‘c3Â’}. To process selection query, first we need to check the predicate set with 1-itemsets which reside at level 1 on the FI tree and partition this into two sets; the predicate set whose members
PAGE 36
25 are all one of 1-freqient itemsets, f-P, or not, Nf-P. We treat these two sets differently. For f-P, we use the FI tree to get the number of tuple that sa tisfy the predicates in this set. The main thing when we handle f-P is to discover the completeFI(s). If there is an itemset I whose members are exactly same with those in f-P, we define that node as a completeFI. By taking the product of frequencies of these completeFIs on the FI tree, we estimate the selectivity of result relation that is comprised of frequent items. Yet, for Nf-P (those predicates that do not include any frequent itemset) we assume that all tuples that contai n one of the values in this set are uniformly distributed. So we just calculate the number of tuples that have non-frequent attribute value under uniform distribution assumption. Suppose the FI tree in Figure 8 is the profile we use for this query. In the case that there is a same itemset with P within the FI tree, we can get the exact frequency of the predicates. After checking all items at level 1, we put thes e predicates into frequent-PredicateSet, f-P. So there is no Nf-P in this query. Now we traverse the FI tree in depth first pattern to find out the complete FI. We find the frequency for the itemset,{Â‘a1Â’,Â‘c3Â’}, take its support (which is 30%), and multiply this frequency by the cardinality of th e relation. By doing this, we can obtain the accurate number of tuples which satisfy th e selection predicates: 3 tuples (3 = 30% 10 tuples). If the FI tree has one completeFI on the node and the set Nf-P is empty, the estimation of the cardinality is as same as the actual one, i.e ., the accuracy is 100%. However, it is still possible that there is no such node where th e frequent itemset is same with f-P. In this case, we try to find out the largest frequent itemset, all of whose me mbers are in the frequent predicate set from the tree, and then remove those items from the f-P. By doing this recursively, we can acquire some frequencies and take the produc t of those frequencies. 3.4.2 Step Two: Get the Cardinality of Result Relation Using the Uniform Distribution Assumption. Sometimes, not all of the predicates in P can be satisfied directly using the FI tree. To handle such cases, we store an additional set of statistics along with the tree. For example, in the
PAGE 37
26 FI tree in Figure 8, there are only two attribute values on A, Â‘a1Â’ and Â‘a2Â’ and the summation of their frequency is 90%. If a predicate value is Â‘a3Â’ which we cannot find in the tree, we need to guess that how many tuples contain this non-fre quent attribute value. For this purpose, we maintain a table that provides the numbers of attribute values over the relation. Figure 9 shows this table for relation R1. Figure 9: The numbers of attribute values over R1 Using the FI tree and this table, we can in fer some information for the non-frequent items that we need to estimate cardinality. Fo r instance, since the attribute counts for A is 3, by subtracting two items in the FI tree in Figure 8, we can know that there is only one attribute value which is not frequent and its suppor t is 10% (10 = 100% 90%). This information on tuples that have non-frequent items is needed for using the heuristics to get the cardinality without data distribution. In the heuristic method for calculati ng cardinalities, we assume uniform distribution over all attribute values. We will intr oduce the formula for this later. To see how these statistics are used in th e case that not all of the predicates in P are present in the FI tree, consider the following example query: Example 5: Here we have a query of R1. For this query, we use both FI tree in Figure 8 and the attribute counts table in Figure 9 as profiles of R1. SELECT < Attribute Values over R1 > Distinct Attribute Â‘AÂ’ values: a1,a2,a3 Distinct Attribute Â‘BÂ’ values: b2,b3,b5 Distinct Attribute Â‘CÂ’ values: null,c3,c4,c5 Distinct Attribute Â‘DÂ’ values: null,d4,d5 3 3 4 3 A B C D Attribute # of Attribute Values
PAGE 38
27 FROM R1 WHERE B = b5 AND C = c5 AND D = d4 AND A = a1; We refer to the predicate Â“B = b5Â” as p1, Â“C = c5Â” as p2 Â“D = d4Â” as p3 and Â“A = a1Â” as p4 Then the predicate set P = {Â‘b5Â’, Â‘c4Â’, Â‘d4Â’, Â‘a1Â’}. By comparing the 1-frequent items on the tree, we divide this set into a frequent predicate set, f-P = {Â‘c5Â’, Â‘d4Â’, Â‘a1Â’} and a non-frequent predicate set, Nf-P = {Â‘b5Â’}. Again for f-P, we use FI tree to get the frequency of the values, but for the non-frequent predicate set, Nf-P, since we do not have any information like attribute values, distribution and etc., we simply adapt the heuristic method to ge t the estimation. Now, we discuss the heuristic method for calculating the number of tuples resulting from the equality selection operation. In the heuristic method, things become much simpler. With only two pieces of information about a relation, we can approximate the selec tivity. Under the uniform distribution assumption, the only things we need are cardinality of a rela tion and the number of attribute values on all attributes in the relation. Consider a relation R with 1000 tuples. If we do not know any of distributions for attribute values, but we know ther e are 10 attribute values for the attribute of the equality predicate, then we assume every attrib ute value has same frequency in that relation and get 100 tuples as the number of tuples for any pred icate value for an equality selection. To give the exact algorithm, we first define some notations below: NR Cardinality of Relation R V(R A) Value count for attribute A on the Relation R Cost(S) the cost of predicate S, in terms of the number of tuples Then, the general expression for the cost of equality selection S is: Cost(S) = A) (RRV N
PAGE 39
28 The FI tree has the cardinality of the relation. Note that the summation of the supports of the nodes that are at level 1 and whose attribute names found in the predicates is the frequency of the total number of tuples which contain freque nt items on that attribute. Thus, using the cardinality and this summation, we can assume the number of tuples that contain non-frequent items on the attribute. In our example, we had previously broken the selection predicates into two sets, f-P = {Â‘c5Â’ Â‘d4Â’, Â‘a1Â’}, and Nf-P = {Â‘b5Â’ }. For processing the set f-P we find completeFI(s) as we discussed in 3.4.1. If there is no such a freque nt itemset that is exactly same with the f-P we try to find a completeFI again as if one of the subsets of the f-P is f-P and take a larger subset as a completeFI since it contains more elements in one set for a completeFI We have more than one completeFI, such as {Â‘a1Â’, Â‘d4Â’} and {Â‘c5Â’}, in our example. By multiplying the supports of these frequent itemsets, we get the frequency for the tuples which contain frequent items: % 9 % 30 % 30 For the set Nf-P, {Â‘b5Â’}, the heuristic method under the uniform distribution assumption is used to obtain the support: % 10 2 3 % 90 % 100 is attribute whose and 1 level at is which nodes of number ) ( is attribute whose node on the supports all of sum 100% B B R1 B V The product of supports of completeFIs and that for Nf-P is the frequency for the original predicates. In our example, 0.9%. 10% 9% Therefore, the cardinality for resulting relation is the estimated frequency multiplied by the cardina lity of the relation in the query. We get 0.09 tuples (0.9% tuples 10 0.09 ) as the cardinality of the result relation for our example. From the actual tuples in R1, we can see that there are no tuples th at satisfy the given predicate in our query, which means that the actual cardinality for this query is 0. So, our estimate is very close to the actual result. Below, we provide the complete algorithm for estimating the cardinality.
PAGE 40
29 Algorithm 1: GetCardinality in selection 3.4.3 Step Three: Create a New FI Tree for the Result after Selection. A result relation after a relational operation can be used as an input relation for the next relational operation in a query plan. Thus, for th e intermediate relations, we also need the profiles. So, the base profiles are propagated while the operations are considered. As a result, for every operation, we produce a new FI tree as well as the cardinality of the result relation. This function should be processed after calcu lating the cardinality of result since GetCardinality() finds out the completeFI(s) on the tree and we need to use completeFI(s) to rebuild an FI tree. To construct the result FI tree from the input FI tree, we check all nodes at level 1, and delete all branches whose attribute is in the predicates and which do not contain the completeFI For every node remaining, if there is any node whos e attribute is same with one of the predicate but the value is different, then we delete that node since after equality se lection only one value on the attribute is possible. After finishing this tr imming, we reconstruct the FI tree to satisfy the upward closure property where all subsets of frequent ite msets are also frequent itemsets. We also add the nodes which are in the non-frequent pr edicate in reconstructing process. Finally, we GetCardinality ( freqPredicate fP, Non-freqPredicate NfP, FI tree T ) { 1. while there is any frequent item in fP do 2. complete_FI = the largest subset of fP which is same with one FI in the T(R ); 3. supportFI = supportFI support(complete_FI ); 4. remove this subset from fP; 5. end while; 6. for all values a in NfP do 7. let attr as a.attr; 8. let C_attr as the number of 1-FIs whose attribute is attr on T; 9. let S_attr as the sum(all supports of 1-FIs whose attribute is attr on T); 10. supportNFI = supportNFI ((100 Â– S_attr)/(V(a)Â–C_attr)); 11. end for; 12. frequency = supportFI x supportNFI; 13. newCardinality = orgCardinality x frequency; 14. return newCardinality; }
PAGE 41
30 calculate the support for all newly ad ded nodes. Here the supports for completeFIs are 100% because this is the equality selection. For the children nodes ( T ) of these completeFIs the support could be calculated as: suppnew (T) = ) ( ) ( T suppold s completeFI suppold We provide depicted processes in Figure 10 followed by a summarized pseudo-code for this process. Figure 10: Overall process for getting the FI tree A:Â‘a2Â’ 40% B:Â‘b2Â’ 50% B:Â‘b3Â’ 40% C:Â‘c3Â’ 50% C:Â‘c5Â’ 30% D:Â‘d4Â’ 50% C:Â‘c3Â’ 40% D:Â‘d4Â’ 30% D:Â‘d4Â’ 30% D:Â‘d4Â’ 30% A:Â‘a1Â’ 50% C:Â‘c3Â’ 30% D:Â‘d4Â’ 30% D:Â‘d4Â’ 30% D:Â‘d4Â’ 30% { R1 : 10 } B:Â‘b2Â’ 40% C:Â‘c3Â’ 30% D:Â‘d4Â’ 30% D:Â‘d4Â’ 30% Complete-FIs < 1. Searching Mode: Finding complete-FI(s) >
PAGE 42
31 Figure 10 .Continued A:Â‘a2Â’ 40% B:Â‘b2Â’ 50% B:Â‘b3Â’ 40% C:Â‘c3Â’ 50% C:Â‘c5Â’ 30% D:Â‘d4Â’ 50% C:Â‘c3Â’ 40% D:Â‘d4Â’ 30% D:Â‘d4Â’ 30% D:Â‘d4Â’ 30% A:Â‘a1Â’ 50% C:Â‘c3Â’ 30% D:Â‘d4Â’ 30% D:Â‘d4Â’ 30% D:Â‘d4Â’ 30% { R1 : 10 } B:Â‘b2Â’ 40% C:Â‘c3Â’ 30% D:Â‘d4Â’ 30% D:Â‘d4Â’ 30% < 2. Trimming Mode > C:Â‘c5Â’ 30% A:Â‘a1Â’ 50% D:Â‘d4Â’ 30% { R1 : 10 } C:Â‘c5Â’ 30% A:Â‘a1Â’ 50% D:Â‘d4Â’ 30% { R1 : 10 } < 3. Rebuilding Mode: adding nodes > B:Â‘b5Â’ 100% D:Â‘d4Â’ 100% new nodes from non-frequent predicate set
PAGE 43
32 Figure 10 .Continued Algorithm 2: GetFITree in selection 3.4.4 Step Four: Update the Table for A ttribute Counts for the Result after Selection. After the selection operation, the numbers of attr ibute values for attributes that involved in the predicates will change. So we need to upda te the attribute value counts for the resulting relation. Since this is the equality selection, only two cas es are possible for the attributes after this operation. First, if the predicate value is one of the possible values in the domain of the attribute, all tuples would have this predicate value for that attribute as a result of this operation. Otherwise, Algorithm GetFITree ( freqPredicate fP,Non-freqPredicate NfP,FItree T ) { 1. for every 1-level node t of T do 2. if t does not have complete-FI 3. drop t from T; 4. end for; 5. for all node t1 of T do 6. let an item in fP as p if t1.attribute = p.attribute; 7. if and t1.value <> pk.value 8. drop t1 from T; 9. end for; 10. add all item in NfP onto T; 11. rebuild T to satisfy upward closure property 12. recalculate supports for all node } C:Â‘c5Â’ 100% A:Â‘a1Â’ 100% D:Â‘d4Â’ 100% { Result relation: 0.09 } < 4. Updating Support Mode > B:Â‘b5Â’ 100% D:Â‘d4Â’ 100%
PAGE 44
33 the cardinality after this operation would be 0. Ther efore, for those attribute which are part of the predicates, we set the number of attribute values to be 1 (value), and for the rest of attributes which are not predicate in the given query, we just keep the original value count. For our example query, all attributes in this re lation are used as predicate. So, the table in Figure 9 would be updated as Table 4: Table 4: The updated table for numbe rs of attribute after selection Below is the summary of this procedure. Algorithm 3: UpdateAttrCounts in selection 3.5 Selectivity Estimation for Join In this section, we demonstrate how we use FI trees for the join operation. The basic idea behind the cardinality estimation for the join operati on using FI trees is very straightforward as follows. For the join operation, our concern is only over common items in both relations. So we find the common items on the joined attributes in both FI trees. We take the product of the two frequencies on each tree for each common item, and the summation of all these frequencies after UpdateAttrCounts { 1. for every attribute Â“aÂ” on V( R ) do 2. if a is same with one in P( Q ) 3. V(R, a) = 1; 4. else 5. keep the value; 6. end if 7. end for } 1 1 1 1 A B C D Attribute # of Attribute Values
PAGE 45
34 production is the frequency for the common items am ong a frequent items. If an attribute value is one of the frequent items but it is not on the other tree, we drop this item and all its branches from the FI tree and treat this item as a non-frequent item. For the join operation with non-frequent items, we again assume that all tuples that c ontain one of these non-frequent items are uniformly distributed. Using the formula to get the cardinality for join operation under the uniform distribution assumption, we estimate the frequenc y for result after the join operation with these tuples. The final frequency for resulting relati on is the summation of these two frequencies. Figure11 illustrates the process. Figure 11: Join modeling The remainder of this section is outlined as follows: 1) First, we define some notati ons that we need for effective description of all processes for the join operation. Whole Relation Frequent Items: All attribute values and supports are kept Â•Attribute coun t Â•cardinality N on-Frequent Items: NonFIRelation Frequent Items: FIRelation Frequent Items: All attribute values and supports are kept Â•Attribute coun t Â•cardinality Whole Relation Find common items CommonFIRelation: All attribute values and supports are kept CommonFIRelation: All attribute values and supports are kept FI but uncommon FI but uncommon Non-FI Non-FI Frequent Items: FIRelation N on-Frequent Items: NonFIRelation +
PAGE 46
35 2) Then, we show how to get the cardinality for resulting relation after the join operation. 3) For future operations in the query plan, we need to build an FI tree profile of the resulting relation. Using an example, we illu strate how we construct a new FI tree from two input FI trees corresponding relations joined. 4) After the join operation, the number of attr ibute values on join attributes may change. Since the table for attributes value counts is al so used as a profile with the FI tree, we describe how to construct a ne w table for resulting relation. 3.5.1 Notations Used. Assume that we have a relation R and we are joining this relation with other relation, S, on attribute A. Attribute A in R has n values, { a1, a2, a3, Â…, an}. Let Tuples(ak) be all tuples which have attribute values ak, then : |R| = f(a1) + f(a2) + f(a3) +Â…+ f(an), Here, f (k) is the frequency for attribute value k such that f(ak) = Count ( Tuples ( ak)). Since every tuple has exactly one value on A, the total number of tuples, |R| can be expressed as the sum of frequencie s for all attribute values on a certain attribute is the cardinality of the relation. Based on this idea, we define some of the notations we will use in this section: NR Cardinality of Relation R V(R A) Value count for attribute A on the Relation R, JoiningFI The value set whose all members are frequent items for the join attribute. CommonFI The value set whose all members belong to JoiningFI and are common in both FI trees. CommonFItuples All tuples that contain one of the attribute value in CommonFI. FIRelation (R) A virtual relation that is extracted from relation R when a tuple contains
PAGE 47
36 one of the members in JoiningFI. All tuples in this relation can be divided into CommonFItuples and unCommonFItuples. NonFIRelation (R) A virtual relation that is comprised of the remaining tuples when FIRelation (R) extracted from R. CommonFIRelation (R) A virtual relation which is comprised of CommonFItuples OtherRelation (R) A virtual relation which is comprised of uncommonFItuples and NonFIRelation (R). 3.5.2 Step One: Get the Cardinality for the Result after Join. We begin by assuming that we have already extracted the common items, CommonFI by comparing 1-frequent itemsets on join attributes in both relations and we dropped all uncommon 1-itemsets and their branches. So, now the FI tr ees have only common items on the join attribute in the tree. From the tree, we get the actual frequency for every item on the node at level 1 and actual tuples that contain those attribut e values. These tuples are defined as CommonFIRelation and we do a join over these two CommonFIRelations. By multiplying the tuples in these virtual relations, we get the exact tuple counts for the join result of these two CommonFIRelations. For the other virtual relations, OtherRelation s produced from both relations (that are made up of uncommonFItuple and NonFIRelation (R)), we know only the total number of tuples and the attribute counts on join attribute over these rela tions. So, we just adap t the simple heuristic method for calculating join result under uniform distribution assumption. The reason we do not ignore uncommon items even after deleting them from trees is that there is a possibility that the same attribute value exists among non-frequent items tuples in the other relation. In this case, these two values can be joined together and would contribute to join result. Since this is a join operation for two relations, we need one more relation, R2, for our running example. We introduce R2 in Table 5 and the corresponding FI tree and the table for attribute value counts are depicted in Figure 12.
PAGE 48
37 Table 5: Transactions in R2 Figure 12: Profiles for R2. a) the FI tree for R2 b) table for attribute value count for R2 Using these two relations, we consider a join example. Select from R1, R2 where R1.A = R2.A; For the relation R1, the frequent itemsets with size 1 are {Â‘a1Â’: 50%}, {Â‘a2Â’: 40%} and R2 has same 1-frequent itemsets but with different frequencies, {Â‘a1Â’: 40%}, {Â‘a2Â’: 40}. So after comparing 1-frequent itemsets of these two relations to get the common item values, CommomFIRelations are same with FIRelations for both relations in this case. The cardinality when we join these two virtual relations is the su mmation of the number of tuples with the tuples after joining attribute values Â‘a1Â’ and those after joining attribute value Â‘a2Â’, which is 18 tuples: a1 (40.0%) a2 (40.0%) e3 (40.0%) e5 (60.0%) f5 (40.0%) a1 e3 (40.0%) a2 e5 (40.0%) e5 f5 (40.0%) R2 : 5 {a1} 40% {e3} 40% {e3} 40% {a2} 40% {e5} 40% {e5} 60% {f5} 40% {f5} 40% b) Table for attribute count over R2 < Attribute Values over R2 > Distinct Attribute Â‘AÂ’ values: a1, a2, a7 Distinct Attribute Â‘EÂ’ values: e3, e5 Distinct Attribute Â‘FÂ’ values: Null,f2,f4,f5 3 2 4 A E F Attribute # of Attribute Values < Frequent Itemsets over R2 > Transaction ID A E F 1 2 3 4 5 a1 e3 f4 a2 e5 f5 a7 e5 f5 a1 e3 f2 a2 e5
PAGE 49
38 Tuples after joining attribute value Â‘a1Â’: tuples 10 40% tuples 5 50% tuples 10 Tuples after joining attribute value Â‘a2Â’: tuples 8 40% tuples 5 40% tuples 10 Cardinality ( CommomFIRelations ) : tuples 18 tuples 8 tuples 10 For NonFIRelation (R1) and NonFIRelation (R2)which are also part of the original relations, there is 1 attribute value on A in both R1 and R2. Since V (R1,A) = 3 and there are 2 attribute values for attribute A in the FI tree, by subtracting the attribute values in FIRelation (R1) from that of the original values in R1, we have 3-2 = 1 value count for virtual relation NonFIRelation (R1). For the R2, we have same result. The cardinality for OtherRelation(R1) is 1 tuple ( 40%)) (50% (100% tuples 10 tuples 10 tuple 1 ) and 1 tuple for OtherRelation(R2). Therefore, by adapting the heuristics under the uniform distribution assumption, we get the cardinality after joining these OtherRelations : Cardinality(OtherRelations) = ) ) ( ), ( ( A R2, A R1 R2 R1 V V MAX N N = ) 1 1 ( 1 1 MAX = 1 The final cardinality of the join operation is computed with the cardinalities, Cardinality ( CommomFIRelations ) and Cardinality ( OtherRelations ) of the two virtual relations for both original relations, R1 and R2 In our example, we get the result cardinality after joining R1 and R2 by adding these two cardinalities: 18 + 1 = 19 tuples. Considering the fact that the actual cardinality after joining relation R1 and R2 on attribute A is 18 tuples, this estimation is reasonable. The summarized algorithm for this procedure follows.
PAGE 50
39 Algorithm 4: GetCardinality in Join 3.5.3 Step Two: Get the FI tree for the Result after the Join The first thing we have to do to build a new FI tree for the result relation is to rebuild the input FI trees: All nodes whose attribute value is in a CommonFI have the first priority, so that we let the attribute values on the join attribute be the base node in an FI tree for each frequent itemset. We define a base node to be the first node we meet when we traverse an itemset in depth search pattern. It resides at level 2. For example, suppose a frequent itemset is {Â‘b2Â’,Â‘d1Â’,Â‘c3Â’,Â‘a1Â’} and Â‘a1Â’ is the value on A When we join on A, we rebuild the FI tree so that all items on A become base nodes. To do this, we put the all items in {Â‘b2Â’,Â’d1Â’,Â’c3Â’} under {Â‘a1Â’}, which means that all of items in th is set would be the descendents of Â‘a1Â’. After the rebuilding process, we drop all 1-iemsets whose attribute is joined one but the value is not in CommonFI. When we drop them we also remove all their branches. In our example, since all values in frequent itemsets are in CommonFI over R1 and R2, there is no node or branch deleted. Figure 13 is the result of this deleting process in our example. GetCardinality ( left-join-relation left, right-join-relation right ) { 1. Let a be joining attributes in both relations 2. Let T(left) be the FI tree for left and T(right) be the FI tree for right 3. Find common-items,commonFI,and delete uncommon-items from both FI trees 4. For all i in commonFI 5. cardinality1= Sum(|left|x f(i) in left x|right|x f(i) in right); 6. end for; 7. V(OtherRelation(left),a) = V(left.a) Â– Count(left.commonFI); 8. |OtherRelation(left)| = |left| Â– |CommonFIRelation(left)|; 9. V(OtherRelation(right),a) = V(right.a) Â– Count(right.commonFI); 10. |OtherRelation(right)| = |right| Â– |CommonFIRelation(right)|; 11. cardinality2 =(|OtherRelation(left)| x |OtherRelation(right)|) / MAX(V(OtherRelation(left),a), V(OtherRelation(right),a)); 12. |result| = cardinality1+ cardinality2; }
PAGE 51
40 Figure 13: After rebuilding and deleting uncommon items For all 1-itemsets whose attribute are not the jo in attribute and all their descendent nodes, we put this part of tree as unJoinedgroup and for all 1-itemsets whose attribute are the join attribute and all of their descendents, we call Joinedgroup Now, we check if there is any same itemset between these two groups. If there is a node n under the joining item in Joinedgroup we subtract the support on n from that of the node in unJoinedgroup In our example, on the FI tree of R1, we see that {Â‘b2Â’}, {Â‘b2Â’, Â‘c3Â’} and {Â‘b2Â’, Â‘c3Â’, Â‘d4Â’} exist under {Â‘a1Â’} and there are same itemsets with th ese three itemsets in unJoinedgroup. So we subtract supports on those three nodes under {Â‘a1Â’} from the nodes in unJoinedgroup The reason we are doing this is that the A:Â‘a2Â’ 40% B:Â‘b2Â’ 50% B:Â‘b3Â’ 40% C:Â‘c3Â’ 50% C:Â‘c5Â’ 30% D:Â‘d4Â’ 50% C:Â‘c3Â’ 40% D:Â‘d4Â’ 30% D:Â‘d4Â’ 30% D:Â‘d4Â’ 30% A:Â‘a1Â’ 50% C:Â‘c3Â’ 30% D:Â‘d4Â’ 30% D:Â‘d4Â’ 30% D:Â‘d4Â’ 30% { R1 : 10 } B:Â‘b2Â’ 40% C:Â‘c3Â’ 30% D:Â‘d4Â’ 30% D:Â‘d4Â’ 30% R2: 5 {a1} 40% {e3} 40% {e3} 40% {a2} 40% {e5} 40% {e5} 60% {f5} 40% { f5 } 4 0% Â‘ a1 Â’ and Â‘ a2 Â’ are common items in both relations R1: 10
PAGE 52
41 frequency of {Â‘b2Â’, Â‘c3Â’, Â‘d4Â’} is the frequency of the summation of the tuples when occurring with attribute value Â‘a1Â’ and the tuples that happens with other attribute values on A. Since we Figure 14: Changing frequencies for nodes use different methods when we handle the two virtual relations (two CommomFIRelations using FI trees and two OtherRelations using the formula under uniform distribution assumption) by subtracting the frequency when the itemset occurs with one of the items in CommonFIRelation we can obtain the frequencies in OtherRelation s while avoiding duplicity. FI trees after this process is in Figure 14. A:Â‘a2Â’ 40% B:Â‘b2Â’ 10% B:Â‘b3Â’ 40% C:Â‘c3Â’ 20% C:Â‘c5Â’ 30% D:Â‘d4Â’ 0% C:Â‘c3Â’ 10% D:Â‘d4Â’ 0% D:Â‘d4Â’ 0% D:Â‘d4Â’ 30% A:Â‘a1Â’ 50% C:Â‘c3Â’ 30% D:Â‘d4Â’ 30% D:Â‘d4Â’ 30% D:Â‘d4Â’ 0% { R1 : 10 } B:Â‘b2Â’ 40% C:Â‘c3Â’ 30% D:Â‘d4Â’ 30% D:Â‘d4Â’ 30% R2 : 5 {a1} 40% {e3} 40% {e3} 0% {a2} 40% {e5} 40% { e5} 20% {f5} 40% { f5} 40% R1: 10 The node which can be found under the joined attribute JoinedGroup
PAGE 53
42 For example, consider Figure 14. In the FI tree for R1, two 1-frequent itemsets {Â‘a1Â’} and {Â‘a2Â’} and all of their descendents make up Joinedgroup The rest of the nodes in the tree are in unJoinedgroup There are two nodes whose values are Â‘d4Â’ in Joinedgroup and there is one node with value Â‘d4Â’, which is for 1-frequent itemset, in unJoinedgroup s. So we subtract the summation, 60% = 30% + 30%, from that of 1-frequent itemset {Â‘d4Â’}, 60%. As a result, the support for {Â‘d4Â’} in unJoinedgroup is 60% 60% = 0. By subtracting the support in Joinedgroup s, we can get the support for this item in OtherRelation (R1). So, our example implies that attribute value Â‘d4Â’ always happens with attribute value Â‘a1Â’ or Â‘a2Â’ and there is no such a tuple that containing attribute value Â‘d4Â’ in OtherRelation (R1). After subtracting the frequency when this item occurs with one of CommonFI we consider the case for joining with two OtherRelations. By adapting the uniform distribution assumption, we calculate this frequency. In Figure 14, {Â‘b2Â’, Â‘c3Â’} in unJoinedgroup has 10% support. Since we already removed the frequency of the case that this item occurs with any item in CommonFI this frequency means that there is 1 tuple ( tuple 1 10% tuples 10 ) containing {Â‘b2Â’, Â‘c3Â’} and this tuple has any unknown attribute values on attribute A. From the table for attribute counts and CommonFI we know that there is 1 attribute value on A and 1 tuple (1 tuple = 40%)) (50% (100% 10) in OtherRelation (R1) In this way, we can compute the information about OtherRelation (R2) These are all information needed wh en we assume uniform distribution assumption. In the next step, we handle all nodes in Joinedgroup We obtain each frequency after joining with the attribute value in CommomFI by multiplying two frequencies in both FI trees. Like in selection queries, we update the frequenc ies for descendents of those joining attribute values. Once we finish getting the all supports for nodes which are related with CommonFI we add the new supports of nodes in JoinedGroup to the same valued nodes in unJoinedGroup. By doing
PAGE 54
43 this, we accomplish to calculate new supports of all nodes for new FI tree. Finally we merge both trees and Figure 15 shows this new FI tree. Figure 15: The FI tree for result relation We provide the algorithm for this procedure, and move to the last function of this operation. A:Â‘a2Â’ 42.1% B:Â‘b2Â’ B:Â‘b3Â’ 35.9% C:Â‘c3Â’ C:Â‘c5Â’ 26.5% D:Â‘d4Â’ 52.7% C:Â‘c3Â’ D:Â‘d4Â’ D:Â‘d4Â’ D:Â‘d4Â’ 31.6% A:Â‘a1Â’ 52.6% C:Â‘c3Â’ 31.6% D:Â‘d4Â’ 21.1% D:Â‘d4Â’ 21.1% D:Â‘d4Â’ { Result Relation : 19 } B:Â‘b2Â’ 42.1% C:Â‘c3Â’ 31.6% D:Â‘d4Â’ 21.1% D:Â‘d4Â’ 31.6% {e3}52.6%{e3} {e5} 42.1% {e5} {f5} 40% {f5}35.9% (5.3+ 42.1)% (10.5+ 31.6)% (0+ 52.6) % (5.3+ 42.1) % (5.3+ 31.6)% (0+ 21.1)% (5.3+ 21.1)% (0+ 31.6)%
PAGE 55
44 Algorithm 5: GetFITree in join 3.5.4 Step Three: Update the Table for Attribute Counts for the Result after Join. When we join, except for the join attribute, we keep all values on the other attributes. So the numbers of attribute values for other attributes after join are same as the original value counts. We merge two input tables for the number of attr ibute values and we update the value count for only the join attributes with smaller one among t hose for the join attributes in both relations. Now, we go back to our example. After merging both tables in Figure 9 and Figure 12, we update the value counts for attribute A. As a result, we have a table like Table 6. Algorithm GetFITree ( left-FItree T(left), right-Fitree T(right) ) { 13. let a joined node for both relation 14. For T(left) and T(right) 15. Rebuild FiTrees so that all value whose attribute is a exist only at level 1 16. For all nodes at level 1 17. drop all nodes whose attribute is a and node.value is not in commonFI 18. end for 19. let T1() is a tree which consists with all items on a and their branches 20. let T2() is a tree such that T Â– T1(); 21. for all nodes n in T2 22. if there is a same itemset with n in T1, subtract the frequency from the support of n 23. end for 24. for all nodes n in T2 25. calculate support of n using the heuristics 26. end for 27. for all nodes n in T1 28. calculate support of node at level1 by multiplying the frequency in the other relation 29. calculate supports for all of its children node 30. end for. 31. for all nodes n in T2 32. if there is a same itemset with n in T1, add its frequency to the frequency of n 33. end for 34. end for 35. newTree = Merge(T(left), T(right)) }
PAGE 56
45 Table 6: Number of attribute values for result relation after join R1 and R2 By including the algorithm for this process below, we conclude this chapter. Algorithm 6: UpdateAttrCount for join UpdateAttrCounts { 1. Let Table(left) as the table for attribute counts in left relation; 2. Let Table(right) as the table for attribute counts in right relation; 3. Table(result) = Merge(V(left), V(right)); 4. For joined attributes in Table(result), update the number of attribute values with MIN(V(left.joinAttribute), V(right.joinAttribute)); } # Of Attribute Values R1.A R1.B R1.C R1.D R2.A R2.E R2.F Attribute M in ( V (R1,A), V (R2,A)) = M in (3, 3 ) = 3 3 4 3 Min ( V (R1,A), V (R2,A)) = Min (3, 3 ) = 3 2 4
PAGE 57
46 CHAPTER 4 EXPERIMENTS AND RESULTS 4.1 Goal of Experiments Our goal through these experiments is to test the ability of our new method to estimate the cardinality for every resulting relations after each operation in a query plan compared to when we use histograms. In our experiments, we tested with two kinds of histograms. First, we used one of the most popular histograms, the equi-depth histogram The next type of histogram we used is histograms that have as many buckets as attri bute values on an attribute. We call this as a Complete histogram in this thesis. Complete histograms have the best performance among all possible one-dimensional histograms. By conductin g experiment with Complete histograms, we wanted to compare our technique with the best possible one-dimensional histograms. We performed our experiments with two realis tic datasets. We gene rated dataset 1 using TPC-R benchmark for real life database. For data set 2, we manipulated some data values in dataset 1 to produce more correlated data. While both histogram methods showed very ba d estimation for costs for correlated data, surprisingly, we found that all methods we used were not bad in terms of choosing the optimal plan. For example, though equi-depth histograms sh owed the worst error rate for every plan of a query, they still selected the best plan with high accuracy. For the correlated data and/or skewed data, our frequent itemsets method significantly outperformed the other methods, in terms of having the lowest absolute error, though all me thods still generally agreed which plan was the best.
PAGE 58
47 4.2 Methodology In our experiments, we tested three different options for selectivity estimation: 1) Equidepth histograms, 2) FI trees and 3) Complete histograms. In histogram method, the number of buckets is a factor that can make a difference in estimation. We used same bucket size over all e qui-depth histograms, where total file size for equi-depth histograms was set to be as close to 10M as possible. The synopsis size for each attribute was chosen to be proportional. For example, assume that one of our relations, R, has two attributes where attribute A had 100 distinct attribute values and attribute B has 200 attribute values. If we use 10 as the number of buckets, ther e would be 10 attribute values in a bucket of histogram for A and 20 for B. For FI trees, we also used same profile size, which is 10MB over all relations. We tried to find the threshold which makes the size of a freque nt itemset become almost same with the sum of sizes of histograms related with the relation. Fo r instance, if there are 5 attributes and the sum of those five histogram files is 1M, then the th reshold for frequent itemsets would be chosen to that the size of FI tree file is almost 1M. If data is not skewed, it is not always easy to keep comparable size because the FI tree size would be increased suddenly in case that most of frequencies of frequent itemsets are similar. In our experiment, almost all of FI files are smaller than the sum of histogram file sizes. This was done to ensure the fairness. The third method we tested is histograms using unlimited size, which we refer as Complete histograms In Complete histograms, every attribute value and its frequency are contained in a different bucket, so the number of buckets and the number of attribute values are exactly same and we can get actual distribution for every attri bute value. If we do not consider the size of profile, and we can store all data distributes in a histogram, this would be the best possible histogram among all one-dimensional histograms. Howe ver, in reality, this kind of histograms is not usually practical since the purpose of such profiles in database is to model the data distribution from compressed information. Nevert heless, the reason we used these histograms in
PAGE 59
48 our experiments is that we wanted to check wh ether even the best possible histogram can give good accuracy when data attributes are correla ted with each other and whether our method is comparable with the best possible histogram. We used the popular TPC-R benchmark for our testing [29]. TPC-R is a decision support benchmark. It evaluates the performance of vari ous decision support systems by executing sets of business-oriented queries. However, the data generated by TPC-R data generation program is uniformly distributed on all attributes, which is not realistic. So, to get skewed data, we downloaded the publicly available program from re searchers at Microsoft [30] and generated more realistic data using their program. This program takes the Zipfian parameter for the degree of skew and uses the Zipfian distribution to provide skew in data. The Zipfian value as a parameter can be from 0 to 4. The parameter z = 0 generates a uniform distribution for each attribute, whereas z = 4 generates a highly skewed data. By using Zipfian value 2, we create a little bit of skew when we generate data for our experiment. We also used 0.1 as a scale factor governing the size of the database produ ced by the data generation program. There are total 8 relations in this database. The largest relation is LINEITEM and its cardinality is 600,000 tuples if a scale factor of 0.1 is used. The relation PARTSUPP has 80,000 tuples, the relation PART contains 20,000 tuples, the cardinalities of CUSTOMER and ORDERS relations are 15,000 tuples and that of the relation SUPPLIER is 10,000 tuples. Both NATION and REGION relations are comprised of 5 tuples. Fi gures 16 depicts partial schema of the database we used.
PAGE 60
49 Figure 16: Partial schema of TPC-R database < PART > Cardinality: 20,000 < PARTSUPP > Cardinality: 80,000 < LINEITEM > Cardinality: 600,000 < ORDERS > Cardinality: 150,000 < SUPPLIER > Cardinality: 10,000 < CUSTOMER > Cardinality: 15,000 < NATION > Cardinality: 25 < REGION > Cardinality: 5 PARTKEY PARTKEY SUPPKEY SUPPKEY SUPPKEY PARTKEY CUSTKEY ORDERKEY NATIONKEY NATIONKEY NATIONKEY REGIONKEY REGIONKEY CUSTKEY ORDERKEY
PAGE 61
50 We generated two different data bases for our experiments. Firs t, we generated a database with a medium amount of skew, using a skew fact or 2. In this set, we produced a very weak correlation between data values. To see how our different three methods work w ith correlated data, we used another dataset. The schema of database and query plans were iden tical, except that we modified some of data in database to introduce additional correlation. For ex ample, we made all of tuples which contained Â‘445Â’ on SUPPKEY in LINEITEM relation have value Â‘653Â’ for attribute PARTKEY. As a result, attribute value Â‘445Â’ on SUPPKEY and attribute value Â‘653Â’ on PARTKEY are correlated each other (the dependency Â‘445Â’ Â‘653Â’ holds in this relation). We updated around 10 data values whose attributes are used in our testing queries to acquire some correlation. We chose 5 queries from the TPC-R benchmark and tested 5 query plans for each query. As our concern was estimating the selectivity of equality selection and join operations, it was necessary to change some parts of queries. For example if there was a range query, like operation, or subqueries, we cha nged the query to be more appr opriate for our implementation. For every query plan we created for each query, we used all three methods to calculate costs. To compare the accuracy for the estimates from thes e three methods, we imported the data from the database generated by TPC benchmark into Oracl e DBMS. By doing this, we could obtain the actual number of tuples for every operation in qu ery plans and compute the error rate for each plan. The formula we used for error rate is, t actual t actual t estimated cos cos cos 4.3 Results The following tables present our results. Th e result compared in our experiment is the number of tuples. We consider the number of tuples as a cost in this thesis. All queries and query plans we used in our experiments and corres ponding test results are provided in the appendix.
PAGE 62
51 We made font bold to denote that plan is th e best one each method chose over all tables below. For the best plan drawn by actual data we marked on the plan title. For example for Query 5 in Dataset 1, the actual best plan is plan5 and FI tree method and Complete histograms method selected the same plan but equi-depth histograms decided plan 4 as a best. 4.3.1 Dataset 1: Slightly Correlated Data plan1 plan2 plan3 plan4 plan5 Avg Error Rate equi-depth H 79.396 20.157 78.819 79.130 46.836 60.86761043 FI 0.530 0.086 0.522 0.518 0.280 0.387463054 Complete H 0.007 0.001 0.007 0.000 0.000 0.003219214 plan1 plan2 plan3 plan4 plan5 Avg Error Rate equi-depth H 597.180 609.966 596.337 68.960 490.425 472.573748 FI 0.518 0.519 0.520 0.535 0.484 0.514999855 Complete H 0.002 0.007 0.002 0.052 0.047 0.021898833 plan1 plan2 plan3 plan4 plan5 Avg Error Rate equi-depth H 357.749 640.466 640.348 530.899 1,372.049 708.3024413 FI 0.032 0.069 0.069 0.734 1.926 0.565942203 Complete H 0.036 0.083 0.083 0.681 1.884 0.553669026 plan1 plan2 plan3 plan4 plan5 Avg Error Rate equi-depth H 8,030.033 4,185.088 62.072 629.358 760.616 2733.433287 FI 27.684 15.175 0.235 2.431 2.705 9.645993616 Complete H 27.970 15.146 0.145 2.351 2.489 9.620216726 4.3.2 Dataset 2: More Correlated Data plan1 plan2 plan3 plan4 plan5 Avg Error Rate equi-depth H 0.663 1.490 0.663 3.728 3.911 2.090932 FI 0.002 0.001 0.532 0.225 0.026 0.1572 Complete H 0.883 0.776 0.882 0.669 0.628 0.767486 plan1 plan2 plan3 plan4 plan5 Avg Error Rate equi-depth H 39.411 7.960 25.142 9.057 10.806 18.47518007 FI 0.072 0.112 0.063 0.300 0.447 0.198344041 Complete H 0.005 0.121 0.085 0.306 0.487 0.200156443
PAGE 63
52 plan1 plan2 plan3 plan4 plan5 Avg Error Rate equi-depth H 19.966 8.213 17.572 8.984 11.816 13.30994855 FI 0.402 0.102 0.339 0.248 0.407 0.299642622 Complete H 0.434 0.111 0.351 0.246 0.447 0.317814133 plan1 plan2 plan3 plan4 plan5 Avg Error Rate equi-depth H 47.990 46.004 47.718 2.261 24.352 33.6647 FI 0.441 0.445 0.043 0.281 0.475 0.337125 Complete H 0.541 0.571 0.541 0.752 0.748 0.630725 plan1 plan2 plan3 plan4 plan5 Avg Error Rate equi-depth H 146.101 169.110 169.055 507.066 1,261.875 450.641351 FI 0.528 0.686 0.686 0.619 1.647 0.833272693 Complete H 0.501 0.646 0.645 0.883 2.253 0.985678458 plan1 plan2 plan3 plan4 plan5 Avg Error Rate equi-depth H 2.917 3.096 57.677 24.944 461.466 110.020 FI 0.983 0.981 0.357 0.837 1.469 0.925325 Complete H 0.981 0.979 0.217 0.812 1.669 0.931379 4.4 Discussion Overall, the FI tree method always outperform ed than equi-depth histogram in terms of accuracy in calculating cardinality. By conducting e xperiments using dataset 1 we could see that when data has little correlation, the Complete hi stograms (which contain all one-dimensional data distribution information) has almost perfect accuracy as we expected. The second data set that has more correlation showed that our new method can work very well for correlated data. In this environment, our FI method even returns more precise results than those using the Complete histograms since th e AVI assumption which all one dimensional histograms (including Complete histograms) have to adopt for multi-dimensional data performed poorly. Thus, even though the Complete histograms keep all information on data distribution for single attribute, the results from this method we re less accurate than those from the FI tree
PAGE 64
53 method. Therefore, we draw a conclusion that our method estimates better than any of onedimensional histograms for correlated data. Among the results we obtained, those for query 5 in dataset 2 are particularly interesting. Due to the relatively high error rate in the Co mplete histogram method, it seems that there is correlation among data over relations which used in the query. Nevertheless, our FI tree method also showed poor estimates for this query. Why did the FI tree (which is designed for use with correlated data) not work well? Although we created some correlation between partkey and size in PART relation, if data is not skewed, the po ssibility we will not keep the correlation in our profile is higher, which means the accuracy go es lower. Despite the correlation we added to the data in PART relation, there is only one itemset that has a value on partkey with a value on size, in our FI tree file. This is due to that fact all of itemsets which contain partkey value and size value have almost same frequencies, and if we store all of these itemsets in a file, the size of the FI tree file balloons and the synops is becomes too large. Consequently, we lost lots of correlation information and the estimation wa s comparably poor when these two attributes were involved like in query 5. However, the es timations from this method are still better than those from equi-depth histograms. Even though our motivation started from the fact that the current histogram method is bad (especially when data have correlations and/or when data is skewed), the results from our experiments show that the FI tree method works be tter than equi-depth histogram way even there is little correlation. As most of profiles do, this technique also has trade back between synopsis size and accuracy. If we use lower threshold to mine frequent itemsets, the estimation can be more precise but the synopsis (which resides in memory when used) becomes bigger. As a result, to decide appropriate threshold would be a challenge when using this method. When looking over the tables above to see th e ability of methods to choose the best plan, we can encounter a bizarre situation, which is the accuracy to decide the best plan is not related with the accuracy to calculate co sts. For query 2 in dataset 2, the average error rate for FI
PAGE 65
54 technique is very low and it returns most accurate estimated costs in that query, but it is the only method that selected the best plan incorrectly. As we discussed above, in dataset 2 the FI tree method shows the most precise results but this me thod is the worst to choose the best plan. We might assume that we do not have enough data to judge the ability in selecting the best plan, but moreover these results remind us that the purpose of query optimizer is to avoid the poor plan to execute in reality. In our implementation, we put values on all attr ibutes into a same pool at the same time to find out frequent itemsets over a relation. However, for the usage of query optimization, we may give options to DBAs so that they can avoid keeping unnecessary values in a profile and have more useful information. Based on the statistic of queries, a DBA can decide attributes that are compared together for constructing FI tree files. For instance, if relation R has 10 attributes but only two or three attributes are used often in que ries, instead of having all frequent itemsets over 10 attributes, we can extract frequent items among data on these two or three attributes.
PAGE 66
55 CHAPTER 5 CONCLUSION AND FUTURE WORKS In this thesis, we proposed a new method to get better selectivity estimation for correlated data. We used frequent itemset mi ning technique for cost-based query optimization. Our approach is effective for skewed data as well. Our expe riments show the potential utility of our method since the FI tree method always outperforms equi-dep th histograms. In terms of choosing the best query plan, no method was clearly the best, but our method shows an excellent ability to calculate frequencies for result relations. In our work, to avoid complexity, we did not use all information we can draw from an FI tree when calculating the costs. If we use as much information as possible from the FI tree, we may get better selectivity estimation. By leav ing this as a future work, we conclude.
PAGE 67
56 APPENDIX EXPERIMENTAL RESULTS Here, we provide all test results we obtained. These are the results performed with two datase ts. For each dataset, we used five different queries and five query plans for each query. So there are 25 different query plans for each dataset. We used three different methods to estimate costs for every operation in a query plan in our experiments. We provide five different queries and the corresponding possible query plans we used and we denote the estimates for every operation in a plan. Total cost is calculated by taking the summation of all costs in a plan for each method. Among total costs, the cost presented in bold font is the best approximation to the actual cost. There are four numbers in the cost of an operation. The first number is the actual execution result for the operation. We got those numbers by executing the operation in ORACLE DBMS. The number that appears along side Â‘e)Â’ is the estimate from equi-depth histograms, and the re sult from using frequent itemsets is known on the third line starting with Â‘f)Â’. The last line that st arts with Â‘c)Â’ is the estimate from Complete histograms.
PAGE 68
57 DataSet 1: Slightly Correlated Data select l_orderkey, o_orderdate, o_shippriority from customer, orders, lineitem where c_mktsegment = Â‘AUTOMOBILE' and c_custkey = o_custkey and l_orderkey = o_orderkey; Query 1 Modified from Query #3 in TPC-R. Query Plan1 customer.mktsegment=Â‘AUTOMOBILEÂ’Â’ custkey orderkey Query Plan2 customer.mktsegment=Â‘AUTOMOBILEÂ’ custkey orderkey <3,013> e)3,000 f)3,013 c)3,013 <29,624> e)250,191 f) 45,195 c) 29,704 <17,865,176> e)1,438,652,325 f) 27,343,310 c) 17,998,830 < 17,897,813> e) 1,438,905,516 f) 27,391,518 c) 18,031,547 <90,905,297> e)862,533,324 f) 90,750,963 c ) 90 892 314 <3,013> e)3,000 f)3,013 c)3,013 <17,865,176> e)1,438,824,355 f) 27,343,310 c) 17,998,830 Total Cost <108,773,486> e) 2,301,360,679 f) 118,097,286 c) 108,894,157 Total Cost e) estimate from equi-depth histograms f) estimate from FI trees c) estimate from Complete histograms
PAGE 69
58 Query Plan3 custke y orderkey customer.mktsegment =Â‘AUTOMOBILEÂ’ custkey orderkey customer.mktsegment =Â‘AUTOMOBILEÂ’ Query Plan5 customer.mktsegment =Â‘AUTOMOBILEÂ’ custke y orderkey Query Plan4 <148,044> e)1,250,958 f) 225,000 c) 147,875 <29,624> e)250,192 f) 45,195 c) 29,704 <17,865,176> e) 1,438,652,325 f) 27,198,641 c) 17,998,830 < 18,042,844> e) 1,440,153,475 f) 27,468,836 c) 18,176,409 Total Cost <148,044> e)1,250,958 f) 225,000 c) 147,875 <89,726,688> e)7,193,261,627 f) 136,126,444 c) 89,604,370 <17,865,176> e)1,438,652,325 f) 27,198,635 c) 17,998,830 <107,739,908> e) 8,633,164,910 f) 163,550,079 c) 107,751,075 Total Cost <17,865,176> e)1,438,824,355 f) 27,343,310 c) 17,998,830 <89,726,688> e)7,194,121,776 f) 136,126,444 c) 89,604,370 <90,905,297> e)862,533,324 f) 90,750,963 c) 90,892,314 <198,497,161> e) 9,495,479,455 f) 254,220,717 c) 198,495,514 Total Cost
PAGE 70
59 select s_acctbal, s_name, n_name, p_partkey, p_mfgr, s_address, s_phone, s_comment from part, supplier, partsupp, nation, region where p_partkey = ps_partkey and s_suppkey = ps_suppkey and p_size = Â‘1Â’ and s_nationkey = n_nationkey and n_regionkey = r_regionkey and r_name = Â’ASIA'; Query 2 Modified from Query #2 in TPC-R. Query Plan1 Part.size = Â‘1Â’ region.name = Â‘ASIAÂ’ partkey suppkey nationkey regionkey <593> e) 400 f) 543 c) 543 <45,940> e)322,365 f) 45,075 c) 45,699 <33,776> e)2,677,943 f) 48,739 c) 45,860 <33,776> e)2,477,151 f) 48,781 c) 45,860 <1> e)1 f)1 c)1 <33,776> e)495,430 f) 15,334 c) 9,172 <147,812> e) 5,973,290 f) 158,473 c) 147,135 Total Cost
PAGE 71
60 Query Plan2 Part.size = Â‘1Â’ region.name=Â‘ASIAÂ’ suppkey nationkey regionkey partkey <80,273> e)664,618 f) 80,157 c) 80,282 <1> e)1 f)1 c)1 <211,846> e) 1,898,224 f) 188,182 c) 186,336 Total Cost <80,273> e)614,785 f) 80,178 c) 80,282 <16,980> e)122,957f ) 16,988 c) 16,056 <33,776> e)495,463 f) 10,315 c) 9,172 <593> e) 400 f) 543 c) 543 Query Plan3 region.name = Â‘ASIAÂ’ partkey suppkey nationkey regionkey Part.size = Â‘1Â’ <593> e) 400 f) 543 c) 543 <1> e)1 f)1 c)1 <5> e)5 f)5 c)5 <80,273> e)664,618 f) 80,157 c) 80,282 <33,776> e) 2,678,119 f) 49,122 c) 45,860 <33,776> e)535,635 f) 9,252 c) 9,172 <148,374> e) 3,878,778 f) 139,080 c) 135,863 Total Cost
PAGE 72
61 Query Plan4 Part.size = Â‘1Â’ region.name = Â‘ASIAÂ’ suppkey nationkey regionkey partkey Part.size = Â‘1Â’ region.name = Â‘ASIAÂ’Â’ suppkey nationkey regionkey partkey Query Plan5 <593> e) 400 f) 543 c) 543 <1,000> e) 925 f) 1,000 c) 1,000 <212> e) 185 f) 212 c) 200 <16,980> e) 122,957 f) 16,991 c) 16,056 <1> e)1 f)1 c)1 <1> e)1 f)1 c)1 <33,776> e) 495,463 f) 10,341 c) 9,172 <593> e) 400 f) 543 c) 543 <1,000> e) 925 f) 1,000 c) 1,000 <212> e) 185 f) 212 c) 200 <33,776> e) 495,463 f) 10,266 c) 9,172 <45,940> e) 322,365 f) 45,075 c) 45,699 <81,472> e) 819,339 f) 57,097 c ) 56 615 Total Cost <52,512> e) 619,931 f) 29,088 c) 26,972 Total Cost
PAGE 73
62 Select n_name from customer, orders, lineitem, supplier, nation, region where c_custkey = o_custkey and l_orderkey = o_orderkey and c_nationkey = s_nationkey and s_nationkey = n_nationkey and n_regionkey = r_regionkey and r_name = Â’ASIAÂ’; Query 3 Modified from Query #5 in TPC-R. Query Plan1 region.name = Â‘ASIAÂ’ custkey nationkey regionkey orderkey nationkey <148,044> e)1,250,958 f) 225,000 c) 147,875 <7,981,717,259> e) 4,774,506,546,885 f) 12,113,002,407 c) 7,963,778,002 Total Cost <89,726,688> e) 7,193,261,627 f) 136,126,444 c) 89,604,370 <3,568,287,410> e) 2,273,652,177,498 f) 5,408,583,824 c) 3,579,044,050 <1> e)1 f) 1 c) 1 <3,568,287,410> e) 2,078,049,880,667 f) 5,408,582,742 c) 3,579,151,422 <755,267,706> e) 415,609,976,134 f) 1,159,484,396 c) 715,830,284
PAGE 74
63 region.name = Â‘ASIAÂ’ <1> e)1 f) 1 c) 1 <755,267,706> e) 415,609,976,134 f) 1,159,484,628 c) 715,830,284 <4,413,429,854> e) 2,696,456,666,223 f) 6,704,419,902 c) 4,384,626,585 Total Cost Query Plan2 custkey nationkey regionkey orderkey nationkey <148,044> e)1,250,958 f) 225,000 c) 147,875 <89,726,688> e) 7,193,261,627 f) 136,126,444 c) 89,604,370 <3,568,287,410> e) 2,273,652,177,498 f) 5,408,583,824 c) 3,579,044,050 <5> e)5 f) 5 c) 5 Query Plan3 region.name = Â‘ASIAÂ’ custkey orderkey nationkey regionkey nationkey <90,905,297> e) 862,533,324 f) 90,750,963 c) 90,892,314 <3,568,287,410> e) 2,273,924,054,132 f) 5,437,352,201 c) 3,579,044,050 <3,568,287,410> e) 2,078,298,367,755 f) 5,437,352,745 c) 3,579,151,422 <1> e)1 f) 1 c) 1 <599,152> e)4,741,213 f) 599,151 c) 599,141 <7,983,346,976> e) 4,768,749,369,976 f) 12,131,707,131 c) 7,965,517,212 Total Cost <755,267,706> e) 415,659,673,551 f) 1,165,652,070 c) 715,830,284
PAGE 75
64 region.name = Â‘ASIAÂ’ custkey orderkey nationkey regionkey nationkey Query Plan5 region.name = Â‘ASIAÂ’ custkey regionkey nationkey nationkey orderkey Query Plan4 <755,267,706> e) 53,289,813,663 f) 1,159,191,380 c) 715,808,810 <1,229,734> e) 9,267,464 f) 1,923,950 c) 1,181,303 <148,044> e) 1,250,958 f) 225,000 c) 147,875 <1,000> e) 1,000 f) 1,000 c) 1,000 <5,887,806> e) 46,337,318 f) 8,987,264 c) 5,906,513 <1> e)1 f) 1 c) 1 <755,267,706> e) 415,611,233,761 f) 1,165,652,070 c) 715,830,284 <90,905,297> e) 862,533,324 f) 90,750,963 c) 90,892,314 <128,446> e) 866,665 f) 128,445 c ) 119 832 <599,152> e) 4,741,213 f) 599,151 c) 599,141 <599,152> e) 4,333,326 f) 599,151 c) 599,159 <762,534,291> e) 53,346,670,404 f) 1,170,328,595 c) 723,045,502 Total Cost <847,499,754> e) 416,482,841,625 f) 1,257,601,336 c) 807,920,899 Total Cost <1> e)1 f) 1 c) 1
PAGE 76
65 select nname as suppnation, p_name as line_part, s_name as supplier_name from supplier, lineitem, orders, part, nation where ssuppkey = lsuppkey and o_orderkey = lorderkey and lpartkey = ppartkey and snationkey = nnationkey and n_name = Â’IRAN' ; Query 4 Modified from Query #7 in TPC-R. Query Plan1 partkey suppkey nationkey orderkey name = Â‘IRANÂ’ <601,672> e) 5,053,818 f) 605,105 c) 601,662 <181,964,334> e) 65,279,590,833 f) 187,869,257 c) 188,555,570 Total Cost <92,037,885> e) 7,265,143,374 f) 91,740,173 c) 91,144,086 <88,430,967> e)55,939,584,270 f) 91,740,173 c) 93,086,366 <893,809> e) 2,069,809,370 f) 3,783,,805 c) 3,723,455 <1> e) 1 f) 1 c) 1
PAGE 77
66 Query Plan2 partkey suppkey nationkey orderkey name = Â‘IRANÂ’ <601,672> e) 5,053,818 f) 605,105 c) 601,662 <574,567> e) 38,912,990 f) 605,105 c) 614,483 <88,430,967> e)55,939,584,270 f) 91,740,172 c) 93,086,366 <893,809> e) 2,069,809,370 f) 3,783,,805 c) 3,723,455 <90,501,016> e) 58,053,360,449 f) 96,734,188 c) 98,025,967 Total Cost Query Plan3 partkey suppkey nationkey orderkey name = Â‘IRANÂ’ <612,048> e) 4,618,961 f) 600,000 c) 612,786 <574,567> e) 38,910,541 f) 605,105 c) 614,483 <88,430,967> e)55,936,063,813 f) 91,740,173 c) 93,086,366 <893,809> e) 2,069,809,110 f) 3,783,805 c) 3,723,455 <90,511,392> e) 58,049,272,426 f) 96,729,084 c) 98,037,091 Total Cost <1> e) 1 f) 1 c) 1 <1> e) 1 f) 1 c) 1
PAGE 78
67 Query Plan4 Query Plan5 partkey suppkey nationkey orderkey name = Â‘IRANÂ’ partkey suppkey nationkey orderkey name = Â‘IRANÂ’ <39> e) 37 f) 41 c) 40 <4,397,101> e) 2,338,812,520 f) 7,622,643 c) 7,393,326 Total Cost <1> e) 1 f) 1 c) 1 <23,346> e) 186,995 f) 25,159 c) 24,067 <3,479,906> e) 268,816,117 f) 3,798,721 c) 3,645,763 <893,809> e) 2,069,809,370 f) 3,798,721 c) 3,723,455 <1> e) 1 f) 1 c) 1 <5,971> e) 1,439,812 f) 25,159 c) 24,579 <39> e) 37 f) 41 c) 40 <612,048> e) 4,619,833 f) 600,000 c) 612,786 <893,809> e) 2,069,809,370 f) 3,798,721 c) 3,723,455 <1,511,868> e) 2,075,869,053 f) 4,423,922 c) 4,360,861 Total Cost
PAGE 79
68 Query 5 Modified from Query #8 in TPC-R. Query Plan1 select count(*) from part, supplier, lineitem, orders, nation, region where p_partkey = l_partkey and s_suppkey = l_suppkey and l_orderkey = o_orderkey and s_nationkey = n_nationkey and n_regionkey = r_regionkey and r_name = Â’ASIAÂ’ and o_orderstatus = Â‘OÂ’ and p_size = Â’1Â’; suppkey orderkey nationkey region.name = Â‘ASIAÂ’ partkey regionkey orders.orderstatus = Â‘OÂ’ part.size= Â‘1Â’ <543> e) 400 f) 543 c) 543 <16,694> e) 92,379 f) 16,290 c) 16,637 <7,093> e) 778,211 f) 16,429 c) 16,683 <0> e) 372,903,363 f) 1,239,261 c) 1,242,721 <0> e) 344,943,069 f) 1,214,949 c) 1,242,721 <0> e) 68,988,614 f) 252,338 c) 248,544 <73,758> e) 50,000 f) 73,758 c) 73,758 <98,089> e) 787,756,037 f) 2,813,569 c) 2,841,608 Total Cost <1> e) 1 f) 1 c) 1
PAGE 80
69 Query Plan2 Query Plan3 suppkey orderkey nationkey region.name = Â‘ASIAÂ’ partkey regionkey orders.orderstatus = Â‘OÂ’ part.size= Â‘1Â’ suppkey orderkey nationkey region.name = Â‘ASIAÂ’ partkey regionkey orders.orderstatus = Â‘OÂ’ part.size= Â‘1Â’ <543> e) 400 f) 543 c) 543 <16,694> e) 92,379 f) 16,290 c) 16,637 <1,000> e) 925 f) 1,000 c) 1,000 <7,093> e) 719,861 f) 16,429 c) 16,683 <0> e) 344,943,069 f) 1,239,261 c) 1,242,721 <0> e) 68,988,614 f) 255,488 c) 248,544 <1> e) 1 f) 1 c) 1 <73,758> e) 50,000 f) 73,758 c) 73,758 <1,000> e) 925 f) 1,000 c) 1,000 <543> e) 400 f) 543 c) 543 <1> e) 1 f) 1 c) 1 <212> e) 185 f) 212 c) 200 <127,658> e) 934,977 f) 128,205 c) 120,332 < 8,009,730> e) 448,022,381 f) 9,672,.965 c) 8,963,474 <0> e) 68,992,956 f) 262,621 c) 248,544 <73,758> e) 50,000 f) 73,758 c) 73,758 <99,089> e) 414,795,249 f) 1,602,770 c) 1,599,887 Total Cost <8,212,902> e) 518,001,825 f) 10,139,305 c) 9,407,852 Total Cost
PAGE 81
70 Query Plan4 Query Plan5 suppkey orderkey nationkey region.name = Â‘ASIAÂ’ partkey regionkey orders.orderstatus = Â‘OÂ’ part.size= Â‘1Â’ suppkey orderkey nationkey region.name = Â‘ASIAÂ’ regionkey orders.orderstatus = Â‘OÂ’ partkey part.size= Â‘1Â’ <543> e) 400 f) 543 c) 543 <16,694> e) 92,379 f) 16,290 c) 16,637 <7,093> e) 778,211 f) 16,429 c) 16,683 <7,093> e) 719,861 f) 16,423 c) 16,683 <7,093> e) 143,972 f) 3,478 c) 3,337 <0> e) 68,988,614 f) 258,325 c) 248,544 <73,758> e) 50,000 f) 73,758 c) 73,758 <1> e) 1 f) 1 c) 1 <1> e) 1 f) 1 c) 1 <73,758> e) 50,000 f) 73,758 c) 73,758 <543> e) 400 f) 543 c) 543 <5> e) 5 f) 5 c) 5 <212> e) 200 f) 212 c) 200 <16,694> e) 92,379 f) 16,290 c) 16,637 <7,093> e) 155,626 f) 3,480 c) 3,337 <0> e) 74,572,773 f) 269,958 c) 248,544 <112,275> e) 70,773,438 f) 385,247 c) 376,186 Total Cost <98,306> e) 74,871,384 f) 364,247 c) 343,025 Total Cost
PAGE 82
71 DataSet 2: More Correlated Data select l_orderkey, o_orderdate, o_shippriority from customer, orders, lineitem where c_mktsegment = Â‘AUTOMOBILE' and c_custkey = o_custkey and l_orderkey = o_orderkey; Query 1 Modified from Query #3 in TPC-R. Query Plan1 customer.mktsegmen t=Â‘AUTOMOBILEÂ’Â’ custkey orderkey Query Plan2 customer.mktsegment=Â‘AUTOMOBILEÂ’ custkey orderkey <3,013> e)3,000 f)3,013 c)3,013 <509,197> e)209,279 f)509,349 c)126,352 <655,864,130> e)1,091,331,609 f) 657,059,559 c) 76,637,793 < 656,376,340> e) 1,091,543,888 f) 657,571,921 c) 76,767,158 <90,910,769> e)782,204,056 f) 90,749,343 c ) 90 980 494 <3,013> e)3,000 f)3,013 c)3,013 <655,864,130> e)1,077,293,021 f) 657,048,020 c) 76,637,793 Total Cost <746,777,912> e) 1,859,500,077 f) 747,800,376 c) 167,621,300 Total Cost
PAGE 83
72 Query Plan3 custke y orderkey customer.mktsegment =Â‘AUTOMOBILEÂ’ custkey orderkey customer.mktsegment =Â‘AUTOMOBILEÂ’ Query Plan5 customer.mktsegment =Â‘AUTOMOBILEÂ’ custke y ordekey Query Plan4 <627,617> e)1,046,393 f) 689,239 c) 628,994 <509,197> e)209,279 f)509,349 c)126,352 <655,864,130> e)1,091,331,604 f) 306,528,686 c ) 76,637,793 < 657,000,944 > e) 1,092,587,276 f) 307,727,274 c) 77,393,139 Total Cost <627,617> e)1,046,393 f) 689,239 c) 628,994 <728,725,642> e)5,456,658,047 f) 765,753,258 c) 381,510,318 <655,864,130> e)1,091,331,609 f) 306,528,655 c) 76,637,793 <1,385,217,389> e) 6,549,036,049 f) 1,072,971,152 c) 458,777,105 Total Cost <655,864,130> e)1,077,293,021 f) 657,048,052 c) 76,637,793 <728,725,642> e)5,386,465,103 f) 765,741,720 c) 381,510,318 <90,910,769> e)782,204,056 f) 90,749,343 c) 90,980,494 <1,475,500,541> e) 7,245,962,180 f) 1,513,539,115 c) 549,128,605 Total Cost
PAGE 84
73 select s_acctbal, s_name, n_name, p_partkey, p_mfgr, s_address, s_phone, s_comment from part, supplier, partsupp, nation, region where p_partkey = ps_partkey and s_suppkey = ps_suppkey and p_size = Â‘1Â’ and s_nationkey = n_nationkey and n_regionkey = r_regionkey and renamed = Â’ASIA'; Query 2 Modified from Query #2 in TPC-R. Query Plan1 Part.size = Â‘1Â’ region.name = Â‘ASIAÂ’ partkey suppkey nationkey regionkey <661> e) 444 f) 661 c) 661 <55,428> e)357,807 f) 54,537 c) 55,627 <113,183> e)2,972,364 f) 58,207 c) 55,823 <113,183> e)2,749,496 f) 58,251 c) 55,823 <1> e)1 f)1 c)1 <33,776> e)549,899 f) 17,322 c) 11,165 <316,232> e) 6,630,011 f) 188,979 c) 179,100 Total Cost
PAGE 85
74 Query Plan2 Part.size = Â‘1Â’ region.name=Â‘ASIAÂ’ suppkey nationkey regionkey partkey <80,273> e)664,618 f) 80,157 c) 80,282 <1> e)1 f)1 c)1 <211,964> e)1,952,790 f) 190,307 c) 188,447 Total Cost <80,273> e)614,785 f) 80,178 c) 80,282 <16,980> e) 122,957 f) 16,988 c) 16,056 <33,776> e)549,985 f) 12,322 c) 11,165 <661> e) 444 f) 661 c) 661 Query Plan3 region.name=Â‘ASIAÂ’ partkey suppkey nationkey regionkey Part.size = Â‘1Â’ <661> e) 444 f) 661 c) 661 <1> e)1 f)1 c)1 <5> e)5 f)5 c)5 <80,273> e)664,618 f) 80,157 c) 80,282 <113,183> e)2,972,828 f) 58,589 c) 55,823 <33,776> e)594,577 f) 11,263 c) 11,165 <227,899> e) 4,232,473 f) 150,676 c) 147,937 Total Cost
PAGE 86
75 Query Plan4 Part.size = Â‘1Â’ re g ion.name = Â‘ASIAÂ’ suppkey nationkey regionkey partkey Part.size = Â‘1Â’ region.name = Â‘ASIAÂ’ suppkey nationkey regionkey partkey Query Plan5 <661> e) 444 f) 661 c) 661 <1,000> e) 925 f) 1,000 c) 1,000 <212> e) 185 f) 212 c) 200 <16,980> e) 122,957 f) 16,991 c) 16,056 <1> e)1 f)1 c)1 <1> e)1 f)1 c)1 <33,776> e) 549,985 f) 12,348 c) 11,165 <661> e) 444 f) 661 c) 661 <1,000> e) 925 f) 1,000 c) 1,000 <212> e) 185 f) 212 c) 200 <33,776> e) 549,935 f) 12,273 c) 11,165 <55,428> e) 357,807 f) 54,357 c) 55,627 <91,078> e) 909,297 f) 68,504 c) 68,654 Total Cost <52,630> e) 674,497 f) 31,213 c) 29,083 Total Cost
PAGE 87
76 select n_name from customer, orders, lineitem, supplier, nation, region where c_custkey = o_custkey and l_orderkey = o_orderkey and c_nationkey = s_nationkey and s_nationkey = n_nationkey and n_regionkey = r_regionkey and r_name = Â’ASIAÂ’; Query 3 Modified from Query #5 in TPC-R. Query Plan1 region.name = Â‘ASIAÂ’ custkey nationkey regionkey orderkey nationkey <627,617> e)1,046,393 f) 689,239 c) 628,994 <73,930,353,260> e) 3,621,841,053,516 f) 41,305,045,793 c) 33,910,566,700 Total Cost <728,725,642> e) 5,456,658,047 f) 765,753,258 c) 381,510,318 <30,413,000,000> e) 1,724,745,059,090 f) 17,150,034,704 c) 15,240,360,525 <1> e)1 f) 1 c) 1 <30,413,000,000> e) 1,576,365,241,654 f) 17,150,038,134 c) 15,240,055,718 <12,375,000,000> e) 315,273,048,331 f) 6,238,530,457 c) 3,048,011,144
PAGE 88
77 region.name = Â‘ASIAÂ’ <1> e)1 f) 1 c) 1 <12,375,000,000> e) 315,273,048,331 f) 6,238,530,924 c) 3,048,011,144 <43,517,353,265> e) 2,045,475,811,867 f) 24,155,008,131 c) 18,670,510,987 Total Cost Query Plan2 custkey nationkey regionkey orderkey nationkey <627,617> e)1,046,393 f) 689,239 c) 628,994 <728,725,642> e) 5,456,658,047 f) 765,753,258 c) 381,510,318 <30,413,000,000> e) 1,724,745,059,090 f) 17,150,034,704 c) 15,240,360,525 <5> e)5 f) 5 c) 5 Query Plan3 region.name = Â‘ASIAÂ’ custkey orderkey nationkey regionkey nationkey <90,910,769> e)782,204,056 f) 90,749,343 c) 90,980,494 <30,413,000,000> e) 1,702,558,414,418 f) 31,851,690,316 c) 15,240,360,525 <30,413,000,000> e) 1,556,087,314,026 f) 31,898,496,375 c) 15,240,055,718 <1> e)1 f) 1 c) 1 <599,152> e)4,740,897 f) 599,150 c) 599,172 <73,292,509,922> e) 3,570,650,136,203 f) 76,455,817,905 c) 33,620,007,054 Total Cost <12,375,000,000> e) 311,217,462,805 f) 12,614,282,720 c) 3,048,011,144
PAGE 89
78 region.name = Â‘ASIAÂ’ custkey orderkey nationkey regionkey nationkey Query Plan5 region.name = Â‘ASIAÂ’ custkey regionkey nationkey nationkey orderkey Query Plan4 <12,375,000,000> e) 40,424,539,745 f) 8,886,229,242 c) 3,048,072,105 <9,960,508> e) 7,751,992 f) 10,366,363 c) 5,025,338 <627,617> e) 1,046,393 f) 689,239 c) 628,994 <1,000> e) 1,000 f) 1,000 c) 1,000 <26,033,882> e)38,759,959 f) 28,497,651 c) 25,126,692 <1> e)1 f) 1 c) 1 <12,375,000,000> e) 315,272,986,664 f) 6,448,456,007 c) 3,048,011,144 <128,446> e) 866,607 f) 128,445 c ) 119 832 <599,152> e) 4,740,897 f) 599,150 c) 599,172 <599,152> e) 4,333,037 f) 599,151 c) 599,160 <12,411,623,008> e) 40,472,099,090 f) 8,925,783,496 c) 3,078,854,130 Total Cost <12,467,237,520> e) 316,065,131,262 f) 6,540,532,097 c) 3,140,309,803 Total Cost <1> e)1 f) 1 c) 1 <90,910,769> e) 782,204,056 f) 90,749,343 c) 90,980,494
PAGE 90
79 select nname as suppnation, p_name as line_part, s_name as supplier_name from supplier, lineitem, orders, part, nation where ssuppkey = lsuppkey and o_orderkey = l_orderkey and lpartkey = ppartkey and snationkey = nnationkey and n_name = Â’IRAN' ; Query 4 Modified from Query #7 in TPC-R. Query Plan1 partkey suppkey nationkey orderkey name = Â‘IRANÂ’ <632,147> e) 5,065,628 f) 635,889 c) 632,465 <423,975,095> e) 62,367,000,567 f) 200,080,612 c) 211,463,085 Total Cost <97,842,390> e) 6,603,935,378 f) 97,678,643 c) 95,903,448 <324,606,748> e)53,768,521,259 f) 98,223,785 c) 110,506,895 <893,809> e) 1,989,478,301 f) 3,542,294 c) 4,420,276 <1> e) 1 f) 1 c) 1
PAGE 91
80 Query Plan2 partkey suppkey nationkey orderke name = Â‘IRANÂ’ <632,147> e) 5,065,628 f) 635,889 c) 632,465 <1,914,433> e) 41,243,789 f) 639,438 c) 728,772 <324,607,748> e) 53,768,521,259 f) 98,223,785 c) 110,506,895 <893,809> e) 1,989,478,301 f) 3,542,294 c) 4,420,276 <328,048,138> e) 55,804,308,978 f) 103,041,407 c) 116,288,409 Total Cost Query Plan3 partkey suppkey nationkey orderkey name = Â‘IRANÂ’ <680,613> e) 4,883,985 f) 603,349 c) 691,362 <1,914,433> e) 41,236,655 f) 639,438 c) 728,772 <324,607,748> e) 53,759,221,173 f) 98,223,785 c) 110,506,895 <893,809> e) 1,989,134,191 f) 3,542,294 c) 4,420,276 <328,096,604> e) 55,794,476,005 f) 103,008,867 c) 116,347,306 Total Cost <1> e) 1 f) 1 c) 1 <1> e) 1 f) 1 c) 1
PAGE 92
81 Query Plan4 Query Plan5 partkey suppkey nationkey orderkey name = Â‘IRANÂ’ partkey suppkey nationkey orderkey name = Â‘IRANÂ’ <39> e) 37 f) 39 c) 40 <4,397,101> e) 2,234,016,663 f) 7,118,228 c) 8,281,754 Total Cost <1> e) 1 f) 1 c) 1 <23,346> e) 187,432 f) 23,419 c) 25,299 <3,479,906> e) 244,350,892 f) 3,537,513 c) 3,836,138 <893,809> e) 1,989,478,301 f) 3,557,256 c) 4,420,276 <1> e) 1 f) 1 c) 1 <5,971> e) 1,526,053 f) 23,550 c) 29,151 <39> e) 37 f) 39 c) 40 <680,613> e) 4,885,216 f) 603,349 c) 691,362 <893,809> e) 1,989,478,301 f) 3,557,256 c) 4,420,276 <1,580,433> e) 1,995,889,608 f) 4,184,195 c) 5,140,830 Total Cost
PAGE 93
82 part.size= Â‘1Â’ <661> e) 444 f) 661 c) 661 <91,199> e) 108,522 f) 19,941 c) 22,850 Query 5 Modified from Query #8 in TPC-R. Query Plan1 select count(*) from part, supplier, lineitem, orders, nation, region where p_partkey = l_partkey and s_suppkey = l_suppkey and l_orderkey = o_orderkey and s_nationkey = n_nationkey and n_regionkey = r_regionkey and r_name = Â’ASIAÂ’ and o_orderstatus = Â‘OÂ’ and p_size = Â’1Â’; suppkey orderkey nationkey region.name = Â‘ASIAÂ’ partkey regionkey orders.orderstatus = Â‘OÂ’ <1,373,617> e) 916,279 f) 21,133 c) 24,086 <106,604,004> e) 398,172,650 f) 1,668,966 c) 1,795,886 <106,604,004> e) 368,317,664 f) 1,562,908 c) 1,795,886 <0> e) 73,663,533 f) 293,750 c) 395,177 <73,758> e) 49,999 f) 73,758 c) 73,758 <214,747,244> e) 841,229,092 f) 3,641,118 c) 4,108,305 Total Cost <1> e) 1 f) 1 c) 1
PAGE 94
83 Query Plan2 Query Plan3 suppkey orderkey nationkey region.name = Â‘ASIAÂ’ partkey regionkey orders.orderstatus = Â‘OÂ’ part.size= Â‘1Â’ |