Record for a UF thesis. Title & abstract won't display until thesis is accessible after 2011-08-31.

Permanent Link: http://ufdc.ufl.edu/UFE0024968/00001

Material Information

Title: Record for a UF thesis. Title & abstract won't display until thesis is accessible after 2011-08-31.
Physical Description: Book
Language: english
Creator: Magnusson, Jeffrey
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2009


Subjects / Keywords: Computer and Information Science and Engineering -- Dissertations, Academic -- UF
Genre: Computer Engineering thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation


Statement of Responsibility: by Jeffrey Magnusson.
Thesis: Thesis (Ph.D.)--University of Florida, 2009.
Local: Adviser: Dankel, Douglas D.
Electronic Access: INACCESSIBLE UNTIL 2011-08-31

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2009
System ID: UFE0024968:00001

Permanent Link: http://ufdc.ufl.edu/UFE0024968/00001

Material Information

Title: Record for a UF thesis. Title & abstract won't display until thesis is accessible after 2011-08-31.
Physical Description: Book
Language: english
Creator: Magnusson, Jeffrey
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2009


Subjects / Keywords: Computer and Information Science and Engineering -- Dissertations, Academic -- UF
Genre: Computer Engineering thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation


Statement of Responsibility: by Jeffrey Magnusson.
Thesis: Thesis (Ph.D.)--University of Florida, 2009.
Local: Adviser: Dankel, Douglas D.
Electronic Access: INACCESSIBLE UNTIL 2011-08-31

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2009
System ID: UFE0024968:00001

This item has the following downloads:

Full Text




2 2009 Jeff Magnusson


3 To my parents


4 ACKNOWLEDGMENTS First, I thank my supervisory committee chair Douglas D. Dankel II. This research would not have been possible without his support, advice, and patience. I also thank Richard Herman, Chris Easely, and all the other s who work on the University of Floridas enterprise resource pl anning system and provided their suppor t during the testing phase of my research. In addition, I thank my friends Lauren and Christian for tolerating my complaining throughout the entire research process and keeping me in good spirits and on task. Lastly, I express my sincere appreciation to my parents who provided me with an incredible amount of support, encouragement, and a safety net throughout my graduate study which allowed me to concentrate on my research without worr y


5 TABLE OF CONTENTS page ACKNOWLEDGMENTS ...............................................................................................................4 LIST OF TABLES ...........................................................................................................................7 LIST OF FIGURES .........................................................................................................................8 ABSTRACT ...................................................................................................................................11 CHAPTER 1 INTRODUCTION ..................................................................................................................13 2 BACKGROUND AND PRIOR ART .....................................................................................21 Cost Based Query Optimization .............................................................................................21 Statistics Gathering .................................................................................................................22 System Rs C ontribution to Cost B ased Optimization ...........................................................24 The Role of Indexes in Query Optimization ...........................................................................26 The Index Selection Problem ..................................................................................................29 Virtual Indexes ........................................................................................................................31 Autonomic Comput ing ...........................................................................................................32 Autonomic Computing in Databases ......................................................................................33 Autonomic Database Workflow Monitoring and Tuning .......................................................35 Prior Research in Autonomic Physical Tuning Architectures ................................................36 3 A MODEL FOR FORECASTING FUTURE QUERY BEHAVIOR ....................................43 Motivation ...............................................................................................................................43 Overview of the Forecasting Architecture ..............................................................................46 The Runstats Listener .............................................................................................................48 The Analysis/Simulation Engine ............................................................................................49 Th e Information Repository ...................................................................................................50 Future Optimal Index Detection .............................................................................................51 Data Threshold Detection .......................................................................................................53 4 FORECASTING METHODS AND ALGORITHMS ............................................................57 Least Squares Linear Regression ............................................................................................57 Autoregressive Integrated Moving Average Models ..............................................................58 Comparing and Selecting Models ...........................................................................................59 Quantile Regression ................................................................................................................60 Forecasting Table Level Statistics ..........................................................................................62 Column Level Statistics ..........................................................................................................63 Index Level Statistics ..............................................................................................................64


6 5 TESTING AND PERFORMANCE EVALUATION ............................................................66 Preliminary Implementation ...................................................................................................66 Testing Methodology ..............................................................................................................67 Results of Testing With Synthetic Data ..................................................................................70 Results of Testing With Enterprise Data ................................................................................78 6 CONCLUSIONS ..................................................................................................................101 Contributions ........................................................................................................................101 Future Work ..........................................................................................................................102 APPENDIX A DDL USED IN PERFORMANCE EVALUATION ............................................................104 Table Creation Script ............................................................................................................104 Index Creation Script ............................................................................................................107 B QUERIES USED IN PERFORMANCE EVALUATION ...................................................111 LIST OF REFERENCES .............................................................................................................125 BIOGRAPHICAL SKETCH .......................................................................................................129


7 LIST OF TABLES Table page 21 Statistics used in costbased calculations. ..........................................................................42


8 LIST OF FIGURES Figure page 11 A data threshold occurs at 4 4/9 years ...............................................................................20 12 Query performance with an index existing on the Student vs Alumni relations. ..............20 21 A typical autonomic index tuning architecture ..................................................................42 31 An overview of the MAEDA architecture. ........................................................................55 32 Multiple databases served by a single MAEDA server. ....................................................55 33 The MAEDA Runstats Listener .........................................................................................56 34 The MAEDA Simulation/A nalysis Engine ........................................................................56 51 TPC H Scale factor 1 results for Query 1. .........................................................................86 52 TPC H Scale factor 1 results for Query 2. .........................................................................86 53 TPC H Scale factor 1 results for Query 3. .........................................................................86 54 TPC H Scale factor 1 results for Query 4. .........................................................................87 55 TPC H Scale factor 1 results for Query 5. .........................................................................87 56 TPC H Scale factor 1 results for Query 6. .........................................................................87 57 TPC H Scale factor 1 results for Query 7. .........................................................................88 58 TPC H Scale factor 1 results for Query 8. .........................................................................88 59 TPC H Scale factor 1 results for Query 9. .........................................................................88 510 TPC H Scale factor 1 results for Query 10. .......................................................................89 511 TPC H Scale factor 1 results for Query 11. .......................................................................89 512 TPC H Scale factor 1 results for Query 12. .......................................................................89 513 TPC H Scale factor 1 results for Query 13. .......................................................................90 514 TPC H Scale factor 1 results for Query 14. .......................................................................90 515 TPC H Scale factor 1 results for Query 15. .......................................................................90 516 TPC H Scale factor 1 results for Query 16. .......................................................................91


9 517 TPC H Scale factor 1 results for Query 17. .......................................................................91 518 TPC H Scale factor 1 results for Query 18. .......................................................................91 519 TPC H Scale factor 1 results for Query 19. .......................................................................92 520 TPC H Scale factor 1 results for Query 20. .......................................................................92 521 Results for TPC H Query 4 modified to produce more realistic growth. ..........................92 522 Results for TPC H Query 7 modified to produce more realistic growth. ..........................93 523 Results for TPC H Query 13 modified to produce more realistic growth. ........................93 524 Results for TPC H Query 16 modified to produce more realistic growth. ........................93 525 ERP data set results for Query 1. .......................................................................................94 526 ERP data set results for Query 2. .......................................................................................94 527 ERP data set results for Query 3. .......................................................................................94 528 ERP data set results for Query 4. .......................................................................................95 529 ERP data set results for Query 5. .......................................................................................95 530 ERP data set results for Query 6. .......................................................................................95 531 ERP data set results for Query 7. .......................................................................................96 532 ERP data set results for Query 8. .......................................................................................96 533 ERP data set results for Query 9. .......................................................................................96 534 ERP data set results for Query 10. .....................................................................................97 535 Constrained ERP data set results for Query 1. ...................................................................97 536 Constrained ERP data set results for Query 2. ...................................................................97 537 Constrained ERP data set results for Query 3. ...................................................................98 538 Constrained ERP data set results for Query 4. ...................................................................98 539 Constrained ERP data set results for Query 5. ...................................................................98 540 Constrained ERP data set results for Query 6. ...................................................................99 541 Constrained ERP data set results for Query 7. ...................................................................99


10 542 Constrained ERP data set results for Query 8. ...................................................................99 543 Constrained ERP data set results for Query 9. .................................................................100 544 Constrained ERP data set results for Query 10. ...............................................................100


11 Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy A DATA THRESHOLD DETECTION APPROACH TO PREDICTING FUTURE QUERY BEHAVIOR By Jeff Magnusson August 2009 Chair: Douglas D. Dankel II Major: Computer Engineering For a database to function efficiently, the query optimizer mu st be presented with an efficient set of indexes for the query workload. Often inefficient index sets are not noticed until they have a substantial negative impact on the database processes. In an active database, the sizes of tables and distributions of data constantly grow at different relative rates. Traditionally, the access plans produced by the cost based query optimizer change when two related data distributions grow relative to one another past some threshold. These data thresholds are significant because they represent the point at which previously optimal indexes become suboptimal. Due to the size and complexity of todays database systems, many DBAs rely on autonomic tuning tools to assist them in determining optimal index sets for the data base workload. We propose an extension to these autonomic tuning tools that will allow DBAs to estimate future performance of queries and which will automatically recommend optimal index sets for the estimated future data distributions. This is accomplis hed by recording and analyzing changes in the database statistics captured by the DBMS over time. Forecasts of future values of these statistics are computed and used to create an alternate system catalog from which to compile queries. This effectively t ricks the query optimizer into optimizing queries assuming the values of the forecasted data distributions. Current methods of deriving optimal index sets for a


12 query workload can then be applied to make optimal index predictions for the future. In addit ion, we show that the extension is useful as a user space tool for estimating query scaling and performance.


13 CHAPTER 1 INTRODUCTION The complexity and power of the business worlds information technology (IT) infrastructures have been allowed to continually increase due to Moores Law and competition decreasing the per unit costs of both hardware and software. However, the cost of IT professionals who have the skills necessary to manage and administrate the growing array of IT systems is not subject to the growth principles of Moores Law. As a result, the total cost of ownership (TCO) involved in maintaining IT systems is becoming increasingly dominated by the cost to employ qualified professionals. Autonomic computing is quickly becoming an essential technology that must be realized to continue to fuel the growth of the IT industry [1] The term autonomic computing derives from the autonomic nervous system of the human body. It is the system that controls vital low level functions such as heart rate and body temperature, leaving the conscious mind free to deal with higher level tasks. A truly autonomic computer system is capable of managing itself utilizing the guidance of high le vel objectives provided by administrators. It is composed of interactive collections of autonomic elements or individual self managing components of systems, which contain resources and deliver services to humans and other autonomic elements [2] One major open area of research is the creation of specific auton omic elements, or research directed towards improving the self managing capability of specific components such as databases, storage systems, servers, etc. [3] ( pp. 16). Databases have becom e a focus of active research in autonomic systems because they are a necessity in many IT infrastructures and labor is such a high part of the TCO involved in maintaining them. Consider the following problem:


14 As data is inserted into a database, the sizes of its relations increase. Associations are formed between relations when queries exist in the databases workload that join attributes from two or more relations. It is common for these associated relations to grow at different relative rates. The acces s plan generated by the query optimizer determines which algorithms, indexes, and tables will be used to compute a query. The optimizer attempts to calculate the most efficient access plan given the current sizes and distributions of data relevant to the query being optimized. Thus the access plan is strongly tied to the sizes and distributions of data. At certain points, the sizes and distributions of data will change such that the query optimizer will compute a different access plan. We call these poin ts data thresholds Data thresholds are significant in that they represent a point at which an optimal set of indexes often become suboptimal. The situation is further complicated as the set of queries contained within the workload grow in complexity. Mo re complex queries will filter data from relations before they are joined and often join the result of one join operation with other relations ( n way joins). The number of possible index sets increases exponentially relative to the number of attributes and associations involved in the workload. As a result, this increases the complexity involved in both identifying data thresholds and reoptimizing the index sets. To illustrate this problem, consider the example shown in Figure 11, which shows a query reaching a data threshold. In this example, the Alumni and Student relations are being joined together on the attribute last_name. The corresponding SQL query might be: SELECT FROM alumni a, student b WHERE a.last_name = b.last_name. Initially, the alumni relation is small (10,000 rows) relative to the student relation (50,000 rows), but grows quickly as students graduate (growth of 10,000 rows per year). The student


15 relation is initially much larger than the alum ni relation but grows at a rate of only 1,000 rows per year (imagine that the school is able to accommodate a slightly larger entering class each year). When compiling the SQL query given above, let us assume that the query optimizer will choose to perfor m a nested loop join (NLJ) on the alumni and student relations. One relation will be designated as the inner relation and the other as the outer relation. This choice can be extremely significant. The NLJ algorithm scans each tuple (row or record) in the inner relation once for each tuple in the outer relation. The formula to derive the total number of rows that will be read for NLJ without indexes is T(outer) + [ T(outer) T(inner) ] where T(x) is the number of tuples in relation x However, the NLJ algorithm can greatly benefit if an index is present on the inner relation for the field being joined upon. An index is a data structure consisting of an ordered subset of fields of a given relation, designed to speed up operations that f etch records from the relation. A typical database index is maintained as a B tree data structure in which the leaf nodes contain pointers to records of the indexed relation. Rather than scan the entire inner relation for each row in the outer relation, NLJ can utilize an appropriate index to find matching records. The formula for NLJ utilizing an appropriate index on the inner relation becomes approximately T(outer) + [ T(outer) 4]. Because the index is being utilized, only four index pages need to be examined rather than scanning the entire inner relation for matching records (four is a good approximation of the number of levels in a typical B Tree index). As a result, the presence of an appropriate index can result in a huge (i.e., orders of magnitude) performance improvement. In addition, the choice of


16 inner and outer relation becomes more significant. It is optimal to choose the largest relation as the inner relation because the cost associated with scanning it is virtually negated by the presence of the index. The initial optimal index set for the query is a single index on the last_name attribute of the student relation. The optimal access plan is to perform a nested loop join with student as the inner relation and alumni as the outer relation. However, as time progresses, the alumni relation grows at a much faster rate relative to the student relation. After approximately four and a half years a data threshold is encountered when the alumni relation becomes larger than the student relation. A fter this point, the previous access plan and index set become suboptimal. The new optimal index set is a single index on the last_name attribute on the alumni relation. The optimal access plan is to perform a nested loop join with alumni as the inner re lation and student as the outer relation. The amount of improvement per execution of this query that could be realized by implementing this change will continue to grow with time. Figure 12 shows the number of rows scanned per query execution over time for both index set options. The point at which the two lines intersect represents the data threshold. Traditionally it has been the role of database administrators (DBAs) to identify these situations and develop new indexes. Maintaining the manpower required to perform this task is expensive and is commonly the largest contributor to TCO for a database system. Efforts to make database systems more aut onomic include elements for automating maintenance activities, monitoring and policing workloads, autonomically tuning bufferpool and other memory sizes, as well as autonomic disk management and optimization. Tools have also been developed to tune SQL sta tements and workloads. Given a statement or workload input from a DBA, these tools will return an optimal, or near optimal set of indexes for running the


17 workload. More recently, research effort has been spent to completely eliminate the need for any hum an interaction using these tools. Autonomic index tuners monitor the SQL as it executes in the database and, when necessary, adjust the index set, attempting to maintain an optimal configuration. The need to make database systems more autonomic has been recognized by the major database vendors. IBMs SMART project and Microsofts AutoAdmin research group are both initiatives by major database management system vendors dedicated to realizing the vision of autonomic database systems. IBMs SMART project is focused on the creation of LEO, a learning optimizer, that refines statistics gathering based on feedback (sizes of actual vs. predicted results) from queries as they are executed [4 8] The group also focuses on improving the DB2 Design Advisor, a tool for automating the database physical design process. The DB2 Design Advisor is capable of providing recommendations for auxiliary data structures, like indexes and materialized query tables (MQTs), which will optimize the execution of a given workload. The SMART group wor ks to improve this tool and strives to make it a fully autonomic solution that can adapt to changes in the DB system rather than a batch oriented or manually invoked tool [9 11] Microsofts AutoAdmin project is a research group focused on automating the physical design of databases and refining/automating statistics collection. For example, work [1217] has been done by this group to automate index selection and workload tuning. In addition, the AutoAdmin project introduced virtual indexes and proposed they be used as workload tuning instruments [18] Virtual indexes are indexes introduced into the system catalog of the database system that are never actually materialized onto disk. T hey can be used by the database optimizer to create the query execution plans that would be executed if the


18 virtual indexes were materialized. Virtual indexes are a very useful tool for estimating the impact of candidate index sets because they allow for the exploration of hypothetical configurations, bypassing the time and load on the database system that would otherwise be required to materialize and evaluate candidate tuning solutions. The ability to create and utilize virtual indexes for workload tuni ng is now a feature in Oracle, DB2, and MS SQL Server. We propose an extension to the virtual index concept leveraged by the majority of autonomic tuning tools found in the prior art. Previously, virtual indexes and query analysis tools have been restrict ed to providing tuning recommendations only for the size and distribution of data currently housed in the database system. Thus, such tools were forced to give reactive tuning recommendations, only providing recommendations for processes that were current ly suboptimal. We outline the creation of a Metadata Analysis and Edge Detection Agent (MAEDA), which will allow queries and virtual indexes to be evaluated under predicted future values of data sizes and distributions. The MAEDA architecture will enable autonomic tools to proactively tune processes and queries. Additionally we provide evidence that the process we have developed for forecasting future query cost and behavior stands on its own as a user space tool for predicting the bounds of future quer y costs and examining the scaling of workloads with respect to time. We will show that MAEDA can reduce TCO of database systems by providing a significant improvement to the current query tuning paradigm. This dissertation is organized as follows. Chapt er 2 provides a brief overview of database concepts essential to the MAEDA architecture, including cost based query optimization, the role database statistics, and a discussion of the various ways indexes may be used in optimizing SQL queries. Next, an overview of the field of autonomic computing is presented, emphasizing


19 developments in database systems. Prior research into autonomic index tuning systems is discussed, and previously proposed autonomic index tuning architectures are examined. Chapter 3 pr esents a blueprint of the MAEDA architecture along with a discussion of what makes it unique and distinct from the systems that have been previously disc ussed. Chapter 4 discusses the various methods and algorithms used in forecasting various database sta tistics. Chapter 5 details the platform and preliminary proof of concept implementation of the architecture used for testing. The testing methodology is provided and results are presented for both synthetic and real world data sets. Finally, Chapter 6 c oncludes with an overview of what the research has accomplished and offers some possible direction and areas for future work.


20 Figure 11. A data threshold occurs at 4 4/9 years when the Alumni relation grows larger than the student relation. Figure 12. Query performance with an index existing on the Student vs Alumni relations. Time Student Relation 50,000 rows (inner table) Growth: 1,000 rows / yr Alumni Relation 10,000 rows (outer table) Growth: 10,000 rows / yr Join on last_name Nested Loop Join Cost: (Assuming index on last_name for student relation) 10,000 + (10,000 4) = 50,000 Alumni Relation 100,000 rows (inner table) Growth: 10,000 rows / yr Student Relation 59,000 rows (outer table) Growth: 1,000 rows / yr Join on last_name Nested Loop Join Cost: (Assuming index on last_name for alumni relation) 60,000 + (60,000 4) = 300,000 Alumni Relation 10,000 rows (inner table) Growth: 10,000 rows / yr Student Relation 50,000 rows (outer table) Growth: 1,000 rows / yr Join on last_name Nested Loop Join Cost: (Assuming index on last_name for alumni relation) 50,000 + (50,000 4 ) = 250,000 Student Relation 59,000 rows (inner table) Growth: 1,000 rows / yr Alumni Relation 100,000 rows (outer table) Growth: 10,000 rows / yr Join on last_name Nested Loop Join Cost: (Assuming index on last_name for student relation) 100,000 + (100,000 4) = 500,000 Data Threshold (Alumni > Student after 4 4/9 years ) Query Performance by Index Type 0 100,000 200,000 300,000 400,000 500,000 600,000 0 1 2 3 4 5 6 7 8 9 Time in Years Number of Rows Scanned per Query Execution Student Index Alumni Index


21 CHAPTER 2 BACKGROUND AND PRIOR ART This chapter pres ents a brief overview of some of the most important databas e system concepts related to this research followed by a brief overview of the field of Autonomic Computing, and a discussion of prior research in Autonomic Database Tuning focusing primarily on prior efforts to automate index tuning Cost Based Query Optimization The query optimizer and query execution engine are essential components in query evaluation for any SQL based relational database management system (RDBMS). The role of the query executi on engine is to implement a set of physical operators, which are components that take as input one or more data streams and produce an output data stream [19] Example s of physical operators include table scan, index scan, sort scan, nested loop join, sort merge join, and hash join. The set and order of physical operators to be executed to satisfy a query is called a physical access plan. Bec ause the choice of physical operators is an essential part of query optimization, a method must exist for estimating the cost to perform each physical operator [20] The accuracy of these estimates depends on the accuracy and precision of table statistics stored in the database. The query optimizer takes as input a parsed representation of a SQL query (a parse tree) and provide s as output an efficient physical query plan to be executed by the query execution engine [20] The parse tree must be translated into an algebraic representation called a logical access plan. The logical access plan is composed primarily of relational algebra operators from the classical relational algebra introduced by Codd in [21] The operators of the extended relational algebra commonly employed are union, intersection, difference, selection, projection, product, join, duplicate elimination, grouping, and sorting [20]


22 Determining a logical access plan for a query is a non trivial task due to the fact that the join operator is both commutative and associative. Conside r that f or a query joining n relations (commonly referred to as an nway join) there are n factorial permutations of join orders [22] When the other algebraic operators required to solve a query are factored in, even simple queries may compile into a large number of logical access plans. The cost based query optimization technique attempts to find the query plan that minimizes the estimated cost of accessing the data. The cost of a plan is primarily influenced by an estimation of the weighted sum of the number of disk I/O s and CPU cycles required to execute the query (the weighting given to disk I/O vs. CPU cycle is often a tunable parameter in the RDBMS) Statistics Gathering Accurate, up to date statistics are essential in determining whether or not a cost based query optimizer generates accurate estimates of the sizes of intermediate and final result sets involved in executing a query and the processing power required to achieve them. Statistics are typically gathered f rom the database during frequent, regularly scheduled maintenance periods. The accuracy and quality of these statistics is tunable on a per relation (and in certain instances, per index) basis, as is the frequency of collection. Table 21 lists some important statistics used in cost based calculations. The most basic statistic that can be collected is the cardinality of the relation, T(R). K nowing the size of a relation is e xtremely important in estimating the sizes of intermediate relations and determining join order. However, much more accurate estimates can be obtained by collecting more detailed statistics. A further level of detail can be obtained by collecting statistics on the attributes of a relation These statistics allow the optimizer to produce better estimates when projecting or


23 joining portions of data from relations based on the values for specific attributes. V(R, a) is a basic column level statistic. B y as suming an even distribution of data in the relation and computing T(R) / V(R, a) the compiler can produce much more accurate estimates than with T(R) alone. Because data is not often evenly distributed across a relation, even m ore accurate estimates can b e obtained by measuring the distribution of data. In particular, this can prove beneficial when collected on the attributes of a relation often used to filter a projection or join to other relations. While the best estimates will be obtained by measurin g the frequency with which each value occurs in an attr ibute, for attributes with high V(R, a) it becomes ineffective to maintain that level of detail (i.e., the data must be collected, stored, and maintained as well as analyzed during query compi lation) Instead, histograms are used to estimate the distribution. Equal height histograms, or percentiles, are commonly computed statistics in database management systems. To compute an equal height histogram for a distribution, pick some fraction, p, which re present s the number of percentile divisions to be computed. For example, if when p = 4, the data would be divided into quartiles. They can be calculated via the following procedure: l ist the lowest value for attribute a, then the value p from the lowest, 2p from the lowest, etc, continuing to the highest value in the distribution. Another commonly used form of histogram is to capture the most frequently occurring values for an attribute. A list of the top n most frequently occurring values is compiled, a long with the number of occurrences of each. In addition, the number of occurrences for all other values not in the list is also noted.


24 Histograms greatly increase the accuracy of estimates for the resulting size of join operations, especially in the case when both tables are joined on most frequent values In the best case, if both relations are joining on a value that is a most frequent occurrence, and the statistics gathered are up to date, the size of the result of the join can be calculated exactly. To save space and reduce the maintenance time requir ed to collect statistics, most DBMSs provide the option to collect distributions only on key and indexed columns [20] The quality (depth) and accuracy of gathered statistics are an essential factor in determining the accuracy of calculations in cost based optimization. Estimations of number of disk I/Os commonly take into account the following factors [20] : The logical operators that are chosen to implement the query. The physical operators chosen to implement the logical operators (i.e., choosing between a ne sted loop join and a merge join). An estimation of the size of the various intermediate relations involved in executing the plan. The ordering of the plans operations, especially the ordering of joins. The ability to pipeline through operations. Intermediate relations must be stored on disk only when a sort is required as the next step in the access plan. Otherwise, tuples can be materialized one by one and sent to the next step in the plan. The existence of useful indexes that can reduce or eliminate th e size of sorts and the amount of full table access to the relation. S ystem Rs Contribution to CostB ased Optimization The System R project [22] greatly enhanced the field of query optimization in relational database systems. One significant contribution resulted from the way System R determines the order of n way joins. A cost model estimates the size and execution cost (CPU + I/O cost) of the final and intermediate output streams (the output stream from each operator in the physical plan). The cost model relies on the set of statistics that has been gathered, formulas to compute estimated selectivity for each logical operator, and formulas which compute estimated execution


25 cost for physical operators (taking into account the statistics available for relations and indexe s as well as the order of the incoming data stream) [19] Join order is determined through a dynamic programming based approach. This is made possible by assuming the optimal plan for processing k joins can be determined by considering only the optimal subplans that process k 1 joins [22] This approach is capable of running in O(n2n 1) vs. O(n!) required by the nave approach [19] One of the join algorithms typically considered by query optimizers is the merge join. The merge join algorithm first sorts the two relations being joined into join pre dicate order. Then, with a synchroniz ed scan, the two relations are merged, thus completing the join. If any of the relations being joined are already in order by the join predicate, the sort step can be avoided. Skipping the sort step greatly enhances the efficiency of merge join. Consider a 3 way join. The intermediate relation produced by the cheapest method of joining relations A and B may not be the optimal relation to join with C because the ordering of the intermediate relation may require a sor t step to complete a merge join (subsequently, the resulting access plan for the query may be suboptimal). This fact gives way to the concept of interesting orders, which was also introduced with System R, which acts as an exception to the dynamic program ming technique discussed above. All join columns specified in a query (and also the orderings specified by the ORDER BY or GROUP BY clause of a query) constitute an interesting order. The dynamic programming technique is modified by not necessarily pruning suboptimal intermediate results from the search space if they satisfy an interesting order. To reduce the search space, equivalence classes are computed for interesting orders. For example, a query joining on A.ID = B.ID and B.ID = C.ID, would res ult in A.ID, B.ID, and C.ID all belonging to the same equivalence class. Optimal


26 intermediate results satisfying each order equivalence class continue in the algorithm unpruned [23] The Role of Indexes in Query Optimization Indexes play a critical role in influencing the query plan chosen by the optimizer. A secondary index is a relation, maintained in some sorted order. It contains a subset of the attributes contained in the relation it is indexing and one additional attribute, a pointer to the indexed record. Primary indexes represent a subset of the set of secondary indexes A primary index has all the properties of a secondary index and, additionally, dictates the physical ordering of the indexed relation on the disk. It is also referred to as the clustering index because the DBMS attempts to maintain the physical o rdering of newly inserted and updated data in this indexs order on disk [24] A relation can have at most one primary index while the number of secondary indexes is theoretically limited only by the number of permutations of logical orderings of combinations of attributes on the indexed relation. DBMSs also commonly include a utility to reorganize the tables back into primary index order once they hav e become fragmented on the disk [24] Proper indexes can reduce the number of sort operations required to materialize a relation. The presence of pro perly sorted indexes also allows the optimizer to consider using a merge join instead of a nested loop join to join two relations. If one or both of the relations already have an index in a desired order, an additional sort step is not required to execute a merge join. Eliminating the sort step required to perform a merge join not only reduces the cost of performing the join, but also allows i t to be pipelined. Pipelining in a database query refers to passing tuples from one physical operator to the nex t as soon as they ar e processed. An operator that blocks the pi peline must materialize all the tuples to memory or disk before they are passed along to the next operation. Pipeline ability is a trait that is inherent to the nested loop join


27 algorithm, but not necessarily the merge join algorithm. It is a desirable trait in processing queries not only because it avoids the need to materialize intermediate relations, but also because if an entire query is pipelined, it can begin returning results to the use r long before the quer y has finished processing. In addition to enabling merge joins to be more cheaply used in query plans by ordering the relation, indexes can greatly decrease the cost of performing a nested loop join. The nave implementation of the n ested loop join algorithm scans each tuple in the inner relation once for each tuple in the outer relation. This makes the cost of the nave implementation [T(inner) T(outer)] + T(outer) Indexing the attributes being joined in the inner relation of a nested loop join reduces the number of rows examined when performing the join to [Cost of index lookup(inner) T(outer)] + T(outer) The cost to look up an indexed tuple is roughly equal to the number of levels of the B tree composing the index. B tree indexes in database systems rarely exceed four levels. For example, an index of five attributes on a twenty million row table can efficiently be stored in a four level btree index. The choice of in ner and outer relation has a large effect on the cost of the algorithm and is determined at compile time. Utilizing indexes in nested loop joins commonly results in performance gains of an order of magnitude or more, giving the nested loop join the potent ial to be the most efficient join algorithm in the optimizers arsenal. For example, consider joining 2 relations, A and B Let T(A) = 100,000 and T(B) = 10,000. Optimally A would be chosen as the inner relation and B as the outer. If no suitable index exists on A, the rows examined when performing a nested loop join using the formula above is [100,000 10,000] + 10,000 = 1,000,010,000. However, if a suitable 4 level index


28 exists on A, the calculation becomes [4 10,000] + 10,000 = 50,000. This repre sents an improvement of 20,000, or 5 orders of magnitude. Imagine now that there is no suitable index on A but one does exists on B Using B as the inner relation, and A as the outer, the calculation would be [4 100,000] + 100,000 = 500,000. While this cost is 10 times greater than the optimal case (where the index exists on A ), it still represents an improvement of 2,000 times over the worst case scenario where no index is present at all. Often, accessing an index provides enough informati on that subsequent accesses to the underlying table are unnecessary. If an index includes all the attributes needed to solve the query, including all attributes being projected from a relation, it is not necessary to follow pointers from indexed records t o the relation to satisfy the query. For example, imagine the query SELECT name, age FROM student WHERE age > 30. A single attribute index on age alone would provide a means for quickly identifying which records matched the criteria in the WHERE clause. However, accessing the student table (chasing pointers from the index to the table itself) is necessary in solving this query because the name attribute is being projected. If name had been included as a second attribute of the index on age optimally the index is all that would be required to fully satisfy the query. This technique can also be used to reduce the total amount of disk access required to perform a nested loop join since the index pointers never need to be followed. When possible, scannin g indexes instead of the actual relations is nearly always more cost efficient since the physical size of the index is often significantly smaller than the relation itself. Because the primary index determines the ordering of records on the disk, updates t o primary index fields carry an increased cost over updates to secondary index fields because


29 whenever the ordering of a record within the primary index changes, the same change in ordering needs to be reflected in the underlying relation. Therefore, all secondary indexes for the relation must also be updated to point to the tuples new physical location on disk. The cost to update the primary index is related to the cost to update all other indexes on the relation. On the other hand, the cost to update a secondary index is independent of all other existing indexes [25] The Index Selection Problem The Index Selection Problem is the problem of computing an optimal set of indexes for a workload that will fit inside a finite amount of space The problem is defined more formally in [26] as follows : We are given m queries say Q1, Qm, along with n candidate indexes say F1, Fn. Each index Fj has an associated maintenance cost cj, and requires dj memory units to be stored. Each query Qi can access the DB data by utilizing at most one index, the corresponding execution cost being ui ij if index Fj is utilized. Let = max 0 be the gain for using index Fj for query Qi. ISP then consists of selecting a subset { 1 } (where j S* means that index Fj has to be constructed and stored) such that j S* dj does not exceed a given bound D on the memory available for the indexes. The objective is the minimization of the overall cost for answering all the queries, computed as min min + = 1, or equivalently the maximization of the net gain, max : = 1. [26] ( pp. 957) Comer defines the optimum index selection problem (OISP) in a simplified manner to prove that the problem is NP complete. He defines OISP as Given: A file F with n records and k attributes, and an integer p. Question: Does there exist an indexing set for F with size no more than p? [27] ( pp. 442) Using this definition, he offers a reduction from SAT 3 (which is shown to be NP complete in [28] ) to OISP. He also proves that when the OISP is modified to include a value function (for example, attaching a weight or probability to each attribute) that any solution to the modified


30 version can be used to solve OISP. Therefore, it follows that more complicated formulations of the ISP will remain NP complete. The problem is, in fact, more complicated than any of the definitions discussed thus far. This is because, in modern databa se management systems, it is possible for each query to access multiple indexes on each relation. In addition, one must consider the fact that each relation can have only one clustering index, but may include many secondary indexes. A primary (clustered) index on a relation could possibly provide a much greater cost reduction for any given query than an identical secondary index. When considering the possibility of primary and secondary indexes on a relation, t he nave solution explores the entire searc h space of possible index sets, which is ( 2 1+ 2) = 1 (2 1) where ni is the number of attributes in the i th relation and r is the number of relations in the workload [25] This search space quickly becomes unmanageable, fo r even moderately sized workloads. Choenni shows in [29] that for approximating the best set of primary and secondar y indexes for a workload, research is roughly split between two approaches: an optimizer based approach and a knowledge based approach. The optimizer based approach relies on information obtained from the DBMSs optimizer to arrive at a set of indexes while the knowledge based approach relies on translating the knowledge of experts into heuristics. The optimizer based approach has the advantage of adapting automatically to changes made to the optimizer but the disadvantage of growing in complexity along with the optimizer. This makes it an excellent choice for simple optimizers, but not as feasible for more complex ones. Complexity can be managed using the heuristic approach (the problem of finding a


31 candidate index set is only as complex as the heurist ics). The disadvantage of the heuristic approach is that it is unknown whether or not a derived solution set of indexes will be fully utilized by the optimizer. Choenni proposes a hybrid of the optimizer and knowledge based approaches. Candidate index se t solutions are derived from a set of heuristics. These solutions are then evaluated against the optimizer to determine their efficiency. This approach circumvents the complexities of the optimizer based approach while still retaining the advantage that solutions are validated by the optimizer. However, candidate solution sets are still only as effective as the heuristics used to generate them. These heuristics will need to be revised over time to reflect changes made to the query optimizer. The reader is referred to [26] for more detailed discussion of algorithms for approximating solutions to ISP. Virtual Indexes Chaudhuri in [18] proposed the idea that DBAs should be able to specify h ypothetical, what if indexes (also commonly referred to as virtual indexes ) and analyze their impact upon a statement or set of statements. Virtual indexes are indexes about which statistics are added to the system catalog, but which a re never materiali zed to disk. The result of doing this is that, while the virtual index cannot be used to solve any queries, they can be used in the same manner as traditional indexes by the query optimizer in generating query plans. By utilizing virtual indexes the que ry optimizer can compile queries as if the indexes have actually been added to the system. The resulting access plans can then be inspected to analyze the effect of adding a particular index. Virtual indexes have allowed for greatly advanced techniques in index selection tools. They provide a simple mechanism with which candidate indexes may be tested against the DBMS query optimizer. This eliminates the need for guess work in determining which


32 candidate indexes will be utilized by the optimizer when c ompiling a query while additionally providing a good estimation of an indexes cost benefit to a particular query (by analyzing the query plan and cost estimates provided by the optimizer). T he creation of virtual indexes is possible within the database sys tems offered by major commercial vendors including Oracle, Microsoft SQL Server, and IBM DB2. Tools have been developed to utilize this feature to analyze index coverage on the databases workload. Tools also make use of virtual indexes to suggest possible indexes that can be created to tune queries or sets of queries and analyze the imp act of creating these indexes. Autonomic Computing The goal of IT in business is to improve the quality of delivered services while reducing the cost to provide such services. The rapid increase of computing power (Moores Law) has been well exploited by software development teams, creating IT applications (and hence IT infrastructures) that are ever more complex and sophisticated. These infrastructures require an ev er growing number of highly skilled IT professionals to administer them. Worse yet, the increasing complexity of the application environments and sheer number of applications available require IT professionals who are trained to administer specific applications and have highly specialized skill sets [2] Unfo rtunately, IT professionals are not subject to Moores Law. Because human intervention remains a critical factor in enhancing performance and capacity of components in an IT system, increasing the complexity of an IT infrastructure requires hiring a large r number of IT administrators with increasingly complex levels of expertise. Increasing the complexity of IT environments increases the chance of system deployment failures, hardware and software issues, and human error hampering productivity.


33 Autonomic c omputing systems are systems that can manage themselves given only high level objectives from administrators. The term comes from the autonomic nervous system the system in our body that governs routine, yet vital, functions like heart rate and body tem perature, leaving the conscious brain free to deal with higher level decision making Autonomic computing is the field of using technology to manage technology, much in the same way the autonomic nervous systems governs bodily functions [30] According to IBMs vision for autonomic systems [2] a fully autonomic IT infrastructure should consist of four main automated functionalities: Self configuring Human interaction in configuring the system should be limited to setting high level policies. The rest of the system should adjust automatically without requiring any further human interaction. Self optimizing The system and its components should continually seek to improve their own performance and efficiency. Self healing The system should automatically detect, diagnose, and, if possible, repair or compensate for hardware and s oftware problems within the system. Self protecting The system should automatically defend against malicious attacks or cascading failures. Monitoring should proactively anticipate and prevent system wide failures. A system component capable of automatically performing each of these functionalities can be fully automated with respect to the common day to day tasks performed by IT administrators [31] IBM offers a blueprint for building such an IT architecture in [30] Autonomic Computing in Databases The goal of autonomic computing in databases is to reduce the Total Cost of Ownership (TC O) of DBMSs by minimizing the skill level and amount of time required by humans to maintain the database [32] A 1998 study by the Aberdeen Group found that a 5year, 25user implementation of a leading industrial DBMS incurred 81% of TCO in human skills for training, maintenance, and implementation. Another study performed by D.H. Brown compares the TCO


34 between two major, commercial database systems, Oracle and DB2. In both cases, the cost to build and maintain the system was higher than the cost for licensing and support [33] Database administrators incur a large variety of responsibilities during the life cycle of a DBMS ranging from determining storage and performance requirements in the planning stage to analyzing growth history and query tuning once the system has reached the production stage. Some form of database monitoring capability is required to implement most forms of autonomic computing in the database. Dat ab ase monitoring is defined in [16] ( pp. 473) as the ability to observe the values of system counters that describe the systems state, e.g. execution time of a query, locks held, CP U/memory usage. Most modern database systems in use today include some mechanism to record and report these values. The next step in the progression towards a completely autonomic system is automated monitoring, which is defined as the ability to evalu ate conditions over these counters and take actions on them [16] ( pp. 473) Examples of autonomic computing in databases that are currently in use today include query monitors li ke the Query Patroller found in DB2 UDB, which uses the databases query explain feature to examine the estimated costs of executing all SQL statements that are sent to the database. If the estimated cost to execute a query exceeds a certain threshold d efined by the DBA for the users profile, the query is cancelled, scheduled to run at a later time, or held for manual intervention by an administrator. Query Patroller takes into account various monitored values like the number of queries currently execu ting, the cost of all queries currently executing, individual user priorities, and the number of queries currently executing for the user or users group. The Design Advisor found in DB2 is an autonomic physical database tuning wizard This system has the ability to recommend indexes for tables based on one or more SQL statements


35 that comprise a workload supplied by the user. It enumerates hypothetical indexes on attributes from various combinations of the following sets: interesting orders of the relat ions columns that appear in equal predicates when joining tables, columns that are referenced by range predicates, all sargable attributes except those involved in nested subqueries and large objects, and any other remaining attributes referenced by the q uery Similar tuning advisors exist for Microsoft SQL Server [17] a nd Oracle [34] The query compiler is then invoked on the hypothetical indexes to determine which ones will actually be used in executing the query. The virtual indexes which are used by the compiler are reported back to us er along with an estimate of the size required to materialize the indexes that will be useful, and an estimate of the cost benefit derived from the new indexes [35] IBM has more recently extended Design Advisor to additionally recommend materialized query tables (MQTs) [10] par titioning strategies, and multidimensional clustering tables (MDCs) [11] Autonomic Database Workflow Monitoring and Tuning Autonomic database workload tuning processes rely on monitoring the stream of queries as they execute in the database and the results of that execution. The i deal system tunes the databases workload and keeps it tuned as new queries/tables are added and the data distribution and size of the relations in the database change during the life of the system. The workflow of a typical autonomic tuning architecture i s represented graphically in Figure 21. Much of the previously proposed work fits into this framework. These architectures can be generalized as follows: As incoming SQL statements are executed in the system, the tuning architec ture intercepts and analyzes them to determine what indexes it is currently utilizing, and what the optimal index set is for that statement. In more modern architectures, often this analysis is performed by vendor supplied database tools or via direct cal ls to the optimizer and query compiler.


36 A cost/benefit analysis is performed, weighing the benefit of creating and using the optimal index set over keeping the currently used set against the cost of doing so. The cost in time and processing power requir ed to drop and build the indexes, cost of extra disk space being used, and/or update and maintenance costs of the new index are among the factors that may be considered. If the analysis determines that indexes should be dropped or built, that action is pe rformed. Finally, a repository of statistical information (per haps tracking the utilization and benefit of indexes created, or performance of SQL executing) is updated and the process is repeated for the next incoming statement. The remainder of this chap ter presents an overview and analysis of specific, previously proposed methods and architectures for autonomic index tuning. Prior Research in Autonomic Physical Tuning Architectures The system described in [36] and [37] proposes using a layered agent, called the Benefits Agent, to monitor the system and create and destroy indexes as needed. The layered agent interacts with the d atabase through four stages: Information Retrieval The cost and access plan, including the index usage information, is retrieved for every SQL statement executed Situation Evaluation The agent updates its beliefs about the statement currently being processed. Possible Alterations Enumeration Heuristics are applied to enumerate hypothetical index designs and evaluate the fitness of new candidate indexes. Alterations Accomplishment Indexes are created or destroyed as necessary. This method is statement centric As statements are executed, their performance is compared against an expected performance assumption. I f the observed performance does not meet expectations, the tuning pr ocess is initiated.


37 The Glue Nail system [38] is a further example of a statement centric real time query optimization system However, rather th an using a middleware approach, Glue Nail achieves run time index management within the optimizer itself. At compile time, the optimizer decides an access method for each relation in a query without utilizing any knowledge of existing indexes in the syste m. If the optimizer determines that index access is beneficial to a table scan and that index does not exist, it is built before the query is run. The Glue Nail system relies on the assumption that, lacking an estimate of future workload, the best way t o consider the cost of creating an index that does not already exist is to ignore the cost because we have no way of knowing what other queries coming later in the workload may benefit. The cost of creating an index that may be used only once is less than the penalty for not creating an index that could have been used multiple times in the future. When the time comes to maintain indexes through operations such as inserts, updates, and deletes, the optimizer must make a decision whether or not to keep the i ndexes. Indexes are dropped instead of maintained when a data flow analysis derived at compile time determines that the index has no more potential uses. Sattler describes an index centric approach in [39] and [40] in which a middleware architecture is used for on the fly index creation. Utilizing DB2 UDBs virtual indexes and index analysis utility, the architecture extends the design time approach offered by the DB2 UDB Design Advisor into a dynamic, continually running solution. As queries are executed, current executio n costs are compared to estimated execution costs resulting from an automated attempt to find an optimal index set for the query (in this case, the output of the Design Advisor). A profit value is computed by subtracting the cost of running


38 the SQL stat ements using the current index set from the estimated cost of running the SQL statements using the index set computed by Design Advisor. Every index involved in the design advisors index set is assigned a portion of the computed profit. A repository, which is updated for every executed SQL statement, records the profit statistics for all virtual and currently materialized indexes. As SQL statements are executed, profit statistics are continually updated. The architecture attempts to optimize the index s et for the currently running workload utilizing the assumption that currently running queries will best represent queries running in the near future. Indexes which are not utilized by the currently running workload are subjected to a decay algorithm. As an underutilized indexs profits decay, they are slowly replaced by new, more profitable indexes maintaining an index pool that contains only the locally most profitable indexes in the limited space available. Results of testing this system show that it does not beat the performance of using the DB2 index advisor utility if the whole workload is known in advance. However, this approach produces better results than the index advisor when used in more dynamic environments where queries are less predictable or when the workload is not known prior to invoking the index advisor. Bruno and Chaudhuri present a technique for evaluating candidate indexes during query compilation in [15] This research is founded on the realization that candidate index structures will only provide a benefit to query performance if they are exploited by the optimizer once they are materialized to disk. Because of this, tuning is best done using the DBMSs cost model as opposed to using an external tool with separately coded heuristics.


39 During compile time, the optimizer is exploited by presenting it with optimum choices of physical structures at each step of execution. Candidate index sets are determine d by which structures are selected during compile time. The fact that a hypothetical configuration is selected during compile time yields a large degree of confidence that that configuration will be exploited during subsequent executions of the same or si milar queries once it has been materialized to disk. This process is repeated until a candidate set of indexes is found that meet any space constraints imposed on the system. During each repetition, the candidate physical structures are relaxed slightl y, meaning that the optimizer is presented with slightly less optimal (yet slightly less space intensive) subsets of indexes to choose from until space constraints are met. The system described above is further improved in [12] and [41] A tight upper bound for performance improvement is determined by continually computing locally optimum execution plans during query compilation. Locally optimum execution plans are optimum subplans of the actual query plan computed by the optimi zer. The advantage to computing locally optimum subplans as opposed to the actual optimum plan for the query is that the former is less resource intensive in that it does not require recompilation of the query. Thus, a tight upper bound is computed, guar anteeing a minimum benefit that can be achieved by activating a tuning process. Schnaitter and Abitebouls autonomic database tuning system, COLT, implements a novel technique for reducing the systems computational requirements [42] COLT is implemented in PostgreSQL and makes heavy use of hypothetical indexes to compute tuning recommendations. In this implementation, hypothetical indexes are evaluated only during the compilation time of actively executing queries. Realizing that analyzing hypothetical index configurations is expensive, the number of what if queries is budgeted in this system. A query profiling strategy is introduced to


40 determine which hypothetical configurations are considered. This strategy is largely based on whether or not the query currently executing has been executed recently and how similar the selection predicates are to those recently seen. Using these factors, a degree of certainty is computed with regards to how well the system can estimate performance gains for various hypothetical indexes Because the self tuning module can more confidently guess the degree of performance improvement that will be gained for a query if it has evaluated similar queries in the past, these queries are given a lower priority for running what if queries. This method gives a greater probability to evaluating indexes for which the amount of gain that can be achieved is most uncertain. The method described in [14] treats the database workload as an ordered sequence rather than just a simple collection of queries. The observation is made that many workloads have detectable patterns of query (i.e., select) vs. insert and update activity. For example, many database tables are queried during the day and updated or loaded at night in a batch process. Given a pattern of activity such as this, it may be more efficient to drop the indexes on a table before it is updated or loaded at night and then rebuild them before the select activity resumes during the day. The problem of tuning the workload as a sequence can be modeled as finding the shortest path over a directed acyclic graph. Among the possible solutions discussed, using a greedy heuristic scales the best for large workloads. A solution is represented as a series of create/drop index statements placed throughout the workload. An additional advantage of dropping and rebuilding indexes is that it presents an opportunity to analyze queries dependent on tables for which indexes are being rebuilt and to possibly change the index being rebuilt to account for growth or changes in data distribution in


41 the underlying relation. However, a disadvantage of the workload as a sequence method is that the query workload running on a database is often composed of multiple independently executed processes that may rely on the same relations. It may be impossible to determine when these processes may be executed relative to each other. Imagine two independent processes, A and B, are running which rely on the same tables. Process A does a significant amount of updates and inserts while process B performs a large number of select statements. Process A runs optimally with no indexes on the underlying tables whereas process B could make full use of the indexes. Situations such as the one described above would not be uncommon in many large scale database systems. Tuning the workload as a sequence under these circumstances is a more difficult problem to model. An optimal s olution may require an underlying business knowledge of the data flow within the system. In this chapter, we have discussed the theory and basic concepts behind query compilation, the index selection problem, and the methods used to automate the process of index selection. The latter half of this chapter has presented an overview of the prior work done in autonomic index recommendation systems. The next chapter presents an overview and blueprint of the autonomic tuning architecture we have developed along with a discussion of its unique and innovative features when compared to the prior art.


42 Table 21. Statistics used in costbased calculations. Value Description T(R) # of tuples in relation R V(R, a) # of different values in relation R for attribute a. B(R) The minimum # of blocks in which R can fit Database Information Repository Copy of Incoming SQL Query Listener Cost/Benefit Analysis Index Analysis Retrieve & Update Cost/Benefit Information Index Set Alteration Virtual Index Analysis Incoming SQL Figure 21. A typical autonomic index tuning architecture


43 CHAPTER 3 A MODEL FOR FORECAST ING FUTURE QUERY BEHAVIOR This chapter presents an overview of a method for forecasting the behavior, estimated cost, and index utilization of a query workload utilizing data threshold detection, followed by a blueprint of the architecture for a preliminary implementation. The chapter concludes with a discussion of the algorithms used for an extension of autonomic query tuning using virtual indexes and our future optimal inde x detection method. Motivation The majority of the prior research produced in autonomic tuning architectures relies heavily on the use of virtual indexes (the concept of predicting the effectiveness of an alternative index configuration without actually ma terializing the alternate configuration to disk). In addition, virtual indexes are utilized extensively in the user space by developers and DBAs in conjunction with various manual analysis and tuning tools. Currently, virtual indexes are limited in the se nse that they can only be utilized to make recommendations for the current size and distribution of data in the database. We propose to extend the concept of virtual index analysis by removing this limitation. The method we have developed allows for the evaluation of query cost and behavior when presented with any combination of existing (currently materialized) or hypothetical (virtual) indexes for future time periods and distributions of data. The capability to view query plans and optimal index recomm endations for future time periods can be leveraged in numerous ways. As an extension to user space tools, it could provide DBAs with an analysis of future optimum index sets. Currently, when tuning a query, the virtual index analysis tool will provide t he DBA sufficient information to predict the execution cost of running the query being tuned given the optimum index configuration (or a user specified configuration). This provides the


44 DBA with enough knowledge to calculate the cost benefit of executing the query with the current index set versus the proposed index set. However, the DBA is limited to calculations only for the present time frame. The optimum index set calculated on the current database configuration may not be optimal a week or a month i n the future. Furthermore, the DBA has no knowledge of how performance of the new index set will scale with time. This knowledge could be invaluable in making tuning decisions. As a related example, consider a new process being developed for a database system. Development processes can often span a period of many months before the developed queries are run on a production database server. As a result, query tuning and query development is often limited in scope to knowledge of the current size and dis tribution of data in the database. Even if developers know that the size of relations and distributions of data will grow significantly before their queries are run on the production server, they are largely limited to guessing how their queries will beha ve on the larger data. If a tool incorporating the proposed virtual index extension were available, this would no longer be a problem. Such a tool could predict query execution costs at the time when the process goes into production and possibly much fu rther in the future. This could potentially save on development and maintenance costs and prevent processes from having to be redeveloped in the future. Consider the following example: A new process is being developed, which contains a query that executes ten thousand times in an inner loop. Such queries are not uncommon in database processes and are often the source of trouble. This query must be carefully optimized. Consider that the query currently executes in one tenth of a second. In aggregate, t hat query would account for slightly more than fifteen minutes of the processs overall running time.


45 Suppose that six months from now, the data had changed such that this query was now taking a quarter of a second to execute. Now, in aggregate, this que ry is responsible for over 40 minutes of the overall running time. Clearly, if the described trend continues, something must be done to tune the query. Three scenarios are possible. In the best case, a simple change of indexes will speed the query to an acceptable level of performance and will scale better into the future. In the second possible case, the query can be improved to an acceptable performance level through a combination of reindexing and query rewriting. This scenario is more difficult, be cause the query itself must be changed which typically requires a testing and review process. Often queries are rewritten to run faster by making various assumptions about the uniqueness or properties of data in the underlying relations. This carries wi th it a certain level of risk. In the worst case, no combination of rewriting or reindexing can be found to improve the query performance. In such cases a significant reimplementation of the process is often necessary. A tool, which incorporates the e xtension to virt ual indexes we describe in this dissertation could have potentially prevented the scenario described above. Had the developers and DBAs been able to predict the future behavior of the queries, the results could have changed the course of tuning or development. Thus, such a tool fits well within the goal of autonomic utilities with a potential to provide a significant reduction of TCO through the benefits of the database self monitoring its own metadata. The ability to tune and recommend i ndexes on queries for future predicted database configurations could also be incorporated into a fully autonomic tuning platform. The process of performing virtual index analysis for future database configuration involves creating a modified version of the databases system catalog. Compiling queries against this modified system


46 catalog is conceptually no different than compiling queries against the actual system catalog. No differences in behavior are realized when compiling in an explain mode where que ries are compiled and optimized, but withheld from execution. This is the common setting used by tuning and analysis utilities. Thus, the body of research in autonomic physical tuners rema ins widely applicable under our extension. Current autonomic tuning architectures are bound to a reactive tuning scheme, where optimal index sets can be discovered only after the currently optimal set has become suboptimal. An autonomic tuning platform incorporating the methods proposed in this dissertation would be abl e to proactively tune index sets, incorporating changes before queries ever run under a suboptimal configuration. Overview of the Forecasting Architecture The architecture fo r the tuning system we have developed ( Fi gure 31 ) which we call the metadata analysis and edge detection agent (MAEDA) is split into three major components: a Query Listener, a Runstats Listener and an Information Repository (the functi on of these are discussed, in depth, later). MAEDAs architecture relies on the observation that query plans produced by the query optimizer will remain essentially unchanged until statistics collection is performed within the database. When the MAEDA architecture detects that new statistics have been collected, the new statistical data is harvested and MAEDA updates its predictions regarding the growth trends of the statistics being tracked in the database via the Runstats Listener. The Analysis/Simul ation Engine represents a unique aspect offered by the MAEDA architecture when compared to the prior art. Previously proposed architectures are reactive in nature. They are designed to detect and fix inefficiencies after they have already appeared in the workload. More specifically, the event that triggers the tuning system to drop/build indexes is


47 SQL running through the database rel ying on data distributions that have already crossed a data threshold. MAEDA, on the other hand, via the Analysis/Simulati on Engine is designed to be proactive in nature. The Analysis/Simulation Engine attempts to identify data thresholds before they are crossed. With the growth predictions from the Runstats Listener, repeated simulations can be run, using custom or vendor supplied virtual index analysis tools (e.g., in the case of DB2, db2advis). Before running each successive simulation, the system catalog is modified to simulate the growth predictions provided by the R unstats Listener, growing the relations involved in t he queries being analyzed such that for each tuning iteration, the system catalog reflects the data sizes and distributions predicted at some poi nt in the future. This causes the DBMS and the h ypothetical index analysis tools to view the relations for wh ich the system catalog has been modified as if they had grown to the future time. Once the optimal index set suggested by running the simulation changes from what is advised for the current data distribution, we learn two things. First, we learn the approximate location of the data threshold and second, we learn the optimal index set to build once the data threshold has been crossed. This is what provides the MAEDA architecture the advantage of being proactive. As long as the growth predictions remain st atic, no further simulation or analysis needs to be performed. Once we have detected that a relation has grown past a data threshold, the optimal index set can be built for all queries that rely on that relation, potentially before the queries have run in a suboptimal state. Index recommendation and tuning systems are known for being computationally intensive when deciding optimal indexes for a workload. The computationally intensive (analysis/simulation engine) component of our architecture requires only the database metadata


48 (statistical data and table/ object definitions). Because this information can be easily copied between databases, it would be a trivial modification to move the analysis/simulation engine component to its own server, thereby offloading the computational requirements of the architecture from the database server. This concept could then be extended to having one analysis/simulation engine serve multiple databases/ database servers as depicted in Figure 3 2. The remainder of this chapter provides a more in depth description of each individual component of the MAEDA. The Runstat s Listener The Runstats Listener ( Figure 33) is the component of the MAEDA architecture that is responsible for creating and maintaining growth trend predictions of relations and data distributions relevant to queries tracked by the Query Listener. In typical commercial databases, time is set aside to perform maintenance tasks on database objects. This includes backups, reorganizing indexes and tables (similar to defragmenting the ha rd disk on a PC), and updating system catalog statistics for tables and indexes. Currently, the Runstats Listener is manually activated by a call at the end of a database maintenance period. Conceivabl y, the Runstats Listener could listen for the mainten ance period using a similar mechanism to the Query Listener. However, the manual approach simplifies testing and may even be preferable in commercial scenarios. Once activated, the Runstats Listener first queries the database to determine which objects ha ve updated statistical data since the previous invocation of the listener. Objects found to have updated data are harvested from the database system catalog. Thes e statistics are stored in dedicated tables in the Information Repository. Once all relevant statistics have been harvested, growth trends are computed for these table and column statistics. These growth rate predictions are stored in the Information Repository.


49 When all processing has completed, the Runstats Listener returns to an idle state, waiting for the next maintenance period. The Analysis/Simulation Engine The component of the MAEDA architecture that locates data thresholds is the Analysis/Simulation Engine ( Figure 34). When the Analysis/Simulation Engine is activated, it is presented with a set of queries to be analyzed, Q and a maximum time, T which represents the maximum time period for which query behavior is to be predicted. These queries are examined to determine the set of relations relevant to the query set being analyzed. MAEDA keeps a shadow copy of all relations being tracked in a separate schema inside the host database. This shadow copy oc cupies no space in the database, as all the tables contained within it are empty. Prior to analysis, the statistics housed in the system catalog, which are relevant to the queries currently being tuned, are copied to the shadow copy database. In DB2, thi s is accomplished via the db2look utility, which is designed to make copies of specific elements of database objects. Because the shadow database uses the same statistics as the actual database it is modeling, queries compile the same and compute the same cost estimates as when they are compiled on the actual database. Queries that are set to be analyzed are compiled against the shadow database, an d run through an index advisor utility like those discussed in Chapter 2. The index advisor utility uses virtual indexes to find the optimal index set for a given query. The optimal index set for the query being analyzed is noted. This is the optimal index set for the current size and distribution of data currently in the database. Using the growth estimates created by the Runstats Listener, the system catalog statistics for the tab les in the shadow database are grown from the current time to time T The simulation is then rerun. During the next simulation, either the optimal index set will change or it will not.


50 If the index set changes, we know that a data threshold is predicted to be crossed at some point between the present configuration and time T If it does not, we can assume that no data threshold will be crossed prior to the point to which we have simulated an increase in data growth. Members of Q for which the optimal index set has changed are noted and the simulation is rerun, using a search strategy to attempt to locate the point where the data threshold occurs. Repeated simulations are run until the point at which the index set returned diverges from the optimal index set can be reliably estimated. As data thresholds are located, indexes are removed from the simulation set. The process terminates when the simulation set is empty. When a data threshold and associated optimal index set are disc overed, that information is stored in the Information Repository to be returned to the caller. Once the Simulation/Analysis Engine has completed analyzing all the queries in Q it returns to an idle state, waiting to be triggered again. The Information Re pository The Information Repository is the database used to house all the information collected by the three components of the architecture. It houses the growth data collected by the Runstats Listener, and the index sets and simulation results collected by the Analysis/Simulation Engine. The amount and time span of retained historical query data and statistical data (i.e., historical table/index cardinalities or queries that hav e not been executed recently) varies depending on the domain A process ass ociated with the Information Repository is responsible for maintaining the data according to some retention policy. This policy may differ depending on the characteristics of the workload and the particular regression/forecasting techniques used to predic t future statistical values. Considering a shorter time span of historical data when computing growth trends will likely lead to a more volatile system that quickly reacts to changes


51 in the nature of the growth patterns in the data. A system that compute s growth trends using a larger time span of data will likely be less volatile, but also less quick to react to sudden fluctuations in usage patterns of the database. One must also consider distance for which forecasts are desired and the volatility of the growth patterns being forecast. Because the Information Repository is efficiently and easily assessable by all pieces of the architecture, it is the ideal mechanism through which to pass information between components. Future Optimal Index Detection Virtu al indexes [18] as discussed in the previous chapter, have become widely used and are implemented in commercial RDBMSes, including Oracle, MS SQL Server, and IBM DB2. By creating simulated system catalo g data representing an index, queries can be compiled as if the index exists, without actually materializing it to disk. Index analysis tools can use virtual indexes to provide the query compiler with multiple index choices that have not actually been mat erialized. The DB2 index recommender and virtual index generation framework [35] creates virtual indexes using information obtained solely from the system catalog statistics. When tasked with optimizing a single query, the DB2 index recommender builds all combinations of virtual indexes that are likely to be exploited by the optimizer when compiling the query being tuned. Vi rtu al indexes are created via the Smart column Enumeration for Index Scans (SAEFIS) enumeration algorithm (described in detail in [35] ). The SAEFIS algorithm enumerates indexes based on knowledge of how the DB2 Optimizer will evaluate the query. Particular preference is given for columns appearing in equality preferences and interesting orders. Statistics are derived for t he virtual indexes using the table and column cardinality and average column size statistics stored in the system catalog. Because the true values are difficult to estimate, statistics for index density and clustering are assigned pessimistic values by th e DB2


52 index recommender. All virtual indexes created will also be marked as non unique. This is due to the fact that uniqueness is not only difficult to identify, but modifies the business rules of the database. The fact that a set of columns comprising a candidate index is currently unique (containing no duplicates) does not imply that it will remain as such. After candidate indexes are enumerated and their statistics are derived, the q uery optimizer is invoked in a noexecute mode in which it considers virtual indexes in addition to those which are currently materialized. Virtual indexes used as part of the query plan that is returned f rom the compiler are marked as recommended. The DB2 index recommender is able to recommend a set of indexes with a si ngle call to the query optimizer. The MAEDA architectures process for predicting optimal index sets at future points in time is an extension of the concept of the DB2 virtual index analysis framework. Before analysis is performed, the system catalog dat a is modified to simulate what we estimate to be the statistics at the point in time for which analysis is desired. Index analysis is then performed. When predictions of statistics are 100% accurate, the result of running the index analysis for a simulated future time will be identical to the results obtained when the same analysis is performed when that future time becomes the present. Future values are forecast for the following database statistics: table and column cardinality, number of pages used by the table, high and low column keys, and column level equal height his tograms These forecasts are all utilized by the query optimizer when compiling the query to be tuned. Table and column level cardinality forecasts are especially vital as they will a dditionally be used to generate statistics for virtual indexes. Two noexecute modes are suppor ted by the DB2 optimizer. The recommend indexes mode evaluates the query and places index recommendations in a table in the users schema


53 called ADVISE_INDEX. The second mode, evaluate indexes optimizes the query with the addition of virtual indexes placed in the ADVISE_INDEX table. This ability to perform virtual index analysis on forecasted statistical data is essential to the data threshold detection proces s. Given a query, or set of queries, the MAEDA architecture is capable of determining what database objects are associated with the queries and populate its system catalog with statistical data for any arbitrary future time. The optimal index set can then be calculated by executing the queries in recommend indexes mode or queries can be compiled by including data for previously defined virtual indexes which can be evaluated in evaluate indexes mode. Queries are evaluated in groups when possible to minimiz e the number of growth operations performed amongst common query elements (tables/columns). Data Threshold Detection Data threshold detection is the process of determining the point in time at which an optimal index set becomes suboptimal and the newly opt imal index set associated with the query group being analyzed. Given a workload, W data threshold detection will be performed on a set of queries, Q is analyzed to determine the common set of relations, R represented in the queries. The system catalog of the shadow database is updated with the statistics pertinent to the relations in R The optimal index set is calculated for the present time ( TC) for each : ( ) Using the pre viously calculated growth estimates, the shadow database system catalog is grown to some future time, TC+N. Next, we calculate ( + ) If ( ) ( + ) for any we can assume that q will cross a data threshold at some point < + Thus, TC TC+N defines the search space for the data thresholds found to exist in Q


54 The problem now becomes locating (or closely approximating) Min(Tj) for each index, i such that ( + ) ( ) TCTC+N is searched through subsequent growth operations and invocations of Optimal to locate each element As indexes in I are located, the set R is reevaluated to determine which members are still neede d for evaluation of the remaining candidate indexes. Once all candidate indexes in I are located that depend on a particular query q, that query can be removed from the simulation set Q Likewise, once all q in Q have been removed that depend on an element r that element can be removed from R Because of this property, each stage of the search through TCTC+N has the potential to become less expensive computationally as the algorithm proceeds.


55 Database MAEDA information repository Analysis/ Simulation Engine Runstats Listener Catalog Statistics Growth Predictions Growth Predictions Query Simulation Figure 31. An overview of the MAEDA architecture. MAEDA Server Database A MAEDA information repository Analysis/ Simulation Engine Database B Database C Database D Query Listener / Runstats Listener Figure 32. Multiple databases served by a single MAEDA server.


56 Database MAEDA information repository Pull Table/Column Statistics Previous Statistical Data Table/Column Stats Maintenance Cycle Alert Compute New Growth Estimates / Flag Changes Update Growth Estimates Figure 33. The MAEDA Runstats Listener Database MAEDA information repository List of Queries Flagged for Analysis Determine Current Optimal Index Sets (time/size = 1) Grow Shadow Database to Time/ Size = N Growth Data Modify Shadow Database Index Analysis Determine Optimal Index Sets at N Index Analysis Determine Queries with Thresholds [Optimal(1) <> Optimal(N)] Search [1N] for Thresholds of Effected Queries Modify Shadow Database/ Index Analysis Revised Threshold Data Figure 34. The MAEDA Simulation/Analysis Engine


57 CHAPTER 4 FORECASTING METHODS AND ALGORITH MS The previous chapter provided an overview of the architectural components of the query behavior forecasting method we have created. The functions of the various architect ural pieces were presented along with the various interactions amongst the architectural components and between the archite cture and the target database. In addition, we defined and presented a method to accomplish data threshold detection. In this chapt er, we provide an overview of the methods and processes u sed to make forecasts for of the various database statistics essential to the method described in the previous chapter The mechanism of forecasting future values of the statistics housed in the DBMS system catalog is central to the operation of our query behavior prediction method. Calculating reasonable forecasts for table and column level database statistics is essential in providing accurate future query cost and access path predictions. We have investigated the use of many methods and models for regression with respect to their effectiveness in forecasting the databases metadata. We must make forecasts on many different types of database statistics (high/low keys, cardinalities of tables, columns, and existing indexes, column level data distributions, etc.) to provide an accurate estimate of query cost. The accuracy of various models may vary significantly based on the type of statistic being used. Therefore multiple regression techniques ar e used to provide our forecasts. Least Squares Linear Regression Least squares linear regression is a common method of regression in which a line is fit to the known data points such that the sum of the squared residuals is minimized. Using the simple met hod of least squares linear regression, the best fitting line is derived as = + where a


58 and b are constants relating x to y This is the common equation representing a straight line, where a is the intercept and b is the slope. a and b can be determined by minimizing: = ( )2 = 1 (4 1) where N is the number of points being fit, and Yi is an observed value of the dependant variable at time i The solution can be found by simultaneously solving the following two normal equations [43] : 2+ = ( ) (4 2) + = (4 3) Any polynomial can be used by this method to fit a curve. The method of finding the coefficients will be similar to the simple case with an nth degree polynomial generating n + 1 normal equations to be solved simultaneously. The normal least squares model can be modified by adding weights to Y When unequal weighting is applied to Y this weighted least squares model allows one to modify the contribution each point makes towards the computed parameters. We linearly increase the weights applied to observa tions such that the latest observations are weighted the heaviest. This alters the curve fit to give preference towards the most recent observations, causing recent trends in the data to be more strongly represented by the fit. Autoregressive Integrated Moving Average Models Box and Jenkins [44] developed a model for forecasting time series that combines autoregressive (AR) and moving average (MA) models with differencing to achieve stationarity (constant variance and m ean over time). These autoregressive integrated moving average models are known as ARIMA( p,d,q) models where p is the order of the AR portion, d is the order of differencing (I), and q is the order of the MA portion.


59 An ARIMA( N ,0,0) model involves only computation for autoregression. The autoregressive property of data is the extent to which previous values determine the current value. The AR( N ) component is an infinite impulse response filter. It can be represented mathematically as: ( ) = ( ) = 1+ (4 4) where ai are the autoregression coefficients and t is the error term, or white noise with variance 2 and zero mean. In contrast, an ARIMA(0,0,N ) model involves only computation of the moving average component. The MA( N ) component is analogous to a finite impulse response filter. The moving average is used to remove serial correlation of the error terms in the data. It can be represented mathematically as: ( ) = ( ) = 1+ (4 5) where i are the coefficients of the trailing error terms. The I( N ) component determines the level of differencing performed on the data. The first order differences are the series obtained by each term in the original sequence from the preceding term. Differencing removes trend from a series, at tempting to create a stationary series from a non stationary one. Comparing and Selecting Models Because it is possible to fit many different models of curves to any set of data, some method for comparing fits and selecting the best fitting candidate must be established. To assess the effectiveness of a given model or curve fit we break the data into two parts: an estimation period and a validation period. A list of candidate regression models is presented for evaluation of a given point forecast. Model s are tested in both unweighted and weighted forms. Each


60 candidate regression is performed to obtain a fit for data within the estimation period and the data in the validation period are used to assess the quality of the predictions. In the current MAEDA implementation, the penalized version of root mean square error (RMSE) of the residuals in the validation period is used to compare various fits and models for the given data. A penalty term is introduced for each candidate model, reflecting it s complexit y. The penalty term acts as a multiplier to the calculated RMSE, yielding a penalized RMSE. The fit with the lowest penalized RMSE is determined to be the best model for performing predictions. The penalty term acts to encourage simpler fits in cases wh ere, according to the RMSE numbers, multiple models seem to explain the data equally well. For a more complex model to be selected, it must show a significant enough improvement over other candidate models to overcome its associated penalty term. This di scourages the system from selecting more complex models unnecessarily, thereby reducing the risk of over fitting. The entire known data set is then used in conjunction with the selected model for making out of sample predictions. Results of testing shows that this method tends to favor models within the class of weighted least squares models for making point predictions regarding future values of table and column statistics such as car dinality and high and low key Quantile Regression Point predictions can be used to forecast the majority of statistics necessary to perform query and index analysis on the workload. However, the database is commonly instructed to collect a more detailed level of statistics on many database columns. This more detailed level of statistics collection builds an equal height histogram for the selected columns. Equal height histograms (the quantiles also commonly referred to as percentiles ) represent a nonparametric probability density function (pdf).


61 Quanti le regression, described by Koenker and Hallock in [45] is a met hod for the estimation of conditional quantile functions. A conditional quantile function returns the quantiles of the conditional distribution of a response variable expressed as a function of its covariates. Performing quantile regression can be thought of in analog to the process of least squares linear regression. The least squares method estimates a conditional mean by minimizing the sum of squared residuals. The median of a distribution can be found through equating the number of positive and negative residuals. Thus, similar to least squares regression, the median can be estimated by minimizing the sum of the absolute residuals. Generalizing to the other percentiles is accomplished by minimizing the sum of the asymmetrically weighted absolute residuals. This is done by tilting the absolute value function, shown graphically in [45] ( pp. 146). More formally, the 100 ( p ) th percentile regression line conditioned on x, qp( u) is determined by calculating the value of that minimizes ( ) = 1 (4 6) where ( ) = [ 0 )( ) ( 1 ) ( 0 )( ) (4 7) the check function (Equation 47) accomplishes the task of asymmetrically weighting the residuals [46] Because the check function is not differentiable at the origin, there is no explicit solution for obtaining the regression coefficients. However, the quantile regression problem can be formulated and efficiently solved as a linear programming problem. Our implementation uses the algorithm provided by Koenker and DOrey [47]


62 Unlike least squares regression, quantile regression does not assume a Gaussian distribution underlying the data being fit. Thus, we can use the quantile regress ion process to provide estimates for a variety of distributi ons, conditioned on data from distributions unknown to us a priori. We have explored fitting linear, quadratic, and logarithmic quantile curves to our data. Additionally, as in our discussion of le ast squares regression, weights may optionally be added to the observations, yielding a weighted quantile regression process. Forecasting Table Level Statistics To simplify the discussion, we make a distinction between tabl e level, column level and index level statistics. Table level statistics are those that must be estimated only once per table. Column level statistics must be estimated for each field of the table being forecasted. Index level statistics must be estimated for each index which already exists on the table being forecasted. We fore cast two table level statistics: cardinality and npages Npages is the number of pages, or blocks, of data the relation takes on disk. Cardinality is the number of tuples in the relation. The npages statistic is forecasted by weighted least squares regression. Both weighted and unweighted versions of linear, quadractic, and logarithmic curves are fit. Weighted fits are weighted linearly with respect to the number of observations meaning the n th observation out of N total observations will have a weight of n/N, with the oldest observation being assigned a weight of 1/ N and the most recent observation being assigned a weight of 1. The fits are then scored using the previously discussed process. The highest scored curve is then refit to the entire data set. As we use this process for a variety of point forecasts, we will subsequently refer to it as the weighted least squares forecasti ng method. Cardinality is derived from the forecasted npages statistic. Using the present time system catalog statistics, the number of tuples per page is calculated by dividing the current cardinality


63 by the current npages. The forecasted cardinality is then calculated by multiplying the forecasted npages by the tuples per page. Column Level Statistics We forecast column cardinality high2key and low2key statistics for all columns. In addition, if quantiles are being calculated for a column, they are a lso forecast. Quantile forecasts are comprised of two values: the column value of the quantile and the number of rows below that value. Column cardinality is defined as the number of distinct values in a column. We calculate column cardinality via the w eighted least squares forecasting method. The weighted least squares forecasting method is also used to calculate high2key and low2ke y statistics. To predict the quantiles we sample from the previously collected quantiles upon which we are conditioning ou r forecast. Samples are drawn for each historic al set of quantiles. W e make the assumption that data between the quantiles (in the bins of the equal height histogr am) are uniformly distributed. Each historical quantile function is sampled n times. Weigh ts for each histogram bin are determined according to the size of the histogram bins defined by pi pi 1 where pi is the percentage of table values below the i th quantile Let value(pi) be the va lue at the quantile specified at pi. Samples are drawn from the uniform distribution between value(pi) and value(pi 1) with probability pi pi 1. Q uantile regression is then performed on the sampled data. The results presented in this dissertation use linear quantile regression fits. The regression is weighted in the same manner as the weighted least squares forecasting method, where a sample from run n will be assigned a weight n/ N The quantile regression process outputs a step function for the interval from [0, 1) at the time period r equested. This is transformed to respect monot onicity as described in [48]


64 When building the forecasted quantiles, we consider two cases. We define low cardinality columns as those having cardinality less than or equal to ten times the number of quantiles collected. In the low cardinality case, we calculate the quantiles based on the last values of the quantiles used in the estimate. For each quantile in the most recently collected set, value ( pi) is taken as the value of i th quantile value to be recorde d in the system. The step function provided by the quantile regression is searched for the larg est input for which the step function is less than or equal to this value. The input is taken as the percentile the forecasted i th quantile value represents i n the system. This is solved by maximizing xi in value(xi) where value(xi) is less than or equal to the last i th historical quantile where value( x ) is the step function generated by the quantile regression process. The number of values below the forecasted quantile value is de rived by taking the product of xi and the forecasted table cardinali ty The high cardinality case follows an opposite procedure. The percentile values are determined by the last collected s et of quantiles in the s ystem. The new quantile value is given by the output of value(xi) where xi is the percentage of value s below the value of the most recently collected i th quantile. This value is adjusted using the same maximization procedure described in the low cardina lity case. The number of values below the forecasted quantile is also determined in the same manner as the low cardinality case. Index Level Statistics Statistics must also be forecast for any preexisting indexes built on a table for which forecasts are m ade. There are several statistics that must be forecast for each index: the first key cardinality the combined cardinality of the first through fourth index keys (i.e., second key cardinality third key cardinality and fourth key cardinality ), the inde x cardinality, the number of row identifiers in the index, the number of leaf pages in the index, th e number of levels in the index, and a measure of the average number of sequential pages in the index.


65 The first key cardinality is equal to the cardinality forecast for the first index column. Second through fourth key cardinalities are forecast using the weighted least squares forecasting method if there exist sufficient index field s to require the statistic. Full key cardinality is also forecast using weighted least squares. If the index is identified as being unique, the full key cardinality is assigned the same value as the cardinality forecast for the table that is indexed. The statistic for number of row identifiers is also equal to the parent tab les cardinality forecast. If the index contains four or less columns, the full key cardinality is assigned to the cardinality statistic for the appropriate combined column measure. The number of index leaves is calculated via the weighted least squares forecasting method. Because the number of levels in a typical B tree index rarely exceeds three or four, the number of index levels must be predicted based on the number of index leaves and the average size of an index row, derived from the sizes of the c olumns which make up the index. The estimate for sequential pages is derived by taking the product of the sequential pages value from the most recently collected batch of statistics and the ratio of newly forecasted number of leaf pages divided by the mos t recently collected statistic for number of leaf pages.


66 CHAPTER 5 TESTING AND PERFORMANCE EVALUATION The previous two chapters have provided a description of the architecture, algorithms, and mathematical methods used to calculate query behavior forecast s. We have tested this method of query behavior forecasting on a prelimi nary implementation of the MAEDA architecture. Testing was performed on both synthe tic and real world data sets. In this chapter, detail s of the preliminary implementation and testing methodology are discussed followed by an overv iew of results obtained from testing along with a corresponding discussion and evaluation of their significance. Preliminary Implementation We have implemented a preliminary version of the MAEDA methods and architecture detailed in this dissertation for the purposes of proof of concept and testing. DB2 UDB v9.5.0.0 running on a 32bit Linux platform was chosen as the DBMS backend for the implementation. DB2 was a desirable choice for rapid development and testing because it offers the ability to easily modify database statistics in the form of updatable views. Furthermore, it can be verified from the literature that DB2 uses only catalog statistics when creating and evaluating virtual indexes. The methods DB2 uses to derive the statistics for virtual indexes and estimate optimum index sets is described in [35] The R language [49] was selected as the p latform for statistical analysis and is linked to the Information Repository and the Analysis/Simulation Engine through a Perl object that we have developed. R is a widely used language for statistical analysis and contains many efficiently coded regressi on models as well as the ability to extend and de fine custom methods. Various autonomic components of DB2 are utilized by the MAEDA architecture. SQL execution is monitored through the event monitor framework. When a SQL event monitor is


67 created, details and statistics regarding the execution of any SQL statements are output. This output is monitored by the architecture via a named pipe and is inserted into the Information Repository. Virtual indexes are created and evaluated through calls made via the previously discussed explain modes available through the SQL interface. Testing Methodology The MAEDA architecture was envisioned to be capable of forecasting query behavior in database systems for which distributions and sizes of data are expanding sizabl y with respect to time. Because the MAEDA platform is designed to forecast optimum and realized query behavior across wide spaces of time, it is a requirement that any testing data set be large enough to be effectively partitioned into sizeable segments w hich respect some idea of temporal order. Conceiving a strategy for partitioning the data set being evaluated is the first step in the testing process. A driver program is developed for the data set that will partition the data and load it into the databa se incrementally. This driver is invoked at the beginning of every iteration of the testing process. Additionally, it is logical that any data store would contain an initial set of data before the MAEDA architecture is deployed. Therefore the dataset an d driver must be capable of performing an initial loading process which seeds the database with some initial data. We have developed our testing methodology to encourage repeatability and speed of testing at the expense of some of the usability described i n the architecture chapter. All tests begin by loading a query workload set into the system. Queries are compiled and run one time on the empty database schema to insure correctness. Next, the initial load is performed, and control of the test is passed to the main testing loop. The first step of the main testing loop invokes the driver program discussed above to load one iteration of data. Data is loaded using DB2s fast loader utility. The fast loader is invoked in nonrecoverable mode, greatly increa sing the speed at which data can be loaded into the DBMS


68 by not performing any logging. Next, a maintenance period is simulated by invoking DB2s reorg utility to reorganize the tables into primary index order on disk, compacting and reordering the set of data on the disk. Then, statistics are collected on the freshly loaded data. This is followed by an invocation of the Runstats Listener which harvests the newly collected statistics and stores them in the Information Repository. For testing purposes, the Information Repository is located in the testing database under the MAEDA schema. Once all statistics for the current iteration of the test have been recorded, the db2look utility is invoked to extract the data definition language ( DDL ) to mimic the structure of the tables and indexes upon which forecasts are to be made. This DDL is then loaded into the testing database under the shadow schema. Next, the Analysis/Simulation agent is invoked. Forecasts are computed for all statistics rel evant to the current test. Each iteration of the test requires the simulation agent to make predictions x number of steps ahead. Therefore, forecasts f or x number of steps into the future are added to the Information Repository to be used in the growth simulation for the currently running iteration. Once all forecasts have been computed, the simulation stage begins. First, all queries in the system are compiled against the actual database schema to obtain the query cost given the currently materialized index set. Then, the DB2 index recommender is invoked to provide an estimate of the optimum query cost and its associated index set. These results are u sed to validate the query cost forecasts. Next, the simulation agent is invoked to modify the system catalog to reflect the forecasts for x iterations of testing into the future. This system catalog data is modified for all tables and indexes relevant to the query set being forecast. These tables and indexes were all added to the


69 shadow schema when the DDL generated by the db2look utility wa s executed. It is the catalog statistics for the shadow schema which the simulation agent modifies. The actual system catalog containing statistics for the testing data remains unmodified throughout the simulation process. Once the simulated growth is co mpleted, the query set is compiled in the shadow schema and the query costs are recorded. These cost forecasts represent the forecasted cost to execute the query under the currently materialized index set. Next, the DB2 index recommender is invoked for e ach member of the query set. The query costs and index set recommendations from the DB2 index recommender are saved in the Information Repository. These represent the forecasted optimum query cost and the index set utilized by the compiler to achieve the access plan which derived that cost. To obtain the results reported in this dissertation, all tests were run for fifty iterations unless otherwise noted. For the first five iterations of testing, the Runstats Listener and the Analysis/Simulation agent ar e not invoked. This provides a burn in period for the data. No statistics or forecasts are computed or saved during these initial iterations. The next ten iterations of the testing process invoke the Runstats Listener, but do not invoke the Simulation /Analysis agent. The Runstats Listener harvests statistics from the system catalog during this period of the test, but no forecasts are made. This provides the Simulation/Analysis agent a set of statistics upon which to make its initial forecasts. The s ixteenth through fiftieth iterations invoke both the Runstats Listener and the Simulation/Analysis agent. During this period of testing statistics are harvested and forecasts are performed. Validation and analysis of the results is performed by comparing the predictions versus the actual costs. Results are best seen graphically, as the intention of the MAEDA architecture is to provide a forecast of trending cost (if a large jump in query cost is predicted slightly early or late,


70 but still well within the prediction window, the process has been successful in accomplishing its task ). The forecasted versus actual results obtained by compiling the query given the currently materialized indexes are compared. Likewise, the forecasted optimum query cost versus the actual optimum query cost obtained by invoking the DB2 index advisor utility is compared. These two sets of results represent a predicted upper and lower bound of query behavior. Depicted graphically, the results are shown in three charts. The first chart shows the forecasted versus actual query cost given the currently materialized index set. The second shows the forecasted optimum query cost versus actual optimum query cost and the third combines the previous two, depicting a bounds of current vers us optimum query costs over time. The x axis of all three charts represents the test iterations, while the y axis represents query cost in timerons as returned by the DB2 query compiler Timerons are an abstract unit of cost that can be used to compare the amount of resources required to execute a query. In general, the resources accounted for in the timeron measure are weighted CPU and I/O costs. The measure is designed to compare different query plans for a single query and is often used to as a roug h approximation for the time a query will take to execute. Results of Testing With Synthetic Data We cho se to test the MAEDA architecture and forecasting process on data generated for the TPC H decision support benchmark. The TPC H benchmark [50] is published by the Transaction Processing Performance Council. The Transaction Processing Performa nce Council is a nonprofit organization which publishes DBMS benchmarks which are widely used throughout the database industry. The TPC H is a decision support benchmark consisting of a set of adhoc queries of a business oriented nature. The queries wer e engineered to have a high degree of complexity and


71 to examine and aggregate large amounts of data. Test results provided are based on a database generat ed with a scale factor of one. The scale factor one TPC H database is approximately one gigabyte in s ize. It consists of eight tables: supplier, part, partsupp, customer, orders, lineitem, nation, and region. The nation and region tables are fixed at 25 and 5 rows, respectively. The supplier, pa rt, partsupp, customer, and orders tables are 10,000 rows 200,000 rows, 800,000 rows, 150,000 rows, and 1,500,000 rows in size, respectfully. The cardinality of these tables scales as a function of the scaling factor. The lineitem table contains approximately 6,001,215 rows in a scale factor one database. Th e number of line items generated per order is randomly distributed with a mean of four The goal of the testing process was to obtain results that were both interesting (in that the behavior of some tested queries changed substantially throughout the testing) and based off of widely used and respect ed data. To obtain a set of our results, growth of the TPC H tables is s imulated by loading the scale factor one data in fifty equal segments. We h ave set the forecast interval to six iterations ahead. The system was tested on a set of twenty queries from the TPC H benchmark query set. Results are depicted in chart form in Figures 51 to 520. The figures show three charts for each result. The first chart shows the forecasted costs using the currently materialized index set and the predicted optimum sets along with the actual values obtained to validate these predictions, for a total of four lines. The second chart in each figure shows the optimum versus forecasted optimum cost and the third chart shows the forecasted versus actual cost with the cur rently materialized index set. Res ults are shown from iteration 24 to iteration 56. Because the test en ded when the data was exhausted at iteration 50, only forecasts are shown for iterations


72 51 through 56. All the points of the forecast line were forecast 6 iterations before the actual cost data points. The lines showing actual costs act to validate the forecast lines. Several indexes are materialized and maintained on the TPC H schema and enforce primary and foreign key constraints. T he DDL used to create the TPC H schema as well as the queries used to generate the results for all of the TPC H tests we discuss are provided in the a ppendix. Our TPC H test ing grows the data in a linear, randomly distributed manner. Values in key fields such as customer and part ids and associated foreign keys in orders and line items increase linear ly leading to a distri bution that is increasing in both mean and variance. Other fields show logarithmic growth if there exists some maximum combination of values for a given field. The cardinality measure increases logarithmically towards that limit. Other values, such as a ddress, price, and tax rate are uniformly distributed a nd show a linear increase in cardinality with negligible change in both mean or variance observable in the distribution of values. As such, it is expected that many TPC H queries will scale in a linear or quadratic manner and not exhibit a change in access plan or optimum index set throughout the test. Queries 1 4, 10, 12 14, and 18 20 all exhibit either linear or exponential growth in both optimum and current index configurations. We make no fur ther comment on these results except to note that our method successfully predicts the query behavior and associated cost in these cases with less than 1% error at most data points. Query 5 ( Figure 5 5) is notable in that the optimal case exhibits a slightly exponential growth until a large jump in query cost at iteration 48. This reflects a change in access path and index set found by the DB2 index advisor for the optimal case. The forecast line shows that this behavior is predicted accurately. We forecast a return to the original access plan on iteration 53.


73 We have attempted to verify this behavior with a scale factor 2 database segmented into 100 parts. However, the scale factor 2 data does not exhibit the access path change at all, instead scaling exponentially at a lo wer predicted cost all the way out to iteration 60, where we ceased testing. This is likely due to a change in the distribution of data comprising the scale factor 2 data set. Query 6 ( Figure 5 6) exhibits no change in access plans for the optimum or currently materialized case. However, it is still notable in that we initially predict a wider and differently ordered index for the optimum case, result ing in a lower and incorrectly forecast index cost. However, our forecasts converge on the correct index and cost at iteration 36. Query 7 ( Figure 5 7) displays a change in optimum index set at iteration 46 which our method predicts successfully. The change in index set is evidenced on the graph by a change in slope at iteration 46 of the optimal versus forecasted optimal cost. This provides evidence for the fact that different index sets are optimal for different sizes and distributions of data. The optimum plan switches from what we can assume to be an initially lower cost plan with a steeper slope to an initially higher cost plan with a lesser slop e Query 8 ( Figure 58) is similar to query 7 in that there is a change to the query plan at test iteration 33 that causes the slope to decrease signi ficantly for the currently materialized case and more slightly in the optimal case. Query 9 ( Figure 5 9) shows a large, sudden increase in predicted qu ery cost for both the currently materialized case and the optimal case. The increase is due to a change in physical access plan and index set utilized by the compiler. The increase occurs at iteration 28 in the currently materialized case and iteration 3 4 in the optimal case. Our forecast predicted the correct index set and access path but forecasted the change to occur at iteration 29 and 35, respectively. In this case the correct plan was forecast only 5 iterations ahead as opposed to 6.


74 We find that the optimal case reverts back to an optimal index on the lineitem table at iteration 42 that is similar to, though wider than, the optimal set from test iteration 24 to 33. Our method fails to find this index set until much later, when it starts iterating back and forth between forecasted index sets between iterations 53 and 56. Query 11 ( Figure 5 11) exhibits linear growth throughout the period for which we have test data. A change in access plan is forecasted at iteration 51 in both the currently materialized and optimal case. We were able to verify that this change does, in fact, occur by observing and verifying the same behavior at test iteration 51 on a scale factor 2 data set segmented into 100 equally sized fragments. The new access plan is observed to be utilized in the scale factor 2 test up to iteratio n 60, where the test was halted. Query 15 ( Figure 5 15) shows a change in the plan utilized by the currently materialized set, which is also predicted in our forecast. The access path changes from a previously lower cost plan which shows a slightly quadratic growth rate to a plan with linear growth. The access paths for Query 16 ( Figure 516) are initially the same in the currently materialized and optimal cases. At iteration 26 both access plans change, represented on the graph by a large and sudden increase in query cost. The optimum case diverg es from the materialized case, favoring a new index set which yields a lower cost and lesser slope than the plan selected in the currently materialized case. Query 1 7 ( Figure 5 17) is notable because, while the access path does not change in either the optimal or currently materialized case, the optimal forecast is notably more erratic than the optimal cost. However, the index set forecast by our method agrees with the optimum index set predicted at all points. Both the forecast and t he actual optimal value show a quadratical ly increasing cost. We argue that the difference in query cost is actually negligible considering the


75 optimal case represents an estimated order of magnitude lower cost over the access path selected with the currently materialized index set. Query 20 ( Figure 5 20) is an accurate forecast of the query cost with the exception of a drop between iterations 35 and 40 due to an incorrect quantile forecast. However, the incorrect portion of the forecast still forecasts the correct query cost and growth rate (slope). The forecast recov ers to the correct value after iteration 40. W e find the results obtained from the TPC H data to be very accurate, yielding little error and few missed predictions. In the majority of cases, the residuals between forecast and verification results in less than 1% of the total predicted cost. Overall, 91% of the data points for the materialized case are forecast within less than 1% residual error and 98% of data points are within 10% error. In the optimal case, 60% of data points are within 1% error and 89% are within 10% error. Cases where the residual error is greater than 10% have been noted and explanations for the error have be en provided in our discussion. An additional test of the TPC H data was executed with TPC H data generated to follow a more re alistic sequence of growth for a business growing from startup to maturity When the business is newly formed, it initially gains a large number of customers who place small orders. As it grows into maturity, some customers begin to leave and new custome r growth begins to slow. However, some long term customers begin to transition into large accounts that frequently place larger than average orders with the business. The data was generated based on the following rules: Customers grow logarithmically over time. Each customer, once created, is randomly assigned a loyalty rating that controls the frequency with which the customer will place an order. Customers are assigned a loyalty rating of i }. The chance of a customer placing an


76 order on any iteration of testing is 1/ i If a customer does not place an order for six consecutive months, they become inactive and will not place a new order for the duration of the test. Realistically, a business will acquire mos t of its suppliers at start up. Therefore, 90% of the supplier table is preloaded. The other 10% of the growth occurs linearly throughout the simulation. Like suppliers, most parts offered by a business will be offered when the business initially opens. To model this, 75% of parts are preloaded. The remaining 25% are added linearly throughout the simulation. Each part has a 1% chance of becoming obsolete or being replaced at each iteration of the test. Once obsolete, the part can no longer be ordered. Additionally, some parts will be more popular than others. Parts are placed into two categories: popular and regular. Parts have a one in ten chance of being popular. Each part is assigned to a mean of four suppliers. When associating parts with suppliers, there is a one in four chance that a new supplier will supply a new part. The order relation is populated by cycling through the list of active customers. Orders are generated based on the previously described customer loyalty rating. This result s in the number of orders growing logarithmic ally but the effect is less pronounced than the logarithmic growth of the customer relation. Newly generated orders favor newer customers (due to customer decay) and customers with higher loyalty ratings. The number of line items associated with each order depends on the number of orders the customer has previously executed. Each order gener ates a mean of 2 + ( number of orders / 3.5) line items, with a variance of 2 + ( number of orders / 3). When generating a new line item, a part and supplier must be selected. There is a one in ten chance that a line item will be associated with a newly added part. Failing that, there is a one in four chance a line item will be associated with a popular part. The result is that approximately 25% of all orders are associated with 10%


77 of the parts. Supplier is chosen randomly from one of the suppliers of the selected part. As a result of this process, the lineitem relation shows slightly quadratic growth despite the fact that the customer relation grows logarithmically The overall results of this test showed 77% of data points within 1% error and 99% of data points within 10% error for the currently materialized case. For the optimal case, 48% of data points we re within 1% error and 96% were within 10% error A subset of the results display ing notable features are presented in Figures 5 21 to 524. Query 4 ( Figure 5 21) exhibits slightly exponential growth in the currently materialized case and slightly less exponential growth in the optimal case. In the currently materialized case, the growth rate switches at iteration 31 from linear to exponential growth along with a sudden increase in query cost. This behavior is correctly forecast, although the increase is forecasted to occur one iteration early at iteration 30. The forecast for the currently materialized case is accurate within 1% error at 85% of data points and within 10% error at 96 % of data points. The optimal case is accurate within 1% error at 85% of data points and 10% of error at all data points. Query 7 ( Figure 5 22) grows exponentially in the currently materialized case and logarithmically in the optimal case. The logarithmic and exponential nature of the growth is forecast correctly in both cases. However, the rate of growth in the cur rently materialized case is slightly incorrect, leading to an increase in error at each successive data point after a sudden jump in query cost is forecast at iteration 27. In the currently materialized case, the forecast is accurate within 1% error at 4% of data points, and within 10% error at all data points. The optimal case forecast is accurate within 1% error at 59% of data points and 10% error at all data points.


78 Query 13 ( Figure 5 23) shows exponential growth in the currently materialized case and linear growth in the optimal case. Both growth rates are forecast correctly. However, while the forecast in the optimal case correctly forecasts the trend of the growth, the forecast line is erratic in nature. The optimal case is correct within 1% error at 19% of data points and 100% error at all data points. In the currently materialized case, the forecast is correct within 1% error at 89% of data points and 10% error at all data points. The optimal and currently materialized cases for query 16 ( Figure 524 ) are initially distinct until they converge at iteration 29 after a sudden increase in query cost. The optimal and currently materialized cases are identical until iteration 42 when another sudden increase in query cost occurs. After the query plans diverge again at it eration 42, the optimal shows a significantly lesser jump in query cost than the currently materialized case and grows with a lesser slope. This behavior is correctly forecast in both the optimal and currently materialized cases The error in both cases was less than 1% at 78% of data points and less than 10% at all data points. Results of Testing With Enterprise Data In addition to tests performed with the TPC H benchmark data, we have tested the MAEDA architecture on data pulled from a large scale (terabyte size) enterprise resource planning system The data tested (which we will refer to as the ERP data set) consists of six table s from the databases financials system: account type_t a bl e (5 rows), source_tabl e (42 rows), account_t a bl e (1947 rows), j ournal _header (appr oximately 580,000 rows), voucher _line (app roximately 34,000,000 rows), and j ourn a l_l i ne (approximately 46,500,000 rows). When extracted from the DBMS, these tables represent approximately 88 gigabytes of combined data. Materialized in the database, with indexes and proper amounts of space allocated to support immediate gr owth and fragmentation needs, they occupy over 200 gigabytes of space on disk.


79 This set of data is an ide al candidate for testing becau se each row of data in the tables contains an associated date stamp. This allows for the data to be partition ed into segments that support our iterative testing pr ocess. The date stamps allow us to fragment the data in such a way as to provide a good approximation of the size and distribution of data that existed in the database at any given time. A set of queries involving these tables was also obtained from the ERP system. Using this set as models, additional queries were created by adding or modifying the condi tions of the provided queries. Of the original queries and their variants that we created, we obtained a set of 10 distinct and interesting queries upon which we report results To provide a starting distribution of data for testing, t he database was initially seeded with all data before year 2005. Each iteration of the testing process loads one month of data into the journal_header, journal_line, and voucher_line tables. The other, smaller tables function as lookup ta bles for the larger ones. The data in these tables remains constant throughout the testing process Therefore, we choose to load all the data for those tables during the initial load process and allow it to remain untouched throughout testing. The first iteration of testing loads data from January 2005 while iteration 50 of the test loads data from February 2009, exhausting our set of usable data. A pproximate average monthy growth rate s of the tables in the test are: journal_header, 7,900 rows per month; voucher_line, 590,000 rows per month; and journal_line, 780,000 rows by month. T o increase the spe ed of the testing process and generate results that differ in a significant manner between the optimal case and the currently materialized index case, all indexes on the data were dropped with the exception of an index on the key of each table. These indexes are defined as unique to enforce the key restraint. They are all defined as clustering indexes.


80 Additionally, to enable testing to be possible on our hardware, fields which were not referenced as a condition of any query in our test set were dropped from the table s, significantly increasing the speed of the load and maintenance phases of the testing process. Forecasts for the test s discussed are for 6 months in the future. Q uery number 1 ( Figure 525) is an example of an optimal query plan that remains constant with respect to t ime The optimal access path selected by the DB2 index advisor utility solves the query with a cost of approximately 200 timerons. This cost remains constant throughout the test with the exception of an increase at iteration 34, whic h is predicted by our forecast. T he query cost using the currently materialized query set increases from approximately 2,500,000 timerons at test iteration 24 to approximately 1,000,000 timerons at test iteration 50. Our forecasts predict the optimum behavior with an average of 0.6% error and a maximum of 3% error at any iteration of the test. Query 2 ( Figure 5 26) displays nearly linear growth in the currently materialize d case. The optimum case, however, exhibits a stair stepping pattern of growth. There is one change in slope, and two sudden jumps in query cost in the optimal case. The change in slope at test iteration 28 is forecast on time. This corresponds to a change in the optimal index set, which was forecasted to occur at iteration 27 one month before it was observed to occur. The first jump occurs at test iteration 33, when it is forecast ed The second jump occurs at iteration 45, t hough it was forecast to occur at 43 (i.e., the forecast was 2 months early). The forecast predicts the optimum query behavior with an average of 1.4% error and a maximum of 4% error. Much of this error is due to the early forecast at test iteration 43. Query 3 ( Figure 5 27) exhibits linear growth in cost for the materialized index set and is predicted accurately by our forecasts. The optimal case ex hibits several changes in slope. The


81 forecast tracks the observed values, predicting each slope change within 3 test iterations of when it actually occurs. The forecasts for the optimal case have a 2% average error and a maximum error of 7.8%. The large maximum error is due to the early or late prediction of slope change in the forecast (although changes are predicted to occur in each case well within the forecast window). Query 4 ( Figure 5 28) is similar to query 1. Our forecast is an accurate portrayal of the observed values. A change in optimal index set was forecasted to occur at iteration 35, and was verified to have occur ed as predicted. Th e average error in the optimal case is 0.6% and the maximum error is 3%. Query 5 ( Figure 5 29) exhibits linearly increasing growth in the currently materialized case. However, the optimal case shows several changes in slope. These changes are well approximated by the forecasts and a re forecasted to occur within two test iterations of when they are observed in the test data (i.e. the forecasts are at most two months early or two months late, but still well within the six month prediction window). Average error for the optimal case w as 2.2%, with a maximum of 5.3%. Query 6 ( Figure 5 30) shows two changes in slope, occurring at like times, in both the currently materialized and the optimal case s The first change occurs at test iteration 29 and is forecast on time. The second change occurs at test iteration 41. Both the materialized and optimal cases forecast a switch to a slightly lower slope at test iteration 36 This results i n a slightly lower forecasted and optimal cost, which is recovered at iteration 47. The error in the optimal case is 2.4% on average with a maximum of 5.4%. Query 7 ( Figure 5 31) grows linearly in the currently materialized case. However, the optimal case displays extremely erratic behavior oscillating from a cost of approximately


82 500,000 to a maximum of approximately 720,000 timerons at iteration 50. Our forecast provides an accurate representation of the query behavior. H owever the changes in the query cost are often forecasted to occur two to three iterations before they are observed. Iterations 39 through 43 are notable because they forecast a larger decrease in query cost than is actually observed. A new optimum index set was forecasted to occur at iteration 27, and occurred as predicted. Average error for this query is 8.3% with a maximum of 16.9%. The large error is due to the erratic natu re of the optimal cost. Query 8 ( Figure 5 32) shows linear growth in the optimal case and a stair stepping pattern in the currently materialized case. In the currently materialized case, sudden increases in cost occur at iterations 32 and iteration 44. Each increase is properly forecast. Average error for the optimal case is 0.9% with a maximum of 3.3% error. Query 9 ( Figure 5 33) shows a dramatic decrease in slope at iteration 29 in both the materialized and optimal case s In both cases, the change is forecasted accurately. Average error in the optimal case for query 9 is 1.4% with a maximum error of 8.5%. The large maximum erro r occurred on the first iteration of the test, before the full amount of data was acquired for use in creating forecasts. The second highest error value that occurred was 3.3% at iteration 28. Query 10 ( Figure 5 34) displays a clearly logarithmic growth rate in the optimal case. Though our forecasted values are not as smoothed as the values which actually occurred, the forecast clearly shows the same l ogarithmic trend. Two changes to the optimum index set were predicted. The first was forecast to occur at iteration 25 and appeared at iteration 26. The second was forecast to occur at 37 and appeared at iteration 40. Average error for the optimal case is 3.5% with a maximum of 7.6%.


83 In a second test on the ERP data, we added constraints to the values of the data found in the voucher_line and journal_line tables. The voucher_line table was constrained to only contain lines with a journal_id of ACCRUAL This reduced the number of rows in the voucher_line table to approximately 15,000,000. The journal_line table was constrained to a maximum of six journal lines per journal, reducing the cardinality of the journal_line table to approximately 2,200,000 and significantly altering its growth rate with respect to time relative to the previous test performed on the full data set These changes effectively changed the distribution and growth rates of the data, as well as the relative growth rates between the voucher_line, journal_line, and journal_header tables in the simulation. The same ten queries discussed above were also tested on this data. R esults are shown in Figures 5 35 to 544. The cases in which the results differed significantly from testing o n the full data are discussed below. Query 2 ( Figure 5 36) exhibits a much stronger stair stepping behavior than it did testing on the full data. This behavior is well approximated by the forecasts, although the first increase is forecasted to occu r in a linear manner between iterations 28 and 33 and the second increase is forecasted three iterations early at iteration 41 as opposed to 44. The average error for query 2 was 7.2% with a maximum error of 21.1%. A significant portion of the error is due to the early forecasted increases. Even though the error is large, it is clear from examination of the graphs that the forecast provides an accurate estimation of the growth in query cost. Query 3 ( Figure 5 37) alternates between slightly increasing and strongly increasing s lopes in the optimal case. The forecasts echo this behavior and provide a well correlated approximation of the query cost at an y given time. However, the periods of strongly increasing slope are forecast to begin three to five iterations before they are observed in the test data. The


84 error for the optimal query cost for query 3 was 3.5% with a maximum of 10% error. Again, this is due to the increases which were forecasted to occur early. The linear nature of query 6 ( Figure 5 40) under the full data set is replaced by three s udden increases in query cost in both the optimal and currently materialized cases when tested with the constrained data. In each case, the increases are forecasted to occur one month later than they are observed (i.e. five months early as opposed to six) Average error for the optimal case was 10.6% with a maximum of 19%. The large error is due to the increases being forecast one month late At 21 of the 27 points, the error was less than 2%. The optimal case for query 7 ( Figure 541) which was erratic in nature under t he full data set, now shows linear growth with sudden increases occurring at iterations 27 and 37. They are for e casted to occur at iterations 26 and 38. The average error for the optimal case is 2.6% with a maximum of 35.3% due to the early forecasted increase. However, 24 of the 27 test points were under 2% error, 14 of which were less than 1% error. While the forecasts we obtained from testing on the E RP data set do not show the same level of accuracy as those obtained testing on the TPC H data, we find them to be very positive. The ERP data set is indicative of data found in an enterprise class database system, representing a large portion (20% 30%) of a terabyte ERP financials database. This is exactly the type of database in which the MAEDA architecture is i ntended to function. In this chapter, w e have shown that our method is capable of forecasting changes in both query cost and behavior for time periods well in the future. Results were presented using both synthetic and real world data, and in both cases showed a high degree of accuracy. Forecasting this behavior would not be possible by applying traditional regression and forecasting methods. In addition to providing accurate estimates of future query cost, our method also provides


85 forecasts for the physical access path the query will take and predictions of the optimal index sets that can be used to solve the queries.


86 Figure 51. TPC H Scale factor 1 results for Query 1. Figure 52. TPC H Scale factor 1 results f or Query 2. Figure 53. TPC H Scale factor 1 results for Query 3.


87 Figure 54. TPC H Scale factor 1 results for Query 4. Figure 55. TPC H Scale factor 1 results for Query 5. Figure 56. TPC H Scale factor 1 results for Query 6.


88 Figure 57. TPC H Scale factor 1 results for Query 7. Figure 58. TPC H Scale factor 1 results for Query 8. Figure 59. TPC H Scale factor 1 results for Query 9.


89 Figure 510. TPC H Scale factor 1 results for Query 10. Figure 511. TPC H Scale factor 1 results for Query 11. Figure 512. TPC H Scale factor 1 results for Query 12.


90 Figure 513. TPC H Scale factor 1 results for Query 13. Figure 514. TPC H Scale factor 1 results for Query 14. Figure 515. TPC H Scale factor 1 results for Query 15.


91 Figure 516. TPC H Scale factor 1 results for Query 16. Figure 517. TPC H Scale factor 1 results for Query 17. Figure 518. TPC H Scale factor 1 results for Query 18.


92 Figure 519. TPC H S cale factor 1 results for Query 19. Figure 520. TPC H Scale factor 1 results for Query 20. Figure 521. Results for TPC H Query 4 modified to produce more realistic growth.


93 Figure 522. Results for TPC H Query 7 modified to produce more realistic growth. Figure 523. Results for TPC H Query 13 modified to produce more realistic growth. Figure 524. Results for TPC H Query 1 6 modified to produce more realistic growth.


94 Figure 525. ERP data set results for Query 1. Figure 526. E RP data set results for Query 2 Figure 527. E RP data set results for Query 3


95 Figure 528. E RP data set results for Query 4 Figure 529. E RP data set results for Query 5 Figure 530. E RP data set results for Query 6


96 Figure 531. E RP data set results for Query 7 Figure 532. E RP data set results fo r Query 8 Figure 533. E RP data set results for Query 9


97 Figure 534. ERP data set results for Query 1 0. Figure 535. Constrained ERP data set results for Query 1. Figure 536. Constrained ERP data set results for Query 2.


98 Figure 537. Constrained ERP data set results for Query 3. Figure 538. Constrained ERP data set results for Query 4. Figure 539. Constrained ERP data set results for Query 5.


99 Figure 540. Constrained ERP data set results for Query 6. Figure 541. Constrained ERP data set results for Query 7. Figure 542. Constrained ERP data set results for Query 8.

PAGE 100

100 Figure 543. Constrained ERP data set results for Query 9. Figure 544. Constrained ERP data set results for Query 1 0.

PAGE 101

101 CHAPTER 6 CONCLUSIONS In this research, we have shown significant progress towards the development of a method for forecasting future query cost and behavior in DBMS syste ms. We have shown our method is capable of providing accurate forecasts of query cost running queries against both synthetic and real world enterprise data. Our method is capable of forecasting changes in query cost and providing information regarding query behavior and index utilization that is not possible using traditional methods. Contributions This dissertation makes two primary contributions to the body of autonomic physical tuning research: Forecasting Query Behavior and Scaling with Respect to Tim e: We have described in a method for forecasting the future cost and access path of queries. By simulating the growth patterns of database objects, we are able to forecast query cost, access path selection, and index utilization of the databases query workload. These predictions can be far more accurate at predictin g significant events (such as changes in plan selection leading to sudden changes in slope or steep increases in the query cost as calculated by the query optimizer) than traditional regression and forecasting methods. Extension of the Virtual Index Concept to the Time Domain : The extension of virtual indexes to recommend and evaluate index sets for queries at future time periods has the potential to be useful even beyond the realm of autonomic tuning. Virtual indexes are widely used by DBAs when evaluat ing the potential effects of indexes on queries and in evaluating queries for new processes being developed to add to the current workload. DBAs have previously been limited to using these tools to evaluate queries given

PAGE 102

102 only the current state of the database statistics. Often database processes currently being developed will not be deployed until much later in the future. Evaluating these queries on the current state of database data may yield deceptive results. Our extension enables DBAs and developer s to analyze their queries under the database configuration that is estimated to occur at the time of a querys deployment. Additionally, when considering adding an index to a currently misbehaving query, the future index advisor could be invoked manuall y to determine if the index being considered will remain lucrative at some future time point. The user is not limited to evaluating only indexes that the system determines are optimal. The user has the freedom to specify any virtual index desired. This could enable administrators and developers to make better informed decisions when deciding on physical design and planning space requirements and strategies for future indexes. We have shown how the method we have developed could be extended into a complet ely autonomic system. Our research differs from the prior art in that it is designed to be proactive rather than reactive in its approach to tuning the system. The method we have developed identifies indexes becoming inefficient due to growth and fluctuat ions of data distributions of the underlying relations. If the underlying relations grow in a predictable manner, index schemes can be generated far in advance. Sudden changes in query behavior due to increases or changes in the dist ribution of data can be predicted, allow ing administrators to redesign the process or insert new indexes before the negative effects of any performance problems are experienced. Future Work The method we have developed lays the groundwork for forecasting query behavior into t he time domain, yet there is still room for significant enhancements.

PAGE 103

103 The field of density and conditional quantile forecasting is an area of open and active research in which significant enhancements and techniques are yet be discovered. New methods for density forecasting and improvements in comparison of models could yield significant increases in accuracy and granularity of our cost forecasts. The method we have developed could also be extended to predict a range of query costs throughout the maintenance cycle. It is well known that performing inserts and updates on tables and indexes causes fragmentation to the database objects, causing measurable degradation in performance. Our current method assumes a maintenance cycle is performed before statistics are measured. Extending the forecasts to predict the effects of loss of clustering throughout maintenance cycles would yield a range of query cost s throughout the cycles, showing the change in these bounds across multiple maintenance cycles with respect to time. Another interesting extension would be to allow the user to modify the density and cardinality forecasts manually (assume for example an organization is considering archiving a large amount of data, or adding additional product types to a low car dinality field). All statistics and forecasts derived from the statistics modified by the user could then automatically be recalculated. The user could then explore the effect over time of these changes (i .e ., answ er the question, What effect will archi ving this data have on the query performance over time?).

PAGE 104

104 APPENDIX A DDL USED IN PERFORMANCE EVALUATION The following DDL was used to build the testing database when performing the synthetic (TPC H) tests. Table Creation Script ------------------------------------------------DDL Statements for table "TPCD"."NATION" -----------------------------------------------DROP TABLE "TPCD"."NATION"; CREATE TABLE "TPCD"."NATION" ( "N_NATIONKEY" INTEGER NOT NULL "N_NAME" CHAR(25) NOT NULL "N_REGIONKEY" INTEGER NOT NULL "N_COMMENT" VARCHAR(152) ) IN "TPCDDATA" ; ------------------------------------------------DDL Statements for table "TPCD"."REGION" -----------------------------------------------DROP TABLE "TPCD"."REGION"; CREATE TABLE "TPCD"."REGION" ( "R_REGIONKEY" INTEGER NOT NULL "R_NAME" CHAR(25) NOT NULL "R_COMMENT" VARCHAR(152) ) IN "TPCDDATA" ; ------------------------------------------------DDL Statements for table "TPCD"."PART" -----------------------------------------------DROP TABLE "TPCD"."PART"; CREATE TABLE "TPCD"."PART" ( "P_PARTKEY" INTEGER NOT NULL "P_NAME" VARCHAR(55) NOT NULL "P_MFGR" CHAR(25) NOT NULL "P_BRAND" CHAR(10) NOT NULL "P_TYPE" VARCHAR(25) NOT NULL "P_SIZE" INTEGER NOT NULL

PAGE 105


PAGE 106


PAGE 107

107 "L_COMMENT" VARCHAR(44) NOT NULL ) IN "TPCDDATA" INDEX IN "TPCDINDEX" ; Index Creation Script ------------------------------------------------DDL Statements for table TPCD "."NATION" ------------------------------------------------DDL Statements for indexes on Table TPCD "."NATION" CREATE UNIQUE INDEX TPCD "."N_NK" ON TPCD "."NATION" ("N_NATIONKEY" ASC) PCTFREE 0 ALLOW REVERSE SCANS; -DDL Statements for indexes on Table T PCD "."NATION" CREATE INDEX TPCD "."N_RK" ON "TPCD "."NATION" ("N_REGIONKEY" ASC) PCTFREE 0 ALLOW REVERSE SCANS; ------------------------------------------------DDL Statements for table TPCD "."REGION" ------------------------------------------------DDL Statements for indexes on Table TPCD "."REGION" CREATE UNIQUE INDEX TPCD "."R_RK" ON "TPCD "."REGION" ("R_REGIONKEY" ASC) PCTFREE 0 ALLOW REVERSE SCANS; ------------------------------------------------DDL Statements for table TPCD "."PART" ------------------------------------------------DDL Statements for indexes on Table TPCD "."PART" CREATE UNIQUE INDEX TPCD "."P_PK" ON "TPCD "."PART" ("P_PARTKEY" ASC) PCTFREE 0 ALLOW REVERSE SCANS; ------------------------------------------------DDL Statements for table TPCD "."SUPPLIER" ------------------------------------------------DDL Statements for indexes on Table TPCD "."SUPPLIER" CREATE INDEX TPCD "."S_NK" ON "TPCD "."SUPPLIER" ("S_NATIONKEY" ASC) PCTFREE 0 ALLOW REVERSE SCANS;

PAGE 108

108 -DDL Statements for indexes on Table TPCD "."SUPPLIER" CREATE UNIQUE INDEX TPCD "."S_SK" ON "TPCD "."SUPPLIER" ("S_SUPPKEY" ASC) PCTFREE 0 ALLOW REVERSE SCANS; ------------------------------------------------DDL Statements for table TPCD "."PARTSUPP" ------------------------------------------------DDL Statements for indexes on Table TPCD "."PARTSUPP" CREATE INDEX TPCD "."PS_PK" ON "TPCD "."PARTSUPP" ("PS_PARTKEY" ASC) PCTFREE 0 ALLOW REVERSE SCANS; -DDL Statements for indexes on Table TPCD "."PARTSUPP" CREATE UNIQUE INDEX TPCD "."PS_PK_SK" ON TPCD "."PARTSUPP" ("PS_PARTKEY" ASC, "PS_SUPPKEY" ASC) PCTFREE 0 ALLOW REVERSE SCANS; -DDL Statements for indexes on Table TPCD "."PARTSUPP" CREATE INDEX TPCD "."PS_SK" ON "TPCD ."PARTSUPP" ("PS_SUPPKEY" ASC) PCTFREE 0 ALLOW REVERSE SCANS; -DDL Statements for indexes on Table TPCD "."PARTSUPP" CREATE UNIQUE INDEX TPCD "."PS_SK_PK" ON TPCD "."PARTSUPP" ("PS_SUPPKEY" ASC, "PS_PARTKEY" ASC) PCTFREE 0 ALLOW REVERSE SCANS; ------------------------------------------------DDL Statements for table TPCD "."CUSTOMER" ------------------------------------------------DDL Statements for indexes on Table TPCD "."CUSTOMER" CREATE UNIQUE INDEX TPCD "."C_CK" ON "TPCD "." CUSTOMER" ("C_CUSTKEY" ASC) PCTFREE 0 ALLOW REVERSE SCANS; -DDL Statements for indexes on Table TPCD "."CUSTOMER"

PAGE 109

109 CREATE INDEX TPCD "."C_NK" ON "TPCD "."CUSTOMER" ("C_NATIONKEY" ASC) PCTFREE 0 ALLOW REVERSE SCANS; -----------------------------------------------DDL Statements for table TPCD "."ORDERS" ------------------------------------------------DDL Statements for indexes on Table TPCD "."ORDERS" CREATE INDEX TPCD "."O_CK" ON "TPCD "."ORDERS" ("O_CUSTKEY" ASC) PCTFREE 3 ALLOW REVERSE SCANS; -DDL Statements for indexes on Table TPCD "."ORDERS" CREATE INDEX TPCD "."O_OD" ON TPCD "."ORDERS" ("O_ORDERDATE" ASC) PCTFREE 3 ALLOW REVERSE SCANS; -DDL Statements for indexes on Table TPCD "."ORDERS" CREATE UNIQUE INDEX TPCD "." O_OK" ON TPCD "."ORDERS" ("O_ORDERKEY" ASC) PCTFREE 3 ALLOW REVERSE SCANS; ------------------------------------------------DDL Statements for table TPCD "."LINEITEM" ------------------------------------------------DDL Statements for indexes on Table TPCD "."LINEITEM" CREATE INDEX TPCD "."L_CD" ON "TPCD "."LINEITEM" ("L_COMMITDATE" ASC) PCTFREE 3 ALLOW REVERSE SCANS; -DDL Statements for indexes on Table TPCD "."LINEITEM" CREATE INDEX TPCD "."L_OK" ON "TPCD "."LINEITEM" ("L_ORDERKEY" ASC ) PCTFREE 3 ALLOW REVERSE SCANS; -DDL Statements for indexes on Table TPCD "."LINEITEM" CREATE INDEX TPCD "."L_PK" ON "TPCD "."LINEITEM" ("L_PARTKEY" ASC) PCTFREE 3 ALLOW REVERSE SCANS;

PAGE 110


PAGE 111

111 APPENDIX B QUERIES USED IN PERF ORMANCE EVALUATION The following are the queries used to report results for the synthetic (TPC H) performance evaluations. Queries are numbered according to query numbers in the figure captions. The seed value for the TPC H query generator program was 1787334968. Query #1 : select l_returnflag, l_linestatus, sum(l_quantity) as sum_qty, sum(l_extendedprice) as sum_base_price, sum(l_extendedprice (1 l_discount)) as sum_disc _price, sum(l_extendedprice (1 l_discount) (1 + l_tax)) as sum_charge, avg(l_quantity) as avg_qty, avg(l_extendedprice) as avg_price, avg(l_discount) as avg_disc, count(*) as count_order from lineitem where l_shipdate <= date(days(date('199812 01')) 82 3) group by l_returnflag, l_linestatus order by l_returnflag, l_linestatus; Query #2: select s_acctbal, s_name, n_name, p_partkey, p_mfgr, s_address, s_phone, s_comment from part, supplier, partsupp, nation,

PAGE 112

112 region where p_partkey = ps_partkey and s_suppkey = ps_suppkey and p_size = 19 and p_type like '%COPPER' and s_nationkey = n_nationkey and n_regionkey = r_regionkey and r_name = 'AMERICA' and ps_supplycost = ( select min(ps_supplycost) from partsupp, supplier, nation, region where p_partkey = ps_partkey and s_suppkey = ps_suppkey and s_nationkey = n_nationkey and n_regionkey = r_regionkey and r_name = 'AMERICA' ) order by s_acctbal desc, n_name, s_name, p_partkey; Query # 3: select l_orderkey, sum(l_extendedprice (1 l_discount)) as revenue, o_orderdate, o_shippriority from customer, orders, lineitem where c_mktsegment = 'MACHINERY' and c_custkey = o_custkey and l_orderkey = o_orderkey and o_orderdate < date('1995 0308') and l_shipdate > date('19950308')

PAGE 113

113 group by l_orderkey, o_orderdate, o_shippriority order by revenue desc, o_orderdate; Query #4: select o_orderpriority, count(*) as order_count from orders where o_orderdate >= date( '19940301') and o_orderdate < date(days(date('19940301')) + 3 31) and exists ( select from lineitem where l_orderkey = o_orderkey and l_commitdate < l_receiptdate ) group by o_orderpriority order by o_orderpriority; Query #5 : select n_name, sum(l_extendedprice (1 l_discount)) as revenue from customer, orders, lineitem, supplier, nation, region where c_custkey = o_custkey and l_orderkey = o_orderkey and l_suppkey = s_suppkey

PAGE 114

114 and c_nationkey = s_nationkey and s_nationkey = n_nationkey and n_regionkey = r_regionkey and r_name = 'AMERICA' and o_orderdate >= date('199601 01') and o_orderdate < date(days(date('19960101')) + 365) group by n_name order by revenue desc; Query #6: select sum(l_extendedprice l_discount) as revenue from lineitem where l_shipdate >= date('1996 0101') and l_shipdate < date(days(date('199601 01')) + 365) and l_discount between 0.02 0.01 and 0.02 + 0.01 and l_quantity < 25; Query #7: select supp_nation, cust_nation, l_year, sum(volume) as revenue from ( select n1.n_name as supp_nation, n2.n_name as cust_nation, year(l_shipdate) as l_year, l_extendedprice (1 l_discount) as volume from supplier, lineitem, orders, customer, nation n1, nation n2 where s_suppkey = l_suppkey and o_orderkey = l_orderkey and c_custkey = o_custkey

PAGE 115

115 and s_nationkey = n1.n_nationkey and c_nationkey = n2.n_nationkey and ( (n1.n_name = 'ALGERIA' and n2.n_name = 'ROMANIA') or (n1.n_name = 'ROMANIA' and n2.n_name = 'ALGERIA') ) and l_shipdate between date('199501 01') and date('19961231') ) as shipping group by supp_nation, cust_nation, l_year order by supp_nation, cust_nation, l_year; Query #8: select o_year, sum(case when nation = 'ROMANIA' then volume else 0 end) / sum(volume) as mkt_share from ( select year(o_orderdate) as o_year, l_extendedprice (1 l_discount) as volume, n2.n_name as nation from part, supplier, lineitem, orders, customer, nation n1, nation n2, region where p_partkey = l_partkey and s_suppkey = l_suppkey and l_orderkey = o_orderkey and o_custkey = c_custkey and c_nationkey = n1.n_nationkey and n1.n_regionkey = r_regionkey

PAGE 116

116 and r_name = 'EUROPE' and s_nationkey = n2.n_nationkey and o_orderdate between date('199501 01') and date('199612 31') and p_type = 'SMALL PLATED STEEL' ) as all_nations group by o_year order by o_year; Query #9: select c_custkey, c_name, sum(l_extendedprice (1 l_discount)) as revenue, c_acctbal, n_name, c_address, c_phone, c_comment from customer, orders, lineitem, nation where c_custkey = o_custkey and l_orderkey = o_orderkey and o_orderdate >= date('199311 01') and o_orderdate < date(days(date('1993 1101')) + 3 31) and l_returnflag = 'R' and c_nationkey = n_nationkey group by c_custkey, c_name, c_acctbal, c_phone, n_name, c_address, c_comment order by revenue desc; Query #10: select

PAGE 117

117 ps_partkey, sum( ps_supplycost ps_availqty) as value from partsupp, supplier, nation where ps_suppkey = s_suppkey and s_nationkey = n_nationkey and n_name = 'CHINA' group by ps_partkey having sum(ps_supplycost ps_availqty) > ( select sum(ps_supplycost ps_availqty) 0.0001000000 from partsupp, supplier, nation where ps_suppkey = s_suppkey and s_nationkey = n_nationkey and n_name = 'CHINA' ) order by value desc; Query #11: select l_shipmode, sum(case when o_orderpriority = '1URGENT' or o_orderpriority = '2HIGH' then 1 else 0 end) as high_line_count, sum(case when o_orderpriority <> '1URGENT' and o_orderpriority <> '2 HIGH' then 1 else 0 end) as low_line_count from orders, lineitem where

PAGE 118

118 o_orderkey = l_orderkey and l_shipmode in ('RAIL', 'REG AIR') and l_commitdate < l_receiptdate and l_shipdate < l_commitdate and l_receiptdate >= date('1996 0101') and l_receiptdate < date(days(date('19960101')) + 365) group by l_shipmode order by l_shipmode; Query #12: select c_count, count(*) as custdist from ( select c_custkey, count(o_orderkey) from customer left outer join orders on c_custkey = o_custkey and o_comment not like '%special%deposits%' group by c_custkey ) as c_orders (c_custkey, c_count) group by c_count order by custdist desc, c_count desc; Query #13: select 100.00 sum(case when p_type like 'PROMO%' then l_extendedprice (1 l_discount) else 0 end) / sum(l_extendedprice (1 l_discount)) as promo_revenue from lineitem, part where l_partkey = p_partkey and l_shipdate >= date('19961101')

PAGE 119

119 and l_shipdate < date(days(date('199611 01')) + 31) ; Query #14: select p_brand, p_type, p_size, count(distinct ps_suppkey ) as supplier_cnt from partsupp, part where p_partkey = ps_partkey and p_brand <> 'Brand#53' and p_type not like 'PROMO BURNISHED%' and p_size in (23, 14, 17, 8, 15, 9, 42, 6) and ps_suppkey not in ( select s_suppkey from supplier where s_comment like '%Customer%Complaints%' ) group by p_brand, p_type, p_size order by supplier_cnt desc, p_brand, p_type, p_size; Query #15: select sum(l_extendedprice) / 7.0 as avg_yearly from lineitem, part where p_partkey = l_partkey and p_brand = 'Brand#35' and p_container = 'MED BAG' and l_quantity < ( select

PAGE 120

120 0.2 avg(l_quantity) from lineitem where l_partkey = p_partkey ); Query #16: select c_name, c_custkey, o_orderkey, o_orderdate, o_totalprice, sum(l_quantit y) from customer, orders, lineitem where o_orderkey in ( select l_orderkey from lineitem group by l_orderkey having sum(l_quantity) > 315 ) and c_custkey = o_custkey and o_orderkey = l_orderkey group by c_name, c_custkey, o_orderkey, o_orderdate, o_totalprice order by o_totalprice desc, o_orderdate; Query #17: select sum(l_extendedprice* (1 l_discount)) as revenue from lineitem,

PAGE 121

121 part where ( p_partkey = l_partkey and p_brand = 'Brand#34' and p_container in ('SM CASE', 'SM BOX', 'SM PACK', 'SM PKG') and l_quantity >= 9 and l_quantity <= 9 + 10 and p_size between 1 and 5 and l_shipmode in ('AIR', 'AIR REG') and l_shipinstruct = 'DELIVER IN PERSON' ) or ( p_partkey = l_partkey and p_brand = Brand#51' and p_container in ('MED BAG', 'MED BOX', 'MED PKG', 'MED PACK') and l_quantity >= 10 and l_quantity <= 10 + 10 and p_size between 1 and 10 and l_shipmode in ('AIR', 'AIR REG') and l_shipinstruct = 'DELIVER IN PERSON' ) or ( p_part key = l_partkey and p_brand = 'Brand#31' and p_container in ('LG CASE', 'LG BOX', 'LG PACK', 'LG PKG') and l_quantity >= 26 and l_quantity <= 26 + 10 and p_size between 1 and 15 and l_shipmode in ('AIR', 'AIR REG') and l_shipinstruct = 'DELIVER IN PERSON' ) ; Query #18: select s_name, s_address from supplier, nation where s_suppkey in ( select ps_suppkey from partsupp where

PAGE 122

122 ps_partkey in ( select p_partkey from part where p_name like 'saddle%' ) and ps_availqty > ( select 0.5 sum(l_quantity) from lineitem where l_partkey = ps_partkey and l_suppkey = ps_suppkey and l_shipdate >= date('19970101') and l_shipdate < date(days(date('199701 01')) + 365) ) ) and s_nationkey = n_nationkey and n_name = 'BRAZIL' order by s_name; Query #19: select s_name, count(*) as numwait from supplier, lineitem l1, orders, nation where s_suppkey = l1.l_suppkey and o_orderkey = l1.l_orderkey and o_orderstatus = 'F' and l1.l_receiptdate > l1.l_commitdate and exists ( select from lineitem l2 where l2.l_orderkey = l1.l_orderkey

PAGE 123

123 and l2.l_suppkey <> l1.l_suppkey ) and not exists ( select from lineitem l3 where l3.l_orderkey = l1.l_orderkey and l3.l_suppkey <> l1.l_suppkey and l3.l_receiptdate > l3.l_commitdate ) and s_nationkey = n_nationkey and n_name = 'VIETNAM' group by s_name order by numwait desc, s_name; Query #20: select cntrycode, count(*) as numcust, sum(c_acctbal) as totacctbal from ( select substr(c_phone,1,2) as cntrycode, c_acctbal from customer where substr(c_phone,1,2) in ('18', '15', '21', '30', '16', '19', '32') and c_acctbal > ( select avg(c_acctbal) from customer where c_acctbal > 0.00 and substr(c_phone,1,2) in ('18', '15', '21', '30', '16', '19', '32') ) and not exists (

PAGE 124

124 select from orders where o_custkey = c_custkey ) ) as custsale group by cntrycode order by cntrycode;

PAGE 125

125 LIST OF REFERENCES [1] Lohman, G. M., and Lightstone, S. S., 2002, "Smart: Making Db2 (More) Autonomic," Proceedings of the 28th VLDB Conference Hong Kong, China, pp. 877879. [2] Kephart, J. O., and Chess, D. M., 2003, "The Vision of Autonomic Computing," IEEE Computer, 36(1), pp. 4150. [3] Kephart, J. O., 2005, "Research Challenges of Autonomic Computing," Proceedings of the 27th International Conference on Software Engineering, St. Louis, MO, pp. 1522. [4] Aboulnaga, A., Haas, P., Lightstone, S. S., Lohman, G., et al., 2004, "Automated Statistics Collection in Db2 Stinger," Proceedings of the 30th VLDB Conference Toronto, Canada, pp. 11461157. [5] Volker, M., Vijayshankar, R., David, S., Guy, L., et al., 2004, "Robust Query Processing through Progressive Optimization," Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, Paris, France, pp. 659670. [6] Markl, V., L ohman, G., and Raman, V., 2003, "Leo: An Autonomic Query Optimizer for Db2," IBM Systems Journal, 42(1), pp. 98106. [7] Stillger, M., Lohman, G., Markl, V., and Kandil, M., 2001, "Leo Db2's Learning Optimizer," Proceedings of the 27th International Conference on Very Large Data Bases Roma, Italy, pp. 1928. [8] Ilyas, I., Markl, V., Haas, P., Brown, P. G., et al., 2004, "Automatic Relationship Discovery in Self Managing Database Systems," Proceedings of the 2004 International Conference on Autonomic C omputing, New York, NY, pp. 340341. [9] Zilio, D. C., Lightstone, S., and Lohman, G. M., 2003, "Trends in Automating Physical Database Design," Proceedings of the 2003 IEEE International Conference on Industrial Informatics Alberta, Canada, pp. 441445. [10] Zilio, D. C., Zuzarte, C., Lightstone, S. S., Wenbin, M., et al., 2004, "Recommending Materialized Views and Indexes with the Ibm Db2 Design Advisor," Proceedings of the 2004 International Conference on Autonomic Computing, New York, NY, pp. 180187. [11] Zilio, D. C., Rao, J., Lightstone, S. S., Lohman, G., et al., 2004, "Db2 Design Advisor: Integrated Automatic Physical Database Design," Proceedings of the 30th VLDB Conference, Toronto, Canada, pp. 10871097. [12] Bruno, N., and Chaudhuri, S., 2007, "An Online Approach to Physical Design Tuning," Proceedings of the 2007 ICDE Conference Istanbul, Turkey, pp. 826835. [13] Bruno, N., and Chaudhuri, S., 2006, "To Tune or Not to Tune? A Lightweight Physical Design Alerter," Proceedings of the 2006 V LDB Conference, Seoul, Korea, pp. 499510.

PAGE 126

126 [14] Agrawal, S., Chu, E., and Narasayya, V., 2006, "Automatic Physical Design Tuning: Workload as a Sequence," Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data, Chicago, IL, pp. 683694. [15] Bruno, N., and Chaudhuri, S., 2005, "Automatic Physical Database Tuning: A RelaxationBased Approach," SIGMOD 2005, Baltimore, MD, pp. 227238. [16] Chaudhuri, S., Konig, A. C., and Vivek, N., 2004, "Sqlcm: A Continuous Monitoring Framework for Relational Database Engines," Proceedings of the 20th International Conference on Data Engineering, Boston, MA, pp. 473484. [17] Agrawal, S., Chaudhuri, S., Kollar, L., Marathe, A., et al., 2004, "Database Tuning Advisor for Microsoft Sql Server 2005," Proceedings of the 30th VLDB Conference Toronto, Canada, pp. 11101121. [18] Chaudhuri, S., and Vivek, N., 1998, "Autoadmin What If Index Analysis Utility," SIGMOD 1998, Seattle, WA, pp. 367378. [19] Chaudhuri, S., 1998, "An Overview of Query Optimization in Relational Systems," Proceedings of the 17th ACM SIGACT SIGMOD SIGART Symposium on Principles of Database Systems Seattle, WA, pp. 3443. [20] GarciaMolina, H., Ullman, J. D., and Widom, J., 2000, Database System Implementation, Prentice Hall, Upper Saddle River, NJ. [21] Codd, E. F., 1970, "A Relational Model of Data for Large Shared Data Banks," Communications of the ACM, 13(6), pp. 377387. [22] Selinger, P. G., Astrahan, M. M., Chamberli n, D. D., Lorie, R. A., et al., 1979, "Access Path Selection in a Relational Database Management System," Proceedings of the 1979 ACM SIGMOD International Conference on Management of Data, Boston, MA, pp. 2334. [23] Ioannidis, Y. E., 1996, "Query Optimiz ation," ACM Computing Surveys, 28(1), pp. 121123. [24] Shasha, D. E., and Bonnet, P., 2003, Database Tuning: Principles, Experiments, and Troubleshooting Techniques Morgan Kaufmann Publishers, San Francisco, CA. [25] Choenni, S., Blanken, H. M., and Chan g, T., 1993, "Index Selection in Relational Databases," Proceedings of the Fifth International Conference on Computing and Information, Ontario, Canada, pp. 491496. [26] Caprara, A., Fischetti, M., and Maio, D., 1995, "Exact and Approximate Algorithms for the Index Selection Problem in Physical Database Design," IEEE Transactions on Knowledge and Data Engineering, 7(6), pp. 955967. [27] Comer, D., 1978, "The Difficulty of Optimum Index Selection," ACM Transactions on Database Systems, 3(4), pp. 440445.

PAGE 127

127 [28] Karp, R. M., 1972, Complexity of Computer Computations Plenum Press, New York. [29] Choenni, S., Blanken, H. M., and Chang, T., 1993, "On the Automation of Physical Database Design," Proceedings of the 1993 ACM/SIGAPP Symposium on Applied Computing, Indianapolis, IN, pp. 358367. [30] Ibm Autonomic Computing, 2003, "An Architectural Blueprint for Autonomic Computing", http://www.ibm.com/autonomic/pdfs/AC_Blueprint_White_Paper_4th.pdf (last accessed June 2009) [31] Barrett, R., Maglio, P. P., Kandogan, E., and Bailey, J., 2004, "Usable Autonomic Computing Systems: The Administrator's Perspective," Proceedings of the Fi rst International Conference on Autonomic Computing, New York, NY, pp. 18 25. [32] Lightstone, S. S., Lohman, G., and Zilio, D., 2002, "Toward Autonomic Computing with Db2 Universal Database," SIGMOD Record, 31(3), pp. 5561. [33] D. H. Brown Associates, I ., 2000, "Db2 Udb Vs. Oracle8i: Total Cost of Ownership," D. H. Brown Associates, Inc, Port Chester, NY. [34] Dageville, B., Das, D., Dias, K., Yagoub, K., et al., 2004, "Automatic Sql Tuning in Oracle 10g," Proceedings of the 13th VLDB Conference Toront o, Canada, pp. 10981109. [35] Valentin, G., Zuliani, M., Zilio, D. C., Lohman, G., et al., 2000, "Db2 Advisor: An Optimizer Smart Enough to Recommend Its Own Indexes," Proceedings of the 16th International Conference on Data Engineering, San Diego, CA, pp. 101 110. [36] Lifschitz, S., and Vaz Salles, M. A., 2005, "Autonomic Index Management," Proceedings of the Second International Conference on Autonomic Computing, Seattle, WA, pp. 304305. [37] Milanes, A., and Lifschitz, S., 2005, "Design and Implementation of a Global Self Tuning Architecture," Proceedings of the 20th Brazilian Symposium on Databases Porto Alegre, Brazil, pp. 7084. [38] Derr, M. A., 1993, "Adaptive Query Optimization in a Deductive Database System," Proceedings of the 2nd International Conference on Information and Knowledge Management Washington, DC, pp. 206215. [39] Sattler, K. U., Schallehn, E., and Geist, I., 2004, "Autonomous Query Driven Index Tuning," Proceedi ngs of the 2004 International Database Engineering and Applications Symposium Coimbra, Portugal, pp. 439 448. [40] Sattler, K. U., Geist, I., and Schallehn, E., 2003, "Quiet: Continuous Query Driven Index Tuning," Proceedings of the 29th VLDB Conference Berlin, Germany, pp. 11291132. [41] Bruno, N., and Chaudhuri, S., 2007, "Online Autoadmin: (Physical Design Tuning)," SIGMOD 2007, Beijing, China, pp. 10671069.

PAGE 128

128 [42] Schnaitter, K., Abiteboul, S., Milo, T., and Polyzotis, N., 2006, "Colt: Continuous onLine Tuning," Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data, Chicago, IL, pp. 793795. [43] Gerald, C. F., and Wheatley, P. O., 1984, Applied Numerical Analysis AddisonWesley Pub. Co., Reading, MA. [44] Box, G., and Jenkins, G., 1990, Time Series Analysis, Forecasting and Control HoldenDay, Incorporated. [45] Koenker, R., and Hallock, K., 2001, "Quantile Regression," Journal of Economic Perspectives, 15(4), pp. 143156. [46] Yu, K., Lu, Z., and Stander, J., 2003, "Q uantile Regression: Applications and Current Research Areas," The Statistician, 52(3), pp. 331350. [47] Koenker, R., and D'orey, V., 1987, "Computing Regression Quantiles," Applied Statistics, 36(3), pp. 383393. [48] Chernozhukov, V., Fernandez Val, I., and Galichon, A., 2007, "Quantile and Probability Curves without Crossing." MIT Working Paper [49] Ihaka, R., and Gentleman, R., 1996, "R: A Language for Data Analysis and Graphics," Journal of Compuational and Graphical Statistics, 5(3), pp. 299314. [5 0] Transaction Processing Performance Council, "Tpc H Benchmark Specification", http://www.tcp.org/hspec.html (last accessed June 2009)

PAGE 129

129 BIOGRAPHICAL SKETCH Jeff Magnusson received his Bachelor of Science degree in computer science from Trinity University in the Spring of 2002 where he graduated with honors. He was awarded the Master of Science and Doctor of Philosophy degrees in computer engineering from the University of Florida in the S pring of 2006 and the Summer of 2009. During his graduate studies at the University of Florida, he worked as both a research and teaching assistant in the Department of Computer Information Science and Engineering. His resea rch interests include database systems, machine learning, artificial intelligence, and autonomic computing.