Citation
Selection predicate placement in database discrimination networks

Material Information

Title:
Selection predicate placement in database discrimination networks
Creator:
Kandil, Mokhtar, 1969-
Publication Date:
Language:
English
Physical Description:
xi, 134 leaves : ill. ; 29 cm.

Subjects

Subjects / Keywords:
Cardinality ( jstor )
Cost allocation ( jstor )
Cost analysis ( jstor )
Database management systems ( jstor )
Databases ( jstor )
Distance functions ( jstor )
Engineering ( jstor )
Heuristics ( jstor )
International conferences ( jstor )
Siblings ( jstor )
Computer and Information Science and Engineering thesis, Ph. D ( lcsh )
Database management ( lcsh )
Dissertations, Academic -- Computer and Information Science and Engineering -- UF ( lcsh )
Rule-based programming ( lcsh )
Genre:
bibliography ( marcgt )
non-fiction ( marcgt )

Notes

Thesis:
Thesis (Ph. D.)--University of Florida, 1998.
Bibliography:
Includes bibliographical references (leaves 125-133).
General Note:
Typescript.
General Note:
Vita.
Statement of Responsibility:
by Mokhtar Kandil.

Record Information

Source Institution:
University of Florida
Holding Location:
University of Florida
Rights Management:
The University of Florida George A. Smathers Libraries respect the intellectual property rights of others and do not claim any copyright interest in this item. This item may be protected by copyright but is made available here under a claim of fair use (17 U.S.C. §107) for non-profit research and educational purposes. Users of this work have responsibility for determining copyright status prior to reusing, publishing or reproducing this item for purposes other than what is allowed by fair use or other copyright exemptions. Any reuse of this item in excess of fair use or other copyright exemptions requires permission of the copyright holder. The Smathers Libraries would like to learn more about this item and invite individuals or organizations to contact the RDS coordinator (ufdissertations@uflib.ufl.edu) with any additional information they can provide.
Resource Identifier:
029546805 ( ALEPH )
41456425 ( OCLC )

Downloads

This item has the following downloads:


Full Text











SELECTION PREDICATE PLACEMENT IN
DATABASE DISCRIMINATION NETWORKS













By

MOKIITAR KANDIL


A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY

UNIVERSITY OF FLORIDA


1998

































@Copyright 1998

By

Mokchtar Kandil

































Dedicated To:
My parents, Abdelaziz and Touria,
And my sister, Mouna.















ACKNOWLEDGMENTS


I would like to express my deepest gratitude to my advisor, Dr. Eric Hanson, for all the guidance, advice and supervision he provided me with throughout the duration of this work. I thank Dr. Stanley Su, Dr. Sharma Chakravarthy, Dr. Randy Chow and Dr. John Alexander for kindly agreeing to serve on my supervisory comm-ittee, and providing me with a great deal of valuable input and advice. I cannot omit extending many thanks to Sharon Grant for both her help with various administrative tasks, and the organization of unforgettable database center social gatherings.

On a more personal note, I would like to express the deepest gratitude to my parents, Abdelaziz Kandil and Touria Bouabid, for their continuous support and encouragement throughout my life and more particularly during this entire experience. I cannot forget to thank my sister, Mouna, whose love, support and encouragement have always been a source of great comfort and inspiration.

A special expression of my gratitude goes to my sponsors, the Fulbright binational commission and the Amideast staff who provided essential financial and administrative support during the complete duration of my studies in the USA.

Finally, I thank all my friends and colleagues who all participated into making this stay in Gainesville a memorable one.


iv















TABLE OF CONTENTS


page


ACKNOWLEDGMENTS ....................................................................... iv

LIST OF TABLES............................................................................... vii

LIST OF FIGURES............................................................................. viii

ABSTRACT ....................................................................................... x

1 INTRODUCTION: THE TRIGGERMAN PROJECT....................................... 1
1. 1 Rules and Active Databases Overview: ..............................................1.
1. 1. 1 Integrity Constraint Checking and Repair:......................................1.
1. 1.2 Time and Temporal Issues:......................................................... 2
1. 1.3 Materialized View Maintenance: .................................................. 3
1. 1.4 Advantages and Shortcomings of Existing Rule Systems: ..................... 4
1.2 Extensibility and Extensible Databases Background:................................. 6
1.3 TriggerMan: ............................................................................... 9
1.3.1 The TriggerMan Environment: ................................................... 11
1.3.2 The TriggerMan Physical Architecture:.......................................... 13

2.PROBLEM STATEMENT................................................................... 18
2.1 A Discrimination Network Discussion and Example:............................... 18
2.2 Need for Selection Predicate Pushdown: ............................................. 25

3.B3ACKGROUND ON GATOR NETWORK OPTIMIZATION ......................... 28
3.1 Brief Review of the Gator Network Optimization Algorithm: .................... 29
3. 1.1 Combine Operators: ............................................................... 32
3.1.2 Local Change Operators: .......................................................... 33
3.2 The Cost Model:......................................................................... 35
3.2.1 Cost Functions for ax Nodes:...................................................... 36
3.2.2 Cost of Generating Temporary Join Results, (TR):............................. 37
3.2.3 Cost of Beta Nodes: ............................................................... 38

RELATED WORK: INADEQUACY OF THE RANK METRIC ....................... 41
4.1 Extensions to The Cost Model and Assumptions: ................................... 42
4. 1.1 The Edge Selectivity Factor:...................................................... 42
4.1.2 Predicates Selectivities and Costs: ............................................... 46


V









4.2 Allee-Gator Example:................................................................... 47
4.3 Generalization of the Cost Considerations:........................................... 53
4.3.1 Influence of Token Distributions on Decision Surfaces for Pushing a
Sigma:............................................................................... 58
4.3.2 Geometric Interpretation of the Analysis off(m):............................... 60
4.3.3 Qualitative Interpretation of the Analysis off(m):.............................. 64
4.4 Conclusion: .............................................................................. 65

5.STRATEGIES: DESIGN ..................................................................... 67
5.1 Adapting the Rank Metric to Gator Networks:....................................... 68
5.2 Two Phase Strategies: .................................................................. 71
5.2.1 Integrating the Simple Strategy With the Gator Optimizer:................... 72
5.2.2 Complexity Analysis:.............................................................. 76
5.2.3 Influence of Indexes on the Heuristic: ........................................... 77
5.3 One Phase Strategies:................................................................... 78
5.3.1 Complexity Analysis:.............................................................. 79
5.3.2 Semantic Issues for Kill Beta: .................................................... 79
Semantic Issues for Merge Sibling: ..................................................... 80
5.3.4 Suggested Fixes for the Semantic Issues: ....................................... 81
5.3.5 Sub-Strategies: ..................................................................... 82

6.STRATEGWES: IMPLEMENTATION...................................................... 84
6.1 Implementation of Local Change Operators:......................................... 85
6.2 Revision of the Cost Formulas Implementation: ..................................... 86
6.2.1 Cost of a Sigma Node: ............................................................ 86
6.2.2 Cost of a Beta Node: .............................................................90 g
6.3 The Optimizer: .......................................................................... 92
6.3.1 Implemented Randomized Algorithms:.......................................... 92
6.3.2 The Optimizer Characteristics: ................................................... 95
6.4 Experimental Results: .................................................................. 96
6.4.1 Two Phase Strategy Running Times: ............................................ 96
6.4.2 Testing Running Times for Both Strategies: .................................... 97
6.4.3 Analysis:........................................................................... 101
6.4.4 Testing Result Quality for Both Strategies: ................................... 106
6.4.5 Analysis:........................................................................... 108
6.4.6 Conclusion: .....................................................................109 lo

7.CONCLUSION AND FUTURE WORK .................................................. 113

GLOSSARY .................................................................................. 116

LIST OF REFERENCES...................................................................... 125

BIOGRAPHICAL SKETCH .................................................................. 134





vi















LIST OF TABLES


Table page

Table 1 Conditions for pushing down a selection predicate ................................. 57

Table 2 Analysis off(m)......................................................................... 60

Table 3 Comparing running times of OPS and TPS ........................................ 109

Table 4 Comparing networks costs of OPS and TPS ....................................... 110

Table 5 Recapitulation of the conditions for pushing down (72............................ 123


vii















LIST OF FIGURES


Figure 1- 1 The architecture of the TriggerMan environment ................................ 12

Figure 1-2 The architecture of TriggerMan.................................................... 14

Figure 1-3 TriggerMan as an Informix Datablade ............................................ 16

Figure 2-1 Rule Graph for Rule HighSalesSalary ............................................ 19

Figure 2-2 Discrimination network for rule HighSalesSalary ............................... 20

Figure 2-3 Selection pushdown................................................................. 25

Figure 3-1 Local change operator "Create Beta"............................................... 33

Figure 3-2 Local change operator "Kill Beta................................................... 34

Figure 3-3 Local change operator 'Merge Sibling"............................................ 34

Figure 3-4 Example of a Gator network ....................................................... 36

Figure 4-1 Analysis of selection predicate pushdown ........................................ 53

Figure 4-2 Representation of a surface Sn.............................................................61

Figure 4-3 Relative positions of Sm for different values of mn............................... 62

Figure 4-4 Addition of surface Si .............................................................. 63

Figure 5-1 Token frequency driven heuristic.................................................. 73

Figure 5-2 Maximum height of a Gator network ............................................. 76

Figure 5-3 Semantic issues for Kill Beta ...................................................... 79

Figure 5-4 Semantic issues for Merge Sibling ................................................ 80

Figure 6-1 Sigma at the top of the network.................................................... 87

Figure 6-2 Sigma with regular children........................................................ 87

Figure 6-3 Sigma list for transfers of PerChildlnsertCost and PerChildDeleteCost....89


Vill









Figure 6-4 Sigma with a Sigmia child .......................................................... 90

Figure 6-5 Cost of a Beta node ................................................................. 91

Figure 6-6 Beta node with a sigma child ...................................................... 91

Figure 6-7 Comparing running times of Phases I and II of TPS............................. 96

Figure 6-8 Relative running times with selective Joins and non-selective sigmas........ 98 Figure 6-9 Relative running times with less selective joins, non-selective sigmas ....... 99 Figure 6- 10 Variations in running times of OPS and TPS with less selective joins....99 Figure 6-11 Relative running times with large joins and selective sigmas............... 100

Figure 6-12 Variations in the running times of OPS and TPS with selective sigmas..101 Figure 6-13 Impacts of "Kill Beta" and "Push Sigma" operators on a network......... 103 Figure 6-14 Relative network costs for selective joins and non-selective sigmas....... 106 Figure 6-15 Relative network costs for non-selective joins and non-selective sigmas ... 107 Figure 6-16 Relative network costs for non-selective joins and selective sigmas....... 107


ix















Abstract of Dissertation Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy SELECTION PREDICATE PLACEMENT IN
DATABASE DISCRIMINATION NETWORKS By

Mokhtar Kandil

August 1998

Chairman: Eric N. Hanson
Major Department: Computer and Information Science and Engineering

Discrimination networks are data structures that are used to implement rule condition testing in active databases. Rules allow actions to be automatically undertaken (user notification, firing of another rule, etc.) when the database enters a particular state following a data insertion, deletion or modification.

Traditional database management systems handle limited (built-in) sets of datatypes and operators, typically numbers, strings of characters, money and datetime types, and basic operators. The extensibility feature allows knowledgeable users to "plugin" newly defined datatypes (such as polygons and time-series) and operators on those datatypes (such as overlapping of polygons, variations in the time-series). As a result, the spectrum of use of database management systems is extended to many more types of user applications (scientific, business, stock exchange, etc.).


x









The focus of this dissertation is on generating and optimizing discrimination networks, which may contain user-defined datatypes and operators as discussed above.

A discrimination network uses two sets of operators to match rule conditions: selections and joins. Traditionally, selections have been put at the top of the network so that they are computed first; the reason being that they are usually cheap and simple operators with respect to joins. While that assumption might hold for non-extensible systems, user defined operators (such as polygon overlapping, or fingerprint matching) can be expensive and complex operators, hence, their cost is no longer negligible with respect to that of joins. In such cases, pushing a selection operator down the discrimination network (in order to delay its evaluation in the rule condition) may prove to represent an effective cost improvement for rule matching.

Query optimization techniques have been proposed to handle the issue of joins and selection predicate ordering in query trees. This dissertation first shows how the aforementioned methods do not necessarily apply to discrimination networks, and how a direct application of those could fail to produce a correct ordering of joins and selections. We proceed by proposing a set of strategies (of increasing levels of complexity) which can be applied towards the placement of selection predicates in a discrimination network, and consider the corresponding implementation issues.


xi















CHAPTER 1
INTRODUCTION: THE TRIGGERMAN PROJECT

1. 1 Rules and Active Databases Overview:


In a passive (traditional) database management system (DBMS), data is manipulated (created, updated and deleted) only in response to queries issued by users or user-level applications. On the other hand, a system fitted with active capabilities (ADBMS) has the ability to perform such operations automatically. More generally, any action can be invoked in response to a particular event, when some predefined condition(s) is (are) true. The active behavior of such database systems is expressed by rules, also referred to as triggers, alerters, or monitors in the literature. In their most popular formn, active rules are defined with a threefold structure: Event-Condition-Action, very often referred to as the ECA model in the literature [Cha89, Day96J. The existence of a rule processor (rule-processing module) allows for a number of tasks to be carried out by active database systems without a dedicated application. A few examples of such tasks are enumerated in the following three sections.



1. 1.1 Integrity Constraint Checking and Repair:

Issues related to automating integrity constraints checking were fairly extensively investigated and are a classical application area for active databases. An article proposed by Ceri et al [Cer92] describes a system that ensures automatic correction of violated


I









constraints through production rules. Additional work by Ceri and Widom [Cer9O] also proposes the use of active rules for constraint maintenance. A special language for defining the constraints is described, as well as a framework for translating those constraints into rules. On a similar note, Etzion [Etz9 1] proposes to use active capabilities to handle incomplete or exceptional information in a database system. Fraternali and Paraboschi's paper [Fra93] is also a good overview of integrity constraints maintenance techniques, mostly based on first order logic. In another research effort, Gertz [Ger94J proposed concentrating on the repair aspect of integrity constraints, rather than mere detection. More generally, many active database research projects did undertake this question in some way or another: the Jasmine prototype [1sh93] is an object-oriented

(00) database with constraint facilities, able to handle both intra and inter-object constraints. Ode is another object-oriented DBMS developped at AT&T Bell Laboratories. A paper by Jagadish and Qian [Jag92] introduces integrity maintenance in Ode and focuses on inter-object constraints including referential integrity, uniqueness integrity and relational integrity. POSTGRES [Sto87] and PARDES [Etz93a, Etz93b, Etz94] also had modules and related publication centered on this idea.



1. 1.2 Time and Temporal Issues:

The notions of events and time are tightly related, and many models allow the user to express (with more or less simplicity) temporal or time-sensitive constraints. Consequently, multiple results from temporal logic and temporal database research have been integrated into active database management systems (ADBMS) models. Dittrich and Gatziu [Dit93] give an overview of the useful extensions to the active model if a notion









of time is supported. In a short paper [Sis92I, Sistla and Wolfson introduce the definition of a temporal language as an extension to complex event definition. The authors discuss two classes of temporal operators: "past temporal language" and "future temporal language." The operators: since, last time and previously apply to the accumulated history up to the time when the trigger was introduced; while until, next time and eventually apply to incoming events from the time when the trigger was defined. The same authors present some additional work on temporal conditions and integrity constraints in active database systems [Sis95aI, and temporal triggers in active databases [Sis95b]. Etzion, Gal and Segev [Etz92] also discussed issues related to supporting temporal rules in the PARDES project. Dori, Gal and Etzion wrote more recently on temporal active databases as a key to computer integrated manufacturing [Dor96]. The POSTGRES [Sto9 11 project proposed the idea of time travel, which allows users to run historical queries. This approach was implemented using a "no-overwrite" (also referred to as append-only) storage management technique.



1. 1.3 Materialized View Maintenance:

The importance of views in query processing is undeniable. Views can remain as predicates that are expanded at query processing time (using query modification techniques [Sto75]), or be materialized to improve query response time. The problem of efficiently and automatically maintaining materialized views has caught the interest of the database research community for several years. Rules in active databases have proved to be of great assistance in the process of automating the maintenance of materialized views. The work carried out by Ceri and Widom [Cer9O, Cer9 I] is a typical application of that






4


concept: The user defines a view as an SQL select expression and the system automatically derives production rules to maintain the materialization of that view. Other examples include ideas from the POSTGRES project [Sto87], additional work by Stonebraker et al [Sto9O] on rules, procedures caching and views, and a more recent publication by Botzer and Etzion [Bot96]. Furthermore, the fairly recent emergence of data warehousing as a research area in the database community has triggered a renewal of interest for further improvements in the processes of maintenance of materialized views.


1. 1.4 Advantages and Shortcomings of Existing Rule Systems:

Because rule systems have proved to be very useful in business-oriented as well as technical applications, mostly from a software engineering aspect, an increasingly larger number of (commercial) database management systems are now offering built-in support for some form of "trigger"~ or "rule" definition capabilities. Indeed, by factoring out the non-trivial and non-easily maintainable procedural coding issues of many applications, rule systems have facilitated application development, scalability and maintenance. However, and despite those multiple advantages, the applicability of "on-board" trigger systems in realistic environments is limited for a number of practical reasons:


Excessively simplified models:

The rule execution modules are often relatively too simple to handle realistic condition testing cases because a trigger can only be defined on a single "table," (in relational terminology) or data source more generally. The definition of a rule can potentially be as complex as a query with multiple joins and selections conditions. Systematically precompiling a complex rule definition as an SQL query to be run against






5


the database at the occurrence of each event can be very expensive in terms of systems resources used.


Lack of portability and flexibility:

Built-in trigger systems are very tightly coupled to the system they were written for. This lack of modularity in the design of active capabilities can cause important roadblocks in the processes of application migration between Database Management Systems, causing lower degrees of flexibility and portability.


Synchronous condition testing:

Despite numerous research designs aimed at describing decoupled or deferred approaches to perform rule condition testing (POSTGRES [5to86], HiPAC [Cha89, Day96]), many, if not all, of the existing implementations of triggers still run the condition checking synchronously, as part of the user-level transaction. While such an approach is acceptable in environments with limited numbers of rules of limited complexity, it does not scale well at all and can cause significant delays in commit times, and important overhead to DBMS's response times. A simple table insert operation can suddenly become extremely lengthy and expensive, when it causes the firing of hundreds of rules.






6


1.2 Extensibility and Extensible Databases Background:


As database management systems gained popularity within the industry, the need for user defined data types, structures and functions became more significant. Therefore, a second major focus of the research community in the past few years (in addition to adding active capabilities to database management systems) was aimed at providing better and generalized support for nontraditional applications as they were becoming more popular among users. Examples of such applications include Geographic Information Systems (GIS), spatial packages, CAD systems, multimedia and others.

Integrating all the necessary code within the kernel of the DBMS makes for large monolithic systems that still could be lacking some specific user requirements. Since no single database system could integrate built-in support for all possible applications, but most of those applications still reused an important set of common functionalities, research groups looked at ways of adding extensibility features to conventional database systems. As accurately described in the compilation of readings in database systems by Stonebraker [Sto88], there are several levels at which extensibility can be achieved, and the corresponding systems may be classified in different groups. An extensible DBMS may support:

1. New data types (such as polygons),

2. New operators on those types (such as overlapping),

3. New aggregate operators (third largest),

4. New operators on relations (such as the transitive closure),

5. New access methods (such as R-Trees operating on polygons),






7


6. Modifications to the data model.

In the most common form of extensibility that has come to be known as Abstract Data Type (ADT) support, we refer the reader to projects such as ADT-INGRES [0ng84], RAD [0sb86], POSTGRES [Sto86I, PROBE [Day87], Starburst [Haa9O], R 2D 2 [Lin881 and Sabrina [Gar89I. This form of extensibility, pioneered by Stonebraker et al [Sto83], consists of providing support for user defined data types (and functions operating on those types), through a set of precisely defined interfaces within the system. Using ADT support, users (programmers) can write modules defining new data types (box, polygon, image, etc.) and functions operating on those types (intersect, overlap, extract-contours) and "hook" or "plug" those modules directly into the DBMS (essentially points 1, 2, 3 and 5). All the projects mentioned above adopted this model, and differed only in terms of basic data models, query languages and ease of adding extension modules. It is important to note that in this approach, the data model itself remains fixed (usually referred to as "object-relational").

A more ambitious form of extensibility (often referred to as a "toolkit" approach) is that where the data model itself is considered as variable (points 4 and 6), has been investigated in projects such as EXODUS [Car86], GENESIS [Bat88] and DASDBS [Pau87, Sch9O]. The database kernel provides storage management and transaction support. The application specific modules and DBMS layers are implemented "on top" of that kernel. The "value added server" capability of the SHORE [Car94] project (of which PARADISE [Dew93, Dew94] is a good example) constitutes a good implementation of this concept, carried out at the University of Wisconsin, Madison. Object-Oriented database systems such as ORION [Ban87], OSAM [Su93] and others are also extensible






8


by means of their data sublanguage (ie at the datatype level). Once again, we refer the interested reader to [Sto88] for a more comprehensive discussion of this topic. We will simply summarize here the proposed terminology translation table, between the Object Oriented and the relational worlds: A class represents a type, and an object is an instance of the type. Furthermore, a method and a message represent respectively a function and a function call. Since extensibility in OODBs relates to allowing the creation of new classes and new corresponding methods, then passing messages to the newly created objects, it follows that it is also an ADT type of extensibility, centered mainly on points 1 and 2.

Even though the concept of extensibility using ADT support is less ambitious than the "toolkit" approach is, it still carries far-reaching consequences for the architecture of the database system. As a matter of fact, the DBMS design needs to be carefully thought out so that it is extensible at multiple levels:

Language level:

At the language level, the user should be able to declare new data types and functions (methods, operators) and use them in appropriate queries (and rules).


Storage management and access methods level:

The introduction of new types such as polygons and boxes often requires the definition of new access methods allowing efficient retrieval and indexing of objects of such types. In the case of spatial data, an example of a new access method could be RTrees [Gut84].






9


Query optimizer level:

The query optimizer is usually a very complex component of the database management system that is "closed" (with built-in logic). The extensibility at the storage management level discussed above is totally useless if the query optimizer can not be made aware of the existence of the new access method, and make proper use of it while generating plans. Thus, the query optimizer architecture needs to be designed in an open formnat (sometimes referred to as "table-driven") allowing the addition of new access methods or functions with their associated costs.

1.3 TriggerMan:


The TriggerMan [Han97a] project (under development at the Database Research Center at the University of Florida) implements an asynchronous and extensible trigger processor system. Many of the design features and concepts adopted for TriggerMan were meant to handle or reduce the significant shortcomings of "traditional" rule processors as they were described earlier in section 1.1.4, and integrate the ADT based extensibility technology described in section 1.2 above. The reader will find below a brief description of the major design decisions that went into TriggerMan. Asynchronous condition testing:

TriggerMan is an asynchronous, external ("outboard") trigger processor, which has been designed to cooperate with various heterogeneous data sources. Data sources can be any combination of legacy systems, relational, object-relational or object oriented DBMSs, simple unstructured data sets or applications. Updates are forwarded to TriggerMan (for rule condition testing) after they have committed within their






10


corresponding data source, hence the "asynchronous" designation. The forwarding of updates is performed by special data sources applications providing standard interfaces between TriggerMan and the data sources. From a scalability point of view, this approach is expected to perform much better than the (mainstream) synchronous one for large sets of rules. Furthermore, the external aspect of TriggerMan will allow it to cooperate with multiple heterogeneous data sources more easily than built-in trigger processors can.


Use of Gator networks:

TriggerMan uses a data structure called a "Gator network" to perform the rule condition testing. The Gator network is a generalization of the earlier Rete [For82] and TREAT [Mir87] discrimination networks, and was developed at the University of Florida as part of the Ariel project [Han92, Han96a]. Discrimination networks are data structures borrowed from Al production rule systems such as OPS5 [Bro85]. The use of discrimination networks in TriggerMan is expected to overcome the simplistic models of many trigger processors, and allow efficient monitoring of sophisticated conditions based on multiple joins, without the added overhead associated with SQL query compilation and execution.


Temporal component:

TriggerMan incorporates a highly expressive and non-procedural (SQL like) temporal language for rule definition. The user is provided with an initial set of built-in temporal functions (increase(), decrease etc.) and completeness is achieved through the extensibility feature, which enables users to write and plug-in a new temporal function the same way new data types and methods are [Alfa98].






I1I


The architecture of the environment in which TriggerM an operates, as well as a brief overview of the decisions that were made regarding the implementation issues, are covered next.



1.3.1 The TriggerMan Environment:

As described in Figure I-A below, the architecture of the TriggerMan environment consist of one or more external data sources "feeding" the Asynchronous Trigger Processor with data consisting of "update descriptors". Those external data sources can be of multiple types:

- An application generating real time data (stock exchange ticker etc.).

- A replication server reading data from a DBMS (such as Sybase replication server

[5yb96]).

- A DBMS application that uses simple built-in triggers to catch local updates, and

forwards them.

- A non-structured data source such as a spreadsheet or a word processor file.

A "data source program" application will act as an interface between TriggerMan and the external data sources. Depending on the inherent characteristics of those external data sources (structured, queriable, active etc.), the role and complexity of the data source program will vary on a per-case basis. Hence, a TriggerMan data source application may range from a process with advanced logic, regularly performing differencing on a passive non-structured, non queriable source, to a much simpler "forwarding" layer (Figure 1- 1).






12


DBMS
Application OR ---- Replication ---- OR with simple
Server Dtriggers












TriggerMan TriggerMan TriggerMan
Data Source Data Source Data Source
Program Program Program




















ConsolClientn


Figure I1-I The architecture of the TriggerMan environment.






13


In addition to providing TriggerMan with data (update descriptors) for rule condition testing, data source applications may perform some processing on the formats of those update descriptors in order to convert them to a format "understood" by TriggerMan when (and if) necessary. The conversion will be performed by a "translator" or "translation layer" reading the external data source format and calling a precisely defined API, in order to generate the tokens in the internal format required by TriggerMan as illustrated in Figure 1-1. A number of client applications may then register with TriggerMan for particular event notifications that can be raised as part of rule actions. When events are raised, each application that registered for it is notified. A console application is used as the "administrative" interface to the system, allowing startups, shutdowns, activation, deactivation of triggers and other administrative tasks.


1.3.2 The TriggerMan Physical Architecture:

The design team was faced with a few possible implementation strategies for the first prototype of TriggerMan. Each of the envisaged approaches was fully compatible with the logical architecture described in section 1.3. 1, but the required amounts of implementation and coding, and the capabilities and speed of the resulting system varied substantially depending on the chosen scheme.


Stand-alone version with local storage management:

In this strategy, TriggerMan is to be implemented as a stand-alone application, using a local storage manager supporting transactions and recovery management. Parallelism support being an integral part of the design of the project, the overall physical architecture is described in Figure 1-2 below:






14


SMP


I High Speed Interconnection


SMP- SM SMP-A29P

























Figure 1-2 The architecture of TriggerMan


In an ideal parallel configuration of this architecture, a copy of the TriggerMan server runs on each node of a shared-nothing architecture. The nodes of the sharednothing environment are connected with a high-speed medium (bus or fast network) and


SMP


-------------------


------------------------- ---------------


/






15


each of them can either be a single processor machine, or an SMP cluster grouping a number of processors sharing a local memory area. A Virtual Processor (VProc) abstraction concept, embedded within TriggerMan will map VProcs to real processors in a flexible way, allowing high fault tolerance and dynamic load sharing capabilities within the server. Each VProc runs a pool of threads handling various tasks including:

- Local rule condition testing.

- Tokens exchanges with other VProcs according to the data-partitioning scheme.

- Possible local basic storage management.

- Caching strategies of the local partitions of the memory nodes.

Although this approach would have represented the ideal implementation, given the unavailability of an open and reliable storage management layer, and considering the tremendous amount of development effort required to write one, it was decided not to try to utilize this architecture.


Standalone version with external storage management:

This approach is very similar to the previous one and differs only in the storage management strategy. Instead of relying on its own storage manager, TriggerMan would use a commercial DBMS for storage purposes, using ODBC or any other message passing interface which could be made available. While this plan of action was technically feasible, it suffered from serious performance limitations in handling large amounts of data through the message passing type of communication. Furthermore, a few additional technical problems would have increased the internal complexity of TriggerMan without any major benefit. It was therefore decided to put this scheme aside.






16


Plug-in within an extensible DBMS:

A middle ground strategy, procuring a reliable storage management layer for TriggerMan, without extensive code writing (or excessive communication overhead), was to develop the project as an extension to an object-relational DBMS with ADT support. The advantages of this strategy are multiple and can be summarized as follows:

- As a plug-in module running in the same address space as the DBMS, TriggerMan would benefit from "commercial strength" local storage management, without the

overhead associated with message passing.

- Any support of parallelism within the DBMS is freely available to TriggerMan through the transparent SQL "call-back" interface.


TriggerMan
Datablade

Generic
DataBlade











IDSIUDO

Figure 1-3 TriggerMan as an Informix Datablade Figure 1-3 above describes the implementation strategy that has been chosen to develop the first prototype of TriggerMan, as an Informix Dynamic Server with Universal Data Option (IDS/UDO) Datablade. The TriggerMan catalogs and rules are stored within






17


the Informix server in regular "tables". The access to those elements is made through standard SQL calls made to IDS/UDO.


Because of technical implementation issues related to Informix threads handling in datablades, it was decided to keep the TriggerMan code passive, and signal it on a regular basis through an external simple timer application: the TriggerMan driver.















CHAPTER 2
PROBLEM STATEMENT



As mentioned in the introduction, TriggerMan relies on Gator networks to perform rule condition testing within the context of an Object-Relational Database Management system and environment. In this chapter, the reader will find a brief tutorial of discrimination networks, and a precise description of the specific issues that this dissertation will cover.

2.1 A Discrimination Network Discussion and Example:


For the reader who is unfamiliar with discrimination (and Gator) networks, this section is meant to present a brief tutorial of these data structures through a general discussion of their characteristics. A simple example presented below will be referred to throughout the discussion for illustration purposes. The reader who is already familiar with discrimination networks may skip this section. Consider a simple two-table schema: EMPLOYEE (SSN#, Name, Age, Salary, Dno) DEPARTMENT (Dno, Dname)

And a rule condition defined on the schema above as follows: "If the record of a Sales department employee with a salary higher than $100,000 is accessed (inserted or modified), then notify Bob, the manager of the department." This condition could be defined in a rule language as follows:


18






19


Create Trigger HighSalesSalary
From EMPLOYEE as EMP, DEPARTMENT as DEP
When EMP.Salary >= 100000
And EMP.Dno = DEP.Dno And DEP.Dname = 'Sales'
Then raiseEvent HighSalesSalaryUpdate( "Bob", EMP.Dno).


Just as queries can be represented as graphs referred to as "query graphs," rules can also be represented in a similar way: the rule graph. A rule processor or active database will then internally represent the trigger HighSalesSalary defined above, in the format of a graph such as the one drawn in Figure 2-1 below:


(71: Salary > 100,000 (T2: Dname = Sales


EMP.Dno = DEP.Dno
EMP DE

Figure 2-1 Rule Graph for Rule HighSalesSalary Where:

- EMP is a tuple variable that represents the table EMPLOYEE

- DEP is a tuple variable that represents the table DEPARTMENT

- a, represents the selection "Salary > 100000"

- (a2 represents the selection Dname = 'Sales'

- The horizontal edge between EMP and DEP represents the join condition written just

above it.






20


A corresponding discrimination network would be that given in Figure 2-2 below: Routing
Node







cL.Dno =o2.Dno








Pnode

Figure 2-2 Discrimination network for rule HighSalesSalary In Figure 2-2:

-cc, is derived from tuple variable EMP from the rule graph in Figure 2-1.

-cc,- is derived from tuple variable DEP from the rule graph in Figure 2-1.

-a, and a2, have the same meanings as in the rule graph from Figure 2- 1.

-The edge drawn between (xi and ciX2 represents the Join condition between EMP and

DEP from the rule graph. By convention, join conditions will be represented as dashed lines joining sibling nodes. Sibling nodes are always at the same distance

from the Pnode.

-The full arrows represent a child --- parent relationship. The parent node stores the

results of the joins across all of its children. The arrows have been drawn here for






21


illustration purposes, but typically, simple lines or segments joining the child to its

parent will replace them.

-The routing node and the dashed arrows are also drawn here for illustration and

tutorial purposes. The dashed arrows represent the routing capabilities of the routing node to send incoming data to the right alpha node. The presence of the routing node and the dashed arrows will be implicit in all further discrimination and Gator

networks figures.


General structure of a discrimination network:

In general, a discrimination network can contain the following types of memory' nodes:

- A Pnode, which is the actual root of the tree representing the discrimination network.

As we will see very shortly, data (tokens) are propagated through the network, and the semantics of discrimination networks is such that when a token reaches the

Pnode, all the rule conditions have been met and the trigger fires.

- A set of alpha nodes, one for each tuple variable from the query graph of the rule.

- A set of beta nodes. Beta nodes are intermediate nodes used to store the results of

two or more alphas and/or betas. (Since the current example has only one join condition, it does not contain any beta node. A more general network with beta nodes

will be presented at the end of this discussion).

- The "Routing" node was sometimes called a "Root" node in earlier papers, but it is

not to be confused with the Pnode of the network. The "Routing" node redirects



1Despite their denomination, memory nodes can be (and often are) stored on disk.









incoming tokens to the appropriate o nodes for further processing. We will therefore

use the term "Root" only to refer to the Pnode.

Now that the structure of discrimination networks has been described, it is necessary to present the dynamic aspect of these structures, represented by the token propagation that was briefly mentioned a few lines above.


Token propagation in discrimination networks:

The process of testing a rule condition when an update occurs (in the database) is sequenced in the following way: each access to a relation appearing in a rule condition will generate the creation of an update descriptor. An update descriptor (also referred to as a token) is simply the tuple that was accessed, to which a few auxiliary fields are added for rule processing purposes. The token in question will be transferred to the Routing node of the discrimination network(s) representing the rules defined on the current schema. Based on a table (or data source) field in the header of the token, it will be forwarded to the appropriate ot node and propagated down the network from there (for testing the rule condition). Consider again the example of the schema described at the beginning of this section; each access to relations EMPLOYEE or DEPARTMENT will therefore generate the creation of an update descriptor that will be transferred to the Routing node described in Figure 2-2 above. Suppose that table DEPARTMENT is the following: Dno Dname
I Finance
2 Sales
3 Research & Development






23


And a tuple (SSN =1, Name = "Bob", Age = 30, Salary = 150000, Dno = 2) was inserted in table EMPLOYEE. An update descriptor of the form [EMPLOYEE, INSERT, 1, "Bob", 30, 150 000, 21 is generated and sent to the routing node of the discrimination network. The same token is then forwarded to al based on the EMPLOYEE field. The field INSERT is used to determine the type of database operation that was performed, since that can be part of the rule condition (on INSERT, on DELETE etc.), and filtered out by the routing node if necessary. The selection predicate represented by al (salary > 100000) is then tested and succeeds. The update descriptor is therefore inserted into the node al~ where the join condition with the departments is checked. Since the current department is "Sales," and the Dno for Sales is "2," the join succeeds too and a resulting compound token (consisting of the concatenation of both tokens from atl and cX2) is created and inserted into the Pnode. As mentioned earlier, the insertion of a new token into the Pnode signals the matching of the rule condition for the current database access. The network used for this particular example was simple enough not to require any beta node to store intermediate joins results. In general, rules with three or more tuple variables will generate networks with three or more alpha nodes, and the presence of beta nodes will become likely. Consider the network below:




u1- (k"' cX3 r4-' '(X5 (X-6

V/






Pnode






24


In this case, the rule has six tuple variables (six alpha nodes) and three selection predicates. When the network is first created, it is "primed," which means that all nodes are loaded with their respective data. Alphas are primed from the relations they are derived from (with the selection condition applied), and betas are primed by executing the join operation across all their children. (The Pnode is always emptied after each rule condition testing and does not require primingg. The results of the joins between the first three alphas are stored in P3, while P32 stores the results of the joins between at4 and a5. The Pnode contains the result of joins between 1, P32 and a6. The propagation of tokens is applied in the same way it was described earlier. Each time a token is inserted into a node, it is joined with all the siblings of that node (according to a predefined ordering of those) and the result of that join (potentially many new "compound tokens"), is inserted into the parent node and so on. As mentioned earlier, a compound token is simply the concatenation of the tokens that joined from each of the sibling nodes. As an example, consider an update descriptor originating from the relation from which (X2 is derived. The routing node (not drawn) routes the token to CC2, which causes G2 to be tested first. If the test succeeds, the token is inserted into a2 and joined with a1i and aC3 (the joining order is part of the optimization algorithm and is hard coded within the 'join plan" of each node). All the tokens generated by this join are inserted into 1 and joined with 132 and cN. Any resulting token is inserted into the Pnode and causes the rule to fire. Note that all tokens inserted into the Pnode do not remain there, but are deleted when the rule executes. Those tokens are needed in the Pnode for condition/action binding purposes, when the action of a rule refers to the data that had caused it to fire.






25


2.2 Need for Selection Predicate Pushdown:


In the extensible and object-relational environment in which TriggerMan operates, selection predicates can be arbitrary user-defined functions or methods. Suppose that the table EMPLOYEE had one additional field "fingerprint" of type "image" or "vector data." Suppose furthermore that the (selection) predicate al is no longer a condition on the salary, but a complex operation executed by one of the image (vector data) data type methods (such as matching fingerprints etc.). The computation of a1, now becomes expensive enough for the cost of the join operation a1 .Dno = C(2.Dno to become negligible in comparison with it. In such a case, it would probably be a good optim-ization strategy to perform the join operation first, and delay the selection evaluation as much as possible. A "Push Down" operation is then performed on Cri, which is replaced by a "True" operator at the leaf level, in order to allow all tokens to get through. The resulting network is shown in Figure 2-3 below:


True G,)

ct1.Dno = cX2.Dno













Figure 2-3 Selection pushdown






26


Indeed, the potential reduction of the number of tokens flowing through a, represents in turn a potential optimization of this network with respect to the one described earlier in Figure 2-2.

Remark:

Since Gator networks are always represented with the root node (Pnode) at the bottom and the leaves (alpha nodes) at the top, a selection pushdown means moving the sigma node so that it becomes closer to the root of the network. Since query trees are usually represented the other way around, with their root at the top, the pushdown and pull-up operations in query optimization [Hel92, He193] have opposite meanings to those used in discrimination network optimization strategies.

All Gator networks designs and implementation have traditionally placed selection predicates at their leaves so that they can be performed first. While this approach proved worthwhile and effective in existing optimization strategies, the ability of TriggerMan to operate in extensible environments makes this heuristic no longer true. As a matter of fact, selection predicates defined on objects such as polygons or images may prove to be more expensive than (some) joins, and a more elaborate heuristic (or set of heuristics) for selection placement in the discrimination network will become necessary.

The intention of this dissertation is to investigate the approaches that can be taken towards achieving an optimal or near optimal placement of selection predicates in discrimination networks in general, and more particularly in Gator networks. The organization of the rest of the chapters of this work is structured as follows:






27


In Chapter 3, the reader will be familiarized with the most important aspects of the existing Gator network optimizer as it is implemented in the Ariel project [Han92]. Chapter 4 will cover the related work that was accomplished in selection predicate placement in query processing, and show how those results cannot be applied to discrimination networks optimization strategies. In Chapter 5, we will propose an alternative set of strategies extending the existent Gator networks optimizer that was described in Chapter 3, and analyze the consequences of each of them (on the current optin-izer) mainly from a conceptual point of view. Chapter 6 will describe the practical issues that arise during the implementation of the strategies described in Chapter 5, and present the practical results that were obtained. In Chapter 7, we will conclude with a brief review of this work and present additional related issues and open research questions.















CHAPTER 3
BACKGROUND ON GATOR NETWORK OPTIMIZATION

TREAT [Mir87] and Rete [For82] networks have been successful approaches for performing non-procedural rule matching in active databases. Treat networks do not make use of the intermediate beta nodes described earlier, and have all their alpha memory nodes feed directly into the Pnode. Therefore, Treat networks are always twolevel networks (The HighSalesSalary network shown above (Figure 2-2) was a TREAT network with a single join condition). The Rete network uses betas that can store the results of intermediate two-way joins only. In other words, a beta node never has more than two children in a Rete network.

Gator networks extend the shapes of discrimination networks to general shapes by allowing betas to have N other nodes as direct children. As a result, TREAT and Rete become special cases of the more general Gator structure.

Because of its generalized shape, the optimization of a Gator network becomes an expensive procedure with the increase of the number of tuple variables involved in the query graph. As a matter of fact, it is easy to see that the number of alternatives for building a Gator network increases extremely rapidly with the number of alpha nodes. Since our approach consists of an extension of the existing cost model, we will present a brief review of the current Gator network optimization strategy before presenting the selection predicate placement strategy itself.


28






29


3.1 Brief Review of the Gator Network Optimization Algorithm:


Randomized algorithms have been noticed to produce very acceptable solutions (nearly optimal) with computation or execution times that are orders of magnitude smaller than dynamic programming approaches, when dealing with exponential search spaces. Such algorithms include:

- Simulated Annealing [1oa87],

- Iterative Improvement [Swa88],

- Two-phase Optimization, which combines the two previous algorithms.

The existing implementation in the Ariel[Han89, Han96, Has93] project includes the use of such algorithms for the generation of Gator networks. The reader will undoubtedly notice that the used approach is a greedy algorithm, starting with an initial solution and trying to improve it at each step of the optimization process. The solution is of course prone to the local minimum problem common to all greedy algorithms, but the use of simulated annealing and similar techniques (such as using multiple starting points) are meant to reduce, if not eliminate, this problem.

The remaining part of this chapter gives a general overview of the existing implementation of Gator network optimizers carried out at the University of Florida. Below we describe a typical randomized algorithm (As a function named GetOptima iGator) described in high-level pseudo code, then we give a brief overview of the proposed cost model.






30


Gator * GetOptima iGator (RileGraph * Graph)

Begin

BestSolution = BuildlnitialGatorNetwork(Graph)

While (no satisfying solution found)

New-Gator = ApplyLocalPerturbation (BestSolution)

If (Cost (New-Gator) < Cost (BestSolution) ) then DeleteGator (BestSolution) BestSolution = New-Gator Else

I DeleteGator (New-Gator)
Endif

End While
Return (BestSolution)

End


The exit condition from the While ioop described as "no satisfying solution found" can be implemented in different ways according to the chosen solution used for the randomized approach. Possible conditions for a satisfying solution include:

- Perform a specified number of iterations then exit.

- Iterate until the cost drops below some predefined threshold then exit.

- Iterate until a certain number of operators called successively fail to improve the

network.

- A combination of both previous cases: iterate until a specified number of iterations

are reached or the cost drops below a certain limit, whichever comes first.

The next algorithm gives a description of the function BuildInitialGatorNetwork that is called at the beginning of GetOptimalGator described above.






31


Gator* Buildin itialGatorNetwork (RuleGraph * Graph)

Begin

LocalList = New List

NeighborsList = New List


For (each tuple variable in Graph)

AlphaNode = new Alpha

If (tuple variable has a a with selectivity s)

I AlphaNode-* selectivity = s
Endif

LocalList 4)1nsert (AlphaNode)

EndFor


While (LocalList not empty)

Node 1 = Randomly select one element from LocalList

NeighborsList = FindNeighbors (Node 1, LocalList)

k = Size of (NeighborsList)

Node2 = Randomly select an element from NeighborsList

BetaNode =Combine (Node I, Node2)

LocalList -~remove (Node 1) LocalList -*remove (Node2)

LocalList -*insert (BetaNode)

End While

End

The function FindNeighbors simply walks down the list of nodes and returns all the nodes that have a direct edge connection with the node currently passed as a parameter.

As the reader may have already noticed, the Gator network is incrementally built using the Combine(Node *, Node *) function. This function implements a set of three basic






32


combination operators for nodes, which incrementally return larger networks. Below is a description of the operators implemented in the function Comnbine.

3. 1.1 Combine Operators:

The three operators, Join Absorb and Merge are shown in this section. him:

Two existing nodes are made children of a newly created one (beta), the new network is one level deeper.


Absorb:

The white node is absorbed and becomes child of an existing P~ node.


+ 0 -


Merge:

Merge applies only to two P3 nodes.






33


Once the initial network is randomly built using the operators above, it is improved gradually by using a set of three local change operators "Create Beta," "Kill Beta" and "Merge Sibling" described in the following section.

3.1.2 Local Change Operators:

Gator* ApplyLocalPerturbation (Gator* Current) Begin
Perturbation = Choose Randomly In { CreateBeta, KillBeta, MergeSibling)
New-Gator = ApplyPerturbation (Current, Perturbation)
Return (New-Gator)
End

Gator* ApplyPerturbation (Gator* Current, char* perturbation)

A set of three local perturbation operators has been specified. At each iteration, one of the operators is randomly chosen and applied to the network. Below is a description of those local operators.

Create Beta:

ct- (X 0('~


Pnode


Pnode


X


Figure 3-1 Local change operator "Create Beta" The Create Beta operator requires a node with at least three children. In the example above, the node in question is the Pnode itself, but it can potentially be any P3 in the network or the Pnode.






34


Kill Beta:


U






Pnode


Pnode


Figure 3-2 Local change operator "Kill Beta" The beta is killed, and all its children become direct children of the (old) parent of that beta. In the example above, all four alphas become children of the Pnode. Merge Sibling:

(X",,,0 (X"IX - c x- *c









PnodePnode Figure 3-3 Local change operator "Merge Sibling"

The merge sibling operator requires a node to have at least three children of which at least one is a P~ (in the example above, the Pnode has been selected again as such a node but that is not a necessity). In the case where more than one P3 node is available, one of them is randomly chosen as a "MergeNode." One of the siblings of the "MergeNode" is then selected and moved to a child position with respect to "MergeNode" as shown in Figure 3-3 above.






35


3.2 The Cost Model:


The comprehensive description of the Gator discrimination network cost model is somewhat lengthy and will not be covered in detail. A complete reference can be found in [Han96b]. In this section, the reader will find a short overview of the cost model, reviewing the cases where all nodes are stored and do not have any indexes defined on any of the attributes. Since the use of indexes and virtual nodes does not have any influence on the remainder of this work, there will be no particular need in covering extensively all the cases.



Notation:

CPUweight The relative weight of the CPU utilization.

"/Oweight The relative weight of the 1/0 operations.

N A node in the discrimination network (cc, f3 or Pnode).

Fi(N) The insert frequency in node N.

Fd(N) The delete frequency in node N.

Card(N) The cardinality of node N.

Ci(N) Insertion cost in node N.

Cd(N) Deletion cost in node N.

Sel(A) Selectivity of the predicate/condition A.

JSF(A,B) Join selectivity factor between nodes A and B.






36


The network below will be used as a reference example that we will refer to throughout the description of the cost formulas in the remaining sections of this chapter.


G~I True GY2 True 03~ True





Q I. --- -P1




Pnode
Figure 3-4 Example of a Gator network


3.2.1 Cost Functions for a~ Nodes:

The cardinality of an alpha node is estimated as the product of the cardinality of the relation on which the node is defined and the selectivity Sel(a;) of the selection predicate defined over the same relation. If there is no selection defined on the relation, a is considered to be the logical "TRUE" predicate and Sel(aY) is taken to be 1. The cost of an alpha node is defined by

Cost Wc) = CM(X) + Cd(cX)

Note: As indicated at the beginning of this section, we will only consider here the case of stored alpha nodes with no index defined on any of the attributes. (The use of indexes is not crucial in the context of this work. The reader will find an elaboration on this aspect in chapter 5).






37


Cost of insert in the alpha node:

The insertion of a new tuple in a relation requires one disk read and one disk write, therefore two I/Os do occur. Since only one tuple is inserted, CPUweight is incurred once.

CM(c) = (CPUweighl + 2 * I/Oweight) * Fi(oa)




Cost of delete from alpha nodes:

In order to delete a tuple from an alpha node, all the pages of the alpha node are read once and tuples in those pages are scanned in order to identify the one to delete. Then the page where the tuple was rewritten is written back to disk.

Cd(aX) = [CPUweight * Card(ca) + 1/Oweight * (Page(ct) + 1)] * Fd(ct)




3.2.2 Cost of Generating Temporar Join Results, (TR):

When a token is inserted in a node with multiple siblings, a sequence of two-way joins is executed until the complete join is computed. The order in which the joins are executed is predefined in the "join plan" of each node. The join plan is computed when the network is built, as the sequence generating the smallest possible intermediate join results: TRn's. In the example network above, an insertion in (X2 would join with either 0C, or (X3 (depending on the join plan Of aX2), then the resulting set of compound tokens would be joined to the remaining sibling in the join plan. The formulas below estimate the cost of generating these sequences of temporary joins for both alpha and beta nodes.









Stored Alpha nodes - No index on join attribute:

Assuming the system has enough memory, the cost will consist of reading the alpha node from disk and performing the join of TR,, with the alpha node:

Cj(TRn, W: = JlOweigh, * Pages(c) + CPUweight * Card(0:) * Card(TRO)

With TR0 = 1,

Card (TR,,)= Card (TR,-.j * JSF(Nn.1 4* NO).

Note: Nn-I - Nn represents the two nodes in sequence in the join plan of the alpha node where the token was initially inserted.



Beta nodes - No join attribute index:

The cost of joining from a TR,, to a beta node (without indexes) is similar to that of joining with an alpha node. The cost formula is the same as the previous one, where alphas are replaced with beta nodes. Cj(TRn, P~) = /Oweight * Pages(f3) + CPUweight * Card(f3) * TRn With TR0 = 1

Card(TRn)= Card(TRn-1) * JSF(Nn-I --- NJ) Note: Nn-I -- Nn represents the two nodes (alpha or beta) in sequence in the join plan of the beta node where the token was initially inserted.



3.2.3 Cost of Beta Nodes:

The cost of a beta node consists of two components:

- A local cost noted LocalCost.


38






39


- The cost for children nodes. Cost(p1) =LocalCost( P3) + I Cost(N)
N e children) P)

LocalCost(13) ={Fi(N) * InsertCost (N, P) + Ed(N) * DeleteCost (N, P))
N E children( 3)

The LocalCost represents the cost incurred to update the beta node itself. This cost has two components: insertion and deletion costs related to the beta node. Each of these two components is computed incrementally with respect to each of the children of the beta node. Each elementary contribution relative to the current child, is considered relatively to the "weight" of that child, which is represented by the corresponding insertion and deletion frequencies Fi and Fd. Below is a high-level pseudo code description of the InsertCost and Delete Cost procedures. Considering the current child N of node P3, the following procedure gives the participation of N in the insertion cost of1. Insertfost (N, 1.)

Begin
TRsize = 1
TempCost = 0
For each node n in the join order plan of N
TempCost = TempCost + Cj(TRsize,n)
I TRsize = TRsize * Sel(ThisNode - NextlnPlan)
EndFor

updateCost TRsize2* /eih
tuplesPerPage(/3)1*2*lOwih
return (TempCost + updateCost) End

Similarly to insertion costs, deletion costs are given by the following procedure for the current child N of a node 13. The Yao(Pages(13), TRsize) function appearing in the






40


procedure DeleteCost below simply estimates the number of pages of P3 that will be touched when TRsize tokens are (randomly) selected from the node. The exact definition of the function Yao as well as estimates of it is given by Bernstein et al [Ber8 1]. DeleteCost (N. P3):

Begin

TRsize = I

For each node n in Join order plan of N

TempCost = TempCost + Cj(TRsize,n)

TRsize =TRsize * Sel(ThisNode --*NextlnPlan)

EndFor

updateCost =[Yao(Pages(pj), TRsize) + Pages(3)] * "Oweight +

Trsize * Card(13) * CPUweight

Return (updateCost)

End

As mentioned earlier in this chapter, a more detailed description of the cost formulas and procedures can be found in Hanson's papers on the optimization of Gator networks [Han96b, Han97b]. The reduced version outlined in the previous pages is however sufficient for the purposes of this work. In the subsequent chapter, we will introduce the work accomplished for selection predicate placement in query optimization and relate it to the selection predicate placement problem in discrimination networks.






41


CHAPTER 4
RELATED WORK: INADEQUACY OF THE RANK METRIC.

Several research projects and related articles have addressed the issue of optimal selection predicate placement among joins in a query tree. The LDL algorithm was first proposed in [Chi89] then later adapted for use with object-relational models in [Yaj9l]. In this approach, the selection predicate is passed to the query optimizer as a relation and treated as such during the whole optimization phase. Many of the subsequent algorithms use the notion of a rank that is assigned to each of the predicates (joins and selections) in the query tree. The notion of a rank metric was explored in operations research [Mon79], as well as database issues [Cha96, Kri86]. After assigning the ranks, the algorithms iterate over each "leaf to root" path of the query tree and reorganize the current stream in ascending order of the rank metric. The rank metric is defined to be: rank = Selectivity - 1 cost
Without a formal proof, it is intuitively easy to see that a low rank will be produced by both a small selectivity factor (which represents a very selective operator) and cost (i.e. a cheap operator). Non-selective and expensive predicates yield higher ranks, and are therefore delayed as much as possible, while cheap and selective predicates are applied earlier. The ordering of selections and joins based on the rank metric yields satisfying plans in query processing [Cha96, He192, He193, He195]. Yet, it appears to fail to produce consistent results when applied to the placement of selection predicates among joins in a discrimination network structure. Because it does not incorporate important parameters such as insert and delete frequencies, it seems that the rank metric is not






42


"rich" enough to capture with accuracy the cost model developed for Gator networks. As a matter of fact, it is fairly easy to infer from the cost formulas presented earlier, that different distributions of token frequencies can sensibly change the cost of a discrimination network, and thus the decision of moving a selection predicate or not. Consequently, the results given by a rank based algorithm do not always represent the correct ordering of predicates in the discrimination network tree. Before considering a simple example, it is necessary to introduce at this point a minor extension to the proposed Gator cost model, as well as the basic modeling approach assumptions that will be used throughout this chapter. Later in the chapter, the reader will find a short mathematical analysis of the same example, showing how various distributions of token frequencies can impact the decision surface for moving selection predicates.



4.1 Extensions to The Cost Model and Assumptions:


The main extension to the Gator networks cost model will be to define a selectivity or "filtering capacity" for edges joining a child to its parent node. We will refer to this property as the "Edge Selectivity Factor" and denote it ESF.

4. 1.1 The Edge Selectivity Factor:

In most relational systems and since the introduction of System R [Ast76], joins have been heavily modeled with a Join Selectivity Factor (JSF) value defined as the ratio of the cardinality of the result of the join, over the products of the cardinalities of the base relations (or nodes). As a result, the value of the JSF is a rational number lying in the interval [0,1]. Practical values for the JSF fall within the interval (0,I) since the






43


values 0 and 1 represent extreme cases where the Join result is either empty or equal to the Cartesian product of the base relations (nodes). Consider the following Join: JSF 1/1000

Cardinality: 1000 R -SCardinality: 100



JCardinality: 100

The figure above gives an illustration of the JSF values as they were just described. The cardinalities of R, S and J are respectively 1000, 100 and 100. The value of the JSF is taken as the ratio of card(J) divided by card(R) * card(S) which is in this case 100 / (100 x 1000) = 1/1000. While this value models the join from a global point of view, it does not correctly depict the participation of "each side" (R and S here) in the joining operation in terms of numbers (or percentages) of tokens flowing across the join edges. As a matter of fact, the cardinality of node J is not one thousandth of either that of R or S. Instead, each token or tuple from S has a match in J (on average), while oniy about one tenth of the tuples from R have a match (on average) in J. Since the selection predicate is always located on one of the relations participating in the join operator, it is important to correctly model the selectivity of the Join with respect to each of the relations participating in it. The Gator network cost model uses the notion of a "directional" selectivity factor for joins, based on the distribution of the join attribute in each relation.

Consider a join condition (from the figure above) of the form R.x = S.x where 'Y is the joining attribute. Under the uniform distribution assumption for attribute x as it is defined by Selinger et al [Se179], the selectivity of R-4S is taken as the ratio: Sel(R->S) = Card(S) /Val[x /S] , where Val[x/ 5] represents the number of distinct values of x in relation






44


S. Symmetrically, the selectivity S-4R is given by Sel(S-*R) = Card(R) / Val[x / R]. Surely, this approach gives correct results under the specified assumptions of uniformity. However, it does suffer from a number of limitations, which are listed below:

- The hypothesis that a token joins on average with Card(Relaiion) / VAlx / Relation] tokens

is valid for equijoins. But when considering non-equijoins such as R.x > S.x or R.x

overlaps S.x (if x is a polygon attribute), this approach becomes inadequate.

- When considering join conditions such as R.x overlaps S.x -rather common in the

extensible environment for which TriggerMan was designed- the notion of distinct

values for an attribute (such as a polygon) may be difficult to define [Haa.95].

- Since the number of distinct values of an attribute cannot exceed the cardinality of

the relation (i.e. the attribute is a key, and all values are distinct), the ratio Card(Relation) / Val[x / Relation] never drops below 1. On the other hand, the selectivity of a selection predicate is always in the interval (0,1). Recalling the definition of the rank= Seetvt , ranks of joins will therefore be positive
cost
numbers while those of selection predicates will remain strictly negative, no matter what the costs of the predicates are. Consequently, the optimization strategy becomes useless because the ordering of predicates always puts selections before joins, since

the ranks of joins are always strictly larger than those of the selection predicates.

In spite of the fact that selection predicate placement in discrimination networks will not be directly reusing ranks, the general idea behind the heuristic is also based on the possible "good" selectivity of joins, allowing a reduction in the number of times an expensive sigma has to be done. If the cost reduction incurred by doing the (expensive) selection predicate "less often" surpasses the increase incurred by doing the joins more






45


often, the strategy works and the selection predicate is replaced in the Gator network. However, If the selectivity factor of a join is always greater or equal to 1, the "traffic" (percentage of tokens flowing along an edge) is never reduced after that Join operation is executed, and the optimization fails in all cases. Therefore, the Gator network model needs to be extended to represent joins that can have selectivity factors in the [0,1] interval, which are the joins of "Interest" as far as the selection predicate placement is concerned. Going back to the previous example with relations R and S joining into J, the join operation resulting in the J node can be viewed in two symmetrical ways:

- Start off with node R empty and node S full, and gradually insert tokens in R until it

gets full. At that point, the cardinality of the node J will have reached 100 and that of R will be 1000. On average, each token inserted in R will have contributed with 0.1 tokens on the edge R-*J, and one out of every ten tokens inserted in R will have

joined if each joining token finds a single match.

- Start off with node R full and node S empty, then gradually fill up S until it reaches

its cardinality of 100 tokens. Since the join operation is independent of the joining order (R join S = S join R), the resulting joining node is exactly identical to the one obtained in the first approach. In this case, each token inserted in S will have

contributed, on average, with one token on the edge S--.>J. (see figure below).


JSF: 1/1000
Cardinality: 1000 R . - - --- Cardinality: 100

0.11
Edge Selectivity: 0. 1 Edge Selectivity: I

JCardinality: 100






46


The selectivity of joins will therefore be defined with respect to each input stream to the join. The value of that selectivity represents the "filtering" capacity of the join with respect to the corresponding base node; i.e. what proportion of tokens is generated along that edge for each token coming into the base node. The value 0. 1 means as described above, that for each token inserted in R (whether it does join or not), in average 0.1 tokens will be generated down the edge R-4J. In other words, the edge R-4 divides the traffic coming into R by a factor of ten.

Remarks:

- The Edges Selectivity Factors (referred to as ESFs) will be represented as shown

above, directly on the edge.

- ESF(R--*J) is computed as JSF(R*->S) * Card(S) and symmetrically ESF(S-4) is

computed as JSF(R<-4S) * Card(R). Since Card(J) is defined as JSF(R<-4S) * Card(R) * Card(S), it follows that ESF(R-4J) = Card(J) / Card(R) and ESF(S-*J) = Card(J)/

Card(S)

- The ESF is not confined to the interval [0, 1] (or (0, 1) ) as JSF values are.




4.1.2 Predicates Selectivities and Costs:

The standard assumptions of query optimnizers about the accuracy of database statistics (estimates of parameters values) and the independence of the selectivity factors of predicates will be made in the remainder of this work.






47


The results on the existence of a constant differential (per-tuple) cost for all the well-known Join algorithms [He192, He1951 will be reused in this work when describing a rank based operation:

The partial differential costs of joins with respect to their base nodes are constants for all the well-known join algorithms, and hence the cost of a join per tuple of each input is typically well defined and independent of the cardinality of either input.




4.2 Allee-Gator Example:


Given the definition of the edge selectivity factor, let us consider a practical example. Doctor Guy Thorfan from the wildlife department of the University of Florida is currently working on a research project involving fauna observation in multiple locations around the state of Florida. One of his concerns is the presence of GATORS in the lake Allee area in Alachua County. Data is gathered as follows: graduate students using laptops and state of the art GPS technology, continually observe the areas of interest and type in their "sightings" that are immediately sent back to the central database. In order to spot the presence of GATORS around Lake Allee, Dr Thorfan writes the following trigger against the database of the department: Create Trigger Allee-Gator
From Fauna_-Sightings, InterestArea
Where InterestArea.Box contains Fauna-Sigthings.Location And Fauna-ightings. Animal = "Gator"
And InterestArea. User = "Thorfan"


The Fauna-Sightings table is defined as FaunaSightings(ID, Animal varchar(255), Location point), and a listing of its data is given in Appendix A. The Lake






48


Allee area of interest is supposed to be geographically represented with the box delimited with the corner points (0,0) and (1,1) in a predetermined system of coordinates and we will suppose, for simplicity reasons but without loss of generality, that the selection "User =Thorfan" returns only that box. The InterestArea table is also shown in Appendix A. All nodes in the network are considered to be stored without any indexes defined on any of the attributes. Also, the database will be supposed to be insert-only so that delete frequencies can be ignored in the context of this example. The discrimination network representing this rule is described below: Cost: 10


Selectivity: 0.6 (7 Animal = "Gator" G2User='Thorfan"

ca,.Box contains c!1.Location
Cardinality: 12 (Xi- C 20 CG: 5 0-()2 Cardinality: 1



I3Cardinality: 1

Node ax1 is based on the relation FaunaSightings and has a selection predicate CF, "Animal = Gator" defined on it. The selectivity of al is represented by the number of tuples where "Animal = Gator" (displayed with a gray shaded background in Appendix A) over the cardinality of FaunaSightings: 12/20 = 0.6. Also, the selection aY2 is supposed to return one unique row from table InterestArea with a Box value of (0,0,1,1). The selectivity Of (72 is not critical in the context of this example; therefore we will give it an arbitrary value of 0.1. The selection predicate (71 and the join operations between cc, and (X2 have been assigned differential cost values indicating the ratio of the complexity of one operation with respect to the other in some predefined unit.






49


The rank of (1 is defined as: rank(oTi) =Slm Cost(rnl)

Therefore rank (al) =0.6 -1I= -0.04 10

The result of the join of aii and 0C2 is a one-row table:

Animal Location User Box
Gator _+(1,1) Thorfan (0,0,1,1)


And the selectivity of the join with respect to al~ is then 1/12 =_0.083. Therefore, rank(Join/(xi) - 0.083 - 1 - -0.0458 20
Representing both ranks on a linear axis, we can see that the rank of the join is strictly smaller than that of the selection predicate.


Join/c1 a,

-0.046 -0.04 0 ranks axis

And according to the rank ordering of predicates, the selection predicate (71 should be pushed down to produce: True(2


OI1-- ..(






50


Let us now reconsider the same network with insert frequencies assigned to each input edge and check the results given by the discrimination network cost formulas: Cost: 10 802%


0.6 G,
48% 2%
12 : 2125
12




If C is the initial cost of the network, let AC' and AC- represent respectively the increase and decrease components that occur in C when the (71 operator is pushed down. The AC' part of the variation of C will have three components:

- The increase in the frequency of execution of the join operation between ax1 and (X2

for each token inserted in cl.

- The increase in the frequency of insert operations applied to node al. That value will

be represented as Ar.

- The increase in the differential Cost C2 of joining tokens from aX2 to aX1 (since a1i is

larger) noted AC2. (AC2 > 0).

AC- will be the decrease in the frequency of execution of al (if any). When aT1 is pushed down, some of the parameters of the network do change, as follows:

- The cardinality of a, grows from 12 to 20 tuples (or tokens).

- The resulting join of a1 with U2 becomes the following table






51


Animal Location User Box
Gator t(,)Thorfan (0,0,1,1)
Crow (1)Thorfan (0,0,1,1)
Frog (,)Thorfan (0,0,1,1)

And the Join selectivity factor becomes 3/(20 * 1) = 0. 15. The selectivity of edge @(Xi-->P) becomes 0. 15 1 = 0. 15 and that of edge(cL2-*fP) becomes 0. 15 * 20 =3.


80% 20%


G2%

Cardinality: 20 (XI - C,'> lC2*U2 Cardinality: 1

0.X15 3

Cost: 10






The values for the increase and decrease in the global cost of the network are then: AC' : (0. 8 - 0.48) * 20 + Al' + 0.02 *AC2 =0. 32 * 20 + AP' + 0.02 * AC2 = 6.4 +Ar +0.02 *AC2

AC-: [0.8 -(0.8 *0.15 +0.02 *3)] 10 (0.8 -0.18) * 10=0.62 *10 =6.2

Since AC' is greater than AC- (Ar' and AC2 are strictly positive terms), the operation of pushing the selection predicate down increases the cost of the network and should not be done. The 80/20 distribution of tokens is therefore not favorable to a pushdown of the selection predicate (TI.









Consider now the same network with a different distribution of insert frequencies:


Cost: 10t 99 1%



59.4%1 0.1%
Card: 12 (XI : 5Ot2 Crd


01 1


99% 1%



0.1%
Card: 20 cx 1:0CI>~X Card: 1

0.15 3


P


In this case we have voluntarily biased most of the insert traffic towards the edge containing the selection predicate al. The pushing of a1i transforms the network as described in the figure above. The values of AC' and AC_ are given below: AC+ : (0.99 - 0.594) * 20 + Ar + 0.00 1 *AC2 =0. 396 * 20 + Ar + 0.00 1 * AC'=7.92 +AF+ +0.001 *AC2

AC_: [0.99 - (0.99 * 0.15 + 0.001 * 3)] 10 = (0.99 - 0.1515) * 10 =0.8385 * 10 = 8.38 Clearly, AC+ could be smaller than AC_ in this case. If Ar and AC2 are small enough and/or can be neglected, then the selection predicate pushdown operation improves the cost of the network and should be executed.

This example pointed out at how the distribution of insert frequencies alone can affect the decision surface regarding the proper placement of selection predicates in discrimination networks. In the following section, the reader will find a generalization of this example and a short analysis of the behavior of cost formula-based decision surfaces with respect to distributions of tokens frequencies.










4.3 -Generalization of the-Cost Considerations:


F,


Cost: Co' F, F2


So, G~ IU2 502
F, *So, F2* S2
OCIx JSF12 CjU2 Q, 2


Pnode


F2 * 52

(XJ J F12 cil2

Sj1 Sj2'

F, Si, + F2 *So2 * S2'j2') r G Cost: Ca, Pnode


Figure 4-1 Analysis of selection predicate pushdown


Let us consider in this analysis (without loss of generality) an insert-only discrimination network, with two alpha nodes joining into a Pnode as described on the left side of Figure 4-1 above. It will be considered in the context of this study, that there is no index defined on any attribute, as adding indexes will increase the complexity of the cost formulas without having any impact on the results of the analysis. (The reader will find an elaboration on this aspect in Chapter 5, section 5.2.3).

F, and F2 are the frequencies of tokens arriving respectively into the left and right input edges of the network. The selectivities of selection predicates a,1 and G2 are denoted So, and 502. The edge selectivity factors of edges (c,1-43>) and cL-1)are represented by Sj, and Si2. (Recall that Sj, and 5j2 are defined as JSF12 * Card(cX,2) and JSF12 * Card(cx,) respectively). Finally, ay, has been assigned a cost of Ca, and the differential costs of the


join between c, and c2 are denoted C, and i2


F1






54


By definition of the rank metric, the rank of (71 is: rank(m) = -" I Cal

And the rank of the Join with respect to cul is defined as: rank(J/ai) = ____CII

The rank based heuristic orders the predicates in ascending order of the rank metric, and will therefore decide to perform a PUSHDOWN when the rank of the join is smaller than the rank of the selection predicate (71: rank(J/ct1) < rank(a1) Replacing the expressions of rank(J/c1) and rank(a1l) with their respective values as they were defined above, we can rewrite:

rank(J/ a i) < rank( a i) <-*QJI Cal

Then by multiplying each side by Ci, and dividing by (So, - 1), we get Cj < j.-I(reversing the comparison since S.. - 1 < 0) C01 Sail I
Which can be rewritten as:



The rank heuristic will therefore perform a PUSHDOWN when the condition (1) above is true and not perform the PUSHDOWN when CJl !I- * Cal

Let us now consider the relocation of the selection predicate a1 above the Pnode from the aspect of the Gator network cost formula. After pushing down the selection predicate (TI, we get the discrimination network described on the right side of Figure 4-1. Since the values of JSF12 and Sa, remain unchanged under the assumption of independence of the selectivity of predicates, the edges selectivity factors values become Si' and Sj,' (Sj, = JSF12 * Card(ct2), which is unchanged, and Si2' = JSF12 * Card(ct1), which is changed). The node cc, now receives an input stream of frequency F, instead of Fi* Sol. The frequency of tokens now going through (71 is given by (F1 * Sp + F, * S02 *






55


Sj2') where F, * SJ, is the contribution from the left side of the Join, and F, * S'* SJ2' the contribution from the right side.

Reusing the same notations to describe the variations in the cost of the Gator network, let AC' and AC- represent respectively the increase and decrease components in the cost C of the network when a, is pushed down as shown in Figure 4-1. AC' has three components:

- The increase in the number of times the join from c I to (X2 is executed.

- The increase in the number of insertions into node al.

- The increase in the cost of joining tokens from (X2 to atl: AC32 (positive quantity). The value of AC' is therefore:

AC' : (F1 - Fi * So,) *Q,~ + (FI - F, * SGi) * (CP~eight + 2 * "/Oweight) + F2 * So, - AC12 Since (CPUweight + 2 * llwetght) is the insertion cost into an alpha node without indexes. (The reader may refer to chapter 3 and [Han92] for more details on the cost formulas). Replacing the quantity (CPUweight + 2 * llOweight) by "Ic" to simplify our notations, we can rewrite AC' as follows:

AC': F1 * (1 - SOO) * [Ci, + (CPUweight + 2 * llOweight)] + F2 * So, -ACJ,

F1* (1 So,) * [Ci, + Ic] + F2 * S02 -AQ2

Since the selectivity factor So, is a value strictly smaller than 1, it follows that F, * So, 1s also strictly smaller than F1, and therefore AC' is always a positive number. (AC' > 0). The decrease in the cost of the network C lies in the (possible) reduction of the frequency of execution of the operator a1I:

AC- = [F1 - (F1 * Sj, + F, * SoI * S52')] * CG3






56


In order to analyze the influence of the distribution of token frequencies over the pushdown heuristic, let us consider that F, = m * F2 with mE- Q+* (Q+* being the set of positive, non-null rational numbers). Replacing F2 by F1/m we can rewrite ACSAC_ [Fi - (Fi * Sil + Fi* . * Sj2')] * C.1
m
Then by factoring out F, and using m as a common denominator

<*AC_ = Fi *(L-Sj - Sj2, . .
m

SAC_ =Fi M - Sj_ * S2___- Sj) C.,
m
The PUSHDOWN operation will be beneficial when the decrease in the network cost C is strictly greater than the increase: 0 < AC+ < AC-. After replacing AC+ and AC_ by their respective values, we can rewrite:


Fi (I- O')* Ci +IC]+ F *S.2 AJ m m
And after simplifying by F, (non-null) on both sides of the inequality: S.2* AC12 <=>~____ (I____ - . C, C SO) * C0
m m
Rearranging terms and dividing each side of the inequality by (1-Sat,), we obtain

M - Sj2'* S.2_ j

<=>Q, -m S al *6- So. * ACJ2 Ic (2)


Therefore, the selection predicate (71 has to be pushed down when the inequality (2) above is true and will not be pushed down when the same inequality is reversed. To simplify the notations, we will rewrite the quantity -SG2* AC12 as Q (Q > 0).
m (I- 50) m






57


Table 1 Conditions for pushing down a selection predicate


PUSHDOWN NO PUSHDOWN


Rank Cj j a C l-SJl*Ca
Based llI l IlIS.5l




Cost formula M~ - 512'* S.2 _ mj M- jSj2 s-
Based mjl __Cl__Q_______>__Ca i
i-Salm i-S.oi


Table 1 above recapitulates the results of this analysis so far. Setting the insertion cost Ic aside temporarily, one can notice that the limit of the cost formula based criterion (when m is very large: mn- o) becomes the rank criterion. While a more precise analysis is required to reach meaningful conclusions, this remark gives an initial picture on the following issues:

- What is the influence of the distribution of token frequencies (recall that different values of m represent different distributions of token frequencies) over the decision

for pushing down a selection predicate using the cost formula based approach?


- How do the rank and cost formula based approaches relate to each other, and how can different token distributions affect this relationship?


In order to gain a better understanding of the two points raised above, it is necessary to proceed at this point of the analysis with a study of the function f(m) that is defined below:
M - Sj2* S0 j

Let f(m) = m Q____ With Sil, 5)2, Q, CGI E- R' and Sat, Sa, E (0, 1)
I-S~l m *C..






58


f(m) is simply the cost formula based criterion where both sides have been divided by the constant Cal and the insertion cost Ic set aside (temporarily). The remaining part of this chapter will cover a short mathematical analysis of the function f(m), followed by a set of comments on the corresponding results.


4.3.1 Influence of Token Distributions on Decision Surfaces for Pushing a Sigma:

As mentioned above, a brief study of the function f(m) will show the effects of m, (i.e. the distribution of token frequencies) over the decision surface for pushing selection predicates in the discrimination network, as well as its connection with the rank based ordering approach.

Domain of definition of f(m):

The domain of definition of f(m) Df is R+*, i.e. mn E (0, +o Limits of f(m):

The lim-its of f(m) are as follows:




1 -Sa

Continuity of f(m):

f(m) is continuous over its domain of definition Df. Variations of f(m):

The variations of f(m) are given by the sign of its first derivative over the domain of definition Df. Since the derivative of Sj,/(l - Sal) with respect to m is null, the first derivative of f(m) with respect to mn is given by the following term:






59


a - - .2* " - Q
af(m)- m*(l-So) m*Co'


The quantity above being the difference of two ratios, it can be developed into:

a[M- Si2'*S02] * [M * (I - S11)1 a[(l -Sol) * m] * (M - Sj2'* S12) a[Q
af(m) a m am m*Ca1
am [m* (1 -S.,)]2a

The values of the derivatives of each term with respect to m allow us to rewrite it as:

af(M) M *(I- S.1) -(- S.1) *(M -Sj2'* SG2) Q
am[m * (I-S",)]2 +C.,* M2

Factoring out (I - Sa,) in the numerator of the first term

Df(m) (I - S"') * [m - (m - 5J2'* S.2)] Q
am [m *(I -S.,)]2 +C *M

And simplifying, we finally get

af(m) (I -S1) * S2'* S02 + Q
am [m * (I- S.o) C.,*M2


af(m) is defined over the domain of definition Df. It is the sum of two positive quantities. Indeed, the numerator of the first quantity is the product of three strictly positive terms, and the denominator is a square. Similarly, Q is also a positive quantity and so is the denominator Ca, * in2. Consequently, the derivative is strictly positive over Df, and it follows that f(m) is a monotonically strictly growing function of m. The table below recapitulates the results of the analysis of f(m) obtained so far.






60


Table 2 Analysis offin)


m 0 +>

af(m)
am+
I -Sp'
f(m) I -S.1




For values of mn ranging from 0 to infinity, the first derivative remains strictly positive and the values of f(m) will range (grow) from negative infinity to the ratio I - S (which is the rank criterion). As described in Table 1, the function f(m) represents a decision surface (or the variable constituent of a decision surface, if Ic is taken into picture) for the selection predicate push down criterion. A good way to visualize the effect of parameter m over that decision surface, is to geometrically represent f(m), and observe how various values of mn can affect it. The geometric visualization of f(m) will be the focus of the next section of this chapter.


4.3.2 Geometric Interpretation of the Analysis of f(m):


Let us consider f(m) as a family of two-dimensional surfaces (denotedX)

defined in a 3D space. Each member of YXm, is a 2D surface and will be denoted Srn(Sj,, Sa,) and defined as:

m - Sj2'* S.2 - 50p
S.i(Sj, Sa,)2 = m S___2_*_____2
ISIII m * (I- S..) * C"'



2 In order to simplify the notations, Srn(Sj,, Sa,) and Smn will refer to the same element.






61


ImWill therefore consist of a continuous set of two-dimensional surfaces ranging from Sm'0 (denoted So by convention even though So is not defined), to Sm_ (denoted S_ by convention). Disregarding the actual shape of the surfaces for the purpose of this analysis, an Sa (arbitrary value of m =a) can be conceptually represented as in Figure 4-2 below:


Z =Sm(Sji, So,)




~S S(Sj,' Sol)

Sao



5jl


Figure 4-2 Representation of a surface Sm

For each value of Sj, and Sa, (representing the X and Y coordinates axis, and ranging in their respective intervals of definition), the Z axis value is taken as Sa(Sji, So,) and represents the point on the surface Sa.

From the analysis of f(m) in section 3.3. 1, the Z axis position of any Sj is strictly above that of any Sk when j > k, for given values of Sj, and So, (The reason being the fact that f(m) is monotonically growing with respect to m, see Table 2). The highest of all surfaces is therefore S , defined by the limit Of rn, when m becomes very large, and the lowest So, or rather the limit of Sm when m becomes very small, since the function is not defined when m is equal to 0. A visual representation is given in Figure 4-3 below:






62


Z = Sm(Sji, So1) Av e 3 S- Srank


Cj1/Ccm- ---- ---- -S


SSo
p.,








Figure 4-3 Relative positions of Smn for different values of m


In Figure 4-3 above, S_ is drawn in black and So is transparent (dotted line). The gray surfaces in between represent two values j and k (j > k) of m, between 0 and infinity. Recalling the results of Table 1 (still leaving aside the component Ic), the condition for pushing down a selection predicate using the cost formula based approach can be expressed as:
m - SJ2*S.,- j
m - mQ -which is C < Sm(Sj, Sai)


In other words, the ratio defined by Cj,/Co, (a strictly positive number) has to be located under the surface S.. for the pushdown operation to improve (reduce) the cost of the Gator network.

The condition for pushing down the selection predicate using the rank based approach can be expressed as:






63


Q I-Sp Ci
- < ~which is -< S-4Sjl, S61i) = Srank


In other words: the ratio defined by Cj/Co; has to be located under the surface Swhich happens to be the decision surface for the rank Srank.

Consider now Figure 4-3 again: the value of the ratio Cj,/Ccr1 is such that it is located under S- but above Si and Sk; which represents two cases (two values of mn) where the rank based approach would erroneously perform a pushdown operation. As a matter of fact, both those values of mn (mn = j and mn = k) represent cases where the pushdown operation would induce an increase in the cost of the network. However, there exists values of mn, (let one of those values be i, i > j > k) such that the cost formula based surface gets close enough to S_ (which is also Srank) for the pushdown operation to reduce the cost of the network. In that case, (illustrated in Figure 4-4 below), the rank based approach gives a correct reordering of the predicates in the Gator network.


Z =Sm(Sjl, Sol) S-= S,.ki

SJ




SGI








Figure 4-4 Addition of surface Si






64


4.3.3 Qualitative Interpretation of the Analysis of f(mz):

The analysis of the function f(m) showed that a cost formula based decision becomes equivalent to a rank based decision when the distribution of tokens is very largely biased towards the edge from which the selection predicate is being pushed. This mathematical result can be validated with the following qualitative reasoning:

The rank-based approach relies exclusively on the rank of the selection predicate and the rank of the join with respect to the current base node to make a pushdown decision. In our example from Figure 4-1, the participation of the right side of the network (tokens coming through cX2) is ignored in the process of considering the pushdown of cy, with the rank-based approach. On the other hand, the cost formula based approach does take into consideration the participation of both sides of the network into the pushdown decision. Since very large values of the parameter m represent configurations where the totality of the tokens pass through F1, they also represent configurations where the right side of the network does not participate in the decision for pushing down the selection predicate (71. On the other hand, smaller values of m describe configurations where the right side of the network has a more substantial impact on the decision for perform-ing the pushdown. Consequently, we can validate our mathematical results showing a strong similarity between rank-based and cost formula-based decisions for large values of m, since they correspond to configurations where the participation of F2 is negligible or null, which is the underlying assumption of using a rank-based strategy.






65


4.4 Conclusion:


The rank based decision surface represents in this case an upper-limit to the cost formula based decision surface, that is reached when parameter m is very large. (Large values of mn represent cases where the token distribution is extremely biased towards the input edge where the selection predicate is located). In such configurations, the rank and cost formula based approaches are likely to reach similar decisions. However when m drops towards values balancing the token frequencies among the input edges, or even lower values reversing the bias in token distributions, the cost formula based decision surface moves away from Srank and erroneous results can be obtained with the rank based reordering of predicates.

Reconsidering the results of Table 1, the exact condition for pushing down a selection predicate, based on the cost formula approach was:


_m Cm-*2S-- I


which can be rewritten as follows:

mn - Sj2'* S.,- m
C < m *Cal- Q - A where A is aconstant =I
Cal 1-Sal m * Cal Cal

The only effect of constant A on the analysis, performed in sections 4.3.1 and 4.3.2, will be to change the position of S_~ to be located at a Z coordinate distance of A under Srank rather than being equal to it. S,_ therefore becomes distinct from Srank, but none of the influence of m over the relative positioning of Sm and Srank is affected, and our conclusions are still valid.






66


Taking into consideration delete tokens introduces additional algebraic complexity, in the cost formulas and in their notations (mainly for AC'), but no conceptual change with respect to insert tokens considered in this analysis. The extension to cases where a beta node has more than two children is briefly covered in Appendix B and does not introduce any conceptual change either.

In conclusion, the influence of the tokens distribution remains an important parameter, causing the position of the decision surface linked to the cost formula to vary widely and making a rank-based decision possibly incorrect.















CHAPTER 5
STRATEGIES: DESIGN

The example and analysis presented in Chapter 4 showed why the rank metric was not adopted to perform the placement of selection predicates in discrimination networks. The position of the decision surface for pushing down (or pulling up) a selection predicate varies according to multiple parameters including:

- The distribution of token frequencies among input edges,

- The selectivity factor of the selection predicate,

- The selectivity factors of the network edges the sigma is moved along,

- The cost of the selection predicate,

- The differential costs of the joins along the path of the pushdown. And by ignoring the distribution of token frequencies among input edges and various node related costs, the rank metric is likely to fail to produce a correct reordering of the predicates in many cases. In the process of investigating the possible strategies that can be applied towards obtaining a correct solution this problem, the most straightforward idea is to try to adapt the rank metric to discrimination network by trying to incorporate the missing parameters. In the following section, the idea of adapting the rank metric to Gator networks is investigated in detail.


67






68


5. 1 Adapting the Rank Metric to Gator Networks:


The selection predicate pushdown criterion based on the cost formula from Table

1 was:
m- Si,' * S.2-S
l SM* I- .


If the insertion cost Ic is considered negligible (Ic =_0), and the assumption about the differential cost Q,2 being constant (see section 4.1.2) admitted (AQj2 = 0), the rank of the join (with respect to (xi) can be redefined to the following expression:



rank(J/ai) = mQ- m - (3)


As a matter of fact, rewriting rank(J/ci) < rank((Y1) (the rank based criterion for the selection predicate pushdown) gives now the same result as the cost formula based approach. When the insertion cost Ic is taken into perspective, the definition of the rank of the join has to be changed to:

S -[m -SJz'* S02 I':* (I-S.') S.-[m -S2'* S112+ c*rnm rank(J/ai) = m Q] Cal [ + Q *rakl)


The elimination of the second assumption about the differential Cost CQ2 introduces an extra term of the form S.2 *AC in the numerator of the expression above. Without M C.
complicating the expression of the rank of the join any further, it can be noticed from the expression above, that the rank of the join is not an independent quantity anymore, but is related to the rank of the selection predicate being pushed down. Indeed (and unless the insertion cost Ic is neglected), parameters of the rank of the selection predicate now appear in the expression of the rank of the join. This new characteristic of the rank of the






69


join is certainly not a desirable one, since the positioning of a selection predicate with respect to a join predicate may cause the rank value of the latter to change. One of the consequences of this is an impact on the convergence of a rank based algorithm for Gator networks.

Clearly, a complete analytical model for a Gator networks adapted rank metric, would quickly reach the limits of tractability, especially when considering the following required cases to be incorporated into the already cumbersome expression shown above:

- Incorporate the cases where beta nodes can have more than two children.

- Incorporate the cost of delete tokens into the global cost analysis of the network.

- Take into account the possibility of more than "one-level" pushdowns.

Working with a (relatively) simplified representation of the rank such as the one presented in equation (3) above is not an alternative in this case for the following reason: The resulting loss of accuracy would make this model inadequate for integration with the existing Gator networks optimization algorithm(s). Indeed, the overall accuracy (and reliability) of the optimizer would become limited by that of its "worst" component, the selection predicate placement module in this matter. Therefore, adding the selection predicate placement heuristic would reduce the accuracy of the existing optimizer, which is not a desirable effect.

While the ordering of selections and joins based on the rank metric does yield satisfying results in query processing, it is our opinion that trying to adapt it to the problem of ordering selection predicates among joins in a Gator network will suffer from the following major limitations:






70


-First, the rank metric does not naturally extend the existing cost model and

randomized algorithm(s) described earlier in Chapter 3, based on a set of local change operators. Thus, the introduction of the notion of a rank ordering of predicates in the Gator network optimizer would prove relatively complex to combine with the existing cost formulas in order to achieve a homogeneous and

integrated global discrimination network optimizer.

-While the idea of developing a rank metric for Gator networks is conceptually

feasible, the practical formulation of such a metric is very likely to become extremely cumbersome and unpractical of use within the optimizer. The impracticality of a Gator network rank will be furthermore increased by the contextual nature it exhibits,

as shown earlier in this section.

Since the introduction of a rank based heuristic would leave us with either a crippled (using a simplified version of the rank) or an overly complex and ill-integrated optim-izer, it has been decided to avoid investigating further any approach based on an adaptation of the rank metric to Gator networks. Instead, our efforts will be directed towards the design of alternative strategies, better adapted to this problem, and satisfying the following requirements:

- In order to maintain a maximum continuity with the current cost model and

optimizer, any proposed approach should reuse the existing work, including cost formulas, heuristics and local change operators. This will also have the advantage of keeping the complexity of the optimizer as low as possible from both a conceptual

and coding / maintenance aspects.






71


-A rank based ordering will still be applied whenever possible, such as during a final

pass on groups of selection predicates lying on the same edge of a network.

The remaining sections of this chapter will therefore cover a set of proposed strategies for incorporating selection predicate placement heuristics within the Gator network optimizer. Those strategies can be classified into two major groups:

- Low complexity strategies with little to no impact on the current optimizer search

space size and local change operators. This type of approach will use two distinct

phases and will be therefore referred to as "Two Phase Strategies."

- Higher complexity methods requiring extensions to the currently proposed model and

resulting in larger search space sizes. For such methods, we will propose a minimal, but necessary, set of extensions and semantic changes to the present local change operators "Kill Beta" and "Merge Sibling." This class of methods runs the optimization in one single phase and will therefore be referred to as "One Phase

Strategies."



5.2 Two Phase Strategzies:


As indicated by its name, this approach will operate in two distinct phases and can be labeled as "simple" for two primary reasons:

- It does not increase the size of the search space in which the current Gator network

optimization algorithm operates. In fact, this strategy comes as an additional pass invoked by the optimizer after the best Gator network has been found, and adds a

negligible amount of work to the existing algorithm.






72


-It does not require any modification to the existing local change operators that were

described earlier in chapter 3.

The incorporation of this method into the existing Gator networks optimizer will require a minimum set of perturbations to the logic of the existing implementation. An algorithmic, high level C++ description of the addition of this method to the existing Gator optimizer is as follows:

5.2.1 Integrating the Simple Strategy With the Gator Optimizer:

Gator* GetOptimalGator (QueryGraph* Graph)

Begin

GatorNet* Best-Gator;

Best-Gator = RunGatorOptimizer (Graph)

Best-Gator = Runfigmaflacement (Best-Gator)

Return (Best-Gator)

End


The function RunGatorOptimizer() represents an abstraction of the existing Gator optimizer, which takes a rule graph as a parameter and returns the best possible discrimination network with the selection predicates placed above the alpha nodes. RunfigmaPlacemento takes the best structure found so far and tries to improve it further by "sliding" the selection predicate nodes (referred to as sigmas) down the network. The proposed heuristic for pushing down a selection predicate is driven by the token frequencies and can be simply described in English as follows:

"Let C be the initial cost of the best Gator network before trying to push down the sigma node. When a selection predicate a~ is pushed down, all the insertion, deletion and






73


join operations above it are executed more often, and cause a cost increase AC'. In order to achieve a global reduction of C, the (Y operation has to be done less frequently, and the incurred decrease AC- has to be large enough to counter balance the increase AC'. A necessary (but not sufficient) condition for that configuration to happen, is to find a location in the network where the frequency of tokens Fi' is strictly smaller than the original Fi when the a was at the top of the network." Consider the case of Figure 5-1 below: F, F2 F3 F4 F1 F2 F3 F4


------ -- I - - z

cl! a a 0lk, U 'c0((






C2F12 C2

Pnode Pnode

Figure 5-1 Token frequency driven heuristic Theorem:

A necessary condition for the positioning of (71 above the P3 node to be beneficial is: F12 < F,



Proof:

Consider the network on the left side of Figure 5-1 and let:

- So, and Ca, represent respectively the selectivity and cost of the selection predicate al.






74


- Q an d Q2 represent the differential cost of the join between the two alpha nodes in the sub network defined by the dashed-line box.

-F, represent the frequency of tokens coming into aYi.

-C represent the total cost of the network and C, and C, represent respectively the costs of the sub networks in the dashed-line and full line. By definition, we have:

- C=CI +C2

- C = (Fi * Ca) +(Fi * Sa * Cj,)+(F2 *Cj2)


Consider now the network on the right side of Figure 5-1 and let F12 represent the frequency of tokens flowing through the selection predicate (Ti. Lemma:

The properties of relational algebra [Smi75, Tam9l, Ull82] are such that the cost of the network in the box drawn in the full line is unchanged and remains C2. The cost of the network in the box in the dashed line becomes: C1' = (Fi * Cj ) + (F2 * Cj2') + (F12 * Ca,) and the total cost is C' = C1' + C2 C, < C Cl' + C2 < C1 + C2 * Cl' < C1 In order to achieve C' < C, it is necessary that C1' < C1 Comparing the terms of C1 and C I' we have: F, * Ci, > F, * Saj * Ci, (Since 0 < Sa, < 1 by definition) Similarly, since (xi becomes larger after the selection pushdown, the differential Cost CQ2' is greater or equal to CQ2, and we have F2_ * Cj,' >= F, * CQ2. Therefore, if F12 > F1, it






75


follows that F12 * Ca,> F, * Ca, and Cl' > C, since all three components of C1' are larger

than those of C1. Hence, the necessary condition for Cl' to be smaller than C, is that F12

be (strictly) smaller than F1: F12 < F, (QED).

Below is a high-level algorithmic description of this heuristic.

Algorithm:

(It is assumed that the best network without the selection predicate placement

optimization is returned in Best-Gator, before calling this function).

RunSigmaPlacemnent(Gator* Best-Gator)

Begin

C = NetCost (Best-Gator)
For (each (Ti)
F =Fi
(x parent of (Ti
Remove aT, from network and recompute frequencies on path cti - Pnode.
For (each node nk of type Pk or Gk on path c(xi Pnode)
IFk< New-Gator = PlaceSigma (cyt, nk)
C' = NetCost (New-Gator)
If (C' < C) then
Delete (Best-Gator) Best-Gator = New-Gator Else
IDelete (New-Gator) Endif
Endif
EndFor
EndFor
For (each leaf --> Pnode path)
If (more than one aT on same edge)
I Reorder by rank
Endif
EndFor
End






76


5.2.2 Complexity Analysis:

A Gator network with N tuple variables has a maximum height of N- I when it is a left (or right) deep binary tree as shown in Figure 5-2 below: (7 (2 (74
1 _". II
01 *2 *.~U *U4 .AX5





Pnd
Figure~~~~ 5- aiu/egto ao ewr
For~~~~~~~~~ eahslcinpeiae h ucto ufmfaeetprom











wit it ithcae figonue io 5-2 Maximumaheightuof as G ato networktl monto

Fokseo eah seligoe peica thfoe ofuntieoron Run ) *gm N -a1)m*ntperformsa conantf aontyp of orin Woyfcnn the copectesi pathfrmted sga nrpoesallith way pot-otesnode serhn (for ape a sitabl ositing o roate it. The legths ofin tha pthei sae mosN)a the numberoun of wiselctio preae to oces i cor nside not t


ithcitsi the caseaof conjuncito of prediuctes, such s:an po)lThenotial. amnethof







Gator network optimizer calls one of three possible operators at each iteration, the






77


dimension of the search tree is therefore 0(3 N) for N iterations, and the global state space is exponential by nature. (Some states could be visited more than once during an optimization phase, but the upper bound still remains 0(3 N)). Consequently, the addition of RunSigmaPlacement does not increase the size of the state space from a combinatorial complexity aspect.


5.2.3 Influence of Indexes on the Heuristic:

It was mentioned at the beginning of section 4.3 that the use of indexes (and the corresponding cost formulas) was not taken into account as far as the problem of selection predicate placement was concerned. The reason behind this simplification is the fact that the pushdown heuristic is independent of the presence or absence of indexes in the Gator network. Indeed (and as described above), the only parameter considered by the heuristic to perform a pushdown is the frequency of tokens flowing across various edges of the network lying between the location of the selection predicate and the Pnode. Since the definition of an index over a (join) attribute has no effect on the "number" of tokens generated by that join condition, it does not have any influence on the resulting frequency of tokens and has therefore no impact on the logic of the heuristic. As a matter of fact, the amount of tokens generated by a join is dependent solely on the selectivity of the join condition, which is a property of the join itself. While the presence of indexes has no consequence on the heuristic itself, it is important to notice that indexes will modify the costs of joins. Consequently, the presence of indexes will have an impact on the global decision regarding whether the cost of the network is reduced by a pushdown or not.






78


5.3 One Phase Strategies:


Contrarily to the strategy described in the previous section, the existing Gator optimizer will not be directly reusable in a straightforward manner it was in the previous case. Instead of adding the selection predicate placement as an extra phase, the idea here is to integrate it within the process of building the discrimination network. In order to do so, one (or more) local change operators are added to the existing set of three: "Create Beta," "Kill Beta" and "Merge Sibling." These new operators (Push Sigma and Pull Sigma) will be called randomlAy during the local perturbations phase, and compete with the others. The set of strategies based on this approach is of higher complexity than the previous one for the following reasons:

- The incorporation of new operators into the existing algorithm is a "double edged

sword." Indeed, the immediate consequence is a substantial increase in the search space size. While a larger search space allows potentially better solutions to be found, it also has the major disadvantage of possibly causing the optimizer to follow "dead-ends" search paths while searching for local minimums. An important tradeoff

issue of optimization complexity versus result quality is to be considered here.

- The addition of new operators allowing the placement of a a node while the network

is being built, brings up issues of semantic nature about the meanings of "Kill Beta"

and "Merge Sibling" and their impact on the integrity of the Gator network.

Therefore, the behavior of some of the existing operators has to be revised before this

type of approach can be implemented.






79


5.3. 1 Complexity Analysis:

In this type of strategy, the number of local change operators is increased from three to five. The number of potential branches at iterative step of the optimizer is therefore also increased from three to five. As a result, the total potential number of states that can be visited is raised from 0(3 N) to 0(5 N) for N iterations (once again, it is possible that multiple states can be visited several times during an optimization phase, but the upper bound still remains). This type of approach has therefore a more serious impact on the complexity than the previous strategy: even though the order of magnitude of the search space is not increased, the potentially added size is exponential versus polynomial in the previous case.

5.3.2 Semantic Issues for Kill Beta:














P3~

Figure 5-3 Semantic issues for Kill Beta


Consider the case of Figure 5-3 above. The left side of the figure describes a configuration where a selection predicate (T has been pushed down in the network, just above node 1, as the result of a previous local change operator sequence. Suppose that the next iteration of the optimizer decides to perform a "Kill Beta" operation on node f31.






80


If the "Kill Beta" operator is kept unchanged (i.e. behaves the same way it does in the current optimizer and as described in chapter 3), once the operation is performed (node 3 killed), the node (T becomes sibling Of P32, as shown on the right side of Figure 5-3. Consequently, every token inserted into P32 has to join with (T, which is not supported by the semantics of Gator networks algorithms, since (T nodes are predicates only, and not memory nodes. As a result, the Gator network gets into the inconsistent state shown on the right side of the figure above, and a fix becomes necessary for this type of situation.


5.3.3 Semantic Issues for Merge Sibling:


Pnode


Figure 5-4 Semantic issues for Merge Sibling In the case of a Figure 5-4 above, a Merge Sibling operation is considered after the selection predicate has been pushed above 1. When node 132 is merged with 01 it becomes sibling of a as shown on the right side of Figure 5-4. This situation again causes inconsistency in the structure of the Gator network similar to the one presented in 5.3.2.


(X-"' -"cc CV*"-"CC -(X



----------P2

P1



Pnode






81


5.3.4 Suggested Fixes for the Semantic Issues:

There are multiple solutions that can be applied towards correcting the inconsistent state in which a Gator network might fall during the operations described in sections 5.3.1 and 5.3.2.

.Conservative approach: No Merge-No Kill:

The conservative approach is to simply "forbid" the killing or the merging of a beta node, when a sigma node is located above the node to kill or the merge node. While this approach is the simplest from an implementation point of view, it may be limited by the disadvantage of possibly reducing the number of operations the optimnizer can perform, which can in turn have some impact on the results that the heuristic could produce.

Progressive approaches:

A more progressive approach consists of allowing the operation to take place, but taking extra actions to fix the inconsistency of the network by relocating the sigma node to a "suitable" location in the network. This would become equivalent to combining a "Push Sigma" or "Pull Sigma" with the "Kill Beta" or "Merge Sibling," every time this type of inconsistent situation is encountered. An example solution would be to bring the selection predicate back to the top of the network, which will be referred as the "Pop To Top" approach. The decision on where to relocate the selection predicate is left as an implementation issue.






82


5.3.5 Sub-Strategies:

Depending on different placement options for the selection predicates, it is possible to generate a number of sub strategies all derived from the same idea consisting of mixing the three traditional local change operators with the selection predicate placement ones. Some examples are listed below: Sigma placement during network build up:










The placement of cy nodes at different levels of the discrimination network can be performed during the network build-up phase, before applying any local transform-ation. A more conservative approach would require that new networks be always built with selection predicates at the top and that only local perturbation operators perform the placements.


Sigma Push and Pull:

The local perturbation operators "Push Sigma" and "Pull Sigma" can themselves operate in a number of different combinations:

- Random Push and Pull: the placement of the selection predicates is driven by random

decisions only.

- Searched Push, Random Pull: The pushdown operation is guided by the heuristic

described for the Two Phase strategy above, while the pulling up is done randomly.






83


-Extensive search for pull-ups and searched push downs for networks with small

numbers of tuple variables: extensively search all the possible placements for the selection predicate on the path from the leaf to the Pnode when performing a pull-up

and apply the heuristic for a pushdown.

Investigating the possible variants analytically would be exciting but is a very difficult and probably not beneficial task, because of the complexity of the corresponding mathematical models. It is our belief that the investigation of the relative performances, advantages and disadvantages of each sub-strategy will be better handled at the implementation level, with numerical results.

In the next chapter, we will describe an implementation of the Gator network optimizer with selection predicate placement capabilities, and discuss some implementation issues and difficulties, and proceed with some simulation results for both simple and advanced strategies (Two Phase and One Phase).















CHAPTER 6
STRATEGIES: IMPLEMENTATION

In the previous chapter, we presented a detailed description of the proposed strategies to perform selection predicate placement in Gator networks from a design point of view. In this chapter, the reader will find an elaboration on some important issues and questions that arise on different aspects of the optimizer, when the proposed strategies are put into practice. The purpose of this chapter was not to write a complete full-fledged implementation of a Gator network optimizer, with selection predicate placement capabilities. Rather, the goal was to investigate and bring into light the possible roadblocks and coding difficulties that the implementers might encounter when writing most of the proposed strategies. The important points that will be addressed in this chapter are as follows:

- Investigate the coding difficulties related to a new (more efficient) implementation

of the local change operators.

- Look over any necessary revisions to be applied to cost formulas based on the new

operators allowing relocation of selection predicates.

- Verify whether the proposed strategies work and improve the Gator network.

- Compare the relative running times and result quality of both the simple and complex

strategies, and find out whether the latter yields any tangible improvement with

respect to the simple one, and in what cases.


84






85


6.1 Implementation of Local Change Operators:


The local change operators "Create Beta, ". .Kill Beta" and "Merge Sibling" were implemented in a modified version of Ariel [Han97b] by regenerating a completely new network at each iteration. The function ApplyPerturbation described in section 3.1.2:

New-Gator = ApplyPerturbation (CurrentGator, Perturbation)

takes as parameters the current Gator network and one of the local change operators, and applies the change to return a new network that is a duplicate of the current one with the local change applied to it. If the cost of the network is improved by the current operation, then the old network is discarded, and the new one kept as the current network, otherwise, the new network is discarded and the old one remains as the current network.

Having seen room for improvement in both the memory use and speed of the optimizer, we have decided to implement the local change operators differently this time around. Each of the operators comes in a tripartite structure (three function or methods in the object oriented terminology) which allows trying it and then either validating or undoing the modifications that were made to the network, based on the old and new costs. Consequently, each of the local change operators is implemented with the following methods:

- Gator* Try~perator (Gator*, char* LocalChange)

- Gator* DoOperator (Gator*, char* LocalChange)

- Gator* UndoOperator (Gator*, char* LocalChange) with Operator c J Create Beta, Kill Beta, Merge Sibling, Push Sigma, Pull Sigma). As mentioned above, the advantages of this approach are straightforward:






86


-Better memory usage (only one network in memory all the time).

-Faster optimization process by not having to regenerate and delete a whole network

at each step.

The implementation of those methods required maintaining some additional state information within the structure of the nodes of the network (old neighbors, old parents, old children etc) in order to allow the rollback when necessary. However, the duplicated information is always localized to the "region" of the network being affected by the current perturbation only, and therefore kept to a minimum.


6.2 Revision of the Cost Formulas Implementation:


Since selection predicates were systematically placed at the top of the Gator network in the traditional optimizer, they did not have a "proper" existence but were rather considered as a preprocessing filtering step to the network. Consequently, the cost of the selection was not computed as a part of the global cost of the network. In this implementation, the ability to relocate a selection predicate potentially anywhere in the network, has necessitated implementing it as an entity (node) with an associated method to compute its cost contribution. Furthermore, the possible presence of a sigma just above a beta node has been at the origin of a few minor, but necessary modifications to the method for computing the cost of beta nodes. A more detailed discussion of both these issues is presented in the following sections.


6.2.1 Cost of a Sigma Node:

Three distinct cases were identified for computing the cost of a selection predicate:






87


-The sigma is at the top of the network.

-The sigma is pushed down and has "regular" children (Alphas and Betas).

-The sigma is pushed down and has another sigma node as a child. A description of each of the three cases above follows. Sigma at the top of the network:

F1 F2 F3
I I I
Vl V2 V

(X' (X3





Figure 6-1 Sigma at the top of the network In this case depicted in Figure 6-1 above, the cost of the sigma node is simply the frequency of tokens flowing through the selection predicate, multiplied by the cost of computing the selection condition (noted SelectionCost).

SigmaCost (aY)= Fj SelectionCost (a1i) Sigzma with re-gular (Alpha and Beta) children:

F1 F2 F3










Figure 6-2 Sigma with regular children









In this case, the selection predicate cost will have two components:

- The cost of computing the selection condition multiplied by the frequency of tokens

flowing through the predicate. This frequency of tokens is computed in a similar fashion as it is for beta nodes, by adding up the anticipated participation of each of the children nodes in producing the amount of tokens passing through the sigma

node.

- The cost of executing the joins across the children is also added to the cost of the

selection predicate here, as it is the case for beta nodes.

SigmaCost (a1i)- Fi * SelectionCost (a1) + CostJoinChildren(ai) Note:

The reader will notice that the PerChildlnsertCostO and PerChildDeleteCosto components that are also added to a beta node with children is not applied to the selection predicate. The reason for this is that sigmas are not nodes in which tokens can be inserted into (or deleted from) but simply predicates acting as filters and allowing tokens through, or not. The PerChildlnsertCost() and PerChildDeleteCosto components cannot however be lost, and have to be saved until the first beta node above this sigma is found. At that point, the per-child insertion and deletion costs are applied to that beta node. In order to handle this "transfer" of cost to the next beta node in the network, it was necessary to attach to sigma nodes a simple structure with one entry for each child. Each entry of the list maintains the join size (with respect to that child), and insert and delete frequencies of that child. We will refer to these lists as "update costs lists" and a sample one is presented in Figure 6-3 below:






89


F1 F) F3


U3
U1I IX' (X3



al (X, X
join size * SeI((Ti)V [ join size * SeI(ca1) join size * SeI(al) iFreq * SeI(o71) iFreq *SeI(01 iFreq *SeI((71)
dFreq * SeI(a1l) dFreq *SeI(ul) dFreq *SeI(a71)

Figure 6-3 Sigma list for transfers of PerChildlnsertCost and PerChildDeleteCost

In this case, al has been pushed just above the beta node. The cost of al includes the cost of executing the selection predicate computation (Fal * SelectionCost(a71)) plus the costs of executing the joins originating from each of the children al, a2 and a3. The cost relative to the insertion or deletion of the results of those joins is not applied at the level of a71, but is transferred to the f3 below (71 through the list described above. The join size, insert frequency and delete frequency are multiplied by the selectivity of the selection predicate (filtering effect) and appended to the list described above, to be processed at the level of the beta node for computing the PerChildlnsertCost() and PerChildDeleteCosto components.


Sigma with a Sigma child:

When multiple selection predicates are pushed down during the optimization phase, it is possible to reach configurations where two (or more) sigmas end up on the same edge of a network. In such a case, a sigma node has another sigma as a child. (Note




Full Text

PAGE 1

SELECTION PREDICATE PLACEMENT IN DATABASE DISCRIMINATION NETWORKS By MOKHTAR KANDIL A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 1998

PAGE 2

©Copyright 1998 By Mokhtar Kandil

PAGE 3

Dedicated To: My parents, Abdelaziz and Touria, And my sister, Mouna.

PAGE 4

ACKNOWLEDGMENTS I would like to express my deepest gratitude to my advisor, Dr. Eric Hanson, for all the guidance, advice and supervision he provided me with throughout the duration of this work. I thank Dr. Stanley Su, Dr. Sharma Chakravarthy, Dr. Randy Chow and Dr. John Alexander for kindly agreeing to serve on my supervisory committee, and providing me with a great deal of valuable input and advice. I cannot omit extending many thanks to Sharon Grant for both her help with various administrative tasks, and the organization of unforgettable database center social gatherings. On a more personal note, I would like to express the deepest gratitude to my parents, Abdelaziz Kandil and Touria Bouabid, for their continuous support and encouragement throughout my life and more particularly during this entire experience. I cannot forget to thank my sister, Mouna, whose love, support and encouragement have always been a source of great comfort and inspiration. A special expression of my gratitude goes to my sponsors, the Fulbright binational commission and the Amideast staff who provided essential financial and administrative support during the complete duration of my studies in the USA. Finally, I thank all my friends and colleagues who all participated into making this stay in Gainesville a memorable one. iv

PAGE 5

TABLE OF CONTENTS page ACKNOWLEDGMENTS iv LIST OF TABLES vii LIST OF FIGURES viii ABSTRACT x LINTRODUCTION: THE TRIGGERMAN PROJECT 1 1 . 1 Rules and Active Databases Overview: 1 L L 1 Integrity Constraint Checking and Repair: 1 1.1.2 Time and Temporal Issues: 2 1.1.3 Materialized View Maintenance: 3 1.1.4 Advantages and Shortcomings of Existing Rule Systems: 4 1.2 Extensibility and Extensible Databases Background: 6 1.3 TriggerMan: 9 1.3.1 The TriggerMan Environment: 11 1.3.2 The TriggerMan Physical Architecture: 13 2. PROBLEM STATEMENT 18 2.1 A Discrimination Network Discussion and Example: 18 2.2 Need for Selection Predicate Pushdown: 25 3. BACKGROUND ON GATOR NETWORK OPTIMIZATION 28 3.1 Brief Review of the Gator Network Optimization Algorithm: 29 3.1.1 Combine Operators: 32 3.1.2 Local Change Operators: 33 3.2 The Cost Model: 35 3.2. 1 Cost Functions for a Nodes: 36 3.2.2 Cost of Generating Temporary Join Results, (TR): 37 3.2.3 Cost of Beta Nodes: 38 4. RELATED WORK: INADEQUACY OF THE RANK METRIC 41 4.1 Extensions to The Cost Model and Assumptions: 42 4.1.1 The Edge Selectivity Factor: 42 4.1.2 Predicates Selectivities and Costs: 46 V

PAGE 6

4.2 Allee-Gator Example: 47 4.3 Generalization of the Cost Considerations: 53 4.3.1 Influence of Token Distributions on Decision Surfaces for Pushing a Sigma: 58 4.3.2 Geometric Interpretation of the Analysis of f{m): 60 4.3.3 Qualitative Interpretation of the Analysis of f(m): 64 4.4 Conclusion: 65 5.STRATEGIES: DESIGN 67 5. 1 Adapting the Rank Metric to Gator Networks: 68 5.2 Two Phase Strategies: 71 5.2.1 Integrating the Simple Strategy With the Gator Optimizer: 72 5.2.2 Complexity Analysis: 76 5.2.3 Influence of Indexes on the Heuristic: 77 5.3 One Phase Strategies: 78 5.3. 1 Complexity Analysis: 79 5.3.2 Semantic Issues for Kill Beta: 79 Semantic Issues for Merge Sibling: 80 5.3.4 Suggested Fixes for the Semantic Issues: 81 5.3.5 Sub-Strategies: 82 6.STRATEGIES: IMPLEMENTATION 84 6. 1 Implementation of Local Change Operators: 85 6.2 Revision of the Cost Formulas Implementation: 86 6.2.1 Cost of a Sigma Node: 86 6.2.2 Cost of a Beta Node: 90 6.3 The Optimizer: 92 6.3. 1 Implemented Randomized Algorithms: 92 6.3.2 The Optimizer Characteristics: 95 6.4 Experimental Results: 96 6.4. 1 Two Phase Strategy Running Times: 96 6.4.2 Testing Running Times for Both Strategies: 97 6.4.3 Analysis: 101 6.4.4 Testing Result Quality for Both Strategies: 106 6.4.5 Analysis: 108 6.4.6 Conclusion: 109 7.C0NCLUSI0N AND FUTURE WORK 1 13 GLOSSARY 116 LIST OF REFERENCES 125 BIOGRAPHICAL SKETCH 134 vi

PAGE 7

LIST OF TABLES Table ESge Table 1 Conditions for pushing down a selection predicate 57 Table 2 Analysis of f(m) 60 Table 3 Comparing running times of OPS and TPS 109 Table 4 Comparing networks costs of OPS and TPS 110 Table 5 Recapitulation of the conditions for pushing down Gi 123 vii

PAGE 8

LIST OF FIGURES Figure 1-1 The architecture of the TriggerMan environment 12 Figure 1-2 The architecture of TriggerMan 14 Figure 1-3 TriggerMan as an Informix Datablade 16 Figure 2-1 Rule Graph for Rule HighSalesS alary 19 Figure 2-2 Discrimination network for rule HighSalesSalary 20 Figure 2-3 Selection pushdown 25 Figure 3-1 Local change operator "Create Beta" 33 Figure 3-2 Local change operator "Kill Beta" 34 Figure 3-3 Local change operator "Merge Sibling" 34 Figure 3-4 Example of a Gator network 36 Figure 4-1 Analysis of selection predicate pushdown 53 Figure 4-2 Representation of a surface Sm 61 Figure 4-3 Relative positions of Sm for different values of m 62 Figure 4-4 Addition of surface Si 63 Figure 5-1 Token frequency driven heuristic 73 Figure 5-2 Maximum height of a Gator network 76 Figure 5-3 Semantic issues for Kill Beta 79 Figure 5-4 Semantic issues for Merge Sibling 80 Figure 6-1 Sigma at the top of the network 87 Figure 6-2 Sigma with regular children 87 Figure 6-3 Sigma list for transfers of PerChildlnsertCost and PerChildDeleteCost 89 viii

PAGE 9

Figure 6-4 Sigma with a Sigma child 90 Figure 6-5 Cost of a Beta node 91 Figure 6-6 Beta node with a sigma child 91 Figure 6-7 Comparing running times of Phases I and II of TPS 96 Figure 6-8 Relative running times with selective joins and non-selective sigmas 98 Figure 6-9 Relative running times with less selective joins, non-selective sigmas 99 Figure 6-10 Variations in running times of OPS and TPS with less selective joins 99 Figure 6-11 Relative running times with large joins and selective sigmas 100 Figure 6-12 Variations in the running times of OPS and TPS with selective sigmas 101 Figure 6-13 Impacts of "Kill Beta" and "Push Sigma" operators on a network 103 Figure 6-14 Relative network costs for selective joins and non-selective sigmas 106 Figure 6-15 Relative network costs for non-selective joins and non-selective sigmas ... 107 Figure 6-16 Relative network costs for non-selective joins and selective sigmas 107 ix

PAGE 10

Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy SELECTION PREDICATE PLACEMENT IN DATABASE DISCRIMINATION NETWORKS By Mokhtar Kandil August 1998 Chairman: Eric N. Hanson Major Department: Computer and Information Science and Engineering Discrimination networks are data structures that are used to implement rule condition testing in active databases. Rules allow actions to be automatically undertaken (user notification, firing of another rule, etc.) when the database enters a particular state following a data insertion, deletion or modification. Traditional database management systems handle limited (built-in) sets of datatypes and operators, typically numbers, strings of characters, money and datetime types, and basic operators. The extensibility feature allows knowledgeable users to "plugin" newly defined datatypes (such as polygons and time-series) and operators on those datatypes (such as overlapping of polygons, variations in the time-series). As a result, the spectrum of use of database management systems is extended to many more types of user applications (scientific, business, stock exchange, etc.).

PAGE 11

The focus of this dissertation is on generating and optimizing discrimination networks, which may contain user-defined datatypes and operators as discussed above. A discrimination network uses two sets of operators to match rule conditions: selections and joins. Traditionally, selections have been put at the top of the network so that they are computed first; the reason being that they are usually cheap and simple operators with respect to joins. While that assumption might hold for non-extensible systems, user defined operators (such as polygon overlapping, or fingerprint matching) can be expensive and complex operators, hence, their cost is no longer negligible with respect to that of joins, hi such cases, pushing a selection operator down the discrimination network (in order to delay its evaluation in the rule condition) may prove to represent an effective cost improvement for rule matching. Query optimization techniques have been proposed to handle the issue of Joins and selection predicate ordering in query trees. This dissertation first shows how the aforementioned methods do not necessarily apply to discrimination networks, and how a direct application of those could fail to produce a correct ordering of joins and selections. We proceed by proposing a set of strategies (of increasing levels of complexity) which can be applied towards the placement of selection predicates in a discrimination network, and consider the corresponding implementation issues. xi

PAGE 12

CHAPTER 1 INTRODUCTION: THE TRIGGERMAN PROJECT 1 1 Rules and Active Databases Overview: In a passive (traditional) database management system (DBMS), data is manipulated (created, updated and deleted) only in response to queries issued by users or user-level applications. On the other hand, a system fitted with active capabilities (ADBMS) has the ability to perform such operations automatically. More generally, any action can be invoked in response to a particular event, when some predefined condition(s) is (are) true. The active behavior of such database systems is expressed by rules, also referred to as triggers, alerters, or monitors in the hterature. In their most popular form, active rules are defined with a threefold structure: Event-ConditionAction, very often referred to as the EC A model in the literature [Cha89, Day96]. The existence of a rule processor (rule-processing module) allows for a number of tasks to be carried out by active database systems without a dedicated application. A few examples of such tasks are enumerated in the following three sections. 1.1.1 Integritv Constraint Checking and Repair: Issues related to automating integrity constraints checking were fairly extensively investigated and are a classical application area for active databases. An article proposed by Ceri et al [Cer92] describes a system that ensures automatic correction of violated 1

PAGE 13

2 constraints through production rules. Additional work by Ceri and Widom [Cer90] also proposes the use of active rules for constraint maintenance. A special language for defining the constraints is described, as well as a framework for translating those constraints into rules. On a similar note, Etzion [Etz91] proposes to use active capabilities to handle incomplete or exceptional information in a database system. Fratemali and Paraboschi's paper [Fra93] is also a good overview of integrity constraints maintenance techniques, mostly based on first order logic. In another research effort, Gertz [Ger94] proposed concentrating on the repair aspect of integrity constraints, rather than mere detection. More generally, many active database research projects did undertake this question in some way or another: the Jasmine prototype [Ish93] is an object-oriented (OO) database with constraint facilities, able to handle both intra and inter-object constraints. Ode is another object-oriented DBMS developped at AT&T Bell Laboratories. A paper by Jagadish and Qian [Jag92] introduces integrity maintenance in Ode and focuses on inter-object constraints including referential integrity, uniqueness integrity and relational integrity. POSTGRES [Sto87] and PARDES [Etz93a, Etz93b, Etz94] also had modules and related publication centered on this idea. 1.1.2 Time and Temporal Issues: The notions of events and time are tightly related, and many models allow the user to express (with more or less simplicity) temporal or time-sensitive constraints. Consequently, multiple results from temporal logic and temporal database research have been integrated into active database management systems (ADBMS) models. Dittrich and Gatziu [Dit93] give an overview of the useful extensions to the active model if a notion

PAGE 14

3 of time is supported. In a stiort paper [Sis92], Sistla and Wolfson introduce the definition of a temporal language as an extension to complex event definition. The authors discuss two classes of temporal operators: "past temporal language" and "future temporal language." The operators: since, last time and previously apply to the accumulated history up to the time when the trigger was introduced; while until, next time and eventually apply to incoming events from the time when the trigger was defined. The same authors present some additional work on temporal conditions and integrity constraints in active database systems [Sis95a], and temporal triggers in active databases [Sis95b]. Etzion, Gal and Segev [Etz92] also discussed issues related to supporting temporal rules in the PARDES project. Dori, Gal and Etzion wrote more recently on temporal active databases as a key to computer integrated manufacturing [Dor96]. The POSTGRES [Sto91] project proposed the idea of time travel, which allows users to run historical queries. This approach was implemented using a "no-overwrite" (also referred to as append-only) storage management technique. 1.1.3 Materialized View Maintenance: The importance of views in query processing is undeniable. Views can remain as predicates that are expanded at query processing time (using query modification techniques [Sto75]), or be materialized to improve query response time. The problem of efficiently and automatically maintaining materialized views has caught the interest of the database research community for several years. Rules in active databases have proved to be of great assistance in the process of automating the maintenance of materialized views. The work carried out by Ceri and Widom [Cer90, Cer91] is a typical application of that

PAGE 15

4 concept: The user defines a view as an SQL select expression and the system automatically derives production rules to maintain the materialization of that view. Other examples include ideas from the POSTGRES project [Sto87], additional work by Stonebraker et al [Sto90] on rules, procedures caching and views, and a more recent publication by Botzer and Etzion [Bot96]. Furthermore, the fairly recent emergence of data warehousing as a research area in the database community has triggered a renewal of interest for further improvements in the processes of maintenance of materialized views. 1.1.4 Advantages and Shortcomings of Existing Rule Systems: Because rule systems have proved to be very useful in business-oriented as well as technical applications, mostly from a software engineering aspect, an increasingly larger number of (commercial) database management systems are now offering built-in support for some form of "trigger" or "rule" definition capabilities. Indeed, by factoring out the non-trivial and non-easily maintainable procedural coding issues of many applications, rule systems have facilitated application development, scalability and maintenance. However, and despite those multiple advantages, the applicability of "on-board" trigger systems in realistic environments is limited for a number of practical reasons: Excessively simplified models: The rule execution modules are often relafively too simple to handle realistic condition testing cases because a trigger can only be defined on a single "table," (in relational terminology) or data source more generally. The definifion of a rule can potentially be as complex as a query with multiple joins and selections conditions. Systematically precompiling a complex rule definition as an SQL query to be run against

PAGE 16

5 the database at the occurrence of each event can be very expensive in terms of systems resources used. Lack of portability and flexibility: Built-in trigger systems are very tightly coupled to the system they were written for. This lack of modularity in the design of active capabilities can cause important roadblocks in the processes of application migration between Database Management Systems, causing lower degrees of flexibility and portability. Synchronous condition testing: Despite numerous research designs aimed at describing decoupled or deferred approaches to perform rule condition testing (POSTGRES [Sto86], HiPAC [Cha89, Day96]), many, if not all, of the existing implementations of triggers still run the condition checking synchronously, as part of the user-level transaction. While such an approach is acceptable in environments with limited numbers of rules of limited complexity, it does not scale well at all and can cause significant delays in commit times, and important overhead to DBMS's response times. A simple table insert operation can suddenly become extremely lengthy and expensive, when it causes the firing of hundreds of rules.

PAGE 17

6 1.2 Extensibility and Extensible Databases Background: As database management systems gained popularity within the industry, the need for user defined data types, structures and functions became more significant. Therefore, a second major focus of the research community in the past few years (in addition to adding active capabilities to database management systems) was aimed at providing better and generalized support for nontraditional applications as they were becoming more popular among users. Examples of such applications include Geographic Information Systems (GIS), spatial packages, CAD systems, multimedia and others. Integrating all the necessary code within the kernel of the DBMS makes for large monolithic systems that still could be lacking some specific user requirements. Since no single database system could integrate built-in support for all possible applications, but most of those applications still reused an important set of common functionalities, research groups looked at ways of adding extensibility features to conventional database systems. As accurately described in the compilation of readings in database systems by Stonebraker [Sto88], there are several levels at which extensibility can be achieved, and the corresponding systems may be classified in different groups. An extensible DBMS may support: 1. New data types (such as polygons), 2. New operators on those types (such as overlapping), 3. New aggregate operators (third largest), 4. New operators on relations (such as the transitive closure), 5. New access methods (such as R-Trees operating on polygons).

PAGE 18

7 6. Modifications to the data model. In the most common form of extensibility that has come to be known as Abstract Data Type (ADT) support, we refer the reader to projects such as ADT-INGRES [Ong84], RAD [Osb86], POSTGRES [Sto86], PROBE [DaySV], Starburst [Haa90], 2 2 R D [Lin88] and Sabrina [Gar89]. This form of extensibility, pioneered by Stonebraker et al [Sto83], consists of providing support for user defined data types (and functions operating on those types), through a set of precisely defined interfaces within the system. Using ADT support, users (programmers) can write modules defining new data types (box, polygon, image, etc.) and functions operating on those types (intersect, overlap, extract_contours) and "hook" or "plug" those modules directly into the DBMS (essentially points 1, 2, 3 and 5). All the projects mentioned above adopted this model, and differed only in terms of basic data models, query languages and ease of adding extension modules. It is important to note that in this approach, the data model itself remains fixed (usually referred to as "object-relational"). A more ambitious form of extensibility (often referred to as a "toolkit" approach) is that where the data model itself is considered as variable (points 4 and 6), has been investigated in projects such as EXODUS [Car86], GENESIS [Bat88] and DASDBS [Pau87, Sch90]. The database kernel provides storage management and transaction support. The application specific modules and DBMS layers are implemented "on top" of that kernel. The "value added server" capability of the SHORE [Car94] project (of which PARADISE [Dew93, Dew94] is a good example) constitutes a good implementation of this concept, carried out at the University of Wisconsin, Madison. Object-Oriented database systems such as ORION [Ban87], OS AM [Su93] and others are also extensible

PAGE 19

8 by means of their data sublanguage (ie at the datatype level). Once again, we refer the interested reader to [Sto88] for a more comprehensive discussion of this topic. We will simply summarize here the proposed terminology translation table, between the Object Oriented and the relational worlds: A class represents a type, and an object is an instance of the type. Furthermore, a method and a message represent respectively a function and a function call. Since extensibility in OODBs relates to allowing the creation of new classes and new corresponding methods, then passing messages to the newly created objects, it follows that it is also an ADT type of extensibility, centered mainly on points 1 and 2. Even though the concept of extensibility using ADT support is less ambitious than the "toolkit" approach is, it still carries far-reaching consequences for the architecture of the database system. As a matter of fact, the DBMS design needs to be carefully thought out so that it is extensible at multiple levels: Language level: At the language level, the user should be able to declare new data types and functions (methods, operators) and use them in appropriate queries (and rules). Storage management and access methods level: The introduction of new types such as polygons and boxes often requires the definition of new access methods allowing efficient retrieval and indexing of objects of such types. In the case of spatial data, an example of a new access method could be RTrees [Gut84].

PAGE 20

Quen/ optimizer level: The query optimizer is usually a very complex component of the database management system that is "closed" (with built-in logic). The extensibility at the storage management level discussed above is totally useless if the query optimizer can not be made aware of the existence of the new access method, and make proper use of it while generating plans. Thus, the query optimizer architecture needs to be designed in an open format (sometimes referred to as "table-driven") allowing the addition of new access methods or functions with their associated costs. 1.3 TriggerMan: The TriggerMan [Han97a] project (under development at the Database Research Center at the University of Florida) implements an asynchronous and extensible trigger processor system. Many of the design features and concepts adopted for TriggerMan were meant to handle or reduce the significant shortcomings of "traditional" rule processors as they were described earlier in section 1.1.4, and integrate the ADT based extensibility technology described in section 1.2 above. The reader will find below a brief description of the major design decisions that went into TriggerMan. Asynchronous condition testing: TriggerMan is an asynchronous, external ("outboard") trigger processor, which has been designed to cooperate with various heterogeneous data sources. Data sources can be any combination of legacy systems, relational, object-relational or object oriented DBMSs, simple unstructured data sets or applications. Updates are forwarded to TriggerMan (for rule condition testing) after they have committed within their

PAGE 21

10 corresponding data source, hence the "asynchronous" designation. The forwarding of updates is performed by special data sources applications providing standard interfaces between TriggerMan and the data sources. From a scalability point of view, this approach is expected to perform much better than the (mainstream) synchronous one for large sets of rules. Furthermore, the external aspect of TriggerMan will allow it to cooperate with multiple heterogeneous data sources more easily than built-in trigger processors can. Use of Gator networks: TriggerMan uses a data structure called a "Gator network" to perform the rule condition testing. The Gator network is a generalization of the earlier Rete [For82] and TREAT [Mir87] discrimination networks, and was developed at the University of Florida as part of the Ariel project [Han92, Han96a]. Discrimination networks are data structures borrowed from AI production rule systems such as OPS5 [Bro85]. The use of discrimination networks in TriggerMan is expected to overcome the simplistic models of many trigger processors, and allow efficient monitoring of sophisticated conditions based on multiple joins, without the added overhead associated with SQL query compilation and execution. Temporal component: TriggerMan incorporates a highly expressive and non-procedural (SQL like) temporal language for rule definition. The user is provided with an initial set of built-in temporal functions (increase(), decrease() etc.) and completeness is achieved through the extensibility feature, which enables users to write and plug-in a new temporal function the same way new data types and methods are [Alfa98].

PAGE 22

11 The architecture of the environment in which TriggerMan operates, as well as a brief overview of the decisions that were made regarding the implementation issues, are covered next. 1.3.1 The TriggerMan Environment: As described in Figure 1-1 below, the architecture of the TriggerMan environment consist of one or more external data sources "feeding" the Asynchronous Trigger Processor with data consisting of "update descriptors". Those external data sources can be of multiple types: An application generating real time data (stock exchange ticker etc.). A replication server reading data from a DBMS (such as Sybase replication server [Syb96]). A DBMS application that uses simple built-in triggers to catch local updates, and forwards them. A non-structured data source such as a spreadsheet or a word processor file. A "data source program" application will act as an interface between TriggerMan and the external data sources. Depending on the inherent characteristics of those external data sources (structured, queriable, active etc.), the role and complexity of the data source program will vary on a per-case basis. Hence, a TriggerMan data source application may range from a process with advanced logic, regularly performing differencing on a passive non-structured, non queriable source, to a much simpler "forwarding" layer (Figure 1-1).

PAGE 23

12 Figure 1 -1 The architecture of the TriggerMan environment.

PAGE 24

13 In addition to providing TriggerMan with data (update descriptors) for rule condition testing, data source applications may perform some processing on the formats of those update descriptors in order to convert them to a format "understood" by TriggerMan when (and if) necessary. The conversion will be performed by a "translator" or "translation layer" reading the external data source format and calling a precisely defined API, in order to generate the tokens in the internal format required by TriggerMan as illustrated in Figure 1-1. A number of client applications may then register with TriggerMan for particular event notifications that can be raised as part of rule actions. When events are raised, each application that registered for it is notified. A console application is used as the "administrative" interface to the system, allowing startups, shutdowns, activation, deactivation of triggers and other administrative tasks. 1.3.2 The TriggerMan Physical Architecture: The design team was faced with a few possible implementation strategies for the first prototype of TriggerMan. Each of the envisaged approaches was fully compatible with the logical architecture described in section 1.3.1, but the required amounts of implementation and coding, and the capabilities and speed of the resulting system varied substantially depending on the chosen scheme. Stand-alone version with local storage management: In this strategy, TriggerMan is to be implemented as a stand-alone application, using a local storage manager supporting transactions and recovery management. Parallelism support being an integral part of the design of the project, the overall physical architecture is described in Figure 1-2 below:

PAGE 25

14 SMP SMP High Speed Interconnection zzzczz I SMP SMP M 1 SMP A /I VProc eh Figure 1-2 The architecture of TriggerMan In an ideal parallel configuration of this architecture, a copy of the TriggerMan server runs on each node of a shared-nothing architecture. The nodes of the sharednothing environment are connected with a high-speed medium (bus or fast network) and

PAGE 26

15 each of them can either be a single processor machine, or an SMP cluster grouping a number of processors sharing a local memory area. A Virtual Processor (VProc) abstraction concept, embedded within TriggerMan will map VProcs to real processors in a flexible way, allowing high fault tolerance and dynamic load sharing capabilities within the server. Each VProc runs a pool of threads handling various tasks including: Local rule condition testing. Tokens exchanges with other VProcs according to the data-partitioning scheme. Possible local basic storage management. Caching strategies of the local partitions of the memory nodes. Although this approach would have represented the ideal implementation, given the unavailability of an open and reliable storage management layer, and considering the tremendous amount of development effort required to write one, it was decided not to try to utilize this architecture. Standalone version with external storage management: This approach is very similar to the previous one and differs only in the storage management strategy. Instead of relying on its own storage manager, TriggerMan would use a commercial DBMS for storage purposes, using ODBC or any other message passing interface which could be made available. While this plan of action was technically feasible, it suffered from serious performance limitations in handling large amounts of data through the message passing type of communication. Furthermore, a few additional technical problems would have increased the internal complexity of TriggerMan without any major benefit. It was therefore decided to put this scheme aside.

PAGE 27

16 Plug-in within an extensible DBMS: A middle ground strategy, procuring a reliable storage management layer for TriggerMan, without extensive code writing (or excessive communication overhead), was to develop the project as an extension to an object-relational DBMS with ADT support. The advantages of this strategy are multiple and can be summarized as follows: As a plug-in module running in the same address space as the DBMS, TriggerMan would benefit from "commercial strength" local storage management, without the overhead associated with message passing. Any support of parallelism within the DBMS is freely available to TriggerMan through the transparent SQL "call-back" interface. TriggerMan Driver TriggerMan Datablade Generic DataBlade SQL IDS/UDO Figure 1-3 TriggerMan as an Informix Datablade Figure 1-3 above describes the implementation strategy that has been chosen to develop the first prototype of TriggerMan, as an Informix Dynamic Server with Universal Data Option (IDSAJDO) Datablade. The TriggerMan catalogs and rules are stored within

PAGE 28

17 the Informix server in regular "tables". The access to those elements is made through standard SQL calls made to IDS/UDO. Because of technical implementation issues related to Informix threads handling in datablades, it was decided to keep the TriggerMan code passive, and signal it on a regular basis through an external simple timer application: the TriggerMan driver.

PAGE 29

CHAPTER 2 PROBLEM STATEMENT As mentioned in the introduction, TriggerMan relies on Gator networks to perform rule condition testing within the context of an Object-Relational Database Management system and environment. In this chapter, the reader will find a brief tutorial of discrimination networks, and a precise description of the specific issues that this dissertation will cover. 2. 1 A Discrimination Network Discussion and Example: For the reader who is unfamiliar with discrimination (and Gator) networks, this section is meant to present a brief tutorial of these data structures through a general discussion of their characteristics. A simple example presented below will be referred to throughout the discussion for illustration purposes. The reader who is already familiar with discrimination networks may skip this section. Consider a simple two-table schema: EMPLOYEE (SSN#, Name, Age, Salary, Dno) DEPARTMENT (Dno, Dname) And a rule condition defined on the schema above as follows: "If the record of a Sales department employee with a salary higher than $100,000 is accessed (inserted or modified), then notify Bob, the manager of the department." This condition could be defined in a rule language as follows: 18

PAGE 30

19 Create Trigger HighSalesSalary From EMPLOYEE as EMP, DEPARTMENT as DEP When EMRSalary >= 100000 And EMP.Dno = DEP.Dno And DEP.Dname = 'Sales' Then raiseEvent HighSalesSalaryUpdate("Bob", EMP.Dno). Just as queries can be represented as graphs referred to as "query graphs," rules can also be represented in a similar way: the rule graph. A rule processor or active database will then internally represent the trigger HighSalesSalary defined above, in the format of a graph such as the one drawn in Figure 2-1 below: Where: EMP is a tuple variable that represents the table EMPLOYEE DEP is a tuple variable that represents the table DEPARTMENT CTi represents the selection "Salary > 100000" 02 represents the selection Dname = 'Sales' The horizontal edge between EMP and DEP represents the join condition written just Ci: Salary > 100,000 ay. Dname = Sales Figure 2-1 Rule Graph for Rule HighSalesSalary above it.

PAGE 31

20 A corresponding discrimination network would be that given in Figure 2-2 below: Figure 2-2 Discrimination network for rule HighSalesSalary In Figure 2-2: tti is derived from tuple variable EMP from the rule graph in Figure 2-1 . a2 is derived from tuple variable DEP from the rule graph in Figure 2-1. Oi and (72 have the same meanings as in the rule graph from Figure 2-1 . The edge drawn between a\ and ai represents the join condition between EMP and DEP from the rule graph. By convention, join conditions will be represented as dashed lines joining sibling nodes. Sibling nodes are always at the same distance from the Pnode. The full arrows represent a child -> parent relationship. The parent node stores the results of the joins across all of its children. The arrows have been drawn here for

PAGE 32

21 illustration purposes, but typically, simple lines or segments joining the child to its parent will replace them. The routing node and the dashed arrows are also drawn here for illustration and tutorial purposes. The dashed arrows represent the routing capabilities of the routing node to send incoming data to the right alpha node. The presence of the routing node and the dashed arrows will be implicit in all further discrimination and Gator networks figures. General structure of a discrimination network: In general, a discrimination network can contain the following types of memory ^ nodes: A Pnode, which is the actual root of the tree representing the discrimination network. As we will see very shortly, data (tokens) are propagated through the network, and the semantics of discrimination networks is such that when a token reaches the Pnode, all the rule conditions have been met and the trigger fires. A set of alpha nodes, one for each tuple variable from the query graph of the rule. A set of beta nodes. Beta nodes are intermediate nodes used to store the results of two or more alphas and/or betas. (Since the current example has only one join condition, it does not contain any beta node. A more general network with beta nodes will be presented at the end of this discussion). The "Routing" node was sometimes called a "Root" node in earlier papers, but it is not to be confused with the Pnode of the network. The "Routing" node redirects ^ Despite their denomination, memory nodes can be (and often are) stored on disk.

PAGE 33

22 incoming tokens to the appropriate a nodes for further processing. We will therefore use the term "Root" only to refer to the Pnode. Now that the structure of discrimination networks has been described, it is necessary to present the dynamic aspect of these structures, represented by the token propagation that was briefly mentioned a few lines above. Token propagation in discrimination networks: The process of testing a rule condition when an update occurs (in the database) is sequenced in the following way: each access to a relation appearing in a rule condition will generate the creation of an update descriptor. An update descriptor (also referred to as a token) is simply the tuple that was accessed, to which a few auxiliary fields are added for rule processing purposes. The token in question will be transferred to the Routing node of the discrimination network(s) representing the rules defined on the current schema. Based on a table (or data source) field in the header of the token, it will be forwarded to the appropriate a node and propagated down the network from there (for testing the rule condition). Consider again the example of the schema described at the beginning of this section; each access to relations EMPLOYEE or DEPARTMENT will therefore generate the creation of an update descriptor that will be transferred to the Routing node described in Figure 2-2 above. Suppose that table DEPARTMENT is the following: Dno Dname 1 Finance 2 Sales 3 Research & Development

PAGE 34

23 And a tuple (SSN = 1, Name = "Bob", Age = 30, Salary = 150000, Dno = 2) was inserted in table EMPLOYEE. An update descriptor of the form [EMPLOYEE, INSERT, 1, "Bob", 30, 150 000, 2] is generated and sent to the routing node of the discrimination network. The same token is then forwarded to ai based on the EMPLOYEE field. The field INSERT is used to determine the type of database operation that was performed, since that can be part of the rule condition (on INSERT, on DELETE etc.), and filtered out by the routing node if necessary. The selection predicate represented by 100000) is then tested and succeeds. The update descriptor is therefore inserted into the node tti where the join condition with the departments is checked. Since the current department is "Sales," and the Dno for Sales is "2," the join succeeds too and a resulting compound token (consisting of the concatenation of both tokens from ai and ai) is created and inserted into the Pnode. As mentioned earlier, the insertion of a new token into the Pnode signals the matching of the rule condition for the current database access. The network used for this particular example was simple enough not to require any beta node to store intermediate joins results. In general, rules with three or more tuple variables will generate networks with three or more alpha nodes, and the presence of beta nodes will become likely. Consider the network below: Pnode

PAGE 35

24 In this case, the rule has six tuple variables (six alpha nodes) and three selection predicates. When the network is first created, it is "primed," which means that all nodes are loaded with their respective data. Alphas are primed from the relations they are derived from (with the selection condition applied), and betas are primed by executing the join operation across all their children. (The Pnode is always emptied after each rule condition testing and does not require priming). The results of the joins between the first three alphas are stored in Pi while P2 stores the results of the joins between 04 and aj. The Pnode contains the result of joins between Pi, P2 and Og. The propagation of tokens is applied in the same way it was described earlier. Each time a token is inserted into a node, it is joined with all the siblings of that node (according to a predefined ordering of those) and the result of that join (potentially many new "compound tokens"), is inserted into the parent node and so on. As mentioned earlier, a compound token is simply the concatenation of the tokens that joined from each of the sibling nodes. As an example, consider an update descriptor originating from the relation from which a2 is derived. The routing node (not drawn) routes the token to a2, which causes Cz to be tested first. If the test succeeds, the token is inserted into az and joined with ttj and (the joining order is part of the optimization algorithm and is hard coded within the "join plan" of each node). All the tokens generated by this join are inserted into Pi and joined with P2 and OLe. Any resulting token is inserted into the Pnode and causes the rule to fire. Note that all tokens inserted into the Pnode do not remain there, but are deleted when the rule executes. Those tokens are needed in the Pnode for condition/action binding purposes, when the action of a rule refers to the data that had caused it to fire.

PAGE 36

25 2.2 Need for Selection Predicate Pushdown: In the extensible and object-relational environment in which TriggerMan operates, selection predicates can be arbitrary user-defined functions or methods. Suppose that the table EMPLOYEE had one additional field "fingerprint" of type "image" or "vector data." Suppose furthermore that the (selection) predicate C\ is no longer a condition on the salary, but a complex operation executed by one of the image (vector data) data type methods (such as matching fingerprints etc.). The computation of a\ now becomes expensive enough for the cost of the join operation tti.Dno = aa.Dno to become negligible in comparison with it. In such a case, it would probably be a good optimization strategy to perform the join operation first, and delay the selection evaluation as much as possible. A "Push Down" operation is then performed on d, which is replaced by a "True" operator at the leaf level, in order to allow all tokens to get through. The resulting network is shown in Figure 2-3 below: Figure 2-3 Selection pushdown

PAGE 37

26 Indeed, the potential reduction of the number of tokens flowing through C\ represents in turn a potential optimization of this network with respect to the one described earlier in Figure 2-2. Remark: Since Gator networks are always represented with the root node (Pnode) at the bottom and the leaves (alpha nodes) at the top, a selection pushdown means moving the sigma node so that it becomes closer to the root of the network. Since query trees are usually represented the other way around, with their root at the top, the pushdown and pull-up operations in query optimization [Hel92, Hel93] have opposite meanings to those used in discrimination network optimization strategies. All Gator networks designs and implementation have traditionally placed selection predicates at their leaves so that they can be performed first. While this approach proved worthwhile and effective in existing optimization strategies, the ability of TriggerMan to operate in extensible environments makes this heuristic no longer true. As a matter of fact, selection predicates defined on objects such as polygons or images may prove to be more expensive than (some) joins, and a more elaborate heuristic (or set of heuristics) for selection placement in the discrimination network will become necessary. The intention of this dissertation is to investigate the approaches that can be taken towards achieving an optimal or near optimal placement of selection predicates in discrimination networks in general, and more particularly in Gator networks. The organization of the rest of the chapters of this work is structured as follows:

PAGE 38

27 In Chapter 3, the reader will be familiarized with the most important aspects of the existing Gator network optimizer as it is implemented in the Ariel project [Han92]. Chapter 4 will cover the related work that was accomplished in selection predicate placement in query processing, and show how those results cannot be applied to discrimination networks optimization strategies. In Chapter 5, we will propose an alternative set of strategies extending the existent Gator networks optimizer that was described in Chapter 3, and analyze the consequences of each of them (on the current optimizer) mainly from a conceptual point of view. Chapter 6 will describe the practical issues that arise during the implementation of the strategies described in Chapter 5, and present the practical results that were obtained. In Chapter 7, we will conclude with a brief review of this work and present additional related issues and open research questions.

PAGE 39

CHAPTER 3 BACKGROUND ON GATOR NETWORK OPTIMIZATION TREAT [Mir87] and Rete [For82] networks have been successful approaches for performing non-procedural rule matching in active databases. Treat networks do not make use of the intermediate beta nodes described earlier, and have all their alpha memory nodes feed directly into the Pnode. Therefore, Treat networks are always twolevel networks (The HighSalesSalary network shown above (Figure 2-2) was a TREAT network with a single join condition). The Rete network uses betas that can store the results of intermediate two-way joins only. In other words, a beta node never has more than two children in a Rete network. Gator networks extend the shapes of discrimination networks to general shapes by allowing betas to have N other nodes as direct children. As a result, TREAT and Rete become special cases of the more general Gator structure. Because of its generalized shape, the optimization of a Gator network becomes an expensive procedure with the increase of the number of tuple variables involved in the query graph. As a matter of fact, it is easy to see that the number of alternatives for building a Gator network increases extremely rapidly with the number of alpha nodes. Since our approach consists of an extension of the existing cost model, we will present a brief review of the current Gator network optimization strategy before presenting the selection predicate placement strategy itself. 28

PAGE 40

29 3. 1 Brief Review of the Gator Network Optimization Algorithm: Randomized algorithms have been noticed to produce very acceptable solutions (nearly optimal) with computation or execution times that are orders of magnitude smaller than dynamic programming approaches, when dealing with exponential search spaces. Such algorithms include: Simulated Annealing [Ioa87], Iterative hnprovement [Swa88], Two-phase Optimization, which combines the two previous algorithms. The existing implementation in the Ariel[Han89, Han96, Has93] project includes the use of such algorithms for the generation of Gator networks. The reader will undoubtedly notice that the used approach is a greedy algorithm, starting with an initial solution and trying to improve it at each step of the optimization process. The solution is of course prone to the local minimum problem common to all greedy algorithms, but the use of simulated annealing and similar techniques (such as using multiple starting points) are meant to reduce, if not eliminate, this problem. The remaining part of this chapter gives a general overview of the existing implementation of Gator network optimizers carried out at the University of Florida. Below we describe a typical randomized algorithm (As a function named GetOptimalGator) described in high-level pseudo code, then we give a brief overview of the proposed cost model.

PAGE 41

30 Gator* GetOptimalGator (RuleGraph* Graph) Begin BestSolution = BuildlnitialGatorNetwork(Graph) While (no satisfying solution found) New-Gator = ApplyLocalPerturbation (BestSolution) If (Cost (New-Gator) < Cost (BestSolution) ) then DeleteGator (BestSolution) BestSolution = New-Gator ilse DeleteGator (New-Gator) Endif EndWhile Return (BestSolution) End The exit condition from the While loop described as "no satisfying solution found" can be implemented in different ways according to the chosen solution used for the randomized approach. Possible conditions for a satisfying solution include: Perform a specified number of iterations then exit. Iterate until the cost drops below some predefined threshold then exit. Iterate until a certain number of operators called successively fail to improve the network. A combination of both previous cases: iterate until a specified number of iterations are reached or the cost drops below a certain limit, whichever comes first. The next algorithm gives a description of the function BuildlnitialGatorNetwork that is called at the beginning of GetOptimalGator described above.

PAGE 42

31 Gator* BuildlnitialGatorNetwork (RuleGraph* Graph) Begin LocalList = New List NeighborsList = New List For (each tuple variable in Graph) AlphaNode = new Alpha If (tuple variable has a a with selectivity s) AlphaNode^ selectivity = s Endif LocalList ^Insert (AlphaNode) EndFor While (LocalList not empty) Node 1 = Randomly select one element from LocalList NeighborsList = FindNeighbors (Nodel, LocalList) k = Size of (NeighborsList) Node2 = Randomly select an element from NeighborsList BetaNode = Combine (Nodel, Node2) LocalList -> remove (Nodel) LocalList remove (Node2) LocalList insert (BetaNode) EndWhile End The function FindNeighbors simply walks down the list of nodes and returns all the nodes that have a direct edge connection with the node currently passed as a parameter. As the reader may have already noticed, the Gator network is incrementally built using the Combine(Node* Node*) function. This function implements a set of three basic

PAGE 43

32 combination operators for nodes, whicii incrementally return larger networks. Below is a description of the operators implemented in the function Combine. 3.1.1 Combine Operators: The three operators, Join Absorb and Merge are shown in this section. Join: Two existing nodes are made children of a newly created one (beta), the new network is one level deeper. Absorb; The white node is absorbed and becomes child of an existing P node. Merge: Merge applies only to two P nodes.

PAGE 44

33 Once the initial network is randomly built using the operators above, it is improved gradually by using a set of three local change operators "Create Beta," "Kill Beta" and "Merge Sibling" described in the following section. 3.1.2 Local Change Operators: Gator* ApplyLocalPerturbation (Gator* Current) Begin End Perturbation = Choose Randomly In {CreateBeta, KillBeta, MergeSibling} New-Gator = ApplyPerturbation (Current, Perturbation) Return (New-Gator) Gator* ApplyPerturbation (Gator* Current, char* perturbation) A set of three local perturbation operators has been specified. At each iteration, one of the operators is randomly chosen and applied to the network. Below is a description of those local operators. Create Beta: tt'' "a ,a ,a Pnode Pnode Figure 3-1 Local change operator "Create Beta" The Create Beta operator requires a node with at least three children. In the example above, the node in question is the Pnode itself, but it can potentially be any P in the network or the Pnode.

PAGE 45

34 Kill Beta: Pnode Pnode Figure 3-2 Local change operator "Kill Beta" The beta is killed, and all its children become direct children of the (old) parent of that beta. In the example above, all four alphas become children of the Pnode. Merge Sibling: Pnode Figure 3-3 Local change operator "Merge Sibling Pnode The merge sibling operator requires a node to have at least three children of which at least one is a P (in the example above, the Pnode has been selected again as such a node but that is not a necessity). In the case where more than one (3 node is available, one of them is randomly chosen as a "MergeNode." One of the siblings of the "MergeNode" is then selected and moved to a child position with respect to "MergeNode" as shown in Figure 3-3 above.

PAGE 46

35 3.2 The Cost Model: The comprehensive description of the Gator discrimination network cost model is somewhat lengthy and will not be covered in detail. A complete reference can be found in [Han96b]. In this section, the reader will find a short overview of the cost model, reviewing the cases where all nodes are stored and do not have any indexes defined on any of the attributes. Since the use of indexes and virtual nodes does not have any influence on the remainder of this work, there will be no particular need in covering extensively all the cases. Notation: CPUweight The relative weight of the CPU utilization. I/Oweight The relative weight of the I/O operations. N A node in the discrimination network (a, P or Pnode). Fi(N) The insert frequency in node N. Fd(N) The delete frequency in node N. Card(N) The cardinality of node N. Ci(N) Insertion cost in node N. Cd(N) Deletion cost in node N. Sel(A) Selectivity of the predicate/condition A. JSF(A,B) Join selectivity factor between nodes A and B.

PAGE 47

36 The network below will be used as a reference example that we will refer to throughout the description of the cost formulas in the remaining sections of this chapter. Oi True (52 Tme O3 True 3.2.1 Cost Functions for a Nodes: The cardinality of an alpha node is estimated as the product of the cardinality of the relation on which the node is defined and the selectivity Sel(a) of the selection predicate defined over the same relation. If there is no selection defined on the relation,
PAGE 48

37 Cost of insert in the alpha node: The insertion of a new tuple in a relation requires one disk read and one disk write, therefore two I/Os do occur. Since only one tuple is inserted, CPUweight is incurred once. Ci(a) = (CPUweight + 2 * I/Oweight) * Fi(a) Cost of delete from alpha nodes: In order to delete a tuple from an alpha node, all the pages of the alpha node are read once and tuples in those pages are scanned in order to identify the one to delete. Then the page where the tuple was rewritten is written back to disk. Cd(a) = [CPUweight * Card(a) + I/Oweight * (Page(a) + 1)] * Fd(a) 3.2.2 Cost of Generating Temporary Join Results, (TR): When a token is inserted in a node with multiple siblings, a sequence of two-way joins is executed until the complete join is computed. The order in which the joins are executed is predefined in the "join plan" of each node. The join plan is computed when the network is built, as the sequence generating the smallest possible intermediate join results: TRn's. Li the example network above, an insertion in a2 would join with either tti or a3 (depending on the join plan of ai), then the resulting set of compound tokens would be joined to the remaining sibling in the join plan. The formulas below estimate the cost of generating these sequences of temporary joins for both alpha and beta nodes.

PAGE 49

38 Stored Alpha nodes No index on join attribute: Assuming the system has enough memory, the cost will consist of reading the alpha node from disk and performing the join of TRn with the alpha node: Cj(TRn, a) = I/Oweight * Pages(a) + CPUweight * Card(a) * Card(TRn) With TRo= 1, Card (TRn)= Card (TRn-i) * JSF(Nn.i ^ N„). Note: Nn-i -> Nn represents the two nodes in sequence in the join plan of the alpha node where the token was initially inserted. Beta nodes No join attribute index: The cost of joining from a TRn to a beta node (without indexes) is similar to that of joining with an alpha node. The cost formula is the same as the previous one, where alphas are replaced with beta nodes. Cj(TR„, P) = FOweight * Pages(p) + CPUweight * Card(p) * TR„ With TRo = 1 Card(TRn)= Card(TR„.,) * JSF(Nn-i ^ Nn) Note: Nn-i Nn represents the two nodes (alpha or beta) in sequence in the join plan of the beta node where the token was initially inserted. 3.2.3 Cost of Beta Nodes: The cost of a beta node consists of two components: A local cost noted LocalCost.

PAGE 50

39 The cost for children nodes. Cost( p) = LocalCost( p) + ^ Cost(N) Nechildren(P) LocalCost( P) = 2 {Fi(N) * InsertCost (N, p) + Fd(N) * DeleteCost (N, p)} N€children(P) The LocalCost represents the cost incurred to update the beta node itself. This cost has two components: insertion and deletion costs related to the beta node. Each of these two components is computed incrementally with respect to each of the children of the beta node. Each elementary contribution relative to the current child, is considered relatively to the "weight" of that child, which is represented by the corresponding insertion and deletion frequencies Fi and Fj. Below is a high-level pseudo code description of the InsertCost and DeleteCost procedures. Considering the current child N of node P, the following procedure gives the participation of N in the insertion cost of p. InsertCost (N. p): 3egin TRsize = 1 TempCost = 0 For each node n in the join order plan of N TempCost = TempCost + Cj(TRsize,n) TRsize = TRsize * Sel(ThisNode ^NextlnPlan) EndFor updateCost = [ ] * 2 * I/Owdght tuplesPerPage{/3) return (TempCost + updateCost) End Similarly to insertion costs, deletion costs are given by the following procedure for the current child N of a node p. The Yao(Pages(p), TRsize) function appearing in the

PAGE 51

40 procedure DeleteCost below simply estimates the number of pages of (3 that will be touched when TRsize tokens are (randomly) selected from the node. The exact definition of the function Yao as well as estimates of it is given by Bernstein et al [BerSl]. DeleteCost (N. (3): Begin TRsize = 1 For each node n in join order plan of N TempCost = TempCost + Cj(TRsize,n) TRsize = TRsize * Sel(ThisNode ->NextInPlan) EndFor updateCost = [Yao(Pages(p), TRsize) + Pages(p)] * I/Owdght + Trsize * Card(p) * CPUweight Return (updateCost) End As mentioned earlier in this chapter, a more detailed description of the cost formulas and procedures can be found in Hanson's papers on the optimization of Gator networks [Han96b, Han97b]. The reduced version outlined in the previous pages is however sufficient for the purposes of this work. In the subsequent chapter, we will introduce the work accomplished for selection predicate placement in query optimization and relate it to the selection predicate placement problem in discrimination networks.

PAGE 52

41 CHAPTER 4 RELATED WORK: INADEQUACY OF THE RANK METRIC. Several research projects and related articles have addressed the issue of optimal selection predicate placement among joins in a query tree. The LDL algorithm was first proposed in [Chi89] then later adapted for use with object-relational models in [Yaj91]. In this approach, the selection predicate is passed to the query optimizer as a relation and treated as such during the whole optimization phase. Many of the subsequent algorithms use the notion of a rank that is assigned to each of the predicates (joins and selections) in the query tree. The notion of a rank metric was explored in operations research [Mon79], as well as database issues [Cha96, Kri86]. After assigning the ranks, the algorithms iterate over each "leaf to root" path of the query tree and reorganize the current stream in , . . r^, , . . , ^ , . , Selectivity I ascendmg order of the rank metric. The rank metric is defined to be: rank = cost Without a formal proof, it is intuitively easy to see that a low rank will be produced by both a small selectivity factor (which represents a very selective operator) and cost (i.e. a cheap operator). Non-selective and expensive predicates yield higher ranks, and are therefore delayed as much as possible, while cheap and selective predicates are applied earlier. The ordering of selections and joins based on the rank metric yields satisfying plans in query processing [Cha96, Hel92, Hel93, Hel95]. Yet, it appears to fail to produce consistent results when applied to the placement of selection predicates among joins in a discrimination network structure. Because it does not incorporate important parameters such as insert and delete frequencies, it seems that the rank metric is not

PAGE 53

42 "rich" enough to capture with accuracy the cost model developed for Gator networks. As a matter of fact, it is fairly easy to infer from the cost formulas presented earlier, that different distributions of token frequencies can sensibly change the cost of a discrimination network, and thus the decision of moving a selection predicate or not. Consequently, the results given by a rank based algorithm do not always represent the correct ordering of predicates in the discrimination network tree. Before considering a simple example, it is necessary to introduce at this point a minor extension to the proposed Gator cost model, as well as the basic modeling approach assumptions that will be used throughout this chapter. Later in the chapter, the reader will find a short mathematical analysis of the same example, showing how various distributions of token frequencies can impact the decision surface for moving selection predicates. 4.1 Extensions to The Cost Model and Assumptions: The main extension to the Gator networks cost model will be to define a selectivity or "filtering capacity" for edges joining a child to its parent node. We will refer to this property as the "Edge Selectivity Factor" and denote it ESP. 4. 1 . 1 The Edge Selectivity Factor: hi most relational systems and since the introduction of System R [Ast76], joins have been heavily modeled with a Join Selectivity Factor (JSF) value defined as the ratio of the cardinality of the result of the join, over the products of the cardinalities of the base relations (or nodes). As a result, the value of the JSF is a rational number lying in the interval [0,1]. Practical values for the JSF fall within the interval (0,1) since the

PAGE 54

„ 43 values 0 and 1 represent extreme cases where the join result is either empty or equal to the Cartesian product of the base relations (nodes). Consider the following join: JSF: 1/1000 The figure above gives an illustration of the JSF values as they were just described. The cardinalities of R, S and J are respectively 1000, 100 and 100. The value of the JSF is taken as the ratio of card(J) divided by card(R) * card(S) which is in this case 100 / (100 x 1000) = 1/1000. While this value models the join from a global point of view, it does not correctly depict the participation of "each side" (R and S here) in the joining operation in terms of numbers (or percentages) of tokens flowing across the join edges. As a matter of fact, the cardinality of node J is not one thousandth of either that of R or S. Instead, each token or tuple from S has a match in J (on average), while only about one tenth of the tuples from R have a match (on average) in J. Since the selection predicate is always located on one of the relations participating in the join operator, it is important to correctly model the selectivity of the join with respect to each of the relations participating in it. The Gator network cost model uses the notion of a "directional" selectivity factor for joins, based on the distribution of the join attribute in each relation. Consider a join condition (from the figure above) of the form R.x = S.x where "x" is the joining attribute. Under the uniform distribution assumption for attribute x as it is defined by Selinger et al [Sel79], the selectivity of R^S is taken as the ratio: Sel(R->S) = Caid(S)/Val[x/S] , where Val[x/S] represents the number of distinct values of x in relation R ' ' Cardinality: 100 J Cardinality: 100

PAGE 55

44 S. Symmetrically, the selectivity S^R is given by Sel(S^R) = Caiti(R) / Val[x / R]. Surely, this approach gives correct results under the specified assumptions of uniformity. However, it does suffer from a number of limitations, which are listed below: The hypothesis that a token joins on average with Caid(Relatic»i)/Val[x/ Relation] tokens is valid for equijoins. But when considering non-equijoins such as R.x > S.x or R.x overlaps S.x (if x is a polygon attribute), this approach becomes inadequate. When considering join conditions such as R.x overlaps S.x -rather common in the extensible environment for which TriggerMan was designedthe notion of distinct values for an attribute (such as a polygon) may be difficult to define [Haa95]. Since the number of distinct values of an attribute cannot exceed the cardinality of the relation (i.e. the attribute is a key, and all values are distinct), the ratio Card(Relation) / Val[x / Relation] never drops below 1. On the other hand, the selectivity of a selection predicate is always in the interval (0,1). Recalling the definition of the rank= ^^^^^^^^^^y ~ ^ ^ ranks of joins will therefore be positive cost numbers while those of selection predicates will remain strictly negative, no matter what the costs of the predicates are. Consequently, the optimization strategy becomes useless because the ordering of predicates always puts selections before joins, since the ranks of joins are always strictly larger than those of the selection predicates. In spite of the fact that selection predicate placement in discrimination networks will not be directly reusing ranks, the general idea behind the heuristic is also based on the possible "good" selectivity of joins, allowing a reduction in the number of times an expensive sigma has to be done. If the cost reduction incurred by doing the (expensive) selection predicate "less often" surpasses the increase incurred by doing the joins more

PAGE 56

45 often, the strategy works and the selection predicate is replaced in the Gator network. However, If the selectivity factor of a join is always greater or equal to 1 , the "traffic" (percentage of tokens flowing along an edge) is never reduced after that join operation is executed, and the optimization fails in all cases. Therefore, the Gator network model needs to be extended to represent joins that can have selectivity factors in the [0,1] interval, which are the joins of "interest" as far as the selection predicate placement is concerned. Going back to the previous example with relations R and S joining into J, the join operation resulting in the J node can be viewed in two symmetrical ways: Start off with node R empty and node S full, and gradually insert tokens in R until it gets full. At that point, the cardinality of the node J will have reached 100 and that of R will be 1000. On average, each token inserted in R will have contributed with 0.1 tokens on the edge R^J, and one out of every ten tokens inserted in R will have joined if each joining token finds a single match. Start off with node R full and node S empty, then gradually fill up S until it reaches its cardinality of 100 tokens. Since the join operation is independent of the joining order (R join S = S join R), the resulting joining node is exactly identical to the one obtained in the first approach. In this case, each token inserted in S will have contributed, on average, with one token on the edge S->J. (see figure below). JSF: 1/1000 Cardinality: 1000 R Cardinality: 100 Edge Selectivity: 0.1 Cardinality: 100 Edge Selectivity: 1

PAGE 57

46 The selectivity of joins will therefore be defined with respect to each input stream to the join. The value of that selectivity represents the "filtering" capacity of the join with respect to the corresponding base node; i.e. what proportion of tokens is generated along that edge for each token coming into the base node. The value 0.1 means as described above, that for each token inserted in R (whether it does join or not), in average 0.1 tokens will be generated down the edge R->J. In other words, the edge R-^J divides the traffic coming into R by a factor of ten. Remarks : The Edges Selectivity Factors (referred to as ESFs) will be represented as shown above, directly on the edge. ESF(R-^J) is computed as JSF(R<-^S) * Card(S) and symmetrically ESF(S^J) is computed as JSF(R<-4S) * Card(R). Since Card(J) is defined as JSF(R<-^S) * Card(R) * Card(S), it follows that ESF(R-^J) = Card(J) / Card(R) and ESF(S^J) = Card(J) / Card(S) The ESF is not confined to the interval [0,1] (or (0,1) ) as JSF values are. 4. 1 .2 Predicates Selectivities and Costs: The standard assumptions of query optimizers about the accuracy of database statistics (estimates of parameters values) and the independence of the selectivity factors of predicates will be made in the remainder of this work.

PAGE 58

47 The results on the existence of a constant differential (per-tuple) cost for all the well-known join algorithms [Hel92, Hel95] will be reused in this work when describing a rank based operation: The partial differential costs of joins with respect to their base nodes are constants for all the well-known join algorithms, and hence the cost of a join per tuple of each input is typically well defined and independent of the cardinality of either input. 4.2 Allee-Gator Example: Given the definition of the edge selectivity factor, let us consider a practical example. Doctor Guy Thorfan from the wildlife department of the University of Florida is currently working on a research project involving fauna observation in multiple locations around the state of Florida. One of his concerns is the presence of GATORS in the lake AUee area in Alachua County. Data is gathered as follows: graduate students using laptops and state of the art GPS technology, continually observe the areas of interest and type in their "sightings" that are immediately sent back to the central database. In order to spot the presence of GATORS around Lake Allee, Dr Thorfan writes the following trigger against the database of the department: Create Trigger Allee-Gator From Fauna_Sightings, InterestArea Where hiterestArea.Box contains Fauna_Sigthings.Location And Fauna_Sightings.Animal = "Gator" And InterestArea.User = "Thorfan" The Fauna_Sightings table is defined as Fauna_Sightings(ID, Animal varchar(255), Location point), and a listing of its data is given in Appendix A. The Lake

PAGE 59

48 Allee area of interest is supposed to be geographically represented with the box delimited with the comer points (0,0) and (1,1) in a predetermined system of coordinates and we will suppose, for simplicity reasons but without loss of generality, that the selection "User = Thorfan" returns only that box. The InterestArea table is also shown in Appendix A. All nodes in the network are considered to be stored without any indexes defined on any of the attributes. Also, the database will be supposed to be insert-only so that delete frequencies can be ignored in the context of this example. The discrimination network representing this rule is described below: Cost: 10 / Selectivity: 0.6 <7l Animal = "Gator^2 Usei="Thorfan" tti. Box contains tti. location Cardinality: 12 ai-^-^' 2o C2:50'^2 Cardinality: 1 ^^/12\^^^^/'^^ P Cardinality: 1 Node tti is based on the relation Fauna_Sightings and has a selection predicate (5\ "Animal = Gator" defined on it. The selectivity of <3\ is represented by the number of tuples where "Animal = Gator" (displayed with a gray shaded background in Appendix A) over the cardinality of Fauna_Sightings: 12/20 = 0.6. Also, the selection is supposed to return one unique row from table InterestArea with a Box value of (0,0,1,1). The selectivity of C2 is not critical in the context of this example; therefore we will give it an arbitrary value of 0.1. The selection predicate G\ and the join operations between tti and tti have been assigned differential cost values indicating the ratio of the complexity of one operation with respect to the other in some predefined unit.

PAGE 60

49 The rank of a\ is defined as: rank(ai) = Therefore rank (Oi) = Mzii ^ _o.04 10 Sel(ai)-1 Cost(ai) The result of the join of ai and ai is a one-row table: Animal Location User Box Gator (1,1) Thorfan (0,0,1,1) And the selectivity of the join with respect to tti is then 1/12 = 0.083. Therefore, rank(Join/ai) = Q Q^^ ~ ^ = -0.0458 20 Representing both ranks on a linear axis, we can see that the rank of the join is strictly smaller than that of the selection predicate. Join/a 1 Gi 1 h-0.046 -0.04 0 ranks axis And according to the rank ordering of predicates, the selection predicate C\ should be pushed down to produce: True a, -

PAGE 61

50 Let us now reconsider the same network with insert frequencies assigned to each input edge and checlc the results given by the discrimination network cost formulas: If C is the initial cost of the network, let AC^ and AC" represent respectively the increase and decrease components that occur in C when the G\ operator is pushed down. The AC part of the variation of C will have three components: The increase in the frequency of execution of the join operation between tti and for each token inserted in tti. The increase in the frequency of insert operations applied to node ai. That value will be represented as AT. The increase in the differential cost C2 of joining tokens from to ai (since tti is larger) noted AC2. (AC2 > 0). AC~ will be the decrease in the frequency of execution of Gi (if any). When Oi is pushed down, some of the parameters of the network do change, as follows: The cardinality of a\ grows from 12 to 20 tuples (or tokens). The resulting join of tti with tti becomes the following table

PAGE 62

51 Animal Location User Box Gator (1,1) Thorfan (0,0,1,1) Crow (1,1) Thorfan (0,0,1,1) Frog (1,1) Thorfan (0,0,1,1) And the join selectivity factor becomes 3/(20 * 1) = 0.15. The selectivity of edge (tti^P) becomes 0.15 * 1 = 0.15 and that of edge(a2^p) becomes 0.15 * 20 = 3. P The values for the increase and decrease in the global cost of the network are then: AC : (0.8 0.48) * 20 + AT + 0.02 * AC2 = 0.32 * 20 + Af + 0.02 * AC2 = 6.4 + AT + 0.02 * AC2 AC": [0.8 -(0.8* 0.15 + 0.02* 3)] * 10 = (0.8 0.18) * 10 = 0.62* 10 = 6.2 Since AC* is greater than AC (AT and AC2 are strictly positive terms), the operation of pushing the selection predicate down increases the cost of the network and should not be done. The 80/20 distribution of tokens is therefore not favorable to a pushdown of the selection predicate ai.

PAGE 63

52 Consider now the same network with a different distribution of insert frequencies: P I P In this case we have voluntarily biased most of the insert traffic towards the edge containing the selection predicate G\. The pushing of C\ transforms the network as described in the figure above. The values of AC* and AC" are given below: AC^ : (0.99 0.594) * 20 + AT + 0.001 * AC2 = 0.396 * 20 + AT + 0.001 * AC2 = 7.92 + AT + 0.001 * AC2 AC: [0.99 -(0.99* 0.15 + 0.001 * 3) ] * 10 = (0.99 0.1515) * 10 = 0.8385 * 10 = 8.38 Clearly, AC* could be smaller than AC in this case. If At and AC2 are small enough and/or can be neglected, then the selection predicate pushdown operation improves the cost of the network and should be executed. This example pointed out at how the distribution of insert frequencies alone can affect the decision surface regarding the proper placement of selection predicates in discrimination networks. In the following section, the reader will find a generalization of this example and a short analysis of the behavior of cost formula-based decision surfaces with respect to distributions of tokens frequencies.

PAGE 64

53 4.3 Generalization of the Cost Considerations: Fi F2 Pnode Figure 4-1 Analysis of selection predicate pushdown Let us consider in this analysis (without loss of generality) an insert-only discrimination network, with two alpha nodes joining into a Pnode as described on the left side of Figure 4-1 above. It will be considered in the context of this study, that there is no index defined on any attribute, as adding indexes will increase the complexity of the cost formulas without having any impact on the results of the analysis. (The reader will find an elaboration on this aspect in Chapter 5, section 5.2.3). Fi and F2 are the frequencies of tokens arriving respectively into the left and right input edges of the network. The selectivities of selection predicates a\ and C2 are denoted Sa, and 802. The edge selectivity factors of edges (ai^p) and (tta^P) are represented by Sj, and Sj2. (Recall that Sj, and Sj2 are defined as JSF12 * Card(a2) and JSF12 * Card(ai) respectively). Finally, ai has been assigned a cost of C0, and the differential costs of the join between tti and a2 are denoted Q, and Q2.

PAGE 65

54 Sc-l By definition of the rank metric, the rank of a\ is: rank(ai) = Coi Si — 1 And the rank of the join with respect to ai is defined as: rank(J/ ai) = — The rank based heuristic orders the predicates in ascending order of the rank metric, and will therefore decide to perform a PUSHDOWN when the rank of the join is smaller than the rank of the selection predicate Gi: rank(J/ai) < rank(ai) Replacing the expressions of rank(J/ai) and rank(ai) with their respective values as they were defined above, we can rewrite: rank(J/ ai) < rank( ai) <=> ^ < — — Cji Coi Then by multiplying each side by Cji and dividing by (Soi1), we get C 1 Si — 1 < — (reversing the comparison since Soi 1 < 0) Col Sol — 1 Which can be rewritten as: o Cji <^!^* Coi (1) l-Soi The rank heuristic will therefore perform a pushdown when the condition (1) above is 1 S'l true and not perform the pushdown when Cji > * Coi 1-Soi Let us now consider the relocation of the selection predicate ai above the Pnode from the aspect of the Gator network cost formula. After pushing down the selection predicate Oi, we get the discrimination network described on the right side of Figure 4-1. Since the values of JSF12 and So, remain unchanged under the assumption of independence of the selectivity of predicates, the edges selectivity factors values become Sj, and Sj2' (Sj, = JSF12 * Card(a2), which is unchanged, and Sj,' = JSF12 * Card(ai), which is changed). The node ai now receives an input stream of frequency Fi instead of Fi* So,. The frequency of tokens now going through ai is given by (Fi * Sj, + F2 * S02 *

PAGE 66

55 Sj,') where Fi * Sj, is the contribution from the left side of the join, and Ft * So: * Sj/ the contribution from the right side. Reusing the same notations to describe the variations in the cost of the Gator network, let AC"^ and AC" represent respectively the increase and decrease components in the cost C of the network when Oi is pushed down as shown in Figure 4-1. AC"^ has three components: The increase in the number of times the join from ai to a2 is executed. The increase in the number of insertions into node ai . The increase in the cost of joining tokens from to ai : ACj2 (positive quantity). The value of AC"*^ is therefore: AC" : (F, Fi * Sa.) * Cj, + (F, Fi * So,) * (CPU^eigh, + 2 * I/Owdght) + F2 * So, * AQ, Since (CPUweight + 2 * I/Oweight) is the insertion cost into an alpha node without indexes. (The reader may refer to chapter 3 and [Han92] for more details on the cost formulas). Replacing the quantity (CPUweight + 2 * I/Oweight) by "Ic" to simplify our notations, we can rewrite AC"^ as follows: AC" : = Fi * (1 Sa,) * [Q, + (CPUwdgh. + 2 * I/Oweight)] + F2 * Sa, * AQ, = = Fi * (1 Sa,) * [Cj, + Ic] + F2 * So: * ACj2 Since the selectivity factor Sa, is a value strictly smaller than 1, it follows that Fi * Sa, is also strictly smaller than Fi, and therefore AC^ is always a positive number. (AC* > 0). The decrease in the cost of the network C lies in the (possible) reduction of the frequency of execution of the operator C\ : AC" = [Fi (Fi * Sj, + F2 * Sa, * Sj,')] * Ca,

PAGE 67

56 In order to analyze the influence of the distribution of token frequencies over the pushdown heuristic, let us consider that Fi = m * F2 with me Q"^* (Q^* being the set of positive, non-null rational numbers). Replacing F2 by F|/m we can rewrite AC~ o AC" = [Fi (Fi * Sj, + ^ Sc2 * Sj2')] * Ca, m Then by factoring out Fi and using m as a common denominator o AC" =Fi*(l-Sj,*^''V Ca. m ^AC"=F.*( "^"^^^'*^" -S.)*Co. m The PUSHDOWN operation will be beneficial when the decrease in the network cost C is strictly greater than the increase: 0 < AC"^ < AC". After replacing AC"^ and AC" by their respective values, we can rewrite: O Fi * ( 1 Sa,) * [Q. + Ic] + — * S02* ACj2 < Fi*( """^^''*^°' Sj, ) * Cm m m And after simplifying by Fi (non-null) on both sides of the inequality: « (1 S.) * [Q, + Ic] + -^^i^ < ^}IL^>L^ . s. ) * Co, m m Rearranging terms and dividing each side of the inequality by (I-S0,), we obtain m-Sj2'*So2 ^ oQ,< 21 !!*Ca,--^^i^^ -Ic (2) l-So, m(l-SaO Therefore, the selection predicate Ci has to be pushed down when the inequality (2) above is true and will not be pushed down when the same inequality is reversed. To So^ * AC 2 O simplify the notations, we will rewrite the quantity — — as — (Q > 0). m(l-So,) m

PAGE 68

57 Table 1 Conditions for pushing down a selection predicate PUSHDOWN NO PUSHDOWN Rank Based 1-Sc, Q,>i:i^*Ca, l-Sa, Cost formula Based m-Sj2'*Sa2 -Sj, m -Sj2'*So2 Cj,< m 1-Sa, *Ca,— -Ic m m l-Sm *Ca,— -Ic m Table 1 above recapitulates the results of this analysis so far. Setting the insertion cost Ic aside temporarily, one can notice that the limit of the cost formula based criterion (when m is very large: m^<») becomes the rank criterion. While a more precise analysis is required to reach meaningful conclusions, this remark gives an initial picture on the following issues: What is the influence of the distribution of token frequencies (recall that different values of m represent different distributions of token frequencies) over the decision for pushing down a selection predicate using the cost formula based approach? How do the rank and cost formula based approaches relate to each other, and how can different token distributions affect this relationship? In order to gain a better understanding of the two points raised above, it is necessary to proceed at this point of the analysis with a study of the function f(m) that is defined below: m-Sj2'*So2 Letf(m) =m -Sj> 1-Sa, m*Co, With Sj„ Sj2', Q, Ca, e R** and Sa„ Saj e (0,1)

PAGE 69

58 f(m) is simply the cost formula based criterion where both sides have been divided by the constant Co, and the insertion cost Ic set aside (temporarily). The remaining part of this chapter will cover a short mathematical analysis of the function f(m), followed by a set of comments on the corresponding results. 4.3.1 Influence of Token Distributions on Decision Surfaces for Pushing a Sigma: As mentioned above, a brief study of the function f(m) will show the effects of m, (i.e. the distribution of token frequencies) over the decision surface for pushing selection predicates in the discrimination network, as well as its connection with the rank based ordering approach. Domain of definition of f(m): The domain of definition of f(m) Df is R"^*, i.e. m e (0, +0°) Limits of f(m): The limits of f(m) are as follows: Limf(m) ) -°o Lim f(m) ) 1-Sa, Continuity of f(m): f(m) is continuous over its domain of definition Df. Variations of f(m): The variations of f(m) are given by the sign of its first derivative over the domain of definition Df. Since the derivative of Sj,/(1 Sa,) with respect to m is null, the first derivative of f(m) with respect to m is given by the following term:

PAGE 70

59 m-V*So: Q af(m) _ m*(l-So,) m*Coi 3m dm The quantity above being the difference of two ratios, it can be developed into: -i-— ^ * [m * ( 1 So,)] * (m Sj: '* SaO ai^TT ] dt(m) 3ni 3m m*Coi ^ y — 3m [m*(l-So,)f 3m The values of the derivatives of each term with respect to m allow us to rewrite it as: 3f(m) m*(l-Sa,)-(l-Sa,)*(m-Sj2'*So2) Q ^ — I = 1" 3m [m*(l-So,)f Ca,*m^ Factoring out (1 So,) in the numerator of the first term 3f(m) (l-Sa,)*[m-(m-Sj2'*Sa2)] Q = 72 + 3m [m*(l-So,)] Co,*m^ And simplifying, we finally get 3f(m) (l-Sa,)*Sj2'*So2 Q ^ — = — I3m [m*(l-Sa,)f Coi*m^ — — is defined over the domain of definition Df. It is the sum of two positive 3m quantities. Indeed, the numerator of the first quantity is the product of three strictly positive terms, and the denominator is a square. Similarly, Q is also a positive quantity and so is the denominator Co, * m^. Consequently, the derivative is strictly positive over Df, and it follows that f(m) is a monotonically strictly growing function of m. The table below recapitulates the results of the analysis of f(m) obtained so far.

PAGE 71

60 Table 2 Analysis of f(m) m 0 +00 af(m) 3m + _^ 1-Sj, f(m) -OO — 1-Sa, For values of m ranging from 0 to infinity, the first derivative remains strictly 1-Sj, positive and the values of f(m) will range (grow) from negative infinity to the ratio 1-So, (which is the rank criterion). As described in Table 1, the function f(m) represents a decision surface (or the variable constituent of a decision surface, if Ic is taken into picture) for the selection predicate push down criterion. A good way to visualize the effect of parameter m over that decision surface, is to geometrically represent f(m), and observe how various values of m can affect it. The geometric visualization of f(m) will be the focus of the next section of this chapter. 4.3.2 Geometric Interpretation of the Analysis of f(m}: Let us consider f(m) as a family of two-dimensional surfaces (denoted Zm) defined in a 3D space. Each member of Zm is a 2D surface and will be denoted Sn,(Sj,, S(ji) and defined as: m-Sj2'*Sa2 Sm(Sji, Sa,)2 = Sj. m So2*ACj2 l-Sc, m*(l-Sm)*Ca, 2 In order to simplify the notations, Sm(Sj„ Sa,) and will refer to the same element.

PAGE 72

61 Xm will therefore consist of a continuous set of two-dimensional surfaces ranging from Sm^o (denoted So by convention even though So is not defined), to Sm^ (denoted S„ by convention). Disregarding the actual shape of the surfaces for the purpose of this analysis, an Sa (arbitrary value of m = a) can be conceptually represented as in Figure 4-2 below: For each value of Sj, and Sa, (representing the X and Y coordinates axis, and ranging in their respective intervals of definition), the Z axis value is taken as Sa(Sj,, So,) and represents the point on the surface SaFrom the analysis of f(m) in section 3.3.1, the Z axis position of any Sj is strictly above that of any Sk when j > k, for given values of Sj, and So,. (The reason being the fact that f(m) is monotonically growing with respect to m, see Table 2). The highest of all surfaces is therefore S„, defined by the limit of Sn, when m becomes very large, and the lowest So, or rather the limit of S^ when m becomes very small, since the function is not defined when m is equal to 0. A visual representation is given in Figure 4-3 below: Z = Sm(Sji, Soi) A Figure 4-2 Representation of a surface S, 'm

PAGE 73

62 Figure 4-3 Relative positions of Sm for different values of m In Figure 4-3 above, is drawn in black and So is transparent (dotted line). The gray surfaces in between represent two values j and k (j > k) of m, between 0 and infinity. Recalling the results of Table 1 (still leaving aside the component Ic), the condition for pushing down a selection predicate using the cost formula based approach can be expressed as: m-Sj2'*So2 g — < ^ 9— which is -^
PAGE 74

63 — ^< — which is — ^ < Soo(Sji, Soi) = Srank Coi 1 Sal Col In other words: the ratio defined by Cj/Ca, has to be located under the surface S„ which happens to be the decision surface for the rank SrankConsider now Figure 4-3 again: the value of the ratio Cji/Co, is such that it is located under but above Sj and S^; which represents two cases (two values of m) where the rank based approach would erroneously perform a pushdown operation. As a matter of fact, both those values of m (m = j and m = k) represent cases where the pushdown operation would induce an increase in the cost of the network. However, there exists values of m, (let one of those values be i, i > j > k) such that the cost formula based surface gets close enough to S_ (which is also Srank) for the pushdown operation to reduce the cost of the network. In that case, (illustrated in Figure 4-4 below), the rank based approach gives a correct reordering of the predicates in the Gator network. Figure 4-4 Addition of surface Si

PAGE 75

64 4.3.3 Qualitative Interpretation of the Analysis of f(m): The analysis of the function f(m) showed that a cost formula based decision becomes equivalent to a rank based decision when the distribution of tokens is very largely biased towards the edge from which the selection predicate is being pushed. This mathematical result can be validated with the following qualitative reasoning: The rank-based approach relies exclusively on the rank of the selection predicate and the rank of the join with respect to the current base node to make a pushdown decision. In our example from Figure 4-1, the participation of the right side of the network (tokens coming through aa) is ignored in the process of considering the pushdown of (J\ with the rank-based approach. On the other hand, the cost formula based approach does take into consideration the participation of both sides of the network into the pushdown decision. Since very large values of the parameter m represent configurations where the totality of the tokens pass through Fi, they also represent configurations where the right side of the network does not participate in the decision for pushing down the selection predicate ai. On the other hand, smaller values of m describe configurations where the right side of the network has a more substantial impact on the decision for performing the pushdown. Consequently, we can validate our mathematical results showing a strong similarity between rank-based and cost formula-based decisions for large values of m, since they correspond to configurations where the participation of F2 is negligible or null, which is the underlying assumption of using a rank-based strategy.

PAGE 76

65 4.4 Conclusion: The rank based decision surface represents in this case an upper-Hmit to the cost formula based decision surface, that is reached when parameter m is very large. (Large values of m represent cases where the token distribution is extremely biased towards the input edge where the selection predicate is located). In such configurations, the rank and cost formula based approaches are likely to reach similar decisions. However when m drops towards values balancing the token frequencies among the input edges, or even lower values reversing the bias in token distributions, the cost formula based decision surface moves away from Srank and erroneous results can be obtained with the rank based reordering of predicates. Reconsidering the results of Table 1, the exact condition for pushing down a selection predicate, based on the cost formula approach was: m-Sj2'*So2 i>j> Q Cj,< ^ *Ca,-^-Ic l-Soi m which can be rewritten as follows: m-Sj2'*So2 Ci O Ic — < — * Cai A where A is a constant = — Coi 1-Soi m*Coi Coi The only effect of constant A on the analysis, performed in sections 4.3.1 and 4.3.2, will be to change the position of S„ to be located at a Z coordinate distance of A under Smnk rather than being equal to it. S„ therefore becomes distinct from Srank, but none of the influence of m over the relative positioning of Sm and Srank is affected, and our conclusions are still valid.

PAGE 77

66 Taking into consideration delete tokens introduces additional algebraic complexity, in the cost formulas and in their notations (mainly for AC), but no conceptual change with respect to insert tokens considered in this analysis. The extension to cases where a beta node has more than two children is briefly covered in Appendix B and does not introduce any conceptual change either. In conclusion, the influence of the tokens distribution remains an important parameter, causing the position of the decision surface linked to the cost formula to vary widely and making a rank-based decision possibly incorrect.

PAGE 78

CHAPTER 5 STRATEGffiS: DESIGN The example and analysis presented in Chapter 4 showed why the rank metric was not adopted to perform the placement of selection predicates in discrimination networks. The position of the decision surface for pushing down (or pulling up) a selection predicate varies according to multiple parameters including: The distribution of token frequencies among input edges, The selectivity factor of the selection predicate, The selectivity factors of the network edges the sigma is moved along, The cost of the selection predicate. The differential costs of the joins along the path of the pushdown. And by ignoring the distribution of token frequencies among input edges and various node related costs, the rank metric is likely to fail to produce a correct reordering of the predicates in many cases. In the process of investigating the possible strategies that can be applied towards obtaining a correct solution this problem, the most straightforward idea is to try to adapt the rank metric to discrimination network by trying to incorporate the missing parameters. In the following section, the idea of adapting the rank metric to Gator networks is investigated in detail. 67

PAGE 79

68 5. 1 Adapting the Rank Metric to Gator Networks: The selection predicate pushdown criterion based on the cost formula from Table 1 was: m-Sj:'*So2 -Sj, 1-Soi m*(l-Sai) If the insertion cost Ic is considered negligible (Ic = 0), and the assumption about the differential cost Ch being constant (see section 4.1.2) admitted (AQz = 0), the rank of the join (with respect to aO can be redefined to the following expression: m-Sj2'*So2 Sj rank(J/ai) = (3) Cji As a matter of fact, rewriting rank(J/ai) < rank(ai) (the rank based criterion for the selection predicate pushdown) gives now the same result as the cost formula based approach. When the insertion cost Ic is taken into perspective, the definition of the rank of the join has to be changed to: , m-S/*S.. Ic*(l-So.) , _ ^ m-Sj2'*S.2 , ^ ^ Sji-[ J Sji-[ I-Ic*rank(ai)] rank(J/ai) = 3^— ^ = ^ Cji Cji The elimination of the second assumption about the differential cost Ck introduces So2 * AC 2 an extra term of the form — in the numerator of the expression above. Without m^Loi complicating the expression of the rank of the join any further, it can be noticed from the expression above, that the rank of the join is not an independent quantity anymore, but is related to the rank of the selection predicate being pushed down. Indeed (and unless the insertion cost Ic is neglected), parameters of the rank of the selection predicate now appear in the expression of the rank of the join. This new characteristic of the rank of the

PAGE 80

69 join is certainly not a desirable one, since the positioning of a selection predicate with respect to a join predicate may cause the rank value of the latter to change. One of the consequences of this is an impact on the convergence of a rank based algorithm for Gator networks. Clearly, a complete analytical model for a Gator networks adapted rank metric, would quickly reach the limits of tractability, especially when considering the following required cases to be incorporated into the already cumbersome expression shown above: Incorporate the cases where beta nodes can have more than two children. Incorporate the cost of delete tokens into the global cost analysis of the network. Take into account the possibility of more than "one-level" pushdowns. Working with a (relatively) simplified representation of the rank such as the one presented in equation (3) above is not an alternative in this case for the following reason: The resulting loss of accuracy would make this model inadequate for integration with the existing Gator networks optimization algorithm(s). Indeed, the overall accuracy (and reliability) of the optimizer would become limited by that of its "worst" component, the selection predicate placement module in this matter. Therefore, adding the selection predicate placement heuristic would reduce the accuracy of the existing optimizer, which is not a desirable effect. While the ordering of selections and joins based on the rank metric does yield satisfying results in query processing, it is our opinion that trying to adapt it to the problem of ordering selection predicates among joins in a Gator network will suffer from the following major limitations:

PAGE 81

70 First, the rank metric does not naturally extend the existing cost model and randomized algorithm(s) described earlier in Chapter 3, based on a set of local change operators. Thus, the introduction of the notion of a rank ordering of predicates in the Gator network optimizer would prove relatively complex to combine with the existing cost formulas in order to achieve a homogeneous and integrated global discrimination network optimizer. While the idea of developing a rank metric for Gator networks is conceptually feasible, the practical formulation of such a metric is very likely to become extremely cumbersome and unpractical of use within the optimizer. The impracticality of a Gator network rank will be furthermore increased by the contextual nature it exhibits, as shown earlier in this section. Since the introduction of a rank based heuristic would leave us with either a crippled (using a simplified version of the rank) or an overly complex and ill-integrated optimizer, it has been decided to avoid investigating further any approach based on an adaptation of the rank metric to Gator networks. Instead, our efforts will be directed towards the design of alternative strategies, better adapted to this problem, and satisfying the following requirements: In order to maintain a maximum continuity with the current cost model and optimizer, any proposed approach should reuse the existing work, including cost formulas, heuristics and local change operators. This will also have the advantage of keeping the complexity of the optimizer as low as possible from both a concepmal and coding / maintenance aspects.

PAGE 82

71 A rank based ordering will still be applied whenever possible, such as during a final pass on groups of selection predicates lying on the same edge of a network. The remaining sections of this chapter will therefore cover a set of proposed strategies for incorporating selection predicate placement heuristics within the Gator network optimizer. Those strategies can be classified into two major groups: Low complexity strategies with little to no impact on the current optimizer search space size and local change operators. This type of approach will use two distinct phases and will be therefore referred to as "Two Phase Strategies." Higher complexity methods requiring extensions to the currently proposed model and resulting in larger search space sizes. For such methods, we will propose a minimal, but necessary, set of extensions and semantic changes to the present local change operators "Kill Beta" and "Merge Sibling." This class of methods runs the optimization in one single phase and will therefore be referred to as "One Phase Strategies." 5.2 Two Phase Strategies: As indicated by its name, this approach will operate in two distinct phases and can be labeled as "simple" for two primary reasons: It does not increase the size of the search space in which the current Gator network optimization algorithm operates. In fact, this strategy comes as an additional pass invoked by the optimizer after the best Gator network has been found, and adds a negligible amount of work to the existing algorithm.

PAGE 83

72 It does not require any modification to the existing local change operators that were described earlier in chapter 3. The incorporation of this method into the existing Gator networks optimizer will require a minimum set of perturbations to the logic of the existing implementation. An algorithmic, high level C++ description of the addition of this method to the existing Gator optimizer is as follows: 5.2. 1 hitegrating the Simple Strategy With the Gator Optimizer: Gator* GetOptimalGator (QueryGraph* Graph) Begin GatorNet* Best-Gator; Best-Gator = RunGatorOptimizer (Graph) Best-Gator = RunSigmaPlacement (Best-Gator) Return (Best-Gator) End The function RunGatorOptimizer() represents an abstraction of the existing Gator optimizer, which takes a rule graph as a parameter and returns the best possible discrimination network with the selection predicates placed above the alpha nodes. RunSigmaPlacementO takes the best structure found so far and tries to improve it further by "sliding" the selection predicate nodes (referred to as sigmas) down the network. The proposed heuristic for pushing down a selection predicate is driven by the token frequencies and can be simply described in English as follows: "Let C be the initial cost of the best Gator network before trying to push down the Sigma node. When a selection predicate a is pushed down, all the insertion, deletion and

PAGE 84

73 join operations above it are executed more often, and cause a cost increase AC"^. In order to achieve a global reduction of C, the a operation has to be done less frequently, and the incurred decrease AC" has to be large enough to counter balance the increase AC^. A necessary (but not sufficient) condition for that configuration to happen, is to find a location in the network where the frequency of tokens Fi' is strictly smaller than the original Fj when the a was at the top of the network." Consider the case of Figure 5-1 below: Figure 5-1 Token frequency driven heuristic Theorem: A necessary condition for the positioning of Gi above the P node to be beneficial is: Fi2 < Fi Proof; Consider the network on the left side of Figure 5-1 and let: So, and Co, represent respectively the selectivity and cost of the selection predicate a\.

PAGE 85

74 Cj, and Cj2 represent the differential cost of the join between the two alpha nodes in the sub network defined by the dashed-line box. Fi represent the frequency of tokens coming into a\. C represent the total cost of the network and Ci and C2 represent respectively the costs of the sub networks in the dashed-line and full line. By definition, we have: C = Ci+C2 Ci = (Fi * Co,) + (Fi * So, * Cj,) + (F2 * QO Consider now the network on the right side of Figure 5-1 and let F12 represent the frequency of tokens flowing through the selection predicate C\. Lemma: The properties of relational algebra [Smi75, Tam91, U1182] are such that the cost of the network in the box drawn in the full line is unchanged and remains C2. The cost of the network in the box in the dashed line becomes: Cr = (Fi * Cj,) + (F2 * Cj2') -I(F12 * Co,) and the total cost is C = C,' + C2 C < C 0 C + C2 < Ci + C2 o Ci' < Ci In order to achieve C < C, it is necessary that Ci' < Ci Comparing the terms of C| and Ci' we have: Fi * Cj, > F, * Sa, * Cj, (Since 0 < Sa, < 1 by definition) Similarly, since ai becomes larger after the selection pushdown, the differential cost Cj:' is greater or equal to Q:, and we have F2 * Cj2' >= F2 * Q: . Therefore, if F12 > Fi, it

PAGE 86

75 follows that Fi2 * Ca, > Fi * Ca, and Cf > Ci since all three components of Cr are larger than those of Ci. Hence, the necessary condition for Ci' to be smaller than Ci is that F12 be (strictly) smaller than Fi: F12 < Fi (QED). Below is a high-level algorithmic description of this heuristic. Algorithm: (It is assumed that the best network without the selection predicate placement optimization is returned in Best-Gator, before calling this function). RunSiginaPlaceinent(Gator* Best-Gator) Begin C = NetCost (Best-Gator) For (each <5\ ) F = Fi ai = parent of G\ Remove ai from network and recompute frequencies on path tti — > Pnode. For (each node n^ of type pk or ak on path a; Pnode) If (Fk
PAGE 87

76 5.2.2 Complexity Analysis: A Gator network with N tuple variables has a maximum height of N-1 when it is a left (or right) deep binary tree as shown in Figure 5-2 below: Pnode Figure 5-2 Maximum height of a Gator network For each selection predicate, the function RunSigmaPlacement performs a constant amount of work W by scanning the complete path from the sigma node all the way to the Pnode, searching for a suitable position to relocate it. The length of that path is at most (N-1) and the number of selection predicates to process is considered not to exceed 0(N) (each tuple variable can have one or a few selection predicates associated with it, in the case of conjunction of predicates, such as: di and Cz). The total amount of work spent on all sigma nodes is therefore of the order of 0(N) * (N 1 ) * W which is 0(WN ). If any type of sorting of the selection predicates is performed as a preprocessing or post-processing step (for example, a final sorting by rank of all sigmas lying on the same edges), the added amount of work is not worse than O(N^) [Cor90], and does not increase the overall complexity of the function which remains polynomial. Since the Gator network optimizer calls one of three possible operators at each iteration, the

PAGE 88

77 dimension of the search tree is therefore 0(3"^) for N iterations, and the global state space is exponential by nature. (Some states could be visited more than once during an optimization phase, but the upper bound still remains 0(3^)). Consequently, the addition of RunSigmaPlacement does not increase the size of the state space from a combinatorial complexity aspect. 5.2.3 Influence of Indexes on the Heuristic: It was mentioned at the beginning of section 4.3 that the use of indexes (and the corresponding cost formulas) was not taken into account as far as the problem of selection predicate placement was concerned. The reason behind this simplification is the fact that the pushdown heuristic is independent of the presence or absence of indexes in the Gator network. Indeed (and as described above), the only parameter considered by the heuristic to perform a pushdown is the frequency of tokens flowing across various edges of the network lying between the location of the selection predicate and the Pnode. Since the definition of an index over a (join) attribute has no effect on the "number" of tokens generated by that join condition, it does not have any influence on the resulting frequency of tokens and has therefore no impact on the logic of the heuristic. As a matter of fact, the amount of tokens generated by a join is dependent solely on the selectivity of the join condition, which is a property of the join itself. While the presence of indexes has no consequence on the heuristic itself, it is important to notice that indexes will modify the costs of joins. Consequently, the presence of indexes will have an impact on the global decision regarding whether the cost of the network is reduced by a pushdown or not.

PAGE 89

78 5.3 One Phase Strategies: Contrarily to the strategy described in the previous section, the existing Gator optimizer will not be directly reusable in a straightforward manner it was in the previous case. Instead of adding the selection predicate placement as an extra phase, the idea here is to integrate it within the process of building the discrimination network. In order to do so, one (or more) local change operators are added to the existing set of three: "Create Beta," "Kill Beta" and "Merge Sibling." These new operators (Push Sigma and Pull Sigma) will be called randomly during the local perturbations phase, and compete with the others. The set of strategies based on this approach is of higher complexity than the previous one for the following reasons: The incorporation of new operators into the existing algorithm is a "double edged sword." Indeed, the immediate consequence is a substantial increase in the search space size. While a larger search space allows potentially better solutions to be found, it also has the major disadvantage of possibly causing the optimizer to follow "dead-ends" search paths while searching for local minimums. An important tradeoff issue of optimization complexity versus result quality is to be considered here. The addition of new operators allowing the placement of a a node while the network is being built, brings up issues of semantic nature about the meanings of "Kill Beta" and "Merge Sibling" and their impact on the integrity of the Gator network. Therefore, the behavior of some of the existing operators has to be revised before this type of approach can be implemented.

PAGE 90

79 5.3.1 Complexity Analysis: < In this type of strategy, the number of local change operators is increased from three to five. The number of potential branches at iterative step of the optimizer is therefore also increased from three to five. As a result, the total potential number of states that can be visited is raised from 0(3"^) to 0(5'^) for N iterations (once again, it is possible that multiple states can be visited several times during an optimization phase, but the upper bound still remains). This type of approach has therefore a more serious impact on the complexity than the previous strategy: even though the order of magnitude of the search space is not increased, the potentially added size is exponential versus polynomial in the previous case. 5.3.2 Semantic Issues for Kill Beta: Figure 5-3 Semantic issues for Kill Beta Consider the case of Figure 5-3 above. The left side of the figure describes a configuration where a selection predicate a has been pushed down in the network, just above node pi, as the result of a previous local change operator sequence. Suppose that the next iteration of the optimizer decides to perform a "Kill Beta" operation on node Pi.

PAGE 91

80 If the "Kill Beta" operator is kept unchanged (i.e. behaves the same way it does in the current optimizer and as described in chapter 3), once the operation is performed (node (3] killed), the node a becomes sibling of ^2, as shown on the right side of Figure 5-3. Consequently, every token inserted into |32 has to join with a, which is not supported by the semantics of Gator networks algorithms, since a nodes are predicates only, and not memory nodes. As a result, the Gator network gets into the inconsistent state shown on the right side of the figure above, and a fix becomes necessary for this type of situation. 5.3.3 Semantic Issues for Merge Sibling: Pnode Figure 5-4 Semantic issues for Merge Sibling In the case of a Figure 5-4 above, a Merge Sibling operation is considered after the selection predicate has been pushed above Pi. When node p2 is merged with Pi, it becomes sibling of a as shown on the right side of Figure 5-4. This situation again causes inconsistency in the structure of the Gator network similar to the one presented in 5.3.2.

PAGE 92

81 5.3.4 Suggested Fixes for the Semantic Issues: There are multiple solutions that can be applied towards correcting the inconsistent state in which a Gator network might fall during the operations described in sections 5.3.1 and 5.3.2. Conservative approach: No Merge-No Kill: The conservative approach is to simply "forbid" the killing or the merging of a beta node, when a sigma node is located above the node to kill or the merge node. While this approach is the simplest from an implementation point of view, it may be limited by the disadvantage of possibly reducing the number of operations the optimizer can perform, which can in turn have some impact on the results that the heuristic could produce. Progressive approaches: A more progressive approach consists of allowing the operation to take place, but taking extra actions to fix the inconsistency of the network by relocating the sigma node to a "suitable" location in the network. This would become equivalent to combining a "Push Sigma" or "Pull Sigma" with the "Kill Beta" or "Merge Sibling," every time this type of inconsistent situation is encountered. An example solution would be to bring the selection predicate back to the top of the network, which will be referred as the "Pop To Top" approach. The decision on where to relocate the selection predicate is left as an implementation issue.

PAGE 93

82 5.3.5 Sub-Strategies: Depending on different placement options for the selection predicates, it is possible to generate a number of sub strategies all derived from the same idea consisting of mixing the three traditional local change operators with the selection predicate placement ones. Some examples are listed below: Sigma placement during network build up: The placement of a nodes at different levels of the discrimination network can be performed during the network build-up phase, before applying any local transformation. A more conservative approach would require that new networks be always built with selection predicates at the top and that only local perturbation operators perform the placements. Sigma Push and Pull: The local perturbation operators "Push Sigma" and "Pull Sigma" can themselves operate in a number of different combinations: Random Push and Pull: the placement of the selection predicates is driven by random decisions only. Searched Push, Random Pull: The pushdown operation is guided by the heuristic described for the Two Phase strategy above, while the pulling up is done randomly.

PAGE 94

83 Extensive search for pull-ups and searched push downs for networks with small numbers of tuple variables: extensively search all the possible placements for the selection predicate on the path from the leaf to the Pnode when performing a pull-up and apply the heuristic for a pushdown. Livestigating the possible variants analytically would be exciting but is a very difficult and probably not beneficial task, because of the complexity of the corresponding mathematical models. It is our belief that the investigation of the relative performances, advantages and disadvantages of each sub-strategy will be better handled at the implementation level, with numerical results. In the next chapter, we will describe an implementation of the Gator network optimizer with selection predicate placement capabilities, and discuss some implementation issues and difficulties, and proceed with some simulation results for both simple and advanced strategies (Two Phase and One Phase).

PAGE 95

CHAPTER 6 STRATEGIES: IMPLEMENTATION In the previous chapter, we presented a detailed description of the proposed strategies to perform selection predicate placement in Gator networks from a design point of view. In this chapter, the reader will find an elaboration on some important issues and questions that arise on different aspects of the optimizer, when the proposed strategies are put into practice. The purpose of this chapter was not to write a complete full-fledged implementation of a Gator network optimizer, with selection predicate placement capabilities. Rather, the goal was to investigate and bring into light the possible roadblocks and coding difficulties that the implementers might encounter when writing most of the proposed strategies. The important points that will be addressed in this chapter are as follows: Investigate the coding difficulties related to a new (more efficient) implementation of the local change operators. Look over any necessary revisions to be applied to cost formulas based on the new operators allowing relocation of selection predicates. Verify whether the proposed strategies work and improve the Gator network. Compare the relative running times and result quality of both the simple and complex strategies, and find out whether the latter yields any tangible improvement with respect to the simple one, and in what cases. 84

PAGE 96

85 6. 1 Implementation of Local Change Operators: The local change operators "Create Beta," "Kill Beta" and "Merge Sibling" were implemented in a modified version of Ariel [Han97b] by regenerating a completely new network at each iteration. The function ApplyPerturbation described in section 3.1.2: New-Gator = ApplyPerturbation (CurrentGator, Perturbation) takes as parameters the current Gator network and one of the local change operators, and applies the change to return a new network that is a duplicate of the current one with the local change applied to it. If the cost of the network is improved by the current operation, then the old network is discarded, and the new one kept as the current network, otherwise, the new network is discarded and the old one remains as the current network. Having seen room for improvement in both the memory use and speed of the optimizer, we have decided to implement the local change operators differently this time around. Each of the operators comes in a tripartite structure (three function or methods in the object oriented terminology) which allows trying it and then either validating or undoing the modifications that were made to the network, based on the old and new costs. Consequently, each of the local change operators is implemented with the following methods: Gator* TryOperator (Gator*, char* LocalChange) Gator* DoOperator (Gator*, char* LocalChange) Gator* UndoOperator (Gator*, char* LocalChange) with Operator e {Create Beta, Kill Beta, Merge Sibling, Push Sigma, Pull Sigma}. As mentioned above, the advantages of this approach are straightforward:

PAGE 97

S6 Better memory usage (only one network in memory all the time). Faster optimization process by not having to regenerate and delete a whole network at each step. The implementation of those methods required maintaining some additional state information within the structure of the nodes of the network (old neighbors, old parents, old children etc) in order to allow the rollback when necessary. However, the duplicated information is always localized to the "region" of the network being affected by the current perturbation only, and therefore kept to a minimum. 6.2 Revision of the Cost Formulas hnplementation: Since selection predicates were systematically placed at the top of the Gator network in the traditional optimizer, they did not have a "proper" existence but were rather considered as a preprocessing filtering step to the network. Consequently, the cost of the selection was not computed as a part of the global cost of the network. In this implementation, the ability to relocate a selection predicate potentially anywhere in the network, has necessitated implementing it as an entity (node) with an associated method to compute its cost contribution. Furthermore, the possible presence of a sigma just above a beta node has been at the origin of a few minor, but necessary modifications to the method for computing the cost of beta nodes. A more detailed discussion of both these issues is presented in the following sections. 6.2.1 Cost of a Sigma Node: Three distinct cases were identified for computing the cost of a selection predicate:

PAGE 98

87 The sigma is at the top of the network. The sigma is pushed down and has "regular" children (Alphas and Betas). The sigma is pushed down and has another sigma node as a child. A description of each of the three cases above follows. Sigma at the top of the network: Figure 6-1 Sigma at the top of the network In this case depicted in Figure 6-1 above, the cost of the sigma node is simply the frequency of tokens flowing through the selection predicate, multiplied by the cost of computing the selection condition (noted SelectionCost). SigmaCost (ai)= Fi * SelectionCost {d) Sigma with regular (Alpha and Beta) children: Figure 6-2 Sigma with regular children

PAGE 99

88 In this case, the selection predicate cost will have two components: The cost of computing the selection condition multiplied by the frequency of tokens flowing through the predicate. This frequency of tokens is computed in a similar fashion as it is for beta nodes, by adding up the anticipated participation of each of the children nodes in producing the amount of tokens passing through the sigma node. The cost of executing the joins across the children is also added to the cost of the selection predicate here, as it is the case for beta nodes. SigmaCost (ai)= Fj * SelectionCost (0,) + CostJoinChildren(ai) Note: The reader will notice that the PerChildInsertCost() and PerChildDeleteCost() components that are also added to a beta node with children is not applied to the selection predicate. The reason for this is that sigmas are not nodes in which tokens can be inserted into (or deleted from) but simply predicates acting as filters and allowing tokens through, or not. The PerChildInsertCost() and PerChildDeleteCost() components cannot however be lost, and have to be saved until the first beta node above this sigma is found. At that point, the per-child insertion and deletion costs are applied to that beta node. In order to handle this "transfer" of cost to the next beta node in the network, it was necessary to attach to sigma nodes a simple structure with one entry for each child. Each entry of the list maintains the join size (with respect to that child), and insert and delete frequencies of that child. We will refer to these lists as "update costs lists" and a sample one is presented in Figure 6-3 below:

PAGE 100

89 a, join size * Sel(ai) join size * Sel(Oi) join size * Sel(ai) iFreq * Sel(ai) iFreq * Sel(ai) iFreq * Sel(Oi) dFreq * Sel(ai) dFreq * Sel(ai) dFreq * Sel(Oi) Figure 6-3 Sigma list for transfers of PerChildlnsertCost and PerChildDeleteCost In this case, <5\ has been pushed just above the beta node. The cost of G| includes the cost of executing the selection predicate computation (Fai * SelectionCost(ai)) plus the costs of executing the joins originating from each of the children tti, and a^. The cost relative to the insertion or deletion of the results of those joins is not applied at the level of CS\, but is transferred to the P below G\ through the list described above. The join size, insert frequency and delete frequency are multiplied by the selectivity of the selection predicate (filtering effect) and appended to the list described above, to be processed at the level of the beta node for computing the PerChildInsertCost() and PerChildDeleteCostO components. Sigma with a Sigma child: When multiple selection predicates are pushed down during the optimization phase, it is possible to reach configurations where two (or more) sigmas end up on the same edge of a network. In such a case, a sigma node has another sigma as a child. (Note

PAGE 101

90 that sigmas are always "unique" children of their parent node). This case is illustrated in Figure 6-4 below: F, F: «i'''"^«2'''""a3 ^3 Figure 6-4 Sigma with a Sigma child The selection predicate C2 has a unique child Gi and will simply "relay" the update costs list from child d, after applying the filtering effect Sel(C2) to each value in the list of G\ . The cost of a2 is here again reduced to the selection cost only: SigmaCost (02)= Foj * SelectionCost (CT2) Where Fa2 is computed as Fai * Sel(a2). 6.2.2 Cost of a Beta Node: As a reminder of the Gator network cost model presented in Chapter 3, the cost of a beta node is computed as follows: LocalCost( p) = ^ { Fi(N) * InsertCost (N, P) + Fd(N) * DeleteCost (N, p)} Nechjldren(P)

PAGE 102

91 Where procedures InsertCost and DeleteCost include respectively the cost of performing the join across all children, plus the costs of inserting and deleting from the beta node. This is described in Figure 6-5 below: 1 (^2 ai' ' a2'' " as Join result (3 Cost = Join cost + update costs Figure 6-5 Cost of a Beta node While this is valid for beta nodes with "regular" children, the presence of a sigma node right above a beta has an impact on the direct application of this cost formula. Fi F2 F3 O2 03 II I I \ ^ Selection result 5 Cost = update costs Figure 6-6 Beta node with a sigma child As shown in Figure 6-6 above, updates resulting from joins across the children of the beta node are no longer directly done to the beta node, but go through the filtering

PAGE 103

92 effect of the selection predicate first. Consequently, the update cost for the beta node is computed from the "update costs list" associated with the sigma below it. 6.3 The Optimizer: The optimizer was written in C++ and reached a size of about ten thousand lines of code. Approximately one third of that code is dedicated to the implementation of the local change operators "Create Beta," "Kill Beta," "Merge Sibling," "Push Sigma" and "Pull Sigma," each of them written in the tripartite structure: try operator, undo operator and do operator described earlier in 6.1. In this section, the reader will find a short description of the major features of this Gator network optimizer. 6.3.1 Implemented Randomized Algorithms: The main difficulty in writing this optimizer lay in the correctness of each of the three methods implementing each of the five local change operators. In order to keep a better focus on this task, it was decided to implement only Iterative Improvement as a general randomized algorithm to be executed by this optimizer. The addition of other methods such as Simulated Annealing, should however be a straightforward task once the local change operators are available. Below is an algorithmic description of our implementation of Iterative Improvement.

PAGE 104

93 Begin double oldCost, newCost Gator *BestGator, *CurrentGator int maxNetworks, maxFailures, numNetworks = 0, numFailures BestGator = BuildlnitialGatorNetwork (RuleGraph) While (numNetworks < maxNetworks) CurrentGator = BuildlnitialGatorNetwork (RuleGraph) numFailures = 0 While (numFailures < maxFailures) Operator = GetOp(random) oldCost = Cost(CurrentGator) CurrentGator = TryOperator (CurrentGator, Operator) newCost = Cost(CurrentGator) If (newCost < oldCost) CurrentGator = DoOperator (CurrentGator, Operator) numFailures = 0 Else CurrentGator = UndoOperator (CurrentGator, Operator) numFaiIures++ Endif EndWhile If (Cost(CurrentGator) < Cost(BestGator)) Delete (BestGator) BestGator = CurrentGator Else Delete (CurrentGator) Endif NumNetworks++ EndWhile End

PAGE 105

94 Description of the algorithm above: The algorithm generates a first discrimination network that is considered to be the best one so far and enters the double while loop. A new network is then generated (from the rule graph) at each step of the outer while loop, and modified with local change operators within the inner loop. At each step of the inner loop, one of the operators is randomly selected and applied to the network with the TryOperator method. If the local change improves (reduces the cost) of the network, then the operation is validated, otherwise it is undone. The exit criterion from the inner loop is based on a maximum number of successive failures while trying to improve the network by applying local operators. Each time the cost of the network is reduced, the number of successive failures is reset to zero. When the inner counter reaches the upper bound "maxFailures," it is assumed that the optimization process is stuck in a local minimum and it is time to generate a new network. The current network is then compared with the current overall best (BestGator) and the one with the lowest cost is kept. This process is repeated until the maximum number of networks has been generated in the outer loop. Note: Both parameters maxNetworks and maxFailures are dynamically generated based on the number of tuple variables in the rule graph, allowing by this way longer searches for larger rules (and networks). The factors connecting the number of tuple variables to maxNetworks and maxFailures are also parameters to the optimizer. Minimum thresholds for maxNetworks and maxFailures are also set in order to guarantee a "reasonable" amount of searching in all cases. Below is a quick overview of the most important parameters of the optimizer.

PAGE 106

95 6.3.2 The Optimizer Characteristics: Unless specified otherwise for a particular test, the characteristics of the optimizer will be tuned as follows: Sizes of relations: The cardinalities of the relations will range from a minimum of 1000 to a maximum of 100,000 tuples. The distribution of the cardinalities uses an 80/20 rule with a majority of small to medium relations (80%) and a minority of large ones (20%). Selectivity of joins: The selectivity of joins will be tuned so that the result of a join is never larger than the largest of the base nodes participating in it. lOWeight and CPUWeight: Secondary storage access time has not improved dramatically during the past decade. On the other hand, microelectronics and VLSI technology improvements have caused central processing units to almost double in power every couple of years. The weight of CPU operations is therefore becoming less and less important with respect to that of a disk access. The ratio of lOWeight to CPUWeight has been set to 2000 to simulate tests on Pentium Pro workstations running at clock speeds of 200Mhz. Size of disk pages: The size of disk pages will be set to 8 Kilobytes. Size of tuples: The maximum size of tuples has been set to 100 bytes.

PAGE 107

96 6.4 Experimental Results: In this section we will present some experimental results which were obtained after implementing chosen variants of the strategies described in chapter 5: Two-Phase Strategy (noted TPS) and One-Phase Strategy with No-Merge No-Kill (noted OPS). All the tests were run on an 8-way 248 Mhz SUNW, UltraSPARC-II CPU Sun Sparc machine (model Ultra Enterprise 4000/5000) with 2 GB of memory and a reduced workload, allowing our process to dispose of the full computing power of one processor per run. The running times are therefore for one CPU, and since none of the tests was run with the machine in single user mode, most results may not be "exactly" reproducible. 6.4.1 Two Phase Strategy Running Times: From the complexity analysis of section 5.2.2, it was expected that the second phase of a two-phase strategy should not have any significant impact on the state space of the optimizer. Consequently, the addition of the second phase in which selection predicates are relocated in the network should not have a significant impact on the optimizer's running times. Running times of Phase I and Phase II 35 I ilf^MMIII^Mr^^^MMim 11^1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Number of tuple variables Figure 6-7 Comparing running times of Phases I and n of TPS

PAGE 108

97 In the graph from Figure 6-7 above, we have plotted the running times (in seconds) of phases I and II of the optimizer for a number of tuple variables ranging from 2 to 15. While the total searching time (summation of running times of both phases) sensibly increases with the number of tuple variables, notice that the impact of the selection predicate placement phase remains negligible with respect to the running time of the first phase. The figure below represents the ratios of the running times of both phases, for the same range of number of tuple variables. Ratios of running times PII/PI 2.00% w 2 1.50% a. o 1.00% c B 0.50% 0.00% nnnfifinnnnn phase II 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Number of tuple variables Observe that the running time of the second phase remains a small fraction (under 2% on average) of the running time of the first phase. Also, as the number of tuple variables increases (above four or five), the ratio decreases sensibly to values not larger than 0.5% on average. This is due to the fast increase of running times of Phase I for networks with sizes larger than four tuple variables (see Figure 6-7). 6.4.2 Testing Running Times for Both Strategies: In this paragraph we will perform a quantitative analysis of the speed of each of the implemented strategies, with respect to different settings of the selectivity factors of joins and selection predicates. The next section will focus on the quality of the results

PAGE 109

98 (costs of networks). Throughout this section (6.4.2), the vertical bars with a percentage vertical axis will represent the difference (in percent) between the running times of OPS and TPS: 100 * (time(OPS) time(TPS)) / time(OPS). Below is a series of three tests focusing on the relative running times of the OPS with respect to the TPS approach for different configurations. Test with selective joins and non-selective sigmas: The initial set of joins and sigma selectivity factors was defined at opposite extremes in order to allow a better analysis of changing of each of them on the global running times of each strategy. In the first graph below (Figure 6-8), joins were selective (the result of a join is smaller than the base nodes participating in the join), but selection predicates were non-selective (majority of sigmas with selectivity larger than 0.9). Relative running times of OPS and TPS 100% 80% 60% 40% 20% 0% JZL nil n 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Number of tuple variables Figure 6-8 Relative running times with selective joins and non-selective sigmas From Figure 6-8 above, it can be noticed that the two-phase strategy is systematically faster than the one-phase approach. The difference in running times increases from 20% for small networks to approximately 60% for larger ones (mean at 42%), indicating running times for OPS about twice the length those of TPS. The

PAGE 110

99 absolute values for the running times ranged from 0.02 to 52 seconds for OPS and from 0.02 to 28 seconds for TPS. In order to examine the impact of the selectivity of joins, the next test will be run with the selectivities of joins set higher (less selective joins) and the same selection predicates. Test with non-selective joins and non-selective sigmas: In the graph below, the same tests were run with the selectivity factors of the joins set to values about 5 times larger than those from Figure 6-8. 100% 80% 60% 40% 20% 0% Relative running times of OPS and TPS .,^. l | , )HHI^^,H,^ l ,^ | ^HIIjpi^ ^^ EL r-l r-i n n 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Number of tuple variables Figure 6-9 Relative running times with less selective joins, non-selective sigmas While one phase optimization is still slower, a clear reduction in the differences between running times can be noticed immediately (average of the difference is now at 23%). This reduction is due to both a decrease in the running times of OPS as well as an increase in the running times of TPS as shown in Figure 6-10 below: Variations in running times of OPS 60 50 40 30 I 20 10 0 1 OPSKjId OPS-new 1 2 3 4 6 6 7 8 9 10 11 12 13 14 15 Number of tuple variables Variations in running times of TPS 50 — 40 1 30 I 20 ~ 10 0 I rm na FBl TPS-old TPS-nev* 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Number of tuple variables Figure 6-10 Variations in running times of OPS and TPS with less selective joins

PAGE 111

100 In order to examine the effect of the selectivity of selection predicates on each strategy, the following test was run with the same non-selective joins (Figure 6-9), and the selection predicates made more selective (a majority of the selection predicates will have their selectivity set under 0. 1 ). Test with non-selective joins and selective sigmas: From Figure 6-11 below, it can be noticed that the gap between the running times of one and two phase strategies is again widened in favor of TPS when the selection predicates are made more selective. The average of the difference between the running times climbs to approximately 60%, giving TPS optimization times about three times shorter than OPS on average. Relative running times of OPS and TPS 80% ! 60% 40% 20% 0% 1 2 3 4 5 6 7 8 9 10 1 1 1 2 13 14 15 Number of tuple variables Figure 6-11 Relative running times with large joins and selective sigmas This increase in the difference of running times between both strategies is caused by a moderate increase in the running times of OPS combined with a significant decrease in the running times of TPS. This variations (illustrated in Figure 6-12 below) were expected since the effect of increasing the selectivity of selection predicates is identical to that of increasing the selectivity of joins from a structural point of view: both lead to higher and narrower networks.

PAGE 112

101 Variations in running times of OPS 70 60 ? 50 S 40 • 30 I 20 10 0 rir:::7i7iltflfl OPS-old aOPS-new 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Number of tuple variables Variations in running times of TPS 1 2 3 4 5 6 7 a 9 10 11 12 13 14 15 Number of tuple variables Figure 6-12 Variations in the running times of OPS and TPS with selective sigmas 6.4.3 Analysis: Before looking into the analysis of the results of the various running times of OPS and TPS, it is necessary to make a few preliminary remarks and define the meaning of a term, which will be repeatedly reused throughout the analysis. Definition of a No-Op: We will define a "No-Op" as a call to a local change operator that cannot be executed because the node on which it is being tried does not qualify to perform it. In such a case, the optimizer increments the inner-loop counter and randomly generates a call to a new operator. Examples of a No-Op are: A call to "Create Beta" on a node that has only two children A call to "Merge Sibling" with a beta node having only two children. A call to "Merge Sibling" with a beta node that has only alpha children. The running time of a No-Op will be essentially zero comparatively with a complete sequence consisting of trying an operator, followed by either a validation or a rollback. The proportion of No-Ops in a test run will therefore have a significant impact on the overall running time of the optimization process.

PAGE 113

102 Preliminary remarks on the effects of selectivity factors: The increase in the selectivity factors of joins (joins less selective) produces larger internal nodes, which makes them more expensive to maintain. Consequently, the optimizer generates "flatter" networks (internal nodes have more children) when joins become less selective. Inversely, when joins are selective, the shape of the network tends towards that of a binary tree where each internal node has two children. The same remark can be made on the selectivity of selection predicates: the less selective they are, the larger the internal nodes will be and the flatter the networks will become, and vice versa. The immediate assertion that can be made from the outcome of the previous three tests is that the two-phase strategy is sensibly faster than the one-phase strategy in all cases. The reason behind this difference of optimization times is predominantly due to two factors: The difference of complexities between the two approaches as they were described in chapter 5 (sections 5.2.2 and 5.3.1). Indeed, the larger state space of one-phase strategies causes the inner loop of Iterative Improvement (see section 6.3.1) to restart more often than in a two-phase strategy approach (each time the network is improved, the inner counter is reset to 0). As a result, each newly generated network from the outer loop requires a larger number of iterations to settle in a minimum state.

PAGE 114

103 The operators "Push Sigma" and "Pull Sigma" are substantially more complicated than the three original ones both in terms of lines of code executed as well as in terms of impact on the network. Indeed, while creating or killing a beta node only affects a network locally, the action of relocating a selection predicate can potentially have a repercussion on a much larger number of nodes as illustrated in the example of Figure 6-13 below: P Figure 6-13 Impacts of "Kill Beta" and "Push Sigma" operators on a network Impact of killing node p ^: When node pi is killed, tti and a2 become children of ^2, but the cardinality of this latter is not affected. The changes in the network are therefore "contained" to restructuring the children of P2 and re-computing its cost. The same remarks are valid for "Create Beta" and "Merge Sibling." Impact of pushing above the Pnode P: If ai is pushed down the network and placed just above P, the cardinalities of nodes ai. Pi and P2 are changed (increased). Consequently, the cost of each of these nodes needs to be re-computed. The change in the cardinality of P2 changes not only the cost of the node itself, but can potentially affect the join plan (order in which the joins are

PAGE 115

104 done) of each of its siblings p3 and Of,. This is true whenever a node, for which the join plan contains more than two nodes, has its cardinality changed. As a result, the relocation of Oi above P has a ripple effect causing a more significant number of nodes to be recalculated and is therefore a lengthier operation. (The same remark also applies to the "Pull Sigma" operator). Since one-phase strategies are likely to execute the "Push Sigma" and "Pull Sigma" operators more often than a two-phase approach, this factor also participates into making OPS slower than TPS. Going back to the results of Figure 6-10, the increase in the running time of TPS (for larger join selectivity factors) is primarily attributed to a smaller proportion of NoOps in the optimization process. When considering the original three local change operators, "Create Beta" "Kill Beta" and "Merge Sibling" that are called in the first phase of TPS, "Create Beta" and "Merge Sibling" require internal nodes with at least three children. (Note that "Merge Sibling" even requires that at least one of those children be a beta node). Therefore, when the shape of the network tends towards that of a binary tree (with more selective joins), the number of operations that can be executed is sensibly reduced and more calls to the local change operators generate No-Ops, leading to faster optimization times. On the other hand, when the internal nodes have more children, the number of No-Ops reduces, and each operator actually executes (try operator, followed by either validation or rolling back), leading to longer optimization times despite a smaller number of iterations (see Note below). In conclusion, an increase in the running times of TPS is actually expected with less selective joins, as obtained experimentally.

PAGE 116

105 Note: An interesting observation we made while running the tests is that high-andnarrow networks will typically require a larger number of iterations than bushy ones. The reason seems to be the fact that the initial random networks usually have bushy shapes and therefore require more iterations to reach their final configuration, when it is (close to) a non-bushy tree (network). Still considering the results of Figure 6-10, the decrease in the running times of OPS is explained by a tangible reduction in the number of iterations (see Note above) combined with an increase in the proportion of No-Ops when the joins become less selective. While the lessening in join selectivity (leading to flatter networks) caused a reduction of the proportion of No-Ops in TPS (and slower running times), it does have an opposite effect in this case and makes OPS faster. This increase in the ratios of No-Ops is attributed to the presence of sigmas within networks that have less internal nodes (because they are flatter), causing higher proportions of "Kill Betas" and "Merge Siblings" to fail. (See sections 5.3.2, 5.3.3 and 5.3.4 about the "No-Merge No-Kill" approach). In the next test (Figure 6-12), the trends were reversed: OPS got moderately slower while the running times of TPS became significanfly faster with respect to those measured from Figure 6-9. As mentioned earlier, these variations were expected since the reduction of the selectivity factors of selection predicates has essentially the same effect on the morphology of the network as the reduction in the selectivity of joins.

PAGE 117

106 6.4.4 Testing Result Quality for Both Strategies: In this section, we will put the focus on the quality of the results (costs of the networks) for the same sets of parameters as those presented in section 6.4.2. The vertical bars will represent here a percentage difference between the costs of the networks obtained by one and two phase strategies: 100 * (cost(OPS) cost(TPS)) / cost(OPS). Note that negative values of the percentages will indicate cases where OPS yields a better result (smaller network cost) than OPS. Testing with selective joins and large sigmas: As the reader will notice from the graph represented in Figure 6-14 below, the differences in the quality of the results are significantly smaller than those obtained for the running times. The absolute value of the difference remains under 10% and OPS performs slightly better than TPS for 4 and 10 tuple variables. In both cases the improvement over TPS remains under 2% and is not considered as a significant one. Relative network costs of OPS and TPS 10.00% 8.00% 6.00% 4.00% 2.00% 0.00% -2.00% -iD D: _ o -J 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Number of tuple variables Figure 6-14 Relative network costs for selective joins and non-selective sigmas In the following test, the join selectivity factors were again multiplied by a factor of 5 in order to estimate the influence of JSFs over the quality of the results.

PAGE 118

107 Testing with non-selective joins and large sigmas: As observed in Figure 6-15 below, the increase in the selectivity of joins does not have any significant impact on the relative quality of the results. The two-phase approach still yields better results despite a moderate decrease in the differences with respect to OPS. Relative network costs of OPS and TPS 10.00% 8.00% 6.00% 4.00% 2.00% 0.00% -2.00% H -J D 1=. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Number of tuple variables Figure 6-15 Relative network costs for non-selective joins and non-selective sigmas In the following run, the joins were left with the same values as in Figure 6-15, and selection predicates were changed so that a majority would have their selectivity factors below 0. 1 . Testing with non-selective joins and selective sigmas: Relative network costs of OPS and TPS 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0.00% -5.00% — -10.00% -15.00% -20.00% -25.00% TT Number of tuple variables Figure 61 6 Relative network costs for non-selective joins and selective sigmas

PAGE 119

108 As can be seen in the graph above, the reduction in the selectivity factors of sigmas did have a significant impact on the resulting networks. Indeed, with rather selective selection predicates, the one-phase approach does systematically return better results than the two-phase. The improvement ranges from a minimum of about 2% to a maximum surpassing 20%, with an average improvement of 10% approximately. 6.4.5 Analysis: From the results of the set of tests ran in section 6.4.4, it shows that the selectivity of selection predicates is the most significant parameter influencing the quality of the results obtained. by the implemented strategies. A majority of selection predicates with a large selectivity factor gives the advantage to a two-phase based approach while a greater proportion of selective predicates changes the trend in a tangible way. From the analysis of the shapes of the generated networks, we have been able to reach the conclusion that the difference in the quality of results is due the fact that the larger state space factor works in favor of OPS in the case of selective selection predicates. Indeed, with a majority of selective selection predicates placed at the top of the network, a two-phase approach almost systematically yields a final network (before the second phase is applied) with a binary shape since all internal nodes are fairly small. The second phase in which selection predicates are pushed down does not change the binary property of the network. On the other hand, the ability of a one-phase approach to relocate selection predicates anytime during the whole optimization process often reaches configurations where one or more internal nodes have more than two children. More generally, the positioning of selective selection predicates is more "sensitive" than that of non-selective ones (i.e. has a

PAGE 120

i 109 more significant impact on the overall cost and shape of the network). Therefore, the larger state space and number of "tries" to relocate sigmas in OPS is beneficial when they are selective and less beneficial when they are not. 6.4.6 Conclusion: In the tables below, the reader will find a summary and recapitulation of the experimental results that were obtained in the previous sections. Table 3 gives a comparison of the running times of one and two-phase strategies as the selectivities of joins and selection predicates become larger (as the predicates become less selective). Table 4 focuses on the quality aspect of the results obtained by each of the strategies. In both tables, the horizontal dimension represents the selectivity factors of the predicates (growing from left to right). Table 3 Comparing running times of OPS and TPS Sigma Selectivity Selective Sigma selectivity factors increase from left to right on this axis Not selective Joins Selectivity Selective Join selectivity factors increase from left to right on this axis Not selective Network Shape \J Networks become flatter and wider from left to right on this axis OPS Running Time TPS much faster than OPS Slow (High running ~ ^.^^ Becomes faster TPS faster TPS Running Time ^ — *^ Becomes ^^^^ -^^^ slower (Low running ^ ' times) ^"""^ than OPS As the selectivity factors of sigmas and joins become larger, the shapes of the networks change from high and narrow to flatter and wider ones (the reason is that

PAGE 121

110 internal nodes become larger and more expensive to maintain, which results in a smaller number of them). On the left side of the table above, OPS running times are high and get lower as we move further to the right. The trend is reversed for TPS. Consequently, TPS is significantly faster (up to 3 or 4 times) than OPS for selective predicates, and the difference in the running times drops towards more moderate values (1.5 or 2) as the predicates become less selective. Note that TPS remains faster than OPS in all cases. However, it is important to point out at the fact that the absolute values of the running times are fairly small, and considering the relative values only might be misleading when choosing a strategy to run. Indeed, even with a running time up to 3 times slower than TPS, OPS still optimizes a network in under a minute for large networks (14 or 15 tuple variables) and under 15 seconds for regular ones (5 to 10 tuple variables). Table 4 below recapitulates the result qualities of both strategies. Table 4 Comparing networks costs of OPS and TPS Sigma Selectivity Selective Sigma selectivity factors increase from left to right on this axis Not Selective OPS Relative Networks Costs OPS returns better results than TPS OPS networks ^ — """'^ costs are low --^ ^ — ^ OPS networks ^ costs become higher relatively to TPS TPS returns better results than OPS TPS Relative Networks Costs (networks with smaller costs) TPS networks ^^^^ costs are high — TPS networks costs become lower relatively to OPS When selection predicates are not selective (right side of the table), the two-phase strategy returns better results on average than a one-phase approach. However, as they become more selective on the left side of the table, the tendency is reversed and OPS becomes more effective. The reason for this shifting in the effectiveness of each strategy

PAGE 122

Ill is explained by the relationship between the selectivity of sigmas, and the optimization process. Non-selective selection predicates have little bearing on the overall result of the optimization process, since their filtering capacity is low. The placement of non-selective sigmas in the network can be considered as an orthogonal problem to that of optimizing the network itself, and trying to mix both in one single phase causes a larger state space without any significant benefit. It is better to treat the optimization of the network and the placement of the sigmas separately, in two phases. On the other hand, when selection predicates become (very) selective, their influence on the shape of the network is more significant, and their placement critical. The problem of placing selection predicates is no more (or less) orthogonal to that of optimizing the network, and running the optimization in one phase becomes beneficial and the larger state space advantageous. TPS or OPS? Given the global results presented so far, our recommendations for the implementation of a complete Gator networks optimizer with selection predicate placement capabilities are the following. If only one strategy is to be implemented, then a two-phase approach is definitely recommended because of the subsequent reasons: The running time of the second phase in which sigmas are placed is completely negligible with respect to the running time of the first phase. Therefore, the addition of the second phase does not increase the running time of the optimizer and "will not harm, if it does not help."

PAGE 123

112 The results obtained by the two-phase strategy in terms of the quality of the returned networks are always acceptable. Even when OPS returns better results than TPS, the differences in network costs are never different by an order of magnitude. If one is willing to go to the extra complexity of implementing both methods, then the optimizer can be tuned to select one strategy or the other based on the relevant parameters. Or both? As we mentioned earlier, even the slowest running times are still under a minute, for networks of 14 or 15 tuple variables. The optimization times drop to about 15 seconds for networks of sizes ranging between 5 and 10 tuple variables. Moreover, these running times will even become smaller with state of the art processors running at clock speeds of 300 to 400 Mhz. Therefore, a very conceivable approach could be to run both strategies and pick the best result.

PAGE 124

CHAPTER 7 CONCLUSION AND FUTURE WORK Although the support of complex triggering capabilities is still limited in commercial database systems (see chapter 1), it is our belief that the design and implementation of TriggerMan as an Asynchronous Rule Processor, and its full compatibility with the Object-Relational model, are an important step towards achieving this goal. In this dissertation, we have presented the major issues, which one is likely to come across while implementing and integrating selection predicate placement capabilities, within a randomized discrimination network optimizer. After briefly describing the Gator network cost model (chapter 3), the inadequacy of rank-based query processing techniques was shown (chapter 4), and a set of alternative strategies (of increasing levels of complexity) was proposed (chapter 5). Next, an implementation phase was carried out (chapter 6), during which the feasibility and relative efficiencies of the proposed strategies were tested and evaluated. In the course of the coding phase, a new and more efficient approach was suggested for the implementation of the local change operators, using a tripartite structure: TryOperator, DoOperator and UndoOperator for each of them. Finally, and based on the experimental results obtained in chapter 6, a set of suggestions was made for integrating the selection predicate placement techniques within the TriggerMan code. 113

PAGE 125

114 While answering many questions on the issue of selection predicate placement in Gator and discrimination networks, this work still leaves open research questions and issues to be investigated in future research endeavors: A general optimization technique widely used to reduce the cost of all types of expensive operations is caching. While this issue was left aside in this work, the possibility of caching the results of a selection predicate for a particular token is a viable optimization add-on, to consider improving token propagation times in discrimination networks. A simple approach would be to implement sigma nodes as lookup tables instead of pure selection predicates. Each time a compound token containing the attribute on which the selection is defined, is propagated through a sigma (cyattr=vaiue( token) = true or false), the result of applying the sigma to that token, or value of the attribute can be cached (function caching). Subsequent (potentially expensive) computations for tokens with the same value of the attribute can be looked up instead of recomputed. The integration of caching in the optimization process would require the development of a corresponding cost formula or model, which is an orthogonal problem to ours. Once a particular cost model or formula has been developed, its integration with our optimizer becomes straightforward since it does not require anything more than plugging the new formulas in the proper methods. (This would be very similar to extending our optimizer with the cases that were left aside such as having indexes on join attributes etc.) As mentioned in chapter 1, many implementations of materialized views and their maintenance reuse active capabilities and active rules. A simplified (but accurate)

PAGE 126

115 description of the relationship between a materiahzed view and a Gator network is given by the following correspondence: If tokens inserted in the Pnode of a Gator network are never cleared out systematically, and kept in the Pnode until they are erased by a corresponding A(read delta minus) token [Han92], then the Pnode becomes a materialized view of the database, matching the condition defined by the Gator network. Consequently, the issue of optimizing materialized views with aggregates, becomes very related to our problem, since aggregates can be considered "similar" to selection predicates, in the sense that they can potentially be relocated at multiple places in the materialized view. Given the heavy use of materialized views in important areas of data management such as data warehousing, it is our belief that the optimization of materialized views with aggregates could benefit from, or reuse and extend, some of the insight and ideas exposed in this work. As a closing remark to this work, we would like to emphasize the investigative nature of this dissertation as a first attempt to explore the various issues and questions related to selection predicate placement, in Gator and discrimination networks in general. A two-phase strategy emerged as the best approach for a first implementation of the optimizer to be written for TriggerMan. However, the introduction of a one-phase approach is not excluded for future versions, as the project gains maturity and efficiency issues take the stand over major end-to-end functionality implementation tasks.

PAGE 127

GLOSSARY ADBMS Active Database Management System. ADT Abstract Data Type. AI Artificial Intelligence. ARP Asynchronous Rule Processor. (Synonymous with ATP) ATP Asynchronous Trigger Processor. DBMS Database Management System. ESF Edge Selectivity Factor. GIS Geographic Information System. IDS/UDO Informix Dynamic Server with Universal Data Option. JSF Join Selectivity Factor. 00 Object Oriented. OPS One-Phase Strategy. TPS Two-Phase Strategy. 116

PAGE 128

I I APPENDIX A ANIMAL LOCATION Gator (1,1) Fros (3,1) Gator (3,4) Gator (5,5) Crow (1,1) Gator (8.4) Raccoon (4,3) Gator (3,0) Gator (1,4) Gator (4,1) Frog (7,3) Gator (3,3) Frog (1,1) Gator (1,4) Gator (3,2) Rabbit (4,0) Gator (3,4) Raccoon (5,0) Mouse (5,1) Gator (5,5) Table Fauna_Sightings User Box Thorfan (0,0,1,1) Thorbait (5,5,30,30) Etc... Table InterestArea 117

PAGE 129

1 APPENDIX B The analysis of the inadequacy of the rank metric for discrimination networks that was presented in chapter 3, rehed on an example network where a beta node had only two children. The reader will find below a generalization of that analysis to cases where beta nodes can have more than two children. For ease of reading, this example is restricted to three children only. However, a simple change in notations is required to represent cases with N children. Consider the network (or sub network) below where the beta node has three children tti, tti and a3. Cost: Ca2 Fi, F2 and F3 are the frequencies of tokens arriving respectively into the left, middle and right input edges of the network. The selectivities of selection predicates ai, G2 and are denoted So,, 802 and So,. The selectiviy factors of the joins are represented by JSFi and JSF2. The "directional" selectivity factors are still defined as they were in chapter 3: 118

PAGE 130

119 Sj,: = JSF, * Card(a2). Sj,, = JSFi * Card(ai). Sj23 = JSFi * Card(a3). Sj32 = JSFi * Card(a2). The semantics of the directional selectivity factor are as follows: Each token inserted into tti produces on average Sj,2 tokens after joining with a2. Likewise, each token inserted into produces on average Sjzi and Sjij tokens when joining with tti and tts respectively. In a similar fashion, a token inserted into as produces on average Sj32 tokens when joining with a2. Tokens inserted into ai will therefore generate an average total of Sjn * Sj23 tokens after joining with both and as. Consequently, the ESF of edge (tti^P) denoted Sji will be the result of the product of Sj^and Sjas : Sjl = Sji2 * Sj23 and likewise: Sj2 = Sj2i * Sj23 and Sjs = Sj32 * Sj2i The reader will notice that when a beta node has more than two children, the ESFs are no longer equal to the "directional" selectivity factors, histead, the ESF of an edge (child-^(3) is computed as the product of all the "directional" selectivity factors encountered in the join plan of the child node in question. Finally, the selection predicate Cj has been assigned a cost of C02, and the join conditions between ai and a2 and a2 and have been assigned differential cost values of Cj,2, Cj2i, Cj23 and Cj32.

PAGE 131

120 Let us consider now the operation of pushing down the selection predicate CSj as described in the figure below. C2 S(T2 p With the assumption that predicates have independant selectivities still holding, the JSF values of each join remain unchanged (as well as the selectivity of the selection predicate 02). Consequently, Sj^ and Sjn change to Sj^' and Sj32'. The changes occuring to Sj^ and Sj32 cause the values Sj, and Sjj to change to Sj,' and Sjj' respectively. Sj2 and the selectivity of csj remain unchanged. Recalling the definition of the rank metric, the rank of G2 is: rank(a2) = ^"'"^ C
PAGE 132

121 Replacing rank(J/a2) and rank(a2) with their respective values defined above, we can 3 , _ 1 So2 — 1 rewrite: rank(J/ ai) < rank( 02) <=> -j^ < — ^ (-^02 Ca2 Then by multiplying each side of the equation above by Q/aa and dividing by (S021), we get the following inequality — ^ < — (reversing the comparison since S02 1 < 0) C02 S02 — 1 1 S 2 Which can be rewritten as I Uj/a2 < * Ua2 (4) I-S02 after each side of the inequality (4) above is multiplied by the positive value Ca2. The rank based heuristic will therefore perform a pushdown when Q/ai is strictly smaller than the right side of (4) and not perform the pushdown otherwise. Reusing the notation from chapter 4, let AC"^ and AC" represent the variations in the cost of the network when the selection predicate G2 is pushed down. AC^: F2 * (Cj2, + Sin * Cj23 + la^) F2 * Saa * (Cj, + * Sja, * Cj, + la^ + Fi * Sa, * AQn + F3 * Sa3 + ACj32 AC^: F2 * (1 SaO * (Cj2, + Sj2, * Cj23 + laa) + F, * Sa, * AQn + F3 * S03 * AQ„ The (potential) decrease in the execution of the selection predicate condition is given by: AC": F2 * Ca2 * [F, * Sa, * Sjr + F2 * Si2 + F3 * Sa, * Sj3'] * Ca2 ACT: {F2 [Fi * Sa, * Sj,' + F2 * Sj^ + F3 * Sa, * Sj3']} * Ca, Analogously to the approach followed in chapter 4, let us define ratios linking the various token frequencies in order to study the influence of the token distributions among the various input edges of the network: Let F| = F2 / mi2 with mi2 e Q"^' And F3 = F2 / m32 with m32 e Q^*

PAGE 133

122 Where Q^* is the set of positive, non-null rational numbers. AC" can now be rewritten as: [F2 (— * So, * Sj,' + F2 * Sj2 + — * So, * Sj,')] * C02 Factoring out F2 F2 * [1 Sj2 J * C02 mi2 m32 And reducing to a common denominator „ . rmi2 *m32 -mn *Soi *Sji'-mi2 *S-< „ J. „ _ , Soi * ACjl2 Sa3 * ACj32 (1 S02) * [Cj21 + Sj21* Cj23 + lai] + — + — < ran m32 mi2 * m32 m32 * Soi * Sji'mi2 * Sa3 * Sj3' I Sj2] mi2 * m32 Finally dividing each side of the inequality above by (1 802), and rearranging terms we get

PAGE 134

123 With the quantities Qi and Q2 being equal to — — ^ — and mi2*(l-SoO m32*(l-Sa2) respectively. The table below recapitulates the results obtained so far. The rank based decision is the one obtained earlier in inequality (4) and the cost formula based approach is taken from the inequality above. Table 5 Recapitulation of the conditions for pushing down 02 PUSHDOWN Rank Based 1 Sr Ci/02 < ... * (2a2 1-Sa2 Cost formula Based mi2 *ni32 -m32 *Soi *Sji'-mi2 *Sa3*Sj3' ^ Cj, + Sj2,*Cj2< m,2*m,2 Qi Q2 1 S02 mi2 m32 While Cj/a2 and (Q, + Sjzi * Qj) might seem to be different at first sight, a more careful examination of the cost incurred when a token is inserted into node clearly shows that they represent the same quantities. Indeed, for each token inserted in a2, cost Q, is incurred once during the join with tti, then the resulting average of Sjzi tokens is joined with as, incurring a cost of Sj2i * Q2. The function defined as f(m) in chapter 4 becomes in this case f(m,2, m32).

PAGE 135

124 mi2 * m?: — mi2 * Soi * Sji ' — m,2 *So,*Sj,' S r. Let f(mi2,mi:) = mi2 * mi2 1-S mi2 ni32 With Sj,', Sj2, Sjj', Qi and Q2 6 R^* and Soi, S02 g (0,1) The properties (limits, continuity and variations) of f(mi2, m32) with respect to the variables niu and are similar to those of the function f(m) defined in chapter 4. The same conclusions can therefore be reached in the cases where beta nodes have multiple children: The variations in the bias of the distributions of token frequencies will cause erroneous decisions if based on the rank metric.

PAGE 136

LIST OF REFERENCES [Alfa98] Al-Fayoumi N. Design and Implementation of a Temporal Trigger Subsystem for the TriggerMan Asynchronous Rule Processor. Ph.D. dissertation, University of Florida. April 1998. [Ast76] Astrahan M. M et al. System R: A relational approach to database management. In ACM Transactions on Database Systems, Vol 1, No 2, Pages 97-137, June 1976. [BerBl] Bernstein P. A., Goodman N, Wong E., Reeve C. L., and J. James B. Rothnie. Query processing in a system for distributed databases (SDD-1). In ACM Transactions on Database Systems, 6(4):602-625, December 1981. [Ban87] Banerjee Jay, Hon-Tai Chou, Garza Jorge F, Won Kim, Darrell Woelk, and Ballou Nat. Data Model Issues for Object-Oriented Applications. In ACM Transaction on Office Information Systems. Volume 5, Number 1, pages 156-173 January 1987. [Bat88] Batory D.S, Bamett J.R, Garza J.F, Smith K.P, Tsukuda K, Twichell B.C and Wise T.E. GENESIS: An extensible database management system. In IEEE Transactions on Software Engineering 14, 1 1 (November 1988), Pages 1711-1730. [Bot96] Botzer D. and O. Etzion. Optimization of materialization strategies for derived data-elements in active databases. In IEEE Transactions on Knowledge and Data Engineering, 8(2): Pages 260-276, April 1996. [Bro85] Brownston L., Ferrell R, Kant E, and Martin N. Programming Expert Systems in 0PS5: An Introduction to rule-based programming. Addison-Wesley. Massachusetts, 1985. [Car86] Carey J.M, DeWitt D.J, Frank D, Graefe G, Muralikrishna M, Richardson J.E and Shekita E.J. The architecture of the EXODUS extensible DBMS. In Proceedings of the IEEE/ACM International 125

PAGE 137

126 i { [Car94] [Cer90] [Cer91] [Cer92] [Cha89] [Cha96] [Chi89] [Cor90] Workshop on Object-Oriented Database Systems. (Pacific Grove, California, September 1986). Pages 52-65. Also available In Proceedings of the 12''' International Conference on Very Large Data Bases. Carey M. J, Dewitt D. J, Franklin M. J, Hall N. E, McAuliffe M, Naughton J. E, Schuh D. T, Solomon M. H, Tan C. K, Tsatalos O, White S and Zwilling M. J. Shoring up Persistent Objects. In Proceedings of the ACM-SIGMOD International Conference on Management of Data. Pages 383-394. Minneapolis, Minnesota. May 1994. Ceri S. and Widom J. Deriving production rules for constraint maintenance. In Proceedings of the 16"' International Conference on Very Large Databases, Brisbane, Australia, August 1990. Ceri S. and Widom J. Deriving production rules for incremental view maintenance. In Proceedings of the IT International Conference on Very Large Databases, Barcelona, September 1991. Ceri S, Fratemali P, Paraboschi S, and Tanca L. Constraint enforcement through production rules: Putting active databases to work. In IEEE Bulletin of the Technical Committee on Data Engineering, Vol.15 (No. 1-4), December 1992. Chakravarthy Sharma et Al, "HiPAC: A research project in active, time-constrained database management", TR XAIT-89-02, Xerox Advanced Information Technology, 1989. Chaudhuri Surajit and Kyuseok Shim. Optimization of Queries with User-defined Predicates. In Proceedings of the 22"^ International Conference on Very Large Databases, Mumbai(Bombay), India, 1996. Chimenfi Danette, Gamboa Ruben, and Krishnamurthy Ravi. Towards an Open Architecture of LDL. In Proceedings of the 15"' International conference on Very Large Data Bases, Amsterdam, August 1989. Cormen Thomas H, Leiserson Charles E, and Ronal L. Rivest. Introduction to Algorithms. Chapter H. The MIT Press, Cambridge, Massachusetts London, England. Mc Graw-Hill Book Company. 1990.

PAGE 138

127 [Day87] Dayal U, Manola F, Buchman A, Chakravarthy U, Goldhirsch D, Heiler S, Orenstein J and Rosenthal A. Simplifying complex objects: The PROBE approach to modelling and querying them. In Proceedings of the BTW 87, H. J. Schek and G. Schlageter, Eds, (Springer 1987). Pages 17-37. [Day96] Dayal U, Buchmann A, and Chakravarthy S. "The HiPac project", In Active Database Systems-Triggers and Rules For Advanced Database Processing. Morgan Kaufman Publishers Inc, Chapter 7, pages 177206, 1996. [Dew93] DeWitt D.J, Luo J, Patel J. M, and Yu J. Paradise A Parallel Geographic Information System. In Proceedings of the ACM Workshop on Advances in Geographic Information Systems. Arlington, Virginia. November 1993. [Dew94] DeWitt D. J, Kabra N, Luo J, Patel J. M and Yu J. Client-Server Paradise. In Proceedings of the !(/'' International Conference on Very Large Data Bases. Santiago, Chile. September 1994. [Dit93] Dittrich K. R and Gatziu S. Time issues in active databases. In Proceedings of the International Workshop on an Infrastructure for Temporal Databases, Texas, June 1993. [Dor96] Dori A, Gal D and O. Etzion. Temporal active databases: A key to computer integrated manufacturing. In Journal of Computer Integrated Manufacturing, 9(2): Pages 89-104, 1996. [Etz91] Etzion O. Active handling of incomplete or exceptional information in database systems. In Workshop on Information Technologies and Systems, pages 46-60, Cambridge, MA, December 1991. [Etz92] Etzion O, A. Gal, and A. Segev. A temporal support in active databases. In Workshop on Information Technologies and Systems, Workshops in Computing, pages 245-254, December 1992. [Etz93a] Etzion O. Pardes A data driven oriented active database model. In ACMSIGMOD Record, Vol.22 (No. 1), March 1993. [Etz93b] Etzion O, A. Gal, and A Segev. Data driven and temporal rules in pardes. In N. Paton and M. Williams, editors. Rules in Database Systems, workshops in Computing, pages 92-108. Springer,

PAGE 139

128 September 1993. [Etz94] Etzion O. An alternative paradigm for active databases. In IEEE Proceedings of the 4''' International Workshop on Research Issw Data Engineering, Houston, Texas, USA, February 1994. [For82] Forgy C. L. Rete: A Fast Algorithm for the Many Pattern/Many Object Pattern Match Problem. In Artificial Intelligence, Vol 19, pages 17-37, 1982 [Fra93] Fratemali P and S. Paraboschi. A review of repairing techniques for integrity maintenance. In N. Paton and M. Williams, editors, Rules in Database Systems, Workshops in Computing, pages 333-346. Springer, September 1993. [Gar89] Gardarin G. et al. Managing Complex Objects in an Extensible Relational Database Management System. In Proceedings of the 15"' International Conference on Very Large Data Bases. Amsterdam, The Netherlands. August 1989. [Ger94] Gertz M. Specifying reactive integrity control for active databases. In IEEE RIDE Proceedings of the 4"' International Workshop on Research Issues in Data Engineering, Houston, Texas, USA, February 1994. [Gut84] Guttman Antonin. R-Trees A Dynamic Index Structure for Spatial Searching. In Proceedings of the International ACM-SIGMOD Conference on Management of Data. Pages 47-57. 1984. [Haa90] Haas L.M., Chang W, Lohman G.M., McPherson J, Wilms P. F., Lapis G., Lindsay B., Pirahesh H., Carey M., and Shekita E. Starburst Mid-Flight: As the Dust Clears. In IEEE Transactions on Data and Knowledge Engineering. 2(1). Pages 143-160. March 1990. [Haa95] Haas P.J., Naughton J. F., Seshadri S., Stokes L. Sampling-Based Estimation of the Number of Disctinct Values of an Attribute. In Proceedings of the IT' International Conference on Very Large Data Bases, Zurich, September 1995. [Han89] Hanson Eric N. An Initial Report on The Design of Ariel: A DBMS With an Integrated Production Rule System. In SIGMOD RECORD, Vol 18, No 3, September 1989

PAGE 140

129 1 [Han92] Hanson Eric N. Rule Condition Testing and Action Execution in Ariel. In Proceedings of the ACM-SIGMOD International Conference on Management of Data. Pages 49-58. San Diego, California, June 1992. [Han96a] Hanson Eric N. The Design and Implementation of the Ariel active database rule system. In IEEE Transactions on Knowledge and Data Engineering, S{1): 157-172, February 1996. [Han96b] Hanson Eric N., Sreenath Bodagala, UUas Chadagala, Yi Goa, Mohammed Hasan, Goutam Kulkami and Jayashree Rangarajan. Optimized Trigger Condition Testing Using Gator Networks. Rm 301 CSE, P.O. Box 11 6 120. CISE Department. University of Florida. November 1996. [Han97a] Hanson Eric N., Khoshla Samir. An Introduction to the TriggerMan Asynchronous Trigger Processor. University of Florida Technical Report TR-97-007. 301 CSE, CISE department. http://www.cise.ufl.edu/research/tech-reports/tr97-abstracts.shtml [Han97b] Hanson Eric N., Bodagala Sreenath, Chadaga Ullas. Optimized Trigger Condition Testing in Ariel Using Gator Networks. University of Florida Technical Report TR-97-021. 301 CSE, CISE department. http://www.cise.ufl.edu/research/tech-reports/tr97-abstracts.shtnil [Has93] Hasan Mohammed. Optimization of discrimination networks for active databases. Master's thesis. University of Florida, CISE Department, November 1993. [Hel92] Hellerstein Joseph M. Predicate Migration: Optimizing Queries With Expensive Predicates. Technical Report Sequoia 2000 92/13, University of California, Berkeley, December 1992. [Hel93] Hellerstein Joseph M. and Stonebraker Micheal. Predicate Migration: Optimizing Queries With Expensive Predicates. In Proceedings of the ACM-SIGMOD International Conference on Management of Data, Washington D.C, May 1993. [Hel95] Hellerstein Joseph M. Optimization and Execution Techniques for Queries with Expensive Methods. Ph.D. dissertation. University of Wisconsin Madison. 1995. [Ioa87] loannidis Yiannis, Wong Eugene. Query Optimization by Simulated

PAGE 141

130 Annealing. In Proceedings of the ACM SIGMOD International Conference on Management of Data, San Fransisco, California, USA. May 27-29 1987. [Ish93] Ishikawa H. and Kubota K. An active object-oriented database: A multi-paradigm approach to constraint management. In Proceedings of the 19"' International Conference on Very Large Data Bases, Dublin, Ireland, August 1993. [Ish94] Ishida Torn. An Optimization Algorithm for Production Systems. In IEEE Transactions on Knowledge and Data Engineering, Vol 6, No 4, August 1994. [Jag92] Jagadish H. V. and Qian X. Intergrity maintenance in an objectoriented database. In Proceedings of the 18''' International Conference on Very Large Data Bases, Vancouver, August 1992. [Kri86] Krishnamurthy R., Boral H., and Zanialo C. Optimization of nonrecursive queries. In Proceedings of International Conference on Very Large Databases, pages 128-137, Kyoto, Japan. August 1986. [Lin88] Linnemann V, Kuspert K, Dadam P, Pistor P, Erbe R, Kemper A, Sudkamp N, Walch G, and Wallarath M. Design and Implementation of an Extensible Database Management System Supporting User Defined Data Types and Functions. In Proceedings of the 14"' Conference on Very Large Data Bases. Los Angeles, California. Pages 294-305. August 1988. [Mir87] Miranker Daniel P. TREAT: A Better Match Algorithm for AI Production Systems. In AAAI, pages 42-47. August 1987. [Mon79] Monma C. L. and Sidney J.B. Sequencing with series-parallel precedence constraints. Mathematics of Operations Research, 4:215224, 1979. [Ong84] [Osb86] Ong J, Fogg D, and Stonebraker M. Implementation of Data Abstraction in the Relational Database System INGRES. In Proceedings of the ACM-SIGMOD Record. 14(1), March 1984. Osbom, S and Heaven, T. The Design of a Relational Database System with Abstract Data Types. In ACM Transactions on Database Systems. Volume 11. Issue 3. September 1986.

PAGE 142

131 1 [Pau87] Paul H. -B., Schek H. -J, Scholl M. H., Weikum G., and Deppisch U. Architecture and Implementation of the Darmstadt Database Kernel System. In Proceedings of the ACM-SIGMOD International Conference on Management of Data. 16(3). Pages 196-207. San Fransisco, California, USA. May 27-29 1987. [Sch90] Schek H. -J. et al. The DASDBS Project: Objectives, Experiences, and Future Perspectives. In IEEE Transactions on Data and Knowledge Engineering. 2(1). March 1990. [Sel79] Selinger Patricia G., Astrahan M., Chamberlin D., Lorie Raymond, and T. Price. Access Path Selection in a Relational Database Management System. In Proceedings ACM-SIGMOD International Conference on Management of Data, Boston, June 1979. [Sis92] Sistla A. P. and Wolfson O. Triggers on database histories. In IEEE Bulletin of the Technical Committee on Data Engineering, 15(1-4), December 1992. [Sis95a] Sistla A. P. and Wolfson O. Temporal conditions and integrity constraints in active database systems. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 269-279, San Jose, California, USA, May 1995. ACM. [Sis95b] Sistla A. P. and Wolfson O. Temporal triggers in active databases. In IEEE Transactions on Knowledge and Data Engineering, 7(3): Pages 471-486, June 1995. [Smi75] Smith J. M. and Chang P.Y. Optimizing the Performance of a Relational Algebra Database Interface. In Communications of the ACM, 18(10). Pages 668-579. 1975 [Sto75] Stonebraker M. Implementation of Integrity Constraints and Views by Query Modification. In Proceedings of the ACMSIGMOD International Conference on Management of Data. Conference on Management of Data, San Jose, California, June 1975. [Sto83] Stonebraker M., Rubinstein B., and Guttmann A. Application of Abstract Data Types and Abstract Indices to CAD Databases. In Proceedings of the ACM Engineering Design Applications. Pages 107-114. 1983.

PAGE 143

132 [Sto86] Stonebraker M and Rowe L. A. The design of POSTGRES. In Proceedings of the ACM SIGMOD International Conference on Management of Data. (Washington, DC, May 1986). Pages 340-355. [Sto87] Stonebraker M., Hanson E. N., and Hong C. H. The design of the Postgres rule system. In Proceedings of the 3^^ International IEEE Conference on Data Engineering, 1987. [Sto88] [Sto90] Stonebraker M. Readings in Database Systems. Morgan Kaufmann Publishers, Inc. pages M1A1%. San Mateo, California 94403, 1988. (Also available in 1994 edition by same publisher). Stonebraker M., Jhingran A., Goh J., and Potamianos S. On rules, procedures, caching and views in data bases systems. In Proceedings of the ACM SIGMOD International Conference on Management of Data, Atlantic City, May 1990. [Sto91] Stonebraker Micheal, Kemnitz Greg. Postgres Next-Generation Database Management System. In Communications of the ACM, Vol 34, No 10, October 1991. [Su93] [Swa88] Su S. Y. W., Lam H., Eddula S. R., Arroyo J., Prasad N. and Zhuang R. OSAM*.KBMS: An Object-oriented Knowledge Base Management System for Supporting Advanced Applications. In Proceedings of the ACM-SIGMOD International Conference on Management of Data, pages 540-541. Washington D.C, May 1993 Swami A. and Gupta A. Optimization of large join queries. In Proceedings of the ACM-SIGMOD International Conference on Management of Data. Pages 8-17. Chicago, June 1988. [Syb96] Clifford, Charles B. Sybase Replication Server. McGraw-Hill Series on Computer Communications. Publisher: McGraw Hill. February 1996. [Tam91] [U1182] [Yaj91] Tamer Ozsu M. and Valduriez Patrick. Principles of Distributed Database Systems. Publisher Prentice Hall. Pages 196-197. 1991 Ullman J.D. Principles of Database Systems (2" edition). Rockville, Maryland: Computer Science press, 1982. Yajima Kenichi, Kitagawa Hiroyuki, Yamaguchi Kazunori, Ohbo Nobuo, and Fujiwara Yuzura. Optimization of Queries Including

PAGE 144

133 ADT functions. In Proceedings of the 2'"' International Symposium on Database Systems for Advanced Applications, pages 366-373, Tokyo, April 1991.

PAGE 145

BIOGRAPHICAL SKETCH Mokhtar Kandil was bom in Morocco on the 3^** of September 1969. After graduating from high school in 1987, he joined the National Institute of Applied Sciences (INSA-LYON) in Lyon, France, where he got a degree of "Ingenieur d'Etat" with a computer science and engineering option, in 1992. He then joined an Agency of the Moroccan Ministry of Housing, where he assumed the position of systems engineer. He joined the doctoral program of the school of Computer and Information Sciences and Engineering at the University of Florida at Gainesville on a Fulbright scholarship in fall of 1994. He has worked since then as a research assistant in the Database Systems Research and Development Center, with interests including computer networks, distributed computing and database systems. 134

PAGE 146

I certify that I have read this study and that in my opinion it conforms to acceptable standards of scholarly presentation and is fully adequate, in scope and quality, as a dissertation for the degree of Doctor of Philosophy. Eric N. Hanson, Chairman Assistant Professor of Computer and Information Science and Engineering I certify that I have read this study and that in my opinion it conforms to acceptable standards of scholarly presentation and is fully adequate, in scope and quality, as a dissertation for the degree of Doctor of Philosophy. Stanley Y.W Su Professor of Computer and Information Science and Engineering I certify' that I have read this study and that in my opinion it conforms to acceptable standards of scholarly presentation and is fully adequate, in scope and quality, as a dissertation for the degree of Doctor of Philosophy. Yiiin C. Chow Professor of Computer and Information Science and Engineering I certify that I have read this study and that in my opinion it conforms to acceptable standards of scholarly presentation and is fully adequate, in scope and quality, as a dissertation for the degree of Doctor of Philosophy. Sharma U. Chakxavarthy Associate Professor of Computer and Information Science and Engineering

PAGE 147

I certify that I have read this study and that in my opinion it conforms to acceptable standards of scholarly presentation and is fully adequate, in scope and quality, This dissertation was submitted to the Graduate Faculty of the College of Engineering and to the Graduate School and was accepted as panial fulfillment of the requirements for the degree of Doctor of Philosophy. August, 1998 Winfred M. Phillips Dean, College of Engineering Karen A. Holbrook Dean, Graduate School