
Citation 
 Permanent Link:
 http://ufdc.ufl.edu/AA00031459/00001
Material Information
 Title:
 Static analysis of ECA rules and use of these rules for incremental computation of general aggregate expressions
 Creator:
 Kim, SeungKyum, 1961
 Publication Date:
 1996
 Language:
 English
 Physical Description:
 viii, 95 leaves : ill. ; 29 cm.
Subjects
 Subjects / Keywords:
 Algebra ( jstor )
Databases ( jstor ) Equivalence relation ( jstor ) Indices of summation ( jstor ) Integers ( jstor ) International conferences ( jstor ) Mathematical variables ( jstor ) Polynomials ( jstor ) Untranslated regions ( jstor ) Warehouses ( jstor ) Computer and Information Science and Engineering thesis, Ph. D Database management ( lcsh ) Dissertations, Academic  Computer and Information Science and Engineering  UF Rulebased programming ( lcsh )
 Genre:
 bibliography ( marcgt )
nonfiction ( marcgt )
Notes
 Thesis:
 Thesis (Ph. D.)University of Florida, 1996.
 Bibliography:
 Includes bibliographical references (leaves 9294).
 General Note:
 Typescript.
 General Note:
 Vita.
 Statement of Responsibility:
 by SeungKyum Kim.
Record Information
 Source Institution:
 University of Florida
 Holding Location:
 University of Florida
 Rights Management:
 The University of Florida George A. Smathers Libraries respect the intellectual property rights of others and do not claim any copyright interest in this item. This item may be protected by copyright but is made available here under a claim of fair use (17 U.S.C. Â§107) for nonprofit research and educational purposes. Users of this work have responsibility for determining copyright status prior to reusing, publishing or reproducing this item for purposes other than what is allowed by fair use or other copyright exemptions. Any reuse of this item in excess of fair use or other copyright exemptions requires permission of the copyright holder. The Smathers Libraries would like to learn more about this item and invite individuals or organizations to contact the RDS coordinator (ufdissertations@uflib.ufl.edu) with any additional information they can provide.
 Resource Identifier:
 023176982 ( ALEPH )
35030497 ( OCLC )

Downloads 
This item has the following downloads:

Full Text 
STATIC ANALYSIS OF ECA RULES AND USE OF THESE RULES
FOR INCREMENTAL COMPUTATION OF GENERAL AGGREGATE EXPRESSIONS
By
SEUNGKYUM KIM
A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY
UNIVERSITY OF FLORIDA
1996
Dedicated to My Parents
ACKNOWLEDGCEMENTS
I would like to express my sincere gratitude to Dr. Sharma Chakravarthy for the continuous guidance and support during the course of this work. I thank Dr. Eric Hanson, Dr. Herman Lam, Dr. Stanley Su, and Dr. Suleyman Tufekci (in alphabetic order) for serving on my supervisory committee and for their perusal of this dissertation. I would like to thank Mrs. Sharon Grant for keeping the warm and comfortable research environment. I also thank many fellow students at the Database Systems R&D Center for their help and friendship.
Finally, I am deeply indebted to my parents for their endless sacrifice and love to me.
This work was supported in part by the Office of Naval Research and the Navy Command, Control and Ocean Surveillance Center RDT&E Division, and by the Rome Laboratory.
TABLE OF CONTENTS
ACKNOWLEDGEMENTS...................................1iii
LIST OF FIGURES....................................... vi
ABSTRACT............................................ vii
CHAPTERS............................................. 1
1 INTRODUCTION.......................................1
1.1 Active Databases.....................................1
1.2 Data Warehouses.....................................4
1.3 Problem Statement.................................. 6
1.3.1 Support for Alternative User Requirements in Active Rules Execution................................ 6
1.3.2 Efficient Support of Aggregates in Data W arehouses.......8 1.3.3 Structure of the Dissertation. .. .. .. .. ... ... ... ...9
2 STATIC ANALYSIS OF ACTIVE RULES. .. .. .. .. .... ... ....10
2.1 Introduction. .. .. .. .. ... ... ... ... ... .... ... ..10
2.2 Limitations of the Earlier Rule Execution Models .. .. .. ... ....13
2.3 Assumptions and Definitions . . . .. ................. 16
2.3.1 Rule Execution Sequence (RES) and Rule Commutativity ..16 2.3.2 Dependencies and Dependency Graph. .. .. .. .. ... ....20
2.3.3 Trigger Graph. .. .. .. .. ... ... ... ... ... .....22
2.4 Confluence and Priority Specification. .. .. .. ... ... ... ....23
3 IMPLEMENTATION OF CONFLUENT RULE SCHEDULER. .. .. ..30
3.1 Strict Order Preserving Rule Execution Model .. .. .. .. ... ....30
3.1.1 Extended Execution Graph. .. .. .. .... ... ... ....30
3.1.2 Strict Order Preserving Executions. .. .. .. ... ... ...32
3.1.3 Implementation. .. .. .. .. ... ... .... ... ... ..33
3.1.4 Parallel Rule Executions. .. .. .. ............36
3.2 Alternative Policies for Handling Overlapping Trigger Paths. .. ....40
3.2.1 Serial TriggerPath Executions .. .. .. .. .. ... ... ...40
3.2.2 Serializable TriggerPath Executions .. .. .. ... ... ....42
3.2.3 Comparisons with Strict OrderPreserving Execution. .. ....43
3.3 Discussion and Conclusions .. .. .. ... ... ... ... ... ...45
iv
4 AGGREGATE CACHE...................................48
4.1 Motivation............................48
4.2 Updating Cached Aggregates. .. .. .. .... ... ... ... ....51
4.3 Incremental Update of Aggregates. .. .. .. .. ... ... ... ...56
4.3.1 Syntactic Conventions. .. .. .. ... ... ... .... ....56
4.3.2 Incrementally Updatable Aggregates .. .. .. ... ... ....57
4.3.3 Algebraic Aggregates and NonAlgebraic Aggregates .. .. ...58 4.3.4 Sumnmative, Aggregates .. .. .. .. ... ... ... ... ...61
4.3.5 Binding of Variables .. .. .. ... ... ... ... ... ....66
4.3.6 Decomposition of Sumnmative Aggregates .. .. .. .. .. ....68
4.3.7 Normalization of Sumnmative Aggregates. .. .. .. ... ....73
4.3.8 Increment al Updatabi li ty of Nested Sumnmative Aggregates ..79
4.4 LookingUp Cached Aggregates. .. .. .. ... ... ... ... ...84
4.5 Conclusions .. .. .. .. ... ... ... ... .... ... ... ....88
5 CONCLUSIONS .. .. .. ... ... ... ... ... ... ... ... ...90
REFERENCES .. .. .. .. ... ... ... ... ... .... ... ... ....92
BIOGRAPHICAL SKETCH .. .. .. .. ... ... ... ... ... ... ...95
v
LIST OF FIGURES
2.1 Rule execution graphs................................ 14
2.2 Overlapped trigger paths .. .. .. .. ... ... ... ... ... ....15
2.3 A conflicting rule set. .. .. .. ... ... ... ... ... ... ...24
2.4 Priority graphs for Figure 1.4. .. .. .. ... ... ... .... ....25
2.5 A pair of trigger graph and dependency graph .. .. .. .. ... ...27
2.6 A priority graph .. .. .. .. .. ... ... ... .... ... ... ..27
2.7 An execution graph .. .. .. .. ... ... ... ... ... ... ....28
3.1 Overlapping trigger paths and extended execution graph .. .. .. ...31 3.2 Three different orderings of dependency edges in Figure 3.1(b) . . ..32 3.3 Extended execution graph in strict order preserving executions . . ..32 3.4 Extended execution graph with rule.counts .. .. .. .. ... ... ..34
3.5 Algorithm  BuildEGO). .. .. .. ... ... ... ... .... ....35
3.6 Algorithm  Scheduleo). .. .. .. .. ... ... ... ... .... ..37
3.7 A priority graph with absolute priorities .. .. .. ... ... ... ..41
3.8 Translation to serializable triggerpath execution  Step 1. .. .. ....43
3.9 Translation to serializable triggerpath execution  Step 2. .. .. ....44
4.1 A Data Warehouse and Aggregate Cache .. .. .. .. ... ... ....50
vi
Abstract of Dissertation Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Doctor of Philosophy
STATIC ANALYSIS OF ECA RULES AND USE OF THESE RULES FOR INCREMENTAL COMPUTATION OF GENERAL AGGREGATE EXPRESSIONS By
SeungKyum Kim
May, 1996
Chairman: Dr. Sharma Chakravarthy
Major Department: Computer and Information Science and Engineering
In this work we address two major issues that are related within the framework of active databases. Firstly, we propose a practical approach to rule analysis. We show how alternative rule designer choices can be supported using our approach to achieve confluent rule execution in active databases. Our model employs priority information to resolve conflicts between rules, and uses a rule scheduler based on the topological sort to achieve correct confluent rule executions. Given a rule set, a trigger graph and a dependency graph are built from the information obtained by analyzing the rule set at compile time. The two graphs are combined to form a priority graph, on which the user is requested to specify priorities (or resolve conflicts) only if there exist dependencies in the dependency graph. The user can have multiple priority graphs by specifying different priorities depending on application semantics. From a priority graph, an execution graph is derived for every user transaction that triggers one or more rules. The rule scheduler uses the execution graph. Our model also correctly
handles the situation where trigger paths of rules triggered by a user transaction are overlapping, which are not handled by existing models. We prove that our model achieves maximum parallelism in rule executions.
Next, we propose a cache mechanism, called aggregate cache for efficiently supporting complex aggregate computations in data warehouses. We discuss several cache update strategies in the context of maintaining consistency between base databases and aggregates cached in the data warehouse. We formally define the incremental update of aggregates, which is a prime issue for the aggregate cache. Further we classify algebraic aggregates into summative aggregates that include a vast variety of aggregates applicable in data warehouses to support decision making and statistical data analysis. We prove that there is a precise subclass of sumnmative aggregates that can be incrementally updated. For the incrementally updatable class of sumnmative aggregates, we propose an efficient cache mechanism that allows many userqueries to share accesses to the cached aggregates in a transparent way.
CHAPTER 1
INTRODUCTION
1.1 Active Databases
For the past decade, making the traditional passive databases active by incorporating a set of rules has drawn a lot of attention from the database research and development community. Originated from the concept of triggers proposed for the System R [16] and largely developed from the production rule languages for expert systems such as OPS5 [9], the research on active databases is now getting matured and several active database systems are being (or were) implemented including HiPAC [12], Ode [19], Postgres [35], Starburst [39], Samos [17], Ariel [29], and Sentinel [13].
While there are variations, a representative active database paradigm is to use the ECA rules (Event Condition Action) [12] with either the setoriented semantics or the tupleoriented (instance oriented) semantics over the relational or ObjectOriented database. An ECA rule consists of three parts, event, condition, and action parts. Events usually correspond to database operations, especially data manipulation operations such as insert, delete, and update. For some systems as HiPAC and Sentinel, temporal events and external events (e.g., user signals) are included too. All of these events are called primitive events. For the ObjectOriented model, a method call is regarded as an event as well. Conditions are generally assumed to be predicates over parameters and database queries without side effects. An action consists of a set of data manipulation operations or a function call.
When an event occurs in the system, rules whose event part corresponds to the event occurred are triggered. Of the triggered rules, one rule is picked based on some
1
2
criteria (or by a process known as conflict resolution). Then, the condition part of the selected rule is tested. If the condition is satisfied, the action part of the rule is executed. The process of picking one triggered rule, testing condition, and executing action is repeated until no more triggered rules remain.
For event specification and detection in active databases (adopting the ECA rule paradigm), three major approaches have been taken. The common goal in these approaches is to support composite events. A composite event is a composition of primitive events and/or other composite events. For instance, in Sentinel [13], given two events El and E2, a disjunction of them, El VE2 is defined to occur when either El or E2occurs. While similar sets of composite events are defined in all the approaches, distinctions are found in the ways of detecting such composite events. In Ode, finite automata are used to detect composite events expressed by a variation of regular expression [21, 20], while in Samos, a labeled Petri Net is adopted [18]. In Sentinel, on the other hand, we use an event graph where each leaf node represents a primitive event and an intermediate node represents a composite event consisting of other events represented by its child nodes. When a primitive event occurs, the occurrence is propagated to its parent node. A parent node, with an appropriate restriction for each type of composite event, collects occurrences of its child events, notifies an occurrence of the composite event if a certain condition is met, and propagates the occurrence of the composite event to its own parent node [13].
For condition specification and testing, there are few systems that take a sophisticated approach, except Ariel. In case a condition is represented by a database query, that query should not update database contents (i.e., side effect free), and it is interpreted as satisfied if the query returns a nonempty result. In Ariel, condition testing is done by an algorithm, called ATREAT [28]. It uses the discrimination network to efficiently compare a large number of patterns to a large number of data
3
without repetitive scanning. ATREAT can speed up rule processing in a database environment and reduce the storage requirement of TREAT algorithm.
On the other hand, there are two representative rule execution semantics, tupleoriented (or instance oriented) semantics and setoriented semantics [30]. When an event occurs, it can triggers one or more rules. These triggered rules are called instances of their respective rules. Even for one rule, there may exist multiple instances of one rule for any period of time. It should be noted that as events, granularities of data manipulation operations are quite coarse. For instance, an event of insert to a relational table is usually defined to occur when any tuple is inserted into the table. In other words, the granularity of the insert event is the whole relation, not a tuple. Also, in SQL, an insert, delete, or update statement can modify multiple tuples in one execution. Therefore if a rule is triggered by any of the data manipulation operations, it is likely that such an event will give rise to multiple instances of the same rule. The two rule execution semantics make a difference in such situations. Suppose one execution of insert statement inserts three tuples, ti, t2, and t3, into a table. And a rule ri is triggered by that insert event. In the tupleoriented semantics, ti, t2, and t3 trigger their own instance of rl, and for each instance of rl, its condition is tested, and if satisfied, its action is executed. However, in the setoriented semantics, only one instance of r, is executed for t1, t2, and t3; that is, rid's condition is tested just once and if the condition is satisfied, the action part of ri is carried out on each of t1, t2, and tG Postgres has the tupleoriented rule execution semantics, while Starburst and Ariel have the setoriented semantics. HiPAC and Sentinel, on the other hand, have rather different a rule execution semantics. In these systems, conceptually all triggered rules are executed concurrently using the nested transaction model [12, 6].
4
1.2 Data Warehouses
In order to support efficient processing of queries which are often of complex forms and span over vast amount of data which in turn could be distributed and even heterogeneous, one viable approach is to maintain a separate dedicated repository that holds the gist of base data and to process the queries over the repository. By doing SO, it is expected that decision support applications that usually involve lots of such otherwise expensive queries can also be detached from ordinary online transactions being performed over the underlying base databases, thereby significantly enhancing performance of both categories of applications. This approach, known as data warehousing [31], is rapidly becoming popular, and a lot of research effort is recently being put to solve various related issues [33].
A data warehouse system can be viewed as a multilayered database system. At the bottom level, there are several databases, called base databases, each of which is an operational, independent database system. At the top level, there is one specialized, (almost) readonly database system that contains the most abstracted or summarized data derived from the databases in one level below. As a result, this system constitutes a pyramidstructured abstraction of information (or data). But usually only a twolayer structure is assumed; that is, base databases and a summarized database which we call data warehouse. Between the two layers, a physical separation is generally assumed. The base databases can also be distributed and even heterogeneous. All these separated databases are connected through a network.
Since the data warehouse and base databases are separated and the data warehouse contains information derived from the base databases, it is a natural requirement that as base databases change, the data warehouse should be updated accordingly, if the changes are relevant to the information in the data warehouse. Therefore,
5
an intermediate layer between the two layer would be necessary to mediate communications between the two layers. This intermediate layer is responsible for monitoring changes made to the base databases and propagating them to the data warehouse to update its contents. In practice, however, the intermediate layer can be two interface layers, each of which resides in each base database and in the data warehouse. The interface layer in base databases is responsible for detecting changes in its respective base database and it should communicate with the interface layer in the data warehouse to propagate the changes. One issue arising here is when to or how often to propagate the changes. Proposed approaches in the literature include eager propagation for a currency critical data set, polling for a data set that changes slowly or whose currency requirement is not critical, and lazy (Or ondemand) propagation for a data set that changes slowly and is not used frequently.
A closely related issue to the change propagation is the incremental update of the data warehouse. It is unthinkable to repopulate (or rederive) the data warehouse whenever a change is made in a base database. Therefore, there should be a way to incrementally update the data warehouse with the propagated changes. In fact, this issue is not a new one. View materialization [8, 27] in the relational database deals with a similar problem, but not identical. Although the data warehouse can be viewed as a set of materialized views (derived from base databases), unlike view materialization, the data warehouse has a lot of problems that need special attention. The views defined in the data warehouse can be very complex for which the conventional view materialization techniques are inadequate. They usually contain historical data and highly aggregated and summarized data. This is another reason why the view materialization techniques cannot be directly applied [37]. Recently, the issue of view materialization has been revisited by several researchers including [23, 24].
6
1.3 Problem Statement
1.3.1 Support for Alternative User Requirements in Active Rules Execution
Using ECA rules in active database systems for reallife applications involves implementing, debugging, and maintaining a large number of rules. Experience in developing large production rule systems has amply demonstrated the need for understanding the behavior of rules especially when their execution is nondeterministic. Availability of rules in active database systems and their semantics create additional complexity for both modeling and verifying the correctness of such systems.
For the effective deployment of active database systems, there is a clear need for providing an analysis (both static and dynamic) and explanation facility to understand the interactionamong rules, between rules and events, and between rules and database objects. Due to the eventbased nature of active database systems, special attention has to be paid for making the context of rule execution explicit [15]. Unlike an imperative programming environment, rules are triggered in the context of the transaction execution and hence both the order and the rules triggered vary from one transact ion/application to another.'
Ideally, the support environment needs to be able to accept expected behavior and compare it with the actual behavior to provide a useful feedback on the differences between the two. This has to be done in the context of database objects, rules, events, and transaction sets that are executed concurrently.
Short of this, it is useful to provide alternatives that allow the designer to choose among various options that are meaningful to his/her application. For example, user
'Use of the nested transaction model for rule execution in Sentinel [6, 36] provides such a context. Our graphics visualization tool [14] using Motif displays transaction and rule execution by using directed graphs to indicate both the context (i.e., transactions/composite events) and the execution of (cascading) rules. We plan to augment the existing visualization capability with the static analysis proposed in this paper.
7
requirements may come from the following set of answers as far as rule execution is concerned:
1. No preference; any arbitrary execution order of rules and their final result is
acceptable.
2. Arbitrary final state is NOT acceptable. The designer wants to see a unique final result whenever the same set of rules executes from the same initial database
state (i.e., Confluent Rule Execution is desirable).
3. This particular group of rules must give a unique result when any subset of
these rules executes. No preference for the rest of the rules.
4. This application must have this order and that application must have that
order, if a different execution order gives a different result.
It is evident that the ability to interact with the designer and to support alternative user requirements is quite important. As a result, support environments for the design of ECA rules in the context of active databases need to support:
e ECA rule analysis to provide feedback on the interaction of rules either globally
or with respect to transactions,
* Confluence analysis at rule scheduling time (in addition to static analysis),
* Parallel rule execution where possible for efficiency reasons, and
* Visualization of actual run time rule execution and related contextual information.
In the first part of this thesis (Chapters 2 and 3), we propose an approach that addresses the above. We address confluent rule executions (which deal with obtaining a unique final database state for any initial set of rules) for a user transaction. We
8
show that previous rule execution models are inadequate for confluent rule executions in some cases, and propose extensions that can readily meet the alternative user requirements with respect to confluent executions. We also show that our model naturally supports parallel rule executions.
1.3.2 Efficient Support of Aggregates in Data Warehouses
Although for years there has been great interest in the data warehouse within the database industry and within academia as well, it does not appear that an efficient, fully functional, and flexible data warehouse system has emerged yet. There are many reasons for that. Part of them are purely engineering problems. There are a lot of legacy database systems that are still running and lack interface capability with any other new systems. Extracting data from such a system and monitoring data changes may not be very theoretically challenging, but from the engineering point of view, they are not trivial tasks at all. On the other hand, there are still issues to be answered in more systematic and fundamental ways. One such issue that we find very important is the efficient support of aggregates in data warehouses. As data warehouses contain a lot of highly summarized data and many applications in data warehouses perform from simple to very complex analyses over the data, it is crucial to improve performance of such aggregate or summary operations. In a naive implementation, it could take more than several hours to perform a simple summation over millions of records dispersed in base databases. With such a response time, no one would expect an interactive data analysis, which is an important requirement for the data warehouse to be indeed useful as a cooperative tool.
With this consideration in mind, in Chapter 4, we propose a practical means to boost the performance of aggregate processes in data warehouses. The aggregate cache that we propose saves in the system (i.e., the data warehouse) previous results of aggregate computations. As base databases change, the stored results are updated
9
appropriately. Along with such an engineering aspect, in our work we identify and prove that there is an exact class of aggregates that are guaranteed to be incrementally updatable, a result that we believe is a theoretical contribution to the research on data warehouse.
1.3.3 Structure of the Dissertation
The rest of this dissertation is structured as follows. In Chapter 2, we introduce formal definitions of confluent rule execution and show conditions in which confluent rule execution can be achieved. In Chapter 3, we generalize the notions developed in Chapter 2 to obtain confluent rule execution in general situations and rule scheduling algorithms along with a proof of maximal parallelism that our approach attains. In Chapter 4, we propose the aggregates cache and identify and prove a class of aggregates that can be incrementally updatable. Chapter 5 contains general conclusions for this dissertation.
CHAPTER 2
STATIC ANALYSIS OF ACTIVE RULES
2.1 Introduction
Incorporating ECA rules (Event Condition Action rules) into the traditional database systems, thereby making them active databases, can broaden database applications significantly [5, 12, 19, 30]. Also, ECA rules provide more flexible and general alternatives for implementing many database features, such as integrity constraint enforcement, that are traditionally hardwired into a DBMS [10, 11, 35, 38, 39].
An ECA rule consists of three parts: event, condition, and action parts. Execution of ECA rules goes through three phases: event detection, condition test, and execution of action. An event can be a data manipulation or retrieval operation, a method invocation in ObjectOriented databases, a signal from timer or the users, or a combination thereof. An active database system monitors occurrences of events prespecified by ECA rules. Once specified events have occurred, the condition part of the relevant rule is tested. If the test is satisfied , the rule's action part can be executed. In Sentinel [13], a rule is said to be triggered when the rule has passed the event detection phase; that is, when one or more events which the rule is waiting for have occurred. When an ECA rule has passed the condition test phase, it is said to be eligible for execution.' In this work, we use "trigger" to describe the eligible rules assuming that the condition part has been satisfied or it is nil.
'The definition of trigger is blurred as conditionaction rules such as the production rule [9] have evolved to ECA rules. A condition action rule is triggered and eligible for execution when the current database state satisfies the specified condition, whereas for an ECA rule to be ready to execute, it has to pass separate event detection and condition test phases.
10
11
The ECA rule execution model (rule execution models in general) has to address several issues. First, for various reasons multiple rules can be triggered and eligible for execution at the same time. For example, suppose that two rules ri and r1 are defined, respectively, on two events E, and (EVE2) and there are no conditions specified for the rules, where (EIVE2) is a disjunction of two component events E, and E2. A disjunction occurs when either component event occurs [13]. Now, if event E, occurs, it will trigger both ri and r3. As addressed by Aiken et al. [3, 4], multiple triggered rules pose problems when different execution orders can produce different final database states. If an active database system randomly chooses a rule to execute (out of several triggered rules), as many extant systems do as the last resort, that will make the final database state nondeterministic. This adds to the problem of understanding applications that trigger rules.
To deal with multiple triggered rules, a generally taken approach is to define priorities among conflicting rules [2, 12, 28, 34]. When multiple conflicting rules are triggered at the same time, a rule with the highest priority is selected for execution. While we believe the prioritization is a sound approach, we notice that the previous priority schemes are incomplete and inadequate to handle the complexity caused by trigger relationships between rules.
On the other hand, Aiken et al. [3] focus on testing whether a given rule set has the confluence property. A rule set is said to be confluent if any permutation of the rules yields the same final database state when the rules are executed. If a rule set turns out to be not confluent, either the rule set is rewritten to remove the conflicts or priorities are explicitly defined over the conflicting rules. Then, the new rule set is retested to see if it has the confluence property. A problem with this approach is that it tends to repeat the timeconsuming test process until the rule set eventually becomes confluent. Also, it has not shown by which mechanism confluence can be
12
guaranteed as priorities between conflicting rules are added to the system as a means of conflict resolution.
There are other subtle problems with the ECA rule execution model. Suppose that ri and r3 mentioned previously have condition parts. As event E, occurs, the two rules pass the event detection phase. Assume that both rules have passed the condition test and are ready to execute their action part. It is possible that the execution of one rule's action, say rid's, can invalidate r1's condition that was already tested to be true; that is, ri can untrigger r3. A part from the issue of whether or not the condition test should be delayed up to the point (or retested at the point) just before execution of the action part, if one rule untriggers other rules, it is very likely that the rule set is not confluent. The opposite situation can also happen. Suppose the condition of ri was not met. So its action part would not be executed. But execution of r3's action could change database state so that rid's condition could be satisfied this time. Therefore, if r3 executes first and rid's condition is tested after that, ri will be able to execute too. Again, execution order of the two rules makes a difference. Instead of proposing a more rigorous rule execution model to deal with the anomalies, we consider such rules as conflicting with one another so that the rule designer can be informed of these rules. This view will allow us to cover the problem within the framework of confluent rule executions.
In this work we explore problems of confluent rule executions, which deal with obtaining a unique final database state for any initial set of rules that is triggered by a user transaction. We show that previous rule execution models are inadequate for confluent rule executions and propose a new rule execution model that guarantees confluent executions. It is also showed that our model is a perfect fit for parallel rule execution.
13
2.2 Limitations of the Earlier Rule Execution Models
Early rule execution models such as one used in OPS5 [9] deal with problems of confluent rule executions only in terms of conflict resolution. When multiple rules are triggered (and eligible for execution), the rule scheduler selects a rule to execute according to a certain set of criteria such as recency of trigger time, complexity of conditions. Although this scheme has been used in the Al domain, users in the database domain prefer to deterministic states from the executions of transactions. Furthermore, the conflict resolution approach is not a complete answer to the confluence problem since it is based on dynamic components such as recency of trigger time. For the above reasons, we do not consider this approach in our work.
A somewhat different approach taken in active database systems such as Starburst [2, 4] and Postgres [34, 35] is to statically assign execution priorities over rules. In these systems if multiple rules are triggered, a rule with the highest priority among them is executed first. However, rule execution models in these systems cannot guarantee confluent rule executions unless all the rules (not only conflicting ones) are totally ordered. This problem is illustrated in the following examples.
Example 2.2.1 Figure 2.1(a) shows a nondeterministic behavior of rule execution even when all conflicting rules are ordered. In the figure solid arrows represent trigger relationships. Dashed lines represent conflicts and an arrow on a dashed line indicates priority between two conflicting rules. As shown, two pairs of rules are conflicting: (r2, r5) and (r3, r5). The conflicting rules are ordered in such a way that r2 precedes r5 and r5 precedes r3 in execution when the pairs of conflicting rules are triggered at the same time. Now suppose ri and r4 are triggered by the user transaction at the same time. (Note that these rules are denoted by solid circles in the graph.) In a rule execution model such as Starburst, one of r, and r4 will be randomly picked for execution since there is no conflict between them, thus no order specified. Suppose r4
14
r, 1 *r4r,4
(a) (b)
Figure 2. 1. Rule execution graphs
is executed first; then it will trigger r'5. Yet there is no order between r, and r5 which are ready to execute. So r5 may go first, and its execution will trigger r6. Then, r6, ri, r2, and r3 may follow. Including this execution sequence, two of all legitimate execution sequences for the rule set are as follows: (1) (r4  , 6r 2r3) and
(2) (r, 2r 4r r6). Note that relative orders of two conflicting rules, r2 and r'5 (as well as r3 and r5) in the two rule execution sequences are different, thereby unable to guarantee confluent execution. .1
Example 2.2.2 Figure 2.2 illustrates another situation where the previous rule execution models fail to achieve confluent rule executions. There is a dependency between rk and r1, and i'k has priority over rl. In this example, ri and r1 are triggered by the user transaction. Note also that r, is an ancestor of r, in the trigger relationship and thereby trigger paths originated from both rules overlap one another. Given that priority, the following two (and more) sequences of rule executions are possible: (ri r rk r r rk rj) and (ri  .rkr rk  iri). Now relative orders of two conflicting rules rk and rl in the two execution sequences are different. Therefore confluent rule executions cannot be guaranteed in the given situation. .
As the previous examples suggest, a problem that the extant active rule execution models fail to address properly is that even though two rules are not directly conflicting each other, they may trigger other rules that are directly conflicting. Depending
15
ri 0ri 0 [1]
ri ~rj @ [2]
(a) (b)
Figure 2.2. Overlapped trigger paths
on execution order of triggering rules, the directly conflicting rules may be executed in a different order from what the user specified, likely resulting in nonconfluent rule executions. Unless all the direct conflicts are removed by rewriting the rules, one possible remedy for this problem, implied in Starburst [3], would be to regard the indirectly conflicting rules as conflicting ones. Figure 2.1(b) illustrates how conflicts of Figure 2.1(a) are propagated toward ancestor rules in the trigger relationship as this approach is taken. An undesirable consequence of propagating conflicts is that it severely limits parallel rule execution. In addition, it is not always clear how to propagate conflicts in some cases as Figure 2.2(a) shows.
Another problem that the previous rule execution models do not handle was shown in Example 2.2.2 where trigger paths of rules triggered by the user transaction overlap. In fact, this new situation poses additional problems for priority specification. That is, any static priority schemes specified before rules' execution cannot range over all possible permutations of conflicting rules execution, since one cannot anticipate which rules will be triggered by the user transaction how many times. For instance, given the rule set of Figure 2.2(a), there can be two distinct final database states which result from rule execution sequences, (ri r3  rk r, r3 rk  r,) and (r2 r3  k  rj rk r, rj). All
other legitimate rule execution sequences are equivalent to one of the two sequences
16
in terms of final database states. However, if r1 is triggered twice and ri once by the user transaction, the number of distinct final database states increases up to five. As ri and rj are triggered more number of times, the number of different final database states increases exponentially. Therefore, it is not realistic to provide every possible alternatives for these cases. Rather, a less general scheme of priority specification, which provides only some specific alternatives, needs to be considered.
Figure 2.2(b) shows one way of specifying priority for the rule set of Figure 2.2(a), which is similar to the priority scheme adopted in Postgres [34]. Numbers in brackets denote absolute priorities associated with rules. A larger number denotes a higher priority. This priority specification guarantees confluent rule executions although nonconflicting rules (r, and r3) too need to be assigned priorities. Note that the given priority specification is (unnecessarily) so strong that it effectively imposes a serial execution order (r3 A r . j  rk rl), thereby ruling out any parallel rule executions. For instance, one instance of r3 could run in parallel with a series of ri and rj without affecting the final database state.
In the subsequent sections, we develop a novel rule execution model and a priority scheme that not only ensures confluent rule executions but also allows greater parallelism.
2.3 Assumptions and Definitions
2.3.1 Rule Execution Seguence (RES) and Rule Commutativity
Informally, a rule execution sequence (RES) is a sequence of rules that the system can execute when a user transaction triggers at least one rule in the sequence. To characterize RESs, we first define partial RESs. Throughout this work, R denotes system rule set, a set of rules defined in the system by the user. D denotes a set of all possible database states determined by the database schema. (d3, Rk), d3 E D and Rk g R, denotes a pair of a database state and a triggered rule set. If Rk is a
17
set of rules directly triggered by a user transaction, it is specially called UTRS that stands for UserTriggered Rules Set. UTRS is, in fact, a multiset since as we shall see later, multiple instances of a rule can be in it. S denotes a set of all partial RESs (see below) defined over R and D.
Partial RES. Given R and DA for a nonempty set of triggered rules, Rk g R and a database state d3 E D, a partial RES, o, is defined to be a sequence of rules that connects pairs of a database state and a triggered rule set as follows:
a=((d1, Rk~ + (d1+,J k+l)1 .. , (djmRkim+
where d3+i E D (1 < 1 < m) is a new database state obtained by execution of ri+,, each rule ri+l (0 < 1 < m) is in a triggered rule set Rk+1, and eligible for execution in d3+j, i.e., d3+l evaluates the rule's condition test to true. Each triggered rule set Rk+I C R (1 < I < m) is built as Rk1= ((Rk+1..1  {ri+i..i})  Ruk+l) U Rtk~, where Ruk+1 is a set of rules untriggered by ri+ii, and Rtk+l is a set of rules triggered by ri+i. 2.
In this work, if only sequences of rule executions are of interest, for simplicity we write a partial RES without associated database states and triggered rule sets. For example, the partial RES above can be denoted as o, = (ri rj+i . ri+mi,), and we already used this form of partial RESs in the previous sections.
Among partial RESs, we are interested in some, called complete RESs (or simply, RESs), that satisfy certain conditions.
Complete RES. Given R and D, for a nonempty set Rk C 1R that is a set of rules triggered by a user transaction (i.e., UTRS) and d3 E D is a database state produced
2Subscripts, i, i + 1,.   , i + m  1, attached to rules, are intended to mean that they are m rules that need not be distinct (similarly for d's and R's). They do not represent any sequential order of rules with respect to subscript numbers. That is, they should not be interpreted as, for instances, r10, rii, r12.  in case where ri is r10. For a precise denotation, we could use j0, ij,..  , instead. However, we have opted for the less precise notation in favor of simplicity throughout this work.
18
by operations in the user transaction, a complete RES (or RES), o, is defined to be a partial RES:
0=((di, R) ~(d3+,,Rk+1) r4 ~ (dj+m, Rk+m . )
where no triggered (and eligible) rules remain after execution of the last rule ri+m1,(i.e., Rk+m = 0). <1
Note that given Rk and dj, there may be multiple different RESs, even in a case where there is only one rule in Bk, and those RESs do not necessarily have the same set of rules executed, since a rule's triggering/untriggering other rules may be dependent on the current database state. In this work we use rule schedule in informal settings interchangeably with complete RES.
Rule shuffling. Given a partial RES a,, two rules ri and r3 in ol can exchange their positions provided r3 E Ry, yielding a different partial RES a2 as below:
=((d., R) Z (dk,R, A~ d~R
0'2 =((d., R2 + (d., R.) (d.,Rt))
Next we define an important property of rules that is used to show if a system rule set is confluent. Two rules are defined to be commutative if shuffling them always yields the same result.
Rule commutativity. Given R and D, two rules ri, r3 E R are defined to be commutative, if for all B2 g R, where ri, r, E Ry, and for all database state d., E D, the following two partial RESs can be defined:
((d., B2) A (dk, RI) % (d., R)
19
((d.,Rv) _+ (d., R.) _ (d, ,,))
where d.,dk, dm,, du E D need not be distinct and likewise Ry, R1,R, R, g R need not be distinct.
Note that any rule is trivially commutative to itself.
Equivalent partial RESs. Two partial RESs oi and oj are defined to be equivalent (=) if:
1. ai and aj begin with the same pair of database state and triggered rule set, and
end with the same pair of database state and triggered rule set; and
2. in oi and 03 the same set of rules is triggered, possibly in different orders. 'I The two partial RESs shown in the definition of rule commutativity are equivalent. In fact, the rule commutativity is used to prove that two or more partial RESs are equivalent, and the equivalence of partial RESs is, in turn, used to show whether given a system rule set is confluent or not. Incidentally, it should be noted that without condition 2), the equivalence definition can still be valid. However, with our static analysis method it is not possible to identify all such equivalent partial RESs. To make presentation of this work coherent, we chose a more restrictive form of equivalence.
The equivalence of partial RESs naturally lends itself to definition of equivalence classes of partial RESs. For given R and D, the set of all partial RESs, S is partitioned into disjoint classes by the equivalence relation (=). All partial RESs belonging to an equivalence class have the same final result, i.e., the same database state and the triggered rule set.
20
Equivalence class of partial RESs., For a partial RES, o, E S, the equivalence class of o, is the set S, defined as follows:
= , {yE S H y at
Of partial RESs belonging to the same equivalence class, for the discussion in this work we define canonical partial RES, or canonical RES for short, to be a partial RES that comes first when all the partial RESs are sorted by their rules' concatenated subscripts in lexicographical order. For instance, assuming that an equivalence class includes only three partial RESs, oa, = r,2 r4 r:3), O'j = (r, 4r1 r3), and Ok= r,r r3 r4), OUk is the canonical RES representing that equivalence class (by lexicographically sorting concatenated subscripts of the partial RESs, i.e., 1234 (ak), 1243 (o4), and 1423 (o,)) Of our prime interest is the equivalence class of complete RESs.
Confluent rule set. Given R and DA if there exists only one equivalence class of complete RESs for every nonempty set R' C R and every d E D, R is defined to be confluent.
2.3.2 Dependencies and Dependency Graph
If a different execution sequence of the same rules can produce a different final database state, it is because of certain interactions between rules and between rules and the environment. If we assume the execution environment to be fixed and there is no interference from the user while rules are executing, the interactions between rules must be the sole reason for nonconfluent rule executions. Based on this, we define rules' interactions responsible for nonconfluence as dependencies between rules, much like those of the concurrency control in transaction processing. We define two kinds of dependencies.
21
Data dependency. Two distinct rules ri and 73 are defined to have data dependency with each other if ri writes in its action part to a data object that 73 reads or writes in its action part, or vice versa. <1
Untrigger dependency. Two distinct rules ri and 73 are defined to have untrigger dependency with each other if ri writes in its action part to a data object that 73 reads in its condition part, or vice versa. .
If two rules have data dependency with each other, the input to one rule can be altered by the other rule's action. Thus it is very likely that the affected rule would behave differently. The data dependency can also mean that one rule's output can be overridden by the other rule's output. This also has a bearing on the final outcome. If there is no data dependency, two rules act independently. Therefore, there should be no difference in the final outcome due to a different relative execution order of the two rules.
On the other hand, if there is untrigger dependency between two rules ri and rj, this implies that one rule's action can change the condition which determines whether the other rule is to execute or not. If the affected rule, say ri, has already executed first, it is unrealistic to revoke the effect of ri. As a result, both ri and rj will execute in this case. However, if the affecting rule rj executes first, it can prevent 7, from executing. Since it is assumed that there are no readonly rules, the two different execution sequences can result in different database states even though there is no data dependency.3
From the observation above, it is clear that the absence of data dependency and untrigger dependency between two rules is a sufficient condition for the two rules to be
31t should be noted that whether or not the untrigger dependency can indeed affect confluent execution depends on rule execution model employed by an active database system. If the rule execution model does not recheck the condition part of a rule just before it executes the action part of that rule, then no rule is untriggered. In such a case, it can appear that the untrigger dependency is no longer a problem and only data dependency matters.
22
commutative. (The reverse is not necessarily true.) If there exists either dependency between two rules, the rules are said to conflict with each other. Obviously, conflicting rules are noncommutative.
Lemma 1 Given a partial RES ou, a new partial RES ou' obtained by freely shuffling rules in a~ is equivalent to o,, as long as relative orders of conflicting rules in both RESs are equal if there are any conflicting rules. Proof of Lemma 1 Suppose o, and o,' are not equivalent despite the same relative orders of conflicting rules in them. Then, there must be one or more pairs of nonconflicting rules in o, that can be shuffled but result in a different (nonequivalent) partial RES. These nonconflicting rules are, then, conflicting and should have the same relative orders in o, and o,', which is a contradiction. 1:
Below, we define dependency graph that represents dependencies between rules in the system rule set.
Dependency graph. Given system rule set R, a dependency graph, DG = (R, ED) is an undirected graph where R is a node set in which a node has onetoone mapping to a rule and ED is a dependency edge set. For any nodes u and v in R, a dependency edge (u, v) is in ED if and only if there is data dependency or untrigger dependency (or both) between uand v. .1
A dependency graph is nontransitive; that is, (u, v) and (v, w) in ED do not imply (u, w) in ED. Edges in a dependency graph represent only direct dependencies. An indirect (transitive) dependency is represented by a path consisting of a set of connected dependency edges.
2.3.3 Trigger Graph
A trigger graph (TG) is an acyclic directed graph representing trigger relationships between rules within a given system rule set R. For system rule set R, TG = (R, ET)
23
has a node set R in which a node has onetoone mapping to a rule and a trigger edge set ET. For any two nodes (i.e., rules) ri and r3 in R, trigger edge set ET contains a directed edge, called trigger edge, (ri +~ r3), if and only if ri can trigger r3. It is defined that ri can trigger r3 if execution of rid's action can give rise to an event that is referenced in the event specification of r2.4 A trigger path in a trigger graph is a (linear) path starting from any node and ending at a reachable leaf node.
Note that for rules r, and r3 above, it is possible that r3 is not triggered by ri at run time if rid's action part contains a conditional statement. Nevertheless we conservatively maintain a trigger edge if there is any possibility of rid's triggering r3. In addition, we are assuming that a trigger graph is acyclic to guarantee termination of rule executions [3]. If a trigger graph contains a cycle, it is possible that once a rule in the cycle is triggered all the rules in the cycle keep triggering the next rule indefinitely. We also assume that there exists a separate mechanism for detecting cycles in a trigger graph so that the rule designer can rewrite the rules in such a case.
Incidentally, it should be noted that a trigger relationship between two rules does not necessarily imply a dependency between the rules. For instance, given a trigger edge (r, +~ ri), if ri for sure triggers r3 and no other rules are triggered from ri and r3, there are only two possible partial RESs for the two rules, (r, r1 r,) and (r2 ri r3). If there is no data or untrigger dependency between ri and r3 (iLe, the two rules are commutative), the two RESs are equivalent despite the trigger edge.
2.4 Confluence and Priority Specification
In this section we present basic ideas that give us a handle for dealing with conflicting rules in order to obtain confluent rules executions. We consider simple
4We admit that this definition of "can trigger" is rather crude. In Sentinel, for example, if a rule is waiting for an occurrence of (El; E2), which is a composite event sequence and occurs when E2occurs provided El occurred before, the occurrence of El alone never triggers that rule. In our current work, however, we do not pursue this issue any further. (For event specifications in Sentinel, see Ghakravarthy et al. [13].)
24
Figure 2.3. A conflicting rule set
cases first. When there are n distinct rules to execute and m pairs of conflicting rules among them, intuitively, the maximum number of different final database states that can result from all differing RESs is conservatively bounded by 2m, since each pair of conflicting rules can possibly produce two different final database states by changing their relative order.
Example 2.4.1 Figure 2.3 is a redrawing of Figure 2.1(a) with removal of directions on dependency edges (r2, r5) and (r3, rs). Note that ri and r4, denoted by solid nodes in the graph, are in UTRS, a set of rules initially triggered by a user transaction. Assuming that all the six rules are executed, all complete RESs that can be generated should be equal to a set of possible merged sequences of two partial RESs (r,  2. 3 and (r4 r5 r6). Then, all the possible merged (now complete) RESs can be partitioned into up to four groups by relative orders between r2 and r5 and between r3 and r5 as follows: (1) (r2 + r5) (r3 + r5), (2) (r2 * r5) (r3 + r5), (3) (r2 + r'5) (r3 + r5), and
(4) (r2 4 r~5) (r3 + r5). However, since there exists an inherent order between r2 and r3, i.e., (r2 + r3), dictated by a trigger relationship, no merged RESs can contain combination (4) due to a cycle being formed. Combination (4) is dropped from consideration. Since cumulative effect of all the other rules are the same regardless of their execution order, the three combinations are the only factor that can make a difference in the final database state. Therefore, in this example, up to three distinct final database states can be produced by all possible complete RESs. .
25
r1 Or4 r, 1 Or4 r 1r
(a) (b) (C)
Figure 2.4. Priority graphs for Figure 1.4
Using the three possible orderings of conflicting rules in Example 2.4.1, we can assign directions to dependency edges in the graph of Figure 2.3. Resulting graphs, which we call priority graphs, are shown in Figure 2.4. These priority graphs present how priorities can be specified over conflicting rules in order to make rule executions confluent. Also, importantly, they represent partial orders that the rule scheduler needs to follow as it schedules rule executions. As we shall see in the following section, the rule scheduler basically uses a topological sort algorithm working on a subgraph of priority graph, and this demands the priority graph to be acyclic.
Example 2.4.2 All possible topological sorts on priority graph of Figure 2.4(a) correspond to an equivalence class represented by a canonical RES, ol = (r,  f2 r4 f5 f3 r6)
 for clarity we use  hereafter to denote conflicting rules as f2. Note that a RES, (r, r4   .%  r6) is equivalently converted to oli by shuffling r2 and r4, which are
commutative. Similarly, U2 = (r,  .f r4  . r6) and 0o3 = (r, r4  f .2f r6)
represent equivalence classes obtained when the topological sort is carried out on priority graphs of Figures 2.4(b) and (c), respectively. <
The formal definition of the priority graph is given below.
Priority graph. Given trigger graph TG = (R, ET) and dependency graph DG = (R, ED), priority graph, PG = (R, Ep) is a directed acyclic multigraph formed
26
by merging TG and DG, where R is a node set defined as before and Ep is a priority edge set. For any two distinct nodes u, v E R, (u ~ +V) E Ep if and only if (U  + v) E ET, and either (u __+ V) E Ep or (v __+ u) E Ep if and only if (u, V) E ED. (U D ( v) (or (v  u)) is called directed dependency edge to distinguish it from undirected ones in ED, and the direction of the edge is given by the user. <1
For a PG, a trigger edge is depicted by a solid arrow line while a directed dependency edge is depicted by a dashed arrow line. A PG is defined to be a multigraph because it can have more than one edge (actually two edges, i.e., a trigger edge and a directed dependency edge) between two nodes. It should be noted that if a node u is an ancestor of a node v in TG and there is a dependency edge (u, v) in DG, the corresponding directed dependency edge in PG is automatically set to (u + V), not to form a cycle in PG.
Example 2.4.3 Figures 2.5(a) and (b) show a trigger graph and its dependency graph counterpart respectively. Figure 2.6 shows a priority graph built out of the two graphs. Note that directions of dependency edges are determined by the user as a way of specifying priorities between conflicting rules. However, direction of dependency edge between r, and r8 (between r2 and r4 as well) is set by the system as shown in the graph because r, is an ancestor of r8 in a trigger path. .1
Note that given a system rule set R, when an active database system is running, subsets of R will be triggered and executed dynamically. In order to schedule those rules, the rule scheduler builds a subgraph of a given PG, called execution graph, when a user transaction triggers rules.
Execution graph. Given a system rule set R, a priority graph PG = (R, Ep), and a UTRS R' C R, an execution graph EG = (RE, EE) is a subgraph of PG where RE is a node set and EE is an edge set. RE is recursively defined as RE = f{r I
27
,rr10o r10.
r2r110 r2
r.3 r r54 05
r6 r7 r8 r6 0, r70' 6 r,
r Or9
(a) (b)
Figure 2.5. A pair of trigger graph and dependency graph
rjoL , " " , %
r4
Figure 2.6. A priority graph
r E R' v (]r' ](r ~ +r) (r' E RE A (r' * r) E Ep)}. For any two distinct nodes u,vERE,(U T v)EEEif Tv)EEp and (uP+v)EEEifu.Pv) EEp.
Simply stated, the node set RE consists of a UTRS plus those rules that are reachable from rules in the UTRS through trigger paths in PG. The edge set EE is a set of trigger and directed dependency edges that connect nodes in RE. Example 2.4.4 Figure 2.7 shows an execution graph derived from a priority graph of Figure 2.6 when a UTRS has rules r3, r5, and r10. A rule schedule can be obtained by performing the topological sort on the execution graph. The canonical RES for the
28
r10.
r3 r
r9
Figure 2.7. An execution graph
equivalence class represented by the execution graph is (r3 r6 r7 r5  nr ir1). Note that priority graphs shown in Figure 2.4 are, in fact, execution graphs as well.
Lemma 2 Given execution graph EG = (RE, EE), a set of rule execution sequences corresponding to all feasible topological sorts on EG constitutes an equivalence class, independent of initial database state.
Proof of Lemma 2 Since EG is acyclic and all pairs of conflicting rules in EG are ordered (i.e., have an edge between them), all topological sorts on EG should have the same relative orders of conflicting rules. Then, by Lemma 1, RESs represented by the topological sorts are equivalent to each other. Also, since Lemma 1 holds without any premise on initial database states, Lemma 2 also holds regardless of initial database states. 0
The power of choice is now given to the users. There may be one systemwise priority graph for all rules defined in the system. All applications will be governed by a single type of confluent rule executions in such a case. More preferably, however, each user (or each application) may have a separate priority graph tailored for specific
29
needs. Also, given a conflicting rule set, the user may choose to specify priorities over only a part of conflicting rules and is not concerned about the rest of them. The rule scheduler will ignore unspecified (thus undirected) dependency edges in a priority. Taking this approach, our rule execution model can readily meet various user requirements with respect to confluent rule executions.
CHAPTER 3
IMPLEMENTATION OF CONFLUENT RULE SCHEDULER
3.1 Strict Order Preserving Rule Execution Model
In the previous chapter, only simple cases were considered. In particular, the structures of trigger graphs were trees, all rules in a UTRS were distinct and no trigger paths existed between these rules. As a result, no trigger paths in the execution graph overlapped with one another. When rules in a UTRS have overlapping trigger paths, the priority graph and execution graph defined in the previous chapter do not capture the semantics. For example, consider Figure 3.1(a) which is the same as Figure 2.2(a). When ri and rj are triggered by a transaction, both rules instantiate their own trigger paths, and these trigger paths overlap with each other.' If we think of the graph as an execution graph, it yields two partial RESs from the graph: (r, . r fk and (r3  f fl). Therefore, a rule schedule (alternatively a complete RES) should be a merged RES of the two partial RESs. A possible merged RES is (r,  r.  riirkF1fkf). Issues to be addressed in this chapter are, (i) obtaining rule schedules from an execution graph where trigger paths are overlapping, (ii) assurance of the confluence property when rules are executed in accordance with any of such rule schedules, and (iii) parallel rule schedule taking advantage of the availability of confluent multiple rule schedules.
3.1.1 Extended Execution Graph
In order to understand the effect of overlapping trigger paths, we introduce an extended execution graph, used only for illustration purposes. Recall that given a
'The overlapping of trigger paths is not directly visible in Figure 3.1(a).
30
31
r1f
(a) (b)
Figure 3. 1. Overlapping trigger paths and extended execution graph
system rule set along with its trigger graph, each rule in a UTRS instantiates a subgraph of the trigger graph, whose sole root is that rule. However, it is possible to derive an execution graph, from given a priority graph and a UTRS, such that every rule in the UTRS becomes a root node in the resulting extended execution graph.
Example 3.1.1 Figure 3.1(b) shows the extended execution graph of Figure 3.1(a). In the extended execution graph, rj and rj (similarly other rules as well) are the same rule and only represent different instantiations. Since there is a dependency between rules rk and ri, this dependency may well be present between all instantiations of rI, and r, as shown in Figure 3. 1(b). Directions of dependency edges in an extended execution graph might be either inferred from the priority graph or specified by the user. Figure 3.2 shows three different acyclic orderings of relevant dependency edges of Figure 3.1(b). Once an acyclic extended execution graph is given, the rule scheduler can schedule rule executions using the topological sort. All possible topological sorts constitute one equivalence class. .1
The extended execution graph, however, cannot be used for priority specification; it is created only at rule execution time and priorities need to be specified before that. Thus we need to find alternative ways to interpret a priority graph.
r.O
32
ri?0 ri~ r, i
or!
rk r  1
TIk
(a) (b) (C)
Figure 3.2. Three different orderings of dependency edges in Figure 3.1(b)
r2 r3r4
r5 t
r. s r.\ /j
(a) (b)
Figure 3.3. Extended execution graph in strict order preserving executions
3.1.2 Strict Order Preserving Executions
One way to derive an extended execution graph is to faithfully follow what the user specifies in the priority graph, i.e., priorities between conflicting rules. In strict orderpreserving executions, if rule ri has precedence over rule r1 in a priority graph, all instances of ri precede all instances of r3 in resulting rules schedules. Given a priority graph, an extended execution graph is obtained by simply adding directed dependency edges in the priority graph to duplicated overlapping trigger paths. This scheme provides a simple solution for overlapping trigger paths, regardless of the number of times trigger paths overlap.
33
Example 3.1.2 Figure 3.3(a) shows a priority graph where rules rl, r2, and r3 are in UTRS and their overlapping trigger paths are denoted by partial RESs, (r,  f2 F4.  5), (77 .f4 F5), and (7%F4 ' F5). Figure 3.3(b) illustrates how an extended execution graph is built using strict orderpreserving executions. First, overlapping trigger paths are separated. Second, any dependency edge in the priority graph that connects a rule in an overlapping trigger path and a rule in any trigger path is also introduced in the extended execution graph. For example, (r2  r4) an ( 5'r) ienteetne execution graph, the rule scheduler can schedule rule execution by performing the topological sort. All feasible topological sorts should constitute one equivalence class since in the topological sorts, executions of all conflicting rules are ordered in the same ways. The canonical RES for the extended execution graph of Figure 3.3(b) is
3.1.3 Implementation
In order to implement a rule scheduler conforming to the strict orderpreserving execution, one could directly use extended execution graphs such as Figure 3.3(b). However, there is a simpler way to derive an execution graph without having to duplicate every overlapping trigger path. Consider the extended execution graph of Figure 3.3(b). In the strict order preserving execution, directions of all dependency edges incoming to and outgoing from overlapping trigger paths are all the same as shown in the graph. Therefore, it is unnecessary to duplicate overlapping trigger paths. We, instead, add a rulecount to each node of a plain execution graph. A rule.count attached to a node indicates how many rules in UTRS share the trigger path which the node (i.e., rule) belongs to. Figure 3.4 depicts how the plain execution graph of Figure 3.3(a) is extended using rule.counts. In the new graph, m(i) represents that the trigger path is shared by i instances of the associated rule.
34
r, M(1)
r2 ~r() r3 m(J)
r 4 m(3),'
r. m(3)
Figure 3.4. Extended execution graph with rulecounts
Now the new extended execution graph can be used with a minor modification to the topological sort. Whenever a rule is scheduled, its rulecount in the execution graph is decreased by 1. If the rulecount reaches 0, the node and outgoing edges are removed from the execution graph.
Figure 3.5 describes an algorithm, BuildEG() for building an execution graph for given a priority graph (PG) and a UTRS. It uses M[ ], an array of size of the system rule set R, to hold rulecount values of rules in EG. If rule ri is (to be) in EG, M[i] represents the rulecount attached to ri. Build.EG() calls a subroutine, DFSCopyreeo for every instance in UTRS. Remember that UTRS is a multiset. It can have multiple instances of the same rule due to multiple triggers. DFSCopyree() traverses the PG in the depthfirst search fashion and copies a portion of the PG that are reachable through trigger edges in PG. It also increases rule.count of each node that it visits. The execution graph of Figure 3.4 is obtained by applying BuildEG() to the priority graph of Figure 3.3 (a) with UTRS = { ri, r2, r3}.Once an execution graph is built by BuildEJGO", the rule scheduler can schedule rule executions. However, it is possible that some rules in a trigger path in an execution graph are not triggered at all by its parent rule at run time. In such a case, other rules in other trigger paths that have an incoming dependency edge from that rule should not wait. Otherwise the rule scheduler will get stuck. Furthermore,
35
Given PG and UTRS:
BuildEGO(
I
EG =0;
initialize array ME I to O's; // M[ I is rulecount array for every ri E UTRS do
call DFSCopyree(ri);
I
DFSCopyTree Cr1)
if (r, EG) then
copy ri into EG;
M[i]++; // increase rulecount of node ri Copy trigger edges f or all r3 such that 3(r, 4r.) E PG do
call DFSCopyree(r); if ((r,  rj) EG) then
copy (rT+ rj) into EG;
I
Copy dependency edges for all r, such that i(r, +~ rj) E PG do
if (rj E EG) and ((ri  rj) EG) then
copy (ri + r3) into EG;
Figure 3.5. Algorithm  Build.EGO)
36
if a rule is not triggered, all its descendant rules in trigger path will not be triggered either. This consideration should be taken recursively downward trigger paths.
Figure 3.6 describes the rule scheduling algorithm, ScheduleQ, which is a modified version of topological sort. Given an execution graph EG, it arbitrarily selects a node (i.e., rule) with indegree 0. Since EG is acyclic, there should always be at least one node with indegree 0. After executing the selected rule, the scheduler decreases rule.count of the node by 1, and if the rule.count reaches 0, the node is removed along with any trigger and dependency edges outgoing from the node. However, before the removal, it checks whether the executed rule has triggered child rules in trigger paths. If there are child rules that are not triggered, then Schedule() calls a subroutine DFS.DecJVI( for those child rules. DFSDec.M() traverses down EG in a depthfirst search fashion and decreases rule.count of each node it visits by 1. If rulecount of a node becomes 0, it removes the node and all outgoing edges.
Theorem 1 The strict orde rp reserving rule execution model guarantees confluent rule executions.
Proof of Theorem 1 Based on Lemma 2, algorithms BuildEG() and Schedule() together serve as a constructive proof for the theorem since by the algorithms, overlapping trigger paths are separated, effectively making them ordinary acyclic graphs, and the topological sort is performed on the graphs. C]
3.1.4 Parallel Rule Executions
The execution graph naturally allows parallel execution of rules. In the extended form, such as Figure 3.3(b), all rules with indegree 0 can be launched in parallel for execution. Since there should be no dependency edges between nodes with indegree 0 in an execution graph,2 relative execution order of those independent rules does not
2Note that an execution graph reduces its size as rules are executed.
37
Given EG:
Schedule 0
I
while (EG 54 0) do
f
choose a node ri with indegree 0;
execute ri;
M[i]; // decrease rulecount of node ri
f or all r3 such that 3(r, T + r1) E EG do
if (r3 was not triggered by execution of ri) then
call DFSDec4(r3);
if ( M[i] = 0 ) then
delete node r, and edges (ri +~ rk) and (r, +2 ri)1
f or any k and 1, f rom EG;
DFSDecI4 (r3)
for all rk such that ](r3 + rk) do
call DFSDecM(rk); if ( M[j] = 0 ) then
delete node ri and edges (r3  + rj) and (r. ),
f or any 1 and m, from EG;
/don't need to delete dependency edges incoming to !
Figure 3.6. Algorithm  Schedule()
38
affect the final outcome. Note also that multiple instances of the same rule can be scheduled for execution at the same time.
In an execution graph with rulecounts, which we use for our work, all rules with indegree 0 are scheduled simultaneously as many times as the rule.counts associated with the rules. In Figure 3.4, for instance, after scheduling and executing each one instance of r, and r3 in parallel, two instances of r2 can be scheduled for execution since rulecount of r2 is 2. In order to implement the parallel rule executions, we have to make some changes to the algorithm of Figure 3.6 which we will not elaborate in this work. Whenever execution of one instance of a rule is completed, the associated rulecount need be decreased by 1 and its child rules have to be checked to see whether they are triggered or not by the parent rule. If some are not triggered, DFS.Dec.M() should be called to recursively decrease rule.counts along trigger paths. Since the rulecount array M[] and execution graph are shared data structures, some locking mechanism need be used to avoid update anomalies within the data structures.
One important measure in parallel processing is the degree of parallelism. In the active rule system, the maximum parallelism is bounded by dependencies between rules in the system rule set. For instance, if all the rules are independent of each other, ideally all triggered rules can be executed in parallel. As dependencies between rules increase, the degree of parallelism would decrease. However, other components too can restrict parallelism. As discussed in Section 2.2, improper priority specification and rule execution model may execute a given rule set serially which could be executed in parallel. Specifically in our work, two components can hamper parallelism, all resulted from static analysis. First, precision of dependency analysis between two rules can affect parallelism. Even though there is data dependency between two rules, they can be commutative in reality. Being unable to detect such a hidden commutativity results in a false dependency edge in an execution graph, likely costing
39
parallelism. Second, precision of trigger relationship analysis can similarly affect parallelism. If we know for sure that one rule triggers another rule, the trigger edge between the two rules can be deleted after all rulecount values are computed. This way, the two rules can be scheduled in parallel if there is no other path connecting them in the resultant execution graph. Using static analysis, we cannot completely avoid uncertain trigger edges, and presence of the uncertain trigger edges can cost parallelism. However, ignoring the loss caused by imprecision of static analysis, the strict orderpreserving rule executions exploit the maximum parallelism existing in a given system rule set. We state it in Theorem 2. Theorem 2 Using the strict orderpreserving rule executions, the active rule execution model achieve the maximum parallelism within limitations of static trigger and dependency analysis.
Proof of Theorem 2 Given any acyclic extended execution graph, there are two kinds of edges; trigger edges and dependency edges. We first assume that no trigger edges can be removed, that is, they are all uncertain trigger edges. Now suppose there are superfluous dependency edges in the execution graph whose absence does not affect the final database state. (Therefore we can remove them safely to increase parallelism.) There can be only two types of dependency edges in an acyclic execution graph. Given any dependency edge (ri + r1), ri is either a proper ancestor of r3 in a trigger path containing both ri and r3 or not an ancestor of r3 in any trigger path. In the first case , even though the dependency edge is redundant in terms of representation, removal of that edge does not allow r3 to execute before ri since r3 is yet to be triggered by ri or its descendant. Thus, removal of the first type of dependency edges has no effect of increasing parallelism. In the second case,7 if (r1 + r3) is the only dependency path that can be interconnected by trigger edges and dependency edges to connect ri to r3, obviously this cannot be removed at any rate. If
40
there exist other dependency paths connecting ri to r3 whose lengths are longer than (r, + ra) (of course, if such paths exist, they should be longer than oneedge path (ri +B~ r3)), (ri .2g. rj) is redundant, but again, removal of the dependency edge does not allow r, to execute before ri. By applying this argument to all dependency edges present in the extended execution graph, we can see that dependency edges are either necessary or redundant, but removal of redundant edges does not increase parallelism. Since the execution graph with rulecounts are equivalent to the extended execution graph under the strict orderpreserving rule executions, we can conclude that the rule scheduler exploit the maximum parallelism inherent in the system rule set. E0
3.2 Alternative Policies for Handling Overlapping Trigger Paths
3.2.1 Serial TriggerPath Executions
In order to handle overlapping trigger paths, there may be other approaches than the strict orderpreserving rule executions. One obvious approach among them is serial triggerpath execution in which all rules in an overlapping trigger path execute before any other rules in other overlapping trigger paths. In other words, when trigger paths overlap, rules in those overlapping trigger paths execute in triggerpath by triggerpath fashion. For instance, in Figure 3.1, (ri r3  rk , iFkf) is a serial triggerpath execution.
Although the serial triggerpath execution could appear more intuitive, unfortunately it brings forward the old problem again. Different serial triggerpath executions may result in different final database states when there exist dependencies between the overlapping trigger paths. Therefore the user has to choose one serial order over the conflicting overlapping trigger paths to obtain a unique final database state. However, choosing a serial order in advance (i.e., at compile time) is not always possible, since, as discussed in Section 2.2, multiple instances of the same rule may be in UTRS and one cannot predict them before the rules execute. To reiterate
41
F1 [20]
r2 1[15]' r3 [10]
r.
Figure 3.7. A priority graph with absolute priorities
that, let's assume that A, B, and C are three overlapping trigger paths rooted by rules a, b, and c, respectively, and the triggerpaths are all conflicting each other. Then, if each one instance of a, b, and c is in UTRS, any of six permutations (3!) made of A, B, and C (i.e., serial triggerpath executions) may produce a different result. However, if rule a is triggered twice while the other rules remain as before, four trigger paths A, A', B, C will be instantiated and the permutations increase to 24 (4!). As more instances of a, b, or c are added to UTRS, the permutation will increase exponentially.
From the observation above, it is apparent that we again have to settle for a less general scheme, which will be discussed below, than the fullfledged serial triggerpath execution. Figure 3.7 shows a priority graph that is slightly different from Figure 3.3(a). The new graph has dependency edges between r, and r4, and between r3 and r,5 as before. Also, it uses absolute priorities (numbers in brackets) attached to rules rather than directed dependency edges. Using this priority graph, the rule scheduler will be able to schedule a serial triggerpath execution by executing rules in a triggerpath whose root rule's priority is the highest. For instance, if rules rl, r2, and r3 are each triggered once by a user transaction, given the priorities, the serial triggerpath execution being scheduled will be (r1  r2.F f5r F4  f5  F75). Now if r2 is triggered twice and r, and r2 once, then since priority of r2's instances is
42
lower than r, and higher than r3, the resultant serial triggerpath execution will be
Althouh we will not elaborate in this work, it will be worthwhile to investigate more efficient and convenient ways of specifying serial execution orders, if the serial triggerpath execution is to be pursued for implementation. Using absolute priorities could be a hassle as pointed out in Section 2.2. On the other hand, sacrificing generality in serial order specification might do more good than harm. For example, it can be assumed that in a trigger path, ancestors have priority over their descendents when they are in UTRS (or vice versa). If such an assumption is made, the explicit priority specification between r, and r2 in Figure 3.7 can be omitted. It should be noted that in order to schedule serial triggerpath executions, the rule scheduler described in the previous section need be modified appropriately to deal with the added absolute priorities.
3.2.2 Serializable Trigger Path Executions
A more interesting scheduling than the serial triggerpath execution would be serializable triggerpath execution in which rule execution sequence may not be a serial triggerpath execution but the final result is equivalent to that of a serial triggerpath execution. In Figure 3.7, for example, one can see that a rule execution sequence (r, r3 r.  4f F f) is equivalent to the serial triggerpath execution
(ri r1  .4f r1  fk  fl). Therefore, the former is a serializable triggerpath execution.
Once a serial execution order of overlapping trigger paths is set, it can be readily translated to a serializable triggerpath execution using extended execution graph. Figure 3.8 illustrates the first step to translate a serial triggerpath execution, shown in Figure 3.7 assuming r2 is triggered twice, to an equivalent serializable triggerpath execution. In Figure 3.8, dashed lines, presently undirected, represent dependencies between rules (only inter triggerpath dependencies are shown), and directed dotted
43
r 2r r 2 r 3. *
r. r5/
Figure 3.8. Translation to serializable triggerpath execution  Step 1
lines connecting the last rule in a trigger path to the first rule in the next trigger path (between r5 and r~, for example) impose the serial order specified in Figure 3.7. Ignoring the undirected dependency edges and regarding the newly added dotted arrows as only dependency edges, our rule scheduler will schedule the original serial triggerpath execution. Then, as the second step of translation, directions of dependency edges are set by following trigger paths bridged by the dotted arrows so that the directions are consistent with the serial order. After that, the dotted arrows are deleted. Figure 3.9 shows the result of the second step translation. Now if the translated execution graph is fed to the rule scheduler, an equivalent serializable triggerpath execution will be generated. Note that there may be multiple equivalent serializable triggerpath executions and they are those that enable parallel rule executions. Maximum parallelism of Theorem 2 still holds for the translated execution graph.
3.2.3 Comparisons with Strict O rder Preserving Execution
Collating the serializable (or serial) triggerpath execution with the strict orderpreserving execution raises an interesting point. Although choosing between them would remain as largely subjective a matter, entrepreneurial policies and applications' semantics could favor one over the other. Suppose rule ri triggers rule r3 and there is a dependency between them. When multiple instances of ri are triggered
44
=    r. ~  s r
Fiur 3.9 Trnlto osraial rgeaheeuinSe
bya se tanatininth radtina ranacio prcssn eniomnthe
rilzl trgeaheeuinwudb aoedsne eadn rge aha
a~~~~~~~~~~~~~~ sutascin tapast i elit htaraye istihtevrnet On~~~~~~~~~~~~~~~~ ~ th/te ad hnr tnsfra aslt as oa mlyesslr
Fcinwiure 3.9 trnasation toosralze trigerpathru executions in Stepe
by apusrransactio inote traition itnstongtprocessingienvironlent thee
Asfripeettotesrializable triggerpath execution woul bes favoredsne eadn rge athe as
(say $1000)t fandew or Ca retivoas sy ,te the strict orderpreserving eeuin trqie afeetionrwouldy makeciimo sense thn co mplhse poiy thtappliowee, allcap pliablored bsoeraise botefor relatires.9 Consering thbe serializabltyhi oftenation whicnh, inr tn thstbeen prropesertonbeeusedtfor ruleeexecutionsintsom
mxmmprleimirueexecutions.
45
3.3 Discussion and Conclusions
In this work we have proposed a new activerule execution model along with priority specification schemes to achieve confluent rule executions in active databases. As other rule execution models, we employ prioritization to resolve conflicts between rules. Ideally, by prioritizing executions of conflicting rules whose different relative execution orders can yield different database states, one can achieve confluent rule executions. It is necessary, however, to prioritize as few rules as possible since prioritizing many rules in a meaningful way itself could be a challenge and the more rules prioritized, the less rules being able to execute in parallel. Prioritizing conflicting rules only, on the other hand, may call for incorrect results as executions of seemingly independent rules can trigger and execute conflicting rules in the wrong way. Also, when rules in the same trigger path are triggered resulting in overlapping trigger paths, more problems can be brought up. Unlike previous rule execution models, our model uses a rule scheduler based on the topological sort to respect all the specified priorities if applicable. This way, rules being triggered and executed from a user transaction can follow the execution sequence imposed by the priority specification to make their execution confluent. We also have proposed the strict orderpreserving rule execution to deal with overlapping trigger paths. In the strict orderpreserving rule execution, when a part of or the whole trigger path is multiply executed and there are priorities between rules in the trigger path, the rules are executed in such a way that no rule appears before rules with higher priorities if any. It has been shown that our rule execution model can exploit maximum parallelism in rule execution. Lastly, we have discussed other possible ways of handling overlapping trigger paths, i.e., the serial triggerpath execution and serializable triggerpath execution.
There are other related issues not pursued in our current work. One of such issues is precision of data dependency (or dependency in general). Our definition in
46
Section 2.3.2 may be too coarse as some rules might be commutative despite presence of the defined dependencies. If an active rule language has a definite form as SQL, the definition of dependency and its detection may be tightened by analyzing static expressions in rule definition as done in Baralis and Widom [7]. In Sentinel, a rule has a very general form since the condition and action parts can be arbitrary functions. In such a system, even detecting dependency with a margin of imprecision can be challenging. However, we know empirically that in general only a few stored data items are referenced in those functions. The rule designer should be able to readily recognize the referenced data items and classify them into readset and writeset. Once the readset and writeset are obtained, the dependency graph can be drawn as usual.
The problem of confluent executions can be further complicated in those active databases such as Sentinel [13] and HiPAC [12] where coupling modes between event and condition and between condition and action are defined. In Sentinel and HiPAC three coupling modes are defined: immediate, deferred, and detached modes. For coupling modes between event and condition, the immediate mode prescribes that condition be tested immediately after a relevant event is detected. In deferred mode, condition is tested after all userdefined operations (except commit/abort) in the current transaction are performed. In detached mode, condition is tested in a different transaction. The semantics of these modes holds for the coupling modes between condition and action as well.
Our work described in this thesis assumes the immediate mode between event and condition and between condition and action as well. However, considering that in the deferred mode, tests (or actions) are carried out in the same order as the occurrence order of events that triggered the rules, our model works well for deferreddeferred couplings between event and condition and between condition and action.
47
Interestingly, in a situation where eventcondition coupling is the immediate mode and condition action coupling is the deferred mode, our model is still applicable. But this combination of coupling modes precludes the possibility of one rule's untriggering the others since all tests are completed before any action is performed. On the other hand, the detached mode doesn't suit with our model and other models as well that deal only with conflicts within a transaction boundary. By detaching the execution of action part from the current transaction, even execution of only one rule can result in different final database states depending on interaction between the detached transaction and the parent transaction. And those interactions are governed by transaction model the system employs. Unless a rule execution model is geared with the transaction model, there is no room to control the intertransaction interactions to make rule executions confluent.
CHAPTER 4
AGGREGATE CACHE
4.1 Motivation
Complex aggregate expressions are frequently computed in data warehouses to support decision making and for statistical data analysis. These computations are expensive to perform as they require at least one full scan of usually huge (or even distributed) base databases (or operational databases). Hence, it is vital for the data warehouse applications to have a means of reducing the computation overhead. An approach used to address this problem is to materialize views defined in the data warehouse, assuming that both the underlying base databases and the data warehouse use the relational data model. There already exists a body of literature on view materialization in the context of relational databases or deductive databases [8, 22, 25, 27]. Recently, the same issue has been reexamined by several researchers [23, 24, 40] from the data warehouse viewpoint. However, bulk of the literature deals with SPJ (Select Proj ect Join) views. Quite a few papers mention materialization of aggregates as a means of performance improvement, but no indepth study has been reported except [32]. Using view materialization, some of SPJ views defined in the data warehouse schema are chosen to be materialized. Then, when base databases change, materialized views that are affected by the change are updated in an incremental way if possible; otherwise the affected views are rematerialized. Realistically, however, application of traditional view materialization techniques seems inappropriate for the data warehouse environment for the following two reasons:
48
49
1. Unlike views, aggregates are usually not predefined and it is hard to predict
what aggregates on what data items are computed, since aggregates tend to be
used in ad hoc queries.
2. Due to the natural process of data analysis or decision making, many aggregates
are not likely to be used again (or rarely used) after some period of heavy usage.
Note that the second reason above implies the presence of temporal locality property of aggregates usage in data warehouse applications. This is one good reason to introduce the cache mechanism for handling aggregates in data warehouses. In addition, by using a cache, an aggregate is cached the first time when it is actually used. That is, the user is not required to declare in advance what aggregates would be used. Therefore this is also a good way to deal with ad hoc queries containing aggregates.
In addition to the temporal locality, the spatial locality is also observed in aggregate usage in data warehouse applications. That is, when an aggregate is used in a query, other aggregates closely related to this aggregate in terms of data modeling perspectives are also expected to be queried sooner or later. For example, if average salary of all employees in a department has been queried, it is more likely that average age of the same group of employees is queried in a short period of time than total order placed today. Unfortunately, this sort of spatial locality is too loose and abstract to be used to enhence efficience of cache. Therefore we will not consider the spatial locality in our work.
In this work we propose an aggregate cache. As Figure 4.1 shows, the aggregate cache can be added to a data warehouse as a plugin. The aggregate cache uses part of the data warehouse (i.e., disks) as a cache. It interacts with query processor of the data warehouse manager (therefore, the query processor need be modified to some extent) to expedite computation of aggregates. It also interacts with the integrator,
50
Data Warehouse Manager *
Aggregate
Manager
Integrator
Change Detector & Notifier DBMS
Base Database 1
Figure
Change Detector Ca
& Notifier&
1 DBMS
   
Base Database 2 Base
4.1. A Data Warehouse and Aggregate Cache
which intercepts change notifications from base databases that are relevant to cached aggregates, and updates affected cached aggregates appropriately.
When an aggregate in a query (of the data warehouse user) is processed, the query processor of the data warehouse manager first consults the aggregate cache manager. If the aggregate is found in the cache, the stored value is used for processing the query. Otherwise, the aggregate needs to be recomputed from data in base databases or from data in the data warehouse if an appropriate (presumably, summarized) copy is present there. In any case, the computed aggregate is cached for later use, unless otherwise directed.
ge Detector Jotifier )BMS
Database n
51
For the aggregate cache proposed in this work, not all aggregates need be cached. If the user knows that some of the aggregates will not be used again, the user should be able to mark such aggregates in queries so that they are not cached by the aggregate cache. On the other hand, if a certain set of aggregates is known to be used frequently, these aggregates can be precomputed and installed in the cache and will not be replaced until explicitly requested to do so. Between these two extremes, all ordinary aggregates are cached when they are first referenced in queries and replaced out later by a cache replacement policy. By taking this approach, as far as aggregates are concerned, the aggregate cache as a whole behaves as an adaptive schema for a data warehouse that dynamically adapts itself to uncertain and varying user requirements.
4.2 Updating Cached Aggregates
As base databases change, there should be some way of reflecting those changes to aggregates cached in the data warehouse, if the changes are relevant. Presuming affected aggregates are known, one way is to invalidate the affected aggregates. This method was addressed by Sellis [32] in the context of maintaining derived data in relational databases. However, in data warehouse environments, where changes to base databases are expected to be frequent (one of the very reasons why the data warehouse was introduced), simply invalidating affected aggregates will not be a good approach as cached aggregates would not have much chance to be reused. Therefore, our natural choice is updating affected aggregates as necessary.
Again assuming affected aggregates are known when base databases change, there may be several ways of updating the aggregates as outlined below: Rematerialization:
Affected aggregates are recomputed by rescanning base databases.
52
Periodic Update:
Changes to base databases are kept in respective base databases and periodically the changes are propagated to the data warehouse. The data warehouse,
then, collects the changes and incrementally updates affected aggregates. Eager Update:
Every time a change is made to a base database, if the change is relevant to any cached aggregates, it is propagated to the data warehouse so that the data
warehouse can incrementally update the affected aggregates. Ondemand Update:
Changes to base databases are accumulated either on the sides of base databases or on the side of data warehouse, forming delta files. When a cached aggregate is accessed by a query, relevant delta files are collected and the aggregate is
updated using the delta files.
The rematerialization approach is the most expensive. In the aggregate cache, rematerializing affected aggregates prior to their usage makes little sense unless the system load is particularly light at that time. This method, however, can be combined with an incremental method to handle a few cases where the incremental method cannot be applied.
A naive implementation of the periodic update could be troublesome. If the frequency of update is too low, values of cached aggregates will be outdated. If the frequency is too high, on the other hand, it will degrade the performance of system. Rather than independently used, the periodic update can be geared with the ondemand update as it will be explained later.
The eager update could appear to be the best way to deliver the uptodate values of cached aggregates quickly. However, unless the system's timeliness requirement is
53
so stringent, this method is considered to be an overkill since generally one does not know when cached aggregates are to be used and whether or not they would ever be used again before they are replaced out. Furthermore, there is no guarantee that the eager update will always outperform other methods in the timeliness measure since it could take up too much of system resources if base databases change often and there are many cached aggregates to update. In fact, a simulation result reported by Adelberg et al. [1], in which various recomputation methods for maintaining derived data were compared, shows that the eager update is inferior to the ondemand update in terms of query response time. Although the simulation setting does not completely match the situation of aggregate maintenance in a data warehouse environment, we think we can safely extrapolate this result to draw a similar result in the aggregate cache. 1
The ondemand update is considered the best fit for the aggregate cache since the frequency with which the base databases change is expected to be much higher than the access frequency of aggregates in the data warehouse. As mentioned earlier, the delta files can be kept either in base databases or in the data warehouse. Keeping delta files in data warehouse is, in effect, more like the eager update since whenever a related base database changes, the change need be transferred to the data warehouse although, unlike the eager update, recalculations of affected cached aggregates happen only when they are referenced. On the other hand, keeping delta files in each base database will cause some delay when an affected cached aggregate is referenced. If
'We expect that the gap between the eager update and the ondemand update will widen in the aggregate cache, since the simulation by Adelberg et al. [1] does not consider the overhead of change propagation between a base database and the data warehouse, which is likely to be substantial. Extensive change propagation in the eager update will deteriorate the performance of the eager update further. Moreover, we can regard an aggregate as a derived data with a huge "fanin," which is used in the simulation to describe the number of source data for a derived datum, since an aggregate need be recomputed whenever a row in a referenced relation is modified. According to the simulation result, the gap between the eager update and the ondemand update widens as the fanin increases.
54
the data warehouse and base databases are connected through a network, network connection and data transfer time will dominate the delay. In our work, by default we keep delta files in base databases, but if a base database has difficulty in keeping them and the network overhead between the base database and the data warehouse is not severe, delta files for that base database should be able to be kept in the data warehouse.
An important premise for using the ondemand update (the eager update and periodic update as well) is that aggregates can be computed incrementally. Although a large class of aggregates can be computed incrementally, there are some aggregates which cannot be computed incrementally. Such aggregates include orderrelated ones such as min, max, and median [26, 27]. When an aggregate cannot be computed incrementally, the aggregate is rematerialized (i.e., recomputed) on demand.
One problem with the ondemand update is that for a cached aggregate, if related base databases change frequently and the aggregate is not referenced for a long time, delta files can grow too big in base databases. This problem can be handled in several ways:
1. The data warehouse periodically polls base databases on behalf of cached aggregates that are not referenced for a certain period of time. Base databases, then, repond by sending delta files to the data warehouse. This is the periodic
update.
2. Each base database monitors sizes of delta files. If size of a delta file grows
over a limit, the base database notifies the data warehouse and sends the oversized delta file to it. This is an aperiodic (or asynchronous) counterpart of the
periodic update.
3. Instead of keeping delta files, base databases keep numeric delta's of affected
aggregates and the numeric values are sent when aggregates are referenced in the
55
data warehouse. As a simplistic example, when aggregate count (*) on a relation in a base database is cached, the numeric delta in this case is the difference in the numbers of tuples inserted into and deleted from that relation. When the aggregate is referenced, the numeric delta is sent to the data warehouse, instead
of delta file containing inserted and deleted tuples.
4. The overgrown delta files are discarded and related cached aggregates are replaced out.
Clearly, the aperiodic update (the second one above) is better than the periodic update since in the aperiodic update, transfer of delta files takes place only when necessary.
The numeric delta method (the third one) has an effect of distributing burden of the data warehouse to base databases. If base databases can sustain the newly imposed load, the performance of data warehouse should improve since network traffic can be reduced substantially and incremental updates in the aggregate cache become simpler. A drawback of this method is the added complexity, especially on the sides of base databases. Base databases now have to know what aggregates are being cached and all the information needed to perform partial computations of affected aggregates.
Discarding overgrown delta files is closely coupled with the cache replacement policy. If the policy is LRU (Least Recently Used) or its variation, cached aggregates not referenced for a long period of time can be replaced out or just flushed. Therefore, it is justifiable to discard overgrown delta files if related cached aggregates are not used for a long time. Of course, if delta files are to be discarded, related cached aggregates should also be deleted.
To sum up, in our aggregate cache, we use both aperiodic update and numeric delta in conjunction with the ondemand update. The aperiodic update is used when
56
a base database is unable to implement the numeric delta method due to its limited functionality or when an aggregate does not allow the numeric delta method.
4.3 Incremental Update of Aggregates
As mentioned in the previous section, the incremental update is a premise for the ondemand update that is adopted in the aggregates cache. In our current work, an aggregate is defined in a general way to be a function that takes as input one or more nonempty sets of objects, called aggregate input sets and zero or more scalar variables and returns as output one scalar value. Elements in an aggregate input set are denoted by a variable called aggregate variable. An aggregate input set corresponds to a group of values of an attribute in a relational table.
4.3.1 Syntactic Conventions
We present syntactic conventions used in subsequent sections.
e Aggregate input sets are denoted by capital letters such as X, Y, and Z. If
size of a set X (i.e., cardinality) is of significance and the size is n, a positive
integer, the set is denoted by X,.
* Aggregate variable of an aggregate input set X is denoted by the lowercase letter, X, and it represents an element in X. For X,,, elements in the set are distinguished by attaching a distinct subscript to x as X,= {X1, X2, . . I Xi,.. , Xnl.
* Aggregates are denoted by capital script letters such as F and R.( An aggregate
.F with its arguments can be denoted as follows:
where Xn, Y, ... , Z0 are aggregate input sets and a, 0,.. are zero or more
independent scalar variables.
57
4.3.2 Incrementally Updatable Aggregates
Informally, incremental update of an aggregate means that when a new element is added to the aggregate input set (assuming the aggregate has only one input set) or an old element is deleted from the aggregate input set, the new value of the aggregate is computed from the added (or deleted) element and the current value of the aggregate stored in the system. If an aggregate has multiple aggregate input sets, let's assume for the time being that all the aggregate input sets are of the same size and insertions or deletions take place such a way that all the sets remain in the same size. Many aggregates used in statistical analysis fall into this category. In order to define incremental update precisely, we first define positive delta and negative delta below.
Let's assume an aggregate F(X,,, Y ..... , , l,    , 1). For convenience, let Qn1 and Rn be two sets containing parameters for F as follows:
Now, suppose that for the aggregate F, the current aggregate input sets are Xn.., Y.1 ,, , 1Zn..1I and Xn, Yn,..  *iZ, are inserted elements to their respective input sets (thereby making them Xn, Yn,..  , Zn, respectively).
Positive delta, An is defined such that
A~n =~ l~  (1 ~ (4.1)
Aggregate F is incrementally updatable on the inserted elements into the aggregate input sets, if there exists an algebraic function g such that
Hi1 (Onf), H2(Qn),. H, li(fn), Xn, Yn," Zn) (4.2) where all Fk's (1 < k < 0) and Hi,'s (1 < I < J) are aggregates whose values are already known or incrementally updatable and none of 'Hi's is F.
58
For the same aggregate F, suppose that the current aggregate input sets are X.~, Y ....., Zn and x', y ..... z are deleted elements from their respective input sets
(thereby making them Xn~1, Yni1  Zn 1).
Negative delta, A ni is defined such that
=nI: F(Qn1)  F(Qn). (4.3)
Aggregate F is incrementally updatable on the deleted elements from the aggregate input sets, if there exists an algebraic function F such that
An1= 3(Fi(Qn),F:2(Qn),...."(Q)
where all Fk'S (1 < k < i) and i's (1 < I <_ J) are aggregates whose values are already known or incrementally updatable and none of ij's is F.
An aggregate F is defined to be incrementally updatable, if F has both positive delta An (equation 4.1) and negative 2KnI (equation 4.3) that are computable for all n > 1.
Once An is computed, F7(Q)) in equation 4.1 can be obtained by adding An, to the known value Of F(Qn1). Likewise, from 2Kl F(Qn1) in equation 4.3 is obtained by adding 2Knlj to the known value Of F(Qn).Now, if an aggregate is sensitive to every insertion into (or a deletion from) any one aggregate input set, the definition of positive delta and negative delta can be extended in a straightforward way so that the aggregate can be incrementally updated whenever an insertion (or a deletion) is made. We would not elaborate the extension in this work.
4.3.3 Algebraic Aggregates and Non Algebraic Aggregates
Aggregates are first classified into algebraic ones and nonalgebraic ones. An algebraic aggregate is an aggregate that takes as input only aggregate input sets of
59
real numbers and independent scalar variables of the real number and returns as output one real number and consists of only algebraic operations defined over the real number. 2 A nonalgebraic aggregate is an aggregate that its domain or range is nonnumeric or uses nonalgebraic operations.
Example 4.3.1 Here are some examples of algebraic aggregates.
n
count(Xn) = i: 1
i=1
n
S'UM(Xn) = Zx,
average(X) = 1 xi _SUM(Xn)
Z'1 1 count(Xn)
Sn nin , ,
SXY(Xn' Yn) = Z(x,Xi Y)( ) E ZXiu bt M
i=1 i=1 n
S.y(n Yn) is sum of products of distances of x and y from their means, T and (i.e., average(Xn) and average(Yn)), and is used to compute simple linear regression.
Example 4.3.2 Nonalgebraic aggregates include max(Xn), min(Xn), median(Xn), and rthpercentile(X~, r) which is the value such that r percent of x's in Xn fall at or are below that value. These aggregates all involve procedural operations, thus not algebraic aggregates. .1
ExamPle 4.3.3 For some simple algebraic aggregates, positive delta (A) and negative delta (A) defined in Section 4.3.2 can be directly obtained by algebraic manipulations. Below, we show how positive delta and negative delta of average(Xn) are derived.
An= average(Xn)  average(Xn1)
2 Note that the real number subsumes the integer. Therefore, aggregates taking integer sets and return an integer or real number, such as count(*), can be included by the definition.
60
Ti ni
(n i)> xi n Ti '
n xi  i n x
n(n )
n(Ex  E xi r7x
nn1)nx,~  x~l n(n 1)
_ (n Xi )x X
n(n )
_(n  1)Xn _ Z21 x
n(n  1)1
n average(Xn1)
Xn average(Xi)
count(Xn)
=average(Xi)  average(Xn)
x, n11 xi!
Tili tl i
n X,  (n E7 iU1 x
(n )n
~(n  1)n z~
Tix ',X,_ n1 , + X,
+ E
 (n )nn
X +x average(Xn)
+
61
average(X,,) x,
n
average(X.)  x
count(Xi)
A,, and ZK n1 above are computable for every n > 1. Therefore, average(X) is incrementally updatable. Compare A,, and 2K,,_1 to equations 4.2 and 4.4 respectively. In A,,, average(X,) is already known and count(X,,) can be incrementally computable from value of count(X,). In A ni, average(X,,) is known and count(X,) can be computed from value of count(X,,). <1
4.3.4 Sumnmative Aggregates
The vast majority of aggregates that are used or have a good potential of being used in decision making applications are those that perform some types of summation operations. We call such aggregates sumnmative aggregates. In light of this observation , we focus our effort on making them incrementally updatable.
Let X,, Yn,... , Z,, be aggregate input sets whose respective current sizes are n, m'...'o, and xi E X (1 < i
Sumnmative Aggregate: Given aggregate input sets, X,,, Ym,., Z0, a summative aggregate is defined to be an algebraic aggregate that has summation operators (Z S) in it as shown below:
JA
.F7(Xn,,Ym,",Z.) f f(O~p, P, H(Xn 7,Ym,",Zo)) (4.5)
p=1
where f (4p, p,H7(X,,, Ym<, Zo)) is called summation body, p is called index variable and always starts from 1, yu is called termination variable and indicates the final index value, 4',, E f{Xi,,y, Y , Zk}, (p, P~) E f{(i, n), (jrn),.*, (k, o)}, and , H(X,,, Y,.1 , Z.) is a sumnmative aggregate nested in F and differs from F. If aggregate input sets
62
have the same size (i.e., Xn~ IYn I... 7 Zn), the definition of sumnmative aggregate can be simpler as follows:
n
FXY ,Zn) f (I)y, IZl1 (4.6)
where the summation body f (xi, yi,. . , zi, i) may (recursively) contain other summative aggregates.
When the summation body of a summative aggregate contains one or more summation operators, the aggregate is called nested summative aggregate.
The summation body in a sumnmative aggregate is defined to be an aggregative polynomial, defined below.
Aggregative polynomial. An aggregative polynomial is a polynomial that consists of zero or more plus' (+'s) as operators and aggregative monomials as operands.
Aggregative monomial. An aggregative monomial is a nonzero real constant multiplied by zero or more factors. A factor is either an aggregate variable, an index variable, a summative aggregate, or a parenthesized aggregative polynomial, any of which may be raised by a nonzero rational number or be an argument of transcendental functions on the real number such as the logarithm.3 An aggregative monomial that does not contain another aggregate is called atomic. An aggregative monomial is sometimes called a term.
Example 4.3. Given aggregate variables, x, y, and z and index variables, i and j, the following are aggregative monomials:
1, lxi, 2.5i'x,, x~ y Z. 3)2 ,y =(i=,~2,lg
3 Ceiling (fr1), floor (LiJ), and absolute (11) functions are included too.
63
In 2? y' (Z>'1 z1)2, factors are 2~, y1, and (Z1 zj)2, whereas in log2 Xi, the sole factor is log2 xt. Except for 2?y1 (EI, 1Zj)2 , all the aggregative monomials are atomic.
Example 4.3.5 Given aggregate variables, x and y and index variables, i, j, and k, the following are aggregative polynomials:
Aggregative polynomial xi  (E' 1 x1)/(E' 1 1) has two terms (aggregative monomials), xi and U x1)/(E' 1 1), while the others have one term. <1
Example 4.3.6 All aggregates but average(X,,) shown in Example 4.3.1 are summative aggregates whose sumnmative bodies are aggregative polynomials. 1
For a sumnmative aggregate, one might be tempted to try to derive a positive delta and a negative delta directly as done in Example 4.3.3. However, deriving deltas for a nontrivial sumnmative aggregate is not only arduous a work, but also such a derivation is not always possible. Furthermore, unless the derivation can be fully automated in an efficient way (which is very improbable), the aggregates cache cannot use the deltas to incrementally update cached aggregates. For sumnmative aggregates, there exists a more efficient and simpler way of incremental update. We first consider nonnested sumnmative aggregates in this subsection.
Lemma 3 A summative aggregate whose summation body is an atomic aggregative monomial is incrementally updatable.
Proof of Lemma 3 Obvious by the definition of sumnmative aggregate.
E0
64
Lemma 4 A summative aggregate whose summation body is an aggregative polynomial is incrementally updatable if all summative aggregates in aggregative monomials of the aggregative polynomial are incrementally updatable. Proof of Lemma 4 Assume any aggregative polynomial, fi(*) + f2(*) + ... + f,(*) where each f8(*), 1 < s < v, is an aggregative monomial and (*) represents any subset of aggregate variables and index variables defined in the original sumnmative aggregate and need not be equal to each other. Then, by the distribution property of E operator over additive (plus and minus) terms, the following equality holds:
n1 n1 n n
E~f(* +2() ' +v())= E l* E 2* ' f()
= =1i= i=1
Thus, if each 1=1~ f8(*) can be updated incrementally, the original sumnmative aggregate on the lefthand side of the equation above can be computed incrementally.
0
Now, the summative aggregate is extended to include more algebraic aggregates as follows.
Extended sumnmative aggregate. Given a set of aggregate input sets, Q = {X1, YmI*, Z,,} and a set of sumnmative aggregates over subsets of Q, F1(91), F2 (Q2),
~~ ~~M , l. Q (1 < s < v), an aggregate T' over Q is defined to be an extended summative aggregate if there exists an algebraic function g on the real number such that
F(X. I Ym . ... Z.) = g(Fl(Ql1), 1:2(~)..,F,~~ (4.7)
Note that in equation 4.7, the algebraic function g is not a sumnmative aggregate by definition. g takes no argument of a set  its arguments are any number of scalar real values (since each F, aggregate returns a scalar real number), whereas an aggregate is defined to take at least one aggregate input set.
65
Example 4.3.7 While the definition of extended sumnmative aggregate covers a vast variety of algebraic aggregates that perform certain types of cumulative operations, perhaps the best known example of extended sumnmative aggregate will be n n
average(X,,) = EZxilZI = sum(Xn) /count (Xn).
Another example,
n
log (EXi) 10lo (SUM(Xn)).Example 4.3.8 This example shows an extended aggregate, grand..average(Xn, Yn) which is an average of two related aggregate input sets, X and Y.
grand..average(Xn, Y,.) =grand..sum(Xn, Y ..) grandcount(Xn, Yin)
22~X, + E~1 , y
Lemma 5 The extended summative aggregate of equation 4.7 is incrementally updatable if all .F,(Q)'s (1 < s < v) are incrementally updatable. 4
Proof of Lemma 5 Since g is a function, over a defined interval of its domain (see footnote 4), it should map a unique combination of input values (rather, every unique vector of input values) to the same one real number. Then, the mapping should remain the same no matter whether the input values are obtained through an incremental update or recomputed on the new aggregate input sets of the underlying sumnmative aggregates. 0
4 A function on the real number may have intervals of domain where the function is undefined (e.g., logarithm over negative values). Therefore, it is possible that for a certain g, g cannot be (incrementally) computed on some values of input sumnmative aggregates. However, that is not because of the incremental update, but because of the definition of that specific aggregate.
66
Incidentally, it is interesting to note that if function g of equation 4.7 is a procedural function (i.e., nonalgebraic function), Lemma 5 does not hold in general. A simple counter example: g being maxo.
4.3.5 Binding of Variables
As shown in equation 4.5, a sumnmative aggregate has two additional types of variables (other than aggregate variables), which are index variables and termination variables. Below, we describe how these variables are bound to a value or to other variable:
* The subscript of every aggregate variable is bound to an index variable; that is,
an index variable is used to refer to a specific element in each aggregate input
set.
* An instance of an index variable used in an aggregative monomial is bound to
the index variable.
* The initial value of an index variable is always bound to 1 by definition.
* The final value of an index variable is bound to its matching termination variable.
9 The termination variable of the outermost EI operator is either unbound (in
most cases) or bound to a constant. If it is unbound, it should be bound to the current size (cardinality) of an associated aggregate input set when the
aggregate is referenced by a query.
* The termination variable of an inner E operator (i.e., in summation body)
is either unbound, bound to an index variable of its an outer EI operator, or
bound to a constant.
67
It should be noted that when a termination variable is bound to the cardinality of an aggregate input set, it is not necessary for the real value of the cardinality to be known, unless it is explicitly used in the summation body. In an incremental update, the purpose of termination variable is to know the latest element to be added to (or to be deleted from) an aggregate input set. On the other hand, if the cardinality is explicitly used in the summation body, it is interpreted as the simplest summative aggregate, COUNT(*) of the associated aggregate input set.
Scope of bindings. Each binding has a scope. A binding is in effect only within its scope. Scope of a summative aggregate is inside of its outermost E operator. Scopes of the index and termination variables associated with a E operator is inside of the summation operation, which is also called the E's scope. Scope of an aggregate variable is the scope of its associated index variable. A nested summative aggregate has multiple nested scopes. For any two index variables, either one variable's scope properly contains the other variable's scope or the two scopes are independent. If two index variables' scopes are independent, the two index variables may be denoted by the same letter.
Example 4.3.9 The following are summative aggregates in which E operators are nested (i.e., nested summative aggregates).
n .1
"l (X, Y., Zn) = Z(Xi (Yj Ei2Zk))
i=1 j=1 k=1
n
JF2(X., Y., Zn) = Z(X, L(y, E i2 Zk))
i=1 j=1 k=1
n n n n n n
.F (.,Y n)= E (Xi( EY3  Zk)) = Z(Xi (Ey jEZ) i=1 j=1 k=1 i=1 j=1 j=1
In particular, in T3, scopes of 1~ y3 and E'~Z r needet o ohsm mative aggregates can be represented using the same index variable j.
68
4.3.6 Decomposition of Sumnmative Aggregates
In the aggregate cache, the basic unit cached is a summation over an aggregative monomial. It is, therefore, more beneficial to decompose a complex sumnmative aggregate into a set of smaller component sumnmative aggregates and store their values in the aggregate cache. When value of the original summative aggregate is necessary, it can be easily restored from the values of the cached component aggregates. This approach facilitates sharing cached component aggregates among many sumnmative aggregates. Furthermore, the decomposition plays a more important role in the aggregate cache. While decomposing the original sumnmative aggregate and normalizing the decomposed sumnmative aggregates, it becomes known whether the original summative aggregate can be incrementally updated or not.' If the original sumnmative aggregate is not incrementally updatable, it can still be cached but the cached value should be invalidated if changes in base databases affect it.
In principle, the decomposition process is done in two steps: expansion of the summation body of a given summative aggregate and distribution of EI operators over the expanded summation body. If a given sumnmative aggregate is nested (i.e., it contains other sumnmative aggregates), the summation body is recursively expanded until no further expansion is possible. Then, E operators are distributed over the whole expansion. By the distribution property of E operator over additive terms, sum of the decomposed sumnmative aggregates is equivalent to the original sumnmative aggregate.
In order to describe the expansion procedure precisely, let's represent an arbitrary aggregative monomial h(*) as follows:
_______ ______ _P2__..._____ (4.8)
'The normalization will be discussed in Section 4.3.7
69
where * is a union of aggregate input sets and index variables, c is a nonzero real constant, each a,' (1 < I < t) is a factor, assuming that if t = 0, no factor is present in the aggregative monomial,' and each pt (1 < 1 < t) is a nonzero rational number. If aP" denotes a factor other than a transcendental function, it means at raised by pl. Otherwise, a," simply denotes a transcendental function. (See Section 4.3.4 and Example 4.3.4.)
In order to expand the aggregative monomial h(*) of equation 4.8, each factor at' (1 < 1 < t) is expanded first. If it is expanded into multiple terms, h(*) is expanded accordingly into multiple aggregative monomials by usual algebraic manipulations.
Example 4.3.10 If ap' and ap~t in equation 4.8 are expanded into (an' + a P2 ) and (al  t'7) respectively, h(*) is expanded into four aggregative monomials as follows: h(*) = caP1al P2 .. ap
P1 2 t P1P2 P
P11 P2 ...Pt1 P12 P2 Pt2
= call a2 . at, cal, a2 .. at
Pi2 P2 Pt1 P12 P2 Pt2
+ ca12 a2 ... a t,  ca12 a2 ... a P2
Expansion of factor. Expansion of a factor aft in equation 4.8 is obtained by a factor expansion procedure below:
1. If aP' denotes a transcendental function, expansion of apt is ap' itself. That is,
a transcendental function is not (or can't be) expanded.'
6lHereafter, for a sequence of any entities, a,, a2,*  , at (t > 0), if t = 0, the sequence is assumed to be null (i.e., no entity in the sequence).
7 For some transcendental functions, there are a few cases in which expansion is possible. For example, log (xi /y:) can be expanded into log xi  log y1. However, in most cases such an expansion is not possible. Therefore, in our work we do not expand transcendental functions.
70
2. If p, (of a"') is not a positive integer, expansion of factor aP is apL itself.'
3. Otherwise, at (of ap') is expanded first as follows:
(a) If a, is a parenthesized aggregative polynomial, its component aggregative monomnials are recursively expanded first and their results are added, yielding hi, (*') + hl, (*') +  + hi. (*') (w > 1), where *' is a subset of * and each *' need not be equal to each other. Then, expansion of at is
hi, (*') + hi, (*') +   + hi. (*').
(b) If at is a sumnmative aggregate Z~f(',its parenthesized summation
body (ft (*')) is recursively expanded first, yielding hi, (*') + hi,*' +  +
hi. (*') (w > 1). Then, expansion of a, is E' hi, (*')+ E0 hi,('+ +
(c) If at is an index variable or an aggregate variable, expansion of a, is at
itself.
After expanding at, if the expansion of a, is a single term, expansion of factor ap'L is af'. Otherwise, expansion of factor al" is obtained by applying the
multinomnial theorem (see below) to the expansion of a,.
After expansion of a summation body is completed, if there exist factors in the form of (c~a., a32 ..a~ )Pq (r. > 1) in the expansion, they are converted into pqaPei1PqaP82Pq . ~rq
8. ~ 82 sr
Decomposition of sumnmative aggregates.. Suppose a sumnmative aggregate (not extended), E!' f (*), where * is a union of aggregate input sets and index variable i. In order to apply the expansion procedure described so far, the summation body
'Again, for an expression (a, + a2 +  .. + a,)P, if the exponent p is not a positive (rather, nonnegative) integer, it is generally not possible to expand the expression in finite steps.
71
f (*) (which is an aggregative polynomial) is changed into an aggregative monomial by parenthesizing it as follows:
n fn
Zf(*) = (f() (4.9)
Then, after expanding the new summation body, let the result be: (f (*)) = hi (*') + h2 (*') + ." h,,(*') (v > 1) (4.10) where each h3(*') (1 < .s < v) is an aggregative monomial, *' is a subset of *, and each *' need not be equal to each other. Then, by distributing E operator over the righthand side of equation 4.10, the following decomposed summative aggregate is obtained:
>f(* hi (*') + h2(*') +" + Zhv(') (4.11)
Decomposition of extended sumnmative aggregates. For an extended sumnmative aggregate, F1:(*) = g (F I(*'),F2 (*'), .... F,,(*')) (v > 1), each argument sumnmative aggregate F,(*') (1 < s < v) is decomposed first, yielding F1,W). Then, the original argument summative aggregates are substituted with the decomposed ones, yielding
()= g(V11 (*'), I "2*W),I ... I,' *)
A ny monomi als (not only aggregat ive monomi als) i n t he form of (a1I + a 2+ +an
where p is a positive integer can be expanded using the multinomial theorem which is an extension of the binomial theorem.' The binomial theorem and multinomnial theorem are presented below.
Binomial theorem. Let p be a positive integer. Then for all x and y, (X + yOP P X
k=O\ /
'In fact, the multinomnial theorem (binomial theorem as well) holds even when the exponent p is a nonzero rational number. However, in general, expanding such a monomial results in an infinite series, which is of no use for the aggregate cache.
72
where ( ~ p! /k! (p  k)!.
Multinomnial theorem. Let p be a positive integer. Then for all X1, X2,... , P XP P2 .. Pt'
(X1+ X + + X~pPlPr"P* )12 *"X
where the summation extends over all possible sequence of nonnegative integers P1,P2**,Pt with pi +P2 + ''+Pt =p, and ( lp P l!P!.. t
Example 4.3.11 When (X1 + X2 + X3 + X4 + X5)' is expanded, the coefficient of
X2X 2X5 (= X ' equals
( 3102 1)3!1!! 2! 1! =0
When (4x,  2X2 + 3X3)' is expanded, the coefficient Of X2X3 X3 equals ( 2 )2(2 )331 = 3840.
Using the multinomial theorem, it is possible to expand any aggregative monomial raised to a positive integer power, if its component monomnials are raised, they can be recursively expanded. From a practical viewpoint, however, it is not a good idea to expand an aggregative monomial which is raised by more than 3, since the number of total monomnials produced increases exponentially. In this regard, we expand (recursively) only those aggregative monomnials raised by 3 or less. Example 4.3.12 Consider the following sumnmative aggregate:
n
Z(Xi  i2
i=1
where the summation body is (X, _ y,)2 . Assume that the current aggregate input sets are X,,, and Y,, and aggregate value on these input sets, I'Ci=)2i
73
stored in the aggregate cache. Then, upon insertions of x,, and y,,, the new aggregate value can be computed by adding value of (x,,  Yn)2 to the stored aggregate value. The sumnmative aggregate is decomposed as follows:
E(Xi  YX = E(x  2xjyj + y3) = x?  2 Exjyj y?
Instead of storing En1 (xi yi)2 directly, it is much better to store En X?, ZEI7 XtY,, and EIU' y? so that other sumnmative aggregates too can make use of the store values.
Similarly, a more complex sumnmative aggregate is decomposed as follows:
1:2i yj + zj)3
>X _(x 3 +~ _ 12x2yj + 12x~zi + 6xiy?
+ 3y,?zi + 6xiz?  3yjZ2  12xiyizi)
8 ZX 3 ? Z 3+Zz?  12ZX2 yi + 12Zx~zi+6xy
+: 3Zi z E 1 Ziz  y  1 E y E z i.
i= =1 i=1 i=1
This summative aggregate shows the rapid increase of the number of aggregative monomnials as the exponent increases.
4.3.7 Normalization of Sumnmative Aggregates
Normalization of a sumnmative aggregate is a process of simplifying the summative aggregate. The simplest normalization is to pull a constant, multiplying the summation body, out of the scope of its summation operator. For instance,
>2i 4xi = 4 EL, xi. By normalizing a summative aggregate in this way, E!' x can be cached instead of E!' 4x2 so that cache operations can be more efficient. In fact, the normalization plays a more profound role than making cache operations efficient. It enables certain types of summative aggregates to be incrementally updated, which otherwise cannot be. In this subsection we extend this simple idea in order to normalize complex nested sumnmative aggregates.
74
As sumnmative aggregates are nested, some unbound entities (variables and component sumnmative aggregates) can be present within the scope of a sumnmative aggregate. These unbound entities hinder the process of incremental update of sumnmative aggregates since their values are not known at the time of an incremental update. There are several entities that can be unbound within a component sumnmative aggregate of a nested sumnmative aggregate.
1. An index variable of an outer summation operator.
2. An aggregate variable indexed by an index variable of an outer summation
operator.
3. A sumnmative aggregate whose termination variable is bound to the cardinality
of an aggregate input set.
4. A sumnmative aggregate whose termination variable is an index variable of an
outer summation operator.
The goal of normalization is to make, recursively, every sumnmative aggregate contain only those entities bound to its index variable by moving unbound entities plus a constant multiplying the aggregative monomial out of the scope of the sumnmative aggregate. Therefore, once a normalization is completed, no unbound entities shall remain within any sumnmative aggregate's scope. Normalization Process
Without loss of generality, it is assumed that a summative aggregate is decomposed maximally (i.e., as far as possible) before normalized. After the decomposition, each decomposed sumnmative aggregate should have a summation body of an aggregative monomial. Then, the normalization is performed against each decomposed sumnmative aggregate. The following is a decomposed sumnmative aggregate to
75
be normalized:
)cap' a P2 ... a~pt (t > 0) (4.12)
where p is an index variable, yu is a termination variable, c is a nonzero real constant, each ap' (1 < I < 0) is a factor, and each p, (1 < I < t) is a nonzero rational number. (See equation 4.8.)
Given an index variable p, a factor is fully bound to p if p is the sole index variable that appears in the factor. A factor is partially bound to p if p and other index variables appear in the factor. If p does not appear in a factor, the factor is unbound to p.
Normalization of a sumnmative aggregate 4.12 takes two passes:
1. For each factor a', if al (of al") is a sumnmative aggregate, it is normalized first,
yielding ci'pa'p After all component summative aggregates are normalized, let the result be the following:
2 . PiPi P P2 P2 .. P2 Pt Pt Pt (
where each t,> 1 (1 < s
2. Unbound factors in the sumnmative aggregate 4.13 are moved out of the scope
of the EI operator as follows:
(CJ'JC2P bjP2 . . . Pb.a~~~u
1 2 ... 4')a~~ul a ... a~U,, 1 U ab, ab2 b V> ,W> ) (.4 P=1
where each aP is an unbound factor to p, each a~b is a either fully or partially
bound factor to p, and v + w = q.10
In the transformed sumnmative aggregate 4.14, if the righthand side of E operator contains only bound factor to p and all component summative aggregates, if any, are 10Note that by the normalization a summative aggregate can be transformed to an extended summative aggregate. (See Y2 and Y3 in Example 4.3.13.)
76
normalized, the sumnmative aggregate is called normalized. Otherwise, the original sumnmative aggregate 4.12 is called unnormalizable. Theorem 3 A normalized summative aggregate is equivalent to the original summative aggregate, that is:
Zcal a~2 at (cc', 2' cp )a, a U2 E a~b a~b a~
P=1 P=1
Proof of Theorem 3 We first show that sumnmative aggregates 4.13 and 4.14 are equivalent, that is:
Z p tc~c 11 12 1()o 2l22 2(t22) . .. at, a~ t2. at)
p=1
14
 c~c2 ct)a~Puall2 .~ . a ' PbjPb a P. (4.15)
P=1
To show the equivalence above, we put (c'c2... ct )al a2 . a~v of sumnmative aggregate 4.14 back inside E operator. Then, the following equality should hold since the lefthand side and the righthand side of the equation are literally equivalent (q =v + w):
p c~c ... )a a . al,)a a~ ... a a,) ... a pt~
p=1
2 P(cuc2 . . . aa' aU PIPb2 . . a,
P=I
Lettin (CP~c2 cP)a~ul P2 .. a~uv be 0 and expanding EI operator of the righthand side of the equation above, the following equation is obtained:
Z(CCP'c . .. ctp)aPul a ... Puv a~b ab2 a
P=1
=1 1(a Pb,),(a Pb2)1 (a b_)l +02(a P)2(aPb2 )2 ... (a Pu,)2 +. +$0,(a j),.(a)IL..(a)ILw where Op or (a Pb)P (1 < p 5 y) represents an expression in wihalocrecso index variable p are substituted by a value of p. However, since 0 is unbound to p
77
(i.e., there is no occurrence of p in q5), all qOp's (1 < p p) should be the same in the above equation. Therefore, we get the following equation, thereby proving the equality of the equation 4.15:
01( ),a ) .. ( + (ap'")2(a)2. (abw)
(ccP1' C2 41~t)a u a u2 ... all Z i a~b2 br a P=1
The next step is to show the equivalence between sumnmative aggregates 4.12 and 4.13. If any factor a' in summative aggregate 4.12 is a power of a sumnmative aggregate (i.e., al is a sumnmative aggregate), the factor is normalized to 4LPaP'af'aP') Then, the proof shown above implies the equivalence of ap' and CILap'a'apL1 (The equation 4.15 directly proves al = c .ill ... al,).) If the summation body of al contains other sumnmative aggregates again, they are (recursively) equivalently normalized before al is normalized. Therefore, we can conclude that sumnmative aggregates 4.12 and 4.13 are equivalent, thereby proving Theorem 3.
Normalization of extended sumnmative aggregates. In order to normalize an extended sumnmative aggregate, F(*) = g(,F1(*'),.F2(*'), .....*') (v > 1), it is decomposed first, yielding .T(*) = g (,T, (*'),*', 2 (*,*.Ten ahdcm posed argument sumnmative aggregate F W*) (1 < .S < v) is normalized, yielding
..~('.Then, the original argument sumnmative aggregates are substituted with the normalized ones, yielding F(*) = ......('), .Fl(*'),.
Example 4.3.13 The following nested sumnmative aggregates are converted to their respective normalized forms.
n n l
Tj(X, n)= E(x: E2i3y3) = 2Z1(Ii EZy3) i=1 j=1 i=1 j=1
78
In F1, 2 and P3 are unbound in Z12i'y,. So, these entities are moved out of index variable j's scope. As a constant, 2 is further moved out of index variable i's scope.
n m m n
'g2 (X., Y.) = Z(Xi EY3) = (EZY)(E Xi)
i=1 j=1 j=1 _=
In F2, the component aggregate ET, yj is an unbound sumnmative aggregate. Thus, it can be treated as a constant from the viewpoint of any surrounding. summative aggregates. So is the result. Note also that the normalized F2 is no longer an ordinary sumnmative aggregate, but an extended sumnmative aggregate. (The same for F~ below.)
n n
F.3 (Xn, Yn, Zn) = EZ(Xi Z(Yj E iZk))
1=1 1=1 k=1
n n
= Z:(Xi(Zk)ZEjya)
i=1 k=1 3=1
E (Zk)(Z(Xi EiY3))
k=1 i=1 j=1
In F3, the component sumnmative aggregate Enjz is unbound and in itself, it contains an unbound index variable j.So, it moves j up to j's scope and it itself moves out of the outermost P's scope.
.F4(Xn, Yn) = (Xi :Y)/
In F4, on the other hand, even though 1> yj is unbound, there is no known way of moving the unbound sumnmative aggregate out of the outer index variable i's scope. Therefore, F4 is unnormalizable. 1
Example 4.3. 14 In this example, as an extended sumnmative aggregate, the standard deviation is normalized. The standard deviation has two argument sumnmative aggregates, ZU(X, _ y)2 and n (i.e., E!', 1), which are decomposed in equations 4.18 and 4.19 below. Then, the decomposed argument sumnmative aggregates are normalized in equation 4.20. Note that in the normalized aggregate, there are only three unique sumnmative aggregates, EnU IX?, En 1 xi, and E!' 1 1.
79
%7F(X") _ x (4.16)
X  ).(4.17)
=n I En ( X )2) (.8
E;' +?  2x~ ~ ' ++~
((.~1 (4.19)
I ?z + 2xi K~~ ;'I(n I
(ZL 1)  1 (4.20)
4.3.8 Increment al Updatabili ty of Nested Summative Aggregates
As Lemma 3 states, a sumnmative aggregate is incrementally updatable if its summation body is an atomic aggregative monomial. In case its summation body is an aggregative polynomial, by Lemma 4, the sumnmative aggregative is incrementally updatable, if each aggregative monomial in the aggregative polynomial is atomic. For nested summative aggregates, however, it turns out that not all of them are incrementally updatable. In this subsection we identify those nested aggregates that are not incrementally updatable. Also, for those that are incrementally updatable, we present how to compute them.
Summative aggregates shown in Example 4.3.13 are quite suggestive of which nested sumnmative aggregates are incrementally updatable and which are not. In short, all .T1, F2, and F3 in the example are incrementally updatable, whereas 1:4 is not.
80
Let's compare F2 and F4 first. It should be noted that without the normalization, even F2 cannot be incrementally updated. F2 has a summation body of (xi ET, y3) that nests an unbound sumnmative aggregate ET,1 yj. One cannot save the value of
F2(X~irJ) = Eni1 (~'j y3) and use that value to compute F2 (Xn, Ym) on insertion of (x,,, yin), due to the following inequality: n m
F2 (Xn,Ym) (X=EY
n1 rni mn m
~ (Xi E Y) +(X. EY3) = F2(Xn1, Ym1) + (X EYj).
t= =1 j=1 j=1
Only after normalizing F2, the following equality is obtained, thereby making incremental update possible:
"2 (Xn, Ym) (E (Xi)(Z Y3) i=1 3=1
n1 rni
E (Zx,)(E Yj) + (Xn)(Ym) = F'2(Xn1, Ym1) + (Xn)(Ym).
:=1 j=1
The reason why F2 is not incrementally updatable before the normalization is that the unbound sumnmative aggregate ET, yj within the scope of index variable i should act as a constant while it takes part in the outer summation operator's operation, but without the normalization, its value varies in successive incremental updates. By normalizing F2, however, the original aggregate is transformed to a multiplication of two sumnmative aggregates, each of which can be independently incrementally updated. Then, by Lemma 5, the normalized sumnmative aggregate is incrementally updatable.
On the other hand, as mentioned in Example 4.3.13, F4 cannot be normalized. Similar to F2 before the normalization, F4 has the following inequality: n n
,4 (Xn, Y) (X=j)
j=1
n1 n1 nl
t== j==1
81
n
~~~1 F4X.1Y _1) + (X.
j=1
Since F4 cannot be normalized, there is no chance of F4 being incrementally updatable.
.Fj and F3 are incrementally updatable as mentioned. Unlike.F2, however, these sumnmative aggregates are still nested even after the normalization. Nonetheless, FT71 has the following equality:
n
F1(Xn,Yn) = 2E(i%,~y3) i=1 j=1
nl1 2 n
= (2 EZ(i3X, E y3)) + (2n 3 X E Yj)
i=i j=1 j=1
n
 F,(Xn,,Yn1) + (2n 3 XnEy3) j=1
The above equation itself can be served as a proof by induction of the incremental updatability. When i = 1 (base), the equality holds trivially, letting.F1(Xo, YO) = 0. Assuming.F1(Xn1, Y1) is incrementally updatable (induction hypothesis), F,(Xn, Yn) is incrementally updatable since EL1 yj too is incrementally updatable. Thus, proved.
Now consider a little experiment with F4. F4 is modified to F5 as below so that the inner summative aggregate becomes bound:
.F5(Xn, Yn) = ( EY1/
i=1 j=1
= E (X, Yj) 12+ (Xn  : >11 )1/2 j=1
A similar induction can be used to prove correctness of the above equality. Therefore, T5 is incrementally updatable.
82
For every sumnmative aggregate, it is true that if the sumnmative aggregate contains any unbound variables or unbound sumnmative aggregates, the sumnmative aggregate in question cannot be incrementally updated. Lemma 6 A summative aggregate that contains only bound entities is incrementally updatable, if, recursively, all its component summative aggregates, if any, contain only bound entities.
Proof of Lemma 6 For any sumnmative aggregate,
n
.17(X., Yn,..., Zn~) = E f(Xi,... , Z,,i). Proof by induction.
(Below we show only the case of insertion. For the case of deletion, similar induction steps can be easily applied.)
1. When i = 1 (base):
Letting .F(Xo, Yo,.. Zo) = 0, incremental update on the first insertion
= .F( 0Y.._, Z0) + f(X1, ,,,..,,)
Thus, the sumnmative aggregate is incrementally updatable on the first insertion.
2. Induction hypothesis:
Suppose for n > 1 that F1:(Xn,, Y,    Z ',) is incrementally updatable."1
3. Induction:
On nth insertion (Xn, yn,... , Zn), by the induction hypothesis, n1
.FXYn' ,Zn EfXi i Zi)+ (n n=Zn
i=1
 .(Xn1, YK.,,1, Zn1) + f(Xn, Yn,~ ,Z,, n). "I1t should be reminded that n is not directly used in aggregate computations. Its sole purpose is to distinguish the previous (or the next) update from the current update. If n appears in a summation body, it should be interpreted as aggregate count(*)
83
Therefore, if f (x., y~ .... z,,, n) can be incrementally computed, F(X,,, Y", . .. , Zn) should be incrementally updatable. By the postulation of the lemma, all entities in f (x,,, yn, ... , z,, n) are bound, thereby their values, except those of bound component sumnmative aggregates, are immediately known. Thus, if there is no component sumnmative aggregate in fQ(, value of f () can be computed immediately. Otherwise, f () waits until values of all component aggregates, say, 1F'()'s, become known. F'()'s, in turn, are all bound by the postulation.
Therefore, by applying induction steps similar to those thus far, T()'s can be immediately computed by adding its previous value to the value of its summation body, say, f'(xn,, yn, ... , Zn in), if f'(Xn, Y.,  '. , z,, n) can be computed immediately. Otherwise, Jr() waits again. Applying these steps repeatedly, eventually computation will reach into the deepest nest where no sumnmative aggregates are used. Then, summation body of that deepest aggregate can be computed by values in (Xn, Yn,~ , Zn, n),I and the recursion is unwound up into
f () and eventually f (Xn, Yn, ... I Zn, n) is computed.
03
Theorem j Every normalized summative aggregate is incrementally updatable.
Proof of Theorem 4 Straightforward from Lemma 6 and the definition of normalization. 0
CorollarU 1 Every normalized extended summative aggregate is incrementally updatable.
Proof of CorollarU 1 Straightforward from Lemma 5 and the definition of normalization of extended sumnmative aggregates. 01
84
It should be noted that Corollary 1 does not imply that the class of normalized (rather, normalizable) extended summative aggregates is the largest class of summative aggregates that can be incrementally updated. There are a few cases in which unnormalizable (by our normalization method) sumnmative aggregates can be incrementally updated by "manually" normalizing them as shown in the following example.
Example 4.3.15 Given a sumnmative aggregate, Z~log (~Lif the user manually expands equation 4.21 into equation 4.22 (equivalently, if the user writes the summative aggregate in the form of equation 4.22), the given sumnmative aggregate can be normalized to equation 4.24, thereby being incrementally updatable.
n n
EZlog E I log Xi (4.21)
i=1 =
n (lon1) (4.22
=~ E (o i) log xi (4.23)
lo= o i (4.24)
4.4 LookingUp Cached Aggregates
In the preceding sections we have described how aggregates, especially summative aggregates, cached in the aggregate cache can be updated in an incremental way. In this section we investigate an efficient way of looking up a cached aggregate that is requested by a query.
Unlike in an ordinary cache mechanism, cache lookup has a great importance in the aggregate cache. First of all, cache lookup is no longer simple a task. For system builtin aggregates, finding them in the cache still remains simple since they must
85
have fixed names and fixed argument formats. However, for userdefined aggregates, their names cannot be good keys for locating them. Moreover, it is a doubtful approach to allow the users to name their aggregates. The aggregate cache needs to be as transparent to the users as possible. Explicitly naming an aggregate implies that if the name is forgotten or not known, the cached aggregate cannot be used or shared. Related and more important is the fact that even though an aggregate is first requested, thus not cached yet, it is possible that a similar aggregate is already cached and the requested aggregate can be derived from the cached one. If cache lookup is relied entirely on explicit naming, it is not possible to find out such a derivability.
Example 4.4.1 Suppose that two users are using the same database. Suppose further that average is not a builtin aggregate. One user wants to compute average over an aggregate input set X, and names it avg(X). Not long after the first user has completed the computation, the second user wants the same aggregate but does not know that the same aggregate has been computed with a name avg. So, he/she requests the aggregate by a name average(X). And, the same average is recomputed even though a copy is in the cache. Now, the third user comes in and wants sum, which is again supposed not to be builtin, over X. If average(X) and count(X) are in the cache, it should be possible to derive sum(X) from them. But if any of the three is userdefined and its semantics is not correctly understood by the system, such a derivation would not be possible. <
In our approach, we provide a small number of builtin (predefined) aggregates that have fixed names and semantics. For userdefined sumnmative aggregates, they too can have names but those names are only for definitional convenience. That is, a user can define named sumnmative aggregates and use the names in his/her queries to refer to the defined aggregates. However, when a sumnmative aggregate
86
is cached, whether builtin or userdefined, it is decomposed and normalized into several smaller sumnmative aggregates. Then, each of the decomposed sumnmative aggregates is cached if that aggregate is not cached already. When a query requests a sumnmative aggregate, the aggregate is looked up in the cache by following similar steps. The queried aggregate is parsed, decomposed and normalized, and each decomposed aggregate is searched in the cache. If all the decomposed aggregates are found in the cache, value of the queried aggregate is readily computed by using the information obtained when the aggregate is parsed and the values retrieved from the cache. Thus, for cached sumnmative aggregates, their names are immaterial as long as cache lookup is concerned.
Example 44.2 This is a comprehensive example. When SxjX, Y"), which was shown in Example 4.3.1, is cached or looked up in the cache, the following equations
87
show how the original aggregate is decomposed and normalized.
S.I(X., Y") (X  Y)(i (4.25)
n (X  E~l X )(Y En~ Y,(4.26)
1k= k=
n= E7=1 1j En~ XiE~ , n
+Z(Z31x3i I~ Yi a + (4.28)
En1 Z 1 1 En 1 1 E~
= k=1 k=:= 1= k
n Z(xiyi) (X Z1 Y, n  Y Z3=1 1 X Z31X j ~n (.0
n 1 Ei Fn
2= =1 i=1 k
E2 x, j_ (4.31)
= >(Xiyi)   E Xi  E _i=1 k=k
First, ~ andn (aeagso Xj , n ~ neqain42isrwtennosumiv
aggegaes reutn in eqa 4.6(Euton42iten2xane 9o)qa
thy r al ond i n th cah, tEn 1 Y(X Y)cnb bane m eitl.I n
Firte is noaond, herson ad aggrneuateo as5 tosb remputenrmve int ase databases, (reut in teatiaeous if26 Eantppoprat copy is, heing xaindtained).
Veal ofaabe SThe~,Y~ soted sn ahdrcmue summative aggregatesaefrtsacedi h ah.I
88
and all the recomputed sumnmative aggregates are cached for later use. Later on, if other query requests any of the cached sumnmative aggregates, say, summation of X,, such an aggregate will be found in the cache until replaced out by other aggregate. On the other hand, equations 4.30 and 4.31 show a process of algebraic simplification of equation 4.29. However, we do not carry out any such simplification. In most cases, the number of unique sumnmative aggregates does not decrease even after such a simplification. Therefore, penalty for not simplifying should be minimal.
4.5 Conclusions
In this work we have proposed the aggregate cache to improve performance of complex aggregate queries in the context of data warehouses. Considering aggregate computation is a frequent operation in data warehouse applications and such a computation is expensive to perform, reuse of oncecomputed results is a natural choice. On top of that, the temporal locality of aggregate accesses observed in decision making processes makes such a cache approach more attractive.
The Aggregate cache has a close bearing on the view materialization since a cached aggregate can be deemed as a materialized view of underlying tables in base databases. As the incremental view update is an important issue in the view materialization, incremental update of a cached aggregate as relevant underlying tables change is a crucial problem in the aggregate cache. If an underlying table is updated and the update affects a cached aggregate, the update should be propagated to the cache so that the cached aggregate too can be updated appropriately.
Based on currency requirement for cached aggregates, there are several schemes of when to update cached aggregates as base databases change. Rematerialization, periodic update, eager update, and ondemand update have been discussed, and the ondemand update has been chosen for the aggregate cache since it is not only perfectly geared with the cache philosophy, but also outperforms other approaches.
89
We have investigated a way of incrementally updating sumnmative aggregates, which cover a vast variety of aggregates performing some types of cumulative operations. Importantly, we have identified a class of sumnmative aggregates that is incrementally updatable and inclusive enough to cover many aggregates used in data warehouse applications, and proposed an efficient cache lookup method.
CHAPTER 5
CONCLUSIONS
In this work we have addressed two related issues within the framework of active databases. First, we have proposed a practical approach to static analysis of active rules and their confluent execution based on different user requirements. When the user wants the full confluent rule execution, that requirement can be easily met by removing conflicts in the rule set or by specifying priorities between the conflicting rules. If confluent rule execution is unnecessary, the system can avoid controlling rule executions. The confluence can also be enforced for only a subset of rules. Using our approach, it is also possible to support multiple, application based confluency controls. In addition, our approach is the best fit for parallel rule execution.
In the second part of our work, we have proposed the aggregate cache that is a cache mechanism for aggregates used in data warehouses. The aggregate cache can improve the performance of aggregate computations significantly by saving previous results. It uses a novel approach to cache lookup, in which a queried aggregate is found in the aggregate cache by lookingup its component aggregates. Our aggregate cache is transparent to the user; no intervention from the user is necessary to run the aggregate cache. Also importantly, we have identified a precise class of aggregates that can be incrementally updated. We expect that the aggregate cache can be implemented by using active rules over active database systems. Change detection and propagation will be able to be implemented using the event detection facility in underlying active base databases. In the data warehouse  it is assumed to be an active database too, incrementally updating cached aggregates as changes are propagated to the aggregate cache can also be implemented using active rules. However, 90
91
the query processor of data warehouse should be modified appropriately to make use of cached aggregates.
REFERENCES
[1] B. Adelberg, B. Kao, and HI. GarciaMolina. Database support for efficiently maintaining derived data. Technical report, Department of Computer Science,
Stanford University, Stanford, CA, 1995.
[2] R. Agrawal, R. Cochrane, and B. Lindsay. On maintaining priorities in a production rule system. In Proceedings International Conference on Very Large
Data Bases, pages 479487, Barcelona, Spain, 1991.
[3] A. Aiken, J. Hellerstein, and J. Widom. Static analysis techniques for predicting the behavior of active database rules. ACM Transactions on Database Systems,
20(l):341, Mar. 1995.
[4] A. Aiken, J. Widom, and J. Hellerstein. Behavior of database production rules: Termination, confluence, and observable determinism. In Proceedings International Conference on Management of Data, pages 5968, San Diego, CA, 1992.
[5] E. Anwar, L. Maugis, and S. Chakravarthy. A new perspective on rule support for objectoriented databases. In Proceedings International Conference on
Management of Data, pages 99108, Washington, DC, May 1993.
[6] R. Badani. Nested transactions for concurrent execution of rules: Design and implementation. Master's thesis, CIS Department, University of Florida,
Gainesville, FL, October 1993.
[7] E. Baralis and J. Widom. An algebraic approach to rule analysis in expert database systems. In Proceedings International Conference on Very Large Data
Bases, pages 475486, Santiago, Chile, 1994.
[8] J. Blakely, P. Larson, and F. Tompa. Efficiently updating materialized views. In Proceedings ACM SIGMOD Conference on Management of Data, pages 6171,
Los Angeles, May 1986.
[9] L. Brownston, R. Farrell, E. Kant, and N. Martin. Programming Expert Systems in OPS5: An Introduction to RuleBased Programming. AddisonWesley,
Reading, MA, 1985.
[10] S. Ceri and J. Widom. Deriving production rules for constraint maintenance. In
Proceedings International Conference on Very Large Data Bases, pages 566577,
Brisbane, Australia, 1990.
[11] S. Ceri and J. Widom. Deriving production rules for incremental view maintenance. In Proceedings International Conference on Very Large Data Bases,
pages 577589, Barcelona, Spain, 1991.
92

Full Text 
PAGE 1
STATIC ANALYSIS OF EGA RULES AND USE OF THESE RULES FOR INCREMENTAL COMPUTATION OF GENERAL AGGREGATE EXPRESSIONS By SEUNGKYUM KIM A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 1996
PAGE 2
Dedicated to My Parents
PAGE 3
ACKNOWLEDGEMENTS I would like to express my sincere gratitude to Dr. Sharma Chakravarthy for the continuous guidance and support during the course of this work. I thank Dr. Eric Hanson, Dr. Herman Lam, Dr. Stanley Su, and Dr. Suleyman Tufekci (in alphabetic order) for serving on my supervisory committee and for their perusal of this dissertation. I would like to thank Mrs. Sharon Grant for keeping the warm and comfortable research environment. I also thank many fellow students at the Database Systems R&D Center for their help and friendship. Finally, I am deeply indebted to my parents for their endless sacrifice and love to me. This work was supported in part by the Office of Naval Research and the Navy Command, Control and Ocean Surveillance Center RDT&E Division, and by the Rome Laboratory. 
PAGE 4
TABLE OF CONTENTS ACKNOWLEDGEMENTS iii LIST OF FIGURES vi ABSTRACT vii CHAPTERS 1 1 INTRODUCTION 1 1.1 Active Databases 1 1.2 Data Warehouses 4 1.3 Problem Statement 6 1.3.1 Support for Alternative User Requirements in Active Rules Execution 6 1.3.2 Efficient Support of Aggregates in Data Warehouses 8 1.3.3 Structure of the Dissertation 9 2 STATIC ANALYSIS OF ACTIVE RULES 10 2.1 Introduction 10 2.2 Limitations of the Earlier Rule Execution Models 13 2.3 Assumptions and Definitions 16 2.3.1 Rule Execution Sequence (RES) and Rule Commutativity . . 16 2.3.2 Dependencies and Dependency Graph 20 2.3.3 Trigger Graph 22 2.4 Confluence and Priority Specification 23 3 IMPLEMENTATION OF CONFLUENT RULE SCHEDULER 30 3.1 Strict OrderPreserving Rule Execution Model 30 3.1.1 Extended Execution Graph 30 3.1.2 Strict OrderPreserving Executions 32 3.1.3 Implementation 33 3.1.4 Parallel Rule Executions 36 3.2 Alternative Policies for Handling Overlapping Trigger Paths 40 3.2.1 Serial TriggerPath Executions 40 3.2.2 Serializable TriggerPath Executions 42 3.2.3 Comparisons with Strict OrderPreserving Execution 43 3.3 Discussion and Conclusions 45 iv
PAGE 5
4 AGGREGATE CACHE 48 4.1 Motivation 48 4.2 Updating Cached Aggregates 51 4.3 Incremental Update of Aggregates 56 4.3.1 Syntactic Conventions 56 4.3.2 Incrementally Updatable Aggregates 57 4.3.3 Algebraic Aggregates and NonAlgebraic Aggregates 58 4.3.4 Summative Aggregates 61 4.3.5 Binding of Variables 66 4.3.6 Decomposition of Summative Aggregates 68 4.3.7 Normalization of Summative Aggregates 73 4.3.8 IncrementalUpdatability of Nested Summative Aggregates . . 79 4.4 LookingUp Cached Aggregates 84 4.5 Conclusions 88 5 CONCLUSIONS 90 REFERENCES 92 BIOGRAPHICAL SKETCH 95 V
PAGE 6
LIST OF FIGURES 2.1 Rule execution graphs 14 2.2 Overlapped trigger paths 15 2.3 A conflicting rule set 24 2.4 Priority graphs for Figure 1.4 25 2.5 A pair of trigger graph and dependency graph 27 2.6 A priority graph 27 2.7 An execution graph 28 3.1 Overlapping trigger paths and extended execution graph 31 3.2 Three different orderings of dependency edges in Figure 3.1(b) .... 32 3.3 Extended execution graph in strict orderpreserving executions .... 32 3.4 Extended execution graph with rule.counts 34 3.5 Algorithm Build_EG() 35 3.6 Algorithm Schedule() 37 3.7 A priority graph with absolute priorities 41 3.8 Translation to serializable triggerpath execution Step 1 43 3.9 Translation to serializable triggerpath execution Step 2 44 4.1 A Data Warehouse and Aggregate Cache 50 vi
PAGE 7
Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy STATIC ANALYSIS OF EGA RULES AND USE OF THESE RULES FOR INGREMENTAL COMPUTATION OF GENERAL AGGREGATE EXPRESSIONS By SeungKyum Kim May, 1996 Chairman: Dr. Sharma Chakravarthy Major Department: Computer and Information Science and Engineering In this work we address two major issues that are related within the framework of active databases. Firstly, we propose a practical approach to rule analysis. We show how alternative rule designer choices can be supported using our approach to achieve confluent rule execution in active databases. Our model employs priority information to resolve conflicts between rules, and uses a rule scheduler based on the topological sort to achieve correct confluent rule executions. Given a rule set, a trigger graph and a dependency graph are built from the information obtained by analyzing the rule set at compile time. The two graphs are combined to form a priority graph, on which the user is requested to specify priorities (or resolve conflicts) only if there exist dependencies in the dependency graph. The user can have multiple priority graphs by specifying different priorities depending on application semantics. From a priority graph, an execution graph is derived for every user transaction that triggers one or more rules. The rule scheduler uses the execution graph. Our model also correctly
PAGE 8
handles the situation where trigger paths of rules triggered by a user transaction are overlapping, which are not handled by existing models. We prove that our model achieves maximum parallelism in rule executions. Next, we propose a cache mechanism, called aggregate cache for efficiently supporting complex aggregate computations in data warehouses. We discuss several cache update strategies in the context of maintaining consistency between base databases and aggregates cached in the data warehouse. We formally define the incremental update of aggregates, which is a prime issue for the aggregate cache. Further we classify algebraic aggregates into summative aggregates that include a vast variety of aggregates applicable in data warehouses to support decision making and statistical data analysis. We prove that there is a precise subclass of summative aggregates that can be incrementally updated. For the incrementally updatable class of summative aggregates, we propose an efficient cache mechanism that allows many userqueries to share accesses to the cached aggregates in a transparent way.
PAGE 9
CHAPTER 1 INTRODUCTION 1.1 Active Databases For the past decade, making the traditional passive databases active by incorporating a set of rules has drawn a lot of attention from the database research and development community. Originated from the concept of triggers proposed for the System R [16] and largely developed from the production rule languages for expert systems such as 0PS5 [9], the research on active databases is now getting matured and several active database systems are being (or were) implemented including HiPAC [12], Ode [19], Postgres [35], Starburst [39], Samos [17], Ariel [29], and Sentinel [13]. While there are variations, a representative active database paradigm is to use the ECA rules (EventConditionAction) [12] with either the setoriented semantics or the tupleoriented (instanceoriented) semantics over the relational or ObjectOriented database. An ECA rule consists of three parts, event, condition, and action parts. Events usually correspond to database operations, especially data manipulation operations such as insert, delete, and update. For some systems as HiPAC and Sentinel, temporal events and external events (e.g., user signals) are included too. All of these events are called primitive events. For the ObjectOriented model, a method call is regarded as an event as well. Conditions are generally assumed to be predicates over parameters and database queries without side effects. An action consists of a set of data manipulation operations or a function call. When an event occurs in the system, rules whose event part corresponds to the event occurred are triggered. Of the triggered rules, one rule is picked based on some 1
PAGE 10
2 criteria (or by a process known as conflict resolution). Then, the condition part of the selected rule is tested. If the condition is satisfied, the action part of the rule is executed. The process of picking one triggered rule, testing condition, and executing action is repeated until no more triggered rules remain. For event specification and detection in active databases (adopting the EGA rule paradigm), three major approaches have been taken. The common goal in these approaches is to support composite events. A composite event is a composition of primitive events and/or other composite events. For instance, in Sentinel [13], given two events Ei and E2, a disjunction of them, E1S/E2 is defined to occur when either Ei or E2 occurs. While similar sets of composite events are defined in all the approaches, distinctions are found in the ways of detecting such composite events. In Ode, finite automata are used to detect composite events expressed by a variation of regular expression [21, 20], while in Samos, a labeled Petri Net is adopted [18]. In Sentinel, on the other hand, we use an event graph where each leaf node represents a primitive event and an intermediate node represents a composite event consisting of other events represented by its child nodes. When a primitive event occurs, the occurrence is propagated to its parent node. A parent node, with an appropriate restriction for each type of composite event, collects occurrences of its child events, notifies an occurrence of the composite event if a certain condition is met, and propagates the occurrence of the composite event to its own parent node [13]. For condition specification and testing, there are few systems that take a sophisticated approach, except Ariel. In case a condition is represented by a database query, that query should not update database contents (i.e., side effect free), and it is interpreted as satisfied if the query returns a nonempty result. In Ariel, condition testing is done by an algorithm, called ATREAT [28]. It uses the discrimination network to efficiently compare a large number of patterns to a large number of data
PAGE 11
3 without repetitive scanning. ATREAT can speed up rule processing in a database environment and reduce the storage requirement of TREAT algorithm. On the other hand, there are two representative rule execution semantics, tupleoriented (or instanceoriented) semantics and setoriented semantics [30]. When an event occurs, it can triggers one or more rules. These triggered rules are called instances of their respective rules. Even for one rule, there may exist multiple instances of one rule for any period of time. It should be noted that as events, granularities of data manipulation operations are quite coarse. For instance, an event of insert to a relational table is usually defined to occur when any tuple is inserted into the table. In other words, the granularity of the insert event is the whole relation, not a tuple. Also, in SQL, an insert, delete, or update statement can modify multiple tuples in one execution. Therefore if a rule is triggered by any of the data manipulation operations, it is likely that such an event will give rise to multiple instances of the same rule. The two rule execution semantics make a difference in such situations. Suppose one execution of insert statement inserts three tuples, ^i, t2, and ^3, into a table. And a rule ri is triggered by that insert event. In the tupleoriented semantics, ^i, and <3 trigger their own instance of rj , and for each instance of rj , its condition is tested, and if satisfied, its action is executed. However, in the setoriented semantics, only one instance of ri is executed for ^i, ^2, and ^3; that is, ri's condition is tested just once and if the condition is satisfied, the action part of ri is carried out on each of ti, t2, and ^3. Postgres has the tupleoriented rule execution semantics, while Starburst and Ariel have the setoriented semantics. HiPAC and Sentinel, on the other hand, have rather different a rule execution semantics. In these systems, conceptually all triggered rules are executed concurrently using the nested transaction model [12, 6].
PAGE 12
4 1.2 Data Warehouses In order to support efficient processing of queries which are often of complex forms and span over vast amount of data which in turn could be distributed and even heterogeneous, one viable approach is to maintain a separate dedicated repository that holds the gist of base data and to process the queries over the repository. By doing so, it is expected that decision support applications that usually involve lots of such otherwise expensive queries can also be detached from ordinary online transactions being performed over the underlying base databases, thereby significantly enhancing performance of both categories of applications. This approach, known data warehousing [31], is rapidly becoming popular, and a lot of research effort is recently being put to solve various related issues [33]. A data warehouse system can be viewed as a multilayered database system. At the bottom level, there are several databases, called base databases, each of which is aji operational, independent database system. At the top level, there is one specialized, (almost) readonly database system that contains the most abstracted or summarized data derived from the databases in one level below. As a result, this system constitutes a pyramidstructured abstraction of information (or data). But usually only a twolayer structure is assumed; that is, base databases and a summarized database which we call data warehouse. Between the two layers, a physical separation is generally assumed. The base databases can also be distributed and even heterogeneous. All these separated databases are connected through a network. Since the data warehouse and base databases are separated and the data warehouse contains information derived from the base databases, it is a natural requirement that as base databases change, the data warehouse should be updated accordingly, if the changes are relevant to the information in the data warehouse. Therefore,
PAGE 13
5 an intermediate layer between the two layer would be necessary to mediate communications between the two layers. This intermediate layer is responsible for monitoring changes made to the base databases and propagating them to the data warehouse to update its contents. In practice, however, the intermediate layer can be two interface layers, each of which resides in each base database and in the data warehouse. The interface layer in base databases is responsible for detecting changes in its respective base database and it should communicate with the interface layer in the data warehouse to propagate the changes. One issue arising here is when to or how often to propagate the changes. Proposed approaches in the literature include eager propagation for a currency critical data set, polling for a data set that changes slowly or whose currency requirement is not critical, and lazy (or ondemand) propagation for a data set that changes slowly and is not used frequently. A closely related issue to the change propagation is the incremental update of the data warehouse. It is unthinkable to repopulate (or rederive) the data warehouse whenever a change is made in a base database. Therefore, there should be a way to incrementally update the data warehouse with the propagated changes. In fact, this issue is not a new one. View materialization [8, 27] in the relational database deals with a similar problem, but not identical. Although the data warehouse can be viewed as a set of materialized views (derived from base databases), unlike view materialization, the data warehouse has a lot of problems that need special attention. The views defined in the data warehouse can be very complex for which the conventional view materialization techniques are inadequate. They usually contain historical data and highly aggregated and summarized data. This is another reason why the view materialization techniques cannot be directly applied [37]. Recently, the issue of view materialization has been revisited by several researchers including [23, 24].
PAGE 14
6 1.3 Problem Statement 1.3.1 Support for Alternative User Requirements in Active Rules Execution Using EGA rules in active database systems for reallife applications involves implementing, debugging, and maintaining a large number of rules. Experience in developing large production rule systems has amply demonstrated the need for understanding the behavior of rules especially when their execution is nondeterministic. Availability of rules in active database systems and their semantics create additional complexity for both modeling and verifying the correctness of such systems. For the effective deployment of active database systems, there is a clear need for providing an analysis (both static and dynamic) and explanation facility to understand the interactionamong rules, between rules and events, and between rules and database objects. Due to the eventbased nature of active database systems, special attention has to be paid for making the context of rule execution explicit [15]. Unlike an imperative programming environment, rules are triggered in the context of the transaction execution and hence both the order and the rules triggered vary from one transaction/application to another.^ Ideally, the support environment needs to be able to accept expected behavior and compare it with the actual behavior to provide a useful feedback on the differences between the two. This has to be done in the context of database objects, rules, events, and transaction sets that are executed concurrently. Short of this, it is useful to provide alternatives that allow the designer to choose among various options that are meaningful to his/her application. For example, user ^Use of the nested transaction model for rule execution in Sentinel [6, 36] provides such a context. Our graphics visualization tool [14] using Motif displays transaction and rule execution by using directed graphs to indicate both the context (i.e., transactions/composite events) and the execution of (cascading) rules. We plan to augment the existing visualization capability with the static analysis proposed in this paper.
PAGE 15
7 requirements may come from the following set of answers as far as rule execution is concerned: 1. No preference; any arbitrary execution order of rules and their final result is acceptable. 2. Arbitrary final state is NOT acceptable. The designer wants to see a unique final result whenever the same set of rules executes from the same initial database state (i.e., Confluent Rule Execution is desirable). 3. This particular group of rules must give a unique result when any subset of these rules executes. No preference for the rest of the rules. 4. This application must have this order and that application must have that order, if a different execution order gives a different result. It is evident that the ability to interact with the designer and to support alternative user requirements is quite important. As a result, support environments for the design of EGA rules in the context of active databases need to support: Â• EGA rule analysis to provide feedback on the interaction of rules either globally or with respect to transactions, Â• Gonfluence analysis at rule scheduling time (in addition to static analysis), Â• Parallel rule execution where possible for efficiency reasons, and Â• Visualization of actual run time rule execution and related contextual information. In the first part of this thesis (Chapters 2 and 3), we propose an approach that addresses the above. We address confluent rule executions (which deal with obtaining a unique final database state for any initial set of rules) for a user transaction. We
PAGE 16
8 show that previous rule execution models are inadequate for confluent rule executions in some cases, and propose extensions that can readily meet the alternative user requirements with respect to confluent executions. We also show that our model naturally supports parallel rule executions. 1.3.2 Efficient Support of Aggregates in Data Warehouses Although for years there has been great interest in the data warehouse within the database industry and within academia as well, it does not appear that an efficient, fully functional, and flexible data warehouse system has emerged yet. There are many reasons for that. Part of them are purely engineering problems. There are a lot of legacy database systems that are still running and lack interface capability with any other new systems. Extracting data from such a system and monitoring data changes may not be very theoretically challenging, but from the engineering point of view, they are not trivial tasks at all. On the other hand, there are still issues to be answered in more systematic and fundamental ways. One such issue that we find very important is the efficient support of aggregates in data waxehouses. As data warehouses contain a lot of highly summarized data and many applications in data warehouses perform from simple to very complex analyses over the data, it is crucial to improve performance of such aggregate or summary operations. In a naive implementation, it could take more than several hours to perform a simple summation over millions of records dispersed in base databases. With such a response time, no one would expect an interactive data analysis, which is an important requirement for the data warehouse to be indeed useful as a cooperative tool. With this consideration in mind, in Chapter 4, we propose a practical means to boost the performance of aggregate processes in data warehouses. The aggregate cache that we propose saves in the system (i.e., the data warehouse) previous results of aggregate computations. As base databases change, the stored results are updated
PAGE 17
9 appropriately. Along with such an engineering aspect, in our work we identify and prove that there is an exact class of aggregates that are guaranteed to be incrementally updatable, a result that we believe is a theoretical contribution to the research on data warehouse. 1.3.3 Structure of the Dissertation The rest of this dissertation is structured as follows. In Chapter 2, we introduce formal definitions of confluent rule execution and show conditions in which confluent rule execution can be achieved. In Chapter 3, we generalize the notions developed in Chapter 2 to obtain confluent rule execution in general situations and rule scheduling algorithms along with a proof of maximal parallelism that our approach attains. In Chapter 4, we propose the aggregates cache and identify and prove a class of aggregates that can be incrementally updatable. Chapter 5 contains general conclusions for this dissertation.
PAGE 18
CHAPTER 2 STATIC ANALYSIS OF ACTIVE RULES 2.1 Introduction Incorporating ECA rules (EventConditionAction rules) into the traditional database systems, thereby making them active databases, can broaden database applications significantly [5, 12, 19, 30]. Also, ECA rules provide more flexible and general alternatives for implementing many database features, such as integrity constraint enforcement, that are traditionally hardwired into a DBMS [10, 11, 35, 38, 39]. An ECA rule consists of three parts: event, condition, and action parts. Execution of ECA rules goes through three phases: event detection, condition test, and execution of action. An event can be a data manipulation or retrieval operation, a method invocation in ObjectOriented databases, a signal from timer or the users, or a combination thereof. An active database system monitors occurrences of events prespecified by ECA rules. Once specified events have occurred, the condition part of the relevant rule is tested. If the test is satisfied, the rule's action part can be executed. In Sentinel [13], a rule is said to be triggered when the rule has passed the event detection phase; that is, when one or more events which the rule is waiting for have occurred. When an ECA rule has passed the condition test phase, it is said to be eligible for execution.^ In this work, we use "trigger" to describe the eligible rules assuming that the condition part has been satisfied or it is nil. 'The definition of trigger is blurred as conditionaction rules such as the production rule [9] have evolved to ECA rules. A conditionaction rule is triggered and eligible for execution when the current database state satisfies the specified condition, whereas for an ECA rule to be ready to execute, it has to pass separate event detection and condition test phases. 10
PAGE 19
11 The ECA rule execution model (rule execution models in general) has to address several issues. First, for various reasons multiple rules can be triggered and eligible for execution at the same time. For example, suppose that two rules r, and rj are defined, respectively, on two events Ei and (E1VE2) and there are no conditions specified for the rules, where {Ei'VE2) is a disjunction of two component events Ei and E2. A disjunction occurs when either component event occurs [13]. Now, if event Ei occurs, it will trigger both r, and rj. As addressed by Aiken et al. [3, 4], multiple triggered rules pose problems when different execution orders can produce different final database states. If an active database system randomly chooses a rule to execute (out of several triggered rules), as many extant systems do as the last resort, that will make the final database state nondeterministic. This adds to the problem of understanding applications that trigger rules. To deal with multiple triggered rules, a generally taken approach is to define priorities among conflicting rules [2, 12, 28, 34]. When multiple conflicting rules are triggered at the same time, a rule with the highest priority is selected for execution. While we believe the prioritization is a sound approach, we notice that the previous priority schemes are incomplete and inadequate to handle the complexity caused by trigger relationships between rules. On the other hand, Aiken et al. [3] focus on testing whether a given rule set has the confluence property. A rule set is said to be confluent if any permutation of the rules yields the same final database state when the rules are executed. If a rule set turns out to be not confluent, either the rule set is rewritten to remove the conflicts or priorities are explicitly defined over the conflicting rules. Then, the new rule set is retested to see if it has the confluence property. A problem with this approach is that it tends to repeat the timeconsuming test process until the rule set eventually becomes confluent. Also, it has not shown by which mechanism confluence can be
PAGE 20
12 guaranteed as priorities between conflicting rules are added to the system as a means of conflict resolution. There are other subtle problems with the EGA rule execution model. Suppose that r, and rj mentioned previously have condition parts. As event E\ occurs, the two rules pass the event detection phase. Assume that both rules have passed the condition test and are ready to execute their action part. It is possible that the execution of one rule's action, say r,'s, can invalidate rj's condition that was already tested to be true; that is, r, can untrigger rj. Apart from the issue of whether or not the condition test should be delayed up to the point (or retested at the point) just before execution of the action part, if one rule untriggers other rules, it is very likely that the rule set is not confluent. The opposite situation can also happen. Suppose the condition of r,was not met. So its action part would not be executed. But execution of r^'s action could change database state so that r,'s condition could be satisfied this time. Therefore, if rj executes first and r,'s condition is tested after that, r, will be able to execute too. Again, execution order of the two rules makes a difference. Instead of proposing a more rigorous rule execution model to deal with the anomalies, we consider such rules as conflicting with one another so that the rule designer can be informed of these rules. This view will allow us to cover the problem within the framework of confluent rule executions. In this work we explore problems of confluent rule executions, which deal with obtaining a unique final database state for any initial set of rules that is triggered by a user transaction. We show that previous rule execution models are inadequate for confluent rule executions and propose a new rule execution model that guarantees confluent executions. It is also showed that our model is a perfect fit for parallel rule execution.
PAGE 21
13 2.2 Limitations of the Earlier Rule Execution Models Early rule execution models such as one used in 0PS5 [9] deal with problems of confluent rule executions only in terms of conflict resolution. When multiple rules are triggered (and eligible for execution), the rule scheduler selects a rule to execute according to a certain set of criteria such as recency of trigger time, complexity of conditions. Although this scheme has been used in the AI domain, users in the database domain prefer to deterministic states from the executions of transactions. Furthermore, the conflict resolution approach is not a complete answer to the confluence problem since it is based on dynamic components such as recency of trigger time. For the above reasons, we do not consider this approach in our work. A somewhat different approach taken in active database systems such as Starburst [2, 4] and Postgres [34, 35] is to statically assign execution priorities over rules. In these systems if multiple rules are triggered, a rule with the highest priority among them is executed first. However, rule execution models in these systems cannot guarantee confluent rule executions unless all the rules (not only conflicting ones) are totally ordered. This problem is illustrated in the following examples. Example 2.2.1 Figure 2.1(a) shows a nondeterministic behavior of rule execution even when all conflicting rules are ordered. In the figure solid arrows represent trigger relationships. Dashed lines represent conflicts and an arrow on a dashed line indicates priority between two conflicting rules. As shown, two pairs of rules are conflicting: (r2,r5) and (r3,r5). The conflicting rules are ordered in such a way that rj precedes rs and rg precedes in execution when the pairs of conflicting rules are triggered at the same time. Now suppose r\ and r\ are triggered by the user transaction at the same time. (Note that these rules are denoted by solid circles in the graph.) In a rule execution model such as Starburst, one of ri and will be randomly picked for execution since there is no conflict between them, thus no order specified. Suppose
PAGE 22
14 'Â•29''s jcr 6 6 r. (a) (b) Figure 2.1. Rule execution graphs is executed first; then it will trigger r^. Yet there is no order between r\ and which are ready to execute. So rs may go first, and its execution will trigger r^. Then, re, Ti, r2, and ra may follow. Including this execution sequence, two of all legitimate execution sequences for the rule set are as follows: (1) {v\ Â• Â• Â• r\ Â• ti Â• ra) and (2) (ri Â• r2 Â• Â• r4 Â• Ts Â• re). Note that relative orders of two conflicting rules, Vi and Ts (as well as tz and rs) in the two rule execution sequences are different, thereby unable to guarantee confluent execution. < Example 2.2.2 Figure 2.2 illustrates another situation where the previous rule execution models fail to achieve confluent rule executions. There is a dependency between Tk and r/, and rjt has priority over r/. In this example, ri and rj are triggered by the user transaction. Note also that r, is an ancestor of rj in the trigger relationship and thereby trigger paths originated from both rules overlap one another. Given that priority, the following two (and more) sequences of rule executions are possible: (r,Â• rj rkrirj Â• rk Â• n) and (r,Â• rj Â• ru Â• rj Â• rk Â• ri Â• ri). Now relative orders of two conflicting rules rjt and r/ in the two execution sequences are different. Therefore confluent rule executions cannot be guaranteed in the given situation. < As the previous examples suggest, a problem that the extant active rule execution models fail to address properly is that even though two rules are not directly conflicting each other, they may trigger other rules that are directly conflicting. Depending
PAGE 23
15 "I r,0' r, O' (a) Figure 2.2. Overlapped trigger paths on execution order of triggering rules, the directly conflicting rules may be executed in a different order from what the user specified, likely resulting in nonconfluent rule executions. Unless all the direct conflicts are removed by rewriting the rules, one possible remedy for this problem, implied in Starburst [3], would be to regard the indirectly conflicting rules as conflicting ones. Figure 2.1(b) illustrates how conflicts of Figure 2.1(a) are propagated toward ancestor rules in the trigger relationship as this approach is taken. An undesirable consequence of propagating conflicts is that it severely limits parallel rule execution. In addition, it is not always clear how to propagate conflicts in some cases as Figure 2.2(a) shows. Another problem that the previous rule execution models do not handle was shown in Example 2.2.2 where trigger paths of rules triggered by the user transaction overlap. In fact, this new situation poses additional problems for priority specification. That is, any static priority schemes specified before rules' execution cannot range over all possible permutations of conflicting rules execution, since one cannot anticipate which rules will be triggered by the user transaction how many times. For instance, given the rule set of Figure 2.2(a), there can be two distinct flnal database states which result from rule execution sequences, {rirjrkrirjrkri) and {rirjrkrjrkriri). All other legitimate rule execution sequences are equivalent to one of the two sequences
PAGE 24
16 in terms of final database states. However, if rj is triggered twice and r,once by the user transaction, the number of distinct final database states increases up to five. As r,and rj are triggered more number of times, the number of different final database states increases exponentially. Therefore, it is not realistic to provide every possible alternatives for these cases. Rather, a less general scheme of priority specification, which provides only some specific alternatives, needs to be considered. Figure 2.2(b) shows one way of specifying priority for the rule set of Figure 2.2(a), which is similar to the priority scheme adopted in Postgres [34]. Numbers in brackets denote absolute priorities associated with rules. A larger number denotes a higher priority. This priority specification guarantees confluent rule executions although nonconflicting rules (r,and rj) too need to be assigned priorities. Note that the given priority specification is (unnecessarily) so strong that it effectively imposes a serial execution order {rj Vk Â• ri Â• ri rj Â• Â• ri), thereby ruling out any parallel rule executions. For instance, one instance of rj could run in parallel with a series of r,and Tj without affecting the final database state. In the subsequent sections, we develop a novel rule execution model and a priority scheme that not only ensures confluent rule executions but also allows greater parallelism. 2.3 Assumptions and Definitions 2.3.1 Rule Execution Sequence (RES) and Rule Commutativitv Informally, a rule execution sequence (RES) is a sequence of rules that the system can execute when a user transaction triggers at least one rule in the sequence. To characterize RESs, we first define partial RESs. Throughout this work, R denotes system rule set, a set of rules defined in the system by the user. D denotes a set of all possible database states determined by the database schema. {dj,Rk), dj G D and Rk C R, denotes a pair of a database state and a triggered rule set. If Rk is a
PAGE 25
17 set of rules directly triggered by a user transaction, it is specially called UTRS that stands for UserTriggeredRulesSet. UTRS is, in fact, a multiset since as we shall see later, multiple instances of a rule can be in it. S denotes a set of all partial RESs (see below) defined over R and D. Partial RES. Given R and Z), for a nonempty set of triggered rules, Rk C R and a database state dj E D, a, partial RES, cr is defined to be a sequence of rules that connects pairs of a database state and a triggered rule set as follows: a = {{dj,Rk) {dj+i,Rk+x) ^ Â•Â•Â• "'^ ' {dj+rn,Rk+m)) where dj^i E D {1 < I < m) is a, new database state obtained by execution of r,+/_i, each rule r,+/ (0 < / < m) is in a triggered rule set Rk+h and eligible for execution in dj+i, i.e., dj+i evaluates the rule's condition test to true. Each triggered rule set Rk+i Q R{1 is a database state produced ^Subscripts, ,i+mI, attached to rules, are intended to mean that they are m rules that need not be distinct (similarly for d's and R's). They do not represent any sequential order of rules with respect to subscript numbers. That is, they should not be interpreted as, for instances, ''10, rn, ri2 Â• Â•, in case where r,is rio. For a precise denotation, we could use jq, ii,.Â• Â• , imi, instead. However, we have opted for the less precise notation in favor of simplicity throughout this work.
PAGE 26
18 by operations in the user transaction, a complete RES (or RES), a is defined to be a partial RES: a = {{dj, Rk) {dj+i,Rk+i) ^ Â•Â•Â• "^^ ^ {dj+m, Rk+m = 0)) where no triggered (and eligible) rules remain after execution of the last rule n+^i (i.e., Rk+m = 0)<1 Note that given Rk and dj, there may be multiple different RESs, even in a case where there is only one rule in Rk, and those RESs do not necessarily have the same set of rules executed, since a rule's triggering/untriggering other rules may be dependent on the current database state. In this work we use rule schedule in informal settings interchangeably with complete RES. Â•= Rule shuffling. Given a partial RES cti , two rules and rj in
PAGE 27
19 {{dj:,Ry) {djn,Rn) (du, Rv)) where dx,dk,dm,du Â€ D need not be distinct and likewise Ry,Ri,Rn,Rv ^ R need not be distinct. < Note that any rule is trivially commutative to itself. Equivalent partial RESs. Two partial RESs
PAGE 28
20 Equivalence class of partial RESs. For a partial RES, cr Â€ Â«S, the equivalence class of (T is the set Sa defined as follows: S, = {^eS\t = a}. Of partial RESs belonging to the same equivalence class, for the discussion in this work we define canonical partial RES, or canonical RES for short, to be a partial RES that comes first when all the partial RESs are sorted by their rules' concatenated subscripts in lexicographical order. For instance, assuming that an equivalence class includes only three partial RESs,
PAGE 29
21 Data dependency. Two distinct rules r, and rj are defined to have data dependency with each other if r, writes in its action part to a data object that Vj reads or writes in its action part, or vice versa. ^l Untrigger dependency. Two distinct rules r,and rj are defined to have untrigger dependency with each other if n writes in its action part to a data object that rj reads in its condition part, or vice versa. <1 If two rules have data dependency with each other, the input to one rule can be altered by the other rule's action. Thus it is very likely that the affected rule would behave differently. The data dependency can also mean that one rule's output can be overridden by the other rule's output. This also has a bearing on the final outcome. If there is no data dependency, two rules act independently. Therefore, there should be no difference in the final outcome due to a different relative execution order of the two rules. On the other hand, if there is untrigger dependency between two rules r,and rj, this implies that one rule's action can change the condition which determines whether the other rule is to execute or not. If the affected rule, say r,, has already executed first, it is unrealistic to revoke the effect of r,. As a result, both r,and rj will execute in this case. However, if the affecting rule rj executes first, it can prevent r, from executing. Since it is assumed that there are no readonly rules, the two different execution sequences can result in different database states even though there is no data dependency.^ From the observation above, it is clear that the absence of data dependency and untrigger dependency between two rules is a sufficient condition for the two rules to be ^It should be noted that whether or not the untrigger dependency can indeed affect confluent execution depends on rule execution model employed by an active database system. If the rule execution model does not recheck the condition part of a rule just before it executes the action part of that rule, then no rule is untriggered. In such a case, it can appear that the untrigger dependency is no longer a problem and only data dependency matters.
PAGE 30
22 commutative. (The reverse is not necessarily true.) If there exists either dependency between two rules, the rules are said to conflict with each other. Obviously, conflicting rules are noncommutative. Lemma 1 Given a partial RES a, a new partial RES c' obtained by freely shuffling rules in a is equivalent to cr, as long as relative orders of conflicting rules in both RESs are equal if there are any conflicting rules. Proof of Lemma 1 Suppose
PAGE 31
23 has a node set R in which a node has onetoone mapping to a rule and a trigger edge set EjFor any two nodes (i.e., rules) r,and rj in R, trigger edge set Et contains a directed edge, called trigger edge, (r,Â— Â» rj), if and only if r, can trigger rj. It is defined that r,can trigger rj if execution of r,'s action can give rise to an event that is referenced in the event specification of rj^ A trigger path in a trigger graph is a (linear) path starting from any node and ending at a reachable leaf node. Note that for rules r,and rj above, it is possible that rj is not triggered by r,at run time if rj's action part contains a conditional statement. Nevertheless we conservatively maintain a trigger edge if there is any possibility of r,'s triggering rj. In addition, we are assuming that a trigger graph is acyclic to guarantee termination of rule executions [3]. If a trigger graph contains a cycle, it is possible that once a rule in the cycle is triggered all the rules in the cycle keep triggering the next rule indefinitely. We also assume that there exists a separate mechanism for detecting cycles in a trigger graph so that the rule designer can rewrite the rules in such a case. Incidentally, it should be noted that a trigger relationship between two rules does not necessarily imply a dependency between the rules. For instance, given a trigger edge (r,Â— > rj), if r,for sure triggers rj and no other rules are triggered from r, and r^, there are only two possible partial RESs for the two rules, (r, Â• rj rj) and {rj Â• r,Â• rj). If there is no data or untrigger dependency between r, and rj (i.e, the two rules are commutative), the two RESs are equivalent despite the trigger edge. 2.4 Confluence and Priority Specification In this section we present basic ideas that give us a handle for dealing with conflicting rules in order to obtain confluent rules executions. We consider simple ^We admit that this definition of "can trigger" is rather crude. In Sentinel, for example, if a rule is waiting for an occurrence of (Â£1; Â£2), which is a composite event sequence and occurs when E2 occurs provided Ei occurred before, the occurrence of Ei alone never triggers that rule. In our current work, however, we do not pursue this issue any further. (For event specifications in Sentinel, see Chakravarthy et al. [13].)
PAGE 32
24 r^O i) rs Figure 2.3. A conflicting rule set cases first. When there are n distinct rules to execute and m pairs of conflicting rules among them, intuitively, the maximum number of different final database states that can result from all differing RESs is conservatively bounded by 2"* , since each pair of conflicting rules can possibly produce two different final database states by changing their relative order. Example 2.41 Figure 2.3 is a redrawing of Figure 2.1(a) with removal of directions on dependency edges (r2, rs) and {rs, rs). Note that ri and r4, denoted by solid nodes in the graph, are in UTRS, a set of rules initially triggered by a user transaction. Assuming that all the six rules are executed, all complete RESs that can be generated should be equal to a set of possible merged sequences of two partial RESs (ri Â• r2 Â• rs) and (r4rsr6). Then, all the possible merged (now complete) RESs can be partitioned into up to four groups by relative orders between r^ and rs and between rs and rs as follows: (1) (rj rs) (rg ^ rs), (2) (rj rg) (rg rg), (3) (ra ^ rs) (ra ^ rj), and (4) (r2 <Â— rs) (rs rs). However, since there exists an inherent order between r2 and rs, i.e., (r2 ra), dictated by a trigger relationship, no merged RESs can contain combination (4) due to a cycle being formed. Combination (4) is dropped from consideration. Since cumulative effect of all the other rules are the same regardless of their execution order, the three combinations are the only factor that can make a difference in the final database state. Therefore, in this example, up to three distinct final database states can be produced by all possible complete RESs.
PAGE 33
25 r,Q >p r O r, r,6 '5 O 6 r, (a) (b) Figure 2.4. Priority graphs for Figure 1.4 (c) Using the three possible orderings of conflicting rules in Example 2.4.1, we can assign directions to dependency edges in the graph of Figure 2.3. Resulting graphs, which we call Â•priority graphs, are shown in Figure 2.4. These priority graphs present how priorities can be specified over conflicting rules in order to make rule executions confluent. Also, importantly, they represent partial orders that the rule scheduler needs to follow as it schedules rule executions. As we shall see in the following section, the rule scheduler basically uses a topological sort algorithm working on a subgraph of priority graph, and this demands the priority graph to be acyclic. Example 242 All possible topological sorts on priority graph of Figure 2.4(a) correspond to an equivalence class represented by a canonical RES, a\ = (ri Â•r2''4'^5Â»^3^6) for clarity we use " hereafter to denote conflicting rules as r2Note that a RES, (ri Â• r4 Â• r2 Â• rs Â• Â• r^) is equivalently converted to
PAGE 34
26 by merging TG and DG, where i? is a node set defined as before and Ep is a priority edge set. For any two distinct nodes w,t; Â€ i2, (u Â— > u) Â€ Ep if and only if (w Â— > u) 6 Et, and either (u + u) Â€ Ep or (u Â— > u) Â€ Ep if and only if (Â«, u) Â€ Ed[u u) (or (v ^ u)) is called directed dependency edge to distinguish it from undirected ones in Ed, and the direction of the edge is given by the user.
PAGE 35
27 rioO rjjO N \ rjjO pr^ j < 8 TtfO.^ r70'' Â•s. Or, (a) (b) Figure 2.5. A pair of trigger graph and dependency graph or. Figure 2.6. A priority graph r e R' V (3r'3(r' ^ r) (r' e Re A (r' ^ r) Â€ Â£;p)}. For any two distinct nodes u,v e Re, {u^v)e Ee if (u u) Â€ Ep and (Â« u) 6 if (u u) Â€ <1 Simply stated, the node set Re consists of a UTRS plus those rules that are reachable from rules in the UTRS through trigger paths in PG. The edge set Ee is a set of trigger and directed dependency edges that connect nodes in ReExample 244 Figure 2.7 shows an execution graph derived from a priority graph of Figure 2.6 when a UTRS has rules ra, rs, and rio. A rule schedule can be obtained by performing the topological sort on the execution graph. The canonical RES for the
PAGE 36
28 riiO Figure 2.7. An execution graph equivalence class represented by the execution graph is (ra Â• re Â• rr Â• rs Â• rg Â• rg Â• rio Â• rn). Note that priority graphs shown in Figure 2.4 are, in fact, execution graphs as well. < Lemma 2 Given execution graph EG = {Re,Ee), a set of rule execution sequences corresponding to all feasible topological sorts on EG constitutes an equivalence class, independent of initial database state. Proof of Lemma 2 Since EG is acyclic and all pairs of conflicting rules in EG are ordered (i.e., have an edge between them), all topological sorts on EG should have the same relative orders of conflicting rules. Then, by Lemma 1, RESs represented by the topological sorts are equivalent to each other. Also, since Lemma 1 holds without any premise on initial database states, Lemma 2 also holds regardless of initial database states. The power of choice is now given to the users. There may be one systemwise priority graph for all rules defined in the system. All applications will be governed by a single type of confluent rule executions in such a case. More preferably, however, each user (or each application) may have a separate priority graph tailored for specific
PAGE 37
29 needs. Also, given a conflicting rule set, the user may choose to specify priorities over only a part of conflicting rules and is not concerned about the rest of them. The rule scheduler will ignore unspecified (thus undirected) dependency edges in a priority. Taking this approach, our rule execution model can readily meet various user requirements with respect to confluent rule executions.
PAGE 38
CHAPTER 3 IMPLEMENTATION OF CONFLUENT RULE SCHEDULER 3.1 Strict OrderPreserving Rule Execution Model In the previous chapter, only simple cases were considered. In particular, the structures of trigger graphs were trees, all rules in a UTRS were distinct and no trigger paths existed between these rules. As a result, no trigger paths in the execution graph overlapped with one another. When rules in a UTRS have overlapping trigger paths, the priority graph and execution graph defined in the previous chapter do not capture the semantics. For example, consider Figure 3.1(a) which is the same as Figure 2.2(a). When r,and rj are triggered by a transaction, both rules instantiate their own trigger paths, and these trigger paths overlap with each other.^ If we think of the graph as an execution graph, it yields two partial RESs from the graph: (r,Â• rj fk n) and {rj Â• fk n). Therefore, a rule schedule (alternatively a complete RES) should be a merged RES of the two partial RESs. A possible merged RES is (r,Â• rj Â• rj Â• fk rrfkri). Issues to be addressed in this chapter are, (i) obtaining rule schedules from an execution graph where trigger paths are overlapping, (ii) assurance of the confluence property when rules are executed in accordance with any of such rule schedules, and (iii) parallel rule schedule taking advantage of the availability of confluent multiple rule schedules. 3.1.1 Extended Execution Graph In order to understand the effect of overlapping trigger paths, we introduce an extended execution graph, used only for illustration purposes. Recall that given a ^The overlapping of trigger paths is not directly visible in Figure 3.1(a). 30
PAGE 39
31 (a) (b) Figure 3.1. Overlapping trigger paths and extended execution graph system rule set along with its trigger graph, each rule in a UTRS instantiates a subgraph of the trigger graph, whose sole root is that rule. However, it is possible to derive an execution graph, from given a priority graph and a UTRS, such that every rule in the UTRS becomes a root node in the resulting extended execution graph. Example 3.1.1 Figure 3.1(b) shows the extended execution graph of Figure 3.1(a). In the extended execution graph, rj and (similarly other rules as well) are the same rule and only represent different instantiations. Since there is a dependency between rules Tfc and r/, this dependency may well be present between all instantiations of Tfc and n as shown in Figure 3.1(b). Directions of dependency edges in an extended execution graph might be either inferred from the priority graph or specified by the user. Figure 3.2 shows three different acyclic orderings of relevant dependency edges of Figure 3.1(b). Once an acyclic extended execution graph is given, the rule scheduler can schedule rule executions using the topological sort. All possible topological sorts constitute one equivalence class. <^ The extended execution graph, however, cannot be used for priority specification; it is created only at rule execution time and priorities need to be specified before that. Thus we need to find alternative ways to interpret a priority graph.
PAGE 40
32 rjO I ,a ri 00 I (a) (b) (c) Figure 3.2. Three different orderings of dependency edges in Figure 3.1(b) (a) (b) Figure 3.3. Extended execution graph in strict orderpreserving executions 3.1.2 Strict OrderPreserving Executions One way to derive an extended execution graph is to faithfully follow what the user specifies in the priority graph, i.e., priorities between conflicting rules. In strict orderpreserving executions, if rule r,has precedence over rule rj in a priority graph, all instances of r,precede all instances of rj in resulting rules schedules. Given a priority graph, an extended execution graph is obtained by simply adding directed dependency edges in the priority graph to duplicated overlapping trigger paths. This scheme provides a simple solution for overlapping trigger paths, regardless of the number of times trigger paths overlap.
PAGE 41
33 Example 3.1.2 Figure 3.3(a) shows a priority graph where rules rj, r2, and ra are in UTRS and their overlapping trigger paths are denoted by partial RESs, (ri Â•r2r4r5), (r2 Â• ^^4 Â• J^s)) and {fs Â• f4 Â• fs). Figure 3.3(b) illustrates how an extended execution graph is built using strict orderpreserving executions. First, overlapping trigger paths are separated. Second, any dependency edge in the priority graph that connects a rule in an overlapping trigger path and a rule in any trigger path is also introduced in the extended execution graph. For example, (r2 r^) and {r'^ r'^). Given the extended execution graph, the rule scheduler can schedule rule execution by performing the topological sort. All feasible topological sorts should constitute one equivalence class since in the topological sorts, executions of all conflicting rules are ordered in the same ways. The canonical RES for the extended execution graph of Figure 3.3(b) is (n Â• r2 Â• f2 Â• fs f4 f4 Â• f4 Â• rg Â• rs Â• rs). (Note that r,, r,', and r" are the same rule.) < 3.1.3 Implementation In order to implement a rule scheduler conforming to the strict orderpreserving execution, one could directly use extended execution graphs such as Figure 3.3(b). However, there is a simpler way to derive an execution graph without having to duplicate every overlapping trigger path. Consider the extended execution graph of Figure 3.3(b). In the strict orderpreserving execution, directions of all dependency edges incoming to and outgoing from overlapping trigger paths are all the same as shown in the graph. Therefore, it is unnecessary to duplicate overlapping trigger paths. We, instead, add a rule.count to each node of a plain execution graph. A rule.count attached to a node indicates how many rules in UTRS share the trigger path which the node (i.e., rule) belongs to. Figure 3.4 depicts how the plain execution graph of Figure 3.3(a) is extended using rule.counts. In the new graph, m{i) represents that the trigger path is shared by i instances of the associated rule.
PAGE 42
34 T/ Â• m(l) Â•c. f^3W m(l) rsC^m(3) Figure 3.4. Extended execution graph with rule_counts Now the new extended execution graph can be used with a minor modification to the topological sort. Whenever a rule is scheduled, its rule.count in the execution graph is decreased by 1. If the rule_count reaches 0, the node and outgoing edges are removed from the execution graph. Figure 3.5 describes an algorithm, Build^GQ for building an execution graph for given a priority graph (PG) and a UTRS. It uses M[], an array of size of the system rule set R, to hold rule_count values of rules in EG. If rule r,is (to be) in EG, M[i] represents the rule.count attached to r,. BuildJ^G() calls a subroutine, DFS.Copy.TreeQ for every instance in UTRS. Remember that UTRS is a multiset. It can have multiple instances of the same rule due to multiple triggers. DPS Copy. TreeQ traverses the PG in the depthfirst search fashion and copies a portion of the PG that are reachable through trigger edges in PG. It also increases rule_count of each node that it visits. The execution graph of Figure 3.4 is obtained by applying BuildJ:G() to the priority graph of Figure 3.3(a) with UTRS = {n, rj, rg}. Once an execution graph is built by Build J:G(), the rule scheduler can schedule rule executions. However, it is possible that some rules in a trigger path in an execution graph are not triggered at all by its parent rule at run time. In such a case, other rules in other trigger paths that have an incoming dependency edge from that rule should not wait. Otherwise the rule scheduler will get stuck. Furthermore,
PAGE 43
Given PG and UTRS: Build_EG() { EG = ^; initialize array M[] toO's; //M[] is rule_count array for every rj Â€ UTRS do call DFS_Copy_Tree(ri) ; } DFS_Copy_Tree(r,) { if (r,^ EG) then copy r, into EG; M[i]++; // increase rule.count of node r,// copy trigger edges for all rj such that 3(r,rj) Â€ PG do { call DFS_Copy_Tree(rj) ; if ((r.r^) ^ EG) then copy (r, Tj) into ^G; } // copy dependency edges for all rj such that 3(r,rj) Â€ do { if {rj Â€ and {{n ^ rj) ^ EG) then copy (r, Tj) into EG) } } Figure 3.5. Algorithm Build_EG()
PAGE 44
36 if a rule is not triggered, all its descendant rules in trigger path will not be triggered either. This consideration should be taken recursively downward trigger paths. Figure 3.6 describes the rule scheduling algorithm, ScheduleQ, which is a modified version of topological sort. Given an execution graph EG, it arbitrarily selects a node (i.e., rule) with indegree 0. Since EG is acyclic, there should always be at least one node with indegree 0. After executing the selected rule, the scheduler decreases rule_count of the node by 1, and if the rule.count reaches 0, the node is removed along with any trigger and dependency edges outgoing from the node. However, before the removal, it checks whether the executed rule has triggered child rules in trigger paths. If there are child rules that are not triggered, then ScheduleQ calls a subroutine DFSJDecMQ for those child rules. DFSDecMQ traverses down EG in a depthfirst search fashion and decreases rule_count of each node it visits by 1. If rule_count of a node becomes 0, it removes the node and all outgoing edges. Theorem 1 The strict orderpreserving rule execution model guarantees confluent rule executions. Proof of T heorem 1 Based on Lemma 2, algorithms Build J^G() and ScheduleQ together serve as a constructive proof for the theorem since by the algorithms, overlapping trigger paths are separated, effectively making them ordinary acyclic graphs, and the topological sort is performed on the graphs. 3.1.4 Parallel Rule Executions The execution graph naturally allows parallel execution of rules. In the extended form, such as Figure 3.3(b), all rules with indegree 0 can be launched in parallel for execution. Since there should be no dependency edges between nodes with indegree 0 in an execution graph,^ relative execution order of those independent rules does not ^Note that an execution graph reduces its size as rules are executed.
PAGE 45
Given EG: Schedule 0 { while {EG ^ 0) do { choose a node r, with indegree 0; execute r,; M[i] Â— ; // decrease rule_count of node r, T for all Tj such that 3(r, Â— > rj) Â€ EG do if (rj was not triggered by execution of r.) then call DFS_DecJl(rj) ; if ( M[i] = 0 ) then delete node r, and edges (r, rk) and (r,Â— r/), for any k and /, from EG; } } DFS_DecJl(rj) { M[j]; for all rk such that 3{rj Â— > rjt) do call DFS_Dec_M(rfc) ; if ( M[j] = 0 ) then delete node Vj and edges {rj r/) and {rj Â— > r^), for any / and m, from EG; II don't need to delete dependency edges incoming to r j ! Figure 3.6. Algorithm Schedule()
PAGE 46
38 affect the final outcome. Note also that multiple instances of the same rule can be scheduled for execution at the same time. In an execution graph with rule.counts, which we use for our work, all rules with indegree 0 are scheduled simultaneously as many times as the rule_counts associated with the rules. In Figure 3.4, for instance, after scheduling and executing each one instance of ri and ra in parallel, two instances of r2 can be scheduled for execution since rule.count of r2 is 2. In order to implement the parallel rule executions, we have to make some changes to the algorithm of Figure 3.6 which we will not elaborate in this work. Whenever execution of one instance of a rule is completed, the associated rule_count need be decreased by 1 and its child rules have to be checked to see whether they are triggered or not by the parent rule. If some are not triggered, DFSDecM() should be called to recursively decrease rule.counts along trigger paths. Since the rule_count array M[] and execution graph are shared data structures, some locking mechanism need be used to avoid update anomalies within the data structures. One important measure in parallel processing is the degree of parallelism. In the active rule system, the maximum parallelism is bounded by dependencies between rules in the system rule set. For instance, if all the rules are independent of each other, ideally all triggered rules can be executed in parallel. As dependencies between rules increase, the degree of parallelism would decrease. However, other components too can restrict parallelism. As discussed in Section 2.2, improper priority specification and rule execution model may execute a given rule set serially which could be executed in parallel. Specifically in our work, two components can hamper parallelism, all resulted from static analysis. First, precision of dependency analysis between two rules can affect parallelism. Even though there is data dependency between two rules, they can be commutative in reality. Being unable to detect such a hidden commutativity results in a false dependency edge in an execution graph, likely costing
PAGE 47
39 parallelism. Second, precision of trigger relationship analysis can similarly affect parallelism. If we know for sure that one rule triggers another rule, the trigger edge between the two rules can be deleted after all rule.count values are computed. This way, the two rules can be scheduled in parallel if there is no other path connecting them in the resultant execution graph. Using static analysis, we cannot completely avoid uncertain trigger edges, and presence of the uncertain trigger edges can cost parallelism. However, ignoring the loss caused by imprecision of static analysis, the strict orderpreserving rule executions exploit the maximum parallelism existing in a given system rule set. We state it in Theorem 2. Theorem 2 Using the strict orderpreserving rule executions, the active rule execution model achieve the maximum parallelism within limitations of static trigger and dependency analysis. Proof of Theorem 2 Given any acyclic extended execution graph, there are two kinds of edges; trigger edges and dependency edges. We first assume that no trigger edges can be removed, that is, they are all uncertain trigger edges. Now suppose there are superfluous dependency edges in the execution graph whose absence does not affect the final database state. (Therefore we can remove them safely to increase parallelism.) There can be only two types of dependency edges in an acyclic execution graph. Given any dependency edge (r,r_,), r,is either a proper ancestor of r, in a trigger path containing both r,and rj or not an ancestor of rj in any trigger path. In the first case, even though the dependency edge is redundant in terms of representation, removal of that edge does not allow rj to execute before r,since rj is yet to be triggered by r, or its descendant. Thus, removal of the first type of dependency edges has no effect of increasing parallelism. In the second case, if (r,rj) is the only dependency path that can be interconnected by trigger edges and dependency edges to connect r,to rj, obviously this cannot be removed at any rate. If
PAGE 48
40 there exist other dependency paths connecting r, to rj whose lengths are longer than (r,rj) (of course, if such paths exist, they should be longer than oneedge path (r,Tj)), (r,rj) is redundant, but again, removal of the dependency edge does not allow rj to execute before r,. By applying this argument to all dependency edges present in the extended execution graph, we can see that dependency edges are either necessary or redundant, but removal of redundant edges does not increase parallelism. Since the execution graph with rule.counts are equivalent to the extended execution graph under the strict orderpreserving rule executions, we can conclude that the rule scheduler exploit the maximum parallelism inherent in the system rule set. 3.2 Alternative Policies for Handling Overlapping Trigger Paths 3.2.1 Serial TriggerPath Executions In order to handle overlapping trigger paths, there may be other approaches than the strict orderpreserving rule executions. One obvious approach among them is serial triggerpath execution in which all rules in an overlapping trigger path execute before any other rules in other overlapping trigger paths. In other words, when trigger paths overlap, rules in those overlapping trigger paths execute in triggerpath by triggerpath fashion. For instance, in Figure 3.1, (r,Â• rj fk Â• f/ Â• rj Â• fk Â• f/) is a serial triggerpath execution. Although the serial triggerpath execution could appear more intuitive, unfortunately it brings forward the old problem again. Different serial triggerpath executions may result in different final database states when there exist dependencies between the overlapping trigger paths. Therefore the user has to choose one serial order over the conflicting overlapping trigger paths to obtain a unique final database state. However, choosing a serial order in advance (i.e., at compile time) is not always possible, since, as discussed in Section 2.2, multiple instances of the same rule may be in UTRS and one cannot predict them before the rules execute. To reiterate
PAGE 49
41 ^1 f^[20] \ \ \ ^2 Â•J15] Figure 3.7. A priority graph with absolute priorities that, let's assume that A, B, and C are three overlapping trigger paths rooted by rules a, b, and c, respectively, and the triggerpaths are all conflicting each other. Then, if each one instance of a, 6, and c is in UTRS, any of six permutations (3!) made of A, B, and C (i.e., serial triggerpath executions) may produce a different result. However, if rule a is triggered twice while the other rules remain as before, four trigger paths A, A', B, C will be instantiated and the permutations increase to 24 (4!). As more instances of a, b, or c are added to UTRS, the permutation will increase exponentially. From the observation above, it is apparent that we again have to settle for a less general scheme, which will be discussed below, than the fullfledged serial triggerpath execution. Figure 3.7 shows a priority graph that is slightly different from Figure 3.3(a). The new graph has dependency edges between ri and r4, and between rs and as before. Also, it uses absolute priorities (numbers in brackets) attached to rules rather than directed dependency edges. Using this priority graph, the rule scheduler will be able to schedule a serial triggerpath execution by executing rules in a triggerpath whose root rule's priority is the highest. For instance, if rules ri, r2, and rg are each triggered once by a user transaction, given the priorities, the serial triggerpath execution being scheduled will be (ri r2 Â• fs r2 Â• f4 Â• rs Â• Â• f4 f^). Now if r2 is triggered twice and ri and once, then since priority of r^s instances is
PAGE 50
42 lower than ri and higher than ra, the resultant serial triggerpath execution will be {fi Â• r2 Â• f4 Â• fs r2 Â• f4 r2 f4 fa Â• f4 f^). Althouh we will not elaborate in this work, it will be worthwhile to investigate more efficient and convenient ways of specifying serial execution orders, if the serial triggerpath execution is to be pursued for implementation. Using absolute priorities could be a hassle as pointed out in Section 2.2. On the other hand, sacrificing generality in serial order specification might do more good than harm. For example, it can be assumed that in a trigger path, ancestors have priority over their descendents when they are in UTRS (or vice versa). If such an assumption is made, the explicit priority specification between ri and in Figure 3.7 can be omitted. It should be noted that in order to schedule serial triggerpath executions, the rule scheduler described in the previous section need be modified appropriately to deal with the added absolute priorities. 3.2.2 Serializable TriggerPath Executions A more interesting scheduling than the serial triggerpath execution would be serializable triggerpath execution in which rule execution sequence may not be a serial triggerpath execution but the final result is equivalent to that of a serial triggerpath execution. In Figure 3.7, for example, one can see that a rule execution sequence (r, Â• rj rj Â• fk Â• fi fk n) is equivalent to the serial triggerpath execution (r, Â• rj fkff rj fk Â• f/). Therefore, the former is a serializable triggerpath execution. Once a serial execution order of overlapping trigger paths is set, it can be readily translated to a serializable triggerpath execution using extended execution graph. Figure 3.8 illustrates the first step to translate a serial triggerpath execution, shown in Figure 3.7 assuming r2 is triggered twice, to an equivalent serializable triggerpath execution. In Figure 3.8, dashed lines, presently undirected, represent dependencies between rules (only intertriggerpath dependencies are shown), and directed dotted
PAGE 51
43 /Â•/ ft. \ r" A Figure 3.8. Translation to serializable triggerpath execution Step 1 lines connecting the last rule in a trigger path to the first rule in the next trigger path (between rs and Tj, for example) impose the serial order specified in Figure 3.7. Ignoring the undirected dependency edges and regarding the newly added dotted arrows as only dependency edges, our rule scheduler will schedule the original 5erial triggerpath execution. Then, as the second step of translation, directions of dependency edges are set by following trigger paths bridged by the dotted arrows so that the directions are consistent with the serial order. After that, the dotted arrows are deleted. Figure 3.9 shows the result of the second step translation. Now if the translated execution graph is fed to the rule scheduler, an equivalent serializable triggerpath execution will be generated. Note that there may be multiple equivalent serializable triggerpath executions and they are those that enable parallel rule executions. Maximum parallelism of Theorem 2 still holds for the translated execution graph. 3.2.3 Comparisons with Strict OrderPreserving Execution Collating the serializable (or serial) triggerpath execution with the strict orderpreserving execution raises an interesting point. Although choosing between them would remain as largely subjective a matter, entrepreneurial policies and applications' semantics could favor one over the other. Suppose rule r, triggers rule and there is a dependency between them. When multiple instances of n are triggered
PAGE 52
44 If ^' u Figure 3.9. Translation to serializable triggerpath execution Step 2 by a user transaction, in the traditional transaction processing environment, the serializable triggerpath execution would be favored since, regarding a trigger path as a subtransaction, it appears to fit well into what already exist in that environment. On the other hand, when r,stands for an absolute raise to an employee's salary (say, $1000) and Vj for a relative raise (say, 5%), then the strict orderpreserving execution would make more sense, if the company has a policy that applies all applicable absolute raises before relative raises. Considering the serializability is often used as a correctness criterion for executions of subtransactions in the nested transaction which, in turn, has been proposed to be used for rule executions in some active databases [6, 12], this example illustrates that the serializability may not be an appropriate choice for some cases even if it is strengthened to yield confluent rule executions. As for implementation, the serializable triggerpath execution is less favorable in our current framework. Compared to the strict orderpreserving execution, it requires a different priority specification scheme than the simple priority graph. However, once transformed properly to the form of Figure 3.9, the serializable triggerpath execution is on the par with the strict orderpreserving execution in terms of exploiting maximum parallelism in rule executions.
PAGE 53
45 3.3 Discussion and Conclusions In this work we have proposed a new activerule execution model along with priority specification schemes to achieve confluent rule executions in active databases. As other rule execution models, we employ prioritization to resolve conflicts between rules. Ideally, by prioritizing executions of conflicting rules whose different relative execution orders can yield different database states, one can achieve confluent rule executions. It is necessary, however, to prioritize as few rules as possible since prioritizing many rules in a meaningful way itself could be a challenge and the more rules prioritized, the less rules being able to execute in parallel. Prioritizing conflicting rules only, on the other hand, may call for incorrect results as executions of seemingly independent rules can trigger and execute conflicting rules in the wrong way. Also, when rules in the same trigger path are triggered resulting in overlapping trigger paths, more problems can be brought up. Unlike previous rule execution models, our model uses a rule scheduler based on the topological sort to respect all the specified priorities if applicable. This way, rules being triggered and executed from a user transaction can follow the execution sequence imposed by the priority specification to make their execution confluent. We also have proposed the strict orderpreserving rule execution to deal with overlapping trigger paths. In the strict orderpreserving rule execution, when a part of or the whole trigger path is multiply executed and there are priorities between rules in the trigger path, the rules are executed in such a way that no rule appears before rules with higher priorities if any. It has been shown that our rule execution model can exploit maximum parallelism in rule execution. Lastly, we have discussed other possible ways of handling overlapping trigger paths, i.e., the serial triggerpath execution and serializable triggerpath execution. There are other related issues not pursued in our current work. One of such issues is precision of data dependency (or dependency in general). Our definition in
PAGE 54
46 Section 2.3.2 may be too coarse as some rules might be commutative despite presence of the defined dependencies. If an active rule language has a definite form as SQL, the definition of dependency and its detection may be tightened by analyzing static expressions in rule definition as done in Baralis and Widom [7]. In Sentinel, a rule has a very general form since the condition and action parts can be arbitrary functions. In such a system, even detecting dependency with a margin of imprecision can be challenging. However, we know empirically that in general only a few stored data items are referenced in those functions. The rule designer should be able to readily recognize the referenced data items and classify them into readset and writeset. Once the readset and writeset are obtained, the dependency graph can be drawn as usual. The problem of confluent executions can be further complicated in those active databases such as Sentinel [13] and HiPAC [12] where coupling modes between event and condition and between condition and action are defined. In Sentinel and HiPAC three coupling modes are defined: immediate, deferred, and detached modes. For coupling modes between event and condition, the immediate mode prescribes that condition be tested immediately after a relevant event is detected. In deferred mode, condition is tested after all userdefined operations (except commit/abort) in the current transaction are performed. In detached mode, condition is tested in a different transaction. The semantics of these modes holds for the coupling modes between condition and action as well. Our work described in this thesis assumes the immediate mode between event and condition and between condition and action as well. However, considering that in the deferred mode, tests (or actions) are carried out in the same order as the occurrence order of events that triggered the rules, our model works well for deferreddeferred couplings between event and condition and between condition and action.
PAGE 55
47 Interestingly, in a situation where eventcondition coupling is the immediate mode and conditionaction coupling is the deferred mode, our model is still applicable. But this combination of coupling modes precludes the possibility of one rule's untriggering the others since all tests are completed before any action is performed. On the other hand, the detached mode doesn't suit with our model and other models as well that deal only with conflicts within a transaction boundary. By detaching the execution of action part from the current transaction, even execution of only one rule can result in different final database states depending on interaction between the detached transaction and the parent transaction. And those interactions are governed by transaction model the system employs. Unless a rule execution model is geared with the transaction model, there is no room to control the intertransaction interactions to make rule executions confluent.
PAGE 56
CHAPTER 4 AGGREGATE CACHE 4.1 Motivation Complex aggregate expressions are frequently computed in data warehouses to support decision making and for statistical data analysis. These computations are expensive to perform as they require at least one full scan of usually huge (or even distributed) base databases (or operational databases). Hence, it is vital for the data warehouse applications to have a means of reducing the computation overhead. An approach used to address this problem is to materialize views defined in the data warehouse, assuming that both the underlying base databases and the data warehouse use the relational data model. There already exists a body of literature on view materialization in the context of relational databases or deductive databases [8, 22, 25, 27]. Recently, the same issue has been reexamined by several researchers [23, 24, 40] from the data warehouse viewpoint. However, bulk of the literature deals with SPJ (SelectProjectJoin) views. Quite a few papers mention materialization of aggregates as a means of performance improvement, but no indepth study has been reported except [32]. Using view materialization, some of SPJ views defined in the data warehouse schema are chosen to be materialized. Then, when base databases change, materialized views that are affected by the change are updated in an incremental way if possible; otherwise the affected views are remateriaUzed. Realistically, however, application of traditional view materialization techniques seems inappropriate for the data warehouse environment for the following two reasons: 48
PAGE 57
49 1. Unlike views, aggregates are usually not predefined and it is hard to predict what aggregates on what data items are computed, since aggregates tend to be used in ad hoc queries. 2. Due to the natural process of data analysis or decision making, many aggregates are not likely to be used again (or rarely used) after some period of heavy usage. Note that the second reason above implies the presence of temporal locality property of aggregates usage in data warehouse applications. This is one good reason to introduce the cache mechanism for handling aggregates in data warehouses. In addition, by using a cache, an aggregate is cached the first time when it is actually used. That is, the user is not required to declare in advance what aggregates would be used. Therefore this is also a good way to deal with ad hoc queries containing aggregates. In addition to the temporal locality, the spatial locality is also observed in aggregate usage in data warehouse applications. That is, when an aggregate is used in a query, other aggregates closely related to this aggregate in terms of data modeling perspectives are also expected to be queried sooner or later. For example, if average salary of all employees in a department has been queried, it is more likely that average age of the same group of employees is queried in a short period of time than total order placed today. Unfortunately, this sort of spatial locality is too loose and abstract to be used to enhence efficience of cache. Therefore we will not consider the spatial locality in our work. In this work we propose an aggregate cache. As Figure 4.1 shows, the aggregate cache can be added to a data warehouse as a plugin. The aggregate cache uses part of the data warehouse (i.e., disks) as a cache. It interacts with query processor of the data warehouse manager (therefore, the query processor need be modified to some extent) to expedite computation of aggregates. It also interacts with the integrator.
PAGE 58
50 Data Warehouse Manager Aggregate Cache Manager Integrator Change Detector & Notifier Change Detector & Notifier Change Detector & Notifier DBMS DBMS Base Database 1 Base Database 2 Base Database n Figure 4.1. A Data Warehouse and Aggregate Cache which intercepts change notifications from base databases that are relevant to cached aggregates, and updates affected cached aggregates appropriately. When an aggregate in a query (of the data warehouse user) is processed, the query processor of the data warehouse manager first consults the aggregate cache manager. If the aggregate is found in the cache, the stored value is used for processing the query. Otherwise, the aggregate needs to be recomputed from data in base databases or from data in the data warehouse if an appropriate (presumably, summarized) copy is present there. In any case, the computed aggregate is cached for later use, unless otherwise directed.
PAGE 59
51 For the aggregate cache proposed in this work, not all aggregates need be cached. If the user knows that some of the aggregates will not be used again, the user should be able to mark such aggregates in queries so that they are not cached by the aggregate cache. On the other hand, if a certain set of aggregates is known to be used frequently, these aggregates can be precomputed and installed in the cache and will not be replaced until explicitly requested to do so. Between these two extremes, all ordinary aggregates are cached when they are first referenced in queries and replaced out later by a cache replacement policy. By taking this approach, as far as aggregates are concerned, the aggregate cache as a whole behaves as an adaptive schema for a data warehouse that dynamically adapts itself to uncertain and varying user requirements. 4.2 Updating Cached Aggregates As base databases change, there should be some way of reflecting those changes to aggregates cached in the data warehouse, if the changes are relevant. Presuming affected aggregates are known, one way is to invalidate the affected aggregates. This method was addressed by Sellis [32] in the context of maintaining derived data in relational databases. However, in data warehouse environments, where changes to base databases are expected to be frequent (one of the very reasons why the data warehouse was introduced), simply invalidating affected aggregates will not be a good approach as cached aggregates would not have much chance to be reused. Therefore, our natural choice is updating affected aggregates as necessary. Again assuming affected aggregates are known when base databases change, there may be several ways of updating the aggregates as outlined below: Rematerialization: Affected aggregates are recomputed by rescanning base databases.
PAGE 60
52 Periodic Update: Changes to base databases are kept in respective base databases and periodically the changes are propagated to the data warehouse. The data warehouse, then, collects the changes and incrementally updates affected aggregates. Eager Update: Every time a change is made to a bcise database, if the change is relevant to any cached aggregates, it is propagated to the data warehouse so that the data warehouse can incrementally update the affected aggregates. Ondemand Update: Changes to base databases are accumulated either on the sides of base databases or on the side of data warehouse, forming delta files. When a cached aggregate is accessed by a query, relevant delta files are collected and the aggregate is updated using the delta files. The rematerialization approach is the most expensive. In the aggregate cache, rematerializing affected aggregates prior to their usage makes little sense unless the system load is particularly light at that time. This method, however, can be combined with an incremental method to handle a few cases where the incremental method cannot be applied. A naive implementation of the periodic update could be troublesome. If the frequency of update is too low, values of cached aggregates will be outdated. If the frequency is too high, on the other hand, it will degrade the performance of system. Rather than independently used, the periodic update can be geared with the ondemand update as it will be explained later. The eager update could appear to be the best way to deliver the uptodate values of cached aggregates quickly. However, unless the system's timeliness requirement is
PAGE 61
5$ so stringent, this method is considered to be an overkill since generally one does not know when cached aggregates are to be used and whether or not they would ever be used again before they are replaced out. Furthermore, there is no guarantee that the eager update will always outperform other methods in the timeliness measure since it could take up too much of system resources if base databases change often and there are many cached aggregates to update. In fact, a simulation result reported by Adelberg et al. [1], in which various recomputation methods for maintaining derived data were compared, shows that the eager update is inferior to the ondemand update in terms of query response time. Although the simulation setting does not completely match the situation of aggregate maintenance in a data warehouse environment, we think we can safely extrapolate this result to draw a similar result in the aggregate cache. ^ The ondemand update is considered the best fit for the aggregate cache since the frequency with which the base databases change is expected to be much higher than the access frequency of aggregates in the data warehouse. As mentioned earlier, the delta files can be kept either in base databases or in the data warehouse. Keeping delta files in data warehouse is, in effect, more like the eager update since whenever a related base database changes, the change need be transferred to the data warehouse although, unlike the eager update, recalculations of affected cached aggregates happen only when they are referenced. On the other hand, keeping delta files in each base database will cause some delay when an affected cached aggregate is referenced. If ^We expect that the gap between the eager update and the ondemand update will widen in the aggregate cache, since the simulation by Adelberg et al. [1] does not consider the overhead of change propagation between a base database and the data warehouse, which is likely to be substantial. Extensive change propagation in the eager update will deteriorate the performance of the eager update further. Moreover, we can regard an aggregate as a derived data with a huge "fanin," which is used in the simulation to describe the number of source data for a derived datum, since an aggregate need be recomputed whenever a row in a referenced relation is modified. According to the simulation result, the gap between the eager update and the ondemand update widens as the fanin increases.
PAGE 62
the data warehouse and base databases are connected through a network, network connection and data transfer time will dominate the delay. In our work, by default we keep delta files in base databases, but if a base database has difficulty in keeping them and the network overhead between the base database and the data warehouse is not severe, delta files for that base database should be able to be kept in the data warehouse. An important premise for using the ondemand update (the eager update and periodic update as well) is that aggregates can be computed incrementally. Although a large class of aggregates can be computed incrementally, there are some aggregates which cannot be computed incrementally. Such aggregates include orderrelated ones such as min, max, and median [26, 27]. When an aggregate cannot be computed incrementally, the aggregate is rematerialized (i.e., recomputed) on demand. One problem with the ondemand update is that for a cached aggregate, if related base databases change frequently and the aggregate is not referenced for a long time, delta files can grow too big in base databases. This problem can be handled in several ways: 1. The data warehouse periodically polls base databases on behalf of cached aggregates that are not referenced for a certain period of time. Base databases, then, repond by sending delta files to the data warehouse. This is the periodic update. 2. Each base database monitors sizes of delta files. If size of a delta file grows over a limit, the base database notifies the data warehouse and sends the oversized delta file to it. This is an aperiodic (or asynchronous) counterpart of the periodic update. 3. Instead of keeping delta files, base databases keep numeric delta's of affected aggregates and the numeric values are sent when aggregates are referenced in the
PAGE 63
55 data warehouse. As a simplistic example, when aggregate count(*) on a relation in a base database is cached, the numeric delta in this case is the difference in the numbers of tuples inserted into and deleted from that relation. When the aggregate is referenced, the numeric delta is sent to the data warehouse, instead of delta file containing inserted and deleted tuples. 4. The overgrown delta files are discarded and related cached aggregates are replaced out. Clearly, the aperiodic update (the second one above) is better than the periodic update since in the aperiodic update, transfer of delta files takes place only when necessary. The numeric delta method (the third one) has an effect of distributing burden of the data warehouse to base databases. If base databases can sustain the newly imposed load, the performance of data warehouse should improve since network traffic can be reduced substantially and incremental updates in the aggregate cache become simpler. A drawback of this method is the added complexity, especially on the sides of base databases. Base databases now have to know what aggregates are being cached and all the information needed to perform partial computations of affected aggregates. Discarding overgrown delta files is closely coupled with the cache replacement policy. If the policy is LRU (LeastRecentlyUsed) or its variation, cached aggregates not referenced for a long period of time can be replaced out or just flushed. Therefore, it is justifiable to discard overgrown delta files if related cached aggregates are not used for a long time. Of course, if delta files are to be discarded, related cached aggregates should also be deleted. To sum up, in our aggregate cache, we use both aperiodic update and numeric delta in conjunction with the ondemand update. The aperiodic update is used when
PAGE 64
5$ a base database is unable to implement the numeric delta method due to its limited functionality or when an aggregate does not allow the numeric delta method. 4.3 Incremental Update of Aggregates As mentioned in the previous section, the incremental update is a premise for the ondemand update that is adopted in the aggregates cache. In our current work, an aggregate is defined in a general way to be a function that takes as input one or more nonempty sets of objects, called aggregate input sets and zero or more scalar variables and returns as output one scalar value. Elements in an aggregate input set are denoted by a variable called aggregate variable. An aggregate input set corresponds to a group of values of an attribute in a relational table. 4.3.1 Svntactic Conventions We present syntactic conventions used in subsequent sections. Â• Aggregate input sets are denoted by capital letters such as X, F, and Z. If size of a set X (i.e., cardinality) is of significance and the size is n, a positive integer, the set is denoted by XÂ„. Â• Aggregate variable of an aggregate input set X is denoted by the lowercase letter, X, and it represents an element in X. For Xn, elements in the set are distinguished by attaching a distinct subscript to x as XÂ„ = {xi, X2, Â• Â• Â• , a;,, Â• Â• Â• , a;Â„}. Â• Aggregates are denoted by capital script letters such as ^ and H. An aggregate ^ with its arguments can be denoted as follows: .r(XÂ„, F,Â„, Â• Â• Â• , Zo, a, ^, Â• Â• Â• , 7). where XÂ„, are aggregate input sets and a, ^, Â• Â• Â• , 7 are zero or more independent scalar variables.
PAGE 65
57 4.3.2 Incrementally Updatable Aggregates Informally, incremental update of an aggregate means that when a new element is added to the aggregate input set (assuming the aggregate has only one input set) or an old element is deleted from the aggregate input set, the new value of the aggregate is computed from the added (or deleted) element and the current value of the aggregate stored in the system. If an aggregate has multiple aggregate input sets, let's assume for the time being that all the aggregate input sets are of the same size and insertions or deletions take place such a way that all the sets remain in the same size. Many aggregates used in statistical analysis fall into this category. In order to define incremental update precisely, we first define positive delta and negative delta below. Let's assume an aggregate J^{XmYn, Â• , Zn,oi, l^, Â• Â• I'j)For convenience, let 0Â„_i and i7Â„ be two sets containing parameters for T as follows: Slnl = {XÂ„_l,FÂ„_l,,ZÂ„_i,Q!,/3,,7}, fin = {XÂ„,y;,,ZÂ„,a,/3,,7}. Now, suppose that for the aggregate the current aggregate input sets are Xni , Yn1 , Â• Â• Â• , Zn1 , and Xn? J/th Â• Â• Â• ) 2n are inserted elements to their respective input sets (thereby making them XÂ„, 1^, Â• Â• Â• , ZÂ„, respectively). Positive delta, AÂ„ is defined such that AÂ„ = jF(n,):r(nÂ„_i). (4.1) Aggregate ^ is incrementally updatable on the inserted elements into the aggregate input sets, if there exists an algebraic function g such that AÂ„ = ^(:^l(nnl),^2(J)Â„l),,^.(nnl), n2{nn), Â• Â• Â• , %(nÂ„), a:Â„, yÂ„, Â• Â• Â• , ^Â„) (4.2) where all /"^'s {1 < k < i) and Hi^s (1 < / < j) are aggregates whose values are already known or incrementally updatable and none of Hi^s is J^.
PAGE 66
58 For the same aggregate suppose that the current aggregate input sets are Xn, Ki, Â• Â• Â• , ZÂ„ and j/^, Â• Â• Â• , 2^ are deleted elements from their respective input sets (thereby making them Xni, VÂ„_i, Â• Â• Â• , Zni)Negative delta, AÂ„_i is defined such that Anl = J'innl) H^n). (4.3) Aggregate ^ is incrementally updatable on the deleted elements from the aggregate input sets, if there exists an algebraic function g such that AÂ„_i = ^(:Fi(f]Â„),^2(f)Â„),,:^.(J^n), 7ii(0Â„_i), ?i2(nni), Â• Â• Â• , Wi(nÂ„i), x'Â„, y^, Â• . . , z'J (4.4) where all ^fc's {I < k < i) and Hi^s {1 < I < j) are aggregates whose values are already known or incrementally updatable and none of Tii^s is T. An aggregate Jis defined to be incrementally updatable, if ^ has both positive delta AÂ„ (equation 4.1) and negative AÂ„_i (equation 4.3) that are computable for all n > 1. Once AÂ„ is computed, T{Un) in equation 4.1 can be obtained by adding AÂ„ to the known value of ^(17Â„_i). Likewise, from AÂ„_i, ^(f2Â„_i) in equation 4.3 is obtained by adding AÂ„_i to the known value of ^(fiÂ„). Now, if an aggregate is sensitive to every insertion into (or a deletion from) any one aggregate input set, the definition of positive delta and negative delta can be extended in a straightforward way so that the aggregate can be incrementally updated whenever an insertion (or a deletion) is made. We would not elaborate the extension in this work. 4.3.3 Algebraic Aggregates and NonAlgebraic Aggregates Aggregates are first classified into algebraic ones and nonalgebraic ones. An algebraic aggregate is an aggregate that takes as input only aggregate input sets of
PAGE 67
i9 real numbers and independent scalar variables of the real number and returns as output one real number and consists of only algebraic operations defined over the real number.^ A nonalgebraic aggregate is an aggregate that its domain or range is nonnumeric or uses nonalgebraic operations. Example 43.1 Here are some examples of algebraic aggregates. n count{Xn) = ^ 1 i=l n sum{Xn) = ^Xi t=l average{Xn) = Er=i 1 count{Xn) Â«=i 1=1 " Sxy{Xn,Yn) IS sum of products of distances of x and y from their means, x and y (i.e., average(XÂ„) and averag'e(y^)), and is used to compute simple linear regression.
PAGE 68
1=1 xj _ 2^1=1 ^1 n n Â— 1 (Â»i)Er=i xinj:7=iXi n{n Â— 1) n{n Â— 1) n(Er=i EL"/ ^.o Er=i xj n{n Â— 1) n{n Â— 1) nXfi Et=l Xi Xn n{n Â— 1) (n^)Xn Â— Ei=l ^i n{n Â— 1) (nIK Erj/a:.n(n 1) n(nl) average(A'Â„_i ) n n average{Xni ) Xn n averag'e(XÂ„_i) count{Xn) average(A'Â„_i) Â— average(XÂ„) i=l _ 2^t=l Xj n Â— 1 n ni:i=iXi{nl)U=iXi {n Â— l)n n E=i Xj n Er=i Xj + E"=i Xj {n Â— l)n n(Er=i Er=i^ X.) 4ELi X.(n Â— l)n nXfi Et=i ''Â« (n Â— l)n nxÂ„ ^ Ej=i Xi (n Â— l)n (n Â— l)n average(XÂ„) n Â— 1 n Â— 1
PAGE 69
61 average{Xn) Â— a;Â„ _ average{Xn) Â— Xn count{Xni) AÂ„ and AÂ„_i above are computable for every n > 1. Therefore, average{X) is incrementally updatable. Compare AÂ„ and AÂ„_i to equations 4.2 and 4.4 respectively. In AÂ„, average{Xni) is already known and count{Xn) can be incrementally computable from value of count(A'Â„_i). In AÂ„_i, average{Xn) is known and count{Xni) can be computed from value of count{Xn). 4.3.4 Summative Aggregates The vast majority of aggregates that are used or have a good potential of being used in decision making applications are those that perform some types of summation operations. We call such aggregates summative aggregates. In light of this observation, we focus our effort on making them incrementally updatable. Let Xn,Ym, Â• Â• 1 Zo be aggregate input sets whose respective current sizes are n, m, Â• Â• Â• , o, and a;,Â€ XÂ„ (1 < i < n), yj 6 Vm (1 < J < rn), Â• Â• , Zk E Zg {1 < k < o) be aggregate variables whose data types are numeric. Summative Aggregate: Given aggregate input sets, XÂ„, Kn, Â• Â• Â• , Zo, a summative aggregate is defined to be an algebraic aggregate that has summation operators (X2 s) in it as shown below: :F(XÂ„, Ym,,Zo) = f2 /(V'P, P, nXr,, K., Â• Â• Â• , Zo)) (4.5) p=l where f{tpp, p, 'H(A'Â„, i^, Â• Â• Â• , Zo)) is called summation body, p is called index variable and always starts from 1, // is called termination variable and indicates the final index value, Vp G {x,, yj, Â• Â• Â• , Zk}, (p, p) e {(Â«, n), (j, m), Â• Â• Â• , (fc, o)}, and n{Xn, F^, Â• Â• Â• , Zo) is a summative aggregate nested in and differs from J^. If aggregate input sets
PAGE 70
62 have the same size (i.e., Xn,Yn, Â• Â• Â• , ZÂ„), the definition of summative aggregate can be simpler cis follows: FÂ„, . Â• Â• , zÂ„) = /(^n y., Â• Â• Â• , ^i, 0 (46) 1=1 where the summation body /(x,, ?/,Â•,Â•Â•Â•, 2,, i) may (recursively) contain other summative aggregates. When the summation body of a summative aggregate contains one or more summation operators, the aggregate is called nested summative aggregate. The summation body in a summative aggregate is defined to be an aggregative polynomial, defined below. Aggregative polynomial. An aggregative polynomial is a polynomial that consists of zero or more plus' ('s) as operators and aggregative monomials as operands. Aggregative monomial. An aggregative monomial is a nonzero real constant multiplied by zero or more factors. A factor is either an aggregate variable, an index variable, a summative aggregate, or a parenthesized aggregative polynomial, any of which may be raised by a nonzero rational number or be an argument of transcendental functions on the real number such as the logarithm."^ An aggregative monomial that does not contain another aggregate is called atomic. An aggregative monomial is sometimes called a term. Example 43.4 Given aggregate variables, a;, y, and z and index variables, i and j, the following are aggregative monomials: 1, Ix,, 2.5i'xi, xhrH^U^J?, V^^i {= {xiyiyl% \og,Xi. ^Ceiling (f ]), floor ([J), and absolute ( ) functions are included too.
PAGE 71
63 In x^y~^ (E"=i ZjY, factors are xf, and (Ej=i Â•^j)^, whereas in logjo;,, the sole factor is logj a;,. Except for {J2]=i ZjY , all the aggregative monomials are atomic.
PAGE 72
64 Lemma 4 A summative aggregate whose summation body is an aggregative polynomial is incrementally updatable if all summative aggregates in aggregative monomials of the aggregative polynomial are incrementally updatable. Proof of Lemma 4 Assume any aggregative polynomial, /i(*) + /2(*) H 1where each fs{*), 1 < s < v, is an aggregative monomial and (*) represents any subset of aggregate variables and index variables defined in the original summative aggregate and need not be equal to each other. Then, by the distribution property of ^ operator over additive (plus and minus) terms, the following equality holds: E(/i(*) + /2(*) + Â• Â• Â• + /.(*)) = E + E /2(*) + Â• Â• Â• + E /.(*). i=l i=l i=l t=l Thus, if each 5Z"=i /Â«(*) can be updated incrementally, the original summative aggregate on the lefthand side of the equation above can be computed incrementally. Now, the summative aggregate is extended to include more algebraic aggregates as follows. Extended summative aggregate. Given a set of aggregate input sets, f2 = {Xn, Ym, Â• Â• Â• , Zo} and a set of summative aggregates over subsets of fi, .Fi(f2i), ^2(^2)1 Â• J^vi^v), ^ (1 < 5 < u), an aggregate ^' over Q. is defined to be an extended summative aggregate if there exists an algebraic function g on the real number such that r (xÂ„, Â• Â• Â• , Zo) = giJ^iin^), .^2(^2), Â• Â• Â• , :F,{n,)). (4.7) Note that in equation 4.7, the algebraic function g is not a summative aggregate by definition, g takes no argument of a set its arguments are any number of scalar real values (since each ^3 aggregate returns a scalar real number), whereas an aggregate is defined to take at least one aggregate input set.
PAGE 73
65 Example 43.7 While the definition of extended summative aggregate covers a vast variety of algebraic aggregates that perform certain types of cumulative operations, perhaps the best known example of extended summative aggregate will be n n average{Xn) = ^Xif^l = sum{Xn)/count{Xn). t=i 1=1 Another example, n log x,) = log (sum(XÂ„)). Â»=i <1 Example 43.8 This example shows an extended aggregate, grand.average{Xn,Ym) which is an average of two related aggregate input sets, X and Y. J V ^ gran d^um{Xn ,Ym) grand^verage{Xn,Ym) = y ^ grand.count[Xn, Ym Lemma 5 The extended summative aggregate of equation 47 is incrementally updatable if all ^^(flg) 's {1 < s < v) are incrementally updatable.'^ Proof of Lemma 5 Since 5^ is a function, over a defined interval of its domain (see footnote 4), it should map a unique combination of input values (rather, every unique vector of input values) to the same one real number. Then, the mapping should remain the same no matter whether the input values are obtained through an incremental update or recomputed on the new aggregate input sets of the underlying summative aggregates. ^ A function on the real number may have intervals of domain where the function is undefined (e.g., logarithm over negative values). Therefore, it is possible that for a certain g, g cannot be (incrementally) computed on some values of input summative aggregates. However, that is not because of the incremental update, but because of the definition of that specific aggregate.
PAGE 74
66 Incidentally, it is interesting to note that if function g of equation 4.7 is a procedural function (i.e., nonalgebraic function), Lemma 5 does not hold in general. A simple counter example: g being maxQ. 4.3.5 Binding of Variables As shown in equation 4.5, a summative aggregate has two additional types of variables (other than aggregate variables), which are index variables and termination variables. Below, we describe how these variables are bound to a value or to other variable: Â• The subscript of every aggregate variable is bound to an index variable; that is, an index variable is used to refer to a specific element in each aggregate input set. Â• An instance of an index variable used in an aggregative monomial is bound to the index variable. Â• The initial value of an index variable is always bound to 1 by definition. Â• The final value of an index variable is bound to its matching termination variable. Â• The termination variable of the outermost operator is either unbound (in most cases) or bound to a constant. If it is unbound, it should be bound to the current size (cardinality) of an associated aggregate input set when the aggregate is referenced by a query. Â• The termination variable of an inner J2 operator (i.e., in summation body) is either unbound, bound to an index variable of its an outer J2 operator, or bound to a constant.
PAGE 75
It should be noted that when a termination variable is bound to the cardinality of an aggregate input set, it is not necessary for the real value of the cardinality to be known, unless it is explicitly used in the summation body. In an incremental update, the purpose of termination variable is to know the latest element to be added to (or to be deleted from) an aggregate input set. On the other hand, if the cardinality is explicitly used in the summation body, it is interpreted as the simplest summative aggregate, COUNT(*) of the associated aggregate input set. Scope of bindings. Each binding has a scope. A binding is in effect only within its scope. Scope of a summative aggregate is inside of its outermost ^ operator. Scopes of the index and termination variables associated with a J2 operator is inside of the summation operation, which is also called the ^'s scope. Scope of an aggregate variable is the scope of its associated index variable. A nested summative aggregate has multiple nested scopes. For any two index variables, either one variable's scope properly contains the other variable's scope or the two scopes are independent. If two index variables' scopes are independent, the two index variables may be denoted by the same letter. Example 43.9 The following are summative aggregates in which J2 operators are nested (i.e., nested summative aggregates). n t j i=i j=i k=i :F2{Xn, Fn, zÂ„) = Eiyj E i'^k)) i=l j=l k=l Yn, ZÂ„) = Â± {x,{Â± yj Â± zk)) = Â±{xi it yj Â± z,)) i=l j=l fc=l ,=1 j=l j=l In particular, in .F3, scopes of Yl]=i Vj and Ylk^i Zk are independent. So, both summative aggregates can be represented using the same index variable j.
PAGE 76
.MS 4.3.6 Decomposition of Summative Aggregates In the aggregate cache, the basic unit cached is a summation over an aggregative monomial. It is, therefore, more beneficial to decompose a complex summative aggregate into a set of smaller component summative aggregates and store their values in the aggregate cache. When value of the original summative aggregate is necessary, it can be easily restored from the values of the cached component aggregates. This approach facilitates sharing cached component aggregates among many summative aggregates. Furthermore, the decomposition plays a more important role in the aggregate cache. While decomposing the original summative aggregate and normalizing the decomposed summative aggregates, it becomes known whether the original summative aggregate can be incrementally updated or not.^ If the original summative aggregate is not incrementally updatable, it can still be cached but the cached value should be invalidated if changes in base databases affect it. In principle, the decomposition process is done in two steps: expansion of the summation body of a given summative aggregate and distribution of operators over the expanded summation body. If a given summative aggregate is nested (i.e., it contains other summative aggregates), the summation body is recursively expanded until no further expansion is possible. Then, J2 operators are distributed over the whole expansion. By the distribution property of ^ operator over additive terms, sum of the decomposed summative aggregates is equivalent to the original summative aggregate. In order to describe the expansion procedure precisely, let's represent an arbitrary aggregative monomial h(*) as follows: h{*) = ca1'4'aT ' (4.8) ^The normalization will be discussed in Section 4.3.7
PAGE 77
where * is a union of aggregate input sets and index variables, c is a nonzero real constant, each af (1 < / < is a factor, assuming that lit = 0, no factor is present in the aggregative monomial,^ and each pi (1 < I < t) is a nonzero rational number. If af denotes a factor other than a transcendental function, it means a/ raised by pi. Otherwise, af simply denotes a transcendental function. (See Section 4.3.4 and Example 4.3.4.) In order to expand the aggregative monomial h{*) of equation 4.8, each factor af (1 < / < ^) is expanded first. If it is expanded into multiple terms, h{*) is expanded accordingly into multiple aggregative monomials by usual algebraic manipulations. Example 43.10 If a'l and af in equation 4.8 are expanded into (ajj* Ajj^) and (otj^ Â— 0*2*) respectively, h{*) is expanded into four aggregative monomials as follows: h{*) = cal'a^2'at' = call' al'aT+ call' a^2'(^t' + call'al'a',l^ caZ'al'aT,\ Expansion of factor. Expansion of a factor of' in equation 4.8 is obtained by a factor expansion procedure below: 1. If af denotes a transcendental function, expansion of af is af itself. That is, a transcendental function is not (or can't be) expanded.^ ^Hereafter, for a sequence of any entities, ai,a2, ,at {t > 0), if < = 0, the sequence is assumed to be null (i.e., no entity in the sequence). ^For some transcendental functions, there are a few cases in which expansion is possible. For example, \og{xi/yi) can be expanded into logx,logy,. However, in most cases such an expansion is not possible. Therefore, in our work we do not expand transcendental functions.
PAGE 78
TO 2. If p/ (of af ) is no< a positive integer, expansion of factor af is af itself.Â® 3. Otherwise, a/ (of af) is expanded first as follows: (a) If ai is a parenthesized aggregative polynomial, its component aggregative monomials are recursively expanded first and their results are added, yielding hi^{*') + hi^{*') + h hi^{*') {w > 1), where *' is a subset of * and each *' need not be equal to each other. Then, expansion of ai is (b) If ai is a summative aggregate Ylp=i //(*')> '^^^ parenthesized summation body (//(*')) is recursively expanded first, yielding ^/i(*') + hi^i*') A 1hi^i*') {w > 1). Then, expansion of a, is E^=i ^/a(*') + Ep=i hi,{*') + + (c) If a/ is an index variable or an aggregate variable, expansion of a/ is a/ itself. After expanding a;, if the expansion of a/ is a single term, expansion of factor af is af. Otherwise, expansion of factor af is obtained by applying the multinomial theorem (see below) to the expansion of a/. O After expansion of a summation body is completed, if there exist factors in the form of (cjCaj^aaj^ Â• Â• Â• aP^'^)PÂ« (r > 1) in the expansion, they are converted into cP'a^;''''a^2'^'''aP;rPÂ». Decomposition of summative aggregates. Suppose a summative aggregate (not extended), Yl7=i /(*)Â» where * is a union of aggregate input sets and index variable i. In order to apply the expansion procedure described so far, the summation body ^Again, for an expression (ai + + + aÂ„y, if the exponent p is not a positive (rather, nonnegative) integer, it is generally not possible to expand the expression in finite steps.
PAGE 79
71 /(*) (which is an aggregative polynomial) is changed into an aggregative monomial by parenthesizing it as follows: !:/(*)= D/(*))(49) i=l i=l Then, after expanding the new summation body, let the result be: (/(*)) = /^i(*') + M*') + + M*') iv>i) (4.10) where each ha{*') (1 < s < u) is an aggregative monomial, *' is a subset of *, and each *' need not be equal to each other. Then, by distributing Yl operator over the righthand side of equation 4.10, the following decomposed summative aggregate is obtained: E /(*) = E hr{*') + t h,{*') + Â• Â• Â• + E K{*'). (4.11) i=l i=l i=l t=l Decomposition of extended summative aggregates. For an extended summative aggregate, ^{*) = g{^i{*'),^2{*'), Â• Â• i^vi*')) > 1), each argument summative aggregate ^s{*') (1 < 5 < u) is decomposed first, yielding J^'ai*')Then, the original argument summative aggregates are substituted with the decomposed ones, yielding Any monomials (not only aggregative monomials) in the form of {ai+a2\ hcinY where p is a positive integer can be expanded using the multinomial theorem which is an extension of the binomial theorem.^ The binomial theorem and multinomial theorem are presented below. Binomial theorem. Let p be a positive integer. Then for all x and y, ix + yY = t^[iyy'~'' Hn fact, the multinomial theorem (binomial theorem as well) holds even when the exponent p is a nonzero rational number. However, in general, expanding such a monomial results in an infinite series, which is of no use for the aggregate cache.
PAGE 80
72 where ^ ^ j =p\/k\{pky.. Multinomial theorem. Let p be a positive integer. Then for all xi,X2,, Xt, {XI+X2 + + XtY = Yl(^ I Â„P1 Â„P2 . . . Â™Pt P1P2 Â•Â•Â•Pt where the summation extends over all possible sequence of nonnegative integers Pi, P2, Â•Â•Â•,?< with pi +P2 + Â•Â•Â• +Pt =P, and ^ ] = p\ / Pi\p2\ Â• Â• Â• Pt\. \ P1P2 Â•Â•Â•pt J Example 43.11 When (xi + X2 + X3 + X4 + x^Y is expanded, the coefficient of 0:1X2X4X5 (= X1X2X3X4X5) equals / 6 \ _ 6! ^ \^ 3 1 02 1 ; 3!1!0!2!1! When (4xi Â— 2x2 + Sa^a)^ is expanded, the coefficient of equals (23i)4'(2f3^ = 3840. Using the multinomial theorem, it is possible to expand any aggregative monomial raised to a positive integer power, if its component monomials are raised, they can be recursively expanded. From a practical viewpoint, however, it is not a good idea to expand an aggregative monomial which is raised by more than 3, since the number of total monomials produced increases exponentially. In this regard, we expand (recursively) only those aggregative monomials raised by 3 or less. Example 43.12 Consider the following summative aggregate: t=i where the summation body is (x, y,)^. Assume that the current aggregate input sets are X^^i and Yni and aggregate value on these input sets, Y17=ii^i ~ ViY is
PAGE 81
73 stored in the aggregate cache. Then, upon insertions of xÂ„ and yÂ„, the new aggregate value can be computed by adding value of (a;Â„ j/Â„)^ to the stored aggregate value. The summative aggregate is decomposed as follows: t=l i=l i1 i=l i=l Instead of storing Yl'iZii^iyi)^ directly, it is much better to store Yi'i=i T,7=i ^iVi^ and 13"=/ Vi sÂ° t^^t other summative aggregates too can make use of the store values. Similarly, a more complex summative aggregate is decomposed as follows: n 5;(2x.Vi + Zif t=i = E(8^f 2/f + ^? 12x.2j/.+ I2x]zi + Gx.y? t"=i + ^}zi + ^Xizl Zyiz} UxiyiZi) = SÂ±x^Â±yf + Â±zf12Â±xhi + 12Â±x',z, + 6Â±x,y^ i=l Â•'=! 1=1 i=l Â«=1 t=l n n n n + 3 E Vi^i + 6 E ^i^i 3 E f 12 E t=l Â«=1 t=l 1=1 This summative aggregate shows the rapid increase of the number of aggregative monomials as the exponent increases. < 4.3.7 Normalization of Summative Aggregates Normalization of a summative aggregate is a process of simplifying the summative aggregate. The simplest normalization is to pull a constant, multiplying the summation body, out of the scope of its summation operator. For instance, J2^=i Â—Axi = Â— 4X^"_i a:,. By normalizing a summative aggregate in this way, Yl,?=i can be cached instead of I^ILi Â— 4x,so that cache operations can be more efficient. In fact, the normalization plays a more profound role than making cache operations efficient. It enables certain types of summative aggregates to be incrementally updated, which otherwise cannot be. In this subsection we extend this simple idea in order to normalize complex nested summative aggregates.
PAGE 82
74 As summative aggregates are nested, some unbound entities (variables and component summative aggregates) can be present within the scope of a summative aggregate. These unbound entities hinder the process of incremental update of summative aggregates since their values are not known at the time of an incremental update. There are several entities that can be unbound within a component summative aggregate of a nested summative aggregate. 1. An index variable of an outer summation operator. 2. An aggregate variable indexed by an index variable of an outer summation operator. 3. A summative aggregate whose termination variable is bound to the cardinality of an aggregate input set. 4. A summative aggregate whose termination variable is an index variable of an outer summation operator. The goal of normalization is to make, recursively, every summative aggregate contain only those entities bound to its index variable by moving unbound entities plus a constant multiplying the aggregative monomial out of the scope of the summative aggregate. Therefore, once a normalization is completed, no unbound entities shall remain within any summative aggregate's scope. Normalization Process Without loss of generality, it is assumed that a summative aggregate is decomposed maximally (i.e., as far as possible) before normalized. After the decomposition, each decomposed summative aggregate should have a summation body of an aggregative monomial. Then, the normalization is performed against each decomposed summative aggregate. The following is a decomposed summative aggregate to
PAGE 83
75 be normalized: j2cal'al'aT {t>0) (4.12) p=i where p is an index variable, /z is a termination variable, c is a nonzero real constant, each af (1 < / < <) is a factor, and each pi {1 < I < t) is a, nonzero rational number. (See equation 4.8.) Given an index variable p, a factor is fully bound to p \{ p is the sole index variable that appears in the factor. A factor is partially bound to p if p and other index variables appear in the factor. If p does not appear in a factor, the factor is unbound to p. Normalization of a summative aggregate 4.12 takes two passes: 1. For each factor af , if a/ (of af ) is a summative aggregate, it is normalized first, yielding a'l^a'i^a'i^^ y After all component summative aggregates are normalized, let the result be the following: X:(cc?'c5^ . Â• . c?')a?;< Â• Â• Â• a^l^^a^y,] a^.^^ Â• Â• Â• af/a?,' Â• Â• Â• a^^^ (4.13) where each tg > I {1 < s < t), and let ^ = 0,w>0) (4. 14) p=i where each a^" is an unbound factor to p, each Cj* is a either fully or partially bound factor to p, and v + w = In the transformed summative aggregate 4.14, if the righthand side of ]^ operator contains only bound factor to p and all component summative aggregates, if any, are ^"Note that by the normalization a summative aggregate can be transformed to an extended summative aggregate. (See and J^a in Example 4.3.13.)
PAGE 84
76 normalized, the summative aggregate is called normalized. Otherwise, the original summative aggregate 4.12 is called unnormalizable. Theorem 3 A normalized summative aggregate is equivalent to the original summative aggregate, that is: f: caf Â• Â• Â• af = (cc?' c^^ Â• Â• Â• c?' a^'^ al",^ a'^ Â• a^^^ . p=i p=i Proof of Theorem 3 We first show that summative aggregates 4.13 and 4.14 are equivalent, that is: f:(cc?'cf Â• Â• Â• ^,')a\\a\\ . . . a\l^^/,\al\ Â• Â• a^^^ Â• Â• a?; a?,' Â• Â• Â• a^^^ p=i = (cc?'c^^ . Â• c^')<><Â• Â• Â• a^Â„r f: aiyl^ . (4.15) p=i To show the equivalence above, we put (cc^^c^^ Â• Â• Â• (^')au"^aul^ Â• Â• Â• a^^" of summative aggregate 4.14 back inside ^ operator. Then, the following equality should hold since the lefthand side and the righthand side of the equation axe literally equivalent (q = v + w): S^lrrJ'^r^^ r^'^n^^n^^ . . . n^^ n^^ n^^ . . . n^* n^* . . . n^* p=i = EÂ«<^2 cf')
PAGE 85
77 (i.e., there is no occurrence of p in 4>)i all (/"p's {I < p < li) should be the same in the above equation. Therefore, we get the following equation, thereby proving the equality of the equation 4.15: <^[(Â«r)i(Oi Â• Â• Â• + (02Â«^)2 Â• Â• Â• {az)2 = (cc?^ ^ Â• Â• Â• c?' )a^Â„a^Â„Â• Â• Â• a^" f a^^ a^^ a^^^ . p=i The next step is to show the equivalence between summative aggregates 4.12 and 4.13. If any factor af in summative aggregate 4.12 is a power of a summative aggregate (i.e., a/ is a summative aggregate), the factor is normalized to c^'a^^a'i^a^^^ y Then, the proof shown above implies the equivalence of af and c^'a'i^a^^a^^^ y (The equation 4.15 directly proves a; = c/O/jO/j Â• Â• Â• a;^,^^.) If the summation body of a/ contains other summative aggregates again, they are (recursively) equivalently normalized before a/ is normalized. Therefore, we can conclude that summative aggregates 4.12 and 4.13 are equivalent, thereby proving Theorem 3. Normalization of extended summative aggregates. In order to normalize an extended summative aggregate, = 9{y^i{*'),^2{*'), Â• ,^v{*')) {v > 1), it is decomposed first, yielding :F(*) = 5f(^'i(*')> <^2(*')> Â• ' '^Ki*'))Then, each decomposed argument summative aggregate (1 < 5 < u) is normalized, yielding ^j'(*'). Then, the original argument summative aggregates are substituted with the normalized ones, yielding = g{J^"{*'), J^ii*'), Â• Â• Â• , Ki*'))Example 43.13 The following nested summative aggregates are converted to their respective normalized forms. jr,(XÂ„, yÂ„) = Â±{x, Â± 2i'yj) = 2 f:(^^x, Â± y,) t=i j=i 1=1 j=i
PAGE 86
78 In ^1, 2 and P are unbound in Yl)=i 2Â«^t/j. So, these entities are moved out of index variable j's scope. As a constant, 2 is further moved out of index variable i's scope. n m m n ,=1 j=i j=i t=i In J^2, the component aggregate Vj is an unbound summative aggregate. Thus, it can be treated as a constant from the viewpoint of any surrounding. summative aggregates. So is the result. Note also that the normalized ^2 is no longer an ordinary summative aggregate, but an extended summative aggregate. (The same for ^Fs below.) 1=1 j=l k=l n n i .=1 k=l j=l fc=i t=i j=i In .F3, the component summative aggregate J2k=ij^k is unbound and in itself, it contains an unbound index variable j. So, it moves j up to j's scope and it itself moves out of the outermost i's scope. Â«=1 j=l In ^4, on the other hand, even though Yl]=i Vj is unbound, there is no known way of moving the unbound summative aggregate out of the outer index variable i's scope. Therefore, .7^4 is unnormalizable. ^ Example 43.14 In this example, as an extended summative aggregate, the standard deviation is normalized. The standard deviation has two argument summative aggregates, Er=i(a;t and n (i.e., Yli=i 1), which are decomposed in equations 4.18 and 4.19 below. Then, the decomposed argument summative aggregates are normalized in equation 4.20. Note that in the normalized aggregate, there are only three unique summative aggregates, J2"=i x], ^"=1 a:,, and ^"=1 1
PAGE 87
79 (4.16) (4.17) N (Er=:i)i (4.18) N (Er=i 1) 1 (4.19) M (Er=i 1) 1 (4.20) \ (ELi 1) 1
PAGE 88
80 Let's compare T2 and Ta first. It should be noted that without the normalization, even !F2 cannot be incrementally updated. has a summation body of (x, YJj=\ Dj) that nests an unbound summative aggregate Y^Jx yj. One cannot save the value of J^2{Xni,Ymi) = YL'l^iixiYlfJi Vj) and use that value to compute ^2(^71, Ki) on insertion of (rcÂ„,i/m), due to the following inequality: n m I i=l i=l nÂ— 1 mÂ—X 771 tn + E fj) + E fi) = ^2(^nl, Knl) + K XI J/i)t=l j=l j=l j=l Only after normalizing T21 the following equality is obtained, thereby making incremental update possible: n m T2{XM = (Y^^diEvj) >=i j=i nÂ— 1 mÂ— 1 = (Z]^Â«)(ZI 2/j) + (a^n)(ym) =:^2(^nl,>^ml) + (a;n)(l/m). 1=1 j=\ The reason why ^2 is not incrementally updatable before the normalization is that the unbound summative aggregate Y!,]l=i Vj within the scope of index variable i should act as a constant while it takes part in the outer summation operator's operation, but without the normalization, its value varies in successive incremental updates. By normalizing ^21 however, the original aggregate is transformed to a multiplication of two summative aggregates, each of which can be independently incrementally updated. Then, by Lemma 5, the normalized summative aggregate is incrementally updatable. On the other hand, as mentioned in Example 4.3.13, ^4 cannot be normalized. Similaj to ^2 before the normahzation, T4 has the following inequality: i=l j=l Â»=1 j=l j=l
PAGE 89
81 Since J^4 Ccinnot be normalized, there is no chance of ^4 being incrementally updatable. ^1 and ^3 are incrementally updatable as mentioned. Unlike !F2, however, these summative aggregates are still nested even after the normalization. Nonetheless, J^i has the following equality: t=i i=i = (2E0"^.Ej/i)) + (2Â«'^nEj/.) t=i j=i j=i = :Fi{Xnl,Ynl) + {2n^Xnf2yj) i=i The above equation itself can be served as a proof by induction of the incremental updatability. When i = 1 (base), the equality holds trivially, letting .Fi(A'o, io) = 0. Assuming ^i(XÂ„_i, y^_i) is incrementally updatable (induction hypothesis), J^i{Xn, Yn) is incrementally updatable since J2]=i Vj too is incrementally updatable. Thus, proved. Now consider a little experiment with T4. T4 is modified to as below so that the inner summative aggregate becomes bound: 1=1 j=i Â«=i j=i j=i = .F5(Xnl,>;_i) + (xÂ„X;t/,)^/2 i=l A similar induction can be used to prove correctness of the above equality. Therefore, ^5 is incrementally updatable.
PAGE 90
82 For every summative aggregate, it is true that if the summative aggregate contains any unbound variables or unbound summative aggregates, the summative aggregate in question cannot be incrementally updated. Lemma 6 A summative aggregate that contains only bound entities is incrementally updatable, if, recursively, all its component summative aggregates, if any, contain only bound entities. Proof of Lemma 6 For any summative aggregate, n j^{Xn, yÂ„, Â• Â• Â• , zÂ„) = 53 /(^i' Â• Â• ' '^Â«' 01=1 Proof by induction. (Below we show only the case of insertion. For the case of deletion, similar induction steps can be easily applied.) 1. When i = 1 (base): Letting J^{Xq, io, Â• Â• Â• , Zq) = 0, incremental update on the first insertion ixuyi,,zi), ^{Xi,Yi,,Zi) = J^{Xo,Yo,,Zo) + f{xuyi,,zi,l). Thus, the summative aggregate is incrementally updatable on the first insertion. 2. Induction hypothesis: Suppose for n > 1 that y^{Xni,Yni,Â• , ZÂ„_i) is incrementally updatable." 3. Induction: On nth insertion (xÂ„, j/Â„, Â• Â• Â• , Zn), by the induction hypothesis, nl ^{Xr^Yn,,Zn) = (J^ /(x,, y,, Â• Â• Â• , z,', ?)) + /(arÂ„, yÂ„, Â• Â• Â• , 2Â„, n) Â«=i = J^{Xni,Yni, Â• , Zni) + /(a;Â„, ?/Â„,Â•Â•Â•, zÂ„, n). ^^It should be reminded that n is not directly used in aggregate computations. Its sole purpose is to distinguish the previous (or the next) update from the current update. If n appears in a summation body, it should be interpreted as aggregate count(*).
PAGE 91
83 Therefore, if /(xÂ„,j/Â„, Â• Â• Â• ,Zn,n) can be incrementally computed, ^(A'Â„,FÂ„, Â• Â• Â• ,Zn) should be incrementally updatable. By the postulation of the lemma, all entities in /(a;Â„, J/n, Â• ' * ? ^n, ") are bound, thereby their values, except those of bound component summative aggregates, are immediately known. Thus, if there is no component summative aggregate in /(), value of /() can be computed immediately. Otherwise, /() waits until values of all component aggregates, say, ^'O's, become known. ^'{Ys, in turn, are all bound by the postulation. Therefore, by applying induction steps similar to those thus far, ^()'s can be immediately computed by adding its previous value to the value of its summation body, say, /'(a;Â„, yÂ„, Â• Â• Â• , 2Â„, n), if /'(a;Â„, J/Â„, Â• Â• , Zn, n) can be computed immediately. Otherwise, ^'{) waits again. Applying these steps repeatedly, eventually computation will reach into the deepest nest where no summative aggregates are used. Then, summation body of that deepest aggregate can be computed by values in (xÂ„, yn,, Zm and the recursion is unwound up into /() and eventually f{xn,yn, Â• Â• Â• , 2Â„, n) is computed. Theorem 4 Every normalized summative aggregate is incrementally updatable. Proof o f Theorem 4 Straightforward from Lemma 6 and the definition of normalization. Corollary 1 Every normalized extended summative aggregate is incrementally updatable. Proof of Corollary 1 Straightforward from Lemma 5 and the definition of normalization of extended summative aggregates.
PAGE 92
84 It should be noted that Corollary 1 does not imply that the class of normalized (rather, normalizable) extended summative aggregates is the largest class of summative aggregates that can be incrementally updated. There are a few cases in which unnormalizable (by our normalization method) summative aggregates can be incrementally updated by "manually" normalizing them as shown in the following example. Example 43.15 Given a summative aggregate, Yi7=i log (^)' ^^^^ manually expands equation 4.21 into equation 4.22 (equivalently, if the user writes the summative aggregate in the form of equation 4.22), the given summative aggregate can be normalized to equation 4.24, thereby being incrementally updatable. Eios(r) = EiÂ°sf%i) (421) n Â«=i V ^Â« n = E logEllog^.) . (422) n / 1 \ 1 = E logEl 5:iog^.(423) t=i \ j=i / .=1 n = (logEl ElElog^.(424) j=l I i=l t=l 4.4 LookingUp Cached Aggregates In the preceding sections we have described how aggregates, especially summative aggregates, cached in the aggregate cache can be updated in an incremental way. In this section we investigate an efficient way of looking up a cached aggregate that is requested by a query. Unlike in an ordinary cache mechanism, cache lookup has a great importance in the aggregate cache. First of all, cache lookup is no longer simple a task. For system builtin aggregates, finding them in the cache still remains simple since they must
PAGE 93
have fixed names and fixed argument formats. However, for userdefined aggregates, their names cannot be good keys for locating them. Moreover, it is a doubtful approach to allow the users to name their aggregates. The aggregate cache needs to be as transparent to the users as possible. Explicitly naming an aggregate implies that if the name is forgotten or not known, the cached aggregate cannot be used or shared. Related and more important is the fact that even though an aggregate is first requested, thus not cached yet, it is possible that a similar aggregate is already cached and the requested aggregate can be derived from the cached one. If cache lookup is relied entirely on explicit naming, it is not possible to find out such a derivability. Example 44^ Suppose that two users are using the same database. Suppose further that average is not a builtin aggregate. One user wants to compute average over an aggregate input set X, and names it avg{X). Not long after the first user has completed the computation, the second user wants the same aggregate but does not know that the same aggregate has been computed with a name av^. So, he/she requests the aggregate by a name averag'e(A'). And, the same average is recomputed even though a copy is in the cache. Now, the third user comes in and wants sum, which is again supposed not to be builtin, over X. If average{X) and count{X) are in the cache, it should be possible to derive sum{X) from them. But if any of the three is userdefined and its semantics is not correctly understood by the system, such a derivation would not be possible. < In our approach, we provide a small number of builtin (predefined) aggregates that have fixed names and semantics. For userdefined summative aggregates, they too can have names but those names are only for definitional convenience. That is, a user can define named summative aggregates and use the names in his/her queries to refer to the defined aggregates. However, when a summative aggregate
PAGE 94
is cached, whether builtin or userdefined, it is decomposed and normalized into several smaller summative aggregates. Then, each of the decomposed summative aggregates is cached if that aggregate is not cached already. When a query requests a summative aggregate, the aggregate is looked up in the cache by following similar steps. The queried aggregate is parsed, decomposed and normalized, and each decomposed aggregate is searched in the cache. If all the decomposed aggregates are found in the cache, value of the queried aggregate is readily computed by using the information obtained when the aggregate is parsed and the values retrieved from the cache. Thus, for cached summative aggregates, their names are immaterial as long as cache lookup is concerned. Example 4 4^ This is a comprehensive example. When Sxy{Xn,Yn), which was shown in Example 4.3.1, is cached or looked up in the cache, the following equations
PAGE 95
87 show how the original aggregate is decomposed and normalized. S.y{XÂ„,Yn) = X;(xi y) (4.25) = ti^W^)iy^W^) (426) ,=1 i^k=i ^ i^k=\ ^ (x.y.Xi ' Vi + ) (4.27) ,=1 2^fc=i ^ l^k=i ^ 2^k=i ^ 2^k=i ^ f=i ,=1 2_vfc=i ^ ,=1 2^k=i ,=1 2^fe=i ^ Z^it=i ^ i=i " .=1 " .=1 " ^ n (4.29) (4.30) n = EM^^^^^^^^ (4.31) First, X and ^ (averages of XÂ„ and FÂ„) in equation 4.25 is rewritten into summative aggregates, resulting in equation 4.26. Equation 4.26 is, then, expanded to equation 4.27, and X) s are distributed over summation body of equation 4.27, resulting in equation 4.28. Now, the normalization is performed on equation 4.28. Equation 4.29 is a result of the normalization. It contains four unique normalized (decomposed) summative aggregates, J^i^i EILi Vi, T,i=i XiVi, and Yli=i 1 that should be incrementally updatable. These summative aggregates are first searched in the cache. If they are all found in the cache, then Sxy{Xn, FÂ„) can be obtained immediately. If any of them is not found, then that aggregate has to be recomputed from values in base databases (or in the data warehouse if an appropriate copy is being maintained). Value of Sxy{Xn,Yn) is obtained using cached/recomputed summative aggregates, i
PAGE 96
and all the recomputed summative aggregates are cached for later use. Later on, if other query requests any of the cached summative aggregates, say, summation of XÂ„, such an aggregate will be found in the cache until replaced out by other aggregate. On the other hand, equations 4.30 and 4.31 show a process of algebraic simplification of equation 4.29. However, we do not carry out any such simplification. In most cases, the number of unique summative aggregates does not decrease even after such a simplification. Therefore, penalty for not simplifying should be minimal. <1 4.5 Conclusions In this work we have proposed the aggregate cache to improve performance of complex aggregate queries in the context of data warehouses. Considering aggregate computation is a frequent operation in data warehouse applications and such a computation is expensive to perform, reuse of oncecomputed results is a natural choice. On top of that, the temporal locality of aggregate accesses observed in decision making processes makes such a cache approach more attractive. The Aggregate cache has a close bearing on the view materialization since a cached aggregate can be deemed as a materialized view of underlying tables in base databases. As the incremental view update is an important issue in the view materialization, incremental update of a cached aggregate as relevant underlying tables change is a crucial problem in the aggregate cache. If an underlying table is updated and the update affects a cached aggregate, the update should be propagated to the cache so that the cached aggregate too can be updated appropriately. Bcised on currency requirement for cached aggregates, there are several schemes of when to update cached aggregates as base databases change. Rematerialization, periodic update, eager update, and ondemand update have been discussed, and the ondemand update has been chosen for the aggregate cache since it is not only perfectly geared with the cache philosophy, but also outperforms other approaches.
PAGE 97
We have investigated a way of incrementally updating summative aggregates, which cover a vast variety of aggregates performing some types of cumulative operations. Importantly, we have identified a class of summative aggregates that is incrementally updatable and inclusive enough to cover many aggregates used in data warehouse applications, and proposed an efficient cache lookup method.
PAGE 98
CHAPTER 5 CONCLUSIONS In this work we have addressed two related issues within the framework of active databases. First, we have proposed a practical approach to static analysis of active rules and their confluent execution based on different user requirements. When the user wants the full confluent rule execution, that requirement can be easily met by removing conflicts in the rule set or by specifying priorities between the conflicting rules. If confluent rule execution is unnecessary, the system can avoid controlling rule executions. The confluence can also be enforced for only a subset of rules. Using our approach, it is also possible to support multiple, applicationbased confluency controls. In addition, our approach is the best fit for parallel rule execution. In the second part of our work, we have proposed the aggregate cache that is a cache mechanism for aggregates used in data warehouses. The aggregate cache can improve the performance of aggregate computations significantly by saving previous results. It uses a novel approach to cache lookup, in which a queried aggregate is found in the aggregate cache by lookingup its component aggregates. Our aggregate cache is transparent to the user; no intervention from the user is necessary to run the aggregate cache. Also importantly, we have identified a precise class of aggregates that can be incrementally updated. We expect that the aggregate cache can be implemented by using active rules over active database systems. Change detection and propagation will be able to be implemented using the event detection facility in underlying active base databases. In the data warehouse it is assumed to be an active database too, incrementally updating cached aggregates as changes are propagated to the aggregate cache can also be implemented using active rules. However, 90
PAGE 99
Â§1 the query processor of data warehouse should be modified appropriately to make use of cached aggregates.
PAGE 100
REFERENCES [1] B. Adelberg, B. Kao, and H. GarciaMolina. Database support for efficiently maintaining derived data. Technical report, Department of Computer Science, Stanford University, Stanford, CA, 1995. [2] R. Agrawal, R. Cochrane, and B. Lindsay. On maintaining priorities in a production rule system. In Proceedings International Conference on Very Large Data Bases, pages 479487, Barcelona, Spain, 1991. [3] A. Aiken, J. Hellerstein, and J. Widom. Static analysis techniques for predicting the behavior of active database rules. ACM Transactions on Database Systems, 20(1):341, Mar. 1995. [4] A. Aiken, J. Widom, and J. Hellerstein. Behavior of database production rules: Termination, confluence, and observable determinism. In Proceedings International Conference on Management of Data, pages 5968, San Diego, CA, 1992. [5] E. Anwar, L. Maugis, and S. Chakravarthy. A new perspective on rule support for objectoriented databases. In Proceedings International Conference on Management of Data, pages 99108, Washington, DC, May 1993. [6] R. Badani. Nested transactions for concurrent execution of rules: Design and implementation. Master's thesis, CIS Department, University of Florida, Gainesville, FL, October 1993. [7] E. Baralis and J. Widom. An algebraic approach to rule analysis in expert database systems. In Proceedings International Conference on Very Large Data Bases, pages 475486, Santiago, Chile, 1994. [8] J. Blakely, P. Larson, and F. Tompa. Efficiently updating materialized views. In Proceedings ACM SIGMOD Conference on Management of Data, pages 6171, Los Angeles, May 1986. [9] L. Brownston, R. Farrell, E. Kant, and N. Martin. Programming Expert Systems in 0PS5: An Introduction to RuleBased Programming. AddisonWesley, Reading, MA, 1985. [10] S. Ceri and J. Widom. Deriving production rules for constraint maintenance. In Proceedings International Conference on Very Large Data Bases, pages 566577, Brisbane, Australia, 1990. [11] S. Ceri and J. Widom. Deriving production rules for incremental view maintenance. In Proceedings International Conference on Very Large Data Bases, pages 577589, Barcelona, Spain, 1991. 92
PAGE 101
93 [12] S. Chakravarthy, B. Blaustein, A. P. Buchmann, M. Carey, U. Dayal, D. Goldhirsch, M. Hsu, R. Jauhari, R. Ladin, M. Livny, D. McCarthy, R. McKee, and A. Rosenthal. Hipac: A research project in active, timeconstrained database management (final report). Technical Report XAIT8902, Xerox Advanced Information Technology, Cambridge, MA, Aug. 1989. [13] S. Chakravarthy, V. Krishnaprasad, E. Anwar, and S.K. Kim. Composite events for active databases: Semantics, contexts, and detection. In Proceedings International Conference on Very Large Data Bases, pages 606617, Santiago, Chile, Sep. 1994. [14] S. Chakravarthy, Z. Tamizuddin, and J. Zhou. SIEVE: An interactive visualization and explanation tool for active databases. In Proc. of the 2nd International Workshop on Rules in Database Systems (RIDS'95), pages 179191, October 1995. [15] 0. Diaz, A. Jaime, and N. W. Paton. Dear: A debugger for active rules in an objectoriented context. In Proc. of the 1st International Conference on Rules in Database Systems, September 1993. [16] K. P. Eswaran. Specifications, implementations, and interactions of a trigger subsystem in an integrated data base system. IBM Research Report RJ1820, Aug. 1976. [17] S. Gatziu and K. R. Dittrich. SAMOS: An active, objectoriented database system. IEEE Quarterly Bulletin on Data Engineering, 15(l4):2326, December 1992. [18] S. Gatziu and K. R. Dittrich. Events in an objectoriented database system. In Proc. of the 1st International Conference on Rules in Database Systems, September 1993. [19] N. Gehani and H. Jagadish. Ode as an active database: Constraints and triggers. In Proceedings International Conference on Very Large Data Bases, pages 327and 336, Barcelona, Spain, 1991. [20] N. H. Gehani, H. V. Jagadish, and 0. Shmueli. COMPOSE: A system for composite event specification and detection. Technical report, AT&T Bell Laboratories, Murray Hill, NJ, December 1992. [21] N. H. Gehani, H. V. Jagadish, and 0. Shmueli. Event specification in an objectoriented database. In Proceedings International Conference on Management of Data, pages 8190, San Diego, CA, June 1992. [22] T. Griffin and L. Libkin. Incremental maintenance of views with duplicates. In Proceedings International Conference on Management of Data, pages 328339, San Jose, CA, 1995. [23] A. Gupta, V. Harinarayan, and D. Quass. Aggregatequery processing in data warehousing environments. In Proceedings International Conference on Very Large Data Bases, pages 358369, Zurich, Swizerland, 1995. [24] A. Gupta, I. Mumick, and K. Ross. Adapting materialized views after redefinitions. In Proceedings International Conference on Management of Data, pages 211222, San Jose, CA, 1995.
PAGE 102
[25] A. Gupta, I. S. Mumick, and V. S. Subrahmanian. Maintaining views incrementally. In Proceedings International Conference on Management of Data, pages 157166, Washington, DC, 1993. [26] E. Hanson. Userdefined aggregates in the relational database system INGRES. Master's report. University of California, Berkeley, CA, 1984. [27] E. Hanson. Efficient Support for Rules and Derived Objects in Relational Database Systems. PhD thesis. University of California, Berkeley, CA, 1987. [28] E. Hanson. The design and implementation of the Ariel active database rule system. Technical Report UFCIS01892, CIS Department, University of Florida, Gainesville, FL, 1992. [29] E. Hanson. Rule condition testing and action execution in Ariel. In Proceedings International Conference on Management of Data, pages 4958, San Diego, CA, 1992. [30] E. Hanson and J. Widom. An overview of production rules in database systems. The Knowledge Engineering Review, 8(3):121143, Sep. 1993. [31] W. Inmon and R. Hackathorn. Using the Data Warehouse. John Wiley & Sons, Inc., New York, 1994. [32] T. Sellis. Efficiently supporting procedures in relational database systems. In Proceedings International Conference on Management of Data, San Francisco, CA, 1987. [33] IEEE Computer Society. IEEE data engineering bulletin: Special issue on materialized views and data warehousing, June 1995. [34] M. Stonebraker, E. Hanson, and S. Potamianos. The POSTGRES rule manager. IEEE Transactions on Software Engineering, 14(7):897907, July 1988. [35] M. Stonebraker, A. Jhingran, J. Goh, and S. Potamianos. On rules, procedures, caching and views in database systems. In Proceedings International Conference on Management of Data, pages 281290, Atlantic City, NJ, 1990. [36] Z. Tamizuddin. Rule execution and visualization in active oodbms. Master's thesis, CIS Department, University of Florida, Gainesville, FL, May 1994. [37] J. Widom. Research problems in data warehousing. In Proceedings International Conference on Information and Knowledge Management (CIKM), Nov. 1995. [38] J. Widom, R. Cochrane, and B. Lindsay. Implementing setoriented production rules as an extension to starburst. In Proceedings International Conference on Very Large Data Bases, pages 275285, Barcelona, Spain, 1991. [39] J. Widom and S. Finkelstein. Setoriented production rules in relational database systems. In Proceedings International Conference on Management of Data, pages 259270, Atlantic City, NJ, 1990. [40] Y. Zhuge, H. GarciaMolina, J. Hammer, and J. Widom. View maintenance in a warehousing environment. In Proceedings International Conference on Management of Data, pages 316327, San Jose, CA, 1995.
PAGE 103
BIOGRAPHICAL SKETCH SeungKyum Kim was born on December 23, 1961, in Inju, Chungnam, South Korea. He received his Bachelor of Engineering degree in computer science from Ajou University, South Korea, in 1985. After his graduation, he worked as a research engineer at ETRI (Electronics and Telecommunication Research Institute) in South Korea. He fully participated in a research project developing a relational DBMS while working at ETRI. In Fall 1990, he joined the Department of Computer and Information Science and Engineering at the University of Florida for his graduate studies. He received his Master of Science degree in computer and information science and engineering in 1993. Having continued his studies at the same department, he will receive his Doctor of Philosophy degree in May 1996. His research interests include active databases, data warehouses, temporal databases, and transaction processing. 95

