• TABLE OF CONTENTS
HIDE
 Introduction
 Gator networks
 Cost functions
 Optimization strategy
 Modifications to Ariel
 Optimizer characteristics...
 Performance evaluation and optimizer...
 Conclusion
 References














Group Title: Optimized trigger condition testing in Ariel using Gator Networks
Title: Optimized trigger condition testing in Ariel using Gator networks
ALL VOLUMES CITATION PDF VIEWER THUMBNAILS PAGE IMAGE ZOOMABLE
Full Citation
STANDARD VIEW MARC VIEW
Permanent Link: http://ufdc.ufl.edu/UF00095400/00002
 Material Information
Title: Optimized trigger condition testing in Ariel using Gator networks
Physical Description: Book
Language: English
Creator: Hanson, Eric N.
Publisher: Department of Computer and Information Science and Engineering, University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: November 12, 1997
 Record Information
Bibliographic ID: UF00095400
Volume ID: VID00002
Source Institution: University of Florida
Holding Location: University of Florida
Rights Management: All rights reserved by the source institution and holding location.

Downloads

This item has the following downloads:

1997257 ( PDF )


Table of Contents
    Introduction
        Page 1
    Gator networks
        Page 2
        Page 3
        Page 4
        Page 5
    Cost functions
        Page 6
        Page 7
        Page 8
        Page 9
    Optimization strategy
        Page 10
        Page 11
        Page 12
        Page 13
    Modifications to Ariel
        Page 14
    Optimizer characteristics and performance
        Page 15
        Page 16
        Page 17
    Performance evaluation and optimizer validation
        Page 18
        Page 19
        Page 20
        Page 21
        Page 22
        Page 23
    Conclusion
        Page 24
    References
        Page 25
        Page 26
Full Text






Rete and '1:i .. \ are rule condition testing structures that have been used both in production-rule
-i. i,- such as Oi'0S and in active database -I. I!- [4, 3, 7]. It has been observed in a simulation study
that '11: ..\1 can outperform Rete, but the ij! Rete network can vastly outperform TREAT in some
situations '_"] The reason that TREAT is sometimes better than Rete, particularly in a limited-buffer-
space environment, is that the cost of maintaining materialized join (0) nodes sometimes is greater than
their benefit. However, if, for example, update frequency is skewed toward one or a few relations in the
database, a particular Rete network structure can significantly outperform TREAT, as well as other Rete
structures. It has been shown that Rete networks can be optimized, giving speedups of a factor of three
or more in real forward-chaining rule -- -I. programs [15], which are like sets of triggers operating on a
small, main-memory database. But even optimized Rete networks still have a fixed number of 3 nodes,
which take time to maintain and use up space. \\ I! Gator, it is possible to get additional advantages from
optimization, since 3 nodes are only materialized when they are beneficial.
This paper describes how trigger conditions can be tested using a Gator network, outlines a cost model for
Gator networks, and presents how the Gator optimizer and trigger condition matching algorithm have been
implemented in a modified version of Ariel. Performance figures are given that demonstrate a substantial
speedup in trigger condition testing performance using Gator compared with optimized Rete and TREAT.
The performance (running time) characteristics of the optimizer are given, along with a discussion of how the
*! l1i of the optimization results varies with different optimizer parameters. In addition, information on
optimizer validation is included showing that the cost estimates the optimizer generates for Gator networks
correlate well with the actual cost of using the networks for trigger condition testing.



2 Gator Networks

Gator networks are made up of the following general components:

selection nodes These test single-relation selection conditions against descriptions of database tuple up-
dates, or i"..!:. Ii- Selection nodes are also sometimes called "t-.... -I nodes, since they I.,, ,11- test
tuples to see if they match constant values.

stored-a nodes These hold the set of tuples matching a single-table selection condition.

virtual-a nodes These are views containing a single-relation selection condition, but not the tuples match-
ing the condition.

3 nodes These hold sets of tuples resulting from the join of two or more a and/or 3 nodes.

P-nodes There is one P-node for each trigger. If a trigger only involves one table, then its P-node has a
selection node as its input. If a trigger involves two or more tables, its P-node has as input two or more
a and/or p nodes. If new tuples arrive at the P-node, the trigger is fired. The P-node is logically the











root of a tree joining all the a and P nodes for the trigger. It is straightforward to use Gator networks
for materialized view maintenance by replacing the P-node of the network with a 3 node designated

as the materialized view.

root node The purpose of this node is to pass tokens to the selection nodes for testing. The root node is

not the root of the join tree. The term "root" is used for historical reasons because it is used in the

Rete algorithm [6].

By convention, the a nodes are drawn at the top, and the P-node is drawn at the bottom. In Gator

networks for triggers involving more than one table, / nodes and P-nodes can have two or more child nodes,

or ",ip.'i These inputs can be either a or 3 nodes. Every a and /3 node has a parent node that is either

a 3 node or a P-node.

Rete and TREAT networks are special cases of Gator networks. Rete networks are always binary trees,

with a full set of 3 nodes, all of which have two inputs. TREAT networks have no 3 nodes all a nodes in

a 11.. \ network feed into the P-node.

To begin illustrating Gator networks with an example, consider the following schema describing real

estate for sale in a city, real estate customers and salespeople, and neighborhoods in the city.


customer(cno, name, phone, minprice, maxprice, sp_no)
salesperson(spno, name)
neighborhood(nno, name, desc)
desired_nh(cno, nno) ; desired neighborhoods for customers
covers_nh(spno, nno) ; neighborhoods covered by salespeople
house(hno, spno, address, nno, price, desc)

A trigger defined on this schema might be "If a customer of salesperson Iris is interested in a house in

a neighborhood that Iris represents, and there is a house available in the customer's desired price range in

that neighborhood, make this information known to Iris." This could be expressed as follows in the Ariel

rule language [7]:


define rule IrisRule
if salesperson.name = "Iris" and customer.spno = salesperson.spno
and customer.cno = desired_nh.cno and salesperson.spno = covers_nh.spno
and desirednh.nno = coversnh.nno and house.nno = desirednh.nno
and house.price >= customer.minprice and house.price <= customer.maxprice
then raise event CustomerHouseMatch("Iris",customer.cno,house.hno)

The raise event command in the rule action is used to signal an application program, which would take

appropriate action [9]. Internally, Ariel represents the condition of a rule as a rule condition graph, similar
to a query graph. The structure of the rule condition graph for IrisRule is shown in I ,Ioi.- 1 (A).

Sample Rete, TREAT and Gator networks for IrisRule are shown in figures 1 (B), 2 (A), and 2 (B),

respectively.


























root


t-const reln=salesperson reln customer reln desirednh reln covers nh reln=house
nodes name="
name="lnrs"
uI 2 a3 u4 a5


house pnce>=customer minpnce house
and house pnce<=customer maxpnce



customer-
customer cno=
S desirednh cno
customer spno=
salesperson spno



(salesperson -- -
salesperson spno-
covers_nh spno
name="Iris"


house nno=
desired nh nno



(desirednh)

desired nh nno=
covers nh nno


AND
customer spno=
salesperson spno


AND
customer cno=
desired nh cno


covers nh


/ AND
salesperson spno=
coversnh spno
and desired nh nno=
covers nh nno


/ AND
house pnce>=customer minpnce
and house pnce <=customer maxpnce
and house nno=desired nh nno

P-node(lnsRule)


1 !,L. 1: Rule condition graph (A) and Rete network (B) for IrisRule.

















root


reln=salesperson

name="lrls"


reln=customer reln=desirednh reln=coversnh reln=house

customer cno=
desired nh cno


2 --- a3-- -
customercno= desired nh nno=
customer spno= desired_nh cno covers_nh nno
salesperson spno

salesperson spno=
covers_nh spno


reln=salesperson

name="lrs"


reln=customer reln=desirednh reln=coversnh reln=house


customer spno= customer cno=
salesperson spno desirednh cno a3
cd _ _ _ :c2 -- /


Sa5
house price>=
customer minpnce
and house price =
customer maxpnce


desirednh nno=
covers nh nno
and salesperson spno=
covers_nh spno


house prce>=
customer mlnpnce
and house pricee=
customer maxpnce
a4 and customercno=
desired nh cno
p


P-node(risRule)


P-node(InsRule)

(B)


I1 ,Ii.- 2: TREAT network (A) and Gator network (B) for IrisRule.










Gator networks use objects called "+" tokens to represent inserted tuples, and "-" tokens to represent
deleted tuples. Modified tuples are treated as deletes followed by inserts.
When a + token is generated due to inserting a tuple in a table, it is propagated through the Gator
network to see if ;oi triggers need to fire. Propagation of + tokens is explained below in object-oriented
terms, describing what happens when a token arrives at the I- i" of nodes listed (- tokens are treated in a
comparable fashion details are omitted):1

root When the token arrives at the root node, the token is passed through a selection predicate index [8, 10]
to reduce the set of selection nodes whose conditions must be tested against the token. The token is
tested against each selection node that is not eliminated from consideration in the previous step. This
identifies each a to which the token must be passed. The token is passed to each of these nodes in
turn.

stored a node The tuple contained in the token is inserted into the node. The node will have a list of one
or more other nodes called its "-,i.Ini," nodes, all of which have the same parent node. The token is
joined with its siblings, using a specific join order that was saved at the time the Gator network was
created (the choice of this join order is discussed in more detail later). A set of tuples is produced by
this join operation. These tuples are packaged as + tokens and passed to the parent node.

virtual a node The work done is the same as that for a stored a node, except that the token is not inserted
into the virtual a node.

[3 node The logic for this case is the same as for a stored a node.

P-node The rule is triggered for the data in the token.

As an example of Gator matching, suppose that the Gator network shown in i ,,.- 2 (B) is being used,
and a new customer for Iris is inserted. This would cause the creation of a "+" token tl containing the new
customer tuple. Token tl would arrive at a2 and be inserted into a2. Then, it could be joined with either al
or a3. Assume that it is joined first with al where it matches with the tuple for Iris. The resulting joining
pair is joined with a3. If elements of a3 join with this pair, each joining triple is packaged as a + token and
forwarded to 31. Upon arriving at 31, a + token is stored in 31. Then, the token can be joined to either
a4 or a5 via the join conditions shown on the dashed edges from 31 to a4 or a5, respectively. Assume it is
joined to a4 first. The results would be joined next to a5. If a combination of tokens matched all the way
across the three nodes 31, a4 and a5 in this example, then that combination would be packaged as one +
token and placed in the P-node, i i. the rule.

1he actual Ariel implementation has a few other more I..;I. types of nodes (see [7]), but the token propagation logic
works as described here.










3 Cost Functions


As part of this work, cost functions were developed to estimate the cost of a Gator network relative to other
Gator networks for a particular trigger. These functions are based on standard catalog statistics, such as
relation cardinality and attribute ,i1i Hili- as well as on update frequency. The catalogs of Ariel have
been extended to keep track of insert, delete and update frequency for each table. An update is considered
equivalent to a delete followed by an insert, except in the special case of triggers that have ON UPDATE
event specifications. The cost functions estimate the expense to propagate tokens through a Gator network,
assuming a frequency of token arrival at .1!I i I I nodes determined by the frequency statistics, relation
! i~;lii, l- attribute ,111i il selection and join predicate -, 1 r- i- and the presence of ON EVENT

specifications for relations appearing in a trigger condition.
The parameters relevant to the cost of a Gator Network are as follows:


CPUweight:
I/Oweight:
R(a):
N:
Sel(a):

Fi(R):
Fd(R):
P.1 .. -(N):
Card(N):
Card(N.attr):
Card(R):
Card(R.attr):
fanout:


The relative cost of a CPU operation
The relative cost of an I/O operation
The base relation of the a-node, a.
Any node in a discrimination network: a, / or a P-node.
The selectivity factor of the selection predicate associated with an a-
memory node, a.
The insert frequency of relation R relative to other relations.
The delete frequency of relation R relative to other relations.
The number of pages occupied by a node N.
Cardinality of N.
Cardinality of attribute attr in N.
Cardinality of relation R.
Cardinality of attribute attr in R.
Fanout of a node in a B+-tree. Cost functions involving indexes are given
only for B+-trees.


The cost model given below assumes that buffer space is limited, charging an amount I/Oweight for each
I/O. However, the effect of having large buffer space can be approximated by setting I/Oweight to zero
or some very small value. Both limited- and plentiful-buffer-space environments are addressed later in the
paper. The small and large buffer space cost models are referred to as CM1 and CM2, respectively.
The cost formulas for a Gator network are defined recursively. The base case is for a nodes. The
cardinality and insert and delete frequency for alpha nodes are as follows:
C ,11.!i i,.- Card(a) = Card(R(a)) x Sel(a)
Insert rF. ..11. n Fi(a) Fi(R(a)) x Sel(a)
Delete F!. i.it. Fd(a) = Fd(R(a)) x Sel(a)


The cost of an a node is given by the cost of maintaining tuples that are stored in it. The insert cost
of an a node, Ci (a), is the cost of inserting a tuple into it. The delete cost, Cd(a), is the cost of deleting a
tuple from the a node. The insert and delete costs for virtual a nodes, stored a nodes with no indexes, and
stored a nodes with an index on a column involved in a join with another memory node are as follows:










Insert Cost, Ci (a) = 0, Virtual alpha node
= CPUweight + 2 X I/Oweight,
I ,i -, a node with no index
= CPUweight + 2 x I/Oweight + Flogfanout Card(a)] x CPUweight,
'... 1,, I alpha node with an index on a join attribute


Delete cost, Cd(a) = 0, Virtual alpha node
= CPUweight x Card(a) + IOweight x (Pr.,1 (a) + 1),
I, ,i, alpha node with no index
= CPUweight x Card(a) + I/Oweight x (Pr.,. (a) + 1) +
[1., Card(a)] x CPUweight,
~I, ,., 1 alpha node with an index on a join attribute


The complete formula for the cost of an a node is:


Cost(a) = Fi(a) x C(a) + Fd(a) x Cd(a)


The cost of a 3 node is defined as the cost of each of its children, plus the local cost of joining tokens that
arrive at each child with the other children, and of updating the stored 3 node itself.
Let is (3) = ila eaves.() Card(R(a)),

lo (0) = laleaves(O) Sel(a) and
[1, (0) = [i (JSF(ai, a)), where ai, a(j E leaves( 3) and 7 .I.I. (Rel(ai, Rel(aj)) in the rule condition
graph.
Define the cardinality of a 0 node, Card(/) = i, (3) x [i,, (3) x Ils (3)


Let TokenGenCount(N,3) represent the number of tokens generated at a 3-node, 3, due to a single
token arriving at a child node, N, of 3.
Define the insert and delete frequencies (Fi and Fd) of a 3 as:


Fi() = Fi (N) x TokenGenCount(N, 3)
NEchildren([i)

Fd(3) = Fd(N) x TokenGenCount(N, 3)
NEchildren(o)

JoinSizeAndCost(N, 3) estimates the cost of performing a sequence of two-way joins when a token
arrives at node N, as well as the expected size of the result. The cost and size are calculated together for
convenience since the calculations are done in a similar way. JoinSizeAndCost(N, 3) will be described in
detail later. The value of TokenGenCount(N, 3) is obtained as follows:
TokenGenCount(N,3) { (size, cost) = JoinSizeAndCost(N, /); return(size) }
The cost of a 3 node is the sum of the following: (1) the cost of the children of the 3 node, (2) the










cost of performing joins for tokens fed into all the children of the f node, and (3) the cost associated with
maintaining ('l",1 il !-) the 3 node. A formula for the cost of a 3 node is thus:


Cost(3) = LocalCost(3) + 1 Cost(N)
NEchildren(0)

where


LocalCost(o3) = {Fi(N) x PerChildInsCost(N, ))} + {Fd(N) x PerChildDelCost(N, /)}
NEchildrenr()

PerChildInsCost(N, 3) and PerChildDelCost(N, P) indicate the respective costs of processing a + and -
token arriving at a child N of the 3 node. PerChildInsCost(N, 3) consists of the cost a multi-way join
performed on a tuple arriving at a child N of the 3 node and the cost of doing :;Ii- needed inserts into the
3 node. Following the join order plan associated with the node N, a sequence of two-way joins with each
of the siblings of this node is performed. PerChildDelCost(N, P) is analogous to PerChildInsCost(N, 3).
JoinSizeAndCost(N, P) returns the estimated join size and the estimated join processing cost.


PerChildInsCost(N, P) {
(size, cost) JoinSizeAndCost(N, 3)
insertCost = [tuplesPerage()i X 2 x I/Oweight + size xCPUeight
return (cost + insertCost)
}

PerChildDelCost(N, 3) {
(size, cost) = JoinSizeAndCost(N, 3)
deleteCost = (Yao(Card(/3),Pages(/), 11 S, ) + Pages(/)) x I/Oweight +
Card(/) x CPUeight
return (cost + deleteCost)
}

Yao(n,m,k) function estimates the number of pages that would be touched when k tuples are randomly
selected within relations/nodes that occupy m pages, each page containing n/m records.
Consider again JoinSizeAndCost(N, 3). In addition to size and update frequency information, its cal-
culation is also based in part on the join order plan of the node N. Let TR represent the temporary join
result formed during the join process, TRSize represent the cardinality of TR and TR.i; '..Attr represent
the estimated cardinality of the join attribute in TR. JoinSizeAndCost(N,/) can be computed as follows:
JoinSizeAndCost(N, P) {
'1 i:s! = 1
TempCost = 0
for each node n in the Join Order plan of N
{
TempCost = TempCost + TwoWayJoinCost(TRSize, n)
I iS, = I iS!, -- x Card(n) / max(Card(Tr i..i.\l 1 r), Card(n i..i,.\ 1 r))










}
return (1 i;SI TempCost)



TwoWayJoinCost(TRSize, n) represents the cost of performing a join between the Temporary Result
(TR) of size TRSize and a node n. There are two cases to consider for computing TwoWayJoinCost. The
first case is when n is a stored a node (without an index on the join attribute), a 3 node, or a virtual a
node without an index on the join attribute on its base relation (for this case, P,. -..'t refers to the pages
of the base relation of the virtual alpha). The formula for this case is:


TwoWayJoinCost(TRSize, n) = Px,. -,,"'I x I/Oweight + TRSize x Card(n) x CPUweight


The second case is when n is a stored alpha node with an index on a join attribute, or a virtual alpha node
with an index on the join attribute on its base relation. The cost for this case is:


TwoWayJoinCost(TRSize, n) =
card(n)
TRSize x Yao(Card(n), P ,,, -:.%,; [ ]) x I/Oweight +
Card(n. -! C ,.4ttr)
TRSize x card(n) 1 x CPUight +
Card(n. i-) A.4ttr)
TRSize x [logfos,,, Card(n)] x CPUweight


The cardinality of a join attribute jAttr in the temporary result TR is estimated in the following way [5]:
Let jAttr be the attribute of a relation R and let n = Card(R) and b = Card(R.jAttr). Assuming uniform
distribution of values, independence of value distribution in 1! il, i I columns, and random selection of values
for the join attribute in TR from the original relation, estimation of C ..... Ii i'.jAttr) can be reduced to the
following statistical problem: Given n objects uniformly distributed over b colors, how !i! !! different colors
are selected if one randomly selects t = Card(TR) objects? Bernstein et al [2] give an approximation of this
value, shown here as the function Estimate(n, b, t):


Estimate(n, b,t) = t, for t < b/2

= (t + b)/3, for b/2 < t < 2b

= b, for t > 2b


This function is used internally to estimate the cardinality of a join attribute in a temporary result. The
same approach is also used to estimate the cardinalities of attributes in the a and / nodes also (except the
ones that participate in a selection condition).
I i! !! the cost of a P-node, i.e. the cost of the whole discrimination Network rooted at the P-node, is










given as follows:
Cost(P) = LocalCost(P) + C cost(N)
NEchildren(0)

where


LocalCost(P)= S {Fi(N) x PerChildInsCost(N, P) + {Fd(N) x PerChildDelCost(N, P)}
NEchildren(P)

PerChildInsCost(N, P) is essentially the same as PerChildInsCost(N, 3) except that there is no update
cost here. Since a P-node does not store;, i!- 'l 1. PerChildDelCost(N, P) involves no cost. This concludes
the presentation of the cost functions. The discussion now turns to how these cost functions can be used to
guide an optimizer in choosing a good discrimination network.



4 Optimization Strategy

For a given rule there can be 11ii i~ possible Gator networks. The i. t! i, of the rule condition testing
mechanism depends on the shape of the Gator network used. The number of Gator networks for a rule can
be extremely large compared to the number of left-deep Rete networks. This can be seen by the fact that
the number of left-deep binary trees is a subset of the set of binary trees, which is in turn a subset of the
set of rooted trees with fanout of two or more at each level (!I!:- Gator networks).
To cope with the large search space for Gator networks, the Gator network optimizer implemented
in Ariel uses a randomized state-space search technique. The use of a randomized approach to Gator
network optimization was motivated by the fact that it has been used successfully for optimizing large join
queries [12], a problem that also has a very large search space. Early experiments were conducted [11, 19]
which demonstrated that a randomized approach to Gator network optimization is superior to a dynamic
programming (DP) approach like that used in traditional query optimizers [21]. \\ iil a DP approach, for
rules with more than seven tuple variables, optimization time to build a Gator network exceeded several
minutes. It was infeasible to optimize the network for rules with eight or more tuple variables. An approach
to optimizing Gator networks needs to be able to produce good-quality results in a few minutes or less for
up to 15 tuple variables. (This is the same as the limit on the number of tuple variables in an SQL 1. .I.CT
statement in at least one 11 i ., commercial Di;. I S product more tuple variables than this are rarely needed
in real applications).
Three randomized state-space search strategies were considered: iterative improvement (II), simulated
annealing (SA) and two-phase optimization (TPO, a combination of II and SA). These generic algorithms
require the specification of three problem-specific parameters, namely state space, neighbors function and
cost function [14, 12, 13].
In the following discussion two sibling nodes in the discrimination network are said to be connected if














CREATE BETA KILL BETA




MERGE SIBLING


1 i,oi. 3: Local change operators.


the following holds. I ,i -i the condition graph node set of a Gator network node N, CGNS(N), is defined
to be the set of condition graph nodes corresponding the the leaf a nodes of N. Two sibling Gator network
nodes N1 and N2 are connected if there is a rule condition graph edge between an element of CGNS(N1)
and CGNS('\ 2'
For the optimization of Gator networks, the following parameters were defined:

State Space The state space of the Gator network optimization problem for a given trigger is defined as
the set of all possible shapes of the complete Gator network for that trigger. Each possible shape of
the Gator network corresponds to a state in the state space. The state space is constrained so that
no 3 node is created that requires a cross product to be formed among two or more of its children.
It is assumed that all trigger condition graphs are connected, so it is always possible to find a Gator
network that does not require cross products.2

Neighbors Function The neighbors function in the optimization problem is specified by the following set

of transformation rules, which are also illustrated using examples in I i,o,.- 3.

Kill-Beta: Kill-Beta removes a randomly selected [3 node, - KB, and adds the children of the
node KB as children of the parent of the node KB.

Create-Beta: Create-Beta adds a new 3 node, - ('i;, to the discrimination network. It first
selects a random P node or the P-node (call this node PA 1: i.' \ I If PA 1 i.' \ I has more than two
children, Create-Beta randomly selects two connected siblings rooted at PA : .'\ 1, makes them

the children of ('i; and makes CB the child of PA lI.'\ 1.

Merge-Sibling: Merge-Sibling makes a node the child of one of its siblings. This operation first
selects a random P node or the P-node. If this node has more than two children, then two
connected siblings rooted at this node are randomly selected and one of them is made a child of
the other. The node to which a child is added must be a 3 .

2If trigger condition graphs are not connected, the implementation adds dummy join edges with "true" as the join condition
to make to make them connected.










Cost Function The cost function is '!i. f outlined in section 3.


The optimizer implemented is capable of using II, SA and TPO. Each of the II, SA and TPO algorithms
needs to be able to construct a random start state (feasible Gator network) given a condition graph for a
trigger. Random start states are built in the following way:

1. Assume the condition graph has N nodes. Then N a nodes are created and inserted into a list.

2. While there is more than one element in the list, a number K where 2 < K < N is generated. A single
starting element is selected from the list. Then, K 1 siblings for this node are selected from among
the other elements of the list. This is done by following join edges leading out of the initially selected
element to identify other elements of the list that have a join relationship with the initially selected
element. The total of K elements identified are removed from the list, and a 3 node with them as
children is formed. This 3 is inserted in the list.

When the list has only one element, that element is a complete Gator network for the trigger. A general
description of II, SA and TPO is given below.


4.1 Iterative Improvement

The Iterative Improvement (II) technique [23] performs a sequence of local optimizations initiated at multiple
random starting states. In each local optimization, it accepts random downhill movements until a local
minimum is reached. This sequence of starting with a random state and performing local optimizations is
repeated until some stopping condition is met. The final result is the local minimum with the lowest cost.


4.2 Simulated Annealing

Simulated Annealing (SA) is a Monte Carlo optimization technique proposed by Kirkpatrick et al. [16] for
problems with i! iii degrees of freedom. It is a probabilistic hill-climbing approach where both uphill and
downhill moves are accepted. A downhill move (i.e. a move to a lower-cost state) is always accepted. The
probability with which uphill moves are accepted is controlled by a parameter called temperature. The
higher the value of temperature, the higher the probability of an uphill move. However, as the temperature
decreases with time, the chances of an uphill move tend to zero [16, 14].


4.3 Two Phase Optimization

In its first phase, TPO runs II for a small period of time, performing a few local optimizations. The output
of the first phase, i.e. the best local minimum, is input as the initial state to SA, which is run with a
very low initial temperature. Intuitively this approach picks a local minimum and then searches the space
around it. It is interesting to observe that this approach is capable of extricating itself out of the local










parameter value
stopping condition same time as that of TPO
local minimum r-local minimum
next state random neighbor


Si ,.i!- 4: Parameters for II


1 i, i,.- 5: Parameters for SA


minimums. However, the low initial temperature makes climbing very high hills virtually impossible. It has
been observed that TPO performs better than both II and SA approaches for optimizing large join queries
[12].


4.4 Optimizer Tuning

The parameters used in this study for II, SA and TPO are given in I 1,,1. 4, 5 and 6 respectively. These
parameters were chosen after extensive experimentation and by following guidelines given in the literature
[1, 24, 23, 12]. TPO needed a lot of tuning effort compared to the other two algorithms. The performance
of TPO depends on the performance of both the II and SA phases and hence more effort is needed to
balance the two phases. Also, it was noticed that the performance of TPO is very sensitive to the initial
temperature of the SA phase, in addition to the number of local optimizations of the II phase. For deciding
the local minimum in II, the same approximation was used as by loannidis [12]. A state is considered to
be an r-local minimum if the cost of that state is less than that of the cost of n randomly chosen neighbors
(with repetition) of that state. In this paper, n was chosen to be the number of edges in the condition graph



parameter value
stopping condition (II phase) 20 local optimizations
initial state best of II (bes,,t)
initial temperature (To) 0.5*cost(bes ,t), if cost(b(esit) < 20000
0.05*cost(bestlI), otherwise
equilibrium Number of edges in the rule condition graph
temp reduction I' -' 'ol d
frozen temp < T0/1000 and best state unchanged for 5 stages or
total time >= (I I n !'i,, i of relations in rule condition) seconds


1 1i ,.. 6: Parameters for TPO


parameter value
initial state random state
initial temp 2 cost(initial state)
temp reduction '-' iold
frozen same as that of the SA phase of TPO










of a rule. This is equivalent to the maximum number of P nodes in :,!i Gator network for that rule and
hence is an upper bound on the number of times a create-beta or a kill-beta can be applied. Deciding a local
minimum by exhaustively searching the neighbors of a state is an expensive process and hence we believe
that the choice of using an r-local minima is a more practical one.
The optimizer in this study creates a complete new Gator network every time it applies a local change
operator. This is an iii. tin.. ii way to move from one state to another. An alternative is to directly modify
the data structure representing the state to move it to the next state, and to undo these changes if they are
not beneficial. Moreover, the Exodus eg++ compiler [20], an unsupported compiler that is not of commercial
,L 11i l was used. The Gator network data structures were implemented using "[llI I.--. -" [20] rather than
regular C++ classes, further slowing optimizer performance, since dereferencing a pointer to a dbclass object
takes i!! ii- instructions instead of one. The E compiler was used to reduce total coding effort, since the
Gator networks must be made persistent, and dbclasses provided persistence without writing extra code. Due
to these performance limitations of our current implementation, we are confident that the Gator optimizer
could easily be speeded up by a factor of 10 or more without changing the algorithms used for search.



5 Modifications to Ariel

The first implementation of the Ariel active Di;., IS was based on the A-' 1i1..\ 1 algorithm, which did not
use 3 nodes. Ariel was thus modified to support 3 nodes. A discrimination network must be -" i'. ," at
the time a trigger is created; in other words, its stored a and 3 nodes must be loaded with data. Ariel's
priming mechanism was modified to allow 3 nodes to be primed. Also, Ariel's token propagation -1i ,1
was modified to make use of and maintain 3 nodes.
In the original Ariel -i. i1 there were seven .[lit i. I. i of a nodes with slightly .t l i. i ii behavior

[7]. Memory nodes in Ariel can be either static, in which case their contents are persistent and are stored
between transactions, or ., ......i in which case they are flushed after each transaction.
To implement Gator, the memory node class hierarchy was modified to include the following I- 1" of 3
nodes:

BetaMemory This is the superclass of the other 3 node |" -

StaticBeta An ordinary 3 node. If none of the children of a 3 node is a dynamic node, i.e. neither
dynamic-a or dynamic-/, then that I node is a s:.,' !;. .

DynamicBeta If; 11 of the children of a 3 node is a dynamic node, i.e. either dynamic-a or dynamic-
3, then that 3 node is a D, ....... P ,,

TransBeta (short for Transparent Beta). An instance of this class is used at the root of the Gator
network as a place holder for the P-node.










Virtual p nodes similar to virtual a nodes are not needed since the non-existence of a p node implies the
need to reconstruct its contents as required.
In Ariel, stored a and p nodes are primed. However, since the contents of dynamic a and p nodes do not
outlive a transaction, they need not be primed. Also, the virtual-as are not materialized during priming.
To prime a stored-a, a one-tuple-variable query is formed internally to retrieve the data to be stored in
the a node. This one-variable query is passed to the query optimizer, and the resulting plan is executed.
The data retrieved are stored in the a memory. To prime a p node, first its children are primed, and then
the children are joined to find the data to put in the p node.

Generating Token Join Order Plans
Every node with a sibling in the Gator network has a join plan attached to it. The join plan is a sequence
of two-way joins regulating the order in which tokens arriving at the node would be joined with each of its
siblings. An important objective is to choose a join plan with the minimum cost. However, since choosing
token join plans must be done very frequently while finding an optimized Gator network for one trigger, it
is too expensive to use traditional query optimization [21] to find the join order plan. Instead, the following
heuristic is used: during each of the two-way joins, the current result should be joined with the smallest
connected sibling. This gives a reasonable join order plan quickly.



6 Optimizer Characteristics and Performance

This section presents the details of various experiments conducted to study the relative behavior of II, SA
and TPO for various rules under. 1!i!. I. i! update frequency distributions, catalogs and cost models. Rules
were created on synthetically generated databases, which had the following properties:

Relation Cardinality of unique values in attributes
Catalog 1 [1000, 100000] [90, 100]
[10, 100]- 20% (0, 20)- 70
Catalog 2 [100, 1000] *, i7: [20, 100) 5"7
[1000, 10000] 16% 100- 25%
Catalog 3 [1000, 10000] [90, 100]
The table gives the cardinality distribution for tables in each catalog. Every table has a primary key attribute.
For the other attributes, the table shows the percentage of attributes which fall in each cardinality range.
E.g. for Catalog2, I. i'7 of the tables have between 100 and 1000 tuples. Furthermore, 5"7 of the attributes
have at least 20% and fewer than liii' as !!, i!! unique values as the primary key.
Indices were created only on large relations in the first database. Experiments were performed on rules
having the following I- i" of Rule Condition Graphs (RCGs):

String type Each relation in the rule condition participates in a join with two other relations such that the
rule condition graph looks like a string. The two relations at the two ends of the string participate in
only one join.










Star type One relation participates in a join with all the other relations in the rule condition.

Random type Joins between relations are chosen randomly to create a connected rule condition graph
with no cycles.

Here are examples of rules with the three l- I" of RCGs:
String Type Star Type Random Type
define rule Rulel define rule Rule2 define rule Rule3
if Rl.a = R2.b if Rl.a = R2.b if Rl.a = R2.b
and R2.c = R3.d and Rl.c = R3.c and Rl.c = R3.c
and R3.d = R4.e and Rl.b = R4.d and Rl.b = R4.d
then actionn> then actionn> and R4.d = R5.e
then actionn>

The update frequency distribution of various relations in the database significantly affects the performance
of discrimination networks. The following three update frequency distributions were chosen:

Skewed One of the relations has a very high update frequency and the other relations have low frequencies.

Even All the relations have the same update frequency.

Step The update frequencies of relations decrease in a stair-like manner.

The actual frequencies used are summarized in the table below. In all cases, frequencies sum to one.

Equal ,. 1 !:. -
5 relations 0.2 each 0.4, 0.3, 0.2, 0.05, 0.05 0.8, 0.05 others
10 relations 0.1 each 0.4, 0.3, 4 each with 0.05 and 0.025 0.7, 0.124, 0.022 others

15 relations 'I ItiT each 0.3, 0.2, 0.14, 0.04, 0.03, and 4 each with 0.02 0.6, 0.14, 0.02 others

The size of a rule is the number of tuple variables in its condition. Rules of size 5, 10 and 15 were created
with string, star and random I I- rule condition graphs. Each one of these rules was tested with equal,
step and skewed frequencies, with cost models CM1 and CM2 and catalogs Catalogi and Catalog2. For a
number of (RCG, size, fi. i!. cost model, catalog) combinations, each of TPO, SA and II was run 10
times with different random seeds. For each algorithm, the average of the output state in all the 10 runs was
computed. II was run the same amount of time as was taken by TPO. For each of these cases and for each
algorithm, the average scaled cost was computed by dividing the average cost of 10 runs of that algorithm
by the best state found by all the runs of all the algorithms for that case.
Due to lack of space, only a few interesting graphs that exhibit the behavior of II, SA and TPO are
shown in I i,,.!- 7. Results are shown for Catalogi and Catalog2 and both cost models for a rule with star
I !"- RCG. Also, results are shown for a rule with string i"- RCG with equal frequency for Catalogi and

CM1, and for a rule with random I- RCG with skewed frequency for Catalog2 and CM2.
It can be seen that for rules with size 5, irrespective of the catalogs, cost models and frequencies, all the
algorithms are doing well. There is no [.i11- i. I ..- in the costs of states produced by different algorithms. As










the rule size increases from 5 to 15, the htitl !i i.. in the relative behavior of the algorithms also increases.
Also, as the rule size increases, the average behavior of the algorithms tends to go away from 1, i.e. the
algorithms become less stable. In general, TPO performs better than II and SA and there is no clear winner
between II and SA. Both II and SA exhibit worst case behavior in some cases. No significant difference was
observed in the behavior of the algorithms for the cost models CM1 and CM2. It was observed that when
considering the best of all runs in the experiments, all of II, SA and TPO performed well.
Graphs D, E and F in I i,oi.- 7 show interesting cases concerning the relative behavior of the three
algorithms. In graph F, all the three algorithms are doing well and their average behavior is very close to 1
(similar to the behavior of the algorithms for rules with size 5). Graphs D and E show cases where II and
SA are doing better than TPO. These graphs also illustrate the htlih, i ll- in tuning TPO to perform well
in all cases. The behavior of the algorithms in these graphs is explained next and some general conclusions
about the overall behavior of the algorithms are then given.
In graph D, II is performing slightly better than TPO. Here, the problem is in deciding the crossover point
between II and SA in TPO. This decision is crucial, especially when the search space contains !! ii~ local
minima at high-cost states with a small but significant portion of them at low-cost states (space A2, similar
to the search space of left deep query trees in [13]). In this case, doing a few iterations in the II phase of TPO
might leave the starting state of SA at a high-cost state (because there are !! ii~ local minima at high-cost
states) making the overall result of TPO not satisfactory. Doing more iterations or local optimizations in II
is always beneficial in this case, because that helps to find a low-cost local minimum. The presence of i!! ii~
high-cost local minima also explains the behavior of SA in this case. SA searches high-cost states when the
temperature is high, and when the temperature gets low, it reaches a local minimum state and searches
around that state. Since there are !! ii~ local minima at high-cost states, SA also can get trapped in one of
the high-cost states and hence its performance is not good. In this case, if the number of iterations in the II
phase of TPO is increased, then TPO is going to do at least as well as II.
In graph E, SA is doing better than TPO and II is doing worse than TPO. Here, partly, the problem is in
estimating the initial temperature of the SA phase of TPO. Here, !! ii~ runs of II generate a high-cost local
minimum and hence its performance is not good. TPO seems to extricate itself out of these high cost states
in i!i ii of the cases but not all. The reason seems to be that the cost of states separating the low-cost states
is not very low and hence the initial temperature (0.5*bestij, here best < 20000) seems to be not enough
to jump over those states. Also, the low value of SA shows the presence of low cost local minima. In fact,
for this case, when we repeated the experiments with high initial temperature (1.0*bestiu) the average value
of TPO came down to I ;li compared to the current value 1.405 and the SA value of 1.230. In general,
we noticed that the behavior of TPO is very sensitive to the initial temperature and we tuned it carefully
for various cases. Here, part of the reason that TPO is not doing well could also be due to the existence
of i!! ii local minima at high cost states. This is because the average behavior of SA is also not close to 1
which suggests the ...--1 il.i1~ of !! ii~ high-cost local minima.










In general, it can be stated that TPO does well and the performance of II and SA are close to TPO
(within 10-20% in most cases). A possible explanation for this is that the search space, in general, contains
most of the local minima at low-cost states with reasonably high cost states separating the local minima.
Each iteration of II generates a random state and follows downhill moves until it reaches a local minimum.
Since most of the local minima are at low-cost states, II is able to find a good local minimum. SA explores
high-cost states when the temperature is high and reaches the local minima states at low temperatures and
searches around those states. Since, again, there are 11ii i local minima at low states, it is able to find a
good one in most of the cases. TPO starts with a good local minimum state and performs search starting
with low initial temperature. It seems that, since the temperature is low, it is able to search or come across
a lot of low-cost states and hence it has a high chance of finding a better state. Also, in other experiments, it
was noticed that the SA phase of TPO does not do well when the starting temperature is very low. Based on
this, coupled with graph E, it can be said that the cost of states separating the low-cost states in reasonably
high (for states with cost < 20000, the SA phase of SA never performs well when the starting temperature
is less than (0.5*bestij)).
In graph B, the performance of SA is worse compared to II and TPO. II is doing well in this situation,
which means that there are enough valleys containing a low cost minimum so that II can almost always find
a good overall solution. SA seems to be getting trapped in high-cost states and does not reach the low-cost
local minimum states. The solution space in this case seems to contain high-cost valleys and the temperature
does not seem to be high enough to escape these valleys.
Even though II and SA are performing close to TPO in most of the cases, the ;,l'11JI of TPO to avoid
worst-case behavior makes it the winner. It is used in the next section for generating the optimal Gator
network to compare with optimal Rete and TREAT.
Tables showing the average optimization time in seconds taken by TPO and SA for all the cases are
shown in I !,..i!- 8. The time taken by II is not shown here because it was given the same amount of time
as TPO. Except in a few cases, TPO takes less time than SA. The ti,! !. !!. -. in the time taken by TPO and
SA increases as rule size rises from 5 to 10. At size 15, both II and SA take almost the same time. Again,
no significant differences were found between the optimization times of the two cost models.



7 Performance Evaluation and Optimizer Validation

This section presents the details of various experiments conducted to study the performance of Gator, Rete
and TREAT discrimination networks. The performance metric in all the experiments is the rule condition
evaluation time. This is the time to evaluate a rule condition using a discrimination network (i.e. the time
to pass a set of tokens through the network).
The Ariel active relational Dl;., IS was used as a testbed for conducting all the experiments. The average
rule activation time was measured by processing a randomly generated stream of updates. The table to which

















14
TPO
135 SA
II
13

125

12

115

1 1 "'+
11

105 ..-


I9 ----I- -------------------
095
0 5 10 15
Number of Relations

(A) ';i ,1 I- !.. RCG with '. p1 Frequency
Distribution, CM1, Catalog 1


14

135

13

125

12


TPO --
SA -+ -
II -.-










I B
--


0 5-1 1


0 5 10 15
Number of Relations

(C) '1 .1 Type RCG with ',, 1p Frequency
Distribution, CM2, Catalog 1


14
+ TPO --
135 SA -+
II -
13

125

12

115

11

105


o I I ------- I


20 0


10
Number of Relations


(B) '1 II- RCG with ',, p Frequency
Distribution, CM1, Catalog 2

14
TPO -
135 SA -
II -
13

125

12

115

11
1. ....... ..
105 -


0 -------95------
095 1 1


20 0 5 10 15
Number of Relations

(D) 'I ,1 Type RCG with 'I,. 1 Frequency
Distribution, CM2, Catalog 2


[ TPO -
S II -o-
SA








0 5 10 1


0 5 10 15
Number of Relations

(E) Si ni, Type RCG with Eq Frequency
Distribution, CM1, Catalog 1


14
TPO -
135 SA -
II -
13

125

12

115

11

105



n o I I I


20 0


10
Number of Relations


(F) Random Type RCG with !: ..; Frequency
Distribution, CM2, Catalog 2


1 ,.iI. 7: Average scaled cost of the optimal Gator network found by SA, II and TPO


4..

























I L.i 8: Average optimization time (in seconds) of TPO and SA.


an update was applied was determined using a frequency distribution equivalent to the update frequency
statistics maintained in the -- -l. i catalog. Inserts were done on each table, and the token testing time for
each was measured. Then, a "I ,I I rule condition testing 1 ii,. was calculated by I ,!llii'1 i,- the time spent
propagating a token for each table by the update frequency for the table.
For all the tests discussed, an optimized Gator network is compared with a '11:..\ 1 network (for which
there is only one choice) and an optimized, left-deep Rete network. The Rete networks were optimized using
a dynamic programming-style Rete network optimizer. Optimized Rete networks were considered in order
to give a fair comparison between Gator and Rete.
The different parameters that were varied were database statistics, number of tuple variables in the RCG,
the shape of the RCG, the placement and number of selection conditions in the rules, and the frequency
distribution of updates. Since the relation sizes are small in Catalogs 2 and 3, no indexes were created. For
Catalog 1, B+-tree indexes were created on primary keys of large relations. As it is not possible to show all
the results here, a sample of the results of various experiments is given below. Due to the fact that Exodus
does not allow buffer space to be set to be very low, and it does an excellent job of buffer management, it was
not possible to truly test token propagation cost for cost model 2. All tests were run on a Sun SPARCstation
5/110.
I ,i, i, 9 and 12 show relative estimated vs. actually measured token propagation times for Gator, Rete
and 1 1..\ 1 for various parameter combinations. i ,,.i!- 9 (A) and (B) show the results for a rule with five
tuple variables, with string and star -I I RCGs and step frequency distribution. It can be seen that Gator is
doing much better than Rete and 11:..\ 1 in (A). In (B), the Gator network which the optimizer comes up
with is the same as the optimal Rete network. This shows the flexibility of Gator when Rete or 11: ..\ 1
is best, Gator will normally arrive at that result.
i ,.I! 9 (C) and (D) show the results for a rule with ten tuple variables, with random I RCG and step
fi. !. i with two different databases. The RCG for the rule used was as shown in I i,.i- 10 (A). It can
be seen that in both the cases, the Gator network outperforms both Rete and TREAT networks. However,


I I -I, 1( I I I .I P k I I I I I, I k, ( '., I -'
Catalog Catalog2 Catalogi
Size 5 10 15 5 10 15 5 10 15
TPO 39 215 598 29 196 600 36 230 600
SA 47 247 600 28 201 537 44 243 600
-I ,-i p, '., -i,_- l'.!! random,skew,CM 2,
Catalog2 Catalog1 Catalog2
Size 5 10 15 5 10 15 5 10 15
TPO 40 237 600 33 191 -. ;1 39 209 583
SA 37 271 600 40 219 556 48 204 587

























Scaled Scaled
Estimated Costs Rule Condn Evaluation Times
(A) iI I- I"- RCG with step frequency
distribution.


Scaled Scaled
Estimated Costs Rule Condn Evaluation Times
(C) Random I I"- RCG with step frequency
distribution, using Catalog 2.


Scaled Scaled
Estimated Costs Rule Condn Evaluation Times
(E) '' '!!t, I!" RCG with equal frequency
distribution, using Catalog 2.


Gator
4
Rete

m Treat



d d


Scaled Scaled
Estimated Costs Rule Condn Evaluation Times
(B) ',1 I- RCG with step frequency
distribution.


Scaled Scaled
Estimated Costs Rule Condn Evaluation Times
(D) Random I I"- RCG with step frequency
distribution, using Catalog 3.


Scaled Scaled
Estimated Costs Rule Condn Evaluation Times
(F) 'i!!', I" RCG with equal frequency
distribution, using Catalog 3.


I ,,.i!- 9: Comparison of estimated costs and rule condition evaluation times.




















The '*' indicates a selection condition on a relation


(A) Random -I I.- RCG with ten Gator network
relations.
SVirtual Alpha
S Stored Alpha
Static Beta
STrans Beta Rete network

(B) Gator and Rete networks.

I ,oi.. 10: A random I RCG of size ten and its optimal Gator and Rete networks.


it ii be noticed that the estimated costs and the actual rule condition evaluation times are proportional
only in the case of (C), whereas in (D) the Gator optimizer overestimates the cost of the 11 ..\ network.

During the tests, it was observed that this was so in a few cases. We believe that this is most likely due

to over-estimation of the size of temporary results when a token is joined across a set of nodes in TREAT.
Getting accurate estimates for the size of temporary results is challenging because of the way errors in join

selectivity compound during a long join sequence.

The Gator and Rete networks generated for (D) are shown in I i,oi,- 10 (B). In all the networks, the
optimizer decided to create virtual a nodes for relations with no selection predicate in the rule condition,
preventing the duplication of relations and thus saving space. In the case of the Gator network, the relations
with high update frequencies (D=0.4 and I=0.3) were pushed down the discrimination network, towards the

P-node. This means that fewer token joins need to be done as tokens propagate through the network due
to updates. Also, the stored a nodes with low size were at the top of the network which helps to reduce the

size of the 3 nodes below them. Another interesting observation was that the optimizer was clever enough

to never form a 3 node with two virtual alpha nodes as children, since that could potentially increase the

size of the 3 node. Along these lines, it can be seen from I !, i,.- 10 (A) that the relations B and E do have
a join edge between them. Neither of them has a selection condition on it. They could have been made the
children of the same 3 node, however the optimizer pruned out that case.

In the case of the Rete network, relation I was closest to the P-node, which one would expect to be

beneficial. However, D was at the top. Intuitively, one would expect this to create a higher-cost network.
However, we forced the creation of a network with both I and D near the bottom, and both the predicted

and the actual costs were higher than that of the network chosen by the optimizer. This illustrates the value

of using optimization for discrimination networks there are so 11i ii- competing cost factors that ". .1- 1. I ,-"






























N 0 M L K J I H G F E D C B A (C)Rte network

Vn-ual Alpha
Stored Alpha
I Sta-i Beta
T-ms Beta
(D) TREATnetwork

I ,oi.-. 11: Gator, Rete and TREAT networks.


improvements i' not really be beneficial. In the case of' 11:..\ I whenever a new token enters the network

it always has to participate in a join across the other a nodes, which explains its higher rule condition testing

time.

1 i i,.- 9 (E) and (F) show the results for a rule with fifteen tuple variables, with string i"- RCG and

equal frequency distribution, with two 1!il i. i! databases. Again, it can be observed that in both the cases,

Gator and Rete perform much better than TREAT, and in one case, the Gator network does much better

than the Rete network as well. The Gator, Rete and TREAT Networks generated are shown in I ,'-'I- 11. It

can be seen in I !,Ii!.- 11 that the optimal Gator networks found by the Gator optimizer for the two different

cost models differ, in that the optimizer decides to have fewer / nodes for cost model 2 (where I/O cost is

significant). The reason is that the cost of maintaining 3 nodes is higher than the cost for performing a join

between more nodes in this case. It was also observed for this case that the cost of the optimal Rete network

was higher than that of the '11 1..\ 1 network.

A Rete network is constrained to be a left-deep binary tree. Since the frequency distribution is equal, the

Rete network loses out to the Gator network on the whole, since it takes a lot more time for rule condition

evaluation in the case of updates that are made to relations which are at the top of the network. However

in the Gator network, the time taken for rule condition evaluation is more or less the same for most of the

relations.

i ,ii,.. 12 (A) and (B) show the results for a star rule with fifteen tuple variables and step frequency

distribution with two t1!l i. i 1 databases. It can be seen that the performance of Gator and Rete networks

was comparable as predicted by the estimated costs. And again, both these networks outperformed the


(A) Gator network with cost model 1


(B) Gator network with cost model 2












I -


Gator .....
SRete W Rete
Trea0 Treat





Scaled Scaled Scaled Scaled Scaled S
Estimated Costs Rule Condn Evaluation Tmes Estunated Costs Rule Condn Evaluation Tmes Estmated Costs Rule Con
(A) i ,d I-". RCG with step (B) i ,, I. RCG with step (C) .1 RCG
frequency distribution, using frequency distribution, using frequency distribute(
Catalog 2. Catalog 3. Catalog 1.

I ,,.-1.- 12: Comparison of estimated costs and rule condition evaluation times.


aled
In Evaluation Times
with step
on, using


TREAT network.

I ,,ti 12 (C) shows the results for a star I- "- rule with ten relations and step frequency distribution.

In this case, indexes were created on large relations. It can be seen from the graphs that the Rete network

did slightly better than the Gator network as the cost formulae had correctly predicted. This means that

the randomized Gator optimizer was not able to arrive at the exact optimal Rete as the optimal Gator.

Occasional errors like this are to be expected in a randomized optimization algorithm. However, it was

observed during all the various tests that were run that this was a very rare occurrence. In this case, one

point in favor of the Gator network optimizer was that it took about half the time which the Rete optimizer

took to come up with the optimal Gator network. In general, it was noted that the time needed for obtaining

an optimal Rete network was large in the case of rules with star RCG.

In ii ii cases it is not intuitive why one network is better than another, in large part because there are

so ii ii'- competing factors that influence the performance of a network. Hence a conclusion of this work

is that it is better to use cost functions and search to perform optimization of Gator networks than to use

heuristics to pick a good network.



8 Conclusion


This paper has introduced Gator networks, a new discrimination network structure for optimized rule con-

dition testing in active databases. A cost model for Gator has been developed, which is based on traditional

database catalog statistics, plus additional information regarding update frequency. A randomized Gator

network optimizer has been implemented and tested as part of the Ariel active Di;. !I

An interesting result of this work is that for most cases, even for even update frequency distributions,

the optimal Gator network has a few P nodes, but not a full complement of them, i.e. it is neither a 1 :1..\ 1

network nor a Rete network. When I/O cost is significant, Gator is clearly better than both Rete and

TREAT. When I/O cost is low, e.g. in an environment with plentiful buffer space, the Gator networks


m











generated have very few if ;,~i p nodes with more than two inputs. Hence, for environments like this,
optimized bushy Rete networks would do approximately as well as optimized Gator networks. Optimized

left-deep Rete networks are only competitive with Gator networks when update frequency is at least partly

skewed and I/O cost is low. Optimized Gator networks virtually always have short root-to-leaf paths. They

are balanced, and the fanout of a f node is ii, ,11- in the range two to four. The short root-to-leaf path
length reduces the number of nodes that need to be touched when tokens are processed. Moreover, as much

as possible, the Gator optimizer avoids materializing large / nodes, which saves time both for rule condition

testing and for P node update. Overall, it is clearly beneficial to use a general discrimination network

structure (Gator), instead of limiting the possibilities to '1 1:1..\ or Rete.
Also, it shows that update frequency distribution has a tremendous influence on the choice of the best
discrimination network. Moreover, it is indeed feasible to develop a cost model and search strategies that

allow effective Gator network optimization.
This work has clearly demonstrated the value of optimizing the testing of trigger conditions involving
i! !ii joins in active databases. This study has gone ',. ,~1l merely an optimizer simulation to actually

validate the results of the optimizer during rule condition testing. Optimizer validation is rarely attempted

due to the 1lt!11i ,11 of doing the tests, as well as showing strong correlation between expected and actual
results. An exception is the paper by Mackert and Lohman [17]. The work given here can help make it

possible to implement the '.' ,llf- to ti I. !i, and incrementally process triggers with large number of

joins in their conditions in commercial database -I. in- thus making a new, powerful tool available to
database application developers. I i ,!!1,, the same implementation of optimized Gator networks could be

used to optimize maintenance of materialized view. This would provide fast materialized view maintenance
essentially "for !t. in terms of Dl;. IS implementation i!!. '1.




References

[1] E. Aarts and J. Korst. Simulated Annealing and Boltzmann Machines. John \\ !. and Sons, 1990.
[2] P. A. Bernstein, N. Goodman, E. \\.. .. C. L.Reeve, and J. James B.Rothnie. Query processing in a system for
,-li l.I,,. .1 databases (SDD-1). ACM ... .... on Database Systems, 6(4):602-625, December 1981.
[3] D. A. Brant and D. P. -. li..I.l Index support for rule activation. In Proceedings of the ACM .!... .1OD
International Conference on Management of Data, pages 42-48, May 1993.
[4] L. Brownston, R. Farrell, E. Kant, and N. Martin. Programming Expert Systems in (C'0 an Introduction to
Rule-Based Programming. Addison \\.- 1. 1985.
[5] S. Ceri and G. Pelagatti. Distributed databases. McGraw-Hill computer science series, 1984.
[6] C. L. Forgy. Rete: A fast algorithm for the many pattern/many object pattern match problem. Artificial
Intelligence, 19:17-37, 1982.
[7] E. N. Hanson. The design and implementation of the Ariel active database rule system. IEEE I .... .. on
Knowledge and Data Engineering, 8(1):157-172, February 1996.
[8] E. N. Hanson, M. ('iC....il..In, C. Kim, and Y. \\ .i'- A predicate matching algorithm for database rule systems.
In Proceedings of the ACM '.!...1OD International Conference on Management of Data, pages 271-280, May
1990.











[9] E. N. Hanson, I.-C. ('C!. i R. Dastur, K. Engel, V. Ramaswamy, C. Xu, and W. Tan. Fl. -.!1.. and recoverable
interaction between applications and active databases. VLDB Journal, 7(1), 1998. In press.
[10] E. N. Hanson and T. Johnson. Selection predicate indexing for active databases using interval skip lists. Infor
nation Systems, 21(3):269-298, 1996.
[11] M. Hasan. Optimization of discrimination networks for active databases. Master's thesis, University of Florida,
CIS Department, .'-...-. il.. i 1993.
[12] Y. Ioannidis and Y. C. Kang. Randomized algorithms for optimizing large join queries. In Proceedings of the
ACM './..1 OD International Conference on Management of Data, pages 312-321, May 1990.
[13] Y. Ioannidis and Y. C. Kang. Left-deep vs. bushy trees: An analysis of strategy spaces and its implications for
query optimization. In Proceedings of the ACM './..1.OD International Conference on Management of Data,
pages 168-177, May 1991.
[14] Y. Ioannidis and E. \\ ..i_ Query optimization by simulated annealing. In Proceedings of the ACM './..1.OD
International Conference on Management of Data, 1987.
[15] T. Ishida. An optimization algorithm for production systems. IEEE ... ... on Knowledge and Data
Engineering, 6(4):549 558, August 1994.
[16] S. Kirkpatrick, C. C. Gelatt, and M. P. Vecchi. Optimization by simulated annealing. Science, 220:671 680,
1983.
[17] L. F. Mackert and G. M. Lohman. R* optimizer validation and performance evaluation for local queries. Technical
Report RJ4989, IBM Almaden Research Center, January 1986.
[18] D. P. l I. i ",II I..1. A better match algorithm for AI production systems. In Proc. AAAI ..
Conference on Artificial Intelligence, pages 42-47, August 1987.
[19] J. Rangarajan. A randomized optimizer for rule condition testing in active databases. Master's thesis, University
of Florida, CIS Department, December 1993.
[20] J. E. Richardson, M. J. Carey, and D. T. Schuh. The design of the E programming language. ACM ... ..
on Programming Languages and Systems, 15(3), 1993.
[21] P. Selinger et al. Access path selection in a relational database management system. In Proceedings of the ACM
.. .. 1OD International Conference on Management of Data, June 1979. (reprinted in [22]).
[22] M. -I..1-. 1,.,!. i editor. Readings in Database Systems. Morgan Kaufmann, 1994.
[23] A. Swami and A. Gupta. Optimization of large join queries. In Proceedings of the ACM _./. .. OD International
Conference on Management of Data, pages 8-17, 1988.
[24] P. J. M. van Laarhoven and E. H. Aarts. Simulated Annealing: .. and Applications. D. Reidel Publishing
Company, 1987.
[25] Y. \\ .i_ and E. N. Hanson. A performance comparison of the Rete and I 1..\ I algorithms for testing database
rule conditions. In Proc. IEEE Data Eng. Conf., pages 88-97, February 1992.




University of Florida Home Page
© 2004 - 2010 University of Florida George A. Smathers Libraries.
All rights reserved.

Acceptable Use, Copyright, and Disclaimer Statement
Last updated October 10, 2010 - - mvs