Paper A166
Incremental Mining of Constrained Associations*
Shiby Thomast Sharma Chakravarthy
Database Systems Research and Development Center
Computer and Information Science and Engineering Department
University of Florida, Gainesville FL 32611
email: {sthomas, sharma}@cise.ufl.edu
Abstract
I i! lI ... mining algorithms have received most attention in the last several years. The
advent of data warehouses has shifted the focus of data mining from I! I I. ... I . . i, to database
S i 1, Architectures and techniques for optimizing mining algorithms for relational as well as
Objectrelational databases are being explored with a view to tightly integrate mining into data
warehouses. More recently, interactive mining has been proposed as a way to bring decision makers
into the loop to enhance the ,I l1li of mining and to support goal oriented mining. Incremental
mining is another technique useful for speeding up the mining process when new data is added,
which is I 1i. l1 of the data collection process.
In this paper, we show that by viewing the negative border concept as a constraint relaxation
technique, incremental data mining can be readily generalized to In !!III mine association
rules with various I I" of constraints. We divide the constraints into four categories and show
how they are used in the algorithm based on negative border to perform incremental mining.
In this approach, traditional incremental mining can be treated as a special case of relaxing
the frequency constraint. For concreteness, we show how the generalized incremental approach
including constraint handling can be cast into the SQL framework. We present optimizations
for incremental mining using SQL and present some promising performance gains of SQLbased
incremental data mining on  !tl. li, datasets. I i!I ill, we demonstrate the applicability of
the approach proposed in this paper to several variants of the data mining problems from the
literature.
1 Introduction
Data mining or knowledge discovery is the process of identifying useful information such as previously
unknown patterns embedded in a large data set. Efficient identification of useful information is central
*This work was supported in part by the ( ithl of .'.., .I Research and the SP.\\\.\.I System Center San Diego,
by the Rome Laboratory, DARPA, and the .'. I Grant IRI9528390
tContact author. Full Address: Database Systems R&D Center, 470 C'. PO Box 116125, Gainesville FL 32611
6125, USA. Email: sthomas@cise.uf 1.edu
to the problem of data mining as the data sets used are extremely large and making multiple passes
over the data set is computationally expensive. Although data mining is used in its own right for
many applications, it is likely to become a key component of decision support systems and as a result
will complement the data analysis/retrieval done through traditional query processing. Of all the
data mining techniques, association rule mining has received significant attention by the research
community. Briefly, association rules represent cooccurrences of items that satisfy the .. 1" /. .. 
support requirement in a given data set. In a marketbasket data set, a basket consists of a set of
items bought together by a customer. A number of such baskets (corresponding to the point of sales,
for example) make up the marketbasket data set. beer + diapers is an example of an association
rule over the set of items that include beer, and diapers. Support refers to the percentage of baskets
that contain items of interest (beer and diapers, in our case). C .,.. I.. refers to the percentage of
baskets that contain the antecedent also contains the consequent.
A number of algorithms [.\1'i.;, AS'i 1, SON95, 'I I ;'ii have been developed for mining association
rules assuming the data is stored in files. I Irj Apriori algorithm and its variants Y.\ iS+96, 'I+'1
BMUT97] generate all possible association rules from a given data set that satisfy user specified
support and confidence values. Frequent itemsets (a frequent itemset is a set of items that satisfies
the minimum support requirement) are generated in a levelwise manner prior to generating the rules.
I hr main optimization to reduce the number of candidate itemsets (potential frequent itemsets)
generated at each level is based on the observation that if an itemset S appears in c baskets, then
any subset of S appears in at least c baskets.
I Ir need for applying association rule mining to data stored in databases/data warehouses has
motivated researchers to [SK97, li 'IV A' iii, S'l~.! TC98, 's i) study alternative architectures
for mining over data stored in databases, ii) translate association rule mining algorithms to work
with relational and object/relational databases, ii) optimize the mining algorithms beyond what the
current relational query optimizers are capable of, and iii) compare the performance of database
mining with filebased mining. Although it has been shown [S'l \'I that SQLbased approaches
curr,tl/i cannot compete with ad hoc file processing algorithms and its variants, as pointed out
rightly in [TUA' 'i*, it does not negate the importance of this approach. For this reason, we will
formulate some of the optimizations proposed in this paper using SQL and evaluate the performance
of the SQLbased approaches. We would like to point out that these optimizations can also be carried
over to other approaches without great difficulty or loss of generality.
1I Ih process of generating all association rules assumes that the parameters of interest are mostly
support and confidence. However, given the cost of computing all association rules, this black
box approach severely lacks user exploration and control. Interactive mining that includes the
specification of expressive constraints is proposed in [NLHP''] I Iij also develop algorithms for
the optimal use of these constraints for the apriori class of algorithms.
Incremental approach to mining, which is the subject of this paper, provides another vehicle for
optimizing the generation of frequent itemsets when either the data or the constraints change. Incre
mental mining is predicated upon retaining item combinations (that would have not been discarded)
selectively at each level for later use. 'I Ir negative border approach identifies the itemsets to be
retained so that when new data is added, complete recomputation of frequent itemsets is avoided.
When mining over relational databases, incremental view materialization can be used to implement
incremental data mining algorithms.
1.1 Overview of the Paper
Given the computational cost of data mining and how the data is gathered, it is not only important to
optimize algorithms within a mining computation, but also when mining is performed multiple times
over the same data on account of additions and deletions (additions are more common than deletions).
Incremental mining refers to the optimizations that can be done across mining computations over
updated data set. I Irj negative border concept proposed in I ,'ll', uses the trick used in the apriori
algorithm not just to prune, but also to materialize data across the levels of the apriori algorithm.
I I ! data is used to compute frequent itemsets that need to be added or deleted when the underlying
data changes.
I I!i paper generalizes the use of incremental mining techniques to association rule mining with
constraints. I Iri negative border concept I I. 'II,] is used as the basis of incremental mining [TBAR97,
FAAM97]. By viewing the negative border concept as a constraint relaxation technique, incremental
data mining can be readily used to efficiently mine association rules with various types of constraints.
We divide the constraints into four categories and show how they are used in the algorithm based on
negative border to perform incremental mining. In this approach, traditional incremental mining can
be treated as a special case of relaxing the frequency constraint. For concreteness, we show how the
generalized incremental approach including constraint handling can be cast into the SQL framework.
We present optimizations for incremental mining using SQL and present some promising performance
gains of SQLbased incremental data mining on synthetic datasets. Fin.'ll;,, we demonstrate the
applicability of the approach proposed in this paper to several variants of the data mining problems
from the literature.
I li contributions of this paper are: i) the generalization of the negative border concept to
a larger class of constraints (in addition to the frequency constraint) that satisfy the downward
closure property, ii) extending the algorithm for incremental frequent itemset computation to handle
constraints, iii) optimizations of generalized incremental data mining algorithm in the relational
context and concomitant performance gains, and iv) wider applicability of the approach presented
to problems such as mining closed sets, query flocks etc. We believe that the ramifications of the
framework for incremental mining presented in this paper go far beyond the blackbox mining and
will be more useful in interactive mining.
Paper Organization: I Ir remainder of this paper is organized as follows. Section 1.2 briefly
discusses related work on association rule mining in the presence of constraints. Section 2 describes
the types of constraints handled and the generalization of the incremental negative border algorithm
for them. Section 3 introduces incremental mining along with a description of the incremental
frequent itemset mining algorithm. In Section 4, we present SQL formulations of incremental mining
along with the performance results. In Section 5 we show the applicability of the approach presented
in the paper to closed sets and query flocks. Section 6 contains conclusions.
1.2 Related Work
Domain knowledge in the form of constraints can be used to restrict the number of association rules
generated. I I!i can be done either as a postprocessing step after generating the frequent itemsets
or can be integrated into the algorithm that generates the frequent itemsets. Use of constraints
and the ability to relax them can also be used to provide a focused approach to mining. [SVA97]
discusses how to optimize the frequent itemsets generation procedure for a set of boolean constraints
over a set of items. I Ini paper discusses boolean constraints expressed over the taxonomy of items.
I I!, I approaches that exploit these constraints as part of the frequent itemsets generation step are
presented. 1 Ir paper shows that the integrated approach provides significant performance gains as
compared to the postprocessing of constraints although the choice of the approach is likely to depend
upon the data characteristics. Incremental techniques are not discussed for the use of constraints.
[NLHP''i introduces constraints for the purposes of interactive mining to facilitate user explo
ration and goal directed mining. Constraints are categorized and the antimonotone and succinctness
properties are used to determine whether the constraint computation can be pushed into the frequent
itemsets generation process. Two algorithms are presented along with their performance evaluation.
Incremental mining techniques are not discussed for the use of constraints.
Query flocks introduced in [TUA 'I consists of a parameterized query and a filter that selects
certain assignments of values for the parameters by applying a condition (filter) to the result of the
query for that value assignment. I Ir techniques described in the paper apply to any monotone
filter condition. Later, we describe how we can generalize the approach presented in the paper to
incrementally compute the safe subqueries of a query flock.
2 Constrained Associations
In this section, we introduce associations with different kinds of constraints on the itemsets or con
straints that characterize the dataset from which the associations are derived. Let Z = {il, i2,..., i,
be a set of literals, called items that are attribute values of a set of relational tables. A constrained
association is defined as a set of itemsets {XIX C Z& C(X)}, where C denotes one or more boolean
constraints. \NI.t that we do not concentrate on generating the association rules in the traditional
sense \IS'".; However, the associations between attribute values that we generate can be used for
rule generation.
2.1 Categories of Constraints
We divide the constraints into four different categories that are outlined below1 We illustrate each
of them with sample mining computations. 1 Ir data model used in our examples is that of a point
ofsale (POS) model for a retail chain. When a customer buys a product or series of products at a
register, that information is stored in a transactional system, which is likely to hold other information
such as who made the purchase, what types of promotions were involved etc. I Ir data is stored in
three relational tables sales, productssold and product with respective schemas as shown in Fi ii 1.
SALES PRODUCTS SOLD PRODUCT
Transaction id Transaction id Product id
Customerid Productid Type
Total price Price Name
No. of products Promotionid Description
I _I,.i 1: Point of sales data model
2.1.1 Frequency constraint
I I!i is the same as the minimum support threshold in the i lii ., t(, ,il.I I. i framework for asso
ciation rule mining A.\S' ] An itemset X is said to be frequent if it appears in at least s transactions,
where s is the minimum support. In the pointofsales data model a transaction correspond to a
customer transaction. \i.t. that in other domains, the notion of a transaction and an itemset ap
pearing in a transaction could be different. Frequent itemsets, the ones which satisfy the frequency
constraint are defined as {X f(X) > s} where f(X) is the frequency of X.
il of the algorithms for frequent itemset discovery utilizes the downward closure property of
itemsets with respect to the frequency constraint: that is, if an itemset is frequent, then so are all
its subsets. Levelwise algorithms A.\T'i.; find all itemsets with a given property among itemsets
of size k (kitemsets) and use this knowledge to explore (k + l)itemsets. Downward closure is a
pruning property. We start with the assumption that all (k + l)itemsets are frequent (frequency
is just an example for a downward closed property). As the kitemsets are examined, we prune out
some (k + l)itemsets that we know cannot be frequent. In effect we use the contrapositive of the
frequent itemset definition, "if any subset of a (k + l)itemset is not frequent, then neither can the
(k + 1)it, ii I' After the pruning, we go through the remaining list, checking each (k + l)itemset
for its frequency. I I r downward closure property is similar to the antimonotonicity property defined
in [NLHP' i
1\\ refer the reader to 1\';7, Nil I ",] for nice discussions of various kinds of constraints. Here we categorize
them based on their usage in the mining process which is explained in Section 2.2 with an example
In the context of the pointofsale data model, the frequency constraint can be used to discover
products bought together frequently.
2.1.2 Item constraint
In order to discover goaloriented associations, it is often required to impose constraints on the
itemsets. For instance, find only the itemsets containing at least one item from a userdefined subset
of items. Let B be a boolean expression over the set of items Z. I Ii problem is to find itemsets that
satisfy the constraint B. Typically the item constraint will be associated with a frequency constraint
also, where we want to find frequent itemsets that satisfy B. I 1iI different algorithms for mining
associations with item constraints are presented in [SVA97].
I item constraints enables us to pose mining queries such as "What are the products whose
sales are affected by the sale of, say, barbecue sauce?" and "What products are bought together with
sodas and snacks?". 1 Ir 1variable constraints in [NLHP'I" are a special case of item constraints.
2.1.3 Aggregation constraint
I Iir are constraints involving ;. .I_, functions on the items that form the itemset. For instance,
in the POS example an  ...iinj constraint could be of the form min(productssold.price) >
p. Here we consider a product as an item. I Ir :. .. ..i. function could be min, max, sum,
count, avg etc. or any other userdefined ;,.i .i.i function. An ;._. .i .ii, constraint of the
form min(products sold.price) > p can be used to find " :" i '. products that are bought to
g. th I' Similarly ii 1 (l ... .. I _old.price) < q can be used to find "i :i" i'.I products that are
bought t1,, li, i' I Ir i :.i._ .i.. functions can be combined in various ways to express a whole
range of useful mining computations. For example, the constraint (min(productssold.price) <
p) & (avg(productssold.price) > q) targets the mining process to inexpensive products that are
bought together with the expensive ones.
2.1.4 External constraint
External constraints filter the data used in the mining process. I Ir i are constraints on attributes
which do not appear in the final result (which we call external attributes). For example, if we want
to find "i'i In' I bought during big purchases where the total sale price of the transaction is larger
than some :iiiIi i"t' we can impose a constraint of the form sales.total price > P.'1 I r i constraints
are useful to target the mining process to just the relevant data there by speeding up the process.
2.2 Constrained association mining
Fi;nii 2 shows the general framework for the kth level in a levelwise approach for mining associations
with constraints. N .t that we could also use SQLbased approaches for implementing the various
operations in the different phases. I Ir different constraints are applied at different steps in the
F k+1
Constraints
F k Input data tables
1 iH,. 2: Framework for constrained association mining
mining process. For example, the item constraints and the ;i:  111ii constraints that satisfy the
closure property can be used in the candidate generation phase for pruning unwanted candidates.
I I! i. different algorithms for generating candidate itemsets in the presence of item constraints are
presented in [SVA97] and we could use them here. Ir :Ii .it ii, i constraints can be applied on the
result of the candidate generation process with item constraints. I Ir, external constraints and some
of the ;,: .i ii.i and item constraints can be applied at the data filtering stage to reduce the size
of the input data to the support counting phase (see Fi ii 3 for a specific example). I Ii frequency
constraint is applied during support counting to filter out the nonfrequent itemsets. Fin.lly, any
unprocessed constraints are checked as a postcounting operation. '1 I frequent itemsets before the
postcounting constraint check are used for the next level candidate generation since these constraints
do not possess the closure property.
In the basic frequent itemset mining, only the frequency constraint is present and the candidate
generation step uses only the subset pruning strategy _.\' !
Example: We illustrate the application of the various constraints here with a specific example
using the pointofsales data model in Section 2. 1 Ir, example shown in Finii 3 finds product
combinations containing "I1.. ., .1l . 11. i, where all the products cost less than .'I and the average
price is more than ', (some notion of similar priced products). Ir, combinations should appear in
at least 100 sales transactions with the total price of the transaction greater than '.'iii 'I I! gives
an idea of what other moderately priced products people buy with barbecue sauce in big purchases
(perhaps for parties). It could help the shop owner to decide on various promotions.
In the above example, the constraint on the total price of the transaction is an external constraint
and is applied in the data filtering stage. Since the maximum price of a product in the desired
combination is I.11, we can also filter out records that does not satisfy this condition in the data
filtering stage. max(Price) < 50 is an :;._ .i. ii' constraint which satisfies the closure property
and can be applied in the candidate generation phase. 1 Ir constraint to include barbecue sauce in
F k+l
contains(barbecue sauce)
avg(product price) >25
(Constraints that does not satisfy Postcounting
the closure property) constraint check
Support > 100
(frequency constraint) W/>
Support Counting
(Aggregation and item constraints C k+l
that satisfy the closure property) Pro sso price 50
rice < 5 IProductssold price 50
Price <= 50 [ Sales total price >= 500
Candidate generation Data Filtering (External constraints and
aggregation constraints useful for
SIdata filtering)
F k Sales
Products sold Products sold
Product
1 io, 3: Pointofsales example for constrained association mining
the combination and the constraint on the average price are checked in the postcounting phase since
they do not satisfy the closure property.
3 Incremental Mining
In this section, we outline the incremental mining algorithm based on the negative border concept and
discuss how it can be generalized to handle constrained associations and certain kinds of constraint
relaxation.
3.1 Incremental association rule mining
I Ir, closure property that all subsets of a frequent itemset are also frequent is used in the candidate
generation process of the apriori algorithm to limit the number of candidates. In the support counting
phase, the support of all minimal nonfrequent itemsets (also termed the negative border [I 'l. ']) is
also counted. I In! computation can be effectively used for maintaining the frequent itemsets when
the transaction database is updated.
I Ir algorithm for the incremental maintenance of frequent itemsets presented in [TBAR97]
utilizes the work spent in counting the support of itemsets in the negative border. All the itemsets in
the negative border along with their support counts are maintained. It has been proved that for an
itemset s which was not originally present in the frequent set or the negative border to appear in
the updated frequent set or negative border, an itemset t C s (such that t was in the original negative
border) must become frequent (see [TBAR97] for the proof and the details of the algorithm). I Ir
incremental algorithm can be summarized as:
1. Find the frequent itemsets in the increment database db. Simultaneously count the support of
all itemsets X E FrequentSets U NegativeBorder in db.
2. Update the support count of all itemsets in FrequentSets U NegativeBorder to include their
support in db.
3. Find the itemsets in \N '.it veBorder that became frequent by the addition of db (call them
Promoted N .t IvtieBorder, PNb).
4. Find the candidate closure with PNb as the seed.
5. If there are no new itemsets in the candidate closure, skip this step. Otherwise, count the
support of all new itemsets in the candidate closure against the whole dataset.
3.2 Incremental constrained association mining
I hl negative border based incremental mining algorithm is applicable for mining associations with
constraints that are closed with respect to the set inclusion property, that is, if an itemset satisfies
the constraint then so do all its subsets. For incremental mining of constrained associations also we
need to materialize and store all the itemsets in the negative border and their support counts. I I
reason why this is enough is that when new transaction data is added, only the support count of the
itemsets could change. We assume that there is a frequency constraint, which is typically the case
in association mining. I Ih definition of the negative border also remains the same, which is the set
of minimal itemsets that did not satisfy the frequency constraint. We list the various steps of the
incremental algorithm which is very similar to the case with just the frequency constraint except for
a few differences.
1. Find the frequent itemsets in the increment database db. Simultaneously count the support
of all itemsets X E FrequentSets U NegativeBorder in db. For this we use the framework
outlined in Section 2.2. I I. different constraints that are present can be handled at the various
steps in the mining process as shown in Figure 2.
2. Update the support count of all itemsets in FrequentSets U NegativeBorder to include their
support in db.
3. Find the itemsets in \N 6.t iveBorder that became frequent by the addition of db (call them
Promoted \ .t i veBorder, PNb).
4. Find the candidate closure with PNb as the seed. During this computation, we use the candi
date generation procedure with constraints as described in Section 2.2.
5. If there are no new itemsets in the candidate closure, skip this step. Otherwise, count the
support of all new itemsets in the candidate closure against the whole dataset. I I. dataset is
subjected to the data filtering step before the support counting as shown in Figure 2.
3.3 Constraint relaxation
I hl negative border idea can be used to handle some cases of constraint relaxation also, especially
relaxations to the external constraints and the frequency constraint. For instance, in the example
in Section 2.2, if we relax the constraint on the sales transaction to sales.Total price > 300, it
can be cast as an incremental mining problem. In this case, the increment dataset would be the
transactions with Totalprice between 300 and 500. \Ni .t that, for this to work the relaxed constraint
should subsume the initial constraint. Relaxations to the frequency constraint involves two cases.
In cases where the frequency threshold is increased, it is straightforward to find the new frequent
itemsets by just filtering the itemsets with the new threshold. On the other hand, if the frequency
threshold is lowered we need to use an approach similar to the incremental mining algorithm where
we compute the promoted negative borders, candidate closure etc to determine if the support of any
new itemset needs to be found.
4 SQL formulations of incremental mining
In this section, we present SQL formulations for the incremental mining algorithm. I Ii shows how
the SQLbased mining framework in [S'l.\' I can be extended for incremental mining. N..t. that
incremental mining can be considered as a special case of relaxing the frequency constraint. It is
possible to handle other kinds of constraints also within the same framework. We present two SQL
based approaches for incremental frequent itemset mining based on the ones presented in [S'! \'Is
for frequent itemset mining.
Two categories of SQL formulations for frequent itemset mining one based on SQL92 and
the other which uses the objectrelational extensions to SQL (SQLOR) and their performance
implications are presented in [S'l\'I\ A costbased analysis of the SQL92 approaches and the
related performance optimizations are discussed in [TC", We discuss here how these techniques
can be adapted for incremental mining. Efficient SQL formulation of the incremental algorithm
entails counting support of multiplesized candidates together in the same pass. In step 5 of the
incremental algorithm in Section 3.1, the new candidate itemsets produced by the candidate closure
computation could be of different sizes and they are counted together in the same pass. I I
input transaction data is stored in a relational table T with the schema (tid, item). I I, increment
transaction table 6T also has the same schema. I il frequent itemsets and negative border of size
k are stored in tables with the schema (itemi,... ,'i, ,,..count). We present two approaches for
support counting based on the Subquery and the Vertical approaches which performed the best in
the SQL92 and SQLOR categories in the experiments reported in [S'l\'I\ We also outline the
SQLbased candidate closure computation (step 4 of the incremental algorithm).
4.1 Subquery approach
In this approach, support counting is done by a set of k nested subqueries where k is the size of
the largest itemset. We present here the extensions to the subquery approach in [S' I\'i' to count
candidates of different sizes. Subquery Qi finds all tids that support the distinct candidate 1itemsets
(dl). dl comprises of the distinct 1item prefixes of all candidate itemsets. However, it is sufficient
to use CI, the candidate 1itemsets instead of dl since all 1item prefixes of candidate itemsets with
more than I items will be present in CI. I Ir output of Qi is grouped on the 1 items to find the
support of the candidate 1itemsets. Q, is also joined with 6T (T while counting the support in the
whole database) and C~+1 to get Qi+1. I Ir SQL queries and the corresponding tree diagrams for the
above computations are given in Fi ii 4. 6Bi stores the support counts of all frequent and negative
border itemsets in 6T.
insert into 6B( select itemi,..., iterl, count(*)
from (Subquery QI) t
group by item, ..., I'h ,,,,
Group by for 6 B 1 Subquery Q1+1
Subquery Q1
Subquery Qi (for any I between 1 and k): iteml,.,itemltid
r 11.tid= tl.tid
select item1,... 7, .. tid tlitem = Ctitemli
from 6T tl, (Subquery Qi1) as r11, C r 11.ititem tem = ite
where ril.itemr = C.itemr and ... and r 1.item 11 =Cl.item 11 X tl
rl
rI1.'I ', .' L = C .,'I : land
ritid = . and Subquery Q_1 C l
r1 1.tid = ~i: .1./ and
tl.item = C.item, Tree diagram for Subquery Qi
Subquery Qo: No subquery Qo.
1 i ,. 4: Support counting using subqueries
I II output of subquery Qi needs to be materialized since it is used for counting the support of 1
itemsets and to generate Qi+1. If the query processor is augmented to support multiple streams where
the output of an operator can be piped into more than one subsequent operators, the materialization
of Ql's can be avoided. In the basic association rule mining, we do not have to count itemsets of
different sizes in the same pass since Ci+1 becomes available only after the frequent 1itemsets are
computed.
For steps 1 and 5 of the incremental algorithm, we use queries outlined above. Tables BI and 6BI
store the frequent and negative border 1itemsets and their support count in T and 6T respectively.
Step 2 can be performed by joining BI and 6BI and adding the corresponding support counts. We
add another attribute to BI to keep track of promoted negative borders and this can be done along
with step 2. In step 3, we simply select the itemsets corresponding to the promoted negative border.
4.2 Computing candidate closure
In the apriori algorithm candidate itemsets are generated in two steps the join step and the prune
step. In the join step, two sets of (k 1)itemsets called generators and extenders are joined to get
kitemsets. An itemset sl in generators joins with s2 in extenders if the first (k 2) items of 81
and s2 are the same and the last item of sl is lexicographically smaller than the last item of s2.
1 Ir join results in an itemset that is si extended with the last item of s2. I IIr result of the join
step is subjected to subset pruning which filters out all itemsets with a non frequent (k l)subset.
I Iii can be accomplished by subsequent joins with (k 2) copies of a set of itemsets termed filters.
While generating Ck from Fk_1 in the basic apriori algorithm, Fk_1 acts as generators, extenders
and filters.
In the incremental algorithm, we compute the candidate closure to avoid multiple passes while
counting the support of the new candidates. It can be seen that all the new candidates will be
supersets of promoted borders. I IrI ! i!. it is sufficient to use the itemsets that are promoted
borders as generators. In order to generate Ck, the candidates of size k, we use PBk1 U Ck_1 as
generators and PBk1 U Ck1 U Fk_1 as extenders and filters. PBk1 and Fk_1 denote promoted
borders and frequent itemsets of size (k 1) respectively. I Ir candidate generation process starts
with Co as the empty set and terminates when Ck becomes empty. It is straightforward to derive
SQL queries for this process and we do not present them here.
4.3 Vertical
In the Subquery approach, for every transaction that supports an itemset we generate (itemset, tid)
tuples resulting in large intermediate tables. I Ir Vertical approach avoids this by collecting all tids
that support an itemset into a BLOB (binary large object) and generates (itemset, tidlist) tuples.
Initially, tidlists for individual items are created using a table function. I Ir tidlist for an itemset
is obtained by intersecting the tidlists of its items using a userdefined function (UDF). I Ir SQL
queries for support counting are similar in structure to that of the Subquery approach except for the
use of UDFs to intersect the tidlists. We refer the reader to [S'!.\'I for the details.
I Ij increment transaction table 6T is transformed into the vertical format by creating the delta
tidlists of the items. I Irj delta tidlists are used to count the support of the candidate itemsets in 6T
which are later merged with the original tidlists. I hi can be accomplished by joining the original
tidlist table with the delta tidlist table and merging the tidlists with a UDF. If the incremental
algorithm requires a pass over the complete data, the merged tidlists are used for support counting.
4.4 Performance results
In this section, we report the results of some of our performance experiments to quantify the ad
vantages of the incremental mining algorithm. \i t. that incremental mining can be considered as a
special case of relaxing the frequency constraint. 1 Ir I experiments were performed on Version 5 of
IBM DB2 Universal Server installed on a Sun Ultra 5 : i.... I 270 with a 2'"i MHz UltraSPARCIIi
CPU, 128 MB main memory and a 4 GB disk. We report the results of the SQL formulations of the
incremental algorithm based on the Subquery and Vertical approaches.
'Il I experiments were performed on synthetic data generated using the same technique as
in i\'.i I I, dataset used for the experiment was T10.I4.D100K (: i' .1Ij size of a transaction = 10,
: i .ij size of maximal potentially large itemsets = 4, N ii i i of transactions = 100 thousand). Ir
increment database is created as follows: We generate 100 thousand transactions, of which (100 d)
thousand is used for the initial computation and d thousand is used as the increment, where d is the
fractional size (in percentage) of the increment.
i  M 5% [:I 1 0O
4 
0 
2 1.5 1
IVinimum Support (in %)
0.5
1 ,,11 5: Speed up of the incremental algorithm based on the Subquery approach
IM 1 5 5_ lO I
05
1 i,.i. 6: Speed up of the incremental algorithm based on the Vertical approach
We compare the execution time of the incremental algorithm with respect to mining the whole
dataset. Fi isni 5 and 6 shows the corresponding speedups of the incremental algorithm based on
the Subquery and the Vertical approaches for different minimum support thresholds. We report the
II II
i, ,
Minimum Support
results for increment sizes of 1 ," and l1' (shown in the legend). We can make the following
observations from the graphs:
I I, incremental algorithm based on the Subquery approach achieves a speedup of about 3 to
20 as compared to mining the whole dataset. However, the maximum speedup of the Vertical
approach is only about 4. For support counting the Vertical approach uses a userdefined
function (UDF) to intersect the tidlists. I1 hr incremental algorithm should also invoke the
UDF at least the same number of times since the support of all the itemsets in the frequent
set and the negative border needs to be found in the increment database. In cases where the
support of new candidates needs to be counted the number of invocations will be even more as
compared to mining the whole dataset. 'I Ir time taken by the Vertical approach in the support
counting phase is directly proportional to the number of times the UDF is called. However,
the incremental algorithm saves in the tidlist creation phase since the size of the increment
dataset is only a fraction of the whole dataset. 'I h!, explains why the speedup of the Vertical
approach is low. In contrast the Subquery approach achieves higher speedup since the time
taken is proportional to the size of the dataset.
It is possible to achieve better speedup for the Vertical approach by allocating a smaller BLOB
(binary large object) for computations involving the increment dataset. N\..1 that the tidlists
for the items are stored as BLOBs. In our experiments, we used the same BLOB size for the
increment dataset and the initial dataset in order to use the same userdefined function for
support counting and the same table function for tidlist creation (refer [S'! \'I] for a detailed
description of the Vertical approach).
hI! speedup reduces as the minimum support threshold is lowered. At lower support values
the chances of the negative border expanding is higher and as a result the incremental algorithm
may have to compute the candidate closure and count the support of the new candidates in
the whole dataset.
I Ij speedup is higher for smaller increment sizes since the incremental algorithm needs to
process less data.
With respect to the absolute execution time, the Subquery and the Vertical approaches followed
the same trend as reported in [S'.\'I\ the Vertical approach was about 3 to 6 times faster
than the Subquery approach.
4.5 Newcandidate optimization
In the basic incremental algorithm, we find the frequent itemsets in the increment database db along
with counting the support of all the itemsets in the frequent set and the negative border. However,
the frequent itemsets in db is used only to prune the non frequent itemsets in db while computing
the candidate closure. In the candidate closure computation we assume that the new candidate k
itemsets are frequent while generating the (k 1l)itemsets. At this step the new candidate kitemsets
that are infrequent in db are known to be infrequent in the whole dataset as well and can be pruned.
I I!i is because the new candidate kitemsets were infrequent in the old dataset (they were not even
in the negative border). I I !i 11. !. they need to be frequent at least in db to have a chance of being
frequent in the whole dataset.
With the newcandidate optimization, we count the support of an itemset in db only if it is
required. In the first phase, while counting the support in db of the itemsets in the frequent set and
the negative border, we do not find all the frequent itemsets in db. During the candidate closure
computation, we count the support in db of just the newcandidates for the pruning explained above.
I I!i results in better speedup as compared to the basic incremental algorithm.
24 
20 
8
4
2 1 0 1
Minimum Support
I ,i '" 7: Speed up of the incremental algorithm based on the Subquery approach with the newcandidate
optimization
III 1O 1 5 o S 10O I
4.5
4
3.5
3
2 2.5
1.5
0.5
2 1.5 1 0.5
IVMinimum Support (in %)
1i n, ,, 8: Speed up of the incremental algorithm based on the Vertical approach with the newcandidate
optimization
Fiii,, 7 and 8 show the speedup of the Subquery and Vertical approaches with the new
candidate optimization. We can observe that this optimization achieves speedups that are up to
_'' better. I Ii improvement is more at smaller increment sizes. I Ir reason is that, when the
increment is smaller we have to use proportionately smaller minimum support values while finding
the frequent itemsets in db. I Ihi! could result in counting too many spurious candidates.
5 Applicability beyond association mining
In this section, we discuss how the incremental approach presented in Section 3 can be extended to
other data mining and decision support problems. In Section 5.1, we discuss the applicability of the
incremental algorithm to mine closed sets. Applicability to incremental evaluation of query flocks
and certain kinds of materialized views are described in Sections 5.2 and 5.3 respectively.
5.1 Mining closed sets
All the efficient algorithms for mining association rules exploits the closure property of frequent
itemsets. i!_rinjiiiii support which characterizes frequent itemsets is downward closed: if an itemset
has minimum support then all its subsets also have minimum support. I r. idea of negative border
can be used for all incremental mining problems that possess closure properties. If the closure
property is incrementally updatable (for instance support) also, it is possible to limit the database
access to at most one scan of the whole database as shown in the incremental frequent itemset
mining example. A property is incrementally updatable if it is possible to derive its value from the
corresponding values of different partitions of the input data. A few examples are COijIJT, SUM,
MIN, MAX etc. We list below a few other data mining problems that have closure properties.
1. Sequential patterns: In sequential pattern mining, the frequent patterns are closed with respect
to minimum support, much like the frequent itemsets. I Ir support is incrementally updatable.
2. Correlation rules: Correlation rules [BMS97] are upward closed with respect to correlation: if
a set of items S is correlated, so is every superset of S. A set of items is said to be correlated
if any of its subsets are dependent. For efficient evaluation of correlation rules, a notion of
support is introduced in [BMS97], which is downward closed. However, these properties are not
incrementally updatable.
3. Consequent part of association rules: For any frequent itemset, if a rule with consequent c
holds (that is, it has minimum confidence) then so do rules with consequents that are subsets
of c _.\: I :',I that is, such rules are downward closed. N\.i. that the confidence is not
incrementally updatable.
4. i.l.._i,,.il frequent itemsets: Random walk algorithms [GMS97] which works by generating a
series of random walks along the itemset lattice exploits the downward closure property of
maximal frequent itemsets. Support which characterizes frequent itemsets in this case is also
incrementally updatable.
5.2 Query flocks
I Ir boolean association rules was defined in the context of market basket data and it has been
generalized to mine associations across relational tables expressed as query flocks [TUA:'' A
query flock is a parameterized query with a filter condition to eliminate the values of parameters
that are "ilitl i, it Ijrg". For example, let us assume that we want to evaluate the mining query
"What are the interesting drivers that caused customers to buy the widgets from a catalog ?". A
driver is deemed iiti, I. .il t if it has caused at least 10 customers to buy the widget. Let the
data be stored in a set of relational tables catalog(widget, manufacturer), sale(customer, widget),
driver(customer, widget, driver). I Ir above query can be written as a query flock in Datalog as shown
below. I Ir filter condition prunes out values which do not have minimum support. In Section 5.2.1,
we discuss how the negative border idea can be used for efficient incremental evaluation of query
flocks.
QUERY:
answer(C) :
sale(C, $W) AND
driver(C, $W, $D) AND
catalog($W, manufacturer)
FILTER:
COUNT(C) > 10
5.2.1 Incremental evaluation of query flocks
Applying the apriori technique for evaluating the above query flock will result in the following safe
subqueries [TUA' 'I
1. Qi: answer(C) : sale(C, $W)
2. Q2: answer(C) : driver(C, $W, $D)
3. Q3: answer(C) : sale(C, $W) AND driver(C, $W, $D)
4. Q4: answer(C) :sale(C, $W) AND driver(C, $W, $D) AND catalog($W, manufacturer)
I Ir query flocks corresponding to the safe subqueries form a lattice with query containment as the
partial order and the original query flock as the top element. I Ij.,t is, if Q1 and Q2 are elements of
the lattice, Qi < Q2 # the result of Q2 is contained in the result of Q1.
During the execution of the subqueries of the query flock, all records with parameter values
which satisfy the filter condition are propagated to the next higher subquery in the lattice for
further evaluation. For example, after Qi is evaluated all the records corresponding to widgets
that are bought by at least 10 customers will be piped to Q3 which is immediately above Qi in the
subquery lattice. I Ir parameter value combinations in this case are analogous to itemsets in boolean
association rule mining.
I Ir negative border in this aprioribased query flock evaluation is the set of parameter value
combinations that does not pass the filter condition test. I Ir i combinations if materialized and
stored along with their support counts can be effectively used to update the result of the query flock
when the base tables over which it is defined are updated. We first check if any of the records in
the negative border pass the filter condition as a result of changes to the base tables. I I!i can be
done starting at the bottom element of the lattice and proceeding upwards, propagating the deltas
corresponding to the new combinations. If none of the records in the negative border passes the
filter condition, we do not have to evaluate the subqueries for the lattice elements above. I Ir base
tables may have to be looked up (for example, to be joined with the deltas) depending upon the
subquery representing the lattice element. It is also possible to evaluate the filter condition for all
the lattice elements' negative borders together. However, this might involve more computations than
propagating the deltas through the lattice.
5.3 View maintenance
Incremental mining can be seen as materialized view maintenance. In boolean association rules, the
frequent itemsets and the negative border are in fact ;,. _ .i, views over the transaction table. In
query flocks each element in the subquery lattice can be considered as a view defined on the base
tables. II !I. view maintenance techniques similar to the ones in [MQM97] can be used for
incremental mining. On the other hand, the negative border based change propagation can also be
applied for the maintenance of views involving monotone ;,. i I .i functions that satisfy the apriori
subset property. For example if the view definition has a filter condition such as the SQL "li.. _ij"
clause, it could be beneficial to also store the records which does not pass the filter condition test
(same as the negative border). When the base tables are updated, these records will be useful to
maintain the view rather than rematerializing it.
An itemset can also be treated as a point in the data cube [GBLP',ii defined by the items as
dimensions and support as the measure. I Ir data cube points can be arranged as a lattice according
to the partial order on the itemsets. In that case, the data cube maintenance algorithms similar
to that of [RKR97] are also applicable here. However, this approach might not be viable in cases
consisting of large number of items.
5.4 Other approaches
In this section, we discuss the applicability of the various architectural alternatives for database
integration of mining (presented in [S'! A'V ) to incremental mining and handling constraints.
In the Loosecoupling approach, the transaction data is read tuple by tuple from the DBMS
to the mining kernel using a cursor interface. Data is never copied to a file system outside the
DBMS. I l. mining process runs in a different address space from the DBMS. I I!, architecture can
be extended to handle incremental mining and constraints just by implementing the appropriate
algorithms in the mining kernel. I l DBMS interface does not require any change. In cases where
the support of new itemsets need to be counted, limiting the data access to just one scan of the
whole database entails counting candidate itemsets of multiple sizes in the same pass. I I!' can be
accomplished by passing the transactions through all the candidates of different sizes and updating
their support counts.
In the Storedprocedure approach, the mining algorithm is encapsulated as a stored procedure
that runs in the same address space as the DBMS. I Ii CacheNineStore approach is a variation
of the Loosecoupling approach where after reading the entire data once from the DBMS, the
algorithm temporarily caches the relevant data in a lookaside buffer on a local disk. Both these
approaches can be made to handle incremental mining and constraints by implementing the respective
algorithms in the appropriate manner. However, the CacheNineStore approach might not give
better performance than the others since the incremental algorithm requires at most one scan of the
entire data. In the UDF approach, the mining algorithm is expressed as a collection of userdefined
functions (UDFs) that are appropriately placed in SQL data scan queries A.\!'.i, Extending this
approach for incremental mining is straightforward and will involve writing UDFs for the different
steps of the incremental algorithm. Constraints can also be handled in a similar way.
6 Summary and Future Work
In this paper, we have established a general approach for incremental mining of association rules by
applying the negative border concept to constraints having the downward closure property. Further
more, we have shown how the incremental mining can be carried out in the relational framework by
using materialized views. I h! is important because data mining typically involves computation of
views with ;~. .it. conditions, and the proposed approach demonstrates a way to incrementally
compute them. I Ih negative border concept helps in determining what to be materialized and how
to propagate the required deltas along the computation lattice. 'I Ir approach is general and is only
dependent on the property of constraints and the structure of the view (or the mining algorithm).
I I! is amply demonstrated in its applicability to different approaches to association rule mining as
shown in this paper.
n this paper, we have also addressed the problem of incrementally mining association rules from
data stored in relational tables. Our focus here was in showing the applicability of the negative border
concept to incremental mining in relational databases, and to formulate it as a view maintenance
problem. Furthermore, we have also sketched how the various approaches to association rule mining
presented in [S'l.\'i can be extended for incremental mining. I Ir, concept of negative border
which is the key to the incremental algorithm has other applications also. It can be used for mining
association rules with varying support and confidence values. For instance, the negative border can
be used to determine the updated frequent itemsets if the support is changed. If the support is
increased it is trivial to update the frequent itemsets. However, if the support is lowered the itemsets
in the promoted negative border can be used to determine if the support of any new itemset needs
to be counted in a similar manner to incremental mining. I Ihi could be quite useful in cases where
determining the ". i .1 I '' support value is difficult. Initially the frequent itemsets for an approximate
support can be computed which is further refined based on user feedback.
6.1 Future Work
I I j i are several interesting problems beyond what is presented in this paper. It is useful to establish
a correspondence between the constraints proposed in the literature and the ones used in this paper.
I Ir framework also needs to be extended to include constraints beyond the ones used in this paper.
It is not clear how the incremental approach can be exploited for incremental mining using some of
the mining algorithms presented in the literature (for example, [BMUT97]).
'I I success of SQL as a data manipulation language stems mainly from its ad hoc query support.
I Ih recent proposal of query flocks [TUA: 'i, can be viewed as an attempt in that direction. We
show that the negative border concept is applicable to query flocks also and hence can be used for
incremental flock evaluation. We have considered only query flocks involving conjunctive predicates.
f i work needs to be done to determine if the concept of negative border holds true for flocks with
negations and unions.
I Ii translation of algorithms to implementation is not always straightforward. Some of the
subtleties may result in less than expected speed up. I Ihi! needs to be investigated for at least the
relational domain if mining is to be done in the database or data warehouse context.
References
[AIS93] Rakesh Agrawal, Tomasz Imielinski, and Arun Swami. Mining association rules between sets of
items in large databases. In Proc. of the .Ii .1 sf l.1/OD C.' ....... on Management of Data,
pages '1 216, Washington, D.C., May 1993.
.\. IS ] Rakesh Agrawal, Heikki Mannila, Ramakrishnan Srikant, Hannu Toivonen, and A. Inkeri Verkamo.
Fast Discovery of Association Rules. In Usama M. F, ,,1 Gregory Piatetsky'I '1p '., Padhraic
,i tH and R i, i i! Uthurusamy, editors, Advances in Knowledge Discovery and Data IM' "
chapter 12, pages ;7 328. AAAI/MIT Press, 1996.
1.\,] Rakesh Agrawal and Ramakrishnan Srikant. Fast Algorithms for Mining Association Rules. In
Proc. of the 20th Int'l C.f4.. .. .. on V i., Large Databases, Santiago, ('IC September 1994.
.s',] Rakesh Agrawal and Kyuseok I ,,ii Developing tightlycoupled data mining applications on a
relational database  . i, In Proc. of the 2nd Int'l Cf.,.[. ..... on Knowledge Discovery in
Databases and Data I', "..( Portland, Oregon, August 1996.
Li;. IS'] Sergey Brin, Rajeev Motwani, and Craig Silverstein. B. ..i~ market baskets: Generalizing asso
ciation rules to correlations. In Proc. of the I .1 / sh .1 OD C', . f... on Management of Data,
May 1997.
[BMUT97] Sergey Brin, Rajeev Motwani, Jeffrey D. Ullman, and !i >I 'iin Tsur. Dynamic itemset counting
and implication rules for market basket data. In Proc. of the AC sli .1,iOD C,.' ..... on
Management of Data, May 1997.
[FAAM97] R. Feldman, Y. Aumann, A. Amir, and H. Mannila. 1.! t! i i Algorithms for Discovering Frequent
Sets in Incremental Databases. In Proceedings of the 1997 SI<,. lOD Workshop on Research Issues
on Data Mining and Knowledge Discovery, I... ....., Arizona, May 1997.
[GBLI"'i.] J. Gray, A. Bosworth, A. I. , I and H. Pirahesh. Data cube: A relational ;'.' ,r'i,,! oper
ator generalizing groupby, crosstab, and subtotal. In Proc. of 1996 Int'l C,f. ... on Data
F,.,'..,, '.., New Orleans, S.\, February 1996.
[(I. !'7] D. Gunopulos, H. Mannila, and S. Saluja. Discovering all most specific sentences by randomized
algorithms. In Proc. of the sixth International C.", . .. on Database 1!.... "., January 1997.
IS'1'".] Maurice Houtsma and Arun Swami. Setoriented mining of association rules. In Int'l C.,,.f. ....
on Data F.,., ~ '..., Taipei, Taiwan, March 1995.
[MQM97] I. Mumick, D. Quass, and B. Mumick. Maintenance of data cubes and summary tables in a
warehouse. In Proc. of the AC.1 SIi r.1OD C., ., .... on Management of Data, Tucson, Arizona,
May 1997.
[NLHP98] r I !ii..!'1 T. Ng, Laks V.S. Lakshmanan, Jiawei Han, and Alex Pang. Exploratory Mining and
Pruning Optimizations of Constrained Association Rules. In Proc. of the I .1f SI' .l OD C,.f' 
ence on Management of Data, Seattle, Washington, June 1998.
[RKR97] N. Roussopoulos, Y. Kotidis, and M. Roussopoulos. Cubetree: Organization of and bulk incre
mental updates on the data cube. In Proc. of the AC .1 s li.1/OD C",.[ ,. .... on Management of
Data, Tucson, Arizona, May 1997.
[SK97] A. Siebes and M. L. Kersten. Ni. O: Minimizing Database Interaction. In Proc. of the 3rd Int'l
C.4, .,... on Knowledge Discovery and Data I'. ".,' Newport Beach, California, August 1997.
[SO:\'I] A. Savasere, E. Omiecinski, and S. Navathe. An. tn!. !i i algorithm for mining association rules in
large databases. In Proc. of the VLDB C4.. .. ... Zurich, Switzerland, September 1995.
S l 1\''] Sunita Sarawagi, II I Thomas, and Rakesh Agrawal. Integrating Association Rule Mining with
Relational Database Systems: Alternatives and Implications. In Proc. of the .I .1I SI' .1 OD
C,. *, ... on Management of Data, Seattle, Washington, June 1998.
[SVA97] Ramakrishnan Srikant, Quoc Vu, and Rakesh Agrawal. Mining Association Rules with Item
Constraints. In Proc. of the 3rd Int'l C,4, .... on Knowledge Discovery in Databases and Data
If',.',., Newport Beach, California, August 1997.
[TBAR97] I!,1, Thomas, Sreenath Bodagala, Khaled Alsabti, and $ ,i Ranka. An I.I i, i Algorithm
for the Incremental Updation of Association Rules in Large Databases. In Proc. of the 3rd Int'l
C.4, .... on Knowledge Discovery and Data I',. ".' Newport Beach, California, August 1997.
[TC98] 'l I1 ,l Thomas and ~I ,!i,, ('I ii , 1! Performance Evaluation and Optimization of Join
Queries for Association Rule Mining. Technical Report TR 9816, University of Florida,
Gainesville, Florida, October 1998.
[Toi96] Hannu Toivonen. Sampling large databases for association rules. In Proc. of the 22nd Int'l
Cf .'... ... on V i., Large Databases, pages 134145, Mumbai (BD. li, ), India, September 1996.
[TS98] 'I, li Thomas and Sunita Sarawagi. Mining Generalized Association Rules and Sequential Pat
terns Using SQL Queries. In Proc. of the 4th Int'l C,.', .. .. on Knowledge Discovery and Data
If .'.,, New York, August 1998.
[TUA+98] Dick Tsur, Jeffrey Ullman, Serge Abiteboul, ('I in ('litI. ~. Rajeev Motwani, Svetlozar Nestorov,
and Arnon Rosenthal. Query Flocks: A Generalization of Association Rule Mining. In Proc. of
the AC.1 i.1,OD C. i ........ on Management of Data, Seattle, Washington, June 1998.
