Group Title: Department of Computer and Information Science and Engineering Technical Reports
Title: Performance evaluation and optimization of join queries for association rule mining
CITATION PDF VIEWER THUMBNAILS PAGE IMAGE ZOOMABLE
Full Citation
STANDARD VIEW MARC VIEW
Permanent Link: http://ufdc.ufl.edu/UF00095427/00001
 Material Information
Title: Performance evaluation and optimization of join queries for association rule mining
Series Title: Department of Computer and Information Science and Engineering Technical Reports
Physical Description: Book
Language: English
Creator: Thomas, Shiby
Chakravarthy, Sharma
Affiliation: Unviersity of Florida
Unviersity of Florida
Publisher: Department of Computer and Information Science and Engineering, University of Florida
Place of Publication: Gainesville, Fla.
Copyright Date: 1998
 Record Information
Bibliographic ID: UF00095427
Volume ID: VID00001
Source Institution: University of Florida
Holding Location: University of Florida
Rights Management: All rights reserved by the source institution and holding location.

Downloads

This item has the following downloads:

1998282 ( PDF )


Full Text






Paper U076


Performance Evaluation and Optimization of Join Queries for

Association Rule Mining*


Shiby Thomast Sharma Chakravarthy

Database Systems Research and Development Center
Computer and Information Science and Engineering Department
University of Florida, Gainesville FL 32611
email: {sthomas, sharma}@cise.ufl.edu




Abstract

The rapid growth in data warehousing 1. 1i1111 .1. and the tremendous drop in storage prices has
enabled collection of large volumes of data in an ever increasing number of business organizations.
One of the greatest challenges f,,1 ,- is how to turn these rapidly expanding data stores into nuggets
of actionable knowledge. The state-of-the-art data mining tools integrate loosely with data stored
in DBMSs, I1- |i 11- through a cursor interface. In this paper, we consider several formulations
of association rule mining using SQL-92 queries primarily from a performance perspective. We
study the performance of .lt!!. i iI i join orders and join methods for executing the various queries.
Based on the performance study we identify certain optimizations and develop the Set-oriented
Apriori approach. We :,i ~- I the cost of the different execution plans which provides a basis
to incorporate the semantics of association rule mining into future query optimizers. The cost
expressions we develop are in terms of the relational operators and hence can be used as ,,'l-
.~1i- to the current optimizers. This work is an initial step towards developing "SQL- ,- .
mining algorithms and exploring the enhancements to current relational Di;. IS- to make them
"II11, ii !.- i" thereby bridging the gap between the two.


1 Introduction

A large number of business organizations are installing data warehouses based on relational database
technology. I Inii trend combined with the drop in storage prices has made it possible to collect large
volumes of customer transaction data in retail stores, mail order companies, banks, stock markets
*This work was supported in part by the ( r !i.. of .' ..1- Research and the SP.\\ .\.I System Center San Diego, by
the Rome Laboratory, DARPA, and the .'%. -I Grant IRI-9528390
tContact author. Full Address: Database Systems R&D Center, 470 C-1. PO Box 116125, Gainesville FL 32611-6125,
USA. Email: sthomasocise.uf 1.edu









etc. In order to transform these huge amounts of data into business competitiveness and profits, it is
extremely important to be able to mine nuggets of useful and understandable information from these
data warehouses.
;I I initial efforts in data mining research was to cull together techniques from machine learning
and statistics to define new mining operations and develop algorithms for them. A majority of the
mining algorithms were built for data stored in file systems and each had its own specialized data
structures, buffer management strategies etc. In cases where the data is stored in a DBMS, data
access was provided through an ODBC or SQL cursor interface [It'i. I i'll Integrating mining with
databases is becoming increasingly important and there have been several research efforts recently
aimed at tighter integration of mining with database systems.
I I. research on database integration of mining can be broadly classified into two categories -one
which proposes new mining operators and the other which leverages the query processing capabilities of
current relational DBMSs. In the former category, there have been language proposals to extend SQL
with specialized mining operators. A few examples are DMQL [HFK' 'i i, I! i-QL [IVA'iii, and the
I! ,. rule operator [MPC'',i However, these proposals do not address the processing techniques for
these operators inside a database engine. In the second category, researchers have addressed the issue
of exploiting the capabilities of conventional relational systems and their object-relational extensions
to execute mining operations. I I!-i entails transforming the mining operations into database queries
and in some cases developing newer techniques that are more appropriate in the database context. I Il
UDF-based (user defined function) approach in .\'.ii], the ET7 i algorithm 1IS' -,, the formulation
of association rule mining as --.p1i i I!!... :- [TUA ''i and SQL queries for mining [S'l \'I, all aim at
tighter database integration.
[S'l .\'I presents a detailed study of the various architectural alternatives for mining data stored in
a DBMS. It has been reported that for association rule mining, with carefully tuned SQL formulations
it is possible to achieve performance comparable to mining systems that cache data in files. 1 I.
potential advantages of developing SQL-based mining algorithms are multi fold. I I. development
effort spent in making the database systems robust, portable, scalable and highly concurrent can be
profitably leveraged by making use of the database indexing, query processing and buffer management
capabilities. I I. use of parallel query processing engines make it possible to exploit the underlying SQL
parallelization, particularly on a SMP machine. I hr development and maintenance costs of mining
systems will be significantly reduced and the DBMS support for crash recovery and check-pointing
can be useful for long-running mining operations.
Two categories of SQL implementations for association rule mining -one based purely on SQL-92
and the other using the object-relational extensions to SQL (SQL-OR) -are presented in [S' l\'I
I ll experimental results show that SQL-OR outperforms SQL-92 for most of the datasets. However,
the object-relational extensions like table functions and user-defined functions (UDFs) used in the
SQL-OR approaches are not yet standardized across the major DBMS vendors and hence portability









could suffer. i'..!. '.i r, optimization and parallelization of the object-relational extensions could be
harder.
In this paper, we analyze the performance of the various SQL-92 approaches and study the impli-
cations of different join orders and join methods. I Ij- motivation for this study is to understand how
best can we do with SQL-92. We derive cost formulae for the different approaches in terms of the
relational operators and the input data parameters. I Ih -, cost expressions are general and hence can
be used in any cost based optimizer. Based on the performance experiments and the cost formulae, we
identify certain optimizations and develop the Set-oriented Apriori approach that performs better than
the best SQL-92 approach in [S'! \A'I We also study the scale-up properties of Set-oriented Apriori.
I Ir rest of the paper is organized as follows: We review association rule mining and a few SQL
formulations of it in Section 2. In Section 3, we present a cost based analysis of the SQL approaches.
Section 4 presents the performance optimizations and their impact on the execution cost and discusses
the Set-oriented Apriori approach. We report the results of some of our performance experiments in
Section 5 and conclude in Section 6.


2 Association Rules

Association rules capture recurring patterns in large databases of transactions, each of which is a set
of items. I Ih intuitive meaning of a typical association rule X-+Y, where X and Y are sets of items,
is that the items in X and Y tend to co-occur in the transactions. An example of such a rule might
be that "-iii' of transactions that contain beer also contain diapers; 5' of all transactions contain
both these it. ij-' Here i-1i' is called the .../71. n of the rule and 5' the support of the rule. I Ir
problem of mining association rules is to find all rules that satisfy a user-specified minimum support
and confidence threshold.
I I!-i problem can be decomposed into two subproblems .\I'.;

Find all combinations of items whose support is greater than minimum support, called frequent
itemsets.

Use the frequent itemsets to generate the desired rules. I Ii idea is that if, say, ABCD and AB
are frequent itemsets, then we can determine if the rule AB-+CD holds by computing the ratio
r = support(ABCD)/support(AB). I Ir rule holds only if r > minimum confidence. N\.Ii. that
the rule will have minimum support because ABCD is frequent.

We review the apriori algorithm in Section 2.1, SQL based apriori candidate generation in Sec-
tion 2.2, input transaction format in Section 2.3, the KwayJoin approach to support counting and the
Subquery optimization in Section 2.4.









2.1 Apriori Algorithm


I hl basic Apriori algorithm for discovering frequent itemsets makes multiple passes over the data.
In the kth pass it finds all itemsets having k items called the k-itemsets. Each pass consists of two
phases -the candidate generation phase and the support counting phase. In the candidate generation
phase, the set of frequent (k l)-itemsets, Fk_1 is used to generate the set of potentially frequent
candidate k-itemsets, Ck. 'I l support counting phase counts the support of all the itemsets in Ck
by examining the transactions and retains the itemsets having the minimum support. ,I l algorithm
terminates when Ck+1 becomes empty.


2.2 Apriori candidate generation using SQL

We briefly outline the SQL-based candidate generation process in [S'lA\'i' here. C, is obtained by
joining two copies of Fk_1 as:

insert into Ck select Ii.itemr, ..., I1../ ,. _1, I12. ',,.-1
from Fk-1 I1,Fk-1 I2
where Ii.item = I2.itemi and


I.' ''"'.-2 = I2 "''', -2 and
1."t I._1 < 12 :l, -1

1 hI join result which is a set of k-itemsets is further pruned using the subset pruning strategy that
all subsets of a frequent itemset should be frequent. I l- subset pruning can be accomplished in SQL
by additional joins with (k 2) more copies of Fk 1. Ilr join predicates for these joins, which act as
filters, are enumerated by skipping one item at a time from the k-itemset. We need (k 2) additional
joins since a k-itemset has k subsets of length (k 1). \N ti that we do not need to check for two of
these k subsets from which the k-itemset was generated.


2.3 Transaction format

I Il transaction data is stored in a relational table T with the schema (tid, item). For a given tid, there
are as many rows in T as the number of items in that transaction. I I! schema is more convenient
than alternate representations, where a transaction and all its items are stored as a single tuple. For
real-life datasets, the maximum and minimum number of items per transaction differ a lot and the
maximum number of items could even be more than the number of columns allowed for a table.


2.4 Support counting by K-Way joins

In this approach, the support counting is formulated as a join query. In the klh pass, k copies of the
transaction table T are joined with the candidate table Ck and is followed up with a group by on the










itemsets as shown in Fi ii 1.


Fk select iteml, ...' ,,.. count(*)

Ck, T t, ...T tk
tl.item = C. .'i and



tk.item = C. .' and
ti.tid = t2.tid and


having
count(*) >mnsup
t
Group by
iteml, ,itemk
Ck teml tl item
Ck itek = tk item A

tltid = tktd Ck


T tk


tk-1.tid = tk.tid
iteml,item2 ... i ,.
count(*) > :minsup


1 1,on.. 1: Support Counting by K-way join


Figii 1 shows a tree diagram of the query also. \itI- that the plan tree generated by the query

processor for this query could look quite different from this tree diagram.

Subquery optimization I Ih basic KwayJoin approach can be optimized to make use of common

prefixes between the itemsets in Ck by splitting the support counting phase into a sequence of k nested

subqueries [S'l1\'ii


insert into Fk select iteml,... ,.. count(*)

from (Subquery Qk) t

group by iteml,itemn2 ...- i .

having count(*) > :minsup


Subquery Qj (for any j between 1 and k):
select itemr,... itemr, tid

from T tj, (Subquery Qj-1) as rj-1,

(select distinct item, ... itemj from Ck) as dj

where rj-_.item = dj.itemi and ... and

rjI.itemj_ = dj.itemj-_and

rj-.tid = tj.tid and

tj.item = dj.itemj
Subquery Qo: No subquery Qo-.


Subquery Qj
t
iteml,...,itemj, tid

tj.item = dj.itemj ]
tj.tid = rj-l1.tid
rj-l.iteml =.dj.iteml
rj-l.itemj-1 =dj.itemj-1 T T
rj-1 d
Subquery Qj-1 select distinct
iteml,. .,itemj

Ck

Tree diagram for Subquery Qi


I i,.[iw 2: Subquery optimization for 1- ..i. J,:in


insert into

from

where


group by

having


tlt d = t2 hd


T tl


T t2










3 Analysis of execution plans


A relational query processor can execute the KwayJoin query in several different ways and the per-

formance depends on the execution plan chosen. We experimented with a number of alternative

execution plans for this query. We could force the query processor to choose different plans by cre-

ating different indices on T and Ck, and in some cases by disabling certain join methods. We will

elaborate more on this in Section 5.
having having
count(*) > mnsup count(*)> minsup

Group by Group by
iteml, ,ltemk iteml, ,ltemk
Ck iteml =tl item ^
C 1 Am 1 A Ck itemk = tk item
Ck itemk = tm tL tnd =tk-ltld L-^

tk-ltld =tktld k Ckitem2t2item T tk
Ck -l item titem =t klt d t k-1

t id t2 nd T tk Ck lt l = T t2
tl item t2 item Ckilten =t ltem

T tl T t2 Ck T tl


1 ,..i. 3: K-way join plan with Ck as inner 1 i,11.- 4: K-way join plan with Ck as outer


We present two different execution plans one with Ck as the outermost relation in Section 3.1

and another with Ck as the innermost relation in Section 3.2- and the cost analysis for them. I 1

effect of subquery optimization on the cost estimates is outlined in Section 3.3.

A schematic diagram of the two different execution plans are given in Fiil', 3 and 4. In the cost

analysis, we use the mining-specific data parameters and knowledge about association rule mining

(Apriori algorithm A.\' I '',i] in this case) to estimate the cost of joins and the size of join results.

Even though current relational optimizers do not use this mining-specific semantic information, the

analysis provides a basis for developing "ijii ~jr!i-.,v.'.i optimizers. I Ir cost formulae are presented

in terms of operator costs in order to make them general; for instance join(p, q, r) denotes the cost of

joining two relations of size p and q to get a result of size r. I Ir actual cost which is based on the join

method used and the system parameters can be derived by substituting the appropriate values in the

given formulae. I Ir data parameters and operators used in the analysis are summarized in Table 1.


3.1 KwayJoin plan with Ck as outer

Start with Ck as the outermost relation and perform a series of joins with the k copies of T. Ir

final join result is grouped on the k items to find the support counts. I I. choice of join methods for

each of the intermediate joins depends on the availability of indices, the size of intermediate results,

amount of available memory etc. For instance, the efficient execution of nested loops joins require an

index (item, tid) on T. If the intermediate join result is large, it could be advantageous to materialize

it and perform sort-merge join.










R
T
N
F1
S(C)
Sk
Rf
Nf
Ck
C(n,k)
group(n, m)
join(p, q, r)


number of records in the input transaction table
number of transactions
average number of items per transaction = R
number of frequent items
sum of support of each itemset in set C
average support of a frequent k-itemset = S( '
number of records out of R involving frequent items = S(Fi)
average number of frequent items per transaction =
number of candidate k-itemsets
number of combinations of size k possible out of a set of size n: =
k!(n-k)!
cost of grouping n records out of which m are distinct
cost of joining two relations of size p and q to get a result of size r


Table 1: Notations used in cost ,i, i 1 -'-

For each candidate itemset in Ck, the join with T produces as many records as the support of its
first item. I Ii- !. 11 I the size of the join result can be estimated to be the product of the number
of k-candidates and the average support of a frequent item and hence the cost of this join is given
by join(Ck, R, Ck si). Similarly, the relation obtained after joining Ck with I copies of T contain
as many records as the sum of the support counts of the 1-item prefixes of the candidate k-itemsets.
Hence the cost of the Ith join is join(Ck s t-, R, Ck st) where so = 1. N.\ t that values of the six's can
be computed from statistics collected in the previous passes. Cost of the last join (with Tk) cannot
be estimated using the above formula since the k-item prefix of a k-candidate is not frequent and we
do not know the value of Sk. However, the last join produces S(Ck) records -there will be as many
records for each candidate as its support -and therefore, the cost is join(Ck sk-1, R, S(Ck)). S(Ck)
can be estimated by adding the support estimates of all the itemsets in Ck. A good estimate for the
support of a candidate itemset is the minimum of the support counts of all its subsets. 'I Ih overall
cost of this plan expressed in terms of operator costs is:
k-1
{ join(Ck st-1, R, Ck st)} +join(Ck Sk-1, R, S(Ck)) + group(S(C), Ck)
l=1

3.2 KwayJoin plan with Ck as inner

In this plan, we join the k copies of T and the resulting k-item combinations are joined with Ck to
filter out non-candidate item combinations. I Ir final join result is grouped on the k-items.
I Ih- result of joining 1 copies of T is the set of all possible 1-item combinations of transactions
and there are C(N, 1) T such combinations. We know that the items in the candidate itemset are
lexicographically ordered and hence we can add extra join predicates as shown in F ii ii 3 to limit









the join result to 1-item combinations (without these extra predicates the join will result in 1-item
permutations). When Ck is the outermost relation these predicates are not required. A mining-
aware optimizer should be able to rewrite the query appropriately. I Ilj Ith join produces (I + 1)-item
combinations and therefore, its cost is join(C(N, 1) T, R, C(N, + 1) T). ;I Ir last join produces
S(Ck) records as in the previous case and hence its cost is join(C(N, k) T, Ck, S(Ck)). I Ir overall
cost of this plan is:

k-1
{ join(C(N, ) T, R, C(N, + 1) T)} + join(C(N, k) T, C, S(Ck))+ group(S(Ck), Ck)
l=1

N.,i that in the above expression C(N, 1) T = R.


3.3 Effect of subquery optimization

I I. subquery optimization makes use of common prefixes among candidate itemsets. Unfolding all
the subqueries will result in a query tree which structurally resembles the KwayJoin plan tree shown
in Finii, 4. Subquery Qi produces d' sl records where d& denotes the number of distinct j item
prefixes of itemsets in Ck. In contrast, the Ith join in the KwayJoin plan results in Ck st records. I Ir
output of subquery Qk contains S(Ck) records. 'I r total cost of this approach can be estimated to
be:
k
{ trijoin(R, s_i dk 1, i. s i )} + group(S(Ck), Ck)
l=1
where trijoin(p, q, r, s) denotes the cost of joining three relations of size p, q, r respectively producing a
result of size s. I Ir value of Sk which is the average support of a frequent k-itemset can be estimated
as mentioned in section 3.1.
In the experimental results reported in [S'\A'i, the subquery optimization gave much better
performance than the basic KwayJoin approach (an order of magnitude better in some cases). We
observed the same trend in our experiments on synthetic datasets. I I. datasets we used in our
experiments are detailed below. I Ir main reason for this is that the number of distinct 1-item prefixes
is much less compared to the total number of candidate itemsets which results in joins between
smaller tables. I Ir number of candidate itemsets and the corresponding distinct item prefixes for
various passes in one of our experiments is given in Fi iil_ 5. '1 I -, numbers are for the dataset
T10.I4.D100K and i .;.;' support. \Nt, that dk is not shown since it is the same as Ck. In pass 3, Ca
contains 2 -i.; itemsets where as d' has only _"21. 1-item prefixes (almost a factor of 10 less than C3).
I I-i results in correspondingly smaller intermediate tables as shown in the analysis above, which is
the key to the performance gain.

Experimental datasets We used synthetic data generated according to the procedure explained
in .\l'-i for our experiments. I Ii results reported in this paper are for the datasets -T5.I2.D100K
and T10.I4.D100K (For scale-up experiments we used other datasets also). I r first dataset consists









2500


2000


1500


1 000


500


-------- d --- -



I ,H11,- 5: Number of candidate itemsets vs distinct item prefixes



Datasets # Records # Transactions # Items Avg.#items
T5.I2.D100K lIIi-,1 100000 1000 5
T10.I4.D100K 111 ',i, 100000 1000 10

Table 2: Description of experimental datasets.


of 100 thousand transactions, each containing an average of 5 items. I I. average size of the maximal
potentially frequent itemsets (denoted as I) is 2. I h. transaction table corresponding to this dataset
had approximately 550 thousand records. I Ii second dataset has 100 thousand transactions, each
containing an average of 10 items (total of about 1.1 million records) and the average size of maximal
potentially frequent itemsets is 4. I I. different parameters of the two datasets are summarized in
Table 2.
All the experiments reported in this paper were performed on PostgreSQL Version 6.3 [P,-'.]
a public domain DBMS, installed on a 8 processor Sun Ultra Enterprise 11 111/5000 with 248 MHz
CPUs and _-,i, MB main memory per processor, running Solaris 2.6. N\ -I that PostgreSQL is not
parallelized. It supports nested loops, hash-based and sort-merge join methods and provides finer
control of the optimizer to disable any of the join methods. We have found it to be a useful platform
for studying the performance of different join methods and execution plans.



4 Performance optimizations


I I. cost analysis presented above provides some insight into the different components of the execution
time in the different passes and what can be optimized to achieve better performance. In this section,
we present three optimizations to the KwayJoin approach (other than the subquery optimization) and










discuss how they impact the cost in Sections 4.1, 4.2, 4.3 respectively. Based on these optimizations,

we develop the Set-oriented Apriori approach in Section 4.4.


4.1 Pruning non-frequent items:

'I h size of the transaction table is a major factor in the cost of joins involving T. It can be reduced

by pruning the non-frequent items from the transactions after the first pass. We store the transaction

data as (tid, item) tuples in a relational table and hence this pruning means simply dropping the tuples

corresponding to non-frequent items. I I!i- can be achieved by joining T with the frequent items table

FI as follows:

insert into Tf select t.tid, t.item

from T t, FI f

where t.item = f.item

We insert the pruned transactions into table Tf which has the same schema as that of T. In the

subsequent passes, joins with T can be replaced with corresponding joins with Tf. I I!-i could result

in improved performance especially for higher support values where the frequent item selectivity is

low, since we join smaller tables. For some of the synthetic datasets we used in our experiments, this

pruning reduced the size of the transaction table to about half its original size. I I!-i could be even

more useful for real-life datasets which typically contains lots of non-frequent items. For example,

some of the real-life datasets used for the experiments reported in [S'l_\'ki contained of the order of

100 thousand items out of which only a few hundreds were frequent. Fiiii. 6 shows the reduction in

transaction table size due to this optimization for our experimental datasets. 'Il initial size (R) and

the size after pruning (Rf) for different support values are shown.


1200000


900000


00000 -


300000 -



o ~ fQ -- P--P -- -- --- -P- -- -
Rf for support values R R, for support values
T10.14.D100K 75.12. 100K



I ,-n,.- 6: Reduction in transaction table size by non-frequent item pruning









With this optimization, in the cost formulae of section 3, R can be replaced with Rf the number
of records in T involving frequent items and N with Nf -the average number of frequent items per
transaction. Recall that with the subquery optimization we could achieve significant performance
improvements by reducing the relation sizes of joins.


4.2 Eliminating candidate generation in second pass:

I Ii optimization aims at reducing the cost of the second pass which is a significant portion of the
total cost in some cases. In the second pass, C2 is almost a cartesian product of the two FIs used to
generate it and hence materializing it and joining with the T's (or Tf's) could be expensive. In order to
circumvent the problem of counting the large C2, most mining algorithms use special techniques in the
second pass. A few examples are two-dimensional arrays in Ij i' Quest data mining system [AAB ii I
and hash filters proposed in [PCY'li-, to limit the size of C2. I ll generation of C2 can be completely
eliminated by formulating the join query to find F2 as:

insert into F2 select p.item, q.item, count(*)
from Tf p, Tf q
where p.tid = q.tid and p.item < q.item
group by p.item, q.item
having count(*) > :minsup

I Ir cost of second pass with this optimization is:

join(Rf, Rf, C(Nf, 2)) + group(C(Nf, 2), C(Fi, 2))

Even though the grouping cost remains the same, there is a big reduction from the basic KwayJoin
approach in the join costs. Fi nii 7 compares the running time of the second pass with this optimiza-
tion to the basic KwayJoin approach for the two experimental datasets. For the KwayJoin approach,
the best execution plan was the one which generates all 2-item combinations, joins them with the
candidate set and groups the join result. We can see that this optimization has a significant impact
on the running time.


4.3 Reusing the item combinations from previous pass:

'I I. SQL formulations of association rule mining is based on generating item combinations in various
ways and similar work is performed in all the different passes. I Ij- i 1! o, reusing the item combina-
tions will improve the performance especially in the higher passes. We will explain more about this
optimization and the corresponding cost reduction in the next section.











Pass 2 optimization (T10.14.D100K)
O With Opt. MWithout Opt.


Pass 2 optimization (T5.12.D100K)
r With Opt. MWithout Opt.


14000

12000

10000


2800

2400

2000

0 1600

S1200
E

800

400


1% 0-75% 0.33%/
Support


2% 1% 0.50o/
Support


n 1 n0


1 ,i-, L. 7: Benefit of second pass optimization



4.4 Set-oriented Apriori


In this section, we develop a set-oriented version of the apriori algorithm combining all the performance

optimizations outlined above, which we call Set-oriented Apriori. We store the item combinations

generated in each pass and use them in the subsequent passes instead of generating them in every pass

from the transaction tables. In the kth pass of the support counting phase, we generate a table Tk

which contains all k-item combinations that are candidates. Tk has the schema (tid, iteml,..., lI ,,. ).

We join Tk_1, Tf and Ck as shown below to generate Tk. A tree diagram of the query is also given

in Fiil, i 8. 1 Ir frequent itemsets Fk is obtained by grouping the tuples of Tk on the k items and


Tk

p.tid, p.iteml, ... .'I ,.-1, q.item

Ck, Tk-1 p, Tf q

p.item = C. .1i ', and



] ', i ,. _1 = C. i and

q.item = C.. and

p.tid = q.tid


Tk
p.iteml Ck.iteml
p.itemk-1 Ck.itemkl [
q.item =Ck.itemk

p.tid q.tid Ck
p.itemk-1 q.item >[


T k-1 p T1 q


I !,,.-- 8: Generation of Tk


applying the minimum support filtering.

We can further prune Tk by filtering out item combinations that turned out to be non-frequent.

However, this is not essential since we join it with the candidate set Ck+1 in the next pass to generate

Tk+i. I l only advantage of pruning Tk is that we will have a smaller table to join in the next pass;


insert into

select

from

where









but at the expense of joining Tk with Fk.
We use the optimization discussed above for the second pass and hence do not materialize and
store T2. I Ii !. l1 we generate T3 directly by joining Tf with C3 as:

insert into T3 select p.tid, p.item, q.item, r.item
from Tf p, Tf q, Tf r, Ck
where p.item = C3.iteml and q.item = C3.item2 and r.item = C3 .'/ and
p.tid = q.tid and q.tid = r.tid

We can also use the Subquery approach to generate T3 if that is estimated to be less expensive. T3
will contain exactly the same tuples produced by subquery Q.3
I Il Set-oriented Apriori algorithm bears some resemblance with the three-way join approach
in [S'l.\'I and the AprioriTid algorithm in [.\' !] In the three-way join approach, the tempo-
rary table Tk stores for each transaction, the identifiers of the candidates it supported. Tk is generated
by joining two copies of Tk_1 with Ck. I Ir generation of Fk requires a further join of Tk with Ck.
AprioriTid makes use of special data structures which are difficult to maintain in the SQL formulation.


4.4.1 Cost comparison

In Section 3.3, we saw that the cost of the kth pass with the subquery optimization is:

k
{Ztrijoin(R, s_1 d- 1, -1. sl ,il )} + group(S(Ck), Ck)
/=1

As a result of the materialization and reuse of item combinations, Set-oriented Apriori requires only a
single 3-way join in the k pass 1. I ll cost of the kth pass in Set-oriented Apriori is:

trijoin(Rf, Tk_1, Ck, S(Ck)) + group(S(Ck), Ck)

where Tk_1 and Ck denote the cardinality of the corresponding tables. I Ir grouping cost is the same
as that of the subquery approach. I ll table Tk_1 contains exactly the same tuples as that of subquery
Qk-1 and hence has a size of sl-1 d-1. Also, dk is the same as Ck. I I!, 1 !, the kth pass cost of
Set-oriented Apriori is the same as the kth term in the join cost summation of the subquery approach.
I I-i results in significant performance improvements especially in the higher passes.
Fi:in 9 compares the running times of the subquery and Set-oriented Apriori approaches for the
dataset T10.I4.D100K for n .;.;' support. We show only the times for passes 3 and higher since both
the approaches are the same in the first two passes.
1..,I. that this may be executed as two 2-way joins since 3-way joins are not generally supported in current relational
systems.











II Set-Apriori I Subquery
6000


5000


4000


3000


2000


1000
eooo I I------------------------












Pass 3 Pass 4 Pass 5 Pass 6 Pass 7



I ,'o'"" 9: Benefit of reusing item combinations



4.4.2 Space overhead


I l. Set-oriented Apriori approach requires additional space in order to store the item combinations

generated. 1 II size of the table Tk is the same as S(Ck), which is the total support of all the k-item

candidates. Assuming that the tid and item attributes are integers, each tuple in Tk consists of k + 1

integer attributes. I I I. !.-. the space requirement of Tk is ITk (k + 1). Fi;bi 10 shows the space

required to store Tk in terms of number of integers, for the dataset T10.I4.D100K for two different

support values. I Ir space needed for the input data table T is also shown for comparison. T2 is not

shown in the graph since we do not materialize and store it in the Set-oriented Apriori approach. \N .t-

that once Tk is materialized Tk_1 can be deleted unless it needs to be retained for some other purposes.


3500000

3000000

2500000 -

E 2000000

1500000

1 00000

500000

0
~T T- -, T-1 -T, T -, -T -T -T
Support 0_75% Support 0.33%


1 i,1'" 10: Space requirements of the set-oriented apriori approach


In the earlier passes, if the number of item combinations is too large and we do not want to store










them, the Subquery approach can be used instead. However, the Subquery approach will also generate

large intermediate tables except in cases where the output of an intermediate join is pipelined to the

subsequent operation. I Ii transition to Set-oriented Apriori can be made in any pass by materializing

the item combinations. An advantage of the Set-oriented Apriori is that it requires only simple joins

and hence it is easier for the optimizer to handle them. With multiple levels of nested subqueries

optimization becomes much harder.



5 Performance experiments


In this section, we compare the total running time of the Subquery and Set-oriented Apriori approaches.

We studied the performance of the Subquery approach (the best SQL-92 approach in [S'! \'I) and

the Set-oriented Apriori for a wide range of data parameters and support values. We report the results

on two of the datasets -T5.I2.D100K and T10.I4.D100K which are described in Section 3.3.

T10.14.D100K: Total time T5.12.D100K: Total time
Pass 1 Pass 2 O Pass 3 O Pass 4
SPass 1 I Pass 2 O Pass3 Pass4 Pass5 M Pass 6 M Pass 7
2400
24000
2100
21000
1800
18000
1500
15000
Z 1200
12000

9000 900

6000 I 600

3000 300
0 00 ......... ..... ............


Support > 2% 1% 0.75% 0.33% Support-- 2% 1% 0.5% 0.1%

I ,ii.- 11: Comparison of Subquery and Set-oriented Apriori approaches


In Fi nii 11, we show the relative performance of Subquery and Set-oriented Apriori approaches

for the two datasets. I Ir chart shows the total time taken for each of the different passes. We ran

the '-ET_ I algorithm li 'l also for a few support values and found that it is an order of magnitude

slower. Set-oriented Apriori performs better than Subquery for all the support values. I Ir first two

passes of both the approaches are similar and they take approximately equal amount of time. 1 1I

difference between Set-oriented Apriori and Subquery widens for higher numbered passes as explained

in Section 4.4. For T5.I2.D100K, F2 was empty for support values higher than 0.:;' and therefore

we chose lower support values to study the relative performance in higher numbered passes.

In some cases, the optimizer did not choose the best plan. For example, for joins with T (Tf for

Set-oriented Apriori), the optimizer chose nested loops plan using (item, tid) index on T in many cases










where the corresponding sort-merge plan was faster an order of magnitude faster in some cases. We

were able to experiment with different plans by disabling certain join methods (disabling nested loops

join for the above case). We also broke down the multi-way joins in some cases into simpler two-way

joins to study the performance implications. I Iir reported times correspond to the best join order and

join methods.

T10.14.D100K: CPU time T10.14.D100K: I/O time
SPass 1 Pass OPass 3 O Pass 4 Pass 5 Pass 6 Pass 7 Pass 1 MPass 2 O Pass 3 O Pass 4 Pass 5 Pass 6 Pass 7
18000
6000

15000
5000

12000 4000
4000

9000 3000
--
6000 2000

3000 1000




Support --> 2% 1% 0.75% 0.33% Support > 2% 1% 0.75% 0.33%

I _,.- 12: Comparison of CPU and I/O times


Finil, 12 shows the CPU time and I/O time taken for the dataset T10.I4.D100K. I I. two ap-

proaches show the same relative trend as the total time. However, it should be noted that the I/O

time is less than one third of the CPU time. 1 In! shows that there is a need to revisit the traditional

optimization and parallelization strategies designed to optimize for I/O time, in order to handle the

newer decision support and mining queries efficiently. 1 Ir experiments on T5.I2.D100K also exhibited

similar behavior.


5.1 Scale-up Experiment

We experimented with several synthetic datasets to study the scale-up behavior of Set-oriented Apriori

with respect to increasing number of transactions and increasing transaction size. Fiiii, 13 shows

how Set-oriented Apriori scales up as the number of transactions is increased from 10,000 to 1 million.

We used the datasets T5.I2 and T10.I4 for the average sizes of transactions and itemsets respectively.

I I, minimum support level was kept at 1 'I1 I first graph shows the absolute execution times

and the second one shows the times normalized with respect to the times for the 10,000 transaction

datasets. It can be seen that the execution times scale quite linearly and both the datasets exhibit

similar scale-up behavior.

I Ir scale-up with increasing transaction size is shown in Finiii 14. In these experiments we kept

the physical size of the database roughly constant by keeping the product of the average transaction











25000 250


20000 -200


15000 150

E
10000 100


5000 50


O 0
0 200 400 600 800 1000 1200 0 200 400 600 800 1000 1200
Number of Transactions (in thousands) Number of Transactions (in thousands)

I i 1i 13: Number of transactions scale-up


size and the number of transactions constant. I Ir number of transactions ranged from 200,000 for the

database with an average transaction size of 5 to 20,000 for the database with an average transaction

size of 50. We fixed the minimum support level in terms of the number of transactions, since fixing it

as a percentage would have led to large increases in the number of frequent itemsets as the transaction

size increased. I Ir numbers in the legend (e.g. 1000) refer to this minimum support. I Ii execution

times increase with the transaction size, but only gradually. I Ii main reason for this increase was

that the number of item combinations present in a transaction increases with the transaction size.

I Ir, i, has been increasing interest in developing scalable data mining techniques [SBMU98, BFR98,

(Chj.'i For the scalability of the SQL approaches, we leverage the development effort spent in making

the relational query processors scalable.



6 Conclusion and Future work


We explored the problem of developing SQL-aware implementations of association rule mining. We

analyzed the best available SQL-92 formulation -KwayJoin approach with Subquery optimization

primarily from a performance perspective and conducted detailed performance experiments to under-

stand how well current relational DBMSs handle such queries. Based on the cost evaluation and the

performance study we identify certain optimizations and develop a set-oriented version of the apriori

algorithm. For the higher numbered passes, Set-oriented Apriori performs significantly better than the

Subquery approach. I Ir cost analysis presented in this paper points to useful enhancements to current

optimizers to make them more "ijiii i j-.v.'.ii ". We also studied the scale-up behavior of Set-oriented

Apriori with respect to increase in the number of transactions and average transaction size.

Recently, efficient techniques to evaluate iceberg queries have been proposed in [FSC : 'i I Ir


- T10.14 -+ T5.12


--T10.14 --T5.12










-- 1000 -750 -500
25000


20000


1 5000 -


1 0000 -__


5000 -



0 10 20 30 40 50 60
Average transaction length


I ,i,.- 14: Transaction size scale-up


association rule mining approaches presented in this paper can benefit from these techniques since

several of them contain icebergs as sub-queries. I Iij integration of data warehousing and data mining

is attracting "'i' l -tii'i coverage from researchers. However, we believe that a lot more research

into SQL-aware mining systems and improved mining-aware SQL systems is required enabling the two

technologies to grow synergistically.



References

[AAB+96] Rakesh Agrawal, Andreas Arning, T.i! Bollinger, Manish Mehta, John I!i t. i and Ramakrishnan
Srikant. The Quest Data Mining System. In Proc. of the 2nd Int'l C' -, .... on Knowledge
Discovery in Databases and Data I- .'., Portland, Oregon, August 1996.

[AIS93] Rakesh Agrawal, Tomasz Imielinski, and Arun Swami. Mining association rules between sets of
items in large databases. In Proc. of the AC .1 SI' .1 OD C, .'- ..... on Management of Data,
pages 21 7 216, Washington, D.C., May 1993.

.\. IS 'it] Rakesh Agrawal, Heikki Mannila, Ramakrishnan Srikant, Hannu Toivonen, and A. Inkeri Verkamo.
Fast Discovery of Association Rules. In Usama M. F,- ,I1 Gregory Piatetsky-'I '1'. '. Padhraic
,ri, l! and Ramasamy UC.11ii 1 iL editors, Advances in Knowledge Discovery and Data IM ".,
chapter 12, pages ;. 17 328. AAAI/MIT Press, 1996.

.\'l] Rakesh Agrawal and Ramakrishnan Srikant. Fast Algorithms for Mining Association Rules. In
Proc. of the 20th Int'l C,.f- .. .... on V .., Large Databases, Santiago, ('!! September 1994.

.\S' i] Rakesh Agrawal and Kyuseok I i Developing tightly-coupled data mining applications on a rela-
tional database -- -l. i! In Proc. of the 2nd Int'l C,.,.f '. .... on Knowledge Discovery in Databases
and Data MI'''', Portland, Oregon, August 1996.









i;i 1i''] P. Bradley, Usama F,- I1 and Cory Reina. Scaling ('lIi-i. i, Algorithms to Large Databases.
In Proc. of the 4th Int'l C,.f', .f... on Knowledge Discovery and Data I' .. '., New York, August
1998.

[( '!I .''] 'ii ii ('!I I 11Ihi, Data mining and database - W i- I Where is the intersection? IEEE Data
Engineering Bulletin, 21(1):4-8, March 1998.

[FS(;., 1+98] Min Fang, N ,- , mi i- ,|:.Ii ,I1 Hector Garcia-Molina, Rajeev Motwani, and Jeffrey D. Ullman.
Computing Iceberg Queries I1..! i !III In Proc. of the VLDB C,-f,. .. .. New York, August 1998.

[HFK+96] J. Han, Y. Fu, K. Koperski, W. Wang, and O. Zaiane. DMQL: A data mining query language for
relational datbases. In Proc. of the 1996 SI<'.1, OD workshop on research issues on data mining
and knowledge discovery, Montreal, Canada, May 1996.

iS'i.1] Maurice Houtsma and Arun Swami. Set-oriented mining of association rules. In Int'l C,., ..-..
on Data F,' .,- '., .,, Taipei, Taiwan, March 1995.

[IM96] T. Imielinski and Heikki Mannila. A database perspective on knowledge discovery. Communication
of the .I .1 39(11):58-64, Nov 1996.

-11 I'i] Internationl Business Machines. IBM Intelligent Miner User's Guide, Version 1 Release 1, SH12-
6213-00 edition, July 1996.

[IVA96] T. Imielinski, A. Virmani, and A. Abdulghani. Discovery Board Application Programming Interface
and Query Language for Database Mining. In Proc. of the 2nd Int'l C-4. .... on Knowledge
Discovery and Data I'. "', Portland, Oregon, August 1996.

[MPC96] R. Meo, G. Psaila, and S. Ceri. A new SQL like operator for mining association rules. In Proc. of
the 22nd Int'l C-."f, ., .... on V i., Large Databases, DB. ii' ,- India, Sep 1996.

[P( '1'i] Jong Soo Park, Ming-S- i! ('I'. and Philip S. Yu. An effective hash based algorithm for mining
association rules. In Proc. of the .It .1- si .1/,OD C,-f ..... on Management of Data, San Jose,
California, May 1995.

[Pos98] P.. SQL Organization. PostgreSQL 6.3 User Manual, February 1998.
http://www.postgresql.org.

Si;.A !i '] Craig Silverstein, Sergey Brin, Rajeev Motwani, and Jeff Ullman. Scalable Techniques for Mining
Causal Si, i I.. In Proc. of the VLDB C,...f .... New York, August 1998.

S 1.\''] Sunita Sarawagi, NI ,,-- Thomas, and Rakesh Agrawal. Integrating Association Rule Mining with
Relational Database Systems: Alternatives and Implications. In Proc. of the A(C Si'.1,OD
C,-4, F, ... on Management of Data, Seattle, Washington, June 1998.

[TUA+98] Dick Tsur, Jeffrey Ullman, Serge Abiteboul, ('I!!- ('Il!1.. ,t Rajeev Motwani, Svetlozar Nestorov,
and Arnon Rosenthal. Query Flocks: A Generalization of Association Rule Mining. In Proc. of
the i .1/ S r.1/, OD C, .-f, ., ..... on Management of Data, Seattle, Washington, June 1998.




University of Florida Home Page
© 2004 - 2010 University of Florida George A. Smathers Libraries.
All rights reserved.

Acceptable Use, Copyright, and Disclaimer Statement
Last updated October 10, 2010 - - mvs