Architectures and optimizations for integrating data mining algorithms with database systems


Material Information

Architectures and optimizations for integrating data mining algorithms with database systems
Physical Description:
xiv, 172 leaves : ill. ; 29 cm.
Thomas, Shiby, 1971-
Publication Date:


Subjects / Keywords:
Data mining   ( lcsh )
Data warehousing   ( lcsh )
Database management   ( lcsh )
Computer and Information Science and Engineering thesis, Ph.D   ( lcsh )
Dissertations, Academic -- Computer and Information Science and Engineering -- UF   ( lcsh )
bibliography   ( marcgt )
non-fiction   ( marcgt )


Thesis (Ph.D.)--University of Florida, 1998.
Includes bibliographical references (leaves 162-171).
Statement of Responsibility:
by Shiby Thomas.
General Note:
General Note:

Record Information

Source Institution:
University of Florida
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
aleph - 030040414
oclc - 40941992
System ID:

This item is only available as the following downloads:

Full Text







My parents, sisters and brothers


I would like to express my sincere gratitude to Sharma for his encouragement and

support throughout my dissertation and my stay with his group. We have had endless

arguments and discussions about various things. Especially during the group meetings,

this might have invited the wrath of several other students because, at times, it would

have even tested their patience. On the personal side he and his family have been my

good friends also. I am greatly indebted to Rakesh Agrawal and Sunita Sarawagi of IBM

Almaden Research Center for their help and giving me an opportunity to work with their

group. The several discussions I had with them while I was at IBM and later through

e-mail were very useful. The work that I did with them contributed to a good part of

my dissertation. It was a nice experience working with the Quest data mining group at

Almaden and the enthusiasm and hard work of Sunita were contagious. I am also thankful

to Sanjay Ranka for his help and suggestions during my initial work on data mining. I

thank Professors Eric Hanson, Sartaj Sahni, Stanley Su and Suleyman Tufekci for being on

my committee and for their comments and suggestions.

I am grateful to many other people for helping me in several ways. In particular I

thank Raja, Sreenath, Mokhtar, Nabeel, Roger, and all my friends at the database center

for the chat sessions, lunch sessions, and so on. Many thanks to Sharon Grant for her help

with everything. She takes care of everything in the database center and her tireless spirit

and attention to even the minute details makes things a whole lot easier. My sincere thanks

to S. Seshadri, my master's thesis advisor. My first exposure to database research was while

working with him and he is partially responsible for my pursuing further studies. I owe my

graduate studies to my parents, sisters and brothers for their continual reassurance.

This work was supported in part by the research grants of Sharma Chakravarthy

from the Office of Naval Research and the SPAWAR System Center San Diego, Rome

Laboratory, DARPA and the NSF grant IRI-9528390. During my first year Theodore

Johnson provided support through his research grants until he left the University. His

initial support helped me to come to the United States for graduate studies. The CISE

department also provided me with teaching assistantships in times of need. I gratefully

acknowledge all the support.


ACKNOWLEDGMENTS ........ ... ................... iii

LIST OF FIGURES .................. .................... ix

LIST OF TABLES ..................................... xii

ABSTRACT ....... .. ....... .... .... .............. xiii


1 INTRODUCTION ..................... .............. 1

1.1 Data Warehousing ............. ..... ............. 1
1.2 Data M ining ................... ................ 3
1.2.1 Association Rule ........................ .... 4
1.2.2 Sequential Patterns .... ....................... 5
1.2.3 Classification .................... ........... 6
1.2.4 Clustering ................... .............. 7
1.3 Mining Databases ................................. 8
1.4 Goal ........... ...... ... ................. .. 9
1.5 Thesis Organization ....... .... .................... 12


2.1 Related W ork ..................... ............. 13
2.2 Architectural Alternatives .............. .... ........... 15
2.2.1 Loose-Coupling .................... ........ 15
2.2.2 Cache-M ine .... .................. ......... 16
2.2.3 Stored-Procedure ............ .... ............. 17
2.2.4 User-Defined Function ............... .. .......... 17
2.2.5 SQL-Based Approach ... .. ........ ..... ....... 18
2.2.6 Integrated Approach ........................... 20
2.3 Summary ......... ........ ................... 21

3 ASSOCIATION RULES ....... ........... ............. 22

3.1 Apriori Algorithm .... ......... .. ... ..... ....... 23
3.2 Other Algorithms ........ ........ ......... ...... 24
3.3 Input-Output Formats ............. ............... 26

3.4 Associations in SQL ........... . .... 27
3.4.1 Candidate Generation in SQL . ........... ... 27
3.4.2 Counting Support to Find Frequent Itemsets . .... 30
3.4.3 Rule Generation .... ......... .. ... .......... 30
3.5 Support Counting Using SQL-92 .... . .... .. .. 32
3.5.1 K-way Join ....... .......... ............. 32
3.5.2 Three-way Join .......... .... ............. 33
3.5.3 Two Group-by ............... ........... 34
3.5.4 Subquery-Based .................. .......... 35
3.6 Performance Comparison .................. ........... .. 35
3.7 Cost Analysis .......... .. ..................... 39
3.7.1 KWayJoin Plan with Ck as Outer Relation . .... 40
3.7.2 KWayJoin Plan with Ck as Inner Relation . ..... 42
3.7.3 Effect of Subquery Optimization . . .. 43
3.8 Performance Optimizations ........ .... .............. 45
3.8.1 Pruning Non-Frequent Items . . 46
3.8.2 Eliminating Candidate Generation in Second Pass ... 47
3.8.3 Reusing the Item Combinations from Previous Pass ... 48
3.8.4 Set-Oriented Apriori . .... .. ...... 48
3.9 Performance Experiments with Set-Oriented Apriori . .... 53
3.9.1 Scale-up Experiment ....... ................... 55
3.10 Summary ..................................... 58

SIONS ........... ...... ................... ...... 60

4.1 GatherJoin ....................... ........... 60
4.1.1 Special Pass 2 Optimization . ...... 61
4.1.2 Variations of GatherJoin Approach . . ... 61
4.1.3 Cost Analysis of GatherJoin and its Variants . ... 64
4.2 Vertical ................... ................... 66
4.2.1 Special Pass 2 Optimization . ..... ..... 67
4.2.2 Cost Analysis . ......... ..... ...... 69
4.3 SQL-Bodied Functions ........ .................... 70
4.4 Performance Comparison ........ .......... ...... .. 71
4.5 Final Hybrid Approach ........ .................... 76
4.6 Architecture Comparisons ..... . ..... .......... 76
4.6.1 Timing Comparison ............. ........ .... 77
4.6.2 Scale-up experiment . .......... .. 80
4.6.3 Impact of longer names . . 81
4.6.4 Space Overhead of Different Approaches . .... 82
4.7 Summary of Comparison Between Different Architectures ... 83
4.8 Other Associations Algorithms . ........ .. 87
4.9 Summary ...... ....... ................. 88

5 GENERALIZED ASSOCIATION RULES ........... ...... ..... 89

5.1 Input-Output Formats ................ . ...... 90
5.2 Cumulate Algorithm .................. .......... .. 91
5.3 Pre-Computing Ancestors .................. .. ....... .. 91
5.4 Candidate Generation ................... .. 92
5.5 Support Counting to Find Frequent Itemsets . ..... 93
5.6 Support Counting Using SQL-92 .................. .... .. .. 94
5.6.1 K-way Join ................. ......... 94
5.6.2 Subquery Optimization ..... . ..... ...... 96
5.7 Support Counting Using SQL-OR . ..... ...... 96
5.7.1 GatherJoin ................ .. ............ 96
5.7.2 GatherExtend ....... ...................... 97
5.7.3 Cost Analysis ........ ........... 98
5.7.4 Vertical .. ........... ................. 100
5.8 Performance Results .. ..... .. .... ............ 102
5.9 Summary ............ .. ................... 104

6 SEQUENTIAL PATTERNS .................. .......... 105

6.1 Input-Output Formats .................. ........... 105
6.1.1 Input Form at ................... ............ 105
6.1.2 Output Format ................... ........... 106
6.2 GSP Algorithm ................... ................ 106
6.3 Candidate Generation ........ ...... ............. 107
6.4 Support Counting to Find Frequent Sequences . .... 109
6.5 Support Counting Using SQL-92 . .. 111
6.5.1 K-way Join ........... .. ................. 111
6.5.2 Subquery Optimization . ..... ........ 112
6.6 Support Counting Using SQL-OR . ..... ...... 112
6.6.1 Vertical ............. ................... 112
6.6.2 GatherJoin ............. ... ............. 117
6.7 Taxonom ies .. .. ... .. .. .. ........ ... 117
6.8 Sum m ary ............... .... ................ 118

7 INCREMENTAL MINING .................. .......... 119

7.1 Incremental Updating of Frequent Itemsets . .... 121
7.1.1 Computing AfBd(F) from F . . 121
7.1.2 Addition of New Transactions . . 123
7.1.3 Deletion of Existing Transactions . .. .. 127
7.2 Experimental Results . ......... . 128
7.3 Comparison with FUP ...... . 129
7.4 Database Integration of Incremental Mining . ..... 131
7.4.1 SQL Formulations of Incremental Mining . ... 131
7.4.2 Performance Results .......... ........... ..... 135
7.4.3 New-Candidate Optimization . . ... 138
7.4.4 Other Approaches ............................ 140

7.5 Constrained Associations .................. ....... 140
7.5.1 Categories of Constraints ............ .... .. .. .. 141
7.5.2 Constrained Association Mining ........... .. .. 144
7.5.3 Incremental Constrained Association Mining . .... 146
7.5.4 Constraint Relaxation .................. .. ..... .. 147
7.6 Applicability Beyond Association Mining . ..... 148
7.6.1 Mining Closed Sets .................. ....... .. 148
7.6.2 Query Flocks .......... .... ... ............ 149
7.6.3 View Maintenance ................... . 151
7.7 Summary ............... ... ................ 152

8 CONCLUSIONS ................... ... ............ .153

8.1 Proposed Extensions .......... . . 156
8.1.1 Richer Set Operations: .................. ........ .. 156
8.1.2 Enhanced Aggregation . .... ........ 157
8.1.3 Multiple Streams . ....... ... ... 158
8.1.4 Sam pling ...... .......... ............... .. 158
8.2 Contributions .............. ............... 159
8.3 Future Work ................ ... .............. 160
8.4 Closing .. ... .. .. .. ... .......... .. .. 160

REFERENCES ................... ................. 162

BIOGRAPHICAL SKETCH ............... ................ 172


1.1 Data warehousing architecture ................... 2

1.2 Credit card classification example . . 7

1.3 Typical data warehouse usage . ..... ........ 10

2.1 Taxonomy of architectural alternatives . ... 16

2.2 Loose-coupling architecture . . . 16

2.3 Cache-mine architecture ..... ............ .......... 17

2.4 Stored-procedure architecture . ........ .. 17

2.5 UDF-based mining architecture . ..... ........ 18

2.6 SQL architecture for mining in a DBMS . ..... 19

2.7 Architecture for mining in next-generation DBMSs . ... 21

3.1 Apriori algorithm ..................... ............ 23

3.2 Candidate generation for any k ......................... 29

3.3 Candidate generation for k = 4 . ..... ..... 29

3.4 Rule generation .................... ........ ....... 32

3.5 Support counting by K-way join ..... . ...... 33

3.6 Support counting using subqueries ... ..... ..... 36

3.7 Comparison of different SQL-92 approaches . ... 38

3.8 K-way join plan with Ck as inner relation . . 40

3.9 K-way join plan with Ck as outer relation ..... . ..... 40

3.10 Number of candidate itemsets vs distinct item prefixes . ... 44

3.11 Reduction in transaction table size by non-frequent item pruning 47

3.12 Benefit of second pass optimization . ......... 49

3.13 Generation of Tk ................... ....... ... ..... .. 50

3.14 Benefit of reusing item combinations . .. 52

3.15 Space requirements of the set-oriented apriori approach . ... 53

3.16 Comparison of Subquery and Set-oriented Apriori approaches ... 54

3.17 Comparison of CPU and I/O times ....................... 56

3.18 Number of transactions scale-up . . 57

3.19 Transaction size scale-up .................. .......... 58

4.1 Support counting by GatherJoin .... . .... 62

4.2 Support Counting by GatherJoin in the second pass . ... 63

4.3 Tid-list creation ............... ................. 66

4.4 Support counting using UDF ................. ......... 68

4.5 Comparison of four SQL-OR approaches: Vertical, GatherPrune, GatherJoin
and GatherCount . .... . .... 72

4.6 Effect of increasing transaction length (average number of items per trans-
action) .................. ................... 74

4.7 Comparison of four architectures . . 78

4.8 Scale-up with increasing number of transactions. . 80

4.9 Scale-up with increasing transaction length .. . 81

4.10 Comparison of different architectures on space requirements. ... 84

5.1 Example of a taxonomy .............. ......... ........ 89

5.2 Pre-computing ancestors ........ . 92

5.3 Generation of C2 .................. .............. 93

5.4 Transaction extension subquery .......... ...... 94

5.5 Support counting by K-way join .................. ...... 95

5.6 Support counting by GatherJoin ............... ... 97

5.7 Support counting by GatherExtend. ...... . 99

5.8 Interior nodes' tid-list generation by union . ..... 101

5.9 Interior nodes' tid-list generation from T* . .... 102

5.10 Comparison of different SQL approaches . ...... ...... 103

6.1 Candidate generation for any k . ..... .. 110

6.2 Candidate generation for k = 4 ....................... 110

6.3 Support counting by K-way join . ... ........ 113

6.4 Subquery optimization for KwayJoin approach. . 114

6.5 Support counting by Vertical ......................... 116

7.1 A high-level description of the apriori-gen function . .... 122

7.2 A high-level description of the negativeborder-gen function ... 122

7.3 A high-level description of the Update-Frequent-Itemset function 126

7.4 Speed-up of the incremental algorithm . ..... 129

7.5 Support counting using subqueries . ..... 133

7.6 Speed up of the incremental algorithm based on the Subquery approach .136

7.7 Speed up of the incremental algorithm based on the Vertical approach 136

7.8 Speed up of the incremental algorithm based on the Subquery approach with
the new-candidate optimization . ..... ........ 139

7.9 Speed up of the incremental algorithm based on the Vertical approach with
the new-candidate optimization . . 139

7.10 Point of sales data model .................. ........... .. 141

7.11 Framework for constrained association mining . ... 144

7.12 Point-of-sales example for constrained association mining ... 145


3.1 Description of different real-life datasets . ..... 37

3.2 Notations used in cost analysis . . ..41

3.3 Description of synthetic datasets . ..... 45

4.1 Pros and cons of different architectural options ranked on a scale of 1(good)
to 4(bad) .................. ..... .............. 84

5.1 An example of the taxonomy table . ..... 90

5.2 Additional notations used for cost analysis . . ... 98

Abstract of Dissertation Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Doctor of Philosophy



Shiby Thomas

December, 1998

Chairman: Dr. Sharma Chakravarthy
Major Department: Computer and Information Science and Engineering

Data mining on large data warehouses is becoming increasingly important. In support

of this trend, we consider a spectrum of architectural alternatives for integrating mining

with database systems. These alternatives include loose-coupling through a SQL cursor

interface; encapsulation of the mining algorithm in a stored procedure; caching the data to

a file system on-the-fly and mining; tight-coupling using primarily user-defined functions;

and SQL implementations for processing in the DBMS. First, we comprehensively study

the option of expressing the association rule mining algorithm in the form of SQL queries.

We consider four options in SQL-92 and six options in SQL enhanced with object-relational

extensions (SQL-OR). Our evaluation of the different architectural alternatives shows that

from a performance perspective, the Cache-Mine option is superior, although the SQL-

OR option comes a close second. Both the Cache-Mine and the SQL-OR approaches

incur a higher storage penalty than the loose-coupling approach which performance-wise

is a factor of 3 to 4 worse than Cache-Mine. We also compare these alternatives on the

basis of qualitative factors like automatic parallelization, development ease, portability and


We further analyze the SQL-92 approaches with the twin goals of st iid. i how best

can a DBMS without any object-relational extensions execute these queries and to identify

ways of incorporating the semantics of mining into cost-based query optimizers. We develop

cost formulae for the mining queries based on the input data parameters and relational

operator costs. We also identify certain optimizations which improve the performance.

Next, we study generalized association rule and sequential pattern mining and develop

SQL formulations for them there by demonstrating that more complex mining operations

can be handled in the SQL frame work.

We develop an incremental association rule mining algorithm which does not need

to examine the old data if the frequent itemsets do not change. Even otherwise, access

to the old database can be limited to just one scan. We categorize the various kinds of

constraints on the items that are useful in the context of interactive mining to facilitate goal-

oriented mining. We show how the incremental mining technique can be adapted to handle

constraints and certain kinds of constraint relaxation. We also show the applicability of

the incremental algorithm to other classes of data mining and decision support problems.

Finally, we identify certain primitive operators that are useful for a large class of data

mining and decision support applications. Supporting them natively in the DBMS could

enable these applications to run faster.


The rapid growth in data warehousing technology combined with the significant drop

in storage prices has made it possible to collect large volumes of data about customer

transactions in retail stores, mail order companies, banks, stock markets, telecommunica-

tion companies and so on. For example, AT&T call records are about 1 giga byte per

hour [73] and super market chains like WalMart collect tera bytes of data. In order to

transform this huge amounts of data into business competitiveness and profits, it is ex-

tremely important to be able to mine nuggets of useful and understandable information

from these data warehouses. In this chapter, we introduce data warehousing and data min-

ing technologies in Sections 1.1 and 1.2 respectively and in Section 1.3, we motivate the

need for coupling the two which is the focus of this dissertation. In Section 1.4, we discuss

and list the specific problems addressed in this dissertation and in Section 1.5, we outline

the dissertation organization.

1.1 Data Warehousing

A data warehouse is simply a single, complete and consistent store of data obtained

from a variety of sources and made available to end users in a way they can understand

and use. Achieving completeness and consistency of data in today's information systems

environment is far from simple. The first problem is to discover how completeness and

consistency can be defined. In the business context, this entails understanding the business

strategies and the data required to support and track their achievement. This process-

called enterprise modeling-requires substantial involvement of business users and is tra-

ditionally a long-term process. A data warehouse consists of several components and tools

which are depicted in Figure 1.1 [34].

Monitoring &
Enterprise Administration tools Business
Modeling Metadata Information
Repository Guide

External Sources Data Warehouse

Load / -
Operational Systems L ad ,/

Data Marts

Data Sources Business Information

Figure 1.1. Data warehousing architecture

Knowing what data are required is just the first step. These data exist today in

various sources on different platforms, and must be copied from these sources for use in the

warehouse. They must be combined according to the enterprise model, even though it was

not originally designed to support such integration. They must be cleansed of structural and

content errors. This step-typically known as data warehouse population-requires tools

for extracting data from multiple operational databases and external sources, for cleaning,

transforming and integrating these data; and for loading data into the data warehouse.

In addition to the main warehouse, there may be several departmental data marts. It

also requires tools for periodically refreshing the warehouse to maintain consistency and to

purge data from the warehouse, perhaps onto slower archival storage.

In order to understand and profitably use the data in a business context, they must

be transformed into information. The warehouse data are stored and managed by one or

more warehouse servers, which present multidimensional views of the data to a variety of

front end tools: query tools, report writers, analysis tools and data mining tools. There is

also a repository which stores metadata derived using the enterprise model, and tools for

monitoring and administering the warehousing system.

The warehouse may be distributed for load balancing, scalability and higher availabil-

ity. In such a distributed architecture, the metadata repository is usually replicated with

each fragment of the warehouse, and the entire warehouse is administered centrally. An

alternative architecture is a federation of data warehouses or data marts, each with its own

repository and decentralized administration.

1.2 Data Mining

Data mining, also referred to as knowledge discovery in databases, is the process of

extracting implicit, understandable, previously unknown and potentially useful information

from data. In other words, data mining is the act of drilling through huge volumes of data in

order to discover relationships, or to answer specific questions that are too broad in nature

for traditional query tools. Data mining is also projected as the next step beyond online

analytical processing (OLAP) for querying data warehouses. Rather than seek out known

relationships, it sifts through data for unknown relationships. For instance, consider the

transaction data in a mail-order company stored in the following relations: sales(customer,

widget, state, year), booster(customer, widget, driver), catalog(widget, manufacturer). The

sale information is stored in the sales table and the catalog table stores the widgets of

different manufacturers. The booster table stores the driver that influences the particular

sale. In this database, OLAP finds answers to queries of the form "How many widgets were

sold in the first quarter of 1998 in California vs. Florida?" However, data mining attempts

to answer queries like "What are the drivers that caused people to buy these widgets from

our catalog?" Fundamentally, data mining is statistical analysis and has been in practice for

a long time. But, until recently, statistical analysis was a time-consuming, manual process

which limited the amount of data that could be analyzed and the accuracy depended heavily

on the personnel involved in the analysis. Today, with the advent of various sophisticated

technologies, tools exist that automate the process, making data mining a practical solution

for a wide range of companies. For example, Fingerhut's (a direct-mail catalog company)

statistical analysis was limited to taking samples of 10 to 20 percent of its customers. With

data mining, it can examine 300 specific characteristics of each of the 10 to 12 million

customers in a much more focused way [15].

The initial efforts on data mining research were to cull together techniques from ma-

chine learning and statistics [24, 28, 29, 37, 39, 40, 60, 65, 81, 90, 95, 102, 103, 104] to define

new mining operations and develop algorithms for them [5, 7, 19, 77]. In the remainder of

this section, we briefly introduce the various data mining problems.

1.2.1 Association Rule

Association rule which captures co-occurrence of items or events was introduced in

the context of market basket data [7]. An example of such a rule might be that 60% of

transactions containing beer also contain diapers and 2% of transactions contain both these

items. Here 60% is the support and 2% is the confidence of the rule beer ==> diaper.

Association rule mining is stated formally as follows [13]. Let I = {il, i2, .. im} be

a set of literals, called items. Let D be a set of transactions, where each transaction T is

a set of items such that T C I. Each transaction has a unique identifier, called its tid.

An association rule is an implication of the form X == Y, where X C I, Y C I, and

X n Y = 0. The rule X ==- Y holds in the transaction set D with confidence c if c% of

transactions in D that contain X also contain Y. The rule X == Y has support s in the

transaction set D if s% of transactions in D contain X U Y. Given a set of transactions D,

the problem of mining association rules is to generate all association rules that have support

and confidence greater than the user-specified minimum support and minimum confidence.

The association rule mining problem has attracted tremendous attention from data

mining researchers and several algorithms have been proposed for it [13, 26, 66, 111, 131].

Researchers have also proposed several variants of the basic association rule mining to

handle taxonomies over items [63, 118], numeric attributes [76, 92, 119] and constraints on

items appearing in rules [97, 121].

1.2.2 Sequential Patterns

Sequential pattern mining was first introduced in Agrawal and Srikant [14] and further

generalized in Srikant and Agrawal [120]. Given a set of data-sequences each of which is a list

of transactions ordered by the transaction time, the problem of mining sequential patterns is

to discover all sequences with a user-specified minimum support. Each transaction contains

a set of items. A sequential pattern is an ordered list (sequence) of itemsets. The itemsets

that are contained in the sequence are called elements of the sequence. For example,

((computer, modem)(printer)) is a sequence with two elements; (computer, modem) and

(printer). The support of a sequential pattern is the percentage of data-sequences that

contain the sequence. A sequential pattern can be further qualified by specifying maximum

and/or minimum time gaps between adjacent elements and a sliding time window within

which items are considered part of the same sequence element. These time constraints are

specified by three parameters, max-gap, min-gap and window-size. We refer the reader to

Srikant and Agrawal [120] for a formal definition of the problem.

1.2.3 Classification

Classification is a well-studied problem [91,129]. However, only recently has there been

focus on algorithms that can handle large datasets [17, 53, 85, 106, 114]. Classification is

the process of generating a description or a model for each class of a given dataset, called

the training set. Each record of the training set consists of several attributes which could

be continuous (coming from an ordered domain), or categorical (coming from an unordered

domain). One of the attributes, called the classifying attribute, indicates the class to which

each record belongs. Once a model is built from the given examples, it can be used to

determine the class of future unclassified records.

Several classification models based on neural networks, statistical models like lin-

ear/quadratic discriminants, decision trees and genetic models have been proposed over the

years. Decision trees are particularly suited for data mining since they can be constructed

relatively fast compared to other methods and they are simple and easy to understand.

Moreover, trees can be easily converted into SQL statements that can be used to access

databases efficiently [5].

A decision tree is a class discriminator that recursively partitions the training set

until each partition consists entirely or dominantly of examples from one class. Each non-

leaf node of the tree contains a split point which is a test on one or more attributes and

determines how the data is partitioned. Figure 1.2 shows a sample decision-tree classifier

and the training set from which it is derived. Each record represents a credit card applicant

and we are interested in building a model that categorizes the applicants into high and low

risk classes.
Age < 25

Age Salary Married Risk Yes No
21 30,000 No High ( Married
18 28,000 Yes High No Yes
30 50,000 Yes Low High
35 20,000 Yes High Salary < 25,000
23 60,000 No High i Yesgh / No
47 80,000 Yes Low

High Low

Figure 1.2. Credit card classification example

1.2.4 Clustering

The fundamental clustering problem is that of grouping together (clustering) similar

data items and is useful for discovering interesting distributions and patterns in the under-

lying data. Clustering has been formulated in various ways in the machine learning [48],

pattern recognition [51], optimization [112] and statistics literature [20]. The problem of

clustering can be defined as follows: given n data points in a d-dimensional metric space,

partition the data points into k clusters such that the data points within a cluster are more

similar to each other than data points in different clusters. The classic K-Means clustering

algorithm starts with estimated initial values for the k cluster centroids. Each of the data

points is assigned to the cluster with the nearest centroid. After the assignment the cen-

troids are refined to the mean of the data points in that cluster. This process is repeated

several times until an acceptable convergence is reached. There are several research efforts

reported in the data mining literature for clustering large databases [23, 42, 57, 132].

Research on time series analysis [10, 43, 75], similarity search [3, 8, 21, 72, 115],

high dimensional data analysis [4, 22, 79], and text data management [30, 31] can also be

classified under the broad category of data mining.

1.3 Mining Databases

The initial research efforts on data mining were aimed at defining new mining problems

and a majority of the algorithms for them were developed for data stored in file systems.

Each had its own specialized data structures, buffer management strategies and so on [8, 13,

14, 22, 49, 50, 62, 63, 74, 84, 85, 86, 111, 115, 119, 121]. In cases where the data are stored in

a DBMS, data access was provided through an ODBC or SQL cursor interface [2, 64, 68, 71].

Data mining tools are being used in several application domains [16, 18, 27, 38, 41, 46, 52,

70, 82, 96, 101, 108, 130]. Coupling the data mining tools with a growing base of accessible

enterprise data-often in the form of a data warehouse-provides business institutions at

its disposal a tool with immense implications. According to the vice president of Mellon

Bank's advanced technology group, "Data mining is the carrot that justifies the expensive

stick of building a data warehouse."

The majority of the warehouse stores-systems used for storing warehouse data-are

relational databases or their variants. The advantages of using database systems are numer-

ous: SQL was invented for direct query of data, most client/server connectivity is supplied

by relational vendors, most replication systems have been designed with relational sources

and targets and most of the relational vendors are delivering parallel database solutions.

There are important alternatives in this segment, however. The OLAP multidimensional en-

gines offer unique performance characteristics across their problem domain. We might also

see traditional file stores providing significantly better performance for some data mining

operations. Differentiators for the relational engines will be overall scalability, availability

across a broad spectrum of hardware platforms, "affinity" with legacy data platforms and

stores, availability of extensions for image, text and so on to support the next generation

of applications, and availability of supporting tools.

The investment in building and managing a data warehouse is enormous. With the

advent of business intelligence and decision support systems, it is imperative that the

data warehouse support multiple applications. Figure 1.3 illustrates how a data warehouse

will typically be used in a business organization. The spectrum of applications include

basic querying and reporting tools, OLAP tools, multidimensional analysis tools and data

mining tools. There are several excellent and popular query/report writers. There are also

tools that support multidimensional analysis directly on relational data stores without the

separate OLAP engine. Note that these tools are like tools in a tool box; that is, they can

be used in combination to produce the desired result. There is no single class of tools that

satisfies the broad range of decision support and business intelligence system requirements.

Therefore, it is crucial that the data mining tools integrate with relational data warehouses

much like the query/report and OLAP tools.

1.4 Goal

The goal of this dissertation is to explore the various issues of database/data ware-

house integration of mining operations. We first study the various architectural alternatives

for coupling data mining with relational database systems, primarily from a performance

perspective. We develop various SQL formulations for association rules [7], a representative

mining problem, and analyze how competitive can the SQL implementation be compared

to other specialized implementations of association rule mining. We further focus on the

Business Intelligence tools

Query/ Data
Reporting OLAP I Mining

Data Warehouse


Business Information

Data Marts

Figure 1.3. Typical data warehouse usage

analysis of various execution plans chosen by relational database systems for executing some

of the SQL-based mining queries. We expect that this study will reveal the domain-specific

semantic information of the mining algorithms that need to be integrated into next gen-

eration query optimizers to handle mining computations efficiently. We also develop SQL

formulations for a few other mining operations, namely, generalized association rules [118]

and sequential patterns [14, 120]. We also propose a few primitive database operators that

are useful for mining and other decision support applications. These operators, if natively

supported in a database system, could potentially speed up the execution of mining queries.

Data warehouses on which the mining tools operate typically are populated incremen-

tally. In order to improve the reliability and usefulness of the discovered information, large

volumes of data need to be collected and analyzed over a period of time. A naive approach

to update the mined information, when new data are added or part of the current data

are deleted, is to recompute them from scratch. However, it would be ideal to develop an

incremental algorithm so that the computation effort spent on the original data can be

effectively utilized. We develop an incremental algorithm and its SQL formulations for up-

dating the association rules, based on the negative border concept. It is based on the closure

property of frequent itemsets, that is, all subsets of a frequent itemset are also frequent.

We show that the incremental mining algorithm can be generalized to handle various kinds

of constraints. We categorize the constraints based on their usage in the mining process

and develop a general framework to process them. We further show that the incremental

technique is applicable to other classes of mining and decision support problems such as

sequential patterns, correlation rules, query flocks and so on.

We summarize and list the goals of this dissertation below:

Survey the various data mining problems and algorithms,

Study the different database integration alternatives for data mining,

Develop and implement SQL formulations of mining algorithms,

Analyze the performance profile of current DBMSs to execute the above SQL queries,

Explore the enhancements to current cost-based optimizers to incorporate the domain-

specific semantics of mining,

Develop an incremental association rule mining algorithm and its SQL formulations,

Generalize the incremental algorithm for mining constrained associations and show

its applicability to other data mining and decision support problems, and

Explore primitive operators for mining in databases.

This work has a significant impact on the state-of-the-art in data mining system archi-

tectures and comes at the appropriate time when the data mining community is looking for

answers to "How to mine data warehouses?" Given the amount of data involved in mining,

its potential impact on various business sectors and the fact that OLAP is finding its way

into commercial database systems (for example, the cube operator), it is only a matter of

time before mining becomes an integral part of database systems. We believe that this work

is a small but strong step in the right direction. This will also have a significant impact on

query optimization and parallel query processing techniques.

1.5 Thesis Organization

The rest of this dissertation is organized as follows: We discuss the various architectural

alternatives for integrating mining with database systems/data warehouses in Chapter 2.

The various SQL formulations of association rules, their performance profiles and optimiza-

tions for improving the performance are detailed in Chapter 3. In Chapter 4, we present

the use of object-relational extensions to SQL for improving the performance of SQL-based

association rule mining and the performance comparison of the various architectural al-

ternatives. SQL-based mining of generalized association rules and sequential patterns are

described in Chapters 5 and 6 respectively. Chapter 7 presents the incremental associa-

tion rule mining algorithm, performance comparison, SQL formulation and generalizations

for mining constrained associations. We conclude in Chapter 8 with a discussion of the

proposed database operators and avenues for further research.


The "first-generation" KDD systems offer isolated discovery features using tree induc-

ers, neural nets and rule discovery algorithms. Such systems cannot be embedded into a

large application and typically offer just one knowledge discovery feature. The current state

of data mining systems is very much similar to the days in which database applications were

written in COBOL with just read and write commands as the interface to data stored in

large files. The advent of relational database systems which offered SQL for ad hoc queries

and various relational APIs (application programming interfaces) for application program-

ming made database applications much easier to develop and manage. Data mining has to

undergo a similar transition from the current "file mining" to data warehouse mining and

a richer set of APIs for developing business intelligence and decision support applications.

In the remainder of this chapter, we survey some of the prior research related to the

database integration of mining in Section 2.1. The various architectural alternatives are

discussed in Section 2.2.

2.1 Related Work

Researchers have started to focus on various issues related to integrating mining with

databases [6, 67, 68]. The research on database integration of mining can be broadly clas-

sified into two categories; one which proposes new mining operators and the other which

leverages the query processing capabilities of current relational DBMSs. In the former cate-

gory, there have been language proposals to extend SQL with specialized mining operators.

A few examples are (i) the query language DMQL proposed by Han et al. [64] which ex-

tends SQL with a collection of operators for mining characteristic rules, discriminant rules,

classification rules, association rules and so on, (ii) The M-SQL language of Imielinski and

Virmani [69] which extends SQL with a special unified operator Mine to generate and

query a whole set of propositional rules, and (iii) the mine rule operator proposed by Meo

et al. [89] for a generalized version of the association rule discovery problem. However, they

do not address the processing techniques for these operators inside a database engine and

the interaction of the standard relational operators and the proposed extensions. It is also

important to break these operators to a finer level of granularity in order to identify com-

monalities between them and derive a set of primitive operators that should be supported

natively in a database engine.

In the second category, researchers have addressed the issue of exploiting the capa-

bilities of conventional relational systems and their object-relational extensions to execute

mining operations. This entails transforming the mining operations into database queries

and in some cases developing newer techniques that are more appropriate in the database

context. The proposal of Agrawal and Shim [12] for tightly coupling a mining algorithm

with a relational database system makes use of user-defined functions (UDFs) in SQL

statements to selectively push parts of the application that perform computations on data

records into the database system. The objective was to avoid one-at-a-time record retrieval

from the database to the application address space, saving both the copying and process

context switching costs. In the KESO project [116], the focus is on developing a data

mining system which interacts with standard DBMSs. The interaction with the database

is restricted to two-way table queries, a special kind of aggregate query. Two-way tables,

which are used in the mining process, have sets of source and target attributes and an

associated count. Association rule mining was formulated as SQL queries in the SETM

algorithm [66]. However, it does not use the subset property-all subsets of a frequent

itemset are frequent-for candidate generation. As a result, SETMI counts a large number

of candidate itemsets in the support counting phase and hence is not efficient [9]. Query

flocks generalizes boolean association rules to mine associations across relational tables. A

query flock is a parameterized query with a filter condition to eliminate the values of param-

eters that are "uninteresting". Tsur et al. [128] present the use of query flocks for mining

and emphasizes the need for incorporating the a-priori technique into new generation query

optimizers to handle mining queries efficiently.

2.2 Architectural Alternatives

The various architectural alternatives for integrating mining with relational systems,

proposed in [109], can be categorized as shown in Figure 2.1. It shows a taxonomy of

various alternatives from loose-coupling to an integrated approach. In the remainder of

this section, we describe each of the alternatives.

2.2.1 Loose-Coupling

This is an example of integrating mining applications into the client in a client/server

architecture or into the application server in a multi-tier architecture. The mining kernel

can be considered as the application server. In this approach, data are read tuple by tuple

from the DBMS to the mining kernel using a cursor interface. The intermediate and the

final results are stored back into the DBMS. The data are never copied on to a file system.

Instead, the DBMS is used as a file system. This is the approach followed by most existing

User-defined Mining
Cache-Mine function extenders/blades

Loose Stored SQL-based Integrated with
Coupling Procedure approach SQL query engine

Mining as Mining as Mining using Integrated approach
application on application on Integrated appro
SQL + extensions
client/app. server database server

Loose --Integration Tight

Figure 2.1. Taxonomy of architectural alternatives

mining systems. A potential problem with this approach is the high context switch costs

between the DBMS and the mining kernel processes since they run in different address

spaces. In spite of the block-read optimization present in many systems (for example,

Oracle [98], DB2 [32]) where a block of tuples is read at a time instead of reading a single

tuple at a time, the performance could suffer. This architecture is outlined in Figure 2.2

GUI Mining
or Request Mining Data
Mining Language Kernel Results

Figure 2.2. Loose-coupling architecture

2.2.2 Cache-Mine

This is a special case of the loose-coupling approach where the mining algorithm reads

data from the DBMS only once and caches the relevant data in flat files on local disk for

future references. The data could be transformed and cached in a format that enables more

efficient access in the future. The mined results, first generated as flat files, are imported

into the DBMS. This method has all the advantages of the stored procedure approach

(described below) plus it promises to have better performance. The disadvantage is that it

requires additional disk space for caching. This architecture is outlined in Figure 2.3.

GUI Mining Data /
or Request Mining result
Mining Language Kernel

File System

Figure 2.3. Cache-mine architecture

2.2.3 Stored-Procedure

This architecture is representative of the case where the mining logic is embedded as

applications on the database server. In this approach, shown in Figure 2.4, the mining

algorithm is encapsulated as a stored procedure [32] that is executed in the same address

space as the DBMS. The main advantage of this as well as the loose-coupling and cache-

mine approach is greater programming flexibility. Also, any existing file system code can

be easily transformed to work on data stored in the DBMS.

GUI Stored-procedure DBMS
SInvocation S r c
Mining Language Stored-procedures
for mining

Figure 2.4. Stored-procedure architecture

2.2.4 User-Defined Function

This approach is another variant of embedding mining as an application on the database

server if the user-defined functions arc run in the unfenced mode (same address space as

the server) [32]. In this case, the entire mining algorithm is encapsulated as a collection of

user-defined functions (UDFs) [32] that are appropriately placed in SQL data scan queries.

The architecture is represented in Figure 2.5. Most of the processing happens in the UDF

and the DBMS is used simply to provide tuples to these UDFs. Little use is made of the

query processing capabilities of the DBMS. The UDFs can be run in either fenced (different

address space) or unfenced (same address space) mode. The main attraction of this method

is performance since when run in the unfenced mode individual tuples never have to cross

the DBMS boundary. Otherwise, the processing happens in almost the same manner as

in the stored procedure case. The main disadvantage is the development cost [12] since

the entire mining algorithm has to be written as UDFs involving significant code rewrites.

Further, these are "heavy-weight" UDFs which involve significant processing and memory


SQL queries DBMS
GUI (containing UDFs)
Mining Language UDFs
for mining

Figure 2.5. UDF-based mining architecture

In order to provide a query interface or application programming interface to the

discovered rules, they can be passed through a post-processing step. The rule discovery

itself could be done by any of the above alternatives.

2.2.5 SQL-Based Approach

This is the integration architecture explored in this dissertation. In this approach, the

mining algorithm is formulated as SQL queries which are executed by the DBMS query

processor. We develop several SQL formulations for a few representative mining operations

in order to better understand the performance profile of current database query processors

in executing these queries. We believe that it will enable us to identify what portions of

these mining operations can be pushed down to the query processing engine of a DBMS.

There are also several potential advantages of a SQL implementation. One can prof-

itably make use of the database indexing and query processing capabilities thereby leverag-

ing on more than two decades of development effort spent in making these systems robust,

portable, scalable, and highly concurrent. Rather than devising specialized paralleliza-

tions, one can potentially exploit the underlying SQL parallelization, particularly in an

SMP environment. The current approach to parallelizing mining algorithms is to develop

specialized parallelizations for each of the algorithms [11, 61, 113, 114]. The DBMS sup-

port for check-pointing and space management can be especially valuable for long-running

mining algorithms on huge volumes of data. The development of new algorithms could be

significantly faster if expressed declaratively using a few SQL operations. This approach is

also extremely portable across DBMS's since porting becomes trivial if the SQL approaches

use only the standard SQL features.

SQL > Preprocessor SQL-92 (Object) Relational

GUI Optimizer SQL-3/SQL-4 DBMS

Domain semantics
of mining

Figure 2.6. SQL architecture for mining in a DBMS

The architecture we have in mind is schematically shown in Figure 2.6. We visualize

that the desired mining operation will be expressed in some extension of SQL or a graphical

language. A preprocessor will generate appropriate SQL translation for this operation.

This preprocessor will be able to select the right translation taking into account input

data distributions. We consider SQL translations that can be executed on a SQL-92 [88]

relational engine, as well as translations that require some of the newer object-relational

capabilities being designed for SQL [78]. Specifically, we assume availability of blobs, user-

defined functions, and table functions [80] in the Object-Relational engine. We do not

require any mining specific extension in the underlying execution engine; identification of

such extensions is one of the goals of this study.

We do quantitative and qualitative comparisons of some of the architectural alter-

natives listed here. Our primary focus is on the performance of various architectural al-

ternatives and the identification of possible enhancements to the query optimizer and the

query processing engine. The issues of the language constructs required to extend SQL

with mining features, and the details of the preprocessing step shown in Figure 2.6 are


It might also be possible to integrate mining with databases using the newer extension

technologies like database extenders, data cartridges or data blades.

2.2.6 Integrated Approach

This is the tightest form of integration where the mining operations are an integral

part of the database query engine. In this approach there is no clear boundary between

simple querying, OLAP and mining; that is querying and mining are treated to be similar

operations. The user's goal is to get information from the data store. He/she should not

have to make the distinction as to whether it is the result of querying/OLAP/mining.

This entails unbundling the bulky mining operations and identifying common operator

primitives with which the mining operations can be composed. We cannot expect to have

a specialized operator for every mining task. It also needs a language in which the required

operations can be specified. In order to realize this goal, it requires tremendous amount

of research in various aspects like designing language extensions, better query processing

and optimization strategies. However, we envision that the query processing engine will

eventually be extended with primitive mining operators. When that is accomplished, a

mining system architecture will resemble the one shown in Figure 2.7.

(Object) Relational
Extended SQL SQL-92 DBMS
GUI SQL-3/SQL-4 Enhanced

Domain semantics
of mining

Figure 2.7. Architecture for mining in next-generation DBMSs

2.3 Summary

The first two approaches in the architecture taxonomy in Figure 2.1, namely, mining in

the application server or database server, facilitates the move from file mining to database

mining rather easily. However, as explained in the loose-coupling, cache-mine, stored-

procedure and user-defined function approaches, they do not utilize the query processing

functionality provided by the DBMS.

In this dissertation we pursue the third approach in Figure 2.1, which uses SQL and

its extensions to implement the mining algorithms. This acts as a pre-cursor to determine

the extensions to current query processors and optimizers in order to move towards the last

approach which is the truly integrated approach.

Note The architectural alternatives in Section 2.2 were developed primarily by

researchers from IBM Almaden Research Center and the author was a contributor.


In this chapter, we discuss the various SQL-92 (SQL with no object-relational ex-

tensions) formulations of association rule mining. We start with a review of the apriori

algorithm for association rule mining in Section 3.1. A few other algorithms for mining

association rules are briefly outlined in Section 3.2. The input-output data formats are de-

scribed in Section 3.3 and in Section 3.4, we introduce SQL-based association rule mining.

The various SQL-92 formulations are presented in Section 3.5. We present experimental

results showing the performance of these formulations on some real-life datasets in Sec-

tion 3.6. In Section 3.7, we develop cost formulae for the cost of executing the above SQL

queries on a query processor, based on the input data parameters and relational operator

costs. A few performance optimizations to the basic SQL-92 approaches and the corre-

sponding performance gains are presented in Section 3.8. Section 3.9 quantifies the overall

performance improvements of the optimizations with experiments on synthetic datasets.

The association rule mining problem outlined in Section 1.2.1 can be decomposed into

two subproblems [7].

Find all combinations of items whose support is greater than minimum support. Call

those combinations frequent itemsets.

Use the frequent itemsets to generate the desired rules. The idea is that if, say, ABCD

and AB are frequent itemsets, then we can determine if the rule AB-+CD holds by

computing the ratio r = support(ABCD)/support(AB). The rule holds only if r >

minimum confidence. Note that the rule will have minimum support because ABCD

is frequent.

3.1 Apriori Algorithm

We use the Apriori algorithm [9] as the basis for our presentation. There are recent

proposals aimed at improving the performance of the Apriori algorithm by reducing the

number of data passes [26, 127]. They all have the same basic data-flow structure as the

Apriori algorithm. Our goal is to understand how best to integrate this basic structure

within a database system.

F1 = {frequent 1-itemsets}

for (k =2; Fk- 0; k + +) do

Ck = apriori-gen(Fk1_); // generate new candidates

forall transactions t E D do

Ct = subset(Ck, t); // find all candidates contained in t

forall candidates c e Ct do




Fk = {c C Ckic.count > minsup}


Answer = Uk Fk;

Figure 3.1. Apriori algorithm

The basic Apriori algorithm shown in Figure 3.1 makes multiple passes over the data.

In the kth pass it finds all itemsets with k items having the minimum support, called the

frequent k-itemsets. Each pass consists of two phases. Let Fk represent the set of frequent

k-itemsets, and Ck the set of candidate k-itemsets (potentially frequent itemsets). First is

the candidate generation phase where the set of all frequent (k-1)-itemsets, Fk-1, found

in pass (k 1), is used to generate the candidate itemsets Ck. The candidate generation

procedure ensures that Ck is a superset of the set of all frequent k-itemsets. The algorithm

builds a specialized hash-tree data structure in memory out of Ck. Then is the support

counting phase where the transaction database is scanned. For each transaction, the

algorithm determines which of the candidates in Ck are contained in the transaction using

the hash-tree data structure and increments their support count. At the end of the pass, Ck

is examined to determine which of the candidates are frequent, yielding Fk. The algorithm

terminates when Ck+1 becomes empty.

3.2 Other Algorithms

There are several other file based algorithms for mining association rules. However

many of them follow the basic apriori framework and tries to improve on the computation

and I/O requirements.

The partition algorithm [111] reduces the I/O requirements to just two passes over the

entire dataset. The reason the database needs to be scanned multiple times is because the

number of possible itemsets to be tested for support is exponentially large if it must be done

in a single scan of the database. The partition algorithm generates a set of all potentially

frequent itemsets in one scan of the database. This set is a superset of all frequent itemsets.

In the second scan, the actual support of these itemsets in the whole database is computed.

The algorithm executes in two phases. In the first phase, it divides the database into a

number of non-overlapping partitions. The frequent itemsets in each of these partitions

are computed separately which will involve multiple passes over the data. However, the

partition sizes are chosen such that an entire partition fits in main memory so that it is

read only once in each phase. At the end of the first phase, the frequent itemsets from all

the partitions are merged together to generate the set of all potentially frequent itemsets.

This set is a superset of all the frequent itemsets since all itemsets that are frequent in

the whole database have to be frequent at least in one partition. In the second phase, the

actual support of these itemsets are counted and the frequent itemsets are identified.

Toivonen proposes a sampling based algorithm [127]. The idea there is to pick a

random sample, use it to determine all association rules that probably hold in the whole

database, and then to verify the results with the rest of the database. The algorithm thus

produces exact association rules in one full pass over the database. In those rare cases where

the sampling method does not produce all association rules, the missing rules can be found

in a second pass. A superset of the frequent itemsets can be determined efficiently from a

random sample by applying any level-wise algorithm on the sample in main memory, and

by using a lowered support threshold. In cases where approximate results are sufficient, the

sampling approach can significantly reduce the computational and I/O requirements since

it works on a much smaller dataset.

A different way of counting support is proposed in Savasere et al. [111] and Zaki

et al. [131]. Associated with each item is a tidlist which consists of all the transaction

identifiers that contain that item. The support for an itemset can be obtained by counting

the number of transactions that contain all the items in the itemset. If the tidlists are

kept sorted, this operation can be done by performing a merge-scan of the tidlists of all the

items in the itemset.

3.3 Input-Output Formats

Input format The transaction table T normally has two column attributes: transac-

tion identifier (tid) and item identifier (item). For a given tid, typically there are multiple

rows in the transaction table corresponding to different items that belong to the same trans-

action. The number of items per transaction is variable and unknown during table creation

time. Thus, alternative schemas may not be convenient. In particular, assuming that all

items in a transaction appear as different columns of a single tuple [105] is not practical,

because often the number of items per transaction can be more than the maximum number

of columns that the database supports. For instance, for one of the real-life datasets we

experimented with, the maximum number of items per transaction is 872 and for another

it is 700. In contrast, the corresponding average number of items per transaction is only

9.6 and 4.4 respectively. Even if the database supports so many columns for a table, there

will be lot of space wastage in that scheme.

Output format The output is a collection of rules of varying length. The maximum

length of these rules is much smaller than the total number of items and is rarely more

than a dozen. Therefore, a rule is represented as a tuple in a fixed-width table where the

extra column values are set to NULL to accommodate rules involving smaller itemsets. The

schema of a rule is (item1,..., itemk, len, rulem, confidence, support) where k is the size of

the largest frequent itemset. The len attribute gives the length of the rule (number of items

in the rule) and the rulem attribute gives the position of the -+ in the rule. For instance,

if k = 5, the rule AB-+CD which has 90% confidence and 30% support is represented by

the tuple (A, B, C. D, NULL, 4, 2, 0.9,0.3). The frequent itemsets are represented the

same way as rules but do not have the rulem and confidence attributes.

3.4 Associations in SQL

In Section 3.4.1 we present the candidate generation process in SQL. In Section 3.4.2

we present the support counting process and in Section 3.4.3 we present the rule generation


3.4.1 Candidate Generation in SQL

Recall that the apriori algorithm for finding frequent itemsets proceeds in a level-wise

manner. In each pass k of the algorithm we first need to generate a set of candidate itemsets

Ck from frequent itemsets Fk-1 of the previous pass.

Given Fk-1, the set of all frequent (k 1)-itemsets, the Apriori candidate generation

procedure [9] returns a superset of the set of all frequent k-itemsets. We assume that the

items in an itemset are lexicographically ordered. Since, all subsets of a frequent itemset

are also frequent, we can generate Ck from Fk-1 as follows.

First, in the join step, we generate a superset of the candidate itemsets Ck by joining

Fk-1 with itself as shown below.

insert into Ck select II.iteml, ..., I1.itemk-l, 2.itemk-1

from Fk-1 Il.Fk 1 2

where Ii.iteml = I-.itemi and

Ii.itemk = I2i.itemk-2 and

Il.itemA. < I2.itemkl1

For example, let F3 be {{1 2 3}, {1 2 4}, {1 3 4}, {1 3 5}, {2 3 4}}. After the join step, C4

will be {{1 2 3 4}, {1 3 4 5}}.

Next, in the prune step, all itemsets c E Ck, where some (k 1)-subset of c is not

in Fk-1, are deleted. Continuing with the example above, the prune step will delete the

itemset {1 3 4 5} because the subset {1 4 5} is not in F3. We will then be left with only

{1 2 3 4} in C4. We can perform the prune step in the same SQL statement as the join

step above by writing it as a k-way join as shown in Figure 3.2. A k-way join is used

since for any k-itemset there are k subsets of length (k 1) for which we need to check in

Fk-1 for membership. The join predicates on I, and 12 remain the same. After the join

between I1 and 12 we get a k-itemset consisting of (Ii.iteml,... ,Ii.itemk-1,I2.itemk-_)

as shown above. For this itemset, two of its (k 1)-length subsets are already known to be

frequent since it was generated from two itemsets in Fk-1. We check the remaining k 2

subsets using additional joins. The predicates for these joins are enumerated by skipping

one item at a time from the k-itemset as follows. We first skip item1 and check if the

subset (Ii.item2,..., Ii.itemk-, I2.-itemk-1) belongs to Fk-1 as shown by the join with I/

in Figure 3.2. In general, for a join with Ir (3 < r < k), we skip item r 2 which gives us

join predicates of the form

Ii.iteml = Ir.iteml and

Ii.itemr-3 = Ir.itemr-3 and

Il.itemr-1 = Ir.itemr-2 and

Il.itemk-_ = Ir.itemk-2 and

I2.itemk- = Ir.itemk-l.

Figure 3.3 gives an example for k = 4.

We construct a primary index on (itemi,...,itemk-i) of Fk-1 to efficiently process

these k-way joins using index probes. Note that sometimes it may not be necessary to ma-

terialize Ck before the counting phase. Instead, the candidate generation can be pipeline

with the subsequent SQL queries used for support counting.

(Skip item_k-2)
l.iteml =Ik.iteml
Il.item k-I = lk.item k-2
12.itemk-I =lk.item k- I

(Skip iteml) 1' F k-1 Ik
11.item2 = 13.iteml
1 .item k-1 = 3.item_k-2
12.itemk-l = I3.item k-l J

II.iteml =12.iteml F k- 13
II.itemk-2 12.item k-2
II.item k- < 12.itemk-I

F k-l II Fk-I 12

Figure 3.2. Candidate generation for any k

(Skip item2)
II.iteml = 14.iteml
II.item3 = 14.item2
I2.item3 = 14.item3

(Skip iteml) F3 14
ll.item2 = I3.iteml
I1 .item3 = I3.item2
12.item3 = I3.item3

II.iteml = 12.item 1 F3 13
I1.item2 = 12.item2 -.
II.item3 < 12.item3

F3 II F3 12

Figure 3.3. Candidate generation for k = 4

3.4.2 Counting Support to Find Frequent Itemsets

This is the most time-consuming part of the association rule mining algorithm. We

use the candidate itemsets Ck and the data table T to count the support of the itemsets in

Ck. We consider two different categories of SQL implementations.

(A) The first one is based purely on SQL-92. We discuss four approaches in this category

in Section 3.5.

(B) The second utilizes the new SQL object-relational extensions like UDFs, BLOBs

(binary large objects), table functions and so on. Table functions [80] are virtual

tables associated with a user defined function which generate tuples on the fly. Like

normal physical tables they have pre-defined schemas. The function associated with

a table function can be implemented as any other UDF. Thus, table functions can be

viewed as UDFs that return a collection of tuples instead of scalar values.

We discuss six approaches in this category in Chapter 4. Note that, UDFs in this

approach are not heavy weight and do not require extensive memory allocations and

coding unlike in a purely UDF-based implementation [12].

3.4.3 Rule Generation

In the second phase of the association rule mining algorithm, we use the frequent

itemsets to generate rules with the user specified minimum confidence, minconf. For every

frequent itemset 1, we first find all non-empty proper subsets of 1. Then, for each of those

subsets m, we find the confidence of the rule m-(l m) and output the rule if it is at least


In the support counting phase, the frequent itemsets of size k are stored in table FA..

Before the rule generation phase, we merge all the frequent itemsets into a single table F.

The schema of F consists of k + 2 attributes (itemn,..., itemk, support, len), where k is

the size of the largest frequent itemset and len is the length of the itemset as discussed

earlier in Section 3.3.

We use the table function GenRules to generate all possible rules from a frequent item-

set. The input argument to the function is a frequent itemset. For each itemset, it outputs

tuples corresponding to rules with all non-empty proper subsets of the itemset in the con-

sequent. The table function outputs tuples with k + 3 attributes, Titeml,..., Titemk,

T_support, Ten, T_rulem. The output is joined with F to find the support of the an-

tecedent and the confidence of the rule is calculated by taking the ratio of the support

values. The predicates in the where clause match the antecedent of the rule with the fre-

quent itemset corresponding to the antecedent. While checking for this match, we need to

check only up to itemk where k < Trulem. The or part (tl.Trulem < k) in the predicate

accomplishes this. Figure 3.4 illustrates the rule generation query.

We can also do rule generation without using table functions and base it purely on

SQL-92. The rules are generated in a level-wise manner where in each level k we generate

rules with consequents of size k. Further, we make use of the property that for any frequent

itemset, if a rule with consequent c holds, then, so do rules with consequents that are subsets

of c as suggested in Agrawal et al. [9]. We can use this property to generate rules in level

k using rules with (k 1) long consequents found in the previous level, much like the way

we did candidate generation in Section 3.4.1.

The fraction of the total running time spent in rule generation is very small. Therefore,

we do not focus much on rule generation algorithms.

insert into R select Titeml, ... T itemk,, TJen, Trulem,

from F fl, table(GenRules(fl.itemi,... ,f.itemrk, fl.len, as tl, F f2

where (tl.Titemr = f2.itemI or tl.Trulem < 1) and

(tl.Titemk = -f2.itemk or tl.T-rulem < k) and

tl.T-rulem = f2.len and > :minconf

item 1 ....itemk, len, rulem.
confidence, support

conf > :minconf

Table function F


Figure 3.4. Rule generation

3.5 Support Counting Using SQL-92

We present four approaches in this category.

3.5.1 K-way Join

In each pass k, we join the candidate itemsets Ck with k transaction tables T and

follow it up with a group by on the itemsets as shown in Figure 3.5.

Figure 3.5 also shows a tree diagram of the query. These tree diagrams are not to be

confused with the plan trees which could look quite different.

insert into Fk select iteml, .. i, r,., count(*)

from Ck, T t,...T tk

where tl.item = Ck.iteml and

tk.item = Ck.itemk and

tl.tid = t2.tid and

tk-1.tid = tk.tid

group by iteml,item2 ... itemk

having count(*) > :minsup

count(*) > :minsup
Group by
item I,...itemk
Ck.iteml = t litem
Ck.itemk= tk.item

S.tid = tk.tid Ck

T tk
tl.tid = t2.tid [>

T tI T t2

Figure 3.5. Support counting by K-way join

This SQL computation, when merged with the candidate generation step, is similar to

the one proposed in Tsur et al. [128] as a possible mechanism to implement query flocks.

In Section 3.7, we discuss the different execution plans for this query and the related

performance issues.

3.5.2 Three-way Join

The above approach requires (k + 1)-way joins in the kth pass. We can reduce the

cardinality of joins to 3 using the following approach which bears some resemblance to

the AprioriTid algorithm in Agrawal et al. [9]. Each candidate itemset Ck, in addition

to attributes (item1,... ,itemk) has three new attributes (oid, idl, id2). oid is a unique

identifier associated with each itemset and idl and id2 are oids of the two itemsets in Fk-i

from which the itemset in Ck was generated (as discussed in Section 3.4.1). In addition,

in the kth pass we generate a new copy of the data table Tk with attributes (tid, oid) that

keeps for each tid the oid of each itemset in Ck that it supported. For support counting,

we first generate Tk from Tk-1 and Ck and then do a group-by on Tk to find Fk as follows.

insert into Tk select tl.tid, oid

from Ck,Tk-1 tl,Tk- t2

where tl.oid = Ck.idl and t2.oid = Ck.id2 and tl.tid = t2.tid

insert into Fk select oid, iteml, ... itemk, cnt

from Ck,

(select oid as cid, count(*) as cnt from Tk

group by oid having count(*) > :minsup) as temp

where Ck.oid = cid

3.5.3 Two Group-by

Another way to avoid multi-way joins is to first join T and Ck based on whether the

"item" of a (tid, item) pair of T is equal to any of the k items of Ck, then do a group by on

(item1,...,itemk,tid) filtering tuples with count equal to k. This gives all (itemset, tid)

pairs such that the tid supports the itemset. Finally, as in the previous approaches, do a

group-by on the itemset (itemi,..., itemk) filtering tuples that meet the support condition.

insert into Fk select item.l,... itemk, count(*)

from (select item1,... itemk, count(*)

from T, Ck

where item = Ck.iteml or

item = Ck.itemk

group by iteml,...,itemk, tid

having count(*) = k) as temp

group by item1,..., itemk

having count(*) > :minsup

3.5.4 Subquery-Based

This approach makes use of common prefixes between the itemsets in Ck to reduce the

amount of work done during support counting. We break up the support counting phase

into a cascade of k subqueries. The 1-th subquery Qi finds all tids that match the distinct

itemsets formed by the first I columns of Ck (call it dr). The output of QI is joined with T

and d+l1 (the distinct itemsets formed by the first I + 1 columns of Ck) to get Qi+i. The

final output is obtained by doing a group-by on the k items to count support as above. The

queries and the tree diagram for subquery QL are given in Figure 3.6.

3.6 Performance Comparison

In this section we compare the performance of the four SQL-92 approaches. All of

our experiments were performed on Version 5 of IBM DB2 Universal Server installed on a

RS/6000 Model 140 with a 200 MHz CPU, 256 MB main memory and a 9 GB disc with a

measured transfer rate of 8 MB/s. These experimental results are also reported in Sarawagi

et al. [109].

insert into Fk select itemrl,... ,itemk, count(*)

from (Subquery Qk) t

group by iteml,item2 ... itemk

having count(*) > :minsup

Subquery Qi (for any I between 1 and k):

select item1,... iteml, tid

from T tl, (Subquery Ql-1) as ri-1,

(select distinct itern ... itern from Ck) as di

where ri-l.itemi = dl.item, and ... and

rl_.item_-1 = di.itemrnIand

ri _.tid = tl.tid and

ti.item = di.itemj

Subquery Qo: No subquery Qo.

Subquery Q_1

iteml,....itend. tid

tl.item = dl.iteml I J
r -I.iteml = dl.iteml
r -l.iteml-I = dl.item l- I T tl

Subquery Q_l-1 select distinct
item ,. .iteml


Tree diagram for Subquery Qi

Figure 3.6. Support counting using subqueries

Table 3.1. Description of different real-life datasets

Datasets # Records # Transactions # Items Avg.#items
in millions in millions in thousands
Dataset-A 2.5 0.57 85 4.4
Dataset-B 7.5 2.5 15.8 2.62
Dataset-C 6.6 0.21 15.8 31
Dataset-D 14 1.44 480 9.62

We selected a collection of four real-life datasets obtained from various mail-order

companies and retail stores for the experiments. These datasets have differing values of

parameters like the total number of (tid,item) pairs, the number of transactions (tids), the

number of items and the average number of items per transaction. Table 3.1 summarizes

these parameters.

In this dissertation, we report the performance with only Dataset-A. The overall

observation was that mining implementations in pure SQL-92 are too slow to be practical.

For these experiments we built a composite index (iteml ... itemk) on Ck, k different indices

on each of the k items of Ck and a (tid, item) and a (item, tid) index on the data table.

The goal was to let the optimizer choose the best plan possible. We do not include the

index building cost in the total time.

In Figure 3.7 we show the total time taken by the four approaches: KwayJoin, 3wayJoin,

Subquery and 2GroupBy. For comparison, we also show the time taken by the Loose-coupling

approach because this is the approach currently used by existing systems. The graph shows

the total time split into candidate generation time (Cgen) and the time for each pass. The

candidate generation time and the time for the first pass are much smaller compared to the

total time. From these set of experiments we can make the following observations.

Data set A
L Cgen I Pass 1 0 Pass 2 3 Pass 3
9000 --

7000 ---

E- 5000 --...

3000 -- ----- ------ --

Support --> 0 50% 0 35% O.20%

Figure 3.7. Comparison of different SQL-92 approaches

* The best approach in the SQL-92 category is the Subquery approach. An important

reason for its performing better than the other approaches is exploitation of common

prefixes between candidate itemsets. None of the other three approaches uses this

optimization. Although the Subquery approach is comparable to the Loose-coupling

approach in some cases, for other cases it did not complete even after taking ten times

more time than the Loose-coupling approach.

* The 2GroupBy approach is significantly worse than the other three approaches because

it involves an index-ORing operation on k indices for each pass k of the algorithm.

In addition, the inner group-by requires sorting a large intermediate relation. The

outer group-by is comparatively faster because the sorted result is of size at most Ck

which is much smaller than the result size of the inner group-by. The DBMS does

aggregation during sorting therefore the size of the result is an important factor in

the total cost.

The 3wayJoin approach is comparable to the KwayJoin approach for this dataset

because the number of passes is at most three. As shown in Agrawal et al. [9] there

might be other datasets especially ones where there is significant reduction in the size

of Tk as k increases where 3wayJoin might perform better than KwayJoin. However,

one disadvantage of the 3wayJoin approach is that it requires space to store and log

the temporary relations Tk generated in each pass.

3.7 Cost Analysis

In this section, we analyze the cost of the KwayJoin and Subquery approaches which

represent the better ones among the SQL-92 approaches. A relational query engine can

execute the KwayJoin query in several different ways and the performance depends on the

execution plan chosen. We experimented with a number of alternative execution plans

for this query. We could force the query processor to choose different plans by creating

different indices on T and Ck, and in some cases by disabling certain join methods in our

experiments using Postgres. We will elaborate more on this in Section 3.9.

We present two different execution plans-one with Ck as the outermost relation in

Section 3.7.1 and another with Ck as the innermost relation in Section 3.7.2-and the cost

analysis for them. The effect of subquery optimization on the cost estimates is outlined in

Section 3.7.3.

A schematic diagram of the two different execution plans are given in Figures 3.8 and

3.9. In the cost analysis, we use the mining-specific data parameters and knowledge about

association rule mining (Apriori algorithm [9] in this case) to estimate the cost of joins and

the size of join results. Even though current relational optimizers do not use this mining-

specific semantic information, the analysis provides a basis for developing "mining-aware"

country* ) > mInsup

(,roup by

Ck.iteml = l.itenl
Ck.itemk = tk.i tem c

l k-1.tid =ik.Lid Ck
t k-1 item < k.item

l d = 2.tid T tk
ri.iem < l2.i m

T tl T t2

Figure 3.8. K-way join plan with Ck as inner relation

count(*) > minsup

Group by
itemll .,itenlk

Ck.itemk i tk item
tk.rid t_k-l.lid

Ck.ilem2= t2.item 1T k
t2.tid =tl-tid
-T t2
Ck.ilteml = l.item

Ck T tI

Figure 3.9. K-way join plan with Ck as outer relation

optimizers. The cost formulae are presented in terms of operator costs in order to make

them general; for instance join(p, q, r) denotes the cost of joining two relations of size p

and q to get a result of size r. The actual cost which is based on the join method used

and the system parameters can be derived by substituting the appropriate values in the

given formulae. The data parameters and operators used in the analysis are summarized

in Table 3.2.

3.7.1 KWayJoin Plan with Ck as Outer Relation

Start with Ck as the outermost relation and perform a series of joins with the k copies

of T. The final join result is grouped on the k items to find the support counts. The

choice of join methods for each of the intermediate joins depends on factors such as the

Table 3.2. Notations used in cost analysis

group(n, m)
join(p, q, r)

number of records in the input transaction table
number of transactions
average number of items per transaction =
number of frequent items
sum of support of each itemset in set C
average support of a frequent k-itemset = S-
number of records out of R involving frequent items = S(Fi)
average number of frequent items per transaction =-
number of candidate k-itemsets
number of combinations of size k possible out of a set of size n: = kn
cost of generating a k item combination using table function Comb-k
cost of grouping n records out of which m are distinct
cost of joining two relations of size p and q to get a result of size r
cost of passing a BLOB of size n integers as an argument

availability of indices, the size of intermediate results, and the amount of available memory.

For instance, the efficient execution of nested loops joins require an index (item, tid) on

T. If the intermediate join result is large, it could be advantageous to materialize it and

perform sort-merge join.

For each candidate itemset in Ck, the join with T produces as many records as the

support of its first item. Therefore, the size of the join result can be estimated to be the

product of the number of k-candidates and the average support of a frequent item and

hence the cost of this join is given by join(Ck, R, Ck si). Similarly, the relation obtained

after joining Ck with I copies of T contain as many records as the sum of the support

counts of the 1-item prefixes of the candidate k-itemsets. Hence the cost of the Ith join

is join(Ck si-1, R, Ck st) where so = 1. Note that values of the si's can be computed

from statistics collected in the previous passes. Cost of the last join (with Tk) cannot be

estimated using the above formula since the k-item prefix of a k-candidate is not frequent


and we do not know the value of sk. However, the last join produces S(Ck) records

there will be as many records for each candidate as its support-and therefore, the cost is

join(Ck Sk-1, R, S(Ck)). S(Ck) can be estimated by adding the support estimates of all

the itemsets in Ck. A good estimate for the support of a candidate itemset is the minimum

of the support counts of all its subsets. The overall cost of this plan expressed in terms of

operator costs is

{ join(Ck s8-1, R, Ck si)} +join(Ck Sk-1, R, S(Ck)) + group(S(Ck), Ck)

3.7.2 KWayJoin Plan with C. as Inner Relation

In this plan, we join the k copies of T and the resulting k-item combinations are joined

with Ck to filter out non-candidate item combinations. The final join result is grouped on

the k-items.

The result of joining I copies of T is the set of all possible 1-item combinations of

transactions and there are C(N, 1) T such combinations. We know that the items in the

candidate itemset are lexicographically ordered and hence we can add extra join predicates

as shown in Figure 3.8 to limit the join result to 1-item combinations (without these extra

predicates the join will result in 1-item permutations). When Ck is the outermost relation

these predicates are not required. A mining-aware optimizer should be able to rewrite the

query appropriately. The Ith join produces ( + 1)-item combinations and therefore, its cost

is join(C(N, ) T, R, C(N, I + 1) T). The last join produces S(Ck) records as in the

previous case and hence its cost is join(C(N, k) T, Ck, S(Ck)). The overall cost of this

plan is

{ join(C(N,1)*T, R, C(N, + 1) *T)}+join(C(N, k)*T, Ck, S(Ck))+group(S(Ck), Ck)

Note that in the above expression C(N, 1) T = R.

3.7.3 Effect of Subquery Optimization

The subquery optimization makes use of common prefixes among candidate itemsets.

Unfolding all the subqueries will result in a query tree which structurally resembles the

KwayJoin plan tree shown in Figure 3.9. Subquery Qt produces d4 st records where d3

denotes the number of distinct j item prefixes of itemsets in Ck. In contrast, the Ith join in

the KwayJoin plan results in Typically d4 is much smaller compared to Ck which explains

why the Subquery approach performs better than the KwayJoin approach. Ck st records.

The output of subquery Qk contains S(Ck) records. The total cost of this approach can be

estimated to be

{Z trijoin(R, sI dL-1 si d')} + group(S(Ck), Ck)

where trijoin(p, q, r, s) denotes the cost of joining three relations of size p, q, r respectively

producing a result of size s. The value of Sk which is the average support of a frequent

k-itemset can be estimated as mentioned in section 3.7.1.

The experimental results presented in Section 3.6 and in Sarawagi et al. [110] shows

that the subquery optimization gave much better performance than the basic KwayJoin

approach (an order of magnitude better in some cases). We observed the same trend in our

additional experiments using synthetic datasets. We used synthetic datasets for some of the

experiments because the real-life datasets were not available outside IBM. The synthetic

datasets used in our experiments are detailed below. The main reason for the subquery

approach performing better is that the number of distinct 1-item prefixes is much less

compared to the total number of candidate itemsets which results in joins between smaller

tables. The number of candidate itemsets and the corresponding distinct item prefixes for

various passes in one of our experiments is given in Figure 3.10. These numbers are for the

dataset T10.I4.D100K and 0.33% support. Note that dk is not shown since it is the same as

Ck. In pass 3, C3 contains 2453 itemsets where as d' has only 295 1-item prefixes (almost

a factor of 10 less than C3). This results in correspondingly smaller intermediate tables as

shown in the analysis above, which is the key to the performance gain.

2500 -..._--.... .---

1 [ 00. 0

1000 Number of candidate itemsets vs distinct item prefixes

here are for the datasets T5..2.DI0K and TI.I4.D100K. The first dataset consists of 100
.thousd ts, eh c g an a e of 5 Te a e size of

Figure 3.10. Number of candidate itemsets vs distinct item prefixes

Experimental datasets We used synthetic data generated according to the procedure

explained in Agrawal and Srikant [13] for some of the experiments. The results reported

here are for the datasets T5.I2.D100K and T10.I4.D100K. The first dataset consists of 100

thousand transactions, each containing an average of 5 items. The average size of the

Table 3.3. Description of synthetic datasets

Datasets # Records # Transactions # Items Avg.#items
T5.I2.D100K 546651 100000 1000 5
T10.I4.D100K 1154995 100000 1000 10

maximal potentially frequent itemsets (denoted as I) is 2. The transaction table corre-

sponding to this dataset had approximately 550 thousand records. The second dataset

has 100 thousand transactions, each containing an average of 10 items (total of about 1.1

million records) and the average size of maximal potentially frequent itemsets is 4. The

different parameters of the two datasets are summarized in Table 3.3.

The experiments reported in the rest of this chapter were performed on PostgreSQL

Version 6.3 [100], a public domain DBMS, installed on a 8 processor Sun Ultra Enterprise

4000/5000 with 248 MHz CPUs and 256 MB main memory per processor, running Solaris

2.6. Note that PostgreSQL is not parallelized. It supports nested loops, hash-based and

sort-merge join methods and provides finer control of the optimizer to disable any of the

join methods. We have found it to be a useful platform for studying the performance of

different join methods and execution plans.

3.8 Performance Optimizations

The cost analysis presented above provides some insight into the different components

of the execution time in the different passes and what can be optimized to achieve better

performance. In this section, we present three optimizations to the KwayJoin approach

(other than the subquery optimization) and discuss how they impact the cost. Based on

these optimizations, we develop the Set-oriented Apriori approach in Section 3.8.4.

3.8.1 Pruning Non-Frequent Items

The size of the transaction table is a major factor in the cost of joins involving T.

It can be reduced by pruning the non-frequent items from the transactions after the first

pass. We store the transaction data as (tid, item) tuples in a relational table and hence

this pruning means simply dropping the tuples corresponding to non-frequent items. This

can be achieved by joining T with the frequent items table F1 as follows.

insert into Tf select t.tid, t.item

from T t, F1 f

where t.item = f.item

We insert the pruned transactions into table Tf which has the same schema as that of T.

In the subsequent passes, joins with T can be replaced with corresponding joins with Tf.

This could result in improved performance especially for higher support values where the

frequent item selectivity is low, since we join smaller tables. For some of the synthetic

datasets we used in our experiments, this pruning reduced the size of the transaction table

to about half its original size. This could be even more useful for real-life datasets which

typically contains lots of non-frequent items. For example, some of the real-life datasets

used for the experiments reported in Sarawagi et al. [109] contained of the order of 100

thousand items out of which only a few hundreds were frequent. Figure 3.11 shows the

reduction in transaction table size due to this optimization for our experimental datasets.

The initial size (R) and the size after pruning (Rf) for different support values are shown.

With this optimization, in the cost formulae of section 3.7, R can be replaced with

Rf-the number of records in T involving frequent items and N with Nf-the average


900000 --- -
600000 .

600000 --- -- -* -" -

300000 -

o 1- O -1 .- -" -.
R Ri for support values R Re for support values
T10.14.D100K T5.12 D100K

Figure 3.11. Reduction in transaction table size by non-frequent item pruning

number of frequent items per transaction. Note that with the subquery optimization we

could achieve significant performance improvements by reducing the relation sizes of joins.

3.8.2 Eliminating Candidate Generation in Second Pass

This optimization aims at reducing the cost of the second pass which is a significant

portion of the total cost in some cases. In the second pass, C2 is almost a cartesian product

of the two Fis used to generate it and hence materializing it and joining with the T's (or

Tf's) could be expensive. In order to circumvent the problem of counting the large C2,

most mining algorithms use special techniques in the second pass. A few examples are

two-dimensional arrays in IBM's Quest data mining system [2] and hash filters proposed in

Park et al. [99] to limit the size of C2. The generation of C2 can be completely eliminated

by formulating the join query to find F2 as

insert into F2 select p.item, q.item, count(*)

from Tf p, Tf q

where p.tid = q.tid and p.item < q.item

group by p.item, q.item

having count(*) > :minsup

The cost of second pass with this optimization is

join(Rf, Rf, C(Nf, 2)) + group(C(Nf, 2), C(Fi, 2))

Even though the grouping cost remains the same, there is a big reduction from the basic

KwayJoin approach in the join costs. Figure 3.12 compares the running time of the sec-

ond pass with this optimization to the basic KwayJoin approach for the two experimental

datasets. For the KwayJoin approach, the best execution plan was the one which generates

all 2-item combinations, joins them with the candidate set and groups the join result. We

can see that this optimization has a significant impact on the running time.

3.8.3 Reusing the Item Combinations from Previous Pass

The SQL formulations of association rule mining is based on generating item combi-

nations in various ways and similar work is performed in all the different passes. Therefore,

reusing the item combinations will improve the performance especially in the higher passes.

We will explain more about this optimization and the corresponding cost reduction in the

next section.

3.8.4 Set-Oriented Apriori

In this section, we develop a set-oriented version of the apriori algorithm combining

all the performance optimizations outlined above, which we call Set-oriented Apriori. We

store the item combinations generated in each pass and use them in the subsequent passes

instead of generating them in every pass from the transaction tables. In the kth pass of

the support counting phase, we generate a table Tk which contains all k-item combinations

Pass 2 optimization (T10.14.D100 K)

0 With Opt. m Without Opt.
14000 --


10000 2

S8000 -

4 6000



2% 1% 0.75% 0.33%

Pass 2 optimization (T5.12.D100K)

th wiinii Opsu oWitthowfltrig


e v t:t



400 -

2% 1% 0.50% 0.10%

Figure 3.12. Benefit of second pass optimization

that are candidates. Tk has the schema (tid, item,..., itemk). We join Tk-1, Tf and Ck as

shown below to generate Tk. A tree diagram of the query is also given in Figure 3.13. The

frequent itemsets Fk is obtained by grouping the tuples of Tk on the k items and applying

the minimum support filtering.

We can further prune Tk by filtering out item combinations that turned out to be

non-frequent. However, this is not essential since we join it with the candidate set Ck+1 in

insert into Tk

select p.tid, p.iteml, ...p.itemkl, q.item

from Ck, Tk-1 P, Tf q

where p.iteml = Ck.iteml and

p.itemk-I = Ck.itemk-1 and

q.item = Ck.itemk and

p.tid = q.tid

p.iteml =Ck.iteml
p.itemk-1 Ck.item k- I
q.item = Ck.itenmk

p.tid = q.tid <' Ck
p.item_k- < q.item

T_k-1 p TI q

Figure 3.13. Generation of Tk

the next pass to generate Tk+1. The only advantage of pruning Tk is that we will have a

smaller table to join in the next pass; but at the additional cost of joining Tk with Fk.

We use the optimization discussed above for the second pass and hence do not mate-

rialize and store the 2-item combinations T2. Therefore, we generate T3 directly by joining

Tf with C3 as

insert into T3 select p.tid, p.item, q.item, r.item

from Tf p, Tf q, Tf r, Ck

where p.item = C3.iteml and q.item = C3.item2 and r.item = C3.item3 and

p.tid = q.tid and q.tid = r.tid

We can also use the Subquery approach to generate T3 if that is estimated to be less

expensive. T3 will contain exactly the same tuples produced by subquery Q.3

The Set-oriented Apriori algorithm bears some resemblance with the three-way join

approach in Section 3.5.2 and the AprioriTid algorithm in Agrawal and Srikant [13]. In the

three-way join approach, the temporary table Tk stores for each transaction, the identifiers

of the candidates it supported. Tk is generated by joining two copies of Tk-1 with Ck. The

generation of Fk requires a further join of Tk with Ck. AprioriTid makes use of special

data structures which are difficult to maintain in the SQL formulation.

Cost Comparison

In Section 3.7.3, we saw that the cost of the kth pass with the subquery optimization

{- trijoin(R, s_1 d-1, d4, s dk)} + group(S(Ck), Ck)

As a result of the materialization and reuse of item combinations, Set-oriented Apriori re-

quires only a single 3-way join in the kth pass 1. The cost of the kth pass in Set-oriented

Apriori is

trijoin(Rf, Tk-1, Ck, S(Ck)) + group(S(Ck), Ck)

where Tk-1 and Ck denote the cardinality of the corresponding tables. The grouping cost

is the same as that of the subquery approach. The table Tk-1 contains exactly the same

tuples as that of subquery Qk-1 and hence has a size of s-_1 d4-1. Also, dk is the same

as Ck. Therefore, the kth pass cost of Set-oriented Apriori is the same as the kth term in

'Note that this may be executed as two 2-way joins since 3-way joins are not generally supported in
current relational systems.

the join cost summation of the subquery approach. This results in significant performance

improvements especially in the higher passes.

Figure 3.14 compares the running times of the subquery and Set-oriented Apriori ap-

proaches for the dataset T10.I4.D100K for 0.33% support. We show only the times for

passes 3 and higher since both the approaches are the same in the first two passes.

c3 Set-Apriorl =Subquery-


Figure 3.14. Benefit of reusing item combinations

Space Overhead

The Set-oriented Apriori approach requires additional space in order to store the item

combinations generated. The size of the table Tk is the same as S(Ck), which is thle total

support of all the k-item candidates. Assuming that the tid and item attributes are integers,

each tuple in Tk consists of k + 1 integer attributes. Figure 3.15 shows the space required

to store Tk in terms of number of integers, for the dataset T10.I4.D100K for two different

support values. The space needed for the input data table T is also shown for comparison.

T2 is not shown in the graph since we do not materialize and store it in the Set-oriented

Apriori approach. Note that once Tk is materialized Tk_1 can be deleted unless it needs to

be retained for some other purposes.
be retained for some other purposes.



2500000 --

S2000000 -

S1500000 -- -


0 I --- ----- -------I ------- -- ---- ----
-T 1T, 1T T4 T, T3 T4 TB Ta
Support 0.75% Support 0.33%

Figure 3.15. Space requirements of the set-oriented apriori approach

In the earlier passes, if the number of item combinations is too large and we do not

want to store them, the Subquery approach can be used instead. However, the Subquery

approach will also generate large intermediate tables except in cases where the output of an

intermediate join is pipelined to the subsequent operation. The transition to Set-oriented

Apriori can be made in any pass by materializing the item combinations. An advantage of

the Set-oriented Apriori is that it requires only simple joins and hence it is easier for the

optimizer to handle them. With multiple levels of nested subqueries, optimization becomes

much harder.

3.9 Performance Experiments with Set-Oriented Apriori

In this section, we compare the total running time of the Subquery and Set-oriented

Apriori approaches.

We studied the performance of the Subquery approach (the best SQL-92 approach in

Section 3.6) and the Set-oriented Apriori for a wide range of data parameters and support

values. We report the results on two of the datasets-T5.I2.D100K and T10.I4.D100K-

which are described in Section 3.7.3.


T10.14.D100K: Total time

Pass 1 Pass 2 0Pass 3 Pass 4 Pass 5 Pass 6 Pass 7





9000 ---"-


3000 -


Support -- 2% 1% 0.75% 0.33%

T5.12.D100K: Total time

SiPass 1 i Pass 2 DPass Pass s 4





Support--> 2% 1% 0.5% 0.1%

Figure 3.16. Comparison of Subquery and Set-oriented Apriori approaches

In Figure 3.16, we show the relative performance of Subquery and Set-oriented Apriori

approaches for the two datasets. The chart shows the total time taken for each of the

different passes. Set-oriented Apriori performs better than Subquery for all the support

values. The first two passes of both the approaches are similar and they take approximately

equal amount of time. The difference between Set-oriented Apriori and Subquery widens for

higher numbered passes as explained in Section 3.8.4. For T5.I2.D100K, F2 was empty

for support values higher than 0.3% and therefore we chose lower support values to study

the relative performance in higher numbered passes.

In some cases, the optimizer did not choose the best plan. For example, for joins

with T (Tf for Set-oriented Apriori), the optimizer chose nested loops plan using (item, tid)

index on T in many cases where the corresponding sort-merge plan was faster; an order

of magnitude faster in some cases. We were able to experiment with different plans by

disabling certain join methods (disabling nested loops join for the above case). We also

broke down the multi-way joins in some cases into simpler two-way joins to study the

performance implications. The reported times correspond to the best join order and join


Figure 3.17 shows the CPU time and I/O time taken for the dataset T10.I4.D100K.

The two approaches show the same relative trend as the total time. However, it should be

noted that the I/O time is less than one third of the CPU time. This shows that there

is a need to revisit the traditional optimization and parallelization strategies designed to

optimize for I/O time, in order to handle the newer decision support and mining queries


3.9.1 Scale-up Experiment

We experimented with several synthetic datasets to study the scale-up behavior of Set-

oriented Apriori with respect to increasing number of transactions and increasing transaction

size. Figure 3.18 shows how Set-oriented Apriori scales up as the number of transactions is

increased from 10,000 to 1 million. We used the datasets T5.I2 and T10.I4 for the average

sizes of transactions and itemsets respectively. The minimum support level was kept at 1%.

The first graph shows the absolute execution times and the second one shows the times


T10.14.D100K: CPU time
Pass 1 IP'ass2 DPass3 DPass4 mPass 5 MPass6 IPass 7






Support -- 2% 1% 0.75% 0.33%

T10.14.D1 00K: I/O time
i P *Ir i Pas 2 O Pass 3 OlPs q : HP-as 5 -Pa .. 6 *Pasa 7

6000 -

5 f ,. *, ". -* .'


Support -> 2% 1% 0.75% 0.33%

Figure 3.17. Comparison of CPU and I/O times

normalized with respect to the times for the 10,000 transaction datasets. It can be seen

that the execution times scale quite linearly and both the datasets exhibit similar scale-up


The scale-up with increasing transaction size is shown in Figure 3.19. In these exper-

iments we kept the physical size of the database roughly constant by keeping the product

of the average transaction size and the number of transactions constant. The number of

transactions ranged from 200,000 for the database with an average transaction size of 5 to

--T1014 T5.12






.I -- -
0 200 400 600 800 1000 1200
Number of Transactions (in thousands)

T- T10.14 '- T5.12


S 150

C: 100


0 200 400 600 800 1000 1200
Number of Transactions (In thousands)

Figure 3.18. Number of transactions scale-up

20,000 for the database with an average transaction size of 50. We fixed the minimum sup-

port level in terms of the number of transactions, since fixing it as a percentage would have

led to large increases in the number of frequent itemsets as the transaction size increased.

The numbers in the legend (for example, 1000) refer to this minimum support. The execu-

tion times increase with the transaction size, but only gradually. The main reason for this

increase was that the number of item combinations present in a transaction increases with

the transaction size.





1 0000


0 10 20 30 40 50 60
Average transaction length

Figure 3.19. Transaction size scale-up

There has been increasing interest in developing scalable data mining techniques [23,

33, 117]. For the scalability of the SQL approaches, we leverage the development effort

spent in making the relational query processors scalable.

3.10 Summary

In this chapter, we presented four SQL-92 formulations for association rule mining. We

analyzed the best approach and developed cost formulae for the different execution plans

of the corresponding SQL queries based on the input data parameters and the relational

operator costs. This cost analysis provides a basis for incorporating the semantics of mining

algorithms into future query optimizers.

While doing the experiments, it was difficult to force the optimizer to choose certain

execution plans and join methods since commercial DBMSs do not provide that level of

control. Postgres was relatively better in that respect, since we could control the choice

of join methods. However, the Postgres optimizer did not optimize long queries, especially

the ones involving nested subqueries, well.


Note A part of the work described in this chapter was primarily done by researchers

from IBM Almaden Research Center. Specifically, the SQL-based candidate generation in

Section 3.4.1 and the support counting approaches in Section 3.5 were developed by them.

They are included in this dissertation for completeness.


In this chapter, we study alternative approaches that make use of additional object-

relational features in SQL. For each approach, we also outline a cost-based analysis of the

execution time to enable one to choose between these different approaches. We present six

different approaches, optimizations and their cost estimates in Sections 4.1, 4.2 and 4.3.

Experimental results comparing the performance of these approaches are presented in Sec-

tion 4.4. In Section 4.5, we propose a hybrid approach which combines the best of all

approaches. The performance of different architectural alternatives described in Chapter 2

is compared in Section 4.6. In Section 4.7, we summarize qualitative comparisons of these

architectures. The applicability of the SQL-based approach to other association rule mining

algorithms are briefly discussed in Section 4.8.

4.1 GatherJoin

This approach (see Figure 4.1) is based on the use of table functions described in sec-

tion 3.4.2. It generates all possible k-item combinations of items contained in a transaction,

joins them with the candidate table Ck and counts the support of the itemsets by grouping

the join result. It uses two table functions Gather and Comb-K. The data table T is scanned

in the (tid, item) order and passed to the table function Gather. This table function col-

lects all the items of a transaction (in other words, items of all tuples of T with the same

tid) in memory and outputs a record for each transaction. Each such record consists of two

attributes, the tid and item-list which is a collection of all its items in a VARCHAR or a

BLOB. The output of Gather is passed to another table function Comb-K which returns all

k-item combinations formed out of the items of a transaction. A record output by Comb-K

has k attributes T itmi,..., Titmk, which can be directly used to probe into the Ck table.

An index is constructed on all the items of Ck to make the probe efficient. Figure 4.1

presents SQL queries for this approach.

This approach is analogous to the KwayJoin approach where we have replaced the

k-way self join of T with table functions Gather and Comb-K. These table functions are

easy to code and do not require a large amount of memory. Also, it is possible to merge

them together as a single table function GatherComb-K, which is what we did in our imple-

mentation. The Gather function is not required when the data is already in a horizontal

format where each tid is followed by a collection of all its items.

4.1.1 Special Pass 2 Optimization

Note that for k = 2, the 2-candidate set C2 is simply a join of Fi with itself. Therefore,

we can specially optimize the pass 2 by replacing the join with C2 by a join with F1 before

the table function (see Figure 4.2). That way, the table function gets only frequent items

and generates significantly fewer 2-item combinations. This optimization can be useful for

other passes too but unlike for pass 2 we still have to do the join with Ck.

4.1.2 Variations of GatherJoin Approach


One variation of the GatherJoin approach for pass two is the GatherCount approach

where we push the group-by inside the table function instead of doing it outside. The

candidate 2-itemsets (C2) are represented as a two dimensional array inside the modified

insert into Fk select ;I, "'I,...,itemk, count(*)

from Ck, (select t2.T7-itm1,, t2.T itmk from T,

table (Gather(T.tid, T.item)) as ti,

table (Comb-K(tl.tid, tl.item-list)) as t2)

where t2-T-itml = Ck.itemi and

t2.T-itmk = Ck.itemk

group by Ck.iteml,...,Ck.itemk

having count(*) > :minsup

count(*) > :minsup
Group by
iteml .....,itenmk

t2.T_itmil Ck.item 11
t2.Titmk Ck.itemk

Table function Ck
Table function
Order by
tid, item


Figure 4.1. Support counting by GatherJoin

table function Gather-Cnt for doing the support counting. Instead of outputting the 2-item

combinations of a transaction, it directly uses it to update support counts in the memory

and output only the frequent 2-itemsets, F2 and their support after the last transaction.

Thus, the table function Gather-Cnt is an extension of the GatherComb-2 table function

used in GatherJoin.

The absence of the outer grouping makes this option rather attractive. The UDF code

is also small since it only needs to maintain a 2D array. We could apply the same trick for

subsequent passes but the coding becomes considerably more complicated because of the

insert into F2 select tt2.Titmi, tt2.Titm2, count(*)

from (select from T, Fi where T.item = Fl.itemi) as ttl,

table (GatherComb-2(tid,item)) as tt2)

group by tt2.Tjtmi, tt2.T-itm2

having count(*) > :minsup

count(*) > :minsup
Group by
tt2.T_itml, tt2.T_itm2
t tt2
Table function

T.em = .item lF .i ]


Figure 4.2. Support Counting by GatherJoin in the second pass

need to maintain hash-tables to index the Ck's. The disadvantage of this approach is that

it can require a large amount of memory to store C2. If enough memory is not available,

C2 needs to be partitioned and the process has to be repeated for each partition. Another

serious problem with this approach is that it cannot be automatically parallelized unlike

the other approaches.


A potential problem with the GatherJoin approach is the high cost of joining the

large number of item combinations with Ck. We can push the join with Ck inside the table

function and thus reduce the number of such combinations. Ck is converted to a BLOB

and passed as an argument to the table function.

The cost of passing the BLOB for every tuple of T could be high. In general, we can

reduce the parameter passing cost by using a smaller Blob that only approximates the real

Ck. The trade-off is increased cost for other parts notably grouping because not as many

combinations are filtered.


Another variation of GatherJoin is the Horizontal approach that first uses the

Gather function to transform the data to the horizontal format but is otherwise similar

to the GatherJoin approach.

Rajamani et al. [105] propose finding associations using an approach quite similar to

this Horizontal approach. They assume (rather unrealistically) that the data is already in

a horizontal format. However, they do not use the frequent item-set filtering optimization

we outlined for pass 2. Without this optimization, the time for pass 2 for most real-life

datasets blows up even for relatively high support values. Also, at the time of candidate

generation, rather than doing self-join on Fk-1, they join Fk-1 with F1, thereby generating

considerably more combinations than needed. Thus, the approach in Rajamani et al. [105]

is likely to perform worse than Horizontal.

4.1.3 Cost Analysis of GatherJoin and its Variants

The choice of the best approach depends on a number of data characteristics like the

number of items, total number of transactions, average length of a transaction and so on.

We express the costs of different approaches in each pass in terms of parameters that are

known or can be estimated after the candidate generation step of each pass. We include

only the terms that were found to be the dominant part of the total cost in practice. We

use the notations of Table 3.2 in the cost analysis.

The cost of GatherJoin includes the cost of generating k-item combinations, joining

with Ck and grouping to count the support. The number of k-item combinations generated,

Tk is C(N, k) T. Join with Ck filters out the non-candidate item combinations. The size

of the join result is the sum of the support of all the candidates denoted by S(Ck). The

actual value of the support of a candidate itemset will be known only after the support

counting phase. However, we get a good estimate by approximating it to the minimum of

the support of all its (k l)-subsets in Fk- 1. The total cost of the GatherJoin approach is

Tk tk + join(Tk, Ck, S(Ck)) + group(S(Ck), Ck), where Tk = C(N, k) T

The above cost formula needs to be modified to reflect the special optimization of

joining with F1 to consider only frequent items. We need a new term join(R, Fi, Rf) and

need to change the formula for Tk to include only frequent items Nf instead of N.

For the second pass, we do not need the outer join with Ck. The total cost of

GatherJoin in the second pass is

join(R, F1, Rf) + T2 t2 + group(T2, C2), where T2 = C(Nf, 2) T N T

Cost of GatherCount in the second pass is similar to that for basic GatherJoin except

for the final grouping cost. In this formula, "groupint" denotes the cost of doing the

support counting inside the table function.

join(R, F1, Rf) + groupint(T2, C2) + F2 t2

For GatherPrune the cost equation is

R blob(k Ck) + S(Ck) tk + group(S(Ck), Ck).

We use blob(k Ck) for the BLOB passing cost since each itemset in Ck contains k items.

The cost estimate of Horizontal is similar to that of GatherJoin except that here

the data is materialized in the horizontal format before generating the item combinations.

4.2 Vertical

In this approach, we first transform the data table into a vertical format by creating for

each item a BLOB containing all tids that contain that item (Tid-list creation phase) and

then count the support of itemsets by merging together these tid-lists (support counting

phase). This approach is related to the approaches discussed in Savasere et al. [111] and

Zaki et al. [131].

For creating the Tid-lists we use a table function Gather. This is the same as the

Gather function in GatherJoin except that here we create the tid-list for each frequent

item. The data table T is scanned in the (item, tid) order and passed to the function

Gather. The function collects the tids of all tuples of T with the same item in memory and

outputs a. (item, tid-list) tuple for items that meet the minimum support criterion. The tid-

lists are represented as BLOBs and stored in a new TidTable with attributes (item, tid-list).

The SQL query which does the transformation to vertical format is given in Figure 4.3.

insert into TidTable select item, tid-list from tt2.item, tt2.tid-list
Table function
(select from T order by item, tid) as ttl, Gather

table(Gather(item,tid,:minsup)) as ft2 itrer ti


Figure 4.3. Tid-list creation

In the support counting phase, conceptually for each itemset in Ck we want to collect

the tid-lists of all k items and use a UDF to count the number of tids in the intersection

of these k lists. The tids are in the same sorted order in all the tid-lists and therefore the

intersection can be done easily and efficiently by a single pass of the k lists. This conceptual

step can be improved further by decomposing the intersect operation so that we can share

these operations across itemsets having common prefixes as follows:

We first select distinct (iteml, item2) pairs from Ck. For each distinct pair we first per-

form the intersect operation to get a new result-tidlist, then find distinct triples (iteml, item2,

items) from Ck with the same first two items, intersect result-tidlist with tid-list for item

for each triple and continue with item4 and so on until all k tid-lists per itemset are inter-


The above sequence of operations can be written as a single SQL query for any k as

shown in Figure 4.4. The final intersect operation can be merged with the count operation

to return a count instead of the tid-list. We do not include this optimization in the query

of Figure 4.4 for simplicity.

4.2.1 Special Pass 2 Optimization

For pass 2 we need not generate C2 and join the TidTables with C2. Instead, we

perform a self-join on the TidTable using predicate tl.item < t2.item.

insert into Fk select tl.item, t2.item, cnt

from (select item1, item2, CountIntersect(tl.tid-list, t2.tid-list) as cnt

from TidTable tl, TidTable t2

where ti.item < t2.item) as t

where cnt > :minsup

insert into Fk select item1,... ,itemrk, count(tid-list) as cnt

from (Subquery Qk) t where cnt > :minsup

Subquery Q1 (for any I between 2 and k)

select item1,... item, Intersect(r_ 1.tid-list,t1.tid-list) as tid-list

from TidTable ti, (Subquery Qi-1) as rt-1,

(select distinct item1 ... iteml from Ck) as di

where ri _.iteml = di.iteml and ... and

rt-1.itemi-i = di.itemiiand

ti.item = dl.itemj

Subquery Qi: (select from TidTable)

iteml,...,iteml, tid-list
t tid-list
r_l-1.tid-list, tl.tid-list f

tl.item = dl.iteml
r_l-1.iteml =.dl.item l
r_l- .iteml-1 = dl.item_1-1 TidTable tl

Subquery Ql-1 select distinct
item 1.,iteml


Tree diagram for Subquery Qi

Figure 4.4. Support counting using UDF

4.2.2 Cost Analysis

The cost of the Vertical approach during support counting phase is dominated by

the cost of invoking the UDFs and intersecting the tid-lists. The UDF is first called for

each distinct item pair in Ck, then for each distinct item triple with the same first two

items and so on. Let dk be the number of distinct j item tuples in Ck. Then the total

number of times the UDF is invoked is Zk=2 d In each invocation two BLOBs of tid-list

are passed as arguments. The UDF intersects the tid-lists by a merge pass and hence the

cost is proportional to 2 average length of a tid-list. The average length of a tid-list can be

approximated to -j. Note that with each intersect the tid-list keeps shrinking. However,

we ignore such effects for simplicity.

In addition to the intersect cost it includes the cost of joins in the query also. The

join cost of subquery Qi can be recursively defined as

C(Q=) = trijoin(F1, Q_ 1, Ck, dk) + C(Q-1)

where tri-join(p, q, r, s) denotes the cost of joining three relations of size p, q, r respectively

producing a result of size s. The exact cost of the 3-way join will depend on the join order.

The cost of subquery Qi is the cost of scanning the TidTable which has Fi tuples. The

result size of the subquery Qi is dk and the result size of Q1 is F1. The total cost of the

Vertical approach is

C(Qk) + (E d) {2 Blob(j) + Intersect(2 )}
j=2 F1 F1

In the formula above Intersect(n) denotes the cost of intersecting two tid-lists with

a combined size of n. The total cost is dominated by the intersect cost and join costs

account for only a small fraction. Therefore, we can safely ignore the join costs in the

above formulae.

The total cost of the second pass is

(F1)2 Rf 2R
join(FiF, (F ) + C2 {2 Blob( ) + Intersect( )}
2 Fi F,

4.3 SQL-Bodied Functions

This approach is based on SQL-bodied functions commonly known as SQL/PSM [87].

SQL/PSMs extend SQL with additional control structures. We make use of one such

construct for do .. end.

We use the for construct to scan the transaction table T in the (tid, item) order. Then,

for each tuple (tid, item) of T, we update those tuples of Ck that contain one matching

item. Ck is extended with 3 extra attributes (prevTid, match, supp). The prevTid attribute

keeps the tid of the previous tuple of T that matched that itemset. The match attribute

contains the number of items of prevTid matched so far and supp holds the current support

of that itemset. On each column of Ck an index is built to do a searched update.

for this as select from T do

update Ck set prevTid = tid,

match = case when tid = prevTid then match+l else 1 end,

supp = case when match = k-1 and tid = prevTid then supp+1

else supp

where item = item or

item = itemk

end for

insert into Fk select item1,..., itemk, supp

from Ck where supp > :minsupp

The cost of this approach can be mainly attributed to the cost of updates to the

candidate table Ck. For each tuple of the data table T, for all the candidate itemsets in Ck

which contains that item, three updates are performed (the attributes prevTid, match, supp

of the itemset are updated). If Nk is the average number of k-item candidates containing

any given item, the total number of updates is 3 R Nk. The cost due to updates for this

approach is U(3 R Nk) where U(n) is the cost of n updates. If the updates are logged

this cost includes the logging cost also.

4.4 Performance Comparison

We studied the performance of six approaches in this category: GatherJoin and its

variants GatherPrune, Horizontal and GatherCountVertical and SBF. We used the four

datasets summarized in Table 3.1. In Figure 4.5 we show the performance of only the four

approaches: GatherJoin, GatherCount, GatherPrune and Vertical. For the other

two approaches the running times were comparatively so large that we had to abort the

runs in many cases. The main reason why the Horizontal approach was significantly

worse than the GatherJoin approach was the time to transform the data to the horizontal

format. For instance, for Dataset-C it was 3.5 hours which is almost 20 times more than

the time taken by Vertical for 2% support. For Dataset-B the process was aborted


after running for 5 hours. After the transformation, compared to GatherJoin the time

taken by Horizontal was also significantly worse when run without the frequent itemset

filtering optimization but with the optimization the performance was comparable. The SBF

approach had significantly worse performance because of the expensive indexing ORing of

the k join predicates. Another problem with this approach is the large number of updates

to the Ck table. In DB2, all of these updates are logged resulting in severe performance


Data set- A Data set- B
i Prep m Pass 1 0 Pass 2 0 Pass'3 |Iprap IPase 1 Pass 2 Passs 3 m Pass 4
2500 -; ..--- --- 14000 --


12000 14000

;L-"oo .--.-:-"-.. .

Supprt..~. 0... 0.3 0.Xo% Supporl 0.10% 003. 001% O

Support-> 2'0. 0.3 2 Support -- 0 1.20% O 0%" 0 02%
Data set- C Data set- D

Figure 4.5. Comparison of four SQL-OR approaches: Vertical, GatherPrune, Gather Join

and GatherCount


60 60


2000 2000

2- 025- S. 0.20% 00ST0 Oo2%

Figure 4.5. Comparison of four SQL-OR approaches: Vertical, GatherPrune, GatherJoin

and GatherCount

Figure 4.5 shows the total running time of the different approaches. The time taken

is broken down by each pass and an initial "prep" stage where any one-time data transfor-

mation cost is included. We can make several observations from the experimental results.

First, let us concentrate on the overall comparison between the different approaches. Then

we will compare the approaches based on how they perform in each pass of the algorithm.

Overall, the Vertical approach has the best performance and is sometimes more than

an order of magnitude better than the other three approaches.

The majority of the time of the Vertical approach is spent in transforming the

data to the Vertical format in most cases (shown as "prep" in figure 4.5). The vertical

representation is like an index on the item attribute. If we think of this time as a one-

time activity like index building then performance looks even better. Note that the time

to transform the data to the Vertical format was much smaller than the time for the

horizontal format although both formats write almost the same amount of data. The main

reason was the difference in the number of records written. The number of frequent items

is often two to three orders of magnitude smaller than the number of transactions.

Between GatherJoin and GatherPrune, neither strictly dominates the other. The

special optimization in GatherJoin of pruning based on F1 had a big impact on perfor-

mance. With this optimization, for Dataset-B with support 0.1%, the running time for

pass 2 alone was reduced from 5.2 hours to 10 minutes.

When we compare these different approaches based on time spent in each pass we

observe that no single approach is "the best" for all different passes of the different datasets

especially for the second pass.

For pass three onwards, Vertical is often two or more orders of magnitude better than

the other approaches. Even in cases like Dataset-B, support 0.01% where it spends three

hours in the second pass, the total time for next two passes is only 14 seconds whereas it is

more than an hour for the other two approaches. For subsequent passes the performance

degrades dramatically for GatherJoin, because the table function Gather-Comb-K generates

a large number of combinations. For instance, for pass 3 of Dataset-C even for support

value of 2% pass 3 did not complete after 5.2 hours whereas for Vertical pass 3 finished

in 0.2 seconds. GatherPrune is better than GatherJoin for the third and later passes. For

pass 2 GatherPrune is worse because the overhead of passing a large object as an argument

dominates cost.

The Vertical approach sometimes ended up spending too much time in the second

pass. In some of these cases the GatherJoin approach was better in the second pass

(for instance for low support values of Dataset-B) whereas in other cases (for instance,

Dataset-C support 0.25%) GatherCount was the only good option. For this case both the

GatherPrune and GatherJoin did not complete after more than six hours even for pass 2.

Further, they caused a storage overflow error because of the large size of the intermediate

results to be sorted. We had to divide the dataset into four equal parts and run the second

pass independently on each partition to avoid this problem.

S- Venlcal --Goin I






0 10 20 30 40 50 60
Average transaction length

Figure 4.6. Effect of increasing transaction length (average number of items per transaction)

Two factors that affect the choice amongst the Vertical, GatherJoin and GatherCount

approaches in different passes and pass 2 in particular are: number of frequent items (Fi)

and the average number of frequent items per transaction (Nf). From the graphs in Fig-

ure 4.5 we notice that as the value of the support is decreased for each dataset causing

the size of Fi to increase, the performance of pass 2 of the Vertical approach degrades

rapidly. This trend is also clear from our cost formulae. The cost of the Vertical approach

increases quadratically with F1. GatherJoin depends more critically on the number of fre-

quent items per transaction. For Dataset-B even when the size of F1 increases by a factor

of 10, the value of Nf remains close to 2, therefore the time taken by GatherJoin does not

increase as much. However, for Dataset-C the size of Nf increases from 3.2 to 10 as the

support is decreased from 2.0% to 0.25% causing GatherJoin to deteriorate rapidly. From

the cost formula for GatherJoin we notice that the total time for pass 2 increases almost

quadratically with Nf.

We validate this observation further by running experiments on synthetic datasets

for varying values of the number of frequent items per transaction. We used the synthetic

dataset generator described in Agrawal et al. [9] for this purpose. We varied the transaction

length, the number of transactions and the support values while keeping the total number

of records and the number of frequent items fixed. In Figure 4.6 we show the total time

spent in pass 2 of the Vertical and GatherJoin approaches. As the number of items per

transaction (transaction length) increases, the cost of Vertical remains almost unchanged

whereas the cost of GatherJoin increases.

4.5 Final Hybrid Approach

The previous performance section helps us draw the following conclusion: Overall, the

Vertical approach is the best option especially for higher passes. When the size of the

candidate itemsets is too large, the performance of the Vertical approach could suffer.

In such cases, GatherJoin is a good option as long as the number of frequent items per

transaction (Nf) is not too large. When that happens GatherCount may be the only

good option even though it may not easily parallelizable. These qualitative differences are

captured by the cost formulae we presented earlier and are used by our final hybrid scheme.

The hybrid scheme chooses the best of the three approaches GatherJoin, GatherCount

and Vertical for each pass based on the cost estimation outlined in the previous sections.

The parameter values used for the estimation are all available at the end of the previous

pass. In Section 4.6 we plot the final running time for the different datasets based on this

hybrid approach.

4.6 Architecture Comparisons

In this section our goal is to compare five architectural alternatives: Loose-coupling,

Stored-procedure, Cache-Mine, UDF, and the best SQL implementation.

For Loose-coupling, we use the implementation of the Apriori algorithm [9] for find-

ing association rules provided with the IBM data mining product, Intelligent Miner [71].

For Stored-procedure, we extracted the Apriori implementation in Intelligent Miner and

created a stored procedure out of it. The stored procedure is run in the unfenced mode

in the database address space. For Cache-Mine, we used an option provided in Intelligent

Miner that causes the input data to be cached as a binary file after the first scan of the data

from the DBMS. The data is copied in the horizontal format where each tid is followed by

an encoding of all its frequent items.

For the UDF-architecture, we use the UDF implementation of the Apriori algorithm

described in Agrawal and Shim [12]. In this implementation, first a UDF is used to initialize

state and allocate memory for candidate itemsets. Next, for each pass a collection of UDFs

are used for generating candidates, counting support, and checking for termination. These

UDFs access the initially allocated memory, address of which is passed around in BLOBs.

Candidate generation creates the in-memory hash-trees of candidates. This happens en-

tirely in the UDF without any involvement of the DBMS. During support counting, the

data table is scanned sequentially and for each tuple a UDF is used for updating the counts

on the memory resident hash-tree.

4.6.1 Timing Comparison

In Figure 4.7, we show the performance of the four architectural alternatives: Cache-Mine,

Stored-procedure, UDF and our best SQL implementation for the datasets in Table 3.1.

We do not show the times for the Loose-coupling option because its performance was very

close to the Stored-procedure option. For each dataset three different support values are

used. The total time is broken down by the time spent in each pass.

We can make the following observations.

Cache-Mine has the best or close to the best performance in all cases. 80-90% of its

total time is spent in the first pass where data is accessed from the DBMS and cached

in the file system. Compared to the SQL approach this approach is a factor of 0.8 to

2 times faster.

Data set- A
i Pass-1 mPass-2 DPass-3





o 400







2000 ""




Support--> 2.0%

Data set- B

I Pass-1 N Pass 2 0 Pass 3 0 Pass 4:


4000 j-
3 ., .--- --'...'? ..,i : -- -------- .---


i 2500





SUPPORT-> 0.1% 0.03% O.01%

Data set- D

i Pass 1 E Pass 2 0 Pass 3 0 Pass 4
"- .,

,. .*...

4, ,,







* The Stored-procedure approach is the worst. The difference between Cache-Mine

and Stored-procedure is directly related to the number of passes. For instance,

for Dataset-A the number of passes increases from two to three when decreasing

support from 0.5% to 0.35% causing the time taken to increase from two to three

times. The time spent in each pass for Stored-procedure is the same except when

the algorithm makes multiple passes over the data since all candidates could not

fit in memory together. This happens for the lowest support values of Dataset-B,

50% 0.35% 0.20%

Data set- C

[E Pass 1 W Pass 2 3 Pass 3 OPass 41
,- :----- .------,: ;.:, .

Support-- 0.

4000 -

1.0% 0.25% Support. -> 0.2% 0.05% 0.02%

Figure 4.7. Comparison of four architectures

Dataset-C and Dataset-D. Time taken by Stored-procedure can be expressed

approximately as number of passes times time taken by Cache-Mine.

* UDF is similar to Stored-procedure. The only difference is that the time per pass

decreases by 30-50% for UDF because of closer coupling with the database.

* The SQL approach comes second in performance after the Cache-Mine approach for

low support values and is even somewhat better for high support values. The cost

of converting the data to the vertical format for SQL is typically lower than the cost

of transforming data to binary format outside the DBMS for Cache-Mine. However,

after the initial transformation subsequent passes take negligible time for Cache-Mine.

For the second pass SQL takes significantly more time than Cache-Mine particularly

when we decrease support. For subsequent passes even the SQL approach does not

spend too much time. Therefore, the difference between Cache-Mine and SQL is not

very sensitive to the number of passes because both approaches spend negligible time

in higher passes.

The SQL approach is 1.8 to 3 times better than Stored-procedure or Loose-coupling

approach. As we decreased the support value so that the number of passes over the

dataset increases, the gap widens. Note that we could have implemented a Stored-

procedure using the same hybrid algorithm that we used for SQL instead of using the

IM algorithm. Then, we expect the performance of Stored-procedure to improve

because the number of passes to the data will decrease. However, we will pay the

storage penalty of making additional copy of the data as we did in the Cache-Mine

approach. The performance of Stored-procedure cannot be better than Cache-Mine

because as we have observed the most of the time of Cache-Mine is spent in the first

pass which cannot be changed for Stored-procedure.

4.6.2 Scale-up experiment

Our experiments with the four real-life datasets above has shown the scaling property

of the different approaches with decreasing support value and increasing number of frequent

itemsets. We experiment with synthetic datasets to study other forms of scaling: increasing

number of transactions and increasing average length of transactions. In Figure 4.8 we show

how Stored-procedure, Cache-Mine and SQL scale with increasing number of transactions.

UDF and Loose-coupling have similar scale-up behavior as Stored-procedure, therefore

we do not show these approaches in the figure. We used a dataset with 10 average number of

items per transaction, 100 thousand total items and a default pattern length (defined in [9])

of 4. Thus, the size of the dataset is 10 times the number of transactions. As the number of

transactions is increased from 10K to 3000K the time taken increases proportionately. The

largest frequent itemset was 5 long. This explains the five fold difference in performance

between the Stored-procedure and the Cache-Mine approach. Figure 4.9 shows the scaling

when the transaction length changes from 3 to 50 while keeping the number of transactions

fixed at 100K. All three approaches scale linearly with increasing transaction length.

S10000o 2000 o
4 000

N000umber of ransaon
00igure 4.8. Scae-up with increasing number of transactions

Figure 4.8. Scale-up with increasing number of transactions

-- cache -- Sproc SCO 1




/ ---------~~
0 10 20 30 40 50 60
Average transaction length

Figure 4.9. Scale-up with increasing transaction length

4.6.3 Impact of longer names

In these experiments we assumed that the tids and item-ids are all integers. Often in

practice these are character strings longer than four characters. Longer strings need more

storage and cost more during comparisons. This could hurt all four of the alternatives.

For the Stored-procedure, UDF and Cache-Mine approach the time taken to transfer data

will increase. The Intelligent Miner code [71] maps all character fields to integers using an

in-memory hash-table. Therefore, beyond the increase in the data transfer and mapping

costs (which accounts for the bulk of the time), we do not expect the processing time of

these three alternatives to increase. For the SQL approach we cannot assume an in-memory

hash-table for doing the mapping therefore we use an alternative approach based on table


For SQL approach we discuss the hybrid approach. The two (already expensive) steps

that could suffer because of longer names are (1) final group-bys during pass 2 or higher

when the GatherJoin approach is chosen and (2) tid-list operations when the Vertical

approach is chosen. For efficient performance, the first step requires a mapping of item-ids

and the second one requires us to map tids. We use a table function to map the tids to

unique integers efficiently in one pass and without making extra copies. The input to the

table function is the data table in the tid order. The table function remembers the previous

tid and the maintains a counter. Every time the tid changes, the counter is incremented.

This counter value is the mapping assigned to each tid. We need to do the tid mapping only

once before creating the TidTable in the Vertical approach and therefore we can pipeline

these two steps. The item mapping is done slightly differently. After the first pass, we add

a column to Fi containing a unique integer for each item. We do the same for the TidTable.

The GatherJoin approach already joins the data table T with F1 before passing to table

function Gather. Therefore, we can pass to Gather the integer mappings of each item from

F1 instead of its original character representation. After these two transformations, the tid

and item fields are integers for all the remaining queries including candidate generation and

rule generation. By mapping the fields this way, we expect longer names to have similar

performance impact on all of our architectural options.

4.6.4 Space Overhead of Different Approaches

In Figure 4.10 we summarize the space required for different datasets for three options:

Stored-procedure, Cache-Mine and SQL. For these experiments we assume that the tids

and items are integers. The first part refers to the space used in caching data and the

second part refers to any temporary space used by the DBMS for sorting or alternately

for constructing indices to be used during sorting. The size of the data is the same as the

space utilization of the Stored-procedure approach. The space requirements for UDF is

the same as that for Stored-procedurewhich requires less space than the Cache-Mine and

SQL approaches. The Cache-Mine and SQL approaches have comparable storage overheads.

For Stored-procedure and UDF we do not need any extra storage for caching. However,

all three options Cache-Mine, Stored-procedure and UDF require data in each pass to be

grouped on the tid. In a relational DBMS we cannot assume any order on the physical

layout of a table, unlike in a file system. Therefore, we need either an index on the data

table or need to sort the table every time to ensure a particular order. Let R denote the

total number of (tid,item) pairs in the data table. Either option has a space overhead of

2 x R integers. The Cache-Mine approach caches the data in an alternative binary format

where each tid is followed by all the items it contains. Thus, the size of the cached data

in Cache-Mine is at most: R + T integers where T is the number of transactions. For SQL

we use the hybrid Vertical option. This requires creation of an initial TidTable of size at

most I + R where I is the number of items. Note that this is slightly less than the cache

required by the Cache-Mine approach. The SQL approach needs to sort data in pass 1 in

all cases and pass 2 in some cases where we used the GatherJoin approach instead of the

Vertical approach. This explains the large space requirement for Dataset-B. However, in

practice when the item-ids or tids are character strings instead of integers, the extra space

needed by Cache-Mine and SQL is a much smaller fraction of the total data size because

before caching we always convert item-ids to their compact integer representation and store

in binary format.

4.7 Summary of Comparison Between Different Architectures

In Table 4.1 we present a summary of the pros and cons of the different architectures

by ranking them on a scale of 1 (good) to 4 (bad) on each of the following yardsticks: (a)

performance (execution time); (b) storage overhead; (c) scope for automatic parallelization;

(d) development and maintenance ease; (e) portability (f) inter-operability.

IO Cache Sort

35 ,


Dataset-A Dataset-B Dataset-C Dataset-D

Figure 4.10. Comparison of different architectures on space requirements.

Table 4.1. Pros and cons of different architectural options ranked on a scale of l(good) to

Metric Stored-proc. UDF Cache-Mine SQL
Performance 4 3 1 2
Storage overhead 1 1 2 2-3
Automatic Parallelism 2 2 2 1(?)
Development and maintenance ease 2 3 2 1-2
Portability 1 3 1 2
Inter-operability 2 2 2 1(?)

In terms of performance, the Cache-Mine approach is the best option followed by the

SQL approach. The SQL approach was within a factor of 0.8 to 2 of the Cache-Mine ap-

proach for all of our experiments. The UDF approach is better than the Stored-procedure

approach in performance by 30 to 50% but it looses on the metrics of development and

maintenance costs and portability. In terms of space requirements, the Cache-Mine and

the SQL approach loose to the UDF or the Stored-procedure approach. Between the

Stored-procedure and the Cache-Mine implementation, the performance difference is ex-

actly a function of the number of passes made on the data; that is, if we make four passes

of the data the Stored-procedure approach is four times slower than Cache-Mine. There-

fore, if one is not willing to pay the penalty of extra storage the best strategy for improving

performance is to reduce the number of passes to the data even if it comes at the cost

of extra processing. Some of the recent proposals [26, 127] that attempt to minimize the

number of data passes to 2 or 3 might be useful in that regard.

The SQL approach is not the winner in terms of performance and space requirements

but it is competitive. The benefit of the SQL approach could arise from other secondary


The SQL implementation has the potential for automatic parallelization. Paralleliza-

tion could come for free because bulk of our processing is expressed in terms of standard

SQL queries. As long as the database supports efficient parallelization of these queries the

mining code can be easily parallelized. The problem case is where the UDFs use scratch

pads. The only such function in our queries is the Gather table function. This function

essentially implements a user defined aggregate, and would have been easy to parallelize if

the DBMS provided support for user defined aggregates or allowed explicit control from the

application about how to partition the data amongst different parallel instances of the func-

tion. For MPPs, one could rely on the DBMS to come up with a data partitioning strategy.

However, it might be possible to better tune performance if the application could provide

hints about the best partitioning [11] to use. Further experiments are required to assess how

the performance of these automatic parallelizations would compare with algorithm-specific

parallelizations [11].

The development time and code size using SQL could be shorter if one can get efficient

implementations out of expressing the mining algorithms declaratively using a few SQL

statements. Thus, one can avoid writing and debugging code for memory management,

indexing and space management all of which are already provided in a database system.

However, there are some detractors to easy development using the SQL alternative. First,

any attached UDF code will be harder to debug than stand-alone C++ codes due to lack

of debugging tools. Second, stand-alone code can be debugged and tested faster when run

against flat file data. Running against flat files is typically a factor of five to ten faster

compared to running against data stored in DBMS tables. Finally, some mining algorithms

(for instance, neural-net based) might be too awkward to express in SQL.

The ease of porting of the SQL alternative depends on the kind of SQL used. Within the

same DBMS, porting from one OS platform to another requires porting only the small UDF

code and hence is easy. In contrast the Stored-procedure and Cache-Mine alternatives

require porting larger lines of code. Porting from one DBMS to another could get hard for

SQL approach, if non-standard DBMS-specific features are used. Unfortunately, we found

SQL-92 implementations (which would have been quite portable) to be unacceptable from

the performance viewpoint. Our preferred SQL implementation relies on the availability

of DB2's table functions. Table functions, for example, in Oracle 8 do not have the same

interface and semantics as DB2. Also, if different features have different performance

characteristics on different database systems, considerable tuning would be required. In

contrast, the Stored-procedure and Cache-Mine approach are not tied thus to any DBMS

specific features. The UDF implementation has the worst of both worlds; it is large and is

tied to a DBMS.

The biggest attraction of the SQL implementation is inter-operability and usage flex-

ibility. The ad hoc querying support provided by the DBMS enables flexible usage and

exposes potential for pipelining the input and output operators of the mining process with

Full Text
xml version 1.0 encoding UTF-8
REPORT xmlns http:www.fcla.edudlsmddaitss xmlns:xsi http:www.w3.org2001XMLSchema-instance xsi:schemaLocation http:www.fcla.edudlsmddaitssdaitssReport.xsd
INGEST IEID ELYORLXX4_N4H96T INGEST_TIME 2013-10-10T03:05:21Z PACKAGE AA00017696_00001