Citation
Architectures and optimizations for integrating data mining algorithms with database systems

Material Information

Title:
Architectures and optimizations for integrating data mining algorithms with database systems
Creator:
Thomas, Shiby, 1971-
Publication Date:
Language:
English
Physical Description:
xiv, 172 leaves : ill. ; 29 cm.

Subjects

Subjects / Keywords:
Algorithms ( jstor )
Cost analysis ( jstor )
Data collection ( jstor )
Database management systems ( jstor )
Databases ( jstor )
Datasets ( jstor )
Mining ( jstor )
SQL ( jstor )
Text analytics ( jstor )
Warehouses ( jstor )
Computer and Information Science and Engineering thesis, Ph.D ( lcsh )
Data mining ( lcsh )
Data warehousing ( lcsh )
Database management ( lcsh )
Dissertations, Academic -- Computer and Information Science and Engineering -- UF ( lcsh )
Genre:
bibliography ( marcgt )
non-fiction ( marcgt )

Notes

Thesis:
Thesis (Ph.D.)--University of Florida, 1998.
Bibliography:
Includes bibliographical references (leaves 162-171).
General Note:
Typescript.
General Note:
Vita.
Statement of Responsibility:
by Shiby Thomas.

Record Information

Source Institution:
University of Florida
Holding Location:
University of Florida
Rights Management:
Copyright [name of dissertation author]. Permission granted to the University of Florida to digitize, archive and distribute this item for non-profit research and educational purposes. Any reuse of this item in excess of fair use or other copyright exemptions requires permission of the copyright holder.
Resource Identifier:
030040414 ( ALEPH )
40941992 ( OCLC )

Downloads

This item has the following downloads:


Full Text














ARCHITECTURES AND OPTIMIZATIONS FOR INTEGRATING
DATA MINING ALGORITHMS WITH DATABASE SYSTEMS









By

SHIBY THOMAS


A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL
OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY


UNIVERSITY OF FLORIDA


1998






































To
My parents, sisters and brothers



















ACKNOWLEDGMENTS


I would like to express my sincere gratitude to Sharma for his encouragement and

support throughout my dissertation and my stay with his group. We have had endless

arguments and discussions about various things. Especially during the group meetings,

this might have invited the wrath of several other students because, at times, it would

have even tested their patience. On the personal side he and his family have been my

good friends also. I am greatly indebted to Rakesh Agrawal and Sunita Sarawagi of IBM

Almaden Research Center for their help and giving me an opportunity to work with their

group. The several discussions I had with them while I was at IBM and later through

e-mail were very useful. The work that I did with them contributed to a good part of

my dissertation. It was a nice experience working with the Quest data mining group at

Almaden and the enthusiasm and hard work of Sunita were contagious. I am also thankful

to Sanjay Ranka for his help and suggestions during my initial work on data mining. I

thank Professors Eric Hanson, Sartaj Sahni, Stanley Su and Suleyman Tufekci for being on

my committee and for their comments and suggestions.

I am grateful to many other people for helping me in several ways. In particular I

thank Raja, Sreenath, Mokhtar, Nabeel, Roger, and all my friends at the database center

for the chat sessions, lunch sessions, and so on. Many thanks to Sharon Grant for her help

with everything. She takes care of everything in the database center and her tireless spirit

and attention to even the minute details makes things a whole lot easier. My sincere thanks










to S. Seshadri, my master's thesis advisor. My first exposure to database research was while

working with him and he is partially responsible for my pursuing further studies. I owe my

graduate studies to my parents, sisters and brothers for their continual reassurance.

This work was supported in part by the research grants of Sharma Chakravarthy

from the Office of Naval Research and the SPAWAR System Center San Diego, Rome

Laboratory, DARPA and the NSF grant IRI-9528390. During my first year Theodore

Johnson provided support through his research grants until he left the University. His

initial support helped me to come to the United States for graduate studies. The CISE

department also provided me with teaching assistantships in times of need. I gratefully

acknowledge all the support.
















TABLE OF CONTENTS


ACKNOWLEDGMENTS ........ ... ................... iii

LIST OF FIGURES .................. .................... ix

LIST OF TABLES ..................................... xii

ABSTRACT ....... .. ....... .... .... .............. xiii

CHAPTERS

1 INTRODUCTION ..................... .............. 1

1.1 Data Warehousing ............. ..... ............. 1
1.2 Data M ining ................... ................ 3
1.2.1 Association Rule ........................ .... 4
1.2.2 Sequential Patterns .... ....................... 5
1.2.3 Classification .................... ........... 6
1.2.4 Clustering ................... .............. 7
1.3 Mining Databases ................................. 8
1.4 Goal ........... ...... ... ................. .. 9
1.5 Thesis Organization ....... .... .................... 12

2 FROM FILE MINING TO DATABASE MINING ................. 13

2.1 Related W ork ..................... ............. 13
2.2 Architectural Alternatives .............. .... ........... 15
2.2.1 Loose-Coupling .................... ........ 15
2.2.2 Cache-M ine .... .................. ......... 16
2.2.3 Stored-Procedure ............ .... ............. 17
2.2.4 User-Defined Function ............... .. .......... 17
2.2.5 SQL-Based Approach ... .. ........ ..... ....... 18
2.2.6 Integrated Approach ........................... 20
2.3 Summary ......... ........ ................... 21

3 ASSOCIATION RULES ....... ........... ............. 22

3.1 Apriori Algorithm .... ......... .. ... ..... ....... 23
3.2 Other Algorithms ........ ........ ......... ...... 24
3.3 Input-Output Formats ............. ............... 26










3.4 Associations in SQL ........... .... 27
3.4.1 Candidate Generation in SQL ........... ... 27
3.4.2 Counting Support to Find Frequent Itemsets .... 30
3.4.3 Rule Generation .... ......... .. ... .......... 30
3.5 Support Counting Using SQL-92 .... .... .. .. 32
3.5.1 K-way Join ....... .......... ............. 32
3.5.2 Three-way Join .......... .... ............. 33
3.5.3 Two Group-by ............... ........... 34
3.5.4 Subquery-Based .................. .......... 35
3.6 Performance Comparison .................. ........... .. 35
3.7 Cost Analysis .......... .. ..................... 39
3.7.1 KWayJoin Plan with Ck as Outer Relation .... 40
3.7.2 KWayJoin Plan with Ck as Inner Relation ..... 42
3.7.3 Effect of Subquery Optimization . .. 43
3.8 Performance Optimizations ........ .... .............. 45
3.8.1 Pruning Non-Frequent Items . 46
3.8.2 Eliminating Candidate Generation in Second Pass ... 47
3.8.3 Reusing the Item Combinations from Previous Pass ... 48
3.8.4 Set-Oriented Apriori .... .. ...... 48
3.9 Performance Experiments with Set-Oriented Apriori .... 53
3.9.1 Scale-up Experiment ....... ................... 55
3.10 Summary ..................................... 58

4 SUPPORT COUNTING USING SQL WITH OBJECT-RELATIONAL EXTEN-
SIONS ........... ...... ................... ...... 60

4.1 GatherJoin ....................... ........... 60
4.1.1 Special Pass 2 Optimization ...... 61
4.1.2 Variations of GatherJoin Approach . ... 61
4.1.3 Cost Analysis of GatherJoin and its Variants ... 64
4.2 Vertical ................... ................... 66
4.2.1 Special Pass 2 Optimization ..... ..... 67
4.2.2 Cost Analysis ......... ..... ...... 69
4.3 SQL-Bodied Functions ........ .................... 70
4.4 Performance Comparison ........ .......... ...... .. 71
4.5 Final Hybrid Approach ........ .................... 76
4.6 Architecture Comparisons ..... ..... .......... 76
4.6.1 Timing Comparison ............. ........ .... 77
4.6.2 Scale-up experiment .......... .. 80
4.6.3 Impact of longer names . 81
4.6.4 Space Overhead of Different Approaches .... 82
4.7 Summary of Comparison Between Different Architectures ... 83
4.8 Other Associations Algorithms ........ .. 87
4.9 Summary ...... ....... ................. 88










5 GENERALIZED ASSOCIATION RULES ........... ...... ..... 89

5.1 Input-Output Formats ................ ...... 90
5.2 Cumulate Algorithm .................. .......... .. 91
5.3 Pre-Computing Ancestors .................. .. ....... .. 91
5.4 Candidate Generation ................... .. 92
5.5 Support Counting to Find Frequent Itemsets ..... 93
5.6 Support Counting Using SQL-92 .................. .... .. .. 94
5.6.1 K-way Join ................. ......... 94
5.6.2 Subquery Optimization ..... ..... ...... 96
5.7 Support Counting Using SQL-OR ..... ...... 96
5.7.1 GatherJoin ................ .. ............ 96
5.7.2 GatherExtend ....... ...................... 97
5.7.3 Cost Analysis ........ ........... 98
5.7.4 Vertical .. ........... ................. 100
5.8 Performance Results .. ..... .. .... ............ 102
5.9 Summary ............ .. ................... 104

6 SEQUENTIAL PATTERNS .................. .......... 105

6.1 Input-Output Formats .................. ........... 105
6.1.1 Input Form at ................... ............ 105
6.1.2 Output Format ................... ........... 106
6.2 GSP Algorithm ................... ................ 106
6.3 Candidate Generation ........ ...... ............. 107
6.4 Support Counting to Find Frequent Sequences .... 109
6.5 Support Counting Using SQL-92 .. 111
6.5.1 K-way Join ........... .. ................. 111
6.5.2 Subquery Optimization ..... ........ 112
6.6 Support Counting Using SQL-OR ..... ...... 112
6.6.1 Vertical ............. ................... 112
6.6.2 GatherJoin ............. ... ............. 117
6.7 Taxonom ies .. .. ... .. .. .. ........ ... 117
6.8 Sum m ary ............... .... ................ 118

7 INCREMENTAL MINING .................. .......... 119

7.1 Incremental Updating of Frequent Itemsets .... 121
7.1.1 Computing AfBd(F) from F . 121
7.1.2 Addition of New Transactions . 123
7.1.3 Deletion of Existing Transactions .. .. 127
7.2 Experimental Results ......... 128
7.3 Comparison with FUP ...... 129
7.4 Database Integration of Incremental Mining ..... 131
7.4.1 SQL Formulations of Incremental Mining ... 131
7.4.2 Performance Results .......... ........... ..... 135
7.4.3 New-Candidate Optimization . ... 138
7.4.4 Other Approaches ............................ 140










7.5 Constrained Associations .................. ....... 140
7.5.1 Categories of Constraints ............ .... .. .. .. 141
7.5.2 Constrained Association Mining ........... .. .. 144
7.5.3 Incremental Constrained Association Mining .... 146
7.5.4 Constraint Relaxation .................. .. ..... .. 147
7.6 Applicability Beyond Association Mining ..... 148
7.6.1 Mining Closed Sets .................. ....... .. 148
7.6.2 Query Flocks .......... .... ... ............ 149
7.6.3 View Maintenance ................... 151
7.7 Summary ............... ... ................ 152

8 CONCLUSIONS ................... ... ............ .153

8.1 Proposed Extensions .......... . 156
8.1.1 Richer Set Operations: .................. ........ .. 156
8.1.2 Enhanced Aggregation .... ........ 157
8.1.3 Multiple Streams ....... ... ... 158
8.1.4 Sam pling ...... .......... ............... .. 158
8.2 Contributions .............. ............... 159
8.3 Future Work ................ ... .............. 160
8.4 Closing .. ... .. .. .. ... .......... .. .. 160

REFERENCES ................... ................. 162

BIOGRAPHICAL SKETCH ............... ................ 172

















LIST OF FIGURES





1.1 Data warehousing architecture ................... 2

1.2 Credit card classification example . 7

1.3 Typical data warehouse usage ..... ........ 10

2.1 Taxonomy of architectural alternatives ... 16

2.2 Loose-coupling architecture . 16

2.3 Cache-mine architecture ..... ............ .......... 17

2.4 Stored-procedure architecture ........ .. 17

2.5 UDF-based mining architecture ..... ........ 18

2.6 SQL architecture for mining in a DBMS ..... 19

2.7 Architecture for mining in next-generation DBMSs ... 21

3.1 Apriori algorithm ..................... ............ 23

3.2 Candidate generation for any k ......................... 29

3.3 Candidate generation for k = 4 ..... ..... 29

3.4 Rule generation .................... ........ ....... 32

3.5 Support counting by K-way join ..... ...... 33

3.6 Support counting using subqueries ... ..... ..... 36

3.7 Comparison of different SQL-92 approaches ... 38

3.8 K-way join plan with Ck as inner relation . 40

3.9 K-way join plan with Ck as outer relation ..... ..... 40










3.10 Number of candidate itemsets vs distinct item prefixes ... 44

3.11 Reduction in transaction table size by non-frequent item pruning 47

3.12 Benefit of second pass optimization ......... 49

3.13 Generation of Tk ................... ....... ... ..... .. 50

3.14 Benefit of reusing item combinations .. 52

3.15 Space requirements of the set-oriented apriori approach ... 53

3.16 Comparison of Subquery and Set-oriented Apriori approaches ... 54

3.17 Comparison of CPU and I/O times ....................... 56

3.18 Number of transactions scale-up . 57

3.19 Transaction size scale-up .................. .......... 58

4.1 Support counting by GatherJoin .... .... 62

4.2 Support Counting by GatherJoin in the second pass ... 63

4.3 Tid-list creation ............... ................. 66

4.4 Support counting using UDF ................. ......... 68

4.5 Comparison of four SQL-OR approaches: Vertical, GatherPrune, GatherJoin
and GatherCount .... .... 72

4.6 Effect of increasing transaction length (average number of items per trans-
action) .................. ................... 74

4.7 Comparison of four architectures . 78

4.8 Scale-up with increasing number of transactions. 80

4.9 Scale-up with increasing transaction length .. 81

4.10 Comparison of different architectures on space requirements. ... 84

5.1 Example of a taxonomy .............. ......... ........ 89

5.2 Pre-computing ancestors ........ 92

5.3 Generation of C2 .................. .............. 93










5.4 Transaction extension subquery .......... ...... 94

5.5 Support counting by K-way join .................. ...... 95

5.6 Support counting by GatherJoin ............... ... 97

5.7 Support counting by GatherExtend. ...... 99

5.8 Interior nodes' tid-list generation by union ..... 101

5.9 Interior nodes' tid-list generation from T* .... 102

5.10 Comparison of different SQL approaches ...... ...... 103

6.1 Candidate generation for any k ..... .. 110

6.2 Candidate generation for k = 4 ....................... 110

6.3 Support counting by K-way join ... ........ 113

6.4 Subquery optimization for KwayJoin approach. 114

6.5 Support counting by Vertical ......................... 116

7.1 A high-level description of the apriori-gen function .... 122

7.2 A high-level description of the negativeborder-gen function ... 122

7.3 A high-level description of the Update-Frequent-Itemset function 126

7.4 Speed-up of the incremental algorithm ..... 129

7.5 Support counting using subqueries ..... 133

7.6 Speed up of the incremental algorithm based on the Subquery approach .136

7.7 Speed up of the incremental algorithm based on the Vertical approach 136

7.8 Speed up of the incremental algorithm based on the Subquery approach with
the new-candidate optimization ..... ........ 139

7.9 Speed up of the incremental algorithm based on the Vertical approach with
the new-candidate optimization . 139

7.10 Point of sales data model .................. ........... .. 141

7.11 Framework for constrained association mining ... 144

7.12 Point-of-sales example for constrained association mining ... 145
xi



















LIST OF TABLES





3.1 Description of different real-life datasets ..... 37

3.2 Notations used in cost analysis . ..41

3.3 Description of synthetic datasets ..... 45

4.1 Pros and cons of different architectural options ranked on a scale of 1(good)
to 4(bad) .................. ..... .............. 84

5.1 An example of the taxonomy table ..... 90

5.2 Additional notations used for cost analysis . ... 98















Abstract of Dissertation Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Doctor of Philosophy



ARCHITECTURES AND OPTIMIZATIONS FOR INTEGRATING
DATA MINING ALGORITHMS WITH DATABASE SYSTEMS

By

Shiby Thomas

December, 1998


Chairman: Dr. Sharma Chakravarthy
Major Department: Computer and Information Science and Engineering


Data mining on large data warehouses is becoming increasingly important. In support

of this trend, we consider a spectrum of architectural alternatives for integrating mining

with database systems. These alternatives include loose-coupling through a SQL cursor

interface; encapsulation of the mining algorithm in a stored procedure; caching the data to

a file system on-the-fly and mining; tight-coupling using primarily user-defined functions;

and SQL implementations for processing in the DBMS. First, we comprehensively study

the option of expressing the association rule mining algorithm in the form of SQL queries.

We consider four options in SQL-92 and six options in SQL enhanced with object-relational

extensions (SQL-OR). Our evaluation of the different architectural alternatives shows that

from a performance perspective, the Cache-Mine option is superior, although the SQL-

OR option comes a close second. Both the Cache-Mine and the SQL-OR approaches

incur a higher storage penalty than the loose-coupling approach which performance-wise










is a factor of 3 to 4 worse than Cache-Mine. We also compare these alternatives on the

basis of qualitative factors like automatic parallelization, development ease, portability and

interoperability.

We further analyze the SQL-92 approaches with the twin goals of st iid. i how best

can a DBMS without any object-relational extensions execute these queries and to identify

ways of incorporating the semantics of mining into cost-based query optimizers. We develop

cost formulae for the mining queries based on the input data parameters and relational

operator costs. We also identify certain optimizations which improve the performance.

Next, we study generalized association rule and sequential pattern mining and develop

SQL formulations for them there by demonstrating that more complex mining operations

can be handled in the SQL frame work.

We develop an incremental association rule mining algorithm which does not need

to examine the old data if the frequent itemsets do not change. Even otherwise, access

to the old database can be limited to just one scan. We categorize the various kinds of

constraints on the items that are useful in the context of interactive mining to facilitate goal-

oriented mining. We show how the incremental mining technique can be adapted to handle

constraints and certain kinds of constraint relaxation. We also show the applicability of

the incremental algorithm to other classes of data mining and decision support problems.

Finally, we identify certain primitive operators that are useful for a large class of data

mining and decision support applications. Supporting them natively in the DBMS could

enable these applications to run faster.


















CHAPTER 1
INTRODUCTION



The rapid growth in data warehousing technology combined with the significant drop

in storage prices has made it possible to collect large volumes of data about customer

transactions in retail stores, mail order companies, banks, stock markets, telecommunica-

tion companies and so on. For example, AT&T call records are about 1 giga byte per

hour [73] and super market chains like WalMart collect tera bytes of data. In order to

transform this huge amounts of data into business competitiveness and profits, it is ex-

tremely important to be able to mine nuggets of useful and understandable information

from these data warehouses. In this chapter, we introduce data warehousing and data min-

ing technologies in Sections 1.1 and 1.2 respectively and in Section 1.3, we motivate the

need for coupling the two which is the focus of this dissertation. In Section 1.4, we discuss

and list the specific problems addressed in this dissertation and in Section 1.5, we outline

the dissertation organization.


1.1 Data Warehousing

A data warehouse is simply a single, complete and consistent store of data obtained

from a variety of sources and made available to end users in a way they can understand

and use. Achieving completeness and consistency of data in today's information systems

environment is far from simple. The first problem is to discover how completeness and

consistency can be defined. In the business context, this entails understanding the business










strategies and the data required to support and track their achievement. This process-

called enterprise modeling-requires substantial involvement of business users and is tra-

ditionally a long-term process. A data warehouse consists of several components and tools

which are depicted in Figure 1.1 [34].

Monitoring &
Enterprise Administration tools Business
>Information
Modeling Metadata Information
Repository Guide


External Sources Data Warehouse




Load / -
Operational Systems L ad ,/
Refresh


Data Marts

Data Sources Business Information
Interface

Figure 1.1. Data warehousing architecture


Knowing what data are required is just the first step. These data exist today in

various sources on different platforms, and must be copied from these sources for use in the

warehouse. They must be combined according to the enterprise model, even though it was

not originally designed to support such integration. They must be cleansed of structural and

content errors. This step-typically known as data warehouse population-requires tools

for extracting data from multiple operational databases and external sources, for cleaning,

transforming and integrating these data; and for loading data into the data warehouse.

In addition to the main warehouse, there may be several departmental data marts. It

also requires tools for periodically refreshing the warehouse to maintain consistency and to

purge data from the warehouse, perhaps onto slower archival storage.










In order to understand and profitably use the data in a business context, they must

be transformed into information. The warehouse data are stored and managed by one or

more warehouse servers, which present multidimensional views of the data to a variety of

front end tools: query tools, report writers, analysis tools and data mining tools. There is

also a repository which stores metadata derived using the enterprise model, and tools for

monitoring and administering the warehousing system.

The warehouse may be distributed for load balancing, scalability and higher availabil-

ity. In such a distributed architecture, the metadata repository is usually replicated with

each fragment of the warehouse, and the entire warehouse is administered centrally. An

alternative architecture is a federation of data warehouses or data marts, each with its own

repository and decentralized administration.

1.2 Data Mining

Data mining, also referred to as knowledge discovery in databases, is the process of

extracting implicit, understandable, previously unknown and potentially useful information

from data. In other words, data mining is the act of drilling through huge volumes of data in

order to discover relationships, or to answer specific questions that are too broad in nature

for traditional query tools. Data mining is also projected as the next step beyond online

analytical processing (OLAP) for querying data warehouses. Rather than seek out known

relationships, it sifts through data for unknown relationships. For instance, consider the

transaction data in a mail-order company stored in the following relations: sales(customer,

widget, state, year), booster(customer, widget, driver), catalog(widget, manufacturer). The

sale information is stored in the sales table and the catalog table stores the widgets of

different manufacturers. The booster table stores the driver that influences the particular









sale. In this database, OLAP finds answers to queries of the form "How many widgets were

sold in the first quarter of 1998 in California vs. Florida?" However, data mining attempts

to answer queries like "What are the drivers that caused people to buy these widgets from

our catalog?" Fundamentally, data mining is statistical analysis and has been in practice for

a long time. But, until recently, statistical analysis was a time-consuming, manual process

which limited the amount of data that could be analyzed and the accuracy depended heavily

on the personnel involved in the analysis. Today, with the advent of various sophisticated

technologies, tools exist that automate the process, making data mining a practical solution

for a wide range of companies. For example, Fingerhut's (a direct-mail catalog company)

statistical analysis was limited to taking samples of 10 to 20 percent of its customers. With

data mining, it can examine 300 specific characteristics of each of the 10 to 12 million

customers in a much more focused way [15].

The initial efforts on data mining research were to cull together techniques from ma-

chine learning and statistics [24, 28, 29, 37, 39, 40, 60, 65, 81, 90, 95, 102, 103, 104] to define

new mining operations and develop algorithms for them [5, 7, 19, 77]. In the remainder of

this section, we briefly introduce the various data mining problems.

1.2.1 Association Rule

Association rule which captures co-occurrence of items or events was introduced in

the context of market basket data [7]. An example of such a rule might be that 60% of

transactions containing beer also contain diapers and 2% of transactions contain both these

items. Here 60% is the support and 2% is the confidence of the rule beer ==> diaper.

Association rule mining is stated formally as follows [13]. Let I = {il, i2, .. im} be

a set of literals, called items. Let D be a set of transactions, where each transaction T is










a set of items such that T C I. Each transaction has a unique identifier, called its tid.

An association rule is an implication of the form X == Y, where X C I, Y C I, and

X n Y = 0. The rule X ==- Y holds in the transaction set D with confidence c if c% of

transactions in D that contain X also contain Y. The rule X == Y has support s in the

transaction set D if s% of transactions in D contain X U Y. Given a set of transactions D,

the problem of mining association rules is to generate all association rules that have support

and confidence greater than the user-specified minimum support and minimum confidence.

The association rule mining problem has attracted tremendous attention from data

mining researchers and several algorithms have been proposed for it [13, 26, 66, 111, 131].

Researchers have also proposed several variants of the basic association rule mining to

handle taxonomies over items [63, 118], numeric attributes [76, 92, 119] and constraints on

items appearing in rules [97, 121].

1.2.2 Sequential Patterns

Sequential pattern mining was first introduced in Agrawal and Srikant [14] and further

generalized in Srikant and Agrawal [120]. Given a set of data-sequences each of which is a list

of transactions ordered by the transaction time, the problem of mining sequential patterns is

to discover all sequences with a user-specified minimum support. Each transaction contains

a set of items. A sequential pattern is an ordered list (sequence) of itemsets. The itemsets

that are contained in the sequence are called elements of the sequence. For example,

((computer, modem)(printer)) is a sequence with two elements; (computer, modem) and

(printer). The support of a sequential pattern is the percentage of data-sequences that

contain the sequence. A sequential pattern can be further qualified by specifying maximum

and/or minimum time gaps between adjacent elements and a sliding time window within










which items are considered part of the same sequence element. These time constraints are

specified by three parameters, max-gap, min-gap and window-size. We refer the reader to

Srikant and Agrawal [120] for a formal definition of the problem.

1.2.3 Classification

Classification is a well-studied problem [91,129]. However, only recently has there been

focus on algorithms that can handle large datasets [17, 53, 85, 106, 114]. Classification is

the process of generating a description or a model for each class of a given dataset, called

the training set. Each record of the training set consists of several attributes which could

be continuous (coming from an ordered domain), or categorical (coming from an unordered

domain). One of the attributes, called the classifying attribute, indicates the class to which

each record belongs. Once a model is built from the given examples, it can be used to

determine the class of future unclassified records.

Several classification models based on neural networks, statistical models like lin-

ear/quadratic discriminants, decision trees and genetic models have been proposed over the

years. Decision trees are particularly suited for data mining since they can be constructed

relatively fast compared to other methods and they are simple and easy to understand.

Moreover, trees can be easily converted into SQL statements that can be used to access

databases efficiently [5].

A decision tree is a class discriminator that recursively partitions the training set

until each partition consists entirely or dominantly of examples from one class. Each non-

leaf node of the tree contains a split point which is a test on one or more attributes and

determines how the data is partitioned. Figure 1.2 shows a sample decision-tree classifier

and the training set from which it is derived. Each record represents a credit card applicant










and we are interested in building a model that categorizes the applicants into high and low

risk classes.
Age < 25

Age Salary Married Risk Yes No
21 30,000 No High ( Married
18 28,000 Yes High No Yes
30 50,000 Yes Low High
35 20,000 Yes High Salary < 25,000
23 60,000 No High i Yesgh / No
47 80,000 Yes Low

High Low

Figure 1.2. Credit card classification example



1.2.4 Clustering

The fundamental clustering problem is that of grouping together (clustering) similar

data items and is useful for discovering interesting distributions and patterns in the under-

lying data. Clustering has been formulated in various ways in the machine learning [48],

pattern recognition [51], optimization [112] and statistics literature [20]. The problem of

clustering can be defined as follows: given n data points in a d-dimensional metric space,

partition the data points into k clusters such that the data points within a cluster are more

similar to each other than data points in different clusters. The classic K-Means clustering

algorithm starts with estimated initial values for the k cluster centroids. Each of the data

points is assigned to the cluster with the nearest centroid. After the assignment the cen-

troids are refined to the mean of the data points in that cluster. This process is repeated

several times until an acceptable convergence is reached. There are several research efforts

reported in the data mining literature for clustering large databases [23, 42, 57, 132].









Research on time series analysis [10, 43, 75], similarity search [3, 8, 21, 72, 115],

high dimensional data analysis [4, 22, 79], and text data management [30, 31] can also be

classified under the broad category of data mining.


1.3 Mining Databases

The initial research efforts on data mining were aimed at defining new mining problems

and a majority of the algorithms for them were developed for data stored in file systems.

Each had its own specialized data structures, buffer management strategies and so on [8, 13,

14, 22, 49, 50, 62, 63, 74, 84, 85, 86, 111, 115, 119, 121]. In cases where the data are stored in

a DBMS, data access was provided through an ODBC or SQL cursor interface [2, 64, 68, 71].

Data mining tools are being used in several application domains [16, 18, 27, 38, 41, 46, 52,

70, 82, 96, 101, 108, 130]. Coupling the data mining tools with a growing base of accessible

enterprise data-often in the form of a data warehouse-provides business institutions at

its disposal a tool with immense implications. According to the vice president of Mellon

Bank's advanced technology group, "Data mining is the carrot that justifies the expensive

stick of building a data warehouse."

The majority of the warehouse stores-systems used for storing warehouse data-are

relational databases or their variants. The advantages of using database systems are numer-

ous: SQL was invented for direct query of data, most client/server connectivity is supplied

by relational vendors, most replication systems have been designed with relational sources

and targets and most of the relational vendors are delivering parallel database solutions.

There are important alternatives in this segment, however. The OLAP multidimensional en-

gines offer unique performance characteristics across their problem domain. We might also

see traditional file stores providing significantly better performance for some data mining









operations. Differentiators for the relational engines will be overall scalability, availability

across a broad spectrum of hardware platforms, "affinity" with legacy data platforms and

stores, availability of extensions for image, text and so on to support the next generation

of applications, and availability of supporting tools.

The investment in building and managing a data warehouse is enormous. With the

advent of business intelligence and decision support systems, it is imperative that the

data warehouse support multiple applications. Figure 1.3 illustrates how a data warehouse

will typically be used in a business organization. The spectrum of applications include

basic querying and reporting tools, OLAP tools, multidimensional analysis tools and data

mining tools. There are several excellent and popular query/report writers. There are also

tools that support multidimensional analysis directly on relational data stores without the

separate OLAP engine. Note that these tools are like tools in a tool box; that is, they can

be used in combination to produce the desired result. There is no single class of tools that

satisfies the broad range of decision support and business intelligence system requirements.

Therefore, it is crucial that the data mining tools integrate with relational data warehouses

much like the query/report and OLAP tools.


1.4 Goal

The goal of this dissertation is to explore the various issues of database/data ware-

house integration of mining operations. We first study the various architectural alternatives

for coupling data mining with relational database systems, primarily from a performance

perspective. We develop various SQL formulations for association rules [7], a representative

mining problem, and analyze how competitive can the SQL implementation be compared

to other specialized implementations of association rule mining. We further focus on the










Business Intelligence tools


Query/ Data
Reporting OLAP I Mining





Data Warehouse



Business
(Applications

Business Information

Data Marts


Figure 1.3. Typical data warehouse usage


analysis of various execution plans chosen by relational database systems for executing some

of the SQL-based mining queries. We expect that this study will reveal the domain-specific

semantic information of the mining algorithms that need to be integrated into next gen-

eration query optimizers to handle mining computations efficiently. We also develop SQL

formulations for a few other mining operations, namely, generalized association rules [118]

and sequential patterns [14, 120]. We also propose a few primitive database operators that

are useful for mining and other decision support applications. These operators, if natively

supported in a database system, could potentially speed up the execution of mining queries.

Data warehouses on which the mining tools operate typically are populated incremen-

tally. In order to improve the reliability and usefulness of the discovered information, large

volumes of data need to be collected and analyzed over a period of time. A naive approach

to update the mined information, when new data are added or part of the current data

are deleted, is to recompute them from scratch. However, it would be ideal to develop an

incremental algorithm so that the computation effort spent on the original data can be









effectively utilized. We develop an incremental algorithm and its SQL formulations for up-

dating the association rules, based on the negative border concept. It is based on the closure

property of frequent itemsets, that is, all subsets of a frequent itemset are also frequent.

We show that the incremental mining algorithm can be generalized to handle various kinds

of constraints. We categorize the constraints based on their usage in the mining process

and develop a general framework to process them. We further show that the incremental

technique is applicable to other classes of mining and decision support problems such as

sequential patterns, correlation rules, query flocks and so on.

We summarize and list the goals of this dissertation below:


Survey the various data mining problems and algorithms,


Study the different database integration alternatives for data mining,


Develop and implement SQL formulations of mining algorithms,


Analyze the performance profile of current DBMSs to execute the above SQL queries,


Explore the enhancements to current cost-based optimizers to incorporate the domain-

specific semantics of mining,


Develop an incremental association rule mining algorithm and its SQL formulations,


Generalize the incremental algorithm for mining constrained associations and show

its applicability to other data mining and decision support problems, and


Explore primitive operators for mining in databases.


This work has a significant impact on the state-of-the-art in data mining system archi-

tectures and comes at the appropriate time when the data mining community is looking for










answers to "How to mine data warehouses?" Given the amount of data involved in mining,

its potential impact on various business sectors and the fact that OLAP is finding its way

into commercial database systems (for example, the cube operator), it is only a matter of

time before mining becomes an integral part of database systems. We believe that this work

is a small but strong step in the right direction. This will also have a significant impact on

query optimization and parallel query processing techniques.

1.5 Thesis Organization

The rest of this dissertation is organized as follows: We discuss the various architectural

alternatives for integrating mining with database systems/data warehouses in Chapter 2.

The various SQL formulations of association rules, their performance profiles and optimiza-

tions for improving the performance are detailed in Chapter 3. In Chapter 4, we present

the use of object-relational extensions to SQL for improving the performance of SQL-based

association rule mining and the performance comparison of the various architectural al-

ternatives. SQL-based mining of generalized association rules and sequential patterns are

described in Chapters 5 and 6 respectively. Chapter 7 presents the incremental associa-

tion rule mining algorithm, performance comparison, SQL formulation and generalizations

for mining constrained associations. We conclude in Chapter 8 with a discussion of the

proposed database operators and avenues for further research.


















CHAPTER 2
FROM FILE MINING TO DATABASE MINING



The "first-generation" KDD systems offer isolated discovery features using tree induc-

ers, neural nets and rule discovery algorithms. Such systems cannot be embedded into a

large application and typically offer just one knowledge discovery feature. The current state

of data mining systems is very much similar to the days in which database applications were

written in COBOL with just read and write commands as the interface to data stored in

large files. The advent of relational database systems which offered SQL for ad hoc queries

and various relational APIs (application programming interfaces) for application program-

ming made database applications much easier to develop and manage. Data mining has to

undergo a similar transition from the current "file mining" to data warehouse mining and

a richer set of APIs for developing business intelligence and decision support applications.

In the remainder of this chapter, we survey some of the prior research related to the

database integration of mining in Section 2.1. The various architectural alternatives are

discussed in Section 2.2.


2.1 Related Work

Researchers have started to focus on various issues related to integrating mining with

databases [6, 67, 68]. The research on database integration of mining can be broadly clas-

sified into two categories; one which proposes new mining operators and the other which










leverages the query processing capabilities of current relational DBMSs. In the former cate-

gory, there have been language proposals to extend SQL with specialized mining operators.

A few examples are (i) the query language DMQL proposed by Han et al. [64] which ex-

tends SQL with a collection of operators for mining characteristic rules, discriminant rules,

classification rules, association rules and so on, (ii) The M-SQL language of Imielinski and

Virmani [69] which extends SQL with a special unified operator Mine to generate and

query a whole set of propositional rules, and (iii) the mine rule operator proposed by Meo

et al. [89] for a generalized version of the association rule discovery problem. However, they

do not address the processing techniques for these operators inside a database engine and

the interaction of the standard relational operators and the proposed extensions. It is also

important to break these operators to a finer level of granularity in order to identify com-

monalities between them and derive a set of primitive operators that should be supported

natively in a database engine.

In the second category, researchers have addressed the issue of exploiting the capa-

bilities of conventional relational systems and their object-relational extensions to execute

mining operations. This entails transforming the mining operations into database queries

and in some cases developing newer techniques that are more appropriate in the database

context. The proposal of Agrawal and Shim [12] for tightly coupling a mining algorithm

with a relational database system makes use of user-defined functions (UDFs) in SQL

statements to selectively push parts of the application that perform computations on data

records into the database system. The objective was to avoid one-at-a-time record retrieval

from the database to the application address space, saving both the copying and process

context switching costs. In the KESO project [116], the focus is on developing a data

mining system which interacts with standard DBMSs. The interaction with the database










is restricted to two-way table queries, a special kind of aggregate query. Two-way tables,

which are used in the mining process, have sets of source and target attributes and an

associated count. Association rule mining was formulated as SQL queries in the SETM

algorithm [66]. However, it does not use the subset property-all subsets of a frequent

itemset are frequent-for candidate generation. As a result, SETMI counts a large number

of candidate itemsets in the support counting phase and hence is not efficient [9]. Query

flocks generalizes boolean association rules to mine associations across relational tables. A

query flock is a parameterized query with a filter condition to eliminate the values of param-

eters that are "uninteresting". Tsur et al. [128] present the use of query flocks for mining

and emphasizes the need for incorporating the a-priori technique into new generation query

optimizers to handle mining queries efficiently.

2.2 Architectural Alternatives

The various architectural alternatives for integrating mining with relational systems,

proposed in [109], can be categorized as shown in Figure 2.1. It shows a taxonomy of

various alternatives from loose-coupling to an integrated approach. In the remainder of

this section, we describe each of the alternatives.

2.2.1 Loose-Coupling

This is an example of integrating mining applications into the client in a client/server

architecture or into the application server in a multi-tier architecture. The mining kernel

can be considered as the application server. In this approach, data are read tuple by tuple

from the DBMS to the mining kernel using a cursor interface. The intermediate and the

final results are stored back into the DBMS. The data are never copied on to a file system.

Instead, the DBMS is used as a file system. This is the approach followed by most existing










User-defined Mining
Cache-Mine function extenders/blades


Loose Stored SQL-based Integrated with
Coupling Procedure approach SQL query engine


Mining as Mining as Mining using Integrated approach
application on application on Integrated appro
SQL + extensions
client/app. server database server


Loose --Integration Tight

Figure 2.1. Taxonomy of architectural alternatives

mining systems. A potential problem with this approach is the high context switch costs

between the DBMS and the mining kernel processes since they run in different address

spaces. In spite of the block-read optimization present in many systems (for example,

Oracle [98], DB2 [32]) where a block of tuples is read at a time instead of reading a single

tuple at a time, the performance could suffer. This architecture is outlined in Figure 2.2

GUI Mining
or Request Mining Data
Mining Language Kernel Results





Figure 2.2. Loose-coupling architecture


2.2.2 Cache-Mine

This is a special case of the loose-coupling approach where the mining algorithm reads

data from the DBMS only once and caches the relevant data in flat files on local disk for

future references. The data could be transformed and cached in a format that enables more

efficient access in the future. The mined results, first generated as flat files, are imported










into the DBMS. This method has all the advantages of the stored procedure approach

(described below) plus it promises to have better performance. The disadvantage is that it

requires additional disk space for caching. This architecture is outlined in Figure 2.3.

GUI Mining Data /
or Request Mining result
Mining Language Kernel



File System


Figure 2.3. Cache-mine architecture


2.2.3 Stored-Procedure


This architecture is representative of the case where the mining logic is embedded as

applications on the database server. In this approach, shown in Figure 2.4, the mining

algorithm is encapsulated as a stored procedure [32] that is executed in the same address

space as the DBMS. The main advantage of this as well as the loose-coupling and cache-

mine approach is greater programming flexibility. Also, any existing file system code can

be easily transformed to work on data stored in the DBMS.


GUI Stored-procedure DBMS
SInvocation S r c
Mining Language Stored-procedures
for mining




Figure 2.4. Stored-procedure architecture


2.2.4 User-Defined Function


This approach is another variant of embedding mining as an application on the database

server if the user-defined functions arc run in the unfenced mode (same address space as










the server) [32]. In this case, the entire mining algorithm is encapsulated as a collection of

user-defined functions (UDFs) [32] that are appropriately placed in SQL data scan queries.

The architecture is represented in Figure 2.5. Most of the processing happens in the UDF

and the DBMS is used simply to provide tuples to these UDFs. Little use is made of the

query processing capabilities of the DBMS. The UDFs can be run in either fenced (different

address space) or unfenced (same address space) mode. The main attraction of this method

is performance since when run in the unfenced mode individual tuples never have to cross

the DBMS boundary. Otherwise, the processing happens in almost the same manner as

in the stored procedure case. The main disadvantage is the development cost [12] since

the entire mining algorithm has to be written as UDFs involving significant code rewrites.

Further, these are "heavy-weight" UDFs which involve significant processing and memory

management.

SQL queries DBMS
GUI (containing UDFs)
or
Mining Language UDFs
for mining




Figure 2.5. UDF-based mining architecture


In order to provide a query interface or application programming interface to the

discovered rules, they can be passed through a post-processing step. The rule discovery

itself could be done by any of the above alternatives.

2.2.5 SQL-Based Approach

This is the integration architecture explored in this dissertation. In this approach, the

mining algorithm is formulated as SQL queries which are executed by the DBMS query










processor. We develop several SQL formulations for a few representative mining operations

in order to better understand the performance profile of current database query processors

in executing these queries. We believe that it will enable us to identify what portions of

these mining operations can be pushed down to the query processing engine of a DBMS.

There are also several potential advantages of a SQL implementation. One can prof-

itably make use of the database indexing and query processing capabilities thereby leverag-

ing on more than two decades of development effort spent in making these systems robust,

portable, scalable, and highly concurrent. Rather than devising specialized paralleliza-

tions, one can potentially exploit the underlying SQL parallelization, particularly in an

SMP environment. The current approach to parallelizing mining algorithms is to develop

specialized parallelizations for each of the algorithms [11, 61, 113, 114]. The DBMS sup-

port for check-pointing and space management can be especially valuable for long-running

mining algorithms on huge volumes of data. The development of new algorithms could be

significantly faster if expressed declaratively using a few SQL operations. This approach is

also extremely portable across DBMS's since porting becomes trivial if the SQL approaches

use only the standard SQL features.

Extended
SQL > Preprocessor SQL-92 (Object) Relational

GUI Optimizer SQL-3/SQL-4 DBMS


Domain semantics
of mining

Figure 2.6. SQL architecture for mining in a DBMS


The architecture we have in mind is schematically shown in Figure 2.6. We visualize

that the desired mining operation will be expressed in some extension of SQL or a graphical

language. A preprocessor will generate appropriate SQL translation for this operation.









This preprocessor will be able to select the right translation taking into account input

data distributions. We consider SQL translations that can be executed on a SQL-92 [88]

relational engine, as well as translations that require some of the newer object-relational

capabilities being designed for SQL [78]. Specifically, we assume availability of blobs, user-

defined functions, and table functions [80] in the Object-Relational engine. We do not

require any mining specific extension in the underlying execution engine; identification of

such extensions is one of the goals of this study.

We do quantitative and qualitative comparisons of some of the architectural alter-

natives listed here. Our primary focus is on the performance of various architectural al-

ternatives and the identification of possible enhancements to the query optimizer and the

query processing engine. The issues of the language constructs required to extend SQL

with mining features, and the details of the preprocessing step shown in Figure 2.6 are

secondary.

It might also be possible to integrate mining with databases using the newer extension

technologies like database extenders, data cartridges or data blades.

2.2.6 Integrated Approach

This is the tightest form of integration where the mining operations are an integral

part of the database query engine. In this approach there is no clear boundary between

simple querying, OLAP and mining; that is querying and mining are treated to be similar

operations. The user's goal is to get information from the data store. He/she should not

have to make the distinction as to whether it is the result of querying/OLAP/mining.

This entails unbundling the bulky mining operations and identifying common operator

primitives with which the mining operations can be composed. We cannot expect to have










a specialized operator for every mining task. It also needs a language in which the required

operations can be specified. In order to realize this goal, it requires tremendous amount

of research in various aspects like designing language extensions, better query processing

and optimization strategies. However, we envision that the query processing engine will

eventually be extended with primitive mining operators. When that is accomplished, a

mining system architecture will resemble the one shown in Figure 2.7.

(Object) Relational
Information
Extended SQL SQL-92 DBMS
or
GUI SQL-3/SQL-4 Enhanced
Optimizer


Domain semantics
of mining

Figure 2.7. Architecture for mining in next-generation DBMSs



2.3 Summary


The first two approaches in the architecture taxonomy in Figure 2.1, namely, mining in

the application server or database server, facilitates the move from file mining to database

mining rather easily. However, as explained in the loose-coupling, cache-mine, stored-

procedure and user-defined function approaches, they do not utilize the query processing

functionality provided by the DBMS.

In this dissertation we pursue the third approach in Figure 2.1, which uses SQL and

its extensions to implement the mining algorithms. This acts as a pre-cursor to determine

the extensions to current query processors and optimizers in order to move towards the last

approach which is the truly integrated approach.


Note The architectural alternatives in Section 2.2 were developed primarily by

researchers from IBM Almaden Research Center and the author was a contributor.


















CHAPTER 3
ASSOCIATION RULES



In this chapter, we discuss the various SQL-92 (SQL with no object-relational ex-

tensions) formulations of association rule mining. We start with a review of the apriori

algorithm for association rule mining in Section 3.1. A few other algorithms for mining

association rules are briefly outlined in Section 3.2. The input-output data formats are de-

scribed in Section 3.3 and in Section 3.4, we introduce SQL-based association rule mining.

The various SQL-92 formulations are presented in Section 3.5. We present experimental

results showing the performance of these formulations on some real-life datasets in Sec-

tion 3.6. In Section 3.7, we develop cost formulae for the cost of executing the above SQL

queries on a query processor, based on the input data parameters and relational operator

costs. A few performance optimizations to the basic SQL-92 approaches and the corre-

sponding performance gains are presented in Section 3.8. Section 3.9 quantifies the overall

performance improvements of the optimizations with experiments on synthetic datasets.

The association rule mining problem outlined in Section 1.2.1 can be decomposed into

two subproblems [7].


Find all combinations of items whose support is greater than minimum support. Call

those combinations frequent itemsets.


Use the frequent itemsets to generate the desired rules. The idea is that if, say, ABCD

and AB are frequent itemsets, then we can determine if the rule AB-+CD holds by









computing the ratio r = support(ABCD)/support(AB). The rule holds only if r >

minimum confidence. Note that the rule will have minimum support because ABCD

is frequent.


3.1 Apriori Algorithm

We use the Apriori algorithm [9] as the basis for our presentation. There are recent

proposals aimed at improving the performance of the Apriori algorithm by reducing the

number of data passes [26, 127]. They all have the same basic data-flow structure as the

Apriori algorithm. Our goal is to understand how best to integrate this basic structure

within a database system.



F1 = {frequent 1-itemsets}

for (k =2; Fk- 0; k + +) do

Ck = apriori-gen(Fk1_); // generate new candidates

forall transactions t E D do

Ct = subset(Ck, t); // find all candidates contained in t

forall candidates c e Ct do

c.count++;

done

done

Fk = {c C Ckic.count > minsup}

done

Answer = Uk Fk;


Figure 3.1. Apriori algorithm










The basic Apriori algorithm shown in Figure 3.1 makes multiple passes over the data.

In the kth pass it finds all itemsets with k items having the minimum support, called the

frequent k-itemsets. Each pass consists of two phases. Let Fk represent the set of frequent

k-itemsets, and Ck the set of candidate k-itemsets (potentially frequent itemsets). First is

the candidate generation phase where the set of all frequent (k-1)-itemsets, Fk-1, found

in pass (k 1), is used to generate the candidate itemsets Ck. The candidate generation

procedure ensures that Ck is a superset of the set of all frequent k-itemsets. The algorithm

builds a specialized hash-tree data structure in memory out of Ck. Then is the support

counting phase where the transaction database is scanned. For each transaction, the

algorithm determines which of the candidates in Ck are contained in the transaction using

the hash-tree data structure and increments their support count. At the end of the pass, Ck

is examined to determine which of the candidates are frequent, yielding Fk. The algorithm

terminates when Ck+1 becomes empty.

3.2 Other Algorithms

There are several other file based algorithms for mining association rules. However

many of them follow the basic apriori framework and tries to improve on the computation

and I/O requirements.

The partition algorithm [111] reduces the I/O requirements to just two passes over the

entire dataset. The reason the database needs to be scanned multiple times is because the

number of possible itemsets to be tested for support is exponentially large if it must be done

in a single scan of the database. The partition algorithm generates a set of all potentially

frequent itemsets in one scan of the database. This set is a superset of all frequent itemsets.

In the second scan, the actual support of these itemsets in the whole database is computed.









The algorithm executes in two phases. In the first phase, it divides the database into a

number of non-overlapping partitions. The frequent itemsets in each of these partitions

are computed separately which will involve multiple passes over the data. However, the

partition sizes are chosen such that an entire partition fits in main memory so that it is

read only once in each phase. At the end of the first phase, the frequent itemsets from all

the partitions are merged together to generate the set of all potentially frequent itemsets.

This set is a superset of all the frequent itemsets since all itemsets that are frequent in

the whole database have to be frequent at least in one partition. In the second phase, the

actual support of these itemsets are counted and the frequent itemsets are identified.

Toivonen proposes a sampling based algorithm [127]. The idea there is to pick a

random sample, use it to determine all association rules that probably hold in the whole

database, and then to verify the results with the rest of the database. The algorithm thus

produces exact association rules in one full pass over the database. In those rare cases where

the sampling method does not produce all association rules, the missing rules can be found

in a second pass. A superset of the frequent itemsets can be determined efficiently from a

random sample by applying any level-wise algorithm on the sample in main memory, and

by using a lowered support threshold. In cases where approximate results are sufficient, the

sampling approach can significantly reduce the computational and I/O requirements since

it works on a much smaller dataset.

A different way of counting support is proposed in Savasere et al. [111] and Zaki

et al. [131]. Associated with each item is a tidlist which consists of all the transaction

identifiers that contain that item. The support for an itemset can be obtained by counting

the number of transactions that contain all the items in the itemset. If the tidlists are









kept sorted, this operation can be done by performing a merge-scan of the tidlists of all the

items in the itemset.


3.3 Input-Output Formats

Input format The transaction table T normally has two column attributes: transac-

tion identifier (tid) and item identifier (item). For a given tid, typically there are multiple

rows in the transaction table corresponding to different items that belong to the same trans-

action. The number of items per transaction is variable and unknown during table creation

time. Thus, alternative schemas may not be convenient. In particular, assuming that all

items in a transaction appear as different columns of a single tuple [105] is not practical,

because often the number of items per transaction can be more than the maximum number

of columns that the database supports. For instance, for one of the real-life datasets we

experimented with, the maximum number of items per transaction is 872 and for another

it is 700. In contrast, the corresponding average number of items per transaction is only

9.6 and 4.4 respectively. Even if the database supports so many columns for a table, there

will be lot of space wastage in that scheme.


Output format The output is a collection of rules of varying length. The maximum

length of these rules is much smaller than the total number of items and is rarely more

than a dozen. Therefore, a rule is represented as a tuple in a fixed-width table where the

extra column values are set to NULL to accommodate rules involving smaller itemsets. The

schema of a rule is (item1,..., itemk, len, rulem, confidence, support) where k is the size of

the largest frequent itemset. The len attribute gives the length of the rule (number of items

in the rule) and the rulem attribute gives the position of the -+ in the rule. For instance,

if k = 5, the rule AB-+CD which has 90% confidence and 30% support is represented by










the tuple (A, B, C. D, NULL, 4, 2, 0.9,0.3). The frequent itemsets are represented the

same way as rules but do not have the rulem and confidence attributes.


3.4 Associations in SQL


In Section 3.4.1 we present the candidate generation process in SQL. In Section 3.4.2

we present the support counting process and in Section 3.4.3 we present the rule generation

process.


3.4.1 Candidate Generation in SQL


Recall that the apriori algorithm for finding frequent itemsets proceeds in a level-wise

manner. In each pass k of the algorithm we first need to generate a set of candidate itemsets

Ck from frequent itemsets Fk-1 of the previous pass.

Given Fk-1, the set of all frequent (k 1)-itemsets, the Apriori candidate generation

procedure [9] returns a superset of the set of all frequent k-itemsets. We assume that the

items in an itemset are lexicographically ordered. Since, all subsets of a frequent itemset

are also frequent, we can generate Ck from Fk-1 as follows.

First, in the join step, we generate a superset of the candidate itemsets Ck by joining

Fk-1 with itself as shown below.


insert into Ck select II.iteml, ..., I1.itemk-l, 2.itemk-1

from Fk-1 Il.Fk 1 2

where Ii.iteml = I-.itemi and




Ii.itemk = I2i.itemk-2 and

Il.itemA. < I2.itemkl1









For example, let F3 be {{1 2 3}, {1 2 4}, {1 3 4}, {1 3 5}, {2 3 4}}. After the join step, C4

will be {{1 2 3 4}, {1 3 4 5}}.

Next, in the prune step, all itemsets c E Ck, where some (k 1)-subset of c is not

in Fk-1, are deleted. Continuing with the example above, the prune step will delete the

itemset {1 3 4 5} because the subset {1 4 5} is not in F3. We will then be left with only

{1 2 3 4} in C4. We can perform the prune step in the same SQL statement as the join

step above by writing it as a k-way join as shown in Figure 3.2. A k-way join is used

since for any k-itemset there are k subsets of length (k 1) for which we need to check in

Fk-1 for membership. The join predicates on I, and 12 remain the same. After the join

between I1 and 12 we get a k-itemset consisting of (Ii.iteml,... ,Ii.itemk-1,I2.itemk-_)

as shown above. For this itemset, two of its (k 1)-length subsets are already known to be

frequent since it was generated from two itemsets in Fk-1. We check the remaining k 2

subsets using additional joins. The predicates for these joins are enumerated by skipping

one item at a time from the k-itemset as follows. We first skip item1 and check if the

subset (Ii.item2,..., Ii.itemk-, I2.-itemk-1) belongs to Fk-1 as shown by the join with I/

in Figure 3.2. In general, for a join with Ir (3 < r < k), we skip item r 2 which gives us

join predicates of the form

Ii.iteml = Ir.iteml and



Ii.itemr-3 = Ir.itemr-3 and

Il.itemr-1 = Ir.itemr-2 and



Il.itemk-_ = Ir.itemk-2 and

I2.itemk- = Ir.itemk-l.











Figure 3.3 gives an example for k = 4.


We construct a primary index on (itemi,...,itemk-i) of Fk-1 to efficiently process


these k-way joins using index probes. Note that sometimes it may not be necessary to ma-


terialize Ck before the counting phase. Instead, the candidate generation can be pipeline


with the subsequent SQL queries used for support counting.

(Skip item_k-2)
l.iteml =Ik.iteml
Il.item k-I = lk.item k-2
12.itemk-I =lk.item k- I


(Skip iteml) 1' F k-1 Ik
11.item2 = 13.iteml
1 .item k-1 = 3.item_k-2
12.itemk-l = I3.item k-l J


II.iteml =12.iteml F k- 13
II.itemk-2 12.item k-2
II.item k- < 12.itemk-I


F k-l II Fk-I 12


Figure 3.2. Candidate generation for any k


(Skip item2)
II.iteml = 14.iteml
II.item3 = 14.item2
I2.item3 = 14.item3

(Skip iteml) F3 14
ll.item2 = I3.iteml
I1 .item3 = I3.item2
12.item3 = I3.item3


II.iteml = 12.item 1 F3 13
I1.item2 = 12.item2 -.
II.item3 < 12.item3


F3 II F3 12

Figure 3.3. Candidate generation for k = 4










3.4.2 Counting Support to Find Frequent Itemsets

This is the most time-consuming part of the association rule mining algorithm. We

use the candidate itemsets Ck and the data table T to count the support of the itemsets in

Ck. We consider two different categories of SQL implementations.


(A) The first one is based purely on SQL-92. We discuss four approaches in this category

in Section 3.5.


(B) The second utilizes the new SQL object-relational extensions like UDFs, BLOBs

(binary large objects), table functions and so on. Table functions [80] are virtual

tables associated with a user defined function which generate tuples on the fly. Like

normal physical tables they have pre-defined schemas. The function associated with

a table function can be implemented as any other UDF. Thus, table functions can be

viewed as UDFs that return a collection of tuples instead of scalar values.

We discuss six approaches in this category in Chapter 4. Note that, UDFs in this

approach are not heavy weight and do not require extensive memory allocations and

coding unlike in a purely UDF-based implementation [12].


3.4.3 Rule Generation

In the second phase of the association rule mining algorithm, we use the frequent

itemsets to generate rules with the user specified minimum confidence, minconf. For every

frequent itemset 1, we first find all non-empty proper subsets of 1. Then, for each of those

subsets m, we find the confidence of the rule m-(l m) and output the rule if it is at least

minconf.

In the support counting phase, the frequent itemsets of size k are stored in table FA..

Before the rule generation phase, we merge all the frequent itemsets into a single table F.










The schema of F consists of k + 2 attributes (itemn,..., itemk, support, len), where k is

the size of the largest frequent itemset and len is the length of the itemset as discussed

earlier in Section 3.3.

We use the table function GenRules to generate all possible rules from a frequent item-

set. The input argument to the function is a frequent itemset. For each itemset, it outputs

tuples corresponding to rules with all non-empty proper subsets of the itemset in the con-

sequent. The table function outputs tuples with k + 3 attributes, Titeml,..., Titemk,

T_support, Ten, T_rulem. The output is joined with F to find the support of the an-

tecedent and the confidence of the rule is calculated by taking the ratio of the support

values. The predicates in the where clause match the antecedent of the rule with the fre-

quent itemset corresponding to the antecedent. While checking for this match, we need to

check only up to itemk where k < Trulem. The or part (tl.Trulem < k) in the predicate

accomplishes this. Figure 3.4 illustrates the rule generation query.

We can also do rule generation without using table functions and base it purely on

SQL-92. The rules are generated in a level-wise manner where in each level k we generate

rules with consequents of size k. Further, we make use of the property that for any frequent

itemset, if a rule with consequent c holds, then, so do rules with consequents that are subsets

of c as suggested in Agrawal et al. [9]. We can use this property to generate rules in level

k using rules with (k 1) long consequents found in the previous level, much like the way

we did candidate generation in Section 3.4.1.

The fraction of the total running time spent in rule generation is very small. Therefore,

we do not focus much on rule generation algorithms.












insert into R select Titeml, ... T itemk,

tl.support, TJen, Trulem, tl.support/f2.support

from F fl, table(GenRules(fl.itemi,... ,f.itemrk, fl.len, fl.support)) as tl, F f2

where (tl.Titemr = f2.itemI or tl.Trulem < 1) and




(tl.Titemk = -f2.itemk or tl.T-rulem < k) and

tl.T-rulem = f2.len and

tl.T.support/f2.support > :minconf

item 1 ....itemk, len, rulem.
confidence, support

conf > :minconf
t'


Table function F
GenRules

F


Figure 3.4. Rule generation


3.5 Support Counting Using SQL-92


We present four approaches in this category.


3.5.1 K-way Join


In each pass k, we join the candidate itemsets Ck with k transaction tables T and

follow it up with a group by on the itemsets as shown in Figure 3.5.

Figure 3.5 also shows a tree diagram of the query. These tree diagrams are not to be

confused with the plan trees which could look quite different.













insert into Fk select iteml, .. i, r,., count(*)

from Ck, T t,...T tk

where tl.item = Ck.iteml and



tk.item = Ck.itemk and

tl.tid = t2.tid and



tk-1.tid = tk.tid

group by iteml,item2 ... itemk

having count(*) > :minsup

having
count(*) > :minsup
t
Group by
item I,...itemk
Ck.iteml = t litem
Ck.itemk= tk.item

S.tid = tk.tid Ck

T tk
tl.tid = t2.tid [>

T tI T t2


Figure 3.5. Support counting by K-way join


This SQL computation, when merged with the candidate generation step, is similar to

the one proposed in Tsur et al. [128] as a possible mechanism to implement query flocks.

In Section 3.7, we discuss the different execution plans for this query and the related

performance issues.


3.5.2 Three-way Join


The above approach requires (k + 1)-way joins in the kth pass. We can reduce the

cardinality of joins to 3 using the following approach which bears some resemblance to










the AprioriTid algorithm in Agrawal et al. [9]. Each candidate itemset Ck, in addition

to attributes (item1,... ,itemk) has three new attributes (oid, idl, id2). oid is a unique

identifier associated with each itemset and idl and id2 are oids of the two itemsets in Fk-i

from which the itemset in Ck was generated (as discussed in Section 3.4.1). In addition,

in the kth pass we generate a new copy of the data table Tk with attributes (tid, oid) that

keeps for each tid the oid of each itemset in Ck that it supported. For support counting,

we first generate Tk from Tk-1 and Ck and then do a group-by on Tk to find Fk as follows.


insert into Tk select tl.tid, oid

from Ck,Tk-1 tl,Tk- t2

where tl.oid = Ck.idl and t2.oid = Ck.id2 and tl.tid = t2.tid



insert into Fk select oid, iteml, ... itemk, cnt

from Ck,

(select oid as cid, count(*) as cnt from Tk

group by oid having count(*) > :minsup) as temp

where Ck.oid = cid


3.5.3 Two Group-by

Another way to avoid multi-way joins is to first join T and Ck based on whether the

"item" of a (tid, item) pair of T is equal to any of the k items of Ck, then do a group by on

(item1,...,itemk,tid) filtering tuples with count equal to k. This gives all (itemset, tid)

pairs such that the tid supports the itemset. Finally, as in the previous approaches, do a

group-by on the itemset (itemi,..., itemk) filtering tuples that meet the support condition.


insert into Fk select item.l,... itemk, count(*)










from (select item1,... itemk, count(*)

from T, Ck

where item = Ck.iteml or



item = Ck.itemk

group by iteml,...,itemk, tid

having count(*) = k) as temp

group by item1,..., itemk

having count(*) > :minsup


3.5.4 Subquery-Based

This approach makes use of common prefixes between the itemsets in Ck to reduce the

amount of work done during support counting. We break up the support counting phase

into a cascade of k subqueries. The 1-th subquery Qi finds all tids that match the distinct

itemsets formed by the first I columns of Ck (call it dr). The output of QI is joined with T

and d+l1 (the distinct itemsets formed by the first I + 1 columns of Ck) to get Qi+i. The

final output is obtained by doing a group-by on the k items to count support as above. The

queries and the tree diagram for subquery QL are given in Figure 3.6.


3.6 Performance Comparison

In this section we compare the performance of the four SQL-92 approaches. All of

our experiments were performed on Version 5 of IBM DB2 Universal Server installed on a

RS/6000 Model 140 with a 200 MHz CPU, 256 MB main memory and a 9 GB disc with a

measured transfer rate of 8 MB/s. These experimental results are also reported in Sarawagi

et al. [109].














insert into Fk select itemrl,... ,itemk, count(*)

from (Subquery Qk) t

group by iteml,item2 ... itemk

having count(*) > :minsup



Subquery Qi (for any I between 1 and k):

select item1,... iteml, tid

from T tl, (Subquery Ql-1) as ri-1,

(select distinct itern ... itern from Ck) as di

where ri-l.itemi = dl.item, and ... and

rl_.item_-1 = di.itemrnIand

ri _.tid = tl.tid and

ti.item = di.itemj



Subquery Qo: No subquery Qo.


Subquery Q_1

iteml,....itend. tid


tl.item = dl.iteml I J
r -I.iteml = dl.iteml
r -l.iteml-I = dl.item l- I T tl

Subquery Q_l-1 select distinct
item ,. .iteml

Ck

Tree diagram for Subquery Qi


Figure 3.6. Support counting using subqueries










Table 3.1. Description of different real-life datasets

Datasets # Records # Transactions # Items Avg.#items
in millions in millions in thousands
Dataset-A 2.5 0.57 85 4.4
Dataset-B 7.5 2.5 15.8 2.62
Dataset-C 6.6 0.21 15.8 31
Dataset-D 14 1.44 480 9.62



We selected a collection of four real-life datasets obtained from various mail-order

companies and retail stores for the experiments. These datasets have differing values of

parameters like the total number of (tid,item) pairs, the number of transactions (tids), the

number of items and the average number of items per transaction. Table 3.1 summarizes

these parameters.

In this dissertation, we report the performance with only Dataset-A. The overall

observation was that mining implementations in pure SQL-92 are too slow to be practical.

For these experiments we built a composite index (iteml ... itemk) on Ck, k different indices

on each of the k items of Ck and a (tid, item) and a (item, tid) index on the data table.

The goal was to let the optimizer choose the best plan possible. We do not include the

index building cost in the total time.

In Figure 3.7 we show the total time taken by the four approaches: KwayJoin, 3wayJoin,

Subquery and 2GroupBy. For comparison, we also show the time taken by the Loose-coupling

approach because this is the approach currently used by existing systems. The graph shows

the total time split into candidate generation time (Cgen) and the time for each pass. The

candidate generation time and the time for the first pass are much smaller compared to the

total time. From these set of experiments we can make the following observations.












Data set A
L Cgen I Pass 1 0 Pass 2 3 Pass 3
10000
9000 --
8000
8oooi------------------

7000 ---

6000
E- 5000 --...


3000 -- ----- ------ --
2000
1000



Support --> 0 50% 0 35% O.20%


Figure 3.7. Comparison of different SQL-92 approaches


* The best approach in the SQL-92 category is the Subquery approach. An important


reason for its performing better than the other approaches is exploitation of common


prefixes between candidate itemsets. None of the other three approaches uses this


optimization. Although the Subquery approach is comparable to the Loose-coupling


approach in some cases, for other cases it did not complete even after taking ten times


more time than the Loose-coupling approach.



* The 2GroupBy approach is significantly worse than the other three approaches because


it involves an index-ORing operation on k indices for each pass k of the algorithm.


In addition, the inner group-by requires sorting a large intermediate relation. The


outer group-by is comparatively faster because the sorted result is of size at most Ck


which is much smaller than the result size of the inner group-by. The DBMS does


aggregation during sorting therefore the size of the result is an important factor in


the total cost.










The 3wayJoin approach is comparable to the KwayJoin approach for this dataset

because the number of passes is at most three. As shown in Agrawal et al. [9] there

might be other datasets especially ones where there is significant reduction in the size

of Tk as k increases where 3wayJoin might perform better than KwayJoin. However,

one disadvantage of the 3wayJoin approach is that it requires space to store and log

the temporary relations Tk generated in each pass.


3.7 Cost Analysis

In this section, we analyze the cost of the KwayJoin and Subquery approaches which

represent the better ones among the SQL-92 approaches. A relational query engine can

execute the KwayJoin query in several different ways and the performance depends on the

execution plan chosen. We experimented with a number of alternative execution plans

for this query. We could force the query processor to choose different plans by creating

different indices on T and Ck, and in some cases by disabling certain join methods in our

experiments using Postgres. We will elaborate more on this in Section 3.9.

We present two different execution plans-one with Ck as the outermost relation in

Section 3.7.1 and another with Ck as the innermost relation in Section 3.7.2-and the cost

analysis for them. The effect of subquery optimization on the cost estimates is outlined in

Section 3.7.3.

A schematic diagram of the two different execution plans are given in Figures 3.8 and

3.9. In the cost analysis, we use the mining-specific data parameters and knowledge about

association rule mining (Apriori algorithm [9] in this case) to estimate the cost of joins and

the size of join results. Even though current relational optimizers do not use this mining-

specific semantic information, the analysis provides a basis for developing "mining-aware"













having
country* ) > mInsup

(,roup by

Ck.iteml = l.itenl
Ck.itemk = tk.i tem c

l k-1.tid =ik.Lid Ck
t k-1 item < k.item

l d = 2.tid T tk
ri.iem < l2.i m

T tl T t2


Figure 3.8. K-way join plan with Ck as inner relation



having
count(*) > minsup

Group by
itemll .,itenlk

Ck.itemk i tk item
tk.rid t_k-l.lid

Ck.ilem2= t2.item 1T k
t2.tid =tl-tid
-T t2
Ck.ilteml = l.item

Ck T tI


Figure 3.9. K-way join plan with Ck as outer relation



optimizers. The cost formulae are presented in terms of operator costs in order to make


them general; for instance join(p, q, r) denotes the cost of joining two relations of size p


and q to get a result of size r. The actual cost which is based on the join method used


and the system parameters can be derived by substituting the appropriate values in the


given formulae. The data parameters and operators used in the analysis are summarized


in Table 3.2.



3.7.1 KWayJoin Plan with Ck as Outer Relation



Start with Ck as the outermost relation and perform a series of joins with the k copies


of T. The final join result is grouped on the k items to find the support counts. The


choice of join methods for each of the intermediate joins depends on factors such as the










Table 3.2. Notations used in cost analysis


R
T
N
FI
S(C)
Sk
Rf
Nf
Ck
C(n,k)
tk
group(n, m)
join(p, q, r)
blob(n)


number of records in the input transaction table
number of transactions
average number of items per transaction =
number of frequent items
sum of support of each itemset in set C
average support of a frequent k-itemset = S-
number of records out of R involving frequent items = S(Fi)
average number of frequent items per transaction =-
number of candidate k-itemsets
number of combinations of size k possible out of a set of size n: = kn
cost of generating a k item combination using table function Comb-k
cost of grouping n records out of which m are distinct
cost of joining two relations of size p and q to get a result of size r
cost of passing a BLOB of size n integers as an argument


availability of indices, the size of intermediate results, and the amount of available memory.

For instance, the efficient execution of nested loops joins require an index (item, tid) on

T. If the intermediate join result is large, it could be advantageous to materialize it and

perform sort-merge join.

For each candidate itemset in Ck, the join with T produces as many records as the

support of its first item. Therefore, the size of the join result can be estimated to be the

product of the number of k-candidates and the average support of a frequent item and

hence the cost of this join is given by join(Ck, R, Ck si). Similarly, the relation obtained

after joining Ck with I copies of T contain as many records as the sum of the support

counts of the 1-item prefixes of the candidate k-itemsets. Hence the cost of the Ith join

is join(Ck si-1, R, Ck st) where so = 1. Note that values of the si's can be computed

from statistics collected in the previous passes. Cost of the last join (with Tk) cannot be

estimated using the above formula since the k-item prefix of a k-candidate is not frequent


1









and we do not know the value of sk. However, the last join produces S(Ck) records

there will be as many records for each candidate as its support-and therefore, the cost is

join(Ck Sk-1, R, S(Ck)). S(Ck) can be estimated by adding the support estimates of all

the itemsets in Ck. A good estimate for the support of a candidate itemset is the minimum

of the support counts of all its subsets. The overall cost of this plan expressed in terms of

operator costs is


k-l
{ join(Ck s8-1, R, Ck si)} +join(Ck Sk-1, R, S(Ck)) + group(S(Ck), Ck)
l=1


3.7.2 KWayJoin Plan with C. as Inner Relation

In this plan, we join the k copies of T and the resulting k-item combinations are joined

with Ck to filter out non-candidate item combinations. The final join result is grouped on

the k-items.

The result of joining I copies of T is the set of all possible 1-item combinations of

transactions and there are C(N, 1) T such combinations. We know that the items in the

candidate itemset are lexicographically ordered and hence we can add extra join predicates

as shown in Figure 3.8 to limit the join result to 1-item combinations (without these extra

predicates the join will result in 1-item permutations). When Ck is the outermost relation

these predicates are not required. A mining-aware optimizer should be able to rewrite the

query appropriately. The Ith join produces ( + 1)-item combinations and therefore, its cost

is join(C(N, ) T, R, C(N, I + 1) T). The last join produces S(Ck) records as in the

previous case and hence its cost is join(C(N, k) T, Ck, S(Ck)). The overall cost of this










plan is


k-1
{ join(C(N,1)*T, R, C(N, + 1) *T)}+join(C(N, k)*T, Ck, S(Ck))+group(S(Ck), Ck)
1=1


Note that in the above expression C(N, 1) T = R.

3.7.3 Effect of Subquery Optimization


The subquery optimization makes use of common prefixes among candidate itemsets.

Unfolding all the subqueries will result in a query tree which structurally resembles the

KwayJoin plan tree shown in Figure 3.9. Subquery Qt produces d4 st records where d3

denotes the number of distinct j item prefixes of itemsets in Ck. In contrast, the Ith join in

the KwayJoin plan results in Typically d4 is much smaller compared to Ck which explains

why the Subquery approach performs better than the KwayJoin approach. Ck st records.

The output of subquery Qk contains S(Ck) records. The total cost of this approach can be

estimated to be


k
{Z trijoin(R, sI dL-1 si d')} + group(S(Ck), Ck)
1=1


where trijoin(p, q, r, s) denotes the cost of joining three relations of size p, q, r respectively

producing a result of size s. The value of Sk which is the average support of a frequent

k-itemset can be estimated as mentioned in section 3.7.1.

The experimental results presented in Section 3.6 and in Sarawagi et al. [110] shows

that the subquery optimization gave much better performance than the basic KwayJoin

approach (an order of magnitude better in some cases). We observed the same trend in our

additional experiments using synthetic datasets. We used synthetic datasets for some of the










experiments because the real-life datasets were not available outside IBM. The synthetic

datasets used in our experiments are detailed below. The main reason for the subquery

approach performing better is that the number of distinct 1-item prefixes is much less

compared to the total number of candidate itemsets which results in joins between smaller

tables. The number of candidate itemsets and the corresponding distinct item prefixes for

various passes in one of our experiments is given in Figure 3.10. These numbers are for the

dataset T10.I4.D100K and 0.33% support. Note that dk is not shown since it is the same as

Ck. In pass 3, C3 contains 2453 itemsets where as d' has only 295 1-item prefixes (almost

a factor of 10 less than C3). This results in correspondingly smaller intermediate tables as

shown in the analysis above, which is the key to the performance gain.

2500 -..._--.... .---


1 [ 00. 0
1000




1000 Number of candidate itemsets vs distinct item prefixes







here are for the datasets T5..2.DI0K and TI.I4.D100K. The first dataset consists of 100
.thousd ts, eh c g an a e of 5 Te a e size of



Figure 3.10. Number of candidate itemsets vs distinct item prefixes




Experimental datasets We used synthetic data generated according to the procedure

explained in Agrawal and Srikant [13] for some of the experiments. The results reported

here are for the datasets T5.I2.D100K and T10.I4.D100K. The first dataset consists of 100

thousand transactions, each containing an average of 5 items. The average size of the










Table 3.3. Description of synthetic datasets

Datasets # Records # Transactions # Items Avg.#items
T5.I2.D100K 546651 100000 1000 5
T10.I4.D100K 1154995 100000 1000 10



maximal potentially frequent itemsets (denoted as I) is 2. The transaction table corre-

sponding to this dataset had approximately 550 thousand records. The second dataset

has 100 thousand transactions, each containing an average of 10 items (total of about 1.1

million records) and the average size of maximal potentially frequent itemsets is 4. The

different parameters of the two datasets are summarized in Table 3.3.

The experiments reported in the rest of this chapter were performed on PostgreSQL

Version 6.3 [100], a public domain DBMS, installed on a 8 processor Sun Ultra Enterprise

4000/5000 with 248 MHz CPUs and 256 MB main memory per processor, running Solaris

2.6. Note that PostgreSQL is not parallelized. It supports nested loops, hash-based and

sort-merge join methods and provides finer control of the optimizer to disable any of the

join methods. We have found it to be a useful platform for studying the performance of

different join methods and execution plans.


3.8 Performance Optimizations

The cost analysis presented above provides some insight into the different components

of the execution time in the different passes and what can be optimized to achieve better

performance. In this section, we present three optimizations to the KwayJoin approach

(other than the subquery optimization) and discuss how they impact the cost. Based on

these optimizations, we develop the Set-oriented Apriori approach in Section 3.8.4.









3.8.1 Pruning Non-Frequent Items

The size of the transaction table is a major factor in the cost of joins involving T.

It can be reduced by pruning the non-frequent items from the transactions after the first

pass. We store the transaction data as (tid, item) tuples in a relational table and hence

this pruning means simply dropping the tuples corresponding to non-frequent items. This

can be achieved by joining T with the frequent items table F1 as follows.


insert into Tf select t.tid, t.item

from T t, F1 f

where t.item = f.item


We insert the pruned transactions into table Tf which has the same schema as that of T.

In the subsequent passes, joins with T can be replaced with corresponding joins with Tf.

This could result in improved performance especially for higher support values where the

frequent item selectivity is low, since we join smaller tables. For some of the synthetic

datasets we used in our experiments, this pruning reduced the size of the transaction table

to about half its original size. This could be even more useful for real-life datasets which

typically contains lots of non-frequent items. For example, some of the real-life datasets

used for the experiments reported in Sarawagi et al. [109] contained of the order of 100

thousand items out of which only a few hundreds were frequent. Figure 3.11 shows the

reduction in transaction table size due to this optimization for our experimental datasets.

The initial size (R) and the size after pruning (Rf) for different support values are shown.

With this optimization, in the cost formulae of section 3.7, R can be replaced with

Rf-the number of records in T involving frequent items and N with Nf-the average











1200000


900000 --- -
600000 .

600000 --- -- -* -" -


300000 -



o 1- O -1 .- -" -.
R Ri for support values R Re for support values
T10.14.D100K T5.12 D100K


Figure 3.11. Reduction in transaction table size by non-frequent item pruning


number of frequent items per transaction. Note that with the subquery optimization we


could achieve significant performance improvements by reducing the relation sizes of joins.


3.8.2 Eliminating Candidate Generation in Second Pass


This optimization aims at reducing the cost of the second pass which is a significant


portion of the total cost in some cases. In the second pass, C2 is almost a cartesian product


of the two Fis used to generate it and hence materializing it and joining with the T's (or


Tf's) could be expensive. In order to circumvent the problem of counting the large C2,


most mining algorithms use special techniques in the second pass. A few examples are


two-dimensional arrays in IBM's Quest data mining system [2] and hash filters proposed in


Park et al. [99] to limit the size of C2. The generation of C2 can be completely eliminated


by formulating the join query to find F2 as


insert into F2 select p.item, q.item, count(*)


from Tf p, Tf q


where p.tid = q.tid and p.item < q.item


group by p.item, q.item










having count(*) > :minsup


The cost of second pass with this optimization is



join(Rf, Rf, C(Nf, 2)) + group(C(Nf, 2), C(Fi, 2))



Even though the grouping cost remains the same, there is a big reduction from the basic

KwayJoin approach in the join costs. Figure 3.12 compares the running time of the sec-

ond pass with this optimization to the basic KwayJoin approach for the two experimental

datasets. For the KwayJoin approach, the best execution plan was the one which generates

all 2-item combinations, joins them with the candidate set and groups the join result. We

can see that this optimization has a significant impact on the running time.

3.8.3 Reusing the Item Combinations from Previous Pass

The SQL formulations of association rule mining is based on generating item combi-

nations in various ways and similar work is performed in all the different passes. Therefore,

reusing the item combinations will improve the performance especially in the higher passes.

We will explain more about this optimization and the corresponding cost reduction in the

next section.

3.8.4 Set-Oriented Apriori

In this section, we develop a set-oriented version of the apriori algorithm combining

all the performance optimizations outlined above, which we call Set-oriented Apriori. We

store the item combinations generated in each pass and use them in the subsequent passes

instead of generating them in every pass from the transaction tables. In the kth pass of

the support counting phase, we generate a table Tk which contains all k-item combinations














Pass 2 optimization (T10.14.D100 K)

0 With Opt. m Without Opt.
14000 --


12000

10000 2


S8000 -


4 6000


4000

2000



2% 1% 0.75% 0.33%
Support

Pass 2 optimization (T5.12.D100K)

th wiinii Opsu oWitthowfltrig
2800

2400






e v t:t
2000




E

800
oo

400 -



2% 1% 0.50% 0.10%
Support


Figure 3.12. Benefit of second pass optimization



that are candidates. Tk has the schema (tid, item,..., itemk). We join Tk-1, Tf and Ck as


shown below to generate Tk. A tree diagram of the query is also given in Figure 3.13. The


frequent itemsets Fk is obtained by grouping the tuples of Tk on the k items and applying


the minimum support filtering.


We can further prune Tk by filtering out item combinations that turned out to be


non-frequent. However, this is not essential since we join it with the candidate set Ck+1 in












insert into Tk

select p.tid, p.iteml, ...p.itemkl, q.item

from Ck, Tk-1 P, Tf q

where p.iteml = Ck.iteml and



p.itemk-I = Ck.itemk-1 and

q.item = Ck.itemk and

p.tid = q.tid

Tk
p.iteml =Ck.iteml
p.itemk-1 Ck.item k- I
q.item = Ck.itenmk

p.tid = q.tid <' Ck
p.item_k- < q.item


T_k-1 p TI q


Figure 3.13. Generation of Tk


the next pass to generate Tk+1. The only advantage of pruning Tk is that we will have a

smaller table to join in the next pass; but at the additional cost of joining Tk with Fk.

We use the optimization discussed above for the second pass and hence do not mate-

rialize and store the 2-item combinations T2. Therefore, we generate T3 directly by joining

Tf with C3 as


insert into T3 select p.tid, p.item, q.item, r.item

from Tf p, Tf q, Tf r, Ck

where p.item = C3.iteml and q.item = C3.item2 and r.item = C3.item3 and

p.tid = q.tid and q.tid = r.tid









We can also use the Subquery approach to generate T3 if that is estimated to be less

expensive. T3 will contain exactly the same tuples produced by subquery Q.3

The Set-oriented Apriori algorithm bears some resemblance with the three-way join

approach in Section 3.5.2 and the AprioriTid algorithm in Agrawal and Srikant [13]. In the

three-way join approach, the temporary table Tk stores for each transaction, the identifiers

of the candidates it supported. Tk is generated by joining two copies of Tk-1 with Ck. The

generation of Fk requires a further join of Tk with Ck. AprioriTid makes use of special

data structures which are difficult to maintain in the SQL formulation.

Cost Comparison

In Section 3.7.3, we saw that the cost of the kth pass with the subquery optimization

is
k
{- trijoin(R, s_1 d-1, d4, s dk)} + group(S(Ck), Ck)
l=1

As a result of the materialization and reuse of item combinations, Set-oriented Apriori re-

quires only a single 3-way join in the kth pass 1. The cost of the kth pass in Set-oriented

Apriori is

trijoin(Rf, Tk-1, Ck, S(Ck)) + group(S(Ck), Ck)


where Tk-1 and Ck denote the cardinality of the corresponding tables. The grouping cost

is the same as that of the subquery approach. The table Tk-1 contains exactly the same

tuples as that of subquery Qk-1 and hence has a size of s-_1 d4-1. Also, dk is the same

as Ck. Therefore, the kth pass cost of Set-oriented Apriori is the same as the kth term in

'Note that this may be executed as two 2-way joins since 3-way joins are not generally supported in
current relational systems.










the join cost summation of the subquery approach. This results in significant performance

improvements especially in the higher passes.

Figure 3.14 compares the running times of the subquery and Set-oriented Apriori ap-

proaches for the dataset T10.I4.D100K for 0.33% support. We show only the times for

passes 3 and higher since both the approaches are the same in the first two passes.

c3 Set-Apriorl =Subquery-
6000..








2000-.







Figure 3.14. Benefit of reusing item combinations



Space Overhead


The Set-oriented Apriori approach requires additional space in order to store the item

combinations generated. The size of the table Tk is the same as S(Ck), which is thle total

support of all the k-item candidates. Assuming that the tid and item attributes are integers,

each tuple in Tk consists of k + 1 integer attributes. Figure 3.15 shows the space required

to store Tk in terms of number of integers, for the dataset T10.I4.D100K for two different

support values. The space needed for the input data table T is also shown for comparison.

T2 is not shown in the graph since we do not materialize and store it in the Set-oriented

Apriori approach. Note that once Tk is materialized Tk_1 can be deleted unless it needs to

be retained for some other purposes.
be retained for some other purposes.












3500000

3000000

2500000 --

S2000000 -

S1500000 -- -


5000000

0 I --- ----- -------I ------- -- ---- ----
-T 1T, 1T T4 T, T3 T4 TB Ta
Support 0.75% Support 0.33%


Figure 3.15. Space requirements of the set-oriented apriori approach


In the earlier passes, if the number of item combinations is too large and we do not


want to store them, the Subquery approach can be used instead. However, the Subquery


approach will also generate large intermediate tables except in cases where the output of an


intermediate join is pipelined to the subsequent operation. The transition to Set-oriented


Apriori can be made in any pass by materializing the item combinations. An advantage of


the Set-oriented Apriori is that it requires only simple joins and hence it is easier for the


optimizer to handle them. With multiple levels of nested subqueries, optimization becomes


much harder.


3.9 Performance Experiments with Set-Oriented Apriori


In this section, we compare the total running time of the Subquery and Set-oriented


Apriori approaches.


We studied the performance of the Subquery approach (the best SQL-92 approach in


Section 3.6) and the Set-oriented Apriori for a wide range of data parameters and support


values. We report the results on two of the datasets-T5.I2.D100K and T10.I4.D100K-


which are described in Section 3.7.3.








54





T10.14.D100K: Total time

Pass 1 Pass 2 0Pass 3 Pass 4 Pass 5 Pass 6 Pass 7
24000

21000

18000

15000

12000

9000 ---"-

6000

3000 -

0


Support -- 2% 1% 0.75% 0.33%

T5.12.D100K: Total time

SiPass 1 i Pass 2 DPass Pass s 4
2400



1800

1500



900


300





Support--> 2% 1% 0.5% 0.1%


Figure 3.16. Comparison of Subquery and Set-oriented Apriori approaches



In Figure 3.16, we show the relative performance of Subquery and Set-oriented Apriori


approaches for the two datasets. The chart shows the total time taken for each of the


different passes. Set-oriented Apriori performs better than Subquery for all the support


values. The first two passes of both the approaches are similar and they take approximately


equal amount of time. The difference between Set-oriented Apriori and Subquery widens for


higher numbered passes as explained in Section 3.8.4. For T5.I2.D100K, F2 was empty









for support values higher than 0.3% and therefore we chose lower support values to study

the relative performance in higher numbered passes.

In some cases, the optimizer did not choose the best plan. For example, for joins

with T (Tf for Set-oriented Apriori), the optimizer chose nested loops plan using (item, tid)

index on T in many cases where the corresponding sort-merge plan was faster; an order

of magnitude faster in some cases. We were able to experiment with different plans by

disabling certain join methods (disabling nested loops join for the above case). We also

broke down the multi-way joins in some cases into simpler two-way joins to study the

performance implications. The reported times correspond to the best join order and join

methods.

Figure 3.17 shows the CPU time and I/O time taken for the dataset T10.I4.D100K.

The two approaches show the same relative trend as the total time. However, it should be

noted that the I/O time is less than one third of the CPU time. This shows that there

is a need to revisit the traditional optimization and parallelization strategies designed to

optimize for I/O time, in order to handle the newer decision support and mining queries

efficiently.

3.9.1 Scale-up Experiment

We experimented with several synthetic datasets to study the scale-up behavior of Set-

oriented Apriori with respect to increasing number of transactions and increasing transaction

size. Figure 3.18 shows how Set-oriented Apriori scales up as the number of transactions is

increased from 10,000 to 1 million. We used the datasets T5.I2 and T10.I4 for the average

sizes of transactions and itemsets respectively. The minimum support level was kept at 1%.

The first graph shows the absolute execution times and the second one shows the times







56





T10.14.D100K: CPU time
Pass 1 IP'ass2 DPass3 DPass4 mPass 5 MPass6 IPass 7
18000


15000


12000


S9000


6000


3000





Support -- 2% 1% 0.75% 0.33%

T10.14.D1 00K: I/O time
i P *Ir i Pas 2 O Pass 3 OlPs q : HP-as 5 -Pa .. 6 *Pasa 7

6000
6000 -



4000
5 f ,. *, ". -* .'










3000





Support -> 2% 1% 0.75% 0.33%


Figure 3.17. Comparison of CPU and I/O times



normalized with respect to the times for the 10,000 transaction datasets. It can be seen


that the execution times scale quite linearly and both the datasets exhibit similar scale-up


behavior.


The scale-up with increasing transaction size is shown in Figure 3.19. In these exper-


iments we kept the physical size of the database roughly constant by keeping the product


of the average transaction size and the number of transactions constant. The number of


transactions ranged from 200,000 for the database with an average transaction size of 5 to













--T1014 T5.12


25000



20000



15000



10000



5000


.I -- -
0 200 400 600 800 1000 1200
Number of Transactions (in thousands)

T- T10.14 '- T5.12
250



200



S 150



C: 100



50




0 200 400 600 800 1000 1200
Number of Transactions (In thousands)


Figure 3.18. Number of transactions scale-up



20,000 for the database with an average transaction size of 50. We fixed the minimum sup-


port level in terms of the number of transactions, since fixing it as a percentage would have


led to large increases in the number of frequent itemsets as the transaction size increased.


The numbers in the legend (for example, 1000) refer to this minimum support. The execu-


tion times increase with the transaction size, but only gradually. The main reason for this


increase was that the number of item combinations present in a transaction increases with


the transaction size.






58




25000


20000


15000


1 0000


5000


0
0 10 20 30 40 50 60
Average transaction length


Figure 3.19. Transaction size scale-up


There has been increasing interest in developing scalable data mining techniques [23,


33, 117]. For the scalability of the SQL approaches, we leverage the development effort

spent in making the relational query processors scalable.


3.10 Summary


In this chapter, we presented four SQL-92 formulations for association rule mining. We


analyzed the best approach and developed cost formulae for the different execution plans


of the corresponding SQL queries based on the input data parameters and the relational


operator costs. This cost analysis provides a basis for incorporating the semantics of mining


algorithms into future query optimizers.


While doing the experiments, it was difficult to force the optimizer to choose certain


execution plans and join methods since commercial DBMSs do not provide that level of

control. Postgres was relatively better in that respect, since we could control the choice


of join methods. However, the Postgres optimizer did not optimize long queries, especially


the ones involving nested subqueries, well.






59



Note A part of the work described in this chapter was primarily done by researchers

from IBM Almaden Research Center. Specifically, the SQL-based candidate generation in

Section 3.4.1 and the support counting approaches in Section 3.5 were developed by them.

They are included in this dissertation for completeness.


















CHAPTER 4
SUPPORT COUNTING USING SQL WITH OBJECT-RELATIONAL EXTENSIONS



In this chapter, we study alternative approaches that make use of additional object-

relational features in SQL. For each approach, we also outline a cost-based analysis of the

execution time to enable one to choose between these different approaches. We present six

different approaches, optimizations and their cost estimates in Sections 4.1, 4.2 and 4.3.

Experimental results comparing the performance of these approaches are presented in Sec-

tion 4.4. In Section 4.5, we propose a hybrid approach which combines the best of all

approaches. The performance of different architectural alternatives described in Chapter 2

is compared in Section 4.6. In Section 4.7, we summarize qualitative comparisons of these

architectures. The applicability of the SQL-based approach to other association rule mining

algorithms are briefly discussed in Section 4.8.


4.1 GatherJoin

This approach (see Figure 4.1) is based on the use of table functions described in sec-

tion 3.4.2. It generates all possible k-item combinations of items contained in a transaction,

joins them with the candidate table Ck and counts the support of the itemsets by grouping

the join result. It uses two table functions Gather and Comb-K. The data table T is scanned

in the (tid, item) order and passed to the table function Gather. This table function col-

lects all the items of a transaction (in other words, items of all tuples of T with the same

tid) in memory and outputs a record for each transaction. Each such record consists of two









attributes, the tid and item-list which is a collection of all its items in a VARCHAR or a

BLOB. The output of Gather is passed to another table function Comb-K which returns all

k-item combinations formed out of the items of a transaction. A record output by Comb-K

has k attributes T itmi,..., Titmk, which can be directly used to probe into the Ck table.

An index is constructed on all the items of Ck to make the probe efficient. Figure 4.1

presents SQL queries for this approach.

This approach is analogous to the KwayJoin approach where we have replaced the

k-way self join of T with table functions Gather and Comb-K. These table functions are

easy to code and do not require a large amount of memory. Also, it is possible to merge

them together as a single table function GatherComb-K, which is what we did in our imple-

mentation. The Gather function is not required when the data is already in a horizontal

format where each tid is followed by a collection of all its items.

4.1.1 Special Pass 2 Optimization

Note that for k = 2, the 2-candidate set C2 is simply a join of Fi with itself. Therefore,

we can specially optimize the pass 2 by replacing the join with C2 by a join with F1 before

the table function (see Figure 4.2). That way, the table function gets only frequent items

and generates significantly fewer 2-item combinations. This optimization can be useful for

other passes too but unlike for pass 2 we still have to do the join with Ck.

4.1.2 Variations of GatherJoin Approach

GatherCount

One variation of the GatherJoin approach for pass two is the GatherCount approach

where we push the group-by inside the table function instead of doing it outside. The

candidate 2-itemsets (C2) are represented as a two dimensional array inside the modified












insert into Fk select ;I, "'I,...,itemk, count(*)

from Ck, (select t2.T7-itm1,, t2.T itmk from T,

table (Gather(T.tid, T.item)) as ti,

table (Comb-K(tl.tid, tl.item-list)) as t2)

where t2-T-itml = Ck.itemi and



t2.T-itmk = Ck.itemk

group by Ck.iteml,...,Ck.itemk

having count(*) > :minsup

having
count(*) > :minsup
t
Group by
iteml .....,itenmk

t2.T_itmil Ck.item 11
t2.Titmk Ck.itemk

Table function Ck
Comb-K
Table function
Gather
Order by
tid, item

T


Figure 4.1. Support counting by GatherJoin


table function Gather-Cnt for doing the support counting. Instead of outputting the 2-item

combinations of a transaction, it directly uses it to update support counts in the memory

and output only the frequent 2-itemsets, F2 and their support after the last transaction.

Thus, the table function Gather-Cnt is an extension of the GatherComb-2 table function

used in GatherJoin.

The absence of the outer grouping makes this option rather attractive. The UDF code

is also small since it only needs to maintain a 2D array. We could apply the same trick for

subsequent passes but the coding becomes considerably more complicated because of the












insert into F2 select tt2.Titmi, tt2.Titm2, count(*)

from (select from T, Fi where T.item = Fl.itemi) as ttl,

table (GatherComb-2(tid,item)) as tt2)

group by tt2.Tjtmi, tt2.T-itm2

having count(*) > :minsup

having
count(*) > :minsup
Group by
tt2.T_itml, tt2.T_itm2
t tt2
Table function
GatherComb-K

T.em = .item lF .i ]

T FI


Figure 4.2. Support Counting by GatherJoin in the second pass


need to maintain hash-tables to index the Ck's. The disadvantage of this approach is that

it can require a large amount of memory to store C2. If enough memory is not available,

C2 needs to be partitioned and the process has to be repeated for each partition. Another

serious problem with this approach is that it cannot be automatically parallelized unlike

the other approaches.


GatherPrune


A potential problem with the GatherJoin approach is the high cost of joining the

large number of item combinations with Ck. We can push the join with Ck inside the table

function and thus reduce the number of such combinations. Ck is converted to a BLOB

and passed as an argument to the table function.

The cost of passing the BLOB for every tuple of T could be high. In general, we can

reduce the parameter passing cost by using a smaller Blob that only approximates the real









Ck. The trade-off is increased cost for other parts notably grouping because not as many

combinations are filtered.

Horizontal

Another variation of GatherJoin is the Horizontal approach that first uses the

Gather function to transform the data to the horizontal format but is otherwise similar

to the GatherJoin approach.

Rajamani et al. [105] propose finding associations using an approach quite similar to

this Horizontal approach. They assume (rather unrealistically) that the data is already in

a horizontal format. However, they do not use the frequent item-set filtering optimization

we outlined for pass 2. Without this optimization, the time for pass 2 for most real-life

datasets blows up even for relatively high support values. Also, at the time of candidate

generation, rather than doing self-join on Fk-1, they join Fk-1 with F1, thereby generating

considerably more combinations than needed. Thus, the approach in Rajamani et al. [105]

is likely to perform worse than Horizontal.

4.1.3 Cost Analysis of GatherJoin and its Variants

The choice of the best approach depends on a number of data characteristics like the

number of items, total number of transactions, average length of a transaction and so on.

We express the costs of different approaches in each pass in terms of parameters that are

known or can be estimated after the candidate generation step of each pass. We include

only the terms that were found to be the dominant part of the total cost in practice. We

use the notations of Table 3.2 in the cost analysis.

The cost of GatherJoin includes the cost of generating k-item combinations, joining

with Ck and grouping to count the support. The number of k-item combinations generated,









Tk is C(N, k) T. Join with Ck filters out the non-candidate item combinations. The size

of the join result is the sum of the support of all the candidates denoted by S(Ck). The

actual value of the support of a candidate itemset will be known only after the support

counting phase. However, we get a good estimate by approximating it to the minimum of

the support of all its (k l)-subsets in Fk- 1. The total cost of the GatherJoin approach is



Tk tk + join(Tk, Ck, S(Ck)) + group(S(Ck), Ck), where Tk = C(N, k) T



The above cost formula needs to be modified to reflect the special optimization of

joining with F1 to consider only frequent items. We need a new term join(R, Fi, Rf) and

need to change the formula for Tk to include only frequent items Nf instead of N.

For the second pass, we do not need the outer join with Ck. The total cost of

GatherJoin in the second pass is



join(R, F1, Rf) + T2 t2 + group(T2, C2), where T2 = C(Nf, 2) T N T
2


Cost of GatherCount in the second pass is similar to that for basic GatherJoin except

for the final grouping cost. In this formula, "groupint" denotes the cost of doing the

support counting inside the table function.



join(R, F1, Rf) + groupint(T2, C2) + F2 t2



For GatherPrune the cost equation is


R blob(k Ck) + S(Ck) tk + group(S(Ck), Ck).










We use blob(k Ck) for the BLOB passing cost since each itemset in Ck contains k items.

The cost estimate of Horizontal is similar to that of GatherJoin except that here

the data is materialized in the horizontal format before generating the item combinations.


4.2 Vertical

In this approach, we first transform the data table into a vertical format by creating for

each item a BLOB containing all tids that contain that item (Tid-list creation phase) and

then count the support of itemsets by merging together these tid-lists (support counting

phase). This approach is related to the approaches discussed in Savasere et al. [111] and

Zaki et al. [131].

For creating the Tid-lists we use a table function Gather. This is the same as the

Gather function in GatherJoin except that here we create the tid-list for each frequent

item. The data table T is scanned in the (item, tid) order and passed to the function

Gather. The function collects the tids of all tuples of T with the same item in memory and

outputs a. (item, tid-list) tuple for items that meet the minimum support criterion. The tid-

lists are represented as BLOBs and stored in a new TidTable with attributes (item, tid-list).

The SQL query which does the transformation to vertical format is given in Figure 4.3.

insert into TidTable select item, tid-list from tt2.item, tt2.tid-list
Stt2
Table function
(select from T order by item, tid) as ttl, Gather

table(Gather(item,tid,:minsup)) as ft2 itrer ti

T

Figure 4.3. Tid-list creation


In the support counting phase, conceptually for each itemset in Ck we want to collect

the tid-lists of all k items and use a UDF to count the number of tids in the intersection










of these k lists. The tids are in the same sorted order in all the tid-lists and therefore the

intersection can be done easily and efficiently by a single pass of the k lists. This conceptual

step can be improved further by decomposing the intersect operation so that we can share

these operations across itemsets having common prefixes as follows:

We first select distinct (iteml, item2) pairs from Ck. For each distinct pair we first per-

form the intersect operation to get a new result-tidlist, then find distinct triples (iteml, item2,

items) from Ck with the same first two items, intersect result-tidlist with tid-list for item

for each triple and continue with item4 and so on until all k tid-lists per itemset are inter-

sected.

The above sequence of operations can be written as a single SQL query for any k as

shown in Figure 4.4. The final intersect operation can be merged with the count operation

to return a count instead of the tid-list. We do not include this optimization in the query

of Figure 4.4 for simplicity.

4.2.1 Special Pass 2 Optimization


For pass 2 we need not generate C2 and join the TidTables with C2. Instead, we

perform a self-join on the TidTable using predicate tl.item < t2.item.


insert into Fk select tl.item, t2.item, cnt

from (select item1, item2, CountIntersect(tl.tid-list, t2.tid-list) as cnt

from TidTable tl, TidTable t2

where ti.item < t2.item) as t

where cnt > :minsup














insert into Fk select item1,... ,itemrk, count(tid-list) as cnt

from (Subquery Qk) t where cnt > :minsup




Subquery Q1 (for any I between 2 and k)

select item1,... item, Intersect(r_ 1.tid-list,t1.tid-list) as tid-list

from TidTable ti, (Subquery Qi-1) as rt-1,

(select distinct item1 ... iteml from Ck) as di

where ri _.iteml = di.iteml and ... and

rt-1.itemi-i = di.itemiiand

ti.item = dl.itemj




Subquery Qi: (select from TidTable)

iteml,...,iteml, tid-list
t tid-list
Intersect
(UDF)
r_l-1.tid-list, tl.tid-list f

tl.item = dl.iteml
r_l-1.iteml =.dl.item l
r_l- .iteml-1 = dl.item_1-1 TidTable tl


Subquery Ql-1 select distinct
item 1.,iteml

Ck

Tree diagram for Subquery Qi


Figure 4.4. Support counting using UDF









4.2.2 Cost Analysis

The cost of the Vertical approach during support counting phase is dominated by

the cost of invoking the UDFs and intersecting the tid-lists. The UDF is first called for

each distinct item pair in Ck, then for each distinct item triple with the same first two

items and so on. Let dk be the number of distinct j item tuples in Ck. Then the total

number of times the UDF is invoked is Zk=2 d In each invocation two BLOBs of tid-list

are passed as arguments. The UDF intersects the tid-lists by a merge pass and hence the

cost is proportional to 2 average length of a tid-list. The average length of a tid-list can be

approximated to -j. Note that with each intersect the tid-list keeps shrinking. However,

we ignore such effects for simplicity.

In addition to the intersect cost it includes the cost of joins in the query also. The

join cost of subquery Qi can be recursively defined as



C(Q=) = trijoin(F1, Q_ 1, Ck, dk) + C(Q-1)


where tri-join(p, q, r, s) denotes the cost of joining three relations of size p, q, r respectively

producing a result of size s. The exact cost of the 3-way join will depend on the join order.

The cost of subquery Qi is the cost of scanning the TidTable which has Fi tuples. The

result size of the subquery Qi is dk and the result size of Q1 is F1. The total cost of the

Vertical approach is



C(Qk) + (E d) {2 Blob(j) + Intersect(2 )}
j=2 F1 F1









In the formula above Intersect(n) denotes the cost of intersecting two tid-lists with

a combined size of n. The total cost is dominated by the intersect cost and join costs

account for only a small fraction. Therefore, we can safely ignore the join costs in the

above formulae.

The total cost of the second pass is


(F1)2 Rf 2R
join(FiF, (F ) + C2 {2 Blob( ) + Intersect( )}
2 Fi F,


4.3 SQL-Bodied Functions

This approach is based on SQL-bodied functions commonly known as SQL/PSM [87].

SQL/PSMs extend SQL with additional control structures. We make use of one such

construct for do .. end.

We use the for construct to scan the transaction table T in the (tid, item) order. Then,

for each tuple (tid, item) of T, we update those tuples of Ck that contain one matching

item. Ck is extended with 3 extra attributes (prevTid, match, supp). The prevTid attribute

keeps the tid of the previous tuple of T that matched that itemset. The match attribute

contains the number of items of prevTid matched so far and supp holds the current support

of that itemset. On each column of Ck an index is built to do a searched update.


for this as select from T do

update Ck set prevTid = tid,

match = case when tid = prevTid then match+l else 1 end,

supp = case when match = k-1 and tid = prevTid then supp+1

else supp










where item = item or



item = itemk

end for

insert into Fk select item1,..., itemk, supp

from Ck where supp > :minsupp


The cost of this approach can be mainly attributed to the cost of updates to the

candidate table Ck. For each tuple of the data table T, for all the candidate itemsets in Ck

which contains that item, three updates are performed (the attributes prevTid, match, supp

of the itemset are updated). If Nk is the average number of k-item candidates containing

any given item, the total number of updates is 3 R Nk. The cost due to updates for this

approach is U(3 R Nk) where U(n) is the cost of n updates. If the updates are logged

this cost includes the logging cost also.


4.4 Performance Comparison

We studied the performance of six approaches in this category: GatherJoin and its

variants GatherPrune, Horizontal and GatherCountVertical and SBF. We used the four

datasets summarized in Table 3.1. In Figure 4.5 we show the performance of only the four

approaches: GatherJoin, GatherCount, GatherPrune and Vertical. For the other

two approaches the running times were comparatively so large that we had to abort the

runs in many cases. The main reason why the Horizontal approach was significantly

worse than the GatherJoin approach was the time to transform the data to the horizontal

format. For instance, for Dataset-C it was 3.5 hours which is almost 20 times more than

the time taken by Vertical for 2% support. For Dataset-B the process was aborted







72





after running for 5 hours. After the transformation, compared to GatherJoin the time


taken by Horizontal was also significantly worse when run without the frequent itemset


filtering optimization but with the optimization the performance was comparable. The SBF


approach had significantly worse performance because of the expensive indexing ORing of


the k join predicates. Another problem with this approach is the large number of updates


to the Ck table. In DB2, all of these updates are logged resulting in severe performance


degradation.


Data set- A Data set- B
i Prep m Pass 1 0 Pass 2 0 Pass'3 |Iprap IPase 1 Pass 2 Passs 3 m Pass 4
2500 -; ..--- --- 14000 --


122000000
25000



6000
12000 14000
















;L-"oo .--.-:-"-.. .
4000




Supprt..~. 0... 0.3 0.Xo% Supporl 0.10% 003. 001% O















Support-> 2'0. 0.3 2 Support -- 0 1.20% O 0%" 0 02%
Data set- C Data set- D












Figure 4.5. Comparison of four SQL-OR approaches: Vertical, GatherPrune, Gather Join












and GatherCount
12000

102000
80000



08000
60 60

40000
4000

2000 2000




2- 025- S. 0.20% 00ST0 Oo2%


Figure 4.5. Comparison of four SQL-OR approaches: Vertical, GatherPrune, GatherJoin

and GatherCount










Figure 4.5 shows the total running time of the different approaches. The time taken

is broken down by each pass and an initial "prep" stage where any one-time data transfor-

mation cost is included. We can make several observations from the experimental results.

First, let us concentrate on the overall comparison between the different approaches. Then

we will compare the approaches based on how they perform in each pass of the algorithm.

Overall, the Vertical approach has the best performance and is sometimes more than

an order of magnitude better than the other three approaches.

The majority of the time of the Vertical approach is spent in transforming the

data to the Vertical format in most cases (shown as "prep" in figure 4.5). The vertical

representation is like an index on the item attribute. If we think of this time as a one-

time activity like index building then performance looks even better. Note that the time

to transform the data to the Vertical format was much smaller than the time for the

horizontal format although both formats write almost the same amount of data. The main

reason was the difference in the number of records written. The number of frequent items

is often two to three orders of magnitude smaller than the number of transactions.

Between GatherJoin and GatherPrune, neither strictly dominates the other. The

special optimization in GatherJoin of pruning based on F1 had a big impact on perfor-

mance. With this optimization, for Dataset-B with support 0.1%, the running time for

pass 2 alone was reduced from 5.2 hours to 10 minutes.

When we compare these different approaches based on time spent in each pass we

observe that no single approach is "the best" for all different passes of the different datasets

especially for the second pass.

For pass three onwards, Vertical is often two or more orders of magnitude better than

the other approaches. Even in cases like Dataset-B, support 0.01% where it spends three











hours in the second pass, the total time for next two passes is only 14 seconds whereas it is


more than an hour for the other two approaches. For subsequent passes the performance


degrades dramatically for GatherJoin, because the table function Gather-Comb-K generates


a large number of combinations. For instance, for pass 3 of Dataset-C even for support


value of 2% pass 3 did not complete after 5.2 hours whereas for Vertical pass 3 finished


in 0.2 seconds. GatherPrune is better than GatherJoin for the third and later passes. For


pass 2 GatherPrune is worse because the overhead of passing a large object as an argument


dominates cost.


The Vertical approach sometimes ended up spending too much time in the second


pass. In some of these cases the GatherJoin approach was better in the second pass


(for instance for low support values of Dataset-B) whereas in other cases (for instance,


Dataset-C support 0.25%) GatherCount was the only good option. For this case both the


GatherPrune and GatherJoin did not complete after more than six hours even for pass 2.


Further, they caused a storage overflow error because of the large size of the intermediate


results to be sorted. We had to divide the dataset into four equal parts and run the second


pass independently on each partition to avoid this problem.

S- Venlcal --Goin I
700

600

400-

S300

200

100-

0 10 20 30 40 50 60
Average transaction length

Figure 4.6. Effect of increasing transaction length (average number of items per transaction)










Two factors that affect the choice amongst the Vertical, GatherJoin and GatherCount

approaches in different passes and pass 2 in particular are: number of frequent items (Fi)

and the average number of frequent items per transaction (Nf). From the graphs in Fig-

ure 4.5 we notice that as the value of the support is decreased for each dataset causing

the size of Fi to increase, the performance of pass 2 of the Vertical approach degrades

rapidly. This trend is also clear from our cost formulae. The cost of the Vertical approach

increases quadratically with F1. GatherJoin depends more critically on the number of fre-

quent items per transaction. For Dataset-B even when the size of F1 increases by a factor

of 10, the value of Nf remains close to 2, therefore the time taken by GatherJoin does not

increase as much. However, for Dataset-C the size of Nf increases from 3.2 to 10 as the

support is decreased from 2.0% to 0.25% causing GatherJoin to deteriorate rapidly. From

the cost formula for GatherJoin we notice that the total time for pass 2 increases almost

quadratically with Nf.

We validate this observation further by running experiments on synthetic datasets

for varying values of the number of frequent items per transaction. We used the synthetic

dataset generator described in Agrawal et al. [9] for this purpose. We varied the transaction

length, the number of transactions and the support values while keeping the total number

of records and the number of frequent items fixed. In Figure 4.6 we show the total time

spent in pass 2 of the Vertical and GatherJoin approaches. As the number of items per

transaction (transaction length) increases, the cost of Vertical remains almost unchanged

whereas the cost of GatherJoin increases.









4.5 Final Hybrid Approach

The previous performance section helps us draw the following conclusion: Overall, the

Vertical approach is the best option especially for higher passes. When the size of the

candidate itemsets is too large, the performance of the Vertical approach could suffer.

In such cases, GatherJoin is a good option as long as the number of frequent items per

transaction (Nf) is not too large. When that happens GatherCount may be the only

good option even though it may not easily parallelizable. These qualitative differences are

captured by the cost formulae we presented earlier and are used by our final hybrid scheme.

The hybrid scheme chooses the best of the three approaches GatherJoin, GatherCount

and Vertical for each pass based on the cost estimation outlined in the previous sections.

The parameter values used for the estimation are all available at the end of the previous

pass. In Section 4.6 we plot the final running time for the different datasets based on this

hybrid approach.


4.6 Architecture Comparisons

In this section our goal is to compare five architectural alternatives: Loose-coupling,

Stored-procedure, Cache-Mine, UDF, and the best SQL implementation.

For Loose-coupling, we use the implementation of the Apriori algorithm [9] for find-

ing association rules provided with the IBM data mining product, Intelligent Miner [71].

For Stored-procedure, we extracted the Apriori implementation in Intelligent Miner and

created a stored procedure out of it. The stored procedure is run in the unfenced mode

in the database address space. For Cache-Mine, we used an option provided in Intelligent

Miner that causes the input data to be cached as a binary file after the first scan of the data









from the DBMS. The data is copied in the horizontal format where each tid is followed by

an encoding of all its frequent items.

For the UDF-architecture, we use the UDF implementation of the Apriori algorithm

described in Agrawal and Shim [12]. In this implementation, first a UDF is used to initialize

state and allocate memory for candidate itemsets. Next, for each pass a collection of UDFs

are used for generating candidates, counting support, and checking for termination. These

UDFs access the initially allocated memory, address of which is passed around in BLOBs.

Candidate generation creates the in-memory hash-trees of candidates. This happens en-

tirely in the UDF without any involvement of the DBMS. During support counting, the

data table is scanned sequentially and for each tuple a UDF is used for updating the counts

on the memory resident hash-tree.

4.6.1 Timing Comparison

In Figure 4.7, we show the performance of the four architectural alternatives: Cache-Mine,

Stored-procedure, UDF and our best SQL implementation for the datasets in Table 3.1.

We do not show the times for the Loose-coupling option because its performance was very

close to the Stored-procedure option. For each dataset three different support values are

used. The total time is broken down by the time spent in each pass.

We can make the following observations.


Cache-Mine has the best or close to the best performance in all cases. 80-90% of its

total time is spent in the first pass where data is accessed from the DBMS and cached

in the file system. Compared to the SQL approach this approach is a factor of 0.8 to

2 times faster.















Data set- A
i Pass-1 mPass-2 DPass-3


800

700

600

500

o 400

300

200

100


3500

3000

2500

2000 ""

1500

1000


500



Support--> 2.0%


Data set- B

I Pass-1 N Pass 2 0 Pass 3 0 Pass 4:
5000

4500

4000 j-
3 ., .--- --'...'? ..,i : -- -------- .---
3500

3000

i 2500

2000

1500

1000

500




SUPPORT-> 0.1% 0.03% O.01%


Data set- D

i Pass 1 E Pass 2 0 Pass 3 0 Pass 4
"- .,




,. .*...




4, ,,


12000


10000


8000


6000


4000


2000


* The Stored-procedure approach is the worst. The difference between Cache-Mine



and Stored-procedure is directly related to the number of passes. For instance,



for Dataset-A the number of passes increases from two to three when decreasing



support from 0.5% to 0.35% causing the time taken to increase from two to three



times. The time spent in each pass for Stored-procedure is the same except when



the algorithm makes multiple passes over the data since all candidates could not



fit in memory together. This happens for the lowest support values of Dataset-B,


50% 0.35% 0.20%

Data set- C

[E Pass 1 W Pass 2 3 Pass 3 OPass 41
,- :----- .------,: ;.:, .


Support-- 0.


4000 -


1.0% 0.25% Support. -> 0.2% 0.05% 0.02%



Figure 4.7. Comparison of four architectures










Dataset-C and Dataset-D. Time taken by Stored-procedure can be expressed

approximately as number of passes times time taken by Cache-Mine.


* UDF is similar to Stored-procedure. The only difference is that the time per pass

decreases by 30-50% for UDF because of closer coupling with the database.


* The SQL approach comes second in performance after the Cache-Mine approach for

low support values and is even somewhat better for high support values. The cost

of converting the data to the vertical format for SQL is typically lower than the cost

of transforming data to binary format outside the DBMS for Cache-Mine. However,

after the initial transformation subsequent passes take negligible time for Cache-Mine.

For the second pass SQL takes significantly more time than Cache-Mine particularly

when we decrease support. For subsequent passes even the SQL approach does not

spend too much time. Therefore, the difference between Cache-Mine and SQL is not

very sensitive to the number of passes because both approaches spend negligible time

in higher passes.

The SQL approach is 1.8 to 3 times better than Stored-procedure or Loose-coupling

approach. As we decreased the support value so that the number of passes over the

dataset increases, the gap widens. Note that we could have implemented a Stored-

procedure using the same hybrid algorithm that we used for SQL instead of using the

IM algorithm. Then, we expect the performance of Stored-procedure to improve

because the number of passes to the data will decrease. However, we will pay the

storage penalty of making additional copy of the data as we did in the Cache-Mine

approach. The performance of Stored-procedure cannot be better than Cache-Mine










because as we have observed the most of the time of Cache-Mine is spent in the first

pass which cannot be changed for Stored-procedure.


4.6.2 Scale-up experiment


Our experiments with the four real-life datasets above has shown the scaling property

of the different approaches with decreasing support value and increasing number of frequent

itemsets. We experiment with synthetic datasets to study other forms of scaling: increasing

number of transactions and increasing average length of transactions. In Figure 4.8 we show

how Stored-procedure, Cache-Mine and SQL scale with increasing number of transactions.

UDF and Loose-coupling have similar scale-up behavior as Stored-procedure, therefore

we do not show these approaches in the figure. We used a dataset with 10 average number of

items per transaction, 100 thousand total items and a default pattern length (defined in [9])

of 4. Thus, the size of the dataset is 10 times the number of transactions. As the number of

transactions is increased from 10K to 3000K the time taken increases proportionately. The

largest frequent itemset was 5 long. This explains the five fold difference in performance

between the Stored-procedure and the Cache-Mine approach. Figure 4.9 shows the scaling

when the transaction length changes from 3 to 50 while keeping the number of transactions

fixed at 100K. All three approaches scale linearly with increasing transaction length.









S10000o 2000 o
4 000







N000umber of ransaon
00igure 4.8. Scae-up with increasing number of transactions




Figure 4.8. Scale-up with increasing number of transactions











-- cache -- Sproc SCO 1
2500

2000

1500

1000



/ ---------~~
0
0 10 20 30 40 50 60
Average transaction length


Figure 4.9. Scale-up with increasing transaction length



4.6.3 Impact of longer names


In these experiments we assumed that the tids and item-ids are all integers. Often in


practice these are character strings longer than four characters. Longer strings need more


storage and cost more during comparisons. This could hurt all four of the alternatives.


For the Stored-procedure, UDF and Cache-Mine approach the time taken to transfer data


will increase. The Intelligent Miner code [71] maps all character fields to integers using an


in-memory hash-table. Therefore, beyond the increase in the data transfer and mapping


costs (which accounts for the bulk of the time), we do not expect the processing time of


these three alternatives to increase. For the SQL approach we cannot assume an in-memory


hash-table for doing the mapping therefore we use an alternative approach based on table


functions.


For SQL approach we discuss the hybrid approach. The two (already expensive) steps


that could suffer because of longer names are (1) final group-bys during pass 2 or higher


when the GatherJoin approach is chosen and (2) tid-list operations when the Vertical


approach is chosen. For efficient performance, the first step requires a mapping of item-ids


and the second one requires us to map tids. We use a table function to map the tids to









unique integers efficiently in one pass and without making extra copies. The input to the

table function is the data table in the tid order. The table function remembers the previous

tid and the maintains a counter. Every time the tid changes, the counter is incremented.

This counter value is the mapping assigned to each tid. We need to do the tid mapping only

once before creating the TidTable in the Vertical approach and therefore we can pipeline

these two steps. The item mapping is done slightly differently. After the first pass, we add

a column to Fi containing a unique integer for each item. We do the same for the TidTable.

The GatherJoin approach already joins the data table T with F1 before passing to table

function Gather. Therefore, we can pass to Gather the integer mappings of each item from

F1 instead of its original character representation. After these two transformations, the tid

and item fields are integers for all the remaining queries including candidate generation and

rule generation. By mapping the fields this way, we expect longer names to have similar

performance impact on all of our architectural options.

4.6.4 Space Overhead of Different Approaches

In Figure 4.10 we summarize the space required for different datasets for three options:

Stored-procedure, Cache-Mine and SQL. For these experiments we assume that the tids

and items are integers. The first part refers to the space used in caching data and the

second part refers to any temporary space used by the DBMS for sorting or alternately

for constructing indices to be used during sorting. The size of the data is the same as the

space utilization of the Stored-procedure approach. The space requirements for UDF is

the same as that for Stored-procedurewhich requires less space than the Cache-Mine and

SQL approaches. The Cache-Mine and SQL approaches have comparable storage overheads.

For Stored-procedure and UDF we do not need any extra storage for caching. However,









all three options Cache-Mine, Stored-procedure and UDF require data in each pass to be

grouped on the tid. In a relational DBMS we cannot assume any order on the physical

layout of a table, unlike in a file system. Therefore, we need either an index on the data

table or need to sort the table every time to ensure a particular order. Let R denote the

total number of (tid,item) pairs in the data table. Either option has a space overhead of

2 x R integers. The Cache-Mine approach caches the data in an alternative binary format

where each tid is followed by all the items it contains. Thus, the size of the cached data

in Cache-Mine is at most: R + T integers where T is the number of transactions. For SQL

we use the hybrid Vertical option. This requires creation of an initial TidTable of size at

most I + R where I is the number of items. Note that this is slightly less than the cache

required by the Cache-Mine approach. The SQL approach needs to sort data in pass 1 in

all cases and pass 2 in some cases where we used the GatherJoin approach instead of the

Vertical approach. This explains the large space requirement for Dataset-B. However, in

practice when the item-ids or tids are character strings instead of integers, the extra space

needed by Cache-Mine and SQL is a much smaller fraction of the total data size because

before caching we always convert item-ids to their compact integer representation and store

in binary format.

4.7 Summary of Comparison Between Different Architectures

In Table 4.1 we present a summary of the pros and cons of the different architectures

by ranking them on a scale of 1 (good) to 4 (bad) on each of the following yardsticks: (a)

performance (execution time); (b) storage overhead; (c) scope for automatic parallelization;

(d) development and maintenance ease; (e) portability (f) inter-operability.












IO Cache Sort








20
35 ,





30




Dataset-A Dataset-B Dataset-C Dataset-D

Figure 4.10. Comparison of different architectures on space requirements.


Table 4.1. Pros and cons of different architectural options ranked on a scale of l(good) to
4(bad)

Metric Stored-proc. UDF Cache-Mine SQL
Performance 4 3 1 2
Storage overhead 1 1 2 2-3
Automatic Parallelism 2 2 2 1(?)
Development and maintenance ease 2 3 2 1-2
Portability 1 3 1 2
Inter-operability 2 2 2 1(?)



In terms of performance, the Cache-Mine approach is the best option followed by the

SQL approach. The SQL approach was within a factor of 0.8 to 2 of the Cache-Mine ap-

proach for all of our experiments. The UDF approach is better than the Stored-procedure

approach in performance by 30 to 50% but it looses on the metrics of development and

maintenance costs and portability. In terms of space requirements, the Cache-Mine and

the SQL approach loose to the UDF or the Stored-procedure approach. Between the

Stored-procedure and the Cache-Mine implementation, the performance difference is ex-

actly a function of the number of passes made on the data; that is, if we make four passes









of the data the Stored-procedure approach is four times slower than Cache-Mine. There-

fore, if one is not willing to pay the penalty of extra storage the best strategy for improving

performance is to reduce the number of passes to the data even if it comes at the cost

of extra processing. Some of the recent proposals [26, 127] that attempt to minimize the

number of data passes to 2 or 3 might be useful in that regard.

The SQL approach is not the winner in terms of performance and space requirements

but it is competitive. The benefit of the SQL approach could arise from other secondary

advantages.

The SQL implementation has the potential for automatic parallelization. Paralleliza-

tion could come for free because bulk of our processing is expressed in terms of standard

SQL queries. As long as the database supports efficient parallelization of these queries the

mining code can be easily parallelized. The problem case is where the UDFs use scratch

pads. The only such function in our queries is the Gather table function. This function

essentially implements a user defined aggregate, and would have been easy to parallelize if

the DBMS provided support for user defined aggregates or allowed explicit control from the

application about how to partition the data amongst different parallel instances of the func-

tion. For MPPs, one could rely on the DBMS to come up with a data partitioning strategy.

However, it might be possible to better tune performance if the application could provide

hints about the best partitioning [11] to use. Further experiments are required to assess how

the performance of these automatic parallelizations would compare with algorithm-specific

parallelizations [11].

The development time and code size using SQL could be shorter if one can get efficient

implementations out of expressing the mining algorithms declaratively using a few SQL

statements. Thus, one can avoid writing and debugging code for memory management,









indexing and space management all of which are already provided in a database system.

However, there are some detractors to easy development using the SQL alternative. First,

any attached UDF code will be harder to debug than stand-alone C++ codes due to lack

of debugging tools. Second, stand-alone code can be debugged and tested faster when run

against flat file data. Running against flat files is typically a factor of five to ten faster

compared to running against data stored in DBMS tables. Finally, some mining algorithms

(for instance, neural-net based) might be too awkward to express in SQL.

The ease of porting of the SQL alternative depends on the kind of SQL used. Within the

same DBMS, porting from one OS platform to another requires porting only the small UDF

code and hence is easy. In contrast the Stored-procedure and Cache-Mine alternatives

require porting larger lines of code. Porting from one DBMS to another could get hard for

SQL approach, if non-standard DBMS-specific features are used. Unfortunately, we found

SQL-92 implementations (which would have been quite portable) to be unacceptable from

the performance viewpoint. Our preferred SQL implementation relies on the availability

of DB2's table functions. Table functions, for example, in Oracle 8 do not have the same

interface and semantics as DB2. Also, if different features have different performance

characteristics on different database systems, considerable tuning would be required. In

contrast, the Stored-procedure and Cache-Mine approach are not tied thus to any DBMS

specific features. The UDF implementation has the worst of both worlds; it is large and is

tied to a DBMS.

The biggest attraction of the SQL implementation is inter-operability and usage flex-

ibility. The ad hoc querying support provided by the DBMS enables flexible usage and

exposes potential for pipelining the input and output operators of the mining process with




Full Text
108
I\.eno3 I\.eno2 = l2-eno2 l2-eno\ and
I\.enc>k-\ I\.enok-2 = h-enok-2 ~ h-enok_3
In the above query, subsequence matching is expressed as join predicates on the attributes
of Fk-\. Note the special join predicates on the eno fields that ensure that not only do
the joined sequences contain the same set of items but that these items are grouped in the
same manner into elements. The result of the join is the sequence obtained by extending
si with the last item of S2. The added item becomes a separate element if it was a separate
element in S2, and part of the last element of si otherwise. In our representation of the
candidate sequence, enok_i and enok-2 determine if the added item is a separate element.
For example, let F3 be {((1,2) (3)), ((1,2) (4)), ((1) (3,4)), ((1,3) (5)), ((2) (3,4)),
((2) (3) (5))}. In the join step, the sequence ((1,2) (3)) joins with ((2) (3,4)) to generate
((1,2) (3,4)) and with ((2) (3) (5)) to generate ((1,2) (3) (5)). There are no other join
compatible sequences in F3.
In the prune step, all candidate sequences that have a non-frequent contiguous (k 1)-
subsequence are deleted. In the above example, the sequence ((1,2) (3) (5)) will be deleted
since its contiguous subsequence ((1) (3) (5)) is not frequent.
We perform both the join and prune steps in the same SQL statement by writing
the above query as a A;-way join as shown in Figure 6.1. For any /c-sequence there are at
most k contiguous subsequences of length (k 1) for which F)c_1 needs to be checked for
membership. Note that all (A; 1)-subsequences may not be contiguous because of the max-
gap constraint between consecutive elements. The join predicates on I\ and I2 remain the
same. The join of I\ and I2 generates a A;-sequence (I\.item\,..., Ii.itemk-\, l2-itemk-\)


64
Cfc. The trade-off is increased cost for other parts notably grouping because not as many
combinations are filtered.
Horizontal
Another variation of GatherJoin is the Horizontal approach that first uses the
Gather function to transform the data to the horizontal format but is otherwise similar
to the GatherJoin approach.
Rajamani et al. [105] propose finding associations using an approach quite similar to
this Horizontal approach. They assume (rather unrealistically) that the data is already in
a horizontal format. However, they do not use the frequent item-set filtering optimization
we outlined for pass 2. Without this optimization, the time for pass 2 for most real-life
datasets blows up even for relatively high support values. Also, at the time of candidate
generation, rather than doing self-join on F_i, they join Fk-i with F\, thereby generating
considerably more combinations than needed. Thus, the approach in Rajamani et al. [105]
is likely to perform worse than Horizontal.
4.1.3 Cost Analysis of GatherJoin and its Variants
The choice of the best approach depends on a number of data characteristics like the
number of items, total number of transactions, average length of a transaction and so on.
We express the costs of different approaches in each pass in terms of parameters that are
known or can be estimated after the candidate generation step of each pass. We include
only the terms that were found to be the dominant part of the total cost in practice. We
use the notations of Table 3.2 in the cost analysis.
The cost of GatherJoin includes the cost of generating fc-item combinations, joining
with Cfc and grouping to count the support. The number of A;-item combinations generated,


41
Table 3.2. Notations used in cost analysis
R
number of records in the input transaction table
T
number of transactions
N
average number of items per transaction =
Fi
number of frequent items
S(C)
sum of support of each itemset in set C
Sk
average support of a frequent fc-itemset =
Rf
number of records out of R involving frequent items = S(F\)
Nf
average number of frequent items per transaction = -rf
ck
number of candidate k-itemsets
C(n, k)
number of combinations of size k possible out of a set of size n: = k,^[ky
tk
cost of generating a k item combination using table function Comb-k
group(n, m)
cost of grouping n records out of which m are distinct
join(p,g,r)
cost of joining two relations of size p and q to get a result of size r
blob(n)
cost of passing a BLOB of size n integers as an argument
availability of indices, the size of intermediate results, and the amount of available memory.
For instance, the efficient execution of nested loops joins require an index (item, tid) on
T. If the intermediate join result is large, it could be advantageous to materialize it and
perform sort-merge join.
For each candidate itemset in Ck, the join with T produces as many records as the
support of its first item. Therefore, the size of the join result can be estimated to be the
product of the number of /-candidates and the average support of a frequent item and
hence the cost of this join is given by join(6\, R, Ck* si). Similarly, the relation obtained
after joining Ck with l copies of T contain as many records as the sum of the support
counts of the -item prefixes of the candidate /-itemsets. Hence the cost of the Ith join
is join(Cfc s/_i, R, Ck si) where So = 1. Note that values of the Sjs can be computed
from statistics collected in the previous passes. Cost of the last join (with Tk) cannot be
estimated using the above formula since the /-item prefix of a fc-candidate is not frequent


CHAPTER 1
INTRODUCTION
The rapid growth in data warehousing technology combined with the significant drop
in storage prices has made it possible to collect large volumes of data about customer
transactions in retail stores, mail order companies, banks, stock markets, telecommunica
tion companies and so on. For example, AT&T call records axe about 1 giga byte per
hour [73] and super market chains like WalMart collect tera bytes of data. In order to
transform this huge amounts of data into business competitiveness and profits, it is ex
tremely important to be able to mine nuggets of useful and understandable information
from these data warehouses. In this chapter, we introduce data warehousing and data min
ing technologies in Sections 1.1 and 1.2 respectively and in Section 1.3, we motivate the
need for coupling the two which is the focus of this dissertation. In Section 1.4, we discuss
and list the specific problems addressed in this dissertation and in Section 1.5, we outline
the dissertation organization.
1.1 Data Warehousing
A data warehouse is simply a single, complete and consistent store of data obtained
from a variety of sources and made available to end users in a way they can understand
and use. Achieving completeness and consistency of data in todays information systems
environment is far from simple. The first problem is to discover how completeness and
consistency can be defined. In the business context, this entails understanding the business
1


129
fm 1 % m g% s% 1 o%
Support Threshold (in %)
Figure 7.4. Speed-up of the incremental algorithm
support thresholds, the number of frequent itemsets and the number of passes in the level-
wise algorithm are less and hence it is less costly to run Apriori on the whole database. At
low support thresholds, the probability of the negative border expanding is higher and as a
result the incremental algorithm may have to scan the whole database. Also, the speed up
is higher for smaller increment sizes since the incremental algorithm needs to process less
data.
7.3 Comparison with FUP
The framework of FUP [36] is similar to that of Apriori and contains a number of
iterations. Each iteration is associated with a complete scan of the whole database and
in iteration k all the frequent A:-itemsets are found. The candidate sets for iteration k + 1
are generated based on the frequent itemsets found in iteration k. FUP uses the following
steps to compute the frequent itemsets.
1. At each iteration k, the support of the frequent A;-itemsets (L*.) is updated against
the increment database db to filter out the itemsets that are no longer frequent in the
updated database.


39
The 3wayJoin approach is comparable to the KwayJoin approach for this dataset
because the number of passes is at most three. As shown in Agrawal et al. [9] there
might be other datasets especially ones where there is significant reduction in the size
of Tfc as k increases where 3wayJoin might perform better than KwayJoin. However,
one disadvantage of the 3wayJoin approach is that it requires space to store and log
the temporary relations T* generated in each pass.
3.7 Cost Analysis
In this section, we analyze the cost of the KwayJoin and Subquery approaches which
represent the better ones among the SQL-92 approaches. A relational query engine can
execute the KwayJoin query in several different ways and the performance depends on the
execution plan chosen. We experimented with a number of alternative execution plans
for this query. We could force the query processor to choose different plans by creating
different indices on T and Ck, and in some cases by disabling certain join methods in our
experiments using Postgres. We will elaborate more on this in Section 3.9.
We present two different execution plansone with Ck as the outermost relation in
Section 3.7.1 and another with Ck as the innermost relation in Section 3.7.2and the cost
analysis for them. The effect of subquery optimization on the cost estimates is outlined in
Section 3.7.3.
A schematic diagram of the two different execution plans are given in Figures 3.8 and
3.9. In the cost analysis, we use the mining-specific data parameters and knowledge about
association rule mining (Apriori algorithm [9] in this case) to estimate the cost of joins and
the size of join results. Even though current relational optimizers do not use this mining-
specific semantic information, the analysis provides a basis for developing mining-aware


145
example). The frequency constraint is applied during support counting to filter out the
non-frequent itemsets. Finally, any unprocessed constraints are checked as a post-counting
operation. The frequent itemsets before the post-counting constraint check are used for the
next level candidate generation since these constraints do not possess the closure property.
In the basic frequent itemset mining, only the frequency constraint is present and the
candidate generation step uses only the subset pruning strategy [13].
Example: We illustrate the application of the various constraints here with a spe
cific example using the point-of-sales data model in Section 7.5. The example shown in
Figure 7.12 finds product combinations containing barbecue sauce where all the prod
ucts cost less than $50 and the average price is more than $25 (some notion of similar priced
products). The combinations should appear in at least 100 sales transactions with the total
price of the transaction greater than $500. This gives an idea of what other moderately
priced products people buy with barbecue sauce in big purchases (perhaps for parties). It
could help the shop owner to decide on various promotions.
F_k+1
F_k
Products_sold
Sales
Products_sold
Product
Figure 7.12. Point-of-sales example for constrained association mining


53
3500000
3000000

2500000 ;

S
2000000
S
s_
CO
1500000
1oooooo
500000 -
--
_

Support 0.75%
Support
I 5 '6
0.33%
Figure 3.15. Space requirements of the set-oriented apriori approach
In the earlier passes, if the number of item combinations is too large and we do not
want to store them, the Subquery approach can be used instead. However, the Subquery
approach will also generate large intermediate tables except in cases where the output of an
intermediate join is pipelined to the subsequent operation. The transition to Set-oriented
Apriori can be made in any pass by materializing the item combinations. An advantage of
the Set-oriented Apriori is that it requires only simple joins and hence it is easier for the
optimizer to handle them. With multiple levels of nested subqueries, optimization becomes
much harder.
3.9 Performance Experiments with Set-Oriented Apriori
In this section, we compare the total running time of the Subquery and Set-oriented
Apriori approaches.
We studied the performance of the Subquery approach (the best SQL-92 approach in
Section 3.6) and the Set-oriented Apriori for a wide range of data parameters and support
values. We report the results on two of the datasetsT5.I2.D100K and T10.I4.D100K
which are described in Section 3.7.3.


7.5 Constrained Associations 140
7.5.1 Categories of Constraints 141
7.5.2 Constrained Association Mining 144
7.5.3 Incremental Constrained Association Mining 146
7.5.4 Constraint Relaxation 147
7.6 Applicability Beyond Association Mining 148
7.6.1 Mining Closed Sets 148
7.6.2 Query Flocks 149
7.6.3 View Maintenance 151
7.7 Summary 152
8 CONCLUSIONS 153
8.1 Proposed Extensions 156
8.1.1 Richer Set Operations: 156
8.1.2 Enhanced Aggregation 157
8.1.3 Multiple Streams 158
8.1.4 Sampling 158
8.2 Contributions 159
8.3 Future Work 160
8.4 Closing 160
REFERENCES 162
BIOGRAPHICAL SKETCH 172
ff
vm


17
into the DBMS. This method has all the advantages of the stored procedure approach
(described below) plus it promises to have better performance. The disadvantage is that it
requires additional disk space for caching. This architecture is outlined in Figure 2.3.
GUI
or
Mining Language
Figure 2.3. Cache-mine architecture
2.2.3 Stored-Procedure
This architecture is representative of the case where the mining logic is embedded as
applications on the database server. In this approach, shown in Figure 2.4, the mining
algorithm is encapsulated as a stored procedure [32] that is executed in the same address
space as the DBMS. The main advantage of this as well as the loose-coupling and cache-
mine approach is greater programming flexibility. Also, any existing file system code can
be easily transformed to work on data stored in the DBMS.
GUI
or
Mining Language
Stored-procedure
Invocation
->
DBMS
Stored-procedures
for mining
Figure 2.4. Stored-procedure architecture
2.2.4 User-Defined Function
This approach is another variant of embedding mining as an application on the database
server if the user-defined functions are run in the unfenced mode (same address space as


47
1200000
900000
600000
300000
O
R R, for support values R R, for support values
T10.I4.D100K T5.I2.D100K
Figure 3.11. Reduction in transaction table size by non-frequent item pruning
number of frequent items per transaction. Note that with the subquery optimization we
could achieve significant performance improvements by reducing the relation sizes of joins.
3.8.2 Eliminating Candidate Generation in Second Pass
This optimization aims at reducing the cost of the second pass which is a significant
portion of the total cost in some cases. In the second pass, C2 is almost a cartesian product
of the two F\s used to generate it and hence materializing it and joining with the Ts (or
Tf's) could be expensive. In order to circumvent the problem of counting the large C2,
most mining algorithms use special techniques in the second pass. A few examples are
two-dimensional arrays in IBMs Quest data mining system [2] and hash filters proposed in
Park et al. [99] to limit the size of C2. The generation of C2 can be completely eliminated
by formulating the join query to find F2 as
insert into F2 select p.item, q.item, count(*)
from
Tf p, Tf q
where
p.tid = q.tid and p.item < q.item
group by
p.item, q.item


27
the tuple (A, B, C, D, NULL, 4, 2, 0.9,0.3). The frequent itemsets are represented the
same way as rules but do not have the rulem and confidence attributes.
3.4 Associations in SQL
In Section 3.4.1 we present the candidate generation process in SQL. In Section 3.4.2
we present the support counting process and in Section 3.4.3 we present the rule generation
process.
3.4.1 Candidate Generation in SQL
Recall that the apriori algorithm for finding frequent itemsets proceeds in a level-wise
manner. In each pass k of the algorithm we first need to generate a set of candidate itemsets
Ck from frequent itemsets Fk-i of the previous pass.
Given Fk-i, the set of all frequent (k l)-itemsets, the Apriori candidate generation
procedure [9] returns a superset of the set of all frequent Ar-itemsets. We assume that the
items in an itemset are lexicographically ordered. Since, all subsets of a frequent itemset
are also frequent, we can generate Ck from Fk-i as follows.
First, in the join step, we generate a superset of the candidate itemsets Ck by joining
Fk-i with itself as shown below.
insert into Ck select fi.itemi, ..., Ii.itemk-i, h-itemk-i
from Fk-i I\,Fk~\ h
where I\.item\ = l2-item\ and
Il.itemk-2 = l2-itemk-2 and
I\.itemk-\ < l2-itemk~\


66
We use blob(A: Ck) for the BLOB passing cost since each itemset in Ck contains k items.
The cost estimate of Horizontal is similar to that of Gather Join except that here
the data is materialized in the horizontal format before generating the item combinations.
4.2 Vertical
In this approach, we first transform the data table into a vertical format by creating for
each item a BLOB containing all tids that contain that item (Tid-list creation phase) and
then count the support of itemsets by merging together these tid-lists (support counting
phase). This approach is related to the approaches discussed in Savasere et al. [Ill] and
Zaki et al. [131].
For creating the Tid-lists we use a table function Gather. This is the same as the
Gather function in GatherJoin except that here we create the tid-list for each frequent
item. The data table T is scanned in the (item, tid) order and passed to the function
Gather. The function collects the tids of all tuples of T with the same item in memory and
outputs a (item, tid-list) tuple for items that meet the minimum support criterion. The tid-
lists are represented as BLOBs and stored in a new TidTable with attributes (item, tid-list).
The SQL query which does the transformation to vertical format is given in Figure 4.3.
insert into TidTable select item, tid-list from
(select from T order by item, tid) as tt\,
table(Gather(item,tid,:minsup)) as tt
tt2.item, tt2.tid-list
| tt2
Table function
Gather
t
Order by
item, tid
t
T
Figure 4.3. Tid-list creation
In the support counting phase, conceptually for each itemset in Ck we want to collect
the tid-lists of all k items and use a UDF to count the number of tids in the intersection


74
hours in the second pass, the total time for next two passes is only 14 seconds whereas it is
more than an hour for the other two approaches. For subsequent passes the performance
degrades dramatically for Gather Join, because the table function Gather-Comb-K generates
a large number of combinations. For instance, for pass 3 of Dataset-C even for support
value of 2% pass 3 did not complete after 5.2 hours whereas for Vertical pass 3 finished
in 0.2 seconds. GatherPrune is better than Gather Join for the third and later passes. For
pass 2 GatherPrune is worse because the overhead of passing a large object as an argument
dominates cost.
The Vertical approach sometimes ended up spending too much time in the second
pass. In some of these cases the GatherJoin approach was better in the second pass
(for instance for low support values of Dataset-B) whereas in other cases (for instance,
Dataset-C support 0.25%) GatherCount was the only good option. For this case both the
GatherPrune and GatherJoin did not complete after more than six hours even for pass 2.
Further, they caused a storage overflow error because of the large size of the intermediate
results to be sorted. We had to divide the dataset into four equal parts and run the second
pass independently on each partition to avoid this problem.
[ Vertical --QioaZl
Average transaction length
Figure 4.6. Effect of increasing transaction length (average number of items per transaction)


35
from (select item\,... itemk, count(*)
from T, Cfc
where item = Ck-item\ or
item Ck .itemk
group by itemi,... ,iterrik, tid
having count(*) = k) as temp
group by item\,..., itemk
having count(*) > :minsup
3.5.4 Subquerv-Based
This approach makes use of common prefixes between the itemsets in Ck to reduce the
amount of work done during support counting. We break up the support counting phase
into a cascade of k subqueries. The Z-th subquery Qi finds all tids that match the distinct
itemsets formed by the first l columns of Ck (call it di). The output of Q¡ is joined with T
and d[+i (the distinct itemsets formed by the first l + 1 columns of Ck) to get Qi+i- The
final output is obtained by doing a group-by on the k items to count support as above. The
queries and the tree diagram for subquery Qi are given in Figure 3.6.
3.6 Performance Comparison
In this section we compare the performance of the four SQL-92 approaches. All of
our experiments were performed on Version 5 of IBM DB2 Universal Server installed on a
RS/6000 Model 140 with a 200 MHz CPU, 256 MB main memory and a 9 GB disc with a
measured transfer rate of 8 MB/s. These experimental results are also reported in Sarawagi
et al. [109].


128
Lemma i Let s be an itemset such that s FDB. Then s g FDB only if s Fdb. That
is a frequent itemset s will become infrequent only if s 6 Fdb.
This lemma can be proved in the same way as lemma 2.
The algorithm to compute the frequent itemset and the negative border of DB is
similar to the one in the case where new transactions are added to the database and can
be derived easily.
7.2 Experimental Results
We conducted a set of experiments to compare the performance of our incremental
algorithm. These experiments were performed on a Sun SPARCstation 4 running SunOS
5.5. In this section, we report on the results of some of those experiments.
The experiments were performed on synthetic data generated using the same tech
nique as in Agrawal and Srikant [13]. The dataset used for the baseline experiment was
T10.I4.D100K (Mean size of a transaction = 10, Mean size of maximal potentially large
itemsets = 4, Number of transactions = 100 thousand). The increment database is created
as follows: We generate 100 thousand transactions, of which (100 d) thousand is used for
the initial computation and d thousand is used as the increment, where d is the fractional
size (in percentage) of the increment.
We compare the execution time of the incremental algorithm with respect to running
Apriori on the whole data set. Figure 7.4 shows the speed up of the incremental algorithm
over Apriori for different minimum support thresholds. We report the results for increment
sizes of 1%, 2%,5% and 10% (shown in the legend). From the graph, it can be seen that
the incremental algorithm achieves speed up of about 3 to 20. The algorithm shows better
speed up for medium support threshold than low and high support thresholds. At high


16
Cache-Mine
User-defined Mining
function extenders/blades
Loose
Coupling
Stored
Procedure
SQL-based Integrated with
approach SQL query engine
Mining as
application on
client/app. server
Mining as
application on
database server
Mining using
SQL + extensions
Integrated approach
Loose Integration Tight
Figure 2.1. Taxonomy of architectural alternatives
mining systems. A potential problem with this approach is the high context switch costs
between the DBMS and the mining kernel processes since they run in different address
spaces. In spite of the block-read optimization present in many systems (for example,
Oracle [98], DB2 [32]) where a block of tuples is read at a time instead of reading a single
tuple at a time, the performance could suffer. This architecture is outlined in Figure 2.2
GUI
or
Mining Language
Figure 2.2. Loose-coupling architecture
2.2.2 Cache-Mine
This is a special case of the loose-coupling approach where the mining algorithm reads
data from the DBMS only once and caches the relevant data in flat files on local disk for
future references. The data could be transformed and cached in a format that enables more
efficient access in the future. The mined results, first generated as fiat files, are imported


10
Business Intelligence tools
Figure 1.3. Typical data warehouse usage
analysis of various execution plans chosen by relational database systems for executing some
of the SQL-based mining queries. We expect that this study will reveal the domain-specific
semantic information of the mining algorithms that need to be integrated into next gen
eration query optimizers to handle mining computations efficiently. We also develop SQL
formulations for a few other mining operations, namely, generalized association rules [118]
and sequential patterns [14, 120]. We also propose a few primitive database operators that
are useful for mining and other decision support applications. These operators, if natively
supported in a database system, could potentially speed up the execution of mining queries.
Data warehouses on which the mining tools operate typically are populated incremen
tally. In order to improve the reliability and usefulness of the discovered information, large
volumes of data need to be collected and analyzed over a period of time. A naive approach
to update the mined information, when new data are added or part of the current data
are deleted, is to recompute them from scratch. However, it would be ideal to develop an
incremental algorithm so that the computation effort spent on the original data can be


LIST OF FIGURES
1.1 Data warehousing architecture 2
1.2 Credit card classification example 7
1.3 Typical data warehouse usage 10
2.1 Taxonomy of architectural alternatives 16
2.2 Loose-coupling architecture 16
2.3 Cache-mine architecture 17
2.4 Stored-procedure architecture 17
2.5 UDF-based mining architecture 18
2.6 SQL architecture for mining in a DBMS 19
2.7 Architecture for mining in next-generation DBMSs 21
3.1 Apriori algorithm 23
3.2 Candidate generation for any k 29
3.3 Candidate generation for k 4 29
3.4 Rule generation 32
3.5 Support counting by K-way join 33
3.6 Support counting using subqueries 36
3.7 Comparison of different SQL-92 approaches 38
3.8 K-way join plan with Ck as inner relation 40
3.9 K-way join plan with Ck as outer relation 40
IX


114
insert into Fk
select item\,eno\,... ,itemk,enoki count(*)
from (select distinct sid, item\, enoi,...,
itemk,enok from (Subquery Qk) )
group by item\, eno\,... ,itemk, eno^
having count(*) > rminsup
Subquery Qi (for any l between 1 and k):
select itemi, eno\, time\,... itemi, enoi, ti.time, sid
from D ti, (Subquery Q/_i) as r/_i,
(select distinct item\,eno\,...
itemi, enoi from C¡t) as d¡
where ri-\.item\ = di.item\ and ... and
= di.itemi-iand
ri-i.sid = ti.sid and
ti.item di-itemi and SubQ-PRED(Z)
Subquery Qo: No subquery Qo-
Subquery Q_1
t
item_l, eno_l, time_l,...,item_l, enoj, ti.time, sid
Subquery Q_l-1
select distinct
item_l,eno l,...,item_l, enoj
f
Ck
Figure 6.4. Subquery optimization for KwayJoin approach


51
We can also use the Subquery approach to generate T3 if that is estimated to be less
expensive. T3 will contain exactly the same tuples produced by subquery Q3.
The Set-oriented Apriori algorithm bears some resemblance with the three-way join
approach in Section 3.5.2 and the AprioriTid algorithm in Agrawal and Srikant [13]. In the
three-way join approach, the temporary table Tk stores for each transaction, the identifiers
of the candidates it supported. Tk is generated by joining two copies of Tk-\ with Ck. The
generation of Fk requires a further join of Tk with Ck. AprioriTid makes use of special
data structures which are difficult to maintain in the SQL formulation.
Cost Comparison
In Section 3.7.3, we saw that the cost of the kth pass with the subquery optimization
is
k
{£ trijoin(f, s/_! 4-1, d[, s¡ dlk)} + group(S(Cfc), Ck)
l-1
As a result of the materialization and reuse of item combinations, Set-oriented Apriori re
quires only a single 3-way join in the kth pass 1. The cost of the kth pass in Set-oriented
Apriori is
trijoin(Rf, T*_ 1, Ck, S{Ck)) + group(S(Ck), Ck)
where Tk_\ and Ck denote the cardinality of the corresponding tables. The grouping cost
is the same as that of the subquery approach. The table Tjt_i contains exactly the same
tuples as that of subquery Qk-\ and hence has a size of s;_i djr1. Also, dkk is the same
as Ck. Therefore, the kth pass cost of Set-oriented Apriori is the same as the kth term in
1Note that this may be executed as two 2-way joins since 3-way joins are not generally supported in
current relational systems.


104
gets multiplied by the average depth of the taxonomy. In the GatherJoin approach, we
show the performance numbers for only the second pass. Note that just the time for second
pass is an order of magnitude more than the total time for all the passes of Vertical. For
0.5% support, second pass of GatherJoin took over 20,000 seconds while the total time for
Vertical was only about 3000 seconds. We did not do extensive experimentation here
because, based on the SQL formulations and the analysis, we expect similar performance
observations as in the case of boolean association rules.
5.9 Summary
In this chapter, we presented various SQL formulations for mining generalized associa
tion rules. It shows that the boolean association rule framework can be easily extended for
generalized association rules. The major additions were to extend the input transaction
table (transform T to T*) and to pre-compute the ancestors.
We have presented here the extensions only for a few representative SQL approaches
for boolean association rules in Chapters 3 and 4. The other approaches can be extended
in a similar manner. We expect the performance trends from boolean association rules
to hold for all these approaches for generalized associations also. There could be better-
ways of handling the taxonomy like the EstMerge and Stratify algorithms presented in
Srikant and Agrawal [118]. However, with the currently available SQL operators they will
be too awkward to program. One possibility is to push some of those optimizations inside
user-defined functions and it remains to be seen how they will influence the performance.


Ill
6.5 Support Counting Using SQL-92
6.5.1 K-wav Join
In the kth pass, we join the candidate table Ck with k copies of the data-sequence
table D and group the join result on the candidate sequences as shown in Figure 6.3. This
approach is very similar to the KwayJoin approach for association rules (refer Section 3.5.1)
except for the following two key differences.
1. We use select distinct to ensure that only distinct data-sequences are counted.
2. Second, we have additional predicates between sequence numbers that are denoted
as PRED(fc) in the query for brevity. The predicates PRED(fc) is a conjunct (and) of
terms Pij{k) corresponding to each pair of items from C\. Pij{k) is expressed as
(Ck-enoj = Ck-enoi and abs(dj.time di.time) < window-size) or
(Ck-enoj = Ck-enoi + 1 and dj.time di.time < max-gap and
dj.time di.time > min-gap) or
(Ck-enoj > Ck-enoi + 1)
Intuitively, these predicates check (a) if two items of a candidate sequence belong
to the same element, then the difference of their corresponding transaction times is
at most window-size and (b) if two items belong to adjacent elements, then their
transaction times are at most max-gap and at least min-gap apart.
We compute the frequent 1-sequences by grouping the data-sequences table on the
item attribute, counting the number of distinct sequences in which the item is present and
filtering the non-frequent items. The SQL query for this computation is given below.


157
We also used the set difference operation for pruning itemsets containing an item and
its ancestor, in generalized association rules.
8.1.2 Enhanced Aggregation
Another common operation that we used is the Gather operation that can transform
two attributes in a data table to a form where for distinct values of one of the attributes
(called the grouping attribute) we collect together in a set all values of the other attribute.
We can think of Gather as a glorified aggregate function. Its reverse operation scatter is
a special case of the subset operation where the subset size is 1.
We used this operation in four of our SQL approaches: GatherJoin, GatherPrune,
Vertical and Horizontal. To implement the Gather table function in our queries we
required data to appear in a particular order as inputs to the table function. This would be
trivial to express in SQL if order by clauses were allowed in inner subqueries. In the absence
of such a feature, we relied on our knowledge of DB2s internals to force such an order in
our experiments. The Gather function can also be treated as an aggregate function that
simply concatenates its arguments. Systems that have support for user-defined aggregate
functions [122] can easily support such a functionality.
There are several other decision support applications where gather could be extremely
useful. For instance, suppose we are trying to classify customers of a credit card company
and the data, in addition to static customer attributes like age and salary, consists of
detailed transaction data about each purchase activity of a customer. In this case, we
would like to gather all activities of a customer as one time series and extract features
from these time series for classification. In OLAP applications too, data often needs to be
converted back and forth from a format where different measures are gathered together as


94
5.6 Support Counting Using SQL-92
5.6.1 K-wav Join
In each pass k, we join the candidate itemsets Ck with k copies of an extended trans
action table T* (defined below), and follow it up with a group by on the itemsets.
The extended transaction table T* is obtained by augmenting T to include tid, item
entries for all ancestors of items appearing in T. This can be formulated as a SQL query
as shown in Figure 5.4.
Query to generate T*
select item, tid from T union
select distinct A.ancestor as item, T.tid
from T, Ancestor A
where A.descendant = T.item)
T*
t
UNION
T.tid, A.ancestor as item
T.item = A.descendant [> Ancestor A
Figure 5.4. Transaction extension subquery
The select distinct clause is used to eliminate duplicate records due to extension of
items with a common ancestor. Note that for this approach we do not materialize T*.
Instead we use the SQL support for common subexpressions (with construct) to pipeline
the generation of T* with the join operations. The final SQL query and the corresponding
tree diagram are shown in Figure 5.5.


48
having count(*) > rminsup
The cost of second pass with this optimization is
join(i2/, Rf, C(Nf, 2)) + group(C(Nf, 2),C(FU 2))
Even though the grouping cost remains the same, there is a big reduction from the basic
KwayJoin approach in the join costs. Figure 3.12 compares the running time of the sec
ond pass with this optimization to the basic KwayJoin approach for the two experimental
datasets. For the KwayJoin approach, the best execution plan was the one which generates
all 2-item combinations, joins them with the candidate set and groups the join result. We
can see that this optimization has a significant impact on the running time.
3.8.3 Reusing the Item Combinations from Previous Pass
The SQL formulations of association rule mining is based on generating item combi
nations in various ways and similar work is performed in all the different passes. Therefore,
reusing the item combinations will improve the performance especially in the higher passes.
We will explain more about this optimization and the corresponding cost reduction in the
next section.
3.8.4 Set-Oriented Apriori
In this section, we develop a set-oriented version of the apriori algorithm combining
all the performance optimizations outlined above, which we call Set-oriented Apriori. We
store the item combinations generated in each pass and use them in the subsequent passes
instead of generating them in every pass from the transaction tables. In the kth pass of
the support counting phase, we generate a table which contains all &-item combinations


97
insert into Fk select item\,...,itemk, count(*)
from Ck,
(select ti-TJtmi,..., t2-TJtrrik from T* t\,
table (GatherComb-K(i.tid, i.item)) as Q)
where t^-TJtmi = Ck-item\ and
t2-Tjitrrik = Ck-iterrik
group by Ck-itemi,... ,Ck-itemk
having count(*) > :minsup
T.
T
having
count(*) > iminsup
t
Group by
item 1itemk
itml = Ck.iteml ^ _
itmk --
Table function
GatherComb-K
I
Order by
tid, item
Ck
Figure 5.6. Support counting by Gather Join
5.7.2 Gather Extend
This is a variation of the Gather Jo in approach where we push the transaction exten
sion inside the table function. For each item, we create an ancestor-list (stored as a BLOB)
which contains a list of ancestors of that item. This can be accomplished using similar SQL
queries as in the tid-list creation phase of the Vertical approach for boolean associations
(refer Section 4.2). The (item, ancestor-list) pairs are stored in a new AncListTable. Note
that ancestor-list needs to be created only for items that appear in the input transactions.


156
8.1 Proposed Extensions
One of the goals of our work has been to unbundle complex mining operations like
associations and classifications into smaller primitives that can be supported efficiently by
a general purpose DBMS and be useful for a large class of mining algorithms. Our exercise
with some of the mining operations has helped us identify a few such primitives that we
believe could be generally useful in a number of other decision support applications.
8.1.1 Richer Set Operations:
We expect a richer collection of set operations to be useful for mining. For the mining
problems addressed in this dissertation, we used three different versions of the basic subset
operation. For candidate generation we needed to find all k 1 subsets of a set of k elements
for doing the pruning. We enumerated the subsets explicitly in the query predicate because
the size of the set and the subsets was fixed and known in advance. For support counting,
we used the Comb-K table function which generated all k subsets of a variable length set.
For rule generation, we used the GenSubset table function to generate all subsets of a
variable length set. The subset operation would also be useful in building decision trees on
categorical attributes [85].
Another important set operation was the intersect operation used in the Vertical
approach. Most systems already have internal implementations of this operation for doing
AND and OR operations on RID (record identifier) lists obtained during multiple index
scans [93]. In current OLAP and data warehousing systems this operation is rampant in
the popular bit-mapped indices. Our Vertical approach is curiously similar to this bit
mapped approach with one important difference. Instead of performing ANDs on RIDs, we
perform the ANDs on another attribute, the transaction identifier.


85
of the data the Stored-procedure approach is four times slower than Cache-Mine. There
fore, if one is not willing to pay the penalty of extra storage the best strategy for improving
performance is to reduce the number of passes to the data even if it comes at the cost
of extra processing. Some of the recent proposals [26, 127] that attempt to minimize the
number of data passes to 2 or 3 might be useful in that regard.
The SQL approach is not the winner in terms of performance and space requirements
but it is competitive. The benefit of the SQL approach could arise from other secondary
advantages.
The SQL implementation has the potential for automatic parallelization. Paralleliza
tion could come for free because bulk of our processing is expressed in terms of standard
SQL queries. As long as the database supports efficient parallelization of these queries the
mining code can be easily parallelized. The problem case is where the UDFs use scratch
pads. The only such function in our queries is the Gather table function. This function
essentially implements a user defined aggregate, and would have been easy to parallelize if
the DBMS provided support for user defined aggregates or allowed explicit control from the
application about how to partition the data amongst different parallel instances of the func
tion. For MPPs, one could rely on the DBMS to come up with a data partitioning strategy.
However, it might be possible to better tune performance if the application could provide
hints about the best partitioning [11] to use. Further experiments are required to assess how
the performance of these automatic parallelizations would compare with algorithm-specific
parallelizations [11].
The development time and code size using SQL could be shorter if one can get efficient
implementations out of expressing the mining algorithms declaratively using a few SQL
statements. Thus, one can avoid writing and debugging code for memory management,


18
the server) [32], In this case, the entire mining algorithm is encapsulated as a collection of
user-defined functions (UDFs) [32] that are appropriately placed in SQL data scan queries.
The architecture is represented in Figure 2.5. Most of the processing happens in the UDF
and the DBMS is used simply to provide tuples to these UDFs. Little use is made of the
query processing capabilities of the DBMS. The UDFs can be run in either fenced (different
address space) or unfenced (same address space) mode. The main attraction of this method
is performance since when run in the unfenced mode individual tuples never have to cross
the DBMS boundary. Otherwise, the processing happens in almost the same manner as
in the stored procedure case. The main disadvantage is the development cost [12] since
the entire mining algorithm has to be written as UDFs involving significant code rewrites.
Further, these are heavy-weight UDFs which involve significant processing and memory
management.
GUI
or
Mining Language
SQL queries
(containing UDFs)
DBMS
UDFs
for mining^
Figure 2.5. UDF-based mining architecture
In order to provide a query interface or application programming interface to the
discovered rules, they can be passed through a post-processing step. The rule discovery
itself could be done by any of the above alternatives.
2.2.5 SQL-Based Approach
This is the integration architecture explored in this dissertation. In this approach, the
mining algorithm is formulated as SQL queries which are executed by the DBMS query


24
The basic Apriori algorithm shown in Figure 3.1 makes multiple passes over the data.
In the kth pass it finds all itemsets with k items having the minimum support, called the
frequent A:-itemsets. Each pass consists of two phases. Let Fk represent the set of frequent
/c-itemsets, and Ck the set of candidate /c-itemsets (potentially frequent itemsets). First is
the candidate generation phase where the set of all frequent (kl)-itemsets, Fk-1, found
in pass (& 1), is used to generate the candidate itemsets Ck- The candidate generation
procedure ensures that Ck is a superset of the set of all frequent fc-itemsets. The algorithm
builds a specialized hash-tree data structure in memory out of Ck- Then is the support
counting phase where the transaction database is scanned. For each transaction, the
algorithm determines which of the candidates in Ck are contained in the transaction using
the hash-tree data structure and increments their support count. At the end of the pass, Ck
is examined to determine which of the candidates are frequent, yielding Fk- The algorithm
terminates when Ck+i becomes empty.
3.2 Other Algorithms
There are several other file based algorithms for mining association rules. However
many of them follow the basic apriori framework and tries to improve on the computation
and I/O requirements.
The partition algorithm [111] reduces the I/O requirements to just two passes over the
entire dataset. The reason the database needs to be scanned multiple times is because the
number of possible itemsets to be tested for support is exponentially large if it must be done
in a single scan of the database. The partition algorithm generates a set of all potentially
frequent itemsets in one scan of the database. This set is a superset of all frequent itemsets.
In the second scan, the actual support of these itemsets in the whole database is computed.


29
Figure 3.3 gives an example for k = 4.
We construct a primary index on (itemi,... ,itemk-j) of Fk~\ to efficiently process
these k-way joins using index probes. Note that sometimes it may not be necessary to ma
terialize Ck before the counting phase. Instead, the candidate generation can be pipelined
with the subsequent SQL queries used for support counting.
(Skip item_k-2)
II.itemi = Ik.itemi
Il.item__k-1 = Ik.item_k-2
I2.item_k-1 = Ik.item_k-1
(Skip itemi)
Il.item2 =13.itemi
F_k-1
Ik
11 ,item_k-1 = I3.item_k-2
I2.item_k-1 =I3.item_k-l
Figure 3.2. Candidate generation for any k
(Skip item2)
11.itemi =14.itemi
Figure 3.3. Candidate generation for k 4


135
generates (itemset, tid-list) tuples. Initially, tid-lists for individual items are created using a
table function. The tid-list for an itemset is obtained by intersecting the tid-lists of its items
using a user-defined function (UDF). The SQL queries for support counting are similar in
structure to that of the Subquery approach except for the use of UDFs to intersect the
tid-lists. We refer the reader to Section 4.2 for the details.
The increment transaction table ST is transformed into the vertical format by creating
the delta tid-lists of the items. The delta tid-lists are used to count the support of the
candidate itemsets in ST which are later merged with the original tid-lists. This can be
accomplished by joining the original tid-list table with the delta tid-list table and merging
the tid-lists with a UDF. If the incremental algorithm requires a pass over the complete
data, the merged tid-lists are used for support counting.
7.4.2 Performance Results
In this section, we report the results of some of our performance experiments to quan
tify the advantages of the incremental mining algorithm. Note that incremental mining can
be considered as a special case of relaxing the frequency constraint. These experiments were
performed on Version 5 of IBM DB2 Universal Server installed on a Sun Ultra 5 Model 270
with a 269 MHz UltraSPARC-IIi CPU, 128 MB main memory and a 4 GB disk. We report
the results of the SQL formulations of the incremental algorithm based on the Subquery
and Vertical approaches.
The experiments were performed on synthetic data generated using the technique in
Agrawal and Srikant [13]. The dataset used for the experiment was T10.I4.D100K (Mean
size of a transaction = 10, Mean size of maximal potentially large itemsets = 4, Number of
transactions = 100 thousand). The increment database is created as follows: We generate


166
[55] J. Gray, A. Bosworth, A. Layman, and H. Pirahesh. Data cube: A relational aggre
gation operator generalizing group-by, cross-tab, and sub-total. In Proc. of 1996 Intl
Conference on Data Engineering, New Orleans, USA, February 1996.
[56] J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, F. Pellow, and H. Pirahesh. Data
cube: A relational aggregation operator generalizing group-by, cross-tab and sub
totals. Data Mining and Knowledge Discovery, 1(1):2953, 1997.
[57] S. Guha, R. Rastogi, and K. Shim. CURE: An efficient clustering algorithm for
large databases. In Proc. of the A CM SIGMOD Conference on Management of Data,
Seattle, Washington, June 1998.
[58] D. Gunopulos, H. Mannila, and S. Saluja. Discovering all most specific sentences by
randomized algorithms. In Proc. of the sixth International Conference on Database
Theory, January 1997.
[59] P. J. Haas, J. F. Naughton, S. Seshadri, and L. Stokes. Sampling-based estima
tion of the number of distinct values of an attribute. In Proceedings of the Eighth
International Conference on Very Large Databases (VLDB), pages 311-22, Zurich,
Switzerland, September 1995.
[60] T. Hagerup and C. Riib. A guided tour of Chernoff bounds. Information Processing
Letters, 33:305-308, 1989/90.
[61] E. H. Han, G. Karypis, and V. Kumar. Scalable parallel data mining for association
rules. In Proc. of the ACM SIGMOD Conference on Management of Data, Tucson,
Arizona, May 1997.
[62] J. Han, Y. Cai, and N. Cercone. Knowledge discovery in databases: An attribute
oriented approach. In Proc. of the VLDB Conference, pages 547-559, Vancouver,
British Columbia, Canada, 1992.
[63] J. Han and Y. Fu. Discovery of multiple-level association rules from large databases.
In Proc. of the 21st, Intl Conference on Very Large Databases, Zurich, Switzerland,
September 1995.
[64] J. Han, Y. Fu, K. Koperski, W. Wang, and O. Zaiane. DMQL: A data mining query
language for relational datbases. In Proc. of the 1996 SIGMOD workshop on research
issues on data mining and knowledge discovery, Montreal, Canada, May 1996.
[65] S. J. Hong. R-MINI: A heuristic algorithm for generating minimal rules from exam
ples. In 3rd Pacific Rim Intl Conference on AI, August 1994.
[66] M. Houtsma and A. Swami. Set-oriented mining of association rules. In Intl Con
ference on Data Engineering, Taipei, Taiwan, March 1995.
[67] T. Imielinski. From file mining to database mining. In Proc. of the 1996 SIGMOD
workshop on research issues on data mining and knowledge discovery, pages 35-39,
May 1996.
[68] T. Imielinski and H. Mannila. A database perspective on knowledge discovery. Com
munication of the ACM, 39(ll):58-64, Nov 1996.


14
leverages the query processing capabilities of current relational DBMSs. In the former cate
gory, there have been language proposals to extend SQL with specialized mining operators.
A few examples are (i) the query language DMQL proposed by Han et al. [64] which ex
tends SQL with a collection of operators for mining characteristic rules, discriminant rules,
classification rules, association rules and so on, (ii) The M-SQL language of Imielinski and
Virmani [69] which extends SQL with a special unified operator Mine to generate and
query a whole set of propositional rules, and (iii) the mine rule operator proposed by Meo
et al. [89] for a generalized version of the association rule discovery problem. However, they
do not address the processing techniques for these operators inside a database engine and
the interaction of the standard relational operators and the proposed extensions. It is also
important to break these operators to a finer level of granularity in order to identify com
monalities between them and derive a set of primitive operators that should be supported
natively in a database engine.
In the second category, researchers have addressed the issue of exploiting the capa
bilities of conventional relational systems and their object-relational extensions to execute
mining operations. This entails transforming the mining operations into database queries
and in some cases developing newer techniques that are more appropriate in the database
context. The proposal of Agrawal and Shim [12] for tightly coupling a mining algorithm
with a relational database system makes use of user-defined functions (UDFs) in SQL
statements to selectively push parts of the application that perform computations on data
records into the database system. The objective was to avoid one-at-a-time record retrieval
from the database to the application address space, saving both the copying and process
context switching costs. In the KESO project [116], the focus is on developing a data
mining system which interacts with standard DBMSs. The interaction with the database


90
Table 5.1. An example of the taxonomy table
Parent
Child
Beverages
Beverages
Soft drinks
Soft drinks
Alcoholic drinks
Snacks
Snacks
Soft drinks
Alcoholic drinks
Pepsi
Coke
Beer
Pretzels
Chocolate bar
We present SQL formulations of ancestor pre-computation and candidate generation in
Sections 5.3 and 5.4 respectively. Several approaches for support counting based on SQL-
92 and SQL-OR are presented in Sections 5.6 and 5.7. In Section 5.8, we discuss some
experimental results.
5.1 Input-Output Formats
Input format The transaction table T has the same schema, (tid, item) as in the
case of boolean association rules. While transactions normally contain only items that are
leaves in the taxonomy, it is not a requirement. The taxonomy table Tax has two column
attributes, parent and child. Each record in Tax corresponds to an edge in the taxonomy
DAG. The taxonomy shown in Figure 5.1 is represented by Table 5.1. The number of
children for the interior nodes and the depth of the DAG are unknown at table creation
time and therefore alternative schemas may not be convenient.
Output format The output is a collection of all rules that satisfy the user-specified
minimum support and minimum confidence. The schema of the rules is the same as in the
boolean associations case 3.3.


124
tDB(s) + tdb{s) c
< mmbupport
tDB + tdb
Therefore, s FDB+ which is a contradiction.
Lemma 3 Let s be an itemset such that s £ ABd(F). Then all possible subsets of s must
be present in F.
Proof: For a contradiction, let t be an itemset such that t C s and t 0 F. By the definition
of negative border, AfBd(F) consists of the minimal itemsets not in F. Since t 0 F, s is not
a minimal itemset not in F. Therefore s cannot be in AiBd(F), which is a contradiction.

Theorem 1 Let s be an itemset such that s 0 FDB U AfBd(FDB) and s £ FDB+. Then
there exists an itemset t such that t C s, t £ AfBd(FDB) and t £ FDB+. That is, some
subset of s has moved from AiBd(FDB) to FDB+.
Proof: Since s £ FDB+, all possible subsets of s should be in FDB+. But all the subsets
of s cannot be in FDB because if that was the case, then s should be present in at least
AiBd(FDB) if not in FDB itself. By our assumption, s g FDB U AiBd{FDB). Therefore,
there exists an itemset t such that t C s and t £ FDB. Now we have two cases.
Case i : teABd{FDB).
In this case, t £ FDBJr since s £ FDB+ and t C s. Therefore, we have found a subset of s
which has moved from AfBd(FDB) to FDB+.
Case ii : t & AfBd{FDB).
That is, t Fdd U AfBd(FDB). But, we know that t £ FDB+ since s £ Fdb+ and t C s.
Therefore, t 0 FDB U AiBd(FDB) and t £ FDB+ and hence we can apply the theorem
recursively on t. Note that the size of t is less than the size of s since t C s.


9
operations. Differentiators for the relational engines will be overall scalability, availability
across a broad spectrum of hardware platforms, affinity with legacy data platforms and
stores, availability of extensions for image, text and so on to support the next generation
of applications, and availability of supporting tools.
The investment in building and managing a data warehouse is enormous. With the
advent of business intelligence and decision support systems, it is imperative that the
data warehouse support multiple applications. Figure 1.3 illustrates how a data warehouse
will typically be used in a business organization. The spectrum of applications include
basic querying and reporting tools, OLAP tools, multidimensional analysis tools and data
mining tools. There are several excellent and popular query/report writers. There are also
tools that support multidimensional analysis directly on relational data stores without the
separate OLAP engine. Note that these tools are like tools in a tool box; that is, they can
be used in combination to produce the desired result. There is no single class of tools that
satisfies the broad range of decision support and business intelligence system requirements.
Therefore, it is crucial that the data mining tools integrate with relational data warehouses
much like the query/report and OLAP tools.
1.4 Goal
The goal of this dissertation is to explore the various issues of database/data ware
house integration of mining operations. We first study the various architectural alternatives
for coupling data mining with relational database systems, primarily from a performance
perspective. We develop various SQL formulations for association rules [7], a representative
mining problem, and analyze how competitive can the SQL implementation be compared
to other specialized implementations of association rule mining. We further focus on the


58
25000
20000
0 15000
$
c
CL>
E
* 1 oooo
5000
O
O
Figure 3.19. Transaction size scale-up
There has been increasing interest in developing scalable data mining techniques [23,
33, 117]. For the scalability of the SQL approaches, we leverage the development effort
spent in making the relational query processors scalable.
3.10 Summary
In this chapter, we presented four SQL-92 formulations for association rule mining. We
analyzed the best approach and developed cost formulae for the different execution plans
of the corresponding SQL queries based on the input data parameters and the relational
operator costs. This cost analysis provides a basis for incorporating the semantics of mining
algorithms into future query optimizers.
While doing the experiments, it was difficult to force the optimizer to choose certain
execution plans and join methods since commercial DBMSs do not provide that level of
control. Postgres was relatively better in that respect, since we could control the choice
of join methods. However, the Postgres optimizer did not optimize long queries, especially
the ones involving nested subqueries, well.
1 ooo
750
500
1 O 20 30 -40 50
Average transaction length


xml version 1.0 encoding UTF-8
REPORT xmlns http:www.fcla.edudlsmddaitss xmlns:xsi http:www.w3.org2001XMLSchema-instance xsi:schemaLocation http:www.fcla.edudlsmddaitssdaitssReport.xsd
INGEST IEID ELYORLXX4_N4H96T INGEST_TIME 2013-10-10T03:05:21Z PACKAGE AA00017696_00001
AGREEMENT_INFO ACCOUNT UF PROJECT UFDC
FILES


160
10. Explore primitive operators to support mining and decision support in databases.
8.3 Future Work
The work presented in this dissertation points to several directions for future research.
A natural next step is to experiment with other kinds of mining operations such as classi
fication and clustering [45], to verify if our conclusions hold for these other cases too. It is
also important to derive a set of primitive operators with which the different mining and
decision support operations can be composed. The operators we have identified provide
some headway in that direction. Another useful direction is to explore what kind of a sup
port is needed for answering short, interactive, ad hoc queries involving a mix of mining and
relational operations. The success of SQL as the most popular data management language
can be mainly attributed to its ad hoc query support. In the mining context, what is an
ad hoc mining query? Is it just mining computations expressed with some constraints on
the result? How much support can we leverage from existing relational engines for min
ing? What data model and language extensions are needed? Does the execution of these
operators involve selective materialization and incremental evaluation techniques? Some of
these questions are orthogonal to whether the bulky mining operations are implemented
using SQL or not. Nevertheless, these are important in providing the analysts with a well
integrated platform where mining and relational operations can be inter-mixed in useful
and flexible ways.
8.4 Closing
In the coming years, database/data warehouse systems will be called upon to deliver
return-on-investment in the form of nuggets of knowledge or better insights about the huge
volumes of data they store and manage. In order to efficiently support the multiplicity of


151
lattice for further evaluation. For example, after Qi is evaluated all the records corre
sponding to widgets that are bought by at least 10 customers will be piped to Q3 which is
immediately above Q\ in the subquery lattice. The parameter value combinations in this
case are analogous to itemsets in boolean association rule mining.
The negative border in this apriori-based query flock evaluation is the set of parameter
value combinations that does not pass the filter condition test. These combinations if
materialized and stored along with their support counts can be effectively used to update
the result of the query flock when the base tables over which it is defined are updated.
We first check if any of the records in the negative border pass the filter condition as a
result of changes to the base tables. This can be done starting at the bottom element
of the lattice and proceeding upwards, propagating the deltas corresponding to the new
combinations. If none of the records in the negative border passes the filter condition, we
do not have to evaluate the subqueries for the lattice elements above. The base tables
may have to be looked up (for example, to be joined with the deltas) depending upon the
subquery representing the lattice element. It is also possible to evaluate the filter condition
for all the lattice elements negative borders together. However, this might involve more
computations than propagating the deltas through the lattice.
7.6.3 View Maintenance
Incremental mining can also be seen as materialized view maintenance. In boolean
association rules, the frequent itemsets and the negative border are in fact aggregate views
over the transaction table. In query flocks each element in the subquery lattice can be
considered as a view defined on the base tables. Therefore, view maintenance techniques
similar to the ones in Mumick et al. [94] can be used for incremental mining. O11 the other


45
Table 3.3. Description of synthetic datasets
Datasets
# Records
# Transactions
# Items
Avg.#items
T5.I2.D100K
546651
100000
1000
5
T10.I4.D100K
1154995
100000
1000
10
maximal potentially frequent itemsets (denoted as I) is 2. The transaction table corre
sponding to this dataset had approximately 550 thousand records. The second dataset
has 100 thousand transactions, each containing an average of 10 items (total of about 1.1
million records) and the average size of maximal potentially frequent itemsets is 4. The
different parameters of the two datasets are summarized in Table 3.3.
The experiments reported in the rest of this chapter were performed on PostgreSQL
Version 6.3 [100], a public domain DBMS, installed on a 8 processor Sun Ultra Enterprise
4000/5000 with 248 MHz CPUs and 256 MB main memory per processor, running Solaris
2.6. Note that PostgreSQL is not parallelized. It supports nested loops, hash-based and
sort-merge join methods and provides finer control of the optimizer to disable any of the
join methods. We have found it to be a useful platform for studying the performance of
different join methods and execution plans.
3.8 Performance Optimizations
The cost analysis presented above provides some insight into the different components
of the execution time in the different passes and what can be optimized to achieve better
performance. In this section, we present three optimizations to the KwayJoin approach
(other than the subquery optimization) and discuss how they impact the cost. Based on
these optimizations, we develop the Set-oriented Apriori approach in Section 3.8.4.


3
In order to understand and profitably use the data in a business context, they must
be transformed into information. The warehouse data are stored and managed by one or
more warehouse servers, which present multidimensional views of the data to a variety of
front end tools: query tools, report writers, analysis tools and data mining tools. There is
also a repository which stores metadata derived using the enterprise model, and tools for
monitoring and administering the warehousing system.
The warehouse may be distributed for load balancing, scalability and higher availabil
ity. In such a distributed architecture, the metadata repository is usually replicated with
each fragment of the warehouse, and the entire warehouse is administered centrally. An
alternative architecture is a federation of data warehouses or data marts, each with its own
repository and decentralized administration.
1.2 Data Mining
Data mining, also referred to as knowledge discovery in databases, is the process of
extracting implicit, understandable, previously unknown and potentially useful information
from data. In other words, data mining is the act of drilling through huge volumes of data in
order to discover relationships, or to answer specific questions that are too broad in nature
for traditional query tools. Data mining is also projected as the next step beyond online
analytical processing (OLAP) for querying data warehouses. Rather than seek out known
relationships, it sifts through data for unknown relationships. For instance, consider the
transaction data in a mail-order company stored in the following relations: sales (customer,
widget, state, year), booster(customer, widget, driver), catalog (widget, manufacturer). The
sale information is stored in the sales table and the catalog table stores the widgets of
different manufacturers. The booster table stores the driver that influences the particular


11
effectively utilized. We develop an incremental algorithm and its SQL formulations for up
dating the association rules, based on the negative border concept. It is based on the closure
property of frequent itemsets, that is, all subsets of a frequent itemset are also frequent.
We show that the incremental mining algorithm can be generalized to handle various kinds
of constraints. We categorize the constraints based on their usage in the mining process
and develop a general framework to process them. We further show that the incremental
technique is applicable to other classes of mining and decision support problems such as
sequential patterns, correlation rules, query flocks and so on.
We summarize and list the goals of this dissertation below:
Survey the various data mining problems and algorithms,
Study the different database integration alternatives for data mining,
Develop and implement SQL formulations of mining algorithms,
Analyze the performance profile of current DBMSs to execute the above SQL queries,
Explore the enhancements to current cost-based optimizers to incorporate the domain-
specific semantics of mining,
Develop an incremental association rule mining algorithm and its SQL formulations,
Generalize the incremental algorithm for mining constrained associations and show
its applicability to other data mining and decision support problems, and
Explore primitive operators for mining in databases.
This work has a significant impact on the state-of-the-art in data mining system archi
tectures and comes at the appropriate time when the data mining community is looking for


107
generation procedure ensures that Ck is a superset of Fk- The counting phase makes a
pass over the input data-sequences to find the support for each candidate sequence. In the
support counting phase for each candidate sequence s E Ck-, the number of distinct data-
sequences that contains s is counted. Two techniques to perform this operation efficiently
are explained in Srikant and Agrawal [120]: one which uses a hash-tree data structure and
another based on transforming the data-sequences into an alternate representation. The
algorithm terminates when the set of candidate sequences Ck or the frequent sequences Fk
becomes empty.
6.3 Candidate Generation
In each pass k, the candidate -sequences Ck are generated from the frequent (k 1)
sequences Fk-i- The schema of Ck is the same as that of frequent sequences explained
above in Section 6.1, except that we do not require the len attribute since all the tuples in
Ck have the same length k.
Candidates are generated in two steps. The join step generates a superset of Ck by
joining Fk-\ with itself. A sequence si joins with S2 if the subsequence obtained by dropping
the first item of si is the same as the one obtained by dropping the last item of S'- This
can be expressed in SQL as follows.
insert into Ck select I\.item\, I\.eno\, I\.itemk-\, I\.enok-\, h-itemk-i,
Ii-enok-i + h-enok-i h.enok-2
from Fk-1 h,Fk-i h
where I\.item2 = l2-item\ and
Ii.itemk-i = l2-itemk-2 and


insert into SBi select item\,... item/, count(*)
from (Subquery Qi) t
group by itemi, ..., itemi
Subquery Qi (for any l between 1 and k):
select itemi, itemi> tid
from ST t¡, (Subquery Qi-i) as r/_i, C;
where ri-\.item\ Ci.item\ and ... and
ri-i.itemi-i = Ci.itemi-iand
ri_i.tid = ti.tid and
ti-item = Ci-itemi
Subquery Qq : No subquery Qq.
Group by for 5 B J Subquery QJ+l
Subquery QI
itemi,....item!, tid
r_I-l. itemi
rj-l.item 1-1
Subquery QJ-l CJ
Figure 7.5. Support counting using subqueries


158
different attributes of the same row to a format where different measures are scattered as
different rows.
8.1.3 Multiple Streams
In current relational DBMSs, there is only a single stream of control along which data
flows. Support for multiple streams, where the output of an operation can be piped to
more than one subsequent operations will be useful for a large class of mining and decision
support operations. In the SQL formulation of incremental mining, we show how it will be
useful for counting the support of multiple-sized candidates in one pass. This idea can be
used for association rule mining also to reduce the number of passes utilizing some of the
newer algorithms [26, 127].
In classification, we have to compute the class distribution corresponding to differ
ent split points and attribute value combinations. This will require multiple scans using
standard SQL. The Unpivot operator proposed in Graefe et al. [54] attempts to circumvent
the problem of multiple scans. However, support for multiple streams will make it much
simpler for doing it in one scan of the database. In the OLAP world also this will be useful
for computing the datacube [56] and simultaneous multidimensional aggregates [1, 133].
8.1.4 Sampling
Sampling is a useful technique for scaling up algorithms that handle large volumes of
data. The use of sampling for query size estimation, histogram construction, computing
quantiles and so on has been addressed in several research papers [35, 59].
Sampling can be used to get quick and good approximate answers to the mining and
decision support queries. The algorithms can be first run on a sample to give the user
approximate answers based on which he/she can decide whether to run it on the whole


81
Average transaction length
Figure 4.9. Scale-up with increasing transaction length
4.6.3 Impact of longer names
In these experiments we assumed that the tids and item-ids are all integers. Often in
practice these are character strings longer than four characters. Longer strings need more
storage and cost more during comparisons. This could hurt all four of the alternatives.
For the Stored-procedure, UDF and Cache-Mine approach the time taken to transfer data
will increase. The Intelligent Miner code [71] maps all character fields to integers using an
in-memory hash-table. Therefore, beyond the increase in the data transfer and mapping
costs (which accounts for the bulk of the time), we do not expect the processing time of
these three alternatives to increase. For the SQL approach we cannot assume an in-memory
hash-table for doing the mapping therefore we use an alternative approach based on table
functions.
For SQL approach we discuss the hybrid approach. The two (already expensive) steps
that could suffer because of longer names are (1) final group-bys during pass 2 or higher
when the Gather Join approach is chosen and (2) tid-list operations when the Vertical
approach is chosen. For efficient performance, the first step requires a mapping of item-ids
and the second one requires us to map tids. We use a table function to map the tids to


49
Pass 2 optimization (T10.I4.D1 OOK)
With Opt. n without Opl. |
Support
Pass 2 optimization (T5.I2.D1 OOK)
With Opt. P Without Opt. |
Support
Figure 3.12. Benefit of second pass optimization
that axe candidates. T*, has the schema (tid, itemi,..., iterrik). We join Tj and Ck as
shown below to generate 7^. A tree diagram of the query is also given in Figure 3.13. The
frequent itemsets Fk is obtained by grouping the tuples of on the k items and applying
the minimum support filtering.
We can further prune Tfc by filtering out item combinations that turned out to be
non-frequent. However, this is not essential since we join it with the candidate set Ck+\ in


131
negative border closure may increase the size of the candidate set. However, a majority of
those itemsets would have been present in the original negative border or frequent itemset.
Only those itemsets which were not covered by the negative border need to be checked
against the whole database. As a result, the size of the candidate set in the final scan could
potentially be much smaller as compared to FUP.
7.4 Database Integration of Incremental Mining
In this section, we discuss the various database integration architectures for incremen
tal mining based on the alternatives presented in Chapter 2. In Section 7.4.1, we present
two SQL-based approaches for incremental frequent itemset computation. The applicability
of the other architectural alternatives are discussed in Section 7.4.4.
7.4.1 SQL Formulations of Incremental Mining
Two categories of SQL formulations for frequent itemset mining based on SQL-92 and
SQL-OR are presented in Chapters 3 and 4 respectively. A cost-based analysis of the SQL-
92 approaches and the related performance optimizations are discussed in Thomas and
Chakravarthy [125]. We discuss here how these techniques can be adapted for incremental
mining. Efficient SQL formulation of the incremental algorithm entails counting support
of multiple-sized candidates together in the same pass.
The input transaction data is stored in a relational table T with the schema (tid, item).
The increment transaction table ST also has the same schema. The frequent itemsets and
negative border of size k are stored in tables with the schema (itemi,... ,itemk, count).
We discuss the extensions to Subquery and Vertical; two representative approaches from
the SQL-92 and SQL-OR categories. We also outline the SQL-based candidate closure
computation to compute negative border closure.


44
experiments because the real-life datasets were not available outside IBM. The synthetic
datasets used in our experiments are detailed below. The main reason for the subquery
approach performing better is that the number of distinct /-item prefixes is much less
compaxed to the total number of candidate itemsets which results in joins between smaller
tables. The number of candidate itemsets and the corresponding distinct item prefixes for
various passes in one of our experiments is given in Figure 3.10. These numbers are for the
dataset T10.I4.D100K and 0.33% support. Note that is not shown since it is the same as
Ck In pass 3, C3 contains 2453 itemsets where as d\ has only 295 1-item prefixes (almost
a factor of 10 less than C3). This results in correspondingly smaller intermediate tables as
shown in the analysis above, which is the key to the performance gain.
Figure 3.10. Number of candidate itemsets vs distinct item prefixes
Experimental datasets We used synthetic data generated according to the procedure
explained in Agrawal and Srikant [13] for some of the experiments. The results reported
here are for the datasets T5.I2.D100K and T10.I4.D100K. The first dataset consists of 100
thousand transactions, each containing an average of 5 items. The average size of the


130
2. The set of candidate sets Ck is generated by applying the apriori-gen function on
LPB+ (frequent (k-l)-itemsets in the updated database). The itemsets in Lk are
pruned from Ck because they have already been handled in the above step. Ck is
further pruned by removing the itemsets that do not have minimum support in db
since they cannot be frequent in the updated database.
3. Scan the whole database to count the support of all the itemsets in Ck and the ones
which have the minimum support are identified as the new frequent fc-itemsets.
The speed up of FUP over Apriori can be mainly attributed to the reduction in the
number of candidate itemsets thereby reducing the computation. However there is no
reduction in the I/O requirements because, like apriori FUP may also require 0(n) passes
over the database where n is the size of the maximal frequent itemset.
The steps involved in our incremental algorithm can be summarized as follows.
1. Compute the frequent itemsets of the increment database (Ldb) using any level-wise
algorithm like Apriori or Partition. Simultaneously, count the support in db of all
the frequent itemsets and the negative border.
2. Using the above support counts and the negativeborder-gen function, determine if
the negative border has expanded.
3. In case of a negative border expansion, find the negative border closure and scan the
whole database once to count the support of the newly added itemsets. The ones
with minimum support are added to the set of frequent itemsets.
The notable feature here is that the whole database is scanned only if required (and
that too only once), thereby reducing the I/O requirements drastically. Computing the


5.4 Transaction extension subquery 94
5.5 Support counting by K-way join 95
5.6 Support counting by GatherJoin 97
5.7 Support counting by GatherExtend 99
5.8 Interior nodes tid-list generation by union 101
5.9 Interior nodes tid-list generation from T* 102
5.10 Comparison of different SQL approaches 103
6.1 Candidate generation for any k 110
6.2 Candidate generation for k = 4 110
6.3 Support counting by K-way join 113
6.4 Subquery optimization for KwayJoin approach 114
6.5 Support counting by Vertical 116
7.1 A high-level description of the apriori-gen function 122
7.2 A high-level description of the negativeborder-gen function 122
7.3 A high-level description of the Update-Frequent-Itemset function 126
7.4 Speed-up of the incremental algorithm 129
7.5 Support counting using subqueries 133
7.6 Speed up of the incremental algorithm based on the Subquery approach 136
7.7 Speed up of the incremental algorithm based on the Vertical approach 136
7.8 Speed up of the incremental algorithm based on the Subquery approach with
the new-candidate optimization 139
7.9 Speed up of the incremental algorithm based on the Vertical approach with
the new-candidate optimization 139
7.10 Point of sales data model 141
7.11 Framework for constrained association mining 144
7.12 Point-of-sales example for constrained association mining 145
xi


138
The speed-up reduces as the minimum support threshold is lowered. At lower support
values the chances of the negative border expanding is higher and as a result the
incremental algorithm may have to compute the candidate closure and count the
support of the new candidates in the whole dataset.
The speed-up is higher for smaller increment sizes since the incremental algorithm
needs to process less data.
With respect to the absolute execution time, the Subquery and the Vertical ap
proaches followed the same trend as explained in Chapter 4. The Vertical approach
was about 3 to 6 times faster than the Subquery approach.
7.4.3 New-Candidate Optimization
In the basic incremental algorithm, we find the frequent itemsets in the increment
database db along with counting the support of all the itemsets in the frequent set and
the negative border. However, the frequent itemsets in db is used only to prune the non
frequent itemsets in db while computing the candidate closure. In the candidate closure
computation we assume that the new candidate A;-itemsets are frequent while generating
the (k + l)-itemsets. At this step the new candidate A:-itemsets that are infrequent in db
are known to be infrequent in the whole dataset as well and can be pruned. This is because
the new candidate A:-itemsets were infrequent in the old dataset (they were not even in the
negative border). Therefore, they need to be frequent at least in db to have a chance of
being frequent in the whole dataset.
With the new-candidate optimization, we count the support of an itemset in db only
if it is required. In the first phase, while counting the support in db of the itemsets in the
frequent set and the negative border, we do not find all the frequent itemsets in db. During


123
Proof: The lemma follows directly from the definition of negative border since any 1-
itemset not in F is a minimal itemset not in F.
7.1.2 Addition of New Transactions
When new transactions are added to the database, an old frequent itemset could
potentially become infrequent in the updated database. Similarly, an old infrequent itemset
could potentially become frequent in the new database.
In order to solve the update problem efficiently, we maintain the frequent itemset
and the negative border along with their support count in the database. That is, for
every s G F U J\fBd(F), we maintain s.count. In the rest of this section, DB denotes the
original database, db denotes the transactions that are newly added and DB+ denotes the
updated database. Also FDB, Fdb and FDB+ denotes the frequent itemset and NBd(FDB),
MBd(Fdb) and J\fBd{FDn+) denotes the negative border of the original database, increment
database and the updated database respectively.
Lemma 2 Let s be any itemset such that s 0 FDB. Then s 6 FDB+ only if s 6 Fdb.
Proof: Assume that there exists an itemset s such that s FDB+, s £ FDB and s Fdb.
Let db(s) and tf,(s) be the number of transactions in DB and db respectively containing
the itemset s. Also let t^B and tdb be the total number of transactions in DB and db
respectively. Since s 0 FDB and s £ Fdb,
tpBjs)
tpB
< minSupport and
ldb{s)
tdb
< minSupport.
From these two equations, it can be shown that


163
[10] R. Agrawal, G. Psaila, E. L. Wimmers, and M. Za'it. Querying shapes of histories.
In Proc. of the 21st Intl Conference on Very Large Databases, Zurich, Switzerland,
September 1995.
[11] R. Agrawal and J. Shafer. Parallel mining of association rules. IEEE Transactions
on Knowledge and Data Engineering, 8(6), December 1996.
[12] R. Agrawal and K. Shim. Developing tightly-coupled data mining applications on
a relational database system. In Proc. of the 2nd Intl Conference on Knowledge
Discovery in Databases and Data Mining, Portland, Oregon, August 1996.
[13] R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In Proc. of
the 20th Intl Conference on Very Large Databases, Santiago, Chile, September 1994.
[14] R. Agrawal and R. Srikant. Mining sequential patterns. In Proc. of the 11th Intl
Conference on Data Engineering, Taipei, Taiwan, March 1995.
[15] S. Alexander. Users find tangible rewards digging into data mines. Enterprise Com
puting, July 1997. http://www.infoworld.com/cgi-bin/displayArchive.pl7/97/27/e01-
27.61.htm.
[16] K. Ali, S. Manganaris, and R. Srikant. Partial classification using association rules.
In Proc. of the 3rd Intl Conference on Knowledge Discovery in Databases and Data
Mining, Newport Beach, California, August 1997.
[17] K. Alsabti, S. Ranka, and V. Singh. CLOUDS: A decision tree classifier for large
datasets. In Proc. of the fth Intl Conference on Knowledge Discovery and Data
Mining, New York, August 1998.
[18] C. Apte and S. J. Hong. Predicting equity returns from securities data with minimal
rule generation. In KDD-9\: AAAI Workshop on Knowledge Discovery in Databases,
pages 407-418, Seattle, Washington, July 1994.
[19] C. Apte and S. Weiss. Data mining with decision trees and decision rules. FGCS
Journal, Special Issue on Data Mining, 1997.
[20] J. Banfield and A. Raftery. Model-based gaussian and non-gaussiaft clustering. Bio
metrics, 49:15-34, 1993.
[21] S. Berchtold, C. Bohm, B. Braunmuller, D. A. Keim, and H. Kriegel. Fast parallel
similarity search in multimedia databases. In Proc. of the ACM SIGMOD Conference
on Management of Data, May 1997.
[22] S. Berchtold, D. A. Keim, and H. Kriegel. The X-Tree: An index structure for high
dimensional data. In Proc. of the 22nd Intl Conference on Very Large Databases,
Bombay, India, September 1996.
[23] P. Bradley, U. Fayyad, and C. Reina. Scaling clustering algorithms to large databases.
In Proc. of the fth Intl Conference on Knowledge Discovery and Data Mining, New
York, August 1998.
[24] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and
Regression Trees. Wadsworth, Belmont, 1984.


34
the AprioriTid algorithm in Agrawal et al. [9]. Each candidate itemset Ck, in addition
to attributes (itemi,... ,itemk) has three new attributes (end, id\, 2) oid is a unique
identifier associated with each itemset and id\ and id- are aids of the two itemsets in Fk- 1
from which the itemset in Ck was generated (as discussed in Section 3.4.1). In addition,
in the kth pass we generate a new copy of the data table Tk with attributes (tid, aid) that
keeps for each tid the oid of each itemset in Ck that it supported. For support counting,
we first generate Tk from Tk-\ and Ck and then do a group-by on Tk to find Fk as follows.
insert into Tk select 1 .tid, oid
from Ck,Tk-\ t\,Tk-\ 2
where fi.oid = Ck-id\ and 2-id = Ck-id- and £1 .tid = 2-tid
insert into Fk select oid, itemi, .. .itemk, cnt
from Ck,
(select oid as cid, count(*) as cnt from Tk
group by oid having count(*) > :minsup) as temp
where C^-oid = cid
3.5.3 Two Group-bv
Another way to avoid multi-way joins is to first join T and Ck based on whether the
item of a (tid, item) pair of T is equal to any of the k items of Ck, then do a group by on
(itemi, ,itenrik, tid) filtering tuples with count equal to k. This gives all (itemset, tid)
pairs such that the tid supports the itemset. Finally, as in the previous approaches, do a
group-by on the itemset (item 1,..., itemk) filtering tuples that meet the support condition.
insert into Fk select item 1,... itemk, count(*)


102
insert into TidTable select 2-item, 2-tid-list
from T* t\, table(Gather(ii.item, ii.tid, :minsup)) as 12
t2.item, t2.tid-list
t
Table function
Gather
t
Order by
item, tid
t
Figure 5.9. Interior nodes tid-list generation from T*
R*
that the average length of a tid-list in this case is where R*j is the sum of support
counts of all the frequent items.
5.8 Performance Results
In this section, we report the results of some of our performance experiments on real-
life datasets. We present only one set of results since our emphasis here is on showing that
the boolean association rule framework can be easily extended to handle more complex
mining computations like generalized association rules.
The experiments reported here were performed on Version 5 of IBM DB2 Universal
Server installed on a RS/600 Model 140 with a 200MHz CPU, 256 MB main memory and a
9 GB disk with a measured transfer rate of 8 MB/sec. Figure 5.10 shows the performance of
three approaches: Vertical, GatherJoin (shown as Gcj) and Stored-procedure (shown as
Sproc). The chart shows the preprocessing time and the time taken for the different passes.
For Vertical the preprocessing time includes ancestor pre-computation and tid-list cre
ation times, where as for GatherJoin it is just the time for ancestor pre-computation. In the
stored procedure approach, the mining algorithm is encapsulated as a stored procedure [32]


148
7.6 Applicability Beyond Association Mining
The incremental algorithm we have proposed makes use of the closure property of
frequent itemsets. In this section, we discuss how the incremental approach can be gen
eralized to other data mining and decision support problems. In Section 7.6.1, we discuss
the applicability of the incremental algorithm to mine closed sets. Generalizations to in
cremental evaluation of query flocks and certain kinds of materialized views are described
in Sections 7.6.2 and 7.6.3 respectively.
7.6.1 Mining Closed Sets
All the efficient algorithms for mining association rules exploits the closure property
of frequent itemsets. Minimum support which characterizes frequent itemsets is downward
closed: if an itemset has minimum support then all its subsets also have minimum support.
The idea of negative border can be used for all incremental mining problems that possess
closure properties. If the closure property is incrementally updatable (for instance support)
also, it is possible to limit the database access to at most one scan of the whole database
as shown in the incremental frequent itemset mining example. A property is incrementally
updatable if it is possible to derive the value of it from the corresponding values of different
partitions of the input data. A few examples are COUNT, SUM, MIN, MAX and so on. We
list below a few other data mining problems that have closure properties.
1. Sequential patterns: In sequential pattern mining, the frequent patterns are closed
with respect to minimum support, much like the frequent itemsets. The support is
incrementally updatable.
2. Correlation rules: Correlation rules [25] are upward closed with respect to correlation;
that is, if a set of items S is correlated, so is every superset of S. A set of items is


91
5.2 Cumulate Algorithm
Srikant and Agrawal [118] present several algorithms: Basic, Cumulate, Stratify and
EstMerge. We picked Cumulate for our SQL-formulations since it has the best performance.
Stratify and EstMerge are sometimes marginally better but they are far too complicated
to merit the additional development cost. Cumulate has the same basic structure as the
Apriori algorithm [9] for boolean associations. It makes multiple passes over the data where
in the kth pass it finds frequent itemsets of size k. Each pass consists of two phases: In the
candidate generation phase, the frequent (k l)-itemsets, Fk-i, found in the previous pass
is used as the seed set to generate candidate A:-itemsets (Ck) that are potentially frequent.
In the support counting phase for each itemset t G Ck, the number of extended transactions
(transactions augmented with all the ancestors of its items) that contains t is counted. At
the end of the pass, the frequent candidates are identified yielding Fk. The algorithm
terminates when Fk or Ck+1 becomes empty.
The above basic structure is augmented with a few optimizations. The important ones
are pruning itemsets containing an item and its ancestor and pre-computing the ancestors
for each item. We extend the SQL-based boolean association rule framework in Chapters 3
and 4 with these optimizations.
5.3 Pre-Computing Ancestors
In this section, we explain how to compute the set of ancestors of each item using SQL.
We call x an ancestor of x if there is a directed path from x to x in Tax. The Ancestor
table is primarily used for (i) pruning candidates containing an item and its ancestor and
(ii) extending the transactions by adding all the ancestors of its items. The ancestor


to S. Seshadri, my masters thesis advisor. My first exposure to database research was while
working with him and he is partially responsible for my pursuing further studies. I owe my
graduate studies to my parents, sisters and brothers for their continual reassurance.
This work was supported in part by the research grants of Sharma Chakravarthy
from the Office of Naval Research and the SPAWAR System Center San Diego, Rome
Laboratory, DARPA and the NSF grant IRI-9528390. During my first year Theodore
Johnson provided support through his research grants until he left the University. His
initial support helped me to come to the United States for graduate studies. The CISE
department also provided me with teaching assistantships in times of need. I gratefully
acknowledge all the support.
IV


142
transaction correspond to a customer transaction. Note that in other domains, the notion
of a transaction and an itemset appearing in a transaction could be different. Frequent
itemsets, the ones which satisfy the frequency constraint are defined as {X\f(X) > s}
where f(X) is the frequency of X.
Most of the algorithms for frequent itemset discovery utilizes the downward closure
property of itemsets with respect to the frequency constraint; that is, if an itemset is
frequent, then so are all its subsets. Downward closure is a pruning property. Level-wise
algorithms [7] find all itemsets with a given property among itemsets of size k (fc-itemsets)
and use this knowledge to explore (k + l)-itemsets. They start, with the assumption that
all (k -f- l)-itemsets are potentially frequent (frequency is just an example for a downward
closed property). As the .-itemsets are examined, they prune out some (k + l)-itemsets
that cannot be frequent. In effect, for pruning, they use the contrapositive of the frequent
itemset definition if any subset of a (k + l)-itemset is not frequent, then neither can the
(k + l)-itemset. After the pruning, they go through the remaining list, checking each
(k + l)-itemset for its frequency. The downward closure property is similar to the anti
monotonicity property defined in Ng et al. [97].
In the context of the point-of-sale data model, the frequency constraint can be used
to discover products bought together frequently.
Item Constraint
In order to discover goal-oriented associations, it is often required to impose constraints
on the itemsets. For instance, find only the itemsets containing at least one item from a
user-defined subset of items. Let B be a boolean expression over the set of items I. The
problem is to find itemsets that satisfy the constraint B. Typically the item constraint will


146
In the above example, the constraint on the total price of the transaction is an external
constraint and is applied in the data filtering stage. Since the maximum price of a product
in the desired combination is $50, we can also filter out records that does not satisfy this
condition in the data filtering stage. max(Price) < 50 is an aggregation constraint which
satisfies the closure property and can be applied in the candidate generation phase. The
constraint to include barbecue sauce in the combination and the constraint on the average
price are checked in the post-counting phase since they do not satisfy the closure property.
7.5.3 Incremental Constrained Association Mining
The negative border based incremental mining algorithm is applicable for mining as
sociations with constraints that are closed with respect to the set inclusion property, that
is, if an itemset satisfies the constraint then so do all its subsets. For incremental mining
of constrained associations also we need to materialize and store all the itemsets in the
negative border and their support counts. The reason why this is enough is that when new
transaction data is added, only the support count of the itemsets could change. We assume
that there is a frequency constraint, which is typically the case in association mining. The
definition of the negative border also remains the same, which is the set of minimal itemsets
that did not satisfy the frequency constraint. We list the various steps of the incremental
algorithm which is very similar to the case with just the frequency constraint except for a
few differences.
1. Find the frequent itemsets in the increment database db. Simultaneously count the
support of all itemsets X £ FrequentSets U NegativeBorder in db. For this we use
the framework outlined in Section 7.5.2. The different constraints that are present
can be handled at the various steps in the mining process as shown in Figure 7.11.


117
6.6.2 GatherJoin
The GatherJoin approach for association rules (refer Section 4.1) can be extended
to mine sequential patterns also. In this approach, we first Gather all the (tid, item) pairs
corresponding to fixed values of sid. We then generate all possible -sequences using a
table function, join them with Ck and group the join result to count the support of the
candidate sequences. The time constraints are checked on the table function output using
join predicates PRED() as in the KwayJoin approach. The number of tuples generated
by the table function can be reduced by applying the max-gap, min-gap and window-size
constraints inside the table function. However, PRED() is still required to check if the
sequences in Ck are contained in the generated -sequences. The items in a candidate
sequence are not lexicographically ordered. Therefore, a data-sequence containing n items
can potentially support nk -sequences. For real-life datasets, the value of n is large and
hence this approach will generate too many sequences, thereby making it less attractive
in terms of performance. However, an approach similar to GatherCount will be feasible
where we keep the candidate sequences inside the table function and generate only sequences
present in the candidate set. The GatherJoin approach was not implemented for sequential
patterns.
6.7 Taxonomies
The approaches to handle taxonomies for association rules (refer Chapter 5) are ap
plicable for sequential patterns also. The basic idea is to replace each data-sequence d with
an extended-sequence d!. Each transaction of d is extended with all the ancestors of its
items to get d'. In order to optimize the performance, we pre-compute the ancestors and
prune candidates with an element that contains both an item and its ancestor.


56
T10.I4.D1 OOK: CPU time
Pass 1 Pass 2 Pass 3 Pass 4 Pass 5 Pass 6 BB Pass 7
Support ~> 2% 1% 0.75% 0.33%
T10.I4.D100K: I/O time
I Pass 1 Bi Pass 2 Pass 3 D Pass 4 Pass 5 Pass_6_ EJ Pass/7
Support > 2% 1% 0.75% 0.33%
Figure 3.17. Comparison of CPU and I/O times
normalized with respect to the times for the 10,000 transaction datasets. It can be seen
that the execution times scale quite linearly and both the datasets exhibit similar scale-up
behavior.
The scale-up with increasing transaction size is shown in Figure 3.19. In these exper
iments we kept the physical size of the database roughly constant by keeping the product
of the average transaction size and the number of transactions constant. The number of
transactions ranged from 200,000 for the database with an average transaction size of 5 to


63
insert into F2 select tt2-TJtm\, tt2.TJ.tm2, count(*)
from (select from T, F\ where T.item F\.item\) as tt\,
table (GatherComb-2(tid,item)) as U2)
group by tt2-TJtm\, tt2-TJtm2
having count(*) > :minsup
having
couni(*) > :minsup
t
Group by
tt2.X_itml, tt2.T_itm2
I tt2
Xable function
GatherComb-K
t
X.item = F1 .item 1 XI
FI
Figure 4.2. Support Counting by Gather Join in the second pass
need to maintain hash-tables to index the CVs. The disadvantage of this approach is that
it can require a large amount of memory to store Cy If enough memory is not available,
C2 needs to be partitioned and the process has to be repeated for each partition. Another
serious problem with this approach is that it cannot be automatically parallelized unlike
the other approaches.
GatherPrune
A potential problem with the Gat her Join approach is the high cost of joining the
large number of item combinations with CWe can push the join with (7* inside the table
function and thus reduce the number of such combinations. C* is converted to a BLOB
and passed as an argument to the table function.
The cost of passing the BLOB for every tuple of T could be high. In general, we can
reduce the parameter passing cost by using a smaller Blob that only approximates the real


132
Sub query Approach
In this approach, support counting is done by a set of k nested subqueries where k is
the size of the largest itemset. We present here the extensions to the subquery approach in
Section 3.5.4 to count candidates of different sizes. Subquery Qi finds all tids that support
the distinct candidate Z-itemsets (d¡). di comprises of the distinct Z-item prefixes of all
candidate itemsets. However, it is sufficient to use Ci, the candidate Z-itemsets instead of
di since all Z-item prefixes of candidate itemsets with more than Z items will be present
in C¡. The output of Qi is grouped on the Z items to find the support of the candidate
Z-itemsets. Qi is also joined with ST (T while counting the support in the whole database)
and C¡+1 to get Qi+i- The SQL queries and the corresponding tree diagrams for the above
computations are given in Figure 7.5. SBi stores the support counts of all frequent and
negative border itemsets in ST.
The output of subquery Q¡ needs to be materialized since it is used for counting the
support of Z-itemsets and for generating Q/+i. If the query processor is augmented to
support multiple streams where the output of an operator can be piped into more than one
subsequent operators, the materialization of Qi's can be avoided. In the basic association
rule mining, we do not have to count itemsets of different sizes in the same pass since C\+\
becomes available only after the frequent Z-itemsets are computed.
Tables Bi and SB¡ store the frequent and negative border Z-itemsets and their support
count in T and ST respectively. Support counts of itemsets in the whole database can be
computed by joining B¡ and SB and adding the corresponding support counts. We add
another attribute to Bi to keep track of promoted borders (itemsets that moved from the
negative border to the frequent set).


112
insert into F\ (select item, 1, count(*)
from (select distinct sid, item
from D) as t
group by item
having count(*) > rminsup
6.5.2 Subquerv Optimization
The subquery optimization for association rules can be applied for sequential patterns
also by splitting the support counting query in pass k into a cascade of k subqueries. The
predicates pij can be applied either on the output of subquery Qk or sprinkled across
the different subqueries (shown as SubQ-PRED(Z) in Figure 6.4). The predicates SubQ-
PRED(Z) is a conjunct of terms ptJ(l) corresponding to each pair of items from d¡, the
distinct Z-item prefixes of C*. Pij(l) is expressed as
(di.enoj = di.enoi and abs(ifmej timei) < window-size) or
(di.enoj = dpenoi 4-1 and timej timei < max-gap and
timej timei > min-gap) or
(di.enoj > dpenoi + 1)
6.6 Support Counting Using SQL-OR
6.6.1 Vertical
This approach is similar to the Vertical approach for association rules (refer Sec
tion 4.2), where we convert the transaction table into a vertical format. For each item, we
create a BLOB (s-list) containing all (sid, time) pairs corresponding to that item. We use
a table function Gather for creating the s-lists. The sequence table D is scanned in the


110
(Drop item_k-l)
Il.iteml
I l.item_k-2
I2.item_k-1
Il.eno2 I l.enol
= Ik.iteml AND
= Ik.item_k-2 AND
= Ik.item_k-1 AND
= Ik.eno2 Ik.enol AND
(Drop item2)
Il.iteml
Il.item3
I2.item_k-1
Il.eno3 -1l.enol
Il.eno4 Il.eno3
Il.eno_k-l + 12.eno_k-l I2.eno_k-2 Il.eno_k-2 =
Ik.eno_k-l Ik.eno_k-2 OR
11 ,eno_k-l + I2.eno_k-l I2.eno_k-2 II .eno_k-2 = 2
= I3.iteml AND
= I3.item2 AND
XI
F_k-1
= I3.item_k-1 AND
= I3.eno2 I3.enol AND
= I3.eno3 I3.eno2 AND
X
I2.eno_k-l I2.eno_k-2 = I3.eno_k-l
Il.eno3 Il.enol = 2
I3.eno_k-2 OR
ll.item2
Il.item_k-1
ll.eno3 Il.eno2
= 12.item 1 AND
= I2.item_k-2 AND
= I2.eno2 I2.enol AND
11 ,eno_k-l II .eno_k-2 = I2.eno_k-2 I2.eno_k-3
F k-1
F_k-1 13
F_k-1
Figure 6.1. Candidate generation for any k
(Drop item3)
Il.iteml =I4.itemlAND
11 .item2 = I4.item2 AND
I2.item3 = I4.item3 AND
Il.eno2 I l.enol = I4.eno2 I4.enol AND
11 ,eno3 + I2.eno3 I2.eno2 11 .eno2 =
I4.eno3 I4.eno2 OR
Il.eno3 + I2.eno3 I2.eno2 Il.eno2 = 2
IX
(Drop item2)
Il.iteml = I3.iteml AND
Il.item3 = I3.item2 AND
I2.item3 = I3.item3 AND
11.eno3 I l.enol = I3.eno2 I3.enol AND
12.eno3 I2.eno2 = I3.eno3 I3.eno2 OR
Il.eno3 I l.enol = 2
X
11 .item2
Il.item3
= 12.item 1 AND
= I2.item2 AND
F3
11 .eno3 11 .eno2 = I2.eno2 I2.eno 1
X
F3
F3 II
F3 12
Ik
Figure 6.2. Candidate generation for k = 4


125
When this is applied recursively, there are two possibilities. First is, for some subset of
t, case i holds true in which case, there is a subset of t which has moved from AfBd(FDB) to
FDB+, and hence the theorem is proved. Otherwise, t will finally become a 1-itemset. By
Lemma 1, we know that all 1-itemsets are present in FDB U AfBd(FDB). Since t FDB,
t G AfBd(FDB) which contradicts the assumption for case n. That is, case ii is not possible
if t is a 1-itemset.
By Theorem 1, if none of the itemsets move from the negative border to the frequent
itemset, we do not need to scan the whole database. Even in cases where some itemsets
move from the negative border to the frequent itemset, a complete database scan is required
only if the negative border expands because, for all the itemsets in the negative border, we
can derive the updated support count without a database scan.
We maintain the support count for all itemsets in the frequent itemset and the negative
border. First, we compute the frequent itemset in db using a level-wise algorithm like Apriori
or Partition. Simultaneously we count the support for all itemsets in FDB U AfBd(FDB)
in db. If an itemset t G FDB does not have minimum support in DB U db, then t is
removed from FDB. This can be easily checked since we know the support count for t in
DB and db. The change in FDB could potentially change ABd(FDB) also. Therefore, we
have to recompute the negative border using the negativeborder-gen function explained in
subsection 7.1.1.
On the other hand there could be some new itemsets which become frequent in the
updated database. Let s be an itemset which gets added to the frequent itemset of the
updated database. By Lemma 2, we know that s has to be in Fdb. We also know by
Theorem 1 that some subset of s must move from AfBd(FDB) to FDB+. For each itemset
s G Fdb, we check if s gets the minimum support to move from AfBd{FDB) to FDB+. If


168
[84] H. Mannila, H. Toivonen, and A. I. Verkamo. Discovering frequent episodes in se
quences. In Proc. of the 1st Intl Conference on Knowledge Discovery in Databases
and Data Mining, Montreal, Canada, August 1995.
[85] M. Mehta, R. Agrawal, and J. Rissanen. SLIQ: A fast scalable classifier for data
mining. In Proc. of the Fifth Intl Conference on Extending Database Technology
(EDBT), Avignon, France, March 1996.
[86] M. Mehta, J. Rissanen, and R. Agrawal. MDL-based decision tree pruning. In Proc.
of the 1st Intl Conference on Knowledge Discovery in Databases and Data Mining,
Montreal, Canada, August 1995.
[87] J. Melton and N. Mattos. SQL3a tutorial. Twenty-second international conference
on Very large data bases, tutorial, September 1996.
[88] J. Melton and A. Simon. Understanding the new SQL: A complete guide. Morgan
Kauffman, 1992.
[89] R. Meo, G. Psaila, and S. Ceri. A new SQL like operator for mining association rules.
In Proc. of the 22nd Intl Conference on Very Large Databases, Bombay, India, Sep
1996.
[90] R. S. Michalski. A theory and methodology of inductive learning. In Michalski
et al., editors, Machine Learning, An Artificial Intelligence Approach, Vol. 1. Morgan
Kaufmann, 1983.
[91] D. Michie, D. J. Spiegelhalter, and C. C. Taylor. Machine Learning, Neural and
Statistical Classification. Ellis Horwood, 1994.
[92] R. J. Miller and Y. Yang. Association rules over interval data. In Proc. of the ACM
SIGMOD Conference on Management of Data, pages 452-461, May 1997.
[93] C. Mohan, D. Haderle, Y. Wang, and J. Cheng. Single table access using multi
ple indexes: optimization, execution, and concurrency control techniques. In Proc.
International Conference on Extending Database Technology, pages 29-43, 1990.
[94] I. Mumick, D. Quass, and B. Mumick. Maintenance of data cubes and summary
tables in a warehouse. In Proc. of the ACM SIGMOD Conference on Management of
Data, Tucson, Arizona, May 1997.
[95] R. Musick, J. Catlett, and S. Russell. Decision-theoretic subsampling for induction
on large datsets. In 9th Intl Conference on Machine Learning, 1993.
[96] J. P. Nearhos, M. J. Rothman, and M. S. Viveros. Applying data mining techniques
to a health insurance information system. In Proc. of the 22nd Intl Conference on
Very Large Databases, Mumbai (Bombay), India, September 1996.
[97] R. T. Ng, L. V. S. Lakshmanan, J. Han, and A. Pang. Exploratory mining and
pruning optimizations of constrained association rules. In Proc. of the ACM SIGMOD
Conference on Management of Data, Seattle, Washington, June 1998.
[98] Oracle. Oracle RDBMS Database Administrators Guide Volumes I, II (Version 7.0),
May 1992.


98
In the support counting phase, we join the transaction table T with the AncListTable and
the resulting (tid, item, ancestor-list) tuples are passed to a table function GExtendComb-K
in tid order. The table function collects all the items and the corresponding ancestor-lists
of the same transaction, extends it and outputs A:-item combinations of the extended trans
action. Figure 5.7 illustrates the SQL queries for this approach. This approach was not
implemented because it might not perform better than the Vertical approach explained
next.
This approach can be combined with the GatherPrune approach (refer Section 4.1
to prune out item combinations that are not candidates. In that case, we can also delete
from the extended transactions items that are not present in any of the candidates, before
generating the item combinations.
5.7.3 Cost Analysis
In the cost analysis we use the notations of Table 5.2 in addition to the notations
introduced for boolean associations in Table 3.2.
Table 5.2. Additional notations used for cost analysis
D
number of records in the input taxonomy table
d
average depth of the taxonomy
l
number of leaf nodes in the taxonomy
A
number of records in the Ancestor table l d
N*
average number of items in an extended transaction = N d
R*
total number of records after transaction extension R* d
ek
cost of generating a A;-item combination using GExtendComb-K
order(n)
cost of sorting n records
The cost of Gather Join includes the cost of extending the transactions by joining the
transaction table with the ancestor table, generating A>item combinations, joining them
with Ck and grouping the join result to count the support. Note that the transaction


15
is restricted to two-way table queries, a special kind of aggregate query. Two-way tables,
which are used in the mining process, have sets of source and target attributes and an
associated count. Association rule mining was formulated as SQL queries in the SETM
algorithm [66]. However, it does not use the subset propertyall subsets of a frequent
itemset are frequentfor candidate generation. As a result, SETM counts a large number
of candidate itemsets in the support counting phase and hence is not efficient [9]. Query
flocks generalizes boolean association rules to mine associations across relational tables. A
query flock is a parameterized query with a filter condition to eliminate the values of param
eters that are uninteresting. Tsur et al. [128] present the use of query flocks for mining
and emphasizes the need for incorporating the a-priori technique into new generation query
optimizers to handle mining queries efficiently.
2.2 Architectural Alternatives
The various architectural alternatives for integrating mining with relational systems,
proposed in [109], can be categorized as shown in Figure 2.1. It shows a taxonomy of
various alternatives from loose-coupling to an integrated approach. In the remainder of
this section, we describe each of the alternatives.
2.2.1 Loose-Coupling
This is an example of integrating mining applications into the client in a client/server
architecture or into the application server in a multi-tier architecture. The mining kernel
can be considered as the application server. In this approach, data are read tuple by tuple
from the DBMS to the mining kernel using a cursor interface. The intermediate and the
final results are stored back into the DBMS. The data are never copied on to a file system.
Instead, the DBMS is used as a file system. This is the approach followed by most existing


150
we discuss how the negative border idea can be used for efficient incremental evaluation of
query flocks.
QUERY:
answer(C)
sale(C, $W) AND
driver(C, $W, $D) AND
catalog($W, manufacturer)
FILTER:
COUNT(C) > 10
Incremental Evaluation of Query Flocks
Applying the apriori technique for evaluating the above query flock will result in the
following safe subqueries [128].
1. Q\: answer(C) sale(C, $W)
2. Q2: answer(C) driver(C, $W, $D)
3. Q3: answer(C) sale(C, $W) AND driver(C, $W, $D)
4.Q\: answer(C) sale(C, $W) AND driver(C, $W, $D) AND
catalog($W, manufacturer)
The query flocks corresponding to the safe subqueries form a lattice with query containment
as the partial order and the original query flock as the top element. That is, if Q\ and Q2
are elements of the lattice, Q\ < Q2 & the result of Q2 is contained in the result of Q\.
During the execution of the subqueries of the query flock, all records with parameter
values which satisfy the filter condition are propagated to the next higher subquery in the


5
a set of items such that TCI. Each transaction has a unique identifier, called its tid.
An association rule is an implication of the form X => Y, where X C I, Y Cl, and
Ifiy = 0. The rule X Y holds in the transaction set V with confidence c if c% of
transactions in V that contain X also contain Y. The rule X ==> Y has support s in the
transaction set V if s% of transactions in V contain X U Y. Given a set of transactions V,
the problem of mining association rules is to generate all association rules that have support
and confidence greater than the user-specified minimum support and minimum confidence.
The association rule mining problem has attracted tremendous attention from data
mining researchers and several algorithms have been proposed for it [13, 26, 66, 111, 131].
Researchers have also proposed several variants of the basic association rule mining to
handle taxonomies over items [63, 118], numeric attributes [76, 92, 119] and constraints on
items appearing in rules [97, 121].
1.2.2 Sequential Patterns
Sequential pattern mining was first introduced in Agrawal and Srikant [14] and further
generalized in Srikant and Agrawal [120]. Given a set of data-sequences each of which is a list
of transactions ordered by the transaction time, the problem of mining sequential patterns is
to discover all sequences with a user-specified minimum support. Each transaction contains
a set of items. A sequential pattern is an ordered list (sequence) of itemsets. The itemsets
that are contained in the sequence are called elements of the sequence. For example,
{(computer, modem)(printer)) is a sequence with two elements; (computer, modem) and
(printer). The support of a sequential pattern is the percentage of data-sequences that
contain the sequence. A sequential pattern can be further qualified by specifying maximum
and/or minimum time gaps between adjacent elements and a sliding time window within


170
[114] J. Shafer, R. Agrawal, and M. Mehta. SPRINT: A scalable parallel classifier for data
mining. In Proc. of the 22nd, Intl Conference on Very Large Databases, Bombay,
India, September 1996.
[115] K. Shim, R. Srikant, and R. Agrawal. High-dimensional similarity joins. In Proc. of
the 13th Intl Conference on Data Engineering, Birmingham, U.K., April 1997.
[116] A. Siebes and M. L. Kersten. KESO: Minimizing database interaction. In Proc. of
the 3rd Intl Conference on Knowledge Discovery and Data Mining, Newport Beach,
California, August 1997.
[117] C. Silverstein, S. Brin, R. Motwani, and J. Ullman. Scalable techniques for mining
causal structures. In Proc. of the VLDB Conference, New York, August 1998.
[118] R. Srikant and R. Agrawal. Mining generalized association rules. In Proc. of the 21st
Intl Conference on Very Large Databases, Zurich, Switzerland, September 1995.
[119] R. Srikant and R. Agrawal. Mining quantitative association rules in large relational
tables. In Proc. of the ACM SIGMOD Conference on Management of Data, Montreal,
Canada, June 1996.
[120] R. Srikant and R. Agrawal. Mining sequential patterns: Generalizations and perfor
mance improvements. In Proc. of the Fifth Intl Conference on Extending Database
Technology (EDBT), Avignon, France, March 1996.
[121] R. Srikant, Q. Vu, and R. Agrawal. Mining association rules with item constraints.
In Proc. of the 3rd Int l Conference on Knowledge Discovery in Databases and Data
Mining, Newport Beach, California, August 1997.
[122] M. R. Stonebraker and G. Kemnitz. The POSTGRES next generation database
management system. Communications of the ACM, 34(10), 1991.
[123] S. Thomas, S. Bodagala, K. Alsabti, and S. Ranka. An efficient algorithm for the
incremental updation of association rules in large databases. In Proc. of the 3rd Intl
Conference on Knowledge Discovery and Data Mining, Newport Beach, California,
August 1997.
[124] S. Thomas and S. Chakravarthy. Incremental mining of constrained associations.
Technical Report TR 98-018, University of Florida, Gainesville, Florida, October
1998.
[125] S. Thomas and S. Chakravarthy. Performance evaluation and optimization of join
queries for association rule mining. Technical Report TR 98-017, University of Florida,
Gainesville, Florida, October 1998.
[126] S. Thomas and S. Sarawagi. Mining generalized association rules and sequential
patterns using sql queries. In Proc. of the fth Intl Conference on Knowledge Discovery
and Data Mining, New York, August 1998.
[127] H. Toivonen. Sampling large databases for association rules. In Proc. of the 22nd
Intl Conference on Very Large Databases, pages 134-145, Mumbai (Bombay), India,
September 1996.


83
all three options Cache-Mine, Stored-procedure and UDF require data in each pass to be
grouped on the tid. In a relational DBMS we cannot assume any order on the physical
layout of a table, unlike in a file system. Therefore, we need either an index on the data
table or need to sort the table every time to ensure a particular order. Let R denote the
total number of (tid,item) pairs in the data table. Either option has a space overhead of
2 x R integers. The Cache-Mine approach caches the data in an alternative binary format
where each tid is followed by all the items it contains. Thus, the size of the cached data
in Cache-Mine is at most: R + T integers where T is the number of transactions. For SQL
we use the hybrid Vertical option. This requires creation of an initial TidTable of size at
most I + R where I is the number of items. Note that this is slightly less than the cache
required by the Cache-Mine approach. The SQL approach needs to sort data in pass 1 in
all cases and pass 2 in some cases where we used the GatherJoin approach instead of the
Vertical approach. This explains the large space requirement for Dataset-B. However, in
practice when the item-ids or tids are character strings instead of integers, the extra space
needed by Cache-Mine and SQL is a much smaller fraction of the total data size because
before caching we always convert item-ids to their compact integer representation and store
in binary format.
4.7 Summary of Comparison Between Different Architectures
In Table 4.1 we present a summary of the pros and cons of the different architectures
by ranking them on a scale of 1 (good) to 4 (bad) on each of the following yardsticks: (a)
performance (execution time); (b) storage overhead; (c) scope for automatic parallelization;
(d) development and maintenance ease; (e) portability (f) inter-operability.


77
from the DBMS. The data is copied in the horizontal format where each tid is followed by
an encoding of all its frequent items.
For the UDF-architecture, we use the UDF implementation of the Apriori algorithm
described in Agrawal and Shim [12]. In this implementation, first a UDF is used to initialize
state and allocate memory for candidate itemsets. Next, for each pass a collection of UDFs
are used for generating candidates, counting support, and checking for termination. These
UDFs access the initially allocated memory, address of which is passed around in BLOBs.
Candidate generation creates the in-memory hash-trees of candidates. This happens en
tirely in the UDF without any involvement of the DBMS. During support counting, the
data table is scanned sequentially and for each tuple a UDF is used for updating the counts
on the memory resident hash-tree.
4.6.1 Timing Comparison
In Figure 4.7, we show the performance of the four architectural alternatives: Cache-Mine,
Stored-procedure, UDF and our best SQL implementation for the datasets in Table 3.1.
We do not show the times for the Loose-coupling option because its performance was very
close to the Stored-procedure option. For each dataset three different support values are
used. The total time is broken down by the time spent in each pass.
We can make the following observations.
Cache-Mine has the best or close to the best performance in all cases. 80-90% of its
total time is spent in the first pass where data is accessed from the DBMS and cached
in the file system. Compared to the SQL approach this approach is a factor of 0.8 to
2 times faster.


CHAPTER 5
GENERALIZED ASSOCIATION RULES
In most real-life applications the set of items that appear in transactions can be cat
egorized according to a taxonomy (is-a hierarchy) on the items. The taxonomy shown in
Figure 5.1 says that Pepsi is-a soft drink is-a beverage and so on. In general, a taxonomy
can be represented as a directed acyclic graph (DAG). Given a set of transactions T each of
which is a set of items, and a taxonomy Tax, the problem of mining generalized association
rules is to discover all rules of the form X > Y, with the user-specified minimum support
and minimum confidence. X and Y can be sets of items at any level of the taxonomy, such
that no item in Y is an ancestor of any item in X [118]. For example, there might be a
rule which says that 50% of transactions that contain Soft drinks also contain Snacks; 5%
of all transactions contain both these items.
Beverages Snacks
Soft drinks Alcoholic drinks Pretzels Chocolate bar
Pepsi Coke Beer
Figure 5.1. Example of a taxonomy
In this chapter, we present several SQL formulations of generalized association rule
mining [126]. In Section 5.1, we describe the input-output formats and in Section 5.2,
we briefly outline the Cumulate algorithm for generalized association rule mining [118].
89


3.10 Number of candidate itemsets vs distinct item prefixes 44
3.11 Reduction in transaction table size by non-frequent item pruning 47
3.12 Benefit of second pass optimization 49
3.13 Generation of 50
3.14 Benefit of reusing item combinations 52
3.15 Space requirements of the set-oriented apriori approach 53
3.16 Comparison of Subquery and Set-oriented Apriori approaches 54
3.17 Comparison of CPU and I/O times 56
3.18 Number of transactions scale-up 57
3.19 Transaction size scale-up 58
4.1 Support counting by GatherJoin 62
4.2 Support Counting by GatherJoin in the second pass 63
4.3 Tid-list creation 66
4.4 Support counting using UDF 68
4.5 Comparison of four SQL-OR approaches: Vertical, GatherPrune, GatherJoin
and GatherCount 72
4.6 Effect of increasing transaction length (average number of items per trans
action) 74
4.7 Comparison of four architectures 78
4.8 Scale-up with increasing number of transactions 80
4.9 Scale-up with increasing transaction length 81
4.10 Comparison of different architectures on space requirements 84
5.1 Example of a taxonomy 89
5.2 Pre-computing ancestors 92
5.3 Generation of C2 93
x


82
unique integers efficiently in one pass and without making extra copies. The input to the
table function is the data table in the tid order. The table function remembers the previous
tid and the maintains a counter. Every time the tid changes, the counter is incremented.
This counter value is the mapping assigned to each tid. We need to do the tid mapping only
once before creating the TidTable in the Vertical approach and therefore we can pipeline
these two steps. The item mapping is done slightly differently. After the first pass, we add
a column to F\ containing a unique integer for each item. We do the same for the TidTable.
The Gather Join approach already joins the data table T with F\ before passing to table
function Gather. Therefore, we can pass to Gather the integer mappings of each item from
F\ instead of its original character representation. After these two transformations, the tid
and item fields are integers for all the remaining queries including candidate generation and
rule generation. By mapping the fields this way, we expect longer names to have similar
performance impact on all of our architectural options.
4.6.4 Space Overhead of Different Approaches
In Figure 4.10 we summarize the space required for different datasets for three options:
Stored-procedure, Cache-Mine and SQL. For these experiments we assume that the tids
and items are integers. The first part refers to the space used in caching data and the
second part refers to any temporary space used by the DBMS for sorting or alternately
for constructing indices to be used during sorting. The size of the data is the same as the
space utilization of the Stored-procedure approach. The space requirements for UDF is
the same as that for Stored-procedurewhich requires less space than the Cache-Mine and
SQL approaches. The Cache-Mine and SQL approaches have comparable storage overheads.
For Stored-procedure and UDF we do not need any extra storage for caching. However,


CHAPTER 8
CONCLUSIONS
We analyzed various architectural alternatives for integrating mining with a relational
database system. We studied various mining algorithms with the twin goals of finding
the trade-offs between the architectural options and the extensions needed in a DBMS to
efficiently support mining. First, we experimented with different ways of implementing the
association rule mining algorithm in SQL to find if it is at all possible to get competitive
performance out of SQL implementations.
We considered two categories of SQL implementations. We experimented with four
different implementations based purely on SQL-92. We picked the best SQL-92 implemen
tation for further analysis. We developed cost formulae for different execution plans based
on the input data parameters and the relational operator costs. This cost analysis provides
a basis for incorporating the semantics of mining algorithms into future query optimizers.
We next experimented with a collection of approaches that made use of the new object-
relational extensions like UDFs, BLOBs, and table functions. With this extended SQL we
got significant performance improvements over the SQL-92 based implementations.
We compared the SQL implementation with different architectural alternatives. We
observed that based just on performance the Cache-Mine approach is the winner. A close
second is the SQL-OR approach that was sometimes slightly better than Cache-Mine and
153


67
of these k lists. The tids are in the same sorted order in all the tid-lists and therefore the
intersection can be done easily and efficiently by a single pass of the k lists. This conceptual
step can be improved further by decomposing the intersect operation so that we can share
these operations across itemsets having common prefixes as follows:
We first select distinct (itenrn,iterri2) pairs from Ck For each distinct pair we first per
form the intersect operation to get a new result-tidlist, then find distinct triples (itemi,item2,
itern^) from Ck with the same first two items, intersect result-tidlist with tid-list for items
for each triple and continue with item^ and so on until all k tid-lists per itemset are inter
sected.
The above sequence of operations can be written as a single SQL query for any k as
shown in Figure 4.4. The final intersect operation can be merged with the count operation
to return a count instead of the tid-list. We do not include this optimization in the query
of Figure 4.4 for simplicity.
4.2.1 Special Pass 2 Optimization
For pass 2 we need not generate C2 and join the TidTables with C2. Instead, we
perform a self-join on the TidTable using predicate t\.item < t2.it.em.
insert into F* select ti.item,t2-item, cnt
from (select itemi,item2, CountIntersect(ii.tid-list, Q-tid-list) as cnt
from TidTable t\, TidTable Q
where t\.item < t2.item) as t
where
cnt > miinsup


126
function Update-Frequent-Itemset(FDB, ABd(FDB), db)
//DB and db denote the number of transactions in the original database and the
increment database respectively.
Compute Fdb
foreach itemset s FDB UMBd(FDB) do
tdb(s) number of transactions in db containing s
Fdb+ = (j>
for each itemset s £ FDB do
if (£>b(s) + tdb(s)) > minsup (DB + db) then FDB+ = FDB+ U s
for each itemset s £ Fdb do
if s & FDB and s £ AiBd(FDB) and (db(s) + tdb(s)) > minsup* (DB + db)
then Fdb+ = Fdb+ U s
if FDB Fdb+ then
J\fBd(FDB+) = negativeborder-gen(FDB+)
else NBd(FDB+) = UBd(FDB)
ifFDBuABd(FDB) FDB+ U AfBd(FDB+) then
S = FDB+
repeat
compute S S U hfBd(S)
until S does not grow
FDB+ {x S\support(x) > minsup}
//support(x) is the support count of x in DB U db
J\fBd(FDB+) ~ negativeborder-gen(F£>B+)
Figure 7.3. A high-level description of the Update-Frequent-Itemset function


121
we present the performance results of the SQL-based incremental algorithm [124]. In Sec
tion 7.5, we introduce constrained associations and generalize the incremental approach to
handle constraints. The incremental algorithm is applicable to a larger class of data mining
and decision support problems. Some of them are briefly outlined in Section 7.6.
7.1 Incremental Updating of Frequent Itemsets
In this section, we develop an efficient algorithm for updating the frequent itemsets
when the database is updated. In the context of basket data, database update effectively
means addition of new transactions to the database or deletion of existing transactions. In
Section 7.1.1, we describe how to compute the negative border of a set F of frequent item-
sets. In Section 7.1.2, we present and prove some lemmas and theorems before describing
the algorithm to handle the addition of transactions. Handling the deletion of transactions
is outlined in Section 7.1.3.
7.1.1 Computing AiBd(F) from F
The negative border (AfBd(F)) of a set of frequent itemsets F can be computed by
repeating the join and prune steps of the apriori-gen function in the apriori algorithm [13].
This computation can be done using only the set of frequent itemsets F and the database
need not be scanned.
Definition 1 The negative border fiiBd(F), of a collection of itemsets F is defined as fol
lows: Given a collection F C V(R) of sets, closed with respect to the set inclusion relation,
the negative border fifBd(F) of F consists of the minimal itemsets X C R not in F [83].
The apriori-gen function (described in Figure 7.1) takes as argument Fk-1, the set of
all frequent (k l)-itemsets. It returns the set of fc-itemsets that are potentially frequent.


80
because as we have observed the most of the time of Cache-Mine is spent in the first
pass which cannot be changed for Stored-procedure.
4.6.2 Scale-up experiment
Our experiments with the four real-life datasets above has shown the scaling property
of the different approaches with decreasing support value and increasing number of frequent
itemsets. We experiment with synthetic datasets to study other forms of scaling: increasing
number of transactions and increasing average length of transactions. In Figure 4.8 we show
how Stored-procedure, Cache-Mine and SQL scale with increasing number of transactions.
UDF and Loose-coupling have similar scale-up behavior as Stored-procedure, therefore
we do not show these approaches in the figure. We used a dataset with 10 average number of
items per transaction, 100 thousand total items and a default pattern length (defined in [9])
of 4. Thus, the size of the dataset is 10 times the number of transactions. As the number of
transactions is increased from 10K to 3000K the time taken increases proportionately. The
largest frequent itemset was 5 long. This explains the five fold difference in performance
between the Stored-procedure and the Cache-Mine approach. Figure 4.9 shows the scaling
when the transaction length changes from 3 to 50 while keeping the number of transactions
fixed at 100K. All three approaches scale linearly with increasing transaction length.
4000
Figure 4.8. Scale-up with increasing number of transactions


109
where two of its (kl)-subsequences are known to be frequent; it is generated from two such
(k l)-subsequences. Membership checks for the other (k 2) subsequences are performed
using additional joins. The join predicates for these joins are enumerated by dropping one
item at a time from the generated fc-sequence. We first drop item2 and if it results in a
contiguous subsequence, we check for its membership in 1 by the join with I3. For the
join with Ir, we drop itemr-\.
The OR clause in the join predicate is to avoid checking for non-contiguous subse
quences that are formed when enor+\ enor_\ = 2 When there is no max-gap constraint,
the join predicates will not contain the OR part.
While joining F\ with itself to get C2, we need to generate sequences where both the
items appear as a single element as well as two separate elements. Also note that the prune
step will not delete any candidate sequences. The generation of C2 is expressed in SQL as
insert into C2 (select I\.item\, 1, 12-itemi, 2
from Fi Ii,F\ I2) union
(select I\.itemi, 1, l2-item\, 1
from F\ 11 ,Fi I2
where I\.item\ < 2.item\)
For instance, if F\ contains {(1), (2)}, C2 will have {((1) (1)), ((1) (2)), ((2) (1)), ((2) (2)),
((1,2))}.
6.4 Support Counting to Find Frequent Sequences
In each pass k, we use the candidate table Ck and the input data-sequences table D
to count the support. We consider SQL implementations based on SQL-92 and SQL-OR
in Sections 6.5 and 6.6 respectively.


8
Research on time series analysis [10, 43, 75], similarity search [3, 8, 21, 72, 115],
high dimensional data analysis [4, 22, 79], and text data management [30, 31] can also be
classified under the broad category of data mining.
1.3 Mining Databases
The initial research efforts on data mining were aimed at defining new mining problems
and a majority of the algorithms for them were developed for data stored in file systems.
Each had its own specialized data structures, buffer management strategies and so on [8, 13,
14, 22, 49, 50, 62, 63, 74, 84, 85, 86, 111, 115, 119, 121]. In cases where the data are stored in
a DBMS, data access was provided through an ODBC or SQL cursor interface [2, 64, 68, 71].
Data mining tools are being used in several application domains [16, 18, 27, 38, 41, 46, 52,
70, 82, 96, 101, 108, 130]. Coupling the data mining tools with a growing base of accessible
enterprise dataoften in the form of a data warehouseprovides business institutions at
its disposal a tool with immense implications. According to the vice president of Mellon
Banks advanced technology group, Data mining is the carrot that justifies the expensive
stick of building a data warehouse.
The majority of the warehouse storessystems used for storing warehouse dataare
relational databases or their variants. The advantages of using database systems are numer
ous: SQL was invented for direct query of data, most client/server connectivity is supplied
by relational vendors, most replication systems have been designed with relational sources
and targets and most of the relational vendors are delivering parallel database solutions.
There are important alternatives in this segment, however. The OLAP multidimensional en
gines offer unique performance characteristics across their problem domain. We might also
see traditional file stores providing significantly better performance for some data mining


144
sale price of the transaction is larger than some amount we can impose a constraint of the
form sales.total price > P. These constraints are useful to target the mining process to
just the relevant data there by speeding up the process.
7.5.2 Constrained Association Mining
F_k+1
Input data tables
Figure 7.11. Framework for constrained association mining
Figure 7.11 shows the general framework for the kth level in a level-wise approach for
mining associations with constraints. Note that we could also use SQL-based approaches for
implementing the various operations in the different phases. The different constraints are
applied at different steps in the mining process. For example, the item constraints and the
aggregation constraints that satisfy the closure property can be used in the candidate gen
eration phase for pruning unwanted candidates. Three different algorithms for generating
candidate itemsets in the presence of item constraints are presented in Srikant et al. [121]
and we could use them here. The aggregation constraints can be applied on the result of
the candidate generation process with item constraints. The external constraints and some
of the aggregation and item constraints can be applied at the data filtering stage to reduce
the size of the input data to the support counting phase (see Figure 7.12 for a specific


155
In the context of goal-oriented mining, we often have to mine associations with various
kinds of constraints. We identify and categorize the different constraints and show how they
can be processed in the relational framework. The incremental approach can also be used
to handle certain kinds of constraint relaxation. We develop a general framework for the
incremental approach to mine associations with constraints having the downward closure
property.
The concept of negative border which is the key to the incremental algorithm has
other applications also. It can be used for mining association rules with varying support
and confidence values. For instance, the negative border can be used to determine the
updated frequent itemsets if the support is changed. If the support is increased it is trivial
to update the frequent itemsets. However, if the support is lowered the itemsets in the
promoted negative border can be used to determine if the support of any new itemset
needs to be counted in a similar manner to incremental mining. This could be quite useful
in cases where determining the correct support value is difficult. Initially the frequent
itemsets for an approximate support can be computed which is further refined based on
user feedback.
We further show the applicability of the incremental algorithm to certain classes of
data mining and decision support problems.
In Section 8.1, we identify certain extensions to the current database management sys
tems that are useful for mining. We enumerate the specific contributions of this dissertation
in Section 8.2 and in Section 8.3, we list possibilities for further research.


127
none of the itemsets in J\fBd(FDB) gets the minimum support, no new itemsets will be
added to FDBJr. If some itemsets in J\iBd(FDB) gets the minimum support move them to
Fdb+ and recompute the negative border. If FDB+ U AfBd(FDB+) ^ FDB Uj\fBd(FDB),
we have to find the negative border closure of F,)n+ and scan the entire database (D Li)
once to find the updated frequent itemset and negative border. The negative border closure
of F is found by repeatedly finding F = F U NBd(F) until F does not grow.
During the database scan, all the itemsets which are in the negative border closure
that were not originally in FLiJ\fBd(F) are used as the candidate itemsets and their support
count is computed. The candidate set can further be pruned by applying an optimization
while finding the negative border closure. It can be observed that an itemset which is not
frequent in the increment database (db) cannot get added to the updated set of frequent
itemsets. Therefore, such itemsets can be pruned at each step of the negative border closure
computation to get the pruned negative border closure. However, the support count of these
pruned itemsets should also be found since they may potentially be in the updated negative
border.
7.1.3 Deletion of Existing Transactions
Similar to the case where new transactions are added to the database, the frequent
itemset and its negative border could potentially change when some existing transactions
are deleted from the database. As in the former case, we maintain the frequent itemset and
the negative border along with their support count in the database. Let DB denote the
updated database and FDB~ and AfBd(FDB~) denote its frequent itemset and negative
border respectively.


I certify that I have read this study and that in my opinion it conforms to acceptable
standards of scholarly presentation and is fully adequate, in scope and quality, as a disser
tation for the degree of Doctor of Philosophy.
Sharma Chakravarthy, Chairman
Associate Professor of Computer and
Information Science and Engineering
I certify that I have read this study and that in my opinion it conforms to acceptable
standards of scholarly presentation and is fully adequate, in scope and quality, as a disser
tation for the degree of Doctor of Philosophy.
Eric N. Hanson
Associate Professor of Computer and
Information Science and Engineering
I certify that I have read this study and that in my opinion it conforms to acceptable
standards of scholarly presentation and is fully adequate, in scope and quality, as a disser
tation for the degree of Doctor of Philosophy.
Distinguished Professor of Computer and
Information Science and Engineering


69
4.2.2 Cost Analysis
The cost of the Vertical approach during support counting phase is dominated by
the cost of invoking the UDFs and intersecting the tid-lists. The UDF is first called for
each distinct item pair in Ck, then for each distinct item triple with the same first two
items and so on. Let dj be the number of distinct j item tuples in Ck. Then the total
number of times the UDF is invoked is J2j=2 dj- 111 each invocation two BLOBs of tid-list
are passed as arguments. The UDF intersects the tid-lists by a merge pass and hence the
cost is proportional to 2 average length of a tid-list. The average length of a tid-list can be
approximated to Note that with each intersect the tid-list keeps shrinking. However,
we ignore such effects for simplicity.
In addition to the intersect cost it includes the cost of joins in the query also. The
join cost of subquery Qi can be recursively defined as
C{Qi) = trijoin(Fi, Q,_i, Ck, df) + C(Qi-i)
where tri-join(p, q, r, s) denotes the cost of joining three relations of size p, q, r respectively
producing a result of size s. The exact cost of the 3-way join will depend on the join order.
The cost of subquery Q\ is the cost of scanning the TidTable which has F\ tuples. The
result size of the subquery Q is df and the result size of Qi is F\. The total cost of the
Vertical approach is
k
C(Qk) + (Yldj) {2 BlobTr) + Intersect{~^)}
j=2


23
computing the ratio r = support(j4i?C'.D)/support(Ai?). The rule holds only if r >
minimum confidence. Note that the rule will have minimum support because ABCD
is frequent.
3.1 Apriori Algorithm
We use the Apriori algorithm [9] as the basis for our presentation. There are recent
proposals aimed at improving the performance of the Apriori algorithm by reducing the
number of data passes [26, 127]. They all have the same basic data-flow structure as the
Apriori algorithm. Our goal is to understand how best to integrate this basic structure
within a database system.
F\ = {frequent 1-itemsets}
for (k 2; Fki ^ 0; k (-) do
Ck = apriori-gen(F*;_i); // generate new candidates
fora 11 transactions t E V do
Ct = subset (Ck, t); // find all candidates contained in t
fora 11 candidates c 6 Ct do
c. count++;
done
done
Fk = {c G Cfc|c.count > minsup}
done
Answer = (J* ^*5
Figure 3.1. Apriori algorithm


2
strategies and the data required to support and track their achievement. This process
called enterprise modelingrequires substantial involvement of business users and is tra
ditionally a long-term process. A data warehouse consists of several components and tools
which are depicted in Figure 1.1 [34].
Figure 1.1. Data warehousing architecture
Knowing what data are required is just the first step. These data exist today in
various sources on different platforms, and must be copied from these sources for use in the
warehouse. They must be combined according to the enterprise model, even though it was
not originally designed to support such integration. They must be cleansed of structural and
content errors. This steptypically known as data warehouse populationrequires tools
for extracting data from multiple operational databases and external sources, for cleaning,
transforming and integrating these data; and for loading data into the data warehouse.
In addition to the main warehouse, there may be several departmental data marts. It
also requires tools for periodically refreshing the warehouse to maintain consistency and to
purge data from the warehouse, perhaps onto slower archival storage.


103
Performance comparison
Prep Pass 1 0 Pass 2 Pass 3 Pass 4 Pass 5 fl Pass 6
Support -->
A
/ 2%
1%
A
/ 0.5%
Mail order data:
Total number of records = 2.5 million
Number of transactions = 568000
Number of items (leaf nodes in taxonomy DAG) = 85161
Total number of items (including interior nodes) = 228428
Max. depth of the taxonomy = 7
Avg. number of children per node =1.6
Max. number of parents = 3
Figure 5.10. Comparison of different SQL approaches
which runs in the same address space as the DBMS. For the Stored-procedure experi
ment, we used the generalized association rule implementation provided with the IBM data
mining product, Intelligent Miner [71]. For all the support values, the Vertical approach
performs equally well as the Stored-procedure approach. In some of the experiments
on other datasets, the Vertical approach performed better than the Stored-procedure
approach. The Gather Join approach is worse mainly due to the large number of item com
binations generated. In the GatherJoin approach, the extended transactions are passed
to the GatherComb table function and hence the effective number of items per transaction


73
Figure 4.5 shows the total running time of the different approaches. The time taken
is broken down by each pass and an initial prep stage where any one-time data transfor
mation cost is included. We can make several observations from the experimental results.
First, let us concentrate on the overall comparison between the different approaches. Then
we will compare the approaches based on how they perform in each pass of the algorithm.
Overall, the Vertical approach has the best performance and is sometimes more than
an order of magnitude better than the other three approaches.
The majority of the time of the Vertical approach is spent in transforming the
data to the Vertical format in most cases (shown as prep in figure 4.5). The vertical
representation is like an index on the item attribute. If we think of this time as a one
time activity like index building then performance looks even better. Note that the time
to transform the data to the Vertical format was much smaller than the time for the
horizontal format although both formats write almost the same amount of data. The main
reason was the difference in the number of records written. The number of frequent items
is often two to three orders of magnitude smaller than the number of transactions.
Between GatherJoin and GatherPrune, neither strictly dominates the other. The
special optimization in GatherJoin of pruning based on F\ had a big impact on perfor
mance. With this optimization, for Dataset-B with support 0.1%, the running time for
pass 2 alone was reduced from 5.2 hours to 10 minutes.
When we compare these different approaches based on time spent in each pass we
observe that no single approach is the best for all different passes of the different datasets
especially for the second pass.
For pass three onwards, Vertical is often two or more orders of magnitude better than
the other approaches. Even in cases like Dataset-B, support 0.01% where it spends three


70
In the formula above Intersect(n) denotes the cost of intersecting two tid-lists with
a combined size of n. The total cost is dominated by the intersect cost and join costs
account for only a small fraction. Therefore, we can safely ignore the join costs in the
above formulae.
The total cost of the second pass is
join(Fi,Fi, ) + C2 {2 Blob(^f) + Intersect^/-)}
11 11
4.3 SQL-Bodied Functions
This approach is based on SQL-bodied functions commonly known as SQL/PSM [87].
SQL/PSMs extend SQL with additional control structures. We make use of one such
construct for do .. end.
We use the for construct to scan the transaction table T in the (tid, item) order. Then,
for each tuple (tid, item) of T, we update those tuples of Ck that contain one matching
item. Ck is extended with 3 extra attributes (prevTid, match, supp). The prevTid attribute
keeps the tid of the previous tuple of T that matched that itemset. The match attribute
contains the number of items of prevTid matched so far and supp holds the current support
of that itemset. On each column of Ck an index is built to do a searched update.
for this as select from T do
update Ck set prevTid = tid,
match = case when tid = prevTid then match-)-1 else 1 end,
supp = case when match = k-1 and tid = prevTid then supp+1
else supp
end


42
and we do not know the value of sk. However, the last join produces S(Ck) records
there will be as many records for each candidate as its supportand therefore, the cost is
join(Cfc Sfc_i, R, S(Ck)). S(Ck) can be estimated by adding the support estimates of all
the itemsets in Ck. A good estimate for the support of a candidate itemset is the minimum
of the support counts of all its subsets. The overall cost of this plan expressed in terms of
operator costs is
Jfc-i
{^joiniCjfc sj-i, R, Ck*si)} +join(Cjfc R, S(Ck)) + gvoup{S{Ck), Ck)
l=i
3.7.2 KWavJoin Plan with Ci- as Inner Relation
In this plan, we join the k copies of T and the resulting fc-item combinations are joined
with Ck to filter out non-candidate item combinations. The final join result is grouped on
the /c-items.
The result of joining / copies of T is the set of all possible /-item combinations of
transactions and there are C(N,l) T such combinations. We know that the items in the
candidate itemset are lexicographically ordered and hence we can add extra join predicates
as shown in Figure 3.8 to limit the join result to /-item combinations (without these extra
predicates the join will result in /-item permutations). When Ck is the outermost relation
these predicates are not required. A mining-aware optimizer should be able to rewrite the
query appropriately. The Ith join produces (/ + l)-item combinations and therefore, its cost
is join(C(IV,/) T, R, C(N,l + 1) T). The last join produces S(Ck) records as in the
previous case and hence its cost is join(C(./V, k) T, Ck, S(Ck)). The overall cost of this


REFERENCES
[1] S. Agarwal, R. Agrawal, P. M. Deshpande, A. Gupta, J. F. Naughton, R. Ramakr-
ishnan, and S. Sarawagi. On the computation of multidimensional aggregates. In
Proc. of the 22nd Intl Conference on Very Large Databases, pages 506-521, Mumbai
(Bombay), India, September 1996.
[2] R. Agrawal, A. Arning, T. Bollinger, M. Mehta, J. Shafer, and R. Srikant. The Quest
data mining system. In Proc. of the 2nd Int l Conference on Knowledge Discovery in
Databases and Data Mining, Portland, Oregon, August 1996.
[3] R. Agrawal, C. Faloutsos, and A. Swami. Efficient similarity search in sequence
databases. In Proc. of the Fourth Intl Conference on Foundations of Data Organi
zation and Algorithms, Chicago, October 1993. Also in Lecture Notes in Computer
Science 730, Springer Verlag, 1993, 69-84.
[4] R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clus
tering of high dimensional data for data mining applications. In Proc. of the ACM
SIGMOD Conference on Management of Data, Seattle, Washington, June 1998.
[5] R. Agrawal, S. Ghosh, T. Imielinski, B. Iyer, and A. Swami. An interval classifier
for database mining applications. In Proc. of the VLDB Conference, pages 560-573,
Vancouver, British Columbia, Canada, August 1992.
[6] R. Agrawal, T. Imielinski, and A. Swami. Database mining: A performance per
spective. IEEE Transactions on Knowledge and Data Engineering, 5(6):914-925,
December 1993.
[7] R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of
items in large databases. In Proc. of the ACM SIGMOD Conference on Management
of Data, pages 207-216, Washington, D.C., May 1993.
[8] R. Agrawal, K. I. Lin, H. S. Sawhney, and K. Shim. Fast similarity search in the
presence of noise, scaling, and translation in time-series databases. In Proc. of the
21st Intl Conference on Very Large Databases, pages 490-501, Zurich, Switzerland,
September 1995.
[9] R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A. I. Verkamo. Fast discovery
of association rules. In U. M. Fayyad, G. P. Shapiro, P. Smyth, and R. Uthurusamy,
editors, Advances in Knowledge Discovery and Data Mining, chapter 12, pages 307
328. AAAI/MIT Press, 1996.
162


165
[41] Direct Marketing Association. Managing database marketing technology for success,
1992.
[42] M. Ester, H. Kriegel, and X. Xu. A database interface for clustering in large spatial
databases. In Proc. of the 1st Intl Conference on Knowledge Discovery in Databases
and Data Mining, Montreal, Canada, August 1995.
[43] C. Faloutsos, M. Ranganathan, and Y. Manolopoulos. Fast subsequence matching in
time-series databases. In Proc. of the ACM SIGMOD Conference on Management of
Data, May 1994.
[44] U. Fayyad, C. Reina, and P. Bradley. Initialization of iterative refinement clustering
algorithms. In Proc. of the fth Intl Conference on Knowledge Discovery and Data
Mining, New York, August 1998.
[45] U. Fayyad, G. P. Shapiro, P. Smyth, and R. Uthurusamy, editors. Advances in
Knowledge Discovery and Data Mining. AAAI/MIT Press, 1996.
[46] U. Fayyad, N. Weir, and S. G. Djorgovski. Skicat: A machine learning system for
automated cataloging of large scale sky surveys. In 1 Oth Int 7 Conference on Machine
Learning, June 1993.
[47] R. Feldman, Y. Aumann, A. Amir, and H. Mannila. Efficient algorithms for discov
ering frequent sets in incremental databases. In Proceedings of the 1997 SIGMOD
Workshop on Research Issues on Data Mining and Knowledge Discovery, Tucson,
Arizona, May 1997.
[48] D. H. Fisher. Knowledge acquisition via incremental conceptual clustering. Machine
Learning, 2(2), 1987.
[49] T. Fukuda, Y. Morimoto, S. Morishita, and T. Tokuyama. Data mining using two-
dimensional optimized association rules: Scheme, algorithms, and visualization. In
Proc. of the ACM SIGMOD Conference on Management of Data, Montreal, Canada,
June 1996.
[50] T. Fukuda, Y. Morimoto, S. Morishita, and T. Tokuyama. Mining optimized associa
tion rules for numeric attributes. In Proc. of the 15th ACM Symposium on Principles
of Database Systems, Montreal, Canada, June 1996.
[51] K. Fukunaga. Introduction to Statistical Pattern Recognition. Academic Press, 1990.
[52] Gartner Group. Data Mining: The Next Generation of Business Intelligence?, ATG
research Note T-517-246, Gartner Group Inc., Stamform, CT edition, March 1994.
[53] J. Gehrke, R. Ramakrishnan, and V. Ganti. Rainforest a framework for fast decision
tree construction of large datasets. In Proc. of the VLDB Conference, New York,
August 1998.
[54] G. Graefe, U. Fayyad, and S. Chaudhuri. On the efficient gathering of sufficient statis
tics for classification from large SQL databases. In Proc. of the 4th Intl Conference
on Knowledge Discovery and Data Mining, New York, August 1998.


36
insert into F*. select itemi,...,itemk, count(*)
from (Subquery Qk) t
group by item\,item2 .. .itemk
having count(*) > :minsup
Subquery Qi (for any l between 1 and k):
select itemi,... itemi, tid
from T ti, (Subquery Qi-i) as r;_i,
(select distinct itemi itemi from Ck) as di
where ri-\.item\ = di.item\ and ... and
ri-i.itemi-i = iand
ri-i.tid = ti-tid and
ti-item di.itemi
Subquery Qq: No subquery Qq-
Subquery Q_1
t
itemi.....itemi. tid
Subquery Q_l-1
select distinct
itemi,....iteml
t
Ck
Tree diagram for Subquery Qi
Figure 3.6. Support counting using subqueries


7
and we are interested in building a model that categorizes the applicants into high and low
risk classes.
Age < 25
Age
Salary
Married
Risk
21
30,000
No
High
18
28,000
Yes
High
30
50,000
Yes
Low
35
20,000
Yes
High
23
60,000
No
High
47
80,000
Yes
Low
Figure 1.2. Credit card classification example
1.2.4 Clustering
The fundamental clustering problem is that of grouping together (clustering) similar
data items and is useful for discovering interesting distributions and patterns in the under
lying data. Clustering has been formulated in various ways in the machine learning [48],
pattern recognition [51], optimization [112] and statistics literature [20]. The problem of
clustering can be defined as follows: given n data points in a d-dimensional metric space,
partition the data points into k clusters such that the data points within a cluster are more
similar to each other than data points in different clusters. The classic K-Means clustering
algorithm starts with estimated initial values for the k cluster centroids. Each of the data
points is assigned to the cluster with the nearest centroid. After the assignment the cen
troids are refined to the mean of the data points in that cluster. This process is repeated
several times until an acceptable convergence is reached. There are several research efforts
reported in the data mining literature for clustering large databases [23, 42, 57, 132].


43
plan is
fc-i
{£>in(C(N,l)*T, R, C(N,l + \ ) *T)} + jom(C(N, k)*T, Ck, S(Ck)) + group(5(Cfc), C*)
z=i
Note that in the above expression C(N, 1) T = R.
3.7.3 Effect of Subquerv Optimization
The subquery optimization makes use of common prefixes among candidate itemsets.
Unfolding all the subqueries will result in a query tree which structurally resembles the
KwayJoin plan tree shown in Figure 3.9. Subquery Qi produces dlk s; records where dJk
denotes the number of distinct j item prefixes of itemsets in Ck. In contrast, the Ith join in
the KwayJoin plan results in Typically dlk is much smaller compared to Ck which explains
why the Subquery approach performs better than the KwayJoin approach. Ck si records.
The output of subquery Qk contains S(Ck) records. The total cost of this approach can be
estimated to be
k
trijoin(/2, s,_i dlkx, d[, st dlk)} + group(5(C*), Ck)
l=l
where trijoin(p, q, r, s) denotes the cost of joining three relations of size p, q, r respectively
producing a result of size s. The value of sk which is the average support of a frequent
A:-itemset can be estimated as mentioned in section 3.7.1.
The experimental results presented in Section 3.6 and in Sarawagi et al. [110] shows
that the subquery optimization gave much better performance than the basic KwayJoin
approach (an order of magnitude better in some cases). We observed the same trend in our
additional experiments using synthetic datasets. We used synthetic datasets for some of the


38
Data set A
Ogen E3 Pass 1 Pass 2 Pass 3
10000
9000
8000
7000
6000
5000
4000
3000
2000
1000
o

p
--"B i
Support -> O 50%
Figure 3.7. Comparison of different SQL-92 approaches
The best approach in the SQL-92 category is the Subquery approach. An important
reason for its performing better than the other approaches is exploitation of common
prefixes between candidate itemsets. None of the other three approaches uses this
optimization. Although the Subquery approach is comparable to the Loose-coupling
approach in some cases, for other cases it did not complete even after taking ten times
more time than the Loose-coupling approach.
The 2GroupBy approach is significantly worse than the other three approaches because
it involves an index-ORing operation on k indices for each pass k of the algorithm.
In addition, the inner group-by requires sorting a large intermediate relation. The
outer group-by is comparatively faster because the sorted result is of size at most Ck
which is much smaller than the result size of the inner group-by. The DBMS does
aggregation during sorting therefore the size of the result is an important factor in
the total cost.


87
other operators in the DBMS. However, to exploit this feature one either needs to imple
ment the mining operators inside the DBMS or use table functions in novel ways. The first
alternative would require major rework in existing database systems. The second alter
native requires table functions that can execute SQL statements (a facility not currently
available in DB2 UDB). Furthermore, some mining operators generate multiple different
kinds of output (for instance, model tree and statistics for decision trees). In such cases,
pipelining of mining operators becomes harder to fit in existing relational models. Note
that the SQL approach presented here is based on embedded SQL and cannot provide oper
ator pipelining and inter-operability. Queries on the mined result is possible with all four
alternatives as long as the mined results are stored back in the DBMS.
4.8 Other Associations Algorithms
There are recent proposals [26, 127] that attempt to minimize the number of data
passes in the Apriori algorithm to two or three. Currently, the Stored-procedure approach
makes as many passes as the length of the largest itemset and in each pass, most of the
time is spent in reading the data from the DBMS. Therefore, by using these alternative
algorithms it is possible to reduce the time spent by Stored-procedure by as many times
as the number of passes. However, the Cache-Mine approach is almost a lower bound on the
best performance possible for doing associations outside the DBMS because it spends most
of its time in the first pass which is the minimum amount of work any mining algorithm
that works outside the DBMS has to spend; read all of the data at least once. Since our SQL
approach is competitive with the one-pass Cache-Mine approach, we expect our conclusions
to hold for other algorithms that spend 1.5 to 3 times the time spent by Cache-Mine.
These alternative algorithms are not likely to help improve the performance of our SQL


93
the second pass (for C-)- In the SQL formulation as shown in Figure 5.3, we prune all
(ancestor, descendant) pairs from C2 which is generated by joining F\ with itself.
insert into C2
(select I\.item\,l2-item\ from F\I\,F\I2
where I\.item\ < fy-itemi)
except
(select ancestor, descendant from Ancestor union
select descendant, ancestor from Ancestor)
C2
EXCEPT
II.item 1 < 12.item 1
C><]
ancestor,
descendant
UNION
descendant,
ancestor
FI II FI 12 Ancestor Ancestor
Figure 5.3. Generation of C2
5.5 Support Counting to Find Frequent Itemsets
We use the candidate itemsets Ck, the data table T and the ancestor table Ancestor
to count the support of the itemsets in C*. We consider two categories of SQL imple
mentations based on SQL-92 and SQL-OR in Sections 5.6 and 5.7 respectively. All the
SQL approaches developed for boolean associations in Chapters 3 and 4 can be extended
to handle taxonomies. However, we present the extensions to only a few representative
approaches. In particular, we consider the KwayJoin approach from SQL-92 and Vertical
and Gather Join from SQL-OR.


167
[69] T. Imielinski, A. Virmani, and A. Abdulghani. Discovery board application program
ming interface and query language for database mining. In Proc. of the 2nd Intl
Conference on Knowledge Discovery and Data Mining, Portland, Oregon, August
1996.
[70] International Business Machines. White paper on data mining, Technical report, 1995.
[71] Internationl Business Machines. IBM Intelligent Miner Users Guide, Version 1 Re
lease 1, SH12-6213-00 edition, July 1996.
[72] H. V. Jagadish. A retrieval technique for similar shapes. In Proc. of the ACM
SIGMOD Conference on Management of Data, pages 208-217, Denver, May 1994.
[73] H. V. Jagadish and C. Faloutsos. Data reduction a tutorial. Fourth international
conference on Knowledge Discovery and Data Mining, tutorial, August 1998.
[74] M. Klemettinen, H. Mannila, P. Ronkainen, H. Toivonen, and A. I. Verkamo. Finding
interesting rules from large sets of discovered association rules. In Third Intl Con
ference on Information and Knowledge Management, pages 401-407, Gaithersburg,
Maryland, November 1994. ACM Press.
[75] F. Korn, H. V. Jagadish, and C. Faloutsos. Efficiently supporting ad hoc queries
in large datasets of time sequences. In Proc. of the ACM SIGMOD Conference on
Management of Data, pages 289-300, 1997.
[76] F. Korn, A. Labrinidis, Y. Kotidis, and C. Faloutsos. Ratio rules: A new paradigm for
fast, quantifiable data mining. In Proc. of the VLDB Conference, New York, August
1998.
[77] R. Krishnamurthy and T. Imielinski. Practitioner problems in need of database
research: Research directions in knowledge discovery. SIGMOD RECORD, 20(3):76-
78, September 1991.
[78] K. Kulkarni. Object oriented extensions in SQL3: a status report. SIGMOD
RECORD, 1994.
[79] K. I. Lin, H. V. Jagadish, and C. Faloutsos. The TV-Tree: An index structure for
high-dimensional data. VLDB Journal, 3(4) :517-542, 1994.
[80] G. Lohman, B. Lindsay, H. Pirahesh, and K. B. Schiefer. Extensions to starburst:
Objects, types, functions, and rules. Communications of the ACM, 34(10), October
1991.
[81] D. J. Lubinsky. Discovery from databases: A review of AI and statistical techniques.
In IJCAI-89 Workshop on Knowledge Discovery in Databases, pages 204-218, Detroit,
August 1989.
[82] J. A. Major and C. Feng. EFD: A hybrid knowledge/statistical-based system for the
detection of fraud. Intl Journal of Intelligent Systems, 7(7), 1992.
[83] H. Mannila and H. Toivonen. On an algorithm for finding all interesting sentences.
In Cybernetics and Systems, Volume II, The 13th European Meeting o n Cybernetics
and Systems Research, Vienna, Austria, April 1996.


ARCHITECTURES AND OPTIMIZATIONS FOR INTEGRATING
DATA MINING ALGORITHMS WITH DATABASE SYSTEMS
By
SHIBY THOMAS
A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL
OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY
UNIVERSITY OF FLORIDA
1998


149
said to be correlated if any of its subsets are dependent. For efficient evaluation
of correlation rules, a notion of support is introduced in Brin et al. [25], which is
downward closed. However, these properties are not incrementally updatable.
3. Consequent part of association rules: For any frequent itemset, if a rule with conse
quent c holds (that is, it has minimum confidence) then so do rules with consequents
that are subsets of c [9]; that is, such rules are downward closed. Note that the
confidence is not incrementally updatable.
4. Maximal frequent itemsets: Random walk algorithms [58] which works by generating
a series of random walks along the itemset lattice exploits the downward closure
property of maximal frequent itemsets. Support which characterizes frequent itemsets
in this case is also incrementally updatable.
7.6.2 Query Flocks
The boolean association rules was first defined in the context of market basket data
and it has been generalized to mine associations across relational tables expressed as query
flocks [128]. A query flock is a parameterized query with a filter condition to eliminate the
values of parameters that are uninteresting. For example, let us assume that we want
to evaluate the mining query What are the interesting drivers that caused customers to
buy the widgets from a catalog ?. A driver is deemed interesting if it has caused at
least 10 customers to buy the widget. Let the data be stored in a set of relational tables,
namely, catalog(widget, manufacturer), sale(customer, widget), driver(customer, widget,
driver). The above query can be written as a query flock in Datalog as shown below. The
filter condition prunes out values which do not have minimum support. In Section 7.6.2,


ACKNOWLEDGMENTS
I would like to express my sincere gratitude to Sharma for his encouragement and
support throughout my dissertation and my stay with his group. We have had endless
arguments and discussions about various things. Especially during the group meetings,
this might have invited the wrath of several other students because, at times, it would
have even tested their patience. On the personal side he and his family have been my
good friends also. I am greatly indebted to Rakesh Agrawal and Sunita Sarawagi of IBM
Almadn Research Center for their help and giving me an opportunity to work with their
group. The several discussions I had with them while I was at IBM and later through
e-mail were very useful. The work that I did with them contributed to a good part of
my dissertation. It was a nice experience working with the Quest data mining group at
Almadn and the enthusiasm and hard work of Sunita were contagious. I am also thankful
to Sanjay Ranka for his help and suggestions during my initial work on data mining. I
thank Professors Eric Hanson, Sartaj Sahni, Stanley Su and Suleyman Tufekci for being on
my committee and for their comments and suggestions.
I am grateful to many other people for helping me in several ways. In particular I
thank Raja, Sreenath, Mokhtar, Nabeel, Roger, and all my friends at the database center
for the chat sessions, lunch sessions, and so on. Many thanks to Sharon Grant for her help
with everything. She takes care of everything in the database center and her tireless spirit
and attention to even the minute details makes things a whole lot easier. My sincere thanks


164
[25] S. Brin, R. Motwani, and C. Silverstein. Beyond market baskets: Generalizing associ
ation rules to correlations. In Proc. of the ACM SIGMOD Conference on Management
of Data, May 1997.
[26] S. Brin, R. Motwani, J. D. Ullman, and S. Tsur. Dynamic itemset counting and
implication rules for market basket data. In Proc. of the ACM SIGMOD Conference
on Management of Data, May 1997.
[27] Business Week. Database marketing, September 1994.
[28] J. Catlett. Megainduction: Machine Learning on Very Large Databases. PhD thesis,
University of Sydney, 1991.
[29] J. Catlett. Overpruning large decision trees. In 12th Intl Joint Conference on AI,
August 1991.
[30] S. Chakrabarti, B. Dom, R. Agrawal, and P. Raghavan. Using taxonomy, discrimi
nants, and signatures for navigating in text databases. In Proc. of the VLDB Con
ference, Athens, Greece, August 1997.
[31] S. Chakrabarti, B. Dom, and P. Indyk. Enhanced hypertext categorization using
hyperlinks. In Proc. of the ACM SIGMOD Conference on Management of Data,
Seattle, Washington, June 1998.
[32] D. Chamberlin. Using the New DB2: IBMs Object-Relational Database System.
Morgan Kaufmann, 1996.
[33] S. Chaudhuri. Data mining and database systems: Where is the intersection? IEEE
Data Engineering Bulletin, 21(1):48, March 1998.
[34] S. Chaudhuri and U. Dayal. An overview of data warehousing and olap technology.
ACM SIGMOD Record, 26(l):65-74, March 1997.
[35] S. Chaudhuri, R. Motwani, and V. Narasayya. Random sampling for histogram
construction: How much is enough? In Proc. of the ACM SIGMOD Conference on
Management of Data, Seattle, Washington, June 1998.
[36] D. Cheung, J. Han, V. Ng, and C. Y. Wong. Maintenance of discovered association
rules in large databases: An incremental updating technique. In Proc. of 1996 Intl
Conference on Data Engineering, New Orleans, USA, February 1996.
[37] G. Cooper and E. Herskovits. A Bayesian method for the induction of probabilistic
networks from data. Machine Learning, 1992.
[38] David Shepard Associates, Business One Irwin, Illinois. The new direct marketing,
1990.
[39] D. Denning. An intrusion-detection model. IEEE Transactions on Software Engi
neering, 13(2), 1987.
[40] T. G. Dietterich and R. S. Michalski. Discovering patterns in sequences of events.
Artificial Intelligence, 25:187-232, 1985.


54
T10.I4.D100K: Total time
Pass 1 El Pass 2 Pass 3 Pass 4 m Pass 5 a Pass 6 Pass 7
24000
21000
18000
15000
g
co
12000
o>
e
h_ 9000
6000
3000
sis' sis' y£S- sis'
Support --> 2% 1% 0.75% 0.33%
T5.I2.D1 OOK: Total time
Pass 1 E3 Pass 2 Pass 3 Pass 4 |
Figure 3.16. Comparison of Subquery and Set-oriented Apriori approaches
In Figure 3.16, we show the relative performance of Subquery and Set-oriented Apriori
approaches for the two datasets. The chart shows the total time taken for each of the
different passes. Set-oriented Apriori performs better than Subquery for all the support
values. The first two passes of both the approaches are similar and they take approximately
equal amount of time. The difference between Set-oriented Apriori and Subquery widens for
higher numbered passes as explained in Section 3.8.4. For T5.I2.D100K, F2 was empty


LIST OF TABLES
3.1 Description of different real-life datasets 37
3.2 Notations used in cost analysis 41
3.3 Description of synthetic datasets 45
4.1 Pros and cons of different architectural options ranked on a scale of l(good)
to 4(bad) 84
5.1 An example of the taxonomy table 90
5.2 Additional notations used for cost analysis 98
xii


TABLE OF CONTENTS
ACKNOWLEDGMENTS iii
LIST OF FIGURES ix
LIST OF TABLES xii
ABSTRACT xiii
CHAPTERS
1 INTRODUCTION 1
1.1 Data Warehousing 1
1.2 Data Mining 3
1.2.1 Association Rule 4
1.2.2 Sequential Patterns 5
1.2.3 Classification 6
1.2.4 Clustering 7
1.3 Mining Databases 8
1.4 Goal 9
1.5 Thesis Organization 12
2 FROM FILE MINING TO DATABASE MINING 13
2.1 Related Work 13
2.2 Architectural Alternatives 15
2.2.1 Loose-Coupling 15
2.2.2 Cache-Mine 16
2.2.3 Stored-Procedure 17
2.2.4 User-Defined Function 17
2.2.5 SQL-Based Approach 18
2.2.6 Integrated Approach 20
2.3 Summary 21
3 ASSOCIATION RULES 22
3.1 Apriori Algorithm 23
3.2 Other Algorithms 24
3.3 Input-Output Formats 26
v


3.4 Associations in SQL 27
3.4.1 Candidate Generation in SQL 27
3.4.2 Counting Support to Find Frequent Itemsets 30
3.4.3 Rule Generation 30
3.5 Support Counting Using SQL-92 32
3.5.1 K-way Join 32
3.5.2 Three-way Join 33
3.5.3 Two Group-by 34
3.5.4 Subquery-Based 35
3.6 Performance Comparison 35
3.7 Cost Analysis 39
3.7.1 KWayJoin Plan with Ck as Outer Relation 40
3.7.2 KWayJoin Plan with Ck as Inner Relation 42
3.7.3 Effect of Subquery Optimization 43
3.8 Performance Optimizations 45
3.8.1 Pruning Non-Frequent Items 46
3.8.2 Eliminating Candidate Generation in Second Pass 47
3.8.3 Reusing the Item Combinations from Previous Pass 48
3.8.4 Set-Oriented Apriori 48
3.9 Performance Experiments with Set-Oriented Apriori 53
3.9.1 Scale-up Experiment 55
3.10 Summary 58
4 SUPPORT COUNTING USING SQL WITH OBJECT-RELATIONAL EXTEN
SIONS 60
4.1 Gather Join 60
4.1.1 Special Pass 2 Optimization 61
4.1.2 Variations of GatherJoin Approach 61
4.1.3 Cost Analysis of GatherJoin and its Variants 64
4.2 Vertical 66
4.2.1 Special Pass 2 Optimization 67
4.2.2 Cost Analysis 69
4.3 SQL-Bodied Functions 70
4.4 Performance Comparison 71
4.5 Final Hybrid Approach 76
4.6 Architecture Comparisons 76
4.6.1 Timing Comparison 77
4.6.2 Scale-up experiment 80
4.6.3 Impact of longer names 81
4.6.4 Space Overhead of Different Approaches 82
4.7 Summary of Comparison Between Different Architectures 83
4.8 Other Associations Algorithms 87
4.9 Summary 88
vi


136
100 thousand transactions, of which (100 d) thousand is used for the initial computation
and d thousand is used as the increment, where d is the fractional size (in percentage) of
the increment.
s% a 1 o% |
Figure 7.6. Speed up of the incremental algorithm based on the Subquery approach
Figure 7.7. Speed up of the incremental algorithm based on the Vertical approach
We compare the execution time of the incremental algorithm with respect to mining
the whole dataset. Figures 7.6 and 7.7 shows the corresponding speed-ups of the incremen
tal algorithm based on the Subquery and the Vertical approaches for different minimum


19
processor. We develop several SQL formulations for a few representative mining operations
in order to better understand the performance profile of current database query processors
in executing these queries. We believe that it will enable us to identify what portions of
these mining operations can be pushed down to the query processing engine of a DBMS.
There are also several potential advantages of a SQL implementation. One can prof
itably make use of the database indexing and query processing capabilities thereby leverag
ing on more than two decades of development effort spent in making these systems robust,
portable, scalable, and highly concurrent. Rather than devising specialized paralleliza
tions, one can potentially exploit the underlying SQL parallelization, particularly in an
SMP environment. The current approach to parallelizing mining algorithms is to develop
specialized parallelizations for each of the algorithms [11, 61, 113, 114]. The DBMS sup
port for check-pointing and space management can be especially valuable for long-running
mining algorithms on huge volumes of data. The development of new algorithms could be
significantly faster if expressed declaratively using a few SQL operations. This approach is
also extremely portable across DBMSs since porting becomes trivial if the SQL approaches
use only the standard SQL features.
Extended
SQL
Preprocessor
SQL-92
(Object) Relational
PITT
+
Optimizer

SQL-3/SQL-4
DBMS
UUi >
Domain semantics
of mining
Figure 2.6. SQL architecture for mining in a DBMS
The architecture we have in mind is schematically shown in Figure 2.6. We visualize
that the desired mining operation will be expressed in some extension of SQL or a graphical
language. A preprocessor will generate appropriate SQL translation for this operation.


40
having
couni() > minsup
t
Group by
item 1 itemk
Figure 3.8. K-way join plan with C* as inner relation
having
count(*)> :minsup
f
Group by
item I itemk
Figure 3.9. K-way join plan with C* as outer relation
optimizers. The cost formulae are presented in terms of operator costs in order to make
them general; for instance join(p, q, r) denotes the cost of joining two relations of size p
and q to get a result of size r. The actual cost which is based on the join method used
and the system parameters can be derived by substituting the appropriate values in the
given formulae. The data parameters and operators used in the analysis are summarized
in Table 3.2.
3.7.1 KWavJoin Plan with Cy as Outer Relation
Start with Ck as the outermost relation and perform a series of joins with the k copies
of T. The final join result is grouped on the k items to find the support counts. The
choice of join methods for each of the intermediate joins depends on factors such as the


BIOGRAPHICAL SKETCH
Shiby Thomas was born on May 30, 1971, in Kothamangalam, Kerala, India. He received his
Bachelor of Technology degree in computer engineering from Regional Engineering College,
Calicut, India, in May 1992. After his graduation he worked for a year as a systems
engineer at Wipro Systems, Bangalore, India. He received his Master of Technology degree
in computer science and engineering from Indian Institute of Technology, Bombay, India,
in January 1995.
He joined the Department of Computer and Information Science and Engineering
at the University of Florida in fall, 1995. He has worked as a teaching assistant in the
Computer and Information Science and Engineering Department of the University and as
a research assistant in the Database Systems Research and Development Center of the
department. In Summer 1997, he worked as a summer intern in the Quest data mining
group at the IBM Almadn Research Center, San Jose, California. He will receive his
Doctor of Philosophy degree in December 1998.
His research interests include data mining and decision support technologies, real-time
databases and transaction processing.
172


4
sale. In this database, OLAP finds answers to queries of the form How many widgets were
sold in the first quarter of 1998 in California vs. Florida? However, data mining attempts
to answer queries like What are the drivers that caused people to buy these widgets from
our catalog? Fundamentally, data mining is statistical analysis and has been in practice for
a long time. But, until recently, statistical analysis was a time-consuming, manual process
which limited the amount of data that could be analyzed and the accuracy depended heavily
on the personnel involved in the analysis. Today, with the advent of various sophisticated
technologies, tools exist that automate the process, making data mining a practical solution
for a wide range of companies. For example, Fingerhuts (a direct-mail catalog company)
statistical analysis was limited to taking samples of 10 to 20 percent of its customers. With
data mining, it can examine 300 specific characteristics of each of the 10 to 12 million
customers in a much more focused way [15].
The initial efforts on data mining research were to cull together techniques from ma
chine learning and statistics [24, 28, 29, 37, 39, 40, 60, 65, 81, 90, 95, 102, 103, 104] to define
new mining operations and develop algorithms for them [5, 7, 19, 77]. In the remainder of
this section, we briefly introduce the various data mining problems.
1.2.1 Association Rule
Association rule which captures co-occurrence of items or events was introduced in
the context of market basket data [7]. An example of such a rule might be that 60% of
transactions containing beer also contain diapers and 2% of transactions contain both these
items. Here 60% is the support and 2% is the confidence of the rule beer ==> diaper.
Association rule mining is stated formally as follows [13]. Let 1 = {x, z2, ,im} be
a set of literals, called items. Let V be a set of transactions, where each transaction T is


65
Tk is C(N,k) T. Join with Ck filters out the non-candidate item combinations. The size
of the join result is the sum of the support of all the candidates denoted by S(Ck). The
actual value of the support of a candidate itemset will be known only after the support
counting phase. However, we get a good estimate by approximating it to the minimum of
the support of all its (k l)-subsets in Fk-\. The total cost of the Gather Join approach is
Tk tk +join(Tk,Ck,S(Ck)) + group(S'(Cfc), Ck), where Tk = C(N,k)*T
The above cost formula needs to be modified to reflect the special optimization of
joining with F\ to consider only frequent items. We need a new term join(R,F\,Rf) and
need to change the formula for Tk to include only frequent items Nj instead of N.
For the second pass, we do not need the outer join with Ck The total cost of
Gather Join in the second pass is
N}*T
join(J?, Fi, Rf) + T2*t2 + group (T2, C2), where T2 = C{Nf, 2) T
Cost of GatherCount in the second pass is similar to that for basic Gather Join except
for the final grouping cost. In this formula, groupJnt denotes the cost of doing the
support counting inside the table function.
join(f?, Fi,Rj) + group_int(T2,C2) + F2*t2
For GatherPrune the cost equation is
R blob(A: Ck) + S{Ck) tk + group{S{Ck), Ck).


I certify that I have read this study and that in my opinion it conforms to acceptable
standards of scholarly presentation and is fully adequate, in scope and quality, as a disser
tation for the degree of Doctor of Philosophy.
4&L
Stanley Y. W. S
Professor of Computer and Information
Science and Engineering
I certify that I have read this study and that in my opinion it conforms to acceptable
standards of scholarly presentation and is fully adequate, in scope and quality, as a disser
tation for the degree of Doctor of Philosophy.
7
Suleyman Tufekci
Associate Professor of Industrial
and Systems Engineering
This dissertation was submitted to the Graduate Faculty of the College
of Engineering and to the Graduate School and was accepted as partial fulfillment of the
requirements for the degree of Doctor of Philosophy.
December 1998
Winfred M. Phillips
Dean, College of Engineering
M. J. Ohanian
Dean, Graduate School


61
attributes, the tid and item-list which is a collection of all its items in a VARCHAR or a
BLOB. The output of Gather is passed to another table function Comb-K which returns all
A:-item combinations formed out of the items of a transaction. A record output by Comb-K
has k attributes TJ,tm\,..., TJtnik, which can be directly used to probe into the Ck table.
An index is constructed on all the items of Ck to make the probe efficient. Figure 4.1
presents SQL queries for this approach.
This approach is analogous to the KwayJoin approach where we have replaced the
A;-way self join of T with table functions Gather and Comb-K. These table functions are
easy to code and do not require a large amount of memory. Also, it is possible to merge
them together as a single table function GatherComb-K, which is what we did in our imple
mentation. The Gather function is not required when the data is already in a horizontal
format where each tid is followed by a collection of all its items.
4.1.1 Special Pass 2 Optimization
Note that for k = 2, the 2-candidate set C2 is simply a join of F\ with itself. Therefore,
we can specially optimize the pass 2 by replacing the join with C2 by a join with F\ before
the table function (see Figure 4.2). That way, the table function gets only frequent items
and generates significantly fewer 2-item combinations. This optimization can be useful for
other passes too but unlike for pass 2 we still have to do the join with Ck-
4.1.2 Variations of GatherJoin Approach
GatherCount
One variation of the GatherJoin approach for pass two is the GatherCount approach
where we push the group-by inside the table function instead of doing it outside. The
candidate 2-itemsets (C2) are represented as a two dimensional array inside the modified


143
be associated with a frequency constraint also, where we want to find frequent itemsets
that satisfy B. Three different algorithms for mining associations with item constraints are
presented in Srikant et al. [121].
The item constraints enables us to pose mining queries such as What are the products
whose sales are affected by the sale of, say, barbecue sauce? and What products are
bought together with sodas and snacks?. The 1-variable constraints in Ng et al. [97] are
a special case of item constraints.
Aggregation Constraint
These axe constraints involving aggregate functions on the items that form the item-
set. For instance, in the POS example an aggregation constraint could be of the form
min(productssold.price) > p. Here we consider a product as an item. The aggregate func
tion could be min, max, sum, count, avg or any other user-defined aggregate function. An
aggregation constraint of the form min{products sold.price) > p can be used to find ex
pensive products that are bought together. Similarly max(products sold.price) < q can
be used to find inexpensive products that are bought together. These aggregate functions
can be combined in various ways to express a whole range of useful mining computations.
For example, the constraint (min(productssold.price) < p) & (avg(productssold.price) >
q) targets the mining process to inexpensive products that are bought together with the
expensive ones.
External Constraint
External constraints filter the data used in the mining process. These are constraints
on attributes which do not appear in the final result (which we call external attributes).
For example, if we want to find products bought during big purchases where the total


T4kz
UNIVERSITY OF FLORIDA
3 1262 08555 1215


CHAPTER 7
INCREMENTAL MINING
Data mining discovers information within data warehouses and finds answers to ques
tions about your data that you havent thought to ask. The rules discovered from a database
only reflect the current state of the database. In order to make the discovered rules reliable
and useful, large volumes of data need to be collected and analyzed over a period of time.
This entails the development of techniques to handle large volumes of data, and to main
tain rules over a significantly long period of time. Therefore, efficient algorithms to update,
maintain and manage the discovered rules are central to the database mining technology.
Association rule mining is an important data mining problem which finds applications
in various business domains. Association rules are discovered from a transaction database
and updates to it could potentially invalidate existing rules or introduce new rules. The
problem of updating the rules can be reduced to finding the new set of frequent itemsets.
Since rule generation is computationally inexpensive, it is not critical to develop incremental
rule generation algorithms. A simple solution to the update problem is to re-compute the
frequent itemsets from the updated database. This is clearly inefficient because all the
computations done initially for finding the old frequent itemsets are wasted. An algorithm,
FUP (Fast Update), for updating the frequent itemsets has been developed for the addition
of new transactions to the database [36]. It is based on the Apriori algorithm and needs
119


46
3.8.1 Pruning Non-Frequent Items
The size of the transaction table is a major factor in the cost of joins involving T.
It can be reduced by pruning the non-frequent items from the transactions after the first
pass. We store the transaction data as (tid, item) tuples in a relational table and hence
this pruning means simply dropping the tuples corresponding to non-frequent items. This
can be achieved by joining T with the frequent items table F\ as follows.
insert into T¡ select t.tid, t.item
from T t, F\ f
where t.item = f.item
We insert the pruned transactions into table Tj which has the same schema as that of T.
In the subsequent passes, joins with T can be replaced with corresponding joins with T¡.
This could result in improved performance especially for higher support values where the
frequent item selectivity is low, since we join smaller tables. For some of the synthetic
datasets we used in our experiments, this pruning reduced the size of the transaction table
to about half its original size. This could be even more useful for real-life datasets which
typically contains lots of non-frequent items. For example, some of the real-life datasets
used for the experiments reported in Sarawagi et al. [109] contained of the order of 100
thousand items out of which only a few hundreds were frequent. Figure 3.11 shows the
reduction in transaction table size due to this optimization for our experimental datasets.
The initial size (R) and the size after pruning (i?/) for different support values are shown.
With this optimization, in the cost formulae of section 3.7, R can be replaced with
Rfthe number of records in T involving frequent items and N with Nfthe average


75
Two factors that affect the choice amongst the Vertical, Gather Join and GatherCount
approaches in different passes and pass 2 in particular are: number of frequent items (Fi)
and the average number of frequent items per transaction {Nf). From the graphs in Fig
ure 4.5 we notice that as the value of the support is decreased for each dataset causing
the size of F\ to increase, the performance of pass 2 of the Vertical approach degrades
rapidly. This trend is also clear from our cost formulae. The cost of the Vertical approach
increases quadratically with F\. Gather Join depends more critically on the number of fre
quent items per transaction. For Dataset-B even when the size of F\ increases by a factor
of 10, the value of Nf remains close to 2, therefore the time taken by Gather Join does not
increase as much. However, for Dataset-C the size of Nf increases from 3.2 to 10 as the
support is decreased from 2.0% to 0.25% causing GatherJoin to deteriorate rapidly. From
the cost formula for GatherJoin we notice that the total time for pass 2 increases almost
quadratically with Nf.
We validate this observation further by running experiments on synthetic datasets
for varying values of the number of frequent items per transaction. We used the synthetic
dataset generator described in Agrawal et al. [9] for this purpose. We varied the transaction
length, the number of transactions and the support values while keeping the total number
of records and the number of frequent items fixed. In Figure 4.6 we show the total time
spent in pass 2 of the Vertical and GatherJoin approaches. As the number of items per
transaction (transaction length) increases, the cost of Vertical remains almost unchanged
whereas the cost of GatherJoin increases.


86
indexing and space management all of which are already provided in a database system.
However, there are some detractors to easy development using the SQL alternative. First,
any attached UDF code will be harder to debug than stand-alone C++ codes due to lack
of debugging tools. Second, stand-alone code can be debugged and tested faster when run
against flat file data. Running against flat files is typically a factor of five to ten faster
compared to running against data stored in DBMS tables. Finally, some mining algorithms
(for instance, neural-net based) might be too awkward to express in SQL.
The ease of porting of the SQL alternative depends on the kind of SQL used. Within the
same DBMS, porting from one OS platform to another requires porting only the small UDF
code and hence is easy. In contrast the Stored-procedure and Cache-Mine alternatives
require porting larger lines of code. Porting from one DBMS to another could get hard for
SQL approach, if non-standard DBMS-specific features are used. Unfortunately, we found
SQL-92 implementations (which would have been quite portable) to be unacceptable from
the performance viewpoint. Our preferred SQL implementation relies on the availability
of DB2s table functions. Table functions, for example, in Oracle 8 do not have the same
interface and semantics as DB2. Also, if different features have different performance
characteristics on different database systems, considerable tuning would be required. In
contrast, the Stored-procedure and Cache-Mine approach are not tied thus to any DBMS
specific features. The UDF implementation has the worst of both worlds; it is large and is
tied to a DBMS.
The biggest attraction of the SQL implementation is inter-operability and usage flex
ibility. The ad hoc querying support provided by the DBMS enables flexible usage and
exposes potential for pipelining the input and output operators of the mining process with


71
where item = item\ or
item = iterrik
end for
insert into Fk select itemi,... ,iterrik, supp
from Ck where supp > :minsupp
The cost of this approach can be mainly attributed to the cost of updates to the
candidate table Ck- For each tuple of the data table T, for all the candidate itemsets in Ck
which contains that item, three updates are performed (the attributes prevTid, match, supp
of the itemset are updated). If Nk is the average number of A;-item candidates containing
any given item, the total number of updates is 3 R* Nk- The cost due to updates for this
approach is U(3 R* Nk) where U(n) is the cost of n updates. If the updates are logged
this cost includes the logging cost also.
4.4 Performance Comparison
We studied the performance of six approaches in this category: GatherJoin and its
variants GatherPrune, Horizontal and GatherCountVertical and SBF. We used the four
datasets summarized in Table 3.1. In Figure 4.5 we show the performance of only the four
approaches: GatherJoin, GatherCount, GatherPrune and Vertical. For the other
two approaches the running times were comparatively so large that we had to abort the
runs in many cases. The main reason why the Horizontal approach was significantly
worse than the GatherJoin approach was the time to transform the data to the horizontal
format. For instance, for Dataset-C it was 3.5 hours which is almost 20 times more than
the time taken by Vertical for 2% support. For Dataset-B the process was aborted


30
3.4.2 Counting Support to Find Frequent Itemsets
This is the most time-consuming part of the association rule mining algorithm. We
use the candidate itemsets Ck and the data table T to count the support of the itemsets in
Ck- We consider two different categories of SQL implementations.
(A) The first one is based purely on SQL-92. We discuss four approaches in this category
in Section 3.5.
(B) The second utilizes the new SQL object-relational extensions like UDFs, BLOBs
(binary large objects), table functions and so on. Table functions [80] are virtual
tables associated with a user defined function which generate tuples on the fly. Like
normal physical tables they have pre-defined schemas. The function associated with
a table function can be implemented as any other UDF. Thus, table functions can be
viewed as UDFs that return a collection of tuples instead of scalar values.
We discuss six approaches in this category in Chapter 4. Note that, UDFs in this
approach are not heavy weight and do not require extensive memory allocations and
coding unlike in a purely UDF-based implementation [12].
3.4.3 Rule Generation
In the second phase of the association rule mining algorithm, we use the frequent
itemsets to generate rules with the user specified minimum confidence, minconf. For every
frequent itemset /, we first find all non-empty proper subsets of /. Then, for each of those
subsets m, we find the confidence of the rule m>(l m) and output the rule if it is at least
minconf.
In the support counting phase, the frequent itemsets of size k are stored in table F*..
Before the rule generation phase, we merge all the frequent itemsets into a single table F.


122
function apriori-gen(F/t_x)
for each p and q Fk-i do
if p.itemi = q.itemi,... ,p.item,k-2 = q.itemk-2 and p.itenrik-\ < q.itemk-i
then insert p.itemi,p.item^, ,p.itemk-i,qitemk-i into C*
for each c G C*
delete c from C* if some (k l)-subset of c is not in Fk-i
Figure 7.1. A high-level description of the apriori-gen function
The negative border consists of all itemsets that were candidates which did not have
the minimum support in the level-wise method. That is, ABd(Fk) = Ck T* where Cjt is
the set of candidate fc-itemsets, F\ is the set of frequent fc-itemsets and NBd(Fk) is the set
of fc-itemsets in AiBd(F). Therefore, f^U NBd(Fk) = Ck- The negativeborder-gen function
to compute JifBd(F) is explained in Figure 7.2.
function negativeborder-gen(F)
Split F into F\ ,Fo,..., Fn where n is the size of the largest itemset in F
forall k = 1,2,..., n do
compute Ck+i using apriori-gen(Fk)
FU AfBd(F) U<=2 +1 Ck U/i where Ii is the set of 1-itemsets.
Figure 7.2. A high-level description of the negativeborder-gen function
Lemma 1 All 1-itemsets should be present in FU AfBd(F).


154
was never worse than a factor of two on the real-life datasets. Both these approaches re
quire additional space for caching, however. The Stored-procedure approach does not
require any extra space (except possibly for initially sorting the data in the DBMS) and
can perhaps be made to be within a factor of two to three of Cache-Mine with the re
cent algorithms [26, 127]. The UDF approach is about a factor of two faster than the
Stored-procedure approach but is significantly harder to code. The SQL approach offers
some secondary advantages like easier development and maintenance and potential for au
tomatic parallelization. However, it might not be as portable as the Cache-Mine approach
across different database management systems.
Next, we wanted to find out if it is possible to handle other more complex mining tasks
within the same SQL framework. We studied generalized association rules and sequential
patterns with the goal of showing that the association rule framework can be extended easily
for these mining problems as well. We developed several SQL formulations for generalized
association rule and sequential pattern mining; some of them by extending the association
rule approaches. The major addition for generalized association rule was to extend the
input transaction table (transform T to T*). For sequential patterns, the join predicates
for candidate generation and support counting were significantly different. We conducted
some experiments on real-life datasets and found that the performance observations hold
here also.
In order to handle the volume of data stored in present day data warehouses, which
keeps streaming in, it is important to develop incremental mining algorithms. We developed
an incremental association rule mining algorithm which does not need to examine the old
data if the frequent itemsets do not change. Even in cases where new frequent itemsets are
added, access to the old database can be limited to just one scan.


100
extension cost can be ignored if T* is materialized because in that case, it will be a one
time cost. The result size of the join which extends the transactions is R* and the number
of fc-item combinations generated, Tk is C(N*,k) T. Therefore, the total cost of the
Gather Join approach is
join(.R, A,iT) +order(iT) +T* sk + join(Tfc, Ck, S{Ck)) -F group(S{Ck),Ck)
In GatherExtend, transaction extension is done inside the table function. The cost of
passing the ancestor-list as a BLOB is blob(d) since the average length of the ancestor-list
is the same as the depth of the taxonomy. The cost of GatherExtend is
join {RJ,R) + R* blob(d) +Tk ek + join(T*, Ck, S(Ck)) + group{S(Ck),Ck)
5.7.4 Vertical
In this approach, the transactions are first converted into a vertical format by creating
for each item a BLOB (tid-list) containing all tids that contain that item. The support for
each itemset is counted by merging the tid-lists of all its items. The tid-list of leaf node items
can be created using a table function in the same way as in the boolean associations case.
We present two approaches for creating the tid-list of the interior nodes in the taxonomy
DAG.
The first approach is based on doing the union of the descendants tid-lists of an
interior node. In the first phase, we create an initial TidTable containing the tid-lists of
the leaf nodes. The TidTable is joined with the ancestor table and the join result is passed
in the order of the ancestor attribute to the table function TUnion. For every node x, the


99
insert into F* select itemi,... ,iterrik, count(*)
from Cfc, (select t2-TJ,tm\,..., t2-T.itmk
from (select tid, item, ancestor-list
from T, AncListTable A
where T.item = A.item) as ii,
table (GExtendComb-K( i.tid, i.item, t\.ancestor-list)) as 2)
where t2-TJtmi = Ck-itemi and
2-TJmfc = Ck-itemifc
group by Ck-itemi,..., Ck-itemk
having count(*) > iminsup
having
count(*) > :minsup
t
Group by
itemi ,....,itemk
Table function
GExtendComb-K
Ck
T.tid, T.item, A.ancestor-list \
T
AncListTable A
Figure 5.7. Support counting by GatherExtend


25
The algorithm executes in two phases. In the first phase, it divides the database into a
number of non-overlapping partitions. The frequent iteinsets in each of these partitions
are computed separately which will involve multiple passes over the data. However, the
partition sizes are chosen such that an entire partition fits in main memory so that it is
read only once in each phase. At the end of the first phase, the frequent itemsets from all
the partitions are merged together to generate the set of all potentially frequent itemsets.
This set is a superset of all the frequent itemsets since all itemsets that are frequent in
the whole database have to be frequent at least in one partition. In the second phase, the
actual support of these itemsets are counted and the frequent itemsets are identified.
Toivonen proposes a sampling based algorithm [127]. The idea there is to pick a
random sample, use it to determine all association rules that probably hold in the whole
database, and then to verify the results with the rest of the database. The algorithm thus
produces exact association rules in one full pass over the database. In those rare cases where
the sampling method does not produce all association rules, the missing rules can be found
in a second pass. A superset of the frequent itemsets can be determined efficiently from a
random sample by applying any level-wise algorithm on the sample in main memory, and
by using a lowered support threshold. In cases where approximate results are sufficient, the
sampling approach can significantly reduce the computational and I/O requirements since
it works on a much smaller dataset.
A different way of counting support is proposed in Savasere et al. [Ill] and Zaki
et al. [131]. Associated with each item is a tidlist which consists of all the transaction
identifiers that contain that item. The support for an itemset can be obtained by counting
the number of transactions that contain all the items in the itemset. If the tidlists are


134
Computing Candidate Closure
In the apriori algorithm candidate itemsets are generated in two steps: the join step
and the prune step. In the join step, two sets of (k l)-itemsets called generators and
extenders are joined to get /c-itemsets. An itemset si in generators joins with S2 in extenders
if the first {k 2) items of si and S2 are the same and the last item of si is lexicographically
smaller than the last item of S2 The join results in an itemset that is si extended with the
last item of S2- The result of the join step is subjected to subset pruning which filters out
all itemsets with a non frequent (k l)-subset. This can be accomplished by subsequent
joins with (k 2) copies of a set of itemsets termed filters. While generating C\ from Fk-i
in the basic apriori algorithm, Fk-1 acts as generators, extenders and filters.
In the incremental algorithm, we compute the candidate closure to avoid multiple
passes while counting the support of the new candidates. It can be seen that all the
new candidates will be supersets of promoted borders. Therefore, it is sufficient to use the
itemsets that are promoted borders as generators. In order to generate Ck, the candidates of
size k, we use PBk~\ UCjt-i as generators and PBk~\UCt-i Ui^-i as extenders and filters.
PBk-i and Fk-i denote promoted borders and frequent itemsets of size (k 1) respectively.
The candidate generation process starts with Co as the empty set and terminates when Ck
becomes empty. It is straight-forward to derive SQL queries for this process and we do not
present them here (refer Section 3.4.1).
Vertical
In the Subquery approach, for every transaction that supports an itemset we generate
(itemset, tid) tuples resulting in large intermediate tables. The Vertical approach avoids
this by collecting all tids that support an itemset into a BLOB (binary large object) and


101
insert into TidTable select 2-item, 2-tid-list
from (select A.ancestor, T.tid-list
from TidTable T, Ancestor A
where T.item = A.descendant
order by ancestor) as t\,
table(TUnion(ii.ancestor, i.tid-list)) as 2
tl.item, tl.tid-list
t
Table function
TUnion
t
Order by
ancestor
A.ancestor, T.tid-list
T.item = A.descendant tX\
TidTable T
Ancestor A
Figure 5.8. Interior nodes tid-list generation by union
table function collects the tid-lists of all the leaf nodes reachable from x, union them and
outputs the tid-list for x. The SQL query for this is illustrated in Figure 5.8. This approach
can be further optimized to do the tid-list union of only the children of each interior node.
However, in the first phase tid-lists have to be created and materialized for every leaf node
item irrespective of its support. This is very expensive especially when the number of items
is large.
The second approach is to pass T* (defined in Section 5.6.1) to the Gather table
function, as shown in Figure 5.9. The table function outputs tid-lists for all the items in
the taxonomy.
The support counting queries are exactly the same as for boolean associations ex
plained in Section 4.2. The cost formula is also the same as in boolean associations except


88
implementation because they are aimed at reducing the number of passes over the data.
We have seen that with our SQL implementations hardly any time is spent on passes beyond
2 because of the Vertical approach. The time spent in pass 2 is not reduced because they
all require counting support of all of Ci on the entire data anyway.
4.9 Summary
In this chapter, we developed and experimented with a set of approaches that made
use of the new object-relational extensions like UDFs, BLOBs and table functions. These
approaches performed much better than the SQL-92 approaches in Chapter 3. We developed
a hybrid scheme which picks the best approach for each pass based on the cost estimates. We
also compare the various architectural alternatives both qualitatively and quantitatively.
Note A part of the work described in this chapter was primarily done by researchers
from IBM Almadn Research Center. The authors primary contributions were the opti
mizations to the approaches, in Sections 4.1.1 and 4.2.1, the hybrid approach in Section 4.5
and the cost analysis of the various support counting approaches. The author has also
contributed to the performance experiments which led to the various comparisons.


76
4.5 Final Hybrid Approach
The previous performance section helps us draw the following conclusion: Overall, the
Vertical approach is the best option especially for higher passes. When the size of the
candidate itemsets is too large, the performance of the Vertical approach could suffer.
In such cases, GatherJoin is a good option as long as the number of frequent items per
transaction (Nj) is not too large. When that happens GatherCount may be the only
good option even though it may not easily parallelizable. These qualitative differences are
captured by the cost formulae we presented earlier and are used by our final hybrid scheme.
The hybrid scheme chooses the best of the three approaches GatherJoin, GatherCount
and Vertical for each pass based on the cost estimation outlined in the previous sections.
The parameter values used for the estimation are all available at the end of the previous
pass. In Section 4.6 we plot the final running time for the different datasets based on this
hybrid approach.
4.6 Architecture Comparisons
In this section our goal is to compare five architectural alternatives: Loose-coupling,
Stored-procedure, Cache-Mine, UDF, and the best SQL implementation.
For Loose-coupling, we use the implementation of the Apriori algorithm [9] for find
ing association rules provided with the IBM data mining product, Intelligent Miner [71].
For Stored-procedure, we extracted the Apriori implementation in Intelligent Miner and
created a stored procedure out of it. The stored procedure is run in the unfenced mode
in the database address space. For Cache-Mine, we used an option provided in Intelligent
Miner that causes the input data to be cached as a binary file after the first scan of the data


21
a specialized operator for every mining task. It also needs a language in which the required
operations can be specified. In order to realize this goal, it requires tremendous amount
of research in various aspects like designing language extensions, better query processing
and optimization strategies. However, we envision that the query processing engine will
eventually be extended with primitive mining operators. When that is accomplished, a
mining system architecture will resemble the one shown in Figure 2.7.
Extended SQL SQL-92
(Object) Relational
DBMS
- Information
or
GUI SQL-3/SQL-4
Enhanced
Optimizer
T
Domain semantics
of mining
Figure 2.7. Architecture for mining in next-generation DBMSs
2.3 Summary
The first two approaches in the architecture taxonomy in Figure 2.1, namely, mining in
the application server or database server, facilitates the move from file mining to database
mining rather easily. However, as explained in the loose-coupling, cache-mine, stored-
procedure and user-defined function approaches, they do not utilize the query processing
functionality provided by the DBMS.
In this dissertation we pursue the third approach in Figure 2.1, which uses SQL and
its extensions to implement the mining algorithms. This acts as a pre-cursor to determine
the extensions to current query processors and optimizers in order to move towards the last
approach which is the truly integrated approach.
Note The architectural alternatives in Section 2.2 were developed primarily by
researchers from IBM Almadn Research Center and the author was a contributor.


147
2. Update the support count of all itemsets in FrequentSets U NegativeBorder to in
clude their support in db.
3. Find the itemsets in NegativeBorder that became frequent by the addition of db (call
them Promoted-NegativeBorder, PNb).
4. Find the candidate closure with PNb as the seed. During this computation, we use
the candidate generation procedure with constraints as described in Section 7.5.2.
5. If there are no new itemsets in the candidate closure, skip this step. Otherwise, count
the support of all new itemsets in the candidate closure against the whole dataset.
The dataset is subjected to the data filtering step before the support counting as shown
in Figure 7.11.
7.5.4 Constraint Relaxation
The negative border idea can be used to handle some cases of constraint relaxation
also, especially relaxations to the external constraints and the frequency constraint. For
instance, in the example in Section 7.5.2, if we relax the constraint on the sales transaction
to sales.Total jprice > 300, it can be cast as an incremental mining problem. In this case,
the increment dataset would be the transactions with Total-price between 300 and 500.
Note that, for this to work the relaxed constraint should subsume the initial constraint.
Relaxations to the frequency constraint involves two cases. In cases where the frequency
threshold is increased, it is straight-forward to find the new frequent itemsets by just filtering
the itemsets with the new threshold. On the other hand, if the frequency threshold is
lowered we need to use an approach similar to the incremental mining algorithm where we
compute the promoted negative borders and candidate closure to determine if the support
of any new itemset needs to be found.


79
Dataset-C and Dataset-D. Time taken by Stored-procedure can be expressed
approximately as number of passes times time taken by Cache-Mine.
UDF is similar to Stored-procedure. The only difference is that the time per pass
decreases by 30-50% for UDF because of closer coupling with the database.
The SQL approach comes second in performance after the Cache-Mine approach for
low support values and is even somewhat better for high support values. The cost
of converting the data to the vertical format for SQL is typically lower than the cost
of transforming data to binary format outside the DBMS for Cache-Mine. However,
after the initial transformation subsequent passes take negligible time for Cache-Mine.
For the second pass SQL takes significantly more time than Cache-Mine particularly
when we decrease support. For subsequent passes even the SQL approach does not
spend too much time. Therefore, the difference between Cache-Mine and SQL is not
very sensitive to the number of passes because both approaches spend negligible time
in higher passes.
The SQL approach is 1.8 to 3 times better than Stored-procedure or Loose-coupling
approach. As we decreased the support value so that the number of passes over the
dataset increases, the gap widens. Note that we could have implemented a Stored-
procedure using the same hybrid algorithm that we used for SQL instead of using the
IM algorithm. Then, we expect the performance of Stored-procedure to improve
because the number of passes to the data will decrease. However, we will pay the
storage penalty of making additional copy of the data as we did in the Cache-Mine
approach. The performance of Stored-procedure cannot be better than Cache-Mine


169
[99]J. S. Park, M. S. Chen, and P. S. Yu. An effective hash based algorithm for mining
association rules. In Proc. of the ACM-SIGMOD Conference on Management of Data,
San Jose, California, May 1995.
[100] PostgreSQL Organization. PostgreSQL 6.3 User Manual, February 1998.
http://www.postgresql.org.
[101] Quest: IBM develops market basket analysis system. Stores, May 1994.
[102] J. R. Quinlan. Induction over large databases. Technical Report STAN-CS-739,
Stanford University, 1979.
[103] J. R. Quinlan. Cf.5: Programs for Machine Learning. Morgan Kaufman, 1993.
[104] J. R. Quinlan and R. L. Rivest. Inferring decision trees using minimum description
length principle. Information and Computation, 1989.
[105] K. Rajamani, B. Iyer, and A. Chaddha. Using DB2s object-relational extensions
for mining association rules. Technical Report TR 03-690, Santa Teresa Laboratory,
IBM Corporation, September 1997.
[106] R. Rastogi and K. Shim. PUBLIC: A decision tree classifier that integrates building
and pruning. In Proc. of the VLDB Conference, New York, August 1998.
[107] N. Roussopoulos, Y. Kotidis, and M. Roussopoulos. Cube tree: Organization of
and bulk incremental updates on the data cube. In Proc. of the ACM SIGMOD
Conference on Management of Data, Tucson, Arizona, May 1997.
[108] M. A. Roytberg. A search for common patterns in many sequences. Computer Ap
plications in the Biosciences, 8(l):57-64, 1992.
[109] S. Sarawagi, S. Thomas, and R. Agrawal. Integrating association rule mining with
relational database systems: Alternatives and implications. In Proc. of the ACM
SIGMOD Conference on Management of Data, Seattle, Washington, June 1998.
[110] S. Sarawagi, S. Thomas, and R. Agrawal. Integrating association rule mining with
relational database systems: Alternatives and implications. Research Report RJ
10107 (91923), IBM Almadn Research Center, San Jose, California, March 1998.
(Longer version of the SIGMOD 98 paper).
[111] A. Savasere, E. Omiecinski, and S. Navathe. An efficient algorithm for mining associ
ation rules in large databases. In Proc. of the VLDB Conference, Zurich, Switzerland,
September 1995.
[112] S. Z. Selim and M. A. Ismail. K-means-type algorithms: A generalized convergence
theorem and characterization of local optimality. IEEE Transactions on Pattern
Analysis and Machine Intelligence, PAMI-6(1), 1984.
[113] J. Shafer and R. Agrawal. Parallel algorithms for high-dimensional similarity joins for
data mining applications. In Proc. of the VLDB Conference, Athens, Greece, August
1997.


CHAPTER 6
SEQUENTIAL PATTERNS
Sequential pattern mining utilizes the time associated with the transaction data to
find frequently occurring patterns. A sequential pattern is an ordered list (sequence) of
itemsets (refer Section 1.2.2 for a brief description of sequential patterns). Given a set of
data-sequences each of which is a list of transactions ordered by the transaction time, the
problem of mining sequential patterns is to discover all such patterns with a user-specified
minimum support.
We present several SQL formulations of sequential pattern mining in this chapter. In
Section 6.1, we describe the input-output formats and in Section 6.2, we briefly outline
the GSP algorithm for sequential pattern mining [120]. SQL-based sequential pattern
candidate generation is presented in Section 6.3. In Sections 6.5 and 6.6, we present several
SQL formulations for support counting based on SQL-92 and SQL-OR. In Section 6.7, we
briefly mention how to handle taxonomies over items. We concentrate on showing that
sequential pattern mining can also be handled within the same SQL-based framework. We
expect the performance trends from association rule mining to hold here also.
6.1 Input-Output Formats
6.1.1 Input Format
The input table D of data-sequences has three column attributes: sequence identifier
(sid), transaction time {time) and item identifier {item). Every data-sequence typically
105


139
the candidate closure computation, we count the support in db of just the new-candidates
for the pruning explained above. This results in better speed-up as compared to the basic
incremental algorithm.
o 1% 5% ni %J
Figure 7.8. Speed up of the incremental algorithm based on the Subquery approach with
the new-candidate optimization
E3 1 % m 5% C3 1 Q%
Figure 7.9. Speed up of the incremental algorithm based on the Vertical approach with the
new-candidate optimization
Figures 7.8 and 7.9 show the speed-up of the Subquery and Vertical approaches with
the new-candidate optimization. We can observe that this optimization achieves speed-
ups that are up to 25% better. The improvement is more at smaller increment sizes.


32
insert into R select TJtem\, .. .Tjitemk,
t\.support, TJen, T_rulem, t\.support/^.support
from F /i, table(GenRules(/i.iiemi,... ,f\.iterrik, /l-len, f\.support)) as t\, F fo
where {t\.TJtem\ = f^.itemi or t\.Tjrulem < 1) and
(t\.T-iterrik = f2-itemk or t\.Tjrulem < k) and
t\.Tjrulem = f2-len and
t\.Tsupport./f2.support > :minconf
item 1 ,...itemk, len, rulem,
confidence, support
t
conf > rminconf
XI
Table function
GenRules
Figure 3.4. Rule generation
3.5 Support Counting Using SQL-92
We present four approaches in this category.
3.5.1 K-wav Join
In each pass k, we join the candidate itemsets Cjt with k transaction tables T and
follow it up with a group by on the itemsets as shown in Figure 3.5.
Figure 3.5 also shows a tree diagram of the query. These tree diagrams are not to be
confused with the plan trees which could look quite different.


118
6.8 Summary
In this chapter, we presented various SQL formulations for mining sequential patterns.
We extended a few representative association rule mining approaches in order to show that
sequential pattern mining can also be handled with in the SQL framework. The major
changes were in the join predicates for candidate generation and support counting.
We did not do extensive experimentation for the sequential pattern mining approaches.
Among the SQL-based approaches, the Vertical approach performed the best in our exper
iments. The SQL queries for the higher numbered passes becomes too long since there are
several join predicates and as a result the optimizer does not generate optimal execution
plans. Further, in the Vertical approach, the user-defined functions for support counting are
much more complex than their association rule counterparts. We also did not implement
the extensions to handle taxonomies in this case.


152
hand, the negative border based change propagation can also be applied for the maintenance
of views involving monotone aggregate functions that satisfy the apriori subset property.
For example if the view definition has a filter condition such as the SQL having clause,
it could be beneficial to also store the records which does not pass the filter condition test
(same as the negative border). When the base tables are updated, these records will be
useful to maintain the view rather than re-materializing it.
An itemset can also be treated as a point in the data cube [55] defined by the items as
dimensions and support as the measure. The data cube points can be arranged as a lattice
according to the partial order on the itemsets. In that case, the data cube maintenance
algorithms similar to that of Roussopoulos et al. [107] are also applicable here. However,
this approach might not be viable in cases consisting of large number of items.
7.7 Summary
In this chapter, we developed an incremental algorithm for mining frequent itemsets.
We developed SQL formulations for the incremental approach based on a few representative
approaches from Chapters 3 and 4. We also show how the incremental algorithm can
be generalized to handle certain kinds of constraints and constraint relaxations and its
applicability to other mining problems.


106
contains several transactions each of which is a set of items. The data-sequence table
contains multiple rows corresponding to different items that belong to transactions in the
data-sequence. The taxonomy on the items is represented in the same way as for the
generalized association rules.
6.1.2 Output Format
The output is a collection of frequent sequences. A sequence which is represented as a
tuple in a fixed-width table consists of an ordered list of elements where each element is a set
of items. The schema of the frequent sequences table is (item\,eno\,..., itemk, enokjen).
The len attribute gives the length of the sequence, that is, the total number of items in
all the elements of the sequence. The eno attributes stores the element number of the
corresponding items. For sequences of smaller length, the extra column values are set to
NULL. For example, if k = 4, the sequence ((computer, modem)(printer)) is represented
by the tuple (computer, 1, modem, 1, printer, 2, NULL, NULL, 3).
6.2 GSP Algorithm
We use the GSP algorithm of Srikant and Agrawal [120] as the basis for our SQL
formulations. The basic structure of the GSP algorithm is similar to that of the apriori
algorithm for association rule mining [9], although the specific details are quite different.
Therefore, the SQL-based association rule framework can be used for sequential patterns
also. The algorithm makes multiple passes over the data-sequences where in the kth pass
it finds frequent sequences of length k. Each pass consists of two phases: the candidate
generation phase and the support counting phase. In the candidate generation phase, the
frequent (k l)-length sequences, Fk-i, found in the previous pass is used as the seed set
to generate candidate fc-length sequences (Ck) that are potentially frequent. The candidate


CHAPTER 4
SUPPORT COUNTING USING SQL WITH OBJECT-RELATIONAL EXTENSIONS
In this chapter, we study alternative approaches that make use of additional object-
relational features in SQL. For each approach, we also outline a cost-based analysis of the
execution time to enable one to choose between these different approaches. We present six
different approaches, optimizations and their cost estimates in Sections 4.1, 4.2 and 4.3.
Experimental results comparing the performance of these approaches are presented in Sec
tion 4.4. In Section 4.5, we propose a hybrid approach which combines the best of all
approaches. The performance of different architectural alternatives described in Chapter 2
is compared in Section 4.6. In Section 4.7, we summarize qualitative comparisons of these
architectures. The applicability of the SQL-based approach to other association rule mining
algorithms are briefly discussed in Section 4.8.
4.1 GatherJoin
This approach (see Figure 4.1) is based on the use of table functions described in sec
tion 3.4.2. It generates all possible &-item combinations of items contained in a transaction,
joins them with the candidate table Ck and counts the support of the itemsets by grouping

the join result. It uses two table functions Gather and Comb-K. The data table T is scanned
in the (tid, item) order and passed to the table function Gather. This table function col
lects all the items of a transaction (in other words, items of all tuples of T with the same
tid) in memory and outputs a record for each transaction. Each such record consists of two
60


31
The schema of F consists of k + 2 attributes {item\,..., item^, support, len), where k is
the size of the largest frequent itemset and len is the length of the itemset as discussed
earlier in Section 3.3.
We use the table function GenRules to generate all possible rules from a frequent item-
set. The input argument to the function is a frequent itemset. For each itemset, it outputs
tuples corresponding to rules with all non-empty proper subsets of the itemset in the con
sequent. The table function outputs tuples with k + 3 attributes, TJ,tem\,... ,T.itemk,
Tsupport, T Jen,T jrulem. The output is joined with F to find the support of the an
tecedent and the confidence of the rule is calculated by taking the ratio of the support
values. The predicates in the where clause match the antecedent of the rule with the fre
quent itemset corresponding to the antecedent. While checking for this match, we need to
check only up to itemk where k < Tjrulem. The or part {t\.Tjrulem < k) in the predicate
accomplishes this. Figure 3.4 illustrates the rule generation query.
We can also do rule generation without using table functions and base it purely on
SQL-92. The rules are generated in a level-wise manner where in each level k we generate
rules with consequents of size k. Further, we make use of the property that for any frequent
itemset, if a rule with consequent c holds, then, so do rules with consequents that are subsets
of c as suggested in Agrawal et al. [9]. We can use this property to generate rules in level
k using rules with (k 1) long consequents found in the previous level, much like the way
we did candidate generation in Section 3.4.1.
The fraction of the total running time spent in rule generation is very small. Therefore,
we do not focus much on rule generation algorithms.


is a factor of 3 to 4 worse than Cache-Mine. We also compare these alternatives on the
basis of qualitative factors like automatic parallelization, development ease, portability and
interoperability.
We further analyze the SQL-92 approaches with the twin goals of studying how best
can a DBMS without any object-relational extensions execute these queries and to identify
ways of incorporating the semantics of mining into cost-based query optimizers. We develop
cost formulae for the mining queries based on the input data parameters and relational
operator costs. We also identify certain optimizations which improve the performance.
Next, we study generalized association rule and sequential pattern mining and develop
SQL formulations for them there by demonstrating that more complex mining operations
can be handled in the SQL frame work.
We develop an incremental association rule mining algorithm which does not need
to examine the old data if the frequent itemsets do not change. Even otherwise, access
to the old database can be limited to just one scan. We categorize the various kinds of
constraints on the items that are useful in the context of interactive mining to facilitate goal-
oriented mining. We show how the incremental mining technique can be adapted to handle
constraints and certain kinds of constraint relaxation. We also show the applicability of
the incremental algorithm to other classes of data mining and decision support problems.
Finally, we identify certain primitive operators that are useful for a large class of data
mining and decision support applications. Supporting them natively in the DBMS could
enable these applications to run faster.
xiv


6
which items are considered part of the same sequence element. These time constraints are
specified by three parameters, max-gap, min-gap and window-size. We refer the reader to
Srikant and Agrawal [120] for a formal definition of the problem.
1.2.3 Classification
Classification is a well-studied problem [91, 129]. However, only recently has there been
focus on algorithms that can handle large datasets [17, 53, 85, 106, 114]. Classification is
the process of generating a description or a model for each class of a given dataset, called
the training set. Each record of the training set consists of several attributes which could
be continuous (coming from an ordered domain), or categorical (coming from an unordered
domain). One of the attributes, called the classifying attribute, indicates the class to which
each record belongs. Once a model is built from the given examples, it can be used to
determine the class of future unclassified records.
Several classification models based on neural networks, statistical models like lin
ear/quadratic discriminants, decision trees and genetic models have been proposed over the
years. Decision trees are particularly suited for data mining since they can be constructed
relatively fast compared to other methods and they are simple and easy to understand.
Moreover, trees can be easily converted into SQL statements that can be used to access
databases efficiently [5].
A decision tree is a class discriminator that recursively partitions the training set
until each partition consists entirely or dominantly of examples from one class. Each non
leaf node of the tree contains a split point which is a test on one or more attributes and
determines how the data is partitioned. Figure 1.2 shows a sample decision-tree classifier
and the training set from which it is derived. Each record represents a credit card applicant


57
I-T10.I4 ~~T5.l2l
|T10.I4 --T5J21
Figure 3.18. Number of transactions scale-up
20,000 for the database with an average transaction size of 50. We fixed the minimum sup
port level in terms of the number of transactions, since fixing it as a percentage would have
led to large increases in the number of frequent itemsets as the transaction size increased.
The numbers in the legend (for example, 1000) refer to this minimum support. The execu
tion times increase with the transaction size, but only gradually. The main reason for this
increase was that the number of item combinations present in a transaction increases with
the transaction size.


62
insert into F(. select itemi,... ,itemk, count(*)
from Ck, (select t2.TJ.tm1,... ,t2-TJtmk from T,
table (Gather(T.tid, T.item)) as t\,
table (Comb-K(ii.tid, t\.item-list)) as 2)
where t2-TAtm\ = Ck-itemi and
2-T-itrrik = Ck-itemk
group by Ck-itemi,..., Ck-itemk
having count(*) > :minsup
t2.T.
t2.T
having
count(*) > :minsup
t
Group by
item 1itemk
t
itml = Ck.itemi
itmk = Ck.itemk
^^t2
Table function Ck
Comb-K
f
Table function
Gather
I
Order by
tid, item
I
T
Figure 4.1. Support counting by GatherJoin
table function Gather-Cnt for doing the support counting. Instead of outputting the 2-item
combinations of a transaction, it directly uses it to update support counts in the memory
and output only the frequent 2-itemsets, F and their support after the last transaction.
Thus, the table function Gather-Cnt is an extension of the GatherComb-2 table function
used in GatherJoin.
The absence of the outer grouping makes this option rather attractive. The UDF code
is also small since it only needs to maintain a 2D array. We could apply the same trick for
subsequent passes but the coding becomes considerably more complicated because of the


55
for support values higher than 0.3% and therefore we chose lower support values to study
the relative performance in higher numbered passes.
In some cases, the optimizer did not choose the best plan. For example, for joins
with T (Tj for Set-oriented Apriori), the optimizer chose nested loops plan using (item, tid)
index on T in many cases where the corresponding sort-merge plan was faster; an order
of magnitude faster in some cases. We were able to experiment with different plans by
disabling certain join methods (disabling nested loops join for the above case). We also
broke down the multi-way joins in some cases into simpler two-way joins to study the
performance implications. The reported times correspond to the best join order and join
methods.
Figure 3.17 shows the CPU time and I/O time taken for the dataset T10.I4.D100K.
The two approaches show the same relative trend as the total time. However, it should be
noted that the I/O time is less than one third of the CPU time. This shows that there
is a need to revisit the traditional optimization and parallelization strategies designed to
optimize for I/O time, in order to handle the newer decision support and mining queries
efficiently.
3.9.1 Scale-up Experiment
We experimented with several synthetic datasets to study the scale-up behavior of Set-
oriented Apriori with respect to increasing number of transactions and increasing transaction
size. Figure 3.18 shows how Set-oriented Apriori scales up as the number of transactions is
increased from 10,000 to 1 million. We used the datasets T5.I2 and T10.I4 for the average
sizes of transactions and itemsets respectively. The minimum support level was kept at 1%.
The first graph shows the absolute execution times and the second one shows the times


5 GENERALIZED ASSOCIATION RULES
89
5.1 Input-Output Formats 90
5.2 Cumulate Algorithm 91
5.3 Pre-Computing Ancestors 91
5.4 Candidate Generation 92
5.5 Support Counting to Find Frequent Itemsets 93
5.6 Support Counting Using SQL-92 94
5.6.1 K-way Join 94
5.6.2 Subquery Optimization 96
5.7 Support Counting Using SQL-OR 96
5.7.1 GatherJoin 96
5.7.2 GatherExtend 97
5.7.3 Cost Analysis 98
5.7.4 Vertical 100
5.8 Performance Results 102
5.9 Summary 104
6 SEQUENTIAL PATTERNS 105
6.1 Input-Output Formats 105
6.1.1 Input Format 105
6.1.2 Output Format 106
6.2 GSP Algorithm 106
6.3 Candidate Generation 107
6.4 Support Counting to Find Frequent Sequences 109
6.5 Support Counting Using SQL-92 Ill
6.5.1 K-way Join Ill
6.5.2 Subquery Optimization 112
6.6 Support Counting Using SQL-OR 112
6.6.1 Vertical 112
6.6.2 GatherJoin 117
6.7 Taxonomies 117
6.8 Summary 118
7 INCREMENTAL MINING 119
7.1 Incremental Updating of Frequent Itemsets 121
7.1.1 Computing NBd(F) from F 121
7.1.2 Addition of New Transactions 123
7.1.3 Deletion of Existing Transactions 127
7.2 Experimental Results 128
7.3 Comparison with FUP 129
7.4 Database Integration of Incremental Mining 131
7.4.1 SQL Formulations of Incremental Mining 131
7.4.2 Performance Results 135
7.4.3 New-Candidate Optimization 138
7.4.4 Other Approaches 140
vii



PAGE 1

$5&+,7(&785(6 $1' 237,0,=$7,216 )25 ,17(*5$7,1* '$7$ 0,1,1* $/*25,7+06 :,7+ '$7$%$6( 6<67(06 %\ 6+,%< 7+20$6 $ ',66(57$7,21 35(6(17(' 72 7+( *5$'8$7( 6&+22/ 2) 7+( 81,9(56,7< 2) )/25,'$ ,1 3$57,$/ )8/),//0(17 2) 7+( 5(48,5(0(176 )25 7+( '(*5(( 2) '2&725 2) 3+,/2623+< 81,9(56,7< 2) )/25,'$

PAGE 2

7R 0\ SDUHQWV VLVWHUV DQG EURWKHUV

PAGE 3

$&.12:/('*0(176 ZRXOG OLNH WR H[SUHVV P\ VLQFHUH JUDWLWXGH WR 6KDUPD IRU KLV HQFRXUDJHPHQW DQG VXSSRUW WKURXJKRXW P\ GLVVHUWDWLRQ DQG P\ VWD\ ZLWK KLV JURXS :H KDYH KDG HQGOHVV DUJXPHQWV DQG GLVFXVVLRQV DERXW YDULRXV WKLQJV (VSHFLDOO\ GXULQJ WKH JURXS PHHWLQJV WKLV PLJKW KDYH LQYLWHG WKH ZUDWK RI VHYHUDO RWKHU VWXGHQWV EHFDXVH DW WLPHV LW ZRXOG KDYH HYHQ WHVWHG WKHLU SDWLHQFH 2Q WKH SHUVRQDO VLGH KH DQG KLV IDPLO\ KDYH EHHQ P\ JRRG IULHQGV DOVR DP JUHDWO\ LQGHEWHG WR 5DNHVK $JUDZDO DQG 6XQLWD 6DUDZDJL RI ,%0 $OPDGQ 5HVHDUFK &HQWHU IRU WKHLU KHOS DQG JLYLQJ PH DQ RSSRUWXQLW\ WR ZRUN ZLWK WKHLU JURXS 7KH VHYHUDO GLVFXVVLRQV KDG ZLWK WKHP ZKLOH ZDV DW ,%0 DQG ODWHU WKURXJK HPDLO ZHUH YHU\ XVHIXO 7KH ZRUN WKDW GLG ZLWK WKHP FRQWULEXWHG WR D JRRG SDUW RI P\ GLVVHUWDWLRQ ,W ZDV D QLFH H[SHULHQFH ZRUNLQJ ZLWK WKH 4XHVW GDWD PLQLQJ JURXS DW $OPDGQ DQG WKH HQWKXVLDVP DQG KDUG ZRUN RI 6XQLWD ZHUH FRQWDJLRXV DP DOVR WKDQNIXO WR 6DQMD\ 5DQND IRU KLV KHOS DQG VXJJHVWLRQV GXULQJ P\ LQLWLDO ZRUN RQ GDWD PLQLQJ WKDQN 3URIHVVRUV (ULF +DQVRQ 6DUWDM 6DKQL 6WDQOH\ 6X DQG 6XOH\PDQ 7XIHNFL IRU EHLQJ RQ P\ FRPPLWWHH DQG IRU WKHLU FRPPHQWV DQG VXJJHVWLRQV DP JUDWHIXO WR PDQ\ RWKHU SHRSOH IRU KHOSLQJ PH LQ VHYHUDO ZD\V ,Q SDUWLFXODU WKDQN 5DMD 6UHHQDWK 0RNKWDU 1DEHHO 5RJHU DQG DOO P\ IULHQGV DW WKH GDWDEDVH FHQWHU IRU WKH FKDW VHVVLRQV OXQFK VHVVLRQV DQG VR RQ 0DQ\ WKDQNV WR 6KDURQ *UDQW IRU KHU KHOS ZLWK HYHU\WKLQJ 6KH WDNHV FDUH RI HYHU\WKLQJ LQ WKH GDWDEDVH FHQWHU DQG KHU WLUHOHVV VSLULW DQG DWWHQWLRQ WR HYHQ WKH PLQXWH GHWDLOV PDNHV WKLQJV D ZKROH ORW HDVLHU 0\ VLQFHUH WKDQNV P

PAGE 4

WR 6 6HVKDGUL P\ PDVWHUf

PAGE 5



PAGE 6

n 6,216 *DWKHU -RLQ 6SHFLDO 3DVV 2SWLPL]DWLRQ 9DULDWLRQV RI *DWKHU-RLQ $SSURDFK &RVW $QDO\VLV RI *DWKHU-RLQ DQG LWV 9DULDQWV 9HUWLFDO 6SHFLDO 3DVV 2SWLPL]DWLRQ &RVW $QDO\VLV 64/%RGLHG )XQFWLRQV 3HUIRUPDQFH &RPSDULVRQ )LQDO +\EULG $SSURDFK $UFKLWHFWXUH &RPSDULVRQV 7LPLQJ &RPSDULVRQ 6FDOHXS H[SHULPHQW ,PSDFW RI ORQJHU QDPHV 6SDFH 2YHUKHDG RI 'LIIHUHQW $SSURDFKHV 6XPPDU\ RI &RPSDULVRQ %HWZHHQ 'LIIHUHQW $UFKLWHFWXUHV 2WKHU $VVRFLDWLRQV $OJRULWKPV 6XPPDU\ YL

PAGE 7

f IURP ) $GGLWLRQ RI 1HZ 7UDQVDFWLRQV 'HOHWLRQ RI ([LVWLQJ 7UDQVDFWLRQV ([SHULPHQWDO 5HVXOWV &RPSDULVRQ ZLWK )83 'DWDEDVH ,QWHJUDWLRQ RI ,QFUHPHQWDO 0LQLQJ 64/ )RUPXODWLRQV RI ,QFUHPHQWDO 0LQLQJ 3HUIRUPDQFH 5HVXOWV 1HZ&DQGLGDWH 2SWLPL]DWLRQ 2WKHU $SSURDFKHV YLL

PAGE 8

&RQVWUDLQHG $VVRFLDWLRQV &DWHJRULHV RI &RQVWUDLQWV &RQVWUDLQHG $VVRFLDWLRQ 0LQLQJ ,QFUHPHQWDO &RQVWUDLQHG $VVRFLDWLRQ 0LQLQJ &RQVWUDLQW 5HOD[DWLRQ $SSOLFDELOLW\ %H\RQG $VVRFLDWLRQ 0LQLQJ 0LQLQJ &ORVHG 6HWV 4XHU\ )ORFNV 9LHZ 0DLQWHQDQFH 6XPPDU\ &21&/86,216 3URSRVHG ([WHQVLRQV 5LFKHU 6HW 2SHUDWLRQV (QKDQFHG $JJUHJDWLRQ 0XOWLSOH 6WUHDPV 6DPSOLQJ &RQWULEXWLRQV )XWXUH :RUN &ORVLQJ 5()(5(1&(6 %,2*5$3+,&$/ 6.(7&+ II YLLL

PAGE 9

/,67 2) ),*85(6 'DWD ZDUHKRXVLQJ DUFKLWHFWXUH &UHGLW FDUG FODVVLILFDWLRQ H[DPSOH 7\SLFDO GDWD ZDUHKRXVH XVDJH 7D[RQRP\ RI DUFKLWHFWXUDO DOWHUQDWLYHV /RRVHFRXSOLQJ DUFKLWHFWXUH &DFKHPLQH DUFKLWHFWXUH 6WRUHGSURFHGXUH DUFKLWHFWXUH 8')EDVHG PLQLQJ DUFKLWHFWXUH 64/ DUFKLWHFWXUH IRU PLQLQJ LQ D '%06 $UFKLWHFWXUH IRU PLQLQJ LQ QH[WJHQHUDWLRQ '%06V $SULRUL DOJRULWKP &DQGLGDWH JHQHUDWLRQ IRU DQ\ N &DQGLGDWH JHQHUDWLRQ IRU N f§ 5XOH JHQHUDWLRQ 6XSSRUW FRXQWLQJ E\ .ZD\ MRLQ 6XSSRUW FRXQWLQJ XVLQJ VXETXHULHV &RPSDULVRQ RI GLIIHUHQW 64/ DSSURDFKHV .ZD\ MRLQ SODQ ZLWK &N DV LQQHU UHODWLRQ .ZD\ MRLQ SODQ ZLWK &N DV RXWHU UHODWLRQ ,;

PAGE 10

n DFWLRQf &RPSDULVRQ RI IRXU DUFKLWHFWXUHV 6FDOHXS ZLWK LQFUHDVLQJ QXPEHU RI WUDQVDFWLRQV 6FDOHXS ZLWK LQFUHDVLQJ WUDQVDFWLRQ OHQJWK &RPSDULVRQ RI GLIIHUHQW DUFKLWHFWXUHV RQ VSDFH UHTXLUHPHQWV ([DPSOH RI D WD[RQRP\ 3UHFRPSXWLQJ DQFHVWRUV *HQHUDWLRQ RI & [

PAGE 11

7UDQVDFWLRQ H[WHQVLRQ VXETXHU\ 6XSSRUW FRXQWLQJ E\ .ZD\ MRLQ 6XSSRUW FRXQWLQJ E\ *DWKHU-RLQ 6XSSRUW FRXQWLQJ E\ *DWKHU([WHQG ,QWHULRU QRGHVf WLGOLVW JHQHUDWLRQ E\ XQLRQ ,QWHULRU QRGHVf WLGOLVW JHQHUDWLRQ IURP 7r

PAGE 12

/,67 2) 7$%/(6 'HVFULSWLRQ RI GLIIHUHQW UHDOOLIH GDWDVHWV 1RWDWLRQV XVHG LQ FRVW DQDO\VLV 'HVFULSWLRQ RI V\QWKHWLF GDWDVHWV 3URV DQG FRQV RI GLIIHUHQW DUFKLWHFWXUDO RSWLRQV UDQNHG RQ D VFDOH RI OJRRGf WR EDGf $Q H[DPSOH RI WKH WD[RQRP\ WDEOH $GGLWLRQDO QRWDWLRQV XVHG IRU FRVW DQDO\VLV [LL

PAGE 13

f 2XU HYDOXDWLRQ RI WKH GLIIHUHQW DUFKLWHFWXUDO DOWHUQDWLYHV VKRZV WKDW IURP D SHUIRUPDQFH SHUVSHFWLYH WKH &DFKH0LQH RSWLRQ LV VXSHULRU DOWKRXJK WKH 64/ 25 RSWLRQ FRPHV D FORVH VHFRQG %RWK WKH &DFKH0LQH DQG WKH 64/25 DSSURDFKHV LQFXU D KLJKHU VWRUDJH SHQDOW\ WKDQ WKH ORRVHFRXSOLQJ DSSURDFK ZKLFK SHUIRUPDQFHZLVH

PAGE 14



PAGE 15

&+$37(5 ,1752'8&7,21 7KH UDSLG JURZWK LQ GDWD ZDUHKRXVLQJ WHFKQRORJ\ FRPELQHG ZLWK WKH VLJQLILFDQW GURS LQ VWRUDJH SULFHV KDV PDGH LW SRVVLEOH WR FROOHFW ODUJH YROXPHV RI GDWD DERXW FXVWRPHU WUDQVDFWLRQV LQ UHWDLO VWRUHV PDLO RUGHU FRPSDQLHV EDQNV VWRFN PDUNHWV WHOHFRPPXQLFDn WLRQ FRPSDQLHV DQG VR RQ )RU H[DPSOH $7t7 FDOO UHFRUGV D[H DERXW JLJD E\WH SHU KRXU >@ DQG VXSHU PDUNHW FKDLQV OLNH :DO0DUW FROOHFW WHUD E\WHV RI GDWD ,Q RUGHU WR WUDQVIRUP WKLV KXJH DPRXQWV RI GDWD LQWR EXVLQHVV FRPSHWLWLYHQHVV DQG SURILWV LW LV H[n WUHPHO\ LPSRUWDQW WR EH DEOH WR PLQH QXJJHWV RI XVHIXO DQG XQGHUVWDQGDEOH LQIRUPDWLRQ IURP WKHVH GDWD ZDUHKRXVHV ,Q WKLV FKDSWHU ZH LQWURGXFH GDWD ZDUHKRXVLQJ DQG GDWD PLQnfV LQIRUPDWLRQ V\VWHPV HQYLURQPHQW LV IDU IURP VLPSOH 7KH ILUVW SUREOHP LV WR GLVFRYHU KRZ FRPSOHWHQHVV DQG FRQVLVWHQF\ FDQ EH GHILQHG ,Q WKH EXVLQHVV FRQWH[W WKLV HQWDLOV XQGHUVWDQGLQJ WKH EXVLQHVV

PAGE 16

VWUDWHJLHV DQG WKH GDWD UHTXLUHG WR VXSSRUW DQG WUDFN WKHLU DFKLHYHPHQW 7KLV SURFHVVf§ FDOOHG HQWHUSULVH PRGHOLQJf§UHTXLUHV VXEVWDQWLDO LQYROYHPHQW RI EXVLQHVV XVHUV DQG LV WUDnf§W\SLFDOO\ NQRZQ DV GDWD ZDUHKRXVH SRSXODWLRQf§UHTXLUHV WRROV IRU H[WUDFWLQJ GDWD IURP PXOWLSOH RSHUDWLRQDO GDWDEDVHV DQG H[WHUQDO VRXUFHV IRU FOHDQLQJ WUDQVIRUPLQJ DQG LQWHJUDWLQJ WKHVH GDWD DQG IRU ORDGLQJ GDWD LQWR WKH GDWD ZDUHKRXVH ,Q DGGLWLRQ WR WKH PDLQ ZDUHKRXVH WKHUH PD\ EH VHYHUDO GHSDUWPHQWDO GDWD PDUWV ,W DOVR UHTXLUHV WRROV IRU SHULRGLFDOO\ UHIUHVKLQJ WKH ZDUHKRXVH WR PDLQWDLQ FRQVLVWHQF\ DQG WR SXUJH GDWD IURP WKH ZDUHKRXVH SHUKDSV RQWR VORZHU DUFKLYDO VWRUDJH

PAGE 17

nf IRU TXHU\LQJ GDWD ZDUHKRXVHV 5DWKHU WKDQ VHHN RXW NQRZQ UHODWLRQVKLSV LW VLIWV WKURXJK GDWD IRU XQNQRZQ UHODWLRQVKLSV )RU LQVWDQFH FRQVLGHU WKH WUDQVDFWLRQ GDWD LQ D PDLORUGHU FRPSDQ\ VWRUHG LQ WKH IROORZLQJ UHODWLRQV VDOHV FXVWRPHU ZLGJHW VWDWH \HDUf ERRVWHUFXVWRPHU ZLGJHW GULYHUf FDWDORJ ZLGJHW PDQXIDFWXUHUf 7KH VDOH LQIRUPDWLRQ LV VWRUHG LQ WKH VDOHV WDEOH DQG WKH FDWDORJ WDEOH VWRUHV WKH ZLGJHWV RI GLIIHUHQW PDQXIDFWXUHUV 7KH ERRVWHU WDEOH VWRUHV WKH GULYHU WKDW LQIOXHQFHV WKH SDUWLFXODU

PAGE 18

VDOH ,Q WKLV GDWDEDVH 2/$3 ILQGV DQVZHUV WR TXHULHV RI WKH IRUP f+RZ PDQ\ ZLGJHWV ZHUH VROG LQ WKH ILUVW TXDUWHU RI LQ &DOLIRUQLD YV )ORULGD"f +RZHYHU GDWD PLQLQJ DWWHPSWV WR DQVZHU TXHULHV OLNH f:KDW DUH WKH GULYHUV WKDW FDXVHG SHRSOH WR EX\ WKHVH ZLGJHWV IURP RXU FDWDORJ"f )XQGDPHQWDOO\ GDWD PLQLQJ LV VWDWLVWLFDO DQDO\VLV DQG KDV EHHQ LQ SUDFWLFH IRU D ORQJ WLPH %XW XQWLO UHFHQWO\ VWDWLVWLFDO DQDO\VLV ZDV D WLPHFRQVXPLQJ PDQXDO SURFHVV ZKLFK OLPLWHG WKH DPRXQW RI GDWD WKDW FRXOG EH DQDO\]HG DQG WKH DFFXUDF\ GHSHQGHG KHDYLO\ RQ WKH SHUVRQQHO LQYROYHG LQ WKH DQDO\VLV 7RGD\ ZLWK WKH DGYHQW RI YDULRXV VRSKLVWLFDWHG WHFKQRORJLHV WRROV H[LVW WKDW DXWRPDWH WKH SURFHVV PDNLQJ GDWD PLQLQJ D SUDFWLFDO VROXWLRQ IRU D ZLGH UDQJH RI FRPSDQLHV )RU H[DPSOH )LQJHUKXWfV D GLUHFWPDLO FDWDORJ FRPSDQ\f VWDWLVWLFDO DQDO\VLV ZDV OLPLWHG WR WDNLQJ VDPSOHV RI WR SHUFHQW RI LWV FXVWRPHUV :LWK GDWD PLQLQJ LW FDQ H[DPLQH VSHFLILF FKDUDFWHULVWLFV RI HDFK RI WKH WR PLOOLRQ FXVWRPHUV LQ D PXFK PRUH IRFXVHG ZD\ >@ 7KH LQLWLDO HIIRUWV RQ GDWD PLQLQJ UHVHDUFK ZHUH WR FXOO WRJHWKHU WHFKQLTXHV IURP PDn FKLQH OHDUQLQJ DQG VWDWLVWLFV > @ WR GHILQH QHZ PLQLQJ RSHUDWLRQV DQG GHYHORS DOJRULWKPV IRU WKHP > @ ,Q WKH UHPDLQGHU RI WKLV VHFWLRQ ZH EULHIO\ LQWURGXFH WKH YDULRXV GDWD PLQLQJ SUREOHPV $VVRFLDWLRQ 5XOH $VVRFLDWLRQ UXOH ZKLFK FDSWXUHV FRRFFXUUHQFH RI LWHPV RU HYHQWV ZDV LQWURGXFHG LQ WKH FRQWH[W RI PDUNHW EDVNHW GDWD >@ $Q H[DPSOH RI VXFK D UXOH PLJKW EH WKDW b RI WUDQVDFWLRQV FRQWDLQLQJ EHHU DOVR FRQWDLQ GLDSHUV DQG b RI WUDQVDFWLRQV FRQWDLQ ERWK WKHVH LWHPV +HUH b LV WKH VXSSRUW DQG b LV WKH FRQILGHQFH RI WKH UXOH EHHU GLDSHU $VVRFLDWLRQ UXOH PLQLQJ LV VWDWHG IRUPDOO\ DV IROORZV >@ /HW ^rLr f f f rP` EH D VHW RI OLWHUDOV FDOOHG LWHPV /HW 9 EH D VHW RI WUDQVDFWLRQV ZKHUH HDFK WUDQVDFWLRQ 7 LV

PAGE 19

D VHW RI LWHPV VXFK WKDW 7&, (DFK WUDQVDFWLRQ KDV D XQLTXH LGHQWLILHU FDOOHG LWV WLG $Q DVVRFLDWLRQ UXOH LV DQ LPSOLFDWLRQ RI WKH IRUP ; < ZKHUH ; & < &O DQG ,IL\ 7KH UXOH ; < KROGV LQ WKH WUDQVDFWLRQ VHW 9 ZLWK FRQILGHQFH F LI Fb RI WUDQVDFWLRQV LQ 9 WKDW FRQWDLQ ; DOVR FRQWDLQ < 7KH UXOH ; < KDV VXSSRUW V LQ WKH WUDQVDFWLRQ VHW 9 LI Vbf RI LWHPVHWV 7KH LWHPVHWV WKDW DUH FRQWDLQHG LQ WKH VHTXHQFH DUH FDOOHG HOHPHQWV RI WKH VHTXHQFH )RU H[DPSOH ^FRPSXWHU PRGHPfSULQWHUff LV D VHTXHQFH ZLWK WZR HOHPHQWV FRPSXWHU PRGHPf DQG SULQWHUf 7KH VXSSRUW RI D VHTXHQWLDO SDWWHUQ LV WKH SHUFHQWDJH RI GDWDVHTXHQFHV WKDW FRQWDLQ WKH VHTXHQFH $ VHTXHQWLDO SDWWHUQ FDQ EH IXUWKHU TXDOLILHG E\ VSHFLI\LQJ PD[LPXP DQGRU PLQLPXP WLPH JDSV EHWZHHQ DGMDFHQW HOHPHQWV DQG D VOLGLQJ WLPH ZLQGRZ ZLWKLQ

PAGE 20

f RU FDWHJRULFDO FRPLQJ IURP DQ XQRUGHUHG GRPDLQf 2QH RI WKH DWWULEXWHV FDOOHG WKH FODVVLI\LQJ DWWULEXWH LQGLFDWHV WKH FODVV WR ZKLFK HDFK UHFRUG EHORQJV 2QFH D PRGHO LV EXLOW IURP WKH JLYHQ H[DPSOHV LW FDQ EH XVHG WR GHWHUPLQH WKH FODVV RI IXWXUH XQFODVVLILHG UHFRUGV 6HYHUDO FODVVLILFDWLRQ PRGHOV EDVHG RQ QHXUDO QHWZRUNV VWDWLVWLFDO PRGHOV OLNH OLQnn OHDI QRGH RI WKH WUHH FRQWDLQV D VSOLW SRLQW ZKLFK LV D WHVW RQ RQH RU PRUH DWWULEXWHV DQG GHWHUPLQHV KRZ WKH GDWD LV SDUWLWLRQHG )LJXUH VKRZV D VDPSOH GHFLVLRQWUHH FODVVLILHU DQG WKH WUDLQLQJ VHW IURP ZKLFK LW LV GHULYHG (DFK UHFRUG UHSUHVHQWV D FUHGLW FDUG DSSOLFDQW

PAGE 21

n WURLGV DUH UHILQHG WR WKH PHDQ RI WKH GDWD SRLQWV LQ WKDW FOXVWHU 7KLV SURFHVV LV UHSHDWHG VHYHUDO WLPHV XQWLO DQ DFFHSWDEOH FRQYHUJHQFH LV UHDFKHG 7KHUH DUH VHYHUDO UHVHDUFK HIIRUWV UHSRUWHG LQ WKH GDWD PLQLQJ OLWHUDWXUH IRU FOXVWHULQJ ODUJH GDWDEDVHV > @

PAGE 22

f§RIWHQ LQ WKH IRUP RI D GDWD ZDUHKRXVHf§SURYLGHV EXVLQHVV LQVWLWXWLRQV DW LWV GLVSRVDO D WRRO ZLWK LPPHQVH LPSOLFDWLRQV $FFRUGLQJ WR WKH YLFH SUHVLGHQW RI 0HOORQ %DQNfV DGYDQFHG WHFKQRORJ\ JURXS f'DWD PLQLQJ LV WKH FDUURW WKDW MXVWLILHV WKH H[SHQVLYH VWLFN RI EXLOGLQJ D GDWD ZDUHKRXVHf 7KH PDMRULW\ RI WKH ZDUHKRXVH VWRUHVf§V\VWHPV XVHG IRU VWRULQJ ZDUHKRXVH GDWDf§DUH UHODWLRQDO GDWDEDVHV RU WKHLU YDULDQWV 7KH DGYDQWDJHV RI XVLQJ GDWDEDVH V\VWHPV DUH QXPHUn RXV 64/ ZDV LQYHQWHG IRU GLUHFW TXHU\ RI GDWD PRVW FOLHQWVHUYHU FRQQHFWLYLW\ LV VXSSOLHG E\ UHODWLRQDO YHQGRUV PRVW UHSOLFDWLRQ V\VWHPV KDYH EHHQ GHVLJQHG ZLWK UHODWLRQDO VRXUFHV DQG WDUJHWV DQG PRVW RI WKH UHODWLRQDO YHQGRUV DUH GHOLYHULQJ SDUDOOHO GDWDEDVH VROXWLRQV 7KHUH DUH LPSRUWDQW DOWHUQDWLYHV LQ WKLV VHJPHQW KRZHYHU 7KH 2/$3 PXOWLGLPHQVLRQDO HQn JLQHV RIIHU XQLTXH SHUIRUPDQFH FKDUDFWHULVWLFV DFURVV WKHLU SUREOHP GRPDLQ :H PLJKW DOVR VHH WUDGLWLRQDO ILOH VWRUHV SURYLGLQJ VLJQLILFDQWO\ EHWWHU SHUIRUPDQFH IRU VRPH GDWD PLQLQJ

PAGE 23

RSHUDWLRQV 'LIIHUHQWLDWRUV IRU WKH UHODWLRQDO HQJLQHV ZLOO EH RYHUDOO VFDODELOLW\ DYDLODELOLW\ DFURVV D EURDG VSHFWUXP RI KDUGZDUH SODWIRUPV fDIILQLW\fn KRXVH LQWHJUDWLRQ RI PLQLQJ RSHUDWLRQV :H ILUVW VWXG\ WKH YDULRXV DUFKLWHFWXUDO DOWHUQDWLYHV IRU FRXSOLQJ GDWD PLQLQJ ZLWK UHODWLRQDO GDWDEDVH V\VWHPV SULPDULO\ IURP D SHUIRUPDQFH SHUVSHFWLYH :H GHYHORS YDULRXV 64/ IRUPXODWLRQV IRU DVVRFLDWLRQ UXOHV >@ D UHSUHVHQWDWLYH PLQLQJ SUREOHP DQG DQDO\]H KRZ FRPSHWLWLYH FDQ WKH 64/ LPSOHPHQWDWLRQ EH FRPSDUHG WR RWKHU VSHFLDOL]HG LPSOHPHQWDWLRQV RI DVVRFLDWLRQ UXOH PLQLQJ :H IXUWKHU IRFXV RQ WKH

PAGE 24

%XVLQHVV ,QWHOOLJHQFH WRROV )LJXUH 7\SLFDO GDWD ZDUHKRXVH XVDJH DQDO\VLV RI YDULRXV H[HFXWLRQ SODQV FKRVHQ E\ UHODWLRQDO GDWDEDVH V\VWHPV IRU H[HFXWLQJ VRPH RI WKH 64/EDVHG PLQLQJ TXHULHV :H H[SHFW WKDW WKLV VWXG\ ZLOO UHYHDO WKH GRPDLQVSHFLILF VHPDQWLF LQIRUPDWLRQ RI WKH PLQLQJ DOJRULWKPV WKDW QHHG WR EH LQWHJUDWHG LQWR QH[W JHQnn WDOO\ ,Q RUGHU WR LPSURYH WKH UHOLDELOLW\ DQG XVHIXOQHVV RI WKH GLVFRYHUHG LQIRUPDWLRQ ODUJH YROXPHV RI GDWD QHHG WR EH FROOHFWHG DQG DQDO\]HG RYHU D SHULRG RI WLPH $ QDLYH DSSURDFK WR XSGDWH WKH PLQHG LQIRUPDWLRQ ZKHQ QHZ GDWD DUH DGGHG RU SDUW RI WKH FXUUHQW GDWD DUH GHOHWHG LV WR UHFRPSXWH WKHP IURP VFUDWFK +RZHYHU LW ZRXOG EH LGHDO WR GHYHORS DQ LQFUHPHQWDO DOJRULWKP VR WKDW WKH FRPSXWDWLRQ HIIRUW VSHQW RQ WKH RULJLQDO GDWD FDQ EH

PAGE 25

HIIHFWLYHO\ XWLOL]HG :H GHYHORS DQ LQFUHPHQWDO DOJRULWKP DQG LWV 64/ IRUPXODWLRQV IRU XSnf 6XUYH\ WKH YDULRXV GDWD PLQLQJ SUREOHPV DQG DOJRULWKPV f 6WXG\ WKH GLIIHUHQW GDWDEDVH LQWHJUDWLRQ DOWHUQDWLYHV IRU GDWD PLQLQJ f 'HYHORS DQG LPSOHPHQW 64/ IRUPXODWLRQV RI PLQLQJ DOJRULWKPV f $QDO\]H WKH SHUIRUPDQFH SURILOH RI FXUUHQW '%06V WR H[HFXWH WKH DERYH 64/ TXHULHV f ([SORUH WKH HQKDQFHPHQWV WR FXUUHQW FRVWEDVHG RSWLPL]HUV WR LQFRUSRUDWH WKH GRPDLQ VSHFLILF VHPDQWLFV RI PLQLQJ f 'HYHORS DQ LQFUHPHQWDO DVVRFLDWLRQ UXOH PLQLQJ DOJRULWKP DQG LWV 64/ IRUPXODWLRQV f *HQHUDOL]H WKH LQFUHPHQWDO DOJRULWKP IRU PLQLQJ FRQVWUDLQHG DVVRFLDWLRQV DQG VKRZ LWV DSSOLFDELOLW\ WR RWKHU GDWD PLQLQJ DQG GHFLVLRQ VXSSRUW SUREOHPV DQG f ([SORUH SULPLWLYH RSHUDWRUV IRU PLQLQJ LQ GDWDEDVHV 7KLV ZRUN KDV D VLJQLILFDQW LPSDFW RQ WKH VWDWHRIWKHDUW LQ GDWD PLQLQJ V\VWHP DUFKLn WHFWXUHV DQG FRPHV DW WKH DSSURSULDWH WLPH ZKHQ WKH GDWD PLQLQJ FRPPXQLW\ LV ORRNLQJ IRU

PAGE 26

DQVZHUV WR f+RZ WR PLQH GDWD ZDUHKRXVHV"f *LYHQ WKH DPRXQW RI GDWD LQYROYHG LQ PLQLQJ LWV SRWHQWLDO LPSDFW RQ YDULRXV EXVLQHVV VHFWRUV DQG WKH IDFW WKDW 2/$3 LV ILQGLQJ LWV ZD\ LQWR FRPPHUFLDO GDWDEDVH V\VWHPV IRU H[DPSOH WKH FXEH RSHUDWRUfn WLRQV IRU LPSURYLQJ WKH SHUIRUPDQFH DUH GHWDLOHG LQ &KDSWHU ,Q &KDSWHU ZH SUHVHQW WKH XVH RI REMHFWUHODWLRQDO H[WHQVLRQV WR 64/ IRU LPSURYLQJ WKH SHUIRUPDQFH RI 64/EDVHG DVVRFLDWLRQ UXOH PLQLQJ DQG WKH SHUIRUPDQFH FRPSDULVRQ RI WKH YDULRXV DUFKLWHFWXUDO DOn WHUQDWLYHV 64/EDVHG PLQLQJ RI JHQHUDOL]HG DVVRFLDWLRQ UXOHV DQG VHTXHQWLDO SDWWHUQV DUH GHVFULEHG LQ &KDSWHUV DQG UHVSHFWLYHO\ &KDSWHU SUHVHQWV WKH LQFUHPHQWDO DVVRFLDn WLRQ UXOH PLQLQJ DOJRULWKP SHUIRUPDQFH FRPSDULVRQ 64/ IRUPXODWLRQ DQG JHQHUDOL]DWLRQV IRU PLQLQJ FRQVWUDLQHG DVVRFLDWLRQV :H FRQFOXGH LQ &KDSWHU ZLWK D GLVFXVVLRQ RI WKH SURSRVHG GDWDEDVH RSHUDWRUV DQG DYHQXHV IRU IXUWKHU UHVHDUFK

PAGE 27

&+$37(5 )520 ),/( 0,1,1* 72 '$7$%$6( 0,1,1* 7KH fILUVWJHQHUDWLRQf .'' V\VWHPV RIIHU LVRODWHG GLVFRYHU\ IHDWXUHV XVLQJ WUHH LQGXFnf IRU DSSOLFDWLRQ SURJUDPn PLQJ PDGH GDWDEDVH DSSOLFDWLRQV PXFK HDVLHU WR GHYHORS DQG PDQDJH 'DWD PLQLQJ KDV WR XQGHUJR D VLPLODU WUDQVLWLRQ IURP WKH FXUUHQW fILOH PLQLQJf WR GDWD ZDUHKRXVH PLQLQJ DQG D ULFKHU VHW RI $3,V IRU GHYHORSLQJ EXVLQHVV LQWHOOLJHQFH DQG GHFLVLRQ VXSSRUW DSSOLFDWLRQV ,Q WKH UHPDLQGHU RI WKLV FKDSWHU ZH VXUYH\ VRPH RI WKH SULRU UHVHDUFK UHODWHG WR WKH GDWDEDVH LQWHJUDWLRQ RI PLQLQJ LQ 6HFWLRQ 7KH YDULRXV DUFKLWHFWXUDO DOWHUQDWLYHV DUH GLVFXVVHG LQ 6HFWLRQ 5HODWHG :RUN 5HVHDUFKHUV KDYH VWDUWHG WR IRFXV RQ YDULRXV LVVXHV UHODWHG WR LQWHJUDWLQJ PLQLQJ ZLWK GDWDEDVHV > @ 7KH UHVHDUFK RQ GDWDEDVH LQWHJUDWLRQ RI PLQLQJ FDQ EH EURDGO\ FODVn VLILHG LQWR WZR FDWHJRULHV RQH ZKLFK SURSRVHV QHZ PLQLQJ RSHUDWRUV DQG WKH RWKHU ZKLFK

PAGE 28

OHYHUDJHV WKH TXHU\ SURFHVVLQJ FDSDELOLWLHV RI FXUUHQW UHODWLRQDO '%06V ,Q WKH IRUPHU FDWHn JRU\ WKHUH KDYH EHHQ ODQJXDJH SURSRVDOV WR H[WHQG 64/ ZLWK VSHFLDOL]HG PLQLQJ RSHUDWRUV $ IHZ H[DPSOHV DUH Lf WKH TXHU\ ODQJXDJH '04/ SURSRVHG E\ +DQ HW DO >@ ZKLFK H[n WHQGV 64/ ZLWK D FROOHFWLRQ RI RSHUDWRUV IRU PLQLQJ FKDUDFWHULVWLF UXOHV GLVFULPLQDQW UXOHV FODVVLILFDWLRQ UXOHV DVVRFLDWLRQ UXOHV DQG VR RQ LLf 7KH 064/ ODQJXDJH RI ,PLHOLQVNL DQG 9LUPDQL >@ ZKLFK H[WHQGV 64/ ZLWK D VSHFLDO XQLILHG RSHUDWRU 0LQH WR JHQHUDWH DQG TXHU\ D ZKROH VHW RI SURSRVLWLRQDO UXOHV DQG LLLf WKH PLQH UXOH RSHUDWRU SURSRVHG E\ 0HR HW DO >@ IRU D JHQHUDOL]HG YHUVLRQ RI WKH DVVRFLDWLRQ UXOH GLVFRYHU\ SUREOHP +RZHYHU WKH\ GR QRW DGGUHVV WKH SURFHVVLQJ WHFKQLTXHV IRU WKHVH RSHUDWRUV LQVLGH D GDWDEDVH HQJLQH DQG WKH LQWHUDFWLRQ RI WKH VWDQGDUG UHODWLRQDO RSHUDWRUV DQG WKH SURSRVHG H[WHQVLRQV ,W LV DOVR LPSRUWDQW WR EUHDN WKHVH RSHUDWRUV WR D ILQHU OHYHO RI JUDQXODULW\ LQ RUGHU WR LGHQWLI\ FRPn PRQDOLWLHV EHWZHHQ WKHP DQG GHULYH D VHW RI SULPLWLYH RSHUDWRUV WKDW VKRXOG EH VXSSRUWHG QDWLYHO\ LQ D GDWDEDVH HQJLQH ,Q WKH VHFRQG FDWHJRU\ UHVHDUFKHUV KDYH DGGUHVVHG WKH LVVXH RI H[SORLWLQJ WKH FDSDn ELOLWLHV RI FRQYHQWLRQDO UHODWLRQDO V\VWHPV DQG WKHLU REMHFWUHODWLRQDO H[WHQVLRQV WR H[HFXWH PLQLQJ RSHUDWLRQV 7KLV HQWDLOV WUDQVIRUPLQJ WKH PLQLQJ RSHUDWLRQV LQWR GDWDEDVH TXHULHV DQG LQ VRPH FDVHV GHYHORSLQJ QHZHU WHFKQLTXHV WKDW DUH PRUH DSSURSULDWH LQ WKH GDWDEDVH FRQWH[W 7KH SURSRVDO RI $JUDZDO DQG 6KLP >@ IRU WLJKWO\ FRXSOLQJ D PLQLQJ DOJRULWKP ZLWK D UHODWLRQDO GDWDEDVH V\VWHP PDNHV XVH RI XVHUGHILQHG IXQFWLRQV 8')Vf LQ 64/ VWDWHPHQWV WR VHOHFWLYHO\ SXVK SDUWV RI WKH DSSOLFDWLRQ WKDW SHUIRUP FRPSXWDWLRQV RQ GDWD UHFRUGV LQWR WKH GDWDEDVH V\VWHP 7KH REMHFWLYH ZDV WR DYRLG RQHDWDWLPH UHFRUG UHWULHYDO IURP WKH GDWDEDVH WR WKH DSSOLFDWLRQ DGGUHVV VSDFH VDYLQJ ERWK WKH FRS\LQJ DQG SURFHVV FRQWH[W VZLWFKLQJ FRVWV ,Q WKH .(62 SURMHFW >@ WKH IRFXV LV RQ GHYHORSLQJ D GDWD PLQLQJ V\VWHP ZKLFK LQWHUDFWV ZLWK VWDQGDUG '%06V 7KH LQWHUDFWLRQ ZLWK WKH GDWDEDVH

PAGE 29

LV UHVWULFWHG WR WZRZD\ WDEOH TXHULHV D VSHFLDO NLQG RI DJJUHJDWH TXHU\ 7ZRZD\ WDEOHV ZKLFK DUH XVHG LQ WKH PLQLQJ SURFHVV KDYH VHWV RI VRXUFH DQG WDUJHW DWWULEXWHV DQG DQ DVVRFLDWHG FRXQW $VVRFLDWLRQ UXOH PLQLQJ ZDV IRUPXODWHG DV 64/ TXHULHV LQ WKH 6(70 DOJRULWKP >@ +RZHYHU LW GRHV QRW XVH WKH VXEVHW SURSHUW\f§DOO VXEVHWV RI D IUHTXHQW LWHPVHW DUH IUHTXHQWf§IRU FDQGLGDWH JHQHUDWLRQ $V D UHVXOW 6(70 FRXQWV D ODUJH QXPEHU RI FDQGLGDWH LWHPVHWV LQ WKH VXSSRUW FRXQWLQJ SKDVH DQG KHQFH LV QRW HIILFLHQW >@ 4XHU\ IORFNV JHQHUDOL]HV ERROHDQ DVVRFLDWLRQ UXOHV WR PLQH DVVRFLDWLRQV DFURVV UHODWLRQDO WDEOHV $ TXHU\ IORFN LV D SDUDPHWHUL]HG TXHU\ ZLWK D ILOWHU FRQGLWLRQ WR HOLPLQDWH WKH YDOXHV RI SDUDPn HWHUV WKDW DUH fXQLQWHUHVWLQJf

PAGE 30

f

PAGE 31

LQWR WKH '%06 7KLV PHWKRG KDV DOO WKH DGYDQWDJHV RI WKH VWRUHG SURFHGXUH DSSURDFK GHVFULEHG EHORZf

PAGE 32

WKH VHUYHUf >@ ,Q WKLV FDVH WKH HQWLUH PLQLQJ DOJRULWKP LV HQFDSVXODWHG DV D FROOHFWLRQ RI XVHUGHILQHG IXQFWLRQV 8')Vf >@ WKDW DUH DSSURSULDWHO\ SODFHG LQ 64/ GDWD VFDQ TXHULHV 7KH DUFKLWHFWXUH LV UHSUHVHQWHG LQ )LJXUH 0RVW RI WKH SURFHVVLQJ KDSSHQV LQ WKH 8') DQG WKH '%06 LV XVHG VLPSO\ WR SURYLGH WXSOHV WR WKHVH 8')V /LWWOH XVH LV PDGH RI WKH TXHU\ SURFHVVLQJ FDSDELOLWLHV RI WKH '%06 7KH 8')V FDQ EH UXQ LQ HLWKHU IHQFHG GLIIHUHQW DGGUHVV VSDFHf RU XQIHQFHG VDPH DGGUHVV VSDFHf PRGH 7KH PDLQ DWWUDFWLRQ RI WKLV PHWKRG LV SHUIRUPDQFH VLQFH ZKHQ UXQ LQ WKH XQIHQFHG PRGH LQGLYLGXDO WXSOHV QHYHU KDYH WR FURVV WKH '%06 ERXQGDU\ 2WKHUZLVH WKH SURFHVVLQJ KDSSHQV LQ DOPRVW WKH VDPH PDQQHU DV LQ WKH VWRUHG SURFHGXUH FDVH 7KH PDLQ GLVDGYDQWDJH LV WKH GHYHORSPHQW FRVW >@ VLQFH WKH HQWLUH PLQLQJ DOJRULWKP KDV WR EH ZULWWHQ DV 8')V LQYROYLQJ VLJQLILFDQW FRGH UHZULWHV )XUWKHU WKHVH DUH fKHDY\ZHLJKWf 8')V ZKLFK LQYROYH VLJQLILFDQW SURFHVVLQJ DQG PHPRU\ PDQDJHPHQW *8, RU 0LQLQJ /DQJXDJH 64/ TXHULHV FRQWDLQLQJ 8')Vf '%06 8')V IRU PLQLQJA )LJXUH 8')EDVHG PLQLQJ DUFKLWHFWXUH ,Q RUGHU WR SURYLGH D TXHU\ LQWHUIDFH RU DSSOLFDWLRQ SURJUDPPLQJ LQWHUIDFH WR WKH GLVFRYHUHG UXOHV WKH\ FDQ EH SDVVHG WKURXJK D SRVWSURFHVVLQJ VWHS 7KH UXOH GLVFRYHU\ LWVHOI FRXOG EH GRQH E\ DQ\ RI WKH DERYH DOWHUQDWLYHV 64/%DVHG $SSURDFK 7KLV LV WKH LQWHJUDWLRQ DUFKLWHFWXUH H[SORUHG LQ WKLV GLVVHUWDWLRQ ,Q WKLV DSSURDFK WKH PLQLQJ DOJRULWKP LV IRUPXODWHG DV 64/ TXHULHV ZKLFK DUH H[HFXWHG E\ WKH '%06 TXHU\

PAGE 33

SURFHVVRU :H GHYHORS VHYHUDO 64/ IRUPXODWLRQV IRU D IHZ UHSUHVHQWDWLYH PLQLQJ RSHUDWLRQV LQ RUGHU WR EHWWHU XQGHUVWDQG WKH SHUIRUPDQFH SURILOH RI FXUUHQW GDWDEDVH TXHU\ SURFHVVRUV LQ H[HFXWLQJ WKHVH TXHULHV :H EHOLHYH WKDW LW ZLOO HQDEOH XV WR LGHQWLI\ ZKDW SRUWLRQV RI WKHVH PLQLQJ RSHUDWLRQV FDQ EH SXVKHG GRZQ WR WKH TXHU\ SURFHVVLQJ HQJLQH RI D '%06 7KHUH DUH DOVR VHYHUDO SRWHQWLDO DGYDQWDJHV RI D 64/ LPSOHPHQWDWLRQ 2QH FDQ SURIn LWDEO\ PDNH XVH RI WKH GDWDEDVH LQGH[LQJ DQG TXHU\ SURFHVVLQJ FDSDELOLWLHV WKHUHE\ OHYHUDJn LQJ RQ PRUH WKDQ WZR GHFDGHV RI GHYHORSPHQW HIIRUW VSHQW LQ PDNLQJ WKHVH V\VWHPV UREXVW SRUWDEOH VFDODEOH DQG KLJKO\ FRQFXUUHQW 5DWKHU WKDQ GHYLVLQJ VSHFLDOL]HG SDUDOOHOL]Dn WLRQV RQH FDQ SRWHQWLDOO\ H[SORLW WKH XQGHUO\LQJ 64/ SDUDOOHOL]DWLRQ SDUWLFXODUO\ LQ DQ 603 HQYLURQPHQW 7KH FXUUHQW DSSURDFK WR SDUDOOHOL]LQJ PLQLQJ DOJRULWKPV LV WR GHYHORS VSHFLDOL]HG SDUDOOHOL]DWLRQV IRU HDFK RI WKH DOJRULWKPV > @ 7KH '%06 VXSn SRUW IRU FKHFNSRLQWLQJ DQG VSDFH PDQDJHPHQW FDQ EH HVSHFLDOO\ YDOXDEOH IRU ORQJUXQQLQJ PLQLQJ DOJRULWKPV RQ KXJH YROXPHV RI GDWD 7KH GHYHORSPHQW RI QHZ DOJRULWKPV FRXOG EH VLJQLILFDQWO\ IDVWHU LI H[SUHVVHG GHFODUDWLYHO\ XVLQJ D IHZ 64/ RSHUDWLRQV 7KLV DSSURDFK LV DOVR H[WUHPHO\ SRUWDEOH DFURVV '%06fV VLQFH SRUWLQJ EHFRPHV WULYLDO LI WKH 64/ DSSURDFKHV XVH RQO\ WKH VWDQGDUG 64/ IHDWXUHV ([WHQGHG 64/ 3UHSURFHVVRU 64/ 2EMHFWf 5HODWLRQDO Uf777 2SWLPL]HU W 64/64/ '%06 88L 'RPDLQ VHPDQWLFV RI PLQLQJ )LJXUH 64/ DUFKLWHFWXUH IRU PLQLQJ LQ D '%06 7KH DUFKLWHFWXUH ZH KDYH LQ PLQG LV VFKHPDWLFDOO\ VKRZQ LQ )LJXUH :H YLVXDOL]H WKDW WKH GHVLUHG PLQLQJ RSHUDWLRQ ZLOO EH H[SUHVVHG LQ VRPH H[WHQVLRQ RI 64/ RU D JUDSKLFDO ODQJXDJH $ SUHSURFHVVRU ZLOO JHQHUDWH DSSURSULDWH 64/ WUDQVODWLRQ IRU WKLV RSHUDWLRQ

PAGE 34

n QDWLYHV OLVWHG KHUH 2XU SULPDU\ IRFXV LV RQ WKH SHUIRUPDQFH RI YDULRXV DUFKLWHFWXUDO DOnfV JRDO LV WR JHW LQIRUPDWLRQ IURP WKH GDWD VWRUH +HVKH VKRXOG QRW KDYH WR PDNH WKH GLVWLQFWLRQ DV WR ZKHWKHU LW LV WKH UHVXOW RI TXHU\LQJ2/$3PLQLQJ 7KLV HQWDLOV XQEXQGOLQJ WKH EXON\ PLQLQJ RSHUDWLRQV DQG LGHQWLI\LQJ FRPPRQ RSHUDWRU SULPLWLYHV ZLWK ZKLFK WKH PLQLQJ RSHUDWLRQV FDQ EH FRPSRVHG :H FDQQRW H[SHFW WR KDYH

PAGE 35

D VSHFLDOL]HG RSHUDWRU IRU HYHU\ PLQLQJ WDVN ,W DOVR QHHGV D ODQJXDJH LQ ZKLFK WKH UHTXLUHG RSHUDWLRQV FDQ EH VSHFLILHG ,Q RUGHU WR UHDOL]H WKLV JRDO LW UHTXLUHV WUHPHQGRXV DPRXQW RI UHVHDUFK LQ YDULRXV DVSHFWV OLNH GHVLJQLQJ ODQJXDJH H[WHQVLRQV EHWWHU TXHU\ SURFHVVLQJ DQG RSWLPL]DWLRQ VWUDWHJLHV +RZHYHU ZH HQYLVLRQ WKDW WKH TXHU\ SURFHVVLQJ HQJLQH ZLOO HYHQWXDOO\ EH H[WHQGHG ZLWK SULPLWLYH PLQLQJ RSHUDWRUV :KHQ WKDW LV DFFRPSOLVKHG D PLQLQJ V\VWHP DUFKLWHFWXUH ZLOO UHVHPEOH WKH RQH VKRZQ LQ )LJXUH ([WHQGHG 64/ 64/ 2EMHFWf 5HODWLRQDO '%06 }

PAGE 36

&+$37(5 $662&,$7,21 58/(6 ,Q WKLV FKDSWHU ZH GLVFXVV WKH YDULRXV 64/ 64/ ZLWK QR REMHFWUHODWLRQDO H[n WHQVLRQVf IRUPXODWLRQV RI DVVRFLDWLRQ UXOH PLQLQJ :H VWDUW ZLWK D UHYLHZ RI WKH DSULRUL DOJRULWKP IRU DVVRFLDWLRQ UXOH PLQLQJ LQ 6HFWLRQ $ IHZ RWKHU DOJRULWKPV IRU PLQLQJ DVVRFLDWLRQ UXOHV DUH EULHIO\ RXWOLQHG LQ 6HFWLRQ 7KH LQSXWRXWSXW GDWD IRUPDWV DUH GHn VFULEHG LQ 6HFWLRQ DQG LQ 6HFWLRQ ZH LQWURGXFH 64/EDVHG DVVRFLDWLRQ UXOH PLQLQJ 7KH YDULRXV 64/ IRUPXODWLRQV DUH SUHVHQWHG LQ 6HFWLRQ :H SUHVHQW H[SHULPHQWDO UHVXOWV VKRZLQJ WKH SHUIRUPDQFH RI WKHVH IRUPXODWLRQV RQ VRPH UHDOOLIH GDWDVHWV LQ 6HFn WLRQ ,Q 6HFWLRQ ZH GHYHORS FRVW IRUPXODH IRU WKH FRVW RI H[HFXWLQJ WKH DERYH 64/ TXHULHV RQ D TXHU\ SURFHVVRU EDVHG RQ WKH LQSXW GDWD SDUDPHWHUV DQG UHODWLRQDO RSHUDWRU FRVWV $ IHZ SHUIRUPDQFH RSWLPL]DWLRQV WR WKH EDVLF 64/ DSSURDFKHV DQG WKH FRUUHn VSRQGLQJ SHUIRUPDQFH JDLQV DUH SUHVHQWHG LQ 6HFWLRQ 6HFWLRQ TXDQWLILHV WKH RYHUDOO SHUIRUPDQFH LPSURYHPHQWV RI WKH RSWLPL]DWLRQV ZLWK H[SHULPHQWV RQ V\QWKHWLF GDWDVHWV 7KH DVVRFLDWLRQ UXOH PLQLQJ SUREOHP RXWOLQHG LQ 6HFWLRQ FDQ EH GHFRPSRVHG LQWR WZR VXESUREOHPV >@ f )LQG DOO FRPELQDWLRQV RI LWHPV ZKRVH VXSSRUW LV JUHDWHU WKDQ PLQLPXP VXSSRUW &DOO WKRVH FRPELQDWLRQV IUHTXHQW LWHPVHWV f 8VH WKH IUHTXHQW LWHPVHWV WR JHQHUDWH WKH GHVLUHG UXOHV 7KH LGHD LV WKDW LI VD\ $%&' DQG $% DUH IUHTXHQW LWHPVHWV WKHQ ZH FDQ GHWHUPLQH LI WKH UXOH $%&' KROGV E\

PAGE 37

FRPSXWLQJ WKH UDWLR U VXSSRUWML"&n'fVXSSRUW$e"f` IRU N f§ )Nf§? A N f§f GR &N DSULRULJHQ)IFBMf JHQHUDWH QHZ FDQGLGDWHV IRUD WUDQVDFWLRQV W 9 GR &W VXEVHW &N Wf ILQG DOO FDQGLGDWHV FRQWDLQHG LQ W IRUD FDQGLGDWHV F &W GR F FRXQW GRQH GRQH )N ^F &IF_FFRXQW PLQVXS` GRQH $QVZHU 8rA )LJXUH $SULRUL DOJRULWKP

PAGE 38

7KH EDVLF $SULRUL DOJRULWKP VKRZQ LQ )LJXUH PDNHV PXOWLSOH SDVVHV RYHU WKH GDWD ,Q WKH NWK SDVV LW ILQGV DOO LWHPVHWV ZLWK N LWHPV KDYLQJ WKH PLQLPXP VXSSRUW FDOOHG WKH IUHTXHQW IFLWHPVHWV (DFK SDVV FRQVLVWV RI WZR SKDVHV /HW )N UHSUHVHQW WKH VHW RI IUHTXHQW FLWHPVHWV DQG &N WKH VHW RI FDQGLGDWH FLWHPVHWV SRWHQWLDOO\ IUHTXHQW LWHPVHWVf )LUVW LV WKH FDQGLGDWH JHQHUDWLRQ SKDVH ZKHUH WKH VHW RI DOO IUHTXHQW Nf§OfLWHPVHWV )N IRXQG LQ SDVV N f§ Of

PAGE 39



PAGE 40

NHSW VRUWHG WKLV RSHUDWLRQ FDQ EH GRQH E\ SHUIRUPLQJ D PHUJHVFDQ RI WKH WLGOLVWV RI DOO WKH LWHPV LQ WKH LWHPVHW ,QSXW2XWSXW )RUPDWV ,QSXW IRUPDW 7KH WUDQVDFWLRQ WDEOH 7 QRUPDOO\ KDV WZR FROXPQ DWWULEXWHV WUDQVDFn WLRQ LGHQWLILHU WLGf DQG LWHP LGHQWLILHU LWHPf )RU D JLYHQ WLG W\SLFDOO\ WKHUH DUH PXOWLSOH URZV LQ WKH WUDQVDFWLRQ WDEOH FRUUHVSRQGLQJ WR GLIIHUHQW LWHPV WKDW EHORQJ WR WKH VDPH WUDQVnf ZKHUH N LV WKH VL]H RI WKH ODUJHVW IUHTXHQW LWHPVHW 7KH WHQ DWWULEXWH JLYHV WKH OHQJWK RI WKH UXOH QXPEHU RI LWHPV LQ WKH UXOHf DQG WKH UXOHP DWWULEXWH JLYHV WKH SRVLWLRQ RI WKH f§} LQ WKH UXOH )RU LQVWDQFH LI N WKH UXOH $%&' ZKLFK KDV b FRQILGHQFH DQG b VXSSRUW LV UHSUHVHQWHG E\

PAGE 41

WKH WXSOH $ % & 18// ff§ OfLWHPVHWV WKH $SULRUL FDQGLGDWH JHQHUDWLRQ SURFHGXUH >@ UHWXUQV D VXSHUVHW RI WKH VHW RI DOO IUHTXHQW $ULWHPVHWV :H DVVXPH WKDW WKH LWHPV LQ DQ LWHPVHW DUH OH[LFRJUDSKLFDOO\ RUGHUHG 6LQFH DOO VXEVHWV RI D IUHTXHQW LWHPVHW DUH DOVR IUHTXHQW ZH FDQ JHQHUDWH &N IURP )NL DV IROORZV )LUVW LQ WKH MRLQ VWHS ZH JHQHUDWH D VXSHUVHW RI WKH FDQGLGDWH LWHPVHWV &N E\ MRLQLQJ )NL ZLWK LWVHOI DV VKRZQ EHORZ LQVHUW LQWR &N VHOHFW ILLWHPL ,LLWHPNL ,LWHPN IURP )NL ,X)NL K ZKHUH ,?LWHP? ,LWHPL DQG ,LLWHPN OLWHPNa DQG ,?LWHPN? LWHPNL

PAGE 42

)RU H[DPSOH OHW ) EH ^^ ` ^ ` ^ ` ^ ` ^ `` $IWHU WKH MRLQ VWHS & ZLOO EH ^^ f ^ `` 1H[W LQ WKH SUXQH VWHS DOO LWHPVHWV F &N ZKHUH VRPH N f§ OfVXEVHW RI F LV QRW LQ )NL DUH GHOHWHG &RQWLQXLQJ ZLWK WKH H[DPSOH DERYH WKH SUXQH VWHS ZLOO GHOHWH WKH LWHPVHW ^ ` EHFDXVH WKH VXEVHW ^ ` LV QRW LQ ) :H ZLOO WKHQ EH OHIW ZLWK RQO\ ^ ` LQ & :H FDQ SHUIRUP WKH SUXQH VWHS LQ WKH VDPH 64/ VWDWHPHQW DV WKH MRLQ VWHS DERYH E\ ZULWLQJ LW DV D $ZD\ MRLQ DV VKRZQ LQ )LJXUH $ NZD\ MRLQ LV XVHG VLQFH IRU DQ\ LWHPVHW WKHUH DUH N VXEVHWV RI OHQJWK N f§ f IRU ZKLFK ZH QHHG WR FKHFN LQ )LW IRU PHPEHUVKLS 7KH MRLQ SUHGLFDWHV RQ ,L DQG UHPDLQ WKH VDPH $IWHU WKH MRLQ EHWZHHQ ,? DQG ZH JHW D IFLWHPVHW FRQVLVWLQJ RI ^,?LWHP?,LLWHPNL DV VKRZQ DERYH )RU WKLV LWHPVHW WZR RI LWV N f§ OfOHQJWK VXEVHWV DUH DOUHDG\ NQRZQ WR EH IUHTXHQW VLQFH LW ZDV JHQHUDWHG IURP WZR LWHPVHWV LQ )AL :H FKHFN WKH UHPDLQLQJ N f§ VXEVHWV XVLQJ DGGLWLRQDO MRLQV 7KH SUHGLFDWHV IRU WKHVH MRLQV DUH HQXPHUDWHG E\ VNLSSLQJ RQH LWHP DW D WLPH IURP WKH IFLWHPVHW DV IROORZV :H ILUVW VNLS LWHP? DQG FKHFN LI WKH VXEVHW ,?LWHP ‘ ‘ ,?LWHPN? KLWHPNLf EHORQJV WR )IFBL DV VKRZQ E\ WKH MRLQ ZLWK LQ )LJXUH ,Q JHQHUDO IRU D MRLQ ZLWK ,U U tf ZH VNLS LWHP U f§ ZKLFK JLYHV XV MRLQ SUHGLFDWHV RI WKH IRUP ,?LWHP? ,ULWHP? DQG ,LLWHPU ,ULWHPUa DQG ,?LWHP7? ,7LWHPU DQG ,OLWHPNBL ,ULWHPNa DQG OLWHPNL ,ULWHPNL

PAGE 43

)LJXUH JLYHV DQ H[DPSOH IRU N :H FRQVWUXFW D SULPDU\ LQGH[ RQ LWHPL LWHPNf RI )NL WR HIILFLHQWO\ SURFHVV WKHVH NZD\ MRLQV XVLQJ LQGH[ SUREHV 1RWH WKDW VRPHWLPHV LW PD\ QRW EH QHFHVVDU\ WR PDn WHULDOL]H &N EHIRUH WKH FRXQWLQJ SKDVH ,QVWHDG WKH FDQGLGDWH JHQHUDWLRQ FDQ EH SLSHOLQHG ZLWK WKH VXEVHTXHQW 64/ TXHULHV XVHG IRU VXSSRUW FRXQWLQJ 6NLS LWHPBNf ,,LWHPL ,NLWHPL ,OLWHPBBN ,NLWHPBN ,LWHPBN ,NLWHPBN 6NLS LWHPLf ,OLWHP LWHPL )BN ,N LWHPBN ,LWHPBN ,LWHPBN ,LWHPBNO )LJXUH &DQGLGDWH JHQHUDWLRQ IRU DQ\ N 6NLS LWHPf LWHPL LWHPL )LJXUH &DQGLGDWH JHQHUDWLRQ IRU N f§

PAGE 44

&RXQWLQJ 6XSSRUW WR )LQG )UHTXHQW ,WHPVHWV 7KLV LV WKH PRVW WLPHFRQVXPLQJ SDUW RI WKH DVVRFLDWLRQ UXOH PLQLQJ DOJRULWKP :H XVH WKH FDQGLGDWH LWHPVHWV &N DQG WKH GDWD WDEOH 7 WR FRXQW WKH VXSSRUW RI WKH LWHPVHWV LQ &N‘ :H FRQVLGHU WZR GLIIHUHQW FDWHJRULHV RI 64/ LPSOHPHQWDWLRQV $f 7KH ILUVW RQH LV EDVHG SXUHO\ RQ 64/ :H GLVFXVV IRXU DSSURDFKHV LQ WKLV FDWHJRU\ LQ 6HFWLRQ %f 7KH VHFRQG XWLOL]HV WKH QHZ 64/ REMHFWUHODWLRQDO H[WHQVLRQV OLNH 8')V %/2%V ELQDU\ ODUJH REMHFWVff§!O f§ Pf DQG RXWSXW WKH UXOH LI LW LV DW OHDVW PLQFRQI ,Q WKH VXSSRUW FRXQWLQJ SKDVH WKH IUHTXHQW LWHPVHWV RI VL]H N DUH VWRUHG LQ WDEOH )N %HIRUH WKH UXOH JHQHUDWLRQ SKDVH ZH PHUJH DOO WKH IUHTXHQW LWHPVHWV LQWR D VLQJOH WDEOH )

PAGE 45

7KH VFKHPD RI ) FRQVLVWV RI N DWWULEXWHV ^LWHP? LWHPN VXSSRUW OHQf ZKHUH N LV WKH VL]H RI WKH ODUJHVW IUHTXHQW LWHPVHW DQG OHQ LV WKH OHQJWK RI WKH LWHPVHW DV GLVFXVVHG HDUOLHU LQ 6HFWLRQ :H XVH WKH WDEOH IXQFWLRQ *HQ5XOHV WR JHQHUDWH DOO SRVVLEOH UXOHV IURP D IUHTXHQW LWHP VHW 7KH LQSXW DUJXPHQW WR WKH IXQFWLRQ LV D IUHTXHQW LWHPVHW )RU HDFK LWHPVHW LW RXWSXWV WXSOHV FRUUHVSRQGLQJ WR UXOHV ZLWK DOO QRQHPSW\ SURSHU VXEVHWV RI WKH LWHPVHW LQ WKH FRQn VHTXHQW 7KH WDEOH IXQFWLRQ RXWSXWV WXSOHV ZLWK N DWWULEXWHV 7-WHP? 7-WHUQN 7VXSSRUW 7-HQ 7MUXOHP 7KH RXWSXW LV MRLQHG ZLWK ) WR ILQG WKH VXSSRUW RI WKH DQn WHFHGHQW DQG WKH FRQILGHQFH RI WKH UXOH LV FDOFXODWHG E\ WDNLQJ WKH UDWLR RI WKH VXSSRUW YDOXHV 7KH SUHGLFDWHV LQ WKH ZKHUH FODXVH PDWFK WKH DQWHFHGHQW RI WKH UXOH ZLWK WKH IUHn TXHQW LWHPVHW FRUUHVSRQGLQJ WR WKH DQWHFHGHQW :KLOH FKHFNLQJ IRU WKLV PDWFK ZH QHHG WR FKHFN RQO\ XS WR LWHUULN ZKHUH N 7MUXOHP 7KH RU SDUW ^W?7UXOHP Nff§ f ORQJ FRQVHTXHQWV IRXQG LQ WKH SUHYLRXV OHYHO PXFK OLNH WKH ZD\ ZH GLG FDQGLGDWH JHQHUDWLRQ LQ 6HFWLRQ 7KH IUDFWLRQ RI WKH WRWDO UXQQLQJ WLPH VSHQW LQ UXOH JHQHUDWLRQ LV YHU\ VPDOO 7KHUHIRUH ZH GR QRW IRFXV PXFK RQ UXOH JHQHUDWLRQ DOJRULWKPV

PAGE 46

LQVHUW LQWR 5 VHOHFW 7-WHPL 7MLWHPN W?VXSSRUW 7-HQ 7BUXOHP W?VXSSRUWAVXSSRUW IURP ) L WDEOH*HQ5XOHVLLHPL -LLWHPN OHQ I?VXSSRUWff DV W? ) IR ZKHUH ^W?7-WHP? RU W?7MUXOHP f DQG W?7-WHPN ILLWHUULN RU W?7MUXOHP Nf

PAGE 47

LQVHUW LQWR )cW VHOHFW LWHP? LWHPN FRXQWrf IURP &IH 7 7 ZKHUH LLLWHP &NLWHP? DQG LWHP &NLWHUULN DQG LLWLG WLG DQG WLG LIFWLG JURXS E\ LWHPLLWHP LWHUULN KDYLQJ FRXQWrf PLQVXS KDYLQJ FRXQWr! PLQVXS W *URXS E\ LWHPO LOHPN 7 7 )LJXUH 6XSSRUW FRXQWLQJ E\ .ZD\ MRLQ 7KLV 64/ FRPSXWDWLRQ ZKHQ PHUJHG ZLWK WKH FDQGLGDWH JHQHUDWLRQ VWHS LV VLPLODU WR WKH RQH SURSRVHG LQ 7VXU HW DO >@ DV D SRVVLEOH PHFKDQLVP WR LPSOHPHQW TXHU\ IORFNV ,Q 6HFWLRQ ZH GLVFXVV WKH GLIIHUHQW H[HFXWLRQ SODQV IRU WKLV TXHU\ DQG WKH UHODWHG SHUIRUPDQFH LVVXHV 7KUHHZDY -RLQ 7KH DERYH DSSURDFK UHTXLUHV N OfZD\ MRLQV LQ WKH $WK SDVV :H FDQ UHGXFH WKH FDUGLQDOLW\ RI MRLQV WR XVLQJ WKH IROORZLQJ DSSURDFK ZKLFK EHDUV VRPH UHVHPEODQFH WR

PAGE 48

WKH $SULRUL7LG DOJRULWKP LQ $JUDZDO HW DO >@ (DFK FDQGLGDWH LWHPVHW &N LQ DGGLWLRQ WR DWWULEXWHV ^LWHPL LWHPNf KDV WKUHH QHZ DWWULEXWHV ^RLGLGLLGf RLG LV D XQLTXH LGHQWLILHU DVVRFLDWHG ZLWK HDFK LWHPVHW DQG LG? DQG XOL DUH HQGV RI WKH WZR LWHPVHWV LQ )N? IURP ZKLFK WKH LWHPVHW LQ &N ZDV JHQHUDWHG DV GLVFXVVHG LQ 6HFWLRQ f ,Q DGGLWLRQ LQ WKH NWK SDVV ZH JHQHUDWH D QHZ FRS\ RI WKH GDWD WDEOH 7N ZLWK DWWULEXWHV ^WLG RLGf WKDW NHHSV IRU HDFK WLG WKH RLG RI HDFK LWHPVHW LQ &N WKDW LW VXSSRUWHG )RU VXSSRUW FRXQWLQJ ZH ILUVW JHQHUDWH 7N IURP 7Na? DQG &N DQG WKHQ GR D JURXSE\ RQ 7N WR ILQG )N DV IROORZV LQVHUW LQWR 7N VHOHFW ILWLG RLG IURP &N7N? W?7N?  ZKHUH ILRLG &NLGL DQG RLG &NLG DQG ILWLG WLG LQVHUW LQWR )N VHOHFW RLG LWHPL LWHPN HQW IURP &N VHOHFW RLG DV FLG FRXQWrf DV HQW IURP 7N JURXS E\ RLG KDYLQJ FRXQWrf PLQVXSf DV WHPS ZKHUH &NRLG FLG 7ZR *URXSEY $QRWKHU ZD\ WR DYRLG PXOWLZD\ MRLQV LV WR ILUVW MRLQ 7 DQG &N EDVHG RQ ZKHWKHU WKH fLWHPf RI D WLG LWHPf SDLU RI 7 LV HTXDO WR DQ\ RI WKH N LWHPV RI &N WKHQ GR D JURXS E\ RQ ^LWHP? LWHPN WLGf ILOWHULQJ WXSOHV ZLWK FRXQW HTXDO WR N 7KLV JLYHV DOO LWHPVHW WLGf SDLUV VXFK WKDW WKH WLG VXSSRUWV WKH LWHPVHW )LQDOO\ DV LQ WKH SUHYLRXV DSSURDFKHV GR D JURXSE\ RQ WKH LWHPVHW ^LWHPL ‘ ‘ ‘ LWHPNf ILOWHULQJ WXSOHV WKDW PHHW WKH VXSSRUW FRQGLWLRQ LQVHUW LQWR )W VHOHFW LWHPL LWHPN FRXQWrf

PAGE 49

IURP VHOHFW LWHP? LWHUULN FRXQWrf IURP 7 &IF ZKHUH LWHP &NLWHP? RU LWHP f§ &NLWHUULN JURXS E\ LWHPL LWHUULN ILG KDYLQJ FRXQWrf Nf DV WHPS JURXS E\ LWHPL f f f LWHPN KDYLQJ FRXQWrf PLQVXS 6XETXHUY%DVHG 7KLV DSSURDFK PDNHV XVH RI FRPPRQ SUHIL[HV EHWZHHQ WKH LWHPVHWV LQ &IF WR UHGXFH WKH DPRXQW RI ZRUN GRQH GXULQJ VXSSRUW FRXQWLQJ :H EUHDN XS WKH VXSSRUW FRXQWLQJ SKDVH LQWR D FDVFDGH RI N VXETXHULHV 7KH =WK VXETXHU\ 4L ILQGV DOO WLGV WKDW PDWFK WKH GLVWLQFW LWHPVHWV IRUPHG E\ WKH ILUVW O FROXPQV RI &IF FDOO LW Gcf 7KH RXWSXW RI 4c LV MRLQHG ZLWK 7 DQG GLL WKH GLVWLQFW LWHPVHWV IRUPHG E\ WKH ILUVW O FROXPQV RI &IFf

PAGE 50

LQVHUW LQWR )r VHOHFW LWHPL LWHPN FRXQWrf IURP 6XETXHU\ 4Nf W JURXS E\ LWHPLLWHPA ‘ ‘ ‘LWHPA KDYLQJ FRXQWrf UPLQVXS 6XETXHU\ 4L IRU DQ\ O EHWZHHQ DQG Nf VHOHFW LWHPL ‘ ‘ ‘ LWHPL WLG IURP 7 WL 6XETXHU\ 4LLf DV UBL VHOHFW GLVWLQFW LWHPL ‘ ‘ ‘ LWHPL IURP &Nf DV GL ZKHUH UcLLWHPL GLWWHPL DQG DQG ULALLWHPLL f§ GLLWHPLLDQG ULLWLG WLWLG DQG WLLWHP f§ GLLWHPL 6XETXHU\ 4T 1R VXETXHU\ 4T 6XETXHU\ 4B W LWHPLLWHPL WLG 6XETXHU\ 4BO VHOHFW GLVWLQFW LWHPLLWHPO W &N 7UHH GLDJUDP IRU 6XETXHU\ 4L )LJXUH 6XSSRUW FRXQWLQJ XVLQJ VXETXHULHV

PAGE 51

7DEOH 'HVFULSWLRQ RI GLIIHUHQW UHDOOLIH GDWDVHWV 'DWDVHWV 5HFRUGV LQ PLOOLRQV 7UDQVDFWLRQV LQ PLOOLRQV ,WHPV LQ WKRXVDQGV $YJLWHPV 'DWDVHW$ 'DWDVHW% 'DWDVHW& 'DWDVHW' :H VHOHFWHG D FROOHFWLRQ RI IRXU UHDOOLIH GDWDVHWV REWDLQHG IURP YDULRXV PDLORUGHU FRPSDQLHV DQG UHWDLO VWRUHV IRU WKH H[SHULPHQWV 7KHVH GDWDVHWV KDYH GLIIHULQJ YDOXHV RI SDUDPHWHUV OLNH WKH WRWDO QXPEHU RI WLGLWHPf SDLUV WKH QXPEHU RI WUDQVDFWLRQV WLGVf WKH QXPEHU RI LWHPV DQG WKH DYHUDJH QXPEHU RI LWHPV SHU WUDQVDFWLRQ 7DEOH VXPPDUL]HV WKHVH SDUDPHWHUV ,Q WKLV GLVVHUWDWLRQ ZH UHSRUW WKH SHUIRUPDQFH ZLWK RQO\ 'DWDVHW$ 7KH RYHUDOO REVHUYDWLRQ ZDV WKDW PLQLQJ LPSOHPHQWDWLRQV LQ SXUH 64/ DUH WRR VORZ WR EH SUDFWLFDO )RU WKHVH H[SHULPHQWV ZH EXLOW D FRPSRVLWH LQGH[ LWHP? LWHUULNf RQ &N N GLIIHUHQW LQGLFHV RQ HDFK RI WKH N LWHPV RI &N DQG D WLG LWHPf DQG D LWHP WLGf LQGH[ RQ WKH GDWD WDEOH 7KH JRDO ZDV WR OHW WKH RSWLPL]HU FKRRVH WKH EHVW SODQ SRVVLEOH :H GR QRW LQFOXGH WKH LQGH[ EXLOGLQJ FRVW LQ WKH WRWDO WLPH ,Q )LJXUH ZH VKRZ WKH WRWDO WLPH WDNHQ E\ WKH IRXU DSSURDFKHV .ZD\-RLQ ZD\-RLQ 6XETXHU\ DQG *URXS%\ )RU FRPSDULVRQ ZH DOVR VKRZ WKH WLPH WDNHQ E\ WKH /RRVHFRXSOLQJ DSSURDFK EHFDXVH WKLV LV WKH DSSURDFK FXUUHQWO\ XVHG E\ H[LVWLQJ V\VWHPV 7KH JUDSK VKRZV WKH WRWDO WLPH VSOLW LQWR FDQGLGDWH JHQHUDWLRQ WLPH &JHQf DQG WKH WLPH IRU HDFK SDVV 7KH FDQGLGDWH JHQHUDWLRQ WLPH DQG WKH WLPH IRU WKH ILUVW SDVV DUH PXFK VPDOOHU FRPSDUHG WR WKH WRWDO WLPH )URP WKHVH VHW RI H[SHULPHQWV ZH FDQ PDNH WKH IROORZLQJ REVHUYDWLRQV

PAGE 52

'DWD VHW $ ’ &JHQ (O 3DVV ’ 3DVV ’ 3DVV R ,f§ 6XSSRUW f§} 2 b b b )LJXUH &RPSDULVRQ RI GLIIHUHQW 64/ DSSURDFKHV f 7KH EHVW DSSURDFK LQ WKH 64/ FDWHJRU\ LV WKH 6XETXHU\ DSSURDFK $Q LPSRUWDQW UHDVRQ IRU LWV SHUIRUPLQJ EHWWHU WKDQ WKH RWKHU DSSURDFKHV LV H[SORLWDWLRQ RI FRPPRQ SUHIL[HV EHWZHHQ FDQGLGDWH LWHPVHWV 1RQH RI WKH RWKHU WKUHH DSSURDFKHV XVHV WKLV RSWLPL]DWLRQ $OWKRXJK WKH 6XETXHU\ DSSURDFK LV FRPSDUDEOH WR WKH /RRVHFRXSOLQJ DSSURDFK LQ VRPH FDVHV IRU RWKHU FDVHV LW GLG QRW FRPSOHWH HYHQ DIWHU WDNLQJ WHQ WLPHV PRUH WLPH WKDQ WKH /RRVHFRXSOLQJ DSSURDFK f

PAGE 53

f 7KH ZD\-RLQ DSSURDFK LV FRPSDUDEOH WR WKH .ZD\-RLQ DSSURDFK IRU WKLV GDWDVHW EHFDXVH WKH QXPEHU RI SDVVHV LV DW PRVW WKUHH $V VKRZQ LQ $JUDZDO HW DO >@ WKHUH PLJKW EH RWKHU GDWDVHWV HVSHFLDOO\ RQHV ZKHUH WKHUH LV VLJQLILFDQW UHGXFWLRQ LQ WKH VL]H RI 7IF DV N LQFUHDVHV ZKHUH ZD\-RLQ PLJKW SHUIRUP EHWWHU WKDQ .ZD\-RLQ +RZHYHU RQH GLVDGYDQWDJH RI WKH ZD\-RLQ DSSURDFK LV WKDW LW UHTXLUHV VSDFH WR VWRUH DQG ORJ WKH WHPSRUDU\ UHODWLRQV 7rf§RQH ZLWK &N DV WKH RXWHUPRVW UHODWLRQ LQ 6HFWLRQ DQG DQRWKHU ZLWK &N DV WKH LQQHUPRVW UHODWLRQ LQ 6HFWLRQ f§DQG WKH FRVW DQDO\VLV IRU WKHP 7KH HIIHFW RI VXETXHU\ RSWLPL]DWLRQ RQ WKH FRVW HVWLPDWHV LV RXWOLQHG LQ 6HFWLRQ $ VFKHPDWLF GLDJUDP RI WKH WZR GLIIHUHQW H[HFXWLRQ SODQV DUH JLYHQ LQ )LJXUHV DQG ,Q WKH FRVW DQDO\VLV ZH XVH WKH PLQLQJVSHFLILF GDWD SDUDPHWHUV DQG NQRZOHGJH DERXW DVVRFLDWLRQ UXOH PLQLQJ $SULRUL DOJRULWKP >@ LQ WKLV FDVHf WR HVWLPDWH WKH FRVW RI MRLQV DQG WKH VL]H RI MRLQ UHVXOWV (YHQ WKRXJK FXUUHQW UHODWLRQDO RSWLPL]HUV GR QRW XVH WKLV PLQLQJ VSHFLILF VHPDQWLF LQIRUPDWLRQ WKH DQDO\VLV SURYLGHV D EDVLV IRU GHYHORSLQJ fPLQLQJDZDUHf

PAGE 54

KDYLQJ FRXQLff PLQVXS W *URXS E\ LWHP LWHPN )LJXUH .ZD\ MRLQ SODQ ZLWK &N DV LQQHU UHODWLRQ KDYLQJ FRXQWrf! PLQVXS I *URXS E\ LWHP LWHPN )LJXUH .ZD\ MRLQ SODQ ZLWK &N DV RXWHU UHODWLRQ RSWLPL]HUV 7KH FRVW IRUPXODH DUH SUHVHQWHG LQ WHUPV RI RSHUDWRU FRVWV LQ RUGHU WR PDNH WKHP JHQHUDO IRU LQVWDQFH MRLQS T Uf

PAGE 55

7DEOH 1RWDWLRQV XVHG LQ FRVW DQDO\VLV 5 QXPEHU RI UHFRUGV LQ WKH LQSXW WUDQVDFWLRQ WDEOH 7 QXPEHU RI WUDQVDFWLRQV 1 DYHUDJH QXPEHU RI LWHPV SHU WUDQVDFWLRQ )[ QXPEHU RI IUHTXHQW LWHPV 6&f VXP RI VXSSRUW RI HDFK LWHPVHW LQ VHW & DYHUDJH VXSSRUW RI D IUHTXHQW IFLWHPVHW f§ 5I QXPEHU RI UHFRUGV RXW RI 5 LQYROYLQJ IUHTXHQW LWHPV 6)?f 1I DYHUDJH QXPEHU RI IUHTXHQW LWHPV SHU WUDQVDFWLRQ UI &N QXPEHU RI FDQGLGDWH NLWHPVHWV &Q Nf QXPEHU RI FRPELQDWLRQV RI VL]H N SRVVLEOH RXW RI D VHW RI VL]H Q NA>N\ WN FRVW RI JHQHUDWLQJ D N LWHP FRPELQDWLRQ XVLQJ WDEOH IXQFWLRQ &RPEN JURXSQ Pf FRVW RI JURXSLQJ Q UHFRUGV RXW RI ZKLFK P DUH GLVWLQFW MRLQSJUf FRVW RI MRLQLQJ WZR UHODWLRQV RI VL]H S DQG T WR JHW D UHVXOW RI VL]H U EOREQf FRVW RI SDVVLQJ D %/2% RI VL]H Q LQWHJHUV DV DQ DUJXPHQW DYDLODELOLW\ RI LQGLFHV WKH VL]H RI LQWHUPHGLDWH UHVXOWV DQG WKH DPRXQW RI DYDLODEOH PHPRU\ )RU LQVWDQFH WKH HIILFLHQW H[HFXWLRQ RI QHVWHG ORRSV MRLQV UHTXLUH DQ LQGH[ LWHP WLGf RQ 7 ,I WKH LQWHUPHGLDWH MRLQ UHVXOW LV ODUJH LW FRXOG EH DGYDQWDJHRXV WR PDWHULDOL]H LW DQG SHUIRUP VRUWPHUJH MRLQ )RU HDFK FDQGLGDWH LWHPVHW LQ &N WKH MRLQ ZLWK 7 SURGXFHV DV PDQ\ UHFRUGV DV WKH VXSSRUW RI LWV ILUVW LWHP 7KHUHIRUH WKH VL]H RI WKH MRLQ UHVXOW FDQ EH HVWLPDWHG WR EH WKH SURGXFW RI WKH QXPEHU RI FDQGLGDWHV DQG WKH DYHUDJH VXSSRUW RI D IUHTXHQW LWHP DQG KHQFH WKH FRVW RI WKLV MRLQ LV JLYHQ E\ MRLQ&? 5 &Nr V?f 6LPLODUO\ WKH UHODWLRQ REWDLQHG DIWHU MRLQLQJ &N ZLWK O FRSLHV RI 7 FRQWDLQ DV PDQ\ UHFRUGV DV WKH VXP RI WKH VXSSRUW FRXQWV RI WKH LWHP SUHIL[HV RI WKH FDQGLGDWH LWHPVHWV +HQFH WKH FRVW RI WKH ,WK MRLQ LV MRLQ&IF r VBL 5 &N r VLf ZKHUH VR f§ 1RWH WKDW YDOXHV RI WKH 6MfV FDQ EH FRPSXWHG IURP VWDWLVWLFV FROOHFWHG LQ WKH SUHYLRXV SDVVHV &RVW RI WKH ODVW MRLQ ZLWK 7Nf FDQQRW EH HVWLPDWHG XVLQJ WKH DERYH IRUPXOD VLQFH WKH LWHP SUHIL[ RI D IFFDQGLGDWH LV QRW IUHTXHQW

PAGE 56

DQG ZH GR QRW NQRZ WKH YDOXH RI VN +RZHYHU WKH ODVW MRLQ SURGXFHV 6&Nf UHFRUGVf§ WKHUH ZLOO EH DV PDQ\ UHFRUGV IRU HDFK FDQGLGDWH DV LWV VXSSRUWf§DQG WKHUHIRUH WKH FRVW LV MRLQ&IF r 6IFL 5 6&Nff 6&Nf FDQ EH HVWLPDWHG E\ DGGLQJ WKH VXSSRUW HVWLPDWHV RI DOO WKH LWHPVHWV LQ &N $ JRRG HVWLPDWH IRU WKH VXSSRUW RI D FDQGLGDWH LWHPVHW LV WKH PLQLPXP RI WKH VXSSRUW FRXQWV RI DOO LWV VXEVHWV 7KH RYHUDOO FRVW RI WKLV SODQ H[SUHVVHG LQ WHUPV RI RSHUDWRU FRVWV LV -IFL VL 5 &NrVLf` MRLQ&MIF 5 6&Nff JURXS^6&Nf &Nf O L .:DY-RLQ 3ODQ ZLWK &L DV ,QQHU 5HODWLRQ ,Q WKLV SODQ ZH MRLQ WKH N FRSLHV RI 7 DQG WKH UHVXOWLQJ IFLWHP FRPELQDWLRQV DUH MRLQHG ZLWK &N WR ILOWHU RXW QRQFDQGLGDWH LWHP FRPELQDWLRQV 7KH ILQDO MRLQ UHVXOW LV JURXSHG RQ WKH FLWHPV 7KH UHVXOW RI MRLQLQJ FRSLHV RI 7 LV WKH VHW RI DOO SRVVLEOH LWHP FRPELQDWLRQV RI WUDQVDFWLRQV DQG WKHUH DUH &1Of r 7 VXFK FRPELQDWLRQV :H NQRZ WKDW WKH LWHPV LQ WKH FDQGLGDWH LWHPVHW DUH OH[LFRJUDSKLFDOO\ RUGHUHG DQG KHQFH ZH FDQ DGG H[WUD MRLQ SUHGLFDWHV DV VKRZQ LQ )LJXUH WR OLPLW WKH MRLQ UHVXOW WR LWHP FRPELQDWLRQV ZLWKRXW WKHVH H[WUD SUHGLFDWHV WKH MRLQ ZLOO UHVXOW LQ LWHP SHUPXWDWLRQVf :KHQ &N LV WKH RXWHUPRVW UHODWLRQ WKHVH SUHGLFDWHV DUH QRW UHTXLUHG $ PLQLQJDZDUH RSWLPL]HU VKRXOG EH DEOH WR UHZULWH WKH TXHU\ DSSURSULDWHO\ 7KH ,WK MRLQ SURGXFHV OfLWHP FRPELQDWLRQV DQG WKHUHIRUH LWV FRVW LV MRLQ&:f r 7 5 &1O f r 7f 7KH ODVW MRLQ SURGXFHV 6&Nf UHFRUGV DV LQ WKH SUHYLRXV FDVH DQG KHQFH LWV FRVW LV MRLQ&$U Nf r 7 &N 6&Nff 7KH RYHUDOO FRVW RI WKLV

PAGE 57

SODQ LV IFL ^e!LQ&1Ofr7 5 &1O f r7f` MRP&1 Nfr7 &N 6&Nff JURXS&IFf &rf ] L 1RWH WKDW LQ WKH DERYH H[SUHVVLRQ &1 f r 7 5 (IIHFW RI 6XETXHUY 2SWLPL]DWLRQ 7KH VXETXHU\ RSWLPL]DWLRQ PDNHV XVH RI FRPPRQ SUHIL[HV DPRQJ FDQGLGDWH LWHPVHWV 8QIROGLQJ DOO WKH VXETXHULHV ZLOO UHVXOW LQ D TXHU\ WUHH ZKLFK VWUXFWXUDOO\ UHVHPEOHV WKH .ZD\-RLQ SODQ WUHH VKRZQ LQ )LJXUH 6XETXHU\ 4L SURGXFHV GON r V UHFRUGV ZKHUH G-N GHQRWHV WKH QXPEHU RI GLVWLQFW M LWHP SUHIL[HV RI LWHPVHWV LQ &N ,Q FRQWUDVW WKH ,WK MRLQ LQ WKH .ZD\-RLQ SODQ UHVXOWV LQ 7\SLFDOO\ GN LV PXFK VPDOOHU FRPSDUHG WR &N ZKLFK H[SODLQV ZK\ WKH 6XETXHU\ DSSURDFK SHUIRUPV EHWWHU WKDQ WKH .ZD\-RLQ DSSURDFK &N r V UHFRUGV 7KH RXWSXW RI VXETXHU\ 4N FRQWDLQV 6&Nf UHFRUGV 7KH WRWDO FRVW RI WKLV DSSURDFK FDQ EH HVWLPDWHG WR EH N ^WULMRLQL VBL r GON[ G> V r f` JURXS &rf &Nf O O ZKHUH WULMRLQS T U Vf GHQRWHV WKH FRVW RI MRLQLQJ WKUHH UHODWLRQV RI VL]H S T U UHVSHFWLYHO\ SURGXFLQJ D UHVXOW RI VL]H V 7KH YDOXH RI VN ZKLFK LV WKH DYHUDJH VXSSRUW RI D IUHTXHQW $LWHPVHW FDQ EH HVWLPDWHG DV PHQWLRQHG LQ VHFWLRQ 7KH H[SHULPHQWDO UHVXOWV SUHVHQWHG LQ 6HFWLRQ DQG LQ 6DUDZDJL HW DO >@ VKRZV WKDW WKH VXETXHU\ RSWLPL]DWLRQ JDYH PXFK EHWWHU SHUIRUPDQFH WKDQ WKH EDVLF .ZD\-RLQ DSSURDFK DQ RUGHU RI PDJQLWXGH EHWWHU LQ VRPH FDVHVf :H REVHUYHG WKH VDPH WUHQG LQ RXU DGGLWLRQDO H[SHULPHQWV XVLQJ V\QWKHWLF GDWDVHWV :H XVHG V\QWKHWLF GDWDVHWV IRU VRPH RI WKH

PAGE 58

b VXSSRUW 1RWH WKDW Gr LV QRW VKRZQ VLQFH LW LV WKH VDPH DV &N‘ ,Q SDVV & FRQWDLQV LWHPVHWV ZKHUH DV G? KDV RQO\ LWHP SUHIL[HV DOPRVW D IDFWRU RI OHVV WKDQ &f

PAGE 59

7DEOH 'HVFULSWLRQ RI V\QWKHWLF GDWDVHWV 'DWDVHWV 5HFRUGV 7UDQVDFWLRQV ,WHPV $YJLWHPV 7,'. 7,'. PD[LPDO SRWHQWLDOO\ IUHTXHQW LWHPVHWV GHQRWHG DV ,f LV 7KH WUDQVDFWLRQ WDEOH FRUUHn VSRQGLQJ WR WKLV GDWDVHW KDG DSSUR[LPDWHO\ WKRXVDQG UHFRUGV 7KH VHFRQG GDWDVHW KDV WKRXVDQG WUDQVDFWLRQV HDFK FRQWDLQLQJ DQ DYHUDJH RI LWHPV WRWDO RI DERXW PLOOLRQ UHFRUGVff DQG GLVFXVV KRZ WKH\ LPSDFW WKH FRVW %DVHG RQ WKHVH RSWLPL]DWLRQV ZH GHYHORS WKH 6HWRULHQWHG $SULRUL DSSURDFK LQ 6HFWLRQ

PAGE 60

3UXQLQJ 1RQ)UHTXHQW ,WHPV 7KH VL]H RI WKH WUDQVDFWLRQ WDEOH LV D PDMRU IDFWRU LQ WKH FRVW RI MRLQV LQYROYLQJ 7 ,W FDQ EH UHGXFHG E\ SUXQLQJ WKH QRQIUHTXHQW LWHPV IURP WKH WUDQVDFWLRQV DIWHU WKH ILUVW SDVV :H VWRUH WKH WUDQVDFWLRQ GDWD DV WLG LWHPf WXSOHV LQ D UHODWLRQDO WDEOH DQG KHQFH WKLV SUXQLQJ PHDQV VLPSO\ GURSSLQJ WKH WXSOHV FRUUHVSRQGLQJ WR QRQIUHTXHQW LWHPV 7KLV FDQ EH DFKLHYHG E\ MRLQLQJ 7 ZLWK WKH IUHTXHQW LWHPV WDEOH )? DV IROORZV LQVHUW LQWR 7c VHOHFW WWLG WLWHP IURP 7 W )L I ZKHUH WLWHP ILWHP :H LQVHUW WKH SUXQHG WUDQVDFWLRQV LQWR WDEOH 7M ZKLFK KDV WKH VDPH VFKHPD DV WKDW RI 7 ,Q WKH VXEVHTXHQW SDVVHV MRLQV ZLWK 7 FDQ EH UHSODFHG ZLWK FRUUHVSRQGLQJ MRLQV ZLWK 7c 7KLV FRXOG UHVXOW LQ LPSURYHG SHUIRUPDQFH HVSHFLDOO\ IRU KLJKHU VXSSRUW YDOXHV ZKHUH WKH IUHTXHQW LWHP VHOHFWLYLW\ LV ORZ VLQFH ZH MRLQ VPDOOHU WDEOHV )RU VRPH RI WKH V\QWKHWLF GDWDVHWV ZH XVHG LQ RXU H[SHULPHQWV WKLV SUXQLQJ UHGXFHG WKH VL]H RI WKH WUDQVDFWLRQ WDEOH WR DERXW KDOI LWV RULJLQDO VL]H 7KLV FRXOG EH HYHQ PRUH XVHIXO IRU UHDOOLIH GDWDVHWV ZKLFK W\SLFDOO\ FRQWDLQV ORWV RI QRQIUHTXHQW LWHPV )RU H[DPSOH VRPH RI WKH UHDOOLIH GDWDVHWV XVHG IRU WKH H[SHULPHQWV UHSRUWHG LQ 6DUDZDJL HW DO >@ FRQWDLQHG RI WKH RUGHU RI WKRXVDQG LWHPV RXW RI ZKLFK RQO\ D IHZ KXQGUHGV ZHUH IUHTXHQW )LJXUH VKRZV WKH UHGXFWLRQ LQ WUDQVDFWLRQ WDEOH VL]H GXH WR WKLV RSWLPL]DWLRQ IRU RXU H[SHULPHQWDO GDWDVHWV 7KH LQLWLDO VL]H 5f DQG WKH VL]H DIWHU SUXQLQJ 5cf IRU GLIIHUHQW VXSSRUW YDOXHV DUH VKRZQ :LWK WKLV RSWLPL]DWLRQ LQ WKH FRVW IRUPXODH RI VHFWLRQ 5 FDQ EH UHSODFHG ZLWK 5If§WKH QXPEHU RI UHFRUGV LQ 7 LQYROYLQJ IUHTXHQW LWHPV DQG 1 ZLWK 1If§WKH DYHUDJH

PAGE 61

fV RU 7fVf FRXOG EH H[SHQVLYH ,Q RUGHU WR FLUFXPYHQW WKH SUREOHP RI FRXQWLQJ WKH ODUJH & PRVW PLQLQJ DOJRULWKPV XVH VSHFLDO WHFKQLTXHV LQ WKH VHFRQG SDVV $ IHZ H[DPSOHV DUH WZRGLPHQVLRQDO DUUD\V LQ ,%0fV 4XHVW GDWD PLQLQJ V\VWHP >@ DQG KDVK ILOWHUV SURSRVHG LQ 3DUN HW DO >@ WR OLPLW WKH VL]H RI & 7KH JHQHUDWLRQ RI & FDQ EH FRPSOHWHO\ HOLPLQDWHG E\ IRUPXODWLQJ WKH MRLQ TXHU\ WR ILQG ) DV LQVHUW LQWR ) VHOHFW SLWHP TLWHP FRXQWrf IURP 7I S 7I T ZKHUH SWLG TWLG DQG SLWHP TLWHP JURXS E\ SLWHP TLWHP

PAGE 62

KDYLQJ FRXQWrf UPLQVXS 7KH FRVW RI VHFRQG SDVV ZLWK WKLV RSWLPL]DWLRQ LV MRLQL 5I &1I ff JURXS&1I f&)8 ff (YHQ WKRXJK WKH JURXSLQJ FRVW UHPDLQV WKH VDPH WKHUH LV D ELJ UHGXFWLRQ IURP WKH EDVLF .ZD\-RLQ DSSURDFK LQ WKH MRLQ FRVWV )LJXUH FRPSDUHV WKH UXQQLQJ WLPH RI WKH VHFn RQG SDVV ZLWK WKLV RSWLPL]DWLRQ WR WKH EDVLF .ZD\-RLQ DSSURDFK IRU WKH WZR H[SHULPHQWDO GDWDVHWV )RU WKH .ZD\-RLQ DSSURDFK WKH EHVW H[HFXWLRQ SODQ ZDV WKH RQH ZKLFK JHQHUDWHV DOO LWHP FRPELQDWLRQV MRLQV WKHP ZLWK WKH FDQGLGDWH VHW DQG JURXSV WKH MRLQ UHVXOW :H FDQ VHH WKDW WKLV RSWLPL]DWLRQ KDV D VLJQLILFDQW LPSDFW RQ WKH UXQQLQJ WLPH 5HXVLQJ WKH ,WHP &RPELQDWLRQV IURP 3UHYLRXV 3DVV 7KH 64/ IRUPXODWLRQV RI DVVRFLDWLRQ UXOH PLQLQJ LV EDVHG RQ JHQHUDWLQJ LWHP FRPELnr ZKLFK FRQWDLQV DOO $LWHP FRPELQDWLRQV

PAGE 63

3DVV RSWLPL]DWLRQ 7,' 22.f OD :LWK 2SW Q :LWKRXW 2SW 6XSSRUW 3DVV RSWLPL]DWLRQ 7,' 22.f >Â’ :LWK 2SW 0 :LWKRXW 2SW )LJXUH %HQHILW RI VHFRQG SDVV RSWLPL]DWLRQ WKDW D[H FDQGLGDWHV 7N KDV WKH VFKHPD WLG LWHPL LWHUULNf :H MRLQ 7NL 7M DQG &N DV VKRZQ EHORZ WR JHQHUDWH 7r $ WUHH GLDJUDP RI WKH TXHU\ LV DOVR JLYHQ LQ )LJXUH 7KH IUHTXHQW LWHPVHWV )nN LV REWDLQHG E\ JURXSLQJ WKH WXSOHV RI 7N RQ WKH N LWHPV DQG DSSO\LQJ WKH PLQLPXP VXSSRUW ILOWHULQJ :H FDQ IXUWKHU SUXQH 7N E\ ILOWHULQJ RXW LWHP FRPELQDWLRQV WKDW WXUQHG RXW WR EH QRQIUHTXHQW +RZHYHU WKLV LV QRW HVVHQWLDO VLQFH ZH MRLQ LW ZLWK WKH FDQGLGDWH VHW &N LQ

PAGE 64

LQVHUW LQWR 7N VHOHFW SWLG SLWHP? SLHPAL TLWHP IURP &MIF 7IFBL S 7I T ZKHUH SLWHPL &NLWHPL DQG SLWHUULNL &NLWHUULNL DQG TMIHP &NLWHPN DQG S WLG T WLG 7N )LJXUH *HQHUDWLRQ RI 7N WKH QH[W SDVV WR JHQHUDWH 7rL 7KH RQO\ DGYDQWDJH RI SUXQLQJ 7N LV WKDW ZH ZLOO KDYH D VPDOOHU WDEOH WR MRLQ LQ WKH QH[W SDVV EXW DW WKH DGGLWLRQDO FRVW RI MRLQLQJ 7N ZLWK )r :H XVH WKH RSWLPL]DWLRQ GLVFXVVHG DERYH IRU WKH VHFRQG SDVV DQG KHQFH GR QRW PDWHn ULDOL]H DQG VWRUH WKH LWHP FRPELQDWLRQV 7 7KHUHIRUH ZH JHQHUDWH 7 GLUHFWO\ E\ MRLQLQJ 7I ZLWK & DV LQVHUW LQWR 7 VHOHFW SWLG SLWHP TLWHP ULWHP IURP 7M S 7I T 7I U &N ZKHUH SLWHP &ALWHPL DQG TLWHP &ALWHP DQG ULWHP &ALWHPA DQG SWLG TWLG DQG TWLG [WLG

PAGE 65

r G> VL r f` JURXS&IFf &Nf O $V D UHVXOW RI WKH PDWHULDOL]DWLRQ DQG UHXVH RI LWHP FRPELQDWLRQV 6HWRULHQWHG $SULRUL UHn TXLUHV RQO\ D VLQJOH ZD\ MRLQ LQ WKH NWK SDVV 7KH FRVW RI WKH NWK SDVV LQ 6HWRULHQWHG $SULRUL LV WULMRLQL" 7IFBL &N 6^&Nff JURXS6&Nf &Nf ZKHUH 7NB DQG &N GHQRWH WKH FDUGLQDOLW\ RI WKH FRUUHVSRQGLQJ WDEOHV 7KH JURXSLQJ FRVW LV WKH VDPH DV WKDW RI WKH VXETXHU\ DSSURDFK 7KH WDEOH 7MWBL FRQWDLQV H[DFWO\ WKH VDPH WXSOHV DV WKDW RI VXETXHU\ 4NL DQG KHQFH KDV D VL]H RI VBL r B $OVR LV WKH VDPH DV &N 7KHUHIRUH WKH NWK SDVV FRVW RI 6HWRULHQWHG $SULRUL LV WKH VDPH DV WKH NWK WHUP LQ 1RWH WKDW WKLV PD\ EH H[HFXWHG DV WZR ZD\ MRLQV VLQFH ZD\ MRLQV DUH QRW JHQHUDOO\ VXSSRUWHG LQ FXUUHQW UHODWLRQDO V\VWHPV

PAGE 66

WKH MRLQ FRVW VXPPDWLRQ RI WKH VXETXHU\ DSSURDFK 7KLV UHVXOWV LQ VLJQLILFDQW SHUIRUPDQFH LPSURYHPHQWV HVSHFLDOO\ LQ WKH KLJKHU SDVVHV )LJXUH FRPSDUHV WKH UXQQLQJ WLPHV RI WKH VXETXHU\ DQG 6HWRULHQWHG $SULRUL DSn SURDFKHV IRU WKH GDWDVHW 7,'. IRU b VXSSRUW :H VKRZ RQO\ WKH WLPHV IRU SDVVHV DQG KLJKHU VLQFH ERWK WKH DSSURDFKHV DUH WKH VDPH LQ WKH ILUVW WZR SDVVHV 6HW$SULRUL ‘ 6XETXHU\ 3DVV 3DVV r 3DVV 3DVV 3DVV )LJXUH %HQHILW RI UHXVLQJ LWHP FRPELQDWLRQV 6SDFH 2YHUKHDG 7KH 6HWRULHQWHG $SULRUL DSSURDFK UHTXLUHV DGGLWLRQDO VSDFH LQ RUGHU WR VWRUH WKH LWHP FRPELQDWLRQV JHQHUDWHG 7KH VL]H RI WKH WDEOH 7N LV WKH VDPH DV 6&Nf ZKLFK LV WKH WRWDO VXSSRUW RI DOO WKH $LWHP FDQGLGDWHV $VVXPLQJ WKDW WKH WLG DQG LWHP DWWULEXWHV DUH LQWHJHUV HDFK WXSOH LQ 7N FRQVLVWV RI N LQWHJHU DWWULEXWHV )LJXUH VKRZV WKH VSDFH UHTXLUHG WR VWRUH 7N LQ WHUPV RI QXPEHU RI LQWHJHUV IRU WKH GDWDVHW 7,'. IRU WZR GLIIHUHQW VXSSRUW YDOXHV 7KH VSDFH QHHGHG IRU WKH LQSXW GDWD WDEOH 7 LV DOVR VKRZQ IRU FRPSDULVRQ  LV QRW VKRZQ LQ WKH JUDSK VLQFH ZH GR QRW PDWHULDOL]H DQG VWRUH LW LQ WKH 6HWRULHQWHG $SULRUL DSSURDFK 1RWH WKDW RQFH 7N LV PDWHULDOL]HG 7N FDQ EH GHOHWHG XQOHVV LW QHHGV WR EH UHWDLQHG IRU VRPH RWKHU SXUSRVHV

PAGE 67

6XSSRUW b 7L 7 7 7E 7H 6XSSRUW bf DQG WKH 6HWRULHQWHG $SULRUL IRU D ZLGH UDQJH RI GDWD SDUDPHWHUV DQG VXSSRUW YDOXHV :H UHSRUW WKH UHVXOWV RQ WZR RI WKH GDWDVHWVf§7,'. DQG 7,'.f§ ZKLFK DUH GHVFULEHG LQ 6HFWLRQ

PAGE 68

7,' 22. 7RWDO WLPH \\ \\ \\ \\ 6XSSRUW f§! b b b b 7,' 22. 7RWDO WLPH ’ 3DVV (D 3DVV D 3DVV D 3DVV )LJXUH &RPSDULVRQ RI 6XETXHU\ DQG 6HWRULHQWHG $SULRUL DSSURDFKHV ,Q )LJXUH ZH VKRZ WKH UHODWLYH SHUIRUPDQFH RI 6XETXHU\ DQG 6HWRULHQWHG $SULRUL DSSURDFKHV IRU WKH WZR GDWDVHWV 7KH FKDUW VKRZV WKH WRWDO WLPH WDNHQ IRU HDFK RI WKH GLIIHUHQW SDVVHV 6HWRULHQWHG $SULRUL SHUIRUPV EHWWHU WKDQ 6XETXHU\ IRU DOO WKH VXSSRUW YDOXHV 7KH ILUVW WZR SDVVHV RI ERWK WKH DSSURDFKHV DUH VLPLODU DQG WKH\ WDNH DSSUR[LPDWHO\ HTXDO DPRXQW RI WLPH 7KH GLIIHUHQFH EHWZHHQ 6HWRULHQWHG $SULRUL DQG 6XETXHU\ ZLGHQV IRU KLJKHU QXPEHUHG SDVVHV DV H[SODLQHG LQ 6HFWLRQ )RU 7,'. ) ZDV HPSW\

PAGE 69

IRU VXSSRUW YDOXHV KLJKHU WKDQ b DQG WKHUHIRUH ZH FKRVH ORZHU VXSSRUW YDOXHV WR VWXG\ WKH UHODWLYH SHUIRUPDQFH LQ KLJKHU QXPEHUHG SDVVHV ,Q VRPH FDVHV WKH RSWLPL]HU GLG QRW FKRRVH WKH EHVW SODQ )RU H[DPSOH IRU MRLQV ZLWK 7 7M IRU 6HWRULHQWHG $SULRULf WKH RSWLPL]HU FKRVH QHVWHG ORRSV SODQ XVLQJ LWHP WLGf LQGH[ RQ 7 LQ PDQ\ FDVHV ZKHUH WKH FRUUHVSRQGLQJ VRUWPHUJH SODQ ZDV IDVWHU DQ RUGHU RI PDJQLWXGH IDVWHU LQ VRPH FDVHV :H ZHUH DEOH WR H[SHULPHQW ZLWK GLIIHUHQW SODQV E\ GLVDEOLQJ FHUWDLQ MRLQ PHWKRGV GLVDEOLQJ QHVWHG ORRSV MRLQ IRU WKH DERYH FDVHfb 7KH ILUVW JUDSK VKRZV WKH DEVROXWH H[HFXWLRQ WLPHV DQG WKH VHFRQG RQH VKRZV WKH WLPHV

PAGE 70

7,' 22. &38 WLPH ’ 3DVV ‘ 3DVV ’ 3DVV ’ 3DVV ‘ 3DVV ’ 3DVV %% 3DVV 6XSSRUW} b b b b 7,'. ,2 WLPH c ’ 3DVV %L 3DVV ’ 3DVV ’ 3DVV ‘ 3DVV ’ 3DVVBB (3DVV n PP ‘ L U +++ +L R /f§Af§f§ 6XSSRUW} b b b b )LJXUH &RPSDULVRQ RI &38 DQG ,2 WLPHV QRUPDOL]HG ZLWK UHVSHFW WR WKH WLPHV IRU WKH WUDQVDFWLRQ GDWDVHWV ,W FDQ EH VHHQ WKDW WKH H[HFXWLRQ WLPHV VFDOH TXLWH OLQHDUO\ DQG ERWK WKH GDWDVHWV H[KLELW VLPLODU VFDOHXS EHKDYLRU 7KH VFDOHXS ZLWK LQFUHDVLQJ WUDQVDFWLRQ VL]H LV VKRZQ LQ )LJXUH ,Q WKHVH H[SHUn LPHQWV ZH NHSW WKH SK\VLFDO VL]H RI WKH GDWDEDVH URXJKO\ FRQVWDQW E\ NHHSLQJ WKH SURGXFW RI WKH DYHUDJH WUDQVDFWLRQ VL]H DQG WKH QXPEHU RI WUDQVDFWLRQV FRQVWDQW 7KH QXPEHU RI WUDQVDFWLRQV UDQJHG IURP IRU WKH GDWDEDVH ZLWK DQ DYHUDJH WUDQVDFWLRQ VL]H RI WR

PAGE 71

,f§7, a}a7OO _f§Af§7, 7)LJXUH 1XPEHU RI WUDQVDFWLRQV VFDOHXS IRU WKH GDWDEDVH ZLWK DQ DYHUDJH WUDQVDFWLRQ VL]H RI :H IL[HG WKH PLQLPXP VXSn SRUW OHYHO LQ WHUPV RI WKH QXPEHU RI WUDQVDFWLRQV VLQFH IL[LQJ LW DV D SHUFHQWDJH ZRXOG KDYH OHG WR ODUJH LQFUHDVHV LQ WKH QXPEHU RI IUHTXHQW LWHPVHWV DV WKH WUDQVDFWLRQ VL]H LQFUHDVHG 7KH QXPEHUV LQ WKH OHJHQG IRU H[DPSOH f UHIHU WR WKLV PLQLPXP VXSSRUW 7KH H[HFXn WLRQ WLPHV LQFUHDVH ZLWK WKH WUDQVDFWLRQ VL]H EXW RQO\ JUDGXDOO\ 7KH PDLQ UHDVRQ IRU WKLV LQFUHDVH ZDV WKDW WKH QXPEHU RI LWHP FRPELQDWLRQV SUHVHQW LQ D WUDQVDFWLRQ LQFUHDVHV ZLWK WKH WUDQVDFWLRQ VL]H

PAGE 72

}

PAGE 73

1RWH $ SDUW RI WKH ZRUN GHVFULEHG LQ WKLV FKDSWHU ZDV SULPDULO\ GRQH E\ UHVHDUFKHUV IURP ,%0 $OPDGQ 5HVHDUFK &HQWHU 6SHFLILFDOO\ WKH 64/EDVHG FDQGLGDWH JHQHUDWLRQ LQ 6HFWLRQ DQG WKH VXSSRUW FRXQWLQJ DSSURDFKHV LQ 6HFWLRQ ZHUH GHYHORSHG E\ WKHP 7KH\ DUH LQFOXGHG LQ WKLV GLVVHUWDWLRQ IRU FRPSOHWHQHVV

PAGE 74

&+$37(5 6833257 &2817,1* 86,1* 64/ :,7+ 2%-(&75(/$7,21$/ (;7(16,216 ,Q WKLV FKDSWHU ZH VWXG\ DOWHUQDWLYH DSSURDFKHV WKDW PDNH XVH RI DGGLWLRQDO REMHFW UHODWLRQDO IHDWXUHV LQ 64/ )RU HDFK DSSURDFK ZH DOVR RXWOLQH D FRVWEDVHG DQDO\VLV RI WKH H[HFXWLRQ WLPH WR HQDEOH RQH WR FKRRVH EHWZHHQ WKHVH GLIIHUHQW DSSURDFKHV :H SUHVHQW VL[ GLIIHUHQW DSSURDFKHV RSWLPL]DWLRQV DQG WKHLU FRVW HVWLPDWHV LQ 6HFWLRQV DQG ([SHULPHQWDO UHVXOWV FRPSDULQJ WKH SHUIRUPDQFH RI WKHVH DSSURDFKHV DUH SUHVHQWHG LQ 6HFn WLRQ ,Q 6HFWLRQ ZH SURSRVH D K\EULG DSSURDFK ZKLFK FRPELQHV WKH EHVW RI DOO DSSURDFKHV 7KH SHUIRUPDQFH RI GLIIHUHQW DUFKLWHFWXUDO DOWHUQDWLYHV GHVFULEHG LQ &KDSWHU LV FRPSDUHG LQ 6HFWLRQ ,Q 6HFWLRQ ZH VXPPDUL]H TXDOLWDWLYH FRPSDULVRQV RI WKHVH DUFKLWHFWXUHV 7KH DSSOLFDELOLW\ RI WKH 64/EDVHG DSSURDFK WR RWKHU DVVRFLDWLRQ UXOH PLQLQJ DOJRULWKPV DUH EULHIO\ GLVFXVVHG LQ 6HFWLRQ *DWKHU-RLQ 7KLV DSSURDFK VHH )LJXUH f LV EDVHG RQ WKH XVH RI WDEOH IXQFWLRQV GHVFULEHG LQ VHFn WLRQ ,W JHQHUDWHV DOO SRVVLEOH IFLWHP FRPELQDWLRQV RI LWHPV FRQWDLQHG LQ D WUDQVDFWLRQ MRLQV WKHP ZLWK WKH FDQGLGDWH WDEOH &N DQG FRXQWV WKH VXSSRUW RI WKH LWHPVHWV E\ JURXSLQJ WKH MRLQ UHVXOW ,W XVHV WZR WDEOH IXQFWLRQV *DWKHU DQG &RPE. 7KH GDWD WDEOH 7 LV VFDQQHG LQ WKH WLG LWHPf RUGHU DQG SDVVHG WR WKH WDEOH IXQFWLRQ *DWKHU 7KLV WDEOH IXQFWLRQ FROn OHFWV DOO WKH LWHPV RI D WUDQVDFWLRQ LQ RWKHU ZRUGV LWHPV RI DOO WXSOHV RI 7 ZLWK WKH VDPH WLGf LQ PHPRU\ DQG RXWSXWV D UHFRUG IRU HDFK WUDQVDFWLRQ (DFK VXFK UHFRUG FRQVLVWV RI WZR

PAGE 75

n PHQWDWLRQ 7KH *DWKHU IXQFWLRQ LV QRW UHTXLUHG ZKHQ WKH GDWD LV DOUHDG\ LQ D KRUL]RQWDO IRUPDW ZKHUH HDFK WLG LV IROORZHG E\ D FROOHFWLRQ RI DOO LWV LWHPV 6SHFLDO 3DVV 2SWLPL]DWLRQ 1RWH WKDW IRU N WKH FDQGLGDWH VHW & LV VLPSO\ D MRLQ RI )? ZLWK LWVHOI 7KHUHIRUH ZH FDQ VSHFLDOO\ RSWLPL]H WKH SDVV E\ UHSODFLQJ WKH MRLQ ZLWK & E\ D MRLQ ZLWK )? EHIRUH WKH WDEOH IXQFWLRQ VHH )LJXUH f 7KDW ZD\ WKH WDEOH IXQFWLRQ JHWV RQO\ IUHTXHQW LWHPV DQG JHQHUDWHV VLJQLILFDQWO\ IHZHU LWHP FRPELQDWLRQV 7KLV RSWLPL]DWLRQ FDQ EH XVHIXO IRU RWKHU SDVVHV WRR EXW XQOLNH IRU SDVV ZH VWLOO KDYH WR GR WKH MRLQ ZLWK &N 9DULDWLRQV RI *DWKHU-RLQ $SSURDFK *DWKHU&RXQW 2QH YDULDWLRQ RI WKH *DWKHU-RLQ DSSURDFK IRU SDVV WZR LV WKH *DWKHU&RXQW DSSURDFK ZKHUH ZH SXVK WKH JURXSE\ LQVLGH WKH WDEOH IXQFWLRQ LQVWHDG RI GRLQJ LW RXWVLGH 7KH FDQGLGDWH LWHPVHWV &f DUH UHSUHVHQWHG DV D WZR GLPHQVLRQDO DUUD\ LQVLGH WKH PRGLILHG

PAGE 76

LQVHUW LQWR Lr VHOHFW LWHP? LWHUULN FRXQWrf IURP &N VHOHFW W7LWP W7-WPN IURP 7 WDEOH *DWKHU7WLG 7LWHPff DV W? WDEOH &RPE.LLWLG eLWHPOLVWff DV ef ZKHUH W7-WP? &NLWHP? DQG W7-WPN &NLWHPN JURXS E\ &NLWHP? &NLWHPN KDYLQJ FRXQWrf PLQVXS W7 W7 KDYLQJ FRXQWrf LPLQVXS W *URXS E\ LWHP LWHPN W LWPO &NLWHPO LWPN &NLWHPN AAW 7DEOH IXQFWLRQ &N &RPE. I 7DEOH IXQFWLRQ *DWKHU 2UGHU E\ WLG LWHP L 7 )LJXUH 6XSSRUW FRXQWLQJ E\ *DWKHU-RLQ WDEOH IXQFWLRQ *DWKHU&QW IRU GRLQJ WKH VXSSRUW FRXQWLQJ ,QVWHDG RI RXWSXWWLQJ WKH LWHP FRPELQDWLRQV RI D WUDQVDFWLRQ LW GLUHFWO\ XVHV LW WR XSGDWH VXSSRUW FRXQWV LQ WKH PHPRU\ DQG RXWSXW RQO\ WKH IUHTXHQW LWHPVHWV ) DQG WKHLU VXSSRUW DIWHU WKH ODVW WUDQVDFWLRQ 7KXV WKH WDEOH IXQFWLRQ *DWKHU&QW LV DQ H[WHQVLRQ RI WKH *DWKHU&RPE WDEOH IXQFWLRQ XVHG LQ *DWKHU-RLQ 7KH DEVHQFH RI WKH RXWHU JURXSLQJ PDNHV WKLV RSWLRQ UDWKHU DWWUDFWLYH 7KH 8') FRGH LV DOVR VPDOO VLQFH LW RQO\ QHHGV WR PDLQWDLQ D DUUD\ :H FRXOG DSSO\ WKH VDPH WULFN IRU VXEVHTXHQW SDVVHV EXW WKH FRGLQJ EHFRPHV FRQVLGHUDEO\ PRUH FRPSOLFDWHG EHFDXVH RI WKH

PAGE 77

LQVHUW LQWR ) VHOHFW WW7-WP? WW7-WP FRXQWrf IURP VHOHFW r IURP 7 )? ZKHUH 7LWHP f§ )?LWHP?f DV WLM WDEOH *DWKHU&RPEWLGLWHPff DV 8f JURXS E\ WW7-WP? WW7-WP KDYLQJ FRXQWrf PLQVXS KDYLQJ FRXQWrf

PAGE 78

&N 7KH WUDGHRII LV LQFUHDVHG FRVW IRU RWKHU SDUWV QRWDEO\ JURXSLQJ EHFDXVH QRW DV PDQ\ FRPELQDWLRQV DUH ILOWHUHG +RUL]RQWDO $QRWKHU YDULDWLRQ RI *DWKHU-RLQ LV WKH +RUL]RQWDO DSSURDFK WKDW ILUVW XVHV WKH *DWKHU IXQFWLRQ WR WUDQVIRUP WKH GDWD WR WKH KRUL]RQWDO IRUPDW EXW LV RWKHUZLVH VLPLODU WR WKH *DWKHU-RLQ DSSURDFK 5DMDPDQL HW DO >@ SURSRVH ILQGLQJ DVVRFLDWLRQV XVLQJ DQ DSSURDFK TXLWH VLPLODU WR WKLV +RUL]RQWDO DSSURDFK 7KH\ DVVXPH UDWKHU XQUHDOLVWLFDOO\f WKDW WKH GDWD LV DOUHDG\ LQ D KRUL]RQWDO IRUPDW +RZHYHU WKH\ GR QRW XVH WKH IUHTXHQW LWHPVHW ILOWHULQJ RSWLPL]DWLRQ ZH RXWOLQHG IRU SDVV :LWKRXW WKLV RSWLPL]DWLRQ WKH WLPH IRU SDVV IRU PRVW UHDOOLIH GDWDVHWV EORZV XS HYHQ IRU UHODWLYHO\ KLJK VXSSRUW YDOXHV $OVR DW WKH WLPH RI FDQGLGDWH JHQHUDWLRQ UDWKHU WKDQ GRLQJ VHOIMRLQ RQ )N WKH\ MRLQ )Na

PAGE 79

7N LV &1Nf r 7 -RLQ ZLWK &N ILOWHUV RXW WKH QRQFDQGLGDWH LWHP FRPELQDWLRQV 7KH VL]H RI WKH MRLQ UHVXOW LV WKH VXP RI WKH VXSSRUW RI DOO WKH FDQGLGDWHV GHQRWHG E\ 6&Nf 7KH DFWXDO YDOXH RI WKH VXSSRUW RI D FDQGLGDWH LWHPVHW ZLOO EH NQRZQ RQO\ DIWHU WKH VXSSRUW FRXQWLQJ SKDVH +RZHYHU ZH JHW D JRRG HVWLPDWH E\ DSSUR[LPDWLQJ LW WR WKH PLQLPXP RI WKH VXSSRUW RI DOO LWV N f§ OfVXEVHWV LQ )N? 7KH WRWDO FRVW RI WKH *DWKHU -RLQ DSSURDFK LV 7N r WN @RP7N&N6&Nff JURXS^6&Nf&Nf ZKHUH 7N &1Nfr7 7KH DERYH FRVW IRUPXOD QHHGV WR EH PRGLILHG WR UHIOHFW WKH VSHFLDO RSWLPL]DWLRQ RI MRLQLQJ ZLWK )? WR FRQVLGHU RQO\ IUHTXHQW LWHPV :H QHHG D QHZ WHUP MRLQ5)?5If DQG QHHG WR FKDQJH WKH IRUPXOD IRU 7N WR LQFOXGH RQO\ IUHTXHQW LWHPV 1M LQVWHDG RI 1 )RU WKH VHFRQG SDVV ZH GR QRW QHHG WKH RXWHU MRLQ ZLWK &N 7KH WRWDO FRVW RI *DWKHU -RLQ LQ WKH VHFRQG SDVV LV LRLQ5 )? 5If 7r W JURXS 7 &f ZKHUH 7 &^1I f r 7 VV f§ &RVW RI *DWKHU&RXQW LQ WKH VHFRQG SDVV LV VLPLODU WR WKDW IRU EDVLF *DWKHU -RLQ H[FHSW IRU WKH ILQDO JURXSLQJ FRVW ,Q WKLV IRUPXOD fJURXS-QWf GHQRWHV WKH FRVW RI GRLQJ WKH VXSSRUW FRXQWLQJ LQVLGH WKH WDEOH IXQFWLRQ MRLQI" )?5If JURXSBLQW7 &f ) r W )RU *DWKHU3UXQH WKH FRVW HTXDWLRQ LV 5 r EORE$ r &Nf 6^&Nf r WN JURXS^6&Nf &Nf

PAGE 80

:H XVH EORE$ r &Nf IRU WKH %/2% SDVVLQJ FRVW VLQFH HDFK LWHPVHW LQ &N FRQWDLQV N LWHPV 7KH FRVW HVWLPDWH RI +RUL]RQWDO LV VLPLODU WR WKDW RI *DWKHU -RLQ H[FHSW WKDW KHUH WKH GDWD LV PDWHULDOL]HG LQ WKH KRUL]RQWDO IRUPDW EHIRUH JHQHUDWLQJ WKH LWHP FRPELQDWLRQV 9HUWLFDO ,Q WKLV DSSURDFK ZH ILUVW WUDQVIRUP WKH GDWD WDEOH LQWR D YHUWLFDO IRUPDW E\ FUHDWLQJ IRU HDFK LWHP D %/2% FRQWDLQLQJ DOO WLGV WKDW FRQWDLQ WKDW LWHP 7LGOLVW FUHDWLRQ SKDVHf DQG WKHQ FRXQW WKH VXSSRUW RI LWHPVHWV E\ PHUJLQJ WRJHWKHU WKHVH WLGOLVWV VXSSRUW FRXQWLQJ SKDVHf 7KLV DSSURDFK LV UHODWHG WR WKH DSSURDFKHV GLVFXVVHG LQ 6DYDVHUH HW DO >,OO@ DQG =DNL HW DO >@ )RU FUHDWLQJ WKH 7LGOLVWV ZH XVH D WDEOH IXQFWLRQ *DWKHU 7KLV LV WKH VDPH DV WKH *DWKHU IXQFWLRQ LQ *DWKHU-RLQ H[FHSW WKDW KHUH ZH FUHDWH WKH WLGOLVW IRU HDFK IUHTXHQW LWHP 7KH GDWD WDEOH 7 LV VFDQQHG LQ WKH LWHP WLGf RUGHU DQG SDVVHG WR WKH IXQFWLRQ *DWKHU 7KH IXQFWLRQ FROOHFWV WKH WLGV RI DOO WXSOHV RI 7 ZLWK WKH VDPH LWHP LQ PHPRU\ DQG RXWSXWV D LWHP WLGOLVWf WXSOH IRU LWHPV WKDW PHHW WKH PLQLPXP VXSSRUW FULWHULRQ 7KH WLG OLVWV DUH UHSUHVHQWHG DV %/2%V DQG VWRUHG LQ D QHZ 7LG7DEOH ZLWK DWWULEXWHV LWHP WLGOLVWf 7KH 64/ TXHU\ ZKLFK GRHV WKH WUDQVIRUPDWLRQ WR YHUWLFDO IRUPDW LV JLYHQ LQ )LJXUH LQVHUW LQWR 7LG7DEOH VHOHFW LWHP WLGOLVW IURP VHOHFW r IURP 7 RUGHU E\ LWHP WLGf DV WW? WDEOH*DWKHULWHPWLGPLQVXSff DV WW WWLWHP WWWLGOLVW WW 7DEOH IXQFWLRQ *DWKHU W 2UGHU E\ LWHP WLG W 7 )LJXUH 7LGOLVW FUHDWLRQ ,Q WKH VXSSRUW FRXQWLQJ SKDVH FRQFHSWXDOO\ IRU HDFK LWHPVHW LQ &N ZH ZDQW WR FROOHFW WKH WLGOLVWV RI DOO N LWHPV DQG XVH D 8') WR FRXQW WKH QXPEHU RI WLGV LQ WKH LQWHUVHFWLRQ

PAGE 81

RI WKHVH N OLVWV 7KH WLGV DUH LQ WKH VDPH VRUWHG RUGHU LQ DOO WKH WLGOLVWV DQG WKHUHIRUH WKH LQWHUVHFWLRQ FDQ EH GRQH HDVLO\ DQG HIILFLHQWO\ E\ D VLQJOH SDVV RI WKH N OLVWV 7KLV FRQFHSWXDO VWHS FDQ EH LPSURYHG IXUWKHU E\ GHFRPSRVLQJ WKH LQWHUVHFW RSHUDWLRQ VR WKDW ZH FDQ VKDUH WKHVH RSHUDWLRQV DFURVV LWHPVHWV KDYLQJ FRPPRQ SUHIL[HV DV IROORZV :H ILUVW VHOHFW GLVWLQFW ^LWHP?LWHUULf SDLUV IURP &)RU HDFK GLVWLQFW SDLU ZH ILUVW SHUn IRUP WKH LQWHUVHFW RSHUDWLRQ WR JHW D QHZ UHVXOWWLGOLVW WKHQ ILQG GLVWLQFW WULSOHV LWHP?LWHP LWHUQAf IURP &N ZLWK WKH VDPH ILUVW WZR LWHPV LQWHUVHFW UHVXOWWLGOLVW ZLWK WLGOLVW IRU LWHPV IRU HDFK WULSOH DQG FRQWLQXH ZLWK LWHPA DQG VR RQ XQWLO DOO N WLGOLVWV SHU LWHPVHW DUH LQWHUn VHFWHG 7KH DERYH VHTXHQFH RI RSHUDWLRQV FDQ EH ZULWWHQ DV D VLQJOH 64/ TXHU\ IRU DQ\ N DV VKRZQ LQ )LJXUH 7KH ILQDO LQWHUVHFW RSHUDWLRQ FDQ EH PHUJHG ZLWK WKH FRXQW RSHUDWLRQ WR UHWXUQ D FRXQW LQVWHDG RI WKH WLGOLVW :H GR QRW LQFOXGH WKLV RSWLPL]DWLRQ LQ WKH TXHU\ RI )LJXUH IRU VLPSOLFLW\ 6SHFLDO 3DVV 2SWLPL]DWLRQ )RU SDVV ZH QHHG QRW JHQHUDWH & DQG MRLQ WKH 7LG7DEOHV ZLWK ,QVWHDG ZH SHUIRUP D VHOIMRLQ RQ WKH 7LG7DEOH XVLQJ SUHGLFDWH W?LWHP WLWHP LQVHUW LQWR )r VHOHFW WLLWHPWLWHP FQW IURP VHOHFW LWHPLLWHP &RXQW,QWHUVHFWILWLGOLVW 4WLGOLVWf DV FQW IURP 7LG7DEOH W? 7LG7DEOH 4 ZKHUH WLLWHP WLWHPf DV W ZKHUH FQW PLLQVXS

PAGE 82

LQVHUW LQWR )N VHOHFW LWHPL LWHUULN FRXQWWLGOLVWf DV FQW IURP 6XETXHU\ 4Nf W ZKHUH FQW PLQVXS 6XETXHU\ 4L IRU DQ\ O EHWZHHQ DQG Nf VHOHFW LWHPL LWHPL ,QWHUVHFWUBLWLGOLVWWcWLGOLVWf DV WLGOLVW IURP 7LG7DEOH WL 6XETXHU\ 4L?f DV UBM VHOHFW GLVWLQFW LWHP? LWHPL IURP &Nf DV GL ZKHUH UBLILHPL GLLWHP? DQG DQG UBLLWHPBL GLHPBLDQG WLLWHP GLLWHPL 6XETXHU\ 4? VHOHFW r IURP 7LG7DEOHf LWHPLLWHPL WLGOLVW ,QWHUVHFW 8')f 6XETXHU\ 4 VHOHFW GLVWLQFW LWHPLLWHPL &N 7UHH GLDJUDP IRU 6XETXHU\ 4c )LJXUH 6XSSRUW FRXQWLQJ XVLQJ 8')

PAGE 83

&RVW $QDO\VLV 7KH FRVW RI WKH 9HUWLFDO DSSURDFK GXULQJ VXSSRUW FRXQWLQJ SKDVH LV GRPLQDWHG E\ WKH FRVW RI LQYRNLQJ WKH 8')V DQG LQWHUVHFWLQJ WKH WLGOLVWV 7KH 8') LV ILUVW FDOOHG IRU HDFK GLVWLQFW LWHP SDLU LQ &N WKHQ IRU HDFK GLVWLQFW LWHP WULSOH ZLWK WKH VDPH ILUVW WZR LWHPV DQG VR RQ /HW GM EH WKH QXPEHU RI GLVWLQFW M LWHP WXSOHV LQ &N 7KHQ WKH WRWDO QXPEHU RI WLPHV WKH 8') LV LQYRNHG LV
PAGE 84

,Q WKH IRUPXOD DERYH ,QWHUVHFWQf GHQRWHV WKH FRVW RI LQWHUVHFWLQJ WZR WLGOLVWV ZLWK D FRPELQHG VL]H RI Q 7KH WRWDO FRVW LV GRPLQDWHG E\ WKH LQWHUVHFW FRVW DQG MRLQ FRVWV DFFRXQW IRU RQO\ D VPDOO IUDFWLRQ 7KHUHIRUH ZH FDQ VDIHO\ LJQRUH WKH MRLQ FRVWV LQ WKH DERYH IRUPXODH 7KH WRWDO FRVW RI WKH VHFRQG SDVV LV MRLQ)L)L f & r ^ r %OREAIf ,QWHUVHFWAf`  64/%RGLHG )XQFWLRQV 7KLV DSSURDFK LV EDVHG RQ 64/ERGLHG IXQFWLRQV FRPPRQO\ NQRZQ DV 64/360 >@ 64/360V H[WHQG 64/ ZLWK DGGLWLRQDO FRQWURO VWUXFWXUHV :H PDNH XVH RI RQH VXFK FRQVWUXFW IRU GR HQG :H XVH WKH IRU FRQVWUXFW WR VFDQ WKH WUDQVDFWLRQ WDEOH 7 LQ WKH WLG LWHPf RUGHU 7KHQ IRU HDFK WXSOH WLG LWHPf RI 7 ZH XSGDWH WKRVH WXSOHV RI &N WKDW FRQWDLQ RQH PDWFKLQJ LWHP &N LV H[WHQGHG ZLWK H[WUD DWWULEXWHV SUHY7LG PDWFK VXSSf 7KH SUHY7LG DWWULEXWH NHHSV WKH WLG RI WKH SUHYLRXV WXSOH RI 7 WKDW PDWFKHG WKDW LWHPVHW 7KH PDWFK DWWULEXWH FRQWDLQV WKH QXPEHU RI LWHPV RI SUHY7LG PDWFKHG VR IDU DQG VXSS KROGV WKH FXUUHQW VXSSRUW RI WKDW LWHPVHW 2Q HDFK FROXPQ RI &N DQ LQGH[ LV EXLOW WR GR D VHDUFKHG XSGDWH IRU WKLV DV VHOHFW r IURP 7 GR XSGDWH &N VHW SUHY7LG WLG PDWFK FDVH ZKHQ WLG SUHY7LG WKHQ PDWFKf HOVH HQG VXSS FDVH ZKHQ PDWFK N DQG WLG SUHY7LG WKHQ VXSS HOVH VXSS HQG

PAGE 85

ZKHUH LWHP LWHP? RU LWHP LWHUULN HQG IRU LQVHUW LQWR )N VHOHFW LWHPL LWHUULN VXSS IURP &N ZKHUH VXSS PLQVXSS 7KH FRVW RI WKLV DSSURDFK FDQ EH PDLQO\ DWWULEXWHG WR WKH FRVW RI XSGDWHV WR WKH FDQGLGDWH WDEOH &N )RU HDFK WXSOH RI WKH GDWD WDEOH 7 IRU DOO WKH FDQGLGDWH LWHPVHWV LQ &N ZKLFK FRQWDLQV WKDW LWHP WKUHH XSGDWHV DUH SHUIRUPHG WKH DWWULEXWHV SUHY7LG PDWFK VXSS RI WKH LWHPVHW DUH XSGDWHGf ,I 1N LV WKH DYHUDJH QXPEHU RI $LWHP FDQGLGDWHV FRQWDLQLQJ DQ\ JLYHQ LWHP WKH WRWDO QXPEHU RI XSGDWHV LV r 5r 1N 7KH FRVW GXH WR XSGDWHV IRU WKLV DSSURDFK LV 8 r 5 r 1Nf ZKHUH 8Qfb VXSSRUW )RU 'DWDVHW% WKH SURFHVV ZDV DERUWHG

PAGE 86

‘ 3DVV ’ 3DVV ’ 3DVV @ & A U Q U 6XSSRUWf§} b 6 b 'DWD VRW % ,( 3UHS ‘ 3DVV ’ 3DVV % 3DVV ‘ 3DVV M 6XSSRUW f§! b b b 'DWD VHW & n ’ 3UHS ‘ 3DVV ’ 3DVV P 3DVV ‘ 3DVV 6XSSRUW} b b b 'DWD VRW ‘ 3URS ‘ 3DVV ’ 3DVV ‘ 3DVV ‘ 3DVV 6XSSRUW f§} b b b )LJXUH &RPSDULVRQ RI IRXU 64/25 DSSURDFKHV 9HUWLFDO *DWKHU3UXQH *DWKHU-RLQ DQG *DWKHU&RXQW

PAGE 87

)LJXUH VKRZV WKH WRWDO UXQQLQJ WLPH RI WKH GLIIHUHQW DSSURDFKHV 7KH WLPH WDNHQ LV EURNHQ GRZQ E\ HDFK SDVV DQG DQ LQLWLDO fSUHSf VWDJH ZKHUH DQ\ RQHWLPH GDWD WUDQVIRUnfSUHSf LQ ILJXUH f 7KH YHUWLFDO UHSUHVHQWDWLRQ LV OLNH DQ LQGH[ RQ WKH LWHP DWWULEXWH ,I ZH WKLQN RI WKLV WLPH DV D RQHnn PDQFH :LWK WKLV RSWLPL]DWLRQ IRU 'DWDVHW% ZLWK VXSSRUW b WKH UXQQLQJ WLPH IRU SDVV DORQH ZDV UHGXFHG IURP KRXUV WR PLQXWHV :KHQ ZH FRPSDUH WKHVH GLIIHUHQW DSSURDFKHV EDVHG RQ WLPH VSHQW LQ HDFK SDVV ZH REVHUYH WKDW QR VLQJOH DSSURDFK LV fWKH EHVWf IRU DOO GLIIHUHQW SDVVHV RI WKH GLIIHUHQW GDWDVHWV HVSHFLDOO\ IRU WKH VHFRQG SDVV )RU SDVV WKUHH RQZDUGV 9HUWLFDO LV RIWHQ WZR RU PRUH RUGHUV RI PDJQLWXGH EHWWHU WKDQ WKH RWKHU DSSURDFKHV (YHQ LQ FDVHV OLNH 'DWDVHW% VXSSRUW b ZKHUH LW VSHQGV WKUHH

PAGE 88

KRXUV LQ WKH VHFRQG SDVV WKH WRWDO WLPH IRU QH[W WZR SDVVHV LV RQO\ VHFRQGV ZKHUHDV LW LV PRUH WKDQ DQ KRXU IRU WKH RWKHU WZR DSSURDFKHV )RU VXEVHTXHQW SDVVHV WKH SHUIRUPDQFH GHJUDGHV GUDPDWLFDOO\ IRU *DWKHU -RLQ EHFDXVH WKH WDEOH IXQFWLRQ *DWKHU&RPE. JHQHUDWHV D ODUJH QXPEHU RI FRPELQDWLRQV )RU LQVWDQFH IRU SDVV RI 'DWDVHW& HYHQ IRU VXSSRUW YDOXH RI b SDVV GLG QRW FRPSOHWH DIWHU KRXUV ZKHUHDV IRU 9HUWLFDO SDVV ILQLVKHG LQ VHFRQGV *DWKHU3UXQH LV EHWWHU WKDQ *DWKHU -RLQ IRU WKH WKLUG DQG ODWHU SDVVHV )RU SDVV *DWKHU3UXQH LV ZRUVH EHFDXVH WKH RYHUKHDG RI SDVVLQJ D ODUJH REMHFW DV DQ DUJXPHQW GRPLQDWHV FRVW 7KH 9HUWLFDO DSSURDFK VRPHWLPHV HQGHG XS VSHQGLQJ WRR PXFK WLPH LQ WKH VHFRQG SDVV ,Q VRPH RI WKHVH FDVHV WKH *DWKHU-RLQ DSSURDFK ZDV EHWWHU LQ WKH VHFRQG SDVV IRU LQVWDQFH IRU ORZ VXSSRUW YDOXHV RI 'DWDVHW%f ZKHUHDV LQ RWKHU FDVHV IRU LQVWDQFH 'DWDVHW& VXSSRUW bff

PAGE 89

7ZR IDFWRUV WKDW DIIHFW WKH FKRLFH DPRQJVW WKH 9HUWLFDO *DWKHU -RLQ DQG *DWKHU&RXQW DSSURDFKHV LQ GLIIHUHQW SDVVHV DQG SDVV LQ SDUWLFXODU DUH QXPEHU RI IUHTXHQW LWHPV )cf DQG WKH DYHUDJH QXPEHU RI IUHTXHQW LWHPV SHU WUDQVDFWLRQ ^1If )URP WKH JUDSKV LQ )LJn XUH ZH QRWLFH WKDW DV WKH YDOXH RI WKH VXSSRUW LV GHFUHDVHG IRU HDFK GDWDVHW FDXVLQJ WKH VL]H RI )? WR LQFUHDVH WKH SHUIRUPDQFH RI SDVV RI WKH 9HUWLFDO DSSURDFK GHJUDGHV UDSLGO\ 7KLV WUHQG LV DOVR FOHDU IURP RXU FRVW IRUPXODH 7KH FRVW RI WKH 9HUWLFDO DSSURDFK LQFUHDVHV TXDGUDWLFDOO\ ZLWK )? *DWKHU -RLQ GHSHQGV PRUH FULWLFDOO\ RQ WKH QXPEHU RI IUHn TXHQW LWHPV SHU WUDQVDFWLRQ )RU 'DWDVHW% HYHQ ZKHQ WKH VL]H RI )? LQFUHDVHV E\ D IDFWRU RI WKH YDOXH RI 1c UHPDLQV FORVH WR WKHUHIRUH WKH WLPH WDNHQ E\ *DWKHU -RLQ GRHV QRW LQFUHDVH DV PXFK +RZHYHU IRU 'DWDVHW& WKH VL]H RI 1I LQFUHDVHV IURP WR DV WKH VXSSRUW LV GHFUHDVHG IURP b WR bf LQFUHDVHV WKH FRVW RI 9HUWLFDO UHPDLQV DOPRVW XQFKDQJHG ZKHUHDV WKH FRVW RI *DWKHU-RLQ LQFUHDVHV

PAGE 90

)LQDO +\EULG $SSURDFK 7KH SUHYLRXV SHUIRUPDQFH VHFWLRQ KHOSV XV GUDZ WKH IROORZLQJ FRQFOXVLRQ 2YHUDOO WKH 9HUWLFDO DSSURDFK LV WKH EHVW RSWLRQ HVSHFLDOO\ IRU KLJKHU SDVVHV :KHQ WKH VL]H RI WKH FDQGLGDWH LWHPVHWV LV WRR ODUJH WKH SHUIRUPDQFH RI WKH 9HUWLFDO DSSURDFK FRXOG VXIIHU ,Q VXFK FDVHV *DWKHU-RLQ LV D JRRG RSWLRQ DV ORQJ DV WKH QXPEHU RI IUHTXHQW LWHPV SHU WUDQVDFWLRQ 1Mfn LQJ DVVRFLDWLRQ UXOHV SURYLGHG ZLWK WKH ,%0 GDWD PLQLQJ SURGXFW ,QWHOOLJHQW 0LQHU >@ )RU 6WRUHGSURFHGXUH ZH H[WUDFWHG WKH $SULRUL LPSOHPHQWDWLRQ LQ ,QWHOOLJHQW 0LQHU DQG FUHDWHG D VWRUHG SURFHGXUH RXW RI LW 7KH VWRUHG SURFHGXUH LV UXQ LQ WKH XQIHQFHG PRGH LQ WKH GDWDEDVH DGGUHVV VSDFH )RU &DFKH0LQH ZH XVHG DQ RSWLRQ SURYLGHG LQ ,QWHOOLJHQW 0LQHU WKDW FDXVHV WKH LQSXW GDWD WR EH FDFKHG DV D ELQDU\ ILOH DIWHU WKH ILUVW VFDQ RI WKH GDWD

PAGE 91

nf &DFKH0LQH KDV WKH EHVW RU FORVH WR WKH EHVW SHUIRUPDQFH LQ DOO FDVHV b RI LWV WRWDO WLPH LV VSHQW LQ WKH ILUVW SDVV ZKHUH GDWD LV DFFHVVHG IURP WKH '%06 DQG FDFKHG LQ WKH ILOH V\VWHP &RPSDUHG WR WKH 64/ DSSURDFK WKLV DSSURDFK LV D IDFWRU RI WR WLPHV IDVWHU

PAGE 92

7LPH LQ VHF e 7LPH LQ VHF 'DWD VHW $ ’ 3DVVA D 3DVV ’ 3DVVa@ 'DWD VHW % 3n 3n SSRUW! b b b 'DWD VHW & ,( 3DVV + 3DVV ’ 3DVV ’ 3DVV 6833257! b 'DWD VHW c 3DVV ‘ 3DVV ’ 3DVV ’ 3DVV 6XSSRUW} b b b FA 6XSSRUW b b b b )LJXUH &RPSDULVRQ RI IRXU DUFKLWHFWXUHV f 7KH 6WRUHGSURFHGXUH DSSURDFK LV WKH ZRUVW 7KH GLIIHUHQFH EHWZHHQ &DFKH0LQH DQG 6WRUHGSURFHGXUH LV GLUHFWO\ UHODWHG WR WKH QXPEHU RI SDVVHV )RU LQVWDQFH IRU 'DWDVHW$ WKH QXPEHU RI SDVVHV LQFUHDVHV IURP WZR WR WKUHH ZKHQ GHFUHDVLQJ VXSSRUW IURP b WR b FDXVLQJ WKH WLPH WDNHQ WR LQFUHDVH IURP WZR WR WKUHH WLPHV 7KH WLPH VSHQW LQ HDFK SDVV IRU 6WRUHGSURFHGXUH LV WKH VDPH H[FHSW ZKHQ WKH DOJRULWKP PDNHV PXOWLSOH SDVVHV RYHU WKH GDWD VLQFH DOO FDQGLGDWHV FRXOG QRW ILW LQ PHPRU\ WRJHWKHU 7KLV KDSSHQV IRU WKH ORZHVW VXSSRUW YDOXHV RI 'DWDVHW%

PAGE 93

'DWDVHW& DQG 'DWDVHW' 7LPH WDNHQ E\ 6WRUHGSURFHGXUH FDQ EH H[SUHVVHG DSSUR[LPDWHO\ DV QXPEHU RI SDVVHV WLPHV WLPH WDNHQ E\ &DFKH0LQH f 8') LV VLPLODU WR 6WRUHGSURFHGXUH 7KH RQO\ GLIIHUHQFH LV WKDW WKH WLPH SHU SDVV GHFUHDVHV E\ b IRU 8') EHFDXVH RI FORVHU FRXSOLQJ ZLWK WKH GDWDEDVH f

PAGE 94

f RI 7KXV WKH VL]H RI WKH GDWDVHW LV WLPHV WKH QXPEHU RI WUDQVDFWLRQV $V WKH QXPEHU RI WUDQVDFWLRQV LV LQFUHDVHG IURP WR WKH WLPH WDNHQ LQFUHDVHV SURSRUWLRQDWHO\ 7KH ODUJHVW IUHTXHQW LWHPVHW ZDV ORQJ 7KLV H[SODLQV WKH ILYH IROG GLIIHUHQFH LQ SHUIRUPDQFH EHWZHHQ WKH 6WRUHGSURFHGXUH DQG WKH &DFKH0LQH DSSURDFK )LJXUH VKRZV WKH VFDOLQJ ZKHQ WKH WUDQVDFWLRQ OHQJWK FKDQJHV IURP WR ZKLOH NHHSLQJ WKH QXPEHU RI WUDQVDFWLRQV IL[HG DW $OO WKUHH DSSURDFKHV VFDOH OLQHDUO\ ZLWK LQFUHDVLQJ WUDQVDFWLRQ OHQJWK ‘ )LJXUH 6FDOHXS ZLWK LQFUHDVLQJ QXPEHU RI WUDQVDFWLRQV

PAGE 95

$YHUDJH WUDQVDFWLRQ OHQJWK )LJXUH 6FDOHXS ZLWK LQFUHDVLQJ WUDQVDFWLRQ OHQJWK ,PSDFW RI ORQJHU QDPHV ,Q WKHVH H[SHULPHQWV ZH DVVXPHG WKDW WKH WLGV DQG LWHPLGV DUH DOO LQWHJHUV 2IWHQ LQ SUDFWLFH WKHVH DUH FKDUDFWHU VWULQJV ORQJHU WKDQ IRXU FKDUDFWHUV /RQJHU VWULQJV QHHG PRUH VWRUDJH DQG FRVW PRUH GXULQJ FRPSDULVRQV 7KLV FRXOG KXUW DOO IRXU RI WKH DOWHUQDWLYHV )RU WKH 6WRUHGSURFHGXUH 8') DQG &DFKH0LQH DSSURDFK WKH WLPH WDNHQ WR WUDQVIHU GDWD ZLOO LQFUHDVH 7KH ,QWHOOLJHQW 0LQHU FRGH >@ PDSV DOO FKDUDFWHU ILHOGV WR LQWHJHUV XVLQJ DQ LQPHPRU\ KDVKWDEOH 7KHUHIRUH EH\RQG WKH LQFUHDVH LQ WKH GDWD WUDQVIHU DQG PDSSLQJ FRVWV ZKLFK DFFRXQWV IRU WKH EXON RI WKH WLPHf ZH GR QRW H[SHFW WKH SURFHVVLQJ WLPH RI WKHVH WKUHH DOWHUQDWLYHV WR LQFUHDVH )RU WKH 64/ DSSURDFK ZH FDQQRW DVVXPH DQ LQPHPRU\ KDVKWDEOH IRU GRLQJ WKH PDSSLQJ WKHUHIRUH ZH XVH DQ DOWHUQDWLYH DSSURDFK EDVHG RQ WDEOH IXQFWLRQV )RU 64/ DSSURDFK ZH GLVFXVV WKH K\EULG DSSURDFK 7KH WZR DOUHDG\ H[SHQVLYHf VWHSV WKDW FRXOG VXIIHU EHFDXVH RI ORQJHU QDPHV DUH f ILQDO JURXSE\V GXULQJ SDVV RU KLJKHU ZKHQ WKH *DWKHU -RLQ DSSURDFK LV FKRVHQ DQG f WLGOLVW RSHUDWLRQV ZKHQ WKH 9HUWLFDO DSSURDFK LV FKRVHQ )RU HIILFLHQW SHUIRUPDQFH WKH ILUVW VWHS UHTXLUHV D PDSSLQJ RI LWHPLGV DQG WKH VHFRQG RQH UHTXLUHV XV WR PDS WLGV :H XVH D WDEOH IXQFWLRQ WR PDS WKH WLGV WR

PAGE 96



PAGE 97

DOO WKUHH RSWLRQV &DFKH0LQH 6WRUHGSURFHGXUH DQG 8') UHTXLUH GDWD LQ HDFK SDVV WR EH JURXSHG RQ WKH WLG ,Q D UHODWLRQDO '%06 ZH FDQQRW DVVXPH DQ\ RUGHU RQ WKH SK\VLFDO OD\RXW RI D WDEOH XQOLNH LQ D ILOH V\VWHP 7KHUHIRUH ZH QHHG HLWKHU DQ LQGH[ RQ WKH GDWD WDEOH RU QHHG WR VRUW WKH WDEOH HYHU\ WLPH WR HQVXUH D SDUWLFXODU RUGHU /HW 5 GHQRWH WKH WRWDO QXPEHU RI WLGLWHPff WR EDGf RQ HDFK RI WKH IROORZLQJ \DUGVWLFNV Df SHUIRUPDQFH H[HFXWLRQ WLPHf Ef VWRUDJH RYHUKHDG Ff VFRSH IRU DXWRPDWLF SDUDOOHOL]DWLRQ Gf GHYHORSPHQW DQG PDLQWHQDQFH HDVH Hf SRUWDELOLW\ If LQWHURSHUDELOLW\

PAGE 98

'DWDVHW$ 'DWDVHW% 'DWDVHW& 'DWDVHW2 )LJXUH &RPSDULVRQ RI GLIIHUHQW DUFKLWHFWXUHV RQ VSDFH UHTXLUHPHQWV 7DEOH 3URV DQG FRQV RI GLIIHUHQW DUFKLWHFWXUDO RSWLRQV UDQNHG RQ D VFDOH RI OJRRGf WR EDGf 0HWULF 6WRUHGSURF 8') &DFKH0LQH 64/ 3HUIRUPDQFH 6WRUDJH RYHUKHDG $XWRPDWLF 3DUDOOHOLVP f 'HYHORSPHQW DQG PDLQWHQDQFH HDVH 3RUWDELOLW\ ,QWHURSHUDELOLW\ f ,Q WHUPV RI SHUIRUPDQFH WKH &DFKH0LQH DSSURDFK LV WKH EHVW RSWLRQ IROORZHG E\ WKH 64/ DSSURDFK 7KH 64/ DSSURDFK ZDV ZLWKLQ D IDFWRU RI WR RI WKH &DFKH0LQH DSn SURDFK IRU DOO RI RXU H[SHULPHQWV 7KH 8') DSSURDFK LV EHWWHU WKDQ WKH 6WRUHGSURFHGXUH DSSURDFK LQ SHUIRUPDQFH E\ WR b EXW LW ORRVHV RQ WKH PHWULFV RI GHYHORSPHQW DQG PDLQWHQDQFH FRVWV DQG SRUWDELOLW\ ,Q WHUPV RI VSDFH UHTXLUHPHQWV WKH &DFKH0LQH DQG WKH 64/ DSSURDFK ORRVH WR WKH 8') RU WKH 6WRUHGSURFHGXUH DSSURDFK %HWZHHQ WKH 6WRUHGSURFHGXUH DQG WKH &DFKH0LQH LPSOHPHQWDWLRQ WKH SHUIRUPDQFH GLIIHUHQFH LV H[n DFWO\ D IXQFWLRQ RI WKH QXPEHU RI SDVVHV PDGH RQ WKH GDWD WKDW LV LI ZH PDNH IRXU SDVVHV

PAGE 99

RI WKH GDWD WKH 6WRUHGSURFHGXUH DSSURDFK LV IRXU WLPHV VORZHU WKDQ &DFKH0LQH 7KHUHn IRUH LI RQH LV QRW ZLOOLQJ WR SD\ WKH SHQDOW\ RI H[WUD VWRUDJH WKH EHVW VWUDWHJ\ IRU LPSURYLQJ SHUIRUPDQFH LV WR UHGXFH WKH QXPEHU RI SDVVHV WR WKH GDWD HYHQ LI LW FRPHV DW WKH FRVW RI H[WUD SURFHVVLQJ 6RPH RI WKH UHFHQW SURSRVDOV > @ WKDW DWWHPSW WR PLQLPL]H WKH QXPEHU RI GDWD SDVVHV WR RU PLJKW EH XVHIXO LQ WKDW UHJDUG 7KH 64/ DSSURDFK LV QRW WKH ZLQQHU LQ WHUPV RI SHUIRUPDQFH DQG VSDFH UHTXLUHPHQWV EXW LW LV FRPSHWLWLYH 7KH EHQHILW RI WKH 64/ DSSURDFK FRXOG DULVH IURP RWKHU VHFRQGDU\ DGYDQWDJHV 7KH 64/ LPSOHPHQWDWLRQ KDV WKH SRWHQWLDO IRU DXWRPDWLF SDUDOOHOL]DWLRQ 3DUDOOHOL]Dnn

PAGE 100

ff WR EH XQDFFHSWDEOH IURP WKH SHUIRUPDQFH YLHZSRLQW 2XU SUHIHUUHG 64/ LPSOHPHQWDWLRQ UHOLHV RQ WKH DYDLODELOLW\ RI '%fn LELOLW\ 7KH DG KRF TXHU\LQJ VXSSRUW SURYLGHG E\ WKH '%06 HQDEOHV IOH[LEOH XVDJH DQG H[SRVHV SRWHQWLDO IRU SLSHOLQLQJ WKH LQSXW DQG RXWSXW RSHUDWRUV RI WKH PLQLQJ SURFHVV ZLWK

PAGE 101

RWKHU RSHUDWRUV LQ WKH '%06 +RZHYHU WR H[SORLW WKLV IHDWXUH RQH HLWKHU QHHGV WR LPSOHn PHQW WKH PLQLQJ RSHUDWRUV LQVLGH WKH '%06 RU XVH WDEOH IXQFWLRQV LQ QRYHO ZD\V 7KH ILUVW DOWHUQDWLYH ZRXOG UHTXLUH PDMRU UHZRUN LQ H[LVWLQJ GDWDEDVH V\VWHPV 7KH VHFRQG DOWHUn QDWLYH UHTXLUHV WDEOH IXQFWLRQV WKDW FDQ H[HFXWH 64/ VWDWHPHQWV D IDFLOLW\ QRW FXUUHQWO\ DYDLODEOH LQ '% 8'%f )XUWKHUPRUH VRPH PLQLQJ RSHUDWRUV JHQHUDWH PXOWLSOH GLIIHUHQW NLQGV RI RXWSXW IRU LQVWDQFH PRGHO WUHH DQG VWDWLVWLFV IRU GHFLVLRQ WUHHVf ,Q VXFK FDVHV SLSHOLQLQJ RI PLQLQJ RSHUDWRUV EHFRPHV KDUGHU WR ILW LQ H[LVWLQJ UHODWLRQDO PRGHOV 1RWH WKDW WKH 64/ DSSURDFK SUHVHQWHG KHUH LV EDVHG RQ HPEHGGHG 64/ DQG FDQQRW SURYLGH RSHUn

PAGE 102

LPSOHPHQWDWLRQ EHFDXVH WKH\ DUH DLPHG DW UHGXFLQJ WKH QXPEHU RI SDVVHV RYHU WKH GDWD :H KDYH VHHQ WKDW ZLWK RXU 64/ LPSOHPHQWDWLRQV KDUGO\ DQ\ WLPH LV VSHQW RQ SDVVHV EH\RQG EHFDXVH RI WKH 9HUWLFDO DSSURDFK 7KH WLPH VSHQW LQ SDVV LV QRW UHGXFHG EHFDXVH WKH\ DOO UHTXLUH FRXQWLQJ VXSSRUW RI DOO RI &nfV SULPDU\ FRQWULEXWLRQV ZHUH WKH RSWLn PL]DWLRQV WR WKH DSSURDFKHV LQ 6HFWLRQV DQG WKH K\EULG DSSURDFK LQ 6HFWLRQ DQG WKH FRVW DQDO\VLV RI WKH YDULRXV VXSSRUW FRXQWLQJ DSSURDFKHV 7KH DXWKRU KDV DOVR FRQWULEXWHG WR WKH SHUIRUPDQFH H[SHULPHQWV ZKLFK OHG WR WKH YDULRXV FRPSDULVRQV

PAGE 103

&+$37(5 *(1(5$/,=(' $662&,$7,21 58/(6 ,Q PRVW UHDOOLIH DSSOLFDWLRQV WKH VHW RI LWHPV WKDW DSSHDU LQ WUDQVDFWLRQV FDQ EH FDWn HJRUL]HG DFFRUGLQJ WR D WD[RQRP\ LVD KLHUDUFK\f RQ WKH LWHPV 7KH WD[RQRP\ VKRZQ LQ )LJXUH VD\V WKDW 3HSVL LVD VRIW GULQN LVD EHYHUDJH DQG VR RQ ,Q JHQHUDO D WD[RQRP\ FDQ EH UHSUHVHQWHG DV D GLUHFWHG DF\FOLF JUDSK '$*f *LYHQ D VHW RI WUDQVDFWLRQV 7 HDFK RI ZKLFK LV D VHW RI LWHPV DQG D WD[RQRP\ 7D[ WKH SUREOHP RI PLQLQJ JHQHUDOL]HG DVVRFLDWLRQ UXOHV LV WR GLVFRYHU DOO UXOHV RI WKH IRUP ; f§!f < ZLWK WKH XVHUVSHFLILHG PLQLPXP VXSSRUW DQG PLQLPXP FRQILGHQFH ; DQG < FDQ EH VHWV RI LWHPV DW DQ\ OHYHO RI WKH WD[RQRP\ VXFK WKDW QR LWHP LQ < LV DQ DQFHVWRU RI DQ\ LWHP LQ ; >@ )RU H[DPSOH WKHUH PLJKW EH D UXOH ZKLFK VD\V WKDW fb RI WUDQVDFWLRQV WKDW FRQWDLQ 6RIW GULQNV DOVR FRQWDLQ 6QDFNV b RI DOO WUDQVDFWLRQV FRQWDLQ ERWK WKHVH LWHPVf %HYHUDJHV 6QDFNV 6RIW GULQNV $OFRKROLF GULQNV 3UHW]HOV &KRFRODWH EDU 3HSVL &RNH %HHU )LJXUH ([DPSOH RI D WD[RQRP\ ,Q WKLV FKDSWHU ZH SUHVHQW VHYHUDO 64/ IRUPXODWLRQV RI JHQHUDOL]HG DVVRFLDWLRQ UXOH PLQLQJ >@ ,Q 6HFWLRQ ZH GHVFULEH WKH LQSXWRXWSXW IRUPDWV DQG LQ 6HFWLRQ ZH EULHIO\ RXWOLQH WKH &XPXODWH DOJRULWKP IRU JHQHUDOL]HG DVVRFLDWLRQ UXOH PLQLQJ >@

PAGE 104

f

PAGE 105

f§ OfLWHPVHWV )N IRXQG LQ WKH SUHYLRXV SDVV LV XVHG DV WKH VHHG VHW WR JHQHUDWH FDQGLGDWH $LWHPVHWV &Nf WKDW DUH SRWHQWLDOO\ IUHTXHQW ,Q WKH VXSSRUW FRXQWLQJ SKDVH IRU HDFK LWHPVHW W f &N WKH QXPEHU RI H[WHQGHG WUDQVDFWLRQV WUDQVDFWLRQV DXJPHQWHG ZLWK DOO WKH DQFHVWRUV RI LWV LWHPVff SUXQLQJ FDQGLGDWHV FRQWDLQLQJ DQ LWHP DQG LWV DQFHVWRU DQG LLf H[WHQGLQJ WKH WUDQVDFWLRQV E\ DGGLQJ DOO WKH DQFHVWRUV RI LWV LWHPV 7KH DQFHVWRU

PAGE 106

FRPSXWDWLRQ FDQ EH H[SUHVVHG LQ 64/ XVLQJ D UHFXUVLYH TXHU\ DV VKRZQ LQ )LJXUH 7KH UHVXOW RI WKH TXHU\ LV VWRUHG LQ WDEOH $QFHVWRU KDYLQJ WKH VFKHPD DQFHVWRU GHVFHQGDQWf LQVHUW LQWR $QFHVWRU ZLWK 57D[ DQFHVWRU GHVFHQGDQWf DV VHOHFW SDUHQW FKLOG IURP 7D[ XQLRQ DOO VHOHFW SDQFHVWRU FFKLOG IURP 57D[ S 7D[ F ZKHUH SGHVFHQGDQW FSDUHQWf VHOHFW DQFHVWRU GHVFHQGDQW IURP 57D[ W 7D[ )LJXUH 3UHFRPSXWLQJ DQFHVWRUV &DQGLGDWH *HQHUDWLRQ ,Q WKH FDQGLGDWH JHQHUDWLRQ SKDVH ZH XVH WKH IUHTXHQW LWHPVHWV )N? IRXQG LQ WKH N f§ ?fWK SDVV WR JHQHUDWH D VHW RI FDQGLGDWH LWHPVHWV &N WKDW FRQWDLQV DOO N LWHPVHWV VXFK WKDW DOO N RI LWV N f§ OfOHQJWK VXEVHWV DUH LQ )N? 6HFWLRQ VKRZV KRZ WR H[SUHVV WKLV RSHUDWLRQ DV D NZD\ MRLQ EHWZHHQ WKH IUHTXHQW N f§ OfLWHPVHWV )N LfVf :H FDQ XVH WKH VDPH IRUPXODWLRQ H[FHSW WKDW ZH QHHG WR SUXQH IURP &N LWHPVHWV FRQWDLQLQJ DQ LWHP DQG LWV DQFHVWRU 6ULNDQW DQG $JUDZDO >@ SURYH WKDW WKLV SUXQLQJ QHHGV WR EH GRQH RQO\ LQ

PAGE 107

WKH VHFRQG SDVV IRU &f ,Q WKH 64/ IRUPXODWLRQ DV VKRZQ LQ )LJXUH ZH SUXQH DOO DQFHVWRU GHVFHQGDQWf SDLUV IURP & ZKLFK LV JHQHUDWHG E\ MRLQLQJ )? ZLWK LWVHOI LQVHUW LQWR & VHOHFW ,?LWHP?OLWHP? IURP )?,? )?, ZKHUH ,?LWHP? OLWHP?f H[FHSW VHOHFW DQFHVWRU GHVFHQGDQW IURP $QFHVWRU XQLRQ VHOHFW GHVFHQGDQW DQFHVWRU IURP $QFHVWRUf & (;&(37 ,,LWHP LWHP &!@ DQFHVWRU GHVFHQGDQW 81,21 GHVFHQGDQW DQFHVWRU ), ,, ), $QFHVWRU $QFHVWRU )LJXUH *HQHUDWLRQ RI & 6XSSRUW &RXQWLQJ WR )LQG )UHTXHQW ,WHPVHWV :H XVH WKH FDQGLGDWH LWHPVHWV &N WKH GDWD WDEOH 7 DQG WKH DQFHVWRU WDEOH $QFHVWRU WR FRXQW WKH VXSSRUW RI WKH LWHPVHWV LQ &N :H FRQVLGHU WZR FDWHJRULHV RI 64/ LPSOHn PHQWDWLRQV EDVHG RQ 64/ DQG 64/25 LQ 6HFWLRQV DQG UHVSHFWLYHO\ $OO WKH 64/ DSSURDFKHV GHYHORSHG IRU ERROHDQ DVVRFLDWLRQV LQ &KDSWHUV DQG FDQ EH H[WHQGHG WR KDQGOH WD[RQRPLHV +RZHYHU ZH SUHVHQW WKH H[WHQVLRQV WR RQO\ D IHZ UHSUHVHQWDWLYH DSSURDFKHV ,Q SDUWLFXODU ZH FRQVLGHU WKH .ZD\-RLQ DSSURDFK IURP 64/ DQG 9HUWLFDO DQG *DWKHU -RLQ IURP 64/25

PAGE 108

6XSSRUW &RXQWLQJ 8VLQJ 64/ .ZDY -RLQ ,Q HDFK SDVV N ZH MRLQ WKH FDQGLGDWH LWHPVHWV &N ZLWK N FRSLHV RI DQ H[WHQGHG WUDQVn DFWLRQ WDEOH 7r GHILQHG EHORZf DQG IROORZ LW XS ZLWK D JURXS E\ RQ WKH LWHPVHWV 7KH H[WHQGHG WUDQVDFWLRQ WDEOH 7r LV REWDLQHG E\ DXJPHQWLQJ 7 WR LQFOXGH WLG LWHP HQWULHV IRU DOO DQFHVWRUV RI LWHPV DSSHDULQJ LQ 7 7KLV FDQ EH IRUPXODWHG DV D 64/ TXHU\ DV VKRZQ LQ )LJXUH 4XHU\ WR JHQHUDWH 7r VHOHFW LWHP WLG IURP 7 XQLRQ VHOHFW GLVWLQFW $DQFHVWRU DV LWHP 7WLG IURP 7 $QFHVWRU $ ZKHUH $GHVFHQGDQW 7LWHPf 7r W 81,21 7WLG $DQFHVWRU DV LWHP 7LWHP $GHVFHQGDQW W;, $QFHVWRU $ )LJXUH 7UDQVDFWLRQ H[WHQVLRQ VXETXHU\ 7KH VHOHFW GLVWLQFW FODXVH LV XVHG WR HOLPLQDWH GXSOLFDWH UHFRUGV GXH WR H[WHQVLRQ RI LWHPV ZLWK D FRPPRQ DQFHVWRU 1RWH WKDW IRU WKLV DSSURDFK ZH GR QRW PDWHULDOL]H 7r ,QVWHDG ZH XVH WKH 64/ VXSSRUW IRU FRPPRQ VXEH[SUHVVLRQV ZLWK FRQVWUXFWf WR SLSHOLQH WKH JHQHUDWLRQ RI 7r ZLWK WKH MRLQ RSHUDWLRQV 7KH ILQDO 64/ TXHU\ DQG WKH FRUUHVSRQGLQJ WUHH GLDJUDP DUH VKRZQ LQ )LJXUH

PAGE 109

LQVHUW LQWR )N ZLWK 7rWLG LWHPf DV 4XHU\ IRU 7r GHILQHG DERYHf VHOHFW LWHPL LWHPN FRXQWrf IURP &N 7r W? 7r WN ZKHUH LLLWHP &NLWHP? DQG LWHP &NLWHPN DQG LLWLG WLG DQG IMWLWLG WIFWLG JURXS E\ LWHPLLWHP ‘ ‘ ‘ LWHPN KDYLQJ FRXQWrf PLQVXS KDYLQJ FRXQWrf PLQVXS W *URXS E\ LWHPO LWHPN &N LWHQLN WNLWHP &NLWHPO WOLWHP 7r WO 7r W )LJXUH 6XSSRUW FRXQWLQJ E\ .ZD\ MRLQ

PAGE 110

7KH ZD\-RLQ DQG *URXS%\ DSSURDFKHV IRU ERROHDQ DVVRFLDWLRQV FDQ EH H[WHQGHG WR KDQGOH WD[RQRPLHV LQ D VLPLODU ZD\ E\ UHSODFLQJ 7 ZLWK 7r LQ WKH FRUUHVSRQGLQJ TXHULHV 6XETXHUY 2SWLPL]DWLRQ 7KH EDVLF .ZD\-RLQ DSSURDFK FDQ EH RSWLPL]HG WR PDNH XVH RI FRPPRQ SUHIL[HV EHn WZHHQ WKH LWHPVHWV LQ &N E\ VSOLWWLQJ WKH VXSSRUW FRXQWLQJ SKDVH LQWR D FDVFDGH RI N VXEn TXHULHV 7KH VXETXHULHV LQ WKLV FDVH DUH H[DFWO\ VLPLODU WR WKRVH IRU ERROHDQ DVVRFLDWLRQV SUHVHQWHG LQ H[FHSW IRU WKH XVH RI 7r LQVWHDG RI 7 6XSSRUW &RXQWLQJ 8VLQJ 64/25 ,Q WKLV VHFWLRQ ZH SUHVHQW WKUHH DSSURDFKHV WKDW PDNH XVH RI WKH REMHFWUHODWLRQDO IHDWXUHV RI 64/ :H DOVR SUHVHQW D FRVWEDVHG DQDO\VLV RI WKH H[HFXWLRQ WLPH RI WKH YDULRXV DSSURDFKHV *DWKHU-RLQ 7KH *DWKHU-RLQ DSSURDFK ZKLFK LV EDVHG RQ WKH XVH RI WDEOH IXQFWLRQV >@ JHQHUDWHV DOO SRVVLEOH IFLWHP FRPELQDWLRQV RI H[WHQGHG WUDQVDFWLRQV MRLQV WKHP ZLWK WKH FDQGLGDWH WDEOH &N DQG FRXQWV WKH VXSSRUW E\ JURXSLQJ WKH MRLQ UHVXOW 7KH H[WHQGHG WUDQVDFWLRQV 7r GHILQHG LQ 6HFWLRQ f DUH SDVVHG WR WKH WDEOH IXQFWLRQ *DWKHU&RPE. LQ WKH WLG LWHPf RUGHU $ UHFRUG RXWSXW E\ WKH WDEOH IXQFWLRQ LV D $LWHP FRPELQDWLRQ VXSSRUWHG E\ D WUDQVDFWLRQ DQG KDV N DWWULEXWHV 7-WPL 7-WUULN 64/ TXHULHV IRU WKLV DSSURDFK LV SUHVHQWHG LQ )LJXUH 7KH VSHFLDO RSWLPL]DWLRQ IRU SDVV DQG WKH YDULDWLRQV RI WKH *DWKHU-RLQ DSSURDFK QDPHO\ *DWKHU&RXQW *DWKHU3UXQH DQG +RUL]RQWDO UHIHU 6HFWLRQ f DUH DOVR DSSOLFDEOH KHUH

PAGE 111

LQVHUW LQWR )r VHOHFW LWHP? LWHPN FRXQWrf IURP &N VHOHFW W7LWP? W7-WUULN IURP 7r W? WDEOH *DWKHU&RPE.LWLG LLWHPff DV 4f ZKHUH W7-WP? &NLWHP? DQG W7BLWPN &NLWHUULN JURXS E\ &NLWHPL &NLWHPN KDYLQJ FRXQWrf PLQVXS 7 7 KDYLQJ FRXQWrf LPLQVXS W *URXS E\ LWHP LWHPN LWPO &N LWHP A LWPN 7DEOH IXQFWLRQ *DWKHU&RPE. 2UGHU E\ WLG LWHP &N )LJXUH 6XSSRUW FRXQWLQJ E\ *DWKHU -RLQ *DWKHU ([WHQG 7KLV LV D YDULDWLRQ RI WKH *DWKHU -R LQ DSSURDFK ZKHUH ZH SXVK WKH WUDQVDFWLRQ H[WHQn VLRQ LQVLGH WKH WDEOH IXQFWLRQ )RU HDFK LWHP ZH FUHDWH DQ DQFHVWRUOLVW VWRUHG DV D %/2%f ZKLFK FRQWDLQV D OLVW RI DQFHVWRUV RI WKDW LWHP 7KLV FDQ EH DFFRPSOLVKHG XVLQJ VLPLODU 64/ TXHULHV DV LQ WKH WLGOLVW FUHDWLRQ SKDVH RI WKH 9HUWLFDO DSSURDFK IRU ERROHDQ DVVRFLDWLRQV UHIHU 6HFWLRQ f 7KH LWHP DQFHVWRUOLVWf SDLUV DUH VWRUHG LQ D QHZ $QF/LVW7DEOH 1RWH WKDW DQFHVWRUOLVW QHHGV WR EH FUHDWHG RQO\ IRU LWHPV WKDW DSSHDU LQ WKH LQSXW WUDQVDFWLRQV

PAGE 112

,Q WKH VXSSRUW FRXQWLQJ SKDVH ZH MRLQ WKH WUDQVDFWLRQ WDEOH 7 ZLWK WKH $QF/LVW7DEOH DQG WKH UHVXOWLQJ WLG LWHP DQFHVWRUOLVWf WXSOHV DUH SDVVHG WR D WDEOH IXQFWLRQ *([WHQG&RPE. LQ WLG RUGHU 7KH WDEOH IXQFWLRQ FROOHFWV DOO WKH LWHPV DQG WKH FRUUHVSRQGLQJ DQFHVWRUOLVWV RI WKH VDPH WUDQVDFWLRQ H[WHQGV LW DQG RXWSXWV $LWHP FRPELQDWLRQV RI WKH H[WHQGHG WUDQVnm O r G 1r DYHUDJH QXPEHU RI LWHPV LQ DQ H[WHQGHG WUDQVDFWLRQ 1 r G 5r WRWDO QXPEHU RI UHFRUGV DIWHU WUDQVDFWLRQ H[WHQVLRQ f§ 5r G HN FRVW RI JHQHUDWLQJ D $LWHP FRPELQDWLRQ XVLQJ *([WHQG&RPE. RUGHU Qf FRVW RI VRUWLQJ Q UHFRUGV 7KH FRVW RI *DWKHU -RLQ LQFOXGHV WKH FRVW RI H[WHQGLQJ WKH WUDQVDFWLRQV E\ MRLQLQJ WKH WUDQVDFWLRQ WDEOH ZLWK WKH DQFHVWRU WDEOH JHQHUDWLQJ $!LWHP FRPELQDWLRQV MRLQLQJ WKHP ZLWK &N DQG JURXSLQJ WKH MRLQ UHVXOW WR FRXQW WKH VXSSRUW 1RWH WKDW WKH WUDQVDFWLRQ

PAGE 113

LQVHUW LQWR )N VHOHFW LWHPL LWHUULN FRXQWrf IURP &IF VHOHFW W7LWPN IURP VHOHFW WLG LWHP DQFHVWRUOLVW IURP 7 $QF/LVW7DEOH $ ZKHUH 7LWHP $LWHPf DV W? WDEOH *([WHQG&RPE. LWLG LLWHP W?DQFHVWRUOLVWff DV f ZKHUH WA7-WPL &NLWHPL DQG W7-WPN &NLWHPN JURXS E\ &NLWHP[ &NLWHPN KDYLQJ FRXQWrf PLQVXS KDYLQJ FRXQWrf PLQVXS W *URXS E\ LWHPL LWHPN 7DEOH IXQFWLRQ *([WHQG&RPE. &N 7WLG 7LWHP $DQFHVWRUOLVW ? 7 $QF/LVW7DEOH $ )LJXUH 6XSSRUW FRXQWLQJ E\ *DWKHU([WHQG

PAGE 114

H[WHQVLRQ FRVW FDQ EH LJQRUHG LI 7r LV PDWHULDOL]HG EHFDXVH LQ WKDW FDVH LW ZLOO EH D RQHn WLPH FRVW 7KH UHVXOW VL]H RI WKH MRLQ ZKLFK H[WHQGV WKH WUDQVDFWLRQV LV 5r DQG WKH QXPEHU RI IFLWHP FRPELQDWLRQV JHQHUDWHG 7N LV &1rNf r 7 7KHUHIRUH WKH WRWDO FRVW RI WKH *DWKHU -RLQ DSSURDFK LV MRLQ^5 $ 5rf RUGHUI7f 7N r VN MRLQ7IF &N 6^&Nff JURXS&IFf &Nf ,Q *DWKHU([WHQG WUDQVDFWLRQ H[WHQVLRQ LV GRQH LQVLGH WKH WDEOH IXQFWLRQ 7KH FRVW RI SDVVLQJ WKH DQFHVWRUOLVW DV D %/2% LV EOREGf VLQFH WKH DYHUDJH OHQJWK RI WKH DQFHVWRUOLVW LV WKH VDPH DV WKH GHSWK RI WKH WD[RQRP\ 7KH FRVW RI *DWKHU([WHQG LV MRLQ ^5-5f 5r EOREGf 7N r HN MRLQ 7N&N 6&Nff JURXS^6&Nf&Nf 9HUWLFDO ,Q WKLV DSSURDFK WKH WUDQVDFWLRQV DUH ILUVW FRQYHUWHG LQWR D YHUWLFDO IRUPDW E\ FUHDWLQJ IRU HDFK LWHP D %/2% WLGOLVWf FRQWDLQLQJ DOO WLGV WKDW FRQWDLQ WKDW LWHP 7KH VXSSRUW IRU HDFK LWHPVHW LV FRXQWHG E\ PHUJLQJ WKH WLGOLVWV RI DOO LWV LWHPV 7KH WLGOLVW RI OHDI QRGH LWHPV FDQ EH FUHDWHG XVLQJ D WDEOH IXQFWLRQ LQ WKH VDPH ZD\ DV LQ WKH ERROHDQ DVVRFLDWLRQV FDVH :H SUHVHQW WZR DSSURDFKHV IRU FUHDWLQJ WKH WLGOLVW RI WKH LQWHULRU QRGHV LQ WKH WD[RQRP\ '$* 7KH ILUVW DSSURDFK LV EDVHG RQ GRLQJ WKH XQLRQ RI WKH GHVFHQGDQWfV WLGOLVWV RI DQ LQWHULRU QRGH ,Q WKH ILUVW SKDVH ZH FUHDWH DQ LQLWLDO 7LG7DEOH FRQWDLQLQJ WKH WLGOLVWV RI WKH OHDI QRGHV 7KH 7LG7DEOH LV MRLQHG ZLWK WKH DQFHVWRU WDEOH DQG WKH MRLQ UHVXOW LV SDVVHG LQ WKH RUGHU RI WKH DQFHVWRU DWWULEXWH WR WKH WDEOH IXQFWLRQ 78QLRQ )RU HYHU\ QRGH [ WKH

PAGE 115

LQVHUW LQWR 7LG7DEOH VHOHFW LWHP WLGOLVW IURP VHOHFW $DQFHVWRU 7WLGOLVW IURP 7LG7DEOH 7 $QFHVWRU $ ZKHUH 7LWHP $GHVFHQGDQW RUGHU E\ DQFHVWRUf DV W? WDEOH78QLRQLLDQFHVWRU LWLGOLVWff DV 4 WO LWHP WO WLGOLVW W 7DEOH IXQFWLRQ 78QLRQ W 2UGHU E\ DQFHVWRU $DQFHVWRU 7WLGOLVW 7LWHP $GHVFHQGDQW ;? 7LG7DEOH 7 $QFHVWRU $ )LJXUH ,QWHULRU QRGHVfr GHILQHG LQ 6HFWLRQ f WR WKH *DWKHU WDEOH IXQFWLRQ DV VKRZQ LQ )LJXUH 7KH WDEOH IXQFWLRQ RXWSXWV WLGOLVWV IRU DOO WKH LWHPV LQ WKH WD[RQRP\ 7KH VXSSRUW FRXQWLQJ TXHULHV DUH H[DFWO\ WKH VDPH DV IRU ERROHDQ DVVRFLDWLRQV H[n SODLQHG LQ 6HFWLRQ 7KH FRVW IRUPXOD LV DOVR WKH VDPH DV LQ ERROHDQ DVVRFLDWLRQV H[FHSW

PAGE 116

LQVHUW LQWR 7LG7DEOH VHOHFW LWHP WLGOLVW IURP 7r L WDEOH*DWKHULLWHP  WLG PLQVXSff DV  WLWHP WWLGOLVW W 7DEOH IXQFWLRQ *DWKHU W 2UGHU E\ LWHP WLG W \r )LJXUH ,QWHULRU QRGHVf WLGOLVW JHQHUDWLRQ IURP 7r 5r WKDW WKH DYHUDJH OHQJWK RI D WLGOLVW LQ WKLV FDVH LV SI ZKHUH 5rf DQG 6WRUHGSURFHGXUH VKRZQ DV 6SURFf 7KH FKDUW VKRZV WKH SUHSURFHVVLQJ WLPH DQG WKH WLPH WDNHQ IRU WKH GLIIHUHQW SDVVHV )RU 9HUWLFDO WKH SUHSURFHVVLQJ WLPH LQFOXGHV DQFHVWRU SUHFRPSXWDWLRQ DQG WLGOLVW FUHn DWLRQ WLPHV ZKHUH DV IRU *DWKHU-RLQ LW LV MXVW WKH WLPH IRU DQFHVWRU SUHFRPSXWDWLRQ ,Q WKH VWRUHG SURFHGXUH DSSURDFK WKH PLQLQJ DOJRULWKP LV HQFDSVXODWHG DV D VWRUHG SURFHGXUH >@

PAGE 117

3HUIRUPDQFH FRPSDULVRQ ’ 3UHS ‘ 3DVV ’ 3DVV ’ 3DVV ‘ 3DVV ( 3DVV ‘ 3DVV 0DLO RUGHU GDWD 7RWDO QXPEHU RI UHFRUGV PLOOLRQ 1XPEHU RI WUDQVDFWLRQV 1XPEHU RI LWHPV OHDI QRGHV LQ WD[RQRP\ '$*f 7RWDO QXPEHU RI LWHPV LQFOXGLQJ LQWHULRU QRGHVf 0D[ GHSWK RI WKH WD[RQRP\ $YJ QXPEHU RI FKLOGUHQ SHU QRGH 0D[ QXPEHU RI SDUHQWV )LJXUH &RPSDULVRQ RI GLIIHUHQW 64/ DSSURDFKHV ZKLFK UXQV LQ WKH VDPH DGGUHVV VSDFH DV WKH '%06 )RU WKH 6WRUHGSURFHGXUH H[SHULn PHQW ZH XVHG WKH JHQHUDOL]HG DVVRFLDWLRQ UXOH LPSOHPHQWDWLRQ SURYLGHG ZLWK WKH ,%0 GDWD PLQLQJ SURGXFW ,QWHOOLJHQW 0LQHU >@ )RU DOO WKH VXSSRUW YDOXHV WKH 9HUWLFDO DSSURDFK SHUIRUPV HTXDOO\ ZHOO DV WKH 6WRUHGSURFHGXUH DSSURDFK ,Q VRPH RI WKH H[SHULPHQWV RQ RWKHU GDWDVHWV WKH 9HUWLFDO DSSURDFK SHUIRUPHG EHWWHU WKDQ WKH 6WRUHGSURFHGXUH DSSURDFK 7KH *DWKHU -RLQ DSSURDFK LV ZRUVH PDLQO\ GXH WR WKH ODUJH QXPEHU RI LWHP FRPn ELQDWLRQV JHQHUDWHG ,Q WKH *DWKHU-RLQ DSSURDFK WKH H[WHQGHG WUDQVDFWLRQV DUH SDVVHG WR WKH *DWKHU&RPE WDEOH IXQFWLRQ DQG KHQFH WKH HIIHFWLYH QXPEHU RI LWHPV SHU WUDQVDFWLRQ

PAGE 118

JHWV PXOWLSOLHG E\ WKH DYHUDJH GHSWK RI WKH WD[RQRP\ ,Q WKH *DWKHU-RLQ DSSURDFK ZH VKRZ WKH SHUIRUPDQFH QXPEHUV IRU RQO\ WKH VHFRQG SDVV 1RWH WKDW MXVW WKH WLPH IRU VHFRQG SDVV LV DQ RUGHU RI PDJQLWXGH PRUH WKDQ WKH WRWDO WLPH IRU DOO WKH SDVVHV RI 9HUWLFDO )RU b VXSSRUW VHFRQG SDVV RI *DWKHU-RLQ WRRN RYHU VHFRQGV ZKLOH WKH WRWDO WLPH IRU 9HUWLFDO ZDV RQO\ DERXW VHFRQGV :H GLG QRW GR H[WHQVLYH H[SHULPHQWDWLRQ KHUH EHFDXVH EDVHG RQ WKH 64/ IRUPXODWLRQV DQG WKH DQDO\VLV ZH H[SHFW VLPLODU SHUIRUPDQFH REVHUYDWLRQV DV LQ WKH FDVH RI ERROHDQ DVVRFLDWLRQ UXOHV 6XPPDU\ ,Q WKLV FKDSWHU ZH SUHVHQWHG YDULRXV 64/ IRUPXODWLRQV IRU PLQLQJ JHQHUDOL]HG DVVRFLDn WLRQ UXOHV ,W VKRZV WKDW WKH ERROHDQ DVVRFLDWLRQ UXOH IUDPHZRUN FDQ EH HDVLO\ H[WHQGHG IRU JHQHUDOL]HG DVVRFLDWLRQ UXOHV 7KH PDMRU DGGLWLRQV ZHUH WR fH[WHQGf WKH LQSXW WUDQVDFWLRQ WDEOH WUDQVIRUP 7 WR 7rf

PAGE 119

&+$37(5 6(48(17,$/ 3$77(516 6HTXHQWLDO SDWWHUQ PLQLQJ XWLOL]HV WKH WLPH DVVRFLDWHG ZLWK WKH WUDQVDFWLRQ GDWD WR ILQG IUHTXHQWO\ RFFXUULQJ SDWWHUQV $ VHTXHQWLDO SDWWHUQ LV DQ RUGHUHG OLVW VHTXHQFHf RI LWHPVHWV UHIHU 6HFWLRQ IRU D EULHI GHVFULSWLRQ RI VHTXHQWLDO SDWWHUQVff WUDQVDFWLRQ WLPH ^WLPHf DQG LWHP LGHQWLILHU ^LWHPf (YHU\ GDWDVHTXHQFH W\SLFDOO\

PAGE 120

f 7KH OHQ DWWULEXWH JLYHV WKH OHQJWK RI WKH VHTXHQFH WKDW LV WKH WRWDO QXPEHU RI LWHPV LQ DOO WKH HOHPHQWV RI WKH VHTXHQFH 7KH HQR DWWULEXWHV VWRUHV WKH HOHPHQW QXPEHU RI WKH FRUUHVSRQGLQJ LWHPV )RU VHTXHQFHV RI VPDOOHU OHQJWK WKH H[WUD FROXPQ YDOXHV DUH VHW WR 18// )RU H[DPSOH LI N WKH VHTXHQFH FRPSXWHU PRGHPfSULQWHUff LV UHSUHVHQWHG E\ WKH WXSOH FRPSXWHU PRGHP SULQWHU 18// 18// ff§ OfOHQJWK VHTXHQFHV )IFBL IRXQG LQ WKH SUHYLRXV SDVV LV XVHG DV WKH VHHG VHW WR JHQHUDWH FDQGLGDWH OHQJWK VHTXHQFHV &Nf WKDW DUH SRWHQWLDOO\ IUHTXHQW 7KH FDQGLGDWH

PAGE 121

VHTXHQFHV &N DUH JHQHUDWHG IURP WKH IUHTXHQW N f§ f VHTXHQFHV )NL 7KH VFKHPD RI &N LV WKH VDPH DV WKDW RI IUHTXHQW VHTXHQFHV H[SODLQHG DERYH LQ 6HFWLRQ H[FHSW WKDW ZH GR QRW UHTXLUH WKH OHQ DWWULEXWH VLQFH DOO WKH WXSOHV LQ &N KDYH WKH VDPH OHQJWK N &DQGLGDWHV DUH JHQHUDWHG LQ WZR VWHSV 7KH MRLQ VWHS JHQHUDWHV D VXSHUVHW RI &N E\ MRLQLQJ )N ZLWK LWVHOI $ VHTXHQFH VL MRLQV ZLWK 6 LI WKH VXEVHTXHQFH REWDLQHG E\ GURSSLQJ WKH ILUVW LWHP RI VL LV WKH VDPH DV WKH RQH REWDLQHG E\ GURSSLQJ WKH ODVW LWHP RI m 7KLV FDQ EH H[SUHVVHG LQ 64/ DV IROORZV LQVHUW LQWR &N VHOHFW ,?LWHP? ,?HQR? ,LLWHQLNL ,?HQRNa? OLWHPNa? ,?HQRNa? KHQRNL KHQRN IURP )N? ,?)N? K ZKHUH ,?LWHP f§ OLWHP? DQG ,?LWHPN? OLWHPN DQG

PAGE 122

,?HQR f§ ,?HQR OHQR f§ OHQR? DQG ,?HQF!NL ,?HQRN KHQRN af ff f ff f ff f ff f ff f f ff` ,Q WKH MRLQ VWHS WKH VHTXHQFH f ff MRLQV ZLWK f ff WR JHQHUDWH f ff DQG ZLWK f f ff WR JHQHUDWH f f ff 7KHUH DUH QR RWKHU MRLQ FRPSDWLEOH VHTXHQFHV LQ ) ,Q WKH SUXQH VWHS DOO FDQGLGDWH VHTXHQFHV WKDW KDYH D QRQIUHTXHQW FRQWLJXRXV N f§ f VXEVHTXHQFH DUH GHOHWHG ,Q WKH DERYH H[DPSOH WKH VHTXHQFH f f ff ZLOO EH GHOHWHG VLQFH LWV FRQWLJXRXV VXEVHTXHQFH f f ff LV QRW IUHTXHQW :H SHUIRUP ERWK WKH MRLQ DQG SUXQH VWHSV LQ WKH VDPH 64/ VWDWHPHQW E\ ZULWLQJ WKH DERYH TXHU\ DV D $ZD\ MRLQ DV VKRZQ LQ )LJXUH )RU DQ\ FVHTXHQFH WKHUH DUH DW PRVW N FRQWLJXRXV VXEVHTXHQFHV RI OHQJWK N f§ f IRU ZKLFK )NB? QHHGV WR EH FKHFNHG IRU PHPEHUVKLS 1RWH WKDW DOO $ f§ fVXEVHTXHQFHV PD\ QRW EH FRQWLJXRXV EHFDXVH RI WKH PD[ JDS FRQVWUDLQW EHWZHHQ FRQVHFXWLYH HOHPHQWV 7KH MRLQ SUHGLFDWHV RQ ,? DQG UHPDLQ WKH VDPH 7KH MRLQ RI ,? DQG JHQHUDWHV D $VHTXHQFH ,?LWHP? ,LLWHPN? OLWHPN?f

PAGE 123

ZKHUH WZR RI LWV Nf§OfVXEVHTXHQFHV DUH NQRZQ WR EH IUHTXHQW LW LV JHQHUDWHG IURP WZR VXFK N f§ OfVXEVHTXHQFHV 0HPEHUVKLS FKHFNV IRU WKH RWKHU N f§ f VXEVHTXHQFHV DUH SHUIRUPHG XVLQJ DGGLWLRQDO MRLQV 7KH MRLQ SUHGLFDWHV IRU WKHVH MRLQV DUH HQXPHUDWHG E\ GURSSLQJ RQH LWHP DW D WLPH IURP WKH JHQHUDWHG IFVHTXHQFH :H ILUVW GURS LWHP DQG LI LW UHVXOWV LQ D FRQWLJXRXV VXEVHTXHQFH ZH FKHFN IRU LWV PHPEHUVKLS LQ E\ WKH MRLQ ZLWK )RU WKH MRLQ ZLWK ,U ZH GURS LWHPU? 7KH 25 FODXVH LQ WKH MRLQ SUHGLFDWH LV WR DYRLG FKHFNLQJ IRU QRQFRQWLJXRXV VXEVHn TXHQFHV WKDW DUH IRUPHG ZKHQ HQRUL f§ HQRUBL :KHQ WKHUH LV QR PD[JDS FRQVWUDLQW WKH MRLQ SUHGLFDWHV ZLOO QRW FRQWDLQ WKH 25 SDUW :KLOH MRLQLQJ )? ZLWK LWVHOI WR JHW & ZH QHHG WR JHQHUDWH VHTXHQFHV ZKHUH ERWK WKH LWHPV DSSHDU DV D VLQJOH HOHPHQW DV ZHOO DV WZR VHSDUDWH HOHPHQWV $OVR QRWH WKDW WKH SUXQH VWHS ZLOO QRW GHOHWH DQ\ FDQGLGDWH VHTXHQFHV 7KH JHQHUDWLRQ RI & LV H[SUHVVHG LQ 64/ DV LQVHUW LQWR & VHOHFW ,?LWHP? LWHPL IURP )L ,?)? ,f XQLRQ VHOHFW ,?LWHP? rIHPL IURP )? ,?)? ZKHUH ,?LWHP? LWHP?f )RU LQVWDQFH LI )? FRQWDLQV ^f f` & ZLOO KDYH ^f ff f ff f ff f ff ff` 6XSSRUW &RXQWLQJ WR )LQG )UHTXHQW 6HTXHQFHV ,Q HDFK SDVV $ ZH XVH WKH FDQGLGDWH WDEOH &N DQG WKH LQSXW GDWDVHTXHQFHV WDEOH WR FRXQW WKH VXSSRUW :H FRQVLGHU 64/ LPSOHPHQWDWLRQV EDVHG RQ 64/ DQG 64/25 LQ 6HFWLRQV DQG UHVSHFWLYHO\

PAGE 124

'URS LWHPBNOf ,,LWHP OLWHPBN ,LWHPBN ,OHQR OHQRO ,N LWHP $1' ,NLWHPBN $1' ,NLWHPBN $1' ,NHQR ,NHQRO $1' 'URS LWHPf ,,LWHP ,OLWHP ,LWHPBN ,OHQR OHQRO ,OHQR ,OHQR ,OHQRBNO ,HQRBNO ,HQRBN ,OHQRBN ,NHQRBNO ,NHQRBN 25 HQRBNO ,HQRBNO ,HQRBN ,, HQRBN ,LWHPO $1' ,LWHP $1' >; )BN ,LWHPBN $1' ,HQR ,HQRO $1' ,HQR ,HQR $1' ; ,HQRBNO ,HQRBN ,HQRBNO ,OHQR ,OHQRO ,HQRBN 25 ,OLWHP ,OLWHPBN ,OHQR ,OHQR f§ LWHP $1' ,LWHPBN $1' ,HQR ,HQRO $1' HQRBNO ,, HQRBN ,HQRBN ,HQRBN ) N )BN )BN )LJXUH &DQGLGDWH JHQHUDWLRQ IRU DQ\ N 'URS LWHPf ,OLWHPO ,LWHPO$1' LWHP ,LWHP $1' ,LWHP ,LWHP $1' ,OHQR ,OHQRO ,HQR ,HQRO $1' HQR ,HQR ,HQR HQR ,HQR ,HQR 25 ,OHQR ,HQR ,HQR ,OHQR ,; 'URS LWHPf ,OLWHPO ,LWHPO$1' ,OLWHP ,LWHP $1' ,LWHP ,LWHP $1' HQR HQR ,HQR ,HQR $1' ,HQR ,HQR ,HQR ,HQR 25 ,OHQR ,OHQRO ; ,OLWHP ,OLWHP LWHP $1' ,LWHP $1' ) HQR HQR ,HQR ,HQR ; ) ) ,, ) ,N )LJXUH &DQGLGDWH JHQHUDWLRQ IRU N

PAGE 125

,OO 6XSSRUW &RXQWLQJ 8VLQJ 64/ .ZDY -RLQ ,Q WKH NWK SDVV ZH MRLQ WKH FDQGLGDWH WDEOH &N ZLWK N FRSLHV RI WKH GDWDVHTXHQFH WDEOH DQG JURXS WKH MRLQ UHVXOW RQ WKH FDQGLGDWH VHTXHQFHV DV VKRZQ LQ )LJXUH 7KLV DSSURDFK LV YHU\ VLPLODU WR WKH .ZD\-RLQ DSSURDFK IRU DVVRFLDWLRQ UXOHV UHIHU 6HFWLRQ f H[FHSW IRU WKH IROORZLQJ WZR NH\ GLIIHUHQFHV :H XVH VHOHFW GLVWLQFW WR HQVXUH WKDW RQO\ GLVWLQFW GDWDVHTXHQFHV DUH FRXQWHG 6HFRQG ZH KDYH DGGLWLRQDO SUHGLFDWHV EHWZHHQ VHTXHQFH QXPEHUV WKDW DUH GHQRWHG DV 35('IFf LQ WKH TXHU\ IRU EUHYLW\ 7KH SUHGLFDWHV 35('IFf LV D FRQMXQFW DQGf RI WHUPV 3LM^Nf FRUUHVSRQGLQJ WR HDFK SDLU RI LWHPV IURP &? 3LM^Nf LV H[SUHVVHG DV &NHQRM &NHQRL DQG DEVGMWLPH f§ GLWLPHf ZLQGRZVL]Hf RU &NHQRM &NHQRL DQG GMWLPH f§ GLWLPH PD[JDS DQG GMWLPH f§ GLWLPH PLQJDSf RU &NHQRM &NHQRL f ,QWXLWLYHO\ WKHVH SUHGLFDWHV FKHFN Df LI WZR LWHPV RI D FDQGLGDWH VHTXHQFH EHORQJ WR WKH VDPH HOHPHQW WKHQ WKH GLIIHUHQFH RI WKHLU FRUUHVSRQGLQJ WUDQVDFWLRQ WLPHV LV DW PRVW ZLQGRZVL]H DQG Ef LI WZR LWHPV EHORQJ WR DGMDFHQW HOHPHQWV WKHQ WKHLU WUDQVDFWLRQ WLPHV DUH DW PRVW PD[JDS DQG DW OHDVW PLQJDS DSDUW :H FRPSXWH WKH IUHTXHQW VHTXHQFHV E\ JURXSLQJ WKH GDWDVHTXHQFHV WDEOH RQ WKH LWHP DWWULEXWH FRXQWLQJ WKH QXPEHU RI GLVWLQFW VHTXHQFHV LQ ZKLFK WKH LWHP LV SUHVHQW DQG ILOWHULQJ WKH QRQIUHTXHQW LWHPV 7KH 64/ TXHU\ IRU WKLV FRPSXWDWLRQ LV JLYHQ EHORZ

PAGE 126

LQVHUW LQWR )? VHOHFW LWHP FRXQWrf IURP VHOHFW GLVWLQFW VLG LWHP IURP 'f DV W JURXS E\ LWHP KDYLQJ FRXQWrf UPLQVXS 6XETXHUY 2SWLPL]DWLRQ 7KH VXETXHU\ RSWLPL]DWLRQ IRU DVVRFLDWLRQ UXOHV FDQ EH DSSOLHG IRU VHTXHQWLDO SDWWHUQV DOVR E\ VSOLWWLQJ WKH VXSSRUW FRXQWLQJ TXHU\ LQ SDVV N LQWR D FDVFDGH RI N VXETXHULHV 7KH SUHGLFDWHV SLM FDQ EH DSSOLHG HLWKHU RQ WKH RXWSXW RI VXETXHU\ 4N RU VSULQNOHG DFURVV WKH GLIIHUHQW VXETXHULHV VKRZQ DV 6XE435('=f LQ )LJXUH f 7KH SUHGLFDWHV 6XE4 35('=f LV D FRQMXQFW RI WHUPV SW-Of FRUUHVSRQGLQJ WR HDFK SDLU RI LWHPV IURP GS WKH GLVWLQFW =LWHP SUHIL[HV RI &r 3LMOf LV H[SUHVVHG DV GLHQRM GLHQRL DQG DEVWLPHM f§ WLPHLf ZLQGRZVL]Hf RU GLHQRM GSHQRL DQG WLPHM f§ WLPHL PD[JDS DQG WLPHM f§ WLPHL PLQJDSf RU GLHQRM GSHQRL f 6XSSRUW &RXQWLQJ 8VLQJ 64/25 9HUWLFDO 7KLV DSSURDFK LV VLPLODU WR WKH 9HUWLFDO DSSURDFK IRU DVVRFLDWLRQ UXOHV UHIHU 6HFn WLRQ f ZKHUH ZH FRQYHUW WKH WUDQVDFWLRQ WDEOH LQWR D YHUWLFDO IRUPDW )RU HDFK LWHP ZH FUHDWH D %/2% VOLVWf FRQWDLQLQJ DOO VLG WLPHf SDLUV FRUUHVSRQGLQJ WR WKDW LWHP :H XVH D WDEOH IXQFWLRQ *DWKHU IRU FUHDWLQJ WKH VOLVWV 7KH VHTXHQFH WDEOH LV VFDQQHG LQ WKH

PAGE 127

LQVHUW LQWR )r VHOHFW LWHPL HQR? LWHPN HQRN FRXQWrf IURP VHOHFW GLVWLQFW GLVLG LWHPL HQR? LWHQLN HQRN IURP &N GL GN ZKHUH GLLWHP &NLWHPL DQG GIFLWHP &NLWHPN DQG GLVLG GVLG DQG GLVLG GNVLG DQG 35('IFf f DV W JURXS E\ LWHPL HQRL LWHPN HQRN KDYLQJ FRXQWrf PLQVXS KDYLQJ FRXQWrf PLQVXS W *URXS E\ LWHPL LWHPN HQROHQRN 35('Nf &NLWHPO GOLWHP GO G )LJXUH 6XSSRUW FRXQWLQJ E\ .ZD\ MRLQ

PAGE 128

LQVHUW LQWR )N VHOHFW LWHPLHQRL LWHPNHQRN FRXQWrf IURP VHOHFW GLVWLQFW VLG LWHPL HQRL LWHPNHQRN IURP 6XETXHU\ 4Nf f JURXS E\ LWHP?HQR? LWHPNHQRN KDYLQJ FRXQWrf PLQVXS 6XETXHU\ 4L IRU DQ\ O EHWZHHQ DQG Nf VHOHFW LWHPL HQR? WLPHL LWHPL HQRL WLWLPH VLG IURP WL 6XETXHU\ 4BLf DV L VHOHFW GLVWLQFW LWHP?HQR? LWHPL HQRL IURP &cWf DV Gc ZKHUH ULLLWHP? GLLWHP? DQG DQG GLLWHPLLDQG UL[VLG WLVLG DQG W>LWHP f§ GLLWHPL DQG 6XE435('=f 6XETXHU\ 4R 1R VXETXHU\ 4R 6XETXHU\ 4B W LWHPBO HQRBO WLPHBOLWHPBO HQRM WLWLPH VLG 6XETXHU\ 4BO VHOHFW GLVWLQFW LWHPBOHQR OLWHPBO HQRM I &N )LJXUH 6XETXHU\ RSWLPL]DWLRQ IRU .ZD\-RLQ DSSURDFK

PAGE 129

LWHP VLG WLPHf RUGHU DQG SDVVHG WR WKH IXQFWLRQ *DWKHU ZKLFK FROOHFWV WKH VLG WLPHf DWWULEXWH YDOXHV RI DOO WXSOHV RI ZLWK WKH VDPH LWHP LQ PHPRU\ )RU LWHPV WKDW PHHW WKH PLQLPXP VXSSRUW FULWHULRQ WKH IXQFWLRQ RXWSXWV D LWHP VOLVWf SDLU 7KH VOLVWV DUH PDLQWDLQHG VRUWHG XVLQJ VLG DV WKH PDMRU NH\ DQG WLPH DV WKH PLQRU NH\ DQG LV VWRUHG LQ D QHZ 6OLVW7DEOH ZLWK WKH VFKHPD LWHP VOLVWfn DULHV RI LW LQ HDFK RI WKH VOLVWV 1RWH WKDW WKH VOLVWV DUH PDLQWDLQHG VRUWHG RQ VLG WLPHf 7KH 8') DOVR JURXSV WKH VOLVWV DFFRUGLQJ WR WKH HOHPHQWV RI WKH FDQGLGDWH VHTXHQFH XVn

PAGE 130

LQVHUW LQWR )N VHOHFW WLWHPL WHQR? WHQRN FQW IURP VHOHFW LWHP? HQRL LWHPA HQRN &RXQW,QWHUVHFW.&rHQRL &NHQRN GL VOLVW VOLVW ZLQGRZVL]HPLQJDSPD[JDSf DV FQW IURP &MW 6OLVW7DEOH GL 6OLVW7DEOH GN ZKHUH GLLWHP &NLWHP? DQG GIFGWHP &NLWHUULNf DV W ZKHUH FQW PLQVXS LWHPOLWHPN HQROHQRN FQW W FQW PLQVXS $ FQW &RXQWOQWHUVHFW. /7')f

PAGE 131

*DWKHU-RLQ 7KH *DWKHU-RLQ DSSURDFK IRU DVVRFLDWLRQ UXOHV UHIHU 6HFWLRQ f FDQ EH H[WHQGHG WR PLQH VHTXHQWLDO SDWWHUQV DOVR ,Q WKLV DSSURDFK ZH ILUVW *DWKHU DOO WKH WLG LWHPf SDLUV FRUUHVSRQGLQJ WR IL[HG YDOXHV RI VLG :H WKHQ JHQHUDWH DOO SRVVLEOH VHTXHQFHV XVLQJ D WDEOH IXQFWLRQ MRLQ WKHP ZLWK &N DQG JURXS WKH MRLQ UHVXOW WR FRXQW WKH VXSSRUW RI WKH FDQGLGDWH VHTXHQFHV 7KH WLPH FRQVWUDLQWV DUH FKHFNHG RQ WKH WDEOH IXQFWLRQ RXWSXW XVLQJ MRLQ SUHGLFDWHV 35('f DV LQ WKH .ZD\-RLQ DSSURDFK 7KH QXPEHU RI WXSOHV JHQHUDWHG E\ WKH WDEOH IXQFWLRQ FDQ EH UHGXFHG E\ DSSO\LQJ WKH PD[JDS PLQJDS DQG ZLQGRZVL]H FRQVWUDLQWV LQVLGH WKH WDEOH IXQFWLRQ +RZHYHU 35('f LV VWLOO UHTXLUHG WR FKHFN LI WKH VHTXHQFHV LQ &N DUH FRQWDLQHG LQ WKH JHQHUDWHG VHTXHQFHV 7KH LWHPV LQ D FDQGLGDWH VHTXHQFH DUH QRW OH[LFRJUDSKLFDOO\ RUGHUHG 7KHUHIRUH D GDWDVHTXHQFH FRQWDLQLQJ Q LWHPV FDQ SRWHQWLDOO\ VXSSRUW QN f DUH DSn SOLFDEOH IRU VHTXHQWLDO SDWWHUQV DOVR 7KH EDVLF LGHD LV WR UHSODFH HDFK GDWDVHTXHQFH G ZLWK DQ fH[WHQGHGVHTXHQFHf G (DFK WUDQVDFWLRQ RI G LV H[WHQGHG ZLWK DOO WKH DQFHVWRUV RI LWV LWHPV WR JHW Gn ,Q RUGHU WR RSWLPL]H WKH SHUIRUPDQFH ZH SUHFRPSXWH WKH DQFHVWRUV DQG SUXQH FDQGLGDWHV ZLWK DQ HOHPHQW WKDW FRQWDLQV ERWK DQ LWHP DQG LWV DQFHVWRU

PAGE 132

n LPHQWV 7KH 64/ TXHULHV IRU WKH KLJKHU QXPEHUHG SDVVHV EHFRPHV WRR ORQJ VLQFH WKHUH DUH VHYHUDO MRLQ SUHGLFDWHV DQG DV D UHVXOW WKH RSWLPL]HU GRHV QRW JHQHUDWH RSWLPDO H[HFXWLRQ SODQV )XUWKHU LQ WKH 9HUWLFDO DSSURDFK WKH XVHUGHILQHG IXQFWLRQV IRU VXSSRUW FRXQWLQJ DUH PXFK PRUH FRPSOH[ WKDQ WKHLU DVVRFLDWLRQ UXOH FRXQWHUSDUWV :H DOVR GLG QRW LPSOHPHQW WKH H[WHQVLRQV WR KDQGOH WD[RQRPLHV LQ WKLV FDVH

PAGE 133

&+$37(5 ,1&5(0(17$/ 0,1,1* 'DWD PLQLQJ GLVFRYHUV LQIRUPDWLRQ ZLWKLQ GDWD ZDUHKRXVHV DQG ILQGV DQVZHUV WR TXHVn WLRQV DERXW \RXU GDWD WKDW \RX KDYHQfW WKRXJKW WR DVN 7KH UXOHV GLVFRYHUHG IURP D GDWDEDVH RQO\ UHIOHFW WKH FXUUHQW VWDWH RI WKH GDWDEDVH ,Q RUGHU WR PDNH WKH GLVFRYHUHG UXOHV UHOLDEOH DQG XVHIXO ODUJH YROXPHV RI GDWD QHHG WR EH FROOHFWHG DQG DQDO\]HG RYHU D SHULRG RI WLPH 7KLV HQWDLOV WKH GHYHORSPHQW RI WHFKQLTXHV WR KDQGOH ODUJH YROXPHV RI GDWD DQG WR PDLQnf IRU XSGDWLQJ WKH IUHTXHQW LWHPVHWV KDV EHHQ GHYHORSHG IRU WKH DGGLWLRQ RI QHZ WUDQVDFWLRQV WR WKH GDWDEDVH >@ ,W LV EDVHG RQ WKH $SULRUL DOJRULWKP DQG QHHGV

PAGE 134

Qf SDVVHV RYHU WKH GDWDEDVH ZKHUH Q LV WKH VL]H RI WKH PD[LPDO IUHTXHQW LWHPVHW $QRWKHU LQFUHPHQWDO DOJRULWKP LV SUHVHQWHG LQ )HOGPDQ HW DO >@ ,Q WKLV FKDSWHU ZH SUHVHQW DQ DOJRULWKP WR ILQG WKH QHZ IUHTXHQW LWHPVHWV ZLWK PLQn LPDO UHFRPSXWDWLRQ ZKHQ QHZ WUDQVDFWLRQV DUH DGGHG WR RU GHOHWHG IURP WKH WUDQVDFWLRQ GDWDEDVH >@ 'HOHWLRQ LV LPSRUWDQW LQ FDVHV ZKHUH ZH ZDQW WR DQDO\]H WKH GDWD LQ D VOLGLQJ WLPH ZLQGRZ 7KH LPSRUWDQW FKDUDFWHULVWLFV RI RXU DOJRULWKP DUH WKH IROORZLQJ f $ORQJ ZLWK WKH IUHTXHQW LWHPVHWV ZH DOVR PDLQWDLQ WKH QHJDWLYH ERUGHU >@ 7KH DOJRULWKP XVHV QHJDWLYH ERUGHUV WR GHFLGH ZKHQ WR VFDQ WKH ZKROH GDWDEDVH DQG LW FDQ EH XVHG LQ FRQMXQFWLRQ ZLWK DQ\ OHYHOZLVH DOJRULWKP OLNH $SULRUL >@ RU 3DUWLn WLRQ >@ fn1RWH WKDW WKH QHJDWLYH ERUGHU FDQ EH PDLQWDLQHG ZKLOH FRPSXWLQJ WKH IUHTXHQW LWHPVHWV ZLWKRXW DQ\ DGGLWLRQDO FRPSXWDWLRQ RYHUKHDG

PAGE 135

ZH SUHVHQW WKH SHUIRUPDQFH UHVXOWV RI WKH 64/EDVHG LQFUHPHQWDO DOJRULWKP >@ ,Q 6HFnf IURP ) 7KH QHJDWLYH ERUGHU $I%G)ff RI D VHW RI IUHTXHQW LWHPVHWV ) FDQ EH FRPSXWHG E\ UHSHDWLQJ WKH MRLQ DQG SUXQH VWHSV RI WKH DSULRULJHQ IXQFWLRQ LQ WKH DSULRUL DOJRULWKP >@ 7KLV FRPSXWDWLRQ FDQ EH GRQH XVLQJ RQO\ WKH VHW RI IUHTXHQW LWHPVHWV ) DQG WKH GDWDEDVH QHHG QRW EH VFDQQHG 'HILQLWLRQ 7KH QHJDWLYH ERUGHU ILL%G)f RI D FROOHFWLRQ RI LWHPVHWV ) LV GHILQHG DV IROn ORZV *LYHQ D FROOHFWLRQ ) & 95f RI VHWV FORVHG ZLWK UHVSHFW WR WKH VHW LQFOXVLRQ UHODWLRQ WKH QHJDWLYH ERUGHU ILI%G)f RI ) FRQVLVWV RI WKH PLQLPDO LWHPVHWV ; & 5 QRW LQ ) >@ 7KH DSULRULJHQ IXQFWLRQ GHVFULEHG LQ )LJXUH f WDNHV DV DUJXPHQW )NBL WKH VHW RI DOO IUHTXHQW N f§ OfLWHPVHWV ,W UHWXUQV WKH VHW RI IFLWHPVHWV WKDW DUH SRWHQWLDOO\ IUHTXHQW

PAGE 136

IXQFWLRQ DSULRULJHQ)IFBLf IRU HDFK S DQG T f LrBL GR LI SLWHPL TLWHP? SLWHPNa f§ TLWHPN DQG SLWHPNL TLWHQLNL WKHQ LQVHUW SLWHPL SLWHP ‘ ‘ ‘ SLWHPNLTLWHPNL LQWR &r IRU HDFK F &r GHOHWH F IURP &r LI VRPH N OfVXEVHW RI F LV QRW LQ )NaL )LJXUH $ KLJKOHYHO GHVFULSWLRQ RI WKH DSULRULJHQ IXQFWLRQ 7KH QHJDWLYH ERUGHU FRQVLVWV RI DOO LWHPVHWV WKDW ZHUH FDQGLGDWHV ZKLFK GLG QRW KDYH WKH PLQLPXP VXSSRUW LQ WKH OHYHOZLVH PHWKRG 7KDW LV $%G)Nf &N f§ )N ZKHUH &MW LV WKH VHW RI FDQGLGDWH IFLWHPVHWV )N LV WKH VHW RI IUHTXHQW IFLWHPVHWV DQG $I%G)Nf LV WKH VHW RI IFLWHPVHWV LQ $L%G)f 7KHUHIRUH $I%G)Nf &N 7KH QHJDWLYHERUGHUJHQ IXQFWLRQ WR FRPSXWH $L%G)f LV H[SODLQHG LQ )LJXUH IXQFWLRQ QHJDWLYHERUGHUJHQ)f 6SOLW ) LQWR M)L M)M )Q ZKHUH Q LV WKH VL]H RI WKH ODUJHVW LWHPVHW LQ ) IRUDOO N Q GR FRPSXWH &NL XVLQJ DSULRULJHQ)Nf )8 $I%G)f f§ 8 f &N 8L ZKHUH ,L LV WKH VHW RI LWHPVHWV )LJXUH $ KLJKOHYHO GHVFULSWLRQ RI WKH QHJDWLYHERUGHUJHQ IXQFWLRQ /HPPD $OO LWHPVHWV VKRXOG EH SUHVHQW LQ )8 -?I%G)f

PAGE 137

3URRI 7KH OHPPD IROORZV GLUHFWO\ IURP WKH GHILQLWLRQ RI QHJDWLYH ERUGHU VLQFH DQ\ LWHPVHW QRW LQ ) LV D PLQLPDO LWHPVHW QRW LQ ) ’ $GGLWLRQ RI 1HZ 7UDQVDFWLRQV :KHQ QHZ WUDQVDFWLRQV DUH DGGHG WR WKH GDWDEDVH DQ ROG IUHTXHQW LWHPVHW FRXOG SRWHQWLDOO\ EHFRPH LQIUHTXHQW LQ WKH XSGDWHG GDWDEDVH 6LPLODUO\ DQ ROG LQIUHTXHQW LWHPVHW FRXOG SRWHQWLDOO\ EHFRPH IUHTXHQW LQ WKH QHZ GDWDEDVH ,Q RUGHU WR VROYH WKH XSGDWH SUREOHP HIILFLHQWO\ ZH PDLQWDLQ WKH IUHTXHQW LWHPVHW DQG WKH QHJDWLYH ERUGHU DORQJ ZLWK WKHLU VXSSRUW FRXQW LQ WKH GDWDEDVH 7KDW LV IRU HYHU\ V ) 8 $I%G)f ZH PDLQWDLQ VFRXQW ,Q WKH UHVW RI WKLV VHFWLRQ '% GHQRWHV WKH RULJLQDO GDWDEDVH GE GHQRWHV WKH WUDQVDFWLRQV WKDW DUH QHZO\ DGGHG DQG '% GHQRWHV WKH XSGDWHG GDWDEDVH $OVR )'% )GE DQG )'% GHQRWHV WKH IUHTXHQW LWHPVHW DQG 1%G)'%f 0%G)GEf DQG -?I%G^)'%f GHQRWHV WKH QHJDWLYH ERUGHU RI WKH RULJLQDO GDWDEDVH LQFUHPHQW GDWDEDVH DQG WKH XSGDWHG GDWDEDVH UHVSHFWLYHO\ /HPPD /HW V EH DQ\ LWHPVHW VXFK WKDW V )'% 7KHQ V )'% RQO\ LI V )GE 3URRI $VVXPH WKDW WKHUH H[LVWV DQ LWHPVHW V VXFK WKDW V f )'% V e )'% DQG V )GE /HW GEVf DQG WIVf EH WKH QXPEHU RI WUDQVDFWLRQV LQ '% DQG GE UHVSHFWLYHO\ FRQWDLQLQJ WKH LWHPVHW V $OVR OHW WA% DQG WGE EH WKH WRWDO QXPEHU RI WUDQVDFWLRQV LQ '% DQG GE UHVSHFWLYHO\ 6LQFH V )'% DQG V e )GE GE^Vf WS% PLQ6XSSRUW DQG WGE^Vf WGE PLQ6XSSRUW )URP WKHVH WZR HTXDWLRQV LW FDQ EH VKRZQ WKDW

PAGE 138

W'%Vf WGE^Vf F PPEXSSRUW W'% WGE 7KHUHIRUH V )'% ZKLFK LV D FRQWUDGLFWLRQ ’ /HPPD /HW V EH DQ LWHPVHW VXFK WKDW V e $%G)f 7KHQ DOO SRVVLEOH VXEVHWV RI V PXVW EH SUHVHQW LQ ) 3URRI )RU D FRQWUDGLFWLRQ OHW W EH DQ LWHPVHW VXFK WKDW W & V DQG W ) %\ WKH GHILQLWLRQ RI QHJDWLYH ERUGHU $I%G)f FRQVLVWV RI WKH PLQLPDO LWHPVHWV QRW LQ ) 6LQFH W ) V LV QRW D PLQLPDO LWHPVHW QRW LQ ) 7KHUHIRUH V FDQQRW EH LQ $L%G)f ZKLFK LV D FRQWUDGLFWLRQ ’ 7KHRUHP /HW V EH DQ LWHPVHW VXFK WKDW V )'% 8 $I%G)'%f DQG V e )'% 7KHQ WKHUH H[LVWV DQ LWHPVHW W VXFK WKDW W & V W e $I%G)'%f DQG W e )'% 7KDW LV VRPH VXEVHW RI V KDV PRYHG IURP $L%G)'%f WR )'% 3URRI 6LQFH V e )'% DOO SRVVLEOH VXEVHWV RI V VKRXOG EH LQ )'% %XW DOO WKH VXEVHWV RI V FDQQRW EH LQ )'% EHFDXVH LI WKDW ZDV WKH FDVH WKHQ V VKRXOG EH SUHVHQW LQ DW OHDVW $L%G)'%f LI QRW LQ )'% LWVHOI %\ RXU DVVXPSWLRQ V J )'% 8 $L%G^)'%f 7KHUHIRUH WKHUH H[LVWV DQ LWHPVHW W VXFK WKDW W & V DQG W e )'% 1RZ ZH KDYH WZR FDVHV &DVH L WH$%G^)'%f ,Q WKLV FDVH W e )'%-U VLQFH V e )'% DQG W & V 7KHUHIRUH ZH KDYH IRXQG D VXEVHW RI V ZKLFK KDV PRYHG IURP $I%G)'%f WR )'% &DVH LL W t $I%G^)'%f 7KDW LV W )GE 8 $I%G^)'%f %XW ZH NQRZ WKDW W e )'% VLQFH V e )GE DQG W & V 7KHUHIRUH W )'% 8 $L%G)'%f DQG W e )'% DQG KHQFH ZH FDQ DSSO\ WKH WKHRUHP UHFXUVLYHO\ RQ W 1RWH WKDW WKH VL]H RI W LV OHVV WKDQ WKH VL]H RI V VLQFH W & V

PAGE 139

:KHQ WKLV LV DSSOLHG UHFXUVLYHO\ WKHUH DUH WZR SRVVLELOLWLHV )LUVW LV IRU VRPH VXEVHW RI W FDVH L KROGV WUXH LQ ZKLFK FDVH WKHUH LV D VXEVHW RI W ZKLFK KDV PRYHG IURP $I%G)'%f WR )'% DQG KHQFH WKH WKHRUHP LV SURYHG 2WKHUZLVH W ZLOO ILQDOO\ EHFRPH D LWHPVHW %\ /HPPD ZH NQRZ WKDW DOO LWHPVHWV DUH SUHVHQW LQ )'% 8 $I %G)'%f 6LQFH W )'% W $I%G)'%f ZKLFK FRQWUDGLFWV WKH DVVXPSWLRQ IRU FDVH Q 7KDW LV FDVH LL LV QRW SRVVLEOH LI W LV D LWHPVHW ’f LQ GE ,I DQ LWHPVHW W )'% GRHV QRW KDYH PLQLPXP VXSSRUW LQ '% 8 GE WKHQ W LV UHPRYHG IURP )'% 7KLV FDQ EH HDVLO\ FKHFNHG VLQFH ZH NQRZ WKH VXSSRUW FRXQW IRU W LQ '% DQG GE 7KH FKDQJH LQ )'% FRXOG SRWHQWLDOO\ FKDQJH $%G)'%f DOVR 7KHUHIRUH ZH KDYH WR UHFRPSXWH WKH QHJDWLYH ERUGHU XVLQJ WKH QHJDWLYHERUGHUJHQ IXQFWLRQ H[SODLQHG LQ VXEVHFWLRQ 2Q WKH RWKHU KDQG WKHUH FRXOG EH VRPH QHZ LWHPVHWV ZKLFK EHFRPH IUHTXHQW LQ WKH XSGDWHG GDWDEDVH /HW V EH DQ LWHPVHW ZKLFK JHWV DGGHG WR WKH IUHTXHQW LWHPVHW RI WKH XSGDWHG GDWDEDVH %\ /HPPD ZH NQRZ WKDW V KDV WR EH LQ )GE :H DOVR NQRZ E\ 7KHRUHP WKDW VRPH VXEVHW RI V PXVW PRYH IURP $I%G)'%f WR )'% )RU HDFK LWHPVHW V )GE ZH FKHFN LI V JHWV WKH PLQLPXP VXSSRUW WR PRYH IURP $I%G^)'%f WR )'% ,I

PAGE 140

IXQFWLRQ 8SGDWH)UHTXHQW,WHPVHW)'% $%G)'%f GEf '% DQG GE GHQRWH WKH QXPEHU RI WUDQVDFWLRQV LQ WKH RULJLQDO GDWDEDVH DQG WKH LQFUHPHQW GDWDEDVH UHVSHFWLYHO\ &RPSXWH )GE IRUHDFK LWHPVHW V f )'% 80%G)'%f GR WGE^Vf QXPEHU RI WUDQVDFWLRQV LQ GE FRQWDLQLQJ V )GE M! IRU HDFK LWHPVHW V e )'% GR LI e!EVf WGEVff PLQVXS r '% GEf WKHQ )'% )'% 8 V IRU HDFK LWHPVHW V e )GE GR LI V t )'% DQG V e $L%G)'%f DQG GEVf WGEVff PLQVXSr '% GEf WKHQ )GE )GE 8 V LI )'%  )GE WKHQ -?I%G)'%f QHJDWLYHERUGHUJHQ),%f HOVH 1%G)'%f 8%G)'%f LI)'%X$%G)'%f  )'% Of$%G)'%f WKHQ 6 )'% UHSHDW FRPSXWH 6 8 1%G6f XQWLO 6 GRHV QRW JURZ )GE f§ ^[ e 6?VXSSRUW[f PLQVXS` VXSSRUW[f LV WKH VXSSRUW FRXQW RI [ LQ '% 8 GE -?I%G)'%f a QHJDWLYHERUGHUJHQ)'%f )LJXUH $ KLJKOHYHO GHVFULSWLRQ RI WKH 8SGDWH)UHTXHQW,WHPVHW IXQFWLRQ

PAGE 141

QRQH RI WKH LWHPVHWV LQ -?I%G)'%f JHWV WKH PLQLPXP VXSSRUW QR QHZ LWHPVHWV ZLOO EH DGGHG WR )'%-U ,I VRPH LWHPVHWV LQ -?L%G)'%f JHWV WKH PLQLPXP VXSSRUW PRYH WKHP WR )GE DQG UHFRPSXWH WKH QHJDWLYH ERUGHU ,I )'% 8 $I%G)'%f A )'% 8M?I%G)'%f ZH KDYH WR ILQG WKH QHJDWLYH ERUGHU FORVXUH RI )'% DQG VFDQ WKH HQWLUH GDWDEDVH 'f RQFH WR ILQG WKH XSGDWHG IUHTXHQW LWHPVHW DQG QHJDWLYH ERUGHU 7KH QHJDWLYH ERUGHU FORVXUH RI ) LV IRXQG E\ UHSHDWHGO\ ILQGLQJ ) ) 8 1%G)f XQWLO ) GRHV QRW JURZ 'XULQJ WKH GDWDEDVH VFDQ DOO WKH LWHPVHWV ZKLFK DUH LQ WKH QHJDWLYH ERUGHU FORVXUH WKDW ZHUH QRW RULJLQDOO\ LQ )/L-?L%G)f DUH XVHG DV WKH FDQGLGDWH LWHPVHWV DQG WKHLU VXSSRUW FRXQW LV FRPSXWHG 7KH FDQGLGDWH VHW FDQ IXUWKHU EH SUXQHG E\ DSSO\LQJ DQ RSWLPL]DWLRQ ZKLOH ILQGLQJ WKH QHJDWLYH ERUGHU FORVXUH ,W FDQ EH REVHUYHG WKDW DQ LWHPVHW ZKLFK LV QRW IUHTXHQW LQ WKH LQFUHPHQW GDWDEDVH GEff§ GHQRWH WKH XSGDWHG GDWDEDVH DQG )'%a DQG $L%G)'%af GHQRWH LWV IUHTXHQW LWHPVHW DQG QHJDWLYH ERUGHU UHVSHFWLYHO\

PAGE 142

/HPPD L /HW V EH DQ LWHPVHW VXFK WKDW V ( )'% 7KHQ V J )'% RQO\ LI V ( )GE 7KDW LV D IUHTXHQW LWHPVHW V ZLOO EHFRPH LQIUHTXHQW RQO\ LI V ( )GE 7KLV OHPPD FDQ EH SURYHG LQ WKH VDPH ZD\ DV OHPPD 7KH DOJRULWKP WR FRPSXWH WKH IUHTXHQW LWHPVHW DQG WKH QHJDWLYH ERUGHU RI '%f§ LV VLPLODU WR WKH RQH LQ WKH FDVH ZKHUH QHZ WUDQVDFWLRQV DUH DGGHG WR WKH GDWDEDVH DQG FDQ EH GHULYHG HDVLO\ ([SHULPHQWDO 5HVXOWV :H FRQGXFWHG D VHW RI H[SHULPHQWV WR FRPSDUH WKH SHUIRUPDQFH RI RXU LQFUHPHQWDO DOJRULWKP 7KHVH H[SHULPHQWV ZHUH SHUIRUPHG RQ D 6XQ 63$5&VWDWLRQ UXQQLQJ 6XQ26 ,Q WKLV VHFWLRQ ZH UHSRUW RQ WKH UHVXOWV RI VRPH RI WKRVH H[SHULPHQWV 7KH H[SHULPHQWV ZHUH SHUIRUPHG RQ V\QWKHWLF GDWD JHQHUDWHG XVLQJ WKH VDPH WHFKn QLTXH DV LQ $JUDZDO DQG 6ULNDQW >@ 7KH GDWDVHW XVHG IRU WKH EDVHOLQH H[SHULPHQW ZDV 7,'. 0HDQ VL]H RI D WUDQVDFWLRQ 0HDQ VL]H RI PD[LPDO SRWHQWLDOO\ ODUJH LWHPVHWV 1XPEHU RI WUDQVDFWLRQV WKRXVDQGf 7KH LQFUHPHQW GDWDEDVH LV FUHDWHG DV IROORZV :H JHQHUDWH WKRXVDQG WUDQVDFWLRQV RI ZKLFK f§ Gf WKRXVDQG LV XVHG IRU WKH LQLWLDO FRPSXWDWLRQ DQG G WKRXVDQG LV XVHG DV WKH LQFUHPHQW ZKHUH G LV WKH IUDFWLRQDO VL]H LQ SHUFHQWDJHf RI WKH LQFUHPHQW :H FRPSDUH WKH H[HFXWLRQ WLPH RI WKH LQFUHPHQWDO DOJRULWKP ZLWK UHVSHFW WR UXQQLQJ $SULRUL RQ WKH ZKROH GDWD VHW )LJXUH VKRZV WKH VSHHG XS RI WKH LQFUHPHQWDO DOJRULWKP RYHU $SULRUL IRU GLIIHUHQW PLQLPXP VXSSRUW WKUHVKROGV :H UHSRUW WKH UHVXOWV IRU LQFUHPHQW VL]HV RI b bb DQG b VKRZQ LQ WKH OHJHQGf )URP WKH JUDSK LW FDQ EH VHHQ WKDW WKH LQFUHPHQWDO DOJRULWKP DFKLHYHV VSHHG XS RI DERXW WR 7KH DOJRULWKP VKRZV EHWWHU VSHHG XS IRU PHGLXP VXSSRUW WKUHVKROG WKDQ ORZ DQG KLJK VXSSRUW WKUHVKROGV $W KLJK

PAGE 143

IP b P Jb Â’ Vb Â’ Rb 6XSSRUW 7KUHVKROG LQ bf )LJXUH 6SHHGXS RI WKH LQFUHPHQWDO DOJRULWKP VXSSRUW WKUHVKROGV WKH QXPEHU RI IUHTXHQW LWHPVHWV DQG WKH QXPEHU RI SDVVHV LQ WKH OHYHO ZLVH DOJRULWKP DUH OHVV DQG KHQFH LW LV OHVV FRVWO\ WR UXQ $SULRUL RQ WKH ZKROH GDWDEDVH $W ORZ VXSSRUW WKUHVKROGV WKH SUREDELOLW\ RI WKH QHJDWLYH ERUGHU H[SDQGLQJ LV KLJKHU DQG DV D UHVXOW WKH LQFUHPHQWDO DOJRULWKP PD\ KDYH WR VFDQ WKH ZKROH GDWDEDVH $OVR WKH VSHHG XS LV KLJKHU IRU VPDOOHU LQFUHPHQW VL]HV VLQFH WKH LQFUHPHQWDO DOJRULWKP QHHGV WR SURFHVV OHVV GDWD &RPSDULVRQ ZLWK )83 7KH IUDPHZRUN RI )83 >@ LV VLPLODU WR WKDW RI $SULRUL DQG FRQWDLQV D QXPEHU RI LWHUDWLRQV (DFK LWHUDWLRQ LV DVVRFLDWHG ZLWK D FRPSOHWH VFDQ RI WKH ZKROH GDWDEDVH DQG LQ LWHUDWLRQ N DOO WKH IUHTXHQW $LWHPVHWV DUH IRXQG 7KH FDQGLGDWH VHWV IRU LWHUDWLRQ N DUH JHQHUDWHG EDVHG RQ WKH IUHTXHQW LWHPVHWV IRXQG LQ LWHUDWLRQ N )83 XVHV WKH IROORZLQJ VWHSV WR FRPSXWH WKH IUHTXHQW LWHPVHWV $W HDFK LWHUDWLRQ N WKH VXSSRUW RI WKH IUHTXHQW $!LWHPVHWV /rf LV XSGDWHG DJDLQVW WKH LQFUHPHQW GDWDEDVH GE WR ILOWHU RXW WKH LWHPVHWV WKDW DUH QR ORQJHU IUHTXHQW LQ WKH XSGDWHG GDWDEDVH

PAGE 144

7KH VHW RI FDQGLGDWH VHWV &N LV JHQHUDWHG E\ DSSO\LQJ WKH DSULRULJHQ IXQFWLRQ RQ /M3" IUHTXHQW NOfLWHPVHWV LQ WKH XSGDWHG GDWDEDVHff SDVVHV RYHU WKH GDWDEDVH ZKHUH Q LV WKH VL]H RI WKH PD[LPDO IUHTXHQW LWHPVHW 7KH VWHSV LQYROYHG LQ RXU LQFUHPHQWDO DOJRULWKP FDQ EH VXPPDUL]HG DV IROORZV &RPSXWH WKH IUHTXHQW LWHPVHWV RI WKH LQFUHPHQW GDWDEDVH /GEff WKHUHE\ UHGXFLQJ WKH ,2 UHTXLUHPHQWV GUDVWLFDOO\ &RPSXWLQJ WKH

PAGE 145

nf 7KH LQFUHPHQW WUDQVDFWLRQ WDEOH 67 DOVR KDV WKH VDPH VFKHPD 7KH IUHTXHQW LWHPVHWV DQG QHJDWLYH ERUGHU RI VL]H N DUH VWRUHG LQ WDEOHV ZLWK WKH VFKHPD LWHPL LWHPN FRXQWf :H GLVFXVV WKH H[WHQVLRQV WR 6XETXHU\ DQG 9HUWLFDO WZR UHSUHVHQWDWLYH DSSURDFKHV IURP WKH 64/ DQG 64/25 FDWHJRULHV :H DOVR RXWOLQH WKH 64/EDVHG FDQGLGDWH FORVXUH FRPSXWDWLRQ WR FRPSXWH QHJDWLYH ERUGHU FORVXUH

PAGE 146

6XE TXHU\ $SSURDFK ,Q WKLV DSSURDFK VXSSRUW FRXQWLQJ LV GRQH E\ D VHW RI N QHVWHG VXETXHULHV ZKHUH N LV WKH VL]H RI WKH ODUJHVW LWHPVHW :H SUHVHQW KHUH WKH H[WHQVLRQV WR WKH VXETXHU\ DSSURDFK LQ 6HFWLRQ WR FRXQW FDQGLGDWHV RI GLIIHUHQW VL]HV 6XETXHU\ 4L ILQGV DOO WLGV WKDW VXSSRUW WKH GLVWLQFW FDQGLGDWH LWHPVHWV Gcf GL FRPSULVHV RI WKH GLVWLQFW LWHP SUHIL[HV RI DOO FDQGLGDWH LWHPVHWV +RZHYHU LW LV VXIILFLHQW WR XVH & WKH FDQGLGDWH LWHPVHWV LQVWHDG RI GL VLQFH DOO LWHP SUHIL[HV RI FDQGLGDWH LWHPVHWV ZLWK PRUH WKDQ LWHPV ZLOO EH SUHVHQW LQ &L 7KH RXWSXW RI 4L LV JURXSHG RQ WKH LWHPV WR ILQG WKH VXSSRUW RI WKH FDQGLGDWH LWHPVHWV 4L LV DOVR MRLQHG ZLWK 67 7 ZKLOH FRXQWLQJ WKH VXSSRUW LQ WKH ZKROH GDWDEDVHf DQG &c WR JHW 4LL 7KH 64/ TXHULHV DQG WKH FRUUHVSRQGLQJ WUHH GLDJUDPV IRU WKH DERYH FRPSXWDWLRQV DUH JLYHQ LQ )LJXUH 6%c VWRUHV WKH VXSSRUW FRXQWV RI DOO IUHTXHQW DQG QHJDWLYH ERUGHU LWHPVHWV LQ 67 7KH RXWSXW RI VXETXHU\ 4c QHHGV WR EH PDWHULDOL]HG VLQFH LW LV XVHG IRU FRXQWLQJ WKH VXSSRUW RI LWHPVHWV DQG IRU JHQHUDWLQJ 4L ,I WKH TXHU\ SURFHVVRU LV DXJPHQWHG WR VXSSRUW PXOWLSOH VWUHDPV ZKHUH WKH RXWSXW RI DQ RSHUDWRU FDQ EH SLSHG LQWR PRUH WKDQ RQH VXEVHTXHQW RSHUDWRUV WKH PDWHULDOL]DWLRQ RI 4LnV FDQ EH DYRLGHG ,Q WKH EDVLF DVVRFLDWLRQ UXOH PLQLQJ ZH GR QRW KDYH WR FRXQW LWHPVHWV RI GLIIHUHQW VL]HV LQ WKH VDPH SDVV VLQFH & EHFRPHV DYDLODEOH RQO\ DIWHU WKH IUHTXHQW LWHPVHWV DUH FRPSXWHG 7DEOHV %L DQG 6%c VWRUH WKH IUHTXHQW DQG QHJDWLYH ERUGHU LWHPVHWV DQG WKHLU VXSSRUW FRXQW LQ 7 DQG 67 UHVSHFWLYHO\ 6XSSRUW FRXQWV RI LWHPVHWV LQ WKH ZKROH GDWDEDVH FDQ EH FRPSXWHG E\ MRLQLQJ %c DQG 6% DQG DGGLQJ WKH FRUUHVSRQGLQJ VXSSRUW FRXQWV :H DGG DQRWKHU DWWULEXWH WR %L WR NHHS WUDFN RI SURPRWHG ERUGHUV LWHPVHWV WKDW PRYHG IURP WKH QHJDWLYH ERUGHU WR WKH IUHTXHQW VHWf

PAGE 147

LQVHUW LQWR 6%L VHOHFW LWHP? LWHPL FRXQWrf IURP 6XETXHU\ 4Lf W JURXS E\ LWHPL LWHPL 6XETXHU\ 4L IRU DQ\ O EHWZHHQ DQG Nf VHOHFW LWHPL f f f LWHPL ILG IURP 67 Wc 6XETXHU\ 4L?f DV UBL &M ZKHUH UL?LWHP? f§ &LLWHP? DQG DQG ULLLWHPLL &LLWHPLLDQG ULBLWLG WLWLG DQG WLLWHP &LLWHPL 6XETXHU\ 4T 1R VXETXHU\ 4T *URXS E\ IRU %B 6XETXHU\ 4-O 6XETXHU\ 4, LWHPLLWHP WLG UBOO LWHPL UMOLWHP 6XETXHU\ 4BO &BO )LJXUH 6XSSRUW FRXQWLQJ XVLQJ VXETXHULHV

PAGE 148

&RPSXWLQJ &DQGLGDWH &ORVXUH ,Q WKH DSULRUL DOJRULWKP FDQGLGDWH LWHPVHWV DUH JHQHUDWHG LQ WZR VWHSV WKH MRLQ VWHS DQG WKH SUXQH VWHS ,Q WKH MRLQ VWHS WZR VHWV RI N f§ OfLWHPVHWV FDOOHG JHQHUDWRUV DQG H[WHQGHUV DUH MRLQHG WR JHW FLWHPVHWV $Q LWHPVHW VL LQ JHQHUDWRUV MRLQV ZLWK 6 LQ H[WHQGHUV LI WKH ILUVW ^N f§ f LWHPV RI VL DQG V DUH WKH VDPH DQG WKH ODVW LWHP RI VL LV OH[LFRJUDSKLFDOO\ VPDOOHU WKDQ WKH ODVW LWHP RI V 7KH MRLQ UHVXOWV LQ DQ LWHPVHW WKDW LV VL H[WHQGHG ZLWK WKH ODVW LWHP RI 6 7KH UHVXOW RI WKH MRLQ VWHS LV VXEMHFWHG WR VXEVHW SUXQLQJ ZKLFK ILOWHUV RXW DOO LWHPVHWV ZLWK D QRQ IUHTXHQW N f§ OfVXEVHW 7KLV FDQ EH DFFRPSOLVKHG E\ VXEVHTXHQW MRLQV ZLWK N f§ ff§ f UHVSHFWLYHO\ 7KH FDQGLGDWH JHQHUDWLRQ SURFHVV VWDUWV ZLWK &R DV WKH HPSW\ VHW DQG WHUPLQDWHV ZKHQ &N EHFRPHV HPSW\ ,W LV VWUDLJKWIRUZDUG WR GHULYH 64/ TXHULHV IRU WKLV SURFHVV DQG ZH GR QRW SUHVHQW WKHP KHUH UHIHU 6HFWLRQ f 9HUWLFDO ,Q WKH 6XETXHU\ DSSURDFK IRU HYHU\ WUDQVDFWLRQ WKDW VXSSRUWV DQ LWHPVHW ZH JHQHUDWH LWHPVHW WLGf WXSOHV UHVXOWLQJ LQ ODUJH LQWHUPHGLDWH WDEOHV 7KH 9HUWLFDO DSSURDFK DYRLGV WKLV E\ FROOHFWLQJ DOO WLGV WKDW VXSSRUW DQ LWHPVHW LQWR D %/2% ELQDU\ ODUJH REMHFWf DQG

PAGE 149

JHQHUDWHV LWHPVHW WLGOLVWf WXSOHV ,QLWLDOO\ WLGOLVWV IRU LQGLYLGXDO LWHPV DUH FUHDWHG XVLQJ D WDEOH IXQFWLRQ 7KH WLGOLVW IRU DQ LWHPVHW LV REWDLQHG E\ LQWHUVHFWLQJ WKH WLGOLVWV RI LWV LWHPV XVLQJ D XVHUGHILQHG IXQFWLRQ 8')f 7KH 64/ TXHULHV IRU VXSSRUW FRXQWLQJ DUH VLPLODU LQ VWUXFWXUH WR WKDW RI WKH 6XETXHU\ DSSURDFK H[FHSW IRU WKH XVH RI 8')V WR LQWHUVHFW WKH WLGOLVWV :H UHIHU WKH UHDGHU WR 6HFWLRQ IRU WKH GHWDLOV 7KH LQFUHPHQW WUDQVDFWLRQ WDEOH 67 LV WUDQVIRUPHG LQWR WKH YHUWLFDO IRUPDW E\ FUHDWLQJ WKH GHOWD WLGOLVWV RI WKH LWHPV 7KH GHOWD WLGOLVWV DUH XVHG WR FRXQW WKH VXSSRUW RI WKH FDQGLGDWH LWHPVHWV LQ 67 ZKLFK DUH ODWHU PHUJHG ZLWK WKH RULJLQDO WLGOLVWV 7KLV FDQ EH DFFRPSOLVKHG E\ MRLQLQJ WKH RULJLQDO WLGOLVW WDEOH ZLWK WKH GHOWD WLGOLVW WDEOH DQG PHUJLQJ WKH WLGOLVWV ZLWK D 8') ,I WKH LQFUHPHQWDO DOJRULWKP UHTXLUHV D SDVV RYHU WKH FRPSOHWH GDWD WKH PHUJHG WLGOLVWV DUH XVHG IRU VXSSRUW FRXQWLQJ 3HUIRUPDQFH 5HVXOWV ,Q WKLV VHFWLRQ ZH UHSRUW WKH UHVXOWV RI VRPH RI RXU SHUIRUPDQFH H[SHULPHQWV WR TXDQnf 7KH LQFUHPHQW GDWDEDVH LV FUHDWHG DV IROORZV :H JHQHUDWH

PAGE 150

WKRXVDQG WUDQVDFWLRQV RI ZKLFK f§ Gf WKRXVDQG LV XVHG IRU WKH LQLWLDO FRPSXWDWLRQ DQG G WKRXVDQG LV XVHG DV WKH LQFUHPHQW ZKHUH G LV WKH IUDFWLRQDO VL]H LQ SHUFHQWDJHf RI WKH LQFUHPHQW )LJXUH 6SHHG XS RI WKH LQFUHPHQWDO DOJRULWKP EDVHG RQ WKH 6XETXHU\ DSSURDFK _%b ‘ b & 4b )LJXUH 6SHHG XS RI WKH LQFUHPHQWDO DOJRULWKP EDVHG RQ WKH 9HUWLFDO DSSURDFK :H FRPSDUH WKH H[HFXWLRQ WLPH RI WKH LQFUHPHQWDO DOJRULWKP ZLWK UHVSHFW WR PLQLQJ WKH ZKROH GDWDVHW )LJXUHV DQG VKRZV WKH FRUUHVSRQGLQJ VSHHGXSV RI WKH LQFUHPHQn WDO DOJRULWKP EDVHG RQ WKH 6XETXHU\ DQG WKH 9HUWLFDO DSSURDFKHV IRU GLIIHUHQW PLQLPXP

PAGE 151

VXSSRUW WKUHVKROGV :H UHSRUW WKH UHVXOWV IRU LQFUHPHQW VL]HV RI b b DQG b VKRZQ LQ WKH OHJHQGf :H FDQ PDNH WKH IROORZLQJ REVHUYDWLRQV IURP WKH JUDSKV f 7KH LQFUHPHQWDO DOJRULWKP EDVHG RQ WKH 6XETXHU\ DSSURDFK DFKLHYHV D VSHHGXS RI DERXW WR DV FRPSDUHG WR PLQLQJ WKH ZKROH GDWDVHW +RZHYHU WKH PD[LPXP VSHHGXS RI WKH 9HUWLFDO DSSURDFK LV RQO\ DERXW )RU VXSSRUW FRXQWLQJ WKH 9HUWLFDO DSSURDFK XVHV D XVHUGHILQHG IXQFWLRQ 8')f WR LQWHUVHFW WKH WLGOLVWV 7KH LQFUHPHQn WDO DOJRULWKP VKRXOG DOVR LQYRNH WKH 8') DW OHDVW WKH VDPH QXPEHU RI WLPHV VLQFH WKH VXSSRUW RI DOO WKH LWHPVHWV LQ WKH IUHTXHQW VHW DQG WKH QHJDWLYH ERUGHU QHHGV WR EH IRXQG LQ WKH LQFUHPHQW GDWDEDVH ,Q FDVHV ZKHUH WKH VXSSRUW RI QHZ FDQGLGDWHV QHHGV WR EH FRXQWHG WKH QXPEHU RI LQYRFDWLRQV ZLOO EH HYHQ PRUH DV FRPSDUHG WR PLQLQJ WKH ZKROH GDWDVHW 7KH WLPH WDNHQ E\ WKH 9HUWLFDO DSSURDFK LQ WKH VXSSRUW FRXQWLQJ SKDVH LV GLUHFWO\ SURSRUWLRQDO WR WKH QXPEHU RI WLPHV WKH 8') LV FDOOHG +RZHYHU WKH LQFUHPHQWDO DOJRULWKP VDYHV LQ WKH WLGOLVW FUHDWLRQ SKDVH VLQFH WKH VL]H RI WKH LQFUHn PHQW GDWDVHW LV RQO\ D IUDFWLRQ RI WKH ZKROH GDWDVHW 7KLV H[SODLQV ZK\ WKH VSHHGXS RI WKH 9HUWLFDO DSSURDFK LV ORZ ,Q FRQWUDVW WKH 6XETXHU\ DSSURDFK DFKLHYHV KLJKHU VSHHGXS VLQFH WKH WLPH WDNHQ LV SURSRUWLRQDO WR WKH VL]H RI WKH GDWDVHW ,W LV SRVVLEOH WR DFKLHYH EHWWHU VSHHGXS IRU WKH 9HUWLFDO DSSURDFK E\ DOORFDWLQJ D VPDOOHU %/2% ELQDU\ ODUJH REMHFWf IRU FRPSXWDWLRQV LQYROYLQJ WKH LQFUHPHQW GDWDVHW 1RWH WKDW WKH WLGOLVWV IRU WKH LWHPV DUH VWRUHG DV %/2%V ,Q RXU H[SHULPHQWV ZH XVHG WKH VDPH %/2% VL]H IRU WKH LQFUHPHQW GDWDVHW DQG WKH LQLWLDO GDWDVHW LQ RUGHU WR XVH WKH VDPH XVHUGHILQHG IXQFWLRQ IRU VXSSRUW FRXQWLQJ DQG WKH VDPH WDEOH IXQFWLRQ IRU WLGOLVW FUHDWLRQ UHIHU 6HFWLRQ IRU D GHWDLOHG GHVFULSWLRQ RI WKH 9HUWLFDO DSSURDFKf

PAGE 152

f 7KH VSHHGXS UHGXFHV DV WKH PLQLPXP VXSSRUW WKUHVKROG LV ORZHUHG $W ORZHU VXSSRUW YDOXHV WKH FKDQFHV RI WKH QHJDWLYH ERUGHU H[SDQGLQJ LV KLJKHU DQG DV D UHVXOW WKH LQFUHPHQWDO DOJRULWKP PD\ KDYH WR FRPSXWH WKH FDQGLGDWH FORVXUH DQG FRXQW WKH VXSSRUW RI WKH QHZ FDQGLGDWHV LQ WKH ZKROH GDWDVHW f 7KH VSHHGXS LV KLJKHU IRU VPDOOHU LQFUHPHQW VL]HV VLQFH WKH LQFUHPHQWDO DOJRULWKP QHHGV WR SURFHVV OHVV GDWD f :LWK UHVSHFW WR WKH DEVROXWH H[HFXWLRQ WLPH WKH 6XETXHU\ DQG WKH 9HUWLFDO DSnfLWHPVHWV $W WKLV VWHS WKH QHZ FDQGLGDWH $LWHPVHWV WKDW DUH LQIUHTXHQW LQ GE DUH NQRZQ WR EH LQIUHTXHQW LQ WKH ZKROH GDWDVHW DV ZHOO DQG FDQ EH SUXQHG 7KLV LV EHFDXVH WKH QHZ FDQGLGDWH $LWHPVHWV ZHUH LQIUHTXHQW LQ WKH ROG GDWDVHW WKH\ ZHUH QRW HYHQ LQ WKH QHJDWLYH ERUGHUf 7KHUHIRUH WKH\ QHHG WR EH IUHTXHQW DW OHDVW LQ GE WR KDYH D FKDQFH RI EHLQJ IUHTXHQW LQ WKH ZKROH GDWDVHW :LWK WKH QHZFDQGLGDWH RSWLPL]DWLRQ ZH FRXQW WKH VXSSRUW RI DQ LWHPVHW LQ GE RQO\ LI LW LV UHTXLUHG ,Q WKH ILUVW SKDVH ZKLOH FRXQWLQJ WKH VXSSRUW LQ GE RI WKH LWHPVHWV LQ WKH IUHTXHQW VHW DQG WKH QHJDWLYH ERUGHU ZH GR QRW ILQG DOO WKH IUHTXHQW LWHPVHWV LQ GE 'XULQJ

PAGE 153

WKH FDQGLGDWH FORVXUH FRPSXWDWLRQ ZH FRXQW WKH VXSSRUW LQ GE RI MXVW WKH QHZFDQGLGDWHV IRU WKH SUXQLQJ H[SODLQHG DERYH 7KLV UHVXOWV LQ EHWWHU VSHHGXS DV FRPSDUHG WR WKH EDVLF LQFUHPHQWDO DOJRULWKP HUD f P b Â’ L !ff )LJXUH 6SHHG XS RI WKH LQFUHPHQWDO DOJRULWKP EDVHG RQ WKH 6XETXHU\ DSSURDFK ZLWK WKH QHZFDQGLGDWH RSWLPL]DWLRQ D b P b F]L L RbO )LJXUH 6SHHG XS RI WKH LQFUHPHQWDO DOJRULWKP EDVHG RQ WKH 9HUWLFDO DSSURDFK ZLWK WKH QHZFDQGLGDWH RSWLPL]DWLRQ )LJXUHV DQG VKRZ WKH VSHHGXS RI WKH 6XETXHU\ DQG 9HUWLFDO DSSURDFKHV ZLWK WKH QHZFDQGLGDWH RSWLPL]DWLRQ :H FDQ REVHUYH WKDW WKLV RSWLPL]DWLRQ DFKLHYHV VSHHG XSV WKDW DUH XS WR b EHWWHU 7KH LPSURYHPHQW LV PRUH DW VPDOOHU LQFUHPHQW VL]HV

PAGE 154

7KH UHDVRQ LV WKDW ZKHQ WKH LQFUHPHQW LV VPDOOHU ZH KDYH WR XVH SURSRUWLRQDWHO\ VPDOOHU PLQLPXP VXSSRUW YDOXHV ZKLOH ILQGLQJ WKH IUHTXHQW LWHPVHWV LQ GE 7KLV FRXOG UHVXOW LQ FRXQWLQJ WRR PDQ\ VSXULRXV FDQGLGDWHV 2WKHU $SSURDFKHV ,Q WKH /RRVHFRXSOLQJ DSSURDFK WKH WUDQVDFWLRQ GDWD LV UHDG WXSOH E\ WXSOH IURP WKH '%06 WR WKH PLQLQJ NHUQHO XVLQJ D FXUVRU LQWHUIDFH 7KLV DUFKLWHFWXUH FDQ EH H[WHQGHG WR KDQGOH LQFUHPHQWDO PLQLQJ MXVW E\ LPSOHPHQWLQJ WKH LQFUHPHQWDO DOJRULWKP RXWOLQHG LQ 6HFWLRQ LQ WKH PLQLQJ NHUQHO 7KH '%06 LQWHUIDFH GRHV QRW UHTXLUH DQ\ FKDQJH ,Q FDVHV ZKHUH WKH VXSSRUW RI QHZ LWHPVHWV QHHG WR EH FRXQWHG OLPLWLQJ WKH GDWD DFFHVV WR MXVW RQH VFDQ RI WKH ZKROH GDWDEDVH HQWDLOV FRXQWLQJ FDQGLGDWH LWHPVHWV RI PXOWLSOH VL]HV LQ WKH VDPH SDVV 7KLV FDQ EH DFFRPSOLVKHG E\ SDVVLQJ WKH WUDQVDFWLRQV WKURXJK DOO WKH FDQGLGDWHV RI GLIIHUHQW VL]HV DQG XSGDWLQJ WKHLU VXSSRUW FRXQWV 7KH 6WRUHGSURFHGXUH DSSURDFK ZKHUH WKH PLQLQJ DOJRULWKP LV HQFDSVXODWHG DV D VWRUHG SURFHGXUH WKDW UXQV LQ WKH VDPH DGGUHVV VSDFH DV WKH '%06 DQG WKH &DFKH0LQH DSSURDFK ZKHUH WKH GDWD LV FDFKHG RXWVLGH WKH '%06 FDQ DOVR EH H[WHQGHG IRU LQFUHPHQnn ULYHG /HW ; f§ rP` EH D VHW RI OLWHUDOV FDOOHG LWHPV WKDW DUH DWWULEXWH YDOXHV

PAGE 155

RI D VHW RI UHODWLRQDO WDEOHV $ FRQVWUDLQHG DVVRFLDWLRQ LV GHILQHG DV D VHW RI LWHPVHWV ^$_$f & = t&;f` ZKHUH & GHQRWHV RQH RU PRUH ERROHDQ FRQVWUDLQWV 1RWH WKDW ZH GR QRW FRQFHQWUDWH RQ JHQHUDWLQJ WKH DVVRFLDWLRQ UXOHV LQ WKH WUDGLWLRQDO VHQVH >@ +RZHYHU WKH DVVRFLDWLRQV EHWZHHQ DWWULEXWH YDOXHV WKDW ZH JHQHUDWH FDQ EH XVHG IRU UXOH JHQHUDWLRQ &DWHJRULHV RI &RQVWUDLQWV :H GLYLGH WKH FRQVWUDLQWV LQWR IRXU GLIIHUHQW FDWHJRULHV WKDW DUH RXWOLQHG EHORZ :H LOOXVWUDWH HDFK RI WKHP ZLWK VDPSOH PLQLQJ FRPSXWDWLRQV 7KH GDWD PRGHO XVHG LQ RXU H[DPSOHV LV WKDW RI D SRLQWRIVDOH 326ffVXSSRUWFRQILGHQFHf IUDPHn ZRUN IRU DVVRFLDWLRQ UXOH PLQLQJ >@ $Q LWHPVHW ; LV VDLG WR EH IUHTXHQW LI LW DSSHDUV LQ DW OHDVW V WUDQVDFWLRQV ZKHUH V LV WKH PLQLPXP VXSSRUW ,Q WKH SRLQWRIVDOHV GDWD PRGHO D :H UHIHU WKH UHDGHU WR 6ULNDQW HW DO >@ DQG 1J HW DO >@ IRU QLFH GLVFXVVLRQV RI YDULRXV NLQGV RI FRQVWUDLQWV +HUH ZH FDWHJRUL]H WKHP EDVHG RQ WKHLU XVDJH LQ WKH PLQLQJ SURFHVV ZKLFK LV H[SODLQHG LQ 6HFWLRQ ZLWK DQ H[DPSOH

PAGE 156

WUDQVDFWLRQ FRUUHVSRQG WR D FXVWRPHU WUDQVDFWLRQ 1RWH WKDW LQ RWKHU GRPDLQV WKH QRWLRQ RI D WUDQVDFWLRQ DQG DQ LWHPVHW DSSHDULQJ LQ D WUDQVDFWLRQ FRXOG EH GLIIHUHQW )UHTXHQW LWHPVHWV WKH RQHV ZKLFK VDWLVI\ WKH IUHTXHQF\ FRQVWUDLQW DUH GHILQHG DV ^;?I;f V` ZKHUH I;f LV WKH IUHTXHQF\ RI ; 0RVW RI WKH DOJRULWKPV IRU IUHTXHQW LWHPVHW GLVFRYHU\ XWLOL]HV WKH GRZQZDUG FORVXUH SURSHUW\ RI LWHPVHWV ZLWK UHVSHFW WR WKH IUHTXHQF\ FRQVWUDLQW WKDW LV LI DQ LWHPVHW LV IUHTXHQW WKHQ VR DUH DOO LWV VXEVHWV 'RZQZDUG FORVXUH LV D SUXQLQJ SURSHUW\ /HYHOZLVH DOJRULWKPV >@ ILQG DOO LWHPVHWV ZLWK D JLYHQ SURSHUW\ DPRQJ LWHPVHWV RI VL]H N IFLWHPVHWVf DQG XVH WKLV NQRZOHGJH WR H[SORUH N OfLWHPVHWV 7KH\ VWDUW ZLWK WKH DVVXPSWLRQ WKDW DOO N OfLWHPVHWV DUH SRWHQWLDOO\ IUHTXHQW IUHTXHQF\ LV MXVW DQ H[DPSOH IRU D GRZQZDUG FORVHG SURSHUW\f $V WKH $LWHPVHWV DUH H[DPLQHG WKH\ SUXQH RXW VRPH N OfLWHPVHWV WKDW FDQQRW EH IUHTXHQW ,Q HIIHFW IRU SUXQLQJ WKH\ XVH WKH FRQWUDSRVLWLYH RI WKH IUHTXHQW LWHPVHW GHILQLWLRQ fLI DQ\ VXEVHW RI D N OfLWHPVHW LV QRW IUHTXHQW WKHQ QHLWKHU FDQ WKH N OfLWHPVHWf $IWHU WKH SUXQLQJ WKH\ JR WKURXJK WKH UHPDLQLQJ OLVW FKHFNLQJ HDFK N OfLWHPVHW IRU LWV IUHTXHQF\ 7KH GRZQZDUG FORVXUH SURSHUW\ LV VLPLODU WR WKH DQWLn

PAGE 157

EH DVVRFLDWHG ZLWK D IUHTXHQF\ FRQVWUDLQW DOVR ZKHUH ZH ZDQW WR ILQG IUHTXHQW LWHPVHWV WKDW VDWLVI\ % 7KUHH GLIIHUHQW DOJRULWKPV IRU PLQLQJ DVVRFLDWLRQV ZLWK LWHP FRQVWUDLQWV DUH SUHVHQWHG LQ 6ULNDQW HW DO >@ 7KH LWHP FRQVWUDLQWV HQDEOHV XV WR SRVH PLQLQJ TXHULHV VXFK DV f:KDW DUH WKH SURGXFWV ZKRVH VDOHV DUH DIIHFWHG E\ WKH VDOH RI VD\ EDUEHFXH VDXFH"f DQG f:KDW SURGXFWV DUH ERXJKW WRJHWKHU ZLWK VRGDV DQG VQDFNV"f 7KH YDULDEOH FRQVWUDLQWV LQ 1J HW DO >@ DUH D VSHFLDO FDVH RI LWHP FRQVWUDLQWV $JJUHJDWLRQ &RQVWUDLQW 7KHVH DUH FRQVWUDLQWV LQYROYLQJ DJJUHJDWH IXQFWLRQV RQ WKH LWHPV WKDW IRUP WKH LWHP VHW )RU LQVWDQFH LQ WKH 326 H[DPSOH DQ DJJUHJDWLRQ FRQVWUDLQW FRXOG EH RI WKH IRUP PLQSURGXFWVVROGSULFHf S +HUH ZH FRQVLGHU D SURGXFW DV DQ LWHP 7KH DJJUHJDWH IXQFn WLRQ FRXOG EH PLQ PD[ VXP FRXQW DYJ RU DQ\ RWKHU XVHUGHILQHG DJJUHJDWH IXQFWLRQ $Q DJJUHJDWLRQ FRQVWUDLQW RI WKH IRUP PLQSURGXFWV VROGSULFHf S FDQ EH XVHG WR ILQG fH[n SHQVLYH SURGXFWV WKDW DUH ERXJKW WRJHWKHUf 6LPLODUO\ PD[SURGXFWVVROGSULFHf T FDQ EH XVHG WR ILQG fLQH[SHQVLYH SURGXFWV WKDW DUH ERXJKW WRJHWKHUf 7KHVH DJJUHJDWH IXQFWLRQV FDQ EH FRPELQHG LQ YDULRXV ZD\V WR H[SUHVV D ZKROH UDQJH RI XVHIXO PLQLQJ FRPSXWDWLRQV )RU H[DPSOH WKH FRQVWUDLQW PLQSURGXFWVVROGSULFHf Sf t DYJSURGXFWVVROGSULFHf Tf WDUJHWV WKH PLQLQJ SURFHVV WR LQH[SHQVLYH SURGXFWV WKDW DUH ERXJKW WRJHWKHU ZLWK WKH H[SHQVLYH RQHV ([WHUQDO &RQVWUDLQW ([WHUQDO FRQVWUDLQWV ILOWHU WKH GDWD XVHG LQ WKH PLQLQJ SURFHVV 7KHVH DUH FRQVWUDLQWV RQ DWWULEXWHV ZKLFK GR QRW DSSHDU LQ WKH ILQDO UHVXOW ZKLFK ZH FDOO H[WHUQDO DWWULEXWHVf )RU H[DPSOH LI ZH ZDQW WR ILQG fSURGXFWV ERXJKW GXULQJ ELJ SXUFKDVHV ZKHUH WKH WRWDO

PAGE 158

VDOH SULFH RI WKH WUDQVDFWLRQ LV ODUJHU WKDQ VRPH DPRXQWfn

PAGE 159

H[DPSOHfn FLILF H[DPSOH XVLQJ WKH SRLQWRIVDOHV GDWD PRGHO LQ 6HFWLRQ 7KH H[DPSOH VKRZQ LQ )LJXUH ILQGV SURGXFW FRPELQDWLRQV FRQWDLQLQJ fEDUEHFXH VDXFHf ZKHUH DOO WKH SURGn XFWV FRVW OHVV WKDQ DQG WKH DYHUDJH SULFH LV PRUH WKDQ VRPH QRWLRQ RI VLPLODU SULFHG SURGXFWVf 7KH FRPELQDWLRQV VKRXOG DSSHDU LQ DW OHDVW VDOHV WUDQVDFWLRQV ZLWK WKH WRWDO SULFH RI WKH WUDQVDFWLRQ JUHDWHU WKDQ 7KLV JLYHV DQ LGHD RI ZKDW RWKHU PRGHUDWHO\ SULFHG SURGXFWV SHRSOH EX\ ZLWK EDUEHFXH VDXFH LQ ELJ SXUFKDVHV SHUKDSV IRU SDUWLHVf ,W FRXOG KHOS WKH VKRS RZQHU WR GHFLGH RQ YDULRXV SURPRWLRQV )BN )BN 3URGXFWVBVROG 6DOHV 3URGXFWVBVROG 3URGXFW )LJXUH 3RLQWRIVDOHV H[DPSOH IRU FRQVWUDLQHG DVVRFLDWLRQ PLQLQJ

PAGE 160

,Q WKH DERYH H[DPSOH WKH FRQVWUDLQW RQ WKH WRWDO SULFH RI WKH WUDQVDFWLRQ LV DQ H[WHUQDO FRQVWUDLQW DQG LV DSSOLHG LQ WKH GDWD ILOWHULQJ VWDJH 6LQFH WKH PD[LPXP SULFH RI D SURGXFW LQ WKH GHVLUHG FRPELQDWLRQ LV ZH FDQ DOVR ILOWHU RXW UHFRUGV WKDW GRHV QRW VDWLVI\ WKLV FRQGLWLRQ LQ WKH GDWD ILOWHULQJ VWDJH PD[3ULFHf LV DQ DJJUHJDWLRQ FRQVWUDLQW ZKLFK VDWLVILHV WKH FORVXUH SURSHUW\ DQG FDQ EH DSSOLHG LQ WKH FDQGLGDWH JHQHUDWLRQ SKDVH 7KH FRQVWUDLQW WR LQFOXGH EDUEHFXH VDXFH LQ WKH FRPELQDWLRQ DQG WKH FRQVWUDLQW RQ WKH DYHUDJH SULFH DUH FKHFNHG LQ WKH SRVWFRXQWLQJ SKDVH VLQFH WKH\ GR QRW VDWLVI\ WKH FORVXUH SURSHUW\ ,QFUHPHQWDO &RQVWUDLQHG $VVRFLDWLRQ 0LQLQJ 7KH QHJDWLYH ERUGHU EDVHG LQFUHPHQWDO PLQLQJ DOJRULWKP LV DSSOLFDEOH IRU PLQLQJ DVne )UHTXHQW6HWV 8 1HJDWLYH%RUGHU LQ GE )RU WKLV ZH XVH WKH IUDPHZRUN RXWOLQHG LQ 6HFWLRQ 7KH GLIIHUHQW FRQVWUDLQWV WKDW DUH SUHVHQW FDQ EH KDQGOHG DW WKH YDULRXV VWHSV LQ WKH PLQLQJ SURFHVV DV VKRZQ LQ )LJXUH

PAGE 161

8SGDWH WKH VXSSRUW FRXQW RI DOO LWHPVHWV LQ )UHTXHQW6HWV 8 1HJDWLYH%RUGHU WR LQn FOXGH WKHLU VXSSRUW LQ GE )LQG WKH LWHPVHWV LQ 1HJDWLYH%RUGHU WKDW EHFDPH IUHTXHQW E\ WKH DGGLWLRQ RI GE FDOO WKHP 3URPRWHG1HJDWLYH%RUGHU 31Ef

PAGE 162

$SSOLFDELOLW\ %H\RQG $VVRFLDWLRQ 0LQLQJ 7KH LQFUHPHQWDO DOJRULWKP ZH KDYH SURSRVHG PDNHV XVH RI WKH FORVXUH SURSHUW\ RI IUHTXHQW LWHPVHWV ,Q WKLV VHFWLRQ ZH GLVFXVV KRZ WKH LQFUHPHQWDO DSSURDFK FDQ EH JHQn HUDOL]HG WR RWKHU GDWD PLQLQJ DQG GHFLVLRQ VXSSRUW SUREOHPV ,Q 6HFWLRQ ZH GLVFXVV WKH DSSOLFDELOLW\ RI WKH LQFUHPHQWDO DOJRULWKP WR PLQH FORVHG VHWV *HQHUDOL]DWLRQV WR LQn FUHPHQWDO HYDOXDWLRQ RI TXHU\ IORFNV DQG FHUWDLQ NLQGV RI PDWHULDOL]HG YLHZV DUH GHVFULEHG LQ 6HFWLRQV DQG UHVSHFWLYHO\ 0LQLQJ &ORVHG 6HWV $OO WKH HIILFLHQW DOJRULWKPV IRU PLQLQJ DVVRFLDWLRQ UXOHV H[SORLWV WKH FORVXUH SURSHUW\ RI IUHTXHQW LWHPVHWV 0LQLPXP VXSSRUW ZKLFK FKDUDFWHUL]HV IUHTXHQW LWHPVHWV LV GRZQZDUG FORVHG LI DQ LWHPVHW KDV PLQLPXP VXSSRUW WKHQ DOO LWV VXEVHWV DOVR KDYH PLQLPXP VXSSRUW 7KH LGHD RI QHJDWLYH ERUGHU FDQ EH XVHG IRU DOO LQFUHPHQWDO PLQLQJ SUREOHPV WKDW SRVVHVV FORVXUH SURSHUWLHV ,I WKH FORVXUH SURSHUW\ LV LQFUHPHQWDOO\ XSGDWDEOH IRU LQVWDQFH VXSSRUWf

PAGE 163

VDLG WR EH FRUUHODWHG LI DQ\ RI LWV VXEVHWV DUH GHSHQGHQW )RU HIILFLHQW HYDOXDWLRQ RI FRUUHODWLRQ UXOHV D QRWLRQ RI VXSSRUW LV LQWURGXFHG LQ %ULQ HW DO >@ ZKLFK LV GRZQZDUG FORVHG +RZHYHU WKHVH SURSHUWLHV DUH QRW LQFUHPHQWDOO\ XSGDWDEOH &RQVHTXHQW SDUW RI DVVRFLDWLRQ UXOHV )RU DQ\ IUHTXHQW LWHPVHW LI D UXOH ZLWK FRQVHn TXHQW F KROGV WKDW LV LW KDV PLQLPXP FRQILGHQFHffXQLQWHUHVWLQJf )RU H[DPSOH OHW XV DVVXPH WKDW ZH ZDQW WR HYDOXDWH WKH PLQLQJ TXHU\ f:KDW DUH WKH LQWHUHVWLQJ GULYHUV WKDW FDXVHG FXVWRPHUV WR EX\ WKH ZLGJHWV IURP D FDWDORJ "f $ GULYHU LV GHHPHG fLQWHUHVWLQJf LI LW KDV FDXVHG DW OHDVW FXVWRPHUV WR EX\ WKH ZLGJHW /HW WKH GDWD EH VWRUHG LQ D VHW RI UHODWLRQDO WDEOHV QDPHO\ FDWDORJZLGJHW PDQXIDFWXUHUf VDOHFXVWRPHU ZLGJHWf GULYHUFXVWRPHU ZLGJHW GULYHUf 7KH DERYH TXHU\ FDQ EH ZULWWHQ DV D TXHU\ IORFN LQ 'DWDORJ DV VKRZQ EHORZ 7KH ILOWHU FRQGLWLRQ SUXQHV RXW YDOXHV ZKLFK GR QRW KDYH PLQLPXP VXSSRUW ,Q 6HFWLRQ

PAGE 164

ZH GLVFXVV KRZ WKH QHJDWLYH ERUGHU LGHD FDQ EH XVHG IRU HIILFLHQW LQFUHPHQWDO HYDOXDWLRQ RI TXHU\ IORFNV 48(5< DQVZHU&f VDOH& :f $1' GULYHU& : 'f $1' FDWDORJ: PDQXIDFWXUHUf ),/7(5 &2817&f ,QFUHPHQWDO (YDOXDWLRQ RI 4XHU\ )ORFNV $SSO\LQJ WKH DSULRUL WHFKQLTXH IRU HYDOXDWLQJ WKH DERYH TXHU\ IORFN ZLOO UHVXOW LQ WKH IROORZLQJ VDIH VXETXHULHV >@ 4? DQVZHU&f VDOH& :f 4 DQVZHU&f GULYHU& : 'f 4 DQVZHU&f VDOH& :f $1' GULYHU& : 'f 4? DQVZHU&f VDOH& :f $1' GULYHU& : 'f $1' FDWDORJ: PDQXIDFWXUHUf 7KH TXHU\ IORFNV FRUUHVSRQGLQJ WR WKH VDIH VXETXHULHV IRUP D ODWWLFH ZLWK TXHU\ FRQWDLQPHQW DV WKH SDUWLDO RUGHU DQG WKH RULJLQDO TXHU\ IORFN DV WKH WRS HOHPHQW 7KDW LV LI 4? DQG 4 DUH HOHPHQWV RI WKH ODWWLFH 4? ‘ 4 t WKH UHVXOW RI 4 LV FRQWDLQHG LQ WKH UHVXOW RI 4? 'XULQJ WKH H[HFXWLRQ RI WKH VXETXHULHV RI WKH TXHU\ IORFN DOO UHFRUGV ZLWK SDUDPHWHU YDOXHV ZKLFK VDWLVI\ WKH ILOWHU FRQGLWLRQ DUH SURSDJDWHG WR WKH QH[W KLJKHU VXETXHU\ LQ WKH

PAGE 165

ODWWLFH IRU IXUWKHU HYDOXDWLRQ )RU H[DPSOH DIWHU 4L LV HYDOXDWHG DOO WKH UHFRUGV FRUUHnf GHSHQGLQJ XSRQ WKH VXETXHU\ UHSUHVHQWLQJ WKH ODWWLFH HOHPHQW ,W LV DOVR SRVVLEOH WR HYDOXDWH WKH ILOWHU FRQGLWLRQ IRU DOO WKH ODWWLFH HOHPHQWVf

PAGE 166

KDQG WKH QHJDWLYH ERUGHU EDVHG FKDQJH SURSDJDWLRQ FDQ DOVR EH DSSOLHG IRU WKH PDLQWHQDQFH RI YLHZV LQYROYLQJ PRQRWRQH DJJUHJDWH IXQFWLRQV WKDW VDWLVI\ WKH DSULRUL VXEVHW SURSHUW\ )RU H[DPSOH LI WKH YLHZ GHILQLWLRQ KDV D ILOWHU FRQGLWLRQ VXFK DV WKH 64/ fKDYLQJf FODXVH LW FRXOG EH EHQHILFLDO WR DOVR VWRUH WKH UHFRUGV ZKLFK GRHV QRW SDVV WKH ILOWHU FRQGLWLRQ WHVW VDPH DV WKH QHJDWLYH ERUGHUf

PAGE 167

n

PAGE 168

ZDV QHYHU ZRUVH WKDQ D IDFWRU RI WZR RQ WKH UHDOOLIH GDWDVHWV %RWK WKHVH DSSURDFKHV UHn TXLUH DGGLWLRQDO VSDFH IRU FDFKLQJ KRZHYHU 7KH 6WRUHGSURFHGXUH DSSURDFK GRHV QRW UHTXLUH DQ\ H[WUD VSDFH H[FHSW SRVVLEO\ IRU LQLWLDOO\ VRUWLQJ WKH GDWD LQ WKH '%06f DQG FDQ SHUKDSV EH PDGH WR EH ZLWKLQ D IDFWRU RI WZR WR WKUHH RI &DFKH0LQH ZLWK WKH UHn FHQW DOJRULWKPV > @ 7KH 8') DSSURDFK LV DERXW D IDFWRU RI WZR IDVWHU WKDQ WKH 6WRUHGSURFHGXUH DSSURDFK EXW LV VLJQLILFDQWO\ KDUGHU WR FRGH 7KH 64/ DSSURDFK RIIHUV VRPH VHFRQGDU\ DGYDQWDJHV OLNH HDVLHU GHYHORSPHQW DQG PDLQWHQDQFH DQG SRWHQWLDO IRU DXnfH[WHQGf WKH LQSXW WUDQVDFWLRQ WDEOH WUDQVIRUP 7 WR 7rf )RU VHTXHQWLDO SDWWHUQV WKH MRLQ SUHGLFDWHV IRU FDQGLGDWH JHQHUDWLRQ DQG VXSSRUW FRXQWLQJ ZHUH VLJQLILFDQWO\ GLIIHUHQW :H FRQGXFWHG VRPH H[SHULPHQWV RQ UHDOOLIH GDWDVHWV DQG IRXQG WKDW WKH SHUIRUPDQFH REVHUYDWLRQV KROG KHUH DOVR ,Q RUGHU WR KDQGOH WKH YROXPH RI GDWD VWRUHG LQ SUHVHQW GD\ GDWD ZDUHKRXVHV ZKLFK NHHSV VWUHDPLQJ LQ LW LV LPSRUWDQW WR GHYHORS LQFUHPHQWDO PLQLQJ DOJRULWKPV :H GHYHORSHG DQ LQFUHPHQWDO DVVRFLDWLRQ UXOH PLQLQJ DOJRULWKP ZKLFK GRHV QRW QHHG WR H[DPLQH WKH ROG GDWD LI WKH IUHTXHQW LWHPVHWV GR QRW FKDQJH (YHQ LQ FDVHV ZKHUH QHZ IUHTXHQW LWHPVHWV DUH DGGHG DFFHVV WR WKH ROG GDWDEDVH FDQ EH OLPLWHG WR MXVW RQH VFDQ

PAGE 169

fFRUUHFWf VXSSRUW YDOXH LV GLIILFXOW ,QLWLDOO\ WKH IUHTXHQW LWHPVHWV IRU DQ DSSUR[LPDWH VXSSRUW FDQ EH FRPSXWHG ZKLFK LV IXUWKHU UHILQHG EDVHG RQ XVHU IHHGEDFN :H IXUWKHU VKRZ WKH DSSOLFDELOLW\ RI WKH LQFUHPHQWDO DOJRULWKP WR FHUWDLQ FODVVHV RI GDWD PLQLQJ DQG GHFLVLRQ VXSSRUW SUREOHPV ,Q 6HFWLRQ ZH LGHQWLI\ FHUWDLQ H[WHQVLRQV WR WKH FXUUHQW GDWDEDVH PDQDJHPHQW V\Vn WHPV WKDW DUH XVHIXO IRU PLQLQJ :H HQXPHUDWH WKH VSHFLILF FRQWULEXWLRQV RI WKLV GLVVHUWDWLRQ LQ 6HFWLRQ DQG LQ 6HFWLRQ ZH OLVW SRVVLELOLWLHV IRU IXUWKHU UHVHDUFK

PAGE 170

3URSRVHG ([WHQVLRQV 2QH RI WKH JRDOV RI RXU ZRUN KDV EHHQ WR fXQEXQGOHff§f OLVWV REWDLQHG GXULQJ PXOWLSOH LQGH[ VFDQV >@ ,Q FXUUHQW 2/$3 DQG GDWD ZDUHKRXVLQJ V\VWHPV WKLV RSHUDWLRQ LV UDPSDQW LQ WKH SRSXODU ELWPDSSHG LQGLFHV 2XU 9HUWLFDO DSSURDFK LV FXULRXVO\ VLPLODU WR WKLV ELWn PDSSHG DSSURDFK ZLWK RQH LPSRUWDQW GLIIHUHQFH ,QVWHDG RI SHUIRUPLQJ $1'V RQ 5,'V ZH SHUIRUP WKH $1'V RQ DQRWKHU DWWULEXWH WKH WUDQVDFWLRQ LGHQWLILHU

PAGE 171

:H DOVR XVHG WKH VHW GLIIHUHQFH RSHUDWLRQ IRU SUXQLQJ LWHPVHWV FRQWDLQLQJ DQ LWHP DQG LWV DQFHVWRU LQ JHQHUDOL]HG DVVRFLDWLRQ UXOHV (QKDQFHG $JJUHJDWLRQ $QRWKHU FRPPRQ RSHUDWLRQ WKDW ZH XVHG LV WKH *DWKHU RSHUDWLRQ WKDW FDQ WUDQVIRUP WZR DWWULEXWHV LQ D GDWD WDEOH WR D IRUP ZKHUH IRU GLVWLQFW YDOXHV RI RQH RI WKH DWWULEXWHV FDOOHG WKH JURXSLQJ DWWULEXWHff

PAGE 172

n

PAGE 173

GDWDVHW DQG DOVR WR ILQH WXQH LQSXW SDUDPHWHUV 7KHUH DUH VHYHUDO PLQLQJ DOJRULWKPV ZKLFK XVH VDPSOLQJ > @ &RQWULEXWLRQV ,Q WKLV GLVVHUWDWLRQ ZH KDYH DGGUHVVHG WKH IROORZLQJ SUREOHPV $QDO\]H WKH GLIIHUHQW GDWDEDVH LQWHJUDWLRQ DOWHUQDWLYHV IRU GDWD PLQLQJ 'HYHORS DQG LPSOHPHQW YDULRXV 64/EDVHG DSSURDFKHV IRU DVVRFLDWLRQ UXOH PLQLQJ 6WXG\ WKH SHUIRUPDQFH SURILOH RI FXUUHQW '%06V WR H[HFXWH WKH DERYH 64/ TXHULHV &RPSDUH WKH GLIIHUHQW GDWDEDVH LQWHJUDWLRQ DUFKLWHFWXUHV TXDQWLWDWLYHO\ DQG TXDOLWDn WLYHO\ 'HYHORS FRVW IRUPXODH IRU WKH 64/ DSSURDFKHV EDVHG RQ LQSXW GDWD SDUDPHWHUV DQG UHODWLRQDO RSHUDWRU FRVWV 7KHVH SURYLGH VRPH LQVLJKWV LQWR HQKDQFLQJ FXUUHQW FRVW EDVHG RSWLPL]HUV WR LQFRUSRUDWH WKH GRPDLQVSHFLILF VHPDQWLFV RI PLQLQJ DOJRULWKPV ([WHQG WKH DVVRFLDWLRQ UXOH IUDPHZRUN IRU PLQLQJ JHQHUDOL]HG DVVRFLDWLRQ UXOHV DQG VHTXHQWLDO SDWWHUQV 'HYHORS DQ LQFUHPHQWDO DVVRFLDWLRQ UXOH PLQLQJ DOJRULWKP 64/ IRUPXODWLRQV RI WKH LQFUHPHQWDO DOJRULWKP DQG LWV JHQHUDOL]DWLRQ WR KDQGOH FRQn VWUDLQWV *HQHUDOL]H WKH LQFUHPHQWDO DOJRULWKP IRU FRQVWUDLQHG DVVRFLDWLRQ PLQLQJ DQG GHPRQn VWUDWH LWV DSSOLFDELOLW\ WR D ODUJHU FODVV RI GDWD PLQLQJ DQG GHFLVLRQ VXSSRUW SUREOHPV O7KH ZRUN FRUUHVSRQGLQJ WR WKH ILUVW IRXU LWHPV ZDV SULPDULO\ GRQH E\ UHVHDUFKHUV IURP ,%0 $OPDGQ 5HVHDUFK &HQWHU DQG WKH DXWKRU ZDV D FRQWULEXWRU )RU WKH UHPDLQLQJ LWHPV WKH DXWKRU ZDV WKH SULPDU\ FRQWULEXWRU 7KH ILOHEDVHG LQFUHPHQWDO DVVRFLDWLRQ UXOH PLQLQJ DOJRULWKP LWHP f ZDV GHYHORSHG SULPDULO\ E\ WKH DXWKRU DV SDUW RI WKH ,QWURGXFWLRQ WR 3DUDOOHO &RPSXWLQJ FRXUVH SURMHFW

PAGE 174

([SORUH SULPLWLYH RSHUDWRUV WR VXSSRUW PLQLQJ DQG GHFLVLRQ VXSSRUW LQ GDWDEDVHV )XWXUH :RUN 7KH ZRUN SUHVHQWHG LQ WKLV GLVVHUWDWLRQ SRLQWV WR VHYHUDO GLUHFWLRQV IRU IXWXUH UHVHDUFK $ QDWXUDO QH[W VWHS LV WR H[SHULPHQW ZLWK RWKHU NLQGV RI PLQLQJ RSHUDWLRQV VXFK DV FODVVLn ILFDWLRQ DQG FOXVWHULQJ >@ WR YHULI\ LI RXU FRQFOXVLRQV KROG IRU WKHVH RWKHU FDVHV WRR ,W LV DOVR LPSRUWDQW WR GHULYH D VHW RI SULPLWLYH RSHUDWRUV ZLWK ZKLFK WKH GLIIHUHQW PLQLQJ DQG GHFLVLRQ VXSSRUW RSHUDWLRQV FDQ EH FRPSRVHG 7KH RSHUDWRUV ZH KDYH LGHQWLILHG SURYLGH VRPH KHDGZD\ LQ WKDW GLUHFWLRQ $QRWKHU XVHIXO GLUHFWLRQ LV WR H[SORUH ZKDW NLQG RI D VXSn SRUW LV QHHGHG IRU DQVZHULQJ VKRUW LQWHUDFWLYH DG KRF TXHULHV LQYROYLQJ D PL[ RI PLQLQJ DQG UHODWLRQDO RSHUDWLRQV 7KH VXFFHVV RI 64/ DV WKH PRVW SRSXODU GDWD PDQDJHPHQW ODQJXDJH FDQ EH PDLQO\ DWWULEXWHG WR LWV DG KRF TXHU\ VXSSRUW ,Q WKH PLQLQJ FRQWH[W ZKDW LV DQ DG KRF PLQLQJ TXHU\" ,V LW MXVW PLQLQJ FRPSXWDWLRQV H[SUHVVHG ZLWK VRPH FRQVWUDLQWV RQ WKH UHVXOW" +RZ PXFK VXSSRUW FDQ ZH OHYHUDJH IURP H[LVWLQJ UHODWLRQDO HQJLQHV IRU PLQn

PAGE 175

GDWD PLQLQJ DQG GHFLVLRQ VXSSRUW RSHUDWLRQV RQ WKH VDPH GDWD WLJKWHU LQWHJUDWLRQ RI WKHVH RSHUDWLRQV ZLWK GDWDEDVH V\VWHPV ZLOO EH UHTXLUHG 7KLV GLVVHUWDWLRQ SUHVHQWV VWUDWHJLHV DLPHG DW WLJKWHU GDWDEDVH LQWHJUDWLRQ RI PLQLQJ DQG LGHQWLILHV RSWLPL]DWLRQV DQG SULPLWLYH RSHUDWRUV WR PDNH GDWDEDVH V\VWHPV D EHWWHU SODWIRUP IRU PLQLQJ DQG GHFLVLRQ VXSSRUW

PAGE 176

5()(5(1&(6 >@ 6 $JDUZDO 5 $JUDZDO 3 0 'HVKSDQGH $ *XSWD ) 1DXJKWRQ 5 5DPDNU LVKQDQ DQG 6 6DUDZDJL 2Q WKH FRPSXWDWLRQ RI PXOWLGLPHQVLRQDO DJJUHJDWHV ,Q 3URF RI WKH QG ,QWfO &RQIHUHQFH RQ 9HU\ /DUJH 'DWDEDVHV SDJHV 0XPEDL %RPED\f ,QGLD 6HSWHPEHU >@ 5 $JUDZDO $ $UQLQJ 7 %ROOLQJHU 0 0HKWD 6KDIHU DQG 5 6ULNDQW 7KH 4XHVW GDWD PLQLQJ V\VWHP ,Q 3URF RI WKH QG ,QW fO &RQIHUHQFH RQ .QRZOHGJH 'LVFRYHU\ LQ 'DWDEDVHV DQG 'DWD 0LQLQJ 3RUWODQG 2UHJRQ $XJXVW >@ 5 $JUDZDO & )DORXWVRV DQG $ 6ZDPL (IILFLHQW VLPLODULW\ VHDUFK LQ VHTXHQFH GDWDEDVHV ,Q 3URF RI WKH )RXUWK ,QWfO &RQIHUHQFH RQ )RXQGDWLRQV RI 'DWD 2UJDQLn ]DWLRQ DQG $OJRULWKPV &KLFDJR 2FWREHU $OVR LQ /HFWXUH 1RWHV LQ &RPSXWHU 6FLHQFH 6SULQJHU 9HUODJ >@ 5 $JUDZDO *HKUNH *XQRSXORV DQG 3 5DJKDYDQ $XWRPDWLF VXEVSDFH FOXVn WHULQJ RI KLJK GLPHQVLRQDO GDWD IRU GDWD PLQLQJ DSSOLFDWLRQV ,Q 3URF RI WKH $&0 6,*02' &RQIHUHQFH RQ 0DQDJHPHQW RI 'DWD 6HDWWOH :DVKLQJWRQ -XQH >@ 5 $JUDZDO 6 *KRVK 7 ,PLHOLQVNL % ,\HU DQG $ 6ZDPL $Q LQWHUYDO FODVVLILHU IRU GDWDEDVH PLQLQJ DSSOLFDWLRQV ,Q 3URF RI WKH 9/'% &RQIHUHQFH SDJHV 9DQFRXYHU %ULWLVK &ROXPELD &DQDGD $XJXVW >@ 5 $JUDZDO 7 ,PLHOLQVNL DQG $ 6ZDPL 'DWDEDVH PLQLQJ $ SHUIRUPDQFH SHUn VSHFWLYH ,((( 7UDQVDFWLRQV RQ .QRZOHGJH DQG 'DWD (QJLQHHULQJ f 'HFHPEHU >@ 5 $JUDZDO 7 ,PLHOLQVNL DQG $ 6ZDPL 0LQLQJ DVVRFLDWLRQ UXOHV EHWZHHQ VHWV RI LWHPV LQ ODUJH GDWDEDVHV ,Q 3URF RI WKH $&0 6,*02' &RQIHUHQFH RQ 0DQDJHPHQW RI 'DWD SDJHV :DVKLQJWRQ '& 0D\ >@ 5 $JUDZDO /LQ + 6 6DZKQH\ DQG 6KLP )DVW VLPLODULW\ VHDUFK LQ WKH SUHVHQFH RI QRLVH VFDOLQJ DQG WUDQVODWLRQ LQ WLPHVHULHV GDWDEDVHV ,Q 3URF RI WKH VW ,QWfO &RQIHUHQFH RQ 9HU\ /DUJH 'DWDEDVHV SDJHV =XULFK 6ZLW]HUODQG 6HSWHPEHU >@ 5 $JUDZDO + 0DQQLOD 5 6ULNDQW + 7RLYRQHQ DQG $ 9HUNDPR )DVW GLVFRYHU\ RI DVVRFLDWLRQ UXOHV ,Q 8 0 )D\\DG 3 6KDSLUR 3 6P\WK DQG 5 8WKXUXVDP\ HGLWRUV $GYDQFHV LQ .QRZOHGJH 'LVFRYHU\ DQG 'DWD 0LQLQJ FKDSWHU SDJHV $$$,0,7 3UHVV

PAGE 177

>@ 5 $JUDZDO 3VDLOD ( / :LPPHUV DQG 0 =DLW 4XHU\LQJ VKDSHV RI KLVWRULHV ,Q 3URF RI WKH VW ,QWfO &RQIHUHQFH RQ 9HU\ /DUJH 'DWDEDVHV =XULFK 6ZLW]HUODQG 6HSWHPEHU >@ 5 $JUDZDO DQG 6KDIHU 3DUDOOHO PLQLQJ RI DVVRFLDWLRQ UXOHV ,((( 7UDQVDFWLRQV RQ .QRZOHGJH DQG 'DWD (QJLQHHULQJ f 'HFHPEHU >@ 5 $JUDZDO DQG 6KLP 'HYHORSLQJ WLJKWO\FRXSOHG GDWD PLQLQJ DSSOLFDWLRQV RQ D UHODWLRQDO GDWDEDVH V\VWHP ,Q 3URF RI WKH QG ,QWfO &RQIHUHQFH RQ .QRZOHGJH 'LVFRYHU\ LQ 'DWDEDVHV DQG 'DWD 0LQLQJ 3RUWODQG 2UHJRQ $XJXVW >@ 5 $JUDZDO DQG 5 6ULNDQW )DVW DOJRULWKPV IRU PLQLQJ DVVRFLDWLRQ UXOHV ,Q 3URF RI WKH WK ,QWfO &RQIHUHQFH RQ 9HU\ /DUJH 'DWDEDVHV 6DQWLDJR &KLOH 6HSWHPEHU >@ 5 $JUDZDO DQG 5 6ULNDQW 0LQLQJ VHTXHQWLDO SDWWHUQV ,Q 3URF RI WKH WK ,QWfO &RQIHUHQFH RQ 'DWD (QJLQHHULQJ 7DLSHL 7DLZDQ 0DUFK >@ 6 $OH[DQGHU 8VHUV ILQG WDQJLEOH UHZDUGV GLJJLQJ LQWR GDWD PLQHV (QWHUSULVH &RPn SXWLQJ -XO\ KWWSZZZLQIRZRUOGFRPFJLELQGLVSOD\$UFKLYHSOH KWP >@ $OL 6 0DQJDQDULV DQG 5 6ULNDQW 3DUWLDO FODVVLILFDWLRQ XVLQJ DVVRFLDWLRQ UXOHV ,Q 3URF RI WKH UG ,QWfO &RQIHUHQFH RQ .QRZOHGJH 'LVFRYHU\ LQ 'DWDEDVHV DQG 'DWD 0LQLQJ 1HZSRUW %HDFK &DOLIRUQLD $XJXVW >@ $OVDEWL 6 5DQND DQG 9 6LQJK &/28'6 $ GHFLVLRQ WUHH FODVVLILHU IRU ODUJH GDWDVHWV ,Q 3URF RI WKH WK ,QWfO &RQIHUHQFH RQ .QRZOHGJH 'LVFRYHU\ DQG 'DWD 0LQLQJ 1HZ @ & $SWH DQG 6 +RQJ 3UHGLFWLQJ HTXLW\ UHWXUQV IURP VHFXULWLHV GDWD ZLWK PLQLPDO UXOH JHQHUDWLRQ ,Q .'' $$$, :RUNVKRS RQ .QRZOHGJH 'LVFRYHU\ LQ 'DWDEDVHV SDJHV 6HDWWOH :DVKLQJWRQ -XO\ >@ & $SWH DQG 6 :HLVV 'DWD PLQLQJ ZLWK GHFLVLRQ WUHHV DQG GHFLVLRQ UXOHV )*&6 -RXUQDO 6SHFLDO ,VVXH RQ 'DWD 0LQLQJ >@ %DQILHOG DQG $ 5DIWHU\ 0RGHOEDVHG JDXVVLDQ DQG QRQJDXVVLDIW FOXVWHULQJ %LRn PHWULFV >@ 6 %HUFKWROG & %RKP % %UDXQPXOOHU $ .HLP DQG + .ULHJHO )DVW SDUDOOHO VLPLODULW\ VHDUFK LQ PXOWLPHGLD GDWDEDVHV ,Q 3URF RI WKH $&0 6,*02' &RQIHUHQFH RQ 0DQDJHPHQW RI 'DWD 0D\ >@ 6 %HUFKWROG $ .HLP DQG + .ULHJHO 7KH ;7UHH $Q LQGH[ VWUXFWXUH IRU KLJKn GLPHQVLRQDO GDWD ,Q 3URF RI WKH QG ,QWfO &RQIHUHQFH RQ 9HU\ /DUJH 'DWDEDVHV %RPED\ ,QGLD 6HSWHPEHU >@ 3 %UDGOH\ 8 )D\\DG DQG & 5HLQD 6FDOLQJ FOXVWHULQJ DOJRULWKPV WR ODUJH GDWDEDVHV ,Q 3URF RI WKH WK ,QWfO &RQIHUHQFH RQ .QRZOHGJH 'LVFRYHU\ DQG 'DWD 0LQLQJ 1HZ @ / %UHLPDQ + )ULHGPDQ 5 $ 2OVKHQ DQG & 6WRQH &ODVVLILFDWLRQ DQG 5HJUHVVLRQ 7UHHV :DGVZRUWK %HOPRQW

PAGE 178

>@ 6 %ULQ 5 0RWZDQL DQG & 6LOYHUVWHLQ %H\RQG PDUNHW EDVNHWV *HQHUDOL]LQJ DVVRFLn DWLRQ UXOHV WR FRUUHODWLRQV ,Q 3URF RI WKH $&0 6,*02' &RQIHUHQFH RQ 0DQDJHPHQW RI 'DWD 0D\ >@ 6 %ULQ 5 0RWZDQL 8OOPDQ DQG 6 7VXU '\QDPLF LWHPVHW FRXQWLQJ DQG LPSOLFDWLRQ UXOHV IRU PDUNHW EDVNHW GDWD ,Q 3URF RI WKH $&0 6,*02' &RQIHUHQFH RQ 0DQDJHPHQW RI 'DWD 0D\ >@ %XVLQHVV :HHN 'DWDEDVH PDUNHWLQJ 6HSWHPEHU >@ &DWOHWW 0HJDLQGXFWLRQ 0DFKLQH /HDUQLQJ RQ 9HU\ /DUJH 'DWDEDVHV 3K' WKHVLV 8QLYHUVLW\ RI 6\GQH\ >@ &DWOHWW 2YHUSUXQLQJ ODUJH GHFLVLRQ WUHHV ,Q WK ,QWfO -RLQW &RQIHUHQFH RQ $, $XJXVW >@ 6 &KDNUDEDUWL % 'RP 5 $JUDZDO DQG 3 5DJKDYDQ 8VLQJ WD[RQRP\ GLVFULPLn QDQWV DQG VLJQDWXUHV IRU QDYLJDWLQJ LQ WH[W GDWDEDVHV ,Q 3URF RI WKH 9/'% &RQn IHUHQFH $WKHQV *UHHFH $XJXVW >@ 6 &KDNUDEDUWL % 'RP DQG 3 ,QG\N (QKDQFHG K\SHUWH[W FDWHJRUL]DWLRQ XVLQJ K\SHUOLQNV ,Q 3URF RI WKH $&0 6,*02' &RQIHUHQFH RQ 0DQDJHPHQW RI 'DWD 6HDWWOH :DVKLQJWRQ -XQH >@ &KDPEHUOLQ 8VLQJ WKH 1HZ '% ,%0fV 2EMHFW5HODWLRQDO 'DWDEDVH 6\VWHP 0RUJDQ .DXIPDQQ >@ 6 &KDXGKXUL 'DWD PLQLQJ DQG GDWDEDVH V\VWHPV :KHUH LV WKH LQWHUVHFWLRQ" ,((( 'DWD (QJLQHHULQJ %XOOHWLQ ff§ 0DUFK >@ 6 &KDXGKXUL DQG 8 'D\DO $Q RYHUYLHZ RI GDWD ZDUHKRXVLQJ DQG RODS WHFKQRORJ\ $&0 6,*02' 5HFRUG Of 0DUFK >@ 6 &KDXGKXUL 5 0RWZDQL DQG 9 1DUDVD\\D 5DQGRP VDPSOLQJ IRU KLVWRJUDP FRQVWUXFWLRQ +RZ PXFK LV HQRXJK" ,Q 3URF RI WKH $&0 6,*02' &RQIHUHQFH RQ 0DQDJHPHQW RI 'DWD 6HDWWOH :DVKLQJWRQ -XQH >@ &KHXQJ +DQ 9 1J DQG & < :RQJ 0DLQWHQDQFH RI GLVFRYHUHG DVVRFLDWLRQ UXOHV LQ ODUJH GDWDEDVHV $Q LQFUHPHQWDO XSGDWLQJ WHFKQLTXH ,Q 3URF RI ,QWfO &RQIHUHQFH RQ 'DWD (QJLQHHULQJ 1HZ 2UOHDQV 86$ )HEUXDU\ >@ &RRSHU DQG ( +HUVNRYLWV $ %D\HVLDQ PHWKRG IRU WKH LQGXFWLRQ RI SUREDELOLVWLF QHWZRUNV IURP GDWD 0DFKLQH /HDUQLQJ >@ 'DYLG 6KHSDUG $VVRFLDWHV %XVLQHVV 2QH ,UZLQ ,OOLQRLV 7KH QHZ GLUHFW PDUNHWLQJ >@ 'HQQLQJ $Q LQWUXVLRQGHWHFWLRQ PRGHO ,((( 7UDQVDFWLRQV RQ 6RIWZDUH (QJLn QHHULQJ f >@ 7 'LHWWHULFK DQG 5 6 0LFKDOVNL 'LVFRYHULQJ SDWWHUQV LQ VHTXHQFHV RI HYHQWV $UWLILFLDO ,QWHOOLJHQFH

PAGE 179

>@ 'LUHFW 0DUNHWLQJ $VVRFLDWLRQ 0DQDJLQJ GDWDEDVH PDUNHWLQJ WHFKQRORJ\ IRU VXFFHVV >@ 0 (VWHU + .ULHJHO DQG ; ;X $ GDWDEDVH LQWHUIDFH IRU FOXVWHULQJ LQ ODUJH VSDWLDO GDWDEDVHV ,Q 3URF RI WKH VW ,QWfO &RQIHUHQFH RQ .QRZOHGJH 'LVFRYHU\ LQ 'DWDEDVHV DQG 'DWD 0LQLQJ 0RQWUHDO &DQDGD $XJXVW >@ & )DORXWVRV 0 5DQJDQDWKDQ DQG < 0DQRORSRXORV )DVW VXEVHTXHQFH PDWFKLQJ LQ WLPHVHULHV GDWDEDVHV ,Q 3URF RI WKH $&0 6,*02' &RQIHUHQFH RQ 0DQDJHPHQW RI 'DWD 0D\ >@ 8 )D\\DG & 5HLQD DQG 3 %UDGOH\ ,QLWLDOL]DWLRQ RI LWHUDWLYH UHILQHPHQW FOXVWHULQJ DOJRULWKPV ,Q 3URF RI WKH IWK ,QWfO &RQIHUHQFH RQ .QRZOHGJH 'LVFRYHU\ DQG 'DWD 0LQLQJ 1HZ @ 8 )D\\DG 3 6KDSLUR 3 6P\WK DQG 5 8WKXUXVDP\ HGLWRUV $GYDQFHV LQ .QRZOHGJH 'LVFRYHU\ DQG 'DWD 0LQLQJ $$$,0,7 3UHVV >@ 8 )D\\DG 1 :HLU DQG 6 'MRUJRYVNL 6NLFDW $ PDFKLQH OHDUQLQJ V\VWHP IRU DXWRPDWHG FDWDORJLQJ RI ODUJH VFDOH VN\ VXUYH\V ,Q 2WK ,QW &RQIHUHQFH RQ 0DFKLQH /HDUQLQJ -XQH >@ 5 )HOGPDQ < $XPDQQ $ $PLU DQG + 0DQQLOD (IILFLHQW DOJRULWKPV IRU GLVFRYn HULQJ IUHTXHQW VHWV LQ LQFUHPHQWDO GDWDEDVHV ,Q 3URFHHGLQJV RI WKH 6,*02' :RUNVKRS RQ 5HVHDUFK ,VVXHV RQ 'DWD 0LQLQJ DQG .QRZOHGJH 'LVFRYHU\ 7XFVRQ $UL]RQD 0D\ >@ + )LVKHU .QRZOHGJH DFTXLVLWLRQ YLD LQFUHPHQWDO FRQFHSWXDO FOXVWHULQJ 0DFKLQH /HDUQLQJ f >@ 7 )XNXGD < 0RULPRWR 6 0RULVKLWD DQG 7 7RNX\DPD 'DWD PLQLQJ XVLQJ WZR GLPHQVLRQDO RSWLPL]HG DVVRFLDWLRQ UXOHV 6FKHPH DOJRULWKPV DQG YLVXDOL]DWLRQ ,Q 3URF RI WKH $&0 6,*02' &RQIHUHQFH RQ 0DQDJHPHQW RI 'DWD 0RQWUHDO &DQDGD -XQH >@ 7 )XNXGD < 0RULPRWR 6 0RULVKLWD DQG 7 7RNX\DPD 0LQLQJ RSWLPL]HG DVVRFLDnn WLFV IRU FODVVLILFDWLRQ IURP ODUJH 64/ GDWDEDVHV ,Q 3URF RI WKH WK ,QWfO &RQIHUHQFH RQ .QRZOHGJH 'LVFRYHU\ DQG 'DWD 0LQLQJ 1HZ
PAGE 180

>@ *UD\ $ %RVZRUWK $ /D\PDQ DQG + 3LUDKHVK 'DWD FXEH $ UHODWLRQDO DJJUHn JDWLRQ RSHUDWRU JHQHUDOL]LQJ JURXSE\ FURVVWDE DQG VXEWRWDO ,Q 3URF RI ,QWfO &RQIHUHQFH RQ 'DWD (QJLQHHULQJ 1HZ 2UOHDQV 86$ )HEUXDU\ >@ *UD\ 6 &KDXGKXUL $ %RVZRUWK $ /D\PDQ ) 3HOORZ DQG + 3LUDKHVK 'DWD FXEH $ UHODWLRQDO DJJUHJDWLRQ RSHUDWRU JHQHUDOL]LQJ JURXSE\ FURVVWDE DQG VXEn WRWDOV 'DWD 0LQLQJ DQG .QRZOHGJH 'LVFRYHU\ ff§ >@ 6 *XKD 5 5DVWRJL DQG 6KLP &85( $Q HIILFLHQW FOXVWHULQJ DOJRULWKP IRU ODUJH GDWDEDVHV ,Q 3URF RI WKH $ &0 6,*02' &RQIHUHQFH RQ 0DQDJHPHQW RI 'DWD 6HDWWOH :DVKLQJWRQ -XQH >@ *XQRSXORV + 0DQQLOD DQG 6 6DOXMD 'LVFRYHULQJ DOO PRVW VSHFLILF VHQWHQFHV E\ UDQGRPL]HG DOJRULWKPV ,Q 3URF RI WKH VL[WK ,QWHUQDWLRQDO &RQIHUHQFH RQ 'DWDEDVH 7KHRU\ -DQXDU\ >@ 3 +DDV ) 1DXJKWRQ 6 6HVKDGUL DQG / 6WRNHV 6DPSOLQJEDVHG HVWLPDn WLRQ RI WKH QXPEHU RI GLVWLQFW YDOXHV RI DQ DWWULEXWH ,Q 3URFHHGLQJV RI WKH (LJKWK ,QWHUQDWLRQDO &RQIHUHQFH RQ 9HU\ /DUJH 'DWDEDVHV 9/'%ffO &RQIHUHQFH RQ 9HU\ /DUJH 'DWDEDVHV =XULFK 6ZLW]HUODQG 6HSWHPEHU >@ +DQ < )X .RSHUVNL : :DQJ DQG 2 =DLDQH '04/ $ GDWD PLQLQJ TXHU\ ODQJXDJH IRU UHODWLRQDO GDWEDVHV ,Q 3URF RI WKH 6,*02' ZRUNVKRS RQ UHVHDUFK LVVXHV RQ GDWD PLQLQJ DQG NQRZOHGJH GLVFRYHU\ 0RQWUHDO &DQDGD 0D\ >@ 6 +RQJ 50,1, $ KHXULVWLF DOJRULWKP IRU JHQHUDWLQJ PLQLPDO UXOHV IURP H[DPn SOHV ,Q UG 3DFLILF 5LP ,QWfO &RQIHUHQFH RQ $, $XJXVW >@ 0 +RXWVPD DQG $ 6ZDPL 6HWRULHQWHG PLQLQJ RI DVVRFLDWLRQ UXOHV ,Q ,QWfO &RQn IHUHQFH RQ 'DWD (QJLQHHULQJ 7DLSHL 7DLZDQ 0DUFK >@ 7 ,PLHOLQVNL )URP ILOH PLQLQJ WR GDWDEDVH PLQLQJ ,Q 3URF RI WKH 6,*02' ZRUNVKRS RQ UHVHDUFK LVVXHV RQ GDWD PLQLQJ DQG NQRZOHGJH GLVFRYHU\ SDJHV 0D\ >@ 7 ,PLHOLQVNL DQG + 0DQQLOD $ GDWDEDVH SHUVSHFWLYH RQ NQRZOHGJH GLVFRYHU\ &RPn PXQLFDWLRQ RI WKH $&0 OOf 1RY

PAGE 181

>@ 7 ,PLHOLQVNL $ 9LUPDQL DQG $ $EGXOJKDQL 'LVFRYHU\ ERDUG DSSOLFDWLRQ SURJUDPn PLQJ LQWHUIDFH DQG TXHU\ ODQJXDJH IRU GDWDEDVH PLQLQJ ,Q 3URF RI WKH QG ,QWfO &RQIHUHQFH RQ .QRZOHGJH 'LVFRYHU\ DQG 'DWD 0LQLQJ 3RUWODQG 2UHJRQ $XJXVW >@ ,QWHUQDWLRQDO %XVLQHVV 0DFKLQHV :KLWH SDSHU RQ GDWD PLQLQJ 7HFKQLFDO UHSRUW >@ ,QWHUQDWLRQO %XVLQHVV 0DFKLQHV ,%0 ,QWHOOLJHQW 0LQHU 8VHUfV *XLGH 9HUVLRQ 5Hn OHDVH 6+ HGLWLRQ -XO\ >@ + 9 -DJDGLVK $ UHWULHYDO WHFKQLTXH IRU VLPLODU VKDSHV ,Q 3URF RI WKH $&0 6,*02' &RQIHUHQFH RQ 0DQDJHPHQW RI 'DWD SDJHV 'HQYHU 0D\ >@ + 9 -DJDGLVK DQG & )DORXWVRV 'DWD UHGXFWLRQ D WXWRULDO )RXUWK LQWHUQDWLRQDO FRQIHUHQFH RQ .QRZOHGJH 'LVFRYHU\ DQG 'DWD 0LQLQJ WXWRULDO $XJXVW >@ 0 .OHPHWWLQHQ + 0DQQLOD 3 5RQNDLQHQ + 7RLYRQHQ DQG $ 9HUNDPR )LQGLQJ LQWHUHVWLQJ UXOHV IURP ODUJH VHWV RI GLVFRYHUHG DVVRFLDWLRQ UXOHV ,Q 7KLUG ,QWfO &RQnf 6HSWHPEHU >@ .XONDUQL 2EMHFW RULHQWHG H[WHQVLRQV LQ 64/ D VWDWXV UHSRUW 6,*02' 5(&25' >@ /LQ + 9 -DJDGLVK DQG & )DORXWVRV 7KH 797UHH $Q LQGH[ VWUXFWXUH IRU KLJKGLPHQVLRQDO GDWD 9/'% -RXUQDO f >@ /RKPDQ % /LQGVD\ + 3LUDKHVK DQG % 6FKLHIHU ([WHQVLRQV WR VWDUEXUVW 2EMHFWV W\SHV IXQFWLRQV DQG UXOHV &RPPXQLFDWLRQV RI WKH $&0 f 2FWREHU >@ /XELQVN\ 'LVFRYHU\ IURP GDWDEDVHV $ UHYLHZ RI $, DQG VWDWLVWLFDO WHFKQLTXHV ,Q ,-&$, :RUNVKRS RQ .QRZOHGJH 'LVFRYHU\ LQ 'DWDEDVHV SDJHV 'HWURLW $XJXVW >@ $ 0DMRU DQG & )HQJ ()' $ K\EULG NQRZOHGJHVWDWLVWLFDOEDVHG V\VWHP IRU WKH GHWHFWLRQ RI IUDXG ,QWfO -RXUQDO RI ,QWHOOLJHQW 6\VWHPV f >@ + 0DQQLOD DQG + 7RLYRQHQ 2Q DQ DOJRULWKP IRU ILQGLQJ DOO LQWHUHVWLQJ VHQWHQFHV ,Q &\EHUQHWLFV DQG 6\VWHPV 9ROXPH ,, 7KH WK (XURSHDQ 0HHWLQJ R Q &\EHUQHWLFV DQG 6\VWHPV 5HVHDUFK 9LHQQD $XVWULD $SULO

PAGE 182

>@ + 0DQQLOD + 7RLYRQHQ DQG $ 9HUNDPR 'LVFRYHULQJ IUHTXHQW HSLVRGHV LQ VHn TXHQFHV ,Q 3URF RI WKH VW ,QWfO &RQIHUHQFH RQ .QRZOHGJH 'LVFRYHU\ LQ 'DWDEDVHV DQG 'DWD 0LQLQJ 0RQWUHDO &DQDGD $XJXVW >@ 0 0HKWD 5 $JUDZDO DQG 5LVVDQHQ 6/,4 $ IDVW VFDODEOH FODVVLILHU IRU GDWD PLQLQJ ,Q 3URF RI WKH )LIWK ,QWfO &RQIHUHQFH RQ ([WHQGLQJ 'DWDEDVH 7HFKQRORJ\ ('%7f $YLJQRQ )UDQFH 0DUFK >@ 0 0HKWD 5LVVDQHQ DQG 5 $JUDZDO 0'/EDVHG GHFLVLRQ WUHH SUXQLQJ ,Q 3URF RI WKH VW ,QWfO &RQIHUHQFH RQ .QRZOHGJH 'LVFRYHU\ LQ 'DWDEDVHV DQG 'DWD 0LQLQJ 0RQWUHDO &DQDGD $XJXVW >@ 0HOWRQ DQG 1 0DWWRV 64/f§D WXWRULDO 7ZHQW\VHFRQG LQWHUQDWLRQDO FRQIHUHQFH RQ 9HU\ ODUJH GDWD EDVHV WXWRULDO 6HSWHPEHU >@ 0HOWRQ DQG $ 6LPRQ 8QGHUVWDQGLQJ WKH QHZ 64/ $ FRPSOHWH JXLGH 0RUJDQ .DXIIPDQ >@ 5 0HR 3VDLOD DQG 6 &HUL $ QHZ 64/ OLNH RSHUDWRU IRU PLQLQJ DVVRFLDWLRQ UXOHV ,Q 3URF RI WKH QG ,QWfO &RQIHUHQFH RQ 9HU\ /DUJH 'DWDEDVHV %RPED\ ,QGLD 6HS >@ 5 6 0LFKDOVNL $ WKHRU\ DQG PHWKRGRORJ\ RI LQGXFWLYH OHDUQLQJ ,Q 0LFKDOVNL HW DO HGLWRUV 0DFKLQH /HDUQLQJ $Q $UWLILFLDO ,QWHOOLJHQFH $SSURDFK 9RO 0RUJDQ .DXIPDQQ >@ 0LFKLH 6SLHJHOKDOWHU DQG & & 7D\ORU 0DFKLQH /HDUQLQJ 1HXUDO DQG 6WDWLVWLFDO &ODVVLILFDWLRQ (OOLV +RUZRRG >@ 5 0LOOHU DQG < @ & 0RKDQ +DGHUOH < :DQJ DQG &KHQJ 6LQJOH WDEOH DFFHVV XVLQJ PXOWLn SOH LQGH[HV RSWLPL]DWLRQ H[HFXWLRQ DQG FRQFXUUHQF\ FRQWURO WHFKQLTXHV ,Q 3URF ,QWHUQDWLRQDO &RQIHUHQFH RQ ([WHQGLQJ 'DWDEDVH 7HFKQRORJ\ SDJHV >@ 0XPLFN 4XDVV DQG % 0XPLFN 0DLQWHQDQFH RI GDWD FXEHV DQG VXPPDU\ WDEOHV LQ D ZDUHKRXVH ,Q 3URF RI WKH $&0 6,*02' &RQIHUHQFH RQ 0DQDJHPHQW RI 'DWD 7XFVRQ $UL]RQD 0D\ >@ 5 0XVLFN &DWOHWW DQG 6 5XVVHOO 'HFLVLRQWKHRUHWLF VXEVDPSOLQJ IRU LQGXFWLRQ RQ ODUJH GDWVHWV ,Q WK ,QWfO &RQIHUHQFH RQ 0DFKLQH /HDUQLQJ >@ 3 1HDUKRV 0 5RWKPDQ DQG 0 6 9LYHURV $SSO\LQJ GDWD PLQLQJ WHFKQLTXHV WR D KHDOWK LQVXUDQFH LQIRUPDWLRQ V\VWHP ,Q 3URF RI WKH QG ,QWfO &RQIHUHQFH RQ 9HU\ /DUJH 'DWDEDVHV 0XPEDL %RPED\f ,QGLD 6HSWHPEHU >@ 5 7 1J / 9 6 /DNVKPDQDQ +DQ DQG $ 3DQJ ([SORUDWRU\ PLQLQJ DQG SUXQLQJ RSWLPL]DWLRQV RI FRQVWUDLQHG DVVRFLDWLRQ UXOHV ,Q 3URF RI WKH $&0 6,*02' &RQIHUHQFH RQ 0DQDJHPHQW RI 'DWD 6HDWWOH :DVKLQJWRQ -XQH >@ 2UDFOH 2UDFOH 5'%06 'DWDEDVH $GPLQLVWUDWRUfV *XLGH 9ROXPHV ,, 9HUVLRQ f 0D\

PAGE 183

>@6 3DUN 0 6 &KHQ DQG 3 6 @ 3RVWJUH64/ 2UJDQL]DWLRQ 3RVWJUH64/ 8VHU 0DQXDO )HEUXDU\ KWWSZZZSRVWJUHVTORUJ >@ 4XHVW ,%0 GHYHORSV PDUNHW EDVNHW DQDO\VLV V\VWHP 6WRUHV 0D\ >@ 5 4XLQODQ ,QGXFWLRQ RYHU ODUJH GDWDEDVHV 7HFKQLFDO 5HSRUW 67$1&6 6WDQIRUG 8QLYHUVLW\ >@ 5 4XLQODQ &I 3URJUDPV IRU 0DFKLQH /HDUQLQJ 0RUJDQ .DXIPDQ >@ 5 4XLQODQ DQG 5 / 5LYHVW ,QIHUULQJ GHFLVLRQ WUHHV XVLQJ PLQLPXP GHVFULSWLRQ OHQJWK SULQFLSOH ,QIRUPDWLRQ DQG &RPSXWDWLRQ >@ 5DMDPDQL % ,\HU DQG $ &KDGGKD 8VLQJ '%fn SOLFDWLRQV LQ WKH %LRVFLHQFHV Of >@ 6 6DUDZDJL 6 7KRPDV DQG 5 $JUDZDO ,QWHJUDWLQJ DVVRFLDWLRQ UXOH PLQLQJ ZLWK UHODWLRQDO GDWDEDVH V\VWHPV $OWHUQDWLYHV DQG LPSOLFDWLRQV ,Q 3URF RI WKH $&0 6,*02' &RQIHUHQFH RQ 0DQDJHPHQW RI 'DWD 6HDWWOH :DVKLQJWRQ -XQH >@ 6 6DUDZDJL 6 7KRPDV DQG 5 $JUDZDO ,QWHJUDWLQJ DVVRFLDWLRQ UXOH PLQLQJ ZLWK UHODWLRQDO GDWDEDVH V\VWHPV $OWHUQDWLYHV DQG LPSOLFDWLRQV 5HVHDUFK 5HSRUW 5f ,%0 $OPDGQ 5HVHDUFK &HQWHU 6DQ -RVH &DOLIRUQLD 0DUFK /RQJHU YHUVLRQ RI WKH 6,*02' SDSHUf >@ $ 6DYDVHUH ( 2PLHFLQVNL DQG 6 1DYDWKH $Q HIILFLHQW DOJRULWKP IRU PLQLQJ DVVRFLn DWLRQ UXOHV LQ ODUJH GDWDEDVHV ,Q 3URF RI WKH 9/'% &RQIHUHQFH =XULFK 6ZLW]HUODQG 6HSWHPEHU >@ 6 = 6HOLP DQG 0 $ ,VPDLO .PHDQVW\SH DOJRULWKPV $ JHQHUDOL]HG FRQYHUJHQFH WKHRUHP DQG FKDUDFWHUL]DWLRQ RI ORFDO RSWLPDOLW\ ,((( 7UDQVDFWLRQV RQ 3DWWHUQ $QDO\VLV DQG 0DFKLQH ,QWHOOLJHQFH 3$0,f >@ 6KDIHU DQG 5 $JUDZDO 3DUDOOHO DOJRULWKPV IRU KLJKGLPHQVLRQDO VLPLODULW\ MRLQV IRU GDWD PLQLQJ DSSOLFDWLRQV ,Q 3URF RI WKH 9/'% &RQIHUHQFH $WKHQV *UHHFH $XJXVW

PAGE 184

>@ 6KDIHU 5 $JUDZDO DQG 0 0HKWD 635,17 $ VFDODEOH SDUDOOHO FODVVLILHU IRU GDWD PLQLQJ ,Q 3URF RI WKH QG ,QWfO &RQIHUHQFH RQ 9HU\ /DUJH 'DWDEDVHV %RPED\ ,QGLD 6HSWHPEHU >@ 6KLP 5 6ULNDQW DQG 5 $JUDZDO +LJKGLPHQVLRQDO VLPLODULW\ MRLQV ,Q 3URF RI WKH WK ,QWfO &RQIHUHQFH RQ 'DWD (QJLQHHULQJ %LUPLQJKDP 8. $SULO >@ $ 6LHEHV DQG 0 / .HUVWHQ .(62 0LQLPL]LQJ GDWDEDVH LQWHUDFWLRQ ,Q 3URF RI WKH UG ,QWfO &RQIHUHQFH RQ .QRZOHGJH 'LVFRYHU\ DQG 'DWD 0LQLQJ 1HZSRUW %HDFK &DOLIRUQLD $XJXVW >@ & 6LOYHUVWHLQ 6 %ULQ 5 0RWZDQL DQG 8OOPDQ 6FDODEOH WHFKQLTXHV IRU PLQLQJ FDXVDO VWUXFWXUHV ,Q 3URF RI WKH 9/'% &RQIHUHQFH 1HZ @ 5 6ULNDQW DQG 5 $JUDZDO 0LQLQJ JHQHUDOL]HG DVVRFLDWLRQ UXOHV ,Q 3URF RI WKH VW ,QWfO &RQIHUHQFH RQ 9HU\ /DUJH 'DWDEDVHV =XULFK 6ZLW]HUODQG 6HSWHPEHU >@ 5 6ULNDQW DQG 5 $JUDZDO 0LQLQJ TXDQWLWDWLYH DVVRFLDWLRQ UXOHV LQ ODUJH UHODWLRQDO WDEOHV ,Q 3URF RI WKH $&0 6,*02' &RQIHUHQFH RQ 0DQDJHPHQW RI 'DWD 0RQWUHDO &DQDGD -XQH >@ 5 6ULNDQW DQG 5 $JUDZDO 0LQLQJ VHTXHQWLDO SDWWHUQV *HQHUDOL]DWLRQV DQG SHUIRUn PDQFH LPSURYHPHQWV ,Q 3URF RI WKH )LIWK ,QWfO &RQIHUHQFH RQ ([WHQGLQJ 'DWDEDVH 7HFKQRORJ\ ('%7f $YLJQRQ )UDQFH 0DUFK >@ 5 6ULNDQW 4 9X DQG 5 $JUDZDO 0LQLQJ DVVRFLDWLRQ UXOHV ZLWK LWHP FRQVWUDLQWV ,Q 3URF RI WKH UG ,QW &RQIHUHQFH RQ .QRZOHGJH 'LVFRYHU\ LQ 'DWDEDVHV DQG 'DWD 0LQLQJ 1HZSRUW %HDFK &DOLIRUQLD $XJXVW >@ 0 5 6WRQHEUDNHU DQG .HPQLW] 7KH 3267*5(6 QH[W JHQHUDWLRQ GDWDEDVH PDQDJHPHQW V\VWHP &RPPXQLFDWLRQV RI WKH $&0 f >@ 6 7KRPDV 6 %RGDJDOD $OVDEWL DQG 6 5DQND $Q HIILFLHQW DOJRULWKP IRU WKH LQFUHPHQWDO XSGDWLRQ RI DVVRFLDWLRQ UXOHV LQ ODUJH GDWDEDVHV ,Q 3URF RI WKH UG ,QWffO &RQIHUHQFH RQ .QRZOHGJH 'LVFRYHU\ DQG 'DWD 0LQLQJ 1HZ @ + 7RLYRQHQ 6DPSOLQJ ODUJH GDWDEDVHV IRU DVVRFLDWLRQ UXOHV ,Q 3URF RI WKH QG ,QWfO &RQIHUHQFH RQ 9HU\ /DUJH 'DWDEDVHV SDJHV 0XPEDL %RPED\f ,QGLD 6HSWHPEHU

PAGE 185

>@ 7VXU 8OOPDQ 6 $ELWHERXO & &OLIWRQ 5 0RWZDQL 6 1HVWRURY DQG $ 5RVHQn WKDO 4XHU\ )ORFNV $ JHQHUDOL]DWLRQ RI DVVRFLDWLRQ UXOH PLQLQJ ,Q 3URF RI WKH $&0 6,*02' &RQIHUHQFH RQ 0DQDJHPHQW RI 'DWD 6HDWWOH :DVKLQJWRQ -XQH >@ 6 0 :HLVV DQG & $ .XOLNRZVNL &RPSXWHU 6\VWHPV WKDW /HDUQ &ODVVLILFDWLRQ DQG 3UHGLFWLRQ 0HWKRGV IURP 6WDWLVWLFV 1HXUDO 1HWV 0DFKLQH /HDUQLQJ DQG ([SHUW 6\VWHPV 0RUJDQ .DXIPDQ >@ 5 % @ 0 =DNL 6 3DUWKDVDUDWK\ 0 2JLKDUD DQG : /L 1HZ DOJRULWKPV IRU IDVW GLVFRYn HU\ RI DVVRFLDWLRQ UXOHV ,Q 3URF RI WKH UG ,QWfO &RQIHUHQFH RQ .QRZOHGJH 'LVFRYHU\ DQG 'DWD 0LQLQJ 1HZSRUW %HDFK &DOLIRUQLD $XJXVW >@ 7 =KDQJ 5 5DPDNULVKQDQ DQG 0 /LYQ\ %,5&+ $Q HIILFLHQW GDWD FOXVWHULQJ PHWKRG IRU YHU\ ODUJH GDWDEDVHV ,Q 3URF RI WKH $&0 6,*02' &RQIHUHQFH RQ 0DQDJHPHQW RI 'DWD 0RQWUHDO &DQDGD -XQH >@ < =KDR 3 0 'HVKSDQGH DQG ) 1DXJKWRQ $Q DUUD\EDVHG DOJRULWKP IRU VLPXOn WDQHRXV PXOWLGLPHQVLRQDO DJJUHJDWHV ,Q 3URF RI WKH $&0 6,*02' &RQIHUHQFH RQ 0DQDJHPHQW RI 'DWD 7XFVRQ $UL]RQD 0D\

PAGE 186



PAGE 187

, FHUWLI\ WKDW KDYH UHDG WKLV VWXG\ DQG WKDW LQ P\ RSLQLRQ LW FRQIRUPV WR DFFHSWDEOH VWDQGDUGV RI VFKRODUO\ SUHVHQWDWLRQ DQG LV IXOO\ DGHTXDWH LQ VFRSH DQG TXDOLW\ DV D GLVVHUn WDWLRQ IRU WKH GHJUHH RI 'RFWRU RI 3KLORVRSK\ 6KDUPD &KDNUDYDUWK\ &KDLUPDQ $VVRFLDWH 3URIHVVRU RI &RPSXWHU DQG ,QIRUPDWLRQ 6FLHQFH DQG (QJLQHHULQJ FHUWLI\ WKDW KDYH UHDG WKLV VWXG\ DQG WKDW LQ P\ RSLQLRQ LW FRQIRUPV WR DFFHSWDEOH VWDQGDUGV RI VFKRODUO\ SUHVHQWDWLRQ DQG LV IXOO\ DGHTXDWH LQ VFRSH DQG TXDOLW\ DV D GLVVHUn WDWLRQ IRU WKH GHJUHH RI 'RFWRU RI 3KLORVRSK\ (ULF 1 +DQVRQ $VVRFLDWH 3URIHVVRU RI &RPSXWHU DQG ,QIRUPDWLRQ 6FLHQFH DQG (QJLQHHULQJ FHUWLI\ WKDW KDYH UHDG WKLV VWXG\ DQG WKDW LQ P\ RSLQLRQ LW FRQIRUPV WR DFFHSWDEOH VWDQGDUGV RI VFKRODUO\ SUHVHQWDWLRQ DQG LV IXOO\ DGHTXDWH LQ VFRSH DQG TXDOLW\ DV D GLVVHUn WDWLRQ IRU WKH GHJUHH RI 'RFWRU RI 3KLORVRSK\ 'LVWLQJXLVKHG 3URIHVVRU RI &RPSXWHU DQG ,QIRUPDWLRQ 6FLHQFH DQG (QJLQHHULQJ

PAGE 188

, FHUWLI\ WKDW KDYH UHDG WKLV VWXG\ DQG WKDW LQ P\ RSLQLRQ LW FRQIRUPV WR DFFHSWDEOH VWDQGDUGV RI VFKRODUO\ SUHVHQWDWLRQ DQG LV IXOO\ DGHTXDWH LQ VFRSH DQG TXDOLW\ DV D GLVVHUn WDWLRQ IRU WKH GHJUHH RI 'RFWRU RI 3KLORVRSK\ 0/ 6WDQOH\ < : 6 3URIHVVRU RI &RPSXWHU DQG ,QIRUPDWLRQ 6FLHQFH DQG (QJLQHHULQJ FHUWLI\ WKDW KDYH UHDG WKLV VWXG\ DQG WKDW LQ P\ RSLQLRQ LW FRQIRUPV WR DFFHSWDEOH VWDQGDUGV RI VFKRODUO\ SUHVHQWDWLRQ DQG LV IXOO\ DGHTXDWH LQ VFRSH DQG TXDOLW\ DV D GLVVHUn WDWLRQ IRU WKH GHJUHH RI 'RFWRU RI 3KLORVRSK\ 6XOH\PDQ 7XIHNFL $VVRFLDWH 3URIHVVRU RI ,QGXVWULDO DQG 6\VWHPV (QJLQHHULQJ 7KLV GLVVHUWDWLRQ ZDV VXEPLWWHG WR WKH *UDGXDWH )DFXOW\ RI WKH &ROOHJH RI (QJLQHHULQJ DQG WR WKH *UDGXDWH 6FKRRO DQG ZDV DFFHSWHG DV SDUWLDO IXOILOOPHQW RI WKH UHTXLUHPHQWV IRU WKH GHJUHH RI 'RFWRU RI 3KLORVRSK\ 'HFHPEHU :LQIUHG 0 3KLOOLSV 'HDQ &ROOHJH RI (QJLQHHULQJ 0 2KDQLDQ 'HDQ *UDGXDWH 6FKRRO

PAGE 189

/' L f7IF 81,9(56,7< 2) )/25,'$


20
This preprocessor will be able to select the right translation taking into account input
data distributions. We consider SQL translations that can be executed on a SQL-92 [88]
relational engine, as well as translations that require some of the newer object-relational
capabilities being designed for SQL [78]. Specifically, we assume availability of blobs, user-
defined functions, and table functions [80] in the Object-Relational engine. We do not
require any mining specific extension in the underlying execution engine; identification of
such extensions is one of the goals of this study.
We do quantitative and qualitative comparisons of some of the architectural alter
natives listed here. Our primary focus is on the performance of various architectural al
ternatives and the identification of possible enhancements to the query optimizer and the
query processing engine. The issues of the language constructs required to extend SQL
with mining features, and the details of the preprocessing step shown in Figure 2.6 are
secondary.
It might also be possible to integrate mining with databases using the newer extension
technologies like database extenders, data cartridges or data blades.
2.2.6 Integrated Approach
This is the tightest form of integration where the mining operations are an integral
part of the database query engine. In this approach there is no clear boundary between
simple querying, OLAP and mining; that is querying and mining are treated to be similar
operations. The users goal is to get information from the data store. He/she should not
have to make the distinction as to whether it is the result of querying/OLAP/mining.
This entails unbundling the bulky mining operations and identifying common operator
primitives with which the mining operations can be composed. We cannot expect to have


96
The 3wayJoin and 2GroupBy approaches for boolean associations can be extended to
handle taxonomies, in a similar way by replacing T with T* in the corresponding queries.
5.6.2 Subquerv Optimization
The basic KwayJoin approach can be optimized to make use of common prefixes be
tween the itemsets in Ck by splitting the support counting phase into a cascade of k sub
queries. The subqueries in this case are exactly similar to those for boolean associations
presented in 3.5.4 except for the use of T* instead of T.
5.7 Support Counting Using SQL-OR
In this section we present three approaches that make use of the object-relational
features of SQL. We also present a cost-based analysis of the execution time of the various
approaches.
5.7.1 GatherJoin
The GatherJoin approach, which is based on the use of table functions [80], generates
all possible fc-item combinations of extended transactions, joins them with the candidate
table Ck and counts the support by grouping the join result. The extended transactions
T* (defined in Section 5.6.1) are passed to the table function GatherComb-K in the (tid,
item) order. A record output by the table function is a A;-item combination supported by
a transaction and has k attributes TJtmi,... ,TJtrrik. SQL queries for this approach is
presented in Figure 5.6.
The special optimization for pass 2 and the variations of the GatherJoin approach,
namely GatherCount, GatherPrune and Horizontal (refer Section 4.1) are also applicable
here.


26
kept sorted, this operation can be done by performing a merge-scan of the tidlists of all the
items in the itemset.
3.3 Input-Output Formats
Input format The transaction table T normally has two column attributes: transac
tion identifier (tid) and item identifier (item). For a given tid, typically there are multiple
rows in the transaction table corresponding to different items that belong to the same trans
action. The number of items per transaction is variable and unknown during table creation
time. Thus, alternative schemas may not be convenient. In particular, assuming that all
items in a transaction appear as different columns of a single tuple [105] is not practical,
because often the number of items per transaction can be more than the maximum number
of columns that the database supports. For instance, for one of the real-life datasets we
experimented with, the maximum number of items per transaction is 872 and for another
it is 700. In contrast, the corresponding average number of items per transaction is only
9.6 and 4.4 respectively. Even if the database supports so many columns for a table, there
will be lot of space wastage in that scheme.
Output format The output is a collection of rules of varying length. The maximum
length of these rules is much smaller than the total number of items and is rarely more
than a dozen. Therefore, a rule is represented as a tuple in a fixed-width table where the
extra column values are set to NULL to accommodate rules involving smaller itemsets. The
schema of a rule is (itemj,..., itemk, ten, rulem, confidence, support) where k is the size of
the largest frequent itemset. The ten attribute gives the length of the rule (number of items
in the rule) and the rulem attribute gives the position of the in the rule. For instance,
if k = 5, the rule AB-+CD which has 90% confidence and 30% support is represented by



PAGE 1

$5&+,7(&785(6 $1' 237,0,=$7,216 )25 ,17(*5$7,1* '$7$ 0,1,1* $/*25,7+06 :,7+ '$7$%$6( 6<67(06 %\ 6+,%< 7+20$6 $ ',66(57$7,21 35(6(17(' 72 7+( *5$'8$7( 6&+22/ 2) 7+( 81,9(56,7< 2) )/25,'$ ,1 3$57,$/ )8/),//0(17 2) 7+( 5(48,5(0(176 )25 7+( '(*5(( 2) '2&725 2) 3+,/2623+< 81,9(56,7< 2) )/25,'$

PAGE 2

7R 0\ SDUHQWV VLVWHUV DQG EURWKHUV

PAGE 3



PAGE 4

WR 6 6HVKDGUL P\ PDVWHUf

PAGE 5



PAGE 6

n 6,216 *DWKHU -RLQ 6SHFLDO 3DVV 2SWLPL]DWLRQ 9DULDWLRQV RI *DWKHU-RLQ $SSURDFK &RVW $QDO\VLV RI *DWKHU-RLQ DQG LWV 9DULDQWV 9HUWLFDO 6SHFLDO 3DVV 2SWLPL]DWLRQ &RVW $QDO\VLV 64/%RGLHG )XQFWLRQV 3HUIRUPDQFH &RPSDULVRQ )LQDO +\EULG $SSURDFK $UFKLWHFWXUH &RPSDULVRQV 7LPLQJ &RPSDULVRQ 6FDOHXS H[SHULPHQW ,PSDFW RI ORQJHU QDPHV 6SDFH 2YHUKHDG RI 'LIIHUHQW $SSURDFKHV 6XPPDU\ RI &RPSDULVRQ %HWZHHQ 'LIIHUHQW $UFKLWHFWXUHV 2WKHU $VVRFLDWLRQV $OJRULWKPV 6XPPDU\ YL

PAGE 7

f IURP ) $GGLWLRQ RI 1HZ 7UDQVDFWLRQV 'HOHWLRQ RI ([LVWLQJ 7UDQVDFWLRQV ([SHULPHQWDO 5HVXOWV &RPSDULVRQ ZLWK )83 'DWDEDVH ,QWHJUDWLRQ RI ,QFUHPHQWDO 0LQLQJ 64/ )RUPXODWLRQV RI ,QFUHPHQWDO 0LQLQJ 3HUIRUPDQFH 5HVXOWV 1HZ&DQGLGDWH 2SWLPL]DWLRQ 2WKHU $SSURDFKHV YLL

PAGE 8

&RQVWUDLQHG $VVRFLDWLRQV &DWHJRULHV RI &RQVWUDLQWV &RQVWUDLQHG $VVRFLDWLRQ 0LQLQJ ,QFUHPHQWDO &RQVWUDLQHG $VVRFLDWLRQ 0LQLQJ &RQVWUDLQW 5HOD[DWLRQ $SSOLFDELOLW\ %H\RQG $VVRFLDWLRQ 0LQLQJ 0LQLQJ &ORVHG 6HWV 4XHU\ )ORFNV 9LHZ 0DLQWHQDQFH 6XPPDU\ &21&/86,216 3URSRVHG ([WHQVLRQV 5LFKHU 6HW 2SHUDWLRQV (QKDQFHG $JJUHJDWLRQ 0XOWLSOH 6WUHDPV 6DPSOLQJ &RQWULEXWLRQV )XWXUH :RUN &ORVLQJ 5()(5(1&(6 %,2*5$3+,&$/ 6.(7&+ II YP

PAGE 9

/,67 2) ),*85(6 'DWD ZDUHKRXVLQJ DUFKLWHFWXUH &UHGLW FDUG FODVVLILFDWLRQ H[DPSOH 7\SLFDO GDWD ZDUHKRXVH XVDJH 7D[RQRP\ RI DUFKLWHFWXUDO DOWHUQDWLYHV /RRVHFRXSOLQJ DUFKLWHFWXUH &DFKHPLQH DUFKLWHFWXUH 6WRUHGSURFHGXUH DUFKLWHFWXUH 8')EDVHG PLQLQJ DUFKLWHFWXUH 64/ DUFKLWHFWXUH IRU PLQLQJ LQ D '%06 $UFKLWHFWXUH IRU PLQLQJ LQ QH[WJHQHUDWLRQ '%06V $SULRUL DOJRULWKP &DQGLGDWH JHQHUDWLRQ IRU DQ\ N &DQGLGDWH JHQHUDWLRQ IRU N f§ 5XOH JHQHUDWLRQ 6XSSRUW FRXQWLQJ E\ .ZD\ MRLQ 6XSSRUW FRXQWLQJ XVLQJ VXETXHULHV &RPSDULVRQ RI GLIIHUHQW 64/ DSSURDFKHV .ZD\ MRLQ SODQ ZLWK &N DV LQQHU UHODWLRQ .ZD\ MRLQ SODQ ZLWK &N DV RXWHU UHODWLRQ ,;

PAGE 10

n DFWLRQf &RPSDULVRQ RI IRXU DUFKLWHFWXUHV 6FDOHXS ZLWK LQFUHDVLQJ QXPEHU RI WUDQVDFWLRQV 6FDOHXS ZLWK LQFUHDVLQJ WUDQVDFWLRQ OHQJWK &RPSDULVRQ RI GLIIHUHQW DUFKLWHFWXUHV RQ VSDFH UHTXLUHPHQWV ([DPSOH RI D WD[RQRP\ 3UHFRPSXWLQJ DQFHVWRUV *HQHUDWLRQ RI & [

PAGE 11

7UDQVDFWLRQ H[WHQVLRQ VXETXHU\ 6XSSRUW FRXQWLQJ E\ .ZD\ MRLQ 6XSSRUW FRXQWLQJ E\ *DWKHU-RLQ 6XSSRUW FRXQWLQJ E\ *DWKHU([WHQG ,QWHULRU QRGHVf WLGOLVW JHQHUDWLRQ E\ XQLRQ ,QWHULRU QRGHVf WLGOLVW JHQHUDWLRQ IURP 7r

PAGE 12

/,67 2) 7$%/(6 'HVFULSWLRQ RI GLIIHUHQW UHDOOLIH GDWDVHWV 1RWDWLRQV XVHG LQ FRVW DQDO\VLV 'HVFULSWLRQ RI V\QWKHWLF GDWDVHWV 3URV DQG FRQV RI GLIIHUHQW DUFKLWHFWXUDO RSWLRQV UDQNHG RQ D VFDOH RI OJRRGf WR EDGf $Q H[DPSOH RI WKH WD[RQRP\ WDEOH $GGLWLRQDO QRWDWLRQV XVHG IRU FRVW DQDO\VLV [LL

PAGE 13

f 2XU HYDOXDWLRQ RI WKH GLIIHUHQW DUFKLWHFWXUDO DOWHUQDWLYHV VKRZV WKDW IURP D SHUIRUPDQFH SHUVSHFWLYH WKH &DFKH0LQH RSWLRQ LV VXSHULRU DOWKRXJK WKH 64/ 25 RSWLRQ FRPHV D FORVH VHFRQG %RWK WKH &DFKH0LQH DQG WKH 64/25 DSSURDFKHV LQFXU D KLJKHU VWRUDJH SHQDOW\ WKDQ WKH ORRVHFRXSOLQJ DSSURDFK ZKLFK SHUIRUPDQFHZLVH

PAGE 14

LV D IDFWRU RI WR ZRUVH WKDQ &DFKH0LQH :H DOVR FRPSDUH WKHVH DOWHUQDWLYHV RQ WKH EDVLV RI TXDOLWDWLYH IDFWRUV OLNH DXWRPDWLF SDUDOOHOL]DWLRQ GHYHORSPHQW HDVH SRUWDELOLW\ DQG LQWHURSHUDELOLW\ :H IXUWKHU DQDO\]H WKH 64/ DSSURDFKHV ZLWK WKH WZLQ JRDOV RI VWXG\LQJ KRZ EHVW FDQ D '%06 ZLWKRXW DQ\ REMHFWUHODWLRQDO H[WHQVLRQV H[HFXWH WKHVH TXHULHV DQG WR LGHQWLI\ ZD\V RI LQFRUSRUDWLQJ WKH VHPDQWLFV RI PLQLQJ LQWR FRVWEDVHG TXHU\ RSWLPL]HUV :H GHYHORS FRVW IRUPXODH IRU WKH PLQLQJ TXHULHV EDVHG RQ WKH LQSXW GDWD SDUDPHWHUV DQG UHODWLRQDO RSHUDWRU FRVWV :H DOVR LGHQWLI\ FHUWDLQ RSWLPL]DWLRQV ZKLFK LPSURYH WKH SHUIRUPDQFH 1H[W ZH VWXG\ JHQHUDOL]HG DVVRFLDWLRQ UXOH DQG VHTXHQWLDO SDWWHUQ PLQLQJ DQG GHYHORS 64/ IRUPXODWLRQV IRU WKHP WKHUH E\ GHPRQVWUDWLQJ WKDW PRUH FRPSOH[ PLQLQJ RSHUDWLRQV FDQ EH KDQGOHG LQ WKH 64/ IUDPH ZRUN :H GHYHORS DQ LQFUHPHQWDO DVVRFLDWLRQ UXOH PLQLQJ DOJRULWKP ZKLFK GRHV QRW QHHG WR H[DPLQH WKH ROG GDWD LI WKH IUHTXHQW LWHPVHWV GR QRW FKDQJH (YHQ RWKHUZLVH DFFHVV WR WKH ROG GDWDEDVH FDQ EH OLPLWHG WR MXVW RQH VFDQ :H FDWHJRUL]H WKH YDULRXV NLQGV RI FRQVWUDLQWV RQ WKH LWHPV WKDW DUH XVHIXO LQ WKH FRQWH[W RI LQWHUDFWLYH PLQLQJ WR IDFLOLWDWH JRDO RULHQWHG PLQLQJ :H VKRZ KRZ WKH LQFUHPHQWDO PLQLQJ WHFKQLTXH FDQ EH DGDSWHG WR KDQGOH FRQVWUDLQWV DQG FHUWDLQ NLQGV RI FRQVWUDLQW UHOD[DWLRQ :H DOVR VKRZ WKH DSSOLFDELOLW\ RI WKH LQFUHPHQWDO DOJRULWKP WR RWKHU FODVVHV RI GDWD PLQLQJ DQG GHFLVLRQ VXSSRUW SUREOHPV )LQDOO\ ZH LGHQWLI\ FHUWDLQ SULPLWLYH RSHUDWRUV WKDW DUH XVHIXO IRU D ODUJH FODVV RI GDWD PLQLQJ DQG GHFLVLRQ VXSSRUW DSSOLFDWLRQV 6XSSRUWLQJ WKHP QDWLYHO\ LQ WKH '%06 FRXOG HQDEOH WKHVH DSSOLFDWLRQV WR UXQ IDVWHU [LY

PAGE 15

&+$37(5 ,1752'8&7,21 7KH UDSLG JURZWK LQ GDWD ZDUHKRXVLQJ WHFKQRORJ\ FRPELQHG ZLWK WKH VLJQLILFDQW GURS LQ VWRUDJH SULFHV KDV PDGH LW SRVVLEOH WR FROOHFW ODUJH YROXPHV RI GDWD DERXW FXVWRPHU WUDQVDFWLRQV LQ UHWDLO VWRUHV PDLO RUGHU FRPSDQLHV EDQNV VWRFN PDUNHWV WHOHFRPPXQLFDn WLRQ FRPSDQLHV DQG VR RQ )RU H[DPSOH $7t7 FDOO UHFRUGV D[H DERXW JLJD E\WH SHU KRXU >@ DQG VXSHU PDUNHW FKDLQV OLNH :DO0DUW FROOHFW WHUD E\WHV RI GDWD ,Q RUGHU WR WUDQVIRUP WKLV KXJH DPRXQWV RI GDWD LQWR EXVLQHVV FRPSHWLWLYHQHVV DQG SURILWV LW LV H[n WUHPHO\ LPSRUWDQW WR EH DEOH WR PLQH QXJJHWV RI XVHIXO DQG XQGHUVWDQGDEOH LQIRUPDWLRQ IURP WKHVH GDWD ZDUHKRXVHV ,Q WKLV FKDSWHU ZH LQWURGXFH GDWD ZDUHKRXVLQJ DQG GDWD PLQnfV LQIRUPDWLRQ V\VWHPV HQYLURQPHQW LV IDU IURP VLPSOH 7KH ILUVW SUREOHP LV WR GLVFRYHU KRZ FRPSOHWHQHVV DQG FRQVLVWHQF\ FDQ EH GHILQHG ,Q WKH EXVLQHVV FRQWH[W WKLV HQWDLOV XQGHUVWDQGLQJ WKH EXVLQHVV

PAGE 16

VWUDWHJLHV DQG WKH GDWD UHTXLUHG WR VXSSRUW DQG WUDFN WKHLU DFKLHYHPHQW 7KLV SURFHVVf§ FDOOHG HQWHUSULVH PRGHOLQJf§UHTXLUHV VXEVWDQWLDO LQYROYHPHQW RI EXVLQHVV XVHUV DQG LV WUDnf§W\SLFDOO\ NQRZQ DV GDWD ZDUHKRXVH SRSXODWLRQf§UHTXLUHV WRROV IRU H[WUDFWLQJ GDWD IURP PXOWLSOH RSHUDWLRQDO GDWDEDVHV DQG H[WHUQDO VRXUFHV IRU FOHDQLQJ WUDQVIRUPLQJ DQG LQWHJUDWLQJ WKHVH GDWD DQG IRU ORDGLQJ GDWD LQWR WKH GDWD ZDUHKRXVH ,Q DGGLWLRQ WR WKH PDLQ ZDUHKRXVH WKHUH PD\ EH VHYHUDO GHSDUWPHQWDO GDWD PDUWV ,W DOVR UHTXLUHV WRROV IRU SHULRGLFDOO\ UHIUHVKLQJ WKH ZDUHKRXVH WR PDLQWDLQ FRQVLVWHQF\ DQG WR SXUJH GDWD IURP WKH ZDUHKRXVH SHUKDSV RQWR VORZHU DUFKLYDO VWRUDJH

PAGE 17

nf IRU TXHU\LQJ GDWD ZDUHKRXVHV 5DWKHU WKDQ VHHN RXW NQRZQ UHODWLRQVKLSV LW VLIWV WKURXJK GDWD IRU XQNQRZQ UHODWLRQVKLSV )RU LQVWDQFH FRQVLGHU WKH WUDQVDFWLRQ GDWD LQ D PDLORUGHU FRPSDQ\ VWRUHG LQ WKH IROORZLQJ UHODWLRQV VDOHV FXVWRPHU ZLGJHW VWDWH \HDUf ERRVWHUFXVWRPHU ZLGJHW GULYHUf FDWDORJ ZLGJHW PDQXIDFWXUHUf 7KH VDOH LQIRUPDWLRQ LV VWRUHG LQ WKH VDOHV WDEOH DQG WKH FDWDORJ WDEOH VWRUHV WKH ZLGJHWV RI GLIIHUHQW PDQXIDFWXUHUV 7KH ERRVWHU WDEOH VWRUHV WKH GULYHU WKDW LQIOXHQFHV WKH SDUWLFXODU

PAGE 18

VDOH ,Q WKLV GDWDEDVH 2/$3 ILQGV DQVZHUV WR TXHULHV RI WKH IRUP f+RZ PDQ\ ZLGJHWV ZHUH VROG LQ WKH ILUVW TXDUWHU RI LQ &DOLIRUQLD YV )ORULGD"f +RZHYHU GDWD PLQLQJ DWWHPSWV WR DQVZHU TXHULHV OLNH f:KDW DUH WKH GULYHUV WKDW FDXVHG SHRSOH WR EX\ WKHVH ZLGJHWV IURP RXU FDWDORJ"f )XQGDPHQWDOO\ GDWD PLQLQJ LV VWDWLVWLFDO DQDO\VLV DQG KDV EHHQ LQ SUDFWLFH IRU D ORQJ WLPH %XW XQWLO UHFHQWO\ VWDWLVWLFDO DQDO\VLV ZDV D WLPHFRQVXPLQJ PDQXDO SURFHVV ZKLFK OLPLWHG WKH DPRXQW RI GDWD WKDW FRXOG EH DQDO\]HG DQG WKH DFFXUDF\ GHSHQGHG KHDYLO\ RQ WKH SHUVRQQHO LQYROYHG LQ WKH DQDO\VLV 7RGD\ ZLWK WKH DGYHQW RI YDULRXV VRSKLVWLFDWHG WHFKQRORJLHV WRROV H[LVW WKDW DXWRPDWH WKH SURFHVV PDNLQJ GDWD PLQLQJ D SUDFWLFDO VROXWLRQ IRU D ZLGH UDQJH RI FRPSDQLHV )RU H[DPSOH )LQJHUKXWfV D GLUHFWPDLO FDWDORJ FRPSDQ\f VWDWLVWLFDO DQDO\VLV ZDV OLPLWHG WR WDNLQJ VDPSOHV RI WR SHUFHQW RI LWV FXVWRPHUV :LWK GDWD PLQLQJ LW FDQ H[DPLQH VSHFLILF FKDUDFWHULVWLFV RI HDFK RI WKH WR PLOOLRQ FXVWRPHUV LQ D PXFK PRUH IRFXVHG ZD\ >@ 7KH LQLWLDO HIIRUWV RQ GDWD PLQLQJ UHVHDUFK ZHUH WR FXOO WRJHWKHU WHFKQLTXHV IURP PDn FKLQH OHDUQLQJ DQG VWDWLVWLFV > @ WR GHILQH QHZ PLQLQJ RSHUDWLRQV DQG GHYHORS DOJRULWKPV IRU WKHP > @ ,Q WKH UHPDLQGHU RI WKLV VHFWLRQ ZH EULHIO\ LQWURGXFH WKH YDULRXV GDWD PLQLQJ SUREOHPV $VVRFLDWLRQ 5XOH $VVRFLDWLRQ UXOH ZKLFK FDSWXUHV FRRFFXUUHQFH RI LWHPV RU HYHQWV ZDV LQWURGXFHG LQ WKH FRQWH[W RI PDUNHW EDVNHW GDWD >@ $Q H[DPSOH RI VXFK D UXOH PLJKW EH WKDW b RI WUDQVDFWLRQV FRQWDLQLQJ EHHU DOVR FRQWDLQ GLDSHUV DQG b RI WUDQVDFWLRQV FRQWDLQ ERWK WKHVH LWHPV +HUH b LV WKH VXSSRUW DQG b LV WKH FRQILGHQFH RI WKH UXOH EHHU GLDSHU $VVRFLDWLRQ UXOH PLQLQJ LV VWDWHG IRUPDOO\ DV IROORZV >@ /HW ^p[ ] f f f LP` EH D VHW RI OLWHUDOV FDOOHG LWHPV /HW 9 EH D VHW RI WUDQVDFWLRQV ZKHUH HDFK WUDQVDFWLRQ 7 LV

PAGE 19

D VHW RI LWHPV VXFK WKDW 7&, (DFK WUDQVDFWLRQ KDV D XQLTXH LGHQWLILHU FDOOHG LWV WLG $Q DVVRFLDWLRQ UXOH LV DQ LPSOLFDWLRQ RI WKH IRUP ; < ZKHUH ; & < &O DQG ,IL\ 7KH UXOH ; < KROGV LQ WKH WUDQVDFWLRQ VHW 9 ZLWK FRQILGHQFH F LI Fb RI WUDQVDFWLRQV LQ 9 WKDW FRQWDLQ ; DOVR FRQWDLQ < 7KH UXOH ; < KDV VXSSRUW V LQ WKH WUDQVDFWLRQ VHW 9 LI Vbf RI LWHPVHWV 7KH LWHPVHWV WKDW DUH FRQWDLQHG LQ WKH VHTXHQFH DUH FDOOHG HOHPHQWV RI WKH VHTXHQFH )RU H[DPSOH ^FRPSXWHU PRGHPfSULQWHUff LV D VHTXHQFH ZLWK WZR HOHPHQWV FRPSXWHU PRGHPf DQG SULQWHUf 7KH VXSSRUW RI D VHTXHQWLDO SDWWHUQ LV WKH SHUFHQWDJH RI GDWDVHTXHQFHV WKDW FRQWDLQ WKH VHTXHQFH $ VHTXHQWLDO SDWWHUQ FDQ EH IXUWKHU TXDOLILHG E\ VSHFLI\LQJ PD[LPXP DQGRU PLQLPXP WLPH JDSV EHWZHHQ DGMDFHQW HOHPHQWV DQG D VOLGLQJ WLPH ZLQGRZ ZLWKLQ

PAGE 20

f RU FDWHJRULFDO FRPLQJ IURP DQ XQRUGHUHG GRPDLQf 2QH RI WKH DWWULEXWHV FDOOHG WKH FODVVLI\LQJ DWWULEXWH LQGLFDWHV WKH FODVV WR ZKLFK HDFK UHFRUG EHORQJV 2QFH D PRGHO LV EXLOW IURP WKH JLYHQ H[DPSOHV LW FDQ EH XVHG WR GHWHUPLQH WKH FODVV RI IXWXUH XQFODVVLILHG UHFRUGV 6HYHUDO FODVVLILFDWLRQ PRGHOV EDVHG RQ QHXUDO QHWZRUNV VWDWLVWLFDO PRGHOV OLNH OLQn HDUTXDGUDWLF GLVFULPLQDQWV GHFLVLRQ WUHHV DQG JHQHWLF PRGHOV KDYH EHHQ SURSRVHG RYHU WKH \HDUV 'HFLVLRQ WUHHV DUH SDUWLFXODUO\ VXLWHG IRU GDWD PLQLQJ VLQFH WKH\ FDQ EH FRQVWUXFWHG UHODWLYHO\ IDVW FRPSDUHG WR RWKHU PHWKRGV DQG WKH\ DUH VLPSOH DQG HDV\ WR XQGHUVWDQG 0RUHRYHU WUHHV FDQ EH HDVLO\ FRQYHUWHG LQWR 64/ VWDWHPHQWV WKDW FDQ EH XVHG WR DFFHVV GDWDEDVHV HIILFLHQWO\ >@ $ GHFLVLRQ WUHH LV D FODVV GLVFULPLQDWRU WKDW UHFXUVLYHO\ SDUWLWLRQV WKH WUDLQLQJ VHW XQWLO HDFK SDUWLWLRQ FRQVLVWV HQWLUHO\ RU GRPLQDQWO\ RI H[DPSOHV IURP RQH FODVV (DFK QRQn OHDI QRGH RI WKH WUHH FRQWDLQV D VSOLW SRLQW ZKLFK LV D WHVW RQ RQH RU PRUH DWWULEXWHV DQG GHWHUPLQHV KRZ WKH GDWD LV SDUWLWLRQHG )LJXUH VKRZV D VDPSOH GHFLVLRQWUHH FODVVLILHU DQG WKH WUDLQLQJ VHW IURP ZKLFK LW LV GHULYHG (DFK UHFRUG UHSUHVHQWV D FUHGLW FDUG DSSOLFDQW

PAGE 21

n WURLGV DUH UHILQHG WR WKH PHDQ RI WKH GDWD SRLQWV LQ WKDW FOXVWHU 7KLV SURFHVV LV UHSHDWHG VHYHUDO WLPHV XQWLO DQ DFFHSWDEOH FRQYHUJHQFH LV UHDFKHG 7KHUH DUH VHYHUDO UHVHDUFK HIIRUWV UHSRUWHG LQ WKH GDWD PLQLQJ OLWHUDWXUH IRU FOXVWHULQJ ODUJH GDWDEDVHV > @

PAGE 22

f§RIWHQ LQ WKH IRUP RI D GDWD ZDUHKRXVHf§SURYLGHV EXVLQHVV LQVWLWXWLRQV DW LWV GLVSRVDO D WRRO ZLWK LPPHQVH LPSOLFDWLRQV $FFRUGLQJ WR WKH YLFH SUHVLGHQW RI 0HOORQ %DQNfV DGYDQFHG WHFKQRORJ\ JURXS f'DWD PLQLQJ LV WKH FDUURW WKDW MXVWLILHV WKH H[SHQVLYH VWLFN RI EXLOGLQJ D GDWD ZDUHKRXVHf 7KH PDMRULW\ RI WKH ZDUHKRXVH VWRUHVf§V\VWHPV XVHG IRU VWRULQJ ZDUHKRXVH GDWDf§DUH UHODWLRQDO GDWDEDVHV RU WKHLU YDULDQWV 7KH DGYDQWDJHV RI XVLQJ GDWDEDVH V\VWHPV DUH QXPHUn RXV 64/ ZDV LQYHQWHG IRU GLUHFW TXHU\ RI GDWD PRVW FOLHQWVHUYHU FRQQHFWLYLW\ LV VXSSOLHG E\ UHODWLRQDO YHQGRUV PRVW UHSOLFDWLRQ V\VWHPV KDYH EHHQ GHVLJQHG ZLWK UHODWLRQDO VRXUFHV DQG WDUJHWV DQG PRVW RI WKH UHODWLRQDO YHQGRUV DUH GHOLYHULQJ SDUDOOHO GDWDEDVH VROXWLRQV 7KHUH DUH LPSRUWDQW DOWHUQDWLYHV LQ WKLV VHJPHQW KRZHYHU 7KH 2/$3 PXOWLGLPHQVLRQDO HQn JLQHV RIIHU XQLTXH SHUIRUPDQFH FKDUDFWHULVWLFV DFURVV WKHLU SUREOHP GRPDLQ :H PLJKW DOVR VHH WUDGLWLRQDO ILOH VWRUHV SURYLGLQJ VLJQLILFDQWO\ EHWWHU SHUIRUPDQFH IRU VRPH GDWD PLQLQJ

PAGE 23

RSHUDWLRQV 'LIIHUHQWLDWRUV IRU WKH UHODWLRQDO HQJLQHV ZLOO EH RYHUDOO VFDODELOLW\ DYDLODELOLW\ DFURVV D EURDG VSHFWUXP RI KDUGZDUH SODWIRUPV fDIILQLW\fn KRXVH LQWHJUDWLRQ RI PLQLQJ RSHUDWLRQV :H ILUVW VWXG\ WKH YDULRXV DUFKLWHFWXUDO DOWHUQDWLYHV IRU FRXSOLQJ GDWD PLQLQJ ZLWK UHODWLRQDO GDWDEDVH V\VWHPV SULPDULO\ IURP D SHUIRUPDQFH SHUVSHFWLYH :H GHYHORS YDULRXV 64/ IRUPXODWLRQV IRU DVVRFLDWLRQ UXOHV >@ D UHSUHVHQWDWLYH PLQLQJ SUREOHP DQG DQDO\]H KRZ FRPSHWLWLYH FDQ WKH 64/ LPSOHPHQWDWLRQ EH FRPSDUHG WR RWKHU VSHFLDOL]HG LPSOHPHQWDWLRQV RI DVVRFLDWLRQ UXOH PLQLQJ :H IXUWKHU IRFXV RQ WKH

PAGE 24

%XVLQHVV ,QWHOOLJHQFH WRROV )LJXUH 7\SLFDO GDWD ZDUHKRXVH XVDJH DQDO\VLV RI YDULRXV H[HFXWLRQ SODQV FKRVHQ E\ UHODWLRQDO GDWDEDVH V\VWHPV IRU H[HFXWLQJ VRPH RI WKH 64/EDVHG PLQLQJ TXHULHV :H H[SHFW WKDW WKLV VWXG\ ZLOO UHYHDO WKH GRPDLQVSHFLILF VHPDQWLF LQIRUPDWLRQ RI WKH PLQLQJ DOJRULWKPV WKDW QHHG WR EH LQWHJUDWHG LQWR QH[W JHQnn WDOO\ ,Q RUGHU WR LPSURYH WKH UHOLDELOLW\ DQG XVHIXOQHVV RI WKH GLVFRYHUHG LQIRUPDWLRQ ODUJH YROXPHV RI GDWD QHHG WR EH FROOHFWHG DQG DQDO\]HG RYHU D SHULRG RI WLPH $ QDLYH DSSURDFK WR XSGDWH WKH PLQHG LQIRUPDWLRQ ZKHQ QHZ GDWD DUH DGGHG RU SDUW RI WKH FXUUHQW GDWD DUH GHOHWHG LV WR UHFRPSXWH WKHP IURP VFUDWFK +RZHYHU LW ZRXOG EH LGHDO WR GHYHORS DQ LQFUHPHQWDO DOJRULWKP VR WKDW WKH FRPSXWDWLRQ HIIRUW VSHQW RQ WKH RULJLQDO GDWD FDQ EH

PAGE 25

HIIHFWLYHO\ XWLOL]HG :H GHYHORS DQ LQFUHPHQWDO DOJRULWKP DQG LWV 64/ IRUPXODWLRQV IRU XSnf 6XUYH\ WKH YDULRXV GDWD PLQLQJ SUREOHPV DQG DOJRULWKPV f 6WXG\ WKH GLIIHUHQW GDWDEDVH LQWHJUDWLRQ DOWHUQDWLYHV IRU GDWD PLQLQJ f 'HYHORS DQG LPSOHPHQW 64/ IRUPXODWLRQV RI PLQLQJ DOJRULWKPV f $QDO\]H WKH SHUIRUPDQFH SURILOH RI FXUUHQW '%06V WR H[HFXWH WKH DERYH 64/ TXHULHV f ([SORUH WKH HQKDQFHPHQWV WR FXUUHQW FRVWEDVHG RSWLPL]HUV WR LQFRUSRUDWH WKH GRPDLQ VSHFLILF VHPDQWLFV RI PLQLQJ f 'HYHORS DQ LQFUHPHQWDO DVVRFLDWLRQ UXOH PLQLQJ DOJRULWKP DQG LWV 64/ IRUPXODWLRQV f *HQHUDOL]H WKH LQFUHPHQWDO DOJRULWKP IRU PLQLQJ FRQVWUDLQHG DVVRFLDWLRQV DQG VKRZ LWV DSSOLFDELOLW\ WR RWKHU GDWD PLQLQJ DQG GHFLVLRQ VXSSRUW SUREOHPV DQG f ([SORUH SULPLWLYH RSHUDWRUV IRU PLQLQJ LQ GDWDEDVHV 7KLV ZRUN KDV D VLJQLILFDQW LPSDFW RQ WKH VWDWHRIWKHDUW LQ GDWD PLQLQJ V\VWHP DUFKLn WHFWXUHV DQG FRPHV DW WKH DSSURSULDWH WLPH ZKHQ WKH GDWD PLQLQJ FRPPXQLW\ LV ORRNLQJ IRU

PAGE 26

DQVZHUV WR f+RZ WR PLQH GDWD ZDUHKRXVHV"f *LYHQ WKH DPRXQW RI GDWD LQYROYHG LQ PLQLQJ LWV SRWHQWLDO LPSDFW RQ YDULRXV EXVLQHVV VHFWRUV DQG WKH IDFW WKDW 2/$3 LV ILQGLQJ LWV ZD\ LQWR FRPPHUFLDO GDWDEDVH V\VWHPV IRU H[DPSOH WKH FXEH RSHUDWRUfn WLRQV IRU LPSURYLQJ WKH SHUIRUPDQFH DUH GHWDLOHG LQ &KDSWHU ,Q &KDSWHU ZH SUHVHQW WKH XVH RI REMHFWUHODWLRQDO H[WHQVLRQV WR 64/ IRU LPSURYLQJ WKH SHUIRUPDQFH RI 64/EDVHG DVVRFLDWLRQ UXOH PLQLQJ DQG WKH SHUIRUPDQFH FRPSDULVRQ RI WKH YDULRXV DUFKLWHFWXUDO DOn WHUQDWLYHV 64/EDVHG PLQLQJ RI JHQHUDOL]HG DVVRFLDWLRQ UXOHV DQG VHTXHQWLDO SDWWHUQV DUH GHVFULEHG LQ &KDSWHUV DQG UHVSHFWLYHO\ &KDSWHU SUHVHQWV WKH LQFUHPHQWDO DVVRFLDn WLRQ UXOH PLQLQJ DOJRULWKP SHUIRUPDQFH FRPSDULVRQ 64/ IRUPXODWLRQ DQG JHQHUDOL]DWLRQV IRU PLQLQJ FRQVWUDLQHG DVVRFLDWLRQV :H FRQFOXGH LQ &KDSWHU ZLWK D GLVFXVVLRQ RI WKH SURSRVHG GDWDEDVH RSHUDWRUV DQG DYHQXHV IRU IXUWKHU UHVHDUFK

PAGE 27

&+$37(5 )520 ),/( 0,1,1* 72 '$7$%$6( 0,1,1* 7KH fILUVWJHQHUDWLRQf .'' V\VWHPV RIIHU LVRODWHG GLVFRYHU\ IHDWXUHV XVLQJ WUHH LQGXFnf IRU DSSOLFDWLRQ SURJUDPn PLQJ PDGH GDWDEDVH DSSOLFDWLRQV PXFK HDVLHU WR GHYHORS DQG PDQDJH 'DWD PLQLQJ KDV WR XQGHUJR D VLPLODU WUDQVLWLRQ IURP WKH FXUUHQW fILOH PLQLQJf WR GDWD ZDUHKRXVH PLQLQJ DQG D ULFKHU VHW RI $3,V IRU GHYHORSLQJ EXVLQHVV LQWHOOLJHQFH DQG GHFLVLRQ VXSSRUW DSSOLFDWLRQV ,Q WKH UHPDLQGHU RI WKLV FKDSWHU ZH VXUYH\ VRPH RI WKH SULRU UHVHDUFK UHODWHG WR WKH GDWDEDVH LQWHJUDWLRQ RI PLQLQJ LQ 6HFWLRQ 7KH YDULRXV DUFKLWHFWXUDO DOWHUQDWLYHV DUH GLVFXVVHG LQ 6HFWLRQ 5HODWHG :RUN 5HVHDUFKHUV KDYH VWDUWHG WR IRFXV RQ YDULRXV LVVXHV UHODWHG WR LQWHJUDWLQJ PLQLQJ ZLWK GDWDEDVHV > @ 7KH UHVHDUFK RQ GDWDEDVH LQWHJUDWLRQ RI PLQLQJ FDQ EH EURDGO\ FODVn VLILHG LQWR WZR FDWHJRULHV RQH ZKLFK SURSRVHV QHZ PLQLQJ RSHUDWRUV DQG WKH RWKHU ZKLFK

PAGE 28

OHYHUDJHV WKH TXHU\ SURFHVVLQJ FDSDELOLWLHV RI FXUUHQW UHODWLRQDO '%06V ,Q WKH IRUPHU FDWHn JRU\ WKHUH KDYH EHHQ ODQJXDJH SURSRVDOV WR H[WHQG 64/ ZLWK VSHFLDOL]HG PLQLQJ RSHUDWRUV $ IHZ H[DPSOHV DUH Lf WKH TXHU\ ODQJXDJH '04/ SURSRVHG E\ +DQ HW DO >@ ZKLFK H[n WHQGV 64/ ZLWK D FROOHFWLRQ RI RSHUDWRUV IRU PLQLQJ FKDUDFWHULVWLF UXOHV GLVFULPLQDQW UXOHV FODVVLILFDWLRQ UXOHV DVVRFLDWLRQ UXOHV DQG VR RQ LLf 7KH 064/ ODQJXDJH RI ,PLHOLQVNL DQG 9LUPDQL >@ ZKLFK H[WHQGV 64/ ZLWK D VSHFLDO XQLILHG RSHUDWRU 0LQH WR JHQHUDWH DQG TXHU\ D ZKROH VHW RI SURSRVLWLRQDO UXOHV DQG LLLf WKH PLQH UXOH RSHUDWRU SURSRVHG E\ 0HR HW DO >@ IRU D JHQHUDOL]HG YHUVLRQ RI WKH DVVRFLDWLRQ UXOH GLVFRYHU\ SUREOHP +RZHYHU WKH\ GR QRW DGGUHVV WKH SURFHVVLQJ WHFKQLTXHV IRU WKHVH RSHUDWRUV LQVLGH D GDWDEDVH HQJLQH DQG WKH LQWHUDFWLRQ RI WKH VWDQGDUG UHODWLRQDO RSHUDWRUV DQG WKH SURSRVHG H[WHQVLRQV ,W LV DOVR LPSRUWDQW WR EUHDN WKHVH RSHUDWRUV WR D ILQHU OHYHO RI JUDQXODULW\ LQ RUGHU WR LGHQWLI\ FRPn PRQDOLWLHV EHWZHHQ WKHP DQG GHULYH D VHW RI SULPLWLYH RSHUDWRUV WKDW VKRXOG EH VXSSRUWHG QDWLYHO\ LQ D GDWDEDVH HQJLQH ,Q WKH VHFRQG FDWHJRU\ UHVHDUFKHUV KDYH DGGUHVVHG WKH LVVXH RI H[SORLWLQJ WKH FDSDn ELOLWLHV RI FRQYHQWLRQDO UHODWLRQDO V\VWHPV DQG WKHLU REMHFWUHODWLRQDO H[WHQVLRQV WR H[HFXWH PLQLQJ RSHUDWLRQV 7KLV HQWDLOV WUDQVIRUPLQJ WKH PLQLQJ RSHUDWLRQV LQWR GDWDEDVH TXHULHV DQG LQ VRPH FDVHV GHYHORSLQJ QHZHU WHFKQLTXHV WKDW DUH PRUH DSSURSULDWH LQ WKH GDWDEDVH FRQWH[W 7KH SURSRVDO RI $JUDZDO DQG 6KLP >@ IRU WLJKWO\ FRXSOLQJ D PLQLQJ DOJRULWKP ZLWK D UHODWLRQDO GDWDEDVH V\VWHP PDNHV XVH RI XVHUGHILQHG IXQFWLRQV 8')Vf LQ 64/ VWDWHPHQWV WR VHOHFWLYHO\ SXVK SDUWV RI WKH DSSOLFDWLRQ WKDW SHUIRUP FRPSXWDWLRQV RQ GDWD UHFRUGV LQWR WKH GDWDEDVH V\VWHP 7KH REMHFWLYH ZDV WR DYRLG RQHDWDWLPH UHFRUG UHWULHYDO IURP WKH GDWDEDVH WR WKH DSSOLFDWLRQ DGGUHVV VSDFH VDYLQJ ERWK WKH FRS\LQJ DQG SURFHVV FRQWH[W VZLWFKLQJ FRVWV ,Q WKH .(62 SURMHFW >@ WKH IRFXV LV RQ GHYHORSLQJ D GDWD PLQLQJ V\VWHP ZKLFK LQWHUDFWV ZLWK VWDQGDUG '%06V 7KH LQWHUDFWLRQ ZLWK WKH GDWDEDVH

PAGE 29

LV UHVWULFWHG WR WZRZD\ WDEOH TXHULHV D VSHFLDO NLQG RI DJJUHJDWH TXHU\ 7ZRZD\ WDEOHV ZKLFK DUH XVHG LQ WKH PLQLQJ SURFHVV KDYH VHWV RI VRXUFH DQG WDUJHW DWWULEXWHV DQG DQ DVVRFLDWHG FRXQW $VVRFLDWLRQ UXOH PLQLQJ ZDV IRUPXODWHG DV 64/ TXHULHV LQ WKH 6(70 DOJRULWKP >@ +RZHYHU LW GRHV QRW XVH WKH VXEVHW SURSHUW\f§DOO VXEVHWV RI D IUHTXHQW LWHPVHW DUH IUHTXHQWf§IRU FDQGLGDWH JHQHUDWLRQ $V D UHVXOW 6(70 FRXQWV D ODUJH QXPEHU RI FDQGLGDWH LWHPVHWV LQ WKH VXSSRUW FRXQWLQJ SKDVH DQG KHQFH LV QRW HIILFLHQW >@ 4XHU\ IORFNV JHQHUDOL]HV ERROHDQ DVVRFLDWLRQ UXOHV WR PLQH DVVRFLDWLRQV DFURVV UHODWLRQDO WDEOHV $ TXHU\ IORFN LV D SDUDPHWHUL]HG TXHU\ ZLWK D ILOWHU FRQGLWLRQ WR HOLPLQDWH WKH YDOXHV RI SDUDPn HWHUV WKDW DUH fXQLQWHUHVWLQJf

PAGE 30

f

PAGE 31

LQWR WKH '%06 7KLV PHWKRG KDV DOO WKH DGYDQWDJHV RI WKH VWRUHG SURFHGXUH DSSURDFK GHVFULEHG EHORZf‘ '%06 6WRUHGSURFHGXUHV IRU PLQLQJ )LJXUH 6WRUHGSURFHGXUH DUFKLWHFWXUH 8VHU'HILQHG )XQFWLRQ 7KLV DSSURDFK LV DQRWKHU YDULDQW RI HPEHGGLQJ PLQLQJ DV DQ DSSOLFDWLRQ RQ WKH GDWDEDVH VHUYHU LI WKH XVHUGHILQHG IXQFWLRQV DUH UXQ LQ WKH XQIHQFHG PRGH VDPH DGGUHVV VSDFH DV

PAGE 32

WKH VHUYHUf >@ ,Q WKLV FDVH WKH HQWLUH PLQLQJ DOJRULWKP LV HQFDSVXODWHG DV D FROOHFWLRQ RI XVHUGHILQHG IXQFWLRQV 8')Vf >@ WKDW DUH DSSURSULDWHO\ SODFHG LQ 64/ GDWD VFDQ TXHULHV 7KH DUFKLWHFWXUH LV UHSUHVHQWHG LQ )LJXUH 0RVW RI WKH SURFHVVLQJ KDSSHQV LQ WKH 8') DQG WKH '%06 LV XVHG VLPSO\ WR SURYLGH WXSOHV WR WKHVH 8')V /LWWOH XVH LV PDGH RI WKH TXHU\ SURFHVVLQJ FDSDELOLWLHV RI WKH '%06 7KH 8')V FDQ EH UXQ LQ HLWKHU IHQFHG GLIIHUHQW DGGUHVV VSDFHf RU XQIHQFHG VDPH DGGUHVV VSDFHf PRGH 7KH PDLQ DWWUDFWLRQ RI WKLV PHWKRG LV SHUIRUPDQFH VLQFH ZKHQ UXQ LQ WKH XQIHQFHG PRGH LQGLYLGXDO WXSOHV QHYHU KDYH WR FURVV WKH '%06 ERXQGDU\ 2WKHUZLVH WKH SURFHVVLQJ KDSSHQV LQ DOPRVW WKH VDPH PDQQHU DV LQ WKH VWRUHG SURFHGXUH FDVH 7KH PDLQ GLVDGYDQWDJH LV WKH GHYHORSPHQW FRVW >@ VLQFH WKH HQWLUH PLQLQJ DOJRULWKP KDV WR EH ZULWWHQ DV 8')V LQYROYLQJ VLJQLILFDQW FRGH UHZULWHV )XUWKHU WKHVH DUH fKHDY\ZHLJKWf 8')V ZKLFK LQYROYH VLJQLILFDQW SURFHVVLQJ DQG PHPRU\ PDQDJHPHQW *8, RU 0LQLQJ /DQJXDJH 64/ TXHULHV FRQWDLQLQJ 8')Vf '%06 8')V IRU PLQLQJA )LJXUH 8')EDVHG PLQLQJ DUFKLWHFWXUH ,Q RUGHU WR SURYLGH D TXHU\ LQWHUIDFH RU DSSOLFDWLRQ SURJUDPPLQJ LQWHUIDFH WR WKH GLVFRYHUHG UXOHV WKH\ FDQ EH SDVVHG WKURXJK D SRVWSURFHVVLQJ VWHS 7KH UXOH GLVFRYHU\ LWVHOI FRXOG EH GRQH E\ DQ\ RI WKH DERYH DOWHUQDWLYHV 64/%DVHG $SSURDFK 7KLV LV WKH LQWHJUDWLRQ DUFKLWHFWXUH H[SORUHG LQ WKLV GLVVHUWDWLRQ ,Q WKLV DSSURDFK WKH PLQLQJ DOJRULWKP LV IRUPXODWHG DV 64/ TXHULHV ZKLFK DUH H[HFXWHG E\ WKH '%06 TXHU\

PAGE 33

SURFHVVRU :H GHYHORS VHYHUDO 64/ IRUPXODWLRQV IRU D IHZ UHSUHVHQWDWLYH PLQLQJ RSHUDWLRQV LQ RUGHU WR EHWWHU XQGHUVWDQG WKH SHUIRUPDQFH SURILOH RI FXUUHQW GDWDEDVH TXHU\ SURFHVVRUV LQ H[HFXWLQJ WKHVH TXHULHV :H EHOLHYH WKDW LW ZLOO HQDEOH XV WR LGHQWLI\ ZKDW SRUWLRQV RI WKHVH PLQLQJ RSHUDWLRQV FDQ EH SXVKHG GRZQ WR WKH TXHU\ SURFHVVLQJ HQJLQH RI D '%06 7KHUH DUH DOVR VHYHUDO SRWHQWLDO DGYDQWDJHV RI D 64/ LPSOHPHQWDWLRQ 2QH FDQ SURIn LWDEO\ PDNH XVH RI WKH GDWDEDVH LQGH[LQJ DQG TXHU\ SURFHVVLQJ FDSDELOLWLHV WKHUHE\ OHYHUDJn LQJ RQ PRUH WKDQ WZR GHFDGHV RI GHYHORSPHQW HIIRUW VSHQW LQ PDNLQJ WKHVH V\VWHPV UREXVW SRUWDEOH VFDODEOH DQG KLJKO\ FRQFXUUHQW 5DWKHU WKDQ GHYLVLQJ VSHFLDOL]HG SDUDOOHOL]Dn WLRQV RQH FDQ SRWHQWLDOO\ H[SORLW WKH XQGHUO\LQJ 64/ SDUDOOHOL]DWLRQ SDUWLFXODUO\ LQ DQ 603 HQYLURQPHQW 7KH FXUUHQW DSSURDFK WR SDUDOOHOL]LQJ PLQLQJ DOJRULWKPV LV WR GHYHORS VSHFLDOL]HG SDUDOOHOL]DWLRQV IRU HDFK RI WKH DOJRULWKPV > @ 7KH '%06 VXSn SRUW IRU FKHFNSRLQWLQJ DQG VSDFH PDQDJHPHQW FDQ EH HVSHFLDOO\ YDOXDEOH IRU ORQJUXQQLQJ PLQLQJ DOJRULWKPV RQ KXJH YROXPHV RI GDWD 7KH GHYHORSPHQW RI QHZ DOJRULWKPV FRXOG EH VLJQLILFDQWO\ IDVWHU LI H[SUHVVHG GHFODUDWLYHO\ XVLQJ D IHZ 64/ RSHUDWLRQV 7KLV DSSURDFK LV DOVR H[WUHPHO\ SRUWDEOH DFURVV '%06fV VLQFH SRUWLQJ EHFRPHV WULYLDO LI WKH 64/ DSSURDFKHV XVH RQO\ WKH VWDQGDUG 64/ IHDWXUHV ([WHQGHG 64/ 3UHSURFHVVRU 64/ 2EMHFWf 5HODWLRQDO 3,77 2SWLPL]HU A 64/64/ '%06 88L 'RPDLQ VHPDQWLFV RI PLQLQJ )LJXUH 64/ DUFKLWHFWXUH IRU PLQLQJ LQ D '%06 7KH DUFKLWHFWXUH ZH KDYH LQ PLQG LV VFKHPDWLFDOO\ VKRZQ LQ )LJXUH :H YLVXDOL]H WKDW WKH GHVLUHG PLQLQJ RSHUDWLRQ ZLOO EH H[SUHVVHG LQ VRPH H[WHQVLRQ RI 64/ RU D JUDSKLFDO ODQJXDJH $ SUHSURFHVVRU ZLOO JHQHUDWH DSSURSULDWH 64/ WUDQVODWLRQ IRU WKLV RSHUDWLRQ

PAGE 34

n QDWLYHV OLVWHG KHUH 2XU SULPDU\ IRFXV LV RQ WKH SHUIRUPDQFH RI YDULRXV DUFKLWHFWXUDO DOnfV JRDO LV WR JHW LQIRUPDWLRQ IURP WKH GDWD VWRUH +HVKH VKRXOG QRW KDYH WR PDNH WKH GLVWLQFWLRQ DV WR ZKHWKHU LW LV WKH UHVXOW RI TXHU\LQJ2/$3PLQLQJ 7KLV HQWDLOV XQEXQGOLQJ WKH EXON\ PLQLQJ RSHUDWLRQV DQG LGHQWLI\LQJ FRPPRQ RSHUDWRU SULPLWLYHV ZLWK ZKLFK WKH PLQLQJ RSHUDWLRQV FDQ EH FRPSRVHG :H FDQQRW H[SHFW WR KDYH

PAGE 35

f 5HODWLRQDO '%06 }

PAGE 36

&+$37(5 $662&,$7,21 58/(6 ,Q WKLV FKDSWHU ZH GLVFXVV WKH YDULRXV 64/ 64/ ZLWK QR REMHFWUHODWLRQDO H[n WHQVLRQVf IRUPXODWLRQV RI DVVRFLDWLRQ UXOH PLQLQJ :H VWDUW ZLWK D UHYLHZ RI WKH DSULRUL DOJRULWKP IRU DVVRFLDWLRQ UXOH PLQLQJ LQ 6HFWLRQ $ IHZ RWKHU DOJRULWKPV IRU PLQLQJ DVVRFLDWLRQ UXOHV DUH EULHIO\ RXWOLQHG LQ 6HFWLRQ 7KH LQSXWRXWSXW GDWD IRUPDWV DUH GHn VFULEHG LQ 6HFWLRQ DQG LQ 6HFWLRQ ZH LQWURGXFH 64/EDVHG DVVRFLDWLRQ UXOH PLQLQJ 7KH YDULRXV 64/ IRUPXODWLRQV DUH SUHVHQWHG LQ 6HFWLRQ :H SUHVHQW H[SHULPHQWDO UHVXOWV VKRZLQJ WKH SHUIRUPDQFH RI WKHVH IRUPXODWLRQV RQ VRPH UHDOOLIH GDWDVHWV LQ 6HFn WLRQ ,Q 6HFWLRQ ZH GHYHORS FRVW IRUPXODH IRU WKH FRVW RI H[HFXWLQJ WKH DERYH 64/ TXHULHV RQ D TXHU\ SURFHVVRU EDVHG RQ WKH LQSXW GDWD SDUDPHWHUV DQG UHODWLRQDO RSHUDWRU FRVWV $ IHZ SHUIRUPDQFH RSWLPL]DWLRQV WR WKH EDVLF 64/ DSSURDFKHV DQG WKH FRUUHn VSRQGLQJ SHUIRUPDQFH JDLQV DUH SUHVHQWHG LQ 6HFWLRQ 6HFWLRQ TXDQWLILHV WKH RYHUDOO SHUIRUPDQFH LPSURYHPHQWV RI WKH RSWLPL]DWLRQV ZLWK H[SHULPHQWV RQ V\QWKHWLF GDWDVHWV 7KH DVVRFLDWLRQ UXOH PLQLQJ SUREOHP RXWOLQHG LQ 6HFWLRQ FDQ EH GHFRPSRVHG LQWR WZR VXESUREOHPV >@ f )LQG DOO FRPELQDWLRQV RI LWHPV ZKRVH VXSSRUW LV JUHDWHU WKDQ PLQLPXP VXSSRUW &DOO WKRVH FRPELQDWLRQV IUHTXHQW LWHPVHWV f 8VH WKH IUHTXHQW LWHPVHWV WR JHQHUDWH WKH GHVLUHG UXOHV 7KH LGHD LV WKDW LI VD\ $%&' DQG $% DUH IUHTXHQW LWHPVHWV WKHQ ZH FDQ GHWHUPLQH LI WKH UXOH $%&' KROGV E\

PAGE 37

FRPSXWLQJ WKH UDWLR U VXSSRUWML"&n'fVXSSRUW$L"f` IRU N f§ )Nf§L A N f§f GR &N DSULRULJHQ)rBLf JHQHUDWH QHZ FDQGLGDWHV IRUD WUDQVDFWLRQV W ( 9 GR &W VXEVHW &N Wf ILQG DOO FDQGLGDWHV FRQWDLQHG LQ W IRUD FDQGLGDWHV F &W GR F FRXQW GRQH GRQH )N ^F &IF_FFRXQW PLQVXS` GRQH $QVZHU -r Ar )LJXUH $SULRUL DOJRULWKP

PAGE 38

7KH EDVLF $SULRUL DOJRULWKP VKRZQ LQ )LJXUH PDNHV PXOWLSOH SDVVHV RYHU WKH GDWD ,Q WKH NWK SDVV LW ILQGV DOO LWHPVHWV ZLWK N LWHPV KDYLQJ WKH PLQLPXP VXSSRUW FDOOHG WKH IUHTXHQW $LWHPVHWV (DFK SDVV FRQVLVWV RI WZR SKDVHV /HW )N UHSUHVHQW WKH VHW RI IUHTXHQW FLWHPVHWV DQG &N WKH VHW RI FDQGLGDWH FLWHPVHWV SRWHQWLDOO\ IUHTXHQW LWHPVHWVf )LUVW LV WKH FDQGLGDWH JHQHUDWLRQ SKDVH ZKHUH WKH VHW RI DOO IUHTXHQW Nf§OfLWHPVHWV )N IRXQG LQ SDVV t f§f

PAGE 39



PAGE 40

NHSW VRUWHG WKLV RSHUDWLRQ FDQ EH GRQH E\ SHUIRUPLQJ D PHUJHVFDQ RI WKH WLGOLVWV RI DOO WKH LWHPV LQ WKH LWHPVHW ,QSXW2XWSXW )RUPDWV ,QSXW IRUPDW 7KH WUDQVDFWLRQ WDEOH 7 QRUPDOO\ KDV WZR FROXPQ DWWULEXWHV WUDQVDFn WLRQ LGHQWLILHU WLGf DQG LWHP LGHQWLILHU LWHPf )RU D JLYHQ WLG W\SLFDOO\ WKHUH DUH PXOWLSOH URZV LQ WKH WUDQVDFWLRQ WDEOH FRUUHVSRQGLQJ WR GLIIHUHQW LWHPV WKDW EHORQJ WR WKH VDPH WUDQVnf ZKHUH N LV WKH VL]H RI WKH ODUJHVW IUHTXHQW LWHPVHW 7KH WHQ DWWULEXWH JLYHV WKH OHQJWK RI WKH UXOH QXPEHU RI LWHPV LQ WKH UXOHf DQG WKH UXOHP DWWULEXWH JLYHV WKH SRVLWLRQ RI WKH f§} LQ WKH UXOH )RU LQVWDQFH LI N WKH UXOH $%&' ZKLFK KDV b FRQILGHQFH DQG b VXSSRUW LV UHSUHVHQWHG E\

PAGE 41

WKH WXSOH $ % & 18// ff§ OfLWHPVHWV WKH $SULRUL FDQGLGDWH JHQHUDWLRQ SURFHGXUH >@ UHWXUQV D VXSHUVHW RI WKH VHW RI DOO IUHTXHQW $ULWHPVHWV :H DVVXPH WKDW WKH LWHPV LQ DQ LWHPVHW DUH OH[LFRJUDSKLFDOO\ RUGHUHG 6LQFH DOO VXEVHWV RI D IUHTXHQW LWHPVHW DUH DOVR IUHTXHQW ZH FDQ JHQHUDWH &N IURP )NL DV IROORZV )LUVW LQ WKH MRLQ VWHS ZH JHQHUDWH D VXSHUVHW RI WKH FDQGLGDWH LWHPVHWV &N E\ MRLQLQJ )NL ZLWK LWVHOI DV VKRZQ EHORZ LQVHUW LQWR &N VHOHFW ILLWHPL ,LLWHPNL KLWHPNL IURP )NL ,?)Na? K ZKHUH ,?LWHP? OLWHP? DQG ,OLWHPN OLWHPN DQG ,?LWHPN? OLWHPNa?

PAGE 42

)RU H[DPSOH OHW ) EH ^^ ` ^ ` ^ ` ^ ` ^ `` $IWHU WKH MRLQ VWHS & ZLOO EH ^^ f ^ `` 1H[W LQ WKH SUXQH VWHS DOO LWHPVHWV F &N ZKHUH VRPH N f§ OfVXEVHW RI F LV QRW LQ )NL DUH GHOHWHG &RQWLQXLQJ ZLWK WKH H[DPSOH DERYH WKH SUXQH VWHS ZLOO GHOHWH WKH LWHPVHW ^ ` EHFDXVH WKH VXEVHW ^ ` LV QRW LQ ) :H ZLOO WKHQ EH OHIW ZLWK RQO\ ^ ` LQ & :H FDQ SHUIRUP WKH SUXQH VWHS LQ WKH VDPH 64/ VWDWHPHQW DV WKH MRLQ VWHS DERYH E\ ZULWLQJ LW DV D $ZD\ MRLQ DV VKRZQ LQ )LJXUH $ NZD\ MRLQ LV XVHG VLQFH IRU DQ\ LWHPVHW WKHUH DUH N VXEVHWV RI OHQJWK N f§ f IRU ZKLFK ZH QHHG WR FKHFN LQ IRU PHPEHUVKLS 7KH MRLQ SUHGLFDWHV RQ ,L DQG UHPDLQ WKH VDPH $IWHU WKH MRLQ EHWZHHQ ,? DQG / ZH JHW D $LWHPVHW FRQVLVWLQJ RI ^,?LWHP?,?LWHPN? DV VKRZQ DERYH )RU WKLV LWHPVHW WZR RI LWV N f§ OfOHQJWK VXEVHWV DUH DOUHDG\ NQRZQ WR EH IUHTXHQW VLQFH LW ZDV JHQHUDWHG IURP WZR LWHPVHWV LQ )IFBL :H FKHFN WKH UHPDLQLQJ N f§ VXEVHWV XVLQJ DGGLWLRQDO MRLQV 7KH SUHGLFDWHV IRU WKHVH MRLQV DUH HQXPHUDWHG E\ VNLSSLQJ RQH LWHP DW D WLPH IURP WKH IFLWHPVHW DV IROORZV :H ILUVW VNLS LWHP? DQG FKHFN LI WKH VXEVHW ,?LWHP ‘ ‘ ,?LWHPN? ,LLWHPNLf EHORQJV WR )IFBL DV VKRZQ E\ WKH MRLQ ZLWK r LQ )LJXUH ,Q JHQHUDO IRU D MRLQ ZLWK ,U U Nf ZH VNLS LWHP U f§ ZKLFK JLYHV XV MRLQ SUHGLFDWHV RI WKH IRUP ,?LWHP? ,ULWHP? DQG ,?LWHPU ,ULWHP7V DQG ,?LWHPUL ,7LWHPU DQG ,OLWHPNL ,ULWHPNa DQG OLWHPNL

PAGE 43

)LJXUH JLYHV DQ H[DPSOH IRU N :H FRQVWUXFW D SULPDU\ LQGH[ RQ LWHPL LWHPNMf RI )Na? WR HIILFLHQWO\ SURFHVV WKHVH NZD\ MRLQV XVLQJ LQGH[ SUREHV 1RWH WKDW VRPHWLPHV LW PD\ QRW EH QHFHVVDU\ WR PDn WHULDOL]H &N EHIRUH WKH FRXQWLQJ SKDVH ,QVWHDG WKH FDQGLGDWH JHQHUDWLRQ FDQ EH SLSHOLQHG ZLWK WKH VXEVHTXHQW 64/ TXHULHV XVHG IRU VXSSRUW FRXQWLQJ 6NLS LWHPBNf ,,LWHPL ,NLWHPL ,OLWHPBBN ,NLWHPBN ,LWHPBN ,NLWHPBN 6NLS LWHPLf ,OLWHP LWHPL )BN ,N LWHPBN ,LWHPBN ,LWHPBN ,LWHPBNO )LJXUH &DQGLGDWH JHQHUDWLRQ IRU DQ\ N 6NLS LWHPf LWHPL LWHPL )LJXUH &DQGLGDWH JHQHUDWLRQ IRU N f§

PAGE 44

&RXQWLQJ 6XSSRUW WR )LQG )UHTXHQW ,WHPVHWV 7KLV LV WKH PRVW WLPHFRQVXPLQJ SDUW RI WKH DVVRFLDWLRQ UXOH PLQLQJ DOJRULWKP :H XVH WKH FDQGLGDWH LWHPVHWV &N DQG WKH GDWD WDEOH 7 WR FRXQW WKH VXSSRUW RI WKH LWHPVHWV LQ &N :H FRQVLGHU WZR GLIIHUHQW FDWHJRULHV RI 64/ LPSOHPHQWDWLRQV $f 7KH ILUVW RQH LV EDVHG SXUHO\ RQ 64/ :H GLVFXVV IRXU DSSURDFKHV LQ WKLV FDWHJRU\ LQ 6HFWLRQ %f 7KH VHFRQG XWLOL]HV WKH QHZ 64/ REMHFWUHODWLRQDO H[WHQVLRQV OLNH 8')V %/2%V ELQDU\ ODUJH REMHFWVf WDEOH IXQFWLRQV DQG VR RQ 7DEOH IXQFWLRQV >@ DUH YLUWXDO WDEOHV DVVRFLDWHG ZLWK D XVHU GHILQHG IXQFWLRQ ZKLFK JHQHUDWH WXSOHV RQ WKH IO\ /LNH QRUPDO SK\VLFDO WDEOHV WKH\ KDYH SUHGHILQHG VFKHPDV 7KH IXQFWLRQ DVVRFLDWHG ZLWK D WDEOH IXQFWLRQ FDQ EH LPSOHPHQWHG DV DQ\ RWKHU 8') 7KXV WDEOH IXQFWLRQV FDQ EH YLHZHG DV 8')V WKDW UHWXUQ D FROOHFWLRQ RI WXSOHV LQVWHDG RI VFDODU YDOXHV :H GLVFXVV VL[ DSSURDFKHV LQ WKLV FDWHJRU\ LQ &KDSWHU 1RWH WKDW 8')V LQ WKLV DSSURDFK DUH QRW KHDY\ ZHLJKW DQG GR QRW UHTXLUH H[WHQVLYH PHPRU\ DOORFDWLRQV DQG FRGLQJ XQOLNH LQ D SXUHO\ 8')EDVHG LPSOHPHQWDWLRQ >@ 5XOH *HQHUDWLRQ ,Q WKH VHFRQG SKDVH RI WKH DVVRFLDWLRQ UXOH PLQLQJ DOJRULWKP ZH XVH WKH IUHTXHQW LWHPVHWV WR JHQHUDWH UXOHV ZLWK WKH XVHU VSHFLILHG PLQLPXP FRQILGHQFH PLQFRQI )RU HYHU\ IUHTXHQW LWHPVHW ZH ILUVW ILQG DOO QRQHPSW\ SURSHU VXEVHWV RI 7KHQ IRU HDFK RI WKRVH VXEVHWV P ZH ILQG WKH FRQILGHQFH RI WKH UXOH Pf§!O f§ Pf DQG RXWSXW WKH UXOH LI LW LV DW OHDVW PLQFRQI ,Q WKH VXSSRUW FRXQWLQJ SKDVH WKH IUHTXHQW LWHPVHWV RI VL]H N DUH VWRUHG LQ WDEOH )r %HIRUH WKH UXOH JHQHUDWLRQ SKDVH ZH PHUJH DOO WKH IUHTXHQW LWHPVHWV LQWR D VLQJOH WDEOH )

PAGE 45

7KH VFKHPD RI ) FRQVLVWV RI N DWWULEXWHV ^LWHP? LWHPA VXSSRUW OHQf ZKHUH N LV WKH VL]H RI WKH ODUJHVW IUHTXHQW LWHPVHW DQG OHQ LV WKH OHQJWK RI WKH LWHPVHW DV GLVFXVVHG HDUOLHU LQ 6HFWLRQ :H XVH WKH WDEOH IXQFWLRQ *HQ5XOHV WR JHQHUDWH DOO SRVVLEOH UXOHV IURP D IUHTXHQW LWHP VHW 7KH LQSXW DUJXPHQW WR WKH IXQFWLRQ LV D IUHTXHQW LWHPVHW )RU HDFK LWHPVHW LW RXWSXWV WXSOHV FRUUHVSRQGLQJ WR UXOHV ZLWK DOO QRQHPSW\ SURSHU VXEVHWV RI WKH LWHPVHW LQ WKH FRQn VHTXHQW 7KH WDEOH IXQFWLRQ RXWSXWV WXSOHV ZLWK N DWWULEXWHV 7-WHP? 7LWHPN 7VXSSRUW 7 -HQ7 MUXOHP 7KH RXWSXW LV MRLQHG ZLWK ) WR ILQG WKH VXSSRUW RI WKH DQn WHFHGHQW DQG WKH FRQILGHQFH RI WKH UXOH LV FDOFXODWHG E\ WDNLQJ WKH UDWLR RI WKH VXSSRUW YDOXHV 7KH SUHGLFDWHV LQ WKH ZKHUH FODXVH PDWFK WKH DQWHFHGHQW RI WKH UXOH ZLWK WKH IUHn TXHQW LWHPVHW FRUUHVSRQGLQJ WR WKH DQWHFHGHQW :KLOH FKHFNLQJ IRU WKLV PDWFK ZH QHHG WR FKHFN RQO\ XS WR LWHPN ZKHUH N 7MUXOHP 7KH RU SDUW ^W?7MUXOHP Nff§ f ORQJ FRQVHTXHQWV IRXQG LQ WKH SUHYLRXV OHYHO PXFK OLNH WKH ZD\ ZH GLG FDQGLGDWH JHQHUDWLRQ LQ 6HFWLRQ 7KH IUDFWLRQ RI WKH WRWDO UXQQLQJ WLPH VSHQW LQ UXOH JHQHUDWLRQ LV YHU\ VPDOO 7KHUHIRUH ZH GR QRW IRFXV PXFK RQ UXOH JHQHUDWLRQ DOJRULWKPV

PAGE 46

LQVHUW LQWR 5 VHOHFW 7-WHP? 7MLWHPN W?VXSSRUW 7-HQ 7BUXOHP W?VXSSRUWAVXSSRUW IURP ) L WDEOH*HQ5XOHVLLLHPL I?LWHUULN OOHQ I?VXSSRUWff DV W? ) IR ZKHUH ^W?7-WHP? IALWHPL RU W?7MUXOHP f DQG W?7LWHUULN ILWHPN RU W?7MUXOHP Nf

PAGE 47

LQVHUW LQWR )nN VHOHFW LWHP? LWHPN FRXQWrf IURP &MW 7 LL 7 LMW ZKHUH LMLWHP &NLWHP? DQG LWHP &NLWHUULN DQG LLWLG WLG DQG WLG LIFWLG JURXS E\ LWHPLLWHUUL LWHUULN KDYLQJ FRXQWrf PLQVXS KDYLQJ FRXQWr! PLQVXS W *URXS E\ LWHPO LOHPN )LJXUH 6XSSRUW FRXQWLQJ E\ .ZD\ MRLQ 7KLV 64/ FRPSXWDWLRQ ZKHQ PHUJHG ZLWK WKH FDQGLGDWH JHQHUDWLRQ VWHS LV VLPLODU WR WKH RQH SURSRVHG LQ 7VXU HW DO >@ DV D SRVVLEOH PHFKDQLVP WR LPSOHPHQW TXHU\ IORFNV ,Q 6HFWLRQ ZH GLVFXVV WKH GLIIHUHQW H[HFXWLRQ SODQV IRU WKLV TXHU\ DQG WKH UHODWHG SHUIRUPDQFH LVVXHV 7KUHHZDY -RLQ 7KH DERYH DSSURDFK UHTXLUHV N OfZD\ MRLQV LQ WKH $WK SDVV :H FDQ UHGXFH WKH FDUGLQDOLW\ RI MRLQV WR XVLQJ WKH IROORZLQJ DSSURDFK ZKLFK EHDUV VRPH UHVHPEODQFH WR

PAGE 48

WKH $SULRUL7LG DOJRULWKP LQ $JUDZDO HW DO >@ (DFK FDQGLGDWH LWHPVHW &N LQ DGGLWLRQ WR DWWULEXWHV LWHPL LWHPNf KDV WKUHH QHZ DWWULEXWHV HQG LG? f‘ RLG LV D XQLTXH LGHQWLILHU DVVRFLDWHG ZLWK HDFK LWHPVHW DQG LG? DQG LG DUH DLGV RI WKH WZR LWHPVHWV LQ )N IURP ZKLFK WKH LWHPVHW LQ &N ZDV JHQHUDWHG DV GLVFXVVHG LQ 6HFWLRQ f ,Q DGGLWLRQ LQ WKH NWK SDVV ZH JHQHUDWH D QHZ FRS\ RI WKH GDWD WDEOH 7N ZLWK DWWULEXWHV WLG DLGf WKDW NHHSV IRU HDFK WLG WKH RLG RI HDFK LWHPVHW LQ &N WKDW LW VXSSRUWHG )RU VXSSRUW FRXQWLQJ ZH ILUVW JHQHUDWH 7N IURP 7N? DQG &N DQG WKHQ GR D JURXSE\ RQ 7N WR ILQG )N DV IROORZV LQVHUW LQWR 7N VHOHFW  WLG RLG IURP &N7N? W?7N?  ZKHUH ILRLG &NLG? DQG rLG &NLG DQG e WLG WLG LQVHUW LQWR )N VHOHFW RLG LWHPL LWHPN FQW IURP &N VHOHFW RLG DV FLG FRXQWrf DV FQW IURP 7N JURXS E\ RLG KDYLQJ FRXQWrf PLQVXSf DV WHPS ZKHUH &ARLG FLG 7ZR *URXSEY $QRWKHU ZD\ WR DYRLG PXOWLZD\ MRLQV LV WR ILUVW MRLQ 7 DQG &N EDVHG RQ ZKHWKHU WKH fLWHPf RI D WLG LWHPf SDLU RI 7 LV HTXDO WR DQ\ RI WKH N LWHPV RI &N WKHQ GR D JURXS E\ RQ LWHPL fff LWHQULN WLGf ILOWHULQJ WXSOHV ZLWK FRXQW HTXDO WR N 7KLV JLYHV DOO LWHPVHW WLGf SDLUV VXFK WKDW WKH WLG VXSSRUWV WKH LWHPVHW )LQDOO\ DV LQ WKH SUHYLRXV DSSURDFKHV GR D JURXSE\ RQ WKH LWHPVHW LWHP LWHPNf ILOWHULQJ WXSOHV WKDW PHHW WKH VXSSRUW FRQGLWLRQ LQVHUW LQWR )N VHOHFW LWHP LWHPN FRXQWrf

PAGE 49

IURP VHOHFW LWHP? LWHPN FRXQWrf IURP 7 &IF ZKHUH LWHP &NLWHP? RU LWHP f§ &N LWHPN JURXS E\ LWHPL LWHUULN WLG KDYLQJ FRXQWrf Nf DV WHPS JURXS E\ LWHP? LWHPN KDYLQJ FRXQWrf PLQVXS 6XETXHUY%DVHG 7KLV DSSURDFK PDNHV XVH RI FRPPRQ SUHIL[HV EHWZHHQ WKH LWHPVHWV LQ &N WR UHGXFH WKH DPRXQW RI ZRUN GRQH GXULQJ VXSSRUW FRXQWLQJ :H EUHDN XS WKH VXSSRUW FRXQWLQJ SKDVH LQWR D FDVFDGH RI N VXETXHULHV 7KH =WK VXETXHU\ 4L ILQGV DOO WLGV WKDW PDWFK WKH GLVWLQFW LWHPVHWV IRUPHG E\ WKH ILUVW O FROXPQV RI &N FDOO LW GLf 7KH RXWSXW RI 4c LV MRLQHG ZLWK 7 DQG G>L WKH GLVWLQFW LWHPVHWV IRUPHG E\ WKH ILUVW O FROXPQV RI &Nf

PAGE 50

LQVHUW LQWR )r VHOHFW LWHPLLWHPN FRXQWrf IURP 6XETXHU\ 4Nf W JURXS E\ LWHP?LWHP LWHPN KDYLQJ FRXQWrf PLQVXS 6XETXHU\ 4L IRU DQ\ O EHWZHHQ DQG Nf VHOHFW LWHPL LWHPL WLG IURP 7 WL 6XETXHU\ 4LLf DV UBL VHOHFW GLVWLQFW LWHPL ‘ ‘ LWHPL IURP &Nf DV GL ZKHUH UL?LWHP? GLLWHP? DQG DQG ULLLWHPLL LDQG ULLWLG WLWLG DQG WLLWHP f§ GLLWHPL 6XETXHU\ 4T 1R VXETXHU\ 4T 6XETXHU\ 4B W LWHPLLWHPL WLG 6XETXHU\ 4BO VHOHFW GLVWLQFW LWHPLLWHPO W &N 7UHH GLDJUDP IRU 6XETXHU\ 4L )LJXUH 6XSSRUW FRXQWLQJ XVLQJ VXETXHULHV

PAGE 51

7DEOH 'HVFULSWLRQ RI GLIIHUHQW UHDOOLIH GDWDVHWV 'DWDVHWV 5HFRUGV LQ PLOOLRQV 7UDQVDFWLRQV LQ PLOOLRQV ,WHPV LQ WKRXVDQGV $YJLWHPV 'DWDVHW$ 'DWDVHW% 'DWDVHW& 'DWDVHW' :H VHOHFWHG D FROOHFWLRQ RI IRXU UHDOOLIH GDWDVHWV REWDLQHG IURP YDULRXV PDLORUGHU FRPSDQLHV DQG UHWDLO VWRUHV IRU WKH H[SHULPHQWV 7KHVH GDWDVHWV KDYH GLIIHULQJ YDOXHV RI SDUDPHWHUV OLNH WKH WRWDO QXPEHU RI WLGLWHPf SDLUV WKH QXPEHU RI WUDQVDFWLRQV WLGVf WKH QXPEHU RI LWHPV DQG WKH DYHUDJH QXPEHU RI LWHPV SHU WUDQVDFWLRQ 7DEOH VXPPDUL]HV WKHVH SDUDPHWHUV ,Q WKLV GLVVHUWDWLRQ ZH UHSRUW WKH SHUIRUPDQFH ZLWK RQO\ 'DWDVHW$ 7KH RYHUDOO REVHUYDWLRQ ZDV WKDW PLQLQJ LPSOHPHQWDWLRQV LQ SXUH 64/ DUH WRR VORZ WR EH SUDFWLFDO )RU WKHVH H[SHULPHQWV ZH EXLOW D FRPSRVLWH LQGH[ LWHP? LWHUULNf RQ &IF N GLIIHUHQW LQGLFHV RQ HDFK RI WKH N LWHPV RI &N DQG D WLG LWHPf DQG D LWHP WLGf LQGH[ RQ WKH GDWD WDEOH 7KH JRDO ZDV WR OHW WKH RSWLPL]HU FKRRVH WKH EHVW SODQ SRVVLEOH :H GR QRW LQFOXGH WKH LQGH[ EXLOGLQJ FRVW LQ WKH WRWDO WLPH ,Q )LJXUH ZH VKRZ WKH WRWDO WLPH WDNHQ E\ WKH IRXU DSSURDFKHV .ZD\-RLQ ZD\-RLQ 6XETXHU\ DQG *URXS%\ )RU FRPSDULVRQ ZH DOVR VKRZ WKH WLPH WDNHQ E\ WKH /RRVHFRXSOLQJ DSSURDFK EHFDXVH WKLV LV WKH DSSURDFK FXUUHQWO\ XVHG E\ H[LVWLQJ V\VWHPV 7KH JUDSK VKRZV WKH WRWDO WLPH VSOLW LQWR FDQGLGDWH JHQHUDWLRQ WLPH &JHQf DQG WKH WLPH IRU HDFK SDVV 7KH FDQGLGDWH JHQHUDWLRQ WLPH DQG WKH WLPH IRU WKH ILUVW SDVV DUH PXFK VPDOOHU FRPSDUHG WR WKH WRWDO WLPH )URP WKHVH VHW RI H[SHULPHQWV ZH FDQ PDNH WKH IROORZLQJ REVHUYDWLRQV

PAGE 52

'DWD VHW $ ’ 2JHQ ( 3DVV ’ 3DVV ’ 3DVV ‘ R f§ S % Lf§ 6XSSRUW 2 b )LJXUH &RPSDULVRQ RI GLIIHUHQW 64/ DSSURDFKHV f 7KH EHVW DSSURDFK LQ WKH 64/ FDWHJRU\ LV WKH 6XETXHU\ DSSURDFK $Q LPSRUWDQW UHDVRQ IRU LWV SHUIRUPLQJ EHWWHU WKDQ WKH RWKHU DSSURDFKHV LV H[SORLWDWLRQ RI FRPPRQ SUHIL[HV EHWZHHQ FDQGLGDWH LWHPVHWV 1RQH RI WKH RWKHU WKUHH DSSURDFKHV XVHV WKLV RSWLPL]DWLRQ $OWKRXJK WKH 6XETXHU\ DSSURDFK LV FRPSDUDEOH WR WKH /RRVHFRXSOLQJ DSSURDFK LQ VRPH FDVHV IRU RWKHU FDVHV LW GLG QRW FRPSOHWH HYHQ DIWHU WDNLQJ WHQ WLPHV PRUH WLPH WKDQ WKH /RRVHFRXSOLQJ DSSURDFK f

PAGE 53

f 7KH ZD\-RLQ DSSURDFK LV FRPSDUDEOH WR WKH .ZD\-RLQ DSSURDFK IRU WKLV GDWDVHW EHFDXVH WKH QXPEHU RI SDVVHV LV DW PRVW WKUHH $V VKRZQ LQ $JUDZDO HW DO >@ WKHUH PLJKW EH RWKHU GDWDVHWV HVSHFLDOO\ RQHV ZKHUH WKHUH LV VLJQLILFDQW UHGXFWLRQ LQ WKH VL]H RI 7IF DV N LQFUHDVHV ZKHUH ZD\-RLQ PLJKW SHUIRUP EHWWHU WKDQ .ZD\-RLQ +RZHYHU RQH GLVDGYDQWDJH RI WKH ZD\-RLQ DSSURDFK LV WKDW LW UHTXLUHV VSDFH WR VWRUH DQG ORJ WKH WHPSRUDU\ UHODWLRQV 7rf§RQH ZLWK &N DV WKH RXWHUPRVW UHODWLRQ LQ 6HFWLRQ DQG DQRWKHU ZLWK &N DV WKH LQQHUPRVW UHODWLRQ LQ 6HFWLRQ f§DQG WKH FRVW DQDO\VLV IRU WKHP 7KH HIIHFW RI VXETXHU\ RSWLPL]DWLRQ RQ WKH FRVW HVWLPDWHV LV RXWOLQHG LQ 6HFWLRQ $ VFKHPDWLF GLDJUDP RI WKH WZR GLIIHUHQW H[HFXWLRQ SODQV DUH JLYHQ LQ )LJXUHV DQG ,Q WKH FRVW DQDO\VLV ZH XVH WKH PLQLQJVSHFLILF GDWD SDUDPHWHUV DQG NQRZOHGJH DERXW DVVRFLDWLRQ UXOH PLQLQJ $SULRUL DOJRULWKP >@ LQ WKLV FDVHf WR HVWLPDWH WKH FRVW RI MRLQV DQG WKH VL]H RI MRLQ UHVXOWV (YHQ WKRXJK FXUUHQW UHODWLRQDO RSWLPL]HUV GR QRW XVH WKLV PLQLQJ VSHFLILF VHPDQWLF LQIRUPDWLRQ WKH DQDO\VLV SURYLGHV D EDVLV IRU GHYHORSLQJ fPLQLQJDZDUHf

PAGE 54

KDYLQJ FRXQLff PLQVXS W *URXS E\ LWHP LWHPN )LJXUH .ZD\ MRLQ SODQ ZLWK &r DV LQQHU UHODWLRQ KDYLQJ FRXQWrf! PLQVXS I *URXS E\ LWHP LWHPN )LJXUH .ZD\ MRLQ SODQ ZLWK &r DV RXWHU UHODWLRQ RSWLPL]HUV 7KH FRVW IRUPXODH DUH SUHVHQWHG LQ WHUPV RI RSHUDWRU FRVWV LQ RUGHU WR PDNH WKHP JHQHUDO IRU LQVWDQFH MRLQS T Uf

PAGE 55

7DEOH 1RWDWLRQV XVHG LQ FRVW DQDO\VLV 5 QXPEHU RI UHFRUGV LQ WKH LQSXW WUDQVDFWLRQ WDEOH 7 QXPEHU RI WUDQVDFWLRQV 1 DYHUDJH QXPEHU RI LWHPV SHU WUDQVDFWLRQ )L QXPEHU RI IUHTXHQW LWHPV 6&f VXP RI VXSSRUW RI HDFK LWHPVHW LQ VHW & 6N DYHUDJH VXSSRUW RI D IUHTXHQW IFLWHPVHW 5I QXPEHU RI UHFRUGV RXW RI 5 LQYROYLQJ IUHTXHQW LWHPV 6)?f 1I DYHUDJH QXPEHU RI IUHTXHQW LWHPV SHU WUDQVDFWLRQ UI FN QXPEHU RI FDQGLGDWH NLWHPVHWV &Q Nf QXPEHU RI FRPELQDWLRQV RI VL]H N SRVVLEOH RXW RI D VHW RI VL]H Q NA>N\ WN FRVW RI JHQHUDWLQJ D N LWHP FRPELQDWLRQ XVLQJ WDEOH IXQFWLRQ &RPEN JURXSQ Pf FRVW RI JURXSLQJ Q UHFRUGV RXW RI ZKLFK P DUH GLVWLQFW MRLQSJUf FRVW RI MRLQLQJ WZR UHODWLRQV RI VL]H S DQG T WR JHW D UHVXOW RI VL]H U EOREQf FRVW RI SDVVLQJ D %/2% RI VL]H Q LQWHJHUV DV DQ DUJXPHQW DYDLODELOLW\ RI LQGLFHV WKH VL]H RI LQWHUPHGLDWH UHVXOWV DQG WKH DPRXQW RI DYDLODEOH PHPRU\ )RU LQVWDQFH WKH HIILFLHQW H[HFXWLRQ RI QHVWHG ORRSV MRLQV UHTXLUH DQ LQGH[ LWHP WLGf RQ 7 ,I WKH LQWHUPHGLDWH MRLQ UHVXOW LV ODUJH LW FRXOG EH DGYDQWDJHRXV WR PDWHULDOL]H LW DQG SHUIRUP VRUWPHUJH MRLQ )RU HDFK FDQGLGDWH LWHPVHW LQ &N WKH MRLQ ZLWK 7 SURGXFHV DV PDQ\ UHFRUGV DV WKH VXSSRUW RI LWV ILUVW LWHP 7KHUHIRUH WKH VL]H RI WKH MRLQ UHVXOW FDQ EH HVWLPDWHG WR EH WKH SURGXFW RI WKH QXPEHU RI FDQGLGDWHV DQG WKH DYHUDJH VXSSRUW RI D IUHTXHQW LWHP DQG KHQFH WKH FRVW RI WKLV MRLQ LV JLYHQ E\ MRLQ? 5 &Nr VLf 6LPLODUO\ WKH UHODWLRQ REWDLQHG DIWHU MRLQLQJ &N ZLWK O FRSLHV RI 7 FRQWDLQ DV PDQ\ UHFRUGV DV WKH VXP RI WKH VXSSRUW FRXQWV RI WKH LWHP SUHIL[HV RI WKH FDQGLGDWH LWHPVHWV +HQFH WKH FRVW RI WKH ,WK MRLQ LV MRLQ&IF r VBL 5 &N r VLf ZKHUH 6R 1RWH WKDW YDOXHV RI WKH 6MfV FDQ EH FRPSXWHG IURP VWDWLVWLFV FROOHFWHG LQ WKH SUHYLRXV SDVVHV &RVW RI WKH ODVW MRLQ ZLWK 7Nf FDQQRW EH HVWLPDWHG XVLQJ WKH DERYH IRUPXOD VLQFH WKH LWHP SUHIL[ RI D IFFDQGLGDWH LV QRW IUHTXHQW

PAGE 56

DQG ZH GR QRW NQRZ WKH YDOXH RI VN +RZHYHU WKH ODVW MRLQ SURGXFHV 6&Nf UHFRUGVf§ WKHUH ZLOO EH DV PDQ\ UHFRUGV IRU HDFK FDQGLGDWH DV LWV VXSSRUWf§DQG WKHUHIRUH WKH FRVW LV MRLQ&IF r 6IFBL 5 6&Nff 6&Nf FDQ EH HVWLPDWHG E\ DGGLQJ WKH VXSSRUW HVWLPDWHV RI DOO WKH LWHPVHWV LQ &N $ JRRG HVWLPDWH IRU WKH VXSSRUW RI D FDQGLGDWH LWHPVHW LV WKH PLQLPXP RI WKH VXSSRUW FRXQWV RI DOO LWV VXEVHWV 7KH RYHUDOO FRVW RI WKLV SODQ H[SUHVVHG LQ WHUPV RI RSHUDWRU FRVWV LV -IFL ^AMRLQL&MIF r VML 5 &NrVLf` MRLQ&MIF 5 6&Nff JYRXS^6^&Nf &Nf O L .:DY-RLQ 3ODQ ZLWK &L DV ,QQHU 5HODWLRQ ,Q WKLV SODQ ZH MRLQ WKH N FRSLHV RI 7 DQG WKH UHVXOWLQJ IFLWHP FRPELQDWLRQV DUH MRLQHG ZLWK &N WR ILOWHU RXW QRQFDQGLGDWH LWHP FRPELQDWLRQV 7KH ILQDO MRLQ UHVXOW LV JURXSHG RQ WKH FLWHPV 7KH UHVXOW RI MRLQLQJ FRSLHV RI 7 LV WKH VHW RI DOO SRVVLEOH LWHP FRPELQDWLRQV RI WUDQVDFWLRQV DQG WKHUH DUH &1Of r 7 VXFK FRPELQDWLRQV :H NQRZ WKDW WKH LWHPV LQ WKH FDQGLGDWH LWHPVHW DUH OH[LFRJUDSKLFDOO\ RUGHUHG DQG KHQFH ZH FDQ DGG H[WUD MRLQ SUHGLFDWHV DV VKRZQ LQ )LJXUH WR OLPLW WKH MRLQ UHVXOW WR LWHP FRPELQDWLRQV ZLWKRXW WKHVH H[WUD SUHGLFDWHV WKH MRLQ ZLOO UHVXOW LQ LWHP SHUPXWDWLRQVf :KHQ &N LV WKH RXWHUPRVW UHODWLRQ WKHVH SUHGLFDWHV DUH QRW UHTXLUHG $ PLQLQJDZDUH RSWLPL]HU VKRXOG EH DEOH WR UHZULWH WKH TXHU\ DSSURSULDWHO\ 7KH ,WK MRLQ SURGXFHV OfLWHP FRPELQDWLRQV DQG WKHUHIRUH LWV FRVW LV MRLQ&,9f r 7 5 &1O f r 7f 7KH ODVW MRLQ SURGXFHV 6&Nf UHFRUGV DV LQ WKH SUHYLRXV FDVH DQG KHQFH LWV FRVW LV MRLQ&9 Nf r 7 &N 6&Nff 7KH RYHUDOO FRVW RI WKLV

PAGE 57

SODQ LV IFL ^e!LQ&1Ofr7 5 &1O ? f r7f` MRP&1 Nfr7 &N 6&Nff JURXS&IFf &rf ] L 1RWH WKDW LQ WKH DERYH H[SUHVVLRQ &1 f r 7 5 (IIHFW RI 6XETXHUY 2SWLPL]DWLRQ 7KH VXETXHU\ RSWLPL]DWLRQ PDNHV XVH RI FRPPRQ SUHIL[HV DPRQJ FDQGLGDWH LWHPVHWV 8QIROGLQJ DOO WKH VXETXHULHV ZLOO UHVXOW LQ D TXHU\ WUHH ZKLFK VWUXFWXUDOO\ UHVHPEOHV WKH .ZD\-RLQ SODQ WUHH VKRZQ LQ )LJXUH 6XETXHU\ 4L SURGXFHV GON r V UHFRUGV ZKHUH G-N GHQRWHV WKH QXPEHU RI GLVWLQFW M LWHP SUHIL[HV RI LWHPVHWV LQ &N ,Q FRQWUDVW WKH ,WK MRLQ LQ WKH .ZD\-RLQ SODQ UHVXOWV LQ 7\SLFDOO\ GON LV PXFK VPDOOHU FRPSDUHG WR &N ZKLFK H[SODLQV ZK\ WKH 6XETXHU\ DSSURDFK SHUIRUPV EHWWHU WKDQ WKH .ZD\-RLQ DSSURDFK &N r VL UHFRUGV 7KH RXWSXW RI VXETXHU\ 4N FRQWDLQV 6&Nf UHFRUGV 7KH WRWDO FRVW RI WKLV DSSURDFK FDQ EH HVWLPDWHG WR EH N WULMRLQ VBL r GON[ G> VW r GONf` JURXS&rf &Nf O O ZKHUH WULMRLQS T U Vf GHQRWHV WKH FRVW RI MRLQLQJ WKUHH UHODWLRQV RI VL]H S T U UHVSHFWLYHO\ SURGXFLQJ D UHVXOW RI VL]H V 7KH YDOXH RI VN ZKLFK LV WKH DYHUDJH VXSSRUW RI D IUHTXHQW $LWHPVHW FDQ EH HVWLPDWHG DV PHQWLRQHG LQ VHFWLRQ 7KH H[SHULPHQWDO UHVXOWV SUHVHQWHG LQ 6HFWLRQ DQG LQ 6DUDZDJL HW DO >@ VKRZV WKDW WKH VXETXHU\ RSWLPL]DWLRQ JDYH PXFK EHWWHU SHUIRUPDQFH WKDQ WKH EDVLF .ZD\-RLQ DSSURDFK DQ RUGHU RI PDJQLWXGH EHWWHU LQ VRPH FDVHVf :H REVHUYHG WKH VDPH WUHQG LQ RXU DGGLWLRQDO H[SHULPHQWV XVLQJ V\QWKHWLF GDWDVHWV :H XVHG V\QWKHWLF GDWDVHWV IRU VRPH RI WKH

PAGE 58

b VXSSRUW 1RWH WKDW LV QRW VKRZQ VLQFH LW LV WKH VDPH DV &N‘ ,Q SDVV & FRQWDLQV LWHPVHWV ZKHUH DV G? KDV RQO\ LWHP SUHIL[HV DOPRVW D IDFWRU RI OHVV WKDQ &f

PAGE 59

7DEOH 'HVFULSWLRQ RI V\QWKHWLF GDWDVHWV 'DWDVHWV 5HFRUGV 7UDQVDFWLRQV ,WHPV $YJLWHPV 7,'. 7,'. PD[LPDO SRWHQWLDOO\ IUHTXHQW LWHPVHWV GHQRWHG DV ,f LV 7KH WUDQVDFWLRQ WDEOH FRUUHn VSRQGLQJ WR WKLV GDWDVHW KDG DSSUR[LPDWHO\ WKRXVDQG UHFRUGV 7KH VHFRQG GDWDVHW KDV WKRXVDQG WUDQVDFWLRQV HDFK FRQWDLQLQJ DQ DYHUDJH RI LWHPV WRWDO RI DERXW PLOOLRQ UHFRUGVff DQG GLVFXVV KRZ WKH\ LPSDFW WKH FRVW %DVHG RQ WKHVH RSWLPL]DWLRQV ZH GHYHORS WKH 6HWRULHQWHG $SULRUL DSSURDFK LQ 6HFWLRQ

PAGE 60

3UXQLQJ 1RQ)UHTXHQW ,WHPV 7KH VL]H RI WKH WUDQVDFWLRQ WDEOH LV D PDMRU IDFWRU LQ WKH FRVW RI MRLQV LQYROYLQJ 7 ,W FDQ EH UHGXFHG E\ SUXQLQJ WKH QRQIUHTXHQW LWHPV IURP WKH WUDQVDFWLRQV DIWHU WKH ILUVW SDVV :H VWRUH WKH WUDQVDFWLRQ GDWD DV WLG LWHPf WXSOHV LQ D UHODWLRQDO WDEOH DQG KHQFH WKLV SUXQLQJ PHDQV VLPSO\ GURSSLQJ WKH WXSOHV FRUUHVSRQGLQJ WR QRQIUHTXHQW LWHPV 7KLV FDQ EH DFKLHYHG E\ MRLQLQJ 7 ZLWK WKH IUHTXHQW LWHPV WDEOH )? DV IROORZV LQVHUW LQWR 7c VHOHFW WWLG WLWHP IURP 7 W )? I ZKHUH WLWHP ILWHP :H LQVHUW WKH SUXQHG WUDQVDFWLRQV LQWR WDEOH 7M ZKLFK KDV WKH VDPH VFKHPD DV WKDW RI 7 ,Q WKH VXEVHTXHQW SDVVHV MRLQV ZLWK 7 FDQ EH UHSODFHG ZLWK FRUUHVSRQGLQJ MRLQV ZLWK 7cf DQG WKH VL]H DIWHU SUXQLQJ L"f IRU GLIIHUHQW VXSSRUW YDOXHV DUH VKRZQ :LWK WKLV RSWLPL]DWLRQ LQ WKH FRVW IRUPXODH RI VHFWLRQ 5 FDQ EH UHSODFHG ZLWK 5If§WKH QXPEHU RI UHFRUGV LQ 7 LQYROYLQJ IUHTXHQW LWHPV DQG 1 ZLWK 1If§WKH DYHUDJH

PAGE 61

fV RU 7InVf FRXOG EH H[SHQVLYH ,Q RUGHU WR FLUFXPYHQW WKH SUREOHP RI FRXQWLQJ WKH ODUJH & PRVW PLQLQJ DOJRULWKPV XVH VSHFLDO WHFKQLTXHV LQ WKH VHFRQG SDVV $ IHZ H[DPSOHV DUH WZRGLPHQVLRQDO DUUD\V LQ ,%0fV 4XHVW GDWD PLQLQJ V\VWHP >@ DQG KDVK ILOWHUV SURSRVHG LQ 3DUN HW DO >@ WR OLPLW WKH VL]H RI & 7KH JHQHUDWLRQ RI & FDQ EH FRPSOHWHO\ HOLPLQDWHG E\ IRUPXODWLQJ WKH MRLQ TXHU\ WR ILQG ) DV LQVHUW LQWR ) VHOHFW SLWHP TLWHP FRXQWrf IURP 7I S 7I T ZKHUH SWLG TWLG DQG SLWHP TLWHP JURXS E\ SLWHP TLWHP

PAGE 62

KDYLQJ FRXQWrf UPLQVXS 7KH FRVW RI VHFRQG SDVV ZLWK WKLV RSWLPL]DWLRQ LV MRLQL 5I &1I ff JURXS&1I f&)8 ff (YHQ WKRXJK WKH JURXSLQJ FRVW UHPDLQV WKH VDPH WKHUH LV D ELJ UHGXFWLRQ IURP WKH EDVLF .ZD\-RLQ DSSURDFK LQ WKH MRLQ FRVWV )LJXUH FRPSDUHV WKH UXQQLQJ WLPH RI WKH VHFn RQG SDVV ZLWK WKLV RSWLPL]DWLRQ WR WKH EDVLF .ZD\-RLQ DSSURDFK IRU WKH WZR H[SHULPHQWDO GDWDVHWV )RU WKH .ZD\-RLQ DSSURDFK WKH EHVW H[HFXWLRQ SODQ ZDV WKH RQH ZKLFK JHQHUDWHV DOO LWHP FRPELQDWLRQV MRLQV WKHP ZLWK WKH FDQGLGDWH VHW DQG JURXSV WKH MRLQ UHVXOW :H FDQ VHH WKDW WKLV RSWLPL]DWLRQ KDV D VLJQLILFDQW LPSDFW RQ WKH UXQQLQJ WLPH 5HXVLQJ WKH ,WHP &RPELQDWLRQV IURP 3UHYLRXV 3DVV 7KH 64/ IRUPXODWLRQV RI DVVRFLDWLRQ UXOH PLQLQJ LV EDVHG RQ JHQHUDWLQJ LWHP FRPELntLWHP FRPELQDWLRQV

PAGE 63

3DVV RSWLPL]DWLRQ 7,' 22.f Â’ :LWK 2SW Q ZLWKRXW 2SO 6XSSRUW 3DVV RSWLPL]DWLRQ 7,' 22.f Â’ :LWK 2SW 3 :LWKRXW 2SW 6XSSRUW )LJXUH %HQHILW RI VHFRQG SDVV RSWLPL]DWLRQ WKDW D[H FDQGLGDWHV 7r KDV WKH VFKHPD WLG LWHPL LWHUULNf :H MRLQ 7M DQG &N DV VKRZQ EHORZ WR JHQHUDWH A $ WUHH GLDJUDP RI WKH TXHU\ LV DOVR JLYHQ LQ )LJXUH 7KH IUHTXHQW LWHPVHWV )N LV REWDLQHG E\ JURXSLQJ WKH WXSOHV RI RQ WKH N LWHPV DQG DSSO\LQJ WKH PLQLPXP VXSSRUW ILOWHULQJ :H FDQ IXUWKHU SUXQH 7IF E\ ILOWHULQJ RXW LWHP FRPELQDWLRQV WKDW WXUQHG RXW WR EH QRQIUHTXHQW +RZHYHU WKLV LV QRW HVVHQWLDO VLQFH ZH MRLQ LW ZLWK WKH FDQGLGDWH VHW &N? LQ

PAGE 64

LQVHUW LQWR 7N VHOHFW SWLG SLWHPL SLWHUULNL TLWHP IURP &IFf 7Nf§? 3} I 4 ZKHUH SLWHPL &NLWHP\ DQG SLWHUULNL &NLWHPNL DQG TLWHP &NLWHPN DQG SWLG TWLG 7N )LJXUH *HQHUDWLRQ RI 7N WKH QH[W SDVV WR JHQHUDWH 7KH RQO\ DGYDQWDJH RI SUXQLQJ 7N LV WKDW ZH ZLOO KDYH D VPDOOHU WDEOH WR MRLQ LQ WKH QH[W SDVV EXW DW WKH DGGLWLRQDO FRVW RI MRLQLQJ 7N ZLWK )N :H XVH WKH RSWLPL]DWLRQ GLVFXVVHG DERYH IRU WKH VHFRQG SDVV DQG KHQFH GR QRW PDWHn ULDOL]H DQG VWRUH WKH LWHP FRPELQDWLRQV 7 7KHUHIRUH ZH JHQHUDWH 7 GLUHFWO\ E\ MRLQLQJ 7I ZLWK & DV LQVHUW LQWR 7 VHOHFW SWLG SLWHP TLWHP ULWHP IURP 7I S 7I T 7I U &N ZKHUH SLWHP &ALWHPL DQG TLWHP &ALWHP" DQG ULWHP &ALWHPA DQG SWLG TWLG DQG TWLG [WLG

PAGE 65

e WULMRLQI VB r G> Vc r GONf` JURXS6&IFf &Nf O $V D UHVXOW RI WKH PDWHULDOL]DWLRQ DQG UHXVH RI LWHP FRPELQDWLRQV 6HWRULHQWHG $SULRUL UHn TXLUHV RQO\ D VLQJOH ZD\ MRLQ LQ WKH NWK SDVV 7KH FRVW RI WKH NWK SDVV LQ 6HWRULHQWHG $SULRUL LV WULMRLQ5I 7rB &N 6^&Nff JURXS6&Nf &Nf ZKHUH 7NB? DQG &N GHQRWH WKH FDUGLQDOLW\ RI WKH FRUUHVSRQGLQJ WDEOHV 7KH JURXSLQJ FRVW LV WKH VDPH DV WKDW RI WKH VXETXHU\ DSSURDFK 7KH WDEOH 7MWBL FRQWDLQV H[DFWO\ WKH VDPH WXSOHV DV WKDW RI VXETXHU\ 4N? DQG KHQFH KDV D VL]H RI VBL r GMU $OVR GNN LV WKH VDPH DV &N 7KHUHIRUH WKH NWK SDVV FRVW RI 6HWRULHQWHG $SULRUL LV WKH VDPH DV WKH NWK WHUP LQ 1RWH WKDW WKLV PD\ EH H[HFXWHG DV WZR ZD\ MRLQV VLQFH ZD\ MRLQV DUH QRW JHQHUDOO\ VXSSRUWHG LQ FXUUHQW UHODWLRQDO V\VWHPV

PAGE 66

WKH MRLQ FRVW VXPPDWLRQ RI WKH VXETXHU\ DSSURDFK 7KLV UHVXOWV LQ VLJQLILFDQW SHUIRUPDQFH LPSURYHPHQWV HVSHFLDOO\ LQ WKH KLJKHU SDVVHV )LJXUH FRPSDUHV WKH UXQQLQJ WLPHV RI WKH VXETXHU\ DQG 6HWRULHQWHG $SULRUL DSn SURDFKHV IRU WKH GDWDVHW 7,'. IRU b VXSSRUW :H VKRZ RQO\ WKH WLPHV IRU SDVVHV DQG KLJKHU VLQFH ERWK WKH DSSURDFKHV DUH WKH VDPH LQ WKH ILUVW WZR SDVVHV 6HW$SULRUL ‘ 6XETXHU\ 3DVV 3DVV 3DVV 3DVV 3DVV )LJXUH %HQHILW RI UHXVLQJ LWHP FRPELQDWLRQV 6SDFH 2YHUKHDG 7KH 6HWRULHQWHG $SULRUL DSSURDFK UHTXLUHV DGGLWLRQDO VSDFH LQ RUGHU WR VWRUH WKH LWHP FRPELQDWLRQV JHQHUDWHG 7KH VL]H RI WKH WDEOH 7N LV WKH VDPH DV 6&Nf ZKLFK LV WKH WRWDO VXSSRUW RI DOO WKH $LWHP FDQGLGDWHV $VVXPLQJ WKDW WKH WLG DQG LWHP DWWULEXWHV DUH LQWHJHUV HDFK WXSOH LQ 7N FRQVLVWV RI N LQWHJHU DWWULEXWHV )LJXUH VKRZV WKH VSDFH UHTXLUHG WR VWRUH 7N LQ WHUPV RI QXPEHU RI LQWHJHUV IRU WKH GDWDVHW 7,'. IRU WZR GLIIHUHQW VXSSRUW YDOXHV 7KH VSDFH QHHGHG IRU WKH LQSXW GDWD WDEOH 7 LV DOVR VKRZQ IRU FRPSDULVRQ  LV QRW VKRZQ LQ WKH JUDSK VLQFH ZH GR QRW PDWHULDOL]H DQG VWRUH LW LQ WKH 6HWRULHQWHG $SULRUL DSSURDFK 1RWH WKDW RQFH 7N LV PDWHULDOL]HG 7N FDQ EH GHOHWHG XQOHVV LW QHHGV WR EH UHWDLQHG IRU VRPH RWKHU SXUSRVHV

PAGE 67

f§ f§ 6f 6 VB &2 RRRRRR f§ B f§ 6XSSRUW b 6XSSRUW } n bf DQG WKH 6HWRULHQWHG $SULRUL IRU D ZLGH UDQJH RI GDWD SDUDPHWHUV DQG VXSSRUW YDOXHV :H UHSRUW WKH UHVXOWV RQ WZR RI WKH GDWDVHWVf§7,'. DQG 7,'.f§ ZKLFK DUH GHVFULEHG LQ 6HFWLRQ

PAGE 68

7,'. 7RWDO WLPH ’ 3DVV (O 3DVV ’ 3DVV ’ 3DVV P 3DVV D 3DVV ‘ 3DVV J FR f§ R! H KB VLVn VLVn \e6 VLVn 6XSSRUW !f b b b b 7,' 22. 7RWDO WLPH ’ 3DVV ( 3DVV ’ 3DVV ’

PAGE 69

IRU VXSSRUW YDOXHV KLJKHU WKDQ b DQG WKHUHIRUH ZH FKRVH ORZHU VXSSRUW YDOXHV WR VWXG\ WKH UHODWLYH SHUIRUPDQFH LQ KLJKHU QXPEHUHG SDVVHV ,Q VRPH FDVHV WKH RSWLPL]HU GLG QRW FKRRVH WKH EHVW SODQ )RU H[DPSOH IRU MRLQV ZLWK 7 7M IRU 6HWRULHQWHG $SULRULf WKH RSWLPL]HU FKRVH QHVWHG ORRSV SODQ XVLQJ LWHP WLGf LQGH[ RQ 7 LQ PDQ\ FDVHV ZKHUH WKH FRUUHVSRQGLQJ VRUWPHUJH SODQ ZDV IDVWHU DQ RUGHU RI PDJQLWXGH IDVWHU LQ VRPH FDVHV :H ZHUH DEOH WR H[SHULPHQW ZLWK GLIIHUHQW SODQV E\ GLVDEOLQJ FHUWDLQ MRLQ PHWKRGV GLVDEOLQJ QHVWHG ORRSV MRLQ IRU WKH DERYH FDVHfb 7KH ILUVW JUDSK VKRZV WKH DEVROXWH H[HFXWLRQ WLPHV DQG WKH VHFRQG RQH VKRZV WKH WLPHV

PAGE 70

7,' 22. &38 WLPH ’ 3DVV ‘ 3DVV ’ 3DVV ’ 3DVV ‘ 3DVV ’ 3DVV %% 3DVV 6XSSRUW a! b b b b 7,'. ,2 WLPH ’ 3DVV %L 3DVV ’ 3DVV 3DVV ‘ 3DVV ’ 3DVVBB (3DVV 6XSSRUW f§! b b b b )LJXUH &RPSDULVRQ RI &38 DQG ,2 WLPHV QRUPDOL]HG ZLWK UHVSHFW WR WKH WLPHV IRU WKH WUDQVDFWLRQ GDWDVHWV ,W FDQ EH VHHQ WKDW WKH H[HFXWLRQ WLPHV VFDOH TXLWH OLQHDUO\ DQG ERWK WKH GDWDVHWV H[KLELW VLPLODU VFDOHXS EHKDYLRU 7KH VFDOHXS ZLWK LQFUHDVLQJ WUDQVDFWLRQ VL]H LV VKRZQ LQ )LJXUH ,Q WKHVH H[SHUn LPHQWV ZH NHSW WKH SK\VLFDO VL]H RI WKH GDWDEDVH URXJKO\ FRQVWDQW E\ NHHSLQJ WKH SURGXFW RI WKH DYHUDJH WUDQVDFWLRQ VL]H DQG WKH QXPEHU RI WUDQVDFWLRQV FRQVWDQW 7KH QXPEHU RI WUDQVDFWLRQV UDQJHG IURP IRU WKH GDWDEDVH ZLWK DQ DYHUDJH WUDQVDFWLRQ VL]H RI WR

PAGE 71

,f§7, a}a7OO _f§Af§7, 7)LJXUH 1XPEHU RI WUDQVDFWLRQV VFDOHXS IRU WKH GDWDEDVH ZLWK DQ DYHUDJH WUDQVDFWLRQ VL]H RI :H IL[HG WKH PLQLPXP VXSn SRUW OHYHO LQ WHUPV RI WKH QXPEHU RI WUDQVDFWLRQV VLQFH IL[LQJ LW DV D SHUFHQWDJH ZRXOG KDYH OHG WR ODUJH LQFUHDVHV LQ WKH QXPEHU RI IUHTXHQW LWHPVHWV DV WKH WUDQVDFWLRQ VL]H LQFUHDVHG 7KH QXPEHUV LQ WKH OHJHQG IRU H[DPSOH f UHIHU WR WKLV PLQLPXP VXSSRUW 7KH H[HFXn WLRQ WLPHV LQFUHDVH ZLWK WKH WUDQVDFWLRQ VL]H EXW RQO\ JUDGXDOO\ 7KH PDLQ UHDVRQ IRU WKLV LQFUHDVH ZDV WKDW WKH QXPEHU RI LWHP FRPELQDWLRQV SUHVHQW LQ D WUDQVDFWLRQ LQFUHDVHV ZLWK WKH WUDQVDFWLRQ VL]H

PAGE 72

F &/! ( rf

PAGE 73

1RWH $ SDUW RI WKH ZRUN GHVFULEHG LQ WKLV FKDSWHU ZDV SULPDULO\ GRQH E\ UHVHDUFKHUV IURP ,%0 $OPDGQ 5HVHDUFK &HQWHU 6SHFLILFDOO\ WKH 64/EDVHG FDQGLGDWH JHQHUDWLRQ LQ 6HFWLRQ DQG WKH VXSSRUW FRXQWLQJ DSSURDFKHV LQ 6HFWLRQ ZHUH GHYHORSHG E\ WKHP 7KH\ DUH LQFOXGHG LQ WKLV GLVVHUWDWLRQ IRU FRPSOHWHQHVV

PAGE 74

&+$37(5 6833257 &2817,1* 86,1* 64/ :,7+ 2%-(&75(/$7,21$/ (;7(16,216 ,Q WKLV FKDSWHU ZH VWXG\ DOWHUQDWLYH DSSURDFKHV WKDW PDNH XVH RI DGGLWLRQDO REMHFW UHODWLRQDO IHDWXUHV LQ 64/ )RU HDFK DSSURDFK ZH DOVR RXWOLQH D FRVWEDVHG DQDO\VLV RI WKH H[HFXWLRQ WLPH WR HQDEOH RQH WR FKRRVH EHWZHHQ WKHVH GLIIHUHQW DSSURDFKHV :H SUHVHQW VL[ GLIIHUHQW DSSURDFKHV RSWLPL]DWLRQV DQG WKHLU FRVW HVWLPDWHV LQ 6HFWLRQV DQG ([SHULPHQWDO UHVXOWV FRPSDULQJ WKH SHUIRUPDQFH RI WKHVH DSSURDFKHV DUH SUHVHQWHG LQ 6HFn WLRQ ,Q 6HFWLRQ ZH SURSRVH D K\EULG DSSURDFK ZKLFK FRPELQHV WKH EHVW RI DOO DSSURDFKHV 7KH SHUIRUPDQFH RI GLIIHUHQW DUFKLWHFWXUDO DOWHUQDWLYHV GHVFULEHG LQ &KDSWHU LV FRPSDUHG LQ 6HFWLRQ ,Q 6HFWLRQ ZH VXPPDUL]H TXDOLWDWLYH FRPSDULVRQV RI WKHVH DUFKLWHFWXUHV 7KH DSSOLFDELOLW\ RI WKH 64/EDVHG DSSURDFK WR RWKHU DVVRFLDWLRQ UXOH PLQLQJ DOJRULWKPV DUH EULHIO\ GLVFXVVHG LQ 6HFWLRQ *DWKHU-RLQ 7KLV DSSURDFK VHH )LJXUH f LV EDVHG RQ WKH XVH RI WDEOH IXQFWLRQV GHVFULEHG LQ VHFn WLRQ ,W JHQHUDWHV DOO SRVVLEOH tLWHP FRPELQDWLRQV RI LWHPV FRQWDLQHG LQ D WUDQVDFWLRQ MRLQV WKHP ZLWK WKH FDQGLGDWH WDEOH &N DQG FRXQWV WKH VXSSRUW RI WKH LWHPVHWV E\ JURXSLQJ WKH MRLQ UHVXOW ,W XVHV WZR WDEOH IXQFWLRQV *DWKHU DQG &RPE. 7KH GDWD WDEOH 7 LV VFDQQHG LQ WKH WLG LWHPf RUGHU DQG SDVVHG WR WKH WDEOH IXQFWLRQ *DWKHU 7KLV WDEOH IXQFWLRQ FROn OHFWV DOO WKH LWHPV RI D WUDQVDFWLRQ LQ RWKHU ZRUGV LWHPV RI DOO WXSOHV RI 7 ZLWK WKH VDPH WLGf LQ PHPRU\ DQG RXWSXWV D UHFRUG IRU HDFK WUDQVDFWLRQ (DFK VXFK UHFRUG FRQVLVWV RI WZR

PAGE 75

n PHQWDWLRQ 7KH *DWKHU IXQFWLRQ LV QRW UHTXLUHG ZKHQ WKH GDWD LV DOUHDG\ LQ D KRUL]RQWDO IRUPDW ZKHUH HDFK WLG LV IROORZHG E\ D FROOHFWLRQ RI DOO LWV LWHPV 6SHFLDO 3DVV 2SWLPL]DWLRQ 1RWH WKDW IRU N WKH FDQGLGDWH VHW & LV VLPSO\ D MRLQ RI )? ZLWK LWVHOI 7KHUHIRUH ZH FDQ VSHFLDOO\ RSWLPL]H WKH SDVV E\ UHSODFLQJ WKH MRLQ ZLWK & E\ D MRLQ ZLWK )? EHIRUH WKH WDEOH IXQFWLRQ VHH )LJXUH f 7KDW ZD\ WKH WDEOH IXQFWLRQ JHWV RQO\ IUHTXHQW LWHPV DQG JHQHUDWHV VLJQLILFDQWO\ IHZHU LWHP FRPELQDWLRQV 7KLV RSWLPL]DWLRQ FDQ EH XVHIXO IRU RWKHU SDVVHV WRR EXW XQOLNH IRU SDVV ZH VWLOO KDYH WR GR WKH MRLQ ZLWK &N 9DULDWLRQV RI *DWKHU-RLQ $SSURDFK *DWKHU&RXQW 2QH YDULDWLRQ RI WKH *DWKHU-RLQ DSSURDFK IRU SDVV WZR LV WKH *DWKHU&RXQW DSSURDFK ZKHUH ZH SXVK WKH JURXSE\ LQVLGH WKH WDEOH IXQFWLRQ LQVWHDG RI GRLQJ LW RXWVLGH 7KH FDQGLGDWH LWHPVHWV &f DUH UHSUHVHQWHG DV D WZR GLPHQVLRQDO DUUD\ LQVLGH WKH PRGLILHG

PAGE 76

LQVHUW LQWR ) VHOHFW LWHPL LWHPN FRXQWrf IURP &N VHOHFW W7-WP W7-WPN IURP 7 WDEOH *DWKHU7WLG 7LWHPff DV W? WDEOH &RPE.LLWLG W?LWHPOLVWff DV f ZKHUH W7$WP? &NLWHPL DQG 7LWUULN &NLWHPN JURXS E\ &NLWHPL &NLWHPN KDYLQJ FRXQWrf PLQVXS W7 W7 KDYLQJ FRXQWrf PLQVXS W *URXS E\ LWHP LWHPN W LWPO &NLWHPL LWPN &NLWHPN AAW 7DEOH IXQFWLRQ &N &RPE. I 7DEOH IXQFWLRQ *DWKHU 2UGHU E\ WLG LWHP 7 )LJXUH 6XSSRUW FRXQWLQJ E\ *DWKHU-RLQ WDEOH IXQFWLRQ *DWKHU&QW IRU GRLQJ WKH VXSSRUW FRXQWLQJ ,QVWHDG RI RXWSXWWLQJ WKH LWHP FRPELQDWLRQV RI D WUDQVDFWLRQ LW GLUHFWO\ XVHV LW WR XSGDWH VXSSRUW FRXQWV LQ WKH PHPRU\ DQG RXWSXW RQO\ WKH IUHTXHQW LWHPVHWV ) DQG WKHLU VXSSRUW DIWHU WKH ODVW WUDQVDFWLRQ 7KXV WKH WDEOH IXQFWLRQ *DWKHU&QW LV DQ H[WHQVLRQ RI WKH *DWKHU&RPE WDEOH IXQFWLRQ XVHG LQ *DWKHU-RLQ 7KH DEVHQFH RI WKH RXWHU JURXSLQJ PDNHV WKLV RSWLRQ UDWKHU DWWUDFWLYH 7KH 8') FRGH LV DOVR VPDOO VLQFH LW RQO\ QHHGV WR PDLQWDLQ D DUUD\ :H FRXOG DSSO\ WKH VDPH WULFN IRU VXEVHTXHQW SDVVHV EXW WKH FRGLQJ EHFRPHV FRQVLGHUDEO\ PRUH FRPSOLFDWHG EHFDXVH RI WKH

PAGE 77

LQVHUW LQWR ) VHOHFW WW7-WP? WW7-WP FRXQWrf IURP VHOHFW r IURP 7 )? ZKHUH 7LWHP f§ )?LWHP?f DV WW? WDEOH *DWKHU&RPEWLGLWHPff DV 8f JURXS E\ WW7-WP? WW7-WP KDYLQJ FRXQWrf PLQVXS KDYLQJ FRXQLrfr LQVLGH WKH WDEOH IXQFWLRQ DQG WKXV UHGXFH WKH QXPEHU RI VXFK FRPELQDWLRQV &r LV FRQYHUWHG WR D %/2% DQG SDVVHG DV DQ DUJXPHQW WR WKH WDEOH IXQFWLRQ 7KH FRVW RI SDVVLQJ WKH %/2% IRU HYHU\ WXSOH RI 7 FRXOG EH KLJK ,Q JHQHUDO ZH FDQ UHGXFH WKH SDUDPHWHU SDVVLQJ FRVW E\ XVLQJ D VPDOOHU %ORE WKDW RQO\ DSSUR[LPDWHV WKH UHDO

PAGE 78

&IF 7KH WUDGHRII LV LQFUHDVHG FRVW IRU RWKHU SDUWV QRWDEO\ JURXSLQJ EHFDXVH QRW DV PDQ\ FRPELQDWLRQV DUH ILOWHUHG +RUL]RQWDO $QRWKHU YDULDWLRQ RI *DWKHU-RLQ LV WKH +RUL]RQWDO DSSURDFK WKDW ILUVW XVHV WKH *DWKHU IXQFWLRQ WR WUDQVIRUP WKH GDWD WR WKH KRUL]RQWDO IRUPDW EXW LV RWKHUZLVH VLPLODU WR WKH *DWKHU-RLQ DSSURDFK 5DMDPDQL HW DO >@ SURSRVH ILQGLQJ DVVRFLDWLRQV XVLQJ DQ DSSURDFK TXLWH VLPLODU WR WKLV +RUL]RQWDO DSSURDFK 7KH\ DVVXPH UDWKHU XQUHDOLVWLFDOO\f WKDW WKH GDWD LV DOUHDG\ LQ D KRUL]RQWDO IRUPDW +RZHYHU WKH\ GR QRW XVH WKH IUHTXHQW LWHPVHW ILOWHULQJ RSWLPL]DWLRQ ZH RXWOLQHG IRU SDVV :LWKRXW WKLV RSWLPL]DWLRQ WKH WLPH IRU SDVV IRU PRVW UHDOOLIH GDWDVHWV EORZV XS HYHQ IRU UHODWLYHO\ KLJK VXSSRUW YDOXHV $OVR DW WKH WLPH RI FDQGLGDWH JHQHUDWLRQ UDWKHU WKDQ GRLQJ VHOIMRLQ RQ )BL WKH\ MRLQ )NL ZLWK )? WKHUHE\ JHQHUDWLQJ FRQVLGHUDEO\ PRUH FRPELQDWLRQV WKDQ QHHGHG 7KXV WKH DSSURDFK LQ 5DMDPDQL HW DO >@ LV OLNHO\ WR SHUIRUP ZRUVH WKDQ +RUL]RQWDO &RVW $QDO\VLV RI *DWKHU-RLQ DQG LWV 9DULDQWV 7KH FKRLFH RI WKH EHVW DSSURDFK GHSHQGV RQ D QXPEHU RI GDWD FKDUDFWHULVWLFV OLNH WKH QXPEHU RI LWHPV WRWDO QXPEHU RI WUDQVDFWLRQV DYHUDJH OHQJWK RI D WUDQVDFWLRQ DQG VR RQ :H H[SUHVV WKH FRVWV RI GLIIHUHQW DSSURDFKHV LQ HDFK SDVV LQ WHUPV RI SDUDPHWHUV WKDW DUH NQRZQ RU FDQ EH HVWLPDWHG DIWHU WKH FDQGLGDWH JHQHUDWLRQ VWHS RI HDFK SDVV :H LQFOXGH RQO\ WKH WHUPV WKDW ZHUH IRXQG WR EH WKH GRPLQDQW SDUW RI WKH WRWDO FRVW LQ SUDFWLFH :H XVH WKH QRWDWLRQV RI 7DEOH LQ WKH FRVW DQDO\VLV 7KH FRVW RI *DWKHU-RLQ LQFOXGHV WKH FRVW RI JHQHUDWLQJ IFLWHP FRPELQDWLRQV MRLQLQJ ZLWK &IF DQG JURXSLQJ WR FRXQW WKH VXSSRUW 7KH QXPEHU RI $LWHP FRPELQDWLRQV JHQHUDWHG

PAGE 79

7N LV &1Nf r 7 -RLQ ZLWK &N ILOWHUV RXW WKH QRQFDQGLGDWH LWHP FRPELQDWLRQV 7KH VL]H RI WKH MRLQ UHVXOW LV WKH VXP RI WKH VXSSRUW RI DOO WKH FDQGLGDWHV GHQRWHG E\ 6&Nf 7KH DFWXDO YDOXH RI WKH VXSSRUW RI D FDQGLGDWH LWHPVHW ZLOO EH NQRZQ RQO\ DIWHU WKH VXSSRUW FRXQWLQJ SKDVH +RZHYHU ZH JHW D JRRG HVWLPDWH E\ DSSUR[LPDWLQJ LW WR WKH PLQLPXP RI WKH VXSSRUW RI DOO LWV N f§ OfVXEVHWV LQ )N? 7KH WRWDO FRVW RI WKH *DWKHU -RLQ DSSURDFK LV 7N r WN MRLQ7N&N6&Nff JURXS6n&IFf &Nf ZKHUH 7N &1Nfr7 7KH DERYH FRVW IRUPXOD QHHGV WR EH PRGLILHG WR UHIOHFW WKH VSHFLDO RSWLPL]DWLRQ RI MRLQLQJ ZLWK )? WR FRQVLGHU RQO\ IUHTXHQW LWHPV :H QHHG D QHZ WHUP MRLQ5)?5If DQG QHHG WR FKDQJH WKH IRUPXOD IRU 7N WR LQFOXGH RQO\ IUHTXHQW LWHPV 1M LQVWHDG RI 1 )RU WKH VHFRQG SDVV ZH GR QRW QHHG WKH RXWHU MRLQ ZLWK &N‘ 7KH WRWDO FRVW RI *DWKHU -RLQ LQ WKH VHFRQG SDVV LV 1`r7 MRLQ-" )L 5If 7rW JURXS 7 &f ZKHUH 7 &^1I f r 7 m f§ &RVW RI *DWKHU&RXQW LQ WKH VHFRQG SDVV LV VLPLODU WR WKDW IRU EDVLF *DWKHU -RLQ H[FHSW IRU WKH ILQDO JURXSLQJ FRVW ,Q WKLV IRUPXOD fJURXS-QWf GHQRWHV WKH FRVW RI GRLQJ WKH VXSSRUW FRXQWLQJ LQVLGH WKH WDEOH IXQFWLRQ MRLQI" )L5Mf JURXSBLQW7&f )rW )RU *DWKHU3UXQH WKH FRVW HTXDWLRQ LV 5 r EORE$ r &Nf 6^&Nf r WN JURXS^6^&Nf &Nf

PAGE 80

:H XVH EORE$ r &Nf IRU WKH %/2% SDVVLQJ FRVW VLQFH HDFK LWHPVHW LQ &N FRQWDLQV N LWHPV 7KH FRVW HVWLPDWH RI +RUL]RQWDO LV VLPLODU WR WKDW RI *DWKHU -RLQ H[FHSW WKDW KHUH WKH GDWD LV PDWHULDOL]HG LQ WKH KRUL]RQWDO IRUPDW EHIRUH JHQHUDWLQJ WKH LWHP FRPELQDWLRQV 9HUWLFDO ,Q WKLV DSSURDFK ZH ILUVW WUDQVIRUP WKH GDWD WDEOH LQWR D YHUWLFDO IRUPDW E\ FUHDWLQJ IRU HDFK LWHP D %/2% FRQWDLQLQJ DOO WLGV WKDW FRQWDLQ WKDW LWHP 7LGOLVW FUHDWLRQ SKDVHf DQG WKHQ FRXQW WKH VXSSRUW RI LWHPVHWV E\ PHUJLQJ WRJHWKHU WKHVH WLGOLVWV VXSSRUW FRXQWLQJ SKDVHf 7KLV DSSURDFK LV UHODWHG WR WKH DSSURDFKHV GLVFXVVHG LQ 6DYDVHUH HW DO >,OO@ DQG =DNL HW DO >@ )RU FUHDWLQJ WKH 7LGOLVWV ZH XVH D WDEOH IXQFWLRQ *DWKHU 7KLV LV WKH VDPH DV WKH *DWKHU IXQFWLRQ LQ *DWKHU-RLQ H[FHSW WKDW KHUH ZH FUHDWH WKH WLGOLVW IRU HDFK IUHTXHQW LWHP 7KH GDWD WDEOH 7 LV VFDQQHG LQ WKH LWHP WLGf RUGHU DQG SDVVHG WR WKH IXQFWLRQ *DWKHU 7KH IXQFWLRQ FROOHFWV WKH WLGV RI DOO WXSOHV RI 7 ZLWK WKH VDPH LWHP LQ PHPRU\ DQG RXWSXWV D LWHP WLGOLVWf WXSOH IRU LWHPV WKDW PHHW WKH PLQLPXP VXSSRUW FULWHULRQ 7KH WLG OLVWV DUH UHSUHVHQWHG DV %/2%V DQG VWRUHG LQ D QHZ 7LG7DEOH ZLWK DWWULEXWHV LWHP WLGOLVWf 7KH 64/ TXHU\ ZKLFK GRHV WKH WUDQVIRUPDWLRQ WR YHUWLFDO IRUPDW LV JLYHQ LQ )LJXUH LQVHUW LQWR 7LG7DEOH VHOHFW LWHP WLGOLVW IURP VHOHFW r IURP 7 RUGHU E\ LWHP WLGf DV WW? WDEOH*DWKHULWHPWLGPLQVXSff DV WW WWLWHP WWWLGOLVW WW 7DEOH IXQFWLRQ *DWKHU W 2UGHU E\ LWHP WLG W 7 )LJXUH 7LGOLVW FUHDWLRQ ,Q WKH VXSSRUW FRXQWLQJ SKDVH FRQFHSWXDOO\ IRU HDFK LWHPVHW LQ &N ZH ZDQW WR FROOHFW WKH WLGOLVWV RI DOO N LWHPV DQG XVH D 8') WR FRXQW WKH QXPEHU RI WLGV LQ WKH LQWHUVHFWLRQ

PAGE 81

RI WKHVH N OLVWV 7KH WLGV DUH LQ WKH VDPH VRUWHG RUGHU LQ DOO WKH WLGOLVWV DQG WKHUHIRUH WKH LQWHUVHFWLRQ FDQ EH GRQH HDVLO\ DQG HIILFLHQWO\ E\ D VLQJOH SDVV RI WKH N OLVWV 7KLV FRQFHSWXDO VWHS FDQ EH LPSURYHG IXUWKHU E\ GHFRPSRVLQJ WKH LQWHUVHFW RSHUDWLRQ VR WKDW ZH FDQ VKDUH WKHVH RSHUDWLRQV DFURVV LWHPVHWV KDYLQJ FRPPRQ SUHIL[HV DV IROORZV :H ILUVW VHOHFW GLVWLQFW LWHQUQLWHUULf SDLUV IURP &N‘ )RU HDFK GLVWLQFW SDLU ZH ILUVW SHUn IRUP WKH LQWHUVHFW RSHUDWLRQ WR JHW D QHZ UHVXOWWLGOLVW WKHQ ILQG GLVWLQFW WULSOHV LWHPLLWHP LWHUQAf IURP &N ZLWK WKH VDPH ILUVW WZR LWHPV LQWHUVHFW UHVXOWWLGOLVW ZLWK WLGOLVW IRU LWHPV IRU HDFK WULSOH DQG FRQWLQXH ZLWK LWHPA DQG VR RQ XQWLO DOO N WLGOLVWV SHU LWHPVHW DUH LQWHUn VHFWHG 7KH DERYH VHTXHQFH RI RSHUDWLRQV FDQ EH ZULWWHQ DV D VLQJOH 64/ TXHU\ IRU DQ\ N DV VKRZQ LQ )LJXUH 7KH ILQDO LQWHUVHFW RSHUDWLRQ FDQ EH PHUJHG ZLWK WKH FRXQW RSHUDWLRQ WR UHWXUQ D FRXQW LQVWHDG RI WKH WLGOLVW :H GR QRW LQFOXGH WKLV RSWLPL]DWLRQ LQ WKH TXHU\ RI )LJXUH IRU VLPSOLFLW\ 6SHFLDO 3DVV 2SWLPL]DWLRQ )RU SDVV ZH QHHG QRW JHQHUDWH & DQG MRLQ WKH 7LG7DEOHV ZLWK & ,QVWHDG ZH SHUIRUP D VHOIMRLQ RQ WKH 7LG7DEOH XVLQJ SUHGLFDWH W?LWHP WLWHP LQVHUW LQWR )r VHOHFW WLLWHPWLWHP FQW IURP VHOHFW LWHPLLWHP &RXQW,QWHUVHFWLLWLGOLVW 4WLGOLVWf DV FQW IURP 7LG7DEOH W? 7LG7DEOH 4 ZKHUH W?LWHP WLWHPf DV W ZKHUH FQW PLLQVXS

PAGE 82

LQVHUW LQWR )N VHOHFW LWHPL LWHUULN FRXQWWLGOLVWf DV FQW IURP 6XETXHU\ 4Nf W ZKHUH FQW PLQVXS 6XETXHU\ 4L IRU DQ\ O EHWZHHQ DQG Nf VHOHFW LWHPL LWHPL ,QWHUVHFWUBLWLGOLVWWcWLGOLVWf DV WLGOLVW IURP 7LG7DEOH WL 6XETXHU\ 4L?f DV UBM VHOHFW GLVWLQFW LWHP? LWHPL IURP &Nf DV GL ZKHUH UBLIWHPL GLLWHP? DQG DQG UBLLWHPBL GLLWHPLLDQG WLLWHP GLLWHPL 6XETXHU\ 4L VHOHFW r IURP 7LG7DEOHf LWHPLLWHPL WLGOLVW ,QWHUVHFW 8')f 6XETXHU\ 4 VHOHFW GLVWLQFW LWHPLLWHPL &N 7UHH GLDJUDP IRU 6XETXHU\ 4c )LJXUH 6XSSRUW FRXQWLQJ XVLQJ 8')

PAGE 83

r DYHUDJH OHQJWK RI D WLGOLVW 7KH DYHUDJH OHQJWK RI D WLGOLVW FDQ EH DSSUR[LPDWHG WR 1RWH WKDW ZLWK HDFK LQWHUVHFW WKH WLGOLVW NHHSV VKULQNLQJ +RZHYHU ZH LJQRUH VXFK HIIHFWV IRU VLPSOLFLW\ ,Q DGGLWLRQ WR WKH LQWHUVHFW FRVW LW LQFOXGHV WKH FRVW RI MRLQV LQ WKH TXHU\ DOVR 7KH MRLQ FRVW RI VXETXHU\ 4L FDQ EH UHFXUVLYHO\ GHILQHG DV &^4Lf WULMRLQ)L 4BL &N GIf &4LLf ZKHUH WULMRLQS T U Vf GHQRWHV WKH FRVW RI MRLQLQJ WKUHH UHODWLRQV RI VL]H S T U UHVSHFWLYHO\ SURGXFLQJ D UHVXOW RI VL]H V 7KH H[DFW FRVW RI WKH ZD\ MRLQ ZLOO GHSHQG RQ WKH MRLQ RUGHU 7KH FRVW RI VXETXHU\ 4? LV WKH FRVW RI VFDQQLQJ WKH 7LG7DEOH ZKLFK KDV )? WXSOHV 7KH UHVXOW VL]H RI WKH VXETXHU\ 4 LV GI DQG WKH UHVXOW VL]H RI 4L LV )? 7KH WRWDO FRVW RI WKH 9HUWLFDO DSSURDFK LV N &4Nf
PAGE 84

,Q WKH IRUPXOD DERYH ,QWHUVHFWQf GHQRWHV WKH FRVW RI LQWHUVHFWLQJ WZR WLGOLVWV ZLWK D FRPELQHG VL]H RI Q 7KH WRWDO FRVW LV GRPLQDWHG E\ WKH LQWHUVHFW FRVW DQG MRLQ FRVWV DFFRXQW IRU RQO\ D VPDOO IUDFWLRQ 7KHUHIRUH ZH FDQ VDIHO\ LJQRUH WKH MRLQ FRVWV LQ WKH DERYH IRUPXODH 7KH WRWDO FRVW RI WKH VHFRQG SDVV LV MRLQ)L)L f & r ^ r %OREAIf ,QWHUVHFWAf`  64/%RGLHG )XQFWLRQV 7KLV DSSURDFK LV EDVHG RQ 64/ERGLHG IXQFWLRQV FRPPRQO\ NQRZQ DV 64/360 >@ 64/360V H[WHQG 64/ ZLWK DGGLWLRQDO FRQWURO VWUXFWXUHV :H PDNH XVH RI RQH VXFK FRQVWUXFW IRU GR HQG :H XVH WKH IRU FRQVWUXFW WR VFDQ WKH WUDQVDFWLRQ WDEOH 7 LQ WKH WLG LWHPf RUGHU 7KHQ IRU HDFK WXSOH WLG LWHPf RI 7 ZH XSGDWH WKRVH WXSOHV RI &N WKDW FRQWDLQ RQH PDWFKLQJ LWHP &N LV H[WHQGHG ZLWK H[WUD DWWULEXWHV SUHY7LG PDWFK VXSSf 7KH SUHY7LG DWWULEXWH NHHSV WKH WLG RI WKH SUHYLRXV WXSOH RI 7 WKDW PDWFKHG WKDW LWHPVHW 7KH PDWFK DWWULEXWH FRQWDLQV WKH QXPEHU RI LWHPV RI SUHY7LG PDWFKHG VR IDU DQG VXSS KROGV WKH FXUUHQW VXSSRUW RI WKDW LWHPVHW 2Q HDFK FROXPQ RI &N DQ LQGH[ LV EXLOW WR GR D VHDUFKHG XSGDWH IRU WKLV DV VHOHFW r IURP 7 GR XSGDWH &N VHW SUHY7LG WLG PDWFK FDVH ZKHQ WLG SUHY7LG WKHQ PDWFKf HOVH HQG VXSS FDVH ZKHQ PDWFK N DQG WLG SUHY7LG WKHQ VXSS HOVH VXSS HQG

PAGE 85

ZKHUH LWHP LWHP? RU LWHP LWHUULN HQG IRU LQVHUW LQWR )N VHOHFW LWHPL LWHUULN VXSS IURP &N ZKHUH VXSS PLQVXSS 7KH FRVW RI WKLV DSSURDFK FDQ EH PDLQO\ DWWULEXWHG WR WKH FRVW RI XSGDWHV WR WKH FDQGLGDWH WDEOH &N )RU HDFK WXSOH RI WKH GDWD WDEOH 7 IRU DOO WKH FDQGLGDWH LWHPVHWV LQ &N ZKLFK FRQWDLQV WKDW LWHP WKUHH XSGDWHV DUH SHUIRUPHG WKH DWWULEXWHV SUHY7LG PDWFK VXSS RI WKH LWHPVHW DUH XSGDWHGf ,I 1N LV WKH DYHUDJH QXPEHU RI $LWHP FDQGLGDWHV FRQWDLQLQJ DQ\ JLYHQ LWHP WKH WRWDO QXPEHU RI XSGDWHV LV r 5r 1N 7KH FRVW GXH WR XSGDWHV IRU WKLV DSSURDFK LV 8 r 5r 1Nf ZKHUH 8Qf LV WKH FRVW RI Q XSGDWHV ,I WKH XSGDWHV DUH ORJJHG WKLV FRVW LQFOXGHV WKH ORJJLQJ FRVW DOVR 3HUIRUPDQFH &RPSDULVRQ :H VWXGLHG WKH SHUIRUPDQFH RI VL[ DSSURDFKHV LQ WKLV FDWHJRU\ *DWKHU-RLQ DQG LWV YDULDQWV *DWKHU3UXQH +RUL]RQWDO DQG *DWKHU&RXQW9HUWLFDO DQG 6%) :H XVHG WKH IRXU GDWDVHWV VXPPDUL]HG LQ 7DEOH ,Q )LJXUH ZH VKRZ WKH SHUIRUPDQFH RI RQO\ WKH IRXU DSSURDFKHV *DWKHU-RLQ *DWKHU&RXQW *DWKHU3UXQH DQG 9HUWLFDO )RU WKH RWKHU WZR DSSURDFKHV WKH UXQQLQJ WLPHV ZHUH FRPSDUDWLYHO\ VR ODUJH WKDW ZH KDG WR DERUW WKH UXQV LQ PDQ\ FDVHV 7KH PDLQ UHDVRQ ZK\ WKH +RUL]RQWDO DSSURDFK ZDV VLJQLILFDQWO\ ZRUVH WKDQ WKH *DWKHU-RLQ DSSURDFK ZDV WKH WLPH WR WUDQVIRUP WKH GDWD WR WKH KRUL]RQWDO IRUPDW )RU LQVWDQFH IRU 'DWDVHW& LW ZDV KRXUV ZKLFK LV DOPRVW WLPHV PRUH WKDQ WKH WLPH WDNHQ E\ 9HUWLFDO IRU b VXSSRUW )RU 'DWDVHW% WKH SURFHVV ZDV DERUWHG

PAGE 86

f§} b 'DWD VHW & ( 3UHS ‘ 3DVV ’ 3DVV P 3DVV ‘ 3DVV @ 6XSSRUW f§} b b b 'DWD VRW % ,( 3UHS ‘ 3DVV ’ 3DVV % 3DVV ‘ 3DVV @ 6XSSRUW f§} b b b 'DWD VRW ‘ 3URS ‘ 3DVR ’ 3DRV ‘ 3DVV ‘ 3DVV 6XSSRUW f§}f b b b )LJXUH &RPSDULVRQ RI IRXU 64/25 DSSURDFKHV 9HUWLFDO *DWKHU3UXQH *DWKHU-RLQ DQG *DWKHU&RXQW

PAGE 87

)LJXUH VKRZV WKH WRWDO UXQQLQJ WLPH RI WKH GLIIHUHQW DSSURDFKHV 7KH WLPH WDNHQ LV EURNHQ GRZQ E\ HDFK SDVV DQG DQ LQLWLDO fSUHSf VWDJH ZKHUH DQ\ RQHWLPH GDWD WUDQVIRUnfSUHSf LQ ILJXUH f 7KH YHUWLFDO UHSUHVHQWDWLRQ LV OLNH DQ LQGH[ RQ WKH LWHP DWWULEXWH ,I ZH WKLQN RI WKLV WLPH DV D RQHnn PDQFH :LWK WKLV RSWLPL]DWLRQ IRU 'DWDVHW% ZLWK VXSSRUW b WKH UXQQLQJ WLPH IRU SDVV DORQH ZDV UHGXFHG IURP KRXUV WR PLQXWHV :KHQ ZH FRPSDUH WKHVH GLIIHUHQW DSSURDFKHV EDVHG RQ WLPH VSHQW LQ HDFK SDVV ZH REVHUYH WKDW QR VLQJOH DSSURDFK LV fWKH EHVWf IRU DOO GLIIHUHQW SDVVHV RI WKH GLIIHUHQW GDWDVHWV HVSHFLDOO\ IRU WKH VHFRQG SDVV )RU SDVV WKUHH RQZDUGV 9HUWLFDO LV RIWHQ WZR RU PRUH RUGHUV RI PDJQLWXGH EHWWHU WKDQ WKH RWKHU DSSURDFKHV (YHQ LQ FDVHV OLNH 'DWDVHW% VXSSRUW b ZKHUH LW VSHQGV WKUHH

PAGE 88

KRXUV LQ WKH VHFRQG SDVV WKH WRWDO WLPH IRU QH[W WZR SDVVHV LV RQO\ VHFRQGV ZKHUHDV LW LV PRUH WKDQ DQ KRXU IRU WKH RWKHU WZR DSSURDFKHV )RU VXEVHTXHQW SDVVHV WKH SHUIRUPDQFH GHJUDGHV GUDPDWLFDOO\ IRU *DWKHU -RLQ EHFDXVH WKH WDEOH IXQFWLRQ *DWKHU&RPE. JHQHUDWHV D ODUJH QXPEHU RI FRPELQDWLRQV )RU LQVWDQFH IRU SDVV RI 'DWDVHW& HYHQ IRU VXSSRUW YDOXH RI b SDVV GLG QRW FRPSOHWH DIWHU KRXUV ZKHUHDV IRU 9HUWLFDO SDVV ILQLVKHG LQ VHFRQGV *DWKHU3UXQH LV EHWWHU WKDQ *DWKHU -RLQ IRU WKH WKLUG DQG ODWHU SDVVHV )RU SDVV *DWKHU3UXQH LV ZRUVH EHFDXVH WKH RYHUKHDG RI SDVVLQJ D ODUJH REMHFW DV DQ DUJXPHQW GRPLQDWHV FRVW 7KH 9HUWLFDO DSSURDFK VRPHWLPHV HQGHG XS VSHQGLQJ WRR PXFK WLPH LQ WKH VHFRQG SDVV ,Q VRPH RI WKHVH FDVHV WKH *DWKHU-RLQ DSSURDFK ZDV EHWWHU LQ WKH VHFRQG SDVV IRU LQVWDQFH IRU ORZ VXSSRUW YDOXHV RI 'DWDVHW%f ZKHUHDV LQ RWKHU FDVHV IRU LQVWDQFH 'DWDVHW& VXSSRUW bf *DWKHU&RXQW ZDV WKH RQO\ JRRG RSWLRQ )RU WKLV FDVH ERWK WKH *DWKHU3UXQH DQG *DWKHU-RLQ GLG QRW FRPSOHWH DIWHU PRUH WKDQ VL[ KRXUV HYHQ IRU SDVV )XUWKHU WKH\ FDXVHG D VWRUDJH RYHUIORZ HUURU EHFDXVH RI WKH ODUJH VL]H RI WKH LQWHUPHGLDWH UHVXOWV WR EH VRUWHG :H KDG WR GLYLGH WKH GDWDVHW LQWR IRXU HTXDO SDUWV DQG UXQ WKH VHFRQG SDVV LQGHSHQGHQWO\ RQ HDFK SDUWLWLRQ WR DYRLG WKLV SUREOHP > 9HUWLFDO 4LR}D=O $YHUDJH WUDQVDFWLRQ OHQJWK )LJXUH (IIHFW RI LQFUHDVLQJ WUDQVDFWLRQ OHQJWK DYHUDJH QXPEHU RI LWHPV SHU WUDQVDFWLRQf

PAGE 89

7ZR IDFWRUV WKDW DIIHFW WKH FKRLFH DPRQJVW WKH 9HUWLFDO *DWKHU -RLQ DQG *DWKHU&RXQW DSSURDFKHV LQ GLIIHUHQW SDVVHV DQG SDVV LQ SDUWLFXODU DUH QXPEHU RI IUHTXHQW LWHPV )Lf DQG WKH DYHUDJH QXPEHU RI IUHTXHQW LWHPV SHU WUDQVDFWLRQ ^1If )URP WKH JUDSKV LQ )LJn XUH ZH QRWLFH WKDW DV WKH YDOXH RI WKH VXSSRUW LV GHFUHDVHG IRU HDFK GDWDVHW FDXVLQJ WKH VL]H RI )? WR LQFUHDVH WKH SHUIRUPDQFH RI SDVV RI WKH 9HUWLFDO DSSURDFK GHJUDGHV UDSLGO\ 7KLV WUHQG LV DOVR FOHDU IURP RXU FRVW IRUPXODH 7KH FRVW RI WKH 9HUWLFDO DSSURDFK LQFUHDVHV TXDGUDWLFDOO\ ZLWK )? *DWKHU -RLQ GHSHQGV PRUH FULWLFDOO\ RQ WKH QXPEHU RI IUHn TXHQW LWHPV SHU WUDQVDFWLRQ )RU 'DWDVHW% HYHQ ZKHQ WKH VL]H RI )? LQFUHDVHV E\ D IDFWRU RI WKH YDOXH RI 1I UHPDLQV FORVH WR WKHUHIRUH WKH WLPH WDNHQ E\ *DWKHU -RLQ GRHV QRW LQFUHDVH DV PXFK +RZHYHU IRU 'DWDVHW& WKH VL]H RI 1I LQFUHDVHV IURP WR DV WKH VXSSRUW LV GHFUHDVHG IURP b WR bf LQFUHDVHV WKH FRVW RI 9HUWLFDO UHPDLQV DOPRVW XQFKDQJHG ZKHUHDV WKH FRVW RI *DWKHU-RLQ LQFUHDVHV

PAGE 90

)LQDO +\EULG $SSURDFK 7KH SUHYLRXV SHUIRUPDQFH VHFWLRQ KHOSV XV GUDZ WKH IROORZLQJ FRQFOXVLRQ 2YHUDOO WKH 9HUWLFDO DSSURDFK LV WKH EHVW RSWLRQ HVSHFLDOO\ IRU KLJKHU SDVVHV :KHQ WKH VL]H RI WKH FDQGLGDWH LWHPVHWV LV WRR ODUJH WKH SHUIRUPDQFH RI WKH 9HUWLFDO DSSURDFK FRXOG VXIIHU ,Q VXFK FDVHV *DWKHU-RLQ LV D JRRG RSWLRQ DV ORQJ DV WKH QXPEHU RI IUHTXHQW LWHPV SHU WUDQVDFWLRQ 1Mfn LQJ DVVRFLDWLRQ UXOHV SURYLGHG ZLWK WKH ,%0 GDWD PLQLQJ SURGXFW ,QWHOOLJHQW 0LQHU >@ )RU 6WRUHGSURFHGXUH ZH H[WUDFWHG WKH $SULRUL LPSOHPHQWDWLRQ LQ ,QWHOOLJHQW 0LQHU DQG FUHDWHG D VWRUHG SURFHGXUH RXW RI LW 7KH VWRUHG SURFHGXUH LV UXQ LQ WKH XQIHQFHG PRGH LQ WKH GDWDEDVH DGGUHVV VSDFH )RU &DFKH0LQH ZH XVHG DQ RSWLRQ SURYLGHG LQ ,QWHOOLJHQW 0LQHU WKDW FDXVHV WKH LQSXW GDWD WR EH FDFKHG DV D ELQDU\ ILOH DIWHU WKH ILUVW VFDQ RI WKH GDWD

PAGE 91

nf &DFKH0LQH KDV WKH EHVW RU FORVH WR WKH EHVW SHUIRUPDQFH LQ DOO FDVHV b RI LWV WRWDO WLPH LV VSHQW LQ WKH ILUVW SDVV ZKHUH GDWD LV DFFHVVHG IURP WKH '%06 DQG FDFKHG LQ WKH ILOH V\VWHP &RPSDUHG WR WKH 64/ DSSURDFK WKLV DSSURDFK LV D IDFWRU RI WR WLPHV IDVWHU

PAGE 92

7LPH LQ VHF  7LPH LQ VHF 'DWD VHW $ ’ 3DVV % 3DVV ’ 3DVV 'DWD VHW % SSRUWa! b b b 'DWD VHW & ,( 3DVV + 3DVV ’ 3DVV ’ 3DVV FGIr 6833257f§! b 'DWD VHW L D 3DVV ‘ 3DVV ’ 3DVV ’ 3DVV 6XSSRUW} b b b 6XSSRUW b b b b )LJXUH &RPSDULVRQ RI IRXU DUFKLWHFWXUHV f 7KH 6WRUHGSURFHGXUH DSSURDFK LV WKH ZRUVW 7KH GLIIHUHQFH EHWZHHQ &DFKH0LQH DQG 6WRUHGSURFHGXUH LV GLUHFWO\ UHODWHG WR WKH QXPEHU RI SDVVHV )RU LQVWDQFH IRU 'DWDVHW$ WKH QXPEHU RI SDVVHV LQFUHDVHV IURP WZR WR WKUHH ZKHQ GHFUHDVLQJ VXSSRUW IURP b WR b FDXVLQJ WKH WLPH WDNHQ WR LQFUHDVH IURP WZR WR WKUHH WLPHV 7KH WLPH VSHQW LQ HDFK SDVV IRU 6WRUHGSURFHGXUH LV WKH VDPH H[FHSW ZKHQ WKH DOJRULWKP PDNHV PXOWLSOH SDVVHV RYHU WKH GDWD VLQFH DOO FDQGLGDWHV FRXOG QRW ILW LQ PHPRU\ WRJHWKHU 7KLV KDSSHQV IRU WKH ORZHVW VXSSRUW YDOXHV RI 'DWDVHW%

PAGE 93

'DWDVHW& DQG 'DWDVHW' 7LPH WDNHQ E\ 6WRUHGSURFHGXUH FDQ EH H[SUHVVHG DSSUR[LPDWHO\ DV QXPEHU RI SDVVHV WLPHV WLPH WDNHQ E\ &DFKH0LQH f 8') LV VLPLODU WR 6WRUHGSURFHGXUH 7KH RQO\ GLIIHUHQFH LV WKDW WKH WLPH SHU SDVV GHFUHDVHV E\ b IRU 8') EHFDXVH RI FORVHU FRXSOLQJ ZLWK WKH GDWDEDVH f

PAGE 94

f

PAGE 95

f ZH GR QRW H[SHFW WKH SURFHVVLQJ WLPH RI WKHVH WKUHH DOWHUQDWLYHV WR LQFUHDVH )RU WKH 64/ DSSURDFK ZH FDQQRW DVVXPH DQ LQPHPRU\ KDVKWDEOH IRU GRLQJ WKH PDSSLQJ WKHUHIRUH ZH XVH DQ DOWHUQDWLYH DSSURDFK EDVHG RQ WDEOH IXQFWLRQV )RU 64/ DSSURDFK ZH GLVFXVV WKH K\EULG DSSURDFK 7KH WZR DOUHDG\ H[SHQVLYHf VWHSV WKDW FRXOG VXIIHU EHFDXVH RI ORQJHU QDPHV DUH f ILQDO JURXSE\V GXULQJ SDVV RU KLJKHU ZKHQ WKH *DWKHU -RLQ DSSURDFK LV FKRVHQ DQG f WLGOLVW RSHUDWLRQV ZKHQ WKH 9HUWLFDO DSSURDFK LV FKRVHQ )RU HIILFLHQW SHUIRUPDQFH WKH ILUVW VWHS UHTXLUHV D PDSSLQJ RI LWHPLGV DQG WKH VHFRQG RQH UHTXLUHV XV WR PDS WLGV :H XVH D WDEOH IXQFWLRQ WR PDS WKH WLGV WR

PAGE 96

XQLTXH LQWHJHUV HIILFLHQWO\ LQ RQH SDVV DQG ZLWKRXW PDNLQJ H[WUD FRSLHV 7KH LQSXW WR WKH WDEOH IXQFWLRQ LV WKH GDWD WDEOH LQ WKH WLG RUGHU 7KH WDEOH IXQFWLRQ UHPHPEHUV WKH SUHYLRXV WLG DQG WKH PDLQWDLQV D FRXQWHU (YHU\ WLPH WKH WLG FKDQJHV WKH FRXQWHU LV LQFUHPHQWHG 7KLV FRXQWHU YDOXH LV WKH PDSSLQJ DVVLJQHG WR HDFK WLG :H QHHG WR GR WKH WLG PDSSLQJ RQO\ RQFH EHIRUH FUHDWLQJ WKH 7LG7DEOH LQ WKH 9HUWLFDO DSSURDFK DQG WKHUHIRUH ZH FDQ SLSHOLQH WKHVH WZR VWHSV 7KH LWHP PDSSLQJ LV GRQH VOLJKWO\ GLIIHUHQWO\ $IWHU WKH ILUVW SDVV ZH DGG D FROXPQ WR )? FRQWDLQLQJ D XQLTXH LQWHJHU IRU HDFK LWHP :H GR WKH VDPH IRU WKH 7LG7DEOH 7KH *DWKHU -RLQ DSSURDFK DOUHDG\ MRLQV WKH GDWD WDEOH 7 ZLWK )? EHIRUH SDVVLQJ WR WDEOH IXQFWLRQ *DWKHU 7KHUHIRUH ZH FDQ SDVV WR *DWKHU WKH LQWHJHU PDSSLQJV RI HDFK LWHP IURP )? LQVWHDG RI LWV RULJLQDO FKDUDFWHU UHSUHVHQWDWLRQ $IWHU WKHVH WZR WUDQVIRUPDWLRQV WKH WLG DQG LWHP ILHOGV DUH LQWHJHUV IRU DOO WKH UHPDLQLQJ TXHULHV LQFOXGLQJ FDQGLGDWH JHQHUDWLRQ DQG UXOH JHQHUDWLRQ %\ PDSSLQJ WKH ILHOGV WKLV ZD\ ZH H[SHFW ORQJHU QDPHV WR KDYH VLPLODU SHUIRUPDQFH LPSDFW RQ DOO RI RXU DUFKLWHFWXUDO RSWLRQV 6SDFH 2YHUKHDG RI 'LIIHUHQW $SSURDFKHV ,Q )LJXUH ZH VXPPDUL]H WKH VSDFH UHTXLUHG IRU GLIIHUHQW GDWDVHWV IRU WKUHH RSWLRQV 6WRUHGSURFHGXUH &DFKH0LQH DQG 64/ )RU WKHVH H[SHULPHQWV ZH DVVXPH WKDW WKH WLGV DQG LWHPV DUH LQWHJHUV 7KH ILUVW SDUW UHIHUV WR WKH VSDFH XVHG LQ FDFKLQJ GDWD DQG WKH VHFRQG SDUW UHIHUV WR DQ\ WHPSRUDU\ VSDFH XVHG E\ WKH '%06 IRU VRUWLQJ RU DOWHUQDWHO\ IRU FRQVWUXFWLQJ LQGLFHV WR EH XVHG GXULQJ VRUWLQJ 7KH VL]H RI WKH GDWD LV WKH VDPH DV WKH VSDFH XWLOL]DWLRQ RI WKH 6WRUHGSURFHGXUH DSSURDFK 7KH VSDFH UHTXLUHPHQWV IRU 8') LV WKH VDPH DV WKDW IRU 6WRUHGSURFHGXUHZKLFK UHTXLUHV OHVV VSDFH WKDQ WKH &DFKH0LQH DQG 64/ DSSURDFKHV 7KH &DFKH0LQH DQG 64/ DSSURDFKHV KDYH FRPSDUDEOH VWRUDJH RYHUKHDGV )RU 6WRUHGSURFHGXUH DQG 8') ZH GR QRW QHHG DQ\ H[WUD VWRUDJH IRU FDFKLQJ +RZHYHU

PAGE 97

DOO WKUHH RSWLRQV &DFKH0LQH 6WRUHGSURFHGXUH DQG 8') UHTXLUH GDWD LQ HDFK SDVV WR EH JURXSHG RQ WKH WLG ,Q D UHODWLRQDO '%06 ZH FDQQRW DVVXPH DQ\ RUGHU RQ WKH SK\VLFDO OD\RXW RI D WDEOH XQOLNH LQ D ILOH V\VWHP 7KHUHIRUH ZH QHHG HLWKHU DQ LQGH[ RQ WKH GDWD WDEOH RU QHHG WR VRUW WKH WDEOH HYHU\ WLPH WR HQVXUH D SDUWLFXODU RUGHU /HW 5 GHQRWH WKH WRWDO QXPEHU RI WLGLWHPf SDLUV LQ WKH GDWD WDEOH (LWKHU RSWLRQ KDV D VSDFH RYHUKHDG RI [ 5 LQWHJHUV 7KH &DFKH0LQH DSSURDFK FDFKHV WKH GDWD LQ DQ DOWHUQDWLYH ELQDU\ IRUPDW ZKHUH HDFK WLG LV IROORZHG E\ DOO WKH LWHPV LW FRQWDLQV 7KXV WKH VL]H RI WKH FDFKHG GDWD LQ &DFKH0LQH LV DW PRVW 5 7 LQWHJHUV ZKHUH 7 LV WKH QXPEHU RI WUDQVDFWLRQV )RU 64/ ZH XVH WKH K\EULG 9HUWLFDO RSWLRQ 7KLV UHTXLUHV FUHDWLRQ RI DQ LQLWLDO 7LG7DEOH RI VL]H DW PRVW 5 ZKHUH LV WKH QXPEHU RI LWHPV 1RWH WKDW WKLV LV VOLJKWO\ OHVV WKDQ WKH FDFKH UHTXLUHG E\ WKH &DFKH0LQH DSSURDFK 7KH 64/ DSSURDFK QHHGV WR VRUW GDWD LQ SDVV LQ DOO FDVHV DQG SDVV LQ VRPH FDVHV ZKHUH ZH XVHG WKH *DWKHU-RLQ DSSURDFK LQVWHDG RI WKH 9HUWLFDO DSSURDFK 7KLV H[SODLQV WKH ODUJH VSDFH UHTXLUHPHQW IRU 'DWDVHW% +RZHYHU LQ SUDFWLFH ZKHQ WKH LWHPLGV RU WLGV DUH FKDUDFWHU VWULQJV LQVWHDG RI LQWHJHUV WKH H[WUD VSDFH QHHGHG E\ &DFKH0LQH DQG 64/ LV D PXFK VPDOOHU IUDFWLRQ RI WKH WRWDO GDWD VL]H EHFDXVH EHIRUH FDFKLQJ ZH DOZD\V FRQYHUW LWHPLGV WR WKHLU FRPSDFW LQWHJHU UHSUHVHQWDWLRQ DQG VWRUH LQ ELQDU\ IRUPDW 6XPPDU\ RI &RPSDULVRQ %HWZHHQ 'LIIHUHQW $UFKLWHFWXUHV ,Q 7DEOH ZH SUHVHQW D VXPPDU\ RI WKH SURV DQG FRQV RI WKH GLIIHUHQW DUFKLWHFWXUHV E\ UDQNLQJ WKHP RQ D VFDOH RI JRRGf WR EDGf RQ HDFK RI WKH IROORZLQJ \DUGVWLFNV Df SHUIRUPDQFH H[HFXWLRQ WLPHf Ef VWRUDJH RYHUKHDG Ff VFRSH IRU DXWRPDWLF SDUDOOHOL]DWLRQ Gf GHYHORSPHQW DQG PDLQWHQDQFH HDVH Hf SRUWDELOLW\ If LQWHURSHUDELOLW\

PAGE 98

'DWDVHW$ 'DWDVHW% 'DWDVHW& 'DWDVHW2 )LJXUH &RPSDULVRQ RI GLIIHUHQW DUFKLWHFWXUHV RQ VSDFH UHTXLUHPHQWV 7DEOH 3URV DQG FRQV RI GLIIHUHQW DUFKLWHFWXUDO RSWLRQV UDQNHG RQ D VFDOH RI OJRRGf WR EDGf 0HWULF 6WRUHGSURF 8') &DFKH0LQH 64/ 3HUIRUPDQFH 6WRUDJH RYHUKHDG $XWRPDWLF 3DUDOOHOLVP "f 'HYHORSPHQW DQG PDLQWHQDQFH HDVH 3RUWDELOLW\ ,QWHURSHUDELOLW\ "f ,Q WHUPV RI SHUIRUPDQFH WKH &DFKH0LQH DSSURDFK LV WKH EHVW RSWLRQ IROORZHG E\ WKH 64/ DSSURDFK 7KH 64/ DSSURDFK ZDV ZLWKLQ D IDFWRU RI WR RI WKH &DFKH0LQH DSn SURDFK IRU DOO RI RXU H[SHULPHQWV 7KH 8') DSSURDFK LV EHWWHU WKDQ WKH 6WRUHGSURFHGXUH DSSURDFK LQ SHUIRUPDQFH E\ WR b EXW LW ORRVHV RQ WKH PHWULFV RI GHYHORSPHQW DQG PDLQWHQDQFH FRVWV DQG SRUWDELOLW\ ,Q WHUPV RI VSDFH UHTXLUHPHQWV WKH &DFKH0LQH DQG WKH 64/ DSSURDFK ORRVH WR WKH 8') RU WKH 6WRUHGSURFHGXUH DSSURDFK %HWZHHQ WKH 6WRUHGSURFHGXUH DQG WKH &DFKH0LQH LPSOHPHQWDWLRQ WKH SHUIRUPDQFH GLIIHUHQFH LV H[n DFWO\ D IXQFWLRQ RI WKH QXPEHU RI SDVVHV PDGH RQ WKH GDWD WKDW LV LI ZH PDNH IRXU SDVVHV

PAGE 99

RI WKH GDWD WKH 6WRUHGSURFHGXUH DSSURDFK LV IRXU WLPHV VORZHU WKDQ &DFKH0LQH 7KHUHnn WLRQ FRXOG FRPH IRU IUHH EHFDXVH EXON RI RXU SURFHVVLQJ LV H[SUHVVHG LQ WHUPV RI VWDQGDUG 64/ TXHULHV $V ORQJ DV WKH GDWDEDVH VXSSRUWV HIILFLHQW SDUDOOHOL]DWLRQ RI WKHVH TXHULHV WKH PLQLQJ FRGH FDQ EH HDVLO\ SDUDOOHOL]HG 7KH SUREOHP FDVH LV ZKHUH WKH 8')V XVH VFUDWFK SDGV 7KH RQO\ VXFK IXQFWLRQ LQ RXU TXHULHV LV WKH *DWKHU WDEOH IXQFWLRQ 7KLV IXQFWLRQ HVVHQWLDOO\ LPSOHPHQWV D XVHU GHILQHG DJJUHJDWH DQG ZRXOG KDYH EHHQ HDV\ WR SDUDOOHOL]H LI WKH '%06 SURYLGHG VXSSRUW IRU XVHU GHILQHG DJJUHJDWHV RU DOORZHG H[SOLFLW FRQWURO IURP WKH DSSOLFDWLRQ DERXW KRZ WR SDUWLWLRQ WKH GDWD DPRQJVW GLIIHUHQW SDUDOOHO LQVWDQFHV RI WKH IXQFn

PAGE 100

ff WR EH XQDFFHSWDEOH IURP WKH SHUIRUPDQFH YLHZSRLQW 2XU SUHIHUUHG 64/ LPSOHPHQWDWLRQ UHOLHV RQ WKH DYDLODELOLW\ RI '%fn LELOLW\ 7KH DG KRF TXHU\LQJ VXSSRUW SURYLGHG E\ WKH '%06 HQDEOHV IOH[LEOH XVDJH DQG H[SRVHV SRWHQWLDO IRU SLSHOLQLQJ WKH LQSXW DQG RXWSXW RSHUDWRUV RI WKH PLQLQJ SURFHVV ZLWK

PAGE 101

RWKHU RSHUDWRUV LQ WKH '%06 +RZHYHU WR H[SORLW WKLV IHDWXUH RQH HLWKHU QHHGV WR LPSOHn PHQW WKH PLQLQJ RSHUDWRUV LQVLGH WKH '%06 RU XVH WDEOH IXQFWLRQV LQ QRYHO ZD\V 7KH ILUVW DOWHUQDWLYH ZRXOG UHTXLUH PDMRU UHZRUN LQ H[LVWLQJ GDWDEDVH V\VWHPV 7KH VHFRQG DOWHUn QDWLYH UHTXLUHV WDEOH IXQFWLRQV WKDW FDQ H[HFXWH 64/ VWDWHPHQWV D IDFLOLW\ QRW FXUUHQWO\ DYDLODEOH LQ '% 8'%f )XUWKHUPRUH VRPH PLQLQJ RSHUDWRUV JHQHUDWH PXOWLSOH GLIIHUHQW NLQGV RI RXWSXW IRU LQVWDQFH PRGHO WUHH DQG VWDWLVWLFV IRU GHFLVLRQ WUHHVf ,Q VXFK FDVHV SLSHOLQLQJ RI PLQLQJ RSHUDWRUV EHFRPHV KDUGHU WR ILW LQ H[LVWLQJ UHODWLRQDO PRGHOV 1RWH WKDW WKH 64/ DSSURDFK SUHVHQWHG KHUH LV EDVHG RQ HPEHGGHG 64/ DQG FDQQRW SURYLGH RSHUn

PAGE 102

LPSOHPHQWDWLRQ EHFDXVH WKH\ DUH DLPHG DW UHGXFLQJ WKH QXPEHU RI SDVVHV RYHU WKH GDWD :H KDYH VHHQ WKDW ZLWK RXU 64/ LPSOHPHQWDWLRQV KDUGO\ DQ\ WLPH LV VSHQW RQ SDVVHV EH\RQG EHFDXVH RI WKH 9HUWLFDO DSSURDFK 7KH WLPH VSHQW LQ SDVV LV QRW UHGXFHG EHFDXVH WKH\ DOO UHTXLUH FRXQWLQJ VXSSRUW RI DOO RI &L RQ WKH HQWLUH GDWD DQ\ZD\ 6XPPDU\ ,Q WKLV FKDSWHU ZH GHYHORSHG DQG H[SHULPHQWHG ZLWK D VHW RI DSSURDFKHV WKDW PDGH XVH RI WKH QHZ REMHFWUHODWLRQDO H[WHQVLRQV OLNH 8')V %/2%V DQG WDEOH IXQFWLRQV 7KHVH DSSURDFKHV SHUIRUPHG PXFK EHWWHU WKDQ WKH 64/ DSSURDFKHV LQ &KDSWHU :H GHYHORSHG D K\EULG VFKHPH ZKLFK SLFNV WKH EHVW DSSURDFK IRU HDFK SDVV EDVHG RQ WKH FRVW HVWLPDWHV :H DOVR FRPSDUH WKH YDULRXV DUFKLWHFWXUDO DOWHUQDWLYHV ERWK TXDOLWDWLYHO\ DQG TXDQWLWDWLYHO\ 1RWH $ SDUW RI WKH ZRUN GHVFULEHG LQ WKLV FKDSWHU ZDV SULPDULO\ GRQH E\ UHVHDUFKHUV IURP ,%0 $OPDGQ 5HVHDUFK &HQWHU 7KH DXWKRUfV SULPDU\ FRQWULEXWLRQV ZHUH WKH RSWLn PL]DWLRQV WR WKH DSSURDFKHV LQ 6HFWLRQV DQG WKH K\EULG DSSURDFK LQ 6HFWLRQ DQG WKH FRVW DQDO\VLV RI WKH YDULRXV VXSSRUW FRXQWLQJ DSSURDFKHV 7KH DXWKRU KDV DOVR FRQWULEXWHG WR WKH SHUIRUPDQFH H[SHULPHQWV ZKLFK OHG WR WKH YDULRXV FRPSDULVRQV

PAGE 103

&+$37(5 *(1(5$/,=(' $662&,$7,21 58/(6 ,Q PRVW UHDOOLIH DSSOLFDWLRQV WKH VHW RI LWHPV WKDW DSSHDU LQ WUDQVDFWLRQV FDQ EH FDWn HJRUL]HG DFFRUGLQJ WR D WD[RQRP\ LVD KLHUDUFK\f RQ WKH LWHPV 7KH WD[RQRP\ VKRZQ LQ )LJXUH VD\V WKDW 3HSVL LVD VRIW GULQN LVD EHYHUDJH DQG VR RQ ,Q JHQHUDO D WD[RQRP\ FDQ EH UHSUHVHQWHG DV D GLUHFWHG DF\FOLF JUDSK '$*f *LYHQ D VHW RI WUDQVDFWLRQV 7 HDFK RI ZKLFK LV D VHW RI LWHPV DQG D WD[RQRP\ 7D[ WKH SUREOHP RI PLQLQJ JHQHUDOL]HG DVVRFLDWLRQ UXOHV LV WR GLVFRYHU DOO UXOHV RI WKH IRUP ; f§!f < ZLWK WKH XVHUVSHFLILHG PLQLPXP VXSSRUW DQG PLQLPXP FRQILGHQFH ; DQG < FDQ EH VHWV RI LWHPV DW DQ\ OHYHO RI WKH WD[RQRP\ VXFK WKDW QR LWHP LQ < LV DQ DQFHVWRU RI DQ\ LWHP LQ ; >@ )RU H[DPSOH WKHUH PLJKW EH D UXOH ZKLFK VD\V WKDW fb RI WUDQVDFWLRQV WKDW FRQWDLQ 6RIW GULQNV DOVR FRQWDLQ 6QDFNV b RI DOO WUDQVDFWLRQV FRQWDLQ ERWK WKHVH LWHPVf %HYHUDJHV 6QDFNV 6RIW GULQNV $OFRKROLF GULQNV 3UHW]HOV &KRFRODWH EDU 3HSVL &RNH %HHU )LJXUH ([DPSOH RI D WD[RQRP\ ,Q WKLV FKDSWHU ZH SUHVHQW VHYHUDO 64/ IRUPXODWLRQV RI JHQHUDOL]HG DVVRFLDWLRQ UXOH PLQLQJ >@ ,Q 6HFWLRQ ZH GHVFULEH WKH LQSXWRXWSXW IRUPDWV DQG LQ 6HFWLRQ ZH EULHIO\ RXWOLQH WKH &XPXODWH DOJRULWKP IRU JHQHUDOL]HG DVVRFLDWLRQ UXOH PLQLQJ >@

PAGE 104

f

PAGE 105

f§ OfLWHPVHWV )NL IRXQG LQ WKH SUHYLRXV SDVV LV XVHG DV WKH VHHG VHW WR JHQHUDWH FDQGLGDWH $LWHPVHWV &Nf WKDW DUH SRWHQWLDOO\ IUHTXHQW ,Q WKH VXSSRUW FRXQWLQJ SKDVH IRU HDFK LWHPVHW W &N WKH QXPEHU RI H[WHQGHG WUDQVDFWLRQV WUDQVDFWLRQV DXJPHQWHG ZLWK DOO WKH DQFHVWRUV RI LWV LWHPVff SUXQLQJ FDQGLGDWHV FRQWDLQLQJ DQ LWHP DQG LWV DQFHVWRU DQG LLf H[WHQGLQJ WKH WUDQVDFWLRQV E\ DGGLQJ DOO WKH DQFHVWRUV RI LWV LWHPV 7KH DQFHVWRU

PAGE 106

FRPSXWDWLRQ FDQ EH H[SUHVVHG LQ 64/ XVLQJ D UHFXUVLYH TXHU\ DV VKRZQ LQ )LJXUH 7KH UHVXOW RI WKH TXHU\ LV VWRUHG LQ WDEOH $QFHVWRU KDYLQJ WKH VFKHPD DQFHVWRU GHVFHQGDQWf LQVHUW LQWR $QFHVWRU ZLWK 57D[ DQFHVWRU GHVFHQGDQWf DV VHOHFW SDUHQW FKLOG IURP 7D[ XQLRQ DOO VHOHFW SDQFHVWRU FFKLOG IURP 57D[ S 7D[ F ZKHUH SGHVFHQGDQW FSDUHQWf VHOHFW DQFHVWRU GHVFHQGDQW IURP 57D[ W 7D[ )LJXUH 3UHFRPSXWLQJ DQFHVWRUV &DQGLGDWH *HQHUDWLRQ ,Q WKH FDQGLGDWH JHQHUDWLRQ SKDVH ZH XVH WKH IUHTXHQW LWHPVHWV )N? IRXQG LQ WKH N f§ OfWK SDVV WR JHQHUDWH D VHW RI FDQGLGDWH LWHPVHWV &N WKDW FRQWDLQV DOO N LWHPVHWV VXFK WKDW DOO N RI LWV N f§ OfOHQJWK VXEVHWV DUH LQ )N? 6HFWLRQ VKRZV KRZ WR H[SUHVV WKLV RSHUDWLRQ DV D NZD\ MRLQ EHWZHHQ WKH IUHTXHQW N f§ OfLWHPVHWV )-WBfVf :H FDQ XVH WKH VDPH IRUPXODWLRQ H[FHSW WKDW ZH QHHG WR SUXQH IURP &N LWHPVHWV FRQWDLQLQJ DQ LWHP DQG LWV DQFHVWRU 6ULNDQW DQG $JUDZDO >@ SURYH WKDW WKLV SUXQLQJ QHHGV WR EH GRQH RQO\ LQ

PAGE 107

WKH VHFRQG SDVV IRU &f ,Q WKH 64/ IRUPXODWLRQ DV VKRZQ LQ )LJXUH ZH SUXQH DOO DQFHVWRU GHVFHQGDQWf SDLUV IURP & ZKLFK LV JHQHUDWHG E\ MRLQLQJ )? ZLWK LWVHOI LQVHUW LQWR & VHOHFW ,?LWHP?OLWHP? IURP )?,?)?, ZKHUH ,?LWHP? I\LWHPLf H[FHSW VHOHFW DQFHVWRU GHVFHQGDQW IURP $QFHVWRU XQLRQ VHOHFW GHVFHQGDQW DQFHVWRU IURP $QFHVWRUf & (;&(37 ,,LWHP LWHP &!@ DQFHVWRU GHVFHQGDQW 81,21 GHVFHQGDQW DQFHVWRU ), ,, ), $QFHVWRU $QFHVWRU )LJXUH *HQHUDWLRQ RI & 6XSSRUW &RXQWLQJ WR )LQG )UHTXHQW ,WHPVHWV :H XVH WKH FDQGLGDWH LWHPVHWV &N WKH GDWD WDEOH 7 DQG WKH DQFHVWRU WDEOH $QFHVWRU WR FRXQW WKH VXSSRUW RI WKH LWHPVHWV LQ &r :H FRQVLGHU WZR FDWHJRULHV RI 64/ LPSOHn PHQWDWLRQV EDVHG RQ 64/ DQG 64/25 LQ 6HFWLRQV DQG UHVSHFWLYHO\ $OO WKH 64/ DSSURDFKHV GHYHORSHG IRU ERROHDQ DVVRFLDWLRQV LQ &KDSWHUV DQG FDQ EH H[WHQGHG WR KDQGOH WD[RQRPLHV +RZHYHU ZH SUHVHQW WKH H[WHQVLRQV WR RQO\ D IHZ UHSUHVHQWDWLYH DSSURDFKHV ,Q SDUWLFXODU ZH FRQVLGHU WKH .ZD\-RLQ DSSURDFK IURP 64/ DQG 9HUWLFDO DQG *DWKHU -RLQ IURP 64/25

PAGE 108

6XSSRUW &RXQWLQJ 8VLQJ 64/ .ZDY -RLQ ,Q HDFK SDVV N ZH MRLQ WKH FDQGLGDWH LWHPVHWV &N ZLWK N FRSLHV RI DQ H[WHQGHG WUDQVn DFWLRQ WDEOH 7r GHILQHG EHORZf DQG IROORZ LW XS ZLWK D JURXS E\ RQ WKH LWHPVHWV 7KH H[WHQGHG WUDQVDFWLRQ WDEOH 7r LV REWDLQHG E\ DXJPHQWLQJ 7 WR LQFOXGH WLG LWHP HQWULHV IRU DOO DQFHVWRUV RI LWHPV DSSHDULQJ LQ 7 7KLV FDQ EH IRUPXODWHG DV D 64/ TXHU\ DV VKRZQ LQ )LJXUH 4XHU\ WR JHQHUDWH 7r VHOHFW LWHP WLG IURP 7 XQLRQ VHOHFW GLVWLQFW $DQFHVWRU DV LWHP 7WLG IURP 7 $QFHVWRU $ ZKHUH $GHVFHQGDQW 7LWHPf 7r W 81,21 7WLG $DQFHVWRU DV LWHP 7LWHP $GHVFHQGDQW >!$QFHVWRU $ )LJXUH 7UDQVDFWLRQ H[WHQVLRQ VXETXHU\ 7KH VHOHFW GLVWLQFW FODXVH LV XVHG WR HOLPLQDWH GXSOLFDWH UHFRUGV GXH WR H[WHQVLRQ RI LWHPV ZLWK D FRPPRQ DQFHVWRU 1RWH WKDW IRU WKLV DSSURDFK ZH GR QRW PDWHULDOL]H 7r ,QVWHDG ZH XVH WKH 64/ VXSSRUW IRU FRPPRQ VXEH[SUHVVLRQV ZLWK FRQVWUXFWf WR SLSHOLQH WKH JHQHUDWLRQ RI 7r ZLWK WKH MRLQ RSHUDWLRQV 7KH ILQDO 64/ TXHU\ DQG WKH FRUUHVSRQGLQJ WUHH GLDJUDP DUH VKRZQ LQ )LJXUH

PAGE 109

LQVHUW LQWR )nN ZLWK 7rWLG LWHPf DV 4XHU\ IRU 7r GHILQHG DERYHf VHOHFW LWHP LWHPN FRXQWrf IURP &N7r W? 7r WN ZKHUH WLLWHP &NLWHP? DQG LWHP &NLWHPN DQG LLWLG WLG DQG WMWBLWLG LLWLG JURXS E\ LWHP?LWHP ‘ ‘ ‘ LWHPN KDYLQJ FRXQWrf PLQVXS KDYLQJ FRXQWrf PLQVXS W *URXS E\ LWHPO LWHPN &N LWHPN WNLWHP &NLWHPO WOLWHP 7r WO Wr D )LJXUH 6XSSRUW FRXQWLQJ E\ .ZD\ MRLQ

PAGE 110

7KH ZD\-RLQ DQG *URXS%\ DSSURDFKHV IRU ERROHDQ DVVRFLDWLRQV FDQ EH H[WHQGHG WR KDQGOH WD[RQRPLHV LQ D VLPLODU ZD\ E\ UHSODFLQJ 7 ZLWK 7r LQ WKH FRUUHVSRQGLQJ TXHULHV 6XETXHUY 2SWLPL]DWLRQ 7KH EDVLF .ZD\-RLQ DSSURDFK FDQ EH RSWLPL]HG WR PDNH XVH RI FRPPRQ SUHIL[HV EHn WZHHQ WKH LWHPVHWV LQ &N E\ VSOLWWLQJ WKH VXSSRUW FRXQWLQJ SKDVH LQWR D FDVFDGH RI N VXEn TXHULHV 7KH VXETXHULHV LQ WKLV FDVH DUH H[DFWO\ VLPLODU WR WKRVH IRU ERROHDQ DVVRFLDWLRQV SUHVHQWHG LQ H[FHSW IRU WKH XVH RI 7r LQVWHDG RI 7 6XSSRUW &RXQWLQJ 8VLQJ 64/25 ,Q WKLV VHFWLRQ ZH SUHVHQW WKUHH DSSURDFKHV WKDW PDNH XVH RI WKH REMHFWUHODWLRQDO IHDWXUHV RI 64/ :H DOVR SUHVHQW D FRVWEDVHG DQDO\VLV RI WKH H[HFXWLRQ WLPH RI WKH YDULRXV DSSURDFKHV *DWKHU-RLQ 7KH *DWKHU-RLQ DSSURDFK ZKLFK LV EDVHG RQ WKH XVH RI WDEOH IXQFWLRQV >@ JHQHUDWHV DOO SRVVLEOH IFLWHP FRPELQDWLRQV RI H[WHQGHG WUDQVDFWLRQV MRLQV WKHP ZLWK WKH FDQGLGDWH WDEOH &N DQG FRXQWV WKH VXSSRUW E\ JURXSLQJ WKH MRLQ UHVXOW 7KH H[WHQGHG WUDQVDFWLRQV 7r GHILQHG LQ 6HFWLRQ f DUH SDVVHG WR WKH WDEOH IXQFWLRQ *DWKHU&RPE. LQ WKH WLG LWHPf RUGHU $ UHFRUG RXWSXW E\ WKH WDEOH IXQFWLRQ LV D $LWHP FRPELQDWLRQ VXSSRUWHG E\ D WUDQVDFWLRQ DQG KDV N DWWULEXWHV 7-WPL 7-WUULN 64/ TXHULHV IRU WKLV DSSURDFK LV SUHVHQWHG LQ )LJXUH 7KH VSHFLDO RSWLPL]DWLRQ IRU SDVV DQG WKH YDULDWLRQV RI WKH *DWKHU-RLQ DSSURDFK QDPHO\ *DWKHU&RXQW *DWKHU3UXQH DQG +RUL]RQWDO UHIHU 6HFWLRQ f DUH DOVR DSSOLFDEOH KHUH

PAGE 111

LQVHUW LQWR )N VHOHFW LWHP?LWHPN FRXQWrf IURP &N VHOHFW WL7-WPL W7-WUULN IURP 7r W? WDEOH *DWKHU&RPE.LWLG LLWHPff DV 4f ZKHUH WA7-WPL &NLWHP? DQG W7MLWUULN &NLWHUULN JURXS E\ &NLWHPL &NLWHPN KDYLQJ FRXQWrf PLQVXS 7 7 KDYLQJ FRXQWrf LPLQVXS W *URXS E\ LWHP LWHPN LWPO &NLWHPO B A B LWPN 7DEOH IXQFWLRQ *DWKHU&RPE. 2UGHU E\ WLG LWHP &N )LJXUH 6XSSRUW FRXQWLQJ E\ *DWKHU -RLQ *DWKHU ([WHQG 7KLV LV D YDULDWLRQ RI WKH *DWKHU -R LQ DSSURDFK ZKHUH ZH SXVK WKH WUDQVDFWLRQ H[WHQn VLRQ LQVLGH WKH WDEOH IXQFWLRQ )RU HDFK LWHP ZH FUHDWH DQ DQFHVWRUOLVW VWRUHG DV D %/2%f ZKLFK FRQWDLQV D OLVW RI DQFHVWRUV RI WKDW LWHP 7KLV FDQ EH DFFRPSOLVKHG XVLQJ VLPLODU 64/ TXHULHV DV LQ WKH WLGOLVW FUHDWLRQ SKDVH RI WKH 9HUWLFDO DSSURDFK IRU ERROHDQ DVVRFLDWLRQV UHIHU 6HFWLRQ f 7KH LWHP DQFHVWRUOLVWf SDLUV DUH VWRUHG LQ D QHZ $QF/LVW7DEOH 1RWH WKDW DQFHVWRUOLVW QHHGV WR EH FUHDWHG RQO\ IRU LWHPV WKDW DSSHDU LQ WKH LQSXW WUDQVDFWLRQV

PAGE 112

,Q WKH VXSSRUW FRXQWLQJ SKDVH ZH MRLQ WKH WUDQVDFWLRQ WDEOH 7 ZLWK WKH $QF/LVW7DEOH DQG WKH UHVXOWLQJ WLG LWHP DQFHVWRUOLVWf WXSOHV DUH SDVVHG WR D WDEOH IXQFWLRQ *([WHQG&RPE. LQ WLG RUGHU 7KH WDEOH IXQFWLRQ FROOHFWV DOO WKH LWHPV DQG WKH FRUUHVSRQGLQJ DQFHVWRUOLVWV RI WKH VDPH WUDQVDFWLRQ H[WHQGV LW DQG RXWSXWV $LWHP FRPELQDWLRQV RI WKH H[WHQGHG WUDQVn DFWLRQ )LJXUH LOOXVWUDWHV WKH 64/ TXHULHV IRU WKLV DSSURDFK 7KLV DSSURDFK ZDV QRW LPSOHPHQWHG EHFDXVH LW PLJKW QRW SHUIRUP EHWWHU WKDQ WKH 9HUWLFDO DSSURDFK H[SODLQHG QH[W 7KLV DSSURDFK FDQ EH FRPELQHG ZLWK WKH *DWKHU3UXQH DSSURDFK UHIHU 6HFWLRQ WR SUXQH RXW LWHP FRPELQDWLRQV WKDW DUH QRW FDQGLGDWHV ,Q WKDW FDVH ZH FDQ DOVR GHOHWH IURP WKH H[WHQGHG WUDQVDFWLRQV LWHPV WKDW DUH QRW SUHVHQW LQ DQ\ RI WKH FDQGLGDWHV EHIRUH JHQHUDWLQJ WKH LWHP FRPELQDWLRQV &RVW $QDO\VLV ,Q WKH FRVW DQDO\VLV ZH XVH WKH QRWDWLRQV RI 7DEOH LQ DGGLWLRQ WR WKH QRWDWLRQV LQWURGXFHG IRU ERROHDQ DVVRFLDWLRQV LQ 7DEOH 7DEOH $GGLWLRQDO QRWDWLRQV XVHG IRU FRVW DQDO\VLV QXPEHU RI UHFRUGV LQ WKH LQSXW WD[RQRP\ WDEOH G DYHUDJH GHSWK RI WKH WD[RQRP\ O QXPEHU RI OHDI QRGHV LQ WKH WD[RQRP\ $ QXPEHU RI UHFRUGV LQ WKH $QFHVWRU WDEOH m O r G 1r DYHUDJH QXPEHU RI LWHPV LQ DQ H[WHQGHG WUDQVDFWLRQ 1 r G 5r WRWDO QXPEHU RI UHFRUGV DIWHU WUDQVDFWLRQ H[WHQVLRQ f§ 5r G HN FRVW RI JHQHUDWLQJ D $LWHP FRPELQDWLRQ XVLQJ *([WHQG&RPE. RUGHUQf FRVW RI VRUWLQJ Q UHFRUGV 7KH FRVW RI *DWKHU -RLQ LQFOXGHV WKH FRVW RI H[WHQGLQJ WKH WUDQVDFWLRQV E\ MRLQLQJ WKH WUDQVDFWLRQ WDEOH ZLWK WKH DQFHVWRU WDEOH JHQHUDWLQJ $!LWHP FRPELQDWLRQV MRLQLQJ WKHP ZLWK &N DQG JURXSLQJ WKH MRLQ UHVXOW WR FRXQW WKH VXSSRUW 1RWH WKDW WKH WUDQVDFWLRQ

PAGE 113

LQVHUW LQWR )r VHOHFW LWHPL LWHUULN FRXQWrf IURP &IF VHOHFW W7-WP? W7LWPN IURP VHOHFW WLG LWHP DQFHVWRUOLVW IURP 7 $QF/LVW7DEOH $ ZKHUH 7LWHP $LWHPf DV LL WDEOH *([WHQG&RPE. LWLG LLWHP W?DQFHVWRUOLVWff DV f ZKHUH W7-WPL &NLWHPL DQG 7-PIF &NLWHPLIF JURXS E\ &NLWHPL &NLWHPN KDYLQJ FRXQWrf LPLQVXS KDYLQJ FRXQWrf PLQVXS W *URXS E\ LWHPL LWHPN 7DEOH IXQFWLRQ *([WHQG&RPE. &N 7WLG 7LWHP $DQFHVWRUOLVW ? 7 $QF/LVW7DEOH $ )LJXUH 6XSSRUW FRXQWLQJ E\ *DWKHU([WHQG

PAGE 114

H[WHQVLRQ FRVW FDQ EH LJQRUHG LI 7r LV PDWHULDOL]HG EHFDXVH LQ WKDW FDVH LW ZLOO EH D RQHn WLPH FRVW 7KH UHVXOW VL]H RI WKH MRLQ ZKLFK H[WHQGV WKH WUDQVDFWLRQV LV 5r DQG WKH QXPEHU RI IFLWHP FRPELQDWLRQV JHQHUDWHG 7N LV &1rNf r 7 7KHUHIRUH WKH WRWDO FRVW RI WKH *DWKHU -RLQ DSSURDFK LV MRLQ5 $L7f RUGHUL7f 7r r VN MRLQ7IF &N 6^&Nff ) JURXS6^&Nf&Nf ,Q *DWKHU([WHQG WUDQVDFWLRQ H[WHQVLRQ LV GRQH LQVLGH WKH WDEOH IXQFWLRQ 7KH FRVW RI SDVVLQJ WKH DQFHVWRUOLVW DV D %/2% LV EOREGf VLQFH WKH DYHUDJH OHQJWK RI WKH DQFHVWRUOLVW LV WKH VDPH DV WKH GHSWK RI WKH WD[RQRP\ 7KH FRVW RI *DWKHU([WHQG LV MRLQ ^5-5f 5r EOREGf 7N r HN MRLQ7r &N 6&Nff JURXS^6&Nf&Nf 9HUWLFDO ,Q WKLV DSSURDFK WKH WUDQVDFWLRQV DUH ILUVW FRQYHUWHG LQWR D YHUWLFDO IRUPDW E\ FUHDWLQJ IRU HDFK LWHP D %/2% WLGOLVWf FRQWDLQLQJ DOO WLGV WKDW FRQWDLQ WKDW LWHP 7KH VXSSRUW IRU HDFK LWHPVHW LV FRXQWHG E\ PHUJLQJ WKH WLGOLVWV RI DOO LWV LWHPV 7KH WLGOLVW RI OHDI QRGH LWHPV FDQ EH FUHDWHG XVLQJ D WDEOH IXQFWLRQ LQ WKH VDPH ZD\ DV LQ WKH ERROHDQ DVVRFLDWLRQV FDVH :H SUHVHQW WZR DSSURDFKHV IRU FUHDWLQJ WKH WLGOLVW RI WKH LQWHULRU QRGHV LQ WKH WD[RQRP\ '$* 7KH ILUVW DSSURDFK LV EDVHG RQ GRLQJ WKH XQLRQ RI WKH GHVFHQGDQWfV WLGOLVWV RI DQ LQWHULRU QRGH ,Q WKH ILUVW SKDVH ZH FUHDWH DQ LQLWLDO 7LG7DEOH FRQWDLQLQJ WKH WLGOLVWV RI WKH OHDI QRGHV 7KH 7LG7DEOH LV MRLQHG ZLWK WKH DQFHVWRU WDEOH DQG WKH MRLQ UHVXOW LV SDVVHG LQ WKH RUGHU RI WKH DQFHVWRU DWWULEXWH WR WKH WDEOH IXQFWLRQ 78QLRQ )RU HYHU\ QRGH [ WKH

PAGE 115

LQVHUW LQWR 7LG7DEOH VHOHFW LWHP WLGOLVW IURP VHOHFW $DQFHVWRU 7WLGOLVW IURP 7LG7DEOH 7 $QFHVWRU $ ZKHUH 7LWHP $GHVFHQGDQW RUGHU E\ DQFHVWRUf DV W? WDEOH78QLRQLLDQFHVWRU LWLGOLVWff DV  WOLWHP WOWLGOLVW W 7DEOH IXQFWLRQ 78QLRQ W 2UGHU E\ DQFHVWRU $DQFHVWRU 7WLGOLVW 7LWHP $GHVFHQGDQW W;? 7LG7DEOH 7 $QFHVWRU $ )LJXUH ,QWHULRU QRGHVfr GHILQHG LQ 6HFWLRQ f WR WKH *DWKHU WDEOH IXQFWLRQ DV VKRZQ LQ )LJXUH 7KH WDEOH IXQFWLRQ RXWSXWV WLGOLVWV IRU DOO WKH LWHPV LQ WKH WD[RQRP\ 7KH VXSSRUW FRXQWLQJ TXHULHV DUH H[DFWO\ WKH VDPH DV IRU ERROHDQ DVVRFLDWLRQV H[n SODLQHG LQ 6HFWLRQ 7KH FRVW IRUPXOD LV DOVR WKH VDPH DV LQ ERROHDQ DVVRFLDWLRQV H[FHSW

PAGE 116

LQVHUW LQWR 7LG7DEOH VHOHFW LWHP WLGOLVW IURP 7r W? WDEOH*DWKHULLLWHP LLWLG PLQVXSff DV WLWHP WWLGOLVW W 7DEOH IXQFWLRQ *DWKHU W 2UGHU E\ LWHP WLG W )LJXUH ,QWHULRU QRGHVf WLGOLVW JHQHUDWLRQ IURP 7r 5r WKDW WKH DYHUDJH OHQJWK RI D WLGOLVW LQ WKLV FDVH LV ZKHUH 5rf DQG 6WRUHGSURFHGXUH VKRZQ DV 6SURFf 7KH FKDUW VKRZV WKH SUHSURFHVVLQJ WLPH DQG WKH WLPH WDNHQ IRU WKH GLIIHUHQW SDVVHV )RU 9HUWLFDO WKH SUHSURFHVVLQJ WLPH LQFOXGHV DQFHVWRU SUHFRPSXWDWLRQ DQG WLGOLVW FUHn DWLRQ WLPHV ZKHUH DV IRU *DWKHU-RLQ LW LV MXVW WKH WLPH IRU DQFHVWRU SUHFRPSXWDWLRQ ,Q WKH VWRUHG SURFHGXUH DSSURDFK WKH PLQLQJ DOJRULWKP LV HQFDSVXODWHG DV D VWRUHG SURFHGXUH >@

PAGE 117

3HUIRUPDQFH FRPSDULVRQ ’ 3UHS ‘ 3DVV 3DVV ’ 3DVV ‘ 3DVV ’ 3DVV IO 3DVV 6XSSRUW $p Y b Y b $p b 0DLO RUGHU GDWD 7RWDO QXPEHU RI UHFRUGV PLOOLRQ 1XPEHU RI WUDQVDFWLRQV 1XPEHU RI LWHPV OHDI QRGHV LQ WD[RQRP\ '$*f 7RWDO QXPEHU RI LWHPV LQFOXGLQJ LQWHULRU QRGHVf 0D[ GHSWK RI WKH WD[RQRP\ $YJ QXPEHU RI FKLOGUHQ SHU QRGH 0D[ QXPEHU RI SDUHQWV )LJXUH &RPSDULVRQ RI GLIIHUHQW 64/ DSSURDFKHV ZKLFK UXQV LQ WKH VDPH DGGUHVV VSDFH DV WKH '%06 )RU WKH 6WRUHGSURFHGXUH H[SHULn PHQW ZH XVHG WKH JHQHUDOL]HG DVVRFLDWLRQ UXOH LPSOHPHQWDWLRQ SURYLGHG ZLWK WKH ,%0 GDWD PLQLQJ SURGXFW ,QWHOOLJHQW 0LQHU >@ )RU DOO WKH VXSSRUW YDOXHV WKH 9HUWLFDO DSSURDFK SHUIRUPV HTXDOO\ ZHOO DV WKH 6WRUHGSURFHGXUH DSSURDFK ,Q VRPH RI WKH H[SHULPHQWV RQ RWKHU GDWDVHWV WKH 9HUWLFDO DSSURDFK SHUIRUPHG EHWWHU WKDQ WKH 6WRUHGSURFHGXUH DSSURDFK 7KH *DWKHU -RLQ DSSURDFK LV ZRUVH PDLQO\ GXH WR WKH ODUJH QXPEHU RI LWHP FRPn ELQDWLRQV JHQHUDWHG ,Q WKH *DWKHU-RLQ DSSURDFK WKH H[WHQGHG WUDQVDFWLRQV DUH SDVVHG WR WKH *DWKHU&RPE WDEOH IXQFWLRQ DQG KHQFH WKH HIIHFWLYH QXPEHU RI LWHPV SHU WUDQVDFWLRQ

PAGE 118

JHWV PXOWLSOLHG E\ WKH DYHUDJH GHSWK RI WKH WD[RQRP\ ,Q WKH *DWKHU-RLQ DSSURDFK ZH VKRZ WKH SHUIRUPDQFH QXPEHUV IRU RQO\ WKH VHFRQG SDVV 1RWH WKDW MXVW WKH WLPH IRU VHFRQG SDVV LV DQ RUGHU RI PDJQLWXGH PRUH WKDQ WKH WRWDO WLPH IRU DOO WKH SDVVHV RI 9HUWLFDO )RU b VXSSRUW VHFRQG SDVV RI *DWKHU-RLQ WRRN RYHU VHFRQGV ZKLOH WKH WRWDO WLPH IRU 9HUWLFDO ZDV RQO\ DERXW VHFRQGV :H GLG QRW GR H[WHQVLYH H[SHULPHQWDWLRQ KHUH EHFDXVH EDVHG RQ WKH 64/ IRUPXODWLRQV DQG WKH DQDO\VLV ZH H[SHFW VLPLODU SHUIRUPDQFH REVHUYDWLRQV DV LQ WKH FDVH RI ERROHDQ DVVRFLDWLRQ UXOHV 6XPPDU\ ,Q WKLV FKDSWHU ZH SUHVHQWHG YDULRXV 64/ IRUPXODWLRQV IRU PLQLQJ JHQHUDOL]HG DVVRFLDn WLRQ UXOHV ,W VKRZV WKDW WKH ERROHDQ DVVRFLDWLRQ UXOH IUDPHZRUN FDQ EH HDVLO\ H[WHQGHG IRU JHQHUDOL]HG DVVRFLDWLRQ UXOHV 7KH PDMRU DGGLWLRQV ZHUH WR fH[WHQGf WKH LQSXW WUDQVDFWLRQ WDEOH WUDQVIRUP 7 WR 7rf

PAGE 119

&+$37(5 6(48(17,$/ 3$77(516 6HTXHQWLDO SDWWHUQ PLQLQJ XWLOL]HV WKH WLPH DVVRFLDWHG ZLWK WKH WUDQVDFWLRQ GDWD WR ILQG IUHTXHQWO\ RFFXUULQJ SDWWHUQV $ VHTXHQWLDO SDWWHUQ LV DQ RUGHUHG OLVW VHTXHQFHf RI LWHPVHWV UHIHU 6HFWLRQ IRU D EULHI GHVFULSWLRQ RI VHTXHQWLDO SDWWHUQVff WUDQVDFWLRQ WLPH ^WLPHf DQG LWHP LGHQWLILHU ^LWHPf (YHU\ GDWDVHTXHQFH W\SLFDOO\

PAGE 120

f 7KH OHQ DWWULEXWH JLYHV WKH OHQJWK RI WKH VHTXHQFH WKDW LV WKH WRWDO QXPEHU RI LWHPV LQ DOO WKH HOHPHQWV RI WKH VHTXHQFH 7KH HQR DWWULEXWHV VWRUHV WKH HOHPHQW QXPEHU RI WKH FRUUHVSRQGLQJ LWHPV )RU VHTXHQFHV RI VPDOOHU OHQJWK WKH H[WUD FROXPQ YDOXHV DUH VHW WR 18// )RU H[DPSOH LI N WKH VHTXHQFH FRPSXWHU PRGHPfSULQWHUff LV UHSUHVHQWHG E\ WKH WXSOH FRPSXWHU PRGHP SULQWHU 18// 18// ff§ OfOHQJWK VHTXHQFHV )NL IRXQG LQ WKH SUHYLRXV SDVV LV XVHG DV WKH VHHG VHW WR JHQHUDWH FDQGLGDWH IFOHQJWK VHTXHQFHV &Nf WKDW DUH SRWHQWLDOO\ IUHTXHQW 7KH FDQGLGDWH

PAGE 121

VHTXHQFHV &N DUH JHQHUDWHG IURP WKH IUHTXHQW N f§ f VHTXHQFHV )NL 7KH VFKHPD RI &N LV WKH VDPH DV WKDW RI IUHTXHQW VHTXHQFHV H[SODLQHG DERYH LQ 6HFWLRQ H[FHSW WKDW ZH GR QRW UHTXLUH WKH OHQ DWWULEXWH VLQFH DOO WKH WXSOHV LQ &N KDYH WKH VDPH OHQJWK N &DQGLGDWHV DUH JHQHUDWHG LQ WZR VWHSV 7KH MRLQ VWHS JHQHUDWHV D VXSHUVHW RI &N E\ MRLQLQJ )N? ZLWK LWVHOI $ VHTXHQFH VL MRLQV ZLWK 6 LI WKH VXEVHTXHQFH REWDLQHG E\ GURSSLQJ WKH ILUVW LWHP RI VL LV WKH VDPH DV WKH RQH REWDLQHG E\ GURSSLQJ WKH ODVW LWHP RI 6n 7KLV FDQ EH H[SUHVVHG LQ 64/ DV IROORZV LQVHUW LQWR &N VHOHFW ,?LWHP? ,?HQR? ,?LWHPN? ,?HQRN? KLWHPNL ,LHQRNL KHQRNL KHQRN IURP )N K)NL K ZKHUH ,?LWHP OLWHP? DQG ,LLWHPNL OLWHPN DQG

PAGE 122

,?HQR f§ ,?HQR OHQR f§ OHQR? DQG ,?HQF!N? ,?HQRN KHQRN af ff f ff f ff f ff f ff f f ff` ,Q WKH MRLQ VWHS WKH VHTXHQFH f ff MRLQV ZLWK f ff WR JHQHUDWH f ff DQG ZLWK f f ff WR JHQHUDWH f f ff 7KHUH DUH QR RWKHU MRLQ FRPSDWLEOH VHTXHQFHV LQ ) ,Q WKH SUXQH VWHS DOO FDQGLGDWH VHTXHQFHV WKDW KDYH D QRQIUHTXHQW FRQWLJXRXV N f§ f VXEVHTXHQFH DUH GHOHWHG ,Q WKH DERYH H[DPSOH WKH VHTXHQFH f f ff ZLOO EH GHOHWHG VLQFH LWV FRQWLJXRXV VXEVHTXHQFH f f ff LV QRW IUHTXHQW :H SHUIRUP ERWK WKH MRLQ DQG SUXQH VWHSV LQ WKH VDPH 64/ VWDWHPHQW E\ ZULWLQJ WKH DERYH TXHU\ DV D $ZD\ MRLQ DV VKRZQ LQ )LJXUH )RU DQ\ FVHTXHQFH WKHUH DUH DW PRVW N FRQWLJXRXV VXEVHTXHQFHV RI OHQJWK N f§ f IRU ZKLFK )fFB QHHGV WR EH FKHFNHG IRU PHPEHUVKLS 1RWH WKDW DOO $ f§ fVXEVHTXHQFHV PD\ QRW EH FRQWLJXRXV EHFDXVH RI WKH PD[ JDS FRQVWUDLQW EHWZHHQ FRQVHFXWLYH HOHPHQWV 7KH MRLQ SUHGLFDWHV RQ ,? DQG UHPDLQ WKH VDPH 7KH MRLQ RI ,? DQG JHQHUDWHV D $VHTXHQFH ,?LWHP? ,LLWHPN? OLWHPN?f

PAGE 123

ZKHUH WZR RI LWV Nf§OfVXEVHTXHQFHV DUH NQRZQ WR EH IUHTXHQW LW LV JHQHUDWHG IURP WZR VXFK N f§ OfVXEVHTXHQFHV 0HPEHUVKLS FKHFNV IRU WKH RWKHU N f§ f VXEVHTXHQFHV DUH SHUIRUPHG XVLQJ DGGLWLRQDO MRLQV 7KH MRLQ SUHGLFDWHV IRU WKHVH MRLQV DUH HQXPHUDWHG E\ GURSSLQJ RQH LWHP DW D WLPH IURP WKH JHQHUDWHG IFVHTXHQFH :H ILUVW GURS LWHP DQG LI LW UHVXOWV LQ D FRQWLJXRXV VXEVHTXHQFH ZH FKHFN IRU LWV PHPEHUVKLS LQ E\ WKH MRLQ ZLWK )RU WKH MRLQ ZLWK ,U ZH GURS LWHPU? 7KH 25 FODXVH LQ WKH MRLQ SUHGLFDWH LV WR DYRLG FKHFNLQJ IRU QRQFRQWLJXRXV VXEVHn TXHQFHV WKDW DUH IRUPHG ZKHQ HQRU? f§ HQRUB? :KHQ WKHUH LV QR PD[JDS FRQVWUDLQW WKH MRLQ SUHGLFDWHV ZLOO QRW FRQWDLQ WKH 25 SDUW :KLOH MRLQLQJ )? ZLWK LWVHOI WR JHW & ZH QHHG WR JHQHUDWH VHTXHQFHV ZKHUH ERWK WKH LWHPV DSSHDU DV D VLQJOH HOHPHQW DV ZHOO DV WZR VHSDUDWH HOHPHQWV $OVR QRWH WKDW WKH SUXQH VWHS ZLOO QRW GHOHWH DQ\ FDQGLGDWH VHTXHQFHV 7KH JHQHUDWLRQ RI & LV H[SUHVVHG LQ 64/ DV LQVHUW LQWR & VHOHFW ,?LWHP? LWHPL IURP )L ,L)? ,f XQLRQ VHOHFW ,?LWHPL OLWHP? IURP )? )L ZKHUH ,?LWHP? LWHP?f )RU LQVWDQFH LI )? FRQWDLQV ^f f` & ZLOO KDYH ^f ff f ff f ff f ff ff` 6XSSRUW &RXQWLQJ WR )LQG )UHTXHQW 6HTXHQFHV ,Q HDFK SDVV N ZH XVH WKH FDQGLGDWH WDEOH &N DQG WKH LQSXW GDWDVHTXHQFHV WDEOH WR FRXQW WKH VXSSRUW :H FRQVLGHU 64/ LPSOHPHQWDWLRQV EDVHG RQ 64/ DQG 64/25 LQ 6HFWLRQV DQG UHVSHFWLYHO\

PAGE 124

'URS LWHPBNOf ,OLWHPO OLWHPBN ,LWHPBN ,OHQR OHQRO ,NLWHPO $1' ,NLWHPBN $1' ,NLWHPBN $1' ,NHQR ,NHQRO $1' 'URS LWHPf ,OLWHPO ,OLWHP ,LWHPBN ,OHQR OHQRO ,OHQR ,OHQR ,OHQRBNO HQRBNO ,HQRBN ,OHQRBN ,NHQRBNO ,NHQRBN 25 HQRBNO ,HQRBNO ,HQRBN ,, HQRBN ,LWHPO $1' ,LWHP $1' ;, )BN ,LWHPBN $1' ,HQR ,HQRO $1' ,HQR ,HQR $1' ; ,HQRBNO ,HQRBN ,HQRBNO ,OHQR ,OHQRO ,HQRBN 25 OOLWHP ,OLWHPBN OOHQR ,OHQR LWHP $1' ,LWHPBN $1' ,HQR ,HQRO $1' HQRBNO ,, HQRBN ,HQRBN ,HQRBN ) N )BN )BN )LJXUH &DQGLGDWH JHQHUDWLRQ IRU DQ\ N 'URS LWHPf ,OLWHPO ,LWHPO$1' LWHP ,LWHP $1' ,LWHP ,LWHP $1' ,OHQR OHQRO ,HQR ,HQRO $1' HQR ,HQR ,HQR HQR ,HQR ,HQR 25 ,OHQR ,HQR ,HQR ,OHQR ,; 'URS LWHPf ,OLWHPO ,LWHPO $1' ,OLWHP ,LWHP $1' ,LWHP ,LWHP $1' HQR OHQRO ,HQR ,HQRO $1' HQR ,HQR ,HQR ,HQR 25 ,OHQR OHQRO ; LWHP ,OLWHP LWHP $1' ,LWHP $1' ) HQR HQR ,HQR ,HQR ; ) ) ,, ) ,N )LJXUH &DQGLGDWH JHQHUDWLRQ IRU N

PAGE 125

,OO 6XSSRUW &RXQWLQJ 8VLQJ 64/ .ZDY -RLQ ,Q WKH NWK SDVV ZH MRLQ WKH FDQGLGDWH WDEOH &N ZLWK N FRSLHV RI WKH GDWDVHTXHQFH WDEOH DQG JURXS WKH MRLQ UHVXOW RQ WKH FDQGLGDWH VHTXHQFHV DV VKRZQ LQ )LJXUH 7KLV DSSURDFK LV YHU\ VLPLODU WR WKH .ZD\-RLQ DSSURDFK IRU DVVRFLDWLRQ UXOHV UHIHU 6HFWLRQ f H[FHSW IRU WKH IROORZLQJ WZR NH\ GLIIHUHQFHV :H XVH VHOHFW GLVWLQFW WR HQVXUH WKDW RQO\ GLVWLQFW GDWDVHTXHQFHV DUH FRXQWHG 6HFRQG ZH KDYH DGGLWLRQDO SUHGLFDWHV EHWZHHQ VHTXHQFH QXPEHUV WKDW DUH GHQRWHG DV 35('IFf LQ WKH TXHU\ IRU EUHYLW\ 7KH SUHGLFDWHV 35('IFf LV D FRQMXQFW DQGf RI WHUPV 3LM^Nf FRUUHVSRQGLQJ WR HDFK SDLU RI LWHPV IURP &? 3LM^Nf LV H[SUHVVHG DV &NHQRM &NHQRL DQG DEVGMWLPH f§ GLWLPHf ZLQGRZVL]Hf RU &NHQRM &NHQRL DQG GMWLPH f§ GLWLPH PD[JDS DQG GMWLPH f§ GLWLPH PLQJDSf RU &NHQRM &NHQRL f ,QWXLWLYHO\ WKHVH SUHGLFDWHV FKHFN Df LI WZR LWHPV RI D FDQGLGDWH VHTXHQFH EHORQJ WR WKH VDPH HOHPHQW WKHQ WKH GLIIHUHQFH RI WKHLU FRUUHVSRQGLQJ WUDQVDFWLRQ WLPHV LV DW PRVW ZLQGRZVL]H DQG Ef LI WZR LWHPV EHORQJ WR DGMDFHQW HOHPHQWV WKHQ WKHLU WUDQVDFWLRQ WLPHV DUH DW PRVW PD[JDS DQG DW OHDVW PLQJDS DSDUW :H FRPSXWH WKH IUHTXHQW VHTXHQFHV E\ JURXSLQJ WKH GDWDVHTXHQFHV WDEOH RQ WKH LWHP DWWULEXWH FRXQWLQJ WKH QXPEHU RI GLVWLQFW VHTXHQFHV LQ ZKLFK WKH LWHP LV SUHVHQW DQG ILOWHULQJ WKH QRQIUHTXHQW LWHPV 7KH 64/ TXHU\ IRU WKLV FRPSXWDWLRQ LV JLYHQ EHORZ

PAGE 126

LQVHUW LQWR )? VHOHFW LWHP FRXQWrf IURP VHOHFW GLVWLQFW VLG LWHP IURP 'f DV W JURXS E\ LWHP KDYLQJ FRXQWrf UPLQVXS 6XETXHUY 2SWLPL]DWLRQ 7KH VXETXHU\ RSWLPL]DWLRQ IRU DVVRFLDWLRQ UXOHV FDQ EH DSSOLHG IRU VHTXHQWLDO SDWWHUQV DOVR E\ VSOLWWLQJ WKH VXSSRUW FRXQWLQJ TXHU\ LQ SDVV N LQWR D FDVFDGH RI N VXETXHULHV 7KH SUHGLFDWHV SLM FDQ EH DSSOLHG HLWKHU RQ WKH RXWSXW RI VXETXHU\ 4N RU VSULQNOHG DFURVV WKH GLIIHUHQW VXETXHULHV VKRZQ DV 6XE435('=f LQ )LJXUH f 7KH SUHGLFDWHV 6XE4 35('=f LV D FRQMXQFW RI WHUPV SW-Of FRUUHVSRQGLQJ WR HDFK SDLU RI LWHPV IURP Gc WKH GLVWLQFW =LWHP SUHIL[HV RI &r 3LMOf LV H[SUHVVHG DV GLHQRM GLHQRL DQG DEVLIPHM f§ WLPHLf ZLQGRZVL]Hf RU GLHQRM GSHQRL DQG WLPHM f§ WLPHL PD[JDS DQG WLPHM f§ WLPHL PLQJDSf RU GLHQRM GSHQRL f 6XSSRUW &RXQWLQJ 8VLQJ 64/25 9HUWLFDO 7KLV DSSURDFK LV VLPLODU WR WKH 9HUWLFDO DSSURDFK IRU DVVRFLDWLRQ UXOHV UHIHU 6HFn WLRQ f ZKHUH ZH FRQYHUW WKH WUDQVDFWLRQ WDEOH LQWR D YHUWLFDO IRUPDW )RU HDFK LWHP ZH FUHDWH D %/2% VOLVWf FRQWDLQLQJ DOO VLG WLPHf SDLUV FRUUHVSRQGLQJ WR WKDW LWHP :H XVH D WDEOH IXQFWLRQ *DWKHU IRU FUHDWLQJ WKH VOLVWV 7KH VHTXHQFH WDEOH LV VFDQQHG LQ WKH

PAGE 127

LQVHUW LQWR )r VHOHFW LWHPL HQR? LWHPN HQRN FRXQWrf IURP VHOHFW GLVWLQFW GLVLG LWHPL HQR? LWHQLN HQRN IURP &N GL Gr ZKHUH GLLWHP &NLWHP? DQG GIFLWHP &NLWHPN DQG GLVLG GVLG DQG GLVLG GNVLG DQG 35('IFf f DV W JURXS E\ LWHPL HQR? LWHPN HQRN KDYLQJ FRXQWrf PLQVXS KDYLQJ FRXQWrf PLQVXS W *URXS E\ LWHPL LWHPN HQROHQRN 35('Nf &NLWHPO GOLWHP GO G )LJXUH 6XSSRUW FRXQWLQJ E\ .ZD\ MRLQ

PAGE 128

LQVHUW LQWR )N VHOHFW LWHP?HQR? LWHPNHQRNL FRXQWrf IURP VHOHFW GLVWLQFW VLG LWHP? HQRL LWHPNHQRN IURP 6XETXHU\ 4Nf f JURXS E\ LWHP? HQR? LWHPN HQRA KDYLQJ FRXQWrf UPLQVXS 6XETXHU\ 4L IRU DQ\ O EHWZHHQ DQG Nf VHOHFW LWHPL HQR? WLPH? LWHPL HQRL WLWLPH VLG IURP WL 6XETXHU\ 4BLf DV UBL VHOHFW GLVWLQFW LWHP?HQR? LWHPL HQRL IURP &cWf DV Gc ZKHUH UL?LWHP? GLLWHP? DQG DQG GLLWHPLLDQG ULLVLG WLVLG DQG WLLWHP f§ GLLWHPL DQG 6XE435('=f 6XETXHU\ 4R 1R VXETXHU\ 4R 6XETXHU\ 4B W LWHPBO HQRBO WLPHBOLWHPBO HQRM WLWLPH VLG 6XETXHU\ 4BO VHOHFW GLVWLQFW LWHPBOHQR OLWHPBO HQRM I &N )LJXUH 6XETXHU\ RSWLPL]DWLRQ IRU .ZD\-RLQ DSSURDFK

PAGE 129

LWHP VLG WLPHf RUGHU DQG SDVVHG WR WKH IXQFWLRQ *DWKHU ZKLFK FROOHFWV WKH VLG WLPHf DWWULEXWH YDOXHV RI DOO WXSOHV RI ZLWK WKH VDPH LWHP LQ PHPRU\ )RU LWHPV WKDW PHHW WKH PLQLPXP VXSSRUW FULWHULRQ WKH IXQFWLRQ RXWSXWV D LWHP VOLVWf SDLU 7KH VOLVWV DUH PDLQWDLQHG VRUWHG XVLQJ VLG DV WKH PDMRU NH\ DQG WLPH DV WKH PLQRU NH\ DQG LV VWRUHG LQ D QHZ 6OLVW7DEOH ZLWK WKH VFKHPD LWHP VOLVWfn DULHV RI LW LQ HDFK RI WKH VOLVWV 1RWH WKDW WKH VOLVWV DUH PDLQWDLQHG VRUWHG RQ VLG WLPHf 7KH 8') DOVR JURXSV WKH VOLVWV DFFRUGLQJ WR WKH HOHPHQWV RI WKH FDQGLGDWH VHTXHQFH XVn

PAGE 130

LQVHUW LQWR )N VHOHFW WLWHPL WHQRL WLWHPN WHQRN FQW IURP VHOHFW LWHP? HQRL LWHPN HQRN &RXQW,QWHUVHFW.&rHQRL &NHQRN GL VOLVW VOLVW ZLQGRZVL]HPLQJDSPD[JDSf DV FQW IURP &MW 6OLVW7DEOH G? 6OLVW7DEOH Gr ZKHUH GLLWHP &NLWHUUL? DQG GIFLWHP &NLWHPNf DV W ZKHUH FQW PLQVXS LWHPOLWHPN HQROHQRN FQW W FQW PLQVXS $ FQW &RXQWOQWHUVHFW. /7')f

PAGE 131

*DWKHU-RLQ 7KH *DWKHU-RLQ DSSURDFK IRU DVVRFLDWLRQ UXOHV UHIHU 6HFWLRQ f FDQ EH H[WHQGHG WR PLQH VHTXHQWLDO SDWWHUQV DOVR ,Q WKLV DSSURDFK ZH ILUVW *DWKHU DOO WKH WLG LWHPf SDLUV FRUUHVSRQGLQJ WR IL[HG YDOXHV RI VLG :H WKHQ JHQHUDWH DOO SRVVLEOH VHTXHQFHV XVLQJ D WDEOH IXQFWLRQ MRLQ WKHP ZLWK &N DQG JURXS WKH MRLQ UHVXOW WR FRXQW WKH VXSSRUW RI WKH FDQGLGDWH VHTXHQFHV 7KH WLPH FRQVWUDLQWV DUH FKHFNHG RQ WKH WDEOH IXQFWLRQ RXWSXW XVLQJ MRLQ SUHGLFDWHV 35('f DV LQ WKH .ZD\-RLQ DSSURDFK 7KH QXPEHU RI WXSOHV JHQHUDWHG E\ WKH WDEOH IXQFWLRQ FDQ EH UHGXFHG E\ DSSO\LQJ WKH PD[JDS PLQJDS DQG ZLQGRZVL]H FRQVWUDLQWV LQVLGH WKH WDEOH IXQFWLRQ +RZHYHU 35('f LV VWLOO UHTXLUHG WR FKHFN LI WKH VHTXHQFHV LQ &N DUH FRQWDLQHG LQ WKH JHQHUDWHG VHTXHQFHV 7KH LWHPV LQ D FDQGLGDWH VHTXHQFH DUH QRW OH[LFRJUDSKLFDOO\ RUGHUHG 7KHUHIRUH D GDWDVHTXHQFH FRQWDLQLQJ Q LWHPV FDQ SRWHQWLDOO\ VXSSRUW QN f DUH DSn SOLFDEOH IRU VHTXHQWLDO SDWWHUQV DOVR 7KH EDVLF LGHD LV WR UHSODFH HDFK GDWDVHTXHQFH G ZLWK DQ fH[WHQGHGVHTXHQFHf G (DFK WUDQVDFWLRQ RI G LV H[WHQGHG ZLWK DOO WKH DQFHVWRUV RI LWV LWHPV WR JHW Gn ,Q RUGHU WR RSWLPL]H WKH SHUIRUPDQFH ZH SUHFRPSXWH WKH DQFHVWRUV DQG SUXQH FDQGLGDWHV ZLWK DQ HOHPHQW WKDW FRQWDLQV ERWK DQ LWHP DQG LWV DQFHVWRU

PAGE 132

n LPHQWV 7KH 64/ TXHULHV IRU WKH KLJKHU QXPEHUHG SDVVHV EHFRPHV WRR ORQJ VLQFH WKHUH DUH VHYHUDO MRLQ SUHGLFDWHV DQG DV D UHVXOW WKH RSWLPL]HU GRHV QRW JHQHUDWH RSWLPDO H[HFXWLRQ SODQV )XUWKHU LQ WKH 9HUWLFDO DSSURDFK WKH XVHUGHILQHG IXQFWLRQV IRU VXSSRUW FRXQWLQJ DUH PXFK PRUH FRPSOH[ WKDQ WKHLU DVVRFLDWLRQ UXOH FRXQWHUSDUWV :H DOVR GLG QRW LPSOHPHQW WKH H[WHQVLRQV WR KDQGOH WD[RQRPLHV LQ WKLV FDVH

PAGE 133

&+$37(5 ,1&5(0(17$/ 0,1,1* 'DWD PLQLQJ GLVFRYHUV LQIRUPDWLRQ ZLWKLQ GDWD ZDUHKRXVHV DQG ILQGV DQVZHUV WR TXHVn WLRQV DERXW \RXU GDWD WKDW \RX KDYHQfW WKRXJKW WR DVN 7KH UXOHV GLVFRYHUHG IURP D GDWDEDVH RQO\ UHIOHFW WKH FXUUHQW VWDWH RI WKH GDWDEDVH ,Q RUGHU WR PDNH WKH GLVFRYHUHG UXOHV UHOLDEOH DQG XVHIXO ODUJH YROXPHV RI GDWD QHHG WR EH FROOHFWHG DQG DQDO\]HG RYHU D SHULRG RI WLPH 7KLV HQWDLOV WKH GHYHORSPHQW RI WHFKQLTXHV WR KDQGOH ODUJH YROXPHV RI GDWD DQG WR PDLQnf IRU XSGDWLQJ WKH IUHTXHQW LWHPVHWV KDV EHHQ GHYHORSHG IRU WKH DGGLWLRQ RI QHZ WUDQVDFWLRQV WR WKH GDWDEDVH >@ ,W LV EDVHG RQ WKH $SULRUL DOJRULWKP DQG QHHGV

PAGE 134

Qf SDVVHV RYHU WKH GDWDEDVH ZKHUH Q LV WKH VL]H RI WKH PD[LPDO IUHTXHQW LWHPVHW $QRWKHU LQFUHPHQWDO DOJRULWKP LV SUHVHQWHG LQ )HOGPDQ HW DO >@ ,Q WKLV FKDSWHU ZH SUHVHQW DQ DOJRULWKP WR ILQG WKH QHZ IUHTXHQW LWHPVHWV ZLWK PLQn LPDO UHFRPSXWDWLRQ ZKHQ QHZ WUDQVDFWLRQV DUH DGGHG WR RU GHOHWHG IURP WKH WUDQVDFWLRQ GDWDEDVH >@ 'HOHWLRQ LV LPSRUWDQW LQ FDVHV ZKHUH ZH ZDQW WR DQDO\]H WKH GDWD LQ D VOLGLQJ WLPH ZLQGRZ 7KH LPSRUWDQW FKDUDFWHULVWLFV RI RXU DOJRULWKP DUH WKH IROORZLQJ f $ORQJ ZLWK WKH IUHTXHQW LWHPVHWV ZH DOVR PDLQWDLQ WKH QHJDWLYH ERUGHU >@ 7KH DOJRULWKP XVHV QHJDWLYH ERUGHUV WR GHFLGH ZKHQ WR VFDQ WKH ZKROH GDWDEDVH DQG LW FDQ EH XVHG LQ FRQMXQFWLRQ ZLWK DQ\ OHYHOZLVH DOJRULWKP OLNH $SULRUL >@ RU 3DUWLn WLRQ >@ fn1RWH WKDW WKH QHJDWLYH ERUGHU FDQ EH PDLQWDLQHG ZKLOH FRPSXWLQJ WKH IUHTXHQW LWHPVHWV ZLWKRXW DQ\ DGGLWLRQDO FRPSXWDWLRQ RYHUKHDG

PAGE 135

ZH SUHVHQW WKH SHUIRUPDQFH UHVXOWV RI WKH 64/EDVHG LQFUHPHQWDO DOJRULWKP >@ ,Q 6HFnf IURP ) 7KH QHJDWLYH ERUGHU $I%G)ff RI D VHW RI IUHTXHQW LWHPVHWV ) FDQ EH FRPSXWHG E\ UHSHDWLQJ WKH MRLQ DQG SUXQH VWHSV RI WKH DSULRULJHQ IXQFWLRQ LQ WKH DSULRUL DOJRULWKP >@ 7KLV FRPSXWDWLRQ FDQ EH GRQH XVLQJ RQO\ WKH VHW RI IUHTXHQW LWHPVHWV ) DQG WKH GDWDEDVH QHHG QRW EH VFDQQHG 'HILQLWLRQ 7KH QHJDWLYH ERUGHU ILL%G)f RI D FROOHFWLRQ RI LWHPVHWV ) LV GHILQHG DV IROn ORZV *LYHQ D FROOHFWLRQ ) & 95f RI VHWV FORVHG ZLWK UHVSHFW WR WKH VHW LQFOXVLRQ UHODWLRQ WKH QHJDWLYH ERUGHU ILI%G)f RI ) FRQVLVWV RI WKH PLQLPDO LWHPVHWV ; & 5 QRW LQ ) >@ 7KH DSULRULJHQ IXQFWLRQ GHVFULEHG LQ )LJXUH f WDNHV DV DUJXPHQW )N WKH VHW RI DOO IUHTXHQW N f§ OfLWHPVHWV ,W UHWXUQV WKH VHW RI IFLWHPVHWV WKDW DUH SRWHQWLDOO\ IUHTXHQW

PAGE 136

IXQFWLRQ DSULRULJHQ)WB[f IRU HDFK S DQG T f )NL GR LI SLWHPL TLWHPL SLWHPN TLWHPN DQG SLWHQULN? TLWHPNL WKHQ LQVHUW SLWHPLSLWHPA ‘ ‘ ‘ SLWHPNLTLWHPNL LQWR &r IRU HDFK F &r GHOHWH F IURP &r LI VRPH N OfVXEVHW RI F LV QRW LQ )NL )LJXUH $ KLJKOHYHO GHVFULSWLRQ RI WKH DSULRULJHQ IXQFWLRQ 7KH QHJDWLYH ERUGHU FRQVLVWV RI DOO LWHPVHWV WKDW ZHUH FDQGLGDWHV ZKLFK GLG QRW KDYH WKH PLQLPXP VXSSRUW LQ WKH OHYHOZLVH PHWKRG 7KDW LV $%G)Nf &N f§ 7r ZKHUH &MW LV WKH VHW RI FDQGLGDWH IFLWHPVHWV )? LV WKH VHW RI IUHTXHQW IFLWHPVHWV DQG 1%G)Nf LV WKH VHW RI IFLWHPVHWV LQ $L%G)f 7KHUHIRUH IA8 1%G)Nf &N 7KH QHJDWLYHERUGHUJHQ IXQFWLRQ WR FRPSXWH -LI%G)f LV H[SODLQHG LQ )LJXUH IXQFWLRQ QHJDWLYHERUGHUJHQ)f 6SOLW ) LQWR )? )R )Q ZKHUH Q LV WKH VL]H RI WKH ODUJHVW LWHPVHW LQ ) IRUDOO N Q GR FRPSXWH &NL XVLQJ DSULRULJHQ)Nf )8 $I%G)f f§ 8 f &N 8L ZKHUH ,L LV WKH VHW RI LWHPVHWV )LJXUH $ KLJKOHYHO GHVFULSWLRQ RI WKH QHJDWLYHERUGHUJHQ IXQFWLRQ /HPPD $OO LWHPVHWV VKRXOG EH SUHVHQW LQ )8 $I%G)f

PAGE 137

3URRI 7KH OHPPD IROORZV GLUHFWO\ IURP WKH GHILQLWLRQ RI QHJDWLYH ERUGHU VLQFH DQ\ LWHPVHW QRW LQ ) LV D PLQLPDO LWHPVHW QRW LQ ) ’ $GGLWLRQ RI 1HZ 7UDQVDFWLRQV :KHQ QHZ WUDQVDFWLRQV DUH DGGHG WR WKH GDWDEDVH DQ ROG IUHTXHQW LWHPVHW FRXOG SRWHQWLDOO\ EHFRPH LQIUHTXHQW LQ WKH XSGDWHG GDWDEDVH 6LPLODUO\ DQ ROG LQIUHTXHQW LWHPVHW FRXOG SRWHQWLDOO\ EHFRPH IUHTXHQW LQ WKH QHZ GDWDEDVH ,Q RUGHU WR VROYH WKH XSGDWH SUREOHP HIILFLHQWO\ ZH PDLQWDLQ WKH IUHTXHQW LWHPVHW DQG WKH QHJDWLYH ERUGHU DORQJ ZLWK WKHLU VXSSRUW FRXQW LQ WKH GDWDEDVH 7KDW LV IRU HYHU\ V ) 8 -?I%G)f ZH PDLQWDLQ VFRXQW ,Q WKH UHVW RI WKLV VHFWLRQ '% GHQRWHV WKH RULJLQDO GDWDEDVH GE GHQRWHV WKH WUDQVDFWLRQV WKDW DUH QHZO\ DGGHG DQG '% GHQRWHV WKH XSGDWHG GDWDEDVH $OVR )'% )GE DQG )'% GHQRWHV WKH IUHTXHQW LWHPVHW DQG 1%G)'%f 0%G)GEf DQG -?I%G^)'Qf GHQRWHV WKH QHJDWLYH ERUGHU RI WKH RULJLQDO GDWDEDVH LQFUHPHQW GDWDEDVH DQG WKH XSGDWHG GDWDEDVH UHVSHFWLYHO\ /HPPD /HW V EH DQ\ LWHPVHW VXFK WKDW V )'% 7KHQ V )'% RQO\ LI V )GE 3URRI $VVXPH WKDW WKHUH H[LVWV DQ LWHPVHW V VXFK WKDW V f )'% V e )'% DQG V )GE /HW GEVf DQG WIVf EH WKH QXPEHU RI WUDQVDFWLRQV LQ '% DQG GE UHVSHFWLYHO\ FRQWDLQLQJ WKH LWHPVHW V $OVR OHW WA% DQG WGE EH WKH WRWDO QXPEHU RI WUDQVDFWLRQV LQ '% DQG GE UHVSHFWLYHO\ 6LQFH V )'% DQG V e )GE WS%MVf WS% PLQ6XSSRUW DQG OGE^Vf WGE PLQ6XSSRUW )URP WKHVH WZR HTXDWLRQV LW FDQ EH VKRZQ WKDW

PAGE 138

W'%Vf WGE^Vf F PPEXSSRUW W'% WGE 7KHUHIRUH V )'% ZKLFK LV D FRQWUDGLFWLRQ ’ /HPPD /HW V EH DQ LWHPVHW VXFK WKDW V e $%G)f 7KHQ DOO SRVVLEOH VXEVHWV RI V PXVW EH SUHVHQW LQ ) 3URRI )RU D FRQWUDGLFWLRQ OHW W EH DQ LWHPVHW VXFK WKDW W & V DQG W ) %\ WKH GHILQLWLRQ RI QHJDWLYH ERUGHU $I%G)f FRQVLVWV RI WKH PLQLPDO LWHPVHWV QRW LQ ) 6LQFH W ) V LV QRW D PLQLPDO LWHPVHW QRW LQ ) 7KHUHIRUH V FDQQRW EH LQ $L%G)f ZKLFK LV D FRQWUDGLFWLRQ ’ 7KHRUHP /HW V EH DQ LWHPVHW VXFK WKDW V )'% 8 $I%G)'%f DQG V e )'% 7KHQ WKHUH H[LVWV DQ LWHPVHW W VXFK WKDW W & V W e $I%G)'%f DQG W e )'% 7KDW LV VRPH VXEVHW RI V KDV PRYHG IURP $L%G)'%f WR )'% 3URRI 6LQFH V e )'% DOO SRVVLEOH VXEVHWV RI V VKRXOG EH LQ )'% %XW DOO WKH VXEVHWV RI V FDQQRW EH LQ )'% EHFDXVH LI WKDW ZDV WKH FDVH WKHQ V VKRXOG EH SUHVHQW LQ DW OHDVW $L%G)'%f LI QRW LQ )'% LWVHOI %\ RXU DVVXPSWLRQ V J )'% 8 $L%G^)'%f 7KHUHIRUH WKHUH H[LVWV DQ LWHPVHW W VXFK WKDW W & V DQG W e )'% 1RZ ZH KDYH WZR FDVHV &DVH L WH$%G^)'%f ,Q WKLV FDVH W e )'%-U VLQFH V e )'% DQG W & V 7KHUHIRUH ZH KDYH IRXQG D VXEVHW RI V ZKLFK KDV PRYHG IURP $I%G)'%f WR )'% &DVH LL W t $I%G^)'%f 7KDW LV W )GG 8 $I%G)'%f %XW ZH NQRZ WKDW W e )'% VLQFH V e )GE DQG W & V 7KHUHIRUH W )'% 8 $L%G)'%f DQG W e )'% DQG KHQFH ZH FDQ DSSO\ WKH WKHRUHP UHFXUVLYHO\ RQ W 1RWH WKDW WKH VL]H RI W LV OHVV WKDQ WKH VL]H RI V VLQFH W & V

PAGE 139

:KHQ WKLV LV DSSOLHG UHFXUVLYHO\ WKHUH DUH WZR SRVVLELOLWLHV )LUVW LV IRU VRPH VXEVHW RI W FDVH L KROGV WUXH LQ ZKLFK FDVH WKHUH LV D VXEVHW RI W ZKLFK KDV PRYHG IURP $I%G)'%f WR )'% DQG KHQFH WKH WKHRUHP LV SURYHG 2WKHUZLVH W ZLOO ILQDOO\ EHFRPH D LWHPVHW %\ /HPPD ZH NQRZ WKDW DOO LWHPVHWV DUH SUHVHQW LQ )'% 8 $I%G)'%f 6LQFH W )'% W $I%G)'%f ZKLFK FRQWUDGLFWV WKH DVVXPSWLRQ IRU FDVH Q 7KDW LV FDVH LL LV QRW SRVVLEOH LI W LV D LWHPVHW ’ %\ 7KHRUHP LI QRQH RI WKH LWHPVHWV PRYH IURP WKH QHJDWLYH ERUGHU WR WKH IUHTXHQW LWHPVHW ZH GR QRW QHHG WR VFDQ WKH ZKROH GDWDEDVH (YHQ LQ FDVHV ZKHUH VRPH LWHPVHWV PRYH IURP WKH QHJDWLYH ERUGHU WR WKH IUHTXHQW LWHPVHW D FRPSOHWH GDWDEDVH VFDQ LV UHTXLUHG RQO\ LI WKH QHJDWLYH ERUGHU H[SDQGV EHFDXVH IRU DOO WKH LWHPVHWV LQ WKH QHJDWLYH ERUGHU ZH FDQ GHULYH WKH XSGDWHG VXSSRUW FRXQW ZLWKRXW D GDWDEDVH VFDQ :H PDLQWDLQ WKH VXSSRUW FRXQW IRU DOO LWHPVHWV LQ WKH IUHTXHQW LWHPVHW DQG WKH QHJDWLYH ERUGHU )LUVW ZH FRPSXWH WKH IUHTXHQW LWHPVHW LQ GE XVLQJ D OHYHOZLVH DOJRULWKP OLNH $SULRUL RU 3DUWLWLRQ 6LPXOWDQHRXVO\ ZH FRXQW WKH VXSSRUW IRU DOO LWHPVHWV LQ )'% 8 $I%G)'%f LQ GE ,I DQ LWHPVHW W )'% GRHV QRW KDYH PLQLPXP VXSSRUW LQ '% 8 GE WKHQ W LV UHPRYHG IURP )'% 7KLV FDQ EH HDVLO\ FKHFNHG VLQFH ZH NQRZ WKH VXSSRUW FRXQW IRU W LQ '% DQG GE 7KH FKDQJH LQ )'% FRXOG SRWHQWLDOO\ FKDQJH $%G)'%f DOVR 7KHUHIRUH ZH KDYH WR UHFRPSXWH WKH QHJDWLYH ERUGHU XVLQJ WKH QHJDWLYHERUGHUJHQ IXQFWLRQ H[SODLQHG LQ VXEVHFWLRQ 2Q WKH RWKHU KDQG WKHUH FRXOG EH VRPH QHZ LWHPVHWV ZKLFK EHFRPH IUHTXHQW LQ WKH XSGDWHG GDWDEDVH /HW V EH DQ LWHPVHW ZKLFK JHWV DGGHG WR WKH IUHTXHQW LWHPVHW RI WKH XSGDWHG GDWDEDVH %\ /HPPD ZH NQRZ WKDW V KDV WR EH LQ )GE :H DOVR NQRZ E\ 7KHRUHP WKDW VRPH VXEVHW RI V PXVW PRYH IURP $I%G)'%f WR )'% )RU HDFK LWHPVHW V )GE ZH FKHFN LI V JHWV WKH PLQLPXP VXSSRUW WR PRYH IURP $I%G^)'%f WR )'% ,I

PAGE 140

IXQFWLRQ 8SGDWH)UHTXHQW,WHPVHW)'% $%G)'%f GEf '% DQG GE GHQRWH WKH QXPEHU RI WUDQVDFWLRQV LQ WKH RULJLQDO GDWDEDVH DQG WKH LQFUHPHQW GDWDEDVH UHVSHFWLYHO\ &RPSXWH )GE IRUHDFK LWHPVHW V f )'% 80%G)'%f GR WGEVf f§ QXPEHU RI WUDQVDFWLRQV LQ GE FRQWDLQLQJ V )GE M! IRU HDFK LWHPVHW V e )'% GR LI e!EVf WGEVff PLQVXS r '% GEf WKHQ )'% )'% 8 V IRU HDFK LWHPVHW V e )GE GR LI V t )'% DQG V e $L%G)'%f DQG GEVf WGEVff PLQVXSr '% GEf WKHQ )GE )GE 8 V LI )'%  )GE WKHQ -?I%G)'%f QHJDWLYHERUGHUJHQ)'%f HOVH 1%G)'%f 8%G)'%f LI)'%X$%G)'%f  )'% 8 $I%G)'%f WKHQ 6 )'% UHSHDW FRPSXWH 6 f§ 6 8 KI%G6f XQWLO 6 GRHV QRW JURZ )'% f§ ^[ f 6?VXSSRUW[f PLQVXS` VXSSRUW[f LV WKH VXSSRUW FRXQW RI [ LQ '% 8 GE -?I%G)'%f a QHJDWLYHERUGHUJHQ)e!%f )LJXUH $ KLJKOHYHO GHVFULSWLRQ RI WKH 8SGDWH)UHTXHQW,WHPVHW IXQFWLRQ

PAGE 141

QRQH RI WKH LWHPVHWV LQ -?I%G)'%f JHWV WKH PLQLPXP VXSSRUW QR QHZ LWHPVHWV ZLOO EH DGGHG WR )'%-U ,I VRPH LWHPVHWV LQ -?L%G)'%f JHWV WKH PLQLPXP VXSSRUW PRYH WKHP WR )GE DQG UHFRPSXWH WKH QHJDWLYH ERUGHU ,I )'% 8 $I%G)'%f A )'% 8M?I%G)'%f ZH KDYH WR ILQG WKH QHJDWLYH ERUGHU FORVXUH RI )fQ DQG VFDQ WKH HQWLUH GDWDEDVH /Lf RQFH WR ILQG WKH XSGDWHG IUHTXHQW LWHPVHW DQG QHJDWLYH ERUGHU 7KH QHJDWLYH ERUGHU FORVXUH RI ) LV IRXQG E\ UHSHDWHGO\ ILQGLQJ ) ) 8 1%G)f XQWLO ) GRHV QRW JURZ 'XULQJ WKH GDWDEDVH VFDQ DOO WKH LWHPVHWV ZKLFK DUH LQ WKH QHJDWLYH ERUGHU FORVXUH WKDW ZHUH QRW RULJLQDOO\ LQ )/L-?I%G)f DUH XVHG DV WKH FDQGLGDWH LWHPVHWV DQG WKHLU VXSSRUW FRXQW LV FRPSXWHG 7KH FDQGLGDWH VHW FDQ IXUWKHU EH SUXQHG E\ DSSO\LQJ DQ RSWLPL]DWLRQ ZKLOH ILQGLQJ WKH QHJDWLYH ERUGHU FORVXUH ,W FDQ EH REVHUYHG WKDW DQ LWHPVHW ZKLFK LV QRW IUHTXHQW LQ WKH LQFUHPHQW GDWDEDVH GEff§ GHQRWH WKH XSGDWHG GDWDEDVH DQG )'%a DQG $I%G)'%af GHQRWH LWV IUHTXHQW LWHPVHW DQG QHJDWLYH ERUGHU UHVSHFWLYHO\

PAGE 142

/HPPD L /HW V EH DQ LWHPVHW VXFK WKDW V f )'% 7KHQ V J )'% RQO\ LI V f )GE 7KDW LV D IUHTXHQW LWHPVHW V ZLOO EHFRPH LQIUHTXHQW RQO\ LI V )GE 7KLV OHPPD FDQ EH SURYHG LQ WKH VDPH ZD\ DV OHPPD 7KH DOJRULWKP WR FRPSXWH WKH IUHTXHQW LWHPVHW DQG WKH QHJDWLYH ERUGHU RI '%f§ LV VLPLODU WR WKH RQH LQ WKH FDVH ZKHUH QHZ WUDQVDFWLRQV DUH DGGHG WR WKH GDWDEDVH DQG FDQ EH GHULYHG HDVLO\ ([SHULPHQWDO 5HVXOWV :H FRQGXFWHG D VHW RI H[SHULPHQWV WR FRPSDUH WKH SHUIRUPDQFH RI RXU LQFUHPHQWDO DOJRULWKP 7KHVH H[SHULPHQWV ZHUH SHUIRUPHG RQ D 6XQ 63$5&VWDWLRQ UXQQLQJ 6XQ26 ,Q WKLV VHFWLRQ ZH UHSRUW RQ WKH UHVXOWV RI VRPH RI WKRVH H[SHULPHQWV 7KH H[SHULPHQWV ZHUH SHUIRUPHG RQ V\QWKHWLF GDWD JHQHUDWHG XVLQJ WKH VDPH WHFKn QLTXH DV LQ $JUDZDO DQG 6ULNDQW >@ 7KH GDWDVHW XVHG IRU WKH EDVHOLQH H[SHULPHQW ZDV 7,'. 0HDQ VL]H RI D WUDQVDFWLRQ 0HDQ VL]H RI PD[LPDO SRWHQWLDOO\ ODUJH LWHPVHWV 1XPEHU RI WUDQVDFWLRQV WKRXVDQGf 7KH LQFUHPHQW GDWDEDVH LV FUHDWHG DV IROORZV :H JHQHUDWH WKRXVDQG WUDQVDFWLRQV RI ZKLFK f§ Gf WKRXVDQG LV XVHG IRU WKH LQLWLDO FRPSXWDWLRQ DQG G WKRXVDQG LV XVHG DV WKH LQFUHPHQW ZKHUH G LV WKH IUDFWLRQDO VL]H LQ SHUFHQWDJHf RI WKH LQFUHPHQW :H FRPSDUH WKH H[HFXWLRQ WLPH RI WKH LQFUHPHQWDO DOJRULWKP ZLWK UHVSHFW WR UXQQLQJ $SULRUL RQ WKH ZKROH GDWD VHW )LJXUH VKRZV WKH VSHHG XS RI WKH LQFUHPHQWDO DOJRULWKP RYHU $SULRUL IRU GLIIHUHQW PLQLPXP VXSSRUW WKUHVKROGV :H UHSRUW WKH UHVXOWV IRU LQFUHPHQW VL]HV RI b bb DQG b VKRZQ LQ WKH OHJHQGf )URP WKH JUDSK LW FDQ EH VHHQ WKDW WKH LQFUHPHQWDO DOJRULWKP DFKLHYHV VSHHG XS RI DERXW WR 7KH DOJRULWKP VKRZV EHWWHU VSHHG XS IRU PHGLXP VXSSRUW WKUHVKROG WKDQ ORZ DQG KLJK VXSSRUW WKUHVKROGV $W KLJK

PAGE 143

IP b P Jb Â’ Vb Â’ Rb 6XSSRUW 7KUHVKROG LQ bf )LJXUH 6SHHGXS RI WKH LQFUHPHQWDO DOJRULWKP VXSSRUW WKUHVKROGV WKH QXPEHU RI IUHTXHQW LWHPVHWV DQG WKH QXPEHU RI SDVVHV LQ WKH OHYHO ZLVH DOJRULWKP DUH OHVV DQG KHQFH LW LV OHVV FRVWO\ WR UXQ $SULRUL RQ WKH ZKROH GDWDEDVH $W ORZ VXSSRUW WKUHVKROGV WKH SUREDELOLW\ RI WKH QHJDWLYH ERUGHU H[SDQGLQJ LV KLJKHU DQG DV D UHVXOW WKH LQFUHPHQWDO DOJRULWKP PD\ KDYH WR VFDQ WKH ZKROH GDWDEDVH $OVR WKH VSHHG XS LV KLJKHU IRU VPDOOHU LQFUHPHQW VL]HV VLQFH WKH LQFUHPHQWDO DOJRULWKP QHHGV WR SURFHVV OHVV GDWD &RPSDULVRQ ZLWK )83 7KH IUDPHZRUN RI )83 >@ LV VLPLODU WR WKDW RI $SULRUL DQG FRQWDLQV D QXPEHU RI LWHUDWLRQV (DFK LWHUDWLRQ LV DVVRFLDWHG ZLWK D FRPSOHWH VFDQ RI WKH ZKROH GDWDEDVH DQG LQ LWHUDWLRQ N DOO WKH IUHTXHQW $LWHPVHWV DUH IRXQG 7KH FDQGLGDWH VHWV IRU LWHUDWLRQ N DUH JHQHUDWHG EDVHG RQ WKH IUHTXHQW LWHPVHWV IRXQG LQ LWHUDWLRQ N )83 XVHV WKH IROORZLQJ VWHSV WR FRPSXWH WKH IUHTXHQW LWHPVHWV $W HDFK LWHUDWLRQ N WKH VXSSRUW RI WKH IUHTXHQW $LWHPVHWV /rf LV XSGDWHG DJDLQVW WKH LQFUHPHQW GDWDEDVH GE WR ILOWHU RXW WKH LWHPVHWV WKDW DUH QR ORQJHU IUHTXHQW LQ WKH XSGDWHG GDWDEDVH

PAGE 144

7KH VHW RI FDQGLGDWH VHWV &N LV JHQHUDWHG E\ DSSO\LQJ WKH DSULRULJHQ IXQFWLRQ RQ /3% IUHTXHQW NOfLWHPVHWV LQ WKH XSGDWHG GDWDEDVHff SDVVHV RYHU WKH GDWDEDVH ZKHUH Q LV WKH VL]H RI WKH PD[LPDO IUHTXHQW LWHPVHW 7KH VWHSV LQYROYHG LQ RXU LQFUHPHQWDO DOJRULWKP FDQ EH VXPPDUL]HG DV IROORZV &RPSXWH WKH IUHTXHQW LWHPVHWV RI WKH LQFUHPHQW GDWDEDVH /GEff WKHUHE\ UHGXFLQJ WKH ,2 UHTXLUHPHQWV GUDVWLFDOO\ &RPSXWLQJ WKH

PAGE 145

n WDO PLQLQJ EDVHG RQ WKH DOWHUQDWLYHV SUHVHQWHG LQ &KDSWHU ,Q 6HFWLRQ ZH SUHVHQW WZR 64/EDVHG DSSURDFKHV IRU LQFUHPHQWDO IUHTXHQW LWHPVHW FRPSXWDWLRQ 7KH DSSOLFDELOLW\ RI WKH RWKHU DUFKLWHFWXUDO DOWHUQDWLYHV DUH GLVFXVVHG LQ 6HFWLRQ 64/ )RUPXODWLRQV RI ,QFUHPHQWDO 0LQLQJ 7ZR FDWHJRULHV RI 64/ IRUPXODWLRQV IRU IUHTXHQW LWHPVHW PLQLQJ EDVHG RQ 64/ DQG 64/25 DUH SUHVHQWHG LQ &KDSWHUV DQG UHVSHFWLYHO\ $ FRVWEDVHG DQDO\VLV RI WKH 64/ DSSURDFKHV DQG WKH UHODWHG SHUIRUPDQFH RSWLPL]DWLRQV DUH GLVFXVVHG LQ 7KRPDV DQG &KDNUDYDUWK\ >@ :H GLVFXVV KHUH KRZ WKHVH WHFKQLTXHV FDQ EH DGDSWHG IRU LQFUHPHQWDO PLQLQJ (IILFLHQW 64/ IRUPXODWLRQ RI WKH LQFUHPHQWDO DOJRULWKP HQWDLOV FRXQWLQJ VXSSRUW RI PXOWLSOHVL]HG FDQGLGDWHV WRJHWKHU LQ WKH VDPH SDVV 7KH LQSXW WUDQVDFWLRQ GDWD LV VWRUHG LQ D UHODWLRQDO WDEOH 7 ZLWK WKH VFKHPD WLG LWHPf 7KH LQFUHPHQW WUDQVDFWLRQ WDEOH 67 DOVR KDV WKH VDPH VFKHPD 7KH IUHTXHQW LWHPVHWV DQG QHJDWLYH ERUGHU RI VL]H N DUH VWRUHG LQ WDEOHV ZLWK WKH VFKHPD LWHPL LWHPN FRXQWf :H GLVFXVV WKH H[WHQVLRQV WR 6XETXHU\ DQG 9HUWLFDO WZR UHSUHVHQWDWLYH DSSURDFKHV IURP WKH 64/ DQG 64/25 FDWHJRULHV :H DOVR RXWOLQH WKH 64/EDVHG FDQGLGDWH FORVXUH FRPSXWDWLRQ WR FRPSXWH QHJDWLYH ERUGHU FORVXUH

PAGE 146

6XE TXHU\ $SSURDFK ,Q WKLV DSSURDFK VXSSRUW FRXQWLQJ LV GRQH E\ D VHW RI N QHVWHG VXETXHULHV ZKHUH N LV WKH VL]H RI WKH ODUJHVW LWHPVHW :H SUHVHQW KHUH WKH H[WHQVLRQV WR WKH VXETXHU\ DSSURDFK LQ 6HFWLRQ WR FRXQW FDQGLGDWHV RI GLIIHUHQW VL]HV 6XETXHU\ 4L ILQGV DOO WLGV WKDW VXSSRUW WKH GLVWLQFW FDQGLGDWH =LWHPVHWV Gcf GL FRPSULVHV RI WKH GLVWLQFW =LWHP SUHIL[HV RI DOO FDQGLGDWH LWHPVHWV +RZHYHU LW LV VXIILFLHQW WR XVH &L WKH FDQGLGDWH =LWHPVHWV LQVWHDG RI GL VLQFH DOO =LWHP SUHIL[HV RI FDQGLGDWH LWHPVHWV ZLWK PRUH WKDQ = LWHPV ZLOO EH SUHVHQW LQ &c 7KH RXWSXW RI 4L LV JURXSHG RQ WKH = LWHPV WR ILQG WKH VXSSRUW RI WKH FDQGLGDWH =LWHPVHWV 4L LV DOVR MRLQHG ZLWK 67 7 ZKLOH FRXQWLQJ WKH VXSSRUW LQ WKH ZKROH GDWDEDVHf DQG &c WR JHW 4LL 7KH 64/ TXHULHV DQG WKH FRUUHVSRQGLQJ WUHH GLDJUDPV IRU WKH DERYH FRPSXWDWLRQV DUH JLYHQ LQ )LJXUH 6%L VWRUHV WKH VXSSRUW FRXQWV RI DOO IUHTXHQW DQG QHJDWLYH ERUGHU LWHPVHWV LQ 67 7KH RXWSXW RI VXETXHU\ 4c QHHGV WR EH PDWHULDOL]HG VLQFH LW LV XVHG IRU FRXQWLQJ WKH VXSSRUW RI =LWHPVHWV DQG IRU JHQHUDWLQJ 4L ,I WKH TXHU\ SURFHVVRU LV DXJPHQWHG WR VXSSRUW PXOWLSOH VWUHDPV ZKHUH WKH RXWSXW RI DQ RSHUDWRU FDQ EH SLSHG LQWR PRUH WKDQ RQH VXEVHTXHQW RSHUDWRUV WKH PDWHULDOL]DWLRQ RI 4LnV FDQ EH DYRLGHG ,Q WKH EDVLF DVVRFLDWLRQ UXOH PLQLQJ ZH GR QRW KDYH WR FRXQW LWHPVHWV RI GLIIHUHQW VL]HV LQ WKH VDPH SDVV VLQFH &?? EHFRPHV DYDLODEOH RQO\ DIWHU WKH IUHTXHQW =LWHPVHWV DUH FRPSXWHG 7DEOHV %L DQG 6%c VWRUH WKH IUHTXHQW DQG QHJDWLYH ERUGHU =LWHPVHWV DQG WKHLU VXSSRUW FRXQW LQ 7 DQG 67 UHVSHFWLYHO\ 6XSSRUW FRXQWV RI LWHPVHWV LQ WKH ZKROH GDWDEDVH FDQ EH FRPSXWHG E\ MRLQLQJ %c DQG 6% DQG DGGLQJ WKH FRUUHVSRQGLQJ VXSSRUW FRXQWV :H DGG DQRWKHU DWWULEXWH WR %L WR NHHS WUDFN RI SURPRWHG ERUGHUV LWHPVHWV WKDW PRYHG IURP WKH QHJDWLYH ERUGHU WR WKH IUHTXHQW VHWf

PAGE 147

LQVHUW LQWR 6%L VHOHFW LWHP? LWHP FRXQWrf IURP 6XETXHU\ 4Lf W JURXS E\ LWHPL LWHPL 6XETXHU\ 4L IRU DQ\ O EHWZHHQ DQG Nf VHOHFW LWHPL ‘ ‘ ‘ LWHPL! WLG IURP 67 Wc 6XETXHU\ 4LLf DV UBL & ZKHUH UL?LWHP? f§ &LLWHP? DQG DQG ULLLWHPLL &LLWHPLLDQG ULBLWLG WLWLG DQG WLLWHP &LLWHPL 6XETXHU\ 4T 1R VXETXHU\ 4T *URXS E\ IRU % 6XETXHU\ 4-O 6XETXHU\ 4, LWHPLLWHP WLG UB,O LWHPL UMOLWHP 6XETXHU\ 4-O &)LJXUH 6XSSRUW FRXQWLQJ XVLQJ VXETXHULHV

PAGE 148

&RPSXWLQJ &DQGLGDWH &ORVXUH ,Q WKH DSULRUL DOJRULWKP FDQGLGDWH LWHPVHWV DUH JHQHUDWHG LQ WZR VWHSV WKH MRLQ VWHS DQG WKH SUXQH VWHS ,Q WKH MRLQ VWHS WZR VHWV RI N f§ OfLWHPVHWV FDOOHG JHQHUDWRUV DQG H[WHQGHUV DUH MRLQHG WR JHW FLWHPVHWV $Q LWHPVHW VL LQ JHQHUDWRUV MRLQV ZLWK 6 LQ H[WHQGHUV LI WKH ILUVW ^N f§ f LWHPV RI VL DQG 6 DUH WKH VDPH DQG WKH ODVW LWHP RI VL LV OH[LFRJUDSKLFDOO\ VPDOOHU WKDQ WKH ODVW LWHP RI 6‘ 7KH MRLQ UHVXOWV LQ DQ LWHPVHW WKDW LV VL H[WHQGHG ZLWK WKH ODVW LWHP RI 6 7KH UHVXOW RI WKH MRLQ VWHS LV VXEMHFWHG WR VXEVHW SUXQLQJ ZKLFK ILOWHUV RXW DOO LWHPVHWV ZLWK D QRQ IUHTXHQW N f§ OfVXEVHW 7KLV FDQ EH DFFRPSOLVKHG E\ VXEVHTXHQW MRLQV ZLWK N f§ fa? 8&MWL DV JHQHUDWRUV DQG 3%Na?8&WL 8LAL DV H[WHQGHUV DQG ILOWHUV 3%NL DQG )NL GHQRWH SURPRWHG ERUGHUV DQG IUHTXHQW LWHPVHWV RI VL]H N f§ f UHVSHFWLYHO\ 7KH FDQGLGDWH JHQHUDWLRQ SURFHVV VWDUWV ZLWK &R DV WKH HPSW\ VHW DQG WHUPLQDWHV ZKHQ &N EHFRPHV HPSW\ ,W LV VWUDLJKWIRUZDUG WR GHULYH 64/ TXHULHV IRU WKLV SURFHVV DQG ZH GR QRW SUHVHQW WKHP KHUH UHIHU 6HFWLRQ f 9HUWLFDO ,Q WKH 6XETXHU\ DSSURDFK IRU HYHU\ WUDQVDFWLRQ WKDW VXSSRUWV DQ LWHPVHW ZH JHQHUDWH LWHPVHW WLGf WXSOHV UHVXOWLQJ LQ ODUJH LQWHUPHGLDWH WDEOHV 7KH 9HUWLFDO DSSURDFK DYRLGV WKLV E\ FROOHFWLQJ DOO WLGV WKDW VXSSRUW DQ LWHPVHW LQWR D %/2% ELQDU\ ODUJH REMHFWf DQG

PAGE 149

JHQHUDWHV LWHPVHW WLGOLVWf WXSOHV ,QLWLDOO\ WLGOLVWV IRU LQGLYLGXDO LWHPV DUH FUHDWHG XVLQJ D WDEOH IXQFWLRQ 7KH WLGOLVW IRU DQ LWHPVHW LV REWDLQHG E\ LQWHUVHFWLQJ WKH WLGOLVWV RI LWV LWHPV XVLQJ D XVHUGHILQHG IXQFWLRQ 8')fnf 7KH LQFUHPHQW GDWDEDVH LV FUHDWHG DV IROORZV :H JHQHUDWH

PAGE 150

WKRXVDQG WUDQVDFWLRQV RI ZKLFK f§ Gf WKRXVDQG LV XVHG IRU WKH LQLWLDO FRPSXWDWLRQ DQG G WKRXVDQG LV XVHG DV WKH LQFUHPHQW ZKHUH G LV WKH IUDFWLRQDO VL]H LQ SHUFHQWDJHf RI WKH LQFUHPHQW Vb D Rb )LJXUH 6SHHG XS RI WKH LQFUHPHQWDO DOJRULWKP EDVHG RQ WKH 6XETXHU\ DSSURDFK )LJXUH 6SHHG XS RI WKH LQFUHPHQWDO DOJRULWKP EDVHG RQ WKH 9HUWLFDO DSSURDFK :H FRPSDUH WKH H[HFXWLRQ WLPH RI WKH LQFUHPHQWDO DOJRULWKP ZLWK UHVSHFW WR PLQLQJ WKH ZKROH GDWDVHW )LJXUHV DQG VKRZV WKH FRUUHVSRQGLQJ VSHHGXSV RI WKH LQFUHPHQn WDO DOJRULWKP EDVHG RQ WKH 6XETXHU\ DQG WKH 9HUWLFDO DSSURDFKHV IRU GLIIHUHQW PLQLPXP

PAGE 151

VXSSRUW WKUHVKROGV :H UHSRUW WKH UHVXOWV IRU LQFUHPHQW VL]HV RI b b DQG b VKRZQ LQ WKH OHJHQGf :H FDQ PDNH WKH IROORZLQJ REVHUYDWLRQV IURP WKH JUDSKV f 7KH LQFUHPHQWDO DOJRULWKP EDVHG RQ WKH 6XETXHU\ DSSURDFK DFKLHYHV D VSHHGXS RI DERXW WR DV FRPSDUHG WR PLQLQJ WKH ZKROH GDWDVHW +RZHYHU WKH PD[LPXP VSHHGXS RI WKH 9HUWLFDO DSSURDFK LV RQO\ DERXW )RU VXSSRUW FRXQWLQJ WKH 9HUWLFDO DSSURDFK XVHV D XVHUGHILQHG IXQFWLRQ 8')f WR LQWHUVHFW WKH WLGOLVWV 7KH LQFUHPHQn WDO DOJRULWKP VKRXOG DOVR LQYRNH WKH 8') DW OHDVW WKH VDPH QXPEHU RI WLPHV VLQFH WKH VXSSRUW RI DOO WKH LWHPVHWV LQ WKH IUHTXHQW VHW DQG WKH QHJDWLYH ERUGHU QHHGV WR EH IRXQG LQ WKH LQFUHPHQW GDWDEDVH ,Q FDVHV ZKHUH WKH VXSSRUW RI QHZ FDQGLGDWHV QHHGV WR EH FRXQWHG WKH QXPEHU RI LQYRFDWLRQV ZLOO EH HYHQ PRUH DV FRPSDUHG WR PLQLQJ WKH ZKROH GDWDVHW 7KH WLPH WDNHQ E\ WKH 9HUWLFDO DSSURDFK LQ WKH VXSSRUW FRXQWLQJ SKDVH LV GLUHFWO\ SURSRUWLRQDO WR WKH QXPEHU RI WLPHV WKH 8') LV FDOOHG +RZHYHU WKH LQFUHPHQWDO DOJRULWKP VDYHV LQ WKH WLGOLVW FUHDWLRQ SKDVH VLQFH WKH VL]H RI WKH LQFUHn PHQW GDWDVHW LV RQO\ D IUDFWLRQ RI WKH ZKROH GDWDVHW 7KLV H[SODLQV ZK\ WKH VSHHGXS RI WKH 9HUWLFDO DSSURDFK LV ORZ ,Q FRQWUDVW WKH 6XETXHU\ DSSURDFK DFKLHYHV KLJKHU VSHHGXS VLQFH WKH WLPH WDNHQ LV SURSRUWLRQDO WR WKH VL]H RI WKH GDWDVHW ,W LV SRVVLEOH WR DFKLHYH EHWWHU VSHHGXS IRU WKH 9HUWLFDO DSSURDFK E\ DOORFDWLQJ D VPDOOHU %/2% ELQDU\ ODUJH REMHFWf IRU FRPSXWDWLRQV LQYROYLQJ WKH LQFUHPHQW GDWDVHW 1RWH WKDW WKH WLGOLVWV IRU WKH LWHPV DUH VWRUHG DV %/2%V ,Q RXU H[SHULPHQWV ZH XVHG WKH VDPH %/2% VL]H IRU WKH LQFUHPHQW GDWDVHW DQG WKH LQLWLDO GDWDVHW LQ RUGHU WR XVH WKH VDPH XVHUGHILQHG IXQFWLRQ IRU VXSSRUW FRXQWLQJ DQG WKH VDPH WDEOH IXQFWLRQ IRU WLGOLVW FUHDWLRQ UHIHU 6HFWLRQ IRU D GHWDLOHG GHVFULSWLRQ RI WKH 9HUWLFDO DSSURDFKf

PAGE 152

f 7KH VSHHGXS UHGXFHV DV WKH PLQLPXP VXSSRUW WKUHVKROG LV ORZHUHG $W ORZHU VXSSRUW YDOXHV WKH FKDQFHV RI WKH QHJDWLYH ERUGHU H[SDQGLQJ LV KLJKHU DQG DV D UHVXOW WKH LQFUHPHQWDO DOJRULWKP PD\ KDYH WR FRPSXWH WKH FDQGLGDWH FORVXUH DQG FRXQW WKH VXSSRUW RI WKH QHZ FDQGLGDWHV LQ WKH ZKROH GDWDVHW f 7KH VSHHGXS LV KLJKHU IRU VPDOOHU LQFUHPHQW VL]HV VLQFH WKH LQFUHPHQWDO DOJRULWKP QHHGV WR SURFHVV OHVV GDWD f :LWK UHVSHFW WR WKH DEVROXWH H[HFXWLRQ WLPH WKH 6XETXHU\ DQG WKH 9HUWLFDO DSnfLWHPVHWV $W WKLV VWHS WKH QHZ FDQGLGDWH $LWHPVHWV WKDW DUH LQIUHTXHQW LQ GE DUH NQRZQ WR EH LQIUHTXHQW LQ WKH ZKROH GDWDVHW DV ZHOO DQG FDQ EH SUXQHG 7KLV LV EHFDXVH WKH QHZ FDQGLGDWH $LWHPVHWV ZHUH LQIUHTXHQW LQ WKH ROG GDWDVHW WKH\ ZHUH QRW HYHQ LQ WKH QHJDWLYH ERUGHUf 7KHUHIRUH WKH\ QHHG WR EH IUHTXHQW DW OHDVW LQ GE WR KDYH D FKDQFH RI EHLQJ IUHTXHQW LQ WKH ZKROH GDWDVHW :LWK WKH QHZFDQGLGDWH RSWLPL]DWLRQ ZH FRXQW WKH VXSSRUW RI DQ LWHPVHW LQ GE RQO\ LI LW LV UHTXLUHG ,Q WKH ILUVW SKDVH ZKLOH FRXQWLQJ WKH VXSSRUW LQ GE RI WKH LWHPVHWV LQ WKH IUHTXHQW VHW DQG WKH QHJDWLYH ERUGHU ZH GR QRW ILQG DOO WKH IUHTXHQW LWHPVHWV LQ GE 'XULQJ

PAGE 153

WKH FDQGLGDWH FORVXUH FRPSXWDWLRQ ZH FRXQW WKH VXSSRUW LQ GE RI MXVW WKH QHZFDQGLGDWHV IRU WKH SUXQLQJ H[SODLQHG DERYH 7KLV UHVXOWV LQ EHWWHU VSHHGXS DV FRPSDUHG WR WKH EDVLF LQFUHPHQWDO DOJRULWKP R b ‘ b QL b)LJXUH 6SHHG XS RI WKH LQFUHPHQWDO DOJRULWKP EDVHG RQ WKH 6XETXHU\ DSSURDFK ZLWK WKH QHZFDQGLGDWH RSWLPL]DWLRQ ( b P b & 4b )LJXUH 6SHHG XS RI WKH LQFUHPHQWDO DOJRULWKP EDVHG RQ WKH 9HUWLFDO DSSURDFK ZLWK WKH QHZFDQGLGDWH RSWLPL]DWLRQ )LJXUHV DQG VKRZ WKH VSHHGXS RI WKH 6XETXHU\ DQG 9HUWLFDO DSSURDFKHV ZLWK WKH QHZFDQGLGDWH RSWLPL]DWLRQ :H FDQ REVHUYH WKDW WKLV RSWLPL]DWLRQ DFKLHYHV VSHHG XSV WKDW DUH XS WR b EHWWHU 7KH LPSURYHPHQW LV PRUH DW VPDOOHU LQFUHPHQW VL]HV

PAGE 154

7KH UHDVRQ LV WKDW ZKHQ WKH LQFUHPHQW LV VPDOOHU ZH KDYH WR XVH SURSRUWLRQDWHO\ VPDOOHU PLQLPXP VXSSRUW YDOXHV ZKLOH ILQGLQJ WKH IUHTXHQW LWHPVHWV LQ GE 7KLV FRXOG UHVXOW LQ FRXQWLQJ WRR PDQ\ VSXULRXV FDQGLGDWHV 2WKHU $SSURDFKHV ,Q WKH /RRVHFRXSOLQJ DSSURDFK WKH WUDQVDFWLRQ GDWD LV UHDG WXSOH E\ WXSOH IURP WKH '%06 WR WKH PLQLQJ NHUQHO XVLQJ D FXUVRU LQWHUIDFH 7KLV DUFKLWHFWXUH FDQ EH H[WHQGHG WR KDQGOH LQFUHPHQWDO PLQLQJ MXVW E\ LPSOHPHQWLQJ WKH LQFUHPHQWDO DOJRULWKP RXWOLQHG LQ 6HFWLRQ LQ WKH PLQLQJ NHUQHO 7KH '%06 LQWHUIDFH GRHV QRW UHTXLUH DQ\ FKDQJH ,Q FDVHV ZKHUH WKH VXSSRUW RI QHZ LWHPVHWV QHHG WR EH FRXQWHG OLPLWLQJ WKH GDWD DFFHVV WR MXVW RQH VFDQ RI WKH ZKROH GDWDEDVH HQWDLOV FRXQWLQJ FDQGLGDWH LWHPVHWV RI PXOWLSOH VL]HV LQ WKH VDPH SDVV 7KLV FDQ EH DFFRPSOLVKHG E\ SDVVLQJ WKH WUDQVDFWLRQV WKURXJK DOO WKH FDQGLGDWHV RI GLIIHUHQW VL]HV DQG XSGDWLQJ WKHLU VXSSRUW FRXQWV 7KH 6WRUHGSURFHGXUH DSSURDFK ZKHUH WKH PLQLQJ DOJRULWKP LV HQFDSVXODWHG DV D VWRUHG SURFHGXUH WKDW UXQV LQ WKH VDPH DGGUHVV VSDFH DV WKH '%06 DQG WKH &DFKH0LQH DSSURDFK ZKHUH WKH GDWD LV FDFKHG RXWVLGH WKH '%06 FDQ DOVR EH H[WHQGHG IRU LQFUHPHQnn ULYHG /HW ; f§ ^LL f f f L rP` EH D VHW RI OLWHUDOV FDOOHG LWHPV WKDW DUH DWWULEXWH YDOXHV

PAGE 155

RI D VHW RI UHODWLRQDO WDEOHV $ FRQVWUDLQHG DVVRFLDWLRQ LV GHILQHG DV D VHW RI LWHPVHWV ^$_$f & ;t&^;f` ZKHUH & GHQRWHV RQH RU PRUH ERROHDQ FRQVWUDLQWV 1RWH WKDW ZH GR QRW FRQFHQWUDWH RQ JHQHUDWLQJ WKH DVVRFLDWLRQ UXOHV LQ WKH WUDGLWLRQDO VHQVH >@ +RZHYHU WKH DVVRFLDWLRQV EHWZHHQ DWWULEXWH YDOXHV WKDW ZH JHQHUDWH FDQ EH XVHG IRU UXOH JHQHUDWLRQ &DWHJRULHV RI &RQVWUDLQWV :H GLYLGH WKH FRQVWUDLQWV LQWR IRXU GLIIHUHQW FDWHJRULHV WKDW DUH RXWOLQHG EHORZ :H LOOXVWUDWH HDFK RI WKHP ZLWK VDPSOH PLQLQJ FRPSXWDWLRQV 7KH GDWD PRGHO XVHG LQ RXU H[DPSOHV LV WKDW RI D SRLQWRIVDOH 326ffVXSSRUWFRQILGHQFHf IUDPHn ZRUN IRU DVVRFLDWLRQ UXOH PLQLQJ >@ $Q LWHPVHW ; LV VDLG WR EH IUHTXHQW LI LW DSSHDUV LQ DW OHDVW V WUDQVDFWLRQV ZKHUH V LV WKH PLQLPXP VXSSRUW ,Q WKH SRLQWRIVDOHV GDWD PRGHO D :H UHIHU WKH UHDGHU WR 6ULNDQW HW DO >@ DQG 1J HW DO >@ IRU QLFH GLVFXVVLRQV RI YDULRXV NLQGV RI FRQVWUDLQWV +HUH ZH FDWHJRUL]H WKHP EDVHG RQ WKHLU XVDJH LQ WKH PLQLQJ SURFHVV ZKLFK LV H[SODLQHG LQ 6HFWLRQ ZLWK DQ H[DPSOH

PAGE 156

WUDQVDFWLRQ FRUUHVSRQG WR D FXVWRPHU WUDQVDFWLRQ 1RWH WKDW LQ RWKHU GRPDLQV WKH QRWLRQ RI D WUDQVDFWLRQ DQG DQ LWHPVHW DSSHDULQJ LQ D WUDQVDFWLRQ FRXOG EH GLIIHUHQW )UHTXHQW LWHPVHWV WKH RQHV ZKLFK VDWLVI\ WKH IUHTXHQF\ FRQVWUDLQW DUH GHILQHG DV ^;?I;f V` ZKHUH I;f LV WKH IUHTXHQF\ RI ; 0RVW RI WKH DOJRULWKPV IRU IUHTXHQW LWHPVHW GLVFRYHU\ XWLOL]HV WKH GRZQZDUG FORVXUH SURSHUW\ RI LWHPVHWV ZLWK UHVSHFW WR WKH IUHTXHQF\ FRQVWUDLQW WKDW LV LI DQ LWHPVHW LV IUHTXHQW WKHQ VR DUH DOO LWV VXEVHWV 'RZQZDUG FORVXUH LV D SUXQLQJ SURSHUW\ /HYHOZLVH DOJRULWKPV >@ ILQG DOO LWHPVHWV ZLWK D JLYHQ SURSHUW\ DPRQJ LWHPVHWV RI VL]H N IFLWHPVHWVf DQG XVH WKLV NQRZOHGJH WR H[SORUH N OfLWHPVHWV 7KH\ VWDUW ZLWK WKH DVVXPSWLRQ WKDW DOO N I OfLWHPVHWV DUH SRWHQWLDOO\ IUHTXHQW IUHTXHQF\ LV MXVW DQ H[DPSOH IRU D GRZQZDUG FORVHG SURSHUW\f $V WKH fLWHPVHWV DUH H[DPLQHG WKH\ SUXQH RXW VRPH N OfLWHPVHWV WKDW FDQQRW EH IUHTXHQW ,Q HIIHFW IRU SUXQLQJ WKH\ XVH WKH FRQWUDSRVLWLYH RI WKH IUHTXHQW LWHPVHW GHILQLWLRQ fLI DQ\ VXEVHW RI D N OfLWHPVHW LV QRW IUHTXHQW WKHQ QHLWKHU FDQ WKH N OfLWHPVHWf $IWHU WKH SUXQLQJ WKH\ JR WKURXJK WKH UHPDLQLQJ OLVW FKHFNLQJ HDFK N OfLWHPVHW IRU LWV IUHTXHQF\ 7KH GRZQZDUG FORVXUH SURSHUW\ LV VLPLODU WR WKH DQWLn

PAGE 157

EH DVVRFLDWHG ZLWK D IUHTXHQF\ FRQVWUDLQW DOVR ZKHUH ZH ZDQW WR ILQG IUHTXHQW LWHPVHWV WKDW VDWLVI\ % 7KUHH GLIIHUHQW DOJRULWKPV IRU PLQLQJ DVVRFLDWLRQV ZLWK LWHP FRQVWUDLQWV DUH SUHVHQWHG LQ 6ULNDQW HW DO >@ 7KH LWHP FRQVWUDLQWV HQDEOHV XV WR SRVH PLQLQJ TXHULHV VXFK DV f:KDW DUH WKH SURGXFWV ZKRVH VDOHV DUH DIIHFWHG E\ WKH VDOH RI VD\ EDUEHFXH VDXFH"f DQG f:KDW SURGXFWV DUH ERXJKW WRJHWKHU ZLWK VRGDV DQG VQDFNV"f 7KH YDULDEOH FRQVWUDLQWV LQ 1J HW DO >@ DUH D VSHFLDO FDVH RI LWHP FRQVWUDLQWV $JJUHJDWLRQ &RQVWUDLQW 7KHVH D[H FRQVWUDLQWV LQYROYLQJ DJJUHJDWH IXQFWLRQV RQ WKH LWHPV WKDW IRUP WKH LWHP VHW )RU LQVWDQFH LQ WKH 326 H[DPSOH DQ DJJUHJDWLRQ FRQVWUDLQW FRXOG EH RI WKH IRUP PLQSURGXFWVVROGSULFHf S +HUH ZH FRQVLGHU D SURGXFW DV DQ LWHP 7KH DJJUHJDWH IXQFn WLRQ FRXOG EH PLQ PD[ VXP FRXQW DYJ RU DQ\ RWKHU XVHUGHILQHG DJJUHJDWH IXQFWLRQ $Q DJJUHJDWLRQ FRQVWUDLQW RI WKH IRUP PLQ^SURGXFWV VROGSULFHf S FDQ EH XVHG WR ILQG fH[n SHQVLYH SURGXFWV WKDW DUH ERXJKW WRJHWKHUf 6LPLODUO\ PD[SURGXFWV VROGSULFHf T FDQ EH XVHG WR ILQG fLQH[SHQVLYH SURGXFWV WKDW DUH ERXJKW WRJHWKHUf 7KHVH DJJUHJDWH IXQFWLRQV FDQ EH FRPELQHG LQ YDULRXV ZD\V WR H[SUHVV D ZKROH UDQJH RI XVHIXO PLQLQJ FRPSXWDWLRQV )RU H[DPSOH WKH FRQVWUDLQW PLQSURGXFWVVROGSULFHf Sf t DYJSURGXFWVVROGSULFHf Tf WDUJHWV WKH PLQLQJ SURFHVV WR LQH[SHQVLYH SURGXFWV WKDW DUH ERXJKW WRJHWKHU ZLWK WKH H[SHQVLYH RQHV ([WHUQDO &RQVWUDLQW ([WHUQDO FRQVWUDLQWV ILOWHU WKH GDWD XVHG LQ WKH PLQLQJ SURFHVV 7KHVH DUH FRQVWUDLQWV RQ DWWULEXWHV ZKLFK GR QRW DSSHDU LQ WKH ILQDO UHVXOW ZKLFK ZH FDOO H[WHUQDO DWWULEXWHVf )RU H[DPSOH LI ZH ZDQW WR ILQG fSURGXFWV ERXJKW GXULQJ ELJ SXUFKDVHV ZKHUH WKH WRWDO

PAGE 158

VDOH SULFH RI WKH WUDQVDFWLRQ LV ODUJHU WKDQ VRPH DPRXQWfn

PAGE 159

H[DPSOHfn FLILF H[DPSOH XVLQJ WKH SRLQWRIVDOHV GDWD PRGHO LQ 6HFWLRQ 7KH H[DPSOH VKRZQ LQ )LJXUH ILQGV SURGXFW FRPELQDWLRQV FRQWDLQLQJ fEDUEHFXH VDXFHf ZKHUH DOO WKH SURGn XFWV FRVW OHVV WKDQ DQG WKH DYHUDJH SULFH LV PRUH WKDQ VRPH QRWLRQ RI VLPLODU SULFHG SURGXFWVf 7KH FRPELQDWLRQV VKRXOG DSSHDU LQ DW OHDVW VDOHV WUDQVDFWLRQV ZLWK WKH WRWDO SULFH RI WKH WUDQVDFWLRQ JUHDWHU WKDQ 7KLV JLYHV DQ LGHD RI ZKDW RWKHU PRGHUDWHO\ SULFHG SURGXFWV SHRSOH EX\ ZLWK EDUEHFXH VDXFH LQ ELJ SXUFKDVHV SHUKDSV IRU SDUWLHVf ,W FRXOG KHOS WKH VKRS RZQHU WR GHFLGH RQ YDULRXV SURPRWLRQV )BN )BN 3URGXFWVBVROG 6DOHV 3URGXFWVBVROG 3URGXFW )LJXUH 3RLQWRIVDOHV H[DPSOH IRU FRQVWUDLQHG DVVRFLDWLRQ PLQLQJ

PAGE 160

,Q WKH DERYH H[DPSOH WKH FRQVWUDLQW RQ WKH WRWDO SULFH RI WKH WUDQVDFWLRQ LV DQ H[WHUQDO FRQVWUDLQW DQG LV DSSOLHG LQ WKH GDWD ILOWHULQJ VWDJH 6LQFH WKH PD[LPXP SULFH RI D SURGXFW LQ WKH GHVLUHG FRPELQDWLRQ LV ZH FDQ DOVR ILOWHU RXW UHFRUGV WKDW GRHV QRW VDWLVI\ WKLV FRQGLWLRQ LQ WKH GDWD ILOWHULQJ VWDJH PD[3ULFHf LV DQ DJJUHJDWLRQ FRQVWUDLQW ZKLFK VDWLVILHV WKH FORVXUH SURSHUW\ DQG FDQ EH DSSOLHG LQ WKH FDQGLGDWH JHQHUDWLRQ SKDVH 7KH FRQVWUDLQW WR LQFOXGH EDUEHFXH VDXFH LQ WKH FRPELQDWLRQ DQG WKH FRQVWUDLQW RQ WKH DYHUDJH SULFH DUH FKHFNHG LQ WKH SRVWFRXQWLQJ SKDVH VLQFH WKH\ GR QRW VDWLVI\ WKH FORVXUH SURSHUW\ ,QFUHPHQWDO &RQVWUDLQHG $VVRFLDWLRQ 0LQLQJ 7KH QHJDWLYH ERUGHU EDVHG LQFUHPHQWDO PLQLQJ DOJRULWKP LV DSSOLFDEOH IRU PLQLQJ DVne )UHTXHQW6HWV 8 1HJDWLYH%RUGHU LQ GE )RU WKLV ZH XVH WKH IUDPHZRUN RXWOLQHG LQ 6HFWLRQ 7KH GLIIHUHQW FRQVWUDLQWV WKDW DUH SUHVHQW FDQ EH KDQGOHG DW WKH YDULRXV VWHSV LQ WKH PLQLQJ SURFHVV DV VKRZQ LQ )LJXUH

PAGE 161

8SGDWH WKH VXSSRUW FRXQW RI DOO LWHPVHWV LQ )UHTXHQW6HWV 8 1HJDWLYH%RUGHU WR LQn FOXGH WKHLU VXSSRUW LQ GE )LQG WKH LWHPVHWV LQ 1HJDWLYH%RUGHU WKDW EHFDPH IUHTXHQW E\ WKH DGGLWLRQ RI GE FDOO WKHP 3URPRWHG1HJDWLYH%RUGHU 31Ef

PAGE 162

$SSOLFDELOLW\ %H\RQG $VVRFLDWLRQ 0LQLQJ 7KH LQFUHPHQWDO DOJRULWKP ZH KDYH SURSRVHG PDNHV XVH RI WKH FORVXUH SURSHUW\ RI IUHTXHQW LWHPVHWV ,Q WKLV VHFWLRQ ZH GLVFXVV KRZ WKH LQFUHPHQWDO DSSURDFK FDQ EH JHQn HUDOL]HG WR RWKHU GDWD PLQLQJ DQG GHFLVLRQ VXSSRUW SUREOHPV ,Q 6HFWLRQ ZH GLVFXVV WKH DSSOLFDELOLW\ RI WKH LQFUHPHQWDO DOJRULWKP WR PLQH FORVHG VHWV *HQHUDOL]DWLRQV WR LQnf

PAGE 163

VDLG WR EH FRUUHODWHG LI DQ\ RI LWV VXEVHWV DUH GHSHQGHQW )RU HIILFLHQW HYDOXDWLRQ RI FRUUHODWLRQ UXOHV D QRWLRQ RI VXSSRUW LV LQWURGXFHG LQ %ULQ HW DO >@ ZKLFK LV GRZQZDUG FORVHG +RZHYHU WKHVH SURSHUWLHV DUH QRW LQFUHPHQWDOO\ XSGDWDEOH &RQVHTXHQW SDUW RI DVVRFLDWLRQ UXOHV )RU DQ\ IUHTXHQW LWHPVHW LI D UXOH ZLWK FRQVHn TXHQW F KROGV WKDW LV LW KDV PLQLPXP FRQILGHQFHffXQLQWHUHVWLQJf )RU H[DPSOH OHW XV DVVXPH WKDW ZH ZDQW WR HYDOXDWH WKH PLQLQJ TXHU\ f:KDW DUH WKH LQWHUHVWLQJ GULYHUV WKDW FDXVHG FXVWRPHUV WR EX\ WKH ZLGJHWV IURP D FDWDORJ "f $ GULYHU LV GHHPHG fLQWHUHVWLQJf LI LW KDV FDXVHG DW OHDVW FXVWRPHUV WR EX\ WKH ZLGJHW /HW WKH GDWD EH VWRUHG LQ D VHW RI UHODWLRQDO WDEOHV QDPHO\ FDWDORJZLGJHW PDQXIDFWXUHUf VDOHFXVWRPHU ZLGJHWf GULYHUFXVWRPHU ZLGJHW GULYHUf 7KH DERYH TXHU\ FDQ EH ZULWWHQ DV D TXHU\ IORFN LQ 'DWDORJ DV VKRZQ EHORZ 7KH ILOWHU FRQGLWLRQ SUXQHV RXW YDOXHV ZKLFK GR QRW KDYH PLQLPXP VXSSRUW ,Q 6HFWLRQ

PAGE 164

ZH GLVFXVV KRZ WKH QHJDWLYH ERUGHU LGHD FDQ EH XVHG IRU HIILFLHQW LQFUHPHQWDO HYDOXDWLRQ RI TXHU\ IORFNV 48(5< DQVZHU&f VDOH& :f $1' GULYHU& : 'f $1' FDWDORJ: PDQXIDFWXUHUf ),/7(5 &2817&f ,QFUHPHQWDO (YDOXDWLRQ RI 4XHU\ )ORFNV $SSO\LQJ WKH DSULRUL WHFKQLTXH IRU HYDOXDWLQJ WKH DERYH TXHU\ IORFN ZLOO UHVXOW LQ WKH IROORZLQJ VDIH VXETXHULHV >@ 4? DQVZHU&f VDOH& :f 4 DQVZHU&f GULYHU& : 'f 4 DQVZHU&f VDOH& :f $1' GULYHU& : 'f 4? DQVZHU&f VDOH& :f $1' GULYHU& : 'f $1' FDWDORJ: PDQXIDFWXUHUf 7KH TXHU\ IORFNV FRUUHVSRQGLQJ WR WKH VDIH VXETXHULHV IRUP D ODWWLFH ZLWK TXHU\ FRQWDLQPHQW DV WKH SDUWLDO RUGHU DQG WKH RULJLQDO TXHU\ IORFN DV WKH WRS HOHPHQW 7KDW LV LI 4? DQG 4 DUH HOHPHQWV RI WKH ODWWLFH 4? ‘ 4 t WKH UHVXOW RI 4 LV FRQWDLQHG LQ WKH UHVXOW RI 4? 'XULQJ WKH H[HFXWLRQ RI WKH VXETXHULHV RI WKH TXHU\ IORFN DOO UHFRUGV ZLWK SDUDPHWHU YDOXHV ZKLFK VDWLVI\ WKH ILOWHU FRQGLWLRQ DUH SURSDJDWHG WR WKH QH[W KLJKHU VXETXHU\ LQ WKH

PAGE 165

ODWWLFH IRU IXUWKHU HYDOXDWLRQ )RU H[DPSOH DIWHU 4L LV HYDOXDWHG DOO WKH UHFRUGV FRUUHnf GHSHQGLQJ XSRQ WKH VXETXHU\ UHSUHVHQWLQJ WKH ODWWLFH HOHPHQW ,W LV DOVR SRVVLEOH WR HYDOXDWH WKH ILOWHU FRQGLWLRQ IRU DOO WKH ODWWLFH HOHPHQWVf

PAGE 166

KDQG WKH QHJDWLYH ERUGHU EDVHG FKDQJH SURSDJDWLRQ FDQ DOVR EH DSSOLHG IRU WKH PDLQWHQDQFH RI YLHZV LQYROYLQJ PRQRWRQH DJJUHJDWH IXQFWLRQV WKDW VDWLVI\ WKH DSULRUL VXEVHW SURSHUW\ )RU H[DPSOH LI WKH YLHZ GHILQLWLRQ KDV D ILOWHU FRQGLWLRQ VXFK DV WKH 64/ fKDYLQJf FODXVH LW FRXOG EH EHQHILFLDO WR DOVR VWRUH WKH UHFRUGV ZKLFK GRHV QRW SDVV WKH ILOWHU FRQGLWLRQ WHVW VDPH DV WKH QHJDWLYH ERUGHUf

PAGE 167

n

PAGE 168

ZDV QHYHU ZRUVH WKDQ D IDFWRU RI WZR RQ WKH UHDOOLIH GDWDVHWV %RWK WKHVH DSSURDFKHV UHn TXLUH DGGLWLRQDO VSDFH IRU FDFKLQJ KRZHYHU 7KH 6WRUHGSURFHGXUH DSSURDFK GRHV QRW UHTXLUH DQ\ H[WUD VSDFH H[FHSW SRVVLEO\ IRU LQLWLDOO\ VRUWLQJ WKH GDWD LQ WKH '%06f DQG FDQ SHUKDSV EH PDGH WR EH ZLWKLQ D IDFWRU RI WZR WR WKUHH RI &DFKH0LQH ZLWK WKH UHn FHQW DOJRULWKPV > @ 7KH 8') DSSURDFK LV DERXW D IDFWRU RI WZR IDVWHU WKDQ WKH 6WRUHGSURFHGXUH DSSURDFK EXW LV VLJQLILFDQWO\ KDUGHU WR FRGH 7KH 64/ DSSURDFK RIIHUV VRPH VHFRQGDU\ DGYDQWDJHV OLNH HDVLHU GHYHORSPHQW DQG PDLQWHQDQFH DQG SRWHQWLDO IRU DXn WRPDWLF SDUDOOHOL]DWLRQ +RZHYHU LW PLJKW QRW EH DV SRUWDEOH DV WKH &DFKH0LQH DSSURDFK DFURVV GLIIHUHQW GDWDEDVH PDQDJHPHQW V\VWHPV 1H[W ZH ZDQWHG WR ILQG RXW LI LW LV SRVVLEOH WR KDQGOH RWKHU PRUH FRPSOH[ PLQLQJ WDVNV ZLWKLQ WKH VDPH 64/ IUDPHZRUN :H VWXGLHG JHQHUDOL]HG DVVRFLDWLRQ UXOHV DQG VHTXHQWLDO SDWWHUQV ZLWK WKH JRDO RI VKRZLQJ WKDW WKH DVVRFLDWLRQ UXOH IUDPHZRUN FDQ EH H[WHQGHG HDVLO\ IRU WKHVH PLQLQJ SUREOHPV DV ZHOO :H GHYHORSHG VHYHUDO 64/ IRUPXODWLRQV IRU JHQHUDOL]HG DVVRFLDWLRQ UXOH DQG VHTXHQWLDO SDWWHUQ PLQLQJ VRPH RI WKHP E\ H[WHQGLQJ WKH DVVRFLDWLRQ UXOH DSSURDFKHV 7KH PDMRU DGGLWLRQ IRU JHQHUDOL]HG DVVRFLDWLRQ UXOH ZDV WR fH[WHQGf WKH LQSXW WUDQVDFWLRQ WDEOH WUDQVIRUP 7 WR 7rf

PAGE 169

fFRUUHFWf VXSSRUW YDOXH LV GLIILFXOW ,QLWLDOO\ WKH IUHTXHQW LWHPVHWV IRU DQ DSSUR[LPDWH VXSSRUW FDQ EH FRPSXWHG ZKLFK LV IXUWKHU UHILQHG EDVHG RQ XVHU IHHGEDFN :H IXUWKHU VKRZ WKH DSSOLFDELOLW\ RI WKH LQFUHPHQWDO DOJRULWKP WR FHUWDLQ FODVVHV RI GDWD PLQLQJ DQG GHFLVLRQ VXSSRUW SUREOHPV ,Q 6HFWLRQ ZH LGHQWLI\ FHUWDLQ H[WHQVLRQV WR WKH FXUUHQW GDWDEDVH PDQDJHPHQW V\Vn WHPV WKDW DUH XVHIXO IRU PLQLQJ :H HQXPHUDWH WKH VSHFLILF FRQWULEXWLRQV RI WKLV GLVVHUWDWLRQ LQ 6HFWLRQ DQG LQ 6HFWLRQ ZH OLVW SRVVLELOLWLHV IRU IXUWKHU UHVHDUFK

PAGE 170

3URSRVHG ([WHQVLRQV 2QH RI WKH JRDOV RI RXU ZRUN KDV EHHQ WR fXQEXQGOHff§f OLVWV REWDLQHG GXULQJ PXOWLSOH LQGH[ VFDQV >@ ,Q FXUUHQW 2/$3 DQG GDWD ZDUHKRXVLQJ V\VWHPV WKLV RSHUDWLRQ LV UDPSDQW LQ WKH SRSXODU ELWPDSSHG LQGLFHV 2XU 9HUWLFDO DSSURDFK LV FXULRXVO\ VLPLODU WR WKLV ELWn PDSSHG DSSURDFK ZLWK RQH LPSRUWDQW GLIIHUHQFH ,QVWHDG RI SHUIRUPLQJ $1'V RQ 5,'V ZH SHUIRUP WKH $1'V RQ DQRWKHU DWWULEXWH WKH WUDQVDFWLRQ LGHQWLILHU

PAGE 171

:H DOVR XVHG WKH VHW GLIIHUHQFH RSHUDWLRQ IRU SUXQLQJ LWHPVHWV FRQWDLQLQJ DQ LWHP DQG LWV DQFHVWRU LQ JHQHUDOL]HG DVVRFLDWLRQ UXOHV (QKDQFHG $JJUHJDWLRQ $QRWKHU FRPPRQ RSHUDWLRQ WKDW ZH XVHG LV WKH *DWKHU RSHUDWLRQ WKDW FDQ WUDQVIRUP WZR DWWULEXWHV LQ D GDWD WDEOH WR D IRUP ZKHUH IRU GLVWLQFW YDOXHV RI RQH RI WKH DWWULEXWHV FDOOHG WKH JURXSLQJ DWWULEXWHffV LQWHUQDOV WR IRUFH VXFK DQ RUGHU LQ RXU H[SHULPHQWV 7KH *DWKHU IXQFWLRQ FDQ DOVR EH WUHDWHG DV DQ DJJUHJDWH IXQFWLRQ WKDW VLPSO\ FRQFDWHQDWHV LWV DUJXPHQWV 6\VWHPV WKDW KDYH VXSSRUW IRU XVHUGHILQHG DJJUHJDWH IXQFWLRQV >@ FDQ HDVLO\ VXSSRUW VXFK D IXQFWLRQDOLW\ 7KHUH DUH VHYHUDO RWKHU GHFLVLRQ VXSSRUW DSSOLFDWLRQV ZKHUH JDWKHU FRXOG EH H[WUHPHO\ XVHIXO )RU LQVWDQFH VXSSRVH ZH DUH WU\LQJ WR FODVVLI\ FXVWRPHUV RI D FUHGLW FDUG FRPSDQ\ DQG WKH GDWD LQ DGGLWLRQ WR VWDWLF FXVWRPHU DWWULEXWHV OLNH DJH DQG VDODU\ FRQVLVWV RI GHWDLOHG WUDQVDFWLRQ GDWD DERXW HDFK SXUFKDVH DFWLYLW\ RI D FXVWRPHU ,Q WKLV FDVH ZH ZRXOG OLNH WR JDWKHU DOO DFWLYLWLHV RI D FXVWRPHU DV RQH WLPH VHULHV DQG H[WUDFW IHDWXUHV IURP WKHVH WLPH VHULHV IRU FODVVLILFDWLRQ ,Q 2/$3 DSSOLFDWLRQV WRR GDWD RIWHQ QHHGV WR EH FRQYHUWHG EDFN DQG IRUWK IURP D IRUPDW ZKHUH GLIIHUHQW PHDVXUHV DUH JDWKHUHG WRJHWKHU DV

PAGE 172

n

PAGE 173

GDWDVHW DQG DOVR WR ILQH WXQH LQSXW SDUDPHWHUV 7KHUH DUH VHYHUDO PLQLQJ DOJRULWKPV ZKLFK XVH VDPSOLQJ > @ &RQWULEXWLRQV ,Q WKLV GLVVHUWDWLRQ ZH KDYH DGGUHVVHG WKH IROORZLQJ SUREOHPV $QDO\]H WKH GLIIHUHQW GDWDEDVH LQWHJUDWLRQ DOWHUQDWLYHV IRU GDWD PLQLQJ 'HYHORS DQG LPSOHPHQW YDULRXV 64/EDVHG DSSURDFKHV IRU DVVRFLDWLRQ UXOH PLQLQJ 6WXG\ WKH SHUIRUPDQFH SURILOH RI FXUUHQW '%06V WR H[HFXWH WKH DERYH 64/ TXHULHV &RPSDUH WKH GLIIHUHQW GDWDEDVH LQWHJUDWLRQ DUFKLWHFWXUHV TXDQWLWDWLYHO\ DQG TXDOLWDn WLYHO\ 'HYHORS FRVW IRUPXODH IRU WKH 64/ DSSURDFKHV EDVHG RQ LQSXW GDWD SDUDPHWHUV DQG UHODWLRQDO RSHUDWRU FRVWV 7KHVH SURYLGH VRPH LQVLJKWV LQWR HQKDQFLQJ FXUUHQW FRVW EDVHG RSWLPL]HUV WR LQFRUSRUDWH WKH GRPDLQVSHFLILF VHPDQWLFV RI PLQLQJ DOJRULWKPV ([WHQG WKH DVVRFLDWLRQ UXOH IUDPHZRUN IRU PLQLQJ JHQHUDOL]HG DVVRFLDWLRQ UXOHV DQG VHTXHQWLDO SDWWHUQV 'HYHORS DQ LQFUHPHQWDO DVVRFLDWLRQ UXOH PLQLQJ DOJRULWKP 64/ IRUPXODWLRQV RI WKH LQFUHPHQWDO DOJRULWKP DQG LWV JHQHUDOL]DWLRQ WR KDQGOH FRQn VWUDLQWV *HQHUDOL]H WKH LQFUHPHQWDO DOJRULWKP IRU FRQVWUDLQHG DVVRFLDWLRQ PLQLQJ DQG GHPRQn VWUDWH LWV DSSOLFDELOLW\ WR D ODUJHU FODVV RI GDWD PLQLQJ DQG GHFLVLRQ VXSSRUW SUREOHPV O7KH ZRUN FRUUHVSRQGLQJ WR WKH ILUVW IRXU LWHPV ZDV SULPDULO\ GRQH E\ UHVHDUFKHUV IURP ,%0 $OPDGQ 5HVHDUFK &HQWHU DQG WKH DXWKRU ZDV D FRQWULEXWRU )RU WKH UHPDLQLQJ LWHPV WKH DXWKRU ZDV WKH SULPDU\ FRQWULEXWRU 7KH ILOHEDVHG LQFUHPHQWDO DVVRFLDWLRQ UXOH PLQLQJ DOJRULWKP LWHP f ZDV GHYHORSHG SULPDULO\ E\ WKH DXWKRU DV SDUW RI WKH ,QWURGXFWLRQ WR 3DUDOOHO &RPSXWLQJ FRXUVH SURMHFW

PAGE 174

([SORUH SULPLWLYH RSHUDWRUV WR VXSSRUW PLQLQJ DQG GHFLVLRQ VXSSRUW LQ GDWDEDVHV )XWXUH :RUN 7KH ZRUN SUHVHQWHG LQ WKLV GLVVHUWDWLRQ SRLQWV WR VHYHUDO GLUHFWLRQV IRU IXWXUH UHVHDUFK $ QDWXUDO QH[W VWHS LV WR H[SHULPHQW ZLWK RWKHU NLQGV RI PLQLQJ RSHUDWLRQV VXFK DV FODVVLn ILFDWLRQ DQG FOXVWHULQJ >@ WR YHULI\ LI RXU FRQFOXVLRQV KROG IRU WKHVH RWKHU FDVHV WRR ,W LV DOVR LPSRUWDQW WR GHULYH D VHW RI SULPLWLYH RSHUDWRUV ZLWK ZKLFK WKH GLIIHUHQW PLQLQJ DQG GHFLVLRQ VXSSRUW RSHUDWLRQV FDQ EH FRPSRVHG 7KH RSHUDWRUV ZH KDYH LGHQWLILHG SURYLGH VRPH KHDGZD\ LQ WKDW GLUHFWLRQ $QRWKHU XVHIXO GLUHFWLRQ LV WR H[SORUH ZKDW NLQG RI D VXSn SRUW LV QHHGHG IRU DQVZHULQJ VKRUW LQWHUDFWLYH DG KRF TXHULHV LQYROYLQJ D PL[ RI PLQLQJ DQG UHODWLRQDO RSHUDWLRQV 7KH VXFFHVV RI 64/ DV WKH PRVW SRSXODU GDWD PDQDJHPHQW ODQJXDJH FDQ EH PDLQO\ DWWULEXWHG WR LWV DG KRF TXHU\ VXSSRUW ,Q WKH PLQLQJ FRQWH[W ZKDW LV DQ DG KRF PLQLQJ TXHU\" ,V LW MXVW PLQLQJ FRPSXWDWLRQV H[SUHVVHG ZLWK VRPH FRQVWUDLQWV RQ WKH UHVXOW" +RZ PXFK VXSSRUW FDQ ZH OHYHUDJH IURP H[LVWLQJ UHODWLRQDO HQJLQHV IRU PLQn

PAGE 175

GDWD PLQLQJ DQG GHFLVLRQ VXSSRUW RSHUDWLRQV RQ WKH VDPH GDWD WLJKWHU LQWHJUDWLRQ RI WKHVH RSHUDWLRQV ZLWK GDWDEDVH V\VWHPV ZLOO EH UHTXLUHG 7KLV GLVVHUWDWLRQ SUHVHQWV VWUDWHJLHV DLPHG DW WLJKWHU GDWDEDVH LQWHJUDWLRQ RI PLQLQJ DQG LGHQWLILHV RSWLPL]DWLRQV DQG SULPLWLYH RSHUDWRUV WR PDNH GDWDEDVH V\VWHPV D EHWWHU SODWIRUP IRU PLQLQJ DQG GHFLVLRQ VXSSRUW

PAGE 176

5()(5(1&(6 >@ 6 $JDUZDO 5 $JUDZDO 3 0 'HVKSDQGH $ *XSWD ) 1DXJKWRQ 5 5DPDNU LVKQDQ DQG 6 6DUDZDJL 2Q WKH FRPSXWDWLRQ RI PXOWLGLPHQVLRQDO DJJUHJDWHV ,Q 3URF RI WKH QG ,QWfO &RQIHUHQFH RQ 9HU\ /DUJH 'DWDEDVHV SDJHV 0XPEDL %RPED\f ,QGLD 6HSWHPEHU >@ 5 $JUDZDO $ $UQLQJ 7 %ROOLQJHU 0 0HKWD 6KDIHU DQG 5 6ULNDQW 7KH 4XHVW GDWD PLQLQJ V\VWHP ,Q 3URF RI WKH QG ,QW fO &RQIHUHQFH RQ .QRZOHGJH 'LVFRYHU\ LQ 'DWDEDVHV DQG 'DWD 0LQLQJ 3RUWODQG 2UHJRQ $XJXVW >@ 5 $JUDZDO & )DORXWVRV DQG $ 6ZDPL (IILFLHQW VLPLODULW\ VHDUFK LQ VHTXHQFH GDWDEDVHV ,Q 3URF RI WKH )RXUWK ,QWfO &RQIHUHQFH RQ )RXQGDWLRQV RI 'DWD 2UJDQLn ]DWLRQ DQG $OJRULWKPV &KLFDJR 2FWREHU $OVR LQ /HFWXUH 1RWHV LQ &RPSXWHU 6FLHQFH 6SULQJHU 9HUODJ >@ 5 $JUDZDO *HKUNH *XQRSXORV DQG 3 5DJKDYDQ $XWRPDWLF VXEVSDFH FOXVn WHULQJ RI KLJK GLPHQVLRQDO GDWD IRU GDWD PLQLQJ DSSOLFDWLRQV ,Q 3URF RI WKH $&0 6,*02' &RQIHUHQFH RQ 0DQDJHPHQW RI 'DWD 6HDWWOH :DVKLQJWRQ -XQH >@ 5 $JUDZDO 6 *KRVK 7 ,PLHOLQVNL % ,\HU DQG $ 6ZDPL $Q LQWHUYDO FODVVLILHU IRU GDWDEDVH PLQLQJ DSSOLFDWLRQV ,Q 3URF RI WKH 9/'% &RQIHUHQFH SDJHV 9DQFRXYHU %ULWLVK &ROXPELD &DQDGD $XJXVW >@ 5 $JUDZDO 7 ,PLHOLQVNL DQG $ 6ZDPL 'DWDEDVH PLQLQJ $ SHUIRUPDQFH SHUn VSHFWLYH ,((( 7UDQVDFWLRQV RQ .QRZOHGJH DQG 'DWD (QJLQHHULQJ f 'HFHPEHU >@ 5 $JUDZDO 7 ,PLHOLQVNL DQG $ 6ZDPL 0LQLQJ DVVRFLDWLRQ UXOHV EHWZHHQ VHWV RI LWHPV LQ ODUJH GDWDEDVHV ,Q 3URF RI WKH $&0 6,*02' &RQIHUHQFH RQ 0DQDJHPHQW RI 'DWD SDJHV :DVKLQJWRQ '& 0D\ >@ 5 $JUDZDO /LQ + 6 6DZKQH\ DQG 6KLP )DVW VLPLODULW\ VHDUFK LQ WKH SUHVHQFH RI QRLVH VFDOLQJ DQG WUDQVODWLRQ LQ WLPHVHULHV GDWDEDVHV ,Q 3URF RI WKH VW ,QWfO &RQIHUHQFH RQ 9HU\ /DUJH 'DWDEDVHV SDJHV =XULFK 6ZLW]HUODQG 6HSWHPEHU >@ 5 $JUDZDO + 0DQQLOD 5 6ULNDQW + 7RLYRQHQ DQG $ 9HUNDPR )DVW GLVFRYHU\ RI DVVRFLDWLRQ UXOHV ,Q 8 0 )D\\DG 3 6KDSLUR 3 6P\WK DQG 5 8WKXUXVDP\ HGLWRUV $GYDQFHV LQ .QRZOHGJH 'LVFRYHU\ DQG 'DWD 0LQLQJ FKDSWHU SDJHV $$$,0,7 3UHVV

PAGE 177

>@ 5 $JUDZDO 3VDLOD ( / :LPPHUV DQG 0 =DnLW 4XHU\LQJ VKDSHV RI KLVWRULHV ,Q 3URF RI WKH VW ,QWfO &RQIHUHQFH RQ 9HU\ /DUJH 'DWDEDVHV =XULFK 6ZLW]HUODQG 6HSWHPEHU >@ 5 $JUDZDO DQG 6KDIHU 3DUDOOHO PLQLQJ RI DVVRFLDWLRQ UXOHV ,((( 7UDQVDFWLRQV RQ .QRZOHGJH DQG 'DWD (QJLQHHULQJ f 'HFHPEHU >@ 5 $JUDZDO DQG 6KLP 'HYHORSLQJ WLJKWO\FRXSOHG GDWD PLQLQJ DSSOLFDWLRQV RQ D UHODWLRQDO GDWDEDVH V\VWHP ,Q 3URF RI WKH QG ,QWfO &RQIHUHQFH RQ .QRZOHGJH 'LVFRYHU\ LQ 'DWDEDVHV DQG 'DWD 0LQLQJ 3RUWODQG 2UHJRQ $XJXVW >@ 5 $JUDZDO DQG 5 6ULNDQW )DVW DOJRULWKPV IRU PLQLQJ DVVRFLDWLRQ UXOHV ,Q 3URF RI WKH WK ,QWfO &RQIHUHQFH RQ 9HU\ /DUJH 'DWDEDVHV 6DQWLDJR &KLOH 6HSWHPEHU >@ 5 $JUDZDO DQG 5 6ULNDQW 0LQLQJ VHTXHQWLDO SDWWHUQV ,Q 3URF RI WKH WK ,QWfO &RQIHUHQFH RQ 'DWD (QJLQHHULQJ 7DLSHL 7DLZDQ 0DUFK >@ 6 $OH[DQGHU 8VHUV ILQG WDQJLEOH UHZDUGV GLJJLQJ LQWR GDWD PLQHV (QWHUSULVH &RPn SXWLQJ -XO\ KWWSZZZLQIRZRUOGFRPFJLELQGLVSOD\$UFKLYHSOH KWP >@ $OL 6 0DQJDQDULV DQG 5 6ULNDQW 3DUWLDO FODVVLILFDWLRQ XVLQJ DVVRFLDWLRQ UXOHV ,Q 3URF RI WKH UG ,QWfO &RQIHUHQFH RQ .QRZOHGJH 'LVFRYHU\ LQ 'DWDEDVHV DQG 'DWD 0LQLQJ 1HZSRUW %HDFK &DOLIRUQLD $XJXVW >@ $OVDEWL 6 5DQND DQG 9 6LQJK &/28'6 $ GHFLVLRQ WUHH FODVVLILHU IRU ODUJH GDWDVHWV ,Q 3URF RI WKH IWK ,QWfO &RQIHUHQFH RQ .QRZOHGJH 'LVFRYHU\ DQG 'DWD 0LQLQJ 1HZ @ & $SWH DQG 6 +RQJ 3UHGLFWLQJ HTXLW\ UHWXUQV IURP VHFXULWLHV GDWD ZLWK PLQLPDO UXOH JHQHUDWLRQ ,Q .''? $$$, :RUNVKRS RQ .QRZOHGJH 'LVFRYHU\ LQ 'DWDEDVHV SDJHV 6HDWWOH :DVKLQJWRQ -XO\ >@ & $SWH DQG 6 :HLVV 'DWD PLQLQJ ZLWK GHFLVLRQ WUHHV DQG GHFLVLRQ UXOHV )*&6 -RXUQDO 6SHFLDO ,VVXH RQ 'DWD 0LQLQJ >@ %DQILHOG DQG $ 5DIWHU\ 0RGHOEDVHG JDXVVLDQ DQG QRQJDXVVLDIW FOXVWHULQJ %LRn PHWULFV >@ 6 %HUFKWROG & %RKP % %UDXQPXOOHU $ .HLP DQG + .ULHJHO )DVW SDUDOOHO VLPLODULW\ VHDUFK LQ PXOWLPHGLD GDWDEDVHV ,Q 3URF RI WKH $&0 6,*02' &RQIHUHQFH RQ 0DQDJHPHQW RI 'DWD 0D\ >@ 6 %HUFKWROG $ .HLP DQG + .ULHJHO 7KH ;7UHH $Q LQGH[ VWUXFWXUH IRU KLJKn GLPHQVLRQDO GDWD ,Q 3URF RI WKH QG ,QWfO &RQIHUHQFH RQ 9HU\ /DUJH 'DWDEDVHV %RPED\ ,QGLD 6HSWHPEHU >@ 3 %UDGOH\ 8 )D\\DG DQG & 5HLQD 6FDOLQJ FOXVWHULQJ DOJRULWKPV WR ODUJH GDWDEDVHV ,Q 3URF RI WKH IWK ,QWfO &RQIHUHQFH RQ .QRZOHGJH 'LVFRYHU\ DQG 'DWD 0LQLQJ 1HZ @ / %UHLPDQ + )ULHGPDQ 5 $ 2OVKHQ DQG & 6WRQH &ODVVLILFDWLRQ DQG 5HJUHVVLRQ 7UHHV :DGVZRUWK %HOPRQW

PAGE 178

>@ 6 %ULQ 5 0RWZDQL DQG & 6LOYHUVWHLQ %H\RQG PDUNHW EDVNHWV *HQHUDOL]LQJ DVVRFLn DWLRQ UXOHV WR FRUUHODWLRQV ,Q 3URF RI WKH $&0 6,*02' &RQIHUHQFH RQ 0DQDJHPHQW RI 'DWD 0D\ >@ 6 %ULQ 5 0RWZDQL 8OOPDQ DQG 6 7VXU '\QDPLF LWHPVHW FRXQWLQJ DQG LPSOLFDWLRQ UXOHV IRU PDUNHW EDVNHW GDWD ,Q 3URF RI WKH $&0 6,*02' &RQIHUHQFH RQ 0DQDJHPHQW RI 'DWD 0D\ >@ %XVLQHVV :HHN 'DWDEDVH PDUNHWLQJ 6HSWHPEHU >@ &DWOHWW 0HJDLQGXFWLRQ 0DFKLQH /HDUQLQJ RQ 9HU\ /DUJH 'DWDEDVHV 3K' WKHVLV 8QLYHUVLW\ RI 6\GQH\ >@ &DWOHWW 2YHUSUXQLQJ ODUJH GHFLVLRQ WUHHV ,Q WK ,QWfO -RLQW &RQIHUHQFH RQ $, $XJXVW >@ 6 &KDNUDEDUWL % 'RP 5 $JUDZDO DQG 3 5DJKDYDQ 8VLQJ WD[RQRP\ GLVFULPLn QDQWV DQG VLJQDWXUHV IRU QDYLJDWLQJ LQ WH[W GDWDEDVHV ,Q 3URF RI WKH 9/'% &RQn IHUHQFH $WKHQV *UHHFH $XJXVW >@ 6 &KDNUDEDUWL % 'RP DQG 3 ,QG\N (QKDQFHG K\SHUWH[W FDWHJRUL]DWLRQ XVLQJ K\SHUOLQNV ,Q 3URF RI WKH $&0 6,*02' &RQIHUHQFH RQ 0DQDJHPHQW RI 'DWD 6HDWWOH :DVKLQJWRQ -XQH >@ &KDPEHUOLQ 8VLQJ WKH 1HZ '% ,%0fV 2EMHFW5HODWLRQDO 'DWDEDVH 6\VWHP 0RUJDQ .DXIPDQQ >@ 6 &KDXGKXUL 'DWD PLQLQJ DQG GDWDEDVH V\VWHPV :KHUH LV WKH LQWHUVHFWLRQ" ,((( 'DWD (QJLQHHULQJ %XOOHWLQ ff§ 0DUFK >@ 6 &KDXGKXUL DQG 8 'D\DO $Q RYHUYLHZ RI GDWD ZDUHKRXVLQJ DQG RODS WHFKQRORJ\ $&0 6,*02' 5HFRUG Of 0DUFK >@ 6 &KDXGKXUL 5 0RWZDQL DQG 9 1DUDVD\\D 5DQGRP VDPSOLQJ IRU KLVWRJUDP FRQVWUXFWLRQ +RZ PXFK LV HQRXJK" ,Q 3URF RI WKH $&0 6,*02' &RQIHUHQFH RQ 0DQDJHPHQW RI 'DWD 6HDWWOH :DVKLQJWRQ -XQH >@ &KHXQJ +DQ 9 1J DQG & < :RQJ 0DLQWHQDQFH RI GLVFRYHUHG DVVRFLDWLRQ UXOHV LQ ODUJH GDWDEDVHV $Q LQFUHPHQWDO XSGDWLQJ WHFKQLTXH ,Q 3URF RI ,QWfO &RQIHUHQFH RQ 'DWD (QJLQHHULQJ 1HZ 2UOHDQV 86$ )HEUXDU\ >@ &RRSHU DQG ( +HUVNRYLWV $ %D\HVLDQ PHWKRG IRU WKH LQGXFWLRQ RI SUREDELOLVWLF QHWZRUNV IURP GDWD 0DFKLQH /HDUQLQJ >@ 'DYLG 6KHSDUG $VVRFLDWHV %XVLQHVV 2QH ,UZLQ ,OOLQRLV 7KH QHZ GLUHFW PDUNHWLQJ >@ 'HQQLQJ $Q LQWUXVLRQGHWHFWLRQ PRGHO ,((( 7UDQVDFWLRQV RQ 6RIWZDUH (QJLn QHHULQJ f >@ 7 'LHWWHULFK DQG 5 6 0LFKDOVNL 'LVFRYHULQJ SDWWHUQV LQ VHTXHQFHV RI HYHQWV $UWLILFLDO ,QWHOOLJHQFH

PAGE 179

>@ 'LUHFW 0DUNHWLQJ $VVRFLDWLRQ 0DQDJLQJ GDWDEDVH PDUNHWLQJ WHFKQRORJ\ IRU VXFFHVV >@ 0 (VWHU + .ULHJHO DQG ; ;X $ GDWDEDVH LQWHUIDFH IRU FOXVWHULQJ LQ ODUJH VSDWLDO GDWDEDVHV ,Q 3URF RI WKH VW ,QWfO &RQIHUHQFH RQ .QRZOHGJH 'LVFRYHU\ LQ 'DWDEDVHV DQG 'DWD 0LQLQJ 0RQWUHDO &DQDGD $XJXVW >@ & )DORXWVRV 0 5DQJDQDWKDQ DQG < 0DQRORSRXORV )DVW VXEVHTXHQFH PDWFKLQJ LQ WLPHVHULHV GDWDEDVHV ,Q 3URF RI WKH $&0 6,*02' &RQIHUHQFH RQ 0DQDJHPHQW RI 'DWD 0D\ >@ 8 )D\\DG & 5HLQD DQG 3 %UDGOH\ ,QLWLDOL]DWLRQ RI LWHUDWLYH UHILQHPHQW FOXVWHULQJ DOJRULWKPV ,Q 3URF RI WKH IWK ,QWfO &RQIHUHQFH RQ .QRZOHGJH 'LVFRYHU\ DQG 'DWD 0LQLQJ 1HZ @ 8 )D\\DG 3 6KDSLUR 3 6P\WK DQG 5 8WKXUXVDP\ HGLWRUV $GYDQFHV LQ .QRZOHGJH 'LVFRYHU\ DQG 'DWD 0LQLQJ $$$,0,7 3UHVV >@ 8 )D\\DG 1 :HLU DQG 6 'MRUJRYVNL 6NLFDW $ PDFKLQH OHDUQLQJ V\VWHP IRU DXWRPDWHG FDWDORJLQJ RI ODUJH VFDOH VN\ VXUYH\V ,Q 2WK ,QW &RQIHUHQFH RQ 0DFKLQH /HDUQLQJ -XQH >@ 5 )HOGPDQ < $XPDQQ $ $PLU DQG + 0DQQLOD (IILFLHQW DOJRULWKPV IRU GLVFRYn HULQJ IUHTXHQW VHWV LQ LQFUHPHQWDO GDWDEDVHV ,Q 3URFHHGLQJV RI WKH 6,*02' :RUNVKRS RQ 5HVHDUFK ,VVXHV RQ 'DWD 0LQLQJ DQG .QRZOHGJH 'LVFRYHU\ 7XFVRQ $UL]RQD 0D\ >@ + )LVKHU .QRZOHGJH DFTXLVLWLRQ YLD LQFUHPHQWDO FRQFHSWXDO FOXVWHULQJ 0DFKLQH /HDUQLQJ f >@ 7 )XNXGD < 0RULPRWR 6 0RULVKLWD DQG 7 7RNX\DPD 'DWD PLQLQJ XVLQJ WZR GLPHQVLRQDO RSWLPL]HG DVVRFLDWLRQ UXOHV 6FKHPH DOJRULWKPV DQG YLVXDOL]DWLRQ ,Q 3URF RI WKH $&0 6,*02' &RQIHUHQFH RQ 0DQDJHPHQW RI 'DWD 0RQWUHDO &DQDGD -XQH >@ 7 )XNXGD < 0RULPRWR 6 0RULVKLWD DQG 7 7RNX\DPD 0LQLQJ RSWLPL]HG DVVRFLDnn WLFV IRU FODVVLILFDWLRQ IURP ODUJH 64/ GDWDEDVHV ,Q 3URF RI WKH WK ,QWfO &RQIHUHQFH RQ .QRZOHGJH 'LVFRYHU\ DQG 'DWD 0LQLQJ 1HZ
PAGE 180

>@ *UD\ $ %RVZRUWK $ /D\PDQ DQG + 3LUDKHVK 'DWD FXEH $ UHODWLRQDO DJJUHn JDWLRQ RSHUDWRU JHQHUDOL]LQJ JURXSE\ FURVVWDE DQG VXEWRWDO ,Q 3URF RI ,QWfO &RQIHUHQFH RQ 'DWD (QJLQHHULQJ 1HZ 2UOHDQV 86$ )HEUXDU\ >@ *UD\ 6 &KDXGKXUL $ %RVZRUWK $ /D\PDQ ) 3HOORZ DQG + 3LUDKHVK 'DWD FXEH $ UHODWLRQDO DJJUHJDWLRQ RSHUDWRU JHQHUDOL]LQJ JURXSE\ FURVVWDE DQG VXEn WRWDOV 'DWD 0LQLQJ DQG .QRZOHGJH 'LVFRYHU\ ff§ >@ 6 *XKD 5 5DVWRJL DQG 6KLP &85( $Q HIILFLHQW FOXVWHULQJ DOJRULWKP IRU ODUJH GDWDEDVHV ,Q 3URF RI WKH $ &0 6,*02' &RQIHUHQFH RQ 0DQDJHPHQW RI 'DWD 6HDWWOH :DVKLQJWRQ -XQH >@ *XQRSXORV + 0DQQLOD DQG 6 6DOXMD 'LVFRYHULQJ DOO PRVW VSHFLILF VHQWHQFHV E\ UDQGRPL]HG DOJRULWKPV ,Q 3URF RI WKH VL[WK ,QWHUQDWLRQDO &RQIHUHQFH RQ 'DWDEDVH 7KHRU\ -DQXDU\ >@ 3 +DDV ) 1DXJKWRQ 6 6HVKDGUL DQG / 6WRNHV 6DPSOLQJEDVHG HVWLPDn WLRQ RI WKH QXPEHU RI GLVWLQFW YDOXHV RI DQ DWWULEXWH ,Q 3URFHHGLQJV RI WKH (LJKWK ,QWHUQDWLRQDO &RQIHUHQFH RQ 9HU\ /DUJH 'DWDEDVHV 9/'%ffO &RQIHUHQFH RQ 9HU\ /DUJH 'DWDEDVHV =XULFK 6ZLW]HUODQG 6HSWHPEHU >@ +DQ < )X .RSHUVNL : :DQJ DQG 2 =DLDQH '04/ $ GDWD PLQLQJ TXHU\ ODQJXDJH IRU UHODWLRQDO GDWEDVHV ,Q 3URF RI WKH 6,*02' ZRUNVKRS RQ UHVHDUFK LVVXHV RQ GDWD PLQLQJ DQG NQRZOHGJH GLVFRYHU\ 0RQWUHDO &DQDGD 0D\ >@ 6 +RQJ 50,1, $ KHXULVWLF DOJRULWKP IRU JHQHUDWLQJ PLQLPDO UXOHV IURP H[DPn SOHV ,Q UG 3DFLILF 5LP ,QWfO &RQIHUHQFH RQ $, $XJXVW >@ 0 +RXWVPD DQG $ 6ZDPL 6HWRULHQWHG PLQLQJ RI DVVRFLDWLRQ UXOHV ,Q ,QWfO &RQn IHUHQFH RQ 'DWD (QJLQHHULQJ 7DLSHL 7DLZDQ 0DUFK >@ 7 ,PLHOLQVNL )URP ILOH PLQLQJ WR GDWDEDVH PLQLQJ ,Q 3URF RI WKH 6,*02' ZRUNVKRS RQ UHVHDUFK LVVXHV RQ GDWD PLQLQJ DQG NQRZOHGJH GLVFRYHU\ SDJHV 0D\ >@ 7 ,PLHOLQVNL DQG + 0DQQLOD $ GDWDEDVH SHUVSHFWLYH RQ NQRZOHGJH GLVFRYHU\ &RPn PXQLFDWLRQ RI WKH $&0 OOf 1RY

PAGE 181

>@ 7 ,PLHOLQVNL $ 9LUPDQL DQG $ $EGXOJKDQL 'LVFRYHU\ ERDUG DSSOLFDWLRQ SURJUDPn PLQJ LQWHUIDFH DQG TXHU\ ODQJXDJH IRU GDWDEDVH PLQLQJ ,Q 3URF RI WKH QG ,QWfO &RQIHUHQFH RQ .QRZOHGJH 'LVFRYHU\ DQG 'DWD 0LQLQJ 3RUWODQG 2UHJRQ $XJXVW >@ ,QWHUQDWLRQDO %XVLQHVV 0DFKLQHV :KLWH SDSHU RQ GDWD PLQLQJ 7HFKQLFDO UHSRUW >@ ,QWHUQDWLRQO %XVLQHVV 0DFKLQHV ,%0 ,QWHOOLJHQW 0LQHU 8VHUfV *XLGH 9HUVLRQ 5Hn OHDVH 6+ HGLWLRQ -XO\ >@ + 9 -DJDGLVK $ UHWULHYDO WHFKQLTXH IRU VLPLODU VKDSHV ,Q 3URF RI WKH $&0 6,*02' &RQIHUHQFH RQ 0DQDJHPHQW RI 'DWD SDJHV 'HQYHU 0D\ >@ + 9 -DJDGLVK DQG & )DORXWVRV 'DWD UHGXFWLRQ D WXWRULDO )RXUWK LQWHUQDWLRQDO FRQIHUHQFH RQ .QRZOHGJH 'LVFRYHU\ DQG 'DWD 0LQLQJ WXWRULDO $XJXVW >@ 0 .OHPHWWLQHQ + 0DQQLOD 3 5RQNDLQHQ + 7RLYRQHQ DQG $ 9HUNDPR )LQGLQJ LQWHUHVWLQJ UXOHV IURP ODUJH VHWV RI GLVFRYHUHG DVVRFLDWLRQ UXOHV ,Q 7KLUG ,QWfO &RQnf 6HSWHPEHU >@ .XONDUQL 2EMHFW RULHQWHG H[WHQVLRQV LQ 64/ D VWDWXV UHSRUW 6,*02' 5(&25' >@ /LQ + 9 -DJDGLVK DQG & )DORXWVRV 7KH 797UHH $Q LQGH[ VWUXFWXUH IRU KLJKGLPHQVLRQDO GDWD 9/'% -RXUQDO f >@ /RKPDQ % /LQGVD\ + 3LUDKHVK DQG % 6FKLHIHU ([WHQVLRQV WR VWDUEXUVW 2EMHFWV W\SHV IXQFWLRQV DQG UXOHV &RPPXQLFDWLRQV RI WKH $&0 f 2FWREHU >@ /XELQVN\ 'LVFRYHU\ IURP GDWDEDVHV $ UHYLHZ RI $, DQG VWDWLVWLFDO WHFKQLTXHV ,Q ,-&$, :RUNVKRS RQ .QRZOHGJH 'LVFRYHU\ LQ 'DWDEDVHV SDJHV 'HWURLW $XJXVW >@ $ 0DMRU DQG & )HQJ ()' $ K\EULG NQRZOHGJHVWDWLVWLFDOEDVHG V\VWHP IRU WKH GHWHFWLRQ RI IUDXG ,QWfO -RXUQDO RI ,QWHOOLJHQW 6\VWHPV f >@ + 0DQQLOD DQG + 7RLYRQHQ 2Q DQ DOJRULWKP IRU ILQGLQJ DOO LQWHUHVWLQJ VHQWHQFHV ,Q &\EHUQHWLFV DQG 6\VWHPV 9ROXPH ,, 7KH WK (XURSHDQ 0HHWLQJ R Q &\EHUQHWLFV DQG 6\VWHPV 5HVHDUFK 9LHQQD $XVWULD $SULO

PAGE 182

>@ + 0DQQLOD + 7RLYRQHQ DQG $ 9HUNDPR 'LVFRYHULQJ IUHTXHQW HSLVRGHV LQ VHn TXHQFHV ,Q 3URF RI WKH VW ,QWfO &RQIHUHQFH RQ .QRZOHGJH 'LVFRYHU\ LQ 'DWDEDVHV DQG 'DWD 0LQLQJ 0RQWUHDO &DQDGD $XJXVW >@ 0 0HKWD 5 $JUDZDO DQG 5LVVDQHQ 6/,4 $ IDVW VFDODEOH FODVVLILHU IRU GDWD PLQLQJ ,Q 3URF RI WKH )LIWK ,QWfO &RQIHUHQFH RQ ([WHQGLQJ 'DWDEDVH 7HFKQRORJ\ ('%7f $YLJQRQ )UDQFH 0DUFK >@ 0 0HKWD 5LVVDQHQ DQG 5 $JUDZDO 0'/EDVHG GHFLVLRQ WUHH SUXQLQJ ,Q 3URF RI WKH VW ,QWfO &RQIHUHQFH RQ .QRZOHGJH 'LVFRYHU\ LQ 'DWDEDVHV DQG 'DWD 0LQLQJ 0RQWUHDO &DQDGD $XJXVW >@ 0HOWRQ DQG 1 0DWWRV 64/f§D WXWRULDO 7ZHQW\VHFRQG LQWHUQDWLRQDO FRQIHUHQFH RQ 9HU\ ODUJH GDWD EDVHV WXWRULDO 6HSWHPEHU >@ 0HOWRQ DQG $ 6LPRQ 8QGHUVWDQGLQJ WKH QHZ 64/ $ FRPSOHWH JXLGH 0RUJDQ .DXIIPDQ >@ 5 0HR 3VDLOD DQG 6 &HUL $ QHZ 64/ OLNH RSHUDWRU IRU PLQLQJ DVVRFLDWLRQ UXOHV ,Q 3URF RI WKH QG ,QWfO &RQIHUHQFH RQ 9HU\ /DUJH 'DWDEDVHV %RPED\ ,QGLD 6HS >@ 5 6 0LFKDOVNL $ WKHRU\ DQG PHWKRGRORJ\ RI LQGXFWLYH OHDUQLQJ ,Q 0LFKDOVNL HW DO HGLWRUV 0DFKLQH /HDUQLQJ $Q $UWLILFLDO ,QWHOOLJHQFH $SSURDFK 9RO 0RUJDQ .DXIPDQQ >@ 0LFKLH 6SLHJHOKDOWHU DQG & & 7D\ORU 0DFKLQH /HDUQLQJ 1HXUDO DQG 6WDWLVWLFDO &ODVVLILFDWLRQ (OOLV +RUZRRG >@ 5 0LOOHU DQG < @ & 0RKDQ +DGHUOH < :DQJ DQG &KHQJ 6LQJOH WDEOH DFFHVV XVLQJ PXOWLn SOH LQGH[HV RSWLPL]DWLRQ H[HFXWLRQ DQG FRQFXUUHQF\ FRQWURO WHFKQLTXHV ,Q 3URF ,QWHUQDWLRQDO &RQIHUHQFH RQ ([WHQGLQJ 'DWDEDVH 7HFKQRORJ\ SDJHV >@ 0XPLFN 4XDVV DQG % 0XPLFN 0DLQWHQDQFH RI GDWD FXEHV DQG VXPPDU\ WDEOHV LQ D ZDUHKRXVH ,Q 3URF RI WKH $&0 6,*02' &RQIHUHQFH RQ 0DQDJHPHQW RI 'DWD 7XFVRQ $UL]RQD 0D\ >@ 5 0XVLFN &DWOHWW DQG 6 5XVVHOO 'HFLVLRQWKHRUHWLF VXEVDPSOLQJ IRU LQGXFWLRQ RQ ODUJH GDWVHWV ,Q WK ,QWfO &RQIHUHQFH RQ 0DFKLQH /HDUQLQJ >@ 3 1HDUKRV 0 5RWKPDQ DQG 0 6 9LYHURV $SSO\LQJ GDWD PLQLQJ WHFKQLTXHV WR D KHDOWK LQVXUDQFH LQIRUPDWLRQ V\VWHP ,Q 3URF RI WKH QG ,QWfO &RQIHUHQFH RQ 9HU\ /DUJH 'DWDEDVHV 0XPEDL %RPED\f ,QGLD 6HSWHPEHU >@ 5 7 1J / 9 6 /DNVKPDQDQ +DQ DQG $ 3DQJ ([SORUDWRU\ PLQLQJ DQG SUXQLQJ RSWLPL]DWLRQV RI FRQVWUDLQHG DVVRFLDWLRQ UXOHV ,Q 3URF RI WKH $&0 6,*02' &RQIHUHQFH RQ 0DQDJHPHQW RI 'DWD 6HDWWOH :DVKLQJWRQ -XQH >@ 2UDFOH 2UDFOH 5'%06 'DWDEDVH $GPLQLVWUDWRUfV *XLGH 9ROXPHV ,, 9HUVLRQ f 0D\

PAGE 183

>@6 3DUN 0 6 &KHQ DQG 3 6 @ 3RVWJUH64/ 2UJDQL]DWLRQ 3RVWJUH64/ 8VHU 0DQXDO )HEUXDU\ KWWSZZZSRVWJUHVTORUJ >@ 4XHVW ,%0 GHYHORSV PDUNHW EDVNHW DQDO\VLV V\VWHP 6WRUHV 0D\ >@ 5 4XLQODQ ,QGXFWLRQ RYHU ODUJH GDWDEDVHV 7HFKQLFDO 5HSRUW 67$1&6 6WDQIRUG 8QLYHUVLW\ >@ 5 4XLQODQ &I 3URJUDPV IRU 0DFKLQH /HDUQLQJ 0RUJDQ .DXIPDQ >@ 5 4XLQODQ DQG 5 / 5LYHVW ,QIHUULQJ GHFLVLRQ WUHHV XVLQJ PLQLPXP GHVFULSWLRQ OHQJWK SULQFLSOH ,QIRUPDWLRQ DQG &RPSXWDWLRQ >@ 5DMDPDQL % ,\HU DQG $ &KDGGKD 8VLQJ '%fn SOLFDWLRQV LQ WKH %LRVFLHQFHV Of >@ 6 6DUDZDJL 6 7KRPDV DQG 5 $JUDZDO ,QWHJUDWLQJ DVVRFLDWLRQ UXOH PLQLQJ ZLWK UHODWLRQDO GDWDEDVH V\VWHPV $OWHUQDWLYHV DQG LPSOLFDWLRQV ,Q 3URF RI WKH $&0 6,*02' &RQIHUHQFH RQ 0DQDJHPHQW RI 'DWD 6HDWWOH :DVKLQJWRQ -XQH >@ 6 6DUDZDJL 6 7KRPDV DQG 5 $JUDZDO ,QWHJUDWLQJ DVVRFLDWLRQ UXOH PLQLQJ ZLWK UHODWLRQDO GDWDEDVH V\VWHPV $OWHUQDWLYHV DQG LPSOLFDWLRQV 5HVHDUFK 5HSRUW 5f ,%0 $OPDGQ 5HVHDUFK &HQWHU 6DQ -RVH &DOLIRUQLD 0DUFK /RQJHU YHUVLRQ RI WKH 6,*02' SDSHUf >@ $ 6DYDVHUH ( 2PLHFLQVNL DQG 6 1DYDWKH $Q HIILFLHQW DOJRULWKP IRU PLQLQJ DVVRFLn DWLRQ UXOHV LQ ODUJH GDWDEDVHV ,Q 3URF RI WKH 9/'% &RQIHUHQFH =XULFK 6ZLW]HUODQG 6HSWHPEHU >@ 6 = 6HOLP DQG 0 $ ,VPDLO .PHDQVW\SH DOJRULWKPV $ JHQHUDOL]HG FRQYHUJHQFH WKHRUHP DQG FKDUDFWHUL]DWLRQ RI ORFDO RSWLPDOLW\ ,((( 7UDQVDFWLRQV RQ 3DWWHUQ $QDO\VLV DQG 0DFKLQH ,QWHOOLJHQFH 3$0,f >@ 6KDIHU DQG 5 $JUDZDO 3DUDOOHO DOJRULWKPV IRU KLJKGLPHQVLRQDO VLPLODULW\ MRLQV IRU GDWD PLQLQJ DSSOLFDWLRQV ,Q 3URF RI WKH 9/'% &RQIHUHQFH $WKHQV *UHHFH $XJXVW

PAGE 184

>@ 6KDIHU 5 $JUDZDO DQG 0 0HKWD 635,17 $ VFDODEOH SDUDOOHO FODVVLILHU IRU GDWD PLQLQJ ,Q 3URF RI WKH QG ,QWfO &RQIHUHQFH RQ 9HU\ /DUJH 'DWDEDVHV %RPED\ ,QGLD 6HSWHPEHU >@ 6KLP 5 6ULNDQW DQG 5 $JUDZDO +LJKGLPHQVLRQDO VLPLODULW\ MRLQV ,Q 3URF RI WKH WK ,QWfO &RQIHUHQFH RQ 'DWD (QJLQHHULQJ %LUPLQJKDP 8. $SULO >@ $ 6LHEHV DQG 0 / .HUVWHQ .(62 0LQLPL]LQJ GDWDEDVH LQWHUDFWLRQ ,Q 3URF RI WKH UG ,QWfO &RQIHUHQFH RQ .QRZOHGJH 'LVFRYHU\ DQG 'DWD 0LQLQJ 1HZSRUW %HDFK &DOLIRUQLD $XJXVW >@ & 6LOYHUVWHLQ 6 %ULQ 5 0RWZDQL DQG 8OOPDQ 6FDODEOH WHFKQLTXHV IRU PLQLQJ FDXVDO VWUXFWXUHV ,Q 3URF RI WKH 9/'% &RQIHUHQFH 1HZ @ 5 6ULNDQW DQG 5 $JUDZDO 0LQLQJ JHQHUDOL]HG DVVRFLDWLRQ UXOHV ,Q 3URF RI WKH VW ,QWfO &RQIHUHQFH RQ 9HU\ /DUJH 'DWDEDVHV =XULFK 6ZLW]HUODQG 6HSWHPEHU >@ 5 6ULNDQW DQG 5 $JUDZDO 0LQLQJ TXDQWLWDWLYH DVVRFLDWLRQ UXOHV LQ ODUJH UHODWLRQDO WDEOHV ,Q 3URF RI WKH $&0 6,*02' &RQIHUHQFH RQ 0DQDJHPHQW RI 'DWD 0RQWUHDO &DQDGD -XQH >@ 5 6ULNDQW DQG 5 $JUDZDO 0LQLQJ VHTXHQWLDO SDWWHUQV *HQHUDOL]DWLRQV DQG SHUIRUn PDQFH LPSURYHPHQWV ,Q 3URF RI WKH )LIWK ,QWfO &RQIHUHQFH RQ ([WHQGLQJ 'DWDEDVH 7HFKQRORJ\ ('%7f $YLJQRQ )UDQFH 0DUFK >@ 5 6ULNDQW 4 9X DQG 5 $JUDZDO 0LQLQJ DVVRFLDWLRQ UXOHV ZLWK LWHP FRQVWUDLQWV ,Q 3URF RI WKH UG ,QW fO &RQIHUHQFH RQ .QRZOHGJH 'LVFRYHU\ LQ 'DWDEDVHV DQG 'DWD 0LQLQJ 1HZSRUW %HDFK &DOLIRUQLD $XJXVW >@ 0 5 6WRQHEUDNHU DQG .HPQLW] 7KH 3267*5(6 QH[W JHQHUDWLRQ GDWDEDVH PDQDJHPHQW V\VWHP &RPPXQLFDWLRQV RI WKH $&0 f >@ 6 7KRPDV 6 %RGDJDOD $OVDEWL DQG 6 5DQND $Q HIILFLHQW DOJRULWKP IRU WKH LQFUHPHQWDO XSGDWLRQ RI DVVRFLDWLRQ UXOHV LQ ODUJH GDWDEDVHV ,Q 3URF RI WKH UG ,QWffO &RQIHUHQFH RQ .QRZOHGJH 'LVFRYHU\ DQG 'DWD 0LQLQJ 1HZ @ + 7RLYRQHQ 6DPSOLQJ ODUJH GDWDEDVHV IRU DVVRFLDWLRQ UXOHV ,Q 3URF RI WKH QG ,QWfO &RQIHUHQFH RQ 9HU\ /DUJH 'DWDEDVHV SDJHV 0XPEDL %RPED\f ,QGLD 6HSWHPEHU

PAGE 185

>@ 7VXU 8OOPDQ 6 $ELWHERXO & &OLIWRQ 5 0RWZDQL 6 1HVWRURY DQG $ 5RVHQn WKDO 4XHU\ )ORFNV $ JHQHUDOL]DWLRQ RI DVVRFLDWLRQ UXOH PLQLQJ ,Q 3URF RI WKH $&0 6,*02' &RQIHUHQFH RQ 0DQDJHPHQW RI 'DWD 6HDWWOH :DVKLQJWRQ -XQH >@ 6 0 :HLVV DQG & $ .XOLNRZVNL &RPSXWHU 6\VWHPV WKDW /HDUQ &ODVVLILFDWLRQ DQG 3UHGLFWLRQ 0HWKRGV IURP 6WDWLVWLFV 1HXUDO 1HWV 0DFKLQH /HDUQLQJ DQG ([SHUW 6\VWHPV 0RUJDQ .DXIPDQ >@ 5 % @ 0 =DNL 6 3DUWKDVDUDWK\ 0 2JLKDUD DQG : /L 1HZ DOJRULWKPV IRU IDVW GLVFRYn HU\ RI DVVRFLDWLRQ UXOHV ,Q 3URF RI WKH UG ,QWfO &RQIHUHQFH RQ .QRZOHGJH 'LVFRYHU\ DQG 'DWD 0LQLQJ 1HZSRUW %HDFK &DOLIRUQLD $XJXVW >@ 7 =KDQJ 5 5DPDNULVKQDQ DQG 0 /LYQ\ %,5&+ $Q HIILFLHQW GDWD FOXVWHULQJ PHWKRG IRU YHU\ ODUJH GDWDEDVHV ,Q 3URF RI WKH $&0 6,*02' &RQIHUHQFH RQ 0DQDJHPHQW RI 'DWD 0RQWUHDO &DQDGD -XQH >@ < =KDR 3 0 'HVKSDQGH DQG ) 1DXJKWRQ $Q DUUD\EDVHG DOJRULWKP IRU VLPXOn WDQHRXV PXOWLGLPHQVLRQDO DJJUHJDWHV ,Q 3URF RI WKH $&0 6,*02' &RQIHUHQFH RQ 0DQDJHPHQW RI 'DWD 7XFVRQ $UL]RQD 0D\

PAGE 186



PAGE 187

, FHUWLI\ WKDW KDYH UHDG WKLV VWXG\ DQG WKDW LQ P\ RSLQLRQ LW FRQIRUPV WR DFFHSWDEOH VWDQGDUGV RI VFKRODUO\ SUHVHQWDWLRQ DQG LV IXOO\ DGHTXDWH LQ VFRSH DQG TXDOLW\ DV D GLVVHUn WDWLRQ IRU WKH GHJUHH RI 'RFWRU RI 3KLORVRSK\ 6KDUPD &KDNUDYDUWK\ &KDLUPDQ $VVRFLDWH 3URIHVVRU RI &RPSXWHU DQG ,QIRUPDWLRQ 6FLHQFH DQG (QJLQHHULQJ FHUWLI\ WKDW KDYH UHDG WKLV VWXG\ DQG WKDW LQ P\ RSLQLRQ LW FRQIRUPV WR DFFHSWDEOH VWDQGDUGV RI VFKRODUO\ SUHVHQWDWLRQ DQG LV IXOO\ DGHTXDWH LQ VFRSH DQG TXDOLW\ DV D GLVVHUn WDWLRQ IRU WKH GHJUHH RI 'RFWRU RI 3KLORVRSK\ (ULF 1 +DQVRQ $VVRFLDWH 3URIHVVRU RI &RPSXWHU DQG ,QIRUPDWLRQ 6FLHQFH DQG (QJLQHHULQJ FHUWLI\ WKDW KDYH UHDG WKLV VWXG\ DQG WKDW LQ P\ RSLQLRQ LW FRQIRUPV WR DFFHSWDEOH VWDQGDUGV RI VFKRODUO\ SUHVHQWDWLRQ DQG LV IXOO\ DGHTXDWH LQ VFRSH DQG TXDOLW\ DV D GLVVHUn WDWLRQ IRU WKH GHJUHH RI 'RFWRU RI 3KLORVRSK\ 'LVWLQJXLVKHG 3URIHVVRU RI &RPSXWHU DQG ,QIRUPDWLRQ 6FLHQFH DQG (QJLQHHULQJ

PAGE 188

, FHUWLI\ WKDW KDYH UHDG WKLV VWXG\ DQG WKDW LQ P\ RSLQLRQ LW FRQIRUPV WR DFFHSWDEOH VWDQGDUGV RI VFKRODUO\ SUHVHQWDWLRQ DQG LV IXOO\ DGHTXDWH LQ VFRSH DQG TXDOLW\ DV D GLVVHUn WDWLRQ IRU WKH GHJUHH RI 'RFWRU RI 3KLORVRSK\ t/ 6WDQOH\ < : 6 3URIHVVRU RI &RPSXWHU DQG ,QIRUPDWLRQ 6FLHQFH DQG (QJLQHHULQJ FHUWLI\ WKDW KDYH UHDG WKLV VWXG\ DQG WKDW LQ P\ RSLQLRQ LW FRQIRUPV WR DFFHSWDEOH VWDQGDUGV RI VFKRODUO\ SUHVHQWDWLRQ DQG LV IXOO\ DGHTXDWH LQ VFRSH DQG TXDOLW\ DV D GLVVHUn WDWLRQ IRU WKH GHJUHH RI 'RFWRU RI 3KLORVRSK\ 6XOH\PDQ 7XIHNFL $VVRFLDWH 3URIHVVRU RI ,QGXVWULDO DQG 6\VWHPV (QJLQHHULQJ 7KLV GLVVHUWDWLRQ ZDV VXEPLWWHG WR WKH *UDGXDWH )DFXOW\ RI WKH &ROOHJH RI (QJLQHHULQJ DQG WR WKH *UDGXDWH 6FKRRO DQG ZDV DFFHSWHG DV SDUWLDO IXOILOOPHQW RI WKH UHTXLUHPHQWV IRU WKH GHJUHH RI 'RFWRU RI 3KLORVRSK\ 'HFHPEHU :LQIUHG 0 3KLOOLSV 'HDQ &ROOHJH RI (QJLQHHULQJ 0 2KDQLDQ 'HDQ *UDGXDWH 6FKRRO

PAGE 189

‘ 7N] 81,9(56,7< 2) )/25,'$


137
support thresholds. We report the results for increment sizes of 1%, 5% and 10% (shown
in the legend). We can make the following observations from the graphs.
The incremental algorithm based on the Subquery approach achieves a speed-up of
about 3 to 20 as compared to mining the whole dataset. However, the maximum
speed-up of the Vertical approach is only about 4. For support counting the Vertical
approach uses a user-defined function (UDF) to intersect the tid-lists. The incremen
tal algorithm should also invoke the UDF at least the same number of times since the
support of all the itemsets in the frequent set and the negative border needs to be
found in the increment database. In cases where the support of new candidates needs
to be counted the number of invocations will be even more as compared to mining
the whole dataset. The time taken by the Vertical approach in the support counting
phase is directly proportional to the number of times the UDF is called. However, the
incremental algorithm saves in the tid-list creation phase since the size of the incre
ment dataset is only a fraction of the whole dataset. This explains why the speed-up
of the Vertical approach is low. In contrast the Subquery approach achieves higher
speed-up since the time taken is proportional to the size of the dataset.
It is possible to achieve better speed-up for the Vertical approach by allocating a
smaller BLOB (binary large object) for computations involving the increment dataset.
Note that the tid-lists for the items are stored as BLOBs. In our experiments, we used
the same BLOB size for the increment dataset and the initial dataset in order to use
the same user-defined function for support counting and the same table function for
tid-list creation (refer Section 4.2 for a detailed description of the Vertical approach).


50
insert into Tk
select p.tid, p.itemi, .. .p.iterrik-i, q.item
from Cfc) Tk\ P -f1/ Q
where p.itemi = Ck-itemy and
p.iterrik-i = Ck-itemk-i and
q.item = Ck-itemk and
p.tid = q.tid
Tk
Figure 3.13. Generation of Tk
the next pass to generate The only advantage of pruning Tk is that we will have a
smaller table to join in the next pass; but at the additional cost of joining Tk with Fk-
We use the optimization discussed above for the second pass and hence do not mate
rialize and store the 2-item combinations T2. Therefore, we generate T3 directly by joining
Tf with C3 as
insert into T3 select p.tid, p.item, q.item, r.item
from Tf p, Tf q, Tf r, Ck
where p.item = C^-itemi and q.item = C^-item? and r.item = C^.item^ and
p.tid = q.tid and q.tid = x.tid


113
insert into F*
select itemi, eno\,... ,item,k, enok, count(*)
from (select distinct di.sid, itemi, eno\,... ,itenik, enok
from Ck, D di,... ,D d*
where di.item = Ck-item\ and
dfc.item = Ck-itemk and
di.sid = d2.sid and
di.sid = dk-sid and
PRED(fc)
) as t
group by itemi, eno\,... ,itemk, enok
having count(*) > :minsup
having
count(*) > :minsup
t
Group by
itemi itemk,
enol,...,enok
PRED(k)
Ck.iteml = dl.item
D dl
D d2
Figure 6.3. Support counting by K-way join


To
My parents, sisters and brothers


120
0(n) passes over the database where n is the size of the maximal frequent itemset. Another
incremental algorithm is presented in Feldman et al. [47].
In this chapter, we present an algorithm to find the new frequent itemsets with min
imal re-computation when new transactions are added to or deleted from the transaction
database [123]. Deletion is important in cases where we want to analyze the data in a
sliding time window. The important characteristics of our algorithm are the following.
Along with the frequent itemsets, we also maintain the negative border [127]1 The
algorithm uses negative borders to decide when to scan the whole database and it
can be used in conjunction with any level-wise algorithm like Apriori [13] or Parti
tion [111].
We first compute the frequent itemsets of the increment database. The algorithm
requires a full scan of the whole database only if the negative border of the frequent
itemsets expands, that is, if an itemset outside the negative border gets added to the
frequent itemsets or its negative border. Even in such cases, it requires only one I/O
pass over the whole data set.
Constrained association mining is another useful technique for goal-oriented mining.
We generalize the incremental mining algorithm to handle cases with contstraints on the
items and certain kinds of constraint relaxation.
In Section 7.1, we present the incremental algorithm to update the frequent itemsets.
Experimental results quantifying the performance advantages of the incremental algorithm
are presented in Section 7.2 and in Section 7.3, we compare it with FUP. In Section 7.4,
we explore the database integration alternatives of incremental mining and in Section 7.4.2
'Note that the negative border can be maintained while computing the frequent itemsets without any
additional computation overhead.


140
The reason is that, when the increment is smaller we have to use proportionately smaller
minimum support values while finding the frequent itemsets in db. This could result in
counting too many spurious candidates.
7.4.4 Other Approaches
In the Loose-coupling approach, the transaction data is read tuple by tuple from the
DBMS to the mining kernel using a cursor interface. This architecture can be extended
to handle incremental mining just by implementing the incremental algorithm outlined in
Section 7.1 in the mining kernel. The DBMS interface does not require any change. In
cases where the support of new itemsets need to be counted, limiting the data access to
just one scan of the whole database entails counting candidate itemsets of multiple sizes
in the same pass. This can be accomplished by passing the transactions through all the
candidates of different sizes and updating their support counts.
The Stored-procedure approach where the mining algorithm is encapsulated as a
stored procedure that runs in the same address space as the DBMS and the Cache-Mine
approach where the data is cached outside the DBMS, can also be extended for incremen
tal mining. However, the Cache-Mine approach might not give better performance than
the others since the incremental algorithm requires at most one scan of the entire data.
Extending the UDF approach for incremental mining is straight-forward and will involve
writing UDFs for the different steps of the incremental algorithm.
7.5 Constrained Associations
In this section, we introduce associations with different kinds of constraints on the
itemsets or constraints that characterize the dataset from which the associations are de
rived. Let X {i,2i i *m} be a set of literals, called items that are attribute values


72
after running for 5 hours. After the transformation, compared to GatherJoin the time
taken by Horizontal was also significantly worse when run without the frequent itemset
filtering optimization but with the optimization the performance was comparable. The SBF
approach had significantly worse performance because of the expensive indexing ORing of
the k join predicates. Another problem with this approach is the large number of updates
to the Ck table. In DB2, all of these updates are logged resulting in severe performance
degradation.
Data sot- A
Support 0 5%
Data set- C
E3 Prep Pass 1 Pass 2 m Pass 3 Pass 4]
Support 2.0% 1.0% 0.25%
Data sot- B
IE3 Prep Pass 1 Pass 2 B8 Pass 3 Pass 4]
Support 0.10% 0.03% 0.01%
Data sot- D
| Prop Paso 1 Paos 2 Pass 3 Pass 41
Support 0.20%
0.05%
0.02%
Figure 4.5. Comparison of four SQL-OR approaches: Vertical, GatherPrune, GatherJoin
and GatherCount


37
Table 3.1. Description of different real-life datasets
Datasets
# Records
in millions
# Transactions
in millions
# Items
in thousands
Avg.#items
Dataset-A
2.5
0.57
85
4.4
Dataset-B
7.5
2.5
15.8
2.62
Dataset-C
6.6
0.21
15.8
31
Dataset-D
14
1.44
480
9.62
We selected a collection of four real-life datasets obtained from various mail-order
companies and retail stores for the experiments. These datasets have differing values of
parameters like the total number of (tid,item) pairs, the number of transactions (tids), the
number of items and the average number of items per transaction. Table 3.1 summarizes
these parameters.
In this dissertation, we report the performance with only Dataset-A. The overall
observation was that mining implementations in pure SQL-92 are too slow to be practical.
For these experiments we built a composite index (item\... iterrik) on Cfc, k different indices
on each of the k items of Ck and a (tid, item) and a (item, tid) index on the data table.
The goal was to let the optimizer choose the best plan possible. We do not include the
index building cost in the total time.
In Figure 3.7 we show the total time taken by the four approaches: KwayJoin, 3wayJoin,
Subquery and 2GroupBy. For comparison, we also show the time taken by the Loose-coupling
approach because this is the approach currently used by existing systems. The graph shows
the total time split into candidate generation time (Cgen) and the time for each pass. The
candidate generation time and the time for the first pass are much smaller compared to the
total time. From these set of experiments we can make the following observations.


95
insert into F'k
with T*(tid, item) as (Query for T* defined above)
select item... itemk, count(*)
from Ck,T* t\, ...T* tk
where ti.item = Ck.item\ and
.item = Ck.itemk and
ii.tid = 2-tid and
tjt_i.tid = ii.tid
group by item\,item2 itemk
having count(*) > :minsup
having
count(*) > :minsup
t
Group by
iteml itemk
Ck itemk = tk.item
Ck.iteml = tl.item
T* tl
t* a
Figure 5.5. Support counting by K-way join


141
of a set of relational tables. A constrained association is defined as a set of itemsets
{A|A C X&C{X)}, where C denotes one or more boolean constraints. Note that we do
not concentrate on generating the association rules in the traditional sense [7]. However,
the associations between attribute values that we generate can be used for rule generation.
7.5.1 Categories of Constraints
We divide the constraints into four different categories that are outlined below2 We
illustrate each of them with sample mining computations. The data model used in our
examples is that of a point-of-sale (POS) model for a retail chain. When a customer buys
a product or series of products at a register, that information is stored in a transactional
system, which is likely to hold other information such as who made the purchase and what
types of promotions were involved. The data is stored in three relational tables sales,
productssold and product with respective schemas as shown in Figure 7.10.
SALES PRODUCTS_SOLD PRODUCT
Transaction id
Transaction_id
Product id
Customer id
Product id
Type
Name
Total price
Price
No. of products
Promotion_id
Description
Figure 7.10. Point of sales data model
Frequency Constraint
This is the same as the minimum support threshold in the support-confidence frame
work for association rule mining [13]. An itemset X is said to be frequent if it appears in
at least s transactions, where s is the minimum support. In the point-of-sales data model a
2We refer the reader to Srikant et al. [121] and Ng et al. [97] for nice discussions of various kinds of
constraints. Here we categorize them based on their usage in the mining process which is explained in
Section 7.5.2 with an example


171
[128] D. Tsur, J. Ullman, S. Abiteboul, C. Clifton, R. Motwani, S. Nestorov, and A. Rosen
thal. Query Flocks: A generalization of association rule mining. In Proc. of the ACM
SIGMOD Conference on Management of Data, Seattle, Washington, June 1998.
[129] S. M. Weiss and C. A. Kulikowski. Computer Systems that Learn: Classification
and Prediction Methods from Statistics, Neural Nets, Machine Learning, and Expert
Systems. Morgan Kaufman, 1991.
[130] R. B. Yates and G. H. Gonnet. A new approach to text searching. Communications
of the ACM, 35(10):74-82, October 1992.
[131] M. J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li. New algorithms for fast discov
ery of association rules. In Proc. of the 3rd Intl Conference on Knowledge Discovery
and Data Mining, Newport Beach, California, August 1997.
[132] T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: An efficient data clustering
method for very large databases. In Proc. of the ACM SIGMOD Conference on
Management of Data, Montreal, Canada, June 1996.
[133] Y. Zhao, P. M. Deshpande, and J. F. Naughton. An array-based algorithm for simul
taneous multidimensional aggregates. In Proc. of the ACM SIGMOD Conference on
Management of Data, Tucson, Arizona, May 1997.


116
insert into Fk
select t.itemi, t.enoi,... ,t.item,k, t.enok, cnt
from (select item\1 enoi,... ,itemk, enok,
CountIntersect-K(C*:.enoi,..., Ck-enok, di .s-list,... .s-list,
window-size,min-gap,max-gap) as cnt
from Cjt, SlistTable d\,... ,SlistTable d*
where di.item = Ck-iterri\ and
dfc.item = Ck.itemk) as t
where cnt > :minsup
iteml,...,itemk, enol,...,enok, cnt
t
cnt > :minsup
A
cnt
Countlntersect-K (LTDF)
Figure 6.5. Support counting by Vertical
corresponding to the intersection is created and is passed to Count which counts the number
of data-sequences containing the candidate. In the later passes of the GSP algorithm, this
approach might be better since several candidates may share common prefixes and also
the length of the intersected s-list is likely to be small. However, in the approach using
Countlntersect-K, the intersected s-list is not materialized since the UDF interleaves the
intersection and counting. Therefore, it is suitable for the earlier passes.


92
computation can be expressed in SQL using a recursive query as shown in Figure 5.2. The
result of the query is stored in table Ancestor having the schema (ancestor, descendant).
insert into Ancestor
with R-Tax (ancestor, descendant) as
(select parent, child from Tax
union all
select p.ancestor, c.child from R-Tax p, Tax c
where p.descendant = c.parent)
select ancestor, descendant from R-Tax
t
Tax
Figure 5.2. Pre-computing ancestors
5.4 Candidate Generation
In the candidate generation phase, we use the frequent itemsets Fk-\ found in the
(k l)th pass to generate a set of candidate itemsets Ck that contains all k itemsets such
that all k of its (k l)-length subsets are in Fk-\. Section 3.4.1 shows how to express this
operation as a k-way join between the frequent (k l)-itemsets (FJt_1s). We can use the
same formulation except that we need to prune from Ck itemsets containing an item and
its ancestor. Srikant and Agrawal [118] prove that this pruning needs to be done only in


115
(item, sid, time) order and passed to the function Gather, which collects the (sid, time)
attribute values of all tuples of D with the same item in memory. For items that meet
the minimum support criterion, the function outputs a (item, s-list) pair. The s-lists are
maintained sorted using sid as the major key and time as the minor key, and is stored in a
new SlistTable with the schema (item, s-list).
In the support counting phase, we collect the s-lists of all the items of a candidate
and intersect them to determine the data-sequences containing that sequence. For each
candidate sequence in Ck, we select from the SlistTable the s-lists corresponding to the
k items in the sequence and pass them to a UDF CountIntersect-K along with the eno
attributes of the candidate, as shown in Figure 6.5. The UDF intersects the s-lists and does
the support counting. Eventhough this approach is the same as the one for associations,
the Countlntersect-K function is quite different.
The UDF first finds the data-sequence containing all the items and marks the bound
aries of it in each of the s-lists. Note that the s-lists are maintained sorted on (sid, time).
The UDF also groups the s-lists according to the elements of the candidate sequence us
ing the values of the eno parameters. For determining whether the candidate is contained
in the data-sequence, we use an algorithm similar to the one described in Srikant and
Agrawal [120]. The details of this procedure are different due to the vertical representation
of data-sequences. If the candidate is contained in the data-sequence the support count is
incremented and the same steps are repeated for the subsequent data-sequences.
The intersect operation can be decomposed to share it across sequences having common
prefixes similar to the subquery optimization in the KwayJoin approach. In this approach
we need to use two UDFs, Intersect and Count. The former intersects the s-lists to filter
out the data-sequences that does not contain all the items of the candidate. A new s-list


52
the join cost summation of the subquery approach. This results in significant performance
improvements especially in the higher passes.
Figure 3.14 compares the running times of the subquery and Set-oriented Apriori ap
proaches for the dataset T10.I4.D100K for 0.33% support. We show only the times for
passes 3 and higher since both the approaches are the same in the first two passes.
| D Set-Apriori Subquery |
Pass 3 Pass 4 Pass 5 Pass 6 Pass 7
Figure 3.14. Benefit of reusing item combinations
Space Overhead
The Set-oriented Apriori approach requires additional space in order to store the item
combinations generated. The size of the table Tk is the same as S(Ck), which is the total
support of all the A:-item candidates. Assuming that the tid and item attributes are integers,
each tuple in Tk consists of k + 1 integer attributes. Figure 3.15 shows the space required
to store Tk in terms of number of integers, for the dataset T10.I4.D100K for two different
support values. The space needed for the input data table T is also shown for comparison.
2 is not shown in the graph since we do not materialize and store it in the Set-oriented
Apriori approach. Note that once Tk is materialized Tk-1 can be deleted unless it needs to
be retained for some other purposes.


84
Dataset-A Dataset-B Dataset-C Dataset-O
Figure 4.10. Comparison of different architectures on space requirements.
Table 4.1. Pros and cons of different architectural options ranked on a scale of l(good) to
4(bad)
Metric
Stored-proc.
UDF
Cache-Mine
SQL
Performance
4
3
1
2
Storage overhead
1
1
2
2-3
Automatic Parallelism
2
2
2
1(?)
Development and maintenance ease
2
3
2
1-2
Portability
1
3
1
2
Inter-operability
2
2
2
1(?)
In terms of performance, the Cache-Mine approach is the best option followed by the
SQL approach. The SQL approach was within a factor of 0.8 to 2 of the Cache-Mine ap
proach for all of our experiments. The UDF approach is better than the Stored-procedure
approach in performance by 30 to 50% but it looses on the metrics of development and
maintenance costs and portability. In terms of space requirements, the Cache-Mine and
the SQL approach loose to the UDF or the Stored-procedure approach. Between the
Stored-procedure and the Cache-Mine implementation, the performance difference is ex
actly a function of the number of passes made on the data; that is, if we make four passes


33
insert into F'k select item\, ... itemk, count(*)
from Cjt, T ii, .. .T ijt
where ij.item = Ck-item\ and
.item = Ck-iterrik and
ii-tid = 2-tid and
.tid = ifc.tid
group by itemi,iterri2 ...iterrik
having count(*) > :minsup
having
count(*> > :minsup
t
Group by
iteml, ilemk
Figure 3.5. Support counting by K-way join
This SQL computation, when merged with the candidate generation step, is similar to
the one proposed in Tsur et al. [128] as a possible mechanism to implement query flocks.
In Section 3.7, we discuss the different execution plans for this query and the related
performance issues.
3.5.2 Three-wav Join
The above approach requires (k + l)-way joins in the A;th pass. We can reduce the
cardinality of joins to 3 using the following approach which bears some resemblance to


68
insert into Fk select itemi,... ,iterrik, count(tid-list) as cnt
from (Subquery Qk) t where cnt > :minsup
Subquery Qi (for any l between 2 and k)
select itemi,... itemi, Intersect(r;_i.tid-list,t¡.tid-list) as tid-list
from TidTable ti, (Subquery Qi-\) as r/_j,
(select distinct item\... itemi from Ck) as di
where r;_i.ftemi = di.item\ and ... and
r_i.item_i = di.itemi-iand
ti-item = di-itemi
Subquery Qi: (select from TidTable)
itemi,...,itemi, tid-list
Intersect
(UDF)
Subquery Q 1-1
select distinct
itemi,...,itemi
Ck
Tree diagram for Subquery Q¡
Figure 4.4. Support counting using UDF


28
For example, let F3 be {{1 2 3}, {1 2 4}, {1 3 4}, {1 3 5}, {2 3 4}}. After the join step, C4
will be {{1 2 3 4), {1 3 4 5}}.
Next, in the prune step, all itemsets c G Ck, where some (k l)-subset of c is not
in Fk-i, are deleted. Continuing with the example above, the prune step will delete the
itemset {1 3 4 5} because the subset {1 4 5} is not in F3. We will then be left with only
{1 2 3 4} in C4. We can perform the prune step in the same SQL statement as the join
step above by writing it as a A:-way join as shown in Figure 3.2. A k-way join is used
since for any itemset there are k subsets of length (k 1) for which we need to check in
for membership. The join predicates on Ii and I2 remain the same. After the join
between I\ and L we get a A:-itemset consisting of {I\.item\,...,I\.itemk-\,
as shown above. For this itemset, two of its (k l)-length subsets are already known to be
frequent since it was generated from two itemsets in Ffc_i. We check the remaining k 2
subsets using additional joins. The predicates for these joins are enumerated by skipping
one item at a time from the fc-itemset as follows. We first skip item\ and check if the
subset (I\.item2, I\.itemk-\, Ii.itemk-i) belongs to Ffc_i as shown by the join with /,*
in Figure 3.2. In general, for a join with Ir (3 < r < k), we skip item r 2 which gives us
join predicates of the form
I\.item\ = Ir.item\ and
I\.itemr-$ = Ir.itemTs and
I\.itemr-i = IT.itemr-2 and
Il.itemk-i = Ir.itemk~2 and
l2-itemk-i =


161
data mining and decision support operations on the same data, tighter integration of these
operations with database systems will be required.
This dissertation presents strategies aimed at tighter database integration of mining
and identifies optimizations and primitive operators to make database systems a better
platform for mining and decision support.


Time in sec Time in sec
78
Data set- A
| Pass-1 B Pass-2 Pass-3 |
Data set- B
pport~> 0.50% 0.35% 0.20%
Data set- C
IE3 Pass-1 H Pass 2 Pass 3 Pass 4
cdf$*
SUPPORT> 0.1%
Data set- D
i a Pass 1 Pass 2 Pass 3 Pass 4
Support-- 2.0% 1.0% 0.25%
Support % -> 0.2% 0.05% 0.02%
Figure 4.7. Comparison of four architectures
The Stored-procedure approach is the worst. The difference between Cache-Mine
and Stored-procedure is directly related to the number of passes. For instance,
for Dataset-A the number of passes increases from two to three when decreasing
support from 0.5% to 0.35% causing the time taken to increase from two to three
times. The time spent in each pass for Stored-procedure is the same except when
the algorithm makes multiple passes over the data since all candidates could not
fit in memory together. This happens for the lowest support values of Dataset-B,


CHAPTER 3
ASSOCIATION RULES
In this chapter, we discuss the various SQL-92 (SQL with no object-relational ex
tensions) formulations of association rule mining. We start with a review of the apriori
algorithm for association rule mining in Section 3.1. A few other algorithms for mining
association rules are briefly outlined in Section 3.2. The input-output data formats are de
scribed in Section 3.3 and in Section 3.4, we introduce SQL-based association rule mining.
The various SQL-92 formulations are presented in Section 3.5. We present experimental
results showing the performance of these formulations on some real-life datasets in Sec
tion 3.6. In Section 3.7, we develop cost formulae for the cost of executing the above SQL
queries on a query processor, based on the input data parameters and relational operator
costs. A few performance optimizations to the basic SQL-92 approaches and the corre
sponding performance gains are presented in Section 3.8. Section 3.9 quantifies the overall
performance improvements of the optimizations with experiments on synthetic datasets.
The association rule mining problem outlined in Section 1.2.1 can be decomposed into
two subproblems [7].
Find all combinations of items whose support is greater than minimum support. Call
those combinations frequent itemsets.
Use the frequent itemsets to generate the desired rules. The idea is that if, say, ABCD
and AB are frequent itemsets, then we can determine if the rule AB-+CD holds by
22


12
answers to How to mine data warehouses? Given the amount of data involved in mining,
its potential impact on various business sectors and the fact that OLAP is finding its way
into commercial database systems (for example, the cube operator), it is only a matter of
time before mining becomes an integral part of database systems. We believe that this work
is a small but strong step in the right direction. This will also have a significant impact on
query optimization and parallel query processing techniques.
1.5 Thesis Organization
The rest of this dissertation is organized as follows: We discuss the various architectural
alternatives for integrating mining with database systems/data warehouses in Chapter 2.
The various SQL formulations of association rules, their performance profiles and optimiza
tions for improving the performance are detailed in Chapter 3. In Chapter 4, we present
the use of object-relational extensions to SQL for improving the performance of SQL-based
association rule mining and the performance comparison of the various architectural al
ternatives. SQL-based mining of generalized association rules and sequential patterns are
described in Chapters 5 and 6 respectively. Chapter 7 presents the incremental associa
tion rule mining algorithm, performance comparison, SQL formulation and generalizations
for mining constrained associations. We conclude in Chapter 8 with a discussion of the
proposed database operators and avenues for further research.


Abstract of Dissertation Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Doctor of Philosophy
ARCHITECTURES AND OPTIMIZATIONS FOR INTEGRATING
DATA MINING ALGORITHMS WITH DATABASE SYSTEMS
By
Shiby Thomas
December, 1998
Chairman: Dr. Sharma Chakravarthy
Major Department: Computer and Information Science and Engineering
Data mining on large data warehouses is becoming increasingly important. In support
of this trend, we consider a spectrum of architectural alternatives for integrating mining
with database systems. These alternatives include loose-coupling through a SQL cursor
interface; encapsulation of the mining algorithm in a stored procedure; caching the data to
a file system on-the-fly and mining; tight-coupling using primarily user-defined functions;
and SQL implementations for processing in the DBMS. First, we comprehensively study
the option of expressing the association rule mining algorithm in the form of SQL queries.
We consider four options in SQL-92 and six options in SQL enhanced with object-relational
extensions (SQL-OR). Our evaluation of the different architectural alternatives shows that
from a performance perspective, the Cache-Mine option is superior, although the SQL-
OR option comes a close second. Both the Cache-Mine and the SQL-OR approaches
incur a higher storage penalty than the loose-coupling approach which performance-wise


59
Note A part of the work described in this chapter was primarily done by researchers
from IBM Almadn Research Center. Specifically, the SQL-based candidate generation in
Section 3.4.1 and the support counting approaches in Section 3.5 were developed by them.
They are included in this dissertation for completeness.


159
dataset and also to fine tune input parameters. There are several mining algorithms which
use sampling [44, 127].
8.2 Contributions1
In this dissertation, we have addressed the following problems.
1. Analyze the different database integration alternatives for data mining.
2. Develop and implement various SQL-based approaches for association rule mining.
3. Study the performance profile of current DBMSs to execute the above SQL queries.
4. Compare the different database integration architectures quantitatively and qualita
tively.
5. Develop cost formulae for the SQL approaches based on input data parameters and
relational operator costs. These provide some insights into enhancing current cost-
based optimizers to incorporate the domain-specific semantics of mining algorithms.
6. Extend the association rule framework for mining generalized association rules and
sequential patterns.
7. Develop an incremental association rule mining algorithm.
8. SQL formulations of the incremental algorithm and its generalization to handle con
straints.
9. Generalize the incremental algorithm for constrained association mining and demon
strate its applicability to a larger class of data mining and decision support problems.
lThe work corresponding to the first four items was primarily done by researchers from IBM Almadn
Research Center and the author was a contributor. For the remaining items, the author was the primary
contributor. The file-based incremental association rule mining algorithm (item 7) was developed primarily
by the author as part of the Introduction to Parallel Computing course project.


CHAPTER 2
FROM FILE MINING TO DATABASE MINING
The first-generation KDD systems offer isolated discovery features using tree induc
ers, neural nets and rule discovery algorithms. Such systems cannot be embedded into a
large application and typically offer just one knowledge discovery feature. The current state
of data mining systems is very much similar to the days in which database applications were
written in COBOL with just read and write commands as the interface to data stored in
large files. The advent of relational database systems which offered SQL for ad hoc queries
and various relational APIs (application programming interfaces) for application program
ming made database applications much easier to develop and manage. Data mining has to
undergo a similar transition from the current file mining to data warehouse mining and
a richer set of APIs for developing business intelligence and decision support applications.
In the remainder of this chapter, we survey some of the prior research related to the
database integration of mining in Section 2.1. The various architectural alternatives are
discussed in Section 2.2.
2.1 Related Work
Researchers have started to focus on various issues related to integrating mining with
databases [6, 67, 68]. The research on database integration of mining can be broadly clas
sified into two categories; one which proposes new mining operators and the other which
13