Treebased Incremental Classification for Large Datasets
Hankil Yoon* Khaled Alsabtit Sanjay Ranka*
hyoon@cise.ufl.edu alsabti@ccis.ksu.edu.sa ranka@cise.ufl.edu
CISE Department, University of Florida
TR99013
Abstract
('I ti 11l ., i is an important, wellknown problem in the field of data mining, and has been studied
extensively by several research communities. Thanks to the advances in data collection technologies
and large scale business enterprises, the datasets for data mining applications are usually large and i i 1
involve several millions of records with high 1lii,. i. .1 ili which make the task of classification com
putationally very expensive. In addition, rapid growth of data can continuously make the previously
constructed classifier obsolete. In this paper, we propose a framework named ICE for incrementally
,1 rif' i;_ such evergrowing large datasets. The framework is scalable with minimal data access re
quirements, and can be inherently migrated to parallel as well as distributed computing environments
easily. We provide mathematical background for incremental classification based on weighted samples
and sampling techniques that extract weighted samples from decision trees. Experimental results show
that our framework outperforms, or at least is comparable to, random sampling and promises to be a
basis of incremental classification for large, evergrowing datasets where fast development of decision
tree classifier is needed.
1 Introduction
('.... l .It1..i is an important, wellknown problem in the field of data mining, and has remained an
extensive research topic within several research communities. Over the years, classification has been
successfully applied to diverse areas such as retail target marketing, medical diagnosis, weather prediction,
credit approval, customer segmentation, and fraud detection : 1iT94]. I I, classification models [LLS97]
that have been proposed in the literature include Bayesian classification [CK '" neural networks [T:jii '
statistical models such as linear/quadratic discriminants 1. iij ,; genetic models [G. ,''] and decision
trees [BFOS84, Qui93, .i: I'll.
Among these models, decision trees are particularly suited for data mining for the following reasons.
F ii t of all, compared to the neural network or the Bayesian classifier, the decision tree is easily interpreted
and comprehended by human beings, and can be constructed relatively fast [BFOS84]. While training
neural networks takes a long time and thousands of iterations, inducing decision tree is efficient and is
thus suitable for large datasets T: i] Also, decision tree generation algorithms do not require additional
information, e.g., prior knowledge of domains or distributions on the data and classes, other than that
already contained in the dataset. Fi.,11;., decision trees result in good classification accuracy compared
to the other models : 1iT94]. For these reasons, we will focus on the decision tree model in this paper.
*Department of Computer & Information Science & Engineering, University of Florida, Gainesville, FL 32611
tDepartment of Computer Science, King Saud University, Saudi Arabia
Decision tree classification I 11r input to building a classifier is a training dataset of records, each of
which is I . . ,1 with a class label. A set of attribute values defines each record. Attributes with discrete
domains are referred to as categorical, whereas those with ordered domains are referred to as numerical.
('!..li. .it.ij is the process of generating a concise description or model for each class of the training
dataset in terms of the predictor attributes defined in the dataset. I Ir model may be subsequently
tested with a test dataset for verification purpose. I Ii model is then used to classify future records
whose classes are unknown.
A decision tree is a class discriminator that recursively partitions the training dataset until the halting
criterion each partition consists entirely or dominantly of records from one class satisfies [BFOS84,
ATl:'',I, Qui93, ._\: i'li Each nonleaf node of the tree contains a splitting condition which is a test on
one or more attributes and determines how the data is partitioned.
Derivation of decision tree classifier typically consists of a construction phase and a pruning phase. In
construction phase, the initial decision tree is built using the training dataset. Building the tree requires
recursive partitioning of the training dataset into two or more subpartitions (subtasks) using splitting
conditions. In order to discover the splitting condition at each node, classifiers such as CART, SLIQ,
and SPRINT use the gini index [BFOS84]. Ii, advantage of gini index is that its calculation requires
only the class distribution in each of the partitions. A partition which meets the halting criterion is
not divided further, and the node that represents the partition is labeled with the dominant class. I I!i
recursive process generates a hierarchical structure with the root representing the entire training dataset
and other nodes representing a subtask.
I ll construction phase builds a perfect tree that accurately classifies every record from the training
dataset. However, new records can be classified with higher accuracy by using an imperfect, smaller
decision tree rather than one which perfectly classifies all known records [QT: I Ir reason is that such
decision tree which is perfect for the known records may be overly sensitive to statistical irregularities
and idiosynchrasies of the training dataset. I Inil most algorithms perform a pruning phase after the
construction phase, in which nodes are iteratively pruned to prevent overfitting and to obtain a tree
with higher accuracy for future records. In general, pruning phase is performed separately from the
construction phase and needs access only to the fullgrown decision tree from the training dataset. An
algorithm based on the minimum description length (MDL) principle [QT 'i is used in this research to
prune the decision tree. In addition, since the pruning phase can be generally executed inmemory and
its cost is very small compared to that of the construction phase, we therefore restrict our attention only
to the construction phase in this paper.
Fi _,ii 1 shows a sample decision tree classifier for the given training dataset. Each record describes
of a person, which is t.i. 1 with one of the two class labels C1 and C2. Each record is characterized by
two attributes, age and profession, the former being numeric (specifically, positive integer) and the latter
being categorical with domain { clerk, business, teacher, professor } which denotes the occupation of the
person described in each record. I Ir splitting conditions (age < 35) and (profession = clerk) partition
the records into the corresponding classes. I Ir goal of building the classifier is to discover, from the
training dataset, concise and meaningful conditions involving age and profession by which a person is
classified into one of the two classes, C1 or C2.
('1..'li. .it .. algorithms have different properties, which are used to compare classifiers. I Ir accu
1. .' of a classifier is the primary metric. It measures the predictive performance of the classifier and is
determined by the percentage of the test dataset examples that are correctly classified [WK91]. I Ir ac
curacy metric assumes that all (mis)classifications are of equal importance. Ir .M ". 1 T. .. /", metric
has been defined to deal with different misclassification costs, i.e., misclassifying an instance with class
x costs more than with class y. I Ir objective is to minimize the overall misclassification cost. I Ir risk
metric extends the misclassification metric to include the gain of cases that are correctly classified.
Age <= 35
A
(1,2,3,4) (5,6,7,
B C Prc
Cl
(6, 8)
D
Cl
)f
ofession= Clerk
No Age Profession Class
1 24 Clerk Cl
2 17 Business Cl
3 22 Teacher Cl
4 35 Professor Cl
5 40 Professor C2
6 48 Clerk Cl
7 52 Teacher C2
8 46 Clerk Cl
(a) 1 i..1 ,:_, data set (b) Decision tree
Fig ii 1: Example of a decision tree classifier
('!..it. I that are easy to understand and interpret are preferred. I h! is called comprehensibility
metric. 'I Ir size of the model generated can be used to measure this metric. I Ir generalization metric
determines the generalization property of a classifier: compact models are preferred as they generalize
better. I Ii ability of a classifier to deal with missing values is another metric.
Needs for incremental classification I Irj datasets for data mining applications are usually large and
may involve several millions of records. Each record typically consists of ten to hundreds attributes, some
of them with large number of distinct values. In general, using large datasets results in the improvement
in the accuracy of the classifier [C''i, Cat84], but the enormoty and complexity of the data involved
in these applications make the task of classification computationally very expensive. Since datasets are
large, they cannot reside completely inmemory, which makes I/O a significant bottleneck. Performing
classification for such large datasets requires development of new techniques that limit accesses to the
secondary storage in order to minimize the overall execution time.
Another problem that makes classification more difficult is that most datasets are growing over time,
continuously making the previously constructed classifier obsolete. Addition of new records (or deletion
of old records) may potentially invalidate the existing classifier that has been built on the old dataset
(which is still valid). However, constructing a classifier for large growing dataset from scratch would be
extremely wasteful and computationally prohibitive.
: i..in machine learning techniques and statistical methods have been developed for small datasets.
1 Ir , techniques are generally iterative in nature and require several passes over the dataset, which make
them inappropriate for data mining applications. As an alternative, techniques such as discretization
and random sampling can be used to scale up the decision tree classifiers to large datasets. However,
generating good samples is still challenging for these techniques. In some cases, the techniques are still
expensive, and the samples may still be large for some large applications. In addition, these techniques
may deteriorate the accuracy of the classifier for datasets with large number of special cases (small
clusters in space).
In this paper, we present a framework that incrementally classifies a database of records which grows
over time, and that still results in a decision tree with comparable quality. I Ii framework, named ICE
(Incremental Classification for E. i rowing large datasets), incrementally builds decision trees for large
C2
growing datasets based on treebased sampling techniques. I Ii framework is described in greater detail
in Section 4.
I I. rest of the paper is organized as follows. A brief survey of related work on decision tree classifiers
is given in Section 2. Ir. mathematical background of incremental classification is presented in Section 3,
together with a few sampling techniques for incremental classification. I hI framework for incremental
classification is proposed in Section 4. I . experimental results and a comparison of the framework
against random sampling are provided in the next. 'I I last section concludes the paper and outlines
possible extensions.
2 Related work
Silt algorithms in the machine learning and statistics are main memory algorithms whereas today's
databases are typically much larger than main memory _\IS'' .;] I I. construction phase of the various
decision tree classifiers differ in selecting the test criterion for partitioning a set of records. CLS [HMSI iii ,
one of the earliest classifiers, examines the solution space of all possible decision trees to a fixed depth.
I I. i it chooses a test that minimizes the cost of classifying a record, which is the sum of the cost of
determining the feature values for testing and the cost of misclassification. ID3 [Qui i and C4.5 [Qui' .;
replace the computationally expensive lookahead scheme of CLS with a simple information theory
driven scheme that selects a test which minimizes the impurity of the partition. On the ohter hand,
CART [BFOS84], and SPRINT ..\: I'll select the test with the lowest gini index. (C..li. i like C4.5
and CART assume that the training dataset fits in the main memory. SPRINT, however, can handle
large training datasets with several millions of records by maintaining separate lists for each attribute.
A recently proposed algorithm, CLOUDS .\A ''Ls samples the splitting points for numeric attributes
followed by an estimation step to narrow the search space of the best split. COULDS reduces computa
tion and I/O overhead substantially compared to the abovementioned classifiers, while maintaining the
quality of the generated trees in terms of accuracy and overall tree size.
Si.liy techniques have been proposed to scale up the decision tree classifiers for large datasets.
I Ii I are mainly two approaches: exact and approximate. Discretization and sampling are approximate
techniques. Sampling the dataset is simple: A random subset of the entire dataset is used to build
the classifier. Windowing technique is used in C4.5 [Qui'.;_ which repeats following process a number
of times. A small sample is drawn from the dataset and subsequently used to build a new tree. I I
sample is augmented with the examples that are misclassified by the tree, and then another tree is built
with the augmented sample. Sil.'.. l I.,i ..' is a selective sampling method in which the classes have
approximately the same distribution [Cat84]. Two new sampling techniques are proposed in .\L ,'I"
one in which the splitting criteria are derived from a limited number of splitting points, and the other
which estimates the gini index values in order to narrow the search space of the best split. Both methods
evaluate the gini index at only a subset of points along each numeric attributes.
Discretization is a process of transforming a numeric attribute into an order discrete attribute with
a small number of values, by which the sorting operation is avoided in the construction phase. All
discretization methods for classification that take the class label into account when discretizing assume
that the database fits into main memory [Qui93, FI'i.; I I. static discretization techniques perform
the discretization prior to building the tree [Cat84]. Another technique called peepholing iterates several
passes to reduce the number of attributes that are considered for deriving the splitting point. 'I I. SPEC
algorithm 'iK'il( is an approximation technique designed to deal with diskresident data. It uses a
clustering technique to derive the splitting point of the numeric attributes.
I l. SPRINT algorithm is an exact technique that scales up the decision tree for diskresident
data [.\: i',ll I I. algorithm employs a presorting technique to avoid the sorting operation at every
node of the tree during the construction phase. While partitioning the data using the attributes with
the lowest gini index, a hash table is created to partition the other attribute lists and determine their
destination.
While offering a fast, scalable decision tree classification algorithm, SPRINT has the following draw
backs i\T:"'s In the 1, li;m;. the initial training dataset is transformed into another representation
and the size of the new dataset is expected to be at least twice as large as the initial dataset for most
datasets. It also requires sorting the entire training dataset. In cases where the hash table cannot fit
inmemory, the partitioning process will require several passes over the datasets.
I Ii tree induction algorithm ID5 restructures an existing decision tree in a dynamic environment
with an assumption that the training dataset fits inmemory [Ut,'] Utgoff et al. present a set of re
structuring operations that can be used to derive a decision tree construction algorithm for a dynamically
changing dataset while maintaining the optimal tree [UBC97]. But they also assume that the dataset fits
inmemory. i, It.,learning techniques can be used to scale up the classification process to large datasets
while achieving comparable accuracy [CS' 1 i. I I, techniques combine a set of (base) classifiers to form a
global classifier. 'I Ih metalearning techniques can be used efficiently for classifying partitioned or large
diskresident datasets and be applied hierarchcally.
A framework called Rainforest is recently proposed for developing fast and scalable algorithms for
constructing decision trees that gracefully adapt to the amount of main memory available [ICT'i
BOAT [GCTL''Li optimistically constructs the exactly same decision tree from samples extracted by
using bootstrapping technique. With changes in data distribution, BOAT requires another pass over
dataset and needs temporary information to incrementally construct the tree.
\nt, that there is no "1 I" classification method for all applications, as known as the conservation
law [Sch94]. It is shown that no algorithm uniformly outperforms others on generalization accuracy. I Ipj
best or appropriate method for a particular application depends on the characteristics of the application.
I I user's goal may help in choosing an appropriate classification method.
3 Incremental classification
In this section, we define the problem of incremental classification and present the background on which
our approach to incremental classification is established. For the rest of the paper, we assume binary,
invariant decision trees. However, it can be easily extended to other types of decision trees.
Problem definition A decision tree T has been built for a dataset D of size N by a decision tree
classification algorithm. After a certain period of time, a new incremental dataset d of size n has been
collected. We are now to build a decision tree T' for the combined dataset D + d. I Ii problem is
nontrivial because addition of new records may change some splitting points and thus consequently the
new decision tree T' may be quite different from T. I I, goal is to build a decision tree classifier T' with
minimal cost such that T' is comparably as accurate as the decision tree that would be built for D + d
from scratch.
Formally, let M be a method (either a framework or an algorithm) for incremental decision tree
classifier. I I. i j the decision tree T induced for D needs to be transformed by M as follows:
M(T, d) = T'
We want to keep the cost of inducing T' from T for incremental dataset d much lower than that of
inducing T' from D + d from scratch, while generating T' of equal or better quality. We discuss this
issues in Section 4.3. A few quality measures of decision tree are also presented in Section 5.
Reusing previous computation It will be very attractive if some of previous computation that was
performed on the original decision tree T can be reused. For instance, if an algorithm like SPRINT is
used to build T, then the values of the splitting numerical attribute at each leaf node of the tree is sorted.
I Ir difficulty lies in the fact that only the splitting numerical attribute is sorted in a particular leaf node
and each leaf node may have a different splitting attribute.
We can also modify the decision tree algorithm so that some useful information such as class dis
tribution is stored at each node while T is being built. However, such information is dependent on the
splitting attribute. Saving the class distribution with regard to other attributes may be undesirable
because it is difficult to choose the attributes that may turn out to be the next splitting point for T'.
('.. distribution alone does not help much to discover the new splitting point because of the
instability of impuritybased split selection methods described next. We may have to recompute from
scratch to find the exact splitting point. For approximation, class distribution may be useful to compute
the splitting point for categorical attributes, whereas it is not so helpful for numeric attributes.
Impuritybased split selection methods We use the gini index [BFOS84] to illustrate impurity
based split selection methods. Let n and Ci be the total number of records in a set S and the number
of records that belong to class i, respectively. Let m be the number of classes. I Ir i the gini index is
defined as follows:
mC 2
gini(S) = 1 ( ) (1)
i=1
When S is split into two subsets, S1 and S2, with nl and n2 records, respectively, the corresponding
gini index of the split set S is calculated in terms of the gini index of the subsets as follows:
;. ,..(S) = lgini(Si) + 2gini(S2) (2)
I Ir gini index defined above does not lend itself for reuse in accordance with the changes in the
number of records (and hence the class distribution). Impuritybased split methods such as gini index
cause such instability [GCGTL'I'I, which makes the induction of T' from T difficult.
3.1 Weighted gini index
I I gini index presented above works on the records with equal weight (i.e., equal importance). When
each record carries different weight (i.e., represents as many points), the gini index must also change
accordingly. In this section, we provide a new gini index as a mathematical basis for incremental classi
fication.
I Ir unweighted gini index is given in Equations 1 and 2. Now suppose that there exists a set S' of n
sorted numeric points in nondecreasing order, with each point (call it wpoint from now on) associated
with a weight w > 1, where w is a positive integer. A wpoint with weight z may represent as many
original points (each with weight 1) that belong to the same class. Let wj be the weight of jth wpoint
in S'. I Ir i, the total weight sum of S' is:
W = wj >n (3)
j=1
I I., set S' is now divided into two subsets, S' and S2, with n1 and n1 wpoints, respectively, such
that n = n 1 + n2 and all the wpoints in S' are less than any one in S2. Now the question is how we
take the weights into account in calculating the impurity of a set. Recall that the class distribution Ci
in Equation 1 is the probability for a point in S to be classified into class i. Let Qi be the weight sum
of the wpoints that belong to class i in S'. I Ir ij, Equation 1 is to be rewritten as following:
gini (S') = 1 ( )2 (4)
i=1
We now consider the distribution of wpoints between S' and S' in computing the gini index in
Equation 2. In Equation 2, the terms and represent the portions of the subsets S1 and S2,
respectively, in the set S. In other words, these simple fractions mean that the impurity of a set tends
to increase as the number of records contained in the set increases. When the points are associated with
weights, this statement illustrates that the impurity of a set tends to increase as the weight sum of points
contained in the set increases.
Let W1 and W2 be the weight sums of subsets S' and S2, respectively. I Ir i we can modify the
Equation 2 based on Equations 3 and 4 as follows:
.,,*". .(S') = gini (S') + g, S') (5)
Now let us check if j. ". ;, is consistent with the original;. ,. .. index, i.e., ". = ;.' ,... at
a common splitting point x. Suppose that wpoints are chosen from S such that for any p represented by
sii and q represented by si for two ; 1j.... 1i it wpoints si_ and si, p < q holds as depicted in Fiiii, 2.
I Ir i there always exists a point x such that p < x < q and x clearcuts into no other wpoint.
S,2 S,i S, S,+i
p x q
Fi;,ii 2: I Ih set S' of wpoints along dimension Z
Lemma 1 i I.. j ",. index value at a clearcut splitting point x on a dimension is consistent with the
original gini index value :;. ,.. at the same splitting point.
Proof Suppose a point x that clearcuts the set S' of size n into two subsets S' and S2, with n1 and n2
wpoints, respectively, where n = nl + n2. Since there is no wpoint in which contains x, the number
of points before (after) x remains to be nf (n2)). I I!,' yields that W1 = nf and W2 = n2, thus making
W = W1 + W2 = nl + n2 = n. 'I Ir. !I 1. with respect to x, j. "'(S') = gini(S) meaning that the class
distribution Qi remains unchanged for subsets S' and S2, and thus j. '(S') ,.; = ;.',.,." (S) at any
such point x. D
Now, we extend this property of wpoint to a multidimensional space. Since each node in the decision
tree represents a subpartition which is 1.., 1./ divided from the partition of its parent node by the splitting
conditions along the root to the node, one can view each node of the decision tree as a hyperbox in the
mutildimensional space composed of the attributes of the dataset (Fi ni 3). Assuming that a single
wpoint represents the entire subpartition of a node, we have the following lemma.
Lemma 2 Ib .. ,. !. index value is consistent with the original gini index value ;.,. ,. with respect
to the splitting conditions.
Proof It directly follows from Lemma 1 and from the fact that the splitting conditions along the path
from the root to any child node clearly divide each dimension of the hyperspace. D
When more than one wpoint are chosen to represent a node, Lemma 2 may not hold, eventually
resulting in a decision tree that is not as good as the original one in terms of accuracy as impurity of
such wpoints increases. N\.t. that the gini index is always computed between two ;i.1.i ... itj wpoints
when constructing a decision tree on a set of wpoints. I Ii lI ,, if each wpoint is chosen to be around
the center of a set of original unweighted points, assuming the set of points represented by a wpoint is
clustered together, j. "'. would not be very different from ;:.'." ,. .. 'I I!j implies that some techniques
such as clustering can be used to select wpoints from the original dataset. We present various sampling
techniques including clustering in Section 3.2.
Categorical attributes I Ii weighted gini index deals only with the (numerical) attributes in ordered
domain. Categorical attributes need to be handled in a different fashion since it is difficult to define simi
larity between the values of categorical attributes due to lack of a priori structure. I Ir. i have been some
work on categorical databases in the areas of clustering [GIKT'] and association rule finding [YF: iN',
although the problem has yet to be tackled in the area of classification.
In order to have one value to represent others for a categorical attribute, the simplest method is to
take the majority: a value with the largest count is chosen to represent the others. Another is to create
a vector of values to keep the histogram information. Identifying the value with the largest count is still
easy, and as dataset is partitioned, correct histogram information of the attributes can be passed down
the decision tree. In this research, we take the majority method.
3.2 Sampling for incremental classification
In this section, we discuss extracting samples for incremental classification. N .t\ that there is supposedly
no superior sampling method as there exists no method that consistently outperforms the others. I Ir
quality of the resulting classifier depends primarily on the characteristics of the application (hence the
dataset). One of the goals pursued in this paper is to identify sampling methods that are better suited
for the incremental classification in terms of accuracy and performance.
Sampling from decision tree classifier One of the problems with random sampling is that it is
sensitive to the distribution of data: when the dataset is skewed, the randomly chosen samples may not
be a good representation of the dataset. Due to its sensitivity to data distribution, the resulting decision
tree may be generated nondet( n ,1' l..,i.i. depending on what samples are chosen. We observe that
the decision tree built for a dataset can be used to generate a good sample of the dataset. Since the
skewedness of dataset is already reflected to the tree itself, we can consequently generate good samples
in a more consistent manner from the tree. Such sampling method is expected to result in a new decision
tree with better or comparable quality than random sampling does.
Since each node in the decision tree includes a subpartition of the dataset, which is a sebset of
the partition in its parent node, one can view each node of the decision tree as a hyperbox in a mutil
dimensional space composed of the attributes of the dataset. Figii. 3 shows such hyperboxes of the
decision tree derived from the dataset in Finlii l(a). An area enclosed by dotted lines constitutes a
hyperbox for which a leaf node represents. 1 Ir leftmost box corresponds to the node B which represents
Node E
Node B
{ Business,
STeacher,
{ Business, c ,
{ Bunes, Professor}
STeacher,
Professor, L________
Clerk }
Node D
{ Clerk}
0 35 Age
Finii, 3: I Ih hyperbox representation of a decision tree classifier
records with age < 35 and all kinds of profession. I Ir box on the top right corner corresponds to the
node E which represents records with age > 35 and profession = { business, teacher, professor } whereas
the box on the bottom right corner corresponds to the node D which represents records with age > 35
and profession = { clerk } N tI that the vertical axis of profession is not an ordered domain. \ tI also
that the node A represents the entire dataset, and the node C consists of the union of the partitions that
the nodes D and E represent. While the decision tree is being built, samples from the nodes of the tree
can be extracted easily. If the samples are to be from level 1, for example, we extract samples from the
nodes B and C. If the samples are to be from the leaf nodes, we extract samples from the nodes B, D,
and E.
Treebased sampling As aforementioned, the decision tree being constructed can be used to extract
the samples. Indeed, we expect these samples to show better characteristics than the ones extracted
directly from the dataset, and hence to result in a better decision tree classifier in an incremental fashion
consequently.
Random sampling Samples are chosen randomly from each node of the decision tree classifier for
the current incremental partition.
,Sii I;, .1 sampling Samples are randomly chosen from each class of the node under consideration
in proportion to the size of the class. ('Cr .... ' samples in a class is random.
Local clustering As suggested in the previous section, we can apply one of wellknown clustering
algorithms such as BIRCH 7;ZL'i' and DBSCAN EIK 'X'll, to the partition in the node under
consideration. Samples are the cluster centers found by the clustering algorithm. I I. classes
of the records are considered in clustering records. Since clustering is executed against the local
partition in each node of the decision tree, this method is called local clustering. It can also be
called postclustering because clustering is applied after the decision tree is constructed. Again, the
samples can be extracted in two ways: random and stratified. I Ir samples are chosen the same
way described above except that they are selected from each of the clusters found.
In the remainder of this section, we discuss other issues of selecting samples based on a decision tree.
We list them in the following:
Attribute value When a sample is chosen, each attribute of the sample is supposed to represent
all the values of the corresponding attribute in the records for which the sample represents. For
a categorical attribute, the value with majority number is chosen. For a numerical attribute, the
average of the values may be chosen, while the maximum or the minimum value can also be a
choice.
Weight assignment We provide a background for handling weighted records in Section 3.1. When
the samples are actually extracted, however, weight may or may not be assigned to the samples.
'I h reasoning behind this is that the values of the attributes in the chosen samples may not have
the best values as a representative. In some cases, simple weight assignment (i.e., weight is always 1)
may result in a better classifier.
Small node Since most datasets for classification is highdimensional, the hyperspace composed of
the attributes is expected to be very sparse. A node with a small number of records may consist of
outliers. Due to this reason, the nodes with small number of records may be ignored in sampling.
An issue here is to determine what number of records in a node is considered to be small.
Use of training set for pruning When a classifier is built, part of the training dataset (possibly the
most recent partition) can optionally be used for pruning the classifier. I l rationale behind this
is that more weight needs to be given to the most recent data partition.
Weight distribution When s samples are chosen from x records in a node, we equally distribute the
records that each samples represents records with total weight of w/s where w is the total weight
sum of x records. As incremental steps iterate, some sample records have weights of 1, whereas
some others have weights of greater than 1.
4 ICE: A scalable framework for incremental classification
Today's advances in data collection technologies have created another problem: scaling existing appli
cations to large datasets. Data mining technologies including classification have been developed to deal
with this problem. Yet, growth of data and its high dimensionality remain the key factors that still make
the problem difficult. Addition of new records (or deletion of old records) may potentially invalidate
the existing classifier, and there exist huge demands that a new classifier be built quickly. In order to
overcome such difficulties in classifying large growing datasets, we propose a framework that is illustrated
in Fi ni, 4.
When the dataset is growing over time and a new classification model is in demand, running a
decision tree algorithm for the entire dataset will be very inefficient and impractical. Further, the cost
of keeping all the old data partitions may become expensive, and often times, the old data partitions
may not be even available after a while. I l proposed ICE framework works around these problems.
Since any decision tree classification algorithm may be used, ICE becomes independent of any particular
features of the selected algorithm in developing a new decision tree classifier incrementally.
4.1 The ICE framework
In Fi ini 4, Di denotes a partition of the entire dataset that is collected during ith time period (epoch
i). N.,t that the partition may have been collected over certain period of time since the dataset up until
Di1 was classified, or that it is partitioned simply because the dataset is too large to be processed at
once. A decision tree T, is built for the partition Di, and a set Si of samples is extracted from Ti (see
Section 3.2 for sampling techniques). I Ir set Si is combined together with the previous sets of samples
(denoted Ui = S1 +  + Si). Si is then used as the training set for the eventual decision tree classifier
Ci for the entire dataset up until Di. I Irj new set of samples Ui = Ui_1 U Si is preserved for the next
epoch of incremental classification. In Figliii 4, a straight line represents data flow while a dotted line
denotes the chosen decision tree algorithm that converts the data partition to a decision tree.
Epoch 1 Epoch 2 Epoch 3 Epoch 4
TI C1
S, SI S3 S4
U U2 U3 U4
U, U, U, U
USk=U1
FiIiii 4: I Ir basic framework for incremental classification
I I!i process seems a little complicated as is shown in Fii i 4. However, only one of the epochs is all
that needs to be done at a moment. Suppose, for instance, that a data partition D4 is recently collected,
and we are required to build a new decision tree classifier for the collected partitions that include D1
through D4. I I classifier C3 has been the current classifier until D4 becomes available. When C3 was
constructed, the set U3 was collected and preserved as well, combining S3 with U2. Now, D4 is used to
construct T4, from which a new set of samples, S4, is extracted. S4 is in turn combined with U3 to build
the new decision tree classifier T4.
I Ir framework requires building two separate decision trees, Ti and then Ci, at epoch i. However,
only a small portion of time needs to be spent for running a chosen decision tree classification algorithm
on two comparably small datasets (Di and Ui) separately, whereas the entire dataset (partitions up until
Di) is presumably very large compared to the sizes of Di and Ui, making construction of the decision
tree classifier from scratch inefficient and impractical. I Ir dataset Ui is the space overhead of ICE.
I I!i overhead saves at least one pass to the entire dataset, and makes the overall process very efficient
and practical. In an extreme situation, the previous data partitions can be abandoned and only Ui is
preserved for the future at epoch i.
N t that ICE is different from the Rainforest presented in [GICT '] where a few data access algo
rithms are used to scale to the size of the dataset. I Ir, main objective of Rainforest is to apply algorithms
in the literature and result in a scalable version of the applied algorithm without modifying the result
of the algorithm. On the other hand, the primary goal of ICE is to present a framework of incremental
classification on the datasets that grow over a long period of time. ICE not only deals well with the
scalability of the algorithm to the datasets, but also addresses the issue of decision tree growth.
4.2 Characteristics of ICE
ICE not only is suitable for incremental classification when the dataset is evergrowing, but can also
be easily extended to diverse applications. I Ir framework presented in Finii 4 provides solutions to
building a reasonably accurate model of classification for large datasets that grow over time (i.e., temporal
scalability), while minimizing data access and running independently of the decision tree classification
algorithm. I Ih fundamental characteristic of the framework that lends itself to being easily extended
to diverse applications is that each step (epoch) deals with a relatively small increment of data. I Ii
characteristic can be used in the following ways:
Size scalability When the dataset is exceptionally large, it takes too long to build a classification
model. To get around this problem, the dataset is divided into n partitions, each of which is used to
build a partial decision tree (Ti). A set Si of samples is extracted from T~ and saved onto disk. When all
the samples are collected, the final decision tree is constructed using the set U,. I I!j process may repeat
up to several levels as needed. I I!j, method can also be used for rapid ..I/./.,/ i".1 of a classification
model with a very large dataset. I Irh intermediate decision tree built on the set Ui can be used as to
how the dataset should be partitioned and how samples should be chosen, and so forth.
Parallelization Parallel computing brings in the same or comparable results within a much shorter
time than it would take with a single processor. Suppose that the entire dataset (U Dj) is available and
each epoch in Filii, 4 is processed in a single processing node. Each Ti is constructed locally at the
same time and produces Si. I Irj set U, is then collected and processed at a node where the final decision
tree is built. N.\ it that this process may repeat hierarchically. An important issue here is how to divide
and distribute the dataset in order to generate the best possible decision tree.
Distributed computing Similarly to the idea of parallelizing classification process presented above,
the framework can be further extended to a distributed computing environment, where computers as
well as datasets are geographically dispersed. Each epoch in Fiiii 4 is assigned to a remotely located
machine and its resulting Si is collected together to generate the final decision tree classifier.
In addition to size scalability, parallelization, and distributed computing discussed above, other im
portant characteristics of ICE are summarized as following:
Temporal ... I. /./ /I Ii framework scales well to the datasets that grow over a certain period of
time.
Algorithm independence Any decision tree classification algorithm can be used in the framework
for incrementally constructing a new decision tree classifier.
.I ,!., / data access I Ir, is no need to access the previous partitions (the old dataset). Only a
small portion of the old dataset and the new incoming partition are all that is needed to generate
a new decision tree classifier. With a very large dataset, this characteristic will save most of the
time spent for I/O.
Flexibility While ICE does not deal with addition and removal of individual records, it provides
flexibility to insertion as well as deletion of partitions. When an old partition becomes obsolete or
expiers, the samples from the partition is removed from Ui and a new classifier Cj is rebuilt. A
new partition is handled by creating a new epoch and hence a new classifier.
\n.t that ICE can be used for rapid .../.'*/.i' ""./ of a classification model for a very large dataset. I I.
resultant intermediate classifiers can be used for feature selection when the final decision tree classifier
is built. Yet another application of ICE is to generate good samples for other more expensive methods
such as neural networks. 'I Ir practical applications of the framework include classfying data from the
worldwide web, medical diagnosis, and retail target marketing, where data is large and evergrowing,
and continuously added to the existing dataset.
4.3 Cost, performance and quality
Cost of ICE Let T(A, D) be the cost function for building a decision tree classifier for a dataset
D using an algorithm A. In general, T(A, D) takes the form of IDI f(A, D), where f(A, D) depends
primarily on the size of D. When IDI exceeds some threshold so that A must run outofcore, or A
involves expensive operations such as sorting, f(A, D) will become asymptotically large, elevating the
total cost higher.
For a constant r such that 0 < r < 1, the cost of ICE at any epoch i using an algorithm A is
composed of three components: the cost of building a tree for Di, the cost of producing samples from the
tree for Di, and the cost of building the final classifier Ci for Ui. I I !I 1!. the cost for ICE is defined:
TICE(A,D) = T(A,Di) +r IDi + T(A, Ui)
I Ir ratio R of the cost of ICE to the cost of building a decision tree from scratch is:
T(A, Di) +r IDl + T(A, Ui)
T(A,D)
Assume that each parition Di is of the same size, namely n. I Ih ij, at any epoch i, IDI = i n and
IUil = r i n. Assuming f(A,D) is a constant, R becomes approximately 1+r, which converges to r
for a large i. In practice, however, since IDil, IUil
asymptotically smaller than f(A,D). I I!i implies that 1i is the upper bound of R. I I!, .i 1.i it is
clear that as the dataset grows over time, building a decision tree for the entire dataset becomes more
expensive, whereas the cost of running ICE remains very low, regardless of the size of the entire dataset.
Performance and quality As shown in [.\1 i'i the performance of a decision tree algorithm nat
urally degrades as the dataset becomes large, mainly due to excessive data access and other operations
such as sorting. Since ICE works with small partitions of a large dataset, its execution time for running
the framework is expected to be very short regardless of the base algorithm chosen than running the
algorithm on the entire dataset. I I I,, more important issue with ICE is to eventually attain a
decision tree classifier with a comparable quality which is close to what would be achieved by applying
the entire dataset. Another goal is to use the framework as a rapid prototyping tool that allows us to
build the best possible classifier under given circumstances.
Another issue is how to choose an appropriate sampling method that can result in a better decision
tree classifier in terms of the metrics discussed in Section 2. I I!i issue is discussed in Section 3.2, and
Section 5 will experimentally show which sampling methods can be used efficiently to generate a decision
tree classifier of equal or better quality.
5 Experimental results
In the experiments, we have used the exhaustive brute force method as the base decision tree classifier
algorithm in ICE. At each node, the algorithm sorts all the numeric attributes of the records to calculate
the gini index. I Ih histogram of categorical attributes are discovered at the root node and passed down
the tree iteratively. I Ii attribute with the lowest gini index is chosen to split the records at the next
level. N\t however, that any decision tree classification algorithm can be used in the framework.
1 hr experimental results of ICE are compared against random sampling. 1 hr samples used in the
framework are extracted based on the treebased local clustering as described in Section 3.2. In finding
local clusters, priority is given to the records that are already a representative of other records (i.e.,
wpoints from previous epochs) so that they can become cluster seeds. Other cluster seeds are chosen
randomly, if needed, and a simple kmeans clustering algorithm KTI'l I has been used in the experiments.
In this paper, we concentrate on measuring and comparing the quality of the decision tree classifier
generated. Performance in time is indeed an important quantitative measure, but it is clear that ICE
executes on large datasets very quickly by its nature, as described in Section 4.3. Among many aspects
of quality discussed in Section 2, two are considered for comparing the quality of a decision tree classifier:
the ... .... / and the size of the classifier.
5.1 Experimental results with letter dataset
I Ir letter dataset obtained from T'i\_ i'l contains 15,000 training records and 5,000 test records. I Ir, i,
are 26 class values and each record consists of 16 attributes, all of which are numeric (integers). Each
record represents characteristics of a handwritten alphabet letter, describing the shape of the written
letter. I Ir dataset is divided into 5 equalsize partitions (epochs), each with 3,000 training records.
Finil: 5 shows the classification accuracy of each Ci after the first epoch (where Ti = Ci as depicted
in Fiuii 4). 1 Ii straight line (denoted Cumulative) denotes the accuracy of the decision tree for the
cumulative dataset U Di up to the current epoch i, using the brute force algorithm. I Ir dotted straight
line (denoted Partition i) represents the accuracy of the decision tree T~ built on the current partition
Di. records of Di are extracted from Ti as samples, where x varies within 1, 2, 5, 10, 20, 30, and 50.
On the other hand, the random samples are chosen from the cumulative dataset U Di at epoch i and the
brute force method is applied (denoted Random).
Filin, 5 leads us to conclude that ICE consistently results in a better classification accuracy for any
sampling rate at any epoch. One might speculate that the impurity of the samples increases as epochs
evolve and samples are chosen repeatedly. However, in this case, such effect of cumulative impurity
seems either nonexistent or very minor. With the sampling rate of about 15' or higher, the framework
results in better accuracy than that of Di and gets close to the accuracy of the decision tree built on the
cumulative dataset.
Table 1 collectively shows the measures of decision tree quality for the sampling rate of 15.i '1 Ir
results for epoch 1 is dropped since ICE produces the same results as for the cumulative datasets. Each
number in the table is an average over 10 executions and the numbers in parentheses are standard
deviation that shows how the results vary on different runs for the same dataset. Other parameters
remain unchanged.
In terms of accuracy, it is obvious that ICE .,,t.',,ntli results in the decision trees with higher
accuracy than random sampling does. \N.t i.. the standard deviation in the table that reveals how the
results can vary in different runs. In terms of the size of tree, it also shows that the tree generated by
ICE is more compact than that by random sampling, hence making it more comprehensive. I Ir results
also follow the cost model given in Section 4.3 that ICE can build a decision tree very quickly. With a
dataset which is more realistically large, the speedup will be more evident. Overall, the table empirically
proves that random sampling is sensitive to which samples are chosen and hence nondeterministically
generates a decision tree. On the other hand, it shows that ICE can generate a decision tree of better
quality in every facet in a more consistent fashion. \N.i t, however, that the results suggest that we always
Letterdata set Epoch II
(a) Epoch II (b) Epoch III
Letter data set Epoch IV Letter data set Epoch V
12 5 10 20 30 50 1 2 5 10 20 30
Sampling rate (%) Sampling rate (%)
(c) Epoch IV (d) Epoch V
Finlii 5: Experimental results for letter dataset with five epochs
Letter data set Epoch III
: ...1 : i ll.. i Epoch 2 Epoch 3 Epoch 4 Epoch 5
Cumulative 75.1 78.8 81.1 83.4
Accuracy Random 55.4 (0.99) .II .; (1.12) 63.4 (0.90) 1.1. (0.99)
) ICE 62.8 (1.09) 65.1 (11' ) '1 (0.76) 69.1 (0.50)
Total Cumulative 1077 1.;' 1621 P1 'P
number of Random 278 (12.47) 370 (5.59) l.; (12.15) 527 (17.92)
leaves ICE 173 (5.59) 224 (4.34) 274 (f5 1i) 390 (6.24)
Execution Cumulative 3.87 6.02 8.08 10.37
time Random 0.50 0.78 1.06 1.29
(seconds) ICE 2.26 2.46 2.77 ' 
Table 1: I Ir quality measures for letter dataset for 1,'' sampling rate
avoid using random sampling. Rather, it implies that we may apply both methods to a large dataset
very quickly and identify a method that better suits to the application.
5.2 Experimental results with chess dataset
I Ir chess dataset obtained from i 'IC \ '1 contains 25,000 training records and 3,056 test records. Ir. I
are 18 class values and each record consists of 6 attributes, all of which are numeric (integers). Each
record represents three locations on the chessboard for three chessmen at some point of time. 'I Ir dataset
is divided into 5 equalsize partitions (epochs), each with 5,000 training records.
Ir experimental circumstances are the same as those for letter dataset except that sampling rates
are F'., 1 1 :i I' and 71 1' Again, Fisiil, 6 confirms the observation made in the previous section.
I Ir other quality measures are not shown because the results are similar to Table 1 for letter dataset.
5.3 Experimental results with synthetic datasets
In order to study the applicability of ICE to large datasets, we created synthetic datasets using the data
generator used in .\I9''.;, <.\i 'li, that is available from the IBM Quest home page1. Each record in the
datasets created has nine attributes, 3 of which are categorical. A description of the attributes for the
records is as shown in Table 2. We generated training datasets of 500000 records with 3 different number
of classes: 2, 3, and 4. 1 Ir class distributions for 2class, 3class, and 4class datasets are 0.55/0.45,
0.4/0.3/0.3, and 0.3/0.3/0.2/0.2, respectively. I Ir number of records in test datsets is 10000. Each
training dataset is divided into 5 equalsize partitions with 100000 records.
Different data disributions are generated by using distinct classification functions to assign class
labels to class values. Table 3 describes the predicate involved with each function in terms of the number
of attributes and the number of classes. Further details on these predicates can be found in [\IS'.; To
model fuzzy boundaries between classes, a perturbation factor for numeric attributes can be supplied to
the data generator .\T1 .; In our experiments, a perturbation factor of F' was used, and 1' noise
was created in the datasets.
Fi,,ii' 7 and 8 present the experiment results for synthetic datasets with 2 classes, and 3 classes
and 4 classes, respectively. I Ir dotted straight line (denoted EntireSet) on top of each figure is the
accuracy of the brute force algorithm that is applied to the entire dataset. I Ir sampling rate varies
1The URL for the page is http://www.almaden.ibm.com/cs/quest/demos.html.
Chess data set Epoch II
(a) Epoch II (b) Epoch III
Chess data set Epoch IV Chess data set Epoch V
20 30 50 5 10 20 30
Sampling rate (%) Sampling rate (%)
(c) Epoch IV (d) Epoch V
Fi1,ii 6: Experimental results for chess dataset with five epochs
Chess data set Epoch III
Attribute Description Value
salary Salary Uniformly distributed from 20000 to 150000
commission Commission If salary > 75000 then commission is zero
else uniformly distributed from 10000 to 75000
age Age Uniformly distributed from 20 to 80
ed_level Education level Uniformly chosen from 0 to 4
car : i.,!. of the car Uniformly chosen from 1 to 20
zipcode Zip code of the town Uniformly chosen from 9 to available zipcodes
value Value of the house Uniformly distributed from 0.5 k 100000 to 1.5 k 100000
where k E {0,..., 9} depending on zipcode
hears Years of house owned Uniformly distributed from 1 to 30
loan Total loan amount Uniformly distributed from 0 to 500000
Table 2: Description of attributes in synthetic datasets
Function Fl F3 F5 F9 F10 F24 F31 F33 F.;i. F37 F41 F47
Attributes 2 3 3 2 2 3 2 3 3 5 2 5
... 2 2 2 2 2 2 3 3 3 3 4 4
Table 3: Description of predicate functions used to generate synthetic datasets
within 0.5' ., 2.:;'. and i' ., and other parameters are similar to those applied to letter and chess datasets
in the previous sections.
At each epoch, the first two bars represent the accuracy of ICE and random sampling, respectively,
for sampling rate of 1i' Similar, the next two pairs of bars show the results of sampling rate of 2.:;' and
0.5' ., for ICE and random sampling, respectively. I I distance from the top of each bar to the dotted
straight line is the drift from the accuracy achieved from the entire dataset. Other quality measures are
omitted intentionally because they are very similar to those presented in Table 1.
In Finaiii 7, ICE is better than random sampling in most cases, especially at lower sampling rates
such as 0.,5' I Ij results at sampling rate ii' show that ICE can achieve the accuracy of the brute
force method applied to the entire dataset. It seems that ICE always produces consistent results at any
epoch regardless of sampling rate. On the other hand, random sampling shows signs of inconsistency
(for example, in Fiini, 7(a), 7(b), and 7(f)) at lower sampling rates, due to its sensitivity to the data
distribution. Such sensitivity will consequently result in decision trees of inferior quality.
In Fil:ii, 8, ICE again outperforms random sampling in most cases, especially at lower sampling
rates. ICE still looks very consistent while the results of random sampling seem unpredictable at all. ICE
can result in accuracy that is close to the dotted line even at a very low sampling rate (0.5.1 ). I III
suggests that ICE is capable of handling large datasets while generating comparably accurate decision
tree classifiers.
Overall, the experiments suggests that ICE be a choice for classifying large dataset incrementally,
say, with less than only 1 samples of the dataset. Since the cost at early epoch is relatively lower, one
may run a classification algorithm on the entire dataset at the moment. At later epochs where the cost
becomes expensive, ICE can be used to incrementally classifying large datasets.
Sythetic data set 5 epochs with 100000 records each (function 1)
(a) Predicate function 1 (b) Predicate function 3
(c) Predicate function 5 (d) Predicate function 9
Sythetic data set 5 epochs with 100000 records each (function 10) Synthetic data set 5 epochs ath 100000 records each functionn 24)
Epoch Epoch
(e) Predicate function 10 (f) Predicate function 24
Fi in 7: Experimental results for synthetic datasets with 2 classes
19
Synthetic data set 5 epochs wth 100000 records each (function 3)
Synthetic data set 5 epochs with 100000 records each (function 31)
(a) Predicate function 31 (b) Predicate function 33
(c) Predicate function 36 (d) Predicate function 37
Synthetic data set 5 epochs with 100000 records each (function 41) Synthetic data set 5 epochs qth 100000 records each functionn 47)
Epoch Epoch
(e) Predicate function 41 (f) Predicate function 47
Fiuii 8: Experimental results for synthetic datasets with 3 classes and 4 classes
20
Synthetic data set 5 epochs wth 100000 records each (functon 33)
6 Concluding remarks
('I!..it. .t..ij is an important, wellknown problem in the field of data mining, and has remained an
extensive research topic within several research communities. I li.i l:. to the advances in data collection
technologies and large scale business enterprises, the datasets for data mining applications are usually
large and may involve several millions of records. In addition, each record typically consists of ten to
hundreds of attributes, some of them with large number of distinct values. I Ij, enormoty and complexity
of the data involved in these applications make the task of classification computationally very expensive.
In this paper, we propose a scalable framework named ICE for incrementally classifying evergrowing large
datasets. I Irh framework has the following properties: temporal scalability, algorithm independence, size
scalability, minimal data access, inherent parallelism, wellsuited for distributed computing, and flexibility
to addition and deletion of partitions.
We provide mathematical background for incremental classification based on weighted samples and
a few sampling techniques that extract weighted samples from a decision tree built on a small data
partition. '1 Ir weighted samples from the previous partitions (i.e., epochs) are used to build a new
decision tree classifier together with the current partition. '1 Ir issues of cost, performance and quality
of decision tree are discussed throughout the paper. We have conducted thorough experiments on two
real datasets as well as synthetic datasets, and the experimental results show that ICE outperforms, or
at least is comparable to, random sampling, and promises to be a basis of incremental classification for
large, evergrowing datasets where fast development of decision tree classifier is needed.
Fin.ll;., we want to point out a few factors that must be dealt with to refine the ICE framework.
How to assign weights to categorical attributes.
tI I time that the record was created can be considered in updating the decision tree classifier. We
need a new weight assignment method based on time parameter.
New attributes that did not exist in the previous dataset, but that are added in the new incremental
dataset must be handled properly. I Ir effects of the new attributes to the updated classifier have
to be identified.
A quantitative measure of tree i, il'/.li needs to be refined. Based on the measured quality, one
must be able to determine which method suits better to the application.
I pr feature selection information from the current decision tree classifier can be used to optimize
the induction of new decision tree classifier when a new partition is added.
Irj difference between the old dataset and the new incremental dataset. One way of determining
and/or measuring the difference will be to compare the classifiers built for the two datasets.
Using the treebased sampling method to generate a good sample for other more expensive mining
methods such as neural network.
References
\I'.; Rakesh Agrawal, Tomasz Imielinski, and Arun Swami. Database : ini n], A Performance
Perspective. IEEE !Ii., 1, ..... on Knowledge and Data Engineering, 5(6):911 'i', De
cember 1''i.;
_\I' Khaled Alsabti. Efficient Algorithms for Data :' i:~rig Ph.D. I I,. i, Syracuse University,
December 1998.
\T:' Khaled Alsabti, Sanjay Ranka, and Vineet Singh. COULDS: A Decision Tree ('I..iii i for
Large Datasets. In Proc. of the ibt I Conf. on Knowledge Discovery in Database and Data
Mining, v. York, 1998.
[BFOS84] L. Breiman, J. H. Friedman, R. A. olshen, and C. J. Stone. CI..'I. .;'.. and Regression
.. Wadsworth, Belmont, I' I
I1K\. 1 l C. Blake, E. Keogh, and C. J. : 1 if. UCI Repository of i1. ifn, Learning Databases
(http://www.ics.uci.edu/~mlearn/MLRepository.html), Irvine, CA: University of Cal
ifornia, Department of Information and Computer Science, 1998.
[Cat84] J. Catlett. .Ij ..... 1..,. i ..I Ub /.. . Learning on Very Large Databases. PhD I Ih !, Uni
versity of Sydney, 1'r I
[CK'."] P. (CI .. ijn.ij, James Kelly, i.l.tti, v. Self, et al. Auto('l. A Bayesian ('I..i li .i i.i
System. In Proc. of the 5th Internaltion Conference on /... .. Learning. il ,.,.in Kaufman,
June 1988.
C['.^ P. K. ('C!I., and S. J. Stolfo. On the Accuracy of i1 t.,learning for Scalable Data : i: !ini
In Journal of Intelligent Information Systems, 8:528, 1997.
EK 7',i, M. Ester, H. Kriegel, J. Sander, and X. Xu. A Densitybased Algorithm for Discovering
('! li i in Large Spatial Databases with \ .1, In Proc. of the .'.1, il Conf. on Knowledge
Discovery and Data Mining, August 'I'iii
[FI.; Usama Fayyad and Keki B. Irani. Multiinterval Discretization of Continuousvalued At
tributes for ('l..1li .I.t i ,i Learning. In Proc. of the I .'i., Int'l Joint Conference on A, I'. .,1
Intelligence, pp.10221027, 1''n.;
[GCGT:L'' i J. Gehrke, V. Ganti, R. Ramakrishnan, and W.Y. Loh. BOAT Optimistic Decision Tree
Construction. Tp appear in Proc. of 1999 SIGMOD Conference, Philadelphia, PA, 1999.
[I:I\ '] D. Gibson, J. Kleinberg, and P. Raghavan. ('nil i :ni, Categorical Data: An Approach
Based on Dynamical Systems. In Procs. of VLDB Conference, N\ '. York, August 1998.
[c .i'i D. E. Goldberg. Genetic Algorithms in Search, Optimization and .l/.1. I ... Learning.: 1.li1
Kaufman, 1989.
[GTI i' Johannes Gehrke, Raghu Ramakrishinan, and Venkatesh Ganti. Rainforest A Framework
for Fast Decision Tree ('I..il. .,t,.in of Large Datasets. In Proc. of VLDB Conference,
pp.416 427, v. York, NY, August 1998.
[HMSiiii. E. B. Hunt, J. i.l.in, and P. J. Stone. Experiments in Induction. Academic Press, N v.
York, I' Ii
i.1 ,1i M. James. Cl/ .'..1 '..' Algorithms. Wiley, I'~ 
IKT'I L. Kaufman, and P. Rousseeuw. Finding Groups in Data: An Introduction to Cluster Anal
ysis, John Wiley and Sons, 1990.
[LLS97] T.S. Lim, W.Y. Loh, and Y.S. Shih. An Emperical Comparison of Decision Trees and
Other ('!..li. ..ti : 1.i it i.i TR 979, Department of Statistics, UW : fi.li..1,, June 1997.
[: 1T94] D.: it ir, D. J. Spiegelhalter, and C. C. Taylor. .ll ., II Learning, Neural and S, l I I..
CI/ T.' ,, I'I.' ! Horwood, 1'i '
[QT: Ii J. R. Quinlan and R. L. Rivest. Inferring Decision Trees Using : ini,:iiiii Description Length
Principle. Information and Computation, 1989.
[QIiij J. R. Quinlan. Induction of Decision Trees. .I/... .. Learning, 1:81106, I'~1i
[Q(ii'2.; J. R. Quinlan. C4.5: Programs for .11.. b. Learning, : il., .i, Kaufman, I'I.;
T1i B. D. Ripley. Pattern Recognition and Neural Networks. Cambridge University Press, Cam
bridge, 1'ii,
T:'! R. Rastogi and K. "!~ji, PUBLIC: A Decision Tree ('!...1I. i that Integrates Building and
Pruning. In Proc. of VLDB Conference, pp. ll14415, v. York, NY, August 1998.
._\: ii John 'I.i.. i, Rakesh Agrawal, and : i..j!i] : 1. lit.. SPRINT: A Scalable Parallel ('!..11. i
for Data : ijninj, In Proc. of VLDB Conference, Bombay, India, September I''ii
[Sch94] E. Schikuta. A Conservation Law for Generalization Performance. In Proc. of the 11th Int'l
Conf. on .11.. I b. Learning, 1l' '
[H.~IK' ii A. Srivastava, V. Singh, E. Han, and V. Kumar. An Efficient, Scalable, Parallel ('!..i. i
for Data : ifi'jinj University of : inimj 1.1.. 1',11 .
[UBC97] P. E. Utgoff, N. C. Berkman, and J. A. ('I .i, Decision Tree Induction Based on Efficient
Tree Restructuring. In .I1.i. I .. Learning, 1997.
[Ut"] P. E. Utgoff. Incremental Induction of Decision Trees. 'l/.1. b.. Learning, 4:1611 ~1, 1989.
[WK91] S. M. Weiss and C. A. Kulikowski. Computer Systems that Learn: CI.,,i ;.'/,, and Pre
diction .I/,. t,.'Ii from Statistics, Neural Nets, .1/. I. Learning, and Expert Systems. 1991.
[YF: i'l Y. ii f, .i I .. T. Fukuta, H.: i.ti ...\...i T. Tokuyama, and K. Yoda. Algorithms for : ii iinr
Association Rules for Binary Segmentations of Huge Categorical Databases. In Procs. of
VLDB Conference, N v. York, August 1998.
[ZT:L'i. T. ZlI..i' R. Ramakrishinan, and M. Livny. BIRCH: An Efficient Data ('!C,i 111 : i tI ,,
for Very Large Databases. In Proc. of ACM SIGMOD Int'l Conf. on /.1,'.... i. i of Data.
I ii i
