RANDOMIZED DECISION TREES FOR DATA MINING
By
VIDYAMANI PARKHE
A THESIS PRESENTED TO THE GRADUATE SCHOOL
OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF
MASTER OF SCIENCE
UNIVERSITY OF FLORIDA
2000
ACKNOWLEDGMENTS
I would like to express my sincere gratitude to Dr. Raj for his support and
encouragement, throughout my graduate program at the University of Florida. It
was due to his motivation and advice that the challenges of initially a transfer
to CISE and then a master's program, were never a hurdle. I am indebted to
Professors Sumi Helal and Manuel Bermudez for their undepletable attention and
their great help, since the very first d,. at UF. I thank Professors Sartaj Sahni and
Joachim Hammer for being on my committee and their lr.. i. .o,; and comments.
There are a few people, to whom I am grateful for multiple reasons. Firstly,
my family back home in Indiawithout their support and trust, nothing would
have been possible. Next, my closest ever friendsAmit, Latha, Sangi, Prashant,
Mahesh, Prateek, Subha and Hari amongst others, for being my family here. With
out their support, encouragement and wrath the journey would have been an im
possible one. Special thanks go to John Bowers, Nisi and Victoria for being there,
al ,v !
I cannot stop short of thanking Leo's, Chilis and Taco Bell for their extended
hours and exquisite food, that has now become an integral part of our lives.
TABLE OF CONTENTS
ACKNOWLEDGMENTS . .
ABSTRACT .. . ........
CHAPTERS
1 INTRODUCTION .. ......
1.1 Data Warehousing ........
1.2 Data Mining ............
1.2.1 Assocation Rules .....
1.2.2 Clustering .........
1.2.3 Sequential Patterns ...
1.2.4 Classification .. .....
1.3 G oal . . . . . . . .
2 RELATED WORK IN THE AREA OF
2.1 Gini Calculation .........
2.2 SLIQ Classifier for Data Mining .
2.2.1 Tree Building . . ..
2.2.2 Tree Pruning . . ..
2.3 SPRINTA Parallel Classifier .
2.3.1 The SPRINT Algorithm
2.3.2 Speedup over SLIQ . .
. .. . ii
CLASSIFICATION
. . ..
2.3.3 Exploiting Parallelism
2.4 CLOUDSA Large Dataset Classifier
2.4.1 Dataset Sampling (DS) . .
2.4.2 Sampling the Splitting Points (SS) . . .
2.4.3 Sampling the Splitting Points with Estimation
2.5 Incremental Learners . . .............
3 ALGORITHMS . . . . . .. . .. .
3.1 Sorting Is Evil . . . . . . . . . . .
3.2 Randomized Approach to Growing Decision Trees .
3.2.1 SSE W without Sorting .. ...........
3.2.2 Sampling a Large Number of Potential Split
3.2.3 Improvised Storage Structure .. ......
3.2.4 Better Splitpoints .. ...........
3.2.5 Accelerated Collection of Statistics .....
DATA
(SSE)
MINING 8
9
. . 12
. . 12
. . 14
. . 14
. . 15
. . 16
. . 16
. . 17
. . 18
. . 18
. . 18
. . 19
Points
. .
111
.
.
3.3 M ultilevel Sampling ............... ..... .. 31
3.4 "S, No to Randomization!" ................ . .. 32
3.4.1 Accelerated Collection of Statistics, the Reprise ...... ..34
3.4.2 Statistics ... To Go! ............ . .. .. 35
3.5 Incremental Decision Trees ................ . .. .. 38
4 IMPLEMENTATION AND RESULTS . . .....
4.1 Implementation . . . ................
4.1.1 Datastructures . . .............
4.1.2 Implementing SPRINT .............
4.1.3 Implementing the Randomized Algorithms . .
4.1.4 Iterative as Opposed to Recursive . . . ..
4.1.5 Alternative Data Sources . . . . .....
4.1.6 Embedded Server for Runtime Statistics . .
4.1.7 MyVector Classfor Better Array Management
4.2 Results ...... ............. ........
4.2.1 Perform ance . . . ..............
4.2.2 A accuracy . . . . . . . . . . .
5 CONCLUSIONS AND FUTURE WORK . . ....
REFERENCES . . . . . . . . . . .
. . 41
. . . 41
. . . 41
. . . 41
. . 42
. . 43
. . 44
. . 44
. . 44
. . . 47
. . 47
. . 49
. . 51
. . 52
BIOGRAPHICAL SKETCH .. . . . .............
Abstract of Thesis Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Master of Science
RANDOMIZED DECISION TREES FOR DATA MINING
By
Vidyamani Parkhe
December 2000
C('I i i, i ,: Sanguthevar R ii i. .1:. i
Major Department: Computer and Information Science and Engineering
Classification data mining is used widely in the area of retail analysis, disease
diagnosis and scam detections. Of late, the application of classification data mining
to the area of web development, web applications and analysis is being exercised.
The 1i ii, i" challenges to this new facet of classification are the enormous amount
of data, data inconsistencies, pressure for time and accuracy of prediction. The
contemporary algorithms for classification, that ii, iP i ly use decision diagrams, are
less useful in such a scenario. The 1i i, ir impediment is the large amount of static
time required in building a model (decision diagram) for accurate prediction or
decision making at runtime and the lack of an efficient incremental algorithm.
Randomized and sampling techniques researched for the problem have been less
accurate. The present work discusses deterministic and randomized algorithms
for classification data mining that are easily parallelizable and have better perfor
mance. The algorithms i.. I novel methods, like multiple levels of intelligent
sampling and partitioning, to collect record distributions in a database, for faster
evaluation of gini indexes. An incremental algorithm, to absorb newly available
datasets, is also discussed. A combination of these characteristics, alongwith very
high accuracy in decisionmaking, makes these algorithms adept for data mining
and more specifically web mining.
Key words: Classification, data mining, randomized algorithms, decision
diagrams, incremental algorithms, gini index and web mining.
CHAPTER 1
INTRODUCTION
With the current trend in the industries to learn from one's mistakes and
those of others, setups from small retailers to large corporations are looking to
wards the "Knowledge Di .. i y paradigm to get a more comprehensive overview
of their own data. This has been further made possible due to the increased per
formances in the data warehousing and data mining algorithms, falling costs of
storage units and increased processing speeds. The sections that follow illustrate
techniques used to store and mine data, in order to derive information therefrom,
that had never been available or thought of heretofore.
1.1 Data Warehousing
With the excessively large customer transactions happening every second, in
large super markets, internet sites, bank, insurance or phone companies, there is a
need to come up with alternative methods to store the historical data, so that there
is no loss of information. Also there could be a need, at a later date, to draw out
the hidden knowledge from the data without a tangible loss of information. The
study of data warehousing comprises everything from the machine architecture and
the data store, that could be best compatible to store the specific form of data, to
the algorithms used to handle and store data.
A data warehouse can be thought of as a collection of different sources of
data, put together, that have been cleaned and checked for inconsistencies prior to
merging. These data sources could be from different locations of a chain of super
markets, like WalMart, or data recorded at the same location over time. The
need to remove the inconsistencies in the data is dire, otherwise the information,
drawn out after mining the entire data, would be incorrect. There are various
algorithms used for the purpose of data source cleaning, data merging and storing
it in a format, from where the applications1 using the data can access it, in the
most efficient manner. Such algorithms have been documented in literature [1, 2].
Widom [2] also quotes a few issues in data warehousing designs that significantly
affect the application using the data.
1.2 Data Mining
Data mining can be viewed as a application that resides over a data ware
house and uses the data to search for certain unknown patterns. The patterns
could be in the form of rules or clusters or some classification, as described in the
following subsections. Data mining is different from OLAP,2 in that OLAP uses
the query techniques to confirm the results known in the past or by heursitics.
In opposition, data mining is indeed a search for the unknown, wherein the
entire data set is used to draw out some information about it, at large, rather than
the specifics of the data in itself.
In the following subsections we will have a closer look at the various tech
niques used for data mining .
1.2.1 Assocation Rules
This form of data mining is used to relate two or more quantities together,
that otherwise would apparently not have coexisted. An example would be i,,' of
the people buying beer also buy diapers and of all the transactions happening
contain both beer and diapers. This can be stated in the form a rule, Beer =
1There could be varoius applications that could make use of the data store, like OLAP, data
visualization, data mining or even transactional ones for daytoday transactions.
20nline Analytical Processing.
Diapers. This rule is said to have a ,aI'. corl;,. ,, : and "' support. The task of
Association Rule Mining is to come up with a set of rules that satisfy a minimum
confidence and support level. One of the leading algorithms used for Association
Rule Mining is the Apriori Algorithm [3, 4, 5].
1.2.2 Clustering
The principle used in clustering is to group together (or cluster) the data
points that have a common characteristic. The whole idea is to partition the entire
data set into categories, depending upon some featuress, such that the items in
one group or cluster are more similar to those in the same cluster as compared
to the ones in other clusters. Clustering is used in various facets of knowledge
discovery and learning, like machine 1. niii i pattern recognition, optimization
... etc. A classic algorithm for clustering starts off with designating k data points
that act as centroids for the k clusters, and proceeds with evaluating the nearest
centroid for every datapoint (to be clustered) and then reevaluating the mean
centroidd) of the datapoints in that cluster.
1.2.3 Sequential Patterns
Initially proposed by Agarwal and Srikant [6], the technique uses time (in
most cases) to detect a similarity (pattern) in the occurrence of events. This can be
used in various applications, like detecting patterns in which books are being read
by a set of library users, or detecting a chainreferral scam of medicine practioners,
or even more critical ones like disease diagnosis [3].
1.2.4 Classification
"One's ability to make the correct set of decisions while solving a certain
problem and doing so in the specific allotted time is the key factor in one's success,
time and again. This is true for any sort of problemsmay it be from a research
perspective or a political oneonly the variables and constants change."
Anoni inl. ';
Classification data mining is associated with those aspects of knowledge
discovery, wherein there is a need to categorize data points, associating them with
a certain classification or category, based upon the classification of a few known
data points. Here, the objective is to traverse the training3 data set and coming
up with a model that could be used for future classification of a test4 data set. The
technique of classification has been used since a very long time for the purpose of
machine learning [7], optimizations using neural networks, decision trees ... etc.
A decision tree or a decision diragram comprises a root node, that one uses
to make his first decision, while classifying a test record. A nonleaf node in the
decision diagrams represents the data represented by its subtrees, while a leaf node
represents data belonging to one class and satisfies the conditions of all its ancestor
nodes. The decision is binary, in most cases, in that it could be either true or false
which takes one to one of the subtrees of the root, for which the same procedure
could be recursively applied, until one reaches the leaf node, that determines the
classification for the record under consideration. The nonleaf nodes are decision
making nodes (mostly binary) and hold a condition like Age < 25, which would
lead on to two subtrees, the data in which ah,i satisfies the condition laid forth
by the common parent node, i.e. Age < 25.
1.3 Goal
Most of the contemporary algorithms for growing decision trees discuss cost
effective methods, by which the size (height and spread) of the tree is optimized
3A data set for which a classification is known.
4A data set for which the classification is not known and has to be determined.
5
and so is the dynamic time required for decision making. The algorithms output
the most compact form of a tree for a given training set, but are costinefficient at
the treegrowing stage. The principle used in most of the algorithms is to come up
with the best split attribute for a given dataset, about which the entire dataset
could be categorized into two sections (for binary decision diagrams), such that
most of the records of a type belong to one of the subtrees. Now, for a given
dataset, it is very time consuming to come up with the very best split attribute
at every stage in the tree. Gini values are used (in most cases) to determine the
best split at a certain stage. The detailed explanation for gini values and their
calculation would be a subject for the following chapters.
The approach, mentioned above, is acceptable and often used in situations
where the decision diagrams are made statically and then used for the purpose of
decision making at runtime. Also, if there is no need to update the classification
for a long time, an approach that gives the most succinct tree is required, as then,
the time required for making decisions at runtime would be largely reduced. But,
in situations where it is necessary to update the decision diagram very frequently,
it might be required to come up with an approach that does so, in a very short
interval of time. As discussed before, the most timeconsuming task, in the tree
building stage, is to determine which is the best split attributevalue pair. If one
spends time in deciding over the very best split at every stage, undoubtedly the
most concise and compact form of the tree would be obtained, but this could be
heavily time consuming. Conversely, if the very best split is not determined the
trees tend to be wider and longer (increasing the time required to make decisions
at runtime, while using the tree).
If the decision diagrams are to be used for the purpose of making the most
critical decisions, that have a very high risk level, one might be more inclined to
use an algorithm that, although takes a lot of static creation time, outputs the
most concise and compact embodiment of the lifecritical data set, a query on
which would not take long to execute. On the other hand, an application that has
no critical hazard might be greatly helped by an inherently incremental algorithm,
which would help in assimilating and consuming the most recent data within the
decision diagram. Thus, the tradehorses of creation and usage time are deter
mined depending upon the application in mind and ultimate usage.
Webapplications, in most cases, are like the latter ones described above.
An example would be clickstream analysis, wherein the objective is to observe a
pattern from the webclicks of various users to a set of webpages. The problem
can be states as follows:
Imagine yourself to be an owner of a webbased commerical store that sells
books. There is a group of loyal customers that can be identified using their lo
gin names and passwords. Every click made by every person, till date, has been
recorded. This comprises a range of customers that merely surf through the web
pages under your company's domain and buy nothing and others that are avid
buyers. Would it not be a interesting piece of knowledge to know who is buying
exactly what, and more specifically if there is a pattern of the types of books being
bought by various customers over a specific range of time! It might prove to be
commerically advantageous to be able to predict the buying pattern of a set of
customers (or potential customers) depending upon the buyingpatterns of other
customers. But, with the amount of clicks being made on the webpages and the
increasing number of transactions happening every second, it might be impossible,
at runtime, to search for and identify the parallelism between a set of clicks of
one particular customer and another, in the past. An effective datastructure like
a decision diagram would certainly come handy in such cases, where it is most
important to interest a customer more in what he would have otherwise, anyway
been interested in, and making business. In such a case, the problem is mostly
one sided, here the presence of a decisonmaker or a nextclickpredictor is not of
primary importance, but having one such (a good one) would definitely help in
growth of business.
Also, as in case of the clickstream example above, a algorithm that is in
cremental, in that it can incorporate fresh data into the decisionmaking data
structure, would be of significant use, rather than having a static decisionmaker
that reflects the choice and trends in the market from a earlier era. As in the
above case, it could prove advantageous to be able to make decisions based on
some clicks, results or transactions happening just the previous second!5
The current work concentrates on the issues mentioned above, using tech
niques like randomized algorithms and sampling to achieve speedups, without the
loss of accuracy. An attempt is also made, at making the algorithms incremental,
so that any additions to the data sets could be reflected in the decisionmaker (also
referred to as the learner). In the chapters that follow, the algorithms and the im
plementation details are mentioned, giving details of the data structures used for
the purpose.
5Though, it might be difficult and highly costinefficient to try and accommodate data from a
transaction that occurred just a few minutes back.
CHAPTER 2
RELATED WORK IN THE AREA OF CLASSIFICATION DATA MINING
Decision diagrams and other classifiers like genetic algorithms, B i,. i in and
neural networks have been used for a very long time for the purpose of simple
classification and decision support. Anahory and Murray [1] give detailed analysis
as to how one could use tools like decision diagrams for data mining which can
be very effective in the case of decision support over a data warehouse. Since the
evolution of the decision diagrams, a lot of algorithms, varying in time complexity
and the kind of data to be classified, have been devised for the purpose of classifica
tion. Some of the famous algorithms developed include ID3 [8], C45 [7], SLIQ [9],
SPRINT [10], CLOUDS [11], and others. All these algorithms and other previ
ous work in the area of classification data mining have sought the best possible
waygiven a dataset or a database of recordsto provide the classification with
the most concise representation or datastructure. Some of the common forms of
representation used for the purpose of classification data mining are neural net
works, decision diagrams ... etc. Another issue of primary importance is the time
required to pack the given dataset in the selected format of representation, in the
minimum possible time frame.
In all the algorithms to build decision diagrams, mentioned above, a com
mon premise and one of the most important objective is that the tree building
algorithm should be precise. The tree should be an exact representation of the
given test dataset. But, in the race to come up with a perfect tree, a lot of time
is spent in building the tree in the first place. These algorithms are cost effective
and li. I techniques like parallel and simultaneous execution for a faster growth
Table 2.1: Sample dataset for gini calculation
Class
1
2
1
3
2
2
1
1
1
3
2
Figure 2.1: Compact Tree
C
Figure 2.2: Skewed Tree
partitions S into S1 and S2. The gini value of the split can be estimated using
ginispit girni(S) + gini(S2)
n n
where, nl and n2 are the number of data points in Si and S2, respectively, and n
is the number of datapoints in S.
As it can be seen, the calculation of the gini index is the most important
step in the nodesplitting stage in a decision tree. Also, it can be trivially observed
that the process could be time consuming, since, to be able to calculate the gini of
one particular potential split value, all the records have to be considered in order
to obtain the ni, n2 and all the pj's for each S' and S2. Since, it is of primary
importance to calculate the gini at all the potential points, viz., all the distinct
data points in the currect dataset for each attribute, any algorithm to do the
gini calculations would necessarily take, O(an2) time complexity, where n is the
number of records in the dataset and a is the number of attributes.
2.2 SLIQ Classifier for Data Mining
SLIQ was one of the first of its kind to introduce the concept of gini index to
grow decision trees. The algorithm is divided into two phases, viz, Tree Building
and Tree Pruning, for building decision diagrams. In the following subsections,
these two stages are discussed.
2.2.1 Tree Building
This comprises two steps, i) evaluation of splits for each attribute and select
ing the best split and ii) creating of partitions using the best split. This is done in
the following manner. First the given table is split into separate lists, in which the
records are sorted according to that particular attribute but maintain a pointer to
the other attribute values of the same record, as can be seen below. The second
Algorithm EvaluateSplits(
for each attribute A do
traverse attribute list of A
for each value v in the attribute list do
find entry in the class list, and hence the class and
leaf node, I
update the histogram (statistics) in leaf 1
if A is a numeric attribute then
compute splitting index for test (A < v) for
leaf 1
if A is a categorical attribute then
for each leaf of the tree do
find the subset of A with the best split
Figure 2.3: EvaluateSplits(
tree, depending upon the best split condition A < v. The records satisfying the
condition are placed in the left subtree, and others in the right subtree.1
2.2.2 Tree Pruning
The tree is built using the entire training dataset. This could contain some
spurious "noisy" data, which could lead to a error in determining the class for the
test data. Those branches that potentially could be misleading at runtime for
class estimation are removed from the tree, using a pruning algorithm described in
Mehta et al. [12].
2.3 SPRINTA Parallel Classifier
SPRINT was one of the pioneering algorithms for building decision diagrams,
that are exact classifiers as opposed to approximate classifiers, that compromised
on accuracy for a better time complexity algorithm. Some of the approximate
algoritms include C4.5 [7] and D2. The algorithm was designed in such a way
that it would be inherently parallel in nature, and hence leading to further scope
for a speedup as compared to the contemporary algorithms.
1This could vary depending upon the application using the decision tree. For a particular
application the condition could be modified as A < v.
the above algorithm, by assigning each attribute list to a separate processor for
gini calculation, and then putting the result together to estimate the best split at
a node. The node splitting can also be done in parallel in the following manner.
Two processors enlist the subtree that each record should belong to after the split,
depending upon the split attribute value. The attribute list being sorted, it is
trivial to decide over the cutoff boundaries for each processor and hence '1! i:, can
work in parallel. Since, there can be no record common to either, '! i:, can both
work on a common array in shared memory. This array, then can be used to
split other attribute lists depending upon the entry in the shared memory array.
Hence, splitting can be done in 0(n) time using O(s) processors, where n is the
number of record in a node and s is the number of attributes, hence preserving
the total processor work, to O(ns). This can be further extended to calculate the
total splitting time complexity at the tree growth stage. Since there can be utmost
0(N) records in all the nodes at any level in the tree, the total time splitting
time complexity of 0(N), where N is the total number of records in the dataset.
Assuming a well distributed full tree, the total time complexity can be estimated
to be O(N log N) for a O(s) processor parallel machine.
2.4 CLOUDSA Large Dataset Classifier
CLOUDS2 was the first of its kind to use sampling for the purpose of clas
sification. The sampling step was followed by an estimation step to determine a
closer and better split attributevalue pair. The CLOUDS algorithm assumes the
following two properties for gini indexes for real datasets [11]:
Given a sorted dataset, the gini value generally increases or decreases
slowly. This implies that the number of good local minima is significantly less
than the size of the dataset, especially for the best split attribute.
2Classification of Large or OUtofcore DataSets.
The minimun gini value (potential split) for an attribute is significantly
lower than the other datapoints along the same attribute and other attributes
too.
Using these two principles as guidelines a couple of sampling techniques were
developed:
2.4.1 Dataset Sampling (DS)
In this algorithm, a random sample of the dataset is obtained, and the
direct method (DM)3 for classification is applied. In order to maintain the quality
of the classifier, the gini values are calculated using the entire dataset, only for
the sampled datapoints.
2.4.2 Sampling the Splitting Points (SS)
Here, a quantiling techique is used to partition the attribute domain into q parts.
Gini values are calculated for each of the boundaries of the qquantiles, and the
lowest is chosen for the split attribute. Hence, it is required to have a preknowledge
of the type and range of the attribute values (metadata).
2.4.3 Sampling the Splitting Points with Estimation (SSE)
The SSE, technique uses SS to estimate the gini values at the boundaries
of the qquantiles for each attribute of the dataset. Then, as in the case of SS,
the minimum ginimin is chosen, here for the purpose of determining the threshold
value for the next (estimation) set to determine the lowest gini value. Using the
gini values, the lowest possible gini value in a quantile is determined, ginilow.
Intervals that do not qualify the threshold level are discarded, i.e. intervals such
that ginilow > ginimin, are eliminated. For the surviving intervals, gini values are
calulated at every data point to determine the lowest possible gini value.
3Something like SPRINT, wherein the gini at every attribute value is calculated for estimating
the best split.
CLOUDS uses both of the above to classify the dataset using sampling
techniques. The sampling technique determines the size of the decision tree. The
quantiling technique rules the accuracy rate and the time required at every stage.
2.5 Incremental Learners
The objective in having an incremental algorithm is that in case of the ex
isting algorithms for building decision diagrams for new upcoming data, the entire
classifier (learner) would need to be destroyed and a new learner created using the
old and new data. Such a process would take a long time and would be repeated
frequently. An incremental algorithm is such that the time required is correspond
ing merely to the new data, rather than the total of new and old data. One of the
v v; to achieve incrementality in the algorithm is, if one could have some tech
nique to merge two learners together to obtain one learner that is a combination
of the two learners. C'!i i and Stolfo [13, 14] have sl. 1. .1 some methods for
merging trees together. The following are the two n ii .r techniques i1,. 1. 1
Hypothesis booting is a method in which a number of different algo
rithms are used on the same dataset to generate various learners. Then, using a
metalearner, all these various learners are combined. Thus, the properties of all
the different learner algorithms are present in the new learner.
Parallel learning is a technique in which a dataset is broken up into
various parts, on which the same algorithm is applied to obtain different parallel
learners, which can be combined together to obtain a learner for the whole data
set.
The other techniques comprise a combination of these ideas.
The following chapters give the Algorithms and the Implementation details
of the Randomized Decision Tree algorithm along with performance statistics.
CHAPTER 3
ALGORITHMS
Having discussed the previous work in the area of classification data mining
and specifically in the area of algorithms for decision trees, in the previous chapter,
this chapter deals with the algorithms devised for the purpose of building (grow
ing) randomized decision diagrams. The contemporary algorithms like SPRINT
and SLIQ aim at building the most concise and compact form of the trees for a
given dataset. But, as discussed before, this approach is extremely time con
suming. The present chapter discusses a few randomized algorithms that possibly
could have the same time complexity, but are estimated to run faster, without a
loss in accuracy in the outputted learner. In certain cases, the height and width of
the tree are more than the SPRINT/SLIQ version of the tree for the same dataset.
The following sections give the drawbacks of the above contemporary algo
rithms that render them less useful for rapidly changing enormous amount of data
or applications where the data could be outdated very early.
3.1 Sorting Is Evil
One of the most important characteristics of webbased applications is that
the data is changing on a continuous basis and very little down time is permissible,
if any. In an application, like C/.. /.stream Ail.;,.: it could be, in most of the
cases, required to absorb and reflect a newly available dataset into the learner.
In such cases, the decision tree should not be made from scratch but should be
an addition to the already existing one. Hence, if it is required to sort the entire
dataset for each attribute, the operation will be extremely costly. Thus, the sorting
operation that needs to be done (though only once) at the root node should be
avoided as far as possible. If sorting cannot be avoided, then the number of records
that have to be sorted should be reduced drastically.
Inspired by the SS approach as si:r. 1.i in Alsabti et al. [11], the following
algorithms s , 1 vI in which one can come up with decision diagrams without
sorting the entire dataset.
3.2 Randomized Approach to Growing Decision Trees
Motwani and Raghavan [15], Horowithz et al. [16], Cormen et al. [17] and
others sI,. 1 algorithms in which randomised approaches help in reducing the
time complexity of an algorithm, without significant loss of accuracy and in most
cases with 9,' or higher accuracy.
One disadvantage with using randomized algorithms, as si1, 1. 1 before, is
that though one does not lose out on accuracy, the resulting trees could be wider
and longer resulting in greater time to make a decision using this learner. If
the tree growth process is not controlled, the trees could end up being skewed up,
increasing the time complexity of the decisionmaking algorithms.
In applications where accuracy is of extreme importance, examples being
those of high risk applications or life critical ones, it might not be feasible to
use such algorithms. Examples of such are disease 1.i: ,. ..: or a learner that
differentiates a poisonous mushroom from a nonpoisonous one. But in such cases,
if the learner assures 0li i accuracy at the cost of higher search/decision time,
randomized approaches could prove to be useful.
In the subsections that follow, randomized algorithms and their modification
for building decision diagrams are i . 1. Il
3.2.1 SSE Without Sorting
Sorting the attribute list is the most time consuming task in calculation of
gini values before the node can be split. The attribute lists have to be separately
sorted as there can be no corelation between the order of any two attributes in
a dataset, the reason being that given a dataset with n attributes, (n 1) of
them are independent attributes while 1 is a dependent attributereferred to as
the class attribute.
The understated algorithm would work perfectly, in one of the following
scenarios :
There are one or more attributes that are partially dependent on one or
more other attributes, in that their values/order can be predicted based upon the
value/order of other attributes or a combination thereof.
If the application that uses the decision tree could tolerate faulty results
some of the times. This is possible if the application uses the decision diagrams to
predict a behavior of a nonlifethreatening identity. It could also come of use in
scenarios where the result is required urgentlya faulty one would not do harm
to the application, but a timely procurement of a healthy result would certainly
help.
One has a certain amount of preknowledge of the dataset, in that, one
can, after looking at a few datapoints, make a fairly good guess of the nature of
the neighboring points. An example would be of a dataset generated at a weather
station. Looking at the temperatures of a few datapoints, one can definitely make
calculated guesses about the neighboring point (at least, one is sure that the night
temperature is lower than the d, temperature).
The algorithm proceeds with sampling a certain percentage of records and
initially working with them. The gini values at these points are evaluated. Since,
we will have a constant number of sampled points the complexity would necessarily
be O(n), where n is the total number if records in the dataset.
Here the gini values for the sampled records are calculated using the entire
dataset, and hence these gini values are exact as opposed to approximate. This
can be achieved in one of the following vv :
Since we have only a constant number, s, of sampled records, one can
obtain the statistics required for the gini calculation by merely comparing every
record in the dataset with every record in the sampled set (records for which the
gini is to be evaulated). This would require O(ns) time or, if only a constant
number of records are sampled, O(n) time.
If the number of records sampled is large, it could be costly to compare
every one of the sampled records with the ones in the dataset. Here, we sort just
the sampled records in O(s log s) time and then use the above process, in such a
way that, if a record X lies ahead, in order, of another record Y for an attribute
z, in the sampled set, then one can assume that for a record M in the dataset, if
M.z < X.z, then M.z < Y.z is also true. Thus, using techniques like BinarySearch
or searching the array in the reverse order can help reduce the time required to
determine the statistics.
Using the gini values for the sampled datapoints, as in the case of SSE the
surviving intervals are selected. Here, unlike SSE since the sampled points have
not been picked up from a presorted dataset, one cannot guarantee the location of
the ultimate minima. But, with a certain preknowledge about the data, like of the
type mentioned above, it could be possible to figure out an approximate position of
a local minima in an interval using techniques like BinarySearch to reduce the time
spent in carrying out the search. As explained above, such a method would not
yield the best of results and the trees could be larger,1 but in cases, as discussed
above, it could be worth having an algorithm that builds a larger tree in a shorter
time frame.
3.2.2 Sampling a Large Number of Potential Split Points
In most of the contemporary databases, one does have a preknowledge about
the data itself, in the form of metadata (or data about data). One does know
the domain of possible attribute values a particular attribute could have. It would
prove advantageous to exploit this knowledge to build a classifier so that one can
do the same, much faster. Now, note that a classifier is a datastructure, such
that at every level, one makes a decision wherein one selects a path, one would
traverse, depending upon a certain attribute value. The deciding factor is an at
tribute and the threshold value that determines whether to search (or continue
traversal) in the left or right subtree. This threshold value is such that one gets
the best possible tree, in that the decision be made as soon as possible, with no
requirement that the value must exist in the training dataset (dataset required
to grow the decision diagram). The datapoint selection can be done in one of two
v,v; explained below.
If one has information about the dataset and the range of values each at
tribute could have, then the sampled datapoints could be synthetically generated,
so that h! i:, lie in the range covering all the possible values one can find in the
dataset. Then, one could use the same method used in the algorithm described
above to obtain a set of gini values for the selected datapoints. Further techniques
like searching for a lower interval in surviving intervals could also be exploited to
zero on to the lowest (best) possible gini value, hence, determining the best split.
Another technique one could use is that one samples a few records and
longer and wider
uses only those as potential splits. This technique could be useful in cases where
the dataset contains a lot of repeated datapoints. In such cases, if a good sam
pling technique is used, one can expect the best split point to be sampled for gini
calculation.
Depending upon the dataset one or more of the above methods could be
used for sampling. If the range of possible values for an attribute is small and
discrete, then it could prove to be advantageous to synthetically generate a large
number of potential split points for that attribute. If the attribute values are con
tinuous then one could use the method of sampling a percentage of the records for
further calculation. Thus, depending upon the type of attribute, one could change
the strategy being used for sampling a smaller set of potential split points.
3.2.3 Improvised Storage Structure
In SPRINT and the algorithms discussed so far, the dataset is converted to
an intermediate representation, wherein, the attributes are split into various at
tribute lists that can be individually sorted. To preserve the records, the class list
is created having incoming pointers from the individual attribute list, and stores
the node that every record belongs to, at any stage in the algorithm. The advan
tage in having separate lists is that, one can just bring one list at a time in memory
and process it in isolation (detached from the other parts of the record). But, with
the algorithms stated above, this could imply a large number of comparisons and
memory swapinswapouts.
Thus, if one can have the whole records stored inmemory, before the com
parison stage, all the comparisons required with a record, from the dataset, could
be done at a time. Thus, it could be very convienent to compare the zth attribute
of the sth record from the sampled set and the nth record from the original data
set, for each z. A 3dimensional array could be one such implementation.
Figure 31 suggests a method in which one can store the distribution statistics
for each record. The three dimensions are of records (or records IDs), attributes
and classes. The algorithm to populate the 3D structure is shown in Figure 32
below.
Classes
 
Records
Attributes
Figure 3.1: 3dimensional array
Algorithm Populate3dArray
Let the original dataset, N, contain n records
Sample s records from N to form the sampled set, S
for each n from N, belonging to current node do
for each s from S do
for each attribute z do
if n.z < s.z then
Let k be the class of record n
increment the kth classposition of s
Figure 3.2: Algorithm Populate3dArray()
Using the statistics from the 3dimensional array, one can calculate the
gini indexes for each of the sampled record s. The advantage of having such a
storage structure is that one can then pass on the information, if need be, from
one stage to the other more specifically from the parent node to one or each of
its children. This reduces the time complexity by obviating the need to collect
the statistics, everytime. But, this is not possible in the algorithm as is. This is
that while sampling, the best split attributevalue could not have been sampled (and
hence used to decide the best split). Using the present algorithm can help reduce
the severity of this problem. As an example, in Figures 31 and 32, assuming that
the best split is obtained at attribute value 12.1, which was not sampled. Using
the technique of the thirds, a closer value, viz 12 was obtained, which could at
times prove to be better than 12.1, in itself, in that, the resultant tree could be
smaller.
Using the method of the thirds, a closer sampling interval is obtained for a
better granularity. This can aid in zeroing on the best split point, or the near best
split point for algorithms like the ones described in [11] or a modification thereof,
as ... I. I1 in section 3.2.1.
In the present algorithm and other randomized algorithms described above,
the sampling technique reduces the number of gini calculations being performed,
hence reducing the time required at every stage in the algorithm. Yet, the 1n ii ,.r
bottleneck, viz., collection of statistics of class distributions for gini calculations,
remains to be cost inefficient. The following sections address the issue.
3.2.5 Accelerated Collection of Statistics
In most of the algorithms used for building decision trees, every record is
compared with every other record for collection of statistics used in gini calcula
tions, with SPRINT as an exception. In the above randomized algorithms also,
every record s from the sample set S is compared with every record n from the
original dataset N. Since, the number of records in S is near constant, the com
plexity of the overall comparsion is O(n), nearly linear. But, in scenarios, where
a higher sampling is required, the performance would deteriorate. The following
approach helps in reducing the number of comparisons.
Algorithm FastStats
Let the original dataset, N, contain n records
Sample s records from N to form the sampled set, S
for each attribute z do
Sort the s records according to the zth attribute
Insert the zth atrributevalues (only) for the sorted records in
the 3d array
for each n from N, belonging to current node do
Let k be the classification of n
for each attribute z do
Use BinarySearch( to find the first records in S, such
that s.z < n.z. Let it be q.
Increment the kth cell contents in the class dimension
for q.
Figure 3.3: Algorithm FastStat()
can be done faster than recordcomparisons, on any standard machine. Also, due
to its inherent nature, PrefixComputation algorithm is parallelizable. Horowitz et
al. [16] cite parallelalgorithms for computation of prefixes.
Putting it together, the techniques of sampling potential split datapoints,
the 3dimensional storage structure for the sampled records and the class distri
bution and the accelerated collection of statistics for gini calculations can help in
achieving a very high speedup, at no loss of accuracy.
3.3 Multilevel Sampling
Unlike traditional databases, wherein no two records are identical (ideally),
in the case of web data, there could be a lot of duplicate records. Infact, in many
cases, the data could even be contradictory. Records could be contradictory, in a
scenario, whereby given a data set with n attributes, (n 1) of them being inde
pendent attributes and one dependent attribute or the classification of the record,
if there exist two records, A and B, in the dataset, such that for A and B, all the
n 1 independent attributes values match, but the dependent attribute does not.
Thus, while generating the decision tree, either A or B or both would have to be
eliminated. This fact could be made use of, while sampling the dataset, such that
a small number of records are used for classification.
The techniques used before ensure that the number of datapoints, at which gini
values are calculated, are a good sample of the dataset, such that the gini values
are not calculated for duplicates. This improves the performance at the cost of
the size of the tree, but does not affect the accuracy. The following algorithm is
approximate in that outputted learned could produce inaccurate results for a few
cases.
The algorithm proceeds with drawing out a random sample from the data
set. The percentage of sampling can vary according to the degree of inaccuracy
tolerated. Using these sampled records and datapoints, one can build a learner
using any algorithm stated above. It can be argued that, due to the nature of the
web data, the classifier would be fairly accurate, for a good sample of the data.
The following chapter comments on the accuracy of the algorithm that uses
two levels of .,,,l.:,:.. for building a classifier. The accuracy can be further im
proved by iterating through the process a fixed number of times using an incre
mental il..' .:thm as described in the section 3.5.
3.4 "Say No to Randomization!"
Randomized algorithm for building decision diagrams can prove to be most
beneficial to applications that would only benefit from a classification tool. Also,
'!, i, can be very effective in applications where the tree needs to be reconstructed
over and over again, frequently, over a small interval of time. In applications, where
data generated due to webclicks, could be misleading and inconsistent, to begin
with, the classifier would only be as good as the data in itself. Hence, in such cases,
using the twolevel sampling technique to reduce the time required to build the
tree, and an incremental algorithm, discussed in 3.5, to better the classifier, would
be the best solution. But, randomized algorithms do have a few disadvantages and
can be unacceptable in a few situations.
To list some of the disadvantages of the randomized algorithm for building
classifiers 
The method of selectively calculating the gini indexes of a few sampled
datapoints ensures that the the time complexity of the node split action is near
O(nlogn). But, it does not ensure that at every stage the best split attribute
would be exploited. Resultingly, the trees could be much wider and longer than a
traditional topdown algorithm that selects the best gini value at every node split,
eg. SPRINT.
In case of twolevel sampling, the first level sampling, if not sufficient, could
lose out on some nontrivial datapoints, leading to a larger inaccuracy rate for
the classifier at large. Thus, there is a tradeoff between accuracy and time spent
to build the decision tree.
Applications that have a high risk factor and are life threatening, could ben
efit little from such techniques, for the following reasons 
The classifer for such applications would be expected up to have a very high
accuracy rate, in absense of which, the application would produce faulty results.
It could also be require to have a compact tree for the purpose of classifi
cation, so that the runtime to query on the tree is reduced. With longer trees the
application would not be as beneficial2.
Inspired by the techniques and datastructures used in the randomized al
gorithms, the following algorithms, use the complete dataset for the purpose of
building decision diagrams, without randomization or sampling. The trees gen
2This is theoretically true, although, as it can be observed in the chapter that follows, the
length of the trees formed using a randomized algorithm are comparable with the most compact
representation, and hence the runtimes are also comparable
rated using the following algorithms are the most compact possible, because at
every stage the best split value is selected.
3.4.1 Accelerated Collection of Statistics, the Reprise
This algorithm follows from the randomized version of FastStat algorithm. As
described before, here the 3dimensional array (storage structure) is used to store
the class distributions prior to calculating the gini indexes. An algorithm similar
to the one described before is used for the purpose of collection of statistics. Here,
the records are not sampled. The entire dataset, in sorted order, is stored in the
3dimesional structure. The complexity of the algorithm is hence, O(n log n), the
time required to sort the entire dataset. But, this needs to be done just once,
for the entire dataset. Unlike, the randomized methods, since all the records
have been sorted once, one does not require to sort them again, at every node.
An additional O(n log n) time is required at every node, for collection of statistics
and populating the classdimension. This is done using BinarySearch, as before,
and then PrefixComputation is performed on them to obtain the classdistribution
statistics. Since, operations happening at every stage are mostly mathematical,
one can expect a speedup over an algorithm that collects statistics by record
comparisons.
Yet, since the node can split at any attributevalue, the statistics cannot
be carried forward from a parent node to any of its children. The reason is that,
the statistics give the classdistribution of number of records, belonging to that
node, but less than or equal to the current record's attributevalue. Since the node
can split at any attributevalue, the statistics, in their current format, cannot be
carried forward from a parent node to any of its children. The following algorithm,
stores the statistics in such a format that '1, i, can be passed over to one of the
children nodes.
3.4.2 Statistics ... To Go!
SPRINT can have a very high speedup when parallelized. One interpreta
tion of a parallelized version of SPRINT would be, assigning every attribute list
to a processor that calculates the statistics for that attribute list and does the
gini calculation. That is, the dataset is vertically fragmentable, to be processed
in parallel. But, for every attribute list, the statistics are linearly incremented
and hence it would be nontrivial to parallelize the dataset horizontally as well.
The present algorithm, stores the statistics in such a way that the storage can be
parallelized horizontally and vertically over the dataset. Also, '! i:, can be passed
over to the child node, without loss of content.
The algorithm, for the sake of simplicity, assumes that all the values in an
attribute list are unique. This assumption does not hurt the sequential version of
the algorithm, but for the parallel version of the algorithm an extra step (com
pensating step) would need to be done to take care of duplicate elements. The
algorithm proceeds as 
Sort each attribute list individually, according to the attribute values. Scan
ning the records sequentially, for every record j, the kth location in the class
dimension is initialized to one, where k is the classification for that record. These
are the 1"' 1I:,,,.:,i .ir statistics for the attribute list. Using these preliminary statis
tics, a prefix computation is done on all the records to get the actual statistics or
classdistribution.
Since, the preliminary statistics are merely classoccurances of elements in
the attribute list, when a node is split, these could certainly be passed over to
one of the children nodes. At the child node, the algorithm can use the same pre
liminary statistics to obtain the actual statistics (classdistributions), using prefix
computation.
For parallel version of the algorithm, the prefix computation can be done in
parallel, using algorithms described in Horowitz et al. [16].
Figures 34 and 35 depicts preliminary statistics and actual statistics for an
attribute, with no duplicates. Figure 36, shows the preliminary statistics being
carried forward after the nodesplit.
Figure 3.4: Preliminary Statistics for an attribute list no duplicates
Figure 3.5: Actual Statistics for an attribute list no duplicates
1 3 4 5 8 9
A 1 1 1
B 1
C 1 1
1 3 4 5 8 9
A 1 1 1 2 3 3
B 0 0 1 1 1 1
C 0 1 1 1 1 2
Parent
4 5 8 3
1 1 1 A
1 B
C 1
Figure 3.6: Preliminary Statistics of children
9
1
To take care of duplicate attribute values, one of the following two methods
could be used:
While generating the premilinary statistics, each entry is treated to be
unique, and the preliminary statistics are collected as before. Then, at the stage
of prefix computation, the normal procedure is performed, but, for every unique
attributevalue z, a pointer is maintained to the first occurance of x. No sooner
the attributevalue changes, an equilizer function is applied on all the occurance of
x, necessarily lying between the first and last occurance, thereof.3 Then, one can
proceed to a new value of x and repeat the process to obtain actual statistics.
Another method, to solve the duplicate attributevalues problem is to have
an extra valid bit for the 3dimensional array for each entry in the attributerecord
plane, i.e. each attributevalue, in the attribute list. During, the process of prefix
3Attribute lists are always maintained sorted.
1
A
B
C
computation, the valid bits for only the last occurance of ever attributevalue x
are enabled, the other (prior) occurances are disabled. Only the enabled or valid
attribute values are considered for ginicalculation.
However, while splitting the nodes, since only the preliminary statistics are
passed over, the procedure remains unchanged.
The time complexity of the algorithm is necessarily O(n log n), but due to
mere mathematical computations, at each level in the tree, the algorithm can be
expected to have a better performance as compared to SPRINT. Also, the duplicate
elimination technique mentioned above reduces the number of ginicalculations
being performed, yet produces the most compact form of the tree.
3.5 Incremental Decision Trees
As discussed before, the need for an incremental algorithm is dire, in appli
cations where new data is being generated at a high rate, and it is essential to use
it in the process of decision making. In such a scenario, rebuilding a tree, period
ically could be a solution, but, it has the certain drawbacks. If the most compact
form of the tree is required, that is completely accurate (with respect to the train
ing dataset), the tree building algorithm could be time consuming. In that case,
there would be intervals of time, wherein the tree either would have the old data
(uncommited to the decisionmaker) or itself be unavailable for decisionmaking.
To cope with the pressures of dire requirement for an incremental, algorithm that
is continually available and is al,v accurate, the following algorithm could be
used.
Consider a decision tree, T, having m levels and representing n records. The
tree, T, is similar to the ones described before, barring that the leaf nodes hold
pointers to the records that '! i:, represent. Let A be a new record that has to be
inserted into the tree. The classification of A is c. To insert A into T, the tree is
traversed starting at the root node, along the path depending upon the attribute
values of A. At every stage, there could be one of two cases:
A lands at a nonleaf node, with the split condition, attribute j < x. If
A.j < x traverse left subtree, else right subtree, subject to the condition that the
left subtree satisfies the condition and right subtree falsifies it.
A lands at the leaf node L, symbolizing class C. In this case, there could
be two possibilities:
o Class of A, i.e. c conforms with the class of the node, viz. C. In this
case, the record is dumped into pool of records embodied by L.
o Class of A, i.e. c does not conform with the class of the node, viz.
C. In this case, the entire pool of records represented by L and A need
to be put together in form of a tree. Any algorithm could be used at
this stage, to build a tree using the alreadyexisting records of the node
and A. The root of this new tree, replaces L. In this case, the height
of the tree could p .. 4;/ increase by one.
The resultant tree represents n + 1 records, and has a height that satisfies,
m < height < m + 1
The above approach will serve as an incremental algorithm, but, as the
number of records in the classifier increase, could prove to the highly inefficient.
This is because, the tree increases in height at the leaf level, only, maintaining the
same root node and other nonleaf nodes. Thus, once a node becomes a nonleaf
node, it would remain there permanently. Thus, as an alternative, the entire data
set could be used to rebuild a new tree T' when the number of records in the
dataset, represented by T reaches 150'. of its original value, or the record count
40
crosses 1.5n. One could also maintain the old tree T until T' has been created using
the records held in the leaf nodes of T. It can be argued that such an approach
could lead to the most compact tree structure frequently, while the tree predicts
accurate results all the time.
A few algorithms described in this chapter, have been implemented. The
implementation details and results are the subject of the next chapter.
CHAPTER 4
IMPLEMENTATION AND RESULTS
A few of the algorithms described in the previous chapter have been imple
mented. This chapter provides with the implementation details and the perfor
mance results. SPRINT is taken to be the benchmark for comparison.
4.1 Implementation
The present section describes the author's experience at implementing the
algorithms. Java is selected as the language for implementation and the data
sources are simple flatfiles. In the subsections that follow, methods have been
Sl", I, .1 for the use of alternative data sources. The datastructures, techniques
and tools used for faster execution of the algorithms are discussed below.
4.1.1 Datastructures
The datastructures used for implementation of the algorithms have been de
fined in terms of generic reuseable java classes. A few of the native java classes
have been used in some cases, without significantly affecting performance. ;f, Vec
tor class, that has better performance than Java's Vector class, has been defined
to replace arrays in the algorithms. The structure and implementation details of
i;l; Vector class are a subject of section 4.1.6.
4.1.2 Implementing SPRINT
SPRINT is used as a benchmark of performance as well as accuracy the
exact randomized algorithms have been compared with SPRINT to test for per
formance and the approximate ones for accuracy. The comparison characteristics
are given in section 4.2. For an accurate measure, SPRINT has been implemented
in Java using the same generic datastructures, if needed, as the ones used for the
randomized algorithms.
In incremental algorithms, for the case in which there is a disagreement be
tween the new record and leafnode class value, the datapoints represented by the
leafnode and the new record have to be reclassified. Randomized algorithms can
not be used efficiently, because '.! i, tend have a reduced performance for a lower
order dataset. The number of records contained in a leafnode is of the order of
a few hundreds, for a dataset with about 50000 tuples. Hence, the randomized
algorithms are a worse option. Thus, for reclassification of the leafrecords, a
SPRINT object is used.
Since, SPRINT is an exact classification algorithm, classification of test data
obtained using randomized algorithms is tested using SPRINT.
4.1.3 Implementing the Randomized Algorithms
Random samples can be generated using one of the following methods, each
has a complexity of O(n) and can be used in different scenarios.
One of method traverses the dataset once, completely. At every record, a
coin is flipped a random number between 0 and 100 is generated and is normal
ized by the sampling percentage to decide whether the record is to be sampled or
not. This method proves to be useful, if one needs to scan through the dataset
to collect information. The method also guarantees unique records in the sampled
set. It can be used in cases like, determining the number of records in the entire
dataset belonging to each class, wherein a complete priorscan of the entire data
set is required.
Another way to sample records is to generate a set, S, of the required
number of records. Then for every s E S, select the sth record from the dataset.
This, method can be effective, where the dataset is memory resident, and there is
no need to scan through the entire dataset.
In the implementation of the randomized algorithms, the dataset is scanned
ones and stored in the form of arrays. At every stage in the algorithm, i.e. at every
node, a fresh sample is generated, for the purpose of generating an unbiased tree.
At deeper levels in the node, the number of records needed to be classified reduces,
and hence only a small number of records need to be sampled, and it would be
costineffective, to scan though the dataset to select a very few records. Hence,
the latter method is used. To ascertain sampling of merely unique records, a bit
array is maintained, which is tested before selection of the set of random numbers
(samples).
To maintain pointers to the datapoints at the leafnode level, the Decision
TreeNode class, extended class from the generic NodeBinary class has been defined,
to aid in defining generalized object, usable by incremental algorithms.
4.1.4 Iterative as Opposed to Recursive
Java copies the parameters and objects across functions and scopes. N. i. .1
scopes produce duplicated data and the space occupied by it cannot be reclaimed
unless the scope is exitted. Due to the nature of the algorithms for building decision
diagrams, multiple nested scopes are generated for a recusive (easier) version of the
algorithm. The algorithm would necessarily have to traverse the leftmost path,
before entering the right subtree. This, could cause memory to trash.
Both iterative and recusive algorithms have been implemented for SPRINT
as well as the random decision tree generators. Iterative implementations tend
to have lower execution speeds, but are optimized in memory usage, as the same
datastore can be iterated through for different nodes. The decision tree nodes
have to stored in an array format from iterative implementation while the recursive
Decision Tree
Web Pages
Relations DB
or
Flatfiles
Data Warehouse
Figure 4.1: Alternative data sources
Embedded
Server TCP
e\ communication k
Classification Algorithm
Remote
Client
0 0
Figure 4.2: Embedded Server Remote Client architecture
continual basis. Failure to do so can result in unpredictable results.
When the array is needed to be expanded, dynamically, the array needs
to be recreated to an alternative location and the already existing data has to be
copied. This poses an extra overhead for array management.
To do away with the above disadvantages, MyVector class is defined, that uses
arrays for internal storage, in form of blocks. The storage can be made extendible
by adding blocks to the current store, making space for new data, without having
to move the old one. Thus, it does away with the overhead of managing bounds
and having to copy data for extendible storage. Figure 43 depicts the architecture
of MyVector class objects.
Java provides a Vector class that does away with the overhead of having to
manage the bounds. MyVector class however tends to have a better performance
as compared to Vector class.
Linear Extendible Array
Blocks
Figure 4.3: Architecture of MyVector class objects
4.2 Results
The section reports the performance results for the implemented algorithms.
Various tests were run to compare the performance and accuracy of the randomized
algorithms. The tests were done on eclipse, a SUNSparc 8 processor machine, on
Solaris 5.6.
Majorly two types of test were run performance tests for speed and accuracy
tests for prediction reliability.
4.2.1 Performance
A randomized algorithm is expected to perform better (in terms of time re
quired for election) as compared to a sequential (nonrandomized) algorithm.
One of the randomized algorithms, discussed before, viz., accelerated collec
tion of statistics using binary search and prefix computation, is compared against
SPRINT. The tests were run on the machine described above, with a dataset of
48
43500 records, 9 independent attributes and 1 dependent attribute (the classifica
tion). The tests were run, at various levels of first and secondlevel sampling.
Figure 44, plots the twolevel sampling results against, time. As can be seen,
it outperforms SPRINT by a large margin. The algorithm moreorless scales lin
early.
2000
1500
1000
5 10 15 20 25 30 35 40 45 50
Firstlevel sampling in %
Figure 4.4: Performance of the Twolevel sampling algorithm. Legend: Dotdashed
line (SPRINT), Dottedline (5%), Dashedline (2%) and Uncut line (1% second
level sampling)
Another test of performance comparison, done is between two randomized
algorithmsthe potential split points algorithms against the accelerated collection
of statistics algorithm. As expected, the latter performs better than the first, due
I I I I I I I
_1_^n
500
50
40
35
30
25
0
0
15
10
5
0'
0 I I I I I I 
10 20 30 40 50 60 70 80 90 100
First Level Sampling in %
Figure 4.5: Accuracy of the Twolevel sampling algorithm for two different data
sets: Uncut line, dataset with 15000 records and dashedline, a dataset with 846
records
CHAPTER 5
CONCLUSIONS AND FUTURE WORK
The current work discusses the need for a decisionmakers, for web and
other applications. In some cases, the need for the model to be an exact embodi
ment of the input dataset or the training set, is dire. While, in the case of others,
like clickstream analysis, a fairly good prediction made, at runtime, using the
model, can help the application or boost up profits. The other requirement is that
of the time required to build the model and that required to run a query on it.
Depending on the application in use, one or both are needed to be optimized a
decision made while choosing an algorithm used to build the tree.
Randomized algorithms for building decision diagrams have been discussed.
These have varying timecomplexities, static buildtime and dynamic querytime
and accuracy rates. For life critical applications, an exact classifier would be
required that has an optimized runtime, while for a business application an algo
rithm that build trees in the smallest possible time frame at a slight expense of
accuracy could be desired/acceptable. In cases, where the data itself is inaccurate,
one could profit with an algorithm like the latter.
Incremental algorithms are required in scenarios where the data continuously
flows in and it is required to reflect the changes, if any, to the model at the earli
est. The incremental algorithm, discussed here, optimizes on both the static and
dynamic time and yet incremental in nature achieving the best of both worlds.
Thus, using the set of algorithms discussed, most of the applications can ben
efit achieving in much smaller time, almost the same result as an exact classifier
would produce.
REFERENCES
[1] Sam Anahory and Dennis Murray. Data Warehousing in the Real World.
AddisonWesley, R. i'1ii Mass., 1997.
[2] Jennifer Widom. Research Problems in Data Warehousing. In Proc. of 4th Int'l
Conference on Information and Knowledge i.l.r,,.r. in. ,., (CIKM95), Balti
more, Maryland, November 1995. (Invited paper).
[3] Rakesh Agarwal, Manish Mehta, Ramakrishnan Srikant, Andreas Arning, and
Toni Bollinger. The Quest Data Mining System. In Proc. of the 1';,l Int'l
Conference on Knowledge Discovery in Databases and Data Mining, Portland,
Oregon, August 1996.
[4] Rakesh Agarwal and Ramakrishnan Srikant. Fast Algorithms for Mining As
sociation Rules. In Proc. of the 20th VLDB Conference, Santiago, Chile, 1994.
[5] Ramakrishanan Srikant and Rakesh Agarwal. Mining Quantitative Associa
tion Rules in Large Relational Tables. In Proc. of the AC'Il.HIGMOD 1996
Conference on lf.lri.. ,in of Data, Montreal, Canada, June 1996.
[6] Rakesh Agarwal and Ramakrishnan Srikant. Mining Sequential Patterns. In
Proc. of the 11th Int'l Conference on Data Engineering, Taipei, Taiwan, March
1995.
[7] J. Ross Quinlan. C4.5: P,. ',r,,, for Machine Learning. Morgan Kaufmann,
San Mateo, California, 1993.
[8] J. Wirth and J. Catlett. Experiments on the Costs and Benefits of Windowing
in ID3. In 5th Int'l Conference on Machine Learning, pages 8799, Ann Arbor,
Michigan, June 1988.
[9] Manish Mehta, Rakesh Agarwal, and Jorma Rissanen. SLIQ: A Fast Scalable
Classfier for Data Mining. In Proc. of the Fifth Int'l Conference on Extending
Database T I,.. .,/; (EDBT), Avignon, France, March 1996.
[10] John Shafer, Rakesh Agarwal, and Manish Mehta. SPRINT: A Scalable Par
allel Classifier for Data Mining. In Proc. of the '. ',./I Int'l Conference on Very
L',(. Databases, Bomb.iv,, India, September 1996.
[11] Khaled Alsabti, q ,ii ,y Ranka, and Vineet Singh. CLOUDS: A Decision Tree
Classifier for Large Datasets. In 4th Int'l Conference on Knowledge Discovery
and Data Mining (KDD98), New York City, August 1998.
[12] Manish Mehta, Jorma Rissanen, and Rakesh Agarwal. MDLbased Decision
Tree Pruning. In Proc. of the 1st Int'l Conference on Knowledge Discovery
in Databases and Data Mining, Montreal, Canada, August 1995.
[13] Philip K. Chan and Salvatore J. Stolfo. Experiments on multistrategy learning
by metalearning. In Proc. :';.,l Int'l. Conference on Information and Knowl
edge i.LI..j,, in,,, (CIKM93), pages 314323, Washington, November 1993.
[14] Philip K. C'!I i, and Salvatore J. Stolfo. Metalearning for multistrategy and
parallel learning. In Proc. Second Intl. Workshop on Multistr il' ,; Learning
(_11.L93), pages 150165, Harpers Ferry, Virginia, May 1993.
[15] Rajeev Motwani and Prabhakar Raghavan. Randomized Algorithms. Cam
bridge University Press, New York, 1995.
[16] Ellis Horowitz, Sartaj Sahni, and Sanguthevar R ii.1: 1 in. Computer Algo
rithms. W.H. Freeman and Company, New York, 1997.
[17] Thomas H. Cormen, C(! i I. E. Leiserson, and Ronald L. Rivest. Introduction
to Algorithms. MIT Press, Cambridge, Mass., 1990.
BIOGRAPHICAL SKETCH
Vidyamani Parkhe was born on April 21st, 1976, in Indore, India. He com
pleted his bachelor's degree in electrical engineering at the Govt. College of En
gineering, Pune, India, in June 1998. He worked as an intern as a development
engineer at the Loudspeaker Developement Lab., Philips Sound Systems, Pimpri,
India.
He joined the University of Florida, in August of 1998. He worked as a
research and teaching assistant for several courses. He completed his Master of
Science degree in computer engineering at the University of Florida, Gainesville in
December, 2000.
His research interests include randomized algorithms, data structures, databases
and data mining.
