Citation |

- Permanent Link:
- https://ufdc.ufl.edu/UF00100813/00001
## Material Information- Title:
- Randomized decision trees for data mining
- Creator:
- Parkhe, Vidyamani, 1976- (
*Dissertant*) Rajasekaran, Sanguthevar (*Thesis advisor*) Sahni, Sartaj (*Reviewer*) Hammer, Joachim (*Reviewer*) - Place of Publication:
- Florida
- Publisher:
- State University System of Florida
- Publication Date:
- 2000
- Copyright Date:
- 2000
- Language:
- English
## Subjects- Subjects / Keywords:
- Algorithms ( jstor )
Copyrights ( jstor ) Customers ( jstor ) Datasets ( jstor ) Decision trees ( jstor ) Gini index ( jstor ) Information classification ( jstor ) Mining ( jstor ) Sampling methods ( jstor ) Statistics ( jstor ) Computer and Information Science and Engineering thesis, M.S ( lcsh ) Data mining ( lcsh ) Decision trees ( lcsh ) Dissertations, Academic -- Computer and Information Science and Engineering -- UF ( lcsh ) - Genre:
- bibliography ( marcgt )
theses ( marcgt ) government publication (state, provincial, terriorial, dependent) ( marcgt ) non-fiction ( marcgt )
## Notes- Abstract:
- Classification data mining is used widely in the area of retail analysis, disease diagnosis and scam detections. Of late, the application of classification data mining to the area of web development, web applications and analysis is being exercised. The major challenges to this new facet of classification are the enormous amount of data, data inconsistencies, pressure for time and accuracy of prediction. The contemporary algorithms for classification, that majorly use decision diagrams, are less useful in such a scenario. The major impediment is the large amount of static time required in building a model (decision diagram) for accurate prediction or decision making at run-time and the lack of an efficient incremental algorithm. Randomized and sampling techniques researched for the problem have been less accurate. The present work discusses deterministic and randomized algorithms for classification data mining that are easily parallelizable and have better performance. The algorithms suggest novel methods, like multiple levels of intelligent sampling and partitioning, to collect record distributions in a database, for faster evaluation of gini indexes. An incremental algorithm, to absorb newly available data-sets, is also discussed. A combination of these characteristics, along with very high accuracy in decision-making, makes these algorithms adept for data mining and more specifically web mining. ( , )
- Subject:
- KEYWORDS: classification, data mining, randomized algorithms, decision diagrams, incremental algorithms, gini index, web mining
- Thesis:
- Thesis (M.S.)--University of Florida, 2000.
- Bibliography:
- Includes bibliographical references (p. 52-53).
- System Details:
- System requirements: World Wide Web browser and PDF reader.
- System Details:
- Mode of access: World Wide Web.
- General Note:
- Title from first page of PDF file.
- General Note:
- Document formatted into pages; contains vi, 54 p.; also contains graphics.
- General Note:
- Vita.
- Statement of Responsibility:
- by Vidyamani Parkhe.
## Record Information- Source Institution:
- University of Florida
- Holding Location:
- University of Florida
- Rights Management:
- All applicable rights reserved by the source institution and holding location.
- Resource Identifier:
- 47680277 ( OCLC )
002678735 ( AlephBibNum ) ANE5962 ( NOTIS )
## UFDC Membership |

Downloads |

## This item has the following downloads: |

Full Text |

RANDOMIZED DECISION TREES FOR DATA MINING By VIDYAMANI PARKHE A THESIS PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE UNIVERSITY OF FLORIDA 2000 ACKNOWLEDGMENTS I would like to express my sincere gratitude to Dr. Raj for his support and encouragement, throughout my graduate program at the University of Florida. It was due to his motivation and advice that the challenges of initially a transfer to CISE and then a master's program, were never a hurdle. I am indebted to Professors Sumi Helal and Manuel Bermudez for their undepletable attention and their great help, since the very first d-,. at UF. I thank Professors Sartaj Sahni and Joachim Hammer for being on my committee and their lr-.-. i. .o,; and comments. There are a few people, to whom I am grateful for multiple reasons. Firstly, my family back home in India-without their support and trust, nothing would have been possible. Next, my closest ever friends-Amit, Latha, Sangi, Prashant, Mahesh, Prateek, Subha and Hari amongst others, for being my family here. With- out their support, encouragement and wrath the journey would have been an im- possible one. Special thanks go to John Bowers, Nisi and Victoria for being there, al ,v- ! I cannot stop short of thanking Leo's, Chilis and Taco Bell for their extended hours and exquisite food, that has now become an integral part of our lives. TABLE OF CONTENTS ACKNOWLEDGMENTS . . ABSTRACT .. . ........ CHAPTERS 1 INTRODUCTION .. ...... 1.1 Data Warehousing ........ 1.2 Data Mining ............ 1.2.1 Assocation Rules ..... 1.2.2 Clustering ......... 1.2.3 Sequential Patterns ... 1.2.4 Classification .. ..... 1.3 G oal . . . . . . . . 2 RELATED WORK IN THE AREA OF 2.1 Gini Calculation ......... 2.2 SLIQ Classifier for Data Mining . 2.2.1 Tree Building . . .. 2.2.2 Tree Pruning . . .. 2.3 SPRINT-A Parallel Classifier . 2.3.1 The SPRINT Algorithm 2.3.2 Speedup over SLIQ . . . .. . ii CLASSIFICATION . . .. 2.3.3 Exploiting Parallelism 2.4 CLOUDS-A Large Data-set Classifier 2.4.1 Data-set Sampling (DS) . . 2.4.2 Sampling the Splitting Points (SS) . . . 2.4.3 Sampling the Splitting Points with Estimation 2.5 Incremental Learners . . ............. 3 ALGORITHMS . . . . . .. . .. . 3.1 Sorting Is Evil . . . . . . . . . . . 3.2 Randomized Approach to Growing Decision Trees . 3.2.1 SSE W without Sorting .. ........... 3.2.2 Sampling a Large Number of Potential Split 3.2.3 Improvised Storage Structure .. ...... 3.2.4 Better Split-points .. ........... 3.2.5 Accelerated Collection of Statistics ..... DATA (SSE) MINING 8 9 . . 12 . . 12 . . 14 . . 14 . . 15 . . 16 . . 16 . . 17 . . 18 . . 18 . . 18 . . 19 Points . . 111 . . 3.3 M ulti-level Sampling ............... ..... .. 31 3.4 "S, No to Randomization!" ................ . .. 32 3.4.1 Accelerated Collection of Statistics, the Reprise ...... ..34 3.4.2 Statistics ... To Go! ............ . .. .. 35 3.5 Incremental Decision Trees ................ . .. .. 38 4 IMPLEMENTATION AND RESULTS . . ..... 4.1 Implementation . . . ................ 4.1.1 Datastructures . . ............. 4.1.2 Implementing SPRINT ............. 4.1.3 Implementing the Randomized Algorithms . . 4.1.4 Iterative as Opposed to Recursive . . . .. 4.1.5 Alternative Data Sources . . . . ..... 4.1.6 Embedded Server for Run-time Statistics . . 4.1.7 MyVector Class-for Better Array Management 4.2 Results ...... ............. ........ 4.2.1 Perform ance . . . .............. 4.2.2 A accuracy . . . . . . . . . . . 5 CONCLUSIONS AND FUTURE WORK . . .... REFERENCES . . . . . . . . . . . . . 41 . . . 41 . . . 41 . . . 41 . . 42 . . 43 . . 44 . . 44 . . 44 . . . 47 . . 47 . . 49 . . 51 . . 52 BIOGRAPHICAL SKETCH .. . . . ............. Abstract of Thesis Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Master of Science RANDOMIZED DECISION TREES FOR DATA MINING By Vidyamani Parkhe December 2000 C('I i i, i ,: Sanguthevar R ii i. .1:. i Major Department: Computer and Information Science and Engineering Classification data mining is used widely in the area of retail analysis, disease diagnosis and scam detections. Of late, the application of classification data mining to the area of web development, web applications and analysis is being exercised. The 1i ii, i" challenges to this new facet of classification are the enormous amount of data, data inconsistencies, pressure for time and accuracy of prediction. The contemporary algorithms for classification, that ii, iP i ly use decision diagrams, are less useful in such a scenario. The 1i i, ir impediment is the large amount of static time required in building a model (decision diagram) for accurate prediction or decision making at run-time and the lack of an efficient incremental algorithm. Randomized and sampling techniques researched for the problem have been less accurate. The present work discusses deterministic and randomized algorithms for classification data mining that are easily parallelizable and have better perfor- mance. The algorithms i-.-.- -I novel methods, like multiple levels of intelligent sampling and partitioning, to collect record distributions in a database, for faster evaluation of gini indexes. An incremental algorithm, to absorb newly available data-sets, is also discussed. A combination of these characteristics, alongwith very high accuracy in decision-making, makes these algorithms adept for data mining and more specifically web mining. Key words: Classification, data mining, randomized algorithms, decision diagrams, incremental algorithms, gini index and web mining. CHAPTER 1 INTRODUCTION With the current trend in the industries to learn from one's mistakes and those of others, setups from small retailers to large corporations are looking to- wards the "Knowledge Di- .. i y paradigm to get a more comprehensive overview of their own data. This has been further made possible due to the increased per- formances in the data warehousing and data mining algorithms, falling costs of storage units and increased processing speeds. The sections that follow illustrate techniques used to store and mine data, in order to derive information therefrom, that had never been available or thought of heretofore. 1.1 Data Warehousing With the excessively large customer transactions happening every second, in large super markets, internet sites, bank, insurance or phone companies, there is a need to come up with alternative methods to store the historical data, so that there is no loss of information. Also there could be a need, at a later date, to draw out the hidden knowledge from the data without a tangible loss of information. The study of data warehousing comprises everything from the machine architecture and the data store, that could be best compatible to store the specific form of data, to the algorithms used to handle and store data. A data warehouse can be thought of as a collection of different sources of data, put together, that have been cleaned and checked for inconsistencies prior to merging. These data sources could be from different locations of a chain of super markets, like Wal-Mart, or data recorded at the same location over time. The need to remove the inconsistencies in the data is dire, otherwise the information, drawn out after mining the entire data, would be incorrect. There are various algorithms used for the purpose of data source cleaning, data merging and storing it in a format, from where the applications1 using the data can access it, in the most efficient manner. Such algorithms have been documented in literature [1, 2]. Widom [2] also quotes a few issues in data warehousing designs that significantly affect the application using the data. 1.2 Data Mining Data mining can be viewed as a application that resides over a data ware- house and uses the data to search for certain unknown patterns. The patterns could be in the form of rules or clusters or some classification, as described in the following sub-sections. Data mining is different from OLAP,2 in that OLAP uses the query techniques to confirm the results known in the past or by heursitics. In opposition, data mining is indeed a search for the unknown, wherein the entire data set is used to draw out some information about it, at large, rather than the specifics of the data in itself. In the following sub-sections we will have a closer look at the various tech- niques used for data mining . 1.2.1 Assocation Rules This form of data mining is used to relate two or more quantities together, that otherwise would apparently not have co-existed. An example would be i,,' of the people buying beer also buy diapers and of all the transactions happening contain both beer and diapers. This can be stated in the form a rule, Beer = 1There could be varoius applications that could make use of the data store, like OLAP, data visualization, data mining or even transactional ones for day-to-day transactions. 20n-line Analytical Processing. Diapers. This rule is said to have a ,aI'. corl;,. ,, : and "' support. The task of Association Rule Mining is to come up with a set of rules that satisfy a minimum confidence and support level. One of the leading algorithms used for Association Rule Mining is the Apriori Algorithm [3, 4, 5]. 1.2.2 Clustering The principle used in clustering is to group together (or cluster) the data points that have a common characteristic. The whole idea is to partition the entire data set into categories, depending upon some featuress, such that the items in one group or cluster are more similar to those in the same cluster as compared to the ones in other clusters. Clustering is used in various facets of knowledge discovery and learning, like machine 1. niii i pattern recognition, optimization ... etc. A classic algorithm for clustering starts off with designating k data points that act as centroids for the k clusters, and proceeds with evaluating the nearest centroid for every data-point (to be clustered) and then re-evaluating the mean centroidd) of the data-points in that cluster. 1.2.3 Sequential Patterns Initially proposed by Agarwal and Srikant [6], the technique uses time (in most cases) to detect a similarity (pattern) in the occurrence of events. This can be used in various applications, like detecting patterns in which books are being read by a set of library users, or detecting a chain-referral scam of medicine practioners, or even more critical ones like disease diagnosis [3]. 1.2.4 Classification "One's ability to make the correct set of decisions while solving a certain problem and doing so in the specific allotted time is the key factor in one's success, time and again. This is true for any sort of problems-may it be from a research perspective or a political one-only the variables and constants change." -Anoni inl. ';- Classification data mining is associated with those aspects of knowledge discovery, wherein there is a need to categorize data points, associating them with a certain classification or category, based upon the classification of a few known data points. Here, the objective is to traverse the training3 data set and coming up with a model that could be used for future classification of a test4 data set. The technique of classification has been used since a very long time for the purpose of machine learning [7], optimizations using neural networks, decision trees ... etc. A decision tree or a decision diragram comprises a root node, that one uses to make his first decision, while classifying a test record. A non-leaf node in the decision diagrams represents the data represented by its sub-trees, while a leaf node represents data belonging to one class and satisfies the conditions of all its ancestor nodes. The decision is binary, in most cases, in that it could be either true or false which takes one to one of the sub-trees of the root, for which the same procedure could be recursively applied, until one reaches the leaf node, that determines the classification for the record under consideration. The non-leaf nodes are decision- making nodes (mostly binary) and hold a condition like Age < 25, which would lead on to two sub-trees, the data in which ah--,i- satisfies the condition laid forth by the common parent node, i.e. Age < 25. 1.3 Goal Most of the contemporary algorithms for growing decision trees discuss cost effective methods, by which the size (height and spread) of the tree is optimized 3A data set for which a classification is known. 4A data set for which the classification is not known and has to be determined. 5 and so is the dynamic time required for decision making. The algorithms output the most compact form of a tree for a given training set, but are cost-inefficient at the tree-growing stage. The principle used in most of the algorithms is to come up with the best split attribute for a given data-set, about which the entire data-set could be categorized into two sections (for binary decision diagrams), such that most of the records of a type belong to one of the sub-trees. Now, for a given data-set, it is very time consuming to come up with the very best split attribute at every stage in the tree. Gini values are used (in most cases) to determine the best split at a certain stage. The detailed explanation for gini values and their calculation would be a subject for the following chapters. The approach, mentioned above, is acceptable and often used in situations where the decision diagrams are made statically and then used for the purpose of decision making at run-time. Also, if there is no need to update the classification for a long time, an approach that gives the most succinct tree is required, as then, the time required for making decisions at runtime would be largely reduced. But, in situations where it is necessary to update the decision diagram very frequently, it might be required to come up with an approach that does so, in a very short interval of time. As discussed before, the most time-consuming task, in the tree building stage, is to determine which is the best split attribute-value pair. If one spends time in deciding over the very best split at every stage, undoubtedly the most concise and compact form of the tree would be obtained, but this could be heavily time consuming. Conversely, if the very best split is not determined the trees tend to be wider and longer (increasing the time required to make decisions at run-time, while using the tree). If the decision diagrams are to be used for the purpose of making the most critical decisions, that have a very high risk level, one might be more inclined to use an algorithm that, although takes a lot of static creation time, outputs the most concise and compact embodiment of the life-critical data set, a query on which would not take long to execute. On the other hand, an application that has no critical hazard might be greatly helped by an inherently incremental algorithm, which would help in assimilating and consuming the most recent data within the decision diagram. Thus, the trade-horses of creation and usage time are deter- mined depending upon the application in mind and ultimate usage. Web-applications, in most cases, are like the latter ones described above. An example would be click-stream analysis, wherein the objective is to observe a pattern from the web-clicks of various users to a set of web-pages. The problem can be states as follows: Imagine yourself to be an owner of a web-based commerical store that sells books. There is a group of loyal customers that can be identified using their lo- gin names and passwords. Every click made by every person, till date, has been recorded. This comprises a range of customers that merely surf through the web- pages under your company's domain and buy nothing and others that are avid buyers. Would it not be a interesting piece of knowledge to know who is buying exactly what, and more specifically if there is a pattern of the types of books being bought by various customers over a specific range of time! It might prove to be commerically advantageous to be able to predict the buying pattern of a set of customers (or potential customers) depending upon the buying-patterns of other customers. But, with the amount of clicks being made on the web-pages and the increasing number of transactions happening every second, it might be impossible, at run-time, to search for and identify the parallelism between a set of clicks of one particular customer and another, in the past. An effective data-structure like a decision diagram would certainly come handy in such cases, where it is most important to interest a customer more in what he would have otherwise, anyway been interested in, and making business. In such a case, the problem is mostly one sided, here the presence of a decison-maker or a next-click-predictor is not of primary importance, but having one such (a good one) would definitely help in growth of business. Also, as in case of the click-stream example above, a algorithm that is in- cremental, in that it can incorporate fresh data into the decision-making data structure, would be of significant use, rather than having a static decision-maker that reflects the choice and trends in the market from a earlier era. As in the above case, it could prove advantageous to be able to make decisions based on some clicks, results or transactions happening just the previous second!5 The current work concentrates on the issues mentioned above, using tech- niques like randomized algorithms and sampling to achieve speedups, without the loss of accuracy. An attempt is also made, at making the algorithms incremental, so that any additions to the data sets could be reflected in the decision-maker (also referred to as the learner). In the chapters that follow, the algorithms and the im- plementation details are mentioned, giving details of the data structures used for the purpose. 5Though, it might be difficult and highly cost-inefficient to try and accommodate data from a transaction that occurred just a few minutes back. CHAPTER 2 RELATED WORK IN THE AREA OF CLASSIFICATION DATA MINING Decision diagrams and other classifiers like genetic algorithms, B i,-. -i in and neural networks have been used for a very long time for the purpose of simple classification and decision support. Anahory and Murray [1] give detailed analysis as to how one could use tools like decision diagrams for data mining which can be very effective in the case of decision support over a data warehouse. Since the evolution of the decision diagrams, a lot of algorithms, varying in time complexity and the kind of data to be classified, have been devised for the purpose of classifica- tion. Some of the famous algorithms developed include ID3 [8], C45 [7], SLIQ [9], SPRINT [10], CLOUDS [11], and others. All these algorithms and other previ- ous work in the area of classification data mining have sought the best possible way-given a data-set or a database of records-to provide the classification with the most concise representation or data-structure. Some of the common forms of representation used for the purpose of classification data mining are neural net- works, decision diagrams ... etc. Another issue of primary importance is the time required to pack the given data-set in the selected format of representation, in the minimum possible time frame. In all the algorithms to build decision diagrams, mentioned above, a com- mon premise and one of the most important objective is that the tree building algorithm should be precise. The tree should be an exact representation of the given test data-set. But, in the race to come up with a perfect tree, a lot of time is spent in building the tree in the first place. These algorithms are cost effective and li--.-- -I techniques like parallel and simultaneous execution for a faster growth Table 2.1: Sample dataset for gini calculation Class 1 2 1 3 2 2 1 1 1 3 2 Figure 2.1: Compact Tree C Figure 2.2: Skewed Tree partitions S into S1 and S2. The gini value of the split can be estimated using ginispit girni(S) + gini(S2) n n where, nl and n2 are the number of data points in Si and S2, respectively, and n is the number of data-points in S. As it can be seen, the calculation of the gini index is the most important step in the node-splitting stage in a decision tree. Also, it can be trivially observed that the process could be time consuming, since, to be able to calculate the gini of one particular potential split value, all the records have to be considered in order to obtain the ni, n2 and all the pj's for each S' and S2. Since, it is of primary importance to calculate the gini at all the potential points, viz., all the distinct data points in the currect data-set for each attribute, any algorithm to do the gini calculations would necessarily take, O(an2) time complexity, where n is the number of records in the data-set and a is the number of attributes. 2.2 SLIQ Classifier for Data Mining SLIQ was one of the first of its kind to introduce the concept of gini index to grow decision trees. The algorithm is divided into two phases, viz, Tree Building and Tree Pruning, for building decision diagrams. In the following sub-sections, these two stages are discussed. 2.2.1 Tree Building This comprises two steps, i) evaluation of splits for each attribute and select- ing the best split and ii) creating of partitions using the best split. This is done in the following manner. First the given table is split into separate lists, in which the records are sorted according to that particular attribute but maintain a pointer to the other attribute values of the same record, as can be seen below. The second Algorithm EvaluateSplits( for each attribute A do traverse attribute list of A for each value v in the attribute list do find entry in the class list, and hence the class and leaf node, I update the histogram (statistics) in leaf 1 if A is a numeric attribute then compute splitting index for test (A < v) for leaf 1 if A is a categorical attribute then for each leaf of the tree do find the subset of A with the best split Figure 2.3: EvaluateSplits( tree, depending upon the best split condition A < v. The records satisfying the condition are placed in the left sub-tree, and others in the right sub-tree.1 2.2.2 Tree Pruning The tree is built using the entire training data-set. This could contain some spurious "noisy" data, which could lead to a error in determining the class for the test data. Those branches that potentially could be misleading at run-time for class estimation are removed from the tree, using a pruning algorithm described in Mehta et al. [12]. 2.3 SPRINT-A Parallel Classifier SPRINT was one of the pioneering algorithms for building decision diagrams, that are exact classifiers as opposed to approximate classifiers, that compromised on accuracy for a better time complexity algorithm. Some of the approximate algoritms include C4.5 [7] and D-2. The algorithm was designed in such a way that it would be inherently parallel in nature, and hence leading to further scope for a speed-up as compared to the contemporary algorithms. 1This could vary depending upon the application using the decision tree. For a particular application the condition could be modified as A < v. the above algorithm, by assigning each attribute list to a separate processor for gini calculation, and then putting the result together to estimate the best split at a node. The node splitting can also be done in parallel in the following manner. Two processors enlist the subtree that each record should belong to after the split, depending upon the split attribute value. The attribute list being sorted, it is trivial to decide over the cut-off boundaries for each processor and hence '1! i:, can work in parallel. Since, there can be no record common to either, '! i:, can both work on a common array in shared memory. This array, then can be used to split other attribute lists depending upon the entry in the shared memory array. Hence, splitting can be done in 0(n) time using O(s) processors, where n is the number of record in a node and s is the number of attributes, hence preserving the total processor work, to O(ns). This can be further extended to calculate the total splitting time complexity at the tree growth stage. Since there can be utmost 0(N) records in all the nodes at any level in the tree, the total time splitting time complexity of 0(N), where N is the total number of records in the data-set. Assuming a well distributed full tree, the total time complexity can be estimated to be O(N log N) for a O(s) processor parallel machine. 2.4 CLOUDS-A Large Data-set Classifier CLOUDS2 was the first of its kind to use sampling for the purpose of clas- sification. The sampling step was followed by an estimation step to determine a closer and better split attribute-value pair. The CLOUDS algorithm assumes the following two properties for gini indexes for real data-sets [11]: Given a sorted data-set, the gini value generally increases or decreases slowly. This implies that the number of good local minima is significantly less than the size of the data-set, especially for the best split attribute. 2Classification of Large or OUt-of-core DataSets. The minimun gini value (potential split) for an attribute is significantly lower than the other data-points along the same attribute and other attributes too. Using these two principles as guidelines a couple of sampling techniques were developed: 2.4.1 Data-set Sampling (DS) In this algorithm, a random sample of the data-set is obtained, and the direct method (DM)3 for classification is applied. In order to maintain the quality of the classifier, the gini values are calculated using the entire data-set, only for the sampled data-points. 2.4.2 Sampling the Splitting Points (SS) Here, a quantiling techique is used to partition the attribute domain into q parts. Gini values are calculated for each of the boundaries of the q-quantiles, and the lowest is chosen for the split attribute. Hence, it is required to have a pre-knowledge of the type and range of the attribute values (meta-data). 2.4.3 Sampling the Splitting Points with Estimation (SSE) The SSE, technique uses SS to estimate the gini values at the boundaries of the q-quantiles for each attribute of the data-set. Then, as in the case of SS, the minimum ginimin is chosen, here for the purpose of determining the threshold value for the next (estimation) set to determine the lowest gini value. Using the gini values, the lowest possible gini value in a quantile is determined, ginilow. Intervals that do not qualify the threshold level are discarded, i.e. intervals such that ginilow > ginimin, are eliminated. For the surviving intervals, gini values are calulated at every data point to determine the lowest possible gini value. 3Something like SPRINT, wherein the gini at every attribute value is calculated for estimating the best split. CLOUDS uses both of the above to classify the data-set using sampling techniques. The sampling technique determines the size of the decision tree. The quantiling technique rules the accuracy rate and the time required at every stage. 2.5 Incremental Learners The objective in having an incremental algorithm is that in case of the ex- isting algorithms for building decision diagrams for new upcoming data, the entire classifier (learner) would need to be destroyed and a new learner created using the old and new data. Such a process would take a long time and would be repeated frequently. An incremental algorithm is such that the time required is correspond- ing merely to the new data, rather than the total of new and old data. One of the v- v-;- to achieve incrementality in the algorithm is, if one could have some tech- nique to merge two learners together to obtain one learner that is a combination of the two learners. C'!i i and Stolfo [13, 14] have sl-.- -1. .1 some methods for merging trees together. The following are the two n ii .r techniques i1--,. -1. 1 Hypothesis booting is a method in which a number of different algo- rithms are used on the same data-set to generate various learners. Then, using a meta-learner, all these various learners are combined. Thus, the properties of all the different learner algorithms are present in the new learner. Parallel learning is a technique in which a data-set is broken up into various parts, on which the same algorithm is applied to obtain different parallel learners, which can be combined together to obtain a learner for the whole data- set. The other techniques comprise a combination of these ideas. The following chapters give the Algorithms and the Implementation details of the Randomized Decision Tree algorithm along with performance statistics. CHAPTER 3 ALGORITHMS Having discussed the previous work in the area of classification data mining and specifically in the area of algorithms for decision trees, in the previous chapter, this chapter deals with the algorithms devised for the purpose of building (grow- ing) randomized decision diagrams. The contemporary algorithms like SPRINT and SLIQ aim at building the most concise and compact form of the trees for a given data-set. But, as discussed before, this approach is extremely time con- suming. The present chapter discusses a few randomized algorithms that possibly could have the same time complexity, but are estimated to run faster, without a loss in accuracy in the outputted learner. In certain cases, the height and width of the tree are more than the SPRINT/SLIQ version of the tree for the same data-set. The following sections give the drawbacks of the above contemporary algo- rithms that render them less useful for rapidly changing enormous amount of data or applications where the data could be outdated very early. 3.1 Sorting Is Evil One of the most important characteristics of web-based applications is that the data is changing on a continuous basis and very little down time is permissible, if any. In an application, like C/.. /.-stream Ail.;-,.: it could be, in most of the cases, required to absorb and reflect a newly available data-set into the learner. In such cases, the decision tree should not be made from scratch but should be an addition to the already existing one. Hence, if it is required to sort the entire data-set for each attribute, the operation will be extremely costly. Thus, the sorting operation that needs to be done (though only once) at the root node should be avoided as far as possible. If sorting cannot be avoided, then the number of records that have to be sorted should be reduced drastically. Inspired by the SS approach as si:r-.- -1.i in Alsabti et al. [11], the following algorithms s --,- -1 vI-- in which one can come up with decision diagrams without sorting the entire data-set. 3.2 Randomized Approach to Growing Decision Trees Motwani and Raghavan [15], Horowithz et al. [16], Cormen et al. [17] and others sI,--. -1 algorithms in which randomised approaches help in reducing the time complexity of an algorithm, without significant loss of accuracy and in most cases with 9,' or higher accuracy. One disadvantage with using randomized algorithms, as si-1-, -1. 1 before, is that though one does not lose out on accuracy, the resulting trees could be wider and longer -resulting in greater time to make a decision using this learner. If the tree growth process is not controlled, the trees could end up being skewed up, increasing the time complexity of the decision-making algorithms. In applications where accuracy is of extreme importance, examples being those of high risk applications or life critical ones, it might not be feasible to use such algorithms. Examples of such are disease 1.i: ,. .-.: or a learner that differentiates a poisonous mushroom from a non-poisonous one. But in such cases, if the learner assures 0li i accuracy at the cost of higher search/decision time, randomized approaches could prove to be useful. In the subsections that follow, randomized algorithms and their modification for building decision diagrams are i -.--- -1. Il 3.2.1 SSE Without Sorting Sorting the attribute list is the most time consuming task in calculation of gini values before the node can be split. The attribute lists have to be separately sorted as there can be no co-relation between the order of any two attributes in a data-set, the reason being that given a data-set with n attributes, (n 1) of them are independent attributes while 1 is a dependent attribute-referred to as the class attribute. The understated algorithm would work perfectly, in one of the following scenarios : There are one or more attributes that are partially dependent on one or more other attributes, in that their values/order can be predicted based upon the value/order of other attributes or a combination thereof. If the application that uses the decision tree could tolerate faulty results some of the times. This is possible if the application uses the decision diagrams to predict a behavior of a non-life-threatening identity. It could also come of use in scenarios where the result is required urgently-a faulty one would not do harm to the application, but a timely procurement of a healthy result would certainly help. One has a certain amount of pre-knowledge of the data-set, in that, one can, after looking at a few data-points, make a fairly good guess of the nature of the neighboring points. An example would be of a data-set generated at a weather station. Looking at the temperatures of a few data-points, one can definitely make calculated guesses about the neighboring point (at least, one is sure that the night temperature is lower than the d-, temperature). The algorithm proceeds with sampling a certain percentage of records and initially working with them. The gini values at these points are evaluated. Since, we will have a constant number of sampled points the complexity would necessarily be O(n), where n is the total number if records in the data-set. Here the gini values for the sampled records are calculated using the entire data-set, and hence these gini values are exact as opposed to approximate. This can be achieved in one of the following v--v : Since we have only a constant number, s, of sampled records, one can obtain the statistics required for the gini calculation by merely comparing every record in the data-set with every record in the sampled set (records for which the gini is to be evaulated). This would require O(ns) time or, if only a constant number of records are sampled, O(n) time. If the number of records sampled is large, it could be costly to compare every one of the sampled records with the ones in the data-set. Here, we sort just the sampled records in O(s log s) time and then use the above process, in such a way that, if a record X lies ahead, in order, of another record Y for an attribute z, in the sampled set, then one can assume that for a record M in the data-set, if M.z < X.z, then M.z < Y.z is also true. Thus, using techniques like BinarySearch or searching the array in the reverse order can help reduce the time required to determine the statistics. Using the gini values for the sampled data-points, as in the case of SSE the surviving intervals are selected. Here, unlike SSE since the sampled points have not been picked up from a pre-sorted data-set, one cannot guarantee the location of the ultimate minima. But, with a certain pre-knowledge about the data, like of the type mentioned above, it could be possible to figure out an approximate position of a local minima in an interval using techniques like BinarySearch to reduce the time spent in carrying out the search. As explained above, such a method would not yield the best of results and the trees could be larger,1 but in cases, as discussed above, it could be worth having an algorithm that builds a larger tree in a shorter time frame. 3.2.2 Sampling a Large Number of Potential Split Points In most of the contemporary databases, one does have a pre-knowledge about the data itself, in the form of meta-data (or data about data). One does know the domain of possible attribute values a particular attribute could have. It would prove advantageous to exploit this knowledge to build a classifier so that one can do the same, much faster. Now, note that a classifier is a data-structure, such that at every level, one makes a decision wherein one selects a path, one would traverse, depending upon a certain attribute value. The deciding factor is an at- tribute and the threshold value that determines whether to search (or continue traversal) in the left or right subtree. This threshold value is such that one gets the best possible tree, in that the decision be made as soon as possible, with no requirement that the value must exist in the training data-set (data-set required to grow the decision diagram). The data-point selection can be done in one of two v-,v; explained below. If one has information about the data-set and the range of values each at- tribute could have, then the sampled data-points could be synthetically generated, so that h! i:, lie in the range covering all the possible values one can find in the data-set. Then, one could use the same method used in the algorithm described above to obtain a set of gini values for the selected data-points. Further techniques like searching for a lower interval in surviving intervals could also be exploited to zero on to the lowest (best) possible gini value, hence, determining the best split. Another technique one could use is that one samples a few records and longer and wider uses only those as potential splits. This technique could be useful in cases where the data-set contains a lot of repeated data-points. In such cases, if a good sam- pling technique is used, one can expect the best split point to be sampled for gini calculation. Depending upon the data-set one or more of the above methods could be used for sampling. If the range of possible values for an attribute is small and discrete, then it could prove to be advantageous to synthetically generate a large number of potential split points for that attribute. If the attribute values are con- tinuous then one could use the method of sampling a percentage of the records for further calculation. Thus, depending upon the type of attribute, one could change the strategy being used for sampling a smaller set of potential split points. 3.2.3 Improvised Storage Structure In SPRINT and the algorithms discussed so far, the data-set is converted to an intermediate representation, wherein, the attributes are split into various at- tribute lists that can be individually sorted. To preserve the records, the class list is created having incoming pointers from the individual attribute list, and stores the node that every record belongs to, at any stage in the algorithm. The advan- tage in having separate lists is that, one can just bring one list at a time in memory and process it in isolation (detached from the other parts of the record). But, with the algorithms stated above, this could imply a large number of comparisons and memory swap-in-swap-outs. Thus, if one can have the whole records stored in-memory, before the com- parison stage, all the comparisons required with a record, from the data-set, could be done at a time. Thus, it could be very convienent to compare the z-th attribute of the s-th record from the sampled set and the n-th record from the original data- set, for each z. A 3-dimensional array could be one such implementation. Figure 3-1 suggests a method in which one can store the distribution statistics for each record. The three dimensions are of records (or records IDs), attributes and classes. The algorithm to populate the 3-D structure is shown in Figure 3-2 below. Classes ---------------------- - Records Attributes Figure 3.1: 3-dimensional array Algorithm Populate3dArray Let the original data-set, N, contain n records Sample s records from N to form the sampled set, S for each n from N, belonging to current node do for each s from S do for each attribute z do if n.z < s.z then Let k be the class of record n increment the k-th class-position of s Figure 3.2: Algorithm Populate3dArray() Using the statistics from the 3-dimensional array, one can calculate the gini indexes for each of the sampled record s. The advantage of having such a storage structure is that one can then pass on the information, if need be, from one stage to the other -more specifically from the parent node to one or each of its children. This reduces the time complexity by obviating the need to collect the statistics, everytime. But, this is not possible in the algorithm as is. This is that while sampling, the best split attribute-value could not have been sampled (and hence used to decide the best split). Using the present algorithm can help reduce the severity of this problem. As an example, in Figures 3-1 and 3-2, assuming that the best split is obtained at attribute value 12.1, which was not sampled. Using the technique of the thirds, a closer value, viz 12 was obtained, which could at times prove to be better than 12.1, in itself, in that, the resultant tree could be smaller. Using the method of the thirds, a closer sampling interval is obtained for a better granularity. This can aid in zeroing on the best split point, or the near best split point for algorithms like the ones described in [11] or a modification thereof, as .-.-.- -I. I1 in section 3.2.1. In the present algorithm and other randomized algorithms described above, the sampling technique reduces the number of gini calculations being performed, hence reducing the time required at every stage in the algorithm. Yet, the 1n ii ,.r bottle-neck, viz., collection of statistics of class distributions for gini calculations, remains to be cost in-efficient. The following sections address the issue. 3.2.5 Accelerated Collection of Statistics In most of the algorithms used for building decision trees, every record is compared with every other record for collection of statistics used in gini calcula- tions, with SPRINT as an exception. In the above randomized algorithms also, every record s from the sample set S is compared with every record n from the original data-set N. Since, the number of records in S is near constant, the com- plexity of the overall comparsion is O(n), nearly linear. But, in scenarios, where a higher sampling is required, the performance would deteriorate. The following approach helps in reducing the number of comparisons. Algorithm FastStats Let the original data-set, N, contain n records Sample s records from N to form the sampled set, S for each attribute z do Sort the s records according to the z-th attribute Insert the z-th atrribute-values (only) for the sorted records in the 3-d array for each n from N, belonging to current node do Let k be the classification of n for each attribute z do Use BinarySearch( to find the first records in S, such that s.z < n.z. Let it be q. Increment the k-th cell contents in the class dimension for q. Figure 3.3: Algorithm FastStat() can be done faster than record-comparisons, on any standard machine. Also, due to its inherent nature, PrefixComputation algorithm is parallelizable. Horowitz et al. [16] cite parallel-algorithms for computation of prefixes. Putting it together, the techniques of sampling potential split data-points, the 3-dimensional storage structure for the sampled records and the class distri- bution and the accelerated collection of statistics for gini calculations can help in achieving a very high speed-up, at no loss of accuracy. 3.3 Multi-level Sampling Unlike traditional databases, wherein no two records are identical (ideally), in the case of web data, there could be a lot of duplicate records. Infact, in many cases, the data could even be contradictory. Records could be contradictory, in a scenario, whereby given a data set with n attributes, (n 1) of them being inde- pendent attributes and one dependent attribute or the classification of the record, if there exist two records, A and B, in the data-set, such that for A and B, all the n 1 independent attributes values match, but the dependent attribute does not. Thus, while generating the decision tree, either A or B or both would have to be eliminated. This fact could be made use of, while sampling the data-set, such that a small number of records are used for classification. The techniques used before ensure that the number of data-points, at which gini values are calculated, are a good sample of the data-set, such that the gini values are not calculated for duplicates. This improves the performance at the cost of the size of the tree, but does not affect the accuracy. The following algorithm is approximate in that outputted learned could produce inaccurate results for a few cases. The algorithm proceeds with drawing out a random sample from the data- set. The percentage of sampling can vary according to the degree of inaccuracy tolerated. Using these sampled records and data-points, one can build a learner using any algorithm stated above. It can be argued that, due to the nature of the web data, the classifier would be fairly accurate, for a good sample of the data. The following chapter comments on the accuracy of the algorithm that uses two levels of .,,,l.:,:.. for building a classifier. The accuracy can be further im- proved by iterating through the process a fixed number of times using an incre- mental il..' .:thm as described in the section 3.5. 3.4 "Say No to Randomization!" Randomized algorithm for building decision diagrams can prove to be most beneficial to applications that would only benefit from a classification tool. Also, '!, i,- can be very effective in applications where the tree needs to be re-constructed over and over again, frequently, over a small interval of time. In applications, where data generated due to web-clicks, could be misleading and inconsistent, to begin with, the classifier would only be as good as the data in itself. Hence, in such cases, using the two-level sampling technique to reduce the time required to build the tree, and an incremental algorithm, discussed in 3.5, to better the classifier, would be the best solution. But, randomized algorithms do have a few disadvantages and can be unacceptable in a few situations. To list some of the disadvantages of the randomized algorithm for building classifiers - The method of selectively calculating the gini indexes of a few sampled data-points ensures that the the time complexity of the node split action is near O(nlogn). But, it does not ensure that at every stage the best split attribute would be exploited. Resultingly, the trees could be much wider and longer than a traditional top-down algorithm that selects the best gini value at every node split, eg. SPRINT. In case of two-level sampling, the first level sampling, if not sufficient, could lose out on some non-trivial data-points, leading to a larger inaccuracy rate for the classifier at large. Thus, there is a trade-off between accuracy and time spent to build the decision tree. Applications that have a high risk factor and are life threatening, could ben- efit little from such techniques, for the following reasons - The classifer for such applications would be expected up to have a very high accuracy rate, in absense of which, the application would produce faulty results. It could also be require to have a compact tree for the purpose of classifi- cation, so that the run-time to query on the tree is reduced. With longer trees the application would not be as beneficial2. Inspired by the techniques and data-structures used in the randomized al- gorithms, the following algorithms, use the complete data-set for the purpose of building decision diagrams, without randomization or sampling. The trees gen- 2This is theoretically true, although, as it can be observed in the chapter that follows, the length of the trees formed using a randomized algorithm are comparable with the most compact representation, and hence the run-times are also comparable rated using the following algorithms are the most compact possible, because at every stage the best split value is selected. 3.4.1 Accelerated Collection of Statistics, the Reprise This algorithm follows from the randomized version of FastStat algorithm. As described before, here the 3-dimensional array (storage structure) is used to store the class distributions prior to calculating the gini indexes. An algorithm similar to the one described before is used for the purpose of collection of statistics. Here, the records are not sampled. The entire data-set, in sorted order, is stored in the 3-dimesional structure. The complexity of the algorithm is hence, O(n log n), the time required to sort the entire data-set. But, this needs to be done just once, for the entire data-set. Unlike, the randomized methods, since all the records have been sorted once, one does not require to sort them again, at every node. An additional O(n log n) time is required at every node, for collection of statistics and populating the class-dimension. This is done using BinarySearch, as before, and then PrefixComputation is performed on them to obtain the class-distribution statistics. Since, operations happening at every stage are mostly mathematical, one can expect a speed-up over an algorithm that collects statistics by record comparisons. Yet, since the node can split at any attribute-value, the statistics cannot be carried forward from a parent node to any of its children. The reason is that, the statistics give the class-distribution of number of records, belonging to that node, but less than or equal to the current record's attribute-value. Since the node can split at any attribute-value, the statistics, in their current format, cannot be carried forward from a parent node to any of its children. The following algorithm, stores the statistics in such a format that '1, i, can be passed over to one of the children nodes. 3.4.2 Statistics ... To Go! SPRINT can have a very high speed-up when parallelized. One interpreta- tion of a parallelized version of SPRINT would be, assigning every attribute list to a processor that calculates the statistics for that attribute list and does the gini calculation. That is, the data-set is vertically fragmentable, to be processed in parallel. But, for every attribute list, the statistics are linearly incremented and hence it would be non-trivial to parallelize the data-set horizontally as well. The present algorithm, stores the statistics in such a way that the storage can be parallelized horizontally and vertically over the data-set. Also, '! i:, can be passed over to the child node, without loss of content. The algorithm, for the sake of simplicity, assumes that all the values in an attribute list are unique. This assumption does not hurt the sequential version of the algorithm, but for the parallel version of the algorithm an extra step (com- pensating step) would need to be done to take care of duplicate elements. The algorithm proceeds as - Sort each attribute list individually, according to the attribute values. Scan- ning the records sequentially, for every record j, the k-th location in the class- dimension is initialized to one, where k is the classification for that record. These are the 1"' 1I:,,,.:,i .ir statistics for the attribute list. Using these preliminary statis- tics, a prefix computation is done on all the records to get the actual statistics or class-distribution. Since, the preliminary statistics are merely class-occurances of elements in the attribute list, when a node is split, these could certainly be passed over to one of the children nodes. At the child node, the algorithm can use the same pre- liminary statistics to obtain the actual statistics (class-distributions), using prefix computation. For parallel version of the algorithm, the prefix computation can be done in parallel, using algorithms described in Horowitz et al. [16]. Figures 3-4 and 3-5 depicts preliminary statistics and actual statistics for an attribute, with no duplicates. Figure 3-6, shows the preliminary statistics being carried forward after the node-split. Figure 3.4: Preliminary Statistics for an attribute list no duplicates Figure 3.5: Actual Statistics for an attribute list no duplicates 1 3 4 5 8 9 A 1 1 1 B 1 C 1 1 1 3 4 5 8 9 A 1 1 1 2 3 3 B 0 0 1 1 1 1 C 0 1 1 1 1 2 Parent 4 5 8 3 1 1 1 A 1 B C 1 Figure 3.6: Preliminary Statistics of children 9 1 To take care of duplicate attribute values, one of the following two methods could be used: While generating the premilinary statistics, each entry is treated to be unique, and the preliminary statistics are collected as before. Then, at the stage of prefix computation, the normal procedure is performed, but, for every unique attribute-value z, a pointer is maintained to the first occurance of x. No sooner the attribute-value changes, an equilizer function is applied on all the occurance of x, necessarily lying between the first and last occurance, thereof.3 Then, one can proceed to a new value of x and repeat the process to obtain actual statistics. Another method, to solve the duplicate attribute-values problem is to have an extra valid bit for the 3-dimensional array for each entry in the attribute-record plane, i.e. each attribute-value, in the attribute list. During, the process of prefix 3Attribute lists are always maintained sorted. 1 A B C computation, the valid bits for only the last occurance of ever attribute-value x are enabled, the other (prior) occurances are disabled. Only the enabled or valid attribute values are considered for gini-calculation. However, while splitting the nodes, since only the preliminary statistics are passed over, the procedure remains unchanged. The time complexity of the algorithm is necessarily O(n log n), but due to mere mathematical computations, at each level in the tree, the algorithm can be expected to have a better performance as compared to SPRINT. Also, the duplicate elimination technique mentioned above reduces the number of gini-calculations being performed, yet produces the most compact form of the tree. 3.5 Incremental Decision Trees As discussed before, the need for an incremental algorithm is dire, in appli- cations where new data is being generated at a high rate, and it is essential to use it in the process of decision making. In such a scenario, re-building a tree, period- ically could be a solution, but, it has the certain drawbacks. If the most compact form of the tree is required, that is completely accurate (with respect to the train- ing data-set), the tree building algorithm could be time consuming. In that case, there would be intervals of time, wherein the tree either would have the old data (uncommited to the decision-maker) or itself be unavailable for decision-making. To cope with the pressures of dire requirement for an incremental, algorithm that is continually available and is al--,v accurate, the following algorithm could be used. Consider a decision tree, T, having m levels and representing n records. The tree, T, is similar to the ones described before, barring that the leaf nodes hold pointers to the records that '! i:, represent. Let A be a new record that has to be inserted into the tree. The classification of A is c. To insert A into T, the tree is traversed starting at the root node, along the path depending upon the attribute values of A. At every stage, there could be one of two cases: A lands at a non-leaf node, with the split condition, attribute j < x. If A.j < x traverse left subtree, else right subtree, subject to the condition that the left subtree satisfies the condition and right subtree falsifies it. A lands at the leaf node L, symbolizing class C. In this case, there could be two possibilities: o Class of A, i.e. c conforms with the class of the node, viz. C. In this case, the record is dumped into pool of records embodied by L. o Class of A, i.e. c does not conform with the class of the node, viz. C. In this case, the entire pool of records represented by L and A need to be put together in form of a tree. Any algorithm could be used at this stage, to build a tree using the already-existing records of the node and A. The root of this new tree, replaces L. In this case, the height of the tree could p ..-- 4;/ increase by one. The resultant tree represents n + 1 records, and has a height that satisfies, m < height < m + 1 The above approach will serve as an incremental algorithm, but, as the number of records in the classifier increase, could prove to the highly inefficient. This is because, the tree increases in height at the leaf level, only, maintaining the same root node and other non-leaf nodes. Thus, once a node becomes a non-leaf node, it would remain there permanently. Thus, as an alternative, the entire data- set could be used to re-build a new tree T' when the number of records in the data-set, represented by T reaches 150'. of its original value, or the record count 40 crosses 1.5n. One could also maintain the old tree T until T' has been created using the records held in the leaf nodes of T. It can be argued that such an approach could lead to the most compact tree structure frequently, while the tree predicts accurate results all the time. A few algorithms described in this chapter, have been implemented. The implementation details and results are the subject of the next chapter. CHAPTER 4 IMPLEMENTATION AND RESULTS A few of the algorithms described in the previous chapter have been imple- mented. This chapter provides with the implementation details and the perfor- mance results. SPRINT is taken to be the benchmark for comparison. 4.1 Implementation The present section describes the author's experience at implementing the algorithms. Java is selected as the language for implementation and the data sources are simple flat-files. In the subsections that follow, methods have been Sl-"-, -I, .1 for the use of alternative data sources. The datastructures, techniques and tools used for faster execution of the algorithms are discussed below. 4.1.1 Datastructures The datastructures used for implementation of the algorithms have been de- fined in terms of generic re-useable java classes. A few of the native java classes have been used in some cases, without significantly affecting performance. ;f, Vec- tor class, that has better performance than Java's Vector class, has been defined to replace arrays in the algorithms. The structure and implementation details of i;l; Vector class are a subject of section 4.1.6. 4.1.2 Implementing SPRINT SPRINT is used as a benchmark of performance as well as accuracy -the exact randomized algorithms have been compared with SPRINT to test for per- formance and the approximate ones for accuracy. The comparison characteristics are given in section 4.2. For an accurate measure, SPRINT has been implemented in Java using the same generic datastructures, if needed, as the ones used for the randomized algorithms. In incremental algorithms, for the case in which there is a disagreement be- tween the new record and leaf-node class value, the data-points represented by the leaf-node and the new record have to be re-classified. Randomized algorithms can- not be used efficiently, because '.! i,- tend have a reduced performance for a lower order data-set. The number of records contained in a leaf-node is of the order of a few hundreds, for a data-set with about 50000 tuples. Hence, the randomized algorithms are a worse option. Thus, for re-classification of the leaf-records, a SPRINT object is used. Since, SPRINT is an exact classification algorithm, classification of test data obtained using randomized algorithms is tested using SPRINT. 4.1.3 Implementing the Randomized Algorithms Random samples can be generated using one of the following methods, each has a complexity of O(n) and can be used in different scenarios. One of method traverses the data-set once, completely. At every record, a coin is flipped -a random number between 0 and 100 is generated and is normal- ized by the sampling percentage to decide whether the record is to be sampled or not. This method proves to be useful, if one needs to scan through the data-set to collect information. The method also guarantees unique records in the sampled set. It can be used in cases like, determining the number of records in the entire data-set belonging to each class, wherein a complete prior-scan of the entire data- set is required. Another way to sample records is to generate a set, S, of the required number of records. Then for every s E S, select the s-th record from the data-set. This, method can be effective, where the data-set is memory resident, and there is no need to scan through the entire data-set. In the implementation of the randomized algorithms, the data-set is scanned ones and stored in the form of arrays. At every stage in the algorithm, i.e. at every node, a fresh sample is generated, for the purpose of generating an un-biased tree. At deeper levels in the node, the number of records needed to be classified reduces, and hence only a small number of records need to be sampled, and it would be cost-ineffective, to scan though the data-set to select a very few records. Hence, the latter method is used. To ascertain sampling of merely unique records, a bit array is maintained, which is tested before selection of the set of random numbers (samples). To maintain pointers to the data-points at the leaf-node level, the Decision- TreeNode class, extended class from the generic NodeBinary class has been defined, to aid in defining generalized object, usable by incremental algorithms. 4.1.4 Iterative as Opposed to Recursive Java copies the parameters and objects across functions and scopes. N. -i. .1 scopes produce duplicated data and the space occupied by it cannot be reclaimed unless the scope is exitted. Due to the nature of the algorithms for building decision diagrams, multiple nested scopes are generated for a recusive (easier) version of the algorithm. The algorithm would necessarily have to traverse the left-most path, before entering the right sub-tree. This, could cause memory to trash. Both iterative and recusive algorithms have been implemented for SPRINT as well as the random decision tree generators. Iterative implementations tend to have lower execution speeds, but are optimized in memory usage, as the same data-store can be iterated through for different nodes. The decision tree nodes have to stored in an array format from iterative implementation while the recursive Decision Tree Web Pages Relations DB or Flat-files Data Warehouse Figure 4.1: Alternative data sources Embedded Server TCP e\ communication k Classification Algorithm Remote Client 0 0 Figure 4.2: Embedded Server -Remote Client architecture continual basis. Failure to do so can result in unpredictable results. When the array is needed to be expanded, dynamically, the array needs to be recreated to an alternative location and the already existing data has to be copied. This poses an extra overhead for array management. To do away with the above disadvantages, MyVector class is defined, that uses arrays for internal storage, in form of blocks. The storage can be made extendible by adding blocks to the current store, making space for new data, without having to move the old one. Thus, it does away with the overhead of managing bounds and having to copy data for extendible storage. Figure 4-3 depicts the architecture of MyVector class objects. Java provides a Vector class that does away with the overhead of having to manage the bounds. MyVector class however tends to have a better performance as compared to Vector class. Linear Extendible Array Blocks Figure 4.3: Architecture of MyVector class objects 4.2 Results The section reports the performance results for the implemented algorithms. Various tests were run to compare the performance and accuracy of the randomized algorithms. The tests were done on eclipse, a SUN-Sparc 8 processor machine, on Solaris 5.6. Majorly two types of test were run performance tests for speed and accuracy tests for prediction reliability. 4.2.1 Performance A randomized algorithm is expected to perform better (in terms of time re- quired for election) as compared to a sequential (non-randomized) algorithm. One of the randomized algorithms, discussed before, viz., accelerated collec- tion of statistics using binary search and prefix computation, is compared against SPRINT. The tests were run on the machine described above, with a data-set of 48 43500 records, 9 independent attributes and 1 dependent attribute (the classifica- tion). The tests were run, at various levels of first- and second-level sampling. Figure 4-4, plots the two-level sampling results against, time. As can be seen, it out-performs SPRINT by a large margin. The algorithm more-or-less scales lin- early. 2000 1500 1000 5 10 15 20 25 30 35 40 45 50 First-level sampling in % Figure 4.4: Performance of the Two-level sampling algorithm. Legend: Dot-dashed line (SPRINT), Dotted-line (5%), Dashed-line (2%) and Un-cut line (1% second level sampling) Another test of performance comparison, done is between two randomized algorithms-the potential split points algorithms against the accelerated collection of statistics algorithm. As expected, the latter performs better than the first, due I I I I I I I _1_^n 500 50 40 35 30 25 0 0 15 10- 5- 0' 0 --I I I I I I - 10 20 30 40 50 60 70 80 90 100 First Level Sampling in % Figure 4.5: Accuracy of the Two-level sampling algorithm for two different data- sets: Un-cut line, data-set with 15000 records and dashed-line, a data-set with 846 records CHAPTER 5 CONCLUSIONS AND FUTURE WORK The current work discusses the need for a decision-makers, for web- and other applications. In some cases, the need for the model to be an exact embodi- ment of the input data-set or the training set, is dire. While, in the case of others, like click-stream analysis, a fairly good prediction made, at run-time, using the model, can help the application or boost up profits. The other requirement is that of the time required to build the model and that required to run a query on it. Depending on the application in use, one or both are needed to be optimized a decision made while choosing an algorithm used to build the tree. Randomized algorithms for building decision diagrams have been discussed. These have varying time-complexities, static build-time and dynamic query-time and accuracy rates. For life critical applications, an exact classifier would be required that has an optimized run-time, while for a business application an algo- rithm that build trees in the smallest possible time frame at a slight expense of accuracy could be desired/acceptable. In cases, where the data itself is inaccurate, one could profit with an algorithm like the latter. Incremental algorithms are required in scenarios where the data continuously flows in and it is required to reflect the changes, if any, to the model at the earli- est. The incremental algorithm, discussed here, optimizes on both the static and dynamic time and yet incremental in nature achieving the best of both worlds. Thus, using the set of algorithms discussed, most of the applications can ben- efit achieving in much smaller time, almost the same result as an exact classifier would produce. REFERENCES [1] Sam Anahory and Dennis Murray. Data Warehousing in the Real World. Addison-Wesley, R. i'1ii- Mass., 1997. [2] Jennifer Widom. Research Problems in Data Warehousing. In Proc. of 4th Int'l Conference on Information and Knowledge i.l.r,,.r. in. ,., (CIKM-95), Balti- more, Maryland, November 1995. (Invited paper). [3] Rakesh Agarwal, Manish Mehta, Ramakrishnan Srikant, Andreas Arning, and Toni Bollinger. The Quest Data Mining System. In Proc. of the 1';,l Int'l Conference on Knowledge Discovery in Databases and Data Mining, Portland, Oregon, August 1996. [4] Rakesh Agarwal and Ramakrishnan Srikant. Fast Algorithms for Mining As- sociation Rules. In Proc. of the 20th VLDB Conference, Santiago, Chile, 1994. [5] Ramakrishanan Srikant and Rakesh Agarwal. Mining Quantitative Associa- tion Rules in Large Relational Tables. In Proc. of the AC'Il-.HIGMOD 1996 Conference on lf.lri.. ,in of Data, Montreal, Canada, June 1996. [6] Rakesh Agarwal and Ramakrishnan Srikant. Mining Sequential Patterns. In Proc. of the 11th Int'l Conference on Data Engineering, Taipei, Taiwan, March 1995. [7] J. Ross Quinlan. C4.5: P,. ',r,,,- for Machine Learning. Morgan Kaufmann, San Mateo, California, 1993. [8] J. Wirth and J. Catlett. Experiments on the Costs and Benefits of Windowing in ID3. In 5th Int'l Conference on Machine Learning, pages 87-99, Ann Arbor, Michigan, June 1988. [9] Manish Mehta, Rakesh Agarwal, and Jorma Rissanen. SLIQ: A Fast Scalable Classfier for Data Mining. In Proc. of the Fifth Int'l Conference on Extending Database T I,.. .,/; (EDBT), Avignon, France, March 1996. [10] John Shafer, Rakesh Agarwal, and Manish Mehta. SPRINT: A Scalable Par- allel Classifier for Data Mining. In Proc. of the '. ',./I Int'l Conference on Very L',(.- Databases, Bomb.iv,-, India, September 1996. [11] Khaled Alsabti, q ,ii ,y Ranka, and Vineet Singh. CLOUDS: A Decision Tree Classifier for Large Datasets. In 4th Int'l Conference on Knowledge Discovery and Data Mining (KDD-98), New York City, August 1998. [12] Manish Mehta, Jorma Rissanen, and Rakesh Agarwal. MDL-based Decision Tree Pruning. In Proc. of the 1st Int'l Conference on Knowledge Discovery in Databases and Data Mining, Montreal, Canada, August 1995. [13] Philip K. Chan and Salvatore J. Stolfo. Experiments on multistrategy learning by metalearning. In Proc. :';.,l Int'l. Conference on Information and Knowl- edge i.LI..j,, in,,, (CIKM-93), pages 314-323, Washington, November 1993. [14] Philip K. C'!I i, and Salvatore J. Stolfo. Meta-learning for multistrategy and parallel learning. In Proc. Second Intl. Workshop on Multistr il' ,; Learning (_11.L-93), pages 150-165, Harpers Ferry, Virginia, May 1993. [15] Rajeev Motwani and Prabhakar Raghavan. Randomized Algorithms. Cam- bridge University Press, New York, 1995. [16] Ellis Horowitz, Sartaj Sahni, and Sanguthevar R ii.1: 1 in. Computer Algo- rithms. W.H. Freeman and Company, New York, 1997. [17] Thomas H. Cormen, C(! i I. E. Leiserson, and Ronald L. Rivest. Introduction to Algorithms. MIT Press, Cambridge, Mass., 1990. BIOGRAPHICAL SKETCH Vidyamani Parkhe was born on April 21st, 1976, in Indore, India. He com- pleted his bachelor's degree in electrical engineering at the Govt. College of En- gineering, Pune, India, in June 1998. He worked as an intern as a development engineer at the Loudspeaker Developement Lab., Philips Sound Systems, Pimpri, India. He joined the University of Florida, in August of 1998. He worked as a research and teaching assistant for several courses. He completed his Master of Science degree in computer engineering at the University of Florida, Gainesville in December, 2000. His research interests include randomized algorithms, data structures, databases and data mining. |