UFDC Home  myUFDC Home  Help 



Full Text  
DATA MINING MEETS ECOMMERCE: USING DATA MINING TO IMPROVE CUSTOMER RELATIONSHIP MANAGEMENT By DARRYL M. ADDERLY A THESIS PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE UNIVERSITY OF FLORIDA 2002 Copyright 2002 by Darryl M. Adderly I would like to dedicate this thesis to a recent blessing in my life, Buttons Kismet Adderly. ACKNOWLEDGMENTS I would like to first thank God for providing the opportunity and giving me the strength to complete this thesis. I really appreciate Dr. Joachim Hammer's patience and guidance throughout the duration of this process. I thank Ardiniece "Nisi" Caudle and John "Jon B." Bowers for assisting me with administrative items. I thank the Office of Graduate Minority Programs (OGMP) for the financial assistance. I would also like to thank my bible study group (Adrian, JD, Jonathan, Kamini, and Ursula) for all of their prayers/spritual support and last but not least JeanDavid Oladele for the friendship and support up until the very last minute. TABLE OF CONTENTS page A C K N O W L E D G M E N T S ................................................................................................. iv LIST OF TABLES ....................................................... ............ .............. .. vii LIST OF FIGURES ..................................................... .......... ................ viii ABSTRACT .............. .......................................... ix 1 IN TR OD U CTION .............................................. .. ....... .... .............. . 1.1 M motivation for R research ................................................. .............................. 1 1.2 T hesis G oals............................................. ......... ..... 4 2 RESEARCH BACKGROUND ........................................ ................................. 7 2 .1 A association R ule M ining ........................................................................................ 10 2 .2 C lu stern g ........................................................................... 1 1 2.2.1 Partitioning A lgorithm s............................................ ........... .............. 11 2.2.2 H ierarchical A lgorithm s...................................................... ... ................. 14 2.2.3 D ensitybased M ethods........................................................ .............. 17 2.2.4 G ridbased M ethods ......................................................... .............. 19 2.2.5 Kmeans ........................................... 20 3 GENERAL APPROACH TO WEB USAGE MINING..............................................24 3.1 The M ining of W eb Usage Data ...................................... ...... .................... 24 3.1.1 Preprocessing Data for Mining .... .................. .............. 25 3.1.2 P pattern D discovery ........................................... .. ................... ........... 26 3.1.3 Pattern Analysis ... ..... .............................................. ............... 27 3.2 Web Usage Mining with kmeans.......................................................... 29 3.2.1 Our W eb Usage M ining Approach .............. ............................. ....... ....... 29 4 ARCHITECTURE and IMPLEMENTATION.......................... ..................31 4 .1 A architecture O v erview .................................................................. .................... 3 1 4 .1.1 P hase 1 P reprocessing ..................................................... ... ................. 32 4.1.2 Phase 2 Pattern Discovery......................................................... 33 4.1.3 Phase 3 Pattern Analysis .......................................................... 34 4.2 Algorithm Implementation..................... ........ ........................... 35 v 5 PER FO R M A N CE A N A LY SIS ............................................................. ....................43 5.1 E xperim ental E valuation.................................................... ........................... 44 5.2 W eb C lu sters ............................................... 47 6 CONCLUSION........ .......... .......... .. .... .... .... ........... 52 6.1 Contributions............................ .................. 52 6.2 Proposed Extensions and Future W ork.............................................................. 52 LIST OF REFERENCES ................................. ............................................54 BIOGRAPH ICAL SKETCH ..................................................... 59 LIST OF TABLES Table page 21 D ata M inning A lgorithm s................................................. ...................................... 9 51 Cluster representations ......................................................... .. 51 LIST OF FIGURES Figure p 3.1 High Level Web Usage Mining Process ............................. ..................... 25 4.1 Our Web Usage Mining Architecture.......... ........................................32 4 .2 T he R eadD ata m odule ............................................................................... ..... .... 37 4.3 The ClusterValues m odule................................................. .............................. 40 5.1 A sample SQL*Loader control file................................................... 45 5.2 O order clustering results ....... ..................... .......... ........................ ............... 48 5.3 Data M ining Software Order clustering results ................................. ............... 49 viii Abstract of Thesis Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Master of Science DATA MINING MEETS ECOMMERCE: USING DATA MINING TO IMPROVE CUSTOMER RELATIONSHIP MANAGEMENT By Darryl M. Adderly December 2002 Chair: Joachim Hammer Major Department: Computer and Information Science and Engineering The application of data mining techniques to the World Wide Web, referred to as Web mining, enables businesses to use knowledge discovered from the past to understand the present and make critical business decisions about the future. For example, this can be done by analyzing the Web pages that visitors have clicked on, items that they have selected or purchased, or registration information provided while browsing. To perform this analysis effectively, businesses find the natural groupings of users, pages, etc., by clustering data stored in the Web logs. The standard kmeans algorithm, an iterative refinement algorithm, is one of the most popular clustering methods used today and it has proven to be an efficient clustering technique. However, numerous iterations over the data set and recalculating cluster centroid values are time consuming. In this thesis, we improve the time complexity of the standard algorithm. Our singlepass, noniterative k means algorithm scans the data only once, calculating all the point and centroid values based on the desired attributes of interest, and places the items within their respective cluster thresholds. Our Web mining process consists of three phases, preprocessing, pattern discovery, and pattern analysis, which are described in detail in the thesis. We will use our implementation of the kmeans algorithm to uncover meaningful Web trends to understand and, after analyzing the results, provide recommendations that may have improved the visitor's website experience. We find that the clustering results of our algorithm provide the same amount of knowledge for analysts as one of the industry's leading data mining applications. CHAPTER 1 INTRODUCTION Consumers are conducting business via the Internet more than ever before due to the economical costs of highspeed Internet service providers (ISPs) and the highlevel of security (secure transactions). However, the recognition of a company's online presence alone does not ensure longlived prosperity. Customer retention and satisfaction strategies remain one of the most important issues for organizations expecting profits. Thus companies work hard to improve and/or maintain their customer relationships. To achieve this, companies must capture the navigational behavior of visitors on their website in a web log and subsequently analyze this data to understand and address their consumer's business needs. 1.1 Motivation for Research The relationship between companies and customers has evolved into a significant research concept called Customer Relationship Management (CRM). A definition for CRM is a process that manages the interactions between companies and its customers [The02]. CRM solutions create a mutually beneficial relationship between the customer and the organization and are critical to a company's future success. The ultimate goals of CRM are to acquire new customers, retain old customers, and increase customer profitability [CYOO]. In the current economic slowdown, companies are using their limited budgets to reduce operational costs or increase revenues while concentrating on improving efforts to acquire new customers and develop customer loyalty. The sources of webbased CRM customer data (user profiles, access patterns for pages, etc.) are from customer web interactions. The advent of the World Wide Web (WWW) has caused an evolution of the Internet. Information is now readily available from any location in the world at any hour of the day. Information on the WWW is not only important to individuals, but also to business organizations for critical decisionmaking. This explosion of information sources on the web has increased the necessity to utilize automated tools to find the desired resources and to track and analyze usage patterns. An electronic trail of data is left behind each time a user visits a website. The megabytes and gigabytes of data logged from these trails seem to not yield any information at first glance. However, when analyzed intelligently, those logs contain a wealth of information providing valuable knowledge for business intelligence solutions. Early attempts to understand the data with statistical tools and online analytical processing (OLAP) systems achieved limited successthat is until the concept of data mining was introduced. Data mining is the process of discovering hidden interesting knowledge from large amounts of data stored in databases, data warehouses, or other information repositories. Web data mining, or web mining, can be broadly defined as the discovery and analysis of useful information from the Web data. Online businesses learn from the past, understand the present, and plan for the future by mining, analyzing, and transforming records into meaningful information. Web mining, when viewed in data mining terms, can be said to have three operations of interests clustering (finding natural groupings of users, pages, etc.), associations (which URLs tend to be requested together), and sequential analysis (the order which order which URLs tend to be accessed) [JK98]. Although the first two have proven to be of greater interest, this research heavily favors the use of clustering techniques and algorithms to support web mining. Data clustering is a process of partitioning a set of data into a set of classes, called clusters, with members of each cluster sharing some interesting common properties [CGHK97]. Clustering itself is the process of organizing similar items into disjoint groups. The investigation of the properties of the set of items belonging to each group illuminates relationships that may have been otherwise overlooked. The kmeans algorithm is one of the most widely used techniques for clustering [A1D95]. It has been shown to be effective in producing good clustering results for many practical applications. The two main goals of clustering techniques are to ensure that the data within each distinct cluster is homogeneous (group items are similar) and each cluster differs from other clusters (data belonging to one cluster should not be present in another cluster). The kmeans algorithm is an iterative refinement algorithm with an input member ofk predefined clusters. "Means" simply represents the average, as in the average location of all members of a particular cluster conceptualized as the centroid. The centroid of a cluster, often termed the representative element, is an artificial point in the space of records that represents the average location. The time complexity of the kmeans algorithm is heavily dependant on the point centroidd) selection process of its first step. Some implementations either requires userprovided or randomly generated starting points but most implementations of the kmeans algorithm do not address the issue of initialization at all. The remaining steps of the algorithm focus on minimizing the intercluster (items belonging to a specific cluster) error by using a distance function (i.e., Euclidean distance [Bla02a] or Manhattan distance [Bla02b] function) and optimizing the intracluster (data items of different clusters) relationships. The standard algorithm typically requires many iterations over a data set to converge to a solution, accessing each data item on each iteration. This approach may be sufficient for small data sets but it is obviously inefficient when scanning large data sets. The kmeans algorithm has proven to be well suited when clustered results are of similar spherical shapes. However, when data items in a given cluster are closer to the center of another cluster than that of its own (for example, when clusters have widely different sizes or have convex shapes), this algorithm may not be as useful. In comparison with other clustering methods, the revised kmeans based methods are promising for their efficient processing of large data sets, however, their use is often limited to numeric data. For the reasons mentioned in this paragraph, we have proposed yet another version of the kmeans algorithm to improve the performance when applied to large data sets of high dimensionality. Also, there has been very little research done in applying the kmeans algorithm to web log data because of its nonnumeric nature. In our experimental section, we prove that the application of our algorithm for web mining is comparable and in some instances outperforms the clustering technique of one of the industry's leading data mining applications. 1.2 Thesis Goals In web mining, the goal is to uncover meaningful web trends to understand and improve the visitors website experience. Clustering techniques are exercised to enable companies to find the natural groupings of customers. The standard kmeans algorithm, by design, optimally partitions a data set into clusters of similar data items, after which the human analytical process begins. In this thesis, we have developed a singlepass noniterative kmeans algorithm. We will attempt to improve the time complexity of the standard algorithm without refining the initial points when applied to large data sets. The traditional algorithm repeats the clustering steps until cluster assignment has been exhausted, scanning the data set as often as necessary. Multiple scans of the data set increases the cluster efficiency at the expense of execution time. Many data sets are large and cannot fit into main memory. Scanning a data set stored on disk or tape repeatedly is time consuming. Our algorithm scans a portion of the data set (residing in memory) only once, calculating all the point values, and finally clustering the items accordingly. We use only a sample and reduced number of attributes for the sake of efficiency and scalability with respect to large databases. Dead clusters are created when a centroid does not have any members in its cluster, which may arise due to bad initialization. We plan to address this issue by calculating the centroids based on the number of k clusters and the deviation between the minimum and maximum point values. This application should handle all the data types accepted by the database application, some of which are very complex (i.e., hypertext data). Applying the kmeans algorithm to the data allows us to group customers together on the basis of similarity by virtue of attributes chosen and, after analyzing the results, get a good grasp for the consumer's behavior and make intelligent predictions about their future behavior. Visitor behavioral predictions serve as a good starting point to improving a website's navigational experience. The suggestions and/or recommendations resulting from the analysis needs to be implemented to discover the true success of the algorithm. The data set used in the experimental section was obtained from the KDD Cup 20001 competition, containing data from an ecommerce site that no longer exists therefore we were unable to confirm the predictions made from our analysis of results. We will show that our method is superior in speed when compared to the standard kmeans algorithm, while maintaining a comparable cluster quality with one of the industry's leading data mining products. The rest of this thesis is organized as follows. Chapter 2 shares background information of related research. Chapter 3 explains our approach for web mining with k means. Chapter 4 describes the architecture used for the development of our algorithm and the implementation. Chapter 5 analyzes the performance of our algorithm and we then conclude with a summary of the thesis, review of our contributions, and future work in Chapter 6. 1 http://www.ecn.purdue.edu/KDDCUP/ CHAPTER 2 RESEARCH BACKGROUND Clustering techniques have been applied to a variety of areas including machine learning, statistics, and data and web mining. As widely used as they are, the fundamental clustering problem remains the task of grouping together similar data items of a given data set. There are four main classifications of clustering algorithms: partitioning algorithms, hierarchical algorithms, densitybased methods, and gridbased methods. There has been a plethora of proposals to improve or refine upon existing algorithms for each respective approach. The kmeans algorithm, which is classified as a partitioning algorithm, is not an exception. Enhancements to the traditional kmeans algorithm involves, but are not limited to, refining initial points, the scalability with respect to large data sets, the minimization of the clustering error, and reducing the number of clustering iterations (data set scans). Data mining is the process of discovering hidden interesting knowledge from large amounts of data stored in databases, data warehouses, or other information repositories. The main idea behind data mining is to identify novel, valid, potentially useful, ultimately understandable patterns in data. The spectrum of uses of data mining tools ranges from financial and telecommunications applications to government policy settings, medical management, and food service menu analysis. Different data mining algorithms are more appropriate for certain types of problems. These algorithms can be classified into two categories: descriptive and predictive. Descriptive data mining describes the data in a summary manner and presents interesting general properties of the data. Predictive data mining constructs one or more sets of models, infers on the available set of data, and attempts to predict the behavior of new data sets. These two styles are also known as undirected and directed data mining, respectively. The former uses a bottomup approach, finding patterns in the data and leaving the decision up to the user to determine whether or not these patterns are important. The latter uses a topdown approach and is used when one has a good grasp on what it is he or she is looking for or would like to predict, applying knowledge gained in the past to the future. There are several classes of algorithms applicable to data mining but the most commonly used are association rules [AS94, LOPZ97], Bayesian networks [Myl02], clustering [Fas99], decision trees [Mur98], and neural networks [CS97]. Table 21 provides a brief overview of data mining algorithms. The application of data mining techniques to the WWW, often referred to as web mining, is a direct result of the dramatic increase of Internet usage. Various data from the WWW stored in web logs include http request information, client IP addresses, the contents of the website (product information, published articles about the company, etc.), visitor behavior data (navigational paths or clickstream data and purchasing data), and web structure data. Thus, the current research efforts of WWW data mining focus on three issues: web content mining, web structure mining, and web usage mining. Web content mining is used to describe the automatic search of information resources available online. The automated discovery of webbased information is difficult because of the lack of structure permeating the information sources on the web. Traditional search engines generally do not provide structured information nor categorize, filter, or interpret documents [CMS97]. Theses factors have prompted researchers to develop Table 21 Data Mining Algorithms COMMON ALGORITHM DESCRIPTION APPLICATIONS Association rules Descriptive and predictive. Understanding consumer Determines when items product data. occur together. Predictive. Learns through Predicting what a consumer Bayesian networks determining conditional would like to do on a web probabilities, site by previous and current behavior. Clustering Descriptive. Identifies and Determining consumer groups similar data. groups. Decision trees Predictive. A flow chart of Predicting credit risk. ifthen conditions leading to a decision. Predictive. Modeled after Optical character Neural networks the human brain; classic recognition and fraud Artificial Intelligence detection. algorithm._ more intelligent tools for information retrieval and extend data mining efforts to provide a higher level of organization for semistructured data available on the web. Web Structure mining deals with mining the web document's structure and links to identify relevant documents. Web structure mining is useful in generating information such as visible web documents, luminous web documents, and luminous paths (a path common to most of the results returned) [BLMN99]. Web usage mining is the discovery of user access patterns from web server logged data. Companies automatically collect large volumes of data from daily website operations in server access logs. They analyze this web log data to essentially aid in future business decisions. In this thesis, we use clickstream and purchasing data collected prior to an ecommerce website going out of business. This data set resembles data used during the web data mining process. Web mining, when viewed from a data mining perspective, is assumed to have three operations of interest sequential analysis, associations, and clustering. Sequential analysis provides insight on the order that URLs tend to be accessed. Determining which URLs are usually requested together (associations) and finding the natural groupings of users, pages, etc. (clustering) are more useful in today's realworld web mining applications. 2.1 Association Rule Mining Association rule mining is the discovery of association relationships (or correlations) amongst a set of items. These relationships are often expressed in the form of a rule by showing attributevalue conditions that occur frequently together in a given set of data. An example of an association rule would be X => Y, which is interpreted by Jiawei Han [Han99] as database tuples that satisfy X are likely to satisfy Y. Association algorithms are efficient for deriving rules but both the support and confidence factors are key for an analyst to make a judgment about the validity and importance of the rules. The support factor indicates the relative occurrence of the detected association rules within the overall data set of transactions and the confidence factor is the degree to which the rule is true across individual records. The main goal of association discovery is to find items that imply the presence of other items in the same transaction. It is widely used in transaction data analysis for directed marketing, catalog design, and other business decisionmaking processes. This technique was a candidate to implement in the experimental section, but clustering proved to be a better fit for our research. Association discovery's simplistic nature gives it a significant advantage over the other data mining techniques. It is also very scalable since it basically counts the occurrences of all possible combinations of items and involves reading a table sequentially from top to bottom each time a new dimension is added. Thus, it is able to handle large amounts of data (in this case, large numbers of transactions). Association rules do not suffer from over fitting, so they tend to generalize better than other types of classifiers. Association rules have some serious limitations, however, such as the number of rules defined. Too many rules may overwhelm an inexperienced user while too few may not suffice. Another drawback is that the rules generated give no information about causation. The rules can only tell what things tend to happen together, without specifying information about the cause. 2.2 Clustering Clustering is the task of grouping together "similar" items in a data set. Clustering techniques attempt to look for similarities and differences within a data set and group similar rows into clusters. A good clustering method produces high quality clusters to ensure that the intercluster similarity is low and the intracluster similarity is high. Clustering algorithms could be classified into four main groups: partitioning algorithms, hierarchical algorithms, densitybased algorithms, and gridbased algorithms. 2.2.1 Partitioning Algorithms Partitioning algorithms attempt to break a data set of N objects into a set of k clusters such that the partition optimizes a given criterion. These algorithms are usually classified as static or dynamic. Static partitioning is performed prior to the execution of the simulation and the resulting partition is fixed during the simulation [JK96]. Dynamic partitioning attempts to keep system resources by combining the computation with the simulation. There are mainly two approaches: the kmeans algorithm, where each cluster is represented by the center of gravity of the cluster and the kmedoid algorithm, where each cluster is represented by one of the objects of the cluster located near the center [CSZ98]. Partitioning applications such as PAM, CLARA, and CLARANS are centered around kmedoids. Other applications involve the traditional kmeans algorithm or a slight variation/extension of it, such as our implementation. PAM (Partitioning Around Medoids) [KR90] uses arbitrarily selected representative objects, called medoids, during its initial steps to find k clusters. Medoids are meant to be the most centralized object within each cluster. Each nonselected object thereafter, is grouped with the medoid that it is most similar. In each step, a swap between a selected object (medoid) and a nonselected object is made if it would result in an improvement of the quality of clustering. The quality of clustering (i.e., the combined quality of the chosen medoids) is measured by the average dissimilarity values given as input. Experimental results by Kaufman and Rousseeuw have shown PAM to work satisfactorily for small data sets (for example, 100 objects in 5 clusters), but it is not efficient when dealing with medium to large data sets. The slow processing time, which is O (k(Nk))2 [CSZ98] due to the comparison of each object with the entire data set, motivated the development of CLARA. CLARA (Clustering LARge Applications) relies in sampling to handle large data sets. CLARA draws a sample of a data set, applies PAM to the sample, and then finds the medoids of the sample instead of the entire data set. The medoids of the sample approximate the medoids of the entire data set. Multiple data samples are drawn to derive better approximations and return the best clustering output. The quality of clustering for CLARA is measured based on the average dissimilarity of all objects in the entire data set, not only of those in the samples. Kaufman and Rousseeuw's experimental results prove that CLARA performs satisfactorily for data sets such as one containing 1000 objects using 10 clusters. Since CLARA only applies PAM to the samples, each iteration reduces to O (k(40+k)2 + k(Nk)) [KR90], using 5 samples of size 40 + 2k. Although the data sets is larger than that used for the PAM experiments, it is not ideal for the web mining analysis. CLARANS (Clustering LARge Applications based on RANdomized Search) [HN94] stems from the work done on PAM and CLARA. It relies on the randomized search of a group of nodes, which are represented by a set of k objects, to find the medoids of the clusters. Each node represents a collection ofk medoids; therefore it corresponds to a clustering. Thus, each node is assigned a cost that is the total dissimilarity value between every object and the medoid of its cluster. The algorithm takes the maximum number of neighbors of a node that can be examined (maxneighbor) and the maximum number of local minimums that can be collected (numlocal). After selecting a random node, CLARANS checks a sample of the neighbors of the node, clusters the neighbor based on the cost differential, and continues until the maxneighbor criterion is met. Otherwise, it declares the current node a local minimum and starts a new search for the local minima. After a specified number of numlocal values are collected, the best of these local values are recorded as the medoid of the cluster. The PAM algorithm can be viewed as the method used to search for the local minima. For large values of N, examining all of k(Nk) neighbors of a node is time consuming. Although Ng and Han claim that CLARANS is linearly proportional to the number of points, the time consumed in each step of searching is O (kN)2, making the overall performance at least quadratic [KolOl]. CLARANS, without any extra focusing techniques cannot handle large data sets. Also, it was not designed to handle high dimensional data. Both of which are characteristics of the data stored in web logs. 2.2.2 Hierarchical Algorithms Hierarchical algorithms create a hierarchical decomposition of a database. These techniques produce a nested sequence of clusters with a single allinclusive cluster at the top and single point clusters at the bottom. The hierarchical decomposition can be represented by a dendrogram, which is a tree that iteratively splits the database into smaller subsets until each subset consists of only one object [EKSX96]. The dendrogram can be created from the leaves up to the root agglomerativee approach) or from the root down to the leaves (divisive approach) by merging or dividing clusters at each step. Agglomerative hierarchical algorithms begin with all the data points as a separate cluster, followed by recursive steps of merging the two most similar (or least expensive) cluster pairs until the desired number of clusters is obtained or the distance between the two closest clusters is above certain threshold distance. Divisive hierarchical algorithms work by repeatedly partitioning a data set into "leaves" of clusters. A path down a well structured tree should visit sets of increasingly tightly related elements, conveniently displaying the number of clusters and the compactness of each cluster. BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) is a clustering method developed to address large data sets and minimization of input/output (I/O) costs. It incrementally and dynamically clusters incoming multidimensional metric data points to try to produce the best quality clustering with available resources (i.e., available memory and time constraints) [LRZ96]. BIRCH typically clusters well with a single scan of the data, however, optional additional passes can be used to improve the cluster quality further. BIRCH contains four phases, two of which are optional (namely the second and the fourth). During phase one, the data is scanned and the initial tree is built using the given amount of memory and recycling space on disk. The optional phase two condenses the tree by scanning the leaf entries to rebuild a smaller one, removing outliers and grouping crowded subclusters into larger ones. The application uses a selfcreated heightbalanced Clustering Feature (CF) tree at the core of their clustering step. Each node, or CF vector, of the tree contains the number of data points in the cluster, the linear sum of the data points, and the square sum of the data points. The CF tree has two parameters: branching factor B and threshold T. Each nonleaf node contains at most B entries. The tree size is a function of T the larger Tis, the smaller the tree. The mandatory phase three uses a global algorithm to cluster all leaf entries. This global algorithm is a preexisting method selected before beginning the BIRCH process. BIRCH also allows the user to specify either the desired number of clusters or the desired threshold (in diameter or radius) for clusters. Up to this point, the original data has only been scanned once, although the tree and outlier information have been scanned multiple times. After phase three, some inaccuracies may exist from the initial creation of the CF tree. Phase four is optional and entails the cost of additional passes of the data to correct those inaccuracies and refine the clusters further. This phase uses the centroids produced in phase three as seeds to migrate and/or create new clusters. [LRZ96] contains a performance analysis versus CLARANS. They conclusively state that BIRCH uses much less memory, but is faster, more accurate, and less order sensitive when compared with CLARANS. BIRCH, in general, scales well but handles only numeric data and the results depend on the order of the records. CURE (Clustering Using REpresentatives) [GRS98] is a bottomup agglomerativee) clustering algorithm based on choosing a wellformed group of points to identify the distance between the clusters. CURE begins by choosing a constant number c of wellscattered points from a cluster used to identify the shape and size of the cluster. The next step uses a predetermined fraction between 0 and 1 to shrink the selected points toward the centroid of the cluster. With the new (shrunken) position of these points identifying the cluster, the algorithm then finds the clusters with the closest pairs of identifying points. This merging continues until the desired number of clusters, k, an input parameter, remains. A kdtree [Sam90] is used to store the representative points for the clusters. CURE uses a random sample of the database to handle very large data sets, in contrast with BIRCH, which preclusters all the data points for large data sets. Random sampling can eliminate significant input/output (I/O) costs since the sample may be designed to fit into main memory and it also helps to filter outliers. If random samples are derived such that the probability of missing clusters is low, accurate information about the geometry of the clusters are still preserved [GRS98]. CURE partitions and partially clusters the data points of the random sample to speed up the clustering process when sample sizes increase. Multiple representative points are used to label the clusters assigning each data point to the cluster with the closest representative point. The use of multiple points enables the algorithm to identify arbitrarily shaped clusters. The worst case time complexity of CURE is 0 (n2logn), where n is the number of sampled points, proving to be no worse than BIRCH [Kol01]. The computational complexity of CURE is quadratic with respect to the sample size and is not related to the size of the dataset. 2.2.3 Densitybased Methods Densitybased clustering algorithms locate clusters by constructing a density function that reflects the spatial distribution of the data points. The densitybased notion of a cluster is defined as a set of densityconnected points that is maximal with respect to densityreachability. In other words, the density of points inside each cluster is considerably higher than outside of the cluster. In addition, the density within the areas of noise is lower than the density in any of the clusters. A couple examples of density based methods are DBSCAN and OPTICS. DBSCAN (Density Based Spatial Clustering of Applications with Noise) [EKSX96] is a localitybased algorithm, relying on a densitybased notion of clustering. The densitybased notion of clustering states that within each cluster, the density of the points is significantly higher than the density of points outside the cluster [Kol01]. This algorithm uses two parameters, Eps and MinPts, to control the density of the cluster. Eps represents the neighborhood of a point (radius) and MinPts is the minimum number of points that must be contained in the neighborhood of that point in the cluster. DBSCAN discovers clusters of arbitrary shapes, can distinguish noise, and only requires one input parameter. The input value is a major drawback because the user for each run of the algorithm must manually determine the Eps. The runtime of the algorithm, O (NlogN), does not factor in the significant calculation time of the Eps so it very misleading. This algorithm can handle large amounts of data but it is not designed to handle higher dimensional data. OPTICS (Ordering Points To Identify the Clustering Structure) [ABKS99] is a cluster analysis algorithm that creates an augmented ordering of the database representing its densitybased clustering structure. This differs from traditional clustering methods purpose of producing an explicit clustering of the data set. This cluster ordering contains information that is equivalent to the densitybased clustering corresponding to a broad range of parameter settings. OPTICS works in principle like an extended DBSCAN algorithm for an infinite number of distance parameters (Eps); which are smaller than a "generating distance" (Eps) (i.e., 0 <= Epsi <= Eps). However, instead of assigning cluster memberships, this algorithm stores objects in the order they are processed and information which would be used by an extended DBSCAN algorithm to assign cluster membership (if it were possible for an infinite number of parameters). This information consists of only two values: the coredistance and a reachability distance. The core distance of an objectp is the smallest distance between it and another neighborhood. The reachabilitydistance of an objectp with respect to the core object o is the smallest distance such that is directly densityreachable from o. The OPTICS algorithm creates an ordering of a database, additionally storing the coredistance and a suitable reachability distance for each object. Objects, which are directly densityreachable from a current core object, are inserted into a seedlist for further expansion. The "seedlist" objects are sorted by their reachability distance to the closest core object from which they have been directly densityreachable. The reachabilitydistance for each object is determined with respect to the centerobject. Objects that are not yet in the priority queue (seedlist) are inserted with their reachabilitydistance. If the new reachability distance of an object is smaller than the previous reachabilitydistance and it already exists in the queue, it is moved further to the top of the queue. [ABKS99] performed extensive performance tests using different data sets and different parameter settings to prove that the runtime of OPTICS is nearly the same as the runtime for DBSCAN. If OPTICS scans through the entire database, then the runtime will be O (N2). If a tree based spatial index can be used, the runtime is reduced to O (NlogN). For medium sized data sets, the cluster ordering can be represented graphically and for very large data sets, OPTICS extends a pixeloriented visualization technique to present the attribute values belonging to different dimensions. 2.2.4 Gridbased Methods Gridbased algorithms quantize the space into a finite number of cells and then do all operations on the quantized space. These approaches tend to have fast processing times, depending only on the number of cells in each dimension quantized in space, remaining independent of the number of data objects. Gridbased techniques such as STING [MWY97] and WaveCluster [CSZ98] have linear computation complexity and are very efficient for large databases; however, they are not typically feasible for analyzing web logs. Gridbased methods are more applicable for spatial data mining. Spatial data mining is the extraction of implicit knowledge, spatial relations, and the discovery of interesting characteristics and patterns that are not explicitly represented in the databases. Spatial data geometrically describes information related to the space occupied by objects. The data may be either a single point in multidimensional space (discrete) or it may span across a region of space (continuous). Huge amounts of spatial data may be obtained from satellite images, medical imagery, Geographic Information Systems, etc., making it unrealistic to examine spatial data in detail. 2.2.5 Kmeans Aforementioned earlier in this chapter, we revisit the various contributions, improvements, and modifications to the standard kmeans algorithm. Historically known as Forgy's method [For65] or MacQueen's algorithm [Mac67], the kmeans algorithm has emerged as one of the most widely used techniques for solving clustering problems. This process consists of mainly three steps [HHK02]: 1. Partition the items into k initial clusters. 2. Proceed through the list of items; assigning an item to the cluster whose centroid (mean) is nearest. Recalculate the centroid for the cluster receiving the new item and for the cluster loosing the item. 3. Repeat step 2 until no more assignments take place. Step 1 may be completed in one of three ways: Randomly selecting k points to represent each cluster, require the user to enter k initial points, or use the first k points to represent each cluster. Most implementations randomly select k representative objects centroidss) to start the process. [BF98] use this statement to illustrate the importance of good initial points: an initial cluster center which attracts no data may remain empty, while a starting point with no empty clusters usually produces better solutions. Our version of the algorithm does not address the initialization issue. Others that do assume it is either userprovided or randomly chosen. Duda and Hart mention a recursive method, [CCMT97] takes the mean of the entire data and randomly perturbs it k times, and [BFR98] refine using small random subsamples of the data. The latter is primarily intended to work on large databases. As a database size increases, efficient and accurate initialization becomes critical. When applied to an appropriately sized random subsample of the database, they show that accurate clustering can be achieved with improved results over the classic kmeans. The only memory requirement of this refinement algorithm is to hold a small subsample in RAM, allowing it to scale easily to very large databases. As we continue on to the remaining steps of the algorithm, the main focus is to optimize the clustering criteria. The most widely used criterion is the clustering error criterion which for each point computes its squared distance from the corresponding cluster center and then takes the sum of these distances for all points in the data set [LVV01]. Intelligent Autonomous Systems has proposed the global kmeans algorithm, which constitutes a deterministic effective global clustering error that employs the k means algorithm as a local search procedure. This algorithm is an incremental approach to clustering that dynamically adds one cluster center at a time through a deterministic global search procedure consisting on N, the size of the data set, executions of the k means algorithm from suitable initial positions. It solves all intermediate problems with 1, 2,..., M1 clusters sequentially to solve a clustering problem with M clusters. The underlying principle of this method is that an optimal solution for a clustering problem with M clusters can be obtained by using the kmeans algorithm to conduct a series of local searches. Each local search places the M1 cluster centers at their optimal positions corresponding to the clustering problem within the data space. Since forM=l the optimal solution is known, this global algorithm can iteratively apply the above procedure to find optimal solutions for all kclustering problems k= 1,..., M. In terms of computational complexity, the method requires N executions of the k means algorithm for each value of k (k = 1,..., M). The experimental results prove that for a small data set (for example, N =250 and M= 15), the performance of this method is excellent, however, the technique has not been tested on largescale data mining problems. Recursive iterations can be expensive when applying the kmeans algorithm. To reduce time complexity as well as the iterations of steps and to increase the scalability of kmeans clustering for large data sets, singlepass kmeans algorithms were introduced [BFR98]. The main idea is to buffer where points from the data set are saved in compressed form. The first step is to initialize the means of the clusters as with the standard kmeans. The next step is to fill the buffer completely with points from the database followed by a twophase compression process. The first of the two, called primary compression, identifies points that are unlikely to ever move to a different cluster using two methods. The first measures the Mahalanobis distance [Rei99] from each point to the cluster mean centroidd) it's associated with it and discards a point if it is within a certain radius. The second method involves creating confidence intervals for each centroid. Then, a worstcase scenario is set up by perturbing the centroids within the confidence intervals with respect to each point. The centroids associated with each point is moved away from the point and the cluster means of all other clusters are moved towards the point. If the point is closest to the same cluster mean after the perturbations, it is unlikely to change cluster membership. Points that are unlikely to change are removed from the buffer and placed in a discard set of one of the main clusters. We are now ready to begin the second phase called the secondary compression. The aim of this phase is to save buffer space by storing some auxiliary clusters instead of individual points. During this stage, another kmeans clustering is performed with a larger number of clusters than for the main clustering on the remaining points in the buffer. The points in the buffer must satisfy a tightness criterion (remain below a certain threshold). After primary and secondary compression, the available buffer space is filled with new points and the whole procedure is repeated. The algorithm ends after one scan of the data set or if the centers of the main clusters do not change significantly as more points are added. A special case of the algorithm of [BFR98], not mentioned in their paper, would be to discard all the points in the buffer each time. The algorithm is [EFLOO]: 1. Randomly initialize cluster means. Let each cluster have a discard set in the buffer that keeps track of the sufficient statistics for all points from previous iterations. 2. Fill the buffer with points. 3. Perform iterations of kmeans on the points and discard sets in the buffer, until convergence. For this clustering, each discard set is treated like a regular point placed at the mean of the discard set, but weighed with the number of points in the discard set. 4. For each cluster, update the sufficient statistics of the discard set with the points assigned to the cluster. Remove all points from the buffer. 5. If the data set is exhausted, then finish. Otherwise, repeat from step 2. According to [EFLOO] lesion experiment, the simple single pass kmeans method (for synthetic data sets of 1,000,000 points, 100, dimensions, and 5 cluster) cluster quality is equivalent to that of the standard kmeans but is more reliable (in terms of trapping of centers) and is about 40% faster than the standard kmeans. With real data from the KDD contest data set 95412 points with 10 clusters, the cluster distortion of the original k means algorithm was significantly less than that of the simple single pass algorithm. CHAPTER 3 GENERAL APPROACH TO WEB USAGE MINING In Chapter 2, we mention the categorization of web mining into three areas of interest: web content mining, web structure mining, and web usage mining. Web content mining focuses on techniques for searching the web for documents whose contents meets web users queries [BS02]. Web structure mining is used to analyze the information contained in links, aiming to generate structural summary about web sites and web pages. Web usage mining attempts to identify (and predict) web user's behavior by applying data mining techniques to the discovery usage patterns from their interactions while surfing the web. In this chapter, we introduce our approach to mining web usage data using the kmeans algorithm to address the issues identified in Section 1.1. 3.1 The Mining of Web Usage Data Companies apply web usage mining techniques to understand and better serve the needs of their current customers and to acquire new customers. The process of web usage mining can be separated into three distinct phases: preprocessing, pattern discovery, and pattern analysis [CDSTOO]. The web usage mining process could also be classified into one of two commonly used approaches [BL99]. One approach applies pre processing techniques directly to the log data prior to adapting a data mining technique. The other approach maps the usage data from the logs into relational tables before the mining is performed. The sample data we obtained from KDD Cup 2000 were in flat files, therefore, we chose the second of the two approaches for our implementation. Figure 3.1 depicts the web usage mining process from a highlevel perspective [CMS99]. The subsequent sections of this chapter will explain the three phases of the process. '/ ~dGo rent and gStuure H l W Ua M 1 Preprocessing Pten Dscoe Patern Dnalyss Th aw Udaga PepEcolceded b te we resting" Data Glickstream and aS rrs Rules, Patberns, and StaE'tis. Data and Statfitics Figure 3.1 High Level Web Usage Mining Process 3.1.1 Preprocessing Data for Mining The raw data collected by the web server logs tend to be abstruse and require the need to organize the data to make it easier to mine for knowledge. Preprocessing consists of converting usage information contained in the various available data sources into the abstractions necessary for pattern discovery [BS02]. There are a number of issues in pre processing data for mining that must be addressed prior to utilizing the mining algorithm. These include developing a model of access log data, developing techniques to filter the raw data to eliminate irrelevant items, grouping individual page access into units (i.e., transactions), and specializing generic data mining algorithms to take advantage of the specific nature of the access log data [CMS97]. The first preprocessing task, referred to as data cleaning, essentially eliminates irrelevant items that may impact the analysis result. This involves determining if there are important accesses or specific access data that are not recorded in the access log. Improving data quality involves user cooperation, which is very difficult (but understandably so) because the individual may feel as if the information requested of them violates their privacy needs. Another preprocessing task is the identification of specific transactions or sessions. The goal of this task is to clearly discern users based on certain criteria (in our case, attributes). The formats of these transactions and/or sessions are tightly coupled with the data collection process. The poor selection of values to collect about the users increases the difficulty of this identification task. 3.1.2 Pattern Discovery The next phase of the web usage mining process, pattern discovery, varies depending on the needs of the analyst. Algorithms and techniques from various research areas such as statistics, machine learning, and data mining are applied during this phase. Our focus is on finding trends in the data by grouping users, transactions, sessions, etc., to understand the behavior of the visitors. Clustering, a data mining technique, is well suited for our desired results. Web usage mining can facilitate the development and execution of future marketing strategies and promote efficient and effective web site management by analyzing the results of clustered web log data. There are different ways to break down the clustering process. One way is to divide it into five basic steps [Mas02]: 1. Preprocessing and feature selection. Most clustering models assume all data items are represented by ndimensional feature vectors. To improve the scalability of the problem space, it is often desirable to choose a subset of all the features (attributes) available. During this first step, the appropriate feature is chosen as well as the appropriate preprocessing and feature extraction on data items to measure the values of the chosen feature set. This step requires a good deal of domain knowledge and data analysis. NOTE: Do not confuse this step with the preprocessing step of web usage mining. This step is done after the data has been cleansed. 2. Similarity measure. This is a function that receives two data items (or two sets of data items) as input and returns a similarity measure between them as output. Item item versions include the Hamming distance [Bla02c], Mahalanobis distance, Euclidean distance, inner product, and edit distance. Itemset versions use any item item versions as subroutines and include max/min/average distance; another approach evaluates the distance from the item to the cluster of the representative set, where point representatives centroidss) are chosen as the mean vector/mean center/ median center of the set, and hyperplane of hyperspherical representatives of the set can also be used. 3. Clustering algorithm. Clustering algorithms generally use particular similarity measures as subroutines. The choice of clustering algorithm depends on the desired properties of the final clustering and the time and space complexity. Clustering user information or data items from web server logs aid companies with web site enhancements such as automated return mail to visitors falling within a specific cluster or dynamically changing a particular site for a customer/user on a return visit, based on past classification of that visitor [CMS99]. 4. Result validation. Do the results make sense? If not, we may want to iterate back to a prior stage. It may also be useful to do a test of clustering tendency, to estimate the presence of clusters at all. NOTE: Any clustering algorithm will produce some clusters regardless of whether or not natural clusters exist. 5. Result interpretation and application. Typical applications of clustering include data compression (via representing data samples by their cluster representative), hypothesis generation (looking for patterns in the clustering of data), hypothesis testing (e.g. verifying feature correlation or other data properties through a high degree of cluster formation), and prediction (once clusters have been formed from the data and characterized, new data items can be classified by the characteristics of the cluster which they would belong). 3.1.3 Pattern Analysis The final stage of web usage mining is pattern analysis. The discovery of web usage patterns would be meaningless without mechanisms and tools to help analysts better understand them. The main objective of pattern analysis is eliminating irrelevant rules or patterns and extracting rules or patterns from the output of the previous stage (pattern discovery). The output, in its original state, of web mining algorithms is usually incomprehensible for the naked eye and thus must be transformed into a more readable format. These techniques have been drawn from fields such as statistics, graphics and visualizations, and database querying. Visualization techniques have been very successful in helping people understand various kinds of phenomena. Bharat and Pitkow [BP94] proposed a web path paradigm in which sets of server log entries are used to extract subsequences of web traversal patterns called web paths along with the development of their WebViz system for visualizing WWW access patterns. Through the use of WebViz, analysts are provided the opportunity to filter out any portion of the web deemed unimportant and selectively analyze those portions of interest. In [Dyr97], OLAP tools had proven to be applicable to web usage data since the analysis needs were similar to those of a data warehouse. The rapid growth of access information increases the size of the server logs quite expeditiously, reducing the possibility to provide online analysis of all of it. Therefore, to make its online analysis feasible, there is a need to summarize the log data. Query languages allows an application or user to express what conditions must be satisfied by the data it needs rather than having to specify how to get the required data [CMS97]. Potentially, a large number of patterns may be mined, thus a mechanism to specify the focus of analysis is necessary. One approach would be to place constraints on the database to restrict a certain portion of the database to mine. Another method would be to perform the querying on the knowledge that has been extracted by the mining process, which would require a language for querying knowledge rather than data. 3.2 Web Usage Mining with kmeans The algorithms used for most of the initial web mining efforts were highly susceptible to failure when operating on real data, which can be quite noisy. In [JK98], Joshi and Krisnapuram introduce some robust clustering methods. Robust techniques typically deal only with a single component and thus increase the complexity when applied to multiple clusters. Fuzzy clustering techniques are capable of addressing the problem of multiple clusters. Fuzzy clustering provides a better description tool when clusters are not well separated [Bez81], which may happen during web mining. Fuzzy clustering for grouping web users has been proposed in [BH93], [FKN95], and [KK93]. Rough set theory [Paw82] has been considered an alternative to the fuzzy set theory. There is limited research on clustering based on rough set theory. Lingras and West [LW02] adapted the kmeans algorithm to find cluster intervals of web users based on rough set theory. They applied a preprocessing technique directly to the log data prior to adapting a data mining technique. This was permitted because of the involvement in the data collection process. This allowed them to filter information into specific predefined categories before mining the data. After applying the kmeans method, they analyzed the data based on the knowledge of the initial classifications. 3.2.1 Our Web Usage Mining Approach In this thesis, our approach was indirectly imposed on us due to the original format of the log data. We chose the second of the two mentioned in Section 3.1, while still applying the three phase process also mentioned in that section. In the pre processing phase, we convert the flat files into relational tables to utilize the advantages 30 of structured query languages to retrieve desired data from the logs. The feature selection step of our pattern discovery phase is taken as input from the analyst (or user of our algorithm). We chose to implement a variation of the kmeans algorithm due to its computational strengths for large data sets. For pattern analysis, we graphed the results discovered in the previous phase to improve human comprehension of the knowledge. The next chapter describes the architecture and implementation strategies for our k means algorithm when used in accordance with web mining. CHAPTER 4 ARCHITECTURE AND IMPLEMENTATION The web usage mining process discussed in Section 3.1 is commonly used throughout the research community. The architecture of our web usage mining solution encompasses most of the phases and steps mentioned in Chapter 3, however, choosing to use our version of kmeans as our clustering method provoked the exclusion of a few steps. Another reason for omitting steps was our lack of input for data collection. Sections 4.1 will provide insight on our architectural structure and Sections 4.2 will explain the details of our kmeans implementation. 4.1 Architecture Overview Our algorithm's architectural structure consists of two java modules carrying out three execution phases. The first class, namely ReadData, accepts the user input, reads the data from the files, and clusters the data points accordingly. The ClusterValues class maintains cluster information such as the number of points in each cluster, all of the point values in each cluster, and the centroid value of the cluster. The three phases have the same goals as those mentioned in the previous chapter for the web usage mining process, however, our clustering algorithm implementation gave us the freedom to omit time consuming steps. The architecture divides the web usage mining process into two main parts. The first part involves the usage domain dependant processes of transforming the web data into suitable transaction form. The second part includes the application of our kmeans algorithm for data mining and pattern matching and analysis techniques. Figure 4.1 depicts the architecture for our web usage mining project. This section describes the steps taken to complete each phase in the process. The next section explains our algorithm in its entirety in conjunction with the modular interaction. Select ,nK Setver Log Data Relational Tables Transaction/ Analysis Knowledge Session Data Figure 4.1 Our Web Usage Mining Architecture 4.1.1 Phase 1 Preprocessing We began our preprocessing phase with the data already condensed in one format, flat files, as our input. Typical web usage data exists in web server logs, referral logs, registration files, and index server logs. Intelligent integration and correlation of information from these diverse sources can reveal usage information that may not be evident from any one of these individually. We have assumed that the content of these files were already in its integrated state when obtained from KDD Cup 2000. The data learning task of our preprocessing phase primarily involved improving the understandability of the data. Column names and, in some instances, a list of column values for the comma delimited flat files were provided, however, the values were still difficult to discern. We decided to convert the flat files into relational tables to both match the column values with their column names and take advantage of the data retrieval methods provided by relational database management systems (RDBMS) during the mining stage. After transforming the format of the data, we removed emptyvalued columns and those columns deemed uninteresting and/or unnecessary for our desired results at this stage of the process. The transaction identification task of this phase distinguishes independent users, transactions, or sessions. This task is simplified when the data collected is carefully selected and conducive to the overall objectives of the mining process. The data set used in this thesis was divided into two tables one containing the visitors' clickstream data, the other customer order information. We did not apply any identification techniques to the data, we simply "learned" the data itself and focused on attributes/columns that were relevant to a user, transaction, or session. For example, the clickstream data has session related attributes (i.e. SESSION ID, SESSIONFIRST REQUEST DAYOF_WEEK, etc.) that we used to identify sessions. At this point, we retained data for comprising specific users, transactions, and sessions in the tables for future refinement in the next phase, pattern discovery. 4.1.2 Phase 2 Pattern Discovery As we enter the pattern discovery phase, we would like to reiterate our web usage mining goal of finding trends in the data to understand the behavior of the visitors. Clustering techniques used in this research area will group together similar users based on the analystspecified parameters. We begin the clustering process by reducing the dimensionality of the data set during the preprocessing/feature selection step. This step allows the analyst to select the attributes necessary to explore the targeted regions of the data set. This preprocessing clustering step differs from the preprocessing phase of web usage mining because it identifies the features needed as input for the clustering algorithm specifically as opposed to the general information resulting from data cleaning and transaction identification. The columns chosen during this step represent the n dimensional feature vectors. The heart of this thesis is engulfed in this next step, which is the clustering technique selection and implementation. We chose to use the popular kmeans because of its ability to produce good clusters and its efficient processing of large data sets. There has been limited research done using kmeans for web mining outside of fuzzy and rough set approaches mentioned Chapter 3. Section 4.2 explains the implementation of our version of the kmeans algorithm. After executing the algorithm, we reviewed the results to decide the legitimacy. If the results seemed unreasonable, we regressed back to the feature selection step to refine our query. This refinement process is intended to assist in finding patterns in the clustering, also known as hypothesis generation. Hypothesis generation exposes trends in the data. We may also use the results to predict future behavior of the customers if this website still existed. The analysis of those results could have helped maintain and acquire new customers and therefore prevented it from going under. 4.1.3 Phase 3 Pattern Analysis The pattern analysis phase provides tools and mechanisms to improve analysts understanding of the patterns discovered in the previous phase. During this phase, we eliminated content (patterns) that did not reveal useful information. We did not use a tool to aid in our analysis. Instead, we used a nonautomated graphing method to visualize our results. The visualization depicted the mined data in a matter that permitted the extraction of knowledge by the analyst. 4.2 Algorithm Implementation The pattern discovery phase is a critical component of the web mining process and usually adopts one of several techniques to complete successfully statistical analysis, association rules, classification, sequential patterns, dependency modeling, and clustering [WanOO]. Statistical analysis of information contained in a periodic web system report can be potentially useful for improving system performance, enhancing the security of a system, facilitation of the site modification task, and providing support for marketing decisions [Coo00]. In web usage mining, association rules refer to sets of pages that are accessed together with a support value exceeding some specific threshold. Classification techniques are used to establish a profile of users belonging to a particular class or category by mapping users (based on specific attributes) into one of several predefined classes. Sequential pattern analysis aims to retrieve subsequent item sets in a timeordered set of sessions or episodes to help place appropriate advertisements for certain user groups. Dependency modeling techniques display significant dependencies amongst the various variables in the web domain to provide a theoretical framework for analyzing user behavior and predict the future web resource consumption. Clustering techniques group together data items with similar characteristics. In our research, we would like to extract knowledge from the data set based on specific attributes of interest. Cluster analysis grants the opportunity to achieve such a goal. In Section 2.2, we discussed various clustering techniques and algorithms. Web server log data files can grow exponentially depending on the amount of data collected per visit by the user. The navigational and purchasing information collected for our data set totaled approximately 1.7 gigabytes over a period of two months back in the year 2000 when the concept of web data mining was in its infancy. The current data collection methods and techniques are far more advanced and may collect the same amount of data daily. Therefore, the clustering algorithm needed for the pattern discovery phase had to be reliable and efficient when applied to large data sets. The traditional kmeans algorithm would suffice, however, there are a few characteristics about our data set that expose drawbacks in the algorithm. The total number of attributes of the combined data files is 449 (217 clickstream and 232 purchasing). We needed to reduce the vector dimensionality and use a representative sample set of data to improve the scalability and efficiency of the algorithm, respectively. Web logs contain non numeric and alphanumeric data, which are both prohibited as input for the standard k means algorithm. Our algorithm must deal with nonnumeric values as input for our clustering algorithm. In this section, we discuss our version of the kmeans algorithm and how it addressed the issues above. Recall Section 4.1 when we mentioned the feature selection step in the pattern discovery phase. This step essentially covers the first two tasks of our algorithm and requires user input. The first task is entering the desired number of clusters with the maximum number being ten. Excessive clusters create a dilution of data, which could potentially further complicate the analysis. The other userrequired input is the attributes to query. The arbitrary selection of these attributes produces meaningless clusters. This task requires at least some knowledge of the data as well as a predetermined goal. Querying completely unrelated attributes could return interesting results, however, that may be unlikely. The preprocessing phase cleanses and organizes the data to prepare the data for pattern discovery. The first two tasks of our algorithm, which actually serve as the feature selection step, allow the analyst to select specific attributes to mine for knowledge. 1 class ReadData 2 { 4 5 //variable initializations 6 7 8 9 public ReadDataO 10 { 11 stop = false; 12 file = "Click_7500.txt"; 13 14 15 16 public void readInLineO 17 { 18 19 20 //initialize method variables 21 22 23 for (resultcolNum = (pArrayMax 1); resultColNum < pArrayMax; resultColNum++) 24 { 25 for (int c = 0; c <= (int)(kclusters+l); c++) 26 { 27 if CpointsArray[resultcolNum] <= clusterMax[c]) 28 { 29 cvstorage.addclustervalues(c+l, pointsArray[resultcolNum], d); 30 cvvector.addElement(cvstorage); 31 d++; 32 break; 33 } 34 //retrieves the cluster information for the specified cluster number 35 36 37 cvstorage = new clustervalueso; 38 39 40 41 42 //if the specified cluster contains any points, then compute 43 //the centroid value 44 45 46 47 Figure 4.2 The ReadData module The values mentioned in the previous paragraph are collected in the main method of the ReadData class, shown in Figure 4.2, used to implement our algorithm. Once these two values have been determined, we call the method located at line 16 of Figure 4.2, readInLineO, to perform the grunt work of the implementation. This method begins with reading the first line of the file specified in the class constructor. The target file would contain sample data that had been generated from a simple query ran against one or both of the tables. The results of the query would then be exported to a delimited file and essentially serve as the cleansed version of the log file. As the first line of data is parsed, we smoothly transition into the next task of our algorithm that involves calculating the data point values. The number of n attributes selected during the feature selection step determines vector size. The values in a web log can be numeric, nonnumeric, or alphanumeric so, unlike traditional kmeans algorithms, our algorithm must support all three value types. We handle this issue by using the ASCII (American Standard Code for Information Interchange) value of each character, digit, alphabet, and special character for computation. We begin by calculating the value for each individual attribute A,, where i = 0, ..., nl, of the ndimensional vector. d1 Scd d=0 (1) A, where dis the array length and cdis d the ASCII value of the d1 character Next, we compute the vector value of the entire row of n attributes. This is done by dividing the sum of the individual values A, by the number of columns i. d1 A, d=0 (2) Rm = where m is the row number i in the table After R1 is computed, it becomes the minimum value (min) by default. The next nonequivalent row vector value Rm detected replaces the min if it is lower than R1 or it becomes the maximum value (max) if it is higher. The point values computed after the max and min values have been selected are compared to both values and replaced accordingly, if necessary. We then subtract the min from the max value to determine the range of the points. (3) diff = max min The diff value obtained at the end of the third task is the numerator of the fraction used to compute the cluster thresholds. The denominator of that fraction is the number of clusters provided by the analyst during the feature selection step. , where k is the number of clusters k The threshold value t does not represent the threshold value for each individual cluster but it is used when computing the upper boundary of each cluster. For example, the threshold for the first cluster, tl, ranges from the min to the sum of the min plus t subtracted by one hundred thousandth, both values mentioned inclusive. Continuing to the threshold of the second cluster, t2, the minimum value of t2 would be mini plus t and the maximum value would be min2 (the minimum value of the second cluster) plus t minus one hundred thousandth. The last paragraph can be represented mathematically as: [mini, minor + t 0.00001] [min2, max2] [minn, maxn] , where mini (and minor) is the minimum point value , where min2 = min + t and max2 = min2 + t  0.00001 , where minn= minn1 + t and maxn = minn +t 0.00001 If a point value exists between two consecutive thresholds, its value is rounded to the nearest hundred thousandth and clustered accordingly without changing its original value in the cluster. We chose to use the hundred thousandth figure because most of the data points were calculated to that precision. Once the final row vector value, Rm, has been calculated, the data points, the min and max, and the cluster thresholds have been determined and each data point has been placed in its proper cluster only after one scan of the data set. 1 class clustervalues 2 { 3 4 private int clusterNum = 0; 5 private float pointvalue = 0; 6 private int pointNumber = 0; 7 private clustervalues cvals = this; 8 private vector cv = new vector(); 9 10 11 public clustervalues() 12 { 13 clusterNum = 0; 14 pointvalue = 0; 15 pointNumber = 0; 16 } 17 18 19 public void addclustervalues(int c, float v, int p) 20 { 21 22 23 24 25 26 } 27 28 29 public void calculatecentroids(int c) 30 { 31 32 33 34 35 36 } 37 38 } Figure 4.3 The ClusterValues module The final step of our algorithm calculates the centroids (representative points) for each cluster. The centroid computation takes place in the method beginning at line 29, calculateCentroids(, of the ClusterValues class displayed in Figure 4.3. The ClusterValues module shown in Figure 4.3 is the structure responsible for maintaining all the relevant information about each cluster such as the point values) and the number of points present in the cluster. The addClusterValueso method, which starts at line 19 in Figure 4.3, requires the cluster number, the point value, and the element number of the cluster, all of which are calculated in ReadData.readInLineo. These values are stored in Java's Vector (java.util.Vectoro) object and retrieved in ClusterValues.calculateCentroidso to calculate the centroid value. We perform this task by dividing the sum of the point values in a specific cluster by the number of points in that cluster if that cluster contains any point values. This point represents the mean of the cluster without measuring the distance between each point and centroid. This permits the exclusion of step two mentioned in Section 3.1.2 and therefore reduces the computational complexity. If you refer back to Section 2.3.5, you will notice several differences between our procedures used to implement the kmeans algorithm and other implementations. The first significant difference is shown as early as the first step. These initial points influence the clustering results tremendously. In most cases, these points are randomly selected and may require numerous executions or a large amount of knowledge of the data set by the analyst. The former could become tedious and the latter may be an unrealistic expectation. Our first two tasks, projecting the number of clusters needed and selecting the attributes to query, do not require a great deal of knowledge about the data set. The only prerequisite of our algorithm is a clearly defined goal. This allows the analyst to specify the appropriate amount of categories (clusters) based on targeted characteristics (attributes). Our centroid creation process is performed as the very last task. It is done after all of the vector values (data points) have been calculated and clustered to determine what the clusters represent. This reduces the algorithm's execution time because it removes the similarity measurement task, where each data point is compared to the centroid using a distance function to identify the shortest distance and cluster that point, from our implementation. The run time is reduced further in our algorithm because we scan and cluster the data only once. Multiple iterations of the data points and recalculations of the centroids improve the clustering efficiency at the expense of time. Chapter 5 will present a performance analysis of our algorithm compared to other proposed kmeans algorithms and Chapter 6 will show how our method faired against one of the industry's leading applications in data mining. CHAPTER 5 PERFORMANCE ANALYSIS When writing software, the criteria for evaluating pertains to the correctness of the algorithm with respect to the specifications and the readability of the code. There are other criteria for judging algorithms that have a more direct relationship to performance, which involves their computing time and storage requirements. The time complexity (or run/execution time) of an algorithm is the amount of computer time it needs to run to its completion. The space complexity of an algorithm is the amount of memory it needs to run to completion [HRS98]. The time complexity is based on the access time for each data point, in our case, row of data. If each row is accessed and recalculated for multiple iterations, the kmeans algorithm could become inefficient for large databases. The space complexity deals with the data set size and variables that may affect it. We will not evaluate the space complexity of our algorithm. In the second part of this chapter, we compare the clustering results of the KDD Cup 2000 data set when using a leading data mining software to the results obtained when applying our algorithm to the data. We will show that our kmeans method produces a comparable quality of clusters as one of the leading data mining tools. We will then conclude our research efforts and contributions in the final chapter, Chapter 6. 5.1 Experimental Evaluation The development of our kmeans algorithm initially began on Microsoft's Windows 98 operating systems, using pcGRASP2 Version 6, a free programming application developed at Auburn University, as our Java programming environment. pcGRASP was the recommended environment for completing our programming assignments in the Programming Languages Principles (PLP) course instructed by Dr. Beverly Sanders. The engine of this home personal computer (PC) consisted of 164 megabytes of random access memory (RAM), a 450 megaHertz Pentium II processor, and 8 gigabytes of hard disk space. Previously installed software along with important documents and files occupied almost 50% of the hard disk, leaving roughly 4 gigabytes during execution. The size of the combined data sets, stored in flat files, consumes about 1.5 gigabytes of disk space. Although using samples of the data during the experimental section, we suspected that 2.5 gigabytes of disk space would be inadequate. We then purchased and installed a 20 gigabyte hard drive as the primary master partition, moving the contents from the 8 gigabyte disk to the new one. Now, prior to installation of additional software, we have a total of 21.5 gigabytes of free space 13.5 on the c:\ drive and 8 on the newly formatted d:\ drive. It was rather difficult to produce samples of the data set from flat files, thus the database search begins. The minute availability of resources limited our options to either Sybase or Microsoft Access. The obvious choice, since Sybase is Unixbased, was Microsoft Access. Microsoft Access was able to handle the large amount of data, however, it took several hours to load (import) the data and the database only created a link from the table defined in Access to the flat file that contained the data. This would definitely have a negative effect on performance. Fortunately, the DBCenter3 acquired a license for Oracle 8i. Oracle 8i only supports imported data that result from the export utility of a previous version of Oracle. We unsuccessfully attempted to use Oracle's SQL*Loader utility to load our delimited flat file data into the database due to various data type incompatibilities with the syntax needed for this utility's control file (see Figure 5.1). LOAD DATA INFILE 'd:\thesis\code\click_data_7500.txt' REPLACE INTO TABLE clickdata TRAILING NULLCOLS ( CUSTOMER_ID INTEGER EXTERNAL TERMINATED BY ,, SESSION_ID INTEGER EXTERNAL TERMINATED BY , PRODUCT CHAR TERMINATED BY , FLAGS INTEGER EXTERNAL TERMINATED BY ,., HIT_NUMBER INTEGER EXTERNAL TERMINATED BY ,, TIMESTAMP DATE "yyyyddmmhh. mm. ss", REFERRAL_URL CHAR TERMINATED BY "," Figure 5.1 A sample SQL*Loader control file A typical control file (.ctl) would not specify the data types of each field because the utility requires the existence of the table in the Oracle database prior to loading data to it. However, if the format of the data confuses the tool, one must specify the data types per column in the control file. So after obtaining a copy of IBM's DB2 application, several prerequisites had to be met prior to installing the software. DB2 version 7 Personal or Enterprise Edition, requires the user to have administrative privileges on the operating system. Windows 98 does not support administrative users, which prohibited the installation; therefore, we decided to change the operating system to Windows 2000 2 http://www.eng.auburn.edu/grasp Professional Edition. After installing DB2 version 7.2 fixpack 5 and creating the structured query language (SQL) to define the tables to store the data, we loaded the data from the flat files to the database using DB2's wizard for importing data in a matter of minutes. The data set used in the experimental portion of the thesis is from a KDD Cup 2000 competition. It contains clickstream and order information from an ecommerce website which went out of business only after a short period of existence. A clickstream can be defined as a sequential series a user's navigational path throughout a website visit. Order data includes product information, number of items purchased, etc. The clickstream data is significantly larger (over 700,000 rows) than that of the order data, however, both files in our case, tables may be applied to the web mining process. The clickstream data provided was collected for roughly two months January 30, 2000 thru March 31, 2000 but contained 98 (out of 217) attribute column values (per row of data) that were either missing or null. To improve scalability, we chose to use a sample selection of the data. We chose to use the first data intensive 7500 rows of data for our research purposes for two reasons: it represents a little over 10 percent of the entire data set and it is approximately twice the size of the amount of rows provided for the order data (3465 rows). The majority of the sample click data is comprised of data ranging from Sunday, January 30, 2000 thru Tuesday, February 2, 2000. The order data, which is in its entirety, remains within the twomonth timeframe and only has 6 columns out of 232 that were deemed irrelevant. Although close to 50 percent of the click data columns were not conducive to our research, we were still able to gain valuable knowledge from 3 http://www.cise.ufl.edu/dbcenter the clustering results of the data set because of their significance. In the next section, we discuss the clustering results from mining both the order and click data. When discussing the efficiency of our algorithm, we use the following notation: m number of kmeans passes over a data set m' number of kmeans passes over a buffer refill n number of data points b size of buffer, fraction of n d number of dimensions k number of clusters The time complexity of the standard kmeans algorithm when using the above notation becomes, more specifically, O(nkdm), where m grows slowly with n [EFLOO]. Our algorithm, which only scans the data once, m is always equal to one. This not only reduces the computational time to O(nkd), it also removes the computational time necessary for cluster refinement (i.e., similarity measurements). As for the disk I/O complexity, for the standard kmeans it is O(ndm), the number of points times the dimensions times the number of passes over the data set [EFLOO]. Our algorithm passes over the data once, therefore the disk I/O complexity would be O(nd). 5.2 Web Clusters The software tool used in our experimental section uses their own core data mining technology to uncover highvalue intelligence from large amounts of enterprise data including transaction data such as that generated by pointofsale, automatic teller machines (ATMs), credit cards, call center, or ecommerce applications. Early releases of this industryleading tool embodied proven data mining technology and scalability options while placing significant emphasis on usability and productivity for data mining analysts. The version used for these experiments places an increased focus on bringing the value of data mining to more business intelligence users by broadening access to mining function and results at the business analyst's desktop. The types of mining functions available with this tool include association, classification, clustering, sequential patterns, and similar sequences. We compare/contrast our kmeans clustering results with the results of the clustering function of the tool. Order Clustering Percentages 1% 23% 8% 1 1% 02 11% 04 05 06 29% 07 08 23% H09 Figure 5.2 Order clustering results Our example involved eight attributes from the order data pertaining to consumer's weekly purchasing habits such as the weekday, time of day, location, order amount, etc. represented using nine clusters. Figures 5.2 and 5.3 graphically display the amount of data points present in each individual cluster using our method and the software tool, respectively. The clusters sizes differ at the least 3% (Cluster 4) and at most 22% (Cluster 6) because the clustering results have different representations from the different applications. Table 51 elaborates on the nine clusters for the two applications. Our algorithm, by design, sorts the data points in ascending order before clustering and calculating the centroid values, creating a diverse set of clusters as that of the tool. The software results are obtained from a modular standpoint, where frequency statistics of the raw data is emphasized. In our implementation, the analysis of the raw values, which are printed to a file before calculating the data point, aids in determining the categorization of each cluster. Although, the resulting clusters from the tool differ in size and data representation from our results, we show that the knowledge gained from our algorithm is potentially just as useful. Order Clustering Percentages 9% 8% 1 8% *2 17%/ 03 11% 04 m5 .6 16% 14% 7 808 7% 10% 9 Figure 5.3 Data Mining Software Order clustering results The information provided on Table 51 is indicative of the relationship of the cluster percentages mentioned in the previous paragraph. For example, the Cluster 4 results of the two techniques are most similar while the Cluster 6 results seem to be the most dissimilar. Although, the statistical results of the tool is comprised of the most frequently used values of the active fields (attributes), which may lead to analyst making decisions decisions based on assumptions about the raw data and not the knowledge gained from the raw data itself. In their results, there was not any information pertaining to male shoppers. Our data, in contrast, did not specify any modular calculation, but did provide monthly and age ranges in conjunction with location and sex, allowing decisionmaking based on factual data instead of generalizations. Regardless of the application used to analyze the data, it would be nearly impossible to gain knowledge from the data if viewed by a human in its original state. Both aid the business user considerably with the clustering results, with the software tool having the edge because of its visualization and reporting tools. Nevertheless, our numerical representation of the results brought us to the same conclusions) as their visualizations: California dwelling women who spent under $12 per order dominated their consumer base, which means that the company needed to advertise more items (maybe higher priced items as well) for women to maintain their current customers while targeting men in the very near future to gain new customers. The previous statement may seem intuitive, however, if this company had had tools to perform this analysis back in 2000, it may still be in business today! Table 51 Cluster representations SinglepsnnaaMnn CLUSTER 1 CLUSTER 2 CLUSTER 3 CLUSTER 4 CLUSTER 5 CLUSTER 6 CLUSTER 7 CLUSTER 8 CLUSTER 9 Predominantly women, ages 2658, living in CA, who shop from Tuesday Friday Men, 2850 years of age, that usually shop on the weekend Mix of men and women shoppers from all over, that do not avg $12 per order Women ages 2658 that shop Tuesday thru Saturday Women that spent at least $22 on their purchase, from all over US, all week Texans (unspecified sex), ages 2252, who shop mostly on Friday Thursday shoppers where the men are from the mid and upper west, women from eastern states Women ordering between 8am and 9am ThursdaySunday women shoppers of unspecified ages from TX and NY Thursday shoppers of unspecified age and sex, from Stamford, CT Women from San Fran, CA that shop on Monday's@lpm Women from New York, NY that shop on Wednesday's@10am Women from Texas that shop on Tuesday's at 5pm, spending $13.95 Wednesday shoppers at 8pm from CA 36 year old women from Hermosa Beach, CA who usually shop on Thursday's@ 1 am New York dwelling women, shopping on Tuesday's@4pm Women from PA shopping on Wednesday's@7am, but no later than 10pm (all week) 36 year old women who spend over $12/order, shop on Wednesday' s@7pm ___ CHAPTER 6 CONCLUSION 6.1 Contributions This thesis, simply stated, has improved the time complexity of a widely used pre existing algorithm and demonstrated its value if used appropriately by a profitseeking corporation. Our version of the kmeans algorithm effectively removed two expensive operations from the original algorithm namely, the refinement portion step(s) that include scanning the data set multiple times and recalculating the representative points centroidss) of each cluster. The implementation presented in this paper reduces the execution time of the algorithm by m, the number of kmeans passes over a data set, while also excluding the optional computations necessary for cluster refinement (i.e. similarity measurements, etc.) to bringing our total run time to O(nkd), where k is the number of clusters and d is the number of dimensions (or active attributes). Since our algorithm scans the data only once, the disk I/O is also reduced by m, therefore giving us a disk I/O of O(nd). We later show that our algorithm, when used as the clustering technique during the pattern discovery phase of the web usage mining process, performs comparably to that of an industryleading data mining tool. 6.2 Proposed Extensions and Future Work We chose to leave the comparison of our algorithm to the standard kmeans algorithm for future work efforts. This would require a slight variation for one implementing the original algorithm to receive not only numerical data, but also nonnumerical and alphanumerical data as input. Another potential research interest would be to develop a to develop a schema or warehouse to store the data for both the navigational and purchasing data and mine them as one unit. Usage data collection over the web is incremental and distributed by its very nature. Valuable information about the data could be extracted if all the data were to be integrated before mining. However, in the distributed case, a data collection approach from all possible server logs is both non scalable and impractical mainly because of the networking issues involved. Hence, there needs to be an approach where mined knowledge from various logs can be integrated together into a more comprehensive model. As a continuation of that issue, the creation of intelligent tools that can assist in the interpretation of mined knowledge remains open. This would assist the business analyst by revealing commonalities or "obvious" trends sooner to allow him/her to focus on the nonintuitive results. LIST OF REFERENCES [AS94] R. Agrawal and R. Srikant, "Fast Algorithms for Mining Association Rules," In J. B. Bocca, M. Jarke, and C. Zaniolo, editors, In Proceedings Ti entieith International Conference Very Large Data Bases (VLDB), p. 487499. Morgan Kaufmann, 1994. [A1D95] M.B. AlDaoud, The Development of Clustering Methods for Large Geographic Aapplications, doctoral dissertaion, School of Computer Studies, University of Leeds, 1995. [ABKS99] M. Ankerst, M. Breunig, HP.Kriegel and J. Sander, "OPTICS: Ordering Points To Identify the Clustering Structure," In Proceedings ACM SIGMOD99 International Conference on Management ofData, Philadelphia, p. 4960, 1999. [BS02] P. Baptist and M.J. Silva, "Mining Web Access Logs of an Online Newspaper," Second International Conference on Adaptive Hypermedia and Adaptive Web Based Systems, Workshop on Recommendation and Personalization in ECommerce, Malaga, Spain, May 2002. [Bez81] J.C. Bezdek, Pattern Recognition i. ith Fuzzy Objective Function Alg go itl/i\, Plenum Press, New York, 1981. [BH93] J.C. Bezdek and R.J. Hathaway, "Switching Regression Models and Fuzzy Clustering," IEEE Transactions on Fuzzy Systems, Vol. 1, No. 3, p. 195204, 1993. [BP94] K. Bharat and J.E. Pitkow, "WebViz: A Tool for WWW Access Log Analysis," In Proceedings of the First International Conference on the WorldWide Web, 1994. [BLMN99] S.S. Bhowmick, E.P. Lim, S. Madria and WK. Ng, "Research Issues in Web Data Mining," In Proceedings of the First International Conference on Data Warehousing and Knowledge Discovery (DaWaK99), p. 303312, 1999. [Bla02a] P.E. Black, "Euclidean Distance," National Institute of Standards in Technology (NIST), http://www.nist.gov/dads/HTML/euclidndstnc.html (October 2002). [Bla02b] P.E. Black, "Manhattan Distance," National Institute of Standards in Technology (NIST), http://www.nist.gov/dads/HTML/manhttndstnc.html (October 2002). [Bla02c] P.E. Black, "Hamming Distance," National Institute of Standards in Technology (NIST), http://www.nist.gov/dads/HTML/hammingdist.html (October 2002). [BL99] J. Borges and M. Levene, "Data Mining of User Navigation Patterns," In Proceedings of the Workshop on Web Usage Analysis and User Profiling (WEBKDD'99), p. 3136, San Diego, CA, August 15,1999. [BF98] P.S. Bradley and U.M. Fayyad, "Refining Initial Points for Kmeans Clustering," In Proceedings of the Fifteenth International Conference on Machine Learning, p. 9199, Morgan Kaufmann, San Francisco, CA, 1998. [BFR98] P.S. Bradley, U.M. Fayyad, and C.A. Reina, "Scaling Clustering Algorithms to Large Databases," In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, p. 915, NewYork, NY, August 2731, 1998. [CY00] WL. Chang and ST. Yuan, "A Synthesized Learning Approach for WebBased CRM," In Proceeding ofACMSIGKDD Conference on Knowledge Discovery in Databases (KDD'2000), p. 4359, Boston, MA, August 20, 2000. [CCMT97] M. Charikar, C. Chekuri, T. Feder and R. Motvani, "Incremental Clustering and Dynamic Information Retrieval," In Proceedings of the Twentyninth Annual ACM Symposium on Theory of Computing, p. 626635, 1997. [CSZ98] S. Chatterjee, G. Sheikholeslami and A. Zhang, "WaveCluster: A Multi Resolution Clustering Approach for Very Large Spatial Databases," In Proceedings of the Twentyfourth International Conference on Very Large Data Bases, p. 428439, August 1998. [CGHK97] S. Chee, J. Chen, Q. Chen, S. Cheng, J. Chiang, W. Gong, J. Han, M. Kamber, K. Koperski, G. Liu, Y. Lu, N. Stefanovic, L. Winstone, B. Xia, O. R. Zaiane, S. Zhang and H. Zhu, "DBMiner: A System for Data Mining in Relational Databases and Data Warehouses," In Proceedings CASCON'97: Meeting ofMinds, p. 249260, Toronto, Canada, November 1997. [Coo00] R. Cooley, Web Usage Mining: Discovery and Application of interesting Patterns from Web data, doctoral dissertation, Department of Computer Science, University of Minnesota, May 2000. [CDSTOO] R. Cooley, M. Deshpande, J. Srivastava and PN. Tan, "Web Usage Mining: Discovery and Applications of Usage Patterns from Web Data," SIGKDD Explorations, Vol. 1, Issue 2, 2000. [CMS97] R. Cooley, B. Mobasher and J. Srivastava, "Web Mining: Information and Pattern Discovery on the World Wide Web," In Proceedings of the Ninth IEEE International Conference on Tools i ith Artificial Intelligence (ICTAI'97), 1997. [CMS99] R. Cooley, B. Mobasher and J. Srivastava, "Creating Adaptive Web sites through Usagebased Clustering of Urls," In IEEE Knowledge andData Engineering Workshop (KDEX'99), November 1999. [CS97] M.W. Craven and J.W. Shavlik, "Using Neural Networks for Data Mining," Future Generation Computer Systems, Vol. 13, p. 211229, 1997. [Dyr97] C. Dyreson, "Using an Incomplete Data Cube as a Summary Data Sieve," Bulletin of the IEEE Technical Committee on Data Engineering, p. 1926, March 1997. [EFLOO] C. Elkan, F. Fanstrom and J. Lewis, "Scalability for Clustering Algorithms Revisited," SIGKDD Explorations, Vol. 2, No. 1, p. 5157, June 2000. [EKSX96] M. Ester, HP. Kriegel, J. Sander and X. Xu, "A DensityBased Algorithm for Discovering Clusters in Large Spatial Databases with Noise," In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD'96), Portland, Oregon, August 1996. [Fas99] D. Fasulo, "An Analysis of Recent Work on Clustering Algorithms," Technical report, University of Washington, 1999. [For65] E. Forgy, "Cluster Analysis of Multivariate Data: Efficiency vs. Interpretability of Classifications," Biometrics 21:768, 1965. [FKN95] H. Frigui, R. Krishnapuram and O. Nasraoui, "Fuzzy and Possibilistic Shell Clustering Algorithms and their Application to Boundary Detection and Surface Approximation: Parts I and II," IEEE Transactions on Fuzzy Systems, Vol. 3, No. 1, p. 2960, 1995. [GRS98] S. Guha, R. Rastogi and K. Shim, "CURE: An Efficient and Scalable Subspace Clustering for Very Large Databases," In Proceedings ofACMSIGMOD International Conference on Management ofData, p. 7384, New York, NY, 1998. [Han99] J. Han, "Data Mining," In J. Urban and P. Dasgupta (eds.), Encyclopedia of Distributed Computing, Kluwer Academic Publishers, Boston, MA, 1999. [HN94] J. Han and R. Ng, "Efficient and Effective Clustering Method for Spatial Data Mining," In Proceedings of 1994 International Conference on Very Large Data Bases (VLDB'94), p. 144155, Santiago, Chile, September 1994. [HHK02] W. Hardle, Z. H1ivka and S. Klinke, "XploRe Applications Guide," Quantlets, http://www.quantlet.de/scripts/xag/htmlbook/xploreapplichtmlnode54.html (August 2002). [JK96] J. Jean and H.K. Kim, "Concurrency Preserving Partitioning (CPP) for Parallel Logic Simulation," In Proceedings of Tenth Workshop on Parallel and Distributed Simulation (PADS'96), p. 98105, May 1996. [JK98] A. Joshi and R. Krishnapuram, "Robust Fuzzy Clustering Methods to Support Web Mining," In S. Chaudhuri and U. Dayal, editors, In Proceedings ACMSIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, June 1998. [KR90] L. Kaufman and P. J. Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis, John Wiley & Sons, Inc., 1990. [KK93] R. Keller and R Krishnapuram, "A Possibilistic Approach to Clustering," IEEE Transactions on Fuzzy Systems, Vol. 1, No. 2, p. 98110, 1993. [KolOl] E. Kolatch, "Clustering Algorithms for Spatial Databases: A Survey," Dept. of Computer Science, University of Maryland, College Park, 2001. [LOPZ97] W. Li, M. Ogihara, S. Parthasarathy and M.J. Zaki, "New Algorithms for Fast Discovery of Association Rules," In Proceedings of Third International Conference on Knowledge Discovery and Data Mining (KDD), August 1997. [LVV01] A. Likas, N. Vlassis and J.J. Verbeek, "The Global Kmeans Clustering Algorithm," Technical report, Computer Science Institute, University of Amsterdam, The Netherlands, February 2001. IASUVA0102. [LW02] P. J. Lingras and C. Chad West, "Interval Set Clustering of Web Users with Rough Kmeans," submitted to the IEEE computer for publication, 2002. [LRZ96] M. Livny, R. Ramakrishnan and T. Zhang, "BIRCH: An Efficient Data Clustering Method for Very Large Databases," In Proceedings of the Fifteenth ACM SICACTSICMODSICART Symposium on Principles of Database Systems: PODS 1996. [Mac67] J. MacQueen, "Some Methods for Classification and Analysis ofMultivariate Observations," In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics andProbability, Vol. I, Statistics, L. M. LeCam and J. Neyman editors, University of California Press, 1967. [Mas02] H. Masum, "Clustering Algorithms," Active Interests, http://www.carleton.ca/hmasum/clustering.html (August 2002). [MWY97] R. Muntz, W. Wang and J. Yang, "STING: A Statistical Information Grid Approach to Spatial Data Mining," In Proceedings of the Twentythird International Conference on Very Large Databases, p. 186195, Athens, Greece, August 1997. [Myl02] P. Myllymaki, "Advantages of Bayesian Networks in Data Mining and Knowledge Discovery," Complex Systems Computation Group, Helsinki Institute for Information Technology, http://www.bayesit.com/docs/advantages.html (October 2002). [Paw82] Z. Pawlak, "Rough Sets," International Journal ofInformation and Computer Sciences, Vol. 11, p. 145172, 1982. [Rei99] T. Reiners, "Mahalanobis Distance," Distances, http://server3.winforms.phil.tu bs.de/treiners/diplom/node3 1 .html (October 2002). [Sam90] H. Samet, The Design and Analysis of Spatial Data Structures, Addison Wesley, Reading, MA, 1990. [The02] K. Thearling, "Data Mining and Customer Relationship," Data Mining White Papers, http://www.thearling.com/text/whexcerpt/whexcerpt.htm (October 2002). [WanOO] Y. Wang, "Web Mining and Knowledge Discovery of Usage Patterns," CS 748T Project (Part I), http://db.uwaterloo.ca/tozsu/courses/cs748t/surveys/wang.pdf (February, 2000). BIOGRAPHICAL SKETCH Darryl M. Adderly, born September 2, 1976, to Renia L. Adderly and Kevin A. Adderly in Miami, Florida, was raised as a military child up until age thirteen when his mother, younger sister (Kadra T. Adderly), and he moved back to Miami where he earned his high school diploma at Miami Northwestern Senior High in June 1994. He began his college career in Tallahassee, Florida at Florida Agricultural & Mechanical University, earning his Bachelor of Science in computer information systems (science option) with a mathematics minor in May 1998. After spending one year working as a software engineer in Raleigh, North Carolina, Darryl was accepted into the University of Florida's computer and information science and engineering graduate program. With the coursework requirements completed, he opted to return to the industry as a software developer for another year. In the fall of 2002, he returned to Gainesville, Florida, to complete and defend his thesis on Web data mining to receive his Master of Science degree. Darryl is an ambitious, hardworking, analytical, and astute individual with a thirst for knowledge in all facets of life. He enjoys cardiovascular activities, weight lifting, football, basketball, and golf (although still a novice!). Outdoor activities (such as camping, white water rafting, and hiking) and traveling are at the top of his list of things to do once obtaining his master's degree. 