<%BANNER%>

Data Mining Meets E-Commerce: Using Data Mining to Improve Customer Relationship Management


PAGE 1

DATA MINING MEETS E-COMMERCE: US ING DATA MINING TO IMPROVE CUSTOMER RELATIONSHIP MANAGEMENT By DARRYL M. ADDERLY A THESIS PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLOR IDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE UNIVERSITY OF FLORIDA 2002

PAGE 2

Copyright 2002 by Darryl M. Adderly

PAGE 3

I would like to dedicate this thesis to a recent blessing in my life, Buttons Kismet Adderly.

PAGE 4

ACKNOWLEDGMENTS I would like to first thank God for providing the opportunity and giving me the strength to complete this thesis. I really appreciate Dr. Joachim Hammers patience and guidance throughout the duration of this process. I thank Ardiniece Nisi Caudle and John Jon B. Bowers for assisting me with administrative items. I thank the Office of Graduate Minority Programs (OGMP) for the financial assistance. I would also like to thank my bible study group (Adrian, JD, Jonathan, Kamini, and Ursula) for all of their prayers/spritual support and last but not least Jean-David Oladele for the friendship and support up until the very last minute. iv

PAGE 5

TABLE OF CONTENTS page ACKNOWLEDGMENTS.................................................................................................iv LIST OF TABLES............................................................................................................vii LIST OF FIGURES.........................................................................................................viii ABSTRACT.......................................................................................................................ix 1 INTRODUCTION............................................................................................................1 1.1 Motivation for Research...........................................................................................1 1.2 Thesis Goals..............................................................................................................4 2 RESEARCH BACKGROUND........................................................................................7 2.1 Association Rule Mining........................................................................................10 2.2 Clustering................................................................................................................11 2.2.1 Partitioning Algorithms.................................................................................11 2.2.2 Hierarchical Algorithms................................................................................14 2.2.3 Density-based Methods.................................................................................17 2.2.4 Grid-based Methods......................................................................................19 2.2.5 K-means........................................................................................................20 3 GENERAL APPROACH TO WEB USAGE MINING.................................................24 3.1 The Mining of Web Usage Data.............................................................................24 3.1.1 Pre-processing Data for Mining....................................................................25 3.1.2 Pattern Discovery..........................................................................................26 3.1.3 Pattern Analysis............................................................................................27 3.2 Web Usage Mining with k-means...........................................................................29 3.2.1 Our Web Usage Mining Approach...............................................................29 4 ARCHITECTURE and IMPLEMENTATION...............................................................31 4.1 Architecture Overview............................................................................................31 4.1.1 Phase 1 Pre-processing...............................................................................32 4.1.2 Phase 2 Pattern Discovery..........................................................................33 4.1.3 Phase 3 Pattern Analysis............................................................................34 4.2 Algorithm Implementation......................................................................................35 v

PAGE 6

5 PERFORMANCE ANALYSIS......................................................................................43 5.1 Experimental Evaluation.........................................................................................44 5.2 Web Clusters...........................................................................................................47 6 CONCLUSION...............................................................................................................52 6.1 Contributions...........................................................................................................52 6.2 Proposed Extensions and Future Work...................................................................52 LIST OF REFERENCES...................................................................................................54 BIOGRAPHICAL SKETCH.............................................................................................59 vi

PAGE 7

LIST OF TABLES Table page 2-1 Data Mining Algorithms................................................................................................9 5-1 Cluster representations................................................................................................51 vii

PAGE 8

LIST OF FIGURES Figure page 3.1 High Level Web Usage Mining Process......................................................................25 4.1 Our Web Usage Mining Architecture..........................................................................32 4.2 The ReadData module.................................................................................................37 4.3 The ClusterValues module...........................................................................................40 5.1 A sample SQL*Loader control file..............................................................................45 5.2 Order clustering results................................................................................................48 5.3 Data Mining Software Order clustering results...........................................................49 viii

PAGE 9

Abstract of Thesis Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Master of Science DATA MINING MEETS E-COMMERCE: USING DATA MINING TO IMPROVE CUSTOMER RELATIONSHIP MANAGEMENT By Darryl M. Adderly December 2002 Chair: Joachim Hammer Major Department: Computer and Information Science and Engineering The application of data mining techniques to the World Wide Web, referred to as Web mining, enables businesses to use knowledge discovered from the past to understand the present and make critical business decisions about the future. For example, this can be done by analyzing the Web pages that visitors have clicked on, items that they have selected or purchased, or registration information provided while browsing. To perform this analysis effectively, businesses find the natural groupings of users, pages, etc., by clustering data stored in the Web logs. The standard k-means algorithm, an iterative refinement algorithm, is one of the most popular clustering methods used today and it has proven to be an efficient clustering technique. However, numerous iterations over the data set and re-calculating cluster centroid values are time consuming. In this thesis, we improve the time complexity of the standard algorithm. Our single-pass, non-iterative k-means algorithm scans the data only once, calculating all the point and centroid values ix

PAGE 10

based on the desired attributes of interest, and places the items within their respective cluster thresholds. Our Web mining process consists of three phases, pre-processing, pattern discovery, and pattern analysis, which are described in detail in the thesis. We will use our implementation of the k-means algorithm to uncover meaningful Web trends to understand and, after analyzing the results, provide recommendations that may have improved the visitors website experience. We find that the clustering results of our algorithm provide the same amount of knowledge for analysts as one of the industrys leading data mining applications. x

PAGE 11

CHAPTER 1 INTRODUCTION Consumers are conducting business via the Internet more than ever before due to the economical costs of high-speed Internet service providers (ISPs) and the high-level of security (secure transactions). However, the recognition of a companys on-line presence alone does not ensure long-lived prosperity. Customer retention and satisfaction strategies remain one of the most important issues for organizations expecting profits. Thus companies work hard to improve and/or maintain their customer relationships. To achieve this, companies must capture the navigational behavior of visitors on their website in a web log and subsequently analyze this data to understand and address their consumers business needs. 1.1 Motivation for Research The relationship between companies and customers has evolved into a significant research concept called Customer Relationship Management (CRM). A definition for CRM is a process that manages the interactions between companies and its customers [The02]. CRM solutions create a mutually beneficial relationship between the customer and the organization and are critical to a companys future success. The ultimate goals of CRM are to acquire new customers, retain old customers, and increase customer profitability [CY00]. In the current economic slowdown, companies are using their limited budgets to reduce operational costs or increase revenues while concentrating on improving efforts to acquire new customers and develop customer loyalty. The sources 1

PAGE 12

2 of web-based CRM customer data (user profil es, access patterns for pages, etc.) are from customer web interactions. The advent of the World Wide Web (WWW) ha s caused an evolution of the Internet. Information is now readily available from a ny location in the world at any hour of the day. Information on the WWW is not only impor tant to individuals, but also to business organizations for critical d ecision–making. This explosion of information sources on the web has increased the necessity to utilize automated tools to find the desired resources and to track and analyze usage patterns. An electronic trail of data is left behi nd each time a user visits a website. The megabytes and gigabytes of data logged fr om these trails seem to not yield any information at first glance. However, when analyzed intelligently, those logs contain a wealth of information providing valuable know ledge for business inte lligence solutions. Early attempts to understand the data with statistical tools and on-line analytical processing (OLAP) systems achieved limited success--that is until the concept of data mining was introduced. Data mining is the process of discovering hidden interesting knowledge from large amounts of data stored in databases, data warehouses, or other information repositories. Web data mining, or web mining, can be broadly defined as the discovery and analysis of useful informa tion from the Web data. On-line businesses learn from the past, understand the present, and plan for the future by mining, analyzing, and transforming records into meaningful information. Web mining, when viewed in data mining term s, can be said to have three operations of interests – clustering (findi ng natural groupings of users, pages, etc.), associations (which URLs tend to be reque sted together), and sequentia l analysis (the order which

PAGE 13

3 order which URLs tend to be accessed) [JK98]. Although the first two have proven to be of greater interest, this research heavily favors the use of clustering techniques and algorithms to support web mining. Data clustering is a process of partitioning a set of data into a set of classes, called clusters, with members of each cluster sharing some interesting common properties [CGHK97]. Clustering itself is the process of organizing similar items into disjoint groups. The investigation of the properties of the set of items belonging to each group illuminates relationships that may have been otherwise overlooked. The k-means algorithm is one of the most widely used techniques for clustering [Al-D95]. It has been shown to be effective in producing good clustering results for many practical applications. The two main goals of clustering techniques are to ensure that the data within each distinct cluster is homogeneous (group items are similar) and each cluster differs from other clusters (data belonging to one cluster should not be present in another cluster). The k-means algorithm is an iterative refinement algorithm with an input member of k pre-defined clusters. Means simply represents the average, as in the average location of all members of a particular cluster conceptualized as the centroid. The centroid of a cluster, often termed the representative element, is an artificial point in the space of records that represents the average location. The time complexity of the k-means algorithm is heavily dependant on the point (centroid) selection process of its first step. Some implementations either requires user-provided or randomly generated starting points but most implementations of the k-means algorithm do not address the issue of initialization at all. The remaining steps of the algorithm focus on minimizing the inter-cluster (items belonging to a specific cluster) error by

PAGE 14

4 using a distance function (i.e., Euclidean distance [Bla02a] or Manhattan distance [Bla02b] function) and optimizing the intra-cluster (data items of different clusters) relationships. The standard algorithm typically requires many iterations over a data set to converge to a solution, accessing each data item on each iteration. This approach may be sufficient for small data sets but it is obviously inefficient when scanning large data sets. The k-means algorithm has proven to be well suited when clustered results are of similar spherical shapes. However, when data items in a given cluster are closer to the center of another cluster than that of its own (for example, when clusters have widely different sizes or have convex shapes), this algorithm may not be as useful. In comparison with other clustering methods, the revised k-means based methods are promising for their efficient processing of large data sets, however, their use is often limited to numeric data. For the reasons mentioned in this paragraph, we have proposed yet another version of the k-means algorithm to improve the performance when applied to large data sets of high dimensionality. Also, there has been very little research done in applying the k-means algorithm to web log data because of its non-numeric nature. In our experimental section, we prove that the application of our algorithm for web mining is comparable and in some instances outperforms the clustering technique of one of the industrys leading data mining applications. 1.2 Thesis Goals In web mining, the goal is to uncover meaningful web trends to understand and improve the visitors website experience. Clustering techniques are exercised to enable companies to find the natural groupings of customers. The standard k-means algorithm, by design, optimally partitions a data set into clusters of similar data items, after which the human analytical process begins.

PAGE 15

5 In this thesis, we have developed a single-pass non-iterative k-means algorithm. We will attempt to improve the time complexity of the standard algorithm without refining the initial points when applied to large data sets. The traditional algorithm repeats the clustering steps until cluster assignment has been exhausted, scanning the data set as often as necessary. Multiple scans of the data set increases the cluster efficiency at the expense of execution time. Many data sets are large and cannot fit into main memory. Scanning a data set stored on disk or tape repeatedly is time consuming. Our algorithm scans a portion of the data set (residing in memory) only once, calculating all the point values, and finally clustering the items accordingly. We use only a sample and reduced number of attributes for the sake of efficiency and scalability with respect to large databases. Dead clusters are created when a centroid does not have any members in its cluster, which may arise due to bad initialization. We plan to address this issue by calculating the centroids based on the number of k clusters and the deviation between the minimum and maximum point values. This application should handle all the data types accepted by the database application, some of which are very complex (i.e., hypertext data). Applying the k-means algorithm to the data allows us to group customers together on the basis of similarity by virtue of attributes chosen and, after analyzing the results, get a good grasp for the consumers behavior and make intelligent predictions about their future behavior. Visitor behavioral predictions serve as a good starting point to improving a websites navigational experience. The suggestions and/or recommendations resulting from the analysis needs to be implemented to discover the true success of the algorithm. The data set used in the experimental section was obtained

PAGE 16

6 from the KDD Cup 20001 competition, containing data from an e-commerce site that no longer exists therefore we were unable to confirm the predictions made from our analysis of results. We will show that our method is superior in speed when compared to the standard k-means algorithm, while maintaining a comparable cluster quality with one of the industrys leading data mining products. The rest of this thesis is organized as follows. Chapter 2 shares background information of related research. Chapter 3 explains our approach for web mining with k-means. Chapter 4 describes the architecture used for the development of our algorithm and the implementation. Chapter 5 analyzes the performance of our algorithm and we then conclude with a summary of the thesis, review of our contributions, and future work in Chapter 6. 1 http://www.ecn.purdue.edu/KDDCUP/

PAGE 17

CHAPTER 2 RESEARCH BACKGROUND Clustering techniques have been applied to a variety of areas including machine learning, statistics, and data and web mining. As widely used as they are, the fundamental clustering problem remains the task of grouping together similar data items of a given data set. There are four main classifications of clustering algorithms: partitioning algorithms, hierarchical algorithms, density-based methods, and grid-based methods. There has been a plethora of proposals to improve or refine upon existing algorithms for each respective approach. The k-means algorithm, which is classified as a partitioning algorithm, is not an exception. Enhancements to the traditional k-means algorithm involves, but are not limited to, refining initial points, the scalability with respect to large data sets, the minimization of the clustering error, and reducing the number of clustering iterations (data set scans). Data mining is the process of discovering hidden interesting knowledge from large amounts of data stored in databases, data warehouses, or other information repositories. The main idea behind data mining is to identify novel, valid, potentially useful, ultimately understandable patterns in data. The spectrum of uses of data mining tools ranges from financial and telecommunications applications to government policy settings, medical management, and food service menu analysis. Different data mining algorithms are more appropriate for certain types of problems. These algorithms can be classified into two categories: descriptive and predictive. Descriptive data mining describes the data in a summary manner and presents interesting general properties of the 7

PAGE 18

8 data. Predictive data mining constructs one or more sets of models, infers on the available set of data, and attempts to predict the behavior of new data sets. These two styles are also known as undirected and directed data mining, respectively. The former uses a bottom-up approach, finding patterns in the data and leaving the decision up to the user to determine whether or not these patterns are important. The latter uses a top-down approach and is used when one has a good grasp on what it is he or she is looking for or would like to predict, applying knowledge gained in the past to the future. There are several classes of algorithms applicable to data mining but the most commonly used are association rules [AS94, LOPZ97], Bayesian networks [Myl02], clustering [Fas99], decision trees [Mur98], and neural networks [CS97]. Table 2-1 provides a brief overview of data mining algorithms. The application of data mining techniques to the WWW, often referred to as web mining, is a direct result of the dramatic increase of Internet usage. Various data from the WWW stored in web logs include http request information, client IP addresses, the contents of the website (product information, published articles about the company, etc.), visitor behavior data (navigational paths or clickstream data and purchasing data), and web structure data. Thus, the current research efforts of WWW data mining focus on three issues: web content mining, web structure mining, and web usage mining. Web content mining is used to describe the automatic search of information resources available on-line. The automated discovery of web-based information is difficult because of the lack of structure permeating the information sources on the web. Traditional search engines generally do not provide structured information nor categorize, filter, or interpret documents [CMS97]. Theses factors have prompted researchers to develop

PAGE 19

9 Table 2-1 Data Mining Algorithms ALGORITHM DESCRIPTION COMMON APPLICATIONS Association rules Descriptive and predictive. Determines when items occur together. Understanding consumer product data. Bayesian networks Predictive. Learns through determining conditional probabilities. Predicting what a consumer would like to do on a web site by previous and current behavior. Clustering Descriptive. Identifies and groups similar data. Determining consumer groups. Decision trees Predictive. A flow chart of if-then conditions leading to a decision. Predicting credit risk. Neural networks Predictive. Modeled after the human brain; classic Artificial Intelligence algorithm. Optical character recognition and fraud detection. more intelligent tools for information retrie val and extend data mining efforts to provide a higher level of organization for semi-str uctured data available on the web. Web Structure mining deals with mining the web document’s structure and links to identify relevant documents. Web structure mining is useful in generating information such as visible web documents, luminous web documents and luminous paths (a path common to most of the results returned ) [BLMN99]. Web usage mining is the discovery of user access patterns from web server logged data Companies automatically collect large volumes of data from daily website operations in server access logs. They analyze this web log data to essentially ai d in future business decisions. In this thesis, we use

PAGE 20

10 clickstream and purchasing data collected pr ior to an e-commerce website going out of business. This data set resembles data used during the web data mining process. Web mining, when viewed from a data mining perspective, is assumed to have three operations of interest – sequential analysis associations, and clustering. Sequential analysis provides insight on the order that URLs tend to be accessed. Determining which URLs are usually requested together (associ ations) and finding the natural groupings of users, pages, etc. (clustering) are more useful in today’s real-world web mining applications. 2.1 Association Rule Mining Association rule mining is the discovery of association relations hips (or correlations) amongst a set of items. These relationships ar e often expressed in th e form of a rule by showing attribute-value conditions that occur frequently togeth er in a given set of data. An example of an association rule would be X => Y, which is interpreted by Jiawei Han [Han99] as database tuples that satisfy X are likely to satisfy Y. Association algorithms are efficient for deriving rules but both the support and confidence factors are key for an analyst to make a judgment about the validity and importance of the rules. The support factor indicates the relati ve occurrence of the detected association rules within the overall data set of transacti ons and the confidence factor is the degree to which the rule is true across individual records. The main goal of association discovery is to find items that imply the presence of other items in the same transaction. It is wi dely used in transaction data analysis for directed marketing, catalog design, and other business decision-making processes. This technique was a candidate to implement in the experimental s ection, but clustering proved to be a better fit for our research.

PAGE 21

11 Association discoverys simplistic nature gives it a significant advantage over the other data mining techniques. It is also very scalable since it basically counts the occurrences of all possible combinations of items and involves reading a table sequentially from top to bottom each time a new dimension is added. Thus, it is able to handle large amounts of data (in this case, large numbers of transactions). Association rules do not suffer from over fitting, so they tend to generalize better than other types of classifiers. Association rules have some serious limitations, however, such as the number of rules defined. Too many rules may overwhelm an inexperienced user while too few may not suffice. Another drawback is that the rules generated give no information about causation. The rules can only tell what things tend to happen together, without specifying information about the cause. 2.2 Clustering Clustering is the task of grouping together similar items in a data set. Clustering techniques attempt to look for similarities and differences within a data set and group similar rows into clusters. A good clustering method produces high quality clusters to ensure that the inter-cluster similarity is low and the intra-cluster similarity is high. Clustering algorithms could be classified into four main groups: partitioning algorithms, hierarchical algorithms, density-based algorithms, and grid-based algorithms. 2.2.1 Partitioning Algorithms Partitioning algorithms attempt to break a data set of N objects into a set of k clusters such that the partition optimizes a given criterion. These algorithms are usually classified as static or dynamic. Static partitioning is performed prior to the execution of the simulation and the resulting partition is fixed during the simulation [JK96]. Dynamic

PAGE 22

12 partitioning attempts to keep system resources by combining the computation with the simulation. There are mainly two approaches: the k-means algorithm, where each cluster is represented by the center of gravity of the cluster and the k-medoid algorithm, where each cluster is represented by one of the objects of the cluster located near the center [CSZ98]. Partitioning applications such as PAM, CLARA, and CLARANS are centered around k-medoids. Other applications involve the traditional k-means algorithm or a slight variation/extension of it, such as our implementation. PAM (Partitioning Around Medoids) [KR90] uses arbitrarily selected representative objects, called medoids, during its initial steps to find k clusters. Medoids are meant to be the most centralized object within each cluster. Each non-selected object thereafter, is grouped with the medoid that it is most similar. In each step, a swap between a selected object (medoid) and a non-selected object is made if it would result in an improvement of the quality of clustering. The quality of clustering (i.e., the combined quality of the chosen medoids) is measured by the average dissimilarity values given as input. Experimental results by Kaufman and Rousseeuw have shown PAM to work satisfactorily for small data sets (for example, 100 objects in 5 clusters), but it is not efficient when dealing with medium to large data sets. The slow processing time, which is O (k(N-k))2 [CSZ98] due to the comparison of each object with the entire data set, motivated the development of CLARA. CLARA (Clustering LARge Applications) relies in sampling to handle large data sets. CLARA draws a sample of a data set, applies PAM to the sample, and then finds the medoids of the sample instead of the entire data set. The medoids of the sample approximate the medoids of the entire data set. Multiple data samples are drawn to

PAGE 23

13 derive better approximations and return the best clustering output. The quality of clustering for CLARA is measured based on the average dissimilarity of all objects in the entire data set, not only of those in the samples. Kaufman and Rousseeuws experimental results prove that CLARA performs satisfactorily for data sets such as one containing 1000 objects using 10 clusters. Since CLARA only applies PAM to the samples, each iteration reduces to O (k(40+k)2 + k(N-k)) [KR90], using 5 samples of size 40 + 2k. Although the data sets is larger than that used for the PAM experiments, it is not ideal for the web mining analysis. CLARANS (Clustering LARge Applications based on RANdomized Search) [HN94] stems from the work done on PAM and CLARA. It relies on the randomized search of a group of nodes, which are represented by a set of k objects, to find the medoids of the clusters. Each node represents a collection of k medoids; therefore it corresponds to a clustering. Thus, each node is assigned a cost that is the total dissimilarity value between every object and the medoid of its cluster. The algorithm takes the maximum number of neighbors of a node that can be examined (maxneighbor) and the maximum number of local minimums that can be collected (numlocal). After selecting a random node, CLARANS checks a sample of the neighbors of the node, clusters the neighbor based on the cost differential, and continues until the maxneighbor criterion is met. Otherwise, it declares the current node a local minimum and starts a new search for the local minima. After a specified number of numlocal values are collected, the best of these local values are recorded as the medoid of the cluster. The PAM algorithm can be viewed as the method used to search for the local minima. For large values of N, examining all of k(N-k) neighbors of a node is time consuming. Although

PAGE 24

14 Ng and Han claim that CLARANS is linearly proportional to the number of points, the time consumed in each step of searching is O (kN)2, making the overall performance at least quadratic [Kol01]. CLARANS, without any extra focusing techniques cannot handle large data sets. Also, it was not designed to handle high dimensional data. Both of which are characteristics of the data stored in web logs. 2.2.2 Hierarchical Algorithms Hierarchical algorithms create a hierarchical decomposition of a database. These techniques produce a nested sequence of clusters with a single all-inclusive cluster at the top and single point clusters at the bottom. The hierarchical decomposition can be represented by a dendrogram, which is a tree that iteratively splits the database into smaller subsets until each subset consists of only one object [EKSX96]. The dendrogram can be created from the leaves up to the root (agglomerative approach) or from the root down to the leaves (divisive approach) by merging or dividing clusters at each step. Agglomerative hierarchical algorithms begin with all the data points as a separate cluster, followed by recursive steps of merging the two most similar (or least expensive) cluster pairs until the desired number of clusters is obtained or the distance between the two closest clusters is above certain threshold distance. Divisive hierarchical algorithms work by repeatedly partitioning a data set into leaves of clusters. A path down a well-structured tree should visit sets of increasingly tightly related elements, conveniently displaying the number of clusters and the compactness of each cluster. BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) is a clustering method developed to address large data sets and minimization of input/output (I/O) costs. It incrementally and dynamically clusters incoming multi-dimensional metric

PAGE 25

15 data points to try to produce the best quality clustering with available resources (i.e., available memory and time constraints) [LRZ96]. BIRCH typically clusters well with a single scan of the data, however, optional additional passes can be used to improve the cluster quality further. BIRCH contains four phases, two of which are optional (namely the second and the fourth). During phase one, the data is scanned and the initial tree is built using the given amount of memory and recycling space on disk. The optional phase two condenses the tree by scanning the leaf entries to rebuild a smaller one, removing outliers and grouping crowded subclusters into larger ones. The application uses a self-created height-balanced Clustering Feature (CF) tree at the core of their clustering step. Each node, or CF vector, of the tree contains the number of data points in the cluster, the linear sum of the data points, and the square sum of the data points. The CF tree has two parameters: branching factor B and threshold T. Each non-leaf node contains at most B entries. The tree size is a function of T the larger T is, the smaller the tree. The mandatory phase three uses a global algorithm to cluster all leaf entries. This global algorithm is a pre-existing method selected before beginning the BIRCH process. BIRCH also allows the user to specify either the desired number of clusters or the desired threshold (in diameter or radius) for clusters. Up to this point, the original data has only been scanned once, although the tree and outlier information have been scanned multiple times. After phase three, some inaccuracies may exist from the initial creation of the CF tree. Phase four is optional and entails the cost of additional passes of the data to correct those inaccuracies and refine the clusters further. This phase uses the centroids produced in phase three as seeds to migrate and/or create new clusters. [LRZ96] contains a

PAGE 26

16 performance analysis versus CLARANS. They conclusively state that BIRCH uses much less memory, but is faster, more accurate, and less order sensitive when compared with CLARANS. BIRCH, in general, scales well but handles only numeric data and the results depend on the order of the records. CURE (Clustering Using REpresentatives) [GRS98] is a bottom-up (agglomerative) clustering algorithm based on choosing a well-formed group of points to identify the distance between the clusters. CURE begins by choosing a constant number c of well-scattered points from a cluster used to identify the shape and size of the cluster. The next step uses a predetermined fraction between 0 and 1 to shrink the selected points toward the centroid of the cluster. With the new (shrunken) position of these points identifying the cluster, the algorithm then finds the clusters with the closest pairs of identifying points. This merging continues until the desired number of clusters, k, an input parameter, remains. A k-d tree [Sam90] is used to store the representative points for the clusters. CURE uses a random sample of the database to handle very large data sets, in contrast with BIRCH, which pre-clusters all the data points for large data sets. Random sampling can eliminate significant input/output (I/O) costs since the sample may be designed to fit into main memory and it also helps to filter outliers. If random samples are derived such that the probability of missing clusters is low, accurate information about the geometry of the clusters are still preserved [GRS98]. CURE partitions and partially clusters the data points of the random sample to speed up the clustering process when sample sizes increase. Multiple representative points are used to label the clusters assigning each data point to the cluster with the closest representative point. The use of

PAGE 27

17 multiple points enables the algorithm to identify arbitrarily shaped clusters. The worst-case time complexity of CURE is O (n2logn), where n is the number of sampled points, proving to be no worse than BIRCH [Kol01]. The computational complexity of CURE is quadratic with respect to the sample size and is not related to the size of the dataset. 2.2.3 Density-based Methods Density-based clustering algorithms locate clusters by constructing a density function that reflects the spatial distribution of the data points. The density-based notion of a cluster is defined as a set of density-connected points that is maximal with respect to density-reachability. In other words, the density of points inside each cluster is considerably higher than outside of the cluster. In addition, the density within the areas of noise is lower than the density in any of the clusters. A couple examples of density-based methods are DBSCAN and OPTICS. DBSCAN (Density Based Spatial Clustering of Applications with Noise) [EKSX96] is a locality-based algorithm, relying on a density-based notion of clustering. The density-based notion of clustering states that within each cluster, the density of the points is significantly higher than the density of points outside the cluster [Kol01]. This algorithm uses two parameters, Eps and MinPts, to control the density of the cluster. Eps represents the neighborhood of a point (radius) and MinPts is the minimum number of points that must be contained in the neighborhood of that point in the cluster. DBSCAN discovers clusters of arbitrary shapes, can distinguish noise, and only requires one input parameter. The input value is a major drawback because the user for each run of the algorithm must manually determine the Eps. The runtime of the algorithm, O (NlogN), does not factor in the significant calculation time of the Eps so it

PAGE 28

18 very misleading. This algorithm can handle large amounts of data but it is not designed to handle higher dimensional data. OPTICS (Ordering Points To Identify the Clustering Structure) [ABKS99] is a cluster analysis algorithm that creates an augmented ordering of the database representing its density-based clustering structure. This differs from traditional clustering methods purpose of producing an explicit clustering of the data set. This cluster ordering contains information that is equivalent to the density-based clustering corresponding to a broad range of parameter settings. OPTICS works in principle like an extended DBSCAN algorithm for an infinite number of distance parameters (Eps); which are smaller than a generating distance (Eps) (i.e., 0 <= Epsi <= Eps). However, instead of assigning cluster memberships, this algorithm stores objects in the order they are processed and information which would be used by an extended DBSCAN algorithm to assign cluster membership (if it were possible for an infinite number of parameters). This information consists of only two values: the core-distance and a reachability distance. The core-distance of an object p is the smallest distance between it and another neighborhood. The reachability-distance of an object p with respect to the core object o is the smallest distance such that p is directly density-reachable from o. The OPTICS algorithm creates an ordering of a database, additionally storing the core-distance and a suitable reachability distance for each object. Objects, which are directly density-reachable from a current core object, are inserted into a seed-list for further expansion. The seed-list objects are sorted by their reachability distance to the closest core object from which they have been directly density-reachable. The reachability-distance for each object is determined with respect to the center-object. Objects that are not yet in the priority

PAGE 29

19 queue (seed-list) are inserted with their reachability-distance. If the new reachability-distance of an object is smaller than the previous reachability-distance and it already exists in the queue, it is moved further to the top of the queue. [ABKS99] performed extensive performance tests using different data sets and different parameter settings to prove that the run-time of OPTICS is nearly the same as the run-time for DBSCAN. If OPTICS scans through the entire database, then the run-time will be O (N2). If a tree-based spatial index can be used, the run-time is reduced to O (NlogN). For medium sized data sets, the cluster ordering can be represented graphically and for very large data sets, OPTICS extends a pixel-oriented visualization technique to present the attribute values belonging to different dimensions. 2.2.4 Grid-based Methods Grid-based algorithms quantize the space into a finite number of cells and then do all operations on the quantized space. These approaches tend to have fast processing times, depending only on the number of cells in each dimension quantized in space, remaining independent of the number of data objects. Grid-based techniques such as STING [MWY97] and WaveCluster [CSZ98] have linear computation complexity and are very efficient for large databases; however, they are not typically feasible for analyzing web logs. Grid-based methods are more applicable for spatial data mining. Spatial data mining is the extraction of implicit knowledge, spatial relations, and the discovery of interesting characteristics and patterns that are not explicitly represented in the databases. Spatial data geometrically describes information related to the space occupied by objects. The data may be either a single point in multi-dimensional space (discrete) or it may span across a region of space (continuous). Huge amounts of spatial

PAGE 30

20 data may be obtained from satellite images, medical imagery, Geographic Information Systems, etc., making it unrealistic to examine spatial data in detail. 2.2.5 K-means Aforementioned earlier in this chapter, we revisit the various contributions, improvements, and modifications to the standard k-means algorithm. Historically known as Forgys method [For65] or MacQueens algorithm [Mac67], the k-means algorithm has emerged as one of the most widely used techniques for solving clustering problems. This process consists of mainly three steps [HHK02]: 1. Partition the items into k initial clusters. 2. Proceed through the list of items; assigning an item to the cluster whose centroid (mean) is nearest. Recalculate the centroid for the cluster receiving the new item and for the cluster loosing the item. 3. Repeat step 2 until no more assignments take place. Step 1 may be completed in one of three ways: Randomly selecting k points to represent each cluster, require the user to enter k initial points, or use the first k points to represent each cluster. Most implementations randomly select k representative objects (centroids) to start the process. [BF98] use this statement to illustrate the importance of good initial points: an initial cluster center which attracts no data may remain empty, while a starting point with no empty clusters usually produces better solutions. Our version of the algorithm does not address the initialization issue. Others that do assume it is either user-provided or randomly chosen. Duda and Hart mention a recursive method, [CCMT97] takes the mean of the entire data and randomly perturbs it k times, and [BFR98] refine using small random sub-samples of the data. The latter is primarily intended to work on large databases. As a database size increases, efficient and accurate initialization becomes critical. When applied to an appropriately sized random

PAGE 31

21 subsample of the database, they show that accurate clustering can be achieved with improved results over the classic k-means. The only memory requirement of this refinement algorithm is to hold a small subsample in RAM, allowing it to scale easily to very large databases. As we continue on to the remaining steps of the algorithm, the main focus is to optimize the clustering criteria. The most widely used criterion is the clustering error criterion which for each point computes its squared distance from the corresponding cluster center and then takes the sum of these distances for all points in the data set [LVV01]. Intelligent Autonomous Systems has proposed the global k-means algorithm, which constitutes a deterministic effective global clustering error that employs the k-means algorithm as a local search procedure. This algorithm is an incremental approach to clustering that dynamically adds one cluster center at a time through a deterministic global search procedure consisting on N, the size of the data set, executions of the k-means algorithm from suitable initial positions. It solves all intermediate problems with 1, 2,, M-1clusters sequentially to solve a clustering problem with M clusters. The underlying principle of this method is that an optimal solution for a clustering problem with M clusters can be obtained by using the k-means algorithm to conduct a series of local searches. Each local search places the M-1 cluster centers at their optimal positions corresponding to the clustering problem within the data space. Since for M=1 the optimal solution is known, this global algorithm can iteratively apply the above procedure to find optimal solutions for all k-clustering problems k = 1,, M. In terms of computational complexity, the method requires N executions of the k-means algorithm for each value of k (k = 1,, M). The experimental results prove that

PAGE 32

22 for a small data set (for example, N =250 and M = 15), the performance of this method is excellent, however, the technique has not been tested on large-scale data mining problems. Recursive iterations can be expensive when applying the k-means algorithm. To reduce time complexity as well as the iterations of steps and to increase the scalability of k-means clustering for large data sets, single-pass k-means algorithms were introduced [BFR98]. The main idea is to buffer where points from the data set are saved in compressed form. The first step is to initialize the means of the clusters as with the standard k-means. The next step is to fill the buffer completely with points from the database followed by a two-phase compression process. The first of the two, called primary compression, identifies points that are unlikely to ever move to a different cluster using two methods. The first measures the Mahalanobis distance [Rei99] from each point to the cluster mean (centroid) its associated with it and discards a point if it is within a certain radius. The second method involves creating confidence intervals for each centroid. Then, a worst-case scenario is set up by perturbing the centroids within the confidence intervals with respect to each point. The centroids associated with each point is moved away from the point and the cluster means of all other clusters are moved towards the point. If the point is closest to the same cluster mean after the perturbations, it is unlikely to change cluster membership. Points that are unlikely to change are removed from the buffer and placed in a discard set of one of the main clusters. We are now ready to begin the second phase called the secondary compression. The aim of this phase is to save buffer space by storing some auxiliary clusters instead of individual points. During this stage, another k-means clustering is performed with a larger number

PAGE 33

23 of clusters than for the main clustering on the remaining points in the buffer. The points in the buffer must satisfy a tightness criterion (remain below a certain threshold). After primary and secondary compression, the available buffer space is filled with new points and the whole procedure is repeated. The algorithm ends after one scan of the data set or if the centers of the main clusters do not change significantly as more points are added. A special case of the algorithm of [BFR98], not mentioned in their paper, would be to discard all the points in the buffer each time. The algorithm is [EFL00]: 1. Randomly initialize cluster means. Let each cluster have a discard set in the buffer that keeps track of the sufficient statistics for all points from previous iterations. 2. Fill the buffer with points. 3. Perform iterations of k-means on the points and discard sets in the buffer, until convergence. For this clustering, each discard set is treated like a regular point placed at the mean of the discard set, but weighed with the number of points in the discard set. 4. For each cluster, update the sufficient statistics of the discard set with the points assigned to the cluster. Remove all points from the buffer. 5. If the data set is exhausted, then finish. Otherwise, repeat from step 2. According to [EFL00] lesion experiment, the simple single pass k-means method (for synthetic data sets of 1,000,000 points, 100, dimensions, and 5 cluster) cluster quality is equivalent to that of the standard k-means but is more reliable (in terms of trapping of centers) and is about 40% faster than the standard k-means. With real data from the KDD contest data set 95412 points with 10 clusters, the cluster distortion of the original k-means algorithm was significantly less than that of the simple single pass algorithm.

PAGE 34

CHAPTER 3 GENERAL APPROACH TO WEB USAGE MINING In Chapter 2, we mention the categorization of web mining into three areas of interest: web content mining, web structure mining, and web usage mining. Web content mining focuses on techniques for searching the web for documents whose contents meets web users queries [BS02]. Web structure mining is used to analyze the information contained in links, aiming to generate structural summary about web sites and web pages. Web usage mining attempts to identify (and predict) web users behavior by applying data mining techniques to the discovery usage patterns from their interactions while surfing the web. In this chapter, we introduce our approach to mining web usage data using the k-means algorithm to address the issues identified in Section 1.1. 3.1 The Mining of Web Usage Data Companies apply web usage mining techniques to understand and better serve the needs of their current customers and to acquire new customers. The process of web usage mining can be separated into three distinct phases: pre-processing, pattern discovery, and pattern analysis [CDST00]. The web usage mining process could also be classified into one of two commonly used approaches [BL99]. One approach applies pre-processing techniques directly to the log data prior to adapting a data mining technique. The other approach maps the usage data from the logs into relational tables before the mining is performed. The sample data we obtained from KDD Cup 2000 were in flat files, therefore, we chose the second of the two approaches for our implementation. 24

PAGE 35

25 Figure 3.1 depicts the web usage mining proce ss from a high-level perspective [CMS99]. The subsequent sections of this chapter wi ll explain the three phases of the process. Figure 3.1 High Level Web Usage Mining Process 3.1.1 Pre-processing Data for Mining The raw data collected by the web server logs tend to be abstruse and require the need to organize the data to make it easier to mi ne for knowledge. Pre-pr ocessing consists of converting usage information contained in th e various available da ta sources into the abstractions necessary for patt ern discovery [BS02]. There are a number of issues in preprocessing data for mining that must be addres sed prior to utilizing the mining algorithm. These include developing a model of access log data, developing techniques to filter the raw data to eliminate irrelevant items, grouping individual page access into units (i.e., transactions), and specializing generic data mining algorithms to take advantage of the specific nature of the acc ess log data [CMS97].

PAGE 36

26 The first pre-processing task, referred to as data cleaning, essentially eliminates irrelevant items that may impact the analysis result. This involves determining if there are important accesses or specific access data that are not recorded in the access log. Improving data quality involves user cooperation, which is very difficult (but understandably so) because the individual may feel as if the information requested of them violates their privacy needs. Another pre-processing task is the identification of specific transactions or sessions. The goal of this task is to clearly discern users based on certain criteria (in our case, attributes). The formats of these transactions and/or sessions are tightly coupled with the data collection process. The poor selection of values to collect about the users increases the difficulty of this identification task. 3.1.2 Pattern Discovery The next phase of the web usage mining process, pattern discovery, varies depending on the needs of the analyst. Algorithms and techniques from various research areas such as statistics, machine learning, and data mining are applied during this phase. Our focus is on finding trends in the data by grouping users, transactions, sessions, etc., to understand the behavior of the visitors. Clustering, a data mining technique, is well suited for our desired results. Web usage mining can facilitate the development and execution of future marketing strategies and promote efficient and effective web site management by analyzing the results of clustered web log data. There are different ways to break down the clustering process. One way is to divide it into five basic steps [Mas02]: 1. Pre-processing and feature selection. Most clustering models assume all data items are represented by n-dimensional feature vectors. To improve the scalability of the problem space, it is often desirable to choose a subset of all the features (attributes)

PAGE 37

27 available. During this first step, the appropriate feature is chosen as well as the appropriate pre-processing and feature extraction on data items to measure the values of the chosen feature set. This step requires a good deal of domain knowledge and data analysis. NOTE: Do not confuse this step with the pre-processing step of web usage mining. This step is done after the data has been cleansed. 2. Similarity measure. This is a function that receives two data items (or two sets of data items) as input and returns a similarity measure between them as output. Itemitem versions include the Hamming distance [Bla02c], Mahalanobis distance, Euclidean distance, inner product, and edit distance. Item-set versions use any itemitem versions as subroutines and include max/min/average distance; another approach evaluates the distance from the item to the cluster of the representative set, where point representatives (centroids) are chosen as the mean vector/mean center/ median center of the set, and hyperplane of hyperspherical representatives of the set can also be used. 3. Clustering algorithm. Clustering algorithms generally use particular similarity measures as subroutines. The choice of clustering algorithm depends on the desired properties of the final clustering and the time and space complexity. Clustering user information or data items from web server logs aid companies with web site enhancements such as automated return mail to visitors falling within a specific cluster or dynamically changing a particular site for a customer/user on a return visit, based on past classification of that visitor [CMS99]. 4. Result validation. Do the results make sense? If not, we may want to iterate back to a prior stage. It may also be useful to do a test of clustering tendency, to estimate the presence of clusters at all. NOTE: Any clustering algorithm will produce some clusters regardless of whether or not natural clusters exist. 5. Result interpretation and application. Typical applications of clustering include data compression (via representing data samples by their cluster representative), hypothesis generation (looking for patterns in the clustering of data), hypothesis testing (e.g. verifying feature correlation or other data properties through a high degree of cluster formation), and prediction (once clusters have been formed from the data and characterized, new data items can be classified by the characteristics of the cluster which they would belong). 3.1.3 Pattern Analysis The final stage of web usage mining is pattern analysis. The discovery of web usage patterns would be meaningless without mechanisms and tools to help analysts better understand them. The main objective of pattern analysis is eliminating irrelevant

PAGE 38

28 rules or patterns and extracting rules or patterns from the output of the previous stage (pattern discovery). The output, in its original state, of web mining algorithms is usually incomprehensible for the naked eye and thus must be transformed into a more readable format. These techniques have been drawn from fields such as statistics, graphics and visualizations, and database querying. Visualization techniques have been very successful in helping people understand various kinds of phenomena. Bharat and Pitkow [BP94] proposed a web path paradigm in which sets of server log entries are used to extract subsequences of web traversal patterns called web paths along with the development of their WebViz system for visualizing WWW access patterns. Through the use of WebViz, analysts are provided the opportunity to filter out any portion of the web deemed unimportant and selectively analyze those portions of interest. In [Dyr97], OLAP tools had proven to be applicable to web usage data since the analysis needs were similar to those of a data warehouse. The rapid growth of access information increases the size of the server logs quite expeditiously, reducing the possibility to provide on-line analysis of all of it. Therefore, to make its on-line analysis feasible, there is a need to summarize the log data. Query languages allows an application or user to express what conditions must be satisfied by the data it needs rather than having to specify how to get the required data [CMS97]. Potentially, a large number of patterns may be mined, thus a mechanism to specify the focus of analysis is necessary. One approach would be to place constraints on the database to restrict a certain portion of the database to mine. Another method would

PAGE 39

29 be to perform the querying on the knowledge that has been extracted by the mining process, which would require a language for querying knowledge rather than data. 3.2 Web Usage Mining with k-means The algorithms used for most of the initial web mining efforts were highly susceptible to failure when operating on real data, which can be quite noisy. In [JK98], Joshi and Krisnapuram introduce some robust clustering methods. Robust techniques typically deal only with a single component and thus increase the complexity when applied to multiple clusters. Fuzzy clustering techniques are capable of addressing the problem of multiple clusters. Fuzzy clustering provides a better description tool when clusters are not well separated [Bez81], which may happen during web mining. Fuzzy clustering for grouping web users has been proposed in [BH93], [FKN95], and [KK93]. Rough set theory [Paw82] has been considered an alternative to the fuzzy set theory. There is limited research on clustering based on rough set theory. Lingras and West [LW02] adapted the k-means algorithm to find cluster intervals of web users based on rough set theory. They applied a pre-processing technique directly to the log data prior to adapting a data mining technique. This was permitted because of the involvement in the data collection process. This allowed them to filter information into specific pre-defined categories before mining the data. After applying the k-means method, they analyzed the data based on the knowledge of the initial classifications. 3.2.1 Our Web Usage Mining Approach In this thesis, our approach was indirectly imposed on us due to the original format of the log data. We chose the second of the two mentioned in Section 3.1, while still applying the three phase process also mentioned in that section. In the pre-processing phase, we convert the flat files into relational tables to utilize the advantages

PAGE 40

30 of structured query languages to retrieve desired data from the logs. The feature selection step of our pattern discovery phase is taken as input from the analyst (or user of our algorithm). We chose to implement a variation of the k-means algorithm due to its computational strengths for large data sets. For pattern analysis, we graphed the results discovered in the previous phase to improve human comprehension of the knowledge. The next chapter describes the architecture and implementation strategies for our k-means algorithm when used in accordance with web mining.

PAGE 41

CHAPTER 4 ARCHITECTURE AND IMPLEMENTATION The web usage mining process discussed in Section 3.1 is commonly used throughout the research community. The architecture of our web usage mining solution encompasses most of the phases and steps mentioned in Chapter 3, however, choosing to use our version of k-means as our clustering method provoked the exclusion of a few steps. Another reason for omitting steps was our lack of input for data collection. Sections 4.1 will provide insight on our architectural structure and Sections 4.2 will explain the details of our k-means implementation. 4.1 Architecture Overview Our algorithms architectural structure consists of two java modules carrying out three execution phases. The first class, namely ReadData, accepts the user input, reads the data from the files, and clusters the data points accordingly. The ClusterValues class maintains cluster information such as the number of points in each cluster, all of the point values in each cluster, and the centroid value of the cluster. The three phases have the same goals as those mentioned in the previous chapter for the web usage mining process, however, our clustering algorithm implementation gave us the freedom to omit time consuming steps. The architecture divides the web usage mining process into two main parts. The first part involves the usage domain dependant processes of transforming the web data into suitable transaction form. The second part includes the application of our k-means algorithm for data mining and pattern matching and analysis techniques. Figure 4.1 depicts the architecture for our web usage mining project. This section describes the 31

PAGE 42

32 steps taken to complete each phase in the process. The next section explains our algorithm in its entirety in conjunc tion with the modular interaction. Figure 4.1 Our Web Usage Mining Architecture 4.1.1 Phase 1 – Pre-processing We began our pre-processing phase with th e data already condensed in one format, flat files, as our input. Typi cal web usage data exists in web server logs, referral logs, registration files, and index server logs. Intelligent integrati on and correlation of information from these diverse sources can reveal usage information that may not be evident from any one of these individually. We have assumed that the content of these files were already in its integrated state when obtained from KDD Cup 2000. The data learning task of our pre-proce ssing phase primarily involved improving the understandability of the data. Column names and, in some instances, a list of column values for the comma delimited flat files were provided, however, the values were still difficult to discern. We decided to convert th e flat files into rela tional tables to both match the column values with their column names and take advantage of the data retrieval methods provided by relational data base management systems (RDBMS) during the mining stage. After transforming the fo rmat of the data, we removed empty-valued

PAGE 43

33 columns and those columns deemed uninteresting and/or unnecessary for our desired results at this stage of the process. The transaction identification task of this phase distinguishes independent users, transactions, or sessions. This task is simplified when the data collected is carefully selected and conducive to the overall objectives of the mining process. The data set used in this thesis was divided into two tables one containing the visitors click-stream data, the other customer order information. We did not apply any identification techniques to the data, we simply learned the data itself and focused on attributes/columns that were relevant to a user, transaction, or session. For example, the click-stream data has session related attributes (i.e. SESSION_ID, SESSION_FIRST_REQUEST_DAY_OF_WEEK, etc.) that we used to identify sessions. At this point, we retained data for comprising specific users, transactions, and sessions in the tables for future refinement in the next phase, pattern discovery. 4.1.2 Phase 2 Pattern Discovery As we enter the pattern discovery phase, we would like to reiterate our web usage mining goal of finding trends in the data to understand the behavior of the visitors. Clustering techniques used in this research area will group together similar users based on the analyst-specified parameters. We begin the clustering process by reducing the dimensionality of the data set during the pre-processing/feature selection step. This step allows the analyst to select the attributes necessary to explore the targeted regions of the data set. This pre-processing clustering step differs from the pre-processing phase of web usage mining because it identifies the features needed as input for the clustering algorithm specifically as opposed to the general information resulting from data cleaning

PAGE 44

34 and transaction identification. The columns chosen during this step represent the n-dimensional feature vectors. The heart of this thesis is engulfed in this next step, which is the clustering technique selection and implementation. We chose to use the popular k-means because of its ability to produce good clusters and its efficient processing of large data sets. There has been limited research done using k-means for web mining outside of fuzzy and rough set approaches mentioned Chapter 3. Section 4.2 explains the implementation of our version of the k-means algorithm. After executing the algorithm, we reviewed the results to decide the legitimacy. If the results seemed unreasonable, we regressed back to the feature selection step to refine our query. This refinement process is intended to assist in finding patterns in the clustering, also known as hypothesis generation. Hypothesis generation exposes trends in the data. We may also use the results to predict future behavior of the customers if this website still existed. The analysis of those results could have helped maintain and acquire new customers and therefore prevented it from going under. 4.1.3 Phase 3 Pattern Analysis The pattern analysis phase provides tools and mechanisms to improve analysts understanding of the patterns discovered in the previous phase. During this phase, we eliminated content (patterns) that did not reveal useful information. We did not use a tool to aid in our analysis. Instead, we used a non-automated graphing method to visualize our results. The visualization depicted the mined data in a matter that permitted the extraction of knowledge by the analyst.

PAGE 45

35 4.2 Algorithm Implementation The pattern discovery phase is a critical component of the web mining process and usually adopts one of several techniques to complete successfully statistical analysis, association rules, classification, sequential patterns, dependency modeling, and clustering [Wan00]. Statistical analysis of information contained in a periodic web system report can be potentially useful for improving system performance, enhancing the security of a system, facilitation of the site modification task, and providing support for marketing decisions [Coo00]. In web usage mining, association rules refer to sets of pages that are accessed together with a support value exceeding some specific threshold. Classification techniques are used to establish a profile of users belonging to a particular class or category by mapping users (based on specific attributes) into one of several predefined classes. Sequential pattern analysis aims to retrieve subsequent item sets in a time-ordered set of sessions or episodes to help place appropriate advertisements for certain user groups. Dependency modeling techniques display significant dependencies amongst the various variables in the web domain to provide a theoretical framework for analyzing user behavior and predict the future web resource consumption. Clustering techniques group together data items with similar characteristics. In our research, we would like to extract knowledge from the data set based on specific attributes of interest. Cluster analysis grants the opportunity to achieve such a goal. In Section 2.2, we discussed various clustering techniques and algorithms. Web server log data files can grow exponentially depending on the amount of data collected per visit by the user. The navigational and purchasing information collected for our data set totaled approximately 1.7 gigabytes over a period of two months back in the year 2000 when the concept of web data mining was in its infancy. The current data

PAGE 46

36 collection methods and techniques are far more advanced and may collect the same amount of data daily. Therefore, the clustering algorithm needed for the pattern discovery phase had to be reliable and efficient when applied to large data sets. The traditional k-means algorithm would suffice, however, there are a few characteristics about our data set that expose drawbacks in the algorithm. The total number of attributes of the combined data files is 449 (217 click-stream and 232 purchasing). We needed to reduce the vector dimensionality and use a representative sample set of data to improve the scalability and efficiency of the algorithm, respectively. Web logs contain non-numeric and alphanumeric data, which are both prohibited as input for the standard k-means algorithm. Our algorithm must deal with non-numeric values as input for our clustering algorithm. In this section, we discuss our version of the k-means algorithm and how it addressed the issues above. Recall Section 4.1 when we mentioned the feature selection step in the pattern discovery phase. This step essentially covers the first two tasks of our algorithm and requires user input. The first task is entering the desired number of clusters with the maximum number being ten. Excessive clusters create a dilution of data, which could potentially further complicate the analysis. The other user-required input is the attributes to query. The arbitrary selection of these attributes produces meaningless clusters. This task requires at least some knowledge of the data as well as a predetermined goal. Querying completely unrelated attributes could return interesting results, however, that may be unlikely. The pre-processing phase cleanses and organizes the data to prepare the data for pattern discovery. The first two tasks of our algorithm, which actually serve as

PAGE 47

37 the feature selection step, allow the analyst to select specific attr ibutes to mine for knowledge. Figure 4.2 The ReadData module The values mentioned in the previous para graph are collected in the main method of the ReadData class, shown in Figure 4.2, us ed to implement our algorithm. Once these two values have been determined, we call th e method located at line 16 of Figure 4.2, readInLine(), to perform the grunt work of th e implementation. This method begins with reading the first line of the file specified in the class constructor. The target file would contain sample data that had been generated from a simple query ra n against one or both

PAGE 48

38 of the tables. The results of the query would then be exported to a delimited file and essentially serve as the cleansed version of the log file. As the first line of data is parsed, we smoothly transition into the next task of our algorithm that involves calculating the data point values. The number of n attributes selected during the feature selection step determines vector size. The values in a web log can be numeric, non-numeric, or alphanumeric so, unlike traditional k-means algorithms, our algorithm must support all three value types. We handle this issue by using the ASCII (American Standard Code for Information Interchange) value of each character, digit, alphabet, and special character for computation. We begin by calculating the value for each individual attribute Ai, where i = 0, n-1, of the n-dimensional vector. cd =10dd (1) Ai = __________ where d is the array length and cd is d the ASCII value of the d-1 character Next, we compute the vector value of the entire row of n attributes. This is done by dividing the sum of the individual values Ai by the number of columns i. Ai =10dd (2) Rm = __________ where m is the row number i in the table After R1 is computed, it becomes the minimum value (min) by default. The next nonequivalent row vector value Rm detected replaces the min if it is lower than R1 or it becomes the maximum value (max) if it is higher. The point values computed after the max and min values have been selected are compared to both values and replaced accordingly, if necessary. We then subtract the min from the max value to determine the range of the points.

PAGE 49

39 (3) diff = max min The diff value obtained at the end of the third task is the numerator of the fraction used to compute the cluster thresholds. The denominator of that fraction is the number of clusters provided by the analyst during the feature selection step. diff (4) t = ____ where k is the number of clusters k The threshold value t does not represent the threshold value for each individual cluster but it is used when computing the upper boundary of each cluster. For example, the threshold for the first cluster, t1, ranges from the min to the sum of the min plus t subtracted by one hundred thousandth, both values mentioned inclusive. Continuing to the threshold of the second cluster, t2, the minimum value of t2 would be min1 plus t and the maximum value would be min2 (the minimum value of the second cluster) plus t minus one hundred thousandth. The last paragraph can be represented mathematically as: (5) t1 = [min1, min0 + t 0.00001] where min1 (and min0) is the minimum point value t2 = [min2, max2] where min2 = min1 + t and max2 = min2 + t 0.00001 . tn = [minn, maxn] where minn = minn-1 + t and maxn = minn + t 0.00001

PAGE 50

40 If a point value exists between two consecuti ve thresholds, its value is rounded to the nearest hundred thousandth and clustered accord ingly without changing its original value in the cluster. We chose to use the hundred thousandth figure because most of the data points were calculated to that precisi on. Once the final row vector value, Rm, has been calculated, the data points, the min and max and the cluster thresholds have been determined and each data point has been placed in its proper cluster only after one scan of the data set. Figure 4.3 The ClusterValues module

PAGE 51

41 The final step of our algorithm calculates the centroids (representative points) for each cluster. The centroid computation takes place in the method beginning at line 29, calculateCentroids(), of the ClusterValues class displayed in Figure 4.3. The ClusterValues module shown in Figure 4.3 is the structure responsible for maintaining all the relevant information about each cluster such as the point value(s) and the number of points present in the cluster. The addClusterValues() method, which starts at line 19 in Figure 4.3, requires the cluster number, the point value, and the element number of the cluster, all of which are calculated in ReadData.readInLine(). These values are stored in Javas Vector (java.util.Vector()) object and retrieved in ClusterValues.calculateCentroids() to calculate the centroid value. We perform this task by dividing the sum of the point values in a specific cluster by the number of points in that cluster if that cluster contains any point values. This point represents the mean of the cluster without measuring the distance between each point and centroid. This permits the exclusion of step two mentioned in Section 3.1.2 and therefore reduces the computational complexity. If you refer back to Section 2.3.5, you will notice several differences between our procedures used to implement the k-means algorithm and other implementations. The first significant difference is shown as early as the first step. These initial points influence the clustering results tremendously. In most cases, these points are randomly selected and may require numerous executions or a large amount of knowledge of the data set by the analyst. The former could become tedious and the latter may be an unrealistic expectation. Our first two tasks, projecting the number of clusters needed and selecting the attributes to query, do not require a great deal of knowledge about the data

PAGE 52

42 set. The only pre-requisite of our algorithm is a clearly defined goal. This allows the analyst to specify the appropriate amount of categories (clusters) based on targeted characteristics (attributes). Our centroid creation process is performed as the very last task. It is done after all of the vector values (data points) have been calculated and clustered to determine what the clusters represent. This reduces the algorithms execution time because it removes the similarity measurement task, where each data point is compared to the centroid using a distance function to identify the shortest distance and cluster that point, from our implementation. The run time is reduced further in our algorithm because we scan and cluster the data only once. Multiple iterations of the data points and re-calculations of the centroids improve the clustering efficiency at the expense of time. Chapter 5 will present a performance analysis of our algorithm compared to other proposed k-means algorithms and Chapter 6 will show how our method faired against one of the industrys leading applications in data mining.

PAGE 53

CHAPTER 5 PERFORMANCE ANALYSIS When writing software, the criteria for evaluating pertains to the correctness of the algorithm with respect to the specifications and the readability of the code. There are other criteria for judging algorithms that have a more direct relationship to performance, which involves their computing time and storage requirements. The time complexity (or run/execution time) of an algorithm is the amount of computer time it needs to run to its completion. The space complexity of an algorithm is the amount of memory it needs to run to completion [HRS98]. The time complexity is based on the access time for each data point, in our case, row of data. If each row is accessed and re-calculated for multiple iterations, the k-means algorithm could become inefficient for large databases. The space complexity deals with the data set size and variables that may affect it. We will not evaluate the space complexity of our algorithm. In the second part of this chapter, we compare the clustering results of the KDD Cup 2000 data set when using a leading data mining software to the results obtained when applying our algorithm to the data. We will show that our k-means method produces a comparable quality of clusters as one of the leading data mining tools. We will then conclude our research efforts and contributions in the final chapter, Chapter 6. 43

PAGE 54

44 5.1 Experimental Evaluation The development of our k-means algorithm initially began on Microsofts Windows 98 operating systems, using pcGRASP2 Version 6, a free programming application developed at Auburn University, as our Java programming environment. pcGRASP was the recommended environment for completing our programming assignments in the Programming Languages Principles (PLP) course instructed by Dr. Beverly Sanders. The engine of this home personal computer (PC) consisted of 164 megabytes of random access memory (RAM), a 450 megaHertz Pentium II processor, and 8 gigabytes of hard disk space. Previously installed software along with important documents and files occupied almost 50% of the hard disk, leaving roughly 4 gigabytes during execution. The size of the combined data sets, stored in flat files, consumes about 1.5 gigabytes of disk space. Although using samples of the data during the experimental section, we suspected that 2.5 gigabytes of disk space would be inadequate. We then purchased and installed a 20 gigabyte hard drive as the primary master partition, moving the contents from the 8 gigabyte disk to the new one. Now, prior to installation of additional software, we have a total of 21.5 gigabytes of free space 13.5 on the c:\ drive and 8 on the newly formatted d:\ drive. It was rather difficult to produce samples of the data set from flat files, thus the database search begins. The minute availability of resources limited our options to either Sybase or Microsoft Access. The obvious choice, since Sybase is Unix-based, was Microsoft Access. Microsoft Access was able to handle the large amount of data, however, it took several hours to load (import) the data and the database only created a link from the table defined in Access to the flat file that contained the data. This would

PAGE 55

45 definitely have a negative effect on performance. Fortunately, the DBCenter3 acquired a license for Oracle 8 i Oracle 8 i only supports imported data that result from the export utility of a previous versi on of Oracle. We unsuccessfu lly attempted to use Oracle’s SQL*Loader utility to load our delimited flat file data into the database due to various data type incompatibilities with the syntax need ed for this utility’s c ontrol file (see Figure 5.1). Figure 5.1 A sample SQL*Loader control file A typical control file (.ctl) would not sp ecify the data types of each field because the utility requires the existence of the table in the Oracle database prior to loading data to it. However, if the format of the data confuses the tool, one must specify the data types per column in the control file. So af ter obtaining a copy of IBM’s DB2 application, several pre-requisites had to be met prior to installing the software. DB2 version 7 Personal or Enterprise Edition, requires the user to have administrative privileges on the operating system. Windows 98 does not support administrative users, which prohibited the installation; therefore, we decided to change the operating system to Windows 2000 2 http://www.eng.auburn.edu/grasp

PAGE 56

46 Professional Edition. After installing DB2 version 7.2 fixpack 5 and creating the structured query language (SQL) to define the tables to store the data, we loaded the data from the flat files to the database using DB2s wizard for importing data in a matter of minutes. The data set used in the experimental portion of the thesis is from a KDD Cup 2000 competition. It contains clickstream and order information from an e-commerce website which went out of business only after a short period of existence. A clickstream can be defined as a sequential series a users navigational path throughout a website visit. Order data includes product information, number of items purchased, etc. The clickstream data is significantly larger (over 700,000 rows) than that of the order data, however, both files in our case, tables may be applied to the web mining process. The clickstream data provided was collected for roughly two months January 30, 2000 thru March 31, 2000 but contained 98 (out of 217) attribute column values (per row of data) that were either missing or null. To improve scalability, we chose to use a sample selection of the data. We chose to use the first data intensive 7500 rows of data for our research purposes for two reasons: it represents a little over 10 percent of the entire data set and it is approximately twice the size of the amount of rows provided for the order data (3465 rows). The majority of the sample click data is comprised of data ranging from Sunday, January 30, 2000 thru Tuesday, February 2, 2000. The order data, which is in its entirety, remains within the two-month timeframe and only has 6 columns out of 232 that were deemed irrelevant. Although close to 50 percent of the click data columns were not conducive to our research, we were still able to gain valuable knowledge from 3 http://www.cise.ufl.edu/dbcenter

PAGE 57

47 the clustering results of the data set because of their significance. In the next section, we discuss the clustering results from mining both the order and click data. When discussing the efficiency of our algorithm, we use the following notation: m number of k-means passes over a data set m number of k-means passes over a buffer refill n number of data points b size of buffer, fraction of n d number of dimensions k number of clusters The time complexity of the standard k-means algorithm when using the above notation becomes, more specifically, O(nkdm), where m grows slowly with n [EFL00]. Our algorithm, which only scans the data once, m is always equal to one. This not only reduces the computational time to O(nkd), it also removes the computational time necessary for cluster refinement (i.e., similarity measurements). As for the disk I/O complexity, for the standard k-means it is O(ndm), the number of points times the dimensions times the number of passes over the data set [EFL00]. Our algorithm passes over the data once, therefore the disk I/O complexity would be O(nd). 5.2 Web Clusters The software tool used in our experimental section uses their own core data-mining technology to uncover high-value intelligence from large amounts of enterprise data including transaction data such as that generated by point-of-sale, automatic teller machines (ATMs), credit cards, call center, or e-commerce applications. Early releases of this industry-leading tool embodied proven data mining technology and scalability

PAGE 58

48 options while placing significant emphasis on usability and productivity for data mining analysts. The version used for these experi ments places an increased focus on bringing the value of data mining to more business intelligence users by broadening access to mining function and results at the business analyst’s desktop. The types of mining functions available with this tool include association, cla ssification, clustering, sequential patterns, and similar sequences We compare/contrast our k -means clustering results with the results of the clusteri ng function of the tool. Order Clustering Percentages0% 1% 4% 11% 23% 29% 23% 8% 1% 1 2 3 4 5 6 7 8 9 Figure 5.2 Order clustering results Our example involved eight attributes from the order data pertaining to consumer’s weekly purchasing habits such as the wee kday, time of day, location, order amount, etc. represented using nine clusters. Figures 5.2 and 5.3 graphically display the amount of data points present in each individual cluster using our method and the software tool, respectively. The clusters sizes differ at the least 3% (Cluster 4) and at most 22% (Cluster 6) because the clustering results have different representations from the different applications. Table 5-1 elaborates on the ni ne clusters for the two applications. Our

PAGE 59

49 algorithm, by design, sorts the data points in ascending order before clustering and calculating the centroid values, cr eating a diverse set of clusters as that of the tool. The software results are obtained fr om a modular standpoint, where frequency statistics of the raw data is emphasized. In our implementati on, the analysis of the raw values, which are printed to a file before calcul ating the data point, aids in determining the categorization of each cluster. Although, the resulting clusters from the tool differ in size and data representation from our results, we show th at the knowledge gained from our algorithm is potentially just as useful. Order Clustering Percentages8% 8% 11% 14% 10% 7% 16% 17% 9% 1 2 3 4 5 6 7 8 9 Figure 5.3 Data Mining Software Order clustering results The information provided on Table 5-1 is indi cative of the relationship of the cluster percentages mentioned in the pr evious paragraph. For exampl e, the Cluster 4 results of the two techniques are most similar while th e Cluster 6 results seem to be the most dissimilar. Although, the statistical results of the tool is comp rised of the most frequently used values of the active fiel ds (attributes), which may lead to analyst making decisions

PAGE 60

50 decisions based on assumptions about the raw data and not the knowledge gained from the raw data itself. In their results, there was not any information pertaining to male shoppers. Our data, in contrast, did not specify any modular calculation, but did provide monthly and age ranges in conjunction with location and sex, allowing decision-making based on factual data instead of generalizations. Regardless of the application used to analyze the data, it would be nearly impossible to gain knowledge from the data if viewed by a human in its original state. Both aid the business user considerably with the clustering results, with the software tool having the edge because of its visualization and reporting tools. Nevertheless, our numerical representation of the results brought us to the same conclusion(s) as their visualizations: California dwelling women who spent under $12 per order dominated their consumer base, which means that the company needed to advertise more items (maybe higher priced items as well) for women to maintain their current customers while targeting men in the very near future to gain new customers. The previous statement may seem intuitive, however, if this company had had tools to perform this analysis back in 2000, it may still be in business today!

PAGE 61

51 Table 5-1 Cluster representations Single-pass, noniterative k -means Data Mining Software CLUSTER 1 Predominantly women, ages 26-58, living in CA, who shop from TuesdayFriday Thursday shoppers of unspecified age and sex, from Stamford, CT CLUSTER 2 Men, 28-50 years of age, that usually shop on the weekend Women from San Fran, CA that shop on Monday’s@1pm CLUSTER 3 Mix of men and women shoppers from all over, that do not avg $12 per order Women from New York, NY that shop on Wednesday’s@10am CLUSTER 4 Women ages 26-58 that shop Tuesday thru Saturday Women from Texas that shop on Tuesday’s at 5pm, spending $13.95 CLUSTER 5 Women that spent at least $22 on their purchase, from all over US, all week Wednesday shoppers at 8pm from CA CLUSTER 6 Texans (unspecified sex), ages 22-52, who shop mostly on Fridays 36 year old women from Hermosa Beach, CA who usually shop on Thursday’s@11am CLUSTER 7 Thursday shoppers where the men are from the mid and upper west, women from eastern states New York dwelling women, shopping on Tuesday’s@4pm CLUSTER 8 Women ordering between 8am and 9am Women from PA shopping on Wednesday’s@7am, but no later than 10pm (all week) CLUSTER 9 Thursday-Sunday women shoppers of unspecified ages from TX and NY 36 year old women who spend over $12/order, shop on Wednesday’s@7pm

PAGE 62

52 CHAPTER 6 CONCLUSION 6.1 Contributions This thesis, simply stated, has improved th e time complexity of a widely used preexisting algorithm and demonstrated its valu e if used appropriat ely by a profit-seeking corporation. Our version of the k -means algorithm effectivel y removed two expensive operations from the original algorithm – na mely, the refinement portion step(s) that include scanning the data set multiple times and re-calculating the representative points (centroids) of each cluster. The implementation presented in this paper reduces the execution time of the algorithm by m the number of k -means passes over a data set, while also excluding the optional computations necessary for cluster refinement (i.e. similarity measurements, etc.) to bringing our total run time to O ( nkd ), where k is the number of clusters and d is the number of dimensions (or active attributes). Since our algorithm scans the data only once, the disk I/O is also reduced by m therefore giving us a disk I/O of O ( nd ). We later show that our algor ithm, when used as the clustering technique during the pattern discovery phase of the web usage mining process, performs comparably to that of an industry-leading data mining tool. 6.2 Proposed Extensions and Future Work We chose to leave the comparison of our algorithm to the standard k -means algorithm for future work efforts. This would require a slight variation for one implementing the original algorithm to rece ive not only numerical data but also non-numerical and alphanumerical data as input. Another potentia l research interest would be to develop a

PAGE 63

53 to develop a schema or warehouse to store the data for both the navigational and purchasing data and mine them as one unit. Usage data collection over the web is incremental and distributed by its very nature. Valuable information about the data could be extracted if all the data were to be integrated before mining. However, in the distributed case, a data collection approach from all possible server logs is both non-scalable and impractical mainly because of the networking issues involved. Hence, there needs to be an approach where mined knowledge from various logs can be integrated together into a more comprehensive model. As a continuation of that issue, the creation of intelligent tools that can assist in the interpretation of mined knowledge remains open. This would assist the business analyst by revealing commonalities or obvious trends sooner to allow him/her to focus on the non-intuitive results.

PAGE 64

54 LIST OF REFERENCES [AS94] R. Agrawal and R. Srikant, “Fast Algorithms for Mining Association Rules,” In J. B. Bocca, M. Jarke, and C. Zaniolo, editors, In Proceedings Twentieth International Conference Very Large Data Bases (VLDB), p. 487-499. Morgan Kaufmann, 1994. [Al-D95] M.B. Al-Daoud, The Development of Clustering Methods for Large Geographic Aapplications doctoral dissertaion, School of Co mputer Studies, University of Leeds, 1995. [ABKS99] M. Ankerst, M. Breunig, H-P. Kriegel and J. Sander, “OPTICS: Ordering Points To Identify the Clustering Structure,” In Proceedings ACM SIGMOD99 International Conference on Management of Data Philadelphia, p. 49-60, 1999. [BS02] P. Baptist and M.J. Silva, “Mini ng Web Access Logs of an On-line Newspaper,” Second International Conference on Adap tive Hypermedia and Adaptive Web Based Systems Workshop on Recommendation and Pers onalization in E-Commerce, Mlaga, Spain, May 2002. [Bez81] J.C. Bezdek, Pattern Recognition with Fuzzy Objective Function Algorithms Plenum Press, New York, 1981. [BH93] J.C. Bezdek and R.J. Hathawa y, “Switching Regression Models and Fuzzy Clustering,” IEEE Transactions on Fuzzy Systems Vol. 1, No. 3, p. 195-204, 1993. [BP94] K. Bharat and J.E. Pitkow, “Web Viz: A Tool for WWW Access Log Analysis,” In Proceedings of the First Internati onal Conference on the World-Wide Web 1994. [BLMN99] S.S. Bhowmick, E.P. Lim, S. Madr ia and W-K. Ng, “Research Issues in Web Data Mining,” In Proceedings of the First Inte rnational Conference on Data Warehousing and Knowledge Discovery (DaWaK99), p. 303-312, 1999. [Bla02a] P.E. Black, “Euclidean Distance, ” National Institute of Standards in Technology (NIST), http://www.nist.gov/dads/H TML/euclidndstnc.html (October 2002). [Bla02b] P.E. Black, “Manha ttan Distance,” National Institute of Standards in Technology (NIST), http://www.nist.gov/dads/HTML/manhttndstnc.html (October 2002). [Bla02c] P.E. Black, “Hamming Distance, ” National Institute of Standards in Technology (NIST), http://www.nist.gov/dads/H TML/hammingdist.html (October 2002).

PAGE 65

55 [BL99] J. Borges and M. Levene, “Dat a Mining of User Navigation Patterns,” In Proceedings of the Workshop on Web Usage Analysis and User Profiling (WEBKDD'99), p. 31-36, San Diego, CA, August 15,1999. [BF98] P.S. Bradley and U.M. Fayyad, “Refin ing Initial Points for K-means Clustering,” In Proceedings of the Fifteenth Interna tional Conference on Machine Learning p. 91-99, Morgan Kaufmann, San Francisco, CA, 1998. [BFR98] P.S. Bradley, U.M. Fayyad, and C.A. Reina, “Scaling Clus tering Algorithms to Large Databases,” In Proceedings of the Fourth Inte rnational Conference on Knowledge Discovery and Data Mining p. 9-15, NewYork, NY, August 27-31, 1998. [CY00] W-L. Chang and S-T. Yuan, “A S ynthesized Learning Approach for Web-Based CRM,” In Proceeding of ACM-SIGKDD Conference on Knowledge Discovery in Databases (KDD'2000), p. 43-59, Boston, MA, August 20, 2000. [CCMT97] M. Charikar, C. Chekuri, T. Fede r and R. Motvani, “Incremental Clustering and Dynamic Information Retrieval,” In Proceedings of the Twenty-ninth Annual ACM Symposium on Theory of Computing p. 626-635, 1997. [CSZ98] S. Chatterjee, G. Sheikholesla mi and A. Zhang, “WaveCluster: A MultiResolution Clustering Approach for Ve ry Large Spatial Databases,” In Proceedings of the Twenty-fourth International Conference on Very Large Data Bases p. 428-439, August 1998. [CGHK97] S. Chee, J. Chen, Q. Chen, S. Cheng, J. Chiang, W. Gong, J. Han, M. Kamber, K. Koperski, G. Liu, Y. Lu, N. Stefa novic, L. Winstone, B. Xia, O. R. Zaiane, S. Zhang and H. Zhu, “DBMiner: A System fo r Data Mining in Relational Databases and Data Warehouses,” In Proceedings CASCON'97: Meeting of Minds p. 249-260, Toronto, Canada, November 1997. [Coo00] R. Cooley, Web Usage Mining: Discovery and Application of Interesting Patterns from Web data doctoral dissertation, Departme nt of Computer Science, University of Minnesota, May 2000. [CDST00] R. Cooley, M. Deshpande, J. Sr ivastava and P-N. Tan, ”Web Usage Mining: Discovery and Applications of Usage Patterns from Web Data,” SIGKDD Explorations Vol. 1, Issue 2, 2000. [CMS97] R. Cooley, B. Mobasher and J. Srivastava, “Web Mining: Information and Pattern Discovery on the World Wide Web,” In Proceedings of the Ninth IEEE International Conference on Tools with Artificial Intelligence (ICTAI'97), 1997.

PAGE 66

56 [CMS99] R. Cooley, B. Mobasher and J. Srivastava, “Creating Adaptive Web sites through Usage-based Clustering of Urls,” In IEEE Knowledge and Data Engineering Workshop (KDEX'99), November 1999. [CS97] M.W. Craven and J.W. Shavlik, “Using Neural Networks for Data Mining,” Future Generation Computer Systems Vol. 13, p. 211-229, 1997. [Dyr97] C. Dyreson, “Using an Incomplete Data Cube as a Summary Data Sieve,” Bulletin of the IEEE Technical Committee on Data Engineering p. 19-26, March 1997. [EFL00] C. Elkan, F. Fanstrom and J. Le wis, “Scalability for Clustering Algorithms Revisited,” SIGKDD Explorations Vol. 2, No. 1, p. 51-57, June 2000. [EKSX96] M. Ester, H-P. Kriegel, J. Sande r and X. Xu, “A Density-Based Algorithm for Discovering Clusters in Large Sp atial Databases with Noise,” In Proceedings of the Second International Conference on K nowledge Discovery and Data Mining (KDD’96), Portland, Oregon, August 1996. [Fas99] D. Fasulo, “An Analysis of Recen t Work on Clustering Algorithms,” Technical report, University of Washington, 1999. [For65] E. Forgy, “Cluster Analysis of Multiv ariate Data: Efficiency vs. Interpretability of Classifications,” Biometrics 21:768, 1965. [FKN95] H. Frigui, R. Krishnapuram and O. Nasraoui, “Fuzzy and Possibilistic Shell Clustering Algorithms and their Applica tion to Boundary Detection and Surface Approximation: Parts I and II,” IEEE Transactions on Fuzzy Systems Vol. 3, No. 1, p. 29-60, 1995. [GRS98] S. Guha, R. Rastogi and K. Shim, “CURE: An Efficient and Scalable Subspace Clustering for Very Large Databases,” In Proceedings of ACM SIGMOD International Conference on Management of Data p. 73-84, New York, NY, 1998. [Han99] J. Han, “Data Mining,” In J. Urban and P. Das gupta (eds.), Encyclopedia of Distributed Computing, Kluwer Acad emic Publishers, Boston, MA, 1999. [HN94] J. Han and R. Ng, “Efficient and Effective Clustering Method for Spatial Data Mining,” In Proceedings of 1994 International Conference on Very Large Data Bases (VLDB'94), p. 144-155, Santiago, Chile, September 1994. [HHK02] W. Hrdle, Z. Hlvka and S. Kli nke, “XploRe Applications Guide,” Quantlets, http://www.quantlet.de/scripts/x ag/htmlbook/xploreapplichtmlnode54.html (August 2002).

PAGE 67

57 [JK96] J. Jean and H.K. Kim, “Concurre ncy Preserving Partitioning (CPP) for Parallel Logic Simulation,” In Proceedings of Tenth Workshop on Parallel and Distributed Simulation (PADS'96), p. 98-105, May 1996. [JK98] A. Joshi and R. Krishnapuram, “Robust Fuzzy Clustering Methods to Support Web Mining,” In S. Chaudhuri and U. Dayal, editors, In Proceedings ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery June 1998. [KR90] L. Kaufman and P. J. Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis John Wiley & Sons, Inc., 1990. [KK93] R. Keller and R Krishnapuram, “A Possibilistic Approach to Clustering,” IEEE Transactions on Fuzzy Systems Vol. 1, No. 2, p. 98-110, 1993. [Kol01] E. Kolatch, “Clustering Algorithms for Spatial Databases: A Survey,” Dept. of Computer Science, University of Maryland, College Park, 2001. [LOPZ97] W. Li, M. Ogihara, S. Parthasara thy and M.J. Zaki, “New Algorithms for Fast Discovery of Association Rules,” In Proceedings of Third International Conference on Knowledge Discovery and Data Mining (KDD), August 1997. [LVV01] A. Likas, N. Vlassis and J.J. Verbeek, “The Global K-means Clustering Algorithm,” Technical report, Computer Scien ce Institute, University of Amsterdam, The Netherlands, February 2001. IAS-UVA-01-02. [LW02] P. J. Lingras and C. Chad West, “Interval Set Clustering of Web Users with Rough K-means,” submitted to the IEEE computer for publication, 2002. [LRZ96] M. Livny, R. Ramakrishnan and T. Zhang, “BIRCH: An Efficient Data Clustering Method for Very Large Databases,” In Proceedings of the Fifteenth ACM SICACTSICMOD--SICART Symposium on Prin ciples of Database Systems: PODS 1996. [Mac67] J. MacQueen, “Some Methods for Classification and Analysis of Multivariate Observations,” In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability Vol. I, Statistics, L. M. LeCam and J. Neyman editors, University of California Press, 1967. [Mas02] H. Masum, “Clustering Algorithms,” Active Interests, http://www.carleton.ca/~hmasum/clustering.html (August 2002). [MWY97] R. Muntz, W. Wang and J. Yang, “STING: A Statistical Information Grid Approach to Spatial Data Mining,” In Proceedings of the Twenty-third International Conference on Very Large Databases p.186-195, Athens, Greece, August 1997.

PAGE 68

58 [Myl02] P. Myllymki, “Advantages of Bayesian Networks in Data Mining and Knowledge Discovery,” Complex Systems Co mputation Group, Helsinki Institute for Information Technology, http://www.bayesit.com/docs/advantages.html (October 2002). [Paw82] Z. Pawlak, “Rough Sets,” International Journal of Information and Computer Sciences Vol. 11, p. 145-172, 1982. [Rei99] T. Reiners, “Mahal anobis Distance,” Distances, http://server3.winforms.phil.tubs.de/~treiners/diplom/node31.html (October 2002). [Sam90] H. Samet, The Design and Analysis of Spatial Data Structures Addison Wesley, Reading, MA, 1990. [The02] K. Thearling, “Data Mining and Customer Relati onship,” Data Mining White Papers, http://www.thearling.com/tex t/whexcerpt/whexcerpt.htm (October 2002). [Wan00] Y. Wang, “Web Mining and Knowle dge Discovery of Usage Patterns,” CS 748T Project (Part I), http://db.uwaterloo.ca/~tozsu/c ourses/cs748t/surveys/wang.pdf (February, 2000).

PAGE 69

59 BIOGRAPHICAL SKETCH Darryl M. Adderly, born September 2, 1976, to Renia L. Adderly and Kevin A. Adderly in Miami, Florida, was raised as a military child up until age thirteen when his mother, younger sister (Kadra T. Adderly), and he moved back to Miami where he earned his high school diploma at Miami Northweste rn Senior High in June 1994. He began his college career in Tallahassee, Florida at Florida Agricultural & Mechanical University, earning his Bachelor of Scien ce in computer information systems (science option) with a mathematics minor in May 1998. After spen ding one year working as a software engineer in Raleigh, North Carolina, Darryl wa s accepted into the University of Florida’s computer and information science and e ngineering graduate program. With the coursework requirements completed, he opted to return to the industry as a software developer for another year. In the fall of 2002, he returned to Gainesville, Florida, to complete and defend his thesis on Web data mining to receive his Master of Science degree. Darryl is an ambitious, hard-working, analytic al, and astute individua l with a thirst for knowledge in all facets of life. He enj oys cardiovascular activ ities, weight lifting, football, basketball, and golf (although still a novice!). Outdoor activities (such as camping, white water rafting, and hiking) and trav eling are at the top of his list of things to do once obtaining his master’s degree.


Permanent Link: http://ufdc.ufl.edu/UFE0000500/00001

Material Information

Title: Data Mining Meets E-Commerce: Using Data Mining to Improve Customer Relationship Management
Physical Description: Mixed Material
Copyright Date: 2008

Record Information

Source Institution: University of Florida
Holding Location: University of Florida
Rights Management: All rights reserved by the source institution and holding location.
System ID: UFE0000500:00001

Permanent Link: http://ufdc.ufl.edu/UFE0000500/00001

Material Information

Title: Data Mining Meets E-Commerce: Using Data Mining to Improve Customer Relationship Management
Physical Description: Mixed Material
Copyright Date: 2008

Record Information

Source Institution: University of Florida
Holding Location: University of Florida
Rights Management: All rights reserved by the source institution and holding location.
System ID: UFE0000500:00001


This item has the following downloads:


Full Text











DATA MINING MEETS E-COMMERCE: USING DATA MINING TO IMPROVE
CUSTOMER RELATIONSHIP MANAGEMENT
















By

DARRYL M. ADDERLY


A THESIS PRESENTED TO THE GRADUATE SCHOOL
OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF
MASTER OF SCIENCE

UNIVERSITY OF FLORIDA


2002




























Copyright 2002

by

Darryl M. Adderly




























I would like to dedicate this thesis to a recent blessing in my life, Buttons Kismet
Adderly.















ACKNOWLEDGMENTS

I would like to first thank God for providing the opportunity and giving me the

strength to complete this thesis. I really appreciate Dr. Joachim Hammer's patience and

guidance throughout the duration of this process. I thank Ardiniece "Nisi" Caudle

and John "Jon B." Bowers for assisting me with administrative items. I thank the Office

of Graduate Minority Programs (OGMP) for the financial assistance. I would also like to

thank my bible study group (Adrian, JD, Jonathan, Kamini, and Ursula) for all of their

prayers/spritual support and last but not least Jean-David Oladele for the friendship and

support up until the very last minute.
















TABLE OF CONTENTS
page

A C K N O W L E D G M E N T S ................................................................................................. iv

LIST OF TABLES ....................................................... ............ .............. .. vii

LIST OF FIGURES ..................................................... .......... ................ viii

ABSTRACT .............. .......................................... ix

1 IN TR OD U CTION .............................................. .. ....... .... .............. .

1.1 M motivation for R research ................................................. .............................. 1
1.2 T hesis G oals............................................. ......... ..... 4

2 RESEARCH BACKGROUND ........................................ ................................. 7

2 .1 A association R ule M ining ........................................................................................ 10
2 .2 C lu stern g ........................................................................... 1 1
2.2.1 Partitioning A lgorithm s............................................ ........... .............. 11
2.2.2 H ierarchical A lgorithm s...................................................... ... ................. 14
2.2.3 D ensity-based M ethods........................................................ .............. 17
2.2.4 G rid-based M ethods ......................................................... .............. 19
2.2.5 K-means ........................................... 20

3 GENERAL APPROACH TO WEB USAGE MINING..............................................24

3.1 The M ining of W eb Usage Data ...................................... ...... .................... 24
3.1.1 Pre-processing Data for Mining .... .................. .............. 25
3.1.2 P pattern D discovery ........................................... .. ................... ........... 26
3.1.3 Pattern Analysis ... ..... .............................................. ............... 27
3.2 Web Usage Mining with k-means.......................................................... 29
3.2.1 Our W eb Usage M ining Approach .............. ............................. ....... ....... 29

4 ARCHITECTURE and IMPLEMENTATION.......................... ..................31

4 .1 A architecture O v erview .................................................................. .................... 3 1
4 .1.1 P hase 1 P re-processing ..................................................... ... ................. 32
4.1.2 Phase 2 Pattern Discovery......................................................... 33
4.1.3 Phase 3 Pattern Analysis .......................................................... 34
4.2 Algorithm Implementation..................... ........ ........................... 35


v











5 PER FO R M A N CE A N A LY SIS ............................................................. ....................43

5.1 E xperim ental E valuation.................................................... ........................... 44
5.2 W eb C lu sters ............................................... 47

6 CONCLUSION........ .......... .......... .. .... .... .... ........... 52

6.1 Contributions............................ .................. 52
6.2 Proposed Extensions and Future W ork.............................................................. 52

LIST OF REFERENCES ................................. ............................................54

BIOGRAPH ICAL SKETCH ..................................................... 59
















LIST OF TABLES

Table page

2-1 D ata M inning A lgorithm s................................................. ...................................... 9

5-1 Cluster representations ......................................................... .. 51
















LIST OF FIGURES

Figure p

3.1 High Level Web Usage Mining Process ............................. ..................... 25

4.1 Our Web Usage Mining Architecture.......... ........................................32

4 .2 T he R eadD ata m odule ............................................................................... ..... .... 37

4.3 The ClusterValues m odule................................................. .............................. 40

5.1 A sample SQL*Loader control file................................................... 45

5.2 O order clustering results ....... ..................... .......... ........................ ............... 48

5.3 Data M ining Software Order clustering results ................................. ............... 49






























viii















Abstract of Thesis Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Master of Science

DATA MINING MEETS E-COMMERCE: USING DATA MINING TO IMPROVE
CUSTOMER RELATIONSHIP MANAGEMENT


By

Darryl M. Adderly

December 2002

Chair: Joachim Hammer
Major Department: Computer and Information Science and Engineering

The application of data mining techniques to the World Wide Web, referred to as

Web mining, enables businesses to use knowledge discovered from the past to understand

the present and make critical business decisions about the future. For example, this can

be done by analyzing the Web pages that visitors have clicked on, items that they have

selected or purchased, or registration information provided while browsing. To perform

this analysis effectively, businesses find the natural groupings of users, pages, etc., by

clustering data stored in the Web logs. The standard k-means algorithm, an iterative

refinement algorithm, is one of the most popular clustering methods used today and it has

proven to be an efficient clustering technique. However, numerous iterations over the

data set and re-calculating cluster centroid values are time consuming. In this thesis, we

improve the time complexity of the standard algorithm. Our single-pass, non-iterative k-

means algorithm scans the data only once, calculating all the point and centroid values









based on the desired attributes of interest, and places the items within their respective

cluster thresholds. Our Web mining process consists of three phases, pre-processing,

pattern discovery, and pattern analysis, which are described in detail in the thesis. We

will use our implementation of the k-means algorithm to uncover meaningful Web trends

to understand and, after analyzing the results, provide recommendations that may have

improved the visitor's website experience. We find that the clustering results of our

algorithm provide the same amount of knowledge for analysts as one of the industry's

leading data mining applications.














CHAPTER 1
INTRODUCTION

Consumers are conducting business via the Internet more than ever before due to

the economical costs of high-speed Internet service providers (ISPs) and the high-level of

security (secure transactions). However, the recognition of a company's on-line presence

alone does not ensure long-lived prosperity. Customer retention and satisfaction

strategies remain one of the most important issues for organizations expecting profits.

Thus companies work hard to improve and/or maintain their customer relationships. To

achieve this, companies must capture the navigational behavior of visitors on their

website in a web log and subsequently analyze this data to understand and address their

consumer's business needs.

1.1 Motivation for Research

The relationship between companies and customers has evolved into a significant

research concept called Customer Relationship Management (CRM). A definition for

CRM is a process that manages the interactions between companies and its customers

[The02]. CRM solutions create a mutually beneficial relationship between the customer

and the organization and are critical to a company's future success. The ultimate goals of

CRM are to acquire new customers, retain old customers, and increase customer

profitability [CYOO]. In the current economic slowdown, companies are using their

limited budgets to reduce operational costs or increase revenues while concentrating on

improving efforts to acquire new customers and develop customer loyalty. The sources









of web-based CRM customer data (user profiles, access patterns for pages, etc.) are from

customer web interactions.

The advent of the World Wide Web (WWW) has caused an evolution of the Internet.

Information is now readily available from any location in the world at any hour of the

day. Information on the WWW is not only important to individuals, but also to business

organizations for critical decision-making. This explosion of information sources on the

web has increased the necessity to utilize automated tools to find the desired resources

and to track and analyze usage patterns.

An electronic trail of data is left behind each time a user visits a website. The

megabytes and gigabytes of data logged from these trails seem to not yield any

information at first glance. However, when analyzed intelligently, those logs contain a

wealth of information providing valuable knowledge for business intelligence solutions.

Early attempts to understand the data with statistical tools and on-line analytical

processing (OLAP) systems achieved limited success--that is until the concept of data

mining was introduced. Data mining is the process of discovering hidden interesting

knowledge from large amounts of data stored in databases, data warehouses, or other

information repositories. Web data mining, or web mining, can be broadly defined as the

discovery and analysis of useful information from the Web data. On-line businesses

learn from the past, understand the present, and plan for the future by mining, analyzing,

and transforming records into meaningful information.

Web mining, when viewed in data mining terms, can be said to have three operations

of interests clustering (finding natural groupings of users, pages, etc.), associations

(which URLs tend to be requested together), and sequential analysis (the order which









order which URLs tend to be accessed) [JK98]. Although the first two have proven to be

of greater interest, this research heavily favors the use of clustering techniques and

algorithms to support web mining.

Data clustering is a process of partitioning a set of data into a set of classes, called

clusters, with members of each cluster sharing some interesting common properties

[CGHK97]. Clustering itself is the process of organizing similar items into disjoint

groups. The investigation of the properties of the set of items belonging to each group

illuminates relationships that may have been otherwise overlooked.

The k-means algorithm is one of the most widely used techniques for clustering

[A1-D95]. It has been shown to be effective in producing good clustering results for

many practical applications. The two main goals of clustering techniques are to ensure

that the data within each distinct cluster is homogeneous (group items are similar) and

each cluster differs from other clusters (data belonging to one cluster should not be

present in another cluster). The k-means algorithm is an iterative refinement algorithm

with an input member ofk pre-defined clusters. "Means" simply represents the average,

as in the average location of all members of a particular cluster conceptualized as the

centroid. The centroid of a cluster, often termed the representative element, is an

artificial point in the space of records that represents the average location. The time

complexity of the k-means algorithm is heavily dependant on the point centroidd)

selection process of its first step. Some implementations either requires user-provided or

randomly generated starting points but most implementations of the k-means algorithm

do not address the issue of initialization at all. The remaining steps of the algorithm

focus on minimizing the inter-cluster (items belonging to a specific cluster) error by









using a distance function (i.e., Euclidean distance [Bla02a] or Manhattan distance

[Bla02b] function) and optimizing the intra-cluster (data items of different clusters)

relationships. The standard algorithm typically requires many iterations over a data set to

converge to a solution, accessing each data item on each iteration. This approach may be

sufficient for small data sets but it is obviously inefficient when scanning large data sets.

The k-means algorithm has proven to be well suited when clustered results are of similar

spherical shapes. However, when data items in a given cluster are closer to the center of

another cluster than that of its own (for example, when clusters have widely different

sizes or have convex shapes), this algorithm may not be as useful. In comparison with

other clustering methods, the revised k-means based methods are promising for their

efficient processing of large data sets, however, their use is often limited to numeric data.

For the reasons mentioned in this paragraph, we have proposed yet another version of the

k-means algorithm to improve the performance when applied to large data sets of high

dimensionality. Also, there has been very little research done in applying the k-means

algorithm to web log data because of its non-numeric nature. In our experimental

section, we prove that the application of our algorithm for web mining is comparable and

in some instances outperforms the clustering technique of one of the industry's leading

data mining applications.

1.2 Thesis Goals

In web mining, the goal is to uncover meaningful web trends to understand and

improve the visitors website experience. Clustering techniques are exercised to enable

companies to find the natural groupings of customers. The standard k-means algorithm,

by design, optimally partitions a data set into clusters of similar data items, after which

the human analytical process begins.









In this thesis, we have developed a single-pass non-iterative k-means algorithm.

We will attempt to improve the time complexity of the standard algorithm without

refining the initial points when applied to large data sets. The traditional algorithm

repeats the clustering steps until cluster assignment has been exhausted, scanning the data

set as often as necessary. Multiple scans of the data set increases the cluster efficiency at

the expense of execution time. Many data sets are large and cannot fit into main

memory. Scanning a data set stored on disk or tape repeatedly is time consuming. Our

algorithm scans a portion of the data set (residing in memory) only once, calculating all

the point values, and finally clustering the items accordingly. We use only a sample and

reduced number of attributes for the sake of efficiency and scalability with respect to

large databases. Dead clusters are created when a centroid does not have any members in

its cluster, which may arise due to bad initialization. We plan to address this issue by

calculating the centroids based on the number of k clusters and the deviation between the

minimum and maximum point values. This application should handle all the data types

accepted by the database application, some of which are very complex (i.e., hypertext

data). Applying the k-means algorithm to the data allows us to group customers together

on the basis of similarity by virtue of attributes chosen and, after analyzing the results,

get a good grasp for the consumer's behavior and make intelligent predictions about their

future behavior. Visitor behavioral predictions serve as a good starting point to

improving a website's navigational experience. The suggestions and/or

recommendations resulting from the analysis needs to be implemented to discover the

true success of the algorithm. The data set used in the experimental section was obtained









from the KDD Cup 20001 competition, containing data from an e-commerce site that no

longer exists therefore we were unable to confirm the predictions made from our analysis

of results. We will show that our method is superior in speed when compared to the

standard k-means algorithm, while maintaining a comparable cluster quality with one of

the industry's leading data mining products.

The rest of this thesis is organized as follows. Chapter 2 shares background

information of related research. Chapter 3 explains our approach for web mining with k-

means. Chapter 4 describes the architecture used for the development of our algorithm

and the implementation. Chapter 5 analyzes the performance of our algorithm and we

then conclude with a summary of the thesis, review of our contributions, and future work

in Chapter 6.


1 http://www.ecn.purdue.edu/KDDCUP/














CHAPTER 2
RESEARCH BACKGROUND

Clustering techniques have been applied to a variety of areas including machine

learning, statistics, and data and web mining. As widely used as they are, the

fundamental clustering problem remains the task of grouping together similar data items

of a given data set. There are four main classifications of clustering algorithms:

partitioning algorithms, hierarchical algorithms, density-based methods, and grid-based

methods. There has been a plethora of proposals to improve or refine upon existing

algorithms for each respective approach. The k-means algorithm, which is classified as a

partitioning algorithm, is not an exception. Enhancements to the traditional k-means

algorithm involves, but are not limited to, refining initial points, the scalability with

respect to large data sets, the minimization of the clustering error, and reducing the

number of clustering iterations (data set scans).

Data mining is the process of discovering hidden interesting knowledge from

large amounts of data stored in databases, data warehouses, or other information

repositories. The main idea behind data mining is to identify novel, valid, potentially

useful, ultimately understandable patterns in data. The spectrum of uses of data mining

tools ranges from financial and telecommunications applications to government policy

settings, medical management, and food service menu analysis. Different data mining

algorithms are more appropriate for certain types of problems. These algorithms can be

classified into two categories: descriptive and predictive. Descriptive data mining

describes the data in a summary manner and presents interesting general properties of the









data. Predictive data mining constructs one or more sets of models, infers on the

available set of data, and attempts to predict the behavior of new data sets. These two

styles are also known as undirected and directed data mining, respectively. The former

uses a bottom-up approach, finding patterns in the data and leaving the decision up to the

user to determine whether or not these patterns are important. The latter uses a top-down

approach and is used when one has a good grasp on what it is he or she is looking for or

would like to predict, applying knowledge gained in the past to the future. There are

several classes of algorithms applicable to data mining but the most commonly used are

association rules [AS94, LOPZ97], Bayesian networks [Myl02], clustering [Fas99],

decision trees [Mur98], and neural networks [CS97]. Table 2-1 provides a brief overview

of data mining algorithms.

The application of data mining techniques to the WWW, often referred to as web

mining, is a direct result of the dramatic increase of Internet usage. Various data from the

WWW stored in web logs include http request information, client IP addresses, the

contents of the website (product information, published articles about the company, etc.),

visitor behavior data (navigational paths or clickstream data and purchasing data), and

web structure data. Thus, the current research efforts of WWW data mining focus on

three issues: web content mining, web structure mining, and web usage mining. Web

content mining is used to describe the automatic search of information resources

available on-line. The automated discovery of web-based information is difficult because

of the lack of structure permeating the information sources on the web. Traditional

search engines generally do not provide structured information nor categorize, filter, or

interpret documents [CMS97]. Theses factors have prompted researchers to develop









Table 2-1 Data Mining Algorithms

COMMON
ALGORITHM DESCRIPTION APPLICATIONS


Association rules Descriptive and predictive. Understanding consumer
Determines when items product data.
occur together.

Predictive. Learns through Predicting what a consumer
Bayesian networks determining conditional would like to do on a web
probabilities, site by previous and current
behavior.

Clustering Descriptive. Identifies and Determining consumer
groups similar data. groups.

Decision trees Predictive. A flow chart of Predicting credit risk.
if-then conditions leading to
a decision.

Predictive. Modeled after Optical character
Neural networks the human brain; classic recognition and fraud
Artificial Intelligence detection.
algorithm._


more intelligent tools for information retrieval and extend data mining efforts to provide

a higher level of organization for semi-structured data available on the web. Web

Structure mining deals with mining the web document's structure and links to identify

relevant documents. Web structure mining is useful in generating information such as

visible web documents, luminous web documents, and luminous paths (a path common to

most of the results returned) [BLMN99]. Web usage mining is the discovery of user

access patterns from web server logged data. Companies automatically collect large

volumes of data from daily website operations in server access logs. They analyze this

web log data to essentially aid in future business decisions. In this thesis, we use









clickstream and purchasing data collected prior to an e-commerce website going out of

business. This data set resembles data used during the web data mining process.

Web mining, when viewed from a data mining perspective, is assumed to have three

operations of interest sequential analysis, associations, and clustering. Sequential

analysis provides insight on the order that URLs tend to be accessed. Determining which

URLs are usually requested together (associations) and finding the natural groupings of

users, pages, etc. (clustering) are more useful in today's real-world web mining

applications.

2.1 Association Rule Mining

Association rule mining is the discovery of association relationships (or correlations)

amongst a set of items. These relationships are often expressed in the form of a rule by

showing attribute-value conditions that occur frequently together in a given set of data.

An example of an association rule would be X => Y, which is interpreted by Jiawei Han

[Han99] as database tuples that satisfy X are likely to satisfy Y.

Association algorithms are efficient for deriving rules but both the support and

confidence factors are key for an analyst to make a judgment about the validity and

importance of the rules. The support factor indicates the relative occurrence of the

detected association rules within the overall data set of transactions and the confidence

factor is the degree to which the rule is true across individual records.

The main goal of association discovery is to find items that imply the presence of

other items in the same transaction. It is widely used in transaction data analysis for

directed marketing, catalog design, and other business decision-making processes. This

technique was a candidate to implement in the experimental section, but clustering

proved to be a better fit for our research.









Association discovery's simplistic nature gives it a significant advantage over the

other data mining techniques. It is also very scalable since it basically counts the

occurrences of all possible combinations of items and involves reading a table

sequentially from top to bottom each time a new dimension is added. Thus, it is able to

handle large amounts of data (in this case, large numbers of transactions). Association

rules do not suffer from over fitting, so they tend to generalize better than other types of

classifiers.

Association rules have some serious limitations, however, such as the number of

rules defined. Too many rules may overwhelm an inexperienced user while too few may

not suffice. Another drawback is that the rules generated give no information about

causation. The rules can only tell what things tend to happen together, without specifying

information about the cause.

2.2 Clustering

Clustering is the task of grouping together "similar" items in a data set.

Clustering techniques attempt to look for similarities and differences within a data set and

group similar rows into clusters. A good clustering method produces high quality

clusters to ensure that the inter-cluster similarity is low and the intra-cluster similarity is

high. Clustering algorithms could be classified into four main groups: partitioning

algorithms, hierarchical algorithms, density-based algorithms, and grid-based algorithms.

2.2.1 Partitioning Algorithms

Partitioning algorithms attempt to break a data set of N objects into a set of k

clusters such that the partition optimizes a given criterion. These algorithms are usually

classified as static or dynamic. Static partitioning is performed prior to the execution of

the simulation and the resulting partition is fixed during the simulation [JK96]. Dynamic









partitioning attempts to keep system resources by combining the computation with the

simulation. There are mainly two approaches: the k-means algorithm, where each cluster

is represented by the center of gravity of the cluster and the k-medoid algorithm, where

each cluster is represented by one of the objects of the cluster located near the center

[CSZ98]. Partitioning applications such as PAM, CLARA, and CLARANS are centered

around k-medoids. Other applications involve the traditional k-means algorithm or a

slight variation/extension of it, such as our implementation.

PAM (Partitioning Around Medoids) [KR90] uses arbitrarily selected

representative objects, called medoids, during its initial steps to find k clusters. Medoids

are meant to be the most centralized object within each cluster. Each non-selected object

thereafter, is grouped with the medoid that it is most similar. In each step, a swap

between a selected object (medoid) and a non-selected object is made if it would result in

an improvement of the quality of clustering. The quality of clustering (i.e., the combined

quality of the chosen medoids) is measured by the average dissimilarity values given as

input. Experimental results by Kaufman and Rousseeuw have shown PAM to work

satisfactorily for small data sets (for example, 100 objects in 5 clusters), but it is not

efficient when dealing with medium to large data sets. The slow processing time, which

is O (k(N-k))2 [CSZ98] due to the comparison of each object with the entire data set,

motivated the development of CLARA.

CLARA (Clustering LARge Applications) relies in sampling to handle large data

sets. CLARA draws a sample of a data set, applies PAM to the sample, and then finds

the medoids of the sample instead of the entire data set. The medoids of the sample

approximate the medoids of the entire data set. Multiple data samples are drawn to









derive better approximations and return the best clustering output. The quality of

clustering for CLARA is measured based on the average dissimilarity of all objects in the

entire data set, not only of those in the samples. Kaufman and Rousseeuw's experimental

results prove that CLARA performs satisfactorily for data sets such as one containing

1000 objects using 10 clusters. Since CLARA only applies PAM to the samples, each

iteration reduces to O (k(40+k)2 + k(N-k)) [KR90], using 5 samples of size 40 + 2k.

Although the data sets is larger than that used for the PAM experiments, it is not ideal for

the web mining analysis.

CLARANS (Clustering LARge Applications based on RANdomized Search)

[HN94] stems from the work done on PAM and CLARA. It relies on the randomized

search of a group of nodes, which are represented by a set of k objects, to find the

medoids of the clusters. Each node represents a collection ofk medoids; therefore it

corresponds to a clustering. Thus, each node is assigned a cost that is the total

dissimilarity value between every object and the medoid of its cluster. The algorithm

takes the maximum number of neighbors of a node that can be examined (maxneighbor)

and the maximum number of local minimums that can be collected (numlocal). After

selecting a random node, CLARANS checks a sample of the neighbors of the node,

clusters the neighbor based on the cost differential, and continues until the maxneighbor

criterion is met. Otherwise, it declares the current node a local minimum and starts a new

search for the local minima. After a specified number of numlocal values are collected,

the best of these local values are recorded as the medoid of the cluster. The PAM

algorithm can be viewed as the method used to search for the local minima. For large

values of N, examining all of k(N-k) neighbors of a node is time consuming. Although









Ng and Han claim that CLARANS is linearly proportional to the number of points, the

time consumed in each step of searching is O (kN)2, making the overall performance at

least quadratic [KolOl].

CLARANS, without any extra focusing techniques cannot handle large data sets.

Also, it was not designed to handle high dimensional data. Both of which are

characteristics of the data stored in web logs.

2.2.2 Hierarchical Algorithms

Hierarchical algorithms create a hierarchical decomposition of a database. These

techniques produce a nested sequence of clusters with a single all-inclusive cluster at the

top and single point clusters at the bottom. The hierarchical decomposition can be

represented by a dendrogram, which is a tree that iteratively splits the database into

smaller subsets until each subset consists of only one object [EKSX96]. The dendrogram

can be created from the leaves up to the root agglomerativee approach) or from the root

down to the leaves (divisive approach) by merging or dividing clusters at each step.

Agglomerative hierarchical algorithms begin with all the data points as a separate cluster,

followed by recursive steps of merging the two most similar (or least expensive) cluster

pairs until the desired number of clusters is obtained or the distance between the two

closest clusters is above certain threshold distance. Divisive hierarchical algorithms work

by repeatedly partitioning a data set into "leaves" of clusters. A path down a well-

structured tree should visit sets of increasingly tightly related elements, conveniently

displaying the number of clusters and the compactness of each cluster.

BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) is a

clustering method developed to address large data sets and minimization of input/output

(I/O) costs. It incrementally and dynamically clusters incoming multi-dimensional metric









data points to try to produce the best quality clustering with available resources (i.e.,

available memory and time constraints) [LRZ96]. BIRCH typically clusters well with a

single scan of the data, however, optional additional passes can be used to improve the

cluster quality further.

BIRCH contains four phases, two of which are optional (namely the second and

the fourth). During phase one, the data is scanned and the initial tree is built using the

given amount of memory and recycling space on disk. The optional phase two condenses

the tree by scanning the leaf entries to rebuild a smaller one, removing outliers and

grouping crowded subclusters into larger ones. The application uses a self-created

height-balanced Clustering Feature (CF) tree at the core of their clustering step. Each

node, or CF vector, of the tree contains the number of data points in the cluster, the linear

sum of the data points, and the square sum of the data points. The CF tree has two

parameters: branching factor B and threshold T. Each non-leaf node contains at most B

entries. The tree size is a function of T- the larger Tis, the smaller the tree. The

mandatory phase three uses a global algorithm to cluster all leaf entries. This global

algorithm is a pre-existing method selected before beginning the BIRCH process.

BIRCH also allows the user to specify either the desired number of clusters or the desired

threshold (in diameter or radius) for clusters. Up to this point, the original data has only

been scanned once, although the tree and outlier information have been scanned multiple

times. After phase three, some inaccuracies may exist from the initial creation of the CF

tree. Phase four is optional and entails the cost of additional passes of the data to correct

those inaccuracies and refine the clusters further. This phase uses the centroids produced

in phase three as seeds to migrate and/or create new clusters. [LRZ96] contains a









performance analysis versus CLARANS. They conclusively state that BIRCH uses much

less memory, but is faster, more accurate, and less order sensitive when compared with

CLARANS. BIRCH, in general, scales well but handles only numeric data and the

results depend on the order of the records.

CURE (Clustering Using REpresentatives) [GRS98] is a bottom-up

agglomerativee) clustering algorithm based on choosing a well-formed group of points to

identify the distance between the clusters. CURE begins by choosing a constant number

c of well-scattered points from a cluster used to identify the shape and size of the cluster.

The next step uses a predetermined fraction between 0 and 1 to shrink the selected points

toward the centroid of the cluster. With the new (shrunken) position of these points

identifying the cluster, the algorithm then finds the clusters with the closest pairs of

identifying points. This merging continues until the desired number of clusters, k, an

input parameter, remains. A k-dtree [Sam90] is used to store the representative points

for the clusters.

CURE uses a random sample of the database to handle very large data sets, in

contrast with BIRCH, which pre-clusters all the data points for large data sets. Random

sampling can eliminate significant input/output (I/O) costs since the sample may be

designed to fit into main memory and it also helps to filter outliers. If random samples

are derived such that the probability of missing clusters is low, accurate information

about the geometry of the clusters are still preserved [GRS98]. CURE partitions and

partially clusters the data points of the random sample to speed up the clustering process

when sample sizes increase. Multiple representative points are used to label the clusters

assigning each data point to the cluster with the closest representative point. The use of









multiple points enables the algorithm to identify arbitrarily shaped clusters. The worst-

case time complexity of CURE is 0 (n2logn), where n is the number of sampled points,

proving to be no worse than BIRCH [Kol01]. The computational complexity of CURE is

quadratic with respect to the sample size and is not related to the size of the dataset.

2.2.3 Density-based Methods

Density-based clustering algorithms locate clusters by constructing a density

function that reflects the spatial distribution of the data points. The density-based notion

of a cluster is defined as a set of density-connected points that is maximal with respect to

density-reachability. In other words, the density of points inside each cluster is

considerably higher than outside of the cluster. In addition, the density within the areas

of noise is lower than the density in any of the clusters. A couple examples of density-

based methods are DBSCAN and OPTICS.

DBSCAN (Density Based Spatial Clustering of Applications with Noise)

[EKSX96] is a locality-based algorithm, relying on a density-based notion of clustering.

The density-based notion of clustering states that within each cluster, the density of the

points is significantly higher than the density of points outside the cluster [Kol01]. This

algorithm uses two parameters, Eps and MinPts, to control the density of the cluster. Eps

represents the neighborhood of a point (radius) and MinPts is the minimum number of

points that must be contained in the neighborhood of that point in the cluster.

DBSCAN discovers clusters of arbitrary shapes, can distinguish noise, and only

requires one input parameter. The input value is a major drawback because the user for

each run of the algorithm must manually determine the Eps. The runtime of the

algorithm, O (NlogN), does not factor in the significant calculation time of the Eps so it









very misleading. This algorithm can handle large amounts of data but it is not designed

to handle higher dimensional data.

OPTICS (Ordering Points To Identify the Clustering Structure) [ABKS99] is a

cluster analysis algorithm that creates an augmented ordering of the database representing

its density-based clustering structure. This differs from traditional clustering methods

purpose of producing an explicit clustering of the data set. This cluster ordering contains

information that is equivalent to the density-based clustering corresponding to a broad

range of parameter settings. OPTICS works in principle like an extended DBSCAN

algorithm for an infinite number of distance parameters (Eps); which are smaller than a

"generating distance" (Eps) (i.e., 0 <= Epsi <= Eps). However, instead of assigning

cluster memberships, this algorithm stores objects in the order they are processed and

information which would be used by an extended DBSCAN algorithm to assign cluster

membership (if it were possible for an infinite number of parameters). This information

consists of only two values: the core-distance and a reachability distance. The core-

distance of an objectp is the smallest distance between it and another neighborhood. The

reachability-distance of an objectp with respect to the core object o is the smallest

distance such that is directly density-reachable from o. The OPTICS algorithm creates

an ordering of a database, additionally storing the core-distance and a suitable

reachability distance for each object. Objects, which are directly density-reachable from

a current core object, are inserted into a seed-list for further expansion. The "seed-list"

objects are sorted by their reachability distance to the closest core object from which they

have been directly density-reachable. The reachability-distance for each object is

determined with respect to the center-object. Objects that are not yet in the priority-









queue (seed-list) are inserted with their reachability-distance. If the new reachability-

distance of an object is smaller than the previous reachability-distance and it already

exists in the queue, it is moved further to the top of the queue. [ABKS99] performed

extensive performance tests using different data sets and different parameter settings to

prove that the run-time of OPTICS is nearly the same as the run-time for DBSCAN. If

OPTICS scans through the entire database, then the run-time will be O (N2). If a tree-

based spatial index can be used, the run-time is reduced to O (NlogN). For medium sized

data sets, the cluster ordering can be represented graphically and for very large data sets,

OPTICS extends a pixel-oriented visualization technique to present the attribute values

belonging to different dimensions.

2.2.4 Grid-based Methods

Grid-based algorithms quantize the space into a finite number of cells and then do

all operations on the quantized space. These approaches tend to have fast processing

times, depending only on the number of cells in each dimension quantized in space,

remaining independent of the number of data objects. Grid-based techniques such as

STING [MWY97] and WaveCluster [CSZ98] have linear computation complexity and

are very efficient for large databases; however, they are not typically feasible for

analyzing web logs. Grid-based methods are more applicable for spatial data mining.

Spatial data mining is the extraction of implicit knowledge, spatial relations, and the

discovery of interesting characteristics and patterns that are not explicitly represented in

the databases. Spatial data geometrically describes information related to the space

occupied by objects. The data may be either a single point in multi-dimensional space

(discrete) or it may span across a region of space (continuous). Huge amounts of spatial









data may be obtained from satellite images, medical imagery, Geographic Information

Systems, etc., making it unrealistic to examine spatial data in detail.

2.2.5 K-means

Aforementioned earlier in this chapter, we revisit the various contributions,

improvements, and modifications to the standard k-means algorithm. Historically known

as Forgy's method [For65] or MacQueen's algorithm [Mac67], the k-means algorithm

has emerged as one of the most widely used techniques for solving clustering problems.

This process consists of mainly three steps [HHK02]:

1. Partition the items into k initial clusters.

2. Proceed through the list of items; assigning an item to the cluster whose centroid
(mean) is nearest. Recalculate the centroid for the cluster receiving the new item and
for the cluster loosing the item.

3. Repeat step 2 until no more assignments take place.

Step 1 may be completed in one of three ways: Randomly selecting k points to

represent each cluster, require the user to enter k initial points, or use the first k points to

represent each cluster. Most implementations randomly select k representative objects

centroidss) to start the process. [BF98] use this statement to illustrate the importance of

good initial points: an initial cluster center which attracts no data may remain empty,

while a starting point with no empty clusters usually produces better solutions. Our

version of the algorithm does not address the initialization issue. Others that do assume it

is either user-provided or randomly chosen. Duda and Hart mention a recursive method,

[CCMT97] takes the mean of the entire data and randomly perturbs it k times, and

[BFR98] refine using small random sub-samples of the data. The latter is primarily

intended to work on large databases. As a database size increases, efficient and accurate

initialization becomes critical. When applied to an appropriately sized random









subsample of the database, they show that accurate clustering can be achieved with

improved results over the classic k-means. The only memory requirement of this

refinement algorithm is to hold a small subsample in RAM, allowing it to scale easily to

very large databases.

As we continue on to the remaining steps of the algorithm, the main focus is to

optimize the clustering criteria. The most widely used criterion is the clustering error

criterion which for each point computes its squared distance from the corresponding

cluster center and then takes the sum of these distances for all points in the data set

[LVV01]. Intelligent Autonomous Systems has proposed the global k-means algorithm,

which constitutes a deterministic effective global clustering error that employs the k-

means algorithm as a local search procedure. This algorithm is an incremental approach

to clustering that dynamically adds one cluster center at a time through a deterministic

global search procedure consisting on N, the size of the data set, executions of the k-

means algorithm from suitable initial positions. It solves all intermediate problems with

1, 2,..., M-1 clusters sequentially to solve a clustering problem with M clusters. The

underlying principle of this method is that an optimal solution for a clustering problem

with M clusters can be obtained by using the k-means algorithm to conduct a series of

local searches. Each local search places the M-1 cluster centers at their optimal positions

corresponding to the clustering problem within the data space. Since forM=l the optimal

solution is known, this global algorithm can iteratively apply the above procedure to find

optimal solutions for all k-clustering problems k= 1,..., M.

In terms of computational complexity, the method requires N executions of the k-

means algorithm for each value of k (k = 1,..., M). The experimental results prove that









for a small data set (for example, N =250 and M= 15), the performance of this method is

excellent, however, the technique has not been tested on large-scale data mining

problems.

Recursive iterations can be expensive when applying the k-means algorithm. To

reduce time complexity as well as the iterations of steps and to increase the scalability of

k-means clustering for large data sets, single-pass k-means algorithms were introduced

[BFR98]. The main idea is to buffer where points from the data set are saved in

compressed form. The first step is to initialize the means of the clusters as with the

standard k-means. The next step is to fill the buffer completely with points from the

database followed by a two-phase compression process. The first of the two, called

primary compression, identifies points that are unlikely to ever move to a different cluster

using two methods. The first measures the Mahalanobis distance [Rei99] from each point

to the cluster mean centroidd) it's associated with it and discards a point if it is within a

certain radius. The second method involves creating confidence intervals for each

centroid. Then, a worst-case scenario is set up by perturbing the centroids within the

confidence intervals with respect to each point. The centroids associated with each point

is moved away from the point and the cluster means of all other clusters are moved

towards the point. If the point is closest to the same cluster mean after the perturbations,

it is unlikely to change cluster membership. Points that are unlikely to change are

removed from the buffer and placed in a discard set of one of the main clusters. We are

now ready to begin the second phase called the secondary compression. The aim of this

phase is to save buffer space by storing some auxiliary clusters instead of individual

points. During this stage, another k-means clustering is performed with a larger number









of clusters than for the main clustering on the remaining points in the buffer. The points

in the buffer must satisfy a tightness criterion (remain below a certain threshold). After

primary and secondary compression, the available buffer space is filled with new points

and the whole procedure is repeated. The algorithm ends after one scan of the data set or

if the centers of the main clusters do not change significantly as more points are added.

A special case of the algorithm of [BFR98], not mentioned in their paper, would

be to discard all the points in the buffer each time. The algorithm is [EFLOO]:

1. Randomly initialize cluster means. Let each cluster have a discard set in the buffer
that keeps track of the sufficient statistics for all points from previous iterations.

2. Fill the buffer with points.

3. Perform iterations of k-means on the points and discard sets in the buffer, until
convergence. For this clustering, each discard set is treated like a regular point placed
at the mean of the discard set, but weighed with the number of points in the discard
set.

4. For each cluster, update the sufficient statistics of the discard set with the points
assigned to the cluster. Remove all points from the buffer.

5. If the data set is exhausted, then finish. Otherwise, repeat from step 2.

According to [EFLOO] lesion experiment, the simple single pass k-means method

(for synthetic data sets of 1,000,000 points, 100, dimensions, and 5 cluster) cluster quality

is equivalent to that of the standard k-means but is more reliable (in terms of trapping of

centers) and is about 40% faster than the standard k-means. With real data from the KDD

contest data set 95412 points with 10 clusters, the cluster distortion of the original k-

means algorithm was significantly less than that of the simple single pass algorithm.














CHAPTER 3
GENERAL APPROACH TO WEB USAGE MINING

In Chapter 2, we mention the categorization of web mining into three areas of

interest: web content mining, web structure mining, and web usage mining. Web content

mining focuses on techniques for searching the web for documents whose contents meets

web users queries [BS02]. Web structure mining is used to analyze the information

contained in links, aiming to generate structural summary about web sites and web pages.

Web usage mining attempts to identify (and predict) web user's behavior by applying

data mining techniques to the discovery usage patterns from their interactions while

surfing the web. In this chapter, we introduce our approach to mining web usage data

using the k-means algorithm to address the issues identified in Section 1.1.

3.1 The Mining of Web Usage Data

Companies apply web usage mining techniques to understand and better serve the

needs of their current customers and to acquire new customers. The process of web

usage mining can be separated into three distinct phases: pre-processing, pattern

discovery, and pattern analysis [CDSTOO]. The web usage mining process could also be

classified into one of two commonly used approaches [BL99]. One approach applies pre-

processing techniques directly to the log data prior to adapting a data mining technique.

The other approach maps the usage data from the logs into relational tables before the

mining is performed. The sample data we obtained from KDD Cup 2000 were in flat

files, therefore, we chose the second of the two approaches for our implementation.










Figure 3.1 depicts the web usage mining process from a high-level perspective [CMS99].

The subsequent sections of this chapter will explain the three phases of the process.






'/ ~dGo rent and
gStuure H l W Ua M



1 Preprocessing Pten Dscoe Patern Dnalyss







Th aw Udaga PepEcolceded b te we resting"
Data Glickstream and aS rrs Rules, Patberns,
and StaE'tis.
Data and Statfitics


Figure 3.1 High Level Web Usage Mining Process

3.1.1 Pre-processing Data for Mining

The raw data collected by the web server logs tend to be abstruse and require the need

to organize the data to make it easier to mine for knowledge. Pre-processing consists of

converting usage information contained in the various available data sources into the

abstractions necessary for pattern discovery [BS02]. There are a number of issues in pre-

processing data for mining that must be addressed prior to utilizing the mining algorithm.

These include developing a model of access log data, developing techniques to filter the

raw data to eliminate irrelevant items, grouping individual page access into units (i.e.,

transactions), and specializing generic data mining algorithms to take advantage of the

specific nature of the access log data [CMS97].









The first pre-processing task, referred to as data cleaning, essentially eliminates

irrelevant items that may impact the analysis result. This involves determining if there

are important accesses or specific access data that are not recorded in the access log.

Improving data quality involves user cooperation, which is very difficult (but

understandably so) because the individual may feel as if the information requested of

them violates their privacy needs.

Another pre-processing task is the identification of specific transactions or

sessions. The goal of this task is to clearly discern users based on certain criteria (in our

case, attributes). The formats of these transactions and/or sessions are tightly coupled

with the data collection process. The poor selection of values to collect about the users

increases the difficulty of this identification task.

3.1.2 Pattern Discovery

The next phase of the web usage mining process, pattern discovery, varies

depending on the needs of the analyst. Algorithms and techniques from various research

areas such as statistics, machine learning, and data mining are applied during this phase.

Our focus is on finding trends in the data by grouping users, transactions, sessions, etc.,

to understand the behavior of the visitors. Clustering, a data mining technique, is well

suited for our desired results.

Web usage mining can facilitate the development and execution of future

marketing strategies and promote efficient and effective web site management by

analyzing the results of clustered web log data. There are different ways to break down

the clustering process. One way is to divide it into five basic steps [Mas02]:

1. Pre-processing and feature selection. Most clustering models assume all data items
are represented by n-dimensional feature vectors. To improve the scalability of the
problem space, it is often desirable to choose a subset of all the features (attributes)









available. During this first step, the appropriate feature is chosen as well as the
appropriate pre-processing and feature extraction on data items to measure the values
of the chosen feature set. This step requires a good deal of domain knowledge and
data analysis.
NOTE: Do not confuse this step with the pre-processing step of web usage
mining. This step is done after the data has been cleansed.

2. Similarity measure. This is a function that receives two data items (or two sets of
data items) as input and returns a similarity measure between them as output. Item-
item versions include the Hamming distance [Bla02c], Mahalanobis distance,
Euclidean distance, inner product, and edit distance. Item-set versions use any item-
item versions as subroutines and include max/min/average distance; another approach
evaluates the distance from the item to the cluster of the representative set, where
point representatives centroidss) are chosen as the mean vector/mean center/ median
center of the set, and hyperplane of hyperspherical representatives of the set can also
be used.

3. Clustering algorithm. Clustering algorithms generally use particular similarity
measures as subroutines. The choice of clustering algorithm depends on the desired
properties of the final clustering and the time and space complexity. Clustering user
information or data items from web server logs aid companies with web site
enhancements such as automated return mail to visitors falling within a specific
cluster or dynamically changing a particular site for a customer/user on a return visit,
based on past classification of that visitor [CMS99].

4. Result validation. Do the results make sense? If not, we may want to iterate back to
a prior stage. It may also be useful to do a test of clustering tendency, to estimate the
presence of clusters at all.
NOTE: Any clustering algorithm will produce some clusters regardless of
whether or not natural clusters exist.

5. Result interpretation and application. Typical applications of clustering include
data compression (via representing data samples by their cluster representative),
hypothesis generation (looking for patterns in the clustering of data), hypothesis
testing (e.g. verifying feature correlation or other data properties through a high
degree of cluster formation), and prediction (once clusters have been formed from the
data and characterized, new data items can be classified by the characteristics of the
cluster which they would belong).


3.1.3 Pattern Analysis

The final stage of web usage mining is pattern analysis. The discovery of web

usage patterns would be meaningless without mechanisms and tools to help analysts

better understand them. The main objective of pattern analysis is eliminating irrelevant









rules or patterns and extracting rules or patterns from the output of the previous stage

(pattern discovery). The output, in its original state, of web mining algorithms is usually

incomprehensible for the naked eye and thus must be transformed into a more readable

format. These techniques have been drawn from fields such as statistics, graphics and

visualizations, and database querying.

Visualization techniques have been very successful in helping people understand

various kinds of phenomena. Bharat and Pitkow [BP94] proposed a web path paradigm

in which sets of server log entries are used to extract subsequences of web traversal

patterns called web paths along with the development of their WebViz system for

visualizing WWW access patterns. Through the use of WebViz, analysts are provided

the opportunity to filter out any portion of the web deemed unimportant and selectively

analyze those portions of interest.

In [Dyr97], OLAP tools had proven to be applicable to web usage data since the

analysis needs were similar to those of a data warehouse. The rapid growth of access

information increases the size of the server logs quite expeditiously, reducing the

possibility to provide on-line analysis of all of it. Therefore, to make its on-line analysis

feasible, there is a need to summarize the log data.

Query languages allows an application or user to express what conditions must be

satisfied by the data it needs rather than having to specify how to get the required data

[CMS97]. Potentially, a large number of patterns may be mined, thus a mechanism to

specify the focus of analysis is necessary. One approach would be to place constraints on

the database to restrict a certain portion of the database to mine. Another method would









be to perform the querying on the knowledge that has been extracted by the mining

process, which would require a language for querying knowledge rather than data.

3.2 Web Usage Mining with k-means

The algorithms used for most of the initial web mining efforts were highly

susceptible to failure when operating on real data, which can be quite noisy. In [JK98],

Joshi and Krisnapuram introduce some robust clustering methods. Robust techniques

typically deal only with a single component and thus increase the complexity when

applied to multiple clusters. Fuzzy clustering techniques are capable of addressing the

problem of multiple clusters. Fuzzy clustering provides a better description tool when

clusters are not well separated [Bez81], which may happen during web mining. Fuzzy

clustering for grouping web users has been proposed in [BH93], [FKN95], and [KK93].

Rough set theory [Paw82] has been considered an alternative to the fuzzy set

theory. There is limited research on clustering based on rough set theory. Lingras and

West [LW02] adapted the k-means algorithm to find cluster intervals of web users based

on rough set theory. They applied a pre-processing technique directly to the log data

prior to adapting a data mining technique. This was permitted because of the

involvement in the data collection process. This allowed them to filter information into

specific pre-defined categories before mining the data. After applying the k-means

method, they analyzed the data based on the knowledge of the initial classifications.

3.2.1 Our Web Usage Mining Approach

In this thesis, our approach was indirectly imposed on us due to the original

format of the log data. We chose the second of the two mentioned in Section 3.1, while

still applying the three phase process also mentioned in that section. In the pre-

processing phase, we convert the flat files into relational tables to utilize the advantages






30


of structured query languages to retrieve desired data from the logs. The feature selection

step of our pattern discovery phase is taken as input from the analyst (or user of our

algorithm). We chose to implement a variation of the k-means algorithm due to its

computational strengths for large data sets. For pattern analysis, we graphed the results

discovered in the previous phase to improve human comprehension of the knowledge.

The next chapter describes the architecture and implementation strategies for our k-

means algorithm when used in accordance with web mining.














CHAPTER 4
ARCHITECTURE AND IMPLEMENTATION

The web usage mining process discussed in Section 3.1 is commonly used

throughout the research community. The architecture of our web usage mining solution

encompasses most of the phases and steps mentioned in Chapter 3, however, choosing to

use our version of k-means as our clustering method provoked the exclusion of a few

steps. Another reason for omitting steps was our lack of input for data collection.

Sections 4.1 will provide insight on our architectural structure and Sections 4.2 will

explain the details of our k-means implementation.

4.1 Architecture Overview

Our algorithm's architectural structure consists of two java modules carrying out

three execution phases. The first class, namely ReadData, accepts the user input, reads

the data from the files, and clusters the data points accordingly. The ClusterValues class

maintains cluster information such as the number of points in each cluster, all of the point

values in each cluster, and the centroid value of the cluster. The three phases have the

same goals as those mentioned in the previous chapter for the web usage mining process,

however, our clustering algorithm implementation gave us the freedom to omit time

consuming steps. The architecture divides the web usage mining process into two main

parts. The first part involves the usage domain dependant processes of transforming the

web data into suitable transaction form. The second part includes the application of our

k-means algorithm for data mining and pattern matching and analysis techniques. Figure

4.1 depicts the architecture for our web usage mining project. This section describes the










steps taken to complete each phase in the process. The next section explains our

algorithm in its entirety in conjunction with the modular interaction.






Select ,nK

Setver Log Data Relational Tables Transaction/ Analysis Knowledge
Session Data




Figure 4.1 Our Web Usage Mining Architecture

4.1.1 Phase 1 Pre-processing

We began our pre-processing phase with the data already condensed in one format,

flat files, as our input. Typical web usage data exists in web server logs, referral logs,

registration files, and index server logs. Intelligent integration and correlation of

information from these diverse sources can reveal usage information that may not be

evident from any one of these individually. We have assumed that the content of these

files were already in its integrated state when obtained from KDD Cup 2000.

The data learning task of our pre-processing phase primarily involved improving the

understandability of the data. Column names and, in some instances, a list of column

values for the comma delimited flat files were provided, however, the values were still

difficult to discern. We decided to convert the flat files into relational tables to both

match the column values with their column names and take advantage of the data

retrieval methods provided by relational database management systems (RDBMS) during

the mining stage. After transforming the format of the data, we removed empty-valued









columns and those columns deemed uninteresting and/or unnecessary for our desired

results at this stage of the process.

The transaction identification task of this phase distinguishes independent users,

transactions, or sessions. This task is simplified when the data collected is carefully

selected and conducive to the overall objectives of the mining process. The data set used

in this thesis was divided into two tables one containing the visitors' click-stream data,

the other customer order information. We did not apply any identification techniques to

the data, we simply "learned" the data itself and focused on attributes/columns that were

relevant to a user, transaction, or session. For example, the click-stream data has session

related attributes (i.e. SESSION ID, SESSIONFIRST REQUEST DAYOF_WEEK,

etc.) that we used to identify sessions. At this point, we retained data for comprising

specific users, transactions, and sessions in the tables for future refinement in the next

phase, pattern discovery.

4.1.2 Phase 2 Pattern Discovery

As we enter the pattern discovery phase, we would like to reiterate our web usage

mining goal of finding trends in the data to understand the behavior of the visitors.

Clustering techniques used in this research area will group together similar users based

on the analyst-specified parameters. We begin the clustering process by reducing the

dimensionality of the data set during the pre-processing/feature selection step. This step

allows the analyst to select the attributes necessary to explore the targeted regions of the

data set. This pre-processing clustering step differs from the pre-processing phase of web

usage mining because it identifies the features needed as input for the clustering

algorithm specifically as opposed to the general information resulting from data cleaning









and transaction identification. The columns chosen during this step represent the n-

dimensional feature vectors.

The heart of this thesis is engulfed in this next step, which is the clustering

technique selection and implementation. We chose to use the popular k-means because

of its ability to produce good clusters and its efficient processing of large data sets. There

has been limited research done using k-means for web mining outside of fuzzy and rough

set approaches mentioned Chapter 3. Section 4.2 explains the implementation of our

version of the k-means algorithm.

After executing the algorithm, we reviewed the results to decide the legitimacy.

If the results seemed unreasonable, we regressed back to the feature selection step to

refine our query. This refinement process is intended to assist in finding patterns in the

clustering, also known as hypothesis generation. Hypothesis generation exposes trends in

the data. We may also use the results to predict future behavior of the customers if this

website still existed. The analysis of those results could have helped maintain and

acquire new customers and therefore prevented it from going under.

4.1.3 Phase 3 Pattern Analysis

The pattern analysis phase provides tools and mechanisms to improve analysts

understanding of the patterns discovered in the previous phase. During this phase, we

eliminated content (patterns) that did not reveal useful information. We did not use a tool

to aid in our analysis. Instead, we used a non-automated graphing method to visualize

our results. The visualization depicted the mined data in a matter that permitted the

extraction of knowledge by the analyst.









4.2 Algorithm Implementation

The pattern discovery phase is a critical component of the web mining process

and usually adopts one of several techniques to complete successfully statistical

analysis, association rules, classification, sequential patterns, dependency modeling, and

clustering [WanOO]. Statistical analysis of information contained in a periodic web

system report can be potentially useful for improving system performance, enhancing the

security of a system, facilitation of the site modification task, and providing support for

marketing decisions [Coo00]. In web usage mining, association rules refer to sets of

pages that are accessed together with a support value exceeding some specific threshold.

Classification techniques are used to establish a profile of users belonging to a particular

class or category by mapping users (based on specific attributes) into one of several

predefined classes. Sequential pattern analysis aims to retrieve subsequent item sets in a

time-ordered set of sessions or episodes to help place appropriate advertisements for

certain user groups. Dependency modeling techniques display significant dependencies

amongst the various variables in the web domain to provide a theoretical framework for

analyzing user behavior and predict the future web resource consumption. Clustering

techniques group together data items with similar characteristics. In our research, we

would like to extract knowledge from the data set based on specific attributes of interest.

Cluster analysis grants the opportunity to achieve such a goal.

In Section 2.2, we discussed various clustering techniques and algorithms. Web

server log data files can grow exponentially depending on the amount of data collected

per visit by the user. The navigational and purchasing information collected for our data

set totaled approximately 1.7 gigabytes over a period of two months back in the year

2000 when the concept of web data mining was in its infancy. The current data









collection methods and techniques are far more advanced and may collect the same

amount of data daily. Therefore, the clustering algorithm needed for the pattern

discovery phase had to be reliable and efficient when applied to large data sets. The

traditional k-means algorithm would suffice, however, there are a few characteristics

about our data set that expose drawbacks in the algorithm. The total number of attributes

of the combined data files is 449 (217 click-stream and 232 purchasing). We needed to

reduce the vector dimensionality and use a representative sample set of data to improve

the scalability and efficiency of the algorithm, respectively. Web logs contain non-

numeric and alphanumeric data, which are both prohibited as input for the standard k-

means algorithm. Our algorithm must deal with non-numeric values as input for our

clustering algorithm. In this section, we discuss our version of the k-means algorithm and

how it addressed the issues above.

Recall Section 4.1 when we mentioned the feature selection step in the pattern

discovery phase. This step essentially covers the first two tasks of our algorithm and

requires user input. The first task is entering the desired number of clusters with the

maximum number being ten. Excessive clusters create a dilution of data, which could

potentially further complicate the analysis. The other user-required input is the attributes

to query. The arbitrary selection of these attributes produces meaningless clusters. This

task requires at least some knowledge of the data as well as a predetermined goal.

Querying completely unrelated attributes could return interesting results, however, that

may be unlikely. The pre-processing phase cleanses and organizes the data to prepare the

data for pattern discovery. The first two tasks of our algorithm, which actually serve as












the feature selection step, allow the analyst to select specific attributes to mine for


knowledge.


1 class ReadData
2 {
4
5 //variable initializations
6
7
8
9 public ReadDataO
10 {
11 stop = false;
12 file = "Click_7500.txt";
13
14
15
16 public void readInLineO
17 {
18
19
20 //initialize method variables
21
22
23 for (resultcolNum = (pArrayMax 1); resultColNum < pArrayMax; resultColNum++)
24 {
25 for (int c = 0; c <= (int)(kclusters+l); c++)
26 {
27 if CpointsArray[resultcolNum] <= clusterMax[c])
28 {
29 cvstorage.addclustervalues(c+l, pointsArray[resultcolNum], d);
30 cvvector.addElement(cvstorage);
31 d++;
32 break;
33 }
34 //retrieves the cluster information for the specified cluster number
35
36
37 cvstorage = new clustervalueso;
38
39
40
41
42 //if the specified cluster contains any points, then compute
43 //the centroid value
44
45
46
47




Figure 4.2 The ReadData module


The values mentioned in the previous paragraph are collected in the main method of


the ReadData class, shown in Figure 4.2, used to implement our algorithm. Once these


two values have been determined, we call the method located at line 16 of Figure 4.2,


readInLineO, to perform the grunt work of the implementation. This method begins with


reading the first line of the file specified in the class constructor. The target file would


contain sample data that had been generated from a simple query ran against one or both










of the tables. The results of the query would then be exported to a delimited file and

essentially serve as the cleansed version of the log file.

As the first line of data is parsed, we smoothly transition into the next task of our

algorithm that involves calculating the data point values. The number of n attributes

selected during the feature selection step determines vector size. The values in a web log

can be numeric, non-numeric, or alphanumeric so, unlike traditional k-means algorithms,

our algorithm must support all three value types. We handle this issue by using the

ASCII (American Standard Code for Information Interchange) value of each character,

digit, alphabet, and special character for computation. We begin by calculating the value

for each individual attribute A,, where i = 0, ..., n-l, of the n-dimensional vector.

d-1
Scd
d=0
(1) A, where dis the array length and cdis
d the ASCII value of the d-1
character

Next, we compute the vector value of the entire row of n attributes. This is done

by dividing the sum of the individual values A, by the number of columns i.

d-1
A,
d=0
(2) Rm = where m is the row number
i in the table

After R1 is computed, it becomes the minimum value (min) by default. The next

nonequivalent row vector value Rm detected replaces the min if it is lower than R1 or it

becomes the maximum value (max) if it is higher. The point values computed after the

max and min values have been selected are compared to both values and replaced

accordingly, if necessary. We then subtract the min from the max value to determine the

range of the points.









(3) diff = max min

The diff value obtained at the end of the third task is the numerator of the fraction

used to compute the cluster thresholds. The denominator of that fraction is the number of

clusters provided by the analyst during the feature selection step.


, where k is the number of clusters


k

The threshold value t does not represent the threshold value for each individual

cluster but it is used when computing the upper boundary of each cluster. For example,

the threshold for the first cluster, tl, ranges from the min to the sum of the min plus t

subtracted by one hundred thousandth, both values mentioned inclusive. Continuing to

the threshold of the second cluster, t2, the minimum value of t2 would be mini plus t and

the maximum value would be min2 (the minimum value of the second cluster) plus t

minus one hundred thousandth. The last paragraph can be represented mathematically as:


[mini, minor + t 0.00001]




[min2, max2]






[minn, maxn]


, where mini (and
minor) is the
minimum point
value

, where min2 = min +
t and max2 = min2 + t
- 0.00001



, where minn= minn-1
+ t and maxn = minn
+t- 0.00001










If a point value exists between two consecutive thresholds, its value is rounded to the

nearest hundred thousandth and clustered accordingly without changing its original value

in the cluster. We chose to use the hundred thousandth figure because most of the data

points were calculated to that precision. Once the final row vector value, Rm, has been

calculated, the data points, the min and max, and the cluster thresholds have been

determined and each data point has been placed in its proper cluster only after one scan of

the data set.

1 class clustervalues
2 {
3
4 private int clusterNum = 0;
5 private float pointvalue = 0;
6 private int pointNumber = 0;
7 private clustervalues cvals = this;
8 private vector cv = new vector();
9
10
11 public clustervalues()
12 {
13 clusterNum = 0;
14 pointvalue = 0;
15 pointNumber = 0;
16 }
17
18
19 public void addclustervalues(int c, float v, int p)
20 {
21
22
23
24
25
26 }
27
28
29 public void calculatecentroids(int c)
30 {
31
32
33
34
35
36 }
37
38 }


Figure 4.3 The ClusterValues module









The final step of our algorithm calculates the centroids (representative points) for

each cluster. The centroid computation takes place in the method beginning at line 29,

calculateCentroids(, of the ClusterValues class displayed in Figure 4.3. The

ClusterValues module shown in Figure 4.3 is the structure responsible for maintaining all

the relevant information about each cluster such as the point values) and the number of

points present in the cluster. The addClusterValueso method, which starts at line 19 in

Figure 4.3, requires the cluster number, the point value, and the element number of the

cluster, all of which are calculated in ReadData.readInLineo. These values are stored in

Java's Vector (java.util.Vectoro) object and retrieved in

ClusterValues.calculateCentroidso to calculate the centroid value. We perform this task

by dividing the sum of the point values in a specific cluster by the number of points in

that cluster if that cluster contains any point values. This point represents the mean of the

cluster without measuring the distance between each point and centroid. This permits the

exclusion of step two mentioned in Section 3.1.2 and therefore reduces the computational

complexity.

If you refer back to Section 2.3.5, you will notice several differences between our

procedures used to implement the k-means algorithm and other implementations. The

first significant difference is shown as early as the first step. These initial points

influence the clustering results tremendously. In most cases, these points are randomly

selected and may require numerous executions or a large amount of knowledge of the

data set by the analyst. The former could become tedious and the latter may be an

unrealistic expectation. Our first two tasks, projecting the number of clusters needed and

selecting the attributes to query, do not require a great deal of knowledge about the data









set. The only pre-requisite of our algorithm is a clearly defined goal. This allows the

analyst to specify the appropriate amount of categories (clusters) based on targeted

characteristics (attributes). Our centroid creation process is performed as the very last

task. It is done after all of the vector values (data points) have been calculated and

clustered to determine what the clusters represent. This reduces the algorithm's

execution time because it removes the similarity measurement task, where each data

point is compared to the centroid using a distance function to identify the shortest

distance and cluster that point, from our implementation. The run time is reduced further

in our algorithm because we scan and cluster the data only once. Multiple iterations of

the data points and re-calculations of the centroids improve the clustering efficiency at

the expense of time. Chapter 5 will present a performance analysis of our algorithm

compared to other proposed k-means algorithms and Chapter 6 will show how our

method faired against one of the industry's leading applications in data mining.














CHAPTER 5
PERFORMANCE ANALYSIS

When writing software, the criteria for evaluating pertains to the correctness of

the algorithm with respect to the specifications and the readability of the code. There are

other criteria for judging algorithms that have a more direct relationship to performance,

which involves their computing time and storage requirements. The time complexity (or

run/execution time) of an algorithm is the amount of computer time it needs to run to its

completion. The space complexity of an algorithm is the amount of memory it needs to

run to completion [HRS98]. The time complexity is based on the access time for each

data point, in our case, row of data. If each row is accessed and re-calculated for multiple

iterations, the k-means algorithm could become inefficient for large databases. The space

complexity deals with the data set size and variables that may affect it. We will not

evaluate the space complexity of our algorithm.

In the second part of this chapter, we compare the clustering results of the KDD

Cup 2000 data set when using a leading data mining software to the results obtained

when applying our algorithm to the data. We will show that our k-means method

produces a comparable quality of clusters as one of the leading data mining tools. We

will then conclude our research efforts and contributions in the final chapter, Chapter 6.









5.1 Experimental Evaluation

The development of our k-means algorithm initially began on Microsoft's

Windows 98 operating systems, using pcGRASP2 Version 6, a free programming

application developed at Auburn University, as our Java programming environment.

pcGRASP was the recommended environment for completing our programming

assignments in the Programming Languages Principles (PLP) course instructed by Dr.

Beverly Sanders. The engine of this home personal computer (PC) consisted of 164

megabytes of random access memory (RAM), a 450 megaHertz Pentium II processor,

and 8 gigabytes of hard disk space. Previously installed software along with important

documents and files occupied almost 50% of the hard disk, leaving roughly 4 gigabytes

during execution. The size of the combined data sets, stored in flat files, consumes about

1.5 gigabytes of disk space. Although using samples of the data during the experimental

section, we suspected that 2.5 gigabytes of disk space would be inadequate. We then

purchased and installed a 20 gigabyte hard drive as the primary master partition, moving

the contents from the 8 gigabyte disk to the new one. Now, prior to installation of

additional software, we have a total of 21.5 gigabytes of free space 13.5 on the c:\ drive

and 8 on the newly formatted d:\ drive.

It was rather difficult to produce samples of the data set from flat files, thus the

database search begins. The minute availability of resources limited our options to either

Sybase or Microsoft Access. The obvious choice, since Sybase is Unix-based, was

Microsoft Access. Microsoft Access was able to handle the large amount of data,

however, it took several hours to load (import) the data and the database only created a

link from the table defined in Access to the flat file that contained the data. This would










definitely have a negative effect on performance. Fortunately, the DBCenter3 acquired a

license for Oracle 8i. Oracle 8i only supports imported data that result from the export

utility of a previous version of Oracle. We unsuccessfully attempted to use Oracle's

SQL*Loader utility to load our delimited flat file data into the database due to various

data type incompatibilities with the syntax needed for this utility's control file (see Figure

5.1).

LOAD DATA
INFILE 'd:\thesis\code\click_data_7500.txt'
REPLACE
INTO TABLE clickdata TRAILING NULLCOLS
(
CUSTOMER_ID INTEGER EXTERNAL TERMINATED BY ,,
SESSION_ID INTEGER EXTERNAL TERMINATED BY ,
PRODUCT CHAR TERMINATED BY ,
FLAGS INTEGER EXTERNAL TERMINATED BY ,.,
HIT_NUMBER INTEGER EXTERNAL TERMINATED BY ,,
TIMESTAMP DATE "yyyy-dd-mm-hh. mm. ss",




REFERRAL_URL CHAR TERMINATED BY ","


Figure 5.1 A sample SQL*Loader control file

A typical control file (.ctl) would not specify the data types of each field because

the utility requires the existence of the table in the Oracle database prior to loading data

to it. However, if the format of the data confuses the tool, one must specify the data

types per column in the control file. So after obtaining a copy of IBM's DB2 application,

several pre-requisites had to be met prior to installing the software. DB2 version 7

Personal or Enterprise Edition, requires the user to have administrative privileges on the

operating system. Windows 98 does not support administrative users, which prohibited

the installation; therefore, we decided to change the operating system to Windows 2000


2 http://www.eng.auburn.edu/grasp









Professional Edition. After installing DB2 version 7.2 fixpack 5 and creating the

structured query language (SQL) to define the tables to store the data, we loaded the data

from the flat files to the database using DB2's wizard for importing data in a matter of

minutes.

The data set used in the experimental portion of the thesis is from a KDD Cup

2000 competition. It contains clickstream and order information from an e-commerce

website which went out of business only after a short period of existence. A clickstream

can be defined as a sequential series a user's navigational path throughout a website visit.

Order data includes product information, number of items purchased, etc. The

clickstream data is significantly larger (over 700,000 rows) than that of the order data,

however, both files in our case, tables may be applied to the web mining process.

The clickstream data provided was collected for roughly two months January 30, 2000

thru March 31, 2000 but contained 98 (out of 217) attribute column values (per row of

data) that were either missing or null. To improve scalability, we chose to use a sample

selection of the data. We chose to use the first data intensive 7500 rows of data for our

research purposes for two reasons: it represents a little over 10 percent of the entire data

set and it is approximately twice the size of the amount of rows provided for the order

data (3465 rows). The majority of the sample click data is comprised of data ranging

from Sunday, January 30, 2000 thru Tuesday, February 2, 2000. The order data, which is

in its entirety, remains within the two-month timeframe and only has 6 columns out of

232 that were deemed irrelevant. Although close to 50 percent of the click data columns

were not conducive to our research, we were still able to gain valuable knowledge from


3 http://www.cise.ufl.edu/dbcenter









the clustering results of the data set because of their significance. In the next section, we

discuss the clustering results from mining both the order and click data.


When discussing the efficiency of our algorithm, we use the following notation:


m number of k-means passes over a data set

m' number of k-means passes over a buffer refill

n number of data points

b size of buffer, fraction of n

d number of dimensions

k number of clusters


The time complexity of the standard k-means algorithm when using the above

notation becomes, more specifically, O(nkdm), where m grows slowly with n [EFLOO].

Our algorithm, which only scans the data once, m is always equal to one. This not only

reduces the computational time to O(nkd), it also removes the computational time

necessary for cluster refinement (i.e., similarity measurements). As for the disk I/O

complexity, for the standard k-means it is O(ndm), the number of points times the

dimensions times the number of passes over the data set [EFLOO]. Our algorithm passes

over the data once, therefore the disk I/O complexity would be O(nd).

5.2 Web Clusters

The software tool used in our experimental section uses their own core data-

mining technology to uncover high-value intelligence from large amounts of enterprise

data including transaction data such as that generated by point-of-sale, automatic teller

machines (ATMs), credit cards, call center, or e-commerce applications. Early releases

of this industry-leading tool embodied proven data mining technology and scalability









options while placing significant emphasis on usability and productivity for data mining

analysts. The version used for these experiments places an increased focus on bringing

the value of data mining to more business intelligence users by broadening access to

mining function and results at the business analyst's desktop. The types of mining

functions available with this tool include association, classification, clustering, sequential

patterns, and similar sequences. We compare/contrast our k-means clustering results with

the results of the clustering function of the tool.


Order Clustering Percentages

1%

23% 8% 1
1%
02

11% 04
05
06

29% 07
08
23% H09




Figure 5.2 Order clustering results

Our example involved eight attributes from the order data pertaining to consumer's

weekly purchasing habits such as the weekday, time of day, location, order amount, etc.

represented using nine clusters. Figures 5.2 and 5.3 graphically display the amount of

data points present in each individual cluster using our method and the software tool,

respectively. The clusters sizes differ at the least 3% (Cluster 4) and at most 22%

(Cluster 6) because the clustering results have different representations from the different

applications. Table 5-1 elaborates on the nine clusters for the two applications. Our









algorithm, by design, sorts the data points in ascending order before clustering and

calculating the centroid values, creating a diverse set of clusters as that of the tool. The

software results are obtained from a modular standpoint, where frequency statistics of the

raw data is emphasized. In our implementation, the analysis of the raw values, which are

printed to a file before calculating the data point, aids in determining the categorization of

each cluster. Although, the resulting clusters from the tool differ in size and data

representation from our results, we show that the knowledge gained from our algorithm is

potentially just as useful.


Order Clustering Percentages


9% 8% 1
8% *2
17%/ 03
11% 04
m5
.6

16% 14% 7
808
7% 10% 9



Figure 5.3 Data Mining Software Order clustering results

The information provided on Table 5-1 is indicative of the relationship of the cluster

percentages mentioned in the previous paragraph. For example, the Cluster 4 results of

the two techniques are most similar while the Cluster 6 results seem to be the most

dissimilar. Although, the statistical results of the tool is comprised of the most frequently

used values of the active fields (attributes), which may lead to analyst making decisions









decisions based on assumptions about the raw data and not the knowledge gained from

the raw data itself. In their results, there was not any information pertaining to male

shoppers. Our data, in contrast, did not specify any modular calculation, but did provide

monthly and age ranges in conjunction with location and sex, allowing decision-making

based on factual data instead of generalizations.

Regardless of the application used to analyze the data, it would be nearly

impossible to gain knowledge from the data if viewed by a human in its original state.

Both aid the business user considerably with the clustering results, with the software tool

having the edge because of its visualization and reporting tools. Nevertheless, our

numerical representation of the results brought us to the same conclusions) as their

visualizations: California dwelling women who spent under $12 per order dominated

their consumer base, which means that the company needed to advertise more items

(maybe higher priced items as well) for women to maintain their current customers while

targeting men in the very near future to gain new customers. The previous statement may

seem intuitive, however, if this company had had tools to perform this analysis back in

2000, it may still be in business today!









Table 5-1 Cluster representations
SinglepsnnaaMnn


CLUSTER 1





CLUSTER 2




CLUSTER 3




CLUSTER 4




CLUSTER 5




CLUSTER 6




CLUSTER 7





CLUSTER 8





CLUSTER 9


Predominantly
women, ages 26-58,
living in CA, who
shop from Tuesday-
Friday
Men, 28-50 years of
age, that usually shop
on the weekend


Mix of men and
women shoppers
from all over, that do
not avg $12 per order

Women ages 26-58
that shop Tuesday
thru Saturday


Women that spent at
least $22 on their
purchase, from all
over US, all week

Texans (unspecified
sex), ages 22-52, who
shop mostly on
Friday

Thursday shoppers
where the men are
from the mid and
upper west, women
from eastern states
Women ordering
between 8am and
9am


Thursday-Sunday
women shoppers of
unspecified ages from
TX and NY


Thursday shoppers of
unspecified age and
sex, from Stamford,
CT

Women from San
Fran, CA that shop on
Monday's@lpm


Women from New
York, NY that shop
on
Wednesday's@10am

Women from Texas
that shop on
Tuesday's at 5pm,
spending $13.95

Wednesday shoppers
at 8pm from CA



36 year old women
from Hermosa Beach,
CA who usually shop
on Thursday's@ 1 am

New York dwelling
women, shopping on
Tuesday's@4pm


Women from PA
shopping on
Wednesday's@7am,
but no later than
10pm (all week)
36 year old women
who spend over
$12/order, shop on
Wednesday' s@7pm


___














CHAPTER 6
CONCLUSION

6.1 Contributions

This thesis, simply stated, has improved the time complexity of a widely used pre-

existing algorithm and demonstrated its value if used appropriately by a profit-seeking

corporation. Our version of the k-means algorithm effectively removed two expensive

operations from the original algorithm namely, the refinement portion step(s) that

include scanning the data set multiple times and re-calculating the representative points

centroidss) of each cluster. The implementation presented in this paper reduces the

execution time of the algorithm by m, the number of k-means passes over a data set,

while also excluding the optional computations necessary for cluster refinement (i.e.

similarity measurements, etc.) to bringing our total run time to O(nkd), where k is the

number of clusters and d is the number of dimensions (or active attributes). Since our

algorithm scans the data only once, the disk I/O is also reduced by m, therefore giving us

a disk I/O of O(nd). We later show that our algorithm, when used as the clustering

technique during the pattern discovery phase of the web usage mining process, performs

comparably to that of an industry-leading data mining tool.

6.2 Proposed Extensions and Future Work

We chose to leave the comparison of our algorithm to the standard k-means algorithm

for future work efforts. This would require a slight variation for one implementing the

original algorithm to receive not only numerical data, but also non-numerical and

alphanumerical data as input. Another potential research interest would be to develop a









to develop a schema or warehouse to store the data for both the navigational and

purchasing data and mine them as one unit. Usage data collection over the web is

incremental and distributed by its very nature. Valuable information about the data could

be extracted if all the data were to be integrated before mining. However, in the

distributed case, a data collection approach from all possible server logs is both non-

scalable and impractical mainly because of the networking issues involved. Hence, there

needs to be an approach where mined knowledge from various logs can be integrated

together into a more comprehensive model. As a continuation of that issue, the creation

of intelligent tools that can assist in the interpretation of mined knowledge remains open.

This would assist the business analyst by revealing commonalities or "obvious" trends

sooner to allow him/her to focus on the non-intuitive results.















LIST OF REFERENCES


[AS94] R. Agrawal and R. Srikant, "Fast Algorithms for Mining Association Rules," In
J. B. Bocca, M. Jarke, and C. Zaniolo, editors, In Proceedings Ti entieith International
Conference Very Large Data Bases (VLDB), p. 487-499. Morgan Kaufmann, 1994.

[A1-D95] M.B. Al-Daoud, The Development of Clustering Methods for Large
Geographic Aapplications, doctoral dissertaion, School of Computer Studies, University
of Leeds, 1995.

[ABKS99] M. Ankerst, M. Breunig, H-P.Kriegel and J. Sander, "OPTICS: Ordering
Points To Identify the Clustering Structure," In Proceedings ACM SIGMOD99
International Conference on Management ofData, Philadelphia, p. 49-60, 1999.

[BS02] P. Baptist and M.J. Silva, "Mining Web Access Logs of an On-line Newspaper,"
Second International Conference on Adaptive Hypermedia and Adaptive Web Based
Systems, Workshop on Recommendation and Personalization in E-Commerce, Malaga,
Spain, May 2002.

[Bez81] J.C. Bezdek, Pattern Recognition i. ith Fuzzy Objective Function Alg go itl/i\,
Plenum Press, New York, 1981.

[BH93] J.C. Bezdek and R.J. Hathaway, "Switching Regression Models and Fuzzy
Clustering," IEEE Transactions on Fuzzy Systems, Vol. 1, No. 3, p. 195-204, 1993.

[BP94] K. Bharat and J.E. Pitkow, "WebViz: A Tool for WWW Access Log Analysis,"
In Proceedings of the First International Conference on the World-Wide Web, 1994.

[BLMN99] S.S. Bhowmick, E.P. Lim, S. Madria and W-K. Ng, "Research Issues in Web
Data Mining," In Proceedings of the First International Conference on Data
Warehousing and Knowledge Discovery (DaWaK99), p. 303-312, 1999.

[Bla02a] P.E. Black, "Euclidean Distance," National Institute of Standards in
Technology (NIST), http://www.nist.gov/dads/HTML/euclidndstnc.html (October 2002).

[Bla02b] P.E. Black, "Manhattan Distance," National Institute of Standards in
Technology (NIST), http://www.nist.gov/dads/HTML/manhttndstnc.html (October
2002).

[Bla02c] P.E. Black, "Hamming Distance," National Institute of Standards in
Technology (NIST), http://www.nist.gov/dads/HTML/hammingdist.html (October 2002).










[BL99] J. Borges and M. Levene, "Data Mining of User Navigation Patterns," In
Proceedings of the Workshop on Web Usage Analysis and User Profiling
(WEBKDD'99), p. 31-36, San Diego, CA, August 15,1999.

[BF98] P.S. Bradley and U.M. Fayyad, "Refining Initial Points for K-means Clustering,"
In Proceedings of the Fifteenth International Conference on Machine Learning, p. 91-99,
Morgan Kaufmann, San Francisco, CA, 1998.

[BFR98] P.S. Bradley, U.M. Fayyad, and C.A. Reina, "Scaling Clustering Algorithms to
Large Databases," In Proceedings of the Fourth International Conference on Knowledge
Discovery and Data Mining, p. 9-15, NewYork, NY, August 27-31, 1998.

[CY00] W-L. Chang and S-T. Yuan, "A Synthesized Learning Approach for Web-Based
CRM," In Proceeding ofACM-SIGKDD Conference on Knowledge Discovery in
Databases (KDD'2000), p. 43-59, Boston, MA, August 20, 2000.

[CCMT97] M. Charikar, C. Chekuri, T. Feder and R. Motvani, "Incremental Clustering
and Dynamic Information Retrieval," In Proceedings of the Twenty-ninth Annual ACM
Symposium on Theory of Computing, p. 626-635, 1997.

[CSZ98] S. Chatterjee, G. Sheikholeslami and A. Zhang, "WaveCluster: A Multi-
Resolution Clustering Approach for Very Large Spatial Databases," In Proceedings of
the Twenty-fourth International Conference on Very Large Data Bases, p. 428-439,
August 1998.

[CGHK97] S. Chee, J. Chen, Q. Chen, S. Cheng, J. Chiang, W. Gong, J. Han, M.
Kamber, K. Koperski, G. Liu, Y. Lu, N. Stefanovic, L. Winstone, B. Xia, O. R. Zaiane,
S. Zhang and H. Zhu, "DBMiner: A System for Data Mining in Relational Databases and
Data Warehouses," In Proceedings CASCON'97: Meeting ofMinds, p. 249-260, Toronto,
Canada, November 1997.

[Coo00] R. Cooley, Web Usage Mining: Discovery and Application of interesting
Patterns from Web data, doctoral dissertation, Department of Computer Science,
University of Minnesota, May 2000.

[CDSTOO] R. Cooley, M. Deshpande, J. Srivastava and P-N. Tan, "Web Usage Mining:
Discovery and Applications of Usage Patterns from Web Data," SIGKDD Explorations,
Vol. 1, Issue 2, 2000.

[CMS97] R. Cooley, B. Mobasher and J. Srivastava, "Web Mining: Information and
Pattern Discovery on the World Wide Web," In Proceedings of the Ninth IEEE
International Conference on Tools i ith Artificial Intelligence (ICTAI'97), 1997.









[CMS99] R. Cooley, B. Mobasher and J. Srivastava, "Creating Adaptive Web sites
through Usage-based Clustering of Urls," In IEEE Knowledge andData Engineering
Workshop (KDEX'99), November 1999.

[CS97] M.W. Craven and J.W. Shavlik, "Using Neural Networks for Data Mining,"
Future Generation Computer Systems, Vol. 13, p. 211-229, 1997.

[Dyr97] C. Dyreson, "Using an Incomplete Data Cube as a Summary Data Sieve,"
Bulletin of the IEEE Technical Committee on Data Engineering, p. 19-26, March 1997.

[EFLOO] C. Elkan, F. Fanstrom and J. Lewis, "Scalability for Clustering Algorithms
Revisited," SIGKDD Explorations, Vol. 2, No. 1, p. 51-57, June 2000.

[EKSX96] M. Ester, H-P. Kriegel, J. Sander and X. Xu, "A Density-Based Algorithm for
Discovering Clusters in Large Spatial Databases with Noise," In Proceedings of the
Second International Conference on Knowledge Discovery and Data Mining (KDD'96),
Portland, Oregon, August 1996.

[Fas99] D. Fasulo, "An Analysis of Recent Work on Clustering Algorithms," Technical
report, University of Washington, 1999.

[For65] E. Forgy, "Cluster Analysis of Multivariate Data: Efficiency vs. Interpretability
of Classifications," Biometrics 21:768, 1965.

[FKN95] H. Frigui, R. Krishnapuram and O. Nasraoui, "Fuzzy and Possibilistic Shell
Clustering Algorithms and their Application to Boundary Detection and Surface
Approximation: Parts I and II," IEEE Transactions on Fuzzy Systems, Vol. 3, No. 1, p.
29-60, 1995.

[GRS98] S. Guha, R. Rastogi and K. Shim, "CURE: An Efficient and Scalable Subspace
Clustering for Very Large Databases," In Proceedings ofACMSIGMOD International
Conference on Management ofData, p. 73-84, New York, NY, 1998.

[Han99] J. Han, "Data Mining," In J. Urban and P. Dasgupta (eds.), Encyclopedia of
Distributed Computing, Kluwer Academic Publishers, Boston, MA, 1999.

[HN94] J. Han and R. Ng, "Efficient and Effective Clustering Method for Spatial Data
Mining," In Proceedings of 1994 International Conference on Very Large Data Bases
(VLDB'94), p. 144-155, Santiago, Chile, September 1994.

[HHK02] W. Hardle, Z. H1ivka and S. Klinke, "XploRe Applications Guide," Quantlets,
http://www.quantlet.de/scripts/xag/htmlbook/xploreapplichtmlnode54.html (August
2002).









[JK96] J. Jean and H.K. Kim, "Concurrency Preserving Partitioning (CPP) for Parallel
Logic Simulation," In Proceedings of Tenth Workshop on Parallel
and Distributed Simulation (PADS'96), p. 98-105, May 1996.

[JK98] A. Joshi and R. Krishnapuram, "Robust Fuzzy Clustering Methods to Support
Web Mining," In S. Chaudhuri and U. Dayal, editors, In Proceedings ACMSIGMOD
Workshop on Research Issues in Data Mining and Knowledge Discovery, June 1998.

[KR90] L. Kaufman and P. J. Rousseeuw, Finding Groups in Data: An Introduction to
Cluster Analysis, John Wiley & Sons, Inc., 1990.

[KK93] R. Keller and R Krishnapuram, "A Possibilistic Approach to Clustering," IEEE
Transactions on Fuzzy Systems, Vol. 1, No. 2, p. 98-110, 1993.

[KolOl] E. Kolatch, "Clustering Algorithms for Spatial Databases: A Survey," Dept. of
Computer Science, University of Maryland, College Park, 2001.

[LOPZ97] W. Li, M. Ogihara, S. Parthasarathy and M.J. Zaki, "New Algorithms for Fast
Discovery of Association Rules," In Proceedings of Third International Conference on
Knowledge Discovery and Data Mining (KDD), August 1997.

[LVV01] A. Likas, N. Vlassis and J.J. Verbeek, "The Global K-means Clustering
Algorithm," Technical report, Computer Science Institute, University of Amsterdam, The
Netherlands, February 2001. IAS-UVA-01-02.

[LW02] P. J. Lingras and C. Chad West, "Interval Set Clustering of Web Users with
Rough K-means," submitted to the IEEE computer for publication, 2002.

[LRZ96] M. Livny, R. Ramakrishnan and T. Zhang, "BIRCH: An Efficient Data
Clustering Method for Very Large Databases," In Proceedings of the Fifteenth ACM
SICACTSICMOD--SICART Symposium on Principles of Database Systems: PODS 1996.

[Mac67] J. MacQueen, "Some Methods for Classification and Analysis ofMultivariate
Observations," In Proceedings of the Fifth Berkeley Symposium on Mathematical
Statistics andProbability, Vol. I, Statistics, L. M. LeCam and J. Neyman editors,
University of California Press, 1967.

[Mas02] H. Masum, "Clustering Algorithms," Active Interests,
http://www.carleton.ca/-hmasum/clustering.html (August 2002).

[MWY97] R. Muntz, W. Wang and J. Yang, "STING: A Statistical Information Grid
Approach to Spatial Data Mining," In Proceedings of the Twenty-third International
Conference on Very Large Databases, p. 186-195, Athens, Greece, August 1997.









[Myl02] P. Myllymaki, "Advantages of Bayesian Networks in Data Mining and
Knowledge Discovery," Complex Systems Computation Group, Helsinki Institute for
Information Technology, http://www.bayesit.com/docs/advantages.html (October 2002).

[Paw82] Z. Pawlak, "Rough Sets," International Journal ofInformation and Computer
Sciences, Vol. 11, p. 145-172, 1982.

[Rei99] T. Reiners, "Mahalanobis Distance," Distances, http://server3.winforms.phil.tu-
bs.de/-treiners/diplom/node3 1 .html (October 2002).

[Sam90] H. Samet, The Design and Analysis of Spatial Data Structures, Addison
Wesley, Reading, MA, 1990.

[The02] K. Thearling, "Data Mining and Customer Relationship," Data Mining White
Papers, http://www.thearling.com/text/whexcerpt/whexcerpt.htm (October 2002).

[WanOO] Y. Wang, "Web Mining and Knowledge Discovery of Usage Patterns," CS
748T Project (Part I), http://db.uwaterloo.ca/-tozsu/courses/cs748t/surveys/wang.pdf
(February, 2000).















BIOGRAPHICAL SKETCH

Darryl M. Adderly, born September 2, 1976, to Renia L. Adderly and Kevin A.

Adderly in Miami, Florida, was raised as a military child up until age thirteen when his

mother, younger sister (Kadra T. Adderly), and he moved back to Miami where he earned

his high school diploma at Miami Northwestern Senior High in June 1994. He began his

college career in Tallahassee, Florida at Florida Agricultural & Mechanical University,

earning his Bachelor of Science in computer information systems (science option) with a

mathematics minor in May 1998. After spending one year working as a software

engineer in Raleigh, North Carolina, Darryl was accepted into the University of Florida's

computer and information science and engineering graduate program. With the

coursework requirements completed, he opted to return to the industry as a software

developer for another year. In the fall of 2002, he returned to Gainesville, Florida, to

complete and defend his thesis on Web data mining to receive his Master of Science

degree.

Darryl is an ambitious, hard-working, analytical, and astute individual with a thirst for

knowledge in all facets of life. He enjoys cardiovascular activities, weight lifting,

football, basketball, and golf (although still a novice!). Outdoor activities (such as

camping, white water rafting, and hiking) and traveling are at the top of his list of things

to do once obtaining his master's degree.