UFDC Home  myUFDC Home  Help 



Full Text  
PAGE 1 1 AUTOMATIC DETECTION OF WORDS NOT SIGNIFICANT TO TOPIC CLASSIFICATION IN LATENT DIRICHLET ALLOCATION By DEBARSHI ROY A THESIS PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR TH E DEGREE OF MASTER OF SCIENCE UNIVERSITY OF FLORIDA 2011 PAGE 2 2 2011 Debarshi Roy PAGE 3 3 To my Mom and Dad PAGE 4 4 ACKNOWLEDGMENTS I would like to begin by thanking my supervisor Arunava Banerjee for introducing me to the field of document classification and topic modeling, for numerous helpful discussions, and for many constructive comments about this thesis itself. On a personal note, I would like to thank my parents who have taught me much, but above all the value of har d work. I would also like to thank all my friends who had made this journey that much easier. PAGE 5 5 TABLE OF CONTENTS page ACKNOWLEDGMENTS ................................ ................................ ................................ .. 4 LIST OF TABLES ................................ ................................ ................................ ............ 7 LIST OF ABBREVIATIONS ................................ ................................ ............................. 8 ABSTRACT ................................ ................................ ................................ ..................... 9 CHAPTER 1 INTRODUCTION ................................ ................................ ................................ .... 10 2 DOCU MENT CLASSIFICATION ................................ ................................ ............. 11 Document Representation ................................ ................................ ...................... 11 Term Frequency Inverse Document Frequency (tf idf) ................................ .......... 12 Latent Semantic Indexing ................................ ................................ ....................... 13 3 TOPIC MODELING ................................ ................................ ................................ 14 Probabilistic Latent Semantic Indexing ................................ ................................ ... 14 Latent Dirichlet Allocation ................................ ................................ ....................... 14 4 BEYOND LDA ................................ ................................ ................................ ......... 16 Problem ................................ ................................ ................................ .................. 16 Approach ................................ ................................ ................................ ................ 17 Experimental Process ................................ ................................ ............................. 17 5 CONCLUSION ................................ ................................ ................................ ........ 19 Summary ................................ ................................ ................................ ................ 19 Suggestions for the Future ................................ ................................ ...................... 19 APPENDIX A LATENT DIRICHLET ALLOCATION ................................ ................................ ...... 20 B MUTUAL INFORMATION AND ENTROPY ................................ ............................ 23 Entropy ................................ ................................ ................................ ................... 23 Mutual Information ................................ ................................ ................................ .. 24 C TERM FREQUENCY INVERSE DOCUMENT FREQUENCY ............................. 26 PAGE 6 6 LIST OF REFERENCES ................................ ................................ ............................... 28 BIOGRAPHICAL SKETCH ................................ ................................ ............................ 29 PAGE 7 7 LIST OF TABLES Table page 2 1 Example term document matrix ................................ ................................ .......... 12 4 1 List of words with the conditional entropies of topics given the word .................. 18 PAGE 8 8 LIST OF ABBREVIATION S LDA Latent Dirichlet Allocation LSI Latent Semantic Indexing p LSI Probabilistic Latent Semantic Indexing tf idf Term Frequency Inverse Document Frequency PAGE 9 9 Abstract of Thesis Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Master of Science AUTOMATIC DETECTION OF WORDS NOT SIGNIFICANT TO TOPIC CLASSIFICATION IN LATENT DIRICHLET ALLOCATION By Debarshi Roy December 2011 Chair: Arunava Banerjee Major: Computer Engineering Topic models, like Latent Dirichlet Allocation (LDA), have been recently used to automatically generate text corpora topics, and to subdivide the corpus words among those topics. However, not all the w ords are useful for classification and do not correspond to genuine topics. Some of them act as noise, which remain there in most of the topics, thus bringing the performance of the classification down. Current approaches to topic modeling perform preproc essing of the data fed into the algorithm like stop word removal. This work tries to automate the process of finding out these words, which is of no help in classifying. The basic idea consists o f finding the conditional entropy of topics given a word and words with high conditional topic entropy are considered to be irrelevant and discarded, so that we can get a better result in the next iteration. PAGE 10 10 CHAPTER 1 INTRODUCTION Machine Learning, a branch of artificial intelligence is concerned with the development of algorithms allowing the machine to learn via inductive inference based on observing data that represents incomplete information abou t statistical phenomenon. Classification which is also referred to as pattern recognition, is an important task in patterns, to distinguish between exemplars based on their diff erent patterns, and to make intelligent decisions. Machine learning technology is usable in a vast area of applications, and more uses are being found out as time passes. It allows computer systems to improve in dynamic environments, where the input signal s are unknown, and the best decisions to be made can only be learnt from history. One such example is document classification. The task is to assign an electronic document to one or more categories, based on its contents. PAGE 11 11 CHAPTER 2 DOCUMENT CLASSIFICAT IO N The Internet is growing with an increasing rate, and it is obvious that it will be difficult to search for information in this gigantic digital library. The estimated size of the Internet, from February 1999, indicates that there are about 800 million pa ges on the World Wide Web, on about 3 million servers One way to sift through numerous documents is to use keyword search engines. However, keyword searches have limitations. One major drawback is that keyword searches don't discriminate by context. In ma ny languages, a word or phrase may have multiple meanings, so a search may result in many matches that are not on the desired topic. For example, a query on the phrase river bank might return documents about the Hudson River Bank & Trust Company, because t he word bank has two meanings. An alternative strategy is to have human beings sort through documents and classify them by content, but this is not feasible for very large volumes of documents. Retrieval of text information is a difficult task. The problem can be either that the information is misinterpreted because of natural language ambiguities or the information need can be imprecisely or vaguely defined by the user. This calls for improved automatic methods for searching and organizing text documents s o information of interest can be accessed fast and accurately. To realize this goal, several classification algorithms have been developed over the years. Document R epresentation In order to reduce the complexity of the documents and make them easier to ha ndle, the document have to be transformed from the full text version to a document vector which describes the contents of the document. In most of the algorithms, PAGE 12 1 2 documents are represented by term document matrix, where the rows corresponds the terms and t he columns correspond to the documents. e .g. Then the term document matrix will be: Table 2 1. Example term document matrix D1 D2 D3 I 1 1 1 like 1 0 2 hate 0 1 0 dogs 1 0 0 cats 0 1 0 florida 0 0 1 Term F requency I nvers e D ocument F requency (t f idf) This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus. Suppose we have a set of English text documents and wish to determine which document is most relevant to the query "the brown cow." A simple way to start out is by eliminating documents that do not contain all three words "the," "brown," an d "cow," but this still leaves many documents. To further distinguish them, we might count the number of times each term occurs in each document and sum them all together; the number of times a term occurs in a document is called its term frequency. Howeve r, because the term "the" is so common, this will tend to incorrectly emphasize documents which happen to use the word "the" more frequently, without giving enough weight to PAGE 13 13 the more meaningful terms "brown" and "cow". Also the term "the" is not a good key word to distinguish relevant and non relevant documents and terms. On the contrary, the words "brown" and "cow" that occur rarely are good keywords to distinguish relevant documents from the non relevant documents. Hence an inverse document frequency facto r is incorporated which diminishes the weight of terms that occur very frequently in the collection and increases the weight of terms that occur rarely. (See A ppendix C for details) L atent S emantic I ndexing While the tf idf reduction has some appealing features notably in its basic identification of sets of words that are discriminative for documents in the collection the approach also provides a relatively small amount of reduction in description length and reveals little in the way of inter or intra d ocument statistical structure. To address these shortcomings, IR researchers have proposed several other dimensionality reduction techniques, most notably latent semantic indexing (LSI) (Deerwester et al., 1990). LSI uses a singular value decomposition of the X matrix to identify a linear subspace in the space of tf idf features that captures most of the variance in the collection. This approach can achieve significant compression in large collections. Furthermore, Deerwester et al. argue that the derived f eatures of LSI, which are linear combinations of the original tf idf features, can capture some aspects of basic linguistic notions such as synonymy and polysemy. PAGE 14 14 CHAPTER 3 TOPIC MODELING Probabilistic Topic Modeling (PTM) is an emerging Bayesian approa ch to summarize data, such as text, in terms of (a small set of) latent variables that correspond (ideally) to the underlying themes or topics. It is a statistical generative model that represents documents as a mixture of probabilistic topics and topics a s a mixture of words. P robabilistic Latent Semantic Indexing A significant step forward in this regard was made by Hofmann (1999), who presented the probabilistic LSI (pLSI) model, also known as the aspect model, as an alternative to LSI. The pLSI approach models each word in a document as a sample from a mixture model, where the mixture components are multinomial random variables single topic, and different words in a doc ument may be generated from different topics. Each document is represented as a list of mixing proportions for these mixture components and thereby reduced to a probability distribution on a fixed set of topics. Latent Dirichlet Allocation Among the variety of topic models proposed, Latent Dirichlet Allocation (LDA) is a truly generative model that is capable of generalizing the topic distributions so that it can be used to generat e unseen documents as well. The completeness of the generative process for documents is achieved by considering Dirichlet priors on the document distributions over topics and on the topic distributions over words. PAGE 15 15 In LDA, each document may be viewed as a m ixture of various topics. This is similar to probabilistic latent semantic analysis (pLSA), except that in LDA the topic distribution is assumed to have a Dirichlet prior In practice, this results in more reasonable mixtures of topics in a document. It ha s been noted, however, that the plea model is equivalent to the LDA model under a uniform Dirichlet prior distribution For example, an LDA model might have topics that can be classified as CAT and DOG However, the classification is arbitrary because the topic that encompasses these words cannot be named. Furthermore, a topic has probabilities of generating various words, such as milk meow and kitten which can be classified and interpreted by the viewer as "CAT". Naturally, cat itself will have high probability given this topic. The DOG topic likewise has probabilities of generating each word: puppy bark and bone might have high probability. A document is given the topics. This is a standard bag of words model assumption, and makes the individual words exchangeable PAGE 16 16 CHAPTER 4 BEYOND LDA The result of the inference step of Latent Dirichlet Allocation (LDA) gives a set of results from which we can get the probability of a word w given a certain topic z p(wz) These values are especially useful in understanding the inherent correlation of the different words. Observing the result values we can decipher the innate mixture of topics hidden in the documents. The different topics that make up the corpus can be clearly understood fr om this. We figure out the topics by getting the words with top probabilities in each of the topics. That is, for a particular z we get the words which have the top p(wz) values. This leads to understanding the topics in a more human friendly manner. Fr om another set of outputs, we get weight of the different topics contributing in generating the different documents. From this, we get the top topics from which a document is generated Thus we can easily correlate the different documents Problem While an alyzing the results, we find some words not contributing to topic differentiation. These words have high probabilities for many of the topics L ooking closer into the words, it was found that they are the words which are not topic specific T hey could have been easily removed as stop words before the initiation of the process T his would have given a better result in LDA as it would n ot have to consider these irrelevant words. It would have helped rank up the words that truly contribute towards building up a topic and give the topics understandable meanings PAGE 17 17 Approach To figure out the mathematical relationship of these words, we used t he concept of conditional entropy We consider the conditional entropy of the topics given a specific word. In extreme case, the topics will have equal prob abilities given the word leading to the value being log k (where k is the total number of topics) T he conditional entropies of the topics w ere found out over all the words This gave us a good base to find the unimportant words. I ntuitively, if given the word, how certain you are of the topics As we looked into the values with high conditional entropies we saw that they were indeed not useful from common knowledge. The main logic behind this me thod is that if a topic has high conditional entropy given a word then it would take more information to determine which topic the word belongs to. And thus the word is not suitable for classification of documents into different topics. The conditional entropy of to pics given a word, H(zw) can be seen as the amount of uncertainty in the topics, z, remaining after the word, w, is known. Experimental P rocess We ran the LDA inference process on the first 1000 documents of Reuters 21578 Text Categorization Collection Da ta Set from UCI Machine Learning Repository. We had pre processed the input by tf idf and had ultimately 6932 unique words as input to LDA. The number of topics was set to 32. The LDA process gave us an output file containing the log p(wz) values of all t he words for each topic. We get the p(zw) values by normalizing t he p(wz) val ues for each word and the conditional entropy of the topics given a word was then calculat ed for each word. Here we assume p(z) to be uniform for simplicity. PAGE 18 18 In the first iteration, we got the maximum entropy value to be 3.71. The words with top conditional entropy values were: Table 4 1. List of words with the conditional entropies of topics given the word Word Conditional Entropy unit 3.71 co 3.52 products 3.49 department 3.47 terms 3.46 corp 3.32 assets 3.3 acquired 3.28 operation 3.27 completed 3.24 based 3.2 As can be seen, common knowledge tells us that these words are not useful in topic determination, and this could have been easily added to a stop word list and removed. In the next iteration, these words were removed, the tf idf was calculated again and the whole process was repeated. This was done with the threshold of the conditional entropy being slowly de creased to 2.5 and it was seen that the value of the maximum conditional entropy for the topics given a word was also going down We found the same results on different sets of documents from Reuters 21578 Text Categorization Collection Data Set PAGE 19 19 CHAPTER 5 CONCLUSION Summary In this thesis, a new method to automatically find out insignificant words in topic modeling has been discussed, which leads to the improvement of the resultant topics that are being generated. Using the conditional entropy of topics given a word has given very good results. And the same result has been observed over different sets of data. Suggestions f or the Future As per my understanding, this approach of automatically pruning out insignificant words has not been done yet. We have used conditional entrop y of topics given words in this thesis. Different objective functions can be found out, which would give better results. Also, rigorous mathematical reasoning is required to prove the convergence of the iterations (i.e. the maximum value of the conditional entropy is de c reasing). We leave this thesis as an experimental observation so that future works can build up on this novel idea. PAGE 20 20 APPENDIX A LATENT DIRICHLET ALL OCATION L atent Dirichlet allocation (LDA) is a generative probabilistic model for collections of discrete data such as text corpora. It is a three level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of text modeling, the topic probabilities provide an explicit representation of a document. Latent Dirichlet allocation (LDA) is a generative probabilistic model of a corpus. The basic idea is that documents are represented as random mixtures over latent topics, where each topic is characterized by a distribution over words. LDA assumes the following generative process for each document w in a corpus D : 1. Ch oose N ~ Poisson( ). 2. Choose q ~ Dir( ). 3. For each of the N words w n : (a) Choose a topic z n ~ Multinomial( ). (b) Choose a word w n from p ( w n  z n ), a multinomial probability conditioned on the topic z n Several simplifying assumptions are made in this basic model First, the dimensionality k of the Dirichlet distribution (and thus the dimensionality of the topic variable z ) is assumed known and fixed. Second, the word probabilities are p arameterized by a k V matrix where ij = p ( w j = 1  z i = 1), which for now we treat as a fixed quantity that is to be estimated. Finally, the Poisson assumption is not critical to anything that follows and more realistic document length distributions can be used as PAGE 21 21 needed. Furthermore, note that N is i ndependent of all the ot her data generating variables ( and z ). It is thus an ancillary variable and we will generally ignore its randomness in the subsequent development. A k dimensional Dirichlet random variable q can take values in the ( 1) simplex (a k vector q lies in the ( 1) simplex if i 0, ), and has the following probability density on this simplex: where the parameter is a k vector with components i > 0, and where ( x ) is the Gamma function. The Dirichlet is a convenient distribution on the simplex it is in the exponential family, has finite dimensional sufficient statistics, and is conjugate to the multinomial distribution Given the parameters and the joint distribution of a topic mixture a set of N topics z and a set of N words w is given by: where p ( z n  ) is simply i for the unique i such that = 1. Integrating over and s umming over z we obtain the marginal distribution of a document: p(w  ) = Finally, taking the product of the marginal probabilities of single documents, we obtain the probability of a corpus: The parameters and are corpus level parameters, assumed to be sampled once in the process of generating a corpus. The variables d are document level PAGE 22 22 variables, sampled once per document. Finally, the variables z dn and w dn are word level variables and are sampled once for each word in each document. PAGE 23 23 APPENDIX B MUTUAL INFORMATION A ND ENTROPY Entropy In information theory, entropy is a measure of the un certainty associated with a random variable. In this context, the term usually refers to the Shannon entropy which quantifies the expected value of the information contained in a message, usually in units such as bits. In this context, a 'message' means a specific realization of the random variable. Equivalently, the Shannon entropy is a measure of the average information content one is missing when one does not know the value of the random variable. The concept was introduced by Claude E. Shannon in his 1 948 paper "A Mathematical Theory of Communication". Named after Boltzmann's H theorem, Shannon denoted the entropy H of a discrete random variable X with possible values { x 1 ..., x n } as, H(X) = E(I(X)) Here E is the expected value, and I is the information content of X I ( X ) is itself a random variable. If p denotes the probability mass function of X then the entropy can explicitly be written as H(X) = = where b is the base of the logar ithm used. Common values of b are 2, Euler's number e, and 10, and the unit of entropy is bit for b = 2, nat for b = e, and dit (or digit) for b = 1 0. In the case of p i = 0 for some i the value of the corresponding summand 0 log b 0 is taken to be 0, whic h is consistent with the limit: PAGE 24 24 Mutual Information Mutual information is one of many quantities that measures how much one random variables tells us about another. It is a dimensionless quantity with (generally) units of bits and can be thought of as the reduction in uncertainty about one random variable given knowledge of another. High mutual information indicates a large reduction in uncertainty; low mutual information indicates a small reduction; and zero mutual information between tw o random variables means the variables are independent. For two discrete variables X and Y whose joint probability distribution is P XY (x, y) the mutual information between them, denoted I(X:Y) is given by (Shannon and Weaver, 1949; Cover and Thomas, 1991 ) I(X:Y) = = E log Here P X (x) and P Y (y) are the marginals: (x) = and and E P is the expected value over the distribution P The focus here is on discrete variables, but most results derived for discrete variables extend very naturally to continuous ones one simply replaces sums by integrals. One should be aware, though, that the formal replacement of sums by integrals hides a great deal of su btlety, and, for distributions that are not sufficiently smooth, may not even work. See (Gray, 1990) for details. The units of information depend on the base of the logarithm. If base 2 is used (the most common, and the one used here), information is measu red in bits. PAGE 25 25 Intuitively, mutual information measures the information that X and Y share: it measures how much knowing one of these variables reduces uncertainty about the other. For example, if X and Y are independent, then knowing X does not give any inf ormation about Y and vice versa, so their mutual information is zero. At the other extreme, if X and Y are identical then all information conveyed by X is shared with Y : knowing X determines the value of Y and vice versa. As a result, in the case of identi ty the mutual information is the same as the uncertainty contained in Y (or X ) alone, namely the entropy of Y (or X : clearly if X and Y are identical they have equal entropy). I(X;Y) = H(X) H(XY) = H(Y) H(YX) = H(X) + H(Y) H(X,Y) = H(X,Y) H(XY) H(YX) where H(X) and H(Y) are the marginal entropies, H(XY) and H(YX) are the conditional entropies, and H(X,Y) is the joint entropy of X and Y PAGE 26 26 APPENDIX C TERM FREQUENCY INVERSE DOCUMENT FRE QUENCY In information retriev al or text mining, the term frequency inverse document frequency also called tf idf is a well know method to evaluate how important is a word in a document. tf idf are also a very interesting way to convert the textual representation of information into idf and the VSM. The term count in the given document is simply the number of times a given term appears in that document. Thus we have the term frequency tf( t d ), defined in the simplest case as the occurrence count of a term t in a document d The inverse document frequency is a measure of the general importance of the term (obtained by dividing th e total number of documents by the number of documents containing the term, and then taking the logarithm of that quotient ). i df(t) = log with  D  : cardinality of D, or the total number of documents in the corpus {d : t d} : numbe r of documents where the term t appears (i.e., tf(t,d) 0) If the term is not in the corpus, this will lead to a division by zero. It is therefore common to adjust the formula to 1 + {d : t d} Then tf idf(t, d) = tf(t, d) x idf(t) A high weight in tf idf is reached by a high term frequency (in the given document) and a low document frequency of the term in the whole collection of documents; the weights hence tend to filter out common terms. The tf idf value for a term will be greate r than zero if and only if the ratio inside the idf's log function is greater than 1. Depending PAGE 27 27 on whether a 1 is added to the denominator, a term in all documents will have either a zero or negative idf, and if the 1 is added to the denominator a term tha t occurs in all but one document will have an idf equal to zero. PAGE 28 28 LIST OF REFERENCES David M. Blei, Andrew Y. Ng, Michael I. Jordan. Latent Dirichlet Allocat ion. Journal of Machine Learning Research 3 2003. Scott Deerwester Susan T. Dumais, George W. Furnas, Thomas K. La ndauer et al Improving Information Retrieval with Latent Semantic Indexing. Proceedings of the 51 st Annual Meeting of the American Society for Information Science 25 1988 Mark Girolami, A ta Kaban. On an Equivalence between PLSI and LDA. Proceedings of SIGIR 2003 T homas Hofmann. Probabilistic latent semantic indexing. Proceedings of the Twenty Second Annual International SIGIR Conference 1999 C laude E. Shannon. A Mathematical Theory of Communication. Bell System Technical Journal, vol. 27, pp. 379 23, 623 656 1948. PAGE 29 29 BIOGRAPHICAL SKETCH Debarshi Roy has done his under graduate study at Jadavpur University, India where he earned his Bachelor of Engineering from the Department of Computer Sc ience and Engineering Then he worked for two years wi th Adobe Systems India as a software developer. After that he joined University of Florida and earned his Master of Science in Computer Engineering from the Computer Information and Science Depa rtment 