
Citation 
 Permanent Link:
 http://ufdc.ufl.edu/AA00058575/00001
Material Information
 Title:
 Multilevel automatic classification for sequential reference retrieval
 Creator:
 Osteen, Robert Ernest, 1936
 Publication Date:
 1972
 Language:
 English
 Physical Description:
 xi, 246 leaves. : illus. ; 28 cm.
Subjects
 Subjects / Keywords:
 Automatic indexing ( lcsh )
Dissertations, Academic  Electrical Engineering  UF Electrical Engineering thesis Ph. D Information storage and retrieval systems ( lcsh )
 Genre:
 bibliography ( marcgt )
nonfiction ( marcgt )
Notes
 Thesis:
 ThesisUniversity of Florida.
 Bibliography:
 Bibliography: leaves 242244.
 General Note:
 Typescript.
 General Note:
 Vita.
Record Information
 Source Institution:
 University of Florida
 Holding Location:
 University of Florida
 Rights Management:
 The University of Florida George A. Smathers Libraries respect the intellectual property rights of others and do not claim any copyright interest in this item. This item may be protected by copyright but is made available here under a claim of fair use (17 U.S.C. Â§107) for nonprofit research and educational purposes. Users of this work have responsibility for determining copyright status prior to reusing, publishing or reproducing this item for purposes other than what is allowed by fair use or other copyright exemptions. Any reuse of this item in excess of fair use or other copyright exemptions requires permission of the copyright holder. The Smathers Libraries would like to learn more about this item and invite individuals or organizations to contact the RDS coordinator (ufdissertations@uflib.ufl.edu) with any additional information they can provide.
 Resource Identifier:
 022922298 ( ALEPH )
14198893 ( OCLC )

Downloads 
This item has the following downloads:

Full Text 
MULTILEVEL AUTOMATIC CLASSIFICATION FOR SEQUENTIAL REFERENCE RETRIEVAL
By
ROBERT ERNEST OSTEEN
A DISSERTATION PRESENTED TO THE GRADUATE COUNCIL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY
UNIVERSITY OF FLORIDA 1972
To Susan Spencer
ACKNOWLEDGMENTS
This work was supported by the Army Research OfficeDurham under Grant Number DAAROD3112470G92 and by the National Science Foundation under Grant Number GK2786. The author wishes to express his thanks to the Center for Informatics Research of the University of Florida for providing financial assistance, as well as the necessary research facilities.
I am indebted to the members of my Supervisory
Committee for their guidance. I am particularly grateful to Dr. J. T. Tou, Director of the Center for Informatics Research, and to Dr. A. R. Bednarek, Chairman of the Department of Mathematics, for their counsel and assistance.
Mrs. Betty Taylor, the Director of the University of
Florida Law Library, made available to mein punch cardsthe bibliographic data and the subject indexes of papers published in legal journals. I am thankful for this kindness, which facilitated the experimental phase of this work.
I am indebted to my good friend and colleague, James
Hollan, for his critical reading of the first draft of this work.
Finally, I am particularly grateful to my wife, Darcy Meeker, for her excellent job of typing this dissertation.
iii
TABLE OF CONTENTS
Page
ACKNOWLEDGMENTS ....... ................
LIST OF TABLES ....... ..................
LIST OF FIGURES . ...... ..............
ABSTRACT ......... .....................
CHAPTER
1. INTRODUCTION ..... ...............
1.1. Summary of the Remaining Chapters
2. CLASSIFICATION AND REFERENCE RETRIEVAL.
2.1. Introduction .... ..............
2.2. Methods of Automatic Classification
2.2.1. Bayesian Classification ..........
2.2.2. Factor Analysis .... ............
2.2.3. Clumps. ................
2.3. FERRETA Feedback Reference
Retrieval System
2.4. Concluding Remarks ... .............
3. GRAPH THEORETICAL COVER GENERATION ....
3.1. Introduction ..... ...............
3.2. Basic Definitions .... ............
3.3. Clusters in Graphs ... .............
3.4. Concluding Remarks ... .............
4. CLIQUE DETECTION ALGORITHMS ..........
4.1. Introduction ....... .............
4.2. Review of Selected Algorithms .......
4.2.1. Point Removal Definitions ........
4.2.2. Point Removal Theorems ............
4.2.3. A Point Removal Algorithm ........
4.3. The Neighborhood Approach to Clique
Detection
iii
vii
9 9
20 21 23 29
34 43
44
* '44
* 46
50 92
9)4
94 95 98 99
* 102
4.3.1. Special Definitions .... ...........
4.3.2. Neighborhood Clique Detection Theorems.
4.3.3. The Neighborhood Clique Detection Algorithm
106
106
107
1il
Page
4.4 The Line Removal Approach to Clique Detection
4.4.1. Line Removal Definitions .............
4.4.2. Line Removal Theorems ................
4.4.3. The Line Removal Clique Detection Algorithm
4.5. Algorithm Timing Experiments .........
4.6. Conclusions ...... ..............
5. AUTOMATIC CLASSIFICATION DERIVATION
5.1. Introduction .... ...............
5.2. Cover Evaluation by Typicality ....... 5.2.1. A Metric for the Class of Collections
of Nonempty Subsets of a Finite Set
5.3. Evaluation by Cluster Homogeneity and Cost Considerations
5.3.1. Cluster Homogeneity .. ...........
5.3.2. An Idealized Cost Function .........
5.3.3. An Evaluation Function ...........
5.4. The Classification Derivation Algorithm
6. THE SEQUENTIAL SEARCH TREE ...........
6.1. Introduction ..... ...............
6.2. Class Representation Transformation . 6.3. Updating the Search Tree ...........
6.4. Search and Retrieval ... ...........
7. EXPERIMENTAL INQUIRY AND CONCLUSIONS. 7.1. Introduction .....................
7.2. The Experimental Document Set . 7.3. The Classification Derivation . 7.4. Basic Searches ... .............
7.5. Conclusions ..... .............
119
120 122
125
* 128
* 135
137
* 137 . 138
* 145
. 149 . 150
* 151 155
. 16o
169 169 170 173 175
. . 184 . 184 . 187 188
. .. 195
202
APPENDICES
A. PROOFS OF THE POINT REMOVAL CLIQUE DETECTION THEOREMS
B. PROOFS OF THE NEIGHBORHOOD CLIQUE DETECTION THEOREMS
C. PROOFS OF THE LINE REMOVAL CLIQUE DETECTION THEOREMS
D. A METRIC ON THE CLASS OF COLLECTIONS OF
NONEMPTY SUBSETS OF A FINITE SET.
E. AN IDEALIZED CLASSIFICATION TREE.
LIST OF REFERENCES.
208
. 213 219 226
. . 239
242
BIOGRAPHICAL SKETCH ........
. . . . 245
LIST OF TABLES
Table Page
3.1 Adjacencies in the Generalized Clique Graphs. 78
4.1 Algorithm 4.1 Applied to the Graph of
Figure 4.2 ..105
4.2 Algorithm 4.3 Applied to the Graph of
Figure 4.2 ..129
4.3 Timing Comparison of Algorithms Based on
Theorems 4.4 and 4.5 ..132
4.4 Timing Comparison of Algorithms 4.1, 4.2,
and 4.3 ..133
4.5 Execution Times of Algorithms 4.2 and 4.3
on the Graphs of Figure 4.5 ..135
5.1 Metrics D and D, Applied to the Covers of
Figure 5.3 ..147
5.2 Illustration of the Cost Function. ........156 6.1 A DocumentTerm Matrix for Figure 6.1. ......172 7.1 Serial and Basic Search Responses. ..........197
7.2 Performance Figures for the Sample Query .. 201
7.3 Serial and Basic Search Time Averages. ......201 7.4 RecallPrecision Summary. ...............202
LIST OF FIGURES
Figure Page
2.1 FERRET Initialization .. 36
2.2 FERRET Updating and Retrieval . . . . 39
3.1 The Clique and Component Clusterings of a
Graph . 52
3.2 The Lattice of Clusterings of a Graph . . 57
3.3 The Lattice of Nonarbitrary Clusterings of
a Graph 59
3.4 The kPartitions of Jardine and Sibson. .... 64 3.5 Type1 kClique Graphs of a Graph . . . 66
3.6 A Graph Whose Type1 2Clustering Differs from
the JardineSibson 2Partition . 67
3.7 A Graph in Which One Type1 2Cluster
Contains Another . 69 3.8 Type1 and Type2 kClusterings of a Graph. . 71
3.9 A Graph for the Illustration of the
Differences among the Typel, Type2,
and Type3 Clusterings . 77
3.10 The Three Types of Intermediate Generalized
Clique Graphs of the Graph of Figure 3.9 . 80
3.11 The Type1 Intermediate Clusterings of the
Graph of Figure 3.9 . 81
3.12 The Type2 Intermediate Clusterings of the
Graph of Figure 3.9 . 82
3.13 The Type3 Intermediate Clusterings of the
Graph of Figure 3.9 . 83
3.14 Numbers of Clusters of the Graph of
Figure 3.9 versus the Parameter k . 84
Figure Page
3.15 An Unsatisfactory Cover Belonging to r(G). 87
4.1 Counterexamples to L = B u C u D .. ....... .101
4.2 A Graph for Clique Detection Algorithm
Illustration 104
4.3 A Counterexample to L = N u Q .... ........ 110
4.4 Counterexamples to L* = L0 u l u L2 u L3" 125
4.5 Two Graphs, G = (V, E) and
G' = (V, E') with E c E' 134
5.1 Two Efficient Covers Which Induce the Same
Graph 141
5.2 A Clustering Which Is Not a aClassification. 144 5.3 Four Covers of a Set of Eight Points ...... ..146 5.4 An Interval Weighting Function ........... ...159
6.1 A Small Classification Tree and the
Corresponding Search Tree 171
7.1 Summary of the (C=0.4)Classification ..... 192
7.2 Completion of the Summary of the
(C=0.2)Classification 193
viii
Abstract of Dissertation Presented to the Graduate Council
of the University of Florida in Partial Fulfillment
of the Requirements for the Degree of Doctor of Philosophy
MULTILEVEL AUTOMATIC CLASSIFICATION
FOR SEQUENTIAL REFERENCE RETRIEVAL
By
Robert Ernest Osteen
August, 1972
Chairman: Dr. Julius T. Tou Major Department: Electrical Engineering
The primary concern of this work is the automatic
derivation of a multilevel nonhierarchical classification of a set of documents, given the logical or numerical subject indexes of the documents. The utilization of such a classification by mechanized search procedures is also treated.
The classification is derived from a quantitative measure of the documentdocument similarities, based on indexes of the pairs of documents. Application of a threshold to the documentdocument similarities transforms the document set into a graph. The graph is subjected to a cluster analysis which typically provides several distinct clusteringsthat is, covers consisting of clusters of points of the graph. A realvalued evaluation function provides the means to select the best of the clusterings. (The evaluation function depends on the homogeneities of the
members of a clustering, and takes into account certain cost considerations of the implementation machinery available for a mechanized search system.) The clusters of the selected clustering constitute the immediate subclasses of the document set. This process is repeated on unsubclassified document subsets until no unsubclassified subset is large enough to warrant analysis into subclasses.
A thorough analysis of the concept of clusters in graphs culminates in a specific definition of the clusterings of a graph having the following properties.
A clustering of a graph is an efficient cover of the set of points of the graph (no member of a clustering is a subset of any other member). Each member of a clustering is a union of cliques (maximal complete subgraphs) of the graph. Each clustering refines the collection of the connected components of the graph and is refined by the collection of the cliques of the graph, both of which qualify as clusterings. Under the refinement relation the clusterings form a chain or tower from the collection of cliques of the graph to the collection of components of the graph.
It is contended that any other efficient cover contains an element of arbitrariness in its formulationthat is, requires a violation of one of the criteria underlying the definition of clusterings: a clustering is defined in terms of the adjacencies of the points and the identities of the cliques (implicit in the adjacencies)using only this information and evading none of it; and all members of a
particular clustering are formed by the application of exactly the same rule.
Because the cliques of a graph are fundamental to the definition of the clusterings of the graph, the problem of the identification of cliques is treated in detail. Two new clique detection algorithms are presented. One of these is intended for use in special circumstances in which it has an efficiency advantage over other known clique detection algorithms: besides the set of lines of the graph, there is given the set of cliques of a specified subgraph on the same point set. The other new clique detection algorithm is applicable to the general clique detection problemit identifies the cliques of a graph, given only the points and lines of the graph. Timing experiments indicate that this algorithm is substantially faster than those previously available.
The cover selection technique is combined with the
graph theoretical cover generation scheme into an algorithm for multilevel classification derivation.
The resulting classification is retained in the form of a sequential search tree. Search procedures, designed to utilize the sequential search tree for more efficient searching and for interactive searching, are presented.
Finally, the results of a preliminary experimental
investigation of the classification derivation and search utilization techniques are reported.
CHAPTER I
INTRODUCTION
This study is concerned with the machine derivation and the search utilization of a multilevel classification of a document corpus by a mechanized document or reference retrieval system.
A reference retrieval system [ 1, 2 consists of a set of document references or names, document surrogates or representations for the members of the document set, and search procedures, i.e., mechanisms for the production of responses to queries. A response is, basically, a set of document references. A query is a request for a response expressed in a form which is consonant with'that of the document representations and for which a search procedure exists. The document representations indicatein some specific formthe subject matters of the respective documents.
A document retrieval system differs from a refei~ence retrieval system in that a response is a set of documents, rather than a set of document references; consequently, there is the additional issue of the physical storage and retrieval of documents. Viewing a document retrieval system as an extension of a reference retrieval system, the document search and retrieval constitutes an additional step,
following reference retrieval. In a conventional library, for example, a searc her Performs reference retrieval by means of the card catalog; a retrieved reference includes a decimal code by means of which the corresponding document can be physically located.
Although useful search strategies exist which are based upon author, publication date, bibliographic cit ations, etc., this work is concerned only with subject searching; that is, in all that follows., queries, document representations, and search strategies are concerned with the subject matters of documents.
In a document or reference retrieval system, a document representation is the result of a process of content analysis. Whether performed intellectually or mechanically, such a process necessarily treats the individual words of a document as the primary source of information concerning the subject matter of the'document.
The content analysis may also make use of syntax, as exemplified by the Syntol diagrammatic document surrogates [ 3 1 The Syntol diagram for a document is a digraph (directed graph) with labeled directed lines. The points of the digraph are Syntol words, chosen to reflect the subject matter of the given document, each representing a state, an action,, or an entity. The directed lines reflect relationships among the Syntol words, and the labels of the lines specify the type of relationship: coordinative, consecutive, associative, or predicative. A request is
similarly analyzed to produce a query, which is a Syntol diagrammatic representation of the request. The search procedure then consists in matching the query against the representations of the documents. The response consists of those documents which match the query. An example of a matching function is as follows: if there is a directed line labeled type X from Syntol word A to word B in the
query, then there is a directed path in the document representation from word A to word B each directed line of which is of relational type X .
More commonly, and somewhat less elaborately, content
analysis procedures ignore syntax, which is to say, produce a document representation based only on the contentbearing words occurring in the text of the document. Such a process, which is known as keyword indexing, is clearly amenable to mechanizationassuming that the documents are available in a machinereadable medium. The simplest product of such a procedure is the representation of a document by the list of keywords occurring in the document, in which case the indexing operation is termed logical indexing. In numerical indexing, a numerical value is associated with each keyword of the document.
This numerical value might be the occurrence frequency of the word in the document, or some normalization thereof, as discussed by Lancaster [ 4 ]; in probabilistic indexing [ 5 ] the value is an estimate of the probability that the document is relevant to the information needs of a user whose query consists of the single keyword.
A descriptor is a natural language expression denoting a subject area, e.g., "Life insurance" and "Data processing systems." Unlike keywords, descriptors are not restricted to single words. Moreover, the application of a descriptor to a document might not require the occurrence of the descriptor in the document. Consequently, descriptor indexing is generally performed by intellectual content analysis. Just as with keywords, however, the product of the content analysis may take the form of logical indexing or numerical indexing. Consequently, whenever the details of the content analysis and the indexing language are not of primary concern, the phrase index termsor more briefly termsis used to refer to keywords, descriptors, and elements of any similar indexing vocabulary.
The type of data base which this work assumes to be given consists of a set of terms and a set of documents, logically or numerically indexed with respect to the term set.
The main result of this study is the development of a technique for the automatic organization of such a document set in the form of a multilevel classification. To supplement the presentation of the classification derivation method a description is given of a mechanized reference retrieval system which utilizes such a classification for the purposes of efficient searching and usersystem interactive searching.
5
The multilevel classification is derived from a
quantitative measure of the documentdocument similarities, which measure is a suitable function of the indexes of the pairs of documents. A set of documents on which a similarity is defined is transformed into a graph by the application of a threshold to the documentdocument similarities. A new graph theoretical cluster analysis technique is applied to the resulting graph to identify the subclasses of the given set of documents. This process is repeated on the resulting subclasses until no unsubclassified set is large enough to warrant further classification.
The mechanized reference retrieval system with usersystem interactive search procedures is termed FERRETa Feedback Reference Retrieval system. (This system is a specific instance of a class of Sequential Feedback Information Retrieval Systems under developmentthe SEFIRE systems.) FERRET retains the multilevel document classification in the form of a sequential search tree which is used by its search procedures for the identification of the subset of documents to be retrieved in response to a given query.
The following section provides further discussion of the main result, along with the organization of this presentation.
1.1. Summary of the Remaining Chapters
1.1.1. Chapter 2
The uses and roles of classification in reference retrieval are discussed. Several specific methods of automatic classification are reviewed.
An overview of FERRETa Feedback Reference Retrieval systemis presented.
1.1.2. Chapter 3
A definition of clusters of points of a graph is
developed and validatedincluding particularly those which are intermediate to the components (maximal connected subgraphs) and the cliques (maximal complete subgraphs) of the graph. This graphtheoretical clusteranalysis method is related to similar efforts by other workers.
1.1.3. Chapter 4
Chapter 4 is devoted to the problem of the identification of the cliques of a graph. Available clique detection algorithms are reviewed. Two new algorithms are presented, along with their validating theorems and timing experiments illustrating their respective virtues.
1.1.4. Chapter 5
Chapter 3 (supported by Chapter 4) provides a method for the generation of covers of a set of documents
consisting of graphtheoretically defined clusters. The graph in question has the document set as point set, with lines joining pairs of documents whose similarities exceed a certain threshold. This cover generation method generally produces several distinct covers of a given document set.
Chapter 5 is concerned with the selection of one such cover. Two quite different methods of selection are explored.
One selection criterion is the extent to which a cover is typical of the collection of all generated covers. The discussion of the "typicality" criterion includes the presentation of two possible metrics for its realization. One of these is taken from the literature; the other is a new metric devised by the author for this application.
The second cover selection criterion explored in Chapter 5 is an evaluation function depending on the homogeneities of the clusters of a cover, and taking into account certain cost considerations of the machinery available for an implementation of FERRET.
Chapter 5 concludes with an algorithm based on this latter method of cover selection and the cover generation rinethod given in Chapters 3 and )4. This algorithm produces the complete multilevel classification of a given document set from a quantitative measure of document document similarities.
1.15. Chapter 6
This chapter presents the form in which the
classification is retained in FERRET for search and retrieval. Also given are the FERRET search procedures, which include provision for systemuser interaction. Chapter 6 also includes a brief treatment of the problem of updating the classification as documents are acquired subsequent to the initial classification derivation.
1.1.6. Chapter 7
Chapter 7 reports a preliminary experimental inquiry into the presented classification technique, illustratingD both the classification derivation and the FERRET search procedures. The chapter concludes with an evaluative discussion of the classification derivation and utilization.
CHAPTER 2
CLASSIFICATION AND REFERENCE RETRIEVAL
2.1. Introduction
As stated in Chapter 1, the reference retrieval systems under consideration are those in which the documents are logically or numerically indexed. Document representations within such systems may be formally viewed as a documentterm matrix: each row corresponds to a document; each column corresponds to a term; the ijth element of the matrix indicateslogically or numericallythe applicability of the jth term to the ith document. The information of a row of the documentterm matrix constitutes the subject index of the corresponding document, i.e., the representation of the subject matter of the document. Whether or not the document is retrieved by a search procedure in response to a query depends chiefly upon the document index and the query itself. The most straightforward search procedure is the serial search: each row of the documentterm matrix is matched against the query; whether or not the corresponding document is included in the response to the query is decided according to the result or value of the match.
Consider a Boolean documentterm matrix representing a logically indexed document set. A query is any Boolean expression of terms, and the matching process is a logical matching. The query is evaluated according to the values of the terms of a row of the documentterm matrix; the corresponding document is retrieved in case that query evaluation is 1 (true), in which case the document representation is said to satisfy the logic of the query.
In case the documentterm matrix is numerical., a query is an assignment of weights (numerical values) to terms. Such a query has the form of a document index, or row of the documentterm matrix, and may be viewed as a description of the subject matter of a hypothetical document sought by the source of the query, i.e., the user. The matching process in this case is the numerical evaluation of a measure of relevance or similarity, e.g., the cosine correlation between a row of the documentterm matrix and the query, regarded as an additional row. The document "score" resulting from this type of evaluation permits the ordering of the members of the response according to the degree of relevance to the query.
An obvious disadvantage to such serial searching is that,although the response to a given query normally consists of only a small fraction of the document set, a relevance computation is required for each document, however irrelevant to the query; the same effort is required to merely negate the membership of a document in a response as
is required to identify a document as a member of the response.
One way to eliminate much of the unproductive search effort of the serial search is to use an inverted file for the organization of the document representations rather than a direct file. An entry in a direct file corresponds to a document; its value is the index of the document, e.g., a list of applicable terms. An entry in an inverted file corresponds to a term; its value is the set or list of all those documents to which the term applies. An inverted file entry therefore corresponds to a column of the documentterm matrix, whereas a direct file entry corresponds to a row.
Now suppose that the documents of a document set are logically indexed and consider a query consisting of the conjunction of two terms. Such a query is a request for just those documents indexed by both of the specified terms. The response is simply the intersection of the classes or sets of documents which are the values of the entries of the terms. Unlike serial search of the direct file, this search expends no effort on documents indexed by neither term.
The inverted file organization of the document
representations constitutes a classification of the document set. To each term there corresponds a pair of elementary classes of documents: those to which the term applies, and those to which it does not. The file entry associated with ,the term represents the former explicitly and the latter
implicitly. A query is a Boolean expression of terms. The response is the class of documents specified by the settheoretic expression obtained from the query by interpreting each term as the class of documents indexed by that term, i.e., the inverted file entry of that term., and by replacing conjunction, disjunction, and negation by intersection, union, and complementation with respect to the document set, respectively. Thus, the search procedure produces the response corresponding to a query formulated in terms of index terms by performing operations on the elementary classes in accordance with the logic of the query.
The inverted file system illustrates the primary
utility of classification in reference retrieval systems: to provide for more efficient search procedures. It also illustrates the price which one must pay for more efficient searching: the more costly file creation and maintenance, i.e., the initial derivation and the updating of the classification. The information provided by the indexing operation on a document is precisely the value of an entry of the direct file. Updating the direct file in the event of the acquisition of a new document requires only the creation of the new entry; in particular, no existing entry is affected. With the inverted file, however, the acquisition of a document requires the modification of existing file entries: the entry of each term applied to the document is modified by the addition of the identifier of the acquired document.
Although serial searching is practicable in a
mechanized reference retrieval systeme.g., MEDLARS E 6 ], the medical literature analysis and retrieval system of the National Library of Yedicinesuch is not the case in conventional reference retrieval systems, i.e., systems in which search procedures are executed by humans with little or no help from machines. Consequently, conventional libraries have long been concerned with the problem of classification.
The traditional method of classification begins with an a priori hierarchy of subject areas or categories by means of which a hierarchical classification of the document set is derived. The category hierarchy has the form of a rooted tree, the root representing (implicitly) the totality of subject areas of the document set. Those nodes adjacent to the root correspond to the major division of subject matters into broad categories. The major categories are divided into subcategories, and so forth, down to the most specific categories, which correspond to endpoints of the tree. Each category is labeled with a code, e.g., a strings of decimal digits, the length of which depends on the degree of specificity of the category. Consider the following example from the Dewey Decimal Classification [7]:
398 Folklore
398.2 Tales and Legends
398.21 Fairy Tales
398.22 Tales and Legends of Heroes 398.23 Tales and Legends of Places
398.24 Tales and Legends of Animals and Plants
398.3 The Real
398.4 The Unreal
This decimal code reflects the specificity relation: Tales and Legends is specific to Folklore, and the code of the latter (398) is a truncation of the former (398.2); indeed, Folklore is generic to any category whose decimal code begins with "398." In this case, Tales and Legends (398.2) has been further analyzed into four subcategories, each of which is an endpoint.
The document classification corresponding to such a
subject category hierarchy has a class of documents for each category. A document is intellectually analyzed to determine the most specific subject category which applies to the document, and the document is labeled with the decimal code for that category. The document belongs to the class associated with the category, and to every class associated with categories generic to the category. A document labeled 398.2 for example, belongs to the classes associated with 398.2 and 398 but not to the classes associated with 398.21, 398.22, 398.3, or 398.4 Thus, any pair of classes are disjoint or related by the inclusion relation. That is, if a pair of classes have one document in common, then one class contains all the documents
belonging to the other. Another aspect of this type of classification is that it is suitable for the physical organization and sto rage of documents: the documents are stored according to the decimal codes with which they are labeled.
The above hierarchical classification is an example of a multilevel classification, in contradistinction to a simple classification such as the inverted file. A simple classification of a finite set consists of an efficient cover of the set, i.e., a collection of classessubsets of the setwhose union coincides with the set, and such that no subset is properly contained in any other. A multilevel classification extends a simple classification: some of the classes of the simple classification of the finite set are themselves endowed with a simple classification, some of whose members may be further classified, and so forth.
The basis of a (subject) classification of documents is similarity of subject matter. A pair of documents in an entry of an inverted file,5 for example, are similar in that they have at least one term in common. A pair of documents assigned to the same class of a traditional hierarchical classification are similar in that they are both judged to be concerned with the subject area corresponding to the class.
It is evident that any automatic classification scheme requires a quantitative measure of similarity on the set of objects to be classified. The attributes of the objects to
be classified may be formally represented in a logical or numerical objectattribute matrix; a row of this matrix is referred to as the attribute vector of the corresponding object. The quantitative similarity of a pair of objects is defined in terms of the attribute vectors of the objects.
Suppose the attributes are binary valued, that is, that each attribute applies, or does not apply, to each object. The basic data in terms of which a similarity function may be defined are as follows for a given pair of objects: the number of mismatches, i.e., attributes applicable to just one of the pair; the number of positive matches, i.e., attributes applicable to both; and the number of negative matches, i.e., attributes applicable to neither. Sokal and Sneath [ 8 ] discuss various specific possibilities for a similarity measure defined in terms of those quantities; these vary chiefly with respect to the issue of equal or unequal weigh tings ofmatches and mismatches and the issue of whether or not negative matches are taken into account.
In case the objects are documents and the attributes are terms, negative matches must be ignored: one may not reasonably construe the inapplicability of a particular term to either of a pair of documents as evidence of their similarity. The Tanimoto similarity measure [ 9 ] is particularly suitable for this application; it is defined to be the ratio of the number of attributes (terms) possessed by both objects (documents) to the number of attributes possessed by either. This similarity function has values
between zero and unity; a similarity of zero indicates that the pair of objects have no attribute in common, while a similarity of unity indicates that the objects have identical attributes, i.e., an attribute applies to one object if and only if it applies to the other. The function value may be regarded [9 ] as the probability that an attribute applies to both objects, given that it applies to at least one of the pair.
Numerous similarity functions exist for application to attribute vectors having nonnegative numerical values. Statistical correlation, that is, the Pearson correlation coefficient [ 10 ], is the measure of similarity used in the factor analytic approach to automatic classification, which is described in the following section. The attribute vector is regarded as the explicit specification of a discrete random variable; the statistical correlation of a pair of objects is then the covariance of the corresponding pair of random variables, standardized with respect to zero mean and unit variance.
The following similarity function, attributed by Salton [3 ] to Tanimoto, is an extension of the Tanimoto similarity for binary attribute vectors to numerical attribute vectors:
1 i + i i
The range of this function, like that of the function which it extends, is the unit intervalprovided, that is, that v and w are nonzero vectors over the unit interval.
The cosine correlation similarity measure [ 3 1 is the normalized inner product of a given pair of attribute vectors:
viwi
S (v, w) = i (2.2)
2 ( v.2 X wi2)1/2
i l i
The cosine correlation ranges from 1 to +1 and requires only that v and w be nonzero vectors over the real numbers. In particular, the components of the attribute vectors are not necessarily nonnegative. Consequently, a user may assign negative weightsrather than merely zero weightto selected terms, which is analogous to negation rather than omission in a logical system.
Consider now a given documentterm matrix. There are two general approaches to the problem of automatic document classification. The direct approach is to regard the rows of the documentterm matrix as attribute vectors of objects, to define a measure of documentdocument similarity, and to classify the documents by reference to their similarities. The indirect approach is to regard the columns of the documentterm matrix as attribute vectors of objects, to define a measure of termterm similarity, to classify the
terms by reference to their similarities, and to classify the documents by reference to the term classes and the document indexes.
One may distinguish, moreover, three major constituents of the document classification problem: the classification derivation problem, the class characterization problem, and the document assignment problem. The effort demanded by each of these component problems depends upon the purpose and nature of the classification scheme.
For example, the traditional hierarchical
classification, whose purpose is to aid the human searcher, requires a great intellectual effort for the derivation of the hierarchy of subject categories; the assignment of a document to a class requires a small intellectual effort; and the characterization of a document class requires no effort, since each document class is characterized by the associated subject category, e.g., Folklore.
The inverted file system, whose purpose is to provide for more efficient mechanized searching, obviously requires only minimal effort for each of the three aspects of classification: an elementary class of documents is assigned to the elementary classes corresponding to the terms with which the document is indexed. Additional classification effort is required, however, during query processing: the particular document class constituting the response must be computed from the application of the logic of the query to the elementary classes.
The direct approach to automatic document
classification from the documentterm matrix requires no effort for document assignments, since the classes to which a document belongs are determined by the classification derivation; the characterization of the classes in terms of index terms, however, does require additional computation. The indirect approach, on the other hand, first produces classes of terms, which essentially constitute the corresponding document class characterizations; in this case, the additional computation is the formation of the document classes from the term classes by a process of document assignment based on the class characterizations and the document indexes.
The next section provides descriptions of some specific methods which have been applied to the problems of automatic document classification.
2.2. Methods of Automatic Classification
Several approaches to automatic classification for,
reference retrieval purposes are described below: Bayesian classification, factor analytically derived categories, and clumpsps" Specifically graph theoretical techniques are not included because those are discussed in sonie detail in Chapter 3.
2.2.1. Bayesian Classification
Maron [ ii ] applies probability theory to the problem of assigning documents to subject categories. The initial document set or training set is intellectually analyzed to determine subject categories, the documents of the training set are assigned by the analysts to the corresponding classes, and the most promising clue words (index terms) for category prediction are manually selected. The documentterm incidence matrix for the training set is then formed: the ijth element indicates the occurrence or nonoccurrence of the jth term in the ith document.
The classes are characterized by probability estimates derived from the documentterm matrix. The a priori probability P(Ci) of each category Cj is estimated by the ratio of the number of training documents manually assigned to Cj to the total number of training documents. The conditional probability P(WilCj) that term Wi applies to a document of category C. is estimated by the ratio of the number of occurrences of Wi in training set documents manually assigned to Cj to the number of occurrences of all terms in these documents.
The automatic assignment of a document to a category is viewed as a probability prediction based upon evidence and hypotheses. The hypotheses are the above probability estimates relating the categories and the terms. The evidence is the index of the document to be classified, that is, the specification of those terms or clue words
which apply to thedocument. The prediction method is based upon Bayes' rule [ 10 ] simplified by means of an independence assumption. The assumption is that given any category Cj any pair of terms Wp and Wm are independent with respect to that category:
P(Wp I Cj, Wm) = P(Wp I Cj)
Let Wp, Wm, ... Wr be the terms applicable to a given document. The Bayesian prediction for category Cj or "attribute number" for Cj is P(CjIWp, Wil ... Wr) = k P(Cj) P(WpC.) P(WmIC). .P (WrICj), where the scaling factor k for the particular set of applicable terms is determined by equating unity with the sum over all the categories of the attribute numbers. The document is then assigned to the category having the greatest attribute number with respect to the set of terms applicable to the document.
The documents used for the experiments by Maron ["11 are abstracts of computer literature published in the March, June, and September, 1959 issues of the IRE Transactions on Electronic Computers. The training set consists of 260 of these, while the remaining 145 abstracts constitute the test set. The intellectual analysis of the training set yielded 32 categories and the selection of 90 index terms. (Documents from the test set, of course, were not included in this analysis, nor in the formation of the hypotheses, i.e., the estimation of P(Cj) for each category and
P(Wi I C.i) for each category and index term.) Considering only documents having at least two index terms and assigned by human classifiers to just one category, the agreement between the automatic assignments and the human assignments of documents to subject categories was 91 % for the training set and 50 % for the test set. Taking into account the number of categories (32), this latter figure is by no means as poor as it might seem at first glance.
Indeed, one may reasonably conclude that this study demonstrates the possibility, if not the practicality, of automatic assignment of documents to classes of a manually defined classification of subject categories.
2.2.2. Factor Analysis
Factor analysis [ 12, 13, 14 ] is applied by Borko
[ 15 1 and Borko and Bernick E 16, 17 1 to the problem of automatic document classification, including the classification derivation, the class characterizations, and the assignment of documents to classes. The general approach of this method is to determine from the words or index terms occurring in the documents of a document set the subject categories in terms of the index terms, and to assign documents to categories by reference to the terms occurring in the documents and the characterizations of the categories. The main tools of the method are statistics and matrix theory.
The method begins with the documentterm frequency
matrix of a document set, in which the ijth element is the number of occurrences of the jth term in the ith document. The documentterm matrix of Borko [ 15 ] gives the occurrence frequency of the significant words of the documents, which are abstracts appearing in Psychological Abstracts. Each term is construed as a discrete random variable defined by the associated column of the documentterm matrix. The means and variances of the terms and the covariances of the pairs of terms are computed; from these quantities the Pearson correlation matrix is formed. The eigenvalues and eigenvectors of the correlation matrix are then found.
The eigenvectors represent "factors" underlying the
correlations; the eigenvalues represent the portion of the correlation attributable to the corresponding factors. The largest eigenvector, i.e., the eigenvector corresponding to the largest eigenvalue, gives the direction in term space along which correlation is maximum. The category space is the subspace of the term space generated by the several largest eigenvectors of the correlation matrix.
The number of eigenvectors is determined on the basis of the percentage of the total correlation accounted for by the corresponding factor space. For example, if 75 % of the total correlation is accounted for, then the k eigenvectors corresponding to the k largest eigenvalues are selected, where the sum of the k largest eigenvalues
is at least 75 % of the total correlation and the sum of the k 1 largest eigenvalues is less than 75 % of the total correlation.
The factor loadings of a term are the respective normalized inner products of the term vector with the eigenvectors, which is to say the particular term components of the respective normalized eigenvectors.
The documentterm matrix may be regarded as representing the documents in Euclidean tspace, t being the number of terms. From the factor loadings of the terms and the documentterm matrix, the documents may be represented in the category spacea kdimensional subspace of the term space, k being the required number of eigenvectors. These document vectors in the category subspace may be regarded as vectors in the term space approximating respectively the original document vectors in the term space, i.e., the rows of the documentterm matrix. As such, these approximations are optimal in that the mean square error is minimized, subject to the constraint that the approximations lie in a kdimensional subspace of the term space.
In a study of feature extraction in pattern recognition, Tou and Heydorn [ 18 ] provide a proof that the estimation error is minimized by the choice of the k largest eigenvectors (as the basis for the kdimensional subspace). On the other hand, two other optimality criteria [ 18 ] dictate the choice of the k smallest eigenvectors, viz., the minimum mean square distance criterion, and the minimum
entropy criterion. However, these two criteria are devised expressly to extract the common features of all the pattern vectors of a single class, i.e., the intraset features of a particular pattern class; in consequence, these criteriaunlike the minimum estimation error criterionare not appropriate to the purpose under discussion.
While the subspace dimensionality required to represent a given percentage of the total correlation is uniquely determined and the optimum subspace is also uniquely determined, the choice of a basis for the subspace remains open. That is, the number of factors and the subspace generated by them are uniquely determined; the subspace is that generated by the eigenvectors; and the eigenvectors may be, but need not be, interpreted as the factors or categories. The eigenvectors are mathematically rotated by Borko [ 15 ] to approximate "simple structure," the purpose of which is to permit meaningful interpretation of the factors in terms of the index terms. The basic idea is to achieve many small factor loadings for all the terms; that is, each index term vector expressed as a linear combination of the factors has many small components and few large components.
The subject categories corresponding to the factors or basis vectors are conceptually identified by intellectual analysis of the factor loadings. For example, the terms having a factor loading of 0.18 or more on a particular one of the factors [ 15 ] were "girls," "boys," "school,"
"?achievement," and "reading," from which the factor interpretation "academic achievement" was inferred.
The assignment of documents to the classes or
categories corresponding to the basis vectors is done as follows. Given a document and its index, a score is computed for each category; the document is assigned to the category of highest score. These factor scores are computed from the factor loadings of the terms and the term frequencies of the document. The score of a factor is the sum over the index terms of the term frequency and the loading of the term on the particular factor; that is, a factor score for a given document is the cosine correlation between the document index and the particular factor.
Borko and Bernick [ 17 ] compared experimentally the factor analytic technique with the Bayesian prediction technique, using the index terms, the training set, and the test set of IMaron [ 11 ].Factor analysis was applied to the training set, producing 21 subject categories. The documents of the training set were manually assigned to the categories. From this classification of the training set the hypotheses (probability estimates) were derived, from which each document of the training set and the test set was assigned to a class by means of the Bayesian technique. Each document was also assigned to a class by the factor analysis method, i.e., according to the factor scores. The assignments of the documents of the training set to the factoranalytically derived categories by each of the three
methods (manual, Bayesian, and factor) were compared. The results may be summarized by the percentage of the documents for which assignments of methods coincide: manualBayesian, 80 % ; manualfactor, 58 % ; Bayesianfactor, 67 % The agreement summaries for the test set of documents were, respectively 45 % 39 % and 57 % In the test set, the two automatic document assignment techniques are in better agreement than is either with the manual method. This fact is believed to be indicative of a greater inherent consistency in mechanized methods of assignment, relative to human assignment.
This study confirms the conclusion of Maron [ 11 ] that the automatic assignment of documents to classes is feasible. In addition, moreover, the automatic methods (both Bayesian and factor) performed about as well, relative to the human assignments, with the automatically derived classification as did the Bayesian method with the manually defined classification.
Thus, this study demonstrates the possibility, not only of the automatic assignment of documents to subject categories of a classification, but also of the automatic derivation of such a classification. It should be emphasized that the feasibility of both the automatic document assignment and the classification derivation are here judged by means of comparing the automatic assignments with the document assignments to the subject categories by human classifiers.
2.2.3. Clumps
The automatic cedassification techniques discussed above may be characterized as applications of machine techniques to problems of conventional document classification. Subject categories are defined or derived, each document is assigned to one category, and a class of documents consists of all those documents assigned to a given category. The document classification is evaluated against the standard of human assignments of documents to categories; it is suitable for the physical grouping of the documents; and, finally, it is not primarily intended for a specifically mechanical search strategy.
The clumps of Needham and SparckJones [ 19 ], on the contrary, were devised expressly for the purpose of improving the retrieval effectiveness of a mechanized document retrieval system. The associated document classification, it will be seen, is not suitable for physical grouping, since documents generally belong to several classes. Finally, because of the purpose of this classification scheme, there would be little point in comparing the automatic assignments of documents to classes with those of human indexers.
The initial classification is, again, a grouping of
terms. The analysis begins with the documentterm incidence matrix of the document set, in which the ijth element indicates the occurrence or nonoccurrence of the jth term
in the ith document. The Tanimoto similarity function is applied to the pairs of columns of the documentterm matrix: the similarity of a pair of terms is the ratio of the number of documents in which both terms occur to the number of documents in which either term occurs.
The aggregate of coefficients of a term to a set of terms is defined to be the sum of the similarities of the given term to the terms of the set. A clump is then defined to be a minimal, nonempty, proper subset of the term set such that each term belonging to the subset has a greater aggregate of coefficients to the subset than to its complement, and each term not belonging to the subset has an aggregate of coefficients to its complement not less than its aggregate of coefficients to the subset.
The sets of clumps to which the terms respectively
belong are formed. This information and the documentterm matrix are then used to associate clumps with documents, i.e., to classify the documents. The clump set for a document is the union of the sets of clumps of the terms occurring in the document. In other words, a certain clump is applied to a document (or the document is assigned to a certain class) if there exists a term which occurs in the document and which belong s to the clump. (The criterion of applicability of a clump to a document could easily be made more stringent.) The result is the documentclump incidence matrix, which amounts to an augmentation of the documentterm matrix: the (binary) value of the ijth element
indicates whether or not the jth clump is applicable to the ith document.
A search request is a specification of a subset of the term set. Just as for a document, a clump set for the request is formed; the request descriptor set consists of all those clumps containing at least one of the terms of the request. The search begins with a matching procedure relative to the clumps. The retrieved documents are those having document clump sets which contain the request clump set.
The second stage of processing is the matching of the terms of the retrieved documents against the terms of the request, thereby providing for a ranking of the retrieved documents according to a quantitative measure of relevance, e.g., the similarity of the term set of a document to the term set of the request.
Thus not only are documents typically assigned to several classes (clumps) but the search procedure is predicated upon such multiDle classifications, in contrast to hierarchical classifications and their associated search strategies.
The role of clumps is analogous to that of descriptors used to logically index documents. The clump technique is intended for application to a mechanically derived documentterm matrix,, reflecting the keyword logical indexing of a document set. The objective is to combine the ease of indexing by keywords with the retrieval effectiveness of
descriptor indexing, and to do so altogether mechanically. In particular, manual descriptor indexing is replaced by the automatic clump derivation and the associated documentclump incidence matrix determination.
Moreover, it is not necessary to interpret the term clumps, i.e., to assign descriptorlike labels to them, because requests are expressed in terms of keywords and the associated clump set is determined automatically. This permits the use of both "descriptors" (clumps) and keywords in the search procedure, without the necessity of clump interpretation for the user.
One important aspect of the clump system which has not yet been mentioned is the updating problem. Although this issue was not investigated in any detail in the exploratory study [ 19 ], its importance was nonetheless acknowledged there.
Consider the acquisition of a document subsequent to
the clump analysis of the initial document set, and suppose the keyword indexing of the new document is given.
The straightforward approach to updating is to
reinitialize., i.e., to do whatever is required so that the resulting system is identical to what it would have been if the new document had been a member of the initial document set. The updated documentterm matrix has an additional row for the new document. It also has a new column for each keyword of the new document which does not occur in any of the other documents. The updated termterm similarity
matrix is of larger order, according to the number of new keywords.
However, if any keyword of the original documentterm
matrix is applied to the new document, then the similarities of this term with all the others will, in general, be different. Consequently, the updated termterm similarities will differ from the initial ones from which the clumps were derivedeven if the new document introduces no new keyword into the system. Therefore, th~e updated clumps may well differ from those of the initial document set. However, the quantity of effort required to repeat the classification derivation is clearly incommensurate with the event of the acquisition of one additional document.
A more modest approach to updating begins with the assumption that the termterm similarities are not significantly affected by the acquisition of the new document, so that the initially derived clumps are still adequate. In this case, the classification derivation is not repeated. All that is required is to assign the new document to the appropriate classes, as determined by its keywords and the clumps to which each belongs, as initially. Thus, the documentterm and documentclump matrices are updated by the addition of a row for the new document, thereby rendering it retrievable through the search strategy as originally described.
The most practical approach may be a combination of the two approaches given above. The system is reinitialized
occasionally, e.g., whenever the document set size has increased by, say, 10 % the size it had for the last initialization. During the interim between initializations, updating is limited to the more modest approach. The premise of this updating strategy is that, although subject matters of a document set will vary with time, the changes will be gradual.
Clearly, the complete elimination of system
reinitialization would undercut the premise of automatic classification, viz., that a classification derived from analysis of the document set is potentially superior to an a priori classification schedule, devised without reference to the particular document collection at hand.
2.3. FERRETA Feedback Reference Retrieval System
The primary concern of this work is the automatic
derivation of a multilevel document classification intended to serve two purposes. The first is to provide for an automatic sequential search procedure which is substantially more efficient than the serial searchwithout degrading retrieval effectiveness. The second purpose is to provide for interactive search procedures.
In view of the purposes of the classification it is mandantory that certain other aspects of a mechanized reference retrieval system be specified, particularly the natures of the search procedures and the form of class
characterizations which these require. Accordingly, this section presents an overview of a complete reference retrieval systemFERRET. (This system, it should be noted, is similar in certain respects to those previously described. Specifically, Salton [ 3 1 has recognized the potential efficiency advantage of a multilevel classification over a simple classification and has discussed the basic search strategy associated with such a classification.)
The processes required for the initialization of the system are indicated in Figure 2.1.
The first step of initialization is the production of the representations of the documents by the analysis of the documents with respect to subject matter. The process of content analysis is not particularly pertinent to the classification problem, and so will not be discussed further. What is of importance here is the form of the document representations resulting from that indexing operation: each document is representedfor subject searching and classification purposesby a logical or numerical term vector (index), that is, an attribute vector whose components correspond to the index terms of the system.
Thus, the data by means of which the classification
derivation process produces a document classification are the documentterm matrix. The classification derivation is developed in detail in the following chapters. Briefly, a measure of documentdocument similarity is defined in terms of the attribute vectors of the documents. A
Documents
Content Analysis
Document Representations
(term vectors)
Classification Derivation
Classification of the Documents
Sequential Search Tree
Information
Store
Figure 2.1 FERRET Initialization
Document Class Representation Transformation
similarity threshold is applied to the documentdocument similarity matrix to produce a graph: the points of the graph are the documents; a pair of points are joined by a line in case the similarity of the pair of documents is not less than the similarity threshold. Graph theoretical techniques are applied to the graph in order to identify clusters of points. These clusters of points constitute the major document classes.
The process is repeated on each class which is suitably large, and so on, until each class not subclassified is suitably small. Consequently, the resulting document classification may be regarded as a rooted tree of subsets of the document set: the root of the tree is the whole document set; the successors of the root are the major classes; the successors of any nonendpoint of the tree constitute the subclassification of that nonendpoint; an endpoint is a subset of the document set which is not subclassified.
In order to realize the objective of efficient search, the classes of the classification must be represented in a form which is suitable for matching against queries. The class representation transformation step of initialization performs this task. Each class (except the root, i.e., the whole document set) is considered to be a document; the class representation is a numerical term vector (or aggregate index) computed from the term vectors of the documents belonging to the class; if the class is an
endpoint of the classification then the class representation includes the identity of the class in addition to the aggregate index.
The results of the class representation transformation is therefore a treethe sequential search treeisomorphic to the classification tree. The root represents (implicitly, not explicitly) the whole document set; any other point includes the aggregate index of the represented class, and the class itself, if it is an endpoint. Clearly, the identity of any class of the classification may be readily determined from the sequential search tree: the class is the union of the classes included in the endpoints of the subtree subtended by the representation of the class in question.
The sequential search tree, which represents the
document classification and the aggregate indexes of the classes, is retained in the information store. Also retained in the information store are two items associated with each document. The first is the document index, used principally during the last step of a search. The other is just the text which constitutes the reference given to the user when the document is included in the response.
After initialization the systems is operational. As indicated in Figure 2.?, there are two general types of operational activity: system updating, and search and retrieval.
Query
Class Characterization User
Class Selection and Mo
Response
Documents
Fetch
Commands
Figure 2.2 FERRET Updating and Retrieval
The search procedures are developed and discussed in
detail in Chapter 6. There are two general types of search procedures: the basic search and the feedback search.
The basic search procedure illustrates the underlying philosophy of the classification structure. Many querydocument relevance computations for future queries are, in effect, done in advance with the results stored implicitly in the retained classification representation, thereby reducing the number of relevance computations required for any given query search.
A query is an assignment of numerical weightspositive, negative, or zeroto the terms of the system. The stored classification structure makes it unnecessary to perform a relevance computation for each document. Instead, a relevance computation is performed relative to each major class by reference to the respective aggregate indexes of the classes. The major class of greatest score is selected, and becomes the decision node of the sequential search tree, replacing the root.
The successors of the selected class are similarly
scored with respect to relevance to the query; the subclass of greatest relevance becomes the next decision node. This continues until the selected decision node is an endpoint of the sequential search tree. The representations of the members of the terminal class are then used to compute the relevance of these individual documents to the query, and provide for their ordering by decreasing relevance. The
relevanceordered list of references of the documents of the terminal class of the search constitutes the response to the query.
It is clear that the total number of relevance
computations for such a search may be substantially smaller than the total number of documents. Moreover, the sequential search tree permits a fuller exploitation of the user's understanding of his objective than would be possible within the framework of a serial search, i.e., the
computation of a relevance value for each document.
In the feedback search, the user participates in the
search in two ways: he makes decisions, and he modifies his query. The basis for both user actions is the information provided him by the system, concerning the nature of the alternatives of a given decision node. The alternatives are the subclasses of the class represented by the decision node of the sequential search tree; the system presents the user with a suitable characterization of these subclasses.
Based upon this information the user selects the next
decision node (subclass). Moreover, these characterizations of the alternatives enable the user to specify his objective with better precision and in a manner which better matches his information needs with the stored information of the system. Consequently, the user may modify his query on each transaction, i.e., from decision node to decision node, from the root of the sequential search t ree to the terminal class.
The other major activity of the operational FERRET
system is updating: as documents are acquired subsequent to system initialization, they must be assimilated. As indicated in Figure 2.2, the first step of updating is content analysis. The resulting document representation and the document reference for presentation to the user are retained in the information store.
To make the new document retrievable, however, requires that it be included in an appropriate endpoint of the sequential search tree; the new document must be classified within the classification derived prior to the acquisition of the document. This is achieved by a process similar to the basic search procedure.
The document representation is treated as a query in
order to identify the classes at each level of classification to which the document belongs. The document identifier is included in the classidentity of the terminal class of the search tree. The class representation of each class of the chain of classes to which the document is assignedfrom the major class to the terminal classis modified in accordance with the representation of the new document.
Evidently this method of updatingwhich is the
assignment of documents to predefined classespresupposes that the assimilation of a new document would not materially affect the classification, i.e., that the initial document set is adequately representative of all future documents, relative to subject matter. To the extent that this
assumption is not valid, the classification is degraded and retrieval effectiveness deteriorates. This can be remedied by infrequent reinitialization.
However, since this study is concerned primarily with the classification derivation and the search procedures associated with the classification, the updating problem is not treated in detail here.
2.4. Concluding Remarks
The organization of documents in document and reference retrieval systems for retrieval purposes has been discussed. Several efforts toward automatic classification for mechanized reference retrieval systems have been reviewed in illustration of specific approaches to the problem of automatic classification and of the different facets of the problem: classification derivation, assignment, and class characterization.
FERRET, a feedback reference retrieval system, has been introduced to provide the necessary context for the specific concern of this study: the machine construction of a multilevel classification of a document set based on a measure of similarity on the documents and intended for more efficient query processing and the effective use of user feedback during the search.
CHAPTER 3
GRAPH THEORETICAL COVER GENERATION
3.1. Introduction
The problem with which this chapter and the following
two chapters are concerned is that of deriving automatically a multilevel classification of a document set based on the given logical or numeric document indexes. The classification representation transformation and the search procedures are discussed in Chapter 6.
The first step toward the solution of the problem is straightforward: the computation of the documentdocument similarity matrix by the application of a suitable measure of similarity to the pairs of document representations.
The identification of the major classes from the
similarity matrix requires two distinct activities: the generation of one or more covers of the document set, and the evaluation of these different covers for the selection of the best cover as the collection of major classes. Evidently, the subclassification of any class requires th~e same processes, viz., the generation of covers of the class and the selection of the best.
This chapter and Chapter )4 are concerned with the cover generation problem. Chapter 5 provides for cover evaluation and selection and unifies the cover generation and cover selection techniques into an algorithm for the derivation of
a multilevel classification of the document set.
The fundamental approach to the problem of generating covers of a given class of documents is a specific form of graph theoretical cluster analysis. A definition of
clusters in a graphdeveloped belowis applied to the graph bof the documents of the class induced by a particular similarity threshold. (The method of selecting a decreasing sequence of similarity thresholds for a given class of documents is discussed in Chapter 5.) The key notion in the definition of clusters in a graph is that of maximal
complete subgraphs of the graphor "cliques," as they are termed by Harary and Ross [ 20 1; clusters are defined below to be the unions of certain collections of cliques of the graph.
This approach is based on the premise that a class of documents is composed of an unknown number of clusters of documents with respect to subject matter; and that documents of similar subject matter have similar representations, i.e., indexes. In this event, a clique of the graph formed by the application of a threshold to the documentdocument similarities constitutes a maximal set of documents in which the similarity between each pair is at least as great as the threshold. The formation of clusters
from cliques. produces a cover which is not necessarily a partition, i.e., the subclasses of the class are not necessarily pairwise disjoint. This is perfectly appropriate to the task at hand, however, since there is no justification for the restrictive assumption that the
subject areas of discourse of the document set are nonoverlapping.
This represents an advantage of this method for the
present application over those which necessarily produce disjoint subclasses and are devised primarily for pattern recognition problems, e.g., the kmeans method of J. MacQueen described by Nagy [ 21 ],and the minimum spanning tree technique of Zahn [22 ].Another advantage of the present method is that the number of classes of the cover is not required to be input to the cover generation procedure, as is the case in the kmeans method, for example. Unlike many pattern recognition problems [ 23 ], in the problem at hand one has no a priori knowledge of the number of subclasses of the class of objects under analysis.
The remainder of this chapter is organized into two
sections. The first presents the basic definitions required, particularly those from graph theory; the second is concerned with the definition of clusters in a graph.
3.2. Basic Definitions
A cover of a set S is a collection U of subsets of the set such that S = u U; U is an efficient cover in
case no member of U is properly contained in any other. The term clustering is also usedin the context of the problem at handto denote a cover resulting from a cluster analysis; the members of such a cover are termed clusters.
A collection A of sets refines a collection B of sets in case A E A implies the existence of a member B of B such that A c B .
The terminology and graph theoretical definitions given below are essentially as given by Harary [ 24 ].
A graph G = (V(G), E(G)) or G = (V, E) consists of (1) a nonempty finite set V whose elements are termed points; and (2) a set of lines, E c {X: X c V, IXI = 2} i.e., a collection of unordered pairs of points. Equivalently, a graph is an irreflexive symmetric binary relation on a nonempty finite set. A line {u, vi E E is denoted more briefly by uv The distinct points u and v are adjacent points, joined by the line uv Each of the endnoints u and v of the line uv is also said to be incident with uv An isolated point of a graph is a
point adjacent to no other point.
If V is a singleton then G = (V, E) is a trivial
graph; otherwise, G is a nontrivial graph. G is a null graph (or totally disconnected graph) in case E = 0 G is a complete graph in case E consists of all unordered pairs of points of G i.e., in case each pair of points of G are joined by a line.
The neighborhood N(u) of a point u of a graph G = (V, E) is the set of all points adjacent to u together with the point u : N(u) = {u}u {v: uv E E} The deleted neighborhood No(u) of point u is No(u) = N(u) {u} i.e., the set of all points adjacent to point u .
A subgraph of G = (V, E) is a graph G' = (V', E')
such that V' c V and E' c E. If X is a nonempty subset of V the subgraph G[X] generated by X is that subgraph of G whose point set is X and whose line set consists of all the lines of G which join points in X In particular, if v is a point of a nontrivial graph G = (V(G), E(G)), the removal of the point v from G is the subgraph G v = G[V {v}]; the point set of G v is V(G v) = V(G) {v} ; the line set of G v consists of all those lines of G not incident with the point v
If G = (V, E) is a graph and uv E the removal
of the line uv from G is the subgraph G uv = (V', E') of G with V' = V and E' = E {uv} .
The notions of neighborhoods of points, point removals, and line removals are of particular utility to the problem of clique detection.
A complete subgraph of a graph G = (V, E) is a
subgraph of G which is a complete graph. A clique G' of G is a maximal complete subgraph of G i.e., a complete subgraph of G such that if G" is a complete subgraph of G and G' is a subgraph of G", then G' = G" Because a
clique is a complete graph, it suffices to specify only the point set of the clique in order to fully specify the clique. Consequently, it creates no confusion to use the term cliqueu" to refer to the point set of the clique in the interest of economy of language. Indeed, one could define a clique of a graph to be a maximal (relative to set theoretic inclusion) subset X of the point set V of the graph, having the property that each pair of points of the subset. are adjacent points of the graph.
Suppose V = {vl, v2, ... , vn I is a nonempty finite
collection of sets. Let E consist of those subsets of .V of two elements which meet, i.e., if vi, vj E V then {vi, v.1I E E if and only if vi / vj and vi nl vj / 0 The graph G = (V, E) is termed the intersection graph of the collection V .In particular, if V is the collection of the cliques of a graph G' then G is the clique graph of the graph G'
The notion of cliques, that of a certain generalization of the notion of clique graphs of graphs, and that of components of a graph are basic to the definition of clusters in graphs, which is developed in the following section. A component X of a graph G = (V, E) is a maximal subset of V such that if U, v ( X then u and v are connected points. The connectivity relation on V
is defined as follows: suppose u, v c V ; then u and v are connected if u = v, or if u and v are joined by a walk in G A walk in G is a sequence of points
u0, Ul, U2, ..., un with 1 n and Ui_lUi E E for each i = 1, 2, ..., n ; such a walk joins the initial and terminal points, u0 and un A path is a walk on which no point has multiple occurrences. That the connectivity relation is an equivalence relation on V is apparent; indeed, the components of V are the equivalence classes induced by the connectivity relation, and so partition the point set V of the graph.
3.3. Clusters in Graphs
The objective of this section is to develop a means of generating clusterings of a given graph G = (V, E) A clustering of a graph is an efficient cover of the point set of the graph consisting of clusters, a cluster being a set of points which satisfies a specific definition of clusters in a graph.
In view of the definitions of the preceding section, two definitions of clusters are immediately apparent: a cluster of G is a component of G and a cluster of G is a clique of G .
Since the component set K(G) that is, the collection of all the components of G partitions the point set, K(G) is an efficient cover. The clustering K(G) of the graph G is termed the component clustering of G
The clique clustering of G is the clique set Q(G) that is, the collection of all the cliques of G That Q(G) is a cover of V follows from the fact that if p E V
then fp} is a complete subgraph of G and so is contained in a maximal complete subgraph of G ; consequently, V c uQ(G) That Q(G) is an efficient cover is an immediate consequence of the definition of cliques: each member of Q(G) is a maximal complete subgraph.
The component clustering and the clique clusteiring of a graph are illustrated in Figure 3.1 in which the cliques are enclosed by dotted lines. The limitations of these clusterings may be appreciated by reference to the figure.
It is apparent from the figure that the component clustering and the clique clustering have complementary limitations. The 7point component consists of two cliques of four points each, intersecting in one point. In the component clustering, of course, these two cliques are inseparable. In the clique clustering, each of the cliques of four points is a cluster; this resulttwo clusters overlapping in one pointseems more plausible than the single cluster of the component clustering.
Consider, on the other hand, the 5point component, which consists of two cliques of four points meeting in three points. In this case the component clustering produces the more plausible resulta single cluster. Thus, the component clustering unites a pair of highly overlapping cliques into one cluster, which the clique clustering does not; and the clique clustering discriminates a pair of cliques with a small intersection into two clusters, which the component clustering does not.
 N
I 5
______________
Components
Figure 3.1 The Clique and Component
Clusterings of a Graph
Cliques
The fact that two distinct points belong to the same
cluster of the component clustering does not imply that the corresponding pair of objects (e.g. documents) are actually similar. In the 4point component of the figure, for example, points 1 and 3 are both adjacent to point 2 and so are in the same component. In terms of objects and their similarities, this means only that each of the pair is similar to a third and so are in the same cluster.
More generally, a pair of objects are in the same
cluster of the component clustering if there is a sequence of objects joining the pair in which each consecutive pair
of objects is similar, i.e. the similarity of each consecutive pair of the sequence is not less than the similarity threshold by which the graph is defined. Indeed, a pair of objects of zero similarity may belong to the same component cluster. This "chaining" phenomenon is the consequence of the Dolicy of liberal inference implicit in this clustering definition, viz., for purposes of classification, similarity is transitive. The component clustering of the graph of Figure 3.1, for example, would be unaffected by the addition to the graph of lines joining point 1 to each of points 3 and 4 thereby rendering the 4point component complete..
The clique clustering, on the other hand, embodies no such inference: a pair of points of a clique cluster are adjacent in the graph, i.e., are similar. Moreover, the clique clustering is necessarily changed by the deletion or addition of a line to the graph.
Thus, the clique clustering is the embodiment of the strictest construction possible of the given information (the graph) for the attainment of its objective in the context of this problemthe identification of subclasses of the class of which the graph is the threshold graph, relative to the numerical similarities of the elements of the class. In the same sense, the component clustering represents the loosest construction possible.
If, on the contrary, the cluster analysis dictated that a pair of points from different components of the graph belonged to the same cluster, it could do so only arbitrarily, there being no information implicit in the graph from which such a condition could be inferred. It would also clearly be arbitraryand therefore unjustifiedto produce a cover such that a clique of the graph was not contained in any member of the cover.
In view of the foregoing considerations, the concept of clustering may be made more specific: a clustering of a graph is an efficient cover of the point set of the graph which refines the component set of the graph and which is refined by the clique set of the graph, i.e., an efficient cover C satisfying Q(G) < C < K(G) (That Q(G) refines K(G) is an obvious and immediate consequence of the definitions of cliques and components.)
It will now be shown that the clusterings of G under
this definition form a lattice under the refinement relation. Specifically, this lattice is an interval of the lattice of all efficient covers of V .
The refinement relation < on the collection EMV of all efficient covers of V is a partial order. The
reflexivity and transitivity of the relation are immediate consequences of its definition and the reflexivity and transitivity of the inclusion relation.
Suppose that A E EMV B E E(V) ,A < B and
B
such that A c B .Since B E B < A there is a member A'
of A such that B c A' .Since A c B c A' and A is an
efficient cover, A =A' ;hence, A B B. That is, A cB.
The same argument proves that B c A .Therefore (E(V), < )is a poset.
Let A v B denote the collection
{M: M4 c A u B and if M c MI E A u B then M = M'}, i.e., the maximal (relative to inclusion) members of A u B
It is evident that A v B is an efficient cover of V and that it is refined by each of A and B .Suppose that C
is an efficient cover refined by each of A and B Since
A < C and B < C ,A u B < C and since A v B c~ A u B,
A vB
and B
Now let A A B. denote the collection
{A nB: A cA ,B E B ,and if A' E A and B' E B and
A n B c A' n B' then A n B = A' n B'}1 i.e., the maximal (relative to inclusion) members of {A n B :A E A B c B1. One sees easily that A A B is an efficient cover of V If M4 E A A B then M4 A n B for some A E A and B E B
Thus, M cA EA and M c B E B Hence, A A B refines each of A and B Suppose now that D refines each of A and B and let D E D Since D < A there is an
A E A such that D c A ; and since D < B there exists a member B of B such that D c B Thus, D c A n B E {A' n B' A' E A E B) so there exists a maximal
member M of the collection such that A n B c M That is, D c A n B c M A A B which establishes that
D < A A B Therefore, A A B is the greatest lower bound of A and B
Since each pair of elements of the poset (E(V), < ) has a greatest lower bound and a least upper bound, (E(V), < ) is a lattice, as claimed.
Returning now to the more specific definition of
clusterings of a graph G let 7(G) denote the collection of all efficient covers of V which refine K(G) and are refined by Q(G) Then (:(G), < ) i.e., the interval
[Q(G), K(G)] is a sublattice of (E(V), < ) with zero Q(G) and unit K(G) Figure 3.2 provides an illustration of such a lattice. That is, the figure exhibits every efficient cover of the point set of the graph G which refines K(G) and is refined by Q(G).
The example of Figure 3.2 illustrates that it would be not only inefficient but definitely undesirable to generate all the members of 5(G) The cover {{il, 2, 3} {2, 3, 4}, {5, 6, 7, 8, 9}} is one of four in which just one of the points 3 and 4 is included in the cluster containing the points 1 and 2 in spite of the symmetry
( (G), <) :
f{1, 2, 3, 4 }K(G)
f{5, 6, 7, 8, 9 1
{1, 2, {i, 2, 43 {1, 2, 3, 4}
f{2, 3, 4 f 2, 3, 4 }{5, 6, 7, 8 }
5, 6, 7, 8, 9 }{5, 6, 7, 8, 9 }{6, 7, 8, 9}
1i, 2} 1, 2, 3} 1, 2, 4}
{2, 3, 41 {2, 3, 4} {2, 3, 41
{5, 6, 7, 8, 9 1 {5, 6, 7, 81 {5, 6, 7, 8 1 {6, 7, 8, 91 {6, 7, 8, 9 1
1I 
{1, 2 } { 2, 3, 4 } { 5, 6, 7, {6, 7, 8,
Q(G)
Figure 3.2
The Lattice of Clusterings
of a Graph
of points 3 and 4 relative to points 1 and 2 Whatever basis exists for the inclusion of the point 3 (or 4) in the cluster containing points 1 and 2 applies as well to the point 4 (or 3 ). Consequently, any clustering generation procedure from which the arbitrary is absent will produce only those elements of the lattice of Figure 3.2 which are included in Figure 3.3.
The arbitrary covers of the lattice of Figure 3.2
excluded from the subset of Figure 3.3 are disqualified as clusterings by the following more specific definition of clusterings of a graph.
A clustering A of a graph G = (V, E) is an
efficient cover of V which satisfies (1) A refines K(G) the component set of G ; (2) A is refined by Q(G), the clique set of G ; (3) if A E A then A = uP for some
P c Q(G) This last condition, which simply states that a cluster is a union of cliques, assures that if a pair of points belong to just the same cliques then they belong to just the same clusters.
Suppose that A is a clustering under this definition, that A E A and that p and q are points such that p E A and q A For some P c Q(G) A = uP ; since p E uP there is a clique M E P such that p M Since
M c A and q 4 A q 4 1 Thus, if a pair of points do not have membership in precisely the same members of A then their clique memberships differ.

N
/
 
I
 
 ..
\ / 0 Il
Il
\~ _', 
_/
\ d
i/
\\ ll
/
Figure 3.3 The Lattice of Nonarbitrary
Clusterings of a Graph
The subset r(G) of E(G) consisting of clusterings under this definition clearly includes K(G) and Q(G) Thus, as with E(G) the refinement relation on r(G) establishes a lattice (P(G), < ) a sublattice of
E(G), < )
The remaining issue is how one chooses subcollection P of the clique set Q(G) from which to form a cluster A = uP The identification of the clusterings K(G) and Q(G) presents no theoretical difficulty; the remainder of this section is concerned with the problem of generating other clusterings, i.e., those intermediate to Q(G) and K(G).
Augustson and Minker [ 25 ] recognized explicitly that K(G) and Q(G) are the extremes of graph clusterings; and the potential value of intermediate clusterings has been widely appreciated [ 25, 26, 27, 28, 29 1.
An early effort in graph theoretical cluster analysis, in which the concept of cliques of a graph is exploited, is that of Bonner [ 26 ]. In addition, there is given by Bonner [ 26 ] an algorithm for the identification of the cliques of a graph. However, the definition of clustering is given only implicitly, in the form of a procedure. Moreover, the generation of a single coverpresumed to be suitableis inconsonant with the strategy of this work, which separates the task into two parts: the cover generation, which produces several covers; and the cover evaluation which selects the most suitable.
Jardine and Sibson [ 27 ] provide a systematic method for the generation of intermediate clusterings of a graph from its cliques. A kpartition of V is defined, for a natural number k to be an efficient cover of V in which each pair of subsets intersects in fewer than k points. The concept of a kpartition is a generalization of that of a partition, a 1partition being a partition. The intermediate clusterings are defined to be the particular kpartitions of V generated by the following procedure.
Step 1.Initialize L = Q(G) the clique set of G
Step 2.If A EL B E L and A X B implies
IA n BI < k then stop.
Step 3.Let A E L and B E L be a pair of
:tk k
distinct members of Lk such that
k IA n BI ; replace Lk by
(Lk u {A u B}) ({A, B}) and go to Step 2.
If k is equal to or greater than the largest number
m of points of any clique of G Step 3 is never executed: the intersection of any pair of cliques has strictly fewer members than does either of the cliques. Hence, L = Q(G)
the clique clustering.
It is evident from the procedure that each member of
Lk for any k is the union of a nonempty subcollection of Q(G) the clique set of G ; that is, if M E Lk then M = uS for some S c Q(G) S # 0 Thus, Lk is
refined by Q(G)
If ISI = 1 then since obviously any clique is a
subset of some component, M is contained in a component of G Suppose ISI = 2 and let K and K2 be components containing SI and S2 the members of S Since S1 n S2 X 0 KIn K2 0 ; but K1, K2 E K(G) which is a partition of V hence, K1 = K2 That is, M = Sl u S2 is contained in some component. As an inductive hypothesis, assume that for any natural number n < no where no > 1 if ISI = n then M = uS is contained in some component of G Suppose ISI = no ; since 1 < no S is the union (formed in Step 3) of two subcollections S' and S" of Q(G) such that (1) 1 IS1'1 IS"I no 1 ; and
(2) (uS') n (uS") # 0 By the induction hypothesis and condition (1), uS c K1 and uS" c K2 for some K1 E K(G) and K2 E K(G) Because of condition (2), K1 n K2 X 0 ;
hence, K = K2 and uS c KI E K(G) This proves that Lk < K(G) for any k .
In particular, L < K(G) It will now be proven that K(G) < Suppose Ml, M2 E LI, and M1 r M2 If M1
and M2 were not disjoint then 1 : IM1 n M21 ; hence, because of Step 3 of the procedure, M u M replaces M
1 2 1
and M2 i.e., M1 M2 L a contradiction. Thus, Li
is a disjoint collection; since, moreover, L! is refined by Q(G) l covers V hence lI partitions V Let
K be a component and M be a member of Li which meets K ; the existence of such an M is assured by K c V = uL, since Li covers V Suppose that K n (V M) 0 .
Thus, K n M 0 K M so {K n M K M} partitions the component K of G ; since K is connected, there exists a line uv in G with u E K n M and v E K M Since {u, v} generates a complete subgraph of G there exists a clique Q of G such that {u, v} c Q Since
Q(G) refines L there is a member M' of L such that
Q c M' Thus, since {u, v} c M' u E M and v M M n M' # 0 and M # M' contradicting the fact that L is a partition. Therefore the supposition that a component meets more than one member of L is false. That is, K(G) refines L .
1
Thus, K(G) < L1 < K(G) with each of K(G) and PL1 a partition, and, hence, an efficient cover of V Since the refinement relation on the class of efficient covers of V is a partial ordering, L1 = K(G) .
To summarize the JardineSibson kpartitions:
L1 = K(G) ; Q(G) < Lk < K(G) for each k = 1, 2, ... ;for each k = m, m+l, ... where m = max { Qj: Q c Q(G)} k = Q(G) Moreover, since any member of any Lk is a union of cliques, each kpartition Lk qualifies as a clustering, i.e., each is a member of F(G). On the other hand, it is certainly the case that not every member of F(G) is, in general, a kpartition.
The character of the kpartitions is illustrated in Figure 3.4. In particular, A {{1, 2, 3, 4, 51,
{6, 7, 8, 9}, {7, 8, 9, 10}} is a member of F(G) which is not a kpartition. However, A seems less plausible than
 3 N
/
/
2 5
/
N ~+ /
k = 1 : k=2 k =3
/ N .
II7
6 11 0
9 
 
/ ,~ N
/
' I
\ \,
 , 
Figure 3.4 The kPartitions
of Jardine and Sibson
/ ',\ \
 I
\ '
/ / \\
~ \\/
each of the kpartitions. This suggests that the definition of clustering is still, after all, lacking in specificity, i.e., that it is not desirable to generate all members of r(G) ; this issue will be further pursued later in this section.
An unfortunate aspect of the kpartitions is that the intermediate clusterings, i.e., those Lk with k = 2, 3, ... ml are difficult to characterize, because of the procedural nature of the definition of kpartitions. An alternative definition is suggested by the recognition that the procedural definition applied to the case k = 1 produces clusters M E LI having the special property that M = uS where S is a component of the clique graph of G The clique graph of G has Q(G) as point set, with a pair of cliques of G adjacent in case their intersection is nonempty. Indeed, it is clear that the union of all the points of a component of the clique graph of G is precisely a component of G and that the collection of all such unions is K(G) = L "
The clustering definition given belowas an
alternative to the JardineSibson procedural definition of kpartitionsis the first of three to be considered, all of which are based on generalizations of the concept of the clique graph of a graph. For k = 1, 2, ..., m where m is the number of points of a largest clique of G the Gk(G) of G is defined as follows: ( ('V1 E 1) vI Q(G) ; if M1, M E Q(G) then
Ik(G) k'( k E k 1 2
1
MIM2 E Ek if and only if MI # M2 and k MI n M21 A
typel kcluster is the union of a subcollection of Q(G)
1
which constitutes a component of Gk(G) The type1 kclustering of G consists of all the maximal kclusters of G The type1 kclique graphs of the graph of Figure 3.4 are exhibited in Figure 3.5 along with the graph and its cliques; from this information one easily sees that the typei kclusterings coincide precisely with the kpartitions of Jardine and Sibson.
k = 2 :   0
k = 3 : o o o o o
k=4: o o o o o
Figure 3.5 Typei kClique Graphs of a Graph
However, a proof that each kpartition of any graph coincides with the typei kclustering would be difficult indeed, because such is not, in fact, the casealthough counterexamples, one of which is given in Figure 3.6, are quite rare.
~~1~~
i
\/ \/
\ \ \
 /
a. The graph G and its cliques
b. The typei 2clique graph
c. The typei S 2clustering
d. The JardineSibson 2partition
Figure 3.6 A Graph Whose Type1 2Clustering Differs from the JardineSibson 2Partition
Before criticism of the typei clustering is begun, the necessity of the term maximal in the definitions of kclusterings will be demonstrated. The graph of Figure 3.7 has 5 cliques of 3 points and one clique of 2 points. The typei 2clique graph has two components: one consists of all the 3point cliques, the other is the isolated 2point clique. Thus, there are two 2clusters of the graph; however, one contains the other. The requirement that the covers of V be efficient therefore requires the specification of maximality of the kclusters, as in the definition.
An immediate consequence of the definition of Gk(G)
is that the point M E Q(G) is an isolated point of Gk(G) for every k IMI Therefore, {M) is a component of Gk(G) so that M = u{M) is a type1 kcluster of G Although it is possibleas in the graph of Figure 3.7that M be properly contained in another typei kcluster, it is more likely that M be a maximal typei kcluster, i.e., that the clique M of G be a member of the type1 kclustering. Thus, each clique of two points normally constitutes an entire cluster except in the component clustering. Similarly, a clique of i points is generally a member of the typei kclustering, for all k i The typei intermediate clusterings consequently tend to be rather highly fragmented covers, consisting of many small clusters and a few large clusters.
 S s
I. N.
N
'I 6 1' f I I
/
I /
/
I
'S
/
\ I'i \ 2"'
a. A graph G and its cliques
0
b. The typei 2clique graph of G
5
/ \
i \
N
>3
/
c. The typei 2clusters of G
Figure 3.7 A Graph in Which One Typei 2Cluster
Contains Another
The type2 kclusterings of a graph G correct this shortcoming by taking into account the sizes of the cliques in determining the adjacencies in the generalized clique graph of G The type2 kclique graph Gk(G) = (V, E) has point set V2 =(G
has point set V = Q(G) ; if k m the maximum clique
size of G then G~(G) is totally disconnected; otherwise, a pair of distinct cliques M1 and M2 of G are adjacent
inG2
in Gk~(G) in case neither is a singleton and M1I n M21 min {k, IM11 1 IM21 11} A type2 kcluster is the union of all the cliques of G of a component of Gk(G) and the type2 kclustering of G consists of all the maximal type2 kclusters of G
A pair of cliques which meet in the maximum number of
points possible, viz., one less than the number of points of the smaller, are contained in the same type2 kcluster for every k < m i.e., they are in separate clusters only in the cliqueclustering. The type2 1clustering and mclustering are the component and clique clustering, just as those of typel. The intermediate clusterings, however, differ as illustrated in Figure 3.8.
It is clear from the definitions of E1 and Ek that Gk is a subgraph of G0k for each value of k : if a pair of cliques of G are adjacent in G1 they meet in at
2
least k points, and so are adjacent in Gk
The type2 2clustering of the graph of the Figure 3.8 is apparently superior to the type1 2clustering, particularly viewed as the successor of the 1clustering.
00o
TypeI and type2 1clique graph
Typei and type2 1clustering
0 0O
..
Typei 2clique graph
Type1 clustering Typei 2clustering
0 0 0
0 0
Typei 3clique graph
N '~( ,L~,
Type2 2clique graph Type2 2clustering
O*O
0 0O
0O 0
Type2 3clique graph
I/ N /
Typei 3clustering
O O0
Type2 3clustering
0 0 0
0 0
0
Typei and type2 4clique graph
 I~* "
N
\ / \
N. ..'
/ N
/ N
~. ~1 /
\' ~~)/
Type1 and type2 4clustering
Figure 3.8 Type1 and Type2 kClusterings
of a Graph
O O0
0 O0
O O0
N
I) N t~
' / N 
OOO
However, considering the 3clustering as the predecessor of the clique clustering, the type1 3clustering is apparently superior to the type2 3clustering. Indeed, although the type2 clustering does remedy the noted defect of the type1 clustering, it does so by means of a complementary defect of its own.
This situation is most clearly exemplified by a pair of cliques MI and M2 of G which meet in one of the two points of Ml The typei kclusterings separate M1 and M2 in every case but the component clustering, while the type2 kclusterings separate M, and M2 in no case but the clique clustering. Thus, the typei sequence of clusterings for k = 1, 2, ..., m have a large discontinuity from k = 1 to k = 2 whereas the type2 clusterings have a large discontinuity from k = m 1 to k = m Expressed differently, the type1 intermediate clusterings form a sequence which approaches the clique clustering, while the type2 intermediate clusterings form a sequence which approaches the component clustering.
Now the general approach of both definitions provides
the component clustering, the clique clustering, and.at most m 2 intermediate clusterings, with each pair of the resulting clusterings related under the refinement relation. That is, the resulting sequence of clusterings forms a chain in the lattice of clusterings from the unit to the zero of the lattice, consisting of at most m clusterings, m being the size of a largest clique of the graph. Such a
chain is generally a subsequence of a maximal chain of length greater than m For example, the chain of typei clusterings of Figure 3.8 is a proper subsequence of the chain: the 1clustering, the type2 2clustering, the typei 2clustering, the type1 3clustering, and the 4clustering. Similarly, the chain of type2 clusterings is a proper subsequence of the chain: the 1clustering, the type2 2clustering, the type2 3clustering, the typei 3clustering, the 4clustering.
The defects of the typei and type2 clusterings may both be characterized as a failure to satisfy a continuity criterion: each successive pair of the sequence of m clusterings should be separated by about the same number of clusterings on the maximal chain containing the chain. That is, the chain of clusterings should constitute a sequence of m 1 equalsized steps from the component clustering to the clique clustering.
Indeed, this criterion can easily be made
quantitatively precise, e.g., a chain of m clusterings in the lattice from the unit to the zero is continuous in case the rootmeansquare of the m 1 distances between successive members of the chain is minimized. (A suitable metric is given in Chapter 5.) However, the application of such a criterion would require the generation of the entire lattice of clusterings, which is contrary to the present general strategy.
The final generalization of the clique graph of a graph for the generation of clusterings is motivated by the foregoing considerations. The type3 kclique graph G(G) of the graph G has point set V3 = Q(G) ; a pair of cliques M1 and M2 of G are adjacent in G3(G) in case k F(1 + m) IM1 n M21 / IM1 u M21] where m is the size of a largest clique of G and Fx] denotes the smallest integer not less than the real number x The factor IM1 n M21 / IM1 u M21 is the Tanimoto similarity [ 9 1 of the sets M and M2 discussed in Chapter 2, in which it was denoted SI The adjacencies in typel kclique graphs take into account only the overlap of a pair of cliques of G ; the type2 adjacencies take into account the overlap of a pair and the size of the smaller; the type3 adjacencies take into account the overlap and the sizes of both cliques of a pair. (The use of cliqueclique similarity for combining cliques into clusters was anticipated by Gotlieb and Kumar [ 28 ], as a method for the generation of one particular intermediate clustering: a similarity threshold is applied to the cliqueclique similarity matrix to define a graph whose cliques are found, etc., until the number of cliques at some level is suitable; then the procedure backs out, taking the union of the cliques at each level, until, finally, one has a cover of the point set consisting of a suitable number of subsets of points.)
Suppose M, M2 E Q(G) and M n M2 $ Then SI(Ml, M2) = 0 so for no k 1, 2, ... is
k [(1 + m) Sl(MI, M2)] = [0] = 0 Thus, a pair of disjoint cliques are nonadjacent in each type3 kclique graph.
Suppose now that M n M2 / 0 Then
0 < (1 + m) S1(M, M2) so that 1 r f(1 + m) SI(Ml, M2)] i.e., M1 and M2 are adjacent in the type3 1clique graph. Therefore, the type3 1clustering is again the component set of G .
It is clear that the greatest possible similarity S1(M, M2) of a pair of cliques is (m 1) / (m + 1) corresponding to the overlap of a pair of largest cliques in m 1 points. In this case, [(1 + m) S1(M1, M2)] = [(1 + m) (m 1) / (m + 1)] = [m 1] = m 1 < m Thus, the type3 mclustering is again the clique clustering.
In spite of the fact that the chain of type3
intermediate clusterings is, loosely speaking, intermediate to those of type1 and type2, it is not in general the case that for each k Gk(G) is a subgraph of G*(G) nor that Gk(G) is a subgraph of Gk(G) A counterexample to the first is a graph G of two cliques, Ml and M2 each of 9 points, meeting in 3 points. In this case, M1M2 E Eg(G) but M1M2 EJ(G) since
3 > [(9 + 1) (3) / (15)] = [2] = 2 A counterexample to the latter is a graph G having a pair MI and M2 of cliques of 3 points, meeting in one point, and a maximum clique size m = 5 Since
2 s [(5 + 1) (1) / (5)] = [6/5] = 2 MIM2 E E(G) ;
however, since IM1 n M21 = 1 < 2 IMlI 1 and 1M21 1 E (G)
To illustrate the differences among the three types of clusterings, a graph G of thirteen cliques of two, three, and four points is given in Figure 3.9
For each of the three types of generalized clique graphs, it is obviously the case that each line of the (k+l)clique graph is a line of the kclique graph, i.e., for each k =1, 2, ..., ml Ek+l(G) c Ek(G) for i = 1, 2, 3 Consequently, the generalized clique graphs may be conveniently specified by giving, for each pair of nondisjoint cliques, the maximum value of k for which the pair are adjacent in the kclique graph.
This manner of specification is used in Table 3.1 for the sequences of generalized clique graphs of the graph G of Figure 3.9 for each of the three definitions. Each row of the table corresponds to a pair of cliques of G having at least one point in common. The pair of cliques are identified according to their designations in Figure 3.9; the sizes of the cliques are given along with their names. Following is the overlap of the pair, i.e., the number of points in common. Finally, the three numbers kI k2 and k3 indicate the greatest values of k for which the
pair of cliques are adjacent in the three types of generalized clique graphs.
In the first row, for example, one reads that cliques MI and M 2 are cliques of two points and that they have
M M2
 I 11
I 1
M3
M4
I
M1
I
I
IM
I
IM
_MI
7 .' N
I
Figure 3.9 A Graph for the Illustration
of the Differences Among
Type1, Type2, and Type3
Clusterings
Table 3.1 Adjacencies in the Generalized Clique Graphs
Clique Mi
Clique Mj
IMi I
IMj I
2
3
3
3
4
2
3
4
4
4
Overlap Largest k for which Mi and Mj are
adjacent in the kclique graph IMi n MjI Type1 Type2 Type3
one point in common. Since 1 IM1 n M21 < 2 Ml and M2 are adjacent only in the type1 1clique graph. Since IM1 n M21 IM11 1 M and M2 are adjacent in the
type2 kclique graphs for k 5 3 = m 1 Since (m + 1) S1(MI, M2) = 5 (1/3) and 1 < 5/3 5 2 Ml and M2 are adjacent in the type3 kclique graphs for k 5 2 .
The generalized clique graphs specified by the table are shown in Figure 3.10 for k = 2 and k = 3 = m 1 the intermediate cases. As expected, the type3 2clique graph closely resembles the type2 2clique graphthe latter has one additional line. Similarly, the type3 (ml)clique graph differs little from the type1 (ml)clique graphthe former has one additional line.
The intermediate clusterings under the three
definitions are indicated in Figures 3.11, 3.12, and 3.13. The 1clusterings and mclusterings are just the component and clique clusterings for each type; since these are given in Figure 3.9, they are not repeated here.
That the type3 chain of clusterings is the smoothest sequence of the three from the component clustering to the clique clustering is apparent from the figures. A quantitative indication of this may be seen in Figure 3.14, which plots the numbers of kclusters as a function of k for each of the three definitions. The number of type3 clusters is more nearly a linear function of the parameter k, than is either of the other two. Predictably, the number of type1 clusters experiences a large jump between
k= 2
1 2 3 4 5
0 0 00 0
60 __
k= 3
0 0 o 0 0
0 0
10 1
12
c1%o 13
0
0 0 0
Type 2
0 0
0 0 00 0
0 0
0 0
Figure 3.10 The Three Types of Intermediate Generalized
Clique Graphs of the Graph of Figure 3.9
Typei
0 0
0 0
Type3
0O
  S  *
 ~
*~ S I~
S I
 p ~ I
, ''I I
S
 S
 S
 *(

/ ,
I ~
S
~
I
'S 
a. k = 2
)r
 n e~*~
1S
b. k = 3
Figure 3.11 The Type1 Intermediate Clusterings
of the Graph of Figure 3.9

S
I
5 _

 I
 I
,  ii~ / *5
5'
5
/
1. e
a. k= 2
/ 5.
II
'S 5.
= .5  
b. k = 3
Figure 3.12 The Type2 Intermediate Clusterings
of the Graph of Figure 3.9
 . ./
/ '
/
 
a. k =2
 s / %.
,% ,
b = 3
Figure .13 Th Type IntreiaeCseig
of teGapho igr.
Numbers
of
Clusters
14
12 10
8
6
2
2
Typel+
Type2
1*
Figure 3.14 Numbers of Clusters of the Graph of
Figure 3.9 Versus the Parameter k
k = 1 and k = 2 ; similarly, the number of type2 clusters has a large gap between k = m 1 and k = m .
The superiority of the type3 clusterings over those of type1 and type2 is due principally to the fact that the similarities of the pairs of cliques determine adjacency in the type3 generalized clique graph, whereas the other types utilize less information for that purpose. Indeed, the definition of type3 kclusterings provides the key to an
even more specificand finaldefinition of clusterings of a graph.
Consider a pair M, and M2 of cliques of a graph G and suppose that MI and M2 are adjacent in the type3 kclique graph of G Then k [(m + 1) Sl( Ml, M2)] where m is the number of points of a largest clique of G Thus, k 1 < (m + 1) Si(MI, M2) or (k 1) / (m + 1) <.S1(M1, M2) For k = 1, 2, ..., m define tk = (k 1) / (m + 1) ; then
0 = tI < t2 < ... < tm = (m 1) / (m + 1) the greatest similarity possible between a pair of sets, each having no more than m points. Consequently, the type3 kclique graph may be defined alternatively as follows. The clique tkgraph Htk(G) of a graph G has the clique set Q(G) as its point set, with a pair of distinct cliques of G adjacent in Htk(G) in case the similarity of the pair exceeds tk *
This generalizes immediately by relaxing the range of choice for the threshold to the clique tgraph Ht(G) where t is a number from the unit interval. There are, however, only finitely many distinct pairs of cliques of G ; therefore, the set T(G) consisting of zero and S1(M, M') for each pair of distinct cliques M and M' of G is a nonempty finite set. Let the members of T(G) be ordered according to size: T(G) = {sO, s, s2, ... sr With si_l < si for each i = 1, 2, ..., r Clearly,
for any s sr Hs(G) is totally disconnected. More generally, if si < s < si+1 then Hs(G) = Hsi(G) Therefore,
H(G) = {Hs(G): 0 s 1} = {Hs(G): s = SO, sl, ..., sr} .
For a given threshold s E I = {x: 0 x 5 1} an
ssimilarity cluster of G is the union of the cliques of G which constitute a component of the clique sgraph Hs(G) of G The ssimilarity clustering of G consists of the maximal ssimilarity clusters of G This last stipulation assures, as usual, that inefficient covers of V are excluded.
The collection Q(G) of all ssimilarity clusterings
of G is a finite collection, since A e (G) implies that A is the ssimilarity clustering of G for some s = si E T(G) a finite set. If si, sj E T(G) and i j then Hsj(G) is a subgraph of Hsi(G) ; consequently, the sjsimilarity clustering refines the sisimilarity clustering. Thus, each pair of members of Q(G) is related under the refinement relation, i.e., the refinement relation linearly orders the collection Q(G) In particular, the 0similarity clustering is the type3 1clustering, and thus is the component clustering. The srsimilarity clustering, where sr is the largest member of T(G) is the clique clustering, since Hsr(G) is totally disconnected. Consequently, the component and clique clusterings belong to Q(G) Moreover, if 0 t s sr Hsr (G) is a subgraph of Ht(G) which is a subgraph of H0(G) ; therefore, every
87
member of Q(G) refines the component clustering of G and is refined by the clique clustering of G .
Thus, if A E A E Q(G) then A is an efficient cover; A is a union of cliques; and Q(G) < A < K(G) Therefore, (Q(G), <) is a chain in the lattice (P(G), <) joining the unit and the zero of the lattice, i.e., a chain of clusterings from the component clustering to the clique clustering.
It will be recalled that some members of F(G) are plainly unsatisfactory, as for example, the cover of Figure 3.15.
I /
i i I
I I II
/ \ I
Figure 3.15 An Unsatisfactory Cover Belonging to F(G)
The development of Q(G) to which the cover of the figure does not belongenables the precise identification of the reasons behind the defect: a pair of cliques of similarity 1/5 are grouped together, while a pair of cliques of greater similarity, 1/2 are not merged into one cluster. Such a grouping is clearly arbitrary and unjustified, in that it requires that certain available informationthe cliqueclique similaritiesbe ignored.
Referring back to Table 3.1 and Figure 3.10, one sees that the typei and type2 kclusterings both have this flaw. Cliques M10 and M11 are adjacent in the typei 2clique graph, but Ml and M2 are not, although S(M10, M11) = 1/3 = S(M1, M2) Similarly, My and M8 are adjacent in the type2 3clique graph, but M10 and M11 are not, although S(MI0, M11) = 1/3 > 1/5 = S(M7, M8).
Thus, just as the restriction from H(G) to F(G) was based on the elimination of arbitrariness relative to points of G so the restriction from F(G) to Q(G) is based on the elimination of arbitrariness relative to the cliques of G.
The foregoing considerations justify a more specific
definition of clusterings of a graph G viz., ssimilarity clusterings for thresholds s E I That is, the set of all clusterings of G is precisely Q(G) As illustrated above, a typei or type2 kclustering may not qualify as a clustering under this more specific definition.
It is contended that this definition is valid, in the sense that it takes into account all the information available within the context of the problem, and it imposes no arbitrary conditions. The context of the problem is that a graph G = (V, E) is given, from which it is required to generate efficient covers of V As previously discussed, a cover A of V which does not refine K(G) or which is not refined by Q(G) is not generated from the given information.
The necessity to generate covers only from the given information also requires the exclusion of nonmembers of F(G) from consideration. And, finally, it is that same requirement that necessitates the deletion from F(G) of nonmembers of 2(G) Therefore, the degree of specificity of the definition is fully justified, i.e., each condition of the definition is logically required by the nature of the problem.
On the other hand, a more specific definition would*require an additional condition. Such a condition would eliminate members of Q(G) which is a chain or tower of covers corresponding to different clique similarity thresholds. Evidently, such a condition is not to be inferred within the context of the problem. Rather, it must be derived from considerations external to the information implicit in G = (V, E) ; that is, any preference criterion must be determined according to the particular application of the cluster analysis of the graph.
The type3 kclusterings, it will be recalled, are precisely the tksimilarity clusterings, with tk = (k 1) / (m + 1) Thus, each type3 kclustering qualifies as a clustering, i.e., belongs to Q(G) Let A(G) denote the collection of all the type3 kclusterings of G Since the type3 1clustering and mclustering are
the component and clique clusterings, (A(G), <) is a subchain of (Q(G), <) including the greatest and least members of Q(G) The sequence of thresholds defining the

Full Text 
xml version 1.0 encoding UTF8
REPORT xmlns http:www.fcla.edudlsmddaitss xmlns:xsi http:www.w3.org2001XMLSchemainstance xsi:schemaLocation http:www.fcla.edudlsmddaitssdaitssReport.xsd
INGEST IEID EX6J9SUNN_TNDT14 INGEST_TIME 20170711T21:49:22Z PACKAGE AA00058575_00001
AGREEMENT_INFO ACCOUNT UF PROJECT UFDC
FILES
PAGE 1
MULTILEVEL AUTOMATIC CLASSIFICATION FOR SEQUENTIAL REFERENCE RETRIEVAL By ROBERT ERNEST OSTEEN A DISSERTATION PRESENTED TO THE GRADUATE COUNCIL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY U N IVE RSITY OF FLORIDA 1972
PAGE 2
To Susan Spencer
PAGE 3
ACKNOWLEDGMENTS This work was supported by the Army Researc h Office Durham under Grant Number DAAROD3112470G92 and by the National Science Foundation under Grant Number GK 2786 The author wishes to express his thanks to the Center for Informatics Research of the University of Florida for providing financial assistance, as well as the necessary research facilities. I am indebted to the members of my Supervisory Committee for their guidance. I am particularly grateful to Dr. J. T. Tou, Director of the Center for Informatics Research, and to Dr. A. R. Bednarek, Chairman of the Department of Mathematics, for their counsel and assistance Mrs. Betty Taylor, the Director of the University of Florida Law Library, made available to mein punch cardsthe bibliographic data and the subject inde xes of papers published in legal journals. I am thankful for this kindness, which facilitated the experimental phase of this work I am indebted to my good friend and colleague, James Hollan, for his critical reading of the first draft of this work. Finally, I am part icularl y gratefu l to my wife, Darcy Meeker, for her excellent job of typing this d i ssertation iii
PAGE 4
TABLE OF CONTENTS ACKNOWLEDGMEN T S LI S T O F TAB LE S L I ST OF FI GUR E S ABSTRAC T CHAP T ER 1. I NTRODUCTION ..... ..... 1 .1. S u mmary of the Remaining Chapters 2. CLASSIFICAT I ON AND REFERENCE RETRIEVAL. 2.1. Introduction ........... 2 2. Methods of Automatic Classification 2 2 1 Bayesian Classification .. 2 2 2 Factor Analysis .. .. 2 2 3 C 1 umps . . . 2 3. FERRET A Feedback Reference Retrieval System 2 4 Conc l uding Remarks 3 GRAPH THEORETICAL COVER GE NERATION 3 1. Introduction . 3 2. Basic Definit i ons 3 3. Clusters in Graphs 3 4. Conc l uding Remarks 4. CLIQUE DETECTION ALGORITHMS 4 .1. 4. 2. 4 2 1. 4 2 2 4 2 3. 4 3 4 3 .1. 4 3 2 4 3. 3 Introduction. . Review of Selected Algorithms Point Removal Definitions Point Remov~ l Theorems .. A Point Removal Algorithm. The Neighborhood Approach to Clique Detect ion Special Definitions . . Neighborhood Qlique Detection Theorems The Ne i ghborhood Clique Dete ction Algorithm. iv Page i i i vi vii ix 1 6 9 9 20 21 23 29 34 43 4 Ii 44 46 50 92 9 ii 94 95 98 99 102 106 106 107 111
PAGE 5
Page 4.4 The Line Removal Approac h to Clique Detection 119 4.4.1. Line Removal Def initions .... 4.4.2. Line Remova l Theorems ....... 4.4.3. The Line Removal Clique Detection Algorithm 4.5. Algorithm Timin g E xpe riments .. 4.6. Conclusions .......... 120 122 125 128 1 35 5. AUTOMAT I C CLASSIFICATION DERI V A TION 137 5,1. Introduction. . . 137 5.2. Cover Evaluation by Typic a lity. 138 5.2.1. A Met ric for the Class of Collections of Nonempty Subsets of a Finite Set 145 5.3. Eva luati6n b y Cluster Homogeneity and Cost Considerations 149 5,3.1. Cluster Homogeneity . 150 5.3.2. An Idealized Cost Function. . 151 5,3.3. An Evaluation F unction. . 155 5.4. The Classification Derivation Algorithm 160 6. THE SEQUENTIAL SEARCH T R EE. . 169 6 .1. Introduction. . . 169 6.2. Class Representat i on Transformation 170 6. 3. Updat in g the Sea rch Tree. . 17 3 6.4. Search and Retrieval. . 175 7. 7 .1. 7. 2 7. 3 7 4 7 5 EXPERIMENTAL INQUIRY AN D CONCLUSIONS. Introduction. . . The Experimental Document Set The C l assificat i on Der iv a tion Basic Searc h es Conclusions APPEND ICES A. PROOFS OF T HE POIN T REMOVAL CLIQUE 184 184 187 188 195 202 DETECTION THEOREMS 208 B. PROOFS OF THE NEIGHBORHOOD CLIQUE DETECTION THEOREMS 213 C. PROOFS OF THE LINE REMOVAL CLIQUE DETECTION T HEOREMS 219 D. A METR IC ON THE CLASS OF COLLEC TI ONS OF NONEMPTY SUBSETS OF A F I NITE SET 226 E. AN IDEALIZED CLASSIFICAT I ON TREE 239 LIST OF RE FE RENCES BIOGRAPH ICAL SKE TCH V 242 245
PAGE 6
LIST OF TABLES Table Page 3.1 Adjacencies in the Generalized Clique Graphs. 78 4.1 Algorithm 4.1 Applied to the Graph of Figure 4.2 .. 105 4.2 Algorithm 4.3 Applied to the Graph of Figure 4.2 129 4.3 Timing Comparison of Algorithms Based on Theorems 4.4 and 4.5 132 4.4 Timing Comparison of Algorithms 4.1, 4.2, and 4.3 133 4.5 Execution Times of A1 ~ 6rithms 4.2 and 4.3 on the Graphs of Figure 4.5 135 5.1 Metrics D and D 1 Applied to the Covers of Fi g ure 5.3 147 5.2 Illustration of the Cost Function 156 6.1 A DocumentTerm Matrix for Figure 6.1 172 7.1 Serial and Basic Search Respo nses 197 7.2 Performance Figures for the Sample Query. 201 7.3 Serial and Basic Search Time Averages 201 7.4 RecallPrecision Summary. . 202 vi
PAGE 7
LIST OF FIGURES Figure 2.1 FERRET Initialization .... Page 36 2.2 FERRET Updating and Retrieval 39 3.1 The Clique and Component Clusterings of a Graph 52 3.2 The Lattice of Clusterings of a Graph 57 3.3 The Lattice of Nonarbitrary Clusterings of a Graph 59 3.4 The kPartitions of Jardine and Sibson. 64 3.5 Type1 kClique Graphs of a Graph 66 3.6 A Graph Whose Type1 2Clusterin g Differs from the JardineSibson 2Partition 67 3.7 A Graph in Which One Type1 2Cluster Contains Another 69 3.8 Type1 and Type2 kClusterin g s of a Graph. 71 3.9 A Graph for the Illustration of the Differences amon g the Type1, Type2, and Type3 Clusterin g s 77 3.10 The Three Types of Intermedi a te Generalized Clique Graphs of the Graph of Fi g ure 3.9 80 3.11 The Type1 Intermediate Clusterin g s of the Graph of Fi g ure 3.9 81 3.12 The Type2 Intermediate Clust e rin g s of the Gra p h of F i g ure 3.9 82 3.13 The Type3 Inter me diate Clu s terin gs of the Gr a ph o f Fi g ure 3.9 83 3.14 Numbers of Clusters of the Graph of Fi g ure 3.9 versus the Parameter k 84 vii
PAGE 8
Figure Page 3.15 An Unsatisfactory Cover Belonging to f(G). 87 4.1 Counterexamples to L =Bu Cu D .. 101 4.2 A Graph for Clique Detection Algorithm Illustration 104 4. 3 4.4 4. 5 A Counterexample to Counterexamples to Two Graphs, G = (V, G' = L = L* = E) (V, N u g_ 1o u 11 u and EI) with 12 u E C ~3 II E' 110 125 134 5.1 Two Efficient Covers Which Induce the Sarne Graph 141 5.2 A Clustering Which Is Not a BClassification .. 144 5.3 Four Covers of a Set of Eight Po ints. 146 5.4 An Interval Weighting Function. 159 6.1 A Small Classification Tree and the Correspondin g Search Tree 171 7.1 Summary of the (C=0.4)Classification 192 7.2 Completion of the Summary of the (C=0.2)Classification 193 viii
PAGE 9
Abstract of Dissertation Presented to the Graduate Counci l of the University of Florida in Part i al Fu l fillment of the Requirement$ for the Degree of Doctor of Phi lo sop hy MULTILEVEL AUTOMA TIC CLASSIFICATIO N FOR SEQUENT I AL REFERENCE RETRIEVA L By Robert Ernest Os teen August, 1972 Ch a irman: Dr. Julius T. Tou Ma jor Department: Electrical Engineering The primary concern of this work is the automatic derivation of a multilevel nonhierarchical classification of a set of documents, g iven the l ogica l or numerical subject indexes of the documents. The utilization of s uch a classification by mechanized search procedures is also treated. The classification i s derived from a quantitative mea sure of the documentdocument similarities based on in dexes of the pairs of documents. Application of a threshold to th e documentdocument simil arit ies transforms the document set into a graph The g raph is subjected to a c luster ana lysis wh ich typically provides several distinct clusteringsthat i s covers consisting of clusters of po int s of the g r aph A re a lv a lued evaluation function provides the mea n s to se l ect the best of the c lu sterings ( The evaluatlon function depends on the h omogene ities of the ix
PAGE 10
members of a clustering, and takes into account certain cost considerations of the implementation machinery available for a mechanized search system.) The clusters of the selected clustering constitute the immediate subclasses of the document set. This process is repeated on unsubclassified document subsets until no unsubclassified subset is large enough to warrant analysis into subclasses. A thorough analysis of the concept of clusters in graphs culminates in a specific definition of the clusterings of a graph having the following properties. A clustering of a graph is an efficient cover of the set of points of the graph (no member of a clustering is a subset of any other member). Each member of a clustering is a union of cliques (maximal complete subgraphs) of the g r ap h. Each clustering refines the collection of the connected components of the graph and is refined by the collection of the cliques of the graph, both of which qualify as clusterings. Under the refinement relation the clusterings form a chain or tower from the collection of cliques of the graph to the collection of components of the graph. It is contended that any other efficient cover contains an element of arbitrariness in its formulationthat is, requires a violation of one of the criteria underlying the definition of clusterj_ngs: a clustering i s defined in terms of the adjacencies of the points and the id ent ities of the cliques (implicit in the adjacencies)using only this information and evading none of it; and all members of a X
PAGE 11
particular clustering are formed by the application of exactly the same rule. Because the cliques of a graph are fundamental to the definition of the clusterings of the graph, the problem of the identification of cliques is treated in detail. Two new clique detection algorithms are presented. One of these is intended for use in special circumstances in which it has an efficiency advantage over other known clique detection algorithms: besides the set of lines of the graph, there is given the set of cliques of a specified subgraph on the same point set. The other new clique detection algorithm is applicable to the general clique detection problemit identifies the cliques of a graph, given only the points and lines of the graph. Timing experiments indicate that this algorithm is substantially faster than those previously available. The cover selection technique is combined with the graph theoretical cover ge neration scheme into an algorithm for multilevel classification derivation. The resulting classification is retained in the form of a sequential search tree. Search procedures, designed to utilize the sequential search tree for more efficient searching and for interactive searching, are presented. Finally, the results of a preliminary experimental investigation of the classification derivation and search utilization techniques are reported. xi
PAGE 12
CHAPTER I INTRODUCTION This study is concerned with the machine derivation and the search utilization of a multilevel classification of a document corpus by a mechanized document or reference retrieval system. A reference retrieval system [ 1, 2 J consists of a set of document references or names, document surrogates or representations for the members of the document set, and search procedures, i.e., mechanisms for the production of responses to queries. A response is, basically, a set of document references. A query is a request for a response expressed in a form which is consonant with that of the document representations and for which a search procedure exists. The document representations indic a tein some specific formthe subject matters of the r espect j_v e doc uments. A document retrieval system differs fro m a ref e~e nce retrieval system in that a re sponse is a set of documents rather than a set of document references ; consequently, there is the additional issue of the physical storage and retrieval of docu me nts. Viewing a document r e triev a l sys t em as an extension of a referenee retrieval system the document search and retrieval constitutes an a dditional step, 1
PAGE 13
2 following reference retr i eval. In a conventional library, for example, a searcher performs reference retrieval by means of the card catalog; a retrieved reference includes a decima l code by means of which the corresponding document can be physically located Although useful search strategies exist which are based upon author, publication date bibliographic citations etc., this work is concerned on l y with subject searching; that is in a ll that f ollows, quer i es document representations, and search strategies are concerned with the subject matters of documents In a document or reference retrieval system, a document representation is the result of a process of content analysis. Whether performed intellectually or mechanically, such a process necessarily treats the individual words of a document as the primary source of information concerning the subject matter of tha document. The content analysis may also make use of syntax, as exemplif i ed by the Syntol diagrammatic document surrogates [ 3 J The Syntol diagram for a document is a digraph (directed graph) with labeled directed lines The p o ints of the digraph are Syntol words, chosen to reflect the subject matter of the given document, each representing a state, an action, or an entity. The directed lines reflect relationships among the Syntol words, and the labels of the lines specify the type of relationship: coordinative consecutive, associative or predicative. A request is
PAGE 14
3 similarly analyzed to produce a query, which is a Syntol diagrammatic representation of the request. The search procedure then consists in matching the query aga inst the representations of the documents. The response consists of those documents which match the query. An example of a matching function is as follows: if there is a directed line labeled type X from Syntol word A to word B in the query, then there is a directed path in the document representation from word A to word B, each directed line of which is of relational type X. More commonl~ and somewhat less elaborately, content analysis procedures i gnore syntax, which is to say, produce a document representation based only on the contentbearing words occurring in the text of the document. Such a process, which is known as keyword in dexing is clearly amenable to mechanization assuming that the documents are available in a machinereadable medium. The simplest product of such a procedure is the representation of a document by the list of keywords occurring in the document, in which case the indexing operation is termed logical indexin g In numerical indexing, a numerical value is associated with each keyword of the document This numerical value might be the occurrence frequency of the word in the document, or some normalization thereof, as discussed by Lancaster [ 4 ]; in probabilistic indexing [ 5 J the value is an estimate of the probability that the document is relevant to the information needs of a user whose query consists of the single keyword.
PAGE 15
4 A descriptor is a natural language expression denoting a subject area, e.g., "Life insuran ce 11 and riData processing systems. 11 Unlike keywords, descriptors are not restricted to single words. Moreover, the application of a descrip tor to a document might not require the occurrence of the des criptor in the document. Consequently, descriptor indexing is generally performed by intellectu a l content ana lysis. Just as with keywords, however, the product of the content analysis may take the form of lo g ic al indexing or numerical indexing. Consequently, whenever the details of the content analysis and the indexin g lan guage are not of primary concern, the phrase index termsor more briefly termsis used to refer to keywords, descriptors, and elements of any similar indexing vocabulary. The type of data base which this work assumes to be g iven consists of a set of terms and a set of documents, lo g ically or numerically indexed with respect to the term set. The main result of this study is the development of a technique for the automatic organization o f such a document set in the form of a multilevel classification To supplement the presentation of the classification derivation method a description is given of a mechanized reference retrieval system which utilizes such a classification for the purposes of efficient searching an d user system interactive searching.
PAGE 16
5 The multilevel classification is derived from a quantitative measure of the documentdocument similarities, which measure is a suitable function of the indexes of the pairs of documents. A set of documents on which a similarity is defined is transformed into a graph by the application of a threshold to the documentdocument similarities. A new graph theoretical cluster ana lysis technique is app lied to the resulting graph to identify the subclasses of the given set of documents. This process is repeated on the resulting subclasses until no unsubclassified set is large enough to warrant further classification. The mechanized reference retrieval system with user system interactive search procedures is termed FERRETa Feedback Reference Retrieval system. (Thi s system is a specific instance of a class of Sequential Feedback Information Retrieval Systems under developmentthe SEFIRE systems.) FERRET retains the multilevel document classi fication in the form of a sequential search tree which is used by its search procedures for the identification of the subset of documents to be retrieved in response to a given query. The fol lowin g section provides further discussion of the main result, along with the organization of this presentation.
PAGE 17
1.1. Summary of the Remaining Chapte rs 1.1.1. Chapter 2 The uses and roles of classification in r efere nc e retrieval a r e d i scussed Several specific met h ods of automatic classification are reviewed. An overview of FERRET a Feedback Reference Retr i eva l systemis presented. 1.1.2. Chapter 3 A definition of clusters of points of a g r ap h is developed a nd validatedi ncluding part i cular l y those which are intermedi a te to the components (m ax im al connected subgraphs ) and the cliques (m ax i ma l complete subg r aphs ) of the graph T his graph th eoret ic a l c lu ster analysis method is r e l ated t o similar efforts by other workers 1. 1. 3. Chapter 4 6 Chapter 4 is devoted to the problem of the identification of th e cliques of a graph Available cl i que detection a l gorithms are re viewed Two new algorithms are presented, a l o n g wit h their val id at in g theorems and timing experiments illustrating their respective vi rtu es 1 .1.4. Chapte r 5 Chapter 3 (supported by Chapter 4) provides a method for the generat i on of covers of a set of documents
PAGE 18
consisting of graphtheoretically defined clusters. The graph in question has the document set as point set, with lines joining pairs of documents whose similarities exceed 7 a certain threshold. This cover generation method generally produces several distinct covers of a g iven document set. Chapter 5 is concerned with the selection of one such cover. Two quite different methods of selection are exp l ored. One selection criterion is the extent to which a cover is typical of the collection of all generated covers. The discussion of the ritypicality 11 criterion includes the presentation of two possible metrics for its realization. One of these is taken from the literature; the other is a new metric devised by the author for this application. The second cover selection criterion explored in Chapter 5 is an evaluation function depending on the homogeneities of the clusters of a cover, and taking into account certain cost considerations of the machinery available for an implementation of FERRET. Chapter 5 concludes with an algorithm based on this latter method of cover selection and the cover generatio n ~ethod given in Chapters 3 and 4. This algorithm produces the complete multilevel classification of a given document set from a quantitative measure of document document similarities.
PAGE 19
1.1.5. Chapter 6 This chapter presents the form in which the classif i cation is retained in FERRET for search and retrieval. Also given are the FERRET searc h procedures, which include provision for system us er int eract ion. Chapter 6 also includes a brief treatment of the problem of updat i ng the classification as documents are acquired subsequent to the i n i tia l classification derivation. 1 1. 6. Chapter 7 8 Chapter 7 reports a pre l iminary experimental inquiry into the presented classification technique, illu strat ing both the c l assification derivation and the FERRET sea rch procedures. The chapter concludes with ~n eva luative discussion of the classification derivation and utiliz a tion.
PAGE 20
CHAPTER 2 CLASSIFICATION AND REFERENCE RETRIE VAL 2.1. Introduction As stated in Chapter 1, the reference retrieval syste ms under consideration are those in which the documents are lo g ically o r numerically ind exed Document representations w ithin such systems may be formally viewed as a document term matrix : each row corresponds to a document ; each column co rres ponds to a term; the ij th element of the matrix indicateslogically o r numericallythe appl icabilit y of the jth term to the i th document The in formation of a row of the document term matrix constitutes the sub je ct index of the corresponding document, i.e. the representation of the subject matter of the document. Whether or no t the document is retrieved by a search procedure i n response to a query depends chiefly upon the document i ndex and the query itself. The most stra i g ht forward search procedure is the serial search : each row of the documentterm matrix is matched against the query; whether or not the corresponding document i s included in the response to the query is decided accord in g to the result or value of the match. 9
PAGE 21
10 Cons id er a Boo le an docum ent t e rm matrix representing a lo g ically in dexed d oc u men t set. A query is any Boo l ean expression of terms, and the matching process i s a l ogica l matching. The query is eva lu ated according to the va lu es o f the terms of a row of the document term m a trix; the corresponding document is retrieved in case that query evaluation is 1 (true), in wh i ch case the document representation is said to satisfy the lo gic of the query. In case the documentterm matrix i s num e rical, a query is a n assignment of w e i ghts (num e ric a l valu es ) to terms Such a query has the f orm of a document in dex or row of the documentterm matrix, and may be vieived as a description of the subject matter of a hypothetical document sou g ht by the source of the query, i.e., the u ser. The matchin g process in this case is the num e ric a l evaluation of a measure of relevance or similarity, e. g ., the cos i ne corre l ation between a r ow o f the documentterm matrix and the query, re ga r ded as an additional row. The document 1 1 score resul t in g from this type of evaluation permits the ordering of the mem bers of the response according to the degree of relevance to the query. An obv ious disadvantage to such serial searching is that, a lthou g h the response to a g iven query normally consists of on l y a small fraction of the document set a relevance computation is requ i red for each document howeve r irr elevant to the query ; the same effort i s required to merely negate the membersh ip of a document in a response as
PAGE 22
is required to identify a document as a member of the response. 11 One way to eliminate much of the unproductive search effort of the serial search i s to use an inverted fi l e for the organization of the document representations rather than a direct file. An entry in a direct f i le corresponds to a document; its value is the index of the document, e.g., a list of applicable terms. An ent r y in an in verted file corresponds to a term; its value is the set or li s t of a ll those documents to wh ich the ter m applies An inverted file entry therefore corresponds to a co lu mn of the documentterm matr i x whereas a direct file entry corresponds to a row. Now s upp ose that the documents of a document set are lo g ically indexed and consider a query consisti n g of the conjunction of two terms. Such a query is a request for just those documents inde xed by both of the specified term s The re sponse i s simply the intersection of the classes or sets of documents which are the values of the entries of the terms. Unl i ke ser i al search of the direct file this search expends no effort on documents indexed by neither term. The in verted file organization of the document represent a tions constitutes a classification of the document set To each term there corresponds a pair of elementary classes of documents : those to which the term applies and those to which it does not. The f i le entry associated with the term represents the former explicitly and the l atter
PAGE 23
12 implicitly. A query is a Boolean express ion of terms. The response is the class of documents specified by the set theoretic expression obtained from the query by interpreting each term as the class of documents indexed by that term, i.e., the inverted file entry of t0at term, and by replacing conjunction, disjunction, and negation by intersection, union, and complementation with respect to the document set, respectively. Thus, the search procedure produces the response corresponding to a query formulated in terms of index terms by performing operations on the elementary classes in accordance with the logic of the query. The inverted file system illu strates the primary utility of classification in reference retrieval systems: to provide for more efficient search procedures. It a lso illustrates the price which one must pay for more efficient searching: the more costly file creation and maintenance i.e., the initial derivation and the updating of the classification. The inform at ion provided by the indexing operation on a document is precisely the value of an entry of the direct file. Updating the direct file in the event of the acquisition of a new document requires only the creation of the new entry; in particular, no existing entry is af fected. With the inverted file, however, the acquisition of a document requires the modification of existing file entries: the entry of each term applied to the document is modified by the addition of the identifier of the acquired document.
PAGE 24
13 Although seria l searching is practicable in a mechan iz ed reference retrieval system e g ., MEDLARS [ 6 ], the medical literature ana l ys i s and retrieva l system of the Na tional Library of Medic i ne such is not the case in convent i onal reference retrieval systems i e ., systems in which search procedures are executed by humans with little or no help from machines. Consequently conventional libraries have long been concerned with the problem of c lassification. The traditional method of classification begins with an a priori hierarchy of subject a reas or cate go ries by means of which a hierarchical classification of the document set is derived. The category hierarchy h as the form of a rooted tree, the root representing (implicitly) the totality of subject areas of the document set Those nodes adjace nt to the root correspond to the major division of su bject matters into broad categories. The major categories are divided into subcategor i es, and so forth, down to the most specific categor i es which correspond to endpoints of the tree. Each cate go ry is label ed with a code, e g ., a string of decimal digits, the length of which depends on the degree of spec ificity of the category Consider the fo llo wing examp le from the Dewey Decimal Classification [7]: 398 398 2 398.21 398 22 398.23 Folklore Ta l es and Legends Fairy Tales Tales and Legends Tales and Legends of Heroes of Places
PAGE 25
398.24 398.3 398.4 14 Tales and Le ge nds of Animals and Plants The Rea l The Unreal This decimal code reflects the spec ificity re l atio n: T a les and Legends is spec ific to F olk l ore and th e code of the latter (398) i s a truncation of the former (398.2); indeed, Folklore i s ge neric to any category whose decimal code be g ins with "39 8 ." In this case, Tales a nd Legends (398.2) has been further a n alyzed into f ou r subcategories each of wh ich i s an endpoint. The document classification correspo ndin g to suc h a subject category hierarchy h as a cl ass of documents for each category. A document i s int el l ectually analyzed to determine the most spec i f i c subject category which app lies to the document, and t h e doc u ment is l a beled with the dec i ma l code for that category. The document belongs to the class assoc i a t ed with the cate gory and to every class associated w ith c atego ri es generic to the category A document labeled 398.2 for example belongs to the classes assoc i ated with 398.2 and 398 but not to the classes associated with 398 21 398 22 398 3 or 398 .4 Thus, any pair of c l asses are disjoint or related by the inclu s i on rel ation That i s if a pair of classes have one document in common, then one class contains all the docume nts
PAGE 26
belonging to the other. Another aspect of this type of classification is th a t it is suitable for the physical organization and storage of documents: the documents are stored according to the decimal codes with which they are labeled. 15 The above hierarchical classification i s an example of a multilevel classification, in contradistinction to a simple classification such as the inverted file. A simple classification of a finite set consists of an efficient cover of the set, i.e., a collection of classessubsets of the setwhose union coincides with the set, and such that no subset is properly contained in any other. A multilevel classification extends a simple classification: some of the classes of the simple classification of the finite set are themselves endowed with a simple classification, some of whose members may be further classified, and so forth. The basis of a (subject) classification of documents is similarity of subject matter. A pair of documents in an entry of an inverted file, for example, are similar in that they have at least one term in con~on. A pair of documents assigned to the same class of a traditional hierarchical classification are similar in that they are both jud g ed to be concerned with the subject area corresponding to tl1e class. It is evident that any automatic classification s cheme requires a quantitative measure of similarity on the set of objects to be classified. The attributes of the objects to
PAGE 27
16 be classified may be formally represented in a logical or numerical objectattribute matrix ; a row of this matrix is referred to as the attribute vector of the corresponding object. The quantitative similarity of a p a ir of objects i s def ined in terms of the attribute vectors of the objects Suppose the attributes are binary val u ed that is that each attribute app lies, or does not appl~ to each object The basic data in t erms of which a simil a rity funct ion may be defined are as follows for a g iv en pair of objects : the number of mismatches, i. e ., a ttri butes applicable to ju st one of the pair; the number of positive matches, i. e ., attributes applicable to both; and the number of negative matches, i.e., at trib utes applicab l e to neither Sokal and Sneath [ 8 J discuss various spec i f ic possibilities for a similarity measure defined in t erms of those quantities; these vary chiefly with respect to the i ss ue of equal or une qua l we i ghtings of matches and m i smatches and the issue of whether or not negative matc h es a re taken into account In case the objects are documents and the a ttributes are terms, n egative matches must be i gnored : one may not reasonably construe the in applicability of a particular term to eit h e r of a pair of documents as evidence of their similarity. The Tanimoto similarity measure [ 9 ] is particularly suitable for this appl ic ation ; it is defined to be the ratio of the nu mber of attributes (t erms ) possessed by both objects (docum ents ) to the number of attributes possessed b y either. This similarity function has values
PAGE 28
17 b etween zero and unity; a similarity of zero indicates that the pair of objects have no a ttribute in common while a similarity of unity indi ca te s th a t the ob j ects have identical a ttributes, i.e., an attribute appl i es to one object if and only if it appl i es to the other. The function va lu e may be regarded [ 9 J as the probability that an attribute app lies to both objects, g i ven that it appl ies to at lea st one of the pair. Numerous similarity functions ex ist for application to attribute vectors h av ing nonn egative numerical va lu es Stat istic a l correlation, th at is, the Pearson correlation coefficient [ 10 ], is the measure of similarity u sed in the factor a n a l ytic approac h to automatic classification which is described in the following section. The attribute vector is re garded as th e explicit spec i fication of a discrete random variable; the stat istic al correlation of a pair of objects is then the covariance of the corresp onding pair of random var i ables standardized with respect to zero mean and unit variance. The fo ll ow in g similarity function attributed by Salto n [ 3 J to Tan im oto is an extension of the Tan i moto simi l arity for binary attribute vectors to numerical attr ibute vecto rs: I vi + I Wi I ViWi i i i (2.1)
PAGE 29
18 The range of this function, like that of the function which it extends, is the unit intervalprovided, that i s that v and w are nonzero vectors over the unit interval. The cosine correlation similarity measure [ 3] is the normalized inner product of a given pair of at tribute vectors: I viwi S ( V, ~) = i 2 CI V. 2 I w.2)1/2 i l i l ( 2. 2) The cosine correlation ranges from 1 to +l and requires only that v and w be nonzero vectors over the real numbers. In particular, the components of the att ribute vectors are not necessarily nonnegative. Consequently, a user may assign ne ga tive weightsrather than merely zero weightto selected terms, which is analogous to negation rather than omission in a lo gica l system. Consider now a g iven documentterm matrix. There are two general approa ches to the problem of automatic document classification. The dir~ct approach is to regard th~ rows of the documentterm matrix as attribute vectors of objects to define a measure of documentdocument similarity, and to classify the documents by reference to their similarities. The indirect approach is to regard the columns of the documentterm matrix as attribute vectors of objects, to define a measure of ter~term similarity to classify the
PAGE 30
terms by reference to their similarities, and to classify the documents by reference to the term classes and the document indexes. 19 One may distinguish, moreover, three major constituents of the document classification problem: the c l assi fication derivation problem, the class characterization problem, and the document assignment problem. The effort demanded by each of these component problems depends upon the purpose and nature of the classification scheme. For example, the traditional hierarchical classification, whose purpose is to aid the human searcher, requires a great intellectual effort for the derivation of the hierarchy of subject categories; the assignment of a document to a class requires a small intellectual effort; and the characterization of a document class requires no effort, since each document class is characterized by the associated subject category, e.g., Folklore. The inverted file system, whose purpose is to provide for more efficient mechanized searching, obviously requires only minimal effort for each of the three aspects of classification: an elementary cl ass of documents is assigned to the elementary classes corresponding to the terms with which the document is indexed. Additional classification effort is required, however, during query processing: the particular document class constituting the response must be comput~d from the application of the logic of the query to the elementary classes.
PAGE 31
20 The direct approach to automatic document classification from the documentterm matrix requires no effort for document ass i g nment s since the classes to which a document belongs are determined by th e classification der ivation; the characterization o f the classes in terms of index terms, however, does require add ition al computation The indirect approach, on the other hand, first produces classes of terms, which essentially constitute the corresponding document class characteriz a tions; in this case the additional computation i s the formati o n of the document classes from the term classes by a proce ss of document assignment based on the class char ac teriz atio ns and the document indexes. The next section provides descriptions of some spec ific methods which have been applied to the probl ems of automatic document classification. 2.2. Methods of Automatic Classification Several approaches to automatic classifi8ation for reference retrieval purposes a r e described b elow : Bayesia n classification, factor analytically derived categories and 11 clumps. 11 Specifically graph theoretical techniques are not included because those are discussed in some detail in Chapter 3.
PAGE 32
21 2.2.1. Bayesian Classificat i o n Maron [ 11 J app li es probability theory to the prob l em of ass i gn in g documents to s u bject categories. The initial document set or training set is i nte ll ectually ana lyzed to determine subject categories, the documents of the training set a re ass i gned by the analysts to the corresponding classes, and the most promi s in g clue words (index terms ) for category prediction a re manually selected. Th e document term incidence matrix for the trainin g set is then formed: the ijth element indicates the occ urr ence or nonoccurrence of the jth term in the ith docume nt. The classes are character i zed by probability estimates derived from the documentterm matrix. The a priori probability P(Ci) of eac h category Cj i s estimated by the ratio of the nu mber of training documents manually assigned to C J to t h e total numb er of train i n g documents The conditional probability P ( Wi l cj) that term Wi appl i es to a document of c a te gory Cj i s es tim ated by the rati o of the nu m ber of occurrences of W l in training se t documents man u ally ass i g ned to Cj to the nu mbe r of occ urren ces of all t e rms in these documents The automat ic assignment of a document to a category i s viewed as a probability prediction based upon evidence and hypotheses. The hypotheses are the above prob a bility estimates relating the ca t egories and the terms The evidence i s the index of the document to be classified, that i s the sp ec i fication of those terms or clue words
PAGE 33
22 which apply to the document. The prediction method is based upon Bayes' rule [ 10 ] simplified by means of an independence assumption. The assumpt ion is that given any category Cj any pair of terms WP and Wm are indepen dent with respect to that category: Let WP, Wm, ... Wr be the terms applicable to a given document. The Bayesian prediction for category "attribute number" for Cj is C. or J where the scaling factor k for the particular set of applicable terms is determined by equat in g unity with the sum over all the categories of the attribute numbers. The document is then assigned to the category having the greatest attribute number w ith respect to the set of terms applicable to the document. The documents used for the experiments by Maron [ 11 J are abstracts of computer literature published in the March June, and September, 1959 issues of the IRE Transactions on Electronic Computers. The training set consists of 260 of these, while the remaining 145 abstracts constitute the test set. The intellectual analysis of the training set yielded 32 categories and the selection of 90 ind ex terms. (Documents from the test set, of course, were not included in this analysis, nor in the formati on of the hypotheses, i.e., the estimation of P(Cj) for each category and
PAGE 34
23 P(Wi I Cj) for each category and index term. ) Co n sidering only documents h av in g at le ast two i ndex terms and assig n ed by human classifiers to just one category the agreeme nt between the automatic ass i gnments and the human assignments of documents to subject categories was 91 % for the training set and 50 % for the test set Taking into account the number of c ategor i es ( 32) this latter fi g ure is by no me a ns as poor as it mi ght seem at first g lance. Indeed, one may reason a bly conclude th at this st udy demonstrate s the possibility, if not the practicality, of automatic assignment of documents to classes of a manually defined classification of subject categories 2.2.2. F acto r Analysis Factor ana l ys is [ 12, 13, 14 J is appl i ed by Borko [ 15 J and Borko and Bernick [ 16, 17 J to the prob l em of a utomatic doc ument classification, includin g the classifi cation derivation, the class characterizations and the assignment of documents to classes. The genera l approach of this method is to determ i ne from the words or in dex terms occurring in the documents of a document set the subject categories in terms of the in dex terms and to assign documents to categories by reference to the terms occurring in th e documents and the characterizations of the cate g or i es The ma in tools of the method are statistics and matrix theory.
PAGE 35
24 The method begins with the documentterm frequency matrix of a document set in wh i ch the ij th element is the number of occurrences of t h e jth term in the i th document The doc u ment term matrix of Borko [ 15 J gives the occurrence frequency of the si gn ific a nt words of the documents which are abstracts appearing in Psychological Abstracts Each term i s const r ued as a discrete random var i a bl e d ef i ned b y the assoc i ated co l um n of t h e doc u me n t term matr i x The mea n s and va ri ances of the terms and the covariances of the pairs of terms are computed ; from these quant i t i es the Pearson cor r e l at i on matrix is formed The e i genvalues and eigenvectors of the correlation matrix are th e n fo und. The e i genvectors represent 11 factors 11 underlying the correlations ; the eigenva lu es represent the portion of the correlation attributable to the corresponding factors The largest eigenvector i e ., the eigenvector correspondin g to the largest eigenvalue, g i ves the direction in term space along which correlation is maximum. rrhe cate gory space is the subspace of the term space generated by the several lar ge st eigenvectors of the correlation matrix The number of eigenvectors is determined on the basis of the percentage of the total correlation accounted for by the correspondin g factor space For example if 75 % of the total ccrrelation is accounted for then the k eigenvectors corresponding to the k l argest eigenvalues are selected where the sum of the k lar ges t eigenvalues
PAGE 36
is at l east 75 % of the total correlation and the sum of the k 1 largest eigenvalues is less than 75 % of the total corre l ation. The fact or loadings of a term are the r espec tive normalized inner products of the term vector with the eigenvectors, which is to sa~ the particular term components of the respective normalized eigenvectors. 25 The document term matrix may be regarded as representing the documents in Euclidean t space, t bein g the number of terms From the factor l oadings of the terms and the document term matrix the documents may be repr ese nted in the category spacea kdimensional subspace of the term space k being the required number of eigenvectors These document vectors in the category subspace may be re garded as vecto r s in the term space approximating respectively the original document vectors in the term space, i e., the rows of the document term matrix. As such, the se approximat ions are optima l in that the mean square error is minimized, subject to the constraint that the approx im atio ns lie in a k dimensional subspace of the term space. In a study of feature extraction in pattern rec o gnition Tou and Heydorn [ 18 J prov i de a proof that the estimation error is minimized by the choice of the k largest eigenvectors (as the basis for the kdimensional subspace) On the other h and two o ther optimality criter i a [ 18 J dictate the choice of the k smallest eigenvectors, viz., the minimum mean square distance criterion, and the m inimum
PAGE 37
~6 entropy criteri on However, these two cr it eria are devised expressly to extract the common features of all the pattern vectors of a single class, i.e., the intr aset features of a particular patt ern class; in consPque nc e these criteria unlike the minimum est i ma ti o n error cr it erion are not appropriate to the purpose under discussion. While the subspace dimensionality required to represent a given percenta ge of the tot a l correl at i on is uni quely determined and the opt i m um subspac e is a l so uniquel y determined, t he choice of a basis for the subspace remains open. That is, the number of factors and t h e subspace generated by them are u n iquely determined; the subspace is that generated by the eigenvectors; a nd the eigenvectors may be, but need not be, interpreted as the factor s or categories. The eigenvectors are mathematically rotated by Borko [ 15 J to approximate "simple structure, 11 the purpose of which is to permit meaningful interpret a tion of t he factors in terms of the index terms. The b as ic idea is to achieve many small factor loadings for all the terms; that is, each index term vector expressed as a linear combination of the factors has many sm a ll components a nd few large components. The subject cate go ries corr esponding to the factors or basis vectors are conceptually identif ied by intellectual analysis of the factor lo ad in gs For example, the term s having a factor loading of 0.18 or more on a particular one of the factors [ 15 J were "girls, 11 "boys, 11 "school, 11
PAGE 38
" ach i evement ," and "reading," from which the factor interpretation "academic achievement" was inferred. 27 The assignment of documents to the classes or categories corresponding to the b a sis v ecto rs is done as follows. Given a document and its index, a score is computed for each category; the document is assi g ned to the category of highest score. These factor scores are computed from the factor loadings of the terms and the term frequencies of the document. The score of a factor is the sum over the index terms of the term frequency and the loading of the term on the p art icul a r factor; that is, a factor score for a given document is the cosine corre l ation between the document index and the particular factor. Borko and Bernick [ 17 J compared experiffiental ly the factor analytic technique with the Bayesian prediction technique, using the index terms, the training set, and the test set of Maron [ 11 ]. Factor analysis w as applied to the training ~et, producing 21 subject categories The documents of the training set were manually assi g ned to the categories From this classification of the training set the hypotheses (prob a bility estimates) were derived from which e a ch document of the trainin g set and the te s t set was assigned to a class by means of the Bayesian technique Each document was also assigned to a class by the factor analysis method, i.e., according to the factor scores. The assignments of the documents of the training set to the factoranalytically derived cate g ories by each of the three
PAGE 39
28 methods (manual, Bayesian, and factor) were compared The results may be summarized by the percentage of the documents for which assignments of methods coincide: manualBayesian, 80 % ; manualfactor, 58 % ; Bayesianfactor 67 % The agreement surnrnaries for the test set of documents were, respectively 45 % 39 % and 57 % In the test set, the two automatic document assignment techniques are in better agreement than is either with the manual method. This fact is believed to be indicative of a greater inherent consistency in mechanized methods of assignment relative to human assignment. This study confirms the conc lu sion of Maron [ 11 J that the a utomatic assignment of documents to classes i s feasible. In addition, moreover, the automatic methods (both Bayesian and factor) performed about as well, relative to the human assignments with the automatically derived classification as did the Bayesian method with the manually defined classification. Thus, this study demonstrates the possibility, not only of the automatic assignment of documents to subject cate gories of a classification, but also of the automatic derivation of such a classification. It should be emphasized that the feasibility of both the automatic document assignment and the classification derivation are here judged by means of comparing the automatic assi g nments with the document assisnments to the subject categories by human classifiers.
PAGE 40
29 2.2.3. Clumps The automatic rrlassification techniques discussed above may be characterized as applications of mach ine techniques to problems of conventional document classification. Subject categories are defined or derived each doc u me nt is assigned to one category, and a class of documents consists of all those documents assigned to a g iv en category. The document classification is evaluated against the standard of human assignments of documents to categories; it is suitable for the physical grouping of the documents; an~ finally, it is not primarily intended for a specifically mec h an ical search strategy. The c l umps of Needham and SparckJones [ 19 ], on the contrary, were devised expre ss ly for the purpose of improvin g the retrieval eff ectiveness of a mechanized document ret~ieval s1stem The assoc iat ed document classification, it will be seen, is not su it able for physical g rouping, since documents general l y belong to severa l classes. Final l y, because of the purpose of this classification sche me there wou ld be little poi nt i n comparing the a utom at ic assignments of documents to classes with those of human indexers The initial classification i s again a grouping of terms. The analysis begins with the documentterm incidence matrix of th e document set in which the ijth e l ement indicates the occurrence or nonoccurrence of the jth term
PAGE 41
30 in the ith document. The Tanimoto s i milar i ty function i s applied to th e pairs of columns of th e doc um e ntterm matrix: the similarity of a p a ir of terms i s the ratio of the number of documents in w hi ch both ter ms occ u r to the number of documents in which either term occurs The aggregate of coefficients of a term to a set of terms is defin ed to be the sum of the similarities of the given term to the terms of the s e t. A clump i s t h en define d to be a minimal, nonempty pr o pe r subset of the term set such that each term belonging to the subset has a greater ag g regate of coefficients t o the subset th an to its complement, and each term n ot b e l onging to the s ub set has an aggregate of c oef ficients to its complement not les s than its aggregate of coeffic i ents to the subset. The sets of clumps to which the terms r espect iv ely beloQ g are formed. Th is inf ormation and the docu men tt e rm matrix are then used to associate clu mps with d oc um ents i.e., to classify the documents. The clump set for a docu ment is the union of the sets of clumps of the t erms occurrin g in the document. In other words, a certain clump is applied to a docu me nt (or the document is assigned to a certain class) if there exists a term which occ ur s in the document and which belongs to the clump. (The criterion of applicability of a clump to a document could easily be made more stringent.) The result is the document clump incidence matrix, which amounts to an augmentation of the document term matrix: the (binary) value of the ijth element
PAGE 42
31 indicates whether or not the j th clump is applicable to the ith document. A search req u est i s a specification of a subset of the term set. Just as for a document, a clump set for the request is formed; the request descriptor set consists of all those clumps containing at least one of the terms of the request The search begins with a matching procedure re l at i ve t o the c lu mps. The retrieved documents are those having doc u ment clump sets wh i ch contain the request clump se t The second stage of processing is the matching of the terms of the retrieved documents against the terms of the request, th ereby providing for a ranking of the retrieved documents according to a quantitative measure of relevance, e.g., the similarity of the term set of a document to the term set of the request Thus not only are docu~ents typically assigned to several classes (clumps) but the search procedure is predicated upon such multiple classificati ons in co ntrast to hierarchical classifications and their associated sea rch strategies The role of clumps is analogous to that of descriptors used to logically index documents The clump technique is intended for application to a mechanically derived documen term matrix, reflecting the keyword logical indexing of a document set The objective is to combine the ease of indexing by keywords with the retrieval effectiveness of
PAGE 43
32 descriptor inde xing, and t o do so altogether mec h an ically. In particular, man ual descriptor indexing is replaced by the a utomati c clump de riv a tion and th e associated document clump inci dence matr ix determination. Moreover it i s not necessary to interpret the term clumps, i.e., to ass i gn descriptorlike labels to them, because requests a re expressed in terms of keywords and the associated clump set i s determ i ned automatically. This permits the use of both :id esc riptors" (clump s ) a nd keywords in the search procedure, without the necessity of clump interpret at ion for the user. One i mportant aspect of the clump system which has not yet been ment ioned is the up dating proble m Although this issue was not inv est i gated in any detai l in the exploratory study [ 19 ], its importance was nonetheless a cknowled ged there. Consider the acqu i s iti on of a document subsequent to the clump a nalysis of the i n iti a l document set, and suppose the keyword indexing of the new document is g iven. The straightforward approach to updating is to reinitialize, i.e., to do whatever is required so th at the resulting system i s i dent i ca l to what it would have been if the new document had been a member of the initi a l document set The updated documentterm matrix has an additional row for the new document. It also has a new column for eac h keyword of the new document which does n ot occu~ in any of the other documents. The updated term t erm simi l ar it y
PAGE 44
matrix is of lar ger order, accord in g to the nu mber of new keywords. 33 However, if any keyword of the original documentterm matrix is applied to the n ew document, then the similarities of this term with all the others will, in g enera l, be different. Consequ e ntly, the upd ated term term similarities will differ fro m the initial ones from which the clumps were derivedeven if the new docu me nt introduces no new keyword into the system. Therefore, the updated c lu mps may well differ from those of the initial document set However, the quantity of effort required to rep eat the classification derivation is clearly incomm ensurate with the event of the acquisition of one additional document. A more modest approach to updatin g begins with th e assumption that the termterm similarities are not significantly affected by the acquisition of the n ew document, so that the initially derived clumps are still adequate. In this case, the classification derivation i s not repeated. All that is required is to assign the new document to the appropriate classes, as determined by its keywords and the clumps to which each belongs, as initi a ll y Thus, the documentterm and documentclump matrices a re updated by the addition of a row for t he new doc u ment thereby rendering it retriev a ble throu gh the search strategy as originally described. The most practical approach may be a combination of the two approaches given above. The system is reinitialized
PAGE 45
34 occasionally, e.g., whenever the document set size has increased by, say, 10 % the size it had for the l ast initializ at ion. During the interi m between initi a lizations, updatin g is limited to the more modest approach The premise of this updatin g strategy is that although subject matters of a document set w ill vary with time, the changes will be gradual. Clearly, the co m plete elimin at i on of sys tem reinitialization would undercut the pre mi se of a u tomatic classification, viz., that a c l assification derived from analysis of the document set i s potentially super i or to an a priori classification schedule, devised without reference to the particular docum e nt collection at hand. 2.3. FERRET A Feedback Reference Retrieval System The primary con~ern of this work i s the automatic derivation of a mu ltil eve l document classification intended to serve two purposes. The first is to provide for an automatic sequential search procedure which is substantially more efficient than the serial search without degrading retriev al effectiveness. The second purpose is to provide for inter act i ve search procedures In view of the purposes of the classification it is mandantory t h at ce rt ain other aspects of a mechanized reference retrieval system be spec i fied particularly the natures of the search procedures and the form of class
PAGE 46
35 characterizations which these require Accordingly, this section presents an overview of a complete reference retrieval system FERRET (Thi s system, it should be noted is s imil a r in certain respects to those previously described Specif ically, Salton [ 3 J has recognized the pote nti a l effi c iency advantage of a multilevel class i fication over a simple classification and has d iscu ssed the basic search strate gy associated with such a classification.) The pro cesses required for the ini t i alization of the system are indicated in Figure 2.1. The first step of in i tialization is the production of the rep resent at i ons of the documents by the analys is of the documents w ith resp ec t to subject matter The process of content analysis is not particularly pertinent to the cl ass ificati o n problem and so will not be discussed further What is of i mporta n ce here i s the form of the document repr esentat i ons resuiting fro m that index i rig operation : each document is r epresented f or subject searching and classification purposes by a lo gical o r numerical term vector (index), that i s an attribute vector whose components correspond to the index terms of the syst6m Thus, the data by means of wh i ch the classification derivation process produces a document c l assification are the documentterm matrix. The classification derivation is developed in deta il in the following chapters. Briefly a measure of document document similarity is def in ed in terms o f the attr i bute vectors of the documents A
PAGE 47
Documents l Content Analysis l Document Representations (term vectors ) Classificat i o n Derivation Classification of the Documents Document Class Representation Trans Torma ti on l Sequential Search Tree Information Store Figure 2.1 FERRET Initialization 36
PAGE 48
37 similarity threshold is applied to the documentdocument similarity matrix to produce a graph: the points of the graph are the documents; a pair of points are joined by a line in case the similarity of the pair of documents is not less than the similarity threshold. Graph theoretical techniques are applied to the graph in order to identify clusters of points. These clusters of points constitute the major document classes. The process is repeated on each class which is suitably large, and so on, until each class not subclassified is suitably small. Consequently, the resulting document classification may be regarded as a rooted tree of subsets of the document set: the root of the tree is the whole document set; the successors of the root are the major classes; the successors of any nonendpoint of the tree con stitute the subclassification of that nonendpoint; an endpoint is a subset of the document set which is not subclassified. In order to realize the objective of efficient search the classes of the classification must be represented in a form which is suitable for matching against queries. The class representation transformation step of initialization performs this task. Each class (except the root, i.e., the whole document set) is ~onsidered to be a document~ the class representation is a numerical term vector (or aggregate index) computed from the term vectors of the documents belonging to the class; if the class is an
PAGE 49
38 endpoint of the classification then th e class representation includes the identity of the class in addition to the ag g regate index. The results of the class repr ese nt at i on transformation is therefore a treethe sequential search treeisomorphic to the c la ss ifi cat ion tree. The root r epresents (i mp licitly, not explicitly) the whole document set; any other point includes the aggregate index of the represented class, a nd the class itself, if it is an endpoint. Clearly, the identity of any class of the classification may be readily determined from the sequentia l search tree: the class is the union of the classes included in the endpoints of the subtree subtended by the represent a tion of the class in question. The sequential search tree, which represents the document class i ficat i on and the aggregate indexes of the cla sse s, is retained in the inf ormation store. Also retained in the inf ormat ion store are two it ems associated with each document The first is the document index, used principally durin g the last step of a searc h. The other is just th e text which constitutes the reference given to the user when the document is include d in the response. After initi a lizati on the system i s operatio n al As indic a ted in Figure 2.~, there are two general types of operational activity: system updating, and search and retrieval.
PAGE 50
User Query Class Character ization Class Selection and Modi fied Query Response Search Procedures Documents+ t I I I I I I I I Content Ana lysis Document Represe ntati o ns Document C l assification Updated Sequent ial Search Tr ee Information Store Search Tre0 Document Index, an d Document Reference Data +Figure 2 2 FERRET Updating and Retrieva l I I + I I 39 Fe tch Co mmands
PAGE 51
40 The search procedures are developed and discussed in detail in Chapter 6. There are two general types of search procedures: the basic search and the feedback search. The basic search procedure illustrates the underlying philosophy of the classification structure. Many query document relevance computations for future queries are, in effect, done in advance with the results stored implicitly in the retained classification representation, thereby reducing the number of relevance computations required for any given query search. A query is an assignment of numerical weightspositive, negative, or zeroto the terms of the system. The stored classification structure makes it unnecessary to perform a relevance computation for each document. Instead, a relevance computation is performed relative to each major class by reference to the respective aggregate indexes of the classes. The major class of greatest score is selected, and becomes the decision node of the sequential search tree, replacing the root. The successors of the selected class are similarly scored with respect to relevance to the query; the subclass of greatest relevance becomes the next decision node. This continues until the selected decision node is an endpoint of the sequential search tree. The representations of the members of the terminal class are then used to compute the relevance of these individual documents to the query, and provide for their ordering by decreasing relevance. The
PAGE 52
41 relevanceordered list of references of the documents of the terminal class of the search constitutes the response to the query. It is clear that the total number of relevance computations for such a s~arch may be substantially smaller than the total number of documents. Moreover, the sequential search tree permits a fuller exploitation of the user's understanding of his objective than would be possible within the framework of a serial search, i.e., the computation of a relevance value for each document. In the feedback search, the user participates in the search in two ways: he makes decisions, and he modifies his query. The basis for both user actions is the information provided him by the system, concerning the nature of the alternatives of a given decision node. The alternatives are the subclasses of the class represented by the decision node of the sequential search tree; the system presents the user with a suitable characterization of these subclasses. Based upon this information the user selects the next decision node (subclass). Moreover, these characterizations of the alternatives enable the user to specify his objective with better precision and in a manner which better matches his information needs with the stored information of the system. Consequently, the user may modify his query on each transaction, i.e., from decision node to decision node, from the root of the sequential search tree to the terminal class.
PAGE 53
42 The other major activity of the operational FERRET system is updating: as documents are acquired subsequent to system initialization, they must be assimilated. As indicated in Figure 2.2, the first step of updating is content analysis. The resulting document representation and the document reference for presentation to the user are retained in the information store. To make the new document retrievable, however, requires that it be included in an appropriate endpoint of the sequential search tree: the new document must be classified within the classification derived prior to the acquisition of the document. This is achieved by a process similar to the basic search procedure. The document representation is treated as a query in order to identify the classes at each level of cl ass ification to which the document belongs. The document identifier is included in the class identity of the terminal class of the search tree. The class representation of each class of the chain of classes to which the document is assigned from the major class to the terminal classis modified in accordance with the representation of the new document. Evidently this method of updatingwhich is the assignment of documents to predefined classes presupposes that the assimilation of a new document would not materially affect the classification, i.e., that the initial document set is adequately representative of all future documents, relative to subject matter To the extent that this
PAGE 54
43 assumption is not valid, the classification is degraded and retrieval effectiveness deteriorates This can be remedied by infrequent reinitializat i on However, since this study is concerned primarily with the classification derivation and the search procedures assoc iated with the classification, the updating problem is not treated in detail here 2.4. Concluding Remark s The organization of documents in document and reference retrieval systems for retriev al purposes has been discussed. Several efforts toward automatic classification for mechanized reference retrieval systems have been reviewed in illustration of specific approaches to the problem of automatic classification and of the different facets of the problem: classification derivation, assignment a nd cl ass characterization FERRET a feedback reference retrieval system, has been in troduced to provide the necessary context for the specific concern of this study : the machine construction of a multi level classification of a document set based on a measure of similarity on the documents and intended for more efficient query processing and the effective use of user feedback during the search
PAGE 55
CHAPTER 3 GRAPH THEORETICAL COVER GENERATION 3.1. Introduction The problem with which this chapter and the following two chapters are concerned is that of deriving automatica lly a multilevel classification of a document set based on the g iven logical or numeric document indexes. The classification representation transformation and the searc h procedures are discussed in Chapter 6. The first step toward the solution of the problem is straightforward: the computation of the documentdocument similarity matrix by the application of a suitable measure of s imilarity to the pairs of document representations. The identification of the major classes from the similarity matrix requires two distinct activities : the generation of one or more covers of the document set, and the evaluation of these different covers for the selection of the best cover as the collection of major classes. Evident ly, the subclassification of any class requires the same processes, viz., the generation of covers of the class and the selection of the best. 44
PAGE 56
45 This chapter and Chapter 4 are concerned with the cover generation problem. Chapter 5 provides for cover evaluation and selection and unifies the cover generation and cover selection techniques into an algorithm for the derivation of a multilevel classification of the document set. The fundamental approach to the problem of generating covers of a given class of documents is a specific form of graph theoretical cluster analysis. A definition of clusters in a graphdeveloped belowis app lied to the graph of the documents of the class induced by a particular similarity threshold. (The method of selecting a decreasing sequence of similarity thresholds for a given class of documents is discussed in Chapter 5.) The key notion in the definition of clusters in a graph is that of maximal complete subgraphs of the graphor "cliqu es, as they are termed by Harary and Ross [ 20 ]; clusters are defined below to be the unions of certain collections of cliques of the graph. This approach is based on the premise that a class of documents is composed of an unknown number of clusters of documents with respect to subject matter; and that documents of similar subject matter have similar representations, i.e., indexes. In this event, a clique of the graph formed by the application of a threshold to the documentdocument similarities constitutes a maximal set of documents in which the simi l arity between each pair is at least as great as the threshold. The formation of clusters
PAGE 57
46 from cliques produces a cover which is not necessarily a partition, i.e., the subclasses of the class are not necessarily pairwise disjoint. This is perfectly appropriate to the task at hand, however, since there is no justification for the restrictive assumption that the subject areas of discourse of the document set are nonoverlapping. This represents an advantage of this method for the present application over those which necessarily produce disjoint subclasses and are devised primarily for pattern recognition problems, e g the kmeans method of J. MacQueen described by Nagy [ 21 ], and the minimum spanning tree technique of Zahn [ 22 ]. Another advantage of the present method is that the number of classes of the cover is not required to be input to the cover generat ion procedure, as i s the case in the kmeans method for examp le. Unlike many pattern recognition problems [ 23 ], in the problem at hand one has no a priori knowledge of the number of subclasses of the class of objects under analysis. The remainder of this chapter is organized into two sections. The first presents the basic definitions required particularly those from graph theory; the second is concerned with the definition of clusters in a graph. 3.2. Basic Definitions A cover of a set S is a collection U of subsets of the set such that S = u ~; U is an effic ient cover in
PAGE 58
47 case no member of U is properly contained in any other. The term clustering is also usedin the context of the problem at handto denote a cover resulting from a cluster analysis; the members of such a cover are termed clusters. A collection A of sets refines a collection B of sets in case A EA implies the existence of a member B of B such that Ac B. The terminology and graph theoretical definitions gi ven below are essentially as given by Harary [ 24 ]. A graph G = (V(G), E(G)) or G = (V, E) consists of (1) a none mpty finite set V, whose elements are termed points; and (2) a set of lines, E c {X: X c V, IXI = 2} i.e., a collection of unordered pairs of points. Equivalently, a graph is an irreflexive symmetric binary relation on a nonempty finite set. A line {u, v} EE is denoted more briefly by uv. The distinct points u and v are adjacent po ints, joined by the line uv Each of the endDoints u and v of the line uv is also said to be in cident with uv An is olated point of a graph is a point adjacent to no other point. If V is a singleton then G = (V, E) is a trivial graph ; otherwise, G is a nontrivial graph. G is a null graph (or totally disconnected g raph) in case E = 0 G is a complete graph in case E consists of all unordered pairs of points of G i.e., in case each pair of points of G are joined by a line.
PAGE 59
48 The neighborhood N(u) of a point u of a graph G = (V, E) is the set of all points adjacent to u together with the point u N(u) = {u}u {v: UV E E} The deleted neighborhood N 0 (u) of point u is No( u) = N(u) {u} i.e., the set of a ll points adjacent to point u A subgraph of G = (V' E) is a g raph G' = C v E') such that V' C V and E' C E. If X is a nonempty subset of V, the subgraph G[X] generated by X is that subgraph of G whose point set is X and whose line set consists of all the lines of G which join points in X In particular, if v is a point of a nontrivial grap h G = (V(G), E(G)), the removal of the point v from G is the subgraph G v = G[V is V(G v) = V(G) {v} {v}J; the point set of G v the line set of G v consists of all those lines of G not incident with the point v If G = (V, E) is a graph and uv EE the removal of the line uv from G is the subgraph G uv = (V', E ') of G with V' = V and E' = E {uv} The notions of neighborhoods of points, point removals, and line removals are of particular utility to the problem of clique detection. A complete subgraph of a graph G = (V, E) is a subgraph of G which is a complete graph A clique G of G is a maximal complete subgraph of G, i.e., a complete subgraph of G such that if G" is a complete subgraph of r, d G' b h f G 11 then G' = G 11 u an is a su grap o Because a
PAGE 60
49 clique is a complete graph, it s uffices to specify only the point set of the clique in order to fully specify the clique. Consequently, it creates no confus i on to use the term clique" to refer to the point set of the clique in the interest of economy of language Indeed, one could define a clique of a graph to be a maximal (relative to set theoretic inclusion) subset X of the point set V of the grap h, having the property that eac h pair of points of the subset are adjacent points of the graph Suppose V = {v 1 v 2 ... vn} is a nonempty finite collection of sets Let E consist of those subsets of V of two elements which meet i e ., if vi vj EV then { vi, v j} EE if and only i f vi I vj and v i n vj I 0 The graph G = (V E) i s termed the inter sect ion g raph of the collection V In particular, if V is the co llection of the cliques of a graph G' then G is the clique g raph of the graph G The notion of cliques, that of a certain generalizat ion of the notion of clique graphs of graphs, and that of components of a graph are basic to the definition of clusters in graphs which is developed in the following section A component X of a graph G = (V, E) is a maximal subset of V such that if u v EX then u and v are connected points. The connectivity relation on V is defined as follows: suppose u, v EV ; then u and v are connected if u = v, or if u and v are joined by a walk in G A walk in G is a sequence of points
PAGE 61
50 1 n and U 1U E ll for each i = 1, 2, ... n ; such a walk joins the initial and terminal points, uo and un A path is a walk on which no point has multiple occurrences. That the connectivity relation is an equivalence relation on V is apparent ; indeed, the components of V are the equivalence classes induced by the connectivity relation, and so partition the point set V of the graph. 3,3. Clusters in Graphs The objective of this section is to develop a means of gene rating clusterings of a given graph G = (V, E) A clustering of a graph is an efficient cover of the point set of the g raph consisting of clusters, a cluster bein g a set of points which satisfies a specific definition of clusters in a graph. In view of the definitions of the preceding section, two definitions of clusters are immediately apparent : a cluster of G is a component of G, and a cluster of G is a clique of G Since the component set K(G) th at is, the collection of all the components of G partitions the point set ~(G) is an efficient cover. The clustering K(G) of the graph G is termed the component cluste ring of G The clique clustering of G is the clique set Q(G ) that is, the collection of all the cliques of G That ~(G) is a cover of V follows from the fact that if p V
PAGE 62
then {p} is a complete subgraph of G and so is contained in a maximal complete subgraph of G consequently, V c u~(G) That ~(G) is an efficient cover is an immediate consequence of the definition of cliques: each member of ~(G) is a maxima l complete subgraph. 51 The component clustering and the clique clustering of a graph are illustrated in Figure 3 1 in which the cliques are enclosed by dotted lines. The limitations of these clusterings may be appreciated by reference to the figure. It is apparent from the figure that the component clustering and the clique clustering have complementary limitations. The 7 point component consists of two cliques of four points each, intersecting in one point. In the component clustering, of course, these two cliques are inseparable. In the clique clustering, each of the cliques of four points is a cluster; this resulttwo clusters overlapping in one pointseems more plausible than the single cluster of the component clustering Consider, on the other hand, the 5point component which consists of two cliques of four points meetin g in three points. In this case the component clusterin g produces the more plausible resulta s in gle cluster Thus, the component clustering unite s a pair of hi g hly over l app ing cliques into one cluster, which the clique clusterin g does not; and the clique clustering discriminates a pair of cliques with a sma ll intersection into two clusters, which the component clustering does not.
PAGE 63
Cl i ques ..... / I \ / ~ 1' 6 ":::I...~ I \ / I / 1 l I \ \ \ \ l 0 ... ......... I I \ l / Figure 3.1 The Clique and Component Clusterings of a Graph 52 Components
PAGE 64
53 The fac t that two distinct points belong t o the same cluster of the componen t clustering does not imply that the corre s pon d ing pair of objects (e. g do cu ment s ) are actually similar. In the 4p o int component of the figure, for example, po int s 1 and 3 a re both adjacent to point 2 and so are in the same comp o nent. In terms of objects a nd their s i m ilarities, this means only that each of the pa ir is similar to a third and so are in the same cluster. Mo re g ener a ll y a pair of o bjects are in the same cluster of the compone nt clustering if there is a seq uence of objects joining th e pair in which each consecutive pair of objects is sim il ar i. e ., the similarity o f eac h con sec utive pair of the sequence is not le ss th a n the similarity threshold by which the graph is defined. Indeed, a pair of objects of zero similarity may bel o n g to the same component c lust e r. Thi s cha inin g 11 phenomenon is the consequence of the po lic y of li bera l inference implicit in this cluster in g def initi o n, viz., for purposes of classification, simi l arity is transitive. The component clustering of the graph of Figure 3.1, for example, would be unaffected by the addition to the g r aph of lines joining point 1 to each of points 3 and 4 thereby rendering the 4point component complete The cl i que c lu stering on the other hand embodies no such inf erence : a pair of po int s of a cl i que cluster are adjacent in the graph i.e., are s i m il ar Moreover the clique clustering i s necessarily changed by the delet i on or addit i on of a line to the grap h.
PAGE 65
54 Thus, the clique clustering is the embodiment of the strictest construction possible of the given information (the graph) for the attainment of its objective in the context of this problemthe identification of subclasses of the class of which the graph is the threshold graph, relative to the numerical similarities of the elements of the class. In the same sense, the component clustering represents the loosest construction possib l e. If, on the contrary, the cluster analy~is dictated that a pair of points from different components of the graph belonged to the same cluster, it could do so only ar bi trarily, there being no information implicit in the g raph from which such a condition could be inferred. It would also clearly be arbitraryand therefore unjustifiedto produce a cover such that a clique of the grap~ was not contained in any member of the cover. In view of the foregoing considerations, the concept of clustering may be made more specific: a clustering of a g raph is an efficient cover of the point set of the graph which refines the component set of the graph and which is refined by the clique set of the graph, i.e., an efficient cover C satisfying ~(G) < f < K(G) (That Q(G ) refines K(G) is an obvious and immediate consequence of the definitions of cliques and components.) It will now be shown that the clusterings of G under this definition form a lattice under the refinement relation. Specifically, this lattice is an interval of the l attice of all efficient covers of V
PAGE 66
The refinement relation < on the collection ~ (V) of all efficient cove rs of V is a part i a l order. The reflexivity and tr ans itivity of the relation are immediate consequences of its definition and the reflexivity and transitivity of the inclusion relation. S uppose that A E ~(V) BE ~(V) ~<~,and 55 B
PAGE 67
56 Thus, M C A E A and M C B E B Hence, A I\ B refines each of A and B Suppose now that D refines each of A and B and let D E D Since D < A there is an A EA such that D c A, and since D < B there exists a member B of B such that D c B Thus, D c An BE {A' n B' A' E ~, B' E ~} so there exists a maximal member M of the collection such that An B c M That is, D C A n B C M E A I\ B which establishes that D < A I\ B Therefore, A I\ B is the greatest lower bound of A and B Since each pair of elements of the poset (~(V), < ) has a greatest lower bound and a least upper bound, (~(V), < ) is a lattice, as claimed. Returning now to the more specific definition of clusterings of a graph G, let ~(G) denote the collection of all efficient covers of V which refine K(G) and are refined by Q(G) Then (~(G), < ) i.e., the interval [9_(G), K(G)] is a sublattice of (~(V), <), with zero 9_(G) and unit ~(G) Figure 3.2 provides an illustration of such a lattice. That is, the fi g ure exhibits every efficient cover of the point set of the graph G which refines ~(G) and is refined by Q(G). The example of Figure 3.2 illustrates that it wonld be not only :Lnefficient but definitely undesir ab le to generate all the members of :: ( G) The cover { { 1 2, 3} { 2, 3, 4 } { 5, 6, 7, 8, 9}} is one of four in which just one of the points 3 and 4 is included in the cluster containing the points 1 and 2 in spite of the symmetry
PAGE 68
G: (::(G), <) {l, 2, 3} {2, 3, 4} {5, 6, 7, 8, 9} { 1, 2 } {2, 3, 4} {5, 6, 7, 8, 9} 3 5 8 {1, 2, 3, 4} {5,6,7,8,9} { 1, 2, 4 } { 2, 3, 4} {5,6,7,8,9} { 1, 2, 3} { 2, 3, 4 } {5,6,7,8} {6,7,8,9} { 1, 2 } {2, 3, 4} {5,6,7,8} {6,7,8,9} 9 !S_(G) g_ ( G) {l, 2, 3, 4} {5,6,7,8} {6, 7, 8 9} { 1, 2, 4 } {2,3,4} {5, 6, 7, 8} {6, 7, 8 9} Figure 3.2 The Lattice of Clusterings of a Graph 57
PAGE 69
of points 3 and 4 relative to points 1 and 2 Whatever basis exists for the inclusion of the point 3 (or 4) in the cluster containing points 1 and 2 58 applies as well to the point 4 (or 3 ). Consequently, any clustering generation procedure from which the a rbitrary is absent will produce only those elements of the l a ttice of Figure 3.2 which are included in Figure 3.3. The arbitrary covers of the lattice of Figure 3.2 excluded from the subset of Figure 3.3 are disqualified as clusterings by the following more specific definition of clusterings of a graph A clustering A of a graph G = (V, E) is an efficient cover of V which satisfies (1) A refines ~(G) the component set of G (2) A is refined by ~(G), the clique set of G ; (3) if A EA then A= uP for some Pc ~(G) This last condition, which simply st a tes that a cluster is a union of cliques, assures that if a pair of points belong to just the same cliques then they belon g to just the same clusters. Suppose that A is a clusterin g under this d e finition, that A EA, and that p and q are points such that p E A and qt A For some Pc ~(G) A= uP since p E uP there is a clique M E P such that p E M Since Mc A and qt A, qt M Thus, if a p a ir o f po ints do not have membership in precisely the same mem b e r s o f A then their clique memberships differ.
PAGE 70
,,,.. ....... ,, ,,.. ....... I "' I \ I I l \ \ I \. I \ I / ...... ,,., ,, / ..... ..... ,. ...... .... /
PAGE 71
The subset f(G) of =(G) consisting of clusterings under this definition clearly includes f(G) and ~(G) Thus, as with =(G) establishes a lattice ( =(G), < ) the refinement relation on r(G) ( f(G), < ) a sublattice of The remaining issue is how one chooses subcollection P of the clique set ~(G) from which to form a cluster A= uP The identification of the clusterings f(G) and ~(G) presents no theoretical difficulty; the remainder of this section is concerned with the problem of generating other clusterings, i.e., those intermediate to ~(G) and f(G) 60 Augustson and Minker [ 25 J recognized explicitly that K(G) and Q(G) are the extremes of graph clusterings; and the potential value of intermediate clusterings has been widely appreciated [ 25, 26, 27, 28, 29 ]. An early effort in graph theoretical cluster analysis, in which the concept of cliques of a graph is exploited, is that of Bonner [ 26 ]. In addition, there is given by Bonner [ 26 Jan algorithm for the identification of the cliques of a graph. However, the definition of clustering is given only implicitly, in the form of a procedure. Moreover, the generation of a single coverpresumed to be suitableis inconsonant with the strategy of this work, which separates the task into two parts: the cover generation, which produces several covers; and the cover evaluation which selects the most suitable.
PAGE 72
61 Jardine and Sibson [ 27] provide a systematic method for the generation of intermediate clusterings of a graph from its cliques. A kpartition of V is defined, for a natural number k, to be an efficient cover of V in which each pair of subsets intersects in fewer thank points. The concept of a kpartition is a generalization of that of a partition, a 1p artit ion being a partition. The intermediate clusterings are defined to be the particular kpartitions of V generated by the following procedure. Step 1.Initialize ~k = 9_(G) the clique set of G Step 2.If A E ~k, BE ~k, and A I B implies IA n Bl < k then stop. Step 3.Let A E ~k and B E ~k be a pair of distinct members of ~k such that k $ IA n Bl replace ~k by (~k u {A u B} ) ({A, B} ) and go to Step 2. If k is equa l to or greater than the l argest number m of points of any clique of G, Step 3 is never executed: the intersection of any pair of cliques has strictly fewer members than does either of the cliques. Hence, L = 9_(G ) m the clique clustering. It is evident from the procedure that each member of ~k for any k is the union of a nonempty subcollection of 9.(G) the clique set of G ; that is, if ME ~k then M = uS for some Sc 9_(G) S/0 Thus, ~k is refined by Q (G)
PAGE 73
62 If ISi = 1 then since obviously any clique is a subset of some component, M is contained in a component of G Suppose ISi = 2 and let K 1 and K 2 be components contain i ng s 1 and s 2 the members of S Since partition of V, hence, K1 = K 2 That is, M = s 1 u S2 is conta i ned in some component. As an inductive hypothesis assume that for any natural number n < n 0 where n 0 > 1 i f ISi = n then M = us is contained in some component of G Suppose ISi = n 0 ; since 1 < n 0 S is the union (formed in Step 3 ) of two subcollections S and S of g_(G) such that (1 ) 1 $ I__ I I S" I $ n 1 ; and 0 ( 2) ( u__') n ( u__ ") I0 By the induction hypothesis and condition (1), uS c K 1 and uS c K for some 2 and K 2 E ~ ( G ) Because of condition (2) K 1 n hence Kl = K2 and us C Kl E ~(G) 'l'his proves that .f:k < ~( G ) for any k In particular .f:1 < K(G ) It will now be proven that ~ ( G ) < .f:1 Suppose M1, M2 E .f:1 and M1 IM2 If Ml and M2 were not disjoint then 1 $ IM 1 n M 2 I hence because of Step 3 of the procedure Ml u M 2 replaces Ml and M2 i.e Ml M2 .f:1 a contradiction. Thus, .f:1 is a dis j o i nt collection; s i nce, moreover .f:i is refined by g_(G) .f: 1 covers V, hence .f: 1 partitions V Let K be a component and M be a member of .f:i which meets K ; the existence of such an M is assured by Kc V = u.f:1 since .f:l covers V Suppose that Kn (V M) I0
PAGE 74
63 Thus, K n M cl 0 f K M so {K n M K M} partitions the component K of G since K is connected, there exists a line UV in G with u E K n M and V E K M Since {u, v} generates a complete subgraph of G there exists a clique Q of G such that {u, v} C Q Since Q (G) refines L 1 there is a member M of l:.1 such that Q C M' Thus, since {u, v} C M' u E M and V t M Mn M' c/ 0 and Mc/ M' contradicting the fact that I::_ 1 is a partition. Therefore the supposition that a component meets more than one member of I::_ 1 is false. That is, ~(G) refines L 1 Thus, ~(G) < I::_ 1 < K(G) with each of ~(G) and I::_ 1 a partition, and, hence, an efficient cover of V Since the refinement relation on the class of efficient covers of V is a partial ordering, L = K(G) 1 To summarize the JardineSibson kpartitions: I::_ 1 = ~(G) 9(G) < f:k < ~(G) for each k = 1, 2, each k = m, m+ 1, where m = max { I Q I : Q E Q ( G)} l:_k = 9(G) Moreover, since any member of any I::_k is a for union of cliques, each kpartition l:.k qualifies as a clustering, i.e., each is a member of r(G). On the other hand, it is certainly the case that not every member of f(G) is, in ~e neral, a kpartit~on. The character of the kpartitions is illustrated in Figure 3.4. In particular, A= {{l, 2, 3, 4, 5}, {6, 7, 8, 9}, {7, 8, 9, 10}} is a member of f(G) which is not a kpartition. However, A seems less plausible than
PAGE 75
k = 1 1' \ k = 2 k = 3 k = 4 ....... / I / 2 I 6 I \ 'I+ .... Figure 3 4 The k Partit i ons of Jardine and Sibson 64 / t ,,,. ..._ ,.,
PAGE 76
65 each of the kpartitions. This suggests that the definition of clustering is still, after all, lacking in specificity, i.e., that it is not desirable to generate all members of f(G) ; this issue will be further pursued later in this section. An unfortunate aspect of the kpartitions is that the intermediate clusterings, i.e., those 1k with k = 2, 3, ... m1 are difficult to characterize, because of the procedural nature of the definition of kpartitions. An alternative definition is suggested by the recognition that the procedural definition applied to the case k = 1 produces clusters ME 1 1 having the special property that M = uS where S is a component of the clique graph of G The clique graph of G has Q(G) as point set, with a pair of cliques of G adjacent in case their intersection is nonempty. Indeed, it is clear that the union of all the points of a component of the clique graph of G is precisely a component of G, and that the collection of all such unions is ~(G) = 1 1 The clustering definition given belowas an alternative to the JardineSibson procedural definition of kpartitionsis the first of three to be considered, all of which are based on generalizations of the concept of the clique graph of a graph. For k = 1, 2, ... m, where m is the number of points of a largest clique of G the type1 kclique graph G~(G) = Q(G) of G is defined as follows: ; if M 1 M 2 E g_ ( G) then
PAGE 77
66 M 1 M 2 e E~ if and only i f M 1 i M 2 and ks !M 1 n M 2 A tyoe 1 kcluster is the union of a subcollection of ~(G) which constitutes a component of G~(G) The type1 k clustering of G cons i sts of all the maximal kclusters of G The type1 k clique graphs of the graph of Figure 3.4 are exhibited in Figure 3 5 along with the graph and its c l iques; from th i s i nformat i on one eas i ly sees that the type1 k cluster i ngs coinc i de prec i se l y w i th the k partitions of Ja r dine an d S i bso n k = 1 k = 2 k = 3 k = 4 ~, ...,,,,,,/' /, \ I \. I \ / I I I \ I r~1 ./ r!r 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Figure 3 5 Type 1 k Clique Graphs of a Graph However, a proof that each k partition of any graph coincides with the type 1 kclustering would be difficult indeed, because such is not, in fact the case although counter examples, one of which is given in Figure 3.6 are quite rare.
PAGE 78
I { / / \ a~ The 0 b. The type1 2clique graph / I I I \ \ ...... ,.,, ...... graph G and / \ ~' \ \ ........ I / .,,.. its cliques c. The type1 2clusterin g ...... / ..__ ..../ / \ \ I I d. T he J a rdineS ib so n 2 part ition F i g ur e 3.6 A Gra ph W h ose T ype 1 2 Cl u st e r in g D i f f e r s fr om t h e J a rdin e S ibso n 2 Part iti o n 67 \ / I / \ I
PAGE 79
68 Before criticism o f the type1 c lu stering i s begun, the necessity of the term maximal in the definitions of kclusterings will b e demonstrat ed. The grap h of Figure 3.7 has 5 cliques of 3 points and one c lique of 2 poi nt s The type 1 2 clique graph has two components : one consists of al l the 3point cl iques, the other i s the isolated 2point clique. Thus there a re two 2 clu s ters of the graph; however, one contains the other. The requirement that the covers of V be efficient therefore requires the spec i f icati o n of maximality of the k clusters as in the definition. An i mmediate consequence of the definition of G~ (G) is that the point ME ~(G) is an i solated point of G~(G) for every k !Ml Therefore, { M } is a component of G~(G) so that M = u{M} is a type 1 k cl uste r o f G Although it is possible as in the g r ap h of Figure 3 .7that M be properly contained in anothe r type 1 k cluster it is more lik e ly that M b e a maximal type1 kcluster, i.e., that the clique M of G be a member of the type1 k cluster in g Thus, each clique of two points norm a lly const itutes an entire cl u ster except in the component clustering. Similarly, a clique of i points is genera lly a membe r o f the type1 kcluster~n g for a ll k i The type 1 i nte r mediate clusterings consequently tend to be rather high l y fragmented covers consisting of many sma ll clusters and a few large clusters
PAGE 80
..... ...._ .,, /_ ,'{_ 1iI \ 6 1' \ I I \ I \ / 'I I I "..,,, a. A graph G and its cliques 0 b. The typ e 1 2clique graph of G / ........ / '/ I '\ I \ I \ \ I \ I \,. / / c, '/ '/ ....... / c. The type1 2clusters of G Figure 3,7 A Graph in Which One Type1 2Cluster Contains Another 69
PAGE 81
70 The type2 kclusterings of a g raph G correct this shortcoming b y taking into account the s iz es of the cliques in d e terminin g the ad j ace ncies in the gene raliz ed clique grap h of G The type 2 kcli que g raph G~(G) = (V~, E~) has point set V~ = ~ (G) ; if k 2 m the maximum clique size of G then G~ ( G ) is totally disconnected; otherw i se a pair o f distinct c liques M 1 and M 2 of G are ad jacent in G~(G) in case neither is a si n g leton and A type2 k c lu ster is the union of a ll the cl i ques of G of a component of G~(G) and the tyoe2 k cluster in g of G consists of all the max i ma l type2 k clusters of G A pair of cliques which meet in the max i mum number of points possible, viz., one l ess than the numb er of po ints of the smaller, are conta ined in the same type2 kcluster for every k < m i.e., they are in sep a r a te clusters on l y in the cliqueclustering. The typ e 2 1 cluste~ in g and m cl u ste ring are the component and c li que clusterin g ju st as those of type1. The int ermediate clusterings however differ as ill ustrated in Figure 3 8 It is clear from the def i nitions of E l k and E2 k that G~ is a subgraph of G~ for each value of k if a pair of cliques G adjacen t in 1 they meet in at of are Gk least k points, and so are adjacent in G2 k The type2 2 clustering of th e g raph of the Fi g ure 3 8 is apparently superior to the type 1 2clustering particu l ar l y viewed as the successor of the 1clustering.
PAGE 82
0 0 00 0oQ Type~ and type2 1cli q ue graph .... /M>\ \ I \ / ....... .,,,, / ,...V[efjf;' I \. \ \ J .......... / Type1 a nd type2 1clu ster in g 0 C><) o0 0 C><) 000 Type1 2clique graph Type2 2clique graph Type1 2clustering Type 2 2 clu stering 0 0 0 0 0 C><) 0 00 0 C><) C><) 0 Type1 3clique g raph Type2 3clique gra ph Type1 3clu st erin g 0 0 0 0 0 /~, '\ \ I I ,, /, .... ..... (~\ \ I I I I ,I I .... / ...... _.,,, Type 2 3 cluster in g 0 0 0 Type1 and type2 4clique grap h Type1 and type2 4clustering Figure 3.8 T ype 1 and Type2 kClusterings of a G r aph 71
PAGE 83
72 However, considering the 3clustering as the predecessor of the clique clustering, the type1 3clustering is apparently superior to the type2 3clustering. Indeed, although the type2 clustering does remedy the noted defect of the type1 clustering, it does so by means of a complementary defect of its own. This situation is most clearly exemplified by a pair of cliques M 1 and M 2 of G which meet in one of the two points of M 1 The type1 kclusterings separate M 1 and M 2 in every case but the component clustering, while the type2 kclusterings separate M 1 and M 2 in no case but the clique clustering. Thus, the type1 sequence of clusterings for k = 1, 2, ... m, have a large discontinuity from k = 1 to k = 2 whereas the type2 clusterings have a large discontinuity from k = m 1 to k = m. Expressed differently, the type1 intermediate clusterings form a sequence which approaches the clique clustering, while the type2 intermediate clusterings form a sequence which approaches the component clustering. Now the general approach of both definitions provides the component clustering, the clique clustering, and.at most m 2 intermediate clusterings, with each pair of the resulting clusterings related under the refinement relation. That is, the resulting sequence of clusterings forms a chain in the lattice of clusterings from the unit to the zero of the lattice, consisting of at most m clusterings, m being the size of a largest clique of the graph. Such a
PAGE 84
73 cha in is generally a subsequence of a maximal chain of l ength greater than m. For example the chai n of type1 clusterings of Figure 3.8 is a proper subsequence of the chain: the 1clu stering the type 2 2 c lu stering, the type1 2clustering, the type 1 3 clustering, and the 4clu stering Similarly the chain of type 2 clusterings i s a proper subsequence of the chain : the 1cluster in g the type 2 2 clusterin g the type2 3clustering, the type1 3clustering, the 4c lustering. The defects of the type1 and type2 c lu sterings may both be characterized as a failure to satis f y a continui ty criterion: each success i ve pa i r of the sequence of m clusterings should be sepa rat ed by about the same number of clusterings on the maxima l chain containing the chain. That is, the chain of c lu sterings shou l d const i tute a sequence o f m 1 equal sized steps from the component clustering to the clique clustering. I ndeed, this criterion can easily be made quantitat i vely precise e g ., a chain of m clu ster in gs in the lattice from the unit to the zero is continuous in case the rootmeansqu are of the m 1 distances between successive members of t he chain is min i mized (A suitable metric is g i ven in Chapter 5.) ~owever, the app lic at ion of such a criter i on would require the generation of the entire l att i ce of cl u sterings which is contrary to the present genera l strategy.
PAGE 85
74 The final genera liz at ion of the clique graph of a graph for the generation of clusterings is motivated by the foregoing considerations. The type3 k cli que graph of the graph G has point set V~ = 9_ ( G) a pair of cliques M1 and M2 of G are adjacent in G~(G) in case k $ r ( 1 + m) IM 1 n M 2 1 I IM 1 u M 2 1J where m is the size of a largest clique of G and fx] denotes the smallest integer not less than the real number x The factor IM 1 n M 2 1 / IM 1 u M 2 1 is the Tanimoto similarity [ 9 ] of the sets M1 and M2 discussed in Chapter 2, in which it was denoted s 1 The adjacencies in t ype 1 kclique graphs take into account only the overlap of a pair of cliques of G; the type2 adjacencies take into account the overlap of a pair and the size of the smaller; the type3 adjacencies take into account the overlap and the sizes of both cliques of a pair. (The use of cliqueclique similarity for combining cliques into clusters was a nticipated by Gotlieb and Kumar [ 28 ], as a method for the generation of one particular inter mediate clustering: a similarity threshold is app lied to the cliqueclique similarity matrix to define a graph whose cliques are found, etc ., until the numuer of cliques at some level is suitable; then the procedure backs out, taking the union of the cliques at each level, until, finally, one has a cover of the point set consisting of a suitable number of subsets of points.) Suppose M 1 M 2 E l(G) and M 1 n M2 = 0 S 1(M1, M 2 ) = 0 so for no k = 1, 2, ... is Then
PAGE 86
disjoint cliques are nonadjacent in each type3 kclique graph. Suppose now that M 1 n M2 i 0 Then 75 O < (1 + m) s 1 (M 1 M2) so that 1 s f(l + m) s 1 (M 1 M 2 )J i.e., M 1 and M2 are adjacent in the type3 1clique graph. Therefore, the type3 1clustering is again the component set of G It is clear that the greatest possib l e similarity s 1 (M 1 M2) of a pair of cliques is (m 1) / (m + 1) corresponding to the overlap of a pair of largest cliques in m 1 points. In this case, f( l + m) S1(M1, M2)J = f (1 + m) (m 1) / (m + l) ] = rm l] = m 1 < m Thus, the type3 mclustering is again the c l ique clustering. In spite of the fact that the chain of type3 intermediate clusterings is, loos e ly speaking, intermediate to those of type 1 and type2, it is not in genera l the case that for each k, G~(G) is a subgraph of G~(G) nor that G~(G) is a subgraph of G~(G) A counterexample to the first is a graph G Df two cliques, Ml and M2 each of 9 points, meeting in 3 points. In this case, M1M2 E E3(G) but M1M2 i E~(G) since 3 > r c 9 + 1) ( 3 ) I (15)] = r2J = 2 A counterexample to the latter is a graph G h av ing a pair M1 and M2 of cliques of 3 points, meeting in one point, and a maximum clique size m = 5 Since 2 s r (5 + 1) (1) / (5)] = r6/5] = 2 M1M2 E E](G) ;
PAGE 87
however, since IM1 n M2 I = 1 < 2 2 IM 2 1 1 M 1 M2 t E 2 (G) 76 IM 1 1 1 and To illustrate the differences among the three types of clusterings, a graph G of thirteen cliques of two, three, and four points is given in Figure 3.9 For each of the three types of generalized clique graphs, it is obviously the case that each line of the (k+l)clique graph is a line of the kclique graph; i.e., for each k = 1, 2, ... m1, Ef<:+i(G) c Ef<:(G) for i = 1, 2, 3 Consequently, the generalized clique graphs may be conveniently specified by giving, for each pair of nondisjoint cliques, the maximum value of k for which the pair are adjacent in the kclique graph. This manner of specification is used in Table 3.1 for the sequences of generalized clique graphs of the graph G of Figure 3.9 for each of the three definitions. Each row of the table corresponds to a pair of cliques of G having at least one point in common. The pair of cliques are identified according to their desi g nations in Fi g ure 3.9; the sizes of the cliques are given along with their names. Following is the overlap of the pair, i.e., the number of points in common. Finally, the three numbers k 1 k2 and k 3 indic a te the greatest values of k for which the pair of cliques are adjacent in the three types of generalized clique graphs. In the first row, for example, one reads that cliques Ml and M 2 are cliques of two points and that they have
PAGE 88
, ,, M1 M2 / '.,.., .... I \ .,,. ,c;. I v,u.....,_u M1 o 1 I \ \ .... ... ... .... \ I ,, .... '\.. ,1 ', M1 3 :\... t_____ ... ~<..I \ \ I I \ I / M s I Fi gure 3.9 A Graph for the Illu stration of the Differences Among Type1, Type2, and Type3 Clusterin gs 77
PAGE 89
Table 3.1 Adjacencies in the Generalized Clique Graphs Clique M1 Clique M Overlap L arges t k for which M1 and Mj are J adjacent in the kclique graph i IM1I j I M j I IM1 n Mj I Type1 Type2 Type3 1 2 2 2 1 1 3 2 2 2 3 3 1 1 3 2 3 3 4 3 2 2 3 3 4 3 5 3 1 1 1 1 6 3 7 4 2 2 3 2 7 4 8 2 1 1 3 1 7 4 9 3 1 1 1 1 10 4 11 4 2 2 2 2 11 4 12 4 1 1 1 1 12 4 13 4 3 3 3 3
PAGE 90
one point in common. Since 1 IM 1 n M 2 I < 2 M1 and M2 are adjacent only in the type1 1clique graph. Since IM 1 n M 2 I IM 1 1 1 Ml and M2 are adjacent in the type2 kclique graphs for k 3 = m 1 Since (m + 1) S1(M1, M2) = 5 (1/3) and 1 < 5/3 < 2 M1 and M2 are adjacent in the type3 kclique graphs for k 2 The generalized clique graphs specified by the table are shown in Figure 3.10 for k = 2 and k = 3 = m 1 ; the intermediate cases. As expected, the type3 2clique graph closely resembles the type2 2clique graphthe latter has one additional line. Similarly, the type3 (m1)clique graph differs little from the type1 (m1)clique graphthe former has one additional line. 79 The intermediate clusterings under the three definitions are indicated in Figures 3.11, 3.12, and 3.13. The 1clusterings and mclusterings are just the component and clique clusterings for each type; since these are given in Figure 3.9, they are not repeated here. That the type3 chain of clusterings is the smoothest sequence of the three from the component clustering to the clique clustering is apparent from the figures. A quantitative indication of this may be seen in Figure 3.14, which plots the numbers of kclu~ters as a function of k for each of the three definitions. The number of type3 clusters is more nearly a linear function of the parameter k, than is either of the other two. Predictably, the number of type1 clusters experiences a large jump between
PAGE 91
k = 2 k = 3 1 2 3 I+ 5 0 0 00 0 0 0 0 0 0 8 0 0 Type1 GoJJ 0 0 09 0 1 0 001 1 0 0 l 2 ~13 0 0 Type 2 0 0 0 0 0 0 0 00 0 0 0 Type3 0 0 0 0 0 0 Figure 3.10 The Three Types of Intermediate Generalized Clique Graphs of the Graph of Figure 39 80
PAGE 92
a. ,. o... ,<::>:',',a_:; .... < ,' .,,. ....... .,,,.... o"1'.iJ::,o .. k ..... ., \ I I / = \ \ I I 2 I I b. k = 3 _F_i~ g~u _r_ e =3~ 1_1 The of T ype 1 the Graph Intermediate Clusterings of Figure 3.9 81
PAGE 93
/ I 000C \ .... \ I I I I ........ .,, ___ a. k = ,,,,,,..... .... ___ .,., I 2 I I I .,, \ I / _.., b. k = 3 Figure 3.12 The Type2 Intermediate Clusterings of the Graph of Figure 39 82
PAGE 94
f / I .... I I ~\ ,ov:a:. , a. k = 2 ,.;""' __ ...,,,. ..... ., b. k = 3 Figure 3.13 The '=.....:::..:::.. Type3 the Graph 0 f Intermediate Clusterings of Figure 3.9 83
PAGE 95
Numbers of Clusters 14 12 10 8 6 4 2 Type1+ 1 2 3 4 Figure 3.14 Numbers of Clusters of the Graph of F i g ure 3 .9 Versus the Par amete r k 84 k k = 1 and k = 2 ; similarly, the numb er o f type 2 clusters has a large gap between k = m 1 and k = m The superiority of the t ype 3 clu s terin gs ove r th ose of type 1 a nd type2 i s due prin c ip a lly to the f act that the similarities of the pairs of cliq~es determ ine adjacency in the type 3 generalized clique graph whereas the other types utilize le ss information for th a t purpo se Ind eed the definition of type3 kclusterings provides the key to a n
PAGE 96
85 even more specificand finaldefinition of clusterin as of 0 a graph. Consider a pair Ml and M2 of cliques of a graph G and suppose that Ml and M2 are adjacent in the type3 kclique graph of G Then k 1(m + 1) S1( M 1, M2)J where m is the number of points of a largest clique of G Thus, k 1 < (m + 1) S1(M1, M2) or (k 1) I (m + 1) < S1(M1, M2) For k = 1, 2, m ... define tk = (k 1) I (m + 1) then 0 = t1 < t2 < ... < tm = (m 1) I (m + 1) the g reatest similarity possible between a pair of sets, each having no more than m points. Consequently, the type3 kclique graph may be defined alternatively as follows. The clique tkgraph Htk(G) of a graph G has the clique set ~(G) as its point set, with a pair of distinct cliques of G adjacent in Htk(G) in case the similarity of the pair exceeds tk This generalizes immediately by relaxing the range of choice for the threshold to the clique tgraph Ht(G) where t is a number from the unit interval There re, however, only finitely many distinct pairs of cliques of G; therefore, the set T(G) consisting of zero and s 1 (M M ) for each pair of distinct cliques M and M of G, is a nonempty finite set. Let the members of T(G) be ordered according to size: T(G) = { s 0 s 1 s2, ... sr} with sil < Si for each i = 1, 2, ... r Clearly,
PAGE 97
for any s sr, Hs(G) is totally disconnected. More generally, if si $ s < si+l then Hs(G) = Hsi(G) Therefore, H(G) = {Hs(G): 0 $ s $ l} = {Hs(G): s = s 0 s 1 ... 86 For a given threshold s EI= {x: O $ x $ l} an ssimilarity cluster of G is the union of the cliques of G which constitute a component of the cliquesgraph Hs(G) of G The ssimilarity clustering of G consists of the maximalssimilarity clusters of G This last stipu l ation assures, as usual, that inefficient covers of V are excluded. The collection D(G) of allssimilarity clusterings of G is a finite collection, since A E D(G) implies that A is the ssimilarity clustering of G for some s = Si E T(G) a finite set. If Si, Sj E T(G) and i $ j then is a subgraph of Hs. (G) ; consequently, the l sjsimilarity clustering refines the Sisimilarity clustering. Thus, each pair of members of ~(G) is related under the refinement relation, i.e., the refinement relation linearly orders the collection D(G) In particular, the 0similarity clustering is the type3 1clustering, and thus is the component clustering. The Srsimilarity clustering, where sr is the largest member of T(G) is the clique clustering, since Hsr(G) is totally disconnected. Consequently, the component and clique clusterings belong to ~(G) Moreover, if O $ t $ sr Hs (G) r is a subgraph of Ht(G) which is a subgraph of H 0 (G) ; therefore, every
PAGE 98
87 member of St ( G) refines the component clustering of G and is refined by the clique clustering of G Thus., if A E A E St ( G) ., then A is an efficient cover; A is a union of cliques; and s_ ( G) < A < .!S_ ( G) Therefore., ( St(G) <) is a chain in the lattice (f(G)., <) joining the unit and the zero of the l attice i.e., a chain of clusterings from the component clustering to the clique clustering. It will be recalled that some members of r(G) are plainly unsatisfactory, as for example, the cover of Figure 3.15. I I I \ ... ...... '.... .. .,,,. I I I Figure 3.15 An Unsatisfactory Cover Belon g in g to r(G) The development of St(G ) t o which the cover of the fi g ure does not belongenables the precise identific at ion of the reasons behind the defect: a pair of cliques of similarity 1/5 are grouped to g~ ther, while a pair of cliques of greater similarity 1/2 are not merged into one cluster. Such a g roupin g is clearly arbitrary and unjustified, in that it requires that certain available inform a tionthe cliquec lique s i milaritiesb e ignored.
PAGE 99
88 Referring back to Table 3.1 and Figure 3.10, one sees that the type1 and type2 kclusterings both have this flaw. Cliques M10 and M11 are adjacent in the type1 2clique graph, but M1 and M2 are not, although S(M10 M11) = 1/3 = S(M1, M2) Similarly, M 7 and Mg are adjacent in the type2 3clique graph, but M 10 and M 11 are not, although S(M10, M 11 ) = 1/3 > 1/5 = S(M 7 Mg ). Thus, just as the restriction from =(G) to r(G) was based on the elimination of arbitrariness relative to points of G, so the restriction from r(G) to ~(G) is based on the elimination of arbitrariness relative to the cliques of G The foregoing considerations justify a more specific definition of clusterings of a graoh G viz., ssimilarity clusterings for thresholds s EI That is, the set of all clusterings of G is precisely ~(G) As illu strated above, a type1 or type2 kclustering may hot qual i fy as a clustering under this more specific definition. It is contended that this definition is valid in the sense that it takes into account a ll the information availab le within the context of the problem and it imposes no arb itrar y conditions. The context of the problem is that a graph G = (V, E) is given, from which it is required to generate efficient covers of V. As previously discussed a cover A of V which does not refine K(G) or which is not refined by ~(G) ls not generated from the given information.
PAGE 100
The necessity to gene r a te covers only from the g iven information also requires the exclusion of nonmembers of r(G) from consideration. And, finally, it is th a t same requirement th at necessitates th e deletion from f(G) of 89 nonmembers of Q(G) Therefore the degree of specif i city of the definition i s fully justifi ed i.e., each cond ition of the definition is logically required b y the nature of the problem. On the other h a nd, a more specific definition would ~ require an additional condition. Such a condition would e limin at e members of Q(G) which is a chain or tower of covers corresponding to different clique similarity thresholds. Evidently, such a condition is not to be inferr ed within the context of the problem. Rather, it must be derived from considerations exte rn al to the in forma tion implicit in G = (V, E) ; that is, any preference criterion must be determined acco rdin g to the part icul ar appl i cat ion of the cluster analysis of the grap h. The type3 kclusterings, it will be recalled, a re prec is e ly the tk si m il ar ity clusterings, with t k = (k 1) / (m + 1) Thus, each type 3 kclustering qua lifi es as a clustering, i. e ., belongs to Q ( G) Let A(G) de note the collection of all the type3 kclusterings of G Since the type 3 1clustering and m c l ustering are the component and c li que clustering~ (A(G), < ) i s a sub chain of ( Q (G), <) including the greates t and least members of Q(G ) The sequence of thresholds defining the
PAGE 101
subset A(G) of St ( G) is 0, 1/(m+l), 2/(m+l ), ... (m1)/(m+l) = tm, which partitions the interval (0, tm) into m 1 equal intervals. Therefore, since m is generally much smaller than I T(G) I and JA( G) J is correspondingly smaller than l st(G)J (A(G), <) may be regarded as an approximation to (SG(G), < ) 90 Indeed, A(G) may be presumed to be the best approxi mation to SG(G) by a subchain of SG(G) having IA(G)J members and containing r(G) and ~(G) in the sense of continuity which motivated the definition of the type3 kclusteringsbecause of the character of the sequence of the thresholds defining A(G). That sequence partitions the interval (0, tm) into equal in tervals, with tm = (m 1) / (m + 1) the greatest possible similarity between a pair of cliques; consequently, tm is the smallest threshold t for which it can be assured that the tgraph Ht(G) is totally disconnected, an~ therefore, that ~(G) E A(G) It is the consideration of computational efficiency which motivates the interest in the possible use of A(G) rather than SG(G) The number of clique tgraphs Ht(G) t E T(G) required for the production of SG(G) is JT(G) I ; this number can be much larger than m the maximum clique size, and the number of type3 kclique graphs required for the production of A(G) Two upper bounds may be given for JT(G)I the first is nl = l~(G) I ( J~(G) I 1) / 2 the number of pairs of
PAGE 102
91 distinct cliques of G The second bound depends on the maximum clique size, m. If m = 2 T(G) c {O, 1/3} ; if m = 3 T(G) c {O, 1/5, 1/4, 1/3, 1/2} By such an enumerationwith repetitions deleted, the number n 2 (m) of distinct similarity values among cliques whose maximum size is m n 2 (4) = 9 and n2(5) = 14 These figures suggest that n 2 (m) satisfies the difference equation n 2 (m + 1) n 2 (m) = m + 1 whose solution is n2(m) = (1/2) (m + 2) (m 1) Combining these two bounds, one may say only that IT(G)I is roughly proportional to onehalf the square of the minimum of the number of cliques and the maximum clique size. It is easy to see that if one has, for example, a graph with several hundred cliques ranging in size up to twenty or so points, then one might hesitate to generate the entire sequence of, say, two hundred clique tgraphs; the effort required might well be deemed a trifle extravagant. Besides the possibly prohibitively large number of thresholds in T(G) there is a second major efficiency consideration relative to the alternatives of generating all of Q(G) or only A(G) JQ(G)J may be much smaller than !"r(G) I since the additional lines of Hsi (G) not in H 3 i+l(G) may join pairs of cliques of G such th a t each clique of such a pair is in the same component of Hsi+l (G). That is, although Hs (G) i+l is a proper subgraph of H 3 i (G) these two graphs may have the same component sets.
PAGE 103
92 On the contrary JA(G)j is likely to be equal or slightly less than m the number of type3 kclique graphs. The reason is that the progression from 3 Gk+l(G) to G~(G) corresponds to a threshold change of 1 / (m + 1) times the whole range of cliqueclique similarities, which may include several members of T(G) ; consequently, the differential number of lines is generally larger, and the likelihood that the component sets of G~+ 1 (G) and G~(G) are the same is correspondingly smaller. This second efficiency consideration may thus be stated as follows: l~(G) I / jT(G) I is generally less than, and may be much less than, JA(G) I / m which means that the number of generalized clique graphs processed may be reduced substantially, while the number of clusterings generated is reduced only slightly, by generat ing A(G) rather than ~(G) 3.4. Concluding Remarks A definition of the clusterings of a graph i.e., efficient covers of the point set of the graph consisting of clusters, has been developed from the following premises and principles. (1) The component set and the clique set of the graph constitute clusterin g s. (2) Any clusterin g must refine the component set and be refined by the clique set. (3) A clustering must be defined in terms of the given information, i.e., the adjacencies of the points and the
PAGE 104
identities of the cliques, which is implicit in the adjacency information. (4) A clustering may not be formed by the evasion of any of the given information. 93 The clusterings under the resulting definition were seen to be linearly ordered by the refinement relation. It was demonstrated that any cover excluded by the definition contained an element of arbitrariness in its formulation, and that no clust~ring could be excluded on the basis of the given information and the premises of the definition. In the interest of computational efficiency a more restrictive definition was suggessted, the clusterings of which form a subchain of the chain of all clusterings. This particular subchain was shown to form a reasonably uniform sequence of transitions from the clique set to the component set, and thus to constitute a good approximation of the chain. The algorithmic implementation of this latter definition is given in Chapter 5.
PAGE 105
CHAPTER 4 CLIQUE DETECTION ALGORITHMS 4.1. Introduction The identification of the cliques of a graph, i.e., the point sets of the maximal complete subgraphs of a graph is fundamental to the definition of clusterings of a graph given in Chapter 3. Since the present application of graph theoretical cluster analysis is intended for application to large sets of objects, the efficiencies of techniques of clique identification are of primary concern. Consequently, this entire chapter is devoted to the issue of clique detection algorithms. A clique detection algorithm is an algorithm which produces the collection ~(G) of a ll the c li ques of a grap h G = (V, E) given certain in formation about G Normally, the clique set ~(G) of G is constructed from the adjacency information of the g raph, i.e., the line set E The next section reviews some of the algo rithms which have been described in the literature. The following two sections present two new algorithms developed by the author The first of these is a recursive algorithm whose input is just the adjacency information, as is usual The second of these algorithms is intended for application to a particular 94
PAGE 106
95 circumstance, as indicated by the input information required: the cliques of G = (V, E) are produced from the clique set of G = (V, E') with E' c E, together with E E' 4.2. Review of Selected Al a orithms The problem of devising an algorithm for the identification of the cliques of any given graph G = (V, E) presents no theoretical difficulty. For example, an algorithm can be constructed directly from the definition of cliques: each of the 2P 1 nonempty subsets of V, where p = IVI is considered in turn. If any pair of the points of the subset are nonadjacent, the subset is rejected; otherwise it is retained. Finally, ~ delete from the collection of those subsets retained each subset which is contained in some other retained subset. An altogether different algorithm can be constructed from the following obvious equivalence: the point set M of a complete subgraph of G = (V, E) is a clique of G if and only if there is no point u EV M such that, for all v EM, uv EE Isolated points are first set aside, and then one begins with the collection of all complete subgraphs of two points, i.e., the set E Consider each set M of the collection in turn; form the intersection of the deleted neighborhoods of all the points of the set. If that intersection is empty, the set M is retained in the collection (it is a clique since no nonmember is adjacent
PAGE 107
96 to each member). Otherwise, the set is deleted from the collection, an~ for each point u in the intersection, the set Mu {u} is ret a ined in the next collection. If after all sets of the collection have been processed the next collection is nonempty, repeat the processing on each of the sets of the next collection. When the next collection is empty, the clique set is the union of all the generated collections. These algorithms illustrate not only the facility with which clique detection algorithms may be devised, but also the practical insufficiency of such algorithms designed without consideration of algorithm efficiency. The inefficiency of the first algorithm is obvious: the number of operations required by it is on the order of 2P The first algorithm would thus execute something like 30,000 operations to find the cliques of a graph of only 15 points. The hopeless inefficiency of the second algorithm becomes apparent when one realizes that all complete subgraphs are generated during the process of finding the maximal complete subgraphs. The number of nontrivial complete proper subgraphs of a clique M of r points is the sum of the combinations of r things taken 2 at a time, 3 at a time, ... r1 at a time, i.e., r1 l C(r, i) = 2r r 1 which is nearly 2r for all but i=2 very small JMI Thus, for graphs which have cliques of 15
PAGE 108
97 or more points, the second algorithm requires tens of thousands of operations; consequently, this algorithm is also impractical for all graphs except those of quite modest size. The i~practicability of the two above algorithms is the consequence of the naive approach to the problem of designing those algorithms, viz., to implement the definition of cliques in a procedure. The approach of Harary and Ross [ 20] was a combination of matrix theory and certain theorems. The theorems were principally concerned with the existence and the identification of points belonging to only one clique, and the identification of such cliques. This algorithm is mentioned here, not because it is efficient, but because it contributes to the general demonstration of the widely differing approaches possible. It also has a particular virtue that no other algorithm can claim: it was the first solution to the clique detection problem. The approach which has led to the most efficient algorithms is as follows. First, one relates the cliques of a nontrivial graph G to those of the subgraph G v obtained by the removal of point v from G i.e., one provides a theorem which produces the cliques of G from those of G v The algorithm then amounts to repeated application of the theorem to a certain tower of subgraphs of the graph whose cliques are required: the first is a single point; each successive subgraph has one additional
PAGE 109
98 point, and all the lines of the graph which join points in the subgraph; finally, the last subgraph is th e graph itself. This approach, which may be termed the Point Removal method, was taken by Bednarek and Taulbee L 30 ], and by E. Bierstone, whose algorithm is given by Augustson and M inker [ 29 ]. Experiments reported by Augustson and Minke r [ 29 J demonstrate that the Bierstone algorithm outperfo rms the Bonner algorithm [ 26 ], one of the earliest clique detection algorithms. (Indeed, it is stated in [ 29 J that the Bierstone algorithm "appears to be the most efficient one presently available.") The Bonner algorithm is the result of a straightforward approac h to the problem; it generates far too many nonmaximal complete subgraphs to be practical for large graphs. The approaches taken in the next tw o sections, in which algorithms are developed, will be seen to contr i bute further evidence in support of the claim that many quite widely differing approaches to the problem of clique det ect ion are possible. To resume the review~ however, the Point Removal metho d will now be described in detail. 4.2.1. Po int Removal Definitions Let G = (V, E) be a nontrivial graph with p points v be a particular point of G, and G' = (V', E' ) = G v For brevity, the neighborhood N(v) will be denoted R that is, R consists of the point v and all points
PAGE 110
99 adjacent to v Let L and L' denote the clique sets of G and G' respectively. Let A= {M: Me L' and Mc R} B = L' A i.e., B = {M: M e L' and M q: R} C = {M u {v}: M e A} and D = { (M n R) u {v}: M e B} Let F be the collection of those elements of D which are maximal with respect to set theoretic inclusion, i.e., F = {M: Me D and if Mc M' e D then M = M'} ; and H = {M: Me F and if M' e C then M q: M'} Let X =Cu D; and Y consist of the maximal members of X, i.e., Y = {M: Me X, and if Mc M' e X then M = M'} 4.2.2. Point Removal Theorems A theorem is developed below which is a suitable basis for a clique detection algorithm. The proofs of the theorem and its supporting lemmas are given in Appendix A. The theorem is a consequence of three lemmas. The first lemma states that a clique of G is a clique of G' if and only if it does not contain the point v; and that if it does contain v then it is the union of the singleton v with the intersection of some clique of G' and the neighborhood of v (This proposition is Theorem 1 of Bednarek and Taulbee [ 30 ] ) The second asserts that a clique of G' which is not contained in th e nei g hb o rhood of v is a clique of G and if a clique of G' is contained in the neighborhood of v then the union of the singleton v and that clique of G' is a clique of G The third
PAGE 111
100 lemma states that any clique of G other than those in B or C is a member of D LEMMA 4.1 Suppose MEL, if v t M then ME~ ; if v EM then Mt L' and there exists an M' EL' such that M = {v} u (Rn M') LEMMA 4.2 Bu Cc L, and B n C = 0 LEMMA 4.3 L ( B u g_) C D The following theorem completes the characterization of the cliques of G in terms of those of G' by specifying just which of those members of D are cliques of G namely, those wh ich are maximal members of D and which are not contained in any members of C THEOREM 4.1 L =Bu Cu H, a disjoint union. The theorem establ ish es the suff ici ency of deleting nonmaximal members from D and those members which are contained in members of g_, in order to complete L from u C The graphs of Figure 4.1 demonstrate the necessity of both operat ions.
PAGE 112
L' = B = But w X y z wo~xLy / / 'o/ V Figure 4.1 Counterexamples to L =Bu Cu D Consider first the graph Gl of the fi g ure. {{w, x}, { x y }} R = {v, x, y } A = {{ x, y }} {{w, x}} C = {{v, x, y}} and F = D = {{v, x}} {v, x} ii: L although it is a maximal member of D because it is properly conta in ed in {v, x, y } EC c L Consider now th e graph G2 of the figure. L' = { { w, x}, {x, Y, }} R = {v, x y } C = A = 0 B = {{w, x}, { x Y, z }} and Q={{x, v}, {y, x v}} M = { x, v} E D and M is contained in no member of C 101 Now = 0 but M is not a clique of G2 M is properly contained in { y x, v} E H C L Therefore Theorem 4.1 can be strengthened neither by replacing 11 H 11 by the max im al members of !2., fl nor by replac i ng i t by those members of D not contained in membe rs of C fl
PAGE 113
Theorem 4.1 is essentially Theorem 2 of Bednarek and Taulbee [ 30 ] (which is the basis of the algorithm given therein), which states that L =Bu Cu (Q n I) Theorem 4.1 is also the principal theorem underlying 102 the Bierstone algorithm. The verification of this statement, however, is difficult for three reasons. The first is that the Bierstone algorithD is presented by Augustson and Minker [ 29 ] (apparently, the only publication of the algorithm) in a form which although quite suitable for computer programming implementation in certain lan g uages, obscures the premises of the algorithm. The second reason is that the theoretical base of the algorithm is not given by Augustson and Minker [ 29 J (apparently, the basis for the algorithm remains unpublished). The third reason for the difficulty is that the Bierstone algorithm [ 29 ] is further complicated by its testing forand special treatment ofcertain special case conditions. The fact remains, however, that Theorem 4.1 is the fundamental premise of both of these Point Removal clique detection algorithms. The algorithm given below is the implementation of that theorem. 4 2 3 A Point Remova l Al~orithm It is required to find the cliques of G = (V, E) with V = {v1, v2, vp} For each i = 1, 2 ... P, ... let Ri = {vi} u {v: V E V VVi E E} Gi = (Vi, Ei)
PAGE 114
103 where Vi= { v1 v2 ... vi} and Ei = { xy: x E Vi YE Vi and xy EE } and finally l e t L l denote the family of cliques of a 1 A LGORITHM 4.1 Step 1.Initialize I::.J. = {{v 1 }} Set k = 1, and proceed. Step 2.Add one to k If k > p then halt; L = L k 1 P is the class of cliques of G Othe r wise proceed. Step 3 .Initi a lize B = C = D = 0 For eac h 1 :M E !:.k1 if M cj: Rk then replace B by B u { M } and D by D u { (M n Rk ) u {vk}} otherwise r ep lace C by C u { M u {v k }} That is, compute B = { M : M E L k 1 M cj: Rk} C = {M u { v k} M E ~1 M C Rk } a nd D = {{ vk } u (M n Rk) M E ~k1 M cj: Rk} Step 4.Derive F from D by deleting from D members proper l y contained in other members. Step 5.Delete from F each member which i s a subse t of some member of C resulting in H to the union of H Step 2 B and C Ass i g n Go to
PAGE 115
104 To illustrate the action of Algorithm 4.1, it has been applied to the graph of Figure 4.2. The results are given in Table 4.1, each row of which gives the values of the computed entities on entry to Step 2 of the algorithm. (To conserve space in the table, subsets of the point set {l, 2, ... 9} are denoted by character strings with brackets omitted.) 4 5 9 1 6 7 Figure 4.2 A Graph for Clique Detection A l gorithm Illustration 2 3
PAGE 116
I k 1 2 3 4 5 6 7 8 9 Table 4.1 Algorit hm 4.1 App li ed to the Graph of Figure 4.2 Rk B C D F H 1239 123 0 12 0 0 0 123 0 123 0 0 0 468 123 0 4 4 4 589 123 0 4 5 5 5 46789 123 46 6 6 0 5 6789 12 3 0 7 67 6 7 46 67 5 456789 123 46 8 8 8 0 58 678 56789 1 23 589 9 689 0 468 6789 689 :!:.k 1 12 123 123 4 123 4 5 123 46 5 123 46 5 6 7 123 468 58 678 123 468 5 8 9 67 8 9 105
PAGE 117
106 4.3. The Ne ighborhood Approach to Clique Detection The approach to clique detection developed in this section is similar to the Point Removal method in that it begins with a theorem relating the cliques of G = (V, E) to those of G v This theorem is then generalized to relate the cliques of G to those of G[V S(v)J the subgraph of G generated by V S(v) where S(v) is a certain set of points including v and possibly others. The resulting algor ithm differs fundamentally from the Point Removal algor ithm s, such as Algorithm 4.1, in that it is recursive. Also, the new algorithm proceeds from G downward through a tower of subgraphs of G, rather than upward to G through a tower of its subgraphs as is the case with Algorithm 4.1. Because the key theorem may permit the re moval of more than a single point in the progression from one subgraph of the tower to the next, the number of subgraphs of the tower may be less than \VI which is the number of subgraphs of the tower of the Point Removal method. 4.3.1. Special Definitions To facilitate the statement of the theorems, the definitions of some addit ion al concepts and symbols a re required. Let u be any point of the g raph G = (V, E) The neighborhood N(u) of u is that subset of V consisting
PAGE 118
107 of u and all points of G adjacent to u The neighborhood subgraph H(u) of G at u is a subgraph of G generated by N(u) i.e., H(u) = G[N(u)J The clique set of H(u) is denoted by ~(u) The scope S(u) of u is that subset of V consisting of all those points w such that N(w) c N(u) In the following subsection, G is assumed to be nontrivial, and v is some particular point of G. ~, R, and N denote, respectively, the clique sets of G G v, and H(v) ; and s = {M: M E R and if MC M' then M' N} If S(v) i V denotes the clique set of G[V S(v)J if S(v) = V then = 0 p denotes the set {M: M E Q M i N(v)} 4.3.2. Neighborhood Clique Detection Theorems These propositions build up to the main resultthe second of two theorems, which serves as the basis of a new clique detection algorithm. The proofs of the two theorems and their supporting lem~as are given in Appendix B. Theorem 4.2 below relates the cliques of G to those of G v; specifically, it states that the cliques of G consist of those of the neighborhood subgraph H(v) of G at v, to gethe r with those of G v which are not subsets of those of H(v) Lemmas 4.4, 4.5, 4.6, and 4.7 state certain simple properties upon which the theorem depends: a clique of G containing v is a clique of H(v) ; each
PAGE 119
clique of H(v) contains V 108 each clique of H(v) is a clique of G ; and each clique of G is either a clique of H(v) or a clique of G v. LEMMA 4. 4 If v e Me L then Mc N(v) and Me N LEMMA 4.5 If Me N then v e M. LEMMA 4.6 N C L LEMMA 4.7 L C N u R THEOREM 4.2 L =Nu S a disjoint union. In view of Lemmas 4 6 and 4.7 which assert that N c L c Nu~, one mi g ht wonder whether or not Theorem 4.2 could be strengthened to read ~=~ u ~,"thereby eliminating th e effort required to compute S from N and R. That such is not the case is illustrated by the following counterexample: G is the complete graph on the three points u v, and w; H(v) = G ; L = N = {y} = {{u, w} } S = 0 ; thus, while L = Nu S L /Nu R
PAGE 120
109 Theorem 4.3, which constitutes the main result relates the cliques of G to those of H(v) and those of a possibly proper subgraph of G v More specifically, if the neighborhood of each point of G is contained in that of v then the cliques of G are just those of H(v) If the set X of points of G whose neighborhoods are not contained in that of v is nonempty, then the cliques of G[X] not contained in the point set of H(v) together with the cliques of H(v), constitute the cliques of G Lemmas 4.8, 4.9, 4.10, and 4 11 state additional properties required for the proof of Theorem 4.3: if the neighborhood of u is a subset of the neighborhood of v, then each c l ique of H(u) is contained in some clique of H(v) ; the cliques of G v are subsets of cliques of H(v) if and only if the neighborhood of each point is contained in that of v; a clique of G which is not a clique of H(v) is a clique of the subgraph of G generated by the set of points whose neighborhoods are not contained in that of v, i.e., a clique of G[V S(v)J ; and a clique of G[V S(v)J is a clique of G if and only if it is not a subset of the neighborhood of v LEMMA 4.8 If N(u) c N(v) then ~(u) < N LEMMA 4.9 S = 0 if and only if N(u) c N(v) for each u in V.
PAGE 121
110 LEMMA 4.10 L c Nu~, a disjoint union. As with Lemma 4.7, Lemma 4.10 cannot be improved to read "f:. == u ~,"because may not be a subset of L Figure 4.3 provides a counterexample. ... ........ ,,,. ,,,. / ,, ., ..... C e I ,,. /b I \ a I V I \ \ I \ \ I S(v) / / / '..}J(v) .... __ .,,. d ,, f / .... ., ...... Figure 4.3 A Counterexample to L = N u Although M = {c, d} is a clique of G[V S(v)J it is not a clique of G The fact that Mc N(v) is no accident, as proved in the following lemma. LEMMA 4.11 If Me Q then Me L if and only if Mi N(v) THEOREM 4.3 If S(v) = V then L = N. If S(v) IV then L ==Nu P a disjoint union.
PAGE 122
111 4.3,3, The Neigh borhood Clique Detection Algorithm Theorem 4.3 is an improvement on Theorem 4.2 with respect to computational efficiency. Each requires th E computation of the cliques N of the neighborhood sub gr ap h H(v) of G at v Each also requires the computation of the cliques of a second subgraph of G and the deletion from these of certain members. The second subgraph of Theorem 4.2 is G v, which is generally larger than G[V S(v)] the subgraph of Theorem 4.3. The number of tests required by Theorem 4.2 for deletions, i.e., the computation of S is between IRI and IRI x l~I whereas Theorem 4.3 requires only IQI tests to compute P Finally, an algorithm based on Theorem 4.3 requires fewer passes or repe a ted application of the theorem than does one based on Theorem 4.2, because, whereas each p ass of the latter reduces the number of points by one, each pass of the former deletes IS(v) I points. The algorithm given below is therefore based on the repeated application (recursively) o f Theorem 4.3. Certain other choices implicit in the a l gor ith m will now be discussed. The first issue is how to choose a point v of the nontrivial graph G In view of Lemma 4.8, the algor ithm selects (Step 2) a point v of maximal neighborhood. S ince S(v) must be computed for whatever point v is se lected, this selection of v is not very expensive.
PAGE 123
112 The simultaneous determination of V and S (v) proceeds as follows. Set V = Vl and S(v) = {v1} For i = 2, 3, p = !VI if N(vi) C N(v) or N(v) C N(vi) ... set S(v) = S(v) u {vi} if N(v) C N(vi) set V = vi Since the cliques of two subgraphs, H(v) and G[V S(v)] are genera ll y required, one m i ght suggest a more el a borate criterion for the selection of v, namely, that v be a point which minimizes max { IN(v) I, IV S(v) I} However, if IN(v) I < IV S(v) I for every point v as is often the case, then the selected point v is simply a point of largest scope. Consequently, this criterion could be approx imated by the simpler selection of v as a point max imizing IS(v)I, i.e., a point of largest scope. Although experimental evidence is lackin g it is considered doubtful that the additional computational complexity of either of these two criteria i s justified. Consequently, Step 2 of the algorithm spec ifi es the selection of a point v of maximal nei g hborhood and the simultaneous determination of S(v) as illustrated above The next issue concerns the termination of recursive plunging (Step 1): under what condition does the algorithm not invoke itself? Clearly, since Theorem 4.3 is based on the premise th at G is nontrivial, the algorithm must discriminate the condition that G is trivial. In this case, of course, theorems are not required : one s imply has L = {V}
PAGE 124
113 Rather than test for IVI = 1 Step 1 tests for the more general condition that G is complete, in which case L = {V} whether or not IVI = 1 This test requires that one test neighborhoods until one is found which is properly contained in V, or until it has been established that each neighborhood is identical to V. In the former case, G is not complete; in the latter, it is. When G is not complete, the test for completeness entails a small time loss. When G is complete and nontrivial, the test saves time. If G is large, the time saved is substantial, for,otherwise, the algorithm would plunge point by point, executing Steps 1 through 4 until the subgraph was a singleton; it would then work its way back out point by point executing Steps 5 and 6. This test applies, of course, to any complete graph submitted to the algorithm, including the whole graph whose cliques are sought and any subgraph of it which is submitted recursively to the algorithm. Consequently, the test for completeness of Step 1 is presumed to be justified by the time saved through stopping the plunge at complete grap hs, rather than at singletons. The crucial issue, of course, is how to determine ( 1 ) the c 1 i q u es N of H ( v ) a_n d ( 2 ) the c 1 i q u e s Q of G[VS(v)J whenever S(v) Cf V (If S(v) = V then one has merely to compute N ) If S(v) cJ V then since V E S(v) G[V S ( V)] has strictly fewer points than does G consequently, the cliques of G[V S(v)] may be found
PAGE 125
114 by recursion without any danger of the algorithm not terminating. Moreover, since G[V S(v)J has, in g eneral, no special properties, one might just as well compute Q from a recursive invocation of the al g orithm (Steps 6 and 7). On the contrary, H(v) does h a ve a special pr o p e rty: v is adjacent to each other point, if any, of H(v) Indeed, if one attempted to determine N by recursion on H(v) then the algorithm might not terminate. In particular, if N(v) = V, then a recursion on H(v) would be repeated indefinitely. Therefore, to assure termination and to take advantage of the special property of H(v) recursion is on H(v) v wherever I N (v) I I 1 (Steps 3 and 4). Since H(v) v has strictly fewer points than does G, even in case N(v) = V, termination is assured here (Step 4), as well as in Step 7. It remains now to complete the deter m in a tion of the cliques N of H(v) from the cliques N' of H(v) v Step 5 performs this computation on the basis of the following proposition. PROPOSITION 4.1 Let G be a nontrivial gr a ph having a point v adjacent to each other poi~t of G Denote by L and L' th e cliques of G a nd G V resp e ctively. Then L = {M u {v}: M E L'} Proof.Let M E L' Since M is a clique of G a nd V is a djacent to each point of G M u {v} is a V
PAGE 126
115 complete subgraph of G If Mu {v} is n ot a clique of G then there i s a point u t Mu {v} wh ic h is adjacent to each point of Mu {v} Since u v, Mu {u} i s a complete subgraph of G v, properly containing MEL a contradict i on Hence { Mu { v }: MEL '} c L Now suppose MEL By Lemma 4.5, v EM M .:.. {v} is a comp let e subgraph of G v If M {v} t L' then there is a point u of G V adjacent to each point of M {v} Since V is adjacent to u M u {u} is a complete subgraph of G properly contain ing M E L a contradiction; hence M {v} E L' S ince M = ( M {v}) u { v } with M {v} E L' it has been shown that if MEL, there is an M' E L' such that M = M u { v} i.e. L c {M u { v}: M E L'} II ALGORITHM 4.2 Ste p 1.If G is complete set L = { V } RETURN Step 2 .Find a po int v of G whose neighborhood N ( v) is maximal simultaneously determinin g the set S(v) of those points u of G such that N (u) c N ( v ) Step 3. If \N ( v) I = 1 then set N = {{v}} a nd go to Step 6
PAGE 127
116 Step 4.(Recursion.) Compute the set N' of cliques of H(v) v, the subgraph of G generated by N(v) {v} Step 5.Set N ={Mu {v}: MEN'} Step 6.If S(v) = V then set L = N and RETURN. Step 7.(Recursion.) Compute the set Q of cliques of G[V S(v)J the subgraph of G generated by V S ( v) Step 8.Set P = {M: ME~, M N(v)} Step 9.Set L =Nu P RETURN. The graph of Figure 4.2 illustrates the action of the algorithm. Each step of the algorithm is listed as visited during its execution on the graph. The steps are indented and prefixed with asterisks in accordance with the level of recursion at which the step is being executed. For brevity, sets are indicated by character strings, without brackets. Steo 1.G is not complete. Proceed. Step 2.V = l N ( 1) = 1 _2 3 9 S ( 1) = 12 3 Step 3.IN(l) I 1 Proceed. Step 4.Recursian on H(l) 1 = G[239] Step 1.G[239] is not complete. Proceed
PAGE 128
* Step 2.V = 3 N(3) = 23 S (3) = 23 Step 3.I N (3)1 i 1 Proceed. Step 4.Recurs ion on G[N(3) 3] = G[2] **Step 1.G [2] is complete L = 2 Return Step 5.N = 2 u 3 = 23 S tep 6.S(3) = 23 i 239 Proceed. Step ?.Rec ur s i on on G[239 23] = G[9] ** Step 1.G[9] i s complete. L = 9 Return Ste p 8.9 i N(3) = 23, so P = = 9 Step 9.L =Nu P = 23, 9 Return. Ste p 5~N = 123, 19 Step 6 .S (l) = 123 i V Proceed Step ?.Recursion on G [ V S( l)] = G[456789 J Step 1.G[ 45678 9 ] is not complete Proceed 117 Step 2.v = 8 N(B) = 456789 S(B) = 456789 Step 3. IN(B) I i 1 Proceed. Step 4.Recursion on G [4 5679] **Step 1.G[45679] is not complete Proceed.
PAGE 129
118 Step 2.V = 6 N (6) = 4679 S(6 ) = 467 Step 3IN ( 6) I t1 Proceed. Step 4 .Recu r sion on G [479] Step 1.G[479] is not complete Proceed. Step 2.V = 4 N(4) = 4 S(4) = 4 Step 3.IN ( 4) I = 1 Se t N = 4 Go to Step 6. Step 6.S(4) = 4 t479 Proceed. S tep 7.Recurs i on on G [479 4] = G[79] ****Step 1.G[79] is complete. L = 79 Retur n. ***Step 8.= 79 79 i 4 = N (4) P = 79 ***Step 9.L =Nu P = 4, 79 Return ** Step 5 .N = 4, 79 !i = 46, 679 **Step 6 .S (6) = 467 t45679 Proceed **Step 7.Recu r s ion on G [59] ***Step 1.G[59] is complete L = 59 Retur n. **Step 8 .Q = 59 N(6) = 4679 P = 59 **Step 9 .L = Nu~ = 46 59 679 ?cet urn. Step 5.N = 46, 59, 679 N = 468 589, 6789
PAGE 130
119 Step 6.S(8) = 456789 all the points of G[456789J L = N Return. Step 8.Q = 468, 589, 6789 N(l) = 1239 P = Q Step 9.L =Nu P = 123, 19, 468, 589, 6789 Return. 4.4. The Line Removal Approach to Clique Detection The approach described below may be briefly characterized as a Line Removal method. It begins with a theorem relating the cliques of G to those of G xy where x and y are a pair of adjacent points of G The algorithm is then the repeated application of the theorem to a tower of subgraphs of G : the first graph of the tower has all the points of G but no lines; each successive subgraph has one additional line of G; finally, the last graph of the tower is G itself. The number of iterations is therefore q the number of lines of G Assuming for simplicity that the effort per iteration is roughly the same for the Line Removal al g orithm and a Point Removal algorithm, one expects a Point Removal algo rithm to be faster on graphs having more lines than points. In graph theoretical cluster analysis the threshold is ordinarily chosen so that there are no isolated points in the graph; usually the number of lines exceeds the number of points, in which case Point Removal is preferable to Line Removal clique detection.
PAGE 131
120 Suppose, however, that the analysis is to be applied to several graphs, corresponding to several similarity thresholds. The graphs G 1 a 2 ... Gn corresponding to thresholds t 1 t 2 ... tn, with t 1 > t 2 > > tn, have the following properties For i = 1, 2, ... n1 the p points of Gi+l are those of Gi and each line of Gi is a line of Gi+l The cluster analysis requires the cliques of each Gi Knowing the cliques of a 1 is of no use to the Point Removal method: to find the cliques of a 2 requires p iterations. The Line Removal method, on the other hand, requires q2 q 1 iterations ( qi being the number of lines of Gi ): it begins with the cliques of a 1 and proceeds to derive those of G 2 by successively adding the lines of a 2 not in a 1 It is for such applications that the Lin e Removal clique detection algorithm is intended. 4.4.1. Line Removal Defin iti ons Th is subsection provides the definitions of the terms and symbols used in the statements and proofs of the proposit ion s of this section. In addition, certain elementary consequences of the definitions are established here. Let G = (V, E) be a graph which is not complete, and let x and y be a pair of nonadjacent points of G Define G* = (V, E*) with E* =Eu {xy} Denote by L and L* respectively, the collections of cliques of G and G* The objective is to find L* given L
PAGE 132
121 Since xy i E no clique of G contains both x and y On the other hand, each point of G belongs to at least one member of L Therefore X :J 0 / y where X = {M: X E M E L} and y = {M: y E M E L} Defining ~o = L (! u Y) one has that {~o' !, Y} is a pairwise disjoint cover of L each clique of G contains (1) neither x nor y, (2) x but not y or (3) y but not x. Let !i = {X: XE!, X Y = {x} for some YE!} and 2 = X x 1 Thus, a clique in !i is a clique of G containing the point x and is containedbut for the point X in some clique of G containing the point y with Similarly, let !2 = y !1 !1 = {Y: y E y y X = {y} for some XE X} Define ~2 = !2 u !2 and ~l = { X u { y } : X E !1} u { Y u { x} : Y E !1} Finally, let ~3 = {M: M E 13' if M C M' then M' t ~1} ~3' = {M: M ~3 II if M C M' E ~3 II then M = M'} E ~3 II = { (X n Y) u {x, y}: (X, Y) E !2 X !2} ~3 thus consists of the maximal members of ~3 11 which are not subsets of any members of ~l and This subsection is concluded with the ob~ervation that {~ 0 ~l' ~ 2 ~ 3 } is a pairwise disjoint collection. No member of Lo contains either X or y whereas each member of ~1, ~2, and ~3 contains at least one of X and y consequently, n L. = 0 for i = 1, 2, 3 l
PAGE 133
122 Each member of h 2 contains exactly one of x and y whereas each member of hl and h3 contains both x and y ; consequently, hn h2 = 0 = h2 n h3 Finally, L 1 and h3 are disjoint, for if ME hi n h 3 then since h3 c h3' ME h3' and Mc ME hi, so that Mt h 3 contradicting ME h3. 4.4.2. Line Removal Theorems The propositions leading to a clique detection algorithm are presented; the proofs are given in Appendix C. Theorem 4.4 states that the collection L* of the cliques of G* is the union of the (obviously) pairwise disjoint sequence of three of its subcollections: that of the cliques containing neither x nor y that of those containing just one of x and y and that of those containing both x and y It states moreover that these subcollections are, respectively, ~O and the maximal members of {(X n Y) u {x, y}: (X, Y) EX x Y} This theorem is based on two lemmas which state that each member of hi i = O, 1, 2, 3 is a clique of G* ; and that a clique of G* is in or ~ 2 if it is also a clique of G, and is otherwise (X n Y) u {x, y} for some XE X and YE Y LEMMA 4.12 ho u h1 u ~2 u h3 c L*
PAGE 134
LEMMA 4.13 L* n L =Lou !:.2 and L* L c {(X n Y) u {x, y}: (X, Y) EX x Y} THEOREM 4.4 L* is the union of the collection of the maximal members of {(X n Y) u {x, y}: (X, Y) EX xi} and Lou !:.2 123 Theorem 4.5 is an attempt to refine Theorem 4.4in the sense of making the assertion somewhat more precisein order to provide for a more efficient algorithm. It states that L* is the pairwise disjoint union of ~o, ~1, ~2 and L 3 The proof of Theorem 4.5 requires the additional lemma, Lemma 4.14, which states that a clique of G* which is not a clique of G is either in ~l or is (X n Y) u {x, y} for some X in !2 and Y in l2 LEMMA 4.14 L* L c ~l u {(X n Y) u {x, y}: (X, Y)E !2 x ::} THEOREM 4.5 L L L L L = 0 u 1 u 2 u 3 There are two considerations which suggest that an algorithm based on Theorem 4.5 would be faster than one based on Theorem 4.4. The minor one is that the necessity
PAGE 135
124 of computing 11 for Theorem 4.5 is not very expensive. To find 1 2 for either theorem requires the identification of ;,;_ 1 and Y1 To compute ~l from !i and Ii one begins with {Xu {y}: XE !i} then for each member Y of l1 one includes Yu {x} if it was not already present (it may be that Xu {y} =Yu {x} for some X in !i and Y in li ). Therefore, the computation of ~l requires only I !i I x I li I tests for equality. The major reason to expect a faster algorithm based on Theorem 4.5 than one based on Theorem 4.4 is that the determination of from 1 3 may require substantially fewer tests than the identification of the maximal members K = { (X n Y) u {x, y}: (X, Y) E X x Y} which includes Referring to the definitions, one sees that ~3 is obtained from 1 3 by two operations: non maximal members are discarded, and maxima l members contained in members of ~l are discarded. That each of these operations is indispensable is illustrated by the graphs of Figure 4.4. Gl exhibits a max imal member M of ~3 which is a (proper) subset of a member of ~l and so is not in 1* L = {12x, 24x, 24y, 23y} ~o = 0 X1 = { 24x } '!2 = {12x} l1 = {24y} l2 = {23y} and ~l = {24xy} M = {12x n 23y} u {x, y} = 2xy c 24xy E ~l c 1* Thus, although M is a maximal member of ~3 11 M 1* G 2 exhibits a nonmaximal member M of ~3" which is contained in no member of ~l
PAGE 136
125 3 X y 4 z Figure 4.4 Counterexamples to L* = !:_o u !:_ 1 u !:_ 2 u 1 3 11 !:_ = {1 25x 45 x 56y 235y, zx, zy} i = {xz} !2 = {125x, 45x} Ii = {yz} I2 = { 56y 235y } and !:.1 = {xyz} !:.3 ti = { 5xy, 25xy} M = {5 xy} is a subset of no member o f !:.1 but is not in L* because it is properly conta ined in 25xy E L* 4 .4. 3 The Line Removal Clique Detection Algorithm Since experiments (see Table 4.3) verify that an algorithm based on Theorem 4.5 i s faster than one based on Theorem 4.4, only the former is given here.
PAGE 137
126 Let G 0 = (V, Eo) be a graph whose clique set ~O is known. Let G = (V, E) be a graph such that Eo is contained in E = properly and let E Eo {e1, e2, ... en} For each i = 1, 2 n let x(ei) and y(ei) ... be the points of G incident with ei Let Gl = (V, Ei) with Ei = Eo u {e j: 1 j i} thus Gi is a subgraph G1 ldoes not have. Let of G having just one line that M. denote the clique set of Gl. l Since Gn = G ~n is the clique set of G The algorithm is an application of Theorem 4.5 to the pair (Gil' Gi) for each i = 1, 2, ... n Each additional line is processed by the execution of Steps 3 through 9, in sequence. Step 3 produces !:_a l, and Y Step 4 identifies !i c X and !i c Y Step 5 produces ~l and incidentally 2 and 2 from Step 3. From 2 and r 2 Step 6 produces ~ 3 11 Step 7 reduces ~ 3 11 to ~ 3 by deleting nonmaximal members of ~ 3 11 Step 8 then determines ~ 3 from ~ 3 and ~l give n by Steps 7 and 5. Finally, Step 9 combines ~O ~l, ~ 2 = 2 u 2 and ~ 3 (Steps 3, 5, and 8) into the collection of cliques, as required by Theorem 4.5. ALGORITHM 4.3 Step 1.Initialize ~O Step 2.Add 1 to k = ~k1 = ~n Set k = 0 and proceed. If n < k then stop: Otherwise, proceed.
PAGE 138
127 Step 3.Decompose M k1 into ho X and y a disjoint cover of M k1 If M E M k1 then M E X if x(ek) E M M E y if y(ek) E M and M E ho otherwise. Step 4.For each (X, Y) in X X y compute X n y If X {x} C X n y flag X as belonging to x1 if y {y} C X n y flag y as belonging to Y1 Step 5.Initialize h1' = ~ 1 = = !:...2 = 0 For each X in X if X is flagged as a member X1 then include X u {y} in h1 I otherwise include X E!2 For each y in y if y not flagged as a member of !1 then include in !2 otherwise, if y is contained in no member of hl I' then include y u {x} in !::1 Finally, set !::1 to the (disjoint) union of L1 and !::1" Step 6.Initialize A= 0 For each ( X Y) in 2 x r 2 include (X n Y) u {x, y} in A (A= L ") 3 y II of is Step 7.Delete from A each member which is properly contained in any other member. (~ = !::3') Step 8.Delete from A each member which is contained in any member of !::1 (~ = !::3)
PAGE 139
Step 9.Set !iic = ~ 0 u ~ 1 u 2 u r 2 u A Step 2. Go to 128 To illustrate the algorithm the g raph of Figure 4.2 is used again. The results are given in Table 4.2, with each row of the table g ivin g the condition on entry to Step 2 of the algorithm. To provide an indication of the effort expended by this algorithm relative to that of the other two algorithms, the cliques are found from scratch, i.e., the Step 1 initialization of ~O is to the singletons of the point set, and the differential set of lines is all the lines of the graph. 4.5. Algorithm Timin g Experiments Algorithms 4.1, 4.2, and 4.3, and an algorithm based on Theorem 4.4, have been implemented in PL/I procedures for execution by an IBM 360/65. For each program, a graph is specified by means of the (reflexive, symmetric, binary) adjacency matrix, in which the ijth bit is 1 if and only if i = j or the ith and jth points are adjacent; the form of the matrix representation is a singly subscripted array of p bit strings of len gt h p where p is the number of points of the graph. Each procedure returns the cliques in a threaded list of bit strings of len gth p one for each clique. For purposes of efficiency comparisons, these programs have been execut ed on various graphs the execution times
PAGE 140
129 Table 4.2 Algorithm 4.3 Applied to the Graph of Figure 4.2 k ek ~o X y I !1 !1 !2 !2 ~l ~3 II ~3' I ~3 ~k 0 1,2, 3 ,4, 5, 6, 7, 8, 9 1 12 3,4, 1 2 1 2 0 0 12 0 0 0 12,3 5,6, 4, 5, 7,8, 6,7, 9 8,9 2 13 4,5, 12 3 0 3 1 2 0 13 0 0 0 12 6, 7, 13 8,9 4, 5, 6, 7, 8 ,9 3 19 4, 5, 12 9 0 9 1 2 0 19 0 0 0 12 6,7, 13 13 13 8 19 4, 5, 6,7, 8 4 23 1 9 12 13 1 2 13 0 0 123 0 0 0 123 4,5, 19 6,7, 4,5, 8 6, 7, 8 5 46 123 4 6 4 6 0 0 46 0 0 0 123 I 19 19 5, 7, 46 8 5, 7, 8 6 48 123 46 8 0 8 46 0 48 0 0 0 1 23 19 19 5,7 46 48 5,7
PAGE 141
130 T a b le 4.2 (continued) k ek !:.o X y !1 !1 !2 !2 !:.1 L II !:_3' !:_3 3 7 58 123 5 48 5 0 0 4 8 58 0 0 0 123 19 19,46 46 48,58 7 7 8 59 123 58 19 0 0 58 1 9 0 59 59 59 123 46 19,46 7 48,58 48 59,7 9 67 123 46 7 0 7 46 0 67 0 0 0 123 19 19,46 48 48,58 58 59,67 59 1 0 68 123 46 48 46 48 67 58 468 68 68 0 123 19 67 58 19 59 468 58,59 67 11 69 123 468 19 0 0 468 19 0 69 69 69 123 58 67 59 67 59 19 468 58 ,59 67,69 12 78 123 67 468 67 0 0 468 678 0 0 0 123 19 5 8 58 19 59 468 69 58,59 678 69 13 7 9 1 2~ 678 19 0 69 678 19 679 79 79 0 123 46 59 59 19 58 69 468 58 ,59 678 6 7 9 14 89 123 468 19 58 59 468 19 5 8 9 8 9 8 9 0 123 58 59 678 679 6789 19 678 679 468 589 6789
PAGE 142
131 being noted. The results are summarized below in Tables 4.3, 4.4, and 4.5. The names of the graphs of the first two tables are as in Harary [ 24 ]. KP denotes the complete g raph on p points, and Pn denotes the path on n points. G' denotes the graph having the same point set as G and in which a pair of points are adjacent if and only if they are nonadjacent in G If G1 = (V1, E1) and G2 = (V2, E2) are graphs with V1 n V2 = 0 then Gl + G2 is the g raph G = (V, E) with V = vl u v 2 and E = E1 u E2 u {xy: X E v1 y E V2} K(p1, P2, Pn) ... denotes a graph of p = P1 + P2 + .. + Pn points which may be partitioned into n sets x 1 x 2 ... Xn such that (1) IXil = pi for each i and (2) a pair of points are adjacent if and only if they do not belong to the same set of the partition. Such graphs are particularly suitable to the purpose at hand since they have large numbers of lines and cliques for a given number of points. Ind eed Moon and Moser [ 31 J proved that if pi= 3 for i = 1, 2, ... n1 and Pn E {2, 3, 4} then the graph has the greatest number of cliques of any graph of p points. For example K(3, 3, 3, 3), which has only 12 points, has 9 x (1 + 2 + 3) = 54 lines and 3 4 = 81 cliques, each 4 f h t consisting of pointsone rom eac par Table 4.3 confirms the expectation that, of the two Line Removal theorems, Theorem 4.5 provides for a faster algorithm than does Theorem 4.4; therefore, the additional complexity of the former over the latter is justified. (Algorithm 4.3 is based on Theorem 4.5.)
PAGE 143
Tab l e 4 3 T i m in g Comparison of A l gor i thms Based on Theorems 4.4 and 4. 5 Algor i thm Execut i on T i mes (seconds ) Gra ph Theorem 4 4 Theorem 4.5 (A l gor i thm 4. 3 ) P 9 0 0 9 0 .06 K l+ Ks' 0.1 8 0.0 8 K (3, 3, 3) 1. 4 1.0 K 9 0.4 0. 2 K (4, 4, 4) 7.2 5.6 K (3, 3 3, 3) 2 4. 14. K (5, 5 5) 29 26 132 Table 4. 4 provides t i m i ng data for an efficiency compar i son among t h e Po i nt Removal, Neighborhood and L ine Remova l Clique Detect i on a l gor i thms i e. Algor i thms 4 .1, 4.2 and 4. 3 The data of Tab l e 4 4 ind i cate that the Line Removal method is much less effic i ent than the other methods at least, as app li ed t o the prob l em of finding cliques of a graph G = ( V E) from the adjacency information alone. However, the Line Removal procedure is expressly intended
PAGE 144
Table 4.4 Timin g Compariso n of Al gor it hms 4.1, 4.2, and 4.3 Algorithm Execution Times (second s) G h rap Algorithm 4.1 Al go rithm 4.2 Al go rithm 4.3 (Point Remo val) (Nei g hborhood) (Line Removal) P9 0.18 0.12 0.06 Kl + Kg 0.16 0.14 .08 K9 I 0.18 0.14 0.0 K(3, 3, 3) 0 .47 0.42 1.0 K(4 4 4) 2.3 1.1 5.6 K(3, 3, 3, 3) 3.1 1.5 14.o K(5 5, 5) 8. 3 2.4 26.0 K(3 3 3, 3 3) 25 4 9 ? I' w w
PAGE 145
134 for a more special i zed problem: find the cliques of G' = ( V E') g iven the cliques of G = (V, E) where E c E' and g iven E' E S uch a tower of two graphs is illu strated in Figure 4.5. The Neighborhood procedure was appl ied to each of G and G ; the Line Removal procedure was app lie d to G' the cliques of G being given ; the execution times are g iven in Table 4.5. E: sol id lines Cliques of G 1, 5, 8, 10, 12 7, 8, 10, 12 4, 10 9 10 2, 6 6 12 3, 12 11, 12 Figure 4.5 Two Graphs with E': solid and broken lines Cliques of G' 1, 5, 8 10, 12 1, 7, 8 10, 12 7, 11, 12 8 9, 10, 12 4, 8 10, 12 6 12 3 12 9 11, 12 2, 6 G = (V, E ) EC E and G I = (V' E I )
PAGE 146
Table 4.5 Execution Times of Algorithms 4.2 and 4.3 on the Graphs of Figure 4.5 Graph Times (seconds) G G' Algorithm 4.2 Algorithm 4.3 (Neighborhood) (Line Remova l) 0.3 0.4 0.1 4.6. Conclusions 135 Concerning the general clique detection problem, the Neighborhood method is evidently substantially more efficient than is the Point Removal method; the Line Removal method is much less efficient than either. The data of Table 4.4 suggest that the Ne i ghborhood procedure execution time per clique is roughly independent of the number of cliques, i.e., that the execution time is roughly proportional to the number of cliques. Turning now to the more specialized problem of f inding the cliques of a tower of graphs having the same point set, it has been shown that the Line Removal procedure can be used to definite advantage. The cliques of G and of G' of Figure 4.5, for example, can be most efficiently found as follows: use the Neighborhood procedure to find the cliques of G ; then use the Line Remova l procedure to find the cliques of G' The two clique sets are thereby produced
PAGE 147
136 in 0.4 seconds, while their production from the application of the Neighborhood procedure to each of G and G' requires 0.7 ~econds. In this special clique detection problem, therefore, the most efficient course of action requires the availability of both the Neighborhood procedure and the Line Removal procedure, and a criterion for selecting the proper one according to the circumstances. Such a criterion would depend primarily on the number of points and the differential number of linesand possibly other available parameters, such as the number of lines of G' and the number of cliques of G the predecessor of G' in the tower.
PAGE 148
CHAPTER 5 AUTOMATIC CLASSIFICATION DERIVATION 5.1. Introduction The problem of deriving a simple classification has been viewed from the perspective of two subproblems. The first is the problem of cover generation, with which Chapter 3 contends. The second component problem is that of selecting one particular cover from among the collection of covers provided by a cover generation procedure. The cover selection problem is treated in this chapter. Two quite different approaches to the cover selection problem are explored. The first of these takes the view that the best cover is the most typical of the collection of covers. This approach is therefore primarily concerned with means of measuring distance between covers of a finite set The second approach to cover selection utilizes the available measure of similarity on the objects of the finite set under analysis to define a numerical measure of the homogeneity of a subset of objects. From the homogeneities of the members of a cover, a numerical evaluation of the cover is defined, by means of which a selection is made. The modification of the cover evaluation to take cost considerations into account is also explored. 137
PAGE 149
138 Following the treatment of the cover selection problem, a cover generation procedure and a cover selection procedure are united into a simple classification derivation, which is imbedded in an algorithm for the derivation of a multilevel classification. 5.2. Cover Evaluation by Typicality The problem under consideration in this section is as follows: given a nonempty collection of efficient covers of a nonempty finite set X, identify the most typical member of the collection. The term "most typical" can be defined by means of a measure f of distance between such covers. For example, a most typical member of the collection is a member A which minimizes the rootmeansquare distance between A and all members of the collection, f(~) = r l f(~, ~)2 ll/2 l B where B ranges over the whole collection. A psuedometric d on a set S is a realvalued function d: S x S 7 R such that: (1) if x ES then d(x, x) = O ; and (2) if x ES y ES, and z ES then d(x, y) d(y, z) + d(z, x) the triangle inequality. A metric d on S (or a distance function for S) is a psuedometric on S such that if x ES y ES and d(x, 'y) = 0 then x = y
PAGE 150
139 The familiar metric d: 2 8 x 2 8 + R for the collection of all subsets of a finite set S is defined as follows: if Ac S and B c S then d(A, B) = IA 6 Bl the cardinality of the symmetric difference of A and B, where A 6 B = (A B) u (B A) This function is a particularly suitable example of a metric in the present context, because it provides a logical starting point for the development of a metric on the class of efficient covers of X Investigating the approximation of a given symmetric irreflexive binary relation R on a finite set s by an equivalence relation E Zahn [ 32 J defined the distance between E and R to be the number of elements in (E R) u (R E) Each of E and R indeed any binary relation on s is a subset of the finite set s X s This definition of distance between symmetric relations on a finite set is, therefore, an instance of the metric d above. Because of the essential equivalence between symmetric, irreflexive relations on the finite set S and graphs with point set S the above application of the metric d may be reformulated as follows. The distance between graphs G = (S, E) and G' = (S, E') is defined to be d (E, E') = IE 6 E' I Now a given efficient cover A of the finite set X induces a graph G(~) = (X, E(~)) in a natural w a y. Let x and y be distinct elem e nts of X; then xy EE(~) in case there exists an M e A such that each of x and Y is an element of M Consequently, let : + f be the
PAGE 151
140 function from the class of efficient covers of X into the class of graphs with point set X defined as above i e if A E then ( 6_) = G ( ~) = ( X E ( A ) ) ( That is an onto functi on incidentally, is immediate; if G = (X, E) the clique set ~(G) is an efficient cover of X; hence, ~(G) E 1 G i.e., ~(G) E and
PAGE 152
A: B: G ( A ) = G(~): ., ..... c:"'' ,, 0'' ,, 1 I ', / / \ I I \ I \ I \ i'r;\,,, ,~ 0~\ I \V_ / '.._ \.:..)1 ,' .... _.,,,,. _.,,, _______________ ,,,,. // ___ 0,, I \ I \ I \ ', 0 \ \ ... .... ~ 0 ,, ,, ,, / ,, I I I I F i gu r e 5 1 Two Efficient Covers Wh i ch Induce the Same G r aph 141 f2. = { { 1, 2 }, { 1, 3 }, { 2, 3 }} a n d B = { { 1, 2 3 }} are two d i fferent eff i c i e n t co v e r s o f X = { l 2 3 } ; yet the graph induced by bo t h i s t he complete graph of three po i nts. T h e r efo r e the pseudometr i c D 1 on the c l ass o f effic i en t covers o f X is not a metric To restate th i s conclusion i f X i s a f i n i te set of three or more elements there ex i st a grea t er n um b er of eff i c i ent covers of X than the number of graphs w i th po i nt se t X
PAGE 153
142 Consider the relation R on defined as follows: ARB in case (~) = (~) That R is an equivalence relation is obvious, i.e., it is reflexive, symmetric, and transitive. In fact, the partition of induced by R TI(~; R) = {1 G: G = (X, E)} Let E be a subset of consisting of a set of representatives of the equivalence classes of R ; that is, E consists of exactly one efficient cover of X from each member of TI(~; R) Let be the restriction of to the subset E of : if A E L then '(~) = (~) Since establishes a oneis toone correspondence between E and D 1 is a metric on L Moreover, D1 is not a metric on any subset of which properly contains E Since the clique set ~(G) of G belongs to the equivalence class 1 a the natural choice for a system of representatives of the equivalence classes is E = {~(G): G = (X, E)} To summarize, D 1 is a metric on the collection L of all efficient covers A of X such th a t = ~((~)) i.e., those coveri of X which are clique sets of g raphs with point set X. Consequently, if the collection of covers of X from which a most typical one is to be selected, is a subcollection not only of but of L then the metric D 1 on E is available for utilization in that selection. The subset of coincides with the family of all classifications of X under the definition of Bedn a rek [ 33 ]. A classification of X is defined to be an
PAGE 154
143 efficient cover A of X such that if A, B, CE~, there is a member of A containing (An B) u (An C) u (B n C) It is then established by a theorem that a cover A of X is a classification if and only if A is the clique set of the graph on X naturally induced by A. Because the term "classification" is being used throughout this work in a more general sense, the term Sclassification will be used to refer to classifications in the technical sense of the definition above, i.e., members of E Thus, if the collection of covers of X consists of only Sclassifications of X, i.e., if it is a subcollection of E, then the metric D 1 on E may be used to select a most typical member of the collection. Unfortunately, not all clusterings as defined in Chapter 3 are Sclassifications. Figure 5.2 provides an example of a clustering such that the clique set A of a set X of nine points, 9((~)) of the induced graph (~) is not identical to A The fact that not every clustering of a graph G = (V,E) is a 6classification of V is not too surprising when the contexts of the definitions are considered. The definition of Sclassifications of V is within the context of V, alone. The definition of a clu?tering of G (Chapter 3) makes use of the additional information represented by E Within the broader context of the problem of deriving a classification of V from a measure of similarity on V, E reflects the similarities of the elements of V Since
PAGE 155
A: Figure 5.2 ,,..... "_,, ..... _,,,,. ..... ,,, ...... ,, ,,, I >' '\ I I \ \ I '\_r..... \ \ \ \ 'I ', I ...... ...,. __ I I \ \ I \ ,, \ ,, ,, ,..... _,.., ..... __ / 144 A Clustering Which Is Not a SClassification there is no reason to exclude, a priori, the possibility of a set of objects having similarities as reflected in, for example, the graph covered by A in Figure 5.2, the restriction of the definition of clusterings to Sclassifications is unjustified. Therefore, the metric on the family of aclassifications of a finite set is insufficient. A metric on a larger subset Of than is required.
PAGE 156
5.2.1. A Metric for the Class of Collections of Nonempty Subsets of a Finite Set 145 Let X be a nonempty finite set, S be the collection of all subsets of X, and S 1 = S {0} ; let T denote the collection of all subsets of S' Thus, T consists of the class of all collections of nonempty subsets of X The function D: T x T + R is defined as follows. If ~, y_ E T IUI $ IVI or IVI $ l~I ; without loss of generality, suppose IUI $ IVI Then there exists a onetoone function from U to V, let F(~, y_) denote the nonempty finite collection of all onetoone functions from u into V For each f E F(~, y_) define 6(f) d(A, D(~ = l d(Z, fZ) + l d(Y, 0) where ZEU YEVfU B) = IA ti BI for subsets A, B of X Finally, V) = min {6(f): f E F (~, y_)} THEOREM 5.1 D as defined above is a metric for the class of all collections of nonempty subsets of the nonempty finite set X The proof of the theorem is g iven in Appendix D. To illustr ate the metric n four covers of the set x 8 = {l, 2, ... 8} are given in Figure 5.3. Since the covers of the figure are Sclassifications, they a lso provide for a comparison of D and D1 Th ere are two
PAGE 157
A : 2 4 2 3 5 2 3 5 Figure 5,3 Four Covers of a Set of Eight Points 146
PAGE 158
147 onetoone functions from ~l = {A11} into ~2 = {A21, A22} f1(A11) = A21 and f2(A11) = A22 6(f1) = d(A11, A21) + d(A22, 0) = 2 + 3 = 5 6(f2) = d(A11, A22) + d(A21, 0) = 5 + 6 = 11 Thus, D(~1, A2) = min { 5, 11} = 5 Since
PAGE 159
148 Now the purpose of the metrics is to provide a distance function on a class of covers of a finite set. Considered relative to that purpose, the preceding example indicates that D gives more plausible results than does D 1 This is a consequence o f the same funda m ent a l difference which requires that D 1 be restricted to the family of Bclassifications: D is defined in terms of the identities of the covers; D 1 is defined in terms of entities derived from the coversnamely, the induced graphs. The computational effort required for the evaluation of D(~, B) however, constitutes a definite disadvantage of D If n = IAI $; ,~, = m the number of onetoone functions from A to B is IF(~,~) I = m (m 1) [m (n l)] = ml / (m n)! To ev a luate D(~, ~) each of these functions must be g enerated, and, for each f E F(~, ~) 6(f) must be computed. The effort required to compute {6(f): f E F(~,~)} can be minimized by computing in advancethe sets {d( A B): A E ~, BE B} and {!Bl: BE B}, the p rep a ration of which requires only mn + m calculations. In this case, the computation of e a ch 6(f) requires only the table lookup and addition of m numbers. The total ~.. ....... number of such operations required for the computation ~f a ll the numbers {6(f): f E F(~, B)} is therefore (m) m! / (m n) Evidently, D is co m putationally prohibitive for collections of, s a y, six or more subsets of X
PAGE 160
149 To summarize, D 1 is not computationally costly, but is restricted to the family of Sclassifications of X On the other hand, D is a metric on the class of all efficient covers, but is practically restricted to collections of five or fewer subsets of X The computational limitation might be due to the uselessly large family of collections of subsets of X on which D is a metricnamely, the class of all collections of nonempty subsets of X, i.e., the class of all collections A such that 0 t A and uA c X. That is, there might well be a suitable metric D' on which is either undefined or not a metric on, say, the class of all covers of X but which is substantially less costly to compute than is D In conclusion, although the limitations of D 1 and D noted above by no means render them valueless, neither metric is fully adequate for purposes of cover selection in the full context of this work. 5.3. Evalu a tion by Cluster Homogeneity a nd Cost Considerations The cover selection method developed in this section is based on the premise that the relative merits of covers be '\ judged by reference to the si m ilarities of the element~ of the sets of the covers. Because overlapping classes are not disvalued in the present application, the evaluation of a cover utilizes intraclass similarities and ignores interclass similarities. Accordingly, the first subsection
PAGE 161
150 defines the homogeneity of a set of documents, in terms of which a tentative numerical cover evaluation function is defined. In addition to the similarity information, this cover evaluation method provides for the influencing of cover selection by external economic considerations. The desirability of introducing economics into the cover selection process may be understood by consideration of the result of system initializationthe sequential decision tree, representing the multilevel classification derived by repeated cover generation and selection. The extensiveness of the classification affects both search time and the storage requirements of the system: more extensive classification, i.e., a larger search tree of more decision levels, generally reduces the time required to process a query, and increases the quantity of memory required for storing the tree. Thus, a tradeoff exists between search time and required storage capacity. The second subsection therefore develops a suitable cost function. Finally, the third subsection unites the similarity information and the cost considerations into a function which gives a numerical measure of the value of a cover, by means of which the cover of a collection having the highest value may be selected. 5.3.1. Cluster Homov,eneity Given a finite set x and a similarity function S on X, the similarity may be generalized from applicability
PAGE 162
151 to subsets of X of two elements to a ll subsets of x, as follo ws. If X' c X then the homogeneity H(X') of X' is defin ed to be the average s i milar ity of the pairs of elements of X Spec ific al ly, 0 if I X I < 2 H(X I) = 2 n1 n n (n 1) l I S(x., xk) otherwise, j = l k=j+l J where n = I X I and X' = { x 1 x 2 ... xn} Let = {X 1 x 2 ... Xm } be a cover of X with for each The value W (A) of A is defined to be the weighted average the homoge nei ty of the membe r s of A m Di W (~) = l H(Xi) i.e., n i=l m ni 2 n 1 ni l x~ ) W ( A) = l I I s ( X j' n (ni)(ni1) i=l j=l k=j+l where i i xi} x i = { xl x2, ... ni 5.3.2. An Idealized Cost Function '\ The purpose of the cost funct i on is to introduce of economics in to the structure derivation problem to assure that the so luti on be matched to available resources, i. e ., th e mach in ery for system implem e ntation The two principa l it ems of cost in the operationa l system are the storage
PAGE 163
152 quantity and the search timethe effort required to produce a response to a user query. The two ma in requirements for storage are the document representationsconsisting of the references for presentation to the user and the document indexesand the sequential search tree the for me r is not subject to variation within the process of classification derivation, and so, is excluded from the cost function. Further references to the storage requirement are therefore understood to refer to the storage requirement for the classification representation. One form of a cost function which has been offered [ 34, 35 J for prob le ms of this nature is the product of the search time and the storage quantity. However, this type of function does not explicitly take into account the relative cost of storage and search time. The pertinent quantity is the cost~ search, which is the product of the cost per unit time for the implementation machinery and the time per search. The time per search can be expressed as the product of the time per relevance computation and the number of relevance computations per search: cost search cost search = = time computation A X f cost time # computations search cost time where A is the time per relevance computation and f is the number of computations per search The cost rate of the implementation machin~ry may be expressed as a+ bx (memory quantity) where a and b
PAGE 164
are constants. For example, the University of Florida Computing Center charge for the IBM360/65 has that form, 153 where a is $300 per hour and b is $100 per hour per 128 Kbytes of main memory. The memory quantity is the product of the memory per node B and the number g of nodes of the search tree. Thus, cost/search= Ax f (a+ bBg) The normalized cost function h is then given by h = f (l+Cg) where the cost parameter C = bB/a, f is the number of relevance computations per average search, and g is the number of nodes of the search tree. As a numerical example, if a and b are as in the Computing Center example above, and B = 384 bytes/node, then C = 103/node and h = f (1 + (103/node) g) Each of f and g depends on the configuration of the classification structure and, for a given structure, could be computed. However, the cost function is to be utilized during the derivation of the classification, at which time the structure is unknown. All that is known is the cost parameter C and the number of documents of the class whose covers are under evaluation. Consequently, the functions f and g must be defined on the basis of certain idealizing assumptions. Each of f and g will be expressed in terms of the number n of documents and the number x of stages of computation per search in the search tree corresponding to the idealized classification tree. In the absence of even
PAGE 165
154 a simple classification, x = 1, and the idealized tree consists of just the document set; the one stage of calculation for searching is the computation of the relevance of each document to the query. In the case of a simple classification, the idealized tree consists of the root (the document set) and a partition of the document set. In this case, x = 2 the first stage of calculation is the selection of a class, the second is the relevance computation per document of the selected class. The idealization is completed by the specification that, for the given numbers n of documents and x of stages of calculation, the idealized tree minimizes the total number of calculations required to isolate a single document. If x = 1 f(x) = n. It is proved in Appendix E that for any natural number x of stages of calculation: (1) the number of major classes of the idealized tree is n 1 /x, each having n/nl/x documents; and (2) f(x) = xnl/x The idealized tree, in short, consists of x sequential selections, each being a oneoutofnl/x, to effect the one outofn overall selection. The function g(x) gives the number of nodes of this idealized classification tree, not counting the root since it entails no explicit representation of a class. The number of level2 classes is I nl/x, each having n(xl)/x documents. The number of members of a partition of any one of these is (nCx1)/x)l/(xl) = nl/x Since there are nl/x such
PAGE 166
155 partitions of l eve l2 classes, the total number of level3 classes is n 2 /x Finally, the number of l evel x classes is n(xl)/x, each having l/x d n ocuments Hence, g ( x ) = nl/x + n 2 /x + + nCxl)/x, a geometric series which may be expressed n nl/x g(x) = ___ nl/x 1 The n orma lized cost fun ct ion h may now be g iven: 1 / [ n nl/x l h(x) = xn x 1 + C nl/x 1 The m ini mi z at i o n of the cost function w ith respect to x affects the cover selection by specifying the opt imum number 1/XO n of membe rs of a cover of the se t of n documents, and their opt i mum s izes (x 0 l)/x 0 n The opt i mum number x 0 of stages of calculation is the le as t natural number x such th at h( x ) s h(x+l) The cost funct i on is illustrated in Table 5.2, w ith n = 100 and C = 103 In this illu strat i on x 0 = 4 Consequently, an economica l ly opt i mum cover would consist ( approximately) of a part iti on of three sets of thirty three documents each. 5 3 .3. An Evaluation Function The tentative evaluation function W was defined to be a wei g hted sum of the homogeneities of the subsets or clusters of a given cover The function of this subsect i on i s to develop a cover eva lu ation which suitably ut ili zes the cost function as well as the c lust er homogeneities
PAGE 167
156 Table 5.2 Illustration of the Cost Function X h ( x) 1 100. 0 2 20.2 3 1 4.3 4 13.2 5 13.4 6 1 4.o 7 14.9 8 16.0 9 17.2 10 18.5 11 19.9 12 21.3 13 22.8 14 24.4 15 26.0 16 27.6 1 7 29.4 18 31.1 19 32.9 20 34,8
PAGE 168
157 Specifically, X is a set of n = I XI documents; X = {X1, X2, Xm} is a given cover (clustering) of X, consisting of m clusters, and H(Xi) is the homogeneity The issue at hand is precisely how to modify a weighted sum I ai H(Xi) of the cluster homogeneities to account for the cost information. The basic information provided by the cost function h is the economically optimum number x 0 of levels of an idealized classification of n objects. The number of 1/XO major classes of such a classification is n Now x 0 is defined to be the least natural number x such that h(x) h(x+l) If h is considered as a function defined on the positive real numbers, it attains a minimum in the interval [x 0 x 0 +1) Consequently, the number m 0 of major classes is actually specified only to within the following precision: m 0 EM= {m1, m1+l, ... m2l, m2} ; where m 1 = Lnl/(l+x 0 )J and .m 2 = ,nl/x 0 J Similarly, the optimum class size s 0 is bracketed as follows: s 0 ES = {s 1 s 1 +1, ... s 2 l, s2} ; where s1 = Ln/m2J and s 2 = fn/m1J Perhaps the most straightforward manner of modifying the tentative evaluation W(X) with the cost infor mat ion is to replace it by A 0 (f) W(f) where { 1 Ao(f) = 0 if m = !XI EM, and otherwise. However, the effect of such a function is to refuse to consider any cover whose number of members does not conform
PAGE 169
158 to the a priori demands of the cost functionregardless of the similarity data. It would certainly be more reasonable to weight the cover with unity in case m EM, but with the weight falling from unity toward zero as m decreases from ml or increases from m2 Therefore, let kl and k2 be positive with k1 < k2 and k = (k1 + k2) I 2 The w: R R is defined as follows: where w 1 (x; k1, k2) if x < k 1 1 if k 1 :,; x :,; k 2 and w 2 (x; k 1 k 2 ) if k 2 < x X k X and real numbers funct ion This interval weighting function is illustrated in Figure 5.4, with k 1 = 4 and k 2 = 8 The cover wei g hting A(JXJ) is defined to be w(JxJ; m1, m 2 ), m 1 and m 2 being the optimum cover size bounds. The issue of the weights a l of the clusters remains to be resolved. The necessity of such weighting may be appreciated by considering a cover X consisting of m EM clusters, but with one of these very large and all the others very small. In such a case, the cover weighting is the maximum (unity). However, such a cover is inferior to one having the same number of clusters, but with the sizes of the clusters more nearly constant. Because of the availability of the cluster size bounds s 1 and s 2 abovethe function w may also be used
PAGE 170
' I I I I I I I I I I \ I \ I \ I wl+ I \ I \ <w2 I \ I \ I \ 1.0 w o.8 o.6 0.4 0.2 5 6 7 8 9 10 11 12 Figure 5.4 An Interval Weighting Function for the cluster weights: ai( Xi ) = w( Xi ; s 1 s 2 ) The evaluation function V(~) is therefore defined: rn V ( ~) = w ( rn ; rn 1 rn 2 ) I w ( I Xi I ; s 1 s 2 ) H (Xi) i=l 159 X This function includes the intraclass similarities of the classes of the cover, the optimum number of classes for a set X = uX of size !XI = n, and the optimum size of the classes.
PAGE 171
160 5.4. The Classification Derivation Algorithm This section provides an algorithm for the machine derivation of a multilevel classification of a collection of documents, given in the form of a set Z of document indexesand on which a numerical similarity function S is given. The transformation from the classes of the resulting classification into their representations, i.e., the production of the sequential search tree from the tree of subclasses of Z is discussed in the next chapter. In briefest essence, the algorithm consists of the repeated application of (1) the generation of the graph theoretical clusterings, as defined in Chapter 3; and (2) the selection of one of these covers, by means of the cover evaluation function, developed in the preceding section of this chapter. This cover generation and selection procedure i~ first applied to Z ; next it is applied to the members of the selected cover of Z ; and so forth, until each class not subclassified is suitably small. More specifically, let X be the subset of Z under consideration for subclassification, with n = IXI The first action required is the decision of whether or not X is sufficiently large to justify subclassification. Therefore, the cost function h is used to determinefrom n and the cost function parameter C the optimum number x 0 of levels of classification from X downward, the bounds (m 1 m 2 ) of the number of subsets of X in an
PAGE 172
161 optimum cover of X and the bounds (s 1 s 2 ) of the optimum number of members of subsets of an optimum cover. In case x 0 = 1 X is marked to be an endpoint of the classification tree, and is not processed. Otherwise, covers of X are generated and evaluated; the evaluation function V(!) applied to a cover X of X makes use of the other outputs of the bounds procedure(m 1 m 2 ) and (s 1 s 2 ) as well as the homogeneities of the members of X The generation of covers of X is the production of type3 clusterings of graphs Gt(X) having point set X and points x and and t S(x, y) y of X adjacent if and only if xi y The choice to generate only A(Gt(X)) the type3 clusteringsrather than all of ~(Gt(X)) is based on the two considerations discussed in Chapter 3: the effort per distinct generated clustering is generally much smaller for A(Gt(X)) than for ~(Gt(X)) ; and the subchain (A(Gt(X)), < ) of the chain (~(Gt(X)), < ) is a good approximation of the chainsubject to the constraint of the number of elements of the subchainin the sense of a continuous progression from the clique clustering to the component clustering of Gt(X) The similarity threshold t is chosen as follows. The first threshold is the greatest similar it y such that if x EX then there is an x' EX such that xix' and t $ S(x, x') That is, the initial threshold t 1 is the strictest threshold t such that Gt(X) has no isolated
PAGE 173
162 points. If more than one graph Gt(X) is required, the successor ti+l of threshold ti is the greatest similarity of pairs of elements of X less than ti Consequently, Gti+l (X) is the minimal proper supergraph of Gti(X) obtainable by similarity thresholding. The number of thresholds controls the effort devoted to the cover generation of X Consequently, the number of thresholds is bounded by an input parameter Pt whose purpose is to limit the effort expended on cover generation of any subset X of Z The second effortlimiting parameter, Pc is the minimum number of distinct cluster ings desired from among which to select the best. The effort expended on cover generation for X is controlled by these parameters, Pt and Pc On completion of the cover generation by means of threshold ti, the cover generation effort is terminated in case (1) the number of generated clusterings is not less than Pc or (2) Pt= i the number of threshold graphs of X which have been processed. Concerning the limitation of effortor more accurately, the limitation of effort to the production of useful results the cover generation procedure takes into account the nature of the cover evaluation function as follows. No cover X of X such that lfl > 5 m2 is generated. The basis for this particular efficiency policy is that (1) one may reasonably expect to generate some covers with no more than five times the maximum of the number of subsets of a cover of optimum size; and (2) the value of a cover of more
PAGE 174
than 5m2 subsets wil l be less than that of the smaller covers. ALGORITHM 5. 1 163 Step 1.[Initialize .J Input the cost parameter C the effortlimiting parameters Pt and_ Pc and the set Z of document representations. Compute the similarities S(x, y) of the pairs of elements of Z Initialize the classification tree to consist of just the root (Z) and the current tree node selector to the root. Step 2.[Find first similarity threshold. ] Determine the first similarity threshold t 1 the minimum over elements y of X (the class indicated by the current tree node selector), of the maximums over elements x of X of S(x, y) with x I y Initialize the threshold counter Nt = 0 Step 3.[Bounds procedure. ] The cost function h( C) Wl .th n IXI 1s minimized with x; n, respect to natural numbers x to obtain (1) XO the optimum number of levels of subclassification of X ( 2 ) the optimum cover size bounds (m1, m2), and ( 3 ) the optimum cover member size bounds (sl, s2)
PAGE 175
If x 0 = 1 then mark X to be an endpoint of the classification tree and go to Step 9 ; otherwise~ proceed. 164 Step 4.[Pr oduce the threshold graph.] Increment Nt the threshold counter. By reference to the similarities S(x, y) of the elements of X and the similarity threshold t the graph Gt(X) is produced. Sten 5,[Find clusterings.] Generate all type3 clusterings of Gt(X) having 5m 2 or fewer members. Step 6.[Cover evaluation.] Apply the evaluation function V(~) to each generated clustering,~, where the parameters of the function are obtained from Step 3, Step 7.[Effort limitation and next threshold.] If the number c of accumulated covers is at least as large as Pc or if the number Nt of thresholds used is Pt then go to Step 8 Otherwise, find the largest similarity of elements of X which is less than the current threshold, assign its value to the current threshold, and go to Step 4
PAGE 176
Step 8.[Cover selection.] If there is no accumulated cover, mark X to be an endpoint. Otherw ise, attach to X the accumulated cover of highest value, and move the current node selector to the first cluster of the selected clustering. Step 9.[Find the next class to be processed.] (a) (b) If the current node X is unprocessed (neither is marked as an endpoint nor has successors), then go to Step 2. If the current node is the root, then go to Step 10 Otherwise, if the current node is not the last cluster of a clustering, then move the node selector to an unprocessed cluster of the clust~ring, and go to Step 2. (c) Move the node selector from X to its predecessor, and go to Step 9 (b). Step 10.[Exit.J Output the classification tree and stop. In view of the discussion preceding Algorithm 5.1, the detailed discussions of the cost function h and the evaluation function V given above in this chapter, and the straightforwardness of their implementations, only Step 5 of the algorithm requires expansion Algorithm 5,2 below therefore provides the details for Step 5 of Algorithm 5.1.
PAGE 177
166 Algorithm 5,2 generates the kclique graphs in the order k = 1, ... 3, 2, 1 where is the maximum clique size, because the line set of the (k1)clique graph contains all the lines of the kclique graph. The lines of the 1clique graph are partitioned into sets Lk, where a line ABE Lk just in case the line is in the kclique graph but not the (k+l)clique graph. Therefore, the components of the kclique graph can be efficiently found from Lk and the components of the (k+l)clique graph as follows: if the endpoints of a line in Lk are in different sets, these two sets are merged into one (Step 5 of Algorithm 5.2). ALGORITHM 5,2 Step 1.[Clique detection.] If ti= t1 (the first threshold) or if n s 2 x IEti (X) Eti_ 1 (X) I then the clique set Q of Gti(X) = (X, Eti(X)) is found by the Neighborhood Clique Detection algorithm (Algorithm 4.2); otherwise, the cliques of Gti(X) are found from those of Gtil(X) by the Line Removal Clique Detection algorithm (Algorithm 4.3). The number of cliques is v and the maximum clique size is If v = 1 then go to Step 9. If vs 5m2 ( m2 is the upper bound on number of members of a cover of optimum size), then the clique clustering is included in the collection of accumulated covers of X.
PAGE 178
167 Step 2.[Generate generalized clique graphs.] For each pair of cliques A and B which meet, find the maximum k such that A and B are adjacent in the type3 kclique graph: k = f(l + IA n Bl I IA u Bl] Include the line AB in the collection Lk of all those lines of the kclique graph not in the (k+l)clique graph of Gt.(X) l Step 3.[Initialize.J Initialize the component set K of the generalized clique graph to consist of its singletons (the type3 clique graph is totally disconnected.) Initialize k = and A= v ( A is the number of components of the kclique graph.) Step 4.Set k = k 1 If k = 0 then go to Step 9, since all kclique graphs have been processed. Step 5.[Next kclique graph .] For each line AB E: Lk find the components KA and KB in such that A E: KA and B E: KB If KA/ KB then replace K by ( !S. { KA, KB}) u {KA u Step 6.If !Kl = A then go to Step 4, since the kclustering would be the same as the (k+l)clustering. Otherwise replace A by I KI ; if 5m2 < A then go to Step 4. KB} K
PAGE 179
168 Step 7.[Next kclustering.] Compute the cover Ste2 X = {uK: KE f} the collection of the unions of the cliques of the components of the kclique graph. Delete from X any member contained in any other member. 8.If IXI = 1 then go to Step 9, since all iclusterings for i $; k are the trivial {X} Otherwise, include X in the collection of accumulated covers of X and go to Step 4 Step 9.Exit.
PAGE 180
CHAPTER 6 THE SEQUENTIAL SEARCH TREE 6.1 Introduction As indicated in Chapter 2, the FERRET system initialization consists of a classification derivation followed by a transformation of the classes (subsets of the document set) of the classification into representations of these classes. It was also noted in Chapter 2 that the two principal functions of the operational system are updating and query processing. Chapter 5supported by Chapters 3 and 4has provided in detail for the classification derivation. This chapter completes the system specification. It is addressed to the problems of the completion of initialization, system updating, and search and retrieval. That is to say, this chapter is concerned with the production, updating, and utilization of the sequential search tree representing the given multilevel classification. The experiment a l evaluati on of the classification derivation procedure requires the transformation of the classification into a search tree and the specification of search procedures. This chapter completes the specification 169
PAGE 181
of the FERRET system as required for the experiments reported in Chapter 7. 6.2. Class Representation Transformation 170 The classification derivation procedure produces a tree of subsets of a document set having the following form. The root of the tree is the set X(l) of all the documents. The n(l) members, X(l, 1) X(l, 2) ... X(l, n(l)) of X(l) are the successors of the root. If X(l, i) is subclassified, the successors of X(l, i) are the n(l, i) members of the cover of X(l, 1) X(l, i, 2) ... X(l, i, n(l, i)) X(l, i, 1) Similarly, the successors of any of these classes which is further subclassifiedand therefore not an endpoint of the classification treeare the members of the selected cover of that class. It is required to transform the classification tree into a sequential search tree, as suggested by the small example of Figure 6.1. The members of a cover must be represented in a form suitable for query matching. Thus, each class X except the root is transformed into a representation R(X) by reference to the representations of the documents of X A document representation is a term vector, that is, an Nttuple of logical or numeric values, one for each of the Nt terms. The representation R(X) of the class X of documents is defined to be the aggre g ate index of the class: that term vector which is the
PAGE 182
X(l, 1) X(l) = {l, 2' 3, 4, 5, 6, 7' 8' 9, 10, 11} = {4, 8' 9' 11~ X(l, 2) = {l, 2' 3' 4' 5' 6 7' X(l, 2, 1) = { 2, 3, 4, 5, 6, 7} X( l, 2' 2) = a. A small classification tree I I R(X(l, 1)) R(X(l 2)) X( l, 1) R(X(l, 2' 1)) R(X(l, 2' X(l, 2' 1) X(l, 2, b. The corresponding search tree Figure 6.1 A Small Classification Tree and the Corresponding Search TreE {l, 2) ) 2) 171 10} 10}
PAGE 183
172 sum of the term vectors of the documents belonging to X. In case the documentterm vectors are logical, the ith element of the class term vector is the number of members of the class to which the ith term applies. As an illustration, suppose that the collection X(l) of eleven documents of Figure 6.1 are logically indexed by Nt = 16 terms, as in Table 6.1. The representation of class X(l, 1) = {4, 8, 9, 11} is given by the sum of rows 4, 8, 9, and 11 of the documentterm matrix: R(X(l, l))= (0, 1, 1, 1, O, 0, 1, O, O, 3, 0, 2, O, 1, 2, 1). Table 6.1 A DocumentTerm Matrix for Figure 6.1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 1 0 0 1 0 0 0 1 0 0 1 0 1 1 0 0 2 1 1 1 0 0 0 0 1 1 0 1 0 0 0 0 1 3 0 1 1 0 0 1 0 1 0 0 0 1 0 0 0 0 4 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 5 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 6 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 7 0 0 1 0 1 1 1 1 0 0 0 0 1 0 0 0 8 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 9 0 0 0 0 0 0 1 0 0 1 0 1 0 1 0 0 10 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 11 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1
PAGE 184
173 This form or class representation is particularly suitable for query matching by means of the cosine correlation coefficient, S2 given in Chapter 2, Equation 2.2, where a query is an assignment of negative, zero, or positive weights to the terms of the system, i.e., an Nttuple of term weights. 6.3. Updating the Search Tree An operational reference retrieval system is not a static entity: new documents are continually added to the data base. The issue at hand is how, precisely, is a new document to be assimilated within the sequential search tree, representing a multilevel classification of the document set. The standard of an updatin g method is that the result of updating coincide with the search tree which would be produced by reinitialization. A ge neral strategy for updating has been suggested in Chapter 2, viz., that system reinitialization actually be performed at lon g intervals, and that a more modest form of updating be applied to incoming documents in the interi ms between reinitializations. The most straightforward approach to interim updating is to treat the problem as an instance of the assignment problem. A given document is to be ass i gned to appropriate predefined classes. The index of the document i s matched against the representations of the major classes, i.e., the
PAGE 185
174 members of the cover of the whole document set. The class X of highest score, i.e., the class best matching the document, is selected; the representation R(X) of the selected class is modified to include the index of the document. The document index is then matched against the successors of the selected class X, resulting in a selection and modification as before. This continues until the selected class has no successors in the search tree, i.e., is an endpoint. Besides the representation R(X) of the class X represented by the endpoint of the search tree, the class X itself is stored in the endpoint; the class X is modified by the inclusion of the document code or identifier. This interim updating may be illustrated by means of the small example of a search tree given in Figure 6.1. Supp o se that document 12 is acquired subsequent to initialization, and that its index !12 = (0, o, o, o, o, o, 1, o, o, 1, o, o, o, o, 1, 1). The cosine correlations of f 12 with each of R(X(l, 1)) and R(X(l, 2)) are 0.73 and 0.15 respectively. Therefore, !l2 is assigned to the class represented by X(l, 1) and R(X(l, 1)) is changed from ( 0 1, 1, 1, 0 0, 1, 0, 0, 3, 0 2, 0, 1, 2, 1) to ( 0 1, 1, 1, 0, 0, 2, 0, 0 4, 0, 2, 0, 1, 3, 2) The preceding interim updatin g may be modified to permit a new document to be assi g ned to m ore than one class of a cover at a given level of assignment. The document is
PAGE 186
175 assigned to each class whose matching scorerelative to the documentis no less than, say, 80 per cent of the highest class score. More elaborate formi of interim updating mi ght be developed to len g then the intervals between system reinitializations. For example the search tree is traversed from the root toward endpointsand the document is assigned at each level to just one classuntil the first node X is reached having the following property: there are two or more successors of X scoring at le ast say, 0.8 times the highest score of the successors, i.e., more than one successor is indicated. The class X updated to include the new documentis subjected to classification derivation; and the subtree of the search tree, from class X on down, is renewed. In short, the reinitialization process is applied to a subtree of the class ific ation Since the updatin g problem is not the principal concern of this study, updating methods are not developed in greater detail here. Rather, this discussion is intended to demonstrate that, although the updatin g problem for FERRET is not trivial, it is tractable. 6.4. Search and Retrieva l A query is an assignment of ne gat ive, zero and/or positive weights to each of the Nt terms of the system i.e., an Nt tupl e of real numbers. The user specifies a
PAGE 187
176 query in the form of a list of terms of the system which he judges to be pertinent to his objective, along with their nonzero weights. The fundamental manner in which the sequential search tree is utilized for searching is as follows. The initial decision node of the tree is the root, representing implicitly the whole document set. The first decision is the selection of a successor to the root, i.e., the selection of a major class of the underlying classification. The selected successor becomes the next decision node, the decision at which is the selection of a successor of the decision nodethat is, the selection of a member of the cover of the class represented by the decision node. This process continues until an endpoint is reached, thereby traversing a chain in the tree from the root to an endpoint, which represents a class of the classification which is not subclassified. The endpoint, unlike the nonendpoints of the tree, stores explicitly the identity of the class of documents which it represents, e.g., as a list of document codes or identifiers. The underlying class of the endpoint constitutes the set of retrieved documents. The membe rs of the retrieved set are matched against the query, and output in order of descending scores; this ordered list of document references and their scores constitutes the (final) response. The mechanism for effecting a selection from the successors of a decision node depends upon whether or not the user participates in the decision.
PAGE 188
177 First, suppose not, i.e., suppose that the user supplies a query and receives a final response, taking no further action. At each successive decision node, the system matches the query against the successors, i.e., computes a score (e.g., the cosine correlation) for the aggregate index of each member of the cover of the class represented by the decision node; the successor of highest score is chosen to become the next decision node. Suppose now that the search is interactive with the user. At each decision node, the system again computes the scores of the successors relative to the query. But in this case, the system does not simply select the successor of highest score, and proceed. Instead, the successors are ordered according to scores, and characterizations of these alternatives are presented to the user in order of decreasing score. The user then selects the successor, and submits a modified query. A class characterization suitable for the user must be determined from the class representation, which is to say, the aggregate index R(X) of a given class X Clearly, the best characterization of X possible by a sin g le term is the specification of a term ti such that the ith component r l is exceeded by no other component. More concretely, if the document indexing is logical, R(X) gives the occurrence frequencies of terms over documents of the class X; terms of high occurrence frequency are more characteristic of the subject matter of the class than terms of low occurrence frequency.
PAGE 189
For a given nonzero Nttuple of nonnegative real numbers, let Y(~) = {y1, Y2, Yn(~)} be the set of those numbers 178 other than zero, ordered as follows: Y 1 > Y 2 > > Y n ( ~) Ti(~) specifies all those components of x having value Yi Now let u.(x) = l Then is the sum of components of x of maximum value. u 2 (~) is the sum over components of maximum value and of second greatest value; and Ui(~) is the sum of those components of rank 1, 2, ... or i In particular, U(~) = Un(~)(~) is the sum of all the components of x Finally, for a given positive number e 1, let k(8) = min {i: Ui(~) ~ex U(~)} ; plainly, k(8) n(~) since Un(~)(~)= U(x) 8 x U(~) The 8characterization of a class X having representation x is defined to be the list of termcornpon~nt pairs k(8) i E u j=l T. ( X) J ordered by decreasin g value of the component values xi The preceding formal definition of a class characterization may be clarified by an example. The representation R(X(l, 1)) of class X(l, 1) = {4, 8, 9 11} of Fi g ure 6.1 was seen, by reference to Table 6.1, to be: X = (0, 1, 1, l, 0, 0, 1, 0, 0, 3, 0, 2, 0, 1, 2, 1) The set Y(~) of distinct values of the nonzero components of
PAGE 190
X consist of Y1 = 3 Y2 = 2 Y3 = 1 T2 (!) = {12, 15} and T3 (!) = {2, 3, 4, 7, u1 (!) = IT 1 (x)I X Y1 = (1)(3) U2 (!) = U1 (!) + IT2(!) I X Y2 = 3 + (2) (2) = 7 U3 (!) = u 2 (x) + IT3(!)lx Y3 = 7 + (6) (1) = 13 0 = 1/2 one sees that k(l/2) = 2 since u 1 (x) = 3 < (1/2)(13) = 6 1/2 =ex u(!) 179 Tl(!) = {10} 14, 16} Taking U2C!) = 7 2 U(!)/2 The resulting characterization is then (tlO' 3) (t 12 2) (t 15 2) where the ti denote the natural language index terms. Thus, although nine terms have nonzero occurrence frequencies over the documents of X(l, 1) the three terms of highest occurrence frequencies account for more th an half the sum of the occurrence frequencies of all nine terms. The low frequency terms are thus regarded as accidental or inessential to the identity of the subject area of the class, whereas the high frequency terms are considered to be characteristic of that subject area. To provide the user with an additional clue concerning the nature of a class, the above class characterization is augmented by a samp le of the classin the form of the title, author, and index of a representative document of the class. A representative document of a class is a document of the class having max imu m correlation with the aggregate index of the class; the selection of a representative document is most conveniently made during the process of classification representation transformation, with the document code retained with the class representation in the search tree.
PAGE 191
180 Up to this point, the role of the sequential search tree in the searching process has been discussedfor searching both with and without user participation. Before specific search modes are defined, another user requirement must be mentioned, viz., the recallprecision requirement. The recall of a response to a query is the ratio of the number of relevant retrieved documents to the total number of documents which are relevant to the query; the precision is the ratio of the number of relevant retrieved documents to the number of retrieved documents. Since a larger retrieved set generally has a greater number of relevant documents but a greater number of irrelevant documents, the recall increases with response size, and the precision decreases with response size. The user may therefore control the recall and precision of the response by specifying the maximum number Nmax of documents that he desires to receive. A larger Nmax is sp~cified in case the user requires a higher recalland can tolerate a lower precision, i.e., a larger proportion of irrelevant documents. In case the user does not interact with the system during the search, he specifies Nmax along with his query. The search procedure described above for this mode is modified as follows to complete the specification of the Basic Mode search procedure. The chain of decision nodes from the tree leaf terminates when (1) an endpoint is reached, or (2) when the number of documents of the class represented by the decision node is less than or equal to
PAGE 192
181 Nmax In the latter case, the class of documents is constructed by forming the union of the classes represented by endpoints of the subtree from the decision node. In Figure 6.1, for example, if R(X(l, 2)) is the second decision node and Nmax = 8 then the response consists of X(l, 2' 1) u X(l, 2' 2) = {2, 3, 4, 5, 6' 7} u {l, 10} = {l, 2, 3, 4' 5, 6, 7, 10} ranked, as usual, according to their similarities to the query. (The number of documents of a classalong with the aggregate index and representative document codeis retained in the search tree node during the classification representation transformation.) In the interactive mode, the class characterization of each successor of the decision node includes the class size. In view of his knowledge of the sizes of the class of the decision node and its successors, the user may either (1) choose a successor and proceed, or (2) desi g nate the class or one of its successors as the retrieved set There are two corresponding modes of transaction between the user and the system: the Feedback Mode, in which the user specifies a node to be the next decision node; and the Recall Mode, in which the user specifies a node to be treated as the endpoint of the searchwhether or not it is actually an endpoint of the search tree. (The Reca ll Mode is so named because it provides the int eract i ve user with the means of obtaining a response of higher recall than that of a smaller subclass of the response; i.e., an endpoint of the search tree.)
PAGE 193
182 The three modes of searchingBasic, Feedback, and Recallare integrated into a single algorithm, given below. A LGORITHM 6.1 Step 1.[ Q uery input.] Initialize the decision node selector to the root of the search tree~ Input the query (terms and their weights) and the mode (Basic, Feedback, or Recall). If the mode is Basic, then input Nmax (the maximum number of documents desired), and go to Step 3. Steo 2.[Decision node input.] Input the infor ma tion specifying the userselected node; and set the decision node selector to that node. If the mode is Recall, then go to Step 6. Otherwise, the mode is Feedback ; proceed to Step 3, St ep 3.If the decision node is a n endpoint go to Step 7. If the node is not a n endpoint, the mode is Basic and the node repr ese nts a class of N or fewer docu ment s, then g o to S tep 6. max Step 4.[Successors' scores.] ComDute the ma tchin g scores of the successors of the node, relative to the query. If the mode is Basic advance the decision node selector to the successor of highest score, and go to Step 3. Otherwise) the mode is Feedback; go to Step 5
PAGE 194
Step 5.[0utput successors' characterizations.] In order of scores, output the characterization 183 of each successor: the frequency ordered list of high frequency terms of the aggregate index with their respective frequencies, the reference to a document representative of the class, and the number of documents of the class. Go to Step 1. Step 6.[Reconstruct the class.] Construct the temporary pseudoendpoint consisting of the union of the classes represented by the endpoints of the subtree from the decision node. Set the decision node selector to the pseudoendpoint. Step 7.[Response.J Compute the matching scores of the documents of the class represented by the endpoint indicated by the decision node selector. Output the document references of the class and their respective scores, in order of decreasing matching score. Go to Step 1.
PAGE 195
CHAPTER 7 EXPERIMENTAL INQUIRY AND CONCLUSIONS 7.1. Introduction This chapter provides a preliminary evaluation of the classification derivation technique developed in the preceding chapters. The evaluation is concerned with the retrieval effectiveness and with the search efficiency of the FERRET systemrelative to the performance of a serial search system. The fundamental tools for the evaluation of retrieval effectiveness are recall and precision. Suppose that the response to a query is simply a subset of the document set the retrieved documents for the query. The recall of the response is the ratio of the number of relevant documents retrieved to the total number of relevant documents. The precision is the ratio of the number of relevant documents retrieved to the total number of documents retrieved. Plainly, the recall and precision require a means to establish whether or not a given document is relevant to a given query. As discussed by Lancaster [ 4 ], the method of relevance assessment determines the import of the recall and precision. If experts or subject specialists form the 184
PAGE 196
185 relevance assessments by reference to the queries and the contentsnot the indexesof the documents, then the recall and precision reflect the quality of the index language and of the content analysis or indexing process, as well as the. efficacy of the search procedure. If the user himself provides the relevance judgments then the query formulation process also enters into the evaluation, since documents are judged relative to the user's objective rather than to his query. Since the index language, the indexing operation, and the query language are not of interest here, the relevance of a document to a query is judged by reference to the index of the document. Indeed, since the specific purpose at hand is the evaluation of FERRET retrieval effectiveness relative to that of a serial search, the serial search response is used as the standard for the FERRET evaluation. Recall and precision as defined above do not take into account the relevance ordering of the documents of the response. Moreover, the recall and precision of a response to a query depend on the size of the response: including more documents in the response gen e rally increases recall and decreases precision. Suppose that a se a rch procedure identifies a tentative response of n documents, and ranks these according to computed relevance to the query. Th e n for each m = 1, 2, ... n, the response consistin g of the m most relevant documents has a recall R(m) and a precision P(m) depending on m.
PAGE 197
186 Salton and Lesk [ 36 J and Salton [ 3 J present normalization methods for combining such response size dependent recalls arid precisions into a sin g le recall and precision, given the expertprovided relevance ranking of the documents relative to the query. The normalized recall Rnorm and the normalized precision P are defin d as norm e follows. n r i Rnorm = 1 i=l n (N n) n n r log r(di) r log i Pnorm = 1 i=l i=l log ( N / ((Nn)! n!)) N represents the total number of documents, and n the number of documents judged (externally) relevant to the query. (7.1) (7.2) The external relevance jud g ment ranks the n relevant documents: d 1 d 2 ... dn That is, d 1 is the most relevant, and dn is the least relevant, of the n relevant documents. The system ranking of the n relevant Ideally, of course ... r(dn) = n in which case Rnorm = Pnorm = 1 Typically, however, some irrelevant documents are ranked hi g her than some relevant documents. In this case, both the norm a lized recall a nd the normalized precisions are less than unity. The worst case is the ranking of all the relev a nt documents below all the
PAGE 198
irrelevant documents, in which case both the normalized recall and the normalized precision are zero. 187 The average recall and average precision are defined as follows: 1 N Ravg = l Ri N i=l ( 7. 3) 1 N Pavg = N l pi i=l ( 7. 4) As before, N is the document set size; Ri and Pi are the conventional recall and precision corresponding to the response consisting of the i documents ranked 1, 2, 3, ... i by the system. It has been shown [ 36, 3] that the average recall and precision are approximated by the normalized recall and precision, under certain conditions on the number n of relevant documents and the total number N of documents. Section 7.4 utilizes suitable adaptations of the abo ve normal and average recalls and precisions for the preliminary evaluation of the search effectiveness of the FERRET system 7.2. The Experimental Document Set The experimental document set consists of 189 articles published in legal journ als s pec ific ally the 84 articles of volumes 19 and 20 (196619 68) of the Univ er sity of Florida Law Review and the 105 articles of volum es 21 and 22 (19661968) of the University of M i ami Law Quarterly
PAGE 199
188 These documentsamong many othershave been logically indexed by members of the staff of the University of Florida Law Library. The index of a document consists of the list of descriptors applied to the document by the human indexer. The set of those descriptors applicable to at least one of the 189 documents consists of 375 descriptors or terms. The Document Reference file consists of one entry for each document. A document entry consists of the bibliographic datatitle, author, etc., and the list of applicable descriptors. Given the Document Reference file, the machine preparation of the Term Dictionary file and the Document Index file is straightforward. 7.3. The Classification Derivation The classification derivation algorithm, Algorithm 5.1, has been implemented in a PL/I program. The input to the program consists of several parameters, followed by either the Document Index file or the (documentdocument) Similarity file. The input parameters are the number of documents and the number of terms; the cost parameter C reflecting the relative cost of search time and storage capacity for the cover evaluation procedure; the effortlimiting parameters Pt and Pc specifying the maximum number of similarity thresholds which the program may apply to a ny subset of documents and the minimum number of distinct covers desired
PAGE 200
from among which to select the best; and a parameter indicating which of the two filesDocument Index or Similarityfollows. 189 In case the Document Index file is input, the program constructs from it the Similarity file, from which the Tanimoto similarity, S(di, dj) of each pair of documents, di, dj may be obtained. The Similarity file is output in a punch card deck for future runs, since the classification derivation may be repeated for several different cost parameters. Because the similarity matrix is rather sparse, i.e., has a substantial majority of zero elements, storage space is conserved by organizing the Similarity file as follows: there is associated with each document di a (possibly null) list of pairs of values, (j, S(di, dj)) for each document dj such that i < j and S(di, dj) i O The Similarity file is used to choose similarity thresholds (Algorithm 5.1, Steps 2 and 7), to generate the corresponding threshold graph (Step 4), and to compute cluster homogeneities (Step 6) for cover evaluation. The program implementation of Step 2 (the determination bf the first similarity threshold) disregards any document having no nonzero similarity with any other document. The bounds procedure implementation also takes into account the number of isolated documents. The cost function h(x; n, C) is minimized with respect to the natural numbers, x = 1, 2, ... with the value of n being the
PAGE 201
number of nonisolated documents, rather than the total number of documents. The corresponding cover size bounds are increased by the number of isolates, since each such necessarily constitutes a cluster of one document. 190 In the experimental document set, for example there are three documents each of which has zero similarity with every other document. Using the value C = 0.2 and the value n = 189 3 = 186 the bounds procedure calculates (see Chapter 5) h(l; 186, 0.2) = 186 h(2; 186, 0.2) = 101 h(3; 186, 0.2) = 149 Thus, the optimum number of levels of calculation is x 0 = 2. The optimum cover size bounds (for 186 documents) are computed to be m1 = L186 1 /(l+ 2 )J = 5 and m 2 = 1186 1 12] = 14 The optimum bounds on the sizes of the clusters of the 186 documents are found to be s1 = L186 / 14] = 13 and s2 = 1186 / 5] = 38 Finally the cover size bounds are adjusted to account for the three singletons, m1 = 3 + 5 = 8 and m2 = 3 + 14 = 17 Subject to the preceding minor qualifications, the program is exactly as prescribed by Algorithm 5.1together with its supplementary algorithms: Algorithm 5.2, 4.2, and 4.3. The program was executed twice on the experimental document setonce with the cost parameter C = 0.4 the second time with C = 0.2 In both cases, the effort limiting parameters were assigned the values Nt = ? and Ne = 2 i.e., no more than two similarity thresholds were allowed for the generat ion of covers of any class, and two
PAGE 202
191 distinct covers was deemed a sufficient number for making a selection. The (C=0.4)c l ass ification (produced by the program in 10. 8 m inutes) cons ists of a tree of 4 8 subsets of the document set, including the tree root X( l) the whole document set. The max i mum distance of any class from the tree root is 3 i.e., the maximum class level i s 4 The endpoints range in size from 1 to 71 documents. The (C=0.4)classifi cat ion is summarized in Figure 7.1, in which each cell contains a class name (or its abb reviation) and the number of documents of the class For example, X(l 14, 7, 25) is a class of 20 documents ; and it i s one of the 25 subclasses of the 148document class, X(l, 14, 7) The ( C =0.2)cl assifica tion (pr oduced by the program in 11.8 m in utes) consists of 123 subsets of the document set The max i mum c l ass l evel is 9 and the endpo ints range in size from 1 to only 11 documents. In fact, the (C=0.2) c l ass ifi cation i s an extens i o n of the (C=0.4)classification. That i s the latter is a subtree of the former. The nodes of the tree or F i gure 7.1 wh i ch are further subclassified in the (C=0.2)classification are X( l, 10) X(l 14 7, 1 6 ), X (l, 14, 7, 17) X( l, 14, 7, 22) a nd X( l, 14, 7, 25 ) consist in g of 13 71, 12 29 and 20 documents respect i vely These subclass i ficat i ons are summarized in Figure 7,2. Thus Figure 7.1 summarizes the (C=0.4) classif ic at i on; and F i gure 7.1 together with F i gure 7.2 summarizes the (C=0.2)classification.
PAGE 203
X(l) Size= 189 X(l, 1) 2 3 4 5 6 7 8 9 10 11 12 13 X(l, 14) Size= 2 1 4 2 9 3 4 2 2 13 1 2 1 161 X(l, 14, 1) 2 3 4 5 6 7 8 2 2 2 4 4 7 148 4 X(l, 14, 7, 1) 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 4 3 2 3 4 3 4 3 4 3 3 2 4 3 8 71 12 2 7 5 7 29 6 2 20 Fi g ure 7.1 S umm a ry of the (C=0.4)Classification
PAGE 204
193 X (l, 10) X( l, 14, 7, 17) Size = 13 12 I I X( l, 10, 1) 2 3 4 X(l 14, 7' 17, l) 2 3 4 3 9 10 10 2 2 2 9 a The subclass ific atio n of X (l, 10) b The subclassification of X(l 14, 7, 17) X (l, 14, 7, 22) 29 I 1 2 3 4 5 6 7 4 5 2 4 3 3 2 3 I X(l 14, 7, 22 7, 1) 2 3 4 5 6 7 8 9 10 C d 3 3 4 3 6 5 4 9 9 The subclassification of X(l 14, 7, 22) The subclassif ic at ion of X(l 14, 7, 25) X(l, 14, 7, 25 ) 20 1 2 3 4 5 6 7 3 5 4 3 4 6 11 Figure 7 2 Completion of the Summary of the (C =0 .2) Classificatio n 3
PAGE 205
X(l, 14, 7, 16) Size= 71 I f\.) X(l, 14, 7, 16, 1) 2 3 4 5 6 7 8 9 10 11 12 13 14 15 3 17 3 2 2 7 3 3 2 2 7 2 9 2 39 I I 1 2 3 4 5 1 2 3 4 5 6 3 3 7 7 3 3 2 4 4 2 32 I 1 2 3 4 2 2 6 29 I 1 2 3 4 5 6 7 8 9 10 5 3 4 11 5 2 4 5 5 16 I 1 2 3 5 2 10 e. The subclassification of X(l, 14, 7, 16)
PAGE 206
195 7.4. Basic Searches The basic search mode of the FERRET system is to be judged relative to the standard provided by the serial search. The adaptat ions of the standard and average recall and precision mentioned in the intr oduct ion of this chapter are based on the form of the available data. Consider the query consisting of the terms whose codes are 31, 53 86 203, 272, 286 364 e ach with a weight of unity (this coincides with the index of document number 162 ). The serial search produces 21 documents having ten distinct nonzero correlations (scores) with the query, ranked according to their scores. The basic search using the (C = 0.4)classification search tree, first computes the correlation of the query with each of the level1 alternativesthe representations of the major classes, X(l, 1) X(l 2) ... X(l, 14) The first, seventh, tenth, and fourteenth of these scores are 0.1195 0.0488 0.3094 and 0.1275 respectively; all the others are zero. Therefore, the decision node becomes the node corresponding to the class X( l, 10) the a ltern at ive of highest score. Since this node is an endpoint of the (C=0.4)classification, the basic searc h procedure then computes the scores of the thirteen member documents and outputs the rankordered documents of X(l ,10) having nonzero correlations with the query; there are ten such documents, having five distinct nonzero scores.
PAGE 207
196 The basic search, using the (C=0.2)classification search tree, also begins with the selection of X (l, 10) at the first decision level. However X(l 10) since is not an endpoint of the (C=0.2)classification, the next step is the computation of the correlations of the successors of the decision nodethe representations of the classes X(l, 10, 1) X(l, 10, 2) X(l, 10, 3) and X(l, 10, 4). The respective scores are 0.6546, 0.1683, 0.1779, and 0.1932 Therefore, the next decision node corresponds to X(l, 10, 1) an endpoint of the (C=0.2)classification consisting of the three documents 26 28 and 162 Each of these three documents has nonzero correlation with the query. The output in this case is the rankordered set of the three document references. The rankordered responses of each of these three searches are tabulated in Table 7.1. Their respective search times were: 0.684 seconds for the ser i al search; 0.205 seconds for the basic search with the (C=0.4) classification; and 0.182 seconds for the (C=0.2) classification. Consider first the normalized v precision and recall, given in Equations 7.1 and 7.2. The search procedure is presumed to rank each of the N documents, and the summation i = 1 ton is over the set of relevant documents Imperfect performance some relevant ones. ranks some irrelevant documents n n Consequently, l r(d) l i l 1 i=l i= before 0
PAGE 208
197 Table 7.1 Serial and Basic Search Responses Document Query Serial Basic rank number correlation rank C = 0.4 C = 0.2 162 1.000 1 1 1 58 0.3419 2 102 0.3381 3 36 0.2520 4 32 0.2279 5 26 0.1890 6 2 2 28 0.1890 7 3 3 146 0.1890 8 27 0.1690 9 4 31 0.1690 10 5 59 0.1690 11 157 0.1690 12 14 0.1543 13 20 0.1543 14 6 25 0.1543 15 7 29 0.1543 16 8 30 0.1543 17 9 41 0.1336 18 10 47 0.1260 19 55 0.1260 20 69 0.1260 21
PAGE 209
198 with equality if and only if each relevant document is ranked prior to each irrelevant document. S ince the FERRET basic search response ranks only some fraction of the document set, the summation must be restricted to that document subset. Moreover, the above inequality is reversed. For example, the basic response with the (C=0.2)classification consists of three documents, ranked 1 2 and 3 by the system. The ranking of these three by the serial search, however, is 1, 6 and 7 Therefore, ( 1 + 2 + 3) ( 1 + 6 + 7) = 8 < O Thus, if the swnmation is taken over the retrieved set and the ranking r(di) is the ranking r'(di) by the serial search, one has R' norm = 1 (N n) n n I i i=l (7,5) Since, moreover, the worst performance possible to the basic search is to retrieve the n documents of least nonzero correlation with the query, the norm a lizin g constant N must be interpreted as the number of docu m ents of the serial search r e sponse, i.e., the number of docum e nt s having nonzero correlation with the query. In the p r esent example R' = 1 norm (1 + 6 + 7) (1 + 2 + 3) = 0.852 The (3) (21 3) normalized precision is similarly adapted:
PAGE 210
199 n n P'norm = 1 I log r'(di) I log i i=l i=l lo g (N! / ((Nn)! n!)) (7.6) In the present example, l og (1) (6) (7) ( 1 ) ( 2 ) ( 3 ) P'norm = 1 log (21) (20) (19) (1) (2) (3) = 0.729 The adaptation of the average recall and precision (Equations 73 and 7.4) proceeds as follo~s. Based on the ~ rankordered response of the serial search, the set of relevant documents may be considered to consist of all those documents having a correlation not less than some relevance threshold. In Table 7.1, for example, one sees that the maximum documentquery score is 1.00 Takin g this value for the relevance threshold, there is just one relevant documentdocument 162 The response of the basic search (C = 0.2) consists of the set {26, 28, 162} Computing the standard recall and precision for this first threshold, R 1 = 1 a nd P1 = 1/3 The second possible threshold is 0.3419 and the relevant set is {58, 162} Thus, R2 = 1/2 and The third relevance threshold is 0 3381 The relevant documents are 162 58 and 102 ; so R 3 = 1/3 and P3 R4 = 1/4 R5 = 1/5 = 1/3 and and The fourth relevance threshold is 0.2520 P4 = 1/3 P 5 = 1/3 The fifth is 0.2279 with Finally, the sixth thresho ld is 0.1890 the minimum correlation of any retrieved document,
PAGE 211
The relevant set is {162, 58, 102, 36, 32, 26, 146} R5 = 3/8 and P5 = 1 200 The adaptations of the average recall and precision are just the averages of these sequences of standard recalls and precisions. In the example, R'avg = (1 + 1/2 + 1/3 + 1/4 + 1/5 + 3/8) / 6 = 0.443 and P'avg = 0.444 Let m be the number of distinct document scores equal to or less than the greatest score of the retrieved set. Let r 1 > r 2 > > rm be those m scores. For i = 1, 2, .. m, let be all those documents scoring not less than ri and Bi be all those retrieved documents scoring not less than ri Then The adaptations of average recall and precision are then defined: 1 R'avg = m 1 P'avg = m (7,7) (7.8) The application of the above definitions to the basic search responses of Table 7.1 are summarized in Table 7,2. A sequence of 189 queries was processed by the serial 0 search procedure, and by the FERRET basic search procedureusing both the (C=0.2)classification and the less extensive (C=0.4)classification search tree. The basic searches always proceeded to endpoints of the classification (the
PAGE 212
Tab le 7. 2 Performance Figures for the Sample Query 201 X(l, 10) (C=0.4)classification X(l, 10, 1) (C=0.2)classification R'norm P'norm R'avg P'avg 0.473 0.454 0.462 0.356 0.852 0.729 0.443 o.444 maximum number of documents desiredAlgorithm 6.1, Step 3was always given the value Nmax = 1 ). The queries of the sequence were the indexes of the members of the experimental document set. That is, corresponding to each document, a query was prepared consisting of just those terms app lied to the documenteach with unit weight. The search times were noted. The average search times are given in Table 7.3. Table 7.3 Serial and Basic Search Time Averages Serial search Basic search Basic search time (C = 0.4) (C = 0. 2) average (sec) time average time average to tl t1/to t2 t2/to 0.629 0.291 0.459 0.265 0.431
PAGE 213
202 The adaptations of the average and normal recalls and precisions of the basic searches were comp uted for each query. The averages over all the 1 89 queries a re given in Table 7.4. Table 7.4 Recall Precis i on Summary (C=0.4)classification (C=0.2)classification R'norrn 0.812 0.814 P'norrn 0.950 0.737 R'avg 0.616 0.571 P'avg 0.545 0.560 Based on these measures of performance a nd the timing fi g ures, the FERRET basic search performs about 70 % as we ll as the serial search but requires only about 45 % the time. 7.5. Conclusions This work has been principall y concerned with the automatic derivation of a mu l t il eve l nonhierarchical class ific at i o n of a set of documents for utilization by a mechanized reference r et rieval system Given a set of documents in the form of the i r lo g ical or numeric a l subject indexes, a quantitative measure of document document similarities is obtained from the appl i cat i on of a su it ab le
PAGE 214
203 definition of similarity to the pairs of document indexes. To convert this similarity information into a form amenable to a graph theoretical treatment, a similarity threshold is chosen: a pair of documents is adjacent in the induced graph in case the similarity of the pair is not less than the threshold. The subclasses of the document set are identified by an analysis of the graph for clusters. This procedure is repeated on the subclasses, and so onuntil unsubclassified classes are suitably small. The concept of clusters in graphs has been thoroughly analyzed (Chapter 3), culminating in a definition of clusterings of a graph, i.e., efficient covers of the point set of the graph consisting of clusters. The collection of all clusterings under this definition was shown to be linearly ordered by the refinement relation. In particular the collection of the components of a graph and the collection of the cliques of a graph are each clusteringsthe former refined by, the latter refinin~ every clustering. Thus, the collection of clusterings forms a chain from the clique set to the component set of a graph. It was argued moreover that any other efficient cover contained an element of arbitrariness in its formation, i.e., requires a violation of one of the criteria underlying the definition: a clustering must be defined in terms of the adjacencies of the points and the identities of the cliques (which is implicit in the adjacency information)using all and only this information; each cover refines and is refined by,
PAGE 215
respectively, the component set and the clique set; and every class of a cover is formed by the same rule. 204 In the interest of computational efficiency, a more restrictive definition was given, the clusterings of which form a subchain of the chain of all clusterings. This particular subchain was shown to form a reasonably uniform sequence of transitions from the clique set to the component set, and thus to constitute a good approximation of the chain of all clusterings. Because cliques of a graph are fundamental to the definition of and the production of the clusterings of the graph, the problem of the identification of cliques in a graph received thorough treatment (Chapter 4). Two new clique detection algorithms were presented. One of these, the Line Removal Clique Detection algorithm, is intend ed for application in special circumstances, in whi~h it exhibits an efficiency advantage ove r other known clique detection algorithms: in addit ion to the adjacency matrix of the graph whose cliques are to be found, there is g i ven the set of cliques of a specified subgraph on the same point set. The other algorithm, the Neighborhood Clique Detection algorithm, is applicable to the general clique detection problem, i.e., it identifies the cliques of a graph g iven only the point set and the line set of the graph The timing experiments reported in Chapter 4 indicate that this algorithm i s substantially faster than those previously available.
PAGE 216
205 The cover generation techniquethe graph theoretical clustering methodgenerally provides several distinct covers of a given dotument set or subclass. The task of selecting the best generated clustering has therefore been treated as a separate problem. One selection criterion investi g ated was the extent to which a cover is typical of the set of generated covers, as quantitatively indicated by the rootmeansquare distance between a cover and all covers of the collection. Two metrics or distance functions were considered: Zahn's metric [ 32 ], and a metric devised by the author. Zahn's metric, although not computationally costly, was shown to be inadequate for the present application because of its overly restricted domain. Specifically, the function in question is a metric only on the collection of Bclassifications. The new metric, on the other hand, has a sufficient domain, which includes all efficient covers. However, it becomes computationally prohibitive as the number of members of a cover increases beyond five or six sets. It is conjectured that this computational limitation is a consequence of the uselessly large domain on which the function is a metric (the class of all collections of nonempty subsets of the given finite set). In particular, there possibly exists a metric on the class of all efficient covers of a finite set which is either undefined or not a metric beyond the class of efficient covers, e.g., on the class of all covers, but which is not computationally
PAGE 217
206 prohibitive on the class of efficient covers. Until future research provides such a metric, however, the selection criterion of cover typicality must be regarded as impractical. The other cover selection method considered was a real valued cover evaluation function, depending on the intracluster similarities of documents and on external cost considerations. A function was developed to represent the projected cost per search, depending on the number of levels of an idealized multilevel classification of the given set and the variation of the cost rate of the implementation machinery with the quantity of storage. The minimization of the cost function with respect to the depth of the classification tree provides the economically optimum numbers of clusters of a cover and of members of clusters. The cost function effects the cover evaluation through these quantitiesthe optimum cover size and optimum cluster size. The cluster homogeneities are weighted with a quantity that decreases from unity toward zero as the cluster size departs from the optimum. This weighted sum of cluster homogeneities is itself weightedwith a quantity which similarly decreases from unity toward zero as the cover size departs from the optimumto complete the cover evaluation. This latter technique for cover selection, which presents no computational difficulties, was combined with the graphtheoretical cover generation scheme into an algorithm (and a computer program) for the multilevel
PAGE 218
207 classification derivation. The major advantage of the present cost function is that it is responsive to the relative cost of storage capacity and search time of a specific implementation ~onfiguration. This makes the extensiveness of the classification subject to the control of the available implementation machineryas illustrated above in the reported experiments, in which a document set was subjected to the classification derivation process twice, with two different values specified for the cost parameter. Basic searches performed by means of the sequential search tree representing the derived classification provided data for a preliminary evaluation of the method, relative to the serial search. These data indicate that the method of classification derivation, representation, and utilization is fundamentally sound: the basic search procedure is substantially more efficient than the serial search, with a relatively small degradation of retrieval effectiveness; and the feedback search procedure provides a promising form of user feedback, i.e., of usersystem interactive searching. An experimental investigation of the improvement of retrieval effectiveness by the use of feedback awaits the futureas, indeed, does a thorough evaluation of the complete system.
PAGE 219
APPENDIX A PROOFS OF THE POINT REMOVAL CLIQUE DETECTION THEOREMS This appendix provides the proof of Theorem 4.1, and the proofs of the required Lemmas 4.1, 4.2, and 4.3. These propositions are repeated here; for the definitions of the terms and symbols see the section entitled "Point Removal Defini ti_ons" of Chapter 4. LEMMA 4.1 Suppose Me L. If v t M then Me L' If v e M then Mt ~' ; there exists an M' e L' such that M = {v} u (Rn M') Proof.Assume V t M E L then M C V' so M is a complete sub g raph of G' Suppose Mi ~' ; then there is a clique M' of G' such that Mc M and M MI 0 Since M' c V' c V and E' c E, M is a complete subgraph of G properly containing Me L, a contradiction. Hence, Me L' Assume now that v e M e.L. Since v e M, Mi V' consequently, Mt~' If x e M th en either v = x or vx e E, therefore, Mc R Let M1 = M { v} c V' n R If X and y a re two distinct members of M they are adjacent in G and are in V' a n d so are in E' Hence, 208
PAGE 220
209 M" is a complete subgraph of G' so there is an M 1 E L' such that Mil CM' Now M" = M {v} and V E M so Mil u {v} CM' u {v} But M C R since Thus, M = M n R = (Mil u {v}) n R C (M' u M C (M' u {v}) n R with M' E L' If X distinct points in M' n R, then since V E M E {v}) n R and y M' E L' xy EE' c E Also, since x ER, vx EE Thus L so are {v} u(M' n R) is a complete subgraph of G containing MEL Therefore, because of the maximality of M, M = {v} u (M' n R) LEMMA 4.2 Bu Cc L, and B n C = 0 that II Proof.Since each member of C contains v and no member of B contains v, B and C are disjoint. Moreover, since no member of A contains v, and no member of A properly contains another, if M and M' are distinct members of A then Mu {v} t M' u {v} hence, IC I = !Al Assume M E B then M E L' and M cj: R Thus, there is a u E M such that UV 4 E since M C V' V t M so u t V M is a complete subgraph of G' which is a subgraph of G so M is a complete subgraph of G Suppose M 4 L then there is a point X E V M such that each point of M is adjacent to x in G Since M cf: R, x f v so x EV' Thus Mu {x} is a complete subgraph of G' properly containing MEL' a contradiction. Hence, MEL, and so B c L
PAGE 221
210 Assume ME Q; then M = {v} u N, where NE L' and N c R. Since also v ER, Mc R, so any point x EM different from v is adjacent to v But any distinct x and y in M both different from v are in N, and so are adjacent in G' and hence in G Hence, M is a complete subgraph of G Suppose Mi L then there is a point x EV M which is adjacent in G to each point of M since v EM, x EV' so that Nu {x} is a complete subgraph of G' properly containing NE L' a contradiction. Therefore, MEL, and so Cc L II LEMMA 4.3 L (B u Q) C D Proof.Let M E L (B u Q) Suppose V 4 M Then by Lemma 4 .1, M E L' Since M 4 B = L' A and M E L' M E A Consequently, {v}u M E C but C C L by Lemma 4.2, so that M u{V}E L Thus, M u {v} and M are both in L and the former properly contains the latter, a contradiction. Hence, Since v EM E ~, by Lemma 4.1, there is an M' EL' such that M = {v} u (Rn M') Moreover, if M' c R then M = {v} u M' E c contrary to hypothesis. Hence, M' R, M' E L and M = { v} u ( R n M' ) E D Therefore, L (~ u Q) c D II
PAGE 222
211 THEOREM 4.1 L =Bu Cu H, a disjoint union. Proof.By Lemma 4.2, B n C = 0 Each member of D contains V while no member of B contains V hence, D n B = 0 Suppose M E D n C Then M = {v} u M" with M" E L' and M" C R and M = (M' n R) u {v} with M' E L' and M' cj: R Since V t M' and V t M" M" = M' n R hence, M" C M' with both M" M' E L' so that M" = M' contradicting M" C R and M' q: R Therefore, D n C = 0, and Bu Cu H is a disjoint union, since H c F c D. D =Hu (Q ~) since H c Q, and by Lemma 4.3 L c Bu Cu D. To prove that L c Bu Cu H, it is sufficient to prove that no member of D H is in L Let MED H = D Fur ~, since H CFC D If M E D F then M is properly contained ~n a member M' of D, since each member of D is clearly a complete subgraph of G Mt L. Suppose Me:: F ~; since F and C are disjoint, Mt C But since Me:: F H there is an Ne:: C such that Mc N, since by Lemma 4.2 Ne:: L, Mc Ne:: C and Mt C M is properly contained in a member of L and so Mt L and L c Bu Cu H. Therefore, ( Q ~) n L = 0 Because of Lemma 4.2, to prove Bu Cu H c L requires only the proof that H c L Let Me:: H Then M is a subset of no member of C M is a proper subset of
PAGE 223
212 no member of ~, and M = {x} u (N n R) with NE L' and N R. Suppose Mi L since M = {v} u (N n R) is a complete subgraph of G, there is a PEL properly containing M. Since v EM c PEL, by Lemma 4.1, there is an N' E L' such that P = {v} u (R n N') Since ME H, M is properly contained in no member of ~, and M is properly contained in P, P D Therefore, N' i B for otherwise p = {v} u (R n N') E D ; hence N' E A = L' B so that N' C R Thus, {v} u N' = {v} u (N' n R) = p E C since N' E A But M C p E C contradicts M E H Hence, M E L i.e. H C L so that B u C u H C L II
PAGE 224
APPENDIX B PROOFS OF THE NEIGHBORHOOD CLIQUE DETECTION THEOREMS This appendix gives the proofs of the propositions leading to the Algorithm 4.2. The statements of the propositions are repeated here; for the definitions of the terms of the propositions, see Chapter 4. LEMMA 4.4 If v EMEL then Mc N(v) and MEN Proof.Let x EM. If x = v then v E N(v) by definition of N(v) Suppose xiv Since x, v EMEL, vx EE ; hence, x E N(v) Therefore, any point of M belongs to N(v) Since Mc N(v) and ME~, M is a complete subgraph of H(v) Since M is properly contained in no complete subgraph of G and H(v) is a subgraph or G M is a maximal complete subgraph of H(v) LEMMA 4.5 If ME fl then v EM. Proof.Suppose M. is a complete subgraph of H(v) not containing v. Since Mc N(v) vis adjacent to 213 II
PAGE 225
214 each member of M subgraph of H(v) M El: N Consequently Mu {v} is a complete properly containing M that is, LEMMA 4.6 N C L Proof.Let MEN. By Lemma 4.5, v EM Since M is a complete subgraph of G there exists an M' EL with Mc M' Then v EM' EL, so by Lemma 4.4, M' E N M = M' Thus, ME~, M' EN, and Mc M' But M' EL, hence MEL LEMMA 4. 7 L C N u R hence Proof.Suppose MEL N By Lemma 4.5, v 4 M, II II since otherwise ME N. Thus, each point in M is a point of G V consequently, M is a complete subgraph of G V Since G v is a subgraph of G and M is a maximal complete subgraph of G M is properly contained in no complete subgraph of G v, i.e., MER THEOREM 4.2 L =Nu S, a disjoint union Proof.Let ME~, M' ER. By Lemma 4.5, v EM. Since v is not a point of G v, v t M' Therefore M M' thus Rn N = 0 Since Sc R, Sn N = 0 II
PAGE 226
215 Let MES Since Sc~, M is a complete subgraph of G v; thus, M is a complete subgraph of G Suppose M 4: L Then there is an M' EL which properly contains M. By Lemma 4.7, L c Nu R, so M' EN or M' E R If M' E ~, then M and M' are cliques of G v with M properly contained in M' a contradiction. Therefore M' E N Thus, M E R M CM' E N hence, M Ef s a contradiction. Hence, M E L I.e., s C L Since by Lemma 4.6, N C L N u s C L Let M E R s Then M is a complete subgraph of G V contained in some M' E N Since R n N = 0 M is properly contained in M' E N C L Hence, M Ef L Thus, L n (R ~) = 0 Since s C R R = s u (~ ~) By Lemma 4.7 L N C R = s u (~ ~) L N C (L n S) u ( f: n (~ ~)) = L n s But it was shown above that s C L thus, L n s = s Therefore, L N C s i.e., L C N u s Hence, L = N u s II LEMMA 4.8 If N(u) c N(v) then ~(u) < N. Proof.Let M ~(u) By Lemma 4.4, Mc N(u) By hypothesis N(u) c N(v) so that M c N(v) Since M is a complete subgraph of G and Mc N(v) M is a complete subgraph of H(v) = G[N(v)] an M' EN such that Mc M' Therefore, there is II
PAGE 227
216 LEMMA 4. 9 S = 0 if and only if N(u) c N(v) in V for each u Proof.Suppose first that N(u) t N(v) and let w e N(u) N(v) Since {w} is a complete subgraph of G there is an M in L such that {w} C M Since w N(v) W / V and w is not adjacent to V because w E M and WV t E V M Therefore, by Lemma 4.5 M N But M E L = N u s therefore M E s i.e. s / 0 Now suppose that N(u) c N(v) for each point u. Suppose Me~, that is, that M is a clique of G v Since M/ 0 and v t M, there is a point u/ v in M. M is a complete subgraph of G, so let M' e L with M c M' By Lemma 4.4, since u e M' e L M' e N(u) By assumption N(u) c N(v) so by Lemma 4.8, ~(u) < ~, thus, there exists an M" e N such that M' c M" Since M c M' M c M" e N and therefore, M S Therefore, s = 0 II LEMMA 4 .10 L c Nu S, a disjoint union. Proof.Let M E L N Suppose u E M n S(v) Since u E M E L M E N(u) by Lemma 4 4 Since u E S(v) N(u) C N(v ) so by Lemma 4. 8, there is an M' E N such that M C M' But N C L by Le mma 4. 6, so
PAGE 228
217 M' E L Thus, MEL, M' E ~, and Mc M' ; hence, M = M' so M E N contrary to the assumption M E L N Hence, M n S(v) = 0 and M C V S(v) Since M C V S(v) and M is a clique of G[V] M is clique of G[V S ( V)] i.e., M E 9. That N and Q are disjoint follows from the facts that each member of N contains v and no member of 9. contains v. LEMMA 4 .11 If ME Q then MEL if and only if Mi N(v) a II Proof.Assume ME 9_ Since Mc V S(v) v t M adjacent to each point of M Suppose first that Mc N(v) Since Mc N(v) v is Since the points of M are pairwise adjacent, Mu {v} is a complete subgraph of G. Therefore, M is properly contained in a complete subgraph of G and so Mt L. Suppose now that M t L and let M' E L be a clique of G properly containing M Since M' E L C N u Q M' E N or M' E 9. If M' E 9. M' and M are cliques of G[V S(v)] with M properly contained in M' an impossibility. Therefore, M' E ~, so by Lemma 4.5, v e M' Since Mc M' and v t M, v is adjacent to each point of M that is, Mc N(v) II
PAGE 229
THEOREM 4.3 If S(v) = V, then L = N. If S(v) i V, then L =Nu P a disjoint union. 218 Proof.By Lemma 4.10, N n = 0, so s5 nce Pc Q, P n N = 0 Assume S(v) = V Theorem 4.2, L = N. By Lemma 4.9, S = 0 so by Assume now that MEL N By Lemma 4.10, ME~. By Lemma 4.11, since MEL n ~, Mi N(v) Since ME~ and Mi N(v) ME P Hence, L N c P i.e., L c Nu P By Lemma 4.11, Pc L, and by Lem,~a 4.6, N c L Hence, L =Nu P //
PAGE 230
APPENDIX C PROOFS OF THE LINE REMOVAL CLIQUE DETECTION THEOREMS This appendix gives the proofs of the propositions leading to A lgorithm 4.3. The statements of the propositions are repeated here; for the definitions of the terms of the propositions, see Chapter 4. LEMilV[A 4 .12 L 0 u ~ 1 u ~ 2 u ~ 3 c ~* Proof.Suppose ME f:.o Then MEL with x t M and y t M Since M i s a complete sub g raph of subgraph G of G* M is a comp lete sub g raph of G* so there is a cl ique M' of G such th at Mc M' S uppo se z E M' M M1 = M u {z} c M' so M" is a comp let e subgraph of G* But {x, y}i M" since { x, y } n M = 0 ; therefore M is a complete subgraph of G properly contain in g MEL a contradiction. Hence, M' M = 0 so that M' c M Since a lso Mc M' M = M' But M E L* Ther efo re, ~O c L* Assume X E X E L y E y E L and X y = {x} and let M = X u { y } Let w E X if w = X then since xy E E* wy E E* if w I X then since 219
PAGE 231
220 X {x} = X n Y, w E Y, so that wy EE* Thus, y is adjacent in G* to each member of X. Since also each pair in X are adjacent in G, M is a complete subgraph of G* Suppose that M' EL* Mc M' and z EM' M. Then z I y and z i X. Since z EM' and X c Mc M' z is adjacent in G* to each member of X. Since z I y z is adjacent in G to each member of X, that is, Xu {z} is a complete subgraph of G Since z X, Xu {z} properly contains XE~, a contradiction. Therefore, there is no M' E L* properly containing M = X u {y} so that X u { y} E L* By the same argument, if X E X E L y E y E L and y X = {y} then y u {x} E L* Hence, ~l C L* Let X E ~2 and suppose X 4 L* Since X is a complete subgraph of G* there is an M E L* such that X C M since X 4 L* X I M M q: X Let w E M X Since X C M E L* and w E M X u {w} is a complete subgraph of G* since w 4 X X u {w} properly contains X If w I y then X u {w} is a complete subgraph of G contradicting X E: L hence, w = y i.e., if w E M X then w = y Since M X I "0 and X C M M = X u { } Y Consider now M' = M {x} Since X 4 M' C M E L* M' is a complete subgraph of G let M" E L with M' C M" Since y E M' C W' E L M" E y and x
PAGE 232
221 X ((Xu {y}) {x}) c X (X {x}) hence, X M" C X ( X { } ) { } X = X Since x EX M" X M" = {x} But XE X and M" E Y with X M" = {x} implies X E !1 contradicting X E !2 = X !1 Hence, X E L* By the same argument, if y E 12 then y E L* therefore, L2 C L* Let M E !:.3 That is, M is contained in no member of L1 and M E L I Since M E !:_3 I M E !:.3 II and M 3 is properly contained in no member of I::_3 II Since M E l:_3 II M = (X n Y) u {x, y} for some X !2 y Y2 E E One easily sees that M is a complete subgraph of G* Suppose M 4: L* Then there is an M' E L* such that M C M' and M' M I0 Since each of X and y is in M and M C M' {x, y} C M' Thus, X' = M' {y} and Y' = M' {x} are complete subgraphs of G containing X and y respectively. Let X" y" E L with X' C X" and Y' C y11 Suppose that X" E !2 and y11 E !2 Then ( x11 n y" ) u {x, y} E L _3 Now M' {y} = X' C XII and M' {x} = Y' C VII so that M' {x, y} C X" n y II .I. hence, M' C (XII n y" ) u {x, y} E l:_3 But M is properly contained in M' and so in a member of L II _3 This contradicts M E !:_3 I Therefore, x11 4: !2 or Y" 4: Y2 Suppose x11 E !1 = X !2 then there is a w E y such that x11 w = {x} and X" u {y} E !:.1 But M' {y} = X' C X" so that M' C X" u {y} E !:.1 Since M C M' M is contained in a member of !:.1 contradicting M E l:_3 By the same argument if Y" 4: !2 then M 4: !:_3
PAGE 233
222 In every case, then, the supposition that Mi L* implies that Mi 13 contrary to assumption. Therefore, MEL* i e 13 C 1* LEMMA 4.13 L* n L = 1o u 1 2 and L* L c { (X n Y) u {x, y}: (X, Y) E X x I) Proof.By Lemma 4.12, and since 1o and 12 are subsets of L 1o u 12 C L* n L Let M E L* n L Since M E L {x, y}
PAGE 234
is an element z EA= ((X n Y) u {x, y}) ((X' n Y') u {x, y}) A =(X n Y)(X' n Y' ){x, y} =(X n Y)(X' n Y'), since (X n Y) {x, y} = X n Y, because y i X and xi Y. X' n Y' = M {x, y} so A =(X n Y)(M {x, y}) = (X n Y) M Thus, 223 z E ( X n Y ) M s o z i { x y } Therefore M is a proper subset of the complete subgraph (X n Y) u {x, y} of G* contradicting MEL* Consequently, A= 0 and hence, M = (X n Y) u {x, y} with XE X and YE Y Therefore L c { ( X n Y ) u { x y } : ( X y ) E X x y } I I THEOREM 4.4 L* is the union of ~ 2 and the maximal members of {(X n Y) u {x, y}: (X, Y) E Xx Y} Proof.~*= (~* n ~) u (~* ~) By Lemma 4.13 L* n L = 1 0 u ~ 2 and L* L c {(X n Y) u {x, y}: (X, Y) E 1 x '!_} = :!!_ member of W is clearly a complete subgraph of Each G* not a complete subgraph of G, and is a subset of no is member of ~* n = ~O u ~ 2 Thus, no maximal member of W is properly contained in any member of L* L or of L* n ~, i.e., in any member of L* Therefore, the maximal members of W are cliques of G* and are not 1 t b hs Of G l e maximals W c L* L comp e e su grap Since no nonmaximal member of W is a clique of G* L* L = maximals W II
PAGE 235
224 LEMMA 4.14 1_* 1_ c 1_ 1 u {(X n Y) u {x, y}: ( X Y) c: f 2 x r_ 2 }. Proof.Let M E L* L By Lemma 4.13 M = (XI n y I) u {x, y} for some X' E X and Y' E y S uppose that for no C x11' y ") E !2 X !2 is M = ( X" n Y" ) u {x, y} Then X' E X or Y' E y 1 1 Suppose X' E !1 Then there is a y E y such that X' y = {x} and M' = X' u {y} E L* by Lemma 4.12. Thus, X' n Y = X {x} For any Z c: Y, since xi Z X' n Z c X' {x} therefore, X' n Y' c X' {x} = X' n Y, so that M C ( X I n Y) u {x, y} = X' u { y } E 11 C L* S ince M, M' E L* and M C M' M = M = X' u {y} That is, M E 11 Similarly, if Y' E !1 M E 11 Thus, if M E L* L and M i { ( X n Y) u { x, y}: ( X, Y) E f 2 x r_ 2 } then M E 1_ 1 / / THEOREM 4.5 Proof.Suppose ME L" L' ; then there is an 3 3 M c: L such that M c M and M M I 0 Thus M is 3 properly contained in some other comp lete subgraph M of G* so that Mi L* Sl nc e L c L L = ( L L ) 3 3 3 3 3 L* n L = (_L* n ( L L ) ) u ( L* n 3 3 3 L emma 4.14, L* L c r:_ 1 u 1_ 3 11 s o u r:_ 3 T hus L ') = L* n L' 3 3 By
PAGE 236
225 L* L = ( (!:* n !:1) u (L* n L ")) 3 L Since L* n !:3 = L* n L 3 L* L = ( (!:* n h1) u (L* n !:3 I ) ) L so that L* L c !: 1 u !: 3 1 Suppose now that ME !: 3 !: 3 Then there is an N E hi such that Mc N If Mi N then M i h1 c L* NE 11 c 1* since M is then properly contained in If M = N then MEL CL* 1 Thus, if M E L I 3 1 3 then MEL* if and only if ME h1, i.e. L* n ( L 3 L* n (L 3 !:3) = h1 L ) C L 3 1 L* h = ((!:* n h1) u L' = (LI L) u L 3 3 3 3 L* h = ( (!:* n hl) u But L* n (1 I !:3) 3 n C (L 3 =3) Since L* (!:* n L I ) ) 3 thus, (!:* n (L I 3 h1 = h1 n In particular, L L L I C 1 u 3 L Since !:3 C !:3 L )) 3 u ( h* n !:3)) L* so L*L C ( (b_*n =1) u (b_* n b.3)) L = (!:*!:) n ( h1 u b.3) Therefore, L* L C h1 u !:3 But by Lemma 4.13, 1* n L = ho u h2 so L* = ( h* n !:) u ( h* 1) C ho u h1 u b.2 u !:3 Since =o u =1 u =2 u !:3 C L* by Lemma 4.12, L* = LO u h1 u =2 u !:3 L II
PAGE 237
APPENDIX D A METRIC ON THE CLASS OF COLLECTIONS OF NONEMPTY SUBSETS OF A FINITE SET Let (X, ~, be a measure space, i.e., X is a set, S is a ring of subsets of X, and is a measure defined on S Suppose further that is a finite measure, and if EE S and (E) = 0 then E = 0. Consider the function d: S X s + R defined as follows: if E, E' E s then d(E, E') = (E 6 EI) If E E s then E 6 E = 0 and (0) = 0 so d(E, E) = 0 If E, E' E s and d(E, EI) = 0 then (E 6 EI) = 0 by assumption that if E II E s and (E11) = 0 then E II = 0 E 6 E' = 0 Hence, E = E' if d (E, E') = 0 Let U, V, WES U A V = U A V A W A W = (UAW) A (VA W) (U A W) A (V A W) c (U A W) u (V A W) and both are in so by monotonicity of [ (U AW) t::, (V A W)] $ [ (U A W) u (V t::, W) J Since is subadditive, [(UAW) u (V t:. W)] $ (U A W) + (V t::, W) Hence, (U A V) $ (U A W) + (VA W) i.e., d(U, V) :::; d(U, W) + d(W, V) Therefore, dis a metric on S 226
PAGE 238
227 As a particularly pertinent example, let X be a finite set and S = P(X) the power set of X, and let : P(X) + R be simply the cardinality function, i.e., if A c X then (A) = IAI That P(X) is a ring and a finite measure, are easily verified. The distance d in this case is given by d(A, B) = (A 6 B)=IA 6 Bl for A, B E P(X) To continue, let S' = S {0} i.e., S' is the collection of nonempty subsets of X belonging to the ring ; and let 11 = P(_') th t f S' S'' i e power se o i.e., s the class of all collections of nonempty subsets of X belonging to S Let S'" = {U: U E _", I !2.1 < 00 } The function D: S '" x S '" + R is defined as follows. Let U V E S '" with _, IUI IYI so that there exists a onetoone function from u into V Define F(!l_, y) = {f: u + V I f is onetoone} For f E F ( !2_, y) define o ( f) = I d(Z, fZ) + I d(Y, 0) since ZEU YEVfU IUI, lyl < 00 and is a finite measure, o(f) ER Finally, D: S"'x S'"+ R is defined, D(Q_, y) = min{o(f): f E F(!l_, y)} for u VE S'" and lul~lvl. For brevity, let 0 = _"' the col lee tion of all finite collections of nonempty subsets of X which are in the ring s If U y E 0 and IQ.I !YI then since each of U and V is of finite cardinality, O < IF ( Q_, y) I tl D(u V) l s the minimum of a nonempty finite consequen y, _,
PAGE 239
228 set of nonnegative real numbers ( d is nonnegative). Hence, there is an f E F(Q, I) such that o(f) = D(Q, I). PROPOSITION D.l D(Q, I) = 0 if and only if U = V. Proof.The identity function i: U + U belongs to F(Q, Q); o(i) = l d(X, iX) + XEU I d(Y, 0) = YEUU I d(X, X) = XEU 0 Hence, D(~, V) O ; but D is nonnegative since d is; therefore, D(~, V) = 0 On the other hand, suppose D(Q, I) = 0 with IU I IVI I d(Z, fZ) + ZEU Let f E F(Q, I) and o(f) = O = I d(Y, 0), the sum of two nonnegative YEVfU numbers. Hence, each is zero. Since 0 i I, if YE V then d(Y, 0) > 0 ; therefore, V fU = 0, i.e., I c fQ. If for some Z EU, Z i fZ then d(Z, fZ) > 0 requiring o(f) > o Hence, fZ = Z for each Z EU. Thus, U = fU c V c fU = U, and therefore, U = V PROPOSITION D.2 Suppose D(Q, I)= o(f) where f E F(Q, I) Then D(Q, fQ) = o(g) where g: U + fU is defined as follows: gU = fU for each U EU Proof.Suppose h: U + fU E F(Q, fQ) Define II k: U + V by kU = hU for each U EU Since f is onetoone, g is onetoone and onto, and IUI = lfQI ; thus,
PAGE 240
229 h also is onetoone and onto, and kU = hU = g!:!_ = fU Thus, V kU cS (k) cS ( h) = l d(V, 0) = cS ( f) VEVkU cS ( k) cS ( f) = cS ( h) cS ( g) Since cS ( k) ::c: cS ( f) = D (Q, y_) hence, 0 $ cS(g) $ cS(h) for each h E F(Q, fQ) D(U, fQ) = cS(g) PROPOSITION D.3 k is onetoone, = V fU and cS ( g) hence, k E F (Q, y_) cS ( h) cS (g ) i.e. Therefore, Suppose I !:!.I = I Y..I and D (Q, y_) = cS ( f) where f E F (Q, y_). If U c U then 1 with where g: Q 1 + f!:!_ 1 is defined as follows: gU = fU for each U E Q 1 II Proof.Suppose h E F(Q 1 fQ 1 ) D e fine k: U + V as follows: if Z E Q 1 then kZ = hZ ; if XE U Q 1 then kZ = fZ Then cS(k) cS(h) = 2 d(V, 0) = cS(f) cS(g) V E'!_fQ_l since fU = V = kU and gQ. 1 = f!:!_ 1 = hQ. 1 so kU hU = y_ f!:!_ 1 = fU g!:!_, and k = h on Q. 1 and g = f on u 1 Since cS(f) = D(Q, y_) cS(f) $ cS(k) ; hence, cS(g) = cS(h) + cS(f) cS(k) $ cS(h) That is, cS(g) $ cS(h) for each h E F(Q 1 f!:!_1) Hence, D(Q 1 f!:!_ 1 ) = cS (g) II PROPOSITION D.4 Suppose D(Q, y_) = cS(f), f E F(Q, '!_) and !:!.i u !:!_ 2 = U with Q 1 n Q 2 = 0 Define f 1 : Q 1 + fQ 1
PAGE 241
230 by f 1 Z = fZ for each Z in ~ 1 Define f2: ~2 + V f~l by f 2 Z = fZ for each z in u ~l = ~2 Then D(~l' f~1) = o(f 1 ) D(~2, V f~l) = o(f 2 ) and Proof.By Proposition D.2, D(~, f~) = o(g) where g: u + fU is given by gZ = fZ for each z in u Since IUI = lfUI and D(~, f~) = o(g) and ~l C u by Proposition D.3, D(~1, g~l) = 0 ( g I ) where g': ~l + g~l is given by g'Z = gZ for each z in ~1Thus, g'U = g~l = f~l and g' = f on ~l hence g' = fl 1 and D(~l' f~l) = o(f 1 ) as stated. Define h: U+ V as follows: if z E ~l then hZ = fZ if z E ~2 then hZ = gZ Evidently, o(f) = o(f 1 ) + o(f 2 ) and o(h) = o(f 1 ) + o(g) since h E F(~, y_) o ( h) 2'. D(~, '!._) = o ( f) Therefore, o(f 2 ) 0 ( g) i e. o(f 2 ) is not greater than any member of F(~ 2 '!.. f~ 1 ) Therefore, D(~ 2 y_ f~ 1 ) = o(f 2 ) as required. Finally, 6(f) = 6(f 1 ) + o(f 2 ) so that D(~, '!._) = D(~l' f~ 1 ) + D(~ ~l, '!._ f~ 1 ) PROPOSITION D.5 Let l~I IV! ~l C U '!.. 1 C V and Then D (~, '!._) D(~l' '!.. 1) + D(~ ~l V '~1' = l'!.. 1 I '!..1) II
PAGE 242
231 Proof.Let ~2 = u u I2 = V v1 Let 1 f1 E: F(~l' Y.1) f2 E F(~2' Y.2) be such that D(~l' Y.1) = (fl) D(~2, Y.2) = (f2) Define f: u + V as follows: if z E: ~l then fZ = f 1 Z if z E: ~2 then fZ = f 2 Z Since fl and f2 are onetoone, and f1~1 n f 2 U 2 = 0, it is clear that f is onetoone from U into V Moreover 0 ( f ) = 0 ( f 1 ) + 0 ( f 2 ) Thus DC~, y) s o(f) = o(f 1 ) + o(f 2 ) = D(~ 1 y_ 1 ) + D(~~ 1 yy 1 ) PROPOSITION D.6 There exists a function g E F(~, y) such that D(~, y) = o(g) and if Z EU n V then gZ = Z Proof.Let Z E n y_, f E F(~, y) with o(f) = D(~, y_) and suppose fZ EV (~ n y) Suppose further that Z i fU Con&ider the function g: u + V defined: if u E: u {Z} then gU = fU and gZ = z Since z t fU g is onetoone, and so is in F(~, y_) 0 ( g) = I d(U, gU) + I d(V, 0) u EU VEYg~ g~ = (f~ {fZ}) u {Z} o(g) = II I d ( u, fU) + d ( z, g z) + I d ( v, 0) + d ( f z 0) d ( z, 0) UEU{Z} VEV~fu = I d(U, fU) d(Z, fZ) + d(Z, Z) d(Z, 0) + d(fZ, 0) + u E: u I d(V, 0) VEV~ru = o(f) + d(fZ, 0) d(Z, 0) d(Z, fZ) But d(fZ, 0) s d(0, Z) + d(Z, fZ) ; hence, o(g) s o(f) Since
PAGE 243
232 o(f) = D(Q_, ~) o(g) = D(Q_, y) Note that u n V n gQ_ = (Q_ n V n fQ_) u {z} with z fU thus, one member of u n V not mapped into itself by f is mapped into itself by g every member of u n V mapped into itself by f is also mapped int o it se lf by g and o(g) = D(Q_, ~) Since IU n VI < 00 this construction may be repeated only finitely many times until the resulting function h has the :property that u n V C hU and o(h) = D(U, y) Assume now that z E u n V and hZ i z since u n V C hU there is a y E u such that hY = z Since hZ i z y i z Define k: u + V as follows: if u E u { z, Y} then kU = hU kZ = z and kY = hZ Clearly, kU = hU and k E F(Q_, y) Also, o(k) o(h) = d(Z, kZ) d(Z, hZ) + d(Y, kY) d(Y, hY) = d(Z, Z) d(Z, hZ) + d(h1 Z, hZ) d(h1 z, Z) 0 si nce d(Z, Z) = 0 and d obeys the triangle inequalit y. Hence, o(k) = D (Q_, y) since o(k) o(h) = D(Q_, ~) o (k) Suppose W E u n V and hW = w if w = y = h1 z hY = and hY = Y, so Y = Z contrary to the hypothesis on Z z Thus, W i Y and W i Z so that hW =kW= W Therefore the number of members of Un V not mapped into themselves by k is strictly less than the number of those not mapped into themselves by h Since IQ_ n YI is finite, this construction is repeated only finitely many times, until the resulting function g: Q_+ V has the required propert ies: o(g) = D(Q_, ~) and gZ = z if z EU n V II
PAGE 244
233 PROPOSITION D.7 D(Q, y) = D(Q y, y Q) Proof.Let IQI s IVI If Un V = 0, U V = U, V Q = V, and the statement holds. Suppose Q n Yi 0 By Proposition D.6, there is a function f: Q + y E F(Q, y) such that o(f) = D(Q, y) and fZ = Z if Z EU n V By Proposition D.4, since f(Q n y) = Q n y, D(U, y) = D(Q n y, f(Q n y)) + D(Q (Q n y) y f(Q n y)) = D(U n V, Q n y) + D(Q y, y Q) By Proposition D.l, the first term is zero. Hence D(Q, y) = D(U V, V Q) o ( f) LEMMA D.l If IUI s IVI s IWI then D(Q, ~) s D(Q, y) + D(Y, ~) Proof.Let f E F(Q, y) g E F(Y, ~) = D(Q, V) and o(g) = D(y, ~) Define II h: u + w to be gf then h E F(Q, ~) since f and g are onetoone. o(h) = l d(U, hU) + l d(W, 0) For u E D UEU WEWhU d(U, hU) = d(U, g(fU)) s d(U, fU) + d(fU, g (fU)); hence, (a). Also, o (h) s I d (W, 0) + WEWhU I d (U' fU) + UEU I d(V, gV) Ve::fU
PAGE 245
(b). D ( u, Y) + D ( Y., w) = I d ( u, fU) + UeU [ L d(V, gV) + L d(V, gV)) + I d(V, 0) + VefU VeVfU VeVfU L d(W, 0) We 1!!_gy_ ( C) Combining (a) and (b): 6(h) s D(g, Y.) + D(y_, 1!!_) + L d(W, 0) WeWhU 234 L d(W, 0) We1!!_gV L [d(V, gV) + d(V, 0)] VeVfU Since hU C gy_, W g'!_ c W hg, and (1!!_ hg) (1!!_ g'!_) = g'!_ hU. Let We gV hg; then there is a Ve V such that gV = W If there were a U e U such that fU = V, then hU = W; so since W hg, Ve V fU On the other hand, if Ve V rg, W = gV e g'!_, and W i hU Hence, g'!_ hU = {gV: Ve Y. fU} so that (c) may be written: ( d) 6(h) s D(g, y_) + D(y_, 1!!_) + I [ d(gV, 0) d(V, gV) d(V, 0) J Since VeVfU d(gV, 0) s d(gV, V) + d(V, 0) each term of the summation is nonpositive, and so the sum is nonpositive. Therefore, 6(h) s D(g, y_) + D('!_, 1!!_) Finally, since h e F (g, 1!!_) D (g, ~) s 6 (h) Hence, D(g, 1!!_) s D(g, V) + b(y_, 1!!_) LEMMA D.2 If lgl s IY.I s 11!!_1 then D(g, y_) s D(g, 1!!_) + D(1!!_, V) II
PAGE 246
L d (W, 0) = WEg'!_ L d ( gV, 0) + VEVhU I d Cg C hU) 0) Uc:U 235 o(h) A~+ I d(V, 0) VEV.,;:.hU l d ( gV, 0) VEVhU I d (V, gV) VEVhU 2 l d(W, 0) WEZ Since d(V, 0) d(0, gV) + d(gV, V) for each V in V h~ cS ( h) A 2 I d(W, 0) 0 Hence, WEZ cS ( h) D(~, ~) + DC~, ~) for h E F(~, Y..) Since D(~, y_) cS(h)., D(~, y_) D(~, ~) + D (y_, W) This completes the case 1 proof. Case 2.T = fU n gy_ t0 Denote ~2 = w ~l ~l U1 = f1 w U2 = u ~l Y..1 = lw Y..2 = V y_l 1 g 1 Define fl: ~l + ~l f2: ~2+ ~2 gl: y_l + ~l and g2: y_2 + w as follows: if z E ~l f 1 Z = fZ if z E 2 f 2 Z = fZ if y E Y..1 glY = gY if By Proposition D.4, D(~, ~) = D(~l' ~l) D (y_, W) = D (y_l' w ) 1 + D CY..2, ~2) so that Also, by Proposition D.4, D(U 2 W 2 ) = cS(f 2 ) f 2 u 2 n g 2 y_ 2 = 0, and D(Y..2, ~2) = cS(g2) I ~2 I I y_2 I I ~2 I + y E Y..2 D (~2, ~2) Since because g2Y = and Q.2 gY 1~ 1 1 = ly_ 1 1 = 1~ 1 1 one has by case 1 above that (b). D(~ 2 y_ 2 ) D(~ 2 ~ 2 ) + D(y_ 2 ~ 2 ) By Lemma D.l, since IV1I = IW1I = 1~11 (c). D(~l' y_ 1 ) D(~ 1 !i) + D(y_ 1 ~ 1 ) Combining (a), (b) However,
PAGE 247
Proof.Let f E F(~, ~) g E F(~, ~) o (f) = D(~, ~) and o (g) = D(~, ~) 236 Case 1.f~ n g~ = 0 W = fU u g~ u(W fU g~) Then W Denoting is the disjoint union, Z = W fU gV W = fU u g~ u Z Let h be any onetoone function from U into V. Then W = Z u fU u g(h~) u g(~ hU) is a disjoint union. Let U EU d(U, hU) s d(U, g(hU)) + d(g(hU), hU) ; d ( u g ( h u ) ) $ d ( u f u ) + d ( f u g ( h u ) ) d(fU, g(hU)) s d(fU, 0) + d(g(hU), 0) Thus, l d(U, hU) s l [ d(fU, 0) + d(g(hU), 0) + d(g(hU), hU) + UEU UEU d(U, fU) J Since o(h) = l d(U, hU) + l d(V, 0) UEU VEVhU o (h) s I d (V, 0) + VEVhU I [ d(fU, 0) + d(g(hU), hU) + UEU d(g(hU), 0) + d(U, fU) ] For brevity, A= D(~, ~) + D(~, ~) ; A = l d (U, fU) + UEU l d(W, 0) + l d(V, gV) + WEWfU VEV L d(W, 0) WE~g~ = l d(U, fU) + ( l d(hU, UEU UEU g(hU)) + l d(V, gV)] + VEVhU I d(W, 0) + l d(W, 0) WEgV WEfU + 2 I d(W, 0) Thus, WEZ o(h) As { I d(V, 0) + I [ d(fu, 0) + d(g(hu), 0) + VEVhU UEU d(g(hU)' hU) + d(U' fU) J} { I [ d(U, fU) + d(hU, g(hU)) J UEU + l d(V, gV) + Y d(W, 0) + I d(W, 0) + 2 I d(W, 0)} VEVhU WEg~ WEf~ WEZ Since I d ( fU, 0) = UE~ and
PAGE 248
237 by Proposition D.5, D(~, Y..) $ D(~l' Y..1) + D(~2' Y..2) Hence, D(U, which Y..) $ D(U, ~) + D (y_, ~) completing together with case 1, establishes LEMMA D.3 If I U I $ IV I $ I w I then D(y_, ~) $ D(~, y_) + D(~, ~) the case 2 proof, the lemma. Proof.Let D(~, y_) = o(f) f E F(~, y_) and D(~, ~) = o(g), g E F(~,~) Denote y_ 1 = fU and Y..2 = V V1 ~l = g~ and ~2 = w ~l Define h1: Y..1 + ~l by h 1 Z =gr1 2 for each z E Y..1 Since I y_ 1 I = IW I I y_ 2 I $ IW I so there is a onetoone 1 2 function, h 2 : y_ 2 + ~ 2 Define h: V+ W as follows: if Z E y_ 1 then hZ = h 1 Z; if Z E y_ 2 then hZ = h 2 Z Let A= o(f) + o(g) = D(~, y_) + D(~, ~) II A = l d(U, fU) + ") d(V, 0) + l d(U, gU) + l d(W, 0) UEU VE!2 UEU WE~2 { (U' fU): u E ~} = {(f1 v, V): V E Y.1} { (U' gU): u E ~} = { c r1 v, g (f1 V)) :V E Y..1} and { ( w, 0): w E ~2} = {(h 2 V, 0): VE Y..2} u { (W' 0): W E W h":_} Hence, A = l [ d(V, r1 v) + a(f1 v, g(flv)) J VEY..1 + l [ d(0, V) + d(0, h 2 V) J + l d(W, 0) VEV2 WE~hY.. On the other hand, 0 (h) = l ct(w, 0) + I acv, hV), WEWhV VEY.. o(h) = I acw, 0) + I acv, g(f1 V)) + I acv, h 2 V) WE~hY.. VEY..1 VEY..2
PAGE 249
238 Since d obeys the triangle inequality, o(h) :5 A = D(~, y_) + D (~, }i) Since D ( y_, W) :5 o(h) D ( y_, }i) :5 D(~, y_) + D(~, ~) II THEOREM 5.1 D: 0 x 0 + R is a metric. Proof.By Proposition D.l, D(~, y_) = O if and only if U = y_, if ~,VE 0 By Lemmas D.l, D.2, and D.3, if ~, y_, WE 0 then D(~, y_) ::; D(~, }i) + D(}i, y_)
PAGE 250
APPENDIX E AN IDEALIZED CLASSIFICATION TREE Let X be a set of n elements, and x be a given natural number. Consider a directed tree of subsets of X, satisfying the following conditions. The root of the tree is X; the successors of each nonendpoint constitute a partition of that point of the tree, i.e., that subset of X, and the maximum distance from the root to an endpoint is X 1 It is required to identify the tree structure for which the maximum number of calculations required to isolate an element of X is minimized; it is also required to determine the function f: N N such that f(x) is that minimum number of calculations. If x = 1, the only tree is the trivial tree whose point is X, and f(l) = n If x = 2 X is partitioned into the endpoints of the tree. Suppose that the number of members of the partition is a and that the numbers of elements in those subsets of X are n 1 n 2 na. Then n = n 1 + n 2 + ... + na The number of calculations required to select a class is a Having selected the ith class, the number of calculations 239
PAGE 251
240 required to isolate an element of this class is ni Thus, the maximum number of calculations is a + max{ ni: i = 1, 2, a }. Suppose that each class ... has the same size; that is, each ni = n/a and a+ max{ ni: i = 1, 2, ... a}= a+ n/a On the other hand, suppose that some nj < n/a for otherwise, Ink~ (a 1) n/a k;ij Then some ni > n/a ; and l nk < n k contradicting l nk = n Therefore, if nj < n/a for some k j then max {n. i" i = 1, 2, ... a} > n/a If n/a < nj for some class, then, obviously, max {ni: i = 1, 2, ... a} > n/a Thus, if some nj ;i n/a then the maximum number of calculations exceeds a+ n/a, i.e., exceeds the maximum number of calculations required if each ni = n/a. Therefore, the best structure for x = 2 is the partition of X into a classes of equal size. Minimizing ri + n/a with respect to a requires that a= n 1 1 2 ; hence, f(2) = n 1 /2 + n/n 1 1 2 = 2 n 1 / 2 As induction hypothesis, assume that for x = k, (1) the number of major classes (successor s of X in the tree) is nl/k, each having n/n 1 /k elements; and (2) f(k) = X nl/k Let x = k + 1, and the a major classes have n 1 n 2 ... na elements. The number of calculations required to se lect a major class is a S uppose the ith class is se lected; the rem ain in g calculations are those for a tree of k levels of selection, the root having n elements. l The total number calculations is m ini mized if this subtree is optimally structured. By the induction hypothesis, the number of
PAGE 252
calculations is therefore a+ k n~/k As above, l n = n1 + n2 + + na so the maximum of ... {a + k 1/k i 1, 2, a} exceeds (n/a)l/k n. : = a + k l ... 241 if any ni tn/a while if each ni = n/a then that maximum is just a+ k (n/a)l/k. Hen c e, the b est distribution of the elements of X over the a classes is ni = n/a for Minimizing a+ k (n/a)l/k with respect to requires that a= nl/(k+l) The number of elements of a each major class is n/a = nk/(l+k) ; and f(k + 1) =a+ k [n1/(~+1)] 1/k = nl/(k+l) + k (nk/(k+l))l/k = (k+l) nl/(k+l) This completes the inductive proof that for an y natural number x of levels of calculation, the idealized classification tree has the following properties: (1) the number of major classes is nl/x, each having n/nl/x ele m ents; and (2) f(x) = x n 1 /x the number of calcul a tions required to isolate a single element of X via the clas s ific a tion tree.
PAGE 253
LIST OF REFERENCES 1. Walston, C. E., "Inf ormat ion retrieval," in P. Alt and M. Rubinoff (eds.), Advances in Comouters, vol. 6 Academic Press, New York, 1965, pp. 130. 2. Senko, M. E., "Information storage and retrieval systems," in J. T. Tou (ed.), Advances in Information Systems Science, vol. 2, Plenum Press New York 1969, pp. 229277. 3. Salton, G., Automat i c Information Organizat i on and Retrieval, McGraw Hill New York 1968. 4. Lancaster, F. W., Information Retr iev al SystemsCharacteristics, Testing and Evaluation John Wiley & Sons, New York, 1968. 5. Maron, M. E., and J. L. Kuhns, "On relevance, probabilistic indexing and information retrieval," J. ACM, vol. 7, no. 3 (July, 1960), pp. 216244. 6. Lancaster, F. W., "MEDLARS: Report on the evaluation of its operating efficiency, 11 Am. Doc., vol. 20, no. 2 (April, 1969), pp. 119142. 7. Dewey, M., Dewey Decimal Classification and Relative Index, Edition 17, Forest Press, Lake Placid Club Education Foundation, Lake Placid Club, New York, 1965. 8. Sokal, R. R., and P. H. A. Sneath, Principles of Numerical Taxo nomy W H. Freeman, San Francisco, 1963. 9 Rogers, D. J., and T. T. Tanimoto, "A computer program for classifying plants,' 1 Science, vol. 132 (1960), pp. 11151118. 10. Freeman, H., Introduction to Statistical Infer ence, AddisonWesley, Read ing, Mass ., 1963. 11. Maron, M. E., "Automatic indexin g : an experimental inquiry," J. ACM, vol. 8 (1961), pp. 407417. 242
PAGE 254
243 12. Cattell, R B. "The meaning and strategic use of fact or analysis, in R B Cattell (ed.), Handbook of Multivari ate Analys is, Rand McNally Ch i cago, 1966, pp. 174245. 13. H arman H. H ., Modern Factor Analysis, Un i versity of Chicago Press, Chicago 1967. 14. Thurstone, L L., Mult i ple Factor Analysis University of Chicago Press Chicago 1947 15. Borko, H., "Th e construction of an empirically based mathematically derived classification system ," Proc Spring Joint Computer Conf., 1962, pp 2 79289 16. Borko, H., and M. D. Be r nick Automat ic document cl a ssifi cation ," J. ACM vo l. 10 (1963), pp 151162. 17. Borko, H ., and M D Bernick Automat ic document cl assi fi cation part II, additional exper i ments ," J. ACM, vol. 11 (1964), pp 1 38 1 5 1. 18. Tou, J. T., and R. P Heydorn, Some approaches t o optimum feature extraction ," in J T. Tou (ed.), Computer and Infor mat i on Sc ien ces vol 2, Academic Press, New York 1967, pp. 57 89. 19. Needham, R. M ., and K. SparckJones, "Ke ywo rd s and clumps,1' J. Doc., vol. 20, no. 1 ( Ma rch, 1 96 4), pp. 515. 20. Harary, F., and I C. Ross, A procedure for c li que detection using the g r oup matrix ," Sociometry vol. 20 (1957), pp 205215. 21. Nagy, G., Stat e of the art in pattern re cogn iti on, Proc. IEEE, vol. 56, no. 5 (Ma y 1968), pp 836 86 2 22. Zahn, C. T., "Gr aph theoreti cal methods f o r detecting and des cr ibin g gesta lt c lus ters ," I EEE Trans. Com puters vol C 20 no. 1 (January, 1971 ), pp 68 86 23. Tou, J. T., "E ngineer i ng pr in c i ples of pattern reco gn iti on ," in J T. Tou (ed.), Advances i n Inform ation Systems Sc i ence vol 1, Plenum Press New York 19 6 8 pp 1 7 32 4 9 24. Harary, F., Graph Theory, Add i son Wesley Reading 1969. 25. Augustson, J. G., and J. Minker, Deriv i ng term relations f o r a corp us b y graph theoret i ca l clusters, J. ASIS, vo l. 21 ~ no. 2 (MarchApril, 1970), pp. 101111.
PAGE 255
244 26. Bonner, R. E., On some clustering techn iqu es ," IBM J. Res. Develop., vol. 8, no. 1 (Janu ary 1964), pp. 2232. 27. Jardine, N., and R. Sibson, "The construction of hierarchic and nonhierarchic classifications," Computer J., vol. 11 (1968), pp. 1771 84 28. Gotlieb, C. C., and S. Kumar, "Semantic clustering of index terms," J. ACM vol. 15, no. 4 (October, 1 968), pp. 493573. 29. 30. 31. 32. 33. 34. 35. 36. Augustson, J. G., and J. M inker, "An analys is of some graph theoretical c luster techniques," J. ACM, vol. 17, no. 4 ( Oc tober, 1970), pp. 571588. Bednarek, A. R., and 0. E. Taulbee, On maximal chains," Rev. Roum. Mat h. Pures et Aoo l., vo l. 11, no. 1 (1966), pp. 2325. Moon, J. and L. Mos er, "On cliques in graphs ," Israel~Math., vol. 3 (1965 ), pp. 2328. Zahn, C. T., Jr., "Approxi mat in g symmetric relations by equivalence relations," J. S I AM vo l. 12 (1964), pp. 840847. Bednarek, A. R., "Some mathematical a spe cts of classification," Graduate Seminar, University of Florida Center for Inf orma tics Res earch, June 4, 1970. Landauer, W. I "The balanced tree and its utilization in information retrieval," IEE E Trans. Electronic Computers, vol. EC12, no. 5 (December, 1963), pp. 863871. Sussenguth, E. H., Jr., "Use of tree structure for processing files," Commun. ACM, vol. 6 (1963), pp. 272279Salton, G., and M. E. Lesk, "Co mp uter evaluation of indexing and text processing," J. ACM vol. 15, n6. 1 (January, 1968), pp. 8 36.
PAGE 256
BIOGRAPHICAL SKETCH Robert Ernest Osteen, the son of Robert Truby Osteen and the former Jennie Louise Hasenbalg, was born in Saint Augustine, Florida, June 13, 1936. His childhood was lived in Saint Augustine, where he graduated from Ketterlinus High School in 1954. From September 1954 to September 1958, R. E. Osteen served as an aviation electronics technician in the United States Navy. In September 1958, he entered the University of Florida, majoring in electrical engineering. In June 1962, he was awarded the degree B.E.E., with honors. From July 1962 to March 1966, R. E. Osteen was a member of the technical staff of the Electronic Switching Systems Development Division of Bell Telephone Laboratories at Holmdel, New Jersey. As a trainee in the Communications Development Training program, he was enrolled as a parttime graduate student in New York University, earning the degree M.E.E. in June 1964. From March 1966 to August 1967, Mr. Osteen was employed as a senior engineer by the Defense, Space, and Special Systems Group of the Burrou g hs Corporation, at the Great Valley Laboratory, Paoli, Pennsylvania. In September 1967 245
PAGE 257
he returned to the University of Florida to continue his education. R. E. Osteen is married to Darcy Meeker of Kootenai County, Idaho. 246
PAGE 258
I certify that I have read this study and that in my opinion it conforms to acceptable standards of scholarly presentation and is fully adequate, in scope and quality, as a dissertation for the degree of Doctor of Philosophy _/j ,,,,,~ /' .. / ,1 1'! \ Dr. Julius T. Tou Graduate Research Professor I certify that I have read this study and that in my opinion it conforms to acceptable standards of scholarly presentation and is fully adequate, in scope and quality, as a dissertation for the degree of Doctor of Philosophy. / r,/'Q 'ri 'cjJ_/ v, J. R. O'Malley, Associate Professor of Electri 1 Engineering I certify that I have read this study and that in my opinion it conforms to acceptable standards of scholarly presentation and is fully adequate, in scope and quality, as a dissertation for the degree of Doctor of Philosophy. Dr. A. R. Bednarek Professor of Mathematics This dissertation was submitted to the Dean of the College of Engineering and to the Graduate Council, and was accep ted as partial fulfillment of the requirements for the degree of Doctor of Philosophy. August,1972 Dean, Graduate School

