UFDC Home  myUFDC Home  Help 



Full Text  
DOMAINSPECIFIC KNOWLEDGEBASED INFORMATION RETRIEVAL MODEL USING KNOWLEDGE REDUCTION By CHANGWOO YOON A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2005 Copyright 2005 by Changwoo Yoon To my wife Jaesook, my daughter Jenny, my son Juhyung, and my families, in God with love ACKNOWLEDGMENTS I would like to thank my parents for their support. They have provided unconditional love and support. I greatly thank to all my relatives for their lovely concerns and prayer. I would also like to thank to William H. Donnelly for his support and beloved care during my Ph.D. Without his support as a research assistantship; I would not have continued my graduate work. I would like to thank my supervisory committee chair Douglas D. Dankel for his guidance and excellent advice on research. Finally, and most of all I express my gratitude to my beloved wife, Jaesook. Her love, support, and prayer have not wavered in this lengthy process. She has undoubtedly been the single most integral component to my success. TABLE OF CONTENTS page A C K N O W L E D G M E N T S ................................................................................................. iv LIST OF TABLES .................................................... ............ ............. .. viii LIST O F FIG U R E S .... .............................. ....................... ........ .. ............... ix ABSTRACT .............. .......................................... xix CHAPTER 1 IN TR OD U CTION ............................................... .. ......................... .. 1.1 Background about Intelligent Information Retrieval...........................................1 1.2 Intelligent Information Retrieval M odel..... .......... ...................................... 3 2 INFORMATION RETRIEVAL ............................................................................6 2.1 Classical Inform ation R etrieval M odels ........................................ .....................6 2.1.1 Boolean M odel .............................. ....... .. .... .............. .. 6 2.1.2 V ector Space M odel ............................................................................. 7 2.1.3 Probabilistic M odel ............................................... .......... ... ...... .. 9 2.2 Alternative Information Retrieval M odels.................................. .............10 2.2.1 Latent Semantic Indexing (LSI) ..... .......... ...................................... 11 2.2.2 Lateral Thinking in Information Retrieval .........................................12 2.3 Information Retrieval Models Involving Reasoning ............... .... ....... .....14 2.4 Evaluating Information Retrieval Performance...............................................15 2.5 U useful Techniques ............................................... ........ .. ............ 17 2.5.1 Stopw ord R em oval ............................................................................. 18 2.5.2 Stemming .................................. ..... ............... 18 2.5.3 Passage R etrieval ................. .......................... ........ ........ .......... 19 2 .5 .4 Q u ery E x p an sion ..................... .. ................................ .. ................ .. 19 2.5.5 U sing P hrase ................... ...... .................. ............ ..... 20 2.6 Enhancement of IR Through Given Knowledge .............. ...........................21 2.6.1 U sing W ordN et............. .... .......................................... .... ........ 21 2.6.2 U sing UM LS, SN OM ED ...................................... ......................... 23 2.7 Summary ...................... ............................23 v 3 KNOWLEDGE REPRESENTATION BY BAYESIAN NETWORK ....................25 3 .1 S em antic N etw ork s ................................................................... .....................2 5 3.2 Probability Principles and Calculus ............................................ ...............27 3.3 B ayesian netw ork ................... .................. ...................... ... ................. 30 3.4 NoisyOR: Bayesian network inference................................... .................33 3 .5 Q M R D T m odel............ ....................................................................... .......... 35 3.6 B ayesian C lassifiers.......... ..... ....................................................... ... .... ....... 37 3 .6 .1 N aiv e B ay es ............................................. ................ 3 8 3.6.2 Selective N alive B ayes ........................................ .......................... 39 3.6.3 Seminaive Bayes ....................................... ............................ 39 3.6.4 Tree Augmented Naive Bayes.... ............ .......... ... ......... ............. 39 3.6.5 Finite M ixture (FM ) m odel ............................................ ............... 40 3.7 Sum m ary ................................................................... .. ...... ........ 40 4 KNOWLEDGEBASED INFORMATION RETRIEVAL MODEL A R C H IT E C T U R E ..................................................................... ...... .....................42 4.1 SN OM ED ......................................... .. ........... ...............44 4.2 Anatomic Pathology Database (APDB) Design and Development...................46 4.2.1 Metadata Set Definition.............................................. 46 4.2.2 Information Processing: Retrieval and Extraction ....................................47 4 .3 Su m m ary ............................................................. .................................. 4 7 5 KNOWLEDGEBASE MANAGEMENT ENGINE ...........................................49 5.1 Semantic Network Knowledge Base Model Representing SNOMED .............49 5.2 Classification of the PostCoordinated Knowledge..............................52 5.2.1 Statistics of Pathology Patient Report Documents Space .......................52 5.2.2 Classification of PostCoordinated Knowledge .............. ..............53 5.3 Statistical Model of the PostCoordinated Knowledge ............. .....................56 5.4 Naive Bayes Model of PostCoordinated Knowledge........................................56 5 .5 S u m m ary .................................................................... ................ 5 9 6 KNOWLEDGE CONVERSION ENGINE (KCE) ......................................... 61 6.1 Support Vector Machine Document Vector ................................................61 6.2 Conceptual D ocum ent V ector......................................... ......................... 62 6.3 KCE: Knowledge Reduction ...................................................... 63 6.4 KCE: Conversion of PreCoordinated Knowledge......................... ...............64 6.5 KCE: Generating the Conceptual Document Vector........................................65 6.6 KCE: Conversion of the PostCoordinated Knowledge ......................................66 6.6.1 Statistical Model of PostCoordinated Knowledge..............................66 6.6.2 Probabilistic Model of PostCoordinated Knowledge..............................69 6.6 SVM IR Engine: Document Retrieval ................. ............... 71 6.7 Sum m ary ..................................... ................................ .......... 72 7 PERFORM ANCE EVALUATION .................................... ......................................73 7.1 Sim ulation Param eters ............................................................ ............... 73 7.2 Sim ulation R esult ........................ ......... .. ............ .... .... .... ....... ... ... ... ... 74 7.2.1 Performance Evaluation with PreCoordinated Knowledge ....................74 7.2.2 Performance Evaluation with Naive Bayes PostCoordinated K n ow led g e ..................................... .... ........... .......... ............... ...............7 8 7.2.3 Performance of Statistical PostCoordinate Knowledge Model ..............80 7.3 Sum m ary ....................................................... ............. ......... 80 8 C O N C L U SIO N ......... ......................................................................... ........ .. ..... .. 82 8.1 C contributions ................. .................................. ................ ............. 82 8.2 Future W orkable page 51 Number of AP data each year from '83 to '94............... ....................................53 52 Number of unique SNOMED axes equations ................ .... .................53 53 R elation statistics am ong axes........................................... ........................... 54 54 Statistics on postcoordinated knowledge............................................ ..........55 71 Relevancy check result of 261 simulation documents .......................... ..........74 72 Value of performance gain of precoordinated knowledge compared to VSM .......78 73 Value of performance gain of postcoordinated knowledge ................. .......... 78 A Prim ary term s for A PD B ........................ ......... ............ .................. ............... 86 B l P artial list o f T co d e ...................................................................... .................... 8 8 B 2 P partial list of M code ....................................................................... ................... 89 B 3 P partial list of E code ........................................................................ ...................90 B 4 P partial list ofF code ............................................ ................. .. ...... 91 B 5 Partial list of D code...................................................................... ............... 92 B 6 Partial list of P code ............................................ ................. .. ...... 93 LIST OF FIGURES Figure page 11 Knowledgebased information retrieval model .............................. ................4 21 Vector Space M odel example diagram ........................................... ............... 9 22 R ecall rate and precision ............................................................................ ..... 16 23 Relationship between recall and precision............. .............................................. 17 31 Example of the probability for combined evidence ..............................................30 32 Forward serial connection Bayesian network example................ .............. ....31 33 Diverging connection Bayesian network example............................................31 34 Converging connection Bayesian network example............................ ............31 35 E xam ple of chain rule ..................................................................... ...................33 36 E xam ple of N oisyO R .............................................................................. ............34 37 General architecture of noisyOR model .......................................... ............35 41 Architecture of the knowledgebased information retrieval model..........................42 42 Architecture of the knowledgebased information retrieval model detailed in the example domain ................................. ........ .......... .. ............44 43 The "Equation" of SNOMED disease axes............................................................45 51 The three types of SNOM ED term relation .................................. ............... 49 52 SNOMED hierarchical term relationship .......................................... ............50 53 SNOM ED synonym s relationship................................................. ............... 51 54 SN OM ED M ultiaxial relationship ........................................ ....................... 51 55 Classification of postcoordinated knowledge ............................... ............... .55 56 An example of a fouraxisrelation postcoordinated knowledge............................56 57 Structure of the postcoordinated knowledge in a Bayesian network...................57 58 PCKB component structure and probability estimation .......................................60 61 K now ledge reductions.............................................................................63 62 Attributes of the SNNKB hierarchical topology relation........................................64 63 Example of DomainSpecific Knowledge relations..............................................67 64 Conversion of typeM relations................................ .......................... 68 65 Examples of case .. ........... .... .......................................................... 70 71 Perform ance evaluation m etrics ......................................................................... 73 72 Comparison of performance for queryl on positive cases .............. ...............75 73 Evaluation results of query 1 including the neutral cases. .....................................75 74 Evaluation results for query 2 for the positive cases.................... .................76 75 Evaluation results for query 2 including the neutral cases................... ..............76 76 Evaluation results of query 1 including postcoordinated knowledge ...................79 77 Evaluation results of query 2 including postcoordinated knowledge ...................79 78 Evaluation results of query 1 including statistical postcoordinated knowledge.....80 81 Knowledge reduction to statistical m odel ..................................... .................83 82 O ffline application of know ledge ...........................................................................83 Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy DOMAINSPECIFIC KNOWLEDGEBASED INFORMATION RETRIEVAL MODEL USING KNOWLEDGE REDUCTION By Changwoo Yoon August 2005 Chair: Douglas D. Dankel II Major Department: Computer and Information Science and Engineering Information is a meaningful collection of data. Information retrieval (IR) is an important tool for changing data into information. Of the three classical IR models (Boolean, Support Vector Machine, and Probabilistic), the Support Vector Machine (SVM) IR model is most widely used. But the SVM IR classical model does not convey sufficient relevancy between a query and documents to produce effective results reflecting knowledge except when using term frequency (tJ) and inverse document frequency (idf). Knowledge is organized information imbued by intelligence. To augment the IR process with knowledge, several techniques have been proposed including query expansion by using a thesaurus, a term relationship measurement like Latent Semantic Indexing (LSI), and a probabilistic inference engine using Bayesian Networks. We created an information retrieval model that incorporates domainspecific knowledge to provide knowledgeable answers to users. We used a knowledgebased model to represent domainspecific knowledge. Unlike other knowledgebased IR models, our model converts domainspecific knowledge to a relationship of terms represented as quantitative values, which gives improved efficiency. CHAPTER 1 INTRODUCTION The object of this thesis is creating an intelligent information retrieval model producing effective results reflecting knowledge using a computationally efficient method. 1.1 Background about Intelligent Information Retrieval Conceptually, information retrieval (IR) is the process of changing data to information. More technically, information retrieval is the process of determining the relevant documents from a collection of documents, based on a query presented by the user. If we look at the World Wide Web (WWW) before any processing (e.g., search), each document or web page is a datum. These data are uninterpreted signals or raw observations that reach our senses. Providing meaning to these data allow them to become information that is more meaningful and useful to humans than the raw data. Information retrieval is the process that extracts information from data. One of the wellknown information retrieval models is Boolean search. In the Boolean search model, we specify a set of query words that is compared to the words in the documents to retrieve those documents precisely containing the given set of query words. We can call the retrieved documents "information" but it is hard to call them "knowledge," because additional tasks such as browsing each document and selecting the more meaningful ones are required to transform the retrieved documents to some form of knowledge. Knowledge is organized information. The classic vector information retrieval model is an attempt to infuse knowledge to information retrieval results using the frequency of the query terms that are found in the documents. Intelligent information retrieval or semantic information retrieval attempts to use some form of knowledge representation within the IR model to obtain more organized information (i.e., improved precision, which is defined in Section 2.4) that is knowledge. But it is difficult to codify or regulate the knowledge. An ontology is the attempt to regulate knowledge and the specification of a conceptualization (Gruber, 1993). In the artificial intelligence research fields, researchers are using an ontology such as a knowledge representation or semantic web (BernersLee et al., 2001), which is the abstract representation of data on the World Wide Web, in an attempt to make the semantics of a body of knowledge more explicit. We can classify an ontology as either general domain or closed domain. For example, WordNet (Miller, 1990) is an example of a general ontology (consisting of a thesaurus and a taxonomy) that aims to represent generaldomain documents written in natural language. We can compare closeddomain data to generaldomain data. * The subject of the closeddomain is confined. For example, a company offering tourists information about excursions and outings might maintain this information in a database. Such a database would consist exclusively of tourrelated data. * A closeddomain typically has its own knowledge repository such as a term dictionary and relations that exist between terms. Good examples of such a repository are the medical field's Unified Medical Language System (UMLS) and Systematized Nomenclature of Medicine (SNOMED). We call these domain specific knowledge. The nature of closeddomain data allows us to use better semantics than that of general domain data. Applying knowledge in the information retrieval process normally requires significant computation. This computation occurs when the intelligent information retrieval system tries to search the knowledge space during the retrieval process. From this, we can derive the following set of research questions for closeddomain IR using domainspecific knowledge: * "How can we express effectively the domain specific knowledge as an ontology?" * "What is the relationship between explicit semantics, ontology, and information retrieval?" * "How can we maximize the efficiency of IR using the given domain specific knowledge (ontology)?" 1.2 Intelligent Information Retrieval Model Our research aims to create an information retrieval model that incorporates domainspecific knowledge to provide knowledgeinfused answers to users. The closed domain data we used consists of pathology patient reports. Figure 11 is a conceptual model of the proposed domainspecific knowledgebased information retrieval model. Details of the model are given in Chapters 4, 5, and 6. A classical vector space model (VSM) information retrieval system using term frequency and inverse term frequency creates the query vector (1) and document vector (2). The knowledge base management engine (KME) creates (5) the knowledge from the existing documents set (3) before the system operation starts. The KME adds knowledge (5) from new document (4) as they enter the database. The Knowledge Conversion Engine (KCE) applies the knowledge (semantics) of the Knowledge Base (7) to the Document Vector (6) to create the Conceptual Document Vector (8). The conventional VSM IR engine calculates the relevance between the query vector (9) and the conceptual document vector (10) resulting in a ranked document list (11). 10 Ir *Ranked, VSM IR engine Ra ked SI11 Result Figure 11. Knowledgebased information retrieval model Using this model results in the following contributions to information retrieval research: * This information retrieval model is a knowledgebased IR model. Unlike other models, that perform knowledge level information retrieval tasks such as ontology comparison and ontological query expansion, this model reduces the knowledge level represented by the knowledge base to the information level such as the vector space model's document vector. * Unlike other knowledgebased IR models, which have a heavy computation requirement because they compare concepts between the IR model and the query when the user requests information, this model uses the offline application of knowledge to the document vector leaving only a similarity measurement calculation between the query and the documents. * When a new document arrives in the system we modify the knowledge base with only the knowledge that can be obtained and augmented from that new document, not from the predefined knowledge base. We call this a dynamic feature of the knowledge base. The dynamic feature of the knowledge base can be mapped to a statistical feature by offline knowledge conversion. This means that we apply the changes of the document vector and the knowledge base in specified time intervals not when introduced. * This model can be applied to IR applications in the general domain if these applications have a domainspecific knowledge ontology. * Unlike other models, which have difficulty applying a knowledge hierarchy to the IR model, the knowledgebased model uses a hierarchical term relevancy value to express the knowledge hierarchy. The organization of this thesis is as follows. Chapter 2 surveys the current research efforts on information retrieval. Chapter 3 surveys the current research topics on knowledge representation and inference using probability, concentrating on Bayesian networks. Chapter 4 introduces the proposed information retrieval model for closed domain data. Chapter 5 and 6 discuss the details of the model. Chapter 7 presents a performance evaluation of the model. The thesis concludes with Chapter 8, which provides future research work to be completed. CHAPTER 2 INFORMATION RETRIEVAL 2.1 Classical Information Retrieval Models Information retrieval (IR) is a process that finds relevant documents (information) from a document collection given a user's request (generally queries). In contrast to data retrieval, which consists of determining which documents of a collection contain the keywords in the user's query, an IR system is concerned with retrieving information about a subject represented by the user's query. There are three classic models in information retrieval: the Boolean, the vector, and the probabilistic models (Yates and Neto, 1999, p. 21). The Boolean model is set theoretic because documents and queries are represented as a set of index terms. The vector model is algebraic because documents and queries are represented as vectors in a tdimensional space where t is the total number of index terms. In the probabilistic model, probability theory forms the framework for modeling documents and query representations. 2.1.1 Boolean Model The Boolean model is a simple retrieval model based on set theory and Boolean algebra (Yates and Neto, 1999, p. 25). In Boolean information retrieval, a query typically consists of a Boolean expression, such as "(cat OR dog) AND NOT goldfish," and each document is represented by the set of terms it contains. The execution of a query consists of obtaining, for each term in the query, the set of documents containing this term. These sets of retrieved documents are then combined using the usual set theoretic union (for OR queries), intersection (for AND), or difference (for NOT) to obtain a final set of documents that match the query. The Boolean model provides a framework that is easy to understand by a common user of an IR system. Furthermore, the queries are specified as Boolean expressions having precise semantics. But, the Boolean model suffers from two major drawbacks. First, using the Boolean model requires skilled users who can formulate quality Boolean queries. When the only users of an IR system are librarians, for example, or computer scientists conversant in logic, and the information to be searched is in a known or restricted form (such as bibliographic records), a Boolean system is adequate. However, in cases where the users are less skilled, or the information to be searched is less welldefined, a ranked strategy (vector space, probabilistic, etc.) may be more effective. The Boolean model's second drawback is that its retrieval strategy is based on a binary decision criterion (i.e., a document is predicted to be either relevant or nonrelevant) without any notion of a grading scale, which prevents good retrieval performance. Thus, the Boolean model is in reality much more a data retrieval model. 2.1.2 Vector Space Model The vector space information retrieval model, first introduced by Salton et al. (1975), takes a geometrical approach. A vector, called the "document vector," represents each document. This vector is of identical length for all documents with the length equaling the number of unique terms in the entire collection of documents. Salton et al. (1975) defined the "term weight" (also known as the importance 1 eighi) as the ability of a term to differentiate one document having the term from other documents having the same term. A number of weighting schemes can be used in the vector space model. Salton uses two properties: the term frequency and the inverse document frequency. The term frequency (tJ) is the intradocument importance, which is the frequency of the term occurring in a document. Term frequency measures how well that term describes the document content. A term with a higher term frequency is more important than a term with a lower frequency. The inverse document frequency (idj) is the number of documents in the corpus which the term occurs. The inverse document frequency of term j is calculated as idf, =log N n) where N is the number of documents in the collection, and n, is the number of documents in which termj occurs. The inverse document frequency is the interdocument importance. If a term is uniformly present across the entire system, the term is less capable of differentiating the documents, which means that it has less importance than a term having a small global weight. We can calculate the term weight w,j of term i in documents as w ,j = tf x idf where tf, is the term frequency of term i in documents, and idf, is the inverse document frequency of term i in the entire set of documents. After constructing the document and query vectors using the weighting scheme, we calculate the similarity coefficient. One of the best known similarity coefficients is the cosine measure (Salton, 1968), defined for the query vector q = (q,, q2,'" q,) and the document vector d = (w,j, w2,j ,,w,j ) where t is the number of terms: q *d, q1 x w,, sim(q, d) cos(q,) )= The cosign similarity measures the angle between the query and document vectors in n dimensional Euclidean space. Suppose that we have a query consisting of two terms and a set of documents that may or may not contain those terms. Figure 21 illustrates the vector model and its similarity measure between two documents, dl and d2, and query q which contain those terms. The similarity between document 1 (di) and the query is Sl,q; while the similarity between document 2 (d2) and the query is S2,q. t2 wd2,  d2 wd2,2 di wd1,2  wq2    q 2,q wd1,2 wdl, wq1 Figure 21. Vector Space Model example diagram 2.1.3 Probabilistic Model Probabilistic retrieval defines the degree of relevance of a document to a query in terms of the probability that the document is relevant to the query. Maron and Kuhns, (1960) first introduced the concept of probabilistic indexing in the context of a library searching system. Robertson and SparckJones (1976) introduced what is now known as the binary independence retrieval (BIR) model, which is considered the standard model of probabilistic retrieval. The fundamental assumption of the probabilistic model is that the probabilistic model estimates the probability of the relevancy of a document with a given user's query q. If we state this as an equation, we can define the similarity of the jth document, dj, to a query q as the ratio P(R d ) sim(dj, q) = (21) P(R  d ) where R is the set of documents known to be relevant, R is the set of nonrelevant documents, P(R I d ) is the probability that document d, is relevant to the query q, and P(R I ) is the probability that dj is nonrelevant to the query q. The problem with Equation 21 (one disadvantage of the probabilistic model) is that we must guess the initial value of the document relevancy. The first probabilistic model, the BIR model, also did not consider the term frequency, which is a basic assumption of the vector space model. 2.2 Alternative Information Retrieval Models The classical information retrieval model does not consider the dependency among the index terms. For examples, in the vector space model, all terms in the document vector are orthogonal. The Latent Semantic Indexing (LSI) model (Furnas et al., 1988) is one of the IR models that incorporates term dependency. 2.2.1 Latent Semantic Indexing (LSI) The classical information retrieval models use index terms as querying tools. The selection of the index terms is based on the assumption that the terms represent the "user's need," that is they represent the concept of the user's query intention. But as the search results show, index terms do not really contribute to the concepts of information retrieval. For example, if the user wants to search about "Major cities in Florida," the index terms used may be "Major," "city," and "Florida." The search engine may try to find documents containing these keywords. But if the search engine is intelligent and supports conceptual matching, it would try to search for keywords such as "Tampa," "Orlando," and "Miami" in the same way as human do. The main idea of Latent Semantic Indexing (LSI) comes from the fact that a document may contain words having similar concepts. So LSI considers documents that have many words in common to be semantically close and vice versa (Furnas et al., 1988). From the example in the previous paragraph, if the words "major," "city," "Florida," "Tampa," "Orlando," and "Miami" appears together in enough documents, the LSI algorithm will conclude that those terms are semantically close, then return all documents containing terms "Tampa," "Orlando," and "Miami" even though these latter terms are not part of the given index terms. The most important point of the LSI algorithm is that all calculations are performed automatically by only looking at the document collection and index terms. As a result, the problems of "Polysemy" and "Synonymy" can be addressed efficiently without the aid of a thesaurus. Polysemy is the problem of a word having more than one meaning. Synonymy is the problem that there are many ways of describing the same object. LSI generally uses a statistical method called Singular Value Decomposition (SVD) to uncover the word associations between documents. The effect of SVD is to move words and documents that are closely associated nearer to one another in the projected space. It is possible for an LSI based system to locate and use terms that do not even appear in a document. Documents that are located in a similar part of the concept space are retrieved, rather than only matching keywords. 2.2.2 Lateral Thinking in Information Retrieval The human brain is divided into two halves: the left and right brain. The leftbrain excels at sequential thinking where the desired outcome is achieved by following a logical sequence of actions. In contrast, the right brain is optimized for creativity where the desired outcome may require a degree of nonlinear processing. Most information retrieval activity is focused on the requirements of sequential thinking, which is most comfortable when searching with precision. An example of sequential thinking in information retrieval is a Boolean logic search. When searching for specific information, traditional techniques can be used to find documents containing the required keywords combined with Boolean logic. "Sequential thinking," which is a process of leftbrain, is an analogous term to "vertical thinking." Sometimes we are looking for information about a particular topic but the concept is nebulous and difficult to articulate precisely. With this type of query it is difficult to specify our search so that all of the best documents are found without too many irrelevant ones. These difficulties are compounded if there is uncertainty about the presence of documents, for example searches designed to gather evidence, or to prove the absence of, information about the selected topic. A successful outcome is likely to involve some right brain activity as we iterate the process with carefully modified search criteria. This kind of brain activity is called "lateral thinking" (Bono, 1973). The lateral thinking process is concerned with insight and creativity. It is a probabilistic rather than a finite process. In an information retrieval context, vertical thinking is used when we know precisely for what we are looking and selecting the finite set of relevant documents is relatively straightforward. In contrast, lateral thinking is applied where the requirements are less well defined and the process of locating relevant information involves some degree of trial and error. Unfortunately, traditional techniques, employed when searching with precision, do not provide much assistance with this type of problem and the user is left to try query after query until they have exhausted all permutations. The ability to automatically identify multiword concepts is absolutely fundamental to provide some assistance to the right brain when searching unstructured information. Without this ability the system is simply analyzing individual word frequencies that are unlikely to make much sense to a human brain when taken out of context. Several approaches (i.e., linguistics, artificial intelligence, and Bayesian networks) have attempted to imbue concepts into the information retrieval model without much success. Given that 90% of data is unstructured presents difficulties to the current statistical information retrieval methods. If the data are well structured like in a relational database schema, where a query is very specific, we can predict a precise result that is like vertical thinking. Unfortunately, many people expect to search unstructured information in the same way and are often disappointed when the documents they expect to find are not returned. The problem is that unstructured data are highly variable in layout, terminology, and style while the queries tend to be more difficult to define. Yann et al. (2003) suggested using feedback from the user requests to retrieve "alternative" documents that may not be returned by more conventional search engines, in a way that may recall "lateral thinking" to solve heterogeneous large scale pharmaceutical database problem (Yann et al. 2003). The proposed solution replaces the query expansion phase by a query processing phase, where evolved modules are applied to the query with two major results (Yann et al. 2003, p. 215): * Rewritten queries will preferably retrieve documents that match fields of interest of the user. * Other documents related to previous and present queries will be retrieved, therefore bringing some "lateral thinking" abilities to the search engine. The system employs evolutionary algorithms, used interactively, to evolve a "user profile" at each new query. This profile is a set of "modules" that perform basic rewriting tasks on words of the query. The evaluation step is extremely simple: a list of documents corresponding to the processed query is presented to the user. The documents actually viewed by the users are considered as interesting, and the modules that retrieved the document are rewarded accordingly. Modules that rarely or never contribute to the retrieval of "interesting" documents are simply discarded and replaced by newly generated modules. He used genetic programming technique to evolve the user profile modules automatically. 2.3 Information Retrieval Models Involving Reasoning A Bayesian network is a directed acyclic graph whose nodes represent random variables and whose edges represent causal relationships between nodes. A causal relationship means that if two nodes are connected, the parent node (i.e., the node from which the edge comes) is considered to be a potential cause of the child node (i.e., the node to which the edge points). We can consider the causal relationship as a probabilistic dependency (Fung and Favero, 1995). Lee et al. (2002) also proposed a Bayesian network model for a medical language understanding system, which provides a noisetolerant and contextsensitive character of the system. He showed a relevant inference based on Bayesian network patterns. Those information models performing inference based on Bayesian networks are not yet at a mature stage and significant research is still needed in this area. This method also has a problem with the heavy computational requirements needed to perform the inference. 2.4 Evaluating Information Retrieval Performance An evaluation of a system is usually performed before the release of the computer system. Commonly, the measures of system's performance are time and space. For example, in a data retrieval system like a database system, the response time and the space requirement are the most interesting metrics. But in the information system, other metrics are also interesting (Yates and Neto, 1999). This results from the vagueness of a user's request to an information retrieval system. The retrieval results also produce partial matches. The most common IR system, the vector space model, produces documents ranked according to their relevance with the query. So the evaluation for information retrieval should have a metric that evaluates how precise the answer of the IR system is. The most commonly used metrics for relevancy evaluation of IR are recall and precision. Consider a database where there are 100 documents related to the general field of data extraction. A query on "text mining" may retrieve 400 documents. If only 40 of the retrieved documents are about data extraction, the recall rate of the tested engine is 40%, since the database contains 100 documents on data extraction (Schweitzer, 2003). Since only 40 documents among 400 matched the request of the user, the precision rate of the engine on this test is 10%. See Figure 22. If the desired set of returned documents (i.e., the target) is known, the recall rate is the proportion of returned documents that match the target with respect to the total size of the target. The precision is the proportion of relevant documents in the document set returned by the system. All documents ~ Retrieved \ 40 400 . / ",...... ... " Rel Retrieved Recall = Relevant Relevant Rel Retrieved 100 Precision = Retrieved Figure 22. Recall rate and precision Trivially, if an algorithm always retrieves all documents in a document base, it has one hundred percent recall. However, this retrieval has low precision because it is unlikely that all documents match the query. In this sense, precision and recall have an inverse relation shown in Figure 22. In many evaluations, precision is measured at a fixed number of retrieved documents, e.g. "precision at 25," which gives a measure of how well an algorithm delivers at the top of the retrieved list. In others, recall and precision are plotted against each other: precision at a certain point of recall indicates how many irrelevant documents readers must examine until they know they have found at least half of the interesting documents. In the Text REtrieval Conference (TREC) evaluations an "11point" average measure is used, with precision measured at every 10 percent of recall: at 10 percent recall, at 20 percent recall, and so forth to 100 percent recall, where all relevant documents are assumed to have been retrieved (Baeza and Neto, 1999, p. 76) The average precision at all those recall points is used as the total measure. 100 50 0 10 Recall 60 Figure 23. Relationship between recall and precision Several methods help to maximize recall rates, for example, query expansion using synonyms. Using this method, a search engine will also find documents on data extraction provided that its thesaurus contains "data" as a synonym for "text" and "extraction" as synonym for "mining." Significant research is currently being performed on manmade thesauri to ensure that all documents that could match a query are actually found (Foskett, 1997). 2.5 Useful Techniques Other than the core information retrieval algorithm, there are a number of techniques that are mandatory for IR processing such as document preprocessing, stopword removal, and stemming. This section discusses several of these techniques that might improve IR performance using text processing. 2.5.1 Stopword Removal Stopwords are words that occur very frequently among documents in the collection. In general, stopword do not carry any useful information. Articles, prepositions, and conjunctions such as "in," "of," "the," etc., are natural candidates for a list of stopwords. Stopword removal has often been shown to be effective at improving retrieval effectiveness, even though many term weighting approaches are designed to give a lower weight to terms appearing in many documents. It also has benefit on reducing the size of the index term structure. Stopword removal is built into many IR engines. In some situation, stopword removal causes reduced recall. For example, if the user's query is "to be or not to be," the only index term left after stopword removal is "be." As a result, some search engine do not adopt stopword removal. They use full text indexing instead. 2.5.2 Stemming Stemming is the process of removing affixes (i.e., prefixes and suffixes) and allowing the retrieval of documents containing syntactic variations of query terms (Yates and Neto, 1999, p. 165). This can involve, for instance, removing the final "s" from plural nouns or converting verbs to their base form ("go" and "goes" both become "go," etc.). The most widely known stemming algorithm is the Porter algorithm (Porter, 1980), which is built into many information retrieval engines. The Porter algorithm uses a suffix list for suffix stripping. The algorithm has several rules applicable to the suffix of words. For example, the rule is used to convert plural forms into their singular forms by substituting the suffix letter "s" to nil. 2.5.3 Passage Retrieval Passage retrieval is the process of retrieving text in smaller units than complete documents. The basic assumption of passage retrieval is that terms inside a meaningful unit like a sentence have more meaning than across document. Callan (1994) describes several approaches to passage identification, including paragraph recognition and window based approaches, in which the position of the passage is determined by the positions in the document of the terms matching the query. In the classical information retrieval method, the order and distance of index terms in the documents and the query have no meaning. If we use a word as an index term unit and multiple closely located words combine to form a specific phrase, the order and distance among the index terms can have a difference when compared with the unordered terms. 2.5.4 Query Expansion Whenever a user wants to retrieve a set of documents, he starts to construct a concept about the topic of interest. Such a conceptualization is called the "information need." Given an "information need," the user must formulate a query that is adequate for the information retrieval system. Usually, the query is a collection of index terms, which might be erroneous and improper initially. In this case, a reformulation of the query should be done to obtain the desired result. The reformulation process is called query expansion. One of the simplest techniques involves the use of a thesaurus to find synonyms for some or all of the terms in the query. These synonyms are added to the query to broaden the search. The thesaurus used can be manually generated for a specific domain, such as the medical domain. But for a general domain like the Web, it is hard to generate such a knowledge base like thesauri because the documents from the general domain are comparably new, large, and dynamically changing. Various algorithms have been suggested for generating thesauri automatically. For example, Crouch and Yang (2000) suggest a method based on clustering and term discrimination value theory. Another widely used method of query expansion is the use of relevance feedback. This involves the user performing a preliminary search, then examining the documents returned and deciding which are relevant. Finally, terms from these documents are added to the query and the search is repeated. This obviously requires human intervention and, as a result, is inappropriate in many situations. However, there is a similar approach, sometimes called pseudorelevance feedback, in which the top few documents from an initial query are assumed relevant and are used for automatic feedback (Mitra et al. 1998). 2.5.5 Using Phrase Many information retrieval systems are based on a vector space model (VSM) that represents a document as a vector of index terms. The classical VSM uses a word as an index term. To improve retrieval accuracy, it is natural to replace word stems with concepts. For example, replacing word stems with a Unified Medical Language System (UMLS) code if the document domain is medical is a possible way to include a concept in information retrieval. However, previous research showed not only no improvements, but a degradation in retrieval accuracy when concepts were used in document retrieval. Replacing word stems with multiple word combinations was also studied. One study used a phrase as an indexing term (Mao and Chu, 2002). A phrase is a string of words used to represent a concept. The conceptual similarity and common word stems jointly determine the correspondence between two phrases, which gains an increase in retrieval accuracy when compared to the classical SVM model. Separating the importance of weighting in SVM model has been suggested (Shuang et al. 2004). Shuang et. al. considered phrases to have more importance than individual terms in information retrieval. They used a tuple of two separate similarity measures between the document and the query, (phrasesim, termsim), where phrasesim is the similarity obtained by matching the phrases of the query against the documents and term sim is usual a similarity measure used in the SVM model. Documents are ranked in descending order of (phrasesim, termsim) where phrasesim has a higher priority. 2.6 Enhancement of IR Through Given Knowledge 2.6.1 Using WordNet WordNet is an electronic lexical database developed at Princeton University beginning in 1985 (Miller, 1990). WordNet 2.0 has over 130,000 word forms. It is widely used in natural language processing, artificial intelligence, and information technology such as information retrieval, document classification, questionanswer systems, language generation, and machine translation. The basic building blocks of WordNet are synonym sets ("synsets"), which are unordered sets of distinct word forms and which correspond closely to what are called "concepts." Examples of synsets are {car, automobile} or {shut, close}. WordNet 2.0 contains some 115,000 synsets. There are two kinds of relations in WordNet: semantic and lexical relations. Examples of semantic relations are "isa," "partof," "cause," etc. An "isa" semantic relation hierarchically organizes nouns and verbs from the top generic concepts to the bottom specific concepts. Examples of lexical relations are synonymy and antonymy. There have been several attempts to use WordNet for information retrieval (Chai and Biermann, 1997). Query expansion is one of method that expands query terms having similar meaning using a thesaurus like WordNet. This technique increases the chances of retrieving more relevant documents. Several other research projects about query expansion using WordNet have been performed (Voorhees, 1994), but the results are not good: there is a small increase of recall but a degradation on precision. Rila et al. (1998) concluded that the degradation of performance for IR using WordNet is caused by the poorly defined structure of WordNet. It is impossible to find term relationships with different parts of speech because words in WordNet are grouped based on partofspeech. Most of the relationships between two terms are not found in WordNet because WordNet handles general lexical knowledge. Sanderson described most efforts in information retrieval using WordNet and noted that a simple dictionary (or thesaurus) based word sense representation has not been shown to greatly improve retrieval effectiveness (Shaderson, 2000). A recent study on word sense disambiguation in information retrieval using WordNet (Kim et al. 2004) shows the possibility of improving IR performance using WordNet knowledge. They proposed a root sense tagging approach. They noticed that the tradition method described in the previous paragraph used afinegraineddisambiguation for IR tasks. For example, the word "stock" has 17 different senses in WordNet, which are used in word sense disambiguation. These include "act," "animal," "artifact," "attribute," "body," etc. Using these classifications when performing word sense disambiguation, called coarsegraineddisambiguation, showed an improvement of retrieval effectiveness. 2.6.2 Using UMLS, SNOMED Medical language is extremely rich, varied, and difficult to comprehend and standardize, and it has vagueness and imprecision. As a result, there have been many efforts to make medical term dictionary structures such as the Unified Medical Language System (UMLS) and Systematized Nomenclature of Medicine (SNOMED). SNOMED is a hierarchically organized and systematized multiaxial nomenclature of medical and scientific terms. We provide more detail on SNOMED in Chapter 3. The terms in SNOMED and UMLS often require expert knowledge, so nonexperts like patients and lawyers cannot recognize the terms used. This problem motivates efforts to combine WordNet and UMLS (Barry and Fellbaum 2004), since WordNet was not built for domain specific applications, creating a need for a lexical database design created specifically for the needs of naturallanguage processing in the medical domain. This approach expands the synonyms thesaurus resulting in an information retrieval query expansion. There are many efforts to visualize the concept of information. Sometimes a figure is worth a thousand words (Pfitzner et al. 2003) with the use of a picture facilitating a user's understanding of the presented information. Keynets developed by Kenneth (http://ww.,, .,i, i, Jii ,",. kiL, key/fast/fast.html) is one of information visualization techniques for representing information in a visual manner. To extract meaning from technical documents, ontologies such as UMLS and semantic frameworks like Keynets can be combined, which improve the accuracy and expressiveness of natural language processing. 2.7 Summary We described three classical information retrieval models: Boolean, Vector, and Probabilistic. There are several attempts to augment knowledge in the information 24 retrieval process such as query expansion and using a phrase as a searching term. Our attempts to incorporate knowledge in IR involve using a knowledge source directly as a form of knowledge representation. Possible candidates for knowledge sources include UMLS and SNOMED. Our developed model uses knowledge in the form of a semantic network and a Bayesian network. The next chapter explains the background required to understand the knowledge base, especially the probabilistic Bayesian network model. CHAPTER 3 KNOWLEDGE REPRESENTATION BY BAYESIAN NETWORK As we will see, the knowledge in our experimental domain (pathology) consists of two types. The first is predefined knowledge that can be used in describing data (i.e., a patient's report). This type of knowledge can be expressed well using a semantic network. The second type of knowledge is obtained from data that are not predefined. Normally, experts describe this knowledge after analyzing the data. Errors will possibly intervene during the writing and analyzing process, which means there is an uncertainty in the knowledge. This type of data can be modeled well by a probability model, especially the Bayesian network. This chapter presents a discussion on knowledge representation issues, concentrating on semantic networks and Bayesian networks, and surveys some of the relevant literature. 3.1 Semantic Networks Semantic networks are often used as a form of knowledge representation. They were developed for representing knowledge within English sentences by representing human memory's structure of having a large number of connections and associations between the different pieces of information contained in it. Today, the term associative networks is more widely used to describe these networks since they are used to represent more than just semantic relations. They are widely used to represent physical and/or causal associations between various concepts or objects. A semantic network is a directed graph consisting of vertices that represent concepts and edges that represent semantic relations between the concepts. An important feature of any associative network is the associative links that connect the various nodes within the network. It is this feature that makes associative graphs different from simple directed graphs. Within knowledgebased systems, associative networks are most commonly used to represent semantic associations. In the more technically oriented applications, they can be used to express both the physical and causal structure of systems. The important semantic relations often used within a semantic network are: * Meronymy (A is part of B), * Holonymy (B has A as a part of itself), * Hyponymy (or troponymy) (A is subordinate of B; A is kind of B), * Hypernymy (A is superordinate of B), * Synonymy (A denotes the same as B), and * Antonymy (A denotes the opposite of B). An example of a semantic network is WordNet, a lexical database of English. A major problem of semantic networks is that although the name of this knowledge representation contains the word "semantic," there is no clear semantics of the various network representations. By representing the knowledge explicitly within an associative network, a knowledgebased system obtains a higher level of understanding for the actions, causes, and events that occur within a domain. The higher level of understanding allows the system to reason more completely about problems that exist within the domain and to develop better explanations in response to user queries (Gonzalez and Dankel 1988, p. 167). 3.2 Probability Principles and Calculus This section provides the core principles necessary to understand Bayesian calculus, which is the base model of the proposed knowledge base. This section starts with the basics of probability calculus. Then, it introduces the concept of subjective probability and conditional probability. Probability is a method for articulating uncertainty. It also gives a quantitative understanding of uncertainty providing a quantitative method for encoding likelihood. Probabilistic methods and models give us the ability to attach numbers to the likelihood of various results. The standard view of probability is the frequentist view. This view says that probability is really a statement of frequency. You can obtain a probability by watching recurring events repeat over time. For example, the probability of a hurricane hitting Florida during hurricane season can be determined by examining the historical record of where hurricanes have struck the USA. In this view, probability is something that is inherent in the process. An alternative view of probability that is very useful to artificial intelligence research is the subjective view, or Bayesian view. In the subjective view, probability is a model of your degree of belief in some event. A Bayesian probability is the value or belief of the person who assigns the probability (e.g., your degree of belief that a coin will land heads), whereas a classical probability is based on the physical properties of the world (e.g., the probability that a coin will land heads). In light of these statements, a degree of belief in an event is referred to as a Bayesian or personal probability, while the classical probability is referred as the true or physical probability of that event. Probability is a logic and a language for talking about the likelihood of events. An event, is a set of atomic events, which is a subset of the universe of all events. A probability distribution is a function that maps events into the range of values between 0 and 1. Probability satisfies the following properties. P(true) = 1 = p(Universe), P(false) = 0 = P(0), and P(A u B) = P(A) + P(B) P(A n B). A random variable describes a probability distribution in which the atomic events are the possible values that could be given to the variable. If we have multiple random variables, we can talk about their joint distribution or the probability assignment to all combinations of the values of the random variables. In general, the joint distribution cannot be computed from the individual distribution. If we know all values of joint distribution, we can answer any probability question. But if the domain is big, the complexity grows exponentially. We can introduce a concept of conditional probability. P(A B)= P(Ar B)/P(B) (31) This is the probability of A given B and states we are restricting our consideration just to the part of the world in which B is true. We can derive Bayes' rule from the definition of conditional probability. P(A B) = P(B A)P(A)/P(B) (32) To make this more concrete, consider the medical domain where we have diseases and the symptoms associated with each disease: P(disease symptom) = P(symptom disease) P(disease)/P(symptom). The probability of a symptom given a disease is generally constant and does not change according to the particular situation or patient. So it is easier, more useful, and more generally applicable to learn these causal relationships. So Bayes's rule has practical importance on conditional probability. We can use the conditioning rule to obtain P(A). P(A) = P(A B) P(B) + P(A B) P(~B) P (A rB) + P(A r B) We say A and B are independent, if and only if the probability that A and B are true is the product of the individual probabilities of A and B being true. P(A rB) P(A) P(B) P(A B)= P(A) P(B A)= P(B) Independence is essential for efficient probabilistic reasoning. There is a more general notion, which is called conditional independence. This states that A and B are conditionally independent given C if and only if the probability of A given B and C is equal to the probability of A given C. P(A B, C)= P(A C) P(B A,C) = P(B C) P(A B C) = P(A C) P(B C) We can solve the Bayesian network probability distribution using Bayes' rule and conditional independency. P(T, X IC)P(C) P(C IT, X)= P(T, X) Assume T and X are conditionally independent given C. P(C T, X) P(T I C)P(X C)P(C) P(T, X) Figure 31. Example of the probability for combined evidence We can obtain P(T,X) by the following equation. P(C\T,X) +P(~C\T,X) = 1 P(T I C)P(X I C)P(C) P(T I~ C)P(X I~ C)P(~ C) 1 + P(T, X) P(T, X) P(T I C)P(X I C)P(C) + P(T I~ C)P(X I~ C)P(~ C) = P(T, X) 3.3 Bayesian network A Bayesian network is an efficient factorization of the joint probability distributions over a set of variables. If we want to know everything in the domain, we need to know the joint probability distribution over all those variables. If the domain is complicated, with many different prepositional variables, the solution is infeasible. For example, if you have N binary variables, then there are 2" possible assignments, and the joint probability distribution requires a number for each one of those possible assignments. The intuition of Bayesian network is that there is almost always some separability between the variables (i.e. some independence), so that we do not actually have to know all of those 2" numbers to know what is occurring in the world. Bayesian networks have two components. The first component is called the "causal component." It describes the structure of the domain in terms of the dependencies between variables, and the second part is the actual numbers, the quantitative part. There are three connection types in Bayesian networks. First is the forward serial connection shown in Figure 32. Evidence is transmitted from A to C through B unless B is instantiated (i.e., its truth value is known). The evidence propagates backward through the serial links as long as the intermediate node is not instantiated. If the intermediate node is instantiated, then evidence does not propagate. A B C Figure 32. Forward serial connection Bayesian network example A C Figure 33. Diverging connection Bayesian network example Figure 34. Converging connection Bayesian network example The second connection type is the diverging connection shown in Figure 33. In a diverging connection, there are arrows going from B to A and from B to C. IfB is not instantiated, the evidence of A propagates through to C. But if B is instantiated, the propagation is blocked. The tricky case is when we have a converging connection like Figure 34. A points to B and C points to B. Let us first think about the case when neither B nor any of its descendants is instantiated. In that case, evidence does not propagate from A to C. For example, suppose B is "sore throat," A is "Bacterial infection," and C is "Viral Infection." If we find that someone has a bacterial infection, it gives us information about whether they have a sore throat, but it does not affect the probability that they have a viral infection also. But when either node B is instantiated, or one of its descendents is, we know something about whether B is true. And in that case, information does propagate through from A to C. If two variables are dseparated, then changing the uncertainty on one does not change the uncertainty on the other. Two variables a and b are "dseparated" if and only if for every path between them, there is an intermediate variable V such that either: the connection is (serial or diverging) and v is known; or the connection is converging and neither v nor any descendent has evidence. For example, if the connection ABC is serial, it is blocked when B is known and connected otherwise. When it is connected, information can flow from A to C or from C to A. Bayesian networks are sometimes called belief networks or Bayesian belief networks. A Bayes net consists of three components: a finite set of variables, each of which has a finite domain, a set of directed arcs between the nodes, forming an acyclic graph; and every node A, with parents B1 through Bn has a conditional probability distribution, P(A Bl... Bn) specified. The crucial theorem about Bayesian networks is that ifA and B are dseparated given some evidence e, then A and B are conditionally independent given e; that is, then P(A B,e) = P(A e). We can exploit these conditional independence relationships to make inference efficient. The chain rule results from the conditional independence relationship of Bayesian networks. Let us assume there are n Boolean variables: Vl,..., Vn.. The joint probability distribution is the product of all the individual probability distribution that are stored in the nodes of the graph. P(V1 = v, V2 = v2,...,V, = vn) = n,P(V, = vi I parents(V,)) (33) P(A) A B P(B) C P(CIA,B) D P(DIC) Figure 35. Example of chain rule If we compute the probability that A, B, C, and D are all true, we can use conditioning to write that. P(ABCD) = P(D ABC)P(ABC) We can simplify P(D ABC) to P(D C), because given C, D is dseparated from A and B. And we have P(D C) stored directly in a local probability table, so we are done with this term. Now we can use conditioning to write P(ABC) as P(C AB) times P(AB). These can be changed by dseparation. P(ABC) P(C AB)P(AB) SP(C AB) P(A) P(B) For each variable, we just have to condition on its parents. Then, we multiply the results together to obtain the joint probability distribution. This means that if you have any independence (if you have anything other than all the arrows in your graph in some sense), then you have to do less work to compute the joint distribution. 3.4 NoisyOR: Bayesian network inference Imagine that there are three possible causes for having a fever: flu, cold, and malaria. The network of Figure 36 encodes the fact that flu, cold, and malaria are mutually independent of one another. SFlu Cold alari Fever Figure 36. Example of NoisyOR In general, the conditional probability table for fever will have to specify the probability of fever for all possible combinations of values of flu, cold, and malaria. This is a large table, and it is hard to assess. Physicians, for example, probably do not think very well about combinations of diseases. It is more natural to ask them individual conditional probabilities: what is the probability that someone has a fever if they have the flu? We are essentially ignoring the influence of cold and Malaria while we think about the flu. The same goes for the other conditional probabilities. We can ask about P(feverlcold) and P(fevermalaria) separately. We are assuming that the causes act independently, which reduces the set of numbers that we need to acquire. If the patient has flu, and the connection is on, then he will certainly have fever. Thus it is sufficient for one connection to be made from a positive variable into fever from any of its causes. If none of the causes are true, then the probability of fever is assumed to be zero (though it is always possible to add an extra cause that is always true, but which has a weak connection, to model the possibility of getting a fever "for no reason"). Here is the general formula for a noisy OR. Assume we know P(effect cause) for each possible cause. And, we are given a set, Cr, of causes that are true for a particular case. Then to compute the probability of E given C, we compute the probability of not E given C. P(E\C) 1 P(E\C) (4) That is equal to the probability of not E just given the causes that are true in this case, CT. And because of the assumption that the causes operate independently (that is, whether one is in effect is independent of whether another is in effect), we can take the product over the causes of the probability of the effect being absent given the cause. C, C2 c C3 Effect Figure 37. General architecture of noisyOR model Finally, we can easily convert the probabilities of not E given C, into minus probability of E given C. P(E C) = P(E C) =1 P( E IC) = fP(~ E C,) (35) C,e C = 1 (1 P(E I C)) CzCT 3.5 QMRDT model The QMRDT model is a twolevel or bipartite Bayesian network intended for use as a diagnostic aid in the domain of internal medicine. We provide a brief overview of the QMRDT model here; for further details see Shwe and Cooper (1991). The QMRDT model is a bipartite graphical model in which the upper layer of nodes represents diseases and the lower layer of nodes represent symptoms. There are approximately 600 disease nodes and 4000 symptom nodes in the database proposed by Shwe and Cooper (1991). The evidence is a set of observed symptoms, which is referred as "findings." We use the symbolfto represent the vector of findings. The symbol d denotes the vector of diseases. All nodes are binary, thus the components and d, are binary random variables. The diseases and findings occupy the nodes on the two levels of the network, respectively, and the conditional probabilities specifying the dependencies between the levels are assumed to be noisyOR gates (Pearl 1988). There are a number of simplifying assumptions in this model. In the absence of findings, the diseases appear independent from each other with their respective prior probabilities (i.e., marginal independence), although some diseases probably do depend on other diseases. Second, the findings are conditionally independent given the diseases. The probability model implied by the QMRDT belief network can be written by the joint probability of diseases and finding as P(f, d) = P(f d)P(d) = IP(f, d) P(d) (36) where d andf are binary (1/0) vectors referring to the presence/absence states of the diseases and the positive/negative states or outcomes of the findings, respectively. The prior probability of the diseases, P(d,), were obtained by Shwe et al. from archival data. The conditional probabilities, P(f I d) for the findings given the states of the diseases, were obtained from expert assessments and are assumed to be noisyOR models: P(f = l0d) =P(f = 0L) nP(f = ld,) (37) J pa, = (1 q ) n (1 q,)d, (38) where pa, (parents of i) is the set of diseases pertaining to finding f . q, = P(f = 0  d = 1) is the probability that the disease, if present, could alone cause the finding t, have a positive outcome, and q,, = P(f =0 L) is the "leak" probability, i.e., the probability that the finding is caused by means other than the diseases included in the belief network model. The effect of each additional disease, if present, is to contribute an additional factor of (1 q, ) to the probability that the ith finding is absent. 3.6 Bayesian Classifiers In this section, we introduce some of the classifiers of the form of Bayesian network that can be used in the modeling of medical diagnosis. We can define the classification problem as a function assigning labels to observations (Miquelez et al. 2004, p. 340). If there is a vector x =(xl,..., x,) e 9" and classes of variable C, we can regard the classifier as a function y: (x,,..., x) > {1,2,..., C I} that assigns labels to observations. This can be rewritten to obtain the highest posterior probability, i.e. 7(x) = arg max p(c I x,,..., x) . C We can use the Bayesian classifier in medical diagnostics to find the probable disease from the given symptoms. We will use the notation O meaning outcome for class variable C, and F meaning finding for the observed variables for the explanation in the following chapters. We use capital letters for variable name and small letters for the values. 3.6.1 Naive Bayes The concept that combines the Bayes theorem and the conditional independence hypothesis is proposed by several names: idiot Bayes (Ohmann et al. 1988), naive Bayes (Kononenko, 1990), simple Bayes (Gammerman and Thatcher 1991), or independent Bayes (Todd and Stamper 1994). The naive Bayes (NB) approach (Minsky, 1961) is the simplest form of classifier based on Bayesian networks. The outcome variable O is defined as the commonparent of the findings, F = {F,,..., }, and each of the findings F, is a child of the outcome variable O. The shape of network is always same: all variables F,,..., F are considered to be conditionally independent given the value of the outcome variable O, which is a main assumption of NB. This is a conditional probability model. We can calculate the posterior probability using Bayes rule and conditional independence. P(()P(FO o)  P(O))P( o) P(O I F,,...,)=Fn nP(0) P(F\) P(FI,,...,Fn) 1, The main advantage of this approach is that the structure is always fixed and simple to calculate because the order of dependence to be found is fixed and reduces to two variables. The number of conditional probability distribution p(O I F,) would result in a considerable reduction in the number of parameters necessary. The Naive Bayes model only requires 2n+ 1 parameters, where n is the number of parents of F,, whereas the joint probability requires 2" parameters. But there is no relationship between findings that is not realistic in the real world. There is extensive literature showing even these kinds of simple computational models can perform surprisingly well (Domingos and Pazzani 1997) and are able to obtain results comparable to other more complex classifiers. 3.6.2 Selective Naive Bayes The selective naive Bayes is a subtly different model compared to the naive Bayes with the selective feature of findings. In the selective naive model, not all variables have to be present in the final model (Kohavi and John 1997; Langley and Sage 1994). There is a restriction that all variables must appear in the naive Bayes model for some types of classification problems, but some variables could be irrelevant or redundant for classification purposes. It is known (Liu and Motoda 1998; Inza et al. 2000) that the naive Bayes paradigm degrades with some cases, so the motivation of removing variables is modeled in the selective naive Bayes (Miquelez et al. 2004, p. 340). 3.6.3 Seminaive Bayes The intuition in the seminaive Bayes model is that we can combine variables (i.e., findings) together (Kononenko, 1991). It allows groups of variables to be considered as a single node in the Bayesian network, aiming to avoid the strict premises of the naive Bayes paradigm. 3.6.4 Tree Augmented Naive Bayes In the tree augmented naive Bayes, (Friedman et al. 1997) the dependencies between variables other than C are taken into account. The model represents the relationships between the variables, X1,..., X,, conditional on the class variable C by using a tree structure. The tree augumented naive Bayes structure is built using a two phase procedure. Firstly, the dependencies between the different variables X,,..., X, are learned. This algorithm uses a score based on information theory, and the weight of a branch (X,, X ) on a given Bayesian network S is defined by the mutual information measure conditional on the class variable as I(X,,X, C) = ZP(c)I(X,,X C = c) P(xx, xo I c) = I IP(x,,xJ,,c)log Sx, xJ P(x, I c)P(xj I c) With these conditional mutual information values the algorithm builds a tree structure. In the second phase, the structure is augmented into the naive Bayes paradigm. 3.6.5 Finite Mixture (FM) model The finite mixture (FM) model tries to relax the conditional independence assumption in the Naive Bayes model (Cheeseman and Stutz 1996). In a FM model, all the dependencies between observed variables, both the findings and outcome variable, are assumed to be modeled by a single discrete latent (i.e., unobserved) variable (Monti and Cooper 1998, p. 593). In a FM model the outcome variable is itself a child node, and the common parent is a latent variable. 3.7 Summary We described two knowledge representation models: semantic networks and Bayesian networks. There are attempts to model medical diagnosis using probabilistic Bayesian models. Shwe's QMRDT model is a twolevel noisyOR model using disease and symptoms nodes, where the nodes in the same layer are independent. The QMRDT model uses several assumptions to reduce the complexity of the joint probability distribution calculation, but it shows exponential complexity time when implemented as an algorithm. There were several attempts to use Bayesian classifiers in a medical diagnosis model: naive Bayes, selective naive Bayes, seminaive Bayes, tree augmented naive Bayes, finite mixture model, and finite mixture augmented naive Bayes. Unlike the other model's modeling of dependency among findings, naive Bayes assumes conditional 41 independency among the findings. But even the simplicity of the modeling, naive Bayes shows good performance when compared to other complex models. The next chapter explains the overall architecture of Knowledgebased Information Retrieval (KBIR) model that uses semantic networks and naive Bayes as a knowledge model. CHAPTER 4 KNOWLEDGE BASED INFORMATION RETRIEVAL MODEL ARCHITECTURE This research developed a knowledge base information retrieval model for a closed domain. Figure 41 is the architecture of the model. 5a  New Document Query Documents 5b la 2a 35 i I Knowledge base Query Document I management engi Vector Vector / s 3b __2b uKnowledge Knowledge Base Conversion Engine 3d lb I 4a Document Vector 1c c .......Ranke...... VSM IR engine Rake Id Result Figure 41. Architecture of the knowledgebased information retrieval model The overall operation of model is as follows. A classical vector space model (VSM) information retrieval model using term frequency and inverse term frequency creates a query vector (la) and a document vector (2a). The knowledge base management engine (KME) creates (3b) knowledge from the set of existing documents (3a) before the system operation starts. The KME processes and adds knowledge from any new documents (5b) added to the document space. The Knowledge Conversion Engine (KCE) applies the knowledge (semantics) of the Knowledge Base to the Document Vector (2b, 3d) to create the Conceptual Document Vector (4a). The conventional VSM IR engine calculates the relevance between the query vector and the conceptual document vector (Ib, Ic) resulting in a ranked document list (Id). To illustrate proof of concept, we implement this model in the domain of pathology. Figure 42 is a detailed architecture of the resulting model. The edges of this diagram represent procedures or actions taken in processing the nodes, which represent data or subsystems. Among the procedures shown by the edges, the bold edge processes (la, lb, Ic) are online processes, while edges shown with normal lines are offline processes completed before the start of any user's query processing. For this domain the knowledge base is named the SNOMED Semantic Network Knowledge Base (SNNKB). The SNN KB is part of the KME developed from the offline processing (4a) of SNOMED. The documents used in the pathology domain are pathology reports called Anatomic Pathology (AP). Because we preprocessed AP raw text data into a database, the actual data from the documents used in this system are contained in the Anatomic Pathology Database (APDB). The Document Vector is produced (2b) from the APDB, and the KME creates (2a) the dynamic parts of the SNNKB. When a new document is added (3a), the KME modifies (3b) the Document Vector and the SNNKB. The Knowledge Conversion Engine (KCE) initially makes the Conceptual Document Vector (5c) from the Document Vector and the KME's SNNKB (5a, 5b). Periodically the KCE updates the Conceptual Document Vector (CDV) to reduce the computational needs rather than updating the CDV every time a new document is added. Nc\\ DocumentIc Quer\ Documents 2a Q N APDB 3a la 2b Query Vetor Document 3b iMaiagemenleiit inTi Vectoror / .......... Vector ....... ...... ... 11 SNNKB \ ;. /b5a . lb Kno\ cl% c 5b 4a i. Col\e 'sion E i i t tl 2i Conceptual Document Vector I Id Ranked S VSM IR engine Re Result Figure 42. Architecture of the knowledgebased information retrieval model detailed in the example domain Before we describe the Knowledge Base Management Engine, we describe SNOMED and the characteristics and preprocessing of the example data: the Anatomic Pathology Database (APDB). 4.1. SNOMED Surgical Pathology, cytology, and autopsy reports are highly structured documents describing specimens, their diagnoses, and retrieval and charge specification codes. The Systematized Nomenclature of Medicine (SNOMED) developed by the College of American Pathologists is used for a retrieval code. This was developed in collaboration with multiple professional societies, scientists, physicians, and computer consultants [Systematized, 1979]. SNOMED II is a hierarchically organized and systematized multiaxial nomenclature of medical and scientific terms. There are six main axes based on the nature of man. These begin with a hierarchical listing of anatomical systems, known as the Topography (or T) axis. Any change in form of topography structures throughout life is characterized in the Morphology (or M) axis. Causes or etiologies for those changes are listed in the Etiology (or E) axis. All human functions, normal and abnormal, are listed in the Function (or F) axis. Combinations of Topography, Morphology, Etiology, and Function may constitute a disease entity or syndrome and are classified in the Disease (or D) axis. Using the T, M, E, F, and D axes it is possible to code nearly allanatomic and physiologic features of a disease process as shown by the example in Figure 43. T + M + E + F = D Lung + Granuloma + M.tuberculosis + fever = Tuberculosis Figure 43. The "Equation" of SNOMED disease axes There is another field that is not part of the disease equation: a Procedure field, classified in the Procedure (or P) axis, which allows identification of services or actions performed on behalf of the patient with the problem. Pathology reports typically consist of useful, apt, and concrete terms in sentence or template format. The diagnostic terminology in reports and SNOMED involve standard terms and acceptable synonyms, both have the same SNOMED code number (e.g., Pneumonia and pneumonitis are coded T28000 M40000 or lung + inflammation). Pathology reports usually contain a specific field for SNOMED codes. Certain anatomic pathology computer systems include SNOMED files that allow code selection, but automated encoding programs are uncommon. Precoded synoptic templates of diagnostic terms allow consistency for diagnostic encoding, but many diagnostic statements contain descriptive language, semantic forms, and linguistic nuances that make automated coding difficult. There is a continual need for error checking. 4.2 Anatomic Pathology Database (APDB) Design and Development Two important characteristics of the APDB patient records are their fixed data and closed domain. The system's target data are patient records from 1980 to the present, which we consider as fixed or static, meaning that any dynamic features of the system is minimized. The nomenclature used in a patient report is restricted to the domain of anatomic pathology and related areas of medicine, making it a relatively closed domain. These features provide a good environment and structure for constructing a knowledge base. Among the several forms of knowledge representation commonly used, the semantic network is widely used for representing simple hierarchical structures. Because SNOMED has a hierarchical architecture, we adopted the semantic network for the knowledge representation method. 4.2.1 Metadata Set Definition Appendix A shows the metadata set definition used to parse the patient surgical pathology records. There are 25 terms that must be located and separated in the current patient record. These terms serve as attributes in the database table. Because some term names have changed through the years, several synonyms exist for some terms. For example, "SURGICAL PATH NO," "ACC#," and "CYTOLOGY NO" have the same meaning: the sequential number of the patient record in the set. The parser, a batch program, processes the patient record and creates an output file containing separate patient record fields. The Database (DB) loader reads the output generated by the parser then stores the results to the DB. The parser also generates an index file that has proximity information among the words inside the gross description and diagnosis. This can be used in multiple keyword information searches. The proximity information is needed to rank the relevant results. 4.2.2 Information Processing: Retrieval and Extraction There are several distinct advantages in processing the pathology patient data. First, the patient record data from 1982 to the present are unique to the University of Florida. This reflects a unique character, both regionally and periodically. Thus, when the parsing is finished, the analysis of the frequency of words and multiple word terms has significant meaning. Second, because the patient reports are expressed in standard medical language (which varies slightly from physician to physician), the terms used are sometimes not an exact match to the SNOMED terms. This makes it useful to analyze the patient reports based on the SNOMED terms. Patient reports also have a Codes> field that shows matching SNOMED codes with the of the SNOMED code frequency throughout the patient records can give a valuable research sense to the pathologist. These types of analysis can be done statically and can be reported all at once. While this static analysis is extremely useful, most information processing should be done dynamically. We cannot imagine or anticipate all requests that might be made of this knowledge base. So for information retrieval purposes, the terms in the documents were analyzed. This provided the relation between the documents and the terms in the form of a proximity value. 4.3 Summary We showed the architecture of the developed knowledge based information retrieval model. The model shows wellseparated sections of online and offline calculation to provide efficiency in the calculations during the document retrieval process. The knowledge reduction technology enables the offline adaptation of knowledge, which is a distinct modeling concept compared to other knowledgebase models incorporating knowledge processing in their retrieval process. We talked about the experimental domain: pathology and SNOMED. In the next chapter, we describe the details of this model. First, we describe the Knowledge Base Management Engine (KME) and a knowledge base structure that contains the domain specific knowledge in Chapter 5. Second, we provide details on the Knowledge Conversion Engine (KCE) in Chapter 6. There, we describe the query vector, the document vector, and the conceptual document vector. The VSM IR engine uses the same methods as the conventional vector model's query a document relevancy calculation method. CHAPTER 5 KNOWLEDGE BASE MANAGEMENT ENGINE The knowledge base for this KBIR system is the Systematized Nomenclature of Medicine (SNOMED). In this chapter we discuss the SNOMED based knowledge model, which consists ofprecoordinated knowledge and postcoordinated knowledge. Thepre coordinated knowledge is knowledge described in SNOMED that is coded by a Pathologist. We can say that this knowledge is the expert knowledge that the Pathologist used in writing and understanding a patient's report. The postcoordinated knowledge is a special form of knowledge that can be obtained from a patient's report. This is augmentable knowledge that can be found from the introduction of new data. The knowledge base uses the constructed model in the information retrieval process. 5.1 Semantic Network Knowledge Base Model Representing SNOMED SNOMED is a detailed and specific coded vocabulary of names and descriptions used in healthcare. It is explicitly designed for use in computerized patient records. We can classify the termtoterm relationships, which are called the "precoordinated relationships in SNOMED as one of three types. See Figure 51. SHierarchical Topology (hasa) Synonymy (isa)  Multiaxial relation Figure 51. The three types of SNOMED term relation The first type is a hierarchical topology. The SNOMED terms are all arranged in a hierarchy, represented by an alphanumeric code where each digit represents a specific location in the hierarchy. Figure 52 illustrates the hierarchical structure of this knowledge modeled as a semantic network. Arcs expressing the "part of' or "hasa" relation connect the nodes of this network. Moving from a lower level concept to a higher level is generalization, while moving in the opposite direction is specialization. SGeneralization T28000 Lung T28100 T28500 RightLun Left Lun Hierarchy T28110 T28120 ... ht Lung,ae iht Lung, b Specialization Figure 52. SNOMED hierarchical term relationship SNOMED has controlled vocabulary characteristics. A controlled vocabulary allows individuals to record data in a patient's record using a variety of synonyms, where each references a primary concept. For example, in SNOMED, the following terms are classified as symptoms of increased body temperature: FEVER, PYREXIA, HYPOTHERMIA, and FEBRILE. Each carries the same term code. Figure 53 illustrates another example using the semantic network form. We call the relationship of synonyms an "isa" relationship. The synonym relation is explicit each other. There is no propagation among the nodes. S D0110 Bacterial sepsis Figure 53. SNOMED synonyms relationship E6921 Fava bean D4094 SFavism Figure 54. SNOMED Multiaxial relationship The third relationship of SNOMED terms is a multiaxial relation shown in Figure 54, which refers to the ability of the ordered set of names to express the meaning of a concept across several axes. We can find examples of this relationship over all axes with it most apparent in the disease axis. The SNOMED D code representing "Tuberculosis" has an information link to the T code representing "Lung." This relationship is precoded, mirroring the knowledge encoded at the time of SNOMED's standardization. 5.2 Classification of the PostCoordinated Knowledge The domainspecific knowledge of our model handles only multiaxial relationships among the three types of SNOMED relations. This relationship is most apparent in the disease axis with a series of codes from other axes of SNOMED comprising the essential characteristics of a disease. As detailed in Section 4.1, SNOMED consists of six categories: Topography, Morphology, Etiology, Function, Disease, and Procedures. A patient report has codes> terms showing matching SNOMED categories and numbers. It is possible to code most of the anatomic and physiologic elements of a disease process, both normal and abnormal, with the combination of the five axes. These elements are often used to summarize a codable class of disease or a recognized syndrome, basically what is called the SNOMED equation shown in Figure 43. Some of the relations are straightforward but often cases have unique relationships based on the patient's report. It is possible to develop a unique knowledge base using these relationships. We can find statistics within the pathology document space that form the basis of the postcoordinated knowledge, then we classify the extracted post coordinated knowledge. 5.2.1 Statistics of Pathology Patient Report Documents Space We examined Anatomic Pathology (AP) data sets from 1983 to 1994. There are a total of 290,346 data sets. Table 51 shows the number of data each year. From the data set, we extracted the SNOMED codes from each documents. The SNOMED codes represent the semantics of each document. Table 52 identifies the number of unique SNOMED axes. Appendix B is a partial list of unique SNOMED codes found in the patient reports. Table 51. Number of AP data each year from '83 to '94 Year Number of sets 1983 17,351 1984 23,186 1985 22,781 1986 22,928 1987 22,965 1988 26,663 1989 27,486 1990 27,814 1991 25,497 1992 23,755 1993 24303 1994 25635 Total 290,346 Table 52. Number of unique SNOMED axes equations Axis Number of unique Total occurrence occurrence T 3,759 702,942 M 4,460 594,870 E 315 137,057 F 413 44,278 D 771 11,001 P 637 348,716 Total 10,355 1,838,864 Table53 is the number of distinct relations between axes. From the statistical data, we can calculate the base prior probability of the naive Bayes based postcoordinate knowledge structure that is explained in Section 5.3. 5.2.2 Classification of PostCoordinated Knowledge From the SNOMED codes of each document, we can extract postcoordinated knowledge. Because of the uncertainty of the world, the pathologist does not know or describe the SNOMED equation exactly. This means there will be a partial description of knowledge. We only count the description of SNOMED code as postcoordinated knowledge if they contain the "D" axis. If the pathologist described SNOMED code including "D", there is acceptable certainty of that a SNOMED equation exists. Figure 5 5 shows the four kinds of SNOMED equations found in documents space. Table 53. Relation statistics among axes Axis Number of Related Number of unique axis twoaxis relations relations T 48170 M 34354 E 979 F 972 D 1515 P 10299 M 75190 T 34354 E 1160 F 1268 D 1684 P 12480 E 2999 T 979 M 1160 F 57 D 75 P 527 F 3229 T 972 M 1268 E 57 D 190 P 486 D 4706 T 1515 M 1684 E 75 F 190 P 1067 Table 54 shows the amount of postcoordinated knowledge found in the document space. We use this knowledge to induce possible diseases from incomplete SNOMED equations (i.e., equations lacking a disease axis). D T (a) Two axis relation D D T ET F (b) Three axis relationships D D (c) Four axisTEF TMEionships (c) Four axis relationships D TMF D TME F (d) Five axis relationship Figure 55. Classification of postcoordinated knowledge Table 54. Statistics on postcoordinated knowledge Postcoordinated Number of unique relations knowledge relations DT 568 DTE 26 DTF 38 DTM 7,425 DTEF 3 DTME 305 DTMF 534 DTMEF 68 5.3 Statistical Model of the PostCoordinated Knowledge Figure 56 shows an example of a fouraxesrelation postcoordinated knowledge. We define link frequency (If) as the total number of links in the codetocode relation context after parsing the current patient's report. The link frequency shows the closeness of the relationship, the larger the closer. TypeD relations Disease axis term D: Tuberculosis /78 ..... '._..1_40 /_.f .54 i T M E F Lung Granuloma M.tuberculos Fever S/=376 /=1480 f=378 8 Other axis terms i . . . . . . .      ... ... . . . . . . . . . . . . ..   th e r .m is te rm ~s TypeM relations Figure 56. An example of a fouraxisrelation postcoordinated knowledge We can obtain the postcoordinate knowledge by searching the complete SNOMED equation from documents described in the previous section. Then, we can obtain the link frequency of each relation between two axes statistically following the induction of complete knowledge from the incomplete SNOMED equation. We use the link frequency, discussed in Chapter 6, for conversion of statistical model of postcoordinated knowledge. 5.4 Naive Bayes Model of PostCoordinated Knowledge It is possible to create or learn a Bayesian network from the data. This is an instance of the problem, known in the statistics literature as density estimation. We can estimate the probability density (that is, ajoint probability distribution) from the data. When we learn a Bayes network from the data, there are four different cases: structure known or unknown, and all variables observable or some unobservable. For our case, the structure is known and some variables are unobservable. To model "postcoordinatedknowledge," we have several assumptions: * We consider only the knowledge consisting of a SNOMED equation. Figure 57 shows the basic architecture of a SNOMED equation expressed using a Bayesian network. * We assume we have complete knowledge before processing a patient's report. The complete knowledge can be obtained from searching complete SNOMED equations from the documents space. We call this complete knowledge as a "post coordinated knowledge. " * The "postcoordinated knowledge" consists of combination of the five axes with the disease axis being mandatory. * Complete knowledge is unique. * Each disease is independent. * The four axes (T, M, E, and F) are independent of each other. * T, M, E, and F are conditionally dependent upon the instantiation of D. Figure 57. Structure of the postcoordinated knowledge in a Bayesian network In our case, the structure of a Bayes network is fixed. It has one of the forms shown in Figure 55. We can consider the knowledge complete only if there is disease axis in the SNOMED equation (i.e., in the document). We use the following algorithm to extract the knowledge. 1. Look through the documents to find a SNOMED equation in the document having the complete postcoordinated knowledge form shown in Figure 55. 2. Extract only the complete knowledge form from the documents retrieved. 3. Use an expert to verify that the extracted knowledge is correct. Generally, we can consider the equation to be complete if it contains a "D" axis. 4. Add the extracted and verified knowledge into the system's knowledge the "Post coordinated knowledge base" (PCKB). It is possible that individual document can contain incomplete knowledge due to a lack of expert knowledge or an error. This means some variables of the "Post coordinated knowledge base" (PCKB) are not observable in some documents. In that case, we must induce the value of the unobserved variables in the complete PCKB. To do this, we need to estimate the probability values of the PCKB structure's variables. It is easier to start by estimating P(D). This is computed by counting how many times D is true (=found positive) in data set (documents) and dividing by n, the total numbers of documents. To obtain an estimate of the probability that Tis true given that D is true, we just count the number of cases in which T and D are both true, and divide by the number of cases in which D is true. The probability of Tgiven not D is similar shown below. P(D) (D = true) n P( D) 1 P(D) ( # I (T = true D = true) # (D = true) P(T D) #(T = true nD = false) #(D = false) There is one problem with this approach. There will be situations where the number of"D is true," "D is false," or "Tand D are true" cases is 0. In those situations we calculate a value of 0 for that probability. Because we start from the "base knowledge" structure, the later case should not occur, but it is possible for the number of"D is true" or "D is false" cases to be 0. To guard against this, we can apply a "Bayesian correction" to our estimates. This means, essentially, initializing our counts at 1 rather than at 0. So, we add a 1 to the count in the numerator, and a value m to the denominator, where m is the number of possible different values the variable whose probability we are estimating can have. In our case, all variables are binary, so we add a 2 in the denominator. The new formula looks like the following. P(T # (T = true D = true) +1 #(D = true)+ 2 P(T #(T = true cD = false) + #(D = false)+ 2 Processing documents to obtain PCKB results in m components of PCKB. Each PCKB has the probability estimations shown in Figure 58. 5.5 Summary We described the Knowledgebase Management Engine (KME) modeling SNOMED precoordinated and postcoordinated knowledge. The precoordinated knowledge is modeled using a semantic network notation. It has synonym, multiaxial, and hierarchical relationships. The postcoordinated knowledge can be modeled either statistically or probabilistically. We created the statistical model using the concept of link frequency that can be obtained from the processing of the documents space. We used the naive Bayes network as a probabilistic model of the postcoordinated knowledge. The naive Bayes network model has a simple structure by its independence assumption, while providing simplistic but acceptable results with its simple structure for calculating the joint probability distribution, that is post priority of disease. We describe the Knowledge Conversion Engine (KCE) in the next chapter. The KCE handles the conversion of knowledge to quantitative values. We call the conversion process knowledge reduction. P(tlIdl) P(mlldl) P(elldl) P(flldl) P(tlIdl) P(mlldl) P(elldl) P(flldl) Figure 58. PCKB component structure and probability estimation. CHAPTER 6 KNOWLEDGE CONVERSION ENGINE (KCE) The Knowledge Conversion Engine (KCE) converts the Support Vector Machine (SVM) document vector to a conceptual document vector reflecting the knowledge of the SNOMED Semantic Network Knowledge Base (SNNKB). We start our discussion of the process with a description of the SVM document vector. 6.1 Support Vector Machine Document Vector The bestknown model in information retrieval is the Vector Space Model (VSM) (Salton et al. 1989). In the VSM, documents and queries reside in vector space. In this space, each document can be represented as a linear combination of term vectors. The definition of the vector space model follows: Definition 6.1: A document vector for a document di is di =(wl,,w2,,... w,J )T where wl > 0 is a weight associated ii th the pair (k, d ) where k, is an index term, di is a document, and t is the number of index terms in the whole system. Definition 6.2: The set of all index terms K is K = {k,,...,k,} where t is the number of index terms in the whole system. Normally the index terms are words contained in the document. The set is usually confined to only the significant words by eliminating common functional words called stopwords. The VSM uses the term frequency and the inverse term frequency as a weighting scheme associated with the document. Definition 6.3: The weight w,,s > 0 is w ,, = tf,, x idf where tf, is the term frequency of term i in documents and idf = log N (the inverse document frequency) where n N is the number of documents in the collection and n, is the document frequency of term i The document frequency is the number of documents in which the term occurs. 6.2 Conceptual Document Vector The SVM document vector uses term frequency and inverse document frequency as a conceptual imbuement to the information retrieval model. There has been an attempt to use phrases as index terms instead of words (Mao and Chu, 2002), which gives a conceptual similarity of phrasal words in the retrieval model. They reported a 16% increase of retrieval accuracy compared to the stembased model. In the Vector Space Model, term vectors are pairwise orthogonal meaning that terms are assumed to be independent. There was an attempt to incorporate term dependencies, which gives semantically rich retrieval results (Billhardt et al. 2004, p. 239). They used a term context vector to reflect the influence of terms in the conceptual description of other terms. The definition of a term context vector follows: Definition 6.4: The set of term context vectors Tis c c21 ... cn T = 2 C22 n2 where Cln C2n Cnn j n is number of terms and c k represents the influence of term tk on term t,. Definition 6.5: The term context vector t is the ith column of matrix Twhere t = (c, cl2,2 ,Cl)T anywhere n is number of terms and ck represents the influence of term tk on term t,. The Knowledge Conversion Engine (KCE) converts relationships within the SNN KB into a term context vector. In the following, we discuss how the elements of matrix T can be obtained from domainspecific knowledge base representation. 6.3 KCE: Knowledge Reduction Human friendly graph representation Computationally complex J Reduction of Knowledge (1= ( 11. 1 ll. 1 Computer friendly statistical model Computationally efficient Figure 61. Knowledge reductions There are two types of knowledge to convert: precoordinated and post coordinated knowledge. We reduce the dimension of the knowledge of the pre coordinated knowledge to a conceptual document vector. The form of knowledge expressed by a graph (in our case, a semantic network) is a human friendly form. But it is computationally complex. We convert that knowledge into a computer friendly and efficient statistical form. The concept of knowledge reduction is shown in Figure 61. 6.4 KCE: Conversion of PreCoordinated Knowledge Three types of relationships exist within the SNNKB model representing SNOMED. In the first type, the hierarchical topology relationship, each node has attributes denoting its characteristics on the hierarchical tree. L(i)=O i ) D(i)=6 So ?L(j)=2 ^ D(j)=0 Figure 62. Attributes of the SNNKB hierarchical topology relation L(i) is the level of term i in a knowledge tree. D(i) is the number of descendents of term node i in the tree. The term influence between i and j is inversely proportional to the distance, which is the difference of the levels. Having many descendents means that a node is a more general term than some node having a smaller number of descendents. So term influence is inversely proportional to the number of descendents. Thus we can calculate the SNNKB hierarchical topology relationship between the two terms i and j as: Definition 6.6: c~ from the SNNKB hierarchical topology is 1 1 c = C(Sht) x x log where d(i, j) D(i) + D(j) C(Sht) is the coefficientfor the SNOMED hierarchical topology relation and d(i, j)= L(i) L(j) where L(i) is level of node i and L() is level of node j, D(i) is number of descendents of node i, and D() is number of descendents of node j. For the synonym relations: Definition 6.7: c, from the SNNKB synonym relation is ci = C(Ss) where C(Ss) is the coefficientfor the SNOMED synonym relationship. For the multiaxial relations: Definition 6.8: c, from the SNNKB multiaxial relation is c, = C(Sm) where C(Sm) is the coefficientfor the SNOMED multiaxial relationship. The value of C(Sht), C(Ss), and C(Sm) should be optimized by simulation. 6.5 KCE: Generating the Conceptual Document Vector By converting the SNOMED knowledge and domainspecific knowledge to the termrelation matrix T defined in Definition 6.4, we can transform each initial document vector d j =(w ,w2,j... ,j ) into a conceptual document vector cd = (c, ,c2,,., t,J ) using the equation in Definition 6.9 (Billhardt et al. 2004, p. 240). Definition 6.9: cd, from d, (Definition 6.1) and t, (Definition 6.5) is n t cd, = where j=1 t, is the term context vector of term t and is the length of vector t. The division of the elements in the term context vectors by the length of the vector is a normalization step. 6.6 KCE: Conversion of the PostCoordinated Knowledge Postcoordinated knowledge can be obtained from a user's document (i.e., a patient's report) after processing all documents in the system. This knowledge cannot be obtained from the predefined SNOMED knowledge base. This knowledge contains noise because the coding ability including the correctness of the coding of the patient report varies from pathologist to pathologist. We can define two kinds of models: statistical and probabilistic. 6.6.1 Statistical Model of PostCoordinated Knowledge To compute the statistical model, we first introduce the link frequency (//) to express the closeness of the relation between terms. Definition 6.10: The linkfrequency If is the number of linkages accumulated fom all system document domain specific knowledge. The domainspecific knowledge in pathology consists of the multiaxial relations that have more importance on the disease axis. In the knowledge of multiaxial disease centered relationships, relations between axis terms can be divided into two types. We call relations including a disease as a Dtype (Disease related type) relation and the other relations as an Mtype (Multiaxial related type) relation. The reason for separating the relations is that the disease axis related relations have more meaning than the other relations. Figure 6.3 shows an example of this domainspecific knowledge model showing newly defined attributes. Figure 6.3 describes a relation between disease i and other axis terms: jl,j2,j3, and j4. The relation between i andjl was found 230 times, which is the link frequency from the start of the system until now (i.e., since the start of data in the database). Because the value of link (i, jl) is greater than the other Drelation links, it is more important than the other links. TypeD relations Disease axis term =230/ / \ \ f78 1 140 /54 i j3 j4 Sf=376 f= 1480 j3 1378 j4 Other axis terms TypeM relations Figure 63. Example of DomainSpecific Knowledge relations For the TypeM relations, we can define the term to term relation factor, cy, as. Definition 6.11: c, from the Domainspecific TypeM relation is c = I + C(DM)x C where nn(n 1) 2 C(DM) is the coefficient of the Domainspecific Mtype relation, If is the link fequency between i andj, and lfc is the link fequency of other relations other than i andj. Figure 64 shows the conversion concepts of typeM relations. The typeM relation is a sum of the importance of link and averaged influences from other links. If we look at typeD relations, one disease term has several relations with typeM nodes. So, we have to consider the influence on one typeD relation to the other typeD relations. For example, if we calculate the relation factor between node i and j 1, we must consider the influences from other relations: (i, j2), (i, j3), and (i, j4) to the relation (i, j 1). T M E F fl376 If 1480 If 378 Other axis terms CT TypeM relations E F Figure 64. Conversion oftypeM relations For the Dtype relations: Definition 6.12: c, from the Domainspecific TypeD relation, where node i is disease term and node j is other term, is c = C(DD)xlJf +C(DDn) c where D N, 1 C(DD) is the coefficient of the DomainSpecific Dtype relation, If is the link frequency between i andj, C(DDn) is the coefficient of the DomainSpecific Disease Neighbor relation, If is the link frequency other than i andj, and N, is the number of axis other than diseases in the knowledgebase. The statistical model of the postcoordinated knowledge can be applied to the conceptual matrix. This means the knowledge is applied to the document vector "generally" regardless of each document's situation. 6.6.2 Probabilistic Model of PostCoordinated Knowledge We defined the naive Bayes network model of the postcoordinated knowledge in Section 5.3. After processing documents for postcoordinated knowledge (PCK), we have n documents and m PCKs. Each PCK has a specific form shown in Figure 5.6. The object of inference in the knowledgebased information retrieval model is to find a disease from the given findings (combinations of T, M, E, and F). Each document does not contain complete PCKs normally. Because of the lack of expert knowledge, it is impossible to write a complete form of the PCKs in a patient's report. So we must estimate what kind of disease is most likely from the given findings in the document. This is the key to improving the knowledge enhancement of the retrieval process. We modeled the PCKs using naive Bayes in Section 5.3. We can define the posterior probability that we are attempting to calculate as: P(D I t,m,e, f), where D is the set of diseases that has a relationship with the given findings (t, m, e, and f) found by searching PCKs. The posterior probability can be solved by Bayes theorem. P(D)P(t, m, e, f D) P(D  t, m, e, f) = p(t,m,e,f) In practice, we are only interested in the numerator of above fraction, since the denominator does not depend on D and the values of the t, m, e, andfthat are given, so the denominator is constant. By the independence assumption, we can rewrite the fraction as: 1n P(D I t,m,e,f) P(D) P( D) z 11 where F, is set of findings. The postcoordinated knowledge has specific relations with the individual documents. Actually, the individual knowledge is defined from the specific contents of each document, so we cannot use knowledge reduction in this case. Knowledge reduction handles general knowledge conversion cases. So we have to apply the postcoordinated knowledge to each document: more specifically to each individual document vector. We can classify several cases for conversion of postcoordinated knowledge. Refer to Figure 55 for the classification of postcoordinated knowledge. We use PCKBa for a one axis relation, PCKBb for a two axes relations, PCKBc for a three axes relation, and PCKBd for a four axes relationship. Case 1: The document contains all four axes, for example (t, m, e, andf). We must find the probability of based upon the existence of (t,m,e,J). This is performed by searching PCKBd. Searching PCKBa, PCKBb, or PCKB c is not necessary because those have less information. We can obtain only one component of knowledge from PCKB because with the five axes of information, the knowledge is complete and unique. Case 2: The document contains three axes all except d. Figure 65 shows an example of this case. Here, we must compute the probability of each possible diseases, then another axis's, i.e., P(dl t, m, e) (1) P(d2 t, m, e) (2) after finding the possible postcoordinate knowledge from PCKBc and PCKBd. Searching PCKBd is required because PCKBc can be inclusive knowledge of PCKBd. Figure 65. Examples of case2 We know already P(dl), P(d2), P(tldl), P(mldl), P(eldl), P(flldl), P(tld2), P(mld2), P(eld2), and P(f2ld2). By the naive Bayes theorem, the posterior probability (1) and (2) can be calculated and compared by: P(d t, m, e)= P(dl)jP(F I dl) = aP(dl)P(t dl)P(m I dl)P(e dl) z  P(d2 t, m,e)= P(d2)jP(F Id2)=aP(d2)P(t d2)P(m d2)P(e d2). Z 1=1 Then, we can augment the document vector according to the relative normalized value ofP(dl]t, m, e) and P(d2 t, m, e) with some coefficient. The complexity of this algorithm, is O(mn) where n is the number of documents and m is a count of the postcoordinate knowledge. Case 3 is the case when two axes relations found in document and case 4 is the case when one axis relation found in document. The calculation is as straightforward as in case 2. 6.6 SVM IR Engine: Document Retrieval After the process of converting the document vector to the conceptual document vector, the system can start accepting queries. A query is expressed identical to a document vector where the query terms are the vector elements. The query vector q is compared with the conceptual document vector cd, using the cosine similarity measure. Definition 6.14: The similarity between q and cd, is  q7.c Zin q q:i, 4 cd cos(q, cd) = qd d j 2j The similarity measure produces a ranked list of relevant documents related to the query. 6.7 Summary We described the details of Knowledgebase Conversion Engine (KCE). The KCE reduces knowledge expressed by a semantic network or a Bayesian network into quantitative values to provide efficiency in the retrieval process. Conversion of the knowledge is called knowledge reduction because the reduction process reduces the graphical knowledge into a twodimensional value representing the number of relations between the two terms. The conversion of Bayesian network knowledge is done by directly applying the inferred certainty value into a document vector. This process applies the knowledge into the individual documents, which is called personalized knowledge application, while the conversion of precoordinated knowledge is general application of knowledge. In the next chapter, we describe the result of performance evaluation of the developed the knowledge base information retrieval model. CHAPTER 7 PERFORMANCE EVALUATION 7.1 Simulation Parameters In our experiment, we used recall and precision metrics for evaluation of the performance as explained in Section 2.4. We can consider the gain of performance if the recallprecision graph shape goes to upper right direction shown in Figure 71, because in ideal case, the precision should be 100% when the recall is 100%. 100 Performance Increasing Precision 50 Performance Decreasing 0 10 Recall 60 Figure 71. Performance evaluation metrics To calculate precision and recall, we must know the exact relationship between each document and the query. An expert should determine this, so it is impossible to evaluate the relevancy between documents and a query if the set is big. In our case, the total number of documents is nearly one half million. We selected 2000 case documents signed by a top expert, because those documents should have a low error rate in describing postcoordinate knowledge. Then we selected 261 cases randomly among 2000 cases because we need to reduce the size of set to be able to examine relevancy by human expert. The selected 261 cases were examined for their relevancy with queries "membranous nephropathy lupus" and nephroticc syndrome." Our expert rated the relevancy between each document and the query as "Positive," "Neutral," and "Negative." In this chapter, we call query "membranous nephropathy lupus" as Q1 and nephroticc syndrome" as Q2. Table 71 shows the result of evaluation for the 261 documents. Table 71. Relevancy check result of 261 simulation documents Query # of positive # of neutral # of negative Total relevant (positive+neutral) Q1 24 95 142 119 Q2 23 90 148 113 7.2 Simulation Result 7.2.1 Performance Evaluation with PreCoordinated Knowledge Figure 72 shows the result of the query "membranous nephropathy lupus" on the positive cases. This graph shows some degradation of performance for the knowledge based information retrieval (KBIR) model compared with the support vector machine (SVM). We can think of the KBIR having the same effect as query expansion. The KBIR expands the document vector instead of the query vector. If the knowledge has synonyms, the KBIR expands the document vector to include synonyms of the query "membranous nephropathy lupus." This causes an expansion to a somewhat broader range of knowledge. For example, "membranous" can be expanded to a more general term, so the degradation on the positive case may be caused by a general expansion of the knowledge of KBIR. This can be explained more by looking at the results of query 1 if we included the neutral cases in the performance evaluation as shown in Figure 73. 100.0% 90.0% 80.0% 70.0% 60.0% 50.0% 40.0% 30.0% 20.0% 10.0% 0.0% 0 20.0% 40.0% 60.0% 80.0% 100.0% Recall Figure 72. Comparison of performance for query 1 on positive cases. 20.0% 40.0% 60.0% Recall KBIR _VSM Synset CrossRef   SomeRel  Syn+CrossRef 80.0% 100.0% Figure 73. Evaluation results of query 1 including the neutral cases.  .0% 100.0% 90.0% 80.0% 70.0% 60.0% 50.0% 40.0% 30.0% 20.0% 10.0% 0.0%  0.0% KBIR _VSM  Synset CrossRef  SomeRel 100.0% 95.0% 90.0% KBIR =\ VSM S85.0% Synonym SCrossRef 80.0%  SomeRel 75.0% 70.0%  0.0% 20.0% 40.0% 60.0% 80.0% 100.0% Recall Figure 74. Evaluation results for query 2 for the positive cases 100.0% 90.0% 80.0% 70.0% 60.0% 50.0% 40.0% 30.0% 20.0% 10.0% 0.0% KBIR VSM Synonym CrossRef  SomeRel 0.0% 20.0% 40.0% 60.0% 80.0% 100.0% Recall Figure 75. Evaluation results for query 2 including the neutral cases  ~ These results show big gain in performance when compared with the degradation that occurs with only the positive cases. If we look at the result more generally, meaning there is an importance to the neutral cases, the performance evaluation result shows promising result. The gain can be explained by the expansion of knowledge in the document vector. If we look at the result of VSM, the resulting documents only have to contain one of the query terms: membranous, nephropathy, or lupus. But KBIR retrieves some documents that do not contain any query words because the document vector was extended to contain terms related to the existing terms in these documents. This increases the recall rate. If we look at precision, this starts to make sense when we consider the results more generally. Figure 74 is the result of query 2, nephroticc syndrome" on just the positive cases. When this is contrasted with the evaluation of query 1 on positive cases, the results show a performance gain. This can be explained by the characteristics of KBIR's knowledge management. Because the number of terms in query 2 is smaller than in query 1, the amount of expanded knowledge for query 2 is less than for query 1. This means that knowledge expansion for queries having fewer query terms tends to have smaller error rates compared to queries having many terms. If we look at the performance evaluation results of query 2 including the neutral cases shown in Figure 75, they show a lower performance gain when compared to the results of query 1. This can be explained also by the small expansion of knowledge caused by lower number of terms in the query. If we look at the effects of each relationship on KBIR performance, we can say the result of KBIR performance is the sum of each relation: synonym, cross reference, and some relation. Normally, synonym relations do not show a significant contribution but cross reference relations (i.e., relations between SNOMED axes) show a significant contribution in performance. This can be explained as each document's concept can be expressed by a SNOMED equation, so the relationship between concepts is more important than just the synonym relations between terms. Table 72 shows quantitative values of performance gain for the precoordinated knowledge addition compared to the VSM method. Table 72. Value of performance gain of precoordinated knowledge compared to VSM Query Performance gain (%) Query 1 39.6 Query 2 20.6 Average 30.1 7.2.2 Performance Evaluation with Naive Bayes PostCoordinated Knowledge Figure 76 shows the performance gain when we use the naive Bayes post coordinated knowledge for queryl and Figure 77 for query 2. Table 73 shows the quantitative value of performance gain compared to VSM and precoordinated knowledge. Table 73. Value of performance gain of postcoordinated knowledge Query Performance gain (%) Performance gain (%) Compared to pre Compared to VSM coordinated knowledge Query 1 7.0 47.0 Query 2 8.2 28.8 Average 7.6 37.9 I ne results snow nearly te same percentage ot improvement compared to tne pre coordinated knowledge case and different gain compared to the VSM case. rl 100.0% 90.0% 80.0% 70.0% 60.0% 50.0% 4.%  40.0% 30.0% 20.0% 10.0% 0.0% 0.0% With PK  Without PK VSM 80.0% 100.0% Figure 76. Evaluation results of query 1 including postcoordinated knowledge 100.0% 90.0% 80.0% 70.0% 60.0% 50.0%  40.0%  30.0% 20.0% 10.0% 0.0%  0.0% 20.0% 40.0% 60.0% 80.0% 100.0% Recall With PK Without PK ........ V SM Figure 77. Evaluation results of query 2 including postcoordinated knowledge 20.0% 40.0% 60.0% Recall  The reason is straightforward for the effects of knowledge application of our model explained in previous section. 7.2.3 Performance of Statistical PostCoordinate Knowledge Model There is no significant performance improvement on this model as seen on Figure 78. We thought the statistical model of postcoordinated knowledge is general knowledge that can be applicable to all documents regardless of its own semantics of each document. The result shows the assumption is incorrect. 100.0% 90.0% 80.0% 70.0% 60.0% 50.0% 40.0% 30.0% 20.0% 10.0% 0.0% With PK  Without PK .... SVM 0.0% 20.0% 40.0% 60.0% 80.0% 100.0% Recall Figure 78. Evaluation results of query 1 including statistical postcoordinated knowledge 7.3 Summary We showed the results of performance evaluation for our knowledgebased information retrieval model showing the effects of each precoordinated knowledge and postcoordinated knowledge. i The results show a nearly 30% increase for precoordinated knowledge application and 37% increase for postcoordinated knowledge application compared to VSM. These increases occur even though the realtime speed of processing is comparable to VSM. We applied the statistical model of postcoordinated knowledge to all documents evenly by inserting computed relations into the termcontext matrix. We assumed the statistical postcoordinated knowledge is general knowledge that can be applied evenly. But from the simulation results of the statistical model, we can conclude that the post coordinated knowledge is personalized knowledge that should be applied to each document separately. We applied the naive Bayes model based knowledge to each document's term vector separately. The next chapter concludes our research summarizing contributions and identifying future work. CHAPTER 8 CONCLUSION In this dissertation, we have shown significant progress towards developing an information retrieval model augmented by a knowledge base. We created a knowledge based information retrieval (KBIR) model showing meaningful performance gain while providing same speed performance in the retrieval process. We summarize our contributions in Section 8.1 and discuss directions for future work in Section 8.2. 8.1 Contributions The objective of this dissertation was to design an intelligent information retrieval model producing knowledge infused answers to users by incorporating a domainspecific ontology in the knowledgebase using a computationally efficient knowledge conversion method. The main contributions of the dissertation to information retrieval research are as follows: Knowledge reduction to statistical model: The developed information retrieval model is a knowledgebased information retrieval model. Unlike the other models, which perform an ontology level information retrieval task such as an ontology comparison and an ontological query expansion, the proposed model reduces the knowledge level represented by the knowledge base to a statistical model such as the vector space model's document vector shown in Figure 81. We used semantic networks for predefined knowledge and naive Bayes networks for postcoordinated knowledge. Those graphical knowledge representations are human friendly and easily understandable to human but computationally complex. The reduced statistical form of knowledge, such as a conceptual document vector, is not human friendly but is computer friendly and computationally efficient. Figure 81. Knowledge reduction to statistical model OFFLINE CALCULATION Figure 82. Offline application of knowledge Offline application of knowledge: Using knowledge reduction enables the off line processing of the application (calculation) of knowledge to the information retrieval procedure shown in Figure 82. Only the conceptual document vector, which can be obtained from the document vector and the knowledge base, is involved in the online process of producing ranked results by comparing a user's query and the documents. Inverse query expansion: The result of our knowledgebased information retrieval model is very similar to that of query expansion or latent semantic. Unlike those models, which calculate part of the knowledge during the retrieving process, our model does its processing offline, giving the same effect with a lower computational burden. Applicability to general open domain: Even if the proposed model uses domain specific knowledge, this model can be used in an opendomain application if some types of knowledge bases are supported. One possible candidate for the open domain knowledge base is WordNet, which has a thesaurus and relations from the natural language domain. Flexibility on the knowledge representation: We defined some examples of knowledge reduction methods using a semantic network. The semantic network is an example of a knowledge representation, which is one of artificial intelligence's field handling ontologies. Our model has flexibility on the type of knowledge representation if we can define the knowledge reduction scheme of the selected knowledge representation model. In our model, we used a naive Bayes network for representing postcoordinated knowledge. It has classification ability with less computational complexity and a reasonable approximation of conditional independence. 8.2 Future Work One task that needs completing is the modeling of the hierarchical knowledge. To adequately model the hierarchy in the Pathology domain requires that we refine the hierarchical relationship by looking at the SNOMED book. The reason is that the database storing the SNOMED notations is incomplete in exactly defining the hierarchical relationships. We need to make complete sets of the hierarchy of over 50,000 semantic relations existing in the SNOMED book to apply the hierarchy in our knowledgebase IR model. There is a possibility to use the current version of SNOMED, SNOMEDCT, that provides a more profound and accurate set of relationships in the pathology domain. This should be handled as a separate project because of the size and depth of the work. We can induce the result when we add the hierarchical knowledge in our model by looking at the results of other relation additions. The trends of relation additions show a higher degradation of performance if the relations are more general. We think that the hierarchical relationships will add a larger number of relations to the term matrix than the other relations, resulting in some degradation on the precision, but with a gain in recall. A second extension of this work is to apply our model to the open domain information retrieval process. Using WordNet as a knowledge source, we can see if there is a performance gain in general domain information retrieval. Extracting knowledge automatically from given documents to use as a knowledge source for the information retrieval process is a possible approach towards applying our model to the general open domain. Finally, we used the naive Bayes network for modeling postcoordinated knowledge. The naive Bayes model assumes independence among findings. Several Bayesian network based models exist providing dependence model among findings. Even though several papers identify that the naive Bayes model shows acceptable performance in its simple form, it would be worthwhile to compare the performance between the naive Bayes and other models providing the dependency relations between findings. APPENDIX A PRIMARY TERMS WHICH ARE THE BASIS FOR THE DB ATTRIBUTE Table A1. Primary terms for APDB Terms Roles Etc Format:NNNNYYT This number also SURG PATH NO NNNN: Serial number distinct in one year, shown at the end of SURGICAL digit width may vary the line having PATHOLOGY NO# YY: year expressed in two digit format: ACC. # T: Type = { C, S, O, G, M } YYTNNNNN###Y ACC# Type YMMDD CYTOLOGY NO "C" Consultation Rpt "S" inhouse surgical Rpt Patient name NAME Format: Last, First, Middle, Suffix TEST NO Test number SPECIMEN NO SPECIMEN Specimen number SPECIMEN MED REC NO 6 digit unique number of each hospital Medical Record # format: NNNNNN may vary ROOM WARD Room number WARD Patient location Age of patient Format: NN [YMD] AGE NN number Y represent year M month D day SEX* Sex of patient SEX Format: {MF} Service date DATE Service Date Format: Month Day, Year Example: JANUARY 07, 1981 PHYS PHYSICIAN rrin P ii Referring Physician or Surgeon Referring Physician Surgeon REPORT TYPE Example: S1 Surgical Date obtained SERVICE Date received Date Obtained Date obtained Date Received Date received Table Ai Continued Terms Roles Etc HISTORY CLTRICAL Clinical history CLINICAL Long text HISTORY Specimen(s) submitted/ Procedures ordered Specimen submitted * GROSS DESCRIPTION MICROSCOPIC Light Microscopy MICROSCOPIC DESCRPTIO* Immunofluorescence microscopy DESCRIPTION MICROSCOPTIC Electron microscopy DESCRIPTION Other tests: e.g. included cytogenetics, molecular biology, or flow cytometry data DIAGNOSIS Bone marrow, aspiration: No lymphoma DIAGNOSIS detected detected COMMENT * PATHOLOGIST * Diagnostic/Retrieval codes Modifier codes Transaction codes: JP/whd RETRIEVAL CODES Date of transcription: 03/23/99 Electronic signatures Date Electronically signed out APPENDIX B SNOMED STATISTICS Table B1. Partial list of T code P(code/total Name Number disease) P(code/documents) T8X330 T8X310 TOOXXO T83000 T8X210 TOX000 T83300 T82 T83320 T74000 T84000 T8X T86800 T2Y030 T88100 T06000 T01000 T8X320 T7X100 T56000 T2Y414 T67000 T71000 T77100 T63000 T32010 T80100 T6X940 T86120 T04030 T86110 T04020 T57000 TOXOO T08000 T82900 T81000 T88960 T66000 142850 64010 53701 33989 22408 16728 14706 14232 13621 13125 12585 11307 8341 7327 7299 6825 6449 5778 5648 5495 5407 4980 4853 4597 4457 4185 4137 3679 3609 3543 3541 3523 3514 3301 3207 2875 2863 2746 2726 0.203217335 0.091060144 0.076394639 0.048352496 0.031877452 0.023797127 0.020920645 0.020246336 0.019377132 0.018671526 0.017903326 0.016085253 0.011865844 0.010423335 0.010383502 0.009709194 0.009174299 0.008219739 0.008034802 0.007817146 0.007691958 0.007084511 0.006903841 0.006539658 0.006340495 0.005953550 0.005885265 0.005233718 0.005134136 0.005040245 0.005037400 0.005011793 0.004998990 0.004695978 0.004562254 0.004089953 0.004072882 0.003906439 0.003877987 0.491999201 0.220461105 0.184955191 0.117063779 0.077176886 0.057614019 0.050649914 0.049017379 0.046912993 0.045204687 0.043344837 0.038943192 0.028727794 0.025235409 0.025138972 0.023506437 0.022211431 0.019900395 0.019452653 0.018925696 0.018622609 0.017151950 0.016714541 0.015832834 0.015350651 0.014413837 0.014248517 0.012671089 0.012429997 0.012202682 0.012195794 0.012133799 0.012102801 0.011369194 0.011045442 0.009901979 0.009860649 0.009457682 0.009388798 