Citation
AN XML INFORMATION BASE AND EXPLORER FOR A NATURAL LANGUAGE QUESTION ANSWERING SYSTEM

Material Information

Title:
AN XML INFORMATION BASE AND EXPLORER FOR A NATURAL LANGUAGE QUESTION ANSWERING SYSTEM
Copyright Date:
2008

Subjects

Subjects / Keywords:
Algorithms ( jstor )
Colleges ( jstor )
Graduates ( jstor )
HTML ( jstor )
Intelligent interfaces ( jstor )
Keywords ( jstor )
Natural language ( jstor )
Web pages ( jstor )
Wrappers ( jstor )
XML ( jstor )

Record Information

Source Institution:
University of Florida
Holding Location:
University of Florida
Rights Management:
Copyright the author. Permission granted to the University of Florida to digitize, archive and distribute this item for non-profit research and educational purposes. Any reuse of this item in excess of fair use or other copyright exemptions requires permission of the copyright holder.
Embargo Date:
8/8/2002
Resource Identifier:
51556078 ( OCLC )

Downloads

This item is only available as the following downloads:


Full Text

PAGE 1

AN XML INFORMATION BASE AND EXPLORER FOR A NATURAL LANGUAGE QUESTION ANSWERING SYSTEM By NATHANIEL NADEAU A THESIS PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE UNIVERSITY OF FLORIDA 2002

PAGE 2

Copyright 2002 by Nathaniel Nadeau

PAGE 3

Dedicated to my Father, who has always inspired me to never stop learning, never stop trying to outdo myself, and never stop enjoying the ride.

PAGE 4

ACKNOWLEDGMENTS I would like to thank my parents for giving me constant support and inspiration, Dr. Douglas D. Dankel II for being the best professor and advisor a student could hope to have, and finally my fellow students who worked with me on this project: Nicholas Antonio, Wilasini Pridaphattharakun, and Eugenio Jarosiewicz. Special thanks are due to my friends Jim Keller, Hazen Mitchell, Evan Blake, and Ben Fletcher for proofreading and general support throughout the writing process. iv

PAGE 5

TABLE OF CONTENTS page ACKNOWLEDGMENTS.................................................................................................iv LIST OF TABLES............................................................................................................vii LIST OF FIGURES.........................................................................................................viii ABSTRACT.........................................................................................................................x CHAPTER 1 INTRODUCTION............................................................................................................1 1.1 A Maze of Information.............................................................................................1 1.2 Current State of Web Searching...............................................................................2 1.3 An Ideal Web Searching Solution.............................................................................3 1.4 A Realistic Question Answering System..................................................................5 1.5 The XML Information Base and Explorer................................................................7 1.6 Summary and Road Map..........................................................................................7 2 PREVIOUS WORK..........................................................................................................9 2.1 The Need for Question Answering...........................................................................9 2.2 Motivation for WebNL...........................................................................................10 2.3 Inspiration for the Ideal QA System.......................................................................10 2.4 Inspiration for WebNL............................................................................................11 2.5 Knowledge and Information Representation..........................................................12 2.6 Text Annotation and XML......................................................................................16 2.7 Summary.................................................................................................................16 3 UNDERLYING TECHNOLOGY..................................................................................17 3.1 XML........................................................................................................................17 3.1.1 Motivation.....................................................................................................17 3.1.2 Extensibility..................................................................................................18 3.1.3 Structure........................................................................................................20 3.1.4 Validation......................................................................................................21 3.1.5 Further Reading.............................................................................................28 3.2 Java Servlets............................................................................................................28 3.3 Summary.................................................................................................................29 v

PAGE 6

4 THE XMLIB – RESEARCH GOALS AND DESIGN PHILOSOPHY........................31 4.1 Goals and Requirements.........................................................................................32 4.2 Design Philosophy..................................................................................................34 4.3 Summary.................................................................................................................40 5 THE XMLIB – IMPLEMENTATION...........................................................................42 5.1 Overview.................................................................................................................42 5.2 The Document Type Definition..............................................................................44 5.2.1 The Root Element.........................................................................................45 5.2.2 Utility Elements............................................................................................46 5.2.3 The Directory................................................................................................53 5.2.4 Domain Elements..........................................................................................55 5.3 The XML Files........................................................................................................64 5.3.1 A Concrete Example.....................................................................................65 5.3.2: General XMLIB Features.............................................................................72 5.4 More on Querying...................................................................................................75 5.5 Summary.................................................................................................................77 6 THE XMLIB – RESULTS..............................................................................................80 6.1 Summary of the Construction Process....................................................................80 6.1.1 Phase One: Constructing the DTD................................................................81 6.1.2 Phase Two: Constructing the XML Files......................................................84 6.2 WebNL Results.......................................................................................................87 6.3 Evaluating the XMLIB...........................................................................................91 7 THE XML EXPLORER – XMLEX...............................................................................93 8 CONCLUSIONS AND FUTURE WORK.....................................................................97 8.1 Conclusions.............................................................................................................97 8.2 Future Work..........................................................................................................100 APPENDIX A WEB STRUCTURE SURVEY...................................................................................103 B XMLIB DTD AND XML FILES.................................................................................108 LIST OF REFERENCES.................................................................................................123 BIOGRAPHICAL SKETCH...........................................................................................126 vi

PAGE 7

LIST OF TABLES Table page 3.1: Symbols for Specifying Element Structure in XML DTD.........................................27 5.1: The XMLIB Utility Elements.....................................................................................52 5.2: Wrapper Conditions in the DTD.................................................................................62 6.1: Results of the Thirteen WebNL Test Questions.........................................................89 A.1: List of Schools Involved in Web Structure Survey.................................................103 vii

PAGE 8

LIST OF FIGURES Figure page 1.1: An ideal QA system......................................................................................................4 1.2: A simpler QA system....................................................................................................5 2.1: A simple frame system...............................................................................................14 2.2: A simple semantic net.................................................................................................14 3.1: Two simple XML elements........................................................................................19 3.2: XML elements representing computer science courses..............................................21 3.3: Overlapping and non-overlapping elements...............................................................22 3.4: A DTD and a corresponding XML file.......................................................................24 4.1: The WebNL system....................................................................................................31 4.2: Information organized using hyperlinks.....................................................................38 4.3: Information organized using headings........................................................................39 5.1: The root element of the XMLIB: ................................................45 5.2: The element....................................................................................53 5.3: The domain element definition...............................................56 5.4: The relationship between HTML headings and domain element structures..............58 5.5: The directory listing for core_courses.xml.................................................................65 5.6: A portion of core_courses.xml....................................................................................66 5.7: The core courses section of the graduate web pages..................................................69 6.1: Phase One of the construction process.......................................................................85 6.2: WebNL’s user interface..............................................................................................88 viii

PAGE 9

6.3: WebNL’s user interface displaying an answer...........................................................90 7.1: Exploring core_courses.xml with root and expanded............94 7.2: More elements expanded............................................................................................95 ix

PAGE 10

Abstract of Thesis Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Master of Science AN XML INFORMATION BASE AND EXPLORER FOR A NATURAL LANGUAGE QUESTION ANSWERING SYSTEM By Nathaniel Nadeau August 2002 Chair: Douglas D. Dankel II Department: Computer and Information Science and Engineering As the amount of information available online continues its exponential growth, the need for question answering systems to supplement current search engines becomes more recognized. An ideal question answering system would automate the process of finding relevant online documents, compiling the information retrieved, and returning only the specific information that answers a user’s question. This thesis describes an information base that serves as a first step in the attempt to automate the process of compiling information contained within online documents, particularly HTML web pages. This first step involves the manual creation of the information base to facilitate the study of the process and the formulation of general algorithms to automate the process. The retrieval of a particular set of documents is outside the scope of this thesis. The developed information base, implemented using XML, covers the restricted domain of web pages describing the graduate program of the University of Florida’s x

PAGE 11

Computer and Information Science and Engineering Department. The approach used to develop the information base involves finding the keywords in a given piece of text and inserting them into an XML structure that mimics the structure of the source HTML documents. This XML structure is a hierarchy of elements that divide and organize the information. It is asserted that a large percentage of web sites have a similar structure to the site modeled in this thesis and, therefore, since the restricted implementation is satisfactory, the method used is a viable candidate for automation and generalization. Also discussed in this thesis is a Java servlet called XML Explorer. This servlet functions as a peripheral tool for interactively exploring the information base, as well as any other XML documents. xi

PAGE 12

CHAPTER 1 INTRODUCTION 1.1 A Maze of Information One significant problem resulting from the exponential growth of the computer industry is that there is simply too much data and information being created and made available—humans can no longer effectively handle this amount of data. The entire field of Data Mining arose in response to this problem in databases, for example. The navigation and interpretation of the increasing amounts of data required the development of new algorithms and techniques. Databases, despite their regular structure, had become too large to be useful without some kind of layer between the data and the user that could make the information digestible. Similarly, the World Wide Web is going through the same stages as the database world did earlier. Like a database, the Web is a collection of information, and this collection has grown too large for humans to navigate through and use efficiently. Unlike databases, however, the Web in its entirety does not enjoy a regular or well-defined structure. Gray Clayton, a lecturer in communication and technological education at the University of Waikato, New Zealand, gives a fitting comment on the irregular arrangement of the Web: A librarian described the World Wide Web as being like a huge library where all the books had been thrown in a pile on the floor. The Web itself is not organized. No one is in charge. No one catalogues it or arranges acquisitions. [1] The Web truly is a maze of information. The creation of a layer between the Web surfer and the sea of online information has been and continues to be the goal of much 1

PAGE 13

2 research. Unfortunately, building this layer poses different challenges than the analogous layer in the database world. 1.2 Current State of Web Searching The first successful attempts to create order from the online chaos were search engines. These programs catalogue hundreds of thousands of online documents, indexing them by keywords. When a user makes a request by providing a set of keywords, the search engine returns those documents from its catalogue whose keywords match. While these programs have been very successful commercially, they still leave the user with the unenviable task of manually searching through hundreds or thousands of returned documents. Additionally, many of these documents are completely irrelevant to the user, since the context of the keywords is unknown. These search engines successfully match the user’s words to those in the document, but are unable to understand the concept of the user’s request. A second approach, which attempts to maximize the percentage of relevant documents returned, is concept-based or clustering search engines [2]. In addition to cataloguing documents by keywords, these search engines also attempt to label the concept of each document by analyzing the keywords, or component words, in some way. Then the words that compose the user’s query are categorized the same way as the indexed documents. The goal is to return pages whose concepts match the concept of the user request. These types of search engines, while better than their predecessors, are still far from perfect. The problem of algorithmically determining the concept or topic of a piece of online text is complex and remains open. A third, more recent genre of search engines attempt to directly answer users’ questions by keeping a database of common questions and linking them to web sites that

PAGE 14

3 directly answer these questions. For example, if a user asks about the weather, the search engine returns a web site that gives local and national weather reports. If a question is not found in the database, or if the user request is not in the form of a question, then a list of web sites are returned using a combination of the previously discussed approaches (keyword search or concept search). The search engines in this category represent a step in the right direction because they are attempting to directly answer a user’s question, if possible. Their current limitation is that if the request is not common or generic enough, the results will be the same as normal search engines: a sizable list of documents that must be sifted through by the user. 1.3 An Ideal Web Searching Solution All of the approaches discussed in the previous section have the same purpose: to provide a mechanism by which the end user can easily and efficiently retrieve the online information desired. In essence, the goal is to create an abstraction layer that hides the user from the vastness of the Web, giving only a narrow, detailed view, which is tailored to the user’s interests. The latest research trying to achieve this level of abstraction includes question answering (QA) systems. These systems attempt to directly answer a user’s question, rather than returning a list of relevant documents in which the user must search to find the answer. An ideal QA system would be able to take a natural language question as input, and return a specific and correct natural language answer. Obviously, there are major problems to overcome in achieving this ideal, and the current research is still trying to define them, let alone solve them. One version of an ideal QA system is illustrated in Figure 1.1. The process begins with the user posing a natural language question (a). The question is parsed by a Natural

PAGE 15

4 Language Parser (b), which transforms it into a format understood by the Web Query Generator (c) and the Information Base Query Generator (f). Figure 1.1. An ideal QA system (h) (g) (f) (e) (d) (c) (b) (a) answer nat u ral l anguage Natural Language Generator Information Base Information Base Query Generator Web Information Base Constructor Web Query Generator Natural Language Parser USER natural lan g ua g e q uestion p arsed q uestion p arsed q uestion Web q uer y relevan t documents q uer y answe r The Web Query Generator takes the new representation of the user’s question and searches the Web (d), returning a set of documents containing relevant information. Then the Information Base Constructor module (e) takes the retrieved documents and creates an information base (g). 1 This information base is a repository for all the retrieved information that is relevant to the user’s question. The information is in a form that can be queried by the Information Base Query Generator (f), which generates a query that retrieves the exact 1 The term information base is used to keep the description of this module generic. The module could be in the form of a database, a knowledge base, or any other suitable structure. Chapter 5 discusses the use of this term further.

PAGE 16

5 answer to the user’s original question. Next the answer is sent through the Natural Language Generator (h), which transforms it into natural language (or some other appropriate format), and finally, the user receives the answer. 1.4 A Realistic Question Answering System This thesis is working towards creating a system similar to the one outlined in Figure 1.1. It is necessary, however, to start with a simpler, more realistically achievable system, which acts as a first step in the progression towards the ideal. Figure 1.2 illustrates this prototype system. Figure 1.2. A simpler QA system (e) (d) (c) (b) (a) Intelligent Interface Manually Created Information Base Information Base Query Generator Natural Language Parser User natural lan g ua g e q uestion p arsed q uestion relevan t quer y information formatted answer In this system, the user, (a), asks a natural language question, which is parsed by the Natural Language Parser, (b). The output of the Natural Language Parser is an XML (Extensible Markup Language) structure 2 labeling the parts of speech of the words in the question. The Information Base Query Generator, (c), takes this structure and generates a 2 For more information on XML, see section 3.1 of this thesis.

PAGE 17

6 query, using XQL, an XML query language. This query is applied to the Information Base, (d), to return the desired information, which is also written in XML. Finally the retrieved, relevant information, which is encased in special XML tags, is formatted by the Intelligent Interface module, (e), and presented back to the user. One important difference between the two systems of Figures 1.1 and 1.2 is that the domain of the second system has been reduced from the entire World Wide Web (the domain of the ideal system), to a much smaller set of web pages—those describing the Graduate Program of the University of Florida’s Computer and Information Science and Engineering (CISE) Department. The Web Query Generator, module (c) from Figure 1.1, is not present in Figure 1.2 because the prototype system substitutes the graduate web pages for the dynamic output of the ideal system’s web search. The second significant difference is that module (e) of the ideal system, the Information Base Constructor, is missing in the prototype system. In the implementation of the prototype system, the information base is constructed manually, in an attempt to define and understand the building process so that it can eventually be automated as module (e) of the ideal system. The two systems in Figures 1.1 and 1.2 are referred to as the Ideal and the WebNL systems, respectively, for the remainder of this thesis. The WebNL system, which stands for Web Natural Language, is composed of parts (b), (c), and (d) of Figure 1.2, and is an implemented first step towards the Ideal system. 3 3 As of the writing of this thesis, modules (c), (d), and (e) of Figure 1.2 have been implemented. Module (b) has not yet been completed. The completed modules were tested by manually producing the output of (b), and sending it through the system. Currently, the output of each module must be manually passed to the next. The fully operational online system is expected to be complete in Spring 2002.

PAGE 18

7 The Information Base module of WebNL is the topic of Chapters 4, 5, and 6 of this thesis. Modules (b), (c), and (e) of WebNL have been implemented by other students in the research group, and they are discussed in corresponding theses.4 1.5 The XML Information Base and Explorer Module (d) of Figure 1.2 is referred to as the XML Information Base (XMLIB) henceforth. The XMLIB contains informati on from the CISE graduate web pages, encoded in XML. The XMLIB represents WebNL’s restricted domain. Its structure allows the query generator to retrieve only relevant (to the user’s question) portions of XML, which are then decoded, processed, and displayed to the user by the Intelligent Interface module. Together the XMLIB and the other modules implement the QA system of Figure 1.2. Also implemented for this thesis is a Java servlet5 called XML Explorer (XMLEX). This program has the capability to interactively explore (i.e., open and close elements, view the children of an element, etc.) XML files. It provides an alternative way for a user to look through the files comprising the XMLIB (or any XML file). Rather than reading through the original web pages, or posing a question to the QA system, a user can simply “surf” his/her way to the desired information, using the XML element names as guides. Chapter 7 discusses XMLEX. 1.6 Summary and Road Map This chapter discussed the problem of searching for information on the World Wide Web, which continues its exponential growth in both size and complexity. 4 The Natural Language Parser is not yet completed. See (Pridaphattharakun 2001) [3] for the Query Generator, and (Antonio 2001) [4] for the Intelligent Interface. 5 For more information on Java servlets, see section 3.2 of this thesis.

PAGE 19

8 Currently available solutions were discussed: keyword search engines, concept-based search engines, and first generation QA systems. An ideal web searching solution was then described, and finally a description of a prototype QA system was given. This prototype acts as a first step toward the goal of the ideal QA system. This thesis in particular discusses the information base module, XMLIB, and a peripheral tool, XMLEX. Chapter 2 discusses previous work that influenced the design of the XMLIB, and previous QA systems that inspired the design of WebNL. An overview of the technologies used in the creation of the XMLIB and XMLEX is given in Chapter 3. Chapters 4, 5, and 6 describe the design goals, implementation, and results of the XMLIB respectively. Chapter 7 discusses XMLEX. Finally Chapter 8 reports conclusions and proposed future work, including what steps should be taken next to get closer to the Ideal system.

PAGE 20

CHAPTER 2 PREVIOUS WORK 2.1 The Need for Question Answering Conferences such as the Text Retrieval Conference (TREC) show that the research community recognizes the potential of QA systems. The following quote from the QA Track Specifications 1 in the TREC community illustrates this point: Current information retrieval systems allow us to locate documents that might contain the pertinent information, but most of them leave it to the user to extract the useful information from a ranked list. This leaves the (often unwilling) user with a relatively large amount of text to consume. There is an urgent need for tools that would reduce the amount of text one might have to read in order to obtain the desired information. This track aims at doing exactly that for a special (and popular) class of information seeking behavior: QUESTION ANSWERING. People have questions and they need answers, not documents. Automatic question answering will definitely be a significant advance in the state-of-art information retrieval technology. [5] Many QA systems have been designed, each with their own approaches and set of limitations. The research community has yet to produce a satisfactory general purpose QA system comparable to the Ideal system discussed in Chapter 1. Moreover, Figure 1.1 illustrates just one of many possible architectures for an ideal QA system—one that the WebNL system is working towards. Each QA research project has its own version of an ideal system as its goal. The next three sections discuss the motivation for WebNL, as well as other QA systems that inspired it and the ideal system. 1 The QA Track Specifications is a document published by the TREC community that details the type of questions that should be ideally answerable by a QA system, as well as listing an expansive set of actual questions that systems should experiment with. 9

PAGE 21

10 2.2 Motivation for WebNL The graduate coordinator and graduate office of the University of Florida’s Computer and Information and Science (CISE) department is constantly fielding questions about departmental rules, requirements, etc. from current and prospective students. The graduate brochure, which answers most of the commonly asked questions, is available on the CISE department web site as a collection of HTML (HyperText Markup Language) files. Although these web pages 2 are publicly available for perusal, students tend to ignore them and instead email questions directly to the graduate coordinator. These students are following the trend noted in the above quote—namely that they want direct answers to their questions, and are not interested in reading through the graduate web pages. It became apparent that this situation could benefit greatly from a QA system, so the WebNL system was conceived, using the graduate web pages as its restricted domain. 2.3 Inspiration for the Ideal QA System The Ideal system is partially inspired by the SMART IR system [7] of AT&T’s Shannon Laboratory. This system uses a combination of natural language processing (NLP) and information retrieval (IR), and focuses on returning a ranked list of direct answers to a user’s question, as opposed to a list of documents that may contain the answer. The SMART IR system shares the same goal as the Ideal, and they both have certain steps in common, including the parsing of a natural language question, the retrieval of relevant documents, and the extraction of a direct answer from these documents. The SMART IR system does not build an information base from the retrieved 2 This thesis refers to the HTML files that make up the CISE graduate brochure as the graduate web pages henceforth, available at http://www.cise.ufl.edu/~ddd/grad [6].

PAGE 22

11 documents, however. Moreover, the correct answer to the query only appears in the top five returned answers 46% of the time—much lower than the desired 100% accuracy rate of the Ideal system. 2.4 Inspiration for WebNL The most significant differences between WebNL and the Ideal system are as follows: 1. WebNL only covers the restricted domain of the graduate web pages. 2. Resulting from (1) above, WebNL does not dynamically retrieve documents from the Web. The graduate web pages are assumed to be the output of the document retrieval step. 3. The information base is constructed manually in WebNL rather than automatically. In addition to influencing the design of WebNL, work by Claire Cardie et al. [8] at Cornell University heavily influenced the decision to implement a QA system on a restricted domain, using a static collection of text as the only source of possible answers to a user’s query. The Cornell QA system retrieves passages of documents that are deemed relevant, then processes these passages using special linguistic filters to provide an answer. The passages are retrieved based on a combination of shallow syntactic analysis and knowledge-based semantic analysis. WebNL uses similar techniques to help choose which portions of text to return. 3 Again, the Cornell system is an encouraging start, but it returns a correct answer in the first five attempts only about 53% of the time, very similar to the performance of the SMART IR system described in the previous section. 3 Chapter 5 discusses the implementation further

PAGE 23

12 The PLANES system [9] from the University of Illinois and the BusTUC system [10] from the University of Trondheim are both satisfactory restricted-domain QA systems that also influenced WebNL’s design, particularly the natural language parser and the query generator. Both of these systems answer natural language questions related to their respective domains. PLANES deals with aircraft flight and maintenance data, and BustTUC deals with bus routes for the city of Trondheim, Norway. Although both of these systems get their information from databases rather than a collection of documents, the natural language parsing and query generating techniques are applicable to WebNL’s domain. 2.5 Knowledge and Information Representation Although the topic of this thesis lies within the framework of the entire WebNL system, the focus is the information base, implemented as a set of XML files, called the XMLIB (XML Information Base). The XMLIB has several parallel goals discussed in Chapter 4 in more detail. The most important requirement for the XMLIB is that it must somehow represent and encode the information contained in the graduate web pages. The representation must allow the information to be queried or searched so that the desired portions of text can be retrieved. When designing any kind of repository for information or knowledge that must be processed, the primary question is how to encode or represent it. Many mechanisms are currently used for knowledge and information representation. First-order logic uses variables to represent known facts in the domain. Logic inference rules can then be used to create more complex logic sentences that represent derived knowledge or facts. Predicate logic is an extension to first-order logic that allows for generalizing and quantifying truth statements. Both systems are simple

PAGE 24

13 and easy to use, but they become cumbersome as the domain grows larger. A more in depth discussion of logic is given in The Engineering of Knowledge-Based Systems by Gonzalez and Dankel [11]. Another branch of knowledge representation uses laws and theories from statistics to manipulate facts in the domain [12]. Unlike in logic, statements can be assigned truth values that are not just true or false (zero or one), but can be any real number, usually falling in the range from zero to one. These systems are able to model real world situations more accurately, by giving a confidence level to the derived facts, but still rely on the same basic structures that logic systems use. More influential to the design and implementation of the XMLIB are semantic nets and frames. These two closely related mechanisms use attribute-value pairs to describe the facts (or knowledge) in a domain. “A frame system consists of a collection of objects, each of which consists of slots and values for these slots . . . semantic nets are little more than another way of representing the information in a frame system” [12, pp. 255-257]. A frame system’s slots and slot-values (analogous to attribute-value pairs) can be directly translated to a semantic net. Each object in a frame system corresponds to a node in a semantic net, and each slot corresponds to an arc. Figure 2.1 illustrates a simple frame system, and Figure 2.2 shows the corresponding semantic net.

PAGE 25

14 Figure 2.1. A simple frame system COP 5555 Instance-of: Core Course Topic: Programming COT 5405 Instance-of: Core Course Topic: Algorithms Core Course Subclass-of: Course Description: Mandatory Course Purpose: Instruction Figure 2.2. A simple semantic net purpose topic topic Prog. Lang. Algorithms instance-of COP 5555 instance-o f COT 5405 Mandatory description subclass-o f Core Course Instruction Course Information contained in a frame system or a semantic net can just as easily be translated into XML, using tags as the objects and the values of the tags as the slot

PAGE 26

15 values. This is exactly what is done in the XMLIB: the graduate web pages are viewed as a frame system consisting of objects with attributes, and this frame system is encoded using XML syntax. WordNet [13] is an important semantic net whose organization is influential in the design of the XMLIB. Containing a representation of the words of the English language, WordNet is involved in many natural language processing and information retrieval research projects. WordNet is an on-line lexical reference system whose design is inspired by current psycholinguistic theories of human lexical memory. English nouns, verbs, and adjectives are organized into synonym sets, each representing one underlying lexical concept. Different relations link the synonym sets. [13, p. 1] Although the WebNL system does not currently use any part of the WordNet database, its overall importance in the field of natural language question answering merits its mentioning here. Related to WordNet is MindNet, a “lexical knowledge base constructed automatically from the definitions and example sentences in two machine-readable dictionaries (MRDs)” [14, p. 1098]. MindNet is relevant because it automatically builds its knowledge base from an online source. This is the task of the Information Base Constructor module of the Ideal system, but with web pages as the source of information, rather than MRDs. The XMLIB, being manually created, serves the purpose of identifying the major steps that the Information Base Constructor would have to take. A conceptual graph (CG) is another way to represent knowledge or information. The WebKB system of Griffith University [15] uses CGs to model the information in web-accessible documents. This system is working towards interpreting online documents from a different angle than WebNL—the designers argue against using XML

PAGE 27

16 and other XML-based languages like RDF (Resource Description Framework) to represent knowledge, and propose that CGs provide a simpler, more elegant solution. 2.6 Text Annotation and XML The information stored in the XMLIB can be queried because it is annotated with special XML tags. WebNL’s query generator uses these tags to find the desired pieces of information and then return them to the user. One of the main purposes of the XMLIB is to provide this annotation. The decision to structure WebNL’s information base as a body of annotated text was inspired by the Osirix system [16], which uses the idea of annotated text to build and search through corporate memories. Osirix uses special XML tags to search through the constructed corporate memories—a process analogous to how WebNL’s query generator searches through the XMLIB. XML was chosen as the language used to implement WebNL’s information base because text annotation is required to perform any searching and/or querying. Numerous other information extraction and natural language processing projects have used XML successfully, primarily because it allows user-defined tags, grants arbitrary levels of nesting, and provides a natural object-oriented environment. Chapter 3 gives a more detailed discussion of XML and its advantages and abilities. 2.7 Summary This chapter presented an overview of the state of QA systems, and previous work that influenced WebNL in general or the XMLIB specifically. The motivation for creating the WebNL system was also described. Then several knowledge representation techniques were discussed, along with how they relate to or influence the XMLIB. Finally text annotation as a means for information representation and text retrieval was discussed, along with reasons for using XML as the annotation tool.

PAGE 28

CHAPTER 3 UNDERLYING TECHNOLOGY 3.1 XML 3.1.1 Motivation The World Wide Web is a collection of documents, files, and programs that are available on an interconnected network of computers: the Internet. The most common online documents are HTML (HyperText Markup Language) files. HTML allows documents to be uniformly displayed (at least in theory) by different web browsing programs. HTML defines how a file looks on the computer screen—it is display-oriented, rather than content-oriented. As online documents become ever larger and more complex, however, the limitations of HTML become more apparent. Web content providers have begun to realize that HTML does not provide the “extensibility, structure, and data checking needed for large-scale commercial publishing” [17]. HTML uses a predefined set of tags, the majority of which identify how to display text. Users cannot define their own tags to logically or semantically structure their data—this kind of extensibility is missing. HTML does not support the arbitrary levels of nesting that are needed to model complex data or hierarchies. Finally, HTML does not support validation. Applications are not able to check data for structural validity. The World Wide Web Consortium (W3C) [18], an organization responsible for developing and standardizing web-related technologies, designed XML (Extensible Markup Language) to address HTML’s limitations. Both HTML and XML are subsets of 17

PAGE 29

18 a vastly more complex markup language called SGML (Standard Generalized Markup Language). SGML contains all the important qualities discussed above that are missing from HTML: extensibility, structure, and validation. Because HTML needed to be as simple as possible, it did not inherit these qualities, only borrowing SGML’s syntax. XML, in response to HTML’s shortcomings, does include these qualities, while excluding many other SGML constructs and options that are not relevant to web applications. 3.1.2 Extensibility XML, HTML, and SGML are all markup languages. Markup languages utilize tags that wrap around basic text to allow sections to be labeled or annotated. A tag usually consists of an opening and closing pair surrounding a particular piece of text. The tag specifies how the enclosed text (or other data) is to be handled by the processing application. For example, the HTML tag tells a web browser to display the enclosed text as the title of the document. Tags can also have associated attributes—qualities of the tag that can be specified by the user. Some attributes have only a small set of valid values, while others can be assigned any value. HTML’s predefined set of tags, as mentioned above, cannot be extended. In contrast, both SGML and XML allow the user to define his or her own tags and attributes for those tags. This nature of XML, being an extensible markup language, allows users to annotate their text and create processing applications that recognize these tags and perform corresponding functions. XML can be used in many domains where HTML is inappropriate. For example, an industry or commercial entity can define a set of XML tags and specify what type of data can be within the tag and how they are to be treated. Protocols are easily developed using <br /> <br /> PAGE 30<br /> <br /> 19 XML. Different views of the same data can be created by applications that filter and organize the tags in a particular fashion. In this way, XML can be useful in data processing as well as document processing, opening up an entire new realm of possibilities not available with HTML. An element in XML is an opening/closing pair of tags together with the data inside the tags. A tag defines the beginning and ending of an element. Although sometimes the two terms are used interchangeably, there is a subtle difference. Figure 3.1 illustrates simple XML syntax. The figure consists of two XML elements, the top representing a student, and the bottom representing a course. These elements are in the form: <tag_name>text</tag_name>. All XML tags are denoted using the “<” and “>” characters. The closing tag has the same name as the corresponding opening tag, preceded by a slash. Figure 3.1. Two simple XML elements <COURSE number=”COT5405”>Algorithms</COURSE> <STUDENT id=” gpa=.5”>John Doe</STUDENT> Attributes are listed in the opening tag with their corresponding values as strings. After the opening tag is the actual text or data of the element and, finally, the closing tag. The first element in Figure 3.1 represents the student John Doe whose student ID is 123 and whose grade point average is 3.5. The bottom element represents an Algorithms course, which corresponds to the course number COT5405. <br /> <br /> PAGE 31<br /> <br /> 20 3.1.3 Structure XML elements can contain more than just simple text. Elements can be nested, meaning one element can contain other elements, in addition to text or other data. If an element contains one or more other elements, it is called a parent of the enclosed child or children elements. These children elements can contain children of their own, and so on, down to any level. Two elements that have the same parent element are called siblings. A group of XML elements naturally forms a tree structure; hence the nomenclature borrowed from data structures. The ability to create tag names, together with the ability to enclose elements within other elements, allows XML to model arbitrarily complex systems. The structure of XML lends itself particularly well to object-oriented design. Elements can be thought of as objects, and children elements can either represent attributes of the parent, or more specific versions of the parent. Figure 3.2 shows an XML element representing a set of university courses. The <COURSES> element 1 has three <COURSE> children elements. Each <COURSE> element has <NAME> and <NUMBER> children. None of the elements have attributes. Each <COURSE> element contains information specific to a particular course. Other container elements can be imagined: <STUDENTS> and <LECTURERS> could hold information concerning students and lecturers. All three container elements could have a parent element called <DEPARTMENT>. XML allows information to be organized into a hierarchy that naturally fits the given domain. 1 XML elements are shown in capital letters surrounded by opening and closing brackets. <br /> <br /> PAGE 32<br /> <br /> 21 Figure 3.2. XML elements representing computer science courses <COURSES> <COURSE> <NAME>Analysis of Algorithms</NAME> <NUMBER>COT5405</NUMBER> </COURSE> <COURSE> <NAME>Programming Language Principles</NAME> <NUMBER>COP5555</NUMBER> </COURSE> <COURSE> <NAME>Computer Architecture Principles</NAME> <NUMBER>CDA5155</NUMBER> </COURSE> </COURSES> 3.1.4 Validation All XML elements must be well-formed and valid. A well-formed section of XML follows these syntactic rules: 1. Each opening tag must be matched with a closing tag. 2. Elements may not overlap. An element whose opening tag lies within another element is a child of the first element. The child element’s closing tag must appear before the closing tag of its parent. Both rules are optional in HTML but required in XML. The first rule is rather simple, but the second needs further explanation. Part (a) of Figure 3.3 shows two elements overlapping each other (which is not allowed in XML), while part (b) shows two that are correctly nested and do not overlap. Together the two rules above define a well-formed XML document. They also simplify parsing—a task that any processing application must perform. An application knows that an element ends with a corresponding closing tag with any children elements fully enclosed in the parent. <br /> <br /> PAGE 33<br /> <br /> 22 Figure 3.3. Overlapping and non-overlapping elements <STUDENT><NAME>John Doe</STUDENT></NAME> <STUDENT><NAME>John Doe</NAME></STUDENT> ( b ) ( a ) If a closing tag is missing, or a parent’s closing tag appears before a child’s closing tag, the processing application can assume that the XML data are in error and can act appropriately. All the XML shown so far, with the exception of Figure 3.3 part (a), is well-formed, but not valid XML. Valid XML must conform to a user-defined set of rules specifying which tag names can be used, how elements can be nested, what attributes each element can have, and what data can be within each element. Currently, there are many languages that can be used to define this set of rules. Two widely used languages are XML Document Type Definition (DTD) and XML Schema. Both languages are recommended by the W3C. XML DTD was the first document definition language for XML to appear, being a subset of the SGML DTD language. XML Schema is meant to supplement and eventually replace XML DTD. The main difference between the two languages is that XML Schema uses XML syntax to describe valid structure, while XML DTD uses its own syntax and constructs. XML DTD is a far simpler language than XML Schema, but lacks the expressive power of XML Schema. XML Schema defines many more data types than XML DTD’s simple pcdata (parsed <br /> <br /> PAGE 34<br /> <br /> 23 character data) and lets users combine data types. For a full comparison of XML DTD, XML Schema, and four other XML document definition languages refer to the comparative analysis done by Lee and Chu [19]. Although XML Schema is a richer language, it is newer than XML DTD. At the beginning of the implementation of WebNL, particularly that of the XMLIB, XML Schema was still under development and testing. An official recommendation for XML Schema had not been given by the W3C, whereas XML DTD was fully supported with many existing tools that could validate XML based on a specific DTD. 2 XML Schema only became officially recommended by the W3C in the middle of the XMLIB’s implementation. For these reasons, the XML files comprising the XMLIB use a DTD for validation, rather than a schema written using XML Schema. A DTD is a set of statements that define XML elements. This definition includes the name of the elements, any attributes the elements may have, what kind of data the elements can hold, and which children may be present. The specification of children identifies which elements can be children, how many of each type can appear, and the order in which these children must appear in the parent element. A DTD can be in a separate file or it can be a section at the top of an XML file. If a DTD appears at the top of an XML file, the following XML uses that DTD for validation. If a DTD is in a separate file, an XML file references it using a doctype statement having the following format: <!DOCTYPE root_element SYSTEM “DTD_file”>. 2 XML DTD is a language that defines valid XML structures. A particular set of definitions written using this language is referred to as a DTD (Document Type Definition). <br /> <br /> PAGE 35<br /> <br /> 24 An XML file must have a single element at the highest or root level. All other elements in the file must be children of the root element. The doctype statement above specifies the root element of this particular XML file ( root_element ) and the DTD file ( DTD_file ). Figure 3.4 illustrates a sample DTD and an XML file using it. Figure 3.4. A DTD and a corresponding XML file Part (a) of the figure shows the DTD defining the structure of a <COURSES> element. The first line of both the DTD and the XML file is a required XML declaration line specifying the XML version and the character encoding scheme. The XML file’s declaration line also includes a standalone declaration specifying that the file references a DTD. <?xml version=.0” encoding=”UTF-8”?> < !ELEMENT COURSES (COURSE*)> < !ATTLIST COURSES department CDATA #REQUIRED> <!ELEMENT COURSE (NAME,NUMBER)> <!ELEMENT NAME (#PCDATA)> <!ELEMENT NUMBER (#PCDATA)> <?xml version=.0” encoding=”UTF-8” standalone=”no”?> <!DOCTYPE COURSES SYSTEM “courses.dtd”> < COURSES department=”Computer Science”> < COURSE> < NAME >Algorithms </NAME > < NUMBER> cot5405 </NUMBER > </COURSE > < COURSE > <NAME >Operating Systems </NAME > < NUMBER >cop5615 </NUMBER </COURSE > </COURSES > (a) (b) <br /> <br /> PAGE 36<br /> <br /> 25 The <!ELEMENT> construct is used to define a new element. The second line of the DTD defines <COURSES> as an element containing zero or more <COURSE> elements. The <!ATTLIST> construct is used to list the attributes of a particular element. The third line of the DTD declares that <COURSES> has a required attribute called department that must have a value, specified as character data (CDATA). The next line of the DTD defines the <COURSE> element, which must contain a <NAME> followed by a <NUMBER>. The last two lines define <NAME> and <NUMBER> as simple elements containing only parsed character data (#PCDATA)—the basic XML DTD data type. Part (b) of Figure 3.4 illustrates an XML file referencing the DTD of part (a). In the declaration line, the value of the standalone attribute, “no,” indicates that this file uses an outside DTD for validation. The <!DOCTYPE> declaration specifies the root element of the file, and the local path and filename of the DTD being used for validation. The second line of the XML file identifies that the root element is a <COURSES> element, and the DTD being referenced is located in the file named “courses.dtd.” The next line opens the root element 3 <COURSES>, specifying department to be “Computer Science.” The <COURSES> element has two <COURSE> children representing an Algorithms course and an Operating Systems course, respectively. Notice that all children elements are fully enclosed within their parent element, and no closing tags are missing, so this XML file is well-formed. According to the DTD, <COURSES> must only contain zero or more <COURSE> elements, and <COURSE> elements must contain a <NAME> followed by a <NUMBER> and nothing else. The 3 The first element in an XML file is the root element. There can be only one root element per file with all other elements contained within it. <br /> <br /> PAGE 37<br /> <br /> 26 <NAME> and <NUMBER> elements can only have #PCDATA—they cannot have any children elements. The XML file follows all these rules, so it is also valid. If one of the <COURSE> elements was missing a <NUMBER> element, the XML file would still be well-formed, but not valid. An application processing this XML file expects a <NUMBER> element after the <NAME> element, and if it is missing, errors might occur. The purpose of XML’s validation capability is to enable applications to recognize when input XML is invalid or corrupted and to act appropriately to avoid problems. The main building blocks of a DTD are the <!ELEMENT> and <!ATTLIST> statements. The <!ELEMENT> statement, where element structure is defined, has the following syntax: <!ELEMENT name structure>. The structure component is where child elements and data types are listed. In Figure 3.4 (a), parentheses, commas, and an asterisk are used to group children, specify the sequence of children, and require that zero or more child elements be present. Combining these and several other symbols allows the user to fully define the structure of each element. Table 3.1 lists the symbols that can be used to specify an element’s structure. <br /> <br /> PAGE 38<br /> <br /> 27 Table 3.1. Symbols for Specifying Element Structure in XML DTD Symbol Symbol Type Description Example Example Notes | Vertical Bar Any element named may appear thisone | thatone Either thisone or thatone must appear. , Comma Requires appearance in specified sequence. thisone, thatone thisone must appear, followed by thatone. ? Question mark Makes optional, but only one may appear. thisone? thisone may appear. No symbol One, and only one, must appear. Thisone thisone must appear * Asterisk Allows any number to appear in sequence, even zero. thisone* thisone may be present; multiple appearances (or zero appearances) of thisone are acceptable. + Plus sign Requires at least one to appear; more may appear in sequence. thisone+ Thisone must be present; multiple thisone elements may appear. ( ) Parentheses Groups elements. (thisone | thatone), whichone Either thisone or thatone may appear, followed by whichone. Source: XML: A Primer, 2 nd Edition [20, p. 130]. The <!ATTLIST> statement is a bit simpler, having the following syntax: <!ATTLIST element attribute1 type default. . .attributeN type default>. Any number of attributes may be specified: each having a name, a data type, and a default. The most commonly used data type is a character string type called cdata. The default lets the DTD author assign a default value for the attribute if one is not given. The default can also be assigned one of the keywords #REQUIRED, #IMPLIED, or #FIXED. <br /> <br /> PAGE 39<br /> <br /> 28 These indicate that the attribute must be present, can be ignored if not present, or must have a specified fixed value if present, respectively. 3.1.5 Further Reading Only the basics of XML and XML DTD have been described in Sections 3.1.2, 3.1.3, and 3.1.4. A full coverage of these topics is outside the scope of this thesis. For more information consult XML: A Primer, 2 nd Edition [20] and, for the most up to date information on XML and data definition languages, see the World Wide Web Consortium [18] at http://www.w3.org . 3.2 Java Servlets In the late 1990s, Java started to become an extremely popular language for web server programming [21]. Many commercial tools were being marketed that would make server-side Java developing more simple and efficient. These tools provided an infrastructure that acted as an abstraction layer for developers, providing interfaces allowing programming to be done at the application level rather than the socket level. The main problem with these tools is that they were server-specific. Developers had to use different tools depending on the server with which they were working. In response to this problem, JavaSoft (now known as the Java Software division of Sun Microsystems) introduced Java servlets, consolidating the existing tools into a single, generic package. Servlets became a new way to develop modular server-side Java applications. Servlets allow a developer to extend the functionality of a server. The process is rather elegant. The new functionality is coded using basic Java classes and methods. Then, the code is called via a doGet() or doPost() method of a class implementing the GenericServlet or HttpServlet interface (these are part of the javax.servlet standard <br /> <br /> PAGE 40<br /> <br /> 29 library). Writing a servlet is, for the most part, just like writing a stand-alone Java application that can then be plugged into the server; this is why Java servlets have seen such success and widespread use, particularly in the area of dynamic web content generation. A complete survey of the capabilities of the Java language and Java servlets is beyond the scope of this section—almost anything that can be done in Java as a stand-alone application can be reworked into a servlet. Chapters 7 and 8 of this thesis describe XMLEX, a servlet allowing a user to interactively view an XML file’s contents. It parses the input XML file and then dynamically displays it as a tree structure. The user is able to open and close elements as desired by clicking on buttons generated by the servlet. XMLEX is just one example of what can be done with relative ease using Java servlets. For more information on servlet programming, see Java Servlet Programming by Jason Hunter [21]. 3.3 Summary This chapter presented a high-level description of the technologies used within this thesis: XML and Java servlets. The information base of the WebNL system, XMLIB, was implemented with XML; the dynamic XML viewer, XMLEX, is a Java servlet. XML was described in Sections 3.1.1 through 3.1.4. The first section discussed the motivation behind the emergence of XML. The next three sections discussed XML’s extensibility, structure, and validation capabilities—the three attributes of XML that have contributed to its success. A description of one of many possible XML data definition languages, XML DTD, was then given. The XML files that comprise the XMLIB use a DTD written in XML DTD for validation. <br /> <br /> PAGE 41<br /> <br /> 30 Section 3.2 provided a general introduction to Java servlets, describing their advantages and capabilities. More details of servlet implementation are discussed in Chapter 7 as they relate specifically to the XMLEX servlet. <br /> <br /> PAGE 42<br /> <br /> CHAPTER 4 THE XMLIB – RESEARCH GOALS AND DESIGN PHILOSOPHY As stated previously in Section 2.2, WebNL is a question answering system for the restricted domain of the web pages comprising the Graduate Brochure, which describes the Graduate Program of the Computer and Information Science and Engineering (CISE) Department at the University of Florida. WebNL retrieves specific pieces of information from these graduate web pages to provide an answer to a user’s natural language question. Figure 1.2, illustrating the architecture of WebNL, is reprinted below as Figure 4.1 for convenience. Figure 4.1. The WebNL system (e) (d) (c) (b) (a) Intelligent Interface XML Information Base (XMLIB) Information Base Query Generator Natural Language Parser User natural lan g ua g e q uestion p arsed q uestion relevan t quer y information formatted answer Part (d) of the figure, the XML Information Base (XMLIB), is the repository for the information contained within the graduate web pages. This information is structured 31 <br /> <br /> PAGE 43<br /> <br /> 32 using XML so that part (c), the Query Generator, can search for and retrieve relevant pieces of text that, hopefully, answer the user’s question. 4.1 Goals and Requirements Before the XMLIB is described in detail in Chapter 5, it is useful to identify its research goals and practical requirements. These goals and requirements provide a framework for understanding the XMLIB’s structure and are referenced often in later sections. WebNL is a simplified version of the Ideal QA system illustrated in Figure 1.1. The difference is that WebNL uses a static set of web pages (the graduate web pages) as its information source, whereas the Ideal system retrieves source web pages dynamically based on the user’s question and builds an information base from those web pages. This naturally restricts WebNL’s domain to the graduate web pages, while a user of the Ideal system can ask a question concerning any subject. Additionally, the graduate web pages share a common HTML markup scheme, using the same tags for the same purposes uniformly, while retrieved web pages in the Ideal system may have different logical and/or markup formats. The graduate web pages have homogeneous structures, while the Ideal system must build its information base from possibly heterogeneous source documents. The difficulty with the Ideal system is that the dynamic construction of an information base from a set of web pages gathered during runtime is an open problem. To gather online documents is relatively trivial with current search engines already providing this service. To gather documents that are relevant to a user’s query is more difficult because the search engine must have some notion of the user’s topic, as opposed to simply matching keywords. To then construct from these relevant documents an <br /> <br /> PAGE 44<br /> <br /> 33 information base that can be queried is an even more difficult problem. Doing so requires the ability to summarize text, to identify concepts, to group concepts from separate sources, and to structure the information in some intelligent way. So far a system that can perform all of these tasks satisfactorily does not yet exist. The goal of defining an algorithmic process that can dynamically create a repository of information from multiple, heterogeneous source documents is one that, if possible to be reached, must be reached in a step-by-step manner. A first necessary step is to understand the process and define what tasks need to be accomplished to complete it. The primary research goal of the XMLIB is to act as this first step, but with a set of source documents (the graduate web pages) that are more homogeneous than heterogeneous. The XMLIB’s manual creation provides insight into at least a portion of what the Information Base Constructor module of the Ideal system must eventually be able to do. The purpose of WebNL is to provide a testing environment. If WebNL is a satisfactory QA system (albeit on a restricted domain of homogeneous sources), then the structure of the XMLIB is viable, and the steps taken to create it can be generalized and used as guideposts for developing the algorithm that runs the Information Base Constructor module of the Ideal system (the IBC algorithm). To fully define the IBC algorithm, the question of how to combine information from heterogeneous sources must be answered, among others. The XMLIB’s scope is purposefully limited and does not attempt to answer this question. A secondary research goal is to assess the feasibility of automating each of the steps involved in the XMLIB’s construction. In other words, if some of the steps can be <br /> <br /> PAGE 45<br /> <br /> 34 identified as trivial, then future research can concentrate on the more difficult steps. Ideally, if the XMLIB can be constructed with a set of reasonable tasks, and the QA system of which it is a part is successful, then the IBC algorithm comes that much closer to being a reality. In addition to these research goals, the XMLIB needs to fulfill certain practical requirements for WebNL to be a functional, satisfactory QA system. The first and most obvious requirement is that the XMLIB must accurately and completely represent the information contained in the graduate web pages. The second and third requirements arise from the fact that the XMLIB is not a stand-alone research project; rather it is just one module of a complete QA system. Therefore, in addition to satisfying the research goals described above and accurately representing the graduate web pages, the XMLIB must also work with WebNL’s Query Generator and Intelligent Interface modules (parts (c) and (e) of Figure 4.1). The XMLIB must be structured so that the Query Generator can search through and retrieve pieces of information. These pieces must be in a format that can be processed by the Intelligent Interface module, which orders the resultant information and presents it to the user in a concise and useful manner. The effects that these requirements have on the design of the XMLIB are discussed in more detail in Chapter 5. Chapter 6 discusses the results of the research goals and explains how well the XMLIB fulfills its practical requirements. 4.2 Design Philosophy In light of the research goals discussed above, the overriding design rule of the XMLIB is to keep the steps involved in its creation as simple as possible. A simple step is a task that can be accomplished algorithmically without the use of human knowledge or intelligence. For example, scanning a set of HTML documents for all <TITLE> tags and <br /> <br /> PAGE 46<br /> <br /> 35 inserting the enclosed text into a data structure is a simple task. This does not require any general or world knowledge to complete, which a complex task requires. Grouping together all the passages from multiple source documents discussing automobile repair is an example of a complex task, 1 requiring general knowledge and the ability to label and group concepts. A passage explaining how an airfoil works must be distinguishable from a passage that describes how to replace a flat tire, for instance. If the phases of the XMLIB’s construction are kept as simple as possible, and if the XMLIB proves to be a useful information base in the context of WebNL, then the IBC algorithm is that much closer to being defined. The strategy, or design philosophy, that guides the construction of the XMLIB therefore attempts to minimize the need for human intelligence and general world knowledge, while fulfilling the practical requirements and research goals discussed above. The idea behind the XMLIB’s design, inspired by research on term description extraction done by Fujii and Ishikawa [22], is to use the structure inherent in a source document’s HTML markup as a model for organizing the information and data in the information base. The way in which a document author uses HTML tags can provide clues as to how the information contained within the document should be organized, or at least how the author intended it to be organized. For example, HTML’s various heading tags (<H1>, <H2> and so on) are commonly used to separate topics. Underneath particular headings, separator tags like the paragraph break (<P>) or line break (<BR>) are often used to 1 Algorithms exist that use word co-occurrence statistics to identify and group concepts, but the results are currently far from perfect. Someday this kind of task may be simple, but for now it is considered complex. <br /> <br /> PAGE 47<br /> <br /> 36 distinguish concepts or ideas that may differ slightly, but are related because they fall under the same heading tag. The different heading tags provide levels of categorization that can be directly translated into XML structures and inserted into the XMLIB. Heading and separator tags are mechanisms for organizing the information in a single HTML document. 2 Alternatively, HTML authors may utilize the anchor tag (<A>) as a structuring tool within a single document or among several to achieve a similar organizational effect. The anchor tag allows an HTML author to define hyperlinks connecting to another area of the same document or to an entirely separate one. A document may contain a list of hyperlinks pointing to sub-areas of the current topic, and each of these documents might use hyperlinks to further subdivide the information. In this situation, the anchor tags are playing the same role as heading tags; both are HTML constructs that provide inherent organizational and structural information about the text in which they are embedded. Structural cues from the graduate web pages drive the construction of the XMLIB and dictate its organization. Fourteen HTML files comprise the graduate web pages. Each file is dedicated to a single topic and is linked to the other files via a navigation bar. Within each file, the text is generally subdivided with heading and separator tags. The separation of topics into several documents and the use of headings within each document are both examples of useful structural cues. The XML elements that compose the XMLIB represent chunks of information, usually at the sentence or paragraph level. Some elements are children of others, meaning 2 Heading and separator tags are just two of several types of organizational HTML tags that may be present in a document, any of which can be potentially useful for gleaning structural cues from the source text. <br /> <br /> PAGE 48<br /> <br /> 37 the information contained in the child is related to the information in the parent. Elements that have common parents (i.e., siblings) contain related information as well, since they are grouped together by the parent. The various heading and separator tags found in the graduate web pages define these inter-element relations. Chapter 5 discusses how the XMLIB models the graduate web pages in detail. As stated previously, the primary research goal of the XMLIB is to provide insight into the tasks that need to be accomplished by the IBC algorithm. However, the IBC algorithm must be able to construct an information base from web pages that may have a highly different structure than the XMLIB’s source documents—the graduate web pages. Other online documents might have a completely different logical arrangement than the graduate web pages, or they might make use of images, animations, and embedded programs to convey their information, whereas the graduate web pages mainly employ basic text and headings. If the construction of the XMLIB is to be truly useful in its goal of mapping out the IBC algorithm, then a significant issue to be addressed is how well the rest of the World Wide Web’s structure resembles the structure of the graduate web pages. In other words, the value of the methods used to construct the XMLIB depends on how applicable they are to other sets of web pages. If a sufficient percentage of web pages and sites have structures similar enough to the graduate web pages that the XMLIB’s construction methods can be applied to them, then the XMLIB has useful and practical research value. Otherwise, the XMLIB is simply a functional information base for WebNL with no generalizable construction methods resulting for the IBC algorithm. <br /> <br /> PAGE 49<br /> <br /> 38 To determine the XMLIB’s applicability to the rest of the Web, an informal survey of web site structure was conducted. Because the Ideal QA system is envisioned as a portal for information gathering, as opposed to commercial consumption, the survey concentrated heavily on university, college, and technical school web sites. One hundred random academic web sites were examined to determine whether they contain some set of web pages whose logical organization and physical structure match or are similar to the graduate web pages. Also included in the survey were approximately 20 commercial sites, to give some comparison to the academic sites. Figures 4.2 and 4.3 illustrate the two types of structures that, if found somewhere on the web site within the first three minutes of looking, were recorded as matching the structure of the graduate web pages. Figure 4.2. Information organized using hyperlinks SubtopicN : : Subtopic2 Subtopic1 : Hyperlinks : Top Level Page Subtopic1 Subtopic2 : : Subto p icN <br /> <br /> PAGE 50<br /> <br /> 39 Figure 4.3. Information organized using headings Single Page Subtopic1 text text text text text text text text text text Subtopic2 text text text text text text text text text text : Figure 4.2 shows a set of web pages whose information is organized via a list of hyperlinks to subtopics. Figure 4.3 shows an arrangement where all the information is on the same physical web page, and topics are differentiated by headings. These two figures illustrate the two ways an HTML author can use certain tags (heading, separator, and anchor tags) to structure text that were discussed earlier in this section. The structure of the graduate web pages is a combination of the two figures; hyperlinks distinguish subtopics located in different pages, and on each individual page headings and separators further subdivide the topic. The results of the survey reveal that out of the 100 academic web sites examined, 78 have sections whose structures match or are a combination of Figures 4.2 and 4.3, leaving 22 sites that have no matching sections. Of the commercial sites, an expected lower percentage of matches were observed (around 55%). While this survey is by no means meant to be exhaustive, and its data was collected through very simple examination of the sites, its encouraging results show that the layout of the graduate web pages is indeed a common one. Therefore, the generalized steps involved in the construction of the XMLIB should be applicable to a large percentage of web pages. <br /> <br /> PAGE 51<br /> <br /> 40 Appendix A explains the methodology of the survey in more detail and lists the colleges and universities whose web sites were examined. 4.3 Summary This chapter presented a description of the research goals and requirements that the XMLIB is attempting to fulfill and the basic strategy employed to construct it. Section 4.1 described the primary research goal of the XMLIB: to act as a first step in the realization of the Information Base Constructor (IBC) algorithm—the algorithm used by the Information Base Constructor module of the Ideal QA system introduced in Chapter 1. The steps taken to manually build the XMLIB help to define the tasks that the IBC algorithm must eventually be able to complete. A secondary goal of the XMLIB is to provide an assessment of the complexity of the methods used in its creation. Section 4.1 also discussed the practical requirements the XMLIB must fulfill so that the WebNL system is successful. It must accurately and completely represent the information contained in the graduate web pages. WebNL’s Query Generator must be able to search through the XML elements and retrieve specific sections, which must in turn have a format that can be properly processed by WebNL’s Intelligent Interface module. Section 4.2 described how the XMLIB’s research goals affect its design philosophy. Steps taken in its construction should be as simple to implement algorithmically as possible. The number of tasks that require human intelligence or general world knowledge to complete should be minimized. A design philosophy that attempts to simplify the XMLIB’s construction is to define elements in the XMLIB as chunks of information, usually at the sentence or paragraph level, whose relationships to <br /> <br /> PAGE 52<br /> <br /> 41 each other are defined by structural cues found in the source web pages. These structural cues are in the form of certain commonly used HTML tags. Finally, Section 4.2 addressed the question of how applicable the XMLIB’s construction methods might be if web pages other than the graduate web pages act as the information source. An informal survey shows that 78% of the academic web sites and 55% of the commercial sites surveyed contain sections that are similarly structured to the graduate web pages, meaning that the XMLIB’s construction methods should be generalizable to a large percentage of web pages. <br /> <br /> PAGE 53<br /> <br /> CHAPTER 5 THE XMLIB – IMPLEMENTATION 5.1 Overview The XML Information Base (XMLIB) acts as the information repository for the WebNL QA system. It is referred to as an information base because it is neither a database nor a knowledge base in the proper sense of the terms. Databases contain data that must be interpreted via selection criteria in order to produce useful information. Databases focus on minimizing search and retrieval times, and they generally have the ability to summarize and compare the retrieved data. The fundamental unit of the XMLIB is a sentence or paragraph that conveys useful information on its own, as opposed to a piece of stand-alone data. There are several design considerations in the XMLIB that attempt to reduce search and retrieval time, but this reduction is not a primary goal of the system. Additionally, the XMLIB does not have the summarization or computation capabilities of a database. It cannot count the number of core courses available in the department, for example, even though it contains elements representing each of the core courses. A knowledge base contains logical representations of rules and can manipulate them in order to reason about its domain and draw conclusions based on given facts. A primary characteristic of a knowledge base is its ability to process rules and even derive new ones—it models human reasoning. Again, the XMLIB has no such capabilities. The XMLIB is much more passive than either a database or a knowledge base, both of which are programs that can perform functions on the data they contain. The XMLIB 42 <br /> <br /> PAGE 54<br /> <br /> 43 is simply a set of XML files whose structures allow other programs, specifically WebNL’s Query Generator and Intelligent Interface modules, to perform their respective functions. The XMLIB represents information at the sentence/paragraph level, rather than at the individual word level. Many semantic networks attempt to represent information about a domain at the word level, meaning that the fundamental units of the network—the nodes—represent words, and information is conveyed based upon how the words are related to each other. The XMLIB, in contrast to these semantic networks, functions at the sentence level for two reasons. First, individual ideas or pieces of information are naturally communicated in sentences or groups of sentences. Second, it is simpler to define XML elements based solely on HTML organizational cues from the source web pages if those elements represent sentences or paragraphs. The latter reason illustrates the implementation of the design philosophy discussed in Section 4.2. Most of the HTML tags commonly used to separate discussion topics, especially in the graduate web pages, delineate paragraphs or sentences rather than individual words. The XMLIB annotates source text with XML elements that delimit different concepts and allow them to be retrieved individually by the Query Generator in order to answer a user’s question. The process by which the Query Generator interfaces with the XMLIB is incrementally explained as the implementation of the XMLIB is described in the following sections. The XMLIB consists of sixteen files: one XML file for each of the fourteen subsections of the graduate web pages (enumerated in Section 5.2), an XML file that acts <br /> <br /> PAGE 55<br /> <br /> 44 as a directory or index for the Query Generator, and a Document Type Definition (DTD) that defines the XML elements appearing in the fifteen XML files. Section 5.2 discusses the DTD, and Section 5.3 describes the XML files that comprise the XMLIB. The steps taken in the XMLIB’s construction are identified throughout these two sections, and an assessment of the difficulty of each is given as they are introduced. The construction process is outlined and summarized in Chapter 6. 5.2 The Document Type Definition The XMLIB uses a Document Type Definition (DTD) for validation. The elements found within the XMLIB must conform to a configuration defined in the DTD, which is listed in its entirety in Appendix B. The DTD characterizes how different elements of the XMLIB relate to each other, lists any attributes that particular elements may have, and specifies the types of data they can contain. There are two types of elements defined in the DTD: utility and domain. Utility elements are those that are present as children in most of the other elements, and are not specific to any domain. These elements would appear in the XMLIB no matter what kind of information is being represented, and they would serve the same purpose. The utility elements play a significant role in allowing the Query Generator to search through the XMLIB. Note that the elements themselves are not domain-specific, but their contents generally are. Domain elements are those that represent and label topics, concepts, and objects from the domain, so they necessarily vary depending on the nature of the domain. These elements and their relationships to each other reflect the logical organization of the source information. <br /> <br /> PAGE 56<br /> <br /> 45 5.2.1 The Root Element The domain element <GRAD_PAGES> is the first element defined in the DTD. This element is the root element of each of the XML files comprising the XMLIB—all other elements in a particular file are children or descendents of it. The <GRAD_PAGES> element may contain only a single child, but this child can be one of fifteen elements, corresponding to the fourteen subsections of the graduate web pages and the directory file. Figure 5.1 shows the definition of the <GRAD_PAGES> element. Figure 5.1. The root element of the XMLIB: <GRAD_PAGES> <!ELEMENT GRAD_PAGES ( DIRECTORY | OVERVIEW | GEN_INFO | ADMISSION | FINANCIAL | MASTERS | ENGINEER | PHD | CONTACTS | UNDERGRAD_PREREQS | CORE_COURSES | FACULTY | LABS | GRAD_COURSES | UNDERGRAD_COURSES )> The <GRAD_PAGES> element identifies that the enclosed elements are part of the XMLIB, and it is named after the domain the XMLIB represents: the graduate web pages. The element names in the DTD consist of either single words (or informal abbreviations) or groups of words connected with an underscore (“_”) character. The domain elements represent sections of information from the graduate web pages, and the names of the elements reflect the particular section. These names were chosen by the author to be simple to read by humans and yet still effectively summarize their respective areas (a complex task as defined in Section 4.2). For the most part, however, they could have easily been determined by copying the names of the hyperlinks and paragraph separators from the source HTML (a simple task). Exceptions to this are noted as they appear in the DTD description. <br /> <br /> PAGE 57<br /> <br /> 46 The fifteen possible children of the <GRAD_PAGES> element represent the directory, which acts like an index for the Query Generator, and the fourteen subsections of the graduate web pages: an overview of the graduate brochure, general information about the graduate program, application procedures, financial aid, the Master’s program, the Degree of Engineer program, the Ph.D. program, departmental contacts, undergraduate prerequisite courses, core courses, department faculty, available laboratories and computing resources, and finally, a list of the graduate and undergraduate courses available in the department. Next the DTD defines a single required attribute for <GRAD_PAGES> that identifies the last revision date of the element: lastRevised. The XMLIB does not make heavy use of attributes. In most cases, appropriate child elements are substituted in place of attributes to simplify the tasks of the Query Generator and Intelligent Interface modules—both programs can concentrate on working only with nested elements, rather than nested elements and attribute lists. 5.2.2 Utility Elements Definitions of the utility elements appearing in the XMLIB follow the <GRAD_PAGES> definition. These elements are always contained in domain elements; they never exist alone. The names of the utility elements reflect their respective purposes and do not depend on the domain being represented. They are standard, defining pieces of the XMLIB; any information base built in a similar way and using the same representation techniques would include them. In other words, unlike the names of domain elements, utility element names are a given and do not have to be designated during the dynamic construction of the information base. <br /> <br /> PAGE 58<br /> <br /> 47 The utility elements are introduced in the following paragraphs, and are summarized in Table 5.1 at the end of this section. Concrete examples of these elements at work in the XMLIB are given in Section 5.3.1. The first utility element is <CW>, standing for component words. This element cannot have any children; it may only contain text data. Every domain element besides the root has a <CW> element as its first child, meaning it always appears immediately after the opening tag of the parent. It contains a list of words that describe the information being represented by the parent. For the most part these component words are nouns in the singular form. There are two reasons for this. The first is that the Natural Language Parser module of WebNL converts the words from the user’s query into their root, singular senses, and the Query Generator uses these forms to build an XML query. Nouns in the original question usually determine the query. The component words in the <CW> elements being singular nouns simplifies the Query Generator’s job. The second reason is that a group of nouns is generally sufficient to identify a concept or idea. Christiane Fellbaum, an architect of the WordNet system [13, 23], gives insight into this assertion. Even though grammatical English sentences require a verb though not necessarily a noun, the language has far fewer verbs than nouns. For example, the Collins English Dictionary lists 43,636 different nouns and 14,190 different verbs. Verbs are more polysemous than nouns: the nouns in Collins have on the average 1.74 senses, whereas verbs average 2.11 senses. The higher polysemy of verbs suggests that verb meanings are more flexible than noun meaningsthe meanings of nouns tend to be more stable in the presence of different verbs. [23, p. 40] A group of nouns (along with the occasional verb) in a <CW> element satisfactorily identifies the concept that the parent represents. The Query Generator searches through the hierarchy of domain elements, looking for those whose <CW> contents most closely match the set of keywords that represent the user’s query. This is <br /> <br /> PAGE 59<br /> <br /> 48 the first of three phases 1 that the Query Generator goes through when performing a query on the XMLIB. The next utility element defined in the DTD is the <CONTENT> element. This element is the second child of every domain element (except the root), behind the <CW> element. It also may only contain text data—no elements can be nested within it. The <CONTENT> element contains a natural language single-sentence description of the contents of its parent element. Its purpose is to be a human-readable guide for anyone perusing the actual XMLIB files. Additionally, while the Query Generator does not use the <CONTENT> element in any way, the Intelligent Interface module employs it to organize the retrieved information and present it to the user. This function is especially useful when the Query Generator returns multiple domain elements, either because the user’s question is vague, or the question can be answered from multiple places within the XMLIB. A list of possibly relevant topics (the multiple <CONTENT> elements) is presented, and the user may choose an individual topic to view by choosing the <CONTENT> element that is most suitable. Choosing a particular <CONTENT> element causes the rest of the information within the parent domain element to be presented. The third utility element is the <TEXT> element. Again, this element may not have children elements, and contains only text data. If it is present, it is the third child of the domain element, appearing directly after the <CW> and <CONTENT> children. The <TEXT> element holds the actual text from the particular section of the graduate web pages that its parent represents. Certain domain elements do not correspond to actual 1 The full querying process is described once all the significant utility elements have been introduced. <br /> <br /> PAGE 60<br /> <br /> 49 sections of source text, so they do not contain a <TEXT> child. The contents of a <TEXT> element are what the Intelligent Interface displays to the user as a final answer to the original query. Information is represented in the XMLIB by wrapping sections of text from the source web pages in <TEXT> elements, which are in turn children of domain elements that label the topic or concept of the section. These domain elements are then grouped by higher-level domain elements that identify more general divisions of the information, thereby preserving the logical organization (as defined by the authors of the source text) of the domain. The last significant utility element is the <ROOT_TEXT> element. Like the others discussed so far, it may not contain children, only text. Like the <TEXT> element, not all domain elements have a <ROOT_TEXT> child, but most do. If a <ROOT_TEXT> element is present, it is always the fourth child, corresponding to and directly following its sibling <TEXT> element. It contains the root forms of the significant words appearing in the corresponding <TEXT> element. Significant words are those that, for example, would be capitalized if found in a title. Unlike the <CW> element, <ROOT_TEXT> elements generally have an equal mix of nouns and verbs. Although both <CW> and <ROOT_TEXT> elements contain lists of significant words concerning their parent’s topic, they serve different purposes. Every domain element must have a <CW> child, which lists root forms of the words that would appear in a summary of the element’s contents. On the other hand, a <ROOT_TEXT> child, which is not present in every domain element, contains the root forms of each significant word appearing in the corresponding <TEXT> element. Some of these words may be found in multiple, unrelated domain elements, and may not necessarily have any <br /> <br /> PAGE 61<br /> <br /> 50 direct relation to the particular domain element. A <ROOT_TEXT> element may contain superfluous words that do not help to specify the contents of its parent, whereas the sole purpose of the <CW> element is to list only those words that together succinctly identify the information contained in the parent domain element. The four utility elements discussed so far are significant because they play integral roles in enabling the Query Generator and Intelligent Interface modules to do their respective jobs. The Intelligent Interface uses the <CONTENT> and <TEXT> children of the retrieved domain elements to display the results of the user’s query, as explained above. The Query Generator employs a scoring system to determine which domain elements to retrieve, and goes through up to three phases when querying the XMLIB. In the first phase, <CW> elements are searched, first in the directory file, then in the individual XML files. If a group of <CW> elements score high enough, their parent domain elements are returned to the Intelligent Interface for display to the user, and the querying process ends. If an acceptable match cannot be found via the <CW> search, then the Query Generator begins the next phase, which involves scanning names of the domain elements themselves. As stated previously, the names of the domain elements reflect the topics they represent, so sometimes this phase can yield useful results when the <CW> search fails. Again, if a suitable domain element is found, it is returned and the process ends. If the domain tag name search does not produce a satisfactory score, then the third phase begins. This phase is where the Query Generator finally uses the <ROOT_TEXT> elements in the XMLIB. This phase is a last resort and basically boils <br /> <br /> PAGE 62<br /> <br /> 51 down to a search among the <ROOT_TEXT> elements for words that match those of the user’s query. Without <ROOT_TEXT> elements, the Query Generator would have to dynamically translate the words in the <TEXT> elements to their root forms (so as to match the root words of the user’s query that the Natural Language Parser produces) and then continue with its scoring system to complete the third phase. During this process many non-informative words like articles and pronouns would be scored. The dynamic translation of words to their root senses and the processing of insignificant words from the <TEXT> elements would greatly increase the response time of a query. The introduction of static <ROOT_TEXT> elements reduces the run time of the system, while only marginally increasing the physical size of the XMLIB on disk. There are two remaining utility elements that are not as important to the overall functioning of WebNL as the four discussed above, but still require explanation: the <LINK> and <EMAIL_LINK> elements. These elements represent HTTP hyperlinks and mailto protocol links, respectively, that are found throughout the graduate web pages. Both elements have the same structure. They may only contain two children: a <TEXT> element followed by the last utility element defined in the DTD, a <TARGET> element. The <TEXT> element holds the actual text of the link as displayed on the web page, and the <TARGET> element, defined as only containing text data, holds the target URL (Uniform Resource Locator) of the link The <LINK> and <EMAIL_LINK> elements are defined separately despite having the same structure so the Intelligent Interface can handle them differently. Otherwise, they are treated equally, being completely ignored by the Query Generator. <br /> <br /> PAGE 63<br /> <br /> 52 Only those domain elements whose corresponding sections of source text include hyperlinks or mailto links have <LINK> or <EMAIL_LINK> children. If they are present, they appear directly after all the other utility elements, but before any sibling domain elements. When the Query Generator returns a domain element containing <LINK> or <EMAIL_LINK> children, the Intelligent Interface displays these, along with the rest of the retrieved information. In this way, the WebNL user is still able to view and traverse all the links that exist in the relevant portions of the original web pages. Including these links in the presentation of the retrieved information makes WebNL more robust and effective as an information provider and question answerer. Table 5.1 summarizes the seven utility elements discussed in this section. The first column lists the utility elements. The second column shows the structure of the elements, as defined in the DTD. The last two columns describe the elements and their functions. Section 5.2.3 discusses the XMLIB directory file, and Section 5.2.4 describes the definitions of the remaining XMLIB domain elements. Table 5.1. The XMLIB Utility Elements Utility Element Contents Description Function <CW> (#PCDATA) List of words that summarize the contents of the parent domain element. Enables Query Generator to search through the XMLIB. <CONTENT> (#PCDATA) Natural language single sentence description of the contents of the parent domain element. Helps humans read the XMLIB and aids Intelligent Interface in presenting answers. <TEXT> (#PCDATA) Actual section of text from source pages. Fundamental unit of information in the XMLIB. <ROOT_TEXT> (#PCDATA) Root forms of significant words found in corresponding <TEXT> element. Provides alternative method for Query Generator to search the XMLIB. <LINK> (TEXT,TARGET) <TEXT> element followed by <TARGET> element. Represents HTTP hyperlinks. <EMAIL_LINK> (TEXT,TARGET) <TEXT> element followed by <TARGET> element. Represents mailto protocol links. <TARGET> (#PCDATA) Target URL of mailto link or hyperlink. Allows Intelligent Interface to display relevant hyperlinks. <br /> <br /> PAGE 64<br /> <br /> 53 5.2.3 The Directory Figure 5.1 lists the <DIRECTORY> element as the first of fifteen possible choices for the single child of the root <GRAD_PAGES> domain element. The DTD defines the <DIRECTORY> element after the utility elements but before the rest of the domain elements because it fits into neither category. It is like a utility element because it is not specific to the domain of the graduate web pages. The <DIRECTORY> element would appear in the XMLIB no matter the domain, and would serve the same purpose. Unlike utility elements, it is not used throughout the XMLIB. Rather it is only found in one special file—the directory. The directory file acts as an index for the Query Generator, which examines the directory first to determine if it can narrow its search down to a single subsection of the XMLIB (corresponding to one of the other fourteen child elements listed in Figure 5.1). The directory file and the <DIRECTORY> element are not necessary for WebNL to function, but they reduce the time needed to query the XMLIB. Figure 5.2 shows the <DIRECTORY> definition. Figure 5.2. The <DIRECTORY> element <!ELEMENT DIRECTORY (LISTING*)> <!ATTLIST DIRECTORY domain CDATA #REQUIRED> <!ELEMENT LISTING (CW,CONTENT)> <!ATTLIST LISTING file CDATA #REQUIRED> The <DIRECTORY> element may contain any number of <LISTING> elements. It also has an attribute, domain, which identifies the set of web pages being represented: the graduate web pages at http://www.cise.ufl.edu/~ddd/grad [6] in this case. There is a <LISTING> element for each of the fourteen XML files that comprise the XMLIB. Each <br /> <br /> PAGE 65<br /> <br /> 54 <LISTING> element has a <CW> child followed by a <CONTENT> child, and a file attribute that identifies the system filename of the corresponding XML file. The <CONTENT> child simply describes for a human reader the sub-area that the corresponding XML file covers. The <CW> child of each <LISTING> element is what the Query Generator scans when searching through the directory. The first phase of the XMLIB querying process, introduced in the last section, involves searching through the <CW> children of the domain elements. Without the directory, the Query Generator would have to search through and score all the domain elements in the XMLIB. The directory allows the Query Generator to search through just one or two of the XML files, as opposed to all fourteen. The <CW> child of each <LISTING> element in the directory reproduces the component words found in the <CW> children of every first-level and second-level domain element in the corresponding XML file (the topor zero-level domain element being the <GRAD_PAGES> element which has no <CW> children). The <CW> elements in the directory provide guidance for the Query Generator because they contain the component words from the top two levels of their respective XMLIB subsections. In other words, the <CW> elements in the directory provide a general picture of the information contained in the corresponding XML file without getting to specific. This allows the Query Generator to easily rule out certain XML files as it scans the directory’s <CW> elements, and narrow its exhaustive search to just a few. Once a particular XML file is chosen based on how well the contents of its directory <CW> element match the keywords of the user’s question, it can be examined in full to find the most specific set of domain elements that answer the user’s question. <br /> <br /> PAGE 66<br /> <br /> 55 If none of the directory’s <CW> elements satisfactorily match the keywords from the user’s query, then the Query Generator proceeds to examine all the XML files in turn, just like it would do if there were no directory. Section 5.3.1 provides a concrete example of a <LISTING> element from the XMLIB directory file and its <CW> and <CONTENT> children. 5.2.4 Domain Elements After the <DIRECTORY> element, the DTD begins defining all the domain elements that appear in the XMLIB. The domain elements resemble a tree structure with the root domain element, <GRAD_PAGES>, at the top. Figure 5.1 shows that a <GRAD_PAGES> element may only contain one of fifteen possible child elements. Fourteen of these elements are first-level domain elements, since they are at level one of the tree (the root being at level zero), the fifteenth being the <DIRECTORY> element. The first-level domain elements represent the first division or categorization of the information in the graduate web pages. Each of these first-level domain elements has a set of second-level domain element children, and so on, down to the leaf level of the tree, where the information does not get any more specific and the domain elements do not have any further domain children. Each level represents a further specification of the information. The tree is not necessarily uniform. Some first-level domain elements might have second-, third-, and fourth-level descendents, while another first-level domain element might be a leaf element itself, having no domain children. This models the fact that some subsections of the graduate web pages are more complex than others. <br /> <br /> PAGE 67<br /> <br /> 56 Naming and arranging the domain elements that make up this tree is a primary task in the construction of the XMLIB. A second major task is determining the contents of the children utility elements, whose names and arrangements are standard. Figure 5.3 illustrates the definition of the first first-level domain element appearing in the DTD: the <CORE_COURSES> element, representing the section of the graduate web pages that discusses the CISE department’s core courses. Figure 5.3. The <CORE_COURSES> domain element definition <!ELEMENT CORE_COURSES (CW,CONTENT,TEXT?,ROOT_TEXT?,MASTERS_CORE,PHD_CORE)> <!ELEMENT MASTERS_CORE (CW,CONTENT,TEXT?,ROOT_TEXT?,LINK?,COURSE*)> <!ELEMENT PHD_CORE (CW,CONTENT,TEXT?,ROOT_TEXT?,COURSE*)> <!ELEMENT COURSE ( CW,CONTENT,TEXT,ROOT_TEXT?,LINK?,NUMBER?,DESCRIPTION?,PREREQ? )> <!ELEMENT NUMBER (CW,CONTENT,TEXT,ROOT_TEXT?)> <!ELEMENT DESCRIPTION (CW,CONTENT,TEXT,ROOT_TEXT)> <!ELEMENT PREREQ ( CW,CONTENT,TEXT,ROOT _ TEXT,LINK* ) > The figure shows seven domain element definitions: the <CORE_COURSES> element and its domain children and descendents. As mentioned in Section 5.2.2, <CW> and <CONTENT> utility elements are the first two children of each of the domain elements. When the <TEXT> and <ROOT_TEXT> elements are listed with a question mark in the DTD, it generally signifies that there is no text at that level of specificity in the source web pages, so the element contains no <TEXT> children and is simply a container of multiple, more specific domain elements that do hold information in the form of <TEXT> children. The <CORE_COURSES> definition in Figure 5.3 lists the utility children and then the domain children: <MASTERS_CORE> and <PHD_CORE>. The two domain <br /> <br /> PAGE 68<br /> <br /> 57 children correspond to the two highest-level headings in the core course source web page. Underneath these headings, there are separate subheadings for each of the courses. These multiple course subheadings are represented in Figure 5.3 as the multiple <COURSE> domain children that can appear in the <MASTERS_CORE> and <PHD_CORE> elements. Next the <COURSE> element is defined in Figure 5.3 as having the standard utility children, followed by the optional <NUMBER>, <DESCRIPTION>, and <PREREQ> domain children, which represent the course number, the course description, and the set of prerequisites required for the course, respectively. These domain elements contain only utility children, so they represent leaf domain elements whose information is not further subdivided. Figure 5.4 illustrates the relationship between the original core course web page configuration and some of the XML domain element structures of Figure 5.3. However, Figure 5.4 does not explain why, for example, the domain children of the <MASTERS_CORE> element are multiple <COURSE> elements rather than a sequence of four individual domain elements each representing a different course; nor is it clear from the figure how the three domain children of the <COURSE> element are designated—they do not correspond to any obvious headings in the source web page. These issues are touching upon the balance that must be struck between the use of simple construction tasks (the primary design philosophy of the XMLIB) and the inclusion of intelligent design features that allow information to be represented more effectively. <br /> <br /> PAGE 69<br /> <br /> 58 Figure 5.4. The relationship between HTML headings and domain element structures Multiple second-level headings translate into (COURSE*) domain children Highest-Level headings translate into (MASTERS_CORE, PHD_CORE) domain children The first issue, concerning why the courses are all <COURSE> elements rather than unique stand-alone elements, in reality is not a significant problem. The realization that each of the subheadings under the Master’s Degree Core Course heading in Figure 5.4 are related because they are all courses, rather than separate topics that the heading simply groups together, is a complex task requiring a certain amount of human intelligence. An algorithm would be unable to recognize this through a simple scanning of the HTML heading tags. Yet the contents of the children of the <COURSE> elements sufficiently distinguish the separate courses (the children elements contain information specific to a particular course). So even though the courses share a common domain element, which basically serves as a wrapper element in this case, the <COURSE> <br /> <br /> PAGE 70<br /> <br /> 59 elements can be differentiated, only at one level lower than they would be if they were all unique stand-alone domain elements. Wrapping all the courses in a standard <COURSE> element is a shortcut to decrease the number of unique domain elements that need to be defined by the DTD. Since the XMLIB is created manually, this is useful. However, the number of unique domain elements that need to be defined in an automatically generated DTD is of less consequence, so shortcuts like these would not be as necessary or helpful in the dynamic generation of an information base. Therefore the intelligence needed to execute a shortcut like this is not required in the IBC algorithm. This discussion has highlighted the second issue, however; namely how the domain children of the <COURSE> element are designated without any obvious clues from the source HTML. Unfortunately this issue cannot be rationalized away as easily as the first. The source core course web page does not provide any direct clues to the fact that courses have course numbers, descriptions, and prerequisites. This knowledge is the type of general world knowledge described in Section 4.2 that defines the essence of a complex task. While the design philosophy of the XMLIB is to keep its construction tasks as simple as possible, certain complex tasks (like dividing the <COURSE> element into number, description, and prerequisite subsections) are required in order to increase its effectiveness in representing the graduate web pages—one of the practical requirements discussed in Section 4.1. Where complex tasks are used in the construction of the XMLIB, a trade-off has occurred between the research goal of building an information base from a set of simple tasks and the practical requirement of creating a satisfactory question answering system. <br /> <br /> PAGE 71<br /> <br /> 60 The two issues discussed in the previous paragraphs are referred to as the wrapper condition and the sensible-definition condition, respectively, for the remainder of this chapter. The wrapper condition is where a group of concepts under a heading are all defined as particular instances of the same wrapper element even though the source HTML provides no clues that they are the same type of object or idea, like the multiple courses example above. The sensible-definition condition is where an element is structured based on common sense or general world knowledge in order to better represent the information. After the <CORE_COURSES> element, the DTD goes on to define the remaining thirteen first-level domain elements (listed as the children of the <GRAD_PAGES> element in Figure 5.1) corresponding to the thirteen remaining subsections of the graduate web pages. Each first-level domain element introduces new children and descendent domain elements in the same manner as the <CORE_COURSES> element, and these are defined in turn as Figure 5.3 illustrates. Some domain elements appear as children or descendents of multiple different first-level domain elements. These commonly used domain elements usually appear as a result of a wrapper condition. For example, many sections of the graduate web pages list a set of requirements that must be fulfilled to achieve some goal. The leaf domain element <REQUIREMENT> represents a single requirement. The domain elements representing these sections of source text are scattered throughout the XMLIB and have multiple <REQUIREMENT> domain children. As stated previously, the two major tasks to be completed when constructing the XMLIB are designating and arranging the domain elements relative to each other, and <br /> <br /> PAGE 72<br /> <br /> 61 determining the contents of the utility children of these domain elements. The DTD shows the results of the former task; the results of the latter are illustrated in the actual XML files comprising the XMLIB, which are discussed in Section 5.3. Figure 5.4 gives an example of the method used to designate and arrange the domain elements for the majority of the XMLIB: correlating domain elements with HTML heading and separator tags. Lower-level headings are modeled as children of higher-level headings. Paragraphs or sentences delineated by separator tags beneath the lowest-level headings are translated into leaf domain elements. Wherever the domain elements are placed, they all use the same utility children for identical purposes. This is the design philosophy introduced in Section 4.2 at work. It consists of relatively trivial tasks: scanning through the source HTML, assigning first-level domain elements to the highest-level headings, and so on down to correlating leaf elements with individual paragraphs or sentences. The resulting hierarchy or tree of domain elements, infused with the XMLIB utility elements, allows the Query Generator and Intelligent Interface to interact with the XMLIB so that question answering is achieved. If the entire XMLIB could be satisfactorily constructed using only the methods described above, then the IBC algorithm would be well on its way to being defined. Unfortunately, the definitions of several XMLIB domain elements require the completion of certain complex tasks. The complexities involved with defining the <CORE_COURSES> element and its descendents have already been explained. The subsequent paragraphs detail the complexities involved with the definitions of several other domain elements. <br /> <br /> PAGE 73<br /> <br /> 62 There are ten total first-level domain elements (including the <CORE_COURSES> element) whose definitions, or those of their children or descendents, involve some complications. Since there are only fourteen first-level domain elements in the XMLIB, this may seem like the majority of the domain elements, but in actuality most of the complications arise from wrapper conditions involving usually just a few of the child or descendent domain elements. Table 5.2 summarizes the wrapper conditions arising in the DTD. Table 5.2. Wrapper Conditions in the DTD First-Level Domain Element Involved Wrapper Element(s) <CORE_COURSES> <COURSE> <ADMISSION> <REQUIREMENT> <MASTERS> <COURSE>, <REQUIREMENT>, <ELECTIVE_AREA> <PHD> <COURSE>, <REQUIREMENT> <UNDERGRAD_PREREQS> <COURSE> <FACULTY> <FACULTY_MEMBER> <LABS> <LAB> <GRAD_COURSES> <COURSE> <UNDERGRAD_COURSES> <COURSE> Out of 80 domain elements in the DTD, there are only six instances where the full definition of the element requires the completion of a complex task and does not include a wrapper condition. The <COURSE> element having <NUMBER>, <DESCRIPTION>, and <PREREQ> domain children is the first nontrivial definition in the DTD. This is a sensible-definition condition as explained above. The definition of the <GEN_INFO> first-level domain element introduces the second and third nontrivial definitions. Two wrapper elements, <STUDY_AREA> and <RESOURCE> are defined, but they are unlike any other wrapper elements. Paragraphs or sentences separated by HTML separator tags lying underneath a common heading usually cause a wrapper element to be defined. In the case of the <STUDY_AREA> <br /> <br /> PAGE 74<br /> <br /> 63 element, the source text is in the form of a numbered list. The source text for the <RESOURCE> element is not even separated by numbers. Rather the text is simply several sentences that to the human reader seem to be obvious candidates for representation as separate entities in the XMLIB. The <STUDY_AREA> element resembles a wrapper condition, but with a different source text structure. The <RESOURCE> element illustrates a type of sensible-definition condition. The fourth nontrivial definition appears as part of the <MASTERS> definition. The <ADMISSION_REQUIREMENTS> element, a domain child of the <MASTERS> element, has a domain child of its own called <BACKGROUND>. This element is only defined because of a sensible-definition condition where the last sentence of the previous paragraph in the source text refers to next section of text as background. Otherwise this section of text would have been treated the same as the surrounding text segments, which are <REQUIREMENT> elements. The last two nontrivial definitions are also instances of sensible-definition conditions. In the subsection of the graduate web pages corresponding to the <LABS> first-level domain element, laboratories in the CISE department are being described. The name of each laboratory is listed as a heading, but the names of the laboratory directors are listed as identical-level headings directly below the name headings. Obviously, a director is an attribute of a laboratory object, so a <DIRECTOR> domain child is defined in the <LAB> definition. If the simple design philosophy of section 4.2 had been adhered to, the <LAB> elements would be empty, and the <DIRECTOR> elements would erroneously hold the laboratory information. <br /> <br /> PAGE 75<br /> <br /> 64 The final nontrivial definition comes from the <FACULTY> domain element. In the faculty subsection of the graduate web pages, a faculty member’s name, year of graduation, alma mater, and research areas are all listed in a single paragraph. Although there is no indication from the source text that “research area” should be a subsection of each faculty member, the <FACULTY_MEMBER> domain element is defined to have a <RESEARCH_AREA> child. The purpose of explaining the above complications is to illustrate that the process of defining the XML elements that represent the graduate web pages requires human judgment at certain key points. The DTD completes the first of the two primary tasks in constructing the XMLIB: designating and arranging the domain elements relative to each other in order to model the logical structure of the source information. Appendix B lists the DTD in its entirety. The next section describes the XML files that make up the XMLIB. 5.3 The XML Files The second major task in constructing the XMLIB is to determine what information to place within the special-purpose utility elements scattered throughout the domain element hierarchy defined by the DTD. This task is analogous to instantiating a class in object-oriented programming. Now that <COURSE> and <FACULTY_MEMBER> elements have been defined, for example, their utility element children need to be filled with actual information from the graduate web pages. Each of the fifteen XML files comprising the XMLIB is an instantiation of the <GRAD_PAGES> top-level domain element. An XML file identifies itself as part of the XMLIB if its top-level root element is <GRAD_PAGES>. The lastRevised attribute of the <GRAD_PAGES> element records the particular XMLIB file’s last revision date. <br /> <br /> PAGE 76<br /> <br /> 65 The information in the graduate web pages is divided into fourteen HTML files, each of which is devoted to a different topic. The separate sections reference each other, either through standard textual references or explicit hyperlinks. The XMLIB is divided into fifteen files, corresponding to the fourteen subsections of the graduate web pages and the directory file. The XMLIB is comprised of multiple, smaller files rather than a single large file for two reasons. First, it is easier to manually produce and update multiple small files. Second, with the use of the directory file, the Query Generator may ideally search only those XML files that contain information relevant to the user’s question. 5.3.1 A Concrete Example Figure 5.5 illustrates the portion of the directory file that links to core_courses.xml, the XML file containing information on core courses and corresponding to the <CORE_COURSES> first-level domain element. Figure 5.6 shows a section of core_courses.xml. Figure 5.5. The directory listing for core_courses.xml <?xml version="1.0" encoding="UTF-8" standalone="no"?> <!DOCTYPE GRAD_PAGES SYSTEM "mainDTD.dtd"> <GRAD_PAGES lastRevised="10/08/01"> <DIRECTORY domain="www.cise.ufl.edu/~ddd/grad"> <LISTING file="core_courses.xml"> <CW>core course master master's degree ph.d. doctor philosophy phd ms m.s.</CW> <CONTENT>CISE Graduate Program core courses</CONTENT> </LISTING> : : </DIRECTORY> </GRAD_PAGES> <br /> <br /> PAGE 77<br /> <br /> 66 Figure 5.6. A portion of core_courses.xml <GRAD_PAGES lastRevised="09/24/01"> <CORE_COURSES> <CW>core course</CW> <CONTENT>CISE Graduate Program core courses</CONTENT> <MASTERS_CORE> <CW>master master's core course degree ms m.s.</CW> <CONTENT>The Master's Degree core courses</CONTENT> <COURSE> <CW>course analysis algorithm cot5405</CW> <CONTENT>Information for Analysis of Algorithms (COT5405)</CONTENT> <TEXT>Analysis of Algorithms</TEXT> <LINK> <TEXT>COT 5405</TEXT> <TARGET>http://www.cise.ufl.edu/~ddd/grad/grad_courses.html#COT5405 </TARGET> </LINK> <NUMBER> <CW>number cot5405</CW> <CONTENT>The course number of Analysis of Algorithms (COT5405) </CONTENT> <TEXT>COT 5405</TEXT> </NUMBER> <DESCRIPTION> <CW>description analysis algorithm cot5405</CW> <CONTENT>The description of Analysis of Algorithms (COT5405) </CONTENT> <TEXT>This course will introduce the student to two areas. There will be a brie f but intensive introduction to discrete mathematics followed by the study of algorithmic analysis</TEXT> <ROOT_TEXT>COURSE STUDENT INTENSIVE INTRODUCTION DISCRETE MATHEMATICS FOLLOW STUDY ALGORITHMIC ANALYSIS</ROOT_TEXT> </DESCRIPTION> </COURSE> <COURSE></COURSE> : : </MASTERS_CORE> <PHD_CORE> <CW>ph.d. doctor philosophy phd degree core course</CW> <CONTENT>The Ph.D. core courses</CONTENT> <TEXT>The Ph.D. core courses consist of all of the M.S. core courses plus COT6315. </TEXT> <ROOT_TEXT>PHD CORE COURSE CONSIST MS COT6315</ROOT_TEXT> <COURSE></COURSE> : : </PHD_CORE></CORE_COURSES></GRAD_PAGES> <br /> <br /> PAGE 78<br /> <br /> 67 The first line of Figure 5.5 specifies the version of XML being used. The second line declares that the file uses the elements defined in the DTD stored in mainDTD.dtd . These two lines are found at the beginning of each of the XML files in the XMLIB. The third line opens the root <GRAD_PAGES> element. Next the <DIRECTORY> firstlevel element opens, and the <LISTING> element for core_courses.xml follows, as indicated by the value of the file attribute. The <CONTENT> children in the directory always copy the first-level element’s <CONTENT> child from the corresponding XML file. In Figure 5.5, the <CONTENT> element is identical to the <CONTENT> element of <CORE_COURSES> in Figure 5.6. The interesting element in Figure 5.5 is the <CW> element. The Query Generator uses its scoring system to decide which XML files to access based on the contents of the directory’s <CW> elements. A <CW> element in the directory combines all the component words from all the <CW> elements in the top two levels of the related XML file. Comparing Figures 5.5 and 5.6, the contents of the <CW> element in the directory is the union of the words appearing in the <CW> child of the first-level <CORE_COURSES> element and the <CW> ch ildren of the two second-level elements <MASTERS_CORE> and <PHD_CORE>. By combining all the component words from the firstand second-level elements of a target XML file, a <CW> element in the directory contains a general overview of the information within that file, without getting too detailed.2 Each of the fourteen XMLIB files has a <LISTING> element in the directory similar to the one shown in Figure 5.5. 2 Experimentation with the Query Generator shows that including the words from the top two levels of <CW> elements from the target XML file provides a good indicator of whether the file contains relevant information. The top level alone is too general, and the <br /> <br /> PAGE 79<br /> <br /> 68 After scanning the directory file, the Query Generator must begin searching through the actual XML file(s) of interest. Figure 5.6 shows a portion of core_courses.xml , including the root <GRAD_PAGES>, the first-level element <CORE_COURSES>, and parts of its two second-level domain children <MASTERS_CORE> and <PHD_CORE>. Both of these second-level domain elements consist of <CW> and <CONTENT> utility children followed by multiple <COURSE> domain children. The <PHD_CORE> element also has a <TEXT> utility child and its corresponding <ROOT_TEXT> element (remember that <CW> and <CONTENT> children are required in every domain element, but <TEXT> and <ROOT_TEXT> are optional). Also included in full in the figure is the first of four <COURSE> elements underneath <MASTERS_CORE>. It is useful to refer to Figure 5.7, which shows the original web page that core_courses.xml represents, as details of Figure 5.6 are explained. The first two lines of Figure 5.6 are the opening tags of the root and first-level elements. The third line is the required <CW> child of <CORE_COURSES>, containing the words core and course . These two words are the root forms of the words comprising the heading of the core courses section of the graduate web pages (see Figure 5.7). The <CONTENT> utility child is next, giving a description of th e information within the <CORE_COURSES> domain element. Next the <MASTERS_CORE> first-level element opens. Its <CW> child contains the words master’s , core , course , degree , master , m.s. , and ms . The first four are the words that make up the first first-level heading in Figure 5.7 (“Core Courses” being the third level (and beyond) <CW> elements tend to be too specific. <br /> <br /> PAGE 80<br /> <br /> 69 top-level heading), and the last three are terms that are commonly used in place of the master’s component word. Master is included in case the user types “masters” in their question (converts to root form “master”); ms and m.s. are common abbreviations that may be used. Figure 5.7. The core courses section of the graduate web pages <br /> <br /> PAGE 81<br /> <br /> 70 In general, only exceedingly common synonyms or substitute terms for a component word are included in a <CW> element. Usually only the actual component words are listed. While not implemented in WebNL, the Ideal system would have some way to retrieve the synonyms of words in the user’s question if the original words are not found in the XMLIB. The <CONTENT> child of <MASTERS_CORE> describes its parent’s contents: the Master’s degree core courses. The <MASTERS_CORE> element does not have a <TEXT> or <ROOT_TEXT> child since there is no text in the source page underneath the “Master’s Degree Core Course” heading; only a lower level heading is present. Next the courses are listed as separate <COURSE> elements, one of which is shown in Figure 5.6. The illustrated course is an algorithms course whose number is COT 5405. Again the <CW> element contains a list of component words that together summarize the contents of the parent <COURSE> element and can be derived from the heading for the class in Figure 5.7. The <CONTENT> child contains a natural language description of the parent element’s contents. Next in Figure 5.6 is a <TEXT> element that holds the name of the course as it appears in the original web page. Normally this <TEXT> element would contain the paragraph following the heading for the class, but because of the sensible-definition condition for the <COURSE> element described in Section 5.2.4, the paragraph is instead contained in the <DESCRIPTION> child’s <TEXT> element. This is done for all <COURSE> elements throughout the XMLIB. The course name is held in the <TEXT> <br /> <br /> PAGE 82<br /> <br /> 71 child of the <COURSE> element, and its description is contained in the <DESCRIPTION> child. There is no <ROOT_TEXT> element for the <TEXT> child of the <COURSE> element in Figure 5.6 because the <TEXT> element is relatively short and simple. In the XMLIB, a <ROOT_TEXT> element only appears if a <TEXT> element contains a sentence or more of text. If a <TEXT> element contains only a phrase or a small group of words, it does not have a corresponding <ROOT_TEXT> sibling. After the <TEXT> child of the <COURSE> element is a <LINK> child, which is present because in the source web page (see Figure 5.7) the course number is a hyperlink to another section of the graduate web pages. Abiding by the definition of the <LINK> element in Section 5.2.2, the <TEXT> child holds the actual text of the hyperlink, and the <TARGET> child holds the target URL. The <LINK> element is the last utility child of the <COURSE> element. The remaining children are all domain elements. The <NUMBER> element’s <CW> child lists the keywords number and cot5405 . Its <CONTENT> child identifies that it represents the course number for the algorithms class, and its <TEXT> child contains the course number. The final child of the <COURSE> element in Figure 5.6 is the <DESCRIPTION> domain element. Its <CW> child indicates to the Query Generator through the component words that a description of the algorithms course is contained within, while the <CONTENT> child does the same for a human reader. The <TEXT> child contains the full paragraph from the source web page, and the <ROOT_TEXT> element holds the significant words from the contents of its <TEXT> sibling. The ellipses at the end of <br /> <br /> PAGE 83<br /> <br /> 72 these two elements are not actually present in the XMLIB, they simply indicate that, to conserve space in the figure, the contents are not fully shown. The rest of the <MASTERS_CORE> elem ent consists of <COURSE> elements similar to the one in Figure 5.6 representing the remaining courses listed under the “Master’s Degree Core Course” heading in Figure 5.7. The <PHD_CORE> element is instantiated in essentially the same manner as its <MASTERS_CORE> sibling, except that <PHD_CORE> has a <TEXT> and <ROOT_TEXT> pair, whereas <MASTERS_CORE> does not. This is because there is text directly beneath the “Ph.D. Core Courses” heading, before the next lower level heading. Since this text is referring to the Ph.D. core courses section as a whole, it makes sense for it to be contained in a <TEXT> child of the <PHD_CORE> element. The XML files representing the other thirteen subsections of the graduate web pages have structures similar to the core_courses.xml file illustrated in Figure 5.6. The next section further explains the XMLIB features introduced in this section and outlines how the utility elements are filled in, thus addressing the second major task of the XMLIB construction process—determining the contents of the utility elements. 5.3.2: General XMLIB Features Throughout the XMLIB, the <CW> elements, no matter where they appear, are the most significant utility elements because they are the first and primary means by which the Query Generator determines the sections of XML to return as an answer to a user’s question. Generally, domain elements in the XMLIB correspond to headings and their <CW> children contain the terms that comprise the heading name. This is one of two methods for determining the contents of <CW> elements. When there is a single piece of text <br /> <br /> PAGE 84<br /> <br /> 73 underneath a heading, the domain element representing that heading includes it within a <TEXT> child. When there are multiple paragraphs or sentences delineated by separator tags directly beneath a common heading, the element representing the heading usually contains multiple wrapper element children for each of the paragraphs, and determines its <CW> child using the method above. However, the <CW> children of the wrapper elements are not determined by the terms in a heading, because there is no corresponding heading—only a sequence of paragraphs. In the case of a wrapper element, the first words listed in the <CW> child describe what the wrapper element is (a course, a faculty member, a requirement, or a resource, for example). The rest of the words in the <CW> element form a summary of the particular paragraph being represented. This is the second method for determining <CW> content. The tasks involved in the former method are trivial but those in the latter are complex. Figure 5.7 shows that each heading in the core course subsection of the graduate web pages either has no text below it or just a single paragraph, so it would seem that the trivial method for determining <CW> content would apply to the elements in Figure 5.6. However, because of the sensible-definition condition concerning the <COURSE> element (see Figure 5.3), Figure 5.6 actually provides a good example of the second, complex method. The reason is that the definition of the <COURSE> domain element basically creates wrapper elements out of the <NUMBER> and <DESCRIPTION> children, even though only one of each is allowed to be present (as opposed to normal wrapper conditions where there are multiple wrapper element children). <br /> <br /> PAGE 85<br /> <br /> 74 The <CW> children underneath the <COURSE>, <NUMBER>, and <DESCRIPTION> elements in Figure 5.6 contain course , number , and description component words, respectively, despite the absence of these words in the corresponding source text of Figure 5.7. The remaining words in each of these <CW> children summarize the contents of the parent element. The same two methods determine the contents of <CW> elements elsewhere in the XMLIB as well. The first, trivial method is applied if the domain element in question represents a heading (the heading element) with no text or only a single section of text below it,3 and it is not involved in a sensible-definition condition. The single piece of text is included as a <TEXT> child of the heading element. The second, complex method is applied if several paragraphs or text segments share a heading. The heading element still determines its <CW> child via the trivial method, but the wrapper children that model the individual paragraphs employ the complex method to instantiate their <CW> children. The most complex task in constructing the XMLIB is instantiating the <CONTENT> elements, which contain natural language summaries of their parent element’s contents. They are the most difficult to instantiate because their contents basically represent the same information as the <CW> elements, translated into natural language. The extra step of translating the summary of the parent element’s contents into natural language increases the difficulty of the task. Fortunately, <CONTENT> elements are not integral to the querying process and they are not a necessity for the Intelligent Interface, although their presence in the output improves the display of multiple retrieved elements. <br /> <br /> PAGE 86<br /> <br /> 75 The <TEXT> and <ROOT_TEXT> elements throughout the XMLIB are trivial to instantiate. The difficult step is determining where a particular section of source text should be placed within the hierarchy of domain elements—this step is discussed above. Once the decision is made, the source text is simply copied into the <TEXT> element. The <ROOT_TEXT> element, if it is necessary depending on the length of the <TEXT> sibling, can be instantiated by inserting the root forms of the significant words in the <TEXT> element. This can be accomplished, for example, by ignoring the articles and pronouns altogether, and retrieving the root forms of the remaining words from an appropriate online dictionary. The <LINK> and <EMAIL_LINK> utility elements are also very straightforward to populate with information. These elements simply contain copies of a link and its target URL as they appear in the source text. Appendix B lists in full, along with the DTD, two files that are part of the XMLIB: labs.xml and gen_info.xml . These files represent the laboratories and general information subsections of the graduate web pages, respectively. The complete XMLIB is available online at http://www.cise.ufl.edu/~nnadeau/research/xmlib . 5.4 More on Querying The three phases of the XMLIB querying process have been described in the previous sections, as well as the interaction between the Query Generator and the directory file. The final topic concerning the querying process is how the structure of the domain elements in conjunction with the <CW> children allow the Query Generator to retrieve appropriately general or specific information, depending on the user’s question. 3 The “no text below the heading” situation refers to when a heading is followed directly by another, usually lower-level heading, with no text appearing in between. <br /> <br /> PAGE 87<br /> <br /> 76 The <CW> elements become more and more specific the deeper down in the domain element hierarchy they are located. The more words a domain element has in common with the words of the user’s question, the higher that particular domain element scores in the Query Generator’s scoring system. The deeper <CW> elements are more specific because they list more words than their parents’ <CW> elements. Throughout the XMLIB, a domain element’s <CW> child is assumed to contain the component words of all the higher-level domain elements on the direct path from that domain element up to the root element of the XML file, in addition to the specific component words unique to that element. Since the Query Generator looks for domain elements whose <CW> children most closely match the user’s words, an element that matches each word in the user’s question but contains more terms not in the question scores lower than a <CW> element that exactly matches the user’s words. This is the case where the user asked a general question and most likely desires a general, high-level answer. Of course, the <CW> element that matches more words always scores higher than an element that matches fewer words. This is how the Query Generator attempts to retrieve information of appropriate specificity. The second phase of the querying process is less accurate. In this phase, the Query Generator basically assumes that if a domain tag name matches a term in the user’s question, that domain element may contain useful information. The third phase is simply a keyword search, returning the element(s) whose <ROOT_TEXT> children contain the most matches. Once a domain element is chosen, an XQL query generated by the Query Generator physically retrieves it and all its children and descendent elements (both utility and <br /> <br /> PAGE 88<br /> <br /> 77 domain) and sends the entire group to the Intelligent Interface. The process is analogous to severing an internal node of a tree structure from its parent. That internal node and all its children and descendents form a separate tree. The Intelligent Interface displays the returned “subtree” of the XMLIB, and allows the user to search through it, viewing the contents of the <TEXT> elements as desired. Since the user can examine more specific elements of an answer that is too general with the help of the Intelligent Interface, the Query Generator prefers to err on the side of generality. 5.5 Summary This chapter explained the implementation details of the XMLIB, starting with the reasoning behind referring to it as an information base, then describing the organization of the physical files comprising it. Section 5.2 discussed the DTD, which defines all the XML elements used in the XMLIB and how they relate to each other. The DTD defines two categories of elements: utility elements and domain elements. Domain elements are determined by the logical structure of the source web pages; the utility elements appear in standard locations within the domain element hierarchy to facilitate functions of the Query Generator and Intelligent Interface m odules of WebNL. Each of the utility element’s roles was explained, along with how they interact with either the Query Generator or Intelligent Interface. Next the <DIRECTORY> element was introduced. This element corresponds to a directory file that acts like an index for the Query Generator. The directory holds component words from the top two levels of each XML file; this provides a general picture of what each file contains. The Query Generator uses the directory to limit its searching of the XMLIB. <br /> <br /> PAGE 89<br /> <br /> 78 After the <DIRECTORY> element, the domain elements as defined in the DTD were discussed. The definitions of the domain elements drive the structure of the XMLIB, which attempts to model the logical structure of the source web pages. Domain elements are usually designated by the headings in the source HTML. In certain situations, the designation of a domain element or its children is more complex than simply reproducing the relationships of the headings in the source text. These situations arise from the conflicting goals of following the design philosophy of Section 4.2 (keep the construction process as simple as possible) and fulfilling the practical requirements of Section 4.1 (creating a functional, satisfactor y question answering system). Every domain definition that required a complex, intelligence-requiring task to complete was identified to show that the construction of the XMLIB is not as simple as ideally hoped for in Chapter 4. The completion of the DTD completes the first major task in the construction of the XMLIB: the designation of the domain element hierarchy. Section 5.3 illustrated a large portion of the core_courses.xml file and its corresponding listing in the directory. This section explained how the utility elements’ contents are generated. Since it is not realistic to detail all the files comprising the XMLIB, the concrete <CORE_COURSES> example was supplemented with Section 5.3.2’s description of the general features and characteristics of the XMLIB. This section, together with the <CORE_COURSES> exampl e and the XML files listed in Appendix B should adequately demonstrate the XMLIB’s implementation. Section 5.3.2 also outlined the steps required to instantiate the utility elements in the domain hierarchy, which is the second major task in the XMLIB construction process. <br /> <br /> PAGE 90<br /> <br /> 79 Section 5.4 completed the description of the querying process, detailing the role of the <CW> element in allowing the Query Generator to retrieve appropriately specific or general sections of information from the XMLIB.<br /> <br /> PAGE 91<br /> <br /> CHAPTER 6 THE XMLIB – RESULTS 6.1 Summary of the Construction Process The primary difference between the XMLIB’s construction process and the tasks that the IBC algorithm will eventually have to perform is that the former process operates on a set of web pages that are already grouped together, written by the same author using the same set of HTML tags, while the latter will have to construct a cohesive information base from multiple web pages that could have very different structures or logical organizations. In other words, the source web pages are homogeneous for WebNL, but may possibly be heterogeneous for the Ideal system. The construction procedure outlined in this section is the one employed to construct the XMLIB with the graduate web pages as the source documents. This procedure is generalized in the following discussion to a generic set of source documents having a structure similar to the graduate web pages. Figures 4.2 and 4.3 illustrate this general structure. No attempt is made to generalize the process to source documents whose organization is significantly different from the graduate web pages. For the purposes of this chapter the term heading, in addition to its normal meaning, may also refer to the types of hyperlinks that lead to a more specific subsection of the text. This type of hyperlink is illustrated in Figure 4.2. The construction process is broken into two phases, corresponding to the two major tasks described in Chapter 5. The first phase is the construction of the DTD, which defines the domain elements and also places the optional utility elements. This phase 80 <br /> <br /> PAGE 92<br /> <br /> 81 defines the domain element hierarchy: the first major construction task. The second phase produces the XML files and the directory. The XML files follow the pre-determined arrangement defined in the DTD, with the domain elements being instantiated through the insertion of data or text into their utility element children. This is the second major construction task. 6.1.1 Phase One: Constructing the DTD The first step in the construction of the DTD is to define the root element, which is <GRAD_PAGES> in the XMLIB. The definition involves choosing the root’s name and defining its first-level domain children. The root element does not contain any utility elements. The choice of a meaningful name for the root, which represents the domain of the information base, can be either a complex or trivial task, depending on whether an overall title for the source documents is available. The definitions of the first-level children are more involved, requiring the identification of the first (highest or most general) level of categorization or division in the source web pages. The highest-level HTML heading tags used in a document, or a list of hyperlinks on an index page to the subsections, or a list of links located on a navigational bar (like the graduate web pages) can all be modeled as first-level domain elements. These headings are referred to as first-level headings. The root element should always have a <DIRECTORY> child, no matter the domain, since a directory can be used in any context. The other first-level children are dependent on the domain’s structure. The utility elements are also a standard part of the construction process, so they would be defined as described in Section 5.2.2 for any domain. <br /> <br /> PAGE 93<br /> <br /> 82 Next the first-level domain elements (the children of the root) must be defined, one at a time. For each element, this includes identifying its domain children and selecting which of the optional utility elements (<TEXT>, <ROOT_TEXT>, <LINK>, and <EMAIL_LINK>) to include. The required utility elements <CW> and <CONTENT> are always defined as the first two children of any domain element. A first-level domain element will have one domain child for each heading underneath the first-level heading. There will also be a domain element child (usually a wrapper element declared in the DTD as appearing multiple times) if a set of separate paragraphs or text sections appear below the first-level heading, but not below any other headings or subheadings. In the case of subheadings or hyperlinks to subsections, the child domain name takes its name from the particular subheading or hyperlink. In the case of a wrapper element, the name must reflect the type of objects or concepts being represented (<COURSE>, for example, is a wrapper element for course objects). Naming the element in the former situation is a trivial task, whereas naming the wrapper element is a more complex task. The inclusion of optional utility children depends on whether there is a single paragraph or text section underneath the first-level heading. The existence of a single paragraph underneath a heading indicates that it is providing information pertaining to the entire subsection. If the subsection has lower-level headings within it, then the paragraph is probably an overview; if there are no lower-level headings, then the first-level element is a leaf and will contain the paragraph’s specific information. Either way, if a single paragraph is present, it will be assigned as the contents of the <TEXT> child of the first-level element in the second phase of construction. Therefore, in the DTD, the first-level <br /> <br /> PAGE 94<br /> <br /> 83 element needs to have a <TEXT> child. The <ROOT_TEXT> element will also be included if the paragraph is long enough (in the XMLIB, a <ROOT_TEXT> element is included if the paragraph is more than a single sentence). The <LINK> and <EMAIL_LINK> elements are only included if the corresponding type of link is present in the paragraph. Once the domain children of the first-level element have been named and the appropriate utility elements have been included, the process described in the previous two paragraphs is repeated for each of the newly identified second-level domain elements. If during the course of a second-level definition, a third-level domain is introduced, it also goes through the same steps. This process continues, replacing occurrences of the word “first-level” in the previous two paragraphs with the appropriate term (second-level, third-level, and so on), to as deep a level as necessary, terminating when no domain elements introduce any further domain children. The exception to this is wrapper elements. Once a wrapper element is designated, there is no need to recursively search its section of text for more structure—it is already known that the section consists only of multiple paragraphs. The DTD can simply list the wrapper element as occurring in its parent multiple times, then define it to contain only the utility elements <CW>, <CONTENT>, <TEXT>, and <ROOT_TEXT> because it has no domain children; each instantiation will contain only a single paragraph: the fundamental unit of the information base. When all the domain elements introduced by the initial first-level domain element have been defined, the entire process is repeated for the remaining first-level elements, until eventually all the elements needed to represent the domain have been defined. <br /> <br /> PAGE 95<br /> <br /> 84 The previous discussion ignores sensible-definition conditions, where an intelligent, sensible decision is made concerning how a domain element should be structured. The above process is just that—a process with well-defined steps (although the method for completing some of the steps may not be well-defined). The nature of a sensible-definition condition is that it cannot be characterized easily, and the issues involved may vary greatly depending on the source information. Therefore the summary of the construction process necessarily avoids the sensible-definition topic, only noting that once the domain elements resulting from the sensible-definition have been defined and the portions of source text to be associated with them have been designated, the construction process can generally proceed normally underneath them. Figure 6.1 is a flowchart of the first phase of the construction process, disregarding sensible-definition conditions. 6.1.2 Phase Two: Constructing the XML Files Once the DTD is completed, the second phase of construction begins. This phase accomplishes the second major construction task by instantiating or filling in the contents of the utility elements located throughout the domain element hierarchy. First the XML files are built, then the directory is derived from the completed files. The information in the source web pages is split into separate files in order to utilize the directory. There should be a separate file for each of the first-level domain elements defined in the DTD in Phase One. The root element of each of these files is the root domain element defined in the DTD. <br /> <br /> PAGE 96<br /> <br /> 85 Figure 6.1. Phase One of the construction process o t level : Complex task : Trivial task KEY L: Current domain element hierarchy level (equal to heading level) : Abbreviation for paragraph L=L+1 Define a level L domain child for each heading at level L, which represents the text under the heading Y N FINISH Y Y N N L=1 Headings at level (L+1)? Define <TEXT> utility child Designate and define wrapper elemen t Multiple s? Single ? For each level L element, scan its corresponding source tex t Define a first-level element for each H1 heading which represents the source text under the heading First child of root element is <DIRECTORY> Designate root element Desi g nate root elemen t N Y Domain has a given name? Identify headings in domain and place in groups labeled H1Hn in order of highest tlowes Source web pages (Domain) START <br /> <br /> PAGE 97<br /> <br /> 86 Phase One keeps track of which sections of the source text correspond to which domain elements, and vice versa. With this information available, Phase Two consists of inserting values from the appropriate sections of source text into the utility elements of the corresponding domain element. The <CW> utility element is instantiated in two ways, depending on whether its parent is a wrapper element or a heading element (a domain element that corresponds to an actual heading in the source text). If the parent is a heading element, the <CW> child consists of the root forms of the words comprising the heading—a simple instantiation. If the parent is a wrapper element, then the <CW> child contains the root forms of the words that describe what the wrapper element represents (faculty member, course, laboratory, etc.), along with the root forms of the set of words that describe and summarize the particular contents of the wrapper element. Instantiating a <CW> child of a wrapper element is much more complex than the instantiation of a heading element’s <CW> child. It is simple enough to include the words that form the name of the wrapper element except that, in Phase One, the choice of the wrapper element name is complex. Including certain keywords from the text can provide a rough picture of the contents of the wrapper element, but choosing which to include is a complex choice itself. Following the <CW> element in all the domain elements of the XMLIB is the <CONTENT> utility element. It holds a natural language single-sentence description of the information contained in its parent domain element. Instantiating this element, no matter the type of domain parent, is a complex task that is not easily automated. The remaining utility elements are trivial to instantiate since the sections of text that correspond to a particular domain element are known. The <TEXT> child simply <br /> <br /> PAGE 98<br /> <br /> 87 gets a copy of the appropriate original source text, and the <ROOT_TEXT> element takes all the significant words (everything but articles, pronouns, and certain other special-case terms) from the source text and translates them into their root forms. A domain element corresponding to a source paragraph that contains links holds the related information in the <LINK> and/or <EMAIL_LINK> children. 6.2 WebNL Results This section illustrates the results of some sample questions posed to the WebNL question answering system. As of the writing of this thesis, the Natural Language Parser module of WebNL is not fully implemented. For each test question, the functions that will eventually be executed by the Natural Language Parser must be performed manually—a tedious and time-consuming endeavor. For this reason, at the time of writing, there are only thirteen test questions that have been sent through the full WebNL system (minus the Natural Language Generator, whose functions were reproduced manually). The test questions attempt to access a wide range of the information stored in the XMLIB. Some questions ask for specific pieces of information, while others are more general in nature. Overall, the test questions form a sample from a pool of commonly asked questions concerning the CISE Graduate Program. Figure 6.2 shows WebNL’s user interface. <br /> <br /> PAGE 99<br /> <br /> 88 Figure 6.2. WebNL’s user interface The text box at the top is where a user may type a natural language question. Currently, the system does not process a question typed in the text box because, as stated previously, the Natural Language Parser, the WebNL module whose input is the natural language question, is not yet complete. The bottom area of the figure is where the retrieved information is displayed. A user can access the FAQ at the right to choose a test question. This will cause the system to display the answer for that question. The interface as seen above is available online at http://www.cise.ufl.edu/~nantonio/research.htm . Table 6.1 lists the thirteen questions and indicates which were answered successfully. <br /> <br /> PAGE 100<br /> <br /> 89 Table 6.1. Results of the Thirteen WebNL Test Questions Question Successfully Answered? Comment What are the core classes? Yes Since user did not specify, both Master’s and Ph.D. core courses are returned. What is the description of COP5555? Yes Actually retrieves the entire course, rather than just the description. Where is the CISE office? Yes Returns only the information asked for. Give me a summary of the Graduate web pages. Yes/No Because of the user’s choice of “summary” rather than “overview” the system warns the user it could not retrieve an exact answer, but in reality the returned approximate answer is an exact answer. Why must I have a committee? No The system is not designed to handle “why” questions. What is the admission process? Yes/No The system returns more information than needed, but the user can find what is desired easily. How can I contact the CISE department graduate program? Yes Returns all the alternative means of contacting the CISE department. How can I get the Degree of Engineer? Yes/No Simply returns the section of the XMLIB concerning the Degree of Engineer, which happens to explain the process of earning the degree. Show me the CISE faculty. Yes Returns all the faculty members and lets the user choose which to see more detailed information on. What labs does the CISE department have? Yes Returns the three domain elements that represent the labs at the CISE department. Show me information on the Master’s Program. Yes This is an easy question to answer. The system simply returns the domain element representing the Master’s program, and lets the user explore. Show me information on the Ph.D. program. Yes Same comment here as for previous question. What are the undergraduate prerequisites? Yes Not only are the specific undergraduate prerequisites retrieved, but also general information concerning them. Table 6.1 mentions that when too much information is returned, the user is able to search through the intuitive and logical organization of the material to quickly and easily find what is desired. This is a direct advantage of copying the inherent structure of the source text—the retrieved answers also exhibit the structure. Even if the answer is not exactly what the user is looking for, the Intelligent Interface allows him or her to navigate through the retrieved material to find what is needed, as long as the answer was too <br /> <br /> PAGE 101<br /> <br /> 90 general rather than too specific. Fortunately, the Query Generator follows the rule of erring on the side of returning too much data. The amount of searching the user has to perform on the returned information is still much less than the amount of searching to be expected with other more traditional systems like search engines. Figure 6.3 illustrates WebNL’s user interface as it is displaying an answer to a test question that returns a large amount of information: “Show me the CISE faculty.” Figure 6.3. WebNL’s user interface displaying an answer Each bullet in the figure represents a faculty member. The arrow in the bottom right-hand corner indicates that there are more faculty members than can fit on the current screen. If the user clicks on one of the bullets, the Intelligent Interface displays more detailed information about that faculty member. The ability WebNL gives to the <br /> <br /> PAGE 102<br /> <br /> 91 user of being able to further search through the information once it is retrieved in order to locate the exact piece of information he or she is looking for is a unique, beneficial characteristic that arises from a combination of the functions provided by the Query Generator, the XMLIB, and the Intelligent Interface. 6.3 Evaluating the XMLIB The research and practical goals of the XMLIB were discussed at length in Chapter 4. The first primary XMLIB research goal was to provide, through its manual construction, insight into the tasks that the IBC algorithm will need to eventually be able to complete. In this effort, the XMLIB has partially succeeded. The IBC algorithm must be able to handle heterogeneous source documents that may or may not have structures comparable to the graduate web pages. The XMLIB has a purposefully limited scope, and only attempts to provide insight into the process of building an information base from source web pages with a certain logical organization. Within this limited scope, however, a very well defined construction process has emerged, as evidenced by Figure 6.1. Most of the tasks are either trivial (copying source text into <TEXT> elements) or require abilities like keyword extraction and concept summarization (deciding on a wrapper element name or instantiating its <CW> children). The appearances of sensible-definition conditions are the only parts of the construction process that are not well-defined or easy to characterize. It can be argued that the XMLIB would still function satisfactorily if all the domain elements were defined using the procedure of Figure 6.1. Unfortunately, a comparison between the XMLIB with the sensible-definition conditions present and a version without them is not available. It is not clear how well the XMLIB would function without these definitions. <br /> <br /> PAGE 103<br /> <br /> 92 The above discussion shows that the XMLIB was successful in its second research goal of being able to provide an estimation of the difficulty of each significant task involved in its construction. The XMLIB was successful in fulfilling its practical requirements, at least based on initial test question results shown in Table 6.1. Only one question was completely unanswerable—the “why” question. Most were directly and succinctly answered. The remaining questions (those listed as Yes/No in the table) were answered, just not directly. The user was required to search through the returned material somewhat to find the exact desired information. Since the XMLIB is part of a satisfactory question answering system, and was built from a set of web pages that employs a logical organization found in a large percentage of other information-providing sites, it is asserted that the methods used in its creation are worthy of further study and refinement in the ongoing attempt to achieve the IBC algorithm. <br /> <br /> PAGE 104<br /> <br /> CHAPTER 7 THE XML EXPLORER – XMLEX XML Explorer (XMLEX) is a Java servlet designed to let the user browse through the contents of an XML file much the same way that some popular operating systems allow users to browse through file systems. Servlets extend the functionality of the server they are installed on. They are programs that can process data from the user on the server-side and return useful results. In contrast stand applets, which are small Java programs that are sent out to a client as part of a web page and execute on the remote client’s computer. XMLEX is not directly related to the XMLIB or any other parts of WebNL. It is an auxiliary tool that allows interested users to interactively search through an XML file by opening and closing (or expanding and collapsing) elements. It is particularly helpful when viewing large XML structures. It allows the user to collapse the elements so that only the first few top levels are expanded, providing a good picture of the general arrangement of the XML elements. XMLEX can provide a useful service for the XMLIB by letting interested users directly view the XML files that comprise it. Since the files in the XMLIB are already arranged so they are matching the source web pages’ structures, “surfing” through the files via XMLEX is a quick, alternative way to find information. XMLEX can explore any XML file, not just those that are a part of the XMLIB. Figures 7.1 and 7.2 illustrate XMLEX with a sample XML file. 93 <br /> <br /> PAGE 105<br /> <br /> 94 Figure 7.1. Exploring core_courses.xml with root and <CORE_COURSES> expanded <br /> <br /> PAGE 106<br /> <br /> 95 Figure 7.2. More elements expanded As the figures show, XMLEX displays comments and special lines like the XML specification line and document type declaration line in addition to normal elements. In Figure 7.1 only the <GRAD_PAGES> and <CORE_COURSES> elements are open or expanded. In Figure 7.2, the first <COURSE> child of <MASTERS_CORE> is the lowest level expanded element. <br /> <br /> PAGE 107<br /> <br /> 96 The natural tree structure of XML elements can be seen easily when viewed in this manner. Expanded elements have a minus sign next to them, indicating that clicking on the sign will close the element. Unexpanded elements have a plus sign to their left, indicating that they will be expanded the next time they are clicked on. When an element is expanded, its children (but no further descendents) are displayed. For simplicity, XMLEX is implemented as a single thread servlet, meaning only one user can access it at a time. The decision to keep it single-threaded greatly simplified the servlet’s code. XMLEX performs a simple parse of the input XML file and builds an internal representation of the XML elements. As a result of being able to parse XML files, it can check for well-formed XML. If a section of XML is not well-formed, XMLEX reports this and also reports where the error is believed to have occurred. XMLEX does not perform any validation against a data type definition. Its main purpose is to let users browse completed XML files. It is not an XML development tool. XMLEX does not maintain any state information between calls. Each plus or minus sign is a call to XMLEX with the same XML file as input, but with different expandkey parameters. The expandkey parameter allows XMLEX to expand or collapse the appropriate elements. While XMLEX is not an important part of the XMLIB or WebNL, it was written during the implementation of the XMLIB as a fallback mechanism. This means that if, for whatever reason, WebNL failed to work properly, the XML files comprising the XMLIB could still be of some use because students using XMLEX could browse through them, despite the malfunction of the question answering system. <br /> <br /> PAGE 108<br /> <br /> CHAPTER 8 CONCLUSIONS AND FUTURE WORK 8.1 Conclusions This thesis is primarily concerned with an information base for a natural language question answering system implemented in XML—the XMLIB. This information base serves the dual purposes of acting as a practical information repository for a functioning question-answering system and serving as a theoretical testing platform. The XMLIB, being constructed manually, could have been built with the application of as much human intelligence and creativity as possible in order to maximize its effectiveness as an information repository. While there is nothing wrong with this approach, the purpose of the XMLIB is on the other side of the spectrum. It aims to be an information base whose construction process is as simple as possible, where the number of intelligence-requiring steps needed to produce it is minimized, yet in the end will still be able to satisfactorily represent information in the context of a question answering system. The simpler the XMLIB is to produce, the more likely it is that the steps in its construction can be generalized and incorporated into an algorithm (referred to as the Information Base Constructor algorithm) that automates the process of building an information base from a set of online documents. From a research perspective, the purpose of the XMLIB is to test how simple an information repository can be while remaining an effective part of a question answering system. The key is that the XMLIB and the question answering system with which it is affiliated, WebNL, need to be shown to be satisfactory before anything else can be 97 <br /> <br /> PAGE 109<br /> <br /> 98 meaningfully discussed. The results so far, as discussed in Section 6.2, are encouraging. The system is able to satisfactorily answer the majority of the test questions. It should be noted that the manner in which questions are answered in WebNL versus the way the Ideal system answers questions differ. The Ideal system is described as being able to directly answer the question with a natural language answer. WebNL attempts to answer the question by returning portions of the original text that contain the information. It makes no attempt to generate its own version of the answer or to put the retrieved information into the form of a natural language response to the user. With the XMLIB’s usefulness as a functional information base for a satisfactory question answering system (see Chapter 6) established, conclusions on its construction process can be made. During the creation of the domain element hierarchy, which is the first of two construction phases discussed in Section 6.1, three categories of tasks are completed. The first category is the trivial tasks. These are tasks that are mundane enough to be easily incorporated into a basic algorithm. The second category includes the tasks identified in Figure 6.1 as being complex. These types of tasks require text summarization and keyword extraction capabilities. While current research is actively working on these problems, they are not conquered and so cannot be considered trivial. The third category corresponds to those points in the XMLIB construction process where a sensible-definition condition occurred. The condition is characterized by the realization that a certain object in the domain needs to be represented in a certain way, yet there are no clues from the source text that point to this. To have this realization requires general world knowledge, the type that could pass a Turing test. The tasks in this category, which <br /> <br /> PAGE 110<br /> <br /> 99 appeared only a few times in the XMLIB’s construction, are the farthest from being computable or solved and are considered to be the most complex. Therefore, the XMLIB shows that a respectably performing question answering system can be built around an information base that is constructed by completing a few very difficult tasks, several complex tasks, and a majority of trivial tasks. The complex tasks require text summarization and keyword extraction techniques to be used. The extremely difficult (from an automation point of view) tasks are those that require abilities that have been and continue to be pursued by AI researchers around the world: the ability for a program to reason about things using a certain level of common sense and real-world intelligence. So the XMLIB has been successful in clarifying the types of research needed in order for question answering systems to move forward towards the Ideal system. Most importantly, it is emphasized that everything discussed above applies only when the source documents are arranged and organized in a fashion similar to the graduate web pages. The construction of the XMLIB provides no insight whatsoever into how the Ideal system should go about building an information base from source documents that have highly different structures. This is why the web structure survey discussed in Chapter 4 and Appendix A was conducted. The percentage of online documents out on the World Wide Web that share the logical arrangement of the graduate web pages is an important indicator of just how useful the XMLIB and its construction process may be in forging the way for the IBC algorithm to be defined. On a more practical level, the XMLIB and its construction process could lead the way toward an automated system that compiles information from a set of input web pages <br /> <br /> PAGE 111<br /> <br /> 100 chosen by a user. The compilation procedure would be interactive. When a complex task is recognized, the user would be alerted and asked how the particular element in question should be structured. The input web pages would need to have the correct structure, of course. The interesting point is that the user could quickly build specific information bases for those sets of pages that have an appropriate organization and that are referenced by others often. The perfect example of this type of web site can be found most often at universities. A professor’s set of pages describing an important course topic, for example, could be compiled by this hypothetical system. The resulting information base could then be accessed by a system like WebNL, which could basically be the exact system that currently exists (with the Natural Language Parser functional, of course) to provide question answering capabilities to the students or other readers. In conclusion, the construction of the XMLIB has helped to define some of the general tasks that the IBC algorithm must accomplish: the extraction of keywords from a paragraph that summarize the text and the recognition of different levels of headings in HTML code are two important examples. Other tasks the IBC algorithm must accomplish have not been addressed. The compilation of heterogeneous material is a significant problem that this thesis does not attempt to resolve. For sets of web pages that have a structure similar to the graduate web pages, the XMLIB’s construction process is a good candidate for automation and inclusion is a system like the one discussed in the previous paragraph. 8.2 Future Work In the immediate future, it would be advisable to employ XML Schema as the data definition language for the XMLIB, replacing XML DTD. XML Schema has emerged as the document type definition language of choice during the course of this project. Once <br /> <br /> PAGE 112<br /> <br /> 101 the Natural Language Parser module becomes available, more testing needs to be performed. More experimentation with the XMLIB directory is warranted. The runtime of the querying process is a major concern. Including more levels of <CW> elements in the directory needs to be tested so that an optimal number of levels to include can be determined. One possible next phase of research involving the XMLIB is to attempt to automate its construction process, which was performed manually for this thesis. This would involve defining algorithms that read in the source HTML documents, recognize any and all of the useful HTML tags, build representations for the headings and their relationships to each other (this was hinted at in Figure 6.1) and then continue with the construction procedure. This thesis makes no attempt to define in detail all the HTML tags that may be useful in determining logical structure. Another possible phase of research could be to extend the construction process so that it can create an information base from more widely differing source document structures. The fact that the XMLIB and its construction process, as they stand now, are only truly informative and useful if the source web pages follow a particular structure leads to some interesting conjectures about where work further in the future may lead. Text summarization, information representation, and concept identification techniques may improve to the point where heterogeneous input documents do not pose problems like they do for the method described in this thesis. On the other hand, if the problem of gathering knowledge and information from heterogeneous source documents remains an open problem for too long, the Internet <br /> <br /> PAGE 113<br /> <br /> 102 community might begin to push for standardization of document structure. In other words, if the IBC algorithm can only be implemented using web pages of a particular structure, then maybe all online documents should strive to follow that mold so they can be included in the results of the question answering systems of tomorrow. <br /> <br /> PAGE 114<br /> <br /> APPENDIX A WEB STRUCTURE SURVEY Table A.1 lists the actual university and college websites surveyed. The list was chosen at random. The columns correspond to the name of the school, whether there was a matching structure found within the site, and if a match was found, the specific URL where it was found. Table A.1. List of Schools Involved in Web Structure Survey School Match? URL of Match University of Florida Yes http://www.dso.ufl.edu/stg/ Stanford University Yes www.stanford.edu/home/students/index.html MIT Yes web.mit.edu/about-mit.html Athens State University Yes www.athens.edu Auburn University Yes www.auburn.edu/main/currentstudents.html Alaska Bible College Yes www.akbible.edu/divisionofministryeducation.html Arizona State University Yes www.asu.edu/apply/ John Brown University (Arkansas) Yes www.jbu.edu California Pacific University Yes www.cpu.edu/dgreepoo.htm Berkeley Yes www.uofb.com/degrees3.html Adams State College (Colorado) No Delaware State University Yes www.udel.edu/main/pros-students/ Bethune-Cookman College Yes www.cookman.edu/index.html University of Idaho Yes www.uidaho.edu/ 103 <br /> <br /> PAGE 115<br /> <br /> 104 Table A.1. Continued School Match? URL of Match DePaul Yes www.depaul.edu/academics/index.asp Ball State University No Southwestern College No Kentucky Wesleyan College Yes www.kwc.edu/admiss/policy.htmv Clark University Yes www.clark.edu College of Saint Mary Yes www.csm.edu Drew University No Western New Mexico University No Brown University No Christian Brothers University No Northwood University Yes admissions.northwood.edu/ Weber State University Yes www.weber.edu/ University of Wyoming Yes www.vwyo.edu Oklahoma State No College of the Ozarks Yes www.cofo.edu Saint Mary’s College Yes www.saintmarys.edu/welcome/ Delaware Valley College Yes devalcol.aspre.net/academics/programs_bach.asp Trinity College No University of Colorado Boulder No DeVry Institute of Technology Yes www.devrycols.edu University of Tennessee No Marquette University Yes www.mu.edu/library.html Holy Cross College Yes www.holycross.edu/departments/library/website <br /> <br /> PAGE 116<br /> <br /> 105 Table A.1. Continued School Match? URL of Match Arkansas State No Michigan State No Northern State University No Seton Hall No University of Alaska No Southern Methodist University Yes www.smumustangs.com University of Miami` No Loyola Marymount Yes North Dakota State University Yes www.ndsu.nodak.edu/ndsu.undergraduate University of North Alabama Yes www.una.edu/academic University of Oklahoma No Embry-Riddle Yes www.embryriddle.edu/development/thanks/thanks.html Weber State University (Utah) Yes weber.edu/ns.asp University of Arkansas Yes www.uark.edu/ Baker University Yes www.bakeru.edu/family&friends/index.htm Saint John's University Yes www.stjohns.edu/pls/portal30/sjudev.school .home Bowling Green State University Yes www.bgsu.edu/offices/admissions/choose/ welcome.html Troy State University Yes www.troyst.edu/gradstudies/gradbulletin.html Marshall University Yes www.marshall.edu Wilmington College Yes www.wilmigngton.edu/ADMIT1.html Georgetown University Yes www.georgetown.edu/home/family.html University of Montana Yes www.umt.edu/homepage/homepage/faculty_staff/admission.htm Florida Institute of Technology Yes www.fit.edu/prospective/index.html Barclay College Yes www.barclaycollege.edu/ap.htm <br /> <br /> PAGE 117<br /> <br /> 106 Table A.1. Continued School Match? URL of Match United States Air Force Academy Yes www.usaf.af.mil/noflash/index.html Kentucky State University Yes www.www.kysu.edu/academics.shtml University of Louisiana Yes www.ulm.edu/~english/inde x2.htm?http:www.ulm.edu ~english/graduate/index.html University of Tulsa No Rice University No Colgate University Yes www.colgate.edu/academic Florida State University Yes www.fsu.edu/current/graduat/academics.shtml University of Vermont Yes www.uvm.edu/~global/la.htm Emory University Yes www.emory.edu/COLLEGE/sciecea ndsociety/robertson2.htm University of Virginia Yes www.law.virgina.edu/home2001/DegreeProgram.shtml University of Kentucky Yes www.uky.edu/Home/General Info/nationrank.html Alabama State University Yes www.alasu.edu/coe/coe_fclt.html Oakwood College Yes www.oakwood.edu/admissions Rocky Mountain College Yes www.rocky.edu/academic/international University of Montana Yes www.umt.edu/homepage/faculty_staff/campusinfo.html Idaho State University Yes www.isu.edu/athletics.html Canyon College Yes www.canyoncollege.edu/degree.htm Delaware State University Yes www.dsc.edu/student/studentlife.html Wesley College Yes www.wesley.edu/admissions /index-admissions.html Central College Yes www.central.edu/english.englishdepartment.htm Drake University Yes www.drake.edu/artsci/soange/soc/smaj.html Woffard College Yes www.woffard.edu/what snew/index.htm Columbia College Yes www.columbiacollege.edu/news.tml Providence College Yes www.providence.edu/alumni/ynews.htm <br /> <br /> PAGE 118<br /> <br /> 107 Table A.1. Continued School Match? URL of Match Brown University Yes www.brown.edu/webmaster/public_service.html Bates college Yes abacus.bates.edu/admin/offices/health/services2.htm College of the Atlantic Yes www.coa.edu/ACADEMICPROGRAM Thomas College Yes www.thomas.edu/facilities Youk County Technical College Yes www.yctc.net/student_services/finaid/FAINDX.htm Babson College No Elms College No Lesley College Yes www.lesley.edu/student.html Olin College Yes www.olin.edu/admissions/index.html University of Central Florida Yes www.ucf.edu/prospective/index.html Univeristy of North Florida Yes www.unf.edu/mainpages/academics.html Boston University Yes www.bu.edu/cas/centers-institutes/ Rutgers University Yes Gradstudy.Rutgers.edu/grad-school.html Harvard University Yes www.harvard.edu/academics Rensselaer at Hartford Yes http://www.rh.edu/does/bio_certificate.html <br /> <br /> PAGE 119<br /> <br /> APPENDIX B XMLIB DTD AND XML FILES ==============================DTD============================= <?xml version="1.0" encoding="UTF-8"?> <!-This is the DTD for the XML Information Base (XMLIB). This DTD contains the definitions of all the elements used in he XMLIB. The XMLIB is structured as a tree. Each node in this tree is represented by a separate XML file (it is assumed that all XMLIB .xml files and the DTD are located in the same directory currently). --> <!-******************** Begin Main DTD ********************* --> <!ELEMENT GRAD_PAGES (DIRECTORY | OVERVIEW | GEN_INFO | ADMISSION | FINANCIAL | MASTERS | ENGINEER | PHD | CONTACTS | UNDERGRAD_PREREQS | CORE_COURSES | FACULTY | LABS | GRAD_COURSES | UNDERGRAD_COURSES)> <!-Here we could place attributes that reflect the sources of the information (the source web pages), or version/tracking numbers for internal system needs, etc. Basically anything we find necessary can be included... don't want to put TOO much here, though. --> <!ATTLIST GRAD_PAGES lastRevised CDATA #REQUIRED> <!-date of last revision mm/dd/yyyy --> <!-********************* meta-elements ********************* --> <!-The following elements are meta-elements, meaning they are used to describe the contents of other elements. They don't contain any actual data concerning their parent elements. These elements can be used by the system to aid in getting good query results. --> <!-The CW (component word) element will be used throughout the XMLIB in order to specify words that are part of the natural language description of the current element. --> 108 <br /> <br /> PAGE 120<br /> <br /> 109 <!ELEMENT CW (#PCDATA)> <!-The CONTENT element will be used throughout the XMLIB to specify the contents of the current element. Basically this tells us what the current element has information on or what it talks about/contains. Useful for response portion of system.--> <!ELEMENT CONTENT (#PCDATA)> <!-************** general use elements ********************* --> <!-These elements are used in multiple different places in the XMLIB. --> <!-The TEXT element contains the actual text from the source web page for the area that the parent element represents. Where a TEXT element is included as TEXT (rather than TEXT?), that element will contain its actual information within the TEXT element. Where a TEXT element is included as TEXT+, this indicates that the element's information is in multiple TEXT elements, and each TEXT element is meant to be displayed on its own line by the output (Nick's) module --> <!ELEMENT TEXT (#PCDATA)> <!-The ROOT_TEXT element is used to include root/important key words that are found in the corresponding TEXT element. There is a ROOT_TEXT element for every TEXT element, except the TEXT element found in the TARGET element, and certain other elements where keywords for each TEXT element are not needed (i.e. ADDRESS) --> <!-The ROOT_TEXT element is used by the Query Generator, when keywords from the query cannot be found in either the CW elements or in the tag names themselves. Initially, a list of keywords for each TEXT element was built dynamically and then searched through. The static addition of keywords after each TEXT element greatly improves response time, while only increasing the file size marginally. --> <!ELEMENT ROOT_TEXT (#PCDATA)> <!-The LINK element contains information about html (or other protocol) links that are contained in the text of the parent element. Currently, only elements that have actual html links in the source web pages have a LINK element included, but in the future, all elements can be allowed to have LINK elements, to facilitate less work in updating the XMLIB with more html link information. --> <!ELEMENT LINK (TEXT,TARGET)> <!ELEMENT TARGET (#PCDATA)> <br /> <br /> PAGE 121<br /> <br /> 110 <!-In a LINK element, the TEXT element lists the name of the link as seen on the source web page, and the TARGET element lists the actual target URL of the link --> <!-The EMAIL_LINK element works the same structurally as the LINK element above. This element is needed to distinguish an email address link on the source web pages from a regular link. --> <!ELEMENT EMAIL_LINK (TEXT,TARGET)> <!-***************** DIRECTORY element ********************* --> <!ELEMENT DIRECTORY (LISTING*)> <!ATTLIST DIRECTORY domain CDATA #REQUIRED> <!-set of web pages this XMLIB deals with --> <!ELEMENT LISTING (CW,CONTENT)> <!ATTLIST LISTING file CDATA #REQUIRED> <!-the XML file that contains the info for this listing --> <!-************** CORE_COURSES element ********************* --> <!ELEMENT CORE_COURSES (CW,CONTENT,TEXT?,ROOT_TEXT?,MASTERS_CORE,PHD_CORE)> <!ELEMENT MASTERS_CORE (CW,CONTENT,TEXT?,ROOT_TEXT?,LINK?,COURSE*)> <!ELEMENT PHD_CORE (CW,CONTENT,TEXT?,ROOT_TEXT?,COURSE*)> <!ELEMENT COURSE (CW,CONTENT,TEXT,ROOT_TEXT?,LINK?,NUMBER?,DESCRIPTION?,PREREQ?)> <!ELEMENT NUMBER (CW,CONTENT,TEXT,ROOT_TEXT?)> <!ELEMENT DESCRIPTION (CW,CONTENT,TEXT,ROOT_TEXT)> <!ELEMENT PREREQ (CW,CONTENT,TEXT,ROOT_TEXT,LINK*)> <!-*************** OVERVIEW element ************************ --> <!ELEMENT OVERVIEW (CW,CONTENT,TEXT,ROOT_TEXT,LINK*)> <!-**************** GEN_INFO element *********************** --> <!ELEMENT GEN_INFO (CW,CONTENT,TEXT?,ROOT_TEXT?,DEGREES_OFFERED,STUDY_AREAS,COMPUTING_RESOURCES)> <!ELEMENT DEGREES_OFFERED (CW,CONTENT,TEXT?,ROOT_TEXT?,LINK*,DEGREE*)> <!ELEMENT DEGREE (CW,CONTENT,TEXT,ROOT_TEXT?)> <!ELEMENT STUDY_AREAS (CW,CONTENT,TEXT?,ROOT_TEXT?,STUDY_AREA*)> <br /> <br /> PAGE 122<br /> <br /> 111 <!ELEMENT STUDY_AREA (CW,CONTENT,TEXT,ROOT_TEXT?,DESCRIPTION)> <!ELEMENT COMPUTING_RESOURCES (CW,CONTENT,TEXT?,ROOT_TEXT?,RESOURCE*)> <!ELEMENT RESOURCE (CW,CONTENT,TEXT,ROOT_TEXT)> <!-******************* ADMISSION element ******************* --> <!ELEMENT ADMISSION (CW,CONTENT,TEXT?,ROOT_TEXT?,APPLICATION_INFO,ADMISSION_MAIL,CISE_MAIL)> <!ELEMENT APPLICATION_INFO (CW,CONTENT,TEXT?,ROOT_TEXT?,REQUIREMENT*)> <!ELEMENT REQUIREMENT (CW,CONTENT,TEXT,ROOT_TEXT)> <!ELEMENT ADMISSION_MAIL (CW,CONTENT,TEXT?,ROOT_TEXT?,MATERIAL*,ADDRESS)> <!ELEMENT MATERIAL (CW,CONTENT,TEXT,ROOT_TEXT)> <!ELEMENT ADDRESS (CW,CONTENT,TEXT+)> <!ELEMENT CISE_MAIL (CW,CONTENT,TEXT?,ROOT_TEXT?,MATERIAL*,ADDRESS)> <!-****************** FINANCIAL element ******************** --> <!ELEMENT FINANCIAL (CW,CONTENT,TEXT?,ROOT_TEXT?,FINANCIAL_ASSISTANCE,TUITION,FINANCIAL_RESPONSIBILITY)> <!ELEMENT FINANCIAL_ASSISTANCE (CW,CONTENT,TEXT,ROOT_TEXT)> <!ELEMENT TUITION (CW,CONTENT,TEXT,ROOT_TEXT)> <!ELEMENT FINANCIAL_RESPONSIBILITY (CW,CONTENT,TEXT,ROOT_TEXT,LINK,ADDRESS)> <!-****************** MASTERS element ********************** --> <!ELEMENT MASTERS (CW,CONTENT,TEXT?,ROOT_TEXT?,ADMISSION_REQUIREMENTS,GENERAL_REQUIREMENTS,TRANSFER_CREDIT,SUPERVISION,MASTERS_CORE,ELECTIVE_AREAS,THESIS_OPTION,NONTHESIS_OPTION,MASTERS_EXAM,PROGRESS)> <!ELEMENT ADMISSION_REQUIREMENTS (CW,CONTENT,TEXT?,ROOT_TEXT?,LINK?,REQUIREMENT*,BACKGROUND?)> <!ELEMENT BACKGROUND (CW,CONTENT,TEXT,ROOT_TEXT,LINK,COURSE*)> <!ELEMENT GENERAL_REQUIREMENTS (CW,CONTENT,TEXT?,ROOT_TEXT?,LINK,REQUIREMENT*)> <!ELEMENT TRANSFER_CREDIT (CW,CONTENT,TEXT,ROOT_TEXT)> <!ELEMENT SUPERVISION (CW,CONTENT,(TEXT,ROOT_TEXT)+)> <!ELEMENT ELECTIVE_AREAS (CW,CONTENT,TEXT?,ROOT_TEXT?,ELECTIVE_AREA*)> <br /> <br /> PAGE 123<br /> <br /> 112 <!ELEMENT ELECTIVE_AREA (CW,CONTENT,TEXT,ROOT_TEXT?)> <!ELEMENT THESIS_OPTION (CW,CONTENT,TEXT?,ROOT_TEXT?,LINK*,REQUIREMENT*,SUMMARY)> <!ELEMENT SUMMARY (CW,CONTENT,TEXT+,ROOT_TEXT?)> <!ELEMENT NONTHESIS_OPTION (CW,CONTENT,TEXT?,ROOT_TEXT?,LINK*,REQUIREMENT*,SUMMARY)> <!ELEMENT MASTERS_EXAM (CW,CONTENT,TEXT?,ROOT_TEXT?,THESIS_EXAM,NONTHESIS_EXAM)> <!ELEMENT THESIS_EXAM (CW,CONTENT,TEXT,ROOT_TEXT)> <!ELEMENT NONTHESIS_EXAM (CW,CONTENT,TEXT,ROOT_TEXT)> <!ELEMENT PROGRESS (CW,CONTENT,(TEXT,ROOT_TEXT)+)> <!-****************** ENGINEER element ********************** --> <!ELEMENT ENGINEER (CW,CONTENT,TEXT,ROOT_TEXT,LINK*)> <!-********************* PHD element ************************ --> <!ELEMENT PHD (CW,CONTENT,TEXT?,ROOT_TEXT?,ADMISSION_REQUIREMENTS,GENERAL_REQUIREMENTS,TRANSFER_CREDIT,SUPERVISION,COURSE_REQUIREMENT,PERFORMANCE,TIME_LIMIT,TRANSFER_CREDIT_GUIDELINES,COMP_EXAM,ORAL_QUALIFY_EXAM,TERMINATION,COMMUNICATION,DISSERTATION,DEFENSE,PROGRESS)> <!ELEMENT COURSE_REQUIREMENT (CW,CONTENT,TEXT?,ROOT_TEXT?,REQUIREMENT*,PHD_CORE,PHD_ELECTIVE)> <!ELEMENT PHD_ELECTIVE (CW,CONTENT,TEXT?,ROOT_TEXT?,COURSE*)> <!ELEMENT PERFORMANCE (CW,CONTENT,TEXT,ROOT_TEXT)> <!ELEMENT TIME_LIMIT (CW,CONTENT,TEXT,ROOT_TEXT)> <!ELEMENT TRANSFER_CREDIT_GUIDELINES (CW,CONTENT,TEXT?,ROOT_TEXT?,REQUIREMENT*)> <!ELEMENT COMP_EXAM (CW,CONTENT,TEXT?,ROOT_TEXT?,COMP_EXAM_WRITTEN,COMP_EXAM_ORAL)> <!ELEMENT COMP_EXAM_WRITTEN (CW,CONTENT,(TEXT,ROOT_TEXT)+)> <!ELEMENT COMP_EXAM_ORAL (CW,CONTENT,TEXT,ROOT_TEXT)> <!ELEMENT ORAL_QUALIFY_EXAM (CW,CONTENT,TEXT?,ROOT_TEXT?,REQUIREMENT*,PURPOSE,RESEARCH)> <!ELEMENT PURPOSE (CW,CONTENT,TEXT,ROOT_TEXT)> <!ELEMENT RESEARCH (CW,CONTENT,TEXT,ROOT_TEXT,LINK*)> <!ELEMENT TERMINATION (CW,CONTENT,TEXT,ROOT_TEXT)> <!ELEMENT COMMUNICATION (CW,CONTENT,TEXT,ROOT_TEXT)> <!ELEMENT DISSERTATION (CW,CONTENT,TEXT,ROOT_TEXT)> <!ELEMENT DEFENSE (CW,CONTENT,TEXT?,ROOT_TEXT?,LINK,REQUIREMENT*)> <!-**************** CONTACTS element ************************ --> <br /> <br /> PAGE 124<br /> <br /> 113 <!ELEMENT CONTACTS (CW,CONTENT,TEXT,ROOT_TEXT,EMAIL_LINK*,ADDRESS,WWW,FAX,TELEPHONE,EMAIL)> <!ELEMENT WWW (CW,CONTENT,TEXT,ROOT_TEXT,LINK)> <!ELEMENT FAX (CW,CONTENT,TEXT,ROOT_TEXT)> <!ELEMENT TELEPHONE (CW,CONTENT,(TEXT,ROOT_TEXT)+,EMAIL_LINK)> <!ELEMENT EMAIL (CW,CONTENT,(TEXT,ROOT_TEXT)+,EMAIL_LINK*)> <!-************** UNDERGRAD_PREREQS element ***************** --> <!ELEMENT UNDERGRAD_PREREQS (CW,CONTENT,TEXT?,ROOT_TEXT?,COURSE*)> <!-****************** FACULTY element *********************** --> <!ELEMENT FACULTY (CW,CONTENT,TEXT?,ROOT_TEXT?,FACULTY_MEMBER*)> <!ELEMENT FACULTY_MEMBER (CW,CONTENT,TEXT,ROOT_TEXT?,EMAIL_LINK*,RESEARCH_AREA?)> <!ELEMENT RESEARCH_AREA (CW,CONTENT,TEXT,ROOT_TEXT?)> <!-********************* LABS element *********************** --> <!ELEMENT LABS (CW,CONTENT,TEXT?,ROOT_TEXT?,LAB*)> <!ELEMENT LAB (CW,CONTENT,TEXT,ROOT_TEXT?,DIRECTOR,DESCRIPTION)> <!ELEMENT DIRECTOR (CW,CONTENT,TEXT,ROOT_TEXT?)> <!-**************** GRAD_COURSES element ******************** --> <!ELEMENT GRAD_COURSES (CW,CONTENT,TEXT?,ROOT_TEXT?,APPLICATIONS,DESIGN,ENGINEERING,INFORMATION,PROGRAMMING,THEORY)> <!ELEMENT APPLICATIONS (CW,CONTENT,TEXT?,ROOT_TEXT?,COURSE*)> <!ELEMENT DESIGN (CW,CONTENT,TEXT?,ROOT_TEXT?,COURSE*)> <!ELEMENT ENGINEERING (CW,CONTENT,TEXT?,ROOT_TEXT?,COURSE*)> <!ELEMENT INFORMATION (CW,CONTENT,TEXT?,ROOT_TEXT?,COURSE*)> <!ELEMENT PROGRAMMING (CW,CONTENT,TEXT?,ROOT_TEXT?,COURSE*)> <!ELEMENT THEORY (CW,CONTENT,TEXT?,ROOT_TEXT?,COURSE*)> <!-************* UNDERGRAD_COURSES element ****************** --> <!ELEMENT UNDERGRAD_COURSES (CW,CONTENT,TEXT?,ROOT_TEXT?,APPLICATIONS,DESIGN,ENGINEERING,GENERAL,INFORMATION,PROGRAMMING,THEORY)> <!ELEMENT GENERAL (CW,CONTENT,TEXT?,ROOT_TEXT?,COURSE*)> <!-End of DTD --> <br /> <br /> PAGE 125<br /> <br /> 114 =======================GEN_INFO.XML========================= <?xml version='1.0' encoding="UTF-8" standalone="no"?> <!-updated 09/24/01 --> <!-CHANGES MADE: -07/09/01: document created. -07/11/01: <CW> elements now contain extra words that help differentiate/specify the current element. 09/04/01: added <ROOT_TEXT> elements 09/24/01: processed <ROOT_TEXT> elements --> <!-The following declaration tells the XML processor that this file uses the "mainDTD.dtd" file as its DTD. --> <!DOCTYPE GRAD_PAGES SYSTEM "mainDTD.dtd"> <!-*********************** Begin GEN_INFO *************************** --> <GRAD_PAGES lastRevised="09/24/01"> <GEN_INFO> <CW>general information</CW> <CONTENT>General information about the CISE graduate program</CONTENT> <DEGREES_OFFERED> <CW>graduate degree offer</CW> <CONTENT>The graduate degrees offered by the CISE department</CONTENT> <TEXT>The CISE Department offers the Master of Science degree through the College of Engineering (M.S. in Computer Engineering) and the College of Liberal Arts & Sciences (M.S. in Computer Science). The Master of Engineering (M.E. in Computer Engineering), the Engineer (Engineer in Computer Engineering), and the Ph.D. (Ph.D. in Computer Engineering) degrees are offered through the College of Engineering only. The Florida Engineering Education Delivery System (FEEDS) and National Technological University (NTU) make graduate instruction available to part-time students at participating remote locations via videotaped courses. Separate documents are available giving information specific to the FEEDS program and NTU courses.</TEXT> <ROOT_TEXT>CISE DEPARTMENT MASTER SCIENCE DEGREE COLLEGE ENGINEERING MS COMPUTER LIBERAL ARTS ENGINEER PHD FLORIDA EDUCATION DELIVERY SYSTEM FEED NATIONAL TECHNOLOGICAL UNIVERSITY NTU GRADUATE INSTRUCTION AVAILABLE PART-TIME STUDENT PARTICIPATE REMOTE LOCATION VIDEOTAPE COURSE SEPARATE DOCUMENT GIVING INFORMATION SPECIFIC PROGRAM</ROOT_TEXT> <LINK> <br /> <br /> PAGE 126<br /> <br /> 115 <TEXT>The Florida Engineering Education Delivery System (FEEDS)</TEXT> <TARGET>http://www.eng.ufl.edu/home/oeep/</TARGET> </LINK> <LINK> <TEXT>National Technological University (NTU)</TEXT> <TARGET>http://www.ntu.edu/</TARGET> </LINK> <DEGREE> <CW>degree master master's science computer engineering</CW> <CONTENT>M.S. in Computer Engineering</CONTENT> <TEXT>Master of Science degree through the College of Engineering (M.S. in Computer Engineering)</TEXT> </DEGREE> <DEGREE> <CW>degree master master's science computer</CW> <CONTENT>M.S. in Computer Science</CONTENT> <TEXT>Master of Science degree through the College of Liberal Arts & Sciences (M.S. in Computer Science)</TEXT> </DEGREE> <DEGREE> <CW>degree master master's engineering computer</CW> <CONTENT>M.E. in Computer Engineering</CONTENT> <TEXT>Master of Engineering (M.E. in Computer Engineering) through the College of Engineering</TEXT> </DEGREE> <DEGREE> <CW>degree engineer computer engineering</CW> <CONTENT>Engineer in Computer Engineering</CONTENT> <TEXT>Engineer (Engineer in Computer Engineering) through the College of Engineering</TEXT> </DEGREE> <DEGREE> <CW>degree phd ph.d. computer engineering philosophy doctor</CW> <CONTENT>Ph.D. in Computer Engineering</CONTENT> <TEXT>Ph.D. (Ph.D. in Computer Engineering) through the College of Engineering</TEXT> </DEGREE> <DEGREE> <CW>degree remote graduate instruction</CW> <br /> <br /> PAGE 127<br /> <br /> 116 <CONTENT>Remotely available graduate instruction</CONTENT> <TEXT>The Florida Engineering Education Delivery System (FEEDS) and National Technological University (NTU) make graduate instruction available to part-time students at participating remote locations via videotaped courses. Separate documents are available giving information specific to the FEEDS program and NTU courses.</TEXT> </DEGREE> </DEGREES_OFFERED> <STUDY_AREAS> <CW>study area specialization</CW> <CONTENT>Areas of study in the CISE graduate department</CONTENT> <TEXT>There are five areas of specialization in the Department</TEXT> <ROOT_TEXT>SPECIALIZATION DEPARTMENT</ROOT_TEXT> <STUDY_AREA> <CW>study area computer system architecture</CW> <CONTENT>Information on Computer Systems and Architecture study area</CONTENT> <TEXT>Computer systems and architecture</TEXT> <DESCRIPTION> <CW>description computer system architecture</CW> <CONTENT>Description of Computer Systems and Architecture study area</CONTENT> <TEXT>Computer architecture, distributed systems, fault-tolerant systems, computer simulation, computer networks and communication, operating systems, and performance evaluation</TEXT> <ROOT_TEXT>COMPUTER ARCHITECTURE DISTRIBUTE SYSTEM FAULT-TOLERANT SIMULATION NETWORK COMMUNICATION OPERATE PERFORMANCE EVALUATION</ROOT_TEXT> </DESCRIPTION> </STUDY_AREA> <STUDY_AREA> <CW>study area database system</CW> <CONTENT>Information on Database Systems study area</CONTENT> <TEXT>Database systems</TEXT> <DESCRIPTION> <CW>description database system</CW> <CONTENT>Description of Database Systems study area</CONTENT> <TEXT>Database management systems and applications, database design, database theory and implementation, database machines, distributed databases, and information retrieval</TEXT> <br /> <br /> PAGE 128<br /> <br /> 117 <ROOT_TEXT>DATABASE MANAGEMENT SYSTEM APPLICATION DESIGN THEORY IMPLEMENTATION MACHINE DISTRIBUTE INFORMATION RETRIEVAL</ROOT_TEXT> </DESCRIPTION> </STUDY_AREA> <STUDY_AREA> <CW>study area software engineering</CW> <CONTENT>Information on Software Engineering study area</CONTENT> <TEXT>Software engineering</TEXT> <DESCRIPTION> <CW>description software engineering</CW> <CONTENT>Description of Software Engineering study area</CONTENT> <TEXT>Largescale software design, software development and maintenance methodologies, software quality assurance, programming environments and languages, parallel and distributed systems, and real-time systems</TEXT> <ROOT_TEXT>LARGE SCALE SOFTWARE DESIGN DEVELOPMENT MAINTENANCE METHODOLOGY QUALITY ASSURANCE PROGRAMMING ENVIRONMENT LANGUAGE PARALLEL DISTRIBUTE SYSTEM REAL-TIME</ROOT_TEXT> </DESCRIPTION> </STUDY_AREA> <STUDY_AREA> <CW>study area intelligent system</CW> <CONTENT>Information on Intelligent Systems study area</CONTENT> <TEXT>Intelligent systems</TEXT> <DESCRIPTION> <CW>description intelligent system</CW> <CONTENT>Description of Intelligent Systems study area</CONTENT> <TEXT>Pattern recognition, image processing, computer vision, CAD/CAM, fs computer graphics, computer animation/simulation, robotics, expert systems, knowledge representation, machine learning, and artificial intelligence</TEXT> <ROOT_TEXT>PATTERN RECOGNITION IMAGE PROCESSING COMPUTER VISION CAD/CAM GRAPHICS ANIMATION SIMULATION ROBOTICS EXPERT SYSTEM KNOWLEDGE REPRESENTATION MACHINE LEARNING ARTIFICIAL INTELLIGENCE</ROOT_TEXT> </DESCRIPTION> </STUDY_AREA> <STUDY_AREA> <CW>study area algorithm high performance computing compute</CW> <CONTENT>Information on Algorithms and High Performance Computing study area</CONTENT> <TEXT>Algorithms and high performance computing</TEXT> <DESCRIPTION> <br /> <br /> PAGE 129<br /> <br /> 118 <CW>description algorithm high performance computing compute</CW> <CONTENT>Description of Algorithms and High Performance Computing study area</CONTENT> <TEXT>Parallel algorithm design, and shared and distributed memory multiprocessor systems</TEXT> <ROOT_TEXT>PARALLEL ALGORITHM DESIGN SHARE DISTRIBUTE MEMORY MULTIPROCESSOR SYSTEM</ROOT_TEXT> </DESCRIPTION> </STUDY_AREA> </STUDY_AREAS> <COMPUTING_RESOURCES> <CW>compute computing resource</CW> <CONTENT>Information on the computing resources available in the CISE department</CONTENT> <TEXT>The CISE Department houses a wide variety of computers. In addition to the numerous machines in public laboratories each faculty member, TA, or RA is provided with an office containing a workstation or PC. The department network has a switched all-fiber infrastructure supported by two CISCO Catalyst 5500 and one CISCO Catalys 5505 switch. Three fiber PoPs are provided in each office and six in each laboratory. This network supports a homogeneous user filesystem view across all the operating systems and architectures we support.</TEXT> <ROOT_TEXT>CISE DEPARTMENT HOUSE WIDE VARIETY COMPUTER ADDITION NUMEROUS MACHINE PUBLIC LABORATORY FACULTY MEMBER TA RA PROVIDE OFFICE CONTAIN WORKSTATION PC NETWORK HA SWITCH ALL-FIBER INFRASTRUCTURE SUPPORT CISCO CATALYST CATALYS FIBER POP HOMOGENEOUS USER FILESYSTEM VIEW ACROSS OPERATE SYSTEM ARCHITECTURE</ROOT_TEXT> <RESOURCE> <CW>resource server sun</CW> <CONTENT>Sun 3500 server</CONTENT> <TEXT>A high-availability cluster of two Sun 3500s with two A5000 disk arrays</TEXT> <ROOT_TEXT>HIGH-AVAILABILITY CLUSTER SUN 3500S A5000 DISK ARRAY</ROOT_TEXT> </RESOURCE> <RESOURCE> <CW>resource server sun enterprise</CW> <CONTENT>Sun Enterprise 4000</CONTENT> <TEXT>An eight-processor Sun Enterprise 4000</TEXT> <ROOT_TEXT>PROCESSOR SUN ENTERPRISE</ROOT_TEXT> </RESOURCE> <RESOURCE> <br /> <br /> PAGE 130<br /> <br /> 119 <CW>resource server IBM ibm</CW> <CONTENT>IBM server</CONTENT> <TEXT>A 14-Node IBM SP-2 providing service to the entire Engineering College</TEXT> <ROOT_TEXT>PROCESSOR SUN ENTERPRISE</ROOT_TEXT> </RESOURCE> <RESOURCE> <CW>resource server sun</CW> <CONTENT>Sun 450s</CONTENT> <TEXT>Several Sun 450s</TEXT> <ROOT_TEXT>SEVERAL SUN 450S</ROOT_TEXT> </RESOURCE> <RESOURCE> <CW>resource server file intel</CW> <CONTENT>Intel file servers</CONTENT> <TEXT>A variety of Intel file servers</TEXT> <ROOT_TEXT>VARIETY INTEL FILE SERVER</ROOT_TEXT> </RESOURCE> <RESOURCE> <CW>resource workstation sun</CW> <CONTENT>Sun Ultra-5's and Ultra-10's</CONTENT> <TEXT>Sun Microsystems Ultra-5s and Ultra-10s running Solaris</TEXT> <ROOT_TEXT>SUN MICROSYSTEMS ULTRA-5S ULTRA-10S RUNNING SOLARIS</ROOT_TEXT> </RESOURCE> <RESOURCE> <CW>resource pc dell</CW> <CONTENT>Dell Optiplex PCs</CONTENT> <TEXT>Dell Optiplexes running Windows NT and RedHat Linux</TEXT> <ROOT_TEXT>DELL OPTIPLEXES RUNNING WINDOW NT REDHAT LINUX</ROOT_TEXT> </RESOURCE> <RESOURCE> <CW>resource workstation SGI sgi</CW> <CONTENT>SGI workstations</CONTENT> <TEXT>SGI Indys, Indigo-IIs, and O2s running IRIX</TEXT> <ROOT_TEXT>INDYS INDIGO-IIS O2S RUNNING IRIX</ROOT_TEXT> </RESOURCE> </COMPUTING_RESOURCES> </GEN_INFO> <br /> <br /> PAGE 131<br /> <br /> 120 </GRAD_PAGES> ===========================LABS.XML===================== <?xml version='1.0' encoding="UTF-8" standalone="no"?> <!-updated 9/24/01 --> <!-CHANGES MADE: -08/29/01: document created -09/04/01: added <ROOT_TEXT> elements -09/24/01: processed <ROOT_TEXT> elements --> <!-the following declaration tells the XML processor that this file uses the "mainDTD.dtd" file as its DTD. --> <!DOCTYPE GRAD_PAGES SYSTEM "mainDTD.dtd"> <!-**************** Begin LABS *********************** --> <GRAD_PAGES lastRevised="09/24/01"> <LABS> <CW>lab laboratory research center</CW> <CONTENT>Information on CISE research centers and laboratories</CONTENT> <LAB> <CW>center computer vision visualization</CW> <CONTENT>Information on the Center for Computer Vision and Visualization</CONTENT> <TEXT>Center for Computer Vision and Visualization</TEXT> <DIRECTOR> <CW>director ritter</CW> <CONTENT>Director of the Center for Computer Vision and Visualization</CONTENT> <TEXT>Director: Dr. G.X. Ritter</TEXT> </DIRECTOR> <DESCRIPTION> <CW>description</CW> <CONTENT>Description of the Center for Computer Vision and Visualization</CONTENT> <TEXT>The Center for Computer Vision and Visualization (CCVV) covers basic and applied research in all aspects of computer vision, computer visualization, and closely related areas of research. Computer vision provides the analysis of real world image data, whereas visualization synthetically produces images based on dynamic models created from real world data. Core areas that support both vision and visualization include image algebra, pattern recognition methods, physically based modeling, computer simulation, computer graphics, and dynamical systems theory.</TEXT> <br /> <br /> PAGE 132<br /> <br /> 121 <ROOT_TEXT>CENTER COMPUTER VISION VISUALIZATION CCVV COVER BASIC APPLY RESEARCH ASPECT CLOSELY RELATE PROVIDE ANALYSIS REAL WORLD IMAGE DATA WHEREAS SYNTHETICALLY PRODUCE BASE DYNAMIC MODEL CREATE CORE SUPPORT INCLUDE ALGEBRA PATTERN RECOGNITION METHOD PHYSICALLY MODELING SIMULATION GRAPHICS DYNAMICAL SYSTEM THEORY</ROOT_TEXT> </DESCRIPTION> </LAB> <LAB> <CW>database system development</CW> <CONTENT>Information on the Database Systems Research and Development Center</CONTENT> <TEXT>Database Systems Research and Development Center</TEXT> <DIRECTOR> <CW>director su</CW> <CONTENT>Director of the Database Systems Research and Development Center</CONTENT> <TEXT>Director: Dr. S.Y.W. Su</TEXT> </DIRECTOR> <DESCRIPTION> <CW>description</CW> <CONTENT>Description of the Database Systems Research and Development Center</CONTENT> <TEXT>The Database Systems Research and Development Center deals with the following three categories of research and development activities: (1) the database-management aspects of information system processing, (2) the hardware aspects of information system design and development, and (3) the behavioral aspects of information transfer.</TEXT> <ROOT_TEXT>DATABASE SYSTEM RESEARCH DEVELOPMENT CENTER DEAL CATEGORY ACTIVITY DATABASE-MANAGEMENT ASPECT INFORMATION PROCESSING HARDWARE DESIGN BEHAVIORAL TRANSFER</ROOT_TEXT> </DESCRIPTION> </LAB> <LAB> <CW>software engineer engineering</CW> <CONTENT>Information on the Software Engineering and Research Center</CONTENT> <TEXT>Software Engineering and Research Center</TEXT> <DIRECTOR> <CW>director thebaut</CW> <CONTENT>Director of the Software Engineering and Research Center</CONTENT> <TEXT>Site Director: Dr. Stephen Thebaut</TEXT> <br /> <br /> PAGE 133<br /> <br /> 122 </DIRECTOR> <DESCRIPTION> <CW>description</CW> <CONTENT>Description of the Software Engineering and Research Center</CONTENT> <TEXT>The SERC is a joint center with Purdue University under the National Science Foundation, Industry/University Cooperative Research Center Program. It is supported by National Science Foundation and 12 industrial and governmental sponsors. It is involved in developing tools, environments, and metrics to assist the development and maintenance of reliable, efficient, reusable, and easily maintained software systems.</TEXT> <ROOT_TEXT>SERC JOINT CENTER PURDUE UNIVERSITY NATIONAL SCIENCE FOUNDATION INDUSTRY/UNIVERSITY COOPERATIVE RESEARCH PROGRAM SUPPORT INDUSTRIAL GOVERNMENTAL SPONSOR INVOLVE DEVELOPING TOOL ENVIRONMENT METRICS ASSIST DEVELOPMENT MAINTENANCE RELIABLE EFFICIENT REUSABLE MAINTAIN SOFTWARE SYSTEM</ROOT_TEXT> </DESCRIPTION> </LAB> </LABS> </GRAD_PAGES> <br /> <br /> PAGE 134<br /> <br /> 123 LIST OF REFERENCES [1] Clayton, Gray. Managing Information Retrieval in Social Studies Lessons. University of Waikato, New Zealand. Available at http://www.cssjournal.com/archives/clayton.html , last accessed on 01/26/2002. [2] Barlow, Linda 2001. A Helpful Guide to Web Search Engines. Monash Information Services. Available at http://www.monash.com/spidap4.html , last updated on 09/22/2001, last accessed on 01/26/2002. [3] Pridaphattharakun, Wilasini 2001. Information Retrieval and Answer Extraction for an XML Knowledge Base in WEBNL. Master’s Thesis, University of Florida, Gainesville, FL. [4] Antonio, Nicholas 2001. Intelligent Interface Design for a Question Answering System. Master’s Thesis, University of Florida, Gainesville, FL. Available at http://www.cise.ufl.edu/~nantonio/research.htm , last accessed on 1/29/2002. [5] Srihari, Rohini and Li, Wei 2000. A Question Answering System Supported by Information Extraction. Cymfony, Inc., Williamsville, NY. In Proceedings of the 6th Applied Natural Language Processing Conference, Seattle, WA p. 166. [6] Dankel, Douglas D. II 2000. Graduate Brochure. Computer and Information Science and Engineering, University of Florida. Available at http://www.cise.ufl.edu/~ddd/grad , last updated on 07/26/2000, last accessed on 02/01/2002. [7] Abney, Steven; Collins, Michael and Singhal, Amit 2000. Answer Extraction. AT&T Shannon Laboratory, Florham Park, NJ. In Proceedings of the 6th Applied Natural Language Processing Conference, Seattle, WA pp. 296-301. [8] Cardie, Claire; Ng, Vincent; Pierce, David and Buckley, Chris 2000. Examining the Role of Statistical and Linguistic Knowledge Sources in a General-Knowledge QuestionAnswering System. Cornell University, Ithaca, NY. In Proceedings of the 6th Applied Natural Language Processing Conference, Seattle, WA pp. 180-187. [9] Waltz, David 1978. An English Language Question Answering System for a Large Relational Database. Communications of the ACM , Vol. 21, No. 7, July 1978, pp. 526539. <br /> <br /> PAGE 135<br /> <br /> 124 [10] Amble, Tore 2000. BusTUC – A Natural Language Bus Route Oracle. University of Trondheim, Norway. In Proceedings of the 6 th Applied Natural Language Processing Conference, Seattle, WA pp. 1-6. [11] Gonzalez, Avelino and Dankel, Douglas 1993. The Engineering of Knowledge-Based Systems. Prentice Hall, Englewood Cliffs, NJ, pp. 47-85. [12] Ginsberg, Matt 1993. Essentials of Artificial Intelligence. Morgan Kaufmann Publishers, San Mateo, CA, pp. 228-247. [13] Miller, George; Beckwith, Richard; Fellbaum, Christiane; Gross, Derek and Miller, Katherine 1993. Introduction to WordNet: An On-line Lexical Database. Cognitive Science Laboratory, Princeton University. Available at ftp://ftp.cogsci.princeton.edu/pub/wordnet/5papers.pdf , last updated on 07/31/1997, last accessed on 03/02/2002. [14] Richardson, Stephen; Dolan, William and Vanderwende, Lucy 1998. MindNet: Acquiring and Structuring Semantic Information from Text. Microsoft Research, Redmond, WA. In Proceedings of the 36 th Annual Meeting of the Association for Computational Linguistics and the 17 th International Conference on Computational Linguistics, Montreal, Quebec, Canada p. 1098. [15] Martin, Philippe and Eklund, Peter 2000. Knowledge Retrieval and the World Wide Web. IEEE Intelligent Systems, Vol. 15, No. 3, May/June, pp. 18-25. [16] Rabarijaona, Auguste; Dieng, Rose; Corby, Olivier and Ouaddari, Rajae 2000. Building and Searching an XML-Based Corporate Memory. IEEE Intelligent Systems, Vol. 15, No. 3, May/June, pp. 56-62. [17] Bosak, John 1997. XML, Java, and the Future of the Web. Sun Microsystems. Available at http://www.ibiblio.org/pub/sun-info/standards/xml/why/xmlapps.htm , last updated on 03/10/1997, last accessed on 02/14/2002. [18] World Wide Web Consortium, available at http://www.w3.org , last updated on 02/14/2002, last accessed on 02/14/2002. [19] Lee, Dongwon and Chu, Wesley 2000. Comparative Analysis of Six XML Schema Languages. Computer Science, University of California, Los Angeles. Available at http://www.cobase.cs.ucla.edu/tech-docs/dongwon/ucla-200008.html , last accessed on 02/17/2002. [20] St. Laurent, Simon 1999. XML: A Primer, 2 nd Edition. M&T Books, Foster City, CA. [21] Hunter, Jason and Crawford, William 1998. Java Servlet Programming. O’Reilly & Associates, Sebastopol, CA. <br /> <br /> PAGE 136<br /> <br /> 125 [22] Fujii, Atsushi and Ishikawa, Tetsuya 2000. Utilizing the World Wide Web as an Encyclopedia: Extracting Term Descriptions from Semi-Structured Texts. In Proceedings of the 38 th Annual Meeting of the Association for Computational Linguistics (ACL-2000), pp. 488-495. Available at http://arXiv.org/abs/cs/0011001 , last updated on 11/02/2000, last accessed on 03/28/2002. [23] Fellbaum, Christiane 1993. English Verbs as a Semantic Net. Cognitive Science Laboratory, Princeton University. Available at ftp://ftp.cogsci.princeton.edu/pub/wordnet/5papers.pdf , last updated on 07/31/1997, last accessed on 03/02/2002. <br /> <br /> PAGE 137<br /> <br /> BIOGRAPHICAL SKETCH Nathaniel Nadeau was born in Lynchburg, Virginia, on May 6, 1977. He moved to West Palm Beach, Florida, in 1986 and remained there until enrolling at the University of Florida in 1995. He received a Bachelor of Science with highest honors in computer and information science in May 1999. That same year he was named a United States Achievement Academy All American Scholar and earned a National Collegiate Computer Science Award. He remained at the University of Florida and enrolled in the Computer and Information Science and Engineering Department Graduate Program, where he remains today. Through his seven years at the University of Florida, he has interned at Motorola three times, worked with several faculty members as a lab assistant, grader, and teaching assistant, and has been constantly involved as both a player and a coach with the University of Florida Club Volleyball team. 126 <br /> <br /><br /> </div> </td> </tr> </tr> </table> </section> <!-- Close the presentation table --> </td> </tr> </table> <!-- Hidden field is used for postbacks to indicate what to save and reset --> <input type="hidden" id="item_action" name="item_action" value="" /> <!-- Close microdata itemscope div --> </section> <script type="text/javascript" src="https://cdn.sobekdigital.com/includes/jquery-ui-draggable/1.10.3/jquery-ui-1.10.3.draggable.min.js"></script> </form> <script async src="https://www.googletagmanager.com/gtag/js?id=UA-272759-11"></script> <script> window.dataLayer = window.dataLayer || [] function gtag() { dataLayer.push(arguments) } gtag('js', new Date()) gtag('config', 'UA-272759-11', { custom_map: { dimension1: 'bib', dimension2: 'vid', dimension3: 'aggregation', dimension4: 'viewer', dimension5: 'tickler' } }) </script> <!-- Adding footer to html (Html_MainWriter.Display_Footer) --> <!-- Footer divisions complete the web page --> <footer id="ufdcfooter_item"> <nav> <p><a href="https://ufdc.ufl.edu/contact">Contact Us</a> | <a href="https://ufdc.ufl.edu/permissions">Permissions</a> | <a href="https://ufdc.ufl.edu/uftech">UF Technologies</a> | <a href="https://ufdc.ufl.edu/stats">Statistics</a> | <a href="https://ufdc.ufl.edu/internal">Internal</a> | <a href="http://www.uflib.ufl.edu/privacy.html">Privacy Policy</a> | <a href="https://ufdc.ufl.edu/rss">RSS</a> | <a href="http://cms.uflib.ufl.edu/accessibility/UFDC">ADA/Accessibility</a> </p> </nav> <div id="UfdcWordmark_item"> <a href="https://www.ufl.edu"><img src="https://cdn.sobekdigital.com/instances/ufdc/smallWordmark_333333.png" alt="University of Florida Home Page" title="University of Florida Home Page" style="border: none;" id="UfdcWordmarkImage" /></a> </div> <div id="UfdcCopyright_item"> <a href="http://cms.uflib.ufl.edu/InclusionAndIntellectualFreedom">Statement on Inclusion and Intellectual Freedom</a><br/> <a href="http://www.uflib.ufl.edu/rights.html"> © University of Florida George A. Smathers Libraries.<br />All rights reserved.</a> <br /> <a href="http://www.uflib.ufl.edu/accesspol.html">Terms of Use for Electronic Resources</a> and <a href="https://www.uflib.ufl.edu/copyright.html">Copyright Information</a> <br /> Powered by <a href="https://sobekrepository.org/sobekcm">SobekCM</a> </div> </footer> <script> (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){ (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o), m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m) })(window,document,'script','https://www.google-analytics.com/analytics.js','ga'); ga('create', 'UA-272759-9', 'auto'); ga('send', 'pageview'); </script> <!-- end of adding footer to html (Html_MainWriter.Display_Footer) --> </body> </html>