Group Title: toolkit for managing XML data with a relational database management system
Title: A Toolkit for managing XML data with a relational database management system
CITATION PDF VIEWER THUMBNAILS PAGE IMAGE ZOOMABLE
Full Citation
STANDARD VIEW MARC VIEW
Permanent Link: http://ufdc.ufl.edu/UF00100860/00001
 Material Information
Title: A Toolkit for managing XML data with a relational database management system
Physical Description: Archival
Language: English
Creator: Ramani, Ramasubramanian, 1977- ( Dissertant )
Joachim Hammer ( Thesis advisor )
Dankel, Douglas ( Reviewer )
Lam, Herman ( Reviewer )
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2001
Copyright Date: 2001
 Subjects
Subject: XML (Document markup language)   ( lcsh )
Relational databases   ( lcsh )
Computer and Information Science and Engineering thesis, M.S   ( lcsh )
Dissertations, Academic -- Computer and Information Science and Engineering -- UF   ( lcsh )
Genre: government publication (state, provincial, terriorial, dependent)   ( marcgt )
bibliography   ( marcgt )
theses   ( marcgt )
non-fiction   ( marcgt )
 Notes
Summary: ABSTRACT: This thesis presents the underlying research, design and implementation of our XML Data Management Toolkit (XML toolkit), which provides the core functionality for storing, querying, and managing XML data using a relational database management system (RDBMS). The XML toolkit is an integral part of the Information Integration Wizard (IWiz) system that is currently under development in the Database Research and Development Center at the University of Florida. IWiz enables the querying of multiple semistructured information sources through one integrated view, thereby removing existing heterogeneities at the structural and semantic levels. IWiz uses a combined mediation/data warehousing approach to retrieve and manage information from the data sources which are represented as semistructured data in IWiz; the internal data model is based on XML and the document object model (DOM). The XML toolkit is part of the Data Warehouse Manager (WHM), which is responsible for caching the results of frequently accessed queries in the IWiz warehouse for faster response and increased efficiency. IWiz has two major phases of operation: A built-time phase during which the schema creator module of the XML toolkit creates the relational schema for the data warehouse using the DTD description of the global IWiz schema as input. This is followed by the run-time or query phase during which the warehouse accepts and processes XML-QL queries against the underlying relational database. Note the XML-QL to SQL conversion is part of another ongoing research project in the center.
Summary: ABSTRACT (cont.): During run-time, the Relational-to-XML-Engine component of the XML toolkit is used to convert relational results from the warehouse into an equivalent XML document that has the same structure as the global IWiz schema. The initial query may also be sent to the mediator in case the contents of the data warehouse are not up-to-date. The loader component of the XML toolkit is used to convert and store XML data from the sources via the mediator into the underlying relational format during warehouse maintenance. We have implemented a fully functional version of the XML toolkit, which uses Oracle 8i as the underlying relational data warehouse engine. The XML toolkit is integrated into the IWiz testbed and is currently undergoing extensive testing.
Thesis: Thesis (M.S.)--University of Florida, 2001.
Bibliography: Includes bibliographical references (p. 50-53).
System Details: System requirements: World Wide Web browser and PDF reader.
System Details: Mode of access: World Wide Web.
Statement of Responsibility: by Ramasubramanian Ramani.
General Note: Title from first page of PDF file.
General Note: Document formatted into pages; contains x, 54 p.; also contains graphics.
General Note: Vita.
 Record Information
Bibliographic ID: UF00100860
Volume ID: VID00001
Source Institution: University of Florida
Holding Location: University of Florida
Rights Management: All rights reserved by the source institution and holding location.
Resource Identifier: oclc - 49053172
alephbibnum - 002763287
notis - ANP1308

Downloads

This item has the following downloads:

Thesis ( PDF )


Full Text










A TOOLKIT FOR MANAGING XML DATA WITH A
RELATIONAL DATABASE MANAGEMENT SYSTEM

















By

RAMASUBRAMANIAN RAMANI


A THESIS PRESENTED TO THE GRADUATE SCHOOL
OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF
MASTER OF SCIENCE

UNIVERSITY OF FLORIDA


2001




























Copyright 2001

by

RAMASUBRAMANIAN RAMANI





















To my parents, Yamuna and Ramani, who have given me the best values in life.















ACKNOWLEDGMENTS

This thesis is a result of the motivation and support provided by many individuals.

Firstly, I would like to thank Dr. Joachim Hammer who has always remained a constant

source of inspiration and technical expertise. His enthusiasm for the subject has been a

driving force, channeling my efforts. I am also thankful to Dr. Douglas Dankel and Dr.

Herman Lam, who kindly agreed to participate in my supervisory committee. It has been

a great honor to be a part of the IWiz development team and to work with my colleagues

Anna Teterovskaya, Amit Shah, Chamyote Pluempitiwiriyawej and Rajesh Kanna. I

would like to thank Sharon Grant and Mathew Belcher, who deserve a special mention

for their support and help in the lab. Finally, I would like to acknowledge the support

given by my family members, back in India.
















TABLE OF CONTENTS

page

A C K N O W L E D G M E N T S .................................................................................................. iv

LIST OF FIGURES ................................................... ............ ................ vii

ABSTRACT....................... .............................ix

CHAPTERS

1 IN T R O D U C T IO N ................. ........ ............................ .... ........ .. ........... ..

1.1. Using XM L to Represent Semistructured Data ....................... ................ .......... 1
1.2. G oals of This R research ......................................... .... .................................. 2
1.2.1. C challenges ......... ..... .............................................................. .................. 3
1.2.2. Contributions ........... ...... .............. ......... ............ .... ........ .. 3

2 RELATED RESEARCH ...................................................................... ...............

2 .1 X M L ....................................................................... 5
2 .1.1. B asics .................. ............................................................ . . 6
2.1.2. D TD s ......................................... ...................... .. ............... ........ 7
2.1.3. APIs for Processing XM L D ocum ents ..................................................... 9
2.2. XML Query Languages .............................. ......... 10
2.3. Data Warehousing ...................... ......... .............. 12
2.4. Mapping DTDs into Relational Schemas .................................. 13
2.5. D ata Loading and M maintenance ........................................ .... .............. 14
2.6. XM L M anagem ent System s ........................................ 15
2 .6.1. O racle X SU ......................................................................................... ....... 15
2.6.2. GM D -IPSI X QL Engine ........................................ ...... .............. 16
2 .6 .3 L O R E .................................................................................................... 1 7

3 T H E IW IZ P R O JE C T ............................................................................................... 18

4 XML TOOLKIT: ARCHITECTURE AND IMPLEMENTATION ..............................22

4.1. M managing X M L D ata in IW iz............................................ ............... 22
4.2. Rational for Using an RDBMS as Our Storage Management ............................. 23
4.3. Functional Specifications .................................. ..................... .............. 24
4.4. A architecture O overview ........................................................................ 25









4.5. Schem a C reator E ngine (SC E )........................................................................ ... 28
4.6. XM L D ata Loader Engine(DLE) ........................................ ................. ...... 32
4.7. Relational-to-XM L-Engine (RXE) ...................................................................... 33
4.8. Database Connection Engine (DBCE)..... .................... ............... 36

5 PERFORMANCE EVALUATION ...........................................................37

5.1. E xp erim mental Setup ....... .............. ........................... ................ .............. ...... 37
5.2. T est C cases ................................................. 39
5.3. A analysis of the R results ........................................... ........ .... ... .............. 42

6 CONCLU SION S....... .............. .......................... .. .. ..................46

6 .1 S u m m a ry .................................................................................................... 4 6
6.2. C contributions .................. ......................................... .. .............. 46
6.3. Future Work ................. ........ ........ ......... 48

L IST O F R E FE R E N C E S ......................................................................... ....................50

B IO G R A PH IC A L SK E TCH ..................................................................... ..................54
















LIST OF FIGURES


Figure Page

2.1: Exam ple of an X M L docum ent ......... ................. .................................... ............... 6

2.2: A sample DTD representing bibliographic information...................................................

2.3: An XML Schema representing the bibliographic information in the sample DTD ..................9

2.4: G generic w warehousing architecture................................................ ............ ............... 12

3.1: IW iz A architecture ..................... .. .................................................... ....... ....... 18

3.2: W H M A architecture ........................ .. ......................... .... ........ .......... 19

4.1: Proposed Architecture of XML data management in IWiz. ............................ ..................22

4.2: Built-time architecture of the XML toolkit ....................................... ... ...............25

4.3: Run-time architecture of the XML toolkit ........................... .. ...................... ............... 26

4.4: Input DTD to the Schema creator engine (SCE)......................... .......................... 27

4 .5: Joinable K ey s file form at ...................... .. .. ............. ....................................................2 9

4.6: Tables created by the SCE for the input DTD in Figure 4.4. .............................................29

4.7: System tables created by the SCE. ............................................. ........................... 30

4 .8 : P seu do code of the S C E ......... .. ............... ................. ................................................30

4.9: A sample XML document conforming to the input DTD in Figure 4.4 ..............................31

4.10: Contents of the tables after loading the sample XML document in Figure 4.9 .................31

4.11: Pseudo code of the loader ................................................................. ............... 33

4.12: SQL query to retrieve books and articles from the data warehouse..............................34

4.13: XML document generated by the Relational-to-XML-engine (RXE) ..............................34









4.14 : P seudo code of the R X E .......................... ......... ......... ............................................. 35

5.1: DTD describing the structure of a TV programs guide........................................................38

5.2: Tables created by the SCE for the TV programs guide DTD ...........................................38

5.3: An example XML document conforming to the TV programs guide DTD .........................39

5.4: An XML-QL query to retrieve information about a particular TV program........................40

5.5: XML-QL processor output in the form of an XML document........... .......................41

5.6: Equivalent SQL query to retrieve information about a particular TV program. .....................41

5.7: Output of the RXE in the form of an XML document ...................................... 42















Abstract of Thesis Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Master of Science

A TOOLKIT FOR MANAGING XML DATA WITH A
RELATIONAL DATABASE MANAGEMENT SYSTEM


By

Ramasubramanian Ramani

August 2001

Chairman: Joachim Hammer
Major Department: Computer and Information Science and Engineering

This thesis presents the underlying research, design and implementation of our

XML Data Management Toolkit (XML toolkit), which provides the core functionality for

storing, querying, and managing XML data using a relational database management

system (RDBMS). The XML toolkit is an integral part of the Information Integration

Wizard (IWiz) system that is currently under development in the Database Research and

Development Center at the University of Florida. IWiz enables the querying of multiple

semistructured information sources through one integrated view, thereby removing

existing heterogeneities at the structural and semantic levels. IWiz uses a combined

mediation/data warehousing approach to retrieve and manage information from the data

sources which are represented as semistructured data in IWiz; the internal data model is

based on XML and the document object model (DOM). The XML toolkit is part of the

Data Warehouse Manager (WHM), which is responsible for caching the results of









frequently accessed queries in the IWiz warehouse for faster response and increased

efficiency.

IWiz has two major phases of operation: A built-time phase during which the

schema creator module of the XML toolkit creates the relational schema for the data

warehouse using the DTD description of the global IWiz schema as input. This is

followed by the run-time or query phase during which the warehouse accepts and

processes XML-QL queries against the underlying relational database. Note the XML-

QL to SQL conversion is part of another ongoing research project in the center. During

run-time, the Relational-to-XML-Engine component of the XML toolkit is used to

convert relational results from the warehouse into an equivalent XML document that has

the same structure as the global IWiz schema. The initial query may also be sent to the

mediator in case the contents of the data warehouse are not up-to-date. The loader

component of the XML toolkit is used to convert and store XML data from the sources

via the mediator into the underlying relational format during warehouse maintenance.

We have implemented a fully functional version of the XML toolkit, which uses

Oracle 8i as the underlying relational data warehouse engine. The XML toolkit is

integrated into the IWiz testbed and is currently undergoing extensive testing.














CHAPTER 1
INTRODUCTION


1.1. Using XML to Represent Semistructured Data

The Web is a vast data store for information and is growing at a fast rate. This

information can originate from a variety of sources, such as email, HTML files,

unstructured text as well as structured databases. These sources make the Web a dynamic

and heterogeneous environment, in which interpretation of information is difficult and

error prone [1]. Much research has been undertaken to provide an integrated view of the

Web by using a computerized approach. However the identification, querying and

merging of data from heterogeneous sources is difficult.

A considerable amount of information available on the Web today is

semistructured [2]. Semistructured data can be defined as data that has structure that may

be irregular and incomplete and need not conform to a fixed schema. There has been a lot

of research in the past in developing data models, query languages and systems to

manage semistructured data. One such model is the Object Exchange Model (OEM) that

was explicitly defined to represent semistructured data in heterogeneous systems in the

Tsimmis system [3]. A variant of this data model has been used in the development of

Lore [4]. The recent emergence of the Extensible Markup Language (XML) from the

World Wide Web Consortium [5] has kindled a lot of interest in using it to model

semistructured data [6-7]. XML is well suited to model semistructured data because it

makes no restrictions on the tags and relationships used to represent the data. XML also









provides advanced features to model constraints on the data, using an XML schema or a

Document Type Definition (DTD). However, XML does have some differences with the

other semistructured data models: (1) XML has ordered collections while semistructured

data are unordered, (2) Attributes in XML can be unordered and (3) XML allows usage

of references to associate unique identifiers for elements; this is absent in most other data

models. Despite these differences, XML is a popular data model to represent

semistructured data, mainly due to the close relationship to HTML as well as the

emergence of standards and tools for creating and viewing XML. However, to the best of

our knowledge not much progress has been made in the development of techniques and

tools for storing and managing XML for rapid querying.


1.2. Goals of This Research

The goal of the thesis is to analyze the problems of XML data management and

implement a toolkit that can be used to provide a persistent storage, retrieval and query

component for XML data. We have developed such a toolkit as part of the Warehouse

Manager (WHM) component in the IWiz prototype system in the Database and Research

Center, University of Florida [8].

We rephrase the overall problem statement for this thesis as follows: Given the

need to manage semistructured data in general and XML data in particular we need a

system for managing this data efficiently. There are a wide variety of management

systems, ranging from native XML databases to XML-enabled databases. Among the

alternatives, we found it very compelling to choose the relational DBMS because of its

wide spread popularity, robustness and performance. Since relational databases are

already used to store information for most web sites and since XML is becoming the









standard to represent this information, it is of the utmost importance that these two

technologies be integrated [9]. So, in our system we have an underlying relational

database for storing XML data and an interface to transform XML data to relational and

vice-versa. Several major database vendors like Oracle are working on tools for

managing XML data. We have summarized the limitations of these products in the

related research section.

1.2.1. Challenges

To address the problem raised above, we have identified the following three

challenges. (1) Automatic creation of the underlying relational schema based on the

schema for the XML data that must be managed. This problem is further complicated

when using DTDs to specify the structure of XML data; DTDs provide only a loose

description of the structure of an XML document and does not contain any type

information. (2) The loading of a single XML document into an equivalent relational

schema may trigger the insertion of tuples into several tables. (3) Creation of a well-

structured XML document with nested tags requires additional input and pocessing [10].

Existing methods in converting relational results into equivalent XML documents, use

simple techniques where by the resulting document has tags derived from the metadata

and values from the relational results. XML is a constantly evolving dita model. Thus the

solution to XML data management is not permanent and needs to be enhanced with the

progress made in related fields like new query languages, more persistent storage options

and new grammar definitions like XML Schema.

1.2.2. Contributions

Upon the conclusion of this research we will have contributed to the state-of-the-

art in XML data management in several important ways. (1) Automatic schema









generation: XML uses hierarchical representation of data. This native nesting in XML

has to be translated to the relational schema that is flat in structure. The schema created

has to preserve the relationships expressed in XML and map them to relational

constraints. (2) Loading of XML data into a relational data warehouse: The loading

operation will have to adhere to the constraints in the relational schema. The data in the

XML data could contain extraneous characters like quotation marks that need to be

removed before loading into the relational tables. (3) Automatic creation of nested XML

documents: A structured XML document has to be recreated from the relational data

obtained as a result of a SQL query. To achieve nesting in the created XML document

would involve additional processing.

The rest of the thesis is composed as follows. Chapter 2 provides an overview of

XML and related technologies. Chapter 3 describes the IWiz architecture and in

particular the warehouse manager component. Chapter 4 concentrates on our

implementation of the XML toolkit and its integration in the IWiz system. Chapter 5

performs an analysis of the implementation, and Chapter 6 concludes the thesis with the

summary of our accomplishments and issues to be considered in future releases.














CHAPTER 2
RELATED RESEARCH


2.1. XML

Among the various representations to model semistructured data, XML has clearly

emerged as the frontrunner. XML started as a language to represent hierarchical

semantics of text data, but is now enriched with extensive APIs, tools such as parsers, and

presentation mechanisms, making it into an ideal data model for semistructured data.

XML consists of a set of tags and declarations, but rather than being concerned with

formatting information like HTML, it focuses on the data and its relations to other data.

Some important features of XML that are making it popular are the following [11]:

* XML is a plain ASCII text file making it platform independent.

* XML is self-describing: Each data element has a descriptive tag. Using these tags,
the document structure can be extracted without knowledge of the domain or a
document description.

* XML is extensible by allowing the creation of new tags. This supports new
customized applications such as MathML, ChemicalML, etc.

* XML can represent relationships between concepts and maintain them in a
hierarchical fashion.

* XML allows recursive definitions, as well as multiple occurrences of an element.

* The structure of an XML document can be described using DTD or XML schema.






























Figure 2.1: Example of an XML document.



2.1.1. Basics

The Extensible Markup Language (XML) is a subset of SGML [12]. XML is a

markup language. Markup tags can convey semantics of the data included between the

tags, special processing instructions for applications and references to other data elements

either internal or external; nested markup, in the form of tags, describes the structure of

an XML document.

The XML document in Figure 2.1 illustrates a set of bibliographic information

consisting of books and articles, each with its own specific structure. Tags define the

semantic information and the data is enclosed between them. For example in Figure 2.1,

represents the tag information and "2 0 0 0" denotes the data value.

The fundamental structure composing an XML document is the element. A

document has a root element that can contain other elements. Elements can contain

character data and auxiliary structures or they can be empty. All XML data must be











contained within elements. Examples of elements in Figure 2.1 are biblel biography ,


and <lastname>. Attributes can be used to represent simple information<br /> <br /> <br /> about elements, which are name-value pairs attached to an element. Attributes are often<br /> <br /> <br /> used to store the element's metadata. Attributes are not allowed to be nested, they can be<br /> <br /> <br /> only be simple character strings. The element <article> in our example has an<br /> <br /> <br /> attribute "t yp e" with an associated data value 'XML ."<br /> <br /> <br /> 2.1.2. DTDs<br /> <br /> <br /> To specify the structure and permissible values in XML documents, a Document<br /> <br /> <br /> Type Definition (DTD) is used. Thus the DTD in XML is very similar to a schema in a<br /> <br /> <br /> relational database. It describes a formal grammar for the XML document. Elements are<br /> <br /> <br /> defined using the <!ELEMENT> tag, attributes are defined using the <!ATTLIST><br /> <br /> <br /> tag.<br /> <br /> <br /> <?xml version="1.0"?><br /> < DOCTYPE bibliography [<br /> < ELEMENT bibliography (book article)*><br /> <!ELEMENT book (title, author+, editor?, publisher?, year)><br /> <!ELEMENT article (author+, title, year ,(shortversion longversion)?)><br /> < ATTLIST article type CDATA #REQUIRED<br /> month CDATA #IMPLIED><br /> <!ELEMENT title (#PCDATA)><br /> <!ELEMENT author (firstname?, lastname)><br /> <!ELEMENT editor (#PCDATA)><br /> <!ELEMENT publisher (name, address?)><br /> < ELEMENT year (#PCDATA)><br /> <!ELEMENT firstname (#PCDATA)><br /> < ELEMENT lastname (#PCDATA)><br /> <!ELEMENT name (#PCDATA)><br /> <!ELEMENT address (#PCDATA)><br /> < ELEMENT shortversion (#PCDATA)><br /> <!ELEMENT longversion (#PCDATA)><br /> <br /> <br /> <br /> Figure 2.2: A sample DTD representing bibliographic information<br /> <br /> <br /> <br /> <br /> When a well-formed XML document conforms to a DTD, the document is called<br /> <br /> <br /> valid with respect to that DTD. Figure 2.2 presents a DTD that can be used to validate the<br /> <br /> <br /> XML document in Figure 2.1.<br /> <br /> <br /> <br /><br /> <br /> <br /> <br /> <br /> <br /> The DTD can also be used to specify the cardinality of the elements. The<br /> <br /> following explicit cardinality operators are available: "?" stands for "zero-or-one," "*"<br /> <br /> for "zero-or-more" and "+" for "one-or-more." The default cardinality of one is assumed<br /> <br /> when none of these operators are used. The operator "I" between elements is used to<br /> <br /> denote the appearance of one of the elements in the document. In our example in Figure<br /> <br /> 2.1, a book can contain one or more author child elements, must have a child element<br /> <br /> named title, and the publisher information can be missing. Order is an important<br /> <br /> consideration in XML documents; the child elements in the document must be present in<br /> <br /> the order specified in the DTD for this document. For example, a book element with a<br /> <br /> year child element as the first child will not be considered a part of a valid XML<br /> <br /> document conforming to the DTD in Figure 2.2.<br /> <br /> The entire DTD structure can be placed in the beginning of the associated XML<br /> <br /> document or in a separate location, in which case the document contains only a<br /> <br /> <!DOCTYPE> tag followed by the root element name and the location of the DTD file in<br /> <br /> form of a URI. Separation of a schema and data permits multiple XML documents to<br /> <br /> refer to the same DTD.<br /> <br /> At the moment of writing, a DTD is the only officially approved mechanism to<br /> <br /> express and restrict the structure of XML documents. There are obvious drawbacks to<br /> <br /> DTDs. Their syntax is different from the XML syntax (this is one reason why most<br /> <br /> parsers do not provide programmatical access to DTD structure). In addition, DTDs do<br /> <br /> not provide any inherent support for datatypes or inheritance. Finally, the format of<br /> <br /> cardinality declarations permits only coarse-grained specifications.<br /> <br /> <br /> <br /><br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> Figure 2.3: An XML Schema representing the bibliographic information in the sample<br /> DTD.<br /> <br /> <br /> <br /> W3C has recognized these existing problems with DTDs and has been working on<br /> <br /> new specifications called XML Schema since 1999 [13-14]. In March 2001, XML<br /> <br /> schema has been advanced to the proposed recommendation status. Eventually, this new<br /> <br /> data definition mechanism will have features like strong typing and support for data<br /> <br /> types. Proposed data types include types currently present in XML 1.0 and additional<br /> <br /> data types such as boolean, float, double, integer, URI and date types. In future systems,<br /> <br /> XML schema will provide a better integration of XML and existing persistent storage<br /> <br /> data models.<br /> <br /> 2.1.3. APIs for Processing XML Documents<br /> <br /> The two alternative ways to access contents of an XML document from a program<br /> <br /> are the tree-based approach and the event-based approach. In the tree-based approach, an<br /> <br /> internal tree structure is created that contains the entire XML document in memory. An<br /> <br /> application program can now freely manipulate any part of the document. In case of the<br /> <br /> event-based approach, an XML document is scanned, and the programmer is notified<br /> <br /> about any significant events such as start or end of a particular tag that are encountered<br /> <br /> <br /> <br /><br /> <br /> <br /> <br /> <br /> <br /> during scanning. The realizations of these approaches that have gained widespread<br /> <br /> popularity are the Document Object Model (implementing the tree-based model) and the<br /> <br /> Simple API for XML (in case of the event-based model).<br /> <br /> The Document Object Model (DOM) specifications are produced by W3C like<br /> <br /> most of the XML-related technologies. The DOM Level 1 Recommendation dates back<br /> <br /> to October 1, 1998 [15]. The W3C has also come up with a Level 2 Recommendation for<br /> <br /> the DOM model [16]. DOM is a language- and platform-neutral definition and specifies<br /> <br /> the APIs for the objects participating in the tree model.<br /> <br /> The Simple API for XML (SAX) represents a different approach to parsing XML<br /> <br /> documents. A SAX parser does not create a data structure for the parsed XML file.<br /> <br /> Instead, a SAX parser gives the programmer the freedom to interpret the information<br /> <br /> from the parser, as it becomes available. The parser notifies the program when a<br /> <br /> document starts and ends, an element starts and ends and when a text portion of a<br /> <br /> document starts. The programmer is free to build his/her own data structure for the<br /> <br /> information encountered or to process the information in some other ways.<br /> <br /> As we have seen, both approaches have their own benefits and drawbacks. The<br /> <br /> decision to use one or the other should be based on a thorough assessment of application<br /> <br /> and system requirements.<br /> <br /> <br /> 2.2. XML Query Languages<br /> <br /> The W3 consortium is in the process of standardizing a query language for XML<br /> <br /> based on the XML query algebra. From the semistructured community, three languages<br /> <br /> have emerged for querying XML data: XML-QL [17], YATL [18] and Lorel [19]. The<br /> <br /> document processing community has developed XQL [20], which is more suitable for<br /> <br /> querying documents and searching for text. For the IWiz system, we use an<br /> <br /> <br /> <br /><br /> <br /> <br /> <br /> <br /> <br /> implementation of XML-QL by AT&T Labs. The following section discusses the syntax<br /> <br /> and features provided by the XML-QL language.<br /> <br /> XML-QL has several notable features [21]. It can extract data from the existing<br /> <br /> XML documents and construct new documents. XML-QL is "relational complete"; i.e.,<br /> <br /> it can express joins. Also, database techniques for query optimization, cost estimation and<br /> <br /> query rewriting could be extended to XML-QL. Transformation of data from one DTD to<br /> <br /> a different DTD can be easily achieved. Finally, it can be used for integration of multiple<br /> <br /> XML data sources.<br /> <br /> In XML-QL, all the conditions are specified using a <WHERE> clause and the<br /> <br /> format of the resulting document is obtained from the <CONSTRUCT> clause. The<br /> <br /> structure specified in the <WHERE> clause must conform to the structure of the XML<br /> <br /> document that is queried. Tag-elements are bound using the "$" symbol to distinguish<br /> <br /> them from string literals and can be used in the <CONSTRUCT> clause or in conditional<br /> <br /> filters. Join conditions can be specified implicitly or explicitly. New tags can be created<br /> <br /> in the resulting document by using them in the <CONSTRUCT> clause. XML-QL uses<br /> <br /> element patterns to match data in an XML document, using the structure in the <WHERE><br /> <br /> clause. There is considerable amount of similarity between XML-QL and other query<br /> <br /> languages. In particular, considering SQL, one can notice that the "WHERE" clause<br /> <br /> specifying the condition in SQL has the same functionality as the <WHERE> clause in<br /> <br /> XML-QL. Just like "AS" can be used to rename results in SQL, the <CONSTRUCT><br /> <br /> clause can be used to create new tags and rename results. The XML document specified<br /> <br /> using the "IN" clause in XML-QL is like the set of tables represented using the "FROM'<br /> <br /> clause in SQL.<br /> <br /> <br /> <br /><br /> <br /> <br /> <br /> <br /> <br /> 2.3. Data Warehousing<br /> <br /> User Queries<br /> <br /> \, /<br /> <br /> <br /> Figure 2.4: Generic warehousing architecture<br /> <br /> <br /> Another technology related to this research is data warehousing. A data<br /> <br /> warehouse is a repository of integrated information from distributed, autonomous and<br /> <br /> possibly heterogeneous, sources. In the case of data warehousing systems [22-23], the<br /> <br /> warehouse manager loads and maintains the data warehouse, which is usually, a<br /> <br /> relational database, in advance using the metadata repository. Figure 2.4 shows the<br /> <br /> generic architecture of such a system. This is also commonly referred to as the eager or<br /> <br /> in-advance approach to data integration. Each source has a data extractor wrapped<br /> <br /> around it. This data warehouse is then queried and results are returned to the user. It<br /> <br /> represents a large volume of data that is stored in a single repository. The data warehouse<br /> <br /> can be optimized for storage depending on the transactions. Usually most of the loading<br /> <br /> into the data warehouse will involve appending of new information and fewer updates.<br /> <br /> <br /> <br /><br /> <br /> <br /> <br /> <br /> <br /> Thus the data warehouse serves as a cache with high query performance. The toolkit uses<br /> <br /> the data warehousing approach. Some of the key issues in this approach are the schema<br /> <br /> creation of the data warehouse, data loading and maintenance.<br /> <br /> <br /> 2.4. Mapping DTDs into Relational Schemas<br /> <br /> The input queries and maintenance of data influence the schema for the data<br /> <br /> warehouse. The schema creation plays a big role in the efficiency of the data warehouse.<br /> <br /> The stored data can then be mined for information. Some of the algorithms for schema<br /> <br /> creation are as follows:<br /> <br /> Edge Approach [24]: The XML document is viewed as a graph with no distinction<br /> <br /> between attributes and subelements. A table "Edge" is created with this schema (edge<br /> <br /> source, ordinal, name, flag, target). This table stores the tag information of the XML<br /> <br /> document. A separate "Vakh/e," table is created that has the schema as (target, value) to<br /> <br /> store the data contained in the XML document. This method proposes a simple scheme<br /> <br /> for translating an XML document to a relational table but is inefficient due to the<br /> <br /> redundancy in the schema creation.<br /> <br /> Basic-Inlining [25]: Every element in the DTD is mapped to a relation table and elements<br /> <br /> mapped to a separate table inline as many of their descendants into the same table. In<br /> <br /> such a scheme, a particular element may be present in several tables. While loading the<br /> <br /> data element from an XML document, several tables have to be loaded with the data<br /> <br /> value. Also, in this schema creation scheme, a simple query would require several join<br /> <br /> conditions. Due to these inefficiencies, this approach is not suitable.<br /> <br /> .\lhei c-Inlining [25]: This approach tries to solve the problems in the "Basic-Inlining"<br /> <br /> approach by sharing relational tables. The principal idea in this method is to create a<br /> <br /> DTD graph and create separate relation tables for nodes that have an in-degree either<br /> <br /> <br /> <br /><br /> <br /> <br /> <br /> <br /> <br /> equal to zero or greater than one. Elements being involved in a one-many relationship,<br /> <br /> which can be known by the presence of a "*" or a "+ are also mapped to a separate table.<br /> <br /> Thus in this scheme, a particular data item will be loaded only into a single relational<br /> <br /> table. But this scheme may not be appropriate when you consider data maintenance for<br /> <br /> the following reason. Since data elements have been inlined, when maintenance queries<br /> <br /> are generated, there is an overlap of concepts from the XML domain. Thus a table<br /> <br /> representing book information could have possibly a field for the author's name. Hence,<br /> <br /> this approach is not used.<br /> <br /> Hybrid-Inlining [25]: This method is a slight modification of the ".\'/ilhei -Inlining"<br /> <br /> approach. In this scheme, elements with degree greater than one are also inlined as<br /> <br /> relational attributes in the table created for the parent. This does reduce the join<br /> <br /> conditions but has similar maintenance problems as '".Ih e Id-Inlining" scheme.<br /> <br /> Our approach uses inlining of child nodes that do not have children or attributes,<br /> <br /> without considering the degree of the node. As relational tables contain only related<br /> <br /> fields, such a relational mapping provides more clarity to the system.<br /> <br /> All the constraints expressed explicitly and derivable have to be translated to the data<br /> <br /> warehouse schema [26]. The data warehouse schema creation algorithm can have<br /> <br /> additional features to incorporate incremental changes to the schema of the underlying<br /> <br /> sources.<br /> <br /> <br /> 2.5. Data Loading and Maintenance<br /> <br /> An important concern in using the warehousing approach is dealing with updates.<br /> <br /> The warehouse data has to be refreshed so that it can be consistent with the sources.<br /> <br /> Some simple techniques propose that the system goes off-line so that the entire<br /> <br /> warehouse can be refreshed with new data. This is obviously very inefficient for large<br /> <br /> <br /> <br /><br /> <br /> <br /> <br /> <br /> <br /> data warehouses. There are other algorithms proposed to detect changes in the sources,<br /> <br /> escalate them to the integrator, which reflects it in the data warehouse [27-28]. Some of<br /> <br /> the loading schemes are as follows:<br /> <br /> In the load-append strategy, the input data is loaded into the various relational<br /> <br /> tables without checking if the same data actually exists in the data warehouse. This is a<br /> <br /> simple technique but redundant tuples can be created. This is just like the "insert"<br /> <br /> operation in relational databases.<br /> <br /> The load-merge strategy involves merging of the input data along with the<br /> <br /> existing data in the warehouse minimizing the redundancy. This operation is comparable<br /> <br /> to the "update" operation in relational databases.<br /> <br /> In the load-erase strategy the content of the data warehouse is removed and then<br /> <br /> loaded with the incoming data. Thus the older content is totally removed and fresh data is<br /> <br /> loaded. Such an operation could be useless if the incoming data set is much lesser than<br /> <br /> the contents of the data warehouse. The operation is analogous to a "delete" operation<br /> <br /> followed by an "insert" operation.<br /> <br /> <br /> 2.6. XML Management Systems<br /> <br /> There are several commercial products available to manipulate XML such as<br /> <br /> XML parsers, XML editors and other tools. We will briefly highlight the features of a<br /> <br /> few commercially available and research oriented XML management systems; thus,<br /> <br /> laying a foundation to the set of functions we want to provide in our toolkit.<br /> <br /> 2.6.1. Oracle XSU<br /> <br /> The Oracle XML-SQL Utility (XSU) is an XML application that can be used for XML<br /> <br /> content and data management. The underlying persistent storage could be an object-<br /> <br /> relational or a relational database like Oracle 8i. XML data is stored as LOB (Large<br /> <br /> <br /> <br /><br /> <br /> <br /> <br /> <br /> <br /> unstructured object) in relational tables and XML documents are stored as CLOB<br /> <br /> (Character Large Objects). XSU can also be accessed from a servlet. Some of the features<br /> <br /> are as follows [29]:<br /> <br /> * Oracle XSU can generate an XML document from SQL results.<br /> <br /> * It can store SQL results from XML inserts, updates and deletes.<br /> <br /> * There are three different interfaces provided to access XSU: command line front end,<br /> Java API and PL/SQL API.<br /> <br /> However, there are a few shortcomings that need to be addressed. The database schema<br /> <br /> has to be defined manually. The data loading assumes that the elements and attributes in<br /> <br /> the document are columns in a single table. To load multiple tables, the input document<br /> <br /> has to be translated into several documents, using XSL or any other language, and<br /> <br /> individually loaded into the various tables. Also, the data in the document has to be<br /> <br /> stored in child nodes and not as attributes. In the XML document generated, tags are not<br /> <br /> automatically nested.<br /> <br /> 2.6.2. GMD-IPSI XQL Engine<br /> <br /> This is a Java based storage and query application that uses two major technologies:<br /> <br /> persistent implementation of the DOM objects and XQL language [30]. Some of the<br /> <br /> features are as follows:<br /> <br /> * XML documents are parsed once and stored as persistent DOM (PDOM).<br /> <br /> * The implementation can swap DOM nodes to disk, while handling large DOM trees<br /> and hence main memory is not a limit to file size.<br /> <br /> * There are built-in caching and garbage collection mechanisms.<br /> <br /> * Multi-threaded access of the PDOM file is allowed.<br /> <br /> However the implementation has limitations. A PDOM structure is created for each<br /> <br /> incoming XML document. The XML document to be queried has to be specified. An<br /> <br /> <br /> <br /><br /> <br /> <br /> <br /> <br /> <br /> update operation would increase the PDOM file size and requires a de-fragmentation<br /> <br /> operation to be initiated. Similarly, a delete operation creates wasted space in the file,<br /> <br /> which has to be reclaimed using the garbage collection operation.<br /> <br /> 2.6.3. LORE<br /> <br /> LORE is a DBMS for managing semistructured data developed in the Stanford<br /> <br /> University. It was initially developed for the OEM data model to manage semistructured<br /> <br /> data but later migrated to XML. A few of the mention-able features are as follows:<br /> <br /> A query language LOREL with a cost-based optimizer is used. The prototype is<br /> <br /> complete with indexing techniques, multi-user support, logging and recovery.<br /> <br /> Dataguides, a structural summary of all database paths, is generated; thus allowing free<br /> <br /> form input data.<br /> <br /> As mentioned in the web site, LORE needs some more development in the areas of<br /> <br /> storage schemes and comparison operators. The LORE system currently does not use<br /> <br /> DTDs and does not encode sub-elements ordering.<br /> <br /> Two of the other prominent management systems that are available are<br /> <br /> Microsoft SQL Server [31] and Strudel Web site management system [32].<br /> <br /> <br /> <br /><br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> CHAPTER 3<br /> THE IWIZ PROJECT<br /> <br /> The XML toolkit is an integral part of the IWiz system. IWiz has the following<br /> <br /> <br /> main modules as shown in the Figure 3.1: A Query Browsing Interface that serves as the<br /> <br /> <br /> user interface. The Warehouse Manager as shown in the top tier of the architecture<br /> <br /> <br /> maintains the data warehouse. A Mediator rewrites the user query to source-specific<br /> <br /> <br /> terms and resolves conflicts in the returned results. Each source is connected to IWiz via<br /> <br /> <br /> a wrapper (DRE) for querying the data and restructuring the results. The interactions<br /> <br /> <br /> between the various modules are as shown in Figure 3.1. A brief description of the<br /> <br /> <br /> modules in the IWiz prototype with their functionality and inputs is provided below.<br /> <br /> <br /> Figure 3.1: IWiz Architecture<br /> <br /> <br /> Front-end<br /> Browsing and HTTP<br /> Querying<br /> U r ur User Query<br /> SQL(internal)<br /> IWiz -<br /> Repository -<br /> Warehouse<br /> Manager<br /> <br /> Query Result<br /> User Query<br /> <br /> Mediator<br /> Souc -<br /> ---- ------- -------<br /> <br /> <br /> <br /> <br /> -r j Sources<br /> so r 1<br /> <br /> <br /> <br /><br /> <br /> <br /> <br /> <br /> <br /> <br /> The Query Browsing interface (QBI) presents an integrated view of the IWiz<br /> <br /> global schema. The QBI is used to generate a user query, which is then sent to the<br /> <br /> warehouse manager component.<br /> <br /> The Warehouse Manager (WHM) component, as shown in Figure 3.2, maintains<br /> <br /> the IWiz Repository which is an Oracle 8 database. The WHM has two major phases of<br /> <br /> operation: a built-time phase during which the schema creator module of the XML toolkit<br /> <br /> creates the script file to create the relational schema for the data warehouse using the<br /> <br /> DTD description of the global IWiz schema as input. The Database Connection Engine<br /> <br /> (DBCE) executes the script file to complete the schema generation process. This is<br /> <br /> followed by the run-time or query phase during which WHM accepts and processes<br /> <br /> XML-QL queries against the underlying relational database; the same query may also be<br /> <br /> sent to the mediator in case the contents of the data warehouse are not up-to-date.<br /> <br /> <br /> User Query<br /> (UQ)<br /> <br /> <br /> Run-time Phase<br /> UQ translated to SQL<br /> <br /> <br /> WHM xI MLQuery Result<br /> WHMlV XML Doc<br /> <br /> <br /> UQ; Maintenance MQ UQ<br /> ------ Query Result Result<br /> (MQ) XML Doc .XML Doc<br /> I<br /> <br /> <br /> Figure 3.2: WHM Architecture<br /> <br /> <br /> <br /><br /> <br /> <br /> <br /> <br /> <br /> The WHM server also provides the Ontology contents to the Mediator and the<br /> <br /> Wrapper. When the XML-QL query is provided to the WHM, the query is checked if it<br /> <br /> could be satisfied from the contents of the data warehouse. The XML-QL is then<br /> <br /> translated to SQL and executed against the warehouse by the DBCE. The XML-QL to<br /> <br /> SQL conversion is part of another ongoing research project in the Database Research and<br /> <br /> Development Center at the University of Florida. The Relational-to-XML-Engine<br /> <br /> component of the toolkit could be used to translate the relational result set to an XML<br /> <br /> document. In case the user query cannot be satisfied from the contents of the data<br /> <br /> warehouse, the XML-QL query and a maintenance query are sent to the Mediator<br /> <br /> component. After the Data Merge Engine returns the resulting document generated by<br /> <br /> merging the information from the various sources, the WHM presents it to the QBI that<br /> <br /> displays it to the user. The XML Loader (DLE) is invoked if the merged document is an<br /> <br /> effect of a maintenance query. The DLE parses the document and generates the insert<br /> <br /> commands. The DBCE is invoked and the data is loaded into the warehouse. All the<br /> <br /> interactions with the data warehouse are interfaced through the DBCE.<br /> <br /> The Mediator component has two modules: the Query Restructuring Engine<br /> <br /> (QRE) and the Data Merge Engine (DME). The QRE during the built-time phase gets the<br /> <br /> knowledge about the data existing in the various data sources. The QRE, using this<br /> <br /> information, then splits the input query into source specific queries in terms of the global<br /> <br /> schema terms appending the source names to the query id. It also generates the query<br /> <br /> plan that is used to merge the data from the sources by the DME. The DME merges the<br /> <br /> results from the various sources removing duplicates and transforms them to a single<br /> <br /> XML document. The DME returns the merged document to the WHM.<br /> <br /> <br /> <br /><br /> <br /> <br /> 21<br /> <br /> <br /> Each source in the system has a Data Restructuring Engine (DRE) wrapped<br /> <br /> around it. The DRE is responsible for translating the input query in terms of the global<br /> <br /> schema terms to the source specific terms and converting the source data returned from<br /> <br /> the sources to the global schema terms. The DRE returns the results from each source to<br /> <br /> the DME.<br /> <br /> <br /> <br /><br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> CHAPTER 4<br /> XML TOOLKIT: ARCHITECTURE AND IMPLEMENTATION<br /> <br /> This chapter discusses about the relational approach, the advantages and dis-<br /> <br /> advantages of this approach, the architecture and implementation details of the XML<br /> <br /> toolkit. It describes the algorithm implemented in the various modules of the toolkit. The<br /> <br /> toolkit is implemented using Java (SDK 1.3) from Sun Microsystems. Some of the other<br /> <br /> software tools and packages used in the implementation are the XML Parser from Oracle<br /> <br /> version 2.0.2.9, Oracle 8i and the Oracle JDBC driver version 2.<br /> <br /> <br /> 4.1. Managing XML Data in IWiz<br /> <br /> The popularity of XML as a new standard for data representation and exchange<br /> <br /> on the Web necessitates the development of an XML management system. XML<br /> <br /> management systems can be broadly classified as XML document management systems<br /> <br /> and XML data management systems [33]. In the former case, the structure is very<br /> <br /> irregular and is usually difficult for a machine to interpret the data. Some of the examples<br /> <br /> <br /> <br /> <br /> <br /> <br /> Store Generatorc<br /> UerIn erfac<br /> <br /> <br /> <br /> Global schema DTD L(DLE) (RXE)<br /> F r 1r--ie :A e SOL query m t<br /> DBCE (Database Connection Engine)<br /> S Relational Results<br /> <br /> Schema Creator Oracle 8i<br /> <br /> <br /> <br /> <br /> Figure 4.1: Proposed Architecture of XML data management in IWiz.<br /> <br /> <br /> <br /><br /> <br /> <br /> <br /> <br /> <br /> could be advertisements and HTML documents. Systems to handle XML documents are<br /> <br /> also known as content management systems [34]. The focus of this approach is however<br /> <br /> on XML data management. In this data-centric approach XML is used as the<br /> <br /> representation format. Many of the documents created in real-world applications such as<br /> <br /> flight schedules and sales order are examples of this classification. The XML data<br /> <br /> management system should be able to provide a persistent store for XML data using a<br /> <br /> relational, object database or an object relational database. The system should be able to<br /> <br /> provide an interface to transfer data between the database and XML. All the data can be<br /> <br /> stored in a single centralized repository, thus having controlled redundancy. The data can<br /> <br /> be queried and a merged view can be created.<br /> <br /> The conceptual architecture for our proposed XML data management system is<br /> <br /> shown in Figure 4.1. If a global schema exists it can be used to create the schema for the<br /> <br /> persistent store. An additional feature of such a system will be to automate schema<br /> <br /> creation for the persistent storage. There is usually a user interface that is used to either<br /> <br /> load an XML document into the data store or query the stored data. The loader<br /> <br /> component stores the data contained in the incoming document in the underlying data<br /> <br /> store. The data maintenance is generally built as part of the loader. The XML document<br /> <br /> generator module accepts the user query, transforms it to the language of the underlying<br /> <br /> data store, executes it and then formats the relational results as an XML document and<br /> <br /> presents it to the user.<br /> <br /> <br /> 4.2. Rational for Using an RDBMS as Our Storage Management<br /> <br /> In this toolkit, the persistent storage is achieved using the relational Oracle 8i<br /> <br /> database. While there are systems that use other techniques such as using semi-structured<br /> <br /> <br /> <br /><br /> <br /> <br /> <br /> <br /> <br /> data stores, for example in Lore, the question of whether which of them is the best<br /> <br /> approach remains. The downside of using these other techniques is that they turn their<br /> <br /> back on several years of work invested in relational database technology. When semi-<br /> <br /> structured data becomes more widely popular and is machine processed, the management<br /> <br /> systems will require efficient query processing and storage features. Relational systems<br /> <br /> currently are the best in providing these features. Specifically, they have the following<br /> <br /> advantages:<br /> <br /> * Centralized merged data: The data warehouse is a single repository containing the<br /> data due to several loaded documents and is not specific to any input XML document.<br /> <br /> * Scalability: With the increase in contents RDBMS provide one of the best scaling.<br /> <br /> * Standard query languages: Using worldwide accepted query languages with efficient<br /> querying capabilities.<br /> <br /> * Concurrency Control, Data recovery and management of secondary storage features.<br /> <br /> On the other hand, using a relational database management system has the following<br /> <br /> drawbacks:<br /> <br /> * In order to satisfy a query, the join sequences could be an nway join leading to an<br /> inefficient execution.<br /> <br /> * An RDBMD requires a rigid schema definition.<br /> <br /> <br /> 4.3. Functional Specifications<br /> <br /> Next we proceed to derive the set of functional specifications that the toolkit must<br /> <br /> satisfy. Firstly, the toolkit must assist in the relational schema creation. Ideally this<br /> <br /> process must be automated. The schema creation should be derived from a global schema<br /> <br /> that binds the incoming XML documents. For example, a DTD can be used to describe<br /> <br /> the global schema definition. The relational schema created must include the cardinality<br /> <br /> <br /> <br /><br /> <br /> <br /> <br /> <br /> <br /> constraints specified in the DTD. There should be an interface to accept an XML<br /> <br /> document as input and load the data contained into the various relational tables.<br /> <br /> Similarly, the toolkit should have the capability to execute queries and wrap the results as<br /> <br /> XML documents. Nesting of tags and creating XML documents conforming to the global<br /> <br /> schema would provide additional features.<br /> <br /> <br /> Figure 4.2: Built-time architecture of the XML toolkit<br /> <br /> <br /> <br /> <br /> 4.4. Architecture Overview<br /> <br /> From the set of functional specifications above we can derive the architecture of<br /> <br /> the XML toolkit which is divided into a built-time and run-time phase. The built-time<br /> <br /> phase creates the preliminary steps- setting up the schema and the server so that in the<br /> <br /> WHM can accept and process user queries during the run-time phase.<br /> <br /> <br /> DDL commands for Schema creation<br /> JDBC<br /> <br /> MDR DWH<br /> <br /> <br /> <br /><br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> Figure 4.3: Run-time architecture of the XML toolkit<br /> <br /> <br /> <br /> During the built-time phase of the WHM, as shown in Figure 4.2, the Schema<br /> <br /> creator engine (SCE) and the database connection engine (DBCE) of the toolkit are used.<br /> <br /> The input DTD file representing the IWiz global schema is parsed and the script file to<br /> <br /> generate the relational tables is created. The script file is then passed on to the DBCE,<br /> <br /> which executes it, and the relational schema is setup. The Joinable Keys file, which is the<br /> <br /> other input to this module, is stored in a separate system table. The WHM is implemented<br /> <br /> as a server that can accept XML-QL user queries, translate them to SQL, run them<br /> <br /> against the warehouse and return the results as an XML document to the user. This also<br /> <br /> serves as an Ontology server that provides the Ontology as a DOM object when invoked<br /> <br /> by the Mediator and the Wrapper. After parsing the DTD, the Ontology server notifies<br /> <br /> the Wrapper and the Mediator. After being notified, the wrapper and mediator request for<br /> <br /> the ontology DOM object to begin their built-time phase.<br /> <br /> <br /> JDBC<br /> <br /> MDR DWH<br /> <br /> <br /> <br /><br /> <br /> <br /> <br /> <br /> <br /> <br /> In the run-time phase, as shown in Figure 4.3, the Relational-to-XML-engine<br /> <br /> (RXE), XML data loader engine (DLE) and the DBCE components of the toolkit are<br /> <br /> used. The decision engine module of the WHM analyzes the input XML-QL user query.<br /> <br /> If the input query could be satisfied in the warehouse, the user query is translated into<br /> <br /> SQL by the XML-QL to SQL module of the WHM. The SQL query generated is then<br /> <br /> executed in the data warehouse using the DBCE of the toolkit. The relational results are<br /> <br /> then converted to an XML document object by the RXE and returned to the QBI<br /> <br /> interface. If the query cannot be satisfied in the data warehouse, due to the absence or<br /> <br /> staleness of data then a maintenance query is generated and both the iser query and the<br /> <br /> maintenance query are sent to the mediator. The maintenance query generates results that<br /> <br /> are used to load the data warehouse so that in future, similar queries could be satisfied<br /> <br /> directly from the warehouse. The DLE component of the toolkit is used to load the data.<br /> <br /> The XML document generated due to the user query is returned to the QBI interface,<br /> <br /> which presents it to the user.<br /> <br /> <!ELEMENT Bib (Book,Article)><br /> <!ELEMENT Book (Author+,Title,Year,Editor*,ISBN)><br /> <!ELEMENT Article (Author*,Title,Year,Editor?)><br /> <!ELEMENT Author (firstname?,lastname,address)><br /> <!ELEMENT Editor (lastname)><br /> <!ELEMENT Title ( #PCDATA)<br /> <!ELEMENT ISBN ( #PCDATA)<br /> <!ELEMENT Year (#PCDATA)><br /> <!ELEMENT firstname (#PCDATA)><br /> <!ELEMENT lastname (#PCDATA)><br /> <!ELEMENT address (#PCDATA)<br /> <br /> Figure 4.4: Input DTD to the Schema creator engine (SCE)<br /> <br /> <br /> <br /><br /> <br /> <br /> <br /> <br /> <br /> 4.5. Schema Creator Engine (SCE)<br /> <br /> The input to this module is the DTD file and Joinable Keys file. The algorithm for<br /> <br /> schema creation views the DTD as a graph and nodes are distinguished based on the path<br /> <br /> from the parent node and not based on the tag names. For example, the paths of the<br /> <br /> "Year" element in Figure 4.4; "Bib/Article/Year" and "Bib/Book/Year" are different. The<br /> <br /> DTD file is parsed by the Oracle XML parser and an nary tree is created. The module<br /> <br /> begins traversing this tree beginning at the root node identifying the various 'CONCEPT'<br /> <br /> nodes. An element in the DTD is a "CONCEPT' if it satisfies one or more of the<br /> <br /> following conditions:<br /> <br /> * It has one or more attributes or<br /> <br /> * It has one or more children or<br /> <br /> * It is involved in a one-many relationship with some other element in the DTD, which<br /> can be inferred by the presence of the cardinality operators, "*" or "+", following this<br /> node when it appears as a child node.<br /> <br /> <br /> Every 'CONCEPT' in the DTD is mapped to a relational table and all child elements<br /> <br /> that are not "CONCEPT'(leaf), are mapped as relational attributes in the table. A hash<br /> <br /> table containing type-information about each element and attribute is created while<br /> <br /> processing the DTD. The type can be a 'CONCEPT', hneli, I-Attribute" or a 'Ynlined-<br /> <br /> Child'. This information is also persistently stored in a system table, which is later<br /> <br /> retrieved and used by the loader module. The script to create a relational table with the<br /> <br /> same name as the "CONCEPT' node is generated and appended to a global script file.<br /> <br /> Every table created has a primary key. The name of every table created is stored in a hash<br /> <br /> table. A table is created only if it is absent in this hash table. This process is recursively<br /> <br /> performed on all the children of this element passing the Parent-element name. All the<br /> <br /> <br /> <br /><br /> <br /> <br /> <br /> <br /> <br /> <br /> parent-child relationships between the elements in the DTD are mapped as a general m:n<br /> <br /> relation and a separate table is created to store this information. The relation tables<br /> <br /> contain two fields with reference to the corresponding field in the parent and child table.<br /> <br /> The format of naming the table is "<Parent-element> <Childelement>". The foreign<br /> <br /> key references are appended to the global script file. This script file is finally executed,<br /> <br /> using the DBCE, creating the various tables and the foreign key constraints. The other<br /> <br /> input to this module is the Joinable Keys file. This is used by the QRE module of the<br /> <br /> mediator to detect join sequences for the various elements in the global Ontology DTD.<br /> <br /> Bib.Book \t 1 \t ISBN<br /> Bib.Book \t 2 \t Title<br /> Bib.Article \t 1 \t Title<br /> <br /> <br /> Figure 4.5: Joinable Keys file format<br /> <br /> <br /> <br /> <br /> Bib<br /> <br /> <br /> <br /> Figure 4.6 a Book FK Id, Bby FKte SE fr te inpFK Id, Bib FK Id<br /> <br /> <br /> ______ Book __ Author Article<br /> TeBook PK t is fi tleYearSBN Authsor PK Id fFistge last cnamedds the p Artcle PK IdnTtle, Ye<br /> <br /> <br /> Book Author Book Editor Editor Artic r<br /> Author FK Id Book FK Id Edhtor FK Id,Book FK Id Editor FK Id Article FK Id Author FK Iditc, le FK Id<br /> <br /> Editor<br /> -- Editor PK Id lastname<br /> <br /> <br /> <br /> Figure 4.6: Tables created by the SCE for the input DTD in Figure 4.4.<br /> <br /> <br /> <br /> <br /> The format of this file is as shown in Figure 4.5. It contains the path and joinable key for<br /> <br /> the "CONCEPT' elements in the DTD.<br /> <br /> <br /> <br /><br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> Tag_Info Foreign_Key_Info Timestamp_Info<br /> Tag Name,ParentTag,Type TagName, Source Field Name, Tag Name, Modified Timestamp<br /> Referenced Table,<br /> Referenced Field Name<br /> <br /> <br /> Joinable_Keys_Info Primay_Keys_Info Ontology_Contents<br /> Tag Name, Rank, Join Attribute TableName, Primarykey Value DTD contents<br /> <br /> <br /> <br /> <br /> Figure 4.7: System tables created by the SCE.<br /> <br /> <br /> <br /> <br /> <br /> Figure 4.6 gives the set of tables created for the DTD in Figure 4.4. The system tables, as<br /> <br /> <br /> shown in Figure 4.7, are also created by the SCE. The Ontology file contents are stored in<br /> <br /> <br /> the "Ontology_Contents" system table. It is later retrieved, parsed into a DTD object and<br /> <br /> <br /> provided to the QRE and the WRP modules. The Joinable Keys file is stored in the<br /> <br /> <br /> "Joinable Keys Info" system table. The "Primary Keys Info" and the "Tag_Info" are<br /> <br /> <br /> used by the loader module.<br /> <br /> <br /> Figure 4.8: Pseudo code of the SCE<br /> <br /> <br /> 1 The mput DTD file is parsed and a DTD DOM object is created<br /> 2 Invoke method makeRelations (Root,"-") passing the root of the DTD and '-' to denote that it has no<br /> parent<br /> 3 A node is a CONCEPT if one of the following conditions holds<br /> 3 1 It has 1 or more attributes<br /> 3 2 It has 1 or more children nodes<br /> 3 3 It has a '*' or a '+' when it appears as a child of some other node<br /> 4 Method makeRelations(Node currentnode,Strng Parent)<br /> 4 1 Check if the current node in the tree is a CONCEPT<br /> 4 1 1 Check if table is already created for the current node m the hash table<br /> 41 1 1 If false<br /> 4 1 1 1 1 Generate the script file to create a table having with the current node name and make<br /> the attributes, and children of the current node which are not Concepts, fields m the table<br /> 4 1 1 1 2 Generate the script file to create the 'Par-Child' relational table using the Parent name<br /> 4 1 1 13 Store the table names for whom the script file are generated in a hash table<br /> 4 1 1 1 4 Store the foreign key constraints in a separate vector<br /> 4 2 Recursively invoke the makeRelations method passing the children of the current node<br /> 5 Append the foreign key constraints to the script file<br /> 6 Execute the script file to generate the 'CONCEPT' tables, the 'Par-Child' tables and the foreign key<br /> constraints<br /> <br /> <br /> <br /><br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> The decision engine module of the WHM uses the "ForeignKeys Info" and the<br /> <br /> <br /> "TimeStamp Info" system tables.<br /> <br /> <br /> <br /> <?xml version= "1.0"?><br /> <Bib><br /> <Book><br /> <Author<br /> <firstname> Jack <firstname><br /> lastname> James</lastname><br /> <address> #123, 8th Avenue<address><br /> <Author><br /> <Author><br /> lastname> Thomson</lastname><br /> <address> #149, 18th Avenue<address><br /> <Author><br /> <Title> XML Manangement Systems <Title><br /> <Year> 2001 <Year><br /> <ISBN> 3528463422 <ISBN><br /> <Book><br /> <Article><br /> <Title> XML Toolkits <Title><br /> <Year> 2001 <Year><br /> <Article><br /> <Bib><br /> <br /> <br /> Figure 4.9: A sample XML document conforming to the input DTD in Figure 4.4.<br /> <br /> <br /> Bib<br /> <br /> <br /> <br /> Bib Book Bib Article<br /> 0,0 0,0<br /> <br /> Book Author Article<br /> OXMLManagement Systems, _. Jack, James, #123, 8th Avenue 0,XML Toolkits, 2001<br /> 2001,3528463422<br /> S20NULL, Thomson, #149, 18th Avenue<br /> <br /> Book Author Book Editor Article Editor Article Author<br /> 0,0<br /> 1,0<br /> Editor<br /> <br /> <br /> <br /> <br /> <br /> Figure 4.10: Contents of the tables after loading the sample XML document in Figure<br /> 4.9.<br /> <br /> <br /> <br /> <br /> <br /> The pseudo-code of the algorithm is shown in Figure 4.8. The SCE also generates the<br /> <br /> <br /> script file for creating and dropping the database in the current directory. In case of any<br /> <br /> <br /> <br /><br /> <br /> <br /> <br /> <br /> <br /> network error while accessing the database, the script file can be executed from the<br /> <br /> Oracle SQL*Plus window directly to create the relational tables.<br /> <br /> <br /> 4.6. XML Data Loader Engine(DLE)<br /> <br /> The loading operation is a translation from one data model (XML) to the other<br /> <br /> (RDBMS). The data in the XML document is stored as relational tuples. The XML data<br /> <br /> loader module, also known as the loader, implements the load-append strategy. Figure 4.9<br /> <br /> gives an example of an XML document conforming to the DTD shown in Figure 4.4.<br /> <br /> This module takes in an XML document DOM object as input and generates the script<br /> <br /> file to load the data into the relational tables. The loader uses the "Tag Info" and the<br /> <br /> "PrimaryKeys Info" system tables. The data from both these tables is loaded into two<br /> <br /> hashtables, TagInfo and PKVals, so that the database is accessed only once initially.<br /> <br /> The loader starts parsing the DOM object from the root until all the nodes in this nary<br /> <br /> tree are traversed. A node that is found to be a 'CONCEPT' triggers loading of a tuple<br /> <br /> for the relational table with the same name. The schema information is obtained from the<br /> <br /> database. The tuple is initialized with null values. The attributes and children of this node<br /> <br /> are examined and the values are loaded into the tuple after matching the relational field<br /> <br /> name with the element name. The primary key value for this tuple is assigned from the<br /> <br /> PK_vals table. The children nodes are recursively processed passing the parent-name so<br /> <br /> that the foreign key references in the Parent-Child table are setup. The foreign key values<br /> <br /> for this table is obtained from the PK_vals table as well. The primary key value for the<br /> <br /> current table is updated in the PrimaryKeys Info hashtable only after processing all it's<br /> <br /> children. This script file and the PKVals hashtable is passed on to the database<br /> <br /> connection engine. The DBCE executes the script file to load the data contained in the<br /> <br /> <br /> <br /><br /> <br /> <br /> <br /> <br /> <br /> <br /> XML document DOM object into the database. The hashtable is used to update the<br /> <br /> contents of the Primary Keys Info system table so that if the system were to crash, then<br /> <br /> the values of the primary keys for the loading of the next document would begin with the<br /> <br /> correct values.<br /> <br /> 1. Input is a XML document DOM object.<br /> 2. Create the TagInfo and PK Vals hashtable from the system tables. Make a new hashtable<br /> Table Names.<br /> 3. Invoke the method makeTuple (Root,'-')<br /> 4. Method makeTuple(Node current node,String Parent)<br /> 4.1 Check if the current node in the tree is a CONCEPT using the TagInfo hash table.<br /> 4.1.1. Iftrue<br /> 4.1.1.1 Match the leaf children names with field names in the relational table and generate<br /> the script to create a tuple using the PK Vals hashtable.<br /> 4.1.1.2 Generate the script file to create a tuple for the 'Parent Child' table.<br /> 4.2. Recursively invoke the makeTuple method for the children of the current node.<br /> 4.3 Update the primary key value for this table in the PK Vals hashtable.<br /> 5. Execute the script file to insert the data in the XML document into the relational tables and update the<br /> contents of the Primary Keys Info system table.<br /> <br /> <br /> Figure 4.11: Pseudo code of the loader<br /> <br /> <br /> <br /> <br /> The tuples generated by the loader and stored in the schema for the document in Figure<br /> <br /> 4.5 are shown in Figure 4.10. The pseudo code for the loader module is in Figure 4.11.<br /> <br /> <br /> <br /> 4.7. Relational-to-XML-Engine (RXE)<br /> <br /> The input to this module is a SQL query. The query is executed in the data<br /> <br /> warehouse and the results are wrapped into an XML document DOM object grouping the<br /> <br /> results if the SQL query contains the "Group By" clause. The module obtains the<br /> <br /> metadata information from the database. The module can create the XML document<br /> <br /> conforming to the ontology specifications if the paths of each of the resulting tags are<br /> <br /> provided. Initially the RXE looks for the "Group By" clause and stores the grouping<br /> <br /> <br /> <br /><br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> attributes in a vector and a hashtable Grp hash. The grouping attributes are located in the<br /> <br /> <br /> resultset and the values in the tuples are compared with the value stored in the Grp hash<br /> <br /> <br /> hashtable. "GRP By" tags are created for each of them. The new values of the grouping<br /> <br /> <br /> attributes are updated in Grp hash. Tags are created from the resultset meta data and are<br /> <br /> <br /> appended to the XML document DOM object. If the SQL query has no "Group By"<br /> <br /> <br /> clause, the resulting document created has the metadata name as tag name and the tuple<br /> <br /> <br /> value as the data value. The important feature of this module is the ability to create a<br /> <br /> <br /> document grouping the results. Every tuple mapped to the XML document contains a<br /> <br /> <br /> "ROW ID" attribute, which takes the tuple number as value.<br /> <br /> <br /> SELECT Author.firstname, Author.lastname, Author.address, Book.Title, Book.Year,<br /> Book.ISBN, Article.Title, Article.Year<br /> <br /> FROM Bib,Article,Book, Bib Article, Bib Book,Author,Book Author<br /> <br /> WHERE Bib.Bib PK ID-Bib Article.Bib FK ID and Bib.Bib PK ID-Bib Book.Bib FK ID<br /> and Author.Author PK ID and Book Author.Author FK ID;<br /> <br /> <br /> Figure 4.12: SQL query to retrieve books and articles from the data warehouse.<br /> <br /> <br /> Figure 4.13: XML document generated by the Relational-to-XML-engine (RXE).<br /> <br /> <br /> <Result><br /> <ROW ID "1"><br /> <FIRSTNAME>Jack</FIRSTNAME><br /> <LASTNAME>James</LASTNAME><br /> <ADDRESS>#123, 8th Avenue</ADDRESS><br /> <TITLE>XML Manangement Systems
2001
3528463422
XML Toolkits
2001


NULL
Thomson
#149, 18th Avenue

XML Manangement Systems
2001
3528463422
XML Toolkits
2001













The SQL query would have to be generated based on the schema information that can be


known from the system tables. The table name is appended to the field name while


retrieving so that it resolves any conflicts that could occur when the same column exists


in both the tables. A join between two concepts in the global schema can be achieved by


involving the parent-child relational tables. For example for the SQL query in Figure


4.12, a join has to be performed using the Parent table, Bib, Book, Article and the


parent-relational tables Bib Article and Bib Book. The query could include a


"Group By" clause, but as per the rules of RDBMS require the inclusion of the non-


aggregateable attributes present in the select clause, in the "Group By" clause as well.


The document created could be nested if the path information about each tag created is


presented to the RXE. The result of a sample SQL, given in Figure 4.12, to extract the


loaded document in Figure 4.9 is shown in Figure 4.13.


1 Input is a SQL query
2 Check if the query has a 'GROUP BY' clause
2 1 If true
2 1 1 Check for all the alias names used for the grouping attributes and store the attribute names in a
hashtable, Grphash and a vector Grpvec
3 Execute the SQL query using the Database Connection Engine (DBCE)
3 1 Obtain the resultset containing the results of the query
3 2 Obtain the metadata details of the resultset
4 MethodmakeXML()
4 1 For all the tuples in the resultset do
4 1 1 For all the grouping attributes in GrpVec
4 1 1 1 Locate the grouping attribute in the resultset
4 1 1 2 Obtain the value of the attribute from the hashtable
4 1 1 3 Compare the value of the grouping attribute in the hashtable and tuple, if they differ
4 1 1 3 1 Create a 'GRP By' tag and assign the value to it from the tuple
4 1 1 3 2 Store the new value in the hashtable
4 1 1 3 3 Reset the values of the grouping attributes lower in priority than the current attribute in
Grp hash
4 1 1 4 All the other values that are not a part of the grouping attributes are stored as separate tags
created from the metatdata details and appended to the XML document DOM object
5 Return the XML document DOM object created


Figure 4.14: Pseudo code of the RXE.









The pseudo code for the algorithm is given in Figure 4.14.


4.8. Database Connection Engine (DBCE)

This module provides the database connectivity. It uses Oracle JDBC driver

version 2.0 to communicate with the database. It provides a set of API that can be

invoked by the other modules. There is only one instance of the DBCE operating and the

other modules have a reference to it. The DBCE uses a configuration file to obtain the

database details: hostname, database name, port number, user name and password. The

DBCE uses the standard JDBC API classes, "Statement" and "PreparedStatement" to

initiate all connections with the Oracle 8i database engine. All insert operations are

implemented as batch operations in order to increase the efficiency of the database

engine. All the relational attributes have "varchar2" as the type due to inadequate

information in the DTD. Also, the default field size is assumed beforehand.














CHAPTER 5
PERFORMANCE EVALUATION

In this chapter, we explain a set of tests to evaluate the performance of our toolkit.

The only XML-processing programs that are benchmarked are several XML parsers [35].

Up to this point, there is very little material discussing the benchmarking of XML data

management systems [36-37]. Among them, Xmach-1 provides the benchmarks based on

web applications, which are not directly applicable to the toolkit. Hence, we intend to

demonstrate in an informal way the validity, functionality and performance of the toolkit.

The focus of test 1 is to analyze the capability of the SCE module to produce

syntactically valid script to generate a relational schema. The goals of test 2 are to study

the correctness and efficiency of the entire toolkit. To illustrate the efficiency, we draw a

comparison between the outputs generated by the XML-QL processor, implemented by

AT&T and the RXE component. Thus, the various aspects of management systems,

schema creation, data loading and data extraction are tested. Section 5.1 describes the

hardware configuration and software packages used for testing. In Section 5.2, we

explain briefly the tests that were performed. Section 5.3 discusses in detail about the

results, bringing out the limitations. The inputs and outputs to the various components are

illustrated in the Figures 5.1-5.7.


5.1. Experimental Setup

All the experiments were carried out on a Pentium II 233 Mhz processor with 256

MB of main memory running Windows NT 4.0. The toolkit was implemented using Java











(SDK 1.3) from the Sun Microsystems. Some of the other software tools and packages


used are the XML Parser from Oracle version 2.0.2.9, Oracle 8i, Oracle JDBC driver


version 2 and the XML-QL query processor, implementation by AT&T. The DTDs and


XML documents were created using XML Authority vl.2 and XML instance v 1.1


respectively. All the modules of the toolkit ran in the same address space as the database,


which was installed on the same machine to avoid network delays.



















Figure 5.1: DTD describing the structure of a TV programs guide




TVSCHEDULE (TVSCHEDULE PK ID, NAME)
CHANNEL (CHANNEL PK ID, CHAN,BANNER)
DAY (DAY PK ID)
HOLIDAY(HOLIDAY PK ID HOLIDAY)
DATE1(DATE1 PK ID, DATE1)
PROGRAMSLOT(PROGRAMSLOT PK ID, VTR,PROG TITLE, DESCRIPTION)
TIME(TIME PK ID, HRS, MINS)
TVSCHEDULE CHANNEL(CHANNEL FK ID, TVSCHEDULE FK ID)
CHANNEL DAY(DAY FK ID, CHANNEL FK ID)
DAY DATE1(DATE1 FK ID ,DAY FK ID)
DAY HOLIDAY(HOLIDAY FK ID,DAY FK ID)
DAY PROGRAMSLOT(PROGRAMSLOT FK ID,DAY FK ID)
PROGRAMSLOT TIME(TIME FK ID, PROGRAMSLOT FK ID)


Figure 5.2: Tables created by the SCE for the TV programs guide DTD












5.2. Test Cases


Test 1 studies the ability of the SCE module to produce a valid relational schema.


The input DTD is supplied to he SCE that creates the relational schema. The input DTD


describes the structure of a TV programs guide. Figure 5.1 shows the input DTD. Figure


5.2 displays the relational schema that corresponds to the DTD.





ABC

04-24-2001


SPIN CITY
COMEDY SERIAL




DAILY NEWS















Figure 5.3: An example XML document conforming to the TV programs guide DTD.





Test 2 focuses on proving the correctness of the entire toolkit. Initially a relational


schema is created using the SCE component. Then, a sample XML document is queried


using the XML-QL processor and an output XML document i created. The same XML


document is loaded into the database. The input XML-QL query is translated to SQL.


The correctness of the toolkit is proved when an equivalent XML document is created by


RXE. Even though efficiency is not the primary focus, this aspect of the toolkit can be








40



illustrated by comparing the output of the XML-QL processor, implemented by AT&T


and the output of the RXE. To illustrate, the XML-QL processor is used to query an


XML document. The XML document being queried is displayed in Figure 5.3. The


XML-QL processor expects the input query to be stored in a file and then generated an


output file containing the results. The XML-QL processor is invoked from the command-


line as follows:


xmlql -q -o .


Figure 5.4: An XML-QL query to retrieve information about a particular TV program.


The XML-QL query file is shown in Figure 5.4.


function query()
{
WHERE



$B

$D

University of Florida Home Page
© 2004 - 2010 University of Florida George A. Smathers Libraries.
All rights reserved.

Acceptable Use, Copyright, and Disclaimer Statement
Last updated October 10, 2010 - - mvs