• TABLE OF CONTENTS
HIDE
 Title Page
 Dedication
 Acknowledgement
 Table of Contents
 List of Tables
 List of Figures
 Abstract
 Introduction
 The use of XML as the underlying...
 Overview of integration approaches...
 The IWiz architecture
 The join sequencing algorithm and...
 The QRE architecture and imple...
 Experimental prototype
 Conclusions
 Appendix
 References
 Biographical sketch














Title: Source specific query rewriting and query plan generation for merging XML-based semistructured data in mediation systems
CITATION PDF VIEWER THUMBNAILS PAGE IMAGE ZOOMABLE
Full Citation
STANDARD VIEW MARC VIEW
Permanent Link: http://ufdc.ufl.edu/UF00100799/00001
 Material Information
Title: Source specific query rewriting and query plan generation for merging XML-based semistructured data in mediation systems
Physical Description: Book
Language: English
Creator: Shah, Amit, 1976-
Publisher: State University System of Florida
Place of Publication: Florida
Florida
Publication Date: 2001
Copyright Date: 2001
 Subjects
Subject: Web databases   ( lcsh )
Database management   ( lcsh )
Computer and Information Science and Engineering thesis, M.S   ( lcsh )
Dissertations, Academic -- Computer and Information Science and Engineering -- UF   ( lcsh )
Genre: government publication (state, provincial, terriorial, dependent)   ( marcgt )
bibliography   ( marcgt )
theses   ( marcgt )
non-fiction   ( marcgt )
 Notes
Summary: ABSTRACT: This thesis describes the underlying research, design, implementation and testing of the Query Rewriting Engine (QRE), which is an integral part of the Information Integration Wizard (IWiz) project that is currently ongoing in the Database Research and Development Center at the University of Florida. IWiz focuses on building an integrated system for querying structurally and semantically heterogeneous, semistructured information sources. QRE is one of two sub-components of the IWiz middleware layer (Mediator) which processes queries against multiple sources containing related or overlapping information. Specifically, the task of QRE is to parse the incoming query, identify appropriate sources to be queried from among the available sources, rewrite the query into source-specific sub-queries, and generate the query plan for merging the results that are returned back to the mediator. The data merging is conducted by the second sub-component, called Data Merge Engine (DME) which is the focus of a related research effort. There are two major phases in the query rewriting process: A built-time phase during which QRE initializes its meta-data about number and availability of sources as well as location information for the queriable concepts in the global ontology. This is followed by the run-time or query phase during which QRE accepts and processes queries from the user interface layer of IWiz. IWiz uses XML as its internal data model and supports XMLQL as query language. We have implemented a fully functional version of QRE, which is installed and integrated into a sample mediator in the IWiz testbed and undergoing continued extensive testing.
Summary: KEYWORDS: semistructured, warehousing, data, integration, mediation, mediator, XML, query, rewriting
Thesis: Thesis (M.S.)--University of Florida, 2001.
Bibliography: Includes bibliographical references (p. 109-112).
System Details: System requirements: World Wide Web browser and PDF reader.
System Details: Mode of access: World Wide Web.
Statement of Responsibility: by Amit Shah.
General Note: Title from first page of PDF file.
General Note: Document formatted into pages; contains xiii, 113 p.; also contains graphics.
General Note: Vita.
 Record Information
Bibliographic ID: UF00100799
Volume ID: VID00001
Source Institution: University of Florida
Holding Location: University of Florida
Rights Management: All rights reserved by the source institution and holding location.
Resource Identifier: oclc - 47890107
alephbibnum - 002729363
notis - ANK7127

Downloads

This item has the following downloads:

finalThesisVersion ( PDF )


Table of Contents
    Title Page
        Page i
        Page ii
    Dedication
        Page iii
    Acknowledgement
        Page iv
    Table of Contents
        Page v
        Page vi
        Page vii
    List of Tables
        Page viii
    List of Figures
        Page ix
        Page x
        Page xi
    Abstract
        Page xii
        Page xiii
    Introduction
        Page 1
        Page 2
        Page 3
        Page 4
        Page 5
        Page 6
        Page 7
    The use of XML as the underlying data model
        Page 8
        Page 9
        Page 10
        Page 11
        Page 12
        Page 13
        Page 14
        Page 15
        Page 16
        Page 17
        Page 18
        Page 19
        Page 20
        Page 21
        Page 22
        Page 23
        Page 24
        Page 25
        Page 26
    Overview of integration approaches and prototypes
        Page 27
        Page 28
        Page 29
        Page 30
        Page 31
        Page 32
        Page 33
        Page 34
        Page 35
        Page 36
    The IWiz architecture
        Page 37
        Page 38
        Page 39
        Page 40
        Page 41
        Page 42
        Page 43
    The join sequencing algorithm and full result generation process
        Page 44
        Page 45
        Page 46
        Page 47
        Page 48
        Page 49
        Page 50
        Page 51
        Page 52
        Page 53
        Page 54
        Page 55
        Page 56
        Page 57
        Page 58
        Page 59
        Page 60
        Page 61
    The QRE architecture and implementation
        Page 62
        Page 63
        Page 64
        Page 65
        Page 66
        Page 67
        Page 68
        Page 69
        Page 70
        Page 71
        Page 72
        Page 73
        Page 74
        Page 75
        Page 76
        Page 77
        Page 78
        Page 79
        Page 80
        Page 81
    Experimental prototype
        Page 82
        Page 83
        Page 84
        Page 85
        Page 86
        Page 87
        Page 88
        Page 89
        Page 90
        Page 91
    Conclusions
        Page 92
        Page 93
        Page 94
        Page 95
    Appendix
        Page 96
        Page 97
        Page 98
        Page 99
        Page 100
        Page 101
        Page 102
        Page 103
        Page 104
        Page 105
        Page 106
        Page 107
        Page 108
    References
        Page 109
        Page 110
        Page 111
        Page 112
    Biographical sketch
        Page 113
Full Text











SOURCE SPECIFIC QUERY REWRITING AND QUERY PLAN GENERATION FOR
MERGING XML-BASED SEMISTRUCTURED DATA IN MEDIATION SYSTEMS

















By

AMIT SHAH


A THESIS PRESENTED TO THE GRADUATE SCHOOL
OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF
MASTER OF SCIENCE

UNIVERSITY OF FLORIDA


2001




























Copyright 2001

by

Amit Shah






























To my parents, who have always striven to give their children, the best in life















ACKNOWLEDGMENTS

I express my sincere gratitude to my advisor, Dr. Joachim Hammer, for giving me

the opportunity to work on this challenging topic and for providing continuous guidance

and feedback during the course of this work and thesis writing. I am thankful to Dr. Sumi

Helal and Dr. Sanguthevar Rajasekaran for agreeing to be on my supervisory committee.

A special thanks goes to my colleague Rajesh Kanna, who assisted me in the

initial stages of this work. I am also grateful to all the other members of IWiz research

group, Charnyote Pluempitiwiriyawej, Anna Teterovskaya and Ramasubramanian

Ramani. It was indeed a great experience to work with them.

I especially wish to thank my friends Vidyamani and Latha, for all their support

and help throughout my stay here at the University of Florida. I am also grateful to my

roommate Unnat who helped me proof-read my thesis document and give it a proper

shape.

I would like to acknowledge the efforts put in by Sharon Grant for making the

Database Center a truly great place to work. Special thanks to John Bowers and Nisi for

being there, always!

I would like to take this opportunity to thank my parents and my brother, for their

continued and encouraging support throughout my period of study here and especially in

this endeavor.
















TABLE OF CONTENTS

page

A C K N O W L E D G M E N T S ................................................................................................. iv

LIST OF TABLES ..................................... ................. .......... viii

LIST O F FIG U R E S ........................... .......................... .. .......................ix

A B STR A CT ............................... ....................................... ..... ......... xii

CHAPTERS

1 IN TR O D U C T IO N ....................... .. ........................ .. ........ ..............

1.1 Characteristics of Semistructured Data....................................................... 2
1.2 The D ata Integration Problem ......................................... .............................. 4
1.3 G oal of the T hesis ..................... .. ............................................... ...... ......... 5

2 THE USE OF XML AS THE UNDERLYING DATA MODEL ...............................8

2 .1 W h y X M L ? ............................................................................................................... 8
2.2 A advanced XM L Features .......................................................... .............. 12
2.3 XM L Query Languages ..................................... ........... ................. 14
2.4 Why We Chose XMLQL as Our Query Language.............................................. 16
2.5 C categories of Q ueries.. ............ ............ ................... .................. .. ............... 19
2.5.1 Category I: Simple Query with No Joins, No Filters and No Nesting........... 20
2.5.2 Category II: Simple Query with Filters and Without Joins and Nesting......... 20
2.5.3 Category III: Simple Query with an Implicit Join................. .................... 23
2.5.4 Category IV: Simple Query with an Explicit Join................ ................ .... 24
2.5.5 C category V : N ested Q uery..................................................... .............. 24
2.4.6 Category VI: Recursive Queries .................................................................. 25

3 OVERVIEW OF INTEGRATION APPROACHES AND PROTOTYPES .................27

3.1 D different A approaches to Integration ............................ ....................................... 28
3.1.1 The D ata W warehousing A pproach............................................... ... ................. 28
3.1.2 The Mediation Approach ............ .... ......... ...................... 30
3.1.3 The H ybrid A approach ......... ................. ......... .................... .............. 32
3.2 Integration System Prototypes ........................................ .......................... 32
3.2.1 The TSIM M IS Project ....................... .............................. ............ .............. 33









3.2.2 The M IX Project ............. ............................... .......... .............. 33
3.2.3 The TUKW ILA Project .............. .. ................. ....................... 34
3.2.4 The FLORID Project..................................................... .......................... 35
3.2.5 The M O M IS Project....................................... ...... ................. .............. 36

4 TH E IW IZ A R CH ITEC TU R E ............................................................ .....................37

4.1 IW iz O verview ..................................... ...................... .. .... ..... .. .............. 37
4.2 The O ntology Schem a ........ ................. ....... ................................... .............. 39
4.3 IWiz Components ........................ .......... ......................... 39
4.3.1 The Query Brow sing Interface (QBI) .............. ............................. ....... ....... 39
4.3.2 The Warehouse Manager (WHM) ............................................................ 40
4.3.3 The Q uery R ew writing E engine ........................................................................... 4 1
4.3.4 The D ata R restructuring Engine ................................. .................................... 41
4.3.5 The D ata M erging Engine ...................................................... ................. 42

5 THE JOIN SEQUENCING ALGORITHM AND FULL RESULT GENERATION
PR O C E S S ...................................... .....................................................44

5.1 The Query Rewriting Process Overview ....................................................... 44
5.2 The C concept of a Full R esult............................................................. ............... ... 45
5.2.1 Case 1: Individual Source Results Are All Full Results ................................ 47
5.2.2 Case 2: Individual Source Results Are All Empty Results ........................... 47
5.2.3 Case 5: Individual Source Results Are Both Full And Empty......................... 48
5.2.4 Case 3: Individual Source Results Are All Partial Results ........................... 48
5.2.5 Case 4: Individual Source Results Are Both Partial as well as Full ............ 51
5.2.6 Case 6: Individual Source Results Are Both Partial as well as Empty .......... 52
5.2.7 Case 7: Individual Source Results Are Both Partial as well as Empty .......... 54
5.3 T he C children B finding R ule ..................................................................................... 56
5.4 The Join Sequencing A lgorithm ........................................ .................. ...... 58

6 THE QRE ARCHITECTURE AND IMPLEMENTATION .............. ....................62

6 .1 T h e B u ild -T im e P h ase ............................................................................................ 6 3
6.1.1 R equirem ents .... ...................... ...... ................................ .............. 63
6.1.2 Analysis...................................... .............. 63
6.1.3 D esign and Im plem entation .................................... ......................... ......... 64
6.2 Run-Tim e Phase .................. ................................ ....... .. .......... .. 69
6.2.1 R equirem ents .... ...................... ...... ................................ .............. 70
6 .2 .2 A n aly sis ................................................... ................... . 7 1
6.2.3 D esign and Im plem entation ................................... .......................... ......... 71
6.2.3.1 The Query Parse Tree G enerator .............. ............................. ....... ....... 71
6.2.3.2 The Join Sequences G enerator.............................................................. 76
6.2.3.3 The Splitter and Query Plan Generator................................................... 77

7 EXPERIM ENTAL PROTOTYPE.......................................................... ............... 82











8 CON CLU SION S ........................................................................... .. 92

8 .1 C o ntrib u tio n s ..................................................................... 9 3
8.2 Future W ork ............................. ............. ...... 94

A P P E N D IX ........................................................................................................9 6

R E F E R E N C E S ............................................................................109

BIOGRAPHICAL SKETCH ................................................................... .......113
















LIST OF TABLES


Table Page

5.1: Scenario wherein two sources containing only one requested item ............................48

5.2: Scenario wherein two sources containing only one requested item with ajoinable
d ata item .........................................................................4 8

5.3: Scenario wherein all the 3 sources together contain all the requested items but no
joinable data item s ...................... .... .............. ................... .. ...... 49

5.4: Scenario wherein all the sources together contain all requested items along with a
com m on joinable data item ..................................................................... .. ..... 50

5.5: Scenario wherein all the sources together contain all requested items with the
joinable data item s required for a join ..... ......... ...................................... 50

5.6: Scenario wherein all the sources together contain all the requested items but do not
contain overlapping joinable data items .......................... .... .......... ......... 51

5.7: Scenario wherein source 1 yields full result, source 2 & source 3 yield a partial result 51

5.8: Scenario wherein source 3 yields a full result and source 1 and source 2 yield a
p a rtia l re su lt ...................................... ............................................. . 5 2

5.9: Scenario wherein source 3 yields no result but provides for joinable data items ..........53

5.10: Scenario wherein source 1 and source 2 both yield partial results and source 3
y field s em pty resu lt .............................................................. ........ ... ......53
















LIST OF FIGURES


Figure Page

2.1: An Example of an XML Document Describing a Bibliography Containing One Data
Instance on Book and One on Article, Each with Their Sub-Structure ..............11

2.2: Sample DTD for the Document in Figure 2.1...............................................13

2.3: A n X M L D ocum ent "bib.xm l" ........ .................................................. ...............16

2.4: An XML DTD "bib.dtd" for the Document Shown in Figure 2.3.............................17

2.5: An XMLQL Query Requesting Author of Books Published by Addison-Wesley.........18

2 .6 : Sam ple Q uery of C category I................................................................ .....................20

2.7: Sample Query of Category II without Tag Variables...............................................21

2.8: Sample Query of Category II with Tag Variables .............................. ................22

2.9: Sam ple Q uery of Category III............................................... .............................. 22

2.10: Sam ple Query of Category IV ............................. ......... .............................. 23

2.11: Sam ple Q uery of C category V ................................................ ............................ 24

2.12: Sam ple Query of Category V I ............................................. ............................. 25

3.1: An Integration System .......... ....... ....... ..... ................ .. ...... 27

3.2: The Data W warehousing Approach...... ............................................. ................28

3.3: The M edition A approach .......................................... ........................ ............... 29

3.4: The Hybrid Approach ......... ...... ............................ ............31

4.1: Information Integration Wizard (IWiz) Architecture .............................................. 37

5.1: Sample XMLQL Query requesting Book Title, Year published and Author .................46









5.2: An XMLQL Query Involving a Join on Titles of Books and Articles .........................54

5.3: An XMLQL Query Requesting Simultaneously for Book Titles and Article Titles......55

5.4: An XM LQL Query with its Source Scenario ...................................... ............... 57

5.5: An XMLQL Query with its Source Scenario ............................. ............... 59

5.6: Pseudo-code of the Join Sequencing Algorithm.........................................................60

6.1. QRE Build-Tim e Phase Overview ...................... ......... ........................ ............... 64

6.2. Joinable Data Item Info Text File ............................................................................64

6.3. Example of Restructuring Specification ..............................................65

6.4. QRE Build-Tim e Phase........................................................................ ............... 66

6.5. The Class A O TN ode .................. .................. ................. ......... .. ............ 67

6.6: QRE Run-Tim e Phase Overview ......................................................... ............... 69

6.7: Q R E R un-T im e Phase ......... .... ................ ..................................................70

6.8: An XMLQL Query Requesting All Books, the Title of Each of Which Is Also the
T itle o f an A article ........... ...... ...................................... .............. .. .... .. .. .. ..7 1

6.9: Parse Tree for Query Show n in Figure 6.8 ........................................ .....................72

6.10: An XMLQL Query Requesting for Books, the Title of Each of Which Is Also the
Title of an A article and a Thesis....................................... ......................... 74

6.11: An XMLQL Query Requesting for Books, Each with Its Title, Year and Author.......75

6.12: WHERE Clause of an XMLQL Query........................... ............................... 76

6.13: Q uery Plan D TD .......................... ........................... .... ........ ........ 77

6 .14 : A n X M L Q L Q u ery ............................................................................. .................... 78

6.15: Query Parse Tree with Location Information .................................... ............... 79

6.16: Sam ple Q uery Plan ................................................ .. .... ........ ..... .. ... 81

7.1: Hierarchical Structure of the XML Document "haptics_article.xml" ............................84

7.2: Location Information for the Concepts of the Document Shown in Figure 7.1 ............85

7.3: The Joinable Data Item Information Text File .................................... ............... 85









7 .4 : T est X M L Q L Q u ery .......................................................................... ....................... 86

7.5: Query to Source S1 .............. ...................... ........ ............ ... .................87

7.6: Query to Source S2 ......................... ........... .. .. ......... ..... ..... 88

7 .7 : Q u ery P lan ................................................................8 8

7.8: Execution Tree Query ....... ...... ............................ ..... 89















Abstract of Thesis Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Master of Science

SOURCE SPECIFIC QUERY REWRITING AND QUERY PLAN GENERATION
FOR MERGING XML-BASED SEMISTRUCTURED DATA
IN
MEDIATION SYSTEMS

By

Amit Shah

May 2001


Chairman: Joachim Hammer
Major Department: Computer and Information Science and Engineering

This thesis describes the underlying research, design, implementation and testing

of the Query Rewriting Engine (QRE), which is an integral part of the Information

Integration Wizard (IWiz) project that is currently ongoing in the Database Research and

Development Center at the University of Florida. IWiz focuses on building an integrated

system for querying structurally and semantically heterogeneous, semistructured

information sources. QRE is one of two sub-components of the IWiz middleware layer

(Mediator) which processes queries against multiple sources containing related or

overlapping information. Specifically, the task of QRE is to parse the incoming query,

identify appropriate sources to be queried from among the available sources, rewrite the

query into source-specific sub-queries, and generate the query plan for merging the

results that are returned back to the mediator. The data merging is conducted by the









second sub-component, called Data Merge Engine (DME) which is the focus of a related

research effort.

There are two major phases in the query rewriting process: A built-time phase

during which QRE initializes its meta-data about number and availability of sources as

well as location information for the queriable concepts in the global ontology. This is

followed by the run-time or query phase during which QRE accepts and processes

queries from the user interface layer of IWiz.

IWiz uses XML as its internal data model and supports XMLQL as query

language. We have implemented a fully functional version of QRE, which is installed and

integrated into a sample mediator in the IWiz testbed and undergoing continued extensive

testing.














CHAPTER 1
INTRODUCTION

The World Wide Web (Web) has become a vast information store whose content

is growing at a rapid rate. It has become a global data repository with virtually limitless

possibilities for data exchange and sharing. However, the contents of the Web cannot be

queried and manipulated in a general way. A large percentage of the information is stored

as static HTML pages that can only be viewed through a browser. Some sites provide

search engines, but their query capabilities are often limited. Most of them involve only

text-based searches with no particular emphasis on the semantics of the result. Also, new

formats for storing and representing data are constantly evolving [1], making the Web an

increasingly heterogeneous environment. Obviously, it cannot be constrained by a single

schema. Any database researcher would want to think of the web as a huge database and

have database tools for querying and maintaining it. But since the web does not conform

to any standard data model, there has been a growing need for a method to describe its

structure. A large body of research is dedicated to overcoming this heterogeneity and

creating systems that allow seamless integration of, and access to a multitude of data

sources. It has been noted in Florescu et al. [2] that web data retain some structure, but

not to the degree where conventional data management techniques can be effectively

used. Consequently, the term semistructured data emerged, and with it, new research

directions and opportunities.









1.1 Characteristics of Semistructured Data

Before the advent of the Web, problems associated with storing large amounts of

data were solved by using databases based on the relational or the 00 model. These

databases require that all data conform to a predefined schema, which naturally limits

variety of data items being stored, but allows for efficient processing of the stored data.

On the other hand, large quantities of data are still being stored as unstructured text files

residing in file systems. Minimal presence of constraints in unstructured data formats

allows for the representation of a wide range of information. However, automatic

interpretation of unstructured data is not an easy task.

Semistructured data usually exhibit some amount of structure, but this structure

may be irregular, incomplete, and much more flexible than what traditional databases

require. The information that is normally associated with a schema is contained within

the data, hence the term "self-describing", which is sometimes used in connection with

semistructured data. In some forms of semistructured data there is no separate schema, in

others it exists but places only loose constraints on the data. Semistructured data can

come to existence in many different ways. The data can be designed with a

semistructured format in mind, but more often the semistructured data format arises as a

result of the introduction of some degree of structure into unstructured text or as the

result of merging data from several heterogeneous sources. Data models and query

languages/access mechanisms designed for well-structured data are inappropriate in such

environments. This is because these data models require the data to adhere to some

specific data types and conform to several constraints.

There are several characteristics of semistructured data that require special

consideration when building an application for processing such data [3, 4, 5]:









* The structure is irregular. The same information can be structured differently in parts

of a document. Information can be incomplete or represented by different data types.

* The structure is partial. The degree of structure in a document can vary from almost

zero to almost 100%. Thus, we can consider unstructured and highly structured data

to be extreme cases of semistructured data.

* The structure is implicitly embedded into data, i.e. the semistructured data are self-

describing. The structure can be extracted directly from data using some

computational process, e.g., parsing.

* An a-priori schema can be used to constrain data. Data that do not conform to the

schema are rejected. A more relaxed approach is to detect a schema from the existing

data (recognizing that such a schema cannot possibly be complete) only in order to

simplify data management, not to constrain the document data.

* A schema that attempts to capture all present data constructs can be very large due to

the heterogeneous nature of the data.

* A schema can be ignored. Nothing prevents an application from simply browsing the

hierarchical data in search of a particular pattern with an unknown location, since the

data are self-describing and can exist independently of the schema.

* A schema can rapidly evolve. In general, a schema is embedded with the data and is

updated as easily as data values themselves.

* The distinction between schema and data is blurred. In standard database

applications, a basic principle is the distinction between the schema (that describes

the structure of the database) and data (the database instance). Many differences

between schema and data disappear in the context of semi-structured data: schema









updates are frequent, schema laws can be violated, the schema may be very large, the

same queries/updates may address both the data and schema.





1.2 The Data Integration Problem

The data integration process queries, extracts, converts and merges the required

data from different heterogeneous sources into a common format and conforms to a

global or a target schema. The most common causes for heterogeneities are different data

formats (e.g., a date being represented as Oct. 11 2000 vs. 10-11-2000 vs. 11-10-2000,

etc.), differences in the underlying data model (e.g., relational, object-oriented,

semistructured), and different schemas. Some aspects of the heterogeneity among data

sources are due to the use of different hardware and software platforms to manage

distributed databases [6]. The emergence of standard protocols and middleware

components, e.g. CORBA, DCOM, ODBC, JDBC, etc., has simplified remote access to

many standard source systems possible. Most of the research initiatives for integrating

heterogeneous data sources focus on overcoming the schematic and semantic

discrepancies that exist among cooperative data sources, assuming they can be reliably

and efficiently accessed by so-called integration systems using the above protocols.

Basically, there are three major tasks to be performed for integration of data:

First, the schemas of the heterogeneous sources are analyzed and compared to the target

schema one by one and the conflicts between the schemas are noted. Based on this

knowledge, a set of rules for data translation is created for each source schema. Applying

translation rules to the source information results in data instances fully conforming to

the target schema. Second, reorganize and 'join' the data coming from different sources









so that the semantic completeness and correctness of the data tuple is preserved. The

relational database contains links (foreign key references) to pieces of information in files

so that all data remains accessible. Similarly, the semistructured data integration system

too requires a layer on top of an irregular and less controlled layer of files that keeps

knowledge of the sources schema and knows to join all the system data that may be

overlapping, incomplete and complementary. Third, merge the data from the different

sources and remove the duplicates and redundancies.

The project IWiz [6], which is currently under development at the University of

Florida Database Research and Development Center, enables users to query a variety of

sources through one common interface. The focus of the project is to provide an

integrated access to semistructured sources, through query mediation while at the same

time, warehouse frequently accessed data for faster retrieval.





1.3 Goal of the Thesis

In this thesis, we describe the underlying research and requirements, design,

implementation and testing experiments for one of the architectural components of the

IWiz, namely the Query Rewriting Engine (QRE). As discussed earlier, in a

semistructured data integration system, there arises a need of a middleware layer that acts

as a mediator between the front-end and the disparate schematically heterogeneous data

sources. In IWiz, we call this middleware layer, the 'Mediator'. It has information on

sources' schema and knowledge to join and merge the sources' data. QRE is one of the

two sub-components of the Mediator, which processes queries against multiple sources

that may contain related, complementary or overlapping, and incomplete information.









Specifically, the task of QRE is to parse the incoming query, identify appropriate sources

to be queried from among the available sources, rewrite the query into source-specific

sub-queries, and generate a query plan for merging the results that are returned back to

the mediator. The data merging is conducted by the second sub-component, called Data

Merge Engine which is the focus of a related research effort. There are two major phases

in the query rewriting process: A built-time phase during which QRE initializes its meta-

data about number and availability of sources as well as location information for the

queriable concepts in the global ontology schema. This is followed by the run-time or

query phase during which QRE accepts and processes queries from the user interface

layer of IWIZ.

At the end of this thesis, the reader can expect the following contributions from

this research. First, a complete categorization of XMLQL queries from an Integration

System perspective. Second, analysis of and solution to problems in joining data at

different levels in the XML document hierarchy. Third, a new and different approach to

Query Rewriting in Mediation Systems. Fourth, an algorithm to discover sources to be

queried for the concepts asked in a query. Fifth a join sequencing algorithms to join the

results returned back to the mediator. Sixth, methodology to generate source-specific

sub-queries customized for each source. And finally the seventh, query plan generation

based on the join sequences.

The rest of the thesis is organized as follows. Chapter 2 gives an overview of why

we chose XML as our underlying data model and XMLQL as our query language. It also

gives a complete description of the categories of XMLQL queries that are supported by

IWiz. Chapter 3 is dedicated to an overview of related research on integration systems.






7


Chapter 4 describes the IWiz architecture and the significance of QRE in relation to other

components. Chapter 5 analyzes fundamental concepts of the Query Rewriting Process.

Chapter 6 focuses on our implementation of QRE. Chapter 7 describes the experimental

prototype and results. Finally Chapter 8 concludes the thesis with the summary of our

accomplishments and issues to be considered in the future.














CHAPTER 2
THE USE OF XML AS THE UNDERLYING DATA MODEL





2.1 Why XML?

Semistructured data can be represented in different ways. Numerous research

projects have been using various representations and data models to manage collections

of irregular structured data [7, 8, 9]. The eXtensible Markup Language (XML) [10] has

emerged as one of the contenders and has quickly turned into the data exchange model of

choice. Initially, it started as a convenient format to delimit and represent hierarchical

semantics of text data, but was quickly enriched with extensive APIs, data definition

facilities, and presentation mechanisms, which turned it into a powerful data model for

semistructured data. The other data models known to model semistructured data are the

OEM (Object Exchange Model) data model developed at Stanford for the TSIMMIS

project [11], the Ozone data model [7], the YAT data model [12], ODMG's object model

used in Garlic at IBM Almaden [13], etc.

XML is the result of convergence of ideas from the document and database

communities. In order to represent data with loosely defined or irregular structure, the

semistructured data model has emerged as a dynamically typed data model that allows a

"schema-less" description format in which the data is less constrained than is usual in

database work. At the same time the document community has developed XML as a

format in which more structure is added to documents in order to simplify and









standardize the transmission of data via documents. It turns out that these two

representations are essentially identical. XML provides a foundation for creating

documents and document systems. XML operates on two main levels. First, it provides

syntax for document markup and secondly it provides syntax for declaring the structures

of documents.

XML is after all, a meta-language, a set of rules that can be used to create sets of

rules for documents. By applying XML technology, one is essentially creating a new

markup language. In a certain sense, there's no such thing as an 'XML document' all the

documents that use XML-compliant syntax are really using applications of XML, with

tag sets chosen by their creators for that particular document. XML's facilities for

creating Document Type Definitions (DTD) provides a set of tools for specifying what

document structures may or must appear in a document, making it easy to define sets of

structures. These structures can then be used with XML tools for authoring, parsing, and

processing, and used by applications as a guide to the data they should accept.

Following are some of the features that make its use favorable [14, 15, 16]:

* XML is self-describing. Each data element has a descriptive tag. Using these tags, the

document structure can be extracted without knowledge of the domain or a document

description.

* XML can not only be used to describe information but also to structure it as well, so it

can be thought of as a data description language. It can be used to describe data

components, records and other data structures--even complex data structures.

* XML is extensible. Unlike HTML, XML allows you to define countless sets of tags,

describing any imaginable domain.









* DTDs provide the Data Definition Language feature to XML and can be used to

create schemas. The well-formedness and structural validity of any XML document

against a DTD can be evaluated using a simple grammar.

* XML is able to capture hierarchical information and preserve parent-child

relationships between real-world concepts.

* XML is portable. It is designed to structure data so that it can be easily transferred

over a network and consistently processed by the receiver.

* XML has a flexible structure. New tags can be added anywhere or existing ones can

be removed anytime very easily.

* The tags can be nested and repeated. Recursive definitions of structures can

conveniently be introduced.

* Unlike HTML, the data in XML is separate from presentation.

* It is human readable, though sounds insignificant, is a very important factor for it's

popularity.

* Shared DTD implies shared data representation. It is compact and easy to print.

* Finally, there are already numerous tools now available for parsing, querying,

processing XML data, tools which map relational schemas to XML data model, etc.

just to name a few.

As has been pointed out earlier, the Extensible Markup Language (XML) is a

subset of SGML [17]. The World Wide Web Consortium took the initiative to develop

and standardize XML, and their recommendations from 10 February 1998 outline the

essential features of XML 1.0 [18].






















publisher>
Wrox Press Ltd
publisher>
Bar>2000

ype = "XML">
author>
Sudarshan
Chawathe
author>
itle>Describing and Manipulating XML Data<
ar>1999
hortversion> This paper presents a brief c
data management using the Extensible
Lanquaqe(XML). It presents the basi


Figure 2.1: An Example of an XML Document Describing a Bibliography Containing
One Data Instance on Book and One on Article, Each with Their Sub-Structure




XML is a markup language. Markup tags can convey semantics of the data


included between the tags, special processing instructions for applications, and references


to other data elements either internal or external. The XML document in Figure 2.1


illustrates a set of bibliographic information consisting of books and articles, each with its


own specific structure. Tags can be nested, with child entities placed between the parent's


opening and closing tags, no limits are placed on the depth of the nesting.


The fundamental structure composing an XML document is the element. An


element can contain other elements, character data, and auxiliary structures, or it can be


empty. All XML data must be contained within elements. Examples of elements in


Figure 2.1 are , , and <lastname>. Simple information about<br /> <br /> <br /> elements can be stored in attributes, which are name-value pairs attached to an element.<br /> <br /> <br /> <br /><br /> <br /> <br /> <br /> <br /> <br /> Attributes are often used to store the element's meta-data. Only simple character strings<br /> <br /> are allowed as attribute values, and no markup is allowed. The element <article> in our<br /> <br /> example has an attribute "type" with an associated data value "XML." The XML<br /> <br /> document in Figure 2.1 is an example of a well-formed XML document, i.e. an XML<br /> <br /> document conforming to all XML syntax rules.<br /> <br /> <br /> <br /> <br /> <br /> 2.2 Advanced XML Features<br /> <br /> An XML grammar defines how to build a well-formed XML document, but it<br /> <br /> does not explain how to convey the rules by which a particular document is built. Other<br /> <br /> questions requiring answers are how to constrain the data values for a particular<br /> <br /> document, and how to reuse an XML vocabulary created by someone else. This section<br /> <br /> touches on XML-related standards and proposals that solve these and other problems. A<br /> <br /> Document Type Definition (DTD) is a mechanism to specify structure and permissible<br /> <br /> values of XML documents. The schema of the document is described in a DTD using a<br /> <br /> formal grammar. The rules to construct a DTD are given in the XML 1.0<br /> <br /> Recommendation. The main components of all XML documents are elements and<br /> <br /> attributes. Elements are defined in a DTD using the <!ELEMENT> tag, attributes are<br /> <br /> defined using the <!ATTLIST> tag. The declarations must start with a <!DOCTYPE> tag<br /> <br /> followed by the name of the root element of the document. The rest of the declarations<br /> <br /> can follow in an arbitrary order. Other markup declarations allowed in a DTD are<br /> <br /> <!ENTITY> and <!NOTATION>. <!ENTITY> declares a reusable content, for example,<br /> <br /> a special character or a line of text repeated often throughout the document. An entity can<br /> <br /> refer to a content defined inside or outside of the document. A <!NOTATION> tag<br /> <br /> <br /> <br /><br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> associates data in formats other than XML with programs that can process the data.<br /> <br /> <br /> Figure 2.2 presents a DTD for the XML document in Figure 2.1. When a well-formed<br /> <br /> <br /> XML document conforms to a DTD, the document is called valid with respect to that<br /> <br /> <br /> DTD. Next, we provide a detailed analysis of what can be a part of a DTD.<br /> <br /> <br /> <?xml version-=".0"?><br /> <!DOCTYPE bibliography<br /> <!ELEMENT bibliography ( book rticle)1 ><br /> <!ELEMENT book (title, author editor?, publisher?, year)><br /> <!ELEMENT article (author+, title, year ,(shortversion longversion)?)><br /> <!ATTLIST article type CDATA #REQUIRED<br /> month CDATA #IMPLIED><br /> <!ELEMENT title (#PCDATA)><br /> <!ELEMENT author (firstname?, lastname)><br /> <!ELEMENT editor (#PCDATA)><br /> <!ELEMENT publisher (name, address?)><br /> <!ELEMENT year (#PCDATA)><br /> <!ELEMENT firstname (#PCDATA)><br /> <!ELEMENT lastname (#PCDATA)><br /> <!ELEMENT name (#PCDATA)><br /> <!ELEMENT address (#PCDATA)><br /> <!ELEMENT shortversion (#PCDATA)><br /> <!ELEMENT longversion (#PCDATA)><br /> <br /> <br /> <br /> <br /> Figure 2.2: Sample DTD for the Document in Figure 2.1<br /> <br /> <br /> <br /> <br /> Each element declaration consists of the element name and its contents. The<br /> <br /> <br /> contents of the element can be of four types: empty, element, mixed, or any. An empty<br /> <br /> <br /> element cannot have any child elements (but can contain attributes). An element whose<br /> <br /> <br /> content has been defined as any can have any number of different contents conforming to<br /> <br /> <br /> XML well-formed syntax. Element content refers to the situation in which an element can<br /> <br /> <br /> have only other elements as children. Mixed content allows combinations of element<br /> <br /> <br /> child nodes and parsed character data (#PCDATA), i.e. text. For example, in Figure 2.2,<br /> <br /> <br /> the bibliography element has element content, and the year element has mixed content.<br /> <br /> <br /> The DTD also allows to specify the cardinality of the elements. The following<br /> <br /> <br /> explicit cardinality operators are available: ? which stands for "zero-or-one," for "zero-<br /> <br /> <br /> <br /><br /> <br /> <br /> <br /> <br /> <br /> or-more," and + for "one-or-more." In the case when no cardinality operator is used, the<br /> <br /> element can be present exactly once (i.e., the default cardinality is "one"). In our example<br /> <br /> in Figure 2.2, a book can contain one or more author child elements, must have a child<br /> <br /> element named title, and the publisher information can be missing. Order is an important<br /> <br /> consideration in XML documents; the child elements in the document must be present in<br /> <br /> the order specified in the DTD for this document. For example, a book element with a<br /> <br /> year child element as the first child will not be considered a part of a valid XML<br /> <br /> document conforming to the DTD in Figure 2.2. Attributes provide a mechanism to<br /> <br /> associate simple properties with XML elements. Each attribute declaration includes<br /> <br /> name, type, and default information. The attribute type can be one of the following<br /> <br /> CDATA, ID, IDREF, IDREFS, ENTITY, ENTITIES, NOTATION, ENUMERATION,<br /> <br /> NMTOKEN, or NMTOKENS. CDATA attributes can contain character strings of any<br /> <br /> length, like the month attribute of the element article in our example. An dement can<br /> <br /> have at most one attribute of type ID. This attribute must be assigned a value that is<br /> <br /> unique in the context of the given document. The ID value can be referenced by an<br /> <br /> attribute of type IDREF in the same document. In a sense, the ID-IDREF pairs in XML<br /> <br /> play the same role as primary key-foreign key associations in the relational model. A<br /> <br /> value for an attribute of type IDREFS is a series of IDREF references of unspecified<br /> <br /> length. The other attribute types are not of particular significance to our study.<br /> <br /> <br /> <br /> <br /> <br /> 2.3 XML Query Languages<br /> <br /> Data represented in XML can be utilized by many applications. However, XML<br /> <br /> data is useful only if the information can be effectively extracted from an XML document<br /> <br /> <br /> <br /><br /> <br /> <br /> <br /> <br /> <br /> according to specified conditions. W3C is currently coordinating the process of creating a<br /> <br /> query language for XML. At IWiz, XML being our underlying data model, we obviously<br /> <br /> required a powerful query language to query our data sources, which are XML<br /> <br /> documents.<br /> <br /> In the database community, there has been an evolution from relational databases<br /> <br /> through object-oriented databases to semistructured databases, but many of the principles<br /> <br /> have remained the same. From the semistructured community, three languages have<br /> <br /> emerged aimed at querying XML data: XMLQL[19], YATL [20,12] and Lorel [21,22].<br /> <br /> The document processing community has developed models of structured text and search<br /> <br /> techniques such as region algebra [23]. From this community, one language that has<br /> <br /> emerged for processing XML data is XQL [24,25]. The main points of the latest version<br /> <br /> of the requirements as put down by the World Wide Web Consortium from August 15,<br /> <br /> 2000 [26] for XML query language are as follows:<br /> <br /> * The XML Query Language must support operations on all data types represented by<br /> <br /> the XML Query Data Model.<br /> <br /> * The XML Query Language must be able to combine related information from<br /> <br /> different parts of a given document or from multiple documents.<br /> <br /> * The XML Query Language must be able to sort query results.<br /> <br /> * The relative hierarchy and sequence of input document structures must be preserved<br /> <br /> in query results.<br /> <br /> * Queries must be able to transform XML structures and create new XML structures.<br /> <br /> * Queries should provide access to the XML schema or DTD, if there is one.<br /> <br /> <br /> <br /><br /> <br /> <br /> <br /> <br /> 16<br /> <br /> <br /> <br /> Queries must be able to perform simple operations on names, such as tests for<br /> <br /> <br /> equality in element names, attribute names, and processing instruction targets and to<br /> <br /> <br /> perform simple operations on combinations of names and data.<br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> 2.4 Why We Chose XMLQL as Our Query Language<br /> <br /> <br /> The simplest XMLQL queries extract data from an XML document.<br /> <br /> <br /> Figure 2.3: An XML Document "bib.xml"<br /> <br /> <br /> !EOCTYPE bib SYSTEM "bib.dtd"><br /> <bib><br /> <book year-"1995"><br /> <!-- A good introductory text --><br /> <title>An Introduction to Database Systems
Date
Addison-Wesley


Foundations for Object/Relational Databases</t<br /> <author><lastname>Date</lastname></author><br /> <author><lastname>Darwen</lastname></author><br /> publisher><name>Addison-Wesley</name></publisher><br /> </book><br /> <book year-"1999"><br /> <title>Data on the Web: from Relations to Semistructu<br /> <author><firstname>Serge</ firstname><lastname >Abiteb<br /> author><firstname>Peter</ firstnamelastname >Bunemar<br /> author><firstname>Dan</firstname>lastname>Suciu</l<br /> publisher><name>Morgan- Kaufman</name></publisher><br /> </book><br /> <article year="1999" type="inproceedings month="June"><br /> <author><lastname>Date</lastname>< /author<br /> <author><firstname>Mary</firstname><lastname>Fernand<br /> <author><firstname>Alin</firstname><lastname>Deutsch-<br /> <author><firstname>Dan</firstname><lastname>Suciu</l<br /> <title>Storing Semi-structured Data Using STORED</tit<br /> <booktitle>ACM SIGMOD</booktitle><br /> </article><br /> b c 1 ~^ l~n l ^ lx-^ ^^;x ^1 ^ clrxl<br /> <br /> <br /> <br /><br /> <br /> <br /> <br /> <br /> <br /> <br /> Our example XML input is in the document "bib.xml" shown in Figure 2.3, and<br /> <br /> we assume that it contains bibliography entries that conform to "bib.dtd", which is shown<br /> <br /> in Figure 2.4.<br /> <br /> <br /> <?xml encoding="US-ASCII"?><br /> <br /> <!ELEMENT bib (booklarticle)*><br /> <!ELEMENT book (title, author+, publisher, isbn?)><br /> <!ATTLIST book year CDATA #REQUIRED><br /> <!ELEMENT article (author+, title, booktitle?, (shortversionllongversion)?)><br /> <!ATTLIST article type CDATA #REQUIRED<br /> year CDATA #REQUIRED<br /> month CDATA #IMPLIED><br /> <!ELEMENT publisher (name, address?)><br /> <!ELEMENT name (#PCDATA)><br /> <!ELEMENT title (#PCDATA)><br /> <!ELEMENT author (firstname?, lastname)><br /> <!ELEMENT firstname (#PCDATA)><br /> <!ELEMENT lastname (#PCDATA)><br /> <!ELEMENT booktitle (#PCDATA)<br /> <br /> <br /> Figure 2.4: An XML DTD "bib.dtd" for the Document Shown in Figure 2.3<br /> <br /> <br /> <br /> <br /> This DTD specifies that a book element contains one or more author elements,<br /> <br /> one title, and one publisher element and has a year attribute. An article is similar, but its<br /> <br /> year element is optional, it omits the publisher, and it contains one shortversion or<br /> <br /> longversion element. An article also contains a type attribute. A publisher contains name<br /> <br /> and address elements, and an author contains an optional firstname and one required<br /> <br /> lastname. We assume that name, address, firstname, and lastname are all CDATA, i.e.,<br /> <br /> string values. XMLQL uses element patterns to match data in an XML document. This<br /> <br /> following example produces all authors of books whose publisher is Addison-Wesley in<br /> <br /> the XML document bib.xml. Any URI (uniform resource identifier) that represents an<br /> <br /> XML-data source may appear on the right-hand side of IN.<br /> <br /> <br /> <br /><br /> <br /> <br /> <br /> <br /> <br /> <br /> WHERE<br /> <bib.book><br /> <publisher><name>"Addison-Wesley"</></><br /> <title>$t</><br /> <author>$a</><br /> </> IN "bib.xml"<br /> CONSTRUCT $a<br /> <br /> <br /> Figure 2.5: An XMLQL Query Requesting Author of Books Published by Addison-<br /> Wesley<br /> <br /> <br /> <br /> The query shown in Figure 2.5 matches every <book> element in the XML<br /> <br /> document "bib.xml" that has at least one <title> element, one <author> element, and one<br /> <br /> <publisher> element whose <name> element is equal to Addison-Wesley. For each such<br /> <br /> match, it binds the variables $t and $a to every title and author pair. Note that variable<br /> <br /> names are preceded by $ to distinguish them from string literal in the XML document<br /> <br /> (like Addison-Wesley). An initial draft of the XMLQL has been submitted to W3C and<br /> <br /> kept as a note for further discussion [27]. We refer the reader to that paper for<br /> <br /> background on XMLQL and other query languages for semistructured data.<br /> <br /> XMLQL takes a database view, as opposed to document view of XML and<br /> <br /> provides functionalities for integrating, transforming, cleaning, and aggregating XML<br /> <br /> data. Data extraction, conversion, transformation, and integration are all well-understood<br /> <br /> database problems. Their solutions rely on a query language, either relational (SQL) or<br /> <br /> object-oriented (OQL). Unlike relational or object-oriented data, XML is semistructured.<br /> <br /> XMLQL is a query-language for XML and is suitable for performing the above tasks.<br /> <br /> We, after doing some case studies and examining the various options available to us<br /> <br /> selected XMLQL as the query language for IWiz. XMLQL has the following features:<br /> <br /> First, it can extract data from the existing XML documents and construct new XML<br /> <br /> <br /> <br /><br /> <br /> <br /> <br /> <br /> <br /> documents. Second, it is "relational complete," i.e., it can express joins. Third, it is<br /> <br /> simple enough that known database techniques for query optimization, cost estimation,<br /> <br /> and query rewriting could be extended to XMLQL. Fourth, it makes the transformation of<br /> <br /> data from one DTD to a different DTD easily possible. Fifth, it can be used for<br /> <br /> integration of multiple XML data sources.<br /> <br /> <br /> <br /> <br /> 2.5 Categories of Queries<br /> <br /> Since the Query Rewriting Engine gets an XMLQL query in the text format, it<br /> <br /> had to be given capabilities to handle any sort of possible query. This involved some of<br /> <br /> the many steps that generally any query processor would have to take viz. lexical analysis<br /> <br /> of the XMLQL query, generation of a parse tree representing the structure of the query in<br /> <br /> a useful way, etc. This itself rendered QRE a very complex and complicated module that<br /> <br /> exclusively had to take care of all these issues. Due to the vastness of Query processing,<br /> <br /> we decided to give a bound to our problem by doing a case study on kinds of queries our<br /> <br /> Query Browsing Interface would generate. We broadly classified all our queries into six<br /> <br /> categories depending on whether they involved a join, whether they had any filters,<br /> <br /> whether they had any nesting and whether they had any recursive definitions. Various<br /> <br /> combinations of these categories are also possible which result in numerous types of<br /> <br /> different queries. Please also note that, some of the syntactic sugar was kept and some<br /> <br /> was removed that otherwise would aid a human in writing the queries with ease. But<br /> <br /> since, in our case, we are going to automate the query generation process by using a<br /> <br /> Query Browsing Interface, we could get away with it thereby narrowing down the<br /> <br /> <br /> <br /><br /> <br /> <br /> <br /> <br /> <br /> <br /> different types of queries to be handled. This put an upper bound to the magnitude of this<br /> <br /> problem. Following is the detailed description of each of the categories:<br /> <br /> <br /> <br /> <br /> 2.5.1 Category I: Simple Query with No Joins, No Filters and No Nesting<br /> <br /> A sample category I query is shown in Figure 2.6. $t and $a are bound variables<br /> <br /> respectively bound to title and author elements of book.<br /> <br /> <br /> WHERE<br /> <bibliography><br /> <book><br /> <title>$t
$a

IN "bib.xml"

CONSTRUCT

$t
$a



Figure 2.6: Sample Query of Category I




The above query in the English language would be "Extract all the book tuples

within bibliography and place each of the tuples' title and author elements within the

newly constructed tag."




2.5.2 Category II: Simple Query with Filters and Without Joins and Nesting

There are 2 sub-categories under this, one, without tag variables and the other

involving tag variables. The first query is shown in Figure 2.7. In this query, $y is a

bound variable bound to year attribute of book. There is an explicit condition or an

explicit filter on $y stating that all years should be greater than or equal to 1995. There is

an implicit filter on title stating that only tuples with 'Database Systems' as the title










should be picked out. The tag within the element tag signifies that

there is no sub-structure within author i.e., there are no child elements of author and it

contains only Parsed Character Data (PCDATA) or has textual content. Thus, $a is bound

to the textual content within the tag. The tag is provided so that a

filter can applied directly to the content within the author tag.


WHERE


"Database Systems"
$a
$p

IN "bib.xml",
$y >= 1995,
$a like "Mar*",
text($p) like "*Wesley"
CONSTRUCT

$a
"Database Systems"




Figure 2.7: Sample Query of Category II without Tag Variables




Here, the 'like' operator has been used to pull out all only those tuples where the

authors' names begin with 'Mar'. The alternative to, using of the tag is to

use the text() function. In the above query, text($p) means the PCDATA of $p which is

bound to the child element 'publisher' of book. The other operators are relational

operators like '<', '>', '!=' '=' and logical operators like 'and' 'or', etc.

The above query in the English language would be "Extract all the book tuples

within bibliography where title of each is 'Database Systems', year of publication is

greater than or equal to 1995, where publisher's name of each ends with 'Wesley' and










where authors' names' of each begin with 'Mar' and place each of the tuples' title and

author elements within the newly constructed tag."


WHERE

<$p>
$t
<$e>$a

IN "bib.xml"
$e IN {author, editor}
$a = "Lamb, G.D."

CONSTRUCT
<$p>
$t
<$e>$a

Figure 2.8: Sample Query of Category /> with Tag Variables
Figure 2.8: Sample Query of Category II with Tag Variables


Figure 2.9: Sample Query of Category III



The second query is shown in Figure 2.8. $p and $e are tag variables. $e has a

filter stating that it can either be an author or an editor. The way $p has been placed, it











suggests that all child-elements within bib should be searched which have a and a<br /> <br /> <$e> tag, where $e itself is either an <author> or an <editor>.<br /> <br /> The above query in the English language would be "Extract all the tuples within<br /> <br /> bibliography where author or editor of each id 'Lamb, G.D.' and place each of the tuples'<br /> <br /> title and author/editor elements within the newly constructed <result> tag."<br /> <br /> <br /> <br /> <br /> 2.5.3 Category III: Simple Query with an Implicit Join<br /> <br /> A sample category III query is shown in Figure 2.9. If the above query is looked<br /> <br /> at carefully, $a is bound to two elements, one to the author of book and second to the<br /> <br /> editor of article. This straightaway implies a join, what we call an implicit join.<br /> <br /> <br /> WHERE<br /> <bibliography><br /> <article><br /> <editor><PCDATA>$e</PCDATA></editor><br /> <title>$t


$a

IN "bib.xml",
$e $a

CONSTRUCT

$e
$t



Figure 2.10: Sample Query of Category IV




The above query in the English language would be "Extract all the tuples of

books as well as articles within bibliography respectively whose authors and editors are

same and place each of the tuples' title and author/editor elements within the new

constructed /
tag which in turn within the newly constructed

tag."











2.5.4 Category IV: Simple Query with an Explicit Join

A sample category IV query is shown in Figure 2.10. In this query, the two bound

variables $e and $a are equated to each other, which implies that their parents are being

joined on them.

The above query in the English language would be "Extract all the tuples of

articles within bibliography whose at least one editor has authored a book and place each

of the tuples' title and editor elements within the within the newly constructed

tag."


Figure 2.11: Sample Query of Category V




2.5.5 Category V: Nested Query

A sample category V query is shown in Figure 2.11. In this query, the WHERE

clause is nested within the CONSTRUCT clause at two places. The idea is to bring the










same effect as that of the 'group by' clause in SQL. Also, note the three 'IN' clauses, one

is "bib.xml" but the other two are $book, which means title and author are being queried

in the book tuple bound to $book. As in other queries, nested queries too can have filters

and joins on their variables, thereby generating a spectrum of different queries that have

to be handled.

The above query in the English language would be "Extract all the tuples of

book within bibliography and group all the authors in each tuple. Place all the grouped

authors within the newly constructed tag, place each book tuple with its title

and all authors within the newly constructed tag and finally place all the book

tuples within the newly constructed tag."


CONSTRUCT


WHERE
<*>



IN "Parts.xml"
CONSTRUCT $name





Figure 2.12: Sample Query of Category VI



2.4.6 Category VI: Recursive Queries

A sample category VI query is shown in Figure 2.12. In this case, 'part' is

searched at all levels in the XML document and within 'part' name at any level below it.

All such tuples are bound to $name. elementsa' binds the entire tuple to

$name.






26


The above query in the English language would be "Extract all the tuples where

'name' occurs at any level within 'part' which in turn occurs at any level in the document

hierarchy and place all of them within one root tag"
















CHAPTER 3
OVERVIEW OF INTEGRATION APPROACHES AND PROTOTYPES

Systems for integration of information and in particular, structurally and

semantically heterogeneous information continue to receive much attention from the

research community the world over [28, 29, 30, 31, 32, 33].


Figure 3.1: An Integration System



The subject of this thesis, query rewriting in multi-source information systems, is

an architectural component in IWiz, which is an integration system for semistructured

data. The goal of the IWiz system is to allow users to query a variety of sources through

one common interface providing a naturally "integrated" global view of the application

domain, to store the query results in a persistent warehouse for continued usage and to

provide on-demand querying for greater flexibility. Before introducing the proposed


User 1 User 2 User N



Integration System


World
Wide
S Web Personal
Digital Libraries Online Sources Databases
E-Commerce (e.g., Dialog)
Sites









architecture for IWiz, we present research aspects of integration systems, relevant to our

work and give an overview of similar research projects.




3.1 Different Approaches to Integration

Most information integration system architectures conform to one of the two

design approaches: data warehousing, or mediation approach.


Figure 3.2: The Data Warehousing Approach



3.1.1 The Data Warehousing Approach

A sample system employing the data warehousing approach is shown in Figure

3.2. The data warehousing scheme assumes presence of a single centralized data storage

facility, which physically holds a copy of data from multiple sources. The data in a

warehouse conform to a certain schema, usually called a global schema. When a new


Merging
Specification









source becomes available to the warehouse, the source data must be processed by the

wrapper component to conform to the global warehouse schema. The data is then

combined with the existing data in the warehouse by the data merging component. All

data requests are processed directly by the warehouse resulting in faster response times,

but creating the possibility of stale data.


Figure 3.3: The Mediation Approach



This approach is also known as the local-as-view (as opposed to global-as-view)

approach in which each local source is defined as a "view" of the global schema. Note,

using this approach it is not possible to map a query against the global, integrated schema

into one or more queries against the underlying sources; mappings are one-way mappings

going from the sources) into the global schema (bottom-up). It is an information push

model as shown in Figure 3.2, where the pre-defined information is "pushed" into the

data warehouse at pre-defined times, e.g., when the source data has changed. Thus the









name "bottom-up information push." It is also known as eager approach, where data

integration occurs in a separate materialization step, before the actual user queries. The

Florid project [34] employs this approach.



3.1.2 The Mediation Approach

A sample system employing the mediation approach is shown in Figure 3.3.

Systems based on the mediation approach do not retrieve data from the sources until the

data is requested. The user query is decomposed by the mediator component-- a software

module responsible for creating a virtual integrated view of the data sources in the

system. The mediator determines which data sources contain relevant information and

queries those sources. The mediation approach guarantees that the retrieved data are

always up to date. However, accessing distributed sources and integrating results before

presenting them to the user can take considerably longer than accessing data in a

warehouse. This approach is also known as the lazy approach (or even on-demand or

virtual approach), i.e., the queries are unfolded and rewritten at runtime as they flow

downwards from the user to the sources. The TSIMMIS Project at Stanford [28, 35, 36,

37] and the MIX project at University of California at San Diego [38] employ this

approach.

Both the mediation and warehousing architectures feature an integrated view of

all system data, i.e. a "virtual relation" that incorporates concepts represented in the

underlying sources. The integrated view can be constructed in a "bottom-up" or "top-

down" fashion [39] using the warehousing or the mediation approach respectively. Each

approach has it's own share of pros and cons. The objective of the warehousing approach

is to build an exact union of information from the underlying source schemas. In the latter









case, the integrated view attempts to encompass all information relevant to the given

knowledge domain. Subsequently, the integrated schema may represent only a subset of

source information when the mediation approach is used. One major advantage of the

warehousing approach is the short query response time whereas in the mediation

approach, it is high as the sources are queried every time a query is received. On the other

hand, the data received in the mediation approach is always fresh and the most up-to-date

since the sources are queried as and when queries are received whereas in the

warehousing approach, information is updated only from time to time.


Figure 3.4: The Hybrid Approach


The warehousing and mediator approaches have been successfully used in

research integration systems [35, 40, 36] as well as commercial applications for data

integration. There are pros and cons of both the approaches. The advantage of the


Mediation
Specification









warehousing approach is The comparative analysis of Hull [39] shows strengths and

limitations of both methodologies and identifies future research challenges.



3.1.3 The Hybrid Approach

We, here at the University of Florida Database Research and Development

Center, envision IWiz as an integration system that provides uniform access to the

multiple heterogeneous sources using a hybrid architecture as shown in Figure 3.4. The

hybrid architecture employs the better features of both the approaches, the mediation as

well as the warehousing approach in an effort to achieve both flexibility and shorter

query response time. Query results are cached in the warehouse and can be retrieved

efficiently in the case of repeated or similar requests. The replacement policy guarantees

that current data always replaces the older information. Each data item is assigned a time

interval when the data can be regarded as valid, after this time the information has to be

retrieved directly from the source. The H20 project [33] was the first system to employ

this approach. The detailed description of IWiz follows, in the next chapter.





3.2 Integration System Prototypes

Many research efforts in recent years are directed towards designing systems for

integration of heterogeneous data. In this section, we will focus on the prototypes being

developed around the world that are relevant to IWiz and in particular to the Mediator

component and QRE. Since the concept of query rewriting assumes its role and

significance only when the mediation approach is employed, we mainly discuss systems

using the Mediation approach.











3.2.1 The TSIMMIS Project

TSIMMIS stands for "The Stanford-IBM Manager of Multiple Information

Sources." This project has been implemented at the Computer Science Department,

Stanford University [37] and follows the Mediation or the lazy approach, i.e., queries are

unfolded and rewritten at runtime as they flow downwards from the user to the sources. It

offers a data model called the Object Exchange Model (OEM) and a common query

language LOREL that are designed to support the combining of information from many

different sources. Above each source, there is a translator that logically converts the

underlying data objects to the common information model. To do this logical translation,

the translator converts queries over information in OEM into requests that a source can

execute, and it converts the data returned by the source back to OEM. OEM is a self-

describing (or tagged) object model. The idea is that all objects, and their sub-objects,

have labels that describe their meaning. Above the translators lie the mediators that refine

the information from the sources. A mediator embeds in itself the knowledge that is

necessary for processing a specific type of information and takes upon itself the task of

query rewriting. The mediator also processes answers before forwarding them to the user

and exports an interface to the client that is identical to that of the translators.



3.2.2 The MIX Project

The MIX project (Mediation of Information using XML) [38, 41] is being

implemented in University of California at San Diego Database Laboratory. As the name

suggests, it uses the Mediation approach and has XML as the underlying data model.

They have developed their in-house query language called XMAS and are developing









wrapping technologies that allow to logically view an information source (which may be

a relational database, a collection of html pages, or even a legacy information system) as

a large XML source. The wrappers are able to translate XMAS queries into queries or

commands that the underlying source understands. They are also able to translate the

result of the source into XML. They call their mediator MIXm, which integrates the

information from multiple sources. XMAS is used as a view definition language. They

have a Blended Browsing and Querying component called the BBQ which is driven by

XML DTDs of the mediator view and guides the user in formulating complex queries.

The MIX mediator comprises several modules to accomplish the integration; its main

inputs are XMAS queries generated by BBQ, and the mediator view definition (which is

in XMAS) for the integrated view. The mediator view definition has to be provided by

the "mediation engineer" and prescribes how the integrated data combines the wrapper

views. The resolution module resolves the user query with the mediator view definition,

resulting in a set of unfolded XML queries that refer to the wrapper views. These queries

can be further simplified based on the underlying XML DTDs. The DTD inference

module can be used to automatically derive view DTDs from source DTDs and view

definitions, thereby supporting the integration task of the mediation engineer.



3.2.3 The TUKWILA Project

The Tukwila project [42] is being implemented at the Computer Science

Department, University of Washington and it uses the Mediation Approach. A mediated

schema is created to represent a particular application domain and data sources are

mapped as views over the mediated schema. The user asks a query over the mediated

schema and the data integration system reformulates this into a query over the data









sources and executes it. The system has a highly efficient query reformulation algorithm,

MiniCon, which maps the input query from the mediated schema to the data sources.

Next, interleaved planning and execution with partial optimization are used to allow

Tukwila to process the reformulated plan, quickly recovering if decisions were based on

inaccurate estimates. During execution, Tukwila uses adaptive query operators such as

the double pipelined hash join, which produces answers quickly, and the dynamic

collector, which robustly and efficiently computes unions across overlapping data

sources. Since the system represents the data sources as views over the mediated schema,

this enables the addition of new data sources with very little human intervention. The

problem of translating the query into queries over the data sources (query reformulation)

is NP-Complete even for conjunctive queries. As an answer to this problem, the project

has a scalable algorithm for query reformulation (which is equivalent to the problem of

answering queries using views). Its feasibility has been experimentally proven even for a

large number of data sources.



3.2.4 The FLORID Project

The Florid Project [34] is being implemented at the Institute of Computer

Science, University of Freiburg, Germany. They have a predefined collection of terms

and their relationships in a form of ontologies as a mediated integration view makes the

problem of integration easier. Their goal is to develop and implement an ontology-based

information integration system, which in addition, built upon a standard LDAP

technology and using a simple, coherent and uniform LDAP model as a middleware data

model promises to allow a seamless integration of source data, schemata discrepancies,









and semantic information under a common framework, that is, by design, able to

reconcile integration and data processing issues.



3.2.5 The MOMIS Project

The MOMIS project (Mediator envirOnment for Multiple Information Sources)

[43] is a mediator-based integration system for structured and semistructured data. It

developed as a joint collaboration between the University of Modena and Reggio Emilia

and University of Milano and Brescia. They have a common thesaurus that plays the role

of a shared ontology; it is built by extracting terminological relationships from source

schemas. The wrapper components translate heterogeneous source schemas into a

common object-oriented model. The translation is based on relationships found in the

common thesaurus. Source schemas are analyzed, and the mediator-integrated view is

(semi) automatically constructed based on relationships between the source schemas'

concepts. The system uses extensive query optimization techniques in order to ensure the

effective data retrieval in distributed environment.

This concludes the discussion on related research dedicated to data integration

systems and in particular, mediator-based systems. Next, we briefly describe the IWiz

architecture and roles of its major structural components.














CHAPTER 4
THE IWIZ ARCHITECTURE

The overall architecture of IWiz is shown in Figure 4.1 [33, 44, 45, 46].


Figure 4.1: Information Integration Wizard (IWiz) Architecture


4.1 IWiz Overview

IWiz aims at integrating structurally and semantically heterogeneous overlapping

or complementary incomplete information from multiple disparate data sources. It uses

the hybrid approach, which has been discussed in chapter 3. As shown in Figure 4.1, it









consists of five software components namely the Query Browsing Interface (QBI), the

Warehouse Manager (WHM), the Query Rewriting Engine (QRE), the Data Merge

Engine (DME), and the Data Restructuring Engine (DRE). A short description of each of

these components follows in section 4.3. The QRE, the topic for description of this thesis

and the DME, both these components are a part of a larger component the Mediator.

Although the end user doesn't care of what is happening behind the GUI that he has to

interact with, from a high level perspective of the system as a whole, for the people

designing and building it and for the people administering it, it is easier to visualize of

QRE and DME together in one single component, a component that splits the queries and

merges the results. Thus, QRE splits the queries and generates a query plan that is used

by the DME to merge the results.

Apart from these software modules, is the Warehouse which also acts as the IWiz

repository beneath which runs the Oracle 8i Database Engine. Each of the following lie

in the same address space:

* QBI and WHM with the Warehouse.

* The Mediator comprising of QRE and DME.

* Each source with its corresponding DRE.

The focus of the project is to provide an integrated access to semistructured

sources, a user-definable view of the integrated data, warehouse frequently accessed data

for faster retrieval and on-demand querying for greater flexibility.












4.2 The Ontology Schema

The end-user is shown a global view of the application domain. The schema used

to represent this global view is referred to as 'Ontology schema' which consists of the

'concepts' describing the domain. It is created by the super-user, the administrator of the

IWiz system who has knowledge about the application domain as well as the

participating data sources. The sources' schemas are mapped into the Ontology schema.

The user has access only to this integrated view of all the data contained in the system.

Using this schema, the user formulates queries over the system data, expresses filters and

joins on different concepts that are present in a multitude of sources. Since, we at IWiz

chose XML as our underlying data model, the obvious choice for schema representation

was XML DTDs. The Ontology schema is represented using a DTD. For our

experimental and testing purposes we chose data sources containing bibliographical data

and consequently came up with a ontology schema (DTD) comprising of concepts that

described this application domain.




4.3 IWiz Components

A short description of each software component follows:



4.3.1 The Query Browsing Interface (QBI)

The QBI is an interface, the only thing shown to the end user. It shows the global

view of the application domain commonly referred to as 'Ontology schema', which

consists of the 'concepts' describing the domain. The sources' schemas are mapped into









the Ontology schema. It allows the user to examine an integrated view of all the data

contained in the system without being aware of the fact that data are physically stored in

different locations and different formats and have different schemas. The interface

displays the global schema in a tree-like form. Since QBI is a graphical user interface, the

user can select or 'click' on concepts he requires the data on, specify filters on them,

specify joins between different concepts, etc. The user is not assumed to be familiar with

any specific querying language or know any details about number, format, or contents of

the underlying sources participating in IWiz. Then, the QBI generates an XMLQL query

and sends it to the WHM for further processing.



4.3.2 The Warehouse Manager (WHM)

Our goal at IWiz is to provide an architecture that stores the answers to frequently

asked queries in the warehouse, which basically acts as a cache. It is the Warehouse

Manager that decides whether a query should be sent to the warehouse or alternatively

should be sent to the mediator that queries the sources, in case the desired information is

not available in the warehouse. It maintains a query log that it updates every time a query

is received and checks for any previous occurrence of a related query. Since the

warehouse is an Oracle database, it is one of WHM's tasks to convert the XMLQL query

to SQL query to query the database and to put a hierarchical structure, confirming to the

Ontology, to the results returned, to show them as XML documents. The WHM also runs

some maintenance queries from time to time to update the information present in the

relational tables in the warehouse. Finally, the WHM also receives the results from the

Data Merge Engine, loads the required parts of the results into the warehouse and sends

the result to the QBI for display to the user. The browsing interface and the warehouse









component are assumed to reside in the same address space and therefore, interaction

between those modules is quite efficient.



4.3.3 The Query Rewriting Engine

If the warehouse is unable to satisfy a query, it is sent to the Mediator wherein the

Query Rewriting Engine (QRE) handles it. QRE parses the input query and generates a

parse tree. Each node of this parse tree is decorated with the relevant information. QRE

has embedded in itself, knowledge of two things. First, answer to -- which sources,

amongst the multiple data sources, have information on each concept in the Ontology

schema. Second, answer to -- which joinable items can be used to perform a join on two

different concepts in the Ontology schema. Using its knowledge base, QRE rewrites the

parse tree into several source-specific queries. Each source-specific query is sent to its

corresponding wrapper, where the DRE handles it. At the core of the engine, runs an

algorithm that also decides all the permutations and combinations of the sources which

when queried together will yield a Full Result, a concept coined by us and explained later

in detail in the next chapter, and in short, a result that fully satisfies the query or what has

been asked for, by the user. QRE simultaneously also generates a Query Plan which

provides information on how to join the result from each source and form all the full

results. This query plan is sent to the Data Merge Engine. Since, QRE is the main topic of

this thesis, it is discussed in complete detail in the following chapters.



4.3.4 The Data Restructuring Engine

The DRE component serves as an interpreter between the sources and the rest of

the system. It is the only component that has direct access to the source data. The DRE









component is source-specific, which means that there is exactly one DRE associated with

each source. The two major phases in the DRE operation are the run-time and built-time

phases. The DRE determines the correspondence between concepts represented in global

schema and those in the source schema at built-time. Further, mappings are augmented

with translation procedures for each data element and verified by an expert user. The

mappings are defined once and not changed during the system operation unless the

source or ontology schemas are altered. During run-time, QRE sends the source-specific

queries to each DRE which using the mappings converts the received query to source

schema terms. It then extracts the information from the source document, transforms it to

the ontology terms, and finally sends it to the Data Merge Engine for merging all the

results.



4.3.5 The Data Merging Engine

The query plan generated by QRE is sent to the Data Merge Engine (DME) which

is the one, which executes it to form full results. Once all the full results are collected, the

DME, also a part of the Mediator, using some distance functions finds closeness among

the concepts, using some heuristics removes duplicates and redundancies among the

tuples of data and then finally merges all the full results into one big result. It sends this

result to the WHM.

All major components of IWiz are currently under development at the Database

Research and Development Center at the University of Florida. The integrated system

prototype is expected to become functional in 2001. Based on this understanding of

relationships between IWiz components, and especially the role of QRE in the process of

data integration, we now proceed to a detailed description of QRE functionalities and our






43


implementation by starting with the details of the Join Sequencing Algorithm and the

process of Result Generation.














CHAPTER 5
THE JOIN SEQUENCING ALGORITHM AND FULL RESULT GENERATION
PROCESS

As pointed out in the previous chapter, the Query Rewriting Engine is invoked

only when the Warehouse Manager decides to send the query to the Mediator. It does this

when desired information is not available in the warehouse or the one available has gone

stale and therefore, fresh information directly from the sources is required or in the case

when the latest copy of the data is required directly from the sources.




5.1 The Query Rewriting Process Overview

QRE queries multiple sources each containing information that may be related,

overlapping or complementary, incomplete and may not necessarily be useful to the user

just by itself. QRE's task is to completely satisfy a query. For this, QRE needs location

information for each concept in the Ontology schema, which means that it needs to know

all the sources where each concept of the Ontology occurs. It also needs information on

how to join the different concepts in the ontology schema. Both, the joinable data item as

well as location information are collected before any query is sent to QRE. QRE then

finds out which sources are to be queried for the items requested in the query (what the

term 'attribute' is to the relational world terminology, the term 'item' is to the IWiz

terminology. The term 'item' for us encompasses both attributes and elements, both

concepts stemming from the XML paradigm) and how to join the results coming in from

each source. It tries to find as many such combinations possible that satisfy a query fully.









Once that is found out, QRE rewrites the input WHM query into several source-specific

queries. While doing this, it also generates a query plan for the Data Merge Engine to join

all the results from each source. In this chapter, we mainly focus on how the join

sequencing algorithm works, what each source is queried for, and how results, which can

satisfy a query, are generated.




5.2 The Concept of a Full Result

Simply put, a full result is one that completely satisfies a query, one that contains

data pertaining to all the queried items in a single query. And a full result source is a

source that returns a full result. Since, the sources' schemas are mapped into the

Ontology schema, not all sources would contain the data for all the items asked for in the

query. From a single query point of view, each source can either return an empty result to

that query, a partial result or a full result. Thus, the Query Rewriting Engine's first task,

as it gets a query, is to classify all the sources into the above mentioned three categories.

The sources that can return a full result can be sent the query outright. For example, if

there are two sources say source S1 and source S2 and both contain data on all the

concepts requested for in the query, then both these sources can be queried separately and

individually. Two full results would be returned, one each from both the sources. But if

the sources return a partial result, then they cannot be queried outright. For example, if

there are two sources say, say S3 & S4, and the user, using the ontology schema chooses

to query for all books with their titles, year of publication and their authors. Let us further

assume that source S3 contains information only on titles, and the source S4 contains

information on authors and years. Now, even if the query is rewritten to query source S3










only for titles, and S4 for authors and years, together the results from these 2 sources

cannot just be unionized or fused. The data together wouldn't make sense unless the

tuples from one source are joined somehow to the tuples from the other source. Thus, this

gives rise to the fact that some kind of information on joinable data items is required too.

Extending our previous example, if sources S3 & S4 both also contain ISBN, then the

results from both the sources can be joined on ISBN, ISBN being a unique attribute (a

joinable data item). Now, with that background, lets explore all the possibilities of when

a full result can be formed and when it cannot, in full details.


WHERE


$t
$y
$a


CONSTRUCT

$t
$y
$a



Figure 5.1: Sample XMLQL Query requesting Book Title, Year published and Author



As concluded earlier, each source can either return a full, or a partial, or an empty

result. Based on this, for a given query, results from all the sources, which when put

together, can be:

1. Full

2. Empty

3. Partial

4. Full & Partial









5. Full & Empty

6. Partial & Empty

7. Full & Partial & Empty

The following case study is an exhaustive exercise presenting a variety of

different source scenarios. We examined each of the above seven combinations and put

forth our analysis and conclusions; analysis as in how full results would be generated in

each case and how the partial results would be joined, conclusions as in how much usable

each case was and whether QRE would use it. Please note that unless otherwise stated,

we will use the query shown in Figure 5.1 for the analysis of all the seven cases.



5.2.1 Case 1: Individual Source Results Are All Full Results

In this case, each source is queried individually. There is no need to perform any

joins on the results. Only the union of the results is enough. The conclusion is that it's a

usable case and the queries are sent to the sources as it is. QRE does not have to rewrite

the main query into source-specific sub-queries and does not have to figure out any joins.



5.2.2 Case 2: Individual Source Results Are All Empty Results

In this case, the sources are not queried at all. There is no need for any joining or

unionizing in the Mediator. The conclusion is that it's not a usable case. Since the sources

are not queried, QRE does not have to do any rewriting nor does it have to figure out any

joins.









5.2.3 Case 5: Individual Source Results Are Both Full And Empty

In this case, only the sources returning a full result are queried. Since the empty

sources are ignored, this case becomes similar to Case 1. The conclusion is that it is a

usable case. Only the full result sources are queried. QRE does not have to rewrite the

main query into source-specific sub-queries and does not have to figure out any joins.



5.2.4 Case 3: Individual Source Results Are All Partial Results

There are 2 sub-cases under this case. First, all the sources put together do not

contain all the items required to form a Full Result. Second, all the sources together

contain all the attributes required to form a Full Result.


Table 5.1: Scenario wherein two sources containing only one requested item

Source 1 Source 2
Title Title


The first case further has 2 more sub-cases. First, none of the sources contain any

joinable data item as shown in Table 5.1. Second, the sources contain do contain a

joinable data item, say Book-id as shown in Table 5.2.




Table 5.2: Scenario wherein two sources containing only one requested item with a
joinable data item

Source 1 Source 2
Title Title

Book-id Book-id












In both the above cases, regardless of the fact whether any joinable data item is

present, since all the sources even when put together cannot contribute all the items, no

full result can be generated. In both the above cases, only title is present in both the

sources whereas the requested items are title, author and year for each book. So, it is not a

usable case. None of the sources are queried. QRE does not have to rewrite the main

query into source-specific sub-queries and does not have to figure out any joins.




Table 5.3: Scenario wherein all the 3 sources together contain all the requested items but
no joinable data items

Source 1 Source 2 Source 3
Year Author Title





The second case is when, all the sources together contain all the attributes

required to form a Full Result. This case has 3 sub-cases. First, none of the sources

contain any joinable data item as shown in Table 5.3. Since there is no way to join the

tuples from all the sources, this case too is deemed as not usable. Although all the three

sources together provide for a full result but since there is no joinable data item, a full

result cannot be generated. None of the sources are queried. QRE does not have to

rewrite the main query into source-specific sub-queries and does not have to figure out

any joins.









Table 5.4: Scenario wherein all the sources together contain all requested items along
with a common joinable data item

Source 1 Source 2 Source 3'
Author Title Year
ISBN ISBN ISBN


Second, sources with the Partial Result contain all the joinable data items required

to form a Full Result. In this case, there are two variations possible, one is as shown in

Table 5.4 and other is as shown in Table 5.5.




Table 5.5: Scenario wherein all the sources together contain all requested items with the
joinable data items required for ajoin

Source 1 Source 2 Source 3
Author Title Year
ISBN ISBN BOOK ID
BOOK ID


The conclusion is that, since a full result can be generated using the joinable data

items, it's a usable case. All the sources involved are queried. QRE has to figure out the

joinable data items to be queried and the sequence in which the joins have to be carried

out. Also, it has to rewrite the input query into source-specific sub-queries. The third sub-

case is when sources with the Partial Result contain joinable data items, but there are no

overlapping joinable data items as shown in Table 5.6. The conclusion is that although

the sources contain joinable data items, no full result can be generated. So, it's not a

usable case. None of the sources are queried.









Table 5.6: Scenario wherein all the sources together contain all the requested items but do
not contain overlapping joinable data items

Source 1 Source 2
Title Author
Year
BOOK ID ISBN


QRE does not have to rewrite the main query into source-specific sub-queries and

does not have to figure out any joins.



5.2.5 Case 4: Individual Source Results Are Both Partial as well as Full

In this case, the query is sent to the full sources as it is and the partial sources are

dealt in the way described in section 5.2.4. Only, the following sub-case brings out a little

different scenario, so, it has been discussed. Partial sources along with one or more full

sources contain all the joinable data items required to join the partial sources. This again

has 2 possible scenarios.


Table 5.7: Scenario wherein source


1 yields full result,
result


source 2 & source 3 yield a partial


Source 1 Source 2 Source 3
Author Author Title
Year Year
Title
ISBN ISBN
BOOK ID BOOK ID


First, the partial sources by themselves do not contain any common joinable data

items as shown in Table 5.7. Source 2 and source 3 both contain joinable data items but









they are different. Thus, although source 1 yields a full result, it is also queried for ISBN

and BOOKID. This is because, in order to generate another full result, involving source

2 and source 3, source 2 has to be joined to source 1 and source 3 also has to be joined to

source 1, since the joinable data items of source 2 and source 3 overlap in source 1. The

conclusion is that it is a usable case. The full result source is queried as it is. And the

joinable data items present in full result source are used to join the sources yielding

partial results.




Table 5.8: Scenario wherein source 3 yields a full result and source 1 and source 2 yield a
partial result

Source 1 Source 2 Source 3
Author Title Author
Year
Title
ISBN BOOK ID ISBN


In the second sub-case, the Full Source does contain the joinable data item as

shown in Table 5.8, but it does not overlap with the partial source. The conclusion is that

it is not a usable case. QRE does not have to rewrite the main query into source-specific

sub-queries and does not have to figure out any joins.



5.2.6 Case 6: Individual Source Results Are Both Partial as well as Empty

In this case, the empty sources are not queried at all and the partial sources are

dealt in the way described in section 5.2.4. Only, the following sub-case brings out a little

different scenario, so, it has been discussed. Either the sources yielding partial results or









the sources yielding empty result may contain the joinable data item. There are 2 possible

scenarios.




Table 5.9: Scenario wherein source 3 yields no result but provides for joinable data items

Source 1 Source 2 Source 3
Author Title
Year
ISBN ISBN
BOOK ID BOOK ID


First, the partial sources by themselves do not contain any common joinable data

items as shown in Table 5.9. Source 1 and source 2 both contain joinable data items but

both are different. Thus, although source 3 doesn't yield anything, it is still queried for

ISBN and BOOKID. This is because, in order to generate another full result, involving

source 1 and source 2, source 2 has to be joined to source 3 and source 1 also has to be

joined to source 1, since the joinable data items of source 2 and source 3 overlap in

source 1. The conclusion is that it is a usable case. The joinable data items present in

empty source are used to join the sources yielding partial results.




Table 5.10: Scenario wherein source 1 and source 2 both yield partial results and source 3
yields empty result

Source 1 Source 2 Source 3
Author Title
Year
ISBN
BOOK ID BOOK ID











So all the sources involved are queried. QRE has to figure out joins and rewrite

the input into source-specific sub-queries. The second scenario is when the empty source


has a joinable data item as shown in Table 5.10 but it does not overlap with the partial

source. The conclusion is that it is not a usable case. None of the sources can be queried.

QRE does not have to rewrite the main query into source-specific sub-queries and does

not have to figure out any joins.


Figure 5.2: An XMLQL Query Involving a Join on Titles of Books and Articles




5.2.7 Case 7: Individual Source Results Are Both Partial as well as Empty

This case is similar to the combination of Case 4 & Case 6. With this, we come to


the end of the case study.

So far we have discussed all possible cases of a full result formation only for

those queries that ask for only one concept and its sub-items (attributes, so to say), for


WHERE


<PCDATA>$t</></><br /> <author>$a</><br /> </><br /> <article><br /> <title><PCDATA>$tl</></><br /> <editor>$e</><br /> </><br /> </> IN "bib.xml"<br /> $t = $tl<br /> CONSTRUCT<br /> <bibliography><br /> <book><br /> <title>$t</><br /> <author>$a</><br /> </><br /> <article><br /> <title>$tl</><br /> <editor>$e</><br /> <br /> <br /> <br /><br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> example, say Books and their corresponding titles, years and authors. Lets deal with some<br /> <br /> more queries, ones which are more complex wherein the queried items may not be<br /> <br /> directly related to each other, may not share a common join-able item/concept or may not<br /> <br /> have any unique key.<br /> <br /> <br /> Figure 5.3: An XMLQL Query Requesting Simultaneously for Book Titles and Article<br /> Titles<br /> <br /> <br /> <br /> <br /> For a better explanation of the queries shown in Figure 5.2 and Figure 5.3, refer<br /> <br /> Chapter 2 Section 2.5. In query shown in Figure 5.2, the query requests for title and<br /> <br /> author of all books and title and editor of all articles title, each of which is also the title of<br /> <br /> a book. It basically is a join on the titles of articles and books. In the query shown in<br /> <br /> Figure 5.3, the join condition i.e., $t = $tl is not present. There is no join. The user is<br /> <br /> simply asking for all books with its title and author and all articles with its title and<br /> <br /> editor. Thus, under the concept <bibliography>, its two child elements (or rather<br /> <br /> <br /> WHERE<br /> <bibliography><br /> <book><br /> <title><PCDATA>$t</></><br /> <author>$a</><br /> </><br /> <article><br /> <title><PCDATA>$tl</></><br /> <editor>$e</><br /> </><br /> </> IN "bib.xml"<br /> <br /> CONSTRUCT<br /> <bibliography><br /> <book><br /> <title>$t</><br /> <author>$a</><br /> </><br /> <article><br /> <title>$tl</><br /> <editor>$e</><br /> </><br /> </><br /> <br /> <br /> <br /><br /> <br /> <br /> <br /> <br /> <br /> concepts) have been queried, namely, <book> and <article> in both the queries and in the<br /> <br /> first case, with a join on titles and in the second case, with no join. We call the joins that<br /> <br /> are already stated by the user in the input query, 'explicit joins'. Now, since there is no<br /> <br /> explicitly stated join condition in the second case, plus there cannot be any join-able<br /> <br /> item/concept common to or shared by books and articles like ISBN (as in the previous<br /> <br /> case), there is no way the tuples of books and articles be associated with each other.<br /> <br /> Thus, if the second query is run against the XML document "bib.xml", the result returned<br /> <br /> would contain a cartesian product of all the book and article tuples! This is because, in<br /> <br /> the Construct clause, for every <result> tag, one book tuple and one article tuple have to<br /> <br /> be placed within the <result> tag. Thus, every tuple of book is paired up with all the<br /> <br /> tuples of articles and put within the <result> tag resulting in a cartesian product. In the<br /> <br /> case of the first query, since there is a join on the titles, the two concepts within<br /> <br /> <bibliography> get related to each other and thus only those tuples having common titles<br /> <br /> will be picked up. Thus, there needs to be an explicit join stated by the user in the input<br /> <br /> query on concepts that are unrelated to each other and do not have share any joinable data<br /> <br /> item.<br /> <br /> <br /> <br /> <br /> 5.3 The Children Binding Rule<br /> <br /> With the background so far, we can now postulate the Children Binding Rule. For<br /> <br /> every concept requested in a query, if there are multiple data instances on that concept<br /> <br /> i.e., if there is a '*' or '+' relationship between itself and its parent as delineated in the<br /> <br /> schema (Ontology DTD), then for all such concepts queried, its children or sub-concepts<br /> <br /> should either<br /> <br /> <br /> <br /><br /> <br /> <br /> <br /> <br /> <br /> <br /> 1. be present all in one source OR<br /> <br /> 2. should be bound by joins explicitly stated by the user OR<br /> <br /> 3. should have a joinable data item common to them if they are present in<br /> <br /> different sources.<br /> <br /> The Children Binding Rule forms the basis for the logic of the Join Sequencing<br /> <br /> Algorithm. It is explained in detail in the following section.<br /> <br /> <br /> Figure 5.4: An XMLQL Query with its Source Scenario<br /> <br /> <br /> WHERE<br /> <Ontology><br /> <Bib><br /> <Book><br /> <Title>$t</> 4 Sl<br /> <Year>$y</> 4 Sl<br /> <Author><br /> <Firstname>$f</> 4 S2<br /> <Lastname>$l</> 4 S2<br /> <br /> <br /> <Article><br /> <Author><br /> <Lastname>$11</> S5<br /> <br /> <br /> <br /> </> IN Mediator,<br /> $1 = $11<br /> CONSTRUCT<br /> <book><br /> <title>$t</><br /> <year>$y</><br /> <author>$f $1</><br /> <br /> <br /> <br /><br /> <br /> <br /> <br /> <br /> <br /> 5.4 The Join Sequencing Algorithm<br /> <br /> Using the Children Binding Rule and the case study discussed in section 5.2, we<br /> <br /> now delineate the Join Sequencing algorithm. The join sequencing algorithm figures out<br /> <br /> joins, their sequence and the data items to be used for joins, in 2 steps.<br /> <br /> In the first step, the very first node encountered in the pre-order traversal of the<br /> <br /> tree that has more than one child, has multiple data instances in the XML document and<br /> <br /> not enough or none explicit joins to bind its children, is picked up. Using the joinable<br /> <br /> item information for this node, joins at this node's level are figured out<br /> <br /> The query shown in Figure 5.4 is taken as example to elucidate step 1. The query<br /> <br /> simply requests for all books each with its title, year of publication and author's firstname<br /> <br /> and lastname. There is a join on book's author's lastname with article's author's<br /> <br /> lastname. Titles and Years occur in source S1, Book Author Firstnames and Lastnames<br /> <br /> occur in source S2 and Article Author Lastnames occur in S5. ISBN is a joinable item<br /> <br /> that can be used to join the information occurring under the concept 'book'. Also, ISBN<br /> <br /> occurs in both these sources but has not been requested for. As the query is traversed,<br /> <br /> <Bib> is the first node that has multiple children but it has an explicit join binding its<br /> <br /> children book and article. Also, the XML document does not have multiple occurrences<br /> <br /> of Bibs. So then the next one that has multiple children, has multiple data instances and<br /> <br /> not enough or none explicit joins is <Book>. So, it is picked up and joins are figured out<br /> <br /> for it. Source S5 can be queried for Article Author Lastname. Since ISBN occurs in both<br /> <br /> the sources, source S1 can be queried for ISBN, Title and Year. Similarly, source S2 can<br /> <br /> be queried for ISBN, Author Firstname and Author Lastname. Thus, the results that are<br /> <br /> returned back from the latter 2 sources can be joined for books on ISBN and then the<br /> <br /> result thereof with the result from source S5 can be joined for Bib on Lastnames.<br /> <br /> <br /> <br /><br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> Figure 5.5: An XMLQL Query with its Source Scenario<br /> <br /> <br /> <br /> <br /> In the second step, the sources figured out in step one for each leaf node (title,<br /> <br /> year, author firstname and lastname for book and author lastname for article in the above<br /> <br /> case) are separated out and are recursively sent to the lower levels of the tree to figure out<br /> <br /> if these same sources can contribute to joins at lower levels.<br /> <br /> In the above example, at one level below the level of <Book> occurs the concept<br /> <br /> <Author>. This concept too has multiple children and may have multiple occurrences on<br /> <br /> itself. But, since, both Firstname and Lastname occur in the same source S2, there is no<br /> <br /> need to figure out any joins for them.<br /> <br /> Taking the example query shown in Figure 5.5 to explain the second step in a<br /> <br /> better way: The only change in this query with respect to one in Fig 5.4 is that Book<br /> <br /> <br /> WHERE<br /> <Ontology><br /> <Bib><br /> <Book><br /> <Title>$t</> 4 Sl<br /> <Year>$y</> 4 Sl<br /> <Author><br /> <Firstname>$f</> 4 S2<br /> <Lastname>$l</> 4 S3<br /> </><br /> </><br /> <Article><br /> <Author><br /> <Lastname>$11</> 4 S5<br /> </><br /> </><br /> </><br /> </> IN Mediator,<br /> $1 = $11<br /> CONSTRUCT<br /> <book><br /> <title>$t</><br /> <year>$y</><br /> <author>$f $1</><br /> <br /> <br /> <br /><br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> Author Firstname and Author Lastname now occur in two different sources namely S2<br /> <br /> and S3 respectively. Also, again assuming that ISBN occurs in sources S1, S2, and S3.<br /> <br /> Again going through step 1, S1 is queried for title, year and ISBN, S2 is queried for<br /> <br /> author firstname and ISBN, S3 is queried for book author lastname and ISBN and S5 for<br /> <br /> article author lastname. And thus the results from all these three sources can be joined on<br /> <br /> ISBN for the concept 'Book'. The question is how are author firstname and lastname<br /> <br /> joined? Both of them come from different sources. Thus, the source names S2 and S3 that<br /> <br /> are queried for author firstname and lastname respectively are sent one level down and<br /> <br /> are checked if they can contribute a join for the concept 'Author'. Lets assume, there is a<br /> <br /> joinable data item called 'Author-id' for the concept 'Author', now if S2 and S3 both<br /> <br /> have Author-id, then a join at that level can also be generated.<br /> <br /> <br /> <br /> For every node in the tree beginning with the root node<br /> If there are less than two children<br /> Then check if there is one child, if it is, recursively invoke the Join<br /> Sequencing algorithm on that child<br /> Else<br /> If there are enough explicit joins<br /> Invoke the algorithm on all the children one by one<br /> Else<br /> If there are no joinable items available for this concept<br /> Conclude "No Full Result can be generated." Quit the Rewriting<br /> Process<br /> Else<br /> Figure out joins, 1-way or 2-way.<br /> If join sequences not available below the current node<br /> Conclude "No Full Result can be generated." Quit the<br /> Rewriting Process<br /> Else<br /> Proceed with splitting and query plan generation.<br /> <br /> <br /> <br /> Figure 5.6: Pseudo-code of the Join Sequencing Algorithm<br /> <br /> <br /> <br /> <br /> But, if either of them does not have Author-id, S2 and S3 can NOT be used to<br /> <br /> query author firstname and lastname even though they have ISBNs. Thus, the joins that<br /> <br /> are figured out at each level cannot violate the joins figured out the parent level i.e., at the<br /> <br /> <br /> <br /><br /> <br /> <br /> 61<br /> <br /> <br /> upper level of recursion. Figure 5.6 shows the pseudo-code of the join sequencing<br /> <br /> algorithm.<br /> <br /> That concludes our research on the logic of join sequencing algorithm and<br /> <br /> process of full result generation. QRE architecture and implementation are discussed in<br /> <br /> detail in the next chapter.<br /> <br /> <br /> <br /><br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> CHAPTER 6<br /> THE QRE ARCHITECTURE AND IMPLEMENTATION<br /> <br /> The Full Result Generation Process and the Join Sequencing Algorithm were<br /> <br /> delineated and discussed in detail in the previous chapter. We also put forth our study on<br /> <br /> what a full result should constitute of and what each source should be queried for. We<br /> <br /> shall now see how QRE actually implements the join sequencing algorithm and rewrites<br /> <br /> the input query into source-specific queries. The QRE prototype is implemented using<br /> <br /> Java (SDK 1.3) from Sun Microsystems. Other major software tools used in our<br /> <br /> implementation are the XML Parser from Oracle version 2.0.2.9 and the XMLQL<br /> <br /> processor from AT&T version 0.9.<br /> <br /> To achieve its functionalities, QRE needs knowledge on 2 issues. First,<br /> <br /> knowledge about all the sources for each concept in the global schema. This means<br /> <br /> knowledge of sources that have data on that concept. Second, knowledge of concepts<br /> <br /> from the global schema that may be used as joinable items, which are queried even if not<br /> <br /> asked originally, in order to perform joins on results from two different sources. The<br /> <br /> location (source) information and joinable item information, together can be seen as<br /> <br /> meta-data required for the Query Rewriting Engine to perform its tasks. This meta-data<br /> <br /> has to be collected before QRE can actually handle a query. It has to be stored in some<br /> <br /> repository wherefrom the information can be retrieved as and when required i.e., as and<br /> <br /> when QRE get queries. QRE thus achieves its entire functionality in two phases namely<br /> <br /> the Build-Time Phase when it actually collects the meta-data and stores it some data<br /> <br /> <br /> <br /><br /> <br /> <br /> <br /> <br /> <br /> structure and the Run-Time Phase when it gets the query, parses it, splits it, and figures<br /> <br /> out joins to merge results. The detailed description of both the phases follows:<br /> <br /> <br /> <br /> <br /> 6.1 The Build-Time Phase<br /> <br /> <br /> <br /> 6.1.1 Requirements<br /> <br /> The build-time phase collects the meta-data and stores it in such a way that the<br /> <br /> required information can be retrieved from it for effectively handling the queries as and<br /> <br /> when QRE gets them.<br /> <br /> <br /> <br /> 6.1.2 Analysis<br /> <br /> There are three inputs to and one output from the build-time phase. The three<br /> <br /> inputs are the ontology schema, the joinable item information and the location<br /> <br /> information and the output is the meta-data data structure that holds QRE meta-data.<br /> <br /> Each of the input is retrieved from the component that has knowledge about it. As<br /> <br /> mentioned earlier, the ontology schema is represented using a DTD. The Ontology DTD<br /> <br /> is needed so that a useful data-structure of the concepts in the ontology can be<br /> <br /> constructed. The data structure would store, for each concept the information whether it<br /> <br /> can be used as a joinable item and information of sources that have data on it. The<br /> <br /> joinable item information has to be provided by the super-user, the IWiz system<br /> <br /> administrator. The location information for each concept is provided by the Data<br /> <br /> Restructuring Engine (DRE). Since the DRE maps the schema of each source to the<br /> <br /> Ontology Schema, it is capable of providing the information of all sources that have data<br /> <br /> <br /> <br /><br /> <br /> <br /> <br /> <br /> <br /> on each individual concept. The output, which is the meta-data data structure is discussed<br /> <br /> at length in the following design and implementation subsection.<br /> <br /> <br /> Figure 6.1. QRE Build-Time Phase Overview<br /> <br /> <br /> 6.1.3 Design and Implementation<br /> <br /> In this subsection, we discuss the design and implementation cf the output and<br /> <br /> each input.<br /> <br /> <br /> Ontology.bib.book \t 2 \t Book-id<br /> Ontology.bib.book \t 1 \t ISBN<br /> Ontology.bib.article \t 1 \t title<br /> <br /> <br /> Figure 6.2. Joinable Data Item Info Text File<br /> <br /> <br /> <br /> The first input is the Ontology DTD. IWiz meta-data is stored in the warehouse<br /> <br /> and is managed by the WHM. As soon as the Ontology DTD is in place, an RMI Client at<br /> <br /> <br /> <br /><br /> <br /> <br /> <br /> <br /> <br /> <br /> WHM's end notifies QRE's RMI server as shown in Figure 6.1. Then QRE's RMI client<br /> <br /> requests the WHM to send the Ontology DTD object.<br /> <br /> <br /> <?xml version = '1.0' encoding = 'UTF-8'?><br /> <!--XML Restructuring (conversionspec.xml)--><br /> <ConversionSpec><br /> <Mapping><br /> <SourcePath>/Book/publisher</SourcePath><br /> <TargetPath>/Bibliography/Book/publisher/name</TargetPath><br /> <br /> <br /> </Mapping><br /> <Mapping><br /> <SourcePath>/Book/author/first</SourcePath><br /> <TargetPath>/Bibliography/Book/author/firstname</TargetPath><br /> <br /> <br /> </Mapping><br /> <Mapping><br /> <SourcePath>/Book/author/last</SourcePath><br /> <TargetPath>/Bibliography/Book/author/lastname</TargetPath><br /> <br /> <br /> </Mapping><br /> </ConversionSpec><br /> <br /> <br /> <br /> Figure 6.3. Example of Restructuring Specification<br /> <br /> <br /> <br /> <br /> The second input is the joinable item information. As mentioned earlier it is<br /> <br /> provided by the IWiz system administrator. It has to be provided in such a way that it can<br /> <br /> be deciphered easily. A simple text file is the solution. A sample file is shown in the<br /> <br /> Figure 6.2 wherein the first column contains the concepts, the second column contains<br /> <br /> the rank of the joinable data item and the third the joinable item itself. Thus, for example,<br /> <br /> the second line in the file suggests using ISBN as a joinable data item for Books and that<br /> <br /> it has precedence over Book-id (in the first line) since its rank is 1. This file is also kept<br /> <br /> as meta-data of IWiz in the warehouse and is requested once the WHM notifies as shown<br /> <br /> in Figure 6.1.<br /> <br /> <br /> <br /><br /> <br /> <br /> 66<br /> <br /> <br /> <br /> <br /> H<br /> M<br /> <br /> W<br /> <br /> M Ke<br /> MED WRP CSP-Server Annotto WRP MEDCSP-Client<br /> D<br /> SAnnotatOntoloed Ontology Tre<br /> N Tree / I<br /> T- Buil6.4. QRE Build-Time Phderase<br /> The third input is the l ocation Information. While generating the r<br /> DRestructuring specification for a source, say source Sl, then on scanning the targetpaths,<br /> W<br /> <br /> H<br /> M<br /> <br /> N<br /> <br /> <br /> <br /> <br /> <br /> it can be concludedAnnotated Ontology Tr has information on<br /> <br /> Figure 6.4. QRE Build-Time Phase<br /> <br /> <br /> <br /> The third input is the location Informatio While generating the restructuring<br /> <br /> specification for each source, the DRE generates mapping between the terms in the<br /> <br /> source schemas and the Ontology schema. It stores these mapping in the form of an<br /> <br /> XML document. An example is shown in Figure 6.3. Mappings are generated only for<br /> <br /> terms that are present in the Ontology. So, if the example shown in Figure 6.3 is the<br /> <br /> Restructuring specification for a source, say source Si, then on scanning the targetpaths,<br /> <br /> it can be concluded that the source S 1 has information on<br /> <br /> * names of publishers of books present in the bibliography<br /> <br /> * firstnames of authors of books present in the bibliography<br /> <br /> <br /> <br /><br /> <br /> <br /> <br /> <br /> <br /> <br /> * lastnames of authors of books present in the bibliography<br /> <br /> and so on and so forth. Thus, the DRE is requested to send a vector of targetpaths<br /> <br /> extracted from restructuring specification of each source as depicted in Figure 6.1.<br /> <br /> The output from QRE Build-Time Phase is the meta-data data structure we call<br /> <br /> Annotated Ontology Tree (AOT). Figure 6.4 shows QRE Build-Time Phase Architecture.<br /> <br /> The Ontology Tree Builder sub-component parses the Ontology DTD into an n-ary<br /> <br /> Ontology Tree using the Oracle's DTD Parser as shown in Figure 6.4. Each node of the<br /> <br /> tree representing a concept of the Ontology is an object of the class AOTNode (Figure<br /> <br /> 6.5). The primary data members among others are the node name, node's children, object<br /> <br /> of class KeyInfo that would keep the list ofjoinable items for this node and finally the<br /> <br /> object of class LocInfo that keeps the list of sources where this node (or concept)<br /> <br /> occurs.<br /> <br /> <br /> class AOTNode<br /> {<br /> String name;<br /> AOTNode children[];<br /> KeyInfo keys;<br /> LocInfo locs;<br /> <br /> <br /> <br /> <br /> <br /> <br /> Figure 6.5. The Class AOTNode<br /> <br /> <br /> <br /> Then the Restructuring Specification vectors from each source are scanned one by<br /> <br /> one by the Location Annotator (see Fig 6.4). The Ontology Tree is traversed picking up<br /> <br /> one targetpath at a time. Once the required AOTNode is reached, the source name, say<br /> <br /> "Sl" is added to the Vector "locsVector" member of the class LocInfo, which in turn is<br /> <br /> <br /> <br /><br /> <br /> <br /> <br /> <br /> <br /> a member of the class AOTNode. This is done for all the targetpaths from all the<br /> <br /> restructuring specifications concept. The output, which is the meta-data data structure, is<br /> <br /> discussed at length in the following design and implementation subsection.<br /> <br /> Once the tree is annotated with all the locations information, the joinable item<br /> <br /> information text file containing the joinable item information is read line by line by the<br /> <br /> Key Annotator (see Fig 6.4). The first string is picked up until a tab is encountered. The<br /> <br /> Ontology tree is traversed using the path in the first string. Once the required Node is<br /> <br /> reached,<br /> <br /> * It is checked if the key is a leaf node and that it doesn't have children.<br /> <br /> * The key is added in the "keyVector" member of the class KeyInfo, which in turn is<br /> <br /> a member of the class AOTNode, in the cell corresponding to the rank (Ranks are in<br /> <br /> the second column in the above file).<br /> <br /> * Also, recursively make this key as the key for all the nodes in this path ONLY if each<br /> <br /> child has a one-to-one relation/correspondence with its parent, i.e., if the there is no<br /> <br /> '*' or '+' mapping.<br /> <br /> The Key Annotator continues this process of reading in the keys until the end of<br /> <br /> file marker is reached for the text file. Once the build-time phase is over, QRE is ready to<br /> <br /> enter the run-time phase.<br /> <br /> <br /> <br /><br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> WHM<br /> <br /> <br /> Figure 6.6: QRE Run-Time Phase Overview<br /> <br /> <br /> 6.2 Run-Time Phase<br /> <br /> During the run-time phase, an RMI server running at QRE's end listens and waits<br /> <br /> for queries to be sent out by the warehouse manager. The communication of QRE with all<br /> <br /> the other components is depicted in the Figure 6.6. QRE in turns processes the query and<br /> <br /> generates source-specific sub-queries. An RMI client then connects to a server listening<br /> <br /> at the Wrapper end and sends the sub-query to it. The Wrapper on the other end processes<br /> <br /> the query it receives and sends back the URL of the result to the client. QRE packs the<br /> <br /> information on the Query Plan as well as all the URLs into one object and sends it to the<br /> <br /> Data Merge Engine that runs in the same address space. The Data Merge Engine does its<br /> <br /> processing and finally sends the URL of the result to the WHM client that sent out the<br /> <br /> query.<br /> <br /> <br /> QRE ed<br /> <br /> _M tI 1 1 ), 1 1<br /> <br /> <br /> <br /><br /> <br /> <br /> 70<br /> <br /> <br /> <br /> MEDWHMQUE-Server<br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> A.O.T<br /> <br /> Join Seiquences<br /> Generator<br /> <br /> <br /> <br /> Splitter<br /> <br /> <br /> <br /> <br /> <br /> V /<br /> WRP_MED_MQU-Client<br /> <br /> <br /> <br /> Figure 6.7: QRE Run-Time Phase<br /> <br /> <br /> <br /> 6.2.1 Requirements<br /> <br /> In the Run-Time Phase, the Query Rewriting Engine is required to query multiple<br /> <br /> sources each containing information that may be related, overlapping, incomplete and<br /> <br /> may not necessarily be useful to the user just by itself. QRE's task is to completely<br /> <br /> satisfy a query. QRE finds out which sources are to be queried for the items asked in the<br /> <br /> query and how to join the results coming in from each source. QRE finds out as many<br /> <br /> combinations of sources as possible that can satisfy a query fully. Once that is found out,<br /> <br /> QRE rewrites the input query into several source-specific sub-queries. While doing this,<br /> <br /> <br /> <br /><br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> it also generates a query plan for the Data Merge Engine to join all the results from each<br /> <br /> <br /> source.<br /> <br /> <br /> Figure 6.8: An XMLQL Query Requesting all Books, the Title of Each of Which Is Also<br /> the Title of an Article<br /> <br /> <br /> <br /> <br /> 6.2.2 Analysis<br /> <br /> There are 2 inputs to the run-time phase namely, the Query and the AOT as also<br /> <br /> <br /> there are 2 outputs which are the source-specific sub-queries and the Query Plan and<br /> <br /> <br /> URLs of the results.<br /> <br /> <br /> 6.2.3 Design and Implementation<br /> <br /> <br /> <br /> <br /> 6.2.3.1 The Query Parse Tree Generator<br /> <br /> The Run-Time Phase Architecture is as shown in Figure 6.7. The query can be<br /> <br /> <br /> passed to QRE as plain text or a Document Object Model (DOM) object (constructed<br /> <br /> <br /> according to certain specifications). In case of a text input, the incoming query is<br /> <br /> <br /> WHERE<br /> <Ontology><br /> <Bib><br /> <Book><br /> <Title><PCDATA>$t<<br /> <Author>$a</><br /> </><br /> <Article><br /> <Title><PCDATA>$tl<br /> </><br /> </><br /> </> IN "source.xml",<br /> $t = $tl<br /> <br /> CONSTRUCT<br /> <Book><br /> <Title>$t</><br /> <Author>$a</><br /> <br /> <br /> <br /><br /> <br /> <br /> <br /> <br /> <br /> converted to a valid XML document and parsed into a DOM structure using the Oracle's<br /> <br /> XML Parser. This DOM object is then traversed and a Query Parse Tree is generated.<br /> <br /> <br /> Figure 6.9: Parse Tree for Query Shown in Figure 6.8<br /> <br /> <br /> For example, if the query shown in Figure 6.8, which asks for title and author<br /> <br /> name of all books in the bibliography where each title is also an article, is parsed, the<br /> <br /> parse tree that gets generated is as shown in Figure 6.9. With a more careful observation,<br /> <br /> one can realize that the parse tree for a query will always be a subset of the Annotated<br /> <br /> Ontology Tree in terms of the structure. In the above figure, only the names of the nodes<br /> <br /> in the tree are shown, but each node apart from its name contains a whole deal of<br /> <br /> information. Following are some of the members in each node:<br /> <br /> * Name of the node<br /> <br /> * Reference to the children nodes<br /> <br /> <br /> <br /><br /> <br /> <br /> <br /> <br /> <br /> * Reference to the parent node<br /> <br /> * Names of all the sources that contain information on this concept (this information is<br /> <br /> cloned from the AOT)<br /> <br /> * Names of sources which would get queried (for this node) after the join sequences are<br /> <br /> figured out<br /> <br /> * The joinable item information (that gets added later after the join sequences algorithm<br /> <br /> is run)<br /> <br /> * The bound variable for the node, if any<br /> <br /> * Information on whether the node is a leaf node<br /> <br /> * Information on whether the node has any explicitly stated join with any other node<br /> <br /> The QueryInfoTableGenerator class generates the Parse Tree and raises a<br /> <br /> MalformedQueryException if the query has any unacceptable syntax or any parse<br /> <br /> error. It has been designed to handle all the categories of queries as discussed in Chapter<br /> <br /> 2. Also, during the parsing of the query, if it is found that there is at least one item that<br /> <br /> does not occur in any source, it is outright concluded that a full result can not be<br /> <br /> generated and the query rewriting process stops. It also has two hashtables which it<br /> <br /> populates while parsing the query. They keep track of all the bound variables in the query<br /> <br /> and all the pairs of explicit Joins already stated in the query by the user respectively.<br /> <br /> After the Parse Tree is generated, four specific tasks are performed First, sources<br /> <br /> yielding a full result are found out. These sources are sent the query as it is, as per the<br /> <br /> discussion in last chapter.<br /> <br /> <br /> <br /><br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> Figure 6.10: An XMLQL Query Requesting for Books, the Title of Each of Which Is<br /> Also the Title of an Article and a Thesis<br /> <br /> <br /> <br /> <br /> Second, for each node in the parse tree, the number of explicit joins that can bind<br /> <br /> its children are found out. For example, query shown in Figure 6.10 asks for all books<br /> <br /> where title of each book is also a title of an article and a thesis. There are three children<br /> <br /> <br /> of <bib> namely <book>, <article>, and <thesis> and there are 2 joins binding the tuples<br /> <br /> of books, articles and theses. That is how all the three concepts are related which,<br /> <br /> otherwise, together do not have any meaning. If there is only one join, there has to be a<br /> <br /> way to join the third concept. The Join Sequence generator does the work, finds whether<br /> <br /> there are any joinable items that can be used to join the third concept with the rest of the<br /> <br /> two. Thus, the number of explicit joins that can bind a nodes children have to be found<br /> <br /> out which facilitates the join-sequencing algorithm later.<br /> <br /> <br /> WHERE<br /> <bib><br /> <book><br /> <title>$t<,<br /> <year>$y</:<br /> </><br /> <article><br /> <title>$tl<br /> </><br /> <thesis><br /> <title>$t2<br /> </><br /> </> IN "source.xml",<br /> $t = $tl,<br /> $tl = $t2<br /> <br /> CONSTRUCT<br /> <result><br /> <title>$t</><br /> <year>$y</><br /> <br /> <br /> <br /><br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> Figure 6.11: An XMLQL Query Requesting for Books, Each with Its Title, Year and<br /> Author<br /> <br /> <br /> <br /> <br /> The third task is to find the first node in the query during the pre-order traversal<br /> <br /> that has more than one child but not enough or none explicit joins that can bind its<br /> <br /> children. All its children are checked to see if they occur all together in one single source.<br /> <br /> Taking the query shown in Figure 6.11 as example. If all the four concepts that are<br /> <br /> queried occur in one single source, then no question of joining them arises. If some are<br /> <br /> brought from one source and the remaining from some other source, only then arises the<br /> <br /> question of looking for a joinable item that has to be queried in both the sources even if<br /> <br /> not asked originally in order to join both the results.<br /> <br /> And finally the fourth task is to apply the Children Binding Rule to all the nodes<br /> <br /> in the parse tree. All those nodes that have multiple data instances are checked for 3<br /> <br /> things:<br /> <br /> * If they have enough explicit joins to bind their children or,<br /> <br /> * If they have all their children occurring in one source or,<br /> <br /> <br /> WHERE<br /> <bib><br /> <book><br /> <title>$t</><br /> <year>$y</><br /> <author><br /> <firstname>$f<<br /> <lastname>$1</<br /> </><br /> </><br /> </><br /> CONSTRUCT<br /> <book><br /> <title>$t</><br /> <year>$y</><br /> <author>$f $1</><br /> <br /> <br /> <br /><br /> <br /> <br /> <br /> <br /> <br /> <br /> * If they have at least one j oinable concept to join all their children.<br /> <br /> If all of the above three turn out to be false, then a full result can NOT be<br /> <br /> generated, Query Rewriting fails and the system stops here. With this, the background for<br /> <br /> the Join sequencing algorithm is ready and thus the Join Sequences Generator is invoked.<br /> <br /> <br /> <A><br /> <B><br /> <C>$c</><br /> <D><br /> <E>$e</><br /> <F>$f</><br /> <G>$g</><br /> <br /> <H><br /> <I>$i</><br /> <br /> <br /> <J>$j</><br /> <K>$k</><br /> </> IN "source.xml",<br /> $j $k,<br /> $i = $j<br /> <br /> <br /> Figure 6.12: WHERE Clause of an XMLQL Query<br /> <br /> <br /> <br /> 6.2.3.2 The Join Sequences Generator<br /> <br /> The class JoinSequence implements the functionalities of the Join Sequences<br /> <br /> Generator. Figure 6.12 shows the where clause of an XMLQL query which we shall use<br /> <br /> to outline the functionalities of the Join Sequences Generator. Join Sequences Generator<br /> <br /> is based on the join sequencing algorithm that has been described in complete detail in<br /> <br /> chapter 5. It proceeds using the Children Binding Rule. 'A' has three children, 'B', 'J',<br /> <br /> and 'K'. All its children are bound using explicit joins viz. '$j = $k' and '$i = $j', so<br /> <br /> there is no need to figure out joins at that level. $i is bound to the concept 'I' which is not<br /> <br /> a direct child of 'A' but one of its sub-children. The algorithm finds the common parent<br /> <br /> for the nodes involved in an explicit join. 'A' is the common parent for 'I' and 'J'. Thus,<br /> <br /> <br /> <br /><br /> <br /> <br /> <br /> <br /> <br /> <br /> the entire 'B' tuple can be joined with the 'J' tuple using '$i = $j'. The focus shifts on<br /> <br /> finding joins for 'B'. It has three children 'C', 'D', and 'H'. Assuming that there is some<br /> <br /> concept say 'M' that can be used to join the children of 'B', sources of 'C', 'E', 'F', 'G',<br /> <br /> and 'I' are searched for 'M'. If there is such a combination available, then 'M' is also<br /> <br /> queried in all these sources. But before finalizing these sources, it is checked if there are<br /> <br /> nodes below 'B' that may require a join. 'C' is a leaf node. 'H' has only one child 'I'<br /> <br /> which in turn does not have any children. Only 'D' has three children which are 'E', 'F',<br /> <br /> and 'G'. Thus joins need to be figured out for 'D'. At this level, two things are done.<br /> <br /> First, if all the three 'E', 'F', and 'G' occur in one single source, no question of figuring<br /> <br /> joins at the level of 'D' arises. Second, if its not the case as in 1, then 'D' should have<br /> <br /> joinable items and one of them should be occurring in the sources specified for 'E', 'F',<br /> <br /> and 'G' at the level of 'B'. If none of the above two work, then some other combinations<br /> <br /> of sources at the level of 'B' are found out and the process is repeated all over again.<br /> <br /> <br /> <!ELEMENT QueryPlan (ExecutionTree*)><br /> <!ATTLIST QueryPlan forElement CDATA #IMPLIED<br /> uquID CDATA #IMPLIED ><br /> <br /> <!ELEMENT ExecutionTree EMPTY><br /> <!ATTLIST ExecutionTree queryProcessor CDATA #REQUIRED<br /> queryFileName CDATA #REQUIRED ><br /> <br /> <br /> Figure 6.13: Query Plan DTD<br /> <br /> <br /> <br /> 6.2.3.3 The Splitter and Query Plan Generator<br /> <br /> Information on all the sequences of joins as well as the joinable items, has to be<br /> <br /> conveyed to the Data Merge Engine in some meaningful way so that it is able to join all<br /> <br /> the results that are returned back to the mediator.<br /> <br /> <br /> <br /><br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> Figure 6.14: An XMLQL Query<br /> <br /> <br /> Since we at IWiz use XML as our underlying data model, in the same spirit, we<br /> <br /> <br /> chose XML to model our data on joins, i.e., the query plan. When the results are merged<br /> <br /> <br /> by the DME, it first joins all the results using the Query Plan. The query plan is nothing<br /> <br /> <br /> but a set of Execution Trees and a query-id. Each execution tree is an XMLQL query that<br /> <br /> <br /> generates a full result. The query is run against all the results (which are XML<br /> <br /> <br /> documents) which are then joined using the explicitly stated joins in the execution tree.<br /> <br /> <br /> The information on what to join and how to join is laid out in the query. The XMLQL<br /> <br /> <br /> WHERE<br /> <Ontology><br /> <Bib><br /> <Book>$book</Book><br /> <Article>$article</Article><br /> </Bib><br /> </Ontology> IN Mediator,<br /> <Title>$t IN $book,
$y IN $book,

$f
$1
IN $book,

$11
IN $article,
$1 = $11

CONSTRUCT

$t
$y

{
WHERE
$a IN $book
CONSTRUCT
$a


'Authors


Book:









query for each execution tree is stored in a separate file. So for example, if '0001' is the

query-id and there are 2 execution trees possible, the queries will get stored in

0001.etl.xmlql and 0001.et2.xmlql respectively. And this information that there are 2

execution trees, is put in the query plan file '0001.qpl' which is an XML document.


Figure 6.15: Query Parse Tree with Location Information


The Query Plan has to adhere to a DTD, the "QueryPlan.dtd" which is shown in

the Figure 6.13. The SplitterAndQPGenerator class implements the functionalities of the









component Splitter and Query Plan Generator. The class JoinSequence stores all the

information regarding the joins (of each execution tree) required for each node (of the

parse tree) in the node itself. A reference to the root node of the parse tree is then passed

to the constructor of the SplitterAndQPGenerator class. Each node of the tree is

traversed again to generate the execution tree XMLQL queries, source-specific sub-

queries and simultaneously the query plan. Taking one final example that of the query

shown in Figure 6.14. The query requests for all books each with its title, year, authors

firstname and lastname and where each title is also a title of an article. Figure 6.15 is the

overview of the parse tree that gets generated which also holds location information for

all the concepts queried. There are 12 sources that can potentially provide information.

When the above query is fed into QRE, QRE cranks out four execution trees, i.e., it is

able to sequence the joins in such a way that four full results can be formed. A brief

description of each of the four Execution Trees follows.

The entire example with the actual execution tree queries, source-specific sub-

queries and the query plan can be found in the Appendix. In the first Execution Tree, the

query looks very similar to the actual query. Only the word 'Mediator' is replaced with

the '0001.S8.xml'. Source S8 provides a full result all by itself, so the query is sent to it

as is. The second Execution Tree, the DME executes it to get Title, Year,

Book.Author.Firstname and Book.Author.Lastname, all of them from the result obtained

from source S4 and Article.Author.Lastname from the result obtained from source S1. It

then joins the two results on the Lastnames. In The third Execution Tree, the DME

constructs the book tuple using information from source S2 and source S1 and joins the

results using ISBN. So, source S2 is queried for Title, Year and ISBN and source S1 is










queried for Author Firstname and Lastname as well as ISBN. Article's Author's

Lastnames are gotten from source S3. Then, there is a join on the lastnames of authors of

books and articles to further filter out the tuples. In the fourth Execution Tree, the DME

constructs the Author tuple from results from two different sources, Author Firstnames

from source S9 and Author Lastnames from S10, using Author-id as a joinable item. It

then constructs the entire book tuple with titles and year from source S5, using ISBN as

the joinable item. So, ISBN is queried from S5, S9 and S10. Article's Author's

Lastnames are picked up from source S12 and finally, there is a join on the author

Lastnames of books and articles.











Figure 6.16: Sample Query Plan



The Query Plan for the above query which has 4 execution trees looks like the

one shown in Figure 6.16. It is an XML document with the tag as the root

of the document. It is written to a text file having the extension ".qpl". The URL of this

XML file is sent to the Data Merge Engine. Since, QRE and the DME both run in the

same address space, just the filenames of the execution trees are present in the document,

an example being "0001.etl.xmlql".

This concludes the description of QRE implementation. The next chapter is

dedicated to the experimental prototype developed during the Testing phase of QRE.














CHAPTER 7
EXPERIMENTAL PROTOTYPE

The functionality of QRE architecture can by analyzed by its applicability to

IWiz. Since QRE was designed in the context of the IWiz, its functionality was defined

based on the following requirements. First and foremost, QRE must be able to retrieve

and store location information and joinable item information for each concept in the

Ontology schema. Second, using the location information, QRE must be able to find out

which sources to be queried for the concepts requested in a query. Third, QRE must be

able to find out a way the results that are returned back to the mediator are to be joined.

Fourth, it must be able to generate source-specific sub-queries customized for each

source, not querying for concepts, not present in the source and querying joinable items

not asked in the query originally. Fifth, it must generate a query plan based on the join

sequences and send it to the Data Merge Engine enabling it to join the results returned

from the sources. Sixth and last, QRE must be able to function as an integral part of the

IWiz system and be able to communicate and interact with the rest of the components

effectively and as required.

To test the efficacy of any system, several prerequisites are needed. In particular,

sets of suitable data, sets of benchmark programs, and performance characteristics

observed by comparable systems during testing. Practically all of these components are

non-existent at this point in the XML world. But notwithstanding the development of all

XML-related technologies and products, large sets of diverse and "real" XML data are

still difficult to come by. For the purpose of testing IWiz components including QRE, we









use several XML sources containing bibliographic data. At this point, we have 8 data

sources. Currently, the IWiz testbed dwells in the Windows NT environment. All the

components communicate with each other using Java RMI. As pointed out in chapter 4,

the Mediator (the QRE and the DME) runs in one address space on one NT workstation,

all the sources with their corresponding DREs reside on another NT machine with each

having a dedicated communication port, and the WHM with the warehouse resides on a

third machine. All the 3 machines being used are Pentium IIs with 128 Mb of memory.

The external tools we use are Oracle's XML Parser version 2.0.2.9 and the AT&T Bell

Labs' XMLQL processor version 0.9. The implementation language for entire IWiz is

Java, which also gives it's a scope to be transported to some other OS environments.

We ran a few experiments to determine the correctness of QRE's working. The

testing was in two phases. The first tests were to gauge the effectiveness and strength of

the communication mechanism. As pointed out earlier in chapter 4, the WHM, the

Mediator and all the Wrappers run separately in different address spaces on different

nodes. So, to test the communication mechanism, the WHM was set up in one address

space, the Mediator in other and the Wrappers in some third. Please note that during these

tests, all objects being sent across the system between the different components were

dummy objects. The RMI servers at all the three locations were started. In the build-time

phase, QRE's server received notifications from the WHM regarding the readiness of

Ontology DTD object and the Joinable Item Information Text File as a string. Once

notified, QRE as desired, asked for both the objects. The Wrappers too notified QRE

about the readiness of the Conversion Spec Vectors. QRE in turn requested the vector

from the corresponding wrapper. Once, this was done, QRE entered the run-time phase.









In order to test this phase, the WHM invoked QRE's exported "processQuery" method

call and sent the 'Query' object as a parameter. QRE in turn invoked Wrapper's exported

"processMediatedQuery" method call and sent it the rewritten query. The object of class

'QueryResult' was returned to QRE by the wrapper. This object along with the query

plan was packed into an object of the class 'InfoQRE2DME' and was sent to the DME.

The DME in turn returned the 'QueryResult' object to the WHM. Thus, this is how the

entire communication in IWiz was set up and tested.


Figure 7.1: Hierarchical Structure of the XML Document "haptics_article.xml"


In the second phase of testing, QRE as a component was explicitly tested. The

correctness of the Query Rewriting Engine implementation was tested in terms of its

functionality. The testing began with the build-time phase, which was followed by the

testing of the run-time phase. We choose the following query and sources scenario to









show the tests. We picked up a real world data source, the "hapticsarticle.xml" whose

hierarchical schematic structure is depicted graphically in the form of a tree in Figure 7.1.


Figure 7.2: Location Information for the Concepts of the Document Shown in Figure 7.1


Ontology.Bib.Article \t 1 \t Title
Ontology.Bib.Article.Journal \t 1 \t Title


Figure 7.3: The Joinable Data Item Information Text File



We did a vertical partitioning on it to create 2 new sources namely "Sl.xml" and

"S2.xml". These 2 sources had some information that was complementary and some that

was overlapping as shown in Figure 7.2. For example, data on Article Pages occurs only

in source S1 while data on Article Journal Year occurs in source S2.












The Ontology DTD object as well as the Joinable Item Information Text File


(Figure 7.3) were sent to QRE by the WHM, the Conversion Spec Vectors for the sources


were sent by the two wrappers.


Figure 7.4: Test XMLQL Query




The query fed to QRE was the one shown in Figure 7.4. The query requested for


all the concepts present in "haptics_article.xml" i.e., everything that is there in the


original source but the two sources that were queried were "Sl.xml" and "S2.xml" that


acted as partial sources and had incomplete and partial information some of which was


complementary and some overlapping.


WHERE



$t
$y
$p

$1


$jt
$jy
$jv



IN Mediator
CONSTRUCT



$t
$y
$y

$l


$jt
$jy
$jv











87


- T I\T ly$ Q l I


Figure 7.5: Query to Source S1




University of Florida Home Page
© 2004 - 2010 University of Florida George A. Smathers Libraries.
All rights reserved.

Acceptable Use, Copyright, and Disclaimer Statement
Last updated October 10, 2010 - - mvs