Citation
An Algorithm and implementation for extracting schematic and semantic knowledge from relational database systems

Material Information

Title:
An Algorithm and implementation for extracting schematic and semantic knowledge from relational database systems
Creator:
Haldavnekar, Nikhil ( Dissertant )
Hammer, Joachim ( Thesis advisor )
Schmalz, Mark S. ( Reviewer )
Issa, R. Raymond ( Reviewer )
Place of Publication:
Gainesville, Fla.
Publisher:
University of Florida
Publication Date:
Copyright Date:
2002
Language:
English

Subjects

Subjects / Keywords:
Cardinality ( jstor )
Cognitive models ( jstor )
Database design ( jstor )
Databases ( jstor )
Extraction ( jstor )
Information attributes ( jstor )
Legacies ( jstor )
Relational databases ( jstor )
Reverse engineering ( jstor )
Wrappers ( jstor )
Computer and Information Science and Engineering thesis, M.S
Relational databases ( lcsh )
Dissertations, Academic -- UF -- Computer and Information Science and Engineering
Rule based programming ( lcsh )

Notes

Abstract:
Due to the heterogeneities of the underlying legacy information systems of enterprises participating in large business networks (e.g. supply chains), existing information integration techniques fall short in enabling the automated sharing of data. This necessitates the development of automated solutions to enable scalable extraction of the knowledge resident in the legacy systems to support efficient sharing. Since the majority of existing information systems are based on relational database technology, I have focused on the process of knowledge extraction from relational databases. This thesis describes an automated approach for extracting schematic and semantic knowledge from relational databases. The extracted knowledge contains information about the underlying relational schema as well as the semantics in order to recreate the semantically rich model that was used to create the database. This knowledge enables schema mapping and mediator generation. The use of this approach can also be foreseen in enhancing existing schemas and extracting metadata needed to create the Semantic Web.
Subject:
database, databases, engineering, extraction, knowledge, relational, reverse, situational
General Note:
Title from title page of source document.
General Note:
Includes vita.
Thesis:
Thesis (M.S.)--University of Florida, 2002.
Bibliography:
Includes bibliographical references.
General Note:
Text (Electronic thesis) in PDF format.

Record Information

Source Institution:
University of Florida
Holding Location:
University of Florida
Rights Management:
Copyright Haldavnekar, Nikhil. Permission granted to the University of Florida to digitize, archive and distribute this item for non-profit research and educational purposes. Any reuse of this item in excess of fair use or other copyright exemptions requires permission of the copyright holder.
Embargo Date:
12/27/2005
Resource Identifier:
53309016 ( OCLC )

Downloads

This item has the following downloads:

haldavnekar_n.pdf

haldavnekar_n_Page_097.txt

haldavnekar_n_Page_028.txt

haldavnekar_n_Page_107.txt

haldavnekar_n_Page_116.txt

haldavnekar_n_Page_048.txt

haldavnekar_n_Page_091.txt

haldavnekar_n_Page_087.txt

haldavnekar_n_Page_106.txt

haldavnekar_n_Page_071.txt

haldavnekar_n_Page_062.txt

haldavnekar_n_Page_090.txt

haldavnekar_n_Page_089.txt

haldavnekar_n_Page_075.txt

haldavnekar_n_Page_043.txt

haldavnekar_n_Page_054.txt

haldavnekar_n_Page_102.txt

haldavnekar_n_Page_110.txt

haldavnekar_n_Page_064.txt

haldavnekar_n_Page_060.txt

haldavnekar_n_Page_021.txt

haldavnekar_n_Page_004.txt

haldavnekar_n_Page_009.txt

haldavnekar_n_Page_069.txt

haldavnekar_n_Page_079.txt

haldavnekar_n_Page_027.txt

haldavnekar_n_Page_022.txt

haldavnekar_n_Page_040.txt

haldavnekar_n_Page_085.txt

haldavnekar_n_Page_093.txt

haldavnekar_n_Page_006.txt

haldavnekar_n_Page_032.txt

haldavnekar_n_Page_012.txt

haldavnekar_n_Page_073.txt

haldavnekar_n_Page_063.txt

UFE0000541_00001_xml.txt

haldavnekar_n_Page_096.txt

haldavnekar_n_Page_094.txt

haldavnekar_n_Page_081.txt

haldavnekar_n_Page_019.txt

haldavnekar_n_Page_038.txt

haldavnekar_n_Page_035.txt

haldavnekar_n_Page_036.txt

haldavnekar_n_Page_026.txt

haldavnekar_n_Page_008.txt

haldavnekar_n_Page_016.txt

haldavnekar_n_Page_055.txt

haldavnekar_n_Page_029.txt

haldavnekar_n_Page_046.txt

haldavnekar_n_Page_023.txt

haldavnekar_n_Page_100.txt

haldavnekar_n_Page_050.txt

haldavnekar_n_Page_076.txt

haldavnekar_n_Page_051.txt

haldavnekar_n_Page_065.txt

haldavnekar_n_Page_013.txt

haldavnekar_n_Page_113.txt

haldavnekar_n_Page_112.txt

haldavnekar_n_Page_083.txt

haldavnekar_n_Page_015.txt

haldavnekar_n_Page_058.txt

haldavnekar_n_Page_115.txt

haldavnekar_n_Page_067.txt

haldavnekar_n_Page_041.txt

haldavnekar_n_Page_042.txt

haldavnekar_n_Page_070.txt

haldavnekar_n_Page_056.txt

haldavnekar_n_Page_077.txt

haldavnekar_n_Page_033.txt

haldavnekar_n_Page_088.txt

haldavnekar_n_Page_001.txt

haldavnekar_n_Page_084.txt

haldavnekar_n_Page_053.txt

haldavnekar_n_Page_095.txt

haldavnekar_n_Page_044.txt

haldavnekar_n_Page_039.txt

haldavnekar_n_Page_066.txt

haldavnekar_n_Page_072.txt

haldavnekar_n_Page_103.txt

haldavnekar_n_Page_074.txt

haldavnekar_n_Page_020.txt

haldavnekar_n_Page_109.txt

haldavnekar_n_Page_011.txt

haldavnekar_n_Page_005.txt

haldavnekar_n_Page_025.txt

haldavnekar_n_Page_111.txt

haldavnekar_n_Page_052.txt

haldavnekar_n_Page_104.txt

haldavnekar_n_Page_034.txt

haldavnekar_n_Page_002.txt

haldavnekar_n_Page_061.txt

haldavnekar_n_Page_114.txt

haldavnekar_n_Page_047.txt

haldavnekar_n_Page_098.txt

haldavnekar_n_Page_018.txt

haldavnekar_n_Page_080.txt

haldavnekar_n_Page_017.txt

haldavnekar_n_Page_003.txt

haldavnekar_n_Page_031.txt

haldavnekar_n_Page_082.txt

haldavnekar_n_Page_086.txt

haldavnekar_n_pdf.txt

haldavnekar_n_Page_078.txt

haldavnekar_n_Page_057.txt

haldavnekar_n_Page_068.txt

haldavnekar_n_Page_030.txt

haldavnekar_n_Page_024.txt

haldavnekar_n_Page_108.txt

haldavnekar_n_Page_059.txt

haldavnekar_n_Page_037.txt

haldavnekar_n_Page_101.txt

haldavnekar_n_Page_092.txt

haldavnekar_n_Page_007.txt

haldavnekar_n_Page_045.txt

haldavnekar_n_Page_105.txt

haldavnekar_n_Page_099.txt

haldavnekar_n_Page_014.txt

haldavnekar_n_Page_049.txt

haldavnekar_n_Page_010.txt


Full Text











AN ALGORITHM AND IMPLEMENTATION FOR EXTRACTING
SCHEMATIC AND SEMANTIC KNOWLEDGE FROM RELATIONAL DATABASE SYSTEMS













By

NIKHIL HALDAVNEKAR


A THESIS PRESENTED TO THE GRADUATE SCHOOL
OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF
MASTER OF SCIENCE

UNIVERSITY OF FLORIDA


2002



























Copyright 2002 by

Nikhil Haldavnekar





























To my parents, my sister and Seema















ACKNOWLEDGMENT S

I would like to acknowledge the National Science Foundation for supporting this research under grant numbers CMS-0075407 and CMS-0122193.

I express my sincere gratitude to my advisor, Dr. Joachim Hammer, for giving me the opportunity to work on this interesting topic. Without his continuous guidance and encouragement this thesis would not have been possible. I thank Dr. Mark S. Schmalz and Dr. R.Raymond Issa for being on my supervisory committee and for their invaluable suggestions throughout this project. I thank all my colleagues in SEEK, especially Sangeetha, Huanqing and Laura, who assisted me in this work. I wish to thank Sharon Grant for making the Database Center a great place to work

There are a few people to whom I am grateful for multiple reasons: first, my

parents who have always striven to give their children the best in life and my sister who is always with me in any situation; next, my closest ever friends--Seema, Naren, Akhil Nandhini and Kaumudi--for being my family here in Gainesville and Mandar, Rakesh and Suyog for so many unforgettable memories.

Most importantly, I would like to thank God for always being there for me.
















TABLE OF CONTENTS
pM. e

A CK N OW LED GM EN TS . iv

LIST OF TABLE S . vii

LIST OF FIGU RE S . viii

A B S T R A C T . x CHAPTER


I IN TR OD U CTION . I

1 . 1 M o tiv a tio n . 2
1.2 Solution A pproaches . 4
1.3 Challenges and Contributions . 6
1.4 O rganization of Thesis . 7


2 RELA TED RE SEAR CH . 8

2.1 D atabase R everse Engineering . 9
2 .2 D ata M in in g . 1 6
2.3 W rapper/M ediation Technology . 17


3 THE SCH EM A EX TR A CTION ALG ORITH M . 20

3 .1 In tro d u ctio n . 2 0
3.2 A lgorithm D esign . 23
3.3 R elated Issue - Sem antic A nalysis . 34
3 .4 In te ra ctio n . 3 8
3.5 K now ledge R epresentation . 41


4 IM PLEM EN TA TION . 44

4.1 Im plem entation D etails . 44
4.2 Exam ple W although of Prototype Functionality . 54


v









4.3 Configuration and U ser Intervention . 61
4 .4 In te g ra tio n . ********** 6 2
4.5 Im plem entation Sum m ary . 63
4 .5 .1 F e a tu re s . 6 3
4.5.2 A dvantages . 63


5 EXPER IM EN TAL EV ALU A TION . 65

5.1 Experim ental Setup . 65
5 .2 E x p e rim e n ts . 6 6
5.2.1 Evaluation of the Schem a Extraction A lgorithm . 66
5.2.2 M easuring the Com plexity of a D atabase Schem a . 69
5.3 Conclusive R easoning . 70
5.3.1 A nalysis of the R esults . 71
5.3.2 Enhancing A ccuracy . 73


6 CON CLU SION . 76

6 .1 C o n trib u tio n s . 7 7
6 .2 L im itatio n s . 7 8
6.2.1 N orm al Form of the Input D atabase . 78
6.2.2 M eanings and N am es for the D iscovered Structures . 79
6.2.3 A daptability to the D ata Source . 80
6 .3 F u tu re W o rk . 8 0
6.3.1 Situational K now ledge Extraction . 80
6.3.2 Im provem ents in the A lgorithm . 84
6.3.3 Schem a extraction from Other D ata Sources . 85
6.3.4 M achine Learning . 85

APPENDIX

A DTD DESCRIBING EXTRACTED KNOWLEDGE . 86

B SN APSH O TS O F "RESU LTS.X M L . . 88

C SLJ13SET TEST FOR INCLUSION DEPENDENCY DETECTION . 91

D EXAMPLES OF THE SITUATIONAL KNOWLEDGE EXTRACTION PROCESS.92

LIST OF REFEREN CE S . 99

BIO GR APH ICAL SK ETCH . 105
















LIST OF TABLES


Table Me



4-1 Example of the attribute classification from the MS-Project legacy source. 57 5-1 Experimental results of schema extraction on 9 sample databases. 67















LIST OF FIGURES


Figure pM. e



2-1 The Concept of Database Reverse Engineering . 9 3 -1 T h e SE E K A rchitectu re . 2 1 3-2 The Schem a Extraction Procedure . 25

3-3 The D ictionary Extraction Process . . 26 3-4 Inclusion D ependency M ining . 27

3-5 T he C ode A naly sis P rocess . 37 3-6 D R E Integrated A rchitecture . 40 4-1 Schema Extraction Code Block Diagram . 45

4-2 T he class structure for a relation . 47 4-3 The class structure for the inclusion dependencies . 48 4-4 T he class structure for an attribute . 50 4-5 T he class structure for a relationship . 51 4-6 The information in different types of relationships instances . 53 4-7 The screen snapshot describing the information about the relational schema . 55 4-8 The screen snapshot describing the information about the entities . 58 4-9 The screen snapshot describing the information about the relationships . 59 4- 10 E/R diagram representing the extracted schema . 60 5-1 Results of experimental evaluation of the schema extraction algorithm: errors in
detected inclusion dependencies (top), number of errors in extracted schema
(b o tto m ) . 7 1









B-I The main structure of the XMIL document conforming to the DTD in Appendix A.88 B-2 The part of the XML document which lists business rules extracted from the code.88 B-3 The part of the XML document which lists business rules extracted from the code.89 B-4 The part of the XML document, which describes the semantically rich E/R schema.90 C -I Tw o queries for the subset test . 91















Abstract of Thesis Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Master of Science

AN ALGORITHM AND IMPLEMENTATION FOR EXTRACTING SCHEMATIC AND SEMANTIC KNOWLEDGE FROM RELATIONAL DATABASE SYSTEMS By

Nikhil Haldavnekar

December 2002


Chair: Dr. Joachim Hammer
Major Department: Computer and Information Science and Engineering

As the need for enterprises to participate in large business networks (e.g., supply chains) increases, the need to optimize these networks to ensure profitability becomes more urgent. However, due to the heterogeneities of the underlying legacy information systems, existing integration techniques fall short in enabling the automated sharing of data among the participating enterprises. Current techniques are manual and require significant programmatic set-up. This necessitates the development of more automated solutions to enable scalable extraction of the knowledge resident in the legacy systems of a business network to support efficient sharing. Given the fact that the majority of existing information systems are based on relational database technology, I have focused on the process of knowledge extraction from relational databases. In the future, the methodologies will be extended to cover other types of legacy information sources.

Despite the fact that much effort has been invested in researching approaches to knowledge extraction from databases, no comprehensive solution has existed before this









work. In our research, we have developed an automated approach for extracting schematic and semantic knowledge from relational databases. This methodology, which is based on existing data reverse engineering techniques, improves the state-of-the-art in several ways, most importantly to reduce dependency on human input and to remove some of the other limitations.

The knowledge extracted from the legacy database contains information about the underlying relational schema as well as the corresponding semantics in order to recreate the semantically rich Entity -Relationship schema that was used to create the database initially. Once extracted, this knowledge enables schema mapping and wrapper generation. In addition, other applications of this extraction methodology are envisioned, for example, to enhance existing schemes or for documentation efforts. The use of this approach can also be foreseen in extracting metadata needed to create the Semantic Web.

In this thesis, an overview of our approach will be presented. Some empirical

evidence to the usefulness and accuracy of this approach will also be provided using the prototype that has been developed and is running in a tested in the Database Research Center at the University of Florida.














CHAPTER I
INTRODUCTION

In the current era of E-Commerce, the availability of products (for consumers or for businesses) on the Internet strengthens existing competitive forces for increased customization, shorter product lifecycles, and rapid delivery. These market forces impose a highly variable demand due to daily orders that can also be customized, with limited ability to smoothen production because of the need for rapid delivery. This drives the need for production in a supply chain. Recent research has led to an increased understanding of the importance of coordination among subcontractors and suppliers in such supply chains [3, 37]. Hence, there is a role for decision or negotiation support tools to improve supply chain performance, particularly with regard to the user's ability to coordinate pre-planning and responses to changing conditions [47].

Deployment of these tools requires integration of data and knowledge across the supply chain. Due to the heterogeneity of legacy systems, current integration techniques are manual, requiring significant programmatic set-up with only a limited reusability of code. The time and investment needed to establish connections to sources have acted as a significant barrier to the adoption of sophisticated decision support tools and, more generally, as a barrier to information integration. By enabling (semi -)automatic connection to legacy sources, the SEEK (Scalable Extraction of Enterprise Knowledge) project that is currently under way at the University of Florida is directed at overcoming the problems of integrating legacy data and knowledge in the (construction) supply chain [22-24].









1.1 Motivation

A legacy source is defined as a complex stand-alone system with either poor or non-existent documentation about the data, code or the other components of the system. When a large number of firms are involved in a project, it is likely that there will be a high degree of physical and semantic heterogeneity in their legacy systems, making it difficult to connect firms' data and systems with enterprise level decision support tools. Also, as each firm in the large production network is generally an autonomous entity, there are many problems when overcoming this heterogeneity and allowing efficient knowledge sharing among firms.

The first problem is the difference between various internal data storage, retrieval and representations methods. Every firm uses its own format to store and represent data in the system. Some might use professional database management systems while others might use simple flat files. Also, some firms might use standard query language such as SQL to retrieve or update data; others might prefer manual access while some others might have their own query language. This physical heterogeneity imposes significant barriers to integrated access methods in co-operative systems. The effort to retrieve even similar information from every firm in the network is non-trivial as this process involves the extensive study about the data stored in every firm. Thus there is little ability to understand and share the other firm's data leading to overall inefficiency.

The second problem is the semantic heterogeneity among the firms. Although,

generally a production network consists of firms working in a similar application domain, there is a significant difference in the internal terminology or vocabulary used by the firms. For example, different firms working in the construction supply chain might use different term s such as Activity, Task or Work-item to mean the same thing i.e., a small









but independent part of an overall construction project. The definition or meaning of the terms might be similar but the actual names used are different. This heterogeneity is present at various levels in the legacy system including conceptual database schema, graphical user interface, application code and business rules. This kind of diversity is often difficult to overcome.

Another difficulty in accessing the firm's data efficiently and accurately is

safeguarding the data against loss and unauthorized usage. It is logical for the firm to restrict the sharing of strategic knowledge including sensitive data or business rules. No firm will be willing to give full access to other firms in the network. It is therefore important to develop third party tools, which assure the privacy of the concerned firm and still extract useful knowledge.

Last but not least, the frequent need of human intervention in the existing

solutions is another major problem for efficient co-operation. Often, the extraction or conversion process is manual and involves some or no automation. This makes the process of knowledge extraction costly and inefficient. It is time consuming (if not impossible) for a firm to query all the firms that may be affected by some change in the network.

Thus, it is necessary to build scalable mediator software using reusable

components, which can be quickly configured through high-level specifications and will be based on a highly automated knowledge extraction process. A solution to the problem of physical, schematic and semantic heterogeneity will be discussed in this thesis. The following section introduces various approaches that can be used to extract knowledge from legacy systems, in general.









1.2 Solution Approaches

The study of heterogeneous systems has been an active research area for the past decade. At the database level, schema integration approaches and the concept of federated databases [38] have been proposed to allow simultaneous access to different database systems. The wrapper technology [46] also plays an important role with the advent and popularity of co-operative autonomous systems. Various approaches to develop some kind of a mediator system have been discussed [2, 20, 46]. Data mining [ 18] is another relevant research area which proposes the use of a combination of machine learning, statistical analysis, modeling techniques and database technology, to find patterns and subtle relationships in data and infers rules that allow the prediction of future results.

A lot of research is being done in the above areas and it is pertinent to leverage

the already existing knowledge whenever necessary. But what is considered as a common input to all of the above methods includes detailed knowledge about the internal database schema, obvious rules and constraints, and selected semantic information.

Industrial legacy database applications (LDAs) often evolve over several

generations of developers, have hundreds of thousands of lines of associated application code, and maintain vast amounts of data. As mentioned previously, the documentation may have become obsolete and the original developers have left the project. Also, the simplicity of the relational model does not support direct description of the underlying semantics, nor does it support inheritance, aggregation, n-ary relationships, or time dependencies including design modification history. However, relevant information about concepts and their meaning is distributed throughout an LDA. It is therefore important to use reverse engineering techniques to recover the conceptual structure of the LDA to









gain semantic knowledge about the internal data. The term Data Reverse Engineering (DRE) refers to "the use of structured techniques to reconstitute the data assets of an existing system" [1, p. 4].

As the role of the SEEK system is to act as an intermediary between the legacy data and the decision support tool, it is crucial to develop methodologies and algorithms to facilitate discovery and extraction of knowledge from legacy sources.

In general, SEEK operates as a three-step process [23]:

SEEK generates a detailed description of the legacy source, including entities,
relationships, application-specific meanings of the entities and relationships, business rules, data formatting and reporting constraints, etc. We collectively
refer to this information as enterprise knowledge.

The semantically enhanced legacy source schema must be mapped onto the
domain model (DM) used by the application(s) that want(s) to access the legacy source. This is done using a schema mapping process that produces the mapping
rules between the legacy source schema and the application domain model.

The extracted legacy schema and the mapping rules provide the input to the
wrapper generator, which produces the source wrapper.

This thesis mainly focuses on the process described in item I above. This thesis also discusses the issue of knowledge representation, which is important in the context of the schema mapping process discussed in the second point. In SEEK, there are two important objectives of Knowledge Extraction in general, and Data Reverse Engineering in particular. First, all the high level semantic information (e.g., entities, associations, constraints) extracted or inferred from the legacy source can be used as an input to the schema mapping process. This knowledge will also help in verifying the domain ontology. The source specific information (e.g., relations, primary keys, datatypes etc.) can be used to convert wrapper queries into actual source queries.









1.3 Challenges and Contributions

Formally, data reverse engineering is defined as the application of analytical

techniques to one or more legacy data sources to elicit structural information (e.g., term

definitions, schema definitions) from the legacy source(s) in order to improve the

database design or to produce missing schema documentation [1]. There are numerous

challenges in the process of extracting the conceptual structure from a database

application with respect to the objectives of SEEK which include the following:

Due to the limited ability of the relational model to express semantics, many
details of the initial conceptual design are lost when converted to relational
database forniat. Also, often the knowledge is spread throughout the database system. Thus, the input to reverse engineering process is not straightforwardly
simple or fixed.

The legacy database belonging to the firm typically cannot be changed in
accordance with the requirements of our extraction approach and hence the
algorithm must impose minimum restrictions on the input source.

Human intervention in terms of user input or domain expert comments is typically
necessary and as Chiang et al. [9, 10] point out, the reverse engineering process
cannot be fully automated. However, this approach is inefficient and not scalable
and we attempt to reduce human input as much as possible.

Due to maintenance activity, essential component(s) of the underlying databases
are often modified or deleted so that it is difficult to infer the conceptual structure.
The DRE algorithm needs to minimize this ambiguity by analyzing other sources.

Traditionally, reverse engineering approaches concentrate on one specific
component in the legacy system as the source. Some methods extensively study
the application code [55] while others concentrate on the data dictionary [9]. The
challenge is to develop an algorithm that investigates every component (such as
the data dictionary, data instances, application code) extracting as much
information as possible.

Once developed, the DRE approach should be general enough to work with different relational databases with only minimum parameter configuration.

The most important contribution of this thesis will be the detailed discussion and

comparison of the various database reverse engineering approaches logically followed









by the design of our Schema Extraction (SE) algorithm. The design tries to meet majority of the challenges discussed above. Another contribution will be the implementation of the SEprototype including the experimental evaluation and feasibility study. Finally this thesis also includes the discussion of suitable representationsfor the extracted enterprise knowledge and possible future enhancements.

1.4 Organization of Thesis

The remainder of this thesis is organized as follows. Chapter 2 presents an overview of the related research in the field of knowledge discovery in general and database reverse engineering in particular. Chapter 3 describes the SEEK-DRE architecture and our approach to schema extraction. It also gives the overall design of our algorithm. Chapter 4 is dedicated to the implementation details including some screen snapshots. Chapter 5 describes the experimental prototype and results. Finally, Chapter 6 concludes the thesis with the summary of our accomplishments and issues to be considered in the future.














CHAPTER 2
RELATED RESEARCH

Problems such as Y2K and European currency conversion have shown how little we understand the data in our computer systems. In our world of rapidly changing technology, there is a need to plan business strategies very early and with much information and anticipation. The basic requirement for strategic planning is the data in the system. Many organizations in the past have been successful at leveraging the use of the data. For example, the frequent flier program from American Airlines and the Friends-family program from MCI have been the trendsetters in their field and could only be realized because their parent organizations knew where the data was and how to extract information from it.

The process of extracting the data and knowledge from a system logically

precedes the process of understanding it. As we have discussed in the previous chapter, this collection or extraction process is non-trivial and requires manual intervention. Generally the data is present at more than one location in the system and has lost much of its semantics. So the important task is to recover these semantics that provide vital information about the system and allow mapping between the system and the general domain model. The problem of extracting knowledge from the system and using it to overcome the heterogeneity between the systems is an important one. Major research areas that try to answer this problem include database reverse engineering, data mining, wrapper generation and data modeling. The following sections will summarize the stateof-the-art in each of these fields.









2.1 Database Reverse Engineering

Generally all the project knowledge in the firm or the legacy source trickles down to the database level where the actual data is present. Hence the main goal is to be able to mine schema information from these database files. Specifically, the field of Database Reverse Engineering (DRE) deals with the problem of comprehending existing database systems and recovering the semantics embodied within them [10].

The concept of database reverse engineering is shown in Figure 2-1. The original design or schema undergoes a series of semantic reductions while being converted into the relational model. We have already discussed the limited ability of the relational model to express semantics, and when regular maintenance activity is considered, a part of the important semantic information generally gets lost. The goal is to recover that knowledge and validate it with the domain experts to recover a high-level model.


Figure 2-1 The Concept of Database Reverse Engineering.


Re ducti onin the ability
------ ----- ----- ----- ----- --- * to expf e s sefrL tAic s.









)n \R1 I se Engineefing









The DRE literature is divided into three areas: translation algorithms and methodologies, tools, and application-specific experiences. Translation algorithm development in early DRE efforts involved manual rearrangement or reformatting of data fields, which is inefficient and error-prone [12]. The relational data model provided theoretical support for research in automated discovery of relational dependencies [8]. In the early 1980s, focus shifted to recovering E/R diagrams from relations [40]. Given early successes with translation using the relational data model, DRE translation was applied to flat file databases [8, 13] in domains such as enterprise schemas [36]. Due to prior establishment of the E/R model as a conceptual tool, reengineering of legacy RDBMS to yield E/R models motivated DRE in the late 1980s [14]. Information content analysis was also applied to RDBMS, allowing a more effective gathering of high-level information from data [5].

DRE in the 1990s was enhanced by cross-fertilization with software engineering. In Chikofsky [11I], taxonomy for reverse engineering included DRE methodologies and also highlighted the available DRE tools. DRE formalisms were better defined, motivating increased DRE interaction with users [21]. The relational data model continued to support extraction of E/R and schema from RDI\'IS [39]. Application focus emphasized legacy systems, including DoD applications [44].

In the late 1990s, object-oriented DRE researched the discovering of objects in

legacy systems using function-, data-, and object-driven objectification [59]. Applications of DRE increased, particularly in the Y2K bug identification and remediation. Recent DRE focus is more applicative, e.g., mining large data repositories [ 15], analysis of legacy systems [3 1] or network databases [43 ] and extraction of business rules from









legacy systems [54]. Current research focuses on developing more powerful DRE tools, refining heuristics to yield fewer missing constructs, and developing techniques for reengineering legacy systems into distributed applications.

Though a large body of researchers agree that database reverse engineering is useful for leveraging data assets, reducing maintenance costs, facilitating technology transition and increasing system reliability, the problem of choosing a method for the reverse engineering of a relational database is not trivial [33]. The input for these reverse engineering methods is one implementation issue. Database designers, even experts, occasionally violate rules of sound database design. In some cases, it is impossible to produce an accurate model because it never existed. Also, different methods have different input requirements and each legacy system has its particular characteristics that restrict information availability.

A wide range of Database Reverse Engineering methods is known, each of them exhibiting its own methodological characteristics, producing its own outputs and requiring specific inputs and assumptions. We now present an overview of the major approaches, each of which is described in terms of input requirements, methodology, output, major advantages and limitations. Although this overview is not completely exhaustive, it discusses the advantages and the limitations of current approaches and provides a solid base for defining the exact objectives of our DRE algorithm.

Chiang et al. [9, 10] suggest an approach that requires the data dictionary as an input. It requires all the relation names, attribute names, keys and data instances. The main assumptions include consistent naming of attributes, no errors in the values of key attributes and a 3NF format for the source schema. The first requirement is especially









strict, as many of the current database systems do not maintain consistent naming of attributes. In this method, relations are first classified based upon the properties of their primary keys i.e., the keys are compared with the keys of other relations. Then, the attributes are classified depending on whether they are the attributes of a relation's primary key, foreign key, or none. After this classification, all possible inclusion dependencies are identified by some heuristic rules and then entities and relationship types are identified based on dependencies. The main advantage of this method is a clear algorithm with a proper justification of each step. All stages requiring human input are stated clearly. But stringent requirements imposed on the input source, a high degree of user intervention and dismissal of the application code as an important source are the drawbacks of this method. Our SE algorithm discussed in the next chapter is able to impose less stringent requirements on the input source and also analyze the application code for vital clues and semantics.

Johansson [34] suggests a method to transform relational schemes into conceptual schemes using the data dictionary and the dependency information. The relational schema is assumed to be in 3NF and information about all the inclusion and functional dependency information is required as an input. The method first splits a relation that corresponds to more than one object and then adds extra relations to handle the occurrences of certain types of inclusion dependencies. Finally it collapses the relations that correspond to the same object type and maps them into one conceptual entity. The main advantage of this method is the detailed explanation about schema mapping procedures. It also explains the concept of hidden objects that is further utilized in Petit's method [5 1 ]. But this method requires all the keys and all the dependencies and thus is









not realistic, as it is difficult to give this information at the start of the reverse engineering process. Markowitz et al. [39] also present a similar approach to identify the extended entity- relationship object structures in relational schemes. This method takes the data dictionary, the functional dependencies and the inclusion dependencies as inputs and transforms the relational schema into a form suitable to identify the EER object structures. If the dependencies satisfy all the rules then object interaction is determined for each inclusion dependency. Though this method presents a formalization of schema mapping concepts, it is very demanding on the user input, as it requires all the keys and dependencies.

The important insight obtained is the use of inclusion dependencies in the above methods. Both the methods use the presence of inclusion dependencies as a strong indication of the existence of a relationship between entities. Our algorithm uses this important concept but it does not place the burden of specifying all inclusion dependencies on the user.

S. Navathe et al. [45] and Blaha et al. [52] give the importance of user

intervention. Both methods assume that the user has more than sufficient knowledge about the database. Very little automation is used to provide clues to the user.

Navathe's method [45] requires the data schema and all the candidate keys as

inputs, and assumes coherency in attribute names, absence of ambiguities in foreign keys, and requires 3NF and BCNF normal form. Relations are processed and classified with human intervention and the classified relations are then mapped based on their classifications and key attributes. Special cases of non-classified relations are handled on a case-by-case basis. The drawbacks of this method include very high user intervention









and strong assumptions. Comparatively Blaha's method [52] is relatively less stringent on the input requirements as it only needs the data dictionary and data sets. But the output is an OMT (Object Modeling Technique) model and is less relevant to our objective. This method also involves high degree of user intervention to determine candidate keys and foreign key groups. The user, based on the guidelines that include querying data, progressively refines the OMT schema. Though the method depends heavily on domain knowledge and can be used in tricky or sensitive situations (where constant guidance is crucial for the success of the process), the amount of user participation makes it difficult to use in a general-purpose toolkit.

Another interesting approach is taken by Signore et al. [55]. The method searches for the predefined code patterns to infer semantics. The idea of considering the application code as a vital source for clues and semantics is interesting to our effort. This approach depends heavily on the quality of application code as all the important concepts such as primary keys, foreign keys, and generalization hierarchies are finalized by these patterns found in the code. This suggests that it is more beneficial to use this method along with another reverse engineering method to verify the outcome. Our SE algorithm discussed in the next chapter attempts to implement this.

Finally, J. M. Petit et al. [5 1 ] suggest an approach that does not impose any

restrictions on the input database. The method first finds inclusion dependencies from the equi-join queries in the application code and then discovers functional dependencies from the inclusion dependencies. The restrictt" algorithm is then used to convert the existing schema to 3NF using the set of dependencies and the hidden objects. Finally, the algorithm in Markowitz et al. [39] is used to convert the 3NF logical schema obtained in









the last phase into an EER model. This paper presents a very sound and detailed algorithm is supported by mathematical theory. The concept of using the equi-join queries in the application code to find inclusion dependencies is innovative and useful. But the main objective of this method is to improve the underlying de-normalized schema, which is not relevant to the knowledge extraction process. Furthermore, the two main drawbacks of this method are lack of justification for some steps and the absence of a discussion about the practical implementation of the approach.

Relational database systems are typically designed using a consistent strategy. But generally, mapping between the schemes and the conceptual model is not strictly one-toone. This means that, while reverse engineering a database, an alternate interpretation of the structure and the data can yield different components [52]. Although in this manner multiple interpretations can yield plausible results, we have to minimize such unpredictability using the available resources. Every relational database employs a similar underlying model for organizing and querying the data, but existing systems differ in terms of the availability of information and reliability of such information.

Therefore, it is fair to conclude that no single method can fulfill the entire range of requirements of relational database reverse engineering. The methods discussed above differ greatly in terms of their approaches, input requirements and assumptions and there is no clear preference. In practice, one must choose a combination of approaches to suit the database. Since all the methods have well-defined steps, each having a clear contribution to the overall conceptual schema, in most cases it is advisable to produce a combination of steps of different methods according to the information available [33].









In the SEEK toolkit, the effort required to generate a wrapper for different sources should be minimized as it is not flexible to exhaustively explore different methods for different firms in the supply chain. The developed approach must be general with a limited amount of source dependence. Some support modules can be added for different sources to use the redundant information to increase result confidence.

2.2 Data Mining

Considerable interest and work in the areas of data mining and knowledge

discovery in databases (KDD) have led to several approaches, techniques and tools for the extraction of useful information from large data repositories.

The explosive growth of many business, government and scientific database

systems in the last decade created the need for the new generation technology to collect, extract, analyze and generate the data. The term knowledge discovery in databases was coined in 1989 to refer to the broad process of finding knowledge in data and to emphasize the high-level application of particular data mining methods [ 18]. Data mining is defined as an information extraction activity whose goal is to discover hidden facts contained in databases [18]. The basic view adopted by the research community is that data mining refers to a class of methods that are used in some of the steps comprising the overall KDD process.

The data mining and KDD literature is broadly divided into 3 sub areas: finding patterns, rules and trends in the data, statistical data analysis and discovery of integrated tools and applications. Early in the last decade of the 20 th century saw tremendous research on data analysis [18]. This research specifically included a human centered approach to mine the data [6], semi-automated discovery of informative patterns, discovery of association rules [64], finding clusters in the data, extraction of generalized









rules [35] etc. Many efforts then concentrated on developing integrated tools such as DBMINER [27], Darwin [48] and STORAJ [ 17]. Recently, focus has shifted towards application specific algorithms. The typical application domains include healthcare and genetics, weather and astronomical surveys and financial systems [ 18].

Researchers have argued that developing data mining algorithms or tools alone is insufficient for pragmatic problems [16]. The issues such as adequate computing support, strong interoperability and compatibility of the tools and above all the quality of data are very crucial.

2.3 Wrapper/Mediation Technology

SEEK follows the established mediation/wrapper methodologies such as

TSIMMIS [26], InfoSleuth [4] and provides a middleware layer that bridges the gap between legacy information sources and decision makers/decision support applications. Generally the wrapper [49] accepts queries expressed in the legacy source language and schema and converts them into queries or requests understood by the source. One can identify several important commonalties among wrappers for different data sources, which make wrapper development more efficient and allow the data management architecture to be modular and highly scalable. These are important prerequisites for supporting numerous legacy sources, many of which have parameters or structure that could initially be unknown. Thus, the wrapper development process must be partially guided by human expertise, especially for non-relational legacy sources.

A nalve approach involves hard-coding wrappers to effect a pre-wired

configuration, thus optimizing code for these modules with respect to the specifics of the underlying source. However, this yields inefficient development with poor extensibility and maintainability. Instead, the toolkit such as Stanford University's TSIMMIS Wrapper









Development Toolkit [26] based on translation templates written in a high-level specification language is extremely relevant and useful. The TSIMN'IS toolkit has been used to develop value-added wrappers for sources such as DBMS, online libraries, and the Web [25, 26]. Existing wrapper development technologies exploit the fact that wrappers share a basic set of source-independent functions that are provided by their toolkits. For example, in TSIMIMIS, all wrappers share a parser for incoming queries, a query processor for post-processing of results, and a component for composing the result. Source-specific information is expressed as templates written in a high-level specification language. Templates are parameterized queries together with their translations, including a specification of the format of the result. Thus, the TSIMN'IS researchers have isolated the only component of the wrapper that requires human development assistance, namely, the connection between the wrapper and the source, which is highly specialized and yet requires relatively little coding effort.

In addition to the TSJMIJS-based wrapper development, numerous other projects have been investigating tools for wrapper generation and content extraction including researchers at the University of Maryland [20], USC/ISI [2], and University of Pennsylvania [53]. Also, artificial intelligence [58], machine learning, and natural language processing communities [7] have developed methodologies that can be applied in wrapper development toolkits to infer and learn structural information from legacy sources.

This chapter discussed the evolution of research in the fields related to knowledge extraction. The data stored in a typical organization is usually raw and needs considerable preprocessing before it can be mined or understood. Thus data mining or KDD somewhat






19


logically follows reverse engineering, which works on extracting preliminary but very important aspects of the data. Many data mining methods [27, 28] require knowledge of the schema and hence reverse engineering methods are definitely useful. Also, the vast majority of wrapper technologies depend on information about the source to perform translation or conversion.

The next chapter will describe and discuss our database reverse engineering algorithm, which is the main topic of this thesis.














CHAPTER 3
THE SCHEMA EXTRACTION ALGORITHM

3.1 Introduction

A conceptual overview of the SEEK knowledge extraction architecture is shown in Figure 3-1 [22]. SEEK applies Data Reverse Engineering (DRE) and Schema Matching

(SM) processes to legacy database(s), to produce a source wrapper for a legacy source. This source wrapper will be used by another component (not shown in Figure 3-1) to communicate and exchange information with the legacy source. It is assumed that the legacy source uses a database management system for storing and managing its enterprise data or knowledge.

First, SEEK generates a detailed description of the legacy source by extracting enterprise knowledge from it. The extracted enterprise knowledge forms a knowledge base that serves as the input for subsequent steps. In particular, the DRE module shown in Figure 3-1 connects to the underlying DBMS to extract schema information (most data sources support at least some form of Call-Level Interface such as JDBC). The schema information from the database is semantically enhanced using clues extracted by the semantic analyzer from available application code, business reports, and, in the future, perhaps other electronically available information that could encode business data such as e-mail correspondence, corporate memos, etc. It has been the experience, through visits with representatives from the construction and manufacturing domains, that such application code exists and can be made available electronically [23].
































Figure 3-1 The SEEK Architecture.

Second, the semantically enhanced legacy source schema must be mapped into the domain model (DM) used by the application(s) that want(s) to access the legacy source. This is done using a schema mapping process that produces the mapping rules between the legacy source schema and the application domain model. In addition to the domain model, the schema matcher also needs access to the domain ontology (DO) that describes the domain model. Finally, the extracted legacy schema and the mapping rules provide the input to the wrapper generator (not shown), which produces the source wrapper.

The three preceding steps can be formalized as follows [23]. At a high level, let a legacy source L be denoted by the tuple L = (DBL SL, DL, QL,), where DBL denotes the legacy database, SL denotes its schema, DL the data and QL a set of queries that can be answered by DBL. Note, the legacy database need not be a relational database, but can









include text, flat file databases, or hierarchically formatted information. SL is expressed by the data model DML.

We also define an application via the tuple A = (SA, QA, DA), where SA denotes

the schema used by the application and QA denotes a collection of queries written against that schema. The symbol DA denotes data that is expressed in the context of the application. We assume that the application schema is described by a domain model and its corresponding ontology (as shown in Figure 3-1). For simplicity, we further assume that the application query format is specific to a given application domain but invariant across legacy sources for that domain.

Let a legacy source wrapper Wbe comprised of a query transformation (equation 1) and a data transformation (Equation 2) fWQ : QA P-QL (1)

fwD :DL -*DA, (2)

where the Q and D are constrained by the corresponding schemas.

The SEEK knowledge extraction process shown in Figure 3-1 can now be stated as follows. Given SA and QA for an application that attempts to access legacy database DBL whose schema SL is unknown, and assuming that we have access to the legacy database DBL as well as to application code CL that accesses DBL, we first infer SL by analyzing DBL and CL, then use SL to infer a set of mapping rules Mbetween SL and SA, are used by a wrapper generator WGen to produce (fwQ,fwD). In short: DRE: (DBL, CL,) F-*SL (3-1)

SM: (SL, SA) P-*M (3-2)

WGen: (QA, W) (fwQ,fwD) (3-3)









Thus, the DRE algorithm (Equation 3-1) is comprised of schema extraction (SE) and semantic analysis (SA). This thesis will concentrate on the schema extraction process which extracts the schema SL by accessing DBL. The semantic analysis process supports the schema extraction process by providing vital clues for inferring SL by analyzing CL and is also crucial to the DRE algorithm. But, its implementation and experimental evaluation is being carried out by my colleague in SEEK and hence will not be dealt with in detail in this thesis.

The following section focuses on the schema extraction algorithm. It also

provides a brief description of the semantic analysis and code slicing research efforts, which also are being undertaken in SEEK. It also presents issues regarding integration of schema extraction and semantic analysis. Finally, the chapter concludes with a summary of the DRE algorithm.

3.2 Algorithm Design

Data reverse engineering is defined as the application of analytical techniques to one or more legacy data sources (DBL) to elicit structural information (e.g., term definitions, schema definitions) from the legacy source(s), in order to improve the database design or produce missing schema documentation. Thus far in SEEK, we are applying DRE to relational databases only. However, since the relational model has only limited ability to express semantics, in addition to the schema, our DRE algorithm generates an E/R-like representation of the entities and relationships that are not explicitly defined (but which exist implicitly) in the legacy schema SL.

More formally, DRE can be described as follows: Given a legacy database DBL defined as (f R1, R2, ., Rn 1, 0), where Ri denotes the schema of the i-th relation with









attributes A& A2, . , Am(i), keys Ki, K2, ., Km(i), and data D { fri(Ri), r2(R2), --r11(Rn))}, such that ri(Ri) denotes the data (extent) for schema Ri at time t. Furthermore, DBL has functional dependencies T {fF 1, F2, ., Fk(i)}I and inclusion dependencies I= {b1, 12, ., JI(i)}I expressing relationships among the relations in DBL. The goal of DRE is to first extract f{R1, R2, . , R1}, , and T and then use , TF, 0D, and CL to produce a semantically enhanced description of {Rj, R2, ., Rn I that includes all relationships among the relations in DBL (both explicit and implicit), semantic descriptions of the relations as well as business knowledge that is encoded in DBL and CL.

Our approach to data reverse engineering for relational sources is based on existing algorithms by Chiang et al. [9, 10] and Petit et al. [5 1]. However, we have improved these methodologies in several ways, most importantly to reduce the dependency on human input and to eliminate several limitations of their algorithms (e.g., assumptions of consistent naming of key attributes, legacy schema in 3-NF). More details about the contributions can be found in Chapter 6.

Our DRE algorithm is divided into two parts: schema extraction and semantic analysis, which operate in interleaved fashion. An overview of the standalone schema extraction algorithm, which is comprised of six steps, is shown in Figure 3-2. In addition to the modules that execute each of the six steps, the architecture in Figure 3-2 includes three support components: the configurable Database Interface Module (upper-left hand corner), which provides connectivity to the underlying legacy source. The Knowledge Encoder (lower right-hand corner) represents the extracted knowledge in the form of an XML document so that it can be shared with other components in the SEEK architecture









(e.g., the semantic matcher). More details about these components can be found in Section 3.4.


XML document

Figure 3-2 The Schema Extraction Procedure.

We now describe each step of our six-step schema extraction algorithm in detail.

Step 1: Extracting Schema Information using the Dictionary Extractor


Legacy Code









The goal of Step 1 is to obtain the relation and attribute names from the legacy source. This is done by querying the data dictionary, which is stored in the underlying database in the form of one or more system tables. The details of this step are outlined in Figure 3-3.








Tolgut cod analyzrse&
Figure ~ ~ ~ Mol 3-3 ThMitinrYxtato Poes










one cadidatekey pe entitthenT koey isaltheprimr kte y. Otewsi3rmr e iformaio canno bee rctinriee ditrctly fromes dtitinrte.loih pseh

set ofcaddatodrie key alngw trdeuted, "rul-otm ptrnohe asflsemntic azr Tesatin tis analyermpetes on tath piaT oefo the ppictionay od to ruleoutcrmtain



oecniatti e s e niy he htkyi h primary keys. Fo oedtie xlnt Otdeamples of prule-oute






patterns, the reader is referred to Section 3.4.

Step 2: Discovering Inclusion Dependencies









After extraction of the relational schema in Step 1, the schema extraction algorithm then identifies constraints to help classify the extracted relations, which represent both the real-world entities and the relationships among these entities. This is done using inclusion dependencies (INDs), which indicate the existence of interrelational constraints including class/subclass relationships.

















Figure 3-4 Inclusion Dependency Mining.

Let A and B be two relations, and X and Y be attributes or a set of attributes of A and B respectively. An inclusion dependency L A.X << B.Y denotes that a set of values appearing in A.X is a subset of the values in B.Y. Inclusion dependencies are discovered by examining all possible subset relationships between any two relations A and B in the legacy source.

As depicted in Figure 3-4, the inclusion dependency detection module obtains its input from two sources: one is the dictionary extractor (via the send/receive module), which provides the table name, column names, primary keys and foreign keys (if available) and the other is the equi-join query finder, which is a part of the code analyzer.









This module operates on the AST, and provides pairs of relations and their corresponding

attributes, which occur together in equi-join queries in the application code. The fact that

two relations are used in a join operation is evidence of the existence of an inclusion

dependency between them.

The inclusion dependency detection algorithm works as follows:

1. Create a set X of all possible pairs of relations from the set R (R1, R2, ,n R1})
e.g., if we have relations P, Q, R, S then X { f(P,Q), (P,R),
(P,S),(Q,R),(Q,S),(R,S)}. Intuitively, this set will contain pairs of relations for
which inclusion dependencies have not been determined yet. In addition, we
maintain two (initially empty) sets of possible (POSSIBLE) and final (FINAL)
inclusion dependencies.

2. If foreign keys have been successfully extracted, do the following for each foreign
key constraint:

a. Identify the pair of participating relations, i.e., the relation to which the FK
belongs and the relation to which it is referring.

b. Eliminate the identified pair from set X.

c. Add the inclusion dependency involving this FK to the set FINAL.

3. If equi-join queries have been extracted from the code, do the following for each
equi-join query:

a) Identify the pair of participating relations.

b) Check the direction of the resulting inclusion dependency by querying the
data. In order to check the direction of an inclusion dependency, we use a
subset test described in Appendix B

c) If the above test is conclusive, eliminate the identified pair from set X and
add the inclusion dependency involving this FK to the set FINAL.

d) If the test in step b) is inconclusive (i.e., the direction cannot be finalized)
add both candidate inclusion dependencies to the set POSSIBLE.

4. For each pairp remaining in X, identify attributes or attribute combinations that
have the same data type. Check whether the subset relationship exists by using the subset test described in Appendix B. If so, add the inclusion dependency to the set
POSSIBLE. If, at the end of Step 4, no inclusion dependency is added to the
possible set, delete p from X; otherwise, leave p in X for user verification.









5. For each inclusion dependency in the set POSSIBLE, do the following:

a) If the attribute names on both sides are equal, assign the rating "High ".

b) If the attribute name on left side of the inclusion dependency is related
(based on common substrings) to the table name on the right hand side,
assign rating "High ".

c) If both conditions are not satisfied, assign rating "Low "

6. For each pair in X, present the inclusion dependencies and their ratings in the set
POSSIBLE to the user for final determination. Based on the user input, append
the valid inclusion dependencies to the set FINAL.

The worst-case complexity of this exhaustive search, given N tables and M attributes per table (NMV total attributes), is O(N 2M2) . However, we have reduced the search space in those cases where we can identify equi-join queries in the application code. This allows us to limit our exhaustive searching to only those relations not mentioned in the extracted queries. As a result, the average case complexity of the inclusion dependency finder is much smaller. For example the detection of one foreign key constraint in the data dictionary or one equi-join query in the application code allows the algorithm to eliminate the corresponding relation(s) from the search space. Hence, if K foreign key constraints and L equi-join queries (involving pairs different from the pairs involved in foreign key constraints) are detected, the average complexity is O((N2 -K L)M2). In the best-case scenario when K + L equals all possible pairs of relations, then the inclusion dependency detection can be performed in constant time 0(l).

Additionally, factors such as matching datatypes and matching maximum length of attributes (e.g., varchar(5)) are used to reduce the number of queries to be made to the database (Step 4) to check subset relationship between attributes. If the attributes in a pair of relations have T mutually different datatypes then the M2 part reduces to M(M-T).









Finally, it is important to note that the DRE activity is always considered as a build-time activity and hence performance complexity is not a crucial issue.

Step 3: Classification of the Relations

When reverse engineering a relational schema, it is important to understand that due to the limited ability of the relational model to express semantics, all real-world entities are represented as relations irrespective of their types and roles in the model. The goal of this step is to identify the different types of relations; some of these will correspond to actual real-world entities while others will represent relationships among the entities.

Identifying different relations is done using the primary key information obtained in Step I and the inclusion dependencies obtained in Step 2. Specifically, if consistent naming is used, the primary key of each relation is compared with the primary keys of other relations to identify strong or weak entity-relations and specific or regular relationship-relations. Otherwise, we have to use inclusion dependencies to give vital clues.

Intuitively, a strong entity-relation represents a real-world entity whose members can be identified exclusively through its own properties. A weak entity-relation, on the other hand, represents an entity that has no properties of its own that can be used to uniquely identify its members. In the relational model, the primary keys of weak entityrelations usually contain primary key attributes from other (strong) entity-relations.

Intuitively, both regular and specific relations represent relationships between two entities in the real world rather than the entities themselves. However, there are instances when not all of the entities participating in an n-ary relationship are present in the









database schema (e.g., one or more of the relations were deleted as part of the normal database schema evolution process). While reverse engineering the database, we identify such relationships as special relations. Specifically, the primary key of a specific relation is only partially formed by the primary keys of the participating (strong or weak) entityrelations, whereas the primary key of a regular relation is made up entirely of the primary keys of the participating entity-relations.

More formally, Chiang et al. [10] define the four relation types as follows:

A strong entity relation is a relation whose primary key (PK) does not properly
contain a key attribute of any other relation.

A weak entity relation p is a relation which satisfies the following three
conditions:

I . A proper subset of p's PK contains key attributes of other strong or weak
entity relations;

2. The remaining attributes of p's PK do not contain key attributes of any other
relation; and

3. p has an identifying owner and properly contains the PK of its owner
relation. User input is required to confirm these relationships.

A regular relation has a PK that is formed by concatenating the PKs of other
(strong or weak) entity relations.

A specific relation ris a relation which satisfies the following two conditions: 1. A proper subset of f s PK contains key attributes of other strong or weak
entity relations;

2. The remaining attributes of f s PK do not contain key attributes of any other
relation.

Classification of relations proceeds as follows: Initially strong and weak entityrelations are classified. For weak entity-relations, the primary key must be composite and part of it must be a primary key of an already identified strong entity-relation. The remaining part of the key must not be a primary key of any other relation. Finally, regular









and specific relations are discovered. This is done by checking the primary keys or the remaining un-classified relations for complete or partial presence of primary keys of already identified entity-relations.

Step 4: Classification of the Attributes

In this step, attributes of each relation are classified into one of four groups,

depending on whether they can be used as keys for entities, weak entities, relationships etc. Attribute classification is based on the type of parent relation and the presence of inclusion dependencies which involve these attributes:

* Primary key attributes (PKA) are attributes that uniquely identify the tuples in a
relation.

* Dangling key attributes (DKA) are attributes belonging to the primary key of a
weak entity-relation or a specific relation that do not appear as the primary key of
any other relations.

* Foreign key attributes (FKA) are attributes in RI referencing R2 if

1 . these attributes of RI have the same domains as the primary key attributes
PK of R2

2. for each tI in r(R1) and t2 in r(R2), either tI [FK] =t2[PK], or tI [FK] is null.

* Non-key attributes (NKA) are those attributes that cannot be classified as PKA,
DKA, or FKA.

Step 5: Identification of Entity Types

The schema extraction algorithm begins to map relational concepts into

corresponding E/R model concepts. Specifically, the strong and weak entity relations identified in Step 3 are classified as either strong or weak entities respectively. Furthermore, for each weak entity we associate with its owner entity. The association, which includes the identification of proper keys, is done as follows:

* Each weak entity relation is converted into a weak entity type. The dangling key
attribute of the weak entity relation becomes the key attribute of the entity.









0 Each strong entity relation is converted into a strong entity type.


Step 6: Identification of Relationship Types

The inclusion dependencies discovered in Step 2 form the basis for determining

the relationship types among the entities identified above. This is a two-step process:

I. Identify relationships present as relations in the relational database. The
relationship-relations (regular and specific) obtained from the classification of
relations (Step 3) are converted into relationships. The participating entity types are derived from the inclusion dependencies. For completeness of the extracted
schema, we can decide to create a new entity when conceptualizing a specific relation. The cardinality of this type of relationships is M:N or many-to-many.

2. Identify relationships among the entity types (strong and weak) that were not
present as relations in the relational database, via the following classification.

* IS-A relationships can be identified using the PKAs of strong entity relations
and the inclusion dependencies among PKAs. If there is an inclusion
dependency in which the primary key of one strong entity-relation refers to
the primary key of another strong entity-relation then an IS-A relationship
between those two entities is identified. The cardinality of the IS-A
relationship between the corresponding strong entities is 1: 1.

* Dependent relationship: For each weak entity type, the owner is determined
by examining the inclusion dependencies involving the corresponding weak
entity-relation as follows: we look for an inclusion dependency whose lefthand side contains the part of the primary key of this weak entity-relation.
When we find such an inclusion dependency, the owner entity can be easily identified by looking at the right-hand side of the inclusion dependency. As
a result, a binary relationship between the owner (strong) entity type and
weak entity is created. The cardinality of the dependent relationship between
the owner and the weak entity is 1IN

* Aggregate relationships: If a foreign key in any of the regular and specific
relations refers to the PKA of one of the strong entity relations, an aggregate
relationship is identified. An inclusion dependency must exist from this (regular or specific) relation on the left-hand side, which refers to some
strong entity-relation on the right-hand side. The aggregate relationship is between the relationship (which must previously be conceptualized from a
regular/specific relation) and the strong entity on right-hand side. The cardinality of the aggregate relationship between the strong entity and
aggregate entity (an M:N relationship and its participating entities at the conceptual level) is as follows: If the foreign key contains unique values,
then the cardinality is 1: 1, otherwise the cardinality is I:N









Other binary relationships: Other binary relationships are identified from
the FKAs not used in identifying the above relationships. When an FKA of a relation refers to a primary key of another relation, then a binary relationship
is identified. The cardinality of the binary relationship between the entities is as follows: If the foreign key contains unique values, then the cardinality
is 1: 1, otherwise the cardinality is I:N.

At the end of Step 6, the schema extraction algorithm will have extracted the following schema information from the legacy database:

0 Names and classification of all entities.

0 Names of all attributes.

0 Primary and foreign keys.

0 Data types.

0 Simple constraints (e.g., Null, Unique) and explicit assertions.

0 Relationships and their cardinalities.

3.3 Related Issue - Semantic Analysis The design and implementation of semantic analysis and code slicing is the

subject of a companion thesis and hence will not be elaborated in detail. Instead the main concepts will be briefly outlined.

Generation of an Abstract Syntax Tree (AST) for the Application Code: Semantic Analysis begins with the generation of an abstract syntax tree (AST) for the legacy application code. The AST will be used by the semantic analyzer for code exploration during code slicing.

The AST generator for C code consists of two major components, the lexical

analyzer and the parser. The lexical analyzer for application code written in C reads the source code line-by-line and breaks it up into tokens. The C parser reads in these tokens and builds an AST for the source code in accordance with the language grammar. The









above approach works well for procedural languages such as the C language; but when applied directly to object oriented languages, it greatly increases the computational complexity of the problem.

In practice, most of the application code written for databases is written in Java making it necessary to develop an algorithm to infer semantic information from Java application code. Unfortunately, the grammar of an object-oriented language tends to be complex when compared with that of procedural languages such as C. Several tools like lex oryacc can be employed to implement the parser. Our objective in AST generation is to be able to associate meaning with program variables. For example, format strings in input/output statements contain semantic information that can be associated with the variables in the input/output statement. This program variable in turn may be associated with a column of a table in the underlying legacy database. These and the other functions of semantic analyzer are described in detail in Hammer et al. [23, 24].

Code Analysis: The objective of code analysis is threefold: (1) augment entities extracted in the schema extraction step with domain semantics, (2) extract queries that help validate the existence of relationships among entities, and (3) identify business rules and constraints not explicitly stored in the database, but may be important to the wrapper developer or application program accessing the legacy source L. Our approach to code analysis is based on code mining, which includes slicing [32] and pattern matching [50].

The mining of semantic information from source code assumes that in the

application code there are output statements that support report generation or display of query results. From output message strings that usually describe a displayed variable v, semantic information about v can be obtained. This implies location (tracing) of the









statement s that assigns a value to v. Since s can be associated with the result set of a query q, we can associate Vs semantics with a particular column of the table being accessed in q.

For each of the slicing variables identified by the pre-slicer, the code slicer and analyzer are applied to the AST. The code slicer traverses the AST in pre-order and retains only those nodes that contain the slicing variable in their sub-tree. The reduced AST constructed by the code slicer is then sent to the semantic analyzer, which extracts the data type, meaning, business rules, column name, and table name that can be associated with the slicing variable. The results of semantic analysis are appended to a result file and the slicing variable is stored in the metadata repository. Since the code analysis is part of a build-time activity, accuracy of the results rather than time is a more critical factor.


























User Decision qk More Semantic Y
analysis Needed ysis User Interfac
To inclusion User Enters
dependency N Slicing Variable
mining, step 4
Figure 3-5 The Code Analysis Process.



After the code slicer and the analyzer have been invoked on all slicing variables, the result generator examines the result file and replaces the variables in the extracted business rules with the semantics from their associated output statements, if possible. The results of code analysis up to this point are presented to the user. The user has a chance to view the results and make a decision whether further code analysis is needed or not. If further analysis is requested, then the user is presented with a list of variables that occur in the input, output, SQL statements and all the slicing variables from the previous passes.

After the description of schema extraction and semantic analysis, it is important to focus on the interaction between these two processes. The next subsection will provide insights on this integration process and the chapter concludes with the integrated system









design diagram and a description of its support components. For more detailed information about code analysis, the reader is referred to Hammer et al. [23, 24].

3.4 Interaction

There are five places in the execution of the integrated DRE algorithm where the schema extraction process (SE) and semantic analyzer (SA) need to interact and they are as follows:

I Initially the SA generates the AST of the application code CL. After successful
generation of an AST, the execution control is transferred to the dictionary
extractor module of SE.

2. If complete information about primary keys is not found in the database
dictionary, then the dictionary extractor requests the semantic analyzer to provide some clues. The algorithm passes the set of candidate keys along with predefined
rule-out patterns to the code analyzer. The code analyzer searches for these
patterns in the application code (i.e., in the AST) and eliminates attributes from the candidate set that occur in the rule-out pattern. The rule-out patterns, which
are expressed as SQL queries, occur in the application code whenever the
programmer expects to select a SET of tuples. By the definition of a primary key, this rules out the possibility that the attributes ai . an1 form a primary key. Three
sample rule out patterns are:

a) SELECT DISTINCT

FROM T

WHERE ai Kscalar expression,> AN]) a2=< scalar expression2> AN]) .
AND an =

b) SELECT

FROM T

WHERE ai Kscalar expression,> AN]) a2=< scalar expression2>ANI) .

AND a11 Kscalar expression1>

GROUIPBY.

c) SELECT


FROM T









WHERE ai= AN]) a2=< scalar expression2>ANI) .

AND a11 Kscalar expression1>

ORDER BY.

3. After the dictionary extraction, the execution control is transferred to the semantic
analyzer to carry out code slicing on all the possible SQL variables and other
input-output variables. Relation names and attribute names generated in the
schema extraction process can guide this step (e.g., the code slicer can concentrate
on SQL variables whose names in the database are already known).

4. Once the code slicing is completed within a pre-specified level of confidence,
control returns back to schema extraction where inclusion dependency detection is
invoked.

5. The inclusion dependency detector requests equi-join queries from the semantic
analyzer, which searches the AST for typical SELECT-FROM-WHERE clauses that include one or multiple equality conditions on the attributes of two relations.
After finding all the possible pairs of relations, the semantic analyzer returns the
pair and the corresponding attribute to the inclusion dependency finder which
uses that as one source for detection of the inclusion dependencies.

After the execution of the integrated algorithm, the information extracted will

contain business rules and the semantic meaning of some of the attributes in addition to


SE output.
































Figure 3-6 DRE Integrated Architecture.

Figure 3-6 presents a schematic diagram of the integrated DRE architecture. The legacy source DBL consists of legacy data DL and legacy code CL.

The DRE process begins off by generating the AST from CL. The dictionary extractor then accesses DL via the Database Interface Module and extracts preliminary information about the underlying relational schema. The configurable Database Interface Module (upper-left hand corner) is the only source-specific component in the architecture. In order to perform knowledge extraction from different sources, only the interface module needs to be modified. The code analysis module then performs slicing on the generated AST and stores information about the variables in the result file. The control is then transferred to the SE again to execute the remaining steps. Finally the Knowledge Encoder (lower right-hand corner) represents the extracted knowledge in the









form of an XMIL document so that it can be shared with other components in the SEEK architecture (e.g., the semantic matcher).

Additionally, the Metadata Repository is internal to DRE and is used to store intermediate run-time information needed by the algorithms including user input parameters, the abstract syntax tree for the code (e.g., from a previous invocation), etc.

3.5 Knowledge Representation

The schema extraction and semantic analysis collectively generate information about the underlying legacy source DBL. After each step of the DRE algorithm, some knowledge is extracted from the source. At the end of the DRE process, the extracted knowledge can be classified into three types. First, the detailed information about the underlying relational schema is present. The information about the relation names, attribute names, data types, simple constraints etc. is useful for query transformation at the wrapperfwQ (Equation 1 in the section 3. 1). The information about the high-level conceptual schema inferred from the relational schema is also available. This includes the entities, their identifiers, the relationships among the entities, their cardinalities etc. Finally, some business rules and a high-level meaning of some attributes that are extracted by the SA are also available. This knowledge must be represented in a format that is not only computationally tractable and easy to manipulate, but which also supports intuitive human understanding.

The representation of knowledge in general and semantics in particular has been an active research area for the past five years. With the advent of XN'L 1.0 as the universal format for structured documents and data in 1998 [60], various technologies such as XN/L schema, RDF [6 1], Semantic Web [62], MathML [63 ], BRM/L [ 19] followed. Every technology is developed and preferred for some specific applications.









For example RDF provides a lightweight ontology system to support the exchange of knowledge on the Web. MathMIL is low-level specification for describing mathematics as a basis for machine-to-machine communication. Our preliminary survey concludes that considering the variety of knowledge that is being (and will be) extracted by DRE, any one of these will not be sufficient for representing the entire range. The choice is to combine two or more standards or to come up with our own format. The advantages of the former are the availability of proven technology and tools and compatibility with other SEEK-like systems while the advantages of own format will be the efficiency and ease of encoding.

We do not rule out different format in the near future but the best choice in the current scenario is XML, since it is a simple yet robust language for representing and manipulating data. Many of the technologies mentioned above use XN11L syntax.

The knowledge encoder takes an XNTL DTD as input and encodes the extracted information to produce an XML document. The entire XN11L DTD along with the resulting XML document is shown in Appendix A. The DTD has very intuitive tree-like structure. It consists of three parts - relational schema, conceptual schema and business rules. The first part provides detailed information about each relation and its attributes. The second part provides information about entities and relationships. The business rules are presented in the third part.

Instead of encoding the extracted information after every step (which can result in inconsistencies, since the DRE algorithm refines some of its intermediate outcomes in the process), the encoding is done at the terminal step to implement consistency checking.






43


In this chapter, we have presented a detailed description of the schema extraction algorithm with all the support processes and components. The next chapter describes the implementation of a working prototype of the DRE algorithm.














CHAPTER 4
IMPLEMENTATION

The entire Schema Extraction process and the overall DRE Algorithm were

delineated and discussed in detail in the previous chapter. We now describe how the SE prototype actually implements the algorithm and encodes the extracted information into an XMIL document focusing on the data structures and execution now. We shall also present an example with sample screen snapshots.

The SE prototype is implemented using the Java SDK 1.3 from Sun

Microsystems. Other major software tool used in our implementation is the Oracle XML Parser. Also, for testing and experimental evaluation, two different database management systems Microsoft-Access and Oracle have been used.

4.1 Implementation Details

The SE working prototype takes a relational data source as an input. The input requirements can be further elaborated, as follows: I . The source is a relational data source and its schema is available

2. JDBC connection to the data source is possible. (This is not a strict requirement
since Sun's JDBC driver download page gives the latest drivers to almost all
database systems such as Oracle, Sybase, IBM DB2, Informix, Microsoft Access,
Microsoft SQL Server, etc. [57])

3. The database can be queried using SQL.

In summary, the SE prototype is general enough to work with different relational databases with only minimal changes to the parameter configuration in the Adapter module shown in the next figure.













Extractirj ava


Figure 4-1 Schema Extraction Code Block Diagram.

Figure 4-1 shows the code block diagram of the SE prototype. The Adapter

module connects to the database and is the only module that contains actual queries to the database. This is the only module that has to be changed in order to connect the SE prototype to a different database system. Details about these changes are discussed in the configuration section later. The Extractor module executes Step 1 of SE algorithm. At the end of that step, all the necessary information is extracted from the database. The Analysis module works on this information to process Steps 2, 3 and 4 of the SE algorithm. The Analysis module also interacts with the Semantic Analyzer module to obtain the equi-join queries. The Inference module identifies the entities and relationships (Steps 5 and 6 of SE). All these modules store the resulting knowledge in a common data structure. This data structure is a collection of the obj ect instances of predetermined classes. These classes not only store information about the underlying relational database


Analysi sj ava


T o
Knowledge Encoder


Inferencejava









but also monitor of newly inferred conceptual information. We now highlight the implementation of the SE algorithm.

SE-1 Dictionary Extractor: This step accesses the data dictionary and tries to extract as much information as possible. The information in the database schema is queried using the JDBC A-PI to get all the relation names, attribute names, data types, simple constraints and key information. Every query in the extractor module is a method invocation which ultimately executes primitive SQL queries in the Adaptor module. Thus, a general API is created for the extractor module. This information can be stored in an internal object. For every relation, we create an object whose structure is consistent with the final XML representation. The representation is such that it will be easy to identify whether the attribute is a primary key, what its data type is and what are corresponding relation names; e.g., Figure 4-2 shows the class structure of a relation.

A new instance of the class is created when the extractor extracts a new relation name. The extracted information is filled into these object instances according to the attributes of the class. Each instance contains information about the name and type (filled after Step 3), of the relation, its primary key, its foreign key, number of attributes etc. Note that every relation object contains an array of attribute objects. The array-size is equal to the number of attributes in the relation. The attribute class is defined in Step 4.


























Figure 4-2 The class structure' for a relation.

After this step, we have an array of relation objects in the common data structure. This way not only can we identify all the relation names and their primary keys, but can also examine each attribute of each relation by looking at its characteristics.

SE 2: Discover Inclusion Dependencies: The inputs for this step include the array of relation obj ects generated in the previous step and the data instances in the database. The actual data in the database is used for detecting of the direction in the inclusion dependency.

During this step, the algorithm needs to make several important decisions, which affect the outcome of the SE process. Due to the importance of this step, the choice of the data structure becomes crucial.








'A class represents an aggregation of attributes and operations. In our implementation, we have defined a set of classes whose instances hold the results of DRE process. These classes define set of attributes with their corresponding datatypes. We follow UML notation throughout this chapter to represent classes.


Relation


name : string primary key : string foreign key : string attributes : array att type : string pkcount :int fkcount:int attcount : int









As described in our algorithm in Chapter 3, we need 2 sets of inclusion dependencies at any time during this step. One is the set of possible inclusion dependencies and other is set of final inclusion dependencies.

These inclusion dependencies may be represented in the relation objects so that it's easy to associate them with relations and attributes. But we decided to create a separate data structure as adding this information into the relation object seems to be a conceptual violation, as inclusion dependencies occur between relations. The class structure for inclusion dependency is illustrated schematically in Figure 4-3.



Inclusion Dependency

lhsentity string
rhsentity: string
lhsattset string
rhsattset string
lhsentitytype :string
noofatt : int
rating :string



Figure 4-3 The class structure for the inclusion dependencies.

The attribute "lhsentitytype" describes the type of the entity at the left hand side of the inclusion dependency. This helps in identifying the relationships in Step 6. For example if the type is "strong" entity then the inclusion dependency can suggest the binary relationship or IS-A relationship. For more details, the reader is referred to Step 6. Another attribute "noofatt" gives the number of attributes involved in the inclusion dependency. This helps in finalizing the foreign key attributes. Other attributes of the class are self-explanatory.









We keep two arrays of such objects; one for the FINAL set and the other for the POSSIBLE set. If the foreign keys can be extracted from the data dictionary or equi-join queries are extracted from the application code, then we can create new instance in the FINAL set. Every NON-FINAL or Hypothesized inclusion dependency is stored by creating new instance in the POSSIBLE set. After the exhaustive search for a set of inclusion dependencies, we remove some of the unwanted inclusion dependencies (e.g., transitive dependencies) in the cleaning process.

Finally, if the POSSIBLE set is non-empty, all the instances are presented to the user. The inclusion dependencies rejected by the user are removed from the POSSIBLE set and the inclusion dependencies accepted by the user are copied to the FINAL set. After this step, only the FINAL set is used for future computations.

SE 3: Classify Relations: This step takes the array of relation objects and the array of inclusion dependency objects as input and classifies each relation into a strongentity, weak-entity, regular-relationship or specific-relationship relation. First the classification is performed assuming consistent naming of key attributes. That means all the relation names and the corresponding primary keys from common data structures are accessed and analyzed. The primary key of every relation is compared with the primary keys of all other relations. According to that analysis the attribute "Type" will be added in the object. This classification is revised based on the existence of inclusion dependencies. So if consistent naming is not employed, the SE can still classify the relations successfully. Also, this type information is added in the inclusion dependency objects so that we can distinguish between entities and relationships.









The output of this step is the array of modified relation objects and the array of modified inclusion dependency obj ects (with the type information of participating relations). This is passed as an input to the subsequent modules.

SE 4: Classify Attributes: Each instance contains information about the name and type (filled after Step 3), of the relation, its primary key, its foreign key, number of attributes in it etc. Note that every relation object contains an array of attribute objects. The array-size is equal to the number of attributes in the relation. The attribute class is defined in Step 4.




Attribute: :]


name : string
meaning : string
tablename : string
datatype : string
isnull : string
isunique : int
type : string
length : string




Figure 4-4 The class structure for an attribute.

This step can be easily executed as all the required information is available in the common data structures. Though this step is conceptually a separate step in the SE algorithm, its implementation is done in conjunction with the above three steps e.g., whether the attribute is a primary key or not can be decided in Step 1.

SE 5: Classify Entities: Every relation from the array of relation objects is

accessed and by checking its type, new entity obj ects can be created. If the type of the









relation is "strong" then a strong entity is created and if the type of the relation is "weak" then a weak entity is created. Every entity object contains information about its name, its identifier and its type.

SE 6: Classify Relationships: The inputs to the last step of the SE algorithm

include the array of relation objects and the array of inclusion dependency objects. This step analyzes each inclusion dependency and creates the appropriate relationship types.

The successful identification of a new relationship results in the creation of new instance of the class described in Figure 4-5. The class structure mainly includes the name and type of the relationship, participating entities and their corresponding cardinalities. The arrays of the strings are kept to accommodate variable number of entities participating in the relationship. The participating entities are filled from the entity-relations in the inclusion dependency; while the cardinality is discovered by actually querying the database. The other information is filled in according to Figure 4-6.



Relationships


name :string
type :string
partentity :array of strings
cardinality: array of strings
partentcount : int




Figure 4-5 The class structure for a relationship.

The flow of execution is described as follows:

For every inclusion dependency whose left-hand side relation is an entity-relation, the SE does the following:









I If it is a strong entity with the primary key in the inclusion dependency, then an
"is-a" relationship between two strong entities is identified.

2. If it is a strong entity with a non-primary key in the inclusion dependency, then
"regular binary" relationship between two entities is identified.

3. If it is a weak entity with the primary key in an inclusion dependency, then a
"dependent" or "has" relationship between two strong entities is identified.

4. If it is a weak entity with a non-primary key attribute in the inclusion dependency,
then a "regular binary" relationship between two entities is identified.

For every inclusion dependency whose left hand side relation is a relationshiprelation, the SE does the following:

I . We know what relations are identified as regular and specific. We only have to
identify the inclusion dependencies involving those primary keys (or subset of the primary keys) of these relations on the left-hand sides to find out the participating
entities. The n-ary relationships, where n >2, are also identified similarly.

2. If we have a regular and specific relation with non-primary keys on the left-hand
side, an "aggregate relationship" is identified.

Thus all the new relationships are created by analyzing the array of inclusion

dependencies. As a result, at the end of the schema extraction process, the output consists of the array of relation objects, the array of entity objects and the array of relationship obj ects.











Consider an inclusion dependency:
LHS-entity.LHS-attset << RHS-entity.RHS-attset

1. "IS-A" relationship:
Name: RHS-entity is-a LHS-entity
Type: IS-A
Cardinality: 1:1

2. Regular Binary relationship:
Name: RHS-entity relates to LHS-entity
Type: Regular Binary
Cardinality: 1:1/1 :N (can be easily finalized by checking duplication)

3. "Dependent" or "has" Relationship
Name: RHS-entity has LHS-entity
Type: Dependent
Cardinality: 1 :N (with N at the weak entity side)

4. M:N Relationships
Name: name of the corresponding relation
Type: M:N
Cardinality: M:N

5. Aggregate Relationship:
Name: Aggregated-LHS-entity relates-to RHS-entity
Type: Aggregate
Cardinality: 1:1/1 :N (can be easily finalized by checking duplication)


Figure 4-6 The information in different types of relationships instances.

Knowledge Encoder: When the schema extraction process is completed, the

encoder module is automatically called. This module has access to the common data

structure. Using the extracted information, the encoder module creates an XML file

("results.xml") with proper formatting and conforms it to the predefined DTD. For more

information about this DTD, please refer appendix A.









4.2 Example Walkthrough of Prototype Functionality

In this section, we present an exhaustive example of the Schema Extraction process. We will also provide some screenshots of our working prototype. Project management is a key application in the construction industry; hence the legacy source for our example is based on a Microsoft Project application from our construction supply chain testbed. For simplicity, we assume without loss of generality or specificity that only the following relations exist in the MS-Project application, which will be discovered using SE (for a description of the entire schema refer to Microsoft Project Website [42]): MSP-Project, MSP-Availability, MSP-Resources, MSP-Tasks and MSP-Assignment.

Additionally, we also assume that the dictionary makes the following information available to the SE algorithm namely relation and attribute names, all explicitly defined constraints, primary keys (PKs), all unique constraints, and data types.

SE Step 1: Extracting Schema Information:

This step extracts the relation and the attribute names from the legacy source. A decision step directs control to the semantic analyzer if the information about the primary keys cannot be obtained from the dictionary.

Result: In the example DRE application, the following relations were obtained from the MS-Project schema. Also all the attribute names, their data types, null constraints and unique constraints were also extracted but are not shown to maintain clarity.

MSP-Proj ect [PROJ ID, .]

MSP-Availability[PROJ ID, AVAIL UID, .]

MSP-Resources [PROJ ID, RES UID, .]

MSP-Tasks [PROJ ID, TASK UID, .]











MSP-Assignment [PROJ ID, ASSN UID, .]




Qtions Get Information 9P Ej TASK FINISH-DATE
B) Datatype DATETIME
[D Meaning: TaskTermination Date
) lsPK: No D) IsFK: No
B) NuIlable? Yes ) lsurique?: No
" TASK ACT START e-4TASK ACT FINISH 0- [j TASK BASE-START 0- L-1 TASK BASEFINISH
-ITASKCONSTRAINT DATE
-4TASK PRIORITY " TASK PCT COMP
0- [TASK PCT WORK COMP
0- Ij TASK TYPE
-I TASK FIXED_COST ACCRUAL
I TASK CREATIONDATE
[I TASK PRELEVELEDSTART -I TASK PRELEVELED FINISH
9P TASK EARLY START
B) Datatype DATETIME
D) Meaning: eari start date oftask
) lsPK No D) IsFK: No
B) NuIlable?: Yes F) IsUnique?: No
9-I TASK LATE FINISH
Figure 4-7 The screen snapshot describing the information about the relational schema.


Figure 4-7 presents a screen snapshot describing the information about the


relational schema including the relation names, attribute names, primary keys, simple


constraints etc. The reader can see a hierarchical structure describing a relational schema.

A subtree of the corresponding attributes is created for every relation in the relational


schema and all information is displayed when user clicks on the particular attribute. For


example, the reader can get information about the attribute TASKFINISHDATE (top


of the screen) including its datatype (datetime), meaning extracted from the code (Task


termination date), key information and simple constraints.

The hierarchical structure is chosen since it provides legibility, user-directed


display (as user can click on a relation name or attribute name for detailed information),


and ease of use. This structure is followed at all places in the user interface.









SE Step 2: Discovering Inclusion Dependencies.

This part of the DRE algorithm has two decision steps. First, all of the possible inclusion dependencies are determined using SE. Then control is transferred to SA to determine if there are equi-join queries embedded in the application (using pattern matching against FROM and WHERE clauses). If so, the queries are extracted and returned to SE where they are used to rule out erroneous dependencies. The second decision determines if the set of inclusion dependencies is minimal. If so, control is transferred to SE Step 3. Otherwise, the transitive dependencies are removed and the minimal set of inclusion dependencies is finalized with the help of the user.

Result: Inclusion dependencies are listed as follows:

MSPAssignment[Task uid,Proj ID] << MSPTasks [Task uid,Proj ID]

MSPAssignment [Res uid,Proj ID] << MSPResources [Res uid,Proj ID] MSPAvailability [Res uid,Proj ID] << MSPResources [Res uid,Proj ID]

MSPResources [Proj ID] << MSP Project [Proj ID]

MSPTasks [Proj ID] << MSP Project [Proj ID]

MSPAssignment [Proj ID] << MSP Project [Proj ID] MSPAvailability [Proj ID] << MSP Project [Proj ID]

The last two inclusion dependencies are removed on the basis of transitivity.

SE Step 3: Classification of the Relations.

Relations are classified by analyzing the primary keys obtained in Step 1 into one of the four types: strong, weak, specific, or regular. If any unclassified relations remain, then user input is requested for clarification. If we need to make distinctions between









weak and regular relations, then user input is requested, otherwise the control is transferred to the next step.

Result:

Strong Entities: MSP Project, MSP Availability

Weak Entities: MSPResources, MSPTasks

Regular Relationship: MSP-Assignment

SE Step 4: Classification of the Attributes.

We classify attributes as (a) PK or FK, (b) Dangling or General, or (c) Non-Key (if none of the above). Control is transferred to SA if FKs need to be validated. Otherwise, control is transferred to SE Step 5.

Result: Table 4-1 illustrates attributes obtained from the example legacy source. Table 4-1 Example of the attribute classification from the MS-Project legacy source.


PKA DKA FKA NKA

MS-Project ProjID

MS-Resources ProjID + ResUID ResUID
All
MS-Tasks ProjID +TaskUID TaskUID
Remaining
MS- Availability ProjID +AvailUID AvailUID Res UID+ProjTD Attributes
ResUID+ ProjID,
MS-Assigmnent ProjID +AssnUID AssnUID
TaskUID+ ProjID



SE Step 5: Identify Entity Types.

In SE-5, strong (weak) entity relations obtained from SE-3 are directly converted into strong (respective weak) entities.

Result: The following entities were classified:









Strong entities: MSP Project with Proj ID as its key.

MSPAvailability with Avail uid as its key.

Weak entities: MSPTasks with Task uid as key and MSP Project as its owner.

MSPResources with Res uid as key and MSP Project as owner.


Ifi F rjurpi 4@231
Options Get Information [ Entities
9 MSP AVAILABILITY L) Identifier: AVAIL UID
Li Type: strongentity
9 I From Relation: MSP AVAILABILITY
[I RESERVEDDATA
-I PROJID -I AVAIL UID 9 [1 RESUID
B Datatype: INTEGER
B Meaning: SIsPK: No
B IsFK: Yes, FKTABLE: VSPRESOURCES
F Nullable? Yes F) leunique: No
0- AVAIL FROM
@" AVAIL TO
- AVAIL UNITS " MSPPROJECTS " MSPRESOURCES "[ MSP TASKS


Figure 4-8 The screen snapshot describing the information about the entities.

Figure 4-8 presents the screen snapshots describing the identified entities. The

description includes the name and the type of the entity and also the corresponding

relation in the relational schema. For example the reader can see entity

MSPAVAILABILITY (top of the screen), its identifier (AVAIL UID) and its type

(strong entity). The corresponding relation MSPAVAILABILITY and its attributes in

the relational schema can also be seen in the interface.

SE Step 6: Identify Relationship Types.

Result: We discovered 1 :N binary relationships between the following entity


types:










Between MSP Project and MSPTasks

Between MSP Project and MSPResources

Between MSPResources and MSPAvailabilty

Since two inclusion dependencies involving MSP Assignment exist (i.e., between

Task and Assignment and between Resource and Assignment), there is no need to define

a new entity. Thus, MSP Assignment becomes an M:N relationship between

MSP Tasks and MSP Resources.

P& ii;rnfJDu yy j[,rfjOurpurI La
Qjtions Get Information
1 Relationships
9 I-"MSPPROJECTS Has MSPRESOURCES
I Type:Dependent
I Participating Entity: MSPRESOURCES
I) Cardinality: N
I Participating Entity: MSPPROJECTS
I) Cardinality: 1
- I MSPPROJECTS Has MSP TASKS
9 MSP ASSIGNMENTS
[i Type:M:N Regular
[i Participating Entity: MSPRESOURCES
[) Cardinality: A
[) Participating Entity: MSP TASKS
[) Cardinality: N
- From Relation: MSP ASSIGNMENTS
0- I MSP AVAILABILITY Relates to MSPRESOURCES






Figure 4-9 The screen snapshot describing the information about the relationships.

Figure 4-9 shows the user interface in which the user can view the information

about the identified relationships. After clicking on the name of the relationship, the user

can view the information such as its type, the participating entities and respective

cardinalities. The type of the relationship can be one of the types discussed in Step 6 of

DRE algorithm given in previous chapter. If the relationship is of M:N type, the

corresponding relation in the relational schema is also shown. For example, the reader

can see the information about the relationship MSPASSIGNNMENTS in Figure 4-9. The









reader can see the type (M:N regular), participating entities (MSPRESOURCES and MSPTASKS) and their corresponding cardinalities(M and N). The reader can also see the corresponding relation MSPASSIGNMENTS.


Figure 4-10 E/R diagram representing the extracted schema.

The E/R diagram based on the extracted information shows four entities, their attributes and relationships between the entities. Not all the attributes are shown for the sake of legibility. MSP Projects is a strong entity with ProjID as its identifier. The entities MSPTasks and MSPResources are weak entities and depend on MSP Projects. Both weak entities participate in an M:N binary relationship MSP Assignments. MSPAvailability is also a strong entity participating in a regular binary relationship with MSP Resources. For the XML view of this information, the reader is referred to APPENDIX B.









4.3 Configuration and User Intervention As discussed earlier, the Adapter module needs to be modified for different databases. These changes mainly include the name of the driver to connect to the database, the procedure for making the connection, the method of providing the username and password etc. Also, certain methods in the JDBC A-PI might not work for all relational databases because such compatibility is generally vendor dependent. Instead of making all these changes every time or keeping different files for different databases, we have provided command line input for specifying the database. Once the program gets the name of the database (e.g., Oracle), it automatically configures the Adapter module and continues execution automatically.

The next point of interaction with the user is before finalizing the inclusion

dependencies. If a set of final inclusion dependencies cannot be finalized without any doubt, then that set is presented to the user with the corresponding rating as discussed in the previous chapter. The user then selects the valid inclusion dependencies and rejects the others. Though the user is guided by the rating system, he is not bound to follow that system and may select any inclusion dependency if he is assured of its correctness. But the result of this irregular manual selection cannot be predicted beforehand.

After the entire process is completed, the user can view the results in two forms. The graphical user interface, automatically executed at the end of SE, shows complete information in an easy and intuitive manner. Java Swing has been extensively used to develop this interface. The other way to view the results is in the form of an XML document generated by the knowledge encoder.

The sample XML representation of the extracted knowledge and its corresponding DTD can be found in Appendix B and Appendix A respectively.









4.4 Integration

As discussed in the previous chapter, the integration of the Schema Extraction algorithm with the Semantic Analysis results in the overall DRE algorithm. The design and implementation of semantic analysis and code slicing is being done by another member of our research group and hence has not been discussed in detail. Though the main focus of this chapter is to present the implementation details of the schema extraction process, we now provide brief insights on how the implementation takes care of the interaction between SE and SA.

We have already listed the points where SE and SA interact. Initially, the

integrated prototype begins with the AST generation step of SA. This step then calls the dictionary extractor. The notable change here is that the dictionary extractor no longer contains the main method. Instead SA calls it as a normal method invocation. Similarly SA calls the Analysis module (shown in Figure 4-1) after its code analysis phase.

One important point of interaction is in the dictionary extraction step. If enough information about the primary keys is not found in the dictionary, the SE will pass a set of candidate keys to SA in the form of an array. SA already has access to the query templates that represent the predetermined patterns. SA operates on the AST to match these patterns and takes decisions if the match is successful. The reduced set is then sent back to SE as an array. We assume that we get strong clues about primary keys from the dictionary and hence this interaction will rarely take place.

Another vital point of communication is the equi-join query finder. While

developing an integrated prototype, we have assumed that SE will simply invoke the module and SA has the responsibility to find, shortlist and format these queries. SA will









send back inclusion dependencies in the independence object format discussed earlier. Then SE takes over and completes the inclusion dependency detection process.

We have discussed the implementation issues in detail in this section and in the previous section. The next section will conclude the chapter by providing the implementation summary.

4.5 Implementation Summary

4.5.1 Features

Almost all the features of the SE algorithm discussed in Chapter 3 have been implemented. The prototype can:

1. connect to the relational database via JDBC;

2. extract information about the underlying relational model with the help of our
own small API built on the powerful JDBC API. This information includes
relation names, column names, simple constraints, data types etc;

3. store that information in a common database like data structure;

4. infer possible inclusion dependencies and finalize the set of inclusion
dependencies with the help of an expert user;

5. identify entities and the relationships among these entities;

6. present the extracted information to the user with an intuitive interface; as well as

7. encode and store the information in an XML document;

The prototype is built using the Java programming language. The JDBC API is used to communicate with the database. The Java Swing API is used to generate all the user interfaces. The choice of Java was motivated by its portability and robustness.

4.5.2 Advantages

The working prototype of the Schema Extraction also presents the following


notable advantages:









I It minimizes user intervention and requests user assistance only at the most
important step (if required).

2. The prototype is easily configurable to work with different relational databases.

3. The user interface is easy to use and intuitive.

4. The final XML file can be kept as a simple yet powerful form of documentation
of the DRE process and can be used to guide the wrapper generation or any
decision making process.

Though these are significant advantages, the main concern about prototype implementation of the SE algorithm is its correctness. If the prototype does not give accurate results, then having an elaborate interface or an easy configuration is of no importance. The next chapter is dedicated to the experimental evaluation where we present the results of testing this SE prototype.

We can conclude whether or not these advantages reflect positively in practical

scenarios or actual testing, only after the experimental evaluation of the prototype. Hence the next chapter outlines the tests performed, discusses the experimental results, and provides more insights about our working prototype.














CHAPTER 5
EXPERIMENTAL EVALUATION

Several parameters can be used to evaluate our prototype. The main criteria

include correctness or accuracy, performance, and ease of use. The schema extraction is primarily a build-time process and hence is not time critical. Thus, performance analysis based on the execution time is not an immediate issue for the SE prototype. The main parameter in the experimental evaluation of our prototype is the correctness of the information extracted by the prototype. If the extracted information is highly accurate in diverse input conditions (e.g., less than 10% error), then the SE algorithm can be considered useful. As SEEK attempts to be a highly automated tool for rapid wrapper reconfiguration, another important parameter is the amount of user intervention that is needed to complete the DRE process.

We have implemented a fully functional SE prototype system, which is currently installed and running in the Database Center at the University of Florida. We have run several experiments in an effort to test our algorithm. We shall first give the setup on which these experiments were conducted. In the next section, we shall explain the test cases and the results obtained. Finally, we shall provide conclusive reasoning of the results and summarize the experiments.

5.1 Experimental Setup

The DRE tested resides on a Intel Pentium-IV PC with a 1.4 GHz processor, 256 NM of main memory, 512KB cache memory, running Windows NT. As discussed earlier, all components of this prototype were implemented using Java (SDK 1.3) from Sun









Microsystems. Other tools used are the XML Parser from Oracle version 2.0.2.9, Project scheduling software from Microsoft to prepare test-data, Oracle 8i RDBMS for storing test-data and JDBC drivers from Sun and Oracle.

The SE prototype has been tested for two types of database systems; A MSAccess/MS-Project database on the PC and an Oracle 8i database that is running on a department's Sun Enterprise 450 Model 2250 machine with 2 processors of 240xmz each. The server has 640M of main memory, 1.4GB of virtual memory and 21VB of cache memory. This testbed reflects the distributed nature of business networks, in which the legacy source and SEEK adapter will most likely execute on different hardware.

The DRE prototype connects to the database, either locally (e.g., MS-ACCESS) or remotely (e.g., Oracle), using JDBC and extracts the required information and infers the conceptual associations between the entities.

5.2 Experiments

In an effort to test our schema extraction algorithm, we selected nine different test databases from different domains. The databases were created by graduate students as part of a database course at the University of Florida. Each of these test cases contains relational schema and actual data. The average number of tables per schema is approximately 20 and the number of tuples ranges from 5 to over 50,000 per table. These applications were developed for varied domains such as online stock exchange and library management systems.

5.2.1 Evaluation of the Schema Extraction Algorithm

Each test database was used as an input to the schema extraction prototype. The

results were compared against the original design document to validate our approach. The results are captured in Table 5-1.









Table 5-1 Experimental results of schema extraction on 9 sample databases.

Proj ect Domain Phantom Missing Phantoms E/R Missing E/R Complexity
Name INDs INDs Components Components Level
P1 Publisher 0 0 0+0 0+0 Low
P9 Library 0 1 0+0 0+1 Low
P5 Online 0 0 0+0 0+0 Low
Company
P3 Catering 0 0 0+0 0+0 Medium
P6 Sports 0 0 0+0 0+0 Medium
P4 Movie Set 1 0 0+1 0+0 Medium
P8 Bank 1 0 0+1 0+0 Medium
P7 Food 3 1 0+3 0+1 High
P2 Financial 500500Hg
P2 Transaction 500500Hg


The final column of Table 5-1 specifies the level of complexity for each schema. At this point, all projects are classified into three levels based on our relational database design experience. Projects P1, P9, P5 are given low complexity as the schema exhibited meaningful and consistent naming systems; rich datatypes (total 11I different datatypes in case of P9 and 17 in case of P1I) relatively few relations (ranging from 10-15) with only few tuples per relation (average 50-70 tuples per relation). A database having these characteristics is considered a good database design and is richer in terms of semantic information content. Hence for knowledge extraction process, these databases are more tractable and should be rated with a low complexity level. Projects P3, P6, P8 and P2 are less tractable than P1I, P9 and P5 due to a limited number of datatypes (only 7 in case of project P3), more relations (average 20 relations per schema) and more tuples per relations than P1I, P9 and P5. Project P7 and P2 have been rated most complex due to their naming system and limited number of datatypes. For example in project P2, primary key attribute of almost all tables are named as only ID. The importance of various factors









in complexity levels is better understood when the behavior and results of each schema are studied. The details are given in Section 5.2.2.

Since the main parameter for evaluating our prototype is the "correctness of the extracted information," table 5-1 shows the errors detected in the experiments. These errors essentially can be of two types, a missing concept (i.e., the concept is clearly present but our SE algorithm did not extract it) or a phantom concept (i.e., the concept is extracted by the SE algorithm but is absent in the data source). As the first step of the SE algorithm merely extracts what is present in the dictionary, the errors only start accumulating from the second step. The core part of the algorithm is inclusion dependency detection. Steps 3, 4, 5, 6 use the final set of inclusion dependencies either to classify the relations or to infer high-level relationships. Hence, an error in this step almost always reflects as an error in the final result.

As previously discussed, when the decision about certain inclusion dependencies cannot be finalized, the possible set is presented to the user with the appropriate rating (low or high). In the experimental evaluation, we always assume that the user blindly decides to keep the inclusion dependency with a high rating and rejects those with a low rating. This assumption will help us reflect the exact behavior of SE algorithm, though in real scenarios the accuracy can be increased by intelligent user decision. More details about this are given in Section 5.3.2.

Thus omissions and phantoms have a slightly different meaning with respect to our algorithm. Missing inclusion dependencies are in the set POSSIBLE that are ranked low by our algorithm but do exist in reality; hence they are considered omissions. Whereas phantom inclusion dependencies are in the set POSSIBLE that are ranked high









by our algorithm but are actually invalid; hence the term 'phantom' since they do not

exist in reality.

5.2.2 Measuring the Complexity of a Database Schema

In order to evaluate the outcome of our schema extraction algorithm, we first

describe our methodology for ranking the test databases based on their perceived

complexity. Although our complexity measure is subjective, we used it to develop a

formula, which rates each test case on a complexity scale between 0) (low) and 1 (high).

The tractability/complexity of a schema was based on the following factors:

* Number of useless PK to identical attribute name matches: One of the most
important factors that were taken into account was the total number of instances
where each primary key name was identical to the other attribute names that were
not in any way relationally connected to the primary key. We define this type of
matches as "useless matches."

* Data types of all attributes, Primary key data types and maximum number of
datatypes: Each data type was distinguished by the data type name and also the length. For instance, char(20) was considered to be different than char(50). The
higher the total number of data types in a schema, the less complex (more tractable) it is because attributes are considered to be more distinguishable.

* Number of tables in the Schema and maximum number of tables in the testcase.
The more relations a schema has, the more complex it will be. Since there is no
common factor with which to normalize the schema, a common denominator was determined by taking the maximum number of relations in the testcase in order to
produce with a normalized scale to rank all the schemas.

* Relationships in the EIR model and maximum number of relationships in the
testcase. The more relationships a schema contains, the more complex it will be.
Similar to the preceding concept, the common denominator in this case was the
maximum number of relationships.

The following mathematical equation determines the tractability of a schema:


Trac Uschemj, (D + W, ( DPK-)+ ) +W1 T ( ' hema r Rpfd
T Y41 = 21D IJ 31 J Dp[ T i J +R~
\\ umax Dmd DmaPK TM. R~mx K maERA~odel,










Where U schema represents the useless name matches in a schema, Umax the maximum number of useless name matches in all testcases, Dschema the number of attribute data types in a schema, Dmax the maximum number of attribute data types in all testcases, DPK the number of primary key data types in a schema, m-PK the maximum number of primary key data types in all testcases, Tche ma the number of tables in a schema, Tmax the maximum number of tables in all testcases, R7�del the number of relationships in the E/R model of a schema, and/&axElodl the maximum number of relationships in the E/R models of all testcases. In addition, each factor was weighted to indicate the importance of the factor with respect to the complexity of the schema.

5.3 Conclusive Reasoning

The complexity measurement formula described in the previous section was used to order the projects. Based on this ordering of the project, we plotted the results of Table 5-1 as two graphs to better illustrate the outcome of the experiments. The graphs are shown in Figure 5-1. The graphs depict the errors encountered during inclusion dependency detection and at the end of overall process.















































E-R Results


S 10


8
E 7
"6 6 - No. of missing E-R
opponents + S. m No. of phantom E-R
E 4 components
c 3 JC 2

0
z 0
p1 p9 p5 p3 p6 p4 p8 p7 p2 Schemas (ascending complexity)


Figure 5-1 Results of experimental evaluation of the schema extraction algorithm: errors in detected inclusion dependencies (top), number of errors in extracted schema (bottom).


5.3.1 Analysis of the Results


Both graphs in Figure 5-1 appear similar, since a phantom in an inclusion


dependency definitely results in a phantom relationship and a missing inclusion


dependency almost always results in a missing relationship. This is due to the fact that


every relationship is identified only from an inclusion dependency. After the set of


Inclusion Dependency Results


10


0)



6 * No. of missing Inds
E E *No ofphantom Inds
4
.c 3
0.
- 2 E
0

p1 p9 p5 p3 p6 p4 p8 p7 p2
Schemas (ascending complexity)









inclusion dependencies is finalized, every inclusion dependency in that set is used to identify associations between participating entities. So, the presence of a phantom inclusion dependency in the final set always indicates the presence of a phantom relationship in the final result.

Phantoms are generated because some pairs of attributes in different tables are not related even though they have name and datatype similarities and the subset relationship holds. For example, consider two relations "Company" and "Person". If both the relations have an attribute called "ID" of integer type and if the subset relationship holds (due to similar integer values or range), then the inclusion dependency "Company (ID) << Person (113)" is definitely added to the POSSIBLE set. As there is a name similarity in both the attributes, the SE algorithm will give a "high" rating to this inclusion dependency while presenting it to the user. So if the user decides to keep it, a phantom relationship between the concerned entities (Company and Person) will be identified in the final result. The phantoms generally occur when the naming system in the database design is not good (as shown by our example). It can also occur due to limited data values or lack of a variety of datatypes in the schema. All of these contribute to the complexity of the schema as discussed earlier.

Omissions occur because some pairs of attributes in different tables are actually related even though they have completely unrelated and different names. For example, consider two relations "Leaders" and "US -Presidents". If the relation "Leader" has attribute "Country#" and the relation "US -Presidents" has attribute "US #" and both the attributes have integer datatype, then the subset relationship definitely holds. Since there is no name similarity between the relations and the attributes, the SE algorithm will









attach a "low" rating to this inclusion dependency while presenting it to the user. If the user rejects this possible inclusion dependency, a valid relationship between the concerned entities (Leaders and US-Presidents) will be missing in the final result. Such omissions generally occur when the naming system in the database design is inadequate (as shown by our example).

Both the graphs in Figure 5-1 also suggest that there are comparatively fewer

omissions than phantoms. 'The omissions will occur very rarely in our algorithm. This is due to the fact that our exhaustive algorithm will miss something (i.e., give a low rating) only when the tables and columns on both sides are completely unrelated in terms of names. As this is very rare in normal database design, our algorithm will rarely miss anything.

Another parameter for evaluation can be user intervention. The SE algorithm may consult the user only at the inclusion dependency detection step. If the algorithm can not finalize the set, it will ask the user to decide. This is a significant improvement over many existing reverse engineering methods, although even this point of interaction might not be necessary for well-designed databases.

5.3.2 Enhancing Accuracy

The accuracy discussions in the previous subsection are based on worst-case scenarios. The main improvements can be done at the inclusion dependency detection step, as it is the first step where errors start creeping in. As discussed earlier, the errors in the final result are just the end-product of errors in this step. We shall now present simple experiments that can be done to enhance the accuracy of the SE algorithm.

In the intermediate user intervention step, if the domain-expert makes intelligent decisions about the existence of the possible inclusion dependencies ignoring the ratings,









the error rate can be decreased significantly. One additional experiment was to do exactly this and the resulting error rate was almost zero in many of the above databases. Even if the user is not a domain expert, some obvious decisions definitely enhance the accuracy. For example, rejecting the inclusion dependency "Company (ID) << Person (113)" even though the corresponding rating is "high" is a common-sense decision and will certainly reduce errors.

Another possible way to increase the correctness of the results is to use some kind of a threshold. Sometimes the number of data values can be very different in two tables and the subset relationship holds just by chance. This is mostly true when the data values in the corresponding columns are integers or integer-ranges. For example, the "ID" column in the relation "Task" may contain values I to 100, while the "ID" column in the relation "Project" may contain values I and 2. This will lead to an invalid possible inclusion dependency "Project (ID) << Task (113)". To reject these kinds of inclusion dependencies, we can keep a dynamic threshold value. If the number of values in one column dependent upon the number of values in the other column is less than the threshold, then we can reject the dependency beforehand. Though this is helpful in many cases, it can result in the rejection of some valid inclusion dependencies. The effect of this improvement is not completely defined and the experiments did not show any definite reduction of errors. But the procedure can be tweaked to get a reduced error rate in majority of the cases.

In this section, we presented discussion of experimental results and the related analysis. The results are highly accurate and have been obtained with minimum user intervention. Accuracy can be further enhanced by simple additional methods. The next






75


section will summarize the main contributions of the schema extraction algorithm and will provide valuable insights on the future enhancements at the algorithmic level. It will also provide an overall summary of this thesis.














CHAPTER 6
CONCLUSION

Schema extraction and knowledge discovery from various database systems has been an important and exciting topic of research for more than two decades now. Despite all the efforts, a truly comprehensive solution for the database reverse engineering problem is still elusive. Several proposals that approach this problem have been made under the assumption that data sources are well known and understood. The substantial work on this problem also remains theoretical, with very few implemented systems present. Also, many authors suggest semi-automatic methods to identify database contents, structures, relationships, and capabilities. However, there has been much less work in the area of a fully automatic discovery of database properties.

The goal of this thesis is to provide a general solution for the database reverse engineering problem. Our algorithm studies the data source using a call-level interface and extracts information that is explicitly or implicitly present. This information is documented and can be used for various purposes such as wrapper generation, forward engineering, system documentation effort etc. We have manually tested our approach for a number of scenarios and domains (including construction, manufacturing and health care) to validate our knowledge extraction algorithm and to estimate how much user input is required. The following section lists the contribution of this work and the last section discusses possible future enhancements.









6.1 Contributions

The most important contributions of this work are the following. First, a broad

survey of existing database reverse engineering approaches was presented. This overview not only updates us with the knowledge of the different approaches, but also provides a significant guidance while developing our SE algorithm. The second major contribution is the design and implementation of a relational database reverse engineering algorithm, which puts minimum restrictions on the source, is as general as possible and extracts as much information as it can from all available resources, with minimum external intervention.

Third, a new and different approach to propose and finalize the set of inclusion dependencies in the underlying database is presented. The fourth contribution is the idea to use every available source for the knowledge extraction process. Giving importance to the application code and the data instances is very vital. Finally, developing the formula for measuring complexity of database schemes is also an important contribution. This formula, which is based on the experimental results generated by our prototype, can be utilized for similar purpose in various applications.

One of the more significant aspects of the prototype we have built is that it is

highly automatic and does not require human intervention except in one phase when the user might be asked to finalize the set of inclusion dependencies. The system is also easy to use and the results are well -documented.

Another vital feature is the choice of tools. The implementation is in Java, due to its popularity and portability. The prototype uses XML (which has become the primary standard for data storage and manipulation) as our main representation and documentation language. Finally, though we have tested our approach only on Oracle,









MS-Access and MS-Project data sources, the prototype is general enough to work with other relational data sources including Sybase, MS-SQL server and IBM DB2.

Though the experimental results of the SE prototype are highly encouraging and its development in the context of wrapper generation and the knowledge extraction module in SEEK is extremely valuable, there are some shortcomings of the current approach. The process of knowledge extraction from databases can be enhanced with some future work. The following subsection discusses some limitations of the current SE algorithm and Section 6.3 presents possible future enhancements.

6.2 Limitations

6.2.1 Normal Form of the Input Database

Currently the SE prototype does not put any restriction on the normal form of the input database. However, if it is in first or second normal form, some of the implicit concepts might get extracted as composite objects. The SE algorithm does not fail on 2NF relations, but it does not explicitly discover all hidden relationships, although this information is implicitly present in the form of attribute names and values. (e.g., cityname and citycode will be preserved as attributes of Employee relations in the following example)

Consider following example:

Employee (SSN, name, cityname, citycode)

Project (ProjID, ProjName)

Assignment (SSN, ProjID, startdate)

In the relation Employee, the attribute citycode depends on the attribute cityname, which is not a primary key. So there is a transitive dependency present in the relation and hence it is in 2NF. Now if we run the SE algorithm over this schema, it first extracts table









names, attribute names as above. Then it finds the set of inclusion dependencies as follows:

Assignment. SSN << Employee. SSN

Assignment.Proj ID << Proj ect.Proj ID

The SE algorithm classifies relations (Employee and Project as strong relations and Assignment as regular relation) and attributes. Finally it identifies Employee and Project as strong entity and Assignment as M:N relationship between them. The dependency of citycode on cityname is not identified as a separate relationship.

To explicitly extract all the objects and relationships, the schema should ideally

be in 3NF. This limitation can be removed by extracting functional dependencies (such as citycode << cityname) from the schema and converting schema into 3NF before starting the extraction process. Any kind of decision about normal form of legacy database is difficult to make. One can not deduce easily whether the database is in 2NF or 3NF. Also it may not be really useful for use to extract such extra relationships explicitly.

6.2.2 Meanings and Names for the Discovered Structures

Although the SE algorithm extracts all concepts (e.g entities, relationships)

modeled by the underlying relational database, it falls short in assigning proper semantic meanings to some of the concepts. Semantic Analysis may provide important clues in this regard but how useful they are depends largely on the quantity and quality of the code. It is difficult to extract semantic meaning for every concept.

Consider an example. If the SE algorithm identifies a regular binary relationship (1 :N) between Resource and Availability, then it is difficult to provide meaningful name to it. The SE algorithm gives the name "relates-to" in this case which is very general.









6.2.3 Adaptability to the Data Source

Ideally the algorithm should adapt successfully to the input database. However, the accuracy of the results generated by the SE algorithm somewhat depends on quality of database design, which includes proper naming system, richer datatypes and size of the schema. The experimental results and the schema complexity measure discussed earlier confirm this.

Although it is very difficult to attain high accuracy levels for broad range of

databases, the integration of SE algorithm with machine learning approaches might aid the extraction process to achieve at least a minimal level of completeness and accuracy.

6.3 Future Work

6.3.1 Situational Knowledge Extraction

Many scheduling applications such as MS-Project or Primavera have a closed interface. Business rules, constraints or formulae are generally not found in the code written for these applications, since the application itself contain many rules. For example, in the case of the MS-Project software, we can successfully access the underlying database, which allows us to extract accurate but only limited information. Our current schema extraction process (part of DRE) extracts knowledge about the entities and relationships from the underlying database. Some additional but valuable information can be extracted by inspecting the data values that are stored in the tuples of each relation. This information can be used to influence decisions in the analysis module. For example in the construction industry, the detection of a 'float' in the project schedule might prompt rescheduling or stalling of activities. In warfare, the detection of the current location and the number of infantry might prompt a change of plans on a particular front.









Since this knowledge is based on current data values, we call it situational knowledge (sk). Situational knowledge is different from business rules or factual knowledge because the deduction is based on current data that can be changed over the time.

Some important points to consider:

Extraction of sk has to be initiated and verified by a domain expert.

An ontology or generalization of terms must be available to guide this process.

The usefulness of this kind of knowledge is even greater in those cases where
business rule mining from application source code is not possible.

It is also crucial to understand what sort of situational knowledge might be

extracted with respect to a particular application domain before thinking about the details of the discovery process. This provides insights about the possible ways in which a domain expert can query the discovery tool. This will also help the designer of the tool to classify these queries and finalize the responses. We classify the sk in four broad categories:

I . Situational Knowledge Explicitly Stored in the Database - Simple Lookup
Knowledge: This type of sk can be easily extracted by one or more simple lookups
and involves no calculation or computation. Extraction can be done through
database queries. e.g., Who's on that activity? (i.e., find the resource assigned to
the activity) or what percentage of work on a particular activity has been
completed?

2. Situational Knowledge Obtained through a Combination of Lookups and
Computation: This type of sk extraction involves database lookups combined with
computations (arithmetic operations). Some arithmetic operations (such as
summation, average) can be predefined. The attribute names can be used as the
parameters taking part in these operations. e.g., What is the summation of the
durations of all activities in a project? Or what is the productivity of an activity as
a function of the activity duration and units of resource assigned?

3. Situational Knowledge Obtained through a Comparison between two inputs: The
request for a comparison between two terms can be made and the response will be in the form of two sets containing relevant information about these terms that can









be used to compare them. This may involve lookups and calculations. e.g.,
Compare the project duration and the sum of durations of all the activities in that
project or compare the skill levels of resources working on different activities.

4. Complex Situational Knowledge: This is the most complex type of sk extraction and can involve lookups, calculations and comparisons. For extracting this kind of
knowledge, one has to provide a definite procedure or an algorithm. e.g., Find a
float in the schedule, or find the overall project status and estimate the finish date.

As discussed above, the situational knowledge discovery process will be initiated on a query from the domain expert, which assures a relatively constrained search on a specific subset of the data. But the domain expert may not be conversant with the exact nature of the underlying database including its schema or low-level primitive data. So, the discovery tool should essentially consist of:

0 an intuitive GUI for an interactive and flexible discovery process.

0 the query transformation process.

0 the response finalizing and formatting process.

0 A simple yet robust user interface is very important in order to specify various
parameters easily.

The process of query transformation essentially involves translation of high-level concepts provided by the user to low-level primitive concepts used by the database. The response finalizing process may involve answering the query in an intelligent way i.e., by providing more information than was initially requested.

The architecture of the discovery system to extract situational knowledge from the database will consist of the relational database, concept hierarchy, generalized rules, query transformation and re-writing system and response finalization system. The concept hierarchies can be prepared by organizing different levels of concepts into a taxonomy or ontology. A concept hierarchy always related to specific attribute and is partially ordered in general -to-specifi c order. The knowledge about these hierarchies









must be given by domain experts. Some researchers have also tried to generate or refine this hierarchy semi -automatically [30], but that is beyond the scope of this thesis. Generalization rules summarize the regularities of the data at a high level. As there are usually a large set of rules extracted from any interesting subset, it is unrealistic to store all of them. However it is important to store some rules based on frequency of inquiries.

The incoming queries can be classified as high-level or low-level based on the names of its parameters. The queries can also be classified as data queries, which are used to find concrete data stored in the database, or knowledge queries, which are used to find rules or constraints. Furthermore, the response may include the exact answer, the addition of some related attributes, information about some similar tuples, etc.

Such a discovery tool should be based on data mining methods. Data mining is considered as one of the most important research topics in 1990s by both machine learning and database researchers [56]. Various techniques have been developed for knowledge discovery including generalization, clustering, data summarization, rule discovery, query re-writing, deduction, associations, multi-layered databases etc. [6, 18, 29]. One intuitive generalization-based approach for intelligent query answering in general and for situational knowledge in particular is based on a well-known data mining technique called attribute-oriented induction in Han et al. [29]. This approach provides an efficient way for the extraction of generalized data from the actual data by generalizing the attributes in the task-relevant data-set and deducing certain situational conclusions depending on the data values in those attributes.

The situational knowledge in SEEK can be considered as an additional layer to the knowledge extracted by the DRE module. This kind of knowledge can be used to









guide the analysis module to take certain decisions; but it cannot be automatically extracted without a domain expert.

This system may be integrated in SEEK as follows:

I . Create a large set of queries that are used on a regular basis to find the status in
every application domain.

2. Create the concept hierarchy for the relevant data set.

3. After the DRE process, execute these queries and record all the responses.

4. The queries and their corresponding responses can be represented as simple
strings or can be eventually added to general knowledge representation.

Appendix D describes detail example of the sk extraction process.

6.3.2 Improvements in the Algorithm

Currently our schema extraction algorithm does not put restriction on the normal form of the input database. However, if the database is in INF or 2NF, then some of the implicit concepts might get extracted as composite objects. To make schema extraction more efficient and accurate, the SE algorithm can extract and study functional dependencies. This will ensure that all the implicit structures can be extracted from the database no matter what normalization form is used.

Another area of improvement is knowledge representation. It is important to leverage existing technology or to develop our own model to represent the variety of knowledge extracted in the process effectively. It will be especially interesting to study the representation of business rules, constraints and arithmetic formulae.

Finally, although DRE is a build-time process, it will be interesting to conduct performance analysis experiments especially for the large data sources and make the prototype more efficient.









6.3.3 Schema Extraction from Other Data Sources

A significant enhancement would extend the SE prototype to include knowledge extraction from multiple relational databases simultaneously or from completely nonrelational data sources. Non-relational database systems include the traditional network database systems and the relatively newer object-oriented database systems. An interesting topic of research would be to explore the extent to which, information can be extracted without human intervention in such data sources. Also it will be useful to develop the extraction process for distributed database systems on top of the existing prototype.

6.3.4 Machine Learning

Finally, machine learning techniques can be employed to make the SE prototype more adaptive. After some initial executions, the prototype could adjust its parameters to extract more accurate information from a particular data source in a highly optimized and efficient way. The method can integrate the machine learning paradigm [4 1 ], especially learning from example techniques, to intelligently discover knowledge.














APPENDIX A
DTD DESCRIBING EXTRACTED KNOWLEDGE <1ELEMENT Relation (Name, Type, Columns)> ELEMENTT Name (#PCDATA)> ELEMENTT Type (#PCDATA)> ELEMENTT Name (#PCDATA)> ELEMENTT DataType (#PCDATA)> ELEMENTT Meaning (#PCDATA)> ELEMENTT IsPK (#PCDATA)> ELEMENTT FK (IsFK, FKTable)>

ELEMENTT IsFK (#PCDATA)> ELEMENTT FKTable (#PCDATA)> ELEMENTT NullOption (#PCDATA)> ELEMENTT IsUnique (#PCDATA)> <1ELEMENT ConceptualSchema (Entity+, Relationship+)> <1ELEMENT Entity (Name, Type, Identifier)>









ELEMENTT Name (#PCDATA)> ELEMENTT Type (#PCDATA)> <1ELEMENT Identifier (#PCDATA)> ELEMENTT Name (#PCDATA)> ELEMENTT Type (#PCDATA)> <1ELEMENT Participants (Participant+)> ELEMENTT Rule (#PCDATA)>
















APPENDIX B
SNAPSHOTS OF "RESULTS.XML"

The knowledge extracted from the database is encoded into an XML document. This resulting XML document contains information about every attribute of every relation in the relational schema in addition to the conceptual schema and semantic information. Therefore the resulting file is too long to be displayed completely in this thesis. Instead, we provide snapshots of the document, which highlight the important parts of the document.


Figure B-1 The main structure of the XML document conforming to the DTD in Appendix A.


-
+
+
-
if(Task Beginning Date < early start date of task) { Task Beginning Date = early start date of task; }
Project Cost = Project Cost + Task Cost ;
if (Resource category - - "brick") { Resource Cost = 2500; }




Figure B-2 The part of the XML document which lists business rules extracted from the code.


+ + +















-
MSPTASKS weakrelation
-
+ -
PRO]_ID 1NTEG ER null Yes
-
Yes MSPPRO]ECTS

Nullable Yes

-
TASKSTARTDATE DATETIlME Task Beginning Date No
-
No

N u liable No

-
TASKFINISH_DATE DATETIME Task Termination Date No
-
No

N u liable Figure B-3 The part of the XML document which lists business rules extracted from the code.




Full Text
xml version 1.0 encoding UTF-8 standalone no
fcla fda yes
!-- An Algorithm and implementation for extracting schematic semantic knowledge from relational database systems ( Mixed Material ) --
METS:mets OBJID UFE0000541_00001
xmlns:METS http:www.loc.govMETS
xmlns:xlink http:www.w3.org1999xlink
xmlns:xsi http:www.w3.org2001XMLSchema-instance
xmlns:daitss http:www.fcla.edudlsmddaitss
xmlns:mods http:www.loc.govmodsv3
xmlns:sobekcm http:digital.uflib.ufl.edumetadatasobekcm
xmlns:lom http:digital.uflib.ufl.edumetadatasobekcm_lom
xsi:schemaLocation
http:www.loc.govstandardsmetsmets.xsd
http:www.fcla.edudlsmddaitssdaitss.xsd
http:www.loc.govmodsv3mods-3-4.xsd
http:digital.uflib.ufl.edumetadatasobekcmsobekcm.xsd
METS:metsHdr CREATEDATE 2020-04-28T10:04:01Z ID LASTMODDATE 2010-10-22T13:35:58Z RECORDSTATUS COMPLETE
METS:agent ROLE CREATOR TYPE ORGANIZATION
METS:name UF,University of Florida
OTHERTYPE SOFTWARE OTHER
Go UFDC - FDA Preparation Tool
INDIVIDUAL
UFAD\renner
METS:note Online edit by Nicola Hill ( 8/16/2010 )
Online edit by Nicola Hill ( 10/22/2010 )
METS:dmdSec DMD1
METS:mdWrap MDTYPE MODS MIMETYPE textxml LABEL Metadata
METS:xmlData
mods:mods
mods:abstract displayLabel Abstract Due to the heterogeneities of the underlying legacy information systems of enterprises participating in large business networks (e.g. supply chains), existing information integration techniques fall short in enabling the automated sharing of data. This necessitates the development of automated solutions to enable scalable extraction of the knowledge resident in the legacy systems to support efficient sharing. Since the majority of existing information systems are based on relational database technology, I have focused on the process of knowledge extraction from relational databases. This thesis describes an automated approach for extracting schematic and semantic knowledge from relational databases. The extracted knowledge contains information about the underlying relational schema as well as the semantics in order to recreate the semantically rich model that was used to create the database. This knowledge enables schema mapping and mediator generation. The use of this approach can also be foreseen in enhancing existing schemas and extracting metadata needed to create the Semantic Web.
type subject Subject database, databases, engineering, extraction, knowledge, relational, reverse, situational
mods:accessCondition Copyright Haldavnekar, Nikhil. Permission granted to the University of Florida to digitize, archive and distribute this item for non-profit research and educational purposes. Any reuse of this item in excess of fair use or other copyright exemptions requires permission of the copyright holder.
mods:language
mods:languageTerm text English
code authority iso639-2b eng
mods:location
mods:physicalLocation University of Florida
UF
mods:name
mods:namePart Haldavnekar, Nikhil
mods:role
mods:roleTerm Dissertant
marcrelator dis
Hammer, Joachim
Thesis advisor
ths
Schmalz, Mark S.
Reviewer
rev
Issa, R. Raymond
Reviewer
rev
mods:note Title from title page of source document.
Includes vita.
thesis Thesis (M.S.)--University of Florida, 2002.
bibliography Includes bibliographical references.
Text (Electronic thesis) in PDF format.
mods:originInfo
mods:publisher University of Florida
mods:dateIssued 2002
mods:copyrightDate 2002
mods:recordInfo
mods:recordIdentifier source sobekcm UFE0000541_00001
mods:recordContentSource University of Florida
mods:subject SUBJ650_-0_1 jstor
mods:topic Cardinality
SUBJ650_-0_2
Cognitive models
SUBJ650_-0_3
Database design
SUBJ650_-0_4
Databases
SUBJ650_-0_5
Extraction
SUBJ650_-0_6
Information attributes
SUBJ650_-0_7
Legacies
SUBJ650_-0_8
Relational databases
SUBJ650_-0_9
Reverse engineering
SUBJ650_-0_10
Wrappers
Computer and Information Science and Engineering thesis, M.S
lcsh
Relational databases
Dissertations, Academic
UF
Computer and Information Science and Engineering
Rule based programming
mods:titleInfo
mods:title An Algorithm and implementation for extracting schematic and semantic knowledge from relational database systems
mods:typeOfResource mixed material
DMD2
OTHERMDTYPE SOBEKCM SobekCM Custom
sobekcm:procParam
sobekcm:Aggregation ALL
UFIR
UFETD
IUF
GRADWORKS
sobekcm:MainThumbnail haldavnekar_n_Page_001thm.jpg
sobekcm:Wordmark UFIR
sobekcm:bibDesc
sobekcm:BibID UFE0000541
sobekcm:VID 00001
sobekcm:Publisher
sobekcm:Name University of Florida
sobekcm:PlaceTerm Gainesville, Fla.
sobekcm:Source
sobekcm:statement UF University of Florida
sobekcm:SortDate 730850
METS:amdSec
METS:digiprovMD DIGIPROV1
DAITSS Archiving Information
daitss:daitss
daitss:AGREEMENT_INFO ACCOUNT PROJECT UFDC
METS:techMD TECH1
File Technical Details
sobekcm:FileInfo
sobekcm:File fileid JP21 width 2550 height 3300
JPEG1 630 815
JPEG2
JP22
JPEG3
JP23
JPEG4
JP24
JPEG5
JP25
JPEG6
JP26
JPEG7
JP27
JPEG8
JP28
JPEG9
JP29
JPEG10
JP210
JPEG11
JP211
JPEG12
JP212
JPEG13
JP213
JPEG14
JP214
JPEG15
JP215
JPEG16
JP216
JPEG17
JP217
JPEG18
JP218
JPEG19
JP219
JPEG20
JP220
JPEG21
JP221
JPEG22
JP222
JPEG23
JP223
JPEG24
JP224
JPEG25
JP225
JPEG26
JP226
JPEG27
JP227
JPEG28
JP228
JPEG29
JP229
JPEG30
JP230
JPEG31
JP231
JPEG32
JP232
JPEG33
JP233
JPEG34
JP234
JPEG35
JP235
JPEG36
JP236
JPEG37
JP237
JPEG38
JP238
JPEG39
JP239
JPEG40
JP240
JPEG41
JP241
JPEG42
JP242
JPEG43
JP243
JPEG44
JP244
JPEG45
JP245
JPEG46
JP246
JPEG47
JP247
JPEG48
JP248
JPEG49
JP249
JPEG50
JP250
JPEG51
JP251
JPEG52
JP252
JPEG53
JP253
JPEG54
JP254
JPEG55
JP255
JPEG56
JP256
JPEG57
JP257
JPEG58
JP258
JPEG59
JP259
JPEG60
JP260
JPEG61
JP261
JPEG62
JP262
JPEG63
JP263
JPEG64
JP264
JPEG65
JP265
JPEG66
JP266
JPEG67
JP267
JPEG68
JP268
JPEG69
JP269
JPEG70
JP270
JPEG71
JP271
JPEG72
JP272
JPEG73
JP273
JPEG74
JP274
JPEG75
JP275
JPEG76
JP276
JPEG77
JP277
JPEG78
JP278
JPEG79
JP279
JPEG80
JP280
JPEG81
JP281
JPEG82
JP282
JPEG83
JP283
JPEG84
JP284
JPEG85
JP285
JPEG86
JP286
JPEG87
JP287
JPEG88
JP288
JPEG89
JP289
JPEG90
JP290
JPEG91
JP291
JPEG92
JP292
JPEG93
JP293
JPEG94
JP294
JPEG95
JP295
JPEG96
JP296
JPEG97
JP297
JPEG98
JP298
JPEG99
JP299
JPEG100
JP2100
JPEG101
JP2101
JPEG102
JP2102
JPEG103
JP2103
JPEG104
JP2104
JPEG105
JP2105
JPEG106
JP2106
JPEG107
JP2107
JPEG108
JP2108
JPEG109
JP2109
JPEG110
JP2110
JPEG111
JP2111
JPEG112
JP2112
JPEG113
JP2113
JPEG114
JP2114
JPEG115
JP2115
JP2116
JPEG116
METS:fileSec
METS:fileGrp USE archive
METS:file GROUPID G1 TIF1 imagetiff CHECKSUM 4d1dff0ee7e63c589f8071304b01c44b CHECKSUMTYPE MD5 SIZE 8433564
METS:FLocat LOCTYPE OTHERLOCTYPE SYSTEM xlink:href haldavnekar_n_Page_001.tif
G2 TIF2 0ea15ac4090d12b359bdfd90c1063226 8432936
haldavnekar_n_Page_002.tif
G3 TIF3 860036ae94de3d25af8e956d6517d40b 8432904
haldavnekar_n_Page_003.tif
G4 TIF4 a7ee216fe0766b4ea0496566d8145b8f 8435072
haldavnekar_n_Page_004.tif
G5 TIF5 2de311b69db3c683230d946bb85cdd1c 8434268
haldavnekar_n_Page_005.tif
G6 TIF6 d16948a3b58c25c4ee8185703e57d3ed 8434748
haldavnekar_n_Page_006.tif
G7 TIF7 995da8ce53f336d3915c13ff5c95f865 8433248
haldavnekar_n_Page_007.tif
G8 TIF8 736e8bc98ded2b17fb70770264e2a14c 8434712
haldavnekar_n_Page_008.tif
G9 TIF9 092e5fa7302df8088aa1ed987bce10cf 8433448
haldavnekar_n_Page_009.tif
G10 TIF10 3d650c8a2a257577f7d9a8ffedf467e7 8435504
haldavnekar_n_Page_010.tif
G11 TIF11 b352df0dfe255ee8ccea24ef53d29a35 8435044
haldavnekar_n_Page_011.tif
G12 TIF12 613bc4dd1d3b29365d61a54b65ff9ca1 8435664
haldavnekar_n_Page_012.tif
G13 TIF13 e4e59bdd23c0a883e87c8c2acfe96e00 8436200
haldavnekar_n_Page_013.tif
G14 TIF14 aedef906bd95b4ff5bb4bd55d74f22b8 8435988
haldavnekar_n_Page_014.tif
G15 TIF15 e7fa12d10e79ffc7342ac75ab523b3cd 8436176
haldavnekar_n_Page_015.tif
G16 TIF16 83dd070e1420beac2731692c152ad8a9 8435944
haldavnekar_n_Page_016.tif
G17 TIF17 ff090cec6a2f0fe5f5201ea6374e51b3 8436208
haldavnekar_n_Page_017.tif
G18 TIF18 8e37099ce6587534dc9ab614d4275a8f 8434676
haldavnekar_n_Page_018.tif
G19 TIF19 b4b4ffd2c0aae79a9fba8a552a486f28 8435708
haldavnekar_n_Page_019.tif
G20 TIF20 77d7c3450ba561aa4acc89f8def6fa7d 8435872
haldavnekar_n_Page_020.tif
G21 TIF21 96ad8b1914ac2887db7b59f036f7e187 8436152
haldavnekar_n_Page_021.tif
G22 TIF22 379dc952c1fdb5c8ad8e54b1775e124e 8436044
haldavnekar_n_Page_022.tif
G23 TIF23 577cc3267ff3779dfe2abb1e6f65082c 8436096
haldavnekar_n_Page_023.tif
G24 TIF24 febabb5e778d1136c36cb4e2e2161772 8435984
haldavnekar_n_Page_024.tif
G25 TIF25 28e9b6f232dc2520e85dc94b19cc3d9d 8436248
haldavnekar_n_Page_025.tif
G26 TIF26 d1e26c3ff49d0cee87209fe2ca771444
haldavnekar_n_Page_026.tif
G27 TIF27 0d76e01119a34c704d5830aed2fbe6fe 8436144
haldavnekar_n_Page_027.tif
G28 TIF28 9a7e2ba34b962f7ae844a64759d6b7d5 8436216
haldavnekar_n_Page_028.tif
G29 TIF29 884b94f6119a743b16c648fb39c9e6e2 8436128
haldavnekar_n_Page_029.tif
G30 TIF30 873302c237830a5bc983038a77154716 8433796
haldavnekar_n_Page_030.tif
G31 TIF31 60276a524de286eeedd36ae109020b4c 8435576
haldavnekar_n_Page_031.tif
G32 TIF32 531703f1c61bae64f46640c513e9cf0e
haldavnekar_n_Page_032.tif
G33 TIF33 f7f54d63c6eaf6ff4acc8ce761ee536c 8435632
haldavnekar_n_Page_033.tif
G34 TIF34 9ded0912daa21b9a2b33868a5587267e 8436072
haldavnekar_n_Page_034.tif
G35 TIF35 78f4c1a87880257316540c9615562c95 8436004
haldavnekar_n_Page_035.tif
G36 TIF36 fa318b411155b501f076887027cd51a0 8435284
haldavnekar_n_Page_036.tif
G37 TIF37 4ad9cef233333f3945b539dd0361b4db 8435628
haldavnekar_n_Page_037.tif
G38 TIF38 37029eea91c3e1c72fe0927c4ae1fc36 8435640
haldavnekar_n_Page_038.tif
G39 TIF39 85af142d5f5c99a02e5a8e0340bc9895
haldavnekar_n_Page_039.tif
G40 TIF40 162ca06caadb4d36319d03cab5ca6eeb 8436196
haldavnekar_n_Page_040.tif
G41 TIF41 b28c128d860f046ec44c21d8481e424d 8435920
haldavnekar_n_Page_041.tif
G42 TIF42 f96d155a0726472d5e2c0e0a5e6935e2 8436140
haldavnekar_n_Page_042.tif
G43 TIF43 c543e6c3f3f254c803384c6a3ff56e7b
haldavnekar_n_Page_043.tif
G44 TIF44 6685486b45c4d38db486499b6ce86c73 8436368
haldavnekar_n_Page_044.tif
G45 TIF45 296b5b9cbc4fd9b28aad3e29688815b1 8435792
haldavnekar_n_Page_045.tif
G46 TIF46 b3573ae41dad378d97632c1b7fee8493 8436100
haldavnekar_n_Page_046.tif
G47 TIF47 47f645d55a17009f1c0a5567ce087701 8434416
haldavnekar_n_Page_047.tif
G48 TIF48 4c7ff5f93c8c6fe2b01821bca55b1bf8 8435328
haldavnekar_n_Page_048.tif
G49 TIF49 5e2cf6e89bf89b98f040abf199b224f2 8435348
haldavnekar_n_Page_049.tif
G50 TIF50 ad9e3f133f910a95bb250175a9924eec 8434704
haldavnekar_n_Page_050.tif
G51 TIF51 902a867c0b56e018ca5ed90f0c25b6cf 8436204
haldavnekar_n_Page_051.tif
G52 TIF52 3759194f4b555f3394b326a4208e6793
haldavnekar_n_Page_052.tif
G53 TIF53 30a79444882ffc170fd0b50177bcb4c6 8435832
haldavnekar_n_Page_053.tif
G54 TIF54 e68aaeba8d73c7752d0a6270fa3a4003 8433304
haldavnekar_n_Page_054.tif
G55 TIF55 0b6508708bf9f74a78ecdce0c2eccb37 8435448
haldavnekar_n_Page_055.tif
G56 TIF56 aa31c6000c9fec8e18de9d11872b516c 8435696
haldavnekar_n_Page_056.tif
G57 TIF57 c90e4d84c5de73abc1675ee1a0c9ce07 8435396
haldavnekar_n_Page_057.tif
G58 TIF58 a2ed0562f2ab0a40295ecbe4275d0e74
haldavnekar_n_Page_058.tif
G59 TIF59 1fb9ac207654470775d941e88852da2f 8435752
haldavnekar_n_Page_059.tif
G60 TIF60 bc0ef36f0aeb6a17e7ed2465c87c0b94 8436032
haldavnekar_n_Page_060.tif
G61 TIF61 1bea38d6114d6f685aa54a803eee58b2 8435668
haldavnekar_n_Page_061.tif
G62 TIF62 ed29fa4f198264cae769aa7f0837ab27 8435720
haldavnekar_n_Page_062.tif
G63 TIF63 e1959b6ed1bb80b7f31f4816399ff0a5 8435108
haldavnekar_n_Page_063.tif
G64 TIF64 ce4de307b59ab2dce6d1f5ddf8989267 8435544
haldavnekar_n_Page_064.tif
G65 TIF65 254bb68f630198b134301f67fcb13f1c 8435904
haldavnekar_n_Page_065.tif
G66 TIF66 a1ccc636a8f380f75ba96bd894a038fa 8435300
haldavnekar_n_Page_066.tif
G67 TIF67 b3d27ff2dfab00e35579509695835c9c 8435768
haldavnekar_n_Page_067.tif
G68 TIF68 2e6f9f6c07e50cf3dcc4f50868f4371b 8435524
haldavnekar_n_Page_068.tif
G69 TIF69 339ca4a56e831244b6486d55514f20c4 8435164
haldavnekar_n_Page_069.tif
G70 TIF70 d7b70240e754c15fb22cd9d194a72516 8435388
haldavnekar_n_Page_070.tif
G71 TIF71 9afcfef63869bce95686284619f7b962 8435976
haldavnekar_n_Page_071.tif
G72 TIF72 e717470c07baa0b5c1248ec066b0b5c8 8436172
haldavnekar_n_Page_072.tif
G73 TIF73 c0d88f0420e0939fa77f00db673ca9e3 8436076
haldavnekar_n_Page_073.tif
G74 TIF74 3295ebb4bff16fb43d41d7378ea16cd0 8435644
haldavnekar_n_Page_074.tif
G75 TIF75 99e25daa3d4f791a876e0db61c64da89 8434760
haldavnekar_n_Page_075.tif
G76 TIF76 607158fe5f9533f4ab836e72fd6b5582
haldavnekar_n_Page_076.tif
G77 TIF77 5f25ee0f3cdb53c20876bfe6034fe839 8436088
haldavnekar_n_Page_077.tif
G78 TIF78 3daa61f907f437043e54233653892890 8435968
haldavnekar_n_Page_078.tif
G79 TIF79 debb1d0104c4f72fa507c328e641fbf5 8436068
haldavnekar_n_Page_079.tif
G80 TIF80 86bd686925adf2660db2a5b3b5a25c4d 8436180
haldavnekar_n_Page_080.tif
G81 TIF81 cd635557da77d03dcdc1313da1b05a3a 8434956
haldavnekar_n_Page_081.tif
G82 TIF82 b179f1d2669427945332b566fe91bb09
haldavnekar_n_Page_082.tif
G83 TIF83 7a747fddcbebbfb2a2c7aa74d7935151
haldavnekar_n_Page_083.tif
G84 TIF84 6b71ff22357c191725e3ccdbdeaa417c 8436188
haldavnekar_n_Page_084.tif
G85 TIF85 937cf401a21e3887bc6dbd82306996d2
haldavnekar_n_Page_085.tif
G86 TIF86 924b48c9857c01b3f791e519f0567591 8433296
haldavnekar_n_Page_086.tif
G87 TIF87 71e3b70bf88b8df819c5cca57b79116b 8435424
haldavnekar_n_Page_087.tif
G88 TIF88 2a9a13da6b18a791e2aeeef2537f6995 8436028
haldavnekar_n_Page_088.tif
G89 TIF89 10226eab77224751efb1fb6663cb835d
haldavnekar_n_Page_089.tif
G90 TIF90 d3f62ef776cdf76e82fe9ef5be482983 8436052
haldavnekar_n_Page_090.tif
G91 TIF91 d00d126774f3c14fbddb318b91b648c8
haldavnekar_n_Page_091.tif
G92 TIF92 d9b75f7fa8e8210e8d5c1daf22384c1c 8437108
haldavnekar_n_Page_092.tif
G93 TIF93 f3efd28f9260dbb8758c017317034821 8436116
haldavnekar_n_Page_093.tif
G94 TIF94 d2adc59de44a8f85e7fbc17f067e09b5 8436136
haldavnekar_n_Page_094.tif
G95 TIF95 ca8da46bb4d125c322c70fe9fc72bfc4 8435824
haldavnekar_n_Page_095.tif
G96 TIF96 526bf7a4fd73426f5d2c96edafe491ee 8434788
haldavnekar_n_Page_096.tif
G97 TIF97 d35974add3bc1616a09513f5e57fdd42 8434652
haldavnekar_n_Page_097.tif
G98 TIF98 10303feb0145902ba3cef704d8587abf 8433776
haldavnekar_n_Page_098.tif
G99 TIF99 19a0c773e6f040b2ef3d0d47e9134073 8434824
haldavnekar_n_Page_099.tif
G100 TIF100 20da3b6be8d220cecf7afb4822661734 8434428
haldavnekar_n_Page_100.tif
G101 TIF101 c3cccc4287549be93d9febad0740d302 8434180
haldavnekar_n_Page_101.tif
G102 TIF102 e7c6f583c86427265f000b41ee760634 8435052
haldavnekar_n_Page_102.tif
G103 TIF103 9b12856cd7f4b55cc3af8e03ebf84cde 8435084
haldavnekar_n_Page_103.tif
G104 TIF104 a6e8cf11b9dd08d13d98a2485a1f7048 8435088
haldavnekar_n_Page_104.tif
G105 TIF105 71baa86b5f4970b6d846921c95b772ba 8434856
haldavnekar_n_Page_105.tif
G106 TIF106 f87983d0f117cd904c5ae2ae567df276 8435064
haldavnekar_n_Page_106.tif
G107 TIF107 e25b91fa9836e40d79d922b796998524 8435596
haldavnekar_n_Page_107.tif
G108 TIF108 5dd2aa54aea2cdfc32d36c76126cc2c6 8435532
haldavnekar_n_Page_108.tif
G109 TIF109 7b6a7a3dd438f5419d0361a4f45647e7 8433188
haldavnekar_n_Page_109.tif
G110 TIF110 5fcb6af198574478e4da7c9e6ff6058c 8436064
haldavnekar_n_Page_110.tif
G111 TIF111 4588526caae69cc4d47b983777df0e2a 8436408
haldavnekar_n_Page_111.tif
G112 TIF112 5ea03e3c156595e39a33707d6e1c340c 8436432
haldavnekar_n_Page_112.tif
G113 TIF113 1608418eb4bba7bda007335b005ea28e 8436268
haldavnekar_n_Page_113.tif
G114 TIF114 569359cd238efe54f673b641c6ca1790 8436252
haldavnekar_n_Page_114.tif
G115 TIF115 514eb640858982f3b80daf23fff9c55c
haldavnekar_n_Page_115.tif
G116 TIF116 e553099cda9a8bdd756ca11281f3ce2e 8434504
haldavnekar_n_Page_116.tif
reference
imagejp2 d773ac57a7d618eef525ddeb08341e35 266234
haldavnekar_n_Page_001.jp2
8de34ef6fd7819f8bdc87f81086f83ca 29888
haldavnekar_n_Page_002.jp2
8c62d9952b441ad648df5b01b6dbd2ad 25872
haldavnekar_n_Page_003.jp2
ab1d6dc7fe0dd6b35b14dd93e2f62736 717145
haldavnekar_n_Page_004.jp2
fe1b569568dfe5399bc94fe2ae05f80a 651986
haldavnekar_n_Page_005.jp2
d64c744e011fb20d4053ff4e44f7cac8 928169
haldavnekar_n_Page_006.jp2
26eb78a0965e0e280ddfb0b90a25504e 121692
haldavnekar_n_Page_007.jp2
36485a6089f3d7275aee8015019c4cc5 755917
haldavnekar_n_Page_008.jp2
f7cc57a3cf0f6c03e50a655ef9bbbf9e 236936
haldavnekar_n_Page_009.jp2
9f4563f1046b2f9dce5bf412eca36220 945567
haldavnekar_n_Page_010.jp2
1757bbf49418a9d04939742e1d72fba7 754393
haldavnekar_n_Page_011.jp2
281f2e7342a8594380dbd4ce47dc7346 982923
haldavnekar_n_Page_012.jp2
59a936f2365a3af47ae2fe1d311edc76 1051973
haldavnekar_n_Page_013.jp2
1eb817df88f10d1a2fba44bf3e80618f 1008467
haldavnekar_n_Page_014.jp2
ff72fdb3232a020c02e76b90e517a349 1051944
haldavnekar_n_Page_015.jp2
3bccd919cfca1daf6ad531dd7c732fde 1051958
haldavnekar_n_Page_016.jp2
183a93e01fe22b89a6c40aeae796c501 1051965
haldavnekar_n_Page_017.jp2
f7afab70ce54b3dd92c687a8aad8efe8 631748
haldavnekar_n_Page_018.jp2
857936d8d5bbd65a9a8f6ef66611449f 959944
haldavnekar_n_Page_019.jp2
2e607139315c603330810cab834bc94d 798438
haldavnekar_n_Page_020.jp2
618aed18df32e9f9f5f4b55c89a83cda 1051951
haldavnekar_n_Page_021.jp2
01c72bbf8d7e691cc303c1b7cb137838 1051953
haldavnekar_n_Page_022.jp2
48019254b56876ddba1146352822a17b 1051982
haldavnekar_n_Page_023.jp2
66f803e0f5ea3c20d62e3fd48f764e12 1051964
haldavnekar_n_Page_024.jp2
2b125651f605f981630e85258a64bda5 1051948
haldavnekar_n_Page_025.jp2
dc43db75719745e0ee9d75f61f4a2ad1 1051983
haldavnekar_n_Page_026.jp2
ea6f88ac8795ed09dff78c12420394a2 1051968
haldavnekar_n_Page_027.jp2
8689e82b80d8f5bf5fdb7e9d465ebc71 1051981
haldavnekar_n_Page_028.jp2
b23ba6562189f2dc31e085470b8b3fb3 1051962
haldavnekar_n_Page_029.jp2
d3d283c36508ed6bee26bc0658d855c9 304897
haldavnekar_n_Page_030.jp2
694b61cf8e3836315c73de1be3a0a487 927242
haldavnekar_n_Page_031.jp2
c3a74952120204051db6d66f04efc98b 933596
haldavnekar_n_Page_032.jp2
d334c6c60a34aa494d3aabae379a84bb 886108
haldavnekar_n_Page_033.jp2
e808aa0ebe8f7e3660d9c755670029c7 1042627
haldavnekar_n_Page_034.jp2
640855777781d3dcc10dfe31f39ab7f7 1051966
haldavnekar_n_Page_035.jp2
251326e3b9e78fa6fb889e3411e39303 539372
haldavnekar_n_Page_036.jp2
a576af9ba03abe0265f1b76061600a4c 830475
haldavnekar_n_Page_037.jp2
3f4b6b200f2c9c74470aa08bedeaf032 814724
haldavnekar_n_Page_038.jp2
ce2978492a3845b8a502336294342bcf 1051984
haldavnekar_n_Page_039.jp2
b5f0e4d0023d573238e6fa0fb64307a5 1051960
haldavnekar_n_Page_040.jp2
f8d07040544b37af1499c2a43af29d84 1024581
haldavnekar_n_Page_041.jp2
d833e9c76772fc68091540fe564d2201 1051970
haldavnekar_n_Page_042.jp2
1d4ef027c1b5fe57aabcc37d44d30b9a 1037732
haldavnekar_n_Page_043.jp2
3a96606623f3ea29212bedc4bb3ddfd1 1051950
haldavnekar_n_Page_044.jp2
564ecbc1bc0d39fd9d1a50d17b8be9ad 955280
haldavnekar_n_Page_045.jp2
1fa7664b51949984af31a209f33062e3 1051980
haldavnekar_n_Page_046.jp2
31c17d6b1eb2d391ead52afedaee9081 519346
haldavnekar_n_Page_047.jp2
c2270830115bf45ba39bead1bfb6acc2 758714
haldavnekar_n_Page_048.jp2
6d838ca46ed1802e979411ba786280a5 930868
haldavnekar_n_Page_049.jp2
5414152d53c42e2f83a00018cff2f4d0 763985
haldavnekar_n_Page_050.jp2
8ec79e34b62e3c129384dff735a29659 899286
haldavnekar_n_Page_051.jp2
ba49b2add485600832af3af22f952138 1051956
haldavnekar_n_Page_052.jp2
4428fe450381b03125148c0c9f5d84cd 1036653
haldavnekar_n_Page_053.jp2
f3605a84e7af0f8bf70873bba7346562 145897
haldavnekar_n_Page_054.jp2
07d664e723a14d87fc78fd2bf777264f 867371
haldavnekar_n_Page_055.jp2
8eeab47ed352189968a9f3ab67616bd2 799736
haldavnekar_n_Page_056.jp2
e86c47094858b5e53766a3281ec1f003 867203
haldavnekar_n_Page_057.jp2
79fa45d3350f797bb759409fcadf6fe0 737462
haldavnekar_n_Page_058.jp2
e6522c8843e2771b1c6de9c9db16bd07 829325
haldavnekar_n_Page_059.jp2
82e2e5b532eff6f87f9414aa40d78d47 1051913
haldavnekar_n_Page_060.jp2
9101cb8bae7e05f692c8637ea40315cf 759790
haldavnekar_n_Page_061.jp2
3dfdb4587bce4b0d212315876937b139 837406
haldavnekar_n_Page_062.jp2
bef8fce2b86a3040de617700e660639e 827243
haldavnekar_n_Page_063.jp2
8334e7020383627ed19f36aaf660ff14 728572
haldavnekar_n_Page_064.jp2
8108af0a4baa43995946fb63ab694444 984655
haldavnekar_n_Page_065.jp2
0e8863c80d6a52a96003ad72bfbed37f 825503
haldavnekar_n_Page_066.jp2
b0670c2c97f04736647701fe02f7cff1 938927
haldavnekar_n_Page_067.jp2
e5405dabb5a12b67ec7478e237274001 721187
haldavnekar_n_Page_068.jp2
f807adf8ea724d8a1d7b44a3a30ffe04 701678
haldavnekar_n_Page_069.jp2
e76e2eb0ef2938dff70f35d2b410c5be 831842
haldavnekar_n_Page_070.jp2
1c939f6daeb21bb18afd2304159e758c 914678
haldavnekar_n_Page_071.jp2
913b0462c863b3cfd1d92cd5e07a44d2 1051937
haldavnekar_n_Page_072.jp2
e27385bab71909006536543a81339ddc
haldavnekar_n_Page_073.jp2
6014a49463c13d59d937e5d4e8bc9513 904117
haldavnekar_n_Page_074.jp2
9dd543f4697a5001a9cccfc4e30f5ce6 655188
haldavnekar_n_Page_075.jp2
8af18d00d4536f306227321a1cf28c38 938240
haldavnekar_n_Page_076.jp2
8da3d1ab4b37a064aef8c61ec52c81e9 1049893
haldavnekar_n_Page_077.jp2
a67102179783baa7675b6e6fa38d6df2 1024115
haldavnekar_n_Page_078.jp2
f827acb8ff549bf5a4e58f32cede68ae
haldavnekar_n_Page_079.jp2
beccaa2e32e04d6ab4ba5021d8a2dee9
haldavnekar_n_Page_080.jp2
fa8bf80218210abc7e82532ed2fdfe10 689398
haldavnekar_n_Page_081.jp2
86484e25a9a05ac4b191227b38f20aba 621393
haldavnekar_n_Page_082.jp2
63e35ff17b73b498bd80544a35119b1a 1051929
haldavnekar_n_Page_083.jp2
ec955316cae67067b2fe9b1c06c568ec 1051919
haldavnekar_n_Page_084.jp2
3fb2ee0b31f12de7c49919b0e11f3f27
haldavnekar_n_Page_085.jp2
3de1e54b5e028738c1fa865dcbc60fcc 136607
haldavnekar_n_Page_086.jp2
eba71d7e0f0b33436fbab901cc2d98a4 893948
haldavnekar_n_Page_087.jp2
6cc936778bb1cef2adc3eb83a8805c21 1051940
haldavnekar_n_Page_088.jp2
c05e27dd2a3af7d6a4c92fcd50644fd5 979494
haldavnekar_n_Page_089.jp2
c3e639811fb0ab8a947cfe7138f5ab83 1028110
haldavnekar_n_Page_090.jp2
6e65ec5b5eff12ebb0ee29ee52fc7391 1029139
haldavnekar_n_Page_091.jp2
e42e778fbedbacd99fb0bfeef4e0e264
haldavnekar_n_Page_092.jp2
db29c0ead96eb7d58d9983110e2e8cbd
haldavnekar_n_Page_093.jp2
22ee5a041e5b7f1e5a16d64421bcc7bd
haldavnekar_n_Page_094.jp2
97d14b16bf79a16378c4a0bde29f0195 978802
haldavnekar_n_Page_095.jp2
4ea86a4acdf9abdabcfe51c25b0a3a7d 659701
haldavnekar_n_Page_096.jp2
08c8bd5732fe9d67d886bc9ab80c2bb6 567183
haldavnekar_n_Page_097.jp2
7df09694f2801f56de9d80d7fe357c38 309075
haldavnekar_n_Page_098.jp2
98e061f23bdd4524bc96cfe9b440771d 590866
haldavnekar_n_Page_099.jp2
cc60389c139250fee437489518c71087 479629
haldavnekar_n_Page_100.jp2
3dac69accafc1bb390e1c9c67526958a 397023
haldavnekar_n_Page_101.jp2
d2d7d826f1b09e2d8f97fd2f97a112bf 611996
haldavnekar_n_Page_102.jp2
08c715995b59711a6fd9e9c05f895d42 656812
haldavnekar_n_Page_103.jp2
4b64872660b232e62c9c518835200b91 630212
haldavnekar_n_Page_104.jp2
0d37388d4619c3383ac8185cccb26c9f 629936
haldavnekar_n_Page_105.jp2
c34379e4a01d8ca1cbfdda7f44490bcb 557843
haldavnekar_n_Page_106.jp2
ba175e77dd23a726bcf25f36aadf706c 933620
haldavnekar_n_Page_107.jp2
b9c06b75169cc11c4bf937ddab3cb6c8 808820
haldavnekar_n_Page_108.jp2
1acbc88e8bbf025751956649377845a0 123158
haldavnekar_n_Page_109.jp2
d356b0849a5682fdb152e07e58dad246 1051975
haldavnekar_n_Page_110.jp2
4579a2a9c9f09690c42e18d17e6bb46a 1051899
haldavnekar_n_Page_111.jp2
b6055277f2dcc365a529938bc5d0aa10
haldavnekar_n_Page_112.jp2
3237e8f927c4856678549fb011e95cd8 1051979
haldavnekar_n_Page_113.jp2
2f80c178d56ff6274d44c60d895b77be
haldavnekar_n_Page_114.jp2
1764281c36b40897a14e2fcd73bc9f73 1051941
haldavnekar_n_Page_115.jp2
fe2ca7f6b28a8ddeb36957b6acb2d4af 528880
haldavnekar_n_Page_116.jp2
imagejpeg eeb0e8559907c66263bca8c5b1cea5be 45508
haldavnekar_n_Page_001.jpg
JPEG1.2 a33ed0eabe912f61ac2625b72bf16e57 26223
haldavnekar_n_Page_001.QC.jpg
27e40c169d35f30a7858d795fc6cec24 22426
haldavnekar_n_Page_002.jpg
JPEG2.2 2dbfe711fd6dfc2be3d8ad5b17914060 19366
haldavnekar_n_Page_002.QC.jpg
08b9273dea11a2af0f3be25265cecbf0 22023
haldavnekar_n_Page_003.jpg
JPEG3.2 9cb1a4f6fd455938ce59aaf63565efe2 19237
haldavnekar_n_Page_003.QC.jpg
4416491365fdefdf0df8d806256722f4 86722
haldavnekar_n_Page_004.jpg
JPEG4.2 122b2c27d590f9929228ff0d1917c745 41522
haldavnekar_n_Page_004.QC.jpg
0a0cf66af5c8a068d2ca8179040ea30d 86053
haldavnekar_n_Page_005.jpg
JPEG5.2 823517e6d7b4a71575bd6ecea0a10c60 35522
haldavnekar_n_Page_005.QC.jpg
498d6c2ad2aa7c8d1f1cd9c8fdce5638 109878
haldavnekar_n_Page_006.jpg
JPEG6.2 f76a84883b9c53c1b69cab5edbc8b604 41239
haldavnekar_n_Page_006.QC.jpg
554aba779799030fd1465c97e9034bfd 31483
haldavnekar_n_Page_007.jpg
JPEG7.2 6a86c79e234dc60699febede37e39bce 22774
haldavnekar_n_Page_007.QC.jpg
d6999bb4137f32e2ad84166d250e7529 91452
haldavnekar_n_Page_008.jpg
JPEG8.2 60ab47e330f8fc6b14e82c2765445357 40141
haldavnekar_n_Page_008.QC.jpg
30c6ad8e65bb458c6ecb883c79ad9c77 42528
haldavnekar_n_Page_009.jpg
JPEG9.2 5af9b5ad743589776549fc16a51ef74a 26380
haldavnekar_n_Page_009.QC.jpg
840cb380d228e21fea6600fb8ff0d22e 109082
haldavnekar_n_Page_010.jpg
JPEG10.2 32cdce87d2d41b77b0cf546bab297db7 47424
haldavnekar_n_Page_010.QC.jpg
dd98d3730e127413568939847cf91348 89212
haldavnekar_n_Page_011.jpg
JPEG11.2 bfea4691a1156f33d216af7d5a2cb16d 42784
haldavnekar_n_Page_011.QC.jpg
3a7ea27cfd08fdc8cb88e969ab16160f 110704
haldavnekar_n_Page_012.jpg
JPEG12.2 cce51b9898ba4fcf054160f062ce61bd 49432
haldavnekar_n_Page_012.QC.jpg
72c729fe5c1303560d713ab10aa78e75 124776
haldavnekar_n_Page_013.jpg
JPEG13.2 755dd813973caa3a71994ebf33647b0d 54272
haldavnekar_n_Page_013.QC.jpg
9e1d0d47dc13d168f6fb0edc6ff18b29 112787
haldavnekar_n_Page_014.jpg
JPEG14.2 f1500e42ac1f99a24e0c8c8ac437b43b 50421
haldavnekar_n_Page_014.QC.jpg
7f8d435e2a7ef193670cfbba5b4dbf11 124507
haldavnekar_n_Page_015.jpg
JPEG15.2 35284c9143395c88f8b708700f9c1c4c 54187
haldavnekar_n_Page_015.QC.jpg
cf56cbb3929017e12b3dbc534daaadde 124694
haldavnekar_n_Page_016.jpg
JPEG16.2 6e45b3c782ac8c0867f7f64ee1f5debd 52648
haldavnekar_n_Page_016.QC.jpg
a9a0850a77299870f2123d65f6ac5597 142457
haldavnekar_n_Page_017.jpg
JPEG17.2 f4acfa17da803b6eb7f3f3afb4d98382 54671
haldavnekar_n_Page_017.QC.jpg
a83a1f5e991f289f5f32b0a1f2a60f9a 77375
haldavnekar_n_Page_018.jpg
JPEG18.2 a4bdf05273f3d53fdd950acfc3f1e4c2 38108
haldavnekar_n_Page_018.QC.jpg
4bba02eb5331751ebfe9fd8740f74073 107890
haldavnekar_n_Page_019.jpg
JPEG19.2 b70c961e6a0325c1cd2df716aa114a04 48522
haldavnekar_n_Page_019.QC.jpg
b059832921e8b4cbb6976b7e2fa2a4a2 93942
haldavnekar_n_Page_020.jpg
JPEG20.2 9ded2e53f0df958a7cabed0ca27e987e 45750
haldavnekar_n_Page_020.QC.jpg
2fb8518eab947ad263803556ab81cd2d 122698
haldavnekar_n_Page_021.jpg
JPEG21.2 6621e62cf0070b1540b7713a260a8f38 53925
haldavnekar_n_Page_021.QC.jpg
1abefff487a5fc4fcf454904e0968b08 121949
haldavnekar_n_Page_022.jpg
JPEG22.2 392bcc83fca696689022140ff0c49f1f 53198
haldavnekar_n_Page_022.QC.jpg
6aa5667b8bbde2d1c90051122d8acea5 123597
haldavnekar_n_Page_023.jpg
JPEG23.2 6de1d84c60e6fe7eaa060bc9c5f22215 54497
haldavnekar_n_Page_023.QC.jpg
981bd230ceb1cee0ec7ea957e571ce00 118186
haldavnekar_n_Page_024.jpg
JPEG24.2 817578beb7747e1c7548acaa29099211 52379
haldavnekar_n_Page_024.QC.jpg
f7f263919764d7ee440a06a48e8edd9e 124526
haldavnekar_n_Page_025.jpg
JPEG25.2 aa30cc96ef2e7d56b3c23bd86f529083 54752
haldavnekar_n_Page_025.QC.jpg
cb291f04833c8fa0b7ded34aa908c7f1 121048
haldavnekar_n_Page_026.jpg
JPEG26.2 7051b7ad51185fac17c4b1b0b70af03e 52650
haldavnekar_n_Page_026.QC.jpg
71ed21b944905c886d4da900f1a29c71 121733
haldavnekar_n_Page_027.jpg
JPEG27.2 deac5bdfe8962d5c07b04274ef201c13 53487
haldavnekar_n_Page_027.QC.jpg
e5a2a2ec6a40c595b5cd750583d46c35 124584
haldavnekar_n_Page_028.jpg
JPEG28.2 3bc338d95ac36d7eb6e7b535e32225b8 54968
haldavnekar_n_Page_028.QC.jpg
97c61cd617723c7d8c98810b47765098 121122
haldavnekar_n_Page_029.jpg
JPEG29.2 12f15698c37414d34d0b03aded44c0a7 53271
haldavnekar_n_Page_029.QC.jpg
1c87b7e81ab635037bbd4a544d80c7fc 48688
haldavnekar_n_Page_030.jpg
JPEG30.2 3dc7bbdd5c788e6c8808235bc23b1a67 28677
haldavnekar_n_Page_030.QC.jpg
e77fc827debd3a14583889d6376518e3 106329
haldavnekar_n_Page_031.jpg
JPEG31.2 bb14014ab3f5f8d0dd89a9e963d8293b 47598
haldavnekar_n_Page_031.QC.jpg
431572e944ef5d0fb9ec06729de2d974 109517
haldavnekar_n_Page_032.jpg
JPEG32.2 cb84d9fa489b1a605948731e52e83f1b 50299
haldavnekar_n_Page_032.QC.jpg
4e48c8ab15e72f5b2e8dddd7734b51e5 102582
haldavnekar_n_Page_033.jpg
JPEG33.2 20447f4d6d25d61f47a0bdb72ea37f74 47365
haldavnekar_n_Page_033.QC.jpg
471f3903f6335b3ac7a51e2b6022da30 116790
haldavnekar_n_Page_034.jpg
JPEG34.2 ab75ed4a86013b5582e157e5f6c8cba7 52349
haldavnekar_n_Page_034.QC.jpg
998999dc1b444af449c2f6e502725457 119200
haldavnekar_n_Page_035.jpg
JPEG35.2 08a9ccf5f97449567d5ccbb7a0c07200 52821
haldavnekar_n_Page_035.QC.jpg
c459b8214ce7146eed17dc03b863b1bb 77813
haldavnekar_n_Page_036.jpg
JPEG36.2 1b38e58d4ccbe24552a7a49fa138c7e8 39450
haldavnekar_n_Page_036.QC.jpg
1f4f7fda8b119d2dc98a74e2afc7a4dc 97725
haldavnekar_n_Page_037.jpg
JPEG37.2 3eca8d1c132366b1d591537171a7fea8 46639
haldavnekar_n_Page_037.QC.jpg
a6b2c9472121695beac33506244187f3 95778
haldavnekar_n_Page_038.jpg
JPEG38.2 74e9ea7e5ec56b45b69c9bdf3612f9a2 45681
haldavnekar_n_Page_038.QC.jpg
ec0e7043b456c830f8a1638224dffb47 129547
haldavnekar_n_Page_039.jpg
JPEG39.2 552d36d9a697a121aae049ed72a46a88 53250
haldavnekar_n_Page_039.QC.jpg
bdca10be8c9bdcbca76372022d5bde0f 127964
haldavnekar_n_Page_040.jpg
JPEG40.2 225940c7f66d47dfcea31182b552e07f 54334
haldavnekar_n_Page_040.QC.jpg
a8f513f73c6632d8d8be4ef5c352c10d 114039
haldavnekar_n_Page_041.jpg
JPEG41.2 03f191a54c200a34445bed57443246fd 51150
haldavnekar_n_Page_041.QC.jpg
7f31dae180752fb13ff9cd1aab5f9f95 118558
haldavnekar_n_Page_042.jpg
JPEG42.2 915df6c09a98def30feb305962aa1f3d 52049
haldavnekar_n_Page_042.QC.jpg
ccb405bfb1b33c2d333ef5bc78728552 115592
haldavnekar_n_Page_043.jpg
JPEG43.2 7da036469e91fb13bdbeafd78d8fa6ae 50846
haldavnekar_n_Page_043.QC.jpg
4da30ebcf8a21ffcc94f2f7f1c7371a0 158396
haldavnekar_n_Page_044.jpg
JPEG44.2 9205e7c52f88e7820d84d91467e4d5f4 56532
haldavnekar_n_Page_044.QC.jpg
23353efe0cea8363a2a1c3108be953d6 106486
haldavnekar_n_Page_045.jpg
JPEG45.2 9bef5fc701145714410d4f54f5e55eb6 47312
haldavnekar_n_Page_045.QC.jpg
8f7c75696fbb117008bc9c671d374459 124186
haldavnekar_n_Page_046.jpg
JPEG46.2 31738006515afd902d592a1a9412b272 54292
haldavnekar_n_Page_046.QC.jpg
19cfd4cda91441c78831daa341b69a71 69557
haldavnekar_n_Page_047.jpg
JPEG47.2 9f1a8c66b4245268fb1dc1dff9e8445e 36124
haldavnekar_n_Page_047.QC.jpg
5e8bc4a9368f12e0440d3e28fa1ba5c2 92220
haldavnekar_n_Page_048.jpg
JPEG48.2 129d1fd8e56eb70ba2f75426b8a399aa 43490
haldavnekar_n_Page_048.QC.jpg
dfe8f5735f42bc6c3f3045880777f6bd 107115
haldavnekar_n_Page_049.jpg
JPEG49.2 98f393be7e79989fbe28a60a280c51a9 44761
haldavnekar_n_Page_049.QC.jpg
5b82b2d7963465dea1295227cedc55f1 90390
haldavnekar_n_Page_050.jpg
JPEG50.2 27f4dd9ec74593cdb41621e5656cfe97 39333
haldavnekar_n_Page_050.QC.jpg
0c05da56522e2ef92606e29dda81a86f 106955
haldavnekar_n_Page_051.jpg
JPEG51.2 e710168c3db4df8810bee522fbd904e2 49287
haldavnekar_n_Page_051.QC.jpg
548cba7af6da4fb2bc632dc205297dc8 121289
haldavnekar_n_Page_052.jpg
JPEG52.2 bd27c441195acbd50c8d7f85378ab19f 53514
haldavnekar_n_Page_052.QC.jpg
5254e72e9f8ef43f25ba1be8f860d784 116117
haldavnekar_n_Page_053.jpg
JPEG53.2 4fad51fa546d339addb9ad24df72334c 51567
haldavnekar_n_Page_053.QC.jpg
15e8f3f71cc20013cb0387f62cdf22bd 33940
haldavnekar_n_Page_054.jpg
JPEG54.2 5dc4113c408c31fc7f9d5d7ef0e22619 23788
haldavnekar_n_Page_054.QC.jpg
05555d7911e6eebe9eddc05b2a300bc1 99989
haldavnekar_n_Page_055.jpg
JPEG55.2 1d92a0a4b7244d1d3bda285bd0f3fd2e 44617
haldavnekar_n_Page_055.QC.jpg
4572f4642bf7990447f737f981f43718 95630
haldavnekar_n_Page_056.jpg
JPEG56.2 dff1db5803247041694f11d0c1524f81 45649
haldavnekar_n_Page_056.QC.jpg
f994663cd40abf5176f2c4cc472f04d8 100792
haldavnekar_n_Page_057.jpg
JPEG57.2 d1cc2830d05cda37e6373d0afaef555c 45958
haldavnekar_n_Page_057.QC.jpg
af2438d52d51cb8ca04aabc35fb8c60c 91347
haldavnekar_n_Page_058.jpg
JPEG58.2 5ff6af6de116a6bd2f1d455d311ce8a3 41844
haldavnekar_n_Page_058.QC.jpg
efec58d0118bde74073103d321de3cfe 101199
haldavnekar_n_Page_059.jpg
JPEG59.2 8eb7f521254dd7e30f69e851efa891ee 46519
haldavnekar_n_Page_059.QC.jpg
0a5011b52c4e756f6c3650b1ba5d7633 118701
haldavnekar_n_Page_060.jpg
JPEG60.2 10eec10c1517f2e4b70255055c422b76 52402
haldavnekar_n_Page_060.QC.jpg
97e93adffe4a5a8db3ac6be567b5ae26 93555
haldavnekar_n_Page_061.jpg
JPEG61.2 e9f04ce637ca8df96179b9543284a1bd 45464
haldavnekar_n_Page_061.QC.jpg
fe027ba2a4e969543ef6e816afc1fe24 101349
haldavnekar_n_Page_062.jpg
JPEG62.2 867db9f6e820549b16b39331d125bf88 46664
haldavnekar_n_Page_062.QC.jpg
fb19046d359b6caf53b10f040bd92cf6 93804
haldavnekar_n_Page_063.jpg
JPEG63.2 af57d15a55441cae7636ab19778dc353 41622
haldavnekar_n_Page_063.QC.jpg
62c7a3b3c1da4b3563d02758ffcd03b4 89910
haldavnekar_n_Page_064.jpg
JPEG64.2 df8d602b333a082145f4c87531fe9925 41983
haldavnekar_n_Page_064.QC.jpg
bb3c8a1507066f173414bceac7ce1ee1 112303
haldavnekar_n_Page_065.jpg
JPEG65.2 99ae2d682c1d8a925b248ae17cb994bb 50868
haldavnekar_n_Page_065.QC.jpg
95fdebc04b39d6c759dee0ae52ec334e 93866
haldavnekar_n_Page_066.jpg
JPEG66.2 3ff0b745160cd5793385ecc2817ed44d 42943
haldavnekar_n_Page_066.QC.jpg
c059979d42c0b9a05c507541a1d523ae 110502
haldavnekar_n_Page_067.jpg
JPEG67.2 b1b178faa58b11056de9635278609c33 49160
haldavnekar_n_Page_067.QC.jpg
b7969f8585684e4e013529af9e8b73dc 92750
haldavnekar_n_Page_068.jpg
JPEG68.2 cca8c54d221395c9aa9df42a8c8a53dc 45089
haldavnekar_n_Page_068.QC.jpg
3baf82ac3be9aa27e4af1f3a96db4eb5 86846
haldavnekar_n_Page_069.jpg
JPEG69.2 d7e2aec4e046d26c1045466b83e4d014 40485
haldavnekar_n_Page_069.QC.jpg
a3ff4c6a34c28d4abcdb490f125a5f27 97122
haldavnekar_n_Page_070.jpg
JPEG70.2 434476c4c9ca848a5b3c8d79105a106a 44422
haldavnekar_n_Page_070.QC.jpg
63f80375ca95787515d4dbc97eda58ce 105481
haldavnekar_n_Page_071.jpg
JPEG71.2 ec108c747f5bf18cd65ce53a3698f688 49085
haldavnekar_n_Page_071.QC.jpg
cd285e399f85368cd08aebbd692a4b80 122339
haldavnekar_n_Page_072.jpg
JPEG72.2 9debc1375933325cf358ec335ed1de3f 53704
haldavnekar_n_Page_072.QC.jpg
e0ad393536bcabdc248a4417bfe6e193 118012
haldavnekar_n_Page_073.jpg
JPEG73.2 001899eab94262d44229b52f4509c091 52463
haldavnekar_n_Page_073.QC.jpg
9c52e2d8f6fa886ddef3c28f70137323 105096
haldavnekar_n_Page_074.jpg
JPEG74.2 7779702ca34a1331226ffdb685b9b9e0 47192
haldavnekar_n_Page_074.QC.jpg
b7f408e4d6a32000661a6d4195f85c3b 81201
haldavnekar_n_Page_075.jpg
JPEG75.2 8d4003c034f3721c38d914823e7b5283 39195
haldavnekar_n_Page_075.QC.jpg
ed02b2fe2d6b2ba82bed9a3fd9872ab9 107107
haldavnekar_n_Page_076.jpg
JPEG76.2 0402f89e88af1dc90332c4b243ccfac5 47947
haldavnekar_n_Page_076.QC.jpg
eb577483d1c26c42c9a46e48e6600fa3 116785
haldavnekar_n_Page_077.jpg
JPEG77.2 c43af1d6b2c45895d3661b34c8ee3ad7 53231
haldavnekar_n_Page_077.QC.jpg
b55e37a0f8ae7380c5d8636081f902b3 120984
haldavnekar_n_Page_078.jpg
JPEG78.2 57f31b114447351d4dbab06c1804e86d 52511
haldavnekar_n_Page_078.QC.jpg
674aef55471e1289dc78a20d51d36c4b 119572
haldavnekar_n_Page_079.jpg
JPEG79.2 542985f638e9112d4f395d4a83fdbc13 52856
haldavnekar_n_Page_079.QC.jpg
49dad9b75a3ab6be935a755b50beb124 133524
haldavnekar_n_Page_080.jpg
JPEG80.2 a2eda7fd12b9bfd5d9fc5f7fd2529c0c 52432
haldavnekar_n_Page_080.QC.jpg
8a821787559fbd9398bd368fbdedd77d 85370
haldavnekar_n_Page_081.jpg
JPEG81.2 371c691954d2a72b4942d29231bd1be5 40199
haldavnekar_n_Page_081.QC.jpg
79c974ea3b9d68c2efe0219d7f890a71 80870
haldavnekar_n_Page_082.jpg
JPEG82.2 607d7a1881ee37f292043e68456d35f6 38651
haldavnekar_n_Page_082.QC.jpg
2bcb0f0a0deb7bab30e7f2c235702c3e 119962
haldavnekar_n_Page_083.jpg
JPEG83.2 4e3cbaaaa317cfb0485189861f34e5bc 53606
haldavnekar_n_Page_083.QC.jpg
7acd8dc59c1066eb8a90d84dd6e8fb2d 117631
haldavnekar_n_Page_084.jpg
JPEG84.2 893076c23fb0075817b12bfe7ce98215 52683
haldavnekar_n_Page_084.QC.jpg
4c8ed0a976513dd11cc1f27b6f5539bf 118882
haldavnekar_n_Page_085.jpg
JPEG85.2 2edf9ea18e46483c3d291d532ac916ac 52337
haldavnekar_n_Page_085.QC.jpg
2a5fa341c48954b7743c1a9d234fabda 33028
haldavnekar_n_Page_086.jpg
JPEG86.2 9e82a204a6b0e352858f75cba493a6e5 23507
haldavnekar_n_Page_086.QC.jpg
a7579399fa93e88779377d88aadd3fd7 101353
haldavnekar_n_Page_087.jpg
JPEG87.2 8ffa5ce40182155c3043548bfebdfbcb 46205
haldavnekar_n_Page_087.QC.jpg
bc5bb0524a126ebfe991a17def992956 120233
haldavnekar_n_Page_088.jpg
JPEG88.2 f39ee5b7d7a94c153f967e4f91e8a6fb 52812
haldavnekar_n_Page_088.QC.jpg
28fd9b356a3290de2a715fe642f0c9e4 109001
haldavnekar_n_Page_089.jpg
JPEG89.2 ddc2d2d74d1df028230be148ed5b39f0 50280
haldavnekar_n_Page_089.QC.jpg
99e444bc9f97579999da834ff3d8c04f 113737
haldavnekar_n_Page_090.jpg
JPEG90.2 d64239b114e99326f213ddc865b2581e 51301
haldavnekar_n_Page_090.QC.jpg
c93b543f6756461e1f68b4997eae287f 114826
haldavnekar_n_Page_091.jpg
JPEG91.2 d5c5538f6754487c3d945fb9091fcd75 51029
haldavnekar_n_Page_091.QC.jpg
0259a768e3710dc5f73d2a9b803c6cf9 136037
haldavnekar_n_Page_092.jpg
JPEG92.2 fd45e3abf1e52846240557a8afedcf09 55214
haldavnekar_n_Page_092.QC.jpg
303530e467e9e2e06dfdccd5c5b84b2a 129331
haldavnekar_n_Page_093.jpg
JPEG93.2 8b2ed6a898a0e7ceb24f8c1fe3deafa7 53873
haldavnekar_n_Page_093.QC.jpg
a554329efade953a3a64bdb5dafbb25d 124334
haldavnekar_n_Page_094.jpg
JPEG94.2 f69da93d302503531843d4688516078b 54324
haldavnekar_n_Page_094.QC.jpg
e36f0c60f6e4d2177a6158f6ef612f91 109703
haldavnekar_n_Page_095.jpg
JPEG95.2 9effba8a8505540de13697bc9346ea8a 49597
haldavnekar_n_Page_095.QC.jpg
9b75ac183905ff637b9fea2d72a09824 80467
haldavnekar_n_Page_096.jpg
JPEG96.2 ec672da4261d5b1c4499fcca525e417f 39206
haldavnekar_n_Page_096.QC.jpg
62e8f92381adaf5dd6970a44ab931fe2 75644
haldavnekar_n_Page_097.jpg
JPEG97.2 5ab0833a423a4773b26a5abc856e4413 37951
haldavnekar_n_Page_097.QC.jpg
1fcfc7ce69fab85d09823152a880a7f3 50677
haldavnekar_n_Page_098.jpg
JPEG98.2 d49e3d0e590aa0f8e1d39556a1d7f131 28943
haldavnekar_n_Page_098.QC.jpg
64f7787884f64f4334a82271e850ef4a 77809
haldavnekar_n_Page_099.jpg
JPEG99.2 cde68384c6e01fbc84f0d20baa3cf277 38234
haldavnekar_n_Page_099.QC.jpg
f299abeee1e6cb237816f02bb88e672b 75085
haldavnekar_n_Page_100.jpg
JPEG100.2 104622f3eb825679369ef832fe7c8de2 34536
haldavnekar_n_Page_100.QC.jpg
a9c153ae6058214a225a8eee933265b7 63657
haldavnekar_n_Page_101.jpg
JPEG101.2 775d140d10a6bf5f662fa509640a1d49 31955
haldavnekar_n_Page_101.QC.jpg
651afaf3b640f592d537695425f97fb6 79371
haldavnekar_n_Page_102.jpg
JPEG102.2 0727bda8bbbd8b01d5164c969c29033c 38690
haldavnekar_n_Page_102.QC.jpg
e83e763820a815da3b824b52021d2e0d 87365
haldavnekar_n_Page_103.jpg
JPEG103.2 f6c5de4cbd00be577abaa003534a7683 41865
haldavnekar_n_Page_103.QC.jpg
73d6b64f71294156f6434c92afc95201 79114
haldavnekar_n_Page_104.jpg
JPEG104.2 aa219a9cfe133032f6f715b4c86e872a 40063
haldavnekar_n_Page_104.QC.jpg
db0a99dcdc9fa03730d2f7517aea46c1 79017
haldavnekar_n_Page_105.jpg
JPEG105.2 6ec11e3bffabadbb2ee9d84f1607f0f0 39369
haldavnekar_n_Page_105.QC.jpg
403a4e3980c62f6f31f6dcaf3a80ad7a 77370
haldavnekar_n_Page_106.jpg
JPEG106.2 12aade1e026af10dacbfddfc1d6c6966 39278
haldavnekar_n_Page_106.QC.jpg
379c01095f803b770a15f8962918ee33 108195
haldavnekar_n_Page_107.jpg
JPEG107.2 158e017b2a5e1d7503aaa906ae265aee 45689
haldavnekar_n_Page_107.QC.jpg
fcded1d397c2689da782bcd5df05f696 103228
haldavnekar_n_Page_108.jpg
JPEG108.2 33778778532a3699625ec7374e7e50c7 45859
haldavnekar_n_Page_108.QC.jpg
3e8ab13004eb1bc2802e60e4b5f71e4a 32708
haldavnekar_n_Page_109.jpg
JPEG109.2 e7a008178cc9521d4459a992986dc186 22339
haldavnekar_n_Page_109.QC.jpg
2ff4f6a0a2df188ed5f95d3b2c2ec89a 129148
haldavnekar_n_Page_110.jpg
JPEG110.2 f9d2d51c86902de7b7cabc36c43ff98b 51130
haldavnekar_n_Page_110.QC.jpg
a0b8a54dd1ae4a03a7011c1bd353c073 140404
haldavnekar_n_Page_111.jpg
JPEG111.2 55c289860ab10f6505386299c2418d8d 55117
haldavnekar_n_Page_111.QC.jpg
233b0ee0725e9fa5686ca2ff56efd7cb 147343
haldavnekar_n_Page_112.jpg
JPEG112.2 668c5bbff1ac5e436206c6bbc82fb546 55709
haldavnekar_n_Page_112.QC.jpg
40e21a0c240d70abf7a882245209e95f 136711
haldavnekar_n_Page_113.jpg
JPEG113.2 979bea1f60b0c6e036a87ba5f7096076 53137
haldavnekar_n_Page_113.QC.jpg
59faa355670b6df4e19dbc3da1755b08 133604
haldavnekar_n_Page_114.jpg
JPEG114.2 d00f98effe34860625554ab52a093576 52646
haldavnekar_n_Page_114.QC.jpg
fc572bb36af49b78ec27980c3ab0ec48 117835
haldavnekar_n_Page_115.jpg
JPEG115.2 77f43a6be40060c0469327c5137c5e66 48047
haldavnekar_n_Page_115.QC.jpg
7c912c165fb3f6c4a19a990ed61e109c 69131
haldavnekar_n_Page_116.jpg
JPEG116.2 5905df81e3f9f498ab5133b2ba65a9d7 35268
haldavnekar_n_Page_116.QC.jpg
THUMB1 imagejpeg-thumbnails f9c4a83f449430487aba82afaa66e193 20556
haldavnekar_n_Page_001thm.jpg
THUMB2 60a50b87e4a77075a54343a06e671908 18253
haldavnekar_n_Page_002thm.jpg
THUMB3 4b67316211636e9df668fd0f36bc8f32 18213
haldavnekar_n_Page_003thm.jpg
THUMB4 db743ed20fabb7027ee01db161efc5fd 25203
haldavnekar_n_Page_004thm.jpg
THUMB5 c89af77f7a8ac1759b4de5dfdf5873fa 23236
haldavnekar_n_Page_005thm.jpg
THUMB6 289f51febcbb024a6f01af961f063f62 24827
haldavnekar_n_Page_006thm.jpg
THUMB7 8aeec6843a8296664ad87a8bce3b5c9e 19340
haldavnekar_n_Page_007thm.jpg
THUMB8 7279d04f4e2bc54d66401824b3a764c6 24585
haldavnekar_n_Page_008thm.jpg
THUMB9 6019669f984bdd2270f5b1b7d60204da 20076
haldavnekar_n_Page_009thm.jpg
THUMB10 7ffeebd7c8f2bc5af7636c26db466497 26957
haldavnekar_n_Page_010thm.jpg
THUMB11 cd55d1cfd2ca4103c34372dd837ad264 25651
haldavnekar_n_Page_011thm.jpg
THUMB12 b63cb7a2341634f5b71545d079a00260 27470
haldavnekar_n_Page_012thm.jpg
THUMB13 4edeb8ef8768eccb6936a3c3399266a0 29103
haldavnekar_n_Page_013thm.jpg
THUMB14 d1f480ef221f3b8d9a01f94becbb691e 28713
haldavnekar_n_Page_014thm.jpg
THUMB15 85e5a42a6ffe925508208038d47f7afe 28975
haldavnekar_n_Page_015thm.jpg
THUMB16 f1343e1f833a1e6cab16cf3cc6f3c703 28561
haldavnekar_n_Page_016thm.jpg
THUMB17 ed1917de65553ff7b8223d892f883b3e 29157
haldavnekar_n_Page_017thm.jpg
THUMB18 e86793fdca43a96fcaf2da68cac7f362 24235
haldavnekar_n_Page_018thm.jpg
THUMB19 5ccb2e4fed5dd6e9694d42e08ea178b8 27426
haldavnekar_n_Page_019thm.jpg
THUMB20 8cd14adeb9fe3ced3c3bae1fd2f70826 27079
haldavnekar_n_Page_020thm.jpg
THUMB21 fb8ed43ef8430ed8b2b16fbf99677e2f 29142
haldavnekar_n_Page_021thm.jpg
THUMB22 efbbc1ab3b018602c3d036c6e09c941c 28933
haldavnekar_n_Page_022thm.jpg
THUMB23 3311d41d2fcafb1876caf21f789be6a3 29108
haldavnekar_n_Page_023thm.jpg
THUMB24 715cee7f8654d5f598334c33ad189275 28721
haldavnekar_n_Page_024thm.jpg
THUMB25 cd0e30aefc53911b38832d8cac7299d8 29647
haldavnekar_n_Page_025thm.jpg
THUMB26 b0e78180fa4b28a0f0912dfc0af1ba82 28727
haldavnekar_n_Page_026thm.jpg
THUMB27 37d91ce7141d6bcee7b8523f144c2795 29042
haldavnekar_n_Page_027thm.jpg
THUMB28 8768a71cbbe8989505e6dcdd580f722a 29078
haldavnekar_n_Page_028thm.jpg
THUMB29 a9d7156f47c8cdccd390443691b18c68 28870
haldavnekar_n_Page_029thm.jpg
THUMB30 23aa852228d7f3cfb11588636c9a7b58 21215
haldavnekar_n_Page_030thm.jpg
THUMB31 7a318446f4ffb62858f9963b45b44e13 27032
haldavnekar_n_Page_031thm.jpg
THUMB32 30e826e749e4586e899393015aa024b2 29074
haldavnekar_n_Page_032thm.jpg
THUMB33 11369039f1823ce98a566ed730e249c5 27416
haldavnekar_n_Page_033thm.jpg
THUMB34 b3e7a84404dca0c9cee02f2718a0722a 28443
haldavnekar_n_Page_034thm.jpg
THUMB35 049ff97d3344207831f142b3eea2fdfa 28627
haldavnekar_n_Page_035thm.jpg
THUMB36 149aa34c2020366d9d545b2e872a4097 25796
haldavnekar_n_Page_036thm.jpg
THUMB37 4a1730fb2fb84c970b4b45ce95483434 27222
haldavnekar_n_Page_037thm.jpg
THUMB38 533afdb2feb0126928e3a259d47525ad 27162
haldavnekar_n_Page_038thm.jpg
THUMB39 f5fb28113bde5f8276e0346d1e702317 29186
haldavnekar_n_Page_039thm.jpg
THUMB40 a94388d0fa3fa857662d00c545c84b3c 28871
haldavnekar_n_Page_040thm.jpg
THUMB41 bb9ac1cfe763b210d485b189cfb13c83 28553
haldavnekar_n_Page_041thm.jpg
THUMB42 217bc4d4285780c9e4e88bc611572389 28842
haldavnekar_n_Page_042thm.jpg
THUMB43 f102c95082d0f2a06c6dc780202dd9a6 28588
haldavnekar_n_Page_043thm.jpg
THUMB44 40d3d73b385f0c06bd1c377fef8492fa 29216
haldavnekar_n_Page_044thm.jpg
THUMB45 ba15ed1301d1f838389d4a826082aa2d 27757
haldavnekar_n_Page_045thm.jpg
THUMB46 dddcf55f9b56df856b3c08826d24b87b 29301
haldavnekar_n_Page_046thm.jpg
THUMB47 e83c10d84423ae4622b49be99be8dd96 23387
haldavnekar_n_Page_047thm.jpg
THUMB48 da4d85cb369b24744f3308a445a08e3e 26491
haldavnekar_n_Page_048thm.jpg
THUMB49 8c84ba91bae43ac807c3a57eef1a17bb 26369
haldavnekar_n_Page_049thm.jpg
THUMB50 391a05f631abbf4e20d5c7392c2cd518 24099
haldavnekar_n_Page_050thm.jpg
THUMB51 ffd11d3b8f5d63c197d890c9c38577a3 28635
haldavnekar_n_Page_051thm.jpg
THUMB52 6bd1de7a6503bf01d887a8a1b9071e18 28890
haldavnekar_n_Page_052thm.jpg
THUMB53 8c84b0af8e1d7ca8128c9cd90bccf701 28380
haldavnekar_n_Page_053thm.jpg
THUMB54 883bec8e17a2b4297be19843b2cc490c 19588
haldavnekar_n_Page_054thm.jpg
THUMB55 034c822525d8b074b7d6366e4440e16d 26852
haldavnekar_n_Page_055thm.jpg
THUMB56 0aeb8437e7274f0a3970cd953a7b7163 27499
haldavnekar_n_Page_056thm.jpg
THUMB57 4b090f80d2bdc7cfb3fbc956879636c3 26531
haldavnekar_n_Page_057thm.jpg
THUMB58 fb4d9b7548a925335db8b18928e72e4b 26127
haldavnekar_n_Page_058thm.jpg
THUMB59 a58870083a60ed23f19d25a72d9f2239
haldavnekar_n_Page_059thm.jpg
THUMB60 8812e9b0d897e15b1c7906aefe9c1496 28640
haldavnekar_n_Page_060thm.jpg
THUMB61 21e399a8fd53750194bd8a99223dcdfd 26709
haldavnekar_n_Page_061thm.jpg
THUMB62 75a508f88daa08c5aa2d98a20a084d02 27309
haldavnekar_n_Page_062thm.jpg
THUMB63 bc066b08eb7939c90b604ac5b1c7c4fa 25526
haldavnekar_n_Page_063thm.jpg
THUMB64 061eb903533586f3a595fbfd3c939c2d 26022
haldavnekar_n_Page_064thm.jpg
THUMB65 4b1026fe1f5e888e195eda85a6b6c3d8 28027
haldavnekar_n_Page_065thm.jpg
THUMB66 e287ace0dfa852198967e2bab06ec740 26198
haldavnekar_n_Page_066thm.jpg
THUMB67 78617f271ca137d762b2653cfc948359 27756
haldavnekar_n_Page_067thm.jpg
THUMB68 72299c480d2ed8b10e28a51d7ccea2c5 27092
haldavnekar_n_Page_068thm.jpg
THUMB69 329f0afadbade1c7b71ebdeb3442b37e 25616
haldavnekar_n_Page_069thm.jpg
THUMB70 064f9a2bb32e7e25119b971f90443620 26311
haldavnekar_n_Page_070thm.jpg
THUMB71 c2bdd620a6214d9538e53466361d2a96 28341
haldavnekar_n_Page_071thm.jpg
THUMB72 21651add789767a7b478bf316173e7e3 29297
haldavnekar_n_Page_072thm.jpg
THUMB73 70c491c9d13f99e4d9fe073751703ef5 28706
haldavnekar_n_Page_073thm.jpg
THUMB74 9fe056a403c3dd809a791a6cf97b4c7a 27434
haldavnekar_n_Page_074thm.jpg
THUMB75 76ac2203708fb9a1d78feca20271ccfe 24430
haldavnekar_n_Page_075thm.jpg
THUMB76 6ea32149122399b566659149b63dbcf3 27577
haldavnekar_n_Page_076thm.jpg
THUMB77 cd61c9cb8d3578e3fbedb889a7f3c2ea 28546
haldavnekar_n_Page_077thm.jpg
THUMB78 07a4c2bb300a8ecbbfdc966ce0f212cb 28537
haldavnekar_n_Page_078thm.jpg
THUMB79 b201d713e0088a3a696ad8ddf532dd73 28940
haldavnekar_n_Page_079thm.jpg
THUMB80 fe05c0a125838e530cecc27c32c8ccf9 28708
haldavnekar_n_Page_080thm.jpg
THUMB81 5be47c71471e2aac9bf53710f2a39903 25424
haldavnekar_n_Page_081thm.jpg
THUMB82 0f3b6d071ed35929a71222a12000f6b8 25108
haldavnekar_n_Page_082thm.jpg
THUMB83 df9fabe531bb67bdab356553efbe7888 29364
haldavnekar_n_Page_083thm.jpg
THUMB84 d17dd057c9a02d266d692c3b2e234457 29212
haldavnekar_n_Page_084thm.jpg
THUMB85 31257bc75e4bbf72a9902dcd89d6b742 28892
haldavnekar_n_Page_085thm.jpg
THUMB86 37ec490bdb6a1d789c84a15f73389f5d 19481
haldavnekar_n_Page_086thm.jpg
THUMB87 c9abbb3de2d97493d433eaa7d0a67f5d 26903
haldavnekar_n_Page_087thm.jpg
THUMB88 e2aef18b2ef0c09f94c972945d88a156 28810
haldavnekar_n_Page_088thm.jpg
THUMB89 fd4344bcbb2e22b4a060c395e5fd2351 28066
haldavnekar_n_Page_089thm.jpg
THUMB90 ef5b21d3de71530c6645494d103ff2a8 28517
haldavnekar_n_Page_090thm.jpg
THUMB91 c7bbda36ca6988eed9aeb44175f6c968 28394
haldavnekar_n_Page_091thm.jpg
THUMB92 f8b1205c7daf36647a2495a2178107ca 30079
haldavnekar_n_Page_092thm.jpg
THUMB93 6b0348b0f33793d6a06677153452a51e 28897
haldavnekar_n_Page_093thm.jpg
THUMB94 5576451f607e52636405c58338cf700f 29207
haldavnekar_n_Page_094thm.jpg
THUMB95 b0012ce380bdc65c1863372938f88f9d 28013
haldavnekar_n_Page_095thm.jpg
THUMB96 02abab0b05a664e0dd8d136077e4fa15 24770
haldavnekar_n_Page_096thm.jpg
THUMB97 f8a35f778e636381acc0e038c2442333 24088
haldavnekar_n_Page_097thm.jpg
THUMB98 c9d85713b881c3654846a33400b210fa 21204
haldavnekar_n_Page_098thm.jpg
THUMB99 5c81c3a12bd3cd8f8ded7ff918815d3b 24846
haldavnekar_n_Page_099thm.jpg
THUMB100 f2b92e837729daa8d2b5448f63054e36 23413
haldavnekar_n_Page_100thm.jpg
THUMB101 1c055495e3b336754ad1f65d3128b1f2 22434
haldavnekar_n_Page_101thm.jpg
THUMB102 92e4ff98982f535a83b805bdf000da9d 25090
haldavnekar_n_Page_102thm.jpg
THUMB103 8d7a44f461270273f062b4e22bbf3dee 25231
haldavnekar_n_Page_103thm.jpg
THUMB104 4789cb4108a4e75665c9ef8c5e1477f6 25579
haldavnekar_n_Page_104thm.jpg
THUMB105 dabc7d1d61a6a95502211cb40cd56fed 24833
haldavnekar_n_Page_105thm.jpg
THUMB106 ea369d8cd7ce0f73182c0314f0d893e7 25159
haldavnekar_n_Page_106thm.jpg
THUMB107 3d03186b4f1fcb9c533f2bcf28150814 26879
haldavnekar_n_Page_107thm.jpg
THUMB108 53cd5ef61e0c0e1c31c0d0c116be0f67 26992
haldavnekar_n_Page_108thm.jpg
THUMB109 85fb565c1f9a85bd0811fa6b4850a012 19046
haldavnekar_n_Page_109thm.jpg
THUMB110 be5cb439b1c895ad429c0063817a5e8c 28570
haldavnekar_n_Page_110thm.jpg
THUMB111 1eb42a09be76054fbd3747c0d05481f3 29870
haldavnekar_n_Page_111thm.jpg
THUMB112 ede263100d3592835a39f41a4f8a9804 29861
haldavnekar_n_Page_112thm.jpg
THUMB113 2ac9963614cf55ce6703854e48816503 29241
haldavnekar_n_Page_113thm.jpg
THUMB114 fd81fd19cfad83b612387804e7f5d080 29088
haldavnekar_n_Page_114thm.jpg
THUMB115 c16ea226efc15d610ce8eaa4bc724f00 27409
haldavnekar_n_Page_115thm.jpg
THUMB116 68058a720e113078c4632611ed2c1c06 23651
haldavnekar_n_Page_116thm.jpg
TXT1 textplain fe924908aa4ddfa8a5b1b125c6e64ed0 464
haldavnekar_n_Page_001.txt
TXT2 b8438d47489d6a30c18cd3ea13d67315 93
haldavnekar_n_Page_002.txt
TXT3 108cbeca484304ebbca59c88965ff829 90
haldavnekar_n_Page_003.txt
TXT4 de78fd7d9636d90efe49403ac7265365 1272
haldavnekar_n_Page_004.txt
TXT5 23c4f9fa0a2e21eeb5f7d773cda40824 1757
haldavnekar_n_Page_005.txt
TXT6 7601be9d257de7da90c7c996dd812cf2 2095
haldavnekar_n_Page_006.txt
TXT7 cc26d289063b3652d6c6a64ce7948117 292
haldavnekar_n_Page_007.txt
TXT8 2d475352f58c5f3a9c7a49b26516161e 1609
haldavnekar_n_Page_008.txt
TXT9 9c8fe2ed8fc69b92c9bb4d1107969187 407
haldavnekar_n_Page_009.txt
TXT10 7bca20eb039d2d70b95bb49dcc1f9c92 1698
haldavnekar_n_Page_010.txt
TXT11 fe2a08bc0b192eb4f9c133fa469975f1 1292
haldavnekar_n_Page_011.txt
TXT12 1cbbb3054efdac23264a9dccb29992b2 1771
haldavnekar_n_Page_012.txt
TXT13 aee8a3efaa6611c1e8a0f72f5c8e2ca4 2021
haldavnekar_n_Page_013.txt
TXT14 a773183ec5fc92c42a4a51b836175736 1783
haldavnekar_n_Page_014.txt
TXT15 696295209a7d522a09c95d606051d276 1983
haldavnekar_n_Page_015.txt
TXT16 4e150480808505df98c1efe2bf2e7d51 2089
haldavnekar_n_Page_016.txt
TXT17 5a30e620974a8ea19063e95fd2e5d0d1 2608
haldavnekar_n_Page_017.txt
TXT18 3c28dd956a68d70e8a4593875d2e7baf 1094
haldavnekar_n_Page_018.txt
TXT19 1741713b81b55802adbbb2faa229006c 1721
haldavnekar_n_Page_019.txt
TXT20 f3a9a633784dfd0231262817fc87995c 1270
haldavnekar_n_Page_020.txt
TXT21 f2ce5d957b9650fbdea78f1dbbde9837 1906
haldavnekar_n_Page_021.txt
TXT22 208ca7622c65b29312ae5e6b63d8df9c 1918
haldavnekar_n_Page_022.txt
TXT23 17addf75f0e505fda2024bb13c6c7300 1967
haldavnekar_n_Page_023.txt
TXT24 b79ca434675c97b6e988c7fad988cd38 1875
haldavnekar_n_Page_024.txt
TXT25 7febb22b9a61a19693259c8e8785ad7f 1964
haldavnekar_n_Page_025.txt
TXT26 a10686d374a0cd5e0c12094ad45ed572 1904
haldavnekar_n_Page_026.txt
TXT27 c42b9fb5aa5a8e703c285e47ddbd489d 1975
haldavnekar_n_Page_027.txt
TXT28 1b7664c99820c29d7a2630bfd7d500a0 1944
haldavnekar_n_Page_028.txt
TXT29 c1d4528fb665f9c9ada2f275bcc088e4 1889
haldavnekar_n_Page_029.txt
TXT30 a15ebac4435d1e8a793b749a53f94cbb 564
haldavnekar_n_Page_030.txt
TXT31 be6b23d37b3a94dd91322ce8561b26d8 1643
haldavnekar_n_Page_031.txt
TXT32 0e117279eb6bc8a4d1510a55b5350c50 1048
haldavnekar_n_Page_032.txt
TXT33 c719da1f6befa2935fca2a3595febda3 1704
haldavnekar_n_Page_033.txt
TXT34 64456222c0356b7788583f231cbdb773 1837
haldavnekar_n_Page_034.txt
TXT35 cac23a2d4d4c22f0cc0c1da3146d7c13
haldavnekar_n_Page_035.txt
TXT36 00468d7cd5d7f4b0208fe6e06bd43e2c 399
haldavnekar_n_Page_036.txt
TXT37 29d730ddc5969b94843a034bdd93aafa 948
haldavnekar_n_Page_037.txt
TXT38 7e1c9d16e8d9bf7934691b309343a403 1215
haldavnekar_n_Page_038.txt
TXT39 f66d411a62e256fc86a9bcbb671f27e0 2452
haldavnekar_n_Page_039.txt
TXT40 4b62c1cf1161cdd23c8b244198fedb0b 2103
haldavnekar_n_Page_040.txt
TXT41 2ab10b5cb0acbaca92dc8c234a709f4c 1849
haldavnekar_n_Page_041.txt
TXT42 931b900160d51a33139152af9bd2164b 2063
haldavnekar_n_Page_042.txt
TXT43 2b3c8468d103149c7927cefd849ab9f6
haldavnekar_n_Page_043.txt
TXT44 71e7cf2811dbe6dac7dc9582c5883bd0 3192
haldavnekar_n_Page_044.txt
TXT45 9073e36a526ae3d74ef3020c37e9b71e 1808
haldavnekar_n_Page_045.txt
TXT46 68cefdc22139cfd29a522500cd7ee236 1959
haldavnekar_n_Page_046.txt
TXT47 c4e23ce91aae9e4b275142d6034ad4bc 930
haldavnekar_n_Page_047.txt
TXT48 277dae229f00ff62b1b2c4f75d5fa4a3 1322
haldavnekar_n_Page_048.txt
TXT49 b37bc2fe8af722c29b9d04d2f694e970 1878
haldavnekar_n_Page_049.txt
TXT50 0f646783e42fd3f6bf6b2b910ddb69a6 1497
haldavnekar_n_Page_050.txt
TXT51 1eeb0f9bd0ece76f0a3c8fb593f32a46 1026
haldavnekar_n_Page_051.txt
TXT52 dec3d2d7cb7667f5a44ef7a887603b0a 1961
haldavnekar_n_Page_052.txt
TXT53 f4fed1d98958848bc4a36a3563e86a74 1794
haldavnekar_n_Page_053.txt
TXT54 3a65b4793ac84715dd807c799d46c67e 296
haldavnekar_n_Page_054.txt
TXT55 009ec4cda2f086d0a48fbccc04b03702 1626
haldavnekar_n_Page_055.txt
TXT56 492458df953aae8271b018a8fdf18240 1212
haldavnekar_n_Page_056.txt
TXT57 93cc2b447d332a55054f5896606aef2d 1546
haldavnekar_n_Page_057.txt
TXT58 91a60d73c1e952b0b1bb9a922a5148ae 1348
haldavnekar_n_Page_058.txt
TXT59 e145b8c5f1cdefd1c942e4ec4ca99a97 1715
haldavnekar_n_Page_059.txt
TXT60 63a1f87f66386e941c96513a817dbe49 1850
haldavnekar_n_Page_060.txt
TXT61 cc2989ca39ff32a200d597946e957fea 1667
haldavnekar_n_Page_061.txt
TXT62 a55655e75b6e44cf291182d1c44127d1 1678
haldavnekar_n_Page_062.txt
TXT63 77522e9538594ceb099108ad7bc81ef9 1604
haldavnekar_n_Page_063.txt
TXT64 d2fc392ab0f32e814768a8caa34535cd 1484
haldavnekar_n_Page_064.txt
TXT65 c69870f2ed569c51732abd1c71958801 1741
haldavnekar_n_Page_065.txt
TXT66 9cdbe657f10902eb09b695f34021b71a 1886
haldavnekar_n_Page_066.txt
TXT67 9efaeada72648e550ce51d125b4febde 1660
haldavnekar_n_Page_067.txt
TXT68 51ef8706a8f2b33381347bc47a8f88c1 1612
haldavnekar_n_Page_068.txt
TXT69 8292e2dea7a7a3901db66b9c4373572f 1595
haldavnekar_n_Page_069.txt
TXT70 1f229066b2ee78e69fe8e13bab0cbb14 1770
haldavnekar_n_Page_070.txt
TXT71 62cd2d0dba172ae340739738ef097e3f 893
haldavnekar_n_Page_071.txt
TXT72 7ca1c3f4f3a78f9d39b0c13b3b92d234 1957
haldavnekar_n_Page_072.txt
TXT73 655ad77368bef1d3b931b3080eba7c85 1902
haldavnekar_n_Page_073.txt
TXT74 43e00e283a21a965e6b0db8e95cb17e0 1692
haldavnekar_n_Page_074.txt
TXT75 58c181bba5e31cc94fabb073c52e2317 1204
haldavnekar_n_Page_075.txt
TXT76 1123b02b549ee948e477fdeb3038a344 1720
haldavnekar_n_Page_076.txt
TXT77 f20c6d4b720f2c2c8500c130f747862d 1826
haldavnekar_n_Page_077.txt
TXT78 85858f6033417d47213dc617f22a795d 2217
haldavnekar_n_Page_078.txt
TXT79 21c56baeb897bce76af556b9b03da2c2 1921
haldavnekar_n_Page_079.txt
TXT80 5b114f1a2218b12a74ae7652e37d5d2e 2442
haldavnekar_n_Page_080.txt
TXT81 1911e3e0ed645841eda03f802f3553eb 1227
haldavnekar_n_Page_081.txt
TXT82 1e8b41e3b5475400beb60c01045706e2 1663
haldavnekar_n_Page_082.txt
TXT83 e363c19db04ff11de8d296dc68121c35 1924
haldavnekar_n_Page_083.txt
TXT84 1b8f781d52721ba94d006795a23ed8b0 1894
haldavnekar_n_Page_084.txt
TXT85 7934a41fe9d9f4cf4223cfd10c4560cf
haldavnekar_n_Page_085.txt
TXT86 e021a9f9f1d5fe9c61c1b5e41f9921c6 283
haldavnekar_n_Page_086.txt
TXT87 878431093df18264863717089707c2d4 1613
haldavnekar_n_Page_087.txt
TXT88 c3133fb03e4b4d76cab2e390ac08d583 1933
haldavnekar_n_Page_088.txt
TXT89 2dccf374952719ec4427f8abad80a5e1 1749
haldavnekar_n_Page_089.txt
TXT90 2327781767a0f3a247d273b57b80aba2 1813
haldavnekar_n_Page_090.txt
TXT91 c30ca53f787f1ca073cc039a2b5ef232 1832
haldavnekar_n_Page_091.txt
TXT92 6d6700c7573370db06cb0bf915d309fc 2404
haldavnekar_n_Page_092.txt
TXT93 b610b33a11eeb6e8ce24f5443597f813 2187
haldavnekar_n_Page_093.txt
TXT94 95fd6348968f0d61d7e8f11d5f423f2c 1988
haldavnekar_n_Page_094.txt
TXT95 59576404e7e4e67505e4d59923dd4bb1 1778
haldavnekar_n_Page_095.txt
TXT96 2dacc50ac4c63c023d6621dadb5158bd 1129
haldavnekar_n_Page_096.txt
TXT97 5716d43b5ee604a40b5351a0be408ca7 816
haldavnekar_n_Page_097.txt
TXT98 f363463b2c5afdf5a9c3810a79f405ee 426
haldavnekar_n_Page_098.txt
TXT99 ce60e7acdb78ee0e5c35a2c39049be67 1241
haldavnekar_n_Page_099.txt
TXT100 0660ac1d4f439c97a4baaa0a09cc79c7
haldavnekar_n_Page_100.txt
TXT101 492572054db9e1cb026b77a3302f5ff3 1110
haldavnekar_n_Page_101.txt
TXT102 afba2a219d74e1b37841d902f91ed57b 1078
haldavnekar_n_Page_102.txt
TXT103 62c02329b78ca0ba36740432340c3dc7 1378
haldavnekar_n_Page_103.txt
TXT104 eaebe1a0c41ab011425c0a0c977506f8 1076
haldavnekar_n_Page_104.txt
TXT105 985949a578cef6e2a0cdac811fefc7d8 1135
haldavnekar_n_Page_105.txt
TXT106 7625c327dd3886fa919bce3bfa489acb 1359
haldavnekar_n_Page_106.txt
TXT107 f76772e1cc9da5802d2aab40f5c1ba5b 1844
haldavnekar_n_Page_107.txt
TXT108 07a16ad1ac8ef1d6bac36f68c08dea5f 2032
haldavnekar_n_Page_108.txt
TXT109 e957b8c724ee720d305fdc39e1c2908c 288
haldavnekar_n_Page_109.txt
TXT110 942d2d445cbee54ec44f6ffe0246154a 2175
haldavnekar_n_Page_110.txt
TXT111 749714194b25ee969f49c66f1ec33ee5 2444
haldavnekar_n_Page_111.txt
TXT112 d48c0420e5c2fddf4126da51fb0bc881
haldavnekar_n_Page_112.txt
TXT113 bd5b2883fa3ceb12379c9f3eb400d8b7 2319
haldavnekar_n_Page_113.txt
TXT114 8a0679a82d0342d8e915ed45f88ff4f2 2233
haldavnekar_n_Page_114.txt
TXT115 3d76f0952ce2285508fa728595777acd 1908
haldavnekar_n_Page_115.txt
TXT116 e277aa5cf027a7cc160813a725307a66 917
haldavnekar_n_Page_116.txt
PDF1 applicationpdf ba140c58f7f4b98c4ef26789f7ec9535 1402182
haldavnekar_n.pdf
METS2 unknownx-mets 3cf16785233fd4c594e24d7de05e2dd6 126900
UFE0000541_00001.mets
METS:structMap STRUCT1 physical
METS:div DMDID ADMID ORDER 0 main
PDIV1 1 Main
PAGE1 Page i
METS:fptr FILEID
PAGE2 ii 2
PAGE3 iii 3
PAGE4 iv 4
PAGE5 v 5
PAGE6 vi 6
PAGE7 vii 7
PAGE8 viii 8
PAGE9 ix 9
PAGE10 x 10
PAGE11 xi 11
PAGE12 12
PAGE13 13
PAGE14 14
PAGE15 15
PAGE16 16
PAGE17 17
PAGE18 18
PAGE19 19
PAGE20 20
PAGE21 21
PAGE22 22
PAGE23 23
PAGE24 24
PAGE25 25
PAGE26 26
PAGE27 27
PAGE28 28
PAGE29 29
PAGE30 30
PAGE31 31
PAGE32 32
PAGE33 33
PAGE34 34
PAGE35 35
PAGE36 36
PAGE37 37
PAGE38 38
PAGE39 39
PAGE40 40
PAGE41 41
PAGE42 42
PAGE43 43
PAGE44 44
PAGE45 45
PAGE46 46
PAGE47 47
PAGE48 48
PAGE49 49
PAGE50 50
PAGE51 51
PAGE52 52
PAGE53 53
PAGE54 54
PAGE55 55
PAGE56 56
PAGE57 57
PAGE58 58
PAGE59 59
PAGE60 60
PAGE61 61
PAGE62 62
PAGE63 63
PAGE64 64
PAGE65 65
PAGE66 66
PAGE67 67
PAGE68 68
PAGE69 69
PAGE70 70
PAGE71 71
PAGE72 72
PAGE73 73
PAGE74 74
PAGE75 75
PAGE76 76
PAGE77 77
PAGE78 78
PAGE79 79
PAGE80 80
PAGE81 81
PAGE82 82
PAGE83 83
PAGE84 84
PAGE85 85
PAGE86 86
PAGE87 87
PAGE88 88
PAGE89 89
PAGE90
PAGE91 91
PAGE92 92
PAGE93
PAGE94 94
PAGE95 95
PAGE96 96
PAGE97 97
PAGE98 98
PAGE99 99
PAGE100 100
PAGE101 101
PAGE102 102
PAGE103 103
PAGE104 104
PAGE105 105
PAGE106 106
PAGE107 107
PAGE108 108
PAGE109 109
PAGE110 110
PAGE111 111
PAGE112 112
PAGE113 113
PAGE114 114
PAGE115 115
PAGE116 116
STRUCT2 other
ODIV1
FILES1
FILES2



PAGE 1

AN ALGORITHM AND IMPLEMENTATION FOR EXTRACTING SCHEMATIC AND SEMANTIC KNOWLEDGE FROM RELATIONAL DATABASE SYSTEMS By NIKHIL HALDAVNEKAR A THESIS PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLOR IDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE UNIVERSITY OF FLORIDA 2002

PAGE 2

Copyright 2002 by Nikhil Haldavnekar

PAGE 3

To my parents, my sister and Seema

PAGE 4

ACKNOWLEDGMENTS I would like to acknowledge the National Science Foundation for supporting this research under grant numbers CMS-0075407 and CMS-0122193. I express my sincere gratitude to my advisor, Dr. Joachim Hammer, for giving me the opportunity to work on this interesting topic. Without his continuous guidance and encouragement this thesis would not have been possible. I thank Dr. Mark S. Schmalz and Dr. R.Raymond Issa for being on my supervisory committee and for their invaluable suggestions throughout this project. I thank all my colleagues in SEEK, especially Sangeetha, Huanqing and Laura, who assisted me in this work. I wish to thank Sharon Grant for making the Database Center a great place to work There are a few people to whom I am grateful for multiple reasons: first, my parents who have always striven to give their children the best in life and my sister who is always with me in any situation; next, my closest ever friends--Seema, Naren, Akhil Nandhini and Kaumudi--for being my family here in Gainesville and Mandar, Rakesh and Suyog for so many unforgettable memories. Most importantly, I would like to thank God for always being there for me. iv

PAGE 5

TABLE OF CONTENTS page ACKNOWLEDGMENTS.................................................................................................iv LIST OF TABLES............................................................................................................vii LIST OF FIGURES.........................................................................................................viii ABSTRACT.........................................................................................................................x CHAPTER 1 INTRODUCTION............................................................................................................1 1.1 Motivation.................................................................................................................2 1.2 Solution Approaches.................................................................................................4 1.3 Challenges and Contributions...................................................................................6 1.4 Organization of Thesis..............................................................................................7 2 RELATED RESEARCH..................................................................................................8 2.1 Database Reverse Engineering.................................................................................9 2.2 Data Mining............................................................................................................16 2.3 Wrapper/Mediation Technology.............................................................................17 3 THE SCHEMA EXTRACTION ALGORITHM...........................................................20 3.1 Introduction.............................................................................................................20 3.2 Algorithm Design....................................................................................................23 3.3 Related Issue Semantic Analysis.........................................................................34 3.4 Interaction...............................................................................................................38 3.5 Knowledge Representation.....................................................................................41 4 IMPLEMENTATION.....................................................................................................44 4.1 Implementation Details...........................................................................................44 4.2 Example Walkthrough of Prototype Functionality.................................................54 v

PAGE 6

4.3 Configuration and User Intervention......................................................................61 4.4 Integration...............................................................................................................62 4.5 Implementation Summary.......................................................................................63 4.5.1 Features.........................................................................................................63 4.5.2 Advantages....................................................................................................63 5 EXPERIMENTAL EVALUATION...............................................................................65 5.1 Experimental Setup.................................................................................................65 5.2 Experiments............................................................................................................66 5.2.1 Evaluation of the Schema Extraction Algorithm..........................................66 5.2.2 Measuring the Complexity of a Database Schema.......................................69 5.3 Conclusive Reasoning.............................................................................................70 5.3.1 Analysis of the Results..................................................................................71 5.3.2 Enhancing Accuracy.....................................................................................73 6 CONCLUSION...............................................................................................................76 6.1 Contributions...........................................................................................................77 6.2 Limitations..............................................................................................................78 6.2.1 Normal Form of the Input Database..............................................................78 6.2.2 Meanings and Names for the Discovered Structures....................................79 6.2.3 Adaptability to the Data Source....................................................................80 6.3 Future Work............................................................................................................80 6.3.1 Situational Knowledge Extraction................................................................80 6.3.2 Improvements in the Algorithm....................................................................84 6.3.3 Schema extraction from Other Data Sources................................................85 6.3.4 Machine Learning.........................................................................................85 APPENDIX A DTD DESCRIBING EXTRACTED KNOWLEDGE...................................................86 B SNAPSHOTS OF RESULTS.XML...........................................................................88 C SUBSET TEST FOR INCLUSION DEPENDENCY DETECTION............................91 D EXAMPLES OF THE SITUATIONAL KNOWLEDGE EXTRACTION PROCESS.92 LIST OF REFERENCES...................................................................................................99 BIOGRAPHICAL SKETCH...........................................................................................105 vi

PAGE 7

LIST OF TABLES Table page 4-1 Example of the attribute classification from the MS-Project legacy source...............57 5-1 Experimental results of schema extraction on 9 sample databases.............................67 vii

PAGE 8

LIST OF FIGURES Figure page 2-1 The Concept of Database Reverse Engineering............................................................9 3-1 The SEEK Architecture...............................................................................................21 3-2 The Schema Extraction Procedure..............................................................................25 3-3 The Dictionary Extraction Process..............................................................................26 3-4 Inclusion Dependency Mining.....................................................................................27 3-5 The Code Analysis Process.........................................................................................37 3-6 DRE Integrated Architecture.......................................................................................40 4-1 Schema Extraction Code Block Diagram....................................................................45 4-2 The class structure for a relation..................................................................................47 4-3 The class structure for the inclusion dependencies.....................................................48 4-4 The class structure for an attribute..............................................................................50 4-5 The class structure for a relationship...........................................................................51 4-6 The information in different types of relationships instances.....................................53 4-7 The screen snapshot describing the information about the relational schema............55 4-8 The screen snapshot describing the information about the entities.............................58 4-9 The screen snapshot describing the information about the relationships....................59 4-10 E/R diagram representing the extracted schema.......................................................60 5-1 Results of experimental evaluation of the schema extraction algorithm: errors in detected inclusion dependencies (top), number of errors in extracted schema (bottom)..................................................................................................................71 viii

PAGE 9

B-1 The main structure of the XML document conforming to the DTD in Appendix A..88 B-2 The part of the XML document which lists business rules extracted from the code..88 B-3 The part of the XML document which lists business rules extracted from the code..89 B-4 The part of the XML document, which describes the semantically rich E/R schema.90 C-1 Two queries for the subset test....................................................................................91 ix

PAGE 10

Abstract of Thesis Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Master of Science AN ALGORITHM AND IMPLEMENTATION FOR EXTRACTING SCHEMATIC AND SEMANTIC KNOWLEDGE FROM RELATIONAL DATABASE SYSTEMS By Nikhil Haldavnekar December 2002 Chair: Dr. Joachim Hammer Major Department: Computer and Information Science and Engineering As the need for enterprises to participate in large business networks (e.g., supply chains) increases, the need to optimize these networks to ensure profitability becomes more urgent. However, due to the heterogeneities of the underlying legacy information systems, existing integration techniques fall short in enabling the automated sharing of data among the participating enterprises. Current techniques are manual and require significant programmatic set-up. This necessitates the development of more automated solutions to enable scalable extraction of the knowledge resident in the legacy systems of a business network to support efficient sharing. Given the fact that the majority of existing information systems are based on relational database technology, I have focused on the process of knowledge extraction from relational databases. In the future, the methodologies will be extended to cover other types of legacy information sources. Despite the fact that much effort has been invested in researching approaches to knowledge extraction from databases, no comprehensive solution has existed before this x

PAGE 11

work. In our research, we have developed an automated approach for extracting schematic and semantic knowledge from relational databases. This methodology, which is based on existing data reverse engineering techniques, improves the state-of-the-art in several ways, most importantly to reduce dependency on human input and to remove some of the other limitations. The knowledge extracted from the legacy database contains information about the underlying relational schema as well as the corresponding semantics in order to recreate the semantically rich Entity-Relationship schema that was used to create the database initially. Once extracted, this knowledge enables schema mapping and wrapper generation. In addition, other applications of this extraction methodology are envisioned, for example, to enhance existing schemas or for documentation efforts. The use of this approach can also be foreseen in extracting metadata needed to create the Semantic Web. In this thesis, an overview of our approach will be presented. Some empirical evidence to the usefulness and accuracy of this approach will also be provided using the prototype that has been developed and is running in a testbed in the Database Research Center at the University of Florida. xi

PAGE 12

CHAPTER 1 INTRODUCTION In the current era of E-Commerce, the availability of products (for consumers or for businesses) on the Internet strengthens existing competitive forces for increased customization, shorter product lifecycles, and rapid delivery. These market forces impose a highly variable demand due to daily orders that can also be customized, with limited ability to smoothen production because of the need for rapid delivery. This drives the need for production in a supply chain. Recent research has led to an increased understanding of the importance of coordination among subcontractors and suppliers in such supply chains [3, 37]. Hence, there is a role for decision or negotiation support tools to improve supply chain performance, particularly with regard to the users ability to coordinate pre-planning and responses to changing conditions [47]. Deployment of these tools requires integration of data and knowledge across the supply chain. Due to the heterogeneity of legacy systems, current integration techniques are manual, requiring significant programmatic set-up with only a limited reusability of code. The time and investment needed to establish connections to sources have acted as a significant barrier to the adoption of sophisticated decision support tools and, more generally, as a barrier to information integration. By enabling (semi-)automatic connection to legacy sources, the SEEK (Scalable Extraction of Enterprise Knowledge) project that is currently under way at the University of Florida is directed at overcoming the problems of integrating legacy data and knowledge in the (construction) supply chain [22-24]. 1

PAGE 13

2 1.1 Motivation A legacy source is defined as a complex stand-alone system with either poor or non-existent documentation about the data, code or the other components of the system. When a large number of firms are involved in a project, it is likely that there will be a high degree of physical and semantic heterogeneity in their legacy systems, making it difficult to connect firms data and systems with enterprise level decision support tools. Also, as each firm in the large production network is generally an autonomous entity, there are many problems when overcoming this heterogeneity and allowing efficient knowledge sharing among firms. The first problem is the difference between various internal data storage, retrieval and representations methods. Every firm uses its own format to store and represent data in the system. Some might use professional database management systems while others might use simple flat files. Also, some firms might use standard query language such as SQL to retrieve or update data; others might prefer manual access while some others might have their own query language. This physical heterogeneity imposes significant barriers to integrated access methods in co-operative systems. The effort to retrieve even similar information from every firm in the network is non-trivial as this process involves the extensive study about the data stored in every firm. Thus there is little ability to understand and share the other firms data leading to overall inefficiency. The second problem is the semantic heterogeneity among the firms. Although, generally a production network consists of firms working in a similar application domain, there is a significant difference in the internal terminology or vocabulary used by the firms. For example, different firms working in the construction supply chain might use different terms such as Activity, Task or Work-item to mean the same thing i.e., a small

PAGE 14

3 but independent part of an overall construction project. The definition or meaning of the terms might be similar but the actual names used are different. This heterogeneity is present at various levels in the legacy system including conceptual database schema, graphical user interface, application code and business rules. This kind of diversity is often difficult to overcome. Another difficulty in accessing the firms data efficiently and accurately is safeguarding the data against loss and unauthorized usage. It is logical for the firm to restrict the sharing of strategic knowledge including sensitive data or business rules. No firm will be willing to give full access to other firms in the network. It is therefore important to develop third party tools, which assure the privacy of the concerned firm and still extract useful knowledge. Last but not least, the frequent need of human intervention in the existing solutions is another major problem for efficient co-operation. Often, the extraction or conversion process is manual and involves some or no automation. This makes the process of knowledge extraction costly and inefficient. It is time consuming (if not impossible) for a firm to query all the firms that may be affected by some change in the network. Thus, it is necessary to build scalable mediator software using reusable components, which can be quickly configured through high-level specifications and will be based on a highly automated knowledge extraction process. A solution to the problem of physical, schematic and semantic heterogeneity will be discussed in this thesis. The following section introduces various approaches that can be used to extract knowledge from legacy systems, in general.

PAGE 15

4 1.2 Solution Approaches The study of heterogeneous systems has been an active research area for the past decade. At the database level, schema integration approaches and the concept of federated databases [38] have been proposed to allow simultaneous access to different database systems. The wrapper technology [46] also plays an important role with the advent and popularity of co-operative autonomous systems. Various approaches to develop some kind of a mediator system have been discussed [2, 20, 46]. Data mining [18] is another relevant research area which proposes the use of a combination of machine learning, statistical analysis, modeling techniques and database technology, to find patterns and subtle relationships in data and infers rules that allow the prediction of future results. A lot of research is being done in the above areas and it is pertinent to leverage the already existing knowledge whenever necessary. But what is considered as a common input to all of the above methods includes detailed knowledge about the internal database schema, obvious rules and constraints, and selected semantic information. Industrial legacy database applications (LDAs) often evolve over several generations of developers, have hundreds of thousands of lines of associated application code, and maintain vast amounts of data. As mentioned previously, the documentation may have become obsolete and the original developers have left the project. Also, the simplicity of the relational model does not support direct description of the underlying semantics, nor does it support inheritance, aggregation, n-ary relationships, or time dependencies including design modification history. However, relevant information about concepts and their meaning is distributed throughout an LDA. It is therefore important to use reverse engineering techniques to recover the conceptual structure of the LDA to

PAGE 16

5 gain semantic knowledge about the internal data. The term Data Reverse Engineering (DRE) refers to the use of structured techniques to reconstitute the data assets of an existing system [1, p. 4]. As the role of the SEEK system is to act as an intermediary between the legacy data and the decision support tool, it is crucial to develop methodologies and algorithms to facilitate discovery and extraction of knowledge from legacy sources. In general, SEEK operates as a three-step process [23]: SEEK generates a detailed description of the legacy source, including entities, relationships, application-specific meanings of the entities and relationships, business rules, data formatting and reporting constraints, etc. We collectively refer to this information as enterprise knowledge. The semantically enhanced legacy source schema must be mapped onto the domain model (DM) used by the application(s) that want(s) to access the legacy source. This is done using a schema mapping process that produces the mapping rules between the legacy source schema and the application domain model. The extracted legacy schema and the mapping rules provide the input to the wrapper generator, which produces the source wrapper. This thesis mainly focuses on the process described in item 1 above. This thesis also discusses the issue of knowledge representation, which is important in the context of the schema mapping process discussed in the second point. In SEEK, there are two important objectives of Knowledge Extraction in general, and Data Reverse Engineering in particular. First, all the high level semantic information (e.g., entities, associations, constraints) extracted or inferred from the legacy source can be used as an input to the schema mapping process. This knowledge will also help in verifying the domain ontology. The source specific information (e.g., relations, primary keys, datatypes etc.) can be used to convert wrapper queries into actual source queries.

PAGE 17

6 1.3 Challenges and Contributions Formally, data reverse engineering is defined as the application of analytical techniques to one or more legacy data sources to elicit structural information (e.g., term definitions, schema definitions) from the legacy source(s) in order to improve the database design or to produce missing schema documentation [1]. There are numerous challenges in the process of extracting the conceptual structure from a database application with respect to the objectives of SEEK which include the following: Due to the limited ability of the relational model to express semantics, many details of the initial conceptual design are lost when converted to relational database format. Also, often the knowledge is spread throughout the database system. Thus, the input to reverse engineering process is not straightforwardly simple or fixed. The legacy database belonging to the firm typically cannot be changed in accordance with the requirements of our extraction approach and hence the algorithm must impose minimum restrictions on the input source. Human intervention in terms of user input or domain expert comments is typically necessary and as Chiang et al. [9, 10] point out, the reverse engineering process cannot be fully automated. However, this approach is inefficient and not scalable and we attempt to reduce human input as much as possible. Due to maintenance activity, essential component(s) of the underlying databases are often modified or deleted so that it is difficult to infer the conceptual structure. The DRE algorithm needs to minimize this ambiguity by analyzing other sources. Traditionally, reverse engineering approaches concentrate on one specific component in the legacy system as the source. Some methods extensively study the application code [55] while others concentrate on the data dictionary [9]. The challenge is to develop an algorithm that investigates every component (such as the data dictionary, data instances, application code) extracting as much information as possible. Once developed, the DRE approach should be general enough to work with different relational databases with only minimum parameter configuration. The most important contribution of this thesis will be the detailed discussion and comparison of the various database reverse engineering approaches logically followed

PAGE 18

7 by the design of our Schema Extraction (SE) algorithm. The design tries to meet majority of the challenges discussed above. Another contribution will be the implementation of the SE prototype including the experimental evaluation and feasibility study. Finally this thesis also includes the discussion of suitable representations for the extracted enterprise knowledge and possible future enhancements. 1.4 Organization of Thesis The remainder of this thesis is organized as follows. Chapter 2 presents an overview of the related research in the field of knowledge discovery in general and database reverse engineering in particular. Chapter 3 describes the SEEK-DRE architecture and our approach to schema extraction. It also gives the overall design of our algorithm. Chapter 4 is dedicated to the implementation details including some screen snapshots. Chapter 5 describes the experimental prototype and results. Finally, Chapter 6 concludes the thesis with the summary of our accomplishments and issues to be considered in the future.

PAGE 19

CHAPTER 2 RELATED RESEARCH Problems such as Y2K and European currency conversion have shown how little we understand the data in our computer systems. In our world of rapidly changing technology, there is a need to plan business strategies very early and with much information and anticipation. The basic requirement for strategic planning is the data in the system. Many organizations in the past have been successful at leveraging the use of the data. For example, the frequent flier program from American Airlines and the Friends-family program from MCI have been the trendsetters in their field and could only be realized because their parent organizations knew where the data was and how to extract information from it. The process of extracting the data and knowledge from a system logically precedes the process of understanding it. As we have discussed in the previous chapter, this collection or extraction process is non-trivial and requires manual intervention. Generally the data is present at more than one location in the system and has lost much of its semantics. So the important task is to recover these semantics that provide vital information about the system and allow mapping between the system and the general domain model. The problem of extracting knowledge from the system and using it to overcome the heterogeneity between the systems is an important one. Major research areas that try to answer this problem include database reverse engineering, data mining, wrapper generation and data modeling. The following sections will summarize the state-of-the-art in each of these fields. 8

PAGE 20

9 2.1 Database Reverse Engineering Generally all the project knowledge in the firm or the legacy source trickles down to the database level where the actual data is present. Hence the main goal is to be able to mine schema information from these database files. Specifically, the field of Database Reverse Engineering (DRE) deals with the problem of comprehending existing database systems and recovering the semantics embodied within them [10]. The concept of database reverse engineering is shown in Figure 2-1. The original design or schema undergoes a series of semantic reductions while being converted into the relational model. We have already discussed the limited ability of the relational model to express semantics, and when regular maintenance activity is considered, a part of the important semantic information generally gets lost. The goal is to recover that knowledge and validate it with the domain experts to recover a high-level model. Figure 2-1 The Concept of Database Reverse Engineering.

PAGE 21

10 The DRE literature is divided into three areas: translation algorithms and methodologies, tools, and application-specific experiences. Translation algorithm development in early DRE efforts involved manual rearrangement or reformatting of data fields, which is inefficient and error-prone [12]. The relational data model provided theoretical support for research in automated discovery of relational dependencies [8]. In the early 1980s, focus shifted to recovering E/R diagrams from relations [40]. Given early successes with translation using the relational data model, DRE translation was applied to flat file databases [8, 13] in domains such as enterprise schemas [36]. Due to prior establishment of the E/R model as a conceptual tool, reengineering of legacy RDBMS to yield E/R models motivated DRE in the late 1980s [14]. Information content analysis was also applied to RDBMS, allowing a more effective gathering of high-level information from data [5]. DRE in the 1990s was enhanced by cross-fertilization with software engineering. In Chikofsky [11], taxonomy for reverse engineering included DRE methodologies and also highlighted the available DRE tools. DRE formalisms were better defined, motivating increased DRE interaction with users [21]. The relational data model continued to support extraction of E/R and schema from RDMBS [39]. Application focus emphasized legacy systems, including DoD applications [44]. In the late 1990s, object-oriented DRE researched the discovering of objects in legacy systems using function-, data-, and object-driven objectification [59]. Applications of DRE increased, particularly in the Y2K bug identification and remediation. Recent DRE focus is more applicative, e.g., mining large data repositories [15], analysis of legacy systems [31] or network databases [43] and extraction of business rules from

PAGE 22

11 legacy systems [54]. Current research focuses on developing more powerful DRE tools, refining heuristics to yield fewer missing constructs, and developing techniques for reengineering legacy systems into distributed applications. Though a large body of researchers agree that database reverse engineering is useful for leveraging data assets, reducing maintenance costs, facilitating technology transition and increasing system reliability, the problem of choosing a method for the reverse engineering of a relational database is not trivial [33]. The input for these reverse engineering methods is one implementation issue. Database designers, even experts, occasionally violate rules of sound database design. In some cases, it is impossible to produce an accurate model because it never existed. Also, different methods have different input requirements and each legacy system has its particular characteristics that restrict information availability. A wide range of Database Reverse Engineering methods is known, each of them exhibiting its own methodological characteristics, producing its own outputs and requiring specific inputs and assumptions. We now present an overview of the major approaches, each of which is described in terms of input requirements, methodology, output, major advantages and limitations. Although this overview is not completely exhaustive, it discusses the advantages and the limitations of current approaches and provides a solid base for defining the exact objectives of our DRE algorithm. Chiang et al. [9, 10] suggest an approach that requires the data dictionary as an input. It requires all the relation names, attribute names, keys and data instances. The main assumptions include consistent naming of attributes, no errors in the values of key attributes and a 3NF format for the source schema. The first requirement is especially

PAGE 23

12 strict, as many of the current database systems do not maintain consistent naming of attributes. In this method, relations are first classified based upon the properties of their primary keys i.e., the keys are compared with the keys of other relations. Then, the attributes are classified depending on whether they are the attributes of a relations primary key, foreign key, or none. After this classification, all possible inclusion dependencies are identified by some heuristic rules and then entities and relationship types are identified based on dependencies. The main advantage of this method is a clear algorithm with a proper justification of each step. All stages requiring human input are stated clearly. But stringent requirements imposed on the input source, a high degree of user intervention and dismissal of the application code as an important source are the drawbacks of this method. Our SE algorithm discussed in the next chapter is able to impose less stringent requirements on the input source and also analyze the application code for vital clues and semantics. Johansson [34] suggests a method to transform relational schemas into conceptual schemas using the data dictionary and the dependency information. The relational schema is assumed to be in 3NF and information about all the inclusion and functional dependency information is required as an input. The method first splits a relation that corresponds to more than one object and then adds extra relations to handle the occurrences of certain types of inclusion dependencies. Finally it collapses the relations that correspond to the same object type and maps them into one conceptual entity. The main advantage of this method is the detailed explanation about schema mapping procedures. It also explains the concept of hidden objects that is further utilized in Petits method [51]. But this method requires all the keys and all the dependencies and thus is

PAGE 24

13 not realistic, as it is difficult to give this information at the start of the reverse engineering process. Markowitz et al. [39] also present a similar approach to identify the extended entityrelationship object structures in relational schemas. This method takes the data dictionary, the functional dependencies and the inclusion dependencies as inputs and transforms the relational schema into a form suitable to identify the EER object structures. If the dependencies satisfy all the rules then object interaction is determined for each inclusion dependency. Though this method presents a formalization of schema mapping concepts, it is very demanding on the user input, as it requires all the keys and dependencies. The important insight obtained is the use of inclusion dependencies in the above methods. Both the methods use the presence of inclusion dependencies as a strong indication of the existence of a relationship between entities. Our algorithm uses this important concept but it does not place the burden of specifying all inclusion dependencies on the user. S. Navathe et al. [45] and Blaha et al. [52] give the importance of user intervention. Both methods assume that the user has more than sufficient knowledge about the database. Very little automation is used to provide clues to the user. Navathes method [45] requires the data schema and all the candidate keys as inputs, and assumes coherency in attribute names, absence of ambiguities in foreign keys, and requires 3NF and BCNF normal form. Relations are processed and classified with human intervention and the classified relations are then mapped based on their classifications and key attributes. Special cases of non-classified relations are handled on a case-by-case basis. The drawbacks of this method include very high user intervention

PAGE 25

14 and strong assumptions. Comparatively Blahas method [52] is relatively less stringent on the input requirements as it only needs the data dictionary and data sets. But the output is an OMT (Object Modeling Technique) model and is less relevant to our objective. This method also involves high degree of user intervention to determine candidate keys and foreign key groups. The user, based on the guidelines that include querying data, progressively refines the OMT schema. Though the method depends heavily on domain knowledge and can be used in tricky or sensitive situations (where constant guidance is crucial for the success of the process), the amount of user participation makes it difficult to use in a general-purpose toolkit. Another interesting approach is taken by Signore et al. [55]. The method searches for the predefined code patterns to infer semantics. The idea of considering the application code as a vital source for clues and semantics is interesting to our effort. This approach depends heavily on the quality of application code as all the important concepts such as primary keys, foreign keys, and generalization hierarchies are finalized by these patterns found in the code. This suggests that it is more beneficial to use this method along with another reverse engineering method to verify the outcome. Our SE algorithm discussed in the next chapter attempts to implement this. Finally, J. M. Petit et al. [51] suggest an approach that does not impose any restrictions on the input database. The method first finds inclusion dependencies from the equi-join queries in the application code and then discovers functional dependencies from the inclusion dependencies. The restruct algorithm is then used to convert the existing schema to 3NF using the set of dependencies and the hidden objects. Finally, the algorithm in Markowitz et al. [39] is used to convert the 3NF logical schema obtained in

PAGE 26

15 the last phase into an EER model. This paper presents a very sound and detailed algorithm is supported by mathematical theory. The concept of using the equi-join queries in the application code to find inclusion dependencies is innovative and useful. But the main objective of this method is to improve the underlying de-normalized schema, which is not relevant to the knowledge extraction process. Furthermore, the two main drawbacks of this method are lack of justification for some steps and the absence of a discussion about the practical implementation of the approach. Relational database systems are typically designed using a consistent strategy. But generally, mapping between the schemas and the conceptual model is not strictly one-to-one. This means that, while reverse engineering a database, an alternate interpretation of the structure and the data can yield different components [52]. Although in this manner multiple interpretations can yield plausible results, we have to minimize such unpredictability using the available resources. Every relational database employs a similar underlying model for organizing and querying the data, but existing systems differ in terms of the availability of information and reliability of such information. Therefore, it is fair to conclude that no single method can fulfill the entire range of requirements of relational database reverse engineering. The methods discussed above differ greatly in terms of their approaches, input requirements and assumptions and there is no clear preference. In practice, one must choose a combination of approaches to suit the database. Since all the methods have well-defined steps, each having a clear contribution to the overall conceptual schema, in most cases it is advisable to produce a combination of steps of different methods according to the information available [33].

PAGE 27

16 In the SEEK toolkit, the effort required to generate a wrapper for different sources should be minimized as it is not flexible to exhaustively explore different methods for different firms in the supply chain. The developed approach must be general with a limited amount of source dependence. Some support modules can be added for different sources to use the redundant information to increase result confidence. 2.2 Data Mining Considerable interest and work in the areas of data mining and knowledge discovery in databases (KDD) have led to several approaches, techniques and tools for the extraction of useful information from large data repositories. The explosive growth of many business, government and scientific database systems in the last decade created the need for the new generation technology to collect, extract, analyze and generate the data. The term knowledge discovery in databases was coined in 1989 to refer to the broad process of finding knowledge in data and to emphasize the high-level application of particular data mining methods [18]. Data mining is defined as an information extraction activity whose goal is to discover hidden facts contained in databases [18]. The basic view adopted by the research community is that data mining refers to a class of methods that are used in some of the steps comprising the overall KDD process. The data mining and KDD literature is broadly divided into 3 sub areas: finding patterns, rules and trends in the data, statistical data analysis and discovery of integrated tools and applications. Early in the last decade of the 20 th century saw tremendous research on data analysis [18]. This research specifically included a human centered approach to mine the data [6], semi-automated discovery of informative patterns, discovery of association rules [64], finding clusters in the data, extraction of generalized

PAGE 28

17 rules [35] etc. Many efforts then concentrated on developing integrated tools such as DBMINER [27], Darwin [48] and STORM [17]. Recently, focus has shifted towards application specific algorithms. The typical application domains include healthcare and genetics, weather and astronomical surveys and financial systems [18]. Researchers have argued that developing data mining algorithms or tools alone is insufficient for pragmatic problems [16]. The issues such as adequate computing support, strong interoperability and compatibility of the tools and above all the quality of data are very crucial. 2.3 Wrapper/Mediation Technology SEEK follows the established mediation/wrapper methodologies such as TSIMMIS [26], InfoSleuth [4] and provides a middleware layer that bridges the gap between legacy information sources and decision makers/decision support applications. Generally the wrapper [49] accepts queries expressed in the legacy source language and schema and converts them into queries or requests understood by the source. One can identify several important commonalties among wrappers for different data sources, which make wrapper development more efficient and allow the data management architecture to be modular and highly scalable. These are important prerequisites for supporting numerous legacy sources, many of which have parameters or structure that could initially be unknown. Thus, the wrapper development process must be partially guided by human expertise, especially for non-relational legacy sources. A nave approach involves hard-coding wrappers to effect a pre-wired configuration, thus optimizing code for these modules with respect to the specifics of the underlying source. However, this yields inefficient development with poor extensibility and maintainability. Instead, the toolkit such as Stanford Universitys TSIMMIS Wrapper

PAGE 29

18 Development Toolkit [26] based on translation templates written in a high-level specification language is extremely relevant and useful. The TSIMMIS toolkit has been used to develop value-added wrappers for sources such as DBMS, online libraries, and the Web [25, 26]. Existing wrapper development technologies exploit the fact that wrappers share a basic set of source-independent functions that are provided by their toolkits. For example, in TSIMMIS, all wrappers share a parser for incoming queries, a query processor for post-processing of results, and a component for composing the result. Source-specific information is expressed as templates written in a high-level specification language. Templates are parameterized queries together with their translations, including a specification of the format of the result. Thus, the TSIMMIS researchers have isolated the only component of the wrapper that requires human development assistance, namely, the connection between the wrapper and the source, which is highly specialized and yet requires relatively little coding effort. In addition to the TSIMMIS-based wrapper development, numerous other projects have been investigating tools for wrapper generation and content extraction including researchers at the University of Maryland [20], USC/ISI [2], and University of Pennsylvania [53]. Also, artificial intelligence [58], machine learning, and natural language processing communities [7] have developed methodologies that can be applied in wrapper development toolkits to infer and learn structural information from legacy sources. This chapter discussed the evolution of research in the fields related to knowledge extraction. The data stored in a typical organization is usually raw and needs considerable preprocessing before it can be mined or understood. Thus data mining or KDD somewhat

PAGE 30

19 logically follows reverse engineering, which works on extracting preliminary but very important aspects of the data. Many data mining methods [27, 28] require knowledge of the schema and hence reverse engineering methods are definitely useful. Also, the vast majority of wrapper technologies depend on information about the source to perform translation or conversion. The next chapter will describe and discuss our database reverse engineering algorithm, which is the main topic of this thesis.

PAGE 31

CHAPTER 3 THE SCHEMA EXTRACTION ALGORITHM 3.1 Introduction A conceptual overview of the SEEK knowledge extraction architecture is shown in Figure 3-1 [22]. SEEK applies Data Reverse Engineering (DRE) and Schema Matching (SM) processes to legacy database(s), to produce a source wrapper for a legacy source. This source wrapper will be used by another component (not shown in Figure 3-1) to communicate and exchange information with the legacy source. It is assumed that the legacy source uses a database management system for storing and managing its enterprise data or knowledge. First, SEEK generates a detailed description of the legacy source by extracting enterprise knowledge from it. The extracted enterprise knowledge forms a knowledge base that serves as the input for subsequent steps. In particular, the DRE module shown in Figure 3-1 connects to the underlying DBMS to extract schema information (most data sources support at least some form of Call-Level Interface such as JDBC). The schema information from the database is semantically enhanced using clues extracted by the semantic analyzer from available application code, business reports, and, in the future, perhaps other electronically available information that could encode business data such as e-mail correspondence, corporate memos, etc. It has been the experience, through visits with representatives from the construction and manufacturing domains, that such application code exists and can be made available electronically [23]. 20

PAGE 32

21 Figure 3-1 The SEEK Architecture. Second, the semantically enhanced legacy source schema must be mapped into the domain model (DM) used by the application(s) that want(s) to access the legacy source. This is done using a schema mapping process that produces the mapping rules between the legacy source schema and the application domain model. In addition to the domain model, the schema matcher also needs access to the domain ontology (DO) that describes the domain model. Finally, the extracted legacy schema and the mapping rules provide the input to the wrapper generator (not shown), which produces the source wrapper. The three preceding steps can be formalized as follows [23]. At a high level, let a legacy source L be denoted by the tuple L = (DB L S L D L Q L, ), where DB L denotes the legacy database, S L denotes its schema, D L the data and Q L a set of queries that can be answered by DB L Note, the legacy database need not be a relational database, but can

PAGE 33

22 include text, flat file databases, or hierarchically formatted information. S L is expressed by the data model DM L We also define an application via the tuple A = (S A Q A D A ), where S A denotes the schema used by the application and Q A denotes a collection of queries written against that schema. The symbol D A denotes data that is expressed in the context of the application. We assume that the application schema is described by a domain model and its corresponding ontology (as shown in Figure 3-1). For simplicity, we further assume that the application query format is specific to a given application domain but invariant across legacy sources for that domain. Let a legacy source wrapper W be comprised of a query transformation (equation 1) and a data transformation (Equation 2) f W Q : Q A Q L (1) f W D : D L D A (2) where the Q and D are constrained by the corresponding schemas. The SEEK knowledge extraction process shown in Figure 3-1 can now be stated as follows. Given S A and Q A for an application that attempts to access legacy database DB L whose schema S L is unknown, and assuming that we have access to the legacy database DB L as well as to application code C L that accesses DB L we first infer S L by analyzing DB L and C L then use S L to infer a set of mapping rules M between S L and S A are used by a wrapper generator WGen to produce (f W Q f W D ). In short: DRE: (DB L C L ,) S L (3-1) SM : (S L S A ) M (3-2) WGen: (Q A M) (f W Q f W D ) (3-3)

PAGE 34

23 Thus, the DRE algorithm (Equation 3-1) is comprised of schema extraction (SE) and semantic analysis (SA). This thesis will concentrate on the schema extraction process which extracts the schema S L by accessing DB L. The semantic analysis process supports the schema extraction process by providing vital clues for inferring S L by analyzing C L and is also crucial to the DRE algorithm. But, its implementation and experimental evaluation is being carried out by my colleague in SEEK and hence will not be dealt with in detail in this thesis. The following section focuses on the schema extraction algorithm. It also provides a brief description of the semantic analysis and code slicing research efforts, which also are being undertaken in SEEK. It also presents issues regarding integration of schema extraction and semantic analysis. Finally, the chapter concludes with a summary of the DRE algorithm. 3.2 Algorithm Design Data reverse engineering is defined as the application of analytical techniques to one or more legacy data sources (DB L ) to elicit structural information (e.g., term definitions, schema definitions) from the legacy source(s), in order to improve the database design or produce missing schema documentation. Thus far in SEEK, we are applying DRE to relational databases only. However, since the relational model has only limited ability to express semantics, in addition to the schema, our DRE algorithm generates an E/R-like representation of the entities and relationships that are not explicitly defined (but which exist implicitly) in the legacy schema S L More formally, DRE can be described as follows: Given a legacy database DB L defined as ({R 1 R 2 , R n }, D), where R i denotes the schema of the i-th relation with

PAGE 35

24 attributes A 1 A 2 , A m(i) keys K 1 K 2 , K m(i), and data D = {r 1 (R 1 ), r 2 (R 2 ), r n (R n )}, such that r i (R i ) denotes the data (extent) for schema R i at time t. Furthermore, DB L has functional dependencies F = {F 1 F 2 , F k(i) } and inclusion dependencies I = {I 1 I 2 , I l(i) } expressing relationships among the relations in DB L The goal of DRE is to first extract {R 1 R 2 , R n }, I, and F and then use I, F, D, and C L to produce a semantically enhanced description of {R 1 R 2 , R n } that includes all relationships among the relations in DB L (both explicit and implicit), semantic descriptions of the relations as well as business knowledge that is encoded in DB L and C L Our approach to data reverse engineering for relational sources is based on existing algorithms by Chiang et al. [9, 10] and Petit et al. [51]. However, we have improved these methodologies in several ways, most importantly to reduce the dependency on human input and to eliminate several limitations of their algorithms (e.g., assumptions of consistent naming of key attributes, legacy schema in 3-NF). More details about the contributions can be found in Chapter 6. Our DRE algorithm is divided into two parts: schema extraction and semantic analysis, which operate in interleaved fashion. An overview of the standalone schema extraction algorithm, which is comprised of six steps, is shown in Figure 3-2. In addition to the modules that execute each of the six steps, the architecture in Figure 3-2 includes three support components: the configurable Database Interface Module (upper-left hand corner), which provides connectivity to the underlying legacy source. The Knowledge Encoder (lower right-hand corner) represents the extracted knowledge in the form of an XML document so that it can be shared with other components in the SEEK architecture

PAGE 36

25 (e.g., the semantic matcher). More details about these components can be found in Section 3.4. Database Call Level Legacy Code XML document Dictionary Extractor Inclusion Dependencies Finder Relations Classification Module Equi-Join Query Finder Primary Key Pattern Matching Entities Identification Module Attributes Classification Module Knowledge Encoder Legacy Data Interface 1 2 3 4 5 Relationships 6 Classification Module XML DTD Figure 3-2 The Schema Extraction Procedure. We now describe each step of our six-step schema extraction algorithm in detail. Step 1: Extracting Schema Information using the Dictionary Extractor

PAGE 37

26 The goal of Step 1 is to obtain the relation and attribute names from the legacy source. This is done by querying the data dictionary, which is stored in the underlying database in the form of one or more system tables. The details of this step are outlined in Figure 3-3. Figure 3-3 The Dictionary Extraction Process. In order to determine key attributes, the algorithm proceeds as follows: For each relation R i, it first attempts to extract primary keys from the dictionary. If no information is explicitly stored, the algorithm identifies the set of candidate key attributes, which have values that are restricted through NON-NULL and UNIQUE constraints. If there is only one candidate key per entity, then that key is the primary key. Otherwise, if primary key information cannot be retrieved directly from the data dictionary, the algorithm passes the set of candidate keys along with predefined rule-out patterns to the semantic analyzer. The semantic analyzer operates on the AST of the application code to rule out certain attributes as primary keys. For a more detailed explanation and examples of rule-out patterns, the reader is referred to Section 3.4. Step 2: Discovering Inclusion Dependencies

PAGE 38

27 After extraction of the relational schema in Step 1, the schema extraction algorithm then identifies constraints to help classify the extracted relations, which represent both the real-world entities and the relationships among these entities. This is done using inclusion dependencies (INDs), which indicate the existence of inter-relational constraints including class/subclass relationships. Figure 3-4 Inclusion Dependency Mining. Let A and B be two relations, and X and Y be attributes or a set of attributes of A and B respectively. An inclusion dependency I i = A.X << B.Y denotes that a set of values appearing in A.X is a subset of the values in B.Y. Inclusion dependencies are discovered by examining all possible subset relationships between any two relations A and B in the legacy source. As depicted in Figure 3-4, the inclusion dependency detection module obtains its input from two sources: one is the dictionary extractor (via the send/receive module), which provides the table name, column names, primary keys and foreign keys (if available) and the other is the equi-join query finder, which is a part of the code analyzer.

PAGE 39

28 This module operates on the AST, and provides pairs of relations and their corresponding attributes, which occur together in equi-join queries in the application code. The fact that two relations are used in a join operation is evidence of the existence of an inclusion dependency between them. The inclusion dependency detection algorithm works as follows: 1. Create a set X of all possible pairs of relations from the set R = ({R 1 R 2 , R n }): e.g., if we have relations P, Q, R, S then X = {(P,Q), (P,R), (P,S),(Q,R),(Q,S),(R,S)}. Intuitively, this set will contain pairs of relations for which inclusion dependencies have not been determined yet. In addition, we maintain two (initially empty) sets of possible (POSSIBLE) and final (FINAL) inclusion dependencies. 2. If foreign keys have been successfully extracted, do the following for each foreign key constraint: a. Identify the pair of participating relations, i.e., the relation to which the FK belongs and the relation to which it is referring. b. Eliminate the identified pair from set X. c. Add the inclusion dependency involving this FK to the set FINAL. 3. If equi-join queries have been extracted from the code, do the following for each equi-join query: a) Identify the pair of participating relations. b) Check the direction of the resulting inclusion dependency by querying the data. In order to check the direction of an inclusion dependency, we use a subset test described in Appendix B c) If the above test is conclusive, eliminate the identified pair from set X and add the inclusion dependency involving this FK to the set FINAL. d) If the test in step b) is inconclusive (i.e., the direction cannot be finalized) add both candidate inclusion dependencies to the set POSSIBLE. 4. For each pair p remaining in X, identify attributes or attribute combinations that have the same data type. Check whether the subset relationship exists by using the subset test described in Appendix B. If so, add the inclusion dependency to the set POSSIBLE. If, at the end of Step 4, no inclusion dependency is added to the possible set, delete p from X; otherwise, leave p in X for user verification.

PAGE 40

29 5. For each inclusion dependency in the set POSSIBLE, do the following: a) If the attribute names on both sides are equal, assign the rating High. b) If the attribute name on left side of the inclusion dependency is related (based on common substrings) to the table name on the right hand side, assign rating High. c) If both conditions are not satisfied, assign rating Low. 6. For each pair in X, present the inclusion dependencies and their ratings in the set POSSIBLE to the user for final determination. Based on the user input, append the valid inclusion dependencies to the set FINAL. The worst-case complexity of this exhaustive search, given N tables and M attributes per table (NM total attributes), is O(N 2 M 2 ). However, we have reduced the search space in those cases where we can identify equi-join queries in the application code. This allows us to limit our exhaustive searching to only those relations not mentioned in the extracted queries. As a result, the average case complexity of the inclusion dependency finder is much smaller. For example the detection of one foreign key constraint in the data dictionary or one equi-join query in the application code allows the algorithm to eliminate the corresponding relation(s) from the search space. Hence, if K foreign key constraints and L equi-join queries (involving pairs different from the pairs involved in foreign key constraints) are detected, the average complexity is O((N 2 K L)M 2 ). In the best-case scenario when K + L equals all possible pairs of relations, then the inclusion dependency detection can be performed in constant time O(1). Additionally, factors such as matching datatypes and matching maximum length of attributes (e.g., varchar(5)) are used to reduce the number of queries to be made to the database (Step 4) to check subset relationship between attributes. If the attributes in a pair of relations have T mutually different datatypes then the M 2 part reduces to M(M-T).

PAGE 41

30 Finally, it is important to note that the DRE activity is always considered as a build-time activity and hence performance complexity is not a crucial issue. Step 3: Classification of the Relations When reverse engineering a relational schema, it is important to understand that due to the limited ability of the relational model to express semantics, all real-world entities are represented as relations irrespective of their types and roles in the model. The goal of this step is to identify the different types of relations; some of these will correspond to actual real-world entities while others will represent relationships among the entities. Identifying different relations is done using the primary key information obtained in Step 1 and the inclusion dependencies obtained in Step 2. Specifically, if consistent naming is used, the primary key of each relation is compared with the primary keys of other relations to identify strong or weak entity-relations and specific or regular relationship-relations. Otherwise, we have to use inclusion dependencies to give vital clues. Intuitively, a strong entity-relation represents a real-world entity whose members can be identified exclusively through its own properties. A weak entity-relation, on the other hand, represents an entity that has no properties of its own that can be used to uniquely identify its members. In the relational model, the primary keys of weak entity-relations usually contain primary key attributes from other (strong) entity-relations. Intuitively, both regular and specific relations represent relationships between two entities in the real world rather than the entities themselves. However, there are instances when not all of the entities participating in an n-ary relationship are present in the

PAGE 42

31 database schema (e.g., one or more of the relations were deleted as part of the normal database schema evolution process). While reverse engineering the database, we identify such relationships as special relations. Specifically, the primary key of a specific relation is only partially formed by the primary keys of the participating (strong or weak) entity-relations, whereas the primary key of a regular relation is made up entirely of the primary keys of the participating entity-relations. More formally, Chiang et al. [10] define the four relation types as follows: 1. 2. A strong entity relation is a relation whose primary key (PK) does not properly contain a key attribute of any other relation. A weak entity relation is a relation which satisfies the following three conditions: 1. A proper subset of s PK contains key attributes of other strong or weak entity relations; 2. The remaining attributes of s PK do not contain key attributes of any other relation; and 3. has an identifying owner and properly contains the PK of its owner relation. User input is required to confirm these relationships. A regular relation has a PK that is formed by concatenating the PKs of other (strong or weak) entity relations. A specific relation is a relation which satisfies the following two conditions: A proper subset of s PK contains key attributes of other strong or weak entity relations; The remaining attributes of s PK do not contain key attributes of any other relation. Classification of relations proceeds as follows: Initially strong and weak entity-relations are classified. For weak entity-relations, the primary key must be composite and part of it must be a primary key of an already identified strong entity-relation. The remaining part of the key must not be a primary key of any other relation. Finally, regular

PAGE 43

32 and specific relations are discovered. This is done by checking the primary keys or the remaining un-classified relations for complete or partial presence of primary keys of already identified entity-relations. Step 4: Classification of the Attributes In this step, attributes of each relation are classified into one of four groups, depending on whether they can be used as keys for entities, weak entities, relationships etc. Attribute classification is based on the type of parent relation and the presence of inclusion dependencies which involve these attributes: 1. 2. Primary key attributes (PKA) are attributes that uniquely identify the tuples in a relation. Dangling key attributes (DKA) are attributes belonging to the primary key of a weak entity-relation or a specific relation that do not appear as the primary key of any other relations. Foreign key attributes (FKA) are attributes in R1 referencing R2 if these attributes of R1 have the same domains as the primary key attributes PK of R2 for each t1 in r(R1) and t2 in r(R2), either t1[FK] = t2[PK], or t1[FK] is null. Non-key attributes (NKA) are those attributes that cannot be classified as PKA, DKA, or FKA. Step 5: Identification of Entity Types The schema extraction algorithm begins to map relational concepts into corresponding E/R model concepts. Specifically, the strong and weak entity relations identified in Step 3 are classified as either strong or weak entities respectively. Furthermore, for each weak entity we associate with its owner entity. The association, which includes the identification of proper keys, is done as follows: Each weak entity relation is converted into a weak entity type. The dangling key attribute of the weak entity relation becomes the key attribute of the entity.

PAGE 44

33 Each strong entity relation is converted into a strong entity type. Step 6: Identification of Relationship Types The inclusion dependencies discovered in Step 2 form the basis for determining the relationship types among the entities identified above. This is a two-step process: 1. Identify relationships present as relations in the relational database. The relationship-relations (regular and specific) obtained from the classification of relations (Step 3) are converted into relationships. The participating entity types are derived from the inclusion dependencies. For completeness of the extracted schema, we can decide to create a new entity when conceptualizing a specific relation. The cardinality of this type of relationships is M:N or many-to-many. 2. Identify relationships among the entity types (strong and weak) that were not present as relations in the relational database, via the following classification. IS-A relationships can be identified using the PKAs of strong entity relations and the inclusion dependencies among PKAs. If there is an inclusion dependency in which the primary key of one strong entity-relation refers to the primary key of another strong entity-relation then an IS-A relationship between those two entities is identified. The cardinality of the IS-A relationship between the corresponding strong entities is 1:1. Dependent relationship: For each weak entity type, the owner is determined by examining the inclusion dependencies involving the corresponding weak entity-relation as follows: we look for an inclusion dependency whose left-hand side contains the part of the primary key of this weak entity-relation. When we find such an inclusion dependency, the owner entity can be easily identified by looking at the right-hand side of the inclusion dependency. As a result, a binary relationship between the owner (strong) entity type and weak entity is created. The cardinality of the dependent relationship between the owner and the weak entity is 1:N. Aggregate relationships: If a foreign key in any of the regular and specific relations refers to the PKA of one of the strong entity relations, an aggregate relationship is identified. An inclusion dependency must exist from this (regular or specific) relation on the left-hand side, which refers to some strong entity-relation on the right-hand side. The aggregate relationship is between the relationship (which must previously be conceptualized from a regular/specific relation) and the strong entity on right-hand side. The cardinality of the aggregate relationship between the strong entity and aggregate entity (an M:N relationship and its participating entities at the conceptual level) is as follows: If the foreign key contains unique values, then the cardinality is 1:1, otherwise the cardinality is 1:N.

PAGE 45

34 Other binary relationships: Other binary relationships are identified from the FKAs not used in identifying the above relationships. When an FKA of a relation refers to a primary key of another relation, then a binary relationship is identified. The cardinality of the binary relationship between the entities is as follows: If the foreign key contains unique values, then the cardinality is 1:1, otherwise the cardinality is 1:N. At the end of Step 6, the schema extraction algorithm will have extracted the following schema information from the legacy database: Names and classification of all entities. Names of all attributes. Primary and foreign keys. Data types. Simple constraints (e.g., Null, Unique) and explicit assertions. Relationships and their cardinalities. 3.3 Related Issue Semantic Analysis The design and implementation of semantic analysis and code slicing is the subject of a companion thesis and hence will not be elaborated in detail. Instead the main concepts will be briefly outlined. Generation of an Abstract Syntax Tree (AST) for the Application Code: Semantic Analysis begins with the generation of an abstract syntax tree (AST) for the legacy application code. The AST will be used by the semantic analyzer for code exploration during code slicing. The AST generator for C code consists of two major components, the lexical analyzer and the parser. The lexical analyzer for application code written in C reads the source code line-by-line and breaks it up into tokens. The C parser reads in these tokens and builds an AST for the source code in accordance with the language grammar. The

PAGE 46

35 above approach works well for procedural languages such as the C language; but when applied directly to object oriented languages, it greatly increases the computational complexity of the problem. In practice, most of the application code written for databases is written in Java making it necessary to develop an algorithm to infer semantic information from Java application code. Unfortunately, the grammar of an object-oriented language tends to be complex when compared with that of procedural languages such as C. Several tools like lex or yacc can be employed to implement the parser. Our objective in AST generation is to be able to associate meaning with program variables. For example, format strings in input/output statements contain semantic information that can be associated with the variables in the input/output statement. This program variable in turn may be associated with a column of a table in the underlying legacy database. These and the other functions of semantic analyzer are described in detail in Hammer et al. [23, 24]. Code Analysis: The objective of code analysis is threefold: (1) augment entities extracted in the schema extraction step with domain semantics, (2) extract queries that help validate the existence of relationships among entities, and (3) identify business rules and constraints not explicitly stored in the database, but may be important to the wrapper developer or application program accessing the legacy source L. Our approach to code analysis is based on code mining, which includes slicing [32] and pattern matching [50]. The mining of semantic information from source code assumes that in the application code there are output statements that support report generation or display of query results. From output message strings that usually describe a displayed variable v, semantic information about v can be obtained. This implies location (tracing) of the

PAGE 47

36 statement s that assigns a value to v. Since s can be associated with the result set of a query q, we can associate vs semantics with a particular column of the table being accessed in q. For each of the slicing variables identified by the pre-slicer, the code slicer and analyzer are applied to the AST. The code slicer traverses the AST in pre-order and retains only those nodes that contain the slicing variable in their sub-tree. The reduced AST constructed by the code slicer is then sent to the semantic analyzer, which extracts the data type, meaning, business rules, column name, and table name that can be associated with the slicing variable. The results of semantic analysis are appended to a result file and the slicing variable is stored in the metadata repository. Since the code analysis is part of a build-time activity, accuracy of the results rather than time is a more critical factor.

PAGE 48

37 Figure 3-5 The Code Analysis Process. User InterfaceSlicing Variables Code Slicer AnalyzerReduced AST More Semantic Analysis Needed ?Semantic Information and Business Rules Result GeneratorYN Result ReportSaveSlicingVariablesUser Decision Pre-Slicer ASTTo inclusion dependencymining, step 4 Meta Data Repository User Defined Slicing Variable User EntersSlicing VariableSlicing Variables Used in Previous Passes After the code slicer and the analyzer have been invoked on all slicing variables, the result generator examines the result file and replaces the variables in the extracted business rules with the semantics from their associated output statements, if possible. The results of code analysis up to this point are presented to the user. The user has a chance to view the results and make a decision whether further code analysis is needed or not. If further analysis is requested, then the user is presented with a list of variables that occur in the input, output, SQL statements and all the slicing variables from the previous passes. After the description of schema extraction and semantic analysis, it is important to focus on the interaction between these two processes. The next subsection will provide insights on this integration process and the chapter concludes with the integrated system

PAGE 49

38 design diagram and a description of its support components. For more detailed information about code analysis, the reader is referred to Hammer et al. [23, 24]. 3.4 Interaction There are five places in the execution of the integrated DRE algorithm where the schema extraction process (SE) and semantic analyzer (SA) need to interact and they are as follows: 1. 2. Initially the SA generates the AST of the application code C L After successful generation of an AST, the execution control is transferred to the dictionary extractor module of SE. If complete information about primary keys is not found in the database dictionary, then the dictionary extractor requests the semantic analyzer to provide some clues. The algorithm passes the set of candidate keys along with predefined rule-out patterns to the code analyzer. The code analyzer searches for these patterns in the application code (i.e., in the AST) and eliminates attributes from the candidate set that occur in the rule-out pattern. The rule-out patterns, which are expressed as SQL queries, occur in the application code whenever the programmer expects to select a SET of tuples. By the definition of a primary key, this rules out the possibility that the attributes a 1 a n form a primary key. Three sample rule out patterns are: a) SELECT DISTINCT FROM T WHERE a 1 = AND a 2 =< scalar_expression 2 > AND AND a n = b) SELECT FROM T WHERE a 1 = AND a 2 =< scalar_expression 2 >AND AND a n = GROUP BY c) SELECT FROM T

PAGE 50

39 WHERE a 1 = AND a 2 =< scalar_expression 2 >AND AND a n = ORDER BY 3. 4. 5. After the dictionary extraction, the execution control is transferred to the semantic analyzer to carry out code slicing on all the possible SQL variables and other input-output variables. Relation names and attribute names generated in the schema extraction process can guide this step (e.g., the code slicer can concentrate on SQL variables whose names in the database are already known). Once the code slicing is completed within a pre-specified level of confidence, control returns back to schema extraction where inclusion dependency detection is invoked. The inclusion dependency detector requests equi-join queries from the semantic analyzer, which searches the AST for typical SELECT-FROM-WHERE clauses that include one or multiple equality conditions on the attributes of two relations. After finding all the possible pairs of relations, the semantic analyzer returns the pair and the corresponding attribute to the inclusion dependency finder which uses that as one source for detection of the inclusion dependencies. After the execution of the integrated algorithm, the information extracted will contain business rules and the semantic meaning of some of the attributes in addition to SE output.

PAGE 51

40 Figure 3-6 DRE Integrated Architecture. Figure 3-6 presents a schematic diagram of the integrated DRE architecture. The legacy source DB L consists of legacy data D L and legacy code C L. The DRE process begins off by generating the AST from CL. The dictionary extractor then accesses DL via the Database Interface Module and extracts preliminary information about the underlying relational schema. The configurable Database Interface Module (upper-left hand corner) is the only source-specific component in the architecture. In order to perform knowledge extraction from different sources, only the interface module needs to be modified. The code analysis module then performs slicing on the generated AST and stores information about the variables in the result file. The control is then transferred to the SE again to execute the remaining steps. Finally the Knowledge Encoder (lower right-hand corner) represents the extracted knowledge in the

PAGE 52

41 form of an XML document so that it can be shared with other components in the SEEK architecture (e.g., the semantic matcher). Additionally, the Metadata Repository is internal to DRE and is used to store intermediate run-time information needed by the algorithms including user input parameters, the abstract syntax tree for the code (e.g., from a previous invocation), etc. 3.5 Knowledge Representation The schema extraction and semantic analysis collectively generate information about the underlying legacy source DB L After each step of the DRE algorithm, some knowledge is extracted from the source. At the end of the DRE process, the extracted knowledge can be classified into three types. First, the detailed information about the underlying relational schema is present. The information about the relation names, attribute names, data types, simple constraints etc. is useful for query transformation at the wrapper f W Q (Equation 1 in the section 3.1). The information about the high-level conceptual schema inferred from the relational schema is also available. This includes the entities, their identifiers, the relationships among the entities, their cardinalities etc. Finally, some business rules and a high-level meaning of some attributes that are extracted by the SA are also available. This knowledge must be represented in a format that is not only computationally tractable and easy to manipulate, but which also supports intuitive human understanding. The representation of knowledge in general and semantics in particular has been an active research area for the past five years. With the advent of XML 1.0 as the universal format for structured documents and data in 1998 [60], various technologies such as XML schema, RDF [61], Semantic Web [62], MathML [63], BRML [19] followed. Every technology is developed and preferred for some specific applications.

PAGE 53

42 For example RDF provides a lightweight ontology system to support the exchange of knowledge on the Web. MathML is low-level specification for describing mathematics as a basis for machine-to-machine communication. Our preliminary survey concludes that considering the variety of knowledge that is being (and will be) extracted by DRE, any one of these will not be sufficient for representing the entire range. The choice is to combine two or more standards or to come up with our own format. The advantages of the former are the availability of proven technology and tools and compatibility with other SEEK-like systems while the advantages of own format will be the efficiency and ease of encoding. We do not rule out different format in the near future but the best choice in the current scenario is XML, since it is a simple yet robust language for representing and manipulating data. Many of the technologies mentioned above use XML syntax. The knowledge encoder takes an XML DTD as input and encodes the extracted information to produce an XML document. The entire XML DTD along with the resulting XML document is shown in Appendix A. The DTD has very intuitive tree-like structure. It consists of three parts relational schema, conceptual schema and business rules. The first part provides detailed information about each relation and its attributes. The second part provides information about entities and relationships. The business rules are presented in the third part. Instead of encoding the extracted information after every step (which can result in inconsistencies, since the DRE algorithm refines some of its intermediate outcomes in the process), the encoding is done at the terminal step to implement consistency checking.

PAGE 54

43 In this chapter, we have presented a detailed description of the schema extraction algorithm with all the support processes and components. The next chapter describes the implementation of a working prototype of the DRE algorithm.

PAGE 55

CHAPTER 4 IMPLEMENTATION The entire Schema Extraction process and the overall DRE Algorithm were delineated and discussed in detail in the previous chapter. We now describe how the SE prototype actually implements the algorithm and encodes the extracted information into an XML document focusing on the data structures and execution flow. We shall also present an example with sample screen snapshots. The SE prototype is implemented using the Java SDK 1.3 from Sun Microsystems. Other major software tool used in our implementation is the Oracle XML Parser. Also, for testing and experimental evaluation, two different database management systems Microsoft-Access and Oracle have been used. 4.1 Implementation Details The SE working prototype takes a relational data source as an input. The input requirements can be further elaborated, as follows: 1. The source is a relational data source and its schema is available 2. JDBC connection to the data source is possible. (This is not a strict requirement since Suns JDBC driver download page gives the latest drivers to almost all database systems such as Oracle, Sybase, IBM DB2, Informix, Microsoft Access, Microsoft SQL Server, etc. [57]) 3. The database can be queried using SQL. In summary, the SE prototype is general enough to work with different relational databases with only minimal changes to the parameter configuration in the Adapter module shown in the next figure. 44

PAGE 56

45 Figure 4-1 Schema Extraction Code Block Diagram. Figure 4-1 shows the code block diagram of the SE prototype. The Adapter module connects to the database and is the only module that contains actual queries to the database. This is the only module that has to be changed in order to connect the SE prototype to a different database system. Details about these changes are discussed in the configuration section later. The Extractor module executes Step 1 of SE algorithm. At the end of that step, all the necessary information is extracted from the database. The Analysis module works on this information to process Steps 2, 3 and 4 of the SE algorithm. The Analysis module also interacts with the Semantic Analyzer module to obtain the equi-join queries. The Inference module identifies the entities and relationships (Steps 5 and 6 of SE). All these modules store the resulting knowledge in a common data structure. This data structure is a collection of the object instances of predetermined classes. These classes not only store information about the underlying relational database

PAGE 57

46 but also monitor of newly inferred conceptual information. We now highlight the implementation of the SE algorithm. SE-1 Dictionary Extractor: This step accesses the data dictionary and tries to extract as much information as possible. The information in the database schema is queried using the JDBC API to get all the relation names, attribute names, data types, simple constraints and key information. Every query in the extractor module is a method invocation which ultimately executes primitive SQL queries in the Adaptor module. Thus, a general API is created for the extractor module. This information can be stored in an internal object. For every relation, we create an object whose structure is consistent with the final XML representation. The representation is such that it will be easy to identify whether the attribute is a primary key, what its data type is and what are corresponding relation names; e.g., Figure 4-2 shows the class structure of a relation. A new instance of the class is created when the extractor extracts a new relation name. The extracted information is filled into these object instances according to the attributes of the class. Each instance contains information about the name and type (filled after Step 3), of the relation, its primary key, its foreign key, number of attributes etc. Note that every relation object contains an array of attribute objects. The array-size is equal to the number of attributes in the relation. The attribute class is defined in Step 4.

PAGE 58

47 Figure 4-2 The class structure 1 for a relation. name : string primary_key : string foreign key : string attributes : array att type : string pkcount : int fkcount : int attcount : int Relation After this step, we have an array of relation objects in the common data structure. This way not only can we identify all the relation names and their primary keys, but can also examine each attribute of each relation by looking at its characteristics. SE 2: Discover Inclusion Dependencies: The inputs for this step include the array of relation objects generated in the previous step and the data instances in the database. The actual data in the database is used for detecting of the direction in the inclusion dependency. During this step, the algorithm needs to make several important decisions, which affect the outcome of the SE process. Due to the importance of this step, the choice of the data structure becomes crucial. 1 A class represents an aggregation of attributes and operations. In our implementation, we have defined a set of classes whose instances hold the results of DRE process. These classes define set of attributes with their corresponding datatypes. We follow UML notation throughout this chapter to represent classes.

PAGE 59

48 As described in our algorithm in Chapter 3, we need 2 sets of inclusion dependencies at any time during this step. One is the set of possible inclusion dependencies and other is set of final inclusion dependencies. These inclusion dependencies may be represented in the relation objects so that its easy to associate them with relations and attributes. But we decided to create a separate data structure as adding this information into the relation object seems to be a conceptual violation, as inclusion dependencies occur between relations. The class structure for inclusion dependency is illustrated schematically in Figure 4-3. Figure 4-3 The class structure for the inclusion dependencies. lhsentity : string rhsentity : string lhsattset : string rhsattset : string lhsentitytype : string noofatt : int rating : string InclusionDe p endenc y The attribute lhsentitytype describes the type of the entity at the left hand side of the inclusion dependency. This helps in identifying the relationships in Step 6. For example if the type is strong entity then the inclusion dependency can suggest the binary relationship or IS-A relationship. For more details, the reader is referred to Step 6. Another attribute noofatt gives the number of attributes involved in the inclusion dependency. This helps in finalizing the foreign key attributes. Other attributes of the class are self-explanatory.

PAGE 60

49 We keep two arrays of such objects; one for the FINAL set and the other for the POSSIBLE set. If the foreign keys can be extracted from the data dictionary or equi-join queries are extracted from the application code, then we can create new instance in the FINAL set. Every NON-FINAL or Hypothesized inclusion dependency is stored by creating new instance in the POSSIBLE set. After the exhaustive search for a set of inclusion dependencies, we remove some of the unwanted inclusion dependencies (e.g., transitive dependencies) in the cleaning process. Finally, if the POSSIBLE set is non-empty, all the instances are presented to the user. The inclusion dependencies rejected by the user are removed from the POSSIBLE set and the inclusion dependencies accepted by the user are copied to the FINAL set. After this step, only the FINAL set is used for future computations. SE 3: Classify Relations: This step takes the array of relation objects and the array of inclusion dependency objects as input and classifies each relation into a strong-entity, weak-entity, regular-relationship or specific-relationship relation. First the classification is performed assuming consistent naming of key attributes. That means all the relation names and the corresponding primary keys from common data structures are accessed and analyzed. The primary key of every relation is compared with the primary keys of all other relations. According to that analysis the attribute Type will be added in the object. This classification is revised based on the existence of inclusion dependencies. So if consistent naming is not employed, the SE can still classify the relations successfully. Also, this type information is added in the inclusion dependency objects so that we can distinguish between entities and relationships.

PAGE 61

50 The output of this step is the array of modified relation objects and the array of modified inclusion dependency objects (with the type information of participating relations). This is passed as an input to the subsequent modules. SE 4: Classify Attributes: Each instance contains information about the name and type (filled after Step 3), of the relation, its primary key, its foreign key, number of attributes in it etc. Note that every relation object contains an array of attribute objects. The array-size is equal to the number of attributes in the relation. The attribute class is defined in Step 4. Figure 4-4 The class structure for an attribute. name : string meaning : string tablename : string datatype : string isnull : string isunique : int type : string length : string Attribute This step can be easily executed as all the required information is available in the common data structures. Though this step is conceptually a separate step in the SE algorithm, its implementation is done in conjunction with the above three steps e.g., whether the attribute is a primary key or not can be decided in Step 1. SE 5: Classify Entities: Every relation from the array of relation objects is accessed and by checking its type, new entity objects can be created. If the type of the

PAGE 62

51 relation is strong then a strong entity is created and if the type of the relation is weak then a weak entity is created. Every entity object contains information about its name, its identifier and its type. SE 6: Classify Relationships: The inputs to the last step of the SE algorithm include the array of relation objects and the array of inclusion dependency objects. This step analyzes each inclusion dependency and creates the appropriate relationship types. The successful identification of a new relationship results in the creation of new instance of the class described in Figure 4-5. The class structure mainly includes the name and type of the relationship, participating entities and their corresponding cardinalities. The arrays of the strings are kept to accommodate variable number of entities participating in the relationship. The participating entities are filled from the entity-relations in the inclusion dependency; while the cardinality is discovered by actually querying the database. The other information is filled in according to Figure 4-6. Figure 4-5 The class structure for a relationship. name : string type : string partentity : array of strings cardinality: array of strings partentcount : int Relationships The flow of execution is described as follows: For every inclusion dependency whose left-hand side relation is an entity-relation, the SE does the following:

PAGE 63

52 1. If it is a strong entity with the primary key in the inclusion dependency, then an is-a relationship between two strong entities is identified. 2. If it is a strong entity with a non-primary key in the inclusion dependency, then regular binary relationship between two entities is identified. 3. If it is a weak entity with the primary key in an inclusion dependency, then a dependent or has relationship between two strong entities is identified. 4. If it is a weak entity with a non-primary key attribute in the inclusion dependency, then a regular binary relationship between two entities is identified. For every inclusion dependency whose left hand side relation is a relationship-relation, the SE does the following: 1. We know what relations are identified as regular and specific. We only have to identify the inclusion dependencies involving those primary keys (or subset of the primary keys) of these relations on the left-hand sides to find out the participating entities. The n-ary relationships, where n >2, are also identified similarly. 2. If we have a regular and specific relation with non-primary keys on the left-hand side, an aggregate relationship is identified. Thus all the new relationships are created by analyzing the array of inclusion dependencies. As a result, at the end of the schema extraction process, the output consists of the array of relation objects, the array of entity objects and the array of relationship objects.

PAGE 64

53 Figure 4-6 The information in different types of relationships instances. Consider an inclusion dependency: LHS-entity.LHS-attset << RHS-entity.RHS-attset 1. IS-A relationship: Name: RHS-entity_is-a_LHS-entity Type: IS-A Cardinality: 1:1 2. Regular Binary relationship: Name: RHS-entity relates_to LHS-entity Type: Regular Binary Cardinality: 1:1/1:N (can be easily finalized by checking duplication) 3. Dependent or has Relationship Name: RHS-entity has LHS-entity Type: Dependent Cardinality: 1:N (with N at the weak entity side) 4. M:N Relationships Name: name of the corresponding relation Type: M:N Cardinality: M:N 5. Aggregate Relationship: Name: Aggregated-LHS-entity relates_to RHS-entity Type: Aggregate Cardinality: 1:1/1:N (can be easily finalized by checking duplication) Knowledge Encoder: When the schema extraction process is completed, the encoder module is automatically called. This module has access to the common data structure. Using the extracted information, the encoder module creates an XML file (results.xml) with proper formatting and conforms it to the predefined DTD. For more information about this DTD, please refer appendix A.

PAGE 65

54 4.2 Example Walkthrough of Prototype Functionality In this section, we present an exhaustive example of the Schema Extraction process. We will also provide some screenshots of our working prototype. Project management is a key application in the construction industry; hence the legacy source for our example is based on a Microsoft Project application from our construction supply chain testbed. For simplicity, we assume without loss of generality or specificity that only the following relations exist in the MS-Project application, which will be discovered using SE (for a description of the entire schema refer to Microsoft Project Website [42]): MSP-Project, MSP-Availability, MSP-Resources, MSP-Tasks and MSP-Assignment. Additionally, we also assume that the dictionary makes the following information available to the SE algorithm namely relation and attribute names, all explicitly defined constraints, primary keys (PKs), all unique constraints, and data types. SE Step 1: Extracting Schema Information: This step extracts the relation and the attribute names from the legacy source. A decision step directs control to the semantic analyzer if the information about the primary keys cannot be obtained from the dictionary. Result: In the example DRE application, the following relations were obtained from the MS-Project schema. Also all the attribute names, their data types, null constraints and unique constraints were also extracted but are not shown to maintain clarity. MSP-Project [ PROJ_ID ...] MSP-Availability[ PROJ_ID, AVAIL_UID ...] MSP-Resources [ PROJ_ID, RES_UID ...] MSP-Tasks [ PROJ_ID, TASK_UID ...]

PAGE 66

55 MSP-Assignment [ PROJ_ID, ASSN_UID ...] Figure 4-7 The screen snapshot describing the information about the relational schema. Figure 4-7 presents a screen snapshot describing the information about the relational schema including the relation names, attribute names, primary keys, simple constraints etc. The reader can see a hierarchical structure describing a relational schema. A subtree of the corresponding attributes is created for every relation in the relational schema and all information is displayed when user clicks on the particular attribute. For example, the reader can get information about the attribute TASK_FINISH_DATE (top of the screen) including its datatype (datetime), meaning extracted from the code (Task termination date), key information and simple constraints. The hierarchical structure is chosen since it provides legibility, user-directed display (as user can click on a relation name or attribute name for detailed information), and ease of use. This structure is followed at all places in the user interface.

PAGE 67

56 SE Step 2: Discovering Inclusion Dependencies. This part of the DRE algorithm has two decision steps. First, all of the possible inclusion dependencies are determined using SE. Then control is transferred to SA to determine if there are equi-join queries embedded in the application (using pattern matching against FROM and WHERE clauses). If so, the queries are extracted and returned to SE where they are used to rule out erroneous dependencies. The second decision determines if the set of inclusion dependencies is minimal. If so, control is transferred to SE Step 3. Otherwise, the transitive dependencies are removed and the minimal set of inclusion dependencies is finalized with the help of the user. Result: Inclusion dependencies are listed as follows: MSP_Assignment[Task_uid,Proj_ID] << MSP_Tasks [Task_uid,Proj_ID] MSP_Assignment [Res_uid,Proj_ID] << MSP_Resources [Res_uid,Proj_ID] MSP_Availability [Res_uid,Proj_ID] << MSP_Resources [Res_uid,Proj_ID] MSP_Resources [Proj_ID] << MSP_Project [Proj_ID] MSP_Tasks [Proj_ID] << MSP_Project [Proj_ID] MSP_Assignment [Proj_ID] << MSP_Project [Proj_ID] MSP_Availability [Proj_ID] << MSP_Project [Proj_ID] The last two inclusion dependencies are removed on the basis of transitivity. SE Step 3: Classification of the Relations. Relations are classified by analyzing the primary keys obtained in Step 1 into one of the four types: strong, weak, specific, or regular. If any unclassified relations remain, then user input is requested for clarification. If we need to make distinctions between

PAGE 68

57 weak and regular relations, then user input is requested, otherwise the control is transferred to the next step. Result: Strong Entities: MSP_Project, MSP_Availability Weak Entities: MSP_Resources, MSP_Tasks Regular Relationship: MSP-Assignment SE Step 4: Classification of the Attributes. We classify attributes as (a) PK or FK, (b) Dangling or General, or (c) Non-Key (if none of the above). Control is transferred to SA if FKs need to be validated. Otherwise, control is transferred to SE Step 5. Result: Table 4-1 illustrates attributes obtained from the example legacy source. Table 4-1 Example of the attribute classification from the MS-Project legacy source. PKA DKA FKA NKA MS-Project Proj_ID MS-Resources Proj_ID + Res_UID Res_UID MS-Tasks Proj_ID +Task_UID Task_UID MSAvailability Proj_ID +Avail_UID Avail_UID Res_UID+Proj_ID MS-Assignment Proj_ID +Assn_UID Assn_UID Res_UID+ Proj_ID, Task_UID+ Proj_ID All Remaining Attributes SE Step 5: Identify Entity Types. In SE-5, strong (weak) entity relations obtained from SE-3 are directly converted into strong (respective weak) entities. Result: The following entities were classified:

PAGE 69

58 Strong entities: MSP_Project with Proj_ID as its key. MSP_Availability with Avail_uid as its key. Weak entities: MSP_Tasks with Task_uid as key and MSP_Project as its owner. MSP_Resources with Res_uid as key and MSP_Project as owner. Figure 4-8 The screen snapshot describing the information about the entities. Figure 4-8 presents the screen snapshots describing the identified entities. The description includes the name and the type of the entity and also the corresponding relation in the relational schema. For example the reader can see entity MSP_AVAILABILITY (top of the screen), its identifier (AVAIL_UID) and its type (strong entity). The corresponding relation MSP_AVAILABILITY and its attributes in the relational schema can also be seen in the interface. SE Step 6: Identify Relationship Types. Result: We discovered 1:N binary relationships between the following entity types:

PAGE 70

59 Between MSP_Project and MSP_Tasks Between MSP_Project and MSP_Resources Between MSP_Resources and MSP_Availabilty Since two inclusion dependencies involving MSP_Assignment exist (i.e., between Task and Assignment and between Resource and Assignment), there is no need to define a new entity. Thus, MSP_Assignment becomes an M:N relationship between MSP_Tasks and MSP_Resources. Figure 4-9 The screen snapshot describing the information about the relationships. Figure 4-9 shows the user interface in which the user can view the information about the identified relationships. After clicking on the name of the relationship, the user can view the information such as its type, the participating entities and respective cardinalities. The type of the relationship can be one of the types discussed in Step 6 of DRE algorithm given in previous chapter. If the relationship is of M:N type, the corresponding relation in the relational schema is also shown. For example, the reader can see the information about the relationship MSP_ASSIGNMENTS in Figure 4-9. The

PAGE 71

60 reader can see the type (M:N regular), participating entities (MSP_RESOURCES and MSP_TASKS) and their corresponding cardinalities(M and N). The reader can also see the corresponding relation MSP_ASSIGNMENTS. Figure 4-10 E/R diagram representing the extracted schema. The E/R diagram based on the extracted information shows four entities, their attributes and relationships between the entities. Not all the attributes are shown for the sake of legibility. MSP_Projects is a strong entity with Proj_ID as its identifier. The entities MSP_Tasks and MSP_Resources are weak entities and depend on MSP_Projects. Both weak entities participate in an M:N binary relationship MSP_Assignments. MSP_Availability is also a strong entity participating in a regular binary relationship with MSP_Resources. For the XML view of this information, the reader is referred to APPENDIX B.

PAGE 72

61 4.3 Configuration and User Intervention As discussed earlier, the Adapter module needs to be modified for different databases. These changes mainly include the name of the driver to connect to the database, the procedure for making the connection, the method of providing the username and password etc. Also, certain methods in the JDBC API might not work for all relational databases because such compatibility is generally vendor dependent. Instead of making all these changes every time or keeping different files for different databases, we have provided command line input for specifying the database. Once the program gets the name of the database (e.g., Oracle), it automatically configures the Adapter module and continues execution automatically. The next point of interaction with the user is before finalizing the inclusion dependencies. If a set of final inclusion dependencies cannot be finalized without any doubt, then that set is presented to the user with the corresponding rating as discussed in the previous chapter. The user then selects the valid inclusion dependencies and rejects the others. Though the user is guided by the rating system, he is not bound to follow that system and may select any inclusion dependency if he is assured of its correctness. But the result of this irregular manual selection cannot be predicted beforehand. After the entire process is completed, the user can view the results in two forms. The graphical user interface, automatically executed at the end of SE, shows complete information in an easy and intuitive manner. Java Swing has been extensively used to develop this interface. The other way to view the results is in the form of an XML document generated by the knowledge encoder. The sample XML representation of the extracted knowledge and its corresponding DTD can be found in Appendix B and Appendix A respectively.

PAGE 73

62 4.4 Integration As discussed in the previous chapter, the integration of the Schema Extraction algorithm with the Semantic Analysis results in the overall DRE algorithm. The design and implementation of semantic analysis and code slicing is being done by another member of our research group and hence has not been discussed in detail. Though the main focus of this chapter is to present the implementation details of the schema extraction process, we now provide brief insights on how the implementation takes care of the interaction between SE and SA. We have already listed the points where SE and SA interact. Initially, the integrated prototype begins with the AST generation step of SA. This step then calls the dictionary extractor. The notable change here is that the dictionary extractor no longer contains the main method. Instead SA calls it as a normal method invocation. Similarly SA calls the Analysis module (shown in Figure 4-1) after its code analysis phase. One important point of interaction is in the dictionary extraction step. If enough information about the primary keys is not found in the dictionary, the SE will pass a set of candidate keys to SA in the form of an array. SA already has access to the query templates that represent the predetermined patterns. SA operates on the AST to match these patterns and takes decisions if the match is successful. The reduced set is then sent back to SE as an array. We assume that we get strong clues about primary keys from the dictionary and hence this interaction will rarely take place. Another vital point of communication is the equi-join query finder. While developing an integrated prototype, we have assumed that SE will simply invoke the module and SA has the responsibility to find, shortlist and format these queries. SA will

PAGE 74

63 send back inclusion dependencies in the inc_dependency object format discussed earlier. Then SE takes over and completes the inclusion dependency detection process. We have discussed the implementation issues in detail in this section and in the previous section. The next section will conclude the chapter by providing the implementation summary. 4.5 Implementation Summary 4.5.1 Features Almost all the features of the SE algorithm discussed in Chapter 3 have been implemented. The prototype can: 1. connect to the relational database via JDBC; 2. extract information about the underlying relational model with the help of our own small API built on the powerful JDBC API. This information includes relation names, column names, simple constraints, data types etc; 3. store that information in a common database like data structure; 4. infer possible inclusion dependencies and finalize the set of inclusion dependencies with the help of an expert user; 5. identify entities and the relationships among these entities; 6. present the extracted information to the user with an intuitive interface; as well as 7. encode and store the information in an XML document; The prototype is built using the Java programming language. The JDBC API is used to communicate with the database. The Java Swing API is used to generate all the user interfaces. The choice of Java was motivated by its portability and robustness. 4.5.2 Advantages The working prototype of the Schema Extraction also presents the following notable advantages:

PAGE 75

64 1. It minimizes user intervention and requests user assistance only at the most important step (if required). 2. The prototype is easily configurable to work with different relational databases. 3. The user interface is easy to use and intuitive. 4. The final XML file can be kept as a simple yet powerful form of documentation of the DRE process and can be used to guide the wrapper generation or any decision making process. Though these are significant advantages, the main concern about prototype implementation of the SE algorithm is its correctness. If the prototype does not give accurate results, then having an elaborate interface or an easy configuration is of no importance. The next chapter is dedicated to the experimental evaluation where we present the results of testing this SE prototype. We can conclude whether or not these advantages reflect positively in practical scenarios or actual testing, only after the experimental evaluation of the prototype. Hence the next chapter outlines the tests performed, discusses the experimental results, and provides more insights about our working prototype.

PAGE 76

CHAPTER 5 EXPERIMENTAL EVALUATION Several parameters can be used to evaluate our prototype. The main criteria include correctness or accuracy, performance, and ease of use. The schema extraction is primarily a build-time process and hence is not time critical. Thus, performance analysis based on the execution time is not an immediate issue for the SE prototype. The main parameter in the experimental evaluation of our prototype is the correctness of the information extracted by the prototype. If the extracted information is highly accurate in diverse input conditions (e.g., less than 10% error), then the SE algorithm can be considered useful. As SEEK attempts to be a highly automated tool for rapid wrapper re-configuration, another important parameter is the amount of user intervention that is needed to complete the DRE process. We have implemented a fully functional SE prototype system, which is currently installed and running in the Database Center at the University of Florida. We have run several experiments in an effort to test our algorithm. We shall first give the setup on which these experiments were conducted. In the next section, we shall explain the test cases and the results obtained. Finally, we shall provide conclusive reasoning of the results and summarize the experiments. 5.1 Experimental Setup The DRE testbed resides on a Intel Pentium-IV PC with a 1.4 GHz processor, 256 MB of main memory, 512KB cache memory, running Windows NT. As discussed earlier, all components of this prototype were implemented using Java (SDK 1.3) from Sun 65

PAGE 77

66 Microsystems. Other tools used are the XML Parser from Oracle version 2.0.2.9, Project scheduling software from Microsoft to prepare test-data, Oracle 8i RDBMS for storing test-data and JDBC drivers from Sun and Oracle. The SE prototype has been tested for two types of database systems; A MS-Access/MS-Project database on the PC and an Oracle 8i database that is running on a departments Sun Enterprise 450 Model 2250 machine with 2 processors of 240MHz each. The server has 640MB of main memory, 1.4GB of virtual memory and 2MB of cache memory. This testbed reflects the distributed nature of business networks, in which the legacy source and SEEK adapter will most likely execute on different hardware. The DRE prototype connects to the database, either locally (e.g., MS-ACCESS) or remotely (e.g., Oracle), using JDBC and extracts the required information and infers the conceptual associations between the entities. 5.2 Experiments In an effort to test our schema extraction algorithm, we selected nine different test databases from different domains. The databases were created by graduate students as part of a database course at the University of Florida. Each of these test cases contains relational schema and actual data. The average number of tables per schema is approximately 20 and the number of tuples ranges from 5 to over 50,000 per table. These applications were developed for varied domains such as online stock exchange and library management systems. 5.2.1 Evaluation of the Schema Extraction Algorithm Each test database was used as an input to the schema extraction prototype. The results were compared against the original design document to validate our approach. The results are captured in Table 5-1.

PAGE 78

67 Table 5-1 Experimental results of schema extraction on 9 sample databases. Project Name Domain Phantom INDs Missing INDs Phantoms E/R Components Missing E/R Components Complexity Level P1 Publisher 0 0 0+0 0+0 Low P9 Library 0 1 0+0 0+1 Low P5 Online Company 0 0 0+0 0+0 Low P3 Catering 0 0 0+0 0+0 Medium P6 Sports 0 0 0+0 0+0 Medium P4 Movie Set 1 0 0+1 0+0 Medium P8 Bank 1 0 0+1 0+0 Medium P7 Food 3 1 0+3 0+1 High P2 Financial Transaction 5 0 0+5 0+0 High The final column of Table 5-1 specifies the level of complexity for each schema. At this point, all projects are classified into three levels based on our relational database design experience. Projects P1, P9, P5 are given low complexity as the schema exhibited meaningful and consistent naming systems; rich datatypes (total 11 different datatypes in case of P9 and 17 in case of P1) relatively few relations (ranging from 10-15) with only few tuples per relation (average 50-70 tuples per relation). A database having these characteristics is considered a good database design and is richer in terms of semantic information content. Hence for knowledge extraction process, these databases are more tractable and should be rated with a low complexity level. Projects P3, P6, P8 and P2 are less tractable than P1, P9 and P5 due to a limited number of datatypes (only 7 in case of project P3), more relations (average 20 relations per schema) and more tuples per relations than P1, P9 and P5. Project P7 and P2 have been rated most complex due to their naming system and limited number of datatypes. For example in project P2, primary key attribute of almost all tables are named as only ID. The importance of various factors

PAGE 79

68 in complexity levels is better understood when the behavior and results of each schema are studied. The details are given in Section 5.2.2. Since the main parameter for evaluating our prototype is the correctness of the extracted information, table 5-1 shows the errors detected in the experiments. These errors essentially can be of two types, a missing concept (i.e., the concept is clearly present but our SE algorithm did not extract it) or a phantom concept (i.e., the concept is extracted by the SE algorithm but is absent in the data source). As the first step of the SE algorithm merely extracts what is present in the dictionary, the errors only start accumulating from the second step. The core part of the algorithm is inclusion dependency detection. Steps 3, 4, 5, 6 use the final set of inclusion dependencies either to classify the relations or to infer high-level relationships. Hence, an error in this step almost always reflects as an error in the final result. As previously discussed, when the decision about certain inclusion dependencies cannot be finalized, the possible set is presented to the user with the appropriate rating (low or high). In the experimental evaluation, we always assume that the user blindly decides to keep the inclusion dependency with a high rating and rejects those with a low rating. This assumption will help us reflect the exact behavior of SE algorithm, though in real scenarios the accuracy can be increased by intelligent user decision. More details about this are given in Section 5.3.2. Thus omissions and phantoms have a slightly different meaning with respect to our algorithm. Missing inclusion dependencies are in the set POSSIBLE that are ranked low by our algorithm but do exist in reality; hence they are considered omissions. Whereas phantom inclusion dependencies are in the set POSSIBLE that are ranked high

PAGE 80

69 by our algorithm but are actually invalid; hence the term phantom since they do not exist in reality. 5.2.2 Measuring the Complexity of a Database Schema In order to evaluate the outcome of our schema extraction algorithm, we first describe our methodology for ranking the test databases based on their perceived complexity. Although our complexity measure is subjective, we used it to develop a formula, which rates each test case on a complexity scale between 0 (low) and 1 (high). The tractability/complexity of a schema was based on the following factors: Number of useless PK to identical attribute name matches: One of the most important factors that were taken into account was the total number of instances where each primary key name was identical to the other attribute names that were not in any way relationally connected to the primary key. We define this type of matches as useless matches. Data types of all attributes, Primary key data types and maximum number of datatypes: Each data type was distinguished by the data type name and also the length. For instance, char(20) was considered to be different than char(50). The higher the total number of data types in a schema, the less complex (more tractable) it is because attributes are considered to be more distinguishable. Number of tables in the Schema and maximum number of tables in the testcase: The more relations a schema has, the more complex it will be. Since there is no common factor with which to normalize the schema, a common denominator was determined by taking the maximum number of relations in the testcase in order to produce with a normalized scale to rank all the schemas. Relationships in the E/R model and maximum number of relationships in the testcase: The more relationships a schema contains, the more complex it will be. Similar to the preceding concept, the common denominator in this case was the maximum number of relationships. The following mathematical equation determines the tractability of a schema: ERModelERModelschemaPKPKschemaschemaRRWTTWDDWDDWUUWTracmax5max4max3max2max1111

PAGE 81

70 Where Urepresents the useless name matches in a schema, Uthe maximum number of useless name matches in all testcases, the number of attribute data types in a schema, the maximum number of attribute data types in all testcases, the number of primary key data types in a schema, the maximum number of primary key data types in all testcases, T the number of tables in a schema, the maximum number of tables in all testcases, the number of relationships in the E/R model of a schema, and the maximum number of relationships in the E/R models of all testcases. In addition, each factor was weighted to indicate the importance of the factor with respect to the complexity of the schema. schema maxPK schemaDERModelR maxD PKDmaxT Dmax schemaERModelmax R 5.3 Conclusive Reasoning The complexity measurement formula described in the previous section was used to order the projects. Based on this ordering of the project, we plotted the results of Table 5-1 as two graphs to better illustrate the outcome of the experiments. The graphs are shown in Figure 5-1. The graphs depict the errors encountered during inclusion dependency detection and at the end of overall process.

PAGE 82

71 Inclusion Dependency Results012345678910p1p9p5p3p6p4p8p7p2Schemas (ascending complexity)No. of phantom + No of missing Inds No. of missing Inds No of phantom Inds E-R Results012345678910p1p9p5p3p6p4p8p7p2Schemas (ascending complexity)No of phantom + No of missing ER components No. of missing E-Rcomponents No. of phantom E-Rcomponents Figure 5-1 Results of experimental evaluation of the schema extraction algorithm: errors in detected inclusion dependencies (top), number of errors in extracted schema (bottom). 5.3.1 Analysis of the Results Both graphs in Figure 5-1 appear similar, since a phantom in an inclusion dependency definitely results in a phantom relationship and a missing inclusion dependency almost always results in a missing relationship. This is due to the fact that every relationship is identified only from an inclusion dependency. After the set of

PAGE 83

72 inclusion dependencies is finalized, every inclusion dependency in that set is used to identify associations between participating entities. So, the presence of a phantom inclusion dependency in the final set always indicates the presence of a phantom relationship in the final result. Phantoms are generated because some pairs of attributes in different tables are not related even though they have name and datatype similarities and the subset relationship holds. For example, consider two relations Company and Person. If both the relations have an attribute called ID of integer type and if the subset relationship holds (due to similar integer values or range), then the inclusion dependency Company (ID) << Person (ID) is definitely added to the POSSIBLE set. As there is a name similarity in both the attributes, the SE algorithm will give a high rating to this inclusion dependency while presenting it to the user. So if the user decides to keep it, a phantom relationship between the concerned entities (Company and Person) will be identified in the final result. The phantoms generally occur when the naming system in the database design is not good (as shown by our example). It can also occur due to limited data values or lack of a variety of datatypes in the schema. All of these contribute to the complexity of the schema as discussed earlier. Omissions occur because some pairs of attributes in different tables are actually related even though they have completely unrelated and different names. For example, consider two relations Leaders and US-Presidents. If the relation Leader has attribute Country# and the relation US-Presidents has attribute US # and both the attributes have integer datatype, then the subset relationship definitely holds. Since there is no name similarity between the relations and the attributes, the SE algorithm will

PAGE 84

73 attach a low rating to this inclusion dependency while presenting it to the user. If the user rejects this possible inclusion dependency, a valid relationship between the concerned entities (Leaders and US-Presidents) will be missing in the final result. Such omissions generally occur when the naming system in the database design is inadequate (as shown by our example). Both the graphs in Figure 5-1 also suggest that there are comparatively fewer omissions than phantoms. `The omissions will occur very rarely in our algorithm. This is due to the fact that our exhaustive algorithm will miss something (i.e., give a low rating) only when the tables and columns on both sides are completely unrelated in terms of names. As this is very rare in normal database design, our algorithm will rarely miss anything. Another parameter for evaluation can be user intervention. The SE algorithm may consult the user only at the inclusion dependency detection step. If the algorithm can not finalize the set, it will ask the user to decide. This is a significant improvement over many existing reverse engineering methods, although even this point of interaction might not be necessary for well-designed databases. 5.3.2 Enhancing Accuracy The accuracy discussions in the previous subsection are based on worst-case scenarios. The main improvements can be done at the inclusion dependency detection step, as it is the first step where errors start creeping in. As discussed earlier, the errors in the final result are just the end-product of errors in this step. We shall now present simple experiments that can be done to enhance the accuracy of the SE algorithm. In the intermediate user intervention step, if the domain-expert makes intelligent decisions about the existence of the possible inclusion dependencies ignoring the ratings,

PAGE 85

74 the error rate can be decreased significantly. One additional experiment was to do exactly this and the resulting error rate was almost zero in many of the above databases. Even if the user is not a domain expert, some obvious decisions definitely enhance the accuracy. For example, rejecting the inclusion dependency Company (ID) << Person (ID) even though the corresponding rating is high is a common-sense decision and will certainly reduce errors. Another possible way to increase the correctness of the results is to use some kind of a threshold. Sometimes the number of data values can be very different in two tables and the subset relationship holds just by chance. This is mostly true when the data values in the corresponding columns are integers or integer-ranges. For example, the ID column in the relation Task may contain values 1 to 100, while the ID column in the relation Project may contain values 1 and 2. This will lead to an invalid possible inclusion dependency Project (ID) << Task (ID). To reject these kinds of inclusion dependencies, we can keep a dynamic threshold value. If the number of values in one column dependent upon the number of values in the other column is less than the threshold, then we can reject the dependency beforehand. Though this is helpful in many cases, it can result in the rejection of some valid inclusion dependencies. The effect of this improvement is not completely defined and the experiments did not show any definite reduction of errors. But the procedure can be tweaked to get a reduced error rate in majority of the cases. In this section, we presented discussion of experimental results and the related analysis. The results are highly accurate and have been obtained with minimum user intervention. Accuracy can be further enhanced by simple additional methods. The next

PAGE 86

75 section will summarize the main contributions of the schema extraction algorithm and will provide valuable insights on the future enhancements at the algorithmic level. It will also provide an overall summary of this thesis..

PAGE 87

CHAPTER 6 CONCLUSION Schema extraction and knowledge discovery from various database systems has been an important and exciting topic of research for more than two decades now. Despite all the efforts, a truly comprehensive solution for the database reverse engineering problem is still elusive. Several proposals that approach this problem have been made under the assumption that data sources are well known and understood. The substantial work on this problem also remains theoretical, with very few implemented systems present. Also, many authors suggest semi-automatic methods to identify database contents, structures, relationships, and capabilities. However, there has been much less work in the area of a fully automatic discovery of database properties. The goal of this thesis is to provide a general solution for the database reverse engineering problem. Our algorithm studies the data source using a call-level interface and extracts information that is explicitly or implicitly present. This information is documented and can be used for various purposes such as wrapper generation, forward engineering, system documentation effort etc. We have manually tested our approach for a number of scenarios and domains (including construction, manufacturing and health care) to validate our knowledge extraction algorithm and to estimate how much user input is required. The following section lists the contribution of this work and the last section discusses possible future enhancements. 76

PAGE 88

77 6.1 Contributions The most important contributions of this work are the following. First, a broad survey of existing database reverse engineering approaches was presented. This overview not only updates us with the knowledge of the different approaches, but also provides a significant guidance while developing our SE algorithm. The second major contribution is the design and implementation of a relational database reverse engineering algorithm, which puts minimum restrictions on the source, is as general as possible and extracts as much information as it can from all available resources, with minimum external intervention. Third, a new and different approach to propose and finalize the set of inclusion dependencies in the underlying database is presented. The fourth contribution is the idea to use every available source for the knowledge extraction process. Giving importance to the application code and the data instances is very vital. Finally, developing the formula for measuring complexity of database schemas is also an important contribution. This formula, which is based on the experimental results generated by our prototype, can be utilized for similar purpose in various applications. One of the more significant aspects of the prototype we have built is that it is highly automatic and does not require human intervention except in one phase when the user might be asked to finalize the set of inclusion dependencies. The system is also easy to use and the results are well-documented. Another vital feature is the choice of tools. The implementation is in Java, due to its popularity and portability. The prototype uses XML (which has become the primary standard for data storage and manipulation) as our main representation and documentation language. Finally, though we have tested our approach only on Oracle,

PAGE 89

78 MS-Access and MS-Project data sources, the prototype is general enough to work with other relational data sources including Sybase, MS-SQL server and IBM DB2. Though the experimental results of the SE prototype are highly encouraging and its development in the context of wrapper generation and the knowledge extraction module in SEEK is extremely valuable, there are some shortcomings of the current approach. The process of knowledge extraction from databases can be enhanced with some future work. The following subsection discusses some limitations of the current SE algorithm and Section 6.3 presents possible future enhancements. 6.2 Limitations 6.2.1 Normal Form of the Input Database Currently the SE prototype does not put any restriction on the normal form of the input database. However, if it is in first or second normal form, some of the implicit concepts might get extracted as composite objects. The SE algorithm does not fail on 2NF relations, but it does not explicitly discover all hidden relationships, although this information is implicitly present in the form of attribute names and values. (e.g., cityname and citycode will be preserved as attributes of Employee relations in the following example) Consider following example: Employee (SSN, name, cityname, citycode) Project (ProjID, ProjName) Assignment (SSN, ProjID, startdate) In the relation Employee, the attribute citycode depends on the attribute cityname, which is not a primary key. So there is a transitive dependency present in the relation and hence it is in 2NF. Now if we run the SE algorithm over this schema, it first extracts table

PAGE 90

79 names, attribute names as above. Then it finds the set of inclusion dependencies as follows: Assignment.SSN << Employee.SSN Assignment.ProjID << Project.ProjID The SE algorithm classifies relations (Employee and Project as strong relations and Assignment as regular relation) and attributes. Finally it identifies Employee and Project as strong entity and Assignment as M:N relationship between them. The dependency of citycode on cityname is not identified as a separate relationship. To explicitly extract all the objects and relationships, the schema should ideally be in 3NF. This limitation can be removed by extracting functional dependencies (such as citycode << cityname) from the schema and converting schema into 3NF before starting the extraction process. Any kind of decision about normal form of legacy database is difficult to make. One can not deduce easily whether the database is in 2NF or 3NF. Also it may not be really useful for use to extract such extra relationships explicitly. 6.2.2 Meanings and Names for the Discovered Structures Although the SE algorithm extracts all concepts (e.g entities, relationships) modeled by the underlying relational database, it falls short in assigning proper semantic meanings to some of the concepts. Semantic Analysis may provide important clues in this regard but how useful they are depends largely on the quantity and quality of the code. It is difficult to extract semantic meaning for every concept. Consider an example. If the SE algorithm identifies a regular binary relationship (1:N) between Resource and Availability, then it is difficult to provide meaningful name to it. The SE algorithm gives the name relates_to in this case which is very general.

PAGE 91

80 6.2.3 Adaptability to the Data Source Ideally the algorithm should adapt successfully to the input database. However, the accuracy of the results generated by the SE algorithm somewhat depends on quality of database design, which includes proper naming system, richer datatypes and size of the schema. The experimental results and the schema complexity measure discussed earlier confirm this. Although it is very difficult to attain high accuracy levels for broad range of databases, the integration of SE algorithm with machine learning approaches might aid the extraction process to achieve at least a minimal level of completeness and accuracy. 6.3 Future Work 6.3.1 Situational Knowledge Extraction Many scheduling applications such as MS-Project or Primavera have a closed interface. Business rules, constraints or formulae are generally not found in the code written for these applications, since the application itself contain many rules. For example, in the case of the MS-Project software, we can successfully access the underlying database, which allows us to extract accurate but only limited information. Our current schema extraction process (part of DRE) extracts knowledge about the entities and relationships from the underlying database. Some additional but valuable information can be extracted by inspecting the data values that are stored in the tuples of each relation. This information can be used to influence decisions in the analysis module. For example in the construction industry, the detection of a float in the project schedule might prompt rescheduling or stalling of activities. In warfare, the detection of the current location and the number of infantry might prompt a change of plans on a particular front.

PAGE 92

81 Since this knowledge is based on current data values, we call it situational knowledge (sk). Situational knowledge is different from business rules or factual knowledge because the deduction is based on current data that can be changed over the time. Some important points to consider: 1. 2. 3. Extraction of sk has to be initiated and verified by a domain expert. An ontology or generalization of terms must be available to guide this process. The usefulness of this kind of knowledge is even greater in those cases where business rule mining from application source code is not possible. It is also crucial to understand what sort of situational knowledge might be extracted with respect to a particular application domain before thinking about the details of the discovery process. This provides insights about the possible ways in which a domain expert can query the discovery tool. This will also help the designer of the tool to classify these queries and finalize the responses. We classify the sk in four broad categories: Situational Knowledge Explicitly Stored in the Database Simple Lookup Knowledge: This type of sk can be easily extracted by one or more simple lookups and involves no calculation or computation. Extraction can be done through database queries. e.g., Whos on that activity? (i.e., find the resource assigned to the activity) or what percentage of work on a particular activity has been completed? Situational Knowledge Obtained through a Combination of Lookups and Computation: This type of sk extraction involves database lookups combined with computations (arithmetic operations). Some arithmetic operations (such as summation, average) can be predefined. The attribute names can be used as the parameters taking part in these operations. e.g., What is the summation of the durations of all activities in a project? Or what is the productivity of an activity as a function of the activity duration and units of resource assigned? Situational Knowledge Obtained through a Comparison between two inputs: The request for a comparison between two terms can be made and the response will be in the form of two sets containing relevant information about these terms that can

PAGE 93

82 be used to compare them. This may involve lookups and calculations. e.g., Compare the project duration and the sum of durations of all the activities in that project or compare the skill levels of resources working on different activities. 4. Complex Situational Knowledge: This is the most complex type of sk extraction and can involve lookups, calculations and comparisons. For extracting this kind of knowledge, one has to provide a definite procedure or an algorithm. e.g., Find a float in the schedule, or find the overall project status and estimate the finish date. As discussed above, the situational knowledge discovery process will be initiated on a query from the domain expert, which assures a relatively constrained search on a specific subset of the data. But the domain expert may not be conversant with the exact nature of the underlying database including its schema or low-level primitive data. So, the discovery tool should essentially consist of: an intuitive GUI for an interactive and flexible discovery process. the query transformation process. the response finalizing and formatting process. A simple yet robust user interface is very important in order to specify various parameters easily. The process of query transformation essentially involves translation of high-level concepts provided by the user to low-level primitive concepts used by the database. The response finalizing process may involve answering the query in an intelligent way i.e., by providing more information than was initially requested. The architecture of the discovery system to extract situational knowledge from the database will consist of the relational database, concept hierarchy, generalized rules, query transformation and re-writing system and response finalization system. The concept hierarchies can be prepared by organizing different levels of concepts into a taxonomy or ontology. A concept hierarchy always related to specific attribute and is partially ordered in general-to-specific order. The knowledge about these hierarchies

PAGE 94

83 must be given by domain experts. Some researchers have also tried to generate or refine this hierarchy semi-automatically [30], but that is beyond the scope of this thesis. Generalization rules summarize the regularities of the data at a high level. As there are usually a large set of rules extracted from any interesting subset, it is unrealistic to store all of them. However it is important to store some rules based on frequency of inquiries. The incoming queries can be classified as high-level or low-level based on the names of its parameters. The queries can also be classified as data queries, which are used to find concrete data stored in the database, or knowledge queries, which are used to find rules or constraints. Furthermore, the response may include the exact answer, the addition of some related attributes, information about some similar tuples, etc. Such a discovery tool should be based on data mining methods. Data mining is considered as one of the most important research topics in 1990s by both machine learning and database researchers [56]. Various techniques have been developed for knowledge discovery including generalization, clustering, data summarization, rule discovery, query re-writing, deduction, associations, multi-layered databases etc. [6, 18, 29]. One intuitive generalization-based approach for intelligent query answering in general and for situational knowledge in particular is based on a well-known data mining technique called attribute-oriented induction in Han et al. [29]. This approach provides an efficient way for the extraction of generalized data from the actual data by generalizing the attributes in the task-relevant data-set and deducing certain situational conclusions depending on the data values in those attributes. The situational knowledge in SEEK can be considered as an additional layer to the knowledge extracted by the DRE module. This kind of knowledge can be used to

PAGE 95

84 guide the analysis module to take certain decisions; but it cannot be automatically extracted without a domain expert. This system may be integrated in SEEK as follows: 1. Create a large set of queries that are used on a regular basis to find the status in every application domain. 2. Create the concept hierarchy for the relevant data set. 3. After the DRE process, execute these queries and record all the responses. 4. The queries and their corresponding responses can be represented as simple strings or can be eventually added to general knowledge representation. Appendix D describes detail example of the sk extraction process. 6.3.2 Improvements in the Algorithm Currently our schema extraction algorithm does not put restriction on the normal form of the input database. However, if the database is in 1NF or 2NF, then some of the implicit concepts might get extracted as composite objects. To make schema extraction more efficient and accurate, the SE algorithm can extract and study functional dependencies. This will ensure that all the implicit structures can be extracted from the database no matter what normalization form is used. Another area of improvement is knowledge representation. It is important to leverage existing technology or to develop our own model to represent the variety of knowledge extracted in the process effectively. It will be especially interesting to study the representation of business rules, constraints and arithmetic formulae. Finally, although DRE is a build-time process, it will be interesting to conduct performance analysis experiments especially for the large data sources and make the prototype more efficient.

PAGE 96

85 6.3.3 Schema Extraction from Other Data Sources A significant enhancement would extend the SE prototype to include knowledge extraction from multiple relational databases simultaneously or from completely non-relational data sources. Non-relational database systems include the traditional network database systems and the relatively newer object-oriented database systems. An interesting topic of research would be to explore the extent to which, information can be extracted without human intervention in such data sources. Also it will be useful to develop the extraction process for distributed database systems on top of the existing prototype. 6.3.4 Machine Learning Finally, machine learning techniques can be employed to make the SE prototype more adaptive. After some initial executions, the prototype could adjust its parameters to extract more accurate information from a particular data source in a highly optimized and efficient way. The method can integrate the machine learning paradigm [41], especially learning from example techniques, to intelligently discover knowledge.

PAGE 97

APPENDIX A DTD DESCRIBING EXTRACTED KNOWLEDGE 86

PAGE 98

87

PAGE 99

APPENDIX B SNAPSHOTS OF RESULTS.XML The knowledge extracted from the database is encoded into an XML document. This resulting XML document contains information about every attribute of every relation in the relational schema in addition to the conceptual schema and semantic information. Therefore the resulting file is too long to be displayed completely in this thesis. Instead, we provide snapshots of the document, which highlight the important parts of the document. Figure B-1 The main structure of the XML document conforming to the DTD in Appendix A. Figure B-2 The part of the XML document which lists business rules extracted from the code. 88

PAGE 100

89 Figure B-3 The part of the XML document which lists business rules extracted from the code.

PAGE 101

90 Figure B-4 The part of the XML document, which describes the semantically rich E/R schema.

PAGE 102

APPENDIX C SUBSET TEST FOR INCLUSION DEPENDENCY DETECTION We are using the following subset test to determine whether there exists an inclusion dependency between attribute (or attribute set) U of relation R1 and attribute (or attribute set) V or R2. Note, U and V must have the same data type and must includes the same number of attributes. Our test is based on the following SQL query templates, which are instantiated for the relations and their attributes and are run against the legacy source. C1 = SELECT count (*) FROM R1 WHERE U not in (SELECT V FROM R2); C2 = SELECT count (*) FROM R2 WHERE V not in (SELECT U FROM R1); Figure C-1 Two queries for the subset test. If C1 is zero, we can deduce that there may exist an inclusion dependency R1.U << R2.V; likewise, if C2 is zero there may exist an inclusion dependency R2.V << R1.U. Note that it is possible for both C1 and C2 to be zero. In that case, we can conclude that the two sets of attributes U and V are equal. 91

PAGE 103

APPENDIX D EXAMPLES OF THE SITUATIONAL KNOWLEDGE EXTRACTION PROCESS Example 1: Input: The table depicts a task-relevant subset of the relation Assignment in a database. Resource Name Task Name Duration Units Cost (days) ($) Tims crew Demolish Old Roof 2 1 3000 Tiles Install New Roof 0 1000 225 Brick Install New Roof 0 500 150 Barns crew Paint New Roof 10 1 5000 Cement Repair Wall covering 0 2000 500 Paint Painting 0 750 500 Jamess crew Painting 2 1 2500 Tims crew Install New Roof 4 1 1500 Table No.1 Assume we already have the concept hierarchy as follows: {Demolish Old Roof, Install New Roof, Paint New Roof} Roofing {Repair Wall covering, Painting} Renovation {Roofing, Renovation} ANY (Task) {Tims crew, Barns crew, Jamess crew} Labor {Tiles, Bricks, Cement, Paint} Material {Labor, Material} ANY (resources) {1-100} small {101-1000} medium {>1000} high {small, medium, high} ANY (Cost) 92

PAGE 104

93 NOTE: A B indicates that B is generalized concept of A. Consider that a supervisor in this firm wants to check the resources assigned to the roofing activities and the cost involved. retrieve Resource Name and Cost from Assignment where Task Name = Roofing This query can be translated using the concept hierarchy as follows: select Resource Name, Cost from Assignment where Task Name= Demolish Old Roof or Task Name= Install New Roof or Task Name = Paint New Roof The response can be simple i.e., providing the names and the related costs of the resources assigned to three tasks. The system may also provide an intelligent response. In this case, the intelligent response might contain the corresponding durations or the classification according to the type of resources. It might also contain information about the resources assigned to another task such as Renovation. Another query might be used to compare two things. For example, compare summation {retrieve Duration from Assignment where Task Name = Roofing }

PAGE 105

94 with retrieve Duration from Project where Project Name = House#11 This query will result in 2 queries to the database. First query will find out the duration for the repairs of the house number 11. Second query will find total duration of the roofing related activities in the database. This will involve adding summing durations from each row of the result-set. The response will enable the user to take important decisions about the status of roofing with respect to entire project. NOTE: The general format for high-level queries used above can be manipulated and finalized by taking into account all the possible queries. For more information the reader is referred to [28]. The relation names and attribute names in a high-level query can be taken from the domain ontology. These names should not necessarily be the same as their names in the database. In a query translation process, these names can be substituted with actual names from the mappings. Some of the mappings are found by the semantic analysis process, while some mapping can be given by domain expert.

PAGE 106

95 Example 2: The process of attribute-oriented induction is used to find generalized rules (as opposed to the fixed rules or constraints) in the database. Input: The following table depicts a portion of data relation called Resource in a database. Name Category Availability delay Max Units Cost/unit (in mins) ($) Brick M 8 1000 3 Tiles M 10 1000 5 Tims crew L 30 1 25 Bulldozer E 60 5 500 Cement M 10 1000 5 Barns crew L 40 1 50 Table No.1 Assume we already have the concept hierarchy as follows: {M, E, L } ANY (type) {0-15} very fast {16-45} fast {46-100} slow {>100} very slow {very fast, fast, slow, very slow} ANY (availability) {1-10} small {11-50} medium {>50} high {small, medium, large} ANY (Cost/unit) NOTE: A B indicates that B is generalized concept of A and the threshold value is set to 5.

PAGE 107

96 The query can be describe generalized rule from Resource about Name and Category and Availability Time and Cost General Methodology: 1. Generalize on the smallest decomposable components of a data relation. A special attribute vote is added to every table initially and its value is kept 1 for every tuple. The vote of tuple t represents the number of tuples in the initial data relation generalized to tuple t of the current relation. 2. If there is large set of distinct values for an attribute, but there is no high level concept for the attribute, then that attribute should be removed. 3. Generalize the concept one level at a time. Removal of redundant tuples is carried out with addition of votes. 4. If the number of tuples of the target class in the final generalized relation exceeds the threshold, further generalization should be performed. 5. Transform rules to CNF and DNF. However, Step 6 should be performed to add quantitative information in the rule. 6. The vote value of a tuple should be carried to its generalized tuple and the vote should be accumulated in the preserved tuples when other identical tuples are removed. The rule with the vote value is presented to the user. The vote value gives the percentage of total records or tuples in the database that represents the same rule. This is typically used for presenting a quantitative rule Example: 1. We first remove attribute max units as the expected generalized rule doesnt take that attribute into account. 2. If there is large set of distinct values for an attribute, but there is no high level concept for the attribute, it should be removed. Thus, the Name attribute is removed.

PAGE 108

97 Category Availability time Cost/unit Vote M 8 3 1 M 10 5 1 L 30 25 1 E 60 500 1 M 10 5 1 L 40 50 1 Table No.2 3. Generalize the concept one level at a time Doing this step multiple times will give the following table: Category Availability time Cost/unit Vote M Very fast Small 1 M Very fast Small 1 L Fast Medium 1 E Slow High 1 M Very fast Small 1 L Fast Medium 1 Table No.3 Removal of redundant tuples yields the following table: Category Availability time Cost/unit Vote M Very fast Small 3 L Fast Medium 2 E Slow High 1 Table No.4 5. 6. If the number of tuples of the target class in the final generalized relation exceeds the threshold, further generalization should be performed Removal of redundant tuples should result in a lesser number of tuples than the threshold. We have kept our threshold as 5 and there are no more than five tuples here, so no generalization is required. The value of the vote for a given tuple should be carried to its generalized tuple and the vote should be accumulated in the preserved tuples when other identical tuples are removed The initial vote values are shown in the table no. 2. When we generalize and remove the redundant tuples the final vote values are shown in the table no. 4

PAGE 109

98 7. Present the rule to the user. One of the final rules can be given in English as follows Among all the resources in the firm 50% are materials whose availability time is very fast and their cost is smallest

PAGE 110

LIST OF REFERENCES [1] P. Aiken, Data Reverse Engineering: Slaying the Legacy Dragon, McGraw-Hill, New York, NY, 1997. [2] N. Ashish, and C. Knoblock, Wrapper Generation for Semi-structured Internet Sources, Proc. Intl Workshop on Management of Semistructured Data, ACM Press, New York, NY, pp. 160-169, 1997. [3] G. Ballard and G. Howell, Shielding Production: An Essential Step in Production Control, Journal of Construction Engineering and Management, ASCE, vol. 124, no. 1, pp. 11-17, 1997. [4] R. Bayardo, W. Bohrer, R. Brice, A. Cichocki, G. Fowler, A. Helal, V. Kashyap, T. Ksiezyk, G. Martin, M. Nodine, M. Rashid, M. Rusinkiewicz, R. Shea, C. Unnikrishnan, A. Unruh, and D. Woelk, Semantic Integration of Information in Open and Dynamic Environments, Proc. SIGMOD Intl Conf., ACM Press, New York, pp. 195-206, 1997. [5] D. Boulanger and S. T. March, An Approach to Analyzing the Information Content of Existing Databases, Database, vol. 20, no. 2, pp. 1-8, 1989. [6] R. Brachman and T. Anand, The Process of KDD: A Human Centered Approach, AAAI/MIT Press, Menlo Park, CA, 1996. [7] M. E. Califf, and R. J. Mooney, Relational Learning of Pattern-match Rules for Information Extraction, Proc. AAAI Symposium on Applying Machine Learning to Discourse Processing, AAAI Press, Menlo Park, CA, pp. 9-15, 1998. [8] M. A. Casanova and J. E. A. d. Sa, Designing Entity-Relationship Schemas for Conventional Information Systems, Proc. Intl Conf. Entity-Relationship Approach, C. G. Davis, S. Jajodia, P. A. Ng, and R. T. Yeh, North-Holland, Anaheim, CA, pp. 265-277, 1983. [9] R. H. L. Chiang, A Knowledge-based System for Performing Reverse Engineering of Relational Database, Decision Support Systems, vol. 13, pp. 295-312, 1995. [10] R. H. L. Chiang, T. M. Barron, and V. C. Storey, Reverse engineering of Relational Databases: Extraction of an EER Model from a Relational Database, Data and Knowledge Engineering, vol. 12, pp. 107-142, 1994. 99

PAGE 111

100 [11] E. J. Chikofsky, Reverse Engineering and Design Recovery: A Taxonomy, IEEE Software, vol. 7, pp. 13-17, 1990. [12] K. H. Davis and P. Aiken, Data Reverse Engineering: A Historical Survey, Proc. IEEE Working Conference on Reverse Engineering, IEEE CS Press, Brisbane, pp. 70-78, 2000. [13] K. H. Davis and A. K. Arora, Converting a Relational Database Model into an Entity-Relationship Model, Proc. Intl Conf. on Entity-Relationship Approach, S. T. March, North-Holland, New York, pp. 271-285, 1987. [14] K. H. Davis and A. K. Arora, Methodology for Translating a Conventional File System into an Entity-Relationship Model, Proc. Intl Conf. Entity-Relationship Approach, P. P. Chen, IEEE Computer Society and North-Holland, Chicago, IL, pp. 148-159, 1985. [15] H. Dayani-Fard and I. Jurisica, Reverse Engineering: A history Where we've been and what we've done, Proc. 5 th IEEE Working Conference on Reverse Engineering, IEEE CS Press, Honolulu, pp. 174-182, 1998. [16] B. Dunkel and N Soparker, System for KDD: From Concepts to Practice, Future Generation Computer System, vol. 13, 231-242, 1997. [17] Elseware SA Ltd., STORM data mining suite, http://www.storm-central.com 2000. Accessed July 22, 2002. [18] U. Fayyad, G. Piatesky-Shapiro, P. Smith, and R. Uthurasamy, Advances In Knowledge Discovery, AAAI Press/MIT Press, Menlo Park, CA, 1995. [19] B. Grosof, Y. Labrou, H.Chan, A Declarative Approach to Business Rules in Contracts Courteous Logic Programs in XML, Proc. ACM Conf. E-Commerce, ACM Press, New York, NY, pp. 68-77, 1999. [20] J. R. Gruser, L. Raschid, M. E. Vidal, and L. Bright, Wrapper Generation for Web Accessible Data Sources, Proc. 3 rd Intl Conf. Cooperative Information Systems, IEEE CS Press, New York, NY, pp. 14-23, 1998. [21] J. L. Hainaut, Database Reverse Engineering: Models, Techniques, and Strategies, Proc. Intl Conf. Entity-Relationship Approach, T. J. Teorey, ER Institute, San Mateo, CA, pp. 729-741, 1991. [22] J. Hammer, W. O'Brien, R. R. Issa, M. S. Schmalz, J. Geunes, and S. X. Bai, SEEK: Accomplishing Enterprise Information Integration Across Heterogeneous Sources, Journal of Information Technology in Construction, vol. 7, no. 2, pp. 101-123, 2002.

PAGE 112

101 [23] J. Hammer, M. Schmalz, W. OBrien, S. Shekar, and N. Haldavnekar, SEEKing Knowledge in Legacy Information Systems to Support Interoperability, CISE Technical Report, CISE TR02-008, University of Florida, 2002. [24] J. Hammer, M. Schmalz, W. OBrien, S. Shekar, and N. Haldavnekar, SEEKing Knowledge in Legacy Information Systems to Support Interoperability, Proc. Intl Workshop on Ontologies and Semantic Interoperability, AAAI Press, Lyon, pp. 67 2002. [25] J. Hammer, H. Garcia-Molina, J. Cho, R. Aranha, and A. Crespo, Extracting Semistructured Information from the Web, Proc. Intl Workshop Management of Semistructured Data, ACM Press, New York, NY, pp. 18-25, 1997. [26] J. Hammer, H. Garcia-Molina, Y. Papakonstantinou, J. Ullman, and J. Widom, Integrating and Accessing Heterogeneous Information Sources in TSIMMIS, Proc. AAAI Symposium on Information Gathering, AAAI Press, Stanford, CA, pp. 61-64, 1995. [27] J. Han, J. Chiang, S. Chee, J. Chen, Q. Chen, S. Cheng, W. Gong, M. Kamber, K. Koperski, G. Liu, Y. Lu, N. Stefanovic, L. Winstone, B. Xia, O. R. Zaiane, S. Zhang, and H. Zhu, DBMiner: A System for Data Mining in Relational Databases and Data Warehouses, Proc. Int'l Conf. Data Mining and Knowledge Discovery, AAAI Press, Newport Beach, CA, pp. 250-255, 1997. [28] J. Han, Y. Huang, N. Cercone, and Y. Fu, Intelligent Query Answering by Knowledge Discovery Techniques, IEEE Transactions on Knowledge and Data Engineering, vol. 8, no. 3, pp. 373-390, 1996. [29] J. Han, Y. Cai and N. Cercone, Data-Driven Discovery of Quantitative Rules in Relational Databases, IEEE Transactions on Knowledge and Data Engineering, vol. 5, no. 1, pp. 29-40, 1993. [30] J. Han and Y. Fu, Dynamic Generation and Refinement of Concept Hierarchies for Knowledge Discovery in Databases, Proc. AAAI Workshop on Knowledge Discovery in Databases (KDD'94), AAAI Press, Seattle, WA, pp. 157-168, 1994. [31] J. Hensley and K. H. Davis, Gaining Domain Knowledge while Data Reverse Engineering: An Experience Report, Proc. Data Reverse Engineering Workshop, Euro Reengineering Forum, IEEE CS Press, Brisbane, pp. 100-105, 2000. [32] S. Horwitz and T. Reps, The Use of Program Dependence Graphs in Software Engineering, Proc. 14 th Intl Conf. Software Engineering, ACM Press, New York, NY, pp. 392-411, 1992.

PAGE 113

102 [33] L. P. Jesus and P. Sousa, Selection of Reverse Engineering Methods for Relational Databases", Proc. Intl Conf. Software Maintenance and Reengineering, ACM Press, New York, NY, pp. 194-197, 1999. [34] P. Johannesson, A Method for Transforming Relational Schemas into Conceptual Schemas, Proc. IEEE Intl Conf. Data Engineering, ACM Press, New York, NY, pp. 190-201, 1994. [35] M. Kamber, L. Winstone, W. Gong, S. Cheng, and J. Han, Generalization and Decision Tree Induction: Efficient Classification in Data Mining, Proc. Intl Workshop Research Issues on Data Engineering (RIDE), IEEE CS Press, Birmingham, pp. 111-120, 1997. [36] C. Klug, Entity-Relationship Views over Uninterpreted Enterprise Schemas, Proc. Intl Conf. Entity-Relationship Approach, P. P. Chen, North-Holland, Los Angeles, CA, pp. 39-60, 1979. [37] L. Koskela and R. Vrijhoef, Roles of Supply Chain Management in Construction, Proc. 7 th Annual Intl Conf. Group for Lean Construction, U.C. Press, Berkley, CA, pp. 133-146, 1999. [38] J. Larson and A. Sheth, Federated Database Systems for Managing Distributed, Heterogeneous, and Autonomous Databases, ACM Computing Surveys, vol. 22, no. 3, pp. 183-236, 1991. [39] V. M. Markowitz and J. A. Makowsky, Identifying Extended Entity-Relationship Object Structures in Relational Schemas, IEEE Transactions on Software Engineering, vol. 16, pp. 777-790, 1990. [40] M. A. Melkanoff and C. Zaniolo, Decomposition of Relations and Synthesis of Entity-Relationship Diagrams, Proc. Intl Conf. Entity-Relationship Approach, P. P. Chen, North-Holland, Los Angeles, CA, pp. 277-294, 1979. [41] R. S. Michalski, Machine Learning: An Artificial Intelligence Approach, Morgan Kaufmann Publishers, San Mateo, CA, 1986. [42] Microsoft Corp., Microsoft Project 2000 Database Design Diagram, http://www.microsoft.com/office/project/prk/2000/Download/VisioHTM/P9_dbd _frame.htm 2000. Accessed August 15, 2001. [43] C. H. Moh, E. P. Lim, and W. K. Ng, Re-engineering Structures from Web Documents, Proc. ACM Intl Conf. Digital Libraries, ACM Press, New York, NY, pp. 58-71, 2000.

PAGE 114

103 [44] P. Muntz, P. Aiken, and R. Richards, DoD Legacy Systems: Reverse Engineering Data Requirements, Communications of the ACM, vol. 37, pp. 26-41, 1994. [45] S. B. Navathe, C. Batini, and S. Ceri, Conceptual Database Design An Entity Relationship Approach, Benjamin-Cummings Publishing Co., Redwood City, CA, 1992. [46] S. Nestorov, J. Hammer, M. Breunig, H. Garcia-Molina, V. Vassalos, and R. Yerneni, Template-Based Wrappers in the TSIMMIS System, ACM SIGMOD Intl. Conf. on Management of Data, ACM Press, New York, NY, pp. 532-535, 1997. [47] W. O'Brien, M. A. Fischer, and J. V. Jucker, An Economic View of Project Coordination, Journal of Construction Management and Economics, vol. 13, no. 5, pp. 393-400, 1995. [48] Oracle Corp., Data mining suite (formerly Darwin), http://technet.oracle.com/products/datamining/listing.htm Accessed July 22, 2002. [49] Y. Papakonstantinou, A. Gupta, H. Garcia-Molina, and J. Ullman, A Query Translation Scheme for Rapid Implementation of Wrappers, Proc. 4 th Intl Conf. Deductive and Object-Oriented Databases, T. W. Ling, A. O. Mendelzon, and L. Vieille, Lecture Notes in Computer Science, Springer Press, Singapore, pp. 55-62, 1995. [50] S. Paul and A. Prakash, A Framework for Source Code Search Using Program Patterns, Software Engineering Journal, vol. 20, pp. 463-475, 1994. [51] J. M. Petit, F. Toumani, J. F. Boulicaut, and J. Kouloumdjian, Towards the Reverse Engineering of Denormalized Relational Databases, Proc Intl Conf. Data Engineering, ACM Press, New York, NY, pp. 218-229, 1996. [52] W.J. Premerlani, M. Blaha, An Approach for Reverse Engineering of Relational Databases, CACM, vol. 37, no. 5, pp. 42-49, 1994. [53] Sahuguet, and F. Azavant, W4F: a WysiWyg Web Wrapper Factory, http://db.cis.upenn.edu/DL/wapi.pdf Penn Database Research Group Technical Report, University of Pennsylvania, 1998. Accessed September 21, 2001. [54] J. Shao and C. Pound, Reverse Engineering Business Rules from Legacy System, BT Journal, vol. 17, no. 4, pp. 179-186, 1999.

PAGE 115

104 [55] O. Signore, M. Loffredo, M. Gregori, and M. Cima, Using Procedural Patterns in Abstracting Relational Schemata, Proc. IEEE 3 rd Workshop on Program Comprehension, IEEE CS Press, Washington D.C., pp. 169-176, 1994. [56] M. StoneBreaker, R.Agrawal, U.Dayal, E. Neuhold and A. Reuter, DBMS Research at Crossroads: The Vienna update, Proc. 19 th Intl Conf. Very Large Data Bases, R. Agrawal, S. Baker, and D. A. Bell, Morgan Kaufmann, Dublin, pp. 688-692, 1993. [57] Sun Microsystems Corp., JDBC Data Access API: Drivers, http://industry.java.sun.com/products/jdbc/drivers Accessed January 10, 2002. [58] D. S. Weld, N. Kushmerick, and R. B. Doorenbos, Wrapper Induction for Information Extraction, Proc. Intl Joint Conf. Artificial Intelligence (IJCAI), AAAI Press, Nagoya, vol. 1, pp. 729-737, 1997. [59] T. Wiggerts, H. Bosma, and E. Fielt, Scenarios for the Identification of Objects in Legacy Systems, Proc. 4 th IEEE Working Conference on Reverse Engineering, I. D. Baxter, A. Quilici, and C. Verhoef, IEEE CS Press, Amsterdam, pp. 24-32, 1997. [60] World Wide Web Consortium, eXtensible Markup Language (XML), http://www.w3c.org/XML/ 1997. Accessed March 19, 2001. [61] World Wide Web Consortium, Resource Description Framework, http://www.w3c.org/RDF/ 2000. Accessed March 21, 2002. [62] World Wide Web Consortium, Semantic Web, http://www.w3.org/2001/sw/ 2001. Accessed October 19, 2001. [63] World Wide Web Consortium, W3C Math Home, http://www.w3c.org/Math/ 2001. Accessed March 21, 2002. [64] M.J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li, New algorithms for Fast Discovery of Association Rules, Proc. Intl. Conf. Knowledge Discovery and Data Mining, AAAI Press, Newport Beach, CA, pp. 283-286, 1997.

PAGE 116

BIOGRAPHICAL SKETCH Nikhil Haldavnekar was born on January 2, 1979, in Mumbai (Bombay), India. He received his Bachelor of Engineering degree in computer science from VES Institute of Technology, affiliated with Mumbai University, India, in August 2000. He joined the department of Computer and Information Science and Engineering at the University of Florida in fall 2000. He worked as a research assistant under Dr. Joachim Hammer and was a member of Database Systems Research and Development Center. He received a Certificate of Achievement for Outstanding Academic Accomplishment from the University of Florida. He completed his Master of Science degree in computer engineering at the University of Florida, Gainesville, in December 2002. His research interests include database systems, Internet technologies and mobile computing. 105