Title Page
 Table of Contents

Group Title: Department of Computer and Information Science and Engineering Technical Reports
Title: Heterogeneous databases : inferring relationships for merging component schemas, and a query language
Full Citation
Permanent Link: http://ufdc.ufl.edu/UF00095163/00001
 Material Information
Title: Heterogeneous databases : inferring relationships for merging component schemas, and a query language
Series Title: Department of Computer and Information Science and Engineering Technical Report ; 92-048
Physical Description: Book
Language: English
Creator: Whang, W. K.
Chakravarthy, Sharma
Navathe, S. B.
Publisher: Department of Computer and Information Sciences, University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: December, 1992
Copyright Date: 1992
 Record Information
Bibliographic ID: UF00095163
Volume ID: VID00001
Source Institution: University of Florida
Holding Location: University of Florida
Rights Management: All rights reserved by the source institution and holding location.


This item has the following downloads:

199279 ( PDF )

Table of Contents
    Title Page
        Page i
    Table of Contents
        Page ii
        Page 1
        Page 2
        Page 3
        Page 4
        Page 5
        Page 6
        Page 7
        Page 8
        Page 9
        Page 10
        Page 11
        Page 12
        Page 13
        Page 14
        Page 15
        Page 16
        Page 17
        Page 18
        Page 19
        Page 20
        Page 21
        Page 22
        Page 23
        Page 24
Full Text

University of Florida
Computer and Information Sciences

... Department of Computer and Information Sciences
_ ^ Computer Science Engineering Building
SUniversity of Florida
Gainesville, Florida 32611

V ..

Heterogeneous Databases: Inferring
Relationships for Merging Component
Schemas, and a Query Language

W. K. Whang
S. Chakravarthy
S. B. Navathe
email: sharma@snapper.cis.ufl.edu

Tech. Report UF-CIS-TR-92-048
December 1992
(Submitted for publication)


1 Introduction 2

2 Schema Integration Process 3
2.1 Data Model Conversion .................. . . . . . .... 4
2.2 Clustering of Related Entities ............... . . . . . .... 4
2.3 Identification of Relationships ............... . . . . . .... 5
2.4 M erging . . . . . . . . . . . . . . . . . . . ... 5

3 Relationships Among the Elements in a Cluster 5
3.1 Algorithm .............. ..... ...... .......... ...... 8
3.2 A n Exam ple . . . . . . . . . . . . . . .. . .. . . . 10

4 Global Schema and An Extended SQL For Querying 13
4.1 The Global Relational View ............... . . . . . ...... 13
4.2 ESQL as a Query Language ............... . . . . . ...... 15

5 Implementation 19

6 Conclusions 22

Heterogeneous Databases: Inferring Relationships for Merging
Component Schemas, and a Query Language*

Whan-Kyu Whang

Artificial Intelligence Section
Computer Research Department
ETRI, P.O.Box 8, Daeduck Science Town
Daejeon, ..11.-I.ii, KOREA
email: wkwhang@hama.etri.re.kr

Sharma Chakravarthyt

Database Systems Research and Development Center
Computer and Information Sciences Department
University of Florida, Gainesville, FL 32611
email: sharma@snapper.cis.ufl.edu

Shamkant B. Navathe

College of Computing
Georgia Institute of Technology
Atlanta, Georgia 30332
email: sham@cc.gatech.edu

December 30, 1992


The need for accessing independently developed database management systems using a uni-
fied conceptual view has been well-recognized. View or schema integration is an important
component of this problem of designing heterogeneous or federated database management sys-
In this paper, we address the problems of: i) identification of relationships among relations
grouped into a cluster, ii) checking the consistency of specified relationships, and iii) exhaus-
tively deriving new relationships from the ones specified -useful for merging relations from
component databases into a global view relation. Specifically, five types of relationships (equal,
contains, contained-in, overlap and disjoint) are considered. Merging of component schemas in a
federated database environment can benefit substantially from the knowledge of the relationship
between relations in a cluster. We present an algorithm to check the consistency of asserted
relationships and to derive all possible relationships from the given set of known relationships.
To support the approach taken for schema integration in this paper, we extend the relational
query language SQL to support set operations on constituent relations. The set operations in
ESQL (extended SQL proposed in this paper) are expressed nonprocedurally in a single state-
ment, unlike conventional SQL which requires blocks of code and set operations separating the
Index Terms: Heterogeneous database management systems, Extended SQL, Relationship
between entities, Inferring new relationships, Schema integration, Prolog implementation.

*This work was supported by a grant from U.S. West Corporation.
tAll communication regarding this paper should be addressed to Prof. S. Chakravarthy.

1 Introduction

Schema integration, which is referred to as merging of schemas that have been developed indepen-
dently, occurs in two different contexts [BaLN86, ShLa90, .l... '-,]

1) View integration/reconciliation (in logical database design): During the design phase, several
views of a database are merged to form a conceptual schema for the entire database. Note
that the view here refers to an application's view (also termed an external view) of the
database. Reconciliation is generally required before a feasible conceptual schema that can
support several external views can be arrived at. The view reconciliation problem is formally
treated using database logic in -..1.. ".] Inconsistencies that arise in this view reconciliation
process can be resolved systematically and a sufficient condition for avoiding inconsistencies
is also given in -.l.I.. ",]

2) Schema integration (in federated databases): Schemas of pre-existing databases are merged to
form a global schema that provides a unified or integrated conceptual view of the underlying
databases. The integration process need to preserve the semantics of individual databases as
well. The global schema can then be used as an external interface to access and modify data
in one or more constituents of the federation.

The methodology used for the above two variations of schema integration are quite similar although
the metrics used are different. For example, in view integration, the choice of the underlying con-
ceptual schema may favor the most frequent user (by incurring the least overhead in the translations
between the external and conceptual levels). Whereas, the choice of the global schema for database
integration is likely to be based on preservation of the semantics of individual databases. Another
major difference between the two lies in the way user queries and transactions are processed after
integration. In view integration, user queries and transactions specified against each external view
are mapped to the requests on the conceptual schema. On the other hand, in schema integration
user queries against the global schema are transformed into requests on the underlying constituent
database schemas. The results of these requests are then assembled to produce the result of the
original query on the global/unified schema.
Our approach for schema integration consists of four steps: 1) data model conversion, 2) clus-
tering of related entities [NaGa82], 3) identification of relationships among relations in a cluster,
and 4) merging of relations into a unified/global schema. First, schemas from existing (possibly
different) data models are transformed into a common data model to facilitate the integration pro-
cess. Second, identical or similar real world entities represented in different schemas are grouped
together to be generalized into a generic concept. Third, the relationships among relations in the
same group are identified for defining a global schema. Finally, relations in each cluster are merged
into a global schema using the semantic of attributes and relationships obtained from Step 3. The
block diagram of the schema integration process is shown in Figure 1.1 which is self-explanatory.
In this paper, we propose solutions to steps 3) and 4) of the schema integration process. We
assume that steps 1) and 2) have already been applied. We also propose a query language (an
extension of SQL termed ESQL) that includes constructs for querying the federated database
elegantly. We demonstrate the simplicity and the expressiveness of ESQL and contrast it with

Local Schema 1

Step 1

Component Schema 1

Step 2

Cluster 1

Local Schema 2

Data Model Conversion

Component Schema 2

Clustering of Related Entities

Cluster 2

Local Schema n

Component Schema n

Cluster m

Step 3

Step 4

Identifying Relationships among Relations in a Cluster
(Collect Relation Relationships from Designer, and Apply
Algorithm to Check Consistency of Asserted Relationships and
to Derive New Relationships from Partially Known Relationships)

Merging of Relations in each Cluster
(Use Semantic Relationships among Attributes or Relations
to Define Global Schema from the Set of Relationships)

Global Schema

( Query is Expressed in ESQL

Figure 1.1: Steps for Creating a Federated Database Schema and Querying it

The remainder of this paper is structured as follows. Section 2 summarizes the schema integra-
tion process. Section 3 deals with the five types of relationships, presents an algorithm to check
consistency of the asserted relationships, and to derive new relationships from partially known ones.
In Section 4 we introduce an extended relational model to represent a global schema associated with
component schemas in a cluster. We also extend SQL by allowing set operations on component
relations to query instances of relations related to one another. Section 5 shows an implementation
of the algorithm presented in Section 2 using PROLOG. Section 6 presents our conclusions and
future work.

2 Schema Integration Process

The schema ',I. /l',, problem is as follows. Given a collection of schemas (possibly heterogeneous,
developed using different data models), how can one construct a global or unified schema that will
support all of the underlying schemas? The choice (or the construction) of the global schema should
not result in loss of information or the ability to query and/or update underlying databases either
individually or collectively. A generalization of this problem, which is not addressed in this paper,
is that of dealing with a number of underlying systems not all of which are necessarily databases.

2.1 Data Model Conversion

In many enterprises that require uniform access to independently developed, pre-existing databases,
it is the case that constituent database management systems that are being integrated have been
developed using different data models. Prior to integration, all constituent database schemas need
to be converted to a common data model. In this paper, we use the relational model as the
common data model1. We call the resulting database a relational federated database. We use the
term component schema to refer to a schema that is converted to the relational model from the
model used for the local schema. This step is not necessary for a local schema that is already
specified using the relational model. In some situations, it may even be necessary to develop the
schema of a database-in-use before performing the model conversion. This process is referred to
as reverse i ',.-i In fact, there are many instances of databases that have been developed
and currently being used for which the schema has not been developed in the first place. Reverse
engineering is an important first step that need to be addressed for realizing a federation or even
migration from the current system to a different one.

2.2 Clustering of Related Entities

In order to integrate component schemas into a unified schema, we need to group the relations
from each component schema which represent the same or overlapping real world entity. These
individual entities from different component schema's are then merged into a generalized relation.
For example, the two relations MAIN_FACULTY and BRANCH_FACULTY in the main campus
and branch campus databases, respectively, can be merged into a general concept of relation,
FACULTY. Each such group of relations is called a cluster and is integrated independently of other
groups. This process simplifies the integration problem.
In this process, determination, of relations that belong to a cluster which is based on the
semantics of the entities and the requirements of the federated database, is ultimately made by the
designer. As component databases are developed at different times by different groups of designers,
no assumptions can be made about the choice of names given to entities/relations. Also, if one
component schema has a relation SECRETARY and another component schema has a relation
ENGINEER, it may be useful to integrate them as an EMPLOYEE relation even though their real
world entities are known to be disjoint. However, this process of grouping cannot be automated
completely as it involves interpretation of entities in component databases and their usage. In
general, the semantics implied by the relations are not simply captured by the syntax, but some
heuristics can be used to help the designer decide clustering or even automate the process partially.
First, relations with the same name are examined to determine whether they indicate the same real
world entities. When they have similar names, the data dictionaries including thesaurus can be
used to find similarity of concepts. Second, the common key attributes of relations can be used to
broadly group the relations. For example, two relations EMPLOYEE and PROJECT having their
primary key as Soc_Sec_No and Proj_Name, respectively, cannot be clustered because they do not
have a common key. Third, the number of common attributes can also be used as an aid to find the
same real world entities. This grouping process, although ad hoc helps in reducing the complexity
of the relationship determination problem. In the final analysis, a user assisted tool seems to be a
good way of addressing this problem.
1This is not a general requirement. In [W1 ], Horn clauses have been used as a canonical representation into
which relational, hierarchical, and network models are converted.

2.3 Identification of Relationships

Once the relations from component databases are clustered by the designer, relationships among
the relations in a cluster need to be identified. Possible relationships between any two relations are:
EQUAL, CONTAINS, CONTAINED-IN, OVERLAP and DISJOINT. Determination of the type of
relationship that exists between relations in a cluster is essential for merging of component schemas.
Whether the merging process leads to a minimal set of relations depends upon the knowledge of
relationships between every pair of relations in the cluster and the heuristics used for merging.

2.4 Merging

After identifying the relationships, the designer needs to indicate the semantic relationships of
attributes of the relations in the same cluster in order to merge similar attributes in the global
schema [LaNE89, MaEf84, MoBu81]. Consider, for example, attributes GRADE and SCORE
taken from two different student databases. GRADE is defined as a character type denoting a
letter grade, while SCORE is defined as an integer type representing a value between 1 and 100.
Although the names and data types of two attributes are different, they are semantically equivalent
(i.e., they represent the same concept and there exists functions for converting one attribute to
the other and vice versa). This equivalence between attributes coming from component database
cannot be determined entirely using the syntax. The designer must ultimately determine the
semantic relationships among attributes as rules based on syntax cannot derive such relationships
and determine the existence of functions for conversion. Note that there is a difference between
merging of attributes and relations. Typically, attribute merging leads to the definition of a function
for conversion of values between the domains of the attributes whereas merging of relations requires
the use of the appropriate set operators (such as join, union, intersection) to map global queries
into queries on component databases.
After the semantic relationships are specified for attributes, relations in each cluster are merged
obtaining a single federated database schema. We use an extended relational model to define the
schema for the federated database which is defined in Section 4.1. We use the term Yi1nl,,, schema to
refer to a schema that is obtained by integrating component schemas using the extended relational
model. The component schemas themselves are assumed to be in the relational model.
Among the four steps, clustering of relations with detection and resolution of conflicts in naming,
structure and domain (steps 1 and 2) are not addressed in this paper. They have been addressed
in [BaLe84, LaNE89, MoBu81], and as pointed out in Navathe et al. [NaEL86] this process requires
designer intervention as it is subjective and depends on naming conventions used by the designers.
In this paper we present an algorithm and techniques for automating part of the merging process
(steps 3 and 4). For this purpose, we examine the relationship among relations in a cluster in terms
of key attributes in the following section.

3 Relationships Among the Elements in a Cluster

The relationship between two relations is determined by the values of key attributes from partici-
pating relations. We assume that a common key exists or can be defined between relations R1 and
R2. In the absence of this (or some other reasonable) assumption, there is no easy way to match
real world entities represented by the two relations. Depending upon the values of key attributes,

we define five types of relationships (shown below) on their domains. The domain of a relation is
the set of tuples in that relation. Recall that some of the known relationships are asserted by the
designer. The process of identifying all of the relationships cannot be completely automated for the
following reasons: 1) current data models cannot capture real world state information completely,
and 2) the semantics of the schema may differ even when the same data model is used for designing
the database depending on the intended use. Observe the similarity of the following definitions
with those in [E1Na84] where the Extended ER model was used to define entities and relationships
as objects and the domain of an object is the -, i of real world entities."

case 1: Identical domains (EQUAL)
R1 EQUAL R2 (R1 = R2) def Dom(R) = Dom(R2)

case 2: Containing domains (CONTAINS)
R1 CONTAINS R2 (R1 c R2) f Dom(Ri) D Dom(R2)

case 3: Contained domains (CONTAINED IN)
R1 CONTAINED IN R2 (R1 ci R2) df Dom(Ri) C Dom(R2)

case 4: Overlapping domains (OVERLAP)
R1 OVERLAP R2 (Ri o R2) def Dom (Ri) n Dom(R2) 5 0
A Dom(Ri) g Dom(R2)
A Dom(Ri) 2 Dom(R2)

case 5: Disjoint domains (DISJOINT)
R1 DISJOINT R2 (R1 d R2) def Dom(Ri) n Dom(R2)= 0

For each pair of relations that belong to the same cluster, a relationship between the two is
asserted. Without this relationship information, schema integration cannot be done because the
attributes of relations have different semantics depending on the relationships. In general, there
are nC2 = n*(n 1)/2 relationships for n relations. Even if the number of relations belonging to
the same cluster is not very large, the number of relationships to consider will still be large (for
example, for ten relations, the number of relationships to consider will be 45). Hence, if the number
of relations to be integrated are large and only some of the assertions are specified by a designer,
an algorithm is required to aid the designer to derive new assertions from partially known ones;
also there is a need for checking the consistency of assertions that are provided by the designer.
The algorithm presented in this paper is similar to the algorithm developed by Elmasri et al.
[E1LN86]. However, the algorithm developed in [E1LN86] is not complete in the sense that only four
of the five types of relationships were considered. The OVERLAP relationship, which is perhaps one
of the most commonly occurring relationship, was not considered at all. In addition, the algorithm
in [E1LN86] cannot deal with a set of relationships when a transitive rule is applied, but considers
only the case when the transitive rule results in a definite value of relationship. The algorithm
presented in this paper, in contrast, is complete in the sense that we consider all the five types of
relationships and the final result is more specifically given even in the case when the transitive rule
does not produce a definite value. Our algorithm uses two inputs, namely, a Relationship Assertion
(RA) matrix and a Transitive Rule (TR) table. In the RA matrix, the relationship assertions

among relations are placed in an n by n matrix, where n is the number of relations belonging to the
same cluster. The goal of the algorithm is to fill out the RA matrix with entries for each RA(i,j)
indicating possible relationships between relations i and j; i.e., each value RA(i,j) is a subset of
possible relationships from the following set:
{ =, c, ci, d, o }
The entries in the above set correspond to the relationship "equal", "contains", "contained in",
"dl-i..ii. ", and ..- i I.q.p ', respectively. If the relationship between two relations is not known, then
it is represented by the symbol u (for unknown).
Suppose we have three databases containing faculty information of a university. Database
1 (main campus) and database 2 branchi campus) both contain the relation FACULTY, while
database 3 branch2 campus) contains three relations, FACULTY, ENGINEERING_FACULTY
and ELECTRICAL_ENGINEERING_FACULTY. Their relationships are assumed to be given as


These assertions capture inter- as well as intra-schema semantics. The graph representation of
these assertions is shown in Figure 3.1. The tabular representation of asserted relationships in the
form of an RA matrix is shown in Figure 3.2. In the graph representation, a cluster is a connected
graph with relations as nodes and relationship assertions as edges. The label on the thick lines in
Figure 3.1 represents an explicitly stated assertion, while the thin line represents an unspecified
assertion to be derived.

1 Legend
1: branch2_elec_eng_faculty
d1 ci 5 2: branchlfaculty
3: main_faculty
/ 4: branch faculty
5: branch2_eng-aculty
=: equal
ci: contained in
S5 c 4 c: contains
d: disjoint
o: overlap
S"d u: unknown

Figure 3.1: Graph Representation of Relationship among Relations

Node 1 2 3 4 5
1 = d u u ci
2 d = o u u
3 u o = d u
4 u u d = c
5 c u u ci =

Figure 3.2: Relationship Assertion (RA) Matrix of Figure 3.1

Transitive Rule (TR) table 2 in Figure 3.3 is used to check consistency of the RA matrix as well
as to derive correct relationships from the RA matrix.

E2 = ci c d o
= = ci c d o
ci ci ci {=,ci,c,d,o} d {ci,d,o}
c c {=,ci,c,o} c {c,d,o} {c,o}
d d {ci,d,o} d {=,ci,c,d,o} {ci,d,o}
o o {ci,o} {c,d,o} {c,d,o} {=,ci,c,d,o}

Figure 3.3: Transitive Rule (TR) Table

Let El, E2 and E3 be relationship assertion symbols between relations R1 and R2, R2, and R3, and
R1 and R3, respectively. El and E2 are used to index rows and columns, respectively on the table
shown in Figure 3.3. If E3 is the symbol for TR(E1, E2) in the table, then for any three relations
R1, R2 and R3 the following transitive relationship holds:

(Ri El R2) and (R2 E2 R3) (RI E3 R3)

Each of El, E2 can be one of the elements from the set {=,ci,c,d,o} and E3 is either a single
relationship or a subset of relationships from the same set. When an entry in the TR table contains
more than one element, it indicates all possible relationships under the corresponding transitive
rule. For example, when the relationship El between R1 and R2 is ci and the relationship E2
between R2 and R3 is c, then the relationship E3 between R1 and R3 can be either =,ci,c,d, or
o; i.e., the set {=,ci,d,o} indicated by the entry TR(ci,c). The algorithm uses this rule table3 to
derive an unspecified assertion (thin edges in Figure 3.1). A derivation for an edge consists of two
transitive edges that have only one intermediate node. Thus, the set of derivations for an edge
includes all paths of length two. In general, there are n-2 derivations for any edge, where n is the
number of nodes (relations) in a cluster. For example, if there are 10 relations (nodes), then there
are 45 edges and for each edge there are 8 cases of transitive rules to consider.

3.1 Algorithm

Algorithm 3.1 is checks the consistency of the RA matrix specified by the designer and deduces
new assertions from the existing ones in the RA matrix.
2The relationship unknown denoted by u is purposely omitted from the table as it produces only u as a result of
the application of the transitive rule.
3The correctness of transitive rules can be easily verified using Venn diagrams and hence is not elaborated.

ALGORITHM 3.1: Algorithm for Checking Consistency of Specified Relationships and Deriv-
ing New ones

INPUT: Relationship Assertion (RA) matrix and Transitive Rule (TR) table.
OUTPUT: Checking consistency for the specified relationship assertions and deriving new assertions
from them.

METHOD: The first portion of the algorithm performs consistency checking for the labeled edges
by building transitive paths. If the relationship obtained from the rule in the TR table contradicts
with the specified assertion in the RA matrix, an inconsistency message along with the participating
relationships and relations is returned to the designer. For the unlabeled edges, the algorithm finds
all the transitive paths for each unlabeled edge and checks its consistency. If the intersection
of all the derived relationship assertions on the unlabeled edge is empty, it means that there is
inconsistency 4 among the specified assertions. If the intersection contains only one element, the
assertion is definite (and unique) and is inserted into the RA matrix. This information is used for
subsequent steps of processing. This process continues until no more assertions can be derived or
any inconsistencies detected. At the last step, the unlabeled edges for which relationships cannot
be derived are identified.

PROCEDURE Derivation_Of_NewAssertions_And_ConsistencyChecking
LE := all labeled edges in RA matrix;
NewUE := all unlabeled edges in RA matrix;

/* check consistency for labeled edges by building transitive path */
T := all transitive paths in LE;
FOR each transitive path t E T DO
IF assertion in RA is different from the transitive relationship in TR table
THEN inconsistent_state;

/* check consistency for unlabeled edges and deriving new relationships using TR table */
UE := NewUE;
FOR each unlabeled edge ei E UE DO
Ti := all transitive paths:
R := 0;
FOR each transitive path t E Ti DO
IF assertion is found in RA matrix
r := transitive relationship in TR table;
Ri := Ri n r
ELSE Ri := Unknown;
IF Ri = 0

4Mutual inconsistency of asserted relationships can also be detected using this algorithm. This is the case when
the relationship derived using the transition rule is different from the one specified for an edge.

THEN inconsistent_state
ELSE IF Ri contains one element
insert the element in Ri into RA matrix;
NewUE := UE e
UNTIL UE $ NewUE and NewUE # 0;

/* for those edges that cannot be derived, a designer feedback is required */
IF Ri contains more than one element or Ri contains unknown
THEN output Ri's;

3.2 An Example

We shall illustrate the above algorithm using the example using the problem depicted in Figures
3.1, 3.2, and 3.3. The list of the labeled edges in Figure 3.1 is [1-2, 2-3, 3-4, 4-5, 5-1]. There is no
transitive paths between the labeled edges having length two. Therefore, we don't need to consider
the consistency of relationships in the initial state, but consider the derivation of assertions of
unlabeled edges. The initial list of the unlabeled edges is [1-3, 1-4, 2-4, 2-5, 3-5]. For example, the
relationship of unlabeled edge 1-3 is obtained as follows:

1) edge 1-3

transitive paths possible relationships

d o
1-----2-----3 {ci,d,o}
u d
1-----4-----3 {u}
ci u
1-----5-----3 {u}

intersectionof possible relationships = {ci,d,o}

For the edge 1-3, there are three possible transitive paths, 1-2-3, 1-4-3 and 1-5-3. From Figure 3.1,
the relationships of edges 1-2, 2-3, 4-3 and 1-5, are initially given as d, o, d and ci, respectively.
The edges 1-4 and 5-3 are unlabeled, therefore it is marked as u denoting unknown. When all
the relationships in the edges of transitive paths are identified, the Transitive Rule table in Figure
3.3 is used to derive the relationship of transitive path. For example, for the transitive path 1-2-3
where the relationships of edges 1-2 and 2-3 are d and o, the relationships of edge 1-3 will be one of
{ci,d,o} that is found in the TR table in Figure 3.3. If one of the edges is unknown in the transitive
path as in 1-4-3 and 1-5-3, the relationship in transitive path results in "unknown." "Unknown"
relationship can be taken to be one of the elements, {=,ci,c,d,o}. When all the relationships in

transitive paths are identified, the intersection 5 of them will determine the relationship of that
edge with the relations corresponding to the nodes. Here the intersection of relationships {ci,d,o},
{u} and {u} results in {ci,d,o}. Derivation of relationship for other edges is shown below.

2) edge 1-4

transitive paths possible relationships

d u
1-----2-----4 {u}
u d
1----- 3---4 {u}
ci ci
1-----5-----4 {ci}

intersection of possible relationships = {ci}

3) edge 2-4

transitive paths possible relationships

d ci
2----- 1----4 {ci,d,o}
o d
2----- 3----4 {c,d,o}
u ci
2-----5-----4 {u}

intersection of possible relationships = {d,o}

4) edge 2-5

transitive paths possible relationships

d ci
2----- 1----5 {ci,d,o}
o u
2----- 3---5 {u}
u c
2-----4-----5 {u}

intersection of possible relationships = {ci,d,o}

5) edge 3-5

transitive paths possible relationships

5Note that the relationship unknown acts as the set of all possible relationships for the intersection operation.

u ci
3-----1-----5 {u}
o u
3-----2-----5 {u}
d c
3-----4-----5 {d}

intersection of possible relationships = {d}

After the first iteration, the relationship of edges 1-4 and 3-5 are unique: {ci} and {d}, respec-
tively. Their relationships are inserted in the RA matrix. These relationships are used to derive
relationships of unlabeled edges in the next iteration. The second iteration proceeds with the un-
labeled edges, [1-3, 2-4, 2-5], in the same manner as in the first iteration and the results obtained
are as follows:

For edge 1-3, relationship = {d}
For edge 2-4, relationship = {d,o}
For edge 2-5, relationship = {d,o}

After the second iteration, the relationship of the edge 1-3 is known, while that of the edge 2-4
has not changed from the first iteration and that of edge 2-5 is more precisely determined. The
third iteration proceeds with the unlabeled edges, [2-4, 2-5]. The third iteration does not yield any
relationship having a definite value. Therefore, the algorithm stops here and returns the result to
the designer. In conclusion, the relationships obtained from the algorithm are given below and the
corresponding RA matrix is depicted in Figure 3.4.


The first three cases give definite value for the relationship, while the last two give two possible

Node 1 2 3 4 5
1 = d d ci ci
2 d = o {d,o} {d,o}
3 d o = d d
4 c {d,o} d c
5 c {d,o} d ci

Figure 3.4: Relationship Assertion (RA) Matrix after applying Algorithm 3.1

The main contributions of this algorithm are: i) validating user specified relationships among
relations in a federated environment and ii) deriving new relationships from partially known rela-
tionships. The transitive rule table provides the inference rule; it is not provided by the designer and

need not be changed unless new relationships are added. Even when new relationships are added,
the algorithm does not change as it is driven by the TR table. Algorithm 3.1 is implemented in
PROLOG. We discuss the implementation in Section 5.

4 Global Schema and An Extended SQL For Querying

Earlier work on query languages for federated databases has concentrated on querying against the
global schema. For merging component relations, we take the approach of generating a general-
ization of a relation (entity) from the component relations (entities) in the cluster. Earlier work,
however, does not consider the relationship between occurrences of an entity type and occurrences
of its subentity types (e.g., information about mi. I section" occurrences between two overlapped
subentity types or "difference" occurrences by subtracting a particular subentity type from its en-
tity type). Suppose we integrate the databases of state universities described earlier and assume
that faculty are allowed to work on more than one campus. From the federated database, the
Regents may want to know the names of faculty working in more than one campus, for example;
or the names of faculty who are working in 3 specific campuses. To answer these queries, it is
necessary to provide set operations on the component relations, which are related to one another
in terms of relationships (derived in the previous section) among them.
Although conventional SQL allows set operations on the relations that are union-compatible, it
requires several subqueries and a procedural specification to relate those subqueries. In contrast,
the set operations proposed in ESQL are expressed nonprocedurally in a single statement and are
described below.
In the following two subsections 4.1 and 4.2, we propose a data model for schema integration
and ESQL for supporting set operations on component relations, respectively.

4.1 The Global Relational View

As described in Section 3, there are five possible of relationships among relations in a cluster.
Once these relationships are known, relations belonging to the same cluster can be integrated
into a generalized relation. The generalized relation plays the role of the gl. .1..I relation" using its
.. .nip. .. i" relations. We call a generalized relation a ili,,,l view relation and relations constituent
of a global view relation component relations. A global view relation can be a component relation at
a higher level of abstraction. The attributes that are common (by semantics, not just syntactically)
to component relations would be the attributes of a global view relation. The names of component
relations are preserved in the global view relation. By introducing the concept of a global view
relation into the relational model, we are able to perform set operations on the component relations
that are named in the global view relation. The global view relation is a meta-relation in a sense
that it encompasses the relations that participate in that global view relation. However, it is reduced
to a traditional relation unless the names of its component relations are explicitly specified.
The names of global view relation and component relations can be used to denote tuple variables
in the FROM clause of ESQL. The general FROM clause syntax can defined as follows. In this
notation, [] denotes one or zero occurrences and symbols enclosed in single quotes denote literals.

[ 'AS'
'(' ',' ')' ]

: |

The "AS" construct is used for a global view relation to provide various functions of its compo-
nent relations. Unless stated explicitly, the global view relation implies the union of its component
As an example of the data model for schema integration, consider two component relational
schemas shown in Figure 4.1 and Figure 4.2. They represent information related to a university
database, either on the main campus or on a branch campus. For simplicity, type declarations
are omitted. For FACULTY relations in both databases, the first one describes main campus fac-
ulty, who have a yearly salary and have an office on the main campus. The second one describes
branch faculty, who have a monthly salary, but do not have an office. The same person can be
both a main campus faculty and a branch campus faculty. The global schema shown in Figure 4.3
provides an integrated view of faculty, in which MAINFACULTY and BRANCH_FACULTY are
seen as component relations of a generalization hierarchy having FACULTY as a global view re-
lation. The relation FACULTY has three single valued attributes: Name, Phone and Salary. The
information about offices of faculty is assumed to be irrelevant for the global view. The relations
BRANCH_GRADUATE are integrated into the global view relations UNDERGRADUATE and
GRADUATE, respectively. The relations UNDERGRADUATE and GRADUATE are integrated
into the global view relation STUDENT.


Figure 4.1: Component Schema at Site 1


Figure 4.2: Component Schema at Site 2


I.. I.i_. i', ollment,Branch_enrollment])
COURSE(CourseName,Section#,Time. ['. 1. 111. >urse,Branch_course])

Figure 4.3: Global Schema

The global schema is shown in Figure 4.3. A list (shown in square brackets) specifies the names of
the component relations that participate in the global view relation. This is termed the "relation-
name" attribute of the global view relation. This schema representation is similar to GEM [Zani83].
The arithmetic comparison operations, such as =, <, <, >, > are not applicable to the "relation-
name" attribute, while set operations such as union, intersection and difference are allowed on this
There are two options for the visibility of integrated and component schemas in a heterogeneous
database depending upon whether component schemas are accessible to the user or not. One option
allows the user to access a global schema as well as component schemas. By allowing the user
to access component schemas, the user can query attributes of component schemas that are not
integrated in the global schema (e.g., Office# in Figure 4.1). However, when we join two relations
in both global and component schemas, sophisticated query processing is required. This option
is not recommended if all the attributes of component schemas are represented in global schema.
The other option allows the user to access only the attributes of the global schema. This option is
recommended for the novice users who want to avoid the complexity of many similar attributes in
global and component schemas. Whether component schemas should be visible to the user or not
is ultimately a matter of policy.

4.2 ESQL as a Query Language

ESQL (Extended SQL) is designed to be a generalization of SQL. Whenever the relation-name
attribute in the global view relation is not used, the global view relation is identical to a conventional
relation and the syntax of the query is reduced to that of SQL. For example, the following query
that retrieves those faculty whose salary is over $50,000 uses the FACULTY as a normal relation.

SELECT f.name
FROM faculty f
WHERE f.salary > 50,000

Figure 4.4: Find those faculty whose salary is over $50,000.

The global view relation FACULTY is the union of its component relations. Thus the same query
can also be generated using component relations as follows:

SELECT f.name
FROM f IS faculty AS UNION(main_faculty,branchfaculty)
WHERE f.salary > 50,000

Figure 4.5: Same as Figure 4.4

The "IS" construct is used to denote f as a tuple variable for faculty. In relational calculus the
above query is expressed as shown below. Note that for those faculty who are both at the main
and branch campuses, the annual salary is computed by adding the salary as main_faculty to the
salary as branchiaculty computed from monthly salary.
Relational calculus equivalent of the ESQL query shown in Figure 4.5:

{t.name (t E main_faculty A (A/ u) (u E branchiaculty
A (u.name = t.name))
A (t.salary > 50,000))
A (t E branch_faculty A (/3 u) (u E main_faculty
A (u.name = t.name))
A (t.salary*12 > 50,000))
A (3 u) (3 v) (u E main_faculty A v E branchfaculty
A (u.name = v.name)
A (t.name = u.name)
A (u.salary + v.salary*12 > 50,000)) }

The equivalent SQL query is given as follows:




mainfaculty m
FROM branchfaculty b
WHERE m.name = b.name)
m.salary > 50,000

branch_faculty b
FROM main_faculty m
WHERE b.name = m.name)
b.salary*12 > 50,000

main_faculty m
FROM branch_faculty b
WHERE m.name = b.name
AND m.salary + b.salary*12 > 50,000)

As we can observe from the above example, the query shown in Figure 4.5 when expressed in SQL
requires several subqueries and the user has to know the details of the integration. However, in
ESQL, the query is a single statement and expresses the intent in a succinct manner. When a query
is invoked, the integration represented internally is mapped to operations of component relations.
Query processing and optimization of ESQL queries and the translation of ESQL queries into

queries on component relations is discussed in [WhNC91, WhCN92, Wha92]. Note that a complete
knowledge of component relations is not required to express queries in ESQL; the relation-name
field captures sufficient information to help formulate a query without having to know the details.
If we wish to retrieve the salary of the main_faculty whose name is "Smith" in the global view,
we can write the query in ESQL as follows:

SELECT f.salary
FROM f IS faculty AS main_faculty
WHERE f.name="Smith"

Figure 4.6: Find the salary of the main-aculty whose name is "Smith" in the global view.

The above query can also be expressed in the following way without an explicit declaration of tuple

SELECT salary
FROM faculty AS main_faculty
WHERE name="Smith"

Figure 4.7: Same as Figure 4.6

The query "List the names of faculty that work at the main and branch campuses and earns more
than $50,000 a year" can be expressed as:

SELECT f.name
FROM f IS faculty AS INTS(main_faculty,branchfaculty)
WHERE f.salary > 50,000

Figure 4.8: Find the faculty who is both at the main and branch campuses and earn more than
$50,000 a year.

In relational calculus the above query is defined as follows:

{t.name (3 u) (3 v) (u E main-aculty A v E branchiaculty
A (u.name = v.name)
A (t.name = u.name)
A (u.salary + v.salary*12 > 50,000))}

For another example, to query the faculty who works only at the main_campus one can write:

SELECT f.name
FROM f IS faculty AS DIFF(main_faculty,branchfaculty)

Figure 4.9: Find the faculty who works only at the main campus.

We can define the query in relational calculus as follows:

{t.name (t E main_faculty A (/3 u) (u E branchiaculty
A (u.name = t.name))
A (t.salary > 50,000))}

Again, if the above queries were to be expressed in SQL, it will require several subqueries.
If we want to know what campus faculty "Smith" belongs to, the query can be written as follows:

SELECT faculty*
FROM faculty f
WHERE f.name = "Smith"

Figure 4.10: Find the campus in which faculty "Smith" works.

The operator specifies how the global view relation is composed of from its component relations
in the global schema. By allowing the component relations in the global schema, we can access
the attribute of component schema that cannot be integrated in the global schema (e.g., Office#
in Figure 4.1). However, even though both the attributes Salary in MAINFACULTY (Figure 4.1)
and FACULTY (Figure 4.3) are accessible to the user, the semantics is different. For example, if
we want to retrieve salary of main_faculty whose name is "Smith" in the component schema, it will

SELECT salary
FROM main_faculty
WHERE name = "Smith"

Figure 4.11: Find the salary of the main_faculty whose name is "Smith" in the component schema.

If "Smith" happens to work both at the main and branch campuses, Figure 4.11 returns the salary
received only at the main campus, while Figure 4.6 returns the salary computed at both campuses.

Set Comparison Operator

ESQL also provides set comparison operators to process the component relations belonging to the
same cluster. These operators are not applicable to relation tuples, but to component relations.
The results of these operations are obtained from the Relationship Assertion (RA) table given in
Section 3. The function name specification is extended to include the following operators.

not equal
D contains
C is contained in
is disjoint
++ overlaps

Aggregate Functions and Grouping

As we mentioned before, the integrated schema is reduced to the conventional relational model if
we do not use the relation-name attribute specifying the component relations participating in the
global view relation. However, using the relations-name attribute in the global view relation, one
can easily invoke a query on the component relations in a single statement. In SQL, the "GROUP
BY" clause is used to group the tuples that have the same value for some collection of attributes.
In addition to the facility provided by conventional SQL, the "GROUP BY" clause in ESQL is used
to group the component relations in the global view relation and functions such as AVG, SUM,
COUNT, MAX, and MIN can be applied to each component relation independently. For example,
the query for computing the average of GPA for undergraduate students in each campus, can be
formulated as follows:

FROM undergraduate

Figure 4.12: Find the average GPA for undergraduate students in each campus

If the global view relation UNDERGRADUATE is composed of the component relations MAIN-
UNDERGRADUATE and BRANCH-UNDERGRADUATE, the answer to the query will be in the
form as follows:

COMPONENT(undergraduate) AVG(GPA)

5 Implementation

The Algorithm (described in Section 3.1) for checking the consistency of relationships and deriving
new ones as well as ESQL to SQL translation has been implemented in PROLOG. In the follow-
ing we show the implementation of Algorithm given in Section 3.1. In PROLOG, one typically
distinguishes between facts and rules. Facts that are considered in the algorithm are component
relations in a cluster, relationships among component relations, and the transitive rule table. The
representation of relations in a cluster, their relationships, and the transition rule table in PROLOG
are described below:

(a) Relations in the same cluster are asserted as follows:

relations([main_f, branchl_f, branch2_f, branch2_engf, branch2_ele_engf]).

The predicate "relations" contains the name of the relation that belongs to the same cluster.

(b) Relationship between a pair of relations is asserted as follows:

relationship(main_f, branchlf, o).
relationship(main_f, branch2f, d).

relationship(branchl_f, branch2_ele_eng_f, d).
relationship(branch2_ele_eng_f, branch2_engf, ci).
relationship(branch2_engf, branch2f, ci).

The predicate "relationship" contains a pair of relations in the first and second arguments, and the
relationship between them in the third argument. For example, relationship(main_f, branch_f, o)
means that the relationship value between the pair of relations MAIN_FACULTY and BRANCH-
_FACULTY is ..-. il .,. "

(c) Before representing the Transitive Rule table shown in Figure 3.4 using Horn clauses, let us
consider again the meaning of the table. If for three relations R1, R2 and R3, El, E2 and E3
are relationships between R1 and R2, R2 and R3, and R1 and R3, respectively, then the following
transitive relationship holds:

(Ri El R2) and (R2 E2 R3)

(R1 E3 R3)

Each of El, E2 and E3 can be one of the following elements, {=, c, ci, d, o}. The predicate
I '..1ii ivesrule" represents the relationship relationship among El, E2 and E3 as follows:
transitiverule(El ,E2,E3).

For example, transitiverule(c,d,[c,d,o]) means that for the three relations R1, R2, and R3, if
the relationships between R1 and R2, and R2 and R3, are c (CONTAINS) and d (DISJOINT),
respectively, then the relationship between R1 and R3 should be one of the elements, {c,d,o}. The
Transitive Rule table in Figure 3.4 is asserted as follows:

transitive_rule(e, e, [e]).
transitive_rule(e, ci, [ci]).
transitive_rule(e, c, [c]).
transitive_rule(e, d, [d]).
transitive_rule(e, d, [ol).




, e, [ci]).
, ci, [ci]).
, c, [e,ci,c,d,o]).
, d, [d]).
, o, [ci,d,o]).

e, [c]).
ci, [e,ci,c,d,o]).
c, [c]).
d, [c,d,o]).
o, [c,o]).

e, [d]).
ci, [ci,d,o]).
c, [d]).
d, [e,ci,c,d,o]).

transitive_rule(d, o, [ci,d,o]).

transitive_rule(o, e, [o]).
transitive_rule(o, ci, [ci,o]).
transitive_rule(o, c, [c,d,o]).
transitive_rule(o, d, [c,d,o]).
transitive_rule(o, o, [e,ci,c,d,o]).

With the I .,,,-ii iverule" predicate, we can now represent relationship by transitive path in the
form of Horn clause as follows:


Clauses of (a) and (b) constitute the input to the algorithm and are asserted by the system dy-
namically, whereas clauses of (c) state the static information that is used to derive relationship by
transitive rules and therefore asserted once.
With the above inputs, the algorithm is realized in PROLOG. Figure 5.1 shows the top-level of
PROLOG clauses. First it reads relations in a cluster and relationship between a pair of relations.
Then it checks the consistency of asserted relationships. If the asserted relationships are consistent,
then new relationships are derived from the asserted relationships; otherwise, the process stops and
returns the inconsistent relationship to the designer. The built-in predicate .--- I ta" is used to
accommodate dynamically changed relationships during derivation of new relationships. Finally,
the built-in predicate "retract" is used to remove from the dynamic databases all clauses whose
head matches the relationship.

top :-

check_consistency_of_asserted_equiv([PairL]) :-
Pair = [R1,R2],


check_equiv_from_transitive_path(Eq,[TransitivePathIL]) -
TransitivePath = [R1,R2,R3], !,

writeln('consistency error')),
check_equiv_from_transitive_path(Eq,[TransitivePathIL]) -

((PairList = NewPairList),


derive_new_relationship([PairlL]) :-
Pair = [R1,R2],
(length(Eq,l), !, Eq = [SingleEq],
pairs(PairsList), delete(Pair,PairsList,NewPairList),


Figure 5.1: Top-level of PROLOG clauses

6 Conclusions

We have presented an algorithm to check the consistency of asserted relationships among component
relations in a cluster. The algorithm also derives new ones from partially known relationships. Our
work extends the work of [E1LN86] by considering more types of relationships and by giving results
explicitly as a set when the derived relationship is not unique. The algorithm described in this paper
is also implemented in PROLOG (actually KB-prolog), which is suitable for this purpose because
the inputs (i.e., relationships and transitive rule table) are asserted as facts and new relationships
are derived using rules.
We have extended the relational model to represent a global view relation from component
relations. With simple extensions to the relational query language SQL, ESQL can query the
information on the component relations using a set of operators as part of the query. Set comparison

operator and "GROUP BY" clauses are applicable to the component relations, while set operations
(i.e., union, difference and intersection) are applicable to the tuples of the component relations.
For details on query processing and optimization of ESQL queries refer to [Wha92, WhCN92].


[BaLe84] Batini, C., M. Lenzerini, '.1. I l...l. I..gy for Data Schema Integration in the Entity-
Relationship Model," IEEE Transactions on Software E,.-',,, i ,,i Vol. 10, No. 6, pp.
I,.11-1,I,.; November 1'' L.

[BaLN86] Batini, C., M. Lenzerini, and S.B. Navathe, "A Comparative Analysis of Methodologies
for Database Schema Integration," AC I[ Computing Surveys, Vol. 18, No. 4, pp. 323-
.;I, I, December 1' -',

[BrOT86] Breitbart, Y., P.L. Olson., and G.R. Thompson, "Database Integration in a Distributed
Heterogeneous Database System," In Proceedings of the .',,, International Conference
on Data E,.-',-.. i ',,. pp. 301-310, 1'1.i

[CaVi83] Casanova, M., and M. Vidal, "Towards a Sound View Integration Methodology," In
Proceedings of the .',Al C If SIGACT/SIC( I[OD Conference on Principles of Database
Systems (Atlanta, Ga.), AC .I New York, pp. .:;,- 17, March 1','.;

[E1LN86] Elmasri, R., J. Larson, and S.B. Navathe, "Schema Integration Algorithms for Fed-
erated Databases and Logical Database Design," Tech. Rep. No. CSC-86-9: 8212,
Honeywell Corporate Systems Development Division, Camden, Minn.

[E1Na84] Elmasri, R., and S.B. Navathe, "Object Integration in Logical Database Design," In
Proceedings of the 1st International Conference on Data E,i,',,, -i ',i pp. 418- 1."
l',- 1.

[E1Na89] Elmasri, R., and S.B. Navathe, Fundamentals of Database Systems, The Ben-
jamin/Cummings Publishing Company, Inc, 1','-

.l. ] Jacobs, B. E., Applied Database Logic: Fundamental Issues (Volume I), Prentice-Hall,
Inc., E .,1.,,.1 Cliffs, 1'l,

[LaNE89] Larson, P., S.B. Navathe, and R. Elmasri, "A Theory of Attribute Equivalence in
Databases with Application to Schema Integration," IEEE Transactions on Software
E,., ', i '.i Vol. 15. No. 4, pp. 44',- I,.; April 1',l-

.1..I 1i,4] Mannino, M.V, and W. Effelsberg, .l.I.. Iliig Techniques in Global Schema Design,"
In Proceedings of the 1st International Conference on Data E,.-i', ,i pp. 418- 1."
l',- 1.

[MoBu81] Motro, A., P. Buneman, "Constructing Superviews," In Proceedings of the AC If SIG-
MOD International Conference on 1!,, ..i. i, of Data, pp. 54-64, May 1981.

[Mol I ;] Motro, A., "Superviews: A Virtual Integration of Multiple Databases," IEEE Trans-
actions on Software Ei,., Vol. 13, No. 7., pp. 7 "'-7'', July 1'l -.

[NaEL86] Navathe, S.B., R. Elmasri, and J.A. Larson., I '. grating User Views in Database
Design," IEEE Computer, Vol. 19, No. 1, pp. 50-62, January 1',i.

[NaGa82] Navathe, S.B., and S.G. Gadgil, "A Methodology for View Integration in Logical
Database Design," In Proceedings of the 'i1 International Conference on Very Large
Data Bases, pp. 142-164, 1',-'

[ShLa90] Sheth, A.P., and J.A. Larson, 1..1. '.Ii. 4 Database Systems for Managing Distributed,
Heterogeneous, and Autonomous Databases," AC I[ Computing Surveys, Vol. 22, No.
3, pp. 183-236, September 1990.

[SSGN91] Savasere, A., A.P. Sheth, S. Gala, S.B. Navathe, and H. Marcus, "On Applying Classifi-
cation to Schema Integration," In the First International Workshop on Interoperability
in Mutidatabase Systems, Kyoto, Japan, pp. 258-261, 1991.

[WhCN92] Whang, W.K., S. Chakravarthy, and S.B. Navathe, "Query Processing using Logic-
Based Queries in Federated Databases," in preparation, 1992.

[Wha92] Whang, W.K., "A Logic-Based Approach to Federated D.i.,1..i-. Ph.D thesis, Elec-
trical Engineering Department, University of Florida, Gainesville, January 1992.

[WhNC91] Whang, W.K., S.B. Navathe, and S. Chakravarthy, "Logic-Based Approach to Real-
izing a Federated Information System", In Proc. of the First International Workshop
on Interoperability in Multidatabase Systems, Kyoto, 1991.

[Zani83] Zaniolo, C., "The Database Language GEM," In Proceedings of the AC I1 SI(. I[OD
International Conference on wM,.i ,i, ol Data, pp. 207-218, 1l',.;

University of Florida Home Page
© 2004 - 2010 University of Florida George A. Smathers Libraries.
All rights reserved.

Acceptable Use, Copyright, and Disclaimer Statement
Last updated October 10, 2010 - - mvs