Group Title: Department of Computer and Information Science and Engineering Technical Reports
Title: Knowledge extraction in the SEEK project part I : data reverse engineering
Full Citation
Permanent Link:
 Material Information
Title: Knowledge extraction in the SEEK project part I : data reverse engineering
Series Title: Department of Computer and Information Science and Engineering Technical Reports ; TR-02-008
Physical Description: Book
Language: English
Creator: Hammer, Joachim
Schmalz, M.
O'Brien, W.
Shekar, S.
Haldavnekar, N.
Publisher: Department of Computer and Information Science and Engineering, University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: July, 2002
Copyright Date: 2002
 Record Information
Bibliographic ID: UF00095584
Volume ID: VID00001
Source Institution: University of Florida
Holding Location: University of Florida
Rights Management: All rights reserved by the source institution and holding location.


This item has the following downloads:

2002330 ( PDF )

Full Text

Dept. of CISE, University of Florida, Gainesville, FL 32611-6120, Technical Report TR-02-008, July 2002.

Knowledge Extraction in the SEEK Project Part I:

Data Reverse Engineering

Technical Report

J. Hammer, M. Schmalz, W. O'Brien, S. Shekar and N. Haldavnekar
Department of Computer & Information Science & Engineering
University of Florida
Gainesville, FL 32611-6120

In this report we describe our methodology for knowledge extraction in the SEEK (Scalable
Extraction of Enterprise Knowledge) project and highlight the underlying technologies
supporting SEEK. In particular, we discuss our use of data reverse engineering and code
mining techniques to automatically infer as much as possible the schema and semantics of a
legacy information system. We have a fully functional prototype implementation, which we
use to illustrate the approach using an example from our construction supply chain testbed.
We also provide empirical evidence to the usefulness and accuracy of our methodology.

1 SEEK Approach to Knowledge Extraction
The goal of SEEK is to semiautomatically discover and extract enterprise knowledge from
legacy sources in order to support the generation of source wrappers. Wrappers, which are "... a type
of software "glueware" that is used to attach together other software components ..." [26] are essential
in enabling the exchange of data between heterogeneous systems. Such systems employ different data
models, representations, and query languages to manage their data: a (source) wrapper converts a
query that is formulated in the language of one system into one or more queries understood by a
second system. Analogously, it converts the resulting data, which is expressed in the model of the
second system, into the format used by the first. Existing wrapper development tools (see, for
example, [11, 13, 21, 24]) require significant programmatic set-up with limited reusability of code.
Given the diversity and number of information sources available today, the time and investment
needed to establish connections to legacy sources has imposed severe limitations on the scalability and
maintainability of such systems. These requirements have generally acted as a significant barrier to
information integration in enterprise development (e.g., supply chain automation). Efforts are now
under way to develop languages and tools for describing resources in a way that can be processed by
computers (e.g., Semantic Web). However, they do not address the problem of how to collect this
enterprise knowledge, or how to maintain it efficiently for the continuously increasing number of
legacy sources.
A high-level view of the SEEK knowledge extraction architecture is shown in Figure 1. SEEK
applies Data Reverse Engineering (DRE) and Schema Matching (SM) processes to legacy database(s),
to produce a source wrapper for a legacy source. The source wrapper will be used by another
component (not shown in Figure 1) wishing to communicate and exchange information with the
legacy system. We assume that the legacy source uses a database management system for storing and
managing its enterprise data.

Rinker School of Building Construction, University of Florida, Gainesville, FL 32634-6134

First, SEEK generates a detailed description of the legacy source, including entities,
relationships, application-specific meanings of the entities and relationships, business rules, data
formatting and reporting constraints, etc. We collectively refer to this information as enterprise
knowledge. The extracted enterprise knowledge forms a knowledgebase that serves as input for
subsequent steps outlined below. In order to extract this enterprise knowledge, the DRE module shown
on the left of Fig. 1 connects to the underlying DBMS to extract schema information (most data
sources support at least some form of Call-Level Interface such as JDBC). The schema information
from the database is semantically enhanced using clues extracted by the semantic analyzer from
available application code, business reports, and, in the future, perhaps other electronically available
information that may encode business data such as e-mail correspondence, corporate memos, etc. It
has been our experience (through visits with representatives from the construction and manufacturing
domains) that such application code exists and can be made available electronically.

r ---- -. -- ----------.-- ^
Revise, Application
s validate
SDomain Domain
Data Reverse Engineering (DRE) Model Ontology

Information ------- ----
Schema Semantic train,
Extractor Analyzer
(SE) (SA) --va-'idate
Embedded Schema
--------------- ...----------------------------... ---- ---- ----
Reports iSource Schema,
Legacy Business Rules,
Legacy DB Application .. to
Code Mapping Wrapper
Legacy Source Generator
............................... ................. .. ......... (W Gen)

Figure 1: Schematic diagram of the conceptual architecture of SEEK's knowledge extraction algorithm.

Second, the semantically enhanced legacy source schema must be mapped into the domain
model (DM) used by the applications) that wants) to access the legacy source. This is done using a
schema matching process that produces the mapping rules between the legacy source schema and the
application domain model. In addition to the domain model, the schema matching module also needs
access to the domain ontology (DO) describing the model.
Finally, the extracted legacy schema and the mapping rules provide the input to the wrapper
generator (not shown), which produces the source wrapper.
The three preceding steps can be formalized as follows. At a high level, let a legacy source L be
denoted by the tuple L = (DBL SL, DL, QL,), where DBL denotes the legacy database, SL denotes its
schema, DL the data and QL a set of queries that can be answered by DBL. Note, the legacy database
need not be a relational database, but can include text, flat file databases, and hierarchically formatted
information. SL is expressed by the data model DML.
We also define an application via the tuple A = (SA, QA, DA), where SA denotes the schema used
by the application and QA denotes a collection of queries written against that schema. The symbol DA
denotes data that is expressed in the context of the application. We assume that the application schema

is described by a domain model and its corresponding ontology (as shown in Figure 1). For simplicity,
we further assume that the application query format is specific to a given application domain but
invariant across legacy sources for that domain.
Let a legacy source wrapper Wbe comprised of a query transformation

fwQ: QA P QL (1)
and a data transformation

fwD DL P- DA, (2)
where the Q 's and D 's are constrained by the corresponding schemas.
The SEEK knowledge extraction process shown in Figure 1 can now be stated as follows. Given
SA and QA for an application wishing to access legacy database DBL whose schema SL is unknown.
Assuming that we have access to the legacy database DBL as well as to application code CL accessing
DBL, we first infer SL by analyzing DBL and CL, then use SL to infer a set of mapping rules Mbetween
SL and SA, which are used by a wrapper generator WGen to produce (fwQ,fwD). In short:
DRE: (DBL, CL,) SL (3.1)
SM: (SL, SA) v>M (3.2)
WGen: (QA,M) (VQ,fwD) (3.3)
In this report, we provide a description of our DRE algorithm (equation 3.1) which is comprised
of schema extraction and semantic analysis. We also provide a description of the DRE prototype that
is installed and running in our testbed and summarize the experimental results validating our approach.
We report on SM and WGen in subsequent publications.

2 Data Reverse Engineering (DRE)
Data reverse engineering is defined as the application of analytical techniques to one or more
legacy data sources to elicit structural information (e.g., term definitions, schema definitions) from the
legacy sources) in order to improve the database design or produce missing schema documentation.
So far in SEEK, we are applying DRE to relational databases only. However, since the relational
model has only limited semantic expressability, in addition to the schema, our DRE algorithm
generates an E/R-like representation of the entities and relationships that are not explicitly defined in
the legacy schema (but which exist implicitly).
More formally, DRE can be described as follows: Given a legacy database DBL defined as ({R1,
R2, ..., Rn}, D), where R, denotes the schema of the i-th relation with attributes A1, A2, ..., Am(i), keys
Ki, K2, ..., Km(i) and data D = {rl(R1), rz(R2), ..., rn(Rn)}, such that r,(R1) denotes the data (extent) for
schema R, at time t. Furthermore, DBL has functional dependencies = {F1, F2, ..., Fk(l)} and inclusion
dependencies I = {I1, I, ..., I,(i)} expressing relationships among the relations in DBL. The goal of
DRE is to first extract {R1, R2, ..., R},, I,and Tand then use I, F, D, and CL to produce a semantically
enhanced description of {R1, R2, ..., Rn} that includes, all relationships among the relations in DBL
(incl. those that are implicit), semantic descriptions of the relations as well as the business knowledge
that is encoded in DBL and CL.
Our approach to data reverse engineering for relational sources is based on existing algorithms
by Chiang [4, 5] and Petit [23]. However, we have improved their ii,> liidilii,,,,g in several ways,
most importantly to reduce the dependency on human input, to eliminate some of the limitations
of their algorithms (e.g., consistent naming of key attributes and the requirement that the legacy
schema be modeled in 3NF), and to produce a semantically richer schema description at the end.
Our DRE algorithm is divided into two parts: schema extraction and semantic analysis, which
operate in interleaved fashion. An overview of the two algorithms, which are comprised of eight steps,

is shown in Figure 2. In addition to the modules that execute each of the eight steps, the architecture in
Figure 2 includes three support components: the configurable Database Interface Module (upper-right
hand comer), which provides connectivity to the underlying legacy source. Note that this component is
the ONLY source-specific component in the architecture: in order to perform knowledge extraction
from different sources, only the interface module needs to be changed. The Knowledge Encoder
(lower right-hand comer) represents the extracted knowledge in the form of an XML document so that
it can be shared with other components in the SEEK architecture (e.g., the semantic matcher). The
Metadata Repository is internal to DRE and used to store intermediate run-time information needed by
the algorithms including user input parameters and the abstract syntax tree for the code (e.g., from a
previous invocation), etc.

Figure 2: High-level overview of the DRE algorithm.

2.1 Eight-Step Data Reverse Engineering Algorithm
We now describe each of the eight steps and related activities outlined in the DRE algorithm
diagram of Figure 2.

Step 1: Generation of an Abstract Syntax Tree (AST) for the Application Code
The DRE process begins with the generation of an abstract syntax tree (AST) for the legacy
application code. The AST will be used by the semantic analyzer for code exploration during Step 3.
The following discussion references Figure 3, which is an expansion of the AST Generator
representation shown in Figure 2. The process flow on the left side is for building ASTs for C code,
the one on the right side is for developing ASTs for Java code.
The AST generator for C code consists of two major components: the lexical analyzer and the
parser. The lexical analyzer for application code written in C reads the source code line-by-line and
breaks it up into tokens. The C parser reads in these tokens and builds an AST for the source code in
accordance with language grammar (see Appendix A for a listing of the grammar for the C code that is
accepted by the semantic analyzer). The above approach works nicely for procedural languages such
as C. However, when applied directly to object oriented languages (e.g., Java), it greatly increases the

complexity of the problem due to required resolution of issues such as ambiguity induced by multiple
inheritance, diversity resulting from specialization of classes and objects, etc.

Application Code

Pattern Matcher

C and Pro*C Modified Application code
Lexicalal Analyzer
Java Lexical Analyzer
A- C Parser
Java Parser AST

To Dictionary Extraction, step 2

Figure 3: Generation of an AST for either C or Java code.

Unfortunately, most application code written for currently available databases is written in Java
making it necessary to develop an algorithm to infer semantic information from Java code. As
previously implied, the grammar of an object-oriented language is complex when compared with
procedural languages like the C language. Building a Java lexical analyzer and parser would require
the parser to look ahead multiple tokens before applying the appropriate production rule. Thus,
building a Java parser from scratch does not seem a feasible solution. Instead, tools like lex or yacc
can be employed to do the parsing. These tools generate N-ary AST's. N-ary trees, unlike binary trees,
are difficult to navigate using standard tree traversal algorithms. Our objective in AST generation is to
be able to extract and associate the meaning of selected partitions of application code with program
variables. For example, format strings in input/output statements contain semantic information that can
be associated with the variables in the input/output statement. This program variable in turn may be
associated with a column of a table in the underlying legacy database. Standard Java language
grammar does not put the format string information on the AST, which defeats the purpose of
generating AST's for the application code.
The above stated reasons justify the need to use an alternate approach for analyzing Java code.
Our Java AST builder (depicted on the right-hand side of Figure 3) has three major components, the
first of which is a pattern matching process. The objective the pattern matching is twofold. First, it
reduces the size of the application code being analyzed. Second, while generating the reduced
application code file, it performs selected text replacements that facilitate easier parsing of the
modified source code that is eventually emitted from the pattern matcher. The pattern matcher works
as follows: It scans the source code line by line looking for patterns such as System. out. println
that indicate output statements or ResultSet that indicate JDBC statements. Upon finding such a
pattern, it next replaces the pattern with an appropriate pre-designated string. For example, in the
following line of code:
System.out.println("Task Start Date" + aValue);
the pattern System. out. println is replaced with printf, and following line is generated in a
reduced source code file:
printf("Task Start Date" + aValue);

After one pass of the application code, the pattern matcher generates a reduced source code file
that contains only JDBC and output statements, which more closely resemble a procedural language.
In writing a lexical analyzer and parser for this reduced source code, we can re-use most of our C
lexical analyzer and parser. The lexical analyzer reads the reduced source code line by line and
supplies tokens to the parser that builds an AST in accordance with the Java language grammar.

Step 2: Extracting Schema Information using the Dictionary Extractor.

The goal of Step 2 is to obtain the relation and attribute names from the legacy source. This is done by
querying the data dictionary, stored in the legacy database DBL in the form of one or more system
tables. The details of this step are outlined in Figure 4.

Configuration Send/Receive Legacy
File Module Database

Dictionary Extractor u_ -
uetyJ Metadata
.Templt AST Repository

Pm KFound? Pattern Matcher

To code analyzer, step 3

Figure 4: Implementation detail for dictionary extraction.

In order to determine key attributes, the algorithm proceeds as follows: For each relation, it first
identifies the set of candidate key attributes, which are those attributes whose values are restricted
through NON-NULL and UNIQUE constraints. If there is only one candidate key per entity, then that
key is the primary key. Otherwise, if primary key information cannot be retrieved directly from the
data dictionary, the algorithm passes the set of candidate keys along with predefined rule-out patterns
to the code analyzer. The code analyzer searches for these patterns in the application code and
eliminates those attributes from the candidate set, which occur in the rule-out pattern. The rule-out
patterns, which are expressed as SQL queries, occur in the application code whenever programmer
expects to select a SET of tuples. By definition of primary key, this rules out the possibility that the
attributes al ... an form a primary key. Three sample rule-out patterns are: scalar
WHRE al= AND a2=< scalar expression> AND ... AND
an =

WHRE al= AND a2=< scaler expression> AND ... AND
an =

WHERE al= AND a2=< scalar expression> AND ... AND
an =

Following code analysis, if a primary key cannot be identified, the reduced set of candidate keys
is presented to the user for final primary key selection.

Step 3: Code Analysis.

The objective of the application code analysis that comprises Step 3 is threefold: (1) augment entities
extracted in Step 2 with domain semantics, (2) extract queries that help validate the existence of
relationships among entities in Step 2, and (3) identify business rules and constraints not explicitly
stored in the database, but which may be important to the wrapper generator or application program
accessing legacy source L. Our approach to code analysis is based on code mining, which includes
code slicing [15] and pattern matching [22].
The mining of semantic information from source code assumes that in the application code there
are output statements that support report generation or display of query results. From output message
strings) that usually describe a displayed variable v, semantic information about v can be obtained.
This implies location (tracing) of the statement s that assigns a value to v. Since s can be associated
with the result set of a query q, we can associate v's semantics with a particular column of the table
being accessed in q.

Pre-Slicer Slicing
Variables Meta Data
Slicing Variables G Repository
User Defined
Code Slicer Slicing Variable

Reduced AST
Semantic Information and
Business Rules
Result Generator
depd __ncn Slicing Variables
Result Report Used in Previous
User Decision .... Y rs
k More semanticc User Interface
Analysis Needed ? F
To inclusion User Enters
dependency N Slicing Variable
mining, step 4

Figure 5: Implementation detail for code analysis.

The details of the code analyzer are shown in Figure 5. The first step is the pre-slicer, which
extracts slicing variables from input, output and SQL statements from the AST. For each of the slicing
variables identified by the pre-slicer, the code slicer and analyzer are invoked on the AST. The code
slicer traverses the AST in pre-order and retains only those nodes that contain the slicing variable in
their sub-tree. The reduced AST constructed by the code slicer is then sent to the semantic analyzer
that extracts the data type, meaning, business rules, column name, and table name which can be
associated with the slicing variable. The results of semantic analysis are appended to a result file and
the slicing variable is stored in the metadata repository. Note that the code slicer and analyzer could
have been invoked once with all the slicing variables generated by the pre-slicer rather than for each
slicing variable in an iterative fashion. The second approach of invoking the code slicer and analyzer
separately for each slicing variable (which is the approach used here) yields a smaller reduced AST
and, therefore, more focused results. Since the code analyzer is part of a build time activity, accuracy
of the results rather than time is the critical factor.

After the code slicer and analyzer have been invoked on all slicing variables, the result generator
examines the result file and replaces the variables in the extracted business rules with the semantics
extracted from their associated output statements (where possible). The results of code analysis
obtained thus far are presented to the user, who has a chance to view the results and decide if further
code analysis is needed. Should the user indicate that further analysis is needed, a list of variables is
displayed that occur in input, output, and SQL statements as well as the slicing variables used in
previous passes. In addition, the user can enter a new slicing variable until the user determines that no
new knowledge can be obtained from the code, or until there are no further candidate slicing variables.
The results of this user-controlled code analysis are appended to the result file.
From the AST created in Step 1, the code analyzer identifies all nodes corresponding to input,
output and embedded SQL statements. The node number, statement node name, and identifier list is
appended to a symbol table during pre-order traversal of the AST.
In practice, the slicing variables are the program identifiers (variables) identified by the pre-
slicer and are classified based on whether they appear in input, output, or SQL statements. The code
slicer module takes each of these slicing variables, traverses the source AST and retains only nodes
that satisfy the slicing criteria. Here satisfying the slicing criteria would mean occurrence of the slicing
variable in a subtree rooted at the node in question. A reduced AST is produced, as follows:
1. Locate the AST start/end node for slicing. Forward slicing completely traverses the AST in
pre-order from the start node. Backward slicing traverses the AST in pre-order, stopping
when the pre-specified end node is reached.
2. Traverse the AST in pre-order starting at the root. Each time a declaration (dcln), statement,
or assign node is encountered, search the node's subtree for occurrences of user-specified
slicing variable(s). If such variable(s) is(are) found, then push the subtree onto a stack.
3. If an assign statement with the slicing variable on its right-hand side is found, then add the
variable on the left-hand side (i.e., the identifier name) to the list of slicing variables.
4. Repeat Steps 2 and 3, above, until the pre-specified end node is reached.
5. Pop all elements off the stack, building a tree using the left-child/right-sibling representation
of N-ary trees.
The result of this step is a reduced AST. Program semantics are determined from pre-order
traversal of the reduced AST, as follows:
If a dcln node is encountered, the data type of the identifier can be discovered.
embSQL nodes contain mapping information of identifier name to the corresponding column
name and table name in the database.
print nodes contain mapping information from the text string to the identifier. In other
words, one can extract the 'meaning' of the identifier from the text string.
In addition, to the preceding program semantics, we identify other enterprise knowledge by
matching templates against code fragments in the AST. So far, we have developed patterns for
discovering business rules which are encoded in loop structures and/or conditional statements and
mathematical formulae, which are encoded in loop structures and/or assignment statements. Note
however, that the occurrence of an assignment statement does not necessarily indicate the presence of
a mathematical formula, but the likelihood increases significantly (for example) if the statement
contains one of the slicing variables. We illustrate the extraction of simple business rules in an
example in Section 3.

Step 4: Discovering Inclusion Dependencies.
After extraction of the relational schema in Step 2, the goal of Step 4 is to identify constraints to
help classify the extracted relations, which represent both the real-world entities and the relationships
among them. This is done using inclusion dependencies (INDs), which indicate the existence of inter-
relational constraints including class/subclass relationships.

Inclusion dependencies are defined as follows. Let A and B be two relations, and X and Y be
attributes or a set of attributes of A and B respectively. An inclusion dependency A.X << B.Y denotes
that a set of values appearing in A.X is a subset of B.Y.

TConfiguraton c Send/Receive Legac
File Module Database

Output Module

Incl Dep

To relation classification, step 5

Figure 6: Implementation detail for inclusion dependency mining.

As depicted in Figure 6, the inclusion dependency detection module obtains its input from two
sources: (1) the dictionary extraction module (via the send/receive module) provides the table name,
column names, primary keys and foreign keys (if available); and (2) the equi-join query finder is a part
of the code analyzer that operates on the AST and provides pairs of relations and corresponding
attributes, which occur together in equi-join queries in the application code. The fact that two relations
are used in a join operation is evidence for the existence of an inclusion dependency between them.
The inclusion dependency detection algorithm operates as follows:

1.Create a set X of pairs of relations: e.g. if we have relations P, Q, R and S then X = {(P,Q),
(P,R), (P,S),(Q,R),(Q,S),(R,S)}. Intuitively, this set will contain pairs of relations for which
inclusion dependencies have not been determined yet. In addition, we maintain two (initially
empty) sets of possible (POSSIBLE) and final (FINAL) inclusion dependencies.
2.If foreign keys have been successfully extracted, do the following for each foreign key
a) Identify the pair of participating relations, i.e. the relation to which the FK belongs and the
relation to which it is referring.
b) Eliminate the identified pair from set X.
c) Add the inclusion dependency involving this FK to the set FINAL.

3.If equi-join queries have been extracted from the code, do the following for each equi-join
a) Identify the pair of participating relations.
b) Check the direction of the resulting inclusion dependency by querying the data. In order to
check the direction of an inclusion dependency, we use a subset test described in
Appendix B.
c) If the above test is conclusive, eliminate the identified pair from set X and add the
inclusion dependency involving this FK to the set FINAL.
d) If the test in step b) is inconclusive (i.e. the direction cannot be finalized) add both
candidate inclusion dependencies to the set POSSIBLE.

4. For each pairp remaining in X, identify attributes or attribute combinations that have the same
data type. Check whether the subset relationship exists by using the subset test described in
Appendix B. If so, add the inclusion dependency to the set POSSIBLE. If, at the end of Step 4,
no inclusion dependency is added to the possible set, delete p from X; otherwise, leave p in X
for user verification.
5.For each inclusion dependency in the set POSSIBLE, do the following:
a) If the attribute names on both sides are equal, then assign the rating value of high.
b) If the attribute name on left side of the inclusion dependency is related (based on common
substrings) to the table name on the right hand side, then assign the rating value of high.
c) If both conditions are not satisfied, then assign the rating value of low.
6.For each pair in X, present the inclusion dependencies and their ratings in the set POSSIBLE
to the user for final determination. Based on the user input, append the valid inclusion
dependencies to the set FINAL.
The worst-case complexity of this exhaustive search, given N tables and M attributes per table
(NM total attributes), is O(N2M2). However, we have reduced the search space in those cases where
we can identify equi-join queries in the application code. This allows us to limit our exhaustive
searching to only those relations not mentioned in the extracted queries. As a result, the average case
complexity of the inclusion dependency finder is much smaller: The detection of one foreign key
constraint in the data dictionary or one equi-join query in the application code allows the algorithm to
eliminate the corresponding relations) from the search space. Hence, if K foreign key constraints and
L equi-join queries (involving pairs different from the pairs that are part of the foreign key constraints)
are detected, the average complexity is O((N2 K L)M2). In the best-case scenario when the K + L
equals all possible pairs of relations, then the inclusion dependency detection can be done in constant
time i.e. O(1).
In addition, the factors such as matching data types and matching maximum length of attributes
(e.g. varchar (5 )) are used to reduce the number of queries to be made to the database (Step 4) to
check subset relationship between attributes. If the attributes in a pair of relations have T mutually
different data types then the N2 part reduces to N(N-T). Finally, it is important to note that the DRE
activity is always considered a built-time activity and hence the performance complexity is not a
crucial issue.

Step 5: Classification of the Relations.
When reverse engineering a relational schema, it is important to understand that due to the limited
expressibility of the relational model, all real-world entities are represented as relations irrespective of
their types and role in the model. The goal of this step is to identify the different types of relations,
some of which correspond to real-world entities while others represent relationships among such
entities or among relations.
Identifying different relations is performed with the primary key information obtained in Step 2
and the inclusion dependencies from Step 4. Specifically if the consistent naming is used, the primary
key of each relation is compared with primary keys of other relations to identify strong or weak entity-
relations and specific or regular relations. Otherwise we have to use inclusion dependencies to give
vital clues. Intuitively, a strong entity-relation represents a real-world entity whose members can be
identified exclusively through its own properties. A weak entity-relation on the other hand, represents
an entity that has no properties of its own that can be used to identify its members. In the relation
model, the primary keys of weak entity-relations usually contain primary key attributes from other
(strong) entity-relations.
Intuitively, both regular and specific relations are relations that represent relationships between
two entities in the real world (rather then the entities themselves). However, there are instances when

not all of the entities participating in an (n-ary) relationship are present in the database schema (e.g.,
one or more of the relations were deleted as part of the normal database schema evolution process).
While reverse engineering the database, we identify such relationships as special relations.
Specifically, the primary key of a specific relation is only partially formed by the primary keys of the
participating (strong or weak) entity-relations, whereas the key of a regular relation is made up entirely
of the primary keys of the participating entity-relations.
More formally, Chiang [4] defines the four relation types as follows:
A strong entity relation is a relation whose primary key does not properly contain a key
attribute of any other relation.
A weak entity relation p is a relation which satisfies the following three conditions:
1. A proper subset of p's PK contains key attributes of other strong or weak entity
2. The remaining attributes of p's PK do not contain key attributes of any other relation;
3. p has an identifying owner and properly contains the PK of its owner relation. User
input is required to confirm these relationships.
A regular relation has a PK that is formed by concatenating the PKs of other (strong or
weak) entity relations.
A specific relation ris a relation which satisfies the following two conditions:
1. A proper subset of is PK contains key attributes of other strong or weak entity
2. The remaining attributes of ifs PK do not contain key attributes of any other relation.
Classification proceeds as follows: First, strong and weak entity-relations are classified. For
weak entity-relations, the primary key must be composite and part of it must be a primary key of an
already identified strong entity-relation. The remaining part of the key must not be a primary key of
any other relation. Finally, regular and specific relations are discovered. This is done by checking the
primary keys or the remaining un-classified relations for full or partial presence of primary keys of
already identified entity-relations.

Step 6: Classification of the Attributes.
Each relation must have its attributes classified into one of four groups to depending on whether
they can be used as keys for entities, weak entities, relationships etc. Attribute classification is based
on the type of relation and inclusion dependencies to which they belong:
Primary key attributes (PKA) are attributes that uniquely identify the tuples in a relation.
Dangling key attributes (DKA) are attributes belonging to the primary key of a weak entity-
relation or specific relation that do not appear as the primary key of other relations.
Foreign key attributes (FKA) are attributes in R1 referencing R2 if
o the attributes of FK have the same domains as the primary key attributes PK of R2
o for each tl in r(R1) and t2 in r(R2), either tl[FK] = t2[PK], or tl[FK] is null.
Non-key attributes (NKA) are those attributes that cannot be classified as PKA, DKA, or

Step 7: Identification of Entity Types.
Starting with this step, we begin to map relational concepts into corresponding E/R model
concepts. Specifically, the strong and weak entity relations identified in step 5 are classified as either
strong or weak entities respectively. Furthermore, for each weak entity we associate its owner entity.
The mapping, which includes the identification of proper keys, is done as follows:

Each weak entity relation is converted into a weak entity type. The dangling key attribute of
the weak entity relation becomes the key attribute of the entity.
Each strong entity relation is converted into a strong entity type.

Step 8: Identification of Relationship Types.
The inclusion dependencies discovered in step 4 form the basis for determining the relationship
types among the entities identified above. This is a two-step process:
1. Identify relationships present as relations in the relational database. The relation types
(regular and specific) obtained from the classification of relations (Step 5) are converted into
relationships. The participating entity types are derived from the inclusion dependencies. For
completeness of the extracted schema, we may decide to create a new entity when
conceptualizing a specific relation.
The cardinality between the entities is M:N.

2. Identify relationships among the entity types (strong and weak) that were not present as
relations in the relational database, via the following classification.
IS-A relationships can be identified using the PKAs of strong entity relations and the
inclusion dependencies among PKAs. If there is an inclusion dependency in which the
primary key of one strong entity-relation refers to the primary key of another strong
entity-relation then an IS-A relationship between those two entities is identified. The
cardinality of the IS-A relationship between the corresponding strong entities is 1:1.
Dependent relationship: For each weak entity type, the owner is determined by
examining the inclusion dependencies involving the corresponding weak entity-relation
as follows: we look for an inclusion dependency whose left hand side contains the part
of the primary key of this weak entity-relation. When we find such an inclusion
dependency, the owner entity can be easily identified by looking at the right hand side
of the inclusion dependency. As a result, a binary relationship between the owner
(strong) entity type and weak entity is created. The cardinality of the dependent
relationship between the owner and the weak entity is 1:N.
A,,,reate relationships: If the foreign key in any of the regular and specific relations
refers to the PKA of one of the strong entity relations, an aggregate relationship is
identified. An inclusion dependency must exist from this (regular or specific)
relation on the left-hand side that refers to some strong entity-relation on the
right-hand side. The aggregate relationship is between the relationship (which must
already be conceptualized in Substep 8.1 from a regular/specific relation) and the strong
entity on right hand side.
The cardinality of the aggregate relationship between the strong entity and aggregate
entity (a M:N relationship and its participating entities at the conceptual level) is
as follows: If the foreign key contains unique values, then the cardinality is 1:1, else the
cardinality is 1:N.
Other binary relationships: Other binary relationships are identified from the FKAs not
used in identifying the above relationships. When an FKA of a relation refers to a
primary key of another relation, then a binary relationship is identified. The cardinality
of the binary relationship between entities is as follows: If the foreign key contains
unique values, the cardinality is 1:1, else the cardinality is 1:N.

2.2 Knowledge Representation

At the end of Step 8, DRE has extracted the following schema information from the legacy database:

Names and classification of all entities;
Names of all attributes;
Primary and foreign keys;
Data types;
Simple constraints (e.g., Null, Unique) and explicit assertions;
Relationships and their cardinalities; and
Business rules.
The schema extraction and semantic analysis collectively generate information about the
underlying legacy source. In each step of the DRE algorithm, knowledge is extracted from the source.
This knowledge has to be represented in the format that is not only computationally tractable and easy
to manipulate but also easy and intuitive for human understanding. The best choice in current scenario
is XML as it is simple yet robust language to represent and manipulate data. The knowledge encoder
takes an XML DTD as input (shown in Appendix C) and encodes the extracted information to produce
XML document. Instead of encoding extracted information after every step, which may result in
inconsistencies, as DRE algorithm refines some of its intermediate outcomes in the process, the
encoding is done at the last because that works as a consistency check as well. We next exemplify the
DRE process using an example.

3 Illustrative Example

Project management is a key application in building construction, hence the sample legacy source is
based on a Microsoft Project application from our construction supply chain testbed. The underlying
legacy database DBL is managed by a relational database management system. For simplicity, we
assume without lack of generality or specificity that only the following relations exist in DBL, whose
schema will be discovered using DRE (for a description of the entire schema refer to [19]):

Proj [P ID,P NAME,DES S,DES F,A S,A F, ...]

In order to illustrate the code analysis and how it enhances the schema extraction, we refer the
reader to the following C code fragment representing a simple, hypothetical interaction with a legacy
database (only those statements are shown which are relevant to the example).
char *aValue;
char *cValue;
int bValue = 0;
/* more code */
WHERE T PRITY = :bValue;
/* more code */
int flag = 0;
if (cValue < aValue)
flag = 1;
/* more code */
printf("Task Start Date %s ", aValue);
printf("Task Finish Date %s ", cValue);

Step 1: Creation of AST
We start by creating an Abstract Syntax Tree shown in Figure 7.

Figure 7: Application-specific code analysis via AST decomposition and code slicing. The direction of slicing is
backwards (forward) if the variable in question is in an output (resp. input or declaration) statement.

Step 2. Dictionary Extraction.
Step 2 extracts table and attribute names from the legacy source. A decision step directs control to the
pattern matcher if PKs cannot be obtained. Using the AST from Step 1, the pattern matcher examines
embedded SQL queries (e.g., SELECT and WHERE clauses), yielding a reduced set of keys to be
presented to the user for selection of PKs. The code analyzer then searches for these patterns in the
application code and rules out selected candidate patterns. At the end of this step, all the primary keys
are finalized.

Result. In the example DRE application, the following relations and their attributes were obtained
from the legacy database:
Proj [P ID,P NAME,DES S,DES F,A S,A F, ...]

Step 3. Code Analysis.
As described above and shown in Figure 5, code analysis is conducted in five substeps. The first
substep is the pre-slicing. From the Abstract Syntax Tree (AST) of the application code, the pre-slicer
identifies all the nodes corresponding to input, output and embedded SQL statements. It appends the
statement node name, and identifier list to an array as the AST is traversed in pre-order. For example,
for the AST in Figure 7, the array contains the following information depicted in Table 1. The
identifiers that occur in this data structure maintained by the pre-slicer form the set of slicing variables.
Table 1: Information maintained by the pre-slicer
Node number Statement Text String Identifiers Direction of Slicing
(only for print nodes)
2 embSQL --- aValue Backwards
(Embedded SQL cValue
statement node)

The code slicer and analyzer, which comprise Substeps 2 and 3 respectively, are executed once
for each slicing variable identified by the pre-slicer. In the above example, the slicing variables that
occur in SQL and output statements are aValue and cValue. The direction of slicing is fixed as
backwards or forwards depending on whether the variable in question is part of a output (backwards)
or input (forwards) statement. The slicing criterion is the exact statement (SQL or input or output)
node that corresponds to the slicing variable.
During the code slicing sub-step we traverse the entire AST for the source code and retain only
those nodes that have an occurrence of the slicing variable in the corresponding subtree. The output of
this substep is a reduced AST, which is produced as follows:
1. Locate the appropriate node on the tree where the slicing should begin/end. For forward
slicing, we start at a certain node (identified by the slicing criteria) and walk the tree nodes
in pre-order till the end of the tree. For backward slicing, we start at the root node and walk
the tree in pre-order stopping when we arrive at the node corresponding to the slicing
2. Do a pre-order traversal of the entire AST starting at the root. Each time a declaration
(dcln), statement, or assign node is encountered, search the sub-tree for occurrences of the
slicing variable that we are looking for. If present, push the sub-tree with root dcln or
statement or assign node onto a stack.
3. If there is an assign statement with the slicing variable on the right hand side, add the left
hand side variable (identifier name) also to the list of slicing variables.
4. Repeat Steps 2 and 3 until the appropriate end node is reached (as defined in Step 1).
5. Pop out all the elements from the stack and build a tree of them using the left child-right
sibling representation of N-ary trees.
For the preceding example, the reduced AST is depicted in Figure 8.

dc ns



Figure 8: Reduced AST.

During the analysis sub-step, our algorithm extracts the information shown in Table 2, while
traversing the reduced AST in pre-order.
1. If a dcln node is encountered, the data type of the identifier can be learned.
2. embSQL contain the mapping information of identifier name to corresponding column name
and table name in the database.
3. Printf/scanfnodes contain the mapping information from the text string to the identifier. In
other words we can extract the contextual meaning of the identifier from the text string.

Table 2: Information inferred during the analysis sub-step.

Identifier Possible Data type Corresponding Corresponding
Name Meaning Business Rule Column Name Table Name
in Database
aValue Task Start if (cValue < Char => T ST D T
Date aValue) string
cValue Task if (cValue < Char => TFIND T
Finish aValue) string
Date {
The results of the analysis sub-step are appended to a report file. After the code slicer and
analyzer have been invoked on every slicing variable identified by the pre-slicer, the report file is
presented to the user at the end of the DRE phase.
The analysis program provides the user with intermediate output, which is used to validate the
results. For example, during code analysis, output about the extracted business rules is available. A
partial list of business rules extracted from the sample code presented earlier is shown in Figure 9.

; Opons Gel Information

iel: if(Task Beginning Date < early start date of task)

"Beginning Date = early start date of task ;

..: Project Cost = Project Cost + Task Cost;

le3:rif (Rsource category == "brick")

:...-. .' Cost =2500;

Figure 9: Sample business rule output after SA is complete.

Using output screens such as the one in Figure 9, the user can decide whether or not to perform
further analysis. This is exemplified in Figure 10. In this screen, the user can enter a new slicing
variable and display the slicing variables used in previous passes as well as identifiers that occur in
input, output and SQL statements. The code slicer and analyzer are invoked based on the slicing
variable(s) indicated by the user. If the user decides not to perform further analysis, code analysis
passes control to the inclusion dependency detection module.

Selecl One Vanrlbles
SOL Variahles .
Outpu Variables T r i r 1i: T.

P'ie .ous Slicing Variables T i- I1. [i:_T-:i
SI -.:l I I

-Enler Shlicni Vaniahle Sellidlatic AInalyel Slalus
,: ,,,- | Submil

Figure 10: Screen snapshot of user interface during code analysis.

Step 4. Discovering Inclusion Dependencies.

If foreign keys are available, we determine associated inclusion dependencies by examining the
relations) to which they belong or refer. Control is then transferred to the equi-join detection module
to determine if equi-join queries are embedded in the application, via pattern matching applied to
FROM and WHERE clauses. Detected equi-join queries are extracted and returned to the inclusion
dependency finder, where they support confirmation or augmentation of the existing set of inclusion
dependencies. The inclusion dependency finder then tests possible inclusion dependencies between
still un-related pairs of relations in an exhaustive manner as described in substeps 4 and 5 of the
inclusion dependency finder algorithm. After this exhaustive search, a final set of inclusion
dependencies is validated with the help of a domain expert.

Result: Inclusion dependencies are as follows:
Assn[TUID,Proj ID] << T [TUID,Proj ID]
Assn [Res uid,Proj ID] << Res [Res uid,Proj ID]
Avail [Res UID,Proj ID] << Res [Res UID,Proj ID]
Res [Proj ID] << Proj [P ID]
T [Proj ID] << Proj [P ID]
Assn [Proj ID] << Proj [P ID]
Avail [Proj ID] << Proj [P ID]

If a minimal set is desired, for example, the second to last inclusion dependency can be removed
since it is implicitly contained in the second and fourth inclusion dependencies using the transitivity

Step 5. Classification of the Relations.

As discussed earlier in the paper, in this step all the relations in the database are classified into one of
four types strong, regular, weak or specific. This classification is done using primary key
information obtained in step 2 and inclusion dependencies from step 4. We use the test described
earlier by which we check for presence of primary key attributes of one relation in the primary key of
other relations. For example, in relation T, T UID is not a primary key of any other relation; on the

other hand, PROJ ID is a primary key of PROJ (where it is called PID). Hence, T is a weak entity
relation and PROJ is a strong entity relation. Inclusion dependencies involving primary key attributes
of two relations on both sides also give similar clues.

Strong Entities: PROJ, AVAIL
Weak Entities: RES, T
Regular Relationship: AS SN

Step 6. Classification of the Attributes.
We classify attributes as (a) PK or FK, (b) Dangling or General, or (c) Non-Key (rest).

Result: Table 3 illustrates attributes obtained from the example legacy source.
Table 3. Example of the attribute classification from the MS-Project legacy source.
RES Proj ID + Res UID Res UID A
T Proj ID +T UID T UID Rema
AVAIL Proj ID +Avail UID Avail UID Res UID+Proj ID Attributes
S D T UID+ Proj ID

Step 7. Identify Entity Types.
Strong (weak) entity relations obtained from step 5 are directly converted into strong (resp. weak)

Result: The following entities were classified:
Strong entities: PROJ with P ID as its key.
AVAIL with AVAIL UID as its key
Weak entities: T with T UID as key and PROJ as its owner.
RES with Res UID as key and PROJ as its owner.

Step 8. Identify Relationship Types.
No decision or analysis is required.

The following 1:N binary relationships between the following weak entity types were discovered:
Between PROJ and T and between PROJ and RES
In addition, a 1:N regular binary relationship was discovered between following entities:
Between RES and AVAIL
Since two inclusion dependencies involving ASSN exist (i.e., between T and ASSN and between
RES and ASSN), there is no need to define a new entity. Thus, ASSN becomes an M:N relationship
between T and RE S.
At the end of the DRE process, the user has the option to review the extracted knowledge for
correctness. If modifications are necessary, the extraction and analysis process can be repeated. Some
of the screen snapshots providing system output are shown in Figures 11 and 12.

SOptions Gel Information

S0- L, i-i

o T i, r r

fI I[.;i;r r..: '- T lC
D TI I1 ,.
I r 1 -., l ,

P ^l n Tjil rr
I l I I I I I

|" I"- T_ ,i, T
DI i > .. 1
L 'l- .i i I, ..

I lI I l
0- i L I- II ,-

D. i "i ,
0-"u i. -E- j ,1 i E i. ,

L 'T. i IF I.-

'j T I ,- i'- ~ El1 ii_,I1

Q 'iT i- 111 I P I:-, 1,11 I-
[j).i.-r"1 r a iir FE-.
FDi- Lr.:nll_,h rL r
I F- ,i r 1 h .ii F -

| i : . .i I -ihl -i1

j F [li j , i i [i .

FLlhire li_

Figure 12: DRE screen snapshot displaying partial list of extracted relationships and their metadata.

A conceptual overview of the schema is represented by the entity-relationship diagram' shown
in Figure 13.

Figure 13: E/R diagram representing the extracted schema.

4 Related Research
Research related to this work includes data reverse engineering and program analysis. The
state-of-the-art is summarized in the following paragraphs.
Data reverse engineering (DRE) refers to the inference of structure and meaning (e.g., schema,
relations, and semantics) from databases [7]. The DRE literature is divided into three areas: translation
algorithms and methodologies, tools, and application-specific experiences. Translation algorithm
development in early DRE efforts involved manual rearrangement or reformatting of data fields,
which is inefficient and error-prone [7]. The relational data model provided theoretical support for
research in automated discovery of relational dependencies [3]. In the early 1980s, focus shifted to
translating relations to E/R diagrams [18]. Given early successes with translation using the relational
data model, DRE translation was applied to flat file databases [3, 8] in domains such as enterprise
schemas [16].
Due to previous establishment of the E/R model as a conceptual tool, reengineering of legacy
RDBMS to yield E/R models motivated DRE in the late 1980s [9]. Information content analysis was
also applied to RDBMS, allowing more effective gathering of high-level information from data [2].
DRE in the 1990s was enhanced by cross-fertilization with software engineering. In [6] a
taxonomy for reverse engineering included DRE methodologies and also highlighted available DRE
tools. DRE formalisms were better defined, motivating increased DRE interaction with users [12].
The relational data model continued to support extraction of E/R and schema from RDMBS [5, 17].
Application focus emphasized legacy systems, including DoD applications [1].

1 The XML representation of the extracted knowledge in its entirety can be downloaded from
hIp \ \ \

In the late 1990s, object-oriented DRE researched the discovering of objects in legacy systems
using function-, data-, and object-driven objectification [27]. Applications of DRE increased,
particularly in Y2K bug identification and remediation. Recent DRE focus is more applicative, e.g.,
mining large data repositories [10], analysis of legacy systems [14] or network databases [20], and
extraction of business rules from legacy systems [25]. Current research focuses on developing more
powerful DRE tools, refining heuristics to yield fewer missing constructs, and developing techniques
for reengineering legacy systems into distributed applications.
An important trend in knowledge discovery research is program analysis or program
comprehension. Originally, programmers sought to understand program functionality and impact of
proposed program modifications, leading to a close association with reverse engineering. SEEK uses
knowledge extraction to discover enterprise knowledge represented in application code. Several
automated approaches have been proposed for program comprehension. The most important
techniques include program slicing [15], cliche recognition [28], and pattern matching [22], together
with the more conventional approaches of lexical/syntactic analysis. Slicing identifies program
statements that directly affect variables of interest to the knowledge extraction process (e.g., input and
output variables and their dependents).
Slicing, a popular data flow analysis derivative, when coupled with program dependence
analysis and a language independent program representation, can extract semantics from application
code. A program dependence graph (PDG) is a DAG whose vertices are an assignment statement or a
predicate of an if-then-else or while construct. Different edges represent control and data flow
dependence. Control flow edges, labeled true or false, respectively lead to an entry point of then or
else block statements. During program execution, if the value of the edge matches the predicate,
then the node pointed to by the control flow edge represents the next executable statement.
Pattern matching identifies interesting code patterns and their dependencies. For example,
conditional control structures such as if. then. else, or case statements may encode business
rules, whereas type declarations and class/structure definitions can provide information about names,
data types, and structure of concepts represented in a legacy source. Pattern matching is implemented
by transforming source code and templates constructed from preselected patterns into syntax trees.
Coupled with analysis of program dependency graphs, a language independent program
representation, slicing, and cliche recognition, pattern matching is a valuable tool for extracting
semantic information from application code.

5 Summary and Discussion
We have manually tested a database reverse engineering approach for a number of scenarios and
domains (including construction, manufacturing and health care) to validate our knowledge extraction
algorithm and to estimate how much user input is required. In addition, we have conducted
experiments using nine different database applications that were created by graduate students during
course projects. The experimental results so far are encouraging: the DRE algorithm was able to
reverse engineer all of the sample legacy sources encountered so far. When coupled with semantic
analysis, human input is reduced compared to existing methods. The approach presented herein
provides the user with clues and guidelines that lead to the augmentation of the schema with additional
semantic knowledge.
The SEEK prototype is being extended using sample data from a large building construction
project on the University of Florida campus, in cooperation with the construction manager (Centex
Rooney Inc.) and several subcontractors or suppliers. This data testbed will support more rigorous
testing of the SEEK toolkit. Other plans for the SEEK toolkit are:
Develop a formal representation for the extracted knowledge.

Develop a matching tool capable of producing mappings between two semantically related
yet structurally different schemas. Currently, schema matching is performed manually,
which is a tedious, error-prone, and expensive process.
Integrate SEEK with a wrapper development toolkit to determine if the extracted knowledge
is sufficiently rich semantically to support compilation of legacy source wrappers for our
construction testbed.
We plan to extend the semantic matching tool to provide machine assistance to domain experts
for developing cross-application ontologies that are expected to support further automation of the DRE
process. The eventual system concept is that of a large, nearly-automatic system that can (1) acquire
large amounts of knowledge from multiple legacy systems, (2) extend and enhance its on-board
knowledge representation and characterization capabilities through ontology-based learning, and (3)
thus make each successive acquisition of knowledge from a legacy system easier and more accessible
to the SEEK user community.

This material is based upon work supported by the National Science Foundation under grant
numbers CMS-0075407 and CMS-0122193.

[1] P. Aiken, A. Muntz, and R. Richards, "DoD legacy systems: Reverse engineering data
requirements," Communications of the ACM, vol. 37, pp. 26-41, 1994.
[2] D. Boulanger and S. T. March, "An approach to analyzing the information content of existing
databases," Database, pp. 1-8, 1989.
[3] M. A. Casanova and J. E. A. d. Sa, "Designing entity-relationship schemas for conventional
information systems," presented at Third International Conference on Entity-Relationship
Approach, 1983.
[4] R. H. Chiang, "A knowledge-based system for performing reverse engineering of relational
database," Decision Support Systems, vol. 13, pp. 295-312, 1995.
[5] R. H. L. Chiang, T. M. Barron, and V. C. Storey, "Reverse engineering of relational databases:
Extraction of an EER model from a relational database," Data and Knowledge Engineering, vol.
12, pp. 107-142, 1994.
[6] E. a. J. C. Chikofsky, "Reverse engineering and design recovery: A taxonomy," IEEE Software,
vol. 7, pp. 13-17, 1990.
[7] K. H. Davis and P. Aiken, "Data reverse engineering: A historical survey," presented at IEEE
Seventh Working Conference on Reverse Engineering, 2000.
[8] K. H. Davis and A. K. Arora, "Methodology for translating a conventional file system into an
entity-relationship model," presented at Fourth International Conference on Entity-Relationship
Approach, 1985.
[9] K. H. Davis and A. K. Arora, "Converting a relational database model into an entity-relationship
model," presented at Sixth International Conference on Entity-Relationship Approach, 1987.
[10] H. Dayani-Fard and I. Jurisica, "Reverse engineering: A History Where we've been and what
we've done," presented at IEEE Fifth Working Conference on Reverse Engineering, 1998.
[11] J.-R. Gruser, L. Raschid, M. E. Vidal, and L. Bright, "Wrapper Generation for Web Accessible
Data Sources," presented at 3rd IFCIS International Conference on Cooperative Information
Systems, New York City, New York, USA, 1998.

[12] J.-L. Hainaut, "Database reverse engineering: Models, techniques, and strategies," presented at
10th International Conference on Entity-Relationship Approach, 1991.
[13] J. Hammer, H. Garcia-Molina, S. Nestorov, R. Yemeni, M. Breunig, and V. Vassalos,
"Template-Based Wrappers in the TSIMMIS System," SIGMOD Record (ACM Special Interest
Group on Management ofData), vol. 26, pp. 532-535, 1997.
[14] J. Hensley and K. H. Davis, "Gaining domain knowledge while data reverse engineering: An
experience report," presented at Data Reverse Engineering Workshop, EuroRef Seventh
Reengineering Forum, 2000.
[15] S. Horwitz and T. Reps, "The use of program dependence graphs in software engineering,"
presented at Fourteenth International Conference on Software Engineering, Melbourne,
Australia, 1992.
[16] A. C. Klug, "Entity-relationship views over uninterpreted enterprise schemas," presented at First
International Conference on the Entity-Relationship Approach, 1980.
[17] V. M. Markowitz and J. A. Makowsky, "Identifying extended entity-relationship object
structures in relational schemas," IEEE Transactions on Software Engineering, vol. 16, pp. 777-
790, 1990.
[18] M. A. Melkanoff and C. Zaniolo, "Decomposition of relations and synthesis of entity-
relationship diagrams," presented at First International Conference on the Entity-Relationship
Approach, 1980.
[19] Microsoft Corp., "Microsoft Project 2000 Database Design Diagram," 2000, dbdframe.htm.
[20] C.-H. Moh, E-P. Lim, and W-K. Ng., "Re-engineering structures from Web documents,"
presented at ACM International Conference on Digital Libraries 2000, 2000.
[21] Y. Papakonstantinou, A. Gupta, H. Garcia-Molina, and J. Ullman, "A Query Translation
Scheme for Rapid Implementation of Wrappers," presented at Fourth International Conference
on Deductive and Object-Oriented Databases, Singapore, 1995.
[22] S. Paul and A. Prakash, "A Framework for Source Code Search Using Program Patterns,"
Software Engineering, vol. 20, pp. 463-475, 1994.
[23] J.-M. Petit, F. Toumani, J.-F. Boulicaut, and J. Kouloumdjian, "Towards the Reverse
Engineering of Denormalized Relational Databases," presented at Twelfth International
Conference on Data Engineering (ICDE), New Orleans, LA, 1996.
[24] L. Rashid, "University of Maryland Wrapper Generation Project," 1997,
[25] J. Shao and C. Pound, "Reverse engineering business rules from legacy system," BTJournal,
vol. 17, 1999.
[26] D. Wells, "Wrappers," 1996,
[27] T. Wiggerts, H. Bosma, and E. Fielt, "Scenarios for the identification of objects in legacy
systems," presented at IEEE Fourth Working Conference on Reverse Engineering, 1997.
[28] L. M. Wills, "Using attributed flow graph parsing to recognize cliches in programs," presented
at International Workshop on Graph Grammars and Their Application to Computer Science,

Appendix A. Grammar for C code used in Semantic Analysis Algorithm

'_ ,I - L F -- ,- I = [: 1, -_1 "

'; ') *

:'- i' : -T

F~I l i-11 = "t~l l r"
= i: I i -"

l -II ii

- ,I I- L l i , *

i i '= E : i i

i :..- lii - i -II
Statement+ '

' 1' E ':- E r
F in " '

S -' I I ' EI

Ir I E

Sr tr rin atr
:r .' , r r

:,r _-r r i

' I L i ' I. , -, _1

[ I j11 1 I P '

cif r r + I
S Zi I +lll +

I- I I

SI i r

- i r -
I._L9L T '

" I I "

" : 1 I:i. ",

" : "

",tl r "

" 1 I "

" I f'T

" i: n"

" + + "

S i. r

I, -I ,

I 'i r i i

F I i

- r Ir 1-' r

! F, l 1 ,

EI- :- :' --, L- : I .i
EI -- E::E'"

Sr I-

'I L.r r-r ''in i r

' ELE,:'T '
S- 1 11'1 1 i r
' E L E :' I Ii i.i,, I


' E L E ':' .- i, ,,,,, i r
' E L E,:'T -l i.i,,,- 1 i r
' TELE :' I' : -1i ,,,,, l i r
tablelist 'WHERE'
' E L.E,:'T I iIIIII 1 r
tablelist 'WHERE'
EXISTS' SQLExpression
' E LE:.1 li I

SQLExpression 'GROUP' 'BY'

nostvarlablelist 'FROM'
tablelist 'WHERE'
SQLExpression 'ORDER' 'BY'
' iELE':'T -1 l III i r FF :, l
' E :'T 1 IIIIII i r F F III I
r ilz li- I i :rill ) 'iiH E F E '
I' E Ei 'T -i l.IIil Ii r F FI:II i'
tablelistmod 'WHERE'
'EXISTS' SQLExpression
' : E E i,:'T 1 m11 -I. l 1 i r FF I: II I'
S I i -I l i l 'iiii EFE' l i:l
EXISTS' SQLExpression
I' E .E :'T '- -f liT I I
columnlist 'FROM'
tablelistmod 'WHERE'
' ELE ,:' I T i,:'I
columnlist 'FROM'
tablelistmod 'WHERE'
:EL E :'T -1 -1.I i r F :,li i

SQLExpression 'GROUP' 'BY'

I iii



I i I :. '


I li .

I I T i,:' IT
I I i ,:,'

I i T ,:,'

''.'L I 1. ,.I. :

-,.'L ; 1- : - [*,.- -- ,. [ .:

- -.- -: -- -
.L - : - i r

r. r
r. r

r r

r r

r'L rr*
S 'Q.'LetIt rr


',I.L I, r L -_r -j~ :

S91: 1- BI i r
- l 1 i 1 i )

I- I I I.t i ii I ; r

1- I i, i r j :

-H r
Jd : r--i n |: 1l.- : r

' E LE. 'T' -1.I i 111 r F :l i
tablelistmod 'WHERE'
SQLExpression 'ORDER' 'BY'
' ll--'EFT TILIT., : 1 1 r r
'VALUES' ('hostvariablelist

'E LETE-.T I I FF:. r ,:l.- l -- r
'i iHH E FF E ,.'L.F::[;.,_ 1- ; i -,,
ii L- T I--: I i r : E T
(SQLAssignment ') list
'WHERE' SQLExpression

I i- I i,

ST l.- % i 1 ;r ' ,

S .- 1 i
T-i u 11 '

- L : : .,1::l

l:, .i i -r-

L ,J [, -'

r 1: 1.-1 ,i ;r

I- I 1 i i ri "

- i i -1 i r j

I -: ,_ 9 I: 1.-
I III,- :r % ,_ 9 : .

i r

Ii r

_-"i,- -, E : :; .

- ,- 1 :: ,:; .
I1 1 Ill' -I

I i '= l-1 '

list '1 ')? )

list ' )? i

list ', )?

list ')?

i 'T I '
T i I ,,,

T i ,,,i



' E::_ - "

' E::_ - "

' f:: -.


'I r
'. E
i r

'.,L r r .,. r
-,.' L r r i, ,. n r

-,.' L r r i, ,. n r

S I.- ,- F
,-L. i-t -

'L, ju ".



i ju lll,

i j !11-,

i j !11-,
I Hju-,,-

I Hju-, ,

I, j u ,,.

I, j u ,,.

I, j u ,,.

I, j u ,,.

I Hf ii

I Hf ii

- I'l r "
I- r "

I- r "

- i'i r

- i'i r "

T 1'i r '

- i'i r "






-!]1 : "

i i :
II- --

II- !

I-!]7 :

-! 1.1:"

I 1

LI h : l i i i I

E L i. I I-

I .1 i i .. I:

I i I -.I 1 -i I r r i
i ' I -.L r

ITi I-.LL L r. r 'r
Ii' iI -.L r r

I = II LI Lr I r r-- li--

i -i E E h L ,iii
Iii IE L - I


i iiEliEI I E F ir I
'FOUND L' Name r r

(Expression list ', ')? ')
i i E iEI EF I -L I r i

I ii I 1 I jl r II Il

II II I 1 i I I "

'BY' Name

E I li I 'L I I. IF F T I- I

F I 11 I -F F

i iH E F 1l FIF _IL 1 -III!
'CALL' Name (' (Expression

list ', ') I'

. 1- i i i
iiiHF. iE "EF .1,,L Ii T

, i i T I IT
i i II II_- iiiE F F F I F

:'E':'L l iE.:.TF I ,,,, T I, .!- T T F T F L.

*:' T: I I I I F i E i -. F F

.:. T I I l = i E i

Z- '- : i Fii

L E : :[:, ,_ F : ,,

I 'I-:

I 'I-:

I u1H

I u1H

I u1H

I u1H

I u1H

I u1H

I u1H

I u1H

I u1H

I u1H

I u1H

I u1H

I u1H

" EI : 1 i 1.I I "

"'' E : 1 .I ; I- "

"'' E : I 1. "

"--- "

I 111- L -

T ,.L I -

F-- : -i

I SL 1111L

A- iII

lii i

T.- L II == I -
T. -- III 1 1

T'-- III I -,


- III '

I i II I i
E::: I F i

I[ I Fl l
[ L L F11*
Si 1 i 11 ii L +

- 111

S r 1 I 1 1

' l_ 1 '11 -1 '1 L '

' ' -I L 1 II




liiIII. I I I1~

]TII: -iii

ii lii

Sl l 'l
r s '''

" I 1i i l I "

" "


Appendix B. Subset Test for Inclusion Dependency Detection.

We are using the following subset test to determine whether there exists an inclusion
dependency between attribute (or attribute set) U of relation R1 and attribute (or attribute set) V or R2.
Note, U and V must have the same data type and must includes the same number of attributes. Our test
is based on the following SQL query templates, which are instantiated for the relations and their
attributes and are run against the legacy source.

C1 =
SELECT count (*)

C2 =
SELECT count (*)

If C1 is zero, we can deduce that there may exist an inclusion dependency R1.U << R2.V;
likewise, if C2 is zero there may exist an inclusion dependency R2.V << R.U. Note that it is possible
for both C1 and C2 to be zero. In that case, we can conclude that the two sets of attributes U and V are

Appendix C. DTD Describing Extracted Knowledge.

Business Rules)>

NullOption, IsUnique)>

Conceptual Schema (Entity+, Relationship+)>
Entity (Name, Type, Identifier)>
Name (#PCDATA)>
Type (#PCDATA)>
Identifier (#PCDATA)>
Relationship (Name, Type, Participants)>
Name (#PCDATA)>
Type (#PCDATA)>
Participants (Participant+)>
Participant (ParticipantEntity, Cardinality)>
ParticipantEntity (#PCDATA)>
Cardinality (#PCDATA)>

University of Florida Home Page
© 2004 - 2010 University of Florida George A. Smathers Libraries.
All rights reserved.

Acceptable Use, Copyright, and Disclaimer Statement
Last updated October 10, 2010 - - mvs