Semantic Integration through Application Analysis

Material Information

Semantic Integration through Application Analysis
TOPSAKAL, OGUZHAN ( Author, Primary )
Copyright Date:


Subjects / Keywords:
Data integration ( jstor )
Databases ( jstor )
Domain ontologies ( jstor )
Heuristics ( jstor )
Java ( jstor )
Legacies ( jstor )
OWL ( jstor )
Reverse engineering ( jstor )
Test data ( jstor )
Universities ( jstor )

Record Information

Source Institution:
University of Florida
Holding Location:
University of Florida
Rights Management:
Copyright Oguzhan Topsakal. Permission granted to University of Florida to digitize and display this item for non-profit research and educational purposes. Any reuse of this item in excess of fair use or other copyright exemptions requires permission of the copyright holder.
Embargo Date:
Resource Identifier:
660162351 ( OCLC )


This item has the following downloads:

topsakal_o ( .pdf )




































































































































Full Text







O 7 O~guzhaln T<. .1 .1

TIo al-L~oir l who are Iari:tivei, just andi loviing- to others i .:::11 of time, location

and status


1 thought quite a lot about the timle whenI would ::if 1 dissertation and write

this ack~nowle.1 = .. .1 section. F'inally, the time h~as c~ome. Here is the 1.1 .. where I c~an

rememcr ber all thec good memlnori es andi thank ever ;: e who help-ed along thec ----. Howe~ver?

I -~ w ordts ar~e not: enough to show; : gratlitude t~o those wvho wer~e there wit~h me: all

ailong thre roacd to ~;. r Ph.D..

First of all, I give thanks to God for giving me: the: pa~tience, strength axnd commitment

to comle all this ---:-

I would :: to gfive -^? sincere: thanks to -^- I: : nation advisor, D~r. Joac~him

Hamlmer, wlho ha~s been so kind and suppoortive: to me. He: was thle pecrfect :ofr

me to work wlithi. I also would :: :to thank to r--- other commlittee members: Dr-. Tutba

Yai~vuiz-K~ahve~ci, Dr. Ci : .. i M. Jermiine,, Dr. HermanI Lam, andi Dr. F' : : Issa

flor serving on ni:-- committeee.

Ti : 1 : to U~mut Sairgut, Z. i Sairgurt, Clan Oztuirk, Fatih~ Buyurkserin aind Fatih

Gord- u i mla~kin gr G~ainesvil le a, better place to live.

I aim g~ateful to :.. cnts, H. ":-ct To~psakal alnd Sababaldlin TIopsakal; to ::

brother, M\'etehan Toipsakal; andi to r;;- sister-in-law, Sibel To~psak al.l T.. were al-

thlere for me when 1 needled them, andi theyr have alway~is : : i o~rtedi mle in whatever I dlo.

My wife, Ei i and I joined ourr lives during the most hectic times of r; -- Ph.D). studies,

and she supp~orted mec in every : i ect~. "!1: is --:--- trecasure.



ACK(NOWLEDGMENTS ......... . .. .. 4

LIST OF TABLES ......... ..... .. 8

LIST OF FIGURES ......... .... .. 9

ABSTRACT ... ......... .......... .. 11


1 INTRODUCTION ......... ... .. 12

1.1 Problem Definition ......... . 12
1.2 Overview of the Approach ......... ... 14
1.3 Contributions ......... . .. 16
1.4 Organization of the Dissertation . ..... .. 17


2.1 Legacy Systems ......... . . 19
2.2 Data, Information, Seniantics ........ .. .. 20
2.3 Semantic Extraction ......... . 20
2.4 Reverse Engineering ......... . 22
2.5 Program Understanding Techniques . ... .. 24
2.5.1 Textual Analysis ........ .. 25
2.5.2 Syntactic Analysis ........ .. 25
2.5.3 Program Slicing ........ .. .. 25
2.5.4 Program Representation Techniques ... .... .. 26
2.5.5 Call Graph Analysis ....... .. 26
2.5.6 Data Flow Analysis ....... .. 26
2.5.7 Variable Dependency Graph . .... .. 26
2.5.8 System Dependence Graph . ..... .. 27
2.5.9 Dynamic Analysis ......... .. 27
2.6 Visitor Design Patterns ......... .. 27
2.7 Ontology. ............ .... ........ 28
2.8 Web Ontology Language (OWL) . .... .. 28
2.9 WordNet ........ . .. :30
2.10 Similarity ........ . .. .. :31
2.11 Semantic Similarity Measures of Words .... .... :32
2.11.1 Resnik Similarity Measure . ..... .. .. :32
2.11.2 Jiang-Conrath Similarity Measure .. .. .. 34
2.11.3 Lin Similarity Measure ........ ... .. :34
2.11.4 Intrinsic IC Measure in WordNet ... ... .. 34
2.11.5 Leacock-Chodorow Similarity Measure .... .. .. :35

2.11.6 Hirst-St.Onge Similarity Measure ..... ... .. :36
2.11.7 Wu and Palmer Similarity Measure .. .. .. 36
2.11.8 Lesk Similarity Measure . .... .. :36
2.11.9 Extended Gloss Overlaps Similarity Measure ... .. .. :37
2.12 Evaluation of WordNet-Based Similarity Measures ... .. .. .. :37
2.1:3 Similarity Measures for Text Data . .... .. .. :37
2.14 Similarity Measures for Ontologies . .... .. :39
2.15 Evaluation Methods for Similarity Measures ... .. .. 41
2.16 Schema Matching .. ... ... .. 4:3
2.16.1 Schema Matching Surveys . ... .. 4:3
2.16.2 Evaluations of Schema Matching Approaches .. .. .. .. 45
2.16.3 Examples of Schema Matching Approaches ... .. .. 46
2.17 Ontology Mapping .. ... .. .. 48
2.18 Schema Matching vs. Ontology Mapping ... .. .. 48

:3 APPROACH ......... ... .. 49

:3.1 Semantic Analysis .. ... ... .. 50
:3.1.1 Illustrative Examples ...... ... .. 51
:3.1.2 Conceptual Architecture of Semantic Analyzer .. .. .. .. 5:3
: Abstract syntax tree generator (ASTG) .. .. .. .. 5:3
: Report template parser (RTP) .. .. .. 55
: Information extractor (IEx) ... ... .. 55
: Report ontology writer (ROW) .. .. .. 58
:3.1.3 Extensibility and Flexibility of Semantic Analyzer .. .. .. .. 58
:3.1.4 Application of Program Understanding Techniques in SA .. .. 60
:3.1.5 Heuristics Used for Information Extraction ... .. .. 62
:3.2 Schema Matching ......... .. .. 67
:3.2.1 Motivating Example ........ ... .. 68
:3.2.2 Schema Matching Approach . .... .. 7:3
:3.2.3 Creating an Instance of a Report Ontology .. .. .. 75
:3.2.4 Computing Similarity Scores . .... .. 76
:3.2.5 Forming a Similarity Matrix ...... .. .. 81
:3.2.6 From Matching Ontologies to Schemas .. .. .. 81
:3.2.7 Merging Results .. ... .. .. 82


4.1 Semantic Analyzer (SA) Prototype ...... .. . 84
4. 1.1 Using JavaCC to generate parsers .... ... .. 84
4. 1.2 Execution steps of the information extractor .. .. .. 86
4.2 Schema Matching by Analyzing ReporTs (SMART) Prototype .. .. .. 88


5.1 THALIA Website and Downloadable Test Package .. .. .. 89
5.2 DataExtractor (HTMLtoXML) Opensource Package .. .. .. .. 91
5.3 Classification of Heterogeneities . ..... .. 92
5.4 Web Interface to Upload and Compare Scores ... ... .. 94
5.5 Usage of THALIA ......... . .. 95

6 EVALUATION ......... ... .. 96

6.1 Test Data ..... ............ ........... 96
6.1.1 Test Data Set from THALIA tested .... .. .. 96
6.1.2 Test Data Set from University of Florida .. .. . 98
6.2 Determining Weights ......... .. .. 102
6.3 Experimental Evaluation . ... .... .. 105
6.3.1 Running Experiments with THALIA Data ... .. .. 107
6.3.2 Running Experiments with ITF Data .... .. .. 110

7 CONCLUSION ......... ... .. 117

7.1 Contributions ......... .. .. 119
7.2 Future Directions ......... .. .. 121

REFERENCES ............ ........... 12:3

BIOGRAPHICAL SK(ETCH ......... .. .. 1:31


Table page

2-1 List of relations used to connect senses in WordNet. ... .. .. :31

2-2 Absolute values of the coefficients of correlation between human ratings of similarity
and the five computational measures. . ...... .. :37

:3-1 Semantic Analyzer can transfer information front one method to another through
variables and can use this information to discover seniantics of a schema element. 62

:3-2 Output string gives clues about the seniantics of the variable following it. .. 6:3

:3-3 Output string and the variable may not he in the same statement. .. .. .. 64

:3-4 Output strings before the slicing variable should be concatenated. .. .. .. 64

:3-5 Tracing back the output text and associating it with the corresponding column
of table. ........... ........... 64

:3-6 Associating the output text with the corresponding colunin in the where-clause. 65

:3-7 Colunin header describes the data in that column. ... .. . .. 65

:3-8 Colunin on the left describes the data items listed to its ininediate right. .. 65

:3-9 Colunin on the left and the header ininediately above describe the same set of
data items. ........ . .. 66

:3-10 Set of data items can he described by two different headers. .. .. .. .. 66

:3-11 Header can he processed before being associated with the data on a column. 66

4-1 Subpackage in the sa package and their functionality. ... .. . .. 85

6-1 The 10 university catalogs selected for evaluation and size of their schemas. .. 98

6-2 Portion of a table description front the College of Engineering, the Bridges Project
and the Business School schemas. ........ ... .. 100

6-3 T Iam! of tables in the College of Engineering, the Bridges Office, and the Business
School schemas and number of schema elements that each table has. .. .. .. 101

6-4 Weights found by analytical method for different similarity functions with THALIA
test data. ......... ... . 104

6-5 Confusion matrix. ......... . .. .. 106


Figure page

1-1 Scalable Extraction of Enterprise K~nowledge (SEEK() Architecture. .. .. .. 14

:3-1 Scalable Extraction of Enterprise K~nowledge (SEEK() Architecture. .. .. .. 50

:3-2 Schema used by an application. ......... ... 52

:3-:3 Schema used by a report. ......... . 5:3

:3-4 Conceptual view of the Data Reverse Engineering (DRE) module of the Scalable
Extraction of Enterprise K~nowledge (SEEK() prototype. .. .. .. 54

:3-5 Conceptual view of Semantic Analyzer (SA) component. ... .. .. 54

:3-6 Report design template example. ......... .. 55

:3-7 Report generated when the above template was run. .. .. .. 56

:3-8 Java Serylet generated HTML report showing course listings of CALTECH. .. 56

:3-9 Annotated HTML page generated by analyzing a Java Serylet. .. .. .. 57

:3-10 Inter-procedural call graph of a program source code. ... .. .. 61

:3-11 Schemas of two data sources that collaborates for a new online degree program. 69

:3-12 Reports from two sample universities listing courses. .. .. .. 70

:3-13 Reports from two sample universities listing instructor offices. .. .. .. 71

:3-14 Similarity scores of schema elements of two data sources. ... .. .. 7:3

:3-15 Five steps of Schema Matching by Analyzing ReporTs (SMART) algorithm. 74

:3-16 Unified Modeling Language (UML) diagram of the Schema Matching by Analyzing
ReporTs (SMART) report ontology. ........ .. .. 76

:3-17 Example for a similarity matrix. ........ ... .. 81

:3-18 Similarity scores after matching report pairs about course listings. .. .. .. 82

:3-19 Similarity scores after matching report pairs about instructor offices. .. .. 82

4-1 Java Code size distribution of (Semantic Analyzer) SA and (Schema Matching
by Analyzing ReporTs) SMART packages. ..... .. 84

4-2 Using JavaCC to generate parsers. . ...... .. 86

5-1 Snapshot of Test Harness for the Assessment of Legacy information Integration
Approaches (THALIA) website. ......... ... .. 90

5-2 Snapshot of the computer science course catalog of Boston University. .. .. 91

5-3 Extensible Markup Language (XML) representation of Boston IUniversitys course
catalog and corresponding schema file. ...... .. . 92

5-4 Scores uploaded to Test Harness for the Assessment of Legacy information Integration
Approaches (THALIA) benchmark for Integration Wizard (IWiz) Project at
the University of Florida. ......... .. .. .. 94

6-1 Report design practice where all the descriptive texts are headers of the data. .97

6-2 Report design practice where all the descriptive texts are on the left hand side
of the data. ......... ... .. 97

6-:3 Architecture of the databases in the College of Engfineeringf. .. .. .. .. 99

6-4 Results of the SMART with Jiang-Conrath (JCN), Lin and Levenstein metrics. 107

6-5 Results of COmbination of MAtching algorithms (COMA++) with All Context
and Filtered Context combined matchers and comparison of SMART and COMA++
results. ......... ..... . 108

6-6 Receiver Operating C'I I) Il-teristics (ROC) curves of SMART and COMA+
for THALIA test data. ......... . .. 110

6-7 Results of the SMART with different report pair similarity thresholds for ITF
test data. ......... ... . 112

6-8 F-Measure results of SMART and COMA++ for ITF test data when report pair
similarity is set to 0.7. ......... . .. 11:3

6-9 Receiver Operating C'I I) Il-teristics (ROC) curves of the SMART for ITF test
data. ............ ............... 114

6-10 Comparison of the ROC curves of the SMART and COMA++ for ITF test data. 115

Abstract of D~issertation P~resented to the Gradua~te Schlool
< the Urniversity of' 11 : Ida in? Partial F~ulfillmentl <- the
Req~uiremecnts fori the D~egfree of Doctor OfC F'i.i.. 0.o'



O~guzhan Tohpsakail

I. -2007

C~i : Joachimn Hanunler
Ma.:C< :.. ulr E~ngineering

Organizatio n~s in~crelasingly neei t~o Iticipa~te in rapid- collabi-ora~tions with othrer

organizations to be suc
< 1. .. -ata in sucth c~ollab-oration-s. One of the porob-lemls that needs to be solved- when

integrating Iai : d: tata sources is findiing semaintic: I. :: : :.:es between elements

of schemnas of < i. :.e dlata sources (a.k.a.. sc~hemn a matchingr). Schlemas, even those fromn

thec same domain, showl mnany semnantic hleteroge neiies. Resolving these heteroge~neitiecs

is ..... done mlanually; wvchich is tedious, time consuming, and expoensive. Current

approaches to autt : :! :::o the process mainly urse the schemnas and the d-ata as::: :i to

semanltic heterogeneities. However, the sc~hemnas andi the datai are nlot surllic~ient

sources of` semant~ics. In cont~ra~st., we ana~l---- a valjuable so)ur.ce <-1 sm~a~ntic~s, nam~ely

ai .i. 1: :. source codie and report design temnplatels, to improve schemal matching fort

information int~egra~tion. S reallyly, wve a~nali--- CT ap '1 nsour.ce GcodeC that~ genra:I(te8j

reports to p~resent~ thle data of the: organization in a urser f : : i way. W;ie traxce the

desir~iptive information on a report backi to the i .1''. schemla elemnent(s) through
rve~-rse engineering of thle .1lilicatio source-- code'- or- .: desig-n tempIlates aindl store

the desir~iptive text, data, and the c~orrespi.. sc~heml a elemlents in a report (;(?

insta~c~e. WeVi utilize the :::i : : i : we have < : : i ftor sc~hema mnatchintg. Our

ex-perimnents using a fully functionali p-rotci-- systeml show that our approach produces

more aiccurate resullts than currr~ent tech-niq ules.


1.1 Problem Definition

The success of many organizations largely depends on their ability to participate in

rapid, flexible, limited-time collaborations. The need to collaborate is not just limited

to business but also applies to government and non-profit organizations such as military,

emergency management, health-care, rescue, etc. The success of a business organization

depends on its ability to rapidly customize its products, adapt to continuously changing

demands, and reduce costs as much as possible. Government organizations, such as the

Department of Homeland Security, need to collaborate and exchange intelligence to

maintain the security of its borders or to protect critical infrastructure, such as energy

supply and telecommunications. Non-profit organizations, such as the American Red

Cross, need to collaborate on matters related to public health in catastrophic events, such

as hurricanes. The collaboration of organizations produces a synergy to achieve a common

goal that would not he possible otherwise.

Organizations participating in a rapid, flexible collaboration environment need to

share and exchange data. In order to share and exchange data, organizations need to

integrate their information systems and resolve heterogeneities among their data sources.

The heterogeneities exist at different levels. There exist physical heterogeneities at the

system level because of differences between various internal data storage, retrieval, and

representation methods. For example, some organizations might use professional database

management systems while others might use simple flat files to store and represent their

data. In addition, there exist structural (syntax)-level heterogeneities because of the

differences at the schema level. Finally, there exist semantic level heterogeneities because

of the differences in the use of the data which correspond to the same real-world objects

[47]. We face a broad range of semantic heterogeneities in information systems because of

different viewpoints of designers of these information systems. Semantic heterogeneity is

simply a consequence of the independent creation of the information systems [44].

To resolve semantic heterogeneities, organizations must first identify the semantics of

their data elements in their data sources. Discovering the semantics of data automatically

has been an important area of research in the database community [22, :36]. However, the

process of resolving semantic heterogeneity of data sources is still mostly done manually.

Resolving heterogeneities manually is a tedious, error-prone, time-consuming, non-scalable

and expensive task. The time and investment needed to integrate data sources become a

significant barrier to information integration of collaborating organizations.

In this research, we are developing an integrated novel approach that automates the

process of semantic discovery in data sources to overcome this barrier and to help rapid,

flexible collaboration among organizations. As mentioned above, we are aware that there

exist physical heterogeneities among information sources but to keep the dissertation

focused, we assume data storage, retrieval and representation methods are the same

among the information systems to be integrated. According to our experiences gained

as a software developer for information technologies department of several banks and

software companies, application source code generating reports encapsulate valuable

information about the semantics of the data to be integrated. Reports present data

from the data source in a way that is easily comprehensible by the user and can he rich

source of semantics. We analyze application source code to discover semantics to facilitate

integration of information systems. We outline the approach in Section 1.2 helow and

provide more detailed explanation in Sections :3.1 and :3.2. The research described in

this dissertation is a part of the NSF-fundedl SEEK( (Scalable Extraction of Enterprise

Knowledge) project which also serves as a tested.

1 The SEEK( project is supported by the National Science Foundation under grant
numbers C':\!$-0075407 and C':\$-012219:3.

1.2 Overview of the Approach

The results described in this dissertation are based on the work we have done on the

SEEK( project. The SEEK( project is directed at overcoming the problems of integrating

legacy data and knowledge across the participants of a collaboration network [45]. The

goal of the SEEK( project is to develop methods and theory to enable rapid integration

of legacy sources for the purpose of data sharing. We apply these methodologies in the

SEEK( toolkit which allows users to develop SEEK( wrappers. A wrapper translates queries

from an application to the data source schema at run-time. SEEK( wrappers act as an

intermediary between the legacy source and decision support tools which require access to

the organization's knowledge.

Data Source of A Source Code of A

ISchema Semantic
Extractor AnalyzerKnweebso
I(SE) Schemas (SA) Organization A

Data Reverse Engineering (DRE)

Scheme Wrapper
Matcher Generator
(SM) ~I(WG)
Data Source of A Source
Mappings Wrappers

Scheme Semanticl Knowledgebaseof
IExtractor Analyzer Organization B
I(SE) Schemas (SA)
Data Reverse Engineering (DRE)

Figure 1-1: Scalable Extraction of Enterprise K~nowledge (SEEK() Architecture.

In general, SEEK( [45, 46] works in three steps: Data Reverse Engineering (DRE),

Schema Matching (831), and Wrapper Generation (WG). In the first step, Data Reverse

Engineering (DR E) component of SEEK( generates a detailed description of the legacy

source. DRE has two sub-components, Schema Extractor (SE) and Semantic Analyzer

(SA). SE extracts the conceptual schema of the data source. SA analyzes electronically

available information sources such as application code and discovers the semantics of

schema elements of the data source. In other words, SA discovers mappings between data

items stored in an information system and the real-world objects they represent by using

the pieces of evidence that it extracts from the application code. SA enhances the schema

of the data source by the discovered semantics and we refer to the semantically enhanced

schema knowledgehase of the organization. In the second step, the Schema Matching (SM)

component maps the knowledgehase of an organization with the knowledgehase of another

organization. In the third step, the extracted legacy schema and the mapping rules

provide the input to the Wrapper Generator (WG), which produces the source wrapper.

These three steps of SEEK( are build-time processes. At run-time, the source wrapper

translates queries from the application domain model to the legacy source schema. A

high-level schematic view outlining the SEEK( components and their interactions is shown

in Figure 1-1.

In this research, our focus is on the Semantic Analysis (SA) and Schema Matching

(SM) methodology. We first describe how SA extracts semantically rich outputs from the

application source code and then relates them with the schema knowledge extracted by

the Schema Extractor (SE). We show that we can gather significant semantic information

from the application source code by the methodology we have developed. We then focus

on our Schema Matching (SM) methodology. We describe how we utilize the semantic

information that we have discovered by SA to find mappings between two data sources.

The extracted semantic information and the mappings can then he used by the subsequent

wrapper generation step to facilitate the development of legacy source translators and

other tools during information integration which is not the focus of this dissertation.

1.3 Contributions

In this research, we introduce novel approaches for semantic analysis of application

source code and for matching of related but disparate schemas. In this section, we list the

contributions of this work. We describe these contributions in details in ('! .pter 7 while

concluding the dissertation.

External information sources such as corpora of schemas and past matches have been

used for schema matching but application source code have not been used as an external

information source yet [25, 28, 78]. In this research, we focus on this well-known but not

yet addressed challenge of analyzing application source code for the purpose of semantic

extraction for schema matching. The accuracy of the current schema matching approaches

is not sufficient for fully automating the process of schema matching [26]. The approach

we present in this dissertation provides better accuracy for the purpose of automatic

schema matching.

The schema matching approaches so far have been mostly using lexical similarity

functions or look-up tables to determine the similarities of two schema element properties

(for example, the names and types of schema elements). There have been -II_a----- -0.>

to utilize semantic similarity measures between words [7] but have not been realized.

We utilize the state of the art semantic similarity measures between words to determine

similarities and show its effect on the results.

Another important contribution is the introduction of a generic similarity function for

matching classes of ontologies. We have also described how we determine the weights of

our similarity function. Our similarity function along with the methodology to determine

the weights of the function can be applied on many domains to determine similarities

between different entities.

Integration based on user reports ease the communication between business and

information technology (IT) specialists. Business and IT specialists often have difficulty

on understanding each other. Business and IT specialists can discuss on data presented on

reports rather than discussing on incomprehensible database schemas. Analyzing reports

for data integration and sharing helps business and IT specialists communicate better.

One other contributions is the functional extensibility of our semantic analysis

methodology. Our information extraction framework lets researchers add new functionality

as they develop new heuristics and algorithms on the source code being analyzed. Our

current information techniques provide improved performance because it requires less

passes over the source code and provide improved accuracy as it eliminates unused code

fragments (i.e., methods, procedures).

While conducting the research, we saw that there is a need of available test data of

sufficient richness and volume to allow meaningful and fair evaluations between different

information integration approaches. To address this need, we developed THALIA2 (TeSt

Harness for the Assessment of Legacy information Integration Approaches) benchmark

which provides researchers with a collection of over 40 downloadable data sources

representing University course catalogs, a set of twelve benchmark queries, as well as

a scoring function for ranking the performance of an integration system [47, 48].

1.4 Organization of the Dissertation

The rest of the dissertation is organized as follows. We introduce important

concepts of the work and summarize research in ('! .pter 2. ('!, Ilter 3 describes our

semantic analysis approach and schema matching approach. ('! .pter 4 describes the

implementation details of our prototype. Before we describe the experimental evaluation

of our approach in ('! .pter 6, we describe the THALIA test bed in ('! .pter 5. ('! .pter 7

concludes the dissertation and summarizes the contributions of this work.

2 THALIA website: http://www.cise.uf 1 .edu/proj ect/thalia. html


In the context of this work, we have explored a broad range of research areas.

These research areas include but are not limited to data semantics, semantic discovery,

semantic extraction, legacy system ulrll 1;1 llliosy reverse engineering of application

code, information extraction from application code, semantic similarity measures, schema

nr I.1l1f11r ontology extraction and ontology mapping, etc. While developing our approach,

we leverage these research areas.

In this chapter, we introduce important concepts and related research that are

essential for understanding the contributions of this work. Whenever necessary, we provide

our interpretations of definitions and commonly accepted standards and conventions in

this field of study. We also present the state-of-the-art in the related research areas.

We first introduce what a legacy system is. Then we state the difference between

frequently used terms data, information and semantics in Section 2.2. We point out some

of the research in semantic extraction in Section 2.3. Since we extract semantics through

reverse engineering of application source code. We provide the definitions of reverse

engineering of source code, database reverse engineering in Section 2.4 and also provide

the techniques for program understanding in Section 2.5. We represent the extracted

information from application source code of different legacy systems in ontologies and

utilize these ontologies to find out semantic similarities between them. For this reason,

semantic similarity measures are also important for us. We have explored the research

on semantic similarity measures and presented these works in Section 2.11 after giving

the definition of similarity in Section 2.10. We aim to leverage the research on assessing

similarity scores between texts and ontologies. We present these techniques in Section

2.13 and 2.14. We then present the ontology concept, and the ontology language Web

Ontology Language (OWL). Finally, we present ontology mapping and schema mapping

and conclude the chapter by presenting some outstanding techniques of schema matching

in the literature.

2.1 Legacy Systems

Our approaches for semantic analysis of application source code and schema

matching has been developed as a part of the SEEK( project. SEEK( project aims to

help understanding of legacy systems. We analyze application source code of a legacy

system to understand the semantics of it and apply gained knowledge to solve schema

matching problem of data integration. In this section, we first give a broad definition of a

legacy system and highlight its importance and then provide its definition in the context

of this work.

Legacy systems are generally known as inflexible, nonextensible, undocumented, old

and large software systems which are essential for the organization's business [12, 14, 75].

They significantly resist modifications and changes. Legacy system are very valuable

because they are the repository of corporate knowledge collected over a long time and they

also encapsulate the logic of the organization's business processes [49].

A legacy system is generally developed and maintained by many different people with

many different programming hi- Mostly, the original programmers have left, and the

existing team is not an expert of all the aspects of the system [49]. Even though once

there was a documentation about the design and specification of the legacy system, the

original software specification and design have been changed but the documentation was

not updated through out the years of development and maintenance. Thus, understanding

is lost, and the only reliable documentation of the system is the application source code

running on the legacy system [75].

In the context of this work, we define legacy systems as any information system with

poor or nonexistent documentation about the underlying data or the application code

that is using the data. Despite the fact that legacy systems are often interpreted as old

systems, for us, an information system is not required to be old in order to be considered

as legacy.

2.2 Data, Information, Semantics

In this section, we give definitions of data, information and seniantics before we

explore some research on semantic extraction in the following section.

According to a simplistic definition data is the raw, unprocessed input to an

information system that produces the information as an output. A coninonly accepted

definition states that data is a representation of facts, concepts or instructions in a

formalized manner suitable for coninunication, interpretation, or processing by humans

or hv automatic means [2, 18]. Data mostly consists of disconnected numbers, words,

symbols, etc. and results front measurable events, or objects.

Data has a value when it is processed, changed into a usable form and placed in a

context [2]. When data has a context and has been interpreted, it becomes information.

Then it can he used purposefully as information [1].

Seniantics is the meaning and the use of data. Seniantics can he viewed as a mapping

between an object stored in an information system and the real-world object it represents


2.3 Semantic Extraction

In this section, we first state the importance of semantic extraction and application

source code as a rich source for semantic extraction and then point out several related

research efforts in this research area.

Sheth et al. [87] stated that data seniantics does not seem to have a purely

niathentatical or formal model and cannot he discovered completely, and fully automatically.

Therefore, the process of semantic discovery requires human involvement. Besides being

huntan-dependent, semantic extraction is a tinte-consunting and hence expensive task

[36]. Although it can not he fully autontatized, the gain of discovering even the limited

amount of useful seniantics can tremendously reduce the cost for understanding a system.

Semantics can be found from knowledge representation schemas, communication protocols,

and applications that use the data [87].

Through out the discussions and research on semantic extraction, application source

code has been proposed as a rich source of information [30, 36, 87]. Besides, researchers

have agreed that the extraction of semantics from application source code is essential for

identification and resolution of semantic heterogeneity.

We use the discovered semantics from application source code to find correspondence

between schemas of disparate data sources automatically. In this context, discovering

semantics means gathering information about the data, so that a computer can identify

mappings (paths) between corresponding schema elements in different data sources.

Jim Ningf et al. worked on extracting semantics from application source code but with

a slightly different aim. They developed an approach to identify and recover reusable code

components [67]. They investigated conditional statements as we do to find out business

rules. They stated that conditional statements are potential business rules. They also gave

importance to input and output statements for highlighting semantics inside the code,

and stated that meaningful business functions normally process input values and produce

results. Jim Ning et al. called investigating input variables as forward slicing and called

investigating output statements as backward slicing. The drawback of their approach was

being very language-specific (Cobol) [67].

N Ashish et al. worked on extracting semantics from internet information sources to

enable semi-automatic wrapper generation [5]. They used several heuristics to identify

important tokens and structures of HTML pages in order to create the specification for

a parser. Similar to our approach, they benefited from parser generation tools, namely

YACC [53] and LEX [59], for semantic extraction.

There are several related work in information extraction from text that deal with

tables and ontology extraction from tables. The most relevant work about information

extraction from HTML pages by the help of heuristics was done by Wang and Lochovsky

[94]. They aimed to form the schema of the data extracted from an HTML page by using

labels of a table on an HTML page. The heuristic that they use to relate labels to the

data and to separate data found in a table cell into several attributes is very similar to

our heuristics. For example, they assume that if several attributes are encoded into one

text string, then there should be some special symbols) in the string as the separator

to visually support users to distinguish the attributes. They also use heuristics to relate

labels to the data from an HTML page that are similar to our heuristics. Buttler et al.

[17] and Embley et al. [:32] also developed heuristic hased approaches for information

extraction from HTML pages. However, their aim was to identify boundaries of data on

an HTML page. Embley et al. [:33] also worked on table recognition from documents and

-II_0---- I a table ontology which is very similar to our report ontology. In a related work,

Tijerino et al. [90] introduced an information extracting system called TANGO which

recognizes tables based on a set of heuristics, forms mini-ontologies and then merges these

ontologies to form a larger application ontology.

2.4 Reverse Engineering

Without the understanding of the system, in other words without the accurate

documentation of the system, it is not possible to maintain, extend, and integrate the

system with other systems [76, 89, 95]. The methodology to reconstruct this missing

documentation is reverse engineering. In this section, we first give the definition of reverse

engineering in general and then give definitions of program reverse engineering and

database reverse engfineeringf. We also state the importance of these tasks.

Reverse engineering is the process of analyzing a technology to learn how it was

designed or how it works. Chikofsky and Cross [19] defined reverse engineering as the

process of analyzing a subject system to identify the systems components and their

interrelationships and as the process of creating representations of the system in another

form or at a higher level of abstraction. Reverse engineering is an action to understand

the subject system and does not include the modification of it. The reverse of the reverse

engineering is forward engfineeringf. Forward engineering is the traditional process of

moving from high-level abstractions and logical, implementation-independent designs to

the physical implementation of a system [19]. While reverse engineering starts from the

subject system and aims to identify the high-level abstraction of the system, forward

engineering starts from the specification and aims to implement the subject system.

Program (software) reverse engineering is recovering the specifications of the software

from source code [49]. The recovered specifications can be represented in forms such as

data flow diagrams, flow charts, specifications, hierarchy charts, call graphs, etc. [75]. The

purpose of program reverse engineering is to enhance our understanding of the software of

the system to reengineer, restructure, maintain, extend or integrate the system [49, 75].

Database Reverse Engineering (DBRE) is defined as identifying the possible

specification of a database implementation [22]. It mainly deals with schema extraction,

analysis and transformation [49]. Chikofsky and Cross [19] defined DBRE as a process

that aims to determine the structure, function and meaning of the data of an organization.

Hainaut [41] defined DBRE as the process of recovering the schema(s) of the database

of an application from data dictionary and program source code that uses the data. The

objective of DBRE is to recover the technical and conceptual descriptions of the database.

It is a prerequisite for several activities such as maintenance, reengineering, extension,

migration, integration. DBRE can produce an almost complete abstract specification

of an operational database while program reverse engineering can only produce partial

abstractions that can help better understand a program [22, 42].

II I.!y data structures and constraints are embedded inside the source code of

data-oriented applications. If a construct or a constraint has not been declared explicitly

in the database schema, it is implemented in the source code of the application that

updates or queries the database. The data in the database is a result of the execution of

the applications of the organization [49]. Even though the data satisfies the constraints of

the database, it is verified with the validation mechanisms inside the source code before it

is being updated into the database to ensure that it does not violate the constrains. We

can discover some constraints, such as referential constraints, by analyzing the application

source code, even if the application program only queries the data but does not modify

it. For instance, if there exists a referential constraint (foreign key relation) between the

entity named El and entity named E2, this constraint is used to join the data of these two

entities with a query. We can discover this referential constraint by analyzing the query

[50]. Since program source code is a very useful source of information in which we can

discover a lot of implicit constructs and constraints, we use it as an information source for


It is well known that the analysis of program source code is a complex and tedious

task. However, we do not need to recover the complete specification of the program

for DBRE. We are looking for information to enhance the schema and to find the

undeclared constraints of the database. In this process, we benefit from several program

understanding techniques to extract information effectively. We provide the definitions of

the program understanding and its techniques in the following section.

2.5 Program Understanding Techniques

In this section, we introduce the concept of program understanding and its techniques.

We have implemented these techniques to analyze application source code to extract

semantic information effectively.

Program understanding (a.k.a program comprehension) is the process of acquiring

knowledge about an existing, generally undocumented, computer program. The knowledge

acquired about the business processes through the analysis of the source code is accurate

and up-to-date because the source code is used to generate the application that the

organization uses.

Basic actions that can be taken to understand a program is to read the documentation

about it, to ask for assistance from the user of it, to read the source code of it or to run

the program to see what it outputs to specific inputs [50]. Besides these actions, there

are several techniques that we can apply to understand a program. These techniques

help the analyst to extract high-level information from low-level code to come to a better

understanding of the program. These techniques are mostly performed manually. However,

we apply these techniques in our semantic analyzer module to automatically extract

information from data-oriented applications. We show how we apply these techniques

in our semantic analyzer in Section 3.1.5. We describe the main program understanding

techniques in the following subsections.

2.5.1 Textual Analysis

One simple way to analyze a program is to search for a specific string in the program

source code. This searched string can he a pattern or a clichih. The program understanding

technique that searches for a pattern or a clichi: is named as pattern matching or clichi:

recognition. A pattern can include wildcards, character ranges and can he based on other

defined patterns. A clichi: is a commonly used programming pattern. Examples of clichi~s

are algorithmic computations, such as list enumeration and binary search, and common

data structures, such as priority queue and hash table [49, 97].

2.5.2 Syntactic Analysis

Syntactic analysis is performed by a parser that decomposes a program into

expressions and statements. The result of the parser is stored in a structure called abstract

syntax tree (AST). An AST is a type of representation of source code that facilitates

the usage of tree traversal algorithms and it is the basic of most sophisticated program

analysis tools [49].

2.5.3 Program Slicing

Program slicing is a technique to extract the statements from a program relevant to a

particular computation, specific behavior or interest such as a business rule [75]. The slice

of a program with respect to program point p and variable V consists of all statements

and predicates of the program that might affect the value of V at point p [96]. Program

slicing is used to reduce the scope of program analysis [49, 83]. The slice that affect the

value of V at point p is computed by gathering statements and control predicates by

way of a backward traversal of the program, starting at the point p. This kind of slice

is also known as backward slicing. When we retrieve statements that can potentially be

affected by the variable V starting front a point p, we call it forward slicing. Forward and

backward slicing are both a type of static slicing because they use only statically available

information (source code) for computing.

2.5.4 Program Representation Techniques

Program source code, even reduced through program slicing, often is too difficult

to understand because the program can he huge, poorly structured, and based on poor

naming conventions. It is useful to represent the program in different abstract views such

as the call graph, data flow graph, etc [49]. Most of the program reverse engineering tools

provide these kind of visualization facilities. In the following sections, we present several

program representation techniques.

2.5.5 Call Graph Analysis

Call graph analysis is the analysis of the execution order of the program units or

statements. If it determines the order of the statements within a program then it is called

intra-procedural analysis. If it determines the calling relationship among the program

units, it is called inter-procedural analysis [49, 83].

2.5.6 Data Flow Analysis

Data flow analysis is the analysis of the flow of the values from variables to variables

between the instructions of a program. The variables defined and the variables referenced

by each instruction, such as declaration, assignment and conditional, are analyzed to

compute the data flow [49, 83].

2.5.7 Variable Dependency Graph

Variable dependency graph is a type of data flow graph where a node represents a

variable and an are represents a relation (assignment, comparison, etc.) between two

variables. If there is a path from variable v1 to variable v2 in the graph, then there is a

sequence of statements such that the value of v1 is in relation with the value of v2. If the

relation is an assignment statement then the are in the diagram is directed. If the relation

is a comparison statement then the are is not directed [49, 8:3].

2.5.8 System Dependence Graph

System dependence graph is a type of data flow graph that also handles procedures

and procedure calls. A system dependence graph represents the passing of values between

procedures. When procedure P calls procedure Q. values of parameters are transferred

from P to Q and when Q returns, the return value is transferred back to P [49].

2.5.9 Dynamic Analysis

The program understanding techniques described so far are performed on the source

code of the program and are static analysis. Dynamic analysis is the process of gaining

increased understanding of a program by systematically executing it [8:3].

2.6 Visitor Design Patterns

We applied the above program understanding techniques in our semantic analyzer

program. We implemented our semantic analyzer by using visitor patterns. In this section,

we explain what a visitor pattern is and the rationale for using it.

A Visitor Design Pattern is a behavioral design pattern [:38], which is used to

encapsulate the functionality that we desire to perform on the elements of a data

structure. It gives the flexibility to change the operation being performed on a structure

without the need to change the classes of the elements on which the operation is

performed. Our goal is to build semantic information extraction techniques that can

he applied to any source code and can he extended with new algorithms. The visitor

design pattern technique is the key object oriented technique to reach this goal. New

operations over the object structure can he defined simply by adding a new visitor. Visitor

classes localize related behavior in the same visitor and unrelated sets of behavior are

partitioned in their own visitor subclasses. If the classes defining the object structure, in

our case the grammar production rules of the programming language, rarely change, but

new operations over the structure are often defined, a visitor design pattern is the perfect

choice [13, 71].

2.7 Ontology

An ontology represents a coninon vocabulary describing the concepts and relationships

for researchers who need to share information in a domain [40, 69]. It includes machine

interpretable definitions of hasic concepts in the domain and relations among them.

Ontologies enable the definition and sharing of dontain-specific vocabularies. They

are developed to share coninon understanding of the structure of information among

people or software agents, to enable reuse of domain knowledge, and to analyze domain

knowledge [69].

According to a coninonly quoted definition, an ontology is a formal, explicit

specification of a shared conceptualization [40]. For a better understanding, Michael

U~schold et al. define the terms in this definition as follows [92]: A conceptualization is an

abstract model of how people think about things in the world. An explicit specification

means the concepts and relations in the abstract model are given explicit names and

definitions. Formal means that the meaning specification is encoded in a language whose

formal properties are well understood. Shared means that the main purpose of an ontology

is generally to be used and reused across different applications.

2.8 Web Ontology Language (OWL)

The Web Ontology Language (OWL) is a semantic markup language for publishing

and sharing ontologies on the World Wide Web [64]. OWL is derived front the DAML+OIL

Web Ontology Language. DAML+OIL was developed as a joint effort of researchers who

initially developed DAML (DARPA Agent Markup Language) and OIL (Ontology

Inference L w< vi or Ontology Interchange Language) separately.

OWL is designed for processing and reasoning about information by computers

instead of just presenting it on the Web. OWL supports more machine interpretability

than XML (Extensible Markup Language), RDF (the Resource Description Framework),

and RDF-S (RDF Schema) by providing additional vocabulary along with a formal


Formal seniantics allows us to reason about the knowledge. We may reason about

class membership, equivalence of classes, and consistency of the ontology for unintended

relationships between classes and classify the instances in classes. RDF and RDF-S

can he used to represent ontological knowledge. However, it is not possible to use all

reasoning mechanisms by using RDF and RDF-S because of some missing features such

as disjointness of classes, boolean combinations of classes, cardinality restrictions, etc.

[4]. When all these features are added to RDF and RDF-S to form an ontology language,

the language becomes very expressive. However it becomes inefficient to reason. For this

reason, OWL contes in three different flavors: OWL-Lite, OWL-DL, and OWL Full.

The entire language is called OWL Full, and uses all the OWL languages primitives.

It also allows to combine these primitives in arbitrary v- 0-<~ with RDF and RDF-S. Besides

its expressiveness, OWL Full's computations can he undecidable. OWL DL (OWL -

Description Logic) is a sublanguage of OWL Full. It includes all OWL language constructs

but restricts in which these constructors front OWL and RDF can he used. This makes

the computations in OWL-DL complete (all conclusions are guaranteed to be computable)

and decidable (all computations will finish in finite time). Therefore, OWL-DL supports

efficient reasoning. OWL Lite limits OWL-DL to a subset of constructors (for example

OWL Lite excludes enumerated classes, cl;-bia.~r~~--4 statements and arbitrary cardinality)

making it less expressive. However, it may be a good choice for hierarchies needing simple

constraints [4, 64].

OWL provides an infrastructure that allows a machine to make the same sorts of

simple inferences that human beings do. A set of OWL statements by itself (and the

OWL spec) can allow you to conclude another OWL statement whereas a set of XML

statements, by itself (and the XML spec) does not allow you to conclude any other

XML statements. Given the statements (nlotherOf suhProperty parentOf) and (N. HII. t

motherOf Oguzhan) when stated in OWL, allows you to conclude (Nedret parentOf

Oguzhan) based on the logical definition of subProperty as given in the OWL spec.

Another advantage of using OWL ontologies is the availability of tools such as Racer, Fact

and Pellet that can reason about them. A reasoner can also help us to understand if we

could accurately extract data and description elements from the report. For instance, we

can define a rule such as 'No data or description elements can overlap-' and check the OWL

ontology by a reasoner to make sure if this rule is satisfied or not.

2.9 WordNet

WordNet is an online database which aims to model the lexical knowledge of a

native speaker of English.l It is designed to be used by computer programs. WordNet

links nouns, verbs, adjectives, and adverbs to sets of synonyms [66]. A set of synonyms

represent the same concept and is known as a synset in WordNet terminology. For

example, the concept of a 'child' may be represented by the set of words: 'kid', 'youngster',

tiddlerr', 'tike'. A synset also has a short definition or description of the real world concept

known as a 'gloss' and has semantic pointers that describe relationships between the

current synset and other synsets. The semantic pointers can be a number of different

types including hyponym / hypernym (is-a / has a) meronym / holonym (part-of /

has-part), etc. A list of semantic pointers is given in Table 2-1.2 WordNet can also be

seen as a large graph or semantic network. Each node of the graph represents a synset and

each edge of the graph represents a relation between synsets. ?1 Ia: of the approaches for

measuring similarity of words uses the graphical structure of WordNet [15, 72, 79, 80].

Since the development of WordNet for English by the researchers of Princeton

University, many WordNets for other languages have been developed such as Dannish

(Dannet), Persian (PersiaNet), Italian (ItalWordnet), etc. There has been also research to

1 WordNet 2.1 defines 155,327 words of English

2 Table is adapted from [72]

Table 2-1. List of relations used to connect senses in WordNet.
Hypernyni is a generalization of furniture is a hypernyni of chair
Hyponyni is a kind of chair is a hyponyni of furniture
Troponyni is a way to anthle is a troponyni of walk
hieronyni is part / substance / nienter of wheel is a (part) nieronyni of a l'*i 1-, 1- -
Holonyni contains part l..-l1.-illl is a holonynt of a wheel
Antonyni opposite of ascend is an antonyni of descend
Attribute attribute of heavy is an attribute of weight
Entailment entails ploughing entails digging
Cause cause to to offend causes to resent
Also see related verb to lodge is related to reside
Similar to similar to dead is similar to assassinated
Participle of is participle of stored (adj) is the participle of to store
Pertainyni of pertains to radial pertains to radius

align WordNets of different languages. For instance, EuroWordNet [93] is a multilingual

lexical knowledgehase that links WordNets of different languages (e.g., Dutch, Italian,

Spanish, German, French, Czech and Estonian). In EuroWordNet, the WordNets are

linked to an Inter-Lingfual-Index which interconnects the languages so that we can go front

the synsets in one language to corresponding synsets in other languages.

While WordNet is a database which aints to model a person's knowledge about a

language, another research effort Cyc [57] (derived front En-cyc-lopedia) aints to model

a person's every .1- coninon sense. Cyc fornializes coninon sense knowledge (e.g., 'You

cannot reniember events that have not happened yet', 'You have to be awake to eat', etc.)

in the form of a massive database of axioms.

2.10 Similarity

Similarity is an important subject in many fields such as philru-uphlli-, psychology, and

artificial intelligence. Measures of similarity or relatedness are used in various applications

such as word sense disambigfuation, text suninarization and annotation, information

extraction and retrieval, automatic correction of word errors in text, and text classification

[15, 21]. Understanding how humans assess similarity is important to solve many of the

problems of cognitive science such as problem solving, categorization, nienory retrieval,

inductive I -..1.11.- etc. [39].

Similarity of two concepts refers to how much features they have in common and

how much they have in difference. Lin [60] provides an information theoretic definition

of similarity by clarifying the intuitions and assumptions about it. According to Lin,

the similarity between A and B is related to their commonality and their difference.

Lin assumes that the commonality between A and B can be measured according to

the information they contain in common (I(common(A, B))). In information theory,

the information contained in a statement is measured by the negative logarithm of the

probability of the statement (I(common(A, B)) = -logP(A n B)). Lin also assumes

that if we know the description of A and B, we can measure the difference by subtracting

the commonality of A and B from the description of A and B. Hence, Lin states that

the similarity between A and B, sim(A, B) is a function of their commonalities and

descriptions. That is, sim(A, B) = f (I(common(Al, B)), I(descrip~2~tion(A, B))).

We also come across with 'semantic relatedness' term while dealing with similarity.

Semantic relatedness is a more general concept than similarity and refers to the degree

to which two concepts are related [72]. Similarity is one aspect of semantic relatedness.

Two concepts are similar if they are related in terms of their likeliness (e.g child kit).

However, two concepts can be related in terms of functionality or frequent association even

though they are not similar (e.g., instructor student, christmas gift).

2.11 Semantic Similarity Measures of Words

In this section, we provide a review of semantic similarity measures of words in the

literature. This review is not meant to be a complete list of the similarity measures but

provides most of the outstanding ones in the literature. Most of the measures below use

the hierarchical structure of WordNet.

2.11.1 Resnik Similarity Measure

Resnik [79] provided a similarity measure based on the is-a hierarchy of the WordNet

and the statistical information gathered from a large corpora of text. Resnik used the

statistical information from the large corpora of text to measure the information content.

According to thle information theory, the information content of a. c~one cpt c c~an be

:ntified- as -- log P-(c)l wh-ere P-(c) is the probability of encournt~ering cc : c. TIhis

formurla tells urs that as 1 : .1.11' by increases, inforrInativLeness dlecr~eases; so the mnore

abstract. a concecpt, t~he lowevr its information conten~t. In order t~o calculate thre .. i. ?ility

of a, < : i ii, Resnik first compu ted thle .3 :: :.:y of` oc~curr~ence: of concept in a Ilarge

corpus of t~extl. Every occurrece~~c of a ( t in thre corpus adds to thre : -:y) <.1 t~he

it andi to the frequency of every c:oncep~t surbsurning the: concept encountered. Basedl

on this ---=- ?iut action, the: formulla for the information ciont~ent is:

P(c) = flreql(c)/ ~req~r)

ic(c) :::::- log P(c)

ic(c)- = -log(f ..(c)/f .r))

where r is the root n~ode < the '-:-- 7-- and c is the con~cecpt.

A~ccor~dingr o Resnik, the more information twvo c~one cpt s have: in common, thle more:

:: il: are. i i: in-fo~rmI-a tion- shared two concepts is indicated by the:i :: :

content of the concepts that subl-surne them in the i -. Ti. formula of the R~esniki

-ty m~ea~sur e is:

simn.RE~S(c1, c2) = mnax[~- log P(c)l

where: c is a ce.... t ti ha~t subsumes both c~l and c2.

One of thec drawb;-ac~k s <- the Resnik mecasure is that it c~ompleiytel
the information content of the concept thait suibsurmes the two concepts whose similarity

we mleasurwe. It does not take~ the tw~o conlcepts into ac~oulnt. i ar this reason similarity

mneisurr es of different pairs of con< i.1 that have the samec surbsumer have the samec

ty values.

2.11.2 Jiang-Conrath Similarity Measure

Jiang and Conrath [52] address the limitations of the Resnik measure. It both uses

the information content of the two concepts, along with the information content of their

lowest common subsumer to compute the similarity of two concepts. The measure is a

distance measure that specifies the extent of unrelatedness of two concepts. The formula

of the Jiangf and Conrath measure is:

distanceJCN~(c1, c2) = ic(cl) + ic(c2) (2 + ic(LCS(c1, c2)))

where ic determines the information content of a concept, and LCS determines the lowest

common subsuming concept of two given concepts. However, this measure works only with

WordNet nouns.

2.11.3 Lin Similarity Measure

Lin [60] introduced a similarity measure between concepts based on his theory of

similarity between arbitrary objects. To measure the similarity, Lin uses the information

content of the two concepts that is being measured and the information concept of the

lowest common subsumer of them. The formula of the Lin measure is:

2 + log P(cO)
simLINV(c1, c2)=
log P(cl) + log P(c2)

where cO is the lowest common concept that subsumes both cl and c2.

2.11.4 Intrinsic IC Measure in WordNet

Seco et al. [85] advocates that WordNet can also be used as a statistical resource with

no need for external corpora to compute the information content of a concept.

They assume that the taxonomic structure of WordNet is organized in a meaningful

and principled way, where concepts with many hyponymS3 COnVey leSS information than

concepts that are leaves. They provide the formula for information content as follows:

icWNV(c) = log ~ = 1-loh ()+1
log log(maxwn)

In this formula, the function hypo returns the number of hyponyms of a given concept

and maxwn is the maximum number of concepts that exist in the

2.11.5 Leacock-Chodorow Similarity Measure

Rada et al. [77] was the first to measure the semantic relatedness based on the length

of the path of two concepts in a' I::... 0:~r. Rada et al. measured semantic relatedness of

medical terms, using a medical ::c.1 -r i called MeSH. According to this measurement,

given a tree-like structure of a' I::...... -,i the number of links between two concepts are

counted and they are considered more related if the length of the path between them is


Leacock-Chodorow [56] applied this approach to measure semantic relatedness of two

concepts using WordNet. The measure counts the shortest path between two concepts in

the ::c.11.l~is, and scales it by the depth of the,-

log(shortestpath(c1 c2))
relatedLCH(cl, c2) =
2+ D

In the formula, c1 and c2 represent the two concepts, D is the maximum dept of the

One weakness of the measure is, it assumes the size or weight of every link as equal.

However, lower down in the hierarchy a single link away concept pairs are more related

3 hyponym: a word that is more specific than a given word.

4 For WordNet 1.7.1, the value of D is 19.

than such pairs higher up in the hierarchy. Another limitation of the measure is that they

limit their attention to is-a links and only noun hierarchies are considered.

2.11.6 Hirst-St.Onge Similarity Measure

Hirst and St.Onge's [51] measure of semantic relatedness is based on the idea that two

concepts are semantically close if their WordNet synsets are connected by a path that is

not too long and that does not change direction too often [15, 72].

The Hirst-St.Onge measure considers all the relations defined in WordNet. All links in

WordNet are classified as Upward (e.g., part-of), Downward (e.g., subclass) or Horizontal

(e.g., opposite-meaning). They also describe three types of relations between words

<::1~I --r11..19 strong and medium-strong.

The strength of the relationship is given by:

where d is the number of changes of direction in the path, and C and k are constants;

if no such path exists, the strength of the relationship is zero and the concepts are

considered unrelated.

2.11.7 Wu and Palmer Similarity Measure

The Wu and Palmer [98] measures the similarity in terms of the depth of the two

concepts in the WordNet, and the depth of the lowest common subsumer (LCS):

2 + depth(LCS)
simWUP(c1, c2)=
depth(cl) + depth(c2)

2.11.8 Lesk Similarity Measure

Lesk [58] defines relatedness as a function of dictionary definition overlaps of concepts.

He describes an algorithm that disambigfuates words based on the extent of overlaps of

their dictionary definitions with those of words in the context. The sense of the target

word with the maximum overlaps is selected as the assigned sense of the word.

Table 2-2. Absolute values of the coefficients of correlation between human ratings of
similarity and the five computational measures.
Measure Miller & Charles Rubenstein & Goodenough
Hirst and St-Onge .744 .786
Jiangf and Conrath .850 .781
Leacock and Chodorow .816 .838
Lin .829 .819
Resnik .774 .779

2.11.9 Extended Gloss Overlaps Similarity Measure

Banerjee and Pedersen [9, 72] provided a measure by adopting the Lesk's measure

to WordNet. Their measure is called 'the extended gloss overlaps measure' and takes not

only the two concepts that are being measured into account but also the concepts related

with the two concepts through WordNet relations. An extended gloss of a concept cl is

prepared by adding the glosses of concepts that is related with c1 through a WordNet

relation r. The calculation of measurement of two concepts c1 and c2 is based on the

overlaps of extended glosses of two concepts.

2.12 Evaluation of WordNet-Based Similarity Measures

Budanitsky and Hirst [16] evaluated six different nietrics using WordNet and listed

the coefficients of correlation between the nietrics and human ratings according to the

experiments conducted by 1\iller & ChI .I l. [65] and Rubenstein & Goodenough [82]. We

present the results of Budanitsky & Hirst's experiments in Table 2-2. According to this

evaluation, the Jiang and Conrath nietric [52] as well as the Lin nietric [60] are listed as

one of the best measures. As a result, we use the Jiangf and Conrath as well as the Lin

semantic similarity measure to assign similarity scores between text strings.

2.13 Similarity Measures for Text Data

Several approaches have been used to assess a similarity score between texts. One

of the simplest methods is to assess a similarity score based on the number of lexical

units that occur in both text segments. Several processes such as stenining, stop-word

removal, longest subsequence ]?r I,0 1,11. weighting factors can he applied to this method

for intprovenient. However, these lexical matching methods are not enough to identify the

semantic similarity of texts. One of the attempts to identify semantic similarity between

texts is latent semantic analysis method (LSA)5 [55] which aints to measure similarity

between texts by including additional related words. LSA is successful at some extend but

has not been used on a large scale, due to the complexity and computational cost of its


Corley and 1\ihalcea [21] introduced a metric for text-to-text semantic similarity by

combining word-to-word similarity nietrics. To assess a similarity score for a text pair,

they first create separate sets for nouns, verbs, adjectives, adverbs, and cardinals for each

text. Then they determine pairs of similar words across the sets in the two text segments.

For nouns and verbs, they use semantic similarity metric hased on WordNet, and for other

word classes they use lexical matching techniques. Finally, they sunt up the similarity

scores of similar word pairs. This bag-of-words approach improves significantly over the

traditional lexical matching nietrics. However, as they acknowledge, a metric of text

semantic similarity should take into account the relations between words in a text.

In another approach to measure semantic similarity between documents, Aslant

and Frost [6] assumes that a text is composed of a set of independent term features and

employ the Lin's [60] metric for measuring similarity of objects that can he described by a

set of independent features. The similarity of two documents in a pile of documents can he

calculated by the following formula:

2 + C nmin(Pa : t, Pb : t) log P(t)
SimlT (a, b)= =
C(Pa : t) log P(t) + (Pb : t) log P(t)
t I

where probability P(t) is the fraction of corpus documents containing term t, Pb : t is

the fractional occurrence of term t in document b (C(Pb : t) = 1) and two documents a

5 UR L of LSA:

and b share min(Pa : t, Pb : t) amount of term t in common, while they contain Pa : t and

Pb : t amount of term t individually.

Another approach by Oleshchuk and Pedersen [70] uses ontologies as a filter before

assessing similarity scores to texts. They interpret a text based on an ontology and find

out how much of the terms (concepts) of an ontology exists in a text. They assign a

similarity score for text t1 and text t2 after comparing the ontology 01 extracted from

t1 based on the ontology O and the ontology 02 extracted from t2 based on the same

ontology O. The base ontology acts as a context filter to texts and depending on the base

ontology used, texts may or may not be similar.

2.14 Similarity Measures for Ontologies

Rodriguez and Egenhofer [81] -11_t-r-- -1.. assessing semantic similarity among entity

classes from different ontologies based on a matching process that uses information about

common and different characteristic features of ontologies based on their specifications.

The similarity score of two entities from different ontologies is the weighted sum of

similarity scores of components of compared entities. Similarity scores are independently

measured for three components of an entity. These components are 'set of synonyms',

'set of semantic relations', and 'set of distinguishing features' of the entity. They further

-II__- -r to classify the distinguishing features into 'functions', 'parts', and 'attributes'

where 'functions' represents what is done to or with an instance of that entity, 'parts' are

structural elements of an entity such as leg or head of a human body, and 'attributes' are

additional characteristics of an entity such as age or hair color of a person.

Rodrigfuez and Egenhofer point out that if compared entities are related to the same

entities, they may be semantically similar. Thus, they interpret comparing semantic

relations as comparing semantic neighborhoods of entities.6 The formula of overall

similarity between entity a of ontology q and entity b of ontology q is as follows:

S(al', bV) = w,, S,,(al', bV) + I, Sz,(al', bV) + w,,, S, (al', bV)

where S,,. St,, and S,z are the similarity between synonym sets, features, and semantic

neighborhood and w,, I, e and w,,, are the respective weights which adds up to 1.0.

While calculating a similarity score for each components of an entity, they also take

non common characteristics into account. The similarity of a component is measured by

the following formula:

|An B|
S(a, b)=
|A n B| + co(a, b) |,4/B| + (1 co(a, b)) |B/,4|
where a~ is a function that defines the relative importance of the non-common

characteristics. They calculate a~ in terms of the depth of the entities in their ontologies.

1\aedche and Staab [63] -11_ _t--- -is to measure similarity of ontologies in two levels:

lexical and conceptual. In the lexical level, they use edit-distance measure to find

similarity between two sets of terms (concepts or relations) that forms the ontologies.

While measuring similarity in the conceptual level, they take all its super- and sub-concepts

of two concepts from two different ontologies into account.

According to Ehrig et al. [31] comparing ontologies should go far beyond comparing

the representation of the entities of the ontologies and should take their relation to the

real world entities into account. For this, Ehrigf et al. -II_t-r-- -1.. a general framework

for measuring similarities of ontologies which consists of four 1... ris: data-, ontology-,

context-, and domain knowledge 1... vr. In the data 1.w-;r, they compare data values by

6 The semantic neighborhood of an entity class is the set of entity classes whose
distance to the entity class is less than or equal to an non negative integer

using generic similarity functions such as edit distance for strings. In the ontology 1... r,

they consider semantic relations by using the graph structure of the ontology. In the

context 1... -r, they compare the usage patterns of entities in ontology-based applications.

According to Ehrig et al. if two entities are used in the same (related) context then they

are similar. They also propose to integrate domain knowledge 111-;- r into any three 1... rs

as needed. Finally, they reach to a overall similarity function which incorporates all 111-- rs

of similarity.

Euzenat and Valtchev [34, 35] proposed a similarity measure for OWL-Lite ontologies.

Before measuring similarity, they first transform OWL-Lite ontology to a OL-graph

structure. Then, they define similarity between nodes of the OL-graphs depending on the

category and the features (e.g relations) of the nodes. They combine the similarities of

features by a weighted sum approach.

A similar work by Bach and Dieng-K~untz [8] proposes a measure for comparing

OWL-DL ontologies. Different from Euzenat and Valtchev's work, Bach and Dieng-K~untz

adjusts the manually assigned feature weights of an OWL-DL entity dynamically in case

they do not exist in the definition of the entity.

2.15 Evaluation Methods for Similarity Measures

There are three kinds of approaches for evaluating similarity measures [15]. These

are evaluation by theoretical examination (e.g., Lin [60]), evaluation by comparing human

judgments, and evaluation hv calculating the performance within a particular application.

Evaluation by comparing human judgments technique has been used hv nar Ilry

researchers such as Resnik [79], and Jiang and Conrath [52]. Most of the researchers refer

to the same experiment on the human judgment to evaluate their performance due to the

expense and difficulty of arranging such an experiment. This experiment was conducted

by Rubenstein and Goodenough [82] and a later replication of it was done by Miller

and ChI .I l. [65]. Rubenstein and Goodenough had human subjects assign degrees of

synonymy, on a scale from 0 to 4, to 65 pairs of carefully chosen words. Miller and ChI .I l. -

repeated the experiment on a subset of 30 word pairs of the 65 pairs used by Rubenstein

and Goodenough. Rubenstein and Goodenough used 15 subjects for scoring the word pairs

and the average of these scores was reported. Miller and ChI Ia l. used 38 subjects in their


Rodriguez and Egenhofer also used human judgments to evaluate the quality of

their similarity measure for comparing different ontologies [81]. They used Spatial Data

Transfer Standard (SDTS) ontology, WordNet ontology, WS ontology (created front the

combination of WordNet and SDTS) and subsets of these ontologies. They conducted two

experiments. In the first experiment, they compare different combinations of ontologies to

have a diverse grade of similarity between ontologies. These combinations include identical

ontologies (WordNet to WordNet), ontology and sub-ontology (WordNet to WordNet's

subset), overlapping ontologies (WordNet to WS), and different ontologies (WordNet

to SDTS). In the second experiment, they asked human subjects to rank similarity of

an entity to other selected entities based on the definitions in WS ontology. Then, they

compared average of human rankings with the rankings based on their similarity measure

using different combinations of ontologies.

Evaluation by calculating the performance within a particular application is another

approach for the evaluation of similarity measurement nietrics. Budanitsky and Hirst [15]

used this approach to evaluate the performance of their metric within an NLP application,

nmalapropisms." Patwardhan [72] also used this approach to evaluate his metric within the

word sense disambigfuation8 application.

SMalapropisms: The unintentional misuse of a word by confusion with one that sounds

s Word Sense Disambiguation: It is the problem of selecting the most appropriate
meaning or sense of a word, based on the context in which it occurs.

2.16 Schema Matching

Schema matching is producing a mapping between elements of two schemas that

correspond to each other [78]. When we match two schemas S and T, we decide if any

element or elements of S refer to the same real-world concept of any element or elements

of T [28]. The match operation over two schemas produces a mapping. A mapping is a

set of mapping elements. Each mapping element indicates certain elements) in S are

mapped to certain elements) in T. A mapping element can have a mapping expression

which specifies how schema elements are related. A mapping element can be defined as

a 5-tuple: (id, e, e', n, R), where id is the unique identifier, e and e' are schema elements

of matching schemas, n is the confidence measure (usually in the [0,1] range) between the

schema elements e and e', R is a relation (e.g., equivalence, mismatch, overlapping) [88].

Schema matching has many application areas, such as data integration, data

warehousing, semantic query processing, agent communication, web services integration,

catalog nr I, 1.11, and P2P databases [78, 88]. The match operation is mostly done

manually. Manually generating the mapping is a tedious, time-consuming, error-prone,

and expensive process. There is a need to automate the match operation. This would be

possible if we can discover the semantics of schemas, make the implicit semantics explicit

and represent them in a machine processable way.

2.16.1 Schema Matching Surveys

Schema matching is a very well-researched topic in the database community. Erhard

Rahm and Philip Bernstein provides an excellent survey on schema matching approaches

by reviewing previous works in the context of schema translation and integration,

knowledge representation, machine learning and information retrieval [78]. In their survey,

they clarify the terms such as match operation, ]rn Ipllfir mapping element, and mapping

expression in the context of schema matching. They also introduce application areas of

schema matching such as schema integration, data warehouses, message translation, and

query processing.

The most significant contribution of their survey is the classification of schema

matching approaches which helps understanding of schema matching problem. They

consider a wide range of classification criteria such as instance-level vs schema-level,

element vs structure, linguistic-based vs constraint-based, matching cardinality, using

auxiliary data (e.g., dictionaries, previous mappings, etc.), and combining different

matchers (e.g., hybrid, composite). However, it is very rare that one approach falls under

only one leaf of the classification tree presented in that survey. A schema matching

approach needs to exploit all the possible inputs to achieve the best possible result, and

needs to combine matchers either in a hybrid way or in a composite way. For this reason,

most of the approaches uses more than one technique and falls under more than one leaf

of the classification tree. For example, our approach uses auxiliary data (i.e., application

source code) and uses linguistic similarity techniques (e.g., name and description),

constraint based techniques (e.g., type of the related schema element) on the data as well.

A recent survey by Anhai Doan and Alon Halevy [28] classifies matching techniques

under two main group: rule-based and learning-based solutions. Our approach falls under

the rule-based group which is relatively inexpensive and does not require training. Anhai

Doan and Alon Halevy also describe challenges of schema matching. They point out that

since data sources become legacy (poorly documented) schema elements are typically

matched based on schema and data. However, the clues gathered by processing the schema

and data are often unreliable, incomplete and not sufficient to determine the relationships

among schema elements. Our approach aims to overcome this fundamental challenge by

analyzing reports for more reliable, complete and sufficient clues.

Anhai Doan and Alon Halevy also state that schema matching becomes more

challenging because matching approaches must consider all the possible matching

combinations between schemas to make sure there is no better mapping. Considering

all the possible combinations increases the cost of the matching process. Our approach

helps us overcoming this challenge by focusing on a subset of schema elements that are

used on a report pair.

Another challenge that Anhai Doan and Alon Halevy state is the subjectivity of

the matching. This means the mapping depends on the application and may change in

different applications even though the underlying schemas are the same. By analyzing

report generating application source code, we believe we produce more objective results.

Anhai Doan and Alon Halevy's survey also adds two more application areas of schema

matching on the application areas mentioned in Erhard and Rahm's survey. These

application areas are peer data nianagenient and model nianagenient.

A more recent survey by Pavel Shvaiko and Jihrome Euzenat [88] points out new

application areas of schema matching such as agent coninunication, web service

integration and catalog matching. In their survey, Pavel Shvaiko and Jihrome Euzenat

consider only schenla-hased approaches not the instance-based approaches and provide

a new classification tree by building on the previous work of Erhard Rahni and Philip

Bernstein. They interpret the classification of Erhard Rahni and Philip Bernstein and

provide two new classification trees based on granularity and kinds of input with added

nodes to the original classification tree of Erhard Rahni and Philip Bernstein. Finally,

Hong-Hai Do suninarizes recent advances in the field in his dissertation [25].

2.16.2 Evaluations of Schema Matching Approaches

The approaches to solve the problem of schema matching evaluate their systems by

using a variety of methodology, nietrics and data which are not usually publicly available.

This makes it hard to compare these approaches. However, there have been works to

benchmark the effectiveness of a set of schema matching approaches [26, 99].

Hong Hai Do et al. [26] specifies four comparison criteria. These criteria are kind

of input (e.g., schema information, data instances, dictionaries, and mapping rules),

match results (e.g., matching between schema elements, nodes or paths), quality measures

(nletrics such as recall, precision and f-measure) and effort (e.g., pre- and post-nlatch

efforts for training of learners, dictionary preparation and correction). Mikalai Yatskevich

in his work [99] compares the approaches based on the criteria stated in [26] and adds time

measures as the fifth criteria.

Hong Hai Do et al. only use the information available in the publications describing

the approaches and their evaluation. In contrast, Mikalai Yatskevich provides real-time

evaluations of matching prototypes, rather than reviewing the results presented in the

papers. Mikalai Yatskevich compares only three approaches (COMA [24], Cupid [62] and

Similarity Flooding (SF) [86]) and concludes that COMA performs the best on the large

schemas and Cupid is the best for small schemas. Hongf Hai Do et al. provides a broader

comparison by reviewing six approaches (Automatch [10], COMA [24], Cupid [62], LSD

[27], Similarity Flooding (SF) [86], Semlnt).

2.16.3 Examples of Schema Matching Approaches

In the rest of this section, we review some of the significant approaches for schema

matching and describe their similarities and difference from our approach. We review LSD,

Corpus-based, COMA and Cupid approaches below.

The LSD (Learning Source Descriptions) approach [27] uses machine-learning

techniques to match data sources to a global schema. The idea of LSD is that after

a training period of determining mappings between data sources and global schema

manually, the system should learn from previous mappings and successfully propose

mappings for new data sources. The LSD system is a composite matcher. It means it

combines the results of several independently executed matchers. The LSD consist of

several learners (matchers). Each learner can exploit from different types of characteristics

of the input data such as name similarities, format, and frequencies. Then the predictions

of different learners are combined. The LSD system is extensible since it has independently

working learners (matchers). When new learners are developed they can he added to the

system to enhance the accuracy. The extensibility of the LSD system is similar to the

extensibility of our system because we can also add new visitor patterns to our system to

extract more information to enhance the accuracy. The LSD approach is similar to our

approach in the way that they also come to a final decision by combining several results

coming from different learners. We also combine several results that come from matching

of ontologies of report pairs, to give a final decision. LSD approach is a learner based

solution and requires training which makes it relatively expensive because of the initial

manual effort. However our approach needs no initial effort other than collecting relevant

report generating source code.

One of the distinguished approaches that uses external evidence is the Corpus-based

Schema Matching approach [43, 61]. Our approach is similar to Corpus-based Schema

Matching in the sense that we also utilize external data rather than solely depending

on matching schemas and their data. The Corpus-based schema matching approach

constructs a knowledge base by gathering relevant knowledge from a large corpus of

database schemas and previous validated mappings. This approach identifies interesting

concepts and patterns in a corpus of schemas and uses this information to match

two unseen schemas. However, learning from the corpus and extracting patterns is a

challenging task. This approach also requires initial effort to create a corpus of interest

and then requires tuning effort to eliminate useless schemas and to add useful schemas.

The COMA (COmbination of MAtching algorithms) approach [24] is a composite

schema matching approach. It develops a platform to combine multiple matchers in a

flexible way. It provides an extensible library of matching algorithms and a framework

to combine obtained results. The COMA approach have been superior to other systems

in the evaluations [26, 99]. The COMA++ [7] approach improves the COMA approach

by supporting schemas and ontologies written in different languages (i.e., SQL, W3C

XSD and OWL) and by bringing new match strategies such as fragment-hased matching

and reuse-oriented matching. Fragment-haased approach follows the divide-and-conquer

idea and decomposes a large schema into smaller subsets aiming to achieve better match

quality and execution time with the reduced problem size and then merges the results of

matching fragments into a global match result. Our approach also considers matching

small subsets of a schema that are covered by reports and then merging these match

results into a global match result as described in ChI Ilpter 3.

The Cupid approach [62] combines linguistic and structural matchers in a hybrid way.

It is both element and structural based. It also uses dictionaries as auxiliary data. It aims

to provide a generic solution across data models and uses X1\L and relational examples.

The structural matcher of Cupid transforms the input into a tree structure and assesses

a similarity value for a node based on the node's linguistic similarity value and its leaves

similarity values.

2.17 Ontology Mapping

Ontology mapping is determining which concepts and properties of two ontologies

represent similar notions [68]. There are several other terms relevant to ontology mapping

and are sometimes used interchangeably with the term mapping. These are alignment,

merging, articulation, fusion, and integration [54]. The result of ontology mapping is used

in similar application domains as schema nr ,bllah.r such as data transformation, query

answering, and web services integration [68].

2.18 Schema Matching vs. Ontology Mapping

Schema matching and ontology mapping are similar problems [29]. However, ontology

mapping generally aims to match richer structures. Generally, ontologies have more

constraints on their concepts and have more relations among these concepts. Another

difference is that a schema often does not provide explicit semantics for their data while

an ontology is a system that itself contains semantics either intuitively or formally [88].

Database community deals with the schema matching problem and the AI community

deals with the ontology mapping problem. We can perhaps fill the gap between these

similar but yet distinctly studied subject.


In C'!s Ilter 1, we stated the need for rapid, flexible, limited time collaborations among

organizations. We also underlined that organizations need to integrate their information

sources to exchange data in order to collaborate effectively. However, integrating

information sources is currently a labor-intensive activity because of non-existing or

out-dated machine processable documentation of the data source. We defined legacy

systems as information systems with poor or nonexistent documentation in Section

2.1. Integrating legacy systems is tedious, tinte-consunting and expensive because the

process is mostly manual. To automate the process we need to develop methodologies to

automatically discover seniantics front electronically available information sources of the

underlying legacy systems.

In this chapter, we state our approach for extracting seniantics front legacy systems

and for using these seniantics for the schema matching process of information source

integration. We developed our approach in the context of SEEK( (Scalable Extraction

of Enterprise K~nowledge) project. As we show in Figure :3-1, the Semantic Analyzer

(SA) takes the output of Schema Extractor (SE), schema of the data source, and the

application source code or report templates as input. After the semantic analysis process,

SA stores its output, extracted semantic information, in a repository which we call the

knowledgehase of the organization. Then, Schema Alatcher (SM) uses this knowledgehase

as an input and produces mapping rules as an output. Finally, these mapping rules will be

an input to Wrapper Generator (WG) which produces source wrappers. In Section :3.1, we

first state our approach for semantic extraction using SA. Then, in Section :3.2, we show

how we utilize the seniantics discovered by SA in the subsequent schema matching phase.

The schema matching phase is followed by the wrapper generation phase which is not

described in this dissertation.

Data Source of A Source Code of A

ISchema Semantic
Extraction I IAnalysis
I(SE) (SA)
I ~Schemas

Data Reverse Engineering (DRE)

Knowledgebase of
Organization A

Schema Wrapper
Matching Generator
-0(SM) (WG)

* Knowledgebase of
Organization B

Knowledgebase of
* Organization C

Knowledgebase of
a Organization D

Figure 3-1. Scalable Extraction of Enterprise K~nowledge (SEEK() Architecture.

3.1 Semantic Analysis

Our approach to semantic analysis is based on the observation that application source

code can he a rich source for semantic information about the data source it is accessing.

Specifically, semantic knowledge extracted from application source code frequently
contains information about the domain-specific meanings of the data or the underlying

schema elements. According to these observations, for example, application code usually

has embedded queries, and the data retrieved or manipulated by queries is stored in
variables and dipt1v. liAI to the end user in output statements. M1 I.ny of these output



statements contain additional semantic information usually in the form of descriptive

text or markup [36, 84, 87]. These output statements become semantically valuable

when they are used to communicate with the end-user in a formatted way. One way of

communicating with the end-user is producing reports. Reports and other user-oriented

output, which are typically generated by report generators or application source code,

do not use the names of schema elements directly but rather provide more descriptive

names for the data to make the output more comprehensible to the users. We claim that

these descriptive names together with their formatting instructions can he extracted

from the application code generating the report and can he related to the underlying

schema elements in the data source. We can trace the variables used in output statements

throughout the application code and relate the output with the query that retrieves data

from the data source and indirectly with the schema elements. These descriptive text

and formatting instructions are valuable information that help discover the semantics of

the schema elements. In the next subsection, we explain this idea using an illustrative


3.1.1 Illustrative Examples

In this section, we illustrate our idea of semantic extraction on two simple example.

On the left hand side of Figure 3-2, we see a relation and its attributes from a relational

database schema. By looking at the names of the relation and its attributes, it is hard to

understand what kind of information this relation and its attributes store. For example,
this relation can he used for storing information about 'courses'" or "-'insrucor'. Th

attribute Name can hold information about coursee names' or instructorsr names Without

any further knowledge of the schema, we would probably not he able to understand the

full meaning of these schema items in the relation 'Courselnst'. However, we can gather

information about the semantics of these schema items by analyzing the application source

code that use these schema items.

Cours~nst Instructor name:
Numl II
Name II Cus
Num2 -

Loc IIISearch |

Figure :3-2. Schema used by an application.

Let us assume we have access to the application source code that outputs the search

screen shown on the right hand side of Figure :3-2. Upon investigation of the code,

semantic analyzer (SA) encounters output statements of the form 'Instructor Name'

and 'Course Code'. SA also encounters input statements that expect input from the

user next to the output texts. Using program understanding techniques, SA finds out

that inputs are used with certain schema elements in a 'where clause' to form a query

to return the desired tuples from the database. SA first relates the output statements

containing descriptive text (e.g., 'Instructor Name') with the input statements located

next to the output statements on the search screen shown in Figure :3-2. SA then traces

input statements back to the 'where clause' and find their corresponding schema elements

in the database. Hence, SA relates the descriptive text with the schema elements. For

example, if SA relates the output statement 'Instructor Name' to 1 I.!!!.-' schema element

of relation 'Courselnst', then we can conclude that 1 .!!!.-' schema element of the relation

'Courselnst' stores information about the 'Instructor ?- Ion. .;

Let us look at another example. Figure :3-3 shows a report R 1 using the schema

elements from the schema S1. Let us assume that we have access to the application source

code that generates the report shown in Figure :3-:3. The schema element names in S1 are

non-descriptive. However, our semantic analyzer can gather valuable semantic information

by analyzing the source code. SA first traces the descriptive column header texts back

to the schema elements that fill in the data of that column. Then, SA relates descriptive

Schedule I Courselnst
I ode NumlI

I~I Til Nu2
I / Pr eq \Loc

r----- -

Co rse Cistings

Corse Title Ins ructor I`Time Prerequisite

CIS 105 Introduction Berger Mw CIS 201
to Comp. Sci. 2pm-3pm

ICIS 201 Discrete Taylor "" I
I Math. 3pm-4pmI

Figure :3-:3. Schema used by a report.

column header texts with the schema elements (red arrows). After that, we can conclude

about the semantics of the schema element. For example, we can conclude that the Name

schema element of the relation Courselnst stores information about 'Instructors'.

3.1.2 Conceptual Architecture of Semantic Analyzer

SA is embedded in the Data Reverse Engineering (DRE) module of the SEEK(

prototype together with the Schema Extractor (SE) component. As Figure :3-4 illustrates,

the SE component in the DR E connects to the data source with a call-level interface (e.g.,

JDBC) and extracts the schema of the data source. The SA component enhances this

schema with the pieces of evidence found about the semantics of the schema elements from

the application source code or from the report design templates. Abstract syntax tree generator (ASTG)

We show the components of Semantic Analyzer (SA) in Figure :3-5. The Abstract

Syntax Tree Generator (ASTG) accepts application source code to be analyzed, parses

Report Design
Templates and
1__51Source Code

;1 Data Source of A of A

ISchema Extraction I a Semantic Analysis r_
(SE) Schemas (SA)

L ----------------------------- Knowledgebase of
Data Reverse Engineering (DRE) Organization A

Figure 3-4. Conceptual view of the Data Reverse Engineering (DRE) module of the
Scalable Extraction of Enterprise K~nowledge (SEEK() prototype.

it and produces the abstract syntax tree of the source code. An Abstract Syntax Tree

(AST) is an alternative representation of the source code for more efficient processing.

Currently, the ASTG is configured to parse application source code written in Java. The

ASTG can also parse SQL statements embedded in the Java source code and HTML

code extracted from the Java Serylet source code. However, we aim to parse and extract

semantic information from source code written in any programming language. To reach

this aim, we use state-of-the-art parser generation tools, JavaCC, to build the ASTG.

We explain how we build the ASTG so that it becomes extensible to other programming

languages in Section 3.1.3.


Figure 3-5. Conceptual view of Semantic Analyzer (SA) component.
 Report template parser (RTP)

We also extract semantic information front another electronically available information

source, namely front report design templates. A report design template includes

information about the design of a report and is typically represented in X1\L. When

a report generation tool, such as Eclipse BIRT or JasperReport, runs a report design

template, it retrieves data front the data source and presents it to the end user according

to the specification in the report design template. When parsed, valuable semantic

information about the schema elements can he gathered front report design templates.

The Report Template Parser (RTP) component of SA is used to parse report design

templates. Our current semantic analyzer is configured to parse report templates designed

with Eclipse BIRT.1 We show an example of a report design template in Figure :3-6 and a

resulting report when this template was run in Figure :3-7.

Computer Scienlce

Spring 20:0-1 Schedule

.OIu+e Time Day Place Insnveralr
Ililll,.l] [Hour] [Time] [Loc] 1Il..ll


Figure :3-6. Report design template example. Information extractor (IEx)

The outputs of ASTG and RTP are the inputs for the Information Extractor (IEx)

component of SA. The IEx, shown in Figure :3-5, is the component where we apply several

heuristics to relate descriptive text in application source code with the schema elements in

Shttp: //www. eclipse, .Org/birt/

Computer Science


Spring 2004 Schedule

C ous lUg imeL Day Plarc e Ilnstnutor

Figure :3-7. Report generated when the above template was run.

database by using program understanding techniques. Specifically, The IEx first identifies

the output statements. Then, it identifies texts in the output statements and variables

related with these output texts. The IEx relates the output text with the variables by the

help of several heuristics described in Section :3.1.5. The IEx traces the variables related

with the output text to the schema elements from which it retrieves data.

CS ovtRVIL PEOPLE IRCSEA~Re (d CJDEMIC 9 soaNAFS~ AD~(lssioNS CON' act 4 ComIputrS eianceE ~

Introduc;-iin to MW 2-00 22 Oates
CS -fCo~rnputationr EGrantley Vasrve 2 55

Introduction to Bercrer IW4- 1 00 1 59Sican
Ma CS B1a Discrete -15

Figure :3-8. .Java Serylet generated HT1\L report showing course listings of CALTECH.

The IEx can extract information front Java application source code that coninunicates

with user through console. The IEx can also extract information front Java Serylet

Schedulle. Code Schedule.Natue C~o~useIns~t.Namle ISchedulde.Thne ISchedule.Loc

Figure 3-9. Annotated HTML page generated by analyzing a Java Serylet.

The IEx has been implemented using visitor design pattern classes. We explain the

benefits of using visitor design patterns in Section 3.1.3. The IEx applies several program

understanding techniques such as program slicing, data flow analysis and call graph

application source code. A Serylet is a Java application that runs on the Web Server and

responds to client requests by generating HTML pages dynamically. A Serylet generates

an HTML page by the output statements embedded inside the Java code. After IEx

analyzes the Java Serylet, it identifies the output statements that output HTML code. It

also identifies the schema elements from which the data on the HTML page is retrieved.

As an intermediate step, the IEx produces the HTML page that the Serylet would produce

with the schema element names instead of the data. An example of the output HTML

page generated by the IEx after analyzing a Java Serylet is shown in Figure 3-9. The Java

Serylet output that was analyzed by the IEx is shown in Figure 3-8. This example is taken

from THALIA integration benchmark and shows course offerings in Computer Science

department of California Institute of Technology (CALTECH). The reader can notice

that the data on the report in Figure 3-8 is replaced with the schema element names from

which the data is retrieved in Figure 3-9. Next, the IEx analyzes this annotated HTML

page show in Figure 3-9 and extracts semantic information from this page.

analysis [49] in visitor design pattern classes. We describe these techniques in Section


The IEx also extracts semantic information from report design templates. The IEx

uses the heuristic numbers seven to eleven described in Section :3.1.5 while analyzing the

report design templates. Extracting information from report design templates is relatively

easier than extracting information from application source code because The report design

templates are represented in X1\L and are more structured. Report ontology writer (ROW)

Report Ontology Writer (ROW) component of SA writes the semantic information

gathered in report ontology instances represented in OWL language. We explain the

design details of the report ontology in Section :3.2.3. These report ontology instances

forms the knowledgehase of the data source being analyzed.

3.1.3 Extensibility and Flexibility of Semantic Analyzer

Our current semantic analyzer is configured to extract information from application

source code written in Java. We choose the Java programming language because it is

one of the dominating programming languages in the enterprise information systems.

However, we aim our semantic analyzer to be able to process source code written

in any programming language to extract semantic information about the data of the

legacy system. For this reason, we need to develop our semantic analyzer in a way that

is extensible to other programming languages easily. To reach this aim, we leverage

state-of-the-art techniques and recent research on code reverse engineering, abstract syntax

tree generation and object oriented programming to develop a novel approach for semantic

extraction from source code. We describe our extensible semantic analysis approach in

details in this section.

To analyze application source code, we need a parser for the grammar of the

programming language of the source code. This parser is used to generate Abstract Syntax

Tree (AST) of the source code. An AST is a type of representation of source code that

facilitates the usage of tree traversal algorithms. For programmerrs, writing a parser for

the grammar of a programming language has ah-li-w been a complex, time-e mmode~fir and

error-prone task. Writing a parser becomes more complex when the number of production

rules of the grammar increases. It is not easy to write a robust parser for Java which has

many production rules [91].2 We focus on extracting semantic information from legacy

system's source code not writing a parser. For this reason, we choose a state-of-the-art

parser generation tool to produce our Java parser. We use JavaCC3 tO autOmatically

generate a parser by using the specification files from the JavaCC repository.4 JaVRCC

can be used to generate parsers for any grammar. We also utilize JavaCC to generate a

parser for SQL statements that are embedded inside the Java source code and for HTML

code that are embedded inside the Java Serylet code. By using JavaCC, we can extend SA

to make it capable of parsing other programming languages with little effort.

The Information Extractor (IEx) component of SA is composed of several visitor

design patterns. Visitor Design Patterns give the flexibility to change the operation

being performed on a structure without the need to change the classes of the elements

on which the operation is performed [38]. Our goal is to build semantic information

extraction techniques that can be applied to any source code and can be extended with

new algorithms. By using visitor design patterns [71], we do not embed the functionality

of the information extraction inside the classes of Abstract Syntax Generator (ASTG).

This separation lets us focus on the information extraction algorithms. We can maintain

the operations being performed whenever necessary. Moreover, new operations over the

data structure can be defined simply by adding a new visitor [13].

2 There are over 80 production rules in the Java language according to the Java
Grammar that we obtained from the JavaCC Repository

3 JaVaCC: https ://j avac dev java. net/

4 JavaCC repository: http://www.cobase. cs avac c/

3.1.4 Application of Program Understanding Techniques in SA

We have introduced program understanding techniques in Section 2.5. In this section,

we present how we apply these techniques in SA. SA has two components as shown in

Figure :3-5. The input of Information Extractor (IEx) component is an abstract syntax

tree (AST). The AST is the output of our Abstract Syntax Tree Generator (ASTG) which

is actually a parser. As mentioned in Section 2.5, processing the source code by a parser

to produce an AST is one of the program understanding techniques known as Syntactic

Analysis [49]. We perform the rest of the program understanding techniques on the AST

by using the visitor design pattern classes of the IEx.

One of the program understanding techniques we apply is Pattern 1\atching [49]. We

wrote a visitor class that looks for certain patterns inside the source code. These patterns

such as input/output statements are stored in a class structure and new patterns can he

simply added into this class structure as needed. The visitor class that searches these

patterns identifies the variables in the input/output statements as slicing variables. For

instance, the variable V in Table :3-5 is identified as a slicing variable since it is used in

an output statement. Program Slicing [75] is another program understanding technique

mentioned in Section 2.5. We analyze all the statements affecting a variable that is used in

an output statement. This technique is also known as backward slicing.

SA also applies the Call Graph Analysis technique [8:3]. SA produces inter-procedural

call graph of the source code and analyzes only methods that exist in this graph. SA

starting from a specific method (e.g., main method of a Java stand-alone class or

doGet method of a Java Serylet) traverses all possible methods that can he executed

in run-time. By this, SA eliminates analyzing unused methods. These methods can reflect

old functionality of the system and analyzing them can lead to incorrect, misleading

information. An example for an inter-procedural call graph of a program source code is

shown in Figure :3-10. SA does not analyze method of Class1, method of Class2, and

method:$ of Class:$ since they are never called from inside other methods.



C lass2





C lass3



1 I




method 1


Figure 3-10. Inter-procedural call graph of a program source code.

The Data Flow Analysis technique [83] is another program understanding technique

that we implemented in the IEx by visitor design patterns. As mentioned in Section

2.5, Data Flow Analysis is the analysis of the flow of the values of variables to variables.

SA analyzes the data flow in the variable dependency graphs (i.e., flow of data between

variables). SA analyzes assignment statements and makes necessary changes in the values

stored in the symbol table of the class being analyzed.

SA also analyzes the data flow in the system dependency graphs (i.e., flow of data

between methods). SA analyzes method calls and initializes the values of method variables

by actual parameters in the method call and transfers back the value of return variable at

Table 3-1. Semantic Analyzer can transfer information from one method to another
through variables and can use this information to discover semantics of a
schema element.

public ResultSet returnList() {
ResultSet rs = null;
try { String query = "SELECT Code, Time, Day, PI, Inst FROM Course";
rs = sqlStatement. execute~uery(query) ;
}eatch(Exception ex) { researchErr = ex.gethlessatge(); }
return rs; }

ResultSet rsList = returnList();
String dataOut = "
while ( {
dataOut = rsList.getString(4);

System.out .println(" Class is held in room number:" + data Out);

the end of the method. SA can transfer information from one method to another through

variables and can use this information to discover semantics of a schema element. The

code fragment in Table 3-1 is given as an example for this capability of SA. Inside the

method, the value of variable query is transferred to variable rs. At the end of the method,

value of variable rs is transferred to variable rsList. The value of the fourth field of the

query from the resultset is then stored into a variable and then printed out. When we

relate the text in the output statement with the fourth field of the query, we can conclude

that Pl field of table Course corresponds to 'Class is held in room number'.

3.1.5 Heuristics Used for Information Extraction

A heuristic is any method found through observation which produces correct or

sufficiently exact results when applied in commonly occurring conditions. We have

developed several guidelines (heuristics) through observations to extract semantics

from the application source code and report design templates. These heuristics relate

semantically rich descriptive texts to schema elements. They are based on mainly layout

and format (e.g., femt size, face, color, and type) of data and description texts that are

used to communicate with users through console with input/output statements or through

a report.

We introduce these heuristics below. The first six heuristics shown in this section are

developed to extract information from source code of applications that communicate with

users through console with input/output statements. Please note that the code fragments

in the first six heuristics contain Java-specific input, output, and database-related

statements that use syntax based on the Java API. We parameterized these statements in

our SA prototype. Therefore it is theoretically straightforward to add new input, output,

and database-related statement names or to switch to another language if necessary.

We developed the rest of the heuristics to extract semantics from reports. We use

these heuristics to extract semantic information either from reports generated by Java

Servlets or from report design templates.

Heuristic 1. Application code generally has input-output statements that display

the results of queries executed on the underlying database. Typically, output statements

display one or more variables and/or contain one or more format strings. Table 3-2

represents a format string '\n Course code:\t' followed by a variable V.

Table 3-2. Output string gives clues about the semantics of the variable following it.

System.out .println('\n Course code:\t' +V),

Heuristic 2. The format string in an input-output statement describes the di;1li- a 4

slicing variable that comes after this format string. The format string '\n Course code:\t'

describes the variable V in Table 3-2.

Heuristic 3. The format string that contains semantic information and the variable

may not be in the same statement and may be separated by an arbitrary number of

statements as shown in Table 3-3.

Heuristic 4. There may be an arbitrary number of format strings in different

statements that inherit semantics and they may be separated by an arbitrary number

Table :3-:3. Output string and the variable may not he in the same statement.

Systent.out.println('\n Course code: );

Systent.out .print(V);

of statements, before we encounter an output of slicing variable. Concatenation of the

format strings before the slicing variable gives more clues about the variable semantic. An

example is shown in Table :3-4.

Table :3-4. Output strings before the slicing variable should be concatenated.

Systent.out.print('\n Course );
Systent.out.println('\t code: );
Systent.out .print(V);

Heuristic 5. An output text in an output statement and a following variable in the

same or following output statements are seniantically related. The output text can he

considered as the variable's possible seniantics. We can trace back the variable through

backward slicing and identify the schema element in the data source that assigns a value

to it. We can conclude that this schema element and variable are related. We can then

relate the output text with the schema element. The Java code sample with an embedded

SQL query in Table :3-5 illustrates our point.

Table :3-5. Tracing back the output text and associating it with the corresponding column
of a table.

R = S.execute~uery(Q);
V = R.getString(1);
Systent.out.println('Course code: + V);

In Table :3-5, the variable V is associated with the text 'Course code'. It is also

associated with the first column of the query result in R, which is called C. Hence the

column C can he associated with the text 'Course code'.

Heuristic 6. If the variable V is used with column C of table T in a compare

statement in the where-clause of the query Q. and if one can associate a text string from

an input/output statement denoting the meaning of variable V, then we can associate this

meaning of V with column C of table T. The Java code sample with an embedded SQL

query in Table :3-6 illustrates our point.

Table :3-6. Associating the output text with the corresponding column in the where-clause.

R = S.execute~uery(Q);
System.out.println('Course code: +V);

In Table :3-6, the variable input is associated with the text 'Course code:'. It is also

associated with the column C of table T. Hence the schema element C can he associated

with the text 'Course code'.

Table :3-7. Column header describes the data in that column.
College Cours e Title Ins tructor
CAS CS101 Intro Comp. Dr. Long
GR S CS640 Artificial Int. Dr. Betke

Heuristic 7. A header of a column H (i.e., description text) on a table on a report

describes the value of a data D (i.e., data element) in that column. We can associate

the header H with the data D presented on the same column. For example, the header

lIs-t uctor" in the fourth column describes the value "Dr. L 1,,,_ in Table :$-7.

Table :3-8. Column on the left describes the data items listed to its immediate right.
Course CSE10:3 Introduction to Databases
Credits :3
Description Core concepts in databases

Heuristic 8. A descriptive text on a row of a table on a report T describes the value

of a data D on the right hand side on the same row of the table. We can associate the text

T with the data D presented on the same row. For example, the text "Description" on the

third row describes the value "Core concepts in dI II I1. I-- in Table 3-8.

Table 3-9. Column on the left and the header immediately above describe the same set of
data items.
Core Courses
Course CSE103 Introduction to Databases
Credits 3
Description Core concepts in databases
Elective Courses
Course CSE131 Problem Solving
Credits 3
Description U~se of Comp. for problem solving

Heuristic 9. Heuristic one and heuristic two can be combined. Both header of a

data on the same column and the text on the left hand side on the same row describe the

data. For example, both the text "Course" on the left hand side and the header "Elective

Courses" of data "CSE131 Problem Solvingt describe the data in Table 3-9.

Table 3-10. Set of data items can be described by two different headers.
Course Instructor
Code Room Name Room
CIS4301 E221 Dr. Hammer E452
COP6726 E112 Dr. Jermaine E456

Heuristic 10. If more than one header describe a data on a report, all the headers

corresponding to the data describe the data. For example, both the header In!-I ructor"

and the header "Room" describe the value "E452" in Table 3-10.

Table 3-11. Header can be processed before being associated with the data on a column.
Course Title (Credits) Instructor
CS105 Comp. Concepts ( 3.0 ) Dr. K~rentel
CS110 Java Intro Prog. ( 4.0 ) Dr. Bolker

Heuristic 11. The data value presented on a colunin can he retrieved front more

than one data itent in the schema. In that case, the format of the header of the column

gives clues about how we need to parse the header and associate it with the data items.

For example, the data of the second colunin in Table 3-11 is retrieved front two data items

in the data source. The format of the header "Title (Credits)" tells us that we need to

consider the parenthesis while parsing the header and associating the data items in the

colunin with the header.

In this section, we have introduced Semantic Analyzer (SA). SA extracts information

about the seniantics of schema elements front the application source code. This information

is an essential input for the Schema Matching (SM) component. In the following section,

we introduce our schema matching approach and how we use SA to discover seniantics for


3.2 Schema Matching

Schema matching aints at discovering semantic correspondences between schema

elements of disparate but related data sources. To match schemas, we need to identify the

seniantics of schema elements. When done manually, this is a tedious, tinte-consunting,

and error-prone task. Much research has been carried out to automate this task to aid

schema matching, see for example, [25, 28, 78]. However, despite the ongoing efforts,

current schema matching approaches, which use the schemas themselves as the main input

for their algorithms, still rely heavily on manual input [26]. This dependence on human

involvement is due to the well-known fact that schemas represent seniantics poorly. Hence,

we believe that improving current schema matching approaches requires improving the

way we discover seniantics.

Discovering seniantics means gathering information about the data, so that after

processing the data, a computer can decide on how to use the data in a way a person

would do. In the context of schema matching, we are interested in findings information

that leads us to find a path from schema elements in one data source to the corresponding

schlema celements in the others. 7i .'efore, wei define discoveringf semnantic~s fori schemna

mlatch-ing as discovering paths between : ::-li = schemna elemnents in (CIT .::i data


Wei~ redi-uce thec levell <.1 d-ifficult~y of thre schrem~a m~atch~ing prob-lemn "-- abstracting it

to mnatching of automatically goncr~ated documents such as :. i r~ts t~hat ar~e semantlnically;

richrer tha~n t~he sc~hema~s to which? '" correspond, i iport~s andi other user-or~ientled

outpurt, which aire i .1'y g-eneratedl 1. Ep Irt generators, dlo not use thle names of

sc~hema elemlents diirectly buit rather provide mlor~e dlescriptive~ names to mnake the ourtpurt

morle c~omF:- i :: :i-1 e to the users. T~i dscriptions together wiith? their ftormatting

instrluctions i relationship s to the I....'. 1 : sc~hema elements in the data source can

be extrlactecd : ::: the apoplication code gener~ating thec report. T`: semnanticaliy rich

dlesc~riptions, wh~ich canl be :::I -d to the schema elements inl the: souirce, canl b-e ursed to

relationlsh-ips between data sourrces and hence between the ulnderlyin-g sc~hemlas.

Moreoverr: i -p rts uise more diomain tt : ': .. thain sc~hemais. Ti : ? E.e, using domain

diction~aries is -ticula~rly helpful a~s opposed to their use: in sch~ema mnatch~i~gr algorithms.

One can argue t~ha t reports of ain info~rmatlion system :.. not cover t~he entire:

sc~hema andi hence by, this approach we :-:=-- not findi matches for all schemna elements. It

is implIortant~ to note tha~t we dlo not hav-Le to match all the: schemna elemnent~s of twio dlata

sourrces in order to have twoi organizations I loorate. Weti believe~ the repoorts i .:.. .1

I-rsn thle mnost importantly data oft thle inftormation? system, whiiichl is also I i- -i to be

thle set of elements that ar~e implorta~nt for the ensuing dlata integrlation scenario. 7 1

starting the schemna mlatch-ing process : ::: canl help focus on the i :_ ortanlt data

elimninating = effort on mIatchingg unnec~essary sc~hemai elements.

3.2.1 i'. i otiv. !::: Example

We~1 present a mnotivating extample to shlow how analyzing reports generlating a~pl 'I

sourrce codle axnd report dlesig-n templIlates caxn 1: "i, us undlerstaindl the sema~intics of schlema~i

elemecnts better. Weii choose: our mlotivating (- i r-;orts fromn the university domain

because the university domain is well known and easy to understand. To create our

motivating example, we use the THALIA5 testbed and benchmark which provides a

collection of over 40 downloadable data sources representing university course catalogs

from computer science departments worldwide [47].

S1 I Schedule Courselnstl S2 ;Offerings Faculty ClassTimes|
SCode IIIINum1 I; No IIINo II Code
|Name I Name I Name I Name Day
STime Num2 TID Room
SPrereq IIIIOffice II;~ IIInsNo IIIITitle

Figure 3-11. Schemas of two data sources that collaborates for a new online degree

We motivate the need for schema matching across the two data sources of computer

science departments with a scenario. Let us assume that two computer science departments

of universities A and B start to collaborate for a new online degree program. Unless one

is contend to query each report separately, one can imagine the existence of a course

schedule mediator capable of providing integrated access to the different course sites.

The mediator enables us to query data of both universities and presents the results in a

uniform way. Such a mediator necessitates the need to find relationships across the source

schemas S1 and S2 of universities A and B shown in Figure 3-11. This is a challenging

task when limited to information provided by the data source alone. By just using the

schema names on the Figure, one can match schema elements of two different schemas in

various v-wsi~. For instance, one can match the schema element Name in relation Offerings

of schema S2 with schema element Name in relation Schedule of schema S1 or with schema

5 THALIA Website: http://www.cise.uf l .edu/proj ect/thalia. html

element Name in relation Courselnt of schema S1. Both mappingfs seem reasonable when

we only consider the available schema information.

However, when we consider the reports generated by application source code using

these schemas of data sources, we can decide on the mappings of schemas more accurately.

Information system applications that generate reports retrieve data from the data source,

format the data and present it to users. To make the data more apprehensible by the user,

these applications generally do not use the names of schema elements but invent more

descriptive names (i.e., title) to the data by using domain specific terms when applicable.

I Schedule Courselnst
ode IIINum1
a le ~Nae
4Tin e, IINur 2
Pre eq O ffic

C ours e ~is tin gs

S Course Title Ins ructor I'Time Prerequisite
CIS 1 5 Introdu tion Berger MW CIS 201
to C om i. Sci. 2pm -3p~
SCIS 2(11 Discretel Taylor 1 MW \
111ath. 3pm -4pm

II \ 1 R2

Co urrs seSc h erd a le s

Course Title Lectu~rer Tim e
COP 3 2 Datab. e Hamilto Ti h
System s 1 -3pm
CEN 4 Sof vare Eng. Paul F 5pm-

ffer gs Facult ClassT s
I No No Code
SName Name Day
InsN o Til Hu
I Iil

Figure 3-12. Reports from two sample universities listing courses.

SL e ctu rtr R o om s

I IL e~ tare r T~itle R oo

SJo0r ge Ha m'Itona Assistant CIS /05
ii \ IProf

SP aulo Co elho0 As c~iate I~ G 202
'I I~

SOfIferings Fa nlty C ClassT im es
SNo No Code
N am e Nan e Day
InsN oTieHor

Figure 3-13. Reports from two sample universities listings instructor offices.

For our motivating example, university A has reports R1 and R3 and university B has

R2 and R4 presenting data from their data sources. Reports R1 and R2 present course

listings and reports R3 and R4 present instructor office information from corresponding

universities. We show these simplified sample reports (R1, R2, R3, and R4) and the

schemas (S1 and S2) in Figures 3-12 and 3-13. The reader can easily match the column

headers (blue dotted arrows in Figures 3-12 and 3-13). On the other hand, it is hard to

match the schema elements of data sources correctly by only considering their names.

However, it becomes again straightforward to identify semantically related schema

elements if we know the links between column headers and schema elements (red arrows in

Figures 3-12 and 3-13).

bll~lSchedule Courselnst
Code I Num1
N am e I iam e
T ime I N um 2

Prereq II IO fp ie

Ins tru c i r O ffi es

Berge~ 22 Gas

Taylor ~ 122 CISE


la n I



Our idea is to find mappings between descriptive texts on reports (blue dotted

arrows) by using semantic similarity functions and to find the links between these texts

and schema elements (red arrows) by analyzing the application source code and report

design templates. For this purpose, we first analyze the application source code or the

report design template generating each report. For each report, we store our findings

such as descriptive texts (e.g., colunin headers), schema elements and relations between

the descriptive texts and the schema elements into an instance of report ontology. We

give the details of the report ontology in Section :3.2.3. We pair report ontology instances

one front the first data source and one front the second data source. We then compute

the similarities between all possible report ontology instance pairs. For our example, the

four possible report pairs when we select one report from DS1 and the other from DS2

are [R 1-R 2], [R 1-R 4], [R 2-R:3] and [R:3-R 4]. We calculate the similarity scores between

descriptive texts on reports for each report pairs by using semantic similarity functions

using WordNet which we describe in Section :3.2.4. We then transfer similarity scores

between descriptive texts of reports to scores between schema elements of schemas by

using the previously discovered relations between descriptive texts and schema elements.

Last, we merge the similarity scores of schema elements computed for each report pair and

form a final matrix holding similarity scores between elements of schemas that are higher

than a threshold. We address details of each step of our approach in Section :3.2.2.

When we apply our schema matching approach on the example schemas and reports

described above, we obtain a precision value of 0.86 and a recall value of 1.00. We show

the similarity scores between schema elements of data sources DS1 and DS2 which are

greater than the threshold (0.5) in Figure :3-14. These results are better than the results

found matching the above schemas with the COMA++ (COmbination of MAtching

algorithms) framework." COMA++ [7] is a well known and well respected schema

6 We use the default COMA++ All Context combined niatcher

matching framework providing a downloadable prototype. This example motivates us that

our approach promises better accuracy for schema matching than existing approaches.

We provide a detailed evaluation of the approach in ChI Ilpter 6. In the next section, we

describe the steps of our schema matching approach.

Rllesse Restllis S1
Schethle LCursallnst
Thashold:0.5 Code Name PeReq Time Num1 Name Num2 Office
No 0.782
OeinsName 0.807
S2 ClassTimnes Doay 0.001
Hour 0.001

Name 0.614
Room 0.505
Title 01

Figure :3-14. Similarity scores of schema elements of two data sources.

3.2.2 Schema Matching Approach

The main idea behind our approach is that user-oriented outputs such as reports,

encapsulate valuable information about semantics of data which can he used to facilitate

schema matching. Applying well-known program understanding techniques as described

in Section :3.1.4, we can extract semantically rich textual descriptions and relate these

with data presented on reports using heuristics described in Section :3.1.5. We can trace

the data back to corresponding schema elements in the data source and match the

corresponding schema elements in the two data sources. Below, we outline the steps of

our Schema Matching approach, which we call Schema Matching by Analyzing ReporTs

(SMART). In the next sections, we provide detailed description of these steps which are

shown in Figure :3-15.

Creating an Instance of a Report Ontology
Computing Similarity Scores
Forming a Similarity Matrix
From Matching Ontologies to Schemas
Merging Results

Report Generating Report GI
Applications and Applications
Report Templates of Template

I: 1) Creating Instances of the
:: ~Report Ontology ,,

ontology .eor

Instance A3 Ontology
2) Computing similarity scores InIstance B4
between Report Ontology Instances

j) Merging Results

4) Transfering Inter Ontology Scores


3) Forming Similarity Matrix


3) Forming Similarity Matrix


Figure 3-15. Five steps of Schema Matching by Analyzing ReporTs (SMART) algorithm.

3.2.3 Creating an Instance of a Report Ontology

In the first step, we analyze application source code that generates a report. We

described the details of semantic analysis process in Section :3.1. The extracted semantic

information from source code or from a report design template is stored in an instance of

the report ontology.

We have developed an ontology for reports after analyzing some of the most widely

used open source report generation tools such as Eclipse BIRT,7 JasperReports and

DataVision.' We designed the report ontology using the Protege Ontology Editorlo and

represented this report ontology in OWL (Web Ontology Language). The UlML diagram of

the report ontology depicted in Figure :3-16 shows the concepts, their properties and their

relations with other concepts.

We store information about the descriptive texts on a report (e.g., column headers)

and information about the source of data (i.e., schema elements) presented on a report in

an instance of the report ontology. The descriptive text and schema element properties

are stored in description element and data element concepts of the report ontology

respectively. The data element concept has properties such as attribute, table (table of the

attribute in relational database) and type (type of the data stored in the attribute). We

identify the relation between a description element concept and a data element concept

by the help of a set of heuristics which are based on the location and format information

described in Section :3.1.5 and store this information in hasDescription relation property of

the description element concept.

SEclipse BIRT: http://www.eclipse. 0rg/birt/

SJasperReport: http://j asperreports sourcef orge .net/

Datavision: http://datavision. source orge .net/

10 Protege tool: http://protege. stanford. edu/

Descrigation Elerraenlt
Dle SCrTi ption'

Figure :3-16. Unified Modeling Language (UML) diagram of the Schema Matching by
Analyzing ReporTs (SMART) report ontology.

The design of the report ontology does not change from one report to another but

the information stored in an instance of the report ontology changes based on the report

being analyzed. We placed the data element concept in the center of the report ontology

as shown in Figure :3-16. This design is appropriate for the calculation of similarity scores

between data element concepts according to the formula described in Section :3.2.4.

3.2.4 Computing Similarity Scores

We compute similarity scores between all possible data element concept pairs

consisting of a data element concept from an instance of the report ontology of the

first data source and another data element concept from an instance of report ontology of

the second data source. This means if there are m reports having n data elements concepts

on average for DS1 data source and k reports having 1 data elements concepts on average

for DS2 data source, we compute similarity scores for (m n k 1) pairs of data elements


However, computing similarity scores for all possible report ontology instance pairs

may be unnecessary. For example, unrelated report pairs, such as a report describing

p wiments of employees with another describing the grades of students at a university,

may not have semantically related schema elements and therefore we may not find any

semantical correspondence by computing similarity scores of concepts of unrelated report

ontology instance pairs. To save computation time, we filter out report pairs that have

semantically unrelated reports. To determine which report pairs are semantically related

or not, we first extract texts (i.e., titles, footers and data headers) on two report pairs and

calculate similarity scores of these texts. If the similarity score between these texts of a

report pair is below a predetermined threshold, we assume that the report pair presents

semantically unrelated data and we do not compute similarity scores of data element pairs

of report pairs having low similarity scores for the texts on them.

The similarity of two objects depends on the similarities of the components that

form the objects. An ontology concept is formed by the properties and the relations it

has. Each relation of an ontology concept connects the concept to its neighbor concept.

Therefore, the similarity of two concepts depends on the similarities of the properties of

the concepts and the similarities of the neighbor concepts. For example, the similarity of

two data element concepts from different instances of the report ontology depends on the

similarity of their properties attribute, table, and type and the similarities of its neighbor

concepts DescriptionElement, Header, Footer, etc.

Our similarity function between concepts of instances of an ontology is similar to

the function proposed by Rodriguez and Egenhofer [81]. Rodriguez and Egenhofer also

consider sets of features (properties) and semantic relations (neighbors) among concepts

while assessing similarity scores among entity classes from different ontologies. While

their similarity function aims to find similarity scores between concepts from different

ontologies, our similarity is for finding similarity scores between the instances of an


We formulate the similarity of two concepts in different instances of an ontology as

follows :

simc (cl, C2) p w, spm(l c1)+ I,2 Pr si,(l c8 E C, C

where cl is a concept in an instance of the ontology, c2 1S the same type of concept

in another instance of the ontology, w, is the weight of total similarity of properties of

that concept and In~,, is the weight of total similarity of the neighbor concepts that can be

reached from that concept by a relation. sim,(cl, c2) and sim,(cl, c2) are the formulas to

calculate similarities of the properties and the neighbors. We can formulate sim,(cl, C2) aS

follows :

sim,,(ct c2) = tr..; SimFunc(clips C2lli) (3-2)

where k is the number of properties of that concept, Iry... is the weight of the ith

property, clip is the ith property of the concept in the first report ontology instance, c29i

is the same type of property of the other concept in the second report ontology instance.

SimFunc is the function that we use to assess a similarity score between the values

of the properties of two concepts. For description elements, the SimFunc is a semantic

similarity function between texts which is similar to the text-to-text similarity function of

Corley and Mihalcea [21]. To calculate the similarity score between two text strings T1

and T2, we first eliminate stop words (e.g., a, and, but, to, by). We then find the word

having the maximum similarity score in text T2 for each word in text T1. The similarity

score between two words, one from text T1 and the other from T2, is obtained from a the

Word-Net based semantic similarity function such as the Jiang and Conrath metric [52].

We sum up the maximum scores and divide the sum by the word count of the text T1.

The result is the measure of similarity between text T1 and the text T2 for the direction

from T1 to T2. We repeat the process for the reverse direction (i.e., from T2 to T1) and

then compute the average of the two scores for a bidirectional similarity score.

We use different similarity functions for different properties. If the property that

we are comparing has text data such as property description, we use one of the word

semantic similarity functions that we have introduced in Section 2.11. By using a semantic

similarity measure instead of lexical similarity measure such as edit distance, we can

detect the similarities of words that are lexically far but semantically close such as

lecturer and instructor and we can also eliminate the words that are lexically close but

semantically far such as 'tower' and 'power'. Besides description property of description

element concept, we also use semantic similarity measures to compute similarity scores

between footernote property of the footer concept, headernote property of the header

concept and title property of the report concept. If the property that we are comparing is

the attribute or table property of data element concept, we assess a similarity score based

on the Levenstein edit similarity measure. Besides attribute property of data element

concept, we also use edit similarity measures to compute similarity scores between query

property of the report concept.

In the following formula, which calculates the similarity between the neighbors of two

concepts, I is the number of relations of the concepts we are comparing, w,,,.. is the weight

of the ith relation, clus (c~, ) is the neighbor concept of the first (second) concept that we

reach by following the kth relation.

Note that our similarity function is generic and can be used to calculate similarity

scores between concepts of instances of any ontologies. Even though the formulas in

Equations 3-1, 3-2 and 3-3 are recursive in nature, when we apply the formulas to

compute similarity scores between data elements of report ontologies, we do not encounter

recursive behavior. That is because there is no path back to data element concept through

relations from neighbors of the data element concept. In other words, the neighbor

concepts of data element concept does not have the data element concept as a neighbor.

We apply the above formulas to calculate similarity scores between data element

concepts of two different report ontologies. The data element concept has properties

attribute, table, and type and neighbor concepts description element, report, header,

and footer concepts. The similarity score between two data element concepts can be

formulated as follows:

simDataElement (DataEl ementl D ataEl eme nt2

wi SimFunc(Attributer Attribute2)

+w2 SimFunc(Tablel, Table2)

+0' SimFunc(Typel, Type2)

+w4 I S EDescrip~tionElement(D escrip2~ti onEl eme Descrip~2~tionElement 2 (4

+I,-. + sim,,tReport RpOE, Report2)

+w6 I SiHeader (Headerl, Header2)

+w?1 simFooter (Footery Footer2)

We explain how we determine the weights wl to my in Section 6.2. The similarity

score between two description element, report, header and footer concepts can be

computed by the following formulas:

simDescri~tionElement (De scriptionEl eme nt 1, D escr ipti onEl ement2 ) = (3-5)

SimFunc(Descr2,iptioni DescriptiOnR2

sim~eport (RepOrtl, Report2) = SimFunc(Queryl i,Q u _.) + SimFunc(Titlel, Title2)


simeader (Headerl, Header2) = SimFunc(Headerl~otel Headerl~ote2)(7

simFooter (Footerl, Footer2) = SimFunc(Footerl~otel Footerl~ote2) (8

3.2.5 Forming a Similarity Matrix

To form a similarity matrix, we connect to the underlying data sources using a

call-level interface (e.g., JDBC) and extract the schemas of two data sources to be

integrated. A similarity matrix is a table storing similarity scores for two schemas such

that elements of the first schema form the column headers and elements of the second

schema form the row headers. The similarity scores are in the range [0,1]. The similarity

matrix given as an example in Figure :3-17 has schema elements from motivating example

in Section :3.2.1 and the similarity scores between schema elements are fictitious.

Schema S1
Entity Schedule Courselnst
Attriburte Code Name PreReq Time Num1 Name Num2 Office
baNo 0.9 0.4 0.2 0.25 0.4 0.3 0.3 0.3
NameI 0.25 0.95 0.15 0.2 0.3 0.4 0.3 0.25
TIDam 0.25 0.25 0.15 0.5 0.2 0.2 0.2 0.35
InsNo 0.3 0.2 0.1 0.2 0.3 0.15 0.2 0.2
E Code 0.5 0.4 0.2 0.5 0.15 0.3 0.4 0.3
qDay 0.2 0.25 0.05 0.7 0.25 0.2 0.15 0.2
HoHur 0.2 0.2 0.1 0.7 0.2 0.2 0.15 0.25
~nNo 0.4 0.3 0.1 0.2 0.8 0.25 0.3 0.2
$ Name 0.45 0.5 0.1 0.2 0.4 0.95 0.3 0.3
i~Room 0.2 0.3 0.05 0.2 0.2 0.2 0.2 0.2
Title 0.2 0.4 0.1 0.1 0.2 0.3 0.2 0.1

Figure :3-17. Example for a similarity matrix.

3.2.6 From Matching Ontologies to Schemas

In the first step, we traced a data element to its corresponding schema elementss. We

use this information to convert inter-ontology matching scores into scores between schema

elements. Using the converted scores, we then fill in a similarity matrix for each report


Note that, we find similarity scores only for a subset of schemas used in the reports.

We believe the reports typically present the most important data of the information

system, which is likely to be the set of elements that is important for the ensuing data

integration scenario. Even though reports of an information system may not cover the

entire schema, our approach can help focus on the important data thus eliminating efforts

EntitySchedule Courselnst
Attribute Code Name PrcReq Time Numl Name Num2 Office
No 0.8 0.2 0.1 0.1 0.15
.Name 0.2 0.95 0.15 0.1 03


tlDay 0.1 0.1 0.1 0.7 0,1
Hour 0.1 0.1 0.1 0.7 0,1
SName 0.2 035 0.2 0.1 0.9

Scoresfrom Reports aboct Instructor Ofics Schema S1
Entity Schedule CourseInst
Attribute Code Name Pre~eq Time Numl Name Num2 Office


Name 0,85 0.2
SRoom 030.
Title 03 0.2

to match unnecessary schema elements. Note that each similarity matrix can be sparse

having only a small subset of its cells filled in as shown in Figures 3-18 and 3-19.

Scores from R~eports about Course Listings

Scheme S1

Figure 3-18. Similarity scores after matching report pairs about course listings.

Figure 3-19. Similarity scores after matching report pairs about instructor offices.

3.2.7 Merging Results

After generating a similarity matrix for each report pair, we need to merge them

into a final similarity matrix. If we have more than one score for a schema element pair

in the similarity matrix, we need to merge the scores. In Section 3.2.4, we described

how we compute similarity scores for report pairs to avoid unnecessary computations

between unrelated report pairs. We use these overall similarity scores between report pairs

while merging similarity scores. We multiply the similarity score of a schema element

pair with the overall similarity score of the report pair and sum the resulting scores up.

7i.. .. we divide thle final score with the: number of repoorts. Forb instance if the similarity

score betw~een schemna elemlents A and B~ is 0.9 in the first report hav~in-g an overall

simnilarity score of 0.'7 aind is 0.5 in the second report hav~ting an1 overall similarity sior~e

of` 0.6, th~en we conclud- e t~hat t~he simnilarity score: between n schelma elemelcnts A and1- B: is

(0.9 + 0.7 i 0.5 0.6)/(2) = 0. :~. Finally, wve eliminate t~he comb~inedi scores which

below a2 (user-diefined)) th~rshold..


We intpleniented both the semantic analyzer (SA) component of the SEEK( and the

SMART schema niatcher using Java progranining language. As shown in Figure 4-1 we

have written 1,350 K(B of Java code (approximately 27,000 lines of code) for our prototype

intplenientation. In addition, we have utilized 1,150 K(B of Java code (approximately

23,000 lines of code) which was automatically generated by JavaCC. In the following

sections, we first explain the SA prototype and then SMART prototype.

500 KB
SA -Java~cC
'1,150 KB

SA Coded
850 KB

Figure 4-1. Java Code size distribution of (Semantic Analyzer) SA and (Schema Matching
by Analyzing ReporTs) SMART packages.

4.1 Semantic Analyzer (SA) Prototype

We have intpleniented SA semantic analyzer prototype using Java language. The

SEEK( prototype source code is placed in the seek package. The functionality of the SEEK(

prototype is divided into several packages. The source code of the seamntic analyzer (SA)

component resides in the sa package. Java classes in the sa package are further divided

into subpackages according to their functionality. The subpackages of the sa package are

listed in Table 4-1.

4.1.1 Using JavaCC to generate parsers

The classes inside the packages syntaxtree, visitor, and parsers are automatically

created by JavaCC tool. JavaCC is a tool that reads a graninar specification and converts

it to a Java program that can recognize matches to the graninar according to that

Table 4-1. Subpackage in the sa package and their functionality.
package name classes in the package
visitor default visitor classes.
parsers classes to parse application source code written in grammars
Java, HTML and SQL.
seekstructures supplementary classes to analyze application source code.
seekvisitors visitor classes to analyze source code written in grammars Java,

specification. As shown in Figure 4-2, JavaCC processes grammar specification file and

output the Java files that has the code of the parser. The parser can process the languages

that are according to the grammar in the specification file. The parsers generated in

this way forms the ASTG component of the SA. Grammar specification files for some

grammars such as Java, C ++, C, SQL, XML, HTML, Visual Basic, and X~uery can he

found at the JavaCC grammar repository Web site.l These specification files have been

tested and corrected by many JavaCC implementers. This implies that parsers produced

by using these specifications must he reasonably effective in the correct production of

ASTs. The classes generated by the JavaCC forms the abstract syntax tree generator

ASTG of the SA which was described in Section

For the SA component of the SEEK( prototype, we created parsers for three different

grammars. These are Java, SQL and HTML grammars. We placed these parsers, related

syntax tree classes and generic visitor classes into parser, syntaxtree, visitor package

respectively. Each Java class inside the syntaxtree package has an accept method to be

used by visitors. Visitor classes have a visit methods that each corresponds to a Java class

inside syntaxtree package. The syntaxtree, visitor, and parsers packages have 142, 15 and

14 classes respectively. The classes inside these packages remains the same as long as the

Java, SQL and HTML grammars do not change.

1 JavaCC repository: http://www.cobase. cs avac c/

Grammar IVa Compiler Syntax Tree
specif ation Compiler Generator

Figure 4-2. Using .JavaCC to generate parsers.

The classes inside the packages seekstructures and seekvisitors are written to fulfill

the goals of the SA. The seekstructures and seekvisitors packages have 25 and ten classes

respectively and are subject to change as we add new functionality to SA module. The

classes inside these packages forms the Information Extractor (IEx) of the SA which was

described in Section IEx is consist of several visitor design patterns. Execution

steps of the IEx and functionality of some selected visitor design patterns are described in

the next section.

4.1.2 Execution steps of the information extractor

Semantic analysis process has two main steps. In the first step, SA makes preparations

necessary for analyzing the source code and forms the control flow graph. SA driver

accepts the name of the stand-alone .Java class file (with the main method) as an

argument. Starting front Java class file, SA finds out all the user-defined .Java classes

to be analyzed in the class path and forms the control flow graph. Next, SA parses the all

the .Java classes in the control flow graph and produces AST of these .Java classes. Then,

the visitor class ObjectSyntholTable gathers variable declaration information for each class

to be analyzed and store this information in the SymbolTable classes. The SymbolTable

classes are passed to each visitor class and are filled with new information as the SA

process continues.

In the second step, SA identifies the variables used in input, output and SQL

statements. SA uses the ObjectSlicingVars visitor class to identify slicing variables.

The list of all input, output, and database-related statements, that are language (.Java,

JDBC) specific, are stored in InputOutputStatements and SqlStatements classes. To

analyze additional statements, or to switch to another language, all we need to do is to

add/update new statement names into these classes. When a visitor class encounters

a method statement while traversingf through AST, it checks this list to find out if this

method is an input, output, or a database-related statement.

SA finds and parses SQL strings embedded inside the source code. SA uses the

ObjectSQLStatement visitor class to find and parse SQL statements. While the visitor

traverses the AST, it constructs the value of variables that are of String type. When

a variable type of String or a string text is passed as a parameter to an SQL execute

method (e.g., execute~uery(queryStr)), this visitor class parses the string, and constructs

the AST of this SQL string. Then it uses the visitor class named ObjectSQLParse to

extract information from that SQL statement. The visitor class ObjectSQLStatement

uses the visitor class ObjectSQLParse to extract information about the SQL string and

stores this information into a class named ResultsetSQL. The information gathered from

SQL statements, input/output methods are used to construct relations between database

schema element and the text denoting the possible meaning of the schema element.

Besides analyzing application source code written in Java, SA can also analyze report

design templates represented in XML. Report Template Parser (RTP) component of the

SA uses Simple API for XML2 (SAX) to parse report templates.

The outcome of the IEx is written into report ontology instances represented in OWL.

Report Ontology Writer (ROW) uses OWL API3 tO Write the semantic information into

OWL ontologies.

2 Simple XML API:

3 OWL API: http: // .shtml

4.2 Schema Matching by Analyzing ReporTs (SMART) Prototype

We have implemented SMART schema matcher prototype using .Java language.

There are 46 classes in five different packages. The total size of the .Java classes are 500K(

(approximately 10,000 lines).

We also wrote a Perl program to find similarity scores between word pairs by using

the WordNet similarity library [73 TO aSsess similarity scores between texts, we first

eliminate stop words (e.g., a, and, but, to, by) and convert plural words to singular words.

We convert plural words to singular n-- 01 hI because WordNet Similarity functions returns

similarity scores between singular words.

The SMART prototype also uses Simple API for XML (SAX) library to parse

XML files and OWL API to read OWL report ontology instances into into internal .Java


COMA++ framework enables external matchers to be included into its framework

through an interface. We have integrated our SMART matcher into the COMA++

framework as an external matcher.

4 WordNet Semantic Similarity Library: http://search.cpan. 0rg/dist/WordNet-Similarity/

5 We are using the PlingStemmer library written by Fabian AI. Suchanek:


Information integration refers to the unification of related, heterogeneous data from

disparate sources, for example, to enable collaboration across domains and enterprises.

Information integration has been an active area of research since the early 80s and

produced a plethora of techniques and approaches to integrate heterogeneous information.

Determining the quality and applicability of an information integration technique has

been a challenging task because of the lack of available test data of sufficient richness and

volume to allow meaningful and fair evaluations. Researchers generally use their own test

data and evaluation techniques, which are tailored to the strengths of the approach and

often hide any existing weaknesses.

5.1 THALIA Website and Downloadable Test Package

While working for this research, we saw the need for a test bed and benchmark

providing test data of sufficient richness and volume to allow meaningful and fair

evaluations for information integration approaches. To answer this need, we developed

THALIA1 (Test Harness for the Assessment of Legacy information Integration Approaches)

benchmark. We show a snapshot of THALIA website in Figure 5-1. THALIA provides

researchers with a collection of over 40 downloadable data sources representing University

course catalogs, a set of twelve benchmark queries, as well as a scoring function for

ranking the performance of an integration system [47, 48].

THALIA website also hosts cached web pages of University course catalogs. The

downloadable packages have data extracted from these websites. Figure 5-2 shows an

example cached course catalog of the Boston University hosted in THALIA website. In

THALIA website, we also provide the ability to navigate between extracted data and

1 ITRL of the THALIA website: ect/thalia. html

File Edit Eiewu History Bookmarks Tools Help

SFind: i Nesxt ij rlevioius ,.. Highlight all [] Match case

Figure 5-1. Snapshot of Test Harness for the Assessment of Legacy information Integration
Approaches (THALIA) website.

corresponding schema files that are in the downloadable packages. Figure 5-3 shows

XML representation of Boston Universitys course catalog and corresponding schema file.

Downloadable University course catalogs are represented using well-formed and valid XML

according to the extracted schema for each course catalog. Extraction and translation

from the original representation was done using a source-specific wrapper which preserves

structural and semantic heterogeneities that exist among the different course catalogs.


Ho.. To Use The


Uni.ersit.. Course

Run Benchmark

Provide Feedback

Upload Your Soores

Honor Roll

Disolamer & Contact

THALIA (Test Harness for the Assessment of Legacy information Integration

Approaches) is a publicly available testbed anid benchmark for testing and evaluating
integration technologies. This Web site provides researchers and practitioners with a
collection of 40 downloadable data sources representing University course catalogs
from computer science departments around the world. The data in the testbed provide
a nrch source of syntactic and semantac heterogeneities smece we believe they stell pose
the greatest techmelal challenges to the research community. In addition, this site
pf0Vrides 8 set Of twelve benchmark queries as well as a scon~ng function for ranking
the performance of an mtegrat~on system.

We hope this site will be useful to both the research community in their efforts to
develop new mtegrat~on technologies as well as to potential users of existing
technologies in evaluating their strengths and wealmesses


Test Hmaless for th~e Assessment of LegacyI Info~mtion Integ~ration Approaches

I~~ md

-+' Al r;ooai; li~l
9 ;I- r- O- u d


IC~PCrJII~ ~e~8~cp~e r 1

i 1
i~_lh~tP:II1WWCiSeU~ edulpro]ec~l~halla h~ml V

~--~~-:rl j


Test Harness fo~r the Assessment of Legacy Informabkow Integration Appro~ahes

HowToUse The

Pu caution

Bwrowe Data anad
Run Benchmrark
Provie Feedback

uphoad Yourr Scre
Honor Fson

Oisotamer a conact

Select a Unir. er r. ii c. rose is Cowuse Catalog

CAS CS 111 tntrotio Comn

CAS CS 101 Intro to Comt

CAS CS 101 intro lo Com~
CAR CS fl Int n ~Wah




Long TR 3.:30-

Stotca MWF 2-3PM

Slolca MWF 12-1PM

Stnica MWI~ LAPM



8 50

computer science course catalog of Boston University.

Figure 5-2. Snapshot of the

5.2 DataExtractor (HTMLtoXML) Opensource Package

To extract the source data provided in THALIA benchmark, we enhanced and

used the Telegraph Screen Scraper (TESS)2 source wrapper developed at ITC Berkeley.

The enhanced version of TESS, DataExtractor (HTMLtoXML), can he obtained from

SourceForge website3 along with the 46 examples used to extract data provided in

THALIA. DataExtractor (HTMLtoXML) tool provides added functionality over TESS

wrapper including capability of extracting data from nested structures. It extracts data

from a HTML page according to a configuration file and puts the data into an XML file

according to a specified structure.

2 TESS: http://telegraph.cs

I TRL of DataExtractor (HTMLtoXML) is http://sourcef orge .net/proj ects/dataextractor

Fall 2003 Schedule

THALIA : Test Harness for the Assesslnent of Legacy Information Integration Approaches

SSealec~t so~:.aUniversty toBrwetsXLDa nd Shema iBostonUnivesi
HowTo~iseThe CUTB
Publiatins ~C~ourelnfo code="CS 101 Al"
<~tite>ntro to Comp
unversrcoure < tnstruc torMlong
cataos. I I

Run Benchmark
pmrr~prs <_courSe>
Uptuad Your Scores (olg A ,__,,, I Honor Roll title >Inrfo to COmip
cas:doc umentation>~Boston Universite/~xs: document ta tion>
cxs:choice mnuOccurs="D)" maxoccurs="unbounded">
crs:element name="course" minOccurs="D" maxOccurs="unbounded">
axs:sequence minoccurs="O" maxoccurs="unho~unded"~
as:element name="college" type="xs~:string" minOccurrs="O" />

Figure 5-3. Extensible Markup Language (XML) representation of Boston Universitys
course catalog and corresponding schema file.

5.3 Classification of Heterogeneities

Our benchmark focuses on syntactic and semantic heterogeneities since we believe

they pose the greatest technical challenges to the research community. We have chosen

course information as our domain of discourse because it is well known and easy to

understand. Furthermore, there is an abundance of data sources publicly available that

allowed us to develop a tested exhibiting all of the syntactic and semantic heterogeneities

that we have identified in our classification. We list our classification of heterogeneities

below. We have also listed these classifications in [48] and in the downloadable package on

the THALIA website along with examples from THALIA benchmark and corresponding


, lrg .II- 1 gi~ '- r Ity. t I V I; A i jr

,, I

Q ~- ...I ~P- ~a~ Q

1. Synonyms: Attributes with examplec 'instrluctor' vs. 'lec~tulrer'

2i. Simple M~lapping: Related attributes in :i ::1 sch~emnas <: i : by a m~ath~ematical
trlansformation of their va~lues. Fobr exampnle, timle values usingr a 24 hour vs. 12 hour

:3. Union Types: Attriburtes in I :i1 : ::i scherna~s use e l: i : : daita ~tyr. to :
the same information. F~or example, courrse description as a single r T. vs.
dlata i <. -::iosedl of string-s axnd links (UR;-Ls) to externaxl djata.

4. Complex MVappings: Relateld attlrib~utes ,`ii : a complex Itr : .. : .:. <.0 their
values. I ranlsfor~mation :-- n~ot a'---. /): be < >utabe from11 first. pr~inc~ipl~es. ForC
::1 r, the attribute 'Units' tr;lhe numnber of Icetures per week vs. ltextual
description of ~hel ex-pecited work load- in? field creditsts.

5. Language Expressiotn: '::: ; or values of identical attributes are expressed inl
iT :.1 langurages. F'or example: TT.. English termn 'dattabase: is cailledl 'Datenbanlk'
inl the Germlan lanlguage.

6. Nulls: i i: attr~ibute (value) does not exist. Fo~cr example, Somne courses do n-ot hav~e
a tex-tbook field or thle value: 1 he tex-tbook field is emnpty.

'7. Virtual Colum~ns: Infor~mation that is explicitly p-rovided in one schlema is only
imploic~itly available in thec oth~er and must be ::: : i f~rom one or more: values. F;or
ex-ampnle, C~ourse p~;;; I ;.. is prlovidedd as an alttribute in one schiemna but ex-ists
01 i- in comlmenrt form ais part of ax i iI :1. attribute in another schema.~

8. Semantic incompatibility: A real~-wi~orld concept tha~t is modleledl by an attribute
does n-ot exist in the: other schemla. Ftor example, Ti concept of student~ classificatio n
('freshlman'l, "sop~homlore:, etc:.) at American Unive~rsities does not exist in Germanin

93. Same attribute in different structure: i i same or relatedi attribute be
loc~ated in dlifforont: positions in different sc~hemas. F~or examnple,? ii 2. :::i.ate Boom
is an attributed of C~our~se in? onec sch~em~a while: it is an? atir~ibutec of` Sctclion whricih in
turn is an 2. :::1.ate of Cour~se in alnot~her schema.

10. Handling sets: A set of valures is : : :i .ii sinlg a 1:o1 set-va~luied attribute
inl onle sch-emna vs. a collection of sinlgle-valured- attributes organized in a hl-ierarchy~ in
another sch~ema. Foir example, A course with multiple instructor s can have~t a single
attr~iburte instrucitors or re 1 i 10 section--instrulctor : : 11--: pairs.

11. Attribute name does not define semantics: Ti: name < the attribute does
not at!i. .. iy described: the meaningr of the va~lue that is storedl there.

12. Attribute composition: The same information can be represented either by
a single attribute (e.g., as a composite value) or by a set of attributes, possibly
organized in a hierarchical manner.

5.4 Web Interface to Upload and Compare Scores

THALIA website offers a web interface for researcher to upload their result for each

heterogeneity listed above. The web interface accepts data in many aspects, such as size

of specification, number of mouse clicks and size of program code, to evaluate the effort

spent to resolve the heterogeneity by the approach. The uploaded scores can be viewed

by anybody visiting the website of the THALIA benchmark. This helps other researcher

compare their approach with others. Figure 5-4 shows the scores uploaded to THALIA

benchmark for Integration Wizard (IWiz) Project at the University of Florida.

File Edit View History Bookmarks lools Help

-..- J http:/ oreDet setyI=&sh rp=Dept m-

Scre Details

Research Group Detof CISE, Univuersity of Florida
Name ItgainWizard (IWiz) Project

PIne, a :d ununent11. 1 1111

.e I 7 1 l, a .- i 1 . n l

I i.. 17 ....- s I rl. 1U l.. I rl....,I r 1 111


No sogei ty Reul E rntl npr

I~~~~~ unt .1.- ;G1 20 =* ... .- r gi

.. t ... ~ ~ .b. I ]
..,, ..- --- 11-t"~ 1.-,,. r

Figure 5-4. Scores uploaded to Test Harness for the Assessment of Legacy information
Integration Approaches (THALIA) benchmark for Integration Wizard (IWiz)
Project at the University of Florida.

While THALIA is not the only data integration benchmark,4 what distinguishes

THALIA is the fact that it combines rich test data with a set of benchmark queries

and associated scoring function to enable the objective evaluation and comparison of

integration systems.

5.5 Usage of THALIA

We believe that THALIA does not only simplify the evaluation of existing integration

technologies but also help researchers improve the accuracy and quality of future

approaches by enabling more thorough and more focused testing. We have used THALIA

test data for the evaluation of our S1\ approach as described in Section 6.1.1. We are

also happy to see it is being used as a source of test data and benchmark by researchers

[11, 74, 100] and graduate courses

4 A list of Data Integfration Benchmarks and Test Suits can he found at
http: //mars.

5 UR L of the graduate course at the University of Toronto using THALIA is
http: //www. cs .toronto. edu/~ {}miller/cs2525


We evaluate our approach using the prototype described in OsI Ilpter 4. In the

following sections, we first describe our test data sets and our experiments. We then

compare our results with other techniques and present a discussion on the results.

6.1 Test Data

The test data sets have two main components; schema of the data source and reports

presenting the data front the data source. We used two test data sets. The first test data

set is from THALIA data integration tested. This data set has 10 schemas. Each schema

of THALIA test data set has one report and the report covers entire schema elements of

the corresponding schema. The second test data set is from University of Florida registrar

office. This data set has three schemas. Each schema of ITF registrar test data set has 10

reports and the reports do not cover all schema elements of the corresponding schema.

The first test data set from THALIA is used to see how SMART approach performs when

the entire schema is covered by reports and the second test data set from ITF is used to

see how SMART approach performs when the entire schema is not covered by reports.

The test data set from ITF also enables us to see the affect of having multiple reports for

one schema. In the following subsections, we give detailed descriptions of the schemas and

reports of these test data sets.

6.1.1 Test Data Set from THALIA testbed

The first test data set is from THALIA tested [48]. THALIA offers 44+ different

University course catalogs front computer science departments worldwide. Each catalog

page is represented in HTML. THALIA also offers data and schema of each catalog page.

We explained details of THALIA tested in OsI Ilpter 5.

For the scope of this evaluation, we treat each catalog page (in HTML) to be a

sample report front the corresponding University. We selected 10 university catalogs

(reports) from THALIA that represent different report design practices. For example,

Fall Schedule

Figure 6-1. Report design practice where all the descriptive texts are headers of the data.

we give two examples of these report design practices in Figures 6-1 and 6-2. Figure 6-1

shows the course scheduling report of Boston University and Figure 6-2 shows the course

scheduling report of Michigan State University.

Cour'Se: CSE101Comnputing (C'oncBpts and Competencies
Semester: Fall of every year. Spring of every year Summer of every year.
Credits: Total Cre dits: 3 Lecture/R~ecitation/Dis cus sion Hours: 2 Lab H ours 23(2- 2)
Description: Core concepts Ln computing including information storage, retneval,
management, and representation. Applications from specific disciplines.
Applying core concepts to design and implement solutions to various focal
problems, using hardware, multimedia software, communication and networks.
Semester Alis: CPS 100, CPS 130
Course: C~SE103Inltroductio n to Databl-ases in Infonuation Teclm~ology
Romac+t- or- Fall nf PTvetry year Snfiw nf ev~ery ye~Ar Sullmme nf Fevetry year~

Figure 6-2. Report design practice where all the descriptive texts are on the left hand side
of the data.

Sizes of schemas in THALIA test data set vary between 5 to 13 as listed in Table

6-1. We stored the data and schemas for each selected university in a MySQL 4.1

database. When we pair 10 schemas, we have 45 different pairs of schemas to match.

45 different schema pairs have 2576 possible combinations of schema elements. We

manually determined that 215 of these possible combinations are real. We use these

manual mappingfs to evaluate our results.

Table 6-1. The 10 university catalogs selected for evaluation and size of their schemas.
University Name # of Schema Elements
University of Arizona 5
Brown University 7
Boston University 7
California Institute of Technology 5
Carnegie Mellon University 9
Florida State University 13
Michigan State University 8
New York University 7
University of Massachusetts Boston 8
University of New South Wales, Sydney 7

We recreated each report (catalog page) from THALIA by using two methods. One

method is using .Java Servlets and the other is using Eclipse Business Intelligence and

Reporting Tool (BIRT).1 .Java Serylet applications corresponding to a course catalog fetch

the relevant data front the repository and produce the HTML report. Report templates

designed by BIRT tool also fetch the relevant data front the repository and produce the

HTML report as well. When SMART prototype is run, it analyzes .Java Serylet code and

report templates to extract semantic information.

6.1.2 Test Data Set from University of Florida

The second test data set is about students registry information and from University of

Florida. We contacted several offices at the University of Florida to obtain test data sets.2

We first contacted the College of Engfineeringf. After several meetings and discussions,

the College of Engineering agreed to give us the schemas and the report design templates

without any data. In fact, we were not after the data because our approach works without

the need of the data. The College of Engineering forms and uses the data set that we

obtained after several months as follows. The College of Engineering runs a batch program

Shttp: //www. eclipse, .Org/birt/

2 I would like to thank to Dr. .Joachim Haniner for his extensive efforts for reaching out
several departments and organizing meetings with staff to gather test data sets.

every first dwi of the week and downloads data from legacy DB2 database of the Registrar

office. DB2 database of the Registrar office is a hierarchical database. The College of

Engineering stores the data in relational MS Access databases. The College of Engineering

extracts a subset of the database of the registrar office and uses the same attribute and

table names in the MS Access database as they are in the database of the registrar office.

The College of Engineering creates subsets of this MS Access database and runs their

reports on these MS Access databases. Figure 6-3 shows the conceptual view of the

architecture of the databases in the College of Engineering.3


College of Engineering klbacjo
cope subset of


Reportsll Reports

Figure 6-3. Architecture of the databases in the College of Engineering.

We also contacted the UF Bridges office. The Bridges is a project to replace the

university business computer systemscalled legacy systemswith new webbased, integrated

3 I WOuld like to thank James Ogles from the College of Engineering for his time to
prepare the test data and for answering our questions regarding the data set.

systems that provide realtime information and improve university business processes.4

The ITF Bridges project also redesigned the legacy DB2 database of the registrar office

for MS SQLServer. We obtained schemas and again could not reach the associated data

because of privacy issues.5

Finally, we reached the Business School.G The Business School stores their data in

MS SQL Server databases. Their schema is based on the Bridges office schema however

they use different naming conventions. They add new structures into the schemas when


Table 6-2. Portion of a table description front the College of Engineering, the Bridges
Project and the Business School schemas.
The College of Eng. The Bridges Office The Business School
Sect VARCHAR(4) ITF_TERM_CD VARCHAR(5) CourseType varchar(1)
CT CHAR ITF_TYPE_DESC VARCHAR(40) Section varchar(4)

The schemas front the College of Engineering, the Bridges Office and the Business

School are seniantically related however they exhibit different syntactical features. The

naming conventions and sizes of schemas are different. The College of Engineering uses

the same names for schema elements as they are in the Registrar's database. The schema

elements names often contains abbreviations which are mostly not possible to guess.

The Bridges office uses more descriptive naming convention for schema elements. The

schema elements (i.e, colunin names) in the schema of the Business School have the most

descriptive names. However, the table names in the schema of the Business School uses

4 http://www.bridges. ufl. edu/about/overview. html

5 I also would like to acknowledge the help of AMr. Warren Curry front the Bridges office
for his help obtaining the schemas.

6 I also would like to acknowledge the help of AMr. John C. Holmes front the Business
School for his help obtaining the schemas.

Full Text




c r 2007OguzhanTopsakal 2


Toallwhoarepatient,supportive,justandlovingtoothers regardlessoftime,location andstatus 3


ACKNOWLEDGMENTS IthoughtquitealotaboutthetimewhenIwouldnishmydisse rtationandwrite thisacknowledgmentsection.Finally,thetimehascome.He reistheplacewhereIcan rememberallthegoodmemoriesandthankeveryonewhohelped alongtheway.However, Ifeelwordsarenotenoughtoshowmygratitudetothosewhowe retherewithmeall alongtheroadtomyPh.D.. Firstofall,IgivethankstoGodforgivingmethepatience,s trengthandcommitment tocomeallthisway. Iwouldliketogivemysincerethankstomydissertationadvi sor,Dr.Joachim Hammer,whohasbeensokindandsupportivetome.Hewasthepe rfectpersonfor metoworkwith.Ialsowouldliketothanktomyothercommitte emembers:Dr.Tuba Yavuz-Kahveci,Dr.ChristopherM.Jermaine,Dr.HermanLam ,andDr.RaymondIssa forservingonmycommittee. ThankstoUmutSargut,ZeynepSargut,CanOzturk,FatihBuyu kserinandFatih GorduformakingGainesvilleabetterplacetolive. Iamgratefultomyparents,H.NedretTopsakalandSabahatdi nTopsakal;tomy brother,MetehanTopsakal;andtomysister-in-law,SibelT opsakal.Theywerealways thereformewhenIneededthem,andtheyhavealwayssupporte dmeinwhateverIdo. Mywife,Elif,andIjoinedourlivesduringthemosthecticti mesofmyPh.D.studies, andshesupportedmeineveryaspect.Sheismytreasure. 4


TABLEOFCONTENTS page ACKNOWLEDGMENTS ................................. 4 LISTOFTABLES ..................................... 8 LISTOFFIGURES .................................... 9 ABSTRACT ........................................ 11 CHAPTER 1INTRODUCTION .................................. 12 1.1ProblemDenition ............................... 12 1.2OverviewoftheApproach ........................... 14 1.3Contributions .................................. 16 1.4OrganizationoftheDissertation ........................ 17 2RELATEDCONCEPTSANDRESEARCH .................... 18 2.1LegacySystems ................................. 19 2.2Data,Information,Semantics ......................... 20 2.3SemanticExtraction .............................. 20 2.4ReverseEngineering .............................. 22 2.5ProgramUnderstandingTechniques ...................... 24 2.5.1TextualAnalysis ............................. 25 2.5.2SyntacticAnalysis ............................ 25 2.5.3ProgramSlicing ............................. 25 2.5.4ProgramRepresentationTechniques .................. 26 2.5.5CallGraphAnalysis ........................... 26 2.5.6DataFlowAnalysis ........................... 26 2.5.7VariableDependencyGraph ...................... 26 2.5.8SystemDependenceGraph ....................... 27 2.5.9DynamicAnalysis ............................ 27 2.6VisitorDesignPatterns ............................. 27 2.7Ontology ..................................... 28 2.8WebOntologyLanguage(OWL) ........................ 28 2.9WordNet ..................................... 30 2.10Similarity .................................... 31 2.11SemanticSimilarityMeasuresofWords .................... 32 2.11.1ResnikSimilarityMeasure ....................... 32 2.11.2Jiang-ConrathSimilarityMeasure ................... 34 2.11.3LinSimilarityMeasure ......................... 34 2.11.4IntrinsicICMeasureinWordNet ................... 34 2.11.5Leacock-ChodorowSimilarityMeasure ................ 35 5


2.11.6Hirst-St.OngeSimilarityMeasure ................... 36 2.11.7WuandPalmerSimilarityMeasure .................. 36 2.11.8LeskSimilarityMeasure ........................ 36 2.11.9ExtendedGlossOverlapsSimilarityMeasure ............. 37 2.12EvaluationofWordNet-BasedSimilarityMeasures .............. 37 2.13SimilarityMeasuresforTextData ....................... 37 2.14SimilarityMeasuresforOntologies ....................... 39 2.15EvaluationMethodsforSimilarityMeasures ................. 41 2.16SchemaMatching ................................ 43 2.16.1SchemaMatchingSurveys ....................... 43 2.16.2EvaluationsofSchemaMatchingApproaches ............. 45 2.16.3ExamplesofSchemaMatchingApproaches .............. 46 2.17OntologyMapping ............................... 48 2.18SchemaMatchingvs.OntologyMapping ................... 48 3APPROACH ..................................... 49 3.1SemanticAnalysis ................................ 50 3.1.1IllustrativeExamples .......................... 51 3.1.2ConceptualArchitectureofSemanticAnalyzer ............ 53 .......... 53 ................ 55 ................. 55 ............... 58 3.1.3ExtensibilityandFlexibilityofSemanticAnalyzer .......... 58 3.1.4ApplicationofProgramUnderstandingTechniquesinS A ...... 60 3.1.5HeuristicsUsedforInformationExtraction .............. 62 3.2SchemaMatching ................................ 67 3.2.1MotivatingExample ........................... 68 3.2.2SchemaMatchingApproach ...................... 73 3.2.3CreatinganInstanceofaReportOntology .............. 75 3.2.4ComputingSimilarityScores ...................... 76 3.2.5FormingaSimilarityMatrix ...................... 81 3.2.6FromMatchingOntologiestoSchemas ................ 81 3.2.7MergingResults ............................. 82 4PROTOTYPEIMPLEMENTATION ........................ 84 4.1SemanticAnalyzer(SA)Prototype ...................... 84 4.1.1UsingJavaCCtogenerateparsers ................... 84 4.1.2Executionstepsoftheinformationextractor ............. 86 4.2SchemaMatchingbyAnalyzingReporTs(SMART)Prototype ....... 88 6


5TESTHARNESSFORTHEASSESSMENTOFLEGACYINFORMATION INTEGRATIONAPPROACHES(THALIA) .................... 89 5.1THALIAWebsiteandDownloadableTestPackage .............. 89 5.2DataExtractor(HTMLtoXML)OpensourcePackage ............. 91 5.3ClassicationofHeterogeneities ........................ 92 5.4WebInterfacetoUploadandCompareScores ................ 94 5.5UsageofTHALIA ............................... 95 6EVALUATION .................................... 96 6.1TestData .................................... 96 6.1.1TestDataSetfromTHALIAtestbed ................. 96 6.1.2TestDataSetfromUniversityofFlorida ............... 98 6.2DeterminingWeights .............................. 102 6.3ExperimentalEvaluation ............................ 105 6.3.1RunningExperimentswithTHALIAData .............. 107 6.3.2RunningExperimentswithUFData ................. 110 7CONCLUSION .................................... 117 7.1Contributions .................................. 119 7.2FutureDirections ................................ 121 REFERENCES ....................................... 123 BIOGRAPHICALSKETCH ................................ 131 7


LISTOFTABLES Table page 2-1ListofrelationsusedtoconnectsensesinWordNet. ................ 31 2-2Absolutevaluesofthecoecientsofcorrelationbetwee nhumanratingsofsimilarity andthevecomputationalmeasures. ........................ 37 3-1SemanticAnalyzercantransferinformationfromonemet hodtoanotherthrough variablesandcanusethisinformationtodiscoversemantic sofaschemaelement. 62 3-2Outputstringgivescluesaboutthesemanticsofthevari ablefollowingit. .... 63 3-3Outputstringandthevariablemaynotbeinthesamestate ment. ........ 64 3-4Outputstringsbeforetheslicingvariableshouldbecon catenated. ........ 64 3-5Tracingbacktheoutputtextandassociatingitwiththec orrespondingcolumn ofatable. ....................................... 64 3-6Associatingtheoutputtextwiththecorrespondingcolu mninthewhere-clause. 65 3-7Columnheaderdescribesthedatainthatcolumn. ................. 65 3-8Columnontheleftdescribesthedataitemslistedtoitsi mmediateright. .... 65 3-9Columnontheleftandtheheaderimmediatelyabovedescr ibethesamesetof dataitems. ...................................... 66 3-10Setofdataitemscanbedescribedbytwodierentheader s. ........... 66 3-11Headercanbeprocessedbeforebeingassociatedwithth edataonacolumn. .. 66 4-1Subpackageinthesapackageandtheirfunctionality. ............... 85 6-1The10universitycatalogsselectedforevaluationands izeoftheirschemas. ... 98 6-2PortionofatabledescriptionfromtheCollegeofEngine ering,theBridgesProject andtheBusinessSchoolschemas. .......................... 100 6-3NamesoftablesintheCollegeofEngineering,theBridge sOce,andtheBusiness Schoolschemasandnumberofschemaelementsthateachtable has. ....... 101 6-4Weightsfoundbyanalyticalmethodfordierentsimilar ityfunctionswithTHALIA testdata. ....................................... 104 6-5Confusionmatrix. ................................... 106 8


LISTOFFIGURES Figure page 1-1ScalableExtractionofEnterpriseKnowledge(SEEK)Arc hitecture. ....... 14 3-1ScalableExtractionofEnterpriseKnowledge(SEEK)Arc hitecture. ....... 50 3-2Schemausedbyanapplication. ........................... 52 3-3Schemausedbyareport. .............................. 53 3-4ConceptualviewoftheDataReverseEngineering(DRE)mo duleoftheScalable ExtractionofEnterpriseKnowledge(SEEK)prototype. .............. 54 3-5ConceptualviewofSemanticAnalyzer(SA)component. ............. 54 3-6Reportdesigntemplateexample. .......................... 55 3-7Reportgeneratedwhentheabovetemplatewasrun. ............... 56 3-8JavaServletgeneratedHTMLreportshowingcourselisti ngsofCALTECH. ... 56 3-9AnnotatedHTMLpagegeneratedbyanalyzingaJavaServle t. .......... 57 3-10Inter-proceduralcallgraphofaprogramsourcecode. ............... 61 3-11Schemasoftwodatasourcesthatcollaboratesforanewo nlinedegreeprogram. 69 3-12Reportsfromtwosampleuniversitieslistingcourses. ............... 70 3-13Reportsfromtwosampleuniversitieslistinginstruct oroces. .......... 71 3-14Similarityscoresofschemaelementsoftwodatasource s. ............. 73 3-15FivestepsofSchemaMatchingbyAnalyzingReporTs(SMA RT)algorithm. .. 74 3-16UniedModelingLanguage(UML)diagramoftheSchemaMa tchingbyAnalyzing ReporTs(SMART)reportontology. ......................... 76 3-17Exampleforasimilaritymatrix. ........................... 81 3-18Similarityscoresaftermatchingreportpairsaboutco urselistings. ........ 82 3-19Similarityscoresaftermatchingreportpairsaboutin structoroces. ...... 82 4-1JavaCodesizedistributionof(SemanticAnalyzer)SAan d(SchemaMatching byAnalyzingReporTs)SMARTpackages. ..................... 84 4-2UsingJavaCCtogenerateparsers. ......................... 86 5-1SnapshotofTestHarnessfortheAssessmentofLegacyinf ormationIntegration Approaches(THALIA)website. ........................... 90 9


5-2SnapshotofthecomputersciencecoursecatalogofBosto nUniversity. ...... 91 5-3ExtensibleMarkupLanguage(XML)representationofBos tonUniversityscourse catalogandcorrespondingschemale. ....................... 92 5-4ScoresuploadedtoTestHarnessfortheAssessmentofLeg acyinformationIntegration Approaches(THALIA)benchmarkforIntegrationWizard(IWi z)Projectat theUniversityofFlorida. ............................... 94 6-1Reportdesignpracticewhereallthedescriptivetextsa reheadersofthedata. 97 6-2Reportdesignpracticewhereallthedescriptivetextsa reonthelefthandside ofthedata. ...................................... 97 6-3ArchitectureofthedatabasesintheCollegeofEngineer ing. ........... 99 6-4ResultsoftheSMARTwithJiang-Conrath(JCN),LinandLe vensteinmetrics. 107 6-5ResultsofCOmbinationofMAtchingalgorithms(COMA++) withAllContext andFilteredContextcombinedmatchersandcomparisonofSM ARTandCOMA++ results. ......................................... 108 6-6ReceiverOperatingCharacteristics(ROC)curvesofSMA RTandCOMA++ forTHALIAtestdata. ................................ 110 6-7ResultsoftheSMARTwithdierentreportpairsimilarit ythresholdsforUF testdata. ....................................... 112 6-8F-MeasureresultsofSMARTandCOMA++forUFtestdatawhe nreportpair similarityissetto0.7. ................................ 113 6-9ReceiverOperatingCharacteristics(ROC)curvesofthe SMARTforUFtest data. .......................................... 114 6-10ComparisonoftheROCcurvesoftheSMARTandCOMA++forU Ftestdata. 115 10


AbstractofDissertationPresentedtotheGraduateSchool oftheUniversityofFloridainPartialFulllmentofthe RequirementsfortheDegreeofDoctorOfPhilosophy SEMANTICINTEGRATIONTHROUGHAPPLICATIONANALYSIS By OguzhanTopsakal May2007 Chair:JoachimHammerMajor:ComputerEngineering Organizationsincreasinglyneedtoparticipateinrapidco llaborationswithother organizationstobesuccessfulandtheyneedtointegrateth eirdatasourcestoshareand exchangedatainsuchcollaborations.Oneoftheproblemsth atneedstobesolvedwhen integratingdierentdatasourcesisndingsemanticcorre spondencesbetweenelements ofschemasofdisparatedatasources(a.k.a.schemamatchin g).Schemas,eventhosefrom thesamedomain,showmanysemanticheterogeneities.Resol vingtheseheterogeneities ismostlydonemanually;whichistedious,timeconsuming,a ndexpensive.Current approachestoautomatingtheprocessmainlyusetheschemas andthedataasinputto discoversemanticheterogeneities.However,theschemasa ndthedataarenotsucient sourcesofsemantics.Incontrast,weanalyzeavaluablesou rceofsemantics,namely applicationsourcecodeandreportdesigntemplates,toimp roveschemamatchingfor informationintegration.Specically,weanalyzeapplica tionsourcecodethatgenerate reportstopresentthedataoftheorganizationinauserfrie ndlyway.Wetracethe descriptiveinformationonareportbacktothecorrespondi ngschemaelement(s)through reverseengineeringoftheapplicationsourcecodeorrepor tdesigntemplatesandstore thedescriptivetext,data,andthecorrespondingschemael ementsinareportontology instance.Weutilizetheinformationwehavediscoveredfor schemamatching.Our experimentsusingafullyfunctionalprototypesystemshow thatourapproachproduces moreaccurateresultsthancurrenttechniques. 11


CHAPTER1 INTRODUCTION 1.1ProblemDenition Thesuccessofmanyorganizationslargelydependsontheira bilitytoparticipatein rapid,rexible,limited-timecollaborations.Theneedtoc ollaborateisnotjustlimited tobusinessbutalsoappliestogovernmentandnon-protorg anizationssuchasmilitary, emergencymanagement,health-care,rescue,etc.Thesucce ssofabusinessorganization dependsonitsabilitytorapidlycustomizeitsproducts,ad apttocontinuouslychanging demands,andreducecostsasmuchaspossible.Governmentor ganizations,suchasthe DepartmentofHomelandSecurity,needtocollaborateandex changeintelligenceto maintainthesecurityofitsbordersortoprotectcriticali nfrastructure,suchasenergy supplyandtelecommunications.Non-protorganizations, suchastheAmericanRed Cross,needtocollaborateonmattersrelatedtopublicheal thincatastrophicevents,such ashurricanes.Thecollaborationoforganizationsproduce sasynergytoachieveacommon goalthatwouldnotbepossibleotherwise. Organizationsparticipatinginarapid,rexiblecollabora tionenvironmentneedto shareandexchangedata.Inordertoshareandexchangedata, organizationsneedto integratetheirinformationsystemsandresolveheterogen eitiesamongtheirdatasources. Theheterogeneitiesexistatdierentlevels.Thereexistp hysicalheterogeneitiesatthe systemlevelbecauseofdierencesbetweenvariousinterna ldatastorage,retrieval,and representationmethods.Forexample,someorganizationsm ightuseprofessionaldatabase managementsystemswhileothersmightusesimpleratlesto storeandrepresenttheir data.Inaddition,thereexiststructural(syntax)-levelh eterogeneitiesbecauseofthe dierencesattheschemalevel.Finally,thereexistsemant iclevelheterogeneitiesbecause ofthedierencesintheuseofthedatawhichcorrespondtoth esamereal-worldobjects [ 47 ].Wefaceabroadrangeofsemanticheterogeneitiesininfor mationsystemsbecauseof 12


dierentviewpointsofdesignersoftheseinformationsyst ems.Semanticheterogeneityis simplyaconsequenceoftheindependentcreationoftheinfo rmationsystems[ 44 ]. Toresolvesemanticheterogeneities,organizationsmust rstidentifythesemanticsof theirdataelementsintheirdatasources.Discoveringthes emanticsofdataautomatically hasbeenanimportantareaofresearchinthedatabasecommun ity[ 22 36 ].However,the processofresolvingsemanticheterogeneityofdatasource sisstillmostlydonemanually. Resolvingheterogeneitiesmanuallyisatedious,error-pr one,time-consuming,non-scalable andexpensivetask.Thetimeandinvestmentneededtointegr atedatasourcesbecomea signicantbarriertoinformationintegrationofcollabor atingorganizations. Inthisresearch,wearedevelopinganintegratednovelappr oachthatautomatesthe processofsemanticdiscoveryindatasourcestoovercometh isbarrierandtohelprapid, rexiblecollaborationamongorganizations.Asmentioneda bove,weareawarethatthere existphysicalheterogeneitiesamonginformationsources buttokeepthedissertation focused,weassumedatastorage,retrievalandrepresentat ionmethodsarethesame amongtheinformationsystemstobeintegrated.Accordingt oourexperiencesgained asasoftwaredeveloperforinformationtechnologiesdepar tmentofseveralbanksand softwarecompanies,applicationsourcecodegeneratingre portsencapsulatevaluable informationaboutthesemanticsofthedatatobeintegrated .Reportspresentdata fromthedatasourceinawaythatiseasilycomprehensibleby theuserandcanberich sourceofsemantics.Weanalyzeapplicationsourcecodetod iscoversemanticstofacilitate integrationofinformationsystems.Weoutlinetheapproac hinSection 1.2 belowand providemoredetailedexplanationinSections 3.1 and 3.2 .Theresearchdescribedin thisdissertationisapartoftheNSF-funded 1 SEEK(ScalableExtractionofEnterprise Knowledge)projectwhichalsoservesasatestbed. 1 TheSEEKprojectissupportedbytheNationalScienceFounda tionundergrant numbersCMS-0075407andCMS-0122193. 13


1.2OverviewoftheApproach Theresultsdescribedinthisdissertationarebasedonthew orkwehavedoneonthe SEEKproject.TheSEEKprojectisdirectedatovercomingthe problemsofintegrating legacydataandknowledgeacrosstheparticipantsofacolla borationnetwork[ 45 ].The goaloftheSEEKprojectistodevelopmethodsandtheorytoen ablerapidintegration oflegacysourcesforthepurposeofdatasharing.Weapplyth esemethodologiesinthe SEEKtoolkitwhichallowsuserstodevelopSEEKwrappers.Aw rappertranslatesqueries fromanapplicationtothedatasourceschemaatrun-time.SE EKwrappersactasan intermediarybetweenthelegacysourceanddecisionsuppor ttoolswhichrequireaccessto theorganization'sknowledge. Figure1-1:ScalableExtractionofEnterpriseKnowledge(S EEK)Architecture. 14


Ingeneral,SEEK[ 45 46 ]worksinthreesteps:DataReverseEngineering(DRE), SchemaMatching(SM),andWrapperGeneration(WG).Inther ststep,DataReverse Engineering(DRE)componentofSEEKgeneratesadetailedde scriptionofthelegacy source.DREhastwosub-components,SchemaExtractor(SE)a ndSemanticAnalyzer (SA).SEextractstheconceptualschemaofthedatasource.S Aanalyzeselectronically availableinformationsourcessuchasapplicationcodeand discoversthesemanticsof schemaelementsofthedatasource.Inotherwords,SAdiscov ersmappingsbetweendata itemsstoredinaninformationsystemandthereal-worldobj ectstheyrepresentbyusing thepiecesofevidencethatitextractsfromtheapplication code.SAenhancestheschema ofthedatasourcebythediscoveredsemanticsandwereferto thesemanticallyenhanced schemaknowledgebaseoftheorganization.Inthesecondste p,theSchemaMatching(SM) componentmapstheknowledgebaseofanorganizationwithth eknowledgebaseofanother organization.Inthethirdstep,theextractedlegacyschem aandthemappingrules providetheinputtotheWrapperGenerator(WG),whichprodu cesthesourcewrapper. ThesethreestepsofSEEKarebuild-timeprocesses.Atrun-t ime,thesourcewrapper translatesqueriesfromtheapplicationdomainmodeltothe legacysourceschema.A high-levelschematicviewoutliningtheSEEKcomponentsan dtheirinteractionsisshown inFigure 1-1 Inthisresearch,ourfocusisontheSemanticAnalysis(SA)a ndSchemaMatching (SM)methodology.WerstdescribehowSAextractssemantic allyrichoutputsfromthe applicationsourcecodeandthenrelatesthemwiththeschem aknowledgeextractedby theSchemaExtractor(SE).Weshowthatwecangathersignic antsemanticinformation fromtheapplicationsourcecodebythemethodologywehaved eveloped.Wethenfocus onourSchemaMatching(SM)methodology.Wedescribehowweu tilizethesemantic informationthatwehavediscoveredbySAtondmappingsbet weentwodatasources. Theextractedsemanticinformationandthemappingscanthe nbeusedbythesubsequent 15


wrappergenerationsteptofacilitatethedevelopmentofle gacysourcetranslatorsand othertoolsduringinformationintegrationwhichisnotthe focusofthisdissertation. 1.3Contributions Inthisresearch,weintroducenovelapproachesforsemanti canalysisofapplication sourcecodeandformatchingofrelatedbutdisparateschema s.Inthissection,welistthe contributionsofthiswork.Wedescribethesecontribution sindetailsinChapter 7 while concludingthedissertation. Externalinformationsourcessuchascorporaofschemasand pastmatcheshavebeen usedforschemamatchingbutapplicationsourcecodehaveno tbeenusedasanexternal informationsourceyet[ 25 28 78 ].Inthisresearch,wefocusonthiswell-knownbutnot yetaddressedchallengeofanalyzingapplicationsourceco deforthepurposeofsemantic extractionforschemamatching.Theaccuracyofthecurrent schemamatchingapproaches isnotsucientforfullyautomatingtheprocessofschemama tching[ 26 ].Theapproach wepresentinthisdissertationprovidesbetteraccuracyfo rthepurposeofautomatic schemamatching. Theschemamatchingapproachessofarhavebeenmostlyusing lexicalsimilarity functionsorlook-uptablestodeterminethesimilaritieso ftwoschemaelementproperties (forexample,thenamesandtypesofschemaelements).There havebeensuggestions toutilizesemanticsimilaritymeasuresbetweenwords[ 7 ]buthavenotbeenrealized. Weutilizethestateoftheartsemanticsimilaritymeasures betweenwordstodetermine similaritiesandshowitseectontheresults. Anotherimportantcontributionistheintroductionofagen ericsimilarityfunctionfor matchingclassesofontologies.Wehavealsodescribedhoww edeterminetheweightsof oursimilarityfunction.Oursimilarityfunctionalongwit hthemethodologytodetermine theweightsofthefunctioncanbeappliedonmanydomainstod eterminesimilarities betweendierententities. 16


Integrationbasedonuserreportseasethecommunicationbe tweenbusinessand informationtechnology(IT)specialists.BusinessandITs pecialistsoftenhavediculty onunderstandingeachother.BusinessandITspecialistsca ndiscussondatapresentedon reportsratherthandiscussingonincomprehensibledataba seschemas.Analyzingreports fordataintegrationandsharinghelpsbusinessandITspeci alistscommunicatebetter. Oneothercontributionsisthefunctionalextensibilityof oursemanticanalysis methodology.Ourinformationextractionframeworkletsre searchersaddnewfunctionality astheydevelopnewheuristicsandalgorithmsonthesourcec odebeinganalyzed.Our currentinformationtechniquesprovideimprovedperforma ncebecauseitrequiresless passesoverthesourcecodeandprovideimprovedaccuracyas iteliminatesunusedcode fragments(i.e.,methods,procedures). Whileconductingtheresearch,wesawthatthereisaneedofa vailabletestdataof sucientrichnessandvolumetoallowmeaningfulandfairev aluationsbetweendierent informationintegrationapproaches.Toaddressthisneed, wedevelopedTHALIA 2 (Test HarnessfortheAssessmentofLegacyinformationIntegrati onApproaches)benchmark whichprovidesresearcherswithacollectionofover40down loadabledatasources representingUniversitycoursecatalogs,asetoftwelvebe nchmarkqueries,aswellas ascoringfunctionforrankingtheperformanceofanintegra tionsystem[ 47 48 ]. 1.4OrganizationoftheDissertation Therestofthedissertationisorganizedasfollows.Weintr oduceimportant conceptsoftheworkandsummarizeresearchinChapter 2 .Chapter 3 describesour semanticanalysisapproachandschemamatchingapproach.C hapter 4 describesthe implementationdetailsofourprototype.Beforewedescrib etheexperimentalevaluation ofourapproachinChapter 6 ,wedescribetheTHALIAtestbedinChapter 5 .Chapter 7 concludesthedissertationandsummarizesthecontributio nsofthiswork. 2 THALIAwebsite: 17


CHAPTER2 RELATEDCONCEPTSANDRESEARCH Inthecontextofthiswork,wehaveexploredabroadrangeofr esearchareas. Theseresearchareasincludebutarenotlimitedtodatasema ntics,semanticdiscovery, semanticextraction,legacysystemunderstanding,revers eengineeringofapplication code,informationextractionfromapplicationcode,seman ticsimilaritymeasures,schema matching,ontologyextractionandontologymapping,etc.W hiledevelopingourapproach, weleveragetheseresearchareas. Inthischapter,weintroduceimportantconceptsandrelate dresearchthatare essentialforunderstandingthecontributionsofthiswork .Whenevernecessary,weprovide ourinterpretationsofdenitionsandcommonlyacceptedst andardsandconventionsin thiseldofstudy.Wealsopresentthestate-of-the-artint herelatedresearchareas. Werstintroducewhatalegacysystemis.Thenwestatethedi erencebetween frequentlyusedtermsdata,informationandsemanticsinSe ction 2.2 .Wepointoutsome oftheresearchinsemanticextractioninSection 2.3 .Sinceweextractsemanticsthrough reverseengineeringofapplicationsourcecode.Weprovide thedenitionsofreverse engineeringofsourcecode,databasereverseengineeringi nSection 2.4 andalsoprovide thetechniquesforprogramunderstandinginSection 2.5 .Werepresenttheextracted informationfromapplicationsourcecodeofdierentlegac ysystemsinontologiesand utilizetheseontologiestondoutsemanticsimilaritiesb etweenthem.Forthisreason, semanticsimilaritymeasuresarealsoimportantforus.Weh aveexploredtheresearch onsemanticsimilaritymeasuresandpresentedtheseworksi nSection 2.11 aftergiving thedenitionofsimilarityinSection 2.10 .Weaimtoleveragetheresearchonassessing similarityscoresbetweentextsandontologies.Wepresent thesetechniquesinSection 2.13 and 2.14 .Wethenpresenttheontologyconcept,andtheontologylang uageWeb OntologyLanguage(OWL).Finally,wepresentontologymapp ingandschemamapping 18


andconcludethechapterbypresentingsomeoutstandingtec hniquesofschemamatching intheliterature. 2.1LegacySystems Ourapproachesforsemanticanalysisofapplicationsource codeandschema matchinghasbeendevelopedasapartoftheSEEKproject.SEE Kprojectaimsto helpunderstandingoflegacysystems.Weanalyzeapplicati onsourcecodeofalegacy systemtounderstandthesemanticsofitandapplygainedkno wledgetosolveschema matchingproblemofdataintegration.Inthissection,wer stgiveabroaddenitionofa legacysystemandhighlightitsimportanceandthenprovide itsdenitioninthecontext ofthiswork. Legacysystemsaregenerallyknownasinrexible,nonextens ible,undocumented,old andlargesoftwaresystemswhichareessentialfortheorgan ization'sbusiness[ 12 14 75 ]. Theysignicantlyresistmodicationsandchanges.Legacy systemareveryvaluable becausetheyaretherepositoryofcorporateknowledgecoll ectedoveralongtimeandthey alsoencapsulatethelogicoftheorganization'sbusinessp rocesses[ 49 ]. Alegacysystemisgenerallydevelopedandmaintainedbyman ydierentpeoplewith manydierentprogrammingstyles.Mostly,theoriginalpro grammershaveleft,andthe existingteamisnotanexpertofalltheaspectsofthesystem [ 49 ].Eventhoughonce therewasadocumentationaboutthedesignandspecication ofthelegacysystem,the originalsoftwarespecicationanddesignhavebeenchange dbutthedocumentationwas notupdatedthroughouttheyearsofdevelopmentandmainten ance.Thus,understanding islost,andtheonlyreliabledocumentationofthesystemis theapplicationsourcecode runningonthelegacysystem[ 75 ]. Inthecontextofthiswork,wedenelegacysystemsasanyinf ormationsystemwith poorornonexistentdocumentationabouttheunderlyingdat aortheapplicationcode thatisusingthedata.Despitethefactthatlegacysystemsa reofteninterpretedasold 19


systems,forus,aninformationsystemisnotrequiredtobeo ldinordertobeconsidered aslegacy. 2.2Data,Information,Semantics Inthissection,wegivedenitionsofdata,informationand semanticsbeforewe exploresomeresearchonsemanticextractioninthefollowi ngsection. Accordingtoasimplisticdenitiondataistheraw,unproce ssedinputtoan informationsystemthatproducestheinformationasanoutp ut.Acommonlyaccepted denitionstatesthatdataisarepresentationoffacts,con ceptsorinstructionsina formalizedmannersuitableforcommunication,interpreta tion,orprocessingbyhumans orbyautomaticmeans[ 2 18 ].Datamostlyconsistsofdisconnectednumbers,words, symbols,etc.andresultsfrommeasurableevents,orobject s. Datahasavaluewhenitisprocessed,changedintoausablefo rmandplacedina context[ 2 ].Whendatahasacontextandhasbeeninterpreted,itbecome sinformation. Thenitcanbeusedpurposefullyasinformation[ 1 ]. Semanticsisthemeaningandtheuseofdata.Semanticscanbe viewedasamapping betweenanobjectstoredinaninformationsystemandtherea l-worldobjectitrepresents [ 87 ]. 2.3SemanticExtraction Inthissection,werststatetheimportanceofsemanticext ractionandapplication sourcecodeasarichsourceforsemanticextractionandthen pointoutseveralrelated researcheortsinthisresearcharea. Shethetal.[ 87 ]statedthatdatasemanticsdoesnotseemtohaveapurely mathematicalorformalmodelandcannotbediscoveredcompl etely,andfullyautomatically. Therefore,theprocessofsemanticdiscoveryrequireshuma ninvolvement.Besidesbeing human-dependent,semanticextractionisatime-consuming andhenceexpensivetask [ 36 ].Althoughitcannotbefullyautomatized,thegainofdisco veringeventhelimited amountofusefulsemanticscantremendouslyreducethecost forunderstandingasystem. 20


Semanticscanbefoundfromknowledgerepresentationschem as,communicationprotocols, andapplicationsthatusethedata[ 87 ]. Throughoutthediscussionsandresearchonsemanticextrac tion,applicationsource codehasbeenproposedasarichsourceofinformation[ 30 36 87 ].Besides,researchers haveagreedthattheextractionofsemanticsfromapplicati onsourcecodeisessentialfor identicationandresolutionofsemanticheterogeneity. Weusethediscoveredsemanticsfromapplicationsourcecod etondcorrespondence betweenschemasofdisparatedatasourcesautomatically.I nthiscontext,discovering semanticsmeansgatheringinformationaboutthedata,soth atacomputercanidentify mappings(paths)betweencorrespondingschemaelementsin dierentdatasources. JimNingetal.workedonextractingsemanticsfromapplicat ionsourcecodebutwith aslightlydierentaim.Theydevelopedanapproachtoident ifyandrecoverreusablecode components[ 67 ].Theyinvestigatedconditionalstatementsaswedotondo utbusiness rules.Theystatedthatconditionalstatementsarepotenti albusinessrules.Theyalsogave importancetoinputandoutputstatementsforhighlighting semanticsinsidethecode, andstatedthatmeaningfulbusinessfunctionsnormallypro cessinputvaluesandproduce results.JimNingetal.calledinvestigatinginputvariabl esasforwardslicingandcalled investigatingoutputstatementsasbackwardslicing.Thed rawbackoftheirapproachwas beingverylanguage-specic(Cobol)[ 67 ]. NAshishetal.workedonextractingsemanticsfrominternet informationsourcesto enablesemi-automaticwrappergeneration[ 5 ].Theyusedseveralheuristicstoidentify importanttokensandstructuresofHTMLpagesinordertocre atethespecicationfor aparser.Similartoourapproach,theybenetedfromparser generationtools,namely YACC[ 53 ]andLEX[ 59 ],forsemanticextraction. Thereareseveralrelatedworkininformationextractionfr omtextthatdealwith tablesandontologyextractionfromtables.Themostreleva ntworkaboutinformation extractionfromHTMLpagesbythehelpofheuristicswasdone byWangandLochovsky 21


[ 94 ].TheyaimedtoformtheschemaofthedataextractedfromanH TMLpagebyusing labelsofatableonanHTMLpage.Theheuristicthattheyuset orelatelabelstothe dataandtoseparatedatafoundinatablecellintoseveralat tributesisverysimilarto ourheuristics.Forexample,theyassumethatifseveralatt ributesareencodedintoone textstring,thenthereshouldbesomespecialsymbol(s)int hestringastheseparator tovisuallysupportuserstodistinguishtheattributes.Th eyalsouseheuristicstorelate labelstothedatafromanHTMLpagethataresimilartoourheu ristics.Buttleretal. [ 17 ]andEmbleyetal.[ 32 ]alsodevelopedheuristicbasedapproachesforinformatio n extractionfromHTMLpages.However,theiraimwastoidenti fyboundariesofdataon anHTMLpage.Embleyetal.[ 33 ]alsoworkedontablerecognitionfromdocumentsand suggestedatableontologywhichisverysimilartoourrepor tontology.Inarelatedwork, Tijerinoetal.[ 90 ]introducedaninformationextractingsystemcalledTANGO which recognizestablesbasedonasetofheuristics,formsmini-o ntologiesandthenmergesthese ontologiestoformalargerapplicationontology. 2.4ReverseEngineering Withouttheunderstandingofthesystem,inotherwordswith outtheaccurate documentationofthesystem,itisnotpossibletomaintain, extend,andintegratethe systemwithothersystems[ 76 89 95 ].Themethodologytoreconstructthismissing documentationisreverseengineering.Inthissection,we rstgivethedenitionofreverse engineeringingeneralandthengivedenitionsofprogramr everseengineeringand databasereverseengineering.Wealsostatetheimportance ofthesetasks. Reverseengineeringistheprocessofanalyzingatechnolog ytolearnhowitwas designedorhowitworks.ChikofskyandCross[ 19 ]denedreverseengineeringasthe processofanalyzingasubjectsystemtoidentifythesystem scomponentsandtheir interrelationshipsandastheprocessofcreatingrepresen tationsofthesysteminanother formoratahigherlevelofabstraction.Reverseengineerin gisanactiontounderstand thesubjectsystemanddoesnotincludethemodicationofit .Thereverseofthereverse 22


engineeringisforwardengineering.Forwardengineeringi sthetraditionalprocessof movingfromhigh-levelabstractionsandlogical,implemen tation-independentdesignsto thephysicalimplementationofasystem[ 19 ].Whilereverseengineeringstartsfromthe subjectsystemandaimstoidentifythehigh-levelabstract ionofthesystem,forward engineeringstartsfromthespecicationandaimstoimplem entthesubjectsystem. Program(software)reverseengineeringisrecoveringthes pecicationsofthesoftware fromsourcecode[ 49 ].Therecoveredspecicationscanberepresentedinformss uchas datarowdiagrams,rowcharts,specications,hierarchych arts,callgraphs,etc.[ 75 ].The purposeofprogramreverseengineeringistoenhanceourund erstandingofthesoftwareof thesystemtoreengineer,restructure,maintain,extendor integratethesystem[ 49 75 ]. DatabaseReverseEngineering(DBRE)isdenedasidentifyi ngthepossible specicationofadatabaseimplementation[ 22 ].Itmainlydealswithschemaextraction, analysisandtransformation[ 49 ].ChikofskyandCross[ 19 ]denedDBREasaprocess thataimstodeterminethestructure,functionandmeaningo fthedataofanorganization. Hainaut[ 41 ]denedDBREastheprocessofrecoveringtheschema(s)ofth edatabase ofanapplicationfromdatadictionaryandprogramsourceco dethatusesthedata.The objectiveofDBREistorecoverthetechnicalandconceptual descriptionsofthedatabase. Itisaprerequisiteforseveralactivitiessuchasmaintena nce,reengineering,extension, migration,integration.DBREcanproduceanalmostcomplet eabstractspecication ofanoperationaldatabasewhileprogramreverseengineeri ngcanonlyproducepartial abstractionsthatcanhelpbetterunderstandaprogram[ 22 42 ]. Manydatastructuresandconstraintsareembeddedinsideth esourcecodeof data-orientedapplications.Ifaconstructoraconstraint hasnotbeendeclaredexplicitly inthedatabaseschema,itisimplementedinthesourcecodeo ftheapplicationthat updatesorqueriesthedatabase.Thedatainthedatabaseisa resultoftheexecutionof theapplicationsoftheorganization[ 49 ].Eventhoughthedatasatisestheconstraintsof thedatabase,itisveriedwiththevalidationmechanismsi nsidethesourcecodebeforeit 23


isbeingupdatedintothedatabasetoensurethatitdoesnotv iolatetheconstrains.We candiscoversomeconstraints,suchasreferentialconstra ints,byanalyzingtheapplication sourcecode,eveniftheapplicationprogramonlyqueriesth edatabutdoesnotmodify it.Forinstance,ifthereexistsareferentialconstraint( foreignkeyrelation)betweenthe entitynamedE1andentitynamedE2,thisconstraintisusedt ojointhedataofthesetwo entitieswithaquery.Wecandiscoverthisreferentialcons traintbyanalyzingthequery [ 50 ].Sinceprogramsourcecodeisaveryusefulsourceofinform ationinwhichwecan discoveralotofimplicitconstructsandconstraints,weus eitasaninformationsourcefor DBRE. Itiswellknownthattheanalysisofprogramsourcecodeisac omplexandtedious task.However,wedonotneedtorecoverthecompletespecic ationoftheprogram forDBRE.Wearelookingforinformationtoenhancetheschem aandtondthe undeclaredconstraintsofthedatabase.Inthisprocess,we benetfromseveralprogram understandingtechniquestoextractinformationeective ly.Weprovidethedenitionsof theprogramunderstandinganditstechniquesinthefollowi ngsection. 2.5ProgramUnderstandingTechniques Inthissection,weintroducetheconceptofprogramunderst andinganditstechniques. Wehaveimplementedthesetechniquestoanalyzeapplicatio nsourcecodetoextract semanticinformationeectively. Programunderstanding(a.k.aprogramcomprehension)isth eprocessofacquiring knowledgeaboutanexisting,generallyundocumented,comp uterprogram.Theknowledge acquiredaboutthebusinessprocessesthroughtheanalysis ofthesourcecodeisaccurate andup-to-datebecausethesourcecodeisusedtogenerateth eapplicationthatthe organizationuses. Basicactionsthatcanbetakentounderstandaprogramistor eadthedocumentation aboutit,toaskforassistancefromtheuserofit,toreadthe sourcecodeofitortorun theprogramtoseewhatitoutputstospecicinputs[ 50 ].Besidestheseactions,there 24


areseveraltechniquesthatwecanapplytounderstandaprog ram.Thesetechniques helptheanalysttoextracthigh-levelinformationfromlow -levelcodetocometoabetter understandingoftheprogram.Thesetechniquesaremostlyp erformedmanually.However, weapplythesetechniquesinoursemanticanalyzermoduleto automaticallyextract informationfromdata-orientedapplications.Weshowhoww eapplythesetechniques inoursemanticanalyzerinSection 3.1.5 .Wedescribethemainprogramunderstanding techniquesinthefollowingsubsections.2.5.1TextualAnalysis Onesimplewaytoanalyzeaprogramistosearchforaspecics tringintheprogram sourcecode.Thissearchedstringcanbeapatternoraclich e.Theprogramunderstanding techniquethatsearchesforapatternoraclicheisnamedas patternmatchingorcliche recognition.Apatterncanincludewildcards,characterra ngesandcanbebasedonother denedpatterns.Aclicheisacommonlyusedprogrammingpa ttern.Examplesofcliches arealgorithmiccomputations,suchaslistenumerationand binarysearch,andcommon datastructures,suchaspriorityqueueandhashtable[ 49 97 ]. 2.5.2SyntacticAnalysis Syntacticanalysisisperformedbyaparserthatdecomposes aprograminto expressionsandstatements.Theresultoftheparserisstor edinastructurecalledabstract syntaxtree(AST).AnASTisatypeofrepresentationofsourc ecodethatfacilitates theusageoftreetraversalalgorithmsanditisthebasicofm ostsophisticatedprogram analysistools[ 49 ]. 2.5.3ProgramSlicing Programslicingisatechniquetoextractthestatementsfro maprogramrelevanttoa particularcomputation,specicbehaviororinterestsuch asabusinessrule[ 75 ].Theslice ofaprogramwithrespecttoprogrampointpandvariableVcon sistsofallstatements andpredicatesoftheprogramthatmightaectthevalueofVa tpointp[ 96 ].Program slicingisusedtoreducethescopeofprogramanalysis[ 49 83 ].Theslicethataectthe 25


valueofVatpointpiscomputedbygatheringstatementsandc ontrolpredicatesby wayofabackwardtraversaloftheprogram,startingatthepo intp.Thiskindofslice isalsoknownasbackwardslicing.Whenweretrievestatemen tsthatcanpotentiallybe aectedbythevariableVstartingfromapointp,wecallitfo rwardslicing.Forwardand backwardslicingarebothatypeofstaticslicingbecauseth eyuseonlystaticallyavailable information(sourcecode)forcomputing.2.5.4ProgramRepresentationTechniques Programsourcecode,evenreducedthroughprogramslicing, oftenistoodicult tounderstandbecausetheprogramcanbehuge,poorlystruct ured,andbasedonpoor namingconventions.Itisusefultorepresenttheprogramin dierentabstractviewssuch asthecallgraph,datarowgraph,etc[ 49 ].Mostoftheprogramreverseengineeringtools providethesekindofvisualizationfacilities.Inthefoll owingsections,wepresentseveral programrepresentationtechniques.2.5.5CallGraphAnalysis Callgraphanalysisistheanalysisoftheexecutionorderof theprogramunitsor statements.Ifitdeterminestheorderofthestatementswit hinaprogramthenitiscalled intra-proceduralanalysis.Ifitdeterminesthecallingre lationshipamongtheprogram units,itiscalledinter-proceduralanalysis[ 49 83 ]. 2.5.6DataFlowAnalysis Datarowanalysisistheanalysisoftherowofthevaluesfrom variablestovariables betweentheinstructionsofaprogram.Thevariablesdened andthevariablesreferenced byeachinstruction,suchasdeclaration,assignmentandco nditional,areanalyzedto computethedatarow[ 49 83 ]. 2.5.7VariableDependencyGraph Variabledependencygraphisatypeofdatarowgraphwherean oderepresentsa variableandanarcrepresentsarelation(assignment,comp arison,etc.)betweentwo variables.Ifthereisapathfromvariablev1tovariablev2i nthegraph,thenthereisa 26


sequenceofstatementssuchthatthevalueofv1isinrelatio nwiththevalueofv2.Ifthe relationisanassignmentstatementthenthearcinthediagr amisdirected.Iftherelation isacomparisonstatementthenthearcisnotdirected[ 49 83 ]. 2.5.8SystemDependenceGraph Systemdependencegraphisatypeofdatarowgraphthatalsoh andlesprocedures andprocedurecalls.Asystemdependencegraphrepresentst hepassingofvaluesbetween procedures.WhenprocedurePcallsprocedureQ,valuesofpa rametersaretransferred fromPtoQandwhenQreturns,thereturnvalueistransferred backtoP[ 49 ]. 2.5.9DynamicAnalysis Theprogramunderstandingtechniquesdescribedsofararep erformedonthesource codeoftheprogramandarestaticanalysis.Dynamicanalysi sistheprocessofgaining increasedunderstandingofaprogrambysystematicallyexe cutingit[ 83 ]. 2.6VisitorDesignPatterns Weappliedtheaboveprogramunderstandingtechniquesinou rsemanticanalyzer program.Weimplementedoursemanticanalyzerbyusingvisi torpatterns.Inthissection, weexplainwhatavisitorpatternisandtherationaleforusi ngit. AVisitorDesignPatternisabehavioraldesignpattern[ 38 ],whichisusedto encapsulatethefunctionalitythatwedesiretoperformont heelementsofadata structure.Itgivestherexibilitytochangetheoperationb eingperformedonastructure withouttheneedtochangetheclassesoftheelementsonwhic htheoperationis performed.Ourgoalistobuildsemanticinformationextrac tiontechniquesthatcan beappliedtoanysourcecodeandcanbeextendedwithnewalgo rithms.Thevisitor designpatterntechniqueisthekeyobjectorientedtechniq uetoreachthisgoal.New operationsovertheobjectstructurecanbedenedsimplyby addinganewvisitor.Visitor classeslocalizerelatedbehaviorinthesamevisitorandun relatedsetsofbehaviorare partitionedintheirownvisitorsubclasses.Iftheclasses deningtheobjectstructure,in ourcasethegrammarproductionrulesoftheprogramminglan guage,rarelychange,but 27


newoperationsoverthestructureareoftendened,avisito rdesignpatternistheperfect choice[ 13 71 ]. 2.7Ontology Anontologyrepresentsacommonvocabularydescribingthec onceptsandrelationships forresearcherswhoneedtoshareinformationinadomain[ 40 69 ].Itincludesmachine interpretabledenitionsofbasicconceptsinthedomainan drelationsamongthem. Ontologiesenablethedenitionandsharingofdomain-spec icvocabularies.They aredevelopedtosharecommonunderstandingofthestructur eofinformationamong peopleorsoftwareagents,toenablereuseofdomainknowled ge,andtoanalyzedomain knowledge[ 69 ]. Accordingtoacommonlyquoteddenition,anontologyisafo rmal,explicit specicationofasharedconceptualization[ 40 ].Forabetterunderstanding,Michael Uscholdetal.denethetermsinthisdenitionasfollows[ 92 ]:Aconceptualizationisan abstractmodelofhowpeoplethinkaboutthingsintheworld. Anexplicitspecication meanstheconceptsandrelationsintheabstractmodelaregi venexplicitnamesand denitions.Formalmeansthatthemeaningspecicationise ncodedinalanguagewhose formalpropertiesarewellunderstood.Sharedmeansthatth emainpurposeofanontology isgenerallytobeusedandreusedacrossdierentapplicati ons. 2.8WebOntologyLanguage(OWL) TheWebOntologyLanguage(OWL)isasemanticmarkuplanguag eforpublishing andsharingontologiesontheWorldWideWeb[ 64 ].OWLisderivedfromtheDAML+OIL WebOntologyLanguage.DAML+OILwasdevelopedasajointeo rtofresearcherswho initiallydevelopedDAML(DARPAAgentMarkupLanguage)and OIL(Ontology InferenceLayerorOntologyInterchangeLanguage)separat ely. OWLisdesignedforprocessingandreasoningaboutinformat ionbycomputers insteadofjustpresentingitontheWeb.OWLsupportsmorema chineinterpretability thanXML(ExtensibleMarkupLanguage),RDF(theResourceDe scriptionFramework), 28


andRDF-S(RDFSchema)byprovidingadditionalvocabularya longwithaformal semantics. Formalsemanticsallowsustoreasonabouttheknowledge.We mayreasonabout classmembership,equivalenceofclasses,andconsistency oftheontologyforunintended relationshipsbetweenclassesandclassifytheinstancesi nclasses.RDFandRDF-S canbeusedtorepresentontologicalknowledge.However,it isnotpossibletouseall reasoningmechanismsbyusingRDFandRDF-Sbecauseofsomem issingfeaturessuch asdisjointnessofclasses,booleancombinationsofclasse s,cardinalityrestrictions,etc. [ 4 ].WhenallthesefeaturesareaddedtoRDFandRDF-Stoforman ontologylanguage, thelanguagebecomesveryexpressive.Howeveritbecomesin ecienttoreason.Forthis reason,OWLcomesinthreedierentravors:OWL-Lite,OWL-D L,andOWLFull. TheentirelanguageiscalledOWLFull,andusesalltheOWLla nguagesprimitives. Italsoallowstocombinetheseprimitivesinarbitraryways withRDFandRDF-S.Besides itsexpressiveness,OWLFull'scomputationscanbeundecid able.OWLDL(OWLDescriptionLogic)isasublanguageofOWLFull.Itincludes allOWLlanguageconstructs butrestrictsinwhichtheseconstructorsfromOWLandRDFca nbeused.Thismakes thecomputationsinOWL-DLcomplete(allconclusionsaregu aranteedtobecomputable) anddecidable(allcomputationswillnishinnitetime).T herefore,OWL-DLsupports ecientreasoning.OWLLitelimitsOWL-DLtoasubsetofcons tructors(forexample OWLLiteexcludesenumeratedclasses,disjointnessstatem entsandarbitrarycardinality) makingitlessexpressive.However,itmaybeagoodchoicefo rhierarchiesneedingsimple constraints[ 4 64 ]. OWLprovidesaninfrastructurethatallowsamachinetomake thesamesortsof simpleinferencesthathumanbeingsdo.AsetofOWLstatemen tsbyitself(andthe OWLspec)canallowyoutoconcludeanotherOWLstatementwhe reasasetofXML statements,byitself(andtheXMLspec)doesnotallowyouto concludeanyother XMLstatements.Giventhestatements(motherOfsubPropert yparentOf)and(Nedret 29


motherOfOguzhan)whenstatedinOWL,allowsyoutoconclude (NedretparentOf Oguzhan)basedonthelogicaldenitionofsubPropertyasgi venintheOWLspec. AnotheradvantageofusingOWLontologiesistheavailabili tyoftoolssuchasRacer,Fact andPelletthatcanreasonaboutthem.Areasonercanalsohel pustounderstandifwe couldaccuratelyextractdataanddescriptionelementsfro mthereport.Forinstance,we candenearulesuchas`Nodataordescriptionelementscano verlap'andchecktheOWL ontologybyareasonertomakesureifthisruleissatisedor not. 2.9WordNet WordNetisanonlinedatabasewhichaimstomodelthelexical knowledgeofa nativespeakerofEnglish. 1 Itisdesignedtobeusedbycomputerprograms.WordNet linksnouns,verbs,adjectives,andadverbstosetsofsynon yms[ 66 ].Asetofsynonyms representthesameconceptandisknownasasynsetinWordNet terminology.For example,theconceptofa`child'mayberepresentedbythese tofwords:`kid',`youngster', `tiddler',`tike'.Asynsetalsohasashortdenitionordes criptionoftherealworldconcept knownasa`gloss'andhassemanticpointersthatdescribere lationshipsbetweenthe currentsynsetandothersynsets.Thesemanticpointerscan beanumberofdierent typesincludinghyponym/hypernym(is-a/hasa)meronym/ho lonym(part-of/ has-part),etc.AlistofsemanticpointersisgiveninTable 2-1 2 WordNetcanalsobe seenasalargegraphorsemanticnetwork.Eachnodeofthegra phrepresentsasynsetand eachedgeofthegraphrepresentsarelationbetweensynsets .Manyoftheapproachesfor measuringsimilarityofwordsusesthegraphicalstructure ofWordNet[ 15 72 79 80 ]. SincethedevelopmentofWordNetforEnglishbytheresearch ersofPrinceton University,manyWordNetsforotherlanguageshavebeendev elopedsuchasDannish (Dannet),Persian(PersiaNet),Italian(ItalWordnet),et c.Therehasbeenalsoresearchto 1 WordNet2.1denes155,327wordsofEnglish 2 Tableisadaptedfrom[ 72 ] 30


Table2-1.ListofrelationsusedtoconnectsensesinWordNe t. Hypernymisageneralizationoffurnitureisahypernymofch air HyponymisakindofchairisahyponymoffurnitureTroponymisawaytoambleisatroponymofwalkMeronymispart/substance/memberofwheelisa(part)meron ymofabicycle HolonymcontainspartbicycleisaholonymofawheelAntonymoppositeofascendisanantonymofdescendAttributeattributeofheavyisanattributeofweightEntailmententailsploughingentailsdiggingCausecausetotooendcausestoresentAlsoseerelatedverbtolodgeisrelatedtoresideSimilartosimilartodeadissimilartoassassinatedParticipleofisparticipleofstored(adj)istheparticipl eoftostore Pertainymofpertainstoradialpertainstoradius alignWordNetsofdierentlanguages.Forinstance,EuroWo rdNet[ 93 ]isamultilingual lexicalknowledgebasethatlinksWordNetsofdierentlang uages(e.g.,Dutch,Italian, Spanish,German,French,CzechandEstonian).InEuroWordN et,theWordNetsare linkedtoanInter-Lingual-Indexwhichinterconnectsthel anguagessothatwecangofrom thesynsetsinonelanguagetocorrespondingsynsetsinothe rlanguages. WhileWordNetisadatabasewhichaimstomodelaperson'skno wledgeabouta language,anotherresearcheortCyc[ 57 ](derivedfromEncyc -lopedia)aimstomodel aperson'severydaycommonsense.Cycformalizescommonsen seknowledge(e.g.,`You cannotremembereventsthathavenothappenedyet',`Youhav etobeawaketoeat',etc.) intheformofamassivedatabaseofaxioms. 2.10Similarity Similarityisanimportantsubjectinmanyeldssuchasphil osophy,psychology,and articialintelligence.Measuresofsimilarityorrelated nessareusedinvariousapplications suchaswordsensedisambiguation,textsummarizationanda nnotation,information extractionandretrieval,automaticcorrectionofworderr orsintext,andtextclassication [ 15 21 ].Understandinghowhumansassesssimilarityisimportant tosolvemanyofthe problemsofcognitivesciencesuchasproblemsolving,cate gorization,memoryretrieval, inductivereasoning,etc.[ 39 ]. 31


Similarityoftwoconceptsreferstohowmuchfeaturestheyh aveincommonand howmuchtheyhaveindierence.Lin[ 60 ]providesaninformationtheoreticdenition ofsimilaritybyclarifyingtheintuitionsandassumptions aboutit.AccordingtoLin, thesimilaritybetweenAandBisrelatedtotheircommonalit yandtheirdierence. LinassumesthatthecommonalitybetweenAandBcanbemeasur edaccordingto theinformationtheycontainincommon( I ( common ( A;B ))).Ininformationtheory, theinformationcontainedinastatementismeasuredbythen egativelogarithmofthe probabilityofthestatement( I ( common ( A;B ))= logP ( A \ B )).Linalsoassumes thatifweknowthedescriptionofAandB,wecanmeasurethedi erencebysubtracting thecommonalityofAandBfromthedescriptionofAandB.Henc e,Linstatesthat thesimilaritybetweenAandB, sim ( A;B )isafunctionoftheircommonalitiesand descriptions.Thatis, sim ( A;B )= f ( I ( common ( A;B )) ;I ( description ( A;B ))). Wealsocomeacrosswith`semanticrelatedness'termwhiled ealingwithsimilarity. Semanticrelatednessisamoregeneralconceptthansimilar ityandreferstothedegree towhichtwoconceptsarerelated[ 72 ].Similarityisoneaspectofsemanticrelatedness. Twoconceptsaresimilariftheyarerelatedintermsoftheir likeliness(e.gchild-kit). However,twoconceptscanberelatedintermsoffunctionali tyorfrequentassociationeven thoughtheyarenotsimilar(e.g.,instructor-student,chr istmas-gift). 2.11SemanticSimilarityMeasuresofWords Inthissection,weprovideareviewofsemanticsimilaritym easuresofwordsinthe literature.Thisreviewisnotmeanttobeacompletelistoft hesimilaritymeasuresbut providesmostoftheoutstandingonesintheliterature.Mos tofthemeasuresbelowuse thehierarchicalstructureofWordNet.2.11.1ResnikSimilarityMeasure Resnik[ 79 ]providedasimilaritymeasurebasedontheis-ahierarchyo ftheWordNet andthestatisticalinformationgatheredfromalargecorpo raoftext.Resnikusedthe statisticalinformationfromthelargecorporaoftexttome asuretheinformationcontent. 32


Accordingtotheinformationtheory,theinformationconte ntofaconcept c canbe quantiedas log P ( c ),where P ( c )istheprobabilityofencounteringconcept c .This formulatellsusthatasprobabilityincreases,informativ enessdecreases;sothemore abstractaconcept,theloweritsinformationcontent.Inor dertocalculatetheprobability ofaconcept,Resnikrstcomputedthefrequencyofoccurren ceofeveryconceptinalarge corpusoftext.Everyoccurrenceofaconceptinthecorpusad dstothefrequencyofthe conceptandtothefrequencyofeveryconceptsubsumingthec onceptencountered.Based onthiscomputation,theformulafortheinformationconten tis: P ( c )= freq ( c ) =freq ( r ) ic ( c )= log P ( c ) ic ( c )= log( freq ( c ) =freq ( r )) whereristherootnodeofthetaxonomyandcistheconcept.AccordingtoResnik,themoreinformationtwoconceptshave incommon,themore similartheyare.Theinformationsharedbytwoconceptsisi ndicatedbytheinformation contentoftheconceptsthatsubsumetheminthetaxonomy.Th eformulaoftheResnik similaritymeasureis: simRES ( c 1 ;c 2)= max [ log P ( c )] wherecisaconceptthatsubsumesbothc1andc2.OneofthedrawbacksoftheResnikmeasureisthatitcomplete lydependsupon theinformationcontentoftheconceptthatsubsumesthetwo conceptswhosesimilarity wemeasure.Itdoesnottakethetwoconceptsintoaccount.Fo rthisreasonsimilarity measuresofdierentpairsofconceptsthathavethesamesub sumerhavethesame similarityvalues. 33


2.11.2Jiang-ConrathSimilarityMeasure JiangandConrath[ 52 ]addressthelimitationsoftheResnikmeasure.Itbothuses theinformationcontentofthetwoconcepts,alongwiththei nformationcontentoftheir lowestcommonsubsumertocomputethesimilarityoftwoconc epts.Themeasureisa distancemeasurethatspeciestheextentofunrelatedness oftwoconcepts.Theformula oftheJiangandConrathmeasureis: distanceJCN ( c 1 ;c 2)= ic ( c 1)+ ic ( c 2) (2 ic ( LCS ( c 1 ;c 2))) whereicdeterminestheinformationcontentofaconcept,an dLCSdeterminesthelowest commonsubsumingconceptoftwogivenconcepts.However,th ismeasureworksonlywith WordNetnouns.2.11.3LinSimilarityMeasure Lin[ 60 ]introducedasimilaritymeasurebetweenconceptsbasedon histheoryof similaritybetweenarbitraryobjects.Tomeasurethesimil arity,Linusestheinformation contentofthetwoconceptsthatisbeingmeasuredandtheinf ormationconceptofthe lowestcommonsubsumerofthem.TheformulaoftheLinmeasur eis: simLIN ( c 1 ;c 2)= 2 log P ( c 0) log P ( c 1)+log P ( c 2) wherec0isthelowestcommonconceptthatsubsumesbothc1an dc2. 2.11.4IntrinsicICMeasureinWordNet Secoetal.[ 85 ]advocatesthatWordNetcanalsobeusedasastatisticalres ourcewith noneedforexternalcorporatocomputetheinformationcont entofaconcept. 34


TheyassumethatthetaxonomicstructureofWordNetisorgan izedinameaningful andprincipledway,whereconceptswithmanyhyponyms 3 conveylessinformationthan conceptsthatareleaves.Theyprovidetheformulaforinfor mationcontentasfollows: icWN ( c )= log hypo ( c )+1 maxwn log 1 maxwn =1 log( hypo ( c )+1) log( maxwn ) Inthisformula,thefunctionhyporeturnsthenumberofhypo nymsofagivenconcept andmaxwnisthemaximumnumberofconceptsthatexistinthet axonomy. 2.11.5Leacock-ChodorowSimilarityMeasure Radaetal.[ 77 ]wasthersttomeasurethesemanticrelatednessbasedonth elength ofthepathoftwoconceptsinataxonomy.Radaetal.measured semanticrelatednessof medicalterms,usingamedicaltaxonomycalledMeSH.Accord ingtothismeasurement, givenatree-likestructureofataxonomy,thenumberoflink sbetweentwoconceptsare countedandtheyareconsideredmorerelatedifthelengthof thepathbetweenthemis shorter. Leacock-Chodorow[ 56 ]appliedthisapproachtomeasuresemanticrelatednessoft wo conceptsusingWordNet.Themeasurecountstheshortestpat hbetweentwoconceptsin thetaxonomyandscalesitbythedepthofthetaxonomy: relatedLCH ( c 1 ;c 2)= log( shortestpath ( c 1 ;c 2)) 2 D Intheformula,c1andc2representthetwoconcepts,Disthem aximumdeptofthe taxonomy. 4 Oneweaknessofthemeasureis,itassumesthesizeorweighto feverylinkasequal. However,lowerdowninthehierarchyasinglelinkawayconce ptpairsaremorerelated 3 hyponym:awordthatismorespecicthanagivenword. 4 ForWordNet1.7.1,thevalueofDis19. 35


thansuchpairshigherupinthehierarchy.Anotherlimitati onofthemeasureisthatthey limittheirattentiontois-alinksandonlynounhierarchie sareconsidered. 2.11.6Hirst-St.OngeSimilarityMeasure HirstandSt.Onge's[ 51 ]measureofsemanticrelatednessisbasedontheideathattw o conceptsaresemanticallycloseiftheirWordNetsynsetsar econnectedbyapaththatis nottoolongandthatdoesnotchangedirectiontoooften[ 15 72 ]. TheHirst-St.Ongemeasureconsidersalltherelationsden edinWordNet.Alllinksin WordNetareclassiedasUpward(e.g.,part-of),Downward( e.g.,subclass)orHorizontal (e.g.,opposite-meaning).Theyalsodescribethreetypeso frelationsbetweenwords extra-strong,strongandmedium-strong. Thestrengthoftherelationshipisgivenby: relHS ( c 1 ;c 2)= C pathlength k d ; wheredisthenumberofchangesofdirectioninthepath,andC andkareconstants; ifnosuchpathexists,thestrengthoftherelationshipisze roandtheconceptsare consideredunrelated.2.11.7WuandPalmerSimilarityMeasure TheWuandPalmer[ 98 ]measuresthesimilarityintermsofthedepthofthetwo conceptsintheWordNettaxonomy,andthedepthofthelowest commonsubsumer(LCS): simWUP ( c 1 ;c 2)= 2 depth ( LCS ) depth ( c 1)+ depth ( c 2) 2.11.8LeskSimilarityMeasure Lesk[ 58 ]denesrelatednessasafunctionofdictionarydenitiono verlapsofconcepts. Hedescribesanalgorithmthatdisambiguateswordsbasedon theextentofoverlapsof theirdictionarydenitionswiththoseofwordsintheconte xt.Thesenseofthetarget wordwiththemaximumoverlapsisselectedastheassignedse nseoftheword. 36


Table2-2.Absolutevaluesofthecoecientsofcorrelation betweenhumanratingsof similarityandthevecomputationalmeasures. MeasureMiller&CharlesRubenstein&Goodenough HirstandSt-Onge.744.786JiangandConrath.850.781LeacockandChodorow.816.838Lin.829.819Resnik.774.779 2.11.9ExtendedGlossOverlapsSimilarityMeasure BanerjeeandPedersen[ 9 72 ]providedameasurebyadoptingtheLesk'smeasure toWordNet.Theirmeasureiscalled`theextendedglossover lapsmeasure'andtakesnot onlythetwoconceptsthatarebeingmeasuredintoaccountbu talsotheconceptsrelated withthetwoconceptsthroughWordNetrelations.Anextende dglossofaconceptc1is preparedbyaddingtheglossesofconceptsthatisrelatedwi thc1throughaWordNet relationr.Thecalculationofmeasurementoftwoconceptsc 1andc2isbasedonthe overlapsofextendedglossesoftwoconcepts. 2.12EvaluationofWordNet-BasedSimilarityMeasures BudanitskyandHirst[ 16 ]evaluatedsixdierentmetricsusingWordNetandlisted thecoecientsofcorrelationbetweenthemetricsandhuman ratingsaccordingtothe experimentsconductedbyMiller&Charles[ 65 ]andRubenstein&Goodenough[ 82 ].We presenttheresultsofBudanitsky&Hirst'sexperimentsinT able 2-2 .Accordingtothis evaluation,theJiangandConrathmetric[ 52 ]aswellastheLinmetric[ 60 ]arelistedas oneofthebestmeasures.Asaresult,weusetheJiangandConr athaswellastheLin semanticsimilaritymeasuretoassignsimilarityscoresbe tweentextstrings. 2.13SimilarityMeasuresforTextData Severalapproacheshavebeenusedtoassessasimilaritysco rebetweentexts.One ofthesimplestmethodsistoassessasimilarityscorebased onthenumberoflexical unitsthatoccurinbothtextsegments.Severalprocessessu chasstemming,stop-word removal,longestsubsequencematching,weightingfactors canbeappliedtothismethod 37


forimprovement.However,theselexicalmatchingmethodsa renotenoughtoidentifythe semanticsimilarityoftexts.Oneoftheattemptstoidentif ysemanticsimilaritybetween textsislatentsemanticanalysismethod(LSA) 5 [ 55 ]whichaimstomeasuresimilarity betweentextsbyincludingadditionalrelatedwords.LSAis successfulatsomeextendbut hasnotbeenusedonalargescale,duetothecomplexityandco mputationalcostofits algorithm. CorleyandMihalcea[ 21 ]introducedametricfortext-to-textsemanticsimilarity by combiningword-to-wordsimilaritymetrics.Toassessasim ilarityscoreforatextpair, theyrstcreateseparatesetsfornouns,verbs,adjectives ,adverbs,andcardinalsforeach text.Thentheydeterminepairsofsimilarwordsacrossthes etsinthetwotextsegments. Fornounsandverbs,theyusesemanticsimilaritymetricbas edonWordNet,andforother wordclassestheyuselexicalmatchingtechniques.Finally ,theysumupthesimilarity scoresofsimilarwordpairs.Thisbag-of-wordsapproachim provessignicantlyoverthe traditionallexicalmatchingmetrics.However,astheyack nowledge,ametricoftext semanticsimilarityshouldtakeintoaccounttherelations betweenwordsinatext. Inanotherapproachtomeasuresemanticsimilaritybetween documents,Aslam andFrost[ 6 ]assumesthatatextiscomposedofasetofindependenttermf eaturesand employtheLin's[ 60 ]metricformeasuringsimilarityofobjectsthatcanbedesc ribedbya setofindependentfeatures.Thesimilarityoftwodocument sinapileofdocumentscanbe calculatedbythefollowingformula: SimIT ( a;b )= 2 P t min( Pa : t;Pb : t )log P ( t ) P t ( Pa : t )log P ( t )+ P t ( Pb : t )log P ( t ) whereprobability P ( t )isthefractionofcorpusdocumentscontainingtermt, Pb : t is thefractionaloccurrenceofterm t indocument b ( P t ( Pb : t )=1)andtwodocuments a 5 URLofLSA: 38


and b share min ( Pa : t;Pb : t )amountofterm t incommon,whiletheycontain Pa : t and Pb : t amountofterm t individually. AnotherapproachbyOleshchukandPedersen[ 70 ]usesontologiesasalterbefore assessingsimilarityscorestotexts.Theyinterpretatext basedonanontologyandnd outhowmuchoftheterms(concepts)ofanontologyexistsina text.Theyassigna similarityscorefortextt1andtextt2aftercomparingtheo ntologyo1extractedfrom t1basedontheontologyOandtheontologyo2extractedfromt 2basedonthesame ontologyO.Thebaseontologyactsasacontextltertotexts anddependingonthebase ontologyused,textsmayormaynotbesimilar. 2.14SimilarityMeasuresforOntologies RodriguezandEgenhofer[ 81 ]suggestedassessingsemanticsimilarityamongentity classesfromdierentontologiesbasedonamatchingproces sthatusesinformationabout commonanddierentcharacteristicfeaturesofontologies basedontheirspecications. Thesimilarityscoreoftwoentitiesfromdierentontologi esistheweightedsumof similarityscoresofcomponentsofcomparedentities.Simi larityscoresareindependently measuredforthreecomponentsofanentity.Thesecomponent sare`setofsynonyms', `setofsemanticrelations',and`setofdistinguishingfea tures'oftheentity.Theyfurther suggesttoclassifythedistinguishingfeaturesinto`func tions',`parts',and`attributes' where`functions'representswhatisdonetoorwithaninsta nceofthatentity,`parts'are structuralelementsofanentitysuchaslegorheadofahuman body,and`attributes'are additionalcharacteristicsofanentitysuchasageorhairc olorofaperson. RodriguezandEgenhoferpointoutthatifcomparedentities arerelatedtothesame entities,theymaybesemanticallysimilar.Thus,theyinte rpretcomparingsemantic 39


relationsascomparingsemanticneighborhoodsofentities 6 Theformulaofoverall similaritybetweenentityaofontologyqandentitybofonto logyqisasfollows: S ( a p ;b q )= w w S w ( a p ;b q )+ w u S u ( a p ;b q )+ w n S n ( a p ;b q ) where S w S u ,and S n arethesimilaritybetweensynonymsets,features,andsema ntic neighborhoodand w w w u ,and w n aretherespectiveweightswhichaddsupto1.0. Whilecalculatingasimilarityscoreforeachcomponentsof anentity,theyalsotake noncommoncharacteristicsintoaccount.Thesimilarityof acomponentismeasuredby thefollowingformula: S(a ; b)= j A \ B j j A \ B j + ( a;b ) j A / B j +(1 ( a;b )) j B / A j where isafunctionthatdenestherelativeimportanceofthenoncommon characteristics.Theycalculate intermsofthedepthoftheentitiesintheirontologies. MaedcheandStaab[ 63 ]suggeststomeasuresimilarityofontologiesintwolevels : lexicalandconceptual.Inthelexicallevel,theyuseeditdistancemeasuretond similaritybetweentwosetsofterms(conceptsorrelations )thatformstheontologies. Whilemeasuringsimilarityintheconceptuallevel,theyta keallitssuper-andsub-concepts oftwoconceptsfromtwodierentontologiesintoaccount. AccordingtoEhrigetal.[ 31 ]comparingontologiesshouldgofarbeyondcomparing therepresentationoftheentitiesoftheontologiesandsho uldtaketheirrelationtothe realworldentitiesintoaccount.Forthis,Ehrigetal.sugg estedageneralframework formeasuringsimilaritiesofontologieswhichconsistsof fourlayers:data-,ontology-, context-,anddomainknowledgelayer.Inthedatalayer,the ycomparedatavaluesby 6 Thesemanticneighborhoodofanentityclassisthesetofent ityclasseswhose distancetotheentityclassislessthanorequaltoannonneg ativeinteger 40


usinggenericsimilarityfunctionssuchaseditdistancefo rstrings.Intheontologylayer, theyconsidersemanticrelationsbyusingthegraphstructu reoftheontology.Inthe contextlayer,theycomparetheusagepatternsofentitiesi nontology-basedapplications. AccordingtoEhrigetal.iftwoentitiesareusedinthesame( related)contextthenthey aresimilar.Theyalsoproposetointegratedomainknowledg elayerintoanythreelayers asneeded.Finally,theyreachtoaoverallsimilarityfunct ionwhichincorporatesalllayers ofsimilarity. EuzenatandValtchev[ 34 35 ]proposedasimilaritymeasureforOWL-Liteontologies. Beforemeasuringsimilarity,theyrsttransformOWL-Lite ontologytoaOL-graph structure.Then,theydenesimilaritybetweennodesofthe OL-graphsdependingonthe categoryandthefeatures(e.grelations)ofthenodes.They combinethesimilaritiesof featuresbyaweightedsumapproach. AsimilarworkbyBachandDieng-Kuntz[ 8 ]proposesameasureforcomparing OWL-DLontologies.DierentfromEuzenatandValtchev'swo rk,BachandDieng-Kuntz adjuststhemanuallyassignedfeatureweightsofanOWL-DLe ntitydynamicallyincase theydonotexistinthedenitionoftheentity. 2.15EvaluationMethodsforSimilarityMeasures Therearethreekindsofapproachesforevaluatingsimilari tymeasures[ 15 ].These areevaluationbytheoreticalexamination(e.g.,Lin[ 60 ]),evaluationbycomparinghuman judgments,andevaluationbycalculatingtheperformancew ithinaparticularapplication. Evaluationbycomparinghumanjudgmentstechniquehasbeen usedbymany researcherssuchasResnik[ 79 ],andJiangandConrath[ 52 ].Mostoftheresearchersrefer tothesameexperimentonthehumanjudgmenttoevaluatethei rperformanceduetothe expenseanddicultyofarrangingsuchanexperiment.Thise xperimentwasconducted byRubensteinandGoodenough[ 82 ]andalaterreplicationofitwasdonebyMiller andCharles[ 65 ].RubensteinandGoodenoughhadhumansubjectsassigndegr eesof synonymy,onascalefrom0to4,to65pairsofcarefullychose nwords.MillerandCharles 41


repeatedtheexperimentonasubsetof30wordpairsofthe65p airsusedbyRubenstein andGoodenough.RubensteinandGoodenoughused15subjects forscoringthewordpairs andtheaverageofthesescoreswasreported.MillerandChar lesused38subjectsintheir experiments. RodriguezandEgenhoferalsousedhumanjudgmentstoevalua tethequalityof theirsimilaritymeasureforcomparingdierentontologie s[ 81 ].TheyusedSpatialData TransferStandard(SDTS)ontology,WordNetontology,WSon tology(createdfromthe combinationof W ordNetand S DTS)andsubsetsoftheseontologies.Theyconductedtwo experiments.Intherstexperiment,theycomparedierent combinationsofontologiesto haveadiversegradeofsimilaritybetweenontologies.Thes ecombinationsincludeidentical ontologies(WordNettoWordNet),ontologyandsub-ontolog y(WordNettoWordNet's subset),overlappingontologies(WordNettoWS),anddier entontologies(WordNet toSDTS).Inthesecondexperiment,theyaskedhumansubject storanksimilarityof anentitytootherselectedentitiesbasedonthedenitions inWSontology.Then,they comparedaverageofhumanrankingswiththerankingsbasedo ntheirsimilaritymeasure usingdierentcombinationsofontologies. Evaluationbycalculatingtheperformancewithinaparticu larapplicationisanother approachfortheevaluationofsimilaritymeasurementmetr ics.BudanitskyandHirst[ 15 ] usedthisapproachtoevaluatetheperformanceoftheirmetr icwithinanNLPapplication, malapropisms. 7 Patwardhan[ 72 ]alsousedthisapproachtoevaluatehismetricwithinthe wordsensedisambiguation 8 application. 7 Malapropisms:Theunintentionalmisuseofawordbyconfusi onwithonethatsounds similar. 8 WordSenseDisambiguation:Itistheproblemofselectingth emostappropriate meaningorsenseofaword,basedonthecontextinwhichitocc urs. 42


2.16SchemaMatching Schemamatchingisproducingamappingbetweenelementsoft woschemasthat correspondtoeachother[ 78 ].WhenwematchtwoschemasSandT,wedecideifany elementorelementsofSrefertothesamereal-worldconcept ofanyelementorelements ofT[ 28 ].Thematchoperationovertwoschemasproducesamapping.A mappingisa setofmappingelements.Eachmappingelementindicatescer tainelement(s)inSare mappedtocertainelement(s)inT.Amappingelementcanhave amappingexpression whichspecieshowschemaelementsarerelated.Amappingel ementcanbedenedas a5-tuple:(id,e,e',n,R),whereidistheuniqueidentier, eande'areschemaelements ofmatchingschemas,nisthecondencemeasure(usuallyint he[0,1]range)betweenthe schemaelementseande',Risarelation(e.g.,equivalence, mismatch,overlapping)[ 88 ]. Schemamatchinghasmanyapplicationareas,suchasdataint egration,data warehousing,semanticqueryprocessing,agentcommunicat ion,webservicesintegration, catalogmatching,andP2Pdatabases[ 78 88 ].Thematchoperationismostlydone manually.Manuallygeneratingthemappingisatedious,tim e-consuming,error-prone, andexpensiveprocess.Thereisaneedtoautomatethematcho peration.Thiswouldbe possibleifwecandiscoverthesemanticsofschemas,maketh eimplicitsemanticsexplicit andrepresenttheminamachineprocessableway.2.16.1SchemaMatchingSurveys Schemamatchingisaverywell-researchedtopicinthedatab asecommunity.Erhard RahmandPhilipBernsteinprovidesanexcellentsurveyonsc hemamatchingapproaches byreviewingpreviousworksinthecontextofschematransla tionandintegration, knowledgerepresentation,machinelearningandinformati onretrieval[ 78 ].Intheirsurvey, theyclarifythetermssuchasmatchoperation,mapping,map pingelement,andmapping expressioninthecontextofschemamatching.Theyalsointr oduceapplicationareasof schemamatchingsuchasschemaintegration,datawarehouse s,messagetranslation,and queryprocessing. 43


Themostsignicantcontributionoftheirsurveyistheclas sicationofschema matchingapproacheswhichhelpsunderstandingofschemama tchingproblem.They considerawiderangeofclassicationcriteriasuchasinst ance-levelvsschema-level, elementvsstructure,linguistic-basedvsconstraint-bas ed,matchingcardinality,using auxiliarydata(e.g.,dictionaries,previousmappings,et c.),andcombiningdierent matchers(e.g.,hybrid,composite).However,itisveryrar ethatoneapproachfallsunder onlyoneleafoftheclassicationtreepresentedinthatsur vey.Aschemamatching approachneedstoexploitallthepossibleinputstoachieve thebestpossibleresult,and needstocombinematcherseitherinahybridwayorinacompos iteway.Forthisreason, mostoftheapproachesusesmorethanonetechniqueandfalls undermorethanoneleaf oftheclassicationtree.Forexample,ourapproachusesau xiliarydata(i.e.,application sourcecode)anduseslinguisticsimilaritytechniques(e. g.,nameanddescription), constraintbasedtechniques(e.g.,typeoftherelatedsche maelement)onthedataaswell. ArecentsurveybyAnhaiDoanandAlonHalevy[ 28 ]classiesmatchingtechniques undertwomaingroup:rule-basedandlearning-basedsoluti ons.Ourapproachfallsunder therule-basedgroupwhichisrelativelyinexpensiveanddo esnotrequiretraining.Anhai DoanandAlonHalevyalsodescribechallengesofschemamatc hing.Theypointoutthat sincedatasourcesbecomelegacy(poorlydocumented)schem aelementsaretypically matchedbasedonschemaanddata.However,thecluesgathere dbyprocessingtheschema anddataareoftenunreliable,incompleteandnotsucientt odeterminetherelationships amongschemaelements.Ourapproachaimstoovercomethisfu ndamentalchallengeby analyzingreportsformorereliable,completeandsucient clues. AnhaiDoanandAlonHalevyalsostatethatschemamatchingbe comesmore challengingbecausematchingapproachesmustconsiderall thepossiblematching combinationsbetweenschemastomakesurethereisnobetter mapping.Considering allthepossiblecombinationsincreasesthecostofthematc hingprocess.Ourapproach 44


helpsusovercomingthischallengebyfocusingonasubsetof schemaelementsthatare usedonareportpair. AnotherchallengethatAnhaiDoanandAlonHalevystateisth esubjectivityof thematching.Thismeansthemappingdependsontheapplicat ionandmaychangein dierentapplicationseventhoughtheunderlyingschemasa rethesame.Byanalyzing reportgeneratingapplicationsourcecode,webelievewepr oducemoreobjectiveresults. AnhaiDoanandAlonHalevy'ssurveyalsoaddstwomoreapplic ationareasofschema matchingontheapplicationareasmentionedinErhardandRa hm'ssurvey.These applicationareasarepeerdatamanagementandmodelmanage ment. AmorerecentsurveybyPavelShvaikoandJer^omeEuzenat[ 88 ]pointsoutnew applicationareasofschemamatchingsuchasagentcommunic ation,webservice integrationandcatalogmatching.Intheirsurvey,PavelSh vaikoandJer^omeEuzenat consideronlyschema-basedapproachesnottheinstance-ba sedapproachesandprovide anewclassicationtreebybuildingonthepreviousworkofE rhardRahmandPhilip Bernstein.TheyinterprettheclassicationofErhardRahm andPhilipBernsteinand providetwonewclassicationtreesbasedongranularityan dkindsofinputwithadded nodestotheoriginalclassicationtreeofErhardRahmandP hilipBernstein.Finally, Hong-HaiDosummarizesrecentadvancesintheeldinhisdis sertation[ 25 ]. 2.16.2EvaluationsofSchemaMatchingApproaches Theapproachestosolvetheproblemofschemamatchingevalu atetheirsystemsby usingavarietyofmethodology,metricsanddatawhichareno tusuallypubliclyavailable. Thismakesithardtocomparetheseapproaches.However,the rehavebeenworksto benchmarktheeectivenessofasetofschemamatchingappro aches[ 26 99 ]. HongHaiDoetal.[ 26 ]speciesfourcomparisoncriteria.Thesecriteriaarekin d ofinput(e.g.,schemainformation,datainstances,dictio naries,andmappingrules), matchresults(e.g.,matchingbetweenschemaelements,nod esorpaths),qualitymeasures (metricssuchasrecall,precisionandf-measure)andeort (e.g.,pre-andpost-match 45


eortsfortrainingoflearners,dictionarypreparationan dcorrection).MikalaiYatskevich inhiswork[ 99 ]comparestheapproachesbasedonthecriteriastatedin[ 26 ]andaddstime measuresasthefthcriteria. HongHaiDoetal.onlyusetheinformationavailableinthepu blicationsdescribing theapproachesandtheirevaluation.Incontrast,MikalaiY atskevichprovidesreal-time evaluationsofmatchingprototypes,ratherthanreviewing theresultspresentedinthe papers.MikalaiYatskevichcomparesonlythreeapproaches (COMA[ 24 ],Cupid[ 62 ]and SimilarityFlooding(SF)[ 86 ])andconcludesthatCOMAperformsthebestonthelarge schemasandCupidisthebestforsmallschemas.HongHaiDoet al.providesabroader comparisonbyreviewingsixapproaches(Automatch[ 10 ],COMA[ 24 ],Cupid[ 62 ],LSD [ 27 ],SimilarityFlooding(SF)[ 86 ],SemInt). 2.16.3ExamplesofSchemaMatchingApproaches Intherestofthissection,wereviewsomeofthesignicanta pproachesforschema matchinganddescribetheirsimilaritiesanddierencefro mourapproach.WereviewLSD, Corpus-based,COMAandCupidapproachesbelow. TheLSD(LearningSourceDescriptions)approach[ 27 ]usesmachine-learning techniquestomatchdatasourcestoaglobalschema.Theidea ofLSDisthatafter atrainingperiodofdeterminingmappingsbetweendatasour cesandglobalschema manually,thesystemshouldlearnfrompreviousmappingsan dsuccessfullypropose mappingsfornewdatasources.TheLSDsystemisacompositem atcher.Itmeansit combinestheresultsofseveralindependentlyexecutedmat chers.TheLSDconsistof severallearners(matchers).Eachlearnercanexploitfrom dierenttypesofcharacteristics oftheinputdatasuchasnamesimilarities,format,andfreq uencies.Thenthepredictions ofdierentlearnersarecombined.TheLSDsystemisextensi blesinceithasindependently workinglearners(matchers).Whennewlearnersaredevelop edtheycanbeaddedtothe systemtoenhancetheaccuracy.TheextensibilityoftheLSD systemissimilartothe extensibilityofoursystembecausewecanalsoaddnewvisit orpatternstooursystemto 46


extractmoreinformationtoenhancetheaccuracy.TheLSDap proachissimilartoour approachinthewaythattheyalsocometoanaldecisionbyco mbiningseveralresults comingfromdierentlearners.Wealsocombineseveralresu ltsthatcomefrommatching ofontologiesofreportpairs,togiveanaldecision.LSDap proachisalearnerbased solutionandrequirestrainingwhichmakesitrelativelyex pensivebecauseoftheinitial manualeort.Howeverourapproachneedsnoinitialeortot herthancollectingrelevant reportgeneratingsourcecode. Oneofthedistinguishedapproachesthatusesexternalevid enceistheCorpus-based SchemaMatchingapproach[ 43 61 ].OurapproachissimilartoCorpus-basedSchema Matchinginthesensethatwealsoutilizeexternaldatarath erthansolelydepending onmatchingschemasandtheirdata.TheCorpus-basedschema matchingapproach constructsaknowledgebasebygatheringrelevantknowledg efromalargecorpusof databaseschemasandpreviousvalidatedmappings.Thisapp roachidentiesinteresting conceptsandpatternsinacorpusofschemasandusesthisinf ormationtomatch twounseenschemas.However,learningfromthecorpusandex tractingpatternsisa challengingtask.Thisapproachalsorequiresinitialeor ttocreateacorpusofinterest andthenrequirestuningeorttoeliminateuselessschemas andtoaddusefulschemas. TheCOMA(COmbinationofMAtchingalgorithms)approach[ 24 ]isacomposite schemamatchingapproach.Itdevelopsaplatformtocombine multiplematchersina rexibleway.Itprovidesanextensiblelibraryofmatchinga lgorithmsandaframework tocombineobtainedresults.TheCOMAapproachhavebeensup eriortoothersystems intheevaluations[ 26 99 ].TheCOMA++[ 7 ]approachimprovestheCOMAapproach bysupportingschemasandontologieswrittenindierentla nguages(i.e.,SQL,W3C XSDandOWL)andbybringingnewmatchstrategiessuchasfrag ment-basedmatching andreuse-orientedmatching.Fragment-basedapproachfol lowsthedivide-and-conquer ideaanddecomposesalargeschemaintosmallersubsetsaimi ngtoachievebettermatch qualityandexecutiontimewiththereducedproblemsizeand thenmergestheresultsof 47


matchingfragmentsintoaglobalmatchresult.Ourapproach alsoconsidersmatching smallsubsetsofaschemathatarecoveredbyreportsandthen mergingthesematch resultsintoaglobalmatchresultasdescribedinChapter 3 TheCupidapproach[ 62 ]combineslinguisticandstructuralmatchersinahybridwa y. Itisbothelementandstructuralbased.Italsousesdiction ariesasauxiliarydata.Itaims toprovideagenericsolutionacrossdatamodelsandusesXML andrelationalexamples. ThestructuralmatcherofCupidtransformstheinputintoat reestructureandassesses asimilarityvalueforanodebasedonthenode'slinguistics imilarityvalueanditsleaves similarityvalues. 2.17OntologyMapping Ontologymappingisdeterminingwhichconceptsandpropert iesoftwoontologies representsimilarnotions[ 68 ].Thereareseveralothertermsrelevanttoontologymappin g andaresometimesusedinterchangeablywiththetermmappin g.Thesearealignment, merging,articulation,fusion,andintegration[ 54 ].Theresultofontologymappingisused insimilarapplicationdomainsasschemamatching,suchasd atatransformation,query answering,andwebservicesintegration[ 68 ]. 2.18SchemaMatchingvs.OntologyMapping Schemamatchingandontologymappingaresimilarproblems[ 29 ].However,ontology mappinggenerallyaimstomatchricherstructures.General ly,ontologieshavemore constraintsontheirconceptsandhavemorerelationsamong theseconcepts.Another dierenceisthataschemaoftendoesnotprovideexplicitse manticsfortheirdatawhile anontologyisasystemthatitselfcontainssemanticseithe rintuitivelyorformally[ 88 ]. Databasecommunitydealswiththeschemamatchingproblema ndtheAIcommunity dealswiththeontologymappingproblem.Wecanperhapsllt hegapbetweenthese similarbutyetdistinctlystudiedsubject. 48


CHAPTER3 APPROACH InChapter 1 ,westatedtheneedforrapid,rexible,limitedtimecollabo rationsamong organizations.Wealsounderlinedthatorganizationsneed tointegratetheirinformation sourcestoexchangedatainordertocollaborateeectively .However,integrating informationsourcesiscurrentlyalabor-intensiveactivi tybecauseofnon-existingor out-datedmachineprocessabledocumentationofthedataso urce.Wedenedlegacy systemsasinformationsystemswithpoorornonexistentdoc umentationinSection 2.1 .Integratinglegacysystemsistedious,time-consumingan dexpensivebecausethe processismostlymanual.Toautomatetheprocessweneedtod evelopmethodologiesto automaticallydiscoversemanticsfromelectronicallyava ilableinformationsourcesofthe underlyinglegacysystems. Inthischapter,westateourapproachforextractingsemant icsfromlegacysystems andforusingthesesemanticsfortheschemamatchingproces sofinformationsource integration.WedevelopedourapproachinthecontextofSEE K(ScalableExtraction ofEnterpriseKnowledge)project.AsweshowinFigure 3-1 ,theSemanticAnalyzer (SA)takestheoutputofSchemaExtractor(SE),schemaofthe datasource,andthe applicationsourcecodeorreporttemplatesasinput.After thesemanticanalysisprocess, SAstoresitsoutput,extractedsemanticinformation,inar epositorywhichwecallthe knowledgebaseoftheorganization.Then,SchemaMatcher(S M)usesthisknowledgebase asaninputandproducesmappingrulesasanoutput.Finally, thesemappingruleswillbe aninputtoWrapperGenerator(WG)whichproducessourcewra ppers.InSection 3.1 ,we rststateourapproachforsemanticextractionusingSA.Th en,inSection 3.2 ,weshow howweutilizethesemanticsdiscoveredbySAinthesubseque ntschemamatchingphase. Theschemamatchingphaseisfollowedbythewrappergenerat ionphasewhichisnot describedinthisdissertation. 49


Figure3-1.ScalableExtractionofEnterpriseKnowledge(S EEK)Architecture. 3.1SemanticAnalysis Ourapproachtosemanticanalysisisbasedontheobservatio nthatapplicationsource codecanbearichsourceforsemanticinformationaboutthed atasourceitisaccessing. Specically,semanticknowledgeextractedfromapplicati onsourcecodefrequently containsinformationaboutthedomain-specicmeaningsof thedataortheunderlying schemaelements.Accordingtotheseobservations,forexam ple,applicationcodeusually hasembeddedqueries,andthedataretrievedormanipulated byqueriesisstoredin variablesanddisplayedtotheenduserinoutputstatements .Manyoftheseoutput 50


statementscontainadditionalsemanticinformationusual lyintheformofdescriptive textormarkup[ 36 84 87 ].Theseoutputstatementsbecomesemanticallyvaluable whentheyareusedtocommunicatewiththeend-userinaforma ttedway.Onewayof communicatingwiththeend-userisproducingreports.Repo rtsandotheruser-oriented output,whicharetypicallygeneratedbyreportgenerators orapplicationsourcecode, donotusethenamesofschemaelementsdirectlybutratherpr ovidemoredescriptive namesforthedatatomaketheoutputmorecomprehensibletot heusers.Weclaimthat thesedescriptivenamestogetherwiththeirformattingins tructionscanbeextracted fromtheapplicationcodegeneratingthereportandcanbere latedtotheunderlying schemaelementsinthedatasource.Wecantracethevariable susedinoutputstatements throughouttheapplicationcodeandrelatetheoutputwitht hequerythatretrievesdata fromthedatasourceandindirectlywiththeschemaelements .Thesedescriptivetext andformattinginstructionsarevaluableinformationthat helpdiscoverthesemanticsof theschemaelements.Inthenextsubsection,weexplainthis ideausinganillustrative example.3.1.1IllustrativeExamples Inthissection,weillustrateourideaofsemanticextracti onontwosimpleexample. OnthelefthandsideofFigure 3-2 ,weseearelationanditsattributesfromarelational databaseschema.Bylookingatthenamesoftherelationandi tsattributes,itishardto understandwhatkindofinformationthisrelationanditsat tributesstore.Forexample, thisrelationcanbeusedforstoringinformationabout`cou rses'or`instructors'.The attributeNamecanholdinformationabout`coursenames'or `instructornames'.Without anyfurtherknowledgeoftheschema,wewouldprobablynotbe abletounderstandthe fullmeaningoftheseschemaitemsintherelation`CourseIn st'.However,wecangather informationaboutthesemanticsoftheseschemaitemsbyana lyzingtheapplicationsource codethatusetheseschemaitems. 51


Figure3-2.Schemausedbyanapplication. Letusassumewehaveaccesstotheapplicationsourcecodeth atoutputsthesearch screenshownontherighthandsideofFigure 3-2 .Uponinvestigationofthecode, semanticanalyzer(SA)encountersoutputstatementsofthe form`InstructorName' and`CourseCode'.SAalsoencountersinputstatementsthat expectinputfromthe usernexttotheoutputtexts.Usingprogramunderstandingt echniques,SAndsout thatinputsareusedwithcertainschemaelementsina`where clause'toformaquery toreturnthedesiredtuplesfromthedatabase.SArstrelat estheoutputstatements containingdescriptivetext(e.g.,`InstructorName')wit htheinputstatementslocated nexttotheoutputstatementsonthesearchscreenshowninFi gure 3-2 .SAthentraces inputstatementsbacktothe`whereclause'andndtheircor respondingschemaelements inthedatabase.Hence,SArelatesthedescriptivetextwith theschemaelements.For example,ifSArelatestheoutputstatement`InstructorNam e'to`Name'schemaelement ofrelation`CourseInst',thenwecanconcludethat`Name's chemaelementoftherelation `CourseInst'storesinformationaboutthe`InstructorNam es'. Letuslookatanotherexample.Figure 3-3 showsareportR1usingtheschema elementsfromtheschemaS1.Letusassumethatwehaveaccess totheapplicationsource codethatgeneratesthereportshowninFigure 3-3 .TheschemaelementnamesinS1are non-descriptive.However,oursemanticanalyzercangathe rvaluablesemanticinformation byanalyzingthesourcecode.SArsttracesthedescriptive columnheadertextsback totheschemaelementsthatllinthedataofthatcolumn.The n,SArelatesdescriptive 52


Figure3-3.Schemausedbyareport. columnheadertextswiththeschemaelements(redarrows).A fterthat,wecanconclude aboutthesemanticsoftheschemaelement.Forexample,weca nconcludethattheName schemaelementoftherelationCourseInststoresinformati onabout'Instructors`. 3.1.2ConceptualArchitectureofSemanticAnalyzer SAisembeddedintheDataReverseEngineering(DRE)moduleo ftheSEEK prototypetogetherwiththeSchemaExtractor(SE)componen t.AsFigure 3-4 illustrates, theSEcomponentintheDREconnectstothedatasourcewithac all-levelinterface(e.g., JDBC)andextractstheschemaofthedatasource.TheSAcompo nentenhancesthis schemawiththepiecesofevidencefoundaboutthesemantics oftheschemaelementsfrom theapplicationsourcecodeorfromthereportdesigntempla tes. WeshowthecomponentsofSemanticAnalyzer(SA)inFigure 3-5 .TheAbstract SyntaxTreeGenerator(ASTG)acceptsapplicationsourceco detobeanalyzed,parses 53


Figure3-4.ConceptualviewoftheDataReverseEngineering (DRE)moduleofthe ScalableExtractionofEnterpriseKnowledge(SEEK)protot ype. itandproducestheabstractsyntaxtreeofthesourcecode.A nAbstractSyntaxTree (AST)isanalternativerepresentationofthesourcecodefo rmoreecientprocessing. Currently,theASTGisconguredtoparseapplicationsourc ecodewritteninJava.The ASTGcanalsoparseSQLstatementsembeddedintheJavasourc ecodeandHTML codeextractedfromtheJavaServletsourcecode.However,w eaimtoparseandextract semanticinformationfromsourcecodewritteninanyprogra mminglanguage.Toreach thisaim,weusestate-of-the-artparsergenerationtools, JavaCC,tobuildtheASTG. WeexplainhowwebuildtheASTGsothatitbecomesextensible tootherprogramming languagesinSection 3.1.3 Figure3-5.ConceptualviewofSemanticAnalyzer(SA)compo nent. 54

PAGE 55 Wealsoextractsemanticinformationfromanotherelectron icallyavailableinformation source,namelyfromreportdesigntemplates.Areportdesig ntemplateincludes informationaboutthedesignofareportandistypicallyrep resentedinXML.When areportgenerationtool,suchasEclipseBIRTorJasperRepo rt,runsareportdesign template,itretrievesdatafromthedatasourceandpresent sittotheenduseraccording tothespecicationinthereportdesigntemplate.Whenpars ed,valuablesemantic informationabouttheschemaelementscanbegatheredfromr eportdesigntemplates. TheReportTemplateParser(RTP)componentofSAisusedtopa rsereportdesign templates.Ourcurrentsemanticanalyzerisconguredtopa rsereporttemplatesdesigned withEclipseBIRT. 1 WeshowanexampleofareportdesigntemplateinFigure 3-6 anda resultingreportwhenthistemplatewasruninFigure 3-7 Figure3-6.Reportdesigntemplateexample. TheoutputsofASTGandRTParetheinputsfortheInformation Extractor(IEx) componentofSA.TheIEx,showninFigure 3-5 ,isthecomponentwhereweapplyseveral heuristicstorelatedescriptivetextinapplicationsourc ecodewiththeschemaelementsin 1 55


Figure3-7.Reportgeneratedwhentheabovetemplatewasrun databasebyusingprogramunderstandingtechniques.Speci cally,TheIExrstidenties theoutputstatements.Then,itidentiestextsintheoutpu tstatementsandvariables relatedwiththeseoutputtexts.TheIExrelatestheoutputt extwiththevariablesbythe helpofseveralheuristicsdescribedinSection 3.1.5 .TheIExtracesthevariablesrelated withtheoutputtexttotheschemaelementsfromwhichitretr ievesdata. Figure3-8.JavaServletgeneratedHTMLreportshowingcour selistingsofCALTECH. TheIExcanextractinformationfromJavaapplicationsourc ecodethatcommunicates withuserthroughconsole.TheIExcanalsoextractinformat ionfromJavaServlet 56


applicationsourcecode.AServletisaJavaapplicationtha trunsontheWebServerand respondstoclientrequestsbygeneratingHTMLpagesdynami cally.AServletgenerates anHTMLpagebytheoutputstatementsembeddedinsidetheJav acode.AfterIEx analyzestheJavaServlet,itidentiestheoutputstatemen tsthatoutputHTMLcode.It alsoidentiestheschemaelementsfromwhichthedataonthe HTMLpageisretrieved. Asanintermediatestep,theIExproducestheHTMLpagethatt heServletwouldproduce withtheschemaelementnamesinsteadofthedata.Anexample oftheoutputHTML pagegeneratedbytheIExafteranalyzingaJavaServletissh owninFigure 3-9 .TheJava ServletoutputthatwasanalyzedbytheIExisshowninFigure 3-8 .Thisexampleistaken fromTHALIAintegrationbenchmarkandshowscourseoering sinComputerScience departmentofCaliforniaInstituteofTechnology(CALTECH ).Thereadercannotice thatthedataonthereportinFigure 3-8 isreplacedwiththeschemaelementnamesfrom whichthedataisretrievedinFigure 3-9 .Next,theIExanalyzesthisannotatedHTML pageshowinFigure 3-9 andextractssemanticinformationfromthispage. Figure3-9.AnnotatedHTMLpagegeneratedbyanalyzingaJav aServlet. TheIExhasbeenimplementedusingvisitordesignpatterncl asses.Weexplainthe benetsofusingvisitordesignpatternsinSection 3.1.3 .TheIExappliesseveralprogram understandingtechniquessuchasprogramslicing,datarow analysisandcallgraph 57


analysis[ 49 ]invisitordesignpatternclasses.Wedescribethesetechn iquesinSection 3.1.4 TheIExalsoextractssemanticinformationfromreportdesi gntemplates.TheIEx usestheheuristicnumbersseventoelevendescribedinSect ion 3.1.5 whileanalyzingthe reportdesigntemplates.Extractinginformationfromrepo rtdesigntemplatesisrelatively easierthanextractinginformationfromapplicationsourc ecodebecauseThereportdesign templatesarerepresentedinXMLandaremorestructured. ReportOntologyWriter(ROW)componentofSAwritesthesema nticinformation gatheredinreportontologyinstancesrepresentedinOWLla nguage.Weexplainthe designdetailsofthereportontologyinSection 3.2.3 .Thesereportontologyinstances formstheknowledgebaseofthedatasourcebeinganalyzed.3.1.3ExtensibilityandFlexibilityofSemanticAnalyzer Ourcurrentsemanticanalyzerisconguredtoextractinfor mationfromapplication sourcecodewritteninJava.WechoosetheJavaprogrammingl anguagebecauseitis oneofthedominatingprogramminglanguagesintheenterpri seinformationsystems. However,weaimoursemanticanalyzertobeabletoprocessso urcecodewritten inanyprogramminglanguagetoextractsemanticinformatio naboutthedataofthe legacysystem.Forthisreason,weneedtodevelopoursemant icanalyzerinawaythat isextensibletootherprogramminglanguageseasily.Torea chthisaim,weleverage state-of-the-arttechniquesandrecentresearchoncodere verseengineering,abstractsyntax treegenerationandobjectorientedprogrammingtodevelop anovelapproachforsemantic extractionfromsourcecode.Wedescribeourextensiblesem anticanalysisapproachin detailsinthissection. Toanalyzeapplicationsourcecode,weneedaparserfortheg rammarofthe programminglanguageofthesourcecode.Thisparserisused togenerateAbstractSyntax Tree(AST)ofthesourcecode.AnASTisatypeofrepresentati onofsourcecodethat 58


facilitatestheusageoftreetraversalalgorithms.Forpro grammers,writingaparserfor thegrammarofaprogramminglanguagehasalwaysbeenacompl ex,time-consuming,and error-pronetask.Writingaparserbecomesmorecomplexwhe nthenumberofproduction rulesofthegrammarincreases.Itisnoteasytowritearobus tparserforJavawhichhas manyproductionrules[ 91 ]. 2 Wefocusonextractingsemanticinformationfromlegacy system'ssourcecodenotwritingaparser.Forthisreason,w echooseastate-of-the-art parsergenerationtooltoproduceourJavaparser.WeuseJav aCC 3 toautomatically generateaparserbyusingthespecicationlesfromtheJav aCCrepository. 4 JavaCC canbeusedtogenerateparsersforanygrammar.Wealsoutili zeJavaCCtogeneratea parserforSQLstatementsthatareembeddedinsidetheJavas ourcecodeandforHTML codethatareembeddedinsidetheJavaServletcode.Byusing JavaCC,wecanextendSA tomakeitcapableofparsingotherprogramminglanguageswi thlittleeort. TheInformationExtractor(IEx)componentofSAiscomposed ofseveralvisitor designpatterns.VisitorDesignPatternsgivetherexibili tytochangetheoperation beingperformedonastructurewithouttheneedtochangethe classesoftheelements onwhichtheoperationisperformed[ 38 ].Ourgoalistobuildsemanticinformation extractiontechniquesthatcanbeappliedtoanysourcecode andcanbeextendedwith newalgorithms.Byusingvisitordesignpatterns[ 71 ],wedonotembedthefunctionality oftheinformationextractioninsidetheclassesofAbstrac tSyntaxGenerator(ASTG). Thisseparationletsusfocusontheinformationextraction algorithms.Wecanmaintain theoperationsbeingperformedwhenevernecessary.Moreov er,newoperationsoverthe datastructurecanbedenedsimplybyaddinganewvisitor[ 13 ]. 2 Thereareover80productionrulesintheJavalanguageaccor dingtotheJava GrammarthatweobtainedfromtheJavaCCRepository 3 JavaCC: 4 JavaCCrepository: 59


3.1.4ApplicationofProgramUnderstandingTechniquesinS A WehaveintroducedprogramunderstandingtechniquesinSec tion 2.5 .Inthissection, wepresenthowweapplythesetechniquesinSA.SAhastwocomp onentsasshownin Figure 3-5 .TheinputofInformationExtractor(IEx)componentisanab stractsyntax tree(AST).TheASTistheoutputofourAbstractSyntaxTreeG enerator(ASTG)which isactuallyaparser.AsmentionedinSection 2.5 ,processingthesourcecodebyaparser toproduceanASTisoneoftheprogramunderstandingtechniq uesknownasSyntactic Analysis[ 49 ].Weperformtherestoftheprogramunderstandingtechniqu esontheAST byusingthevisitordesignpatternclassesoftheIEx. OneoftheprogramunderstandingtechniquesweapplyisPatt ernMatching[ 49 ].We wroteavisitorclassthatlooksforcertainpatternsinside thesourcecode.Thesepatterns suchasinput/outputstatementsarestoredinaclassstruct ureandnewpatternscanbe simplyaddedintothisclassstructureasneeded.Thevisito rclassthatsearchesthese patternsidentiesthevariablesintheinput/outputstate mentsasslicingvariables.For instance,thevariableVinTable 3-5 isidentiedasaslicingvariablesinceitisusedin anoutputstatement.ProgramSlicing[ 75 ]isanotherprogramunderstandingtechnique mentionedinSection 2.5 .Weanalyzeallthestatementsaectingavariablethatisus edin anoutputstatement.Thistechniqueisalsoknownasbackwar dslicing. SAalsoappliestheCallGraphAnalysistechnique[ 83 ].SAproducesinter-procedural callgraphofthesourcecodeandanalyzesonlymethodsthate xistinthisgraph.SA startingfromaspecicmethod(e.g.,mainmethodofaJavast and-aloneclassor doGetmethodofaJavaServlet)traversesallpossiblemetho dsthatcanbeexecuted inrun-time.Bythis,SAeliminatesanalyzingunusedmethod s.Thesemethodscanrerect oldfunctionalityofthesystemandanalyzingthemcanleadt oincorrect,misleading information.Anexampleforaninter-proceduralcallgraph ofaprogramsourcecodeis showninFigure 3-10 .SAdoesnotanalyzemethod1ofClass1,method1ofClass2,an d method3ofClass3sincetheyarenevercalledfrominsideoth ermethods. 60


Figure3-10.Inter-proceduralcallgraphofaprogramsourc ecode. TheDataFlowAnalysistechnique[ 83 ]isanotherprogramunderstandingtechnique thatweimplementedintheIExbyvisitordesignpatterns.As mentionedinSection 2.5 ,DataFlowAnalysisistheanalysisoftherowofthevaluesof variablestovariables. SAanalyzesthedatarowinthevariabledependencygraphs(i .e.,rowofdatabetween variables).SAanalyzesassignmentstatementsandmakesne cessarychangesinthevalues storedinthesymboltableoftheclassbeinganalyzed. SAalsoanalyzesthedatarowinthesystemdependencygraphs (i.e.,rowofdata betweenmethods).SAanalyzesmethodcallsandinitializes thevaluesofmethodvariables byactualparametersinthemethodcallandtransfersbackth evalueofreturnvariableat 61


Table3-1.SemanticAnalyzercantransferinformationfrom onemethodtoanother throughvariablesandcanusethisinformationtodiscovers emanticsofa schemaelement. publicResultSetreturnList() f ResultSetrs=null;try f Stringquery="SELECTCode,Time,Day,Pl,InstFROMCourse" ; rs=sqlStatement.executeQuery(query);g catch(Exceptionex) f researchErr=ex.getMessage(); g returnrs; g ResultSetrsList=returnList();StringdataOut="";while( f dataOut=rsList.getString(4);...System.out.println("Classisheldinroomnumber:"+dataO ut); theendofthemethod.SAcantransferinformationfromoneme thodtoanotherthrough variablesandcanusethisinformationtodiscoversemantic sofaschemaelement.The codefragmentinTable 3-1 isgivenasanexampleforthiscapabilityofSA.Insidethe method,thevalueofvariablequeryistransferredtovariab lers.Attheendofthemethod, valueofvariablersistransferredtovariablersList.Thev alueofthefourtheldofthe queryfromtheresultsetisthenstoredintoavariableandth enprintedout.Whenwe relatethetextintheoutputstatementwiththefourtheldo fthequery,wecanconclude thatPleldoftableCoursecorrespondsto'Classisheldinr oomnumber'. 3.1.5HeuristicsUsedforInformationExtraction Aheuristicisanymethodfoundthroughobservationwhichpr oducescorrector sucientlyexactresultswhenappliedincommonlyoccurrin gconditions.Wehave developedseveralguidelines(heuristics)throughobserv ationstoextractsemantics fromtheapplicationsourcecodeandreportdesigntemplate s.Theseheuristicsrelate semanticallyrichdescriptivetextstoschemaelements.Th eyarebasedonmainlylayout andformat(e.g.,fontsize,face,color,andtype)ofdataan ddescriptiontextsthatare 62


usedtocommunicatewithusersthroughconsolewithinput/o utputstatementsorthrough areport. Weintroducetheseheuristicsbelow.Therstsixheuristic sshowninthissectionare developedtoextractinformationfromsourcecodeofapplic ationsthatcommunicatewith usersthroughconsolewithinput/outputstatements.Pleas enotethatthecodefragments intherstsixheuristicscontainJava-specicinput,outp ut,anddatabase-related statementsthatusesyntaxbasedontheJavaAPI.Weparamete rizedthesestatementsin ourSAprototype.Thereforeitistheoreticallystraightfo rwardtoaddnewinput,output, anddatabase-relatedstatementnamesortoswitchtoanothe rlanguageifnecessary. Wedevelopedtherestoftheheuristicstoextractsemantics fromreports.Weuse theseheuristicstoextractsemanticinformationeitherfr omreportsgeneratedbyJava Servletsorfromreportdesigntemplates. Heuristic1 .Applicationcodegenerallyhasinput-outputstatementst hatdisplay theresultsofqueriesexecutedontheunderlyingdatabase. Typically,outputstatements displayoneormorevariablesand/orcontainoneormoreform atstrings.Table 3-2 representsaformatstring` n nCoursecode: n t'followedbyavariableV. Table3-2.Outputstringgivescluesaboutthesemanticsoft hevariablefollowingit. System.out.println(` n nCoursecode: n t'+V); Heuristic2 .Theformatstringinaninput-outputstatementdescribest hedisplayed slicingvariablethatcomesafterthisformatstring.Thefo rmatstring` n nCoursecode: n t' describesthevariableVinTable 3-2 Heuristic3 .Theformatstringthatcontainssemanticinformationandt hevariable maynotbeinthesamestatementandmaybeseparatedbyanarbi trarynumberof statementsasshowninTable 3-3 Heuristic4 .Theremaybeanarbitrarynumberofformatstringsindiere nt statementsthatinheritsemanticsandtheymaybeseparated byanarbitrarynumber 63


Table3-3.Outputstringandthevariablemaynotbeinthesam estatement. System.out.println(' n nCoursecode:`); ......System.out.print(V); ofstatements,beforeweencounteranoutputofslicingvari able.Concatenationofthe formatstringsbeforetheslicingvariablegivesmoreclues aboutthevariablesemantic.An exampleisshowninTable 3-4 Table3-4.Outputstringsbeforetheslicingvariableshoul dbeconcatenated. System.out.print(' n nCourse`); System.out.println(' n tcode:`); System.out.print(V); Heuristic5 .Anoutputtextinanoutputstatementandafollowingvariab leinthe sameorfollowingoutputstatementsaresemanticallyrelat ed.Theoutputtextcanbe consideredasthevariable'spossiblesemantics.Wecantra cebackthevariablethrough backwardslicingandidentifytheschemaelementinthedata sourcethatassignsavalue toit.Wecanconcludethatthisschemaelementandvariablea rerelated.Wecanthen relatetheoutputtextwiththeschemaelement.TheJavacode samplewithanembedded SQLqueryinTable 3-5 illustratesourpoint. Table3-5.Tracingbacktheoutputtextandassociatingitwi ththecorrespondingcolumn ofatable. Q='SELECTCFROMT`;R=S.executeQuery(Q);V=R.getString(1);System.out.println('Coursecode:`+V); 64


InTable 3-5 ,thevariableVisassociatedwiththetext'Coursecode`.It isalso associatedwiththerstcolumnofthequeryresultinR,whic hiscalledC.Hencethe columnCcanbeassociatedwiththetext'Coursecode`. Heuristic6 .IfthevariableVisusedwithcolumnCoftableTinacompare statementinthewhere-clauseofthequeryQ,andifonecanas sociateatextstringfrom aninput/outputstatementdenotingthemeaningofvariable V,thenwecanassociatethis meaningofVwithcolumnCoftableT.TheJavacodesamplewith anembeddedSQL queryinTable 3-6 illustratesourpoint. Table3-6.Associatingtheoutputtextwiththecorrespondi ngcolumninthewhere-clause. Q='SELECT*FROMTWHEREC='`+V+"`;R=S.executeQuery(Q);System.out.println('Coursecode:`+V); InTable 3-6 ,thevariableinputisassociatedwiththetext`Coursecode :'.Itisalso associatedwiththecolumnCoftableT.Hencetheschemaelem entCcanbeassociated withthetext`Coursecode'. Table3-7.Columnheaderdescribesthedatainthatcolumn. College Course Title Instructor CAS CS101 IntroComp. Dr.Long GRS CS640 ArticialInt. Dr.Betke Heuristic7 .AheaderofacolumnH(i.e.,descriptiontext)onatableona report describesthevalueofadataD(i.e.,dataelement)inthatco lumn.Wecanassociate theheaderHwiththedataDpresentedonthesamecolumn.Fore xample,theheader \Instructor"inthefourthcolumndescribesthevalue\Dr.L ong"inTable 3-7 Table3-8.Columnontheleftdescribesthedataitemslisted toitsimmediateright. Course CSE103IntroductiontoDatabases Credits 3 Description Coreconceptsindatabases 65


Heuristic8 .AdescriptivetextonarowofatableonareportTdescribest hevalue ofadataDontherighthandsideonthesamerowofthetable.We canassociatethetext TwiththedataDpresentedonthesamerow.Forexample,thete xt\Description"onthe thirdrowdescribesthevalue\Coreconceptsindatabases"i nTable 3-8 Table3-9.Columnontheleftandtheheaderimmediatelyabov edescribethesamesetof dataitems. CoreCourses Course CSE103IntroductiontoDatabases Credits 3 Description Coreconceptsindatabases ElectiveCourses Course CSE131ProblemSolving Credits 3 Description UseofComp.forproblemsolving Heuristic9 .Heuristiconeandheuristictwocanbecombined.Bothheade rofa dataonthesamecolumnandthetextonthelefthandsideonthe samerowdescribethe data.Forexample,boththetext\Course"onthelefthandsid eandtheheader\Elective Courses"ofdata\CSE131ProblemSolving"describethedata inTable 3-9 Table3-10.Setofdataitemscanbedescribedbytwodierent headers. Course Instructor Code Room Name Room CIS4301 E221 Dr.Hammer E452 COP6726 E112 Dr.Jermaine E456 Heuristic10 .Ifmorethanoneheaderdescribeadataonareport,allthehe aders correspondingtothedatadescribethedata.Forexample,bo ththeheader\Instructor" andtheheader\Room"describethevalue\E452"inTable 3-10 Table3-11.Headercanbeprocessedbeforebeingassociated withthedataonacolumn. Course Title(Credits) Instructor CS105 Comp.Concepts(3.0) Dr.Krentel CS110 JavaIntroProg.(4.0) Dr.Bolker 66


Heuristic11 .Thedatavaluepresentedonacolumncanberetrievedfrommo re thanonedataitemintheschema.Inthatcase,theformatofth eheaderofthecolumn givescluesabouthowweneedtoparsetheheaderandassociat eitwiththedataitems. Forexample,thedataofthesecondcolumninTable 3-11 isretrievedfromtwodataitems inthedatasource.Theformatoftheheader\Title(Credits) "tellsusthatweneedto considertheparenthesiswhileparsingtheheaderandassoc iatingthedataitemsinthe columnwiththeheader. Inthissection,wehaveintroducedSemanticAnalyzer(SA). SAextractsinformation aboutthesemanticsofschemaelementsfromtheapplication sourcecode.Thisinformation isanessentialinputfortheSchemaMatching(SM)component .Inthefollowingsection, weintroduceourschemamatchingapproachandhowweuseSAto discoversemanticsfor SM. 3.2SchemaMatching Schemamatchingaimsatdiscoveringsemanticcorresponden cesbetweenschema elementsofdisparatebutrelateddatasources.Tomatchsch emas,weneedtoidentifythe semanticsofschemaelements.Whendonemanually,thisisat edious,time-consuming, anderror-pronetask.Muchresearchhasbeencarriedouttoa utomatethistasktoaid schemamatching,seeforexample,[ 25 28 78 ].However,despitetheongoingeorts, currentschemamatchingapproaches,whichusetheschemast hemselvesasthemaininput fortheiralgorithms,stillrelyheavilyonmanualinput[ 26 ].Thisdependenceonhuman involvementisduetothewell-knownfactthatschemasrepre sentsemanticspoorly.Hence, webelievethatimprovingcurrentschemamatchingapproach esrequiresimprovingthe waywediscoversemantics. Discoveringsemanticsmeansgatheringinformationaboutt hedata,sothatafter processingthedata,acomputercandecideonhowtousetheda tainawayaperson woulddo.Inthecontextofschemamatching,weareintereste dinndinginformation thatleadsustondapathfromschemaelementsinonedatasou rcetothecorresponding 67


schemaelementsintheother.Therefore,wedenediscoveri ngsemanticsforschema matchingasdiscoveringpathsbetweencorrespondingschem aelementsindierentdata sources. Wereducethelevelofdicultyoftheschemamatchingproble mbyabstractingit tomatchingofautomaticallygenerateddocumentssuchasre portsthataresemantically richerthantheschemastowhichtheycorrespond.Reportsan dotheruser-oriented output,whicharetypicallygeneratedbyreportgenerators ,donotusethenamesof schemaelementsdirectlybutratherprovidemoredescripti venamestomaketheoutput morecomprehensibletotheusers.Thesedescriptionstoget herwiththeirformatting instructionsplusrelationshipstotheunderlyingschemae lementsinthedatasourcecan beextractedfromtheapplicationcodegeneratingtherepor t.Thesesemanticallyrich descriptions,whichcanbelinkedtotheschemaelementsint hesource,canbeusedto discoverrelationshipsbetweendatasourcesandhencebetw eentheunderlyingschemas. Moreover,reportsusemoredomainterminologythanschemas .Therefore,usingdomain dictionariesisparticularlyhelpfulasopposedtotheirus einschemamatchingalgorithms. Onecanarguethatreportsofaninformationsystemmaynotco vertheentire schemaandhencebythisapproachwemaynotndmatchesforal lschemaelements.It isimportanttonotethatwedonothavetomatchalltheschema elementsoftwodata sourcesinordertohavetwoorganizationscollaborate.Web elievethereportstypically presentthemostimportantdataoftheinformationsystem,w hichisalsolikelytobe thesetofelementsthatareimportantfortheensuingdatain tegrationscenario.Thus startingtheschemamatchingprocessfromreportscanhelpf ocusontheimportantdata eliminatinganyeortonmatchingunnecessaryschemaeleme nts. 3.2.1MotivatingExample Wepresentamotivatingexampletoshowhowanalyzingreport generatingapplication sourcecodeandreportdesigntemplatescanhelpusundersta ndthesemanticsofschema elementsbetter.Wechooseourmotivatingexamplereportsf romtheuniversitydomain 68


becausetheuniversitydomainiswellknownandeasytounder stand.Tocreateour motivatingexample,weusetheTHALIA 5 testbedandbenchmarkwhichprovidesa collectionofover40downloadabledatasourcesrepresenti nguniversitycoursecatalogs fromcomputersciencedepartmentsworldwide[ 47 ]. Figure3-11.Schemasoftwodatasourcesthatcollaboratesf oranewonlinedegree program. Wemotivatetheneedforschemamatchingacrossthetwodatas ourcesofcomputer sciencedepartmentswithascenario.Letusassumethattwoc omputersciencedepartments ofuniversitiesAandBstarttocollaborateforanewonlined egreeprogram.Unlessone iscontendtoqueryeachreportseparately,onecanimaginet heexistenceofacourse schedulemediatorcapableofprovidingintegratedaccesst othedierentcoursesites. Themediatorenablesustoquerydataofbothuniversitiesan dpresentstheresultsina uniformway.Suchamediatornecessitatestheneedtondrel ationshipsacrossthesource schemasS1andS2ofuniversitiesAandBshowninFigure 3-11 .Thisisachallenging taskwhenlimitedtoinformationprovidedbythedatasource alone.Byjustusingthe schemanamesontheFigure,onecanmatchschemaelementsoft wodierentschemasin variousways.Forinstance,onecanmatchtheschemaelement NameinrelationOerings ofschemaS2withschemaelementNameinrelationScheduleof schemaS1orwithschema 5 THALIAWebsite: 69


elementNameinrelationCourseIntofschemaS1.Bothmappin gsseemreasonablewhen weonlyconsidertheavailableschemainformation. However,whenweconsiderthereportsgeneratedbyapplicat ionsourcecodeusing theseschemasofdatasources,wecandecideonthemappingso fschemasmoreaccurately. Informationsystemapplicationsthatgeneratereportsret rievedatafromthedatasource, formatthedataandpresentittousers.Tomakethedatamorea pprehensiblebytheuser, theseapplicationsgenerallydonotusethenamesofschemae lementsbutinventmore descriptivenames(i.e.,title)tothedatabyusingdomains pecictermswhenapplicable. Figure3-12.Reportsfromtwosampleuniversitieslistingc ourses. 70


Forourmotivatingexample,universityAhasreportsR1andR 3anduniversityBhas R2andR4presentingdatafromtheirdatasources.ReportsR1 andR2presentcourse listingsandreportsR3andR4presentinstructoroceinfor mationfromcorresponding universities.Weshowthesesimpliedsamplereports(R1,R 2,R3,andR4)andthe schemas(S1andS2)inFigures 3-12 and 3-13 .Thereadercaneasilymatchthecolumn headers(bluedottedarrowsinFigures 3-12 and 3-13 ).Ontheotherhand,itishardto matchtheschemaelementsofdatasourcescorrectlybyonlyc onsideringtheirnames. However,itbecomesagainstraightforwardtoidentifysema nticallyrelatedschema elementsifweknowthelinksbetweencolumnheadersandsche maelements(redarrowsin Figures 3-12 and 3-13 ). Figure3-13.Reportsfromtwosampleuniversitieslistingi nstructoroces. 71


Ourideaistondmappingsbetweendescriptivetextsonrepo rts(bluedotted arrows)byusingsemanticsimilarityfunctionsandtondth elinksbetweenthesetexts andschemaelements(redarrows)byanalyzingtheapplicati onsourcecodeandreport designtemplates.Forthispurpose,werstanalyzetheappl icationsourcecodeorthe reportdesigntemplategeneratingeachreport.Foreachrep ort,westoreourndings suchasdescriptivetexts(e.g.,columnheaders),schemael ementsandrelationsbetween thedescriptivetextsandtheschemaelementsintoaninstan ceofreportontology.We givethedetailsofthereportontologyinSection 3.2.3 .Wepairreportontologyinstances onefromtherstdatasourceandonefromtheseconddatasour ce.Wethencompute thesimilaritiesbetweenallpossiblereportontologyinst ancepairs.Forourexample,the fourpossiblereportpairswhenweselectonereportfromDS1 andtheotherfromDS2 are[R1-R2],[R1-R4],[R2-R3]and[R3-R4].Wecalculatethe similarityscoresbetween descriptivetextsonreportsforeachreportpairsbyusings emanticsimilarityfunctions usingWordNetwhichwedescribeinSection 3.2.4 .Wethentransfersimilarityscores betweendescriptivetextsofreportstoscoresbetweensche maelementsofschemasby usingthepreviouslydiscoveredrelationsbetweendescrip tivetextsandschemaelements. Last,wemergethesimilarityscoresofschemaelementscomp utedforeachreportpairand formanalmatrixholdingsimilarityscoresbetweenelemen tsofschemasthatarehigher thanathreshold.Weaddressdetailsofeachstepofourappro achinSection 3.2.2 Whenweapplyourschemamatchingapproachontheexamplesch emasandreports describedabove,weobtainaprecisionvalueof0.86andarec allvalueof1.00.Weshow thesimilarityscoresbetweenschemaelementsofdatasourc esDS1andDS2whichare greaterthanthethreshold(0.5)inFigure 3-14 .Theseresultsarebetterthantheresults foundmatchingtheaboveschemaswiththeCOMA++(COmbinati onofMAtching algorithms)framework. 6 COMA++[ 7 ]isawellknownandwellrespectedschema 6 WeusethedefaultCOMA++AllContextcombinedmatcher 72


matchingframeworkprovidingadownloadableprototype.Th isexamplemotivatesusthat ourapproachpromisesbetteraccuracyforschemamatchingt hanexistingapproaches. WeprovideadetailedevaluationoftheapproachinChapter 6 .Inthenextsection,we describethestepsofourschemamatchingapproach. Figure3-14.Similarityscoresofschemaelementsoftwodat asources. 3.2.2SchemaMatchingApproach Themainideabehindourapproachisthatuser-orientedoutp utssuchasreports, encapsulatevaluableinformationaboutsemanticsofdataw hichcanbeusedtofacilitate schemamatching.Applyingwell-knownprogramunderstandi ngtechniquesasdescribed inSection 3.1.4 ,wecanextractsemanticallyrichtextualdescriptionsand relatethese withdatapresentedonreportsusingheuristicsdescribedi nSection 3.1.5 .Wecantrace thedatabacktocorrespondingschemaelementsinthedataso urceandmatchthe correspondingschemaelementsinthetwodatasources.Belo w,weoutlinethestepsof ourSchemaMatchingapproach,whichwecallSchemaMatching byAnalyzingReporTs (SMART).Inthenextsections,weprovidedetaileddescript ionofthesestepswhichare showninFigure 3-15 CreatinganInstanceofaReportOntology ComputingSimilarityScores FormingaSimilarityMatrix FromMatchingOntologiestoSchemas MergingResults 73


Figure3-15.FivestepsofSchemaMatchingbyAnalyzingRepo rTs(SMART)algorithm. 74


3.2.3CreatinganInstanceofaReportOntology Intherststep,weanalyzeapplicationsourcecodethatgen eratesareport.We describedthedetailsofsemanticanalysisprocessinSecti on 3.1 .Theextractedsemantic informationfromsourcecodeorfromareportdesigntemplat eisstoredinaninstanceof thereportontology. Wehavedevelopedanontologyforreportsafteranalyzingso meofthemostwidely usedopensourcereportgenerationtoolssuchasEclipseBIR T, 7 JasperReport 8 and DataVision. 9 WedesignedthereportontologyusingtheProtegeOntologyE ditor 10 and representedthisreportontologyinOWL(WebOntologyLangu age).TheUMLdiagramof thereportontologydepictedinFigure 3-16 showstheconcepts,theirpropertiesandtheir relationswithotherconcepts. Westoreinformationaboutthedescriptivetextsonareport (e.g.,columnheaders) andinformationaboutthesourceofdata(i.e.,schemaeleme nts)presentedonareportin aninstanceofthereportontology.Thedescriptivetextand schemaelementproperties arestoredindescriptionelementanddataelementconcepts ofthereportontology respectively.Thedataelementconcepthaspropertiessuch asattribute,table(tableofthe attributeinrelationaldatabase)andtype(typeofthedata storedintheattribute).We identifytherelationbetweenadescriptionelementconcep tandadataelementconcept bythehelpofasetofheuristicswhicharebasedonthelocati onandformatinformation describedinSection 3.1.5 andstorethisinformationinhasDescriptionrelationprop ertyof thedescriptionelementconcept. 7 EclipseBIRT: 8 JasperReport: 9 Datavision: 10 Protegetool: 75


Figure3-16.UniedModelingLanguage(UML)diagramoftheS chemaMatchingby AnalyzingReporTs(SMART)reportontology. Thedesignofthereportontologydoesnotchangefromonerep orttoanotherbut theinformationstoredinaninstanceofthereportontology changesbasedonthereport beinganalyzed.Weplacedthedataelementconceptinthecen terofthereportontology asshowninFigure 3-16 .Thisdesignisappropriateforthecalculationofsimilari tyscores betweendataelementconceptsaccordingtotheformuladesc ribedinSection 3.2.4 3.2.4ComputingSimilarityScores Wecomputesimilarityscoresbetweenallpossibledataelem entconceptpairs consistingofadataelementconceptfromaninstanceofther eportontologyofthe rstdatasourceandanotherdataelementconceptfromanins tanceofreportontologyof theseconddatasource.Thismeansiftherearemreportshavi ngndataelementsconcepts onaveragefor DS 1datasourceandkreportshavingldataelementsconceptson average 76


for DS 2datasource,wecomputesimilarityscoresfor( m n k l )pairsofdataelements concepts. However,computingsimilarityscoresforallpossiblerepo rtontologyinstancepairs maybeunnecessary.Forexample,unrelatedreportpairs,su chasareportdescribing paymentsofemployeeswithanotherdescribingthegradesof studentsatauniversity, maynothavesemanticallyrelatedschemaelementsandthere forewemaynotndany semanticalcorrespondencebycomputingsimilarityscores ofconceptsofunrelatedreport ontologyinstancepairs.Tosavecomputationtime,welter outreportpairsthathave semanticallyunrelatedreports.Todeterminewhichreport pairsaresemanticallyrelated ornot,werstextracttexts(i.e.,titles,footersanddata headers)ontworeportpairsand calculatesimilarityscoresofthesetexts.Ifthesimilari tyscorebetweenthesetextsofa reportpairisbelowapredeterminedthreshold,weassumeth atthereportpairpresents semanticallyunrelateddataandwedonotcomputesimilarit yscoresofdataelementpairs ofreportpairshavinglowsimilarityscoresforthetextson them. Thesimilarityoftwoobjectsdependsonthesimilaritiesof thecomponentsthat formtheobjects.Anontologyconceptisformedbytheproper tiesandtherelationsit has.Eachrelationofanontologyconceptconnectstheconce pttoitsneighborconcept. Therefore,thesimilarityoftwoconceptsdependsonthesim ilaritiesofthepropertiesof theconceptsandthesimilaritiesoftheneighborconcepts. Forexample,thesimilarityof twodataelementconceptsfromdierentinstancesoftherep ortontologydependsonthe similarityoftheirpropertiesattribute,table,andtypea ndthesimilaritiesofitsneighbor conceptsDescriptionElement,Header,Footer,etc. Oursimilarityfunctionbetweenconceptsofinstancesofan ontologyissimilarto thefunctionproposedbyRodriguezandEgenhofer[ 81 ].RodriguezandEgenhoferalso considersetsoffeatures(properties)andsemanticrelati ons(neighbors)amongconcepts whileassessingsimilarityscoresamongentityclassesfro mdierentontologies.While theirsimilarityfunctionaimstondsimilarityscoresbet weenconceptsfromdierent 77


ontologies,oursimilarityisforndingsimilarityscores betweentheinstancesofan ontology. Weformulatethesimilarityoftwoconceptsindierentinst ancesofanontologyas follows: sim c ( c 1 ;c 2 )= w p sim p ( c 1 ;c 2 )+ w n sim n ( c 1 ;c 2 )(3{1) where c 1 isaconceptinaninstanceoftheontology, c 2 isthesametypeofconcept inanotherinstanceoftheontology, w p istheweightoftotalsimilarityofpropertiesof thatconceptand w n istheweightoftotalsimilarityoftheneighborconceptsth atcanbe reachedfromthatconceptbyarelation. sim p ( c 1 ;c 2 )and sim n ( c 1 ;c 2 )aretheformulasto calculatesimilaritiesofthepropertiesandtheneighbors .Wecanformulate sim p ( c 1 ;c 2 )as follows: sim p ( c 1 ;c 2 )= X i =1 ::k w pi SimFunc ( c 1 p i ;c 2 p i )(3{2) where k isthenumberofpropertiesofthatconcept, w pi istheweightofthe i th property, c 1 p i isthe i thpropertyoftheconceptintherstreportontologyinstan ce, c 2 p i isthesametypeofpropertyoftheotherconceptinthesecond reportontologyinstance. SimFuncisthefunctionthatweusetoassessasimilaritysco rebetweenthevalues ofthepropertiesoftwoconcepts.Fordescriptionelements ,theSimFuncisasemantic similarityfunctionbetweentextswhichissimilartothete xt-to-textsimilarityfunctionof CorleyandMihalcea[ 21 ].Tocalculatethesimilarityscorebetweentwotextstring sT1 andT2,wersteliminatestopwords(e.g.,a,and,but,to,by ).Wethenndtheword havingthemaximumsimilarityscoreintextT2foreachwordi ntextT1.Thesimilarity scorebetweentwowords,onefromtextT1andtheotherfromT2 ,isobtainedfromathe Word-NetbasedsemanticsimilarityfunctionsuchastheJia ngandConrathmetric[ 52 ]. Wesumupthemaximumscoresanddividethesumbythewordcoun tofthetextT1. TheresultisthemeasureofsimilaritybetweentextT1andth etextT2forthedirection 78


fromT1toT2.Werepeattheprocessforthereversedirection (i.e.,fromT2toT1)and thencomputetheaverageofthetwoscoresforabidirectiona lsimilarityscore. Weusedierentsimilarityfunctionsfordierentproperti es.Ifthepropertythat wearecomparinghastextdatasuchaspropertydescription, weuseoneoftheword semanticsimilarityfunctionsthatwehaveintroducedinSe ction 2.11 .Byusingasemantic similaritymeasureinsteadoflexicalsimilaritymeasures uchaseditdistance,wecan detectthesimilaritiesofwordsthatarelexicallyfarbuts emanticallyclosesuchas lecturerandinstructorandwecanalsoeliminatethewordst hatarelexicallyclosebut semanticallyfarsuchas`tower'and`power'.Besidesdescr iptionpropertyofdescription elementconcept,wealsousesemanticsimilaritymeasurest ocomputesimilarityscores betweenfooternotepropertyofthefooterconcept,headern otepropertyoftheheader conceptandtitlepropertyofthereportconcept.Iftheprop ertythatwearecomparingis theattributeortablepropertyofdataelementconcept,wea ssessasimilarityscorebased ontheLevensteineditsimilaritymeasure.Besidesattribu tepropertyofdataelement concept,wealsouseeditsimilaritymeasurestocomputesim ilarityscoresbetweenquery propertyofthereportconcept. Inthefollowingformula,whichcalculatesthesimilarityb etweentheneighborsoftwo concepts, l isthenumberofrelationsoftheconceptswearecomparing, w ni istheweight ofthe i threlation, c 1 n i ( c 2 n i )istheneighborconceptoftherst(second)conceptthatwe reachbyfollowingthe k threlation. sim n ( c 1 ;c 2 )= X i =1 ::l w ni sim c ( c 1 n i ;c 2 n i )(3{3) Notethatoursimilarityfunctionisgenericandcanbeusedt ocalculatesimilarity scoresbetweenconceptsofinstancesofanyontologies.Eve nthoughtheformulasin Equations 3{1 3{2 and 3{3 arerecursiveinnature,whenweapplytheformulasto computesimilarityscoresbetweendataelementsofreporto ntologies,wedonotencounter recursivebehavior.Thatisbecausethereisnopathbacktod ataelementconceptthrough 79


relationsfromneighborsofthedataelementconcept.Inoth erwords,theneighbor conceptsofdataelementconceptdoesnothavethedataeleme ntconceptasaneighbor. Weapplytheaboveformulastocalculatesimilarityscoresb etweendataelement conceptsoftwodierentreportontologies.Thedataelemen tconcepthasproperties attribute,table,andtypeandneighborconceptsdescripti onelement,report,header, andfooterconcepts.Thesimilarityscorebetweentwodatae lementconceptscanbe formulatedasfollows: sim DataElement ( DataElement 1 ;DataElement 2 )= w 1 SimFunc ( Attribute 1 ;Attribute 2 ) + w 2 SimFunc ( Table 1 ;Table 2 ) + w 3 SimFunc ( Type 1 ;Type 2 ) + w 4 sim DescriptionElement ( DescriptionElement 1 ;DescriptionElement 2 ) (3{4) + w 5 sim Report ( Report 1 ;Report 2 ) + w 6 sim Header ( Header 1 ;Header 2 ) + w 7 sim Footer ( Footer 1 ;Footer 2 ) Weexplainhowwedeterminetheweights w 1 to w 7 inSection 6.2 .Thesimilarity scorebetweentwodescriptionelement,report,headerandf ooterconceptscanbe computedbythefollowingformulas: sim DescriptionElement ( DescriptionElement 1 ;DescriptionElement 2 )= (3{5) SimFunc ( Description 1 ;Description 2 ) sim Report ( Report 1 ;Report 2 )= SimFunc ( Query 1 ;Query 2 )+ SimFunc ( Title 1 ;Title 2 ) (3{6) sim Header ( Header 1 ;Header 2 )= SimFunc ( HeaderNote 1 ;HeaderNote 2 )(3{7) sim Footer ( Footer 1 ;Footer 2 )= SimFunc ( FooterNote 1 ;FooterNote 2 )(3{8) 80


3.2.5FormingaSimilarityMatrix Toformasimilaritymatrix,weconnecttotheunderlyingdat asourcesusinga call-levelinterface(e.g.,JDBC)andextracttheschemaso ftwodatasourcestobe integrated.Asimilaritymatrixisatablestoringsimilari tyscoresfortwoschemassuch thatelementsoftherstschemaformthecolumnheadersande lementsofthesecond schemaformtherowheaders.Thesimilarityscoresareinthe range[0,1].Thesimilarity matrixgivenasanexampleinFigure 3-17 hasschemaelementsfrommotivatingexample inSection 3.2.1 andthesimilarityscoresbetweenschemaelementsarectit ious. Figure3-17.Exampleforasimilaritymatrix. 3.2.6FromMatchingOntologiestoSchemas Intherststep,wetracedadataelementtoitscorrespondin gschemaelement(s).We usethisinformationtoconvertinter-ontologymatchingsc oresintoscoresbetweenschema elements.Usingtheconvertedscores,wethenllinasimila ritymatrixforeachreport pair. Notethat,wendsimilarityscoresonlyforasubsetofschem asusedinthereports. Webelievethereportstypicallypresentthemostimportant dataoftheinformation system,whichislikelytobethesetofelementsthatisimpor tantfortheensuingdata integrationscenario.Eventhoughreportsofaninformatio nsystemmaynotcoverthe entireschema,ourapproachcanhelpfocusontheimportantd atathuseliminatingeorts 81


tomatchunnecessaryschemaelements.Notethateachsimila ritymatrixcanbesparse havingonlyasmallsubsetofitscellslledinasshowninFig ures 3-18 and 3-19 Figure3-18.Similarityscoresaftermatchingreportpairs aboutcourselistings. Figure3-19.Similarityscoresaftermatchingreportpairs aboutinstructoroces. 3.2.7MergingResults Aftergeneratingasimilaritymatrixforeachreportpair,w eneedtomergethem intoanalsimilaritymatrix.Ifwehavemorethanonescoref oraschemaelementpair inthesimilaritymatrix,weneedtomergethescores.InSect ion 3.2.4 ,wedescribed howwecomputesimilarityscoresforreportpairstoavoidun necessarycomputations betweenunrelatedreportpairs.Weusetheseoverallsimila rityscoresbetweenreportpairs whilemergingsimilarityscores.Wemultiplythesimilarit yscoreofaschemaelement pairwiththeoverallsimilarityscoreofthereportpairand sumtheresultingscoresup. 82


Thenwedividethenalscorewiththenumberofreports.Fori nstance,ifthesimilarity scorebetweenschemaelementsAandBis0.9intherstreport havinganoverall similarityscoreof0.7andis0.5inthesecondreporthaving anoverallsimilarityscore of0.6,thenweconcludethatthesimilarityscorebetweensc hemaelementsAandBis (0 : 9 0 : 7+0 : 5 0 : 6) = (2)=0 : 465.Finally,weeliminatethecombinedscoreswhichfall belowa(user-dened)threshold. 83


CHAPTER4 PROTOTYPEIMPLEMENTATION Weimplementedboththesemanticanalyzer(SA)componentof theSEEKandthe SMARTschemamatcherusingJavaprogramminglanguage.Assh owninFigure 4-1 we havewritten1,350KBofJavacode(approximately27,000lin esofcode)forourprototype implementation.Inaddition,wehaveutilized1,150KBofJa vacode(approximately 23,000linesofcode)whichwasautomaticallygeneratedbyJ avaCC.Inthefollowing sections,werstexplaintheSAprototypeandthenSMARTpro totype. Figure4-1.JavaCodesizedistributionof(SemanticAnalyz er)SAand(SchemaMatching byAnalyzingReporTs)SMARTpackages. 4.1SemanticAnalyzer(SA)Prototype WehaveimplementedSAsemanticanalyzerprototypeusingJa valanguage.The SEEKprototypesourcecodeisplacedintheseekpackage.The functionalityoftheSEEK prototypeisdividedintoseveralpackages.Thesourcecode oftheseamnticanalyzer(SA) componentresidesinthesapackage.Javaclassesinthesapa ckagearefurtherdivided intosubpackagesaccordingtotheirfunctionality.Thesub packagesofthesapackageare listedinTable 4-1 4.1.1UsingJavaCCtogenerateparsers Theclassesinsidethepackagessyntaxtree,visitor,andpa rsersareautomatically createdbyJavaCCtool.JavaCCisatoolthatreadsagrammars pecicationandconverts ittoaJavaprogramthatcanrecognizematchestothegrammar accordingtothat 84


Table4-1.Subpackageinthesapackageandtheirfunctional ity. packagename classesinthepackage visitor defaultvisitorclasses. parsers classestoparseapplicationsourcecodewritteningrammar s Java,HTMLandSQL. seekstructures supplementaryclassestoanalyzeapplicationsourcecode. seekvisitors visitorclassestoanalyzesourcecodewritteningrammarsJ ava, HTML specication.AsshowninFigure 4-2 ,JavaCCprocessesgrammarspecicationleand outputtheJavalesthathasthecodeoftheparser.Theparse rcanprocessthelanguages thatareaccordingtothegrammarinthespecicationle.Th eparsersgeneratedin thiswayformstheASTGcomponentoftheSA.Grammarspecica tionlesforsome grammarssuchasJava,C++,C,SQL,XML,HTML,VisualBasic,a ndXQuerycanbe foundattheJavaCCgrammarrepositoryWebsite. 1 Thesespecicationleshavebeen testedandcorrectedbymanyJavaCCimplementers.Thisimpl iesthatparsersproduced byusingthesespecicationsmustbereasonablyeectivein thecorrectproductionof ASTs.TheclassesgeneratedbytheJavaCCformstheabstract syntaxtreegenerator ASTGoftheSAwhichwasdescribedinSection FortheSAcomponentoftheSEEKprototype,wecreatedparser sforthreedierent grammars.TheseareJava,SQLandHTMLgrammars.Weplacedth eseparsers,related syntaxtreeclassesandgenericvisitorclassesintoparser ,syntaxtree,visitorpackage respectively.EachJavaclassinsidethesyntaxtreepackag ehasanacceptmethodtobe usedbyvisitors.Visitorclasseshaveavisitmethodsthate achcorrespondstoaJavaclass insidesyntaxtreepackage.Thesyntaxtree,visitor,andpa rserspackageshave142,15and 14classesrespectively.Theclassesinsidethesepackages remainsthesameaslongasthe Java,SQLandHTMLgrammarsdonotchange. 1 JavaCCrepository: 85


Figure4-2.UsingJavaCCtogenerateparsers. Theclassesinsidethepackagesseekstructuresandseekvis itorsarewrittentofulll thegoalsoftheSA.Theseekstructuresandseekvisitorspac kageshave25andtenclasses respectivelyandaresubjecttochangeasweaddnewfunction alitytoSAmodule.The classesinsidethesepackagesformstheInformationExtrac tor(IEx)oftheSAwhichwas describedinSection .IExisconsistofseveralvisitordesignpatterns.Executi on stepsoftheIExandfunctionalityofsomeselectedvisitord esignpatternsaredescribedin thenextsection.4.1.2Executionstepsoftheinformationextractor Semanticanalysisprocesshastwomainsteps.Intherstste p,SAmakespreparations necessaryforanalyzingthesourcecodeandformsthecontro lrowgraph.SAdriver acceptsthenameofthestand-aloneJavaclassle(withthem ainmethod)asan argument.StartingfromJavaclassle,SAndsoutalltheus er-denedJavaclasses tobeanalyzedintheclasspathandformsthecontrolrowgrap h.Next,SAparsestheall theJavaclassesinthecontrolrowgraphandproducesASToft heseJavaclasses.Then, thevisitorclassObjectSymbolTablegathersvariabledecl arationinformationforeachclass tobeanalyzedandstorethisinformationintheSymbolTable classes.TheSymbolTable classesarepassedtoeachvisitorclassandarelledwithne winformationastheSA processcontinues. Inthesecondstep,SAidentiesthevariablesusedininput, outputandSQL statements.SAusestheObjectSlicingVarsvisitorclassto identifyslicingvariables. Thelistofallinput,output,anddatabase-relatedstateme nts,thatarelanguage(Java, 86


JDBC)specic,arestoredinInputOutputStatementsandSql Statementsclasses.To analyzeadditionalstatements,ortoswitchtoanotherlang uage,allweneedtodoisto add/updatenewstatementnamesintotheseclasses.Whenavi sitorclassencounters amethodstatementwhiletraversingthroughAST,itcheckst hislisttondoutifthis methodisaninput,output,oradatabase-relatedstatement SAndsandparsesSQLstringsembeddedinsidethesourcecod e.SAusesthe ObjectSQLStatementvisitorclasstondandparseSQLstate ments.Whilethevisitor traversestheAST,itconstructsthevalueofvariablesthat areofStringtype.When avariabletypeofStringorastringtextispassedasaparame tertoanSQLexecute method(e.g.,executeQuery(queryStr)),thisvisitorclas sparsesthestring,andconstructs theASTofthisSQLstring.Thenitusesthevisitorclassname dObjectSQLParseto extractinformationfromthatSQLstatement.Thevisitorcl assObjectSQLStatement usesthevisitorclassObjectSQLParsetoextractinformati onabouttheSQLstringand storesthisinformationintoaclassnamedResultsetSQL.Th einformationgatheredfrom SQLstatements,input/outputmethodsareusedtoconstruct relationsbetweendatabase schemaelementandthetextdenotingthepossiblemeaningof theschemaelement. BesidesanalyzingapplicationsourcecodewritteninJava, SAcanalsoanalyzereport designtemplatesrepresentedinXML.ReportTemplateParse r(RTP)componentofthe SAusesSimpleAPIforXML 2 (SAX)toparsereporttemplates. TheoutcomeoftheIExiswrittenintoreportontologyinstan cesrepresentedinOWL. ReportOntologyWriter(ROW)usesOWLAPI 3 towritethesemanticinformationinto OWLontologies. 2 SimpleXMLAPI: 3 OWLAPI: 87


4.2SchemaMatchingbyAnalyzingReporTs(SMART)Prototype WehaveimplementedSMARTschemamatcherprototypeusingJa valanguage. Thereare46classesinvedierentpackages.Thetotalsize oftheJavaclassesare500K (approximately10,000lines). WealsowroteaPerlprogramtondsimilarityscoresbetween wordpairsbyusing theWordNetsimilaritylibrary 4 [ 73 ].Toassesssimilarityscoresbetweentexts,werst eliminatestopwords(e.g.,a,and,but,to,by)andconvertp luralwordstosingularwords. Weconvertpluralwordstosingularwords 5 becauseWordNetSimilarityfunctionsreturns similarityscoresbetweensingularwords. TheSMARTprototypealsousesSimpleAPIforXML(SAX)librar ytoparse XMLlesandOWLAPItoreadOWLreportontologyinstancesint ointointernalJava structures. COMA++frameworkenablesexternalmatcherstobeincludedi ntoitsframework throughaninterface.WehaveintegratedourSMARTmatcheri ntotheCOMA++ frameworkasanexternalmatcher. 4 WordNetSemanticSimilarityLibrary: 5 WeareusingthePlingStemmerlibrarywrittenbyFabianM.Su chanek: 88


CHAPTER5 TESTHARNESSFORTHEASSESSMENTOFLEGACYINFORMATION INTEGRATIONAPPROACHES(THALIA) Informationintegrationreferstotheunicationofrelate d,heterogeneousdatafrom disparatesources,forexample,toenablecollaborationac rossdomainsandenterprises. Informationintegrationhasbeenanactiveareaofresearch sincetheearly80sand producedaplethoraoftechniquesandapproachestointegra teheterogeneousinformation. Determiningthequalityandapplicabilityofaninformatio nintegrationtechniquehas beenachallengingtaskbecauseofthelackofavailabletest dataofsucientrichnessand volumetoallowmeaningfulandfairevaluations.Researche rsgenerallyusetheirowntest dataandevaluationtechniques,whicharetailoredtothest rengthsoftheapproachand oftenhideanyexistingweaknesses. 5.1THALIAWebsiteandDownloadableTestPackage Whileworkingforthisresearch,wesawtheneedforatestbed andbenchmark providingtestdataofsucientrichnessandvolumetoallow meaningfulandfair evaluationsforinformationintegrationapproaches.Toan swerthisneed,wedeveloped THALIA 1 (TestHarnessfortheAssessmentofLegacyinformationInte grationApproaches) benchmark.WeshowasnapshotofTHALIAwebsiteinFigure 5-1 .THALIAprovides researcherswithacollectionofover40downloadabledatas ourcesrepresentingUniversity coursecatalogs,asetoftwelvebenchmarkqueries,aswella sascoringfunctionfor rankingtheperformanceofanintegrationsystem[ 47 48 ]. THALIAwebsitealsohostscachedwebpagesofUniversitycou rsecatalogs.The downloadablepackageshavedataextractedfromthesewebsi tes.Figure 5-2 showsan examplecachedcoursecatalogoftheBostonUniversityhost edinTHALIAwebsite.In THALIAwebsite,wealsoprovidetheabilitytonavigatebetw eenextracteddataand 1 URLoftheTHALIAwebsite: 89


Figure5-1.SnapshotofTestHarnessfortheAssessmentofLe gacyinformationIntegration Approaches(THALIA)website. correspondingschemalesthatareinthedownloadablepack ages.Figure 5-3 shows XMLrepresentationofBostonUniversityscoursecatalogan dcorrespondingschemale. DownloadableUniversitycoursecatalogsarerepresentedu singwell-formedandvalidXML accordingtotheextractedschemaforeachcoursecatalog.E xtractionandtranslation fromtheoriginalrepresentationwasdoneusingasource-sp ecicwrapperwhichpreserves structuralandsemanticheterogeneitiesthatexistamongt hedierentcoursecatalogs. 90


Figure5-2.Snapshotofthecomputersciencecoursecatalog ofBostonUniversity. 5.2DataExtractor(HTMLtoXML)OpensourcePackage ToextractthesourcedataprovidedinTHALIAbenchmark,wee nhancedand usedtheTelegraphScreenScraper(TESS) 2 sourcewrapperdevelopedatUCBerkeley. TheenhancedversionofTESS,DataExtractor(HTMLtoXML),c anbeobtainedfrom SourceForgewebsite 3 alongwiththe46examplesusedtoextractdataprovidedin THALIA.DataExtractor(HTMLtoXML)toolprovidesaddedfun ctionalityoverTESS wrapperincludingcapabilityofextractingdatafromneste dstructures.Itextractsdata fromaHTMLpageaccordingtoacongurationleandputsthed ataintoanXMLle accordingtoaspeciedstructure. 2 TESS: 3 URLofDataExtractor(HTMLtoXML)is 91


Figure5-3.ExtensibleMarkupLanguage(XML)representati onofBostonUniversitys coursecatalogandcorrespondingschemale. 5.3ClassicationofHeterogeneities Ourbenchmarkfocusesonsyntacticandsemanticheterogene itiessincewebelieve theyposethegreatesttechnicalchallengestotheresearch community.Wehavechosen courseinformationasourdomainofdiscoursebecauseitisw ellknownandeasyto understand.Furthermore,thereisanabundanceofdatasour cespubliclyavailablethat allowedustodevelopatestbedexhibitingallofthesyntact icandsemanticheterogeneities thatwehaveidentiedinourclassication.Welistourclas sicationofheterogeneities below.Wehavealsolistedtheseclassicationsin[ 48 ]andinthedownloadablepackageon theTHALIAwebsitealongwithexamplesfromTHALIAbenchmar kandcorresponding queries. 92


1. Synonyms: Attributeswithdierentnamesthatconveythesamemeaning .For example,`instructor'vs.`lecturer'. 2. SimpleMapping: Relatedattributesindierentschemasdierbyamathemati cal transformationoftheirvalues.Forexample,timevaluesus inga24hourvs.12hour clock. 3. UnionTypes: Attributesindierentschemasusedierentdatatypestore present thesameinformation.Forexample,coursedescriptionasas inglestringvs.complex datatypecomposedofstringsandlinks(URLs)toexternalda ta. 4. ComplexMappings: Relatedattributesdierbyacomplextransformationofthe ir values.Thetransformationmaynotalwaysbecomputablefro mrstprinciples.For example,theattribute`Units'representsthenumberoflec turesperweekvs.textual descriptionoftheexpectedworkloadineld`credits'. 5. LanguageExpression: Namesorvaluesofidenticalattributesareexpressedin dierentlanguages.Forexample,TheEnglishterm`databas e'iscalled`Datenbank' intheGermanlanguage. 6. Nulls: Theattribute(value)doesnotexist.Forexample,Somecour sesdonothave atextbookeldorthevalueforthetextbookeldisempty. 7. VirtualColumns: Informationthatisexplicitlyprovidedinoneschemaisonl y implicitlyavailableintheotherandmustbeinferredfromo neormorevalues.For example,Courseprerequisitesisprovidedasanattributei noneschemabutexists onlyincommentformaspartofadierentattributeinanothe rschema. 8. Semanticincompatibility: Areal-worldconceptthatismodeledbyanattribute doesnotexistintheotherschema.Forexample,Theconcepto fstudentclassication (`freshman',`sophomore',etc.)atAmericanUniversities doesnotexistinGerman Universities. 9. Sameattributeindierentstructure: Thesameorrelatedattributemaybe locatedindierentpositionsindierentschemas.Forexam ple,TheattributeRoom isanattributeofCourseinoneschemawhileitisanattribut eofSectionwhichin turnisanattributeofCourseinanotherschema. 10. Handlingsets: Asetofvaluesisrepresentedusingasingle,set-valuedatt ribute inoneschemavs.acollectionofsingle-valuedattributeso rganizedinahierarchyin anotherschema.Forexample,Acoursewithmultipleinstruc torscanhaveasingle attributeinstructorsormultiplesection-instructoratt ributepairs. 11. Attributenamedoesnotdenesemantics: Thenameoftheattributedoes notadequatelydescribethemeaningofthevaluethatisstor edthere. 93


12. Attributecomposition: Thesameinformationcanberepresentedeitherby asingleattribute(e.g.,asacompositevalue)orbyasetofa ttributes,possibly organizedinahierarchicalmanner. 5.4WebInterfacetoUploadandCompareScores THALIAwebsiteoersawebinterfaceforresearchertouploa dtheirresultforeach heterogeneitylistedabove.Thewebinterfaceacceptsdata inmanyaspects,suchassize ofspecication,numberofmouseclicksandsizeofprogramc ode,toevaluatetheeort spenttoresolvetheheterogeneitybytheapproach.Theuplo adedscorescanbeviewed byanybodyvisitingthewebsiteoftheTHALIAbenchmark.Thi shelpsotherresearcher comparetheirapproachwithothers.Figure 5-4 showsthescoresuploadedtoTHALIA benchmarkforIntegrationWizard(IWiz)ProjectattheUniv ersityofFlorida. Figure5-4.ScoresuploadedtoTestHarnessfortheAssessme ntofLegacyinformation IntegrationApproaches(THALIA)benchmarkforIntegratio nWizard(IWiz) ProjectattheUniversityofFlorida. 94


WhileTHALIAisnottheonlydataintegrationbenchmark, 4 whatdistinguishes THALIAisthefactthatitcombinesrichtestdatawithasetof benchmarkqueries andassociatedscoringfunctiontoenabletheobjectiveeva luationandcomparisonof integrationsystems. 5.5UsageofTHALIA WebelievethatTHALIAdoesnotonlysimplifytheevaluation ofexistingintegration technologiesbutalsohelpresearchersimprovetheaccurac yandqualityoffuture approachesbyenablingmorethoroughandmorefocusedtesti ng.WehaveusedTHALIA testdatafortheevaluationofourSMapproachasdescribedi nSection 6.1.1 .Weare alsohappytoseeitisbeingusedasasourceoftestdataandbe nchmarkbyresearchers [ 11 74 100 ]andgraduatecourses 5 4 AlistofDataIntegrationBenchmarksandTestSuitscanbefo undat 5 URLofthegraduatecourseattheUniversityofTorontousing THALIAis{}miller/cs2525/ 95


CHAPTER6 EVALUATION WeevaluateourapproachusingtheprototypedescribedinCh apter 4 .Inthe followingsections,werstdescribeourtestdatasetsando urexperiments.Wethen compareourresultswithothertechniquesandpresentadisc ussionontheresults. 6.1TestData Thetestdatasetshavetwomaincomponents;schemaofthedat asourceandreports presentingthedatafromthedatasource.Weusedtwotestdat asets.Thersttestdata setisfromTHALIAdataintegrationtestbed.Thisdatasetha s10schemas.Eachschema ofTHALIAtestdatasethasonereportandthereportcoversen tireschemaelementsof thecorrespondingschema.ThesecondtestdatasetisfromUn iversityofFloridaregistrar oce.Thisdatasethasthreeschemas.EachschemaofUFregis trartestdatasethas10 reportsandthereportsdonotcoverallschemaelementsofth ecorrespondingschema. ThersttestdatasetfromTHALIAisusedtoseehowSMARTappr oachperformswhen theentireschemaiscoveredbyreportsandthesecondtestda tasetfromUFisusedto seehowSMARTapproachperformswhentheentireschemaisnot coveredbyreports. ThetestdatasetfromUFalsoenablesustoseetheaectofhav ingmultiplereportsfor oneschema.Inthefollowingsubsections,wegivedetailedd escriptionsoftheschemasand reportsofthesetestdatasets.6.1.1TestDataSetfromTHALIAtestbed ThersttestdatasetisfromTHALIAtestbed[ 48 ].THALIAoers44+dierent Universitycoursecatalogsfromcomputersciencedepartme ntsworldwide.Eachcatalog pageisrepresentedinHTML.THALIAalsooersdataandschem aofeachcatalogpage. WeexplaineddetailsofTHALIAtestbedinChapter 5 Forthescopeofthisevaluation,wetreateachcatalogpage( inHTML)tobea samplereportfromthecorrespondingUniversity.Weselect ed10universitycatalogs (reports)fromTHALIAthatrepresentdierentreportdesig npractices.Forexample, 96


Figure6-1.Reportdesignpracticewhereallthedescriptiv etextsareheadersofthedata. wegivetwoexamplesofthesereportdesignpracticesinFigu res 6-1 and 6-2 .Figure 6-1 showsthecourseschedulingreportofBostonUniversityand Figure 6-2 showsthecourse schedulingreportofMichiganStateUniversity. Figure6-2.Reportdesignpracticewhereallthedescriptiv etextsareonthelefthandside ofthedata. SizesofschemasinTHALIAtestdatasetvarybetween5to13as listedinTable 6-1 .Westoredthedataandschemasforeachselecteduniversity inaMySQL4.1 database.Whenwepair10schemas,wehave45dierentpairso fschemastomatch. 45dierentschemapairshave2576possiblecombinationsof schemaelements.We manuallydeterminedthat215ofthesepossiblecombination sarereal.Weusethese manualmappingstoevaluateourresults. 97


Table6-1.The10universitycatalogsselectedforevaluati onandsizeoftheirschemas. UniversityName#ofSchemaElements UniversityofArizona5BrownUniversity7BostonUniversity7CaliforniaInstituteofTechnology5CarnegieMellonUniversity9FloridaStateUniversity13MichiganStateUniversity8NewYorkUniversity7UniversityofMassachusettsBoston8UniversityofNewSouthWales,Sydney7 Werecreatedeachreport(catalogpage)fromTHALIAbyusing twomethods.One methodisusingJavaServletsandtheotherisusingEclipseB usinessIntelligenceand ReportingTool(BIRT). 1 JavaServletapplicationscorrespondingtoacoursecatalo gfetch therelevantdatafromtherepositoryandproducetheHTMLre port.Reporttemplates designedbyBIRTtoolalsofetchtherelevantdatafromthere positoryandproducethe HTMLreportaswell.WhenSMARTprototypeisrun,itanalyzes JavaServletcodeand reporttemplatestoextractsemanticinformation.6.1.2TestDataSetfromUniversityofFlorida Thesecondtestdatasetisaboutstudentsregistryinformat ionandfromUniversityof Florida.WecontactedseveralocesattheUniversityofFlo ridatoobtaintestdatasets. 2 WerstcontactedtheCollegeofEngineering.Afterseveral meetingsanddiscussions, theCollegeofEngineeringagreedtogiveustheschemasandt hereportdesigntemplates withoutanydata.Infact,wewerenotafterthedatabecauseo urapproachworkswithout theneedofthedata.TheCollegeofEngineeringformsanduse sthedatasetthatwe obtainedafterseveralmonthsasfollows.TheCollegeofEng ineeringrunsabatchprogram 1 2 IwouldliketothanktoDr.JoachimHammerforhisextensivee ortsforreachingout severaldepartmentsandorganizingmeetingswithstatoga thertestdatasets. 98


everyrstdayoftheweekanddownloadsdatafromlegacyDB2d atabaseoftheRegistrar oce.DB2databaseoftheRegistraroceisahierarchicalda tabase.TheCollegeof EngineeringstoresthedatainrelationalMSAccessdatabas es.TheCollegeofEngineering extractsasubsetofthedatabaseoftheregistraroceandus esthesameattributeand tablenamesintheMSAccessdatabaseastheyareinthedataba seoftheregistraroce. TheCollegeofEngineeringcreatessubsetsofthisMSAccess databaseandrunstheir reportsontheseMSAccessdatabases.Figure 6-3 showstheconceptualviewofthe architectureofthedatabasesintheCollegeofEngineering 3 Figure6-3.ArchitectureofthedatabasesintheCollegeofE ngineering. WealsocontactedtheUFBridgesoce.TheBridgesisaprojec ttoreplacethe universitysbusinesscomputersystemscalledlegacysyste mswithnewwebbased,integrated 3 IwouldliketothankJamesOglesfromtheCollegeofEngineer ingforhistimeto preparethetestdataandforansweringourquestionsregard ingthedataset. 99

PAGE 100

systemsthatproviderealtimeinformationandimproveuniv ersitybusinessprocesses. 4 TheUFBridgesprojectalsoredesignedthelegacyDB2databa seoftheregistraroce forMSSQLServer.Weobtainedschemasandagaincouldnotrea chtheassociateddata becauseofprivacyissues. 5 Finally,wereachedtheBusinessSchool. 6 TheBusinessSchoolstorestheirdatain MSSQLServerdatabases.TheirschemaisbasedontheBridges oceschemahowever theyusedierentnamingconventions.Theyaddnewstructur esintotheschemaswhen needed. Table6-2.PortionofatabledescriptionfromtheCollegeof Engineering,theBridges ProjectandtheBusinessSchoolschemas. TheCollegeofEng.TheBridgesOceTheBusinessSchool trans2PS UF CREC COURSEt CREC UUIDVARCHAR(9)UF UUIDVARCHAR(9)UFIDvarchar(9) CNumVARCHAR(4)UF AUTO INDEXINTEGERTermvarchar(6) SectVARCHAR(4)UF TERM CDVARCHAR(5)CourseTypevarchar(1) CTCHARUF TYPE DESCVARCHAR(40)Sectionvarchar(4) TheschemasfromtheCollegeofEngineering,theBridgesOc eandtheBusiness Schoolaresemanticallyrelatedhowevertheyexhibitdier entsyntacticalfeatures.The namingconventionsandsizesofschemasaredierent.TheCo llegeofEngineeringuses thesamenamesforschemaelementsastheyareintheRegistra r'sdatabase.Theschema elementsnamesoftencontainsabbreviationswhicharemost lynotpossibletoguess. TheBridgesoceusesmoredescriptivenamingconventionfo rschemaelements.The schemaelements(i.e,columnnames)intheschemaoftheBusi nessSchoolhavethemost descriptivenames.However,thetablenamesintheschemaof theBusinessSchooluses 4 5 IalsowouldliketoacknowledgethehelpofMr.WarrenCurryf romtheBridgesoce forhishelpobtainingtheschemas. 6 IalsowouldliketoacknowledgethehelpofMr.JohnC.Holmes fromtheBusiness Schoolforhishelpobtainingtheschemas. 100

PAGE 101

non-descriptivenamessimilartothenamesintheregistrar database.Togiveanexample fordierentnamingconventionintheschemas,wepresentap ortionofatabledescription fromtheCollegeofEngineering,theBridgesOceandtheBus inessSchoolschemasin Table 6-2 TheCollegeofEngineering,theBridgesOce,andtheBusine ssSchoolschemas havetotally135,175,114attributesrespectivelyinsixta bles.Wepresenttablenames inthesethreeschemasandthenumberofschemaelementsthat eachtablehasinTable 6-3 .Inadataintegrationscenario,wepairtheschemasthatist obeintegrated.When wepairthreeschemasfromtheCollegeofEngineering(COE), theBridgesOce(BO) andtheBusinessSchool(BS),showninTable 6-3 ,wehavethreedierentschemapairs, (COE-BO),(COE-BS),(BO-BS),tomatch.Wemanuallydetermi nedthat(COE-BO) pairhas88,(COE-BS)pairhas91and(BO-BS)pairhas110mapp ingsandweusethese manualmappingstoevaluatetheresultsoftheSMART(Schema MatchingbyAnalyzing ReporTs)andCOMA++(COmbinationofMAtchingalgorithms)a pproachesasdescribed inSection 6.3.2 .WerecreatedcorrespondingreportsbyusingEclipseBIRTt ool.Wehave 10reportsforeachschema. Table6-3.NamesoftablesintheCollegeofEngineering,the BridgesOce,andthe BusinessSchoolschemasandnumberofschemaelementsthate achtablehas. TheCollegeofEngineeringBridgesTheBusinessSchool colleges,8ps uf colleges,12t coll,5 deptx1,14ps uf departments,23t dept,8 ce,32ps uf ce,33t ce,30 honors,56ps uf honors,56t honors,40 majors,10ps uf majors,17t majo,9 trans2,15ps uf crec course,34t crec,22 total:135total:175total:114 6.2DeterminingWeights Weexplainedtheformulastocomputesimilarityscoresbetw eenconceptsoftwo ontologiesandshowedhowweapplytheseformulastocompute similarityscoresbetween dataelementconceptsoftworeportontologyinstancesinSe ction 3.2.4 .Inthissection,we 101

PAGE 102

showhowwedeterminetheweightsintheformulas.Thecorrec tselectionoftheweights ofthesimilarityfunctionisveryimportant.Theweightsdi rectlyaectthesimilarityscore andhencedirectlyaecttheresultsandaccuracyoftheSMAR Tapproach.Weshowthe formulaforcomputingthesimilarityscoresbetweendatael ementsbelow.Ourgoalisto determineweightsfrom w 1 to w 8 sim DataElement ( DataElement 1 ;DataElement 2 )= w 1 SimFunc ( Attribute 1 ;Attribute 2 ) + w 2 SimFunc ( Table 1 ;Table 2 ) + w 3 SimFunc ( Type 1 ;Type 2 ) + w 4 SimFunc ( Description 1 ;Description 2 ) (6{1) + w 5 SimFunc ( Query 1 ;Query 2 ) + w 6 SimFunc ( Title 1 ;Title 2 ) + w 7 SimFunc ( HeaderNote 1 ;HeaderNote 2 ) + w 8 SimFunc ( FooterNote 1 ;FooterNote 2 ) Weusemultiplelinearregressionmethodtodetermineweigh tsofthesimilarity function.Inmultiplelinearregression[ 3 ],partofthevariablesareconsideredtobe explanatoryvariables,andtheremainingareconsideredto bedependentvariables. Inourproblem,theexplanatoryvariablesarethesimilarit iesofthepropertiesofthe concepts.Forexample,ourexplanatoryvariablesinthefor mulatocomputesimilarity scoresbetweendataelementconceptsaresimilarityscores ofAttributes,Table,Type, Title,Query,HeaderNoteandFooterNoteproperties.Thede pendentvariableisthe overallsimilarityoftwodataelementconcepts.Linearreg ressionattemptstomodelthe relationshipbetweentheexplanatoryvariablesandthedep endentvariablebytting alinearequationtoobserveddata.Observeddatareferstoa setofsamplevectorsfor explanatoryvariablesandthedesiredvalueofthedependen tvariablecorrespondingto 102

PAGE 103

thesamplevectors.Ourprototypecomputesthesamplevecto rsforexplanatoryvariables accordingtoSimFuncfunctionsthatcomputesimilaritysco resbetweenpropertiesof concepts.Wemanuallyenterthedesiredvalueofthedepende ntvariable(i.e.,similarity scorefortwodataelementconcepts)correspondingtothesa mplevectors.Letusdenote theexplanatoryvariablesasacolumnvector(calledfeatur evector)by: x ( n )=[x 1 ( n ) ; x 2 ( n ) ;:::; x N ( n )] T (6{2) whereTdenotesthetransposeoperatorand n istheindexofthesampledata.The observeddatacontainsadependentvariable(desireddata) called d ( n )=[d 1 ( n ) ; d 2 ( n ) ;:::; d L ( n )] T (6{3) correspondingtothefeaturevector x ( n ).NotethatLis1forourproblem.Wecan combinethefeaturevectorsasan NxP matrix x =[x(1) ; x(2) ;:::; x( P )](6{4) where P isthenumberofdatapointsinourobserveddata.Similarly, thedesireddatacan becombinedinan LxP matrix d =[ d (1) ; d (2) ;:::; d ( P )](6{5) Inlinearregression,thegoalistomodel d (n)asalinearfunctionof x (n),i.e., d (n)=w T x (n)(6{6) where w =[w 1 ; w 2 ;::; w N ] T (6{7) iscalledtheweightmatrix.Themostcommonapproachfornd ing w isthemethodof least-squares.Thismethodcalculatestheoptimal w fortheobserveddatabyminimizing 103

PAGE 104

acostfunctionwhichismeanofthesquareerrors(MSE),i.e. MSE = P X n =1 ( d (n) w T x (n)) 2 (6{8) UsingMSE,theweightmatrix w canbefoundanalyticallyorinaniterativefashion. Tondtheanalyticalsolution,wecalculatetheminimumval ueofMSEwithrespectto w TondtheminimumvalueofMSE,wetakethederivativeoftheM SEw.r.t w andequate itto0.Theresultingequationfortheoptimalvalueof w ,denotedby w isgivenby w =( 1 P P X n =0 x ( n ) x ( n ) T ) 1 ( 1 P P X n =0 x ( n ) d ( n ))(6{9) WehavedeterminedtheweightsbyusingourtestdatafromTHA LIAtestbed. THALIAtestdatahas10schemasandonereportforeachschema .Weextractedand createdaninstanceofthereportontologythatcorresponds toareport.Eachreport ontologyinstancehas5to9dataelementconcepts.Asatotal ,wehave2576dataconcept pairsin45reportinstancecombinations.Weuse1500ofthes e2576dataconceptpairsas trainingdatasetfordeterminingtheweightsofthesimilar ityfunction.Theeightweights ofthesimilarityfunctionfoundareshowninTable 6-4 .Weranourexperimentsto determineweightswiththreedierentsimilaritymeasures ;jcn[ 52 ],lin[ 60 ]andlevenstein [ 20 ];todeterminesimilarityscoresbetweentexts. Table6-4.Weightsfoundbyanalyticalmethodfordierents imilarityfunctionswith THALIAtestdata. SimFunc AttributeTableTypeDescriptionQueryTitleHeaderFooter JCN 0.30- LIN 0.32- Levenstein 0.32- Alternatively, w canbefoundinaniterativefashionusingtheupdateequatio n. w ( n +1)= w ( n )+ e ( n ) x ( n )(6{10) where iscalledthestepsizeand e ( n )istheerrorvaluegivenby d ( n ) y ( n ). 104

PAGE 105

6.3ExperimentalEvaluation Inthefollowingsubsections,weexplaintheresultsgather edbyrunningSMART prototypeonthetestdatasetsexplainedinSection 6.1 .Weusef-measuremetricandthe ReceiverOperatingCharacteristic(ROC)curvestoevaluat etheaccuracyofourresults. Wepresentdescriptionsoff-measuremetricandROCcurvesb elow. F-measurehasbeenthemostwidelyusedmetricforevaluatin gschemamatching approaches[ 26 ].F-MeasureistheharmonicmeanofPrecision(P)andRecall (R). Precisionspeciespercentageofthecorrectresultsamong allfoundresultsandRecall speciesthepercentageofcorrectresultsamongallrealre sults.Table 6-5 showsthe confusionmatrix.Eachcolumnofthematrixrepresentsthei nstancesinapredictedclass, whileeachrowrepresentstheinstancesinanactualclass.A ccordingtoTable 6-5 ,wecan formulatePrecision(P)as TP=TP + FP andRecall(R)as TP=TP + FN .Wedonot usePorRmeasuresalonebecauseneitherPnorRalonecanaccu ratelyassessthematch quality[ 25 ].RcaneasilybemaximizedattheexpenseofapoorPbyreturn ingasmany correspondencesaspossible,forexample,thecrossproduc toftwoinputschemas.Onthe otherhand,ahighPcanbeachievedattheexpenseofapoorRby returningonlyfew butcorrectcorrespondences.WecalculatedtheF-Measurew iththefollowingformulaby givingRecall(R)andPrecision(P)metricsequalweights. F Measure =2 Pr ecision Re call Pr ecision +Re call (6{11) Table6-5.Confusionmatrix. PredictedPositivePredictedNegative PositiveExamples TruePositives(TP)FalseNegatives(FN) NegativeExamples FalsePositives(FP)TrueNegatives(TN) ReceiverOperatingCharacteristic(ROC)analysisorigina tedfromsignaldetection theory.ROCanalysishasalsowidelybeenusedinmedicaldat aanalysistostudythe 105

PAGE 106

eectofvaryingthethresholdonthenumericaloutcomeofad iagnostictest.Ithasbeen introducedtomachinelearninganddataminingrelativelyr ecently. TheReceiverOperatingCharacteristics(ROC)analysissho wstheperformanceofa classierasatradeobetweendetectionrateandfalsealar mrate.Toanalyzethetrade obetweentworates,aROCcurveisplotted.AROCcurveisagr aphicalplotofthe truepositives(a.k.a.hit,detection)rateversusfalsepo sitivesrate(a.k.a.falsealarm) asabinaryclassiersystem'sthresholdparameterisvarie d.AccordingtoTable 6-5 ,we formulatetruepositiverate(TPR)as TP=TP + FN andfalsepositiverate(FPR)as 1 ( TN=TN + FP ). TheROCcurvealwaysgoesthroughtwopoints(0,0and1,1).0, 0iswherethe classierdetectsnoalarms.Inthiscaseitalwaysgetsthen egativecasesrightbutitgets allpositivecaseswrong.Thesecondpointis1,1whereevery thingisclassiedaspositive. Sotheclassiergetsallpositivecasesrightbutitgetsall negativecaseswrong.Thebest possiblepredictionmethodwouldyieldapointintheupperl eftcorner(0,1),representing alltruepositivesarefoundandnofalsepositivesarefound .Thecloserthecurvefollowsa linefrom(0,0)to(0,1)andthenfrom(0,1)to(1,1),themore accuratetheclassier. TheareaundertheROCcurveisaconvenientwayofcomparingc lassiers.A randomclassierhasanareaof0.5,whileanidealonehasana reaof1.Thelargerthe areaundertheROCcurve,thebettertheperformanceofthecl assier.However,insome cases,theareaundertheROCcurvemaybemisleading.Atacho senthreshold,the classierwiththelargerareamaynotbetheonewiththebett erperformance.Thebest placetooperatetheclassier(thebestthreshold)isthepo intonitsROCwhichliesona 45degreelineclosesttotheupperleftcorner(0,1)oftheRO Cplot. 7 Werunourexperimentswithdierentsimilaritymeasures(e .g.,Lin,JCNand lexicaletc.)andthencomparethemwiththeresultsoftheCO MA++(COmbinationof 7 Weassumethatthecostsofdetectionandfalsealarmareequa l. 106

PAGE 107

Figure6-4.ResultsoftheSMARTwithJiang-Conrath(JCN),L inandLevensteinmetrics. MAtchingalgorithms)[ 7 ]schemamatcherframework.Weselectedtocompareourresul ts withtheresultsofCOMA++becauseCOMA++hasperformedtheb estinexperiments evaluatingtheexistingschemamatchingapproaches[ 26 99 ].Besides,itprovidesa downloadableprototypewhichenablesustocreatereproduc ibleresults.COMA++ alsoenablescombiningdierentschemamatchingalgorithm s.WeusedAllContextand FilteredContextcombinedmatchersintheCOMA++framework .AllContextand FilteredContextarecombinationsofvedierentmatchers ;name,path,leaves,parents andsiblings.6.3.1RunningExperimentswithTHALIAData ToevaluateSMARTapproach,weusedatasourcesandcachedHT MLpagesfrom theTHALIAdataintegrationbenchmark[ 48 ].THALIAoers44+dierentUniversity coursecatalogsfromcomputersciencedepartmentsworldwi de.Universitycoursecatalogs andtheirschemascanbedownloadedfromtheTHALIAwebsite. 8 Forthescopeof thisevaluation,weconsidereachcatalogpage(inHTML)tob easamplereportfrom thecorrespondingUniversity.Weselected10universityca talogs(reports)fromTHALIA thatrepresentdierentreportdesignpracticesandpaired theirreportsresulting45 8 107

PAGE 108

Figure6-5.ResultsofCOmbinationofMAtchingalgorithms( COMA++)withAll ContextandFilteredContextcombinedmatchersandcompari sonofSMART andCOMA++results. dierentpairsofreportstomatch.Wetacitlyassumethatco urseinformationisstored inadatabaseandthateachreportisproducedbyEclipseBIRT toolthatfetchesthe relevantdatafromtherepositoryandproducestheHTMLrepo rt.Thedatapresentedon reportsarestoredinMySQL4.1database. SMARTapproach'sprototype,writteninJavalanguage,extr actsinformation fromreportdesigntemplatesandstorestheextractedinfor mationininstancesofthe reportontology.Thenitcomputessimilarityscoresusingw eightsdescribedinSection 6.2 .SMARTprototypeusesthreedierentsimilaritymeasurest ondsimilarityscores betweentexts.ThesemeasuresareJCNandLINsemantic,andL evensteineditsimilarity measures. Figure 6-4 showsthef-measureresultsofSMARTwhenJCNandLINsemanti c similaritymeasuresandLevensteinlexicalsimilaritymea sureareused.Weusesemantic similaritymeasurestocomputesimilarityscoresbetweend escriptivetextssuchascolumn headers,reportheadersandfooters.Figure 6-4 showsthechangeinprecision,recalland f-measuremetricsasthethresholdchanges.Thereadercann oticethatJINandLIN semanticsimilaritymeasureperformsbetterthanLevenste inlexicalsimilaritymeasure. 108

PAGE 109

Figure6-6.ReceiverOperatingCharacteristics(ROC)curv esofSMARTandCOMA++ forTHALIAtestdata. Thiswasquiteexpectedbecauseusingsemanticsimilaritym easureshelpustoidentify similaritiesbetweenwordsthataresemanticallyclosebut lexicallyfar. InFigure 6-5 ,weshowCOMA++resultsfortheTHALIAtestdatawithAllCont ext andFilteredContextcombinedmatchers.Similartootherg ures,theresultsstartwith lowprecisionbuthighrecallvaluesforlowerthresholds.P recisionvalueincreasesand recallvaluedecreasesasthethresholdincreases.Ontheri ghthandsideoftheFigure 6-5 wepresentthecomparisonbetweenSMARTandCOMA++results. Thereadercannotice thattheSMARTperformsbetterinallthresholdswithJCNsem anticsimilaritymeasure. ThesecondbestresultonthegureisalsoachievedbytheSMA RTwhenLevenstein (EDIT)lexicalsimilaritymeasureisused.Eventheresults withlexicalsimilaritymeasure arebetterthanCOMA++results.Thatisbecausethethedescr iptivetextsextracted fromreportsandusedtondsimilarityscoresbetweenschem aelementsbytheSMART. Thesetextstendtobelexicallycloserthantheschemaeleme ntnamesusedtond similarityscoresbetweenschemaelementsbytheCOMA++app roach. InFigure 6-6 ,weshowROCCurvesofSMARTandCOMA++approachesforthe THALIATestData.Asstatedbefore,thecloserthecurvefoll owstheleft-handborder andthenthetopborderoftheROCspace,themoreaccuratethe approach.When thereaderanalyzestheROCcurves,thereadercannoticetha tresultsoftheSMART 109

PAGE 110

aremuchmoreaccuratethantheresultsofCOMA++.Thebestth resholdtorunthe matcherscanbefoundbythehelpoftheROCcurves.Thethresh oldthatgeneratesthe closestpointtotheupperleftcorner(0,1)oftheROCplotan dliesona45degreeline isthebestthreshold. 9 Forexample,thecoordinatesoftheclosestpointtotheuppe r leftcorner(0,1)ontheROCcurvefortheSMARTwiththeLINsi milaritymeasure is(0.05,0.8).TheSMARTproducesthe0.05falsealarmratea nd0.8detectionrate whenoperatedwith0.3threshold. 10 Thecoordinatesoftheclosestpointtotheupper leftcorner(0,1)ontheROCcurvefortheCOMA++withtheAllC ontextmatcheris (0.25,0.55).COMA++producesthe0.25falsealarmrateand0 .55detectionratewhen operatedwith0.4threshold.ThisshowsthatSMARTandCOMA+ +achievestheir bestperformanceatdierentthresholds.SincetheROCcurv eoftheSMARTisalways closertoupperleftcorner(0,1),theSMARTperformsbetter thanCOMA++atany threshold.ThisfactcanalsobeseeninFigure 6-5 wheref-measureresultsoftheSMART andCOMA++arepresentedfordierentthresholds.6.3.2RunningExperimentswithUFData Inthissection,wepresentourexperimentalresultswithou rsecondtestdataset.The seconddatasethasthreeschemasandeachschemahas10repor ts.Weobtainedthethree schemasfromtheCollegeofEngineering,theBusinessSchoo landtheBridgesOce.We describedthedetailsoftheseconddatasetinSection 6.1.2 Thedierencebetweentherstandthesecondtestdatasetis thatthesecond datasethasmorereportsperschemaandalsothereportsofth esecondtestdataset donotcovertheentireschema.TheschemasfromtheCollegeo fEngineering(COE), theBusinessSchool(BS),andtheBridgesOce(BO)have135, 175and114schema 9 Theassumptionhereisthatcostofafalsealarmandadetecti onareequal. 10 Thethresholdsfordierentdetection/falsealarmratecom binationsarenotseenon thegure. 110

PAGE 111

Figure6-7.ResultsoftheSMARTwithdierentreportpairsi milaritythresholdsforUF testdata. elements(i.e.,attributes)respectively.Wemanuallydet erminedthat(COE-BO)pairhas 88,(COE-BS)pairhas91and(BO-BS)pairhas110mappings.Ho wever,reportscover %90ofthesemappings.Thismeans,theSMARTcanatmosthave0 .9recallaccuracy valueifitdeterminesallthemappingscoveredbyreports.O urexperimentsshowthat evenwiththisdisadvantage,theSMARTperformsbetterthan COMA++results. Sincewehavemorethanonereportperschema,weneedtomerge theresultsfrom reportpaircombinationsintoanalsimilaritymatrix.InS ection 3.2.7 ,wedescribedhow wemergethescoresintoanalsimilaritymatrixwhenwehave morethanonescorefor aschemaelementpair.Shortly,wecomputetheweightedaver ageofthesimilarityscores betweenschemaelementpairs.Weconsiderthesimilaritysc oresbetweenreportpairsas weightsforthiscomputation.Wedescribedhowwecomputesi milarityscoresbetween reportpairsinSection 3.2.4 .Thesimilarityscoresbetweenreportpairsareintherange [0,1].Toeliminateunrelatedreportpairs,wesetathresho ldforreportpairsimilarity scoresandconsideronlyschemaelementsimilarityscorest hatcomefromreportpairs havingsimilarityscorehigherthanthereportpairsimilar ityscorethreshold. WeshowtheaccuracyresultsoftheSMARTwiththeUFdataseta ccordingto f-measuremetricinFigure 6-7 .OnthelefthandsideoftheFigure 6-7 ,weshowthe resultsoftheSMARTforBusiness-Bridgesschemapairwhenr eportpairsimilarity 111

PAGE 112

thresholdissetto0.6,0.7and0.8.Thereportpairscover90 %oftheactualmappings, thereforerecallvalueisalwayslessthan0.9.Thereare19, 13and10reportpairsthat havehighersimilarityscorethan0.6,0.7and0.8respectiv ely.Thereadercannoticethat theaccuracyofresultsarebetterwhenthereportsimilarit ythresholdissetto0.7or0.8. Thatisbecausethereportpairshavingsimilarityscorehig herthanthreshold0.7aremore similartoeachotherandthiscausesmoreaccurateresults. InthemiddleoftheFigure 6-7 ,weshowtheresultsoftheSMARTforBusiness-CollegeofEng ineeringschemapair whenthereportpairsimilaritythresholdissetto0.6,0.7a nd0.8.Thereare16,12and 8reportpairsthathavehighersimilarityscorethan0.6,0. 7and0.8respectively.The readercannoticethattheaccuracyofresultsisslightlybe tterwhenthereportsimilarity thresholdissetto0.6or0.7.Thatisbecausewhenthereport pairsimilaritythreshold issetto0.8,weeliminatesomeverysimilarreportpairs. 11 Ontherighthadsideofthe Figure 6-7 ,weshowtheresultsoftheSMARTforCollegeofEngineeringBridgesschema pairwhenreportpairsimilaritythresholdissetto0.6,0.7 and0.8.Thereare21,13 and8reportpairsthathavehighersimilarityscorethan0.6 ,0.7and0.8respectively. Thereadercannoticethattheaccuracyofresultsisslightl ybetterwhenthereport similaritythresholdissetto0.7.Thatisbecausewhenther eportpairsimilaritythreshold issetto0.6,weincludesomeunrelatedreportpairsintocom putationwhichaectsthe accuracyoftheresultsnegatively.Also,whenthereportpa irsimilaritythresholdisset to0.8,weeliminatesomeverysimilarreportpairs.Therefo re,theresultsarebetterwhen thresholdissetto0.7.Thechangesintheaccuracyoftheres ultswithdierentreportpair similarityscorethresholdsettingsshowthatcorrectlyde terminingthereportsimilarity scorethresholdisimportantfortheSMARTapproach.Thegu re 6-7 suggestsusthat weneedtoselectthereportpairsimilarityscorethreshold carefully.Thechoosenreport 11 Theschemashave10verysimilarreportpairs. 112

PAGE 113

Figure6-8.F-MeasureresultsofSMARTandCOMA++forUFtest datawhenreport pairsimilarityissetto0.7. similaritythresholdshouldnoteliminatethesimilarrepo rtbuteliminatetheunrelated reports. InFigure 6-8 ,wecomparetheperformanceoftheSMARTwiththeCOMA++base d onthef-measuremetric.TheSMARTresultswerepreparedwit hJCNsemanticsimilarity measurewhenthereportthresholdwassetto0.7.TheCOMA++a pproachresultswere preparedwiththeAllContextandtheFilteredContextcombi nedmatchers.Thereader cannoticethatSMARTproduceshigherf-measureaccuracyre sultsthanCOMA++. However,theSMARTandCOMA++performstheirbestresultsat dierentthresholds. TheresultsoftheCOMA++isveryclosetotheresultstheSMAR TfortheBusiness School-BridgesProjectschemapair.Thatisbecausethissc hemapairhasverysimilar namingconventionsasdescribedinSection 6.1.2 .Whensimilarnamingconventionsare used,lexicalsimilaritymeasuresperformsbetter.COMA++ matchersarebasedonlexical similaritymeasures,henceCOMA++performsbetterfortheB usinessSchool-Bridges Projectschemapaircomparedtoitsperformancefortheothe rschemapairs.Onthe otherhand,SMARTdoesnotonlyuselexicalsimilaritymeasu res.Itcombineslexicaland semanticsimilaritymeasuresanddoesnotdependonlexical closenessofschemaelement names.Itutilizesmoredescriptivetextsextractedfromre ports.Therefore,theresults oftheSMARTisnotaectedbythechangesinthedescriptiven essorlexicalcloseness 113

PAGE 114

Figure6-9.ReceiverOperatingCharacteristics(ROC)curv esoftheSMARTforUFtest data. oftheschemaelementnames.ThereadercanalsonoticethatS MARTandCOMA++ performstheirbestresultsatdierentschemaelementsimi larityscorethresholds.The SMARTperformsitsbestresultsaround0.25thresholdandCO MA++performsaround 0.5thresholdforUFtestdatasets. OnthelefthandsideofFigure 6-9 ,wepresentROCcurvesoftheSMARTforUF Business-Bridgesschemapair.Eachschemahas10reportswh ichmakes100report pairs.10ofthesereportpairsareverysimilar.Eachreport pairhasasimilarityscore intherange[0,1].Thereare31,19,13,10and7reportpairst hathashigherreport similarityscorethan0.5,0.6,0.7,0.8and0.9thresholdsr espectively.Whenthereport pairsimilarityscorethresholdissetto0.9,someofthever ysimilarreportpairsare eliminated.ThereforetheperformanceoftheSMARTwhenthe reportsimilarityscore thresholdissetto0.9islow.OntherighthandsideoftheFig ure 6-9 ,wepresentROC curvesoftheSMARTforUFBusiness-Engineeringschemapair .Again,tenofthepossible 100reportpairsareverysimilar.Thereare28,16,12,8and4 reportpairsthathas higherreportsimilarityscorethan0.5,0.6,0.7,0.8and0. 9thresholdsrespectively. Whenthereportpairsimilaritythresholdissetto0.8and0. 9,someoftheverysimilar reportpairsareeliminated.Thereforetheperformanceoft heSMARTdecreasesfor 114

PAGE 115

Figure6-10.ComparisonoftheROCcurvesoftheSMARTandCOM A++forUFtest data. thesethresholds.Aswelowerthereportpairsimilaritythr esholdtheperformanceofthe SMARTslightlyincreases.However,aswelowerthereportsi milarityscorethreshold, morereportpairspassthethresholdwhichrequiresextrapr ocessingtime.Aftersome point,theincreaseintheperformancebydecreasingtherep ortpairsimilaritythreshold andhenceincreasingthenumberofreportsandcomputationa mount,isnegligible. Therefore,wedonotconsiderthereportpairsbelowthe0.5s imilarityscoreinFigure 6-9 InFigure 6-10 ,wecomparetheperformanceoftheSMARTwiththeCOMA++ basedontheROCcurves.FortheBusiness-Bridgesschemapai r,theCOMA++performs betterthantheSMART.Asstatedbefore,thenamingconventi onsoftheschemas areveryclosewhichhelpsCOMA++toperformbetter.Moreove r,theSMARThasa disadvantagethatnotallthemappingsarecoveredbytheava ilablereports.Forthe Business-Engineeringschemapair,theSMARTperformsvery closetoCOMA++even thoughnotallthemappingsarecoveredbyreports. 115

PAGE 116

CHAPTER7 CONCLUSION Schemamatchingisafundamentalproblemthatoccurswhenin formationsystems share,exchangeorintegratedataforthepurposeofdatawar ehousing,queryprocessing, messagetranslation,etc.Despiteextensiveeorts,solut ionsforschemamatchingarestill mostlymanualanddependonsignicanthumaninputwhichmak esschemamatching atimeconsuminganderror-pronetask.Schemaelementsaret ypicallymatchedbased onschemaanddata.However,thecluesgatheredbyprocessin gtheschemaanddata areoftenunreliable,incompleteandnotsucienttodeterm inetherelationshipsamong schemaelements[ 28 ].Moreover,themappingdependsontheapplicationandmay changefromoneapplicationtoanothereventhoughtheunder lyingschemasremainthe same.Severalautomaticapproachesexistbuttheiraccurat enessdependsheavilyonthe descriptivenessoftheschemasalone. WehavedevelopedanewapproachcalledSchemaMatchingbyAn alyzingReporTs (SMART)whichextractsimportantsemanticinformationabo uttheschemasandtheir relationshipsfromreportgeneratingapplicationsourcec odeandreportdesigntemplates. Specically,inSMARTwereverseengineertheapplications ourcecodeandreport templatesassociatedwiththeschemasthataretobematched .Fromthesourcecode andreporttemplateswhichusetheschemaandproducereport sorotheruser-friendly output,weextractsemanticallyrichdescriptivetexts.We identifyrelationshipsofthese descriptivetextswiththedatapresentedonthereportwith thehelpofasetofheuristics. Wetracethedataonthereportbacktothecorrespondingsche maelementsinthedata source.Westorealltheinformationgatheredfromareport, includingthedescriptive texts(e.g.,columnheadersandreporttitle)andpropertie sofdatapresented(e.g.,schema elementnameandtypeofdata)intoaninstanceofthereporto ntology.Wecompute similarityscoresbetweeninstancesofthereportontology .Wethenconvertinter-ontology matchingscoresintoscoresbetweenschemaelements. 116

PAGE 117

OurexperimentalresultsshowthattheSMARTprovidesmorer eliableandaccurate resultsthancurrentapproachesthatrelyontheinformatio ncontainedintheschemas anddatainstancesalone.Forexample,thehighestaccuracy (basedonthef-measure metric)oftheSMARTforourrsttestdatasetinwhichreport scoverallschema elementsis0.73whilethehighestaccuracyoftheCOMA++(th ebestschemamatching approachaccordingtotheevaluations[ 26 ])is0.5.TheresultsoftheSMARTisalso betterorveryclosetotheCOMA++resultsforoursecondtest datasetinwhichreports cover90%ofmappings.Thehighestaccuracies(basedonthef -measuremetric)ofthe SMARTforoursecondtestdatasetare0.55,0.68and0.57whil ethehighestaccuracies oftheCOMA++are0.5,0.5and0.4fordierentschemapairsre spectively.Wealso analyzedoutresultswithreceiveroperatingcharacterist ics(ROC)curves.Wesawthat theSMART'sperformanceisbetterforeverythresholdforou rrsttestdatasetandthe SMART'sperformanceisveryclosetoCOMA++'sperformancef ortheseconddataset. Ourapproachshowsthatvaluablesemanticinformationcanb eextractedfromreports generatingapplicationsourcecode.Reverseengineerings ourcecodetoextractsemantic informationisaverychallengingtask.Toeasetheprocesso fsemanticextraction,we introducedamethodologyandframeworkwhichutilizesstat e-of-the-arttoolsanddesign patterns.Besides,ourapproachshowsthatreporttemplate s(representedinXML)are alsovaluablesourceofsemanticsandthesemanticinformat ioncanbeeasilyextracted fromreporttemplates.Moreover,weshowhowtheextractedi nformationfromdatabase schemasandreportapplicationsourcecanbestoredinontol ogies.Wealsoexplainedin detailshowweapplymulti-linearregressionmethodtodete rminetheweightsofdierent informationtoreachthebestaccurateresults. Webelieveourapproachrepresentsanimportantsteptoward smoreaccurateand reliabletoolsforschemamatching.Moreandmoresolutions forautomaticschema matchinghelpussaveeort,timeandinvestment.Thedecrea sedcostforschema matchingandhencefordataintegrationfacilitatemoreand moreorganizationsto 117

PAGE 118

collaborate.Thesynergygainedfromeective,rapid,andr exiblecollaborationsamong organizationsboaststheeconomyandthusenhancesthequal itylevelofourdailylife. 7.1Contributions Researchersadvocatethatthegainofcapturingeventhelim itedamountofuseful semanticscanbetremendous[ 87 ]andtheymotivateutilizinganykindofinformation sourcetoimproveourunderstandingofdata.Researchersal sopointoutthatapplication sourcecodeencapsulatesimportantsemanticinformationa bouttheirapplication domainandcanbeusedforthepurposeofschemamatchingford ataintegration[ 78 ]. Externalinformationsourcessuchascorporaofschemasand pastmatcheshavebeen usedforschemamatchingbutapplicationsourcecodehaveno tbeenusedasanexternal informationsourceyet[ 25 28 78 ].Inthisresearch,wefocusonthiswell-knownbutnot yetaddressedchallengeofanalyzingapplicationsourceco deforthepurposeofsemantic extractionforschemamatching.Wepresentanovelapproach forschemamatchingthat utilizessemanticallyrichtextsextractedfromapplicati onsourcecode.Weshowthat theapproachweprovideinthisdissertationprovidesbette raccuracyforthepurposeof automaticschemamatching. Duringsemanticanalysisofapplicationsourcecode,wecre ateaninstanceofthe reportontologyfromeachreportgeneratedbyapplications ourcecodeandusethis ontologyinstanceforthepurposeofschemamatching.While (semi)automaticextraction ofontologies(a.k.a.ontologylearning)fromtext,relati onalschemataandknowledgebases arewellstudiedintheliterature[ 23 37 ],tothebestofourknowledgetherehasbeenno studyaimedatextractinganontologyfromapplicationsour cecode. Anotherimportantcontributionistheintroductionoftheg enericfunctionfor computingsimilarityscoresbetweenconceptsofontologie s.Wealsodescribedhowwe determinetheweightsofthesimilarityfunction.Thesimil arityfunctionalongwiththe methodologytodeterminetheweightsofthefunctioncanbea ppliedtoanydomainto determinesimilaritiesbetweendierentconceptsofontol ogies. 118

PAGE 119

Theschemamatchingapproachessofarhavebeenusinglexica lsimilarityfunctionsor look-uptablestodeterminethesimilarityscoresoftwosch emaelements.Therehavebeen suggestionstoutilizesemanticsimilaritymeasuresbetwe enwords[ 7 ]buthavenotbeen realized.Namesofschemaelementsaremostlyabbreviation sandconcatenationsofwords. Thesenamescannotbefoundinthedictionariesthatsemanti csimilaritymeasuresuseto computesimilarityscoresbetweentwowords.Therefore,ut ilizingthesemanticsimilarity measuresbetweenwordswasnotpossible.Weextractdescrip tivetextsfromreports andrelatethemwiththeschemaelements.Therefore,wecanu tilizethestate-of-the-art semanticsimilaritymeasurestodeterminesimilarities.B yusingasemanticsimilarity measureinsteadoflexicalsimilaritymeasuresuchaseditd istance,wecandetectthe similaritiesofwordsthatarelexicallyfarbutsemantical lyclosesuchas`lecturer'and `instructor'andwecanalsoeliminatethewordsthatarelex icallyclosebutsemantically farsuchas`tower'and`power'. Oneimportantcontributionisthatintegrationbasedonuse rreportseasesthe communicationbetweenbusinessand(IT)(InformationTech nology)specialists. BusinessandITspecialistsoftenhavedicultyunderstand ingeachother.Business andITspecialistscandiscussondatapresentedonreportsn otondatabaseschemas. Businessspecialistcanrequestthedataseenonspecicrep ortstobeintegratedorshared. Analyzingreportsfordataintegrationandsharinghelpsbu sinessandITspecialists communicatebetter. Whileconductingtheresearch,wesawthatthereisaneedofa vailabletestdataof sucientrichnessandvolumetoallowmeaningfulandfairev aluationsbetweendierent informationintegrationapproaches.Toaddressthisneed, wedevelopedTHALIA 1 (TestHarnessfortheAssessmentofLegacyinformationInte grationApproaches) benchmarkwhichprovidesresearcherswithacollectionofo ver40downloadabledata 1 THALIAwebsite: 119

PAGE 120

sourcesrepresentingUniversitycoursecatalogs,asetoft welvebenchmarkqueries,aswell asascoringfunctionforrankingtheperformanceofaninteg rationsystem[ 47 48 ].We arehappytoseeitisbeingusedasasourceoftestdataandben chmarkbyresearchers [ 11 74 100 ]andgraduatecourses 2 Inthesemanticanalysispartofourwork,weintroduceanewe xtensibleandrexible methodologyforsemanticextractionfromapplicationsour cecode.Weintegrateand utilizestate-of-the-arttechniquesinobjectorientedpr ogrammingandparsergeneration, andleveragefromtheresearchincodereverseengineeringa ndprogramunderstanding. Oneofthemaincontributionsofoursemanticanalysismetho dologyisitsfunctional extensibility.Ourinformationextractionframeworklets researchersaddnewfunctionality astheydevelopnewheuristicsandalgorithmsonthesourcec odebeinganalyzed.Our currentinformationextractiontechniqueprovidesimprov edaccuracyasiteliminates unusedcodefragments(i.e.,methods,procedures). 7.2FutureDirections Thisresearchcanbecontinuedinthefollowingdirections:Extendingthesemanticanalyzer(SA) .SAcanbeextendedtoextract informationfromwebqueryinterfaces.Webqueryinterface shavepotentiallyvaluable semanticinformationformatchingschemaelements.Theinf ormationgatheredfromquery interfacescanfacilitatebetterresultsforschemamatchi ng.SAcanalsobeextendedto extractotherpossiblyimportantinformation(e.g.,forma tandlocation)ofdataona report.Newheuristicscanalsobeaddedtorelatedataandde scriptivetextsonareport. Enhancingthereportontology .OurreportontologywasrepresentedinOWL (WebOntologyLanguage).Wecanbenetfromcapabilitiesof OWLtorelatedataand descriptionelementsintheontology.InOWL,asetofOWLsta tementscanallowus 2 ThegraduatecourseattheUniversityofTorontousingTHALI Ais'ResearchTopics inDataManagement`: 120

PAGE 121

toconcludeanotherOWLstatement.Forexample,giventhest atements(motherOf subPropertyparentOf)and(NedretmotherOfOguzhan)whens tatedinOWL,allowsus toconclude(NedretparentOfOguzhan)basedonthelogicald enitionofsubProperty asgivenintheOWLspec.Similarly,wecandeneisDescripti onOfrelationbetween dataelementconceptanddescriptionelementconcept,soth atOWLcanconcludethe isDescriptionOfrelationbylookingatthelocationinform ationofbothdataelementand descriptionelementconceptsonareport.Anotheradvantag eofusingOWLontologies istheavailabilityoftoolssuchasRacer,FactandPelletth atcanreasonaboutthem.A reasonercanalsohelpusunderstandifwecanaccuratelyext ractdataanddescription elementsfromthereport.Forinstance,wecandenearulesu chas\Nodataor descriptionelementscanoverlap"andchecktheOWLontolog ybyareasonertomake sureifthisruleissatised. Extendingtheschemamatcher(SMART) .Weevaluatethesimilarityscores producedbySMARTtodetermine1to1mappings.Wecanworkonr esultsofSMARTto gureouthowtointerprettheresultstodetermine1-nandmnmappingsaswell. Continuingresearchonsimilarity .Assessingsimilarityscoresbetweenobjects isanimportantresearchtopic.Weintroducedagenericsimi larityfunctiontodetermine similaritiesbetweenconceptsofontologies.Wealsoexpla inedhowwedeterminethe weightsofthisgenericsimilarityfunction.Weappliedthi ssimilarityfunctiononreport ontologyinstances.Realworldobjectscanbemodeledusing ontologiesandoursimilarity functioncanbeusedtondsimilaritiesbetweenthem.Forex ample,oursimilarity functionisappropriatetondsemanticsimilarityscoresb etweentwowebpagesand betweentwosentences.Todeterminethesimilarityscoresb etweentwosentences,current approachesdonotconsidertheplaceofawordinasentencean dtherelationsbetween wordsinasentences.Wecanmodelanontologyspecifyingthe relationofwordsina sentencesanduseoursimilarityfunctiontoassesssimilar ityscoresbetweensentences. 121

PAGE 122

REFERENCES [1] A.Aamodt,M.Nygard,Dierentrolesandmutualdependencie sofdata, information,andknowledge:Anaiperspectiveontheirinte gration,DataKnowl. Eng.16(3)(1995)191{222. [2] P.M.Alexander,Towardsreconstructingmeaningwhentexti scommunicated electronically,Ph.D.thesis,UniversityofPretoria,Sou thAfrica(2002). [3] M.P.Allen,UnderstandingRegressionAnalysis,NewYorkPl enumPress,1997. [4] G.Antoniou,F.vanHarmelen,Webontologylanguage:Owl.,i n:S.Staab, R.Studer(eds.),HandbookonOntologies,InternationalHa ndbooksonInformation Systems,Springer,2004,pp.67{92. [5] N.Ashish,C.A.Knoblock,Semi-automaticwrappergenerati onforinternet informationsources,in:COOPIS'97:ProceedingsoftheSec ondIFCISInternational ConferenceonCooperativeInformationSystems,IEEECompu terSociety, Washington,DC,USA,1997. [6] J.A.Aslam,M.Frost,Aninformation-theoreticmeasurefor documentsimilarity,in: SIGIR'03:Proceedingsofthe26thannualinternationalACM SIGIRconferenceon Researchanddevelopmentininformaionretrieval,ACMPres s,NewYork,NY,USA, 2003. [7] D.Aumuellet,H.-H.Do,S.Massmann,E.Rahm,Schemaandonto logymatching withcoma++,in:ProceedingsofSIGMOD2005(SoftwareDemon stration), Baltimore,2005. [8] T.-L.Bach,R.Dieng-Kuntz,Measuringsimilarityofelemen tsinowldlontologies, in:ContextandOntologies:Theory,PracticeandApplicati ons,Pittsburgh, Pennsylvania,USA,2005. [9] S.Banerjee,T.Pedersen,Anadaptedleskalgorithmforword sensedisambiguation usingword-net,in:InProceedingsoftheThirdInternation alConferenceon IntelligentTextProcessingandComputationalLinguistic s,MexicoCity,2002. [10] J.Berlin,A.Motro,Databaseschemamatchingusingmachine learningwithfeature selection,in:CAiSE'02:Proceedingsofthe14thInternati onalConferenceon AdvancedInformationSystemsEngineering,Springer-Verl ag,London,UK,2002. [11] A.Bilke,J.Bleiholder,F.Naumann,C.Bohm,K.Draba,M.Wei s,Automaticdata fusionwithhummer,in:VLDB'05:Proceedingsofthe31stint ernationalconference onVerylargedatabases,VLDBEndowment,2005. [12] J.Bisbal,D.Lawless,B.Wu,J.Grimson,Legacyinformation systems:Issuesand directions,IEEESoftw.16(5)(1999)103{111. 122

PAGE 123

123 [13] M.Bravenboer,E.Visser,Guidingvisitors:Separatingnav igationfromcomputation, Tech.Rep.UU-CS-2001-42,InstituteofInformationandCom putingSciences, UtrechtUniversity,TheNetherlands,UniversityofUtrech t,P.O.Box80.089,3508 TB,Utrecht,TheNetherlands(November2001). [14] M.L.Brodie,Thepromiseofdistributedcomputingandthech allengesoflegacy informationsystems,in:ProceedingsoftheIFIPWG2.6Data baseSemantics ConferenceonInteroperableDatabaseSystems(DS-5),Nort h-Holland,1993. [15] A.Budanitsky,G.Hirst.,Semanticdistanceinwordnet:Ane xperimental, application-orientedevaluationofvemeasures.,in:NAA CL2001WordNetand OtherLexicalResourcesWorkshop,Pittsburgh,2001. [16] A.Budanitsky,G.Hirst,Evaluatingwordnet-basedmeasure sofsemanticdistance., ComputationalLinguistics32(1)(2006)13{47. [17] D.Buttler,L.Liu,C.Pu,Afullyautomatedobjectextractio nsystemfortheworld wideweb.,in:ICDCS,2001. [18] P.Checkland,S.Holwell,Information,SystemsandInforma tionSystems-making senseoftheeld,JohnWileyandSons,Inc.,Hoboken,NJ,USA ,1998. [19] E.J.Chikofsky,J.H.C.II,Reverseengineeringanddesignr ecovery:Ataxonomy, IEEESoftw.7(1)(1990)13{17. [20] W.W.Cohen,P.Ravikumar,S.E.Fienberg,Acomparisonofstr ingdistance metricsforname-matchingtasks.,in:S.Kambhampati,C.A. Knoblock(eds.), IIWeb,2003. [21] C.Corley,R.Mihalcea,Measuringthesemanticsimilarityo ftexts,in:Proceedings oftheACLWorkshoponEmpiricalModelingofSemanticEquiva lenceand Entailment,AnnArbor,Michigan,2005. [22] K.H.Davis,P.H.Aiken,Datareverseengineering:Ahistori calsurvey.,in:Working ConferenceonReverseEngineering(WCRE),2000. [23] Y.Ding,S.Foo,Ontologyresearchanddevelopment.partI-a reviewofontology generation,JournalofInformationScience28(2)(2002)12 3{136. [24] E.Do,Hong-Hai;Rahm,COMA-asystemforrexiblecombinatio nofschema matchingapproaches,in:Proc.28thIntl.ConferenceonVer yLargeDatabases (VLDB),Hongkong,Aug.2002,2002. [25] H.-H.Do,Schemamatchingandmapping-baseddataintegrati on,Dissertation, UniversittLeipzig,Germany,DepartmentofComputerScien ce,UniversittLeipzig, Germany(January2006).

PAGE 124

124 [26] H.H.Do,S.Melnik,E.Rahm,Comparisonofschemamatchingev aluations,in: RevisedPapersfromtheNODe2002WebandDatabase-RelatedW orkshopson Web,Web-Services,andDatabaseSystems,Springer-Verlag ,London,UK,2003. [27] A.Doan,P.Domingos,A.Y.Levy,Learningsourcedescriptio nfordataintegration., in:WebDB(InformalProceedings),2000. [28] A.Doan,A.Halevy,Semanticintegrationresearchinthedat abasecommunity: Abriefsurvey.,AIMagazine,SpecialIssueonSemanticInte gration26(1)(2005) 83{94. [29] A.Doan,N.F.Noy,A.Y.Halevy,Introductiontothespeciali ssueonsemantic integration.,SIGMODRecord33(4)(2004)11{13. [30] P.Drew,R.King,D.McLeod,M.Rusinkiewicz,A.Silberschat z,Reportofthe workshoponsemanticheterogeneityandinterpolationinmu ltidatabasesystems, SIGMODRec.22(3)(1993)47{56. [31] M.Ehrig,P.Haase,N.Stojanovic,M.Hefke,Similarityforo ntologies-a comprehensiveframework,in:13thEuropeanConferenceonI nformationSystems, Regensburg,2005,iSBN:3937195092. [32] D.W.Embley,Y.S.Jiang,Y.-K.Ng,Record-boundarydiscove ryinweb documents.,in:A.Delis,C.Faloutsos,S.Ghandeharizadeh (eds.),SIGMOD Conference,ACMPress,1999. [33] D.W.Embley,D.P.Lopresti,G.Nagy,Notesoncontemporaryt ablerecognition., in:H.Bunke,A.L.Spitz(eds.),DocumentAnalysisSystems, vol.3872ofLecture NotesinComputerScience,Springer,2006. [34] J.Euzenat,P.Valtchev,Anintegrativeproximitymeasuref orontologyalignment, in:ISWC-2003workshoponsemanticinformationintegratio n,SanibelIsland(FL US),2003. [35] J.Euzenat,P.Valtchev,Similarity-basedontologyalignm entinowl-lite,in:15th EuropeanConferenceonArticialIntelligence(ECAI),Val encia,2004. [36] W.J.Frawley,G.Piatetsky-Shapiro,C.J.Matheus,Knowled gediscoveryin databases:Anoverview.,AIMagazine13(3)(1992)57{70. [37] A.Gal,G.A.Modica,H.M.Jamil,Ontobuilder:Fullyautomat icextractionand consolidationofontologiesfromwebsources.,in:Interna tionalConferenceonData Engineering(ICDE),IEEEComputerSociety,2004. [38] E.Gamma,R.Helm,R.E.Johnson,J.M.Vlissides,Designpatt erns:Abstraction andreuseofobject-orienteddesign,in:ECOOP'93:Proceed ingsofthe7th EuropeanConferenceonObject-OrientedProgramming,Spri nger-Verlag,London, UK,1993.

PAGE 125

125 [39] R.L.Goldstone,Similarity,in:R.Wilson,F.C.Keil(eds.) ,MITencylopediaofthe cognitivesciences,MITPress,Cambridge,MA,1999,pp.763 {765. [40] T.R.Gruber,Atranslationapproachtoportableontologysp ecications,Knowl. Acquis.5(2)(1993)199{220. [41] J.-L.Hainaut,M.Chandelon,C.Tonneau,M.Joris,Contribu tiontoatheoryof databasereverseengineering.,in:WCRE,1993. [42] J.-L.Hainaut,J.Henrard,Ageneralmeta-modelfordata-ce nteredapplication reengineering,in:DagstuhlworkshoponInteroperability ofReengineeringTools, 2001. [43] A.Y.Halevy,J.Madhavan,P.A.Bernstein,Discoveringstru ctureinacorpusof schemas.,IEEEDataEng.Bull.26(3)(2003)26{33. [44] J.Hammer,Resolvingsemanticheterogeneityinafederatio nofautonomous, heterogeneousdatabasesystems,Ph.D.thesis,University ofSouthernCalifornia (August1994). [45] J.Hammer,W.O'Brien,M.S.Schmalz,Scalableknowledgeext ractionfrom legacysourceswithseek.,in:H.Chen,R.Miranda,D.D.Zeng ,C.C.Demchak, J.Schroeder,T.Madhusudan(eds.),IntelligenceandSecur ityInformatics(ISI),vol. 2665ofLectureNotesinComputerScience,Springer,2003. [46] J.Hammer,M.Schmalz,W.O'Brien,S.Shekar,N.Haldavnekar ,SEEKing knowledgeinlegacyinformationsystemstosupportinterop erability,in:ECAI-02 WorkshoponOntologiesandSemanticInteroperability,200 2. [47] J.Hammer,M.Stonebraker,O.Topsakal,Thalia:Testharnes sfortheassessment oflegacyinformationintegrationapproaches.,in:Intern ationalConferenceonData Engineering(ICDE),IEEEComputerSociety,2005. [48] J.Hammer,M.Stonebraker,O.Topsakal,Thalia:Testharnes sfortheassessment oflegacyinformationintegrationapproaches,Tech.Rep.t r05-001,Universityof Florida,ComputerScienceandInformationandEngineering (2005). [49] J.Henrard,Programunderstandingindatabasereverseengi neering,Ph.D.thesis, UniversityofNotre-Dame(2003). [50] J.Henrard,V.Englebert,J.-M.Hick,D.Roland,J.-L.Haina ut,Program understandingindatabasesreverseengineering.,in:G.Qu irchmayr,E.Schweighofer, T.J.M.Bench-Capon(eds.),DEXA,vol.1460ofLectureNotes inComputer Science,Springer,1998. [51] G.Hirst,D.S.Onge,Lexicalchainsasrepresentationsofco ntextforthedetection andcorrectionofmalapropisms,in:C.Fellbaum(ed.),Word Net:Anelectronic lexicaldatabase,MITPress,1998.

PAGE 126

126 [52] J.J.Jiang,D.W.Conrath,Semanticsimilaritybasedoncorp usstatisticsandlexical taxonomy,in:IntheProceedingsofROCLINGX,Taiwan,1997, 1997. [53] S.C.Johnson,YACC:Yetanothercompilercompiler,Tech.Re p.CSTR32,ATT BellLaboratories(1978). [54] Y.Kalfoglou,M.Schorlemmer,Ontologymapping:Thestateo ftheart,The KnowledgeEngineeringReviewJournal18(1)(2003)1{31. [55] T.K.Landauer,P.W.Foltz,D.Laham,Introductiontolatent semanticanalysis, DiscourseProcesses25(1998)259{284. [56] C.Leacock,M.Chodorow,Combininglocalcontextandwordne tsimilarityforword senseidentication,in:C.Fellbaum(ed.),WordNet:Anele ctroniclexicaldatabase, MITPress,1998. [57] D.B.Lenat,Cyc:alarge-scaleinvestmentinknowledgeinfr astructure, CommunicationsoftheACM38(11)(1995)33{38. [58] M.Lesk,Automaticsensedisambiguationusingmachineread abledictionaries:How totellapineconefromaicecreamcone,in:SIGDOC86,1986. [59] M.E.Lesk,Lex-alexicalanalyzergenerator,Tech.Rep.CST R39,ATTBell Laboratories,NewJersey(1975). [60] D.Lin,Aninformation-theoreticdenitionofsimilarity, in:ICML'98:Proceedings oftheFifteenthInternationalConferenceonMachineLearn ing,MorganKaufmann PublishersInc.,SanFrancisco,CA,USA,1998. [61] J.Madhavan,P.A.Bernstein,A.Doan,A.Halevy,Corpus-bas edschemamatching, in:ICDE'05:Proceedingsofthe21stInternationalConfere nceonDataEngineering (ICDE'05),IEEEComputerSociety,Washington,DC,USA,200 5. [62] J.Madhavan,P.A.Bernstein,E.Rahm,Genericschemamatchi ngwithcupid,in: VLDB'01:Proceedingsofthe27thInternationalConference onVeryLargeData Bases,MorganKaufmannPublishersInc.,SanFrancisco,CA, USA,2001. [63] A.Maedche,S.Staab,Measuringsimilaritybetweenontolog ies,in:EKAW'02: Proceedingsofthe13thInternationalConferenceonKnowle dgeEngineeringand KnowledgeManagement.OntologiesandtheSemanticWeb,Spr inger-Verlag, London,UK,2002. [64] D.L.McGuinness,F.vanHarmelen,Owlwebontologylanguage overview,,w3CRecommendatio n(February2004). [65] G.Miller,W.Charles,Contextualcorrelatesofsemanticsi milarity,Languageand CognitiveProcesses6(1)(1991)1{28.

PAGE 127

127 [66] G.A.Miller,Wordnet:alexicaldatabaseforenglish,Commu n.ACM38(11)(1995) 39{41. [67] J.Q.Ning,A.Engberts,W.V.Kozaczynski,Automatedsuppor tforlegacycode understanding,Commun.ACM37(5)(1994)50{57. [68] N.F.Noy,Semanticintegration:Asurveyofontology-based approaches.,SIGMOD Record33(4)(2004)65{70. [69] N.F.Noy,D.L.McGuinness,Ontologydevelopment101:Aguid etocreatingyour rstontology,TechnicalReportKSL-01-05,StanfordKnowl edgeSystemsLaboratory (2003). [70] V.Oleshchuk,A.Pedersen,Ontologybasedsemanticsimilar itycomparisonof documents,in:14thInternationalWorkshoponDatabaseand ExpertSystems Applications(DEXA03),IEEE,2003. [71] J.Palsberg,C.B.Jay,Theessenceofthevisitorpattern.,i n:ComputerSoftware andApplicationsConference(COMPSAC),IEEEComputerSoci ety,1998. [72] S.Patwardhan,Incorporatingdictionaryandcorpusinform ationintoacontext vectormeasureofsemanticrelatedness,Master'sthesis,U niversityofMinnesota (2003). [73] T.Pedersen,S.Patwardhan,J.Michelizzi,Wordnet::simil arity-measuringthe relatednessofconcepts,in:ProceedingsoftheNineteenth NationalConferenceon ArticialIntelligence(AAAI-04),SanJose,CA,2004. [74] D.H.PeterBailey,A.Krumpholz,Towardmeaningfultestcol lectionsfor informationintegrationbenchmarking,in:IIWeb2006,Wor kshoponInformation IntegrationontheWebinconjunctionwithWWW2006,Edinbur gh,Scotland,2006. [75] M.Postema,H.W.Schmidt,Reverseengineeringandabstract ionoflegacysystems, Informatica:InternationalJournalofComputingandInfor matics22(3). [76] A.Quilici,Reverseengineeringoflegacysystems:Apathto wardsuccess.,in:ICSE, 1995. [77] R.Rada,H.Mili,E.Bicknell,M.Blettner,Developmentanda pplicationofametric onsemanticnets,IEEETransactionsonSystems,Man,andCyb ernetics19(1) (1989)17{30. [78] E.Rahm,P.A.Bernstein,Asurveyofapproachestoautomatic schemamatching., VeryLargeDataBases(VLDB)J.10(4)(2001)334{350. [79] P.Resnik,Semanticsimilarityinataxonomy:Aninformatio n-basedmeasureand itsapplicationtoproblemsofambiguityinnaturallanguag e.,J.Artif.Intell.Res. (JAIR)11(1999)95{130.

PAGE 128

128 [80] R.Richardson,A.F.Smeaton,Usingwordnetasaknowledgeba seformeasuring semanticsimilaritybetweenwords,Tech.Rep.CA-1294,Sch oolofComputer Applications,DublinCityUniversity,Dublin,Ireland(19 94). [81] M.A.Rodriguez,M.J.Egenhofer,Determiningsemanticsimi larityamongentity classesfromdierentontologies,IEEETransactionsonKno wledgeandData Engineering15(2)(2003)442{456. [82] H.Rubenstein,J.Goodenough,Contextualcorrelatesofsyn onymy,Computational Linguistics8(1965)627{633. [83] S.Rugaber,Programcomprehension,EncyclopediaofComput erScienceand Technology35(20)(1995)341{368,marcelDekker,Inc:NewY ork. [84] S.Sangeetha,J.Hammer,M.Schmalz,O.Topsakal,Extractin gmeaningfromlegacy codethroughpatternmatching,TechnicalReportTR-03-003 ,UniversityofFlorida, Gainesville(2003). [85] N.Seco,T.Veale,J.Hayes,Anintrinsicinformationconten tmetricforsemantic similarityinwordnet,in:ProceedingsofECAI'2004,the16 thEuropeanConference onArticialIntelligence,Valencia,Spain,2004. [86] E.R.SergeyMelnik,HectorGarcia-Molina,Similarityrood ing:Aversatile graphmatchingalgorithmanditsapplicationtoschemamatc hing,in:ICDE'02: Proceedingsofthe18thInternationalConferenceonDataEn gineering(ICDE'02), IEEEComputerSociety,Washington,DC,USA,2002. [87] A.P.Sheth,Datasemantics:What,whereandhow.,in:Procee dingsofthe6th IFIPWorkingConferenceonDataSemantics,1995. [88] P.Shvaiko,J.Euzenat,Asurveyofschema-basedmatchingap proaches,Journalon DataSemantics(JoDS)IV(2005)146{171. [89] E.Stroulia,M.El-Ramly,L.Kong,P.G.Sorenson,B.Matichu k,Reverse engineeringlegacyinterfaces:Aninteraction-drivenapp roach.,in:6thWorking ConferenceonReverseEngineering(WCRE'99),1999. [90] Y.A.Tijerino,D.W.Embley,D.W.Lonsdale,Y.Ding,G.Nagy, Towardsontology generationfromtables.,WorldWideWeb8(3)(2005)261{285 [91] O.Topsakal,Extractingsemanticsfromlegacysourcesusin greverseengineeringof javacodewiththehelpofvisitorpatterns,Master'sthesis ,DepartmentofComputer andInformationScienceandEngineering,UniversityofFlo rida(2003). [92] M.Uschold,M.Gruninger,Ontologiesandsemanticsforseam lessconnectivity, SIGMODRec.33(4)(2004)58{64.

PAGE 129

129 [93] P.Vossen,Eurowordnet:amultilingualdatabaseforinform ationretrieval,in: ProceedingsoftheDELOSworkshoponCross-languageInform ationRetrieval, Zurich,1997. [94] J.Wang,F.H.Lochovsky,Dataextractionandlabelassignme ntforwebdatabases, in:WWW'03:Proceedingsofthe12thinternationalconferen ceonWorldWide Web,ACMPress,NewYork,NY,USA,2003. [95] B.W.Weide,W.D.Heym,J.E.Hollingsworth,Reverseenginee ringoflegacycode exposed.,in:ICSE,1995. [96] M.Weiser,Programslicing,in:ICSE'81:Proceedingsofthe 5thinternational conferenceonSoftwareengineering,IEEEPress,Piscatawa y,NJ,USA,1981. [97] L.M.Wills,Usingattributedrowgraphparsingtorecognize clichsinprograms,in: Selectedpapersfromthe5thInternationalWorkshoponGrap hGramarsandTheir ApplicationtoComputerScience,Springer-Verlag,London ,UK,1996. [98] Z.Wu,M.Palmer,Verbssemanticsandlexicalselection,in: Proceedingsofthe 32ndannualmeetingonAssociationforComputationalLingu istics,Associationfor ComputationalLinguistics,Morristown,NJ,USA,1994. [99] M.Yatskevich,Preliminaryevaluationofschemamatchings ystems,Tech.Rep. DIT-03-028,UniversityofTrento(2003). [100] B.Yu,L.Liu,B.C.Ooi,K.L.Tan,Keywordjoin:Realizingkey wordsearchfor informationintegration,in:ComputerScience(CS),DSpac eatMIT,2006.

PAGE 130

BIOGRAPHICALSKETCH OguzhanTopsakal,anativeofTurkey,receivedhisBachelor ofSciencedegreefrom theComputerandControlEngineeringDepartmentofIstanbu lTechnicalUniversity inJune1996.Heworkedininformationtechnologiesdepartm entsofseveralcompanies beforepursuinggraduatedegreeintheU.S.A.Afterherecei vedhisMasterofScience degreeincomputerengineeringattheUniversityofFlorida inAugust2003,hecontinued,hewasa study-abroadstudentat theUniversityofBremen,Germany.DuringhisPh.D.studies ,heworkedasateaching assistantforprogramminglanguageanddatabaserelatedco ursesattheUniversityof Floridaandfordatawarehousinganddataminingcourseatth eUniversityofHongKong. Healsoworkedasaresearchassistantinscalableextractio nofenterpriseknowledge (SEEK)andtestharnessfortheassessmentoflegacyinforma tionintegrationapproaches (THALIA)projects.Hisresearchinterestsincludesemanti canalysis,machinelearning, naturallanguageprocessing,knowledgemanagement,infor mationretrievalanddata integration.Hebelievesincontinuedlearningandeducati ontobetterunderstandandto contributetosociety. 130