Citation
Mapping Specifications for Ranked Hierarchical Trees in Data Integration Systems

Material Information

Title:
Mapping Specifications for Ranked Hierarchical Trees in Data Integration Systems
Creator:
Soomro, Sarfaraz A
Place of Publication:
[Gainesville, Fla.]
Florida
Publisher:
University of Florida
Publication Date:
Language:
english
Physical Description:
1 online resource (105 p.)

Thesis/Dissertation Information

Degree:
Master's ( M.S.)
Degree Grantor:
University of Florida
Degree Disciplines:
Computer Engineering
Computer and Information Science and Engineering
Committee Chair:
FORTES,JOSE A
Committee Co-Chair:
FIGUEIREDO,RENATO JANSEN
Committee Members:
LI,XIAOLIN
Graduation Date:
8/9/2014

Subjects

Subjects / Keywords:
Cities ( jstor )
Counties ( jstor )
Countries ( jstor )
Data integration ( jstor )
Databases ( jstor )
Datasets ( jstor )
Geodetic position ( jstor )
Identifiers ( jstor )
SQL ( jstor )
Taxa ( jstor )
Computer and Information Science and Engineering -- Dissertations, Academic -- UF
data-integration -- data-transformation -- etl -- mapping-language -- self-referential-schema
City of Gainesville ( local )
Genre:
bibliography ( marcgt )
theses ( marcgt )
government publication (state, provincial, terriorial, dependent) ( marcgt )
born-digital ( sobekcm )
Electronic Thesis or Dissertation
Computer Engineering thesis, M.S.

Notes

Abstract:
A popular approach to deal with data integration of heterogeneous data sources is to Extract, Transform and Load (ETL) data from disparate sources into a consolidated data store while addressing integration challenges including, but not limited to: structural differences in the source and target schemas, semantic differences in their vocabularies, and data encoding. This work focuses on the integration of tree-like hierarchical data or information that when modeled as a relational schema can take the shape of a flat schema, a self-referential schema or a hybrid schema. Examples include evolutionary taxonomies, geological time scales, and organizational charts. Given the observed complexity in developing ETL processes for this particular but common type of data, our work focuses on reducing the time and effort required to map and transform this data. Our research automates and simplifies the transformation from ranked self-referential to flat representations (and vice-versa), by: (a) proposing MSL+, an extension to IBM's Mapping Specification Language (MSL), to succinctly express the mapping between schemas while hiding the actual transformation implementation complexity from the user, and (b) implementing a transformation component for the Talend open source ETL platform, called Tree Transformer (TT). We evaluated MSL+ and TT, in the context of biodiversity data integration, where this class of transformations is a recurring pattern. We demonstrate the effectiveness of MSL+ with respect to development time savings as well as a 2 to 25-fold performance improvement in transformation time achieved by TT when compared to existing implementations and to Talend built-in components. ( en )
General Note:
In the series University of Florida Digital Collections.
General Note:
Includes vita.
Bibliography:
Includes bibliographical references.
Source of Description:
Description based on online resource; title from PDF title page.
Source of Description:
This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Thesis:
Thesis (M.S.)--University of Florida, 2014.
Local:
Adviser: FORTES,JOSE A.
Local:
Co-adviser: FIGUEIREDO,RENATO JANSEN.
Statement of Responsibility:
by Sarfaraz A Soomro.

Record Information

Source Institution:
UFRGP
Rights Management:
Copyright Soomro, Sarfaraz A. Permission granted to the University of Florida to digitize, archive and distribute this item for non-profit research and educational purposes. Any reuse of this item in excess of fair use or other copyright exemptions requires permission of the copyright holder.
Classification:
LD1780 2014 ( lcc )

Downloads

This item has the following downloads:


Full Text

PAGE 1

MAPPING SPECIFICATIONS FOR RANKED HIERARCHICAL TRE E S IN DATA INTEGRATION SYSTEMS By SARFARAZ AHMED SOOMRO A THESIS PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE UNIVERSITY OF FLORIDA 2014

PAGE 2

© 2014 Sarfaraz Ahmed Soomro

PAGE 3

To my parents

PAGE 4

4 ACKNOWLEDGMENTS I would like to thank Dr. José A.B. Fortes for providing me the opportunity to work in the ACIS lab and his critical feedback, patience and motivation . I would also like to thank Dr. Andréa Matsunaga for guiding me throughout the course of my research and providing in depth feedback and insight. Furthermore, I would like to thank Dr. Renato J Figueiredo and Dr. Xiaolin Li for their help and support as committee member s . I am deeply grateful for the financial assistance I received from the National Science F oundation (NSF) through my advisor for the iDigBio project ( grant No. EF 1115210 ) . I also appreciate the hel p and support I received from my peers at the ACIS lab and the iDigBio project . A special thanks to the folks at Kansas University Diversity Institute for providing data sets and help with Specify software. Finally, I would like to thank my family an d specially my parents for motivating me and supporting throughout the way.

PAGE 5

5 TABLE OF CONTENTS page ACKNOWLEDGMENTS ................................ ................................ ................................ .. 4 LIST OF TABLES ................................ ................................ ................................ ............ 8 LIST OF FIGURES ................................ ................................ ................................ .......... 9 LIST OF ABBREVIATIONS ................................ ................................ ........................... 11 ABSTRACT ................................ ................................ ................................ ................... 12 CHAPTER 1 INTRODUCTION ................................ ................................ ................................ .... 14 2 STORING RANKED HIERARCHICAL TREES ................................ ....................... 19 Flat Schema ................................ ................................ ................................ ............ 19 Ranked Self Referential Schema ................................ ................................ ............ 19 Hybrid Schema ................................ ................................ ................................ ....... 20 3 MOTIVATING EXAMPLE ................................ ................................ ........................ 21 Ranked Tree Hierarchies ................................ ................................ ........................ 21 Flat Schema Representation of Ranked Tree Hierarchies ................................ ...... 22 Ranked Self Referential Schema Representation of Ranked Tree Hierarchies ...... 22 4 CHALLENGES ................................ ................................ ................................ ........ 26 Rank Jumps ................................ ................................ ................................ ............ 26 Rank Mismatch ................................ ................................ ................................ ....... 26 Number of Tanks in Source = Number of R anks in Target ............................... 26 Number of Ranks in Source > Number of Ranks in Target .............................. 27 Number of Ranks in Source < Number of Ranks in Target .............................. 27 Lack of Identifiers ................................ ................................ ................................ .... 28 5 RELATED WORK ................................ ................................ ................................ ... 29 Logical Framework for a Data Integration System ................................ .................. 29 Conjunctive Queries ................................ ................................ ......................... 30 Sample Schema ................................ ................................ ............................... 31 Local as View (LAV) Mapping ................................ ................................ .......... 32 Global as View (GAV) Mapp ing ................................ ................................ ........ 34 Global and Local as View (GLAV) Mapping ................................ ..................... 36 The CLIO Project ................................ ................................ ................................ .... 37

PAGE 6

6 MSL Evaluation and Shortcomings for Ranked Tree Hierarchies ........................... 41 Dealing With Rank Jumps ................................ ................................ ................ 41 Dealing With Rank Mismatches ................................ ................................ ........ 45 Dealing With Lack of Identifiers ................................ ................................ ........ 4 6 Nested Schema Mappings ................................ ................................ ...................... 47 6 CONTRIBUTIONS ................................ ................................ ................................ .. 48 Transformation Algorithms ................................ ................................ ...................... 48 Ranked Self Referential to Flat Transformation Algorithm ............................... 49 Flat to Ranked Self Referential Transformation Algorithm ............................... 49 MSL+ ................................ ................................ ................................ ...................... 53 Ranked Self Referential to Flat Schema Mapping ................................ ............ 54 Flat to Ranked Self Referential Schema Mapping ................................ ............ 55 Tree Transformer ................................ ................................ ................................ .... 55 7 EXPERIMENTAL SETUP ................................ ................................ ....................... 57 Experimental Datasets ................................ ................................ ............................ 57 Machine Configuration ................................ ................................ ............................ 58 Experiment Scenarios ................................ ................................ ............................. 59 S2F Transformation Experiment ................................ ................................ ....... 59 F2S Transformation Experiment ................................ ................................ ....... 60 iDigBio Complete Schema Transformation Experiment ................................ ... 61 8 EXPERIMENTAL EVALUATION ................................ ................................ ............ 63 Ranked Self Referential to Flat Transformation ................................ ...................... 63 Execution Times ................................ ................................ ............................... 63 User Effort ................................ ................................ ................................ ........ 64 Flat to Ranked Self Referential Transformation ................................ ...................... 66 Execution Times ................................ ................................ ............................... 66 User Effort ................................ ................................ ................................ ........ 67 A Complete Schema Transformation Example ................................ ....................... 69 Execution time ................................ ................................ ................................ .. 69 User Effort ................................ ................................ ................................ ........ 69 9 CONCLUSIONS AND FUTURE WORK ................................ ................................ . 70 APPENDIX A GENERATING TEST DATA ................................ ................................ ................... 72 B USING SPECIFY TO OUTPUT FLAT SCHEMA ................................ .................... 74 C USING SPECIFY TO OUTPUT DWC FILE ................................ ............................ 76 ................................ ................................ ............ 76

PAGE 7

7 ................................ ................................ ........ 80 D SPECIFY SIMULATOR ................................ ................................ ........................... 81 E TALEND OPEN STUDIO ................................ ................................ ........................ 82 Tree Transformer Components ................................ ................................ ............... 82 TT S2F Component ................................ ................................ .......................... 82 TT F2S Component ................................ ................................ .......................... 83 Talend Jobs ................................ ................................ ................................ ............ 84 Ranke d Self Referential to Flat Schema Transformation Using TT S2F Component ................................ ................................ ................................ .... 84 Ranked Self Referential to Flat Schema Transformation U sing TOS Built In Components Only ................................ ................................ ......................... 87 Flat to Ranked Self Referential Schema Transformation Using TT F2S Component ................................ ................................ ................................ .... 93 Flat to Ranked Self Referential Schema Transformation Using TOS Built In Components Only ................................ ................................ ......................... 95 Creating a DwC File Using TT Components ................................ ..................... 99 LIST OF REFERENCES ................................ ................................ ............................. 102 BIOGRAPHICAL SKETCH ................................ ................................ .......................... 105

PAGE 8

8 LIST OF TABLES Table page 3 1 Flat format representation of tree hierarchy in Figure 3 1. All ancestral in formation for each location is present within the row and replicated across rows. ................................ ................................ ................................ ................... 23 3 2 Self referential format representation of tree Hierarchy in Figure 3 1. Each ................................ ................................ .............................. 23 6 1 A flat schema similar to the one in Table 3 1, but with no identifier column. ...... 51 6 2 A flat schema representing Specimen taxonomy information. Here the identifier SpecimenID identifies the specimen object and not the location information, which is also present in this schema. ................................ .............. 51 6 3 Distinct paths at each rank in the sample flatGeography schema data. ............. 53 A 1 Taxon ranks in KUIP database. The name of taxon rank and the number of records at each rank are also shown. ................................ ................................ . 73

PAGE 9

9 LIST OF FIGURES Figure page 3 1 An example of a ranked tree hierarchy using geographic data. Nodes represent geographic locations, edges represent containment of a region within a larger one . ................................ ................................ ............................. 21 5 1 ................................ ................................ ........ 39 5 2 MSL mapping statement for a simple mapping between Agent and Person relations ................................ ................................ ................................ .............. 39 5 3 MSL statement to map complete global schema to source schema ................... 40 5 4 Initial incomplete schema mapping between ranked self referential and flat schemas using MSL, without yielding correct transformation ............................. 42 5 5 Schema mapping between ranked self referential and flat schemas using MSL, limited to map only State nodes. ................................ ............................... 43 5 6 Schema mapping between ranked self referential and flat schema, assuming if conditional constructs are available in MSL. ................................ .................... 43 5 7 Schema mapping between ranked self referential and flat schema, assuming if conditional constructs are available in MSL ................................ ..................... 44 6 1 The transformation algorithm for ranked self referential schema to flat transformation. ................................ ................................ ................................ .... 50 6 2 Schema mapping between ranked self referential and flat schemas using the new constructs offered by MSL+ ................................ ................................ ........ 55 6 3 Schema mapping for flat to ranked self referential schema transformation using the new constructs added to MSL+ ................................ ........................... 55 7 1 Distribution of 8183 Taxon nodes in KUIP database. Different colors in bars were generated based on these statistics. ................................ ......................... 58 7 2 TOS job for transformation from ranked self referential schema to flat schema. All tM ySQLInput component input into tMap component which joins all of them together. ................................ ................................ ............................ 60 7 3 A partial screen shot of TOS job using only TOS built in components to achieve flat to ranked self referential schema transformation ............................ 61

PAGE 10

10 7 4 TOS job showing different K UIP database tables being combined to produce the Darwin Core file output. The components with circular arrow pointing toward them are TT S2F components. ................................ ............................... 62 8 1 Comparing the execution times for different methods of schema transformation from ranked self referential to flat schema. TT S2F ........... 64 8 2 A screen shot of a TOS job for ranked self referential schem a to flat schema transformation ................................ ................................ ................................ ..... 65 8 3 Inputs required for tH2F component. See Appendix E for description of the input fields. ................................ ................................ ................................ ......... 65 8 4 A screenshot of the tMap component used in TOS job for ranked self referential schema to flat schema transformation, usi ng only TOS built in components . ................................ ................................ ................................ ....... 65 8 5 A Java expression defined in tMap component to correctly identify rank and define output. Such expressions are defined for each rank, hence making the transformation process long and cumbersome. ................................ .................. 66 8 6 Comparing execution times for flat to ranked self referential schema transformation. TT F2S show slightly expensive execution times compared to in components ................................ ................................ ................. 67 8 7 A screenshot of the TOS job for flat schema to ranked self referential schema transformation, using TT F2S component. Shown on the left, this component performs the en tire flat to ranked self referential schema transformation. ................................ ................................ ................................ .... 68 8 8 Inputs required by the TT F2S component to perform the schema transformation from flat to ranked self referential schema. See Appendix E for details about the input field. ................................ ................................ ........... 68 8 9 A sample of custom java code required to be written for tJavaRow component used in TOS job for flat to ranked self referential schema transformation using only TOS built in components. ................................ .......... 68 B 1 The Taxon relation selected in Specify and its attributes/properties listed ......... 74 B 2 The query for displaying all taxon in a flat format. All ranks in the taxon hierarchy have been selected for this query ................................ ....................... 75 E 1 TT components shown in component palette. ................................ .................... 83

PAGE 11

11 LIST OF ABBREVIATIONS CSV Comma Separated Value DMR Developer Milestone Release DwC Darwi n Core ETL Extract Transform Load F2S Flat to ranked Self referential GAV Global As View GLAV Global and Local As View GUI Graphical User Interface IBM International Business Machines iDigBio integrated Digitized Biocollections KUIP Kansas University Invert Paleo database LAV Local As View MSL Mapping Specification Language MSL+ Mapping Specification Language Plus S2F ranked Self referential to Flat SQL Structured Query Language TGD Tuple Generating Dependency TOS Talend Open Studio TSIMMIS The Stanford IBM Manager of Multiple Information Sources TT Tree Transformer XML eXtensible Markup Language XSD XML Schema Definition

PAGE 12

12 Abstract of Thesis Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Master of Science MAPPING SPECIFICATIONS FOR RANKED HIERARCHICAL TRE E S IN DATA INTEGRATION SYSTEMS By Sarfaraz Ahmed Soomro August 2014 Chair: José A.B. Fortes Major: Computer Engineering A popular approach to deal with data integration of heterogeneous data sources is to Extract, Transform and Load (ETL) data from disparate sources into a consolidated data store while addressing integration challeng es including, but not limited to : structural differences in the source and target schemas, semantic differences in their vocabularies , and data encoding. This work focuses on the integration of tree like hierarchical data or information that when modeled as a relational schema can take the shape of a flat schema, a self referential schema or a hybrid schema. Examples include evolutionary taxonomies, geological time scales, and organizational charts. Given the observed complexity in developing ETL processes for this particular but common type of data, our work focuses on reducing the time and effort required to map and transform this data. Our research automat es and simplifies the transformation from ranked self referential to flat representations (and vice versa) , by: (a) proposing MSL+, an extension Mapping Specification Language (MSL) , to succinctly express the mapping between schemas while hiding the actual transformation implementation complexity from the user, and (b) implementing a transformation component for the Talend open

PAGE 13

13 source ETL platform, called Tree Transformer (TT). We evaluated MSL+ and TT, in the context of biodiversity data integration, where this class of transformations is a recurring pattern. We demonstrate the effectiveness of MSL+ with respect to development time savings as well as a 2 to 25 fold performance improvement in transformation time achieved by TT when compared to existing implementations and to Talend built in components.

PAGE 14

14 CHAPTER 1 INTRODUCTION Data Integration is the problem of combining data residing at different sources and providing the user with a unified view of these data [2]. Data integration systems have become more important with the prevalence of internet as different services and aggregators try to provide users with information about a subject of interest at one place, although this information is being extracted from different sources spread all across the internet. The data integration systems involve a global schema and source schemas, with the data from source schemas molded into the global schema. Therefore, designing the relationship between the global schema and the sources is essential to enable correct schema transformation and data exchange. Data Integration involves structural schema transformations as well as data transformations. Structural transformation involves changing the shape of the schema that store s the information. For example, a source schema can be star schema whereas the target schema could be a normalized snowflake schema. Data transformations change the presentation of ay, month and year in the source to year, month, and day in the target schema or changing a measurement from one unit to another are examples of data transformation. [8] refers to the process of defining mappings between the source and target schemas as mediator approach, the data remains at the data sources and the sou rces are queried at run time. In the data warehousing approach, data is extracted from the sources,

PAGE 15

15 transformed and loaded in to a warehouse ahead of query time. Mediator approach is used where preference is to have up to date information available whereas data warehousing approach is used for more complex tasks like data mining. Figure 1 1 illustrates both of these approaches to data integration. An ex ample of a Data Integration System using mediator approach is that of flight search websites, which provide information about flight schedules and rates from different airlines, all consolidated in a uniform view at one place. Even though every airline has a different database system with different schemas and data formats, the data integration system allows querying and retrieval of relevant information from these heterogeneous systems in a consolidated form to be presented to the end user. The main idea i n the mediator approach is to have up to date information available, as this approach is used mostly in cases where critical transactions are performed, and the nature of previous transactions affects the way new transactions are carried out. An example of data warehousing approach is a system which extracts information about flights, persons travelled, source and destination locations, etc., from different airlines and stores it in a separate database to run analytical queries on this data, such as most tr avelled to destinations, etc. Here the focus is to run statistical queries to aid in better decision s for operational and marketing reasons. Data information; therefore they cou ld process data in daily or weekly batches as per the operational requirements of the organization.

PAGE 16

16 Figure 1 1 . The two approaches to data integration. In mediation, data is extracted in real time from the sources. In warehousing, data is extracted, processed and stored ahead of time before the queries are posed to the system. The main challenge in data integratio n comes from the heterogeneity between the data sources. Each source may have a different schema, field names and data formats to describe the same entity. There is a global schema or mediated schema [2] to which the queries are posed by the end users. The challenge is to define semantic mappings between the schemas of the data sources and global schema. The query posed to the data integration system is in terms of the global schema. In case of a data integration system using mediation approach, this query is transformed into various local queries for each data source with the help of respective semantic mappings. The results from these local queries are then consolidated to be presented to the user. For a data integration system using data warehousing appro ach, global schema is already materialized, i.e., the data is physically present in terms of global schema so the query results are obtained directly from the materialized global schema. The global schema is materialized before the system is exposed to acc ept queries. M. M. Kwakye in [8] breaks down the tasks of designing a data integration system into three steps; schema

PAGE 17

17 matching, schema mapping and schema merging. In the schema matching phase, a correspondence between the elements and entities in the sour ce schema and global contain multiple fields or elements. An example of this is a person table in one source schema corresponds to an employee table in the global schema. Schema matching is usually a semi automatic operation aided by human verification and correction. Schema mapping takes as input the results of schema matching and e stablishes the relationship between the elements in the source schema and the global schema. Various formalisms are used to define the mapping between the source schema and the global schema. Schema merging is the phase where physical data transformation t akes place. This takes as input the schema mappings and translates them into physical queries to be run on the target system. Ranked self referential schemas, which are tree like hierarchies are used extensively in relational database systems to model real world data such as geographic data and taxonomic hierarchies. These schemas are encountered very frequently in data integration systems for schema transformation into a flat schema. Other schema transformations frequently encountered are for flat schemas to be converted to ranked self referential schemas. We describe ranked self referential schemas, flat schemas and tree hierarchies in detail in the next sections. For now, let it suffice to say that these transformations pose unique challenges and are not trivial to attain. It is for this reason we choose to deal with these transformations in the scope of data integration systems,

PAGE 18

18 to simplify the task of data integration involving transformations to and from ranked self referential schemas.

PAGE 19

19 CHAPTER 2 STORING RANKED HIERARCHICAL TREES Hierarchical trees are conceptual structures that capture the hierarchical (parent ch ild) relationship among their nodes. Nodes have attributes and values, a parent node (except for the root), and a set of children nodes (except for the leaves). In ranked trees, nodes are organized in ranks and sibling nodes may belong to different ranks ( as illustrated i n Figure 3 1 ). Tree hierarchies are stored and maintained using a variety of technologies (e.g., relational databases, object databases, XML documents, and spreadsheets), each allowing the data modeler to physically structure these hierarchies in a flat format, a ranked self referential format or a hybrid format (mixing the two other formats). Flat Schema Flat format or flat schema stores hierarchies with all the ancestral information of a node contained within one row of a relational table or element of an XML document. In objects, since the ancestral information is replicated into each row or object. Ranked Self Referen tial Schema In a self referential format/schema, all the ancestral data pertaining to a node are not present in just one row/object. Instead each node uses a link/pointer to point to another row/object (its parent) to build a hierarchy, and each node only has information pertaining to itself and its rank in the hierarchy. An example of such links between nodes is a foreign key constraint in relational databases.

PAGE 20

20 Hybrid Schema A hybrid format combines flat and self referential formats such that certain rank s of the tree are stored in a flat format whereas other ranks are stored in a ranked self referential manner. The choice of data format is made by the data model designer, often as a trade off between replication of information to reduce time in locating ancestry information and the absence of ancestry information replication to minimize the effort required to update replicas and maintain consistency. Hierarchical trees have been used to represent real world concepts, including biological taxonomy trees, c scale trees and geographical location data, to name a few. T ransformations between the different physical schemas (flat and ranked self referential) for tree hierarchies have proven to be time consuming, error prone and non trivial in the context of the iDigBio project, even when using state of the art Extract, Tra nsform and Load (ETL) tools.

PAGE 21

21 CHAPTER 3 MOTIVATING EXAMPLE Ranked Tree Hierarchies A tree hierarchy of geographic information ( Figure 3 1 ) is used to exemplify how data is physically structured in a flat or self referential schema, while pointing to detai ls that turn data transformation into a complex step . Each node of this tree represents a geographical location, which can contain a number of attributes (e.g., location name, abbreviation for the locality, and bounding box of geospatial coordinates for th e locality). For simplicity, only the location names (e.g., North America, United States, and Florida) are shown for the nodes. The hierarchy captures containment of a location (child) within a larger region (parent), while the ranks (e.g., country, state, and city) allow siblings to be categorized as different ranks (e.g., the Gainesville city and the Alachua County are siblings at the same level, but different ranks). Fig ure 3 1 . An example of a ranked tree hierarchy using geographic data. Nodes represent geographic locations, edges represent containment of a region within a larger one, and ranks are constructs used to further organize sibling nodes.

PAGE 22

22 Flat S chema R epresentation of R anked T ree H ierarchie s Table 3 1 shows flatGeography , the fl at format representat i on of the data in Figure 3 1 . Each row represents a geographical location (a node) with all ancestral information replicated in the row, with different columns indicating the rank of the node and without references to any other row. F or example, the row with GeographyID represents the State ancestry, i.e., the State Country Continent e table shows, the hierarchy is replicated multiple times number of nodes related to this path). This replication may lead to excessive storage requirements or slow updates if the tree is large and nodes contain many attributes. Transformations from flat to hierarchical schemas require extracting the unique paths at each rank in the hierarchy. The pr oblem of finding the unique paths at a particular rank is that of extracting the unique combination of nodes representing ancestry from the flat representation of data at that rank up to the root rank. Ranked S elf R eferential S chema R epresentation of R anke d T ree H ierarchies Table 3 2 shows selfrefGeography , the self referential representation of the data in Figure 3 1 . Each row also represents a geographical location (a node), but the ParentID column is used to make references to the parent node based on th e identifier stored in the GeographyID column. Thus, each row only contains the information for that particular location, its rank in the hierarchy and a link to its parent using the ParentID column. Identifiers (numeric ones in the example) for each node are preferred as they can be used to distinguish between two nodes with the same sequence of characters in the name but at different ranks, or between two nodes with same

PAGE 23

23 sequence of characters in the name and at the same rank but on different paths in the Figure 3 1 ). Table 3 1. Flat format representation of tree hierarchy in Figure 3 1. All ancestral information for each location is present within the row and replicated across rows. GeographyID Continent Country State County City 5 North America 198 North America United States 1237 North America United States Florida 1250 North America United States Georgia 4371 North America United States Florida Alachua County 4380 North America United States Georgia Hall County 4600 North America United States Florida Gainesville 4670 North America United States Georgia Hall County Gainesville Table 3 2. Self referential format representation of tree Hierarchy in Figure 3 1 . Each associated with it and is referred to by the GeographyID Name ParentID Rank 5 North America Continent 198 United States 5 Country 1237 Florida 198 State 1250 Georgia 198 State 4371 Alachua County 1237 County 4380 Hall County 1250 County 4600 Gainesville 1237 City 4670 Gainesville 4380 City The Integrated Digitized BioCollections ( iDigBio ) project involves such structural schema transformations along with data transformations for the purpose of consolidating data into a global schema from a variety of sources such as various natural history museums, academic collections and online aggregators. Biodi versity information contains different data that is representative of tree hierarchies. This information is widely stored either in flat schemas or ranked self referential schemas [1].

PAGE 24

24 Some sources use flat sc hema to store these hierarchies, e.g., locations data in referential able to convert between a ranked self referential schema and a flat schema, while maintaining data integrity. This structural transformation poses a complex challenge in cases where several entities/tables in the source need to be combined (joined) under certain constraints and con ditions. Data Integration systems such as Talend Open Studio for Data Integration and IBM InfoSphere Data Architect provide tools that help achieve structural schema transformations of this type. Our experiments using these tools revealed that although the se tools are generic enough to achieve the desired schema transformation, it requires expert knowledge of the database schema and coding skills to ingrain the specific logic required to achieve these transformations. The complexity of designing these trans formations using these tools increases with the increase in the number of levels complemented by issues such as rank skips between elements of the hierarchy. For an e nd user such as a collections manager, this type of transformation is not easy to perform. In iDigBio these transformations are a necessity. Given the complexity required for these schema transformations and how frequently they need to be performed, we have develop ed tools that take out most of the manual work required to do these transformations. We have also develop ed custom components for Talend Open Studio for Data Integration (TOS) that simplify the task of converting a ranked self r eferential schema to a flat schema or a flat schema to a rank self referential schema. These custom components require minimum input from the user to perform the

PAGE 25

25 transformation. We approach the problem of transformation between ranked self referentia l schemas and flat schemas in a purely generic way that decouples it from any domain specific issues. Hence our solution has a much wider impact than the scope of iDigBio project.

PAGE 26

26 CHAPTER 4 CHALLENGES Having described the nature of ranked hierarchical trees data and the popular ways to stores this data physically, i.e.; flat schemas, ranked self referential schemas and a hybrid schema, we now enumerate the challenges involved in performing transformations between ranked self referential and flat schema s. Rank Jumps Lack of data or equivalency between data sources can lead to rank jumps (e.g., some countries do not represent county division). In Figure 3 1 , the City is linked directly to the State County rank. Transforming from self referential to flat schema requires a blank entry for the County column to be produced, whereas transforming from flat to self referential requires the reference to the parent to be found in the next rank. Rank M ismatch Mismatches in the definitions of ranks among several data sources and target schemas are inevitable. While target self referential schemas do not require additional changes to accommodate new ranks, target flat schemas require updates. If this update canno t be accommodated, an alternative solution is for the transformation process to skip the additional ranks not present in the target. The rank mismatch anomaly manifests in 3 different ways: Number of T anks in S ource = Number of R anks in T arget If the ranks in the source and target schemas are same, then this does not pose a challenge. But if the source and target schemas have different ranks, provided they are not mutually exclusive; then only the information from the ranks which are present in

PAGE 27

27 both the sou rce and target schema is going to be ported to the target schema. An example of such a scenario is a follows : sourceGeography(Planet, Continent, Country, State, County, City) , targetGeography(Country, County, District, City, Municipality , Locality ) The above two schemas both have six ranks, but target schema has District, Municipality and Locality ranks which are not present in the source schema. Number of Ranks in Source > Number of Ranks in Target ks are not mutually exclusive, then the data of the ranks present in both the source and target schema is ported to the target schema. It may be the case that the ranks in target schema form a proper subset of the ranks in the source schema, such as; sourc eGeography(Planet, Continent, Country, State, County, District, City , Municipality, Locality ), targetGeography(Country, County, District, City, Municipality, Locality) Number of R anks in S ource < Number of R anks in T arget In this case, again if the source and target sets of ranks are not mutually exclusive, then only the data for the common ranks from the source schema is ported into the target schema. The new ranks in the target schema, are either kept empty, due to lack o f data in the source schema, or filled with values as determined by the data administrator. An example of source and target schemas in this case is: sourceGeography(Country, County, District, City, Municipality, Locality), t arget Geography(Planet, Continent , Country, State, County, District, City, Municipality, Locality)

PAGE 28

28 Lack of I dentifiers Flat schemas may capture only leaf nodes, lacking identifiers to intermediary nodes. Moreover, it is not uncommon for flat representations to lack identifiers. Thus, when transforming from flat to self referential schema, generation of new identifiers for each node is required to establish the relationship required by the self referential format. Distinguishing two nodes with the same sequence of characters (e.g., two cities with the identifiers. table, which primarily stores specimen information, along with ta xonomic hierarchy and location data in the same table stored as flat information. The identifier in this table identifies the specimen object and not the taxonomic determination or the geographic location associated with the specimen object. Another exampl e, would be a simple flat schema table that stores geographic locations but without an identifier column. Although with each location. Flat schema cases such as t hese, when converted to ranked self referential schema need to generate new i dentifiers for the entity store d in the flat schema.

PAGE 29

29 CHAPTER 5 RELATED WORK The type of data integration systems we look at in this work are based on an architecture consisting of a global schema and several local or source schemas. The source schemas are where the actual data resides , whereas, the global schema is a reconciled view capable of representing all the data and relations in the source schemas. The global sc hema is sometimes also referred to as the mediated schema . Since in a data integration system, the query is always posed to the global schema , this query needs to be transformed into one or more sub queries to be executed at the sources in order to get the results for the actual query posed . Therefore, the modelling of relationship between the source schemas and the global schema is an important aspect of data integration systems, as it partly facilitates query processing . This relationship between the global and source schemas is called schema mapping and as we discuss further, several formalisms have been developed to define schema mappings. We will look at each of those , with examples and a discussion on how query process ing is done in each case. Logical Framework for a Data Integration System Here we explain a logical framework for data integration systems as described in [2]. A data integration system I is formally described as a triple (G, S, M) , where G is a global schema or mediated schema S is a source schema or local schema M is the mapping between G and S using a set of assertions as a) b)

PAGE 30

30 where ; qs and qg are queries of the same arity over source and global schemas, respectively. These assertions define the connection between the elements of the source schemas to the elements in the global schema. Queries posed to the data integration system I provide a sp ecification of the data to be extracted from the schema represented by the global schema G . Queries to the data integration system I are posed to the global schema G and the mapping M dictates how data from a source schema S are transformed into G . Previo us research has developed formalisms for expressing these mappings with the main idea of using logical views to specify the needed query translation or data transformation. Views are a way of supporting independence from the way data is stored (physical vi ew) and the way data is presented (logical views) [4]. Logical views enable keeping a consistent view of data while it s physical storage scheme is changed. Or while the physical view stays the same, several logical views can be constructed as well. Here we discuss three mapping formalisms; Local as View (LAV), Global as View (GAV) and Global and Local as View GLAV). Conjunctive Queries Conjunctive queries form the basis of schema mapping formalisms discussed later in this chapter. A c onjunctive quer y [ 3, 22 ] is an expression of the form ; q(x 1 , .. .,x n ) : A 1 (u 1 k (u k ) (3 1) Where Ai is a relation and ui is a vector of con s tants and variables. Each x i should occur in some u i . q(x 1 n ) is called the head, A 1 (u 1 k (u k ) is called the body of the query. The variables xi are distinguished variables, whereas any variables in vectors u i are existential variables. If R is an instance of relations appearing in the body of the query, an answer is a tuple (v(x 1 ,v(x n )) , where v is some v alue of distinguished

PAGE 31

31 variables , such that for each i, A i (v(u i )) holds in R . q( R ) is denoted as the set of answers. Containment of conjunctive queries: A query q1 is contained in query q2, denoted by q1 q2 , if for each R , q1( R ) q2( R ) . The idea of query containment plays an important role in data integration systems. Sample Schema Consider the following local and global schemas ; we would use them in the related examples. Source schema cons ists of the following relations ; the sub bullets show the compact form of relation schema with attribute abbreviations : Occurrence(TaxonName, Family, Class, Species, LocationName, Country, State, City, AgentName, InstituitonName) o Occurrence(TN, F, C l , S, LN, C, St, Ct, AN, IN) Agent(AgentName, IntitutionName, IntitutionCode, LocationName, Country, State, City) o Agent(AN, IN, IC, LN, C, St, Ct) The Occurrence relation contains all specimen records, i.e.; their taxonomic determination, location, agent who recorded or collected the specimen and the institution which owns the specimen. The Agent table contains records of person s, the instit utions they are affiliated and the location of the institution. Important to note in this schema is that, not all the agents listed in the Agent s table are bound to be present in present in the Occurrence table. The following are the relations in the global schema along with their compact form : Taxon(TaxonName, Family, Class, Species) o Taxon(TN, F, Cl, S)

PAGE 32

32 Location(LocationName, Country, State, City ) o Location(LN, C, St, Ct) Person ( Person Name, InstitutionName, InstitutionCode) o Person(PN, IN, IC) Institution(InstitutionName, InstitutionCode, LocationName) o Institution(IN, IC, LN) Specimen(TaxonName, LocationName, Person Name) o Specimen(TN, LN, PN) The Taxon table has records for all taxonomic determinations. The Location table contains all Locations. The Person table records all persons along with their institutional affiliation. The institution table contains institution and its location informatio n. The Specimen table ties it all together by associating a taxonomic determination with a location and an agent . Local as View (LAV) M apping In the LAV [3] approach, the mapping M associate s each relation in the source schema S to a view over the global schema G . That is to say, a LAV mapping is an assertion of the form S q , where q is a conjunctive query of the form q(x 1 ,...,x n ) : A 1 (u 1 k (u k ) over the global relations. Given the sample schema we define the relations in the local schema as views over the relations in the global schema; Occurrence(TN, F, Cl, S, LN, C, St, Ct, A N, IN) Specimen(TN, LN, P N), Taxon(TN, F, Cl, S), Location(LN, C, St, Ct) , Person ( P N, IN, IC) Agent( AN, IN, IC, LN, C, St, Ct) Person(PN, IN, IC), Institution( (IN, IC, LN), Location(LN, C, St, Ct) The above mappings show that LAV mappings are more suitable in cases where the global schema is stable and not vulnerable to frequent changes. As any change in one of the relations in the global schema would require changes to all the mappings invol ving that relation. I f a new relation is added to the source schema, it only requires a

PAGE 33

33 new mapping to be defined for it [2, 8, 22] . Consider that a new relation Geography is added to the source schema; Geography(GeographyName, Country, State, City) This is essentially the same as Location relation in the global schema. So in addition to current LAV mappings, this requires just another additional mapping as follows ; Occurrence(TN, F, Cl, S, LN, C, St, Ct, AN, IN) Specimen(TN, LN, PN), Taxon(TN, F, C l, S), Location(LN, C, St, Ct), Person(PN, IN, IC) Agent(AN, IN, IC, LN, C, St, Ct) Person(PN, IN, IC), Institution((IN, IC, LN), Location(LN, C, St, Ct) Geography( L N, C, St, Ct) Location( L N, C, St, Ct) Now, suppose that there is a change in the global schema and the Location relation has changed as follows ; Location 1 ( LocationName, Country) Location2 (LocationName, State, City) Now there are two relations containing the location information and all the ma ppings that relied on Location relation previously need to be tweaked as follows; Occurrence(TN, F, Cl, S, LN, C, St, Ct, AN, IN) Specimen(TN, LN, PN), Taxon(TN, F, Cl, S), Location 1 (LN, C), Location2(LN, St, Ct), Person(PN, IN, IC) Agent(AN, IN, IC, LN, C, St, Ct) Person(PN, IN, IC), Institution((IN, IC, LN), Location1(LN, C), Location2(LN, St, Ct) Geography(LN, C, St, Ct) Location1(LN, C), Location2(LN, St, Ct) Answering queries in LAV mappings is based on incomplete information in the global schema, as a consequence of partials views from the source schemas. These partial sources make query answering in LAV mappings a difficult task [8]. LAV mappings are discussed in detail in [2] and two approaches to view based query processin g are described, namely, view based query rewriting and view based query

PAGE 34

34 answering. Deducing queries out of LAV mappings is based on alg orithms described in [2] and [ 13 ]. We do not discuss these algorithms in this literature as it would be out of the scope of this document . Global as View (GAV) Mapping GAV [4] is an alternative approach to define schema mappings M that associates to each relation in the global schema G , a query (view) over the source schema S . A GAV mapping is an assertion of the f orm G q , where q is a conjunctive query of the form q(x 1 ,...,x n ) : A 1 (u 1 k (u k ) over the source relations. Given the sample schema we described earlier, we define GAV mappings as assertions describing each relation in the global schema as a view over the relation s in the local schema. Taxon(TN, F, Cl, S ) Occurrence (TN, F, Cl, S, LN, C, St, Ct, AN, IN) Location(LN, C, St, Ct) Occurrence(TN, F, Cl, S, LN, C, St, Ct, AN, IN), Agent(AN,IN, IC, LN, C, St, Ct) Person(PN, IN, IC) Agent( P N, IN, IC, LN, C, St, Ct) Institution(IN, IC, LN) Agent( AN, IN, IC, LN, C, St, Ct) Specimen ( TN, LN, P N ) Occurrence (TN, F, Cl, S, LN, C, St, Ct, P N, IN) the source schema is stable, and not vulnerable to frequent changes. The drawback to this approach is the addition of new relations in the source schema. If the source schema is unstable, then with the addition of every new source, all the existing m apping s need to be re evaluated and re adjusted to accommodate the new source. The addition of new ose a challenge in GAV mappings, as only one

PAGE 35

35 new mapping need to be specified particular to the new addition in the global schema. Consider a new relation PersonLocation is added to the global schema; PersonLocation(PersonName, LocationName) The location of the person in this case is the same as the location of the institution person is associated with. This only requires adding a new mapping to the set of mappings we already have, thus a minimal change is required; PersonLocation( PN , LN ) Agent( PN, IN, IC, LN, C, St, Ct) Now consider that the source schema is unstable and there is a change, that requires the location info rmation to be stored in a separate relation; Occurrence(TaxonName, Family, Class, Species, LocationName, AgentName, InstituitonName) Agent(AgentName, IntitutionNam e, IntitutionCode, LocationName ) Location (LocationName, Country, State, City) This change to the source schema now requires all the mappings to be altered to cater to the change , as follows; Taxon(TN, F, Cl, S ) Occurrence (TN, F, Cl, S, LN, AN, IN) Location(LN, C, St, Ct) Location(LN, C, St, Ct) Person(PN, IN, IC) Agent( P N, IN, IC, LN ) Institution(IN, IC, LN) Agent( AN, IN, IC, LN) Specimen(TN, LN, PN ) Occurrence (TN, F, Cl, S, LN, PN, IN) Answering queries in GAV mappings is easier than LAV mappings. This is because GAV mappings procedurally describe how the data is to be extracted fr om the source. This process ; called View Unfolding is used to process queries when GAV mappings are in place. Query unfolding is the process whereby the atoms in the query

PAGE 36

36 are re written in terms of the view definitions of the GAV mapping. Consider the following query for fetching Q(PN, IN, LN) : Person(PN, IN, IC), Institution(IN, IC, LN) The two atoms in this query could be substituted by their view definitions; Q(PN, IN, LN) : Ag ent( PN, IN, IC, LN, C, St, Ct), Agent( PN, IN, IC, LN, C, St, Ct) And simplified as; Q(PN, IN, LN) : Agent( PN, IN, IC, LN, C, St, Ct) Although the above query is posed to the global schema directly, but due to the procedural nature of GAV mappings, getting data from the sources is easy due to an easy unfolding process. Global and Local as View (GLAV) Mapping Combining the power of both L AV and GAV mappings [ 7 ] , led to the introduction of Global and Local As View (GLAV) mappings. GLAV mappings use both LAV and GAV to establish a mapping between the source schema and the global schema. A GLAV mapping assertion is based on queries on both sides: every mapping assertion has a query over the source schema and a c orresponding query over the global schema. GLAV combines the expressive power of both LAV and GAV and allows for flexible schema definitions independent of the specific details of the sources. Q(PN, IN, LN) Since GLAV mappings use both LAV and GAV mappings, the following is a LAV mapping for this query; Q(PN, IN, LN) : Agent( PN, IN, IC, LN, C, St, Ct)

PAGE 37

37 And a corresponding GAV mapping for this query is followi ng; Q(PN, IN, LN): Person(PN, IN, IC), Institution(IN, IC, LN) , Then a GLAV mapping is a mapping of the form; Agent( PN, IN, IC, LN, C, St, Ct) Person(PN, IN, IC), Institution(IN, IC, LN) For query processing, GLAV uses both view unfolding and view based query re writing algorithms [12] . [8] summarizes the pros and cons of all three approaches to mapping specification. The GLAV approach is now the most favored and used approach for schema mappings and many academic and industry p rojects have adopted it to model schema mappings in their tools. TSIMMIS [5] and The Information Manifold [16] are examples of GAV and LAV systems, respecti vely. In the next subsection we discuss the CLIO Project which uses GLAV based mappings . The CLIO Project The CLIO Projec t [9, 10 , 22, 23, 24, 25, 26, 27 ] , which was collaboration between IBM Research and University of Toronto, is a prime example of a data integration tool that uses GLAV assertions for schema mappings. CLIO pioneered the use of schema mappings for automatically generating views by querying the local schemas to sati sfy a query posed against a global schema as is the case in data integration o r in the case of data exchange to transform data from one representation to another representation . CLIO deals primarily with relational and XML schemas using a nested relational model [27] to model both types of schemas. To express correspondence between source and target relations, CLIO uses inter schema inclusion dependencies or more generally

PAGE 38

38 source to target tuple generating dependencies (TGDs) [17] . The mappings are sound, i.e., the result of the source query is contained in the result of the target query. Mapping Specification Language: Section 3 in [10] describes a verbose form of mapping language developed for CLIO based on source to target TGDs extended to support nested relations in a nested relational model. This Mapping Specification Language (MSL) expresses mappings for relational as well as semi st ructured documents such as XML. MSL is a declarative language for specifying schema mappings, clearly and succinctly describing correspondences between attributes of the source and target schemas. [10] has examples detailing the use of MSL, none of which deals with mappings involving tree hierarchies either as ranked self referential or flat schemas. We have chosen MSL as the basis for our work b ecause of the following reasons: MSL is declarative and works with a nested relational model, a good choice to model relational and XML schemas. Given the implementation of MSL in CLIO, a complete data integration tool and part hitect tool [20] with mapping discovery, query generation and data exchange, makes our contributions available to be incorporated into an existing tool with broad audience. CLIO uses sound GLAV mappings [2] . In sound mappings, the relation between the queries is a containment relation; i.e.; the result of the source query is contained in the result of the target query). As a consequence of sound mappings, there is no restriction on what data could be in the gl obal schema, since multiple source schemas can be mapped into one global schema. This is a common case with data integration systems. We first go through basic MSL syntax to familiarize the reader with the basic constructs in the language. In a later secti on we provide specific examples dealing with

PAGE 39

39 ranked tree hierarchies and the transformation between ranked self referential schemas and flat schemas. An MSL mapping is a statement of the form [8] ; M::= foreach x 1 in g 1 x n in g n w here B 1 e xists y 1 in g 1 y m in g m w here B 2 with e 1 =e 1 k =e k Figure 5 1. The mapping is a source to target constraint of the form Q S Q T , where Q S is a query over the source schema and Q T is a query over the target schema. The first foreach where clause forms Q S and the second exists where clause formulates Q T . The mapping induces an inter schema inclusion dependency [ 17, 28] or a containment assertion; i.e.; for each tuple returned by Q S , there must exist a corresponding tuple in Q T . The with clause explicitly establishes the correspondences between the elements of Q S and Q T . B 1 and B 2 establish associations with in the elements of same schema, such as joining two relations in a source schema. Consider the Person relation in the globa l schema and the Agent relation in the source schema described earlier. An MSL expression for a mapping from Person to Agent can be written as follows ; foreach a in Agent e xists p in Person with p.PersonName=a.AgentName p .InstitutionName=a.InstitutionName p. InstitutionCode=a.InstitutionCode Figure 5 2 . MSL mapping statement for a simple mapping between Agent and Person relations The mapping in Figure 5 2 is a very simple mapping, which maps elements from the Agent relation to the elements of the Person relation.

PAGE 40

40 where from each schema is considered. The following is a more involved example involving multiple relations in the source and target schemas ; foreach t in Taxon, l in Location , ll in Location , p in Person , pp in Person , i in I nstitution , ii in Institution , s in Specimen where s.TaxonName=t.TaxonName s.LocationName=l.LocationName s. Person Name= p . PersonName p .InstitutionName= i.InstitutionName p .InstitutionCode=i.InstitutionCode i.LocationName=l.LocationName pp.InstitutionName=ii.InstitutionName pp.InstitutionCode=ii.InstititutionCode ii.LocationName=ll.LocationName e xists o in Occurrence, a in Agent where a.AgentName=o.AgentName a.InstitutionName=o.InstitutionName with o.TaxonName=t.TaxonName o.Family=t.Family o.Class=t.Class o.Speci es =t.Species o.LocationName=s.LocationName o.Country= l.Country o.State=l.State o.City=l.City o.AgentName= p.PersonName o.insitutionName= p .InstitutionName a. Agen t Name= p p .PersonName a.InstitutionName=p p .InstitutionName a.InstitutionCode=p p .InstitutionCode a.LocationName= ll.LocationName a.Country =ll.Country a.State=ll.State a.City=ll.City Figure 5 3. MSL statement to map complete global schema to source schema The mapping show in Figure 5 3 is a very involved mapping; that shows the expressive power of MSL mappings. Since this is a mapping from global schema to source schema, we will refer elemen ts in the first part of mapping as source schema although they are technically elements from the global schema. The mapping is comprised of three major parts and we shall inspect each of them to completely understand the mapping. The first foreach where clause deals with the source schema and establishes associations within the elements of source schema. The second exists where clause deals with elements of the target (source schema actually ) and establishes the associations between its elements. F inal ly , with clause establishes the

PAGE 41

41 correspondences between the elements of the source and target schemas. There are subtle points to take away from this mapping. The first foreach where clause which deals with the source schema uses two instances of Location , Person and Institution relations . This is because the Specimen relation joins with these relations and on the other hand the Person relation joins with Institution which in turn joins with Location relation. Since in the source schema, a Person can exist independently without the need to being referred by the Specimen relation, this distinction is important to make, in order to achieve a correct schema mapping. This example of schema mapping shows how complex schema mappings can be clearly and e asily defined using MSL. MSL E valuation and S hortcomings for R anked T ree H ierarchies We evaluate MSL in order to express schema mappings between a ranked self referential schema and a flat schema. We point out inherent difficulties in expressing such mappi ngs and the requirement of special constructs, which are not provided by MSL. We also provide a preview of a mapping, had such constructs been provided, and point out the problems associated with mappings which would use such constructs. Dealing W ith R ank J umps Consider the ranked self referential schema in Table 3 2 as the source, and the flat representation of the same data in Table 3 1 as the target. To express the mapping between these two schemas with MSL, we start by joining the self referential relat ion with itself using the ParentId as the join condition. This is similar to joins in a relational database The join is made on a mutual value present in both the tables or relations . To keep our example concise and easier to understand, we co nsider the desire to map only Continent , Country and State ranks, but with data present in all five ranks.

PAGE 42

42 Consider the following mapping; foreach s in selfrefGeography, t in selfrefGeography, u in selfrefGeography t.ParentId=u.GeographyId exists f in flatGeography f.Continent=u.Name Figure 5 4 . Initial incomplete schema mapping between ranked self referential and flat schemas using MSL, without yielding correct transformation For the mapping shown in Figure 5 4, we assume that foreach where clause would return elements with no parents ; which might as well be the root of the tree and elements with a parent and a grandparent as we ll . So this mapp ing may return elements linked up to 3 levels in the hierarchy. Inspection of the mapping provided in Figure 5 4 reveals several transformation issues. First, this mapping assumes ranks are static for the relations being mapped, i.e., it assumes s at State , t at Country and u at Continent ranks without enforcing such assumptions through conditions, which would erroneously map data at ranks County and City . Second , the possibility of rank skips is not taken City cas State bypassing the County rank would place a City at County rank. If it is known that selfrefGeography has only these three levels and there is no desire to represent higher rank nodes, and there are no jumps in the h ierarchy, then the mapping in 5 4 is considered correct. T hese are unreasonable assumptions in practice , as our experience with real world data in the iDigBio project has demonstrated. The data can come in any form and variation and these anomalies are most like y to be present in real world data. To deal with the first problem, the above mapping can be modified to explicitly enforce s at State rank .

PAGE 43

43 Consider the following mapping; forea ch s in selfrefGeography, t in selfrefGeography, u in selfrefGeography s.Rank='State' exists f in flatGeography f.Continent =u.Name Figure 5 5. Schema mapping between ranked self referential and flat schemas using MSL, limited to map only State nodes. The mapping in Figure 5 5 clearly states its intention to only map nodes at State level, but it still leaves us with an other two problems. The first one being to deal with rank skips and the other is to allow mapping of higher level nodes. To allow mapping of higher level nodes, there is a need to conditionally establish the mappings based on the rank of the node. In Figur e 5 6 , we show a mapping that would allow higher level nodes to be mapped correctly, given that MSL is extended to support an if statement. foreach s in selfrefGeography, t in selfrefGeography, u in selfrefGeography [s.ParentId=NULL] exists f in flatGeograp hy if (s.Rank == 'State') [ if (s.Rank == 'Country' ) [ if (s.Rank == 'Continent') [ f.Continent=s.Name ] Figure 5 6. Schema mapping between ranked self referential and flat schema, assuming if conditional constructs are available in MSL. In the mapping in Figure 5 6 we have got rid of the explicit condition to map only elements at State Rank. Also, we have setup con ditions to deal with the possibility of s being a State , Country or a C ontinent . The mapping in Figure 5 6 still leaves us with the

PAGE 44

44 problem of rank jumps . For example, if s is a State , then the mapping assumes that t is a Country and u is a Continent . It may very well be possible that s connects directly to Continent , skipping the Country rank, but this mapping doesn One solution to deal with rank jumps would be to account for every possible combination of jumps and create a conditional statement for it, but now writing a mapping expression has turned into a tedious task requiring careful placement of possibly nested c onditions as shown in Figure 5 7 (imagine the amount of conditions as the number of ranks and combination of rank skip grows). foreach s in selfrefGeography, t in selfrefGeography, u in selfrefGeography t.ParentId= [s.ParentId=NULL] exists f in flatGeography with f.Ge if (s.Rank == 'State') [ ) { f.Country=t.Name } f.Continent=t.Name } if(u.Rank = Continent ) { f.Continent= u.Name } if (s.Rank == 'Country' ) [ if(t.Rank == f.Continent=t.Name } if (s.Rank == 'Continent') [ f.Continent=s.Name ] Figure 5 7. Schema mapping between ranked self referential and flat schema, assuming if conditional constructs are available in MSL and showing how nested conditions used to deal with rank skips have complicated the schema mapping . As stated earlier, the idea for schema mappings is to provide a succinct declarative way of expressing the mappings between the source and target schemas. The presumed addition of conditions to the mapping has drifted the mapping away from

PAGE 45

45 a declarative form to a more procedural form. This example has so far only shown challenges encountered while doing a schema transformation from ranked self referential schema to a flat schema. We encounter similar challenges when performing with GeographyId City connects directly flat to ranked self referential schema would also specify special conditions to deal with cases of rank jumps in this direction. It should be noted t hat this evaluation only addressed the challenges created by rank jumps. To deal with rank mismatches, yet additional conditions need to be specified, and identifier generation functions need to be created to deal with potential lack of identifiers. Dealin g W ith R ank M ismatches Consider the same self refGeography and flatGeography given in figures 3 2 and 3 3 respectively. Now assume that flatGeography only has columns for Country , State and City , i.e.; no elements at Continent or County ranks are represented in this variation of the schema : flatGeography 2 (GeographyId, Country, State, City) Rank mismatches complicate the mapping procedure in a number of ways. The simplest case of a rank mismatch is that of a missing rank either from the root or the lea f ranks of the hierarchy. These two cases can be dealt with easily by simply ignoring these ranks from the source schema. In the above schema ( flatGeography2 ) we notice Continent rank, which is the root rank in s elfRefGeography ; hence a case of missing rank from the root level. The other case of rank mismatch is that of a missing rank at an intermediate level. The flatGeography2 we note that missing

PAGE 46

46 County rank is an example of this case. A correct schema mapping would attempt to establish correct hierarchical link by finding for the next available parent if the current node is a parent at a rank which is not to be present in the target schema. Consider the case of City h GeographyId Table 3 2 . The ParentId GeographyId : hierarchy. But since out target schema flatGeography2 does not accommodate County GeographyId : 1250 ) as the target schema . A schema transformation could pose one of these challenges or a combinati on of these challenges. It 5 6 that handling the case of rank mismatches in addition to rank jumps would require several more conditions to be added to the schema mapping, hence making it even more complicated . Dealing W ith L ack of I dentifiers [ 10 ] documents the use of Skolem functions to generate new values in the target schema. One to one Skolem functions can be used to generate distinct values for unmapped fields in the target schema that are not null, either key elements or referred to by another relation using a key (foreign key). flatGeography schema shown in Table 3 1 already has a GeographyId field, therefore, we didn previous examples as we directly mapped Geog raphyId from the selfrefGeography to this field in the target flat schema. Consider a variation in flatGeography as follows: flatGeography3(Continent, Country, State, County, City)

PAGE 47

47 GeographyId field. If flatGeography3 is to be mapped to selfrefGeography , then we have a case of missing identifiers in the source schema. A Skolem function can be used to generate distinct identifiers in the target schema, and the input would be all elements in the distinct pa ths in flatGeography3 schema. Nested Schema Mappings [27] introduce d Nested mappings are based on arbitrary nesting of mappings inside other mappings. Especially for nested or hierarchical data models, nested mappings provide a more natural programming paradigm. Nested mappings overcome the disadvantages of conventional MSL mappings by eliminating the need for several mapping statements and replacing them with fewer mapping statements that capture the nested o r hierarchical nature of data. This automatically results in less data redundancy during data transformation. A s a consequence of natural hierarchy capturing expressions, nested mappings have a built in grouping behavior that follows the grouping pattern i n the source data. As demonstrated in [27] nested mappings work in nested relational model, which can model both relational data sources as well as hierarchical ( e.g.; XML) data sources. Since the ranked self referential schemas we are interested in are re cursive in nature, our evaluation concludes that nested schema mapping s; although they are good for non recursive nested data, but are not enough to encapsulate the proper semantic s required to express mappings involving ranked self referential schemas.

PAGE 48

48 CHAPTER 6 CONTRIBUTIONS In this section we present ou r approach to solve the problem of transformations between ranked self referential and flat schemas. We take a generic approach to solve the challenges this transformation poses. Our approach takes into account the major problems associated with transformat ions between ranked self referential and flat schemas; highlighted in the previous sections, i.e.; rank jumps, rank mismatches and lack of identifiers: Transformation Algorithms: A description of the transformation algorithms to achieve the transformations from ranked self referential schemas to flat schemas and from flat to ranked self referential schemas. MSL+ : a set of extensions to MSL to abstract and simplify the definitions of mappings between ranked self re ferential and flat schemas. Tree Transformer (TT): a custom component that implements MSL+ for the open source Talend ETL tool. Transformation Algorithms Before we go into the details of MSL+ and TT components, we present an explanation of the algorithms that these solutions are designed to encapsul ate. Having a good understanding of the working of these algorithms would help us appreciate the level of abstraction MSL+ and TT components provide to achieve these transformations with simple declarative assertions and user friendly graphical user interf ace ( GUI ) based components. We first explain the algorithm for transformation from ranked self referential schema to flat schema, and later explain the algorithm for transformation from flat schema to ranked self referential schema.

PAGE 49

49 Ranked S elf R eferential to F lat T ransformation A lgorithm For a ranked self referential schema as shown in Fig. 3 3 to be transformed to a flat schema like the one show in Fig. 3 2 , the following algorithm is used, 1. Iterate through all rows in the ranked self referential schema i) For each row in the ranked self referential schema, based on its rank populate the appropriate rank column in the target flat schema with the Name and ID properties from the source ranked self referential schema * . ii) Upper hierarchy: Populate the higher hiera rchy columns in the target flat schema with the Name* of the element pointed to by the ParentId column until a NULL or no value (empty) is encountered in the ParentId column. * The Name and ID columns used here are in contextual reference to our sample sch emas. These could be different labelled columns in a different schema, but the underlying concept stays the same. Fig ure 6 1 shows this algorithm in action, going through the steps for populating a single row from the source schema to target flat schema . This process is done for all rows. Flat to R anked S elf R eferential T ransformation A lgorithm Flat schemas usually have two different flavors. One; as is shown in Fig. 3 2 where there is an identifier attached with the element being stored in the flat sche ma. In the other case, there is no identifier to identify the element that is stored in the flat schema. This is because, either there was no identifier at all in the flat schema (Table 6 1), or the identifier is set to identify some other element in the f lat schema (Table 6 2). object occurrences. This is a flat table which stores the taxonomic identity and geographic location of the specimen object in flat format. The identif ier in this table only identifies the specimen object and does not identify either taxonomic object or geographic location.

PAGE 50

50 Figure 6 1. The transformation algorithm for ranked self referential schema to flat transformation.

PAGE 51

51 Table 6 1. A flat schema similar to the one in Table 3 1 , but with no identifier column. Continent Country State County City North America North America United States North America United States Florida North America United States Georgia North America United States Florida Alachua County North America United States Georgia Hall County North America United States Florida Gainesville North America United States Georgia Hall County Gainesville Table 6 2. A flat schema representing Specimen taxonomy information. Here the identifier SpecimenID identifies the specimen object and not the location information, which is also present in this schema. SpecimenID TaxonName Continent Country State County City 1001 Sylvilagus North America United States Florida Gainesville 1002 Perognathus North America United States Georgia Hall County Gainesville The solutions we propose later on in this section can perform a flat to ranked self referential schema transformation from both variations of flat schemas, i.e., either the one with and identifier or the one without an identifier. We achieve this in MSL+ by introducing constructs to help build identifiers if none are present. Our TT components have also the capability to create new identifiers if none are present in the flat schema. The need for identifiers is emphasized in flat to ranked self referential schema transformation, because the self referential schema uses these identifiers to create links in order to build a h ierarchy. Therefore, dealing with the problem of non existing identifiers is paramount for the functionality and correctness of these transformations. The algorithm below for flat to ranked self referential schema transformation deals with both variations of flat schema. 1. Repeat the following for each rank in the flat schema ;

PAGE 52

52 a) Get all the distinct paths starting from all non null elements in this rank all the way to the root rank i) For each distin ct path, (1) C reate an entry in the target ranked self referential schema for the element on this level. (2) For Id (a) If the source flat schema has an identifier associated for this element, use this identifier to populate the ID column in the target schema (b) If the is element, create an identifier based on the entire distinct path. (Use Skolem function) (3) For ParentId; (a) If the source flat schema has identifiers associated, then look up the identifier for the next upper rank non null element in this distinct path. Use th is identifier to populate the ParentId column in the target flat schema . If unable to find the next upper rank non null element in this target schema (b) ve identifiers, then create an identifier in a similar way as in step 2.b, but using only the portion of distinct path start from the next upper rank non null element in this distinct path . If unable to find the next upper rank non null element in this dis the target schema (4) For Rank; use the current rank from Step 1 to populate the Rank field in the target schema Table 6 3 shows how the above algorithm is used to perform a flat to ranked s elf referential schema transformation by labelling the distinct paths at each rank. There is a single distinct path at the Continent and Country ranks labelled as [A] and [B] respectively , while there are two distinct paths at each of State , County and City ranks , labelled as [C], [D] and [E] respectively . Distinct paths greatly simplify the schema transformation from flat schema to ranked self referential schema. Each distinct path

PAGE 53

53 represents a corresponding element in the target self referential schema . Compared to schema transformation in the other direction, this transformation is a more involved one. Table 6 3 . Distinct paths at each rank in the sample flatGeography schema data . GeographyID Continent [A] Country [B] State [C] County [D] City [E] 5 [A] North America 198 [B] North America United States 1237 [C] North America United States Florida 1250 [C] North America United States Georgia 4371 [D] North America United States Florida Alachua County 4380 [D] North America United States Georgia Hall County 4600 [E] North America United States Florida Gainesville 4670 [E] North America United States Georgia Hall County Gainesville MSL+ MSL+ is a set of functions extending IBM's MSL to express mapping specifically between ranked self referential and flat schemas while hiding the complexity of dealing with mappings that dynamically change with the data and recovering the declarative nature of MSL. Each of these functions are described next. GetRankName(rankID) : A function that outputs the name of the rank, given the numeric ID of the rank. This function is useful in scenarios where a separate relation stores information about the ranks in the hierarchy associating to the main relation through numeric IDs. Mappings are more user friendly with translation of IDs into names, but using this translation is not a requireme nt. GetParents(n) : Given a self referential node n, all the ancestry of n is returned the input n. As an example , Geog Table 3 2 . GetParents would return a relation with two nodes:

PAGE 54

54 respectively. The ability of GetParents function to return a relation enables the result to GetDistinctPaths(rel, rankID) : returns a relation with all the distinct paths for nodes at rankID all the way up to the hierarchy root, given a relation rel of a flat schema and a rankID in that schema. The function assumes that the rankID is numeric and its GetRankName function. Each tuple in the output relation represents a unique path. The schema of the outp ut tuple varies with the input rank, having as many columns as the number of ancestors that the rank has. All output columns are non NULL. ParentId(node) : returns the ID assigned to the parent of the given node or generates one (e.g. ; using Skolem function [11], where node includes the node's entire ancestry. Ranked S elf R eferential to F lat S chema M apping Consider the same example of mapping the selfrefGeography to flatGeography , the new constructs from MSL+, namely GetParents and GetRankName, can be used to succinctly express the mapping as shown in Fig ure 6 2 . As in MSL, it maps GeographyId from source to target schema. I nstead of building a flat hierarchy from self joining selfrefGeography multiple times, the hierarchy is built by dynamically mapping each ancestor returned by GetParents , into the corresponding column in the flat representation. The GetRankName function helps determine the name of the rank of the ancestor (if one exists) and hence the proper column in the flat schema. Comparing the mappings from Fig ure 5 7 and Fig ure 6 2 , not only the mapping in Fig ure 6 2 is compact, but also accurate, eliminating the problems identified in the previous section.

PAGE 55

55 foreach s in selfrefGeography exists f in flatGeography with f.GeographyId = s.GeographyId [ foreach sp in GetParents(s) f.[ GetRankName( sp.Rank) ] = sp.Name ] Figure 6 2 . Schema mapping between ranked self referential and flat schemas using the new constructs offered by MSL+ Flat to R anked S elf R eferential S chema M apping Using the new GetDistinctPaths construct to extract all distinct e lements at a certain rank from flatGeography , an inverse mapping from flat Geography to self ref Geography can be succinctly defined as in Fig ure 6 3 , iterating over all ranks. The ParentID function is responsible for dealing with the potential lack of identifiers. foreach r in ranks [ foreach f in GetDistinctPaths(flatGeography, r) exists s in selfrefGeography s.Name = f.[ GetRankName(r) s.ParentId = ParentId(f) ] Figure 6 3 . Schema mapping for flat to ranked self referential schema transformation using the new constructs added to MSL+ Tree Transformer While mappings between ranked self referential and flat schema are expressed through MSL+ in a user friendly manner, in this section, we present Tree Transformer (TT) for performing the actual data transformation based on MSL+. Given the proprietary nature of CLIO, TT is implemented as a custom component in the open source Talend Open Studio (TOS) for Data Integration. TOS offers a Graphical User Interface (GUI) through which the user can compose and configure a complex workflow of data transformations usin g a set of built in components. Hence, the GUI offered for TT can be viewed as a graphical representation of MSL+.

PAGE 56

56 Two TT components were developed: one to transform from a flat schema to ranked self referential schema ( F2S ), and another for the reverse di rection ( S2F ). Both components have core transformation implementations in the form of MySQL queries that are generated using Java, given that TOS uses Java and JavaJet for component development. The procedural complexities presented in Chapter 4 for dealing with rank jumps, lack of identifiers and rank mismatches are handled by the custom generated SQL query while expecting only the MSL+ input s from the user. User inputs identify the ID , ParentID and RankID columns in the source self referential table. The component also requires name and IDs of all ranks in the hierarchy along with mapping information. Transformation jobs in TOS can be exported as a simple Jar file, making it easy to share complex transformations as a job that could be run on any machine just with Java installed. Even though the current implementation focuses on MySQL, it would be trivial to extend it to other relational databases.

PAGE 57

57 CHAPTER 7 EXPERIMENTAL SETUP In this section, we take a very pragmatic approach to evaluate our MSL+ mappings and TT component. We make use of the real transformation scenarios required by the iDigBio project when aggregating biodiversity data from a variety of data providers. Two not eworthy database management systems being heavily used by this community are Specify [19] and Symbiota [21], respectively storing tree hierarchies in ranked self referential and flat schemas. In addition, the community has defined the Darwin Core (DwC) sta ndard [18] for a bag of terms that implicitly forces a flat representation of the data. Therefore, in our evaluation we consider transformations in both directions ( F2S and S2F ), and a complete transformation job between the two popular databases. The data sets used in the evaluation also reflect distribution of data extracted from real databases as further detailed in this section. Experimental Datasets database was used as t he sample dataset. The ranked self referential taxon table in this database contains a hierarchy of biological taxonomic names and its properties. In this database instance, the taxon table contains 8183 records. For the purpose of testing scalability of o ur solution, 100 random ranked self referential datasets varying in size were created (the datasets range from 10,000 to 1 million records in increments of 10,000 records). These datasets were generated taking into account the statistical distribution of t he KUIP dataset to capture the idiosyncrasies of real data. The statistics considered were (a) percentage of records at each level in the hierarchy, and (b)

PAGE 58

58 percentage of jumps from each level to other levels in the higher hierarchy. This distribution of r ecords and jumps to other levels is shown in F igure 7 1 . Fig ure 7 1 . Distribution of 8183 Taxon nodes in KUIP database. Different colors in datasets were generated based on these statistics. Machine Configuration All the experiments were conducted on an IBM Blade Server with the following configuration : Two Intel Xeon 5140 dual core processors with 8GB memory running CentOS 6.4, MySQL version 5.7 Developer Milestone Release (DMR), Java SE Run Time Environment 7 (1.7.0_21), Talend Open Studio for Data Integration 5.2.3 and Specify 6.5. All MySQL tables u se InnoDB storage engine with innodb_buffer_pool_size parameter increased to 4 GB. All experiments show average

PAGE 59

59 and standard deviation of 10 transformations. For experiments with TT, the execution tChronometerStop component. tLoop and tRunJob components were used to run the job multiple times and loop through all the sample datasets. Experiment Scenarios S2F Tran s formation Experiment In addition to observing scalability of S2F TT, we compared its performance with two o ther solutions: one as implemented by Specify, and other using only TOS built in components. The query interface in Specify has the limitation of displaying only 20,000 records at one time, restricting the number of experiments that could be performed u S2F transformation based on the MySQL query log during such transformation. For fair comparison, we verified that our Specify simulator presented t he same performance as Specify for datasets with less than 20,000 records. in components, the components tMySQLInput and tMap were used. The tMap component is at the heart of this transformation as it joins va rious tables and defines expressions to filter and direct the flow of data. For this particular transformation, with the data distribution in Figure 6 1 , 21 instances of tMySQLInput were used, each for one level in the hierarchy. Each of these inputs is joined with the taxontreedefitem table, which contains the rank information. Thus a total of 42 tMySQLInput components are used to perform this trans f or mation. All tMySQLInput compone nts feed into the tMap component where data are joined together and directed to appropriate columns in the target flat schema using conditional

PAGE 60

60 expressions. Figure 7 2 shows a screenshot of a complete TOS job for ranked self referential to flat transformat ion, using only the built in components. Figure 7 2. TOS job for transformation from ranked self referential schema to flat schema. All tMySQLInput component input into tMap component which joins all of them together. F2S Transformation Experiment We compared F2S TT component with transformations that use only TOS in components. The transformations using only built in components use tMySQLInput , tUniqRow and tJavaRow components, one for each rank in the tree hierarchy. A tUnite component is used to merge results from other components into a single output. T he

PAGE 61

61 tJavaRow component is used to specify custom code for performing transformation. Figure 7 3 shows a partial screenshot of the TOS job for achieving this transformation. Figure 7 3 . A partial screen shot of TOS job using only TOS built in components to achieve flat to ranked self referential schema transformation iDigBio Complete Schema Transformation Experiment TT components can be used alongside other TOS built in components to design larger schema transformation jobs. We consider the example of expo rting specimen occurrence data from KUIP Specify database to a Darwin Core [19] file (a comma separated file containing specimen data characterized by terms in the Darwin Core vocabulary). This target flat file contains biological specimen records transfor med from data in ranked self referential format from Specify. Fig ure 7 4 shows what different

PAGE 62

62 S2F (for taxonomy, geography, geologic time, and lithostratigraphy) and built in components t MySQLInput (CollectionObject, Determination, Collection, CollectingEvent, Locality, GeoCoordDetail, PaleoContext, Discipline, Division, and Institution), tMap and tFileOutputDelimited to create the transformation to Darwin Core flat file. Figure 7 4. TOS job showing different KUIP database tables being combined to produce the Darwin Core file output. The components with circular arrow pointing toward them are TT S2F components. These components first undergo a S2F transformation leading to a flat outpu t

PAGE 63

63 CHAPTER 8 EXPERIMENTAL EVALUATION Ranked Self Referential to Flat Transformation Execution T imes The S2F transformation times for all 100 datasets, using simulated Specify transformation, Talend built in components and TT, are shown in Figure 8 1 . Simulated Specify takes the most time. For the 1 million record dataset, the transformation takes an average of 20 minutes for Specify, 109 seconds for TOS built in components and only 49 seconds TT S2F. This is due to Specify issuing several M ySQL queries to in components provide faster execution times at the cost of higher memory consumption. The Java process running this TOS job had to be allocated 6 GB o f memory to prevent it from running out of memory during the experiment. The maximum amount of memory usage for the Java process during these experiments was 4.6 GB. TT S2F component leverages the power of database engine by generating a MySQL query on the in components rely heavily on performing the transformation programmatically (in Java, in this case), the TT S2F component leverages the power of SQL queries. The execution times for TT S2F component are half the execut in components. Even though InnoDB buffer pool was allocated 4GB of memory, the maximum amount of memory usage for the MySQL process was 700 MB.

PAGE 64

64 Figure 8 1. Comparing the execution times for different methods of schema transforma tion from ranked self referential to flat schema. TT S2F User E ffort The SQL query generated for this transformation by TT S2F component is a complex query containing 25,539 characters with 441 if conditions and 20 joins. Writing similar query manually for hierarchies with a large number of ranks is error prone and time consuming task. The TT S2F component (Figure 8 2) reduces this task by requiring input which is only 322 characters long (Figure 8 3) which is 79 times smaller than the SQL query. The TOS job requires only the TT S2F component to perform the actual transformation. A TOS job that uses built in components to perfor m the transformation requires many more components to achieve the same transformation. In this case, 43 built in components were used which also required writing Java code snippets (Figure 8 5 ) for the tMap component and establishing join links in the tMa p component (Figure 0.001 0.01 0.1 1 10 100 1000 10000 10 60 110 160 210 260 310 360 410 460 510 560 610 660 710 760 810 860 910 960 Execution Time (Sec) Number of Records (Thousands) Ranked Self referential to Flat Transformation Specify Simulator S2F Talend Built In Component S2F Tree Transformer S2F

PAGE 65

65 8 4 ) manually. These code snippets or expressions are non trivial for an end user not accustomed to programming. Figure 8 2. A screen shot of a TOS job for ranked self referential schema to flat schema transformation. The tH2F c omponent on the left does the entire transformation, greatly reducing the effort required to perform the transformation. The tMap component here is used only for filter out fields not required in the output. Figure 8 3 . Inputs required for tH2F component . See Appendix E for description of the input fields. Figure 8 4. A screenshot of the tMap component used in TOS job for ranked self referential schema to flat schema transformation, using only TOS built in

PAGE 66

66 components. The left pane shows all inputs from the 42 tMySQLInput components used in this transformation job. establish proper joins between all these inputs. The middle pane is used to define special conditions on incoming data. In this case we define complicated nested conditions (Figure 8 5) to correctly identify rank information and define output for each rank . Figure 8 5 . A Java expression defined in tMap component to correctly identify rank and define output. Such expressions are defined for each rank, hence making the transformation process long and cumbersome. The more the number of ranks, the longer and more complicated these expressions are as shown h ere. Flat to Ranked Self Referential Transformation Exe cution Times F2S transformation times for TOS built in components and F2S TT are show in Fi gure 8 6 in components perform better than the jobs using TT F2S component by a small margin. The 1 Million dataset runs in 102.5 seconds for TT F2S job whereas the same dataset runs in 94.8 seconds using TOS built in components. Additional experiments conducted

PAGE 67

67 to isolate different steps of transformation revealed tha t the in MySQL queries perform s slower than tUnite component which is used in the transformation using TOS built in components. Figure 8 6. Comparing e xecution times for flat to ranked self referential schema transformation. TT F2S show slightly expensive execution times compared to in components User E ffort The SQL query generated by TT F2S component for this experiment contains 90,010 characters with 1750 if conditions. Again, writing such a query by hand is a long error prone process. TT F2S (Figure 8 7) component simplifies this task by requiring a small amount of input that is only 714 c haracters long (Figure 8 8) . To achieve the same transformation using TOS built in components, 65 built in components were used (Figure 7 3) . This also required custom programming components (tJavaRow) and specialized Java code (Figure 8 9) to be written f or each incoming rank to achieve the transformation. 0 20 40 60 80 100 120 10 40 70 100 130 160 190 220 250 280 310 340 370 400 430 460 490 520 550 580 610 640 670 700 730 760 790 820 850 880 910 940 970 1000 Execution Time (Seconds) Number of Records (Thousands) Flat to Ranked Self Referential Transformation Talend Built in Component F2S Tree Transformer F2S

PAGE 68

68 Figure 8 7. A screenshot of the TOS job for flat schema to ranked self referential schema transformation, using TT F2S component . Shown on the left, this component performs the entire flat to ranked self referential schema transformation. Figure 8 8. Inputs required by the TT F2S component to perform the schema transformation from flat to ranked self referential schema. See Append ix E for details about the input field. Figure 8 9. A sample of custom java code required to be written for tJavaRow component used in TOS job for flat to ranked self referential schema transformation using only TOS bui lt in components. The code deals with generating new identifiers for the target schema.

PAGE 69

69 A Complete Schema Transformation Example Execution time Experiments with Data Exporter tool show that it takes a large amount of time to generate the output Darwin Core file. For the KUI P database set to extract around 110,000 specimen records, it took Specify Data Exporter tool 12.6 minutes to produce the output file. The same transformation in TOS using TT components along with other built in TOS components was just 15 seconds, showing significant savings in computation time when compared to Specify Data Exporter tool. User E ffort Specify provides a schema mapper tool to map Specify data to arbitra ry schemas like Darwin Core, with an easy to use and intuitive interface. TOS on the other hand is a graphical programming and workflow interface. The effort required to create a TOS job using TT and other TOS built in components is significantly more invo lved and requires the user to have understanding of the Specify relational database schema (Figure 7 4 ) . Designing this transformation job is a one time task and the faster execution time of this job reaps more benefits in the longer run. This TOS job uses 4 TT S2F components, and designing a similar TOS job using just the built in components would have required at least 88 more components to be used. In addition, it would have been slower as we have shown in our previous results. Using Tree Transformer com ponents allows us to achieve the same transformation with just 16 components. This TOS job using TT components can be exported and reused to run on any machine able to run Java.

PAGE 70

70 CHAPTER 9 CONCLUSIONS AND FUTURE WORK Given the complexity and challenges faced by a real biodiversity data integration project when dealing with hierarchical data, we have proposed MSL+, an extension of MSL to succinctly express schema mappings involving tree hierarchical data either in ranked self referential schema or flat sc hema. As a proof of concept we have provided sample implementation of MSL+ as a GUI tool for TOS called Tree Transformer (TT). Two TT components for TOS are developed for transformations between flat and ranked self referential schemas. Our experiments sho w that our TT components perform exceedingly well in terms of execution time compared to Specify and TOS built in components, with a 2 to 25 fold decrease in execution times. TT components are also user friendly and make it easy for the end user to achiev e such transformations without requiring any database or programming knowledge. Finally, we show that our sample implementation can be used to aid in larger schema transformations. The issues addressed in this thesis serve as a stepping stone for further w ork into automating such complex schema transformations especially in the realm of big data. Integration of large amounts of data for migration or consolidation purposes is a problem being studie d in depth currently. Big data solutions focus on data proces sing, but data preparation is a lso an essential part of the entire process. Data scientists are more dependent on tools that help with data migration and preparation process , as the lesser time they spend on bringing in and preparing data, the more time th ey can spend on solving the actual problem. We intend to extend our work into big data world, where our solution could perform schema transformations on clusters with high throughput.

PAGE 71

71 Products like HPCC systems and their programmable interfaces are Open So urce systems which can be used as foundations to buil d upon .

PAGE 72

72 APPENDIX A GENERATING TEST DATA TT components can be used in an isolated setting to convert a ranked self referential database table to a flat schema representation and vice versa. The KUIP Specify database has 8183 taxonomic determinations. The Taxon table contains this information and is ranked self referential in nature. The Taxon table refers to RankId column in TaxonTreeDefItem table which defines the different ranks in the taxonomic hierarchy for the KUIP specimen collection. Since 8183 records is a really small number, there was a need for a larger dataset to run our experiments on . Since esent real world distribution of taxons, we used statistics from the KUIP Taxon table to generate new data. The following statistics were considered ; 1. Number of taxonomic determinations at each rank , i .e.; the number of taxon records identifying a certain r ank in the hierarchy 2. The percentage of links from elements at a certain rank to parent taxon s at different ranks . The above two statistics are shown in graphical format in Figure 7 1. The horizontal bar shows the number of taxons at each rank and the different colors inside the horizontal bar show s the number of records at that rank that connect to their parent at another higher rank. Table A 1 shows the distribution of records at each taxon rank. A small Java program was written to take these statisti cs into account and generate new datasets based on these statistics. A total of 100 data sets were generated, starting from 10,000 records to 1 Million records with a step size of 10,000 records. We used the percentage of records at each rank in the origin al data set to generate the same percentage of records in the new data sets. The dataset generated

PAGE 73

73 has a similar schema to Taxon table in the Specify database, which is ranked self referential in nature. These datasets were used to test TT S2F components. We used the flat schema output from TT S2F experiments, as t he input for TT F2S experiments hence we did not need to generated new datasets for flat schema. Table A 1. Taxon ranks in KUIP database. The name of taxon rank and the number of records at each rank are also shown. Rank Name RankId Number of Records Kingdom 10 6 Subkingdom 20 0 Phylum 30 25 Subphylum 40 9 Superclass 50 2 Class 60 67 Subclass 70 27 Infraclass 80 0 Superorder 90 5 Order 100 125 Suborder 110 86 Infraorder 120 0 Superfamily 130 229 Family 140 629 Subfamily 150 142 Tribe 160 1 Subtribe 170 0 Genus 180 2158 Subgenus 190 43 Species 220 4582 Subspecies 230 46

PAGE 74

74 APPENDIX B USING SPECIFY TO OUTPUT FLAT SCHEMA Specify can be used to output taxonomic data in Taxon table in flat format. This equates to ranked self referential to flat schema transformation using Specify. In order to do that, follow the steps below; 1. Launch Specify 2. Go to Query tab 3. Select Taxon from the list in the left pane . The Taxon relation along with its attributes will appear in the upper pane . ( Figure B 1 ) 4. Select the Taxon rank s to be part of the flat schema from the Taxon relation . After selecting a rank, click the small green plus button next to the list to add it to the output schema . 5. C Search button to execute the query 6. The results are displayed in a tabular format Figure B 1. The Taxon relation selected in Specify and its attributes /properties listed Figure B 2 shows the query for getting all the taxon in the database with all the taxon ranks selected to be part of the output.

PAGE 75

75 Figure B 2. The query for displaying all taxon in a flat format. All ranks in the taxon hierarchy have been selected for this query

PAGE 76

76 APPENDIX C USING SPECIFY TO OUTPUT D W C FILE DwC file is a comma separated value (CSV) file that contains specimen data described by a bag of terms. The KUIP database snapshot we have; has 202656 number of specimen records in the CollectionObject table. q uery interface described in Ap pendix B to output the specimen data from specify. Later the export feature in the results pane can be used to export the results of the query to a text file. are following limitations 1. The database has more records as is the case with our KUIP instance, then separate queries in batches of 20,000 would need to be executed and later the exported files would need to be combined together . 2. The vocabulary to address the same property in S pecify may be different for DwC standard. For example KUIP Specify instance . The first limitation ; requires that we use a S pecify DataExporter.exe tool to export the data if the number of records exceeds 20,000 records. The second limitation requires that we create a mapping from terms in DwC schema to the terms in Specify schema. In this appendix we will first show how to create a schema mapping using the DataExporter.exe tool to export data to a text file. 1. Open up Specify Schema Mapper by going to System menu and . 2. There are two ways to create a schema mapping

PAGE 77

77 a) With no predefined schema b) With a predefined schema, which is imported separately 3. For this example we are going to use DwC predefined schema. In order to use DwC schema to map Specify elements first it must be imported. In the left pane under . 4. choose an existing X ML Schema Definition File (XSD) that contains the schema definition of the schema we would like to import. In this case we are going to use the The file can be found in the config . Select this file.

PAGE 78

78 5. Give the imported s chema a name and version. 6. Next, click the Create New Mapping A popup appears, showing previous step. 7. Specify would try to automatically map the concepts in the imported schema to the concepts in its schema. Inspect the mapping for correctness and make any

PAGE 79

79 adjustments necessary. 8. Once satisfied with the mapping ; Save the schema mapping with a name. In this .

PAGE 80

80 1. installation folder. 2. Click the Buil d /Update tion to complete . 3. with a name an extension of your choice. This save s the file in a CSV format.

PAGE 81

81 APPENDIX D SPECIFY SIMULATOR could not run our experiments for ranked self referential to flat schema transformation using Specify. As a workaround, we inspected the way Specify does the transformation, and built a simulator for Specify that would run the transformation on datasets with more than 20,000 records. We used MySQL query logs to inspect the sequence of queries issued by Specif y to perform the transformation. We then built a Java program to simulate the way Specify perform the transformation given a source table. We used the Specify simulator to run our experiments for ranked self referential to flat schema transformation. As noted earlier, the simulator takes lesser time than performing the same operation in Specif y. This is because in Specify the user interface elements take time to populate. Thus it is safe to use the simulator for experimental purposes.

PAGE 82

82 APPENDIX E TALEND OPEN STUDIO In this appendix we are going to; 1. Introduce the TT S2F and F2S components and describe the required input fields 2. Design jobs to achieve a) ranked self referential to flat schema transformation using TT S2F component b) ranked self referential to flat schema transformation using TOS built in components only c) flat to ranked s elf referential schema transformation using TT F2S component d) flat to ranked self referential schema transformation using TOS built in components only 3. Furthermore, design a TOS job to export DwC file using TT components Tree Transformer Component s TT components provide a visual interface for MSL+ described in Chapter 6 . We developed TT components to work with MySQL database. As we have mentioned earlier, it is easy to develop such components to work with other databases as well. As is the case with most TOS components, our components also output a stream of results which can be used with a combination of built in TOS components. TT S2F Component The TT S2F component is a custom developed TOS component. It takes several parameters as input to perform correct transformation from a source ranked self referential table to a flat out put . The TT S2F component can be found under the Tree family in component palette (Figure E 1).

PAGE 83

83 Figure E 1. TT components shown in component palette. Figure 8 3 shows the input fields required by the TT S2F component. Following is a description of what each field is ; 1. Source Table Name: N ame of the source ranked self referential table in the database. 2. ID Element: The identifier column in this source table. 3. Parent ID Element: Name of the to the parent of the element in the current row. 4. Rank ID Element: Name of the column containing the ranks identifier. 5. Ranks: A comma separate list of ranks in the ranked tree hierarchy represented by the source table . Each rank is described by a RankID and Rank Name both separated by a pipe (|) character. As an example see Figure 8 3. 6. Mapping: This field lists all the fields (in a comma separated way) that need to be mapped one on one from the source schema. 7. Column to Map: This field contains the name of the column in the source ranked self referential schema table which should be used to populate the columns in the flat schema corresponding to th e rank. 8. Ranks to Skip: A comma separated list of ranks to be skipped for this transformation. s no ranks are to be skipped. TT F2S Component Figure 8 8 shows the input fields required by TT F2S to perform a transformation. The description of these input fields is given below; 1. Source Table Name: N ame of the source flat schema table in the database. 2. ID Element: N ame of the identifier column in output schema .

PAGE 84

84 3. Parent ID Element: to the p arent of the element in the current row in the output schema . 4. Rank ID Element: Name of the column containing the ranks identifier in the output schema . 5. Ranks: A comma separate list of ranks in the ranked tree hierarchy represented by the source table. Each rank is described by a RankID and Rank Name both separated by a pipe (|) character. As an example see Figure 8 8 . The RankI D here i s associated with ranks in the target ranked self referential schema. 6. Mapping: This field provides a mapping of the columns in the source flat schema to a column in the target schema as a comma separated value string. A mapping format is As an example consider the string in Figure 8 8; here source schema is to be mapped to 7. Ranks to Skip: A comma separated list of ranks to be skipped for this ns no ranks are to be skipped. 8. Preseve IDs: If the source flat schema has identifier then use that identifier. Talend Jobs R anked S elf R eferential to F lat S chema T ransformation U sing TT S2F C omponent This job performs a schema transformation from ranked self referential to flat schema. The job design is fairly simple. Follow the steps below to design the job.

PAGE 85

85 1. In the left Repository pane, right click on and select 2. Name the job appropriately. A new tab opens up the job design pane. 3. Drag the TT S2F component from the component palette to the job design window. 4. Double clicking the component would open up the component properties tab in the bottom. 5. Fill in the database connection parameters followed by the input fields specific for schema transformation information as described earlier.

PAGE 86

86 6. Once all the input fields are populated, the component is ready to be executed to perform the transformation. The c omponent sends out a stream of output rows. This in components which accept input connections. For this example we will use the tLogRow component . 7. tLogRow component onto the job design pane adjacent to the TT S2F component. 8. Right click the TT S2F component 9. The mouse pointer is connected to the component by a string, move the mouse over tLogRow component and click it. Now there is a connection between TT S2F component and tLogRow component.

PAGE 87

87 10. The job is now ready to execute. In the toolbar at the top, click the green execute job button. 11. The tLogRow component receive s the output from the TT S2F component and displays it. The connection from the TT S2F component also displays the throughput of output. Ranked S elf R eferential to F lat S chema T ransformation U sing TOS B uilt I n C omponents O nly This exercise uses tMySQLInput and tMap components along with custom expression to achieve the desired transformation. We use the taxon table from KUIP database for this example. The number of tMySQLInput components required for this example is twice the number of ranks. This is because in this particular case, the taxon

PAGE 88

88 table is joined with taxonTreeDefItem table . It is possible that in other circumstances there is no need for additional tMySQLInput compo nents. 1. Drag tMySQLInput component from the Databases family in the component palette . Fill in the database connection parameters, for the Query input field, we used a very simple select query . 2. Next drag another tMySQLInput component, this time for the TaxonTreeDefItem table . Based on the source schema this may not be required. Again for this table we used a very simple query. 3. Next, drag and tMap component from the Processing family in the component palette. The tMap component allows j oin ing data source s , equivalent to a SQL join. 4. Connect the Main output from Taxon tMySQLInput component to the tMap component. 5. Connect the Main output from Taxon TreeDefItem tMySQLInput component to the tMap component. 6. Double click the tMap component to popup the tMap interface. On the left pane, both the inputs are shown. Select the RankId field from row1 input and drag and drop it over RankId field on row2 . This creates a link from row1 to row2 using RankId as a

PAGE 89

89 foreign key. Close the tMap editor. 7. Next , copy the pair of tMySQLInput components for each level in the tree hierarchy. For our case there are a total of 21 pairs, equal to 42 total tMySQLInput components. The SQL query for each of these components stays the same. Each pair inputs into tMap component. 8. Next in the tMap editor . Each pair needs to be connected to an adjacent pair, using the ParentId as the foreign key TaxonId , in both instances of the Taxon

PAGE 90

90 tMySQLInput component. 9. Once all the pairs are connected, we turn our attention to the right pane in the tMap editor. This is where most of the work is done. This is where the output schema of tMap component is defined. We define a flat taxon out put schema as below based on the tree hierarchy in the source .

PAGE 91

91 10. Next i nside the tMap editor, select the TaxonID input from row1 in left pane and drag it onto TaxonID defined in the output schema in the right pane. 11. Next in the middle pane click the plus button to add a new variable. Name the expression editor. Here we have to write such a conditional expression, which would determine if s value to the variable just created. For this case we wrote the following expression. 12. After entering the expression, close the expression editor. You would note that the corresponding inputs from the left pane are shown to connect to the variable in the

PAGE 92

92 middle pane. 13. put field in the right pane. 14. In the same way as described in Step 11, created a variable for each rank in the hierarchy and a similar expression. Also connect those variables to the corresponding output fields in the right pane. Once done, the tMap edito r looks like

PAGE 93

93 this. 15. Close the tMap editor. 16. Next add a tFileOutputDelimited component to the job designer. Connect the output of the tMap component to tFileOutputDelimited component. 17. The job design is complete and ready to execute. The tFileOutputDelimited would save the file at the desired location. Flat to R anked S elf R eferential S chema T ransformation U sing TT F2 S C omponent In this job we will use the TT F2S component to perform a flat to ranked self referential schema transformation. The steps are given below; 1. Create a new job .

PAGE 94

94 2. Drag the TT F2S component from the component palette on the right onto the job design area . The TT F2S component is found under the Tree Transformer family . 3. Double clicking the component will open up the component properties window in the lower pane. 4. Fill in the database connection parameters followed by populating the input fields required specifically for flat to ranked self referential schema transformation described earlier in this appendix. 5. Once all the input fields are filled in, the component is ready to execute. This component like the TT S2F component, also outputs a stream of records. This could be captured using a tLogRow component. Drag a tLogRow component adjacent to the TT F2S component and connect using the main output connector. 6. Click the run job button to execute the job .

PAGE 95

95 Flat to R anked S elf R eferential S chema T ransformation U sing TOS B uilt I n C omponents O nly In this section, we will design a TOS job to perform a flat to ranked self referential schema transformation using TOS in components only. The components used are tMySQLInput, tUniqRow, tJavaRow and tUnite . Follow the steps below to design this job. 1. Place a tMySQLInput component in the job design window. In this job there is a tMySQLInput component for each rank. So for our tree hierarchy of 21 ranks there would be 21 tMySQLInput components. Enter the database connection parameters and a query over the flat schema table as fol lows. This query is for the roo t of the hierarchy rank . 2. Next place a tUniqRow panel. Connect the tMySQLInput tUniqRow . 3. Double click tUniqRow . Then click the check box under the column.

PAGE 96

96 4. Next, place a tJavaRow component and connect the tUniqRow unique output connection to it. 5. In th e properties , define the output schema as follows. 6. In the properties, add the following code. The purpose of this code is to generate an identifier for this element in the target schema. This code assigns the id and values to the output schema. 7. As described in steps 1 6, add a combination of tMySQLInput, tUniqRow and tJavaRow for each rank in the hierarchy. Adjust the query and the code for each rank. For example for rank is show below;

PAGE 97

97

PAGE 98

98 8. Once you have similar combination for each rank, the job will look like this; 9. The next step is to combine the output from all these block s . Drag a tUnite Connect the outp ut of all tJavaRow components to the input of tUnite component in the order of

PAGE 99

99 increasing rank in the hierarchy so as to make sure the merge order is correct . 10. The tUnite component will output the ranked self referential schema. In this example we direct the output to a file. 11. The job is complete and ready to execute. Creating a DwC F ile U sing TT Components This job design is based on Specify 6.5 schema. Designing this job requires the use of TT S2F components in combinations with other TOS built in components. The output schema for this job is shown below.

PAGE 100

100 The four TT S2F components used in this job are for the geographic, taxonomic, chronostrat and li thostart trees. The rest of the components are tMySQLInput which just select the requ ired columns from the tables. The following figure shows what tables are involved in designing this job. These tables are all connected together in tMap

PAGE 101

101 component and the output is saved to a text file.

PAGE 102

102 LIST OF REFERENCES [1] P. J. Morris, "Relational database design and implementation for biodiversity informatics", PhyloInformatics, vol. 7, 2005, pp. 1 66. [2] M. Lenzerini 21st ACM SIGACT SIGMOD SIGART Symposium on Principles of Database Systems, pp. 233 246. ACM PODS 2002. [3] Lecture Notes in Computer Science, pp. 19 40. Springer, 1997. [4] pp. 270 294, Springer Verlag 2001 [5] H. Garcia Molina, Y. Papakonstantinou, D. Quass, A. Rajaraman, Y. Sagiv, J. Workshop on Next Gen. Information Technologies and Systems, Naharia, Israel, June, 1995. [6] Spring Symposium on Information Gathering, 1995. [7] M. Friedman, A. Y. Levy, T. D. In Proceedings of the 16th National Conference on Artificial Intelligence and Eleventh Conference on Innovative Applications of Artificial Intelligence, pp. 67 73. AAAI/IAAI, 1999. [8] IARIA, 2013. [9] [10] R. Fagin, L. M. Haas, M. Hernandez, R. J 236. [11] Hong Kong, China, 2002. [12] [13] Mana

PAGE 103

103 [14] [15] [16] Michael J. Carey, Laur a M. Haas, Peter M. Schwarz, Manish Arya, William F. Cody, Ronald Fagin, Myron Flickner, Allen Luniewski, Wayne Niblack, Dragutin Petkovic, Joachim Thomas II, John H. Williams, and Edward L. Wimmers, In RIDE DOM, pages 124 131, 1995 [17] R. Fagin, P. G. Kolaitis, R. J. Miller, and L. Popa. Data Exchange: Semantics and Query Answering . Theoretical Comput. Sci., 336(1):89 124, May 2005 [18] DarwinCore, TDWG [Online] Available: http ://rs.tdwg.org/dwc/ [19] Specify database [Online] Available: http://specifysoftware.org/ [20] IBM InfoSphere Data Architect [Online] Available: http://www 03.ibm.com/software/products/en/ibminfodataarch [21] Symbiota [Online] Available: http://symbiota.org/tiki/tiki index.php [22] R. J. Mil ler, L. M. Haas, and M. A. He rn andez. Schema Mapping as Query Discov ery In Proceedings of 26th International Conference on Very Large Data Bases, pages77 88 , VLDB 2000 [23] P. Andritsos , R. Fagin, A. Fuxman , L. M. Haas, M. A. Hernandez, C. T. H. Ho , A. Kementsietsidis , R. J. Miller, F. Naumann , L. Popa , Y. Velegrakis , C. Vila rem , and L L Yan , Schema Management . IEEE Data Engineering Bulletin (DEBU) 25(3):, pages 32 38, (2002) [24] M. A. Hern andez , L. Popa , C. T. H. Ho , and F. Naumann , Clio: A Schema Mapping Tool for Information Integration . In Proceedings of the 8th International Symposium on Parallel Architectures, Algorithms, and Networks , page 11. ISPAN 2005 [25] M. A. Hernandez , R. J. Miller , and L. M. Haas , Clio: A Semi Automatic Tool For Schema Mapping , In A Workshop Presentation at ACM Conference , page 607. ACM SIGMOD 200 1 [26] R. J. Miller , M. A. Hern andez , L. M. Haas , L L Yan , C. T. H. Ho , R. Fagin , and L. Popa . The Clio Project: Managing Heterogeneity . SIGMOD Record , Volume30(Number 1):78 83 [27] A. Fuxman , M. A. Hernandez , C. T. H. Ho , R. J. Miller , P. Papotti , and L. Popa . Proceedings of the 32nd International Conference on Very Large Data Bases , pages 67 78. VLDB 2006

PAGE 104

104 [28] C. Beeri and M. Y. Vardi. A proof procedure for data dependencie s . J. ACM,31(4):718 741, 1984

PAGE 105

105 BIOGRAPHICAL SKETCH Sarfaraz Ahmed Soomro was born in Karachi, Pakistan. He received his Bachelor of Engineering in c omputer and i nformation s ystems e ngineering from NEDUET, Karachi in Dec 2007. He worked as a software engineer before joining University of Florida in Fall 2011 for Masters in Computer Science. He joined Advanced Computing and Information Systems (ACIS) lab in November, 2011. Sarfaraz has been associated with the iDigBio project since May 2012, mostly with research focus on automating schema transformation patterns particularly those involving tree hierarchies. His other research interests include big data analytics, data integration and large scale databases .