Group Title: Department of Computer and Information Science and Engineering Technical Reports
Title: Genomics algebra : a new integrating data model, language, and tool for processing and querying genomic information
CITATION PDF VIEWER THUMBNAILS PAGE IMAGE ZOOMABLE
Full Citation
STANDARD VIEW MARC VIEW
Permanent Link: http://ufdc.ufl.edu/UF00095585/00001
 Material Information
Title: Genomics algebra : a new integrating data model, language, and tool for processing and querying genomic information
Series Title: Department of Computer and Information Science and Engineering Technical Reports ; TR-02-009
Physical Description: Book
Language: English
Creator: Hammer, Joachim
Schneider, Markus
Publisher: Department of Computer and Information Science and Engineering, University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: October, 2002
Copyright Date: 2002
 Record Information
Bibliographic ID: UF00095585
Volume ID: VID00001
Source Institution: University of Florida
Holding Location: University of Florida
Rights Management: All rights reserved by the source institution and holding location.

Downloads

This item has the following downloads:

2002331 ( PDF )


Full Text
Department of CISE, University of Florida, Gainesville, FL 32611-6120, Technical Report TR-02-009, October 2002.





Genomics Algebra: A New, Integrating Data Model, Language,

and Tool for Processing and Querying Genomic Information


Joachim Hammer and Markus Schneider

Department of Computer & Information Science & Engineering
University of Florida
Gainesville, FL 32611-6120
U.S.A.
{jhammer,mschneid}@cise.ufl.edu


Abstract
The dramatic increase of mostly semi-structured
genomic data, their heterogeneity and high
variety, and the increasing complexity of
biological applications and methods mean that
many and very important challenges in biology
are now challenges in computing and here
especially in databases. In contrast to the many
query-driven approaches advocated in the
literature, we propose a new integrating approach
that is based on two fundamental pillars. The
Genomics Algebra provides an extensible set of
high-level genomic data types (GDTs) (e.g.,
genome, gene, chromosome, protein, nucleotide)
together with a comprehensive collection of
appropriate genomic functions (e.g., translate,
transcribe, decode). The Unifying Database
allows us to manage the semi-structured or,
ideally, structured contents of publicly available
genomic repositories and to transfer these data
into GDT values. These values then serve as
arguments of Genomics Algebra operations,
which can be embedded into a DBMS query
language.

1. Introduction
In the past decade, the rapid progress of genome projects
has led to a revolution in the life sciences causing a large
and exponentially increasing accumulation of information
in molecular biology and an emergence of new and
challenging applications. The flood of genomic data, their
high variety and heterogeneity, their semi-structured
nature as well as the increasing complexity of biological
applications and methods mean that many and very
important challenges in biology are now challenges in
computing and here especially in databases. This
statement is underpinned by the fact that millions of
nucleic acid sequences with billions of bases have been


deposited in the well-known persistent genomic
repositories EMBL [47], GenBank [6], and DDBJ [50].
Both SwissProt [3] and PIR [4] form the basis of
annotated protein sequence repositories together with
TrEMBL [17] and GenPept [55], which contain
computer-translated sequence entries from EMBL and
GenBank. In addition, hundreds of specialized
repositories have been derived from the above primary
sequence repositories. Information from them can only be
retrieved by computational means.
The indispensable and inherently integrative discipline
of bioinformatics has established itself as the application
of computing and mathematics to the management,
analysis, and understanding of the rapidly expanding
amount of biological information to solve biological
questions. Consequently, research projects in this area
must have and indeed have a highly interdisciplinary
character. Biologists provide their expertise in the
different genomic application areas and serve as domain
experts for input and validation. Computer scientists
contribute their knowledge about the management of huge
data volumes and about sophisticated data structures and
algorithms. Mathematicians provide specialized analysis
methods based, e.g., on statistical concepts.
We have deliberately avoided the term genomicc
database' and replaced it by the term genomicc repository'
since many of the so-called genomic 'databases' are
simply collections of flat files or accumulations of Web
pages and do not have the beneficial features of real
databases in the computer science sense. Attempts to
combine these heterogeneous and largely semi-structured
repositories have been predominantly based on federated
or query-driven approaches leading to complex
middleware tiers between the end user application and the
genomic repositories.
This technical report propagates the increased and
integrative employment of current database technology as
well as appropriate innovations for the treatment of non-
standard data to cope with the large amounts of genomic









data. In a sense, we advocate a "back to the roots"
strategy of database technology for bioinformatics. This
means that general database functionality should remain
inside the DBMS and not be shifted into the middleware.
The concepts presented in this report aim at
overcoming the following fundamental challenges: The
deliberate independence, heterogeneity, and limited
interoperability among multiple genomic repositories, the
enforced low-level treatment of biological data imposed
by the genomic repositories, the lack of expressiveness
and limited functionality of current query languages and
proprietary user interfaces, the different formats and the
lack of structure of biological data representations, and
the inability to incorporate owns own, self-generated data.
Our integrating approach, which to our knowledge is
new in bioinformatics and differs substantially from the
integration approaches that can be found in the literature
(see Section 3), rests on two fundamental pillars:
1. Genomics Algebra. This extensible algebra is based
on the conceptual design, implementation, and
database integration of a new, formal data model,
query language, and software tool for representing,
storing, retrieving, querying, and manipulating
genomic information. It provides a set of high-level
genomic data types (GDTs) (e.g., genome, gene,
chromosome, protein, nucleotide) together with a
comprehensive collection of appropriate genomic
operations or functions (e.g., translate, transcribe,
decode). Thus, it can be considered a resource for
biological computation.
2. Unifying Database. Based on latest database
technology, the construction of a unifying and
integrating database allows us to manage the semi-
structured or, in the best case, structured contents of
genomic repositories and to transfer these data into
high-level, structured, and object-based GDT values.
These values then serve as arguments of Genomics
Algebra operations. In its most advanced extension,
the Unifying Database will develop into a global
database comprising the most important or, as a
currently rather unrealistic vision, even all publicly
available genomic repositories.
The main benefits resulting from this approach for the
biologist can be summarized as follows: Instead of a
currently low-level treatment of data in genomic
repositories, the biologist can now express a problem and
obtain query results in biological terms (using high-level
concepts) with the aid of genomic data types and
operations. In addition, the biologist is provided with a
powerful, general, and extensible high-level biological
query language and user interface adapted to his/her
needs. In a long-term view, the biologist is not confronted
any more with hundreds of different genomic repositories
but is equipped with an integrated and consistent working
environment and user interface based on a unifying or


ultimately global database. This Unifying Database allows
the biologist to combine, integrate, and process data from
originally different genomic resources in an easy and
elegant way. Finally, it enables the integration and
processing of self-generated data and their combination
with data stemming from public resources.
Section 2 identifies the main data management
problems that biologists face today, and "translates" them
into computer science centric problems and requirements.
The current trend to overcome these problems is based on
the query-driven integration approach. Section 3 discusses
benefits and drawbacks of this approach as an alternative,
integrative concept for dealing with genomic data and
motivates our rational for choosing a different strategy.
Sections 4 and 5 describe the technical details of our
approach starting with the Genomics Algebra followed by
the Unifying Database. Since this technical report
describes work in progress, we emphasize research
challenges and our contributions to the state-of-the-art
wherever appropriate. Section 6 outlines the interaction
between the Algebra and the warehouse and Section 7
concludes the report.

2. Requirements of Genomic Data
Management
Adequate employment of database technology requires a
deep understanding of the problems of the application
domain. These domain-specific requirements then make it
possible to derive computer science and database relevant
requirements for which solutions have to be found.
Discussions and cooperation with biologists have revealed
the following main problems regarding the information
available in the genomic repositories:
B1. Apprehension that essential information will be
overlooked. Proliferation of specialized databases
coupled with continuous expansion of established
repositories creates missed opportunities.
B2. Two or more databases may hold additive or
cntlicting information. All available information
resources should be queried and data retrieved for
evaluation and analysis.
B3. A familiar data resource will disappear or morph to
a ditterent site.
B4. Query results are unmanageable unless organized
into a customized, project-specific database.
B5. Data records become obsolete. Retrieved data
records are current only at the time the search is
completed. If the underlying database record is
updated after a search result is retrieved, the record
retrieved earlier becomes obsolete and possibly
misleading.
B6. The portal to each data site is a unique interface.
This forces scientists to develop a customized
method of queuing large-scale searches and


CISE TR-02-009
Hammer & Schneider









retrieving results in a format suitable for later
analysis.
B7. Database search functions are limited by the
interface. The database schema and data types are
unknown to the user making custom SQL queries
impossible. Worse, biologists do not understand
SQL, don't want to understand database schemas,
and would prefer constructing queries using familiar
biological terms and operations.
These seven information-related problems (B1-B7)
identified from a biologist's perspective lead to the
following computer science centric problems (C1-C15).
The identifiers in parentheses serve as cross-references
into the list above.
C1. Multitude and heterogeneity of available genomic
repositories (B1, B2). Finding all appropriate sites
from the more than 300 genomic repositories
available on the Internet for answering a question is
difficult. Many repositories contain related genomic
data but differ with respect to contents, detail,
completeness, data format, and functionality.
C2. Missing standards for genomic data representation
(B B2, B6). There is no commonly accepted way
for representing genomic data as evident in the large
number of different formats and representations in
use today.
C3. Multitude of user interfaces (B6). The multitude of
genomic repositories implies a multitude of user
interfaces and ontologies a biologist is forced to
learn and to comprehend.
C4. Quality of user interfaces (B4, B6, B7). In order to
utilize existing user interfaces effectively, the
biologist frequently needs detailed knowledge about
computing and data management since they are
often too system-oriented and not user-friendly
enough.
C5. Quality of query languages (B4, B7). SQL is
tailored to answer questions about alphanumerical
data but unsuited for biologists asking biological
questions. Consequently, the biologist should have
access to a biological query language.
C6. Limited functionality of genomic repositories (B2,
B7). The possible interactions of the biologist with a
genomic repository are limited to the functions
available in the user interface of that repository.
This implies a lack of flexibility and the ability to
ask new types of queries.
C7. Format of query results (B4, B5). The result of a
query against a genomic repository is often
outputted to the computer screen or to a text file and
cannot be used for further computation. It is then left
to the biologist to analyze the results manually.
C8. Incorrectness due to inconsistency and
incompatibility of data (Bl, B2, B5). The existence


of different genomic repositories with respect to the
same kind of biological data leads to the question
whether and where similar or overlapping
repositories agree and disagree with one another.
C9. Uncertainty of data (B2, B5). A very important but
extremely difficult question refers to the correctness
of data stored in genomic repositories. Due to vague
or even lacking biological knowledge and due to
experimental errors, erroneous data in genomic
repositories cannot be excluded. Frequently, it
cannot be decided from two inconsistent pieces of
data, which one is correct and which one is wrong.
In this case, access to both alternatives should be
given.
C10. Combination of data from ditirent genomic
repositories (B2, B7). Currently, data sets from
different, independent genomic repositories cannot
be combined or merged in an easy and meaningful
manner.
C11. Extraction of hidden and creation of new knowledge
(B B2, B7). The nature of stored genomic data,
e.g., in flat files, semi-structured records, makes it
difficult to extract hidden information and to create
new biological knowledge. The extraction of
relevant data from query results and their analysis
has to be performed manually without much
computational support.
C12. Low-level treatment of data (B1, B2, B4, B7).
Genomic data representations and query results are
more or less collections of textual strings and
numerical values and are not expressed in biological
terms such as genes, proteins, and nucleotide
sequences. Operations on these high-level entities
do not exist.
C13. I,,i,.,,i,. ,, of self-generated data and extensibility
(B4, B5). A biologist generates new biological data
from their own research or experimental work. It is
not possible to store and retrieve this data, to
perform computations with generally known or even
own methods, and to match the data against the
genomic databases. This requires an extensible
database management system, query language, and
user interface.
C14. Iiil,., ,'a of new specialty evaluation functions
(B4, B7). The possibility to evaluate data originally
stemming from genomic repositories as well as self-
generated data with publicly available methods is
insufficient. Thus, it must be possible to create, use,
and integrate user-defined functions that are capable
of operating on both kinds of data.
C15. Loss of 'i'r,, repositories (B3). Due to the high
competition at the bioinformatics market, many
companies disappear and with them the genomic
repositories that were maintained by them. The
company's valuable knowledge should be preserved.


CISE TR-02-009
Hammer & Schneider









This detailed problem analysis shows the enormous
complexity of the information-related challenges
biologists and computer scientists are confronted with. It
is our conviction, and we will motivate and explain this in
the following sections, that the combination of Genomics
Algebra and Unifying Database is a solution for all these
problems, even though it raises a number of new,
complicated, and hence challenging issues.

3. Biological Database Integration
Much research has been conducted to reduce the burden
on the biologist when attempting to access related data
from multiple genomic repositories. One can distinguish
two commonly accepted approaches: (1) query-driven
.,,,.,,,.',,i or mediation and (2) data warehousing. In
both approaches, users access the underlying sources
indirectly through an integrated, global schema (view),
which has been constructed either from the local schemas
of the sources or from general knowledge of the domain.
Conceptually, the two approaches differ in where the
data is stored and when integration takes place. In the
query-driven approach, no data is stored in the
.,,,.,,,, ,, system and all queries against the global
schema must be restructured and submitted to the sources
to fetch the individual results, which are then integrated
on the fly. In the data warehousing approach, the
.,,. ,,lr,. system is also the repository containing the
desired source data, which has been fetched, integrated
and reconciled prior to receiving queries. Deciding on
which approach to use depends on several factors,
including required performance and supported query
capabilities at the integration system, ownership of data,
dynamics of the sources, to name a few. See [14] for an
in-depth discussion of the pros and cons of these two
integration approaches.




Client-level Interface




Middleware Inlegral3ln Stslem





Source-
level ...

Figure 1: Generic integration architecture using the query-
driven approach.


In the biological domain, most integration systems are
currently based on the query-driven approach'. SRS [16],
BioNavigator [9], K2/Kleisli [13], TAMBIS [38] and
DiscoveryLink [25] are representatives of this class.
Although they differ greatly in the capabilities they offer,
they can be considered middleware systems [7], in which
the bulk of the query and result processing takes place in a
different location from where the data is stored. For
example, K2/Kleisli, DiscoveryLink, and TAMBIS use
source-specific data drivers (wrappers) for extracting the
data from underlying data sources (e.g., GenBank,
dbEST, SWISS-PROT) including application programs
(e.g., the BLAST family of similarity search programs
[2]). The extracted data is then shipped to the integration
system, where it is represented and processed using the
data model and query language of the integration system
(e.g., the relational model and SQL in TAMBIS, the
object-oriented model and OQL in K2/Kleisli). Biologists
access the integration system through a client interface,
which hides many of the source-specific details and
heterogeneities. For example, SRS and BioNavigator
provide a Web-based interface, which allows biologists to
access sources through hyperlinks and formulate queries
by filling out HTML forms. The query-driven approach to
accessing multiple sources is depicted in Figure 1.
The generic data warehousing architecture (not shown)
looks similar to the one depicted in Figure 1, except for
the addition of a repository (warehouse) in the
middleware layer. This warehouse is used by the
integration system to store (materialize) the integrated
views over the underlying data sources. Instead of
answering queries at the source, the data in the warehouse
is used. This greatly improves performance but requires
complex maintenance procedures to update the warehouse
in light of changes to the sources. Among the integration
systems for biological databases, the only representative
of the data warehousing approach known to us is GUS
(Genomics Unified Schema) [13]. GUS describes a
relational data warehouse in support of organism and
tissue-specific research projects at the University of
Pennsylvania. It has been designed to unify data extracted
from GenBank/EMBL/DDBJ, dbEST and SWISS-PROT.
GUS shares some of its goals and requirements with the
system proposed in this technical report and we will come
back to it in Section 5.
Despite the continuous advancements in biological
database systems research, we argue that current systems
present biologists with only an incomplete solution to the
growing data management problem they are facing. More
importantly, we share the belief of the authors in [13] that


1 Historically, sharing architectures based on the query-driven
approach have also been termed federated databases and it was
common to distinguish between loosely- and tightly-coupled
federations based on the degree of autonomy that was retained
by the underlying source systems after they entered into the
federation.


CISE TR-02-009
Hammer & Schneider









in those situations where close control over query
performance as well as accuracy and consistency of the
results are important (problem C8 in Section 2), the
query-driven approach is not an option. However, query
performance and correctness are only two aspects of
biological data management. As can be seen in our list of
requirements in Section 2, a suitable representation of
biological data and a powerful and extensible biological
query language capable of dealing with the inherent
uncertainty of the correctness of biological data are
equally important (C2, C4, C6, C9, C12, C14). To our
knowledge, none of the existing systems currently
addresses these requirements.
To illustrate our position, we first highlight a few
specific problems and shortcomings with current systems.
Table 1 at the end of this section summarizes how the
integration systems mentioned above address each of the
computer science issues C1-C15 listed in Section 2.
While query-driven data access and integration is an
indispensable tool for many applications, relying on it to
develop high-performance biological database systems
has several serious drawbacks:
Shipping large amounts of data back and forth
between sources and the integration system is
resource and time-consuming. As a result, query
performance in query-driven systems is governed
by the traffic on the network and at the sources.
By moving part of the query processing outside
of the database system, implementers cannot take
full advantage of the existing functionality of the
database, which has been optimized over time.
As a result, the external code becomes more
complicated and less efficient than if it had been
integrated with the database system.
Query optimization across multiple genomic
repositories is a complex problem and remains
largely unsolved. This reduces performance of
the query-driven integration systems further.
Being able to annotate query results and to make
these annotations part of the original data is an
important part of scientific exploration.
However, query-driven integration systems are
read-only and provide no mechanism for storing
annotations persistently with the underlying data


or for combining them with subsequent query
results.
Independent of the integration approach used, current
systems lack adequate support for biologists, forcing them
to adopt their research methods to fit those imposed by
the data management tools instead of the other way
around:
The representation of data in current integration
systems (including the underlying genomic
repositories) is not well suited for biological
analysis since it forces biologists to abandon
their domain specific viewpoint (e.g., decoding a
sequence of nucleotides) and to adopt a computer
science centric view (e.g., performing a pattern
matching operation on a sequence of characters).
Although most systems attempt to shield the
biologists from learning the intricacies of the
different repositories, more often than not, the
biologist still needs to act as human query
processor. This is the case, for example, when
the output of one source/application cannot be
directly used as input to another, e.g., due to
incompatible formats or some intermediate result
processing.
In the next sections, we outline our proposal for a
genomic data warehouse and a powerful analysis
component. We believe the combination of the two
greatly enhances the way biologists analyse and process
information including data stored in the existing genomic
repositories.
To put the previous discussion in perspective and
validate our solution, consider the analogy with online
analytical processing (OLAP). Business analysts would
never consider running their OLAP queries, which are of
similar magnitude and complexity than BLAST queries,
for example, against a query-driven integration system.
Instead, they use a data warehouse in which the extracted
transactional data is pre-fetched, integrated and
restructured. The warehouse is integrated with a powerful
OLAP front-end which presents the data in a format
understood by the analysts (data cube) instead of
relational tables and which natively supports the complex
business operations (e.g., roll-up, drill-down) in a high-
level business language instead of SQL.


Table 1: Analysis of data management capabilities of existing integration systems with respect to the requirements outlined in Sec. 2.

SRS BioNavigator K2/Kleisli DiscoveryLink TAMBIS GUS
User shielded from User shielded from User shielded from User shielded from User shielded from User shielded from
c, source details source details source details source details source details source details

C2 HTML HTML Global schema using Global schema using Global schema GUS schema based
object-oriented model relational model using description on relational model,
logic 00 views
C3 Single-access point Single-access point Single-access point Single-access point Single-access point Single-access point
C4 Simple to use visual Simple to use visual Not a user-level Requires knowledge of Simple to use visual Requires knowledge
interface interface interface SQL interface of SQL


CISE TR-02-009
Hammer & Schneider









Limited query
capability
No new operations


No re-organization
of source data
No reconciliation of
results

No provision for
dealing with
uncertainty in data
Results not
integrated, sources
must be Web-
enabled
Not supported

Not supported
Not supported
Not supported
No archival
functionality


Not query oriented

No new operations


No re-organization of
source data
No reconciliation of
results

No provision for
dealing with
uncertainty in data
Results not integrated,
sources must be Web-
enabled

Not supported

Not supported
Not supported
Not supported
No archival
functionality


Comprehensive query
capability
New operations on
integrated view data

Reorganization of result
possible
No reconciliation of
results

No provision for dealing
with uncertainty in data

Results integrated
using global schema,
source wrapper needed

Not supported

Not supported
Not supported
Not supported
No archival functionality


Comprehensive query
capability
New operations on
integrated view data

Reorganization of result
possible
No reconciliation of
results

No provision for dealing
with uncertainty in data

Results integrated
using global schema,
source wrapper needed

Not supported

Not supported
Not supported
Not supported
No archival functionality


Comprehensive
query capability
New operations on
integrated view data

Reorganization of
result possible
Result reconciliation
supported

No provision for
dealing with
uncertainty in data
Results integrated
using global
schema, source
wrapper needed
Not supported

Not supported
Not supported
Not supported
No archival
functionality


Comprehensive
query capability
New operations
defined on
warehouse data
Reorganization of
result possible
Data in warehouse
is reconciled and
cleansed
No provision for
dealing with
uncertainty in data
Query results are
integrated


Annotations
supported
Not supported
Supported
Not supported
Archiving of data
supported


4. Genomics Algebra

Based on the observations and conclusions made in
Section 3, we pursue an alternative, integrative approach,
which heavily focuses on current database and data
warehouse technologies. The Genomics Algebra (GenAlg)
is the first of two pillars of our approach. It incorporates a
sophisticated, self-contained, and high-level type system
for genomic data together with a comprehensive set of
operations.

4.1 An Ontology for Molecular Biology and
Bioinformatics

The first step and precondition for a successful
construction of our Genomics Algebra is the design of an
ontology for molecular biology and bioinformatics. By
ontology, we are referring to "a specification of a
conceptualization" [19]. That is, in general, an ontology is
a description of the concepts and relationships that define
an application domain.
Applied to bioinformatics, an ontology is a "controlled
vocabulary for the description of the molecular functions,
biological processes and cellular components of gene
products" [37]. An obstacle to its unique definition is that
the multitude of heterogeneous and autonomous genomic
repositories has induced terminological differences
(synonyms, aliases, formulae), syntactic differences (file
structure, separators, spelling) and semantic differences
(intra- and interdisciplinary homonyms). The
consequence is that data integration is impeded by
different meanings of identically named categories,
overlapping meanings of different categories, and
conflicting meanings of different categories. Naming
conventions of data objects, object identifier codes, and
record labels differ between databases and do not follow a
unified scheme. Even the meaning of important high-level


concepts (e.g., the notion of gene or protein function) that
are fundamental to molecular biology is ambiguous.
If the user queries a database with such an ambiguous
term, until now (s)he has full responsibility to verify the
semantic congruence between what (s)he asked for and
what the database returned. An ontology helps here to
establish a standardised, formally and coherently defined
nomenclature in molecular biology. Each technical term
has to be associated with a unique semantics that should
be accepted by the biological community. If this is not
possible, because different meanings or interpretations are
attached to the same term but in different biological
contexts, then the only solution is to coin a new,
appropriate, and unique term for each context. Uniqueness
of a term is an essential requirement to be able to map
concepts into the Genomics Algebra.
Consequently, one of our current research efforts and
challenges is to develop a comprehensive ontology which
defines the terminology, data objects and operations
including their semantics that underlie genome
sequencing. Since there has been much effort in defining
ontologies for various bioinformatics projects [34], for
example, Eccocyc [27], Pasta [44], Gene Ontology
Consortium [18], we are about to study and compare these
and other existing contributions in this field when
defining our ontology. Therefore, besides an important
contribution in itself, a comprehensive ontology forms the
starting point for the development of our Genomics
Algebra. In total, this goal especially contributes to a
solution of the problems C1, C2, C3, C5, C8, C9, Cll,
and C12. Besides posing such a genomic ontology, a main
challenge is to find or even devise an appropriate
formalism for its unique specification.


CISE TR-02-009
Hammer & Schneider









4.2 The Algebra
In a sense, the Genomics Algebra as the second step is the
derived, formal, and executable instantiation of the
resulting genomic ontology. Entity types and functions in
the ontology are represented directly using the appropriate
data types and operations supported by our Genomics
Algebra. This algebra has to satisfy two main tasks. First,
it has to serve as interface between biologists, who use
this interface, and computer scientists, who implement it.
An essential feature of the algebra is that it incorporates
high-level biological terminology and concepts. Hence, it
is not based on the low-level concepts provided by
database technology. Second, as a result, this high-level,
domain-specific algebra will greatly facilitate the
interactions of biologists with genomic information stored
in our Unifying Database (see Section 5) and
incorporating at least the knowledge of the genome
repositories. This is much like the invention of the 3-tier
architecture and how the resulting data independence
simplified database operations in relational databases. To
our knowledge, no such algebra currently exists in the
field of bioinformatics. The main impact of this goal is in
solving the problems C2 to C4 and C6 to C14. This
requires close coordination between domain experts from
biology, who have to identify and select useful data types,
relevant operations, and their semantics, and computer
scientists, whose task it is to formalize and implement the
algebra.
In order to explain the notion of algebra, we start with
the concept of a many-sorted signature which consists of
two sets of symbols called sorts (or types) and operators.
For an in-depth treatment of many-sorted algebras the
reader is invited to refer to [15]. Operators are annotated
with strings of sorts. For instance, the symbols string,
integer, and char may be sorts and c. ,,. at and
getcharstring integer char two operators. The annotation with
sorts defines the functionality of the operator, which in a
more conventional way is usually written as concat :
string x string string and getchar : string x integer ->
char. To assign semantics to a signature, one must assign
a (carrier) set to each sort and a function to each operator.
Each function has domains and a codomain according to
the string of sorts of the operator. Such a collection of sets
and functions forms a many-sorted algebra. Hence, a
signature describes the syntactic aspect of an algebra by
associating with each sort the name of a set of the algebra
and with each operator the name of a function. A
signature especially defines a set of terms such as
getchar(concat("Genomics ", "Algebra"), 10). The sort of
a term is the result sort of its outermost operator, which is
char in our example.
Our Genomics Algebra is a domain-specific, many-
sorted algebra incorporating a type system for biological
data. Its sorts, operators, carrier sets, and functions will be
derived from the genomic ontology developed in the first
step. The sorts are called genomic data types (GDTs) and


the operators genomic operations. To illustrate the
concept, we assume the following, very simplified
signature, which is part of our algebra:
sorts
gene, primaryTranscript, mRNA, protein


ops
transcribe:
splice:
translate:


gene -> primaryTranscript
primaryTranscript -> mRNA
mRNA -> protein


This "mini algebra" contains four sorts or genomic data
types for genes, primary transcript, messenger RNA, and
protein as well as three operators transcribe, which for a
given gene returns its primary transcript, splice, which for
a given primary transcript identifies its messenger RNA,
and translate, which for a given messenger RNA
determines the corresponding protein. We can assume that
these sorts and operators have been derived from our
genomic ontology. Hence, the high-level nomenclature of
our genomic ontology is directly reflected in our algebra.
The algebra now allows us to (at least) syntactically
combine different operations by (function) composition.
For instance, given a gene g, we can syntactically
construct the term translate(splice(transcribe(g))), which
yields the protein determined by g. For the semantic
problems of this term, see below.
Obviously, our mini algebra is incomplete. It is our
conviction that finding a "complete" set of GDTs and
genomic operations (what does "completeness" mean in
this context?) is impossible, since new biological
applications can induce new data types or new operations
for already existing data types. Therefore, we pursue an
extensible approach, i.e., if required, the Genomics
Algebra can be extended by new sorts and operations. In
particular, we can combine new sorts with sorts already
present in the algebra, which leads to new operations. In
other words, we can combine information stemming
originally from different genomic repositories. Our hope
is to be able to identify new, powerful, and fundamental
genomic operations that nobody has considered so far.
From a software point of view, the Genomics Algebra
is an extensible, self-contained software package and tool
providing a collection of genomic data types and
operations for biological computation. It is principally
independent of a database system and can be used as a
software library by a stand-alone application program.
Thus, we also denote it as kernel algebra.
This kernel algebra develops its full expressiveness
and usability if it is designed and integrated as a
collection of abstract data types (ADTs) into the type
system and query language of a database system (Section
6) [48, 49]. ADTs encapsulate their implementation and
thus hide it from the user or another software component
like the DBMS. From a modeling perspective, the DBMS
data model and the application-specific algebra or type
system are separated. This enables the software developer


CISE TR-02-009
Hammer & Schneider









to focus on the application-specific aspects embedded in
the algebra. Consequently, this procedure supports
modularity and conceptual clarity and also permits the
reusability of an algebra for different DBMS data models.
It requires extensibility mechanisms at the type system
level in particular and at all levels of the architecture of a
DBMS in general, starting from user interface extensions
down to new, external representation and index structures.
From an implementation point of view, ADTs support
modularity, information hiding, and the exchange of
implementations. Simple and inefficient implementation
parts can then be replaced by more sophisticated ones
without changing the interface, that is, the signature of
algebra operations.

4.3 Research Challenges
We have already addressed two main research challenges,
namely the design of the genomic ontology and the
derivation of the signature of the Genomics Algebra from
it. The authors have already gained some experience in
other research areas with respect to algebra design and its
integration into databases, e.g., in the spatial domain [42],
the spatio-temporal domain [20], or the fuzzy domain
[43]. However, in the genomics domain the requirements
and challenges are by far much higher.
This leads us to the third main challenge, which is to
give a formal definition of the genomic data types and
operations, i.e., to specify their semantics, in terms that
can be transferred to computer science and especially to
database technology. A serious obstacle to the
construction of the Genomics Algebra is the biologists'
vague or even lacking knowledge about genomic
processes. Biological results are inherently uncertain and
never guaranteed (in contrast to the results of the
application domains mentioned above) but always
attached with some degree of uncertainty. For instance, it
is known that the splice operation takes a primary
Transcript and produces a messenger RNA, i.e., the effect
of splicing (the "what"?) is clear since the cell
demonstrates this observable biological function all the
time. But it is unknown how the cell performs
("computes") this function. Transferred to our Genomics
Algebra, this means that the signature of the splice
operation is known with domain and codomain as shown
in Section 4.2. We can even define the semantics of the
operation in a denotational way. However, we cannot
determine its operational semantics in the form of an
algorithm and thus not implement it directly. A way out of
this "dilemma" can be to map the procedure that
biologists use in their everyday work to elude the problem
or to compute an approximated solution for the problem.
This inherent feature of uncertainty due to lacking
knowledge must be appropriately reflected in the
Genomics Algebra in order not to pretend correct results,
which actually are vague or error-prone. The challenging
issue is how this can be done in the best way.


The fourth main challenge is to implement the
Genomics Algebra. This includes the design of
sophisticated data structures for the genomic data types
and efficient algorithms for the genomic operations. We
discuss two important aspects here. A first aspect is that
algorithms for different operations processing the same
kind of data usually prefer different internal data
representations in order to be as efficient as possible. In
contrast to traditional work on algorithms, the focus is
here not on finding the most efficient algorithm for each
single problem (operation) together with a corresponding
sophisticated data structure, but rather on considering the
Genomics Algebra as a whole and on reconciling the
various requirements posed by different algorithms within
a single data structure for each genomic data type.
Otherwise, the consequence would be enormous
conversion costs between different data structures in main
memory for the same data type. A second aspect is that
the implementation is intended for use in a database
system. Consequently, representations for genomic data
types should not employ pointer data structures in main
memory but be embedded into compact storage areas
which can be efficiently transferred between main
memory and disk. This avoids unnecessary and high costs
for packing main memory data and unpacking external
data.

5. Unifying Database
The Unifying Database is the second pillar of our
integrating approach. By Unifying Database, we are
referring to a data warehouse, which integrates data from
multiple genomic repositories. We have chosen the data
warehousing approach to take advantage of the many
benefits it provides, including superior query processing
performance in multi-source environments, the ability to
maintain and annotate extracted source data after it has
been cleansed, reconciled and corrected, and the option to
preserve historical data from those repositories that do not
archive their contents. Equally important, the Unifying
Database is also the persistent storage manager for the
Genomics Algebra.
We start by providing a brief overview of the
components that make up the Unifying Database in
Section 5.1, followed by a description of the challenges
that must be overcome during its implementation in
Section 5.2.

5.1 Component Overview
The component most visible to the user is the integrated
schema. We distinguish between the portions of the
schema that house the restructured and integrated external
data (i.e., the entities that store the genomic data brought
in from the sources) and which is publicly available to
every user, and those that contain the user data (i.e. the
entities that store user-created data including annotations),
which may be private. The schema containing the external


CISE TR-02-009
Hammer & Schneider









data is read-only to facilitate maintenance of the
warehouse; user-owned entities are updateable by their
owners. Separating between user and public space
provides privacy but does not exclude sharing of data
between users, which can be controlled via the standard
database access control mechanism. Since all information
is integrated in one database using the same formats and
representation, cross-referencing, linking, and querying
can be done using the declarative database language
provided by the underlying database management system
(DBMS), which has been extended by powerful
operations specific to the characteristics of the genomic
data (see Section 6.3). However, users do not interact
directly with the database language; instead, they use the
commands and operations provided by the Genomics
Algebra, which may be embedded in a graphical user
interface.
Conceptually, the Unifying Database may be
implemented using any DBMS as long as it is extensible.
By extensible we are referring to the capability to extend
the type system and query language of the database with
user-defined data types. For example, all of the object-
relational and most object-based database management
systems are extensible. We have more to say on the
integration of the Genomics Algebra with the DBMS
hosting the Unifying Database in Section 6. We believe
our integration of the Genomics Algebra with the
Unifying Database represents a dramatic improvement
over current technologies (e.g., a query-driven integration
system connected to BLAST sources) and will cause a
fundamental change in the way biologists will conduct
sequence analysis.
Conceptually, the component responsible for loading
the Unifying Database and making sure its contents are
up-to-date is referred to as ETL (Extract-Transform-
Load). In our system architecture, ETL comprises four
separate activities:
1. Monitoring the data sources and detecting
changes to their contents. This is done by the
source monitors.
2. Extracting relevant new or changed data from the
sources and restructuring the data into the
corresponding types provided by the Genomics
Algebra. This is done by the sources wrappers.
3. Merging related data items and removing
inconsistencies before the data is loaded into the
Unifying Database. This is done by the
warehouse integrator.
4. Loading the cleaned and integrated data into the
unifying database. This is done by the loader.
A conceptual overview of the Unifying Database is
depicted in Figure 3 in Section 6. As we can see from the
figure, the ETL component interfaces with a DBMS-
specific adapter instead of the DBMS directly. This
adapter, which implements the interface between database


engine and Genomics Algebra, is the only component that
has knowledge about the types and operations of the
Genomics Algebra as well as how they are implemented
and stored in the DBMS. The adapter is discussed in more
detail in the next section.
To reduce complexity and help validate our new
integrating approach, our initial focus is on loading data
from a single genomic repository first. We have chosen
the GenBank database as our starting point, mainly
because its contents are of the greatest use to our
collaborators and domain experts in the Bioinformatics
Center at the University of Florida.
Although much has been written and documented
about building data warehouses for different applications
[28] including the GUS warehouse for biological data at
the University of Pennsylvania [13], we briefly highlight
the challenges that we face during the development of the
Unifying Database. While similar in nature to some of the
challenges documented in [56], we believe they provide a
useful perspective and insight into the complexity of our
project.

5.2 Research Challenges
We have identified the following challenges, which are
directly related to implementing the components
described above:
1. How do we best design the integrated schema so
that it can accommodate data from a variety of
genomic repositories?
2. How do we automate the detection of changes in
the data sources?
3. How do we integrate related data from multiple
sources in the Unifying Database?
4. How do we automate the maintenance of the
Unifying Database?

Design of the Integrated Schema
There are two seemingly contradictory goals when
designing the schema that defines the unifying database.
On one hand, the schema should reflect the informational
needs of the biologists, and should therefore be defined in
terms of a global, biologist-centric view of the data (top-
down design). On the other hand, the schema should also
be capable of representing the union of the entities stored
in the underlying sources (bottom-up design). We use a
bottom-up approach by designing an integrated schema
for the unifying database that contains the most important
entities from all of the underlying repositories; which
entities are considered important will be determined in
discussions with the collaborating biologists. However,
using a bottom-up approach does not imply a direct
mapping between source objects and target schema.
Given the wealth of data objects in the genomic
repositories, a one-to-one mapping would result in a
warehouse schema that is as unmanageable and inefficient


CISE TR-02-009
Hammer & Schneider









as the source schemas it is trying to replace (e.g., GUS
contains over 180 tables to store data from five
repositories). Instead, we aim for a schema that combines
and restructures the original data to obtain the best
possible query performance while providing its users with
an easy-to-use view of the data. If desired, each user can
further customize the schema to his individual needs by
creating views.
Schema design will likely be an iterative process,
aiming to first create a schema that contains all of the
nucleotide data, which will later be extended by new
tables storing protein data, and so forth. This iterative
process is possible since there is little overlap among the
repositories containing different types of genomic data;
furthermore, this type of schema evolution will mainly
result in new entities being added instead of existing ones
being removed or updated.

Change Detection
The type of change detection algorithm used by the source
monitor depends largely on the information source
capability and the data representation. Figure 2 classifies
the types of change detection for common sources and
data representations, where the abscissa denotes four
different source types (explained below), and different
data representations occur along the ordinate. A third
dimension (degree of cooperation of the underlying
source) is omitted for simplicity since source capability
and degree of cooperation are related.

Data


Hierarchical


Flat file


Relational


Program Inspect Edit Edit
Trigger Log Sequence Sequence


N/A Inspect N/A LCS
Log


Database
Trigger


Inspect
Log


Snaphot
Differential


Active Logged Queryable Non-queryable Source
Type

Figure 2. Classification of data sources using data
representation and capability of the source management system
as the defining characteristics [56]. The grid identifies several
proposed approaches to change detection. Shaded squares
denote areas of particular interest to our project.
In Figure 2, relational refers to the familiar row/column
representation used by the relational data model, flat file
refers to any kind of unstructured information (e.g., text
document), and hierarchical refers to a data
representation that exhibits nesting or layering of
elements such as the tree and graph structures or data
typically represented in an object model. Since most of


the genomic data sources use either a flat file format (e.g.,
GenBank, EMBL, DDBJ) or a hierarchical format (e.g.,
AceDB), we focus our investigation mainly on the upper
two rows in the graph.
Active data sources provide active capabilities such
that notifications of interesting changes can be
programmed to occur automatically. Active capabilities
are primarily found in relational systems (e.g., in the form
of database triggers). However, some of the non-
relational genomic data sources (e.g., SWISS-PROT) are
now beginning to offer push capabilities, which will
notify requesting users when relevant sequence entries
have been made.
Where logged sources maintain a log that can be
queried or inspected, changes can be extracted for
hierarchical, flat file, or relational data. Queryable
sources allow the database monitor to query information
at the source, so periodic polling can be used to detect
changes of interest. Two approaches are possible:
detecting edit sequences for successive snapshots
containing hierarchical data using the algorithm described
in [12], or computing snapshot differentials for relational
data using the algorithm described in [30].
Finally, non-queryable sources do not provide
triggers, logs, or queries. Instead, periodic data dumps
(snapshots) are provided off-line, and changes are
detected by comparing successive snapshots. In the case
of flat files, one can use the "longest common
subsequence" approach, which is used in the UNIX diff
command. For hierarchical data, various diff algorithms
for ordered trees exist (e.g., [57, 58]). In the case of the
ACe databases, the "acediff' utility will compute minimal
changes between different snapshots. For data sources
that export snapshots in XML, IBMS's XMLTreeDiff2
can be used
Despite the existence of prior existing work, change
detection remains challenging, especially for the shaded
regions in Figure 2. For example, in queryable sources,
performance and semantic issues are associated with the
polling frequency (PF). If the PF is too high, performance
can degrade. Conversely, important changes may not be
detected in a timely manner. A related problem
e independent of the change detection algorithm involves
development of appropriate representations for deltas
during transmission and in the warehouse. At the very
least, each delta must be uniquely identifiable and contain
(a) information about the data item to which it belongs
and (b) the a priori and a posteriori data and the time
stamp for when the update became effective or was
detected.

Data Integration
Before the extracted data from the sources can be loaded
into the Unifying Database, one must establish the
relationships among the data items in the sources) with


2 See www.alphaWorks.ibm.com/formula/xmltreediff


CISE TR-02-009
Hammer & Schneider









the existing data in the Unifying Database to ensure
proper loading. In addition, in case data from more than
one source is loaded, related data items from different
sources must first be identified so that duplicates can be
removed and inconsistencies among related values can be
resolved. This last step is referred to as reconciliation.
The fundamental problem is as follows: How do we
automatically detect relationships among similar entities,
which are represented differently in terms of structure or
terminology? This problem is commonly referred to as the
semantic heterogeneity problem. Being able to find an
efficient solution will allow us to answer the following
important questions that arise during data integration:
Which source object is related to which entity in
the Unifying Database and how?
Which data values can be merged, (for example
because they contain complimentary or duplicate
information)?
Which items are inconsistent?
There has been a wealth of research on finding automated
solutions to the semantic heterogeneity problem. For a
general overview and theoretical perspective on managing
semantic heterogeneities see [24]. One common approach
is to reason about the meaning and resemblance of
heterogeneous entities in terms of their structural
representation [31]. Other promising methodologies that
were developed include heuristics to determine the
similarity of objects based on the percentage of
occurrences of common attributes (e.g., [23, 35, 45]).
More accurate techniques use classification and clustering
for matching related data elements and determining
possible relationships between entities (e.g., [32, 41]). In
addition, various systems based on Description Logic
(DL) exist for reasoning about interschema relationships
(e.g., CLASSIC [11], LOOM [33]). However, DL-based
approaches are typically "fragile" and users must
carefully decide between the expressability of the
language versus decidability of the solution.
We have developed a semiautomaticc resolution
algorithm for computing object relationships in previous
projects [21, 22, 39] and plan on leveraging this
technology to implement the ETL component for the
Unifying Database. We are also experimenting with
various methods based on machine learning and reasoning
with uncertainty techniques. Given the fact that schema
modeling and integration is highly subjective, we feel that
machine learning (with minimal user input while training)
is very promising.
Unifying Database Maintenance
Since the Unifying Database contains information from
existing genomic repositories, its contents must be
refreshed whenever the underlying sources are updated.
Ideally, the Unifying Database should always be perfectly
synchronized with respect to the external sources.
However, given the frequency of updates to most


repositories, this is not realistic. On the other hand, there
is evidence that a less synchronized warehouse is still
useful. For example, SWISS-PROT, which is a curated
version of the protein component of GenBank is updated
on a quarterly basis; yet it is extensively used due to the
high quality of its data. A similar experience has been
documented by the authors in [13], whose GUS
warehouse contents lag those of the data sources by a few
months. As an important variation to existing refresh
schemes, namely to automatically maintain certain pre-
defined consistency levels between sources and
warehouse, we plan to offer a manual refresh option. This
allows the biologist to defer or advance updates
depending on the situation.
Independent of the update frequency, refreshing the
contents of the Unifying Database in an automatic and
efficient manner remains a challenge. Since a warehouse
can be regarded as an integrated "view" over the
underlying sources, updating a warehouse has been
documented in the database literature as the view
maintenance problem [10]. In general, one can always
update the warehouse by reloading the entire contents,
i.e., by re-executing the integration query(s) that produced
the warehouse view. However, this is very expensive, so
the problem is to find a new load procedure (or view
query) that takes as input the updates that have occurred
at the sources and possibly the original source instances or
the existing warehouse contents and updates the
warehouse to produce the new state. When the load
procedure can be formulated without requiring the
original source instances, the warehouse is said to be self-
maintainable.
View maintenance has been studied extensively in the
context of relational databases (see, for example, [60-62]).
However, fewer results are known in the context of
object-oriented databases (e.g., [29]) or semistructured
databases (e.g., [1, 59]). To our knowledge, no work has
been done so far on recomputing annotations or
corrections that need to be applied to existing data in the
warehouse.

6. Interaction Between Genomics Algebra
and Unifying Database
Both pillars of our approach develop their full power if
they are integrated into a common system architecture. In
the following, we will discuss the system architecture, the
requirements with respect to DBMSs, and appropriate
integration mechanisms.

6.1 System Architecture
A conceptual overview of the high-level architecture that
integrates the Genomics Algebra with the Unifying
Database is shown in Figure 3. The Unifying Database is
managed by the DBMS and contains the genomic data,
which comes either from the external sources or is user


CISE TR-02-009
Hammer & Schneider









generated. The link between the Genomics Algebra and
the Unifying Database is established through the DBMS-
specific adapter. Extracting and integrating data from the
external sources is the job of the extract-transform-load
(ETL) tool shown on the right-hand side of Figure 3.
User-friendly access to the functionality of the Genomics
Algebra is provided by the GUI component depicted in
the top center. In the following, we describe the remaining
components of the architecture and some further aspects
in more detail.


GUI

Genomlcs
Algebra
DBMS-specific ETL
Adapter
Extensible DBMS


Unifying Database
public space

sauce space space External Repositories
(e g, GenBank, NCBI, )

Figure 3: Integration of the Genomics Algebra with the
Unifying Database through a DBMS-specific adapter.

6.2 Adapters and User-Defined Data Types

Databases must be inherently extensible to be able to
efficiently handle various rich, application-domain-
specific complex data types. The adapter provides a
DBMS-specific coupling mechanism between the ADTs
together with their operations in the Genomics Algebra
and the DBMS managing the Unifying Database. The
ADTs are plugged into the adapter by using the user-
defined data type (UDT) mechanism of the DBMS. UDTs
provide the ability to efficiently define and use new data
types in a database context without having to re-architect
the DBMS. The adapter is registered with the database
management system at which point the UDTs become
add-ons to the type system of the underlying database.
Two kinds of UDTs can be distinguished, namely
object types, whose structure is fully known to the DBMS,
and opaque types whose structure is not. Object types can
only be constructed by means of types the DBMS
provides (e.g., native SQL data types, other object types,
large objects, reference types). They turn out to be too
limited. For our purpose, opaque types are much more
appropriate. They allow us to create new fundamental
data types in the database whose internal and mostly
complex structure is unknown to the DBMS. The database
provides storage for the type instances. User-defined
operators (see also Section 6.3) that access the internal
structure are linked as external methods or external


functions. They as well as the types are implemented, e.g.,
in C, C", or Java. The benefit of opaque types arises in
cases where there is an external data model and behavior
available like in the case of our Genomics Algebra.
All major database vendors support UDTs and
external functions and provide mechanisms to package
them up for easy installation (e.g., cartridges, extenders,
datablades). A very important feature of the Genomics
Algebra is that it is completely independent of the
software that is used to provide persistence. That is, it can
be integrated with any DBMS (relational, object-
relational, object-oriented), as long as the DBMS is
extensible. The reason is that the genomic data types are
included into the database schema as attribute data types
(like the standard data types real, integer, boolean, etc.).
Tuples in a relational setting or objects in an object-
oriented environment then only serve as containers for
storing genomic values.

6.3 Integration of User-Defined Operations into SQL
Typically, databases provide a set of pre-defined
operators to operate on built-in data types. Operators can
be related to arithmetic (e.g., +, -, *, /), comparison (e.g.,
=, <, >), Boolean logic (e.g., not, and, or), etc. From
Section 6.2 we know that the UDT mechanism also
allows us to specify and include user-defined operators as
external functions. For example, it is possible to define a
resembles operator for comparing nucleotide sequences.
User-defined operators can be invoked anywhere
built-in operators can be used, i.e., wherever expressions
may occur. In particular, this means that they can be
included in SQL statements. They can be used in the
argument list of a SELECT clause, the condition of a
WHERE clause, the GROUP BY clause, and the ORDER
BY clause. This ability allows us to integrate all the
powerful operations and predicates of the Genomics
Algebra into the DBMS query language, which, by the
way, need not necessarily be SQL, and to extend the
semantics of the query language in a domain-specific
way. Let us assume the very simplified example of a
predicate contains which takes as input a decoded DNA
fragment and a particular sequence and which returns true
if the fragment contains the specified sequence. Then we
can write an SQL query as
SELECT id
FROM DNAFragments
WHERE contains(fragment, "ATTGCCATA")

6.4 Graphical User Interface

For the biologist the quality of the user interface plays an
important role, because it represents the communication
mechanism between him/her on the one side and the
Genomics Algebra and the Unifying Database on the
other side. Currently, the biologist has to cope with the
multitude, heterogeneity, fixed functionality, and


CISE TR-02-009
Hammer & Schneider









simplicity of the user interfaces provided by genomic
repositories. Hence, uniformity, flexibility, extensibility,
and the possibility of graphical visualization are important
requirements of a well-designed user interface. Based on
an analysis and comparison of the currently available and
most relevant user interfaces (e.g., MuSeqBox [8],
GeneMachine [51], BEAUTY [5], BioWidgets [53],
WebBLAST [52], PhyloBLAST [36]), our goal is to
construct such a graphical user interface (GUI) for our
Genomics Algebra. The aforementioned GUIs suffer from
(at least) two main problems. First, they do not provide
database and query facilities (except for PhyloBLAST),
which is an essential drawback, and second, their formats
are only either HTML, ASN.1 (a binary format for
bioinformatics data), or graphical output. Our GUI is to
comprise the following main elements: (1) a biological
query language combined with a graphical output
description language, (2) a visual language for the
graphical specification of queries, and (3) the
development of an XML application as a standardized
input/output facility for genomic data.
The extended SQL query language enriched by the
operations of the Genomics Algebra (see Section 6.3) is
not necessarily the appropriate end user query language
for the biologist. Biologists frequently dislike SQL due to
its complexity. For them SQL is solely a database query
language but not apt as a biological query language.
Thus, the issue is here to design such a biological query
language based on the biologists' needs. A query
formulated in this query language will then be mapped to
the extended SQL of the Unifying Database. A general,
connected question is how the query result should be
displayed. To enable high flexibility of the graphical
output, the idea is to devise a graphical output description
language whose commands can be combined with
expressions of the biological query language.
The textual formulation of a query is frequently
troublesome and only possible for the computationally
experienced biologist. A visual language can help to
provide support for the graphical specification of a query.
The graphical specification is then evaluated and
translated into a textual SQL representation which itself is
executed by the Unifying Database. The design of such a
visual language and the translation process are here the
challenging issues.
In the meantime, a number of XML applications (e.g.,
GEML [40], RiboML [46], phyloML [54], Gb2xml [26])
exist for genomic data. Unfortunately, these are
inappropriate for a representation of the high-level objects
of the Genomics Algebra. Hence, we plan to design our
own XML application, which we name Ge,. I i/._\ Il

6.5 Genomic Index Structures and Genomic Data
Optimization
We briefly mention two important research topics which
are currently not the focus of our considerations but which


will become relevant in the future, since they enhance the
performance of the Unifying Database and the Genomics
Algebra. These topics relate to the construction of
genomic index structures and to the design of
optimization techniques for genomic data.
As we add the ability to store genomic data, a need
arises for indexing these data by using domain-specific,
i.e., genomic, indexing techniques. These should support,
e.g., similarity or substructure search on nucleotide
sequences, or 3D structural comparison of tertiary protein
sequences. The DBMS must then offer a mechanism to
integrate these user-defined index structures.
The development of optimisation techniques for non-
standard data (e.g., spatial, spatio-temporal, fuzzy data)
must currently be regarded as immature due to the
complexity of these data. Nevertheless, optimisation rules
for genomic data, information about the selectivity of
genomic predicates, and cost estimation of access plans
containing genomic operators would enormously increase
the performance of query execution.

7. Vision
We believe our project will cause a fundamental change
in the way biologists analyze genomic data. No longer
will biologists be forced to interact with hundreds of
independent data repositories each with their own
interface. Instead, biologists will work with a unified
database through a single user interface specifically
designed for biologists. Our high-level Genomics Algebra
allows biologists to pose questions using biological terms,
not SQL statements. Managing user data will also become
much simpler for biologists, since his/her data can also be
stored in the Unifying Database and no longer will s/he
have to prepare a custom database for each data
collection. Biologists should, and indeed want to invest
their time being biologists, not computer scientists.
A major impact of our approach on the biological
community as a whole is enhancing the opportunities for
biologists not part of a large multidisciplinary research
center. For molecular and cellular biologists even
moderate sized gene sequencing and gene expression
projects create unfamiliar information management
demands. As more and more scientists undertake
genomics projects, they discover that in the absence of a
well-developed institutional informatics infrastructure,
they must identify and implement the means on their own.
For biologists this is often a bewildering task, requiring
resolution of disjointed data analysis functions, choosing
between a confusing variety of commercial and public
domain software tools, and solving data type and file
structure incompatibilities. The Genomics Algebra
approach empowers biologists to perform complex
analyses on large-scale collections of data without
needing a computer scientist at their side. This is
especially beneficial to biologists who work alone or in a
small group since it Ic'\ c l the playing field" permitting


CISE TR-02-009
Hammer & Schneider









any biologist with a good idea to pursue it fully and not be
discouraged by the lack of local infrastructure and support
personnel.
Our approach offers another benefit to the biological
community in that it fosters collaborations among
biologists by providing both a convenient means for them
to share data, and a powerful suite of functions to analyze
their data. With our approach, users would choose to
make their data accessible to other users (and specify
which user(s) (if any) receive the privilege). Accessing
custom user data in our unified database, and analyzing it,
is indistinguishable from performing those tasks on the
available public data. Currently, to use another biologist's
data, especially a large quantity of data that's stored in a
custom-designed database, requires one to link to the
machine containing the database and write custom
functions to perform the analyses.
From a computer science perspective, the main
implications consist in obtaining extended knowledge
about the design and implementation of new,
sophisticated data structures and efficient algorithms in
the non-standard application field of biology and
bioinformatics. The Genomics Algebra comprising all
these data structures and algorithms will be made publicly
available so that other groups in the community can study,
improve, and extend it.
From a database perspective, our project leverages
and extends the benefits and possibilities of current
database technology. In particular, we demonstrate the
elegance and expressive power of modeling and
integrating non-standard and extremely complex data by
the concept of abstract data types into databases and query
languages. In addition, our approach is independent of a
specific underlying DBMS data model. That is, the
Genomics Algebra can be embedded in a relational,
object-relational, or object-oriented DBMS as long as it is
equipped with the appropriate extensibility mechanisms.

Acknowledgements

We wish to thank our collaborators Drs. William Farmerie
and Li Liu as well as Kevin Holland from the
Interdisciplinary Center for Biotechnology Research
(ICBR) at the University of Florida for their valuable
input to this research. They have been instrumental in
helping us identify, understand, and formalize the
genomic data management requirements that became the
starting point for our work on the Genomics Algebra and
Unifying Database.

References

[1] S. Abiteboul, J. McHugh, M. Rys, V. Vassalos, and J.
Wiener, "Incremental Maintenance for Materialized
Views over Semistructured Data," in Proceedings of the
24th Annual international Conference on very Large Data
Bases, New York City, NY, pp. 38-49, 1998.



CISE TR-02-009
Hammer & Schneider


[2] S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J.
Lipman, "Basic local alignment search tool," Journal of
Molecular Biology, 215:2, pp. 403-410, 1990.
[3] A. Bairoch and R. Apweiler, "The SWISS-PROT protein
sequence data bank and its supplement in TrEMBL,"
Nucleic Acids Research, 26:1, pp. 38-42, 1998.
[4] W. C. Barker, J. S. Garavelli, D. H. Haft, L. T. Hunt, C. R.
Marzec, B. C. Orcutt, G. Y. Srinivasarao, L.-S. L. Yeh, R.
S. Ledley, H. W. Mewes, F. Pfeifer, and A. Tsugita, "The
PIR-International Protein Sequence Database," Nucleic
Acids Research, 26:1, pp. 27-32, 1998.
[5] Baylor College of Medicine, "BEAUTY: An Enhanced
BLAST-based Search Tool that Integrates Multiple
Biological Information Resources into Sequence
Similarity Search Results", Web Site,
http://www.cshl.org/genomere/supplement
/worley/.
[6] D. A. Benson, M. S. Boguski, D. J. Lipman, J. Ostell, and
B. F. F. Ouellette, "GenBank," Nucleic Acids Research,
26:1, pp. 1-7, 1998.
[7] P. Bernstein, "Middleware: A Model for Distributed
System Services," in Communications of the ACM1
(CACM), 1996.
[8] Bioinformatics Group at Iowa State University,
"MuSeqBox a program for Multi-query Sequence Blast
output examination", Web Site,
http://bioinformatics.iastate.edu/bioin
formatics2go/mb/MuSeqBox.html.
[9] I. Bionavigator, "BioNavigator", Web Site,
http://www.bionavigator.com.
[10] J. A. Blakeley, P.-A. Larson, and F. W. Tompa,
"Efficiently Updating Materialized Views," in
Proceedings of the AC1M SIGMOD International
Conference on Management of Data, Washington, D.C.,
pp. 61-71, 1986.
[11] A. Borgida, R. Brachman, D. McGuinness, and L.
Resnick, "CLASSIC: A structural data model for objects,"
in Proceedings of the ACM SIGMOD International
Conference on Management ofData, 1989.
[12] S. S. Chawathe, A. Rajaraman, H. Garcia-Molina, and J.
Widom, "Change Detection in Hierarchically Structured
Information," in Proceedings of the ACM SIGMOD
International Conference on Management of Data,
Montreal, Canada, pp. 493-504, 1996.
[13] S. Davidson, J. Crabtree, B. Brunk, J. Schug, V. Tannen,
C. Overton, and C. Stoeckert, "K2/Kleisli and GUS:
Experiments in integrated access to genomic data
sources," IBM Systems Journal, 40:2, pp. 512-531, 2001.
[14] S. B. Davidson, C. Overton, and P. Buneman, "Challenges
in integrating biological data sources," Journal of
Computational Biology, 2:4, pp. 557-572, 1995.
[15] H. Ehrig and B. Mahr, Fundamentals of Algebraic
Spti iltirion I, Springer-Verlag, 1985.
[16] T. Etzold, A. Ulyanov, and P. Argos, "SRS: information
retrieval system for molecular biology data banks,"
Methods Enzymol, 266:1, pp. 114-128, 1996.









[17] European Bioinformatics Institute (EBI), "Translated
EMBL (TrEMBL)", Web Site,
http://www.ebi.ac.uk/trembl/.
[18] Gene Ontology Consortium, "Gene Ontology: tool for the
unification of biology," Nature Genetics, 25:1, pp. 25-29,
2000.
[19] T. R. Gruber, "A Translation Approach to Portable
Ontologies," Knowledge Acquisition, 5:2, pp. 199-220,
1993.
[20] R. H. Giting, M. H. Bohlen, M. Erwig, C. S. Jensen, M.
Schneider, N. A. Lorentzos, and M. Vazirgiannis, "A
Foundation for Representing and Querying Moving
Objects," ACM Transactions on Database Systems
(TODS), 25:1, pp. 1-42, 2000.
[21] J. Hammer and D. McLeod, "An Approach to Resolving
Semantic Heterogeneity in a Federation of Autonomous,
Heterogeneous Database Systems," International Journal
ofIntelligent & Cooperative I,,..... ar..;, Systems, 2:1, pp.
51-83, 1993.
[22] J. Hammer and D. McLeod, "On the Resolution of
Representational Diversity in Multidatabase Systems," in
Management of Heterogeneous and Autonomous
Database Systems, A. K. Elmargamid, M. Rusinkiewicz,
and A. P. Sheth, Eds. San Francisco, CA: Morgan
Kaufmann, 1998, pp. 91-118.
[23] S. Hayne and S. Ram, "Multi-User View Integration
System (MUVIS): An Expert System for View
Integration," in Proceedings of the 6th International
Conference on Data Engineering, 1990.
[24] R. Hull, "Managing Semantic Heterogeneity in Databases:
A Theoretical Perspective," in Proceedings of the
Sixteenth ACM SIGACT-SIGMOD-SIGART Symposium
on Principles of Database Systems, Tucson, Arizona, pp.
51-61, 1997.


[25] IBM Corp.,


"DiscoveryLink",


Web Site,


http://www.ibm.com/discoverylink.
[26] P. Institute, "Genbank to XML conversion tool", Web
Site,
http://bioweb.pasteur.fr/seqanal/interf
aces/gb2xml.html.
[27] P. Karp, "EcoCyc: Encyclopedia of Escherichia coli
Genes and Metabolism", Web Site,
http://ecocyc.org/.
[28] R. Kimball and M. Ross, The Data Warehouse Toolkit:
The Complete Guide to Dimensional Modeling, 2nd
edition (April 2002) ed, John Wiley & Sons, 2002.
[29] H. A. Kuno and E. A. Rundensteiner, "Incremental
maintenance of materialized object-oriented views in
multiview: Strategies and performance evaluation," IEEE
Transactions on Knowledge and Data Engineering, 10:5,
pp., 1998.
[30] W. J. Labio and H. Garcia-Molina, "Efficient Snapshot
Differential Algorithms for Data Warehousing," in
Proceedings of the International Conference on Very
Large Databases, Bombay, India, pp. 63-74, 1996.
[31] J. Larson, S. B. Navathe, and R. Elmasri, "A Theory of
Attribute Equivalence and its Applications to Schema



CISE TR-02-009
Hammer & Schneider


Integration," Transactions on Software Engineering, 15:4,
pp. 449-463, 1989.
[32] C. Li, E. Chang, H. Garcia-Molina, and G. Wiederhold,
"Clustering for Approximate Similarity Search in High-
Dimensional Spaces," IEEE Transactions on Knowledge
and Data Engineering, 14:4, pp. 792-808, 2002.
[33] R. MacGregor and R. Bates, "The Loom Knowledge
Representation Language," Technical Report ISIRS-87-
188, May 1987.
[34] R. McEntire, P. Karp, N. Abernethy, D. Benton, G. Helt,
M. DeJongh, R. Kent, A. Kosky, S. Lewis, D. Hodnett, E.
Neumann, F. Olken, D. Pathak, P. Tarczy-Horoch, L.
Toldo, and T. Topaloglou, "An Evaluation of Ontology
Exchange Languages for Bioinformatics," in Proceedings
of the 2000 Conference on Intelligent Systems for
Molecular Biology, 2000.
[35] S. B. Navathe, R. ElMasri, and J. Larson, "Integrating
User Views in Database Design," in Proceedings of the
ieeecomp, pp. 50-62, 1986.
[36] Pathogenomics Project, "PhyloBLAST", Web Site,
http://www.pathogenomics.bc.ca/phyloBLA
ST/.
[37] N. Paton and C. Goble, "Information Management for
Genome Level Bioinformatics," in Proceedings of the
International Conference on Very Large Databases
(VLDB), Rome, Italy, 2001,
http://www.cs.man.ac.uk/~norm/VLDBTutor
ial.ppt.
[38] N. W. Paton, R. Stevens, P. G. Baker, C. A. Goble, S.
Bechhofer, and A. Brass, "Query processing in the
TAMBIS bioinformatics source integration system," in
Proceedings of the 11th International Conference on
\ ,iri,, and Statistical Databases, pp. 138-147, 1999.
[39] C. Pluempitiwiriyawej and J. Hammer, "A Hierarchical
Clustering Model to Support Automatic Reconciliation of
Semistructured Data," University of Florida, Gainesville,
FL, Technical Report TR01-015, December 2001,
ftp://ftp.dbcenter.cise.ufl.edu/Pub/pub
lications/tr99-019.pdf.
[40] Rosetta Biosoftware, "The Gene Expression Markup
Language (GEML)", Web Site,
http://www.rosettabio.com/products/cond
uctor/geml/default.htm.
[41] A. Savasere, A. Sheth, S. Gala, S. Navathe, and H.
Marcus, "On Applying Classification to Schema
Integration," in Proceedings of the First International
Workshop on Interoperability in Multidatabase Systems,
pp. 258-261, 1991.
[42] M. Schneider, "Spatial Data Types for Database Systems -
Finite Resolution Geometry for Geographic Information
Systems," in LNCS, 1288: Springer-Verlag, 1997.
[43] M. Schneider, "Uncertainty Management for Spatial Data
in Databases: Fuzzy Spatial Data Types," in Proceedings
of the 6th International Symposium on Advances in
Spatial Databases (SSD), pp. 330-351, 1999.
[44] Sheffield University, "Protein Active Site Template
Acquisition (PASTA)", Web Site,









http://www.dcs.shef.ac.uk/research/grou
ps/nlp/pasta/.
[45] A. Sheth, J. Larson, A. Cornelio, and S. B. Navathe, "A
Tool for Integrating Conceptual Schemata and User
Views," in Proceedings of the Fourth International
Conference on Data Engineering, pp. 176--183, 1988.
[46] Stanford Medical Informatics, "RiboML", Web Site,
http://www.smi.stanford.edu/projects/he
lix/riboml/.
[47] G. Stoesser, M. A. Moseley, J. Sleep, M. McGowran, M.
Garcia-Pastor, and P. Steek, "The EMBL Nucleotide
Sequence Database," Nucleic Acids Research, 26:1, pp. 8-
15, 1998.
[48] M. Stonebraker, "Inclusion of New Types in Relational
Data Base Systems," in Proceedings of the 2nd
International Conference On Data Engineering, pp. 262-
269, 1986.
[49] M. Stonebraker, B. Rubenstein, and A. Guttmann,
"Application of Abstract Data Types and Abstract Indices
to CAD Databases," in Proceedings of the ACM/IEEE
Conference on Engineering Design Applications, pp. 107-
113, 1983.
[50] Y. Tateno, H. Fukami-Kobayashi, S. Miyazaki, H.
Sugawara, and T. Gojobori, "DNA Data Bank of Japan at
work on genome sequence data," Nucleic Acids Research,
26:1, pp. 16-20, 1998.
[51] The National Human Genome Research Institute,
"GeneMachine", Web Site,
http://genome.nhgri.nih.gov/genemachine


[52] The National Human Genome Research Institute,
"WebBlast", Web Site,
http://genome.nhgri.nih.gov/webblast/.
[53] University of Pennsylvania, "CBIL bioWidgets for Java",
Web Site,
http://www.cbil.upenn.edu/bioWidgets/.
[54] Washington and Lee University, "phyloML" Web Site,
http://cs.wlu.edu/~roycet/phyloML/.
[55] Weizmann Institute of Science Department of Biological
Services, "GenPept", Web Site,
http://inn.weizmann.ac.il/databanks/gen
pept.html.
[56] J. Widom, "Research Problems in Data Warehousing," in
Proceedings of the Fourth International Conference on
I,,ir...,. *..,, and Knowledge Management, Baltimore,
Maryland, pp. 25-30, 1995.
[57] K. Zhang and D. Shasha, "Simple fast algorithms for the
editing distance between trees and related problems,"
SIAM Journal ofC. .,,, l,, .. 18:6, pp. 1245-1262, 1989.
[58] K. Zhang, R. Statman, and D. Shasha, "On the editing
distance between unordered labeled trees," IC., ..i..r
Processing Letters, 42:1, pp. 133-139, 1992.
[59] Y. Zhuge and H. Garcia-Molina, "Graph structured views
and their incremental maintenance," in Proceedings of the
14th International Conference on Data Engineering
(ICDE), Orlando, Florida, pp. 116-125, 1998.


[60] Y. Zhuge, H. Garcia-Molina, J. Hammer, and J. Widom,
"View Maintenance in a Warehousing Environment," in
Materialized Views, A. Gupta and I. S. Mumick, Eds.
Boston, MA: MIT Press, 1999, pp. 616.
[61] Y. Zhuge, H. Garcia-Molina, and J. L. Wiener, "The
Strobe Algorithms for Multi-Source Warehouse
Consistency," in Proceedings of the Conference on
Parallel and Distributed Ir.-n..ri.., Systems, Miami
Beach, FL, 1996.
[62] Y. Zhuge, H. Garcia-Molina, and J. L. Wiener,
"Consistency Algorithms for Multi-Source Warehouse
View Maintenance," Distributed and Parallel Databases,
6:1, pp. 7-40, 1998.


CISE TR-02-009
Hammer & Schneider




University of Florida Home Page
© 2004 - 2010 University of Florida George A. Smathers Libraries.
All rights reserved.

Acceptable Use, Copyright, and Disclaimer Statement
Last updated October 10, 2010 - - mvs