Group Title: BMC Bioinformatics
Title: Genephony : a knowledge management tool for genome-wide research
CITATION PDF VIEWER THUMBNAILS PAGE IMAGE ZOOMABLE
Full Citation
STANDARD VIEW MARC VIEW
Permanent Link: http://ufdc.ufl.edu/UF00099928/00001
 Material Information
Title: Genephony : a knowledge management tool for genome-wide research
Physical Description: Book
Language: English
Creator: Nuzzo, Angelo
Riva, Alberto
Publisher: BMC Bioinformatics
Publication Date: 2009
 Notes
Abstract: BACKGROUND:One of the consequences of the rapid and widespread adoption of high-throughput experimental technologies is an exponential increase of the amount of data produced by genome-wide experiments. Researchers increasingly need to handle very large volumes of heterogeneous data, including both the data generated by their own experiments and the data retrieved from publicly available repositories of genomic knowledge. Integration, exploration, manipulation and interpretation of data and information therefore need to become as automated as possible, since their scale and breadth are, in general, beyond the limits of what individual researchers and the basic data management tools in normal use can handle. This paper describes Genephony, a tool we are developing to address these challenges.RESULTS:We describe how Genephony can be used to manage large datesets of genomic information, integrating them with existing knowledge repositories. We illustrate its functionalities with an example of a complex annotation task, in which a set of SNPs coming from a genotyping experiment is annotated with genes known to be associated to a phenotype of interest. We show how, thanks to the modular architecture of Genephony and its user-friendly interface, this task can be performed in a few simple steps.CONCLUSION:Genephony is an online tool for the manipulation of large datasets of genomic information. It can be used as a browser for genomic data, as a high-throughput annotation tool, and as a knowledge discovery tool. It is designed to be easy to use, flexible and extensible. Its knowledge management engine provides fine-grained control over individual data elements, as well as efficient operations on large datasets.
General Note: Periodical Abbreviation:BMC Bioinformatics
General Note: Start page 278
General Note: M3: 10.1186/1471-2105-10-278
 Record Information
Bibliographic ID: UF00099928
Volume ID: VID00001
Source Institution: University of Florida
Holding Location: University of Florida
Rights Management: Open Access: http://www.biomedcentral.com/info/about/openaccess/
Resource Identifier: issn - 1471-2105
http://www.biomedcentral.com/1471-2105/10/278

Downloads

This item has the following downloads:

PDF ( PDF )


Full Text



BMC Bioinformatics


B
BioM.- Central


Software

Genephony: a knowledge management tool for genome-
wide research
Angelo Nuzzo' and Alberto Riva*2,3


Address: 'Centre for Tissue Engineering, University of Pavia, via Ferrata 1, 1-27100, Pavia, Italy, 2Department of Molecular Genetics and
Microbiology, University of Florida, Gainesville, Florida 32610, USA and 3University of Florida Genetics Institute, University of Florida,
Gainesville, Florida 32610, USA
Email: Angelo Nuzzo angelo.nuzzo@unipv.it; Alberto Riva* ariva@ufl.edu
* Corresponding author


Published: 3 September 2009
BMC Bioinformatics 2009, 10:278 doi:10.1186/1471-2105-10-278


Received: 4 March 2009
Accepted: 3 September 2009


This article is available from: http://www.biomedcentral.com/1471-2105/10/278
2009 Nuzzo and Riva; licensee BioMed Central Ltd.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.ore/licenses/by/2.0),
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.



Abstract
Background: One of the consequences of the rapid and widespread adoption of high-throughput
experimental technologies is an exponential increase of the amount of data produced by genome-
wide experiments. Researchers increasingly need to handle very large volumes of heterogeneous
data, including both the data generated by their own experiments and the data retrieved from
publicly available repositories of genomic knowledge. Integration, exploration, manipulation and
interpretation of data and information therefore need to become as automated as possible, since
their scale and breadth are, in general, beyond the limits of what individual researchers and the
basic data management tools in normal use can handle. This paper describes Genephony, a tool we
are developing to address these challenges.
Results: We describe how Genephony can be used to manage large datesets of genomic
information, integrating them with existing knowledge repositories. We illustrate its functionalities
with an example of a complex annotation task, in which a set of SNPs coming from a genotyping
experiment is annotated with genes known to be associated to a phenotype of interest. We show
how, thanks to the modular architecture of Genephony and its user-friendly interface, this task can
be performed in a few simple steps.
Conclusion: Genephony is an online tool for the manipulation of large datasets of genomic
information. It can be used as a browser for genomic data, as a high-throughput annotation tool,
and as a knowledge discovery tool. It is designed to be easy to use, flexible and extensible. Its
knowledge management engine provides fine-grained control over individual data elements, as well
as efficient operations on large datasets.


Background
Modem biomedical research is an increasingly knowledge-
intensive endeavor. New experimental technologies and
high-throughput analysis methods produce vast quanti-
ties of data with each experiment. Systems biology
approaches investigate biological processes on a large


scale, relying on the measurement and analysis of thou-
sands of variables in order to elucidate the structure and
behavior of complex biological systems. Online databases
store an exponentially increasing amount of information,
from raw DNA sequences to high-level observations on
genotype/phenotype correlations. The shift from hypothe-


Page 1 of 9
(page number not for citation purposes)







BMC Bioinformatics 2009, 10:278


sis-based to hypothesis-free research that is made possible by
these technological and methodological advances opens
up unprecedented new opportunities for studying biolog-
ical systems on a large scale, at a low cost, and with a
holistic perspective that promises to expand our under-
standing of biological processes and of their connections
with clinically relevant outcomes.

In order to take advantage of this paradigm-changing evo-
lution, researchers will increasingly need effective, practi-
cal tools to handle very large volumes of heterogeneous
data, both generated by their own experiments and
retrieved from publicly available repositories of genomic
knowledge [1]. Integration, exploration, manipulation
and interpretation of such data therefore need to become
as automated as possible, since the "traditional" data
inspection and analysis methods are quickly becoming
inadequate in a scenario in which an investigator can sam-
ple hundreds of thousands of variables in parallel, and an
entire new genome can be sequenced and annotated in a
matter of days.

While a large amount of work is under way to develop ad-
hoc analysis methods, able to address the well-known
problems related with the statistical significance of results
based on a very large number of observations, it is appar-
ent that all phases of the scientific discovery process
(hypothesis generation and testing, background knowl-
edge gathering, experiment design, interpretation of
results, generation of new knowledge) will have to be
adapted to this new reality. The post-genomic era will
increasingly require methods and tools able to automati-
cally link new observations and findings to preexisting
knowledge.

Finally, new data storage and retrieval systems will need to
be developed and adopted in order to handle the unprec-
edented volumes of data and information being generated
in an efficient and productive way. Knowledge and data
are represented using nomenclatures, classification
schemes and annotation formats that are constantly
evolving and often incompatible with each other. Creat-
ing, storing and manipulating datasets consisting of hun-
dreds of thousands of records, integrating knowledge
from multiple heterogeneous sources, combining and
mining data in novel ways for exploratory research, are all
tasks that can represent a significant bottleneck for an
average researcher who is not an expert in database usage
or programming [2].

The ability to effectively address the challenges outlined
above will have a direct, dramatic impact on the speed,
accuracy and effectiveness of scientific progress in all areas
of the life sciences. We are therefore working on develop-
ing tools to facilitate the discovery process in high-


http://www. biomedcentral.com/1471-2105/10/278



throughput biomedical research, by providing high usa-
bility and effective automation of complex tasks through
an easily accessible and intuitive interface. This paper
describes Genephony, a powerful online tool designed to
assist the non-technical user in creating and manipulating
large datasets of genomic information.

Implementation
Genephony is a Web-based application whose main pur-
pose is to allow the user to easily build sets of biological
objects. Sets can be created by providing identifiers or
query terms, or can be derived from other sets through
appropriate transformations. The system automatically
keeps track of the relationships among sets, and allows the
user to freely navigate through them via a simple, consist-
ent and intuitive user interface. Genephony is designed to
be highly interoperable with other online tools: it accepts
a wide variety of common formats in input, it provides
extensive data export capabilities, and it features a SOAP
server interface [3] that allows other software tools to pro-
grammatically interact with it.

The knowledge base
Genephony is able to handle a wide variety of object
types, including genomic entities (chromosome regions,
genes, transcripts, SNPs, miRNAs, CNVs), classifications
and taxonomies (GeneOntology, HomoloGene, path-
ways), experimental identifiers (probesets for common
gene expression and genotyping microarrays), computa-
tional predictions (e.g. transcription factor binding sites),
and high-level genetic and phenotypic data (e.g. SNP fre-
quencies from HapMap, entries from OMIM and GAD
[4]).

The system relies on a local, integrated database of
genomic information that includes information about
most of the object types mentioned above and, when
practical, on real-time access to online resources. It is
important to note that Genephony does not try to repro-
duce exactly the entire contents of all the source databases
it uses: doing so would be extremely impractical and ulti-
mately not very useful. Genephony's local knowledge
base, instead, represents a selection of the most com-
monly used object types and data elements, a selection
that reflects the needs and requirements of an "average"
genomic study. Since the system is based on a modular
and general architecture, the default knowledge base
described here can easily be replaced with alternative ones
that are focused on alternative domains, by defining new
object types and new relationships among them.

The choice of maintaining a local database implies an
effort to ensure its contents are up to date, through scripts
that periodically check the source databases for new data
releases. On the other hand, the alternative solution of


Page 2 of 9
(page number not for citation purposes)







BMC Bioinformatics 2009, 10:278


retrieving the data from the source databases in real time
is not practical for a variety of reasons: to start, most
online resources enforce a limit on the number and fre-
quency of queries that they accept from a client, making it
impossible to work on large volumes of data; not all
resources provide interface to access their data in an effi-
cient and machine-friendly way; and finally, accessing
very large datasets over the Internet is usually too slow for
practical uses.

Data and object representation
Biological objects are internally represented as data struc-
tures composed of several slots, each of which contains a
single element of information. For example, SNPs (Single
Nucleotide Polymorphisms) may be represented by an
object containing slots for the SNP identifier (NCBI "rs"
number), its genomic location (chromosome and posi-
tion), its alleles, and its validation status. Each object pos-
sesses a unique identifier. Usually this will be the
"natural" identifier of the entity being described, when
available (e.g., HGNC names for human genes, NCBI "rs"
identifiers for SNPs); otherwise one will be internally gen-
erated by the system.

Aset is a collection of objects of the same type. Sets are cre-
ated by the user by entering query terms, by uploading
files, or by performing operations on existing sets. There is
no a priori limit on the number of objects that a set can
contain, or on the number of sets that can be created, and
the system is optimized to handle sets containing a very
large number of objects. Sets can then be browsed, fil-
tered, annotated and exported in a variety of ways. The
next section provides detailed information on all the data-
set operations available in Genephony.

Results
To start working with Genephony, the user creates a ses-
sion, giving it a unique identifier. No password is cur-
rently required, although one may be optionally used to
protect data privacy. Once a session has been established,
the user can populate it by creating new sets in one of the
following ways:

1) Manually entering one or more identifiers. The sys-
tem is able to automatically recognize a large number
of common identifiers; this is accomplished by a set of
autodetect procedures that examine the supplied iden-
tifiers and determine their possible meanings (Table 1
displays a list of the currently recognized identifiers).
When multiple identifiers are entered, the system will
select the autodetector that applies to the majority of
them; although the user has the option of overriding
autodetection by manually specifying how to interpret
identifiers, this is rarely necessary. After decoding all
supplied identifiers, the system creates a set containing
the corresponding objects.


http://www. biomedcentral.com/1471-2105/10/278



2) By uploading a file containing identifiers. The sys-
tem accepts delimited text files and Excel spreadsheets,
and handles both ZIP and gzip compression. The user
needs only specify the column that contains the iden-
tifiers of interest; the identifiers are then parsed and
translated into objects using the same procedure
described in 1).

3) By deriving them from an existing set, or combining
two existing sets. In a Derive operation, a new set is
generated using the data from a single existing one.
For example, given a set of genomic regions, it is pos-
sible to generate the set of all SNPs belonging to them.
In a Combine operation, the data contained in two
existing sets is used to generate a new one.

When a set is created, it is initially populated only with the
identifiers of the objects it should contain; the object
themselves are actually created the first time they are
accessed, for efficiency (in some cases a set is only used to
derive other sets from it, and the object identifiers may be
sufficient for that purpose). The data used to create the
object are usually retrieved from the local database
through highly optimized queries, although in general
they could come from other sources as well (e.g., from
real-time access to remote resources).

All generated datasets are permanently stored in the cur-
rent session. The system keeps track of how each set was
generated, and of which other sets were generated by it. It
is therefore always possible to reconstruct the path
through which any individual dataset was produced, and
the user has the option of navigating back to previously-
generated sets at any time, in order to examine the data
they contain or to generate new sets by following alterna-
tive Derive or Combine paths. Moreover, the system
records the relationships among individual objects in
datasets that are derived from each other. For example,
when a set of SNPs is derived from a set of genomic
regions as described above, the system will create a table
associating each SNP with the genomic region (or
regions) it belongs to, and each region with the SNPs it
contains. In general, these will be many-to-many relation-
ships, and will allow the user to determine how an indi-
vidual object was produced or how many derived objects
were produced by an object in the starting set. These data
structures are also used by the Annotate command as
described below.

User interface
Genephony's main interface window, shown in Figure 1,
is divided into three panels. On the left is the Workspace,
which lists all the sets in the current session. The user can
"focus" any one of the sets in the session by clicking on its
name in the Workspace panel; the currently focused set is
shown in bold face. The top right panel displays informa-

Page 3 of 9
(page number not for citation purposes)








http://www. biomedcentral.com/1471-2105/10/278


Table I: Identifiers automatically recognized by the system


Object


Source


Examples


Genomic regions

Cytogenetic bands

dbSNP identifiers


UCSC Genome Browser (hg 18)

UCSC Genome Browser (hg 8)


dbSNP (build 130)


Entrez GenelD identifiers

HGNC gene symbols

Genbank mRNA accession numbers

SWISSPROT protein identifiers


NCBI

NCBI

NCBI


SWISSPROT


chr3:120,000,000-150,000,000

chr3:q13.1 I, chr3:q13


rs36126692

3456


IFNBI


NM 002176


P01574


ENSEMBL gene identifiers

GeneOntology classes


OMIM entries

STS markers


NCBI


UCSC Genome Browser (hg 18)


MicroRNA identifiers


MicroRNA accession numbers

Microarray probesets

SNP microarray probesets


CNVs


Sanger

Sanger


Affymetrix, Illumina


Affymetrix

Affymetrix


GAD entries


ENSG00000171855


GO:0051990

MIM:178600, MIM:hypertension

AFM344WE9, GDB:199719

hsa-mir-942

M10005767


208173 at


SNP_A-1507458

Variation_0008

GAD:retinopathy


tion about the currently focused set: its name, a descrip-
tion, the number and type of object it contains, how it was
generated, and how many other sets it is a parent of (the
set's name and description are editable and can be
changed by the user at any time). This panel also contains
the buttons through which the user can perform all avail-
able operations. Finally, the bottom panel is used to dis-
play information about the contents of a set, or about its
relationships with other sets; it is also used to get input
from the user when running certain commands. For exam-
ple, when the user clicks on the button for the Derive
command, the bottom panel will display the list of derive
operations available for the current set.

The New command opens the initial page and allows for
the creation of a new set. The Info command displays
additional information about the current set that would
not fit in the top panel, such as the complete list of sets


that were derived from it. The Derive and Combine com-
mands are used to generate new datasets from the current
one as described above.

The Browse command displays the contents of the current
set as a table in which each row represents an object in the
set and each column contains one of the fields of the
objects. Several commands are available while browsing:
the user may hide or show any column in the table, and
sort the set contents by the value of any field by clicking
on the header of the corresponding column. Clicking on
a table row brings up a page containing detailed informa-
tion about the object it contains, including the set of "par-
ent" objects that generated it. For instance, considering
again the example described above, the page for an indi-
vidual SNP object would contain the complete list of
fields with their values, and the list of genomic regions
that contain it (there may be more than one "parent"



Page 4 of 9
(page number not for citation purposes)


BMC Bioinformatics 2009, 10:278








http://www. biomedcentral.com/1471-2105/10/278


S .lPl] HEL I F: I QUIT
-OKPC URN AAE


.;I:12112 (2761 SNPs)


LJ sriBfitr


SET OF: 2761 SNPs CREATED: Jul-31-2009 15653
NFIAMF 3et2112 DERIVED FROM: file demo-snps.tx


L.E l I,. I TIliI '

*li,-m I,. ",: :,' 1 ,, v ,' 1 ",', ", h. ,,',:1. : "' ,1 "ln .-I,'-1 .- -h .": -.-' ,1 objects 1-100 z -,I



SI chrl 65,671,908 G/T N
.* .I chr3 45,247,329 A/G Y
SJ chr5 93,756,245 CT Y
i,.,:1 chr7 12,945,907 C/T Y
j:10 I-, chr16 50,686,968 C/T Y
chr12 99,502,744 A/C N
J.i,_.. chr5 109,354,986 A/G Y
J in i chr2 8,008,925 NG Y
? _,.. I chr22 34,872,077 A/C Y
chr22 35,192,447 C/T Y
i I chr20 57,317,210 ANG Y
,._ chr16 21,426,491 A/G N
I : chr9 101,074,047 C/T Y
S: 'IJ chr5 94,313,461 C/T Y
chr5 110,747,541 G/T Y
.i: I i chrl 187,807,855 C/T N


Figure I
Genephony main window. Genephony's user interface is based on a single window divided into three panels. The Work-
space lists all the sets in the current session. The Current Dataset panel displays information about the currently focused set and
contains the buttons through which the user can perform all available operations. The third panel displays information about
the contents of a set and its relationships with other sets, and is used to receive input from the user. This figure shows the
contents of the initial dataset, generated by uploading a text file containing 2,761 SNP identifiers.


region for a SNP because genomic regions, in general, may
overlap each other). The user may then choose to restrict
the display to a manually-selected subset of the rows; the
remaining objects in the set are effectively filtered out, as
described below.

The Filter command can be used to hide the objects in a
set that do not meet a specified condition. Once a dataset
is filtered, all subsequent operations only apply to its vis-
ible elements. For example, a set of regions may be filtered
to display only the ones belonging to a specified chromo-
some. If the user then applies the Derive command to
derive the set of SNPs they contain, the operation will be
applied only to the visible regions, and the resulting set of
SNPs will contain only the ones belonging to the regions
on the specified chromosome. Filters can be inverted, to
select only objects that do not meet the filter condition,
and multiple different filters can be applied at the same
time, in order to select the objects that meet all specified
conditions at the same time (see Figure 2). Finally, filters


can be removed bringing the set back to its initial state,
with all objects visible.

Avery powerful feature offered by Genephony is the Anno-
tate command that allows the contents of a dataset to be
added to any one of its "parent" or "child" datasets. For
example, a set of SNPs can be used to annotate the set of
genomic regions it was derived from: the system will keep
track of the relationship between each SNP and the region
it belongs to, so that when browsing or exporting the set
of regions, the system will automatically associate each
region with the set of SNPs it contains and display the
contents of both datasets side by side (see the example of
use of the system in the Results section). It is important to
note that this feature works across any number of Derive
steps, in both directions: a set can be used to annotate its
parent, its parent's parent, its child, its child's child and so
on. Combined with filtering and with the ability to select
individual fields of the objects, the Annotate feature can
be used to create richly annotated datasets in a few simple
steps.



Page 5 of 9
(page number not for citation purposes)


BMC Bioinformatics 2009, 10:278








http://www.biomedcentral.com/1471-2105/10/278


S5NPO YHLPIFEBC QUI IFFSLORI"IJ)


Set2060 (78717 SNPs)
Set2059 (773 transcripts)
Set2058 (6 regions)
Set2067 (1 OMIM entry)
Set2056 (2761 SNPs)


I URN AAI E


StI OF-:
NAME:
DESCRIPTION:


/8/1/ SNPs
|Set2060


CKEAIlu: Jul-jU-2JUU 1U:U/
DERIVED FROM: Set2059


v agilaElB3 -filts

Active? Filter Invert?
] Show SNPs that belong to chromosome [chr E0
W Show validated SNPs D
D Show a random subset of SNPs D
Show only SNPs that also appear in Set2056 (2761 SNPs)
Show SNPs with genotype data from HGDP D
cM IiMaMl =


Figure 2
Dataset filtering. Datasets can be filtered to display only a subset of the objects they contain. In this example, we apply two
filters to a dataset containing 78,717 SNPs: the first one selects validated SNPs only, while the second one selects SNPs that
also belong to a different set of SNPs. The Active? checkboxes are used to indicate which filters to apply, while the Invert?
checkboxes cause the selected filters to be reversed (ie, only objects that do not satisfy the filter condition will be displayed).


The Export command allows the user to retrieve the con-
tents of a dataset in a variety of different formats, includ-
ing tab- or comma-delimited text files, Excel spreadsheets,
and HTML tables. The files can be directly downloaded or
received by email, with optional ZIP or gzip compression,
or submitted to the Galaxy online tool [5]. Datasets con-
taining objects that represent chromosome regions can
also be automatically uploaded to the UCSC Genome
Browser and displayed through its well-known interface.
Finally, the corresponding DNA sequences (or their trans-
lation into amino acid sequences in any of the six possible
frames) can be exported in FastA, Genbank or EMBL for-
mat.

Interoperability
In order to facilitate the exchange of data with other appli-
cations, Genephony is designed to accept and to generate
datasets encoded in the most common formats, including
comma- and tab-delimited text files, Excel spreadsheets,
and HTML tables. In addition, Genephony provides a
SOAP server interface allowing external programs to use
its capabilities independently of the Web interface. The


SOAP interface is self-documenting and is fully described
in the system's Help pages. Its WSDL definition is also
provided to enable the automatic generation of client pro-
grams.

Example of use
In this section we present a detailed example of how
Genephony can be used to perform a complex data inte-
gration and annotation task. Let us imagine we have per-
formed a genotyping experiment on a large set of SNPs,
and that statistical analysis of the results has identified a
subset of SNPs that are significantly associated with the
presence of a phenotype of interest. In order to better
characterize our results, we would now like to determine
which of these SNPs are located in genes that are known to be
related to the phenotype. In the example described here, we
used a dataset of 2,761 SNPs, and we chose Insulin-
Dependent Diabetes Mellitus (IDDM) as the disease
under study (for those readers wishing to walk through
this example, the input file containing the SNP identifiers
is available in the "Tutorial" section of the program's Help
pages, along with step-by-step instructions).



Page 6 of 9
(page number not for citation purposes)


BMC Bioinformatics 2009, 10:278







BMC Bioinformatics 2009, 10:278


Our strategy will be to query a database of genotype-phe-
notype correlations, such as OMIM or GAD, for genes
contained in regions known to be associated with the dis-
ease, to extract the SNPs contained in their transcripts, and
to intersect this set of SNPs with the original set. To start,
we create a new session and upload the input file using the
"Create Dataset" form, specifying that the identifiers are in
column 1. The system automatically parses the "rs" iden-
tifiers contained in the file and creates an initial set of
2,761 SNPs, that can be displayed using the Browse com-
mand (see Figure 1). Next, we turn to the problem of iden-
tifying genomic regions associated with diabetes. One
possible way to do this is by querying the OMIM database:
we return to the "Create Dataset" screen and enter the
query term "MIM: diabetes" in the "Enter region or identi-
fier(s)" field (the MIM: prefix is used to indicate that the
following term should be interpreted as part of an OMIM
entry tide). This results in a set containing the 89 OMIM
entries containing the word "diabetes" in their tide. We
then use the Browse command to display the contents of
this dataset, locate the row containing OMIM entry
222100, "Diabetes Mellitus, Insulin-Dependent, IDDM",
and select it by shift-clicking on it. Clicking on the "click
to filter" command appearing at the top of the browse
window, we filter the OMIM set restricting it to this single
entry. We can now exploit the information on genomic
regions associated with phenotypes provided by OMIM to
create a set of regions, using the appropriate Derive com-
mand; the result in this case consists of the six genomic
regions. With a further Derive operation we create the set
of all 773 transcripts contained in these regions and, in
turn, of the 78,717 SNPs contained in them.

In order to answer our original question, we just need to
find the SNPs that appear both in this set and in the one
uploaded at the beginning of this session. This is accom-
plished using the Filter command, since one of the avail-
able filters restricts the set to the SNPs that also appear in
another set. We apply this filter together with a second
one that only displays validated SNPs, as shown in Figure
2, and the result is the set of 12 SNPs shown in Figure 3.

To conclude, we would like to annotate the resulting set of
SNPs with information about the genes they belong to.
We start by creating the set of transcripts containing the
SNPs and the set of genes producing these transcripts,
using two more Derive operations (it is important to note
that the Genephony knowledge base treats genes and tran-
scripts as distinct objects, since the same gene may pro-
duce multiple transcripts having a different layout on the
chromosome). To simplify the display, we use the
"Choose columns" menu to select just the GenelD, Gene
symbol, and Gene name columns. Then, using the "Anno-
tate" command we annotate these genes with the set of
SNPs they were derived from. To view the resulting anno-


http://www. biomedcentral.com/1471-2105/10/278



tated dataset, we select the set of 12 SNPs from the work-
space window, browse it, and select the set of genes from
the "Annotations" menu (see Figure 4).

To summarize, using Genephony we were able to quickly
identify a set of SNPs belonging to genes that are known
to be involved in IDDM and for which we have genotype
data in our dataset, a task that would have otherwise
required accessing at least three different databases and
performing complex data integration steps on large data-
sets.

Conclusion
As the life sciences increasingly become knowledge-inten-
sive disciplines, every effort aimed at facilitating the pro-
duction, organization and dissemination of new
knowledge is bound to have a profound effect on the
speed, accuracy and effectiveness of scientific research,
and of genome-wide, hypothesis free research in particu-
lar. Data and information production in this new era is
measured on extraordinarily large scales: just in the field
of sequencing, massively parallel DNA sequencing sys-
tems have increased our sequencing capacity to hundreds
of millions of base-pairs per process run. Microarray tech-
nology for gene expression or genotype analysis is under-
going a similar evolution, with modem platforms now
reaching one million simultaneous measurements. Paral-
lel advances are taking place in proteomics, transcriptom-
ics and metabolomics. This is having a profound effect on
genomics-based research throughout the full range of bio-
logical science: whole-genome studies that were once
unfeasible are now within the possibilities of any
medium-sized laboratory, the distinction between model
and non-model organisms has been blurred, and it is now
possible to directly sequence entire collections of
microbes, and viruses.

Genephony is an online tool aimed at researchers who
need an easy, practical way to annotate, integrate and
explore genomic knowledge and data resulting from large-
scale experiments. The system is robust, efficient and
extremely easy to use: it automatically determines which
operations are applicable on each dataset, and presents
them to the user in a detailed, readable form. Identifiers
are automatically recognized and converted in order to
establish relationships between different datasets. Interval
operations are available for all objects that represent
regions on chromosomes (e.g. transcripts, binding sites).
Very complex sequences of data manipulations can thus
be performed in just a few steps, and no knowledge of the
structure of the underlying database is required.

Compared to similar systems such as Galaxy [5], DAVID
[6], or BioMart [7], Genephony offers a more explicit and
general representation of biomedical object types and of


Page 7 of 9
(page number not for citation purposes)








http://www.biomedcentral.com/1471-2105/10/278


S5NPO YHLPIFEBC QUI IFFSLORI"IJ)


SeT2060 J12 SNPs)
Set2059 (773 transcripts)
Set2058 (6 regions)
Set2067 (1 OMIM entry)
Set2056 (2761 SNPs)


[lll11 [ AII I_ l


SET OF:
NAME:
DESCRIPTION:


12 SNPs (78705U hidden)
|Set2060


Show: Choose columns. v (click on rowto inspect it, shift-click to select it)


rs2454185
rs12117546
rs705300
rs12737714
rs16947055
rs55723005
rs268863
rs2078866
rs7411387
rs7648683
rs7952866
rs11067948


Figure 3
Result dataset. The dataset of 12 SNPs resulting from the data integration example described in the text. These SNPs are
members of the initial set of 2,761 SNPs generated from the uploaded file, are validated according to dbSNP, and belong to
transcripts in regions known to be associated with IDDM (according to the OMIM database).


the relationships among them (as opposed to Galaxy's
flat-file model or DAVID's gene-set centric view), a flexi-
ble workflow model that does not constrain the user on a
predefined analysis or annotation path, leaving him/her
free to generate and combine datasets in an exploratory
way, and powerful data reuse and interoperability fea-
tures. Moreover, Genephony does not enforce a limit on
the size of the datasets the user can use, thus making it
possible to operate on the entire contents of a set at once
regardless of its dimensions.

Genephony does not currently offer graphical output
capabilities, since its main focus is on knowledge and
information management, but it provides flexible ways of
exporting the contents of its datasets in standard formats
for use in external visualization and data manipulation
tools such as the UCSC Genome Browser and Galaxy.
Although it is not an analysis tool, its rich knowledge base
makes it suitable for scenarios ranging from basic
genomic data annotation to translational research appli-
cations aimed at establishing links between the genomic
level and medically relevant phenotypes.


The manipulation and interpretation of very large datasets
represents a significant bottleneck for researchers who are
not experts in database technology and programming. By
providing them with effective tools to perform these
increasingly common tasks, Genephony has the potential
to accelerate the process of turning experimental data into
verifiable hypotheses and biomedically relevant findings.
Genephony could also be used as a platform for the dis-
semination of domain-specific knowledge, since its mod-
ular nature facilitates the creation of customized
knowledge bases. It can therefore be helpful in making
biomedical information available and accessible outside
the boundaries of research community, resulting in an
added benefit for the general public.

Availability and requirements
Project name: Genephony

Project home page: http://genome.ufl.edu/gp/

Access policy: the system is freely available to any-
body. Users are asked to enter a session identifier to




Page 8 of 9
(page number not for citation purposes)


CREATED: Jul-30-2009 10:07
DERIVED FROM: Set2059


objects 1-12 of 12


116,321,747
111,231,820
108,511,540
109,499,774
115,774,816
109,565,291
8,019,090
18,794,628
111,638,658
114,476,824
116,617,690
115,139,579


BMC Bioinformatics 2009, 10:278


ISNP name IChronlosome JPosition JAlleles JValidate









BMC Bioinformatics 2009, 10:278


http://www. biomedcentral.com/1471-2105/10/278


=1112=11111=1111


:-.- |1;-.-,,-,, -,,HF


(1, O M IM e n try )
" ;__-~. --,: I SNPs)


SET OF: 12 SNPs (78705 hidden) CREATED: Jul-30-2009 10 07
FIAF.lF Set2060 DERIVED FROM: Set2059
LDEs-.FI Tifr I DERIVED SETS: 1

ANNOTATIONS: 1


I.- i .i A l i.- i -1i. 1 '1i .1. 1 I ..I 1. k on row to inspect it, shift-click to select it) objects 1-12 of 12


rs24641 65
117546






I'"


116,321,747
111,231,820
108,511,540
109,499,774
115,774,816
109,565,291
8,019,090
18,794,628
111,638,658
114,475,824
116,617,690
1: 1 -


55356 SLC22A15
963 CD53
29957 SLC25A24
57535 KIAA1324
84900 TMEM118
6301 SARS
113263 GLCCI1
9734 HDAC9
27159 CHIA
148281 SYT6
283455 KSR2
-, 1:- - I-


solute carrier family 22, member 15
CD53 molecule
solute carrier family 25 (mitochondrial carrier, phosphate carrier), member
24
KIAA1324
hypothetical protein LOC84900
seryl-tRNA synthetase
glucocorticoid induced transcript 1
histone deacetylase 9
chitinase, acidic
synaptotagmin VI
kinase suppressor of ras 2
I,=, h , .i -,: :, . ..,. 1 ,l: J l, ,


Figure 4
Annotated results. Using the Annotate command, the final set of SNPs can be annotated with information about the genes
the SNPs belong to (found in Set2062, that is derived from the set of SNPs through an intermediate set of transcripts). The
contents of the two datasets can thus be displayed side by side and exported as a single table.


start using the system. Using an e-mail address as the
identifier is preferable, but is not required.


Operating systemss: Platform independent


Programming language: Common Lisp, Java


Any restrictions to use by non-academics: freely
available.


Authors' contributions
AR designed the Genephony system and was responsible
for its implementation. AN contributed to the develop-
ment of the SOAP interface and other interoperability fea-
tures. Both authors read and approved the final
manuscript.


Acknowledgements
The authors wish to acknowledge Dr. Isaac S. Kohane (Harvard Medical
School) for providing the initial inspiration and support for this work, and
Brandon Walts for technical assistance. This project was partially sup-
ported by NIH grant RO I HL8768 1-0 1, "Genome-Wide Association Studies
in Sickle Cell Anemia and in Centenarians", by NIH grant 5U54LM008748-
02 (National Centers for Biomedical Computing), and by the "Bioinformat-


ics for Tissue Engineering: Creation of an International Research Group
project, funded by Fondazione Cariplo.


References
I. Philippi S, Kohler J: Addressing the problems with life-science
databases for traditional uses and systems biology. Nat Rev
Genet 2006, 7(6):482-488.
2. Stein L: Genome annotation: from sequence to biology. Nat
Rev Genet 2001, 2(7):493-503.
3. The World Wide Web Consortium SOAP Specification Ver-
sion 1.2 [http://www.w3.org/TR/soap 12-part I]
4. Becker KG, Barnes KC, Bright TJ, Wang SA: The genetic associa-
tion database. Nat Gen 2004, 36(5):431-432.
5. Giardine B, Riemer C, Hardison RC, Burhans Elnitski L, Shah P,
Zhang Y, Blankenberg D, Albert I, Taylor J, Miller W, Kent WJ,
Nekrutenko A: Galaxy: a platform for interactive large-scale
genome analysis. Genome Res 2005, 15(10):1451-1455.
6. Dennis G Jr, Sherman BT, Hosack DA, Yang Gao W, Lane HC, Lem-
picki RA: DAVID: Database for Annotation, Visualization, and
Integrated Discovery. Genome Biol 2003, 4(5):P3.
7. KasprzykA, Keefe D, Smedley D, London D, Spooner W, Melsopp D,
Hammond M, Rocca-Serra P, Cox T, Birney E: EnsMart: a generic
system for fast and flexible access to biological data. Genome
Res 2004, 14(l):160-169.










Page 9 of 9
(page number not for citation purposes)


Set2O62 (12 genes)


OCNEPHONY HELP I FEED&ACI, I QUIT




University of Florida Home Page
© 2004 - 2010 University of Florida George A. Smathers Libraries.
All rights reserved.

Acceptable Use, Copyright, and Disclaimer Statement
Last updated October 10, 2010 - - mvs