Citation
BAXQL_BLAST: AN ENHANCED BLAST BIOINFORMATICS HOMOLOGY SEARCH TOOL WITH BATCH AND STRUCTURED QUERY SUPPORT

Material Information

Title:
BAXQL_BLAST: AN ENHANCED BLAST BIOINFORMATICS HOMOLOGY SEARCH TOOL WITH BATCH AND STRUCTURED QUERY SUPPORT
Copyright Date:
2008

Subjects

Subjects / Keywords:
Bioinformatics ( jstor )
Blasts ( jstor )
Data integration ( jstor )
Databases ( jstor )
HTML ( jstor )
Information search ( jstor )
Java ( jstor )
Relational databases ( jstor )
SQL ( jstor )
XML ( jstor )

Record Information

Source Institution:
University of Florida
Holding Location:
University of Florida
Rights Management:
Copyright the author. Permission granted to the University of Florida to digitize, archive and distribute this item for non-profit research and educational purposes. Any reuse of this item in excess of fair use or other copyright exemptions requires permission of the copyright holder.
Embargo Date:
8/8/2002
Resource Identifier:
51514966 ( OCLC )

Downloads

This item is only available as the following downloads:


Full Text

PAGE 1

BAXQL_BLAST: AN ENHANCED BLAST BIOINFORMATICS HOMOLOGY SEARCH TOOL WITH BATCH AND STRUCTURED QUERY SUPPORT By TSUNG-LU LEE A THESIS PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE UNIVERSITY OF FLORIDA 2002

PAGE 2

Copyright 2002 by Tsung-Lu Lee

PAGE 3

To God I serve.

PAGE 4

ACKNOWLEDGMENTS First, I am grateful to have an excellent advisor, Dr. Li-Min Fu, who has always been patient, caring, and full of encouragement for me and my research. I would like to dedicate to him all my success I have achieved, and wish him the best luck in his career and life. Also, I have been very fortunate to have two outstanding professors, Dr. Jonathan C. Liu and Dr. Joachim Hammer, as my committee members. I would like to express my sincere appreciation to them for spending their valuable time and energy to assist me throughout the thesis process. I think the CISE department at the University of Florida is an excellent academic institute for graduate studies in computer science. Finally, I wish to thank my parents, Chin-Cheng Lee and Li-Yuan Chen, for all the effort, energy, money, and love they have provided to bring me to this stage. I truly thank them and love them. Most importantly, I thank God. iv

PAGE 5

TABLE OF CONTENTS page ACKNOWLEDGMENTS.................................................................................................iv LIST OF TABLES............................................................................................................vii LIST OF FIGURES.........................................................................................................viii ABSTRACT.........................................................................................................................x CHAPTER 1 INTRODUCTION............................................................................................................1 1.1 Internet and Biological Data.....................................................................................1 1.1.1 The Web..........................................................................................................2 1.1.2 Problem with Large Data................................................................................3 1.2 BAXQL_BLAST.........................................................................................................3 1.2.1 Homology Sequence Search...........................................................................4 1.2.2 Motivation.......................................................................................................4 2 BLAST AND BIOINFORMATICS BACKGROUND....................................................7 2.1 Biological Sequence Comparison and Database Search...........................................7 2.1.1 What Is NCBI?................................................................................................7 2.1.2 What Is BLAST ?............................................................................................9 2.1.3 More on BLAST...........................................................................................10 2.2 Related Research Based on BLAST.......................................................................13 2.2.1 Automated Multi-Sequence Query Approach...............................................15 2.2.2 Visualization and Identification Approach...................................................16 2.2.3 Web Approach..............................................................................................19 3 BATCH_BLAST............................................................................................................21 3.1 Motivation and Architecture of BAXQL_BLAST....................................................21 3.1.1 Motivation.....................................................................................................21 3.1.2 Architecture of BAXQL_BLAST....................................................................24 3.2 Why XML?.............................................................................................................26 3.2.1 What Is XML?..............................................................................................27 3.2.2 XML Versus ASN.1......................................................................................28 3.2.3 Why XML Benefits Our Batch BLAST Project?.........................................29 v

PAGE 6

3.3 Implementation of Batch_BLAST...........................................................................32 3.3.1 Java Multithreading.......................................................................................33 3.3.2 Java Swing Components...............................................................................34 3.3.3 How Batch_BLAST Works?..........................................................................36 4 SQL_BLAST..................................................................................................................39 4.1 Not Enough XML...................................................................................................39 4.1.1 XML Is Easier to Parse.................................................................................40 4.1.2 Well-supported XML....................................................................................42 4.1.3 XML and Biological Databases....................................................................43 4.2 Why Relational Databases?....................................................................................45 4.3 Architecture and Implementation...........................................................................48 4.3.1 Motivation of Sql_BLAST.............................................................................48 4.3.2 BLAST XML Parser and Relational Tables.................................................51 4.3.3 Entrez............................................................................................................52 4.3.4 The Java Database Connectivity (JDBC): A Bridge to Databases...............54 4.3.5 JavaServer Pages (JSP): A Web Front Desk to Databases...........................55 5 SUMMARY AND FUTURE WORK............................................................................58 5.1 What Do We Achieve?...........................................................................................58 5.2 Where Do We Improve?.........................................................................................61 5.3 Future Work: Bioinformatics Data Integration.......................................................62 6 CONCLUSIONS.............................................................................................................64 6.1 The XML Solution..................................................................................................64 6.2 Biological Intelligence............................................................................................65 REFERENCES..................................................................................................................67 BIOGRAPHICAL SKETCH.............................................................................................73 vi

PAGE 7

LIST OF TABLES Table page 2-1: NCBI Major Databases and Tools ...............................................................................9 4-1: XML-enabled databases ............................................................................................43 4-2: Choosing the Right Database for Your XML Data....................................................47 5-1: Achievements of BAXQL_BLAST..............................................................................59 vii

PAGE 8

LIST OF FIGURES Figure page 2-1: NCBI web-based search interface................................................................................8 2-2: NCBI BLAST web site...............................................................................................10 2-3: BLAST input interface...............................................................................................11 2-4: BLAST returns a request ID in the status page..........................................................12 2-5: BLAST result in HTML format.................................................................................13 2-6: Interface of program MuSeqBox................................................................................15 2-7: Graphic output interface of program BioWidgets......................................................18 2-8: Output results of program Saturated BLAST.............................................................19 2-9: Web interface of program PhyloBLAST....................................................................20 3-1: The growth rate of GenBank .....................................................................................22 3-2: Architecture of BAXQL_BLAST.................................................................................25 3-3: Example of XML BLAST results...............................................................................30 3-4: Architecture of Swing, AWT, and JFC .....................................................................34 3-5: Swing component hierarchy ......................................................................................35 3-6: Batch_BLAST: user input page...................................................................................36 3-7: Batch_BLAST: parameter setting page.......................................................................37 3-8: Batch_BLAST: result status page................................................................................38 4-1: BLAST HTML results................................................................................................41 4-2: BLAST text results.....................................................................................................42 4-3: Storing XML data.......................................................................................................46 viii

PAGE 9

4-4: Architecture of Sql_BLAST........................................................................................49 4-5: Sql_BLAST relational data tables and schema............................................................52 4-6: Databases involved in Entrez.....................................................................................53 4-7: Example of Entrez results...........................................................................................54 4-8: Example of JSP...........................................................................................................55 5-1: Biological Data Integration by using BAXQL_BLAST...............................................62 ix

PAGE 10

Abstract of Thesis Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Master of Science BAXQL_BLAST: AN ENHANCED BLAST BIOINFORMATICS HOMOLOGY SEARCH TOOL WITH BATCH AND STRUCTURED QUERY SUPPORT By Tsung-Lu Lee August 2002 Chair: Li-Min Fu Major Department: Computer and Information Sciences and Engineering The Basic Local Alignment Search Tool (BLAST) is the most popular and widely used biological sequence similarity search tool maintained by the National Center for Biotechnology Information (NCBI). The NCBI servers perform an average of 70,000 BLAST searches daily. The BAXQL_BLAST is an enhanced computational tool to ease the overhead of searching multiple sequences simultaneously. Additionally, the search results are saved locally as tables in relational database systems for further retrieval by the Structured Query Language (SQL). The BAXQL_BLAST contains two main components: Batch_BLAST and Sql_BLAST. The Batch_BLAST automatically saves BLAST results in extensible markup language (XML) format to the local file system, and simultaneously triggers Sql_BLAST to transform XML results into relational tables. x

PAGE 11

The two major contributions of BAXQL_BLAST to the homology sequence search are automated batch sequences searching and structured query support. Applying structured queries on BLAST results allows users to find their target sequences efficiently and reduces false-positive results effectively. xi

PAGE 12

CHAPTER 1 INTRODUCTION The importance of searching homology sequences with the assistance of computational tools in bioinformatics has increased rapidly due to the human genome project and related research involving biology, genetics, medicine, pharmacy, chemistry, computer science, engineering and many other areas. It is an interesting and challenging task that involves the efforts from both biological and computational fields. As the number of newly discovered genes and sequences has increased very quickly, the demand for biological database systems along with bioinformatics analysis tools is much higher today. Many existing genomic and biological databases have provided users the easily accessed Web interface to submit sequences on the Internet [1], and the easy-to-use sequence submission tool has pushed the number of nucleotides sequences in these databases to a new high. One of the largest nucleic acid sequence databases, Genbank ( http://www.ncbi.nlm.nih.gov/Database/ ), contained more than 14 million sequences and 15 billion base pairs of DNA in early 2002 [2]. 1.1 Internet and Biological Data Behind the current growth of genomic databases is a big hand pushing the computational biology and bioinformatics to a new era, and the big hand is the Internet. The continuous development of network programming languages and technologies has made Internet resources more useful and closer to biological and genomic researchers, whose work relies heavily on the most complete and up-to-date genome databases through the entire world. 1

PAGE 13

2 For example, the following three major nucleic acid sequence databases are located at three different sites, and all have been widely used not only at their regions but also many other regions in the world. 1. GenBank http://www.ncbi.nlm.nih.gov/ U.S.A. 2. EMBL http://www.ebi.ac.uk/ Europe 3. DDBJ http://www.ddbj.nig.ac.jp/ Japan 1.1.1 The Web Among all the Internet components and resources, such as electronic mail, file transfer protocol, telnet, gopher, and the Web, the World Wide Web (WWW), which began in the early 1990s, is the newest, most powerful, and most popular one [3]. With the rich text support of Web browsers, biological data, such as DNA sequence text files, protein structure images, and even medical education videos, are available to the remote users transferred by the Internet. The Web is having great presentation advantages over other Internet components, such as e-mail, telnet, and gopher. The Web is not only an excellent component for presentation, but also an outstanding data collecting tool that is available throughout the world by the Internet. It is a very important step in bioinformatics research, and there are currently three major genome databanks collecting and updating nucleotide sequences daily via the WWW. The three major databases all provide a WWW submission site to simplify the process of sequence submission. Their Web sites include the following: 1. GenBank http://www.ncbi.nlm.nih.gov/BankIt/ 2. EMBL http://www.ebi.ac.uk/Submissions/ 3. DDBJ http://www.ddbj.nig.ac.jp/sub-e.html

PAGE 14

3 1.1.2 Problem with Large Data The Web submission forms make genomic data collection easier and more efficient. However, the large volumes of newly discovered sequences have created significant bottlenecks to users from around the world who would like to access the databases. Biological data are still growing at exponential rates in many databases, and the continuous progress in genetics and molecular biology areas, combined with new algorithms and tools in computer technologies, makes the growth in genetic and biological databases even faster [4, 5]. As soon as the human genome project is complete, 3 billion base pairs will soon join the already crowded genome databases. Suddenly, the weight of bioinformatics has been shifted to the shoulders of computer scientists, whose mission in bioinformatics resides in the following three major areas: 1. Computational tools. 2. Biological database management. 3. Data and knowledge integration. 1.2 BAXQL_BLAST In recent years, many computational tools and technologies have focused on each of the previously mentioned three major research areas in the field of bioinformatics. Implementing useful computational tools to assist molecular biologists in DNA sequencing, homology sequence searching, and protein functionality prediction is the most popular and developed areas for computer scientists and programmers. One good reason is because these areas are directly related to human genome projects and are also the areas that need most assistance from computer hardware and software.

PAGE 15

4 Among the computer tools working on the homology sequence searching, the Basic Local Alignment Search Tool (BLAST) [6], which is maintained by the National Center for Biotechnology Information (NCBI) at the National Institutes of Health (NIH), is the most popular and widely used bioinformatics program. 1.2.1 Homology Sequence Search The BLAST program at NCBI is a sequence similarity tool that aligns the target sequences against other sequences in the databases to find out the ones that receive high scores in similarity. The main reason scientists need to search for homology sequences in the genome databases is because it helps predicting and determining the functionalities or associated organisms the sequences control or relate to; therefore, this is a very important step for sequence analysis [7]. In order to search the target sequences against a large number of sequences in the databases, many genome databanks provide useful and fast computer algorithms and tools to speed up the process of homology sequence search. Fortunately, the existing bioinformatics tools are reasonably fast and easy to use via the Web-based interfaces. Among all the computational tools that are associated with sequence similarity search, BLAST is the most popular and frequently used program. 1.2.2 Motivation Although BLAST is a very fast program compared to other sequence similarity search tools, such as FASTA[8] and Smith-Waterman [9], its sensitivity is lower due to the algorithm, which is designed for speed primarily. Also, the results are sent to users in either e-mail format or Hyper Text Markup Language (HTML) format. Both formats are not designed for storing, and users can only view their HTML results within 24 hours on the Web browsers; otherwise, they can save the files as HTML format on the local drives, which is not an organized way to manage biological data.

PAGE 16

5 Depending on several issues, such as Internet speed, the length of the query sequence, the speed of the CPU, and so forth, the results of BLAST searches may not return to the users in a short time. Usually it takes several minutes for each sequence search, but it can be as long as an hour or as short as few seconds. Even considering the average cases, a few minutes for each sequence can add up to a long waiting process to users who are searching for a batch of sequences. In order to solve these programs and enhance the usability and flexibility of the BLAST search program, we implement a java program called BAXQL_BLAST. The BAXQL_BLAST program provides multi-sequence BLAST searches and real-time monitoring on the entire search process, which is not supported by other Web-based or e-mail-based BLAST programs. The program automatically tracks and receives the results in the XML format provided by NCBI, and saves the results in tables of relational database systems. In this paper, we discuss BAXQL_BLAST program in more detail, and we organize our paper to help readers better understand how the BAXQL_BLAST program is designed to contribute in the three major research areas of bioinformatics today. In Chapter 2, we provide background of bioinformatics and the NCBI BLAST program. Later in the chapter, we compare some BLAST-related programs and show their advantages and disadvantages as well. In Chapter 3, we introduce the first part of the BAXQL_BLAST program, Batch_BLAST, which is a Java real-time program that runs BLAST searches on more than one sequence (nucleotide or protein) at the same time in an automated fashion. Users can upload their batch sequence files into the program, and simply click on one button to

PAGE 17

6 watch the entire BLAST search results be retrieved and saved in a short period of time. This chapter contributes mainly to the bioinformatics research area: Computational Tools. In Chapter 4, we continue to introduce the second part of the BAXQL_BLAST program, Sql_BLAST, which provides both Java application interface and Java Servelet Page interface to support the SQL query on the BLAST results, which are saved as relational tables in the local database systems. Users can perform further analysis on the results locally without running the BLAST search every time when they need the data. This chapter contributes to the bioinformatics research area: Biological Database Management. In Chapter 5, we discuss some contributions of our BAXQL_BLAST program to the bioinformatics research and the areas we can improve on. Also, we look at the popular concept in biotechnology called biological data integration, and consider the future aspects of our research.

PAGE 18

CHAPTER 2 BLAST AND BIOINFORMATICS BACKGROUND Bioinformatics is the area of study that combines the power of computational programming and algorithms with the existing biological research and data systems. Recently, research in bioinformatics has not only solved the problem of some unsolved questions in medical and biological areas, but this research has detected efficient and accurate tools to help researchers, scientists, doctors, and engineers solve the new problem more easily. 2.1 Biological Sequence Comparison and Database Search “Comparison of homologous sequence is an essential step for many studies related to molecular biology and evolution: to predict the function of a new gene, to identify important regions in genomic sequences, to study evolution at the molecular level or to determine the phylogeny of species” [10:21]. This points out the importance of comparative sequence analysis in the human genome projects and other related genetic projects as well. The large and rapidly increasing amount of homologous sequences in a public database not only makes the sequence analysis more difficult but also makes the database search more expensive. 2.1.1 What Is NCBI? In 1998, the federal government announced the National Center for Biotechnology Information (NCBI), one of the country’s biggest biological database resources centers, to be a national resource for molecular biology information. The NCBI provides useful software and analysis tools to do the searching, comparison, and retrieval on nucleotides, 7

PAGE 19

8 proteins, genomes, and many biological databases. Their worldwide users can easily submit their newly discovered sequences and results to the GenBank (at NCBI ) [11], the NIH genetic sequence database, on the Web or through e-mail. The efficient daily update system also makes GenBank the largest growing genetic sequence database available publicly. PubMed Entrez BLAST OMIM Books TaxBrowser Structure Search for Go Nucleotide Figure 2-1: NCBI web-based search interface. Figure 2-1 shows that NCBI supports many biomedical databases and tools on its online services. We abbreviate some of the major databases or tools in the following table.

PAGE 20

9 Table 2-1: NCBI Major Databases and Tools [12] Databases / Tools Contains PubMed Biomedical literature Entrez Several linked databases BLAST Homology search tool OMIM Online Mendelian Inheritance in Man Books Online books TaxBrowser Organisms in GenBank Structure 3-D macromolecular structures We are now going to introduce the most developed and most popular sequence similarity search tool (BLAST) at NCBI. 2.1.2 What Is BLAST ? Basic Local Alignment Search Tool (BLAST) is one of the most important biological sequencing tools offered and maintained by NCBI. Since its release in 1990, BLAST [13] has become the most widely used standard sequence similarity search tool. The BLAST program serves as a front door of a house in the area of sequence comparison and analysis. If you are a hard-working biological scientist in one of thousands of research labs around the world, and then one day you discover a new (maybe) nucleotide sequence in human organisms. Before announcing your discovery, you will first make sure the sequence has not been discovered by any other scientists. In other words, you would like to compare your newly discovered sequence with sequences

PAGE 21

10 in the most complete nucleotide sequence databases in the world, which is GenBank, and, of course, the computational tool you most likely will use is BLAST. Figure 2-2: NCBI BLAST web site. However, being the most popular and most used sequence search tool does not mean the job is easy. Much of the time, the most popular computational search tool often ends up as the most abused program. Since the BLAST programs by NCBI are the most frequently used sequence similarity search tools in the world today, according to the report NCBI servers perform in average of 70,000 BLAST searches daily [14], not including the BLAST searches at local servers. 2.1.3 More on BLAST Despite the large amount of remote accessing on the NCBI BLAST sequences database, BLAST is the fastest and most sensitive database searching program. The Web-based BLAST at NCBI search against the most up-to-date sequence database makes the BLAST program available through the convenient Web interfaces. We are now going to

PAGE 22

11 show the steps and interfaces of the BLAST search program in the following three figures. First, figure 2-3 shows the BLAST Web page that accepts the user’s input sequence (one at a time) in FASTA format. Also, it accepts other formats, such as accession numbers and identifiers [15]. Parameters can be set in the page as well. Users can choose the database they want to search on and type of filter they want to use. The interface is very user-friendly and easy to use, but the search currently supports only one sequence at a time. Figure 2-3: BLAST input interface. After the sequence is submitted, the Web page will direct users to a new page (see Figure 2-4), which is a status page that tells users how long the process will take and gives a request identification for the current search as well. The ID will be valid in the

PAGE 23

12 next 24 hours, and the result can be retrieved by submitting the valid request ID number to the server. While waiting for the process to conclude, users can choose the result format in this page. This is a redundant and confusing process to users who just selected the format and parameters in the previous page (BLAST input page, see Figure 2-3). The page will not automatically show the result page when the result is ready. Figure 2-4: BLAST returns a request ID in the status page. If users click the “Format” button before the result is ready for them, the Web page will show the new estimated time and the page will be updated automatically. Finally, the result is ready, and it is shown in Figure 2-5 in HTML format.

PAGE 24

13 Figure 2-5: BLAST result in HTML format. By using the current version of the BLAST search program on the NCBI Web site, users can run BLAST searches with a user-friendly interface that can be accessed by any type of Web browsers. This is advantageous to BLAST users; however, the intermediate status page and the long waiting time might be confusing to the users. In recent years, research and computer tools that are related to the BLAST search program have been noticed by the public. We would like to take a look at some of these computational programs which have been published in the Bioinformatics journal to see what kind of technologies they use to improve the popular homology sequence search tool, BLAST. 2.2 Related Research Based on BLAST Depending on the post-BLAST analysis, many researchers need computational tools to help them perform efficient and accurate ways of analyzing result sequences.

PAGE 25

14 Since BLAST was developed in 1990 and is maintained by NCBI, its success has made it one of the most popular programs for searching biosequences against biological databases [16]. While biological scientists are focusing on the accuracy part of the sequence analysis research, computer scientists are trying to implement efficient software to help biological scientists get the work done in a faster and more efficient way. Combining the areas of biological and computational science is a great challenge, and those challenges need to be resolved before major achievement can be made in the field of sequence analysis [17]. Since NCBI is the largest Web-based BLAST server site, statistics have shown that the size of GenBank has doubled every 15 months. The growing databases make the job of BLAST even more difficult. During all these years, we have to admit that the BLAST group at NCBI has performed well and overcome a huge challenge of retrieving a large amount of nucleotide and protein sequences in databases. Their continuing upgrades and research make the well-designed BLAST programs not only good for speed but also for the sensitivity of the result sequences. In recent years, more sequence analysis tools have been invented in order to make the most well-known sequence similarity searching program, BLAST, more powerful and efficient. We can divide this research into two areas (pre-processing of input query sequences and post-processing of BLAST results.) Depending on their main functionalities and contributions, we group those post-processing BLAST-related tools into the following three approaches: 1. Automated Multi-Sequence Query Approach: (MuSeqBox, GeneMachine, BLAST Search Updater)

PAGE 26

15 2. Visualization and Identification Approach: (BEAUTY-X, bioWidgets, Saturated BLAST) 3. Web Approach: (WebBLAST, PhyloBLAST) 2.2.1 Automated Multi-Sequence Query Approach The MuSeqBox [18] is written in C++, and it supports a multi-sequence automated query against BLAST databases. It also contains filter and parser to parse the BLAST output in order to store them into the tables in HTML format. The automated multi-sequence query algorithm and easy-access Web interface make it easy to use and see the results. The disadvantage is that the process is not very efficient, and the table it returns is in HTML format, which is not very flexible to store or transfer. Besides, there are too many parameters that need to be inserted by users as well. Figure 2-6: Interface of program MuSeqBox. GeneMachine [19] is a Perl program that allows users to run multiple gene predication programs automatically and simultaneously and return output in ASN.1 format as results. It performs a series of BLAST searches to run the homology searching,

PAGE 27

16 and the output is returned in ANS.1 format automatically. GeneMachine can be run on the UNIX command line fashion, as well as Web interface fashion. The main contribution of GeneMachine is that it combines several major gene/exon predication programs and executes them in an automated fashion to reduce the overhead of executing each program manually by users. The well-designed architecture also utilizes several different algorithms to perform the gene identification. The result is in ASN.1 format and usually is not the format that can be stored and transferred easily. It limits the result files’ extensibility for further analysis. All results can be viewed by NCBI’s Sequin program, and also input sequence needs to be input one at a time which costs the efficiency of the program. The BLAST Search Updater [20] is designed to run a large number of BLAST searches, and it also performs the screening match on the previously obtained results in order to improve the searching score. The automated multi-sequence BLAST search program returns files in HTML format and can be viewed by a Web browser easily. The program is written in Perl script and utilizes the BLASTcl3 client to run the search at the NCBI site. Due to the Internet traffic and busy NCBI BLAST server, the result cannot be returned in a short period of time. Also, there is no Web front end for this Perl script program. The result can only be viewed in HTML format and is returned by e-mail to the users in a period of time. The HTML result is not a good format to store on the local drive or database either, and further analysis and query are limited. 2.2.2 Visualization and Identification Approach The BEAUTY (BLAST Enhanced Alignment Utility) [21] improves the search report of the standard BLAST searches by pointing out the most informative results. Programs such as PROSITE, BLOCKS, PRINTS, or Entrez are used to put the BLAST

PAGE 28

17 result in a better format for analysis and identification. The language is written in Perl and C, and no Web interface is available for this enhanced BLAST program. That could be a disadvantage of the BLAST search tool, and users are forced to download and run on UNIX machines. Only a single sequence can be input at a time, and lack of automation limits the efficiency of this gene identification tool as well. BioWidgets [22] is the only program tool that is written in Java in all of our review programs. The great graphic interface of Java language, which combines the flexibility of JavaBean components, creates a bioWidgets toolkit containing three components: AnnotView, BlastView, and AlignView. BlastView is designed to overcome the limitation of earlier developed BLAST viewers, such as VisualBLAST, MacBOB, MView, and NCBI’s HTML output. VisualBLAST can be run only on the Windows 95/NT platform while MacBOB is only for Macintosh computers. Both MView and NCBI’s BLAST output is in HTML format and makes the updating a slow process to perform. BioWidgets instead is platform independent, and the output is viewed in JavaStudio interface, which also makes the updating easier and quicker. With the lack of Web front end and ideal result format to store and query, the JavaBean implementation enhances the functionality of connecting and accessing the database servers. We believe the idea of emergence of the database servers and data transfer will be the main architecture of genomic data searching tools.

PAGE 29

18 Figure 2-7: Graphic output interface of program BioWidgets. Saturated BLAST [23] developed in Perl/Perl Tk is tested on a UNIX platform and its main functionality is to identify relations among BLAST alignments and improve the performance of detecting distant homology. A built-in BLAST parser is also created to filter the results of BLAST, and the use of a database manager stores significant information from the redundant BLAST search results. Saturated BLAST is an automated program and supports a restart function since all the parameters are being stored in advance in a restart file. This is an excellent idea, and it puts the program in a more reliable state of process. The output formats are FASTA, HTML, tables, and plain text, although the graphic interface is quite friendly and useful. Unfortunately, it does not support the multi-sequence searching automatically, and there is no database or data storage designed in the architecture except that the tables can be edited in Microsoft Excel and Access.

PAGE 30

19 Figure 2-8: Output results of program Saturated BLAST. 2.2.3 Web Approach The WebBLAST [24] is designed and created in Perl to support a Web front end application of organizing BLAST sequencing data. The Java Applet language is involved in the graphic user interface output to make the results more friendly to view and analyze. Although the result data are in HTML format and stored on the UNIX Web server, it is not an ideal database storage system to hold a large scale of BLAST searching results. The lack or query functionality of a Web server also creates another problem of searching results on top of BLAST HTML results stored on the UNIX Web server. The PhyloBLAST [25] is also a Web-based application which is written in Perl along with CGI. It performs searching by utilizing BLAST2 against the SwissProt/TREMBL database. A multi-sequence feature is designed and supported in this application, and an automated fashion is also one of the standard features in the architecture. The output is a two-dimensional tree structure and a great presentation, but

PAGE 31

20 the database is not a built-in design for storing JPEG graphics or ASCII text-based graphic tree. Figure 2-9: Web interface of program PhyloBLAST.

PAGE 32

CHAPTER 3 BATCH_BLAST In Chapter 2, we talked about the bioinformatics background and the importance of the BLAST program at NCBI and several BLAST-related search tools. Although the research focus and main functionality of these BLAST-related tools are quite varied, they share similar fundamental goals: implementing efficient, effective, automated, user-friendly visualization tools to enhance the usability of the BLAST program in the area of sequence searching and analysis. In this chapter and also in Chapter 4, we introduce our implementation of the enhanced BLAST tool called BAXQL_BLAST, which is a reliable, efficient, user-friendly Java application to enhance the existing NCBI BLAST program with automated multi-sequencing support. We integrate several new technologies and languages, such as Extensible Markup language (XML), JavaServer Page (JSP), Java Database Connectivity (JDBC), Structured Query Language (SQL), and Java that are ideal for biological database managing and retrieving. 3.1 Motivation and Architecture of BAXQL_BLAST 3.1.1 Motivation Until recently most scientists and users of biological software needed to run their homology sequence searches in the command-line fashion on UNIX. In 1994, NCBI launched its Web site with programs BLAST, Entrez, dbEST, and dbSTS, and started putting their development focus on the new medium, Web interface [26]. Since then, the 21

PAGE 33

22 homology sequence search with query sequence aligns with each of the sequences in a database that is the most fundamental stage of sequence analysis, and the Web-based BLAST program at NCBI has became the most well-known and widely used homology sequence search tool on the Web [27]. In 1995, NCBI created a simple Web-based submission interface allowing global research groups to share their sequences of DNA easily by submitting the sequence on the Web [28]. Due to both easy Internet access and excellent integration biotechnology, the NCBI GenBank database has grown dramatically. The most current release (129.0) of GenBank reported that the GenBank database at NCBI currently contains 16,769,983 loci, 19,072,679,701 bases, and 16,769,983 reported sequences [29]. There are currently many newly discovered nucleotide and protein sequences that are being submitted and added to the GenBank database. As of May 2002, there are more than 10,000 Mb in sizes of GenBank databases, and the largest biological databank is still growing at a phenomenal rate (see Figure 3-1). Figure 3-1: The growth rate of GenBank [30].

PAGE 34

23 Because of the enormous size of the databases and the frequent update of the databases, it is not convenient for each user to store a GenBank BLAST database locally on his Web servers. Therefore, most BLAST users still choose the Web-accessed BLAST program to do their BLAST searches. Unfortunately, to search such a large database for homologous sequences is usually time consuming. Besides, users can input only one sequence (nucleotide or protein) at a time manually, which delays the entire process in the submit-and-wait cycle (see Chapter 2). The case can become worse if the search is being done during peak hours when either there is heavy traffic on the Internet or at the NCBI BLAST server. Whatever the case, long waiting times of the submitted BLAST searches have been nightmares to most BLAST users. They have to wait for several minutes for the first result to come back before they can submit the second one. Most BLAST-related programs are also either running on the local servers, which require a password from the administrator (not available to users outside the communities) or require users to input e-mail addresses since the result will not be ready in a short period of time. Because of these restrictions, users have no access to the program, and they have to wait for a long time for their results to come back in the e-mail. A considerable disadvantage of sending the batch BLAST results by e-mail is the lack of space and organization since most BLAST search results are not small files for e-mail system. Consider submitting 100 sequences to the batch BLAST programs, and then receiving 100 BLAST results back in your e-mail in a period of time. The mailbox would be crowded and the results not easy to find and view.

PAGE 35

24 3.1.2 Architecture of BAXQL_BLAST To solve the problems efficiently, as mentioned above, we divide our program into two parts. The first part is called Batch_BLAST, which will be discussed later in this chapter. It is a Java application that simultaneously sends multiple sequences (submitted by users) to the remote BLAST server at NCBI and returns the results in XML format (saved locally for users). The second part of the program is a database query interface called Sql_BLAST, which parses the BLAST XML results (from Batch_BLAST) into small elements and saves as tables in the relational database. This process is triggered by the Batch_BLAST program and is a continuous function running at an automated fashion. In other words, we conceptually divide the BAXQL_BLAST program into Batch_BLAST and Sql_BLAST, but these two parts actually interact with each other with the BLAST results in XML format, which is ideal for data exchange [31]. We combine the key words (BATCH, XML, SQL) of the three major parts of the project together to represent our program as BAXQL_BLAST.

PAGE 36

25 Relational Table for BLAST Results S Q L JSP XML Parser Oracle DB NCBI BLAST SERVER BLAST DB Sequence File * Sql_BLAST * BLAST XML RESULTS BLAST RESULTS BLAST REQUESTS * Batch_ BLAST * Figure 3-2: Architecture of BAXQL_BLAST. Figure 3-2 presents us a more solid view of the architecture of the BAXQL_BLAST, and it shows the directions of data flow with the arrow sign. Basically starting from the left-hand corner of the graph, our BAXQL_BLAST users submit sequence files with more than one nucleotide or protein sequence into the Batch_BLAST program. The functions begin by sending requests for each sequence to the BLAST server at a remote site. Once the BLAST server receives the request, it performs the BLAST homology sequence search by aligning the target sequence against the BLAST databanks. The longer the target sequences, the shorter time it takes for this alignment process [32]. Once the result is achieved, the BLAST server will send the results back in XML format, as we requested. These BLAST results contain the information of sequences (called hit sequences), which are sorted by the similarity score (called e-value). These files will be

PAGE 37

26 saved to local file systems as they are received. No parsing or editing has been done to these XML results yet. As soon as these XML results have been saved to the local file system, the second part of the project, Sql_BLAST, will be triggered to perform the parsing processes to all the BLAST XML results. In the meantime, there is another process, which is not shown in the figure, that retrieves profile data for those hit sequences from another data retrieval system called Entrez. The program will parse these XML data into small elements and save them as corresponding attributes in relational tables of database systems such as Oracle. These biological data from both BLAST search and Entrez search will be available to users through the query interface, such as Java Swing API or JavaServer Pages that support JDBC API [33] and SQL (Structured Query Language) [34]. 3.2 Why XML? The XML is one of the popular key words in the computer science dictionary. In August 2000, two years after W3C’s XML 1.0 specification was released, NCBI BLAST began to support BLAST output in XML format, which was a fairly new language at that time [35]. It is an important message to biological databank users because the leading biotechnology database search center not only provides the XML data as another data type, but also encourages its users to consider using the XML format as a standard biological data exchange format for interchange, presentation, storing, and many other purposes.

PAGE 38

27 3.2.1 What Is XML? It is not a bad guess to think XML and HTML (hypertext markup language) are related to each other. Actually, both XML and HTML belong to the same family of Markup Language, SGML (Standard Generalized Markup Language) [36]. However, the major separation between HTML and XML is that HTML uses a predefined fixed set of tags to represent how the layout and content look at browsers while XML uses user-defined tags to construct document structure and attributes. Separating content construction from layout representation benefits the reusability and flexibility of the semi-structured and self-describing markup language. The user can easily change the way he looks at the document without modifying the document page at all. By taking advantage of the platform-independence, reusable, Web-centric XML, a company or a hospital can easily modify its online order menu or patient record query page without even modifying the content of product profiles or patient record. Although HTML is easier to build and publish on the Web at the beginning, XML saves time and space in the long run from the document management point of view. We are not saying that we should replace HTML with XML or that XML is better than HTML. They are both doing their jobs on different levels, and maybe sometimes in the future HTML will migrate into a new language combining the basis of XML and HTML. It is difficult to predict when that will happen, but there is good reason and chance that this might be a possibility. According to the article [37] by JDM Systems Consultants, XML holds several advantages for Electronic Data Interchange (EDI): XML can be mapped to object models

PAGE 39

28 XML can be mapped to database schemas XML document types are extensible XML has a robust, nonbinary format XML is independent from transport layer XML has wide acceptance XML has good tool support From all the advantages and benefits that XML can provide to the field of electronic data interchange, it is not difficult to predict that XML can also provide these advantages and benefits to the biotechnology research that relate to NCBI databanks, which frequently perform a large numbers of biological data interchange on the Web in the daily bases. 3.2.2 XML Versus ASN.1 Before XML was introduced to NCBI as one of the file formats to represent biological and genomic data or documents, the Abstract Syntax Notation One (ASN.1) [38] was used to describe the structure of data to achieve interoperability between platforms. The ASN.1 was an International Standards Organization (ISO) standard data representation data format developed to support data interchange between applications. The ASN.1 format has been used in storing, retrieving, and passing data such as nucleotide and protein sequences, genome data, 3-D structures, and MEDLINE records. Although both ASN.1 and XML are being used by NCBI for data definition and transmission, the purpose and philosophy are different behind these two technologies. The ASN.1 is designed to describe the data structure and not the content of data, and it is being used widely in the area of communication protocols, such as Z39.50 and LDAP

PAGE 40

29 [39]. The XML is a text-oriented language and is more flexible and readable for humans than ASN. 1. The ASN. 1 and XML are both machine and application independent, and ASN. 1 has a downstreaming process which avoids ambiguity when information is being transmitted between systems [40]. The XML uses XSL and XSLT to remove the ambiguity program as well. The ASN.1 is an excellent candidate for ontology-exchange languages, but the significant advantages of XML are flexibility, reusability, data storing, data exchange, and Internet ready. Significant hardware and software support has put the XML in front of all the other data structure languages in the biological data exchange system (see Table 3-1). Table 3-1: Comparison of ASN. 1 and XML ASN.1 XML Goal Data structure definition and transmission between heterogeneous parties Representation of information Form Binary Text Readability Low High Downstreaming Yes By XML or XSLT 3.2.3 Why XML Benefits Our Batch BLAST Project? Since we are retrieving highly structured data (BLAST XML results) located at remote databases (NCBI databanks) through the Internet, we would like to find a data

PAGE 41

30 format that is compact, structured, specifying the semantics of the data, and ideal for interchange on the Internet. We discovered out that XML is the perfect fit for this kind of task. Figure 3-3: Example of XML BLAST results. Imagine that the Internet is as crowded as a Boeing 747 airplane with 400 people on board. You, as a passenger (data), need to transfer to airports (programs) or destinations (databases) and would transfer faster if you carried the least amount possible without being naked (empty data). You do not need to look good while you are traveling on the airplane. Putting your outfits (presentation style) in the luggage (XLST) and separating it from you (content) seems to be the most efficient way to travel (transfer) on the airplane (Internet).

PAGE 42

31 Thanks go to NCBI that provided XML as one of the BLAST output results starting in August 2000 [41]. We are definitely taking this advantage as an encouragement to make XML as our default data format for our BAXQL_BLAST program. Since the BAXQL_BLAST is a computational tool that enhances the NCBI BLAST method in a more efficient way to retrieve the BLAST results, we would like to make sure that XML provides the necessary needs for our project. As listed in Table 3-2, XML provides several advantages to the BAXQL_BLAST’s needs. The XML is semi-structured, portable, self-describing, and platform independent; therefore, it is a good language to use in the BAXQL_BLAST.

PAGE 43

32 Table 3-2: The Functional Match Between XML and Our BAXQL_BLAST Program BAXQL_BLAST requires data XML provides 1 Easy to organize Structure ( hierarchical ) 2 Transferred from server to client Portability 3 Does not take extra space to store Compresses 4 Provides multiple views to users from single data. Multiple views by using XML stylesheet( Stylibility) 5 Representing the semantics of the data itself Self-describing 6 Has No hardware or software barriers Platform independent 7 Easy for user to understand and edit Human readable 3.3 Implementation of Batch_BLAST The Batch_BLAST program is a Java application that can support a multi-sequence BLAST similarity search in an automated fashion. The Batch_BLAST is not the only program that performs this kind of task. However, by taking advantages of the secure, reliable, and powerful Java programming language, we are able to perform efficient and effective real-time monitoring of the entire automatic BLAST search, which has not been done in similar programs and algorithms. The reason we can perform a complex and expensive real-time monitoring and also show every single process in the Java Swing table is due to the combination of Java

PAGE 44

33 Multithreading [42] and Java Lightweight Swing Components [43]. This is a key combination to the success of our Batch_BLAST application, and it is more difficult to get the same function by other languages like Perl, which has been chosen for many bioinformatics tools. Perl along with a Web browser, can give only frame to frame (not real-time) status to the users. Since all the work has to be handled completely by a Web server, the overhead is too high to perform an up-to-date status to users. 3.3.1 Java Multithreading The concept of multithreading is an important part of Java programming language especially for concurrent programming. The main character of the Java multithreading is the ability to perform more than one task or program at the same time. This multitasking kind of programming language not only speeds up the process, but also extends the reusability of the program [44]. In Batch_BLAST, we have several tasks that we need to process on each target sequence submitted by users, for instance, sending the request to the NCBI BLAST server, tracking the availability of BLAST results, and storing these files into a local file system. Since there is no resource sharing or dependency in these tasks, the synchronization is not a problem for us. The real-time synchronization of these threads (tasks) allows our users to monitor the status of each BLAST search. Instead of writing 100 programming codes for 100 sequences, we write only one piece of code and make it a thread that we can apply to all the sequences, as many times as we want simultaneously. This is the advantage of Java Multithreading, and it helps our program perform fast and synchronized results to our Batch_BLAST users.

PAGE 45

34 3.3.2 Java Swing Components The Java Swing Components is a part of the Java Foundation Classes (JFC) collections for the purpose of designing graphic user interfaces (GUI) [45]. Swing is considered lightweight components compared to the heavyweight Abstract Window Toolkit (AWT), which is designed for user interfaces as well [46]. However, Swing components are more flexible than AWT components because Swing is using a “Pluggable Look & Feel”(PLAF) module that makes it platform independent and AWT, on the other hand, is platform dependent [47]. As shown in Figure 3-4, Swing sits on top of many AWT components, including Java 2D, Drag and Drop, and the Accessibility component. Swing is not a replacement for AWT but a very handy component to enhance the usability and flexibility of the graphic interface components. Figure 3-4: Architecture of Swing, AWT, and JFC [48]. Inside the Swing package, there are many user interface components (see Figure 3-5), including JButton, JCheckBox, JComboBox, JFrame, JList, JMenuBar, and JTable.

PAGE 46

35 We implement our graphic interfaces by using many of the Swing components to provide easy-accessed buttons and checkboxes to users. Adding listeners to these Swing components provides the functionalities to each component. In the next section, we show what these graphic user interfaces look like in the Batch_BLAST program and how they work together to perform the multi-sequence BLAST searches via the Internet. Figure 3-5: Swing component hierarchy [49].

PAGE 47

36 3.3.3 How Batch_BLAST Works? The best way to show how the Batch_BLAST program works is to see the interface in a picture instead of imaging it. In Chapter 2, we showed the three steps and interfaces of running the BLAST searches on the NCBI Web site. We would now like to show the three steps and how our program solves the problems that the NCBI BLAST program contains. First, the biggest benefit of the Batch_BLAST program is the multi-sequence support on the BLAST search. The main goal of this program is to be able to run the BLAST searches on more than one nucleotide or protein sequence at the same time. In Figure 3-6, it is the sequence input page, and users can either copy and paste the target sequences into the text area or upload a sequence file by clicking the “open file” icon shown in the figure. Users can easily modify, add, delete, and save sequences to existing files or new files. Figure 3-6: Batch_BLAST: user input page.

PAGE 48

37 Second, we have the parameter setting page next to the sequence input panel at the menu bar, as shown in Figure 3-7. We have most of the major parameters that are supported by the NCBI BLAST program, including databases, number of results, result format, and so forth. In our next example is a parameter setting page for nucleotide sequences. After setting the parameters, users can submit the search and the process will begin automatically. The system then will show a result status page, which allows users to monitor the process of the BLAST searches, and the program will retrieve all the results to the local file system. Figure 3-7: Batch_BLAST: parameter setting page. As shown in Figure 3-8, the BLAST search is in progress, and the graphic interface shows the real-time status of each retrieval sequence that users submitted. The process is in an automated and efficient fashion that reduces a lot of overhead of submitting and tracking the results of BLAST searches. Users can also keep the programs

PAGE 49

38 running by themselves, and once the processes are completed, the path to the saved BLAST results would be shown in the table. Figure 3-8: Batch_BLAST: result status page.

PAGE 50

CHAPTER 4 SQL_BLAST In Chapter 3, we explained the motivation and implementation on the first part of the project, Batch_BLAST, which is implementing a Java application that supports a multi-sequencing BLAST homology search in an automated fashion. In this chapter, we focus on the second part of the project, Sql_BLAST, which is data storing and retrieving on Batch_BLAST results received and stored locally by a java application. 4.1 Not Enough XML In Chapter 2, we discussed the benefits we received by transferring biological BLAST results in XML format in our Batch_BLAST program. In fact, due to its compact and structured format, many database systems and schemas have discovered and developed ways to store and support XML data and documents. In May 1996, XML became a part of a project at W3C to be the standard representing format on the Internet, and ever since, XML has been used widely for data representing and transferring purposes [50]. The first version of the recommendation for XML, the W3C Recommendation: Extensible Markup Language (1.0), was released in February 1998 [51]. The biggest advantage of XML is its separation of content from presentation. Unlike HTML, XML does not need to carry extra tags to show how the content appears; therefore, it is easier and more inexpensive (in process time and space) for database systems to store them. Also, XML is self-describing, machine-readable, and platform independent. It does not require schemas or definition to be set. 39

PAGE 51

40 In recent years, due to the largely growing bioinformatics and genomic data, finding a good way to store and query a large amount of data has become more important and more challenging for bioinformatics programmers. Fortunately, the rapid growing of XML-related projects and developments points out the great advantages and benefits of combining XML with databases, and the following discussions will demonstrate why XML data or documents fit well in database systems. 4.1.1 XML Is Easier to Parse If we look at the BLAST XML results (see Figure 3-3), we can easily figure out that it is very easy to read, and each tag represents the meaning of its content very well. That makes the job of parsing XML into relational attributes and tables much easier for relational database system users. In traditional flat files (HTML or Text), we will not be able to distinguish the content very well. The entire file contains a big block of information text, and it is not very organized. For example, the BLAST HTML files, as shown in Figure 4-1, contain several fixed HTML tags such as

,

, and so forth. Unlike the XML, these tags provide very little information about the content, and it is also very difficult to design a parser for it. “What you see is what you get,” according to the HTML file, is mainly for representing a fixed data file. It is not easy to integrate related HTML files, and it is hard to parse them into database systems. Without a doubt, HTML’s simplicity and user-friendly ability have earned itself an excellent reputation on the World Wide Web; however, it has no advantage over the area of data storing and data integrating. 

PAGE 52

41 Figure 4-1: BLAST HTML results. Besides HTML, simple text is not a good choice for parsing either. The BLAST text results (see Figure 4-2) indicate that the text file is simple enough to read and edit, but it is not simple to parse. In our Sql_BLAST program, a major reason to make sure our BLAST results are in the formats that are easy to parse is that we want to extract some important key words or values from the sequence results to index or group related data for further analysis or query.

PAGE 53

42 Figure 4-2: BLAST text results. Among all the formats of NCBI BLAST results, XML provides the best structure for parsing, and BLAST XML results also share the same set of XML tags which make the parsing more efficient and correct. 4.1.2 Well-supported XML More importantly, XML is being well supported by many well-known commercial databases, such as Oracle, Microsoft, IBM, and so forth (see Table 4-1). A large number of software and tools also focus on XML as well, which gives us important advantages in storing, transferring, and accessing our XML files in different systems. This enhances our XML BLAST result in its great extensibility and portability.

PAGE 54

43 Table 4-1: XML-enabled databases [52] Product Developer License DB Type Access 2002 Microsoft Commercial Relational DB2 XML Extender, DB2 Text Extender IBM Commercial Relational FileMaker FileMaker Commercial FileMaker FoxPro Microsoft Commercial Relational Informix IBM Commercial Relational Objectivity/DB Objectivity Commercial Object-oriented Oracle 8i, 9i Oracle Commercial Relational SQL Server 2000 Microsoft Commercial Relational 4.1.3 XML and Biological Databases Although combining XML with current technologies in bioinformatics is still in a developing stage, many developers and researchers agree that XML has a bright future in the areas such as biological data integration and genomic sequencing and analysis. Because bioinformatics and XML are both fairly new technologies to society, many biological tools are taking XML into their consideration while implementing the new systems. Therefore, the XML development came in at the perfect time to solve the data management problem in bioinformatics. To support XML is a popular and needed function to most bioinformatics databases, applications, and tools today. That makes the data interchange and data integration become more possible than before. Here are some of the bioinformatics applications that embed XML in their systems [53]: AnaML (Anatomical Markup Language): An XML-based language for anatomy.

PAGE 55

44 Bioinformatic Sequence Markup Language (BSML): A public domain XML application for bioinformatics data. BioXML.org: A center for development for open-source biological DTDs. CML: A Chemical Markup Language. GAME DTD (Genome Annotation Markup Elements): A syntax for exchange of genomic annotation. Gb2xml (Genbank to XML conversion tool): For converting between XML and Genetic Sequence Data Bank. phyloML (Phylogenetic Markup Language): An XML application for working with phylogenetic trees. RiboML (Ribonucleic Acid Markup Language): An XML application for ribosomal science. GEML (Gene Expression Markup Language): A file format for storing DNA microarray and gene expression data. The list of the current standard and work on XML in bioinformatics can continue on many pages. Since we are using XML as our primary selection of data exchange format in our Batch_BLAST program, it is also a good advantage for us to extend the flexibility and reusability of XML as the primary data storing format in our relational database tables in the Sql_BLAST program. Here, we list some major advantages of storing XML files in databases and also list its benefits for the database systems. a. Easy to parse – reduce complexity. b. Easy to transfer – reduce transfer time.

PAGE 56

45 c. Easy to integrate – reduce integration overhead. d. Content only – reduce storage space. The XML plays a very important rule as the data transfer form from the NCBI BLAST server to users’ local machines. We can inherit these great benefits and store the BLAST XML results into our relational databases. Next, we explain the reason we choose relational databases to store our BLAST XML results and list the advantages. 4.2 Why Relational Databases? There are many ways to store XML information, depending on what the users need to do with these XML data or documents (see Figure 4-3). Since biological and genomic data are highly regular and structured, they should be stored in a relational database system. Although it might be a little more expensive to parse and store them into relational tables rather than other databases, it provides a better scale on queries because XML files do not provide efficient scalable facilities like sequence indexing or clustering, which is important for biological and genomic data [54].

PAGE 57

46 Figure 4-3: Storing XML data. In the article of Storing XML Data [55], the author managed the ways of storing XML data or documents depending on the users’ needs. Users can pick the best suitable data storage system for their architecture by answering the following questions (see Table 4-2):

PAGE 58

47 Table 4-2: Choosing the Right Database for Your XML Data Storage Types Do you intend to run queries against the stored data Do you need to retrieve the XML in exactly the same form as it was stored Do you need to present your data in many different forms BLOB No Yes No Tables in Relational DB Yes No Yes Object DB Little Yes No Native XML DB Yes Yes No The are several advantages of storing XML data in a relational database, which is a mature and stable database technology, as first described by Codd in the IBM research report in 1969 [56]. He later published a research paper titled “A Relational Model of Data for Large Shared Data Banks” in 1970 [57]. Since then, the continuous development of a relational database system has earned its reputation in many commercial and noncommercial database systems and research. There are many built-in functions and tools that make the system much easier to organize and transfer data. Since most biological data are highly structured and related, they fit perfectly to an E/R type of database system. Along with XML, each XML’s tag or element can be transformed into an attribute in a relational table easily. The indexing data structure [58] also makes the system more efficient to access and query. It is important to our Sql_BLAST program because the main functionality we need is to search a large number of sequences by their similarity scores. A complete index of the sequence table will make the search more time-efficient.

PAGE 59

48 Besides, a relational database is still the most popular and most supported database system in society today. There are many applications that do not understand XML but can access well with relational databases. This is always important because we would like to make certain our data can be transferred in and out of other database systems without a problem. 4.3 Architecture and Implementation 4.3.1 Motivation of Sql_BLAST In Batch_BLAST, we save all the BLAST results in the XML format to the local file systems, and users can open each file to view, edit, and transfer. However, a simple file system is not the best choice to manage these XML results, and with a large number of sequence results being put together, it becomes very difficult for users to find out the sequences they submitted just few days ago. In order to solve this problem, we design and create the Sql_BLAST program to help users find out the specific sequence or data they are looking for. Earlier in this chapter, we pointed out several advantages to store XML data as tables in the relational databases, and combining JDBC and SQL technologies associated with a relational database system that allows users to search their results easily and quickly.

PAGE 60

49 Figure 4-4: Architecture of Sql_BLAST. Now, we would like to review the steps involved in our Sql_BLAST program, as shown in Figure 4-4: 1. Upload files from either local file systems or JSP page to the Java XML parser in the Sql_BLAST program. 2. Parse the BLAST XML results into relational tables, such as QuerySequence and HitSequence. 3. Store parsed BLAST data in the tables of relational databases by JDBC. 4. Open a connection to the NCBI server and retrieve data by Entrez search.

PAGE 61

50 5. Submit an SQL query from Java-based SQL Interface or from a Web-based JSP page. Since the Batch_BLAST program supports an automated BLAST search on more than one nucleotide or protein sequence, the results of BLAST batch search are returned to the local file systems in XML format were that created by the NCBI BLAST server. During the run time, the process of Batch_BLAST and Sql_BLAST works together in a continuous fashion very well. As soon as the XML results from Batch_BLAST are saved to a local file system, Sql_BLAST will start its process automatically, and the synchronization of the these two programs creates a better performance on the BLAST searches. While the Batch_BLAST program is waiting for the unfinished searches to come back, Sql_BLAST is working in parallel to optimize the entire process of the BAXQL_BLAST program. Once the Batch_BLAST search has finished all the processes completely, all the BLAST results will be saved in local file systems as XML format and in relational database tables as attributes. While users were monitoring their Batch_BLAST process, Batch_BLAST results were saved in the local database systems, which are reachable to users by using JDBC and SQL. Also, in order to enhance the usability of the Sql_BLAST program, we implement a Web-based JSP interface for users at remote sites to upload BLAST XML results from a distance. After viewing the big picture of the Sql_BLAST program, some key elements in this program need more time discussion in their important roles that these components play in our implementation.

PAGE 62

51 4.3.2 BLAST XML Parser and Relational Tables The Java XML parser is a very important but small part of the Sql_BLAST program. It is designed to read an NCBI BLAST result in XML format and scan through the entire file to manage each tag to the corresponding attributes in relational tables. Since the BLAST result is highly structured and regular, it makes the job of parsing the files much easier. We divide the BLAST result into four major parts, and we create four arrays to store data and elements in order to save them into relational tables efficiently by JDBC. Also, we create a similar parser for the Entrez search as well, and we create a table called Entrez in our relational table to provide a more detailed profile of the query sequences and hit sequences. The schema shown in Figure 4-5 contains four major tables from the BLAST query result and one table from the NCBI Entrez retrieval system, which provides biological information about the sequences in both the QuerySequence table and the HitSequence table. The information integration and interchange provides better query results to Sql_BLAST users who can perform structured queries on the BLAST results returned by the Batch_BLAST program.

PAGE 63

52 Figure 4-5: Sql_BLAST relational data tables and schema. 4.3.3 Entrez Entrez [59] is also a program maintained by the National Center for Biotechnology Information, and it is a retrieval system that can search several linked databases, such as Genbank, PubMed, 3-D structure and so forth (see Figure 4-6) [60].

PAGE 64

53 Figure 4-6: Databases involved in Entrez. The reason we would like to add the results of the Entrez search is because it will give users more information about the nucleotide or protein sequence they submitted at the Batch_BLAST program. It provides information of source, organism, author, and many important biological profiles about the sequences (see Figure 4-7). Adding this information to our relational tables will make the query more specific than regular BLAST results, which provide little or no information about the source or organism of the query and result (hit) sequences. This small amount of information can make a big difference in the search of homology sequences and can reduce many false-positive results by narrowing down the query to a specific source or organism. We use a similar strategy of the BLAST XML parser in parsing the Entrez XML results, and the process is performed right after the sequences (both target and result) have been saved to the databases (see Figure 4-4). The project is currently in the process of extending the biological warehouse (databases) by integrating more biological data into our existing relational tables, as shown in Figure 4-5. Those Entrez-related databases shown in Figure 4-6 are very good candidates for our extension project, and we also are looking at other biological and genetic databanks

PAGE 65

54 which provide useful information to assist our BAXQL_BLAST projects and enhance the homology sequence search. Figure 4-7: Example of Entrez results. 4.3.4 The Java Database Connectivity (JDBC): A Bridge to Databases The first version of JDBC was released in the summer of 1996 by Sun. The JDBC is a reliable bridge to link the Java application with the databases. The programmers can update their databases by using the Structured Query Language (SQL), and the biggest advantage of the Java programming language and JDBC is the platform and vendor independency [61]. With JDBC, our BAXQL_BLAST program can communicate to any type of database systems, and it is always an advantage to be able to work with several different persistence layers (databases) from different vendors (platforms). In the Sql_BLAST program, we solve the deployment issues by combining JDBC and SQL on top of our relational databases. Since JDBC performs a very platform independent and secure

PAGE 66

55 connection to different database systems, it is advantageous to have JDBC in the picture of our implementation along with core Java applications. 4.3.5 JavaServer Pages (JSP): A Web Front Desk to Databases JSP is a program that runs on a Web server and performs the following tasks [62]: 1. Reads any data sent by the user. 2. Looks up any other information about the request that is embedded in the HTTP request. 3. Generates the results. 4. Formats the results inside a document. 5. Sets the appropriate HTTP response parameters. 6. Sends the document back to the client. JSP technology is an extension of the Java Servlet technology . JSP technology is designed to generate dynamic Web content by using XML-like tags written in Java programming language (see Figure 4-8) [63]. JSP Example

Date and Time

<% java.util.Date today = new java.util.Date(); out.println("Today's date is: "+today); %>
Figure 4-8: Example of JSP.

PAGE 67

56 JSP has many advantages that are difficult for programmers to resist. Portability: JSP is written in Java and is portable to many operating systems and Web servers. Active Server Page from Microsoft can work only on Windows OS. Security: JSP is more secure than Common Gateway Interface (CGI) programs. Efficiency: JSP is running on lightweight Java threads instead of heavyweight operating system processes. Scalability: JSP has access to all the standard J2EE services, including JDBC API. Below are some of the major reasons we choose to implement our dynamic Web front end by using JSP. Since the main purpose of our JSP interface is to perform SQL on our databases, we need to do the following tasks [64]: Load the appropriate JDBC driver classes. <%! Class.forName(“JDBC.Driver”); %> Import the SQL classes. <% ! Connection connection = DriverManager.getConnection() %> Execute a query and display the results. <% Statement statement = connection.createStatement(); ResultSet resultSet = statement.executeQuery(“select ”); %>

The results are:

    <% while (resultSet.next() ) { %>
  • <%= resultSet.getString(1) %> <% } >


PAGE 68

57 Also, XML works well together with JSP and there are many techniques, such as JavaBeans, XMLEntryList, DOM, and XPath to present XML data on the JSP [65]. JSP allows our program to present the dynamic biological results in XML format.

PAGE 69

CHAPTER 5 SUMMARY AND FUTURE WORK In this chapter, we summarize some key achievements that the BAXQL_BLAST program contributes to the research areas of bioinformatics and computer science. We then point out some areas that we can improve on our program in the near future. The main functionality of the BAXQL_BLAST program is to enhance the NCBI BLAST homology search in multiple nucleotide or protein sequences simultaneously and provide a query by using the well-formatted Structured Query Language (SQL) on top of our BLAST XML warehouse in the relational table format. But it is more important to lead our audiences to a bigger picture of the bioinformatics research: Biological Data Integration [66, 67, 68, 69]. In Section 5.3, we further address this popular and exciting research topic and discuss what we can do to upgrade our program to a better biological data integration tool. 5.1 What Do We Achieve? There are three major achievements corresponding to each part of the project and together as a whole (see Table 5-1). The BAXQL_BLAST is one of the existing computational tools in bioinformatics that applies the newest technologies together successfully, including Java programming language, XML data exchange and representation format, relational database systems along with JDBC and SQL support, and JSP that provides Web front end to the remote users. 58

PAGE 70

59 Table 5-1: Achievements of BAXQL_BLAST achievements before Batch_BLAST (Ch 3) Users receive real-time monitoring on BLAST sequence homology search with multiple sequences at the same time. Users have to submit their sequence one at a time to NCBI BLAST server. No real-time monitoring is available (delayed results). Sql_BLAST (Ch 4) Small knowledge and data warehouse that support SQL on BLAST XML results to help users narrow their target sequences. The BLAST results will be gone after 24 hours. No database support. No SQL support. No easy access to data. BAXQL_BLAST (Ch 3 + Ch 4) Bioinformatics data integration of heterogeneous XML data from single source. There is no relation or integration on BLAST and Entrez data from NCBI. Not only do our programs combine the most up-to-date technologies to manage biological XML data in a real world Java application, we also provide many useful functionalities that are not in the reach of users’ hands before. First, Batch_BLAST is using Java-based presentation layers along with Java Multithreading and the lightweight Java Swing API, which are richer and react quicker than Web-based presentation layers. The Batch_BLAST allows its users to have the real-time monitoring on the entire multiple BLAST sequence searches instead of receiving delayed results a few hours later. Our users can also cancel or save their current process and continue the search at a later time. This is a friendly function, which is not provided by similar search tools. Most tools support only a one-time submission, and after users submit their searches, they have no control over the process. It does not seem to be a very reliable and efficient way to run the BLAST batch search in this kind of fashion.

PAGE 71

60 Second, most regular BLAST search users would probably face the problem that all the BLAST search results from NCBI will be gone in 24 hours unless they are saved to the local drive. The process is slow and troublesome because users have to rename the files of each BLAST search result to avoid the name conflict. Additionally, a BLAST search does not guarantee the time it takes to finish the process, and the approximate waiting time is updated on the browsers at run time. To effectively solve this program, our program not only saves the results to the local file system automatically, but also parses the BLAST XML results into relational tables. Users no longer need to hustle in submitting and renaming the BLAST results. And since all the BLAST results have been saved into the relational database by Sql_BLAST program, BAXQL_BLAST users can save much time in finding their target sequences. Also, they can apply certain restrictions or keywords to the queries to narrow down the sequence specifically. It makes the data mining process of the BLAST results more efficient and effective. Researchers do not need to search their file system to find out their target sequences. Finally, we integrate both BLAST and Entrez results (in XML format) into our systems and save them as relational tables in our database systems. This allows us to make a query by joining tables between BLAST and Entrez, and combining the knowledge from both databases will make the BLAST search results more useful and correct. This unique functionality has not been implemented by any other biotechnology tool, and we are still working on collecting more useful knowledge to add to the existing intelligence warehouse for the BLAST homology sequence search [70].

PAGE 72

61 5.2 Where Do We Improve? It is always a challenge to bring all the new technologies to a system that is working on extremely dynamic data and information on the Web. Even though we have achieved unique goals in the BAXQL_BLAST program, we still have many areas that we can improve upon. That is motivation to improve our programs to the next level. In our programs, we integrate only biological XML files from different tools or databases (BLAST and Entrez) from one single source which is NCBI. We can easily extend the program and collect XML data from other bioinformatics databases to have a stronger integration on the output results (see Figure 5-1). The more information we can collect for our users, the more information they can search to find out their target homology sequences. Also, the chance of having false-positive information in our output will be lower. Since we did not support many data resources in our current version of BAXQL_BLAST, we did not have enough information to apply artificial intelligence techniques or machine learning algorithms to our data to support “intelligent” data mining process. We simply rely on the Structural Query Language and the joint methods to combine the attributes to match the keyword that users input for each attribute. However, we certainly can extend the SQL strategy to collect less ambiguous data for better biological data clustering.

PAGE 73

62 Figure 5-1: Biological Data Integration by using BAXQL_BLAST. 5.3 Future Work: Bioinformatics Data Integration Information or data integration has been a popular topic these past few years. Researchers in different areas, such as business, biology, statistics, computer science, medicine, and so forth, are all looking for solutions to these exciting and challenging issues. In the area of bioinformatics, the interests of data integration grow dramatically due to the human genome project and the rapid growth of biological data. In order to provide complete and efficient access to the large amount of data (grow by terabytes per day) [71], a number of public and private biological databases are available on the Web providing useful data managing and querying tools to the users all over the world.

PAGE 74

63 Most of the integration tools are based on specific biotechnology tools with a single data model [72]. The intention and motivation are to develop computational tools that interconnect heterogeneous and distributed biological databases to enhance the post-genome analysis and research. Fortunately, our BAXQL_BLAST research is following this path and trying to integrate biological XML data not only from NCBI databases but also from other biological, medical, and genomic databases to support better decision-making and data mining computer programs for biological analysis. According to the conference report (Plant, Animal & Microbe Genomes X Conference, San Diego, California, January 12-16, 2002) titled “Integrated Genome Resources at NCBI,” NCBI is working on the project of integrating various primary organism-specific resources [73]. This indicates that the BAXQL_BLAST project is headed in the right direction in building future computational tools to create better biological database management systems. Additionally, the BAXQL_BLAST system integrates heterogeneous biological XML data and generates promising targets in homology sequence searching.

PAGE 75

CHAPTER 6 CONCLUSIONS As the variety of data types increases as the data volume increases, it becomes more difficult and challenging to bring heterogeneous data into integration. In the area of bioinformatics, there is always a shortage of database management systems and computational analysis tools. Among the existing database systems and analysis tools, the lack of integration in the data pool degrades the quality and quantity of their biological results. Obviously, we are not the first ones to detect this program, and many strategies and implementations have been designed and developed primarily to solve this data integration program. Unfortunately, there is no simple solution to this program and many solutions are too costly to public users. 6.1 The XML Solution When researchers were struggling to find solutions, the first version of the W3C Recommendation: Extensible Markup Language (XML) 1.0 was released in February 1998 ( http://www.w3.org/TR/REC-xml ). Initially its goal was to ease the implementation of the Web and enhance the functionality and interoperability of the Web. In an effort of less than five years, XML is showing its potential in almost every aspect of the computer information communities, including the database management 64

PAGE 76

65 systems. Because of the great potential and popularity of the XML, biological databanks have started to put their efforts into providing promising biological XML data, such as NCBI BLAST, PIR, SWISS-PROT and many more. The current development of biological XML data should promise a bright future in the area of biological data integration. Our current version of the BAXQL_BLAST program has successfully integrated XML results from heterogeneous sources (the dynamic BLAST homology sequence search engine and the static Entrez query system) in the relational tables that provide stable and scale indexing on the biological data. Although it is a very small step in the progress of being a successful biological data integration and decision-making tool, we have compromised between the new and old technologies in bioinformatics. Also, it is important not to be overwhelmed by the largely growing XML-related tools and technologies. Choosing the best strategies does not always mean the newest ones; however, we should not be too conservative in receiving new technologies. 6.2 Biological Intelligence After being exposed to the large amount of biological data and tools on the Internet, we sense there is a lack of intelligence in the system no matter if it is a process that solves data collecting or decision-making. Since most biological or genomic data are related to life science, the data are more dynamic than regular knowledge data. For instance, simply changing the parameters of an NCBI BLAST query might result in different data. In order to react to the dynamic data more efficiently, we need to have a more intelligent system to handle the expected enormous amount of biological data flow. Some technologies, such as artificial intelligence, machine learning, and ontology, might

PAGE 77

66 be involved more or less in this kind of biological intelligence system to provide more reliable data results to users. In the meantime, as bioinformatics programmers, we are focusing on implementing biological tools that utilize as much “lightweight” strategies as possible, such as lightweight data interchange--XML, lightweihgt programming language--Java Multithreading, and lightweight interface API--Java Swing to handle the “heavyweight” biological data on the Internet.

PAGE 78

REFERENCES [1] Gary Williams. "Nucleic Acid and Protein Sequence Databases." Genetics Databases. Ed. Martin J. Bishop. London UK: Ac, 1999. 15,31. [2] NCBI. "GenBank Growth." Growth of GenBank. 12 Mar. 2002. 19 June 2002 . [3] Leonard F. Peruski, Jr., Anne Harwood Peruski. The Internet and the New Biology. Washington, D.C.: American Society for Microbiology, 1997. [4] N. M. Luscombe, D. Greenbaum, and M. Gerstein. "What Is Bioinformatics? An Introduction and Overview." Yearbook of Medical Informatics 2001. New Haven, CT: Yale University, 2001. [5] Pethuru Raj. "Homology Searching for Biological Sequences Databases." Homology Searching for Biological Sequences Databases. 6 June 2002 . [6] S. F. Altschul, W. Gish, W. Miller. "Basic Local Alignment Search Tool." Journal of Molecular Biology 215 (1990): 403-10. [7] Peter M. Woollard. "Bioinformatics." Molecular Biology and Biotechnology: Fourth Edition. Ed. J.M. Walker and R. Rapley. Cambridge UK: The Royal Society of Chemistry. 406. [8] D. Lipman and W. Pearson. "Rapid and Sensitive Protein Similarity Searches." Science 227 (Mar. 1985): 1435-41. [9] T. F. Smith and M.S. Waterman. "Identification of Common Molecular Sequences." J. Mol. Biol. 147 (1981): 195-97. [10] Laurent Duret, Guy Perriere and Manolo Gouy. "HOVERGEN: Comparative Analysis of Homologous Vertebrate Genes." Bioinformatics: Databases and Systems. Ed. Stanley I. Letovsky. Norwell, MA: Kluwer Academic Publishers, 1999. 21. [11] Benson, D.A., Boguski. "GenBank." Nucleic Acids Res. 25 (1997): 1-6. [12] NCBI. "Entrez Home." Entrez is a Retrieval System for Searching Several Linked Databases. 18 June 2002. 25 June 2002 . 67

PAGE 79

68 [13] S.F. Altschul, W. Miller, E.W. Myers, and D.J. Lipman. "Basic Local Alignment Search Tool." Journal of Molecular Biology 215 (1990): 403 10. [14] Eugene Russo and Steve Bunk. "Hot Papers in Bioinformatics." The Scientist 13.8. 15. 12 Apr. 1999. 25 June. 2002 . [15] NCBI. "BLAST Query Input Format." Search Format. 21 Jan. 2000. 25 June 2002 . [16] Keith Robison. "BLAST." BLAST. 25 June 200 2 . [17] David P. Yee, Judith Bayard Cushing, Tim Hunkapiller. "Overview: A System for Tracking and Managing the Results from Sequence Comparison Programs." Pattern Discovery in Biomolecular Data: Tools, Tech niques, and Applications. Ed. Bruce A. Shapiro, Jason T.L. Wang, Dennis Shasha. New York: Oxford, 1999. 165. [18] Liqun Xing and Volker Brendel. "Multi Query Sequence BLAST Output Examination with MuSeqBox." BIOINFORMATICS 17.8 (6 Apr. 2001): 744 45. [19] Izabela Makalowska, Joseph F. Ryan and Andreas D. Baxevanis. "GeneMachine: Gene Prediction and Sequence Annotation." BIOINFORMATICS 17.9 (30 May 2001): 843 44. [20] Matthew Boone and Chris Upton. "BLAST Search Updater: A Notification System for New Databas e Matches." BIOINFORMATICS 16.11 (15 June 2000): 1054 55. [21] Kim C. Worley, Pamela A. Culpepper, Brent A. Wiese and Randall F. Smith. "BEAUTY X: Enhanced BLAST Searches for DNA Queries." BIOINFORMATICS 14.10 (15 Oct. 1998): 890 91. [22] S. Fischer, J. Cr abtree, B. Brunk. "BioWidgets: Data Integration Components for Genomics." BIOINFORMATICS 15.10 (23 Apr. 1999): 837 46. [23] Weizhong Li, Frederic Pio, Krzystof Pawtowski and Adam Godzik. "Saturated BLAST: An Automated Multiple Intermediate Sequence Search Used to Detect Distant Homology." BIOINFORMATICS 16.12 (2000): 1105 10. [24] Erik S. Ferlanti, Joseph F. Ryan, Izabela Makalowska and Andreas D. Baxevanis. "WebBLAST 2.0 : An Integrated Solution for Organizing and Analyzing Sequence Data." BIOINFORMATICS 1 5.5 (3 Feb. 1999): 422 23. [25] Fiona S. L. Brinkman, Ivan Wan, Robert E. W. Hancock. "PhyloBLAST: Facilitating Phylogenetic Analysis of BLAST Results." BIOINFORMATICS 15.5 (31 Dec. 2001): 385 87.

PAGE 80

69 [26] NCBI. "A Decade of Data at NCBI." NCBI News. Summer 1 999. 25. June 2002 . [27] Joao Setubal and Joao Meidanis. Introduction to Computational Molecular Biology. Grove CA: Brook/Cole Publishing Company, 1997. [28] NCBI. "A Decade of Data at NCBI." N CBI News. Summer 1999. 25. June 2002 . [29] NCBI. "NCBI GenBank Flat File Release 129.0." 15 Apr. 2002. 25 June 2002 . [30] NCBI. "GenBank Growth." Grow th of GenBank. 12 Mar. 2002. 19 June 2002 . [31] Nicholas Chase. XML and Java from Scratch. Indianapolis, IN: QUE, 2001. [32] Luke Alphey. DNA SEQUENCING: From Experimental Methods to Bioinformatics. N ew York: Springer Verlag New York Inc., 1997. [33] Sun Microsystems. "JavaServer Pages(TM) Technology." JavaServer Pages. 25 June 2002 . [34] M. M. Astrahan and D.D. Chamberlin. "Implementation of a Structured English Que ry Language." Communications of the ACM 18.10 (October 1975): 580 87. [35] BLAST help desk. "Introduction to the Standalone WWW Blast Server." Introduction to the Standalone WWW Blast Server. 25 June 2002 . [36] C . F. Goldfarb. The SGML Handbook. Oxford, UK: Oxford University Press, 1990. [37] JDM Systems Consultants. "XML for Data Exchange." XML for Data Exchange. 25 June 2002 . [38] M.T. Rose. The Open Book: A Practical Perspective on OSI. Englewood Cliffs, NJ: Prentice Hall, 1990. [39] Takeshi IMAMURA and Hiroshi MARUYAMA. "Symposium on Applications and the Internet." IEEE (2001): 57 64. [40] JoanMa Mas Ribes, Xavier Orri. "ASN.1 Vs XML." 15 Jan. 2002. 25 June 2002 . [41] BLAST. "Introduction to the Standalone WWW Blast Server." Introduction to the Standalone WWW Blast Server. 25 June 2002 .

PAGE 81

70 [42] Cay S. Horstmann, Gary Cornell. Cor e JAVA Volume II -Advanced Features. Palo Alto, CA: Sun Microsystems, Inc., 2000. [43] David M. Geary. Graphic JAVA VOLUME II Swing: Mastering the JFC. Upper Saddle River, NJ: Prentice Hall PTR, 1999. [44] Cay. S. Horstmann, Gary Cornell. Core JAVA Volume II -Advanced Features. Palo Alto, CA: Sun Microsystems, Inc., 2000. [45] David M. Geary. Graphic JAVA VOLUME II Swing: Mastering the JFC. Upper Saddle River, NJ: Prentice Hall PTR, 1999. [46] David M. Geary. Graphic JAVA VOLUME II Swing: Mastering the JFC. Upper Saddle River, NJ: Prentice Hall PTR, 1999. [47] IBM. "IBM Accessibility Center: Java Foundation Classes (JFC) -A Foundation for Accessibility." 3.0 Java Foundation Classes (JFC) -A Foundation for Accessibility. 25 June 2002 . [48] IBM. "IBM Accessibility Center: Java Foundation Classes (JFC) -Foundation for Accessibility." 3.0 Java Foundation Classes (JFC) -A Foundation for Accessibility. 25 June 2002 . [49] Macmi llan Computer Publishing. "Introducing Swing." The Swing Component Hierarchy. 25 June 2002 . [50] Eric Snow. "Fact Sheet: W3C Issues XML 1.0 as a Recommendation." The World Wide Web Consortium Issues XML 1.0 as a W3C Recommendation. 2002. 25 June 2002 . [51] W3C. "Extensible Markup Language (XML)." Timeline: Events and Publications. 2002. 25 June 2002 . [52] Ronald Bourret. "XML Data base Products." XML Enabled Databases. 6 June. 2002. 25 June 2002 . [53] O'Reilly & Associates, Inc. "XML.Com Bioinformatics [May 20, 2001]." Bioinformatics. 2002. 24 June 2002 . [54] Hailong Zhang and Xiangyun Wang. "XML Application in Bioinformatics and an Example of Using XML for Protein Motif Data Presentation." Iowa State University. 24 June 2002 . [55] Todd Su ndsted. "Dot Com Builder: Storing XML Data." Storing XML Data. 7 Jan. 2002. 19 May 2002 . [56] E. F. Codd. Derivability, Redundancy and Consistency of Relations Stored in Large Data Banks. IBM Research Report no. RJ599. San Jose, CA, 1969.

PAGE 82

71 [57] E. F. Codd. "A Relational Model of Data for Large Shared Data Banks." Communications of the ACM June. 1970: 377 87. [58] Mark Graves. Designing XML Databases. Upper Saddle River, NJ: Prentice Hall PTR, 2001. [59] J.A Epstein, J.A. Kans, and G.D. Schuler. WWW Entrez: A Hypertext Retrieval Tool for Molecular Biology. 2nd Ann. Int. WWW Conf., in press. 1994. [60] NCBI. "Entrez Home." Entrez. 2002. 25 June 2002 . [61] Cay S. Horstma nn, Gary Cornell. Core JAVA Volume II -Advanced Features. Palo Alto, CA: Sun Microsystems, Inc., 2000. [62] Merty Hall. Core Servlets and JavaServer Pages. Upper Saddle River, NJ: Prentice Hall PTR, 2001. [63] Sun Microsystems. "JavaServer Pages(TM) Techno logy." Jerver Pages. 25 June 2002 . [64] Joe Mocker. "Using JDBC with JavaServer Pages." Question of the Week No. 73. 15 May 2002 . [65] Alex Chaffee. "Using XML an d JSP Together." Using XML and JSP Together. 2002. 25 June 2002 . [66] Barbara A. Eckman, Zoe Lacroix, Louiqa Raschid. "University of Maryland Computer Science Dept. Technical Reports. " Optimized Seamless Integration of Biomolecular Data. 24 June 2002 . [67] Barbara Eckman, Julia Rice, and William Swope. "Heterogeneous Data and Algorithm Integration in Bioinformatics." ISMB 2002 Tutorial Proposal 15 Mar. 2002. [68] Noboru Matoba, Junko Tanoue, Masatoshi Yoshikawa. "A System for Integration of Heterogeneous Biological XML Data." Genome Informatics 12 (2001): 473 74. [69] 3rd Millennium, Inc. "Practical Data Integration in Biopharmaceutical R&D: Strat egies and Technologies." May 2002. 3rd Millennium. 25 June 2002 . [70] Paul Gary and Hugh J. Watson. Decision Support in The Data Warehouse. Upper Saddle River, NJ: Prentice Hall PT R, 1998.

PAGE 83

72 [71] Eric Gombocz and Robert Stanley. "Informatics: Program and Solutions in the Handling of Massive Amounts of Disparate Data." PharmaGenomics March/April 2002: 30 40. [72] Sergio Lifschitz, Luiz Fernando Bessa Seibel, and Elvira Maria Antunes U choa. A Framework for Molecular Biology Data Integration. 25 June 2002 . [73] Tatiana A. Tatusova1. "PAG X: INTEGRATED GENOME RESOURCES AT NCBI." INTEGRATED GENOME RESOURCES AT NCBI. 2002. 25 June 2002 .

PAGE 84

BIOGRAPHICAL SKETCH Tsung-Lu Lee is currently a graduate student in the Department of Computer and Information Sciences and Engineering at University of Florida. He was born in Taiwan in April 7 th , 1976. He graduated from Tainan First High School in 1994, and entered the Iowa State University in 1995. In 1996, he transferred to the University of Iowa and studied biochemistry and pre-medicine. In May 2000, he received a Bachelor of Science degree in computer science and Bachelor of Arts degree in biochemistry. In August 2000, he enrolled at the University of Florida and studied under Dr. Li-Min Fu, who is a Professor at the Department of Computer and Information Sciences and Engineering. Tsung-Lu’s current research interests are bioinformatics, biological digital library, and medical informatics. 73