<%BANNER%>

State-of-the-Art Protein Secondary-Structure Prediction Using a Novel Two-Stage Alignment and Machine-Learning Method

Permanent Link: http://ufdc.ufl.edu/UFE0023862/00001

Material Information

Title: State-of-the-Art Protein Secondary-Structure Prediction Using a Novel Two-Stage Alignment and Machine-Learning Method
Physical Description: 1 online resource (113 p.)
Language: english
Creator: Gates, Ami
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2008

Subjects

Subjects / Keywords: machine, prediction, protein, secondary, structure, support, vector
Computer and Information Science and Engineering -- Dissertations, Academic -- UF
Genre: Computer Engineering thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract: While the complexity of biological systems often appears intractable, living organisms possess an underlying correlation derived from their hierarchical association. This notion enables methods such as machine learning techniques, Bayesian statistics, nearest neighbor, and known sequence-to-structure exploration, to discover and predict biological patterns. As proteins are the direct expression of DNA, they are the center of all biological activity. Thousands of new protein sequences are discovered each year, and knowledge of their biological importance relies on the determination of their folded or tertiary structure. Secondary structure prediction plays an important role in protein tertiary prediction, as well as in the characterization of general protein structure and function. The protein secondary structure prediction problem is defined as a three-state classification problem. Given any linear sequence of one-letter coded amino acids, the goal is to predict the secondary structure membership of each amino acid. Machine-learning based techniques are commonly and increasingly used for secondary structure prediction. For the past few decades, several algorithms and their variations have been used to predict protein secondary structure, including multi-layered neural networks and ensembles of support vector machines. DARWIN is new protein secondary structure prediction server that utilizes a novel two-stage system that is unlike any current state-of-the-art method. DARWIN specifically responds to the issue of accuracy decline due to a lack of known homologous sequences, by balancing and maximizing PSI-BLAST information, by using a new method termed fixed-size fragment analysis (FFA), and by filling in gaps, ends, and missing information with an ensemble of support vector machines. DARWIN comprises a unique combination of homology consensus modeling, fragment consensus modeling, and support vector machine learning. DARWIN has been tested against several leading prediction servers and results show that DARWIN exceeds current state-of-the-art accuracy for all explored test sets.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility: by Ami Gates.
Thesis: Thesis (Ph.D.)--University of Florida, 2008.
Local: Adviser: Banerjee, Arunava.
Electronic Access: RESTRICTED TO UF STUDENTS, STAFF, FACULTY, AND ON-CAMPUS USE UNTIL 2010-12-31

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2008
System ID: UFE0023862:00001

Permanent Link: http://ufdc.ufl.edu/UFE0023862/00001

Material Information

Title: State-of-the-Art Protein Secondary-Structure Prediction Using a Novel Two-Stage Alignment and Machine-Learning Method
Physical Description: 1 online resource (113 p.)
Language: english
Creator: Gates, Ami
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2008

Subjects

Subjects / Keywords: machine, prediction, protein, secondary, structure, support, vector
Computer and Information Science and Engineering -- Dissertations, Academic -- UF
Genre: Computer Engineering thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract: While the complexity of biological systems often appears intractable, living organisms possess an underlying correlation derived from their hierarchical association. This notion enables methods such as machine learning techniques, Bayesian statistics, nearest neighbor, and known sequence-to-structure exploration, to discover and predict biological patterns. As proteins are the direct expression of DNA, they are the center of all biological activity. Thousands of new protein sequences are discovered each year, and knowledge of their biological importance relies on the determination of their folded or tertiary structure. Secondary structure prediction plays an important role in protein tertiary prediction, as well as in the characterization of general protein structure and function. The protein secondary structure prediction problem is defined as a three-state classification problem. Given any linear sequence of one-letter coded amino acids, the goal is to predict the secondary structure membership of each amino acid. Machine-learning based techniques are commonly and increasingly used for secondary structure prediction. For the past few decades, several algorithms and their variations have been used to predict protein secondary structure, including multi-layered neural networks and ensembles of support vector machines. DARWIN is new protein secondary structure prediction server that utilizes a novel two-stage system that is unlike any current state-of-the-art method. DARWIN specifically responds to the issue of accuracy decline due to a lack of known homologous sequences, by balancing and maximizing PSI-BLAST information, by using a new method termed fixed-size fragment analysis (FFA), and by filling in gaps, ends, and missing information with an ensemble of support vector machines. DARWIN comprises a unique combination of homology consensus modeling, fragment consensus modeling, and support vector machine learning. DARWIN has been tested against several leading prediction servers and results show that DARWIN exceeds current state-of-the-art accuracy for all explored test sets.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility: by Ami Gates.
Thesis: Thesis (Ph.D.)--University of Florida, 2008.
Local: Adviser: Banerjee, Arunava.
Electronic Access: RESTRICTED TO UF STUDENTS, STAFF, FACULTY, AND ON-CAMPUS USE UNTIL 2010-12-31

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2008
System ID: UFE0023862:00001


This item has the following downloads:


Full Text

PAGE 1

1 STA TEOF THE ART PROTEIN SECONDARY STRUCTURE PREDICTION USING A NOVEL TWO STAGE ALIGNMENT AND MACHINE LEARNING METHOD By AMI M. GATES A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2008

PAGE 2

2 2008 Ami M. Gates

PAGE 3

3 To My Family and Friends

PAGE 4

4 ACKNOWLEDGMENTS I would like to dedicate this overwhelmi ng moment to my loving and supportive family, and to my wonderful friends. I would like to thank my parents, Eileen and Myke, who always supported my goals and listened endlessly; my brother Josh, who of fered continuous encouragement; and my late brother C had, whose last words to me were PhD. I would like to thank my dear friends Amos, Karina Jesse, Neko, and Nathan for standing by me, and I would like to tha nk my committee chair Arunava Banerjee, who always believed in me.

PAGE 5

5 TABLE OF CONTENTS page ACKNOWLEDGMENTS ...............................................................................................................4 LIST OF TABLES ...........................................................................................................................8 LIST OF FIGURES .........................................................................................................................9 ABSTRACT ...................................................................................................................................11 CHAPTER 1 INTRODUCTION ..................................................................................................................13 Introduction .............................................................................................................................13 Proteins ...................................................................................................................................13 Protein Secondary Structure ...................................................................................................14 Machine Learning and Protein Secondary Structure Prediction .............................................15 Protein Secondary Structure Prediction Methods ...................................................................16 Dynamic Alignment Based Protein Window SVM Integrated Prediction for Three State Protein Se condary Structure ................................................................................................17 Overview .................................................................................................................................17 2 REVIEW OF THE BIOLOGY OF PROTEINS .....................................................................19 Brief Biology of Proteins ........................................................................................................19 From DNA to Protein .............................................................................................................19 Protein and Amino Acids ........................................................................................................20 Protein Folding .......................................................................................................................21 Secondary Structure ................................................................................................................22 Protein Evolution and Sequence Conservation .......................................................................23 3 LITERATURE REVIEW .......................................................................................................31 Problem of Secondary Structure Prediction ...........................................................................31 Literature R eview of Secondary Structure Prediction ............................................................32 Methods Preceding 1993 .................................................................................................32 Methods Proceeding 1993 ...............................................................................................36 Neural network methods from 1993 2007 .............................................................37 Summary of neural network based methods ............................................................40 Support Ve ctor Machine Methods from 2001 2007 .....................................................40 Summary of SVM Based Methods ..................................................................................42 Combined or Meta Methods ............................................................................................42 Direct Homology Based Methods ...................................................................................43

PAGE 6

6 4 MATERIALS AND METHODS ...........................................................................................45 Introduction .............................................................................................................................45 Protein Data and Databanks ....................................................................................................45 Datasets ...................................................................................................................................46 Protein Identity, Similarity, and Ho mology ...........................................................................48 Multiple Sequence Alignment and PSI BLAST .....................................................................50 Basic Local Alignment Search Tool (BLAST) Algorithm ..............................................52 BLAST: step 1 ..........................................................................................................53 BLAST: step 2 ..........................................................................................................54 BLAST: step 3 ..........................................................................................................54 Position Specific Iterative BLAST (PSI BLAST) Algorithm ........................................54 Creating the PSSM ...................................................................................................55 Summary of PSI BLAST .........................................................................................57 Input Vectors and Sliding Windows .......................................................................................57 Accuracy Measures .................................................................................................................58 Machine Learning Techniques ...............................................................................................59 Support Vector Machines ................................................................................................59 Using SVMs in Secondary Structure Prediction .............................................................63 Neural Networks .....................................................................................................................63 Information Theory and Prediction ........................................................................................64 5 NEW SECONDARY STRUCTURE P REDICTION METHOD DARWIN .........................72 Dynamic Alignment Based Protein Window SVM Integrated Prediction for Three State Protein Secondary Structure: A New Prediction Server. ....................................................72 Introduction and Motivation of DARWIN .............................................................................73 Methods and Algorithms used in DARWIN ..........................................................................75 Phases of DARWIN: Stage 1 ..........................................................................................76 Phase 1 ......................................................................................................................76 Phase 2a: If at least one viable template is found: ...................................................77 Phase 2b: If no viable template is found: .................................................................78 Phase 3 ......................................................................................................................79 Phases of DARWIN Stage 2: Fixed Size Fragment Anal ysis .........................................79 Fragment size selection ............................................................................................80 Step 1 ........................................................................................................................80 Step 2 ........................................................................................................................80 Step 3 ........................................................................................................................81 Ensemble of Support Vector Machines in DARWIN .............................................................82 The SVM Kernel and Equation .......................................................................................82 Training the SVM and Using PSI BLAST Profiles ........................................................83 Datasets and Measures of Accuracy for DARWIN ................................................................84 Experiments, Measures, and Results ......................................................................................86 Conclusions on DARWIN ......................................................................................................90

PAGE 7

7 6 DARWIN WEB S ERVER ......................................................................................................94 Introduction .............................................................................................................................94 Using the Server .....................................................................................................................94 Design of the DARWI N Web Service ....................................................................................96 7 DISCUSSION AND CONCLUSION ..................................................................................102 Introduction ...........................................................................................................................102 Protein Secondary Structure Prediction Progress .................................................................102 Strength of DARWIN ...........................................................................................................104 Future Work and Improvements ...........................................................................................105 LIST OF REFERENCES .............................................................................................................106 BIOGRAPHICAL SKETCH .......................................................................................................113

PAGE 8

8 LIST OF TABLES Table page 51 D etailed average prediction results for DARWIN. ............................................................92 52 A verage prediction results for dataset EVA5 for DARWIN compared to top published indirect homology method results. ....................................................................92 53. A verage prediction results for dataset EVA6 for DARWIN compared to top published indirect homology method results. ....................................................................92

PAGE 9

9 LIST OF FIGURES Figure page 21 S implification of the processes of transcription an d translation.. ......................................24 22 Once a polypeptide is creat ed through the process of translation, it is released into the cytosol and is known as the primary or linear sequence. ............................................25 23 T he 20 known amino acids. A dapted from Voet and Voet, 2005. .....................................26 24 T orsion angles phi and psi that offer rotational flexibility between amino acid peptide bonds. Adapted from Voet and Voet, 2005. ..........................................................27 25. Ramachandran Plot for a set of three alanine amino acids joined as a trip eptide. .............28 26 E xample of a helical protein secondary structure. The hydrogen bonds are denote d with dashed lines ................................................................................................................29 27 S heet protein secondary structure, with hydrogen bonds noted with dashed lines. Adapted from Voet and Voet, 2005. ..................................................................................30 31 E xample of a linear sequence of amino acids, each accompanied by a secondary structure label of H, C, or E. ..............................................................................................44 41 Protein Data Bank (PDB) website. This area is a repository for known pr otein structures and related protein information. ........................................................................66 42 Matrix known as BLOSUM 62, a similarity matrix derived from small local blocks of aligned sequences that share at least 62% i dentity ........................................................67 43 E xample of a PSI BLAST generated alignment between a query protein and a subject p rotein.. ..................................................................................................................67 44 E xample of a PSI BLAST ge nerated position speci fic scoring matrix (PSSM). ..............68 45 E xample of the BLAST algorithm. A given query protein is analyzed by looking at all three amino acid word sets ............................................................................................69 46 V isual example of the production of input vectors that can be used to train and test m achine learning constructs. .............................................................................................70 47 V isual example of decisi on boundary between two classes and the margin that is maximized. .........................................................................................................................71 51 The PSI BLAST example alignment portion. S everal areas in a given alignment can result in missing infor mation. ...........................................................................................93

PAGE 10

10 52 H istogram for each dataset, EVA5 and EVA6 displays the percentage of proteins predicted by DARWI N with given accuracy. ...................................................................93 61 I mage of the DARWIN Web page that allows Internet based graphical user interface with the DARWIN service. ..............................................................................................101

PAGE 11

11 Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy STA TEOF THE ART PROTEIN SECONDARY STRUCTURE PREDICTION USING A NOVEL TWO STAGE ALIGNMENT AND MACHINE LEARNING METHOD By Ami M. Gates December 2008 Chair: Arunava Banerjee Major: Co mputer Engineering While the complexity of biological systems often appears intractable, living organisms possess an underlying correlation derived from their hierarchical association. This notion enables methods such as machine learning techniques, Bayes ian statistics, nearest neighbor, and known sequence to structure exploration, to discover and predict biological patterns. As proteins are the direct expression of DNA, they are the center of all biological activity. Thousands of new protein sequences are discovered each year, and knowledge of their biological importance relies on the determination of their folded or tertiary structure. Secondary structure prediction plays an important role in protein tertiary prediction as well as in the characterization of general protein structure and function. The protein secondary structure prediction problem is defined as a three state classification problem. Given any linear sequence of one letter coded amino acids, the goal is to predict the secondary structure membership of each amino acid. Machine learning based techniques are commonly and increasingly used for secondary structure pr ediction. For the past few decades, several algorithms and their variations have been used to predict protein secondary structure i ncluding multilayered neural networks and ensembles of support vector machines

PAGE 12

12 DARWIN is new protein secondary structure prediction server that utilizes a novel two stage system that is unlike any current state of the art method. DARWIN specif i cally res ponds to the issue of accuracy decline due to a lack of known homologous sequences, by balancing and maximizing PSI BLAST info r mation, by using a new method termed fixed size fragment analysis (FFA), and by filling in gaps, ends, and missing information wi th an ensemble of support vector machines. DARWIN comprises a unique combination of homology consensus modeling, fragment consensus modeling, and support vector machine learning. DARWIN has been tested against several leading prediction servers and r esults show that DARWIN exceeds current state of the art accuracy for all explored test sets.

PAGE 13

13 CHAPTER 1 INTRODUCTION Introduction While the complexity of biological systems often appears intractable, living organisms possess an underlying correlation derived f rom their hierarchical association. It is this notion that enables methods such as machine learning techniques, Bayesian statistics, nearest neighbor, and known sequence to structure exploration, to discover and predict biological patterns. Proteins Liv ing organisms are based on a m orphological unit called a cell, where each cell contains a complete set of genetic information encoded in the base sequences of DNA molecules. For DNA to enact its function, it engages in a process known as transcription, dur ing which RNA is used to make a copy of the section of DNA to be express ed. This assures the safety and encapsulation of the DNA, and allows the information contained in the DNA to be utilized in another area of the cell. Next, inside the endoplasmic reticulum, the RNA, through a process known as translation, takes part in the production of a linear chain of amino acids each bound by a pepti de bond (polypeptide). In the final stage of protein production, the polypeptide is released into the cytosol, where it quickly folds into a localized secondary structure, and then a tertiary or biologically functional protein structure. As proteins are the direct expression of DNA, they are the center of all biological activity. Proteins act as enzymes to assist with chemical reactions, as chemical messengers or hormones to maintain internal communication, and as transportation mechanisms such as oxygen transport in the blood. Further, proteins are involved in the storage and acquisition of information, such as that co llected by the retina, the construction of complex structures, such as bone and collagen, and the maintenance of systems, such as the immune system.

PAGE 14

14 Proteins are composed of unique monomeric units, or amino acids. T here are 20 distinct amino acids that co mprise all known proteins. Each amino acid consists of a central carbon atom ( C), an attached carboxyl group ( COOH), an attached amino group ( NH2Protein Secondary Structure ), and a side chain or R group. It is the individual R group that makes each amino acid unique, and in possess ion of different chemical properties. Thousands of new protein sequences are discovered each year, and knowledge of their biological importance relies on the determination of their folded or tertiary structure. Protein structures can be determined experimentally through nuclear magnetic resonance (NMR) spectroscopy as well as by x ray crystallography. However, both methods have unique challenges and can be time and resource consuming. Therefore the problem of predicting the full threedimensional structure of a protein given its linear sequence of amino acids has remained both ubiquitous and unsolved. Secondary structure prediction plays an important role in protein tertiary prediction as well as in the characterization of general protein structure and function. Over the last several decades, protein secondary structure prediction has continued to advance while simultaneously benef iting from the growth and increased availability of protein databanks. From a linear sequence of amin o acids a protein sequence folds rapidly into secondary or local arrangements, and then into a tertiary or three dimensional structure. Because the secondary structure of a protein provides a first step toward native or tertiary structure prediction, seco ndary structure information is utilized in the majority of protein folding prediction algorithms (Liu and Rost, 2001; McGuffin et al ., 2001; Meller and Baker, 2003; Hung and Samudrala, 2003). Similarly, protein secondary structure information is routinely used in a variety of scientific areas, including proteome and gene annotation (Myers and Oas, 2001; Gardy et al ., 2003; VanDomselaar et al ., 2005; Mewes et

PAGE 15

15 al ., 2006), the determination of protein flexibility (Wishart and Case, 2001), the subcloning of pro tein fragments for expression, and the assessment of evolutionary trends among organisms (Liu and Rost, 2001). The protein secondary structure prediction problem is defined as a three state classification problem. Given any linear sequence of one letter c oded amino acids, the goal is to predict the secondary structure membership by labeling each amino acid as H (for helix type), E (for sheet type), or C (for loop, coil, and other). As each amino acid can be labeled as one of three states, there are 3nMachine Learning and Protein Secondary Structure Prediction poss ible solutions for a sequence of n amino acids Machine learning based techniques are commonly and increasingly used for secondary structure pr ediction. While unique in specifi cs, each follows a s et of prediction steps that can be loosely summarized as the following. A machine learning algorithm is selected. This is usually a neural network ensemble, (Bondugula and Xu, 2007; Pollastri and McLysaght, 2004; Cuff and Barton,1999; Jones, 1999; Rost and Sander, 1993) or a support vector machine ensemble (Ward et al ., 2003; Kim and Park, 2003; Hu et al ., 2005; Wang et al ., 2004). Next, a mutually nonhomologous set of proteins, such as rs126 (Rost and Sander, 1993), cb513 (Cuff and Barton, 1999), or EVA Common (Koh et al ., 2003) is selected and used to construct training and testing vector sets. During input vector construction, both the tr aining and testing sets are profi led using a multiple sequence alignment (MSA) method that can discover and align si milar or homologous proteins. The most widely used algorithm for this purpose is PSI BLAST, (Altschul et al ., 1997) which generates a position specific scoring matrix (PSSM), or profi le containing the log likelihood of the occurrence of each of the 20 am ino acids at each position of the query protein sequence. The utilization of PSI BLAST, MSA, and profiling falls under the category of prediction through the use of evolutionary information.

PAGE 16

16 Protein Secondary Structure Prediction Methods For the past few d ecades, several algorithms and their variations have been used to predict protein secondary structure. Early techniques, including single residue statistics (Chou and Fasman, 1974), Baysian statistics, and information theory (Garnier et al ., 1978, 1996) opened the door to more advanced methods Following and building on early ideas, an explosion of techniques using PSI BLAST profi les, as pioneered in the method PHD (Rost and Sander, 1993) that broke the 70% prediction barrier emerged Current methods incl ude machine learning constructs, such as multi layered neural networks, as in SSpro (Pollastri and McLysaght, 2004), PROFsec (Rost and Eyrich, 2001), PHDpsi (Przybylski and Rost, 2001), and PSIPRED (Jones, 1999), ensembles of support vector machines, such as SVMpsi (Kim and Park, 2003) and YASSPP (Karypis, 2006), nearest neighbor methods, such as PREDATOR (Frishman and Argos, 1996), and a plethora of combined or meta methods (Cuff and B arton, 1999; Albrecht et al ., 2003). All current machine learning techn iques rely on homological and alignment data for training and testing, generally in the form of PSI BLAST (Altschul et al ., 1997) profiles, and can therefore be referred to as indirect homology based methods. Recently, modern techniques have more directly utilized homology and evolutionary information by including template and fragment modeling into the pre diction process (Pollasti, 2007; Montgomerie et al 2006; Cheng, 2005). A commonality among all methods that exceed an average threestate prediction accuracy greater than 70% is the use of evolutionary information in the form of single and multiple sequence alignments (Pollastri and McLysaght, 2004; Bondugula and Xu, 2007; Przybylski and Rost, 2002; Cuff and Barton, 1999; Jones, 1999; Ward et al ., 2003; Kim and Park, 2003; Hu et al ., 2005; Wang et al ., 2004), most commonly through the use of PSI BLAST (Altschul et al ., 1997). As homologous protein seque nces have a higher propensity for exhibiting similar

PAGE 17

17 secondary structure, and proteins can exchange as many as 70% of their residues without altering their basic folding pattern (Przybylski a nd Rost, 2001; Benner and Gerloff, 1991), a common challenge of protein secondary structure prediction is the maintenance of high prediction accuracy in the absence of detectable known homologous sequences. Dynamic Alignment Based Protein WindowSVM Integrated Prediction for Three State Protein Secondary Structure Dynamic Alignment Based Protein Window SVM Integrated Prediction for Three State Protein Secondary Struc ture ( DARWIN) a new seconda ry structure prediction server which offers a novel and accurate two stage prediction method, is presented. DARWIN incorporates a balance of PSI BLAST derived homological data with a fragment based technique termed fixed size fr agment analysis (FFA) to respond to query proteins for which no homologous proteins can be found. In both stages an ensemble of Gaussian kernel based support vector machines (SVM) are employed to compensate for any lack of alignment information, gaps in a lignment information, or skewed or incomplete alignment information. DARWIN has been tested against several leading prediction servers, including PSIPRED, PROFsec, and PHDpsi, and on common and comparative data sets, including EVA Common 5 (EVA5), EVA Comm on 6 (EVA6), and rs126. Results show that DARWIN exceeds current state of the art accuracy for all explored test sets and methods Overview Chapter 2 will discuss protein biology and the process of how proteins are manufactured. The biology of amino acids, protein secondary structure, and protein folding will be explained, as well as the role of evolution is protein prediction. Chapter 3 will describe the protein secondary structure prediction problem and will offer an extensive literature review on the topic Chapter 4 will offer a description of all materials and methods used by the majority of current

PAGE 18

18 prediction methods as well as the novel method presented. Chapter 4 will include discussion of datasets, databanks, protein identity, similarity, and homology, multiple sequence alignment similarity matrices, input vectors and sliding windows, accuracy measures, and machine learning techniques including support vector machines, neural networks, and information theory. Chapter 5 will introduce a novel method of protein secondary structure prediction, DARWIN, and a full description of the methods, algorithms, and results Chapter 6 will include discussion the creation and use of the DARWIN web server as well an overview of the code flow behind DARWIN. Chapter 7 will include a discussion, conclusions, and the consider ation of future work in the area.

PAGE 19

19 CHAPTER 2 REVIEW OF THE BIOLOGY OF PROTEINS Brief Biology of Proteins All life is based on a morphological unit called a cell (Schleiden and Schwann, 1838). Cells are classified as ei ther prokaryotes or eukaryotes and b oth cell forms cont ain D NA (deoxyribonucleic acid). U nlike prokaryotes, eukaryotes encapsulate this DNA in a membra ne enclosed nucleus, are considerably larger and possess membrane enclosed organell es, each with its own individual purpose. As such, eukaryotic cells have a more complex fu nction and organization than do pr okaryotic cells For the remainder of the this paper, reference to a cell will imply reference to a eukaryotic cell, though in gener al, most attributes and processes are shared by both cell types. Each cell nucleus contains a complete set of genetic information, which is encoded in the base sequences of DNA molecules. These DNA molecules together form the discrete number of chromosome s characteristic to each species. As an example, each human cell has 46 chromosomes and contains over 700 megabytes of i nformation (Voet and Voet, 2005). Interestingly, the soybean has 40 chromosomes and the camel has 70. To control, utilize and realize t his mass of information, DNA is indirectly expressed as protein. As such, proteins carry out the tasks and maintain the environment that DNA encodes. From DNA to Protein Both DNA and protein are considered macromolecules as they consist of small finite se ts of monomeric units. DNA for example, consists of four distinct nucleotides, and proteins are formed from a finite set of 20 amino acids. Thus, it is permutation that holds the vast information of life.

PAGE 20

20 For DNA to enact its function, it engages in a pro cess know n as transcription ( Figure 2 1) During the process of transcription, RNA is first used to make a copy of the section of DNA to be expressed, following which, the RNA exits the nucleus. This assures the safety and encapsulation of the DNA and allows the information contained in the DNA to be utilized in another area of the cell. Next, inside the endoplasmic reticulum, RNA, through a process known as translation ( Figure 2 1) takes part in the creation of a polypeptide chain. A polypeptide is a line ar sequence of amino acids each connected by peptide bonds In the final stage of protein production, the polypeptide is released into the cytosol, where it quickly folds into local secondary formations, and then a tertiary protein structure ( Figure 2 2) Protein and Amino Acids As proteins are the direct expression of DNA, they are the center of all biological activity. Proteins act as enzymes, to assist with chemical reactions. They act as chemical messengers, or hormones, to maintain internal communicat ion. They engage in transportation, including oxygen transport in the blood. They are involved in the storage and acquisition of information, such as that collected by the retina. They are involved in construction, such as collagen, a nd are actively involved in immune system function. Proteins are composed of their own unique monomeric units, or amino acids. There are 20 distinct amino acids ( Figure 23) that are used to build all proteins. The genetic information encoded in DNA and delivered by RNA is con tained in the permutations of four DNA nucleotides. To represent DNA information, e ach of the 20 amino acids can be matched to a set of three RNA nucleotides. This triplet code, or codon ( Figure 2 1) is known to be nonove rlapping and degenerate (Voet and Voet 2005 ). As such, in many cases, two or more codons encode the same amino acid. This in itself has many implications including lessoning the cascading effect of a point mutation in the DNA code.

PAGE 21

21 Each of the twenty amino acids consists of a central ca rbon atom ( C), an attached carboxyl group ( COOH), an attached amino group ( NH2Protein Folding ), and a side chain or R group. It is the individual R group that creates uniqueness among amino acids, and displays differing and unique chemical properties. T he side chain of an amino acid ( Figure 2 3) can affect characteristics such as mass, acidity, polarity, hydrophobici ty, and electron charge. T hese characteristics the sequence and order of the amino acids in the polypeptide, combined wit h the cellular environment directly determine the final folded structure of a protein. During the process of translation, a linear sequence of amino acids (polypeptide) is created. Upon completion, the polypeptide is released into the cell cytoplasm where it folds into secon dary and then tertiary structure ( Figure 2 2) The question of how secondary and tertiary structure can be predicted given a linear sequence of amino acids has been elusive and persistent. Therefore, investigation of the properties and prediction of protei n folding, as well as related problems, such as secondary structure prediction, continues to engage fields of biology, physics, and computer science. While the peptide bonds that connect each amino acid of a protein polypeptide are known to have a planar and rigid structur e ( Voet and Voet, 2005), the torsion angles between C N ( ) and C C ( ) (Figure 24) each offer a set of combined conformational ranges, limited by steric constraints. It should be noted that while and represent major degrees of freedom, every R group contains several atoms, each with internal and relative external variation. As a simple example case, consider the set of physically permissible values of and for a set of three amino acids (tripeptide). The sterically permissibl e and angles can be calculated by measuring the distan ce between all neighboring atoms for all possible angle values. The Ramachandran diagram, (Ramachandran and Sasisekharan, 1968) ( Figure 2 5) reveals all

PAGE 22

22 possible and permissible angle combinati ons for a sequence of three consecutive alanine amino acids. The diagram illustrates that all eight secondary structures fall within the permissible steric range, and conversely, the permissible steric ranges closely encompass the eight known secondary str uctures. To consider the intractable nature of protein folding prediction, consider a reduced example in which the 2n torsion angles contained in a protein sequence of size n each have only three stable relative conformations. This would yield 32n 10nSecondary Structure combinations. Even for small values of n two conclusions can be noted. First, the biologically managed protein folding process does not have time to explore all possibilities, implying the existence of a n underlying folding process. Similarly, predi ction algorithms do not have the time or the resources to consider all possibilities. Therefore, the folding prediction problem and related problems such as secondary structure prediction depend on discovering propensities implied from known folded proteins. Protein secondary structure is defined as the local conformation of a polypepti de backbone (Voet and Voet, 2005), where a polypeptide is a linear sequence of amino acids each connected end to end by a rigid peptide bond. There are ei ght possible secondary structures that can occur in a given protein. These include helices (H), 310While there are eight possible secondary structures that are known to form, these eight states are generally categorized into three basic groups The first group, called helices ( Figure 2 helices (G), helices (I), sheets (E), bridges (B), bends (S), hydrogen bond turns (T), and either loops, coils, or no structure (C). The significanc e of secondary structure, with respect to protein structure and function, is that secondary structure is maintained even in the tertiary or biologically active state. Therefore, secondary structure offers a first step toward tertiary prediction as well as information about protein relation of function.

PAGE 23

23 6) include s the helix, the 310helix and the helix. Helices are formed due to a twisting of the polypeptide chain, and are stabilized by hydrogen bonds that form between every p amino acid, where p is known as the pitch. The most stable and common helical structu re has a pitch of four, and is known as the helix. The 310Protein Evolution and Sequence Con servation helix and helix are less stable and therefore rarer The second group, called sheets ( Figure 2 7) include s the sheet and the bridge. Sheet structures are also stabilized by hydrogen bonds. The third group includes the remaining structures such as bends, turns, loops, coils, and regions that contain no order It is hypothesized that regions of little or no order offer flexibility for external interaction. As proteins evolve and diverge over time the conservation of sequence tends to be localized to specific functional regions. Evolutionarily conserved regions are generally both functionally and structurally more important (Sitbon and Pietrokovski 2007) so that any amino acid mutation in such regions would result in nonviable proteins. Because only functionally viable proteins can persist through time, proteins with similar sequences tend to adopt similar structure (Ch othia, 1986; Doolittle, 1981). This assertion supports and motivates the use of evolutionary information and multiple sequence alignment in protein prediction methods. While it has been shown that as many as 70% of amino acids in a protein can be altered or mutated without affecting the overall protein structure or the secondary structure integrity (Rost, 1999), it is also important to note that changes that destabilized proteins are not conserved through evolution. Therefore, amino acid exchanges or mutations that result in conserved structure and function while statistically rare, are highly likely due to evolution (Rost, 2003). There are also cases in which a single amino acid mutation can severely alter protein structure and function, but still sustain through evolution. Sickle cel l anemia, a genetic disorder that results in a sickle shaped red blood cell, is caused by a single amino acid mutation.

PAGE 24

24 However, the sickle cell mutation was able to survive evolution as it only partially impairs function, and is known to protect against m alaria Figure 2 1. S implification of the processes of transcription and translation. Transcription takes place in the nucleus and makes a copy of the section of DNA to be expressed. Translation occurs in the endoplasmic reticulum and uses the sequence of RNA to create a polypeptide. Adapted from Voet,D., and Voet,J. (2005) Biochemistry, Third Addition, Wiley Higher Education.

PAGE 25

25 Figure 2 2. Once a polypeptide is created through the process of translation, it is released into the cytosol and is known as the primary or linear sequence. The primary sequence then folds into local secondary structure and then collapses into tertiary structure. Adapted from Voet,D., and Voet,J. (2005) Biochemistry, Third Addition, Wiley Higher Education.

PAGE 26

26 Figure 2 3. The 20 known amino acids. Adapted from Voet,D., and Voet,J. (2005) Biochemistry, Third Addition, Wiley Higher Education.

PAGE 27

27 Figure 2 4. T orsion angles phi and psi that offer rotational flexibility between amino acid peptide bonds. Adapted from Voet,D., and Voet,J. (2005) Biochemistry, Third Addition, Wiley Higher Education

PAGE 28

28 Figure 2 5. Ramachandran Plot for a set of three alanine amino acids joined as a tripeptide. Relatively permissible phi and psi angles are noted in light blue and the angles that occur in sheet and helical formation are noted in navy blue. Adapted from Voet,D., and Voet,J. (2005) Biochemistry, Third Addition, Wiley Higher Education.

PAGE 29

29 Figure 2 6. E xample of a helical protein secondary structure. The hydrogen bonds are denoted with dashe d lines and form between oxygen and hydrogen atoms of nonneighboring amino acids. Adapted from Voet,D., and Voet,J. (2005) Biochemistry, Third Addition, Wiley Higher Education.

PAGE 30

30 Figure 2 7. S heet protein secondary structure with hydrogen bonds noted w ith dashed lines. Adapted from Voet,D., and Voet,J. (2005) Biochemistry, Third Addition, Wiley Higher Education.

PAGE 31

31 CHAPTER 3 LITERATURE REVIEW P roblem of Secondary Structure Prediction T he secondary structure of a protein provides a significant first step toward tertiary structure prediction, as well as offering information about protein activity, relationship, and function. Protein folding, or the prediction of tertiary structure from linear sequence, is an unsolved and ubiquitous problem that invites res earch from many fields of study, including computer science, molecular biology, biochemistry, and physics. S econdary structure information is utilized in the majority of protein folding prediction algorithms (Liu and Rost, 2001; McGuffin et al ., 2001; Mell er and Baker, 2 003; Hung and Samudrala, 2003). Protein secondary structure is also used in a variety of scientific areas, including proteome and gene annotation (Myers and Oas, 2001; Gardy et al ., 2003; VanDomselaar et al ., 2005; Mewes et al ., 2006), the determination of protein flexibility (Wishart and Case, 2001), the subcloning of protein fragments for expression, and the assessment of evolutionary trends among organisms (Liu and Rost, 2001). Therefore, protein secondary structure prediction remains an a ctive area of research, and an integral part of protein analysis. Protein secondary structure prediction can be described as a threestate classification problem that begins with a linear sequence of amino acids, and results in the labeling of each amino a cid as H, E, or C ( Figure 3 1) The label H represents a helical secondary structure formation, the label E represents a sheet secondary structure formation, and the label C represents a coil or loop structure formation, or alternatively, no secondary structure. As noted, secondary structure prediction methods follow a reduced definition of secondary structure that consolidates the eight known secondary conformations into three basic states, namely: H (helix)

PAGE 32

32 = {H, G, (I)}, E (sheet) = {B, E}, C (other) = {C, S, T, (I)}. It should be noted that helices (I) are rare and unstable, and are often categorized in the C state. Literature Review of Secondary Structure Prediction Many algorithms and their variations have been investigated to predict protein sec ondary structure. Methods include early techniques such as single residue statistics and residue propensities (Chou and Fa sman, 1974), as well as Bayesian statistics and information theory seen in early GOR methods (Ga rnier et al ., 1978, 1996). These pioneering methods were then followed by an explosion of techniques using PSI BLAST profiles, as first proposed in th e method PHD, (Rost and Sander, 1993) that broke the 70% prediction barrier. Current methods include machine learning constructs, such as multi layered neural networks, as in SSpro (Pollastri and McLysaght, 2004), PROFsec (Rost and Eyrich, 2001), PHDpsi (Przybylski and Rost, 2001), and PSIPRED (Jones, 1999), ensembles of support vector machines, such as SVMpsi (Kim and Park, 2003) and YASSPP (Kar ypis, 2006), nearest neighbor methods such as PREDATOR (Frishman and Argos, 1996), and a plethora of combined or meta methods (Cuff and B arton, 1999; Albrecht et al ., 2003). Methods Preceding 1993 One of the earliest protein secondary structure predictio n methods was published in 1974, and is known as the ChouFasman method (Chou and Fasman, 1974). The ChouFasman meth od utilized a table of amino acid conformational propensities, and aimed at predicting the initiation and termination of helical and sheet regions. The conformational propensity table generated using 19 available known proteins, offered the calculated probability that each amino acid would appear in a given secondary state (Equation 31). (3 1) ) ( / ) | ( R P S R P

PAGE 33

33 The value P(R|S) is the probabilit y of amino acid reside R occurring, given that the observed state is S. The value P(R) is the probability of amino acid resid ue R occurring given a set of amino acids Once the propensities were calculated for each amino acid, seven types of propensity me asures were considered These measures were categorized as helix former, having a high relative probability of being found in a helix, helix indifferent, having a neutral relative probability of being found in a helix, and helix breaker, having a low relat ive probability of being found in a helix. These same three categories were also created for sheets, including sheet former, sheet indifferent, and sheet breaker. The seventh propensity measure was the probability of being in a coil, or rather not in a she et or helix. Given propensities, an input sequence was first searched for nucleation sites, or areas of likely secondary structure formation. These areas would contain either high numbers of consecutive helix formers or high numbers of consecutive sheet f ormers. The heuristic to determine nucleation generally evaluated six consecutive amino acids at a time. Once a nucleation site was located, it was extended in both directions until breakers were discovered. While the overall prediction accuracy of the Cho uFasman method was quoted between 70 80%, later research (Nishikawa, 1983) determined that accuracies were below 55%, due to the use of the same small protein set for both training and testing. Although prediction accuracies were low, s everal valuable id eas were introduced by the ChouFasman method that would later pave the way for more advanced algorithms. First, the notion of using local information in the prediction process was suggested, namely that short range and medium range amino acid interactions play a predominant role in the prediction of secondary structure. While this suggestion is still the subject of some debate, the majority of current prediction methods use a local sliding window (discussed in Chapter 4) to create

PAGE 34

34 prediction input. Next, t he use of a conformational propensity table would be one of the first implications that known protein data, and later protein evolutionary information, could be utilized in the determination of structure. Published in 1978, the GOR method (Garnier et al ., 1978) ex panded on the ideas presented by the ChouFasman method The GOR method used a variation of the conformational propensity table, an extension of residue probability evaluation through information theory, and the introduction of the sliding window. The GOR method measured information difference, or the difference between the likelihood of a conformational state as compared to all other conformational states for a given amino acid. Four information formulas were evaluated, one for each of the four GO R defined states, namely helix(H), sheet(E), coil (C), and reverse turn (T). For each state, an information formula (Equation 32) was evaluated using approximation, where Sj is the state of the jth amino acid, S is one of four secondary states, and Rk is the amino acid in the kth (3 2) position. To achieve an approximation, s everal assumptions were made, including the notion that information offered by distant amino acids is near zero and that a window of eight amino aci ds on either side of the query amino acid is sufficient. Therefore, the information formula was approximated (Equation 33) using the following reduction that both neglects multiple relative r esidue effects as well as information more distant than eight re sidues away in either direction. (3 3) Given a set of four information scores, one for each possible state, the maximum was noted within a fixed window of size 17 to predict the final structure of each central amino acid While results were reported at better than 64%, later r esearch (Nishikawa, 1983) determined that results ) ... | (1 n jR RS S I ) ; ( ) ,..., ; (8 8 1 m f m j n jR S I R R S I

PAGE 35

35 were near 55% and that the inaccurate measurement was due to small datasets (25 proteins were used) and a non disjoint training and testing set. Through the following year s, the GOR method, now GORV (Sen et al ., 2005, Kloczkowski et al ., 2002) has continued to improve due to an increased number of known and available structures as well as the incorporation of evolutionary information in the form of PSI BLAST PSSM profiles A ccuracies have reached as high as 74.2% depending on the dataset evaluated The GOR methods are unique in their use of information theory and Bayesian statistics (discussed in detail in Chapter 4) and GORV has added the use of triplet statistics within each given window, by calculating the statistics of single, pair, and triplet joint probability to determine of a particular state. The mid eighties brought the use of more complicated data evaluation methods machine learning techniques that further utilized information from known protein sequences to learn and predict structure The use of neural networks (NN) for protein structure prediction was in part due to the work of Sejnowski and Rosenberg (Sejnowski and Rosenberg, 1987) who utilized a back propag ation neural feedforward network ( BF NN) to predict speech synthesis. This idea was followed independently by both Qian and Sejnowski (Qian and Sejnowski, 1988) and Holley and Karplus (Holley and Karplus, 1989). In the Qian method, 106 proteins were used, and training and testing sets were separated. Efforts were made to create disjointness (mutual non homology) between the training and testing sets by not placing homologous proteins in both training and testing groups. A back propagation feedforward neural network was trained and then tested on input ve ctors generated using both a sliding window and a binary amino acid representation. To train and test the NN, an input vector was created to represent each amino in a given query sequence. A n i nput vector for a given

PAGE 36

36 amino acid was created using a window of 13 amino acids six on each side where 21 binary values were used to uniquely describe the each amino acid A s the ends of each protein will produce spaces in the sliding window 20 values are used to repr esent each of the 20 possible amino acids with the twenty first value representing a space. Starting from the first amino acid in the query sequence, the sliding window (discussed further in Chapter 4) moves through the protein sequence one amino acid at a time, generating a vector of size 13 21.While prediction accuracy for the Qain method was reported below 65%, the ideas put forth by Qian and Sejnowski further encouraged the use of the sliding window, and the corresponding flattened input vector repres entation of each amino acid The Holley method (Holley and Karplus, 1989) was very similar to the Qian method, and also used the 21 unit binary input to describe each amino acid. Small alterations included the use of a larger window of size 17, and a sli ghtly different NN architecture. Results were similar, and under 65%. Methods Proceeding 1993 In 1993, Rost and Sander created PHD, a secondary structure prediction algorithm that broke the 70% prediction barrier by incorporating evolutionary information i n the form of multiple sequence alignments of similar (homologous) proteins into a two layer feed forward neural network. Like preceding methods, PHD used a neural network, a nd a sliding window of size 13. However a larger dataset of 126 mutually nonhomol ogous proteins was used, and a novel improved input vector of size 21 13 was employed. The new input vector would be the first to shift from a binary representation of each amino acid to the inclusion of evolutionary or homologue information in the form of multiple sequence (described in Chapter 4) alignments E ach amino acid in the query sequence was described with 21 values The first 20 values were the relative frequency of each of the 20 amino acid s

PAGE 37

37 occurr ing at that specific position as calculated fro m a set of known discovered similar aligned proteins to the query protein. The twenty first position was used to represent a space, and was set as either zero or one. The move away from a binary representation of each amino acid to a relative frequency re presentation delivered considerable a dditional information to the prediction process B ecause frequencies were gathered using the alignment and comparison of similar proteins a higher weight on positions that are particularly conserved was implicitly gene rated. The use of known protein information in the form of alignments was a first step in realizing the notion that similar sequence produces similar structure. The prediction was further filtered by recognizing that helices have a m inimum length of 4 consecutive residues in nature Thus, all hel ix predictions of length fewer than three were converted back to the default (C). The steps taken in the PHD method to improve the prediction process and overall accuracy wer e followed by an explosion of techniques that considered better sequence alignment methods, different window sizes, and more elaborate machine learning techniques and architectures Neural network m ethods from 1993 2007 In 1999 David Jones published PSIP RED, (Jones, 1999) a twostage neural network that is currently accessible online, and to date, maintains an average competitive accuracy of just under 80%. The first stage of the neural network represents the sequence to structure stage, and takes as an i nput a vector of size 15 21. As with the majority of predictors that appeared after PHD, PSIPRED uses multiple sequence alignment information to create input vectors. PSIPRED also uses the increasingly popular application, PSI BLAST (Altshul et al ., 1997) to create profiles or position specific scoring matrices (PSSM) that list the log likelihood of each amino acid occurring at each position of the query protein. Therefore, PSI BLAST profiles are used to create

PAGE 38

38 training and testing input vectors PSI BLA ST (discussed in Chapter 4) is an advanced alignment algorithm that detects and aligns homologous proteins, creating a query based multiple sequence alignment and corresponding position specific scoring matrix (PSSM) or profile. Once input vectors are created, they are delivered to the NN. During the first stage of the NN, the i nput vector is reduced into 15 4 hidden units (three states and space) that are then sent into the second stage of the NN the structure to structure stage that outputs the final structure prediction. PSIPRED was trained with access to over 3000 proteins and uses a window of size 15. In 2000, James Cuff and Geoffrey Barton published Jpred, a currently active secondary structure prediction method that is available online. Jpreds uniqueness is its utilization of several different types of multiple sequence alignment information during the training and testing of a feedforward NN. Jpred s NN incorporates information from several multiple sequence alignment algorithms including the P SI BLAST PSSM log likelihood profile s the PSI BLAST frequency profile, the HMMer2 MSA profile (Eddy, 1998 ), and an independently constructed multiple sequence alignment derived from both AMPS (Barton, 1990) and CLUSTALW(Thompson et al ., 1994). All ty pes of alignment information were fed into the NN as input vectors, and results were compared to attain the best predicti on. Input vector construction was based on a window of size 17, generating vectors of size 21 17. Jpred offers accuracies between 70.5 % a nd 76.5 %, depending on the dataset evaluated, and was trained on 480 mutually nonhomologous proteins In 2008, Jpred3 was published (Cole et al ., 2008) and was updated on the publically available online server Jnet Jpred3 uses the Jpred algorithm, offe rs batch processing, and was retrained on a significantly larger dataset.

PAGE 39

39 In 2002, SSpro was introduced (Pollastri et al ., 2002) as a new technique that used a bidirectional recurrent neural network (BR NN ) PSI BLAST profiles, and a large training set of 1180 proteins. Unlike the feedforward NN, the BRNN (discussed further in Chapter 4) creates a classification or prediction based on three components. The first or central component is associated with a local window representing a portion of the protein que ry sequence and a specific central amino acid, as with FFNN. The two additional components handle the information to the left and right of the central amino acid. Therefore, a uniqueness of SSpro is the use of a semi window allowing a difference between le ft and right contexts in the training of the NN. SSpro is available online and claims an average accuracy of 78% depending on the dataset evaluated In 2004, Porter, an evolution of SSpro, was published (Pollastri and McLysaght, 2004) as a new system usin g a bidirectional recurrent NN with shortcut connections, filtering, and a larger training set of 2171 proteins. Probabilities of secondary structures are used to filter the results, and 5 individual twostage bidirectional recurrent NNs were independently trained and their results averaged. Porter reports accuracies between 76.8% and 79%, depending on the dataset. In 2007, Porter_H was added to the server (Pollastri et al ., 2007). Porter_H builds on Porter and adds the use of direct homological evolutionar y information in the form of alignments made to the query protein. When homologous proteins are found, template information collected from the homologues is directly added to the ensemble of recursive NNs. When direct homology is used, Porter_H has accuracies that can reach 90% for a given protein. Porter_H was trained on 2171 proteins. In 2005, YASPIN, (Lin et al ., 2005) a single neural network server was published. YASPINs uniqueness is its utilization of a hidden Markov model (HMM) to optimize and filte r

PAGE 40

40 the output of the single layer NN. The forward and backward algorithms of the HMMs are used in the assignment of prediction reliability scores, or confidence for each prediction result. With similarities to the Chou and Fasman method, the HMM in YASPIN i dentifies 7 states, namely helix start, helix, helix end, sheet start, sheet, sheet end, and coil. These are used in conjunction with the NN results to make the final state prediction. YASPIN was trained on 3553 nonredundant proteins, uses PSI BLAST to pr ofile all sequences and to create input vectors and uses a window of size 15 YASPIN publishes competitive accuracies near 78%, depending on the dataset. Summary of neural network based m ethods For the past 15 years, NN methods have been employed to predi ct protein secondary structure. Improvements to NN based methods have generally included the addition of prediction layers, more advanced NN architectures, the addition of post processing methods such as hidden Markov Models, an increase i n the number of p roteins used in train ing and the indirect and direct use of evolutionary information in the form of multiple sequence alignment and homology detection. Support Vector Machine Methods from 2001 2007 In 2001, Sujun Hua and Zhirong Sun would be the first to publish a protein secondary structure prediction method based on support vector machines (SVM) (Vapnik, 1995, 1998) (discussed further in Chapter 4). At the time, SVM based methods had already been successfully used in areas of pattern recognition, incl uding text (Drucker et al ., 1999) and speech recognition (Schmidt and Grish, 1996). I n many cases SVM methods were noted to offer better performance that other machine learning techniques (Hua and Sun, 2001). Because a SVM is a binary classifier, three to six SVM classifiers are generally used to offer three state prediction, with six SVMs used by Hua and Sun. The radial basis kernel was employed, input vectors were derived

PAGE 41

41 from PSI BLAST PSSM profiles, and a window of size 13 was employed. Hua and Suns method offered accuracies as high as 73.5%, depending on the dataset evaluated, and would be followed by several techniques, each using a variation of the SVM based prediction methodology. In 2003, SVMpsi (Kim and Park, 2003) was published with accuracies as high as 78.5%, depending on the dataset evaluated SVMpsi is an SVM based predictor method that offers improvements in the areas of tertiary c lassifiers and jury decisions. SVMpsi uses six SVM classifiers with a jury style final decision process. SVMps i also employs the use of PSI BLAST PSSM profiles in the creation of training and testing input vectors, a window of size 15, and a radial basis kernel. In 2005, Chengs group (Cheng et al ., 2005) published a n SVM method that further and more directly util ized evolutionary information. Normalized scores for each amino acid were collected by aligning and analyzing matching segments derived from BLAST alignments and calculating normalized scores for each amino acid based on known similar structures for that position. The normalized scores were then entered into an SVM as part of the final decision process. Accuracies range d fro m 65% 73% depending on the availability of known similar sequences and the dataset tested. In 2006, Brizele and Kramer (Brizele and Kramer, 2006) published an SVM based method that used frequent amino acid patterns (subsequences) combined with PSI BLAST alignment information. The frequency of patterns of any length of consecutive amino acids were discovered by searching a protein data base, and compared with the query protein. Next, the query protein was used to find homologous alignments through PSI BLAST. Finally, feature vectors to train and test the SVM ensemble were created using a combination of PSI BLAST alignment

PAGE 42

42 information and frequent pattern information. Results were reported as high as 77%, depending on the dataset evaluated, using three binary SVM classifiers and the radial basis kernel. That same year, George Karypis published YASSPP (Karypis, 2006), a currently available SVM based prediction method. YASSPP employs two levels of SVM based models to create a final prediction. YASSPP creates input vectors by using a combination of PSI BLAST PSSM profile data and BLOSUM62 (Henikoff and Henikoff, 1992) information (discussed i n chapter 4). The use of BLOSUM62 affords information when homologous sequences to the que ry sequence are not available. Each SVM ensemble contains three SVM classifiers of one versus rest, using a constructed kernel that combines a normali zed second order kernel with an exponential function. Results are reported as high as 79.34%, depending on the dataset. Summary of SVM B ased M ethods SVM methods were made popular by Hua and Sun in 2001, and continue to be used in protein secondary structure prediction. T he use of SVMs for prediction was due to earlier success with similar problems such as text and speech recognition, an improved ability to avoid over fitting, the ease of handling large high dimensional datasets and the ability to discover a global rather than local minimum The results for SVM based methods and NN based methods are comparable, and both method types rely on evolutionary information and known sequence structure to improve accuracy. Combined or Meta Methods Combinations of methods, or meta methods, have be en investigated in many cases. Combination methods either apply a jury based decision algorithm to the outcomes collected from a set of high accuracy known methods (Ward et al ., 2003; Rost et al ., 2002; Cuff and Barton 1999), or combine a known method with an addon method to increase accuracy. The method HYPROSP (Wu et al ., 2004) uses a knowledge base that contains a set of protein

PAGE 43

43 fragments with known secondary structure. If the query protein has a measured match rate to one or more known fragments with more than 80% identity, the known secondary structure information from the fragment(s) is used. If it does not, PSIPRED, a well know and accurate prediction method is used. This addon method reports accuracies in excess of PSIPRED whenever the match rate exceeds 80%. Similarly, GORV was combined with fragment database mining (Cheng et al ., 2007) to increase overall accuracy. Results range from 67.5% 93.2% depending upon known protein availability. Fragment database mining, like knowledge base utilization, relies on the discovery of homologous proteins with known secondary structure. The query protein is then aligned with the discovered homologues to create a prediction. If no homologues are available, the GORV method is used. Direct Homol ogy Based Methods All current machine learning techniques rely on homological and alig nment data for training and testing, generally in the form of PSI BLAST (Altschul et al ., 1997) profiles, and can therefore be referred to as indirect h omology based methods. Recently, modern techniques have more directly utilized homology and evolutionary inform a tion by including template and fragment modeling into the pred iction process (Pollasti, 2007; Montgomerie et al ., 2006; Cheng, 2005). The method, Porter_H (Pollas ti, 2007) uses direct homology by combining a set of query derived homologous templates from the Protein Data Bank ( PDB ) (Berman et al ., 2000) with both the original query s e quence and the corresponding PSI BLAST profile to train a complex neural network ensemble. Similarly, PROTEUS ( Montgomerie et al ., 2006) uses direct homology modeling when hom ologues are available, and a jury of machinelearning expert techniques including PSIPRED, JNET, and TRANSEC when homologues are not avail a ble. Although direct ho mology based methods collect the same information used by pure machine

PAGE 44

44 learning styl e indirect homology methods, namely PSI BAST alignments indirect homology methods use this data in the form of log likelihood measures to train machines learning construct s while direct homology methods use this information directly (often in the form of a template) as some portion of the prediction process. For over 20 years, protein secondary structure prediction has incrementally improved through the advancement of alig nment algorithms, the increased availability of known and homologous protein structures and databanks, and the maximal utilization of evolutionary information and machine learning techniques. One of the main sources of recent prediction improvement has bee n PDB derived structural information and its direct use in prediction (Pollasti, 2007). Figure 3 1. E xample of a linear sequence of amino acids, each accompanied by a secondary structure label of H, C, or E.

PAGE 45

45 CHAPTER 4 MATERIALS AND METHODS Introduction This chapter describes all commonly used materials and methods in the area of protein secondary structure prediction. Topics will include protein data, databanks, datasets, multiple sequence alignment and PSI BLAST, measures of protein identity, similarity and homology, similarity matrices and BLOSUM62, input vector generation and the sliding window construct, measures of accuracy for prediction, and machine learning techniques including support vector machines, neural networks, and information theory. Pro tein Data and Databanks Protein secondary structure prediction depends on protein data, access to protein databanks, and access to secondary structure information for known sequences. Proteins and their corresponding structure are slowly but continuously discovered through the use of exclusion chromatography, mass spectroscopy, and nuclear resonance spectroscopy (Moy et al ., 2001). As proteins are discovered, they are entered into protein databanks such as the Research Collaboratory for Structural Bioinform atics (RCSB) Protein Data Bank (PDB) (Berman et al ., 2000, www.pdb.org ). The PDB ( Figure 4 1) established in 1971, is a free and open worldwide repository for protein information, including 3D structure. The PDB website is located at http://www.rcsb.org/pdb/home/home.do. The study of protein secondary structure also requires specific and accurate information about the secondary structure of known proteins. The Database of S econdary S tructure assignments (DSSP) (Kabsch and Sander, 1983) can be accessed online, and contains the experimentally determined secondary structure for all proteins contained in the PDB The secondary structure of known proteins is determined by using the 3D coordinate information

PAGE 46

46 available in the PDB. This coordinate information describes the location of each amino acid atom as a 3D (x,y,z) coordinate, each relative to the central carbon atom (C). While the DSSP program cannot predict protein tertiar y structure, given atom locations, it will define secondary structure, geometrical features, and solvent exposure. Datasets Datasets are used to train and test secondary structure prediction methods, and to measure and determine their relative accuracy. To maintain fairness, to encourage improved accuracy comparison between algorithms, and to maintain disjointness between training and testing sets, several datasets have been published. Because even a single protein can affect the accuracy measure of a techn ique, fair comparison between techniques is best accomplished through the use of identical datasets. To consider this point further, several early published methods were later reevaluated using different datasets. Upon reevaluation, each dropped considerably in accuracy, and well below published claims (Kabsch and Sander, 1983). Of most common use, are the datasets rs126 (Rost and Sander, 1993), cb513 (Cuff and Barton, 1999), and EVA Common Sets ( http://cubic.bioc.columbia.edu/eva/doc/intro_sec.html ). The dataset rs126 contains 126 proteins collected from the PDB, and claims that no two proteins in the set have a greater than 25% amino acid identity. When rs126 was constructed, a measure of homology was defined by Rost and Sander, as two sequences sharing more than 25% sequence identity over a length of at least 80 amino acids. By 1999, Cuff and Barton (Cuff and Barton, 1999) added to the definition of homology by claiming that simple percentage identity is insufficient to determine homology (Brenner et al ., 1996), and that SD scores are a better measure. An SD score is created by first aligning two proteins using a dynamic programming alignment method such as the Needleman and Wunsch method (Needleman and Wunsch 1970),

PAGE 47

47 and measuring an alignment score V. Next, the order of the amino acids in each protein is then randomized and the alignment is again performed and measured, creating another alignment score. This process is repeated 100 times to create a sample average and sample standard deviation. Finally, the z score or SD is calculated by taking the difference between the original alignment score and the sample mean, and dividing by the standar d deviation. This SD score measure was used to create the dataset, cb513, a set of 513 nonhomologous proteins. The EVA Common Sets are part of the EVA web server located at http://cubic.bioc.columbia.edu/eva EVA continuously and automatically analyze s protein structure prediction servers in real time and is managed by a large team of contributors (Koh et al ., 2003). To compare prediction servers, EVA creates common protein sets that contain relatively new and nonhomologous proteins (Rost, 1999). Furt her, and unlike rs126 and cb513, many proteins found in EVA Common datasets do not have homologues that can be found in the P DB. For example, over 50% of 211 proteins in EVA Common 6 do not match any known PDB homologues when run on PSI BLAST, whereas rs12 6 offers PSI BLAST homologues for all 126 pro teins Therefore, the use of an EVA Common Dataset may offer a more challenging prediction method evaluation. The PDB25 dataset is a rel atively new dataset of over 4200 mu tually non homologous proteins grouped t ogether by ASTRAL (Brenner et al ., 2000). The PDB25 dataset was designed so that no two proteins possess more than 25% sequence identity. DARWIN uses a new dataset of 800 test proteins, termed PDB25_800 randomly derived from PDB25. The absence or availabi lity of homologous proteins can significantly affect the accuracy of protein prediction algorithms. The assertion that proteins with known homologues are easier to predict, because similar sequence begets sim ilar structure, is well known (Pollastri et al ., 2007).

PAGE 48

48 Therefore, a dataset with fewer known homologues might prove more challenging on a prediction server than would a dataset such as rs126 for which all proteins match several known homologues. Protein Identity, Similarity, and Homology With respect t o protein sequences, the terms identity, similarity, and homology are commonly and often interchangeably used However, the exact definitions of these terms can be rather elusive, and in many cases, author defined or implied. True homology between two proteins can only be determined if a complete ancestral history and a full exploration of intermedi ate proteins is completed (Koonin and Galperin, 2003). Therefore, the more common determination of homology is generally based on some defined or implied measur e of both similarity and identity, with an underlying intimation of familial relation. The greater the sequence identity is between two proteins, the lower the probability is that the two proteins evolved f rom an independent origin (Koonin and Galperin, 2003). However, even measures of similarity and identity depend on methods of sequence alignment. To determine the percentage identity between two proteins, these two proteins must first be aligned. The alignment can be based on maximizing a global measure of alignment (Needleman and Wunsch, 1970) or maximizing a local measure of alignment (Smith and Waterman, 1981). Both alignment methods are based on dynamic programming algorithms, and assign scores to insertions, deletions, and replacements (edit distance ). By determining the least costly alignment, dynamic programming method s seek to minimize evolutionary distance or maximize similarity. However, these algorithms quickly become intractable for multiple sequences. The notion of similarity stems from the idea that certain amino acids are more similar because they are more likely to replace each other in nature, without disrupting the structure of the protein. By clustering and aligning similar protein subsequences, common amino acids

PAGE 49

49 substitutions (replacem ents) can be discovered. This idea lead to the creation of similarity matrices such as BLOSUM and PAM (discussed below). Alignment algorithms can then use a measure of similarity, such as BLOSUM62 (used as a default in PSI BLAST) to calculate the similarit y between two proteins. The process of measuring protein similarity involves aligning proteins and then calculating the number of both identical and similar amino acids that occur. Protein disjointness often loosely termed protein nonhomology, is theref ore measured by aligning a query protein to a subject protein and using a combination of protein identity and similarity to determine if the two proteins can be considered dissimilar Disjointness is principally important when training and testing machine learning based algorithms, as well as with any method that uses cross validation techniques. It is well known that result accuracies are inflated and unreliable when machine learning methods are trained and tested on datasets that contain overlapping infor mation. With respect to proteins, overlapping information or nondisjointness is often termed homology. Therefore, the goal of dataset creation is to generate a dataset of mutually nonhomologous proteins. The definition of homology has evolved over the ye ars, and is ambiguous. However, it is commonly accepted that proteins with less than 25% post alignment sequence identity can be considered disjoint for purposes of machine learning training and testing. As methods for sequence alignments generally utilize a measure of similarity to evaluate amino acid exchanges, several matrices containing similarity scores for all possible amino acid exchanges have been created. Amino acids that are more likely to be exchanged in nature will have a higher similarity score. Of these similarity matrices, the two most commonly used similarity matrices are the PAM (Percentage of Acceptable Mutations per 108 years) (Dayhoff et al 1978) matrices, that use the global alignment of related proteins to measure exchange

PAGE 50

50 likelihood, and BLOSUM (Blocks Substitution Matrix)(Henikoff and Henikoff, 1992) matrices that use local alignments of similar proteins to measure exchange likelihood. The key to similarity is based on the theory of evolution, and the notion of distance between two a mino acids is based on their likelihood to replace each other over evolutionarily based time. Both the PAM and BLOSUM matrices use a measure of log odds (Equation 41) where S is the log odds (4 1) ratio, i and j are two amino acids, qij are the frequencies that amino acid i and j are observed to a lign in related sequences, and pi and pjMultiple Sequence Alignment and PSI BLAST are the frequencies of the occurrence of amino acids i and j in the set of sequences. To create PAM similarity matrices, mutation probabilities are created th at measure the chance of amino acid i mutating to amino acid j over a particular number of years. Therefore, the PAM matrices are specific to evolutionary distance. PAM matrices are developed using global sequence alignment and evaluation. BLOSUM similarit y matrices are based on locally aligned, gap free subsequences or blocks. BLOSUM62 (Figure 4 2) the default matrix used in both BLAST and PSI BLAST is created from a comparison of subsequences, with at least 62% amino acid identity. In the same way, BLOSU M80 is made from clusters of 80% sequence identity. There are several single and multiple sequence alignment heuristicstyle applications commonly used, including PSI BLAST, CLUSTALW (Thompson et al ., 1994), and T COFFEE (Notredame et al ., 2000). The use of multiple sequence alignment (MSA) is ubiquitous in the prediction of protein secondary structure and can be found in the majority of prediction methods (Rost and Sander, 1993; Cuff and Barton, 1999; Hua and Sun, 2001; Pollastri et al ., 2002; Kloczowski et al. 2002; Karypis, 2005; Montgomerie et al ., 2006). Perhaps the most commonly employed sequence alignment algorithm for protein secondary structure prediction, used in the j i ij ijp p q S log

PAGE 51

51 vast majority of prediction technique s after 1993 (Pollastri and McLysaght, 2004; Bondugula and Xu, 2007; Przybylski and Rost, 2001; Cuff and Barton, 1999; Jones, 1999; Ward et al ., 2003; Kim and Park, 2003; Hu et al ., 2005; Wang et al ., 2004) as well as in all methods and experiments discuss ed in this paper, is PSI BLAST. PSI BLAST (Position Specific Iterative BLAST) is a derivative of BLAST (Basic Local Alignment Search Tool) (Altschul et al., 1990) whose main purpose is to rapidly align proteins or protein segments in order to directly app r oximate a measure of local similarity. While there have been several early approaches to measuring sequence similarity through the use of sequence alignment and corresponding edit distance measurement (Ne edleman and Wunsch, 1970; Sankoff and Kruskal, 1983), these global alignment methods, due to their dynamic programming algorithms, near intractability w hen expanded to the analysis of multiple proteins (Gotoh, 1982). The next generation of alignment measures employed heuristics, used local alignment compa rison, and implicitly defi ned a measure of similarity (Smith and Waterma n, 1981; Wilbur and Lipman, 1973; Pearson and Lipman, 1988). Following thi s notion, BLAST detects biologically significant sequence similarities by directly approximating the dynamic programmi ng results for an explicitly defined mutation score measure. Given a query protei n, PSI BLAST will produce both a set of discovered similar proteins as well as a profi le matrix or position specifi c scoring matrix (PSSM) containing the log likelih ood of each of the 20 amino acids occurring at each query amino acid location with respect to the discovered and pseudoaligned homologous sequences. PSI BLAST generates alignment s between a query protein and a subject protein (Figure 4 3) Using the discovered subject proteins and alignments, PSI BLAST generates a position specific scoring matrix ( PSSM) or profile that contains the log likelihood of

PAGE 52

52 the occurrence of each of the 20 amino acids as determined through the discovered PSI BLAST alignments (Figu re 4 4) While PSI BLAST offers many u ser controlled parameters, this paper uses the statistical significance threshold option E the inclusion option h, and the similarity matrix M The statistical significance threshold affects the proteins found by the fi rst BLAST run. The default PSI BLAST E value is 10, such that 10 matches are expected to be found merely by chance, according to the stochastic model of Karlin and Altschul (1990). However, the inclusion value determines which of the discovered proteins are included in the production of the PSSM. In this paper, E = h for all experiments The similarity matrix aff ects the measure of similarity PSI BLAST will use when compa ring and aligning proteins. P aper uses BLOSUM62 (Figure 4 2) All other parameters ar e PSI BLAST defaults unless otherwise noted. Basic Local Alignment Search Tool ( BLAST) Algorithm The discovery and utilization of homology, identity, and similarity measures among proteins is fundamental in the vast majority of protein evaluation and prediction procedures. While dynamic programming algorithms (Needleman and Wunsch, 1970; Waterman, 1984) can successfully align and therefore compare two protein sequences, they quickly become impractical when multiple sequences need to be aligned. Therefore, many heuristics have been applied that offer a measure of similarity, but do not explicitly define similarity as the cost of a set of mutations (Lipman and Pearson, 1985, 1988). The algorithm BLAST (Altschul et al ., 1990) uses a measure based on well defined mutation scores and directly approximates the results that would be obtained by a dynamic programming algorithm. BLAST also offers both speed and the option to detect weak but biologically significant sequence similarities. The BLAST algorithm employs a basic three step process that finds maximal segment pairs (MSP), and uses a local rather than global measure of similarity. Global similarity

PAGE 53

53 algorithms effort to optimize the overall alignment between to sequences and therefore can result in large stretc hes of low identity and similarity. Local alignment algorithms seek to discover conserved regions or subsequences of high identity or similarity, so that a single sequence may show several conserved subsequence locations t hat are then combined to create an overall alignment. The local similarity alignment method is preferred as it is surmised that unconserved subsequence regions do not contribute to a measure of similarity in an evolutionary sense (Smith and Waterman, 1981; Goad and Kanehisa, 1982) When co mparing subsequences, a measure of similarity is required. There fore a similarity matrix such as PAM or BLOSUM (Figure 4 2) is employed Define a sequence segment as a contiguous stretch of amino acids of any length, and define a similarity score as the s um of the similarity values as noted in the selected similarity matrix for any pair of aligned residues. Given a query sequence, BL AST searches for a maximal segment pair (MSP), such that an MSP is defined as the highest scoring pair of identical length se gments chosen from two sequences. As the boundaries of the MSP are selected to maximize the similarity score, the length of an MSP can have any value. The MSP score provides a measure of local similarity and is calculated by BLAST using the following heuri stic based steps. BLAST: s tep 1 In the first stage of the BLAST heuristic, and given a query sequence and a database of known subject sequences, a list of high scoring words is compiled from the query sequence (Figure 4 5) A word is a small, fixed length subsequence. BLAST uses words of size 3 amino acids for protein analysis. Starting from the beginning of the query sequence and shifting one amino acid at a time, a set of wor ds can be generated Using the selected similarity matrix, each word can be assoc iated with a score, called T. Words with score T above a given threshold are retained as they are more likely to occur and will have a greater chance of discovering matches.

PAGE 54

54 Next, for each word in the list of score s of at least T, all similar words are add ed to the list that also score at least T as evaluate by the si milarity matrix Step 1 ends with a set of all words derived fro m the query sequence, and their similar words, each having a similarity score, T greater than a given threshold. BLAST: s tep 2 In the second step of BLAST, the database of subject sequences is scanned, using each word in the list The scanning is implemented using a deterministic finite automata (Hopcroft and Ullman, 1979), and details were not included in the publication. For each word in the list, if a subject sequence is located that also contained that word, step 3 is initiated. BLAST: s tep 3 The final step of the BLAST algorithm is the extending stage. Once a word match is discovered in both the query list of words and a subject protein, the two full proteins are aligned at the location of the word match, and then extended in both directions to determine if a local alignment can be created with a score above a threshold S A BLAST search will discover and combine all significant local alignments to create a final alignment, and will do so for all similar subject proteins in a given database. Position Specific Iterative BLAST ( PSI BLAST) Algorithm PSI BLAST (Altschul et al ., 1997) is an extension of the BLAST algorithm that takes all statistically significant alignments discovered in a BLAST search and produces a position specific scoring matrix (PSSM) (Figure 4 4) that be used to research the database for more distantly related proteins. PSI BLAST is iterative so as to allow the PS SM to be updated upon each search and then again used to research the database.

PAGE 55

55 Creating the PSSM Database searches that use PSSMs or profiles can often detect weak but significant protein relationships. Given a set of query aligned proteins generated fr om an initial BLAST run, all alignments within a given threshold (E) are collected. The query sequence is used as a template, and a multiple alignment of all other discovered proteins, with respect to the query protein, is created. Discovered subject prote ins that are identical to the query are purged and those with greater than 98% identity are consolidated in to one copy. A lignments involve gaps in both the query protein as well as the subject proteins, with internal and end gaps as needed to create an ev en alignment No attempt is made to render a true multiple sequence alignment by comparing subject proteins to each other. Subject proteins are aligned only to the query template. The final alignment set of the query protein and associated subject proteins is then pruned so that all rows and columns contain either a residue or an internal gap character The pruning of each column involves the removal of sequences that do not contribute to that given column. Therefore, just the sequences that contribute a residue or gap are included in each column. Empty spaces are removed, leaving a multiple alignment with every row and column containing either an amino acid residue or a gap character. Note that column lengths will differ. Once the reduced alignment is prepa red, a scoring matrix is generated, namely the PSSM. As closely related sequences will carry little more information than just one such sequence, sequence weighting is used to maximize information. Smaller weights are assigned to sequences with many close relatives. All frequency measures are therefore based on weighted measures. Further, in the construction of PSSM scores, many factors must be considered such as the number of independent observations per column, the number of different amino acids per colu mn and prior information about amino acid relationship. Given a multiple alignment, each column can be evaluated and a score can be generated for each of the 20 amino acids using the

PAGE 56

56 following basic probability formula ), / log(i iP Q where Qi is the estimated probability for residue i to be found in that column, and Pi is the overall probably of amino acid i occurring Qi is estimated using the data dependent pseudocount method (Tatusov et al .,1994), which employs prior knowledge of amino acid re lationships as contained in a selected substitution matrix sij. Because the values in the PSSM must directly correspond to the values in the selected similarity matrix (usually BLOSUM or PAM), the constant u (4 2) is used to normalize. Therefore, a score in the PSSM for amino acid i in a given column determined using the PSSM score (Equation 42). The similarity matrix scores are determined using the sijNext, to estimate Q formula (Equation 43). (4 3) i, the pseudocount value gi is calculated and contains the prior amino acid relationship i nformation as well as the observed amino acid frequencies fi. Thus, specifically for each column, pseudocount frequencies are constructed using the giwhere the q formula (Equation44) (4 4) ij are the target frequencies given by the qijThen Q formu la (E quation 45). (4 5) iIt is important to note that constructed scores will reduce to s is estimated (Equation 46) using the relative weights, and given to observed and pseudocount residue frequencies. (4 6) ij, or the similarity matrix scores, in columns where nothing has been aligned to the query sequence. To assure this digression, PSI BLAST assigns =NC 1, where NC u i i iP Q score PSSM/ ) (log is the relative number of independent observations in a given column in the alignment set. The c onstant is left as an alterable parameter such that the u j i ij ijP P q s/ ) (log ij j j j iq P f g ij us j i ije P P q i i ig f Q

PAGE 57

57 greater its value the more emphasis is given to prior knowledge of residue relationship. T he default value of is 10. Summary of PSI BLAST PSI BLAST starts by running a BLAST database search on a g iven query protein and collecting all significant alignments. Next, a pseudo multiple sequence alignment with gaps is generated with the query protein as the template. Next, a PSSM is formed that defaults to the selected similarity matrix when no alignment information is discovered. The PSSM can be used to iteratively research the database for further and more distantly related proteins. PSI BLAST offers both a pseudo multiple sequence alignment, as well as a PSSM or profile as output. Input Vectors and Sli ding Windows To date, the vast majority of machine learning secondary structure prediction methods make use of the sliding window construct in combination with PSI BLAST PSSM information to create input v ectors (Qian and Senowski, 1988; Rost and Sander, 1993; Cuff and Barton, 1999; Jones, 1999; Pan et al ., 1999; Ward et al ., 2003; Kim and Park, 2003; Hu et al ., 2005; Wang et al ., 2004; Pollastri and McLysaght, 2004; Bondugula and Xu, 2007). To represent each amino acid in a protein sequence with a feat ure vector (Figure 4 6) a fi nite window of size w encircles a neighborhood of w amino acids of the query protein, with the represented amino acid as the central amino acid. This same window of w amino acids is also referenced in the corresponding PSI BLAS T PSSM to generate a unique matrix of size w 20, which is subsequently fl attened to an input vector of size 1 ( w 20). Each time the location of the window slides or shifts one location, the following central amino acid becomes represented, and a new input vector is generated.

PAGE 58

58 Accuracy Measures To determine the accuracy of a given prediction method, the following measures are commonly used. Let Aij represent the number of amino acids predicted to be in state j and observed to be in state i Let Nr rep resent the total number of amino acids in the dataset. Let NpThe measure known as Q represent the total num ber of proteins in the dataset (Rost and Sander, 1993; Koh et al ., 2003). 3 offers information about the number of correctly predicted amino acids in a given dataset. The formula for Q3The measure, Q (Equation 47) is used throughout protein structure prediction literature. (4 7) 3pp is very similar to Q3, but rather than offering a per amino acid accuracy, it offers a per protein accuracy. This difference is significant for several reasons. Measuring a per amino acid accuracy for a given set of proteins offers a more amino acid based balanced accuracy measure that is not affected by protein size and is not significantly decreased by a si ngle poorly predicted protein. Alternatively, Q3ppThe measure Q (Equation 48) offers a per protein prediction that will be affected by protein size as well as outliers. (4 8) iobs (Equation 49) is a measure of the total number of amino acid s both observed and predicted to be in the same state, divided by the total number of amino acids observed to be in that state. Similarly, Qiobspp (4 9) (4 10) (Equation 410) is a per protein average of the same notion. r i iiN A Q3 13 3 3 3 1 31 .Q protein per is Q where Q N Qp N i p p pp avgp } { ,3 1E C H i A A obs Qj ij ii i protein per is Q where N obs Q obs Qp i p p i pp i, /

PAGE 59

59 The meas ure Qiprd (Equation 411) is a measure of the total number of amino acids both observed and predicted to be in the same state, divided by the total number of amino acids predicted to be in that state. Similarly, Qiprdpp (4 11) (4 12) The errsig (Equation 413) is the standard deviation divided by the square root of the number of proteins. The errsig is used only in cases of per protein accuracy measure. (Equation 412) is a per protein ave rage of the same notion. (4 13) As these measures are included in the vast majority of publications, they are used to make direct comparisons between prediction servers when identical datasets are tested. Machine Learning Techniques Support Vector Machines Support Vector Machines ( SVM) (Vapnik, 1995; Joachims, 1998, 2002) have been used in a variety of applications since their inception including handwriting recognition (Cortes and Vapnik, 1995; Scholkopf et al ., 1999), object recognition (Blanz et al ., 1996), speaker identification (S chmidt and Grish, 1996), and tex t categorization (Joachims, 1998). In most of these cases, SVM generalization performance, including error rates on test sets, either matches or exceeds that of competing machine learning methods (Burges, 1998) In general, an SVM, through the use of training examples, will learn to differentiate between two separable classes defined by multidimensional data points, or vectors. SVMs are binary classifiers that use a separating hyperplane (decision boundary) to distinguish be tween two given classes. Because the decision boundary should be as far from both classes as possible, one of the goals of the SVM is to maximize the margin between the two } { ,3 1E C H i A A prd Qj ji ii i proteinper is Q where N prd Q prd Qp i p p i pp avg i, /. xN x stdev x errsig /

PAGE 60

60 classes (Figure 4 7 ) Given a separating hyperplane (Equation 414), two parallel h yperplanes (Equation 415 and Equation 416), the margin m can be defined as m= 2/|| w ||. (4 14) (4 15) (4 16) Next, let {x1,,xn} represent a training data set, and let ti={ 1,+1} represent the class label of each v ector in the training data set. Then, the decision boundary that should classify all points correctly can be written as ti( wTxi + b) 1 for all vectors xi (4 18) The decision boundary can be found by solving the following optimization problem (Equation 417 and Equation 418). (4 17) To extend this idea to a nonlinear decision boundary, the xi values or training input vectors can be transformed to a fixed higher dimensional or infinite dimensional space, known as a feature space, (xiBecause the perpendicular dist ance of a point to a hyperplane, defined by y(x)= 0, is given by |y(x)| / ||w||, and because only correctly classified datapoints are of interest, such that t ) The goal of transformation is to attain a nonlinear decision boundary in the current space by utilizing a linear decision boundary in the feature space. Using a feature space transformation creates a two class classification problem (Equation 419) where b is the bias parameter. (4 19) ny(xn) > 0 fo r all n, the distance between xn b x w x yT ) ( ) ( and the decision boundary is given by Equation 420. (4 20) || || ) ) ( ( || || ) ( w b x w t w x y tn T n n n 0 b x wT 1 b x wT 1 bx wT 2|| || 2 1 w Minimize i i T ix b x w t to Subject 1 ) (

PAGE 61

61 Therefore, the maximum margin can be found by maximizing parameters w and b (Equation 421). (4 21) To ease calculation, this representat ion can be converted by rescaling both w and b by the same constant, thereby leaving the ratio tny(xn (4 22) In this c ase, all datapoint s will satisfy E quation 423. ) / ||w|| unchanged. Using the rescaling option, the point closest to the decisi on plane can be represented by E quation 422. (4 23) This is known as the canonical representation of the decision hyperplane (Bishop, 2006). Therefore, to maximize ||w||1 (4 25) the quadratic programming (QP) problem (Equation 424 and Equation 425) is minimized. (4 24) Using the Lagrange multipliers nBy setting partial derivatives with respect to w and b (respectively) to zero and solving, two conditions emerge (Equation 427 and Equation 428) 0 the constrained QP problem (Equation 424 and Equation 425) can be written as Equation 426. (4 26) (4 27) (4 28) )]} ) ( ( [ min || || 1 { max arg b x w t n w b wn T n 1 ) ) ( ( b x w tn T n n b x w tn T n 1 ) ) ( ( } || || 2 1 { min arg2w b w n b x w t to subjectn T n 1 ) ) ( ( } 1 ) ) ( ( { || || 2 11 2 b x w t w Ln T n N n n ) (1 n n N n nx t w n N n nt10

PAGE 62

62 Using both conditions (Equation 427 and Equation 428) the dual representation of the maximum margin problem can be formulated and can be written as Equation 4 29, subject to Equation 430. (4 29) (4 30) While computation in higher or infinite dimensional space can be limiting, a kernel function can be used to represent the inner product of the feature vectors in a se lected higher dimensional space. This representation is often referred to as the kernel trick. Because the QP optimization problem contains only the inner product of the feature vec tors, namely (xn)T (xm) a kernel need only define the inner product of the feature space as it affects xn, namely K(xn,xm)= (xn)T (xmEach SVM used in the application, DARWIN (discussed in Chapter 5), is a soft margin binary classifier, and seeks to satisfy the following quadratic optimiza tion problem (Equation 431 and Equation 432) w here x ) The transformation of the feature vectors into the higher dimensional feature space is not necessary, nor is the specification of the function, There are several well known and commonly used kernels, including the polynomial kernel, the radial basis kernel (noted below), and the sigmoidal kernel. i is an input vector, ti (4 31) (4 32) is the corresponding class label, is the kernel mapping, is the vector perpendicular to the decision boundary (separating hyperplane), is the slack variable, b is the offset for the hyperplane, and C is the tradeoff parameter between the error in predicting a class and the margin between the classes. l i i T bC1 ,2 1 min 0 1 ) ) ( ( i i i T ib x t to subject ) ( ) ( 2 1 ) `( max1 1 1 N m m T n m n m n N n n n ix x t t L imize N n where t to subjectn N n n n,..., 1 0 01

PAGE 63

63 Using SVMs in Secondary Structure Prediction Because SVMs are binary classifiers, an ensemble of SVMs is needed to classify a threestate protein secondary structure prediction problem. To date, most methods use between three and six separate SVMs, fo llowed by a jury style result selection (Kim and Park, 2003; Karypis, 2006). In general, each SVM might determine whether a given amino acid is in a certain secondary state, such as helix (H), or not helix (~H). This type of SVM will be notated as H/~H, and is referred to as a one versus rest classifier. Each SVM is trained on a set of input vectors, where each input vector (described above) represents an amino acid secondary state. Neural Networks Artificial neural networks (NN) have been used in many areas including speech synthesis, medicine, finance, vision, and many other problems that can be categorized as pattern recognition problems. The first neural network model was developed in 1943 (McCulloch and Pitts, 1943) and was followed by the perceptr on model (Rosenblatt, 1962). The notion of a neural network is loosely based on brain neurons, where each artificial neuron accepts a finite number of inputs and creates a single output. A single unit NN accepts a vector of n real numbers, X={x1, x2, xn}, where each element of the input vector has a weight wi associated with it that describes its strength with respect to the final decision. The set of weights is often described as a weight vector W={ w1, w2, wnA multiple unit neural network is composed of a set of single units, with weighted, unidirectional, connections between them, such that the output of one neural netw ork can be used as the input to another. Multiple unit NNs can possess a variety of network architectures, including a simple linear network, layered networks, and so on. }. The single unit evaluates the follow ing summation, and uses the value S (Equation 433) to reach an output. (4 33) i n i iw x S1

PAGE 64

64 Networks are trained using input feature vectors with known classification. The indi vidual weights associated with each vector element can be altered until the NN presents the expected classification output for a given known input. For multilayered networks, the backpropagation training algorithm (Rumelhart et al ., 1988) is commonly used. The output vector is compared to the expected output, the error, if any, is calculated using the delta rule (McClelland et al ., 1986), and is then backpropagated through the network. Weights are adjusted to minimize the difference between the generated a nd expected outputs. The delta rule, as noted below, changes the weight vector in such as way that minimizes the error, or the difference between the expected and actual outputs. The delta rule is defined in Equation 434, where r is the learning rate, tj is the target or expected output and yj (4 34) Single and multilayered NN are used in many protein secondary structure prediction methods, with the following generalized scheme. The first layer of the network takes as input a vector of values that represents a specific amino acid in a given protein sequence. The representation of an amino acid is very often based on a finite window of values contained in PSI BLAST generated PSSM profiles as described above. The first laye r then returns a state prediction for the amino acid represented by the input vector. This layer is often called the sequence to structure network. A second layer, the structure to structure network, can filters the outputs from the first layer and produce a refined final result. Additional layers, and the incorporation of other methods, such as internal hidden Markov Models (Lin et al ., 2005), can complement the process. is the actual output. Information Theory and Prediction Information theory dates back to the mathematical theory of communication (Shannon, 1948) and has found application in a variety of areas including modeling (Burnham and ), (j j i ijy t rx w

PAGE 65

65 Anderson, 2002), neurobiology (Rieke et al ., 1997), physics (Jaynes, 1957), and data analysis (Morvai and Weiss, 2008). The basic inform ation formula is noted in Equation 435, where S describes the secondary structure state, such as H, E, or C, R is one of the 20 possible amino acids, and P(S) is the probability that state S will occur. (4 35) Because it is commonly accepted that the state of a given amino acid is affected by both the amino acid type and the amino acid types of neighboring residues (Kloczkowski et al ., 2002), a sliding window of amino acids in often used in each evaluation. Because a window of amino acids i s evaluated, the information function must be decomposed, namely, the information function of a complex event can be reduced to the sum of the information of simplified events. The following function (Equation 436) displays the decomposition, where the in fo rmation difference is defined in Equation 437. The difference, nS denotes all states other than S (4 36) (4 37) The GOR methods (Sen et al ., 2005) use information theory in prediction. ) ,..., | ; ( ... ) | ; ( ) | ; ( ) ; ( ) ,..., ; (1 2 1 2 1 3 1 2 1 2 1 n n nR R R R S I R R R S I R R S I R S I R R R S I ) ,..., ; ( ) ,..., ; ( ) ,..., ; (2 1 2 1 2 1 n n nR R R S n I R R R S I R R R S I ) ( / ) | ( log[ ) ; ( S P R S P R S I

PAGE 66

66 Figure 4 1. Protein Data Bank (PDB) website. This area is a repository for known protein structures and related protein information.

PAGE 67

67 Figu re 4 2. Matrix known as BLOSUM 62, a similarity matrix derived from small local blocks of aligned sequences that share at least 62% identity. For each a lignment block, the log likelihood of the occurrence of each amino acid is calculated to derive an estimate of the similarity between amino acids. These similarity values are contained in the above matrix. Fig ure 4 3. E xample of a PSI BLAST generated ali gnment between a query protein and a subject protein. A normal PSI BLAST output will contain a set of discovered subjects (if any) that align to the query protein within a threshold limit. Each query subject alignment is then displayed individually as in t he image above.

PAGE 68

68 Figure 4 4. E xample of a PSI BLAST generated position specific scoring matrix (PSSM). These matrices are also known as profiles and offer the log likelihood of each of the 20 amino acids occurring at each position of the query protein. The profile is created with reference to all subject proteins discovered during a PSI BLAST analysis of a query protein.

PAGE 69

69 Figure 4 5. E xample of the BLAST algorithm. A given query protein is analyzed by looking at all three amino acid word set s. Each wo rd set is used to search the database for matches in other subject proteins. When a match is discovered, the word set is extended in both directions and the query and subject proteins are compared. If the comparison leads to a sequence match between the qu ery and subject the subject is considered similar and is retained. The BLOSUM 62 matrix is the default for measuring similarity between two amino acids. Image adapted from OReilly BLAST Programming, 2002.

PAGE 70

70 Figure 4 6. V isual example of the production of input vectors that can be used to train and test machine learning constructs. A window of size 15 is used to scan the query sequence, and an unique vector is generated with each amino acid having the opportunity to be central. The position specific scor ing matrix shown is used to generate the vector as noted. With a window of size 15, the vector will be of size 300.

PAGE 71

71 Fi gure 4 7. V isual example of decision boundary between two classes and the margin that is maximized.

PAGE 72

72 CHAPTER 5 NEW SECONDARY STRUCT U RE PREDICTION METHOD DARWIN Dynamic Alignment B ased Protein WindowSVM Integrated Prediction for T hree S tate P rotein S econdary S tructure: A New Prediction Server. A new protein secondary structure prediction server, DARWIN, is discussed and evaluated. DARW IN utilizes a novel two stage system that is unlike any current state of the art method. DARWIN specif i cally responds to the issue of accuracy decline due to a lack of known homologous sequences, by balancing and maximizing PSI BLAST info r mation, by using a new method termed fixed size fragment analysis (FFA), and by filling in gaps, ends, and missing information with an ensemble of support vector machines. DARWIN comprises a unique combination of homology consensus modeling, fragment consensus modeling, and support vector machine learning. DARWIN is directly compared with several published methods, including well known indirect homology based techniques (those that use PSI BLAST profiling), as well as more recent direct homology based co m bination techniques (those that include the use of PSI BLAST templates). Corresponding results using both EVA and PDB25 derived datasets are reported and evaluated. DARWIN is further evaluated for varying homology availability, and specifically for proteins for which no know n PSI BLAST (PDB) di s covered homologues are detectible. DARWIN offers highly competitive and in many cases superior prediction accuracy for protein secondary structure, as well as a publicly available online service. DARWIN accuracy ranges from 65% to well over 95% depending upon the nature of the dat a set analyzed. The DARWIN server is available over the Internet at URL http://www.cise.ufl.edu/research/compbio2/cgi bin/am g/phd/WebMainPage.cgi

PAGE 73

73 Introduction and Motivation of DARWIN While the complexity of biological systems often appears intracta ble, living organisms possess an underlying correlation derived from their hierarchical association. It is this notion that enab les methods, such as machine learning techniques, Bayesian statistics, nearest neighbor, and known sequence to structure exploration, to discover and predict biological patterns. As the number of known protein structures increases, so do the accuracies of prediction methods. Because the secondary structure of a protein provides a first step toward native or tertiary structure prediction, secondary structure information is utilized in the majority of protein folding prediction algorithms (Liu and Rost, 2001; McGuffin et al ., 2001; Meller and Baker, 2003; Hung and Samudrala, 2003). Similarly, protein se c ondary structure information is routinely used in a variety of scientific areas, including proteome and gene annotation (Myers and Oas, 2001; Gardy et al ., 2003; VanDomselaar et al ., 2005; Mewes et al ., 2006), the determination of protein flexibility (Wishart and Case, 2001), the subcloning of protein fragments for expression, and the assessment of evolutionary trends among organisms (Liu and Rost, 2001). For the past few decades, several algorithms and their variations have been used to predict protein secondary structure. These i n clude early techniques, such as single residue statistics (Chou and Fasman, 1974), Baysian statistics, and information theory (Garnier et al ., 1978, 1996). These were then followed by an explosion of techniques using evolutionary information and homological protein alignment data, as pioneered in the method, PHD, (Rost and Sander, 1993) that broke the 70% prediction barrier. Current m e thods include machine learning constructs, such as multilayered neural networks, as in SSpro (Pollastri and McLysaght, 2004), PROFsec (Rost and Eyrich, 2001), PHDpsi (Prz y bylski and Rost, 2001), and PSIPRED (Jones, 1999), ensembles of support vector machines, such as SVMpsi (Kim and Park, 2003) and YASSPP (Karypis, 2006), nearest neighbor methods, such as PREDATOR (Frishman and Argos, 1996), and a

PAGE 74

74 plethora of com bined or meta methods (Cuff and Barton, 1999; Alb recht et al ., 2003). All current machine lea rning techniques rely on homological and alig nment data for training and testing, generally in the form of PSI BLAST (Altschul et al ., 1997) profiles, and can therefore be referred to as indirect h o mology based methods. Recently, modern techniques have mor e directly utilized homology and evolutionary inform a tion by including template and fragment modeling into the pre diction process (Pollasti, 2007; Montgomerie et al ., 2006; Cheng, 2005). The method, Porter_H (Pollasti, 2007) uses direct homology by combini ng a set of query derived hom ologous templates from the PDB with both the original query sequence and the corresponding PSI BLAST profile to train a complex neural network ensemble. Similarly, PROTEUS ( Montgomerie et al. 2006) uses direct homology modeling when hom ologues are available, and a jury of machinelearning techniques including PSIPRED when homologues are not avail a ble. A commonality among all methods that exceed an average threestate prediction accuracy greater than 70%, whether indirect homology based (PSI BLAST profile use) or direct h omology based (template modeling), is the absolute reliance on evolutionary i nformation; generally in the form of query based multiple sequence alignments derived from PSI BLAST (Pollastri and McLysaght, 2004; Bond ugula and Xu, 2007; Przybylski and Rost, 2002; Cuff and Barton, 1999; Jones, 1999; Ward et al ., 2003; Kim and Park, 2003; Hu et al ., 2005; Wang et al ., 2004, Pollastri et al 2007). As homologous protein sequences have a higher propensity of exhibiti ng similar se c ondary structure, and proteins can exchange as many as 70% of their residues without altering their basic folding pattern (Przybylski and Rost, 2001; Benner and Gerloff, 1991), a persi s tent challenge of protein secondary structure prediction is the maintenance of high accuracy in the absence of detectable hom ologous sequences. It is also for this reason that

PAGE 75

75 different protein datasets can elicit significantly differing results, and separate accuracy reports for proteins without known homologue s are rare. Presented is a new secondar y structure prediction server, DARWIN, which offers a novel two stage prediction method. DARWIN incorporates a balance of PSI BLAST derived homological data by using a form of weighted consensus homol ogy modeling when homologues are available, and by using a novel fragment based consensus method termed fixedsize fragment analysis (FFA), otherwise. In both stages, DARWIN employs an ensemble of Gaussian kernel based support vector machines (SVM) to compensate for any la ck of information or gaps. In this way, DARWIN offers the user maximal accuracy when homol ogues are available and a competitive alternative to pure machine learning otherwise. DARWIN has been tested against several leading indirect homology based pure machine learning prediction servers, as well as against methods that use direct homology modeling. All training and tes t ing datasets are rigorously derived from the well known EVA and PDB25 sets. All contained algorithms and m e thods use the Protein Data Bank ( PDB) only, and do not access any other databank. In addition, special focus is given to proteins having no PDB detect a ble homologues as determined by PSI BLAST (with given parameters). Extensive detailed results and comparisons as well as complete parameter descriptions are i ncluded in following sections. Methods and Algorithms used in DARWIN One of the main sources of prediction improvement has been PDB derived structural information, and over half of all new proteins have some degree of similarity to know n structures (Pollastri et al 2007). Therefore, to offer the user the maximum possible accuracy of prediction, DARWIN begins with weighted direct homology consensus modeling. If even one viable homologue (discussed below) is detected, homology modeling combined with SVM support offers a complete and highly accurate prediction. If no viable homologue is discovered,

PAGE 76

76 DARWIN applies a form of fragment mining and consensus analysis, fixedsize fragment analysis (FFA), with information gaps resolved by an ense mble of SVMs (discussed in the next section). Phases of DARWIN: Stage 1 For clarity, let the query protein represent the given protein whose secondary structure is in question, and let any alignment discovered by PSI BLAST be known as a subject protein. L et the PID represent the percentage identity between the query and subject protein fragment as listed in a given PSI BLAST output. Let QL represent the length of the query protein, and let SL represent the true length of a subject protein, where SL is alwa ys greater than or equal to the subject fragment that is aligned to the query protein. Let ER represent the effective ratio between the query length and the true subject length. Then, the effective percentage identity (EP) can be calculated as the followin g (Equation 51) where ER (Equation 52) is the ration between the length of the query and full subject protein and is always less than 1. ( 51) ( 52) Therefore, in cases for which the query and subject proteins have the same leng th, EP =PID Otherwise, EP < PID, and will depend on the size difference between the query and full subject protein. The EP offers a more accurate identity measure between two alignments and is used in DARWINs consensus algorithms. Phase 1 In the first stage of DARWIN, the query sequence is run against three iterations of PSI BLAST on the PDB database using both an inclusion and expected value of .01, and the default similarity matrix BLOSUM62 ( Henikoff and Henikoff, 1992, 1996). Because in all cases the PSI BLAST search will return the query protein itself as well as a possible collection of where ER PID EP SL QL if QL SL ER else SL QL if SL QL ER

PAGE 77

77 alternatively named yet identical proteins, all subjects with PID > 95% are removed. This step maintains fairness and assures that neither the query protein itself, n or any alternatively named yet effectively identical proteins are used in the method. Remaining subject proteins are considered viable templates. Phase 2a: If at least one viable template is found: If one or more subject proteins are collected, the secondary structure information for each subject protein is gathered using the DSSP (Kabsch and Sander, 1983). Note that while each of the subject proteins will have a PSI BLAST defined alignment to the query protein, each given alignment may, and often does, possess corresponding gaps relative to the query protein, gaps in the subject protein, alignments that are skewed and therefore do not begin at the starting point of one or both proteins, and incomplete alignments that do not offer information about the full query protein ( Figure 5 1 ) As many options exist for predicting secondary struc ture using homologous templates, we investigated several issues including the number of alignment templates to use, as well as the best option for creating an accurate consens us. Using the highest EP viable template (the top matching alignment) proved less affe c tive then using a consensus. This conclusion was reached through the independent testing of ten small 100protein datasets randomly gener ated from the PDB25 dataset. In addition, to avoid arbitrarily s e lecting a fixed number of templates to permit in the consensus, a n exponentially weighted consensus measure of all available viable templates (with PSI BLAST expected value = .01) was ut i lized to arrive at the final conclus ion. Specifically given a set of viable PSI BLAST derived alignments, each alignment with respect to the query protein is filled in with spa c es to create a complete matrix. Given a complete alignment matrix with the number of rows equal to the number of alignments (templates), and the number of columns equal to the length of the query pr otein, the following consensus method

PAGE 78

78 is applied. As each of the templates is directly aligned to the query sequence, all m a trix cells contain either one of 20 amino acids or a space signifying a lack of information (gap) for that location in the alig nment. Using the DSSP, a secondary structure label is then assigned to each amino acid. The label L is used for all spaces. Each co l umn can then be evaluated independently to predict the secondary structure of each amino acid in the query protein. Consensus eva l uation for each column is calculated using a normalized exponential measure based on the effective percentage identity (EP) of the given row. For each col umn j let th e normalized weight associated with each secondary structure label in row i be notated as W ij (Equation 53) and let exp be the selected exp onent. (5 3) Several different exponents were evaluated on ten independent datasets of 100 proteins ea ch, randomly generated from the PDB25 dataset. Results suggested that exp=2.5 offers the most reliable accuracy. Once weights are assigned, a sum is calculated for each of the three possible secondary stru c tures, H ( helices, 310Phase 2b: If no viable template is found: helices, and helices), E ( sheets and bridges), and C(all other). The maximum sum is used to select the final secondary stru c ture, and to label each amino acid in the query protein as H, E, C, or L. Again note that L is not a secondary structure, and is used when a column contains only spaces. If after eliminating all invalid subject proteins, none remain, it is assumed that the query protein possesses no viable homologues in the PDB. It should be noted that the term homologue is ambi guous, and for this paper, unless otherwise noted, will refer to proteins discovered using three iterations of PSI BLAST on the PDB only, with BLOSUM 62, and both an inclusion and expected value of .01. While many direct homology methods resort to a machine learning technique once the option of direct homological template alignment is elimi nated, DARWIN j column in i EP EP Wi i ij ij ,exp exp

PAGE 79

79 uses a new fragment consensus based method to discover further localized information before combining with an SVM. This method, FFA, is described in det ail in a separate section below Phase 3 Once each amino acid of the query protein is labeled, an ensemble of Gaussian kernel based support vector machines are used to fill in any position labeled with an L. Because the SVM is used only to fill in areas for which information was not available, DARWIN offers the user high acc uracy for the majority of queries (Pollastri et al 2007), balanced with competitive or maximal accuracy for all other cases. Phases of DARWIN Stage 2: Fixed Size Fragment Analysis One opti on, and perhaps the most common, when attempting to predict the structure of a protein that possess no known homologues as determined by PSI BLAST, is to use a pre trained m a chine learning construct, in combination with PSI BLAST profiling of the query sequence. Profiling a query protein for which no homologues will be discovered by PSI BLAST is equivalent to using the underlying similar i ty matrix (usually BLOSUM 62) to generate input vectors and to evaluate the protein. While this is certainly a valid meth od that offers fair results, DARWIN efforts to add i nformation by first directly mining the PDB for similar protein fragments, and then combining this information with that learned by an SVM. The fixedsize fragment analysis (FFA) method offers a novel alt ernative to current techniques when homol ogues are not available and therefore secondary structure is more challenging to determine. The theory behind the FFA algorithm is that the relative local neighborhood surrounding each amino acid may contain va l uabl e information about the secondary structure of that amino acid. Gi ven a query protein, the FFA method follows four steps.

PAGE 80

80 Fragment size s election To determine the best fixed size window, and ther e fore the fragment size, ten independent mutually nonhomologous proteins sets containing 100 proteins each were ra ndomly derived from the dataset PDB25. Each of the ten datasets was evaluated on an array of window (fragment) sizes ranging from 3 to 31. It was determined that window sizes 15 and 17 both offer the h ighest and most consistent accuracy. Therefore, we s e lected a window of size 15. Step 1 Each amino acid in the query protein is represented by a query fragment of size 15, containing the local neighborhood about the amino acid with 7 neighbors to each side. An amino acid near the end of the query protein will be mapped to a fragment that contains spaces. Each fragment is then run on PSI BLAST using the PDB database (iterations 3, inclusion 600, threshold 600, matrix BLOSUM62). To avoid pe r mitting evaluatio n to include the query protein itself or near identical proteins, all PSI BLAST di s covered subject alignments were screened, and those with EP > 95 % were removed. Remaining viable alignments were each aligned to the query fragment accor ding to PSI BLAST. G aps and size differences were filled in with spaces. Once the query fragment is aligned with all viable subjects, and spaces are used to fill gaps and to even sizes, a matrix of amino acids results with exactly 15 columns and the number of rows equal to the number of discovered subject alig nments. Step 2 For the matrix of amino acids generated in step 1, the DSSP is used to label each amino acid with its corresponding se c ondary structure. Note that the only column of interest is the center column contain ing the query amino acid. Given a column of se c ondary structures, a consensus is performed to predict the query central amino acid structure. Let the normalized weight of each secondary structure of row i and of the center column c be represented by Wic

PAGE 81

81 (E quation 54) let exp represent the exponent, and let the PID represent the percent identity between query and su bject as noted by PSI BLAST. (5 4) To determine the most accurate exponent, the same 10 testing sets noted above and derived from the PDB25 dataset were tested for exponents ranging in value from .5 to 3.5 in steps of .5. Results showed that an exponent parameter set to 1 offered the most consistent accuracy, suggesting that the weight offered by PSI BLAST was sufficient for producing a reliable consensus measure. In addition, note that PID is used rather than EP as fragment alignments will generally have very small EP values that will not accurately represent the identity between the fragments themselves. For the central column of interest, three weight sums were calculated for each of the three secondary structures, H, E, and C. The maximum sum, corresponding to the highest weight was selected and used to assign the associated secondary structure. If the central column contained only spaces, an L was assigned to show a lack of info r mation for that location. Step 3 Steps 1 and 2 are repeated for all amino acids in the query protein. Once completed, each amino acid in the query pr otein is labeled with H, E, C, or L. Finally, the ensemble of SVMs is used to predict any amino acid labeled with L. Once complete, a full prediction for the query protein is released. The strength of the FFA method is that it offers a new option for determining structure when no homologues are available. The FFA method follows the underlying idea of database fragment detection, ali gnment, and consensus (Wu, 2004; Cheng, 2005), but fur ther seeks to narrow the database fragment search to a localized neighborhood about a given central amino acid from the query pr otein. The FFA method can improve pure machine learning m e thods c column in i PID PID Wi i ic ic ,exp exp

PAGE 82

82 several percentage points by adding additional information to the underlying similarity matrix used by PSI BAST when homol ogues are not available to construct the PSSM profile. Ensemble of Support Vector Machines in DARWIN DARWIN uses an ensemble of three Gaussian kernel based SVMs to classify any amino acids in a given query protein that are labeled with an L as described above. The L label is generally due to a lack of alignment information, gaps in alignment information, or skewed or incomplete alignment information. Each SVM is a d e riva tive of SVM_light (Vapnik, 1995; Joachims, 1998, 2002), and is a soft margin binary classifier. Within the ensemble, the first SVM predicts H vs. not H (H/~H) the second, (E/~E), and the third (C/~C). The outputs of each SVM are compared, and the highest is selected to represent the secondary structure. Note that if all outputs are less than .6, the default prediction is C. Each S VM is soft margin binary classifier, and seeks to satisfy the following quadratic opt i mization problem (Equation 55 and Equation 56) The SVM Kernel and Equation (5 5) (5 6) Where xi is an input vector, yi l i i TC b12 1 min is the corresponding class label, is the kernel mapping, is the vector perpendicular to the decision boundary (separating hyperplane), is the slack variable, b is the offset for the hyperplane, and C is the tradeoff parameter between the error in predicting a class and the margin between the classes It is well known that the above constrained optimization problem (Equation 55 and Equation 56) can be solved if a kernel function (Equation 57), is specified. (5 7) 0 1 ) ) ( ( i i i T ib x y to subject ) ( ) ( ) (j T i j ix x x x K

PAGE 83

83 After the analysis of several kernels, the RBF (radial basis or Gaussian kernel) (Equation 58) was found to offer the best average acc uracy. (5 8) After testing several parameter options, was set of .1 and C was set to 1. Training the SVM and Using PSI BLAST Profiles To fairly measure the accuracy of any machine learning technique, the training set must be disjoint from the testing set. Disjointness between two protein sequences can be defined as a lack of detect a ble homology between the two proteins as discovered by PSI BLAST, and has also been defined as having less than 25% s e quence identity between the two proteins (Rost and Sander, 1993). The ensemble of SVMs used in DARWIN was trained on 1000 randomly selected proteins collected from a subset of the PDB25 dataset. To assure mutual nonhomology between the 1000 training proteins and both EVA5 and EVA 6, the entire PDB25 da taset was run on three iterations of PSI BLAST with similarity matrix BLOSUM 62, and an expected value of .01. Next, all detected homologues, and all original proteins from the PDB25 dataset itself, were co m pared to all proteins in both EVA 5 and EVA 6 datasets. It was discovered that 88 proteins in PDB25 were hom ologous to either EVA5 or EVA6 or both. These 88 proteins were removed before the 1000 proteins were randomly s e lected to train the SVM. The common PSI BLAST profile sliding window method (Figure 4 5) was used to train DARWIN (Qian and Se i nowski, 1988; Rost and Sander 1993; Cuff and Barton, 1999; Pollastri and McLysaght 2004). Gi ven 1000 training proteins from PDB25, a PSI BLAST prof ile, or position specific scoring matrix (PSSM) (Gribskov et al ., 1987), was gene r ated for each protein. Profiles are built using the set of alignment proteins discovered by PSI BLAST for the query protein. These discovered alignments can be considered homol ogues to the query and together ) 2 /( 1 0 ), exp( ) (2 2 j i j ix x x x K

PAGE 84

84 are used to construct a corresponding PSSM profile matrix. Each profile matrix offers the log likelihood of the occurrence of each of the 20 possible am i no acids at each given position of the query protein. In this way, a profile is a matrix of available homologous information in the form of log likelihood data, with respect to the query protein. When no homol ogues (alignments) are discovered, PSI BLAST uses the selected similarity matrix (usually BLOSUM 62) to create the profile. Next, for each protein in the training set, and given a corresponding pr ofile, a sliding window is used to scan each profile and create a unique input vector to represent each amino acid in the query pr otein. The window was selected to have a widt h of 15, thereby generating an input vector of size 300, and representing 7 neighboring amino acids on either side of the ce ntral amino acid. Using the DSSP, each amino acid in the training set was associated with a secondary structure label. Therefore, each amino acid in the training set is represented by a vector of size 300 containing log like lihood information about its surrounding local neighborhood, and finally defined with a class label designated by its DSSP notated secondary structure. The SVM accu racy alone, with no post processing, offers a range of Q3 per protein accuracies from 69% to 76% depending on the dataset analyzed. Intui tively, machine learning techniques will always offer higher accuracies when a greater number of homologues are d e tect able. Datasets and Measures of Accuracy for DARWIN Creating a fair comparison between protein secondary structure prediction methods can be a challenge, as even a single protein can greatly alter an overall percentage accuracy result, and proteins with no detectible homologues will always offer lower accuracy. In addition, special care must be taken to avoid using mutually hom ologous proteins in both the training and testing sets. Contamination of training and testing sets can become a further issue as PSI BLAST profiling of both the training and testing set can severely blur original homological disjointness.

PAGE 85

85 Three well known and robust datasets were employed to offer a fair and balanced analysis of DARWIN, and a direct comparison with other listed and publ ished results. The datasets used were EVA sets (http://cubic.bioc.columbia.edu/eva/cafasp/index.html) (Koh et al ., 2003), specifically EVA Common 5 (EVA5) and EVA Common 6 (EVA6), and a ne w dataset of 800 test proteins, termed PDB25_800 randomly derived from PDB25, a dataset grouped t o gether by ASTRAL (Brenner et al ., 2000). The PDB25 dataset was designed so that no two proteins possess more than 25% s e quence identity. The EVA5 set contain s 178 mutually nonhomologous proteins and the EVA6 set contains 211 mutually nonhomologous pr oteins. EVA datasets are available online and are used to make direct comparisons with other predi c tion methods that have been tested on those exact sets. The PD B25 is a large dataset of over 4200 proteins that are mutually nonhomologous. When the PDB25 dataset was downloaded for experimentation by the authors, it contained exactly 3881 proteins and 553016 amino acid residues. However, 88 pr oteins were removed as they were shown to share PSI BLAST derivable h omologues with the EVA datasets. A further 7 proteins were removed as they did not have viable DSSP counterparts. Next, 1000 of the remaining 3786 proteins were randomly selected to train the DARWIN SVM ensemble. After separating the 1000 training proteins from the set, the remaining 2786 proteins were used to randomly generate the new testing set containing 800 proteins, PDB25_800. To determine the accuracy of DARWIN, several measures are used. Let Aij repres ent the number of residues predicted to be in state j and observed to be in state i Let Nr represent the total number of residues in the dataset. Let Np represent the total number of proteins in the dataset (Rost and Sander, 1993; Koh et al ., 2003). Then, Equations 59 through 515 are used to measure accuracies for DARWIN. These measures are also noted and explained in detail in Chapter 4.

PAGE 86

86 (5 9) (5 10) (5 11) (12) (5 12) (5 13) ( 514) (5 15) As these measures (Equations 59 through 515) are included in the vast majority of public a tions, they are used to make direct comparisons between DARWIN and other prediction servers. Comparisons are only made between methods when exact dataset results are available. Experiments, Measures, and Results While it is most common to evaluate the accuracy of a protein secondary structure prediction method through the use of nfold cross validation of one or more mutually nonhomologous da tasets, this approach can offer a limited measurement of a methods beh a vior for proteins of varying levels of detectable homology. All well known datasets, including rs126 (Rost and Sander, 1993), cb513 (Cuff and Barton, 1997), EVA(46), and PDB25 all pos sess a si g nificant number of proteins for which PSI BLAST can detect one or more homologues. The older sets rs126 and cb513 contain >98% proteins having known homologues in the PDB, and just over 50% of the proteins in EVA5, EVA6 and PDB25 produce pa r tial or full homologues in the PDB depending on the expected value selected when running PSI BLAST. Therefore, results published for these balanced datasets may reveal overall average accuracy, but will not reveal sp e cific accuracy for proteins that generate zero PDB PSI BLAST homologues. Next, comparing prediction methods on nonidentical datasets lends little true comparative information, as even a single protein can affect prediction accur a cy. Finally, current prediction methods can be loosely r i iiN A Q3 13 33 3 1 31 .Q protein per is Q where Q N Qp N i p p pp avgp } { ,3 1E C H i A A obs Qj ij ii i } { ,3 1E C H i A A prd Qj ji ii i protein per is Q where N obs Q obs Qp i p p i pp avg i, /. protein per is Q where N prd Q prd Qp i p p i pp avg i, /. xN x stdev x errsig /

PAGE 87

87 categorized int o those that use indirect homology and evolutionary information via machine learning and PSI BLAST pr ofiles (such as PSIPRED and PHDpsi), and those that incorporate direct homology (via template modeling) into the pr e diction method (such as Porter_H and PR OTEUS). To accomm odate these important issues and to offer a balanced analysis of DARWIN, several different evaluations were performed with details and results fol lowing. For overall accuracy assessment and later direct comparison, DARWIN was first ev a luat ed on datasets EVA5 and EVA6 (Table 5 1 and Figure 52). As EVA datasets are well known and available for download over the Internet, they offer a fair baseline for ave r age method accuracy as well as a means to compare methods tested on these exact sets. T o test DARWIN on the EVA sets, the DARWIN SVM was trained on 1000 proteins collected from the PDB25 (as described in the dataset section) that were assessed using three iterations of PSI BLAST on the PDB, with BLOSUM 62, and an expected va l ue of .01 to ass ure mutual nonhomology with EVA testing sets. As DARWIN accesses only the PDB, all references to testing, training, and homology assume PDB proteins only. DARWIN was then compared to several indirect homology counterparts using dat a sets EVA5 and EVA6. It is important to note that DARWIN uses direct homology modeling as part of the prediction process, while the methods compared below do not. How ever, all considered methods rely directly the same PSI BLAST derived data, though through differing avenues. R es ults for DARWIN as compared with leading indirect h omology methods ( Table 5 2 and Table 5 3), for which results have been specifically published for da tasets EVA5 and EVA6 show that DARWIN exceeds all noted indirect homology methods Next, DARWIN was comp ared with methods that use direct h omolo gy as part of the prediction process. Both Porter_H (Pollastri et al ., 2007) and PROTEUS ( Montgomerie et al .,

PAGE 88

88 2006) are leading direct homology methods for which current accuracies have been published. Several notes must first be considered. Porter_H accuracies were measured and reported using a large subset of the PDB25 dataset, while PROTEUS accuracies were measured using a large set of EVA proteins. Therefore, no true direct comparison can be made. However, some idea can be offered about potential accur a cies for differing levels of detectible homology. In addition, neither method published separate accuracy results for proteins that pos sess no detectable homologues. Therefore, this aspect cannot be co m pared, but wi ll be reported for DARWIN. Porter_H reports prediction accuracies of 90% when templates with greater than 50% similarity (defined as percent identity over the query sequence) are available, accuracies of nearly 87% when te m plates with at least 30 50% similarity are found, 82% accuracy when templates of between 20% 30% similarity are found, and < 79% accuracy for templates of less than 20% similarity. PROTEUS reports an average accuracy of 85.5% when homol ogy is available and 77.6% accuracy when the use of d irect homo l ogy is turned off. Both Porter_H and PROTEUS methods are supported by an underlying machine learning technique. Porter_H uses an ensemble of neural networks and PROTEUS uses a jury of e x pert predictors (PSIPRED, JNET, and TRANSSEC) that all are based on neural network e nsembles. To measure DARWIN accuracies for varying degrees of detectible homology, the dataset PDB25_800 was employed. PDB25_800 is a set of 800 randomly selected proteins derived from the PDB25 dataset. PDB25_800 is specifically disjoint from the 1000 PDB25 proteins used to train the DARWIN SVM. The 800 proteins in PDB25_800 were each run against three iterations of PSI BLAST using the PDB only, the similarity matrix BLOSUM 62 an e x pected value of 1, and an inclusion value of .01. The use of an expected value of 1, rather than .01 was to generate potential matches in the range of 10% 30% identity. Of the 800 proteins, 434 proteins (54%)

PAGE 89

89 found homologues with better than 50% query sequence identity, 553 (69%) of the proteins found homologues with at least 30% query sequence identity, 100 (12%) of the pr oteins found homologues between 10% and 30% query sequence identity, and 133 (17%) of the proteins found no detectable hom ologues. For proteins with at least one template having bett er than 30% identity the average prediction was over 92%, but again de pends on the homologues discovered and is certainly protein dependent. For proteins that find homologues between 2030%, prediction still exceeds average pure machine learning methods wi th averages near 83%. For pr oteins with homologues found below 20%, accuracies can vary considerably depending upon the infor mation offered by the homologue, and vary between 65% 79% on average. For proteins with no homologues, DARWIN uses FFA combined w ith an ensemble of SVMs to offer improved r e sults. To measure DARWNs performance on proteins for which no homologues are detected, a baseline was first set. A collection of 133 proteins from the PDB25_800 dataset were used to analyze performance. These 1 33 proteins were those noted above for which no de tectible homologues were discovered using three iterations of PSI BLAST on the PDB with similarity matrix BLOSUM62, e x pected value 1 and inclusion value .01. To create a baseline, these 133 proteins were tested using only the DARWIN SVM, and without the use of the fragment consensus method (FFA). It is impor tant to note that when no homologues are detectable, PSI BLAST profiles default to the underlying similarity matrix used, in this case BLOSUM62. To evalu ate each protein, a sliding window of size 15 was used to create input vectors of size 300 with which to test the DARWIN SVM. Each amino acid in a given query protein was represented by a vector of size 300 and was subsequently pr e dicted via the DARWIN SVM The average result for the 133 pr oteins was 62.1%. Next, to measure the effects and advantage of

PAGE 90

90 including the fixedsize fragment consensus method into the pr e diction process, the same 133 proteins were run through the full DARWIN method. Therefore, eac h protein of size n was first br oken into a set of n fragments, such as that each amino acid was central to exactly one fragment, and each fragment was exactly 15 amino acids in length. Each fragment therefore contained a local neighborhood of amino acids about the central amino acid to be pr e dicted. Each of the n fragments was run against PSI BLAST on the PDB with similarity matrix BLOSUM 62, and expected value of 600 and an inclusion value of 600. Note that the expected value size is large to allow PSI BL AST to detect matches for fragments of small size. Once fragment alignments were analyzed (details can be found in the methods and algorithms section above), the final query prediction is made with a combination of fragment and SVM information. The accurac y for the set of 133 proteins increased by 3.2% points, to 65.3%. These results suggest that the use of local information through fragment consensus analysis can add accuracy to an unde r lying machine learning technique. Conclusions on DARWIN DARWIN is a un ique method that comprises a combin a tion of homology consensus modeling, fixed size fragment consensus modeling, and support vector machine learning. Each component of DARWIN offers the user the maximal available accuracy for a given query protein. When homologues are detectible, DARWINs accuracies are significantly higher than indirect homology (pure machine learning) methods, and when homologues are not available, DARWIN e m ploys a novel fragment consensus process that adds information to the underlying s upport vector machine. DARWIN offers an elegant and constrained method of prediction, with only one tunable parameter, exp for the full protein alignment evaluation, and only two tunable parameters, exp

PAGE 91

91 and fra g ment size for the FFA portion. DARWIN r esul ts are highly competitive or better than more complicated techniques that can su f fer from overtraining. Comparatively, DARWIN exceeds indirect h omology counterpart methods as measured on common E VA5 and EVA 6, while also offering maximal accuracy for proteins with detectible homologues. DARWIN was also com pared to methods that use direct homology as part of their prediction algorithm, with highly com petitive accuracy DARWIN is a new server, and there are several additions that are currently being investigated. The first addition underway is the upgrade of DARWIN to offer an eight state pr e diction option. The second is the training of the DARWIN SVM on a more exten sive protein set Finally, we suspect that consider a ble information can be added to the predict ion process through the use of fragment mining of the PDB, and therefore further investigation and improvement of DARWINs FFA is unde r way.

PAGE 92

92 Table 5 1. D etailed average prediction results for DARWIN. DataSet EVA5 EVA6 Q3 80.3 80.5 Q3 avg. pp 78.7 78.9 errsigQ3 1.1 1.0 QHobs 83.6 84.0 QHobs avg. pp 82.0 82.3 QHprd 83.4 83.3 QHprd avg. pp 78.3 78.2 errsigQHobs 1.7 1.4 errsigQHprd 2.1 1.9 QEobs 78 .2 77.6 QEobs avg. pp 70.4 69.7 QEprd 75.6 76.0 QEprd avg. pp 58.5 58.0 errsigQEobs 2.7 2.4 errsigQEprd 3.2 3.0 QCobs 77.7 78.1 QCobs avg. pp 77.3 77.9 QCprd 79.3 79.5 QCprd avg. pp 78.7 79.0 errsigQCobs 1.3 1.1 errsigQCprd 1.3 1.2 Note All err sig values are +/ and pp stands for per protein. Table 5 2. A verage prediction results for dataset EVA5 for DARWIN compared to top published indirect homology method results EVA 5 Q3 errsig(+/ ) QHobs QEobs QCobs PROFsec 76.4 0.8 78 70 77 SAMT99sec 77.1 0.8 86 63 74 PSIpred 77.3 0.9 86 66 75 PHDpsi 74.3 0.9 80 65 72 DARWIN 80.3 83.6 78.2 77.7 DARWINpp 78.7 1.1 82 70.4 77.3 Note: Per Protein average is represented by pp. Table 5 3. A verage prediction results for dataset EVA6 for DARWIN compared to top published indirect homology method results. EVA 6 Q3 errsig(+/ ) QHobs QEobs QCobs PROFsec 76.6 0.8 78 70 77 PSIpred 77.8 0.8 86 66 75 PHDpsi 74.9 0.8 80 65 72 DARWIN 80.5 84 77.6 78.1 DARWINpp 78.9 1 82.3 69.7 77.9 Note: Per Protein average is represented by pp.

PAGE 93

93 Figure 5 1. The PSI BLAST e xample alignment portion. S e veral areas in a given alignment can result in missing information. Here, the similarity between the query and subject protein is skewed as the fourth amino acid of the query is aligned to the 17th amino acid of the subject. Therefore, the first 3 amino acids of the query are not aligned to anything in this case. Next, both the query and subject proteins contain a gap designated with dashes. These gaps also signify missing information. Figure 52. H istogram for each dataset, EVA5 and EVA6 displays the percentage of proteins predicted by DARWIN with given accuracy. The horizontal axis describes the decimal accuracy and the vertical axis describes the number of proteins

PAGE 94

94 CHAPTER 6 DARWIN WEB SERVER Introduction The go al and intention of the DARWIN method is to offer novel, superior and reliable protein secondary structure prediction. To that end, the DARWIN We b Server was created to offer a graphical Internet based user interface to the DARWIN system. The DARWIN web se rver is available over the Internet at URL http://www.cise.ufl.edu/research/compbio2/cgi bin/amg/phd/WebMainPage.cgi Using the Server The publically available DARWIN I nternet Service (Figure 6 1) is intended for academic use only and requires the entrance of a university affiliated email address to permit request submission. On the DARWIN website, the user is asked to enter several pieces of information. These include the user s name, the user s email address to which results will be sent and a name that describes the sequence of amino acids that will have a corresponding secondary structure prediction generated. Note that the amino acid sequence name entered by the us er does not have to be an actual protein name. Instead, it is simply a name that the user can associate with their specific request. DARWIN does not base any portion of the prediction process on a protein name. Next, the user enters a single sequence of am ino acids with each amino acid represented using a single letter amino acid code. The amino acid sequence entered has no need for internal spaces, returns, or tabs. Each amino acid one letter code should occur directly following the previous. There is no sequence size limit. The DARWIN website also offers the user four other testing alternatives, namely the expected value for which to run PSI BLAST with (E value), the similarity matrix to use with PSI BLAST, the kernel type to use when running the DARWIN SVM, and the corresponding

PAGE 95

95 gap penalties that are associated with different similarity matrix choices. Each of these options is preset and automatically defaults to the DARWIN selected ideal options. The default DARWIN options to attained best accuracies i nclude using an expected value of .01, a similarity matrix BLOSUM62, corresponding gap penalty 11, and a Gaussian kernel with gamma at .1 and c at 1. However, the user, to compare or attain different results may select an SVM kernel of linear, quadratic, or cubic. The user may also select similarity matrices BLOSUM80, BLOSUM45, and PAM 30, corresponding gap penalties that range from 9 to 13 and are noted for each similarity matrix choice, and finally the expected value to use during the PSI BLAST run. The e xpected value options range from .000001 to 10000. The inclusion value is automatically set to whichever value the expected value is set to. The inclusion value dictates the proteins that will be used to create the PSSM profile from the list of proteins di scovered. When the expected value and the inclusion value are the same, all discovered proteins are used. Altering any parameters away from the DARWIN defaults may result in lower accuracies. Although DARWIN has the ability to manage several proteins seque nces at a time (batch processing) the current DARWIN web s ervice is designed to accept only one amino acid sequence per submission request. Once a submission request is made by clicking the submit button on the site, the request along with all related in formation is entered into a queue and will be processed in the order in which it was received. The result of the request will include the original amino acid sequence the name given to the sequence, and the DARWIN generated three state prediction for the secondary structure of each amino acid, namely as H, E, or C. Prediction generally requires between 5 minutes and 15 minutes, but can require f urther time depending upon the nature and length of the input sequence, as well as the number of request already in the queue

PAGE 96

96 Design of the DARWIN Web Service The underlying programming code that runs DARWIN is written in Perl DARWIN uses two external applications, SVMlight and PSI BLAST. DARWIN also accesses two databases including the Protein Data Bank (PDB) and the DSSP. For speed, a version of SVMlight was downloaded to the server, and then integrated into the DARWIN code. Similarly, a version of PSI BLAST (plastpgp) was downloaded onto the server installed, and then integrated into the DARWIN code via a call request. Both databases, the PDB and the DSSP contain many forms of information. As PSI BLAST requires access to the PDB database, a specially formatted dataset containing all PDB proteins (at the time of download) was downloaded and stored in a location on the server. All PSI BLAST request s access only the PDB and no other databases. At certain stages, t he DARWIN program requires secondary structure information about known proteins When DARWIN uses PSI BLAST, subject proteins that align to the query sequ ence are discovered and their secondary structures are then required. To access secondary structure information, there is a downloaded version of the latest DSSP set of .dssp files that are located on the server. As both the PDB and the DSSP are ever growi ng, they are re uploaded to the DARWIN server area every 6 months. The DARWIN web site or Internet based graphical user interface, is written with a combination of Perl, CGI, JavaScript, Cascading Style Sheets, and Dynamic HTML. Information is collected from the user on the client side of the DARWIN web server by a CGI/Perl/JavaScript/DHTML webpage. The information is first reviewed for validity using client side JavaScript If the user submitted information is acceptable, the user request is collected an d entered into a queue on the server for processing. If the information is not correct, the JavaScript will inform the user to correct the information and resubmit. Once the submission

PAGE 97

97 is successful and the request is entered into the server request queue, the website returns a message of success to the user and informs the user to expect the results via email. DARWIN was primarily written in Perl and uses two external application packages that are integrated into the code. The first package is that of the support vector machine (SVMlight), and the second package is PSI BLAST. DARWIN begins by using the query protein sequence to generate a set of files that are properly formatted for interaction with the applications that DARWIN associates with. For each s equence submitted, DARWIN generates a FASTA file that has a specific format and offers a linear description of the sequence of amino acids in the query sequence. T his FASTA file is used to run PSI BLAST on the query protein. PSI BLAST accesses the PDB and discovers known protein sequences that align to the query sequence within a given set of parameters. For full sequence alignment, DARWIN uses a PSI BLAST expected value of .01, and inclusion value of .01, and a similarity matrix BLOSUM62. PSI BLAST generat es t w o separate outputs that can be captured as files and appropriately named The first output, or MSA file (Figure 4 2) is the full PSI BLAST alignment output. This output includes a considerable quantity of information including all proteins discovered to be similar to the query protein (given parameters) their alignment to the query, their l ength, the expected value of each alignment, and the identity between the query and each alignment. Each alignment protein discovered is referred to as a subject p rotein. The alignment discovered can be a subsequence of a larger protein. Considerable parsing and data processing is required to collect and utilize each piece of information for various stages of the DARWIN process. The second file produced by PSI BLAST the PSSM file (Figure 4 3) is a matrix that describes the log likelihood of each of the 20 known amino acids occurring at each of the positions of the query sequence.

PAGE 98

98 Once files are collected, t he MSA file is parsed and processed to determine if any vi able PSI BLAST homologues were discovered. The PSSM file is parsed and processed to create testing vectors for the DARWIN SVM ensemble. A description of the creating of SVM input vectors can be found in Chapter 4. If at least one viable homologue is discovered from the MSA file, than all viable homologues are collected though a complicated parsing process that takes the PSI BLAST generated alignments and creates a set of complete alignments to the query. PSI BLAST alignments (subjects) can be skewed, gapped and distinct in length with respect to the query protein. The query protein itself can also have gaps with respect to any given subject proteins. Each subject protein will have a different alignment to the query protein, and subjects are not aligned to e ach other. C onsiderable processing is needed to create a complete and even alignment between all subjects and the query. Gaps and skews are filled with spaces, and gaps in the query with respect to the subject require removal of corresponding subject amino acids. Once all subjects are evenly and completely aligned to the query, the DSSP is accessed to find the secondary structure of all subjects. Again, because subjects are incomplete and because amino acids are sometimes removed, considerable processing and parsing is required to properly align the secondary structure defined in the DSSP with each subject protein. It should also be noted that there are sometimes differences between the DSSP proteins and the PDB proteins, such as the exact set of amino aci ds and a plethora of other small yet important subtleties that must be handled. Once all subjects are evenly aligned to the query, and corresponding subject protein secondary structure labels are in place, each column of the alignment set is processed in dependently to reach a consensus and to label that column as H, E, C, or L. Note that L is not a secondary structure and represents a lack of information (all spaces) in that given column.

PAGE 99

99 Spaces can occur from gapped and skewed alignments. To reach an ind ependent consensus for each column, the PSI BLAST generated identity scores are used to calculate the effective percentage identity between the subject and the query, as well as a weight associated with that subject (Chapter 5 has details and formulas). We ighted values for each of the labels H, E, and C are collated, and the maximum sum generates the prediction. If a column has only spaces, an L is generated. However, if no viable homologue is discovered in the MSA file, the code calls the FFA function and engages the fixed size fragment analysis portion of DARWIN. In using FFA, the query protein is broken into a set of n fragments, where each fragment is the size of a selected window. In the case of DARWIN, the window size is 15. Next, each fragment is used to predict the secondary structure of its central amino acid. To do this, PSI BLAST is again called for each fragment, and the MSA file associated with each call is collected, parsed, and processed into a complete alignment set with respect to the query f ragment. The DSSP files are then accessed to assign corresponding secondary structure to all discovered and aligned subjects. This portion of the code is algorithmically involved as only the fragment of the DSSP information is needed but must be exact and correct. Once secondary structure information is assigned to all known subject fragments, a consensus is reached for the central column of the fragment alignment and a label of H, E, C, or L is assigned. A weighted measure based on the identity between the subject fragment and query fragment is used to generate sums for H, E, and C. The maximum sum is used for prediction. If the central column contains only spaces, the L is used. In the last stage of DARWIN, the ensemble of SVMs built from SVMlight is used to create a prediction for the query protein. The SVM ensemble is a collection of three binary SVM, including one trained for helices, one trained for sheets, and one trained for other. A query

PAGE 100

100 protein can be broken into a collection of test vectors, one f or each amino aci d, and tested against the threeSVM ensemble. The SVM returning the largest value determines the prediction. If all returned SVM values are less than .6 for H, E, and C, the returned secondary structure is C. Only those amino acids in the query labeled with an L from above are filled in with SVM prediction. Therefore, the SVM is used in cases where to alignment, either full or fragment is available. In this way, DARWIN offers maximal and consistent accuracy. Once DARWIN generated the final secondary structure prediction, it is sent via email to the users email address. DARWIN details, algorithms, methods, and formulas can be found in Chapter 5.

PAGE 101

101 Fig ure 6 1. I mage of the DARWIN Web page that allows Internet based graphical user interface with the DARWIN service.

PAGE 102

102 CHAPTER 7 DISCUSSION AND CONCL USION Introduction Protein secondary structure prediction is ubiquitous in protein science and has encouraged research from many fields of study including computer science, molecular biology, and biophysics. The secondary structure of a protein provides a first step toward native or tertiary structure prediction, and is therefore utilized in the majority of protein folding prediction algorithms. Similarly, protein secondary structure information is routinely used in a variety of scientific areas, including proteome and gene annotation, the determination of protein flexibility, the subcloning of protein fragments for expression, and the assessment of evolutionary trends among organisms. Protein Secon dary Structure Prediction Progress From a computer science and machine learning standpoint, protein secondary structure prediction can be viewed as a three state classification problem. The goal of protein secondary structure prediction is to accept a sequ ence of amino acids in one letter code, and to predict or classify each amino acid as one of three secondary structures states, namely H for helix, E for sheet, or C for other. The progress of secondary structure prediction has been slow but steady over s everal decades with strides resulting fro m a combination of larger knownstructure protein databanks and the use of increasingly sophisticated pattern recognition techniques such as neural networks and support vector machines. While the initial definition of the protein secondary structure prediction problem was to predict the structure of a sequence of amino acids in the absence of viable homologues, this definition was later blurred and distorted, as method s of prediction began to give way to the goal of prediction, and computational learning algorithms that accessed

PAGE 103

103 known protein structure (homologues) began to prove more reliable than both biological modeling and methods that relied on the query sequence alone Even in the 1970s, prediction techniques we re making use of small sets of available known protein structures, on which simple Bayesian statistics were performed and frequency characteristics were measured These initial methods sought to find and predict locations in a given protein sequence in which helices or sheets might be more likely to form. However, these single amino acid based statistical methods were not able to offer accuracies above 60%. By the early 1990s, the number of know protein structure s increased considerably, a nd a plethora of machine learning indirect homology based methods emerged. Pioneered by Rost and Sander in 1993, the representation of each amino acid as a set of 20 binary values progressed into the presentation of each amino acid as a set of 20 log likelihood values (se e Chapter 4 for details). This exploitation of structural information through the use of alignment derived log likelihood data would explode into 15 years of machine learning multiple sequence alignment based methods that used varying machine learning arch itectures. These methods were further improved after 1997 by the use of a fast alignment heuristic method, PSI BLAST. PSI BLAST automatically finds and aligns all similar proteins to a given query protein, then automatically calculates the log likelihood of each of the twenty amino acids at each position of the query protein, based on discovered similar alignments. Finally, PSI BLAST produces a position specific scoring matrix (PSSM), or profile that contains a matrix of all log likelihood values with respect to the query protein. Therefore, PSI BLAST produces a matrix of homology data that is ready to use in the creation of training and testing vectors for any machine learning technique. To date, all prediction techniques that exceed 70% prediction accuracy use PSI BLAST profiles combined with a machine learning ensemble.

PAGE 104

104 Recently, it has been suggested that rather than using the PSI BLAST information indirectly though profiles and machine learning, perhaps the same information can also be used directly th rough methods such as template modeling This notion has only begun to enter the area of protein secondary structure prediction, and can be found in a few modern methods published after 2004 (Pollastri et al ., 2007; Montgomerie et al ., 2006) While it is w ell known that direct homology utilization can increase prediction accuracy often considerably, its use has been cautious. Because the original goal of secondary structure prediction was to predict the structure of a protein in the absence of homological templates, the use of direct homology may seem in opposition of that goal However, with closer observation, it can be immediately noted that the use of homology has permeated the study and prediction of secondary structure and that all information derive d from PSI BLAST profiles (used to train and test all current machine learning techniques ) is homology based. In short, without the use of homology, current structure pr ediction would still be below 70%. However, the problem of secondary structure predicti on in the absence of homology need not be abandoned or threatened by the use of homology. In fact, a more sensible and broadly beneficial approach is to balance prediction by maximally utilizing all available information ; homology when available while simultaneously engaging more sophisticated prediction methods when homology is absent Strength of DARWIN Protein secondary structure prediction can be viewed from many angles, each with its own overall goal. From a user standpoint, the goal of secondary str ucture prediction is simply to offer the most accurate prediction is the least amount of time. From a machine learning and information standpoint, the key to secondary structure prediction is the detection, differentiation, and maximal utilization of attainable informa tion to offer the highest accuracy in all cases. From

PAGE 105

105 a scientific standpoint and to address the original goal of secondary structure prediction, prediction in the absence of detectible homologues should be given priority. To this end, DARWIN uses three complimentary and integrated methods that offer maximal prediction accuracy balanced with a focus of improved prediction in the absence of homologues DARWIN uses a novel combination of full sequence weighted homology modeling, fragment analysi s and homology modeling and machine learning through the use of an ensemble of support vector machines. The DARWI N technique exceeds all currently published indirect homology protein secondary structure prediction methods and servers as tested on common datasets EVA5 and EVA6 Tests also suggest that DARWIN is highly competitive with more modern methods that use direct homology modeling as part of the prediction process. DARWIN offers an elegant and constrained method of prediction, with only one tunable parameter for the full protein alignment evaluation, namely the exp onent (exp) and only two tunable parameters, the exp onent (exp) and fragment size for the FFA portion. DARWIN results either clearly exceed or are highly competitive with alternative and m ore complicated techniques that can suffer from over training. Future Work and Improvements As DARWIN is a new server, there are several additions that are currently being investigated. The first addition underway is the upgrade of DARWIN to offer an eight state pr e diction option. The second is the training of the DARWIN SVM on a more extensive protein set Finally, we suspect that consider a ble information can be added to the prediction process through the use of fragment mining of the PDB, and therefore further investigation and improvement of DARWINs FFA is unde r way

PAGE 106

106 LIST OF REFERENCES Albrecht,M., Tosatto,C.E., Lengauer,T., and Valle,G. (2003) Simple consensus procedures are effective and sufficient in secondary structure prediction. PEDS, 16(7) 459462. Altschul,S.F., Madden,T.L., Schaffer,A.A., Zhang,J., Zhang,Z., Miller,W., Lipman D.I. (1997) Gapped BLAST and PSI BLAST a new generation of protein database search programs, Nucleic Acids Res, 25,33893402. Barton,G .J.(1990) Protein multiple sequence alig nment and flexible pattern matching. Methods, Enzymol,183:403428. Benner, S.A., and Gerloff, G. (1991) Patterns of Divergence in Homologous Proteins as indicators of secondary and tertiary structure: a prediction of the structure of a catalytic domain of protein kinases. Adv. Enzyme Regul 31:121181. Berman,H.M., Westbrook,J., Feng, Z., Gilliland,G., Bhat, T.N., Weissig,H., Shindyalov,I.N., and Bourne,P.E. (2000) The protein data bank, Nucleic Acids Res., 28,235242 Birzele, F. and Kramer,S. (2006) A new representation for protein secondary structure prediction based on frequent patterns Bioinformatics, 22(21):26282634. Bishop,C.M. (2006) Pattern Recognition and Machine Learning Springer NY Brenner ,S .E ., Chothia ,C ., Hubbard,T .J ., Murzin ,A .G (1996) U nderstanding protein structure: using scope for fold interpretation. Methods Enzymol 266: 635643. Brenner, S .E ., Koehl,P ., and Levitt,M. (2000) The ASTRAL compendium for sequence and structure analysis. Nucleic Acids Research 28: 254 256 Bondugula,R., and Xu,D. (2007) MUPRED: A tool for bridging the gap between template based methods and sequence profile based methods for secondary structure prediction, Proteins: Stuct, Funct, and Bioinformatices, 66, 664670. Burges,C. (1998) A tutorial on support vector machines for pattern recognition, Bell Labs, Lucent Technologies, Data Mining and Knowledge Disc., 2,121167. Burnham,K.P. and Anderson,D.R. (2002) Model Selection and Multimodel Inference: A Practical Information Theoretic Approach, Second Edition ,Spring er Science, New York. Cheng,H ., Sen,T.Z., Kloczkowski,A., Margaritis,D., and Jernigan,R.L. (2005) Prediction of protein secondary structure by mining structural fragment database. Polymer. 46:4314 4321. Cheng,H., Sen,T.Z., Jernigan ,R.L., and Kloczkowski ,A. (2007) Consensus Data Mining (CDM) Protein Secondary Structure Prediction Server: Combining GOR V and Fragment Database Mining (FDM) Bioinformatics, 23(19):26282630.

PAGE 107

107 Chou,P.Y., and Fasman,G.D. (1974) Prediction of protein conformation, Biochemistry, 13(2), 222245. Chothia,C ., Lesk,A .M (1986) The relation between the divergence of sequence and structure in proteins. Embo J 5: 823826 Cole,C., Barber,J.D., and Barton,G.J. (2008) The Jpred3 secondary structure prediction server, Nucleic Acids Res., 10:15 Cuff,A., and Barton,G.J. (1999) Evaluation and improvement of multiple sequence methods for protein secondary structure prediction, Proteins, 34, 508519. Dayhoff ,M.O. (1978) Atlas of Protei n Sequence and Structure ,Natl. Biomed. Res. Found., Washington, 5:3, 345352. Doolittle ,R .F (1981) Similar amino acid sequences: chance or common ancestry? Science, 214: 149159. Drucker,H., Wu ,D., and Vapnik, V.N. (1999) S upport vector machines for spam categorization, Goad, W.B., and Neural Networks, IEEE Transactions 10( 5 ), 1048 1054. Duda,R.O., and Hart,P.E. (1973) Pattern Classification and Scene Analysis, John Wiley & Sons, New York. Eddy,S.R. (1998) Profile hidden Markov models, Bioinformatics, 14:755763 Frishman,D., and Argos,P. (1996) Incorporation of nonlocal interactio ns in protein secondary structure prediction from the amino acid sequence. Protein Engineering, Protein Engineering, 9:2, 133142. Gardy,J.L., Spencer,C., Wang,K., Ester,M., Tusnady,G.E., Simon,I., Hua,S., deFays,K., Lambert,C., Nakai,K., and Brinkman,F.S (2003) PSORT B: Improving protein subcellular localization prediction for gram negative bacteria, Nucleic Acids Res, 31, 36133617. Garnier,J., Gibirt,J., Robson,B. (1996) Gor method for predicting protein secondary structure from amino acid sequences, Methods in Enzymal, 266, 541553. Garnier,J., Osguthorpe,D.J., and Robson,B. (1978) Analysis of the accuracy and implications of simple methods for predicting the secondary structure of globular proteins, J. Mol. Biol., 120(1), 97120. Kanehisa,M.I. (1982) Pattern recognitio n in nucleic acid sequences. A general method for finding local homologies and symmetries, Nucleic Acids Research 10 (1): 247263. Gotoh, O. (1982). An improved algorithm for matching biological sequences. J. Mol. Biol. 162, 705708.

PAGE 108

108 Gribskov,M., McLachlan,A.D., and Eisenberg,D. (1987) Profile analysis: detection of distantly related proteins, Proc Natl Acad Sci USA, 84(13): 43554358 Henikoff,S., and Henikoff,J. (1992) Amino Acid Substitution Matrices from Protein Blocks, Proceedings of the National Academy of Sciences, 89, 1091510919. Henikoff,J.G., and Henikoff,S. (1996) Using substitution probabilities to improve positionspecific scoring matrices, Comput. Appl. Biosci., 12, 135143. Holl ey, H L ., and Karplus,M. (1989) Protein secondary structure prediction with a neural network. Proc Natl Acad Sci, 86:152156. Hu,H., Phang,C., He,J., Harrison,R., and Pan,Y. (2005) Protein secondary structure prediction using support vector machine with a PSSM profile and an advanced tertiary classifier, Proceedings of the 2005 IEEE Computational S ystems Bioinformatics Conference Workshops (CSBW05). Hua,S., and Sun,Z. (2001) A novel method of protein secondary structure prediction with high segment overlap measure: support vector machine approach. J Mol Biol, 308:397407. Hung,L.H., and Samudrala,R (2003) PROTINFO:Secondary and tertiary protein structure prediction, Nucleic Acids Res., 1;31(13), 32963299. Jaynes,E.T. (1957) Information theory and statistical mechanics, Physics Review, 106:620. Joachims,T. (1998) Text: Categorization with Support Vector Machines: Learning with Many Relevant Features,Proceedings of the european conference on machine learning. Springer NY. Joachims,T. (2002) Optimizing Search Engines Using Clickthrough Data, Proceedings of the ACM Conference on Knowledge Discovery and Data Mining (KDD), ACM. Jones,D. (1999) Protein secondary structure prediction based on positionspecific scoring matrices, J Molecular Biology, 292, 195202. Kabsch,W., and Sander,C. (1983) Dictionary of protein secondary structure: pattern recognit ion of hydrogen bonded and geomet rical features, Biopolymers, 22:12, 25772637. Karypis G., (2006) YASSPP: Better kernels and coding schemes lead to improvements in SVM based secondary structure prediction. Proteins, 64, 575586. Kim,H., and Park,H. (2003) Protein secondary structure prediction based on an improved support vector machines approach, Protein Eng., 16(8), 553560. Kloczkowski,A., Ting, K L ., Jernigan,R L ., Garnier, J (2002) Combining the GOR V algorithm with evolutionary information for protein secondary structure prediction from amino acid sequence. Proteins, 49: 154166.

PAGE 109

109 Koh,I., Eyrich,V.A., Marti Renom,M., Przybylski,D., Madhusudhan,M., Eswar,N., Grana,O., Pazos,F., Valencia,A., Sali,A., Rost,B. (2003) EVA: evaluation of protein structure prediction servers, Nucleic Acids Research, 31:13, 33113315. Koonin,E.V., and Galperin,M.Y. (2003) Textbook: Sequence, Evolution, Function: Computational approaches in comparative genomics, Kluwer Academic Publishers. Lin,K., Simossis,V.A., Taylor,W.R., and Heringa,J. (2005) A Simple and fast secondary structure prediction method using hidden neural networks, Bioinformatics, 21(2):152159 Liu,J., and Rost,B. (2001) Comparing function and structure between entire proteomes, Protein Science, 10, 19701979. Lipman,D.J., and Pearson,W.R. (1985) Rapid and sensitive protein similarity searches, Science, 227:14351441 McClelland, J.L. Rumelhart,D.E., and Hinton,G.E. (1986). The appeal of p arallel distributed processing, Parallel Distributed Processing, MIT Press, 1:344. McCulloch,W. and Pitts,W. (1943) A logical calculus of the ideas immanent in nervous activity Bulletin of Mathematical Biophysics 5: 115133. McGuffin,L.J., Bryson,K., and Jones,D.T. (2001) What are the baselines for protein fold recognition? Bioinformatics, 17, 6372. Meller,J., and Baker,B. (2003) Coupled predi ction of protein secondary and tertiary structure, PNAS, 100:21, 1210512110. Mewes,H.W., Frishman,D., Mayer,K.F., Munsterkotter,M., Noubibou,O., Pagel,P., Rattei ,T., Oesterheld,M., Ruepp,A., and Stumpflen,V. (2006) MIPS: analysis and annotation of prote ins from whole genomes in 2005, Nucleic Acids Res, 34, 169172. Montgomerie,S ., S undararaj, S ., Gallin, W .J ., Wishart, D .S (2006) Improving the accuracy of protein secondary structure prediction using structural alignment, BMC Bioinformatics. Nishikawa, K. (1983). Assessment of secondary structure prediction of proteins. Comparison of the computerized ChouFassman method with others. Biochimica et biophysica ac ta 748, 285299. 14( 7) 301. Morvai,G. and Weiss,B. (2008) Estimating the lengths of memory words, IEEE Trans. on Inf. Theo ry, 54:8, 38043807. Myers,J., and Oas,T. (2001) Preorganized secondary structure as an important determinant of fast protein folding, Nature S tructural Biology, 8, 552558. Needleman,S.B., and Wunsch,C.D. (1970) A general method applicable to the search f or similarities in the amino acid sequence of two proteins J. Mol. Bio., 48:443453.

PAGE 110

110 Notredame,C., Higgins ,D.G., and Heringa ,J. (2000) T Coffee: A Novel Method for Fast and Accurate Multiple Sequence Alignment J Mol Bio, 302, 205217. Pearson,W.R., and Lipman, D.J. (1988) Improved tools for biological sequence comparison, Proc. Natl. Acad. Sci. USA, 85(8) :24442448 Pollastri,G., Przybylski, D ., Rost,B ., and Baldi,P (2002) Improving the prediction of protein secondary structure in three and eight classes using recurrent neural networks and profiles. Proteins, 47:228235. Po llastri,G., and McLysaght,A. (2004) Porter: a new, accurate server for protein secondary structure prediction, Bioinformatics, 21(8), 17191720. Pollastri, G., Martin, A.J.M., Mooney ,C. and Vullo, A. (2007) Accurate prediction of protein secondary structure and solvent accessibility by consensus combiners of sequence and structure information, BMC Bioinformatics, 8:201 Przybylski,D., and Rost,B. (2001) Alignments grow, secondary structure prediction improves, Proteins: Structure, Function, and Genetics, 46, 195205. Qian,N., and Sejnowski,T. (1988) Predicting the secondary structure of globular proteins using neural network models, J. Mol Bio, 202, 865884. Ramachandran,G.N ., Sasisekharan,V (1968) Conformation of polypeptides and proteins Adv Protein Chem 23:283438 Rieke,F., Warland, D.,Van Steveninck, R., and Bialek,W. (1997) Spikes: exploring the neural code. The MIT press. Rosenblatt F ( 1962) Principles of Neurodynamics Spartan Books, New York Rost,B. (2003) Prediction in 1D: secondary structure, membrane hel ices, and accessibility. Methods Biochem. Anal., 44, 559587. Rost,B., and Eyrich,V. (2001) EVA: Large scale analysis of secondary structure predictions, Proteins, 45, 192199. Rost,B., and Sander,C. (1993) Prediction of protein secondary structure at bett er than 70% accuracy. J. Mol. Bio, 232, 584599. Rost,B. (1999) Twilight zone of protein sequence alignments, Protein Engineering, 12, 8594. Rumelhart,D., Hinton,G., and Williams,R. (1988) Learning internal representations by error prop agation, Neurocomputing,675695, Cambridge, MA: MIT Press. Sankoff,D., and Kruskal,J.B. (1983). Time warps, string edits and macromolecules: the theory and practice of sequence comparison. Addison Wesley.

PAGE 111

111 Schmidt, M ., and Grish,H. (1996) Speaker identification via support ve ctor classifi ers, Proceeding of the International Conference on Acoustics, Speech and Signal P rocessing. Long Beach, CA: IEE Press; 105108. Sch olkopf ,B., Burges,C., and Smola,A. (1999) Making large scale SVM learning practical. Advances in kernel methods support vector learning Cambridge, MA: MIT Press Sen,T.Z., Jernigan,R.L., Garnier,J., and Kloczkowski,A. (2005) GORV server for protein secondary structure prediction, Bioinformatics, 21(11):27872788. Shannon,C.E. (1948) A mathematical theory of communication, Bell System Technical Journal, 27, 379423 and 623656 Sitbon,E. and Peitrokovski,S. (2007) Occurrence of protein structure elements in conserved sequence regions, BMC Structural Biology, 7:3 Smith ,T.F., and Waterman ,M.S. (1981) Identification of Common Molecular Subsequences J. Mol. Biol., 147:195197. Tatusov,R.L., Altschul,S.F. and Koonin,E.V. (1994) Proc. Natl. Acad. Sci.USA, 91, 12091 12095. Thompson, J.D., Higgins ,D.G., and Gibson,T.J. (1994) CLUSTAL W: improving the sensitivity of progressi ve multiple sequence alignment through sequence weighting, position specific gap penalties and weight matrix choice, 22(22):46734680. Van Domselaar,G.H., Stothard,P., Shrivastava,S., Cruz,J.A., Guo,A., Dong,X., Lu,P., Szafron,D., and Greiner,R. (2005) BAS ys: A web server for automated bacterial genome annotatione, Nucleic Acids Res, 33, 455459. Vapnik,V. (1995) Text: The nature of statis tical learning theory. Springer, NY. Vapnik,V. (1998) Statistical Learning Theory. Wiley Interscience, NY. Voet ,D., an d Voet ,J. (2005) Biochemistry, Third Addition, Wiley Higher Education. Wang,L., Liu,J., Li,Y., and Zhou,H. (2004) Predicting protein secondary structure by a support vector machines based on the a new coding scheme, Genome Informatics, 15(2), 181190. War d,J., McGuffin,J., Buxton,B., and Jones,D. (2003) Secondary structure prediction with support vector machines, Bioinformatic, 19(13), 16501655. Wilbur,W.J., and Lipman,D.J. (1973) Rapid similarity searches of nucleic acid and protein data banks Proc. Nat l. Acad. Sci., 80, 726730 Wishart,D.S., and Case,D.A. (2001) Use of chemical shifts in macromolecular structure determination, Methods Enzymol, 338, 334.

PAGE 112

112 Wu ,K., Lin, H., Chang ,J., Sung ,T. and Hsu ,W. (2004) HYPROSP : a hybrid protein secondary structure p rediction algorithm a knowledge based approach, Nucleic Acids Research 32(17):50595065

PAGE 113

BIOGRAPHICAL SKETCH Ami Gates left high s chool before the age of 16, to soon after enter community college by attaining a GED. Ami completed an Associate of Arts deg ree in Mathematics by age 17 and was awarded both the Math and Science Achievement Awards for that year. Ami then transferred to Florida Atlantic University, where at 19, she became the youngest student to graduate wit h a Bachelor of Arts Degree in mathe matics. After graduation, Ami traveled the United States for several months, and then settled in Gainesville Florida. While in Gainesville, Ami engaged in considerable volunteer efforts including the Childrens Cancer Foundation and t he Welfare to Work Foundation. Ami also volunteered in the downtown community college where she tutored under privileged students in math and reading. In addition to volunteer activities, Ami has been an educator for over 17 years, and still continues with both the development and instruction of mathematics and computer science courses. Ami has won seven awards for dedication to student learning and education from four different universities. In 1997, Ami completed a Master of Science in math e ducation and in 2002, Ami complet ed a second Master of Science in computer science and e n gineerin g. Shortly after, Ami be gan work toward a Doctorate in c omputer s cience, which she will complete in 2008. Amis favorite hobby is health research, and for leisure, Ami is an artist and a ballr oom dancer.