Nucleotide Sequence Compositional Contrasts And Computational Gene Prediction

MISSING IMAGE

Material Information

Title:
Nucleotide Sequence Compositional Contrasts And Computational Gene Prediction
Physical Description:
1 online resource (138 p.)
Language:
english
Creator:
Oden, Steven M
Publisher:
University of Florida
Place of Publication:
Gainesville, Fla.
Publication Date:

Thesis/Dissertation Information

Degree:
Doctorate ( Ph.D.)
Degree Grantor:
University of Florida
Degree Disciplines:
Medical Sciences, Genetics (IDP)
Committee Chair:
Brocchieri, Luciano
Committee Members:
Zhou, Lei
Jin, Shouguang
Wayne, Marta L

Subjects

Subjects / Keywords:
annotation -- computational -- gene -- genome -- n-pact -- prediction
Genetics (IDP) -- Dissertations, Academic -- UF
Genre:
Medical Sciences thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract:
Gene coding sequences exhibit contrasting global compositional properties in the three codon positions, depending on the overall base composition of the sequence. General rules on the base content at the three codon positions as a function of the overall base content can be identified and exploited to score sequences for their coding potential. More generally, the period-three structure of coding regions imposes compositional periodicity to the sequence that, irrespective of the specific type of contrasts that we may expect to see, result in a significantly non-random distribution of bases. We develop novel computational algorithms to identify and characterize protein coding sequences. The methods we develop will be useful for finding genes in anonymous fragments of DNA or to annotate entire genomes. Our approach will deviate from previous approaches by identifying coding sequences through detecting periodic contrasts in the nucleotide composition of codon positions that arise in response to mutation and selection. This will allow our method to predict genes free of any particular gene model or any preconceived assumptions on codon usage. Our approach complements available algorithms that base their predictions on complex compositional rules and dependencies empirically learned from a set of known genes. Although these approaches have reached considerable success, their predictions for any particular input are difficult to interpret and depend on one or a few gene models learned for each organism. Therefore these methods cannot identify genes on very short, anonymous input sequences, and may miss genes of anomalous composition. By identifying simpler compositional contrasts predicted to arise in coding sequences through random mutation and different modes of selection, we can identify, in any DNA sequence, potential coding regions independently from detailed compositional models and we can explicitly characterize the features that suggest each prediction. In addition, we can express compositional characteristics of coding regions as a function of the local compositional properties of the genome, accounting for the well-known dependence of gene composition on mutational biases.
General Note:
In the series University of Florida Digital Collections.
General Note:
Includes vita.
Bibliography:
Includes bibliographical references.
Source of Description:
Description based on online resource; title from PDF title page.
Source of Description:
This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility:
by Steven M Oden.
Thesis:
Thesis (Ph.D.)--University of Florida, 2012.
Local:
Adviser: Brocchieri, Luciano.
Electronic Access:
RESTRICTED TO UF STUDENTS, STAFF, FACULTY, AND ON-CAMPUS USE UNTIL 2013-06-30

Record Information

Source Institution:
UFRGP
Rights Management:
Applicable rights reserved.
Classification:
lcc - LD1780 2012
System ID:
UFE0044963:00001


This item is only available as the following downloads:


Full Text

PAGE 1

1 NUCLEOTIDE SEQUENCE COMPOSITIONAL CONTRASTS AND COMPUTATIONAL GENE PREDICTION By STEVE ODEN A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2012

PAGE 2

2 2012 Steve Oden

PAGE 3

3 To my Mom

PAGE 4

4 ACKNOWLEDGMENTS I thank the chair and members of my supervisory committee for their mentoring. I thank my family for their loving encouragement, which motivated me to complete my study and my friends for the welcomed respite they provided.

PAGE 5

5 TABLE OF CONTENTS page ACKNOWLEDGMENTS .................................................................................................. 4 LIST OF FIGURES .......................................................................................................... 7 LIST OF ABBREVIATIONS ........................................................................................... 11 ABSTRACT ................................................................................................................... 12 CHAPTER 1 BACKGROUND ...................................................................................................... 14 Probabilistic Gene Predicting Algorithms ................................................................ 14 GeneMark ............................................................................................................... 16 GeneMark.hmm ...................................................................................................... 21 EasyGene ............................................................................................................... 22 Glimmer .................................................................................................................. 23 Glimmer 2.0 ............................................................................................................ 25 Glimmer 3.0 ............................................................................................................ 27 Prodigal ................................................................................................................... 28 Limitations of Probabilistic Gene Prediction Methods ............................................. 33 Gene Prediction Methods Based on Global Compositional Properties ................... 34 Frame Analysis ....................................................................................................... 35 Determining Potential Coding Regions ................................................................... 38 2 FRAME ANALYSIS OF SPROFILES AND APPLICATION TO ANNOTATION IMPROVEMENT ..................................................................................................... 44 S profile Asymmetries Identify New Potential Gene Coding Regions and Errors in Annotation ........................................................................................................ 45 Assessing Contrasts Between Codon Positions ..................................................... 47 Application and Extension of Frame Analysis to Annotation Improvement ............. 48 3 RESEARCH DESIGN AND METHODS .................................................................. 55 ORF Structure and Local Compositional Properties. .............................................. 58 Codon Scores ......................................................................................................... 58 High Scoring Seg ments .......................................................................................... 60 Significance of Scores ............................................................................................ 62 Simulations and Generation of Stop Less Random Sequences ............................. 64 Pseudocoding Sequences: Random Sequences with Codon Position Specific Nucleotide Frequencies. ...................................................................................... 66 The G test and Nucleotide Associations with Codon Positions. .............................. 67 Shadow Genes ....................................................................................................... 67

PAGE 6

6 Identification of Generalized Associations of Base Types with Codon Position ...... 69 Sequence Complexity ............................................................................................. 71 4 IMPLEMENTATION ................................................................................................ 81 From High Sc oring Segments to Gene Prediction and Characterization ................ 81 5 RESULTS ............................................................................................................... 87 Compositional Contrasts i n Annotated Prokaryote Genes ...................................... 87 Newly Predicted Coding Sequences in Prokaryote Genomes ................................ 90 Number of Predicted ORFs as a Function of Genome GC Content ........................ 92 ORF Conservation .................................................................................................. 92 Pseudogenes .......................................................................................................... 92 Concordance Between New Gene Predictions and Other Predictors ..................... 94 Application to Pseudomonas aeruginosa ................................................................ 96 Experimental Verification of Computationally Predicted Protein Coding Regions 100 Application to t he Human Genome ....................................................................... 101 Processing Time ................................................................................................... 103 6 DISCUSSION ....................................................................................................... 128 LIST OF REFERENCES ............................................................................................. 134 BIOGRAPHICAL SKETCH .......................................................................................... 138

PAGE 7

7 LIST OF FIGURES Figure page 1 1 Distribution of GC bases in first, second and third codon positions as a function of the overall GC content of the coding region ...................................... 41 1 2 Distribution of A+G bases in first, second and third codon positions as a function of the overall A +G content of the coding region .................................... 41 1 3 A sequence partitioned into three frames with every third nucleotide included in the fra me ......................................................................................................... 42 1 4 S profiles for positions 1 10K of the genome sequence of Pseudomonas aeruginosa PAO1 ............................................................................................... 42 1 5 Total GC plotted for a given window size remains largely constant and hovers around the mean value for the given genome ......................................... 43 1 6 Graphical depiction of the approach used to determine coding strand and frame from resulting S profile graphs ................................................................. 43 2 1 An example where frame analysis suggests a published annotation should be modified ......................................................................................................... 51 2 2 An example of a published annotated gene (0322) that is not supported ........... 51 2 3 An example of an annotated gene (Adeh_3257) that is not supported ............... 52 2 4 An example of published annotations (1426, 1427) that are contradicted by the S profile signature ........................................................................................ 52 2 5 Another example of annotated gene predictions that are contradicted by the S profile signal .................................................................................................... 53 2 6 An example from Anaeromyxobacter dehalogenans showing a case where the annotated gene prediction is contradicted by the S profile signal ................. 53 2 7 A region on the chromosome of the high GC organism Anaeromyxobacter dehalogenans ..................................................................................................... 54 2 8 Difference in GC content bet ween codon positions three and two in relation to overall GC content .......................................................................................... 54 3 1 Log odds ratio scores ......................................................................................... 73 3 2 Base composition at the three codon positions of genes in relation to the overall base composition of the coding sequence. ............................................. 73

PAGE 8

8 3 3 Scores as a function of GC content at the three codon positions and scores for sixtyone trinucleotides as a function of GC content ...................................... 74 3 4 Cumulating trinucleotide scores from the 3 to 5 will result in scores growing positive ............................................................................................................... 74 3 5 Threshold scores at the indicated significance levels ......................................... 75 3 6 Threshold scores at significance level p = 0.01 obtained from simulations compared to upper bounds of scores from highscoring segment theory. .......... 75 3 7 Threshold scores at significance level p = 0.001 obtained from simulations compared to upper bounds of scores from highscoring segment theory. .......... 76 3 8 Expected nucleotide type frequencies in codon positions after removing stop codons ................................................................................................................ 76 3 9 Expected nucleotide type frequencies in codon positions from marginal values ................................................................................................................. 77 3 10 Odds ratios ......................................................................................................... 77 3 11 Association between nucleotide types and codon posi tions ............................... 77 3 12 Typical compositional contrasts observed in a GC rich sequence ...................... 78 3 13 Contingency table of basetype in three frames of a sequence of length n ........ 78 3 14 The G test procedure is here applied to a known ORF from P. aeruginosa PAO1 .................................................................................................................. 79 3 15 Estimates of G score threshold scores at significance level p = 0.001 as a function of sequence length and composition ..................................................... 80 3 16 Estimates of G score threshold scores at significance level p = 0.01 as a function of sequence length and composition ..................................................... 80 4 1 The N PACT (N Profile Analysis Computational Tool) web interface ................. 85 4 2 Flow chart showing program execution ............................................................... 85 4 3 N PACT output showing variation of GC in three phases along a genome ........ 8 6 5 1 Total num ber of genomes in each GC class ..................................................... 104 5 2 Ratio of nucleotide contribution from each GC class normalized by the nucleotide total of the complete collection of genomes .................................... 104 5 3 Density of annotated genes per kilobase pair ................................................... 105

PAGE 9

9 5 4 Summary of statistics for each class of published annotations and their total hit types at two significance levels .................................................................... 105 5 5 H type hits in published CDSs as a function of length ...................................... 106 5 6 G type hits in publish ed CDSs as a function of length ...................................... 106 5 7 The union of G type hits and H type hits in published CDSs as a function of length ................................................................................................................ 107 5 8 H ty pe hits in published C DSs as a function of GC content .............................. 107 5 9 G type hits in published C DSs as a function of GC content ............................. 108 5 10 The union of G type hits and H type hits in published C DSs as a function of GC content ....................................................................................................... 108 5 11 Frequency as a function of length and GC content of characterized g enes with H type or G type hits ................................................................................. 109 5 12 Classes of gene predictions from our method as compared to the published annotati ons of 1,083 prokaryote genomes ....................................................... 110 5 13 Total number of new gene predictions (per meg abase pair) as a function of GC .................................................................................................................... 111 5 14 Conservation of newly predicted ORFs showing over 30% of newly predicted genes have homologous sequences ................................................................ 111 5 15 Pseudogenes and newly predicted genes in 65 prokaryote genomes .............. 112 5 16 Length distribution of newly predicted genes compared to known homologs found in t he non redundant BLAST database .................................................. 112 5 17 Newly predicted genes, 200,178 in total, confirmed by n other gene predicting methods ........................................................................................... 113 5 18 Newly predicted genes confirmed by other specified gene predicting algorithm s ......................................................................................................... 113 5 19 Top quality newly predicted genes, 10,700 in total, confirmed by n other gene predicting methods ........................................................................................... 114 5 20 Top quality newly predicted genes confirmed by specified gene predicting method ............................................................................................................. 114 5 21 Characterization of coding sequences also predicted by four gene detection methods investigated ........................................................................................ 115

PAGE 10

10 5 22 Characterization of coding sequences uniquely predicted by our method ........ 116 5 23 Length conservation of all newly predicted coding regions ............................... 117 5 24 Length conservation of topquality newly predicted proteincoding re gions ...... 117 5 25 Total genes in the published annotations of four strains of Pseudomonas aeruginosa along with new genes predicted by the described methods ........... 118 5 26 Length distribution of new gene predictions in Pseudomonas aeruginosa strain PAO1 as found by the described method ............................................... 118 5 27 Comparing predicted genes of P. aeruginosa PAO1 with annotated or newly predicted genes in other strains of P. aeruginosa ............................................. 119 5 28 S profile of high confidence newly predicted genes in Pseudomonas aeruginosa PAO1 ............................................................................................. 120 5 29 S profile of high confidence newly predicted genes in Pseudomonas aeruginosa PAO 1 ............................................................................................. 121 5 30 S profile of high confidence newly predicted genes in Pseudomonas aeruginosa PAO1 ............................................................................................. 122 5 31 S profile of high confidence newly predicted genes in Pseudomonas aeruginosa PAO1 ............................................................................................. 123 5 32 S profile of high confidence newly predicted genes in Pseudomonas aeruginosa PAO1 ............................................................................................. 124 5 33 Number of new sequence segments with significant coding potential in the human genom e ................................................................................................. 125 5 34 Distribution of scores among new hits obtained from the original genome sequence and ran dom hits from shuffled sequence ......................................... 125 5 35 Length distribution of hits from the original genome sequ ence and after randomization ................................................................................................... 126 5 36 Distribution of Shannons entropi es among new hits and published exons ...... 126 5 37 Processing time of the algorithm as a function of sequence length .................. 127

PAGE 11

11 LIST OF ABBREVIATION S AT Adenine + Thymine bp base pair cDNA complementary Deoxyribonucleic Acid DNA Deoxyribonucleic Acid GC Guanine + Cytosine HSS High Scoring Segment IFAS Institute of Food and Agricultural Sciences IMM Interpolated Markov Model ICM Interpolated Context Model mRNA messenger Ribonucleic Acid NCBI National Center for Biotechnology Information NR N onR edundant Database of P rot ein S equences ORF Open Reading Frame RBS Ribosome Binding Site RF Reading Frame SD ShineDalgarno URL Universal Resource Locator

PAGE 12

12 Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy NUCLEOTIDE SEQUENCE COMPOSITIONAL CONTRASTS AND COMPUTATIONAL GENE PREDICTION By Steve Oden December 2012 Chair: Luciano Brocchieri Major: Medical Sciences Gene coding sequences exhi bit contrasting global compositional properties in the three codon positions, depending on the overall base composition of the sequence. General rules on the base content at the three codon positions as a function of the overall base content can be identif ied and exploited to score sequences for their coding potential. More generally, the periodthree structure of coding regions imposes compositional periodicity to the sequence that, irrespective of the specific type of contrasts that we may expect to see, result in a significantly nonrandom distribution of bases. We develop novel computational algorithms to identify and characterize proteincoding sequences. The methods developed will be useful for finding genes in anonymous fragments of DNA or to annotate entire genomes. The proposed approach will deviate from previous approaches by identifying coding sequences through detecting periodic contrasts in the nucleotide composition of codon positions that arise in response to mutation and selection. This will allow gene prediction free of any particular gene model or any preconceived assumptions on codon usage. This approach will complement available algorithms that base their predictions on complex

PAGE 13

13 compositional rules and dependencies empirically learned from a set of known genes. Although these approaches have reached considerable success, their predictions for any particular input are difficult to interpret and depend on one or a few gene models learned for each organism. Therefore these methods cannot identif y genes on very short, anonymous input sequences, and may miss genes of anomalous composition. By identifying simpler compositional contrasts predicted to arise in coding sequences through random mutation and different modes of selection, we can identify, in any DNA sequence, potential coding regions independently from detailed compositional models and we can explicitly characterize the features that suggest each prediction. In addition, we can express compositional characteristics of coding regions as a function of the local compositional properties of the genome, accounting for the well known dependence of gene composition on mutational biases.

PAGE 14

14 CHAPTER 1 BACKGROUND Probabilistic Gene Predicting Algorithms Detecting proteins is a necessity for biologists studying organisms and how organisms respond to their environment. When faced with the task of detecting proteins biologists can take several approaches. There are a variety of techniques one can apply to try to isolate and detect proteins from biological samples (Bollag et al. 1996). These methods are suitable when working with a small number of proteins or for very well characterized proteins. However, sometimes researchers will be working only with sequence information or the goal may be to achieve an understanding of an organisms complete proteome. In these situations researchers must turn to computational methods of gene predictions. There are two basic classes of computational gene prediction consisting of sequence similarity searches (extrinsic methods) and ab initio gene prediction (intrinsic methods). Sequence similarity approaches are based on the assumption that functionally important regions of genomes are evolutionarily conserved and this conservation will be reflected in sequence identity. If t he sequence under investigation has a high degree of similarity to a sequence that has previously been shown to code for a protein, it is reasonable to infer that due to this similarity the sequence in question is also protein coding and will have a similar function inside the cell. Sequence similarity approaches have limited efficacy due to the limited number of sequenced proteins in the collection of known proteins, further, the homology may be restricted to a small segment of the protein. This can happen if the active site or functionally important part of the protein occupies a short stretch of the sequence. Mutations in less functionally important parts of the protein may accumulate and lead to

PAGE 15

15 loss of significant homology. This can result in two truly homologous proteins being misidentified as not homologous to each other. Despite these limitations, very sophisticated extrinsic approaches to gene detection have achieved considerable levels of success, BLAST (Altschul et al. 1990) is an archetypal exampl e of an extrinsic approach to gene detection. The second approach to computational gene prediction, ab initio gene prediction relies solely on the information contained in the DNA sequence. More specifically, two forms of sequence information: signal sens ors and content sensors. Signal sensors refer to sequence motifs such as start codons, stop codons, splice sites, polypyrimidine tracts, etc. specific to a particular gene. Content sensors refer to patterns of nucleotide usage (compositional bias, codon us age, etc. ) in the sequence that can be detected by statistical detection algorithms to distinguish coding sequences from noncoding sequences. Even simply calculating the autocorrelation function of a sequence can reveal a characteristic periodicity of protein coding sequences (Fickett 1982). Among the large variety of coding pattern usage that has been investigated, hexamer usage was shown to be the most discriminative variable (Fickett 1992). Many algorithmic approaches have been developed to detect these coding sequence patterns, some examples include: dynamic programming, neural network (Snyder 1995), linear discriminant analysis (Solovyev 1997), Markov model (Borodovsky 1993), Hidden Markov model (Burge 1997) and several variations of Markov model and H idden Markov model approaches (Salzberg et al. 1998; Lukashin 1998). The most widely used gene prediction algorithms are based on hidden Markov m odel s (Guigo et al. 2000). These methods utilize a Markov chain m odel framework to learn conditional

PAGE 16

16 probabilit ies of nucleotide usage along with Bayesian decision making processes to score a sequence for coding probability. The underlying formula from Markov chain theory calculates the probability that a given sequence of states (in this case a nucleotide sequence) i1, i2, ... ik can be observed in the realization of the Markov chain process starting from a given position in a sequence as P ( i1, i2, ... ik) = P0 ( i1) P ( i2| i1) P ( i3| i2)... P ( ik| ik 1). In h idden Markov m odel, transitions between submodels co rresponding to particular gene features are modeled as unobserved Markov processes, which determines the probability of observing particular nucleotides. Despite great advances of computational gene prediction methods, it is well known that genome annotations contain erroneous predictions and that the annotation errors have a tendency to propagate (Doerks et al. 1998; Galperin 1998). The following sections introduce and explain, in a general sense, some of the more recent and successful methods that have been advanced for computational gene prediction. GeneMark GeneMark (Borodovsky 1993) is a widely popular and successful gene prediction method based on probabilistic analysis of sequence content. GeneMark was used to annotate the first completely sequenced bacterium, Haemophilus influenza, and the first completel y sequenced archaeon, Methanococcus jannaschii The principle GeneMark employs to make gene predictions is founded on Markov chain models. A Markov chain is a sequence of random variables Xi, where the probability distribution for each Xi depends only on t he preceding k variables Xi 1, ..., Xi k, for a constant k. In a kth order Markov chain, the distribution of a random variable Xt, t = time (or nucleotides in the case of Markov models for DNA), depends only on the k variables immediately preceding it. In a first order Markov chain, for example, the distribution of Xt depends

PAGE 17

17 only on Xt 1. Thus in a first order Markov chain model, dinucleotide dependencies are being modeled. Following that logic, a 0th order Markov chain models simply the frequency of eac h nucleotide. In the 0th order Markov model, the probability of observing Xt does not depend on other variables. To calculate posterior probabilities of the observed sequence under an inhomogeneous Markov model of the n th order for coding sequences we nee d to estimate the codonposition dependent probability of observing a nucleotide type in a codon position conditional on observing the previous n nucleotides. We also need to estimate similar probabilities for sequences complementary to the coding sequence, and for noncoding sequences. To make accurate gene predictions the GeneMark algorithm builds species specific inhomogeneous Markov chain models of protein coding DNA sequences as well as a homogeneous Markov chain model of non coding sequences. The inho mogeneity of the Markov models results from assigning different dependencies to nucleotide probabilities at the three codon positions, that is, the probabilities of observing nucleotide type Xi following the preceding k nucleotides are estimated separately for codon position 1, 2 and 3, resulting in three distinct conditional probability matrices Pm ( m = 1, 2, 3), one for each codon position. The parameters of the first order model for non coding are derived from the counts of monoand dinucleotides N ( i ) a nd N ( ij) i j = 1, 2, 3, 4 calculated from the training set of noncoding sequences. Parameters of the coding models are estimated from the training set of coding sequences. Large training sets are required and are obtained by using either segments of the genome which are uninterrupted by stop codons over a predetermined minimum length (e.g. > 500 nt), or by translating ORFs and using only those translated ORFs that share a high degree of sequence

PAGE 18

18 homology with sequences known to encode functional protein. Once a traini ng set has been compiled, counts of monoand dinucleotides are tabulated into three vectors Nm( i ) and Nm( ij) depending on i s position in the codon, m = 1, 2, 3. This tabulation procedure can be generalized to storing ktuples and k + 1 tuples and split into three subsets depending on the position of the first nucleotide in the ktuple therefore obtaining the parameters for the inhomogeneous coding Markov model can be abstracted to account for any model order k. In order to change the raw counts of observ ations N ( i ) and N ( ij) into probabilities it is necessary to apply a normalization by dividing each j th member of the i th row by the i th marginal. For example, consider an inhomogeneous first order Markov chain model. We need to estimate the probabilities of observing any particular nucleotide type Y in codon position ( i + 1) % 3 given nucleotide X in codon position i % 3. Suppose for example that we want to know the probability of observing a C in second codon position [( i + 1) % 3 = 2] following nucleot ide A in first codon position ( i % 3 = 1). In our training set of coding sequences we look for all occurrences of nucleotide A in the first codon position. Say there are N1(A) such positions. The number of times these positions are followed by C is then counted. Say the count is N1(C|A). The estimate of P1(C|A) is given by the frequency of A following a C: N1(C|A)/N1(A). A 4x4 table of probabilities similarly estimated for all XY dinucleotide types in codon positions (1, 2) can be generated. Similar tables can be generated for dinucleotides in codon positions (2, 3 and (3, 1)). Three more tables are generated for sequences

PAGE 19

19 complementary to a coding sequence, where complementary dinucleotides follow the order (3, 2), (2, 1), or (1, 3). Another table is generated to describe a homogeneous Markov model for noncoding sequences. After computing the conditional probabilities of nucleotide usages for the noncoding state and six coding frame states posterior probabilities of a sequence fragment can be computed for each of the seven states as: To see how the posterior probability for an input sequence is determined consider a nucleotide fragment F = f1, f2,..., fn. The probability that F appears in a noncoding region (likelihood of non coding given sequence F ) is calculated by the first order Markov chain as The order of the model can be increased, the effect of increasing model order is to consider more prior observations influencing the probability statistic governing the current observation. In higher order Markov models, instead of looking for positions following certain mononucleotide types, (first order Markov model), we must look at positions following certain dinucleotides (second order Markov model), trinucleotides (third order Markov model), etc. to calculate the frequency of the nucleotide types following those. The probability of F occurring in a coding region is divided into three mutually exclusive outcomes depending on the position of the first nucleotide in the codon. These three probabilities are calculated using an inhomogeneous Markov model, again using a first order model as an example.

PAGE 20

20 The three remaining probabilities which can occur if F occurs in the shadow of a real coding region can be calculated in a similar fashion to what is shown above using the corresponding conditional pr obability matrices Qm. After all of these probabilities (likelihoods) are determined they are used to calculate the posterior probabilities which characterize the coding potential of F in all six reading frames. Three components m = 1, 2, 3 are determined using Bayes formula P ( CODm) represents the prior probability of the event CODm, m = 1, 2, 3, that F falls into a coding region and its first nucleotide is located in a codon position defined by the index m The same application of Bayes formula defines the probabilities of F occurring in different shadow frames. Q ( CODm) represents the prior probability of the event CODm, m = 1, 2, 3, that F falls into a coding shadow region and its first nucleotide is complementary to a codon position defined by the index m The function P ( NON) is the prior probability of the event NON that is, the probability with which F occurs in a noncoding region. The assumption

PAGE 21

21 made in GeneMark is that P ( NON) = 1/2 and that P ( CODm) = Q ( CODm) = 1/12 for m = 1, 2, 3. The last two formula above are used to calculate six coding inframe posterior probabilities for any given input fragment F The posterior probability of the event that F belongs to a noncoding region is determined as follows GeneMark employ s a sliding window approach where the values P ( CODm | F ) and Q ( CODm | F ) are calculated for every window and assigned to the fragments middle point. The values determined for each frame can be visualized graphically or the values can be compared to each other and the highest value can be used to determine which functional model best characterizes F GeneMark.hmm GeneMark.hmm (Lukashin 1998) was designed to improve the quality of gene predictions made by GeneMark. In particular the improvement centered aro und increasing GeneMarks ability to find exact gene boundaries. To effect this improvement GeneMark models of nucleotide usage are integrated into a hidden Markov model framework. A hidden Markov model is a graph of connected states, each state having the ability to emit a series of observations. The observations progress through a dimension which is often time, in the case of hidden Markov models for gene detection the dimension is the position in the sequence. The hidden Markov model is parameterized wit h probabilities governing the state at t + 1, with the previous state known. The Markov process is used to minimize the requirement to only having to know the previous state in order to make a probabilistic calculation for observing the next

PAGE 22

22 state. As the process evolves through the states, each state can potentially emit observations, these are the observations that can be seen (the DNA sequence). Considering a DNA sequence S = {b1, b2, ..., bn}, where the bi is interpreted as the nucleotide A, C, G or T and n is the sequence length, a hidden Markov model can assign a sequential functional role A = {a1, a2, ..., an} to each nucleotide. Here each ai takes an integer value 0 if nucleotide bi is non coding, 1 if bi is in a gene encoded on the direct strand and 2 if bi is located in a gene encoded by the complement strand. The GeneMark.hmm computes the posterior probability that a particular functional sequence A underlies a given sequence S P ( A |S ) = P ( a1, a2, ..., an | b1, b2, ..., bn). Given a particular set of parameterized models, two questions can be answered in the hidden Markov framework; 1) for a given observed sequence, which model is the most likely to explain the data and 2) for a given sequence and a given model, what is the most likely reconstruction of the path through the states. In order to reconstruct the most likely path from all potential paths, a forwardbackward decoding algorithm such as the Viterbi algorithm (Viterbi 1967) is often employed. Further, models can be trained from the data, in this scenario parameters are estimated by expectationmaximization techniques. Hidden Markov models and Bayesian statistical methods are used to calculate an a posteriori probability (the probability of the model, given the observ ed sequence). The GeneMark.hmm algorithm computes the P ( A |S ) value and ultimately defines the functional sequence A* having the largest value P ( A*|S ) among all possible A The output of the program A* is the most likely annotation of the DNA sequence S Ea syGene EasyGene (Larsen 2003) is a gene prediction method based on HMM. The HMM introduces states for distinctive sequence features occurring around the start and stop

PAGE 23

23 codons of a gene. Prior to the model of start codons, there is a null model followed by a ribosome binding site model. The null model describes the general composition of the genome and also the coding regions on the complementary strand, while the ribosome binding site model contains seven hidden states and a spacer with a variable number o f states. To account for distinctive patterns of codon frequencies near the gene start and end, codon models specific to these regions are incorporated in the HMM. There are three models for the sequence of internal codons, each model representing three di stinct gene classes. The model order is variable, with the highest fourth order being used for modeling the coding regions. In addition to predicting coding regions, EasyGene assigns a measure of significance to the predictions it makes, based on ORF score and the length of the ORF. The significance score for an ORF is the expected number of ORFs in one megabase of random sequence at the same significance or higher, where the random sequence has the same statistics as the genome in the context of a third or der Markov chain. Glimmer Glimmer (Salzberg et al. 1998) is a program that uses a v ersion of Markov models called interpolated Markov m odels (IMMs) to delimit coding regions in microbial sequences. In principle IMMs have more discriminative power than Markov chains by combining probabilities from oligomers of varying lengths and using only those contexts which occur frequently enough to provide reliable estimates. The predominant model for microbial sequence analysis and the model used in GeneMark is a fixedorder Markov chain. A fixed order Markov model predicts each base of a DNA sequence using a fixed number of preceding bases in the sequence. GeneMark for example, uses a 5thorder model, using the five previous bases to evaluate the probability of observing the next

PAGE 24

24 base under the alternative models. If insufficient training data is available, estimates for the probability of each base occurring after every possible combination of five preceding bases will be very inaccurate. To overcome this difficulty, Glimmer computes the probability of each base A, C, G, T, following all kk kmer it computes a weight t o use in combining the predictions of different order models. The weights are determined by counting the number of times each nucleotide in every possible context is observed. Once the weights are computed, Glimmer evaluates new sequences by computing the probability, P( S |M ), that the model M generated the sequence S Th is probability is computed as: where Sx is the oligomer ending at position x, and n is the length of the sequence, IMM8( Sx), the 8thorder interpolated Markov model score, is computed as: k( Sx 1) is the numeric weight associated with the k mer ending at position x1 in the sequence S and Pk( Sx) is the estimate obtained from the training data of the probability of the base located at x in the kthorder model. Thus, the 8thorder IMM score of an oligomer is a linear combination of the predictions made by the 8th, 7th, 6th, and lesser order models all the way down to the 0thorder model, which is the prior probabilities of A, C, G, T. In the average microbial genome some pentamers will occur too infrequently to give reliable estimates of the probability of the next base, while some octamers will occur frequently enough to give reliable estimates. In principle, using longer oligomers to make predictions is a superior strategy than using s horter oligomers, but only if there is sufficient data available in the training set of genes to

PAGE 25

25 produce good probability estimates and if the longer oligomer affects the probability of observing the next base. An IMM therefore uses a longer context to mak e a prediction whenever possible, taking advantage of the greater accuracy potentially provided by higher order Markov models. Where the statistics on longer oligomers are insufficient to produce good estimates, an IMM can rely on shorter oligomers to make its predictions. An IMM can emulate a fixed kth order chain by setting all oligomer weights to zero except for the weights associated with k. Glimmer does not use a sliding window, in contrast to GeneMark. Instead Glimmer identifies all ORFs longer than a specified threshold in all six possible reading frames and scores each one. Those ORFs that score higher than a designated threshold are selected for further processing. This includes resolving any potential overlapped predictions. The final output of the program is a list of putative gene coordinates in the genome, along with notations for each prediction that may have had a suspicious overlap with another gene candidate. Glimmer 2.0 Glimmer 2.0 extends the idea of IMM into something the authors call interpolated context models (ICM) (Delcher et al. 1999). In the standard IMM, for a given context C=b1b2...bk of length k a probability distribution is computed for bk +1 using as many of the bases immediately preceding bk +1 as the training data supports. The ICM can select any of the bases in C to calculate the probability of bk +1. This feature gives the ICM more flexibility when compared to the IMM. The criterion employed by the ICM to choose which bases of a context to use is mutual information. The mutual information between a given pair of discrete random variables X and Y is defined as:

PAGE 26

26 where xi and yj are the values taken by random variables X and Y respectively, and P( xi, yj) is the joint probability of xi and yj together. To construct an ICM with context length k from a training set T of DNA sequences, first consider all windows (oligomers) of length k+1 that occur in T Let random variable X1 be the distribution of bases in the first position of those windows; X2 be the distribution of bases in the second position; and so on through Xk+ 1. Next calculate the mutual inform ation values I ( X1; Xk +1), I ( X2; Xk +1) ,..., I ( Xk; Xk +1), and select the maximum. Suppose the maximum is I ( Xj; Xk +1). Next partition the set of windows into four subsets based on the nucleotide that occurs in position j in the window. The same procedure can now be performed again for each of the other four sets of windows. Within each set, the position that has the highest mutual information with the base at position k+1 is chosen. The four nucleotide values at that posi tion induce a further partitioning of the current set of windows into four subsets. This process can be viewed as constructing a tree of positions within context strings. The construction is terminated when the tree depth reaches a predetermined limit, or when the size of a set of windows becomes too small to be informative toward estimating the probability of the last base position. Each node in the ICM decomposition tree represents a set of windows that provide a probability distribution for the final bas e position. The root node, which includes all possible windows, represents a 0thorder Markov model. All other nodes give a probability distribution for the final base position, conditional on a specific set of bases occurring at the positions indicated on the path to the root from that node. The IMM of Glimmer 1.0 is a special case of the ICM of Glimmer 2.0, more specifically it is the case where the base chosen at each level of the

PAGE 27

27 tree is the last available base in the particular context window. The interpolation mechanism used in the ICM is identical to that used in Glimmer 1.0. Glimmer 2.0 takes a weighted sum of two probability distributions, where the weights are determined by the number of observations in the training set used to construct the distri bution and its 2 test. The only difference is that the ICM interpolation is viewed as interpolating between the distribution at a parent and child node in the same tree, while the IMM interpolation is always between distributions obtained using different numbers of bases at the end of the context window. Additionally, Glimmer 2.0 adds improved heuristics to resolve overlapped gene predictions as compared to Glimmer 1.0. Glimmer 3.0 Glimmer 3.0 (Delcher et al. 2007) introduced several major changes to the Glimmer gene detecting approach, intended to improve both coding region detection as well as start codon detection. Glimmer 3.0 introduces a new algorithm for scanning coding regions as well as a new module for start site detection and a new architecture that integrates all gene predictions across the entire genome. In addition, a new automated training program for Glimmer 3.0 produces substantially improved training sets of genes, particularly on genomes of high GC content. Similar to Glimmer 2.0, the Glimmer 3.0 algorithm is based on an IMM framework that computes the log likelihood of a given interval on a DNA sequence, based on a model of coding versus noncoding DNA. Glimmer 3.0 scores all open reading frames starting from the stop codon toward the start codon. The probability of each base is conditioned on a context window on its 3 side and the score of the ORF is the log likelihood sum of the bases contained in the ORF. The score is computed incrementally as a cumulative sum at every codon position

PAGE 28

28 in a given ORF. The advantage of scanning ORFs from the 3 end is that the context window of the IMM is always contained within the coding portion of the gene, in contrast to a context window on the opposite side, who se context would intersect a noncoding region for nucleotides very close to the start site. It is expected that the peak value of the cumulative score will coincide with the correct start site. Prodigal Prodigal (Hyatt et al. 2010) or PROkaryotic DYnamic programming Genefinding ALgorithm was designed with three main goals in mind: 1) to improve gene prediction structure, 2) to improve translation initiation site detection, 3) reduce false positive gene predictions. To achieve these aims Prodigal first needs to learn the necessary properties of the input organism, including start codon usage, ribosomal binding site motif usage, GC frame plot bias, hexamer coding statistics in order to build a complete training set of genes. Prodigal takes a unique approach when building a learning set of genes from the input organism instead of merely taking long ORFs as being putative genes in a first pass. The program begins by traversing the entire input sequence and examining the GC bias in each of the three codon posit ions in each ORF. The codon position with the highest GC is considered the winner and a running total for the codon position is incremented. After all ORFs have been processed, the sums give an approximate measure of the preference of each codon position for G and C. The values for each codon position are next normalized around 1 and divided by 1/3. With the GC bias information Prodigal constructs preliminary coding scores for each gene in the genome by multiplying the relative codon bias for each of the three positions by the number of codons in the putative gene in which that codon position is the maximal GC

PAGE 29

29 frame in a 120 bp window centered on that position. The score S for a given gene starting at location n 1 and ending at location n 2 is given by: wh ere B ( i ) is the bias score for codon position i and l ( i ) is the number of bases in the gene where the 120 bp maximal window at that position corresponds to codon position i After the preliminary coding score measure based on GC codon position statistics is obtained, Prodigal scores every start stop pair longer than 90 bp in the entire genome. Prodigal next performs dynamic programming across the whole sequence to identify a maximal tiling path of genes to train on. The design of the dynamic programming al gorithm forces the method to choose between heavily overlapping ORFs in the same genomic context. Prodigal uses the same dynamic programming algorithm for the preliminary training phase as well as for the final gene calling phase. After the initial dynamic programming algorithm has completed, Prodigal next gathers statistics from the putative genes and constructs a more rigorous coding scorer. Prodigal does this by looking at in frame hexamer coding frequencies for a gene relative to the background frequenc ies. A table of 4096 values is created, one for each hexamer, where the value of a given word w is: where C is the coding score, G is the percentage occurrence of that word within the gene training set, and B is the percentage occur rence of that word across the entire sequence (irrespective of frame). The final coding score for a gene beginning at position n 1 and ending at position n 2 is computed as:

PAGE 30

30 where S is the sum of the coding scores (C) for the inframe hexamers (the set of words w) in the gene. Prodigal modifies the coding score based on information from sequences near the gene prediction to help select genes that are more likely to be real genes when there is significant overlap or when a gene structure is more likely, Prodigal also tweaks the coding score to boost scores of longer genes (based on GC content) to be slightly positive is the preliminary score is negative. After Prodigal has calculated coding potential scores for every start stop pair the genome, it goes on to create a translation initiation site scoring system from the training set of genes. The program constructs a background of ATG, GTG, TTG frequencies of all start nodes in the genome. The program also builds a background of RBS motifs based on the ShineDa lgarno sequence. Prodigal begins with the assumption that the ShineDalgarno motif will be utilized by the organism. If this turns out to not be the case, it runs a more rigorous motif finder. For RBS motifs, Prodigal utilizes the concept of bins, each bin corresponds to a set of RBS motifs and spacer distances with different priorities assigned to each bin. In the initial background, the motif in a higher numbered bin takes priority over one in a lower numbered bin if both bins are found upstream of a star t site. Prodigal examines the initial coding peak in every ORF with a coding score higher than an arbitrarily determined coding score of 35.0. From these selected coding peaks, it builds a log likelihood model similar to the coding score model namely: where S is the score, R is the observer percentage in the training set, B is the percentage observed in the background. This method is used for both start codon usage and for the ShineDalgarno bin motif. The scores are summed and multiplied by an empirically determined constant then added to the coding score of the

PAGE 31

31 sequence under investigation. Prodigal proceeds through every start stop combination and performs that calculation while modifying the coding score by the quality of its star t codon information. This process leads to a new set of peaks for the set of training ORFs. Once the new set of peaks has been determined, Prodigal reconstructs the background for both ShineDalgarno motif and start codon usage. In this and subsequent iter ations, Prodigal no longer assumes a higher numbered bin is better for RBS motifs and it instead relies on the log likelihoods calculated in the previous iteration to find the best upstream motif for a given start site. Prodigal performs several iterations of this process, moving the peaks around based on subsequent information until they no longer move, it determines the final set of weights based on statistics gathered from this final set of putative real start codons. The end result is a set of log likelihood weights for ATG/GTG/TTG information and for each of the RBS bins. If the zero bin for RBS motifs, which corresponds to no ShineDalgarno motif, is positive, or if the zero bin is above 0.5 and the 4base motif bins are less than 1.0, then Prodigal determines the organism under investigation does not use the ShineDalgarno motif strongly, at this point Prodigal will run a more rigorous motif finder. The alternative motif finder consists of Prodigal searching exhaustively for alternative motifs by look ing at the occurrence of all 3 mer motifs in the initial set of peaks and locating all 3mers that occur in at least 20% of the high scoring gene models. From these motifs Prodigal performs an iterative algorithm similar to the one described above with the bins instead corresponding to every word of size 36 bp with many different choices for spacer length. All words 36 bp that do not occur frequently enough are combined in the no RBS motif bin. In this way Prodigal arrives at a similar set of weights for no RBS motif, as well as for each 36 bp

PAGE 32

32 motif that contains the commonly occurring 3 bp motif as a subset. Finally Prodigal adds a scoring system to capture information in the regions outside those examined by the RBS scorer (1 2 bp and 1545 bp upstrea m from the translation start site). This scoring system builds a position weight matrix on the whole region it examines. This generic upstream scoring system is not part of the iterative algorithm instead the data is gathered from the final iteration of the start site training. Once Prodigal has start score weights for both start codon type and RBS motif/spacer distance it next scores every start node in the entire sequence. The final score for a start node is given as: where S is th e final assigned score, R is the RBS motif score, T is the start type score, U is the upstream score, C is the coding score. For the RBS weight, Prodigal uses the ShineDalgarno motif score if it determines the organism uses ShineDalgarno, the secondary R BS motif score if it finds a secondary motif, and the maximum of the two systems if neither system located a strong RBS motif. The 4.25 and 0.4 constants were arrived at by experimenting with different values and observing the change in results across test sets of genomes. These values were chosen to maximize performance without introducing a lot of bias and with an effort to keep false positive gene predictions to a minimum. Once the scores have been calculated, the dynamic programming is performed a secon d time using more detailed node scores for gene connections and introducing procedures to handle overlapping gene predictions in a way that makes biological sense. Prodigal then makes one final pass through the models and removes genes with negative scores and also makes one final improvement to genestart calls. The final output of Prodigal consists of a complete list of gene coordinates and optionally provides protein translations and/or

PAGE 33

33 detailed information about each potential start in the genome. Prodi gal can be run either in two steps with a training phase and a gene prediction phase, or in a single step where the training step is hidden from the user and only the final gene predictions are presented. Limitations of Probabilistic Gene Prediction Methods While probabilistic approaches to prokaryotic gene prediction are very successful, they also have known limitations (Skovgaard et al. 2001). Many ab initio gene prediction methods that work by learning compositional rules from a selected set of ge nes use a single model for protein coding regions. This makes the assumption that all protein coding regions wi thi n a particular genome follow the same model, which is generally not the case. Local compositional fluctuations in the genome, or compositional differences related to gene functional class, gene size and level of expression are some of the factors that influence the compositional properties of genes and proteins (Elton 1974, Sueoka 1962, Fi ckett et al. 1992a, Konu 2002). The problem of a single g ene model is emphasized within genomes that show a greater heterogeneity in compositional properties and mutational bias. Although the desirability of multiple models has been recognized (Borodovsky et al. 1995), in practice the limited size of training sets makes it difficult to develop multiple gene models within a single genome. In general, gene finding algorithms based on complex gene models may suffer from some or all of the following limitations: (i) parameter estimation is subject to sampling error; (ii) the limited size of the learning set allows development of only one or few gene models; (iii) the complexity of the models is not transparent to biological interpretation; (iv) the reliability of individual predictions is difficult to evaluate; (v) posterior probability

PAGE 34

34 evaluations are affected by decisions on final probabilities and are conditional on a set of predefined alternative hypotheses. It is thus desirable to compare the results from different prediction methods to evaluate the reliability of a hypothetical predicted gene. Gene predictions are most often characterized by posterior probability, rather than by measures of statistical significance. Gene Prediction Methods Based on Global Compositional Properties Protein coding regions of prokary otic genomes possess different compositional properties as compared to non coding regions. These differences between protein coding and non coding regions of DNA can be exploited by gene detecting algorithms tuned to look for the underlying differences. Prokaryotic genomes are gene dense with an estimated >90% of the genome being comprised of coding regions, leaving ver y little sequence space for no n coding regions. Furthermore, coding regions of DNA tend to be GC rich while noncoding regions are rich in AT. It has been speculated that this compositional difference between coding and non coding regions arises due to the requirement that non coding r egions must be more readily separated to allow cellular machinery access to the DNA in order to effect processes such as transcription and replication and related functions. This increased access is achieved by the decreased numbers of hydrogen bonding occ urring between AT base pairs allowing the two DNA strands to denature and anneal with less energy being expended by the cell. Protein coding stretches of DNA are further characterized by signature usages of DNA; as a byproduct of being protein coding the c odon structure is introduced. The codon consists of three consecutive nucleotides each with a different capacity to carry information. This differential information storage aspect leads to dissimilar forces acting upon each codon

PAGE 35

35 position. Mutations in DNA arise with a certain frequency and are retained or removed depending on selective forces acting on the change. Selection applies differential pressure against mutation as a function of codon position. As a consequence, in coding sequences, base usage among codon positions is significantly different. Figure 11 represents the GC (=S) content in codon positions 1, 2 and 3 among 1087 genomes of bacteria, where each point represents S usage in one of the three codon positions, averaged over all genes annotated in a single genome. Codonposition specificity in base usage generates a periodthree non randomness in usage of strong and weak bases that depends on the total GC of the gene. In genomes with a high GC content the contrast in usage of GC between codon positions is maximized and generates a strong signal that has been used for the identification of coding regions (Bibb et al. 1984). In addition to contrasts in GC content, other types of asymmetric base usage can be identified that can help provide informat ion useful for identifying coding regions (Shepherd 1981). For example, genes of any GC composition typically contain an excess of purines in the first codon position compared to the second and third codon positions (Figure 12). Frame Analysis The compo sitional contrasts between codon base positions have been used for visual detection of protein coding regions in an approach called frameanalysis (Bib b et al. 1984, Ishikawa 1999). In frame analysis, compositional curves called S profiles (Brocchieri et al. 2005) that characterize periodic GC content distribution are built along a genome sequence by calculating three measures of frame specific GC usage. By this method, the genome sequence is first partitioned into three subsequences, each containing every third nucleotide in three different phases. The first subsequence

PAGE 36

36 contains genome positions 1, 4, 7, ... (phase 1). The second subsequence contains genome positions 2, 5, 8, ... (phase 2). The third subsequence contains genome positions 3, 6, 9, ... (phase 3). Along each sub sequence the GC content is independently evaluated within a moving window of fixed length W (Figure 13). The size of W is a user defined parameter chosen to be small enough to represent meaningful local variation in GC content whil e at the same time being large enough to prevent excessive random fluctuations in the frequencies of the sampl ed nucleotides. A window length of around 200 nucleotides is a reasonable compromise. In any case, W is chosen to be a multiple of three to includ e an equal number of positions in the three sub sequences. Within the window, three separate measures of GC content, corresponding to each subsequence, is recorded and assigned to the position central to the window. By moving the window along the genome, three independent curves of GC content, referred to as S profiles, are obtained. A qualitative assessment of the presence of proteincoding genes, and of their coding strand and frame, is obtained in frame analysis by visual examination of the S profile. A procedure to obtain quantitative assessments of the same contrasts based on the bias in framespecific GC distribution and a related measure of coding potentials, calculated based on expected global compositional biases in relation to GC content, has been proposed (Brocchieri et al. 2005). In contrast to other methods for gene detection, frame analysis and its quantitative development does not assume a specific codon composition of the gene but rather relies on expectations of how GC bases are differenti ally distributed among the three codon positions given the overall GC content of the sequence.

PAGE 37

37 An example of S profile plots obtained for coding sequences in genomes of high GC content is shown in Figure 14, which represents S profiles obtained for the f irst 10,000 base pairs of the P. aeruginosa PAO1 genome, computed using a moving window of length W = 201 nt. The colored curves indicate the GC content inside the window measured every third nucleotide in phase with genome position 1 (red), 2 (green), 3 ( blue). Figure 14 panel A represents the S profile curve obtained for phase 1 nucleotides ( i % 3 = 1), where i is sequence position. Panels B and C show analogous curves obtained for nucleotides in phase 2 ( i % 3 = 2) and in phase 3 ( i % 3 = 0), respectively. These curves show within each phase the GC content dramatically fluctuates from region to region as the window transverses different codon positions of coding regions in different phases and strands. When the three S profile curves are superimposed as in Figure 14 panel D it can be seen that while the total GC content of this region remains approximately constant (Figure 15), the large GC nucleotide fluctuations in the different phases generate characteristic contrasting patterns among them. In high G C genes the highest GC frequency occurs in third codon positions and the lowest GC frequency occurs in second codon positions (Figure 1 1) When annotated genes are placed above the S profile curves and are depicted as colored arrows (Figure 1 4 panel D ) t he pattern of GC usage is consistent with the GC distributions observed among codon positions shown in Figure 11 and the presence and coding frame of the gene can be discerned based on the relative position of the three S profile curves as seen in Figure 1 4 panel D For example, when considering specifically the gene named dnaA in Figure 1 4 panel D we can see that the green curve is in phase with the frame containing the highest GC and that the red curve is in

PAGE 38

38 phase with the frame with the lowest GC whil e the frame depicted by the blue curve is intermediate between the other two frames. From the pattern of red, green, blue shown in Figure 16 and knowing that nucleotides in the codon structure must be sequential we can know that the relation of these three particular curves in this conformation is consistent with a gene being encoded on the direct strand with the third codon position in the second register. Using this approach, any DNA sequence showing consistent and differential GC usage as a function of codon position can be visualized and the frame and the encoding strand can be discerned as is shown in Figure 14 panel D for the remainder of the annotated genes displayed along the top of the S profile curve. Determining Potential Coding Regions Genomes of high GC content are often described as difficult to annotate due to the relatively rare occurrence of randomly generated stop codons, which results in a large number of long ORFs that ar e not protein coding (Nielsen 2005). These false positive ORFs have the potential to confuse gene prediction methods that build training sets composed of long ORFs encountered in the genome. In contrast to methods that learn gene nucleotide statistics from a set of genome long ORFs, in f rame analysis, a genome with high G C content naturally generates the strongest compositional contrasts among codon positions resulting in a corresponding increased predictive power in the identification of coding regions independent of any learning set of genes. The novel approach to gene f inding that is described in this dissertation is well suited to the discovery of genes in high GC genomes similar to f rame analysis. However in its application the method will be enhanced to detect contrasts that arise amongst nucleotides and codon positio ns above and beyond those that simply occur between

PAGE 39

39 GC and nonGC, as the previously mentioned abundance of purines in first codon position irrespective of overall GC content (Figure 12). Another difficulty faced by gene detection algorithms when determin ing coding potential is that of detecting very short genes. Recently mini genes (< 100 amino acids in length) have gained prominence as more is understood about the role they play as regulators in coordinating gene expression in cells (Kastenmayer et al. 2006). Mini proteins may play significant roles in various biological processes. Mini proteins usually contain a single domain. In prokaryotes, well known mini proteins include chaperonin Hsp10, translation initiation factor IF 1, ribosomal proteins and oth ers. In eukaryotes, miniproteins include signaling molecules such as peptide hormones, cytokines and corepressors and coactivators. The smallest known functional geneproduct, in E. coli, is a 29 amino acid peptide involved in K+ transport (KdpF) (Gasse l et al. 1999). Small ORFs with as few as 14 amino acids have been predicted in E. coli and ORFs with as few as 28 amino acids have been predicted in Saccharomyces cerevisiae and many short peptides have been isolated from the cellular milieu from other bacterial species. Artificial gene constructs encoding just five amino acids have even been created and were able to transcribe and result in functional geneproducts affecting intracellular signaling in Bacillus subtilis (Lazazzera et al. 1999). Short codi ng regions are typically difficult to predict computationally. Due to their short sequence, the coding signal corresponding to these short genes is less likely to reach the levels of significance necessitated by common genefinding algorithms. Also, small genes may not follow the same codon composition properties of longer genes, further more the corresponding short peptides may not have the amino acid composition typical of average length

PAGE 40

40 globular proteins. Extremely short proteins are unlikely to be predicted by computational methods unless they conserve a very well characterized motif. Current prediction methods routinely predict coding sequences of < 100 codons base d on compositional properties. The approach described here is based on simple global features, therefore the statistical problems stemming from the size of small genes will be minimized. Small genes are likely, because of their size, to have different compositional properties than the longer genes from which the gene compositional rules ar e learned in conventional gene finding methods. These features may confuse predictions based on detailed compositional models but the approach described in this manuscript does not appear to be significantly affected by the unusual composition of small genes. In fact, since small genes as well as longer genes are expected to be subject to some kind of constraining selection at nonsynonymous sites, they are still expected to show the compositional contrasts that distinguish synonymous from nonsynonymous codon positions in compositionally biased genomes (Bibb et al. 1984).

PAGE 41

41 Figure 11. Distribution of GC bases in first, second and third codon positions as a function of the overall GC content of the coding regi on. Individual points represent the average base content of all coding sequences listed in 1087 prokaryote genomes. Figure 12. Distribution of A+G bases in first, second and third codon positions as a function of the overall A+G content of the coding region. Individual points are the average base content of all coding sequences listed in 1087 prokaryote genomes %GC in coding regions %GC in codon position %A+G in codon position %A+G in coding regions

PAGE 42

42 Figur e 13. A sequence partitioned into three frames with every third nucleotide included in the frame. The GC is calculated for each frame in a window of size = W. Figure 14. S profiles for positions 110K of the genome sequence of Pseudomonas aeruginosa P AO1 with a GC = 66.6%. A to C : S profiles indicate the GC content within windows spanning 201 nt measured every third nucleotide in phase with genome position 1 (red=R), 2 (green=G) or 3 (blue=B). D : The three curves are superimposed, genes from the publis hed annotation are represented as arrows pointing in the 5 to 3 direction, colored according to the phase of their third codon positions. As shown in Figure 11 the highest GC is in the third codon positions and the lowest is in the second codon positions.

PAGE 43

43 Figure 15. Total GC plotted for a given window size remains largely constant and hovers around the mean value for the given genome. Figure 16. Graphical depiction of the approach used to determine coding strand and frame from resulting S profile gr aphs. The colored circles repres ent nucleotides and the same color is repeated every third nucleotide thereby maintaining frame. The strand and frame is determined from the relation of GC concentration amongst the three frames in an S profile graph and knowing that nucleotides participating in a codon must be sequential. Tota direct strand complementary strand

PAGE 44

44 CHAPTER 2 FRAME ANALYSIS OF SPROFILES AND APPLICA TION TO ANNOTATION IMPROVEMENT Frame analysis (Bibb et al. 1984) provides a powerful method to visually identify proteincoding genes in sequences of high GC content exploiting the sharp global compositional contrasts between codon positions observed in genes of high GC content (Figure 1 1). In frame analysis, the frame specific contrast in GC content is utilized to infer the presence of a coding region, the three S profiles assume characteristic contrasts that can be used to visually identify presence and coding frame of a gene. When frame analysis is contrasts among S profiles are an obvious indication of the presence of coding regions (Figure 1 6) Comparisons of S profiles with published annotations of coding regions provides in a preponderanc e of cases strong support for the genes reported in published annotations (Figure 14 panel D ). Such visual confirmation is particularly useful for example in the case of predictions of hypothetical non conserved genes (ORFans) of short length, whi ch a re most prone to mis annotation. Furthermore, frame analysis provides evidence of the existence of coding regions that are not included in the published annotations of high GC genomes, or suggests modifications in annotated start of translation of certain coding regions. Some specific examples are provided in the sections that follow. T hese cases are evidence that, for sequences of high GC content, frame analysis is a useful genefinding technique c apable of c omplementing sophisticated probabilistic methods Frame analysis has been implemented and made available to the general user in the FramePlot software, a webbased tool allowing visualization of S profiles, positions of stop codons and potential start of translation codons of a sequence (Ishikawa 1999). Visualization of frame specific S profiles is

PAGE 45

45 implemented in the Artemis genome browser and annotation tool (Rutherford et al. 2000), which allows graphical representation of framespecific GC content within windows of different size, together with ORF st ructure of the sequence and annotation features. S profile Asymmetries Identify New Potential Gene Coding Regions and Errors in Annotation Visual frame analysis of S profile plots of annotated prokaryote genomes of high GC content generally show the correspondence of annotated coding regions with strong compositional contrasts. However, it also reveals the presence of occasional discrepancies between annotation (or lack of) and S profile contrasts. These discrepancies strongly suggest that current annotations of prokaryotic genomes can be improved and that quantitative frame analysis is a useful tool to achieve this improvement. The following section provides examples showing instances of disagreement between visual S profile signals, and publishe d annotations. There are four classes of disagreements (i) published predictions modified by S profile analysis; (ii) published predictions not supported by S profile analysis; (iii) published predictions contradicted by S profile analysis; (iv) newly predicted genes not found in the published annotation. 1) Modified Annotated Coding Sequence Predicting the translation start is possibly the most difficult problem in prokaryotic gene annotation (Mathe et al. 2002). The start of translation is typically signaled by the presence of an ATG codon, in which case the problem consists of identifying the correct ATG among multiple possible choices. However the problem is complicated in prokaryotes by the known complexity that translation is initiated by alternative codons,

PAGE 46

46 namely GTG, but also occasionally, TTG, CTG or ATT. An example of how quantitative frame analysis can help in deciding the correct translation start is illustrated in Figure 21, where the published annotation of the predicted coding region Adeh_05901 of the high GC organism, Anaeromyxobacter dehalogenans 2CP C is matched to its corresponding S profile. This is a specific example of a whole class of modified annotations that can be identified. In this example a strong S profile signal extends 5 of the predicted start of translation, through a region devoi d of stop codons. 2) Predictions Not Supported By S profile Analysis Some annotated genes in the collection of prokaryotic genes lack the signature feature of containing compositional contrasts of GC amongst codon positions. These annotated genes may simply be incorrectly predicted as being protein coding. Other explanations are the contrasts may exist among other nucleotide bases and not necessarily in GC or the contrasts may be present but may not extend over a long enough sequence space to rise to be di scernible (or statistically significant) when visually examining S profiles. Some genes with disordered ends may also experience this effect where the S profile signal is corrupted by relaxed constraints on the sequence. These predictions may ultimately pr ove to be correct despite their lack of S profile signal and it will be useful to record their unexpected compositional properties in terms of GC contrast and to further characterize sequence properties facilitating their coding potential. These gene predi ctions are classified as not supported and examples are given in Figure 22 and Figure 23. 3) Predictions Contradicted By S profile Analysis Another class of predictions in conflict with S profile results are termed contradictions. These correspond to a significant S profile signal suggesting an

PAGE 47

47 alternative coding sequence located in a different frame or strand then the published annotation. Specific examples are shown in Figure 24 Figure25 and Figure 26 4) Newly Predicted Genes The final class of c onflicts that can occur between annotated genes and S profile signal are newly predicted genes. These are regions of the genome that contain clear compositional contrasts amongst codon positions yet the region is not included in the published annotation. A n example of a newly predicted gene is presented in Figure 27. Assessing Contrasts Between Codon Positions As shown in Figure 11 and Figure 14, in coding regions of high GC content the strongest and most robust compositional contrast involve the second codon position, which is AT rich, and the third codon position, which has the highest GC content. The difference in GC content between second and third codon position (S3 S2) were used to evaluate all annotated genes in three strains of P. aeruginosa P AO1, PA7 and PA14. Individual genes are represented in Figure 28 by their total GC and by their difference in GC between the third and second position normalized by expectations : so that for all GC levels the value (S3 S2)* = 0 corresponds to the expected difference. With this metric, genes in which S3 = S2 lay on the sloping line represented in Figure 28. Points above this line indicate genes with S3 > S2 and points below it represent genes where S3 < S2. In this analysis functionally characterized genes (those identified by a descriptive name) were distinguished from putative uncharacterized genes. It can be seen from Figure 28 that characterized genes (red dots) cluster around the expected S3 S2 difference (S3 S2)* = 0. The majority of putativ e genes (blue dots)

PAGE 48

48 are also distributed around the expected compositional contrasts. In P. aeruginosa strain PA7 however, many putative high GC genes do not show the expected difference between S2 and S3, and in some cases even S2 > S3. The nonrandom, period three distribution of GC bases in the majority of predicted genes of P. aeruginosa strains identifies a compositional character of these sequences that strongly supports their classification as coding. For sequences included in the published annotation that did not show these contrasts, frame analysis does not corroborate their classification as coding sequences. Application and Extension of Frame Analysis to Annotation Improvement Use of frame analysis is in practice limited to sequences of high GC content. Furthermore, frame analysis is limited by its qualitative approach to gene finding. While in many cases S profile contrasts unequivocally signal coding regions, in other cases it is difficult to assess with confidence whether observed contrasts are significantly different from random fluctuations. The window based approach of frame analysis does not allow for precise identification of the boundaries of regions of high compositional contrast. Finally, frame analysis does not use nor provide information on the ORF structure (position of start and stop codons) of the underlying sequence and must be supported by other tools to verify correspondence of the S profile signal with an underlying ORF structure. The procedures and computational tools developed in this manuscript address these limitations, and implement and extend the principle of frame analysis, providing quantitative measures of 3periodic compositional contrasts applicable to sequence segments. The quantitative approach to frame analysis allow s: (i) analysis of sequences of any length, without the need of a sufficiently large dataset of coding sequence to

PAGE 49

49 construct a (known or inferred) reliable learning set; (ii) identifying sequence segments of any length characterized by compositional contrasts, replacing the predefinedlength window approach used to generate compositional plots in frame analysis. These sequence segments will correspond to continuous sequence regions with the expected frame specific compositional contrasts visually represented in frame analysis; (iii) assigning a measure of the statistical significance of the identified contrasts. Short coding sequence predictions, which are typically enriched in false positives, will be characterized for their exhibiting statistically signifi cant compositional contrasts in comparison to a random model; (iv) to precisely delineate the boundaries of the predicted coding region, in contrast to visual frame analysis, where the boundaries of the regions of compositional contrast are, by method, not clearly resolved; and (v) extension of frame analysis to sequences of any composition, considering compositional contrasts in any of the four nucleotides (Bese mer 1999). The quantitative methods developed here (quantitative frame analysis) capture the qualitative visual information provided by frame analysis, i.e., compositional contrasts between codon base positions. As in visual frame analysis, in quantitative frame analysis coding regions are identified by means of the periodic compositional contrast s observed in the three codon positions over sequence segments of any length. The method is based on the definition of a scoring scheme and associated cumulative scores of known probability distribution, which, in contrast to visual frame analysis, allow precise determination of the boundaries of segments of compositional contrast and can be associated with measures of statistical significance. Furthermore, with quantitative frame analysis the principle of visual frame analysis is extended from measuring frame

PAGE 50

50 contrasts in GC content to measuring compositional contrasts of individual nucleotide types. With this generalization, frame analysis can be extended to sequences of any composition, in each case utilizing the most informative contrasts. For quantitative frame analysis two different scoring schemes were developed. The first scoring system is based on the expected composition of each codon position given the overall local composition of the seq uence (Besemer 1999). The second scoring system is based on a procedure to detect any significant compositional contrast of period three w ithin a sequence segment, regardless of the length of the segment and of the expectations on the nature of those contrasts.

PAGE 51

51 Figure 21. An example where frame analysis suggests a published annotation should be modified, extending it to match the strong c ompositional contrasts 5 of the annotated gene Adeh_0501. Figure 22. An example of a published annotated gene (0322) that is not supported as evidenced by the lack of any compositional contrast amongst codon positions in the S profile.

PAGE 52

52 Figure 23. A n example of an annotated gene (Adeh_3257) that is not supported by the expected GC usage as visualized by the S profile. Figure 24. An example of published annotations (1426, 1427) that are contradicted by the S profile signature. A pair of hypothetical genes annotated on the direct strand in the genomes of P. aeruginosa PAO1 and PA14 correspond to a strong S profile signal consistent with the presence of a coding region on the complementary strand. The newly predicted gene is found conserved among the genes published in the annotation of two other P. aeruginosa strains namely 2192 and C3719. In the case of strain PAO1 and PA14 the published annotated sequences may have been preferred because they maintain the same orientation of the flanking genes.

PAGE 53

53 Figure 25. Another example of annotated gene predictions that are contradicted by the Sprofile signal. In this example an annotated gene (Adeh_0560) from the high GC organism Anaeromyxobacter dehalogenans i s in conflict with the underlying S profile signal suggesting there are two genes encoded in this area, one gene is encoded on the direct strand and the other is encoded on the complementary strand. The alternative gene predictions are supported by the pr esence of inframe translation initiation codons in close proximity to the high scoring regions. Figure 26. An example from Anaeromyxobacter dehalogenans showing a case where the annotated gene prediction is contradicted by the S profile signal. The gen e Adeh_0513 is not consistent with GC usage for a gene in that frame and direction, the S profile supports a gene predicted on the complementary DNA strand that contains a start of translation initiator codon distally located to it but in the correct fram e.

PAGE 54

54 Figure 27. A region on the chromosome of the high GC organism Anaeromyxobacter dehalogenans that shows GC usage that is consistent with coding for a protein however there is no published annotated gene occupying this region. The new gene prediction (H 1529*G) is further supported by the presence of a translation initiating codon (GTG) located in close proximity and in frame to the highscoring segment. Figure 28. Difference in GC content between codon positions three and two in relation to overall GC content, normalized to expectations calculated from several bacterial genomes. The distribution in the two classes of predictions in P. aeruginosa are represented by the difference in GC content between their third and second codon positions (S3S2), which is expected to be positive in high GC genomes. In PAO1and PA14 the distributions of characterized and putative genes largely overlap. In PA7 we see many putative genes in which the difference S3 S2 is much smaller than expected. PAO1 PA7 PA14

PAGE 55

55 CHAPTER 3 RESEARCH DESIGN AND METHODS Our procedures are designed to identify sequence regions of any length characterized by periodthree compositional contrasts that signal an underlying codon structure. In frame analysis, these compositional contrasts are visually identified by graphing three framespecific S profiles of GC content, computed within running windows of predefined width. The predicted underlying frame and strand of coding can be determined based on the relation between the three S profiles. Although visual frame analysis is in practice applied only to GC rich sequences, its principle is applicable to sequences of any composition, taking into consideration that sequence composition affects the type of contrasts expected in coding regions. Plots of the frequency of strong bases (S = G or C) in the three codon positions as a function of the overall GC content of the coding sequence, show the dependencies of GC content in codonpositions with the overall GC content (Figure 11). The relations between the three distributions explain why frame analysis is mostly effective in GC rich sequences, when the distribution of GC content in the three codon positions is most diver gent, whereas in genes of intermediate GC content their distributions are overlapping. The frequencies of strong bases at the three codon positions ( s1, s2, s3) are almost linearly correlated with the overall S content ( s ) of the sequence, and their expect ations can be estimated by linear r egression, as shown in Figure 11 or by non linear functions such as, for example, the logistic, or a thirdorder polynomial. The relative contribution of nucleotide frequencies in generating usage contrasts can be quant ified comparing expected nucleotide frequency in a codon position to nucleotide overall frequency with log odds ratio scores:

PAGE 56

56 where si and wi are the expected frequencies of, respectively, GC (S) and AT (W) in codon position i and s and w are corresponding overall frequencies. Plots of the scores as a function of sequence GC content illustrate the main determinants of the expected contrast in sequences of different GC content. For example, in GC rich sequences avoidance of AT bases in thir d codon position and their over representation in second codon position make up for the strongest contrasts, in AT rich sequences GC is strongly avoided in third codon position and over represented in first codon position, and in sequences of intermediate GC content (~45%) weak contrasts are entirely generated by over representation of GC in first codon position and of AT in third codon position (Figure 3 1 ). Contrasts in strong vs. weak nucleotide usage are well suited for visual frame analysis of GC rich sequences but are less useful for sequences of GC content lower than ~55%. Other types of compositional contrasts, however, could in principle be used for frame analysis of sequences of lower GC content. For example, it has been known for a long time that in coding sequences, first codon positions tend to be particularly rich in purines (R) compared to second and third positions, irrespective of overall GC content (Figure 12 ) (Shepherd 1981). Information on presence of coding regions in theory could be obtained from visual analysis of R profiles. It has been previousl y noted (Besemer 1999) that in coding regions, the frequencies of all four DNA bases can be described by a linear function of the overall base composition and position in the codon (Figure 32 ). These dependencies have been used together with expected amino

PAGE 57

57 acid frequencies to infer heterogeneous Markov models that can be used for gene prediction (Besemer 1999). To extend the principle of frame analysis to sequences of any composition, it is convenient to utilize information on expected compositional contrasts of individual nucleotide types. In principle, profiles of codonpositionspecific frequencies can be plotted for individual nucleotide types. A straightforward generalization of the scor ing scheme in the W/S alphabet to any nucleotide X in codon position i is given by the scores: where is the expected frequency of nucleotide type x in codon position i as a function of the overall nucleotide type frequency from linear regression. In GC rich sequences this scoring scheme is expected to capture the dominance of contrasts in GC content, whereas in other sequences other types of contrasts will predominate (e.g. purine/pyrimidine usages). Scores based on the four let ter nucleotide bases (Figure 3 3 ) in first codon position reflect over representation of G and under representation of T in sequences of low and intermediate GC content. In second codon position emphasize overall under representation o f G, over representation of AT in highGC sequences, and over representation of C in low GC sequences. In third codon position scores are dominated by under representation of GC in low GC sequences, and by under representation of AT in highGC sequences. I n general, in sequences with intermediate GC content scores have the smallest absolute values, reflecting the weakest compositional contrasts observed in these sequences.

PAGE 58

58 ORF Structure and Local C ompositional P roperties In scoring a long nucleotide sequence (e.g., a genome sequence) for its codonpositionspecific compositional contrasts, it is desirable to take into consideration local compositional fluctuations of the sequence, so that appropriate scores can be applied to different sequence regions as its local composition changes. It is also desirable to consider the reading frame structure of the sequence, so that only regions that do not include stop codons can be evaluated for their coding potential. To accomplish this we independently consider, for each reading frame on the direct and complementary strand, sequence segments included between consecutive stop codons. The analysis is then performed independently for each segment, and scores are constructed based on the nucleotide composition of each segment. In the case of prokaryote nucleotide sequences and with the only possible exception of terminal fragments, if a coding region is present in a sequence segment, it must be completely represented within the segment, and it must terminate at the 3 end of the segment. At this stage of the procedure we evaluate coding potential within a sequence segment irrespective of the presence of potential start codons. Codon S cores By construction, the length 3L of each sequence segment subject to scoring is a multiple of three, and can be represented as a sequence of L trinucleotides that are potential codons of a coding sequence. In the context of frame analysis, we score these in frame trinucleotides for their pot ential to generate the compositional contrasts expected in coding regions, given its overall composition Thus, the scores will not reflect expectations on the frequency with which the amino acid types that they encode are represented in the corresponding protein. Log odds ratio scores for codons should

PAGE 59

59 reflect the probability of observing a particular trinucleotide given its codonposition specific expected frequencies, versus its probability in a randomly generated sequence. Codon scores equal to the sum of the scores of the corresponding nucleotides, i.e. : do not take into consideration the exclusion of stop codons in evaluating expected codon frequencies generated under both null and alternative models. We accomplish this adopting a model by which a r andom sequence is generated drawing nucleotides independently with some probabilities ( x = A, C, G, T) such that the expected nucleotide frequencies in sequence segments between consecutive stop codons equal the frequencies observed in the scored sequence. Similarly, codonposition specific nucleotide probabilities with i = 1, 2, 3, can be calculated for generating a pseudocoding random sequence such that expected codonpositionspecific nucleotide frequencies in sequence segments between stop codons equal the frequencies expected given the original sequence composition. The corresponding score for codon XYZ is: ( 3 1) where and ( c = 0 in the case of Mycoplasmas, c = 1 otherwise) are, respectively, the probabilities of generating stop codons in the ps eudocoding and null models. These codon scores account for nucleotide type associations generated by the process of selecting

PAGE 60

60 sequence segments. Although numerical evaluation of the probabilities involved in expression ( 3 1) is straightforward, their calculation at run time over a large number of sequence segments of different composition is prohibitively time consuming. A dramatic improvement in performance can be achieved either by pretabulating probabilities for all pos sible nucleotide compositions (in practice, for a large number of compositional states on a threedimensional grid), or by approximating scores based on the observed nucleotide frequencies and on expected codonposition specific freq uencies : ( 3 2) with corresponding and probabilities of stop codons. We found that scores obtained based on observed and expected frequencies as in ( 3 2) closely approximate those obtained calculating probabilities as in ( 3 1) Thus, in our implementation of the procedure we used expression ( 3 2) to score trinucleotides, substantially improving run time performance. Probabilities were instead calculated to generate random sequences used to sample cumulative score distributions, as described below. High Scoring Se gments To identify from a potential reading frame significant compositional contrasts within sequence segments of any length, as seen in visual frame analysis, we utilized a highscoring segment ap proach (Karlin 1990), based on cumulating along the sequence the log odds ratio scores defined in the previous sections for codons. Utilizing t he highscoring segment theory overcomes problems associated with the moving window approach, i.e., choosing and justifying an optimal window size, fuzzy boundaries

PAGE 61

61 resulting from hybrid windows at the interface between coding and noncoding regions, and identifying, from overlapping windows of fixed size, continuous regions of variable size. Cumulative scores have been used in a genefinding context also by the procedure implemented in Glimmer3.0 (Delcher et al. 2007), where scores, derived as log odds ratios from interpolated Markov models for coding and noncoding s equences, are cumulated from the 3 end of an ORF, and are used to predict the start of translation of the coding region (Delcher et al. 2007). Our cumulative scores will reflect how well the sequence conforms to the overall compositional contrasts expected in a coding region, rather than to its expected oligonucleotide usage. Furthermore, in our implementation of cumulative scores we will utilize the approach of the highscoring segment theory to evaluate the statistical significance of the observed cumulative score. Finally, we will use cumulative scores to identify highscoring intervals within the search sequence (highscoring segments) rather than just its start of translation signal. With our method, scores of trinucleotides (codons) can be equivalently cumulated starting from the 5 or 3 end of the sequence, with results independent from the orientation of the cumulation process. The proc edure is illustrated in Figure 34 which illustrates an example of scores cumulated in the orientation. In traversing coding regions in the correct frame, codonscores have positive expectations and the cumulative score tends to grow positive, whereas it is reset to 0.0 whenever it becomes negative. After scoring the sequence endto end, the sequence position xM a ttaining the maximum score is identified, and the statistical significance of the maximum score is evaluated. From the position of maximum score, the sequence of codons is retraced in the orientation until a position such that or the

PAGE 62

62 end of the sequence is encountered. The nucleotide sequence segment from the first codon position of the codon corresponding to maximum score to the third codon position of the last codon with positive cumulative score, identifies a highscoring segment with associated score M A high scoring segment induces two flanking subsequences, which are recursively analyzed until their length reaches a minimum threshold, arbitrarily set to 36 nucleotides Significance of Scores Our scoring scheme enjoys the properties required for applying the highscoring segment theory (Karlin 1990), i.e., at least one score is positive and the expected score in a random sequence of the given nucleotide composition is negative. Thus, when scores are cumulated and reset to zero as described along a random sequence of trinucleotides (codons), the cumulative score is expected to gravitate towards zero, with random positive fluctuations. An expression for an upper bound estimate of the probability of the maximum among such positive fluctuations, corresponding to a HighScoring Segment (HSS), is av a ilable (Karlin 1994): ( 3 3) where L is the length of the sequence (in our application, the number of codons), S is the highest observed score, K and are parameters that can be evaluated numerically, and x is the threshold score c orresponding to a desired level of significance Numerical algorithms are available for estimating from equation ( 3 3) parameter values and threshold scores at any level of significance given sequence length and probability of elementary scores. However, although estimating threshold scores for one sequence requires only a few seconds of computer time, finding

PAGE 63

63 numerical solutions of expression ( 3 3) at run time for a large number of sequence segments of variable composition becomes impractical. Furthermore, formula ( 3 3) only estimates an upper bound for thresholds of significance, providing conservative estimates of sig nificance values. Thus, we followed the alternative strategy of pretabulating estimates of threshold score values at various significance levels from computer simulations. Threshold scores corresponding to significance levels and were estimated from the distribution of maximum scores obtained from sets of 10 0,000 sequences randomly generated for different compositional states and sequence lengths (Figure 3 5) In all, distributions for 2 3,426 different compositional states were generated obtained by scanning the [0.0, 1.0]4 compositional space of nucleotide frequencies ( i = 1, 2, 3, 4) with resolution (subject to the obvious constraints and ), and for seven sequence lengths, doubled from 150 n ucleotides to 9,600 n ucleotides Threshold scores obtained from our simulations at each significance level were as expected lower than the upper bounds calc ulated with formula ( 3 3), as exemplified in Figure 36 and Figure 37 We found that the relation between estimated threshold scores and sequence length is closely described by a linear fit to the log log le ngth of the sequence ( Figure 36 and Figure 37 ) in contrast to the log length linear relation predicted by relation (3 3 ) for the upper bound. This relation allowed us to tabulate threshold scores corresponding to the three significance levels for each compositional state k, by the two parameters of the linear regression describing the relation of estimated scores with the log log length of the sequence. Thus, given a sequence segment of compositional state k and length L

PAGE 64

64 the corresponding threshold score values S* ( L k) for each sequence segment can by calculated at run time with the simple transformation: Simulations and Generation of Stop Less Random Sequences With the described procedure sequence segments included between consecutive stop codons are extracted from the input sequence in all six frames and each segment is independently scored. To evaluate its significance, we compare the original score with the score distribution obtained from randomized sequences of the same length and composition. To generate random sequences of the same length of the original segment and devoid of stop codons (TAA, TAG, and T GA in nonMycoplasma) a sequence of nucleotide bases are generated with given probabilities and stop codons are removed as they form, until the desired length is reached. A score is calculated and the procedure is repeated to generate a large sample of the distribution of scores, to which compare the original score for significant assessment. In this procedure, selection against stop codons has the effect of generating ex pected nucleotide frequencies for the resulting sequence that are generally different from the probabilities with which nucleotides are generated. Precisely, if ( X = A, C, G, T) are the probabilities with which nucleotide types X are sampled, after selection against stop codons the sequence will have expected nucleotide frequencies :

PAGE 65

65 Sequences with expected nucleotide frequencies equal to the original sequence are generated sampl ing nucleotides with probabilities equal to the solutions of the system of inverted equations Numerical solutions are obtained using a steepest ascent hill climbing procedure with starting conditions For example, to generate a random sequence of any length with expected equal frequencies of nucleotide types solving the relations (1) we obtain nucleotide probabilities: pA = 0.2599, pC = 0.2375, pG = 0.2424, and pT = 0.2542. Expected frequencies and derive from probabilities: pA = 0.3640, pC = 0.1374, pG = 0.1501, and pT = 0.3486. Finally, expected frequencies and result from probabilities: pA = 0.1549, pC = 3429, pG = 0.3484, and pT = 0.1537. The expected frequency of trinucleotide XYZ generated following this randomization model will be: where with k = 0 in Mycoplasmas and k = 1 otherwise. Populations of randomized sequences were generated for all discretized compositions of nucleotide frequencies in the interval [0,1] with increments 0.02 (51 compositional levels of each nucleotide type). Considering the constraints this

PAGE 66

66 gives rise to 23,426 distinct compositional states, for each of which score distributions were estimated based on samples of randomly generated sequences of given lengths. Pseudocoding Sequences: R andom S equen ces with Codon Position S pecific N ucleotide F requencies. We define logodds ratio scores for each codon type, based on the expected codonpositionspecific nucleotide frequencies at the three codon positions i = 1,2,3. These frequencies are in this application estimated by linear fitting to data. We can generate a pseudocoding random sequence with expected codonpositionspecific nucleotide frequencies following the ideas described above, generating a random sequence with the appropriate codonpositionspecific probabilities and removing stop codons as needed. As before, we can obtain these probabilities by numerically finding the solutions (X = A, C, G, T ; i = 1, 2, 3) from the analogous relations: , where is the probability of stop codons ( k = 0 for Mycoplasmas, otherwise k = 1), and, from these relations, derive, for i = 1, 2, 3:

PAGE 67

67 In general: the nucleotide probabilities obtained under the random model. A natural log odds ratio score associated with compositional contrasts for codons is: where only depends on sequence composition The G test and N ucleotide A ssociations with C odon P ositions. Selection against stop codons in randomly generated sequences induces association of nucleotide types with codon positions. For example, Figure 38 shows the expected nucleotide dis tribution among codon positions in the case of a random sequence generated as described in the previous sections to obtain a sequence with equal expected frequencies of nucleotide types Comparison with the expected distribution calc ulated from marginal values as in a standard Chi square test (Figure 39), produces the odds ratios shown in Figure 310 and a corresponding G statistic value G = 3.8826E 3 L where L is the length of the sequence. Thus, selection against stop codons res ults in an association between nucleotide types and codon positions that becomes significant when the sequence is sufficiently long (Figure 311) Shadow G enes It is well known that the nonrandom distribution of base types induced by a true coding sequenc e in a particular reading frame, casts weaker genelike compositional properties also on alternative, superimposed reading frames, known as shadow genes. In the context of our methods, although shadow genes are expected to

PAGE 68

68 generate lower scores than the corresponding true gene, still they can generate significant scores. To distinguish true genes from shadow genes, each highscoring segment is rescored relative to other coding frames with: for each of five alternative permutations p of the identity permutation (1,2,3), corresponding to coding in the five alternative frames (two alternative coding frames on the same strand and three alternative frames on the opposite strand) without resetting negative cumulative scores to zero. A highscoring segment generating a negative cumulative score when compared to any of the other reading frames, is discarded, otherwise it is retained. In our application, each highscoring segment will correspond to a sequence region with significant contrasts in codon position specific nucleotide composition, hence providing at once a quantitative definition of contrast, a measure of statistical significance of the contrast, a precise definition of the boundaries of contrasting segments, and a means of identifying s egments of any length. We identify this procedure as the H t est and the corresponding highscoring segments as H type hits or H hits. Examples of the relation of highscoring segments with compositional profiles are shown in Figure 312 for a coding sequence of high GC content, compared to S profiles, and for a sequence with lower GC content, where the most obvious contrast involves higher usage of purines in first codon position.

PAGE 69

69 Identification of Generalized Associations of B ase T ypes with C odon Pos ition Coding regions whose compositional properties deviate sufficiently from expectations may not be recognized based on the previous approach. We devised an alternative approach to identify coding regions with the only assumption that the codon structure generates 3periodic compositional contrasts, without any assumption on the specific nature of these contrasts. We based our test on the G statistics, which is commonly used as an alternative to the Chi square statistics for testing associations in cont in gency tables (Sokal 1994). By using a variant of the G test we can identify, within a sequence, segments of any length associated with significant values of the G statistics, irrespective of the type of nucleotide association with codon position. Given a sequence segment of 3L nucleotides (i.e., a reading frame L codons long) compositional contrasts can be determined identifying significant association of nucleotide types with codon base position ( i mod 3), where i is the sequence position, with a test of a ssociation (e.g., the Chi square or the G test) on the corresponding 3 x 4 contingency table of codonposition specific nucleotide counts (Figure 3 13). To identify, within longer sequences, segments of any length displaying significant compositional contr asts, we applied the following procedure (Figure 3 14). First, a reading frame sequence is identified and partitioned into trinucleotides (= codons) as in the highscoring segment approach. A standard 3 x 4 contingency table T counting nucleotidetypes ( j = 1, 2, 3, 4) observed in the three codon positions ( i = 1, 2, 3), is initialized to T = {0 }. Starting from the 3 end of a sequence of L codons, nucleotide types are progressively added into the contingency table according to their codon position. For

PAGE 70

70 each codonposition x ( ) read into the table, a G statistics is calculated for the sequence interval [ x, L ]: where is the total number of nucleotides of type j observed in codon positions i cumulated into the contingency table from the 3 end of the sequence down to position x, and is the corresponding total count of nucleotides of type j is calculated and recorded for each codon position x, gen erating a sequence of values (G curve) from which the position x = m of maximum value is identified (if more than one such position exists, the 3 most position is selected). The contingency table is reinitialized to T = {0} and the procedure is repeated in the opposite (5 to 3) orientation starting from position m A position M ( M > m ) at which the statistics attains maximum value exists and is identified. The procedure identifies a sequence interval [ m M ] of variable length with an associated score Given a threshold score value of significantly low probability, the sequence segment is identified as a highscoring segment if This procedure defines a statistics. We refer to this test as the Gtest, and we identify the corresponding high scoring segments as G type hits or G hits. In contrast to the classical Chi sq uare or the G statistics, the random distribution of the statistics does not approximate a distribution. This can be readily appreciated considering that, given a sequence of length L the value of cannot be less than the value of the G statistics calculated over the whole length of the

PAGE 71

71 sequence. Furthermore, the distribution of increasingly favors greater v alues compared to the distribution with increasing sequence length, which is intuitively explained by the corresponding combinatorial proliferation of segment choices that could generate Contrary to the G statistics, the distribution of is also a function of sequence composition. In our specific application, a more subtle effect contributing to a positive correlation between and random sequence length is a weak association between nucleotide types and codon positions generated by selecting sequence intervals devoid of stop codons. To determine the threshold values of at different significance levels and for sequences of any length and composition, we rel ied on computer simulations, as described for the case of H type hits. Estimates of G score significance threshold scores at two different significance values are presented in Figures 315 and Figure 316 Because it only assumes that a coding region generates periodic compositional contrasts in the sequence without any assumption on the nature of such contrasts, the G test has the potential of identifying coding regions with composition deviating from expec tations. However, it does not provide information on the strand and frame in which the sequence is encoded. To decide strand and frame of the potential coding region, once a sequence segment with significant was identified in a speci fic frame, we scored it against all other frames as for H test (see above), retaining it only if it achieved positive scores in comparisons with all other reading frames. Sequence C omplexity Our procedures, and in particular the G test, are prone to iden tify any sequence with compositional periodicity of 3n where n is a small enough positive integer (i.e., 3n

PAGE 72

72 = 3, 6, 9, ...). Although we expect such periodicity to be in most cases indicative of a protein coding sequence, significant periodic contrasts can also arise due to other events. Among these, repetitive (low complexity) elements not necessarily associated with coding sequences, although more frequent in eukaryotic genomes, are also occasionally observed in prokaryote sequences ( Troyanskaya et al. 2 002) and can generate periodic compositional contrasts. To separate repetitive elements from coding sequences, we characterized each hit for codon complexity by Shannons entropy (Shannon 1948) normalized to the interval [0,1] by measuring it in cod units, with 1 cod = ln (61). Analysis of the distribution of H from our set of characterized genes (data not shown) suggested to apply a threshold H* = 0.40 to separate repetitive elements ( H H > 0 .40).

PAGE 73

73 Figure 31. Log odds ratio scores for strong (S = C or G, thick lines) and weak (W = A or T, thin lines) nucleotides in codon positions one (red), two (green) and three (blue) in relation to the overall GC content of the sequence. Figure 32. Base composition at the three codon positions of genes in relation to the overall base composition of the coding sequence. Each point represents composition averaged over all coding regions in one of 813 prokaryote genomes. %nt in codon position %A in coding regions %C in coding regions %G in coding regions %T in coding regions

PAGE 74

74 Figure 33. Scores as a function of GC content at the three codon positions and scores for sixtyone trinucleotides as a function of GC content. The highest or lowest scoring trinucleotides types are indicated at different GC levels. Figure 34. Cumulating trinucleotide scores from the 3 to 5 will result in scores growing positive with the maximum corresponding to the start of the predicted gene and the last zero corresponding to the end of the predicted gene.

PAGE 75

75 Figure 35. Threshold scores at the indicated significance levels for different sequence lengths (range 150 9600 nt) and balanced nucleotide composition ( p = 0.25) based on sets of 100,000 randomly generated sequences. Scores are described by a log log function of length. The regression line shown corresponds to composition GC = 0.50%. Figure 36. Threshold scores at significance level p = 0.01 obtained from simulations compared to upper bounds of scores from high scor ing segment theory

PAGE 76

76 Figure 37. Threshold scores at significance level p = 0.001 obtained from simulations compared to upper bounds of scores from highscoring segment theory. A C G T Tot 1 0.0912 0.0833 0.0872 0.0716 0.3333 2 0.0794 0.0833 0.0814 0.0892 0.3333 3 0.0794 0.0833 0.0814 0.0892 0.3333 Tot 0.2500 0.2500 0.2500 0.2500 1.0000 Figure 38. Expected nucleotide type frequencies in codon positions after removing stop codons.

PAGE 77

77 A C G T Tot 1 0.0833 0.0833 0.0833 0.0833 0.3333 2 0.0833 0.0833 0.0833 0.0833 0.3333 3 0.0833 0.0833 0.0833 0.0833 0.3333 Tot 0.2500 0.2500 0.2500 0.2500 1.0000 Figure 39. Expected nucleotide type frequencies in codon positions from marginal values. A C G T 1 1.0943 1.0000 1.0461 0.8597 2 0.9529 1.0000 0.9770 1.0702 3 0.9529 1.0000 0.9770 1.0702 Figure 310. Odds ratios A / B (GC = 50%). Figure 311. A ssociation between nucleotide types and codon positions

PAGE 78

78 Figure 312. Typical compositional contrasts observed in a GC rich sequence, exemplified by a sequence region of 10,000 b ase p airs from the genome of Pseudomonas aeruginosa PAO1. Hits indicate regions of compositional contrast identified by our methods on the direct (top line) and complementary (bottom line) strands. Hits of the H type are identified by thick lines, those of the G type by thin lines. Hit regions match visually to S profiles obtained with a moving window of size 201 b ase pairs Annotation indicates coding regions published in the genome file (NC_00251 6), represented as arrows pointing in the 5 to 3 orientation. Between annotated coding regions PA0032 and PA0033, there is a region of significant compositional contrasts predicting a coding region in frame with a corresponding ORF (unnamed red arrow). T he colors assigned to S profiles correspond to the three different positional phases ( i mod 3), red= 0, green= 1, blue= 2. Predicted coding regions are assigned the phase color corresponding to their third codon position. A C G T Total 1 n 11 n 12 n 13 n 14 n 1. 2 n 21 n 22 n 23 n 24 n 2. 3 n 31 n 32 n 33 n 34 n 3. Totals n .1 n .2 n .3 n .4 n .. Figure 313. Contingency table of basetype in three frames of a sequence of length n ...

PAGE 79

79 Figure 314. The G test procedure is here applied to a known ORF from P. aeruginosa PAO1. Starting from the 3 end of the sequence trinucleotides are read and counts of nucleotides are progressively added to the rows of a contingency table initialized to zero depending on the position of the nucleotide in the triplet (Figure 13). After r ecording the content of each trinucleotide the value of the G statistics is recalculated and recorded (1). Initially the G values will not be significant because too few data points have accumulated into the contingency table. As more and more observations cumulate into the contingency table moving in the 3 to 5 direction, presence of a coding region will result in association of nucleotidetypes with codon positions that will become progressively more significant and correspond to increasing G values. Moving outside the coding region the added data will lack the association with codon positions corrupting the signal and consequently reducing the value of G. This will identify the position of maximum value of the G statistics, from which the process can be repeated in the opposite direction (2). A second maximum will be found identifying a region between the positions of the two maxima (3) predicted as a coding segment.

PAGE 80

80 Figure 31 5 Estimates of Gscore threshold scores at significance level p = 0.001 as a function of sequence length and composition. Each simulation is based on 100,000 randomly generated sequences, with lengths ranging from 15 n ucleotides to 12,288 nucleotides Regression line corresponds to points at GC = 0.50%. Figure 316. Estimates of G score threshold scores at significance level p = 0.01 as a function of sequence length and composition. Each simulation is based on 100,000 randomly generated sequences, with lengths ranging from 15 nt to 12,288 nt. Regression line corresponds to points at GC = 0.50%.

PAGE 81

81 CHAPTER 4 IMPLEMENTATION From High S coring Segments to Gene Prediction and C haracterization The concepts developed over the previous sections were employed to engineer a gene predicting algorithm that can be used to score sequences, of any length, up to entire genomes, for their potential to be protein coding based on periodic contrasts contained in the sequences. This information quantifies the visual signal that is displayed in frame analysis and can precisely identify the boundaries o f the region of contrast. The outlined method also has the capacity to assign a statistical significance to the observed contrasts and extends frame a nalysis to discover contrasts in the natural four letter nucleotide alphabet. The algorithm attaches other factors, such as ORF structure and local sequence signals to each of the predictions to provide support for predictions. The sequence is analyzed from the 5 end of the direct strand, identifying all sequence segments included between consecutive stop codons, referred to as the reading frame (RF), in six alternative frames (three on the direct strand and three on the independently analyzed looking first for high scoring segments, and subsequently any compositional contrast. The individual RF is scored based on its nucleotide composition, taking into consideration compositional heterogeneity of the input sequence, and a corresponding coding region is predicted based on the position of potential translation start codons and stop codon. If previous gene annotations are provided with the input file, these are compared with the new predictions and both sets are characterized according to their agreement or disagreement. The proce dure is implemented through the N PACT (N Profile Analysis Computational Tool) website, which produces text

PAGE 82

82 outputs for the newly predicted coding sequences, and a graphical representation for frame analysis of sequence compositional features, position and frame of all identified significant hits, and corresponding positions of predicted coding regions. The general procedure follows the following general scheme, illustrated in Figure 43 as a series of function calls: 1. A sequence file in a minimum GenB ank format, optionally including annotated coding regions (CDS), is uploaded. 2. The input sequence is simultaneously read in all six frames and reading frames (RF), i.e., sequence segments between two consecutive stop codons, are extracted from the sequence, and i ndependently analyzed if the sequence length satisfies a minimal ucleotides ). 3. The nucleotide composition of each RF is measured and compositiondependent log odds ratio scores are built for each codon type based on this composition. Sequence segments of significantly high cumulative score are detected and recorded using the previously described procedure. 4. High scoring sequence segments (hits) are ordered by position and all sequence segments between consecutive hits are analyzed evaluating their reading frames for presence of significant contrasts using the second algorithm. Regions of significant contrasts are scored according to their expected composition (scoring scheme of the Highscoring segment method). 5. E va lues associated with all significant hits are evaluated based on the overall size of the analyzed sequence. From simulations, we identify the sequence length corresponding to probability p = 0.01 of observing the same or higher scores in random sequences o f the same composition. The length of the search space is initialized by the overall length of the analyzed sequence is the number of hits of the given or higher score expected in sequences of length (E value). If is a threshold value (e.g., ) and the corresponding hit is marked as globally significant, and the length LORF of the corresponding predicted ORF is subtracted fr om the overal l sequence length The process is repeated for hits of lower score, until the first non globally significant hit is identified. 6. Each RF including highscoring segment(s) (from either algorithm) is characterized by its composition, overall score, Shannons entropy, and presence of putative start codons in proximity to the 5 end of the high scoring segment. In its current version, the algorithm looks for an ATG or GTG codon, first within the nucleotide interval [ 45, +12] and then < 45 n ucleotides from the 5 end of the high scoring

PAGE 83

83 segment. If no ATG or GTG codons are identified in these intervals, alternative start codons in the order TTG, CTG, and ATT are looked for within the same intervals. Further developments will include other supporting evidence of translation initiation, such as ShineDalgarno sequences. 7. Predicted ORFs are ranked by score and characterized by their positional relations with other predicted ORFs and with previously published predictions. The output includes the following text files with information about the predicted genes: a) [File_name].verbose Complete list of all high scoring segments, characterized by their composition, type of significant score (H = highscoring segment or G = G test), sequence position, reading frame coordinates, entropy, levels of significance, score, E value, relation with other predictions. b) [File_name].profiles List of positions of high scoring segments. c) [File_name].newcds List of predicted new genes and their sequence position. Na mes (e.g., H 1934*A, G 342 t) indicate the type of hit (H or G ) supporting the prediction, a hit number as assigned during sequence analysis, global significance (*), type of start of translation codon (atg=[A,a], gtg=[G,g], ttg=[T,t], distal in lowercas e, proximal in uppercase, no start= [empty]). d) [File_name].altcds List of predicted new genes superimposed to other predicted genes. These predictions, although supported by highscoring segments, are most likely to correspond to frameshifts in larger genes, to fragments of pseudogenes, or to unusual compositional properties within genes. e) [File_name].confirmed List of genes appearing in the published annotation and confirmed by our analysis. f) [File_name].modified List of genes appearing in the published annotation with a suggested different start of translation (MO, modified), not supported (NS) by presence of a highscoring segment, or contradicted (CO) by presence of a highscoring segment in a different coding frame. g) [File_name].repetitive List of high scoring segment regions with Shannons entropy below a specified threshold (by default: H < 0.40). Furthermore, N PACT results are graphically represented in a multi page printable output that includes a genomewide representation of framespecific p rofiles of GC content (S profiles), of high scoring segments, and of previously and newly predicted coding regions (Figure 43). The graphical output allows validation of computational predictions in the N PACT (N Profile Analysis Computational Tool) gene

PAGE 84

84 prediction software. N PACT can be accessed at the web site: http://genome.ufl.edu/npact/

PAGE 85

85 Figure 41. The N PACT (N Profile Analysis Computational Tool) web inter face allows users to upload GenB ank style sequence files for visualization of sequence compositional features and for gene prediction based on compositional contrasts among codon positions. Figure 42. Flowchart showing program execution.

PAGE 86

86 Figure 43. N PACT output showing variation of GC in three phases along a genome. Published annotations are plotted above the curves as well as new gene predictions uncovered by the proposed algorithm.

PAGE 87

87 CHAPTER 5 RESULTS Compositional Contrasts i n Annotated Prokaryote Genes To test the power of these protein coding detection approaches we scored all genes in the collection of prokaryotic genomes found in the NCBI RefSeq database. This collection is comprised of 1,083 species of bacteria represented by a mixture of 2,041 chrom osomes and plasmids, the shortest contig is 1,286 nucleotides and the longest contig is 13,033,779 nucleotides. There are 3,550,316 annotated genes over 3,913,805,583 nucleotides. The GC content of the contigs ranges from 17% to 75%, the number of genomes in each GC class is presented in Figure 51 showing an unequal distribution in the number of genomes belonging to each GC class. The normalized (to the total number of nucleotides in the collection) distribution of nucleotides in each GC class is shown in Figure 52 and shows more erratic fluctuations amongst the GC classes. However the number of annotated genes per kilobase pair is relatively flat amongst all GC classes and shows that gene density is uniform and largely independent of GC content (Figure 53). The entire collection of genes were divided into three categories 1) Characterized (2,451,937) genes are selected ex cluding all sequences with corresponding proteins described as hypothetical in corresponding collections of predicted genes 2) Hypothetical (1,093,956) genes are the predictions with protein products described as hypothetical and 3) Other (5,026) genes includes all sequences described as hypothetically characterized proteins (e.g., hypothetical protease, hypothetical transcription fac tor, etc.). Characterized genes are predictions that are more certain to be true protein coding sequences based on homology to other known proteins or they are predictions that have experimental verification. Genes in the

PAGE 88

88 Hypothetical class and in the Other class are of more dubious distinction. The GC content of the annotated coding sequences varied between 4.2% and 88.0% with an average of 52.5% the lengths varied between 6 and 36,806 codons with an average of 316 codons. This range of GC and length prov ides a wide spectrum of lengths and compositional properties to test the method. Coding sequences from all three sets were analyzed using both the score based method and the G test method at significance thresholds of p = 102 and p = 103. Around 85% of the sequences in the Characterized set have regions of significant contrast ( p = 102, 78% at p = 103) when measured with either the H test or the G test. 79.6% (71.4% at the more stringent significance) of all Characterized sequences resulted in both H h its and G hits (Figure 54 The fraction of Characterized sequences with H hits or G hits exceeded 90% at p = 102 and 84.5% at p = 103 (Figure 54 column H G). In comparison, a significantly lower proportion of sequences from the Hypothetic al set showed significant contrasts of H hit or G hit (72.0% at significant level p = 102, and 61.3% at p = 103) whereas intermediate results were obtained from the dataset comprised of Other sequences (Figure 54 ). To investigate how the power of the tests is affected by sequence length, we partitioned the three categories into classes of lengths 50i l < 50( i + 1) codons ( i = 0, 1, 2, ...). The fraction of significant hits increases with sequence length within all three categories of prediction type ( Figure 55, Figure 56, Figure 57 ). In particular, the vast majority (> 99%) of sequences from the Characterized set of length > 650 codons were identified as H hits and G hits, whereas power > 90% was attained for sequence lengths exceeding 250 codons. S ignificant hits were also recorded for the majority (60.6%) of the sequences in the length range 50100 codons, whereas almost onethird

PAGE 89

89 (31.9%) of the sequences shorter than 50 codons showed significant contrasts. Again, significantly higher frequency of hits were found in the Characterized dataset compared to the Hypothetical (Figure 57 ) and Other (not shown) sets, demonstrating that the previously observed overall difference between sets was not a consequence of differences in length between Characteriz ed and Hypothetical genes. Coding sequences were also par ti tioned into classes of GC content to investigate the effect sequence composition would have on the power of the test to identify protein coding regions. The power of the test may be decreased in s equences of intermediate GC where the compositional contrasts are not as pronounced. We determined that there was no significant correlation between gene length and GC content in the characterized set (Pearsons correlation coefficient R = 0.0726; N = 2,45 2,754). The relation of the power of the methods with GC content in Characterized coding regions is sh own in Figure 58 and Figure 59 at significance level p = 102. As expected, both gene detection methods achieved highest overall power (9899%) with seq uences of 75%) was obtained by both tests for sequences of intermediate GC content (in the range 4045%). Unexpectedly, the power was also lowest among sequences of very low (<25%) GC content. Combining H t ype and G type hits (Figure 5 10) the power increased to a minimum of 80% in low or intermediate GC content sequences, reaching 99.8% in sequences of more than 70% GC content. The frequency of Hypothetical genes with significant hits was again significant ly lower than among genes of the Characterized set, particularly for sequences of lower GC content (< 45% GC).

PAGE 90

90 The effects of sequence length and composition on the significance of the compositional contrasts can best be visualized by partitioning the set of sequences according to both length and GC content, as seen in Figure 511 This representation shows that the power of the test is consistently high (>90%) for sequence lengths exceeding 400450 codons, independently of the sequence GC content. In shor ter sequences of low or intermediate GC, the power of the tests rapidly decreases as the sequence length shortens. In sequences with GC content > 60%, the power of the test remains high (>90%) even for sequences of short length in the range of 50100 codons. These results suggest that our approach should be able to identify with a high success rate compositional contrasts induced by coding sequences of high GC content (>60%) even when sequences are as short as 50100 nucleotides. We also expect to identify most coding sequences of lower GC content provided that they are sufficiently long, and to be able to detect, albeit with reduced power, shorter coding regions of any composition. Newly Predicted Coding Sequences in Prokaryote Genomes We applied the des cribed approach to annotate every pr okaryote genome in the NCBI GenB ank collection. In preliminary analyses of randomization we found that p = 0.001 was an appropriate level of significance to use for de novo gene detection, to control for false positives due to multiple testing. For this application we tested all reading frames longer than 36 nucleotides between consecutive stop codons looking first for reading frames containing high scoring segments and then searching for G type hits in all remaining regions. We next identified hits corresponding to coding regions previously described in the NCBI genome annotations and screened the remaining significant hits according to their relative positions and sequence features. We identified

PAGE 91

91 the boundaries of potential coding regions containing each hit identifying the position of the first 3 stop codon in frame and potential start of translation codon appearing proximally to the 5 start of the hit, excluding all sequences for which no canonical (ATG, GTG) or alternative (TTG, CTG, ATT) translation initiation codons were found. We then characterized the predicted coding region by its entropy value and excluded all low complexity sequences. Finally all potential new coding regions that overlap more than a predetermined fraction (50%) to other higher scoring predicted coding regions or to previously annotated genes not contradicted by our tests were excluded. After selecting the highest scoring among our overlapping hits and eliminating hits of low complexity (H < 0.4) The hits confirming to previously annotated genes were excluded resulting in a collection of a final set of potential new coding regions, the statistical features of the set are summarized in Figure 5 12 F rom 1087 genomes we identified 139,561 new potential coding sequences from the H test and 61,219 from the G test method for a total of 200,531 potential new genes (for an average of 184.5 new genes per genome). In most of the sequences with H type hits (70.4%) and in many of the sequences with G type hits (56.4%) a canonical ATG or GTG putative start of translation codon was identified proximal to the 5 end of the high scoring segment. Among sequences identified by the H test, 20.9% had scores above the multiple testing threshold of significance and of these, for 84.3% a canonical start codon could be identified 5 of the hit region. Among potential coding regions predicted by the G test, a lower fraction (14.0%) reached global significance and among these the majority (62.5%) contained a canonical start codon.

PAGE 92

92 Number of P redicted ORFs as a F unction of G enome GC C ontent As for the analysis of characterized coding sequences, we expected our method to be variably successful in newly identifying coding regions depending on the composition of the sequence, and mostly on its GC content. Figure 513 shows the density (number per 100,000 nt) of newly predicted ORFs in relation to genome GC content. From Figure 513, we can see that biased genomes have more newly predicted protein coding sequences than less biased genomes. ORF Conservation Conservation across species (homology) is strong evidence supporting predicted genes. We queried (with BLAST) the NCBI non redundant (nr) database of protein sequences with transl ations of the 200,532 newly predicted proteincoding regions. Applying an E value threshold of 103, we identified in the data base target sequences for more than one third (34.2%) of all queries, most with much lower E values (e.g., about 20% of all seque nces resulted in BLAST hits with E value < 1030), supporting their classification as coding regions (Figure 514). Considering the type of prediction (H or G) and start codon (ATG/GTG or other), as expected a larger fraction of predictions of the H type w ith canonical start codons were associated with conserved sequences in the database, whereas the least conserved class was G type with non canonical start codons. Pseudogenes A possible source of periodic sequence compositional contrasts is given by the pr esence of pseudogenes that are young enough to preserve genelike compositional features. We compared the collection of pseudogenes reported in the Pseudogene Database ( www.pseudogene.org ) for 65 chrom osomes of prokaryotic organisms to our

PAGE 93

93 newly predicted sequences for the same chromosomes (Figure 5 15) Of all coding regions newly predicted in these chromosomes, 16.7% matched sequences predicted as pseudogenes with a small preference for H type vs. G type hits. The highest proportion of pseudogenes (37.5%) was found among the most significant H type hits (* = global significance), despite the fact that downstream of 83% (408 of 491) of these hits a canonical potential start of translation codon was identified. Overall, among the 65 tested genomes, more than 80% of the newly predicted coding regions with canonical start of translation codon were not characterized as pseudogenes. A distinguishing characteristic of pseudogenes is progressive sequence fragmentation of the ancestral gene due to the accumulation of mutations generating frame shifts and inframe stop codons over evolutionary time. With this in mind, we characterized our predictions comparing their length with that of homologous sequences identified in the nr database. For each query we identified in the set of homologs identified by BLAST the sequence of most similar length, the length ratio between query and target was recorded, this ratio is referred to as (Figure 5 1 6 ) It stands to r eason that predicted ORFs of similar length to known genes are more likely to be functional genes, whereas ORFs matching only a fragment of known genes may cor respond to pseudogenes. Figure 5 1 6 shows the distribution of ratios among all predicted ORFs with identified homolog s, with a sharp peak at ratio 1:1 (even when shown on a logarithmic scale), signifying that the vast majority of our predictions matched a database gene sequence with very similar length. The right skew of the distribution, however, also indicates that whi le relatively few predicted ORFs (queries) are

PAGE 94

94 significantly longer than known genes (targets), quite a few more are significantly shorter, suggesting the possibility that these regions m ight correspond to pseudogenes. Concordance Between New Gene Predicti ons and Other Predictors A case by case analysis of S profile signal, ORF structure, and conservation, provides convincing support for many of the newly predicted genes (Figure 5 12) T he 200,178 predictions not included in publi c annotations were compared with predictions automatically obtained using four other predic tion methods, GeneMark2.5, GeneM arkHMM, Glimmer3 .0 and Prodigal, available at the NCBI microbial genome database site. Matching predictions were identified as those ending at the same stop co don, irrespective of the predicted start of translation site. From these comparisons, we found that more than 20% (41,140) of the genes predicted in our set were also consistently predicted by all other methods we tested, and about one third were predicted by at east one other method ( Figure 51 7 ). A clear trend was observed for longer genes to be more consistently predicted across methods. Considered individually, each prediction method supported from about 23% (GeneMark2.5) to about 29% (GeneMarkHMM, Glim mer3.0) of the predictions in our set ( Figure 51 8 ). Since genes predicted by our s and other methods could very well have been excluded from annotations after expert analysis, the previous test was repeated on a subset of 10,700 topquality new predictions, characterized by: (i) globally significant H type hits. i.e., strong genelike compositional signal; (ii) presence of an ATG or GT G start codon proximal to the region of significant contrast (no more than 15 codons 5 of the start of the hit), i.e., pr esence of a canonical start of translation signal associated with significant compositional contrasts; (iii) regions of significant contrast extending across no less than 80% of the predicted coding sequence; (iv) high s equence

PAGE 95

95 comparing this restricted set of topquality predictions to coding regions predicted by other methods, more than 87% (9,336) of our predicted genes were also predicted by all other methods tested, and more than 93% were predicted by at least one other method ( Figure 51 9 ), again with longer genes more consistently predicted across methods. Individual prediction methods supported from about 90% to 92% of the predictions in our set ( Figure 520). We further qualified each top quality prediction for: (v) presence of known conserved homologs (from the NCBI nonredundant database) (vi) presence of at least one known homologue of similar (20%) length, arbitrarily setting the threshold length ratio to 1.2, as described before. Of 10,700 topquality predictions, 92.6% showed homology with other proteins, and of these 86.2% matched protein targets of similar length. It seems unlikely that other computationally determined features could be used to exclude thes e predictions from annotations. Summary statisti cs are presented in Figure 521 for the 200,178 new genes predicted From the total newly predicted genes, 33% were also predicted by at least one other method. Twenty one percent of the total newly predicted genes are agreed upon by all gene detection met hods investigated. Field entries in bold are those most likely to be real genes. There is a tendency for these predictions to have homologous sequences and to utilize canonical start codons. Of those predictions without BLAST hits and not overlapped to published annotations, the majority of them A similar analysis is presented in Figure 522 for the 132,499 genes that were uniquely predic ted by our methods From this figure we can see that the majority of the predictions not overlapped to published annotations (11%) do not have homologous sequences and

PAGE 96

96 furthermore there is less of a preference for these genes to utilize canonical start co dons Figure 523 shows a stacked bar graph of all our 200,178 coding predictions with the length conservation and number of gene predictors in agreement with our designation for these sequences to be coding. From this representation we can see that the majority of our predictions are unique and do not share homology any with known protein. Figure 5 24 is a similar graph for the 10,700 top quality predictions. Here we can see that most gene predicting algorithms predict these sequences to be genes, of these sequences, 85% have a homologous sequence of similar length. Application to Pseu domonas a eruginosa Pseudomonas aeruginosa is a member of the Gamma Proteobacteria class of bacteria. It is a Gram negative, aerobic rod measuring 0.5 to 0.8 m belonging to the bacterial family Pseudomonadaceae. P. aeruginosa is of medical importance as an emerging threat to human health for its ability to cause infections in certain classes of patients (i.e. burn victims, cystic fibrosis patients and general nosocomial infec tions) (Richards et al. 1999). Many traits of P. aeruginosa make this organism well suited to an opportunistic pathogen type of existence. These traits include, but are not limited to, an ability to subsist on a wide range of metabolites, a very plastic genome and an ability to acquire, through horizontal gene transfer or mutation, resistance to a whole host of antibiotics and the ability to develop genotypic and morphological variants under stress (Klockgether et al. 2003; Kresse et al. 2003). Complete gen ome sequences for many P. aeruginosa pathovars are available. These pathovars range in their pathogenicity and provide a rich source of information for comparative genome analyses. Genome annotations predict considerable variations in gene content among pathovars, ranging

PAGE 97

97 from 5,566 genes predicted in PAO1 to 6,286 genes predicted in PA7. However, many of the annotated genes (up to 2/3 for some strains ) are listed as hypothetical and the majority of all predicted genes have not been experimentally verified. Having accurate annotations of P. aeruginosa strains is desirable because the differential gene complement amongst these close phylogenetic strains may provide insight into genes influencing the observed differences in pathogenicity. A qualification of predicted genes will also provide better assessments on the difference in gene content between different strains and identify promising targets for ex perimental verification. The source of these differences while interesting in and of themselves may also be potential targets for treatments for patients infected with P. aeruginosa. In preliminary visual and quantitative analyses we have unambiguously identified in the high GC genome of P. aeruginosa a significant number of new candidate genes, most of which corresponded to mini genes. Recently mini genes (< 100 amino acids in length) have gained prominence as more is understood about the role they play as regulators in coordinating gene expression in cells (Kastenmayer et al. 2006). Mini proteins may play significant roles in various biological processes of all annotated genes in published prokaryote genomes, more than 10% are predicted to be mini genes. Short coding regions are typically difficult to predict computationally. Due to their short sequence length, the coding signal corresponding to these short genes is less likely to reach the levels of significance necessitated by common genefinding algorit hms. Also, small genes may not follow the same codon composition properties of longer genes, further the corresponding short peptides may not have the amino acid composition typical of average length globular proteins. Extremely short proteins are unlikely to be predicted by computational

PAGE 98

98 methods unless they conserve a very well characterized motif. Current prediction methods routinely predict coding sequences of < 100 codons based on compositional properties. The approach described in this manuscript is ba sed on simple global features, therefore the statistical problems stemming from the size of sm all genes is minimized. Pseudomonas represents an ideal system for application and experimental verification of our gene detection methods for the following reasons: (i) the genome of P. aeruginosa is very rich in GC bases (66.7%), a compositional bias corresponding to strong expected compositional contrasts among codon positions; (ii) an extensive body of literature on P. aeruginosa both experimental and computati onal, provides a rich background of information for interpretation of the results; (iii) the recent sequencing of P. aeruginosa genomes from different species provides a rich source of data for comparative analyses; (iv) P. aeruginosa is among the most medically relevant and theoretically interesting subjects for studying adaptive processes in host pathogen interactions. O ur method of gene detection performs well in any high GC prokaryotic genome the method was used to characterize the gene content in P. aeruginosa, strains PAO1, PA7, PA14, PACS2 and LESB58, with special attention paid to the identification and characterization of mini genes. The new predictions were compared to the published annotation. The results can be seen in Figure 52 5 the majority of annotated genes in each strain are confirmed by the analysis with new gene predictions made in eac h strain There are three major classes of new predictions that resulted: (i) new prediction not superimposed to any annotated gene; (ii) new predictions superimposed to an annotated gene but in different frame or orientation; (iii) new predictions corresponding

PAGE 99

99 to annotated genes but modifying its boundaries. Correspondingly, genes in the published annotation were characterized through our reannotation as : (i) confirmed; (ii) modified; ( i ii) not supported; (if no coding region is predicted) ; (iv ) contradicted (if a coding region in a different frame/orientation is instead predicted) (Figure 52 5 ) The length distribution of newly predicted genes in Pseudomonas aeruginosa PAO1 shows that ~65 % of new predictions can be classified as mini genes (< 300 nucleotides in length) (Figure 52 6 ). Similar fractions of new gene predictions in the other strains are represented by mini genes (Figure 52 5 ). Conserv ation of newly predicted genes across different strains of P. aeruginosa and amongst all bacterial species were characterized (Figure 52 7 ) using the S ignificant S egment Pair A lignment method ( Brocchieri 1998) Comparing the 180 newly predicted protein coding sequences from Pseudomonas aeruginosa PAO1 to other Pseudomonas strains it can be seen in Pseudomonas aeruginosa PA14, 81% are either included in the published annotation (Published row) or are detected by our method (New CDS row) Of the 180 newly predicted genes in Pseudomonas aeruginosa PAO1 68% are also found either in the published annotation or are discovered when our new gene detection method is performed in Pseudomonas aeruginosa PA7 Eighty six percent of the newly discovered Pseudomonas aerugi nosa PAO1 genes are found in strain LESB58. And finally 95% of the newly discovered Pseudomonas aeruginosa PAO1 genes are found in all Pseudomonas aeruginosa strains in the NCBI database or predicted in Pseudomonas species by our method. When the same collection of 180 newly predicted Pseudomonas aeruginosa PAO1 genes are supplied as BLAST input, 62% are found to be homologous in Pseudomonas aeruginosa strains 48% are homologous

PAGE 100

100 to genes found in Pseudomonas species, and 37% have homologues in other bacterial species found in the nonredundant database. These results give us confidence that the majority of the newly predicted Pseudomonas aeruginosa PAO1 genes are sound predictions. Experimental Verification of Computationally Predicted Protein Coding Regions To experimentally validate our computationally predicted new genes in Pseudomonas aeruginosa PAO1, we verified expression of the corresponding sequences using RNA seq data. Two collections of predictions with the highest confidence of being true protein coding regions were compiled. The first collection consisted of 33 newly predicted coding sequences that had global significance, a translation start signal (RTG) in close proximity to the region with a high scoring segment and a homologous protein in the nr database of similar length. The second collection consisted of 60 newly predicted proteincoding sequences, the requirements for membership in this group are more relaxed in that a translation signal in close proximity to the start of high scoring can consist of NTG. Sequences in the second collection need not be of global significant nor does membership require there to be a homologous sequence present in the nr database. RNA was collected from populations of Pseudomonas aeruginosa PAO 1 in exponential growth phase in normal conditions RNA was reversed transcribed to cDNA and prepared for deep sequencing. After reads were acquired, the reads were mapped to the Pseudomonas aeruginosa PAO1 genome. Figure 528 and Figure 529 show the S pro file for the thirty genes in the first collection. The S profile shows the contrast in GC as has been previously described and adds information to a track entitled Hits above the S profile signal showing the region spanning the length of the highscoring segment in the color of the frame consistent with

PAGE 101

101 the third codon position. Above the Hits track is a track that shows an arr ow indicating the frame and direction of the published gene s and above that is a track titled New which shows the newly predicted gene along with the name that is automatically assigned to each new prediction. The direction of the arrow indicates the str and on which the gene resides while the color indicates the frame of the gene, with the color corresponding to the phase in the third codon position. Next to each new gene prediction is a C, c or N. The C corresponds to gene predictions that have at least one homologous sequence of similar length (< 20%). The c corresponds to gene predictions that have at least one homologous sequence of dissimilar length (> 20%). T he N corresponds to genes that do not have homologous sequences in the database. The black bars in the center of each S profile plot show the read depth as determined by RNA seq. The height of each bar show the level of reads that overlap to that region of the Pseudomonas aeruginosa PAO1 chromosome. Black bars that deflect up from th e origin indicate reads that align in the direct reading frame while bars that deflect down from the origin indicate reads that align in the complementary reading frame. Figure 530, Figure 531 and Figure 53 2 show the same information for the 60 gene pre dictions of lower confidence. These results show that the majority of the computationally predicted Pseudomonas aeruginosa PAO1 genes are supported by in vivo evidence showing that the region of the Pseudomonas aeruginosa PAO1 that we predict to encode a g ene is actively undergoing transcription, in the predicted direction, leading to a mRNA that was captured and amplified by the RNA seq methodology. Application to t he Human Genome The compositional contrasts amongst codon positions that are observed in pr okaryotic sequences also apply to coding sequences in eukaryotic sequences. The

PAGE 102

102 genetic landscape in eukaryotic genes is complicated by the multi exonic structure of eukaryote genes but highscoring segments corresponding to genes can be found without attempting to join exons nor construct fulllength genes. We applied our method to the human genome in an attempt to discover genes that are not included in the published human genome. At a lower significance level ( p = 103) we identify many significant hits not superimposed to published annotated genes, but we find more hits in the r andomized sequences (Figure 53 3 ). At higher significance ( p = 104) we found an excess of hits (+10,071) compared to the randomized genome. After applying a multiple test correction, we found in shuffled sequences only 125 significant hits but in the original sequence we found 8,344 genome regions with nonrepetitive com positional contrasts consistent with being protein coding. We compared the genome positions of these highly significant hits with the positions of 17,014 human pseudogenes identified in the database Pseudogenes.org. From this comparison, we identified among our hits 945 positions corresponding to sequences identified as pseudogenes in Pseudogenes.org. Although at the significance level p 3 we found in the genome sequence as many hits as in the r andomized sequence (Figure 53 3 ), the distribution of scores from the genome sequence shows a conspicuous tail of high scores (as high as 379.9) much higher than the highest score (17.5) observed in the corresponding randomized sequence (Figure 53 4 ). So while there are more predicted new genes in the shuffled g enome, the shuffling has the effect of disrupting sustained high scoring. Similarly to what is observed for scores, the distribution of lengths of significant ( p 3) regions identified in the original genome sequence also shows a tail of longer segment s than i n

PAGE 103

103 random sequences (Figure 53 5 ). The distribution of entropies of these regions is lower but superimposed to the distribution of entropies of published exons (Figure 536). Processing Time We tested performance of our implementation of the method on a MacPro Quad 2.4 MHz personal desktop computer with a large number of complete prokaryote genome sequences and eukaryote genome contigs of various lengths. The results of these tests demonstrate that in the range of any practical sequence length, execution time is dominated by sequence scoring and run time is linearly related to sequence length (Figure 537). With our system the processing speed was approximately 5sec / Mbp.

PAGE 104

104 Figure 51. Total number of genomes in each GC class. Figure 52. Ratio of nucleotide contribution from each GC class normalized by the nucleotide total of the complete collection of genomes.

PAGE 105

105 Figure 53. Density of annotated genes per kilobase pair. Set1 Number of seqs % GC2 len / nt2 % H type3 % G type3 % H G 3 % HG3 Ch 2,451,937 53.3 1,067 84.71 77.95 85.45 77.96 90.61 84.50 79.55 71.41 Hy 1,093,956 50.9 683 62.37 52.22 64.55 54.12 71.96 61.28 54.96 45.06 Ot 5,026 51.3 868 65.80 56.67 71.63 62.77 76.58 67.29 60.84 52.15 Tot 3,550,919 52.5 948 77.80 70.00 79.00 70.59 84.84 77.32 71.95 63.27 Figure 54. Summary of statistics for each class of published annotations and their total hit types at two significance levels. 1 Ch: Characterized; Hy: Hypothetical; Ot: Other. 2 Average percent GC content in dataset.3 Percent frequency of coding regions with signif icant H type and G type hits at p = 102 (rom an) and p = 10 3 (italics). The H G fraction results from the union of the sequences with H type or G type hits. The intersection H G indicates the fraction of sequences with hits of both types.

PAGE 106

106 Figure 55. Htype hits in published CDSs as a function of length. Figure 56. Gtype hits in published CDSs as a function of length.

PAGE 107

107 Figure 57. The union of G type hits and H type hits in published CDSs as a function of length. Figure 58. H type hits in published CDSs as a function of GC content.

PAGE 108

108 Figure 59. Gtype hits in published CDSs as a function of GC content. Figure 510. The union of G type hits and H type hits in published CDSs as a function of GC content.

PAGE 109

109 Figure 511. Frequency as a function of length and GC content of characterized genes with H type or G type hits

PAGE 110

110 Category Number Chromosomes 2,03 8 Nucleotides 3,904,477,373 Published genes 3,541,071 Segments with significant contrasts 3,799,066 Low complexity 3,466 Published contradicted 142,614 (4.04%) Published not supported 457,337 (12.9%) Newly predicted genes (p < 10 -3) 200, 178 With ATG/GTG start codon 132,561 (66.1%) With TTG/CTG/ATT start codon 67,971 H type hits 139,313 With ATG/GTG 98,018 (70.4%) With global significance 29,109 (20.9%) With ATG/GTG 24,543 (84.3%) G type hits 61,219 With ATG/GTG 34,543 (56.4%) With global significance 8,601 (14.0%) With ATG/GTG 5,377 (62.5%) Figure 512. Classes of gene predictions from our method as compared to the published annotations of 1,083 prokaryote genomes.

PAGE 111

111 Figure 513. Total number of new gene predictions (per megabase pair) as a function of GC. Figure 514. Conservation of newly predicted ORFs showing over 30% of newly predicted genes have homologous sequences.

PAGE 112

112 Total new genes in 65 genomes (% of All) Annotated as pseudogenes (% of corresponding class) Type All AUG/GUG AUG/GUG All AUG/GUG AUG/GUG* H 2891 491 (17.0) 1816 (62.8) 408 (14.1) 527 (18.2) 184 (37.5) 359 (19.8) 154 (37.7) G 1165 135 (11.6) 499 (42.8) 67 (5.8) 150 (12.9) 21 (15.6) 65 (13.0) 9 (13.4) Total 4056 626 (15.4) 2315 (57.1) 475 (11.7) 677 (16.7) 205 (32.7) 424 (18.3) 163 (34.3) Figure 515. Pseudogenes and newly predicted genes in 65 prokaryote genomes. Figure 516. Length distribution of newly predicted genes compared to known homologs found in the nonredundant BLAST database.

PAGE 113

113 n 0 1 2 3 4 num ber 132,499 10,753 6,913 8,873 41,140 % 66.19 5.37 3.45 4.43 20.55 len/aa 66.9 93.8 104.8 131.1 226.0 Figure 517. Newly predicted genes, 200,178 in total, confirmed by n other gene predicting methods Method Total CDS1 Total (%) among newly predicted 2 GenBank Annotation 3,544,892 0 (0.00) GeneMark2.5 3,322,073 47,889 (23.92) GeneMarkHMM 3,726,046 58,588 (29.27) Glimmer3 3,868,542 56,565 (28.26) Prodigal 3,624,014 52,717 (26.34) Figure 51 8 Newly predicted genes confirmed by other specified gene predicting algorithms 1Total number of coding regions predicted by the indicated method in the set of chromosomes. 2Total (percent) of coding regions predicted by our methods and not included in GenBank annotations that are also predicted by the indicated method.

PAGE 114

114 n 0 1 2 3 4 number 704 98 11 9 44 3 9,336 % 6.58 0.92 1.11 4. 1 4 87.25 len/aa 187.6 223.6 214.1 258.2 307.9 Figure 519. Top quality newly predicted genes, 10,700 in total, confirmed by n other gene predicting methods. Method Total (%) GenB ank Annotation 0 (0.00) GeneMark2.5 9,641 (90.10) GeneMarkHMM 9,857 (92.12) Glimmer3 9,678 (90.45) Prodigal 9,833 (91.90) Figure 520. Top quality newly predicted genes confirmed by specified gene predicting method. Over 90% of the predictions are in agreement with the four other gene prediction algorithms investigated.

PAGE 115

115 Type of prediction Start 2 Hit not overlapped to published 3 Hit overlapped to published3 Total 1 With BLAST hits RT G Rare 38,116 (29,481) 6,118 (4,295) 8,520 (4,591) 1,109 (450) 46,636 (34,072) 7,227 (4,745) With BLAST RT G Rare 28,855 (23,197) 4,400 (3,466) 6,604 (3,936) 698 (354) 35,459 (27,133) 5,098 (3,820) No BLAST hits RT G Rare 7,830 (1791) 889 (85) 4,216 (427) 881 (20) 12,046 (2,218) 1,770 (105) No BLAST hits, len > 100 cod RT G Rare 946 (301) 133 (33) 1116 (149) 268 (11) 2,062 (450) 401 (44) No BLAST 100 cod RT G Rare 6,884 (1,490) 756 (52) 3,100 (278) 613 (9) 9,984 (1,768) 1,369 (61) No BLAST 60% RT G Rare 4,716 (1,291) 391 (40) 2,273 (311) 241 (8) 6,989 (1,602) 632 (48) No BLAST 60% cod RT G Rare 4,197 (1,079) 326 (23) 1,613 (192) 160 (3) 5,810 (1,271) 486 (26) Figure 521. Characterization of coding sequences also predicted by four gene detection methods investigated. 1Number of nonpublished ORFs identified in this study and predicted by at least one of four other tested methods. Values in parentheses indicate those predicted by all four other methods. 2ORF is started by an A T G or GTG codon (RTG) or by a TTG, CTG, or ATT codon (Rare). 3A newly predicted coding region is considered not to overlap one or

PAGE 116

116 Type of prediction Start 2 Hit not overlapped to published 3 Hit overlapped to published3 Total 1 BLAST hits RT G Rare 3,310 2,177 6,145 2,906 9,455 5,083 BLAST hits, RT G Rare 1888 795 3,022 1150 4,910 1,945 No BLAST hits RT G Rare 15,555 12,213 48,679 41,514 64,234 53,727 No BLAST hits, len > 100 cod RT G Rare 1,771 836 9,444 5,515 11,215 6,351 No BLAST 100 cod RT G Rare 13,784 11,377 39,235 35,999 53,019 47,376 No BLAST 60% RT G Rare 5,542 1,975 11,472 4,287 17,014 6,262 No BLAST 60% cod RT G Rare 4,613 1,732 8,128 3,142 12,741 4,874 Figure 522. Characterization of coding sequences uniquely predicted by our method. 1Number of nonpublished ORFs identified in this study and predicted by at least one of four other tested methods. Values in parentheses indicate those predicted by all four other methods. 2ORF is started by an ATG or GTG codon (RTG) or by a TTG, CTG, or AT T codon (Rare). 3A newly predicted coding overlap published CDSs.

PAGE 117

117 Figure 523. Length conservation of all newly predicted coding regions. Most predictions are uniquely predicted by our method and do not have homologous sequences in the nr database. Figure 524. Length conservation of topquality newly predicted proteincoding regions. Most predictions ar e confirmed by the other gene predicting algorithms and have homologous sequences with similar length.

PAGE 118

118 PAO1 PA7 PA14 PACS2 LESB58 confirmed 5359 5640 5531 5347 5526 modified 160 145 170 277 188 not supported 103 381 151 70 159 contradicted 23 86 34 22 36 TOTAL 5645 6252 5886 5716 5909 new 180 142 99 294 217 mini genes 116 81 59 196 151 Figure 525. Total genes in the published annotations of four strains of Pseudomonas aeruginosa along with new genes predicted by the described methods, (mini genes) is the number of new predictions found that are under 100 amino acids in length. Figur e 526. Length distribution of new gene predictions in Pseudomonas aeruginosa strain PAO1 as found by the described method.

PAGE 119

119 SSPA BLAST PA14 PA7 LESB58 P. a. P. a. Pseudomonas Other Bacteria Published 68 20 107 126 112 87 67 New CDS 87 105 59 116 Total 145 123 155 171 Figure 527. Comparing predicted genes of P. aeruginosa PAO1 with annotated or newly predicted genes in other strains of P. aeruginosa (PA14, PA7 and LESB58) using SSPA pairwise alignment and comparing them with NCBI nonredundant database of proteins using BLAST.

PAGE 120

120 Figure 528. S profile of high confidence newly predicted genes in Pseudomonas aeruginosa PAO1 showing RNA seq reads aligning to the region of the chromosome predicted to be coding for protein. Many of the newly predicted genes are supported by RNA seq data.

PAGE 121

121 Figure 529. S profi le of high confidence newly predicted genes in Pseudomonas aeruginosa PAO1 showing RNA seq reads aligning to the region of the chromosome predicted to be coding for protein. Many of the newly predicted genes are supported by RNA seq data.

PAGE 122

122 Figure 530. S profile of high confidence newly predicted genes in Pseudomonas aeruginosa PAO1 showing RNA seq reads aligning to the region of the chromosome predicted to be coding for protein. Many of the newly predicted genes are supported by RNA seq data.

PAGE 123

123 Figure 531. S profile of high confidence newly predicted genes in Pseudomonas aeruginosa PAO1 showing RNA seq reads aligning to the region of the chromosome predicted to be coding for protein. Many of the newly predicted genes are supported by RNA seq data.

PAGE 124

124 Fig ure 532. S profile of high confidence newly predicted genes in Pseudomonas aeruginosa PAO1 showing RNA seq reads aligning to the region of the chromosome predicted to be coding for protein. Many of the newly predicted genes are supported by RNA seq data.

PAGE 125

125 Collection Hit type p 3 p 4 Multiple test correction p 2 Genome Random Genome Random Genome Random New CDS H 142,193 179,707 65,622 71,798 1698 2 G 64,346 64,045 27,366 11,119 6646 123 Total 206,539 245,752 92,988 82,917 8344 125 Repetitive (H<0.4) H 4,869 102 2,646 37 537 0 G 13,262 464 11,724 156 5836 0 Total 18,131 566 14,370 193 6373 0 Figure 5 33. Number of new sequence segments with significant coding potential in the human genome. Hits from highscoring segments (H) and from G tests (G) identified at different levels of significance in the original human genome sequence (Genome) and after randomization (Random). Potential new CDSs are distinguished from repetitive elements based on Shannons entropy (H) d istribution calculated in published CDSs. Multiple testing was taken into 2 in a randomized sequence the length of the genome. Figure 53 4 Distribution of scores among new hits obtained from the original genome sequence and random hits from shuffled sequence.

PAGE 126

126 Figure 53 5 Length distribution of hits from the original genome sequence and after randomization. Figure 53 6 Dist ribution of Shannons entropies among new hits and published exons. Entropy of codon states is expressed in cods, i.e., normalized by ln(61), the logarithm of the number of codon types. Sequences

PAGE 127

127 Figure 53 7 Processing time of the algorithm as a function of sequence length. Run time is linear with respect to sequence length.

PAGE 128

128 CHAPTER 6 DISCUSSION This dissertation introduced two quantitative procedures to identify sequence segments of any length containing statistically significant, threeperiodic compositional contrasts. The first procedure identifies sequence segments that are expected to arise in coding sequences (H hits) of a given composition. The second procedure identifies threeperiodic compositional contrasts of any type (G hits). These procedures quanti fy and extend the principle utilized in frame analysis for gene identification, which is based on visual identification of contrasts in GC content usage between the three codonpositions. Our approach, which we called Quantitative Frame Analysis, is simi larly based on the observation that a coding region generates, in coding DNA or corresponding mRNA sequences, significant compositional contrasts of period three (3periodic contrasts). However, the nature of these contrasts depends on the local compositio n of the sequence. Our analysis being based on contrasts of any base type, not solely on codon position specific compositional contrasts in GC, transcends the original application of frame analysis to only high GC content sequences and can be utilized for sequences of any composition, albeit with different success expectations. Furthermore, our methods provide precise delineation of regions of contrast and supply statistical support for the observed compositional contrasts. The dependency of the score of each codon to the overall content s of the sequence provides information on the major components of the resulting contrasts in sequences of different composition. For example, in GC rich coding sequences the dominant contrast involves the second codon positi on, which tends to be AT rich, and the third codon position, which has the highest frequency of GC bases. In AT rich

PAGE 129

129 sequences the dominant contrasts are instead between the first codon position, which tends to be GC rich, and the AT rich third codon posit ion. Thus, in both GC rich and AT rich sequences the dominant compositional contrasts arise predominantly between second (in GC rich sequences) or first (in AT rich sequences) codon position, and third codon position, and result from the contrasting effect s of selection for amino acid usage (mainly first and second codon position) and a combination of selection for optimal codon usage and genomewide GC content, and of nucleotide mutation biases. Sequences of intermediate GC content (~45%) show the weakest compositional contrasts, which mostly involve first codon position, where strong bases are prevalent, and second codon position, which are AT rich. Thus, in coding sequences of intermediate GC content, the presence and nature of the observed contrasts most ly depend on and reflects the amino acid usages of the corresponding proteins. With the first method, which we call the H test, we detect potential coding regions based on expectations about codon position dependent contrasts, learned from sets of known g enes, and related cumulated scores, using a highscoring segment approach. With the second method, the G test, we detect sequence segments with any compositional 3periodic contrast, applying a special implementation of a Chi square test to cumulated conti ngency tables. By defining compositiondependent log odds ratio scores for trinucleotides (codons), and by generating, using computer simulations, large numbers of randomized sequences of varying lengths and compositions, we have been able to quantify us ing our methods the observed compositional contrasts, and to qualify statistically the significance of the observed contrasts. A further feature in our implementation is that the scoring system is modulated depending on the local

PAGE 130

130 composition of the sequenc e, thus taking into consideration the fact that the expected composition of coding regions is specific to the local overall composition. Thus our method takes into consideration the possible mosaic structure of a genome, generated for example by the pres ence of pathogenicity islands or integrated viruses, or any other factor causing fluctuations of the overall composition of the sequence. Two identify coding segments of any length we adopted the approach of cumulative scores, implemented in two tests, the H test and the G test. The H test is inspired by the highscoring segment theory approach (Karlin 1994), which provides a probabilistic framework to assess the significance of the calculated cumulative scores. The G test, although also based on statistic al evaluation of cumulative scores, has no direct relation to previously published algorithms in the literature of computational gene prediction. Both procedures are implemented in the geneidentification bioinformatics tool N PACT (N Profi le Analysis Comp utational Tool), available for public use at the URL http://genome.ufl.edu/npact/ N PACT automatically produces gene predictions and characterizations, and a graphical output representing sequence compositional features and the position of sequence regions of contrast and corresponding gene predictions The predictor implemented in N PACT is designed to identify coding regions in prokaryotic organisms (that is, it assumes singleexon genes). However, identificati on of regions of contrasts is independent from gene structure and can provide useful information also for detection of coding regions in sequences from eukaryotic organisms. Applying the methods to more than 1000 complete genome sequences of prokaryotic or ganisms, we identified several thousands of sequences, not included in

PAGE 131

131 published annotations, with highly significant compositional contrasts and putative start of translation signals, spanning lengths between 50 and 6000 codons, many of which were corroborated by homology to known genes. Among all regions of contrast identified in the 1000 genomes, it is likely that a fraction corresponds to non functional gene fragments (pseudogenes). This is also suggested by pseudogene predictions available for a subset of 65 genomes, where about onesixth of the newly identified ORFs were in fact listed in the Pseudogene.org database. Consistently, of 68,502 ORFs for which we were able to identify one or more homologous sequences in the NCBI protein database, we found t hat some were notably shorter than their homologous sequence of most similar length (e.g., 11.2% were shorter than half length; 17.0% were shorter than twothird length). Per contra, 69.0% of the conserved predicted sequences, i.e. 51,296 sequences, where predicted to be of length similar to homologous sequences deposited in public databases, suggesting that they encode functional genes. Besides for de novo gene finding, our methods can also be used to statistically evaluate predictions by other methods. In fact, the H and G tests provide statistical support for hundreds of thousands of hypothetical gene predictions previously included in publish annotations, thus providing further evidence for these otherwise uncharacterized predictions. At the same time, our methods can help in identifying potential mis annotations, either reporting lack of significance in the corresponding compositional contrasts, or identifying ORFs in alternative coding frames consistent with the expected compositional properties of the sequence. The results demonstrate that the methods are useful for the process of sequence annotation and for supporting

PAGE 132

132 results from other gene predictors. The methods are also useful for identifying coding regions in short, possibly anonymous, sequences, such as for the analysis of RNA seq or metagenomic data. Perhaps the most surprising findings from our tests, was that many of the coding regions predicted by our methods and that have not been included in published annotations, were instead listed in predictions automatically generated by well established gene predictors (GeneMark2.5, GeneMark.hmm, Glimmer 3.0, Prodigal) and available at the NCBI Microbial Genome Database. The vast majority of ~10,000 of the most robust gene predictions (our topquality subset) were confirmed by all methods we tested, and by conservation analysis. Among all ~200,000 gene predicted by our method and not included in published annotations, ~40,000 were also agreed upon by all other predictors and confirmed by homologous sequences recognized in databases. This result suggests that the exclusion from published annotations of these most likely true genes, rather than resulting from limitations of genefinding algorithms, is the effect of how automatically generated predictions are handled and selected through manual intervention. We apply our methods to the analysis of the genome sequences of strains of Pseudomonas aeruginosa, an ideal candidate for its high GC content (67%) and for its careful annotation. Unusual regulatory genes that have escaped annotation due to their small size and unusual composition have been recently identified in P. aeruginosa. Many other genes of similar properties may still wait to be discovered. Many coding regions of small size are included in the cur rent annotations of strains of P. aeruginosa with the potential of encoding mini proteins of regulatory functions. The procedures

PAGE 133

133 introduced here provide a characterization of these hypothetical genes, identifying those that have the most significant c oding potential. Our methods provided significant evidence for many highly significant, short coding sequences, newly identified or selected from previously annotated hypothetical genes, providing focused targets for experimental verification. Expression data obtained with RNA seq deepsequencing analysis in Pseudomonas aeruginosa PAO1, showed that many of the putative genes detected by our methods are supported by mRNA expression. Computational prediction of prokaryote genes presents fewer challenges than prediction of often multi exonic and alternatively spliced eukaryotic genes. However, the results presented here substantiate the fact that codondependent nucleotide compositional contrasts exist in eukaryotic genes and our method has the power to identif y those regions. While we made no attempt to completely describe gene structures for eukaryotic genes, our methods identified also in the human genome, genelike regions of significant compositional contrast not superimposed to annotated genes or pseudogenes. These findings suggest that our method can also be useful for de novo gene identification of eukaryotic gene sequences, whose structure can be further refined and characterized by other methods, or in future implementations of Quantitative Frame Analys is for eukaryotic gene prediction.

PAGE 134

134 LIST OF REFERENCES Almagor, H. (1 9 85) Nucleotide distribution and the recognition of coding regions in DNA sequences: an information theory approach. J. Theor. Biol 117 127136. Altschul S Gish W Miller W Myers E and Lipman,D. (1990) Basic local a lign ment search t ool. J. Mol. Biol. 215(3) 403 410. Besemer,J. and Borodovsky ,M. (1999) Heuristic approach to deriving models for gene f inding. Nucleic Acid s Res 27 3911 3920. Besemer,J., Lomsadze, A. and Borodovsky ,M. (2001) GeneMarkS: a self training method for prediction of g ene starts in m icrobial genomes. Implications for finding sequence motifs in r egulatory r egions. Nucleic Acid s Res 29(12) 26072618. Bibb,M. J., Findlay, P.R. and Johnson ,M.W. (19 84) The relationship between base composition and codon usage in bacterial g enes a nd its use for the s imple and r eliable i dentification of proteinc oding sequences. Gene 30 157 166. Bollag,D.M., Rozycki,M.D. and Edelstein ,S.J (1996) Protein Methods 2n d Edition. Borodovsky,M. and McIninch,J .D (1993) GenMark: p arallel g ene r ecognition for both DNA s trands. Computers Chem 17, 123133. Borodovsky, M., McIninch,J.D., Koonin, E.V., Rudd, K.E., Medigue, C. and Danchin ,A (1995) Detection of new g enes in a b acterial g enome u sing Markov m odels for three gene c lasses. Nucleic Acid s Res 23(17) 3554 3562. Brocchieri,L. and Karlin,S. (1998) A symmetric iterated multiple alignment of protein sequences. J. Mol. Biol., 276 249 264. Brocchieri, L., Kledal, T.N., Karlin, S. and Mocarski ,E.S. (2005) Predicting coding potential from g enome sequence: a pplication to b etaherpesviruses i nfecting r ats and m ice. J. Virol 79(12) 75707596. Burge,C. and Karlin ,S (19 97) Prediction of complete gene structures in human g en omic DNA. J. Mol. Biol. 268, 7894. Chang,B., Halgamuge,S. and Tang ,S.L (2006) Analysis of SD sequences in c ompleted m icrobial genomes: non SD led genes are as common as SD led g enes. Gene 373 9099. Delcher,A.L., Bratke,K.A., Powers,E.C. and Salzberg,S.L. (2007) Identifying bacterial genes and endosymbiont DNA with Glimmer. Bioinformatics 23(6) 673679. Delcher,A.L., Harmon, D., Kasif, S., White, O. and Salzberg ,S.L (1999) Improved m icrobial g ene i dentification with G limmer. Nucleic Acids Res ., 27 46364641.

PAGE 135

135 Doerks, T., Bairoch, A. and Bork,P. (1998) Protein a nnotation: d etective w ork for f unction p rediction. Trends Genet 14, 248 250. Elton,R.A. (1974) Theoretical models for h eterogeneit y of base c omposition in DNA. J. Theor. Biol. 45 533553. Fickett,J.W. (1982) Recognition of protein coding regions in DNA se quences. Nucleic Acids Res ., 10, 53035318. Fickett,J.W., Torney,D.C. and Wolf,D.R. (1992 ) Base compositional structure of genomes. Genomics 13, 10561064. Fic kett,J.W. and Tung ,C.S (1992a ) Assessment of protein coding measures. Nucleic Acid s Res 20, 64416450. Galperin,M.Y. and Koonin,E.V. (1998) Sources of systematic error in functional annotation of genomes: domai n rearrangement, non orthologous gene displacement, and operon disruption. In Silico Biol., 1 55 67. Gassel,M., Mollenkamp, T., Puppe,W. and Altendorf,K. (1999) The KdpF subunit is part of the K+translocating Kdp c omplex of the Escherichia coli and is responsible for sta biliza tion of the c omplex in Vitro J. Biol. Chem 274(53) 37901 37907. Guigo,R., Agarwal, P., Abril, J.F., Burset,M. and Fickett,J.W. (2000) An assessment of gene prediction accuracy in large DNA s equences. Genome Res 10 16311642. Hyatt,D., Chen, G.L ., LoCascio, P.F., Land, M.L., Larimer, F.W. and Hauser ,L.J (2010) Prodigal: prokaryotic gene recognition and translation initiation site i dentification. BMC Bioinformatics 11 119. Ishikawa,J. and Hotta,K. (1999) FramePlot: a new implementation of the f rame analysis for predicting proteincoding regions in bacterial DNA with a high G+C c ontent. FEMS Microbiology Letter 174 251 253. Karlin,S. (1994) Statistical studies of biomolecular sequences: scorebased methods. Philo. Trans.: Biol. Sci ., 344(1310) 391 402. Karlin S and Altschul S F (1990) Methods for a ssessing the statistical significance of m olecular sequence features by using g eneral scoring s chemes. Proc. Natl. Acad. Sci. USA 87, 22642268. Kastenmayer,J. P., Ni, L., Chu, A., Kitchen, L.E., Au, W., Yang, H., Carter, C.D., Wheeler, D., Davis, R.W., Boeke, J.D., Snyder, M.A. and Basrai ,M.A (2006) Functional genomics of genes with s mall o pen r eading f rames (sORFs) in S. cerevisiae. Genome Res 16, 365373.

PAGE 136

136 Klockgether, J., Wurdemann, D., Reva, O., Wiehlmann, L. and Tummler ,B (2007) Diversity of the abundant pKLC102/PAGI 2 family of genomic i slands in Pseudomonas aeruginosa. J. Bact. 189(6) 24432459. Konu ,O. and Li ,M.D (2002) Correlations between mRNA expression levels and GC contents of coding and untranslated r egions of g enes in r odents. J. Mol. Evol 54, 3541. Kresse,A. U., Dinesh, S.D., Larbig, K. and Romling ,U (2003) Impact of l arge chromosomal i nversions on the a daptation and e volution of Pseudomonas aeruginosa chronically colonizing cyst ic fibrosis lungs. Molec. Microb 47(1) 145 158. Larsen,T.S. and Krogh,A (2003) EasyGene a prokaryotic gene finder that ranks ORFs by statistical s ignificance. BMC Bioinformatics http://www.biomedcentral.com/14712105/4/21 Lazazzera,B. A., Kurtser,I.G., McQuade, R.S. and Grossman ,A.D (1999) An a utoregulatory circuit a ffecting p eptide signaling in Bacillus subtilis J. Bacteriol 181(17) 51935200. Lukashin,A.V. and Borodovsky,M. (1998) GeneMark.hmm: n ew solutions for g ene f inding. Nucleic Acid s Res ., 26(4) 11071115. Mathe,C., Sagot, M., Schiex, T. and Rouze,P. (2002) Current m ethods of g ene p rediction, t heir strengths and w eaknesses. Nucleic Acid s Res 30(19) 41034117. Nielsen,P. and Krogh ,A. (2005) Largescale p rokaryotic g ene p rediction and comparison to g enome a nnotation. Bioinformatics 21(24) 43224329. Richards,M.J., Edwards, J.R., Culver, D.H. and Gaynes ,R.P (1999) Nosocomial infections i n medical intensive care u nits in the United States. National nosocomial infections surveillance s ystem. Crit. Care Med 27 887 892. Rutherford, K., Parkhill,J. Crook, J., Horsnell, T., Rice, P., Rajandream, M.A. and Barrell,B. (2000) Artemis: sequence visualization and a nnotation. Bioinformatics 16, 944 945. Salzberg,S. L., Delcher, A.L., Kasif, S. and White ,O. (1998) Microbial gene identification using interpolated Markov m odels. Nucleic Acid s Res 26(2) 544548. Shannon,C.E. (1948) A mathematical theory of c ommunication. Bell System Technical Journal 27, 379 423.

PAGE 137

137 Shepherd,J.C.W. (1981) Method to determine the reading frame of a protein from the purine/pyrimidine genome sequence and its possible evolutionary j ustification. Proc. Natl. Sci. 78, 15961600. Shine,J. and Dalgarno ,L. (1975) Determinant of cistron specificity in bacterial r ibosomes. Nature, 254 34 38. Skovgaard, M., Jensen, L.K., Brunak, S., Ussery, D. and Krogh ,A. (2001) On the t otal n umber of g enes and t heir l ength d istribution in complete m icrobial g enomes. Trends in Gene. 17(8) 425 428. Snyder,E.E. and Sto rmo,G.D. (1995) Identification of protein coding regions in g enomic DNA. J. Mol. Biol. 248, 1 18. S okal,R.R. and Rohlf,F.J. (1994) Biometry: the principles and practices of s tatistics in b iolog ical r esearch. Third (3rd) Edition. Freeman and Co. New York, NY. Solovyev, V. and Salamov. A. (1997) The genefinder computer tools for analysis of human and model organisms genome s equences. Proc. Int. Conf. Intell. Syst. Mol. Biol., 5 294 302. Sueoka,N. (1962) On the genetic basis of variation and heterogeneity of DNA base composition. Proc. Natl. Sci 48 582 592. Troyanskaya,O. G., Arbell, O., Koren, Y., Landau, G.M. and Bolshoy ,A (2002) Sequence complexity p rofiles of p rokaryotic g enomic seque nces: a fast a lgorithm for calculating l inguistic complexity. Bioinformatic s, 18, 679 688. Ventre, I., Goodman, A.L., Vallet Gely, I., Vasseur, P., Soscia, C., Molin, S., Bleves, S., Lazdunski, A., Lory, S. and Filloux ,A. (2006) Multiple sensors control r eciprocal e xpression of Pseudomonas aeruginosa r egulatory RNA and v irulence g enes. Proc. Natl. Sci. 103(1) 171 176. Viterbi,A.J. (1967) Error bounds for convolutional codes and an asymptotically optimum d ecoding a lgorithm. IEEE Trans. Informat. Theory, V, IT 13, 260 269. Wu,W. and Jin ,S. (2005) PtrB of Pseudomonas aeruginosa suppresses the Type III Secretion System under the stress of DNA d amage. J. Bact 187(17 ), 60586068.

PAGE 138

138 BIOGRAPHICAL SKETCH Steve Oden was born in Charleston, South Carolina. Shortly thereafter he moved with his family to Sacramento, California where he attended elementary school at James Marshall Elementary School until the sixth grade. At this point he attended Albert Einstein Middle School. After successfull y graduating middle school he and his family relocated to Palm Coast Florida where he attended Flagler Palm Coast High School. Steve next attended Daytona Beach Community College where he graduated with an Associate in Science degree. Subsequently he matr iculated to Florida State University in Tallahassee, Florida, where he received a Bachelor of Science degree in biology with dual minors of chemistry and physics. Steve went on to work for several years at the North Florida Research and Education Center in Quincy, Florida for the University of Florida IFAS. He received his Ph.D. from the College of Medicine at the University of Florida in the f all of 2012.