Citation
A Novel Methodology for Identifying Conserved Regulatory Modules at the Binding Site Level

Material Information

Title:
A Novel Methodology for Identifying Conserved Regulatory Modules at the Binding Site Level
Creator:
MENG, HAILONG ( Author, Primary )
Copyright Date:
2008

Subjects

Subjects / Keywords:
Binding sites ( jstor )
Databases ( jstor )
Datasets ( jstor )
DNA ( jstor )
False positive errors ( jstor )
Genomes ( jstor )
Matrices ( jstor )
Nucleic acids ( jstor )
Nucleotide sequences ( jstor )
P values ( jstor )

Record Information

Source Institution:
University of Florida
Holding Location:
University of Florida
Rights Management:
Copyright Hailong Meng. Permission granted to University of Florida to digitize and display this item for non-profit research and educational purposes. Any reuse of this item in excess of fair use or other copyright exemptions requires permission of the copyright holder.
Embargo Date:
8/31/2006
Resource Identifier:
649810168 ( OCLC )

Downloads

This item has the following downloads:

meng_h ( .pdf )

meng_h_Page_017.txt

meng_h_pdf.txt

meng_h_Page_018.txt

meng_h_Page_015.txt

meng_h_Page_102.txt

meng_h_Page_071.txt

meng_h_Page_023.txt

meng_h_Page_012.txt

meng_h_Page_026.txt

meng_h_Page_056.txt

meng_h_Page_051.txt

meng_h_Page_006.txt

meng_h_Page_075.txt

meng_h_Page_031.txt

meng_h_Page_104.txt

meng_h_Page_054.txt

meng_h_Page_042.txt

meng_h_Page_030.txt

meng_h_Page_033.txt

meng_h_Page_082.txt

meng_h_Page_041.txt

meng_h_Page_025.txt

meng_h_Page_058.txt

meng_h_Page_032.txt

meng_h_Page_074.txt

meng_h_Page_066.txt

meng_h_Page_096.txt

meng_h_Page_076.txt

meng_h_Page_060.txt

meng_h_Page_069.txt

meng_h_Page_034.txt

meng_h_Page_037.txt

meng_h_Page_084.txt

meng_h_Page_027.txt

meng_h_Page_049.txt

meng_h_Page_028.txt

meng_h_Page_009.txt

meng_h_Page_002.txt

meng_h_Page_091.txt

meng_h_Page_083.txt

meng_h_Page_045.txt

meng_h_Page_063.txt

meng_h_Page_008.txt

meng_h_Page_035.txt

meng_h_Page_062.txt

meng_h_Page_013.txt

meng_h_Page_007.txt

meng_h_Page_089.txt

meng_h_Page_001.txt

meng_h_Page_004.txt

meng_h_Page_050.txt

meng_h_Page_003.txt

meng_h_Page_085.txt

meng_h_Page_068.txt

meng_h_Page_094.txt

meng_h_Page_073.txt

meng_h_Page_070.txt

meng_h_Page_092.txt

meng_h_Page_036.txt

meng_h_Page_010.txt

meng_h_Page_061.txt

meng_h_Page_065.txt

meng_h_Page_098.txt

meng_h_Page_088.txt

meng_h_Page_053.txt

meng_h_Page_047.txt

meng_h_Page_081.txt

meng_h_Page_097.txt

meng_h_Page_057.txt

meng_h_Page_087.txt

meng_h_Page_040.txt

meng_h_Page_086.txt

meng_h_Page_093.txt

meng_h_Page_079.txt

meng_h_Page_078.txt

meng_h_Page_046.txt

meng_h_Page_024.txt

meng_h_Page_043.txt

meng_h_Page_072.txt

meng_h_Page_095.txt

meng_h_Page_103.txt

meng_h_Page_020.txt

meng_h_Page_067.txt

meng_h_Page_077.txt

meng_h_Page_052.txt

meng_h_Page_064.txt

meng_h_Page_059.txt

meng_h_Page_044.txt

meng_h_Page_099.txt

meng_h_Page_048.txt

meng_h_Page_055.txt

meng_h_Page_029.txt

meng_h_Page_011.txt

meng_h_Page_016.txt

meng_h_Page_039.txt

meng_h_Page_021.txt

meng_h_Page_100.txt

meng_h_Page_080.txt

meng_h_Page_038.txt

meng_h_Page_005.txt

meng_h_Page_019.txt

meng_h_Page_090.txt

meng_h_Page_014.txt

meng_h_Page_022.txt

meng_h_Page_101.txt


Full Text











A NOVEL METHODOLOGY FOR IDENITFYING CONSERVED REGULATORY
MODULES AT THE BINDING SITE LEVEL














By

HAILONG MENG


A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL
OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY

UNIVERSITY OF FLORIDA


2006




























Copyright 2006

by

Hailong Meng
















To my dearly loved wife, Haoxian, with whom enj oyable and difficult times were shared
throughout this journey, and to my parents, Zhanfu and Xioumei, for their love and
support through my life.
















ACKNOWLEDGMENTS

I would like to express my sincere thanks and appreciation to my advisors, Dr.

Arunava Banerj ee and Dr. Lei Zhou, for their invaluable guidance, support, and

encouragement throughout my graduate research and study at the University of Florida.

My accomplishments could not have been achieved without their support.

I would also like to thank my committee members, Drs. Anand Rangaraj an, Sam

Wu and Tamer Kahveci, for their insightful comments and helpful suggestions.

Finally, my appreciation goes to all my friends at the University of Florida,

especially those at Dr. Banerj ee and Dr. Zhou' s lab, for their helpful suggestions,

discussions and friendships.



















TABLE OF CONTENTS


page

ACKNOWLEDGMENT S .............. .................... iv

LI ST OF T ABLE S ................. ................. viii............

LIST OF FIGURES .............. .................... ix

AB STRAC T ................ .............. xi

CHAPTER

1 INTRODUCTION ................. ...............1.......... ......

2 LITERATURE REVIEW ................. ...............4................


Computational Representation of TFB Ss ................. ...............4............ ...
TFB Ss Databases ................. .. ............. ........ ... ..................6
Computational Identification and Statistical Evaluation of TFB Ss ................... ...........8
Computational Identification of TFBSs .............. ...............8.....
Stati stical Evaluation of TFBS s.................... ............... .......... .......... .....1
Cis-Regulatory Modules (CRMs) and Computational Prediction of CRMs ..............12
Cis-Regulatory Modules (CRMs) ................. ...............12................
Computational Identification of CRMs .................. ...... ............. .....................14
Cross-species Conservation of Cis-Regulatory Modules (CRMs) ................... ...15
Computational Prediction of Cis-Regulatory Modules (CRMs) by Cross-
species Genome Comparison ................. ...............17........... ....

3 THE FEASIBILITY STUDY OF CROSS-GENOME COMPARISON AT THE
BINDING SITE LEVEL .............. ...............21....


Background ................. ...............21.................
Re sults................ ... .......... .............. .. ...... ... ...........2
Simulating Sequence Pairs Harboring a Conserved Module of Binding Sites ...25
Identifying a Conserved Module by Comparison at the Binding Site Level ......28
Distribution and Statistical Evaluation of the BLISS score .............. ................31
Identifying a Conserved Regulatory Module in Distantly Related Species........34
Implementation of BLISS as a Web-based Service .............. .....................3
Discussion ................. ...............40.................
Conclusions............... ..............4











Methods .............. .......... ..............4
Generating Simulated Sequences .............. ...............43....
Binding Site Identification .............. ...............44....
P Value for M score ................. ...............45........... ...
Gaussian Smoothing .................. ... ......... ......... .............4
Searching the Conserved Module in the Long Sequence ................. ................46
Large-scale Search of the Simulated Sequences, Statistical Analysis ................47
Searching for the eve2 Module in D. virilis and D. moj avanis Sequences .........47
W ebsite Construction .............. ...............48....

4 BINDING-SITE LEVEL IDENTIFICATION OF SHARED REGULATORY
MODULE S IN TWO ORTHOLOGOUS SEQUENCE S ................ ............... .....49


Back ground ................. ...............49.......... ......
Re sults.................. ........... ............. ... ... ... ..........5
Simulating Sequence Pairs with Varying Complexity of Conserved Binding
Site Patterns ............... ......... .. .. ... .............5
Identifying Conserved Regulatory Complexes by Comparison at the Binding-
site Level ................... ........... ......... .............5
Testing BLISS2 in Real-world Examples .............. ...............56....
Website Implementation of BLISS2 .............. ...............60....
Discussion ................. ...............61.................
Conclusions............... ..............6
Methods .............. .......... ..............6
Generating Simulated Sequences .............. .......... ...................6
Identifying Conserved Regulatory Complexes by Comparison at the Binding-
site Level .................. .. ........... ... ......... .............6
Performance Testing Using Simulated Sequences ................ .. ... ......... ........65
Searching for the eve2 Regulatory Complex in D. virilis and D. moj avanis ......66
W ebsite Construction .............. ...............66....

5 BLISS 2.0: THE WEB-BASED TOOL FOR PREDICTING CONSERVED
REGULATORY MODULES IN DISTANTLY-RELATED ORTHOLOGOUS
SEQUENCE S .............. ...............67....

6 CONCLUSIONS AND SUGGESTIONS FOR FUTURE STUDY ................... ........74


APPENDIX

A PARAMETER OPTIMIZATIONS FOR GAUSSIAN SMOOTHING METHOD ...76


B A DYNAMIC PROGRAMMING ALGORITHM FOR IDENTIFYING
CONSERVED REGULATORY MODULES ................. ............... ......... ...77


C AN IMPROVED ALGORITHM FOR BLISS2 .............. ...............79....

LI ST OF REFERENCE S ................. ...............8.. 2......... ....












BIOGRAPHICAL SKETCH .............. ...............92....



















LIST OF TABLES


Table pg


2-1. IUPAC consensus characters............... ...............

2-2. Statistics of TRANSFAC ................. ...............6................

2-3. Matrices in TRANSFAC Professional r9.1. ............. ...............7.....

4-1. Simulated data sets .............. ...............52....

4-2. Performance test of BLIS S2 for M score cutoff=0.75. .........___ ..... ......_........54

4-3. Performance test of BLISS2 for M score cutoff=0.8. ................... ...............5


C-1. Performance test of the improved BLIS S2 for M~score cutoff=0.75 ........._............79

C-2. Performance test of the improved BLIS S2 for M~score cutoff=0.8 ........._..............79

C-3. Performance comparison of two BLISS2 algorithms for second data set when
M score cutoff is 0.75................ ...............80.


















LIST OF FIGURES


Figure pg

1-1. Transcription and translation. .............. ...............1.....

1-2. Transcription is controlled by transcription factors ................. .........................2

2-1. Frequency matrix of transcription factor "v-Myb" from TRANSFAC. ......................5

2-2. The known TFBSs in S2E. ............. ...............15.....

2-3. S2E of D. Mel and D. Pse alignment using CLUSTALW. ................... ...............16

3-1. Simulating conserved regulatory modules in a pair of random sequences............._...27

3-2. Gaussian smoothing of the TFBS profile. ............. ...............29.....

3-3. Identifying a conserved module at the binding site level. ................... ...............3

3-4. Identifying the eve S2E module in distantly related species................... ...............35

3-5. Inter-TFBSs distances are very well conserved within each clause of the S2E
m odule. .............. ...............37....

3-6. Searching for conserved individual clauses / element modules using BLIS S...........3 9

3-7. BLISS output of the contributing TFBSs .............. ...............40....

4-1. The effects of length of analyzed sequences on the performance of BLISS2...........56

4-2. Regulatory complexes in S2E of D.mela and D.pseu............... ...............57.

4-3. Identification S2E in 6 kb D.viri sequence using the S2E of D.pseu. .................. .....58

4-4. Regulatory complexes of S2E in D.viri and D.moja .................... ...............5

5-1. Web interface of BLISS 2.0. This is the homepage of BLISS 2.0.. ..........................69

5-2. The color plot of BLISS _score .......___ ......... ___ .........__ ...... 7

5-3. Statistical analysis and distribution of BLIS S_score. ....._____ .... .. ...___...........71











5-4. Output of contributing TFBSs. ........... ...............72......

A-1. Parameter optimizations for Gaussian smoothing method used in BLISS.............. .76

C-1. Distribution of BLISS_score for the improved BLISS2 algorithm with the
M score cut off value of 0.75. ............. ...............8 1....

C-2. Distribution of BLISS_score for the improved BLISS2 algorithm with the
M score cut off value of 0.8. ............ .............8 1......
















Abstract of Dissertation Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Doctor of Philosophy

A NOVEL METHODOLOGY FOR IDENITFYING CONSERVED REGULATORY
MODULES AT THE BINDING SITE LEVEL

By

Hailong Meng

August 2006

Chair: Arunava Banerjee
Cochair: Lei Zhou
Major Department: Computer Information and Science and Engineering

Regulatory modules are segments of the DNA that control particular aspects of

gene expression. Their identification is therefore of great importance to the Hield of

molecular genetics. Each module is composed of a distinct set of binding sites for

specific transcription factors. Since experimental identification of regulatory modules is

an arduous process, accurate computational techniques that supplement this process can

be very beneficial. Functional modules are under selective pressure to be evolutionarily

conserved. Most current approaches therefore attempt to detect conserved regulatory

modules through similarity comparisons at the DNA sequence level. However, some

regulatory modules, despite the conservation of their responsible binding sites, are

embedded in sequences that have little overall similarity.

We hypothesize that it will be more efficient and make more biological sense to

perform the cross-genome comparison at the binding site level since the evolutionary

pressure is on the binding site patterns that determine the gene regulation. In this study,









we developed a novel methodology that detects conserved regulatory modules via

comparisons at the binding site level. The methodology compares the binding site

profiles of orthologs and identifies those segments that have similar (not necessarily

identical) profiles. The similarity measure is based on transformed profiles, which takes

into consideration the p values of binding sites as well as the potential shift of binding

site positions. Furthermore, statistical analysis was performed to evaluate the accuracy of

the similarity measure. We used simulated sequence pairs for the optimization of the

methodology and the methodology was then verified on real word sequences.

Our research demonstrates that, for sequences with little overall similarity at the

DNA sequence level, it is possible to identify conserved regulatory modules based solely

on binding site profiles. This methodology has been implemented as a web-based tool,

BLISS, for the research community.















CHAPTER 1
INTRODUCTION

One of the central processes in the life of a cell is the expression of the information

encoded in its DNA. The initial stage of this expression is the synthesis of messenger

RNA (mRNA) from the DNA template, a process referred to as transcription (Figure 1-




Transcription Translation
DNA mRNA PROTEIN




Figure 1-1. Transcription and translation.

One of the maj or challenges in molecular genetics is to understand the precise

mechanism via which gene expression is regulated. The regulation of gene expression at

the level of transcription is a very complex process and is controlled by DNA binding

proteins called transcription factors (TFs). As shown in Figure 1-2, the initiation of

transcription begins with the binding of RNA polymerase II complex to the promoter, a

region right in front of the first exon of the gene. The level (i.e., rate) of transcription is

controlled by bindings between TFs and corresponding short DNA sequences,

Transcription Factor Binding Sites (TFB Ss). The size of a typical binding site ranges

between 5 and 25 base pairs (bps) (Boulikas 1994; Hernandez-Munain and Krangel

1994). It is possible that the binding of a single TF to TFBS can affect the rate of

transcription, however, in higher eukaryotes, it often takes multiple TFB Ss to group

together as a regulatory module to determine the precise temporal and spatial expression









of the genes. Despite the importance of regulatory modules, our abilities to identify and

characterize them are still very limited (Qiu 2003).











mRNA



m Exon Intron

Promoter region

t A Transcription Factor RNA Polymerase II



Figure 1-2. Transcription is controlled by transcription factors.

While laboratory identification of regulatory modules is feasible (Galas and

Schmitz 1978; Fried and Crothers 1981; Garner and Revzin 1981), the process is in

general expensive, time-consuming and arduous. Accurate computational approaches

therefore promise to be very beneficial as a supplement to the traditional laboratory

experiment since they can lead biologists to a more efficient determination of the

regulatory elements.

Functional regulatory modules are under selective pressure and tend to be

evolutionarily conserved. Most current approaches therefore attempt to detect conserved

regulatory modules through similarity comparisons at the DNA sequence level. However,

some regulatory modules, despite the conservation of their responsible binding sites, are

embedded in sequences that have little overall similarity. It therefore makes biological









sense to perform the cross-genome comparison at the binding site level since the

evolutionary pressure is directed at the binding site level.

The obj ectives of this thesis are

1. To test the feasibility of cross-genome comparison at the binding site level.

2. To develop a novel methodology that detects conserved regulatory modules
via comparisons at the binding site level.

3. To implement the methodology as a web-based tool that can be publicly
accessed by the research community.

The remaining chapters are organized as follows:

Chapter 2 conducts a literature review, in which background material and related

research in the field are presented.

In chapter 3,we demonstrate that it is feasible to perform the cross-species

comparison at the binding site level and that conserved regulatory modules can be

predicted in diverged sequences without detectable DNA sequence similarity. However,

this comes with the limitation that we know that a conserved module exists in the shorter

of the two sequences being compared.

In chapter 4, based on the research results in chapter 3, we extend the methodology

to a more general case, i.e. two sequences without the prior knowledge about the

locations of the regulatory modules.

In chapter 5, we focus on BLISS, the web implementation of this methodology.

Chapter 6 presents conclusions and suggestions for future research.















CHAPTER 2
LITERATURE REVIEW

Computational Representation of TFBSs

Most eukaryotic TFs can bind to multiple binding sites and tolerate considerable

sequence variation in their binding sites. The computational analysis of regulatory

regions in genome sequences is based on the prediction of potential TFBSs, which

depends on the quality of the models that represent the binding specificity of the

transcription factors. These models are based upon a statistical representation of

experimentally determined binding sites for a particular TF.

IUPAC consensus code was first used to represent this degenerate feature of TFBSs

by Day and McMorris (1992). They used a string such as RCCY to represent the

combination of string ACCT, ACCC, GCCT, and GCCT. IUPAC consensus characters

are listed in Table 2-1.

Table 2-1. IUPAC consensus characters.
IUPAC consensus character in the motif Matchin base in the DNA seuence
A A
T T
G G
C C
R A or G
Y T or C
W A or T
S G or C
M A or C
K G or T
B T or G or C
H Aor Tor C
V A or G or C
D A or G or T
N A or T or G or C










However, the disadvantage of IUPAC consensus sequence is that it does not take

into account the frequency at which each nucleotide occurs at each position of a TFB S

and therefore cannot represent a TFB S quantitatively. It disregards too much information

originally present in the set of TFBSs (Quandt et al. 1995).

Compared to IUPAC, matrix based representations (including frequency matrices

and Position Weight Matrices PWMs) retain more information and are quantitative

representations of TFBSs. In this case, the binding specificity of each position is

described by the elements of the matrix. Comparative analysis (Osada et al. 2004) has

shown that matrices are a much better way to represent TFB Ss. Each matrix is of size 4 X

k, where the rows correspond to each of the nucleotides and the columns represent the

base preference for each position of the binding site. For a frequency matrix, each column

gives the nucleotide distribution at that position. For instance, Figure 2-1 is the frequency

matrix of transcription factor "v-Myb" from TRANSFAC Professional r9. 1 (Matys et al.

2003), which was obtained from 24 experimentally determined binding sites. Each

element in the matrix represents how many times a specific nucleotide was observed at a

certain position. The last column in the matrix is the IUPAC consensus sequence for the

transcription factor.

Position A C G T
01 13 4 4 3 A
02 14 3 5 2 A
03 2 7 1 14 T
04 23 0 0 1 A
05 24 0 0 0 A
06 1 23 0 0 C
07 0 1 17 6 G
08 1 1 21 1 G
09 11 5 4 4 N
10 12 5 2 5 A



Figure 2-1. Frequency matrix of transcription factor "v-Myb" from TRANSFAC.









The PWM is derived from this frequency matrix and then the base preference at

each position is estimated based on the nucleotide distribution (Bucher 1990; Stormo

2000). The quality of the matrix is dependent upon experimentally determined target

sites. As of now, most matrices have been built based on binding sites through in-vitro

target-site detection assays. But, it was recently reported that in vivo functional TFBSs

can be detected by ChlP-Chip chromatinn immunoprecipitation coupled with microarray

chip hybridization) technique (Lee et al. 2002), which will greatly enhance both the

quality as well as the coverage of the noted matrices.

TFBSs Databases

TRANSFAC, TRANSCOMPEL, TRRD and JASPAR are databases that store

information about transcriptional regulation in eukaryotic cells.

TRANSFAC (Knuppel et al. 1994; Wingender et al. 1996; Wingender et al. 2000;

Matys et al. 2003) is the leading database that includes Transcription Factors, their target

genes and Transcription Factor Binding Sites, matrices of TFB Ss and other information

about transcriptional regulation. TRANSFAC represents the most comprehensive

collection of TFBSs and summarizes them as matrices. All information collected in

TRANSFAC is generated from published literature. Its content has been extending and

improving over the last ten years as shown in Table 2-2.

Table 2-2. Statistics of TRANSFAC.
TRANSFAC TRANSFAC TRANSFAC6.0 TRANSFACr9.1
1996 2000 2003 2005
Matrix 169 356 336 753
FACTOR 1544 2765 4219 6017
SITE 4304 8390 6627 15244
GENE -1302 1755 4649
REFERENCE 3130 6570 -11126











The newly released TRANFACT Professional r9. 1 has a total of 753 matrices for

bacteria, fungi, insects, nematodes, plants and vertebrates, as indicated in Table 2-3.

Table 2-3. Matrices in TRANSFAC Professional r9. 1.
Matrix Number Matrix Number Without
Duplication
Bacteria 1 1
Fungi 56 44
Insects 50 40
Nematodes 7 5
Plants 84 66
Vertebrates 555 379
Total 753 535

TRANSFAC contains redundant sets of matrices of diverse quality for some

TFB Ss. It is a commercial resource and only an older version of TRANSFAC can be

accessed by the public for free.

JASPAR (Sandelin et al. 2004) is an open-access database and its data can be

unrestrictedly accessed for free. All matrices in JASPAR were derived from published

collections of experimentally determined TFB Ss. JASPAR claims that it has a non-

redundant collection of reliable matrices.

TRANSCOMPEL (Kel-Margoulis et al. 2000; Kel-Margoulis et al. 2002) is a sister

database of TRANSFAC and includes information about composite regulatory elements.

Each composite regulatory element contains two closely situated binding sites of distinct

transcription factors. There are two types of composite regulatory elements: synergistic

and antagonistic. A synergistic composite regulatory element results in transcriptional

activation, while in an antagonistic composite regulatory element, the two factors

interfere with each other and repress transcription. However, the effectiveness of

TRANSCompel is limited by it small size.










Transcription Regulatory Regions Database (TRRD) (Kel et al. 1997; Heinemeyer

et al. 1998; Kolchanov et al. 1999; Kolchanov et al. 2000) describes the structure of

eukaryotic transcription regulatory regions. Each TRRD entry corresponds to an entire

gene and binding sites are listed at the lowest level of the hierarchy. However, this

database is seldom referred to by current publications.

Computational Identification and Statistical Evaluation of TFBSs

Computational Identification of TFBSs

Computational identification of TFB Ss is very important to decode the regulation

network of genes. Although some earlier methods of predicting TFBSs were based on

consensus sequences (Kondrakhin et al. 1995; Prestridge 1995), most recently developed

methods are based on matrices (Frech et al. 1997).

Match is a weight-based tool to search for potential transcription factor binding

sites in DNA sequences. Match is associated with TRANSFAC and it uses the matrix

library collected in TRANSFAC to search for TFBSs. Because TFBSs may be detectable

on either the forward or the backward strand, Match searches both strands of input DNA

sequence for TFBSs. The algorithm uses two scores: the matrix similarity score (MSS)

and the core similarity score (CSS). Both scores measure the similarity between the

segments of the sequences under consideration and the TFBS matrices. MSS is calculated

using all positions of the matrix while CSS is calculated using the five core (most

conserved) positions in the matrices only. Both MSS and CSS range from 0 to 1 where 1

means an exact match. MSS is calculated as follows (Kel et al. 2003):

MS=Current M~in
MaxS Mi










c*r~rent = Il(i)fi,b,



Min = f I(i) fmm






I(i) = Cff In(4J;,) i1 ,..
Bt{A
where /;,b, iS the frequency of nucleotide B occurring at position i of the matrix

(B{AT,,G).f mm" is the freque~ncy of~~ thepn~ nuloide that is the lowetinpositi no;~n i


and J max is the frequency of the nucleotide that is the highest in position i of the matrix.

The information vector I(i) reflects the extent of the conservation of i position in the

matrix. The more conserved the position i is, the greater is the information vector value

of that position. In this manner, for a highly conserved position, a mismatch will receive a

large penalty while a match will score high. On the other hand, less conserved positions

contribute less than conserved positions whether they be a match or a mismatch. This

leads to better recognition of TFBSs compared to methods that do not use an information

vector.

Stormo (1998, 2000) and Stormo and Fields (1998) pointed out that logarithms of

the base frequencies should be proportional to the binding energy and the information

vector is therefore related to the average binding energy between transcription factors and

binding sites. Hence, Match score is not just a statistical score, but is in direct relation to

the protein-DNA binding energy.









Match requires users to choose a score threshold, which is based on a balance

between specificity and sensitivity. When a high threshold is used, noise is reduced from

the output and the false positives decrease. However, at the same time, some functional

binding sites with low scores are ignored. Increased specificity causes decreased

sensitivity. On the other hand, when a low threshold is used, both false matches and true

binding sites are reported. Increased sensitivity causes decreased specificity. Hence, an

ideal threshold should restrict false positives without losing too many true positives.

One assumption of methods based on matrices is that positions in the binding site

contribute additively and independently to the total activity. However, recent biological

experiments suggest that dependence exists among positions in TFBSs (Benos et al.

2002; Bulyk et al. 2002). More complicated models (Stormo et al. 1986; Lawrence and

Reilly 1990; Lawrence et al. 1993; Zhang and Marr 1993; Zhou and Liu 2004;

Gershenzon et al. 2005) are being considered. For instance, di-nucleotides are chosen as

the elements of the matrix instead of single nucleotides. But, the limitation of these

methods is that they need more prior information and more training binding sites, which

are currently not easy to obtain. So, these approaches are rarely used.

Other available software to identify TFBSs include MatInspector (Quandt et al.

1995), ConsInspector (Frech et al. 1993), FUNSITE (Kel et al. 1995), TFBIND (Tsunoda

and Takagi 1999), SIGNAL SCAN (Prestridge 1996), MATRIX SEARCH (Chen et al.

1995), SeqVISTA (Hu et al. 2003; Hu et al. 2004), RSAT (van Helden 2003), BEST (Che

et al. 2005) and P-Match (Chekmenev et al. 2005). The primary problem with existing

methods for predicting TFB Ss is that there tend to generate unacceptably large number of

false positives and only a very small fraction of the predicted TFBSs are functional ones.









Because of the large number of false positives, these methods are of little practical use for

predicting TFB Ss in vivo.

Statistical Evaluation of TFBSs

A fundamental challenge for TFB Ss-searching methods is how to reduce false

positives and improve prediction accuracy. One of the reasons for false positives is that

short sequences (when trying to identify short TFB Ss) occur a lot by chance. Estimating

the statistical significance of an identified match is valuable and may increase the

performance of TFBSs-searching methods. To calculate statistical significance, a

background model is necessary. Based on the background model, we will see how likely

it is to see a score that is at least as good by chance. Formally, the p-value of a score is

defined as the probability of obtaining such a score or higher from the background model.

A simple way of estimating the p-value of a score is via native approximation. A

large pool of subsequences is sampled from the background distribution, and the score of

each subsequence is calculated based on the matrix. The p value is estimated by

calculating the fraction of samples that have scores as high as that score. The primary

drawback of native approximation is that it needs huge number of samples to accurately

estimate the p value. Normally, the number of samples should be two orders of

magnitude larger than the inverse of the actual p value. If there are not enough samples,

small p values will be poorly estimated. Available samples and time complexity are

maj or drawbacks of the native approximation method.

Independent and identically distributed (IID) background sequence model (Staden

1989) can be used to evaluate p value. The position p value of MAST (Motif Alignment

and Search Tool) (Bailey and Gribskov 1998; Bailey and Gribskov 2000) is defined as

the probability of observing such a match score or higher at a single, randomly chosen









position in a random sequence, and it is obtained by calculating the cumulative density

function based on the null hypothesis that each position in the sequence is independent

and identically distributed (IID).

It has been reported that neighboring nucleotide compositions can affect the

binding between a transcription factor and its binding site. Local Markov Method (LMM)

(Huang et al. 2004) incorporated the local genomic context into the p-value-based scoring

method and modeled the background sequences as a Markov chain. The p value for a

candidate site in this method reflected simultaneously both the similarity to its matrix and

its difference from the local genomic context. LMM claimed that, compared to other

current methods, it could reduce false positive errors by more than 50% without

compromising sensitivity. Similarly, Thij s et al. (Thij s et al. 2001) use a higher-order

background model to enhance the performance of their motif-finding algorithm.

Although statistical analysis has been integrated into predictions of TFBSs, large

amounts of false positives still exist and accurate identification of TFBSs is hard to

achieve.

Cis-Regulatory Modules (CRMs) and Computational Prediction of CRMs

Cis-Regulatory Modules (CRMs)

Genes are rarely regulated by a single TF. Abundant evidence shows that TFBSs

occur in clusters rather than in isolation (Maniatis et al. 1987; Johnson and McKnight

1989; Shaw 1992; Wang et al. 1993; Tjian and Maniatis 1994). Each gene has a unique

set of TFs that are responsible for its spatial and temporal expression. Transcription is

initialized by the binding of RNA Polymerase II to the promoter region, but the

expression level of the gene is controlled by the specific set of TFs. The binding of a TF









to its TFBS may enhance or repress the binding of other TFs to their TFBSs (Weintraub

et al. 1990; Fickett 1996; Muhlethaler-Mottet et al. 1998).

In higher eukaryotes, the regulatory information is organized into modular units

which contain multiple TFBSs over a few hundred base pairs of DNA sequence. Such a

functional unit is often referred to as Cis-Regulatory Modules (CRMs, we also use

regulatory modules to refer to CRMs alternately). Various CRMs work together to offer

the combinatorial regulation of gene transcription in response to different conditions

(Yuh et al. 1998). Genes are typically controlled by several CRMs, which are generally

located within a few hundred base pairs from the promoter region, but may occur

thousands or even tens of thousands of base pairs either downstream or upstream from

the transcription start site.

TFB Ss in a CRM are co-localized with specific distances apart on the genome. The

distance between the binding sites within the CRM may overlap (Lewis et al. 1994), lie

adj acent to each other (Whitley et al. 1994) or may be dozens of base pairs apart

(Lemaigree et al. 1990). Considering the complexity of the identification of CRMs, some

groups (Hannenhalli and Levy 2002) have started to work on TF pairs called composite

regulatory elements. Kel et al. (Kel et al. 1995) investigated 101 composite regulatory

elements and found that the distance between these binding sites vary between complete

overlap and 80 bps. The mean distance was found to be around 17 bps. Qiu et al. (Qiu et

al. 2002) analyzed all the entries of composite elements in the TransCompel database

(version 3.0) and they found that approximately 87% of the composite elements were

within 50 bps distance from each other and about 65% were within 20 bps distance.









Transcriptional regulation is mediated by CRMs involving both protein-DNA and

protein-protein interaction. The immense false positive of current TFB S-prediction

methods is most likely a result of attempting to predict TFB Ss based only on protein-

DNA interactions between TFs and TFBSs and ignoring their context, interactions

between/among TFs, etc.

Computational Identification of CRMs

It has been reported that high local density of TFBSs account for the specificity of

their activities and is required for functional CRMs, which has been used as the basis for

identifying novel CRMs (Frith et al. 2001; Berman et al. 2002; Halfon and Michelson

2002; Rajewsky et al. 2002; Rebeiz et al. 2002; Johansson et al. 2003; Berman et al.

2004). These methods define a cluster as a certain window size DNA sequence with

certain number of known TFBSs, and then search for such clusters in the given DNA

sequence to predict regulatory regions. These methods, however, need prior knowledge

of TFB Ss, which is not available in general.

With the development of microarray technology, genes can be classified into

different clusters based on their expression pattern. Genes in the same cluster are

assumed to be co-regulated. Co-regulated genes are hypothesized to share common Cis-

Regulatory Modules (CRMs), since they share the same regulatory response. It has been

observed that functional TFBSs often appear several times in the regulatory region so that

occurrences of relevant TFBSs are significantly more than expected by chance. Hence,

methods based on co-regulated genes comparison are to search for over-represented

motifs in the collection of regulatory regions (Bailey and Elkan 1995; Workman and

Stormo 2000; Liu et al. 2001; Thij s et al. 2002; Elkon et al. 2003; Zheng et al. 2003; Cora

et al. 2004; Grad et al. 2004). High frequency of false positives is also a maj or problem





" .. ..


with these methods. Furthermore, these methods require a reliable set of co-regulated

genes. But, it is also possible that the expression of one gene is triggered by the

expression of another gene in the same cluster and they are not really co-regulated genes.

Some groups have tried to conquer this problem by annotating genes to specific Gene

Ontology (GO) terms (Cora et al. 2004).

Cross-species Conservation of Cis-Regulatory Modules (CRMs)

Functional sequences including coding and functional noncoding sequences tend to

evolve at a slower rate than non-functional sequences. Noncoding sequences that play an

important role in regulation are assumed to survive natural selection for longer periods of

time. It has been shown that CRMs are conserved across species (Frazer et al. 2003;

Ludwig et al. 2005).


me~lc


yak



ere


pse we i' I


aKr Hb o Bed ag Gt 4 Slp


Figure 2-2. The known TFBSs in S2E: bicoid (circle), hunchback (ovals), kruppel
(squares), giant (rectangles), and sloppy-paired (triangle). Symbols
representing sites 100% conserved compared to D. mel are green, while
those diverged are shaded orange.








16



Early Drosophila embryo is a paradigmatic model for studying transcriptional


control of development (Arnosti 2002). Exhaustive genetic analysis has been applied to


this model. Even-skipped (eve), an important developmental gene in Drosophila,


produces seven transverse stripes in the blastoderm embryo. The stripe 2 enhancer (S2E)


is the best studied one and includes multiple TFBSs of five TFs, the activators bicoid


(bcd) and hunchback (hb), and the repressors giant (gt), Kruppel (kr), and sloppy-paired


(slp) (Small et al. 1992; Arnosti et al. 1996; Andrioli et al. 2002). As shown in Figure 2-


2, experimental investigations suggest that S2E is evolutionarily conserved among D.


melanogaster, D. yakuba, D. erecta and D. pseudoobscura. D. yakuba and D. erecta are


separated by approximately 10-12 million years from D. melanogaster, and D.


pseudoobscura split from D. melanogaster about 40-60 million years ago (Ludwig et al.


1998; Ludwig 2002; Ludwig et al. 2005). There is no detectable difference in the spatial


or temporal control of gene expression among these four species.

Mel AATATAACCCAATAATTTGAAGTAACT------------ GGCAGGAGCGAGGTAT--- -- 43
PSE AATATAACCCAATAATTTTAACTAACTCGCAATGGACAGGCTAGATGGAG 60
****************** ** ***** ***** www **

Mel --CCTTCCTGGT---------TACCCGGTACTG-----CATAAGA--ACG 82
PSE AGCATTGCAGGAAGGATGCAATACTCGGGAATGGAATGCATAAGGCGACG 120
**ww *w www *** ***wwww *w ********

Mel AA--CCGTAAC- TGGGACAGA----TCGA- ------- -AAAGCTGGCCTGGTTTCTCGC 125
PSE GGTTCCGTTTCGCGAGATAAGGTTCTTTAGTCTTGACGGTTCCTTGACGTC 180
**** www www ** *w ***

Mel TGTGTGTGCC------- GTGTTAATCCGTTTGCCATCAGCGAGATTATTAGTCAATTGC-177
PSE TGTGTGCTCTCTGCTCTGTGTTAATCCGTTTGCCATCAGCATATGCATTC 240
****** *********************** ****************

Mel -------------AGTTGCAGC----GTTTCGCTTTCGTCC---GTCAT- 213
PSE ATATTTCCAGTCGAGTCGCAGTTTTGGTTTCACTTTCCTCCTGATCTCTG 300
*** **** ***** ***** **** * **

Mel -------------------------------------------------CG 216
PSE CCTCATGTGGATGCCGATGCCGATCGGCTTGCCGTTGCCGTTGCCGACCG 360


Mel GTTAGACTTTATTGCAGCATCTTGAACAATCGTC -GCAGTTTGGTAACACGCTG ------ 269
PSE GTTAGATTTTATTGCAGCATCTTGAACAATCAACTGGAATTGACTCGGGC 420


Mel -------------- TGCCATACTTTCATTTAGACGGAATCGAGGGACCCTGGACTAT 315
PSE CTAACCCTGGAGATTGCTCTACTTTCGCCTCAATTGAATCGTAGGAGCGC 480

Figure 2-3. S2E of D. Mel and D. Pse alignment using CLUSTALW.








17



Mel CGCACAACGAGACC-- GGGTTGC---- GAAGTCAGGGCATTCCGCCGATCTAGCCATCG 368
PSE GGACCCTTGCGACCAAGGGTTGTCTCCTGGCCTCAGGAGTTCAGCAGTTG 540

Mel CCATCTTCTGCGGGCGTTTGTTTGTTTGTTTGCTGGGATTAC-GGTGCTG 427
PSE CTGGTTTGTTTATTTGTTTGTTTGTTT- -TAGCCAGGATTAGCCCGAGGGCTTGACTTGG 598

Mel AATCCAATC---------- ------------------- ------ CTG-ATCCCTAGCCCG 451
PSE AACCCGACCAAAGCCAAGGGCTTTAGGGCATGCTCAAGAGACTTTCTTCT 658

Mel ATCCCAATCCCAATCCCAATCCC- -TTGTCCTTTTCATTAGAAAGTCATAAAA-ACACAT 508
PSE GTCGCGATCCCTAAACCGATCCCATTTGGCAATTTCATTAGATCAACCCT 718

Mel AATAATGA-- TGTCGAAGGGATTAGG--- -- GGCGCGCAGGTCCAGGCAACGCAAT ---T 558
PSE AATAATGAGATGTCGAAGGGATTAAGATTAAGACAAAAGGCAGCAGACAT 778

Mel AACGGACTAGCGAACTGGGTTATTTTTTTGCGCCGA- ---------- CTTAGCCCTGATC 607
PSE AACGGACTAACGGAATCGGGTTATTTTTTGCGCTCAACCGGCACTAGTAC 838

Mel CGCGAGCTTAACCCGTTTTGAGCCGGGCAGCAGGTAG-TGGTGACAGCTT 666
PSE CACAAGCTTAACCCCTTGTGGGCCG- -CGGCAGGTAAATCGATGACCGATGTCGCTTGCG 896

Mel TTTTGGCCGAACCTCCAATCTAACTTGCGCAAGTGGCAAGTGCTTCGCCA 726
PSE CAAGGCCCCTACTACTCCCCTCCCCTCC- CATATGACAACCCACTAACC- -CTGCCCCCG 953

Mel AAGAGGAGGCACTATCCCGGTCCTG- --GTACAGTTGG- TACGCTGGGAATGATTATATC 782
PSE CCCTCTC-- CACCACCACTGTACTGTATGTACAGTTGCCTCCTCTCTGGATATTT 1011

Mel ATCATAATAAATGTTT 798
PSE ATCATAATAAATGTTT 1027


Figure 2-3. Continued


Although similar at the TFBS level and functionally conserved, as shown in figure


2-3, S2E in D. Mel and D. Pse are substantially diverged at the sequence level. One can


observe large insertion or deletion in spacers between TFBSs, nucleotide substitutions in


TFBSs, variation of length of enhancers, etc. Only 3 of 17 known TFBSs in D.


melanogaster are completely conserved among all four species.


Computational Prediction of Cis-Regulatory Modules (CRMs) by Cross-species
Genome Comparison


In contrast, cross-species genome comparison (Phylogenetic Footprinting) is more


reliable and is capable of identifying CRMs from a single gene of different species, as


long as they are conserved across species. Cross-species genome comparison is based on


the assumption that functional sequences, like regulatory regions, are more conserved









evolutionarily than nonfunctional regions, due to selective evolutionary pressure. Like a

filter, cross-species genome comparison eliminates non-conserved regulatory regions and

thereby increases the selectivity and specificity of CRMs prediction.

Evolutionarily conserved noncoding regions across species are a potentially rich

source for the identification of regulatory information of genes and can be found by

sequence alignment tools such as BLAST (Altschul et al. 1990), PIPMaker (Schwartz et

al. 2000), MultiPipMaker (Schwartz et al. 2003), CLUSTAL W (Thompson et al. 1994),

AVID (Bray et al. 2003) and VISTA (Mayor et al. 2000). A large number of methods

have been developed recently to predict CRMs based on cross-species genome

comparison. With complementary data provided by sequence comparison, these methods

have improved the prediction of TFB Ss.

ConSite (Lenhard et al. 2003; Sandelin et al. 2004) is a web-based tool for the

detection of transcription factor binding sites, which integrates both cross-species

comparison and binding site prediction generated with Matrices. ConSite reports TFBSs

that are in conserved regions and at the same time located as pairs of sites in equivalent

positions in alignments between two orthologous sequences. By incorporating

evolutionary constraints, selectivity is increased compared to approaches purely based on

profile search of single genomic sequence. ConSite claims that it reduces the noise level

by ~85% while retaining high sensitivity compared to single sequence analysis.

Similar to ConSite, r~rista (Loots et al. 2002; Loots and Oveharenko 2004)

combines sequence comparison and TFBS profile based prediction, and attempts to

identify aligned TFBSs from noncoding regions that are evolutionarily conserved. r~rista

considers noncoding regions with low DNA similarity as least likely to be biologically









relevant and it discards these regions for TFBS searches. r~rista claims to take an

effective way to reduce false positives.

TraFAC (Jegga et al. 2002) is a web-based application to detect conserved TFB Ss

from a pair of DNA sequences. It defines hits as the density of shared TFBSs within a

200-bp window in evolutionarily conserved non-coding regions. The hit count is plotted

as a function of position.

PhyME (Sinha et al. 2004) is an algorithm based on cross-species comparison too.

Furthermore, it integrates another important feature of a TFB S, overrepresentation, to the

algorithm. Overrepresentation depends on the number of occurrences of the TFBS in

each species. By searching conserved and statistically over-represented TFBSs, PhyME

claims to increase both sensitivity and specificity.

TOUCAN (Aerts et al. 2003; Aerts et al. 2005) is a web-based Java application for

the prediction of cis-regulatory elements from sets of orthologs or co-regulated genes.

Orthologs are aligned and conserved regions are selected as candidate search regions.

Toucan has two TFBS prediction algorithms. One is MotifScanner, which is used to

search known TFBSs based on their matrices. Instead of using a predefined threshold that

is independent of the sequence context, MotifScanner uses a probabilistic model to

estimate the significance of TFB Ss. The other is MotifSampler, which is used to search

for new TFBSs based on Gibbs sampling.

Another algorithm (Cora et al. 2004) tries to predict over-represented TFBSs from

sets of orthologs. It focuses on short conserved regulatory regions in mice and humans so

that a suitable signal to noise ratio can be achieved. Statistically over-represented TFBSs

in those regions are searched and their annotations to Gene Ontology (GO) terms are










analyzed. They conj ecture that functional TFBSs will be ones that are statistically over-

represented and share the same specific Gene Ontology terms.

CREME (Sharan et al. 2003; Sharan et al. 2004) is another web-based tool for

identifying and visualizing CRMs from a set of potentially co-regulated genes. It uses

r~rista 2.0 to find TFBSs that are conserved between genomes. Subsequently, it

enumerates all combinations of these TFBSs that occur within a user-defined window.

These combinations are statistically evaluated and significant combinations are reported.

Compared to other methods, CREME considers the combination of TFBSs and also

evaluates the statistical significance of the predicted CRMs.

Venkatesh and Yap (Venkatesh and Yap 2005) searched for cis-regulatory elements

from non-coding sequences conserved between fugu and mammals and they believe that

the cis-regulatory elements they found are likely to be shared by all vertebrates

considering the relatively long evolutionary distance between fugu and mammals.

In general, methods based on cross-species genome comparison have improved

prediction of TFBSs by searching TFB Ss only in sequence-conserved regions across

species. However, such methods fail to detect CRMs that have retained their functions

but have evolved in their sequence. The performances of these methods depend on the

evolutionary distances of species, alignment algorithms, and profile models of TFBSs

These days, more and more evidence (Thanos and Maniatis 1995; Erives and

Levine 2004; Senger et al. 2004) have shown that CRMs might be highly structured and

depend on a number of organizational constraints, such as distance between TFBSs,

orders and orientations of TFB Ss, which may be used as further constraints to find

functional CRMs from a noisy background.















CHAPTER 3
THE FEASIBLITY STUDY OF CROSS-GENOME COMPARISON AT THE
BINDING-SITE LEVEL

Background

Transcription is the fundamental biological process in which a selected region of

DNA is transcribed into RNA by a molecular machinery, a core component of which is

RNA polymerase. For most protein-coding genes, transcription is the intermediate step

via which the information coded in their DNA is "expressed" into functioning proteins.

For others, such as RNA genes, the product of the transcription itself may have biological

function. Even though each cell has the complete set of genes in its chromosomal DNA,

only a portion of the genes are transcribed (expressed) in any particular cell depending on

tissue/cell type, developmental stage, etc (Zhang et al 1997). The transcriptome, that is

all of the genes that are selectively transcribed in a cell, determines the function and

morphology of the cell. In addition, the level (i.e., rate) of transcription is often regulated

in response to intra-cellular and extra-cellular environmental factors to achieve cellular

homeostasis. Normal transcriptional regulation, i.e., the right genes being transcribed at

the right times, in the right cell, and at the appropriate rates, is central to almost all

physiological processes. Abnormal regulation of transcription often results in disruption

of development and/or pathological states. For example, ectopic (i.e., abnormally high)

expression of oncogenes leads to hyperplasia and cancer.

A basic element of transcriptional regulation is the interaction of transcription

factors (trans factors) and their corresponding transcription factor binding sites (TFBSs,









also referred to as cis factors) on the DNA. Transcriptional regulation of a gene (e.g.

restricted transcription in a particular cell type, or elevated transcription, in response to

UV light) is often mediated through the functional/physical interactions among multiple

transcription factors, each recruited to the proximity of the DNA in part by their selective

affinity to their corresponding binding sites. For example, the even-skipped(eve) gene is

transiently expressed in 7 alternative stripes on the longitudinal axis in the developing

Drosophila melan2oga~ster embryo at the blastoderm stage. Each of the seven stripes is

regulated by a distinct set of transcription factors binding to their corresponding binding

sites located in a DNA segment flanking the even-skipped gene. The most well

investigated of these segments is the stripe 2 regulatory region, which has identified

binding sites for 5 different transcription factors in a 700bp (base pair)-1kb (kilo base

pair) DNA region in front of the transcription initiation site of the eve gene. Evolutionary

comparison of this transcription regulatory module in different Drosophila species has

revealed that most of the binding sites are highly conserved and functional, even though

the underlying DNA sequence has undergone considerable change (Ludwig et al.1998).

A useful analogy to understanding the composition of DNA regulatory modules is

to consider DNA as a sequence of "Letters" and individual binding sites as "Words".

Then, a functional module of closely associated binding sites can be conceived of as the

"Sentence" command embedded in the DNA sequence that guides transcription. The

problems associated with identifying the "Sentence" commands in a DNA sequence are

two fold. First, the binding sites are degenerate in nature, that is, the same "Word" may

have different letters in certain positions. Second, the "Words" are themselves

interspersed by varying lengths of meaningless "Letters".









One approach to identifying DNA regulatory modules is through cross-genome

comparison. The assumption underlying this approach is that DNA sequences encoding

functional TFB Ss are under selective pressure to be conserved during evolution, whereas

non-functional DNA mutate/change more rapidly. Thus, if DNA sequences flanking

orthologs in two related species were to be compared for sequence-level similarity, DNA

regulatory modules would appear as conserved "islands" in a sea of otherwise not-

conserved DNA sequences. Approaches in this category include rVista2.0, ConSite,

PhyME, TOUCAN, CREME, TraFAC, etc (Sharan et al. 2003; Sandelin et al. 2004;

Loots, G.G. and I. Oveharenko 2004; Sinha et al. 2004; Aerts et al 2005; Schwartz et al.

2003; Cora et al. 2004; Venkatesh and Yap 2005). For instances, based on the sequence

level conservation between human and mouse, Cora et al. predicted functional TFBSs

that are statistically over-represented and share the same specific Gene Ontology (GO)

terms (Cora et al. 2004). This kind of cross-genome comparison approaches has

successfully led to the discovery of regulatory modules that were subsequently verified

by functional characterization (Santini et al. 2003).

The disadvantage of the sequence-based approach is that it is dependent on the

overall conservation of the DNA region harboring the regulatory module. As we

mentioned earlier, TFB S sequences are degenerate in nature and are interspersed by non-

functional sequences which are not under conservation pressure. Depending on the ratio

of functional versus non-functional base-pairs in the DNA region, it is entirely possible

that the overall sequence level conservation of the region be undetectable from the

background level, while the actual TFB Ss making up the functional module still be

conserved. In other words, it is possible that despite the conservation of the Sentence"









command at the binding site level, the overall conservation of the DNA backbone at the

sequence level be minimal or non-detectable. This situation is aggravated if the

participating binding sites are highly degenerate (i.e., tolerate many variations at the

sequence level) and the spaces between the binding sites are long. In fact, it has been

observed by researchers in many instances that while the regulatory region has no

detectable overall similarity, the transcription regulatory function is preserved (Ludwig et

al. 2005). Sequence-based approaches, or approaches requiring filtration of sequences

based on DNA level similarity, are not helpful for identifying the responsible TFB Ss in

this scenario.

Since the conservation pressure is at the binding site level, i.e., the sequence must

be able to maintain binding affinity to the transcription factorss, it makes biological

sense to perform comparisons at the binding site level rather than at the sequence level.

This, however, is currently hindered by two factors. The first limitation is the

effectiveness of identifying transcriptional binding sites in a given DNA sequence. The

set of TFBSs for a given TF can be quantitatively represented using a frequency matrix

that describes the binding specifieity of the TF at each of its positions. The quality of the

matrix used to identify potential TFBSs is determined by the number and quality of

known binding site sequences used to construct the matrix. As a result of the

development of new technologies such as Chip (Yan et al. 2004) and ChlP-chip (Lee et

al. 2002), it is anticipated that binding site instances will be identified at a unprecedented

rate which will undoubtedly greatly enhance both the quality as well as the coverage of

binding site matrices in the near future (Matys et al. 2003).









The second limitation is that we currently do not understand the grammar

governing how binding sites (Words) make up the regulatory modules (Sentences).

Based on our understanding of transcriptional regulation, such a grammar should have at

least three components: (1) the composition of the binding sites, (2) the sequence of the

binding sites, and (3) the spaces between/among the binding sites. Currently, the number

of regulatory modules that have been thoroughly characterized is far fewer than what is

required to decode this grammar.

A maj or obstacle for biologists working on transcriptional regulation is to locate

and identify potential TFBSs responsible for a particular regulatory module, especially in

sequences that do not have significant conserved islands. In this paper, we describe a

novel methodology for binding site level identification of conserved regulatory modules

in such sequences.

Results

Simulating Sequence Pairs Harboring a Conserved Module of Binding Sites

Since the number of well-studied regulatory modules is currently rather sparse, we

chose to simulate sequence sets (pairs) representing the domain of our interest, i.e.,

conserved binding site patterns in a pair of sequences which nonetheless have little or no

similarity at the sequence level. In many cases, experimental investigation in a model

organism has narrowed down the location of the regulatory module(s) for a particular

gene to a relatively short region (e.g. within 1 kb), whereas for the ortholog in a less-

studied organism (reference organism), information about the localization of the module

is absent (except that it is generally in the proximity of the gene). In view of this, in the

first (current) stage of the development of BLISS, we considered the identification of a

conserved module present in both a short sequence of about 100-500 bps, representing










the model organism, and a longer sequence (5-6 kb), representing the reference organism.

Although this simplification limits the applicability of the current methodology, it does

highlight the promise of our approach.

For each sequence pair, the backbones for both the short sequence and the long

sequence were generated randomly and thus had no sequence similarity. A hypothetical

module involving binding sites for 4-8 distinct transcription factors was first introduced

into the short sequence. The binding site sequences were randomly selected from the

instances recorded in the TRANSFAC 9.1 database (Matys et al. 2003). The rules

governing the formation of the hypothetical module were as follows:

1. A module contains binding sites for 4-8 distinct transcription factors.

2. For each transcription factor, there may be more than one binding site in the
module.

3. The distance between consecutive binding sites is "d,", where in 65% of the cases
d, lies within 5-20 bps, in 22% of the cases d, lies within 21-50 bps and in the
remaining 13% of the cases d, lies within 51-60 bps (Figure 3-1).

The range of values for d, was based on background knowledge as well as a

statistical analysis of the distances between pairs of TFBSs in TransCompel database by

Qiu et al. which revealed the above distribution of distances between pairs of TFBSs (Qiu

et al. 2002).

The hypothetical module was first simulated according to the above rules in the

short sequence. Subsequently, a "conserved" module was formulated and inserted into

the longer sequence at a random location. The rules governing the formation of the

"conserved" module were as follows:





Seq. #1


Conserved module


Seq. #2


Figure 3-1.


Simulating conserved regulatory modules in a pair of random sequences.
Each pair has a short (100-500 bp) and a long (5-6 kb) sequence. For both
sequences the backbones were generated randomly. The hypothetical
module was formulated according to rules described in the text and inserted
into the short sequence. A "conserved" module with binding sites
corresponding to the same transcription factor, but with different sequences
was inserted into the long sequence. The distances between consecutive
binding sites were also different between the hypothetical module and the
conserved module.


1. It is comprised of TFBSs that correspond to the same transcription factors as
present in the hypothetical module.
2. The sequence for each TFB S is randomly chosen from the recorded instances in
TRANSFAC 9.1, with the caveat that it cannot be the same instances) that was
(were) used to construct the hypothetical module in the short sequence.
3. The respective order of TFBSs is the same as in the hypothetical module in the
short sequence.


Simulation of conserved regulatory

modules in pairs of DNA sequences



Hypothetical module


Transcription Factor
binding sites









4. The distance between consecutive binding sites in the conserved module is d,; d, is
a function of 4 in that d, lies in the range (d, -Ad, d, Add) (Figure 3-1).

Ad is the "perturbation factor"--the variation of distance between corresponding

binding sites in the hypothetical module and the conserved module. In this study, we

used Ad=4 (See Discussion). A total of 10,000 pairs of sequences were generated

according to above rules, and were used to test and evaluate various algorithms.

Identifying a Conserved Module by Comparison at the Binding Site Level

As stated above, the obj ective of our methodology is to identify conserved

regulatory modules within highly divergent sequences. The sequence pairs in our

simulated data-set had little overall sequence similarity. Of the 10,000 pairs, 73.32%

have no similarity detectable by BLAST analysis (E=0.01, Blast2seq). This indicated

that the conservation of binding sites in the hypothetical module and the conserved

module was not sufficient to allow detection at the sequence level. Of those that did have

a significant match, the output alignments were shorter than 30 bps, which was far shorter

than the length of the inserted module.

M_score. To identify the conserved module at the binding site level, we first generated

the potential TFBS profies for each of the simulated sequence pairs. A matrix scoring

method similar to that used in Match (Kel et al. 2003) was implemented (see Methods),

which allowed us to score each sequence against the frequency TFB S matrices recorded

in TRANSFAC 9. 1 (M~score, Figure 3-2a ). When a cut-off value of 0.75 (onM2~score)

was applied, on average there were about 3000 identified potential TFBSs in every

thousand base pairs of simulated sequences. This is similar to what we observed when

using genomic DNA sequences randomly extracted from model organism databases (data

not shown).












(a) M-scores before cutting off (b) M-scores affer cutting off


2 .: : : : 2 -
1- ~1











32 '' C~~nJ 00 32 200
1 C 100 1 100
TFBSs OO0 Locations TFBSs OO0 Locations



scors (Mscore) o he Fs ln hr DNA seune.b)Mscores
profileafter pplyin a cutoff of0.75. .) Proi ofPsor fe
inoroatn th au fbnigst ace.d)Poieo h 2cr
afe Gassa smohig The colors rersn the ifrn F :Rd
En; Gre,: Croc Ble Lun-1
To idniytehpteia oueebdddi h eunew re eea
difren aloihsta opae h idn it rflso h sotadteln










forallTFBS mtics Thu it doeas ntdfentiateshr an reaivl Lctosipemtcs









that match DNA sequences with a high frequency from those long and stringent matrices

that match DNA sequences only rarely. For example, the binding site for En (I$ED_06)

has 7 positions, and on average there are 320 matches (with M~score>0.75) on any

10,000 bp random sequence. In contrast, the binding site for Bel-1 (V$BEL1 B) has 13

positions, and the average number of matches with M~score>0.75 in a 10,000 bp random

sequence is 3. It is clear that a match involving the binding site V$BEL1_B is far more

significant than a match with I$ED_06. To differentiate this, we introduced the p value

of the M~score, which was estimated by calculating the fraction of randomly generated

sequences that have scores equal to or higher than that M~score. We then calculated the

P~score (see Methods) as the product of -log (p value ofM~score>cut~off) and the

M~score (Figure 3-2, b).

G_score. To account for the variation in the distances between/among binding sites, we

performed a Gaussian smoothing of the P~score (see Methods). Through empirical

testing (data not shown), we found that a variance of G2=9 gave the best performance.

We denote the Gaussian smoothed score profile as the G~score profile of the sequence

(Figure 3-2 d).

BLISS_score. G~score profiles were generated for both the short and the long sequences.

To identify a maximum match at the binding site profile level, the short G~score profile

was slid along the long G score profile. At each position, the match between the short

profile and its corresponding region of equal length (length of the window) in the long

profile was evaluated using an inner-product as the BLISS~score (see Methods).

Note that since the "length of the window" appears in the denominator, the

BLISS~score is independent of the length of the short profile (or the length of the









window). Figure 3-3a shows the distribution of BLISS~scores as the shorter G~score

profile was slid along the longer G~score profile. The peak of the BLISS~score indicates

the maximum match. In this case, the abrupt surge in the BLISS~score is due to the

match between the embedded hypothetical module in the short sequence and the

conserved module in the long sequence. When this methodology was tested on all of the

10,000 simulated sequence pairs, about 80% of the highest peaks for each pair contained

the correct match between the embedded hypothetical module and the conserved module.

Distribution and Statistical Evaluation of the BLISS score

To be able to evaluate the match at the binding site profile level, we analyzed the

distribution of BLISS~scores using the simulated sequence pairs. For each pair of

sequences, BLISS~scores were calculated at each position as the short profile slid along

the longer profile. The peak matches (corresponding to the peaks in the score profile)

between each pair of sequences were evaluated to see whether it aligned the embedded

modules. If the match did align the modules, it was designated a "true" match. All other

BLISS~scores were considered as background.

Figure 3-3b shows the distribution of the background and the "true" match

BLISS~scores for the 10,000 simulated pairs of sequences. This distribution varies

slightly depending upon the cutoff threshold set for M~score (Figure 3-3b & c). This is

not surprising, since a lower cutoff threshold will lead to more identified potential

binding sites and thus a slightly higher background score.





































(b) Statistical Analysis of BLISS_scores. M_score cutoff = 0 76

Module matches with random sequence
SModule matches with modula




1i


32





(a) Search the conserved regulatory module


long DNA sequence


111


a
~?n
e
g
111?


In~


I


c I:


BLISS score


(c) Statistical Analysis of BLISS_scoras, M_score cutoff= DB8

SModule matches with random sequence
: ..... -.. Module matches with modlule


0.12

01


(li


BLISS score


Figure 3-3. Identifying a conserved module at the binding site level. a.) The peak in the
BLISS_Score profile represents the maximum match at the binding site

level. b.) and c.) show the distribution of BLISS_score in true matches vs.

background with a cutoff value of 0.75 and 0.80, respectively.









The distributions allow us to evaluate any particular BLISS~score. It is informative

in helping set a threshold for reporting significant matches at the binding site level. Given

a BLISS score x, the distributions allow us to decide whether that BLISS score

corresponds to a true alignment of modules or whether it corresponds to the module

aligning with a random DNA segment. Let C1 denote the event where the modules

embedded in the short and the long sequences are aligned, and C2 denote the event where

either module is aligned with a random DNA segment. Based on Bayes formula, the

posterior probabilities can be calculated as follows:

p(x | Cl) p(C1)
p(C1 | x) =
p(x | Cl)p(C1)+ p(x | C2)p(C2)

p(x |C2)p(C2)
p(C2 | x) =
p(x | Cl)p(C1)+ p(x | C2)p(C2)

Where p(C1|x) is the conditional probability of Cl given a BLISS~score x and

p(C2|x) is the conditional probability of C2 given a BLISS~score x; p(C1) is the prior

probability of Cl and p(C2) is the prior probability of C2; p(x| CI) is the conditional

probability of observing x given Cl and p(x|C2) is the conditional probability of

observing x given C2. p(x|C1) and p(x|C2) can be read off directly from the distributions

generated.

It is difficult to find the means to calculate the prior probabilities p(C1) and p(C2).

In this study, we assumed p(C1) = p(C2), although we suspect that p(C1) might be

smaller than p(C2). This assumption allowed us to calculate the posterior probabilities to

evaluate a BLISS~score x. In practice, we decided that x was a significant matching score

if p(C2|x) was less than a threshold of, e.g. 0.01 or 0.001.









Identifying a Conserved Regulatory Module in Distantly Related Species

To test the efficacy of the BLISS methodology in real sequences, we undertook the

task of identifying the Even-skipped (eve) stripe 2 enhancer (S2E) in distantly related

Drosophila species. Even-skipped, an important development regulatory gene in

Drosophila nzelan2oga~ster (D.nzela), is specifically expressed in seven transverse stripes

in the embryo during the blastoderm stage. The stripe 2 enhancer is the best studied and

includes TFBSs for five TFs, bicoid (Bcd), hunchback (Hb), giant (gt), Kruppel (Kr), and

sloppy-paired (slp) (Small et al. 1992; Arnosti et al. 1996; Andrioli et al. 2002).

Unfortunately, TRANSFAC 9. 1 has Position Weight Matrices for only three of the five

TFs, i.e., Hb, Kr, and Bcd. Our search was therefore limited in the sense that some of the

participating TFBSs could not be predicted and used for the match comparison.

Previous experimental investigations have shown that S2E is evolutionarily

conserved among D. yakuba (D.yaku), D. erecta, and D. pseudoobscura (D.pseu)

(Ludwig et al. 1998; Ledwig et al. 2002; Ledwig et al. 2005). All of these species are in

the same sub genus (Sophophora) as D.nzela, 0 ithr D.pseu having the most distant

relationship with D.nela (separated at about 40 million years ago). BLISS did identify

the eve S2E modules among these four species. In particular, a significant peak was

reported by BLISS when we searched the S2E module extracted from D. pseu against the

14kb D. nzela genomic sequence flanking the eve coding region.











(a) Search S;2E in 14kb D.~ mroja


(tb) S~ea -rch 2E 14kb D. mnsaj compl~emRent ary stra nd


Figure 3-4. Identifying the eve S2E module in distantly related species. a.) Using the
D.pseu S2E module (1027 bp), a peak (red circle) in BLISS_score was
identified in a 14 kb D.moja genomic sequence surrounding the eve coding
region. b.) No significant match was identified in the reverse strand (bottom
panel) or an unrelated sequence (data not shown).


1 4kb~ D. maoja


14kb D. maeja complemnentary srand









In contrast, no detailed information has been published about potential S2E in more

distantly related species, such as D. zojavensis (D.nroja) or D. virilis (D.viri), both from

a separate sub genus (Drosophila) separated from D.nzela at about 60 million years ago

(Powell 1997). To identify S2E in these two distantly related species, we extracted the

14kb genomic sequence flanking the eve coding region from D.nroja and D.viri genomic

sequences. Blast analysis using the D.nzela or D.pseu sequence harboring the eve S2E

module did not identify any significant match longer than 41 bp (using bl2seq with

default gap penalty values). Using the BLISS methodology however, a significant peak

in the BLISS~score was observed (Figure 3-4a, p(C2|x)<0.05). Verification of this match

indicated that it contained the TFBSs composing the S2E module. Similar results were

obtained when corresponding sequence pairs involving 1.) D.mela and D.moj a, and 2.)

D.mela and D.viri, were analyzed. In contrast, no significant match was identified in the

reverse-complemented sequence (Figure 3-4b) or in other 14 kb sequences unrelated to

eve, indicating the specificity of the search.

A detailed inspection of the make up of the S2E modules in distantly related

species showed that S2E can be viewed as a complex module made of element modules.

To make an analogy, S2E is a complex sentence that has several "clauses" (Figure 3-5).

The evolution of the whole module indicates that the distances between some TFBSs

have changed dramatically. However, the distances among the TFBSs within

corresponding "clauses" have remained relatively stable. For example, in Clause 1 the

distances among participating TFBSs have remained constant over the long evolutionary

period. Specifically, the distance between the first bcd (overlapping with the first kr) and

the second bcd is invariably 20 in all of the four species. In addition, the distances










among TFBSs in Clause 3 have also remained relatively stable, i.e., within the variation

we have factored into our simulation.


4 159 327 403 49 52% 7

mel
139139 311 580 615664
Clause Clause2 ClauseS

4 2211 494 573 7373674 79:
pse
01 2011 802 8.

4 246 489 625 736 758s

810
26226 p

4 198 4 6 523 3 2 60
viri
7140 E755
17 78725 743


S Bod g Kr gH


Figure 3-5. Inter-TFBSs distances are very well conserved within each clause of the S2E
module. Comparison of the evolution of S2E modules across distantly
related species revealed that while the sequence length of the module has
changed significantly, the distance among TFBSs in Clauses 1 and 3 have
remained stable. The numbers near the TFBS indicate the positions relative
to the first Kr site in this module.

Since our methodology is really based on the assumption of limited distance

variations between TFB Ss, it should be much more sensitive at identifying individual

"Clauses" or simple modules. When the corresponding TFBS profie covering Clause 1

or Clause 3 were used to search against the genome sequence from D.moja, very

significant peaks in BLISS~score were observed (Figure 3-6a & b, p(C2|x)<0.001 for

both). The peaks corresponded to the match of Clause 1 and Clause 3 on the D.moja










sequence, respectively. BLAST analysis using the sequences covering Clause 1 or

Clause 3 searched against the D.moja genomic sequence failed to identify significant

matches that spanned the whole module. r~rista 2.0 did predict Clause 1 because it

succeeded in detecting the DNA similarity between the sequence covering Clause 1 and

the D.moja sequence. However, r~rista 2.0 failed to identify Clause 3 since no similarity

was detected between the sequence covering Clause 3 and the D.moja genomic sequence.

Implementation of BLISS as a Web-based Service

The BLISS methodology has been implemented as a web-based tool for the

research community. The web application embodies the Gaussian Smoothing Method for

identifying cis-regulatory modules at the binding site level, and outputs all potential

TFB Ss in the predicted module. The module finding process consists of several steps:

To begin, the user inputs two DNA sequences. For example, a short sequence from

a model organism that harbors a regulatory module, and a longer sequence surrounding

the ortholog of a different species. An M~score threshold of 0.75 or 0.8 is then chosen by

the user for the generation of the TFB S profiles for both sequences. Next, a plot of

BLISS~scores comparing successive alignments of the short profile against the long

profile is returned to the user. On the very same page, the distributions described earlier

(Figure 3-3b & c) are displayed so that the user may choose a BLISS~score threshold.

Once the BLISS~score threshold is chosen, BLISS outputs all of the matches with a

BLISS~score higher than that threshold (limited up to 5). For each match, a table of

contributing TFB Ss are listed based on the product of the p-values of the matching

TFB Ss on both sequences (Figure 3-7). Alternatively, it can be listed based on their

numeric contribution to the BLISS~score, or by the location of the TFB Ss.












(a) Search S2E Clause 1 in 14kb D maja


ecr



L sco

8
u,
CR to

m

LO


(b) Search S2E Clause 3 in 14kh D. mnoja


9LO






uJ
uJ

m


Figure 3-6. Searching for conserved individual clauses / element modules using BLISS.
Profiles covering clauses 1 (a) or 3 (b) of S2E were used to search against a
14 kb D.moja genomic region. The BLISS_score peaks are significant
(P(C2|x)=0.0 for a, = 0.0003 for b).


14kb D. maja


14kb D. maja











Currently, the limits for the short and long input sequences are set at 1200 bps and

15k bps, respectively.

TFBS Location1 M scorel PValuel LocationZ M score PValue2 Contribution PiralueProduci
Ahd.F~f.3 8 8 9499904f a 733F-F; fi7 8 9R199914 1 4AR0801'F-F;' 1 R?19RE 1704A077F-


1.025b60Uhb-
0.9424420813 646

0.7552634 '1.03974E-4 755
0.9447829 5.706195E-4 646
0.9414161 5.706195E-4 646


0.93008316 7.5675E-5 2.0242994 7.760926E-9

0.75241125 1.4380351E-4 '1.4902003 1.4961826E-8
0.9724083 5.93735E-5 '1.0032555 3.3879676E-8
0.9724083 5.93735E-5 0.60433334 3.3879676E-8


HNF-1(+ 105 0.8874515 1.844625E-4 709 0.84781164 7.54511E-4 1.7975879 1.3917898E-7
CHX10(+) 79 0.7580955 0.0048756273 685 0.9947151 2.92595E-5 '1.8798841 1.4265842E-7


CIEBPdella
12 0.89993819 0.00'1164967 697 0.94612706 1.774225'1E-4 '1.802457 2.0669137E-7




Figure 3-7. BLISS output of the contributing TFBSs. Our S2E search was used as the
example. The list of TFBSs can be output either based on sequence
position, product of p value (Figure 3-7), or contribution to BLISS_score.
The TFB Ss that belongs to S2E were highlighted in green.

Discussion

In this study, we have presented a first step towards identifying regulatory modules

via comparisons at the binding site level. The advantage of such an approach is that it

allows the detection of conserved regulatory modules in highly divergent sequences, as

we have demonstrated both with simulated sequences as well as with real world


examples. This method is thus complementary to many existing methods that are based

on sequence similarity comparisons (Altschul et al. 1990) or use sequence similarity for


pre-analysis selection (Sandelin et al. 2004; Loots and Oveharenko 2004; Aerts et al.


PITX2(-) 42
Lentiviral 13
TATA(+
Crx(-) 42
Cr(1 48










2005; Sharan et al. 2004). It should also be complementary to applications such as

MEME and CompareProspector, which are widely used for the identification of

conserved sequence motifs (binding sites) in the regulatory region of co-expressed genes

(Liu et al. 2004; Liu et al. 2004).

There are limitations to our approach. Some of the maj or limitations, such as the

coverage and quality of the TFBS matrices, are expected to improve rapidly in the near

future as new high throughput techniques are applied to identify binding sites in genome

scale. Our current algorithm is developed based on the assumption that the inter-TFBS

distance variation is within a +/-4 base pair range. This allows the identification of

modules/clauses with relatively small inter-TFB S distance variation, such as the

individual clauses in the S2E module. It will likely miss modules/clauses that have much

larger distance variations between TFBSs. In the case of S2E, the identification of the

module was based on the fact that the third clause had low inter-TFBS distance

variations, which was sufficient to generate a significant BLISS~score (Figure 3-4a). As

indicated by Ludwig et al, if S2E as a whole were to be considered, many inter-TFBS

distances have changed dramatically during evolution (Ludwig et al. 2005). However, a

closer look at the distribution of TFB Ss in S2E in the two distantly related species also

indicated that the S2E module may be sub-divided into clauses (Figure 3-5). While the

inter-clause distances have varied dramatically, the inter-TFBS distances within each

clause have remained largely stable (Figure 3-5). This is very possibly a reflection of the

spacing restriction on important transcription factor interactions.

In addition to the S2E module, we also tested our methodology on other regulatory

modules such as the DME (Distal Muscle Enhancer) module in front of the paramyosin










gene (Marco-Ferreres et al. 2005). Using a 200 bp sequence harboring the DME in

D.mela, we were able to detect the corresponding module in D. viri (data not shown).

Currently, the number of well characterized, evolutionarily conserved modules is limited.

The goal of BLISS is to facilitate the discovery of multiple TFB Ss modules by

identifying the conserved pattern of the TFBSs. We also applied BLISS to a regulatory

region that is responsible for mediating UV induced expression of hac-1 (Zhou and

Steller 2003). There is no existing information on the composition of the UV-responsive

module in this region, which has very low sequence level conservation between the

corresponding segments in D.mela and D.pseu. Yet genetic experiments have indicated

that the responsiveness is highly conserved. The potential module identified by BLISS is

currently being tested experimentally.

BLISS, with some adaptation, can potentially be used to identify the conserved

regulatory modules in co-expressed genes. Another advantage of BLISS is that the

methodology can also be applied to identify patterns that involve not only TFB Ss, but

also other sequence features such as complex response elements (Ringrose et al. 2003),

insulator sequences, CpG islands, etc. Functionally, these sequence features (their related

modifications and binding partners) interact with transcription factors. However, these

features, such as CpG islands, cannot be detected by simple sequence similarity based

searches.

Conclusions

In this study, we addressed the feasibility of identifying conserved regulatory

modules at the binding site level. Our results indicate that it is feasible to identify

conserved regulatory modules in simulated random sequences harboring a regulatory

module made of 4-8 distinct binding sites. Using real sequences, we demonstrated that









our approach outperforms regular sequence level comparisons when the orthologous

DNA sequences are highly diverged. In addition, the BLISS program outputs directly the

candidate binding sites that are shared between the two regulatory sequences, which can

greatly facilitate the evaluation of the candidate module as well as the design of the

experimental verification strategy by biomedical scientists. Future development of the

proj ect will include identifying better algorithms for complex modules and modules with

higher inter-TFBS distance variations.

Methods

Generating Simulated Sequences

10000 simulated sequence pairs were generated for developing the methodology.

Each set included a short DNA sequence (100-500 bps) harboring a hypothetical TFB S

module and a long DNA sequence (5-6 kb) harboring a conserved TFB S module. First,

the hypothetical TFBS module was generated in the following manner: 4-8 binding sites

were randomly chosen from TRANSFAC 9.1 (Matys et al. 2003) database and then

random DNA subsequences were inserted between them. Qiu et al. (Qiu et al.

2002)analyzed all the entries of composite elements in the TransCompel database

(version 3.0) and they found that about 87% of the composite elements are within 50 bp

distance and about 65% are within 20 bp distance of one another. We therefore chose

lengths of DNA subsequences inserted between binding sites based on this result. Next,

we created the conserved TFBS module, which included binding sites for the same sets of

transcription factors in the same order as in the shorter sequence. However, the binding

site sequences had to be different and they were randomly chosen from TRANSFAC 9.1.

Furthermore, compared to the hypothetical module, distances between binding sites were

set to vary slightly and we allowed each binding site to shift up to 4 bps either to the right










or to the left. Finally, we inserted the conserved module into a 5000 bp randomly

generated DNA sequence to generate the longer sequence.

Binding Site Identification

Potential TFBSs both in the short DNA sequence (including the hypothetical

module) and the long DNA sequence (including the conserved module) were searched

based on frequency matrices collected by TRANSFAC 9.1i. Because TFB Ss may be

detectable on either the forward or the backward strand, we searched both strands of

sequences. The M score profile for each sequence is a M*L matrix, where M is twice of

the number of matrices applied and L is the length of the sequence. The top half of the

M~score matrix is the score profile for the forward strand and the bottom half is that for

the complementary strand. TheM2~score of the ith TFB S at position j of the sequence was

calculated by first aligning the frequency matrix for the ith TFB S with the sequence at

position j and then computing:

f. 1. M score:.

Score[i, j] Score,,, [i, j]
M~ score[i, j] =
Scorey,~[i, j] Score,,, [i, j]

K-1
Score[ i, j] = [ (k) fkn,77+



Score,,,,[i]= [ I(k) fknm


K-1
Scored [i]= CI(k) fknax


Ilk) = fkAIn4fa
Nt(A,T,C,G)










Kis the length of the TFBS. Izyk E (ATCG is th nulotd ocurin in the,1,~~ ,,,,,


sequence at position j+k. fk~ni+k iS the frequency of nucleotide n, y, at position k in the


frequency matrix (of the ith TF). fkmm is the lowest frequency and fkmax is highest

frequency across all nucleotides at position k in the frequency matrix (of the ith TF). Ilk)

is the information vector for the frequency matrix, which reflects the degree of

conservation at position k of the matrix. Finally, M~score~i, j] is the normalized

Score~i,j]. Stormo (Stormo 1998; Stormo 2002; Stormo and Fields 2002) observed that

logarithms of the base frequencies ought to be proportional to the binding energy and the

information vector reflects this average binding energy between the transcription factor

and the binding site.

A score cutoff at 0.75 was applied to the M~score profiles of both the short and the

long sequence as follows:

M~ score[i, j] = M~ score[i, j] if M~ score[i, j]>2 0.75

M~ score[i, j] = 0 ifM~ score[i, j] <0.75

P Value for M score.

To calculate the p value, a background model is required. Here, we chose the

background model to be a random DNA sequence where each position is drawn

independently. The ratio among A, T, C, and G is 30% : 30% : 20 % : 20%. For each

frequency matrix, 300 million subsequences were sampled from the background model,

and the M~score of each sub sequence was calculated to build the M~score distribution.

The p value of aM2~score for each TFBS was estimated by calculating the fraction of

samples that had scores equal to or higher than that M~score. Then, the P~score profiles

were calculated as follows:












f.2 P score


P score[i, j] = C, *M2 score[i, j]



C, is the negative natural logarithm of the p value ofM~score >= 0.75 for the ith

TFB S.

Gaussian Smoothing

To account for the change in the distances between/among binding sites, a

Gaussian smoothing was applied to the P~score profiles with a variance of 9. Formally,

the G~score profiles were calculated as follows:

f.3 G score:


G score[i, j] =
1 -k2/2U2


k=-7 -22/2U2




where a = 3 and k ranges from -7 to 7. In effect, a P~score can spread 7 positions to

both the right and the left due to the Gaussian smoothing. Smoothed P~scores beyond 7

positions were ignored due to their small values.

Searching the Conserved Module in the Long Sequence

To identify a maximum match at the binding site profile level, the short G~score

profile was slid along the long G~score profile. BLISS~score at position n is the matching

score between the short profile and its corresponding region of equal length (length of the

short sequence) in the long profile at position n:










f.4 BLISS score:


BLISS score[n] =
M-1L-1
(ff G c_ scorel[ i, j]* G score2[ i, j + n]) / LenzgthO/ShortSequence
I=0 ]=0


where G~scorel is the G~score profile for the short sequence and G~score2 is the

G~score profile for the long sequence; L is the length of the short sequence; n is the

current location where the short sequence is aligned to the long sequence.

Large-scale Search of the Simulated Sequences, Statistical Analysis

We used 10000 simulated sequence pairs generated by the above method to

calculate two BLISS score distributions. The first is the BLISS score distribution when

the hypothetical module in the short sequence is aligned with the conserved module in the

long sequence. The second is the BLISS~score distribution when the module is aligned

with a non-module segment of the longer sequence. For each pair of sequences,

BLISS~scores were calculated at each position as the short profile slid along the longer

profile. The peak matches (corresponding to the peaks in the score profile) between each

pair of sequences were evaluated to see whether it aligned the embedded modules. If the

match did include the alignment of the modules, it was designated a "true" match, and

this BLISS~score was used to calculate the distribution for the modules matching. All of

the other BLISS~scores were used to calculate the distribution for the module matching

with the background sequence.

Searching for the eve2 Module in D. virilis and D. mojavanis Sequences

The GenBank (http://www.ncbi .nlm.nih.gov/) accession numbers for the S2E

sequences are AFO42712 (D. pseudoobscura) and AFO42709 (D. melan2oga~ster). We










used BLISS to search these two enhancers in 13kb D. virilis and 14kb D. moj avanis

sequences, in which S2E is hypothesized to be located, but the specific location is

unknown.

Ludwig et al. indicated that distances between TFB Ss in two clauses in S2E (region

134-275 and region 484-684 for D. melan2oga~ster, region 196-376 and region 692-866 for

D. pseudoobscura) were substantially conserved. We removed those regions and

searched for the modules in 13kb D. virilis and 14kb D. mojavanis sequences using

BLISS.

Website Construction

BLISS was implemented using HTML/JSP/JavaBean and is supported by an

Apache Tomcat 5.5 server. It is publicly available at:

http://genel .ufscc.ufl.edu:8080/blissWeb/index.html. The M~score profiles of TFBSs

were calculated based on the frequency matrix library collected by TRANSFAC 9.1.

BLISS used DISLIN (http://www.mps.mpg. de/dislin/), a plotting library for displaying

data, to draw the match score plot in run time.















CHAPTER 4
BINDING-SITE LEVEL IDENTIFICATION OF SHARED REGULATORY
MODULES INT TWO ORTHOLOGOUS SEQUENCES

Background

Determination of the bindings between Transcription Factors (TFs) and

Transcription Factor Binding Sites (TFB Ss) is critical to understanding the mystery of

gene regulation. In higher organisms, a particular aspect of gene expression is rarely

controlled by one single TF, rather by a combination of TFBSs called regulatory module

(Maniatis et al. 1987; Johnson and McKnight 1989; Shaw 1992; Wang et al. 1993; Tjian

and Maniatis 1994). Genes are typically regulated by various regulatory modules in

response to different conditions (Yuh et al. 1998). Identification of such regulatory

modules is notoriously difficult due to the inherent biological complexity. The traditional

way is through biological experimentation (Galas and Schmitz 1978; Fried and Crothers

1981; Garner and Revzin 1981), which is in general costly and time-consuming. With

fully sequenced genomes, computational approaches have emerged as powerful tools to

supplement full-scale laboratory experiments in narrowing down candidate regulatory

modules and therefore dramatically reduced the effort in determining the regulatory

elements (Qiu 2003).

A number of software tools have been developed to search for regulatory modules

by cross-genome comparison (Sharan et al. 2003; Sandelin et al. 2004; Loots, G.G. and I.

Oveharenko 2004; Sinha et al. 2004; Aerts et al 2005; Schwartz et al. 2003; Cora et al.

2004; Venkatesh and Yap 2005). The theoretical assumption underlying these tools is









that DNA sequences harboring TFBSs are conserved during evolution due to selective

pressure. Thus conserved regulatory modules should be found in the regions with high

DNA similarity. However, most of these tools are based directly on the measurement of

similarity at the DNA sequence level. It is known that TFBS sequences are degenerate in

nature and the DNA sequences between TFBSs are not under the pressure of selection

and therefore mutate rapidly. Therefore, it is highly probable that the DNA similarity is

not detectable even though the actual regulatory modules are conserved (Ludwig et al.

2005). Since the conservation pressure is at the binding site level, in our previous

research, we developed a novel methodology named BLISS to perform the cross-genome

comparison on binding site profies (Meng et al. 2006). This is first time that the

feasibility of comparison at the binding site level has been validated. As a complementary

tool in the Hield, BLISS demonstrates the ability to detect conserved regulatory modules

from diverged orthologous sequences where DNA similarity is undetectable.

It has been observed that the distances among a certain cluster of TFB Ss in the

regulatory module are highly conserved during evolution, which may reflect the space

restriction essential for the protein-protein interactions among TFs required for

combinatorial transcription regulation (Ludwig et al. 2005). Due to biological

complexity, previous work in this direction has been limited to identifying composite

elements, pairs of transcription factors, whose binding sites tend to co-exist.

TransCompel (Kel-Margoulis et al. 2002) is such a database with collections of

composite elements. In this study, we extend the concept of "composite element" to

"regulatory complex". Regulatory complex is a cluster of TFB Ss, where the type

(identity), number, and order of TFBSs are highly conserved during evolution and the









variation of inter-TFBS distances is within a certain range (less than 10 bps). A

regulatory complex is not necessarily conserved at the sequence level during the

evolution. It could, by itself, be a simple regulatory module or be a part of a complex

regulatory module. In previous research (Meng et al. 2006), BLISS showed that it was

more advantageous to Eind regulatory complexes rather than the regulatory module,

especially when the regulatory complex is not the only regulatory elements in the

module. We therefore focuses on the detection of conserved regulatory complexes in this

study and the regulatory modules could be predicted based on the identification of

conserved regulatory complexes within them.

The first release of BLISS (BLISS1) was limited in application to a short sequence

and a long sequence and it made the assumption that the short sequence harbored only

one regulatory module/complex (Meng et al. 2006). However, in most cases, biologists

have no prior knowledge about the locations of the conserved regulatory modules in the

sequences to be analyzed. In the second release of BLISS (BLISS2), we extended

BLISS1 to a general tool without the limitation of the assumption made in BLISS1 about

the analyzed sequences. In this study, we show that BLISS2 could predict conserved

regulatory modules by detecting all potential shared regulatory complexes.

Results

Simulating Sequence Pairs with Varying Complexity of Conserved Binding Site
Patterns

Simulated sequences were generated to test the performance of our algorithm since

very few well-investigated regulatory modules are available currently. Two testing data

sets were generated. The backbones of the sequences in the first data set were 1000 bp










(base pair) random DNA sequences and the backbones of the sequences in the second

data set were 5000 bps.

Table 4-1. Simulated data sets
Sequence Length of Inserted complexes in each sequence pair
pairs in each Backbone Groupl1 Group2 Group3 Group4
group
Data set 1 500 1000 0 1 2 3
Data set 2 500 5000 0 1 2 3


As shown in Table 4-1, each testing data set consisted of four groups of sequence

pairs and each group had 500 sequence pairs. No regulatory complex was inserted into

the first group of sequence pairs. In the second group, each sequence pair had one

conserved regulatory complex. Two and three conserved regulatory complexes were

inserted into each sequence pair respectively in the third and fourth group.

Each regulatory complex was composed of 4-8 binding sites, which were extracted

from TRANSFAC 9.1 database (Matys et al. 2003). Qiu et al. (Qiu et al. 2002) performed

a statistical analysis for the distances between TFB Ss in the composite elements, which

had been experimentally confirmed and collected in the TransCompel database (version

3.0). Their analysis demonstrated that 87% of the distances between TFB Ss in composite

elements are within 50 bps, and 65% are within 20 bps. The length of the inserted random

DNA segments between binding sites was based on this statistical analysis when we

formulated the regulatory complexes (see Methods for details).

Identifying Conserved Regulatory Complexes by Comparison at the Binding-site
Level

In our previous research (Meng et al. 2006), BLISS1 has demonstrated the

feasibility of identifying conserved regulatory complexes in diverged DNA sequences at

the binding-site level. However, it was limited to predicting conserved regulatory










complexes from a short sequence and a long sequence and it required that the regulatory

complex existed in the short sequence. During the comparison, the short sequence was

slid along the long sequence to Eind the segment in the long sequence that had a

significant matching score with the short sequence at binding site level. In this study,

however, our focus is to extend BLISS1 to identify potentially conserved regulatory

segments from two orthologous sequences without prior knowledge about the locations

of regulatory complexes in the sequences.

Like BLISS1, BLISS2 calculated BLISS_score to represent the degree of similarity

at the binding site level for two sequence segments. The first three steps of BLISS2 were

to calculate M~scores, P_scores and G~scores respectively for two genomic sequences.

M_score is the binding site profie calculated using a matrix scoring method based on the

binding site frequency matrices collected in TRANSFAC9.1 database. To differentiate

simple matrices that match DNA sequences with a high frequency from those matrices

that match DNA sequences rarely, P_score was generated as the product of -log(p value

of M_score > cutoff) and the M_score. To account for the variation of distances

between/among binding sites in the conserved regulatory complexes, Gaussian smoothing

was applied to P_score to get G~score. The first three steps in BLISS2 were in essence

exactly identical to those in BLISS 1. The difference is that BLISS2 deals with two long

sequences and the locations and lengths of the regulatory complexes in the sequences are

unknown. The task of BLISS2 is to seek the matched segments from two sequences

whose binding site profiles are similar (not necessarily identical) and whose

BLISS_scores, the matching score at the binding site level, are statistically significant.

The statistical evaluation of BLISS_score in our previous study allowed us to evaluate a










particular BLISS_score and set a threshold for reporting significant matches. In this

study, BLISS2 took a heuristic approach. The general strategy of BLISS2 was to seek

seeds with a certain window size from the G~score profiles where their BLISS_scores are

greater than a certain BLISS_score cutoff. Then all the seeds were extended and the

shared regulatory regions were decided when the BLISS_score dropped below the

BLISS_score cutoff. (See Methods)

At the beginning, BLAST (E=0.01, Blast2seq) was applied to the sequence pairs in

the first data set in which the backbone of each sequence was 1000 bp long and there

were a total of 3000 conserved regulatory complexes. Of the 2000 simulated sequence

pairs, only 32.4% of them were reported to have detectable similarity. Even in those

positive cases, the average length of output alignments was 16 bps and the maximum

length of output alignments was 32 bps. Clearly, these output alignments were far shorter

than the length of the inserted regulatory complexes, which was not sufficient to allow

for the detection of the regulatory complexes. This analysis indicated that traditional tools

based on sequence level comparison are not enough to predict the conservation of

regulatory complex in diverged DNA sequences.

Table 4-2. Performance test of BLISS2 for Mlscore cutoff=0.75. False Positives (FP)
and True Positives (TP) are listed under different window size and BLISS
score cutoff.
Window BLISS score cutoff
Type
size 2.9 2.8 2.7 2.6 2.5 2.4 2.3 2.2 2.1
FP 1795 2565 3748 5272 7180 9262
100
TP 904 1045 1174 1307 1434 1537
FP 143 221 401 741 1362 2402 3940 5781
150
TP 732 897 1079 1269 1445 1616 1775 1900
FP 14 18 44 80 205 473 1106 2195 3836
200
TP 382 526 682 879 1083 1345 1558 1764 1899
FP 1 1 6 14 34 88 261 727 1707
250
TP 130 213 343 504 695 918 1189 1468 1663









Table 4-3. Performance test of BLISS2 for M~score cutoff=0.8. False Positives (FP) and
True Positives (TP) are listed under different window size and BLISS score
cutoff.
Window BLISS score cutoff
Type
size 1.9 1.8 1.7 1.6 1.5 1.4 1.3 1.2
FP 1907 2905 443 5 6480 8991
100
TP 1168 1328 1488 1633 1759
FP 193 330 629 1193 2366 4306 6625
150
TP 1021 1229 1432 1664 1892 2049 2154
FP 26 45 81 195 523 1351 2975 5195
200
TP 617 818 1069 1346 1631 1880 2054 2137
FP 2 4 20 42 130 373 1213 2982
250
TP 298 464 662 932 1227 1539 1822 2006


In comparison, the testing results of BLISS2 under different 1)window size and 2)

BLISS score cutoff are listed in table 4-2 and 4-3.

True positive refers to the regulatory complexes that were inserted in the simulated

sequence pair and identified by BLISS2, while false positive refers to the regulatory

complexes that were reported by BLISS2 but did not really exist. Both table 4-2 and 4-3

indicate a balance between true positive and false positive under different BLISS_score

cutoffs. While the number of true positives improved when we chose a lower

BLISS_score cutoff, the false positives increased as well; on the other hand, when a

higher BLISS_score cutoff was applied, false positives decreased while true positives

were compromised.

Based on the testing results listed in table 4-2 and 4-3, a window size 200 was

selected in the implementation of BLISS2 since in relation to other window sizes, it had a

better balance between false positives and true positives both for M~score cutoff 0.75 and

0.8. In the first testing data set, the length of each sequence was 1000 bps. The

performance test of BLISS2 proceeded under window size 200 using the second testing

data set, where the length of backbone of each sequence was 5000 bps.











1200

m, 1000 4

800 --FP-2
-ro -u ~FP- 1
0 600 'i


200-


2.5 2.6 2.7 2.8 2.9 3 3.1
BLISS score cutoff


Figure 4-1. The effects of length of analyzed sequences on the performance of BLISS2.
Window size 200 and M_score cutoff 0.75 were applied in this comparison
test. FP-1 represents the false positive for the first data set; FP-2 represents
the false positive for the second data set; TP-1 represents the true positive
for the first data set; TP-2 represents the true positive for the second data set.

The results displayed in Figure 4-1 shows that the true positives were almost the

same for both the testing data sets, although the true positives for the second data set is

slightly less than the true positives for the first data set. The major difference lies in the

false positive. When a higher BLISS_score cutoff was applied, false positives were not

significantly affected by the lengths of the sequences. However, the false positives

increased dramatically for the second testing data set when the BLISS_score cutoff was

decreased. Thus, for practical purpose, a high BLISS_score cutoff is suggested to avoid

large amount of false positive for the longer sequences, although the sensitivity of

BLISS2 will be sacrificed at the same time.

Testing BLISS2 in Real-world Examples

Even-skipped gene (eve) is expressed in seven stripes in the embryo of Drosophila

melanogaster (D.mela) during the blastoderm stage. The enhancer for the second stripe









(S2E) is the best studied regulatory module and includes binding sites for five

transcription factors, biocoid (Bcd), giant (gt), hunchback (Hb), Kruppel (Kr) and sloppy-

paired (slp) (Small et al. 1992; Arnosti et al. 1996; Andrioli et al. 2002). D.mela, D.

yakuba (D.yaku), D. erecta, and D. pseudoobscura (D.pseu) belong to same subgenus

(Sophophora) and experimental investigations on S2E suggests that S2E is conserved

during evolution among these four closely related species. A further detailed inspection

has revealed that S2E includes two regulatory complexes, where the binding sites and the

distances among binding sites are highly conserved (Ludwig et al. 2005). The make-ups

of S2Es in D.mela and D.pseu have been displayed in Figure 4-2. However, no detailed

information has been published about S2E in two species: D. moj avensis (D.moj a) and D.

virilis (D.viri), which are distantly related to the above four species and belong to a

separate subgenus (Drosophila). We extracted 6k bp sequences before the transcription

start region both from the D.moja and D.viri genomic sequences and applied BLISS2 to

identify S2E in them.


Regulatory complexes in S2E


4 159 327 403 495 5 72
mel
139 139 311 580 615 664
Colmplex1 Cormplex2
4 221 494 573 703 734 79


21 201 802 6
Complex Complex2


Figure 4-2. Regulatory complexes in S2E of D.mela and D.pseu. The two complexes are
highly conserved during the evolution.











In the first experiment, BLISS demonstrated its ability to predict the locations of

S2E both in D.viri and D.moja using S2E sequences of D.mela and D.pseu. For instance,

three regulatory complexes were reported (Figure 4-3) when BLISS was applied to the

S2E sequence of D.pseu and the 6k bp sequence from D.viri (M_score cutoff=0.75 and

BLISS_score cutoff=2.9). By investigating the matched TFBSs listed by BLISS2, the

second and third complexes were those two S2E regulatory complexes in D.mela and


D.pseu (Figure 4-4). Thus the position of S2E in D.viri could be located based on the

positions of those two complexes. Identifying S2E in D.moj a was carried out in the

similar way.


I~nsrvrd I rwsnonr In
iii-Is ton
r $6r bs2
3 61 )11


Figure 4-3. Identification S2E in 6 kb D.viri sequence using the S2E of D.pseu. Three
complexes are predicted and the second and third one are those two in S2E.










Regulatory complex 1 in S2E
4221
viri
4201

3970

maja
3950



Regulatory complex 2 in S2E

4668
4644 4671 4716 4729
t9 I 1
viri
47374749 4778
4766
4482
4460 4484 4545


45534561 4592 4602


L Bed g Kr g Hb





b) TFBSs in complex 2.

In previous research (Meng et al. 2006), BLISS1 was applied to the same sequence

pair and the S2E in D.viri could be detected due to the match of one complex in S2E.

Another complex could not be matched at the same time because of the distance

difference between the two complexes in those two sequences. The BLISS_score

therefore was inevitably lower because the BLISS_score in BLISS was the average over









the whole range of the S2E sequence. In that investigation, it indicated that the

BLISS_score would be more sensitive and significant when either complex sequence was

used as the short sequence. In fact, we don't have this kind of prior knowledge in most

cases. BLISS2 displayed its advantage over BLISS 1 in that it was able to predict

regulatory complexes by detecting local maximum regions with similar binding site

profiles without requiring prior knowledge about the complexes in the orthologous

sequences.

BLISS2 further demonstrated its advantage over BLISS 1 for identifying conserved

regulatory complexes in two long genomic sequences, which could not be carried out by

BLISS1 at all. For instance, when BLISS2 was applied to two 6k bp sequences before the

transcription start region of eve gene from D.pseu and D.viri (M_score cutoff=0.75 and

BLISS_score cutoff=2.9), 15 potential regulatory complexes were reported by BLISS2.

By inspection of the make up of these complexes, it turned out that the 9th and 14th

complexes were those two regulatory complexes in S2E. In contrast, rVista2.0 could only

identify the first two binding sites in complex of S2E in the region with detectable DNA

similarity when the same binding site score cutoff (0.75) was applied. It is difficult for us

to verify the other 13 complexes identified by BLISS2 except for the two known S2E

complexes due to the lack of detailed regulatory information for those regions. However,

it is well known that those two 6k bp sequences are rich in regulatory information. At the

very least, it should harbor enhancers for the expression of the eve gene in some of the 7

stripes in the embryo.

Website Implementation of BLISS2

BLISS2 was implemented as a web-based software that can be freely accessed by

the public. To start, BLISS2 requires the users to input two DNA sequences that










potentially harbor shared regulatory information. An M~score (binding site score)

threshold could be chosen by the user on the very same page. Then a color plot indicating

conserved regulatory regions is returned to the user with the statistical analysis of

BLISS score under that M score threshold. The user then chooses a BLISS score cutoff

to output all conserved complexes with maximum BLISS_score above the cutoff. For

each complex, a table of contributing binding sites could be listed based on the product of

the p-value of the matching binding sites, or the location of the binding sites, or their

contributions to the BLISS_score. Currently, the length of each input sequence for

BLISS2 is limited to 8000 bps.

Discussion

In our previous investigation (Meng et al. 2006), we proved that it is feasible to

identify conserved regulatory modules from highly diverged orthologous genomic

sequences at the binding site level. In this study, we extended our previous method to the

two long sequence case and the regulatory modules could be detected by identifying

conserved regulatory complexes. The performance of BLISS2 was tested using both

simulated sequences and real world examples.

There are several limitations to BLISS2. First, BLISS2 is restricted to detect

conserved binding sites groups (regulatory complexes) with small inter-TFB S distance

variations. If the spaces between/among binding sites are not strictly required for the

interactions between/among transcription factors and the variation of inter-TFBS distance

is high, BLISS2 may miss those regulatory modules. The coverage and quality of the

frequency matrices for TFBSs is another limitation since the analysis of BLISS2 is totally

dependent on the accuracy of the binding site profiles. However, there is no doubt that









the coverage and quality of frequency matrices are being/will be improved dramatically

and the performance of BLIS S2 is expected to be enhanced along with it.

Although further improvement is required, BLISS2 demonstrates the ability to

detect biologically significant similarities in those sequences where DNA sequence

similarity is undetectable. The basic idea underlying BLISS2 is novel, but

straightforward. And therefore it potentially could be implemented in a number of ways

to improve performance for the different applications.

Conclusions

In summary, using simulated sequences and real world examples, BLISS2

demonstrated that regulatory complexes, the clusters of binding sites with small variation

ofinter-TFBS distance during evolution, could be identified in highly diverged

orthologous sequences by comparison at the binding site level. By detecting those

conserved regulatory complexes and exploring matched TFBSs in them, BLISS2

facilitates the localization of regulatory regions and therefore can assist biomedical

scientists in deciphering the mystery of gene regulation. Future direction of the proj ect

includes developing new algorithm to improve the selectivity and sensitivity of the

current algorithm, detecting regulatory complexes from multiple orthologous sequences,

and identifying regulatory modules with higher inter-TFBS distance variations.

Methods

Generating Simulated Sequences

Two testing data sets were generated for the development of the methodology.

The first data set had 2000 sequence pairs, which were divided into four groups

with each group having 500 sequence pairs. The backbone for each sequence was









constructed as 1000 bp random DNA sequence with the ratio among A, T, C and G being

30% : 30% : 20% : 20% since the non-transcribed genomic regions are generally AT rich.

Each regulatory complex contained binding sites for 4-8 transcription factors,

which are randomly extracted from TRANSFAC 9. 1 database. The range of distance (di)

between consecutive TFBSs was based on the statistical analysis by Qiu et al. (Qiu et al.

2002) on the TransCompel database: 65% of the distance between consecutive binding

sites was within 5-20 bps, 22% was within 21-50 bps and the rest 13% was within 51-60

bps. The conserved regulatory complex for the second sequence was formulated based on

the following rule: First, it consisted of binding sites for the same transcription factors in

the corresponding regulatory complex in the first sequence. The binding sites of two

corresponding transcription factors were randomly extracted from the binding site

instances collected in TRANSFAC 9.1 database and therefore they could have the same

or different sequences; second, the respective order of binding sites in those two

corresponding regulatory complex was kept the same; finally, dj, the distance between

consecutive binding sites in the conserved regulatory complex in the second sequence,

was randomly chosen to have a value from range (di-Ad,di+Ad). Ad is the variation of

distance between corresponding binding sites in those two regulatory complexes and we

chose Ad = 4 in this study.

Each sequence in the sequence pairs in the first group was a 1000 bp random DNA

sequence and had no introduced complex. For the second group, one regulatory complex

was inserted at a random location into the first sequence of each sequence pair; and then

the corresponding conserved regulatory complex was inserted at a random location into

the second sequence. In the similar manner, two and three regulatory complexes were









inserted into each sequence pairs, respectively, for the third and the fourth group. This

composition of simulated sequences reflected the fact that the analyzed orthologous

sequences may share 1) no of regulatory complex, 2) one regulatory complex, or 3) more

than one regulatory complexes.

All the above held true for the second testing data set except that the length of the

backbone of each sequence was 5000 bps.

Identifying Conserved Regulatory Complexes by Comparison at the Binding-site
Level

M scores, P scores and G scores. M scores, P scores and G scores for both DNA

sequences were calculated in the exact same manner as was described in chapter 3 (see

Methods part) based on frequency matrices collected by TRANSFAC 9.1. The G~score

profile for each sequence is a M*L matrix, where M is twice of the number of TFBSs in

the frequency matrix library and L is the length of the sequence. The top half of G~score

matrix is the score profile for the forward strand and the bottom half of the G~score

matrix is for the complementary strand.

Pre_BLISS_score. Calculation of PreBLISS_score is an intermediate step for

computing the BLISS_score from the G~score profiles. PreBLISS_score was calculated

as follows:


Pre _BLISS scor[ii j= G _scorel[m,i]*G scor2[m, j]


where G~scorel is the G~score profile for the first sequence and G~score2 is that

for the second sequence. M is twice of the number of TFB Ss in the frequency matrix

library and m is the index of the TFBS. PreBLISS_score is a two-dimensional L1*"L2










matrix where L1 and L2 correspond to the lengths of the two DNA sequences

respectively.

BLISS_score. BLIS S_score is the average of PreBLIS S_scores over a window size w:

w/2
BLISS score[i, j] = Pre BLISS score[i + k, j + k] /w
k=-w/2

Same as Pre BLISS score, BLISS score is also a two-dimensional L1*L2 matrix where

L1 and L2 are the lengths of the two DNA sequences.

Reporting Conserved Regulatory Complexes. Conserved regulatory complexes would

be reported according to the BLISS_score matrix. To start, the maximum point

BLISS_score[x,y] was found in the matrix. If BLISS_score[x,y] was less than the

BLISS_score cutoff, we would report that there is no conserved regulatory region for

these two sequences; otherwise, we extended this maximum point along the diagonal in

both directions until the BLISS_scores dropped below the BLISS_score cutoff. Suppose

these two end points were BLISS_score[xl,yl] and BLISS_score[x2,y2], then we would

report that we detected one shared regulatory complex, which started from x1-w/2 and

ended at x2+w/2 in the first and covered from yl-w/2 to at y2+w/2 in the second

sequence. The rest of the BLISS_score matrix was searched recursively in the same way

to Eind other shared regulatory complexes.

Performance Testing Using Simulated Sequences

Using our simulated data sets, the performance of BLISS2 was tested under

different BLISS_score cutoff and window sizes. The true positive and false positive for

each test under different condition were recorded.

Suppose a shared regulatory complex identified by BLISS starts from xl and ends

at x2 in the first sequence and covers range yl to y2 in the second sequence, while the










truly inserted regulatory complex is from xl' to x2' in the first sequence and from yl' to

y2' in the second sequence. We decided that it was a correct prediction if:

1. Identified complexes in both sequences covered more than 60% of the region of the
truly inserted complexes.

2. Two lines were created in two-dimensional space: one starting from point (xl, yl)
and ending at point (x2, y2); the other starting from point (xl', yl') and ending at
point (x2', y2'), and the distance between these two lines was found to be less than
10.

The regulatory complex, which was predicted by BLISS2 and truly existed in

simulated sequences, would be counted as a true positive; while the regulatory complex,

which was predicted by BLISS2 but did not exist in the simulated sequences, would be

counted as false positive.

Searching for the eve2 Regulatory Complex in D. virilis and D. mojavanis

The GenBank (http://www.ncbi .nlm.nih.gov/) accession numbers for S2E

sequences are AFO42709 (D. melanogaster) and AFO42712 (D. pseudoobscura). The two

6k bp seqeunces of D.viri and Dmoj a are the regions right in front of the transcription

initiation site of the eve gene in the genomic sequences of D.viri and D.moj a. M~score

cutoff 0.75 and BLISS_score cutoff 2.9 (default Mlscore and BLISS_score cutoff of

BLISS2) were used in both experiments.

Website Construction

BLISS2 is hosted by a Apache Tomcat 5.5 server and available at:

http://genel .ufscc.ufl.edu: 8080/bliss2/index.html. The website was implemented by

JavaBean/JSP/HTML and the color display of BLIS S_score was plotted by tools

provided by DISLIN (http://www.mps.mpg.de/dislin/).















CHAPTER 5
BLISS 2.0: THE WEB-BASED TOOL FOR PREDICTING CONSERVED
REGULATORY MODULES IN DISTANTLY-RELATED ORTHOLOGOUS
SEQUENCES

Identifying functional Transcription Factor Binding Sites (TFB Ss) in the regulatory

region of DNA sequence is essential to understand gene regulation at transcription level.

In eukaryotes, distinct TFBSs are often grouped together into regulatory modules to

control a particular aspect of gene expression. The composition of a particular regulatory

module can be identified by experimental approaches; for example, through enhancer

region dissection, DNase hypersensitivity assay, DNA foot printing, etc. However, most

of these approaches are time-consuming, laborious and costly. More importantly, many

of these approaches, such as the DNase hypersensitivity assay, can only be applied to a

short DNA sequence (a few hundred base pairs). It is almost impossible if the relative

location of the module cannot be narrowed to an experimentally testable region.

With the emergence and application of bioinformatics, computational approaches

have been developed to predict regulatory module and help the design and verification of

laboratory experiments. A number of such computational approaches (Aerts et al 2005;

Sandelin et al. 2004; Loots, G.G. and I. Oveharenko 2004; Sharan et al. 2003; Sinha et al.

2004) are through cross-genome comparison. The assumption underlying these

approaches is that DNA sequences encoding functional TFB Ss are conserved during

evolution due to selective pressure, whereas non-functional DNA sequences evolve

(mutate) much faster. Therefore, it is likely that the conserved regulatory modules could

be identified at the regions with high DNA similarity in two orthologs.












While this type of approaches has been proven very helpful in some cases, there is

also limitation. The success of these approaches depends on identifying significant

sequence level similarity between (among) the DNA regions that harbor the regulatory

modules. However, TFBS sequences are degenerative in nature and that they are

interspersed by non-functional sequences, which are not under conservation pressure. It

is therefore entirely possible that although the regulatory modules are conserved, the

overall similarity of the sequences harboring regulatory modules is only marginal and

cannot be distinguished from background. Applications based on cross-genome

comparison at DNA sequence level will fail when the orthologous DNA sequences are

highly diverged. Therefore, there is a need for a complementary tool that could allow

users to detect conserved regulatory modules from diverged DNA sequences.

We developed a novel methodology, BLISS 2.0 (Binding-site Level Identification

of Shared Signal-module 2.0 version), for identifying evolutionarily conserved regulatory

modules in pairs of orthologs based on the comparison at binding site level. Considering

the conservation pressure is at the binding site level rather than the DNA sequence level.

By integrating the Gaussian smoothing and statistical analysis, we demonstrated that our

approach outperforms regular sequence level comparisons when the orthologous DNA

sequences are highly diverged. BLISS 2.0 is now implemented as a web-based tool and

can be access by the research community freely at:

http://genel .ufscc.ufl.edu: 8080/bliss2/index.html.















Flie Ed~t Ylew Favantes Tools Help


Address / http//gene l.ukec.uR.edu:8080/blss2 v Goa Lmnks~


r*


Figure 5-1. Web interface of BLISS 2.0. This is the homepage of BLISS 2.0. To start,
users are required to input two orthologous DNA sequences to start the
search.


Given two orthologous DNA sequences, BLISS 2.0is able to output all potentially


conserved regulatory modules between those two sequences. BLISS 2.0 analysis



proceeds in three maj or steps:


To begin, users are required to input two raw DNA sequences (Figure 5-1). BLISS



2.0 generates binding site profiles for each sequence based on matrices collected by


Transfac 9.1 (Matys et al. 2003). And on the same page, users are allowed to choose a



binding site score cutoff. Either 0.75 or 0.8 can be selected as the binding score cutoff to


determine if a specific binding site is found in a certain location in the DNA sequence.



This cutoff is based on a balance between specificity and sensitivity. Higher score cutoff


increases the specificity, but decreases the sensitivity at the same time.


p qqll
AATTAACCATAATTTACTACTGCATGACAG Al
:GGCATAGAGCAG TA~GA~GTAGAGCATTGCAG
GAAGAGCAAT.CTCGGGAPTGPATGGAAATMGCAACAG
GCAGGCCGGTTCGTTCGCGAGATAA vi


Cis-Regulatory Modules (C~us) search
in twvo sequences


Cis-5egurlatoryr Modules {CRMa) Search intwro sequences helps you to identify conserved CRMs harboured in two input sequences.

1. SEQUENCE 1
a. Please input a name for the first DNAI sequence:
D. mel
b. Please paste the first DNA sequence (raw sequence) to the following text field:
GTTCITCCTGGTTACCCGGIA~CTIGCATAA 1
CAATGGAACCCGAACCGTAACTGGGACAGATCGAJJAGC
TGGCCTGGITICTCCIGGCTGGTGCCGTGTT

2. SEQUENCE 2
a. Please input a name for thle second DHA sequence:
D. pse
b Please asteP the second DNA sellr uence (raw se unce) t the follo~inn text Field:


g _


O Please check here if you want to search the complementary strand of the second sequence

3. Please choose a binding site store cutoff o.5Y



D~one


8 lnteret















I:lenel~f c itedu:8080/biss2/resullt2D,3sp v Go Links
1. CRM Search Results BLISS scores













D. me


6 B.d~ ~


A


File Edit View Favorlies Tools Help


Figure 5-2. The color plot of BLISS_score. BLISS_score indicates the degree of the
conservation at the binding site level between two sequences.

Second, by integrating Gaussian smoothing and statistical analysis on the binding


site profiles comparison, BLISS 2.0 indicates the degree of the conservation at the


binding site level between two sequences as BLISS_scores, which are displayed and


visualized as a two-dimensional color plot (Figure 5-2). The vertical color bar on the


right side of the color plot shows the value of BLIS S_score that the color represents.

Continuous high BLISS_scores along diagonal direction indicates a potential match of


conserved regulatory modules between two sequences. To be able to evaluate a particular


BLISS_score, we analyzed the distribution of BLISS_scores using simulated sequence


pairs as shown in Figure 5-3. There are two BLISS_score distributions. The left one is the

distribution when a random sequence matches with a random sequence or a sequence


~d rs





























































0 1 2 3 4 5 6
BLISS score

3. BLISS will output all matched binding sites in the matches regions with the average BLISS score greater than a cutoff.
Please input the BLISS_score cutoff based on BLISS_score Distribution: (
Rank output TFBSs based on: Productof p value .v


Figure 5-3. Statistical analysis and distribution of BLISS_score. Statistical analysis of
BLISS_score helps to evaluate a BLISS_score.


Finally, all the matching regulatory modules with BLISS_score greater than the


BLISS_score cutoff that users choose are displayed on the third page. Contributing


TFB Ss are listed in a separated table for each matched region. BLISS 2.0 provides the


v/~Go Link;"


harboring a regulatory module and the right is the distribution when two conserved


regulatory modules matches. Based on this statistical analysis, users can choose a


BLISS_score cutoff to output all matched regions with the BLIS S_score greater than this


cutoff to be conserved regulatory modules. Users also have the option to determine how


to rank the shared TFBSs in reported regulatory modules. They can be ranked by


locations, by numeric contributions to BLIS S_score or by product of p-values of the


matching TFBSs on both sequences.


File Edit View Favorites Tools Help


BLISS score Distribution (Binding site score cutoff-0.75)


i
T


01



0 06

O 046



O 02


2.


~ddress I~ ht~:ll~enel.uiscc.u





























L....alionl .n
rquel... I
383 -626
53 -288
299 -522
675 98
136 -353
463 -656


LI. rlium...
rqurl.-- 2
593 -836
115 -350
469 -692
904 -1027
280 -497
682 -875


GI~o Links *
A


Lenagl.M xis....
243 5.395915
235 4.9885244
223 4.459804
123 4.0222187
217 3.9538407
193 3.5002155


details
details
details
details
details


4 lone B Internet



Figure 5-4. Output of contributing TFBSs. Contributing TFBSs are listed in a separated

table for each matched CRM. Users can choose to highlight the TFBS they

are interested in as green color.


It has been experienced by many researchers that the laboratory experiments in a



model organism has narrowed down the location of the regulatory module for a particular



gene to a relatively short region, whereas for the ortholog in a less-studied organism,


information about the localization of the module is absent. To be able to deal with this



special instance, BLISS 2.0 provide a complementary tool Single Cis-Regulatory


Module Search, to locate the position of the conserved regulatory module in the ortholog



of the less-studied organism. We suggest users to use this tool especially in the case that



the length of one of the orthologous DNA sequences is less than 200 bps.


blalr*s~r*or~n~R8 crr.~Fr.nr~vr~*~m~I


Transcription Factor Binding Sites (TFBSs) output ranked by product of p values

Matched Regulatory Region l: sequence 1: 382 -625 Sequence 2: 592 -835 BLISS_score: 5.395915
TFBS i... .miolniy M More FV.=vI~ I L...:3io..2M Sc.>..? PIIalu!? LU nu.nulialn PLraluePII..duct

IEEF1D1584 0.9581334 4.32085E-5 804 0.~958133 4.32085E-5 0.014564217 1.8669744E-9

2F-4:DP-1(+ 584 0.9570317 4.3279E 5 804 0.9570317 4.3279E 5 0.014302874 1.873072E 9
FOXP1() 491 0.8072664 2.4131E-5 701 0.7661525 1.009465E- 0.10607633 2.4359401E-9

AbdB() 491 0.953505341.05372E- 700 0.9622206 6.732E-5 0.10029537 7.0936426E-9
PITX2(-) 522 09221458 1 2505-734 0 9244208310565E 0 08801021 1 0517734E-8
LFA1() 437 0.980438053.44865E-5 636 0.9059996 3.45118E-4 0.019752372 1.19019115E-8
E2F(+ 584 0 9617707 1 2874551E-4804 0 9617707 1 2874551E- 0 031881593 1 6575406E-8
AP-2alphaA(-) 407 0.88831943 1.67842E 4 608 0.9091B31 9.9237004E 5 0.034530856 1.6666136E 8
ZF5(+ 529 0.935855 3.20295E-5 748 0.8440624 5.243665E- 0.040263414 1.6795196E48
AP-2g~amma(-) 410 0.98323405 1.329145E 4 611 0.98323405 1.329145E 4 0.04111515 1.7666265E 8

E2F-1:DP-2(+ 584 0 91215223 146338E-4~ 804 0 912152231 46338E-4 0 015871227 2 1414811E-8


a I


72




option to let users highlight the TFBS they are interested in as green color as shown in



Figure 5-4.


Fle Edit View Favortes


Tools Help



...ser- p ~
... I's r legion
1 ~lr









The advantage of BLISS 2.0 is that it allows the detection of conserved regulatory

modules in highly diverged sequences, on which BLISS 2.0 outperforms the existing

methods that are based on sequence similarity comparison. We have successfully applied

BLISS 2.0 to identify the Even-skipped (eve) stripe 2 enhancer (S2E) in D. moj avenis

and D. virilis, which no detailed information has been published about and cannot be

detected by existing tools like BLAST, r~rista 2.0 and etc.















CHAPTER 6
CONCLUSIONS AND SUGGESTIONS FOR FUTURE STUDY

In this study, we developed a novel methodology to identify conserved regulatory

modules at the binding site level and implemented it as a freely accessed web-based

application named BLISS 2.0. To our knowledge, this is the first time that the feasibility

of comparison at binding site level is confirmed. In the first release of BLISS, we

assumed biologists have the sequence for a regulatory module and BLISS could identify

the conserved module from its orthologous sequence, where the DNA similarity could

not be detected and thus the methods based on sequence similarity failed. After the

success of BLISS 1, we further extended this methodology to be applicable for a more

general scenario. In BLISS2, potential regulatory modules can be predicted between two

orthologous sequences by detecting conserved regulatory complexes.

The advantage of this methodology is that it allows the detection of conserved

regulatory modules in highly divergent sequences as we have demonstrated both with

simulated sequences as well as with real world examples. BLISS is thus complementary

to many exiting methods based on nucleotide sequence similarity. It outperforms these

sequence similarity -based methods when the orthologous DNA sequences are highly

diverged. BLISS therefore serves as a valuable tool to facilitate biomedical scientists to

identify functional regulatory modules and design experimental verification strategies.

This study investigated a novel cross-genome comparison strategy and therefore

opened a new field for the future research, which includes developing new algorithm to

improve the performance of the methodology, multiple sequence comparison at the







75


binding site level and identifying regulatory modules with higher inter-TFBS distance

variations.















APPENDIX A
PARAMETER OPTIMIZATIONS FOR GAUSSIAN SMOOTHINTG METHOD

Using simulated sequences, the performance of Gaussian smoothing method was

tested under different parameters: 1) M_score cutoff, 2) variance for Gaussian smoothing,

and 3) integrating p value of M_score OR without integrating the p value of M~score

(Figure A-1).



Gaussian Smooth Method Tests Using Simulated
Sequences

100
I'90 -1 --- Var=0
O 80 --VP



5 7 ---Var 0(P value)
50 -e-Var=l(P value)
O40 I Var=2(P value)
0.5 0.6 0.7 0.8 0.9 1 --- Var=3(P value)
Score Cutoff


Figure A-1. Parameter optimizations for Gaussian smoothing method used in BLISS.

The results indicated that

1. Integrating the p value of M_score to Gaussian smoothing method had greatly
increased the performance of the method.

2. Bigger variance gave better performance.

3. BLISS got the best predictions when binding site score cutoff is between 0.75 and
0.85. Considering the binding sites from TRANSFAC are identified binding sites
and they intended to have higher binding site score, so 0.75 and 0.8 were chosen as
the default cutoff for binding site score in the implementation of BLISS.















APPENDIX B
A DYNAMIC PROGRAMMING ALGORITHM FOR IDENTIFYING CONSERVED
REGULATORY MODULES

Dynamic programming algorithms have been applied extensively in computational

sequence analysis. In this study, efforts were performed to apply dynamic programming

on the binding site profies to identify conserved regulatory modules in orthologous

sequences.

The idea underlying this dynamic programming algorithm is to look for the best

local alignments between the binding site profies of two sequences and the algorithm

consists of three steps:

1. Calculations of P_score profies for two sequences, which are exactly same as
described in the Method part of Chapter 3.

2. A matrix F was constructed based on the P_score profies of two sequences, where
F(i, j) is the score of the best alignment between the initial segment of position 1 to
i of first sequence and the initial segment of 1 to j of second sequence. F(i, j) was
computed recursively as below:



F~ij) =max F(i-1, j-1) + s(seqli, seq2j)
F(i-1, j) +d
F(i, j-1) +d

F(i, j) could be calculated from F(i-1, j-1), F(i-1, j) and F(i, j-1). F(i, 0) and F(0, j)
were set to 0 for all i and j.

F(i, j) = 0. When F(i, j) has a negative score in some points, we will give it value 0
to start a new alignment.

F(i, j) = F(i-1, j-1) + s(seqli, seq2j). When position i in sequence 1 and position j in
sequence 2 at least have one shared binding site, s(seqli, seq2j) will be the
maximum of P_score l(k, i) and P_score(k, j) for all k, where k is the index of the
binding site; when there is no shared binding site at position i in sequence 1 and
position j in sequence 2, but there is at least one binding site either at position i in










sequence 1 or position j in sequence 2, then s(seqli, seq2j) would be set to a penalty
called "TF_penalty"; when there is no any binding site either at position i in
sequence 1 or position j in sequence 2, s(seqli, seq2j) was set to zero.

F(i, j) = F(i, j-1) +d. This is the case when seq2j was aligned to a gap. When there
is at least one binding site at position j of sequence 2, d was set to "TF_penalty";
when there is no any binding site at position j of sequence 2, d was set to
"N~penalty".

F(i, j) = F(i-1, j) +d. This is the case when seqli, was aligned to a gap. When there
is at least one binding site at position i of sequence 1, d was set to "TF_penalty";
when there is no any binding site in position i of sequence 1, d was set to
"N~penalty".

3. After the construction of matrix F, find the maximum F(i, j), which corresponds to
the end points of conserved regulatory module. Then trace back until to a point
with value 0, which corresponds to the start points of conserved module.


This dynamic programming algorithm was applied to 500 simulated sequence pairs

and each sequence pair has only one conserved module. This sequence group corresponds

to the second group simulated sequence in the first data set used for BLISS2 performance

test in chapter 4. The best results were achieved when TF_penalty equals to -9 and

N~penalty equals to -1. When 0.8 was chosen as the M~score cutoff, the conserved

regulatory modules in 168 sequence pairs could be detected and the rest of 332 local

maximum reported by this dynamic programming algorithm was false positives.

As a conclusion, this study addressed the feasibility of identifying conserved

regulatory modules in orthologous sequences by applying the dynamic programming

algorithm on the binding site profiles. Even though, Gaussian smoothing method

developed in chapter 3 and chapter 4 achieved a better performance than this dynamic

programming algorithm. So, BLISS took the Gaussian smoothing method, instead of this

dynamic programming algorithm, for the web implementation.















APPENDIX C
AN IMPROVED ALGORITHM FOR BLISS2

In the current algorithm for BLISS2, we chose to take the sum of G~scores over all

TFs at each matched position when BLISS_score was calculated. The performance of

BLISS2 was improved (Table C-1 and C-2) when was instead took the maximum of

G scores at each matched location.

Table C-1. Performance test of the improved BLISS2 for M~score cutoff=0.75. False
Positives (FP) and True Positives (TP) are listed under different window size
and BLISS score cutoff.
Window BLISS score cutoff
size Tye 0.27 0.26 0.25 0.24 0.23 0.22 0.21
FP 2565 3748 5272 7180
100
TP 1045 1174 1307 1434
FP 262 663 2412 6496
150
TP 1185 1614 1956 2157
FP 40 92 416 2157 6096
200
TP 724 1165 1686 2062 2167
FP 6 23 75 617 3277
250
TP 352 699 1180 1736 2000



Table C-2. Performance test of the improved BLISS2 for M~score cutoff=0.8. False
Positives (FP) and True Positives (TP) are listed under different window size
and BLISS score cutoff.
Window BLISS score cutoff
size Tye 0.28 0.27 0.26 0.25 0.24 0.23 0.22
FP 1779 2968 5489
100
TP 1373 1602 1794
FP -288 575 1412 3572 6805
150
TP -1526 1822 2078 2263 2340
FP -46 68 185 686 2395
200
TP -1034 1374 1734 2063 2259
FP 16 36 142 684 2527
250
TP 851 1222 1622 1967 2190









The method to calculate BLISS_score for the improved algorithm is the same as the

algorithm of BLIS S2 except that the PreBLIS S_score was calculated as below:

Pr e BLISS score[i, j] = MaxinaunOf(G scorel[i,nz]*"G score[ j,nz]) for m

= 0, 1, 2 ..., where m is the index of the TFBS.

The improvement was more dramatic for the second data set that is used in chapter

4, which had longer backbone sequences. Table C-3 displays the performance

comparison between the two algorithms under the condition where the window size is

200 and the binding score cutoff is 0.75.

Table C-3. Performance comparison of two BLISS2 algorithms for second data set when
M~score cutoff is 0.75. False Positives (FP) and True Positives (TP) are listed
under different BLISS score cutoff.
Algorithm Type BLISS scores
2.9 2.8 2.7 2.6 2.5
BLISS2 FP 33 111 359 1014 2778
TP 340 458 611 804 1029
0.25 0.245 0.24 0.235 0.23
Improved
FP 27 79 217 861 2953
BLISS2
TP 748 937 1198 1431 1653

This improvement may reflect the fact that only one TF can exclusively bind to a

location in the real biological environment and the performance of the BLISS2 algorithm

could be compromised due to the repeated contributions of overlapped TFB Ss to the

BLISS_score. However, the results included more false positives when this improved

algorithm was applied to S2E example. Considering S2E is the only real world example

we have so far, more researches need to be continued and more real world examples are

required for the further testing of this improved algorithm for BLISS2.

The BLISS_score distribution for the improved algorithm was calculated using the

same method as described in chapter 3 and are shown in Figure C-1 and Figure C-2.











0.14

0.12

0.1



. 0.08


0.04 C


O 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45
BLISS score


Figure C-1. Distribution of BLISS~score for the improved BLISS2 algorithm with the
M score cut off value of 0.75.


0.12
-Module matches with random sequence
Module matches with module

0.1 -














U 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5
BLISS score


Figure C-2. Distribution of BLISS~score for the improved BLISS2 algorithm with the
M score cut off value of 0.8.


-Module matches with random sequence
- .. Module matches with module




- '

- ,
















LIST OF REFERENCES


Aerts S, Thij s G, Coessens B, Staes M, Moreau Y, De Moor B: Toucan: deciphering the
cis-regulatory logic of coregulated genes. Nucleic Acids Res 2003, 31(6): 1753-
1764.

Aerts S, Van Loo P, Thij s G, Mayer H, de Martin R, Moreau Y, De Moor B: TOUCAN
2: the all-inclusive open source workbench for regulatory sequence analysis.
Nucleic Acids Res 2005, 33(Web Server issue):W393-396.

Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search
tool. J2~olBiol 1990, 215:403-410.

Andrioli LP, Vasisht V, Theodosopoulou E, Oberstein A, Small S: Anterior repression
of a Drosophila stripe enhancer requires three position-specific mechanisms.
Development 2002, 129(21):4931-4940.

Arnosti DN: Design and function of transcriptional switches in Drosophila. Insect
Biochem Mol1 Biol 2002, 32(10):1257-1273.

Arnosti DN, Barolo S, Levine M, Small S: The eve stripe 2 enhancer employs multiple
modes of transcriptional synergy. Development 1996, 122(1):205-214.

Bailey TL, Elkan C: The value of prior knowledge in discovering motifs with MEME.
Proc Int Conflntell Syst Mol1 Biol 1995, 3:21-29.

Bailey TL, Gribskov M: Methods and statistics for combining motif match scores. J
Comput Biol 1998, 5(2):21 1-221.

Bailey TL, Gribskov M: Concerning the accuracy of MAST E-values. Bioinformatics
2000, 16(5):488-489.

Benos PV, Lapedes AS, Stormo GD: Probabilistic code for DNA recognition by
proteins of the EGR family. J2~ol Biol 2002, 323(4):701-727.

Berman BP, Nibu Y, Pfeiffer BD, Tomancak P, Celniker SE, Levine M, Rubin GM,
Eisen MB: Exploiting transcription factor binding site clustering to identify
cis-regulatory modules involved in pattern formation in the Drosophila
genome. Proc Natl AcadSci USA 2002, 99(2):757-762.









Berman BP, Pfeiffer BD, Laverty TR, Salzberg SL, Rubin GM, Eisen MB, Celniker SE:
Computational identification of developmental enhancers: conservation and
function of transcription factor binding-site clusters in Drosophila
melanogaster and Drosophila pseudoobscura. Genome Biol 2004, 5(9):R61.

Boulikas T: A compilation and classification of DNA binding sites for protein
transcription factors from vertebrates. Crit Rev Eukaryot Gene Expr 1994, 4(2-
3):1 17-321.

Bray N, Dubchak I, Pachter L: AVID: A global alignment program. Genome Res 2003,
13(1):97-102.

Bucher P: Weight matrix descriptions of four eukaryotic RNA polymerase II
promoter elements derived from 502 unrelated promoter sequences. J2~olBiol
1990, 212(4):563-578.

Bulyk ML, Johnson PL, Church GM: Nucleotides of transcription factor binding sites
exert interdependent effects on the binding affinities of transcription factors.
Nucleic Acids Res 2002, 30(5): 1255-1261.

Che D, Jensen S, Cai L, Liu JS: BEST: binding-site estimation suite of tools.
Bioinformatics 2005, 21(12):2909-2911.

Chekmenev DS, Haid C, Kel AE: P-Match: transcription factor binding site search by
combining patterns and weight matrices. Nucleic Acids Res 2005, 33(Web
Server issue):W432-437.

Chen QK, Hertz GZ, Stormo GD: MATRIX SEARCH 1.0: a computer program that
scans DNA sequences for transcriptional elements using a database of weight
matrices. Comput ApplBiosci 1995, 11(5):563-566.

Cora D, Di Cunto F, Provero P, Silengo L, Caselle M: Computational identification of
transcription factor binding sites by functional analysis of sets of genes sharing
overrepresented upstream motifs. BM~C Bioinformatics 2004, 5:57.

Day WH, McMorris FR: Critical comparison of consensus methods for molecular
sequences. Nucleic Acids Res 1992, 20(5): 1093-1099.

Elkon R, Linhart C, Sharan R, Shamir R, Shiloh Y: Genome-wide in silico
identification of transcriptional regulators controlling the cell cycle in human
cells. Genome Res 2003, 13(5):773-780.

Erives A, Levine M: Coordinate enhancers share common organizational features in
the Drosophila genome. Proc NatlAcad Sci US A 2004, 101(1 1):3 85 1-3 856.

Fickett JW: Coordinate positioning of MEF2 and myogenin binding sites. Gene 1996,
172(1):GC19-32.









Frazer KA, Elnitski L, Church DM, Dubchak I, Hardison RC: Cross-species sequence
comparisons: a review of methods and available resources. Genome Res 2003,


Frech K, Herrmann G, Werner T: Computer-assisted prediction, classification, and
delimitation of protein binding sites in nucleic acids. Nucleic Acids Res 1993,
21(7):1655-1664.

Frech K, Quandt K, Werner T: Finding protein-binding sites in DNA sequences: the
next generation. Trends Biochem Sci 1997, 22(3):103-104.

Fried M, Crothers DM: Equilibria and kinetics of lac repressor-operator interactions
by polyacrylamide gel electrophoresis. Nucleic Acids Res 1981, 9(23):6505-6525.

Frith MC, Hansen U, Weng Z: Detection of cis-element clusters in higher eukaryotic
DNA. Bioinformatics 2001, 17(10):878-889.

Galas DJ, Schmitz A: DNAse footprinting: a simple method for the detection of
protein-DNA binding specificity. Nucleic Acids Res 1978, 5(9):3157-3170.

Garner MM, Revzin A: A gel electrophoresis method for quantifying the binding of
proteins to specific DNA regions: application to components of the Escherichia
coli lactose operon regulatory system. Nucleic Acids Res 1981, 9(13):3047-3060.

Gershenzon NI, Stormo GD, loshikhes IP: Computational technique for improvement of
the position-weight matrices for the DNA/protein binding sites. Nucleic Acids Res
2005, 33(7):2290-2301.

Grad YH, Roth FP, Halfon MS, Church GM: Prediction of similarly acting cis-
regulatory modules by subsequence profiling and comparative genomics in
Drosophila melanogaster and D.pseudoobscura. Bioinformatics 2004,
20(16):2738-2750.

Halfon MS, Michelson AM: Exploring genetic regulatory networks in metazoan
development: methods and models. Physiol Genomics 2002, 10(3):131-143.

Hannenhalli S, Levy S: Predicting transcription factor synergism. Nucleic Acids Res
2002, 30(19):4278-4284.

Heinemeyer T, Wingender E, Reuter I, Hermj akob H, Kel AE, Kel OV, Ignatieva EV,
Ananko EA, Podkolodnaya OA, Kolpakov FA et al: Databases on transcriptional
regulation: TRANSFAC, TRRD and COMPEL. Nucleic Acids Res 1998,
26(1):362-367.

Hernandez-Munain C, Krangel MS: Regulation of the T-cell receptor delta enhancer
by functional cooperation between e-Myb and core-binding factors. Mol1 Cell
Biol 1994, 14(1):473-483.










Hu Z, Frith M, Niu T, Weng Z: SeqVISTA: a graphical tool for sequence feature
visualization and comparison. BM~C Bioinformatics 2003, 4: 1.

Hu Z, Fu Y, Halees AS, Kielbasa SM, Weng Z: SeqVISTA: a new module of
integrated computational tools for studying transcriptional regulation. Nucleic
Acids Res 2004, 32(Web Server issue):W23 5-241.

Huang H, Kao MC, Zhou X, Liu JS, Wong WH: Determination of local statistical
significance of patterns in Markov sequences with application to promoter
element identification. J Comput Biol 2004, 11(1): 1-14.

Jegga AG, Sherwood SP, Carman JW, Pinski AT, Phillips JL, Pestian JP, Aronow BJ:
Detection and visualization of compositionally similar cis-regulatory element
clusters in orthologous and coordinately controlled genes. Genome Res 2002,
12(9):1408-1417.

Johansson O, Alkema W, Wasserman WW, Lagergren J: Identification of functional
clusters of transcription factor binding motifs in genome sequences: the
MSCAN algorithm. Bioinformatics 2003, 19 Suppl 1:il69-176.

Johnson PF, McKnight SL: Eukaryotic transcriptional regulatory proteins. Annu Rev
Biochem 1989, 58:799-839.

Kel AE, Gossling E, Reuter I, Cheremushkin E, Kel-Margoulis OV, Wingender E:
MATCH: A tool for searching transcription factor binding sites in DNA
sequences. Nucleic Acids Res 2003, 31(13):3 576-3 579.

Kel AE, Kolchanov NA, Kel OV, Romashchenko AG, Anan'ko EA, Ignat'eva EV,
Merkulova TI, Podkolodnaia OA, Stepanenko 1L, Kochetov AV et al: [TRRD: a
database of transcription regulatory regions in eukaryotic genes]. Mol1Biol
(M~osk) 1997, 31(4):626-636.

Kel AE, Kondrakhin YV, Kolpakov Ph A, Kel OV, Romashenko AG, Wingender E,
Milanesi L, Kolchanov NA: Computer tool FUNSITE for analysis of eukaryotic
regulatory genomic sequences. Proc Int Conflntell Syst Mol1 Biol 1995, 3: 197-
205.

Kel OV, Romaschenko AG, Kel AE, Wingender E, Kolchanov NA: A compilation of
composite regulatory elements affecting gene transcription in vertebrates.
Nucleic Acids Res 1995, 23(20):4097-4103.

Kel-Margoulis OV, Kel AE, Reuter I, Deineko IV, Wingender E: TRANSCompel: a
database on composite regulatory elements in eukaryotic genes. Nucleic Acids
Res 2002, 30(1):332-334.










Kel-Margoulis OV, Romashchenko AG, Kolchanov NA, Wingender E, Kel AE:
COMPEL: a database on composite regulatory elements providing
combinatorial transcriptional regulation. Mecleic Acids Res 2000, 28(1):311-
315.

Knuppel R, Dietze P, Lehnberg W, Frech K, Wingender E: TRANSFAC retrieval
program: a network model database of eukaryotic transcription regulating
sequences and proteins. J Comput Biol l994, 1(3): 191-198.

Kolchanov NA, Ananko EA, Podkolodnaya OA, Ignatieva EV, Stepanenko IL, Kel-
Margoulis OV, Kel AE, Merkulova TI, Goryachkovskaya TN, Busygina TV et al:
Transcription Regulatory Regions Database (TRRD):its status in 1999. Micleic
Acids Res 1999, 27(1):3 03-306.

Kolchanov NA, Podkolodnaya OA, Ananko EA, Ignatieva EV, Stepanenko 1L, Kel-
Margoulis OV, Kel AE, Merkulova TI, Goryachkovskaya TN, Busygina TV et al:
Transcription regulatory regions database (TRRD): its status in 2000. Micleic
Acids Res 2000, 28(1):298-301.

Kondrakhin YV, Kel AE, Kolchanov NA, Romashchenko AG, Milanesi L: Eukaryotic
promoter recognition by binding sites for transcription factors. Comput Appl
Biosci 1995, 11(5):477-488.

Lawrence CE, Altschul SF, Boguski MS, Liu JS, Neuwald AF, Wootton JC: Detecting
subtle sequence signals: a Gibbs sampling strategy for multiple alignment.
Science 1993, 262(5131):208-214.

Lawrence CE, Reilly AA: An expectation maximization (EM) algorithm for the
identification and characterization of common sites in unaligned biopolymer
sequences. Proteins 1990, 7(1):41-51.

Lee TI, Rinaldi NJ, Robert F, Odom DT, Bar-Joseph Z, Gerber GK, Hannett NM,
Harbison CT, Thompson CM, Simon I et al: Transcriptional regulatory
networks in Saccharomyces cerevisiae. Science 2002, 298(5594):799-804.

Lemaigre FP, Lafontaine DA, Courtois SJ, Durviaux SM, Rousseau GG: Spl can
displace GHF-1 from its distal binding site and stimulate transcription from
the growth hormone gene promoter. M01 Cell Biol 1990, 10(4): 1811-1814.

Lenhard B, Sandelin A, Mendoza L, Engstrom P, Jareborg N, Wasserman WW:
Identification of conserved regulatory elements by comparative genome
analysis. JBiol 2003, 2(2): 13.

Lewis H, Kaszubska W, DeLamarter JF, Whelan J: Cooperativity between two NF-
kappa B complexes, mediated by high-mobility-group protein I(Y), is essential
for cytokine-induced expression of the E-selectin promoter. M>01 Cell Biol 1994,
14(9):5701-5709.









Liu X, Brutlag DL, Liu JS: BioProspector: discovering conserved DNA motifs in
upstream regulatory regions of co-expressed genes. Pac Symp Biocomput
2001:127-138.

Liu Y, Liu XS, Wei L, Altman RB, Batzoglou S: Eukaryotic regulatory element
conservation analysis and identification using comparative genomics. Genome
Res 2004, 14(3):451-458.

Liu Y, Wei L, Batzoglou S, Brutlag DL, Liu JS, Liu XS: A suite of web-based
programs to search for transcriptional regulatory motifs. Nucleic Acids Res
2004, 32(Web Server issue):W204-207.

Loots GG, Oveharenko I: rVISTA 2.0: evolutionary analysis of transcription factor
binding sites. Nucleic Acids Res 2004, 32(Web Server issue):W217-221.

Loots GG, Oveharenko I, Pachter L, Dubchak I, Rubin EM: r~rista for comparative
sequence-based discovery of functional transcription factor binding sites.
Genome Res 2002, 12(5):832-839.

Ludwig MZ: Functional evolution of noncoding DNA. Curr Opin Genet Dev 2002,
12(6):634-639.

Ludwig MZ, Palsson A, Alekseeva E, Bergman CM, Nathan J, Kreitman M: Functional
evolution of a cis-regulatory module. PLoS Biol 2005, 3(4):e93.

Ludwig MZ, Patel NH, Kreitman M: Functional analysis of eve stripe 2 enhancer
evolution in Drosophila: rules governing conservation and change.
Development 1998, 125(5):949-958.

Maniatis T, Goodbourn S, Fischer JA: Regulation of inducible and tissue-specific gene
expression. Science 1987, 236(4806):1237-1245.

Marco-Ferreres R, Vivar J, Arredondo JJ, Portillo F, Cervera M: Co-operation between
enhancers modulates quantitative expression from the Drosophila
Paramyosin/miniparamyosin gene in different muscle types. M~ech Dev 2005,
122(5):681-694.

Matys V, Fricke E, Geffers R, Gossling E, Haubrock M, Hehl R, Hornischer K, Karas D,
Kel AE, Kel-Margoulis OV et al: TRANSFAC: transcriptional regulation, from
patterns to profiles. Nucleic Acids Res 2003, 31(1):374-378.

Mayor C, Brudno M, Schwartz JR, Poliakov A, Rubin EM, Frazer KA, Pachter LS,
Dubchak I: VISTA : visualizing global DNA sequence alignments of arbitrary
length. Bioinformatics 2000, 16(11):1046-1047.

Meng H., Banerjee A, Zhou L: BLISS: biding site level identification of shared
signal-modules in DNA regulatory sequences. BMC Bioinformatics 2006,7: 287.









Muhlethaler-Mottet A, Di Berardino W, Otten LA, Mach B: Activation of the MHC
class II transactivator CIITA by interferon-gamma requires cooperative
interaction between Stat1 and USF-1. Inanunity 1998, 8(2):157-166.

Osada R, Zaslaysky E, Singh M: Comparative analysis of methods for representing
and searching for transcription factor binding sites. Bioinfornzatics 2004,
20(18):3516-3525.

Powell J: Progress and prospects in evolutionary biology: The Drosophila model.
Oxford: Oxford University Press; 1997.

Prestridge DS: Predicting Pol II promoter sequences using transcription factor
binding sites. J2~olBiol 1995, 249(5):923-932.

Prestridge DS: SIGNAL SCAN 4.0: additional databases and sequence formats.
Comput ApplBiosci 1996, 12(2):157-160.

Qiu P: Recent advances in computational promoter analysis in understanding the
transcriptional regulatory network. Biochent Biophys Res Conanun 2003,
309(3):495-501.

Qiu P, Ding W, Jiang Y, Greene JR, Wang L: Computational analysis of composite
regulatory elements. 2Mann Genome 2002, 13(6):327-332.

Quandt K, Frech K, Karas H, Wingender E, Werner T: MatInd and MatInspector: new
fast and versatile tools for detection of consensus matches in nucleotide
sequence data. Mecleic Acids Res 1995, 23(23):4878-4884.

Rajewsky N, Vergassola M, Gaul U, Siggia ED: Computational detection of genomic
cis-regulatory modules applied to body patterning in the early Drosophila
embryo. BM~C Bioinfornzatics 2002, 3:30.

Rebeiz M, Reeves NL, Posakony JW: SCORE: a computational approach to the
identification of cis-regulatory modules and target genes in whole-genome
sequence data. Site clustering over random expectation. Proc Natl AcadSci US
A 2002, 99(15):9888-9893.

Ringrose L, Rehmsmeier M, Dura JM, Paro R: Genome-wide prediction of
Polycomb/Trithorax response elements in Drosophila melanogaster. Dev Cell
2003, 5(5):759-771.

Sandelin A, Alkema W, Engstrom P, Wasserman WW, Lenhard B: JASPAR: an open-
access database for eukaryotic transcription factor binding profiles. Micleic
Acids Res 2004, 32(Database issue):D91-94.

Sandelin A, Wasserman WW, Lenhard B: ConSite: web-based prediction of regulatory
elements using cross-species comparison. Mucleic Acids Res 2004, 32(Web
Server issue):W249-252.




Full Text

PAGE 1

A NOVEL METHODOLOGY FOR IDENIT FYING CONSERVED REGULATORY MODULES AT THE BINDING SITE LEVEL By HAILONG MENG A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLOR IDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2006

PAGE 2

Copyright 2006 by Hailong Meng

PAGE 3

To my dearly loved wife, Haoxian, with whom enjoyable and difficu lt times were shared throughout this journey, and to my parents, Zhanfu and Xioumei, for their love and support through my life.

PAGE 4

iv ACKNOWLEDGMENTS I would like to express my sincere thanks and apprecia tion to my advisors, Dr. Arunava Banerjee and Dr. Lei Zhou, for th eir invaluable guidance, support, and encouragement throughout my graduate research and study at the University of Florida. My accomplishments could not have been achieved without their support. I would also like to than k my committee members, Drs. Anand Rangarajan, Sam Wu and Tamer Kahveci, for their insightful comments and helpful suggestions. Finally, my appreciation goes to all my friends at the University of Florida, especially those at Dr. Banerjee and Dr. Zhous lab, for their helpful suggestions, discussions and friendships.

PAGE 5

v TABLE OF CONTENTS page ACKNOWLEDGMENTS.................................................................................................iv LIST OF TABLES...........................................................................................................viii LIST OF FIGURES...........................................................................................................ix ABSTRACT.......................................................................................................................xi CHAPTER 1 INTRODUCTION........................................................................................................1 2 LITERATURE REVIEW.............................................................................................4 Computational Representation of TFBSs.....................................................................4 TFBSs Databases..........................................................................................................6 Computational Identification and St atistical Evaluation of TFBSs..............................8 Computational Identif ication of TFBSs................................................................8 Statistical Evaluation of TFBSs...........................................................................11 Cis-Regulatory Modules (CRMs) and Computational Prediction of CRMs..............12 Cis-Regulatory Modules (CRMs)........................................................................12 Computational Identif ication of CRMs...............................................................14 Cross-species Conservation of Cis-Regulatory Modules (CRMs)......................15 Computational Prediction of CisRegulatory Modules (CRMs) by Crossspecies Genome Comparison...........................................................................17 3 THE FEASIBILITY STUDY OF CROSS-GENOME COMPARISON AT THE BINDING SITE LEVEL............................................................................................21 Background.................................................................................................................21 Results........................................................................................................................ .25 Simulating Sequence Pairs Harboring a Conserved Module of Binding Sites...25 Identifying a Conserved Module by Comp arison at the Binding Site Level......28 Distribution and Statistical Ev aluation of the BLISS_score...............................31 Identifying a Conserved Regulatory M odule in Distantly Related Species........34 Implementation of BLISS as a Web-based Service............................................38 Discussion...................................................................................................................40 Conclusions.................................................................................................................42

PAGE 6

vi Methods......................................................................................................................43 Generating Simulated Sequences........................................................................43 Binding Site Identification..................................................................................44 P Value for M_score............................................................................................45 Gaussian Smoothing............................................................................................46 Searching the Conserved M odule in the Long Sequence....................................46 Large-scale Search of the Simulated Sequences, Statistical Analysis................47 Searching for the eve2 Module in D. virilis and D. mojavanis Sequences.........47 Website Construction..........................................................................................48 4 BINDING-SITE LEVEL IDENTIFICA TION OF SHARED REGULATORY MODULES IN TWO ORTHOLOGOUS SEQUENCES...........................................49 Background.................................................................................................................49 Results........................................................................................................................ .51 Simulating Sequence Pairs with Varyi ng Complexity of Conserved Binding Site Patterns.....................................................................................................51 Identifying Conserved Regulatory Co mplexes by Comparison at the Bindingsite Level..........................................................................................................52 Testing BLISS2 in Real-world Examples...........................................................56 Website Implementation of BLISS2...................................................................60 Discussion...................................................................................................................61 Conclusions.................................................................................................................62 Methods......................................................................................................................62 Generating Simulated Sequences........................................................................62 Identifying Conserved Regulatory Co mplexes by Comparison at the Bindingsite Level..........................................................................................................64 Performance Testing Using Simulated Sequences..............................................65 Searching for the eve2 Regulatory Comple x in D. virilis and D. mojavanis......66 Website Construction..........................................................................................66 5 BLISS 2.0: THE WEB-BASED TOOL FOR PREDICTING CONSERVED REGULATORY MODULES IN DISTANTLY-RELATED ORTHOLOGOUS SEQUENCES.............................................................................................................67 6 CONCLUSIONS AND SUGGESTIONS FOR FUTURE STUDY...........................74 APPENDIX A PARAMETER OPTIMIZATIONS FOR GAUSSIAN SMOOTHING METHOD...76 B A DYNAMIC PROGRAMMING ALGORITHM FOR IDENTIFYING CONSERVED REGUL ATORY MODULES............................................................77 C AN IMPROVED ALGORITHM FOR BLISS2.........................................................79 LIST OF REFERENCES...................................................................................................82

PAGE 7

vii BIOGRAPHICAL SKETCH.............................................................................................92

PAGE 8

viii LIST OF TABLES Table page 2-1. IUPAC consensus characters.......................................................................................4 2-2. Statistics of TRANSFAC.............................................................................................6 2-3. Matrices in TRAN SFAC Professional r9.1.................................................................7 4-1. Simulated data sets.....................................................................................................52 4-2. Performance test of BLISS2 for M_score cutoff=0.75.............................................54 4-3. Performance test of BLISS2 for M_score cutoff=0.8...............................................55 C-1. Performance test of the impr oved BLISS2 for M_score cutoff=0.75.......................79 C-2. Performance test of the impr oved BLISS2 for M_score cutoff=0.8.........................79 C-3. Performance comparison of two BLISS2 algorithms for second data set when M_score cutoff is 0.75..............................................................................................80

PAGE 9

ix LIST OF FIGURES Figure page 1-1. Transcription and translation.......................................................................................1 1-2. Transcription is controll ed by transcription factors.....................................................2 2-1. Frequency matrix of transcrip tion factor v-Myb from TRANSFAC.......................5 2-2. The known TFBSs in S2E.........................................................................................15 2-3. S2E of D. Mel and D. Pse alignment using CLUSTALW........................................16 3-1. Simulating conserved regulatory modul es in a pair of random sequences................27 3-2. Gaussian smoothing of the TFBS profile..................................................................29 3-3. Identifying a conserved m odule at the binding site level..........................................32 3-4. Identifying the eve S2E module in distantly related species.....................................35 3-5. Inter-TFBSs distances are very well conserved within each clause of the S2E module......................................................................................................................37 3-6. Searching for conserved individual clauses / element modules using BLISS...........39 3-7. BLISS output of the contributing TFBSs..................................................................40 4-1. The effects of length of analyzed sequences on the performance of BLISS2...........56 4-2. Regulatory complexes in S2E of D.mela and D.pseu................................................57 4-3. Identification S2E in 6 kb D.viri sequence using the S2E of D.pseu........................58 4-4. Regulatory complexes of S2E in D.viri and D.moja.................................................59 5-1. Web interface of BLISS 2.0. This is the homepage of BLISS 2.0............................69 5-2. The color plot of BLISS_score..................................................................................70 5-3. Statistical analysis and distribution of BLISS_score.................................................71

PAGE 10

x 5-4. Output of contributing TFBSs. .................................................................................72 A-1. Parameter optimizations for Gaussian smoothing method used in BLISS...............76 C-1. Distribution of BLISS_score for th e improved BLISS2 algorithm with the M_score cut_off value of 0.75.................................................................................81 C-2. Distribution of BLISS_score for th e improved BLISS2 algorithm with the M_score cut_off value of 0.8...................................................................................81

PAGE 11

xi Abstract of Dissertation Pres ented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy A NOVEL METHODOLOGY FOR IDENIT FYING CONSERVED REGULATORY MODULES AT THE BINDING SITE LEVEL By Hailong Meng August 2006 Chair: Arunava Banerjee Cochair: Lei Zhou Major Department: Computer Inform ation and Science and Engineering Regulatory modules are segments of the DNA that control particular aspects of gene expression. Their identification is ther efore of great importa nce to the field of molecular genetics. Each module is composed of a distinct set of binding sites for specific transcription factors. Since experimental identificat ion of regulatory modules is an arduous process, accurate computational techniques that supplem ent this process can be very beneficial. Functional modules are unde r selective pressure to be evolutionarily conserved. Most current approaches theref ore attempt to detect conserved regulatory modules through similarity comparisons at the DNA sequence level. However, some regulatory modules, despite th e conservation of their res ponsible binding sites, are embedded in sequences that have little overall similarity. We hypothesize that it will be more effici ent and make more biological sense to perform the cross-genome comparison at the bi nding site level since the evolutionary pressure is on the binding site patterns that determine the gene re gulation. In this study,

PAGE 12

xii we developed a novel methodology that det ects conserved regulatory modules via comparisons at the binding site level. Th e methodology compares the binding site profiles of orthologs and identifies those se gments that have similar (not necessarily identical) profiles. The similarity measure is based on transformed profiles, which takes into consideration the p values of binding site s as well as the potential shift of binding site positions. Furthermore, statistical analys is was performed to ev aluate the accuracy of the similarity measure. We used simulate d sequence pairs for the optimization of the methodology and the methodology was then verified on real word sequences. Our research demonstrates that, for sequences with little overall similarity at the DNA sequence level, it is possibl e to identify conserved regulatory modules based solely on binding site profiles. This methodology has been implemented as a web-based tool, BLISS, for the research community.

PAGE 13

1 CHAPTER 1 INTRODUCTION One of the central processes in the life of a cell is the expression of the information encoded in its DNA. The initial stage of th is expression is the synthesis of messenger RNA (mRNA) from the DNA template, a proces s referred to as tran scription (Figure 11). Figure 1-1. Transcript ion and translation. One of the major challenges in molecula r genetics is to understand the precise mechanism via which gene expression is regu lated. The regulation of gene expression at the level of transcription is a very comp lex process and is controlled by DNA binding proteins called transcripti on factors (TFs). As shown in Figure 1-2, the initiation of transcription begins with th e binding of RNA polymerase II complex to the promoter, a region right in front of the firs t exon of the gene. The level (i .e., rate) of transcription is controlled by bindings between TFs and corresponding short DNA sequences, Transcription Factor Binding Sites (TFBSs). The size of a typical binding site ranges between 5 and 25 base pairs (bps) (Boulik as 1994; Hernandez-Munain and Krangel 1994). It is possible that the binding of a single TF to TFBS can affect the rate of transcription, however, in higher eukaryotes, it often takes multiple TFBSs to group together as a regulatory module to determine the precise temporal and spatial expression DNA mRNA PROTEIN Transcri p tion Translation

PAGE 14

2 of the genes. Despite the im portance of regulatory modules, our abilities to identify and characterize them are stil l very limited (Qiu 2003). Figure 1-2. Transcription is cont rolled by transcription factors. While laboratory identification of regulat ory modules is feasible (Galas and Schmitz 1978; Fried and Crothers 1981; Garn er and Revzin 1981), the process is in general expensive, time-consuming and ar duous. Accurate computational approaches therefore promise to be very beneficial as a supplement to the traditional laboratory experiment since they can lead biologists to a more efficient determination of the regulatory elements. Functional regulatory modules are under selective pressure and tend to be evolutionarily conserved. Most current appr oaches therefore attemp t to detect conserved regulatory modules through similarity compar isons at the DNA sequence level. However, some regulatory modules, despite the conserva tion of their responsible binding sites, are embedded in sequences that have little overa ll similarity. It theref ore makes biological

PAGE 15

3 sense to perform the cross-genome comparis on at the binding site level since the evolutionary pressure is direct ed at the binding site level. The objectives of this thesis are 1. To test the feasibility of cross-genome comparison at the binding site level. 2. To develop a novel methodology that det ects conserved regulatory modules via comparisons at the binding site level. 3. To implement the methodology as a webbased tool that can be publicly accessed by the research community. The remaining chapters are organized as follows: Chapter 2 conducts a literature review, in which background material and related research in the field are presented. In chapter 3,we demonstrate that it is feasible to perfor m the cross-species comparison at the binding site level and that conserved regulatory modules can be predicted in diverged sequences without de tectable DNA sequence similarity. However, this comes with the limitation that we know th at a conserved module ex ists in the shorter of the two sequences being compared. In chapter 4, based on the research results in chapte r 3, we extend the methodology to a more general case, i. e. two sequences without th e prior knowledge about the locations of the re gulatory modules. In chapter 5, we focus on BLISS, the web implementation of this methodology. Chapter 6 presents conclusions and suggestions for future research.

PAGE 16

4 CHAPTER 2 LITERATURE REVIEW Computational Representation of TFBSs Most eukaryotic TFs can bind to multiple binding sites and tolerate considerable sequence variation in their binding sites. The computational analysis of regulatory regions in genome sequences is based on th e prediction of potential TFBSs, which depends on the quality of the models that represent the binding specificity of the transcription factors. These models are based upon a statistical representation of experimentally determined binding sites for a particular TF. IUPAC consensus code was first used to re present this degenerate feature of TFBSs by Day and McMorris (1992). They used a string such as RCCY to represent the combination of string ACCT, ACCC, GCCT, and GCCT. IUPAC consensus characters are listed in Table 2-1. Table 2-1. IUPAC consensus characters. IUPAC consensus character in the motif Matching base in the DNA sequence A A T T G G C C R A or G Y T or C W A or T S G or C M A or C K G or T B T or G or C H A or T or C V A or G or C D A or G or T N A or T or G or C

PAGE 17

5 However, the disadvantage of IUPAC cons ensus sequence is that it does not take into account the frequency at which each nuc leotide occurs at each position of a TFBS and therefore cannot represent a TFBS quantitatively. It disr egards too much information originally present in the set of TFBSs (Quandt et al. 1995). Compared to IUPAC, matrix based repres entations (including frequency matrices and Position Weight Matrices PWMs) reta in more information and are quantitative representations of TFBSs. In this case, the binding specificity of each position is described by the elements of the matrix. Co mparative analysis (O sada et al. 2004) has shown that matrices are a much better way to re present TFBSs. Each matrix is of size 4 X k, where the rows correspond to each of the nucleotides and the columns represent the base preference for each position of the binding site. For a frequency matrix, each column gives the nucleotide distributi on at that position. For instan ce, Figure 2-1 is the frequency matrix of transcription fact or v-Myb from TRANSFAC Prof essional r9.1 (Matys et al. 2003), which was obtained from 24 experiment ally determined binding sites. Each element in the matrix represents how many tim es a specific nucleot ide was observed at a certain position. The last column in the matr ix is the IUPAC consensus sequence for the transcription factor. Position A C G T 01 13 4 4 3 A 02 14 3 5 2 A 03 2 7 1 14 T 04 23 0 0 1 A 05 24 0 0 0 A 06 1 23 0 0 C 07 0 1 17 6 G 08 1 1 21 1 G 09 11 5 4 4 N 10 12 5 2 5 A Figure 2-1. Frequency matrix of transc ription factor v-M yb from TRANSFAC.

PAGE 18

6 The PWM is derived from this frequency matrix and then the base preference at each position is estimated based on the nucleotide distribution (Bucher 1990; Stormo 2000). The quality of the matrix is dependent upon experimentally determined target sites. As of now, most matrices have been built based on binding sites through in-vitro target-site detection assays. But, it was rece ntly reported that in vivo functional TFBSs can be detected by ChIP-Chip (chromatin i mmunoprecipitation coupled with microarray chip hybridization) technique (Lee et al. 2002), which will greatly enhance both the quality as well as the coverage of the noted matrices. TFBSs Databases TRANSFAC, TRANSCOMPEL, TRRD and JASPAR are databases that store information about transcriptional re gulation in eukaryotic cells. TRANSFAC (Knuppel et al. 1994; Wingender et al. 1996; Wingender et al. 2000; Matys et al. 2003) is the leadi ng database that includes Transc ription Factors, their target genes and Transcription Factor Binding Sites, matrices of TFBSs and other information about transcriptional regulation. TRANSFA C represents the most comprehensive collection of TFBSs and summarizes them as matrices. All information collected in TRANSFAC is generated from published litera ture. Its content ha s been extending and improving over the last ten year s as shown in Table 2-2. Table 2-2. Statistics of TRANSFAC. TRANSFAC 1996 TRANSFAC 2000 TRANSFAC6.0 2003 TRANSFACr9.1 2005 Matrix FACTOR SITE GENE REFERENCE

PAGE 19

7 The newly released TRANFACT Professiona l r9.1 has a total of 753 matrices for bacteria, fungi, insects, nematodes, plants and vertebrates, as indicated in Table 2-3. Table 2-3. Matrices in TRANSFAC Professional r9.1. Matrix Number Matrix Number Without Duplication Bacteria 1 1 Fungi 56 44 Insects 50 40 Nematodes 7 5 Plants 84 66 Vertebrates 555 379 Total 753 535 TRANSFAC contains redundant sets of matrices of di verse quality for some TFBSs. It is a commercial resource and only an older version of TRANSFAC can be accessed by the public for free. JASPAR (Sandelin et al. 2004) is an openaccess database and its data can be unrestrictedly accessed for free. All matrices in JASPAR were derived from published collections of experimentally determined TFBSs. JASPAR claims that it has a nonredundant collection of reliable matrices. TRANSCOMPEL (Kel-Margoulis et al. 2000; Ke l-Margoulis et al. 2002) is a sister database of TRANSFAC and includes informa tion about composite regulatory elements. Each composite regulatory elemen t contains two closely situated binding sites of distinct transcription factors. There are two types of composite regulatory elements: synergistic and antagonistic. A synergistic composite re gulatory element result s in transcriptional activation, while in an antagonistic compos ite regulatory element, the two factors interfere with each other and repress transc ription. However, the effectiveness of TRANSCompel is limited by it small size.

PAGE 20

8 Transcription Regulatory Re gions Database (TRRD) (Kel et al. 1997; Heinemeyer et al. 1998; Kolchanov et al. 1999; Kolchanov et al. 2000) describes the structure of eukaryotic transcription regul atory regions. Each TRRD entry corresponds to an entire gene and binding sites are listed at the lowe st level of the hierarchy. However, this database is seldom referred to by current publications. Computational Identification and Statistical Evaluation of TFBSs Computational Identification of TFBSs Computational identification of TFBSs is very important to decode the regulation network of genes. Although some earlier methods of predicting TFBSs were based on consensus sequences (Kondrakhin et al. 1995; Prestridge 1995), most recently developed methods are based on matrices (Frech et al. 1997). Match is a weight-based tool to search for potential transcription factor binding sites in DNA sequences. Match is associated with TRANSFAC and it uses the matrix library collected in TRANSFAC to search fo r TFBSs. Because TFBSs may be detectable on either the forward or the backward stra nd, Match searches both strands of input DNA sequence for TFBSs. The algorithm uses two sc ores: the matrix similarity score (MSS) and the core similarity score (CSS). Both scores measure the similarity between the segments of the sequences under considerati on and the TFBS matric es. MSS is calculated using all positions of the matrix while CSS is calculated using the five core (most conserved) positions in the matrices only. Bo th MSS and CSS range from 0 to 1 where 1 means an exact match. MSS is calculated as follows (Kel et al. 2003): Min Max Min Current MSS

PAGE 21

9 L i b iif i I Current1 ,) ( L i if i I Min1 min) ( L i if i I Max1 max) ( } { ,) 4 ln( ) (G C T A B B i B if f i I i=1, 2, L where ib if, is the frequency of nucleotide B o ccurring at position i of the matrix (B{A,T,C,G}). min if is the frequency of the nucleotide that is the lowest in position i and max if is the frequency of the nucleotide that is the highest in position i of the matrix. The information vector ) ( i I reflects the extent of the conservation of i position in the matrix. The more conserved the position i is, the greater is the information vector value of that position. In this manner, for a highl y conserved position, a mismatch will receive a large penalty while a match will score high. On the other hand, less conserved positions contribute less than conserved positions whet her they be a match or a mismatch. This leads to better recognition of TFBSs compared to methods that do not use an information vector. Stormo (1998, 2000) and Stormo and Fields (1998) pointed out that logarithms of the base frequencies should be proportiona l to the binding energy and the information vector is therefore related to the average binding energy betw een transcription factors and binding sites. Hence, Match score is not just a statistical score, but is in direct relation to the protein-DNA binding energy.

PAGE 22

10 Match requires users to choose a score th reshold, which is based on a balance between specificity and sensitivity. When a high threshold is used, noise is reduced from the output and the false positives decrease. However, at the same time, some functional binding sites with low scores are ignored. Increased specificity causes decreased sensitivity. On the other hand, when a low th reshold is used, both false matches and true binding sites are reported. Increased sensitivit y causes decreased specificity. Hence, an ideal threshold should restrict false positives without losing too many true positives. One assumption of methods based on matrices is that positions in the binding site contribute additively and independently to th e total activity. However, recent biological experiments suggest that dependence exists among positions in TFBSs (Benos et al. 2002; Bulyk et al. 2002). More complicated models (Stormo et al. 1986; Lawrence and Reilly 1990; Lawrence et al. 1993; Zhang and Marr 1993; Zhou and Liu 2004; Gershenzon et al. 2005) are being considered. For instance, di -nucleotides are chosen as the elements of the matrix instead of single nucleotides. But, the limitation of these methods is that they need more prior info rmation and more training binding sites, which are currently not easy to obtain. So, these approaches are rarely used. Other available softwares to identify TFBS s include MatInspector (Quandt et al. 1995), ConsInspector (Frech et al. 1993), FUNSITE (Kel et al 1995), TFBIND (Tsunoda and Takagi 1999), SIGNAL SCAN (Prestridge 1996), MATRIX SEARCH (Chen et al. 1995), SeqVISTA (Hu et al. 2003; Hu et al. 2004), RSAT (van Helden 2003), BEST (Che et al. 2005) and P-Match (Chekmenev et al 2005). The primary pr oblem with existing methods for predicting TFBSs is that there tend to generate unacceptably large number of false positives and only a very small fraction of the predicted TFBSs are functional ones.

PAGE 23

11 Because of the large number of false positives, these methods are of li ttle practical use for predicting TFBSs in vivo. Statistical Evaluation of TFBSs A fundamental challenge for TFBSs-sear ching methods is how to reduce false positives and improve prediction accuracy. One of the reasons for false positives is that short sequences (when trying to identify short TFBSs) occur a lot by chance. Estimating the statistical significance of an identifi ed match is valuable and may increase the performance of TFBSs-searching methods. To calculate statistic al significance, a background model is necessary. Based on the background model, we will see how likely it is to see a score that is at least as good by chance. Fo rmally, the p-value of a score is defined as the probability of obtaining such a score or higher from the background model. A simple way of estimating the p-value of a score is via nave approximation. A large pool of subsequences is sampled from the background distribution, and the score of each subsequence is calculated based on the matrix. The p value is estimated by calculating the fraction of samples that have scores as high as that score. The primary drawback of nave approximation is that it needs huge number of samples to accurately estimate the p value. Normally, the number of samples should be two orders of magnitude larger than the inve rse of the actual p value. If there are not enough samples, small p values will be poorly estimated. Available samples and time complexity are major drawbacks of the nave approximation method. Independent and identically distributed (IID) background sequence model (Staden 1989) can be used to evaluate p value. The position p value of MAST (Motif Alignment and Search Tool) (Bailey and Gribskov 1998; Bailey and Gribskov 2000) is defined as the probability of observing such a match scor e or higher at a single, randomly chosen

PAGE 24

12 position in a random sequence, and it is obtai ned by calculating the cumulative density function based on the null hypothesis that each position in the sequence is independent and identically distributed (IID). It has been reported that neighboring nucleotide compositions can affect the binding between a transcripti on factor and its binding site. Local Markov Method (LMM) (Huang et al. 2004) incorporated the local genom ic context into the p-value-based scoring method and modeled the background sequences as a Markov chain. The p value for a candidate site in this method reflected simulta neously both the similari ty to its matrix and its difference from the local genomic context. LMM claimed that, compared to other current methods, it could reduce false positive errors by more than 50% without compromising sensitivity. Similarly, Thijs et al. (Thijs et al. 2001) use a higher-order background model to enhance the performance of their motif-finding algorithm. Although statistical anal ysis has been integrated into predictions of TFBSs, large amounts of false positives sti ll exist and accurate identifica tion of TFBSs is hard to achieve. Cis-Regulatory Modules (CRMs) and Computational Prediction of CRMs Cis-Regulatory Modules (CRMs) Genes are rarely regulated by a single TF. Abundant evidence shows that TFBSs occur in clusters rather th an in isolation (Maniatis et al. 1987; Johnson and McKnight 1989; Shaw 1992; Wang et al. 1993; Tjian and Maniatis 1994). Each gene has a unique set of TFs that are responsible for its spatia l and temporal expressi on. Transcription is initialized by the binding of RNA Polymera se II to the promoter region, but the expression level of the gene is controlled by the specific set of TFs. The binding of a TF

PAGE 25

13 to its TFBS may enhance or repress the bind ing of other TFs to their TFBSs (Weintraub et al. 1990; Fickett 1996; Muhl ethaler-Mottet et al. 1998). In higher eukaryotes, the re gulatory information is orga nized into modular units which contain multiple TFBSs over a few hundred base pairs of DNA sequence. Such a functional unit is often referred to as Ci s-Regulatory Modules (CRMs, we also use regulatory modules to refer to CRMs alternatel y). Various CRMs work together to offer the combinatorial regulation of gene transc ription in response to different conditions (Yuh et al. 1998). Genes are ty pically controlled by severa l CRMs, which are generally located within a few hundred base pairs fr om the promoter region, but may occur thousands or even tens of thousands of ba se pairs either downstream or upstream from the transcription start site. TFBSs in a CRM are co-localized with sp ecific distances apart on the genome. The distance between the binding sites within th e CRM may overlap (Lewis et al. 1994), lie adjacent to each other (Whitley et al. 1994) or may be dozens of base pairs apart (Lemaigree et al. 1990). Considering the comple xity of the identification of CRMs, some groups (Hannenhalli and Levy 2002) have starte d to work on TF pairs called composite regulatory elements. Kel et al. (Kel et al 1995) investigated 1 01 composite regulatory elements and found that the distance between these binding sites vary between complete overlap and 80 bps. The mean distance was found to be around 17 bps. Qiu et al. (Qiu et al. 2002) analyzed all the entries of composite elements in the TransCompel database (version 3.0) and they found that approximate ly 87% of the composite elements were within 50 bps distance from each other and about 65% were within 20 bps distance.

PAGE 26

14 Transcriptional regulation is mediated by CRMs involving both protein-DNA and protein-protein interaction. The immense false positive of current TFBS-prediction methods is most likely a result of attemp ting to predict TFBSs based only on proteinDNA interactions between TFs and TFBSs a nd ignoring their cont ext, interactions between/among TFs, etc. Computational Identification of CRMs It has been reported that high local density of TFBSs account for the specificity of their activities and is required for functional CRMs, which has been used as the basis for identifying novel CRMs (Frith et al. 2001; Berman et al. 2002; Halfon and Michelson 2002; Rajewsky et al. 2002; Rebeiz et al. 2002; Johansson et al. 2003; Berman et al. 2004). These methods define a cluster as a certain window size DNA sequence with certain number of known TFBSs, and then se arch for such clusters in the given DNA sequence to predict regulato ry regions. These methods, however, need prior knowledge of TFBSs, which is not available in general. With the development of microarray tec hnology, genes can be classified into different clusters based on their expressi on pattern. Genes in the same cluster are assumed to be co-regulated. Co-regulated genes are hypothesized to share common CisRegulatory Modules (CRMs), since they share the same regulatory response. It has been observed that functional TFBSs of ten appear several times in the regulatory region so that occurrences of relevant TFBSs are significan tly more than expected by chance. Hence, methods based on co-regulated genes comparison are to search for over-represented motifs in the collection of regulatory re gions (Bailey and Elkan 1995; Workman and Stormo 2000; Liu et al. 2001; Thijs et al. 2002; Elkon et al. 2003; Zh eng et al. 2003; Cora et al. 2004; Grad et al. 2004). High frequency of false positiv es is also a major problem

PAGE 27

15 with these methods. Furthermore, these met hods require a reliable set of co-regulated genes. But, it is also possible that the expression of one gene is triggered by the expression of another gene in the same cluste r and they are not rea lly co-regulated genes. Some groups have tried to conquer this pr oblem by annotating genes to specific Gene Ontology (GO) terms (Cora et al. 2004). Cross-species Conservation of Cis-Regulatory Modules (CRMs) Functional sequences including coding and functional noncoding sequences tend to evolve at a slower rate than non-functional sequences. Noncoding sequences that play an important role in regulation are assumed to su rvive natural selection for longer periods of time. It has been shown that CRMs are c onserved across species (Frazer et al. 2003; Ludwig et al. 2005). Figure 2-2. The known TFBSs in S2E: bi coid (circle), hunchback (ovals), kruppel (squares), giant (rectangles), and sloppy-paired (triangle). Symbols representing sites 100% conserved co mpared to D. mel are green, while those diverged are shaded orange.

PAGE 28

16 Early Drosophila embryo is a paradigma tic model for studying transcriptional control of development (Arnosti 2002). Exhaus tive genetic analysis has been applied to this model. Even-skipped (eve), an impor tant developmental gene in Drosophila, produces seven transverse stripe s in the blastoderm embryo. The stripe 2 enhancer (S2E) is the best studied one and includes multiple TFBSs of five TFs, the activators bicoid (bcd) and hunchback (hb), and the repressors giant (gt), Kruppel (kr), and sloppy-paired (slp) (Small et al. 1992; Arnosti et al. 1996; A ndrioli et al. 2002). As shown in Figure 22, experimental investigati ons suggest that S2E is evol utionarily conserved among D. melanogaster, D. yakuba, D. erecta and D. pseudoobscura. D. yakuba and D. erecta are separated by approximately 10-12 milli on years from D. melanogaster, and D. pseudoobscura split from D. melanogaster about 40-60 million years ago (Ludwig et al. 1998; Ludwig 2002; Ludwig et al. 2005). There is no detectable difference in the spatial or temporal control of gene e xpression among these four species. Mel AATATAACCCAATAATTTGAAGTAACT------------GGCAGGAGCGAGGTAT----43 PSE AATATAACCCAATAATTTTAACTAACTCGCAATGGACAGGGCAGTAGAGCAGTAGAGTAG 60 ****************** ** ***** ***** ** *** Mel --CCTTCCTGGT---------TACCCGGTACTG-----CATAACAATGGA-----ACCCG 82 PSE AGCATTGCAGGAAGGATGCAATACTCGGGAATGGAATGCATAACAATGGGCAAGGACCAG 120 ** ** *** *** ** *********** *** Mel AA--CCGTAAC-TGGGACAGA-----TCGA---------AAAGCTGGCCTGGTTTCTCGC 125 PSE GGTTCCGTTTCGCGAGATAAGGTTCTTTGACGGTTCCTTGACGGTTCCTTGACGGTTCCC 180 **** * ** * ** * * ** ** Mel TGTGTGTGCC-------GTGTTAATCCGTTTGCCATCAGCGAGATTATTAGTCAATTGC177 PSE TGTGTGCTCTCTGCTCTGTGTTAATCCGTTTGCCATCAGCAAGATTATTAGTCAATTTTC 240 ****** *********************** **************** Mel -------------AGTTGCAGC----GTTTCGCTTTCGTCCT------CGTTTCACTTT213 PSE ATATTTCCAGTCGAGTCGCAGTTTTGGTTTCACTTTCCTCCTTTGCCACTTCTTGCCTTG 300 *** **** ***** ***** **** * * ** Mel ---------------------------------------------------------CGA 216 PSE CCTCATGTGGATGCCGATGCCGATGCCGTTGCCGTTGCCGTTGCCGTTGCCGACCGACGA 360 *** Mel GTTAGACTTTATTGCAGCATCTTGAACAATCGTC-GCAGTTTGGTAACACGCTG-----269 PSE GTTAGATTTTATTGCAGCATCTTGAACAATCAACTGGAATTTGGTAACATGCTGCGCGGC 420 ****** ************************ * ********** **** Mel --------------TGCCATACTTTCATTTAGACGGAATCGAGGGACCCTGGACTATAAT 315 PSE CTAACCCTGGAGATTGCTCTACTTTCGCCTCAATTGAATCGGAGTTAGGCGGAAGACGGC 480 *** ******* * ****** *** Figure 2-3. S2E of D. Mel and D. Pse alignment using CLUSTALW.

PAGE 29

17 Mel CGCACAACGAGACC--GGGTTGC-----GAAGTCAGGGCATTCCGCCGATCTAGCCATCG 368 PSE GGACCCTTGCGACCAAGGGTTGTCTCCTGGCCTCAGGAGTTTCCACAGTCAACGCTTTCG 540 * **** ****** ***** **** * ** *** Mel CCATCTTCTGCGGGCGTTTGTTTGTTTGTTTGCTGGGATTAGCC-AAGGGCTTGACTTGG 427 PSE CTGGTTTGTTTATTTGTTTGTTTGTTT--TAGCCAGGATTAGCCCGAGGGCTTGACTTGG 598 ** ************ ** ********* ************** Mel AATCCAATC-----------------------------------CTG-ATCCCTAGCCCG 451 PSE AACCCGACCAAAGCCAAGGGCTTTAGGGCATGCTCAAGAGATCCCTATATCCCTATCCCT 658 ** ** * ** ******* *** Mel ATCCCAATCCCAATCCCAATCCC--TTGTCCTTTTCATTAGAAAGTCATAAAA-ACACAT 508 PSE GTCGCGATCCCTAAACCGATCCCATTTGGCAATTTCATTAGAAAGTCATAAAACACACAT 718 ** ***** ** ***** *** ********************* ****** Mel AATAATGA--TGTCGAAGGGATTAGG-----GGCGCGCAGGTCCAGGCAACGCAAT---T 558 PSE AATAATGAGATGTCGAAGGGATTAAGATTAAGGGACGCACACACAGGCAGCAGGATCATT 778 ******** ************** ** **** ****** ** Mel AACGGACTAGCGAACTGGGTTATTTTTTTGCGCCGA-----------CTTAGCCCTGATC 607 PSE AACGGACTAACGGAATCGGGTTATTTTTTGCGCTCAACCGGACGGAGCTTAACGCTAAGC 838 ********* ** * ** ********** **** ** * Mel CGCGAGCTTAACCCGTTTTGAGCCGGGCAGCAGGTAG-TTGTGGGTGGACCCCACGACTT 666 PSE CACAAGCTTAACCCCTTGTGGGCCG--CGGCAGGTAAATCGATGACCGATGTCGCTTGCG 896 * ********** ** ** **** ******* * ** * Mel TTTTGGCCGAACCTCCAATCTAACTTGCGCAAGTGGCAAGTGGCCGGTTTGCTGGCCCAA 726 PSE CAAGGCCCCTACTACTCCCCTCCCCTCC-CATATGACAACCCACTAACC--CTGCCCCCG 953 ** ** ** * ** ** *** *** *** Mel AAGAGGAGGCACTATCCCGGTCCTG---GTACAGTTGG-TACGCTGGGAATGATTATATC 782 PSE CCCTCTC--CACCACCACTGTACTGTATGTACAGTTGCCTCCTCTCTGGATGATTATATC 1011 *** * ** *** ********* * ** *********** Mel ATCATAATAAATGTTT 798 PSE ATCATAATAAATGTTT 1027 **************** Figure 2-3. Continued Although similar at the TFBS level and func tionally conserved, as shown in figure 2-3, S2E in D. Mel and D. Pse are substantia lly diverged at the sequence level. One can observe large insertion or deletion in spacers between TFBSs, nucleotide substitutions in TFBSs, variation of length of enhancers, etc. Only 3 of 17 known TFBSs in D. melanogaster are completely conserved among all four species. Computational Prediction of Cis-Regula tory Modules (CRMs) by Cross-species Genome Comparison In contrast, cross-species genome compar ison (Phylogenetic Footprinting) is more reliable and is capable of identifying CRMs fr om a single gene of different species, as long as they are conserved acr oss species. Cross-species ge nome comparison is based on the assumption that functional sequences, like regulatory regions, are more conserved

PAGE 30

18 evolutionarily than nonfunctional regions, due to selective evolutionary pressure. Like a filter, cross-species genome comparison elim inates non-conserved regulatory regions and thereby increases the selectivity a nd specificity of CRMs prediction. Evolutionarily conserved noncoding regions across species are a potentially rich source for the identification of regulatory information of genes and can be found by sequence alignment tools such as BLAST (Alts chul et al. 1990), PIPMaker (Schwartz et al. 2000), MultiPipMaker (Schwartz et al. 2003 ), CLUSTAL W (Thompson et al. 1994), AVID (Bray et al. 2003) and VISTA (Mayor et al. 2000). A large number of methods have been developed recently to pred ict CRMs based on cross-species genome comparison. With complementary data provided by sequence comparison, these methods have improved the prediction of TFBSs. ConSite (Lenhard et al. 2003; Sandelin et al. 2004) is a web-based tool for the detection of transcription f actor binding sites, which in tegrates both cross-species comparison and binding site prediction genera ted with Matrices. ConSite reports TFBSs that are in conserved regions a nd at the same time located as pairs of sites in equivalent positions in alignments between two or thologous sequences. By incorporating evolutionary constraints, sele ctivity is increased compared to approaches purely based on profile search of single genomic sequence. C onSite claims that it reduces the noise level by ~85% while retaining hi gh sensitivity compared to single sequence analysis. Similar to ConSite, rVista (Loots et al. 2002; Loots and Ovcharenko 2004) combines sequence comparison and TFBS prof ile based prediction, and attempts to identify aligned TFBSs from noncoding regions th at are evolutionarily conserved. rVista considers noncoding regions with low DNA sim ilarity as least likely to be biologically

PAGE 31

19 relevant and it discards these regions for TFBS searches. rVista claims to take an effective way to reduce false positives. TraFAC (Jegga et al. 2002) is a web-base d application to det ect conserved TFBSs from a pair of DNA sequences. It defines hits as the density of shared TFBSs within a 200-bp window in evolutionarily conserved noncoding regions. The hit count is plotted as a function of position. PhyME (Sinha et al. 2004) is an algor ithm based on cross-species comparison too. Furthermore, it integrates another important feature of a TFBS, overrepresentation, to the algorithm. Overrepresentation depends on the number of occurrences of the TFBS in each species. By searching conserved and st atistically over-represented TFBSs, PhyME claims to increase both sensitivity and specificity. TOUCAN (Aerts et al. 2003; Aerts et al. 2005) is a web-based Java application for the prediction of cis-regulatory elements from sets of orthologs or co-regulated genes. Orthologs are aligned and cons erved regions are selected as candidate search regions. Toucan has two TFBS prediction algorithms. One is MotifScanner, which is used to search known TFBSs based on their matrices. Inst ead of using a predefined threshold that is independent of the sequence context, MotifScanner uses a pr obabilistic model to estimate the significance of TFBS s. The other is MotifSampler, which is used to search for new TFBSs based on Gibbs sampling. Another algorithm (Cora et al. 2004) tries to predict over-represented TFBSs from sets of orthologs. It focuses on short conser ved regulatory regions in mice and humans so that a suitable signal to noise ratio can be achieved. Statistically over-represented TFBSs in those regions are searched and their a nnotations to Gene Ontology (GO) terms are

PAGE 32

20 analyzed. They conjecture that functional TF BSs will be ones that are statistically overrepresented and share the same specific Gene Ontology terms. CRME (Sharan et al. 2003; Sharan et al. 2004) is another web-based tool for identifying and visualizing CRMs from a set of potentially co-regulated genes. It uses rVista 2.0 to find TFBSs that are cons erved between genomes. Subsequently, it enumerates all combinations of these TFBS s that occur within a user-defined window. These combinations are statis tically evaluated and significan t combinations are reported. Compared to other methods, CRME consid ers the combination of TFBSs and also evaluates the statistical significance of the predicted CRMs. Venkatesh and Yap (Venkatesh and Yap 2005) searched for cis-regulatory elements from non-coding sequences conserved between fugu and mammals and they believe that the cis-regulatory elements they found are likely to be shared by all vertebrates considering the relatively long evolutionary distan ce between fugu and mammals. In general, methods based on cross-sp ecies genome comparison have improved prediction of TFBSs by searching TFBSs onl y in sequence-conserved regions across species. However, such methods fail to detect CRMs that have retained their functions but have evolved in their sequence. The pe rformances of these methods depend on the evolutionary distances of species, alignmen t algorithms, and profile models of TFBSs These days, more and more evidence (T hanos and Maniatis 1995; Erives and Levine 2004; Senger et al. 2004) have shown th at CRMs might be high ly structured and depend on a number of organizational constr aints, such as distance betweens TFBSs, orders and orientations of TFBSs, which ma y be used as further constraints to find functional CRMs from a noisy background.

PAGE 33

21 CHAPTER 3 THE FEASIBILITY STUDY OF CROSS-GENOME COMPARISON AT THE BINDING-SITE LEVEL Background Transcription is the fundam ental biological process in which a selected region of DNA is transcribed into RNA by a molecular m achinery, a core component of which is RNA polymerase. For most prot ein-coding genes, transcripti on is the intermediate step via which the information coded in their DNA is expressed into functioning proteins. For others, such as RNA genes, the product of the transcription itself may have biological function. Even though each cell has the comple te set of genes in its chromosomal DNA, only a portion of the genes are transcribed (expressed) in a ny particular cell depending on tissue/cell type, developmental stage, etc (Zhang et al 1997). The transcriptome, that is all of the genes that are selectively transc ribed in a cell, determines the function and morphology of the cell. In addition, the level (i .e., rate) of transcrip tion is often regulated in response to intra-cellular and extra-cellular environmenta l factors to ach ieve cellular homeostasis. Normal transcri ptional regulation, i.e., the ri ght genes being transcribed at the right times, in the right cell, and at the appropriate rates, is central to almost all physiological processes. Abnor mal regulation of transcripti on often results in disruption of development and/or patholog ical states. For example, ectopic (i.e., abnormally high) expression of oncogenes leads to hyperplasia and cancer. A basic element of transcriptional regulat ion is the interaction of transcription factors (trans factors) and their corresponding t ranscription f actor b inding s ites (TFBSs,

PAGE 34

22 also referred to as cis factors) on the DNA. Transcripti onal regulation of a gene (e.g. restricted transcription in a pa rticular cell type, or elevated transcription, in response to UV light) is often mediated through the func tional/physical intera ctions among multiple transcription factors, each recruited to the pr oximity of the DNA in part by their selective affinity to their corre sponding binding sites. For example, the even-skipped(eve) gene is transiently expressed in 7 a lternative stripes on the longitudinal axis in the developing Drosophila melanogaster embryo at the blastoderm stage. Each of the seven stripes is regulated by a distinct set of transcription factors binding to their corresp onding binding sites located in a DNA segment flanking the even-skipped gene. The most well investigated of these segments is the stripe 2 regulatory region, which has identified binding sites for 5 different transcription factors in a 700bp (base pair)-1kb (kilo base pair) DNA region in front of the transcription in itiation site of the eve gene. Evolutionary comparison of this transcription regulatory module in different Drosophila species has revealed that most of the binding sites are highly conserved and f unctional, even though the underlying DNA sequence has undergone consid erable change (Ludwig et al.1998). A useful analogy to understanding the co mposition of DNA regulatory modules is to consider DNA as a sequence of Letters and individual binding sites as Words. Then, a functional module of closely associated binding sites can be conceived of as the Sentence command embedded in the DNA seque nce that guides transcription. The problems associated with identifying the Sentence commands in a DNA sequence are two fold. First, the binding s ites are degenerate in nature, that is, the same Word may have different letters in certain positi ons. Second, the Words are themselves interspersed by varying lengths of meaningless Letters.

PAGE 35

23 One approach to identifying DNA regulat ory modules is through cross-genome comparison. The assumption underlying this approach is that DNA sequences encoding functional TFBSs are under selective pressure to be conserved duri ng evolution, whereas non-functional DNA mutate/change more rapidl y. Thus, if DNA sequences flanking orthologs in two related speci es were to be compared fo r sequence-level similarity, DNA regulatory modules would appe ar as conserved islands in a sea of otherwise notconserved DNA sequences. Approaches in this category include rVista2.0, ConSite, PhyME, TOUCAN, CREME, TraFAC, etc (Sha ran et al. 2003; Sandelin et al. 2004; Loots, G.G. and I. Ovcharenko 2004; Sinha et al. 2004; Aerts et al 2005; Schwartz et al. 2003; Cora et al. 2004; Venkatesh and Yap 2005). For instances, based on the sequence level conservation between human and mouse, Cora et al. predicted functional TFBSs that are statistically over-represented and share the same specific Gene Ontology (GO) terms (Cora et al. 2004). This kind of cross-genome comparison approaches has successfully led to the discovery of regulator y modules that were s ubsequently verified by functional characterizati on (Santini et al. 2003). The disadvantage of the sequence-based appr oach is that it is dependent on the overall conservation of the DNA region har boring the regulatory module. As we mentioned earlier, TFBS sequences are degenera te in nature and ar e interspersed by nonfunctional sequences which are not under cons ervation pressure. Depending on the ratio of functional versus non-functi onal base-pairs in the DNA re gion, it is entirely possible that the overall sequence level conservation of the region be undetectable from the background level, while the actual TFBSs ma king up the functional module still be conserved. In other words, it is possible th at despite the conservation of the Sentence

PAGE 36

24 command at the binding site level, the overall conservati on of the DNA backbone at the sequence level be minimal or non-detectable. This situation is aggravated if the participating binding sites are highly degenerate (i.e., tole rate many variations at the sequence level) and the spaces between the binding sites are long. In fact, it has been observed by researchers in many instances that while the regulatory region has no detectable overall similarity, the transcripti on regulatory function is preserved (Ludwig et al. 2005). Sequence-based approaches, or appr oaches requiring filtration of sequences based on DNA level similarity, are not helpfu l for identifying the responsible TFBSs in this scenario. Since the conservation pressure is at the bi nding site level, i.e., the sequence must be able to maintain binding affinity to th e transcription factor(s), it makes biological sense to perform comparisons at the binding site level rather than at the sequence level. This, however, is currently hindered by two factors. The first limitation is the effectiveness of identifying tr anscriptional binding sites in a given DNA sequence. The set of TFBSs for a given TF can be quantitat ively represented using a frequency matrix that describes the binding specificity of the TF at each of its positions. The quality of the matrix used to identify potential TFBSs is determined by the number and quality of known binding site sequences used to constr uct the matrix. As a result of the development of new technologies such as Chip (Yan et al. 2004) and ChIP-chip (Lee et al. 2002), it is anticipated that binding site instances will be identified at a unprecedented rate which will undoubtedly greatly enhance both the quality as well as the coverage of binding site matrices in the n ear future (Matys et al. 2003).

PAGE 37

25 The second limitation is that we cu rrently do not understand the grammar governing how binding sites (Words) make up the regulatory modules (Sentences). Based on our understanding of transcriptional regulation, such a grammar should have at least three components: (1) the composition of the binding sites, (2) the sequence of the binding sites, and (3) the spaces between/am ong the binding sites. Currently, the number of regulatory modules that have been thoroughly characterized is far fewer than what is required to decode this grammar. A major obstacle for biologists working on transcriptional regulation is to locate and identify potential TFBSs responsible for a particular regulatory module, especially in sequences that do not have signi ficant conserved islands. In this paper, we describe a novel methodology for binding site level identifi cation of conserved regulatory modules in such sequences. Results Simulating Sequence Pairs Harboring a Conserved Module of Binding Sites Since the number of well-studied regulatory modules is currently rather sparse, we chose to simulate sequence sets (pairs) re presenting the domain of our interest, i.e., conserved binding site patterns in a pair of sequences which nonethele ss have little or no similarity at the sequence level. In many cases, experimental investigation in a model organism has narrowed down the location of the regulatory module(s) for a particular gene to a relatively short region (e.g. within 1 kb), whereas for the ortholog in a lessstudied organism (reference organism), inform ation about the localiz ation of the module is absent (except that it is generally in the proximity of the gene). In view of this, in the first (current) stage of the development of BL ISS, we considered the identification of a conserved module present in both a short sequence of about 100-500 bps, representing

PAGE 38

26 the model organism, and a longer sequence (5-6 kb), representing the reference organism. Although this simplification limits the appli cability of the current methodology, it does highlight the promise of our approach. For each sequence pair, the backbones for both the short sequence and the long sequence were generated randomly and thus had no sequence similarity. A hypothetical module involving binding sites for 4-8 distinct transcripti on factors was first introduced into the short sequence. Th e binding site sequences were randomly selected from the instances recorded in the TRANSFAC 9.1 database (Matys et al. 2003). The rules governing the formation of the hypothe tical module were as follows: 1. A module contains binding sites for 4-8 distinct transcription factors. 2. For each transcription factor, there may be more than one binding site in the module. 3. The distance between cons ecutive binding sites is di, where in 65% of the cases di lies within 5-20 bps, in 22% of the cases di lies within 21-50 bps and in the remaining 13% of the cases di lies within 51-60 bps (Figure 3-1). The range of values for di was based on background knowledge as well as a statistical analysis of the di stances between pairs of TFBS s in TransCompel database by Qiu et al. which revealed the above distribution of distances between pairs of TFBSs (Qiu et al. 2002). The hypothetical module was first simulate d according to the above rules in the short sequence. Subsequently, a conserved module was formulated and inserted into the longer sequence at a random location. The rules governing the formation of the conserved module were as follows:

PAGE 39

27 Figure 3-1. Simulating cons erved regulatory modules in a pair of random sequences. Each pair has a short (100-500 bp) an d a long (5-6 kb) sequence. For both sequences the backbones were ge nerated randomly. The hypothetical module was formulated according to rule s described in the text and inserted into the short sequence. A cons erved module with binding sites corresponding to the same transcription factor, but with different sequences was inserted into the long sequence. The distances between consecutive binding sites were also different be tween the hypothetical module and the conserved module. 1. It is comprised of TFBSs that correspond to the same transcription factors as present in the hypothetical module. 2. The sequence for each TFBS is randomly chosen from the recorded instances in TRANSFAC 9.1, with the caveat that it ca nnot be the same instance(s) that was (were) used to construct the hypothetical module in the short sequence. 3. The respective order of TFBSs is the sa me as in the hypothetical module in the short sequence.

PAGE 40

28 4. The distance between consecutive bindi ng sites in the conserved module is dj; dj is a function of di in that dj lies in the range (di -d, di +d) (Figure 3-1). d is the perturbation f actorthe variation of di stance between corresponding binding sites in the hypothetical module and the conserved module. In this study, we used d=4 (See Discussion). A total of 10,000 pairs of sequences were generated according to above rules, and were used to test and evaluate various algorithms. Identifying a Conserved Module by Comp arison at the Binding Site Level As stated above, the objective of our methodology is to identify conserved regulatory modules within hi ghly divergent sequences. The sequence pairs in our simulated data-set had little overall sequen ce similarity. Of the 10,000 pairs, 73.32% have no similarity detectable by BLAST analys is (E=0.01, Blast2seq). This indicated that the conservation of bi nding sites in the hypothetical module and the conserved module was not sufficient to allow detection at the sequence level. Of those that did have a significant match, the output alignments we re shorter than 30 bps, which was far shorter than the length of the inserted module. M_score. To identify the conserved module at th e binding site level, we first generated the potential TFBS profiles for each of the si mulated sequence pairs. A matrix scoring method similar to that used in Match (Kel et al. 2003) was implemented (see Methods), which allowed us to score each sequence against the frequency TFBS matrices recorded in TRANSFAC 9.1 (M_score, Figure 3-2a ). When a cut-off value of 0.75 (on M_score) was applied, on average there were about 3000 identified poten tial TFBSs in every thousand base pairs of simulated sequences. This is similar to what we observed when using genomic DNA sequences randomly extrac ted from model organism databases (data not shown).

PAGE 41

29 Figure 3-2. Gaussian smoothing of the TFBS profile. a.) Profile of matrix matching scores (M_score) for three TFBSs al ong a short DNA sequence. b.) M_score profile after applying a cut_off of 0. 75. c.) Profile of P_score after incorporating the p value of binding site matches. d.) Profile of the G_score after Gaussian smoothing. The colors re present three different TFBSs: Red, En; Green, Croc; Blue, Lun-1. To identify the hypothetical module embedde d in the sequences, we tried several different algorithms that compared the bi nding site profiles of the short and the long sequences. Of those tested, a scoring method (using inner-products) ba sed on a statistical evaluation of binding site matches after a Gaus sian smoothing of the binding site profiles gave reliable and promising re sults. The performance test of this method was listed in Appendix A. P_score. The matrix score (M_score), by virtue of its definition (f.1), ranges from 0 to 1 for all TFBS matrices. Thus it does not differentiate short and relatively simple matrices

PAGE 42

30 that match DNA sequences with a high frequenc y from those long and stringent matrices that match DNA sequences only rarely. For example, the binding site for En (I$ED_06) has 7 positions, and on average there are 320 matches (with M_score>0.75) on any 10,000 bp random sequence. In contrast, the binding site for Bel-1 (V$BEL1_B) has 13 positions, and the average number of matches with M_score>0.75 in a 10,000 bp random sequence is 3. It is clear that a match involving the binding site V$BEL1_B is far more significant than a match with I$ED_06. To di fferentiate this, we introduced the p value of the M_score, which was estimated by calculating th e fraction of randomly generated sequences that have scores e qual to or higher than that M_score. We then calculated the P_score (see Methods) as the produ ct of -log (p value of M_score>cut_off) and the M_score (Figure 3-2, b). G_score. To account for the variation in the di stances between/among binding sites, we performed a Gaussian smoothing of the P_score (see Methods). Through empirical testing (data not shown), we found that a variance of 2=9 gave the best performance. We denote the Gaussian smoothed score profile as the G_score profile of the sequence (Figure 3-2 d). BLISS_score. G_score profiles were generated for both the short and the long sequences. To identify a maximum match at the bi nding site profile level, the short G_score profile was slid along the long G_score profile. At each position, the match between the short profile and its corresponding re gion of equal length (length of the window) in the long profile was evaluated usi ng an inner-product as the BLISS_score (see Methods). Note that since the length of the wi ndow appears in the denominator, the BLISS_score is independent of the length of the short profile (or the length of the

PAGE 43

31 window). Figure 3-3a shows the distribution of BLISS_scores as the shorter G_score profile was slid along the longer G_score profile. The peak of the BLISS_score indicates the maximum match. In this case, the abrupt surge in the BLISS_score is due to the match between the embedded hypothetical module in the short sequence and the conserved module in the long sequence. When this methodology was tested on all of the 10,000 simulated sequence pairs, about 80% of the highest peaks for each pair contained the correct match between the embedded hypot hetical module and the conserved module. Distribution and Statistical E valuation of the BLISS_score To be able to evaluate the match at the binding site profile level, we analyzed the distribution of BLISS_scores using the simulated sequence pairs. For each pair of sequences, BLISS_scores were calculated at each position as the short profile slid along the longer profile. The peak matches (corres ponding to the peaks in the score profile) between each pair of sequences were evalua ted to see whether it aligned the embedded modules. If the match did align the modules, it was designated a true match. All other BLISS_scores were considered as background. Figure 3-3b shows the distribution of the background and the true match BLISS_scores for the 10,000 simulated pairs of sequences. This distribution varies slightly depending upon the cut_off threshold set for M_score (Figure 3-3b & c). This is not surprising, since a lower cut_off thresh old will lead to more identified potential binding sites and thus a slight ly higher background score.

PAGE 44

32 Figure 3-3. Identifying a cons erved module at the binding site level. a.) The peak in the BLISS_Score profile represents the maximum match at the binding site level. b.) and c.) show th e distribution of BLISS_sc ore in true matches vs. background with a cut_off value of 0.75 and 0.80, respectively.

PAGE 45

33 The distributions allow us to evaluate any particular BLISS_score. It is informative in helping set a threshold for reporting signifi cant matches at the bindi ng site level. Given a BLISS_score x, the distributions allow us to decide whether that BLISS_score corresponds to a true alignm ent of modules or whether it corresponds to the module aligning with a random DNA segment. Let C1 denote the event where the modules embedded in the short and the l ong sequences are aligned, and C2 denote the event where either module is aligned with a random DN A segment. Based on Bayes formula, the posterior probabilities can be calculated as follows: ) 2 ( ) 2 | ( ) 1 ( ) 1 | ( ) 1 ( ) 1 | ( ) | 1 (C p C x p C p C x p C p C x p x C p ) 2 ( ) 2 | ( ) 1 ( ) 1 | ( ) 2 ( ) 2 | ( ) | 2 (C p C x p C p C x p C p C x p x C p Where p(C1|x) is the conditional probability of C1 given a BLISS_score x and p(C2|x) is the conditiona l probability of C2 given a BLISS_score x; p(C1) is the prior probability of C1 and p(C2) is the prior probability of C2; p(x|C1) is the conditional probability of observing x given C1 and p(x|C2) is the conditional probability of observing x given C2. p(x|C1) and p(x|C2) can be read off directly from the distributions generated. It is difficult to find the means to calculate the prior probabilities p(C1) and p(C2). In this study, we assumed p(C1) = p(C2), although we suspect that p(C1) might be smaller than p(C2). This assumption allowed us to calcu late the posterior probabilities to evaluate a BLISS_score x. In practice, we decided that x was a significant matching score if p(C2|x) was less than a threshold of, e.g. 0.01 or 0.001.

PAGE 46

34 Identifying a Conserved Regulatory Module in Distantly Related Species To test the efficacy of the BLISS met hodology in real sequences, we undertook the task of identifying the Even-skipped (eve) st ripe 2 enhancer (S2E) in distantly related Drosophila species. Even-skipped, an important development regulatory gene in Drosophila melanogaster (D.mela), is specifically expressed in seven transverse stripes in the embryo during the blastoderm stage. The stripe 2 enhancer is the best studied and includes TFBSs for five TFs, bicoid (Bcd), hunchback (Hb), giant (gt), Kruppel (Kr), and sloppy-paired (slp) (Small et al. 1992; Arnos ti et al. 1996; Andrio li et al. 2002). Unfortunately, TRANSFAC 9.1 has Position Wei ght Matrices for only three of the five TFs, i.e., Hb, Kr, and Bcd. Our search was therefore limited in the sense that some of the participating TFBSs could not be predic ted and used for the match comparison. Previous experimental investigations ha ve shown that S2E is evolutionarily conserved among D. yakuba (D.yaku), D. erecta, and D. pseudoobscura (D.pseu) (Ludwig et al. 1998; Ledwig et al. 2002; Ledwig et al. 2005). All of these species are in the same subgenus (Sophophora) as D.mela, with D.pseu having the most distant relationship with D.mela (separated at about 40 milli on years ago). BLISS did identify the eve S2E modules among these four species In particular, a significant peak was reported by BLISS when we searched the S2E module extracted from D. pseu against the 14kb D. mela genomic sequence flanking the eve coding region.

PAGE 47

35 Figure 3-4. Identifying the eve S2E module in distantly related species. a.) Using the D.pseu S2E module (1027 bp), a peak (red circle) in BLISS_score was identified in a 14 kb D.moja genomic sequence surrounding the eve coding region. b.) No significant match was id entified in the reve rse strand (bottom panel) or an unrelated sequence (data not shown).

PAGE 48

36 In contrast, no detailed information has b een published about pot ential S2E in more distantly related species, such as D. mojavensis (D.moja) or D. virilis (D.viri), both from a separate subgenus (Drosophila) separated from D.mela at about 60 million years ago (Powell 1997). To identify S2E in these two distantly related species, we extracted the 14kb genomic sequence flanking the eve coding region from D.moja and D.viri genomic sequences. Blast analysis using the D.mela or D.pseu sequence harboring the eve S2E module did not identify any significant matc h longer than 41 bp (using bl2seq with default gap penalty values). Using the BLISS methodology however a significant peak in the BLISS_score was observed (Figure 3-4a, p(C2|x)<0.05). Verification of this match indicated that it contained the TFBSs composing the S2E module. Similar results were obtained when corresponding sequence pairs in volving 1.) D.mela and D.moja, and 2.) D.mela and D.viri, were analy zed. In contrast, no significan t match was identified in the reverse-complemented sequence (Figure 3-4b) or in other 14 kb sequences unrelated to eve, indicating the specificity of the search. A detailed inspection of the make up of the S2E modules in distantly related species showed that S2E can be viewed as a complex module made of element modules. To make an analogy, S2E is a complex sentence that has several clauses (Figure 3-5). The evolution of the whole module indicates that the distances between some TFBSs have changed dramatically. However, the distances among the TFBSs within corresponding clauses have remained relativel y stable. For example, in Clause 1 the distances among participating TFBSs have rema ined constant over th e long evolutionary period. Specifically, the distance between the first bcd (overlapping w ith the first kr) and the second bcd is invariably 20 in all of th e four species. In a ddition, the distances

PAGE 49

37 among TFBSs in Clause 3 have also remained relatively stable, i.e., within the variation we have factored in to our simulation. Figure 3-5. Inter-TFBSs distances are very we ll conserved within each clause of the S2E module. Comparison of the evolutio n of S2E modules across distantly related species revealed that while the sequence length of the module has changed significantly, the distance among TFBSs in Clauses 1 and 3 have remained stable. The numbers near th e TFBS indicate the positions relative to the first Kr site in this module. Since our methodology is really based on the assumption of limited distance variations between TFBSs, it should be much more sensitive at id entifying individual Clauses or simple modules. When the corresponding TFBS prof ile covering Clause 1 or Clause 3 were used to search against the genome sequence from D.moja, very significant peaks in BLISS_score were observed (Figure 3-6a & b, p(C2|x)<0.001 for both). The peaks corresponded to the matc h of Clause 1 and Clause 3 on the D.moja

PAGE 50

38 sequence, respectively. BLAST analysis using the sequences c overing Clause 1 or Clause 3 searched against the D.moja genomic sequence failed to identify significant matches that spanned the whole module. rV ista 2.0 did predict Clause 1 because it succeeded in detecting the DNA similarity be tween the sequence covering Clause 1 and the D.moja sequence. However, rVista 2.0 failed to identify Clause 3 since no similarity was detected between the seque nce covering Clause 3 and the D.moja genomic sequence. Implementation of BLISS as a Web-based Service The BLISS methodology has been implemen ted as a web-based tool for the research community. The web application em bodies the Gaussian Smoothing Method for identifying cis-regulatory modules at th e binding site level, and outputs all potential TFBSs in the predicted module. The module fi nding process consists of several steps: To begin, the user inputs two DNA sequen ces. For example, a short sequence from a model organism that harbors a regulato ry module, and a longe r sequence surrounding the ortholog of a different species. An M_score threshold of 0.75 or 0.8 is then chosen by the user for the generation of the TFBS profiles for both sequences. Next, a plot of BLISS_scores comparing successive alignments of the short profile against the long profile is returned to the user. On the very same page, the distribut ions described earlier (Figure 3-3b & c) are displayed so that the user may choose a BLISS_score threshold. Once the BLISS_score threshold is chosen, BLISS out puts all of the matches with a BLISS_score higher than that threshold (limited up to 5). For each match, a table of contributing TFBSs are listed based on the pr oduct of the p-values of the matching TFBSs on both sequences (Figure 3-7). Altern atively, it can be listed based on their numeric contribution to the BLISS_score, or by the location of the TFBSs.

PAGE 51

39 Figure 3-6. Searching for c onserved individual clauses / element modules using BLISS. Profiles covering clauses 1 (a) or 3 (b) of S2E were used to search against a 14 kb D.moja genomic region. The BLISS_score peaks are significant (P(C2|x)=0.0 for a, = 0.0003 for b).

PAGE 52

40 Currently, the limits for the short and long input sequences are set at 1200 bps and 15k bps, respectively. Figure 3-7. BLISS output of the contributing TFBSs. Our S2E search was used as the example. The list of TFBSs can be output either based on sequence position, product of p value (Figure 3-7) or contribution to BLISS_score. The TFBSs that belongs to S2E were highlighted in green. Discussion In this study, we have presented a first step towards identify ing regulatory modules via comparisons at the binding site level. The advantage of such an approach is that it allows the detection of conserved regulatory modules in highly divergent sequences, as we have demonstrated both with simulate d sequences as well as with real world examples. This method is thus complement ary to many existing methods that are based on sequence similarity comparis ons (Altschul et al. 1990) or use sequence similarity for pre-analysis selection (Sa ndelin et al. 2004; Loots and Ovcharenko 2004; Aerts et al.

PAGE 53

41 2005; Sharan et al. 2004). It should also be complementary to applications such as MEME and CompareProspector, which are wi dely used for the identification of conserved sequence motifs (binding sites) in the regulatory region of co-expressed genes (Liu et al. 2004; Liu et al. 2004). There are limitations to our approach. Some of the major limitations, such as the coverage and quality of the TFBS matrices, ar e expected to improve rapidly in the near future as new high throughput techniques are applied to identify binding sites in genome scale. Our current algorithm is develope d based on the assumption that the inter-TFBS distance variation is within a +/-4 base pair range. This allows the identification of modules/clauses with relatively small inte r-TFBS distance variation, such as the individual clauses in the S2E m odule. It will likely miss mo dules/clauses that have much larger distance variations between TFBSs. In the case of S2E, the identification of the module was based on the fact that the th ird clause had low inter-TFBS distance variations, which was sufficien t to generate a significant BLISS_score (Figure 3-4a). As indicated by Ludwig et al, if S2E as a w hole were to be considered, many inter-TFBS distances have changed drama tically during evolution (Ludwig et al. 2005). However, a closer look at the distribution of TFBSs in S2E in the two di stantly related species also indicated that the S2E module may be sub-divi ded into clauses (Figure 3-5). While the inter-clause distances have varied dramatically, the inter-TFBS distances within each clause have remained largely stable (Figure 35). This is very possi bly a reflection of the spacing restriction on important tran scription factor interactions. In addition to the S2E module, we also tested our methodology on other regulatory modules such as the DME (Distal Muscle E nhancer) module in front of the paramyosin

PAGE 54

42 gene (Marco-Ferreres et al. 2005). Usi ng a 200 bp sequence harboring the DME in D.mela, we were able to detect the corresponding module in D.viri (data not shown). Currently, the number of well ch aracterized, evolutionarily co nserved modules is limited. The goal of BLISS is to facilitate th e discovery of multiple TFBSs modules by identifying the conserved pattern of the TFBS s. We also applied BLISS to a regulatory region that is responsible for me diating UV induced expression of hac-1 (Zhou and Steller 2003). There is no existing information on the composition of the UV-responsive module in this region, which has very lo w sequence level conservation between the corresponding segments in D.mela and D.pseu. Yet genetic experiments have indicated that the responsiveness is highly conserved. The potential module identified by BLISS is currently being test ed experimentally. BLISS, with some adaptation, can potent ially be used to id entify the conserved regulatory modules in co-expressed genes. Another advantage of BLISS is that the methodology can also be applied to identify pa tterns that involve not only TFBSs, but also other sequence features such as comple x response elements (Ringrose et al. 2003), insulator sequences, CpG islands, etc. Functiona lly, these sequence feat ures (their related modifications and binding partne rs) interact with transcription factors. However, these features, such as CpG islands, cannot be de tected by simple sequence similarity based searches. Conclusions In this study, we addressed the feasibili ty of identifying conserved regulatory modules at the binding site level. Our results indicate that it is feasible to identify conserved regulatory modules in simulate d random sequences ha rboring a regulatory module made of 4-8 distinct binding sites. Using real sequences, we demonstrated that

PAGE 55

43 our approach outperforms regular sequen ce level comparisons when the orthologous DNA sequences are highly diverged. In additi on, the BLISS program outputs directly the candidate binding sites that are shared betw een the two regulatory sequences, which can greatly facilitate the evaluation of the candi date module as well as the design of the experimental verification strategy by biomedi cal scientists. Future development of the project will include identifying better algor ithms for complex modules and modules with higher inter-TFBS di stance variations. Methods Generating Simulated Sequences 10000 simulated sequence pairs were gene rated for developing the methodology. Each set included a short DNA sequence (100500 bps) harboring a hypothetical TFBS module and a long DNA sequence (5-6 kb) harb oring a conserved TFBS module. First, the hypothetical TFBS module was generated in the following manner: 4-8 binding sites were randomly chosen from TRANSFAC 9.1 (Matys et al. 2003) database and then random DNA subsequences were inserted between them. Qiu et al. (Qiu et al. 2002)analyzed all the entries of composite elements in the TransCompel database (version 3.0) and they found that about 87% of the composite elements are within 50 bp distance and about 65% are within 20 bp dist ance of one another. We therefore chose lengths of DNA subsequences inserted between binding sites based on this result. Next, we created the conserved TFBS module, which included binding sites for the same sets of transcription factors in the same order as in the shorter sequence. However, the binding site sequences had to be different and they were randomly chosen from TRANSFAC 9.1. Furthermore, compared to the hypothetical m odule, distances between binding sites were set to vary slightly and we allowed each binding s ite to shift up to 4 bps either to the right

PAGE 56

44 or to the left. Finally, we inserted the conserved module into a 5000 bp randomly generated DNA sequence to generate the longer sequence. Binding Site Identification Potential TFBSs both in the short DNA sequence (including the hypothetical module) and the long DNA sequence (includi ng the conserved module) were searched based on frequency matrices collected by TRANSFAC 9.1. Because TFBSs may be detectable on either the forward or the backward strand, we searched both strands of sequences. The M_score profile for each sequence is a M*L matrix, where M is twice of the number of matrices applied and L is the le ngth of the sequence. The top half of the M_score matrix is the score profile for the forw ard strand and the bottom half is that for the complementary strand. The M_score of the ith TFBS at position j of the sequence was calculated by first aligning the frequency matrix for the ith TFBS with the sequence at position j and then computing: f.1. M_score: ] [ ] [ ] [ ] [ ] [ j i Score j i Score j i Score j i Score j i score MMin Max Min 1 0 ,) ( ] [K k n kk jf k I j i Score 1 0 min) ( ] [K k k Minf k I i Score 1 0 max) ( ] [K k k Maxf k I i Score } , { ,) 4 ln( ) (G C T A N N k N kf f k I

PAGE 57

45 K is the length of the TFBS. k jn {A,T,C,G} is the nucleotide occurring in the sequence at position j+k k jn kf, is the frequency of nucleotide k jn at position k in the frequency matrix (of the i th TF). min kf is the lowest frequency and max kf is highest frequency across all nucleotides at position k in the frequency matrix (of the i th TF). I(k) is the information vector for the frequency matrix, which reflects the degree of conservation at position k of the matrix. Finally, M_score[i,j] is the normalized Score[i,j] Stormo (Stormo 1998; Stormo 2002; St ormo and Fields 2002) observed that logarithms of the base frequencies ought to be proportional to the binding energy and the information vector reflects this average bi nding energy between the transcription factor and the binding site. A score cutoff at 0.75 was applied to the M_score profiles of both the short and the long sequence as follows: ] [ ] [ j i score M j i score M if 75 0 ] [ _j i score M 0 ] [ _j i score M if75 0 ] [ j i score M P Value for M_score. To calculate the p value, a background model is required. Here, we chose the background model to be a random DNA sequence where each position is drawn independently. The ratio among A, T, C, and G is 30% : 30% : 20 % : 20%. For each frequency matrix, 300 million subsequences were sampled from the background model, and the M_score of each subsequence was calculated to build the M_score distribution. The p value of a M_score for each TFBS was estimated by calculating the fraction of samples that had scores equal to or higher than that M_score Then, the P_score profiles were calculated as follows:

PAGE 58

46 f.2 P_score Ci is the negative natural logarithm of the p value of M_score >= 0.75 for the i th TFBS. Gaussian Smoothing To account for the change in the distances between/among binding sites, a Gaussian smoothing was applied to the P_score profiles with a variance of 9. Formally, the G_score profiles were calculated as follows: f.3 G_score: where = 3 and k ranges from -7 to 7. In effect, a P_score can spread 7 positions to both the right and the left due to the Gaussian smoothing. Smoothed P_scores beyond 7 positions were ignored due to their small values. Searching the Conserved Module in the Long Sequence To identify a maximum match at the bi nding site profile level, the short G_score profile was slid along the long G_score profile. BLISS_score at position n is the matching score between the short profile and its corresponding region of equal length (length of the short sequence) in the long profile at position n : 7 7 7 7 2 / 2 /2 2 2 22 1 2 1 ] [ ] [ _k i i ke e k j i score P j i score G ] [ ] [ j i score M C j i score Pi

PAGE 59

47 f.4 BLISS_score: where G_score1 is the G_score profile for the short sequence and G_score2 is the G_score profile for the long sequence; L is the length of the short sequence; n is the current location where the short sequence is aligned to the long sequence. Large-scale Search of the Simulate d Sequences, Statistical Analysis We used 10000 simulated sequence pairs generated by the above method to calculate two BLISS_score distributions. The first is the BLISS_score distribution when the hypothetical module in the short sequence is aligned with the conserved module in the long sequence. The second is the BLISS_score distribution when the module is aligned with a non-module segment of the longer sequence. For each pair of sequences, BLISS_scores were calculated at each position as the short profile slid along the longer profile. The peak matches (corresponding to the peaks in the score profile) between each pair of sequences were evaluated to see whet her it aligned the embedded modules. If the match did include the alignment of the modul es, it was designated a true match, and this BLISS_score was used to calculate the distributi on for the modules matching. All of the other BLISS_scores were used to calculate the di stribution for the module matching with the background sequence. Searching for the eve2 Module in D. virilis and D. mojavanis Sequences The GenBank (http://www.ncbi.nlm.nih.gov/ ) accession numbers for the S2E sequences are AF042712 (D. pseudoobscura) and AF042709 (D. melanogaster). We e ortSequenc LengthOfSh n j i score G j i score G n score B LIS S M i L j/ ) ] [ 2 ] [ 1 ( ] [ _1 0 1 0

PAGE 60

48 used BLISS to search these two enhancers in 13kb D. virilis and 14kb D. mojavanis sequences, in which S2E is hypothesized to be located, but the specific location is unknown. Ludwig et al. indicated that distances be tween TFBSs in two clauses in S2E (region 134-275 and region 484-684 for D. melanogaster region 196-376 and region 692-866 for D. pseudoobscura ) were substantially conserved. We removed those regions and searched for the modules in 13kb D. virilis and 14kb D. mojavanis sequences using BLISS. Website Construction BLISS was implemented using HTML/JSP/JavaBean and is supported by an Apache Tomcat 5.5 server. It is publicly available at: http://gene1.ufscc.ufl.e du:8080/blissWeb/index.html The M_score profiles of TFBSs were calculated based on the frequency ma trix library collected by TRANSFAC 9.1. BLISS used DISLIN (http://www.mps.mpg.de/d islin/), a plotting library for displaying data, to draw the match score plot in run time.

PAGE 61

49 CHAPTER 4 BINDING-SITE LEVEL IDENTIFICATI ON OF SHARED REGULATORY MODULES IN TWO ORTHOLOGOUS SEQUENCES Background Determination of the bindings betw een Transcription Factors (TFs) and Transcription Factor Binding Sites (TFBSs) is critical to understanding the mystery of gene regulation. In higher organisms, a partic ular aspect of gene expression is rarely controlled by one single TF, rather by a comb ination of TFBSs called regulatory module (Maniatis et al. 1987; Johnson and McKnight 1989; Shaw 1992; Wang et al. 1993; Tjian and Maniatis 1994). Genes are typically regulated by vari ous regulatory modules in response to different conditions (Yuh et al 1998). Identification of such regulatory modules is notoriously difficult due to the i nherent biological complexity. The traditional way is through biological experimentation (G alas and Schmitz 1978; Fried and Crothers 1981; Garner and Revzin 1981), which is in general costly and time-consuming. With fully sequenced genomes, computational approa ches have emerged as powerful tools to supplement full-scale laboratory experiments in narrowing down ca ndidate regulatory modules and therefore dramatically reduced the effort in determining the regulatory elements (Qiu 2003). A number of software tools have been de veloped to search fo r regulatory modules by cross-genome comparison (Sha ran et al. 2003; Sandelin et al 2004; Loots, G.G. and I. Ovcharenko 2004; Sinha et al. 2004 ; Aerts et al 2005; Schwartz et al. 2003; Cora et al. 2004; Venkatesh and Yap 2005). The theoretica l assumption underlying these tools is

PAGE 62

50 that DNA sequences harboring TFBSs are cons erved during evoluti on due to selective pressure. Thus conserved regul atory modules should be found in the regions with high DNA similarity. However, most of these tool s are based directly on the measurement of similarity at the DNA sequence level. It is known that TFBS sequences are degenerate in nature and the DNA sequences between TFBSs are not under the pres sure of selection and therefore mutate rapidly. Therefore, it is highly probable that the DNA similarity is not detectable even though th e actual regulatory modules ar e conserved (Ludwig et al. 2005). Since the conservation pressure is at the binding site level, in our previous research, we developed a novel methodology na med BLISS to perfor m the cross-genome comparison on binding site profiles (Meng et al. 2006). This is first time that the feasibility of comparison at the binding site le vel has been validated. As a complementary tool in the field, BLISS demonstrates the ab ility to detect conserved regulatory modules from diverged orthologous sequences where DNA similarity is undetectable. It has been observed that the distance s among a certain cluster of TFBSs in the regulatory module are highly c onserved during evolution, wh ich may reflect the space restriction essential for the protein-prot ein interactions among TFs required for combinatorial transcription regulation (L udwig et al. 2005). Due to biological complexity, previous work in this direction has been limited to identifying composite elements, pairs of transcription factors, whose binding sites tend to co-exist. TransCompel (Kel-Margoulis et al. 2002) is such a database with collections of composite elements. In this study, we exte nd the concept of composite element to regulatory complex. Regulatory complex is a cluster of TFBSs, where the type (identity), number, and order of TFBSs are highly conserved during evolution and the

PAGE 63

51 variation of inter-TFBS dist ances is within a certain range (less than 10 bps). A regulatory complex is not necessarily c onserved at the sequence level during the evolution. It could, by itself, be a simple regulatory module or be a part of a complex regulatory module. In previous research (Meng et al. 2006), BLISS showed that it was more advantageous to find regulatory comp lexes rather than the regulatory module, especially when the regulatory complex is not the only regulatory elements in the module. We therefore focuses on the detection of conserved regulatory complexes in this study and the regulatory modules could be predicted based on th e identification of conserved regulatory complexes within them. The first release of BLISS (BLISS1) was limited in application to a short sequence and a long sequence and it made the assumpti on that the short sequence harbored only one regulatory module/complex (Meng et al. 200 6). However, in most cases, biologists have no prior knowledge about th e locations of the conserve d regulatory modules in the sequences to be analyzed. In the second release of BLISS (BLISS2), we extended BLISS1 to a general tool w ithout the limitation of the assumption made in BLISS1 about the analyzed sequences. In this study, we show that BLISS2 could predict conserved regulatory modules by detec ting all potential shared regulatory complexes. Results Simulating Sequence Pairs with Varying Complexity of Conserved Binding Site Patterns Simulated sequences were generated to te st the performance of our algorithm since very few well-investigated regulatory modules are available currently. Two testing data sets were generated. The backbones of the sequences in the first data set were 1000 bp

PAGE 64

52 (base pair) random DNA sequences and the ba ckbones of the sequences in the second data set were 5000 bps. Table 4-1. Simulated data sets Inserted complexes in each sequence pair Sequence pairs in each group Length of Backbone Group1 Group2 Group3 Group4 Data set 1 500 1000 0 1 2 3 Data set 2 500 5000 0 1 2 3 As shown in Table 4-1, each testing data set consisted of four groups of sequence pairs and each group had 500 sequence pairs. No regulatory complex was inserted into the first group of sequence pairs. In th e second group, each sequence pair had one conserved regulatory complex. Two and thre e conserved regulatory complexes were inserted into each sequence pair respectively in the third and fourth group. Each regulatory complex was composed of 4-8 binding sites, wh ich were extracted from TRANSFAC 9.1 database (Mat ys et al. 2003). Qi u et al. (Qiu et al. 2002) performed a statistical analysis for the distances betw een TFBSs in the composite elements, which had been experimentally confirmed and collect ed in the TransCompel database (version 3.0). Their analysis demonstrated that 87% of the distances between TFBSs in composite elements are within 50 bps, and 65% are within 20 bps. The length of the inserted random DNA segments between binding sites was based on this statistical analysis when we formulated the regulatory complexes (see Methods for details). Identifying Conserved Regulatory Comple xes by Comparison at the Binding-site Level In our previous research (Meng et al. 2006), BLISS1 has demonstrated the feasibility of identifying conserved regulat ory complexes in diverged DNA sequences at the binding-site level. However, it was limited to predicting conserved regulatory

PAGE 65

53 complexes from a short sequence and a long sequ ence and it required that the regulatory complex existed in the short sequence. Du ring the comparison, the short sequence was slid along the long sequence to find the se gment in the long sequence that had a significant matching score with the short sequence at binding site level. In this study, however, our focus is to extend BLISS1 to identify potentially conserved regulatory segments from two orthologous sequences w ithout prior knowledge about the locations of regulatory complexes in the sequences. Like BLISS1, BLISS2 calculated BLISS_scor e to represent the de gree of similarity at the binding site level for two sequence segm ents. The first three steps of BLISS2 were to calculate M_scores, P_scores and G_scor es respectively for two genomic sequences. M_score is the binding site profile calculated using a matrix scoring method based on the binding site frequency matrices collected in TRANSFAC9.1 database. To differentiate simple matrices that match DNA sequences w ith a high frequency from those matrices that match DNA sequences rarely, P_score wa s generated as the product of log(p value of M_score > cutoff) and the M_score. To account for the variation of distances between/among binding sites in the conserved regulatory complexes, Gaussian smoothing was applied to P_score to get G_score. The fi rst three steps in BLISS2 were in essence exactly identical to those in BLISS1. The diffe rence is that BLISS2 deals with two long sequences and the locations a nd lengths of the regulatory complexes in the sequences are unknown. The task of BLISS2 is to seek the matched segments from two sequences whose binding site profiles are similar (not necessarily id entical) and whose BLISS_scores, the matching score at the bindi ng site level, are statistically significant. The statistical evaluation of BLISS_score in our previous study allo wed us to evaluate a

PAGE 66

54 particular BLISS_score and set a threshold for reporting significant matches. In this study, BLISS2 took a heuristic approach. The general strategy of BLISS2 was to seek seeds with a certain window size from the G_ score profiles where their BLISS_scores are greater than a certain BLISS_score cutoff. Then all the seeds were extended and the shared regulatory regions were decided when the BLISS_score dropped below the BLISS_score cutoff. (See Methods) At the beginning, BLAST (E=0.01, Blast2se q) was applied to the sequence pairs in the first data set in which the backbon e of each sequence was 1000 bp long and there were a total of 3000 conserved regulatory complexes. Of the 2000 simulated sequence pairs, only 32.4% of them we re reported to have detectable similarity. Even in those positive cases, the average length of output alignments was 16 bps and the maximum length of output alignments was 32 bps. Clearly these output alignments were far shorter than the length of the inserted regulatory complexes, which was not sufficient to allow for the detection of the regulatory complexes. Th is analysis indicated that traditional tools based on sequence level comparison are not enough to predict the conservation of regulatory complex in diverged DNA sequences. Table 4-2. Performance test of BLISS2 for M_score cutoff=0.75. False Positives (FP) and True Positives (TP) are listed under different window size and BLISS score cutoff. BLISS_score cutoff Window size Type 2.9 2.8 2.7 2.6 2.5 2.4 2.3 2.2 2.1 FP 1795 2565 3748 5272 7180 9262 100 TP 904 1045 1174 1307 1434 1537 FP 143 221 401 741 1362 2402 3940 5781 150 TP 732 897 1079 1269 1445 1616 1775 1900 FP 14 18 44 80 205 473 1106 2195 3836 200 TP 382 526 682 879 1083 1345 1558 1764 1899 FP 1 1 6 14 34 88 261 727 1707 250 TP 130 213 343 504 695 918 1189 1468 1663

PAGE 67

55 Table 4-3. Performance test of BLISS2 for M_score cutoff=0.8. False Positives (FP) and True Positives (TP) are listed under different window size and BLISS score cutoff. BLISS_score cutoff Window size Type 1.9 1.8 1.7 1.6 1.5 1.4 1.3 1.2 FP 1907 2905 4435 6480 8991 100 TP 1168 1328 1488 1633 1759 FP 193 330 629 1193 2366 4306 6625 150 TP 1021 1229 1432 1664 1892 2049 2154 FP 26 45 81 195 523 1351 2975 5195 200 TP 617 818 1069 1346 1631 1880 2054 2137 FP 2 4 20 42 130 373 1213 2982 250 TP 298 464 662 932 1227 1539 1822 2006 In comparison, the testing results of BL ISS2 under different 1)window size and 2) BLISS_score cutoff are listed in table 4-2 and 4-3. True positive refers to the regulatory complexes that were inserted in the simulated sequence pair and identified by BLISS2, while false positive refers to the regulatory complexes that were reported by BLISS2 but di d not really exist. Both table 4-2 and 4-3 indicate a balance between true positive and false positive under different BLISS_score cutoffs. While the number of true positives improved when we chose a lower BLISS_score cutoff, the false positives incr eased as well; on the other hand, when a higher BLISS_score cutoff was applied, false positives decreased while true positives were compromised. Based on the testing results listed in ta ble 4-2 and 4-3, a window size 200 was selected in the implementation of BLISS2 sin ce in relation to other window sizes, it had a better balance between false positives and true positives both for M_score cutoff 0.75 and 0.8. In the first testing data set, the length of each sequence was 1000 bps. The performance test of BLISS2 proceeded under window size 200 using the second testing data set, where the length of backbone of each sequence was 5000 bps.

PAGE 68

56 0 200 400 600 800 1000 1200 2.52.62.72.82.933.1 BLISS_score cutoffPredicted complexes FP-2 FP-1 TP-2 TP-1 Figure 4-1. The effects of length of analy zed sequences on the performance of BLISS2. Window size 200 and M_score cutoff 0.75 we re applied in this comparison test. FP-1 represents the false positive for the first data set; FP-2 represents the false positive for the second data se t; TP-1 represents the true positive for the first data set; TP-2 represents the true positiv e for the second data set. The results displayed in Figure 4-1 shows th at the true positives were almost the same for both the testing data sets, although th e true positives for th e second data set is slightly less than the true positives for the firs t data set. The major difference lies in the false positive. When a higher BLISS_score cu toff was applied, false positives were not significantly affected by the lengths of the sequences. However, the false positives increased dramatically for the second testi ng data set when the BLISS_score cutoff was decreased. Thus, for practical purpose, a hi gh BLISS_score cutoff is suggested to avoid large amount of false positive for the longe r sequences, although the sensitivity of BLISS2 will be sacrificed at the same time. Testing BLISS2 in Real-world Examples Even-skipped gene (eve) is expressed in seven stripes in the embryo of Drosophila melanogaster (D.mela) during the blastoderm stage. The enhancer for the second stripe

PAGE 69

57 (S2E) is the best studied regulatory module and include s binding sites for five transcription factors, biocoid (Bcd), gian t (gt), hunchback (Hb) Kruppel (Kr) and sloppypaired (slp) (Small et al. 1992; Arnosti et al. 1996; Andrioli et al. 2002). D.mela, D. yakuba (D.yaku), D. erecta, and D. pseudoobs cura (D.pseu) belong to same subgenus (Sophophora) and experimental investigations on S2E suggests that S2E is conserved during evolution among these four closely rela ted species. A further detailed inspection has revealed that S2E include s two regulatory complexes, wh ere the binding sites and the distances among binding sites are highly conser ved (Ludwig et al. 2005). The make-ups of S2Es in D.mela and D.pseu have been di splayed in Figure 4-2. However, no detailed information has been published about S2E in two species: D. mojavensis (D.moja) and D. virilis (D.viri), which are di stantly related to the above four species and belong to a separate subgenus (Drosophila). We extracte d 6k bp sequences befo re the transcription start region both from the D.moja and D.viri genomic sequences and applied BLISS2 to identify S2E in them. Figure 4-2. Regulatory comple xes in S2E of D.mela and D.pseu. The two complexes are highly conserved during the evolution.

PAGE 70

58 In the first experiment, BLISS demonstrated its ability to predict the locations of S2E both in D.viri and D.moja using S2E seque nces of D.mela and D.pseu. For instance, three regulatory complexes were reported (F igure 4-3) when BLISS was applied to the S2E sequence of D.pseu and the 6k bp sequenc e from D.viri (M_score cutoff=0.75 and BLISS_score cutoff=2.9). By investigati ng the matched TFBSs listed by BLISS2, the second and third complexes were those two S2E regulatory complexes in D.mela and D.pseu (Figure 4-4). Thus the position of S2 E in D.viri could be located based on the positions of those two complexes. Identifying S2E in D.moja was carried out in the similar way. Figure 4-3. Identification S2E in 6 kb D.vi ri sequence using the S2E of D.pseu. Three complexes are predicted and the second and third one are those two in S2E.

PAGE 71

59 Figure 4-4. Regulatory comple xes of S2E in D.viri and D.moja. a) TFBSs in complex 1 b) TFBSs in complex 2. In previous research (Meng et al. 2006), BLISS1 was applied to the same sequence pair and the S2E in D.viri c ould be detected due to the match of one complex in S2E. Another complex could not be matched at the same time because of the distance difference between the two complexes in those two sequences. The BLISS_score therefore was inevitably lower because the BLISS_score in BLISS was the average over

PAGE 72

60 the whole range of the S2E sequence. In th at investigation, it indicated that the BLISS_score would be more sensitive and si gnificant when either complex sequence was used as the short sequence. In fact, we don t have this kind of prior knowledge in most cases. BLISS2 displayed its advantage over BLISS1 in that it was able to predict regulatory complexes by detecting local ma ximum regions with similar binding site profiles without requ iring prior knowledge about th e complexes in the orthologous sequences. BLISS2 further demonstrated its advantag e over BLISS1 for identifying conserved regulatory complexes in two long genomic sequences, which could not be carried out by BLISS1 at all. For instance, when BLISS2 was applied to two 6k bp sequences before the transcription start region of eve gene from D.pseu and D.viri (M_score cutoff=0.75 and BLISS_score cutoff=2.9), 15 potential regula tory complexes were reported by BLISS2. By inspection of the make up of these complexes, it turned out that the 9th and 14th complexes were those two regulatory complexes in S2E. In contrast, rVista2.0 could only identify the first two binding sites in comple x1 of S2E in the region with detectable DNA similarity when the same binding site score cuto ff (0.75) was applied. It is difficult for us to verify the other 13 complexes identified by BLISS2 except for the two known S2E complexes due to the lack of detailed regulat ory information for those regions. However, it is well known that those two 6k bp sequences are rich in regulatory information. At the very least, it should harbor enhancers for the expression of the eve gene in some of the 7 stripes in the embryo. Website Implementation of BLISS2 BLISS2 was implemented as a web-based software that can be freely accessed by the public. To start, BLISS2 requires th e users to input two DNA sequences that

PAGE 73

61 potentially harbor shared re gulatory information. An M_ score (binding site score) threshold could be chosen by the user on the ve ry same page. Then a color plot indicating conserved regulatory regions is returned to the user with the statistical analysis of BLISS_score under that M_score threshold. Th e user then chooses a BLISS_score cutoff to output all conserved comp lexes with maximum BLISS_score above the cutoff. For each complex, a table of contributing binding s ites could be listed ba sed on the product of the p-value of the matching binding sites, or the location of the binding sites, or their contributions to the BLISS_score. Currently, the length of each input sequence for BLISS2 is limited to 8000 bps. Discussion In our previous investigati on (Meng et al. 2006), we prove d that it is feasible to identify conserved regulatory modules fr om highly diverged orthologous genomic sequences at the binding site level. In this study, we extended our previous method to the two long sequence case and the regulatory modules could be detected by identifying conserved regulatory complexes. The perfor mance of BLISS2 was tested using both simulated sequences and real world examples. There are several limitations to BLISS2. Fi rst, BLISS2 is restricted to detect conserved binding sites groups (regulatory co mplexes) with small inter-TFBS distance variations. If the spaces between/among bindi ng sites are not stric tly required for the interactions between/among tran scription factors and the vari ation of inter-TFBS distance is high, BLISS2 may miss those regulatory modules. The coverage and quality of the frequency matrices for TFBSs is another limitation since the analysis of BLISS2 is totally dependent on the accuracy of the binding site profiles. However, there is no doubt that

PAGE 74

62 the coverage and quality of frequency matric es are being/will be improved dramatically and the performance of BLISS2 is exp ected to be enhanced along with it. Although further improvement is required, BLISS2 demonstrates the ability to detect biologically significant similari ties in those sequenc es where DNA sequence similarity is undetectable. The basi c idea underlying BLISS2 is novel, but straightforward. And therefore it potentially could be implemented in a number of ways to improve performance for the different applications. Conclusions In summary, using simulated sequences and real world examples, BLISS2 demonstrated that regulatory complexes, the cl usters of binding site s with small variation of inter-TFBS distance during evolution, could be identified in highly diverged orthologous sequences by comparison at the binding site level. By detecting those conserved regulatory complexes and expl oring matched TFBSs in them, BLISS2 facilitates the localization of regulatory re gions and therefore can assist biomedical scientists in deciphering the my stery of gene regulation. Futu re direction of the project includes developing new algorithm to improve the selectivity and sensitivity of the current algorithm, detecting regulatory complexes from multiple orthologous sequences, and identifying regulatory modules with higher inter-TFBS di stance variations. Methods Generating Simulated Sequences Two testing data sets were generated for the development of the methodology. The first data set had 2000 sequence pairs, which were divided into four groups with each group having 500 sequence pairs. The backbone for each sequence was

PAGE 75

63 constructed as 1000 bp random DNA sequence w ith the ratio among A, T, C and G being 30% : 30% : 20% : 20% since the non-transcribed genomic regions are generally AT rich. Each regulatory complex contained binding sites for 4-8 transcription factors, which are randomly extracted fr om TRANSFAC 9.1 database. Th e range of distance (di) between consecutive TFBSs was based on the sta tistical analysis by Qiu et al. (Qiu et al. 2002) on the TransCompel database: 65% of the distance between consecutive binding sites was within 5-20 bps, 22% was within 21 -50 bps and the rest 13% was within 51-60 bps. The conserved regulatory complex for the second sequence was formulated based on the following rule: First, it consisted of binding sites for the same transcription factors in the corresponding regulatory complex in th e first sequence. The binding sites of two corresponding transcription f actors were randomly extracted from the binding site instances collected in TRANSFAC 9.1 database and therefore they could have the same or different sequences; second, the respect ive order of binding sites in those two corresponding regulatory complex was kept th e same; finally, dj, the distance between consecutive binding sites in the conserved re gulatory complex in the second sequence, was randomly chosen to have a value from range (did,di+ d). d is the variation of distance between corresponding binding sites in those two regulatory complexes and we chose d = 4 in this study. Each sequence in the sequence pairs in the first group was a 1000 bp random DNA sequence and had no introduced complex. For the second group, one regulatory complex was inserted at a random location into the fi rst sequence of each se quence pair; and then the corresponding conserved regulatory comple x was inserted at a random location into the second sequence. In the similar manner, two and three regulatory complexes were

PAGE 76

64 inserted into each sequence pairs, respectivel y, for the third and the fourth group. This composition of simulated sequences reflected the fact that the analyzed orthologous sequences may share 1) no of regulatory complex, 2) one regulatory complex, or 3) more than one regulatory complexes. All the above held true for the second test ing data set except that the length of the backbone of each sequence was 5000 bps. Identifying Conserved Regulatory Comple xes by Comparison at the Binding-site Level M_scores, P_scores and G_scores. M_scores, P_scores and G_scores for both DNA sequences were calculated in the exact same manner as was described in chapter 3 (see Methods part) based on frequency matrices collected by TRANSFAC 9.1. The G_score profile for each sequence is a M*L matrix, where M is twice of the number of TFBSs in the frequency matrix library and L is the leng th of the sequence. The top half of G_score matrix is the score profile for the forwar d strand and the bottom half of the G_score matrix is for the complementary strand. Pre_BLISS_score. Calculation of Pre_BLISS_score is an intermediate step for computing the BLISS_score from the G_score profiles. Pre_BLISS_score was calculated as follows: 1 0] [ 2 ] [ 1 ] [ _ PrM mj m score G i m score G j i score BLISS e where G_score1 is the G_score profile for the first sequence and G_score2 is that for the second sequence. M is twice of the number of TFBSs in the frequency matrix library and m is the index of the TFBS. Pr e_BLISS_score is a two-dimensional L1*L2

PAGE 77

65 matrix where L1 and L2 correspond to the lengths of the two DNA sequences respectively. BLISS_score. BLISS_score is the average of Pr e_BLISS_scores over a window size w: w k j k i score BLISS e j i score BLISSw w k/ ] [ _ Pr ] [ _2 / 2 / Same as Pre_BLISS_score, BLISS_score is also a two-dimensional L1*L2 matrix where L1 and L2 are the lengths of the two DNA sequences. Reporting Conserved Regulatory Complexes. Conserved regulatory complexes would be reported according to the BLISS_score matrix. To start, the maximum point BLISS_score[x,y] was found in the matrix. If BLISS_score[x,y] was less than the BLISS_score cutoff, we would report that there is no conserved regulatory region for these two sequences; otherwise, we extende d this maximum point along the diagonal in both directions until the BLISS_scores droppe d below the BLISS_score cutoff. Suppose these two end points were BLISS_score[x1,y1] and BLISS_score[x2,y2], then we would report that we detected one shared regulat ory complex, which started from x1-w/2 and ended at x2+w/2 in the first and covered from y1-w/2 to at y2+w/2 in the second sequence. The rest of the BLISS_score matrix was searched recursively in the same way to find other shared regulatory complexes. Performance Testing Using Simulated Sequences Using our simulated data sets, the perf ormance of BLISS2 was tested under different BLISS_score cutoff and window sizes The true positive and false positive for each test under different condition were recorded. Suppose a shared regulatory complex identifi ed by BLISS starts from x1 and ends at x2 in the first sequence and covers range y1 to y2 in the second sequence, while the

PAGE 78

66 truly inserted regulatory complex is from x1 to x2 in the first sequence and from y1 to y2 in the second sequence. We decide d that it was a corr ect prediction if: 1. Identified complexes in both sequences c overed more than 60% of the region of the truly inserted complexes. 2. Two lines were created in two-dimensional space: one starting from point (x1, y1) and ending at point (x2, y2) ; the other starting from poi nt (x1, y1) and ending at point (x2, y2), and the distance between these two lines was found to be less than 10. The regulatory complex, which was pred icted by BLISS2 and truly existed in simulated sequences, would be counted as a tr ue positive; while the regulatory complex, which was predicted by BLISS2 but did not ex ist in the simulated sequences, would be counted as false positive. Searching for the eve2 Regulatory Comp lex in D. virilis and D. mojavanis The GenBank (http://www.ncbi.nlm.nih.gov/ ) accession numbers for S2E sequences are AF042709 (D. melanogaster) and AF042712 (D. pseudoobscura). The two 6k bp seqeunces of D.viri and Dmoja are the re gions right in front of the transcription initiation site of the eve gene in the genomic sequences of D.viri and D.moja. M_score cutoff 0.75 and BLISS_score cutoff 2.9 (def ault M_score and BLISS_score cutoff of BLISS2) were used in both experiments. Website Construction BLISS2 is hosted by a Apache Tomcat 5.5 server and available at: http://gene1.ufscc.ufl.edu:8080/bliss2/index.html The website was implemented by JavaBean/JSP/HTML and the color display of BLISS_score was plotted by tools provided by DISLIN (http://www.mps.mpg.de/dislin/ ).

PAGE 79

67 CHAPTER 5 BLISS 2.0: THE WEB-BASED TO OL FOR PREDICTING CONSERVED REGULATORY MODULES IN DISTANTLY-RELATED ORTHOLOGOUS SEQUENCES Identifying functional Transc ription Factor Binding Site s (TFBSs) in the regulatory region of DNA sequence is essential to understa nd gene regulation at transcription level. In eukaryotes, distinct TFBSs are often gr ouped together into regulatory modules to control a particular aspect of gene expression. The composition of a particular regulatory module can be identified by e xperimental approaches; for example, through enhancer region dissection, DNase hypersensitivity assay, DNA foot printing, et c. However, most of these approaches are time-consuming, la borious and costly. More importantly, many of these approaches, such as the DNase hypersensitivity assay, can only be applied to a short DNA sequence (a few hundred base pairs). It is almost impossible if the relative location of the module cannot be narrowed to an experimentally testable region. With the emergence and application of bi oinformatics, computational approaches have been developed to predict regulatory module and help the desi gn and verification of laboratory experiments. A number of such co mputational approaches (Aerts et al 2005; Sandelin et al. 2004; Loots, G.G. and I. Ovch arenko 2004; Sharan et al. 2003; Sinha et al. 2004) are through cross-genome comparis on. The assumption underlying these approaches is that DNA sequences encodi ng functional TFBSs are conserved during evolution due to selectiv e pressure, whereas non-func tional DNA sequences evolve (mutate) much faster. Therefore, it is lik ely that the conserved regulatory modules could be identified at the regions with hi gh DNA similarity in two orthologs.

PAGE 80

68 While this type of approaches has been pr oven very helpful in some cases, there is also limitation. The success of these a pproaches depends on identifying significant sequence level similarity between (among) the DNA regions that harbor the regulatory modules. However, TFBS sequences are dege nerative in nature and that they are interspersed by non-functional se quences, which are not under conservation pressure. It is therefore entirely possibl e that although the regulatory modules are conserved, the overall similarity of the sequences harbori ng regulatory modules is only marginal and cannot be distinguished from background. Applications based on cross-genome comparison at DNA sequence level will fail when the orthologous DNA sequences are highly diverged. Therefore, there is a need for a complementary tool that could allow users to detect conserved regulatory modules from diverged DNA sequences. We developed a novel methodology, BLISS 2. 0 (Binding-site Level Identification of Shared Signal-module 2.0 version), for iden tifying evolutionarily conserved regulatory modules in pairs of orthologs based on the comparison at bind ing site level. Considering the conservation pressure is at the binding si te level rather than the DNA sequence level. By integrating the Gaussian smoothing and sta tistical analysis, we demonstrated that our approach outperforms regular sequence le vel comparisons when the orthologous DNA sequences are highly diverged. BLISS 2.0 is now implemented as a web-based tool and can be access by the resear ch community freely at: http://gene1.ufscc.ufl.edu:8080/bliss2/index.html.

PAGE 81

69 Figure 5-1. Web interface of BLISS 2.0. This is the homepage of BLISS 2.0. To start, users are required to input two or thologous DNA sequences to start the search. Given two orthologous DNA seque nces, BLISS 2.0is able to output all potentially conserved regulatory modules between t hose two sequences. BLISS 2.0 analysis proceeds in three major steps: To begin, users are required to input two raw DNA sequences (Figure 5-1). BLISS 2.0 generates binding site profiles for each sequence based on matrices collected by Transfac 9.1 (Matys et al. 2003). And on the same page, users are allowed to choose a binding site score cutoff. Either 0.75 or 0.8 can be selected as the binding score cutoff to determine if a specific binding site is found in a certain location in the DNA sequence. This cutoff is based on a balance between specificity and sensitivity. Higher score cutoff increases the specificity, but decreases the sensitivity at the same time.

PAGE 82

70 Figure 5-2. The color plot of BLISS_scor e. BLISS_score indicates the degree of the conservation at the binding site level between two sequences. Second, by integrating Gaussian smoothing and statistical anal ysis on the binding site profiles comparison, BLISS 2.0 indicat es the degree of the conservation at the binding site level between tw o sequences as BLISS_scores, which are displayed and visualized as a two-dimensional color plot (Figure 5-2). The vert ical color bar on the right side of the color plot shows the value of BLISS_scor e that the color represents. Continuous high BLISS_scores along diagonal direction indi cates a potential match of conserved regulatory modules betw een two sequences. To be able to evaluate a particular BLISS_score, we analyzed the distribution of BLISS_scores using simulated sequence pairs as shown in Figure 5-3. There are two BL ISS_score distributions. The left one is the distribution when a random sequence matc hes with a random sequence or a sequence

PAGE 83

71 harboring a regulatory module and the right is the distribution when two conserved regulatory modules matches. Based on this statistical analysis, users can choose a BLISS_score cutoff to output all matched regions with the BLISS_score greater than this cutoff to be conserved regulatory modules. Us ers also have the option to determine how to rank the shared TFBSs in reported regulatory modules. They can be ranked by locations, by numeric contribu tions to BLISS_score or by product of p-values of the matching TFBSs on both sequences. Figure 5-3. Statistical analys is and distribution of BLISS_sc ore. Statistical analysis of BLISS_score helps to evaluate a BLISS_score. Finally, all the matching regulatory modules with BLISS_score greater than the BLISS_score cutoff that users choose are displayed on the thir d page. Contributing TFBSs are listed in a separated table for each matched region. BLISS 2.0 provides the

PAGE 84

72 option to let users highlight th e TFBS they are interested in as green color as shown in Figure 5-4. Figure 5-4. Output of contri buting TFBSs. Contributing TFBS s are listed in a separated table for each matched CRM. Users can choose to highlight the TFBS they are interested in as green color. It has been experienced by many researcher s that the laboratory experiments in a model organism has narrowed down the location of the regulatory module for a particular gene to a relatively short region, whereas for the ortholog in a less-studied organism, information about the localizati on of the module is absent. To be able to deal with this special instance, BLISS 2.0 provide a co mplementary tool Single Cis-Regulatory Module Search, to locate the position of the conserved regulatory module in the ortholog of the less-studied organism. We suggest users to use this tool especially in the case that the length of one of the orthologous DNA sequences is less than 200 bps.

PAGE 85

73 The advantage of BLISS 2.0 is that it allo ws the detection of conserved regulatory modules in highly diverged sequences, on which BLISS 2.0 outperforms the existing methods that are based on sequence similarity comparison. We have successfully applied BLISS 2.0 to identify the Even-skipped (eve ) stripe 2 enhancer (S2E) in D. mojavenis and D. virilis, which no detailed informa tion has been published about and cannot be detected by existing tools like BLAST, rVista 2.0 and etc.

PAGE 86

74 CHAPTER 6 CONCLUSIONS AND SUGGESTIONS FOR FUTURE STUDY In this study, we developed a novel me thodology to identify conserved regulatory modules at the binding site level and impl emented it as a freely accessed web-based application named BLISS 2.0. To our knowledge, this is the first time that the feasibility of comparison at binding site level is confirmed. In the first release of BLISS, we assumed biologists have the sequence for a regulatory module and BLISS could identify the conserved module from its orthologous sequence, where the DNA similarity could not be detected and thus the methods base d on sequence similarity failed. After the success of BLISS1, we further extended this methodology to be applicable for a more general scenario. In BLISS2, potential regulatory modules can be predicted between two orthologous sequences by detecting conserved regulatory complexes. The advantage of this methodology is that it allows the det ection of conserved regulatory modules in highly di vergent sequences as we have demonstrated both with simulated sequences as well as with real wo rld examples. BLISS is thus complementary to many exiting methods based on nucleotide sequence similarity. It outperforms these sequence similarity -based methods when the orthologous DNA sequences are highly diverged. BLISS therefore serves as a valuable tool to facilitate biom edical scientists to identify functional regulatory modules and de sign experimental ve rification strategies. This study investigated a novel cross-ge nome comparison strategy and therefore opened a new field for the future research, which includes developi ng new algorithm to improve the performance of the methodology, multiple sequence comparison at the

PAGE 87

75 binding site level and identifying regulatory modules with higher inter-TFBS distance variations.

PAGE 88

76 APPENDIX A PARAMETER OPTIMIZATIONS FO R GAUSSIAN SMOOTHING METHOD Using simulated sequences, the performa nce of Gaussian smoothing method was tested under different parameters: 1) M_score cutoff, 2) variance for Gaussian smoothing, and 3) integrating p value of M_score OR without integra ting the p value of M_score (Figure A-1). Figure A-1. Parameter optimizations for Ga ussian smoothing method used in BLISS. The results indicated that 1. Integrating the p value of M_score to Gaussian smoothing method had greatly increased the performance of the method. 2. Bigger variance gave better performance. 3. BLISS got the best predictions when bindi ng site score cutoff is between 0.75 and 0.85. Considering the binding sites fro m TRANSFAC are identified binding sites and they intended to have higher binding site score, so 0.75 and 0.8 were chosen as the default cutoff for binding site score in the implementation of BLISS. Gaussian Smooth Method Tests Using Simulated Sequences40 50 60 70 80 90 100 0.50.60.70.80.91 Score CutoffCorrect Predictions/100 Var=0 Var=1 Var=2 Var=3 Var=0(P value) Var=1(P value) Var=2(P value) Var=3(P value)

PAGE 89

77 APPENDIX B A DYNAMIC PROGRAMMING ALGORITHM FOR IDENTIFYING CONSERVED REGULATORY MODULES Dynamic programming algorithms have been applied extensively in computational sequence analysis. In this study, efforts were performed to apply dynamic programming on the binding site profiles to identify conserved regulatory m odules in orthologous sequences. The idea underlying this dynamic programmi ng algorithm is to look for the best local alignments between the binding site profiles of two sequen ces and the algorithm consists of three steps: 1. Calculations of P_score profiles for tw o sequences, which are exactly same as described in the Method part of Chapter 3. 2. A matrix F was constructed based on the P_score profiles of two sequences, where F(i, j) is the score of the best alignment between the initial segment of position 1 to i of first sequence and the initial segment of 1 to j of second sequence. F(i, j) was computed recursively as below: F(i, j) = max F(i, j) could be calculated from F(i-1, j-1), F(i-1, j) and F( i, j-1). F(i, 0) and F(0, j) were set to 0 for all i and j. F(i, j) = 0. When F(i, j) has a negative score in some points, we will give it value 0 to start a new alignment. F(i, j) = F(i-1, j-1) + s(seq1i, seq2j). When position i in sequ ence 1 and position j in sequence 2 at least have one shared binding site, s(seq1i, seq2j) will be the maximum of P_score1(k, i) and P_score(k, j) for all k, where k is the index of the binding site; when there is no shared bindi ng site at position i in sequence 1 and position j in sequence 2, but th ere is at least one binding site either at position i in 0, F(i-1, j-1) + s(seq1i, seq2j) F(i-1, j) + d F(i, j-1) + d

PAGE 90

78 sequence 1 or position j in sequence 2, then s(seq1i, seq2j) would be set to a penalty called TF_penalty; when there is no a ny binding site either at position i in sequence 1 or position j in sequence 2, s(seq1i, seq2j) was set to zero. F(i, j) = F(i, j-1) +d. This is the case when seq2j was aligned to a gap. When there is at least one binding site at position j of sequence 2, d was set to TF_penalty; when there is no any binding site at position j of sequence 2, d was set to N_penalty. F(i, j) = F(i-1, j) +d. This is the case when seq1i, was aligned to a gap. When there is at least one binding site at position i of sequence 1, d was set to TF_penalty; when there is no any binding site in position i of sequence 1, d was set to N_penalty. 3. After the construction of ma trix F, find the maximum F(i, j), which corresponds to the end points of conserved regulatory modul e. Then trace back until to a point with value 0, which corresponds to the start points of conserved module. This dynamic programming algorithm was a pplied to 500 simulated sequence pairs and each sequence pair has only one conser ved module. This sequence group corresponds to the second group simulated se quence in the first data set used for BLISS2 performance test in chapter 4. The best results were achieved when TF_penalty equals to and N_penalty equals to When 0.8 was chos en as the M_score cutoff, the conserved regulatory modules in 168 seque nce pairs could be detected and the rest of 332 local maximum reported by this dynamic program ming algorithm was false positives. As a conclusion, this study addressed the feasibility of identifying conserved regulatory modules in ort hologous sequences by applyi ng the dynamic programming algorithm on the binding s ite profiles. Even though, Gaussian smoothing method developed in chapter 3 and chapter 4 achieved a better pe rformance than this dynamic programming algorithm. So, BLISS took the Ga ussian smoothing method, instead of this dynamic programming algorithm, for the web implementation.

PAGE 91

79 APPENDIX C AN IMPROVED ALGORITHM FOR BLISS2 In the current algorithm for BLISS2, we chose to take the sum of G_scores over all TFs at each matched position when BLISS_sc ore was calculated. The performance of BLISS2 was improved (Table C-1 and C-2) when was instead took the maximum of G_scores at each matched location. Table C-1. Performance test of the impr oved BLISS2 for M_score cutoff=0.75. False Positives (FP) and True Positives (TP) are listed under different window size and BLISS score cutoff. BLISS_score cutoff Window size Type 0.27 0.26 0.25 0.24 0.23 0.22 0.21 FP 2565 3748 5272 7180 100 TP 1045 1174 1307 1434 FP 262 663 2412 6496 150 TP 1185 1614 1956 2157 FP 40 92 416 2157 6096 200 TP 724 1165 1686 2062 2167 FP 6 23 75 617 3277 250 TP 352 699 1180 1736 2000 Table C-2. Performance test of the impr oved BLISS2 for M_score cutoff=0.8. False Positives (FP) and True Positives (TP) are listed under different window size and BLISS score cutoff. BLISS_score cutoff Window size Type 0.28 0.27 0.26 0.25 0.24 0.23 0.22 FP 1779 2968 5489 100 TP 1373 1602 1794 FP 288 575 1412 3572 6805 150 TP 1526 1822 2078 2263 2340 FP 46 68 185 686 2395 200 TP 1034 1374 1734 2063 2259 FP 16 36 142 684 2527 250 TP 851 1222 1622 1967 2190

PAGE 92

80 The method to calculate BLISS_score for th e improved algorithm is the same as the algorithm of BLISS2 except that the Pre_ BLISS_score was calculated as below: ]) [ 2 ] [ 1 ( ] [ _ Pr m j score G m i score G MaximumOf j i score BLISS e for m = 0, 1, 2 where m is the index of the TFBS. The improvement was more dramatic for the second data set that is used in chapter 4, which had longer backbone sequences. Table C-3 displays the performance comparison between the two algorithms unde r the condition where the window size is 200 and the binding score cutoff is 0.75. Table C-3. Performance comparison of two BLISS2 algorithms for second data set when M_score cutoff is 0.75. False Positives ( FP) and True Positives (TP) are listed under different BLISS score cutoff. Algorithm Type BLISS_scores 2.9 2.8 2.7 2.6 2.5 FP 33 111 359 1014 2778 BLISS2 TP 340 458 611 804 1029 0.25 0.245 0.24 0.235 0.23 FP 27 79 217 861 2953 Improved BLISS2 TP 748 937 1198 1431 1653 This improvement may reflect the fact th at only one TF can exclusively bind to a location in the real biological environment and the performance of the BLISS2 algorithm could be compromised due to the repeated contributions of overlapped TFBSs to the BLISS_score. However, the results included more false positives when this improved algorithm was applied to S2E example. Consid ering S2E is the only real world example we have so far, more researches need to be continued and more real world examples are required for the further testing of this improved algorithm for BLISS2. The BLISS_score distribution for the im proved algorithm was calculated using the same method as described in chapter 3 a nd are shown in Figure C-1 and Figure C-2.

PAGE 93

81 Figure C-1. Distribution of BLISS_score for the improved BLISS2 algorithm with the M_score cut_off value of 0.75. Figure C-2. Distribution of BLISS_score for the improved BLISS2 algorithm with the M_score cut_off value of 0.8.

PAGE 94

82 LIST OF REFERENCES Aerts S, Thijs G, Coessens B, St aes M, Moreau Y, De Moor B: Toucan: deciphering the cis-regulatory logic of coregulated genes. Nucleic Acids Res 2003, 31(6):17531764. Aerts S, Van Loo P, Thijs G, Mayer H, de Martin R, Moreau Y, De Moor B: TOUCAN 2: the all-inclusive open source work bench for regulatory sequence analysis. Nucleic Acids Res 2005, 33(Web Server issue):W393-396. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol 1990, 215:403-410. Andrioli LP, Vasisht V, Theodosopoul ou E, Oberstein A, Small S: Anterior repression of a Drosophila stripe enhancer requi res three position-specific mechanisms. Development 2002, 129(21):4931-4940. Arnosti DN: Design and function of transcriptional switches in Drosophila. Insect Biochem Mol Biol 2002, 32(10):1257-1273. Arnosti DN, Barolo S, Levine M, Small S: The eve stripe 2 enhancer employs multiple modes of transcriptional synergy. Development 1996, 122(1):205-214. Bailey TL, Elkan C: The value of prior knowledge in discovering motifs with MEME. Proc Int Conf Intell Syst Mol Biol 1995, 3:21-29. Bailey TL, Gribskov M: Methods and statistics for combining motif match scores. J Comput Biol 1998, 5(2):211-221. Bailey TL, Gribskov M: Concerning the accuracy of MAST E-values. Bioinformatics 2000, 16(5):488-489. Benos PV, Lapedes AS, Stormo GD: Probabilistic code for DNA recognition by proteins of the EGR family. J Mol Biol 2002, 323(4):701-727. Berman BP, Nibu Y, Pfeiffer BD, Tomancak P, Celniker SE, Levine M, Rubin GM, Eisen MB: Exploiting transcription factor bind ing site clustering to identify cis-regulatory modules involved in pa ttern formation in the Drosophila genome. Proc Natl Acad Sci U S A 2002, 99(2):757-762.

PAGE 95

83 Berman BP, Pfeiffer BD, Laverty TR, Salzbe rg SL, Rubin GM, Eisen MB, Celniker SE: Computational identification of deve lopmental enhancers: conservation and function of transcription factor bi nding-site clusters in Drosophila melanogaster and Drosophila pseudoobscura. Genome Biol 2004, 5(9):R61. Boulikas T: A compilation and classification of DNA binding si tes for protein transcription factors from vertebrates. Crit Rev Eukaryot Gene Expr 1994, 4(23):117-321. Bray N, Dubchak I, Pachter L: AVID: A global alignment program. Genome Res 2003, 13(1):97-102. Bucher P: Weight matrix descriptions of four eukaryotic RNA polymerase II promoter elements derived from 502 unrelated promoter sequences. J Mol Biol 1990, 212(4):563-578. Bulyk ML, Johnson PL, Church GM: Nucleotides of transcription factor binding sites exert interdependent effects on the bindi ng affinities of transcription factors. Nucleic Acids Res 2002, 30(5):1255-1261. Che D, Jensen S, Cai L, Liu JS: BEST: binding-site estima tion suite of tools. Bioinformatics 2005, 21(12):2909-2911. Chekmenev DS, Haid C, Kel AE: P-Match: transcription factor binding site search by combining patterns and weight matrices. Nucleic Acids Res 2005, 33(Web Server issue):W432-437. Chen QK, Hertz GZ, Stormo GD: MATRIX SEARCH 1.0: a computer program that scans DNA sequences for transcriptional el ements using a database of weight matrices. Comput Appl Biosci 1995, 11(5):563-566. Cora D, Di Cunto F, Provero P, Silengo L, Caselle M: Computational identification of transcription factor binding sites by functi onal analysis of sets of genes sharing overrepresented upstream motifs. BMC Bioinformatics 2004, 5:57. Day WH, McMorris FR: Critical comparison of consensus methods for molecular sequences. Nucleic Acids Res 1992, 20(5):1093-1099. Elkon R, Linhart C, Sharan R, Shamir R, Shiloh Y: Genome-wide in silico identification of transcriptional regula tors controlling the cell cycle in human cells. Genome Res 2003, 13(5):773-780. Erives A, Levine M: Coordinate enhancers share common organizational features in the Drosophila genome. Proc Natl Acad Sci U S A 2004, 101(11):3851-3856. Fickett JW: Coordinate positioning of ME F2 and myogenin binding sites. Gene 1996, 172(1):GC19-32.

PAGE 96

84 Frazer KA, Elnitski L, Church DM, Dubchak I, Hardison RC: Cross-species sequence comparisons: a review of methods and available resources. Genome Res 2003, 13(1):1-12. Frech K, Herrmann G, Werner T: Computer-assisted prediction, classification, and delimitation of protein bindi ng sites in nucleic acids. Nucleic Acids Res 1993, 21(7):1655-1664. Frech K, Quandt K, Werner T: Finding protein-binding sites in DNA sequences: the next generation. Trends Biochem Sci 1997, 22(3):103-104. Fried M, Crothers DM: Equilibria and kinetics of lac repressor-operator interactions by polyacrylamide gel electrophoresis. Nucleic Acids Res 1981, 9(23):6505-6525. Frith MC, Hansen U, Weng Z: Detection of cis-element clus ters in higher eukaryotic DNA. Bioinformatics 2001, 17(10):878-889. Galas DJ, Schmitz A: DNAse footprinting: a simple method for the detection of protein-DNA binding specificity. Nucleic Acids Res 1978, 5(9):3157-3170. Garner MM, Revzin A: A gel electrophoresis method fo r quantifying the binding of proteins to specific DNA regions: applicat ion to components of the Escherichia coli lactose operon regulatory system. Nucleic Acids Res 1981, 9(13):3047-3060. Gershenzon NI, Stormo GD, Ioshikhes IP: Co mputational technique for improvement of the position-weight matrices fo r the DNA/protein binding sites. Nucleic Acids Res 2005, 33(7):2290-2301. Grad YH, Roth FP, Halfon MS, Church GM: Prediction of similarly acting cisregulatory modules by subsequence prof iling and comparative genomics in Drosophila melanogaster and D.pseudoobscura. Bioinformatics 2004, 20(16):2738-2750. Halfon MS, Michelson AM: Exploring genetic regulatory networks in metazoan development: methods and models. Physiol Genomics 2002, 10(3):131-143. Hannenhalli S, Levy S: Predicting transcription factor synergism. Nucleic Acids Res 2002, 30(19):4278-4284. Heinemeyer T, Wingender E, Reuter I, He rmjakob H, Kel AE, Kel OV, Ignatieva EV, Ananko EA, Podkolodnaya OA, Kolpakov FA et al : Databases on transcriptional regulation: TRANSFAC, TRRD and COMPEL. Nucleic Acids Res 1998, 26(1):362-367. Hernandez-Munain C, Krangel MS: Regulation of the T-cell receptor delta enhancer by functional cooperation between c-Myb and core-binding factors. Mol Cell Biol 1994, 14(1):473-483.

PAGE 97

85 Hu Z, Frith M, Niu T, Weng Z: SeqVISTA: a graphical t ool for sequence feature visualization and comparison. BMC Bioinformatics 2003, 4:1. Hu Z, Fu Y, Halees AS Kielbasa SM, Weng Z: SeqVISTA: a new module of integrated computational tools for studying transcriptional regulation. Nucleic Acids Res 2004, 32(Web Server issue):W235-241. Huang H, Kao MC, Zhou X, Liu JS, Wong WH: Determination of local statistical significance of patterns in Markov seque nces with application to promoter element identification. J Comput Biol 2004, 11(1):1-14. Jegga AG, Sherwood SP, Carman JW, Pinski AT, Phillips JL, Pestian JP, Aronow BJ: Detection and visualization of composit ionally similar cis-regulatory element clusters in orthologous and coordinately controlled genes. Genome Res 2002, 12(9):1408-1417. Johansson O, Alkema W, Wa sserman WW, Lagergren J: Identification of functional clusters of transcription factor bind ing motifs in genome sequences: the MSCAN algorithm. Bioinformatics 2003, 19 Suppl 1:i169-176. Johnson PF, McKnight SL: Eukaryotic transcriptional regulatory proteins. Annu Rev Biochem 1989, 58:799-839. Kel AE, Gossling E, Reuter I, Cheremus hkin E, Kel-Margoulis OV, Wingender E: MATCH: A tool for searching trans cription factor binding sites in DNA sequences. Nucleic Acids Res 2003, 31(13):3576-3579. Kel AE, Kolchanov NA, Kel OV, Romashch enko AG, Anan'ko EA, Ignat'eva EV, Merkulova TI, Podkolodnaia OA, Stepanenko IL, Kochetov AV et al : [TRRD: a database of transcription regulat ory regions in eukaryotic genes]. Mol Biol (Mosk) 1997, 31(4):626-636. Kel AE, Kondrakhin YV, Kolpakov Ph A, Kel OV, Romashenko AG, Wingender E, Milanesi L, Kolchanov NA: Computer tool FUNSITE for analysis of eukaryotic regulatory genomic sequences. Proc Int Conf Inte ll Syst Mol Biol 1995, 3:197205. Kel OV, Romaschenko AG, Kel AE Wingender E, Kolchanov NA: A compilation of composite regulatory elements affectin g gene transcription in vertebrates. Nucleic Acids Res 1995, 23(20):4097-4103. Kel-Margoulis OV, Kel AE, Reuter I, Deineko IV, Wingender E: TRANSCompel: a database on composite regulatory elements in eukaryotic genes. Nucleic Acids Res 2002, 30(1):332-334.

PAGE 98

86 Kel-Margoulis OV, Romashchenko AG, Kolchanov NA, Wingender E, Kel AE: COMPEL: a database on composit e regulatory elements providing combinatorial transcriptional regulation. Nucleic Acids Res 2000, 28(1):311315. Knuppel R, Dietze P, Lehnberg W, Frech K, Wingender E: TRANSFAC retrieval program: a network model database of eukaryotic transcription regulating sequences and proteins. J Comput Biol 1994, 1(3):191-198. Kolchanov NA, Ananko EA, Podkolodnaya OA, Ignatieva EV, Stepanenko IL, KelMargoulis OV, Kel AE, Merkulova TI Goryachkovskaya TN, Busygina TV et al : Transcription Regulatory Regions Da tabase (TRRD):its status in 1999. Nucleic Acids Res 1999, 27(1):303-306. Kolchanov NA, Podkolodnaya OA, Ananko EA, Ignatieva EV, Stepanenko IL, KelMargoulis OV, Kel AE, Merkulova TI Goryachkovskaya TN, Busygina TV et al : Transcription regulatory regions database (TRRD): its status in 2000. Nucleic Acids Res 2000, 28(1):298-301. Kondrakhin YV, Kel AE, Kolchanov NA, Romashchenko AG, Milanesi L: Eukaryotic promoter recognition by binding sites for transcription factors. Comput Appl Biosci 1995, 11(5):477-488. Lawrence CE, Altschul SF, Boguski MS Liu JS, Neuwald AF, Wootton JC: Detecting subtle sequence signals: a Gibbs samp ling strategy for multiple alignment. Science 1993, 262(5131):208-214. Lawrence CE, Reilly AA: An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences. Proteins 1990, 7(1):41-51. Lee TI, Rinaldi NJ, Robert F, Odom DT Bar-Joseph Z, Gerber GK, Hannett NM, Harbison CT, Thompson CM, Simon I et al : Transcriptional regulatory networks in Saccharomyces cerevisiae. Science 2002, 298(5594):799-804. Lemaigre FP, Lafontaine DA, Courtois SJ, Durviaux SM, Rousseau GG: Sp1 can displace GHF-1 from its distal binding site and stimulate transcription from the growth hormone gene promoter. Mol Cell Biol 1990, 10(4):1811-1814. Lenhard B, Sandelin A, Mendoza L, Engs trom P, Jareborg N, Wasserman WW: Identification of conserved regulatory elements by comparative genome analysis. J Biol 2003, 2(2):13. Lewis H, Kaszubska W, De Lamarter JF, Whelan J: Cooperativity between two NFkappa B complexes, mediated by high-mo bility-group protein I(Y), is essential for cytokine-induced expression of the E-selectin promoter. Mol Cell Biol 1994, 14(9):5701-5709.

PAGE 99

87 Liu X, Brutlag DL, Liu JS: BioProspector: discovering conserved DNA motifs in upstream regulatory regions of co-expressed genes. Pac Symp Biocomput 2001:127-138. Liu Y, Liu XS, Wei L, Altman RB, Batzoglou S: Eukaryotic regulatory element conservation analysis and identifi cation using comparative genomics. Genome Res 2004, 14(3):451-458. Liu Y, Wei L, Batzoglou S, Brutlag DL, Liu JS, Liu XS: A suite of web-based programs to search for transcriptional regulatory motifs. Nucleic Acids Res 2004, 32(Web Server issue):W204-207. Loots GG, Ovcharenko I: rVISTA 2.0: evolutionary analys is of transcription factor binding sites. Nucleic Acids Res 2004, 32(Web Server issue):W217-221. Loots GG, Ovcharenko I, Pachter L, Dubchak I, Rubin EM: rVista for comparative sequence-based discovery of functional transcription factor binding sites. Genome Res 2002, 12(5):832-839. Ludwig MZ: Functional evolution of noncoding DNA. Curr Opin Genet Dev 2002, 12(6):634-639. Ludwig MZ, Palsson A, Alekseeva E, Bergman CM, Nathan J, Kreitman M: Functional evolution of a cis-regulatory module. PLoS Biol 2005, 3(4):e93. Ludwig MZ, Patel NH, Kreitman M: Functional analysis of eve stripe 2 enhancer evolution in Drosophila: rules governing conservation and change. Development 1998, 125(5):949-958. Maniatis T, Goodbourn S, Fischer JA: Regulation of inducible and tissue-specific gene expression. Science 1987, 236(4806):1237-1245. Marco-Ferreres R, Vivar J, Arredond o JJ, Portillo F, Cervera M: Co-operation between enhancers modulates quantitative expression from the Drosophila Paramyosin/miniparamyosin gene in different muscle types. Mech Dev 2005, 122(5):681-694. Matys V, Fricke E, Geffers R, Gossling E, Ha ubrock M, Hehl R, Hornischer K, Karas D, Kel AE, Kel-Margoulis OV et al : TRANSFAC: transcriptional regulation, from patterns to profiles. Nucleic Acids Res 2003, 31(1):374-378. Mayor C, Brudno M, Schwartz JR, Poliakov A, Rubin EM, Frazer KA, Pachter LS, Dubchak I: VISTA : visualizing global DNA sequence alignments of arbitrary length. Bioinformatics 2000, 16(11):1046-1047. Meng H., Banerjee A, Zhou L: BLISS: biding site level id entification of shared signal-modules in DNA regulatory sequences. BMC Bioinformatics 2006,7: 287.

PAGE 100

88 Muhlethaler-Mottet A, Di Berardino W, Otten LA, Mach B: Activation of the MHC class II transactivator CIITA by in terferon-gamma requires cooperative interaction between Stat1 and USF-1. Immunity 1998, 8(2):157-166. Osada R, Zaslavsky E, Singh M: Comparative analysis of methods for representing and searching for transcription factor binding sites. Bioinformatics 2004, 20(18):3516-3525. Powell J: Progress and prospects in evolutionary biology: The Drosophila model. Oxford: Oxford University Press; 1997. Prestridge DS: Predicting Pol II promoter sequences using transcription factor binding sites. J Mol Biol 1995, 249(5):923-932. Prestridge DS: SIGNAL SCAN 4.0: additional databases and sequence formats. Comput Appl Biosci 1996, 12(2):157-160. Qiu P: Recent advances in computational prom oter analysis in understanding the transcriptional regulatory network. Biochem Biophys Res Commun 2003, 309(3):495-501. Qiu P, Ding W, Jiang Y, Greene JR, Wang L: Computational analysis of composite regulatory elements. Mamm Genome 2002, 13(6):327-332. Quandt K, Frech K, Karas H, Wingender E, Werner T: MatInd and MatInspector: new fast and versatile tools for detectio n of consensus matches in nucleotide sequence data. Nucleic Acids Res 1995, 23(23):4878-4884. Rajewsky N, Vergassola M, Gaul U, Siggia ED: Computational detection of genomic cis-regulatory modules applied to body patterning in the early Drosophila embryo. BMC Bioinformatics 2002, 3:30. Rebeiz M, Reeves NL, Posakony JW: SCORE: a computational approach to the identification of cis-regulatory modules and target genes in whole-genome sequence data. Site clustering over random expectation. Proc Natl Acad Sci U S A 2002, 99(15):9888-9893. Ringrose L, Rehmsmeier M, Dura JM, Paro R: Genome-wide prediction of Polycomb/Trithorax response elemen ts in Drosophila melanogaster. Dev Cell 2003, 5(5):759-771. Sandelin A, Alkema W, Engstrom P, Wasserman WW, Lenhard B: JASPAR: an openaccess database for eukaryotic tran scription factor binding profiles. Nucleic Acids Res 2004, 32(Database issue):D91-94. Sandelin A, Wasserman WW, Lenhard B: ConSite: web-based prediction of regulatory elements using cross-species comparison. Nucleic Acids Res 2004, 32(Web Server issue):W249-252.

PAGE 101

89 Santini S, Boore JL, Meyer A: Evolutionary conservation of regulatory elements in vertebrate Hox gene clusters. Genome Res 2003, 13(6A):1111-1122. Schwartz S, Elnitski L, Li M, Weirauch M, Riemer C, Smit A, Green ED, Hardison RC, Miller W: MultiPipMaker and supporting tools: alignments and analysis of multiple genomic DNA sequences. Nucleic Acids Res 2003, 31(13):3518-3524. Schwartz S, Zhang Z, Frazer KA, Smit A, Ri emer C, Bouck J, Gibbs R, Hardison R, Miller W: PipMaker--a web server for alig ning two genomic DNA sequences. Genome Res 2000, 10(4):577-586. Senger K, Armstrong GW, Rowell WJ, Kwa n JM, Markstein M, Levine M: Immunity regulatory DNAs share common organizational features in Drosophila. Mol Cell 2004, 13(1):19-32. Sharan R, Ben-Hur A, Loots GG, Ovcharenko I: CREME: Cis-Regulatory Module Explorer for the human genome. Nucleic Acids Res 2004, 32(Web Server issue):W253-256. Sharan R, Ovcharenko I, Ben-Hur A, Karp RM: CREME: a framework for identifying cis-regulatory modules in human-mouse conserved segments. Bioinformatics 2003, 19 Suppl 1:i283-291. Shaw PE: Ternary complex formation over the c-fos serum response element: p62TCF exhibits dual component specifi city with contacts to DNA and an extended structure in the DNA-binding domain of p67SRF. Embo J 1992, 11(8):3011-3019. Sinha S, Blanchette M, Tompa M: PhyME: a probabilistic algorithm for finding motifs in sets of orthologous sequences. BMC Bioinformatics 2004, 5:170. Small S, Blair A, Levine M: Regulation of even-skipped st ripe 2 in the Drosophila embryo. Embo J 1992, 11(11):4047-4057. Staden R: Methods for calculating the probabilities of finding patterns in sequences. Comput Appl Biosci 1989, 5(2):89-96. Stormo GD: Information content and free energy in DNA--protein interactions. J Theor Biol 1998, 195(1):135-137. Stormo GD: DNA binding sites: repres entation and discovery. Bioinformatics 2000, 16(1):16-23. Stormo GD, Fields DS: Specificity, free energy and info rmation content in proteinDNA interactions. Trends Biochem Sci 1998, 23(3):109-113.

PAGE 102

90 Stormo GD, Schneider TD, Gold L: Quantitative analysis of the relationship between nucleotide sequence and functional activity. Nucleic Acids Res 1986, 14(16):6661-6679. Thanos D, Maniatis T: Virus induction of human IFN be ta gene expression requires the assembly of an enhanceosome. Cell 1995, 83(7):1091-1100. Thijs G, Lescot M, Marchal K, Rombauts S, De Moor B, Rouze P, Moreau Y: A higherorder background model improves th e detection of promoter regulatory elements by Gibbs sampling. Bioinformatics 2001, 17(12):1113-1122. Thijs G, Marchal K, Lescot M, Rombauts S, De Moor B, Rouze P, Moreau Y: A Gibbs sampling method to detect overrepresente d motifs in the upstream regions of coexpressed genes. J Comput Biol 2002, 9(2):447-464. Thompson JD, Higgins DG, Gibson TJ: CLUSTAL W: improving the sensitivity of progressive multiple sequence alig nment through sequence weighting, position-specific gap penaltie s and weight matrix choice. Nucleic Acids Res 1994, 22(22):4673-4680. Tjian R, Maniatis T: Transcriptional activation: a complex puzzle with few easy pieces. Cell 1994, 77(1):5-8. Tsunoda T, Takagi T: Estimating transcription factor bindability on DNA. Bioinformatics 1999, 15(7-8):622-630. van Helden J: Regulatory sequence analysis tools. Nucleic Acids Res 2003, 31(13):3593-3596. Venkatesh B, Yap WH: Comparative genomics using fugu: a tool for the identification of conserved vertebrate cis-regulatory elements. Bioessays 2005, 27(1):100-107. Wang CY, Petryniak B, Thompson CB, Kaelin WG, Leiden JM: Regulation of the Etsrelated transcription factor Elf-1 by binding to the retinoblastoma protein. Science 1993, 260(5112):1330-1335. Weintraub H, Davis R, Lockshon D, Lassar A: MyoD binds cooperatively to two sites in a target enhancer sequence: occupancy of two sites is required for activation. Proc Natl Acad Sci U S A 1990, 87(15):5623-5627. Whitley MZ, Thanos D, Read MA, Maniatis T, Collins T: A striking similarity in the organization of the E-selectin and beta interferon gene promoters. Mol Cell Biol 1994, 14(10):6464-6475. Wingender E, Chen X, Hehl R, Karas H, Liebich I, Matys V, Meinhardt T, Pruss M, Reuter I, Schacherer F: TRANSFAC: an integrated sy stem for gene expression regulation. Nucleic Acids Res 2000, 28(1):316-319.

PAGE 103

91 Wingender E, Dietze P, Karas H, Knuppel R: TRANSFAC: a database on transcription factors and their DNA binding sites. Nucleic Acids Res 1996, 24(1):238-241. Workman CT, Stormo GD: ANN-Spec: a method for discovering transcription factor binding sites with improved specificity. Pac Symp Biocomput 2000:467-478. Yan Y, Chen H, Costa M: Chromatin Immunoprecipitation Assays. In: Epigenetics Protocols. Edited by Tollefsbol TO. Totowa, NJ: Humana Press; 2004. Yuh CH, Bolouri H, Davidson EH: Genomic cis-regulatory l ogic: experimental and computational analysis of a sea urchin gene. Science 1998, 279(5358):18961902. Zhang L, Zhou W, Velculescu VE, Kern SE Hruban RH, Hamilton SR, Vogelstein B, Kinzler KW: Gene expression profiles in normal and cancer cells. Science 1997, 276(5316):1268-1272. Zhang MQ, Marr TG: A weight array method fo r splicing signal analysis. Comput Appl Biosci 1993, 9(5):499-509. Zheng J, Wu J, Sun Z: An approach to identify overrepresented cis-elements in related sequences. Nucleic Acids Res 2003, 31(7):1995-2005. Zhou L, Steller H: Distinct pathways mediate UV-induced apoptosis in Drosophila embryos. Dev Cell 2003, 4(4):599-605. Zhou Q, Liu JS: Modeling within-motif dependence fo r transcription factor binding site predictions. Bioinformatics 2004, 20(6):909-916.

PAGE 104

92 BIOGRAPHICAL SKETCH Hailong Meng was born in Peoples Republic of China in 1973. He is currently a Ph.D. candidate at the University of Florid a, majoring in computer information science and engineering. He received a B.S. in biochemical engineer ing from Zhejiang University in 1996 and M.S. in biochemical engineering from the graduate school of the Chinese Academy of Science in 1999. His current intere sts lies in bioinformatics field, especially in novel algorithm development and biological database management.