<%BANNER%>

A Novel Methodology for Identifying Conserved Regulatory Modules at the Binding Site Level

xml version 1.0 encoding UTF-8
REPORT xmlns http:www.fcla.edudlsmddaitss xmlns:xsi http:www.w3.org2001XMLSchema-instance xsi:schemaLocation http:www.fcla.edudlsmddaitssdaitssReport.xsd
INGEST IEID E20101219_AAAAEA INGEST_TIME 2010-12-19T21:49:25Z PACKAGE UFE0015612_00001
AGREEMENT_INFO ACCOUNT UF PROJECT UFDC
FILES
FILE SIZE 1750 DFID F20101219_AACEZN ORIGIN DEPOSITOR PATH meng_h_Page_011.txt GLOBAL false PRESERVATION BIT MESSAGE_DIGEST ALGORITHM MD5
7d622f6a2f3dac77017787c7b457342a
SHA-1
a485525be32e4f154c799a45930d39648292538e
54505 F20101219_AACEYZ meng_h_Page_101.pro
41c2529702f6d19b28806366c0fa147d
696e73e44702750b5f7edb2659b37ed5bc24974b
29211 F20101219_AACFFI meng_h_Page_017.QC.jpg
9ac4fc15d27bbdb8c17506aceb26d4c1
bdbbce4563fd16a0e1ecae5c8d2e467600702735
35316 F20101219_AACFET meng_h_Page_006.QC.jpg
efaa67e653272f1203a99009f980b4c2
9eae466dc8533c12fc69b78d523e43caf1c3e569
1027 F20101219_AACEZO meng_h_Page_012.txt
342a9c29113d89701c32b83a7d56d69f
0dc44ba16ea44582e001b184a71b07dcdc792d87
32181 F20101219_AACFFJ meng_h_Page_018.QC.jpg
ea12be21dd4b6331a8f6b4455adb0f5e
e37357791fef1757a1584149ef19b010df85699b
8433 F20101219_AACFEU meng_h_Page_006thm.jpg
9e99fde92d61a75ad67ec7f981e87ea8
8b88e0120a4aec4fcb7f5df07f2d9c2cf2c752c1
1589 F20101219_AACEZP meng_h_Page_013.txt
0b8025f2da11974ab4ab0cff4616e6e8
25a3026f1795cf9bb3326e4f3840ecbaabfa30ed
8178 F20101219_AACFFK meng_h_Page_018thm.jpg
7d9a29c1b5ed5c82b6745af4dca942cc
64112f2ba8e606aa03bd7807f352cb883ad5b61d
29045 F20101219_AACFEV meng_h_Page_009.QC.jpg
9e9c52b9be6b2af41877986cdca52f69
6d7926d7aac4cd4ac7c541c0c1556065d3ca419d
1375 F20101219_AACEZQ meng_h_Page_014.txt
a2bb8b34c76cefd2cfb82a8588eff972
92362dd4bfafcb175b4e1549ce3a085618eb4fb5
30475 F20101219_AACFFL meng_h_Page_020.QC.jpg
be9df07d85113b556674f1156f3bc401
589da0ee12e50b8b17a8f759ba708b340e5dc749
7156 F20101219_AACFEW meng_h_Page_009thm.jpg
62a18c4da093c856300eba3b438aa89d
a15246fcf6195b8c3b5bea1ea1057eea2846e832
1582 F20101219_AACEZR meng_h_Page_016.txt
ae2b55e8c21c7deca190ce643359d0fd
0f03324f9969375087700d9897ad6c5a544ec021
30602 F20101219_AACFGA meng_h_Page_031.QC.jpg
25c02212ce96680c28d5a4cebf4e8434
565e980202e6df07df50ec4ac1425c18cac6173f
21944 F20101219_AACFFM meng_h_Page_021.QC.jpg
f4f2a319e43e346b1981b94be88085ba
2f7c35af2e2a8874e6fc8b2d92c8a2e74e5373be
9327 F20101219_AACFEX meng_h_Page_010.QC.jpg
38be9b1b357e2cef2acf064184c9ebc5
4dbb9352fdec5a0c6ba88ecc5a192ef90f44f0ea
1934 F20101219_AACEZS meng_h_Page_017.txt
8e20611368dd137dd020a01f8abc09de
a6fbaf01a6b9029c329a61becd7519bbc98de932
7992 F20101219_AACFGB meng_h_Page_031thm.jpg
90a294199a3b961faa0165a00247e9dc
4a34a3b82a4ab914fc716fc131b37709adadc35d
34684 F20101219_AACFFN meng_h_Page_022.QC.jpg
e221293fa37de2994348fc4d13658ef5
87d6cacdd605d710a3b4f21f10d75f16b623d12e
2369 F20101219_AACFEY meng_h_Page_010thm.jpg
4d13553800a8f12eef84f22fd12c6423
04fc19b146cc4fb9a2b600afa99d608b53139544
1734 F20101219_AACEZT meng_h_Page_018.txt
4c21ddd6efd2fac3d8b4d72e2dc2991c
f639ba47ca88198340f87406a5d4ff44460b40b2
8121 F20101219_AACFGC meng_h_Page_032thm.jpg
db3a03155998c42b228e2fe3ff0a0ab1
456f90df7e79c172fde9b438cb16435b86873aa8
8670 F20101219_AACFFO meng_h_Page_022thm.jpg
88f1471878139a92ffffdc58ec0e1c59
552b0f93ff89cdcf0f3602cd66d50783be2b293c
26411 F20101219_AACFEZ meng_h_Page_011.QC.jpg
8690cce53e608063f929a7d0f6241736
7eafbe853ed7b165f447bfcc0ccf3a6b7d3d3347
1689 F20101219_AACEZU meng_h_Page_019.txt
c54d3247625003aabe7dba6355b2dd80
ed8818e65230bee47d8ed015f765519667345946
7414 F20101219_AACFGD meng_h_Page_033thm.jpg
b6531f24a4a9a8d49be4bccb2c648269
2e76aff31e86e7f928078c156a9b01c17a8348c9
34369 F20101219_AACFGE meng_h_Page_034.QC.jpg
8585767f75b4b9052d8e444eda1f1ca7
909c6fda96e5f095b29282211940735e697f9b11
32390 F20101219_AACFFP meng_h_Page_024.QC.jpg
ff1e5f95ce51544a5967d80eb6f918c8
1bc26cda1490561a83bf6881e98ca9780c4db744
2044 F20101219_AACEZV meng_h_Page_022.txt
431230062ef802324a7685ffcc145fdf
d92fdc92f74f41ec915698b03c236929ea84b893
8467 F20101219_AACFGF meng_h_Page_034thm.jpg
3bf55a6a6d3d461c2581cb2f3228c079
1c9e05195e261f225539428975c02e23f6cabf30
8183 F20101219_AACFFQ meng_h_Page_024thm.jpg
8558a70249a1c8d8f29fa69b0a2caa4f
f88104105ce381a4b77c7da220a010c02b2798d4
2010 F20101219_AACEZW meng_h_Page_023.txt
4922528f53e75989f1b2623c1efc32f4
8641bd497cf6d891f239163f59c5452384bc92b1
31948 F20101219_AACFGG meng_h_Page_036.QC.jpg
f0547d4eb5232f18d4ce37616869e09f
89a83a9091fb8d4cd6a459f43a858e7da9d3956f
33916 F20101219_AACFFR meng_h_Page_026.QC.jpg
6edf653584c958b07bff766d3260267a
cef6262a2b31ab6119ba7ced90fb5d7df688c144
1876 F20101219_AACEZX meng_h_Page_025.txt
df50ba1563b5f5f8673b7b141167312f
d5eeb14d2b953be2e018136bdda9eadf3215207d
8004 F20101219_AACFGH meng_h_Page_036thm.jpg
4843521f3fc01efd7b64b064e1f78171
72ce4c12d1efeb7cdb3c3d993e0dd3e1d4fd54a6
8320 F20101219_AACFFS meng_h_Page_026thm.jpg
2d62fec0ab39748db4a6dbaccb6fa677
c7d5b45b068fc2036c9010c701e5cf8913ecc7e5
1986 F20101219_AACEZY meng_h_Page_026.txt
3599aa391a9f7ecdf8663faaf740e90a
f9ba18d0a43de32bb0d0d376bdff733282f289e4
7891 F20101219_AACFGI meng_h_Page_037thm.jpg
f155eacd37ac3005b9711b37c016fac7
4bf83dc6c0c7f20ede42fa5fc936475475cc9f28
25678 F20101219_AACFFT meng_h_Page_027.QC.jpg
9f3b6606d4525d7f70dbc8bba6a1ed46
cc68582b62ff372166bb3448bdef7df585b858ab
1331 F20101219_AACEZZ meng_h_Page_027.txt
d1d65a04c1777e4460cf51df5a85a416
2ac9348047a3b9bf02cc6c1086ef54389dedb720
27977 F20101219_AACFGJ meng_h_Page_038.QC.jpg
2a7c6b201e25700a027d5484cd7dec5d
1900234bddd3c0a6cb04ad1d680aa27ccc5ed894
7810 F20101219_AACFFU meng_h_Page_027thm.jpg
8e919af2855af585518a7947cd0e5398
7f87d7ae738dc96d78919b44d1a747608961286a
26104 F20101219_AACFGK meng_h_Page_039.QC.jpg
9c72457b56501ca3ce044c0ed7849a50
205497d718b5f826a1bc8122721c2223b168dcfa
32023 F20101219_AACFFV meng_h_Page_028.QC.jpg
a0307675da9f9adaf047eeaba74d28af
8daa5e8a1b0c6a10d7595c1aaba64b5c530ff52d
7111 F20101219_AACFGL meng_h_Page_039thm.jpg
08e9ea720ff173d5f5572f2a612b2d3f
956310f336cce595758b87c8e6410b9bb530e9db
7576 F20101219_AACFFW meng_h_Page_028thm.jpg
314ce68087e03b4b4a20caf204d9bc93
d44242a34ffb05fce6b1630bbe948dc3e9aa21fd
32755 F20101219_AACFGM meng_h_Page_040.QC.jpg
ee30fefc27cc6a0809c2bf5a22dd4e80
177a6a5387b602fcafdf5f7ff595f3d234d5b020
31550 F20101219_AACFFX meng_h_Page_029.QC.jpg
6d920430200a9bf9135974f515d62758
5ce4b4c3329f4d8d5cac2f40e33370e93b87694c
34337 F20101219_AACFHA meng_h_Page_050.QC.jpg
6461e02e10ed04e757f1cc08cdafd1fd
c4a0ce21d6120028f2bc67b13e97c6dc252648b9
29799 F20101219_AACFGN meng_h_Page_041.QC.jpg
0fdd0f07f7dfa18f6bb9ba448cc5c45c
360e0ebfdd156914aa65005bb96c571b84985f56
7035 F20101219_AACFFY meng_h_Page_029thm.jpg
8ffedd6338265afab53ad9ede49a845a
9bdee76c80c6ed23fea404186442eae37706f17e
8440 F20101219_AACFHB meng_h_Page_050thm.jpg
c4a091805537c36bb1536f96b31033c5
08bbba2409d932207b11c1e5a8cede37e325e6ae
8633 F20101219_AACFGO meng_h_Page_042thm.jpg
de30a70c99d273736f7d9df0534b8862
7313dc0b4f956e7828f14f61d6f219ad768ee58b
33128 F20101219_AACFFZ meng_h_Page_030.QC.jpg
4a4c295c6c2a0aa0ab392439cd7e4eb1
ec01a5d1b4cb076fb42b9266ce0544d6475d527e
22802 F20101219_AACFHC meng_h_Page_051.QC.jpg
f0aecd046cd8e8c51967ef54f893a12c
99542f2da0f5324e46adf4cb4c2c2f7c93872d51
29717 F20101219_AACFGP meng_h_Page_043.QC.jpg
fcf7ece5da39c67ec8b9de08acc46b08
25a0f6e9b75dbd1a245ac6e65446ed9181b97bff
6547 F20101219_AACFHD meng_h_Page_051thm.jpg
8e5a87d8abed691e0d36c03cffbce721
8648e99c24bab88c2ee073d931d4b5eb6d2fa7c2
35972 F20101219_AACFHE meng_h_Page_052.QC.jpg
859a6681797d2bc18fac8dce13cd2071
951dd0db6104ddd7bf0975f16e34dd0364ce847d
7337 F20101219_AACFGQ meng_h_Page_043thm.jpg
780d67198bdb118b3f1afababb17854c
9d3a5b030a31152f6f9e7acb6e4f76a329c99719
9058 F20101219_AACFHF meng_h_Page_052thm.jpg
93c1d657b6b795f7b1d0810e94a00564
504753a3ae32c055f50e2acd123dc2389a3da0ef
17176 F20101219_AACFGR meng_h_Page_044.QC.jpg
1020abad82a1cfcaa08b2ec89f89f169
62c38431cf6e69ed4d0a89848c31a782bc24f62a
33841 F20101219_AACFHG meng_h_Page_053.QC.jpg
8d7d24dd021caee605e3428bfd9e04d9
8ae0b4dddb8dd2cd57690d0643623044a325a793
32628 F20101219_AACFHH meng_h_Page_054.QC.jpg
89e2e9b749a9dc99379103e403f6d191
377000307dfb7628d04c7dcc77366ca220fe59d1
4836 F20101219_AACFGS meng_h_Page_044thm.jpg
c33f7793d0029887d890bfbbfd58dc3f
a30e088bac7127c56fa4273d1ae075d38f38bc3b
34381 F20101219_AACFHI meng_h_Page_055.QC.jpg
d73e245d6a3c2fbdbeccf2f4c5462be3
f6fb43e6647a2de940ecb772095ed29eecf3cff3
28683 F20101219_AACFGT meng_h_Page_045.QC.jpg
0abe5e35fc3d1f7931bdc558dd8d948f
b20f1d2e1e6aec853f3557e9656a4db54ecb7b80
8325 F20101219_AACFHJ meng_h_Page_055thm.jpg
437b7d56da89afd3bd299a6477fc9e35
1143722249b62407ac241ddbda677fee58c53895
28856 F20101219_AACFGU meng_h_Page_046.QC.jpg
c0e4720763ed9d56f1d2935180a7662e
b43db696b55a53b60ec193063983d33de326d86e
22698 F20101219_AACFHK meng_h_Page_056.QC.jpg
b5d9758eaa32ac26ed0e786c4c4d22fc
3f45a65865527d97374f34851cc9c878d628eaa2
6931 F20101219_AACFGV meng_h_Page_046thm.jpg
37b90cd864e115cd9ad640d479b883f6
7afc80ba86fb5f51ae9c713b49bd6d9db5bb5914
20748 F20101219_AACFHL meng_h_Page_058.QC.jpg
9ff9be5b5219bab6209d4438c2b3077a
6e6201044b6083ab3067e6ceacb423d2ccd16a0e
22379 F20101219_AACFGW meng_h_Page_047.QC.jpg
7deb7323ab67d79a92a27e400a6b3850
d46cabfc32481a5019a449a9c6cade8335776d5f
8160 F20101219_AACFIA meng_h_Page_067thm.jpg
f2fcbf12bf61d7d07a3b19adea795406
d3028ce01550da21ff5e97507d1e9c2daac0b40e
29960 F20101219_AACFHM meng_h_Page_059.QC.jpg
e6de347fe68eac7ce11ac005dc19c937
84dd0ce133111a85205ea59c48f0cf3cf4cf77c1
8471 F20101219_AACFGX meng_h_Page_048thm.jpg
55ac45d239024100f63a3958be29657a
e4268f07dcdab8630f0fab307972dc9c64a7054c
28659 F20101219_AACFIB meng_h_Page_068.QC.jpg
7a2efe1339f838ebc7493e799c313f91
945b0d49698c4d349d8e493638d679d696f386d9
7821 F20101219_AACFHN meng_h_Page_059thm.jpg
e883ce958134e9b0edbd0d0643a1f30e
71c82e6b679db396c09612ff5bd041312300cf84
31289 F20101219_AACFGY meng_h_Page_049.QC.jpg
7542743da5708cab8c14357b33573a7e
d256e4c13669df7f0dbe5c95e0583f57e1ebf03e
7699 F20101219_AACFIC meng_h_Page_068thm.jpg
dfbf9cf124875e4e1549df4f3f7ec382
95e0fa33d3bec1d2405d4531e7e582c9bf9128aa
20041 F20101219_AACFHO meng_h_Page_060.QC.jpg
9f9cde2dcc06bdc907ceec6a4bc52890
dba488794e75f7fbdce3efd5de2c9b0ef5fb12a9
8973 F20101219_AACFGZ meng_h_Page_049thm.jpg
ed8fe887ce21c08acefa43bd2534a823
69b33abad73b70f046db9928b8c29466e3de3218
32004 F20101219_AACFID meng_h_Page_069.QC.jpg
7dc97af1cbc2f1d233cb69092badb928
b7c3723f63e045bedd19b4399eebf6ed0cc470f2
5442 F20101219_AACFHP meng_h_Page_060thm.jpg
f8e4c44ccdad07823dcc98b20cf4a082
9e447255f8955165cd28a09bf7a60eb1690643fa
8689 F20101219_AACFIE meng_h_Page_069thm.jpg
e8a73928c672497cfdb418c5e2192d9e
c3d6b799b46cc509e75634336ed512accb4be16a
29319 F20101219_AACFHQ meng_h_Page_061.QC.jpg
37d94e63b33e9b8a0af49c714c800c3a
cd1de3ba3495a61cfea78692948a2021e772bffd
28941 F20101219_AACFIF meng_h_Page_070.QC.jpg
a6bdb184374490c41bd3157c5e1e4707
4b2421e3b9b91b54535064de62ec21b0d03f57d8
20708 F20101219_AACFIG meng_h_Page_071.QC.jpg
fc3629bef6ee16451654f8f2e487f29e
821dd9e6a3e003eb461da51e7d3758f6985957e2
7508 F20101219_AACFHR meng_h_Page_061thm.jpg
d0eb72e786a157f8c00cdcfe568beea5
3956f7b9708c83de937deed832618e4e614d43b2
6577 F20101219_AACFIH meng_h_Page_071thm.jpg
10ba5dd9910640a0151e88fd34b936b1
cfd9c75836a35332d7b0e0aec0a6adc6fe9b223d
33834 F20101219_AACFHS meng_h_Page_062.QC.jpg
c7c99fe531108d3268dca2169edeb76a
42f6aec397452b76203b6c7eb108e696cc3cf595
34216 F20101219_AACFII meng_h_Page_072.QC.jpg
59e963b4430617d30072c4e7ca5b3908
f3ae7660916c443ccfe607867b30e51698c5129f
8282 F20101219_AACFHT meng_h_Page_062thm.jpg
1aabb0be5828f0676f9492bcf34aad3b
00f8c5f5cee8714c135b86b6c8767441d1acd913
8168 F20101219_AACFIJ meng_h_Page_072thm.jpg
6b2e572d31973cae26a7a8a7a4bbc514
0177f60351b4f2b5d3df6ba760d05bfa769f3dc3
31772 F20101219_AACFHU meng_h_Page_063.QC.jpg
d5ea5eb0896f0a82cfffe5d09ef4ddde
4382ec90cfdfb857855137a98553a6fbb4e82107
30747 F20101219_AACFIK meng_h_Page_074.QC.jpg
f359d418c50d3aa790d6b6938638bb89
66794ed82c1945a6fb7bd03cad4f3e6836d70092
7987 F20101219_AACFHV meng_h_Page_063thm.jpg
18a9a346836e1a341c2f52e40de6bf75
b86b725abd8e1b6d0f2a74878ac03e622f88664a
7264 F20101219_AACFIL meng_h_Page_074thm.jpg
207bd1a32042746616093c8bf02d7543
6e9130e91f040a6e38dc5d2aaa89c1d0d4e6a6a4
31815 F20101219_AACFHW meng_h_Page_064.QC.jpg
cd3f825a55103cdf35da6a888cebe6bd
8b37a35a5a5e00d5e533547bead6b1553c78f5db
33874 F20101219_AACFIM meng_h_Page_075.QC.jpg
9109d52f1bf4b4d9915fc29b98c7ce7d
85730875bf4a2809df020a9976b1fc803673faa0
33709 F20101219_AACFHX meng_h_Page_065.QC.jpg
2ebc3c5a5ef252e0e63e3708188ec423
b9c480781561bfdac24f16c6b4ee9450046246ab
31318 F20101219_AACFJA meng_h_Page_086.QC.jpg
1f8d05f0e49b00842eca07b6c265b228
e56fac5ecc9124241afd4f96d6d406d77bd35922
8600 F20101219_AACFIN meng_h_Page_075thm.jpg
6550df694d3280cf771b85ce3ef9f2c0
e78cd38913e011693f49d62c4d490a4cb31312d4
8497 F20101219_AACFHY meng_h_Page_066thm.jpg
e98724cc67fccf2aef2a05db0ed49bb9
f028727c872ea2a5890e1da6fa0e7b0e720db3c7
7579 F20101219_AACFJB meng_h_Page_086thm.jpg
a8746ba4b95ad129950a7312c32e2a57
5a6737d73d29c0d7a15c22a378b9e762fefed980
7622 F20101219_AACFIO meng_h_Page_076thm.jpg
b5d418677a10c4da32bd65cec9f94ac7
33658b69b62f032d31c182f5841ca0eba4907405
32956 F20101219_AACFHZ meng_h_Page_067.QC.jpg
a4f56766a0afa196d438e2850a2f4c32
e97dc9870dc985e9cdeea7d3a1f0fa1c01a67334
3095 F20101219_AACFJC meng_h_Page_087.QC.jpg
5e3b6c9d1dc230be96c29ce2b9030acd
e7793f433fccd1023c499ad0c219429c150a2c4e
30791 F20101219_AACFIP meng_h_Page_077.QC.jpg
24b1f4601d1c22e90346c4f5456c6636
0642834994e9c0df914372df9784ba80fac37d4a
25273 F20101219_AACFJD meng_h_Page_088.QC.jpg
ce118fad91687d96e821df0939256dee
2897e28cc85e97a8d7d7a1a47514c0bbd9f2fc3e
31253 F20101219_AACFIQ meng_h_Page_079.QC.jpg
192e207f63a50f343e261a9997fb1623
0e9e87ea99ad5625e113d1458faa6fbde6bd1d3c
6881 F20101219_AACFJE meng_h_Page_089thm.jpg
0de6ff54e24b6e80e12264e884222809
fe858e0aea427342b42b590846c550c4fbe8a733
29312 F20101219_AACFIR meng_h_Page_080.QC.jpg
38cc6006facc566e02aebf1a742b6d67
d25f6e8d41a4a942005dd61ea985598e30f37759
32598 F20101219_AACFJF meng_h_Page_090.QC.jpg
6b517a9b3a5002922593e94223caff26
b5a7c67fe382ff74cc842e3d80d146a537d40de7
7845 F20101219_AACFJG meng_h_Page_090thm.jpg
3ef22f16a4a0c3a4bf914d0361e936bc
b1718dd0e6473c4347d813a99e556c62ecec645e
7103 F20101219_AACFIS meng_h_Page_080thm.jpg
2798fe4c65f31b2f0530e4ac5a7f5759
faa9e357fe9648d6987d10137d2f739e9ed2147b
25944 F20101219_AACFJH meng_h_Page_091.QC.jpg
640acaefba47707e9324c6c64e85b71d
4eeca78ea53ff71d56d74959aa253f3f809eece3
28574 F20101219_AACFIT meng_h_Page_081.QC.jpg
98a8165e7f097032a93b19ff66c735e8
ad8eae4d0a2d88b201d4c05fa8096bb3d41e542f
6726 F20101219_AACFJI meng_h_Page_091thm.jpg
3ce95ee8190035b99ea71dae267b23d1
3f9593988dfcce438bd5543fc85e309d1dff07da
31896 F20101219_AACFIU meng_h_Page_082.QC.jpg
8774244da09bf6f75f8d6c2bdd3469cc
ab0c3725317c65e87799c77c0e41757b0e888a1e
32263 F20101219_AACFJJ meng_h_Page_092.QC.jpg
86d987ef2f00380e1fdabd340360b7ee
1f8e8eb6592aa86805982b3f8699c0e3f7665c05
8595 F20101219_AACFIV meng_h_Page_082thm.jpg
d0b94b0bf2fab8ee0a2235454ecbbb9a
e117ea52fc7a4d4481a51b3029ea8c95ed3f02ea
8200 F20101219_AACFJK meng_h_Page_092thm.jpg
4501851f4d7e16e42e27ad7f95454efd
444e2b1da1e68a508051388d5d669068d8e208a9
27856 F20101219_AACFIW meng_h_Page_083.QC.jpg
222a88e574f5c0bf56a6c685e1e5a35b
47ced164888661c05c476be245c22c79d79baba4
7751 F20101219_AACFJL meng_h_Page_094thm.jpg
f42fe57506c07393cc07ab0835538265
e6842ff9b7866078a2749fbc2300e39f46f3f00d
7453 F20101219_AACFIX meng_h_Page_083thm.jpg
fdbba9938962b12d9d8dfe44c7d90623
7b98d939e499abc62baeadbbc3398323b40cf601
23860 F20101219_AACFKA meng_h_Page_103.QC.jpg
90af873423c40d13c46894969e2ef80f
ad03562736adc03bf73868f584e1d9c1cabc7778
36697 F20101219_AACFJM meng_h_Page_095.QC.jpg
16a96be0ddd2858ec55f476e5c33659a
187d47c3f6124d2ad28618787b95286e8d8bf451
8508 F20101219_AACFIY meng_h_Page_084thm.jpg
b303df79542b9b85ab593e21ad08ffd7
63a1f4d0c6cc8c7ced2c243bd82840024b433166
5731 F20101219_AACFKB meng_h_Page_103thm.jpg
0860e8dbec95c5b8c72723b4418a322d
38213b8ebbb7a944715cf56a681e3721f70af24e
9148 F20101219_AACFJN meng_h_Page_095thm.jpg
3c107af2e92dd4445a4213e46095d843
c3b4ad2800bf506ef34bc941fe3937ce5a30df1a
2456 F20101219_AACFIZ meng_h_Page_085thm.jpg
2826c74592450c7485bc44c3909b88f3
5e27dd666a5e8fd675149c55f6b10d4cbcd95297
2861 F20101219_AACFKC meng_h_Page_104thm.jpg
59c8b2a8633c508d76653564a6a721c8
83be2e64047ad602a18f2124edc97e15287a2108
37774 F20101219_AACFJO meng_h_Page_096.QC.jpg
0c84fefae54567c68229e624ba2689be
c5fb835f3f343eeab25c8764795b8eebb4a47757
34717 F20101219_AACFJP meng_h_Page_097.QC.jpg
491f0bcbd9d35e6da4942975dce3f0d5
f9ee4f04a933e68da00bcebd7568e88642b98117
8685 F20101219_AACFJQ meng_h_Page_097thm.jpg
320307c0f867b01ad6eb487777f7541b
96f137a54f8985fb9cf5aced889c072ac13a4538
37970 F20101219_AACFJR meng_h_Page_098.QC.jpg
30b622e93fa3ee27ae89d2014d8bbdda
a1a5f2ad79712d4b55a065724aff7b2483d2dc2f
9136 F20101219_AACFJS meng_h_Page_098thm.jpg
f971ce317d00aeb81f048a3f17082259
3e9f88942bc6144573f993dd41dbcd2a18b405bd
37014 F20101219_AACFJT meng_h_Page_099.QC.jpg
1d8b13da1178296ba45c1400efc9e7e4
ffc40953ddd3858568659677131b3a18d6f8de00
9252 F20101219_AACFJU meng_h_Page_099thm.jpg
ec4e6102e6f5cb65cdf486ae96961b11
bab3373ad9fc89bef7454763b6f0cb4f11b1f7d1
9404 F20101219_AACFJV meng_h_Page_100thm.jpg
2d3975a0399ead6bbf9a623b198649d9
cc30c7d316955e2512daed268c6f4d3a4b96086a
35229 F20101219_AACFJW meng_h_Page_101.QC.jpg
6fa56222d39bcbfe103e49d7063512ad
db31e62730affbecbf9c3519df32b3db122e1479
8919 F20101219_AACFJX meng_h_Page_101thm.jpg
10ad64e1cc2e7b9cc2cc12ef9876ef34
e8872e2ca4a0b0f4e1c31661da38917333bc12cd
36792 F20101219_AACFJY meng_h_Page_102.QC.jpg
d100a121f5e5b83d82c82ebc8211a3c3
e7c2a11b07935a09c8d65769397b429bbe7c9700
51280 F20101219_AACEIA meng_h_Page_062.pro
07b1862892d81169af5d9894306075db
b8a0aabd39d07afa0f4564689b51dd597e12c7c7
9272 F20101219_AACFJZ meng_h_Page_102thm.jpg
9ffdf1898ac917c87e82e4740888f3f2
13805731021327cdf5cb0d4aa6d991f8e02dd13e
25271604 F20101219_AACEIB meng_h_Page_052.tif
78a6febac6ddb66027fd47b5ca61e646
0bf7dd1cad47ec4e5dcac16e35eb51420ab0c4ee
91009 F20101219_AACEIC meng_h_Page_089.jpg
31cb3bb26d8dbb2274b8bf134a18e9d9
8fcae92026ba83da911f4718e0a69abfce5bd94e
4030 F20101219_AACEID meng_h_Page_003.QC.jpg
309712b0e2f28706e0d29c5691f8452a
182229d535bdc72bfab27466af4348edf4a7fd80
1778 F20101219_AACEIE meng_h_Page_020.txt
632f5d5c9770bb796c62408dc506bd29
7062b8e31a608b81a442e7c6a0e7ba97961fd13f
1053954 F20101219_AACEIF meng_h_Page_077.tif
0b47ae1b5ff9319dcf4e884ff74fc486
8bd538fbfbd5137da5c1e6ec5b46072bb3d3622c
24157 F20101219_AACEIG meng_h_Page_015.QC.jpg
71ce1e6e46f3aa1a12da04bb2574151b
690d0c8064e0783ede954d4a91b81e675ea40f5a
5429 F20101219_AACEIH meng_h_Page_016thm.jpg
bcb20825c85b486dfba72843757c445b
c4138cc6b8c71625d2e628ebbdad27d61fe45cf8
F20101219_AACEII meng_h_Page_015.tif
a5c7794e10a4f68c6e772da3dd820f75
e619494e04d4744faa22806a95c57dbcb99acd3f
8195 F20101219_AACEIJ meng_h_Page_087.jpg
ecb2a8538bf9be6753d3bc7ccc6b16ee
4195ca55be25ae50156cc7e4b26f473cc8e76998
60611 F20101219_AACEIK meng_h_Page_044.jpg
be38da4255e3850bf49bcba3eff1c7f1
a481eed2902340dae743a07818b9bfb0ab573761
59841 F20101219_AACEIL meng_h_Page_100.pro
bc103f5853a22cb76f598c14f6102cd2
e5948ebe1c02074d6792be7db07ac95bd5e6ec9c
48887 F20101219_AACEJA meng_h_Page_066.pro
92297385a4a7851b1604f7f17e67e6fb
20f905a008cd6460506d1db5499595ce9f1e6adb
F20101219_AACEIM meng_h_Page_008.tif
d9ac80dd38d918219d70c716148ff8b8
42f35cb3c5780b018f1c158853f2b190b6b96361
45196 F20101219_AACEHY meng_h_Page_089.pro
b7d9ce12481fa96d044f69023c868dda
05d5fd5541789c220a060b48b6ab264015187959
14832 F20101219_AACEJB meng_h_Page_093.pro
d342b92fdca4e814babd88f1c83e86bb
a4a7d41f6081642d3c6937054f59d003f218b627
89889 F20101219_AACEIN meng_h_Page_019.jp2
156c1ce5e338e5c494e91119603f85de
04ae80115baedf6f7b3cc857165ddda4e9de83da
7884 F20101219_AACEHZ meng_h_Page_078thm.jpg
72c1b1be34822da2f6589826bf20b9ea
c34e55875acd0e8880cc2cc3b6aef58a1da6c81f
7626 F20101219_AACEJC meng_h_Page_020thm.jpg
48bede5e349ed5af4f2c85c09893800b
0cc80caa0de4036f64d1f0bbbaffe60213255bd7
1862 F20101219_AACEIO meng_h_Page_024.txt
e43d0d687e975f904fabf5fd30b7d61a
ccb393ed9cde8ece22df59b2dc4534621b7ddac9
47448 F20101219_AACEJD meng_h_Page_025.pro
81fbc15207ed77a5be53d0831e7629fc
c6a7ccf8d626fa37aec4d9de39e6caa5c5fd0867
5982 F20101219_AACEIP meng_h_Page_056thm.jpg
08f5fc86787d7a02db26b78f2b74ab3a
89578cbf4de70b373f30cbf5dacfb04ba39a96dd
34023 F20101219_AACEJE meng_h_Page_023.QC.jpg
324fff55a9ccd63cbf09b452e02aec5d
1deb18b519883760a967d46b269a9bd353573051
72505 F20101219_AACEIQ meng_h_Page_006.pro
c6ceb93a60c5f5e2f58e40f2435438ff
45d24685c408eeb1e01fe9771b2acc191921a949
75070 F20101219_AACEJF meng_h_Page_021.jp2
a2057a302f3120f6e4cc59b5fba156d0
cae9f02d4a59d22733a8e7a71099d9df67a33de2
25694 F20101219_AACEIR meng_h_Page_012.pro
bc9fc951c1c93cff0e6021da6220d700
37b150d3b07fb346eb034e879626510602414644
112587 F20101219_AACEJG meng_h_Page_075.jp2
3218019182982636fad89f1d5102223a
20d2b01c619d575bd3156bcfee23587cc761d901
47924 F20101219_AACEIS meng_h_Page_073.pro
4d47cd66a86dc275540bae29486ac9ab
12a54968ab3ade4f4b735e04677226aa23792d38
31868 F20101219_AACEJH meng_h_Page_073.QC.jpg
e519de6b53bb216ff1898020da9af760
267a56a9d54e65af65a232adbbf536487f94ba49
47575 F20101219_AACEJI meng_h_Page_067.pro
9a41ff79a4023862122efb00d9c94508
7649c633d71b423056c37dc11437fea0b226c0b2
105955 F20101219_AACEIT meng_h_Page_067.jp2
b5a87e6f277dbac75fb0f56b357b87b3
2061bce464135e310a4823b4206ff4a471740b89
857810 F20101219_AACEJJ meng_h_Page_014.jp2
680d1b3d66b3ab484fe87b7e4b0bcdba
dcc788130507f97e14c00c173b3a4a9b643b7cb3
1526 F20101219_AACEIU meng_h_Page_015.txt
4188af910957214bef11569a708e67d9
8e380508aa7f3e45dc14ce1af73c6ff8f2779bbd
34053 F20101219_AACEJK meng_h_Page_066.QC.jpg
f96623c8a1d91a3a3c2f6c0477f35857
7ab5c593543c13ac2f907fe5645bfa6768b52e06
F20101219_AACEIV meng_h_Page_036.tif
5e4ce5b0515102261bba078dd64d40b3
e9db18e93f7b2fab321ff89c621ffe1967f13812
32503 F20101219_AACEJL meng_h_Page_084.QC.jpg
74dfc7623594f919c473355c4593e83e
98e1f606716dfa3da6a0ec2b482502a3043bbc9e
19176 F20101219_AACEIW meng_h_Page_093.QC.jpg
8322c02e34414df83d7fca14efe617e0
ae56d46c0f1494998ca204d003d68a8f2ff28916
F20101219_AACEKA meng_h_Page_032.tif
b80ebbaaa727e8221350bd3d6171725f
37d00cf17dbb55b3e6ef17fb2cf6175c1988bd7c
31519 F20101219_AACEJM meng_h_Page_025.QC.jpg
943eb6ed2d8ccc792ddcba2b91e74589
1a30df12ea74cb25952ef9b3fb6cdbd21951d65c
F20101219_AACEIX meng_h_Page_025.tif
01203d224e02e974b41b90b8a5143fe3
4baea2a201979e1035f2fcb1c297eed2077f5f89
111486 F20101219_AACEKB meng_h_Page_035.jp2
7736c0e4aac8672ab44733706b0fad24
e008767ce45ab72679ba4abd27bbd2fc61bfc66f
50911 F20101219_AACEJN meng_h_Page_023.pro
7d2174362974c065fa33e4c0cb8918dd
3d85673b62ddf8df7198fbc2ecac54959f8043b0
1909 F20101219_AACEIY meng_h_Page_042.txt
21e22ba23d7f396acab3406c08d6364d
53e1511a16b9f9ca7ebe99761426d31ca32c05bf
59883 F20101219_AACEKC meng_h_Page_096.pro
8968eb9000820e05be3ae6f2e9ee1940
3b24619d6f7d560ceca99b09f0262983008a7bb0
110728 F20101219_AACEJO meng_h_Page_023.jp2
03a310c8efc8e301f762788639f1070e
3db82c69840af96674daba41999edb9c3093befe
F20101219_AACEIZ meng_h_Page_027.tif
ff43bf2662d9a0cd488b839d436564be
87c94bd717b4d07aca952186092594606e5c25a9
75261 F20101219_AACEKD meng_h_Page_015.jpg
b72d236950a1a24ee98a83d13022eb81
85a06ce48b2677dfff2f47f4429401917be75011
8516 F20101219_AACEJP meng_h_Page_053thm.jpg
efc922fbe6352e9c619479558a81fec6
734c41e6faffde19d93a540664677ecc791f3060
39732 F20101219_AACEKE meng_h_Page_011.pro
c478482b34b4b1bb89db7f203025ddb3
7b04113fc2889448da032a0d385f3bee12cea47d
110864 F20101219_AACEJQ meng_h_Page_030.jp2
37b3cdde89c53daf6d1d897a37f271b0
6abd3b064ec6d5328386083b2dd17bd392ab88e9
41821 F20101219_AACEKF meng_h_Page_018.pro
a7c6c7050662efc0c0fdfec859fe21f2
5962aaa49fd3040ec142a68f5fc5e2ca6a57ea06
87836 F20101219_AACEJR meng_h_Page_043.jpg
31c05d120dd6b3db2b45b00ff1553b1d
23cd5a28c9154412d404b349991c0a07ab3dd9d4
F20101219_AACEKG meng_h_Page_096thm.jpg
0cbce18e48963001bbbc8af941ee5c77
d037d0880e3853c4d4d61446dd317b6e33842045
85723 F20101219_AACEJS meng_h_Page_083.jpg
e5a5dd2dd0afd9ad9fdd91d3ff6db0b9
10fb11571f202a5522a08bcf48a2b54da9dd07d0
103250 F20101219_AACEKH meng_h_Page_030.jpg
985f4080f45009405ab8d43cd0580c3f
b4255138a20c5d996b223fcabcdd5bd6ce512cb8
6982 F20101219_AACEJT meng_h_Page_088thm.jpg
f1f82cc9d9bec67e73daee0bd5a231e3
648d317be55a7eb98370b632a6979dc777194bdf
4503 F20101219_AACEKI meng_h_Page_002.jpg
0cdffc7d5fd4664befe83458535f8831
8322e6949ad76a7ce79fd52bae0378104563e133
F20101219_AACEKJ meng_h_Page_020.tif
860aa57d221abf874b96e809222f9186
a71374fbca25a477f3cc112e984132eef2bdf702
99384 F20101219_AACEJU meng_h_Page_077.jp2
8744f02a27d7d5338cface96a2269508
775aeab1847b9e2c08d1f76417e5085d56ea7869
1408 F20101219_AACEKK meng_h_Page_021.txt
7e2119fa570054f061f3dc818fb1472c
90cc13cf95f47c70bf36ebe82f44035172bb79fc
1900 F20101219_AACEJV meng_h_Page_036.txt
3ea5e60ee241cfb0de407550a10b8c06
e302d50fc1b3da3c087411896391b597fbe51f52
36271 F20101219_AACEKL meng_h_Page_015.pro
aff854fcdc327004a4e46852376aa5b9
61fbedcb0113e00f7ef0fb5ad58afeb4e651dba5
50653 F20101219_AACEJW meng_h_Page_035.pro
4d9a73460c690eff27af6b161590e248
3cd1adf6ea4abf7419ffeafe1c810a47bbc4fcd3
59167 F20101219_AACEKM meng_h_Page_060.jpg
8d75448b0f705f24924267055aa3b802
dd588858493409b807793ebb0c8e6895ca606b45
54998 F20101219_AACEJX meng_h_Page_012.jpg
a2222da1d0c9662fb77987940588b09f
db979eadd7814e416812af4097fd51f8104e34b9
120018 F20101219_AACELA UFE0015612_00001.mets FULL
e418099073f6bb9ba6bc6f145d1d923e
759b2f34a4d317b72c2d1cec8422fbfa664c2c6e
10779 F20101219_AACEKN meng_h_Page_104.QC.jpg
e063b74623d6ccd18471bc5e9b06afa4
20615a111d482991b42a9870fef4d56313f16e8f
31315 F20101219_AACEJY meng_h_Page_014.pro
0104ec51c640f5a805d1b511658f8750
b70a5e3e354bc289f1e69477eb648dfe09f04116
91071 F20101219_AACEKO meng_h_Page_076.jp2
553999a9b96d801300652e8de49c0b92
c6da142475291d81ed44415faab55ed3e32eece5
29515 F20101219_AACEJZ meng_h_Page_094.QC.jpg
e66288aa22d6c3dd9a1b7cabd43747cd
57878579d223f98bc5050235c970ade3e249a10c
F20101219_AACEKP meng_h_Page_083.tif
507e792984566223479f07f5cd6423b2
36ce4df0f2fa856ac82c8766ef215cd747295071
26488 F20101219_AACELD meng_h_Page_001.jpg
41d8c9bbed9b9425e8c9f3d22df8aa2e
ba5f5c5a5a5ed3e0824ba54995da02a7364cac2b
F20101219_AACEKQ meng_h_Page_057.tif
357bb466003c244b41293c4b7750cf07
02719c357d678b3d30137bbf58c4d115c1e17dad
13227 F20101219_AACELE meng_h_Page_003.jpg
5905a2dfb80c037dbccc28fcd0f93761
597aff8b3eb25a61b05610861baf678df8941b85
5998 F20101219_AACEKR meng_h_Page_021thm.jpg
d81d87b7f03cb1c6c43687980a7b3ed6
e0590294d51cec4884b70bc91a0e3216ab484a5a
40476 F20101219_AACELF meng_h_Page_004.jpg
18ab5caba2554608e298ebacf53d808e
35fe913aa67baab72eab7a6922e3cc7a730b72d7
8411 F20101219_AACEKS meng_h_Page_065thm.jpg
55ec829055a8e73e2f0c4a12cdbc8a18
c6a90d867525b525859dd5975026aa14d94b2883
109408 F20101219_AACELG meng_h_Page_005.jpg
6a19b834cda922f6875e7642f34f2025
fc4a54c37439ba4ec1b3907edb20f39bcf2621c9
7382 F20101219_AACEKT meng_h_Page_017thm.jpg
34a7d10d07894f8fe8cc50bd09c09367
ececfa45eb5210cff518d30320858ccc7184d4e7
142554 F20101219_AACELH meng_h_Page_006.jpg
e91e0a1058207836375756a22d173345
d5b06e639ba74eddd7d6d9fd659d96b4b76f0a4d
74838 F20101219_AACEKU meng_h_Page_014.jpg
9744cea007e030aab0bfcb1ee4675704
2ae84b2052701a711a4b39795f5fa494415dca41
7964 F20101219_AACELI meng_h_Page_007.jpg
59bed32579514df27453f69e522fb887
6446662e2340b60dce50f1e81dc7cdf35bf02de3
50764 F20101219_AACELJ meng_h_Page_008.jpg
d8407928e294e2817b2e385d071a303c
9133a73a6de25edde15b02091874572d1d705fb9
46210 F20101219_AACEKV meng_h_Page_086.pro
1772e5d3cf538a71cbd88f8299575c9d
300af209c9d4bf25708aaceb8936540619b8b86b
99091 F20101219_AACELK meng_h_Page_009.jpg
7bdd56a61d39bd02d26bd5570f39c1a5
eafb0740227eab2bab29c1076e3b5f4c7c37fbdb
98558 F20101219_AACEKW meng_h_Page_074.jp2
d3b596b1b873c2a4138ef8a722883b24
27a25ac0aa1498c4c0dbcd305a33256bd445b9ce
32718 F20101219_AACELL meng_h_Page_010.jpg
a4cfbaff3546745ddea9ba48ed23315a
213424b62c644b48ce809537ad861f2102cb6df1
F20101219_AACEKX meng_h_Page_048.tif
c8d2d5447143295eb5e915c8d240b0b5
1c0d7e2f36a88c409bfaedc3a41ba3e6f8f7a13e
107510 F20101219_AACEMA meng_h_Page_028.jpg
cb14df46346fde8874ac2f9c95d1f94d
a6e3b50955af0af4301e96126ca875c686b56c5b
85078 F20101219_AACELM meng_h_Page_011.jpg
aceaabf55dcf93fc005a7c1c4e274aed
4070ef51f6461c85dffc30a963e6080234ad9105
2731 F20101219_AACEKY meng_h_Page_028.txt
d6983201d38a866feeabbbdd8399875a
2a287fbab1e8d16f2e0ab3c78b109039d32d8ee7
101084 F20101219_AACEMB meng_h_Page_029.jpg
91ae885dd3066cc4517daf9ada00ba43
56c88ef65d91b60a80e8c4105c06cd90f2cfbebc
79366 F20101219_AACELN meng_h_Page_013.jpg
2543c31f8285e6c41a7e35db4695d865
ef5703b667e9f823d5cee4e39147d3f7f454d7b4
7985 F20101219_AACEKZ meng_h_Page_054thm.jpg
719d72670aa7feff9a89b20d6d1bfd05
3bb600675f8998243636fc6d96f47a9cf015a649
96710 F20101219_AACEMC meng_h_Page_031.jpg
e2ac0be5e8282db943ab8742197e0a9f
180bca1774ce3e6e078775e752b93acf37304e30
64598 F20101219_AACELO meng_h_Page_016.jpg
b881d22862adfa2dbb53849fb50eabe9
09c83483f24b092ae4f300aeb198917454e3d72b
100793 F20101219_AACEMD meng_h_Page_032.jpg
4185d82cef899015f2f2d3e4285b9986
bbd451838b79d80b2ff8a197f750562cd3e2ac63
86693 F20101219_AACELP meng_h_Page_017.jpg
92f01b92fcdcf92fe7d679045e3da49b
70bc9e57f02294b1d8659c3c4ef208596f122c08
91102 F20101219_AACEME meng_h_Page_033.jpg
25fbf7d1948483e0fd855d4c5e282dfd
f596840c281bc21abe40102feb86cb74f5a6b390
96510 F20101219_AACELQ meng_h_Page_018.jpg
fefd1717886c0a91d8d18283ada04a8a
a4b42893ea699e09911fa0eb78888e076d09eade
106488 F20101219_AACEMF meng_h_Page_034.jpg
0c0d80f14d94b74ce20a10018205b194
3bb6783fce1791cf9e7e1f01ac8b47a8719f8413
86884 F20101219_AACELR meng_h_Page_019.jpg
5e6d5204c37ce1e4d8bd5a4f2497c0e2
05ca36ba6fc8c1994dc2d3607d92fd443dd6229d
103898 F20101219_AACEMG meng_h_Page_035.jpg
b62bf47d7d24268a5dd3287fae7fdd7c
d57ee6011557a1ffad03d7b5c4c0926cc2b29532
93039 F20101219_AACELS meng_h_Page_020.jpg
7d77e281b283658dd95a03e4a6835bfe
75f108ed1d3d41578ac21bda5108244bd46d525e
96909 F20101219_AACEMH meng_h_Page_036.jpg
eed17735504fa097f689ce64328b03f8
c267fbc25a4b0e05551b1e3c7895665d069901a9
71822 F20101219_AACELT meng_h_Page_021.jpg
6fa730516dad79f2e60ce95987a3906c
acc37bae8dec24b31e3e6a5d65a7e153fe47d8d5
100912 F20101219_AACEMI meng_h_Page_037.jpg
dcd078e9c06e8f60223fe2a8a0acea3a
73c5c12881d672e5ebdab1bae489dbbca0acb927
106100 F20101219_AACELU meng_h_Page_022.jpg
e5290ee0385c7fd33446ce8cf27ce15e
436ecaca8fa0c74c0569d35f76bd5904f63a8fc9
88899 F20101219_AACEMJ meng_h_Page_038.jpg
b586bad0245f3fa7639d9a2dca4544dd
abc68261f2753fe57db21434ad2932909ff5e9b4
103793 F20101219_AACELV meng_h_Page_023.jpg
6dd65adc84f275c4b63bfea69e05ed22
70bef3fe35cdc60ec21dd5269265b1054535f6e3
87538 F20101219_AACEMK meng_h_Page_039.jpg
9cd9eb06e7cc72db8b6e4893ccbea708
2239b94ef0052fa03c3ad86b44fa99c26843b230
104202 F20101219_AACEML meng_h_Page_040.jpg
509e1168ec5b8c595a6e491dbc2b3006
96bc60badcdaf355e9e6280bbb072e1bb0d4caff
97098 F20101219_AACELW meng_h_Page_024.jpg
ff2fb6c4f98a8e87744aecc10fc8ff74
0e566544c49a0e14c14cc81500bdc8ff7788a3b4
90013 F20101219_AACENA meng_h_Page_057.jpg
9dc826f2689ff2871110bf014d19e7ba
1347887c4c70a27dbbc44e79498dfbb899d6ed09
100547 F20101219_AACEMM meng_h_Page_041.jpg
57d38c3ce3a2e2f320fe094bfdaee2bf
8e594091a327b58e676762621bcc8c33edd1a475
98208 F20101219_AACELX meng_h_Page_025.jpg
e6f47a7550c9aba063166701fad6e34c
a5030378a4c571369b333e91d6df4f9f68e9dd1b
61433 F20101219_AACENB meng_h_Page_058.jpg
dceec939d7ec3e699539c93d3e150258
85a2d1fdcdd058656a6903f401036a63a4d488a9
100605 F20101219_AACEMN meng_h_Page_042.jpg
53b47fda78b72bfa32f7d562af4baa1b
10c4bbdf3b5f7226abd6afe1742b2a10f06933e2
103787 F20101219_AACELY meng_h_Page_026.jpg
5ef2dfe5d21883c9e0df2cfabb2f636c
895bcfaa14e8775928925a2c41eb42dc62f55af8
89505 F20101219_AACEMO meng_h_Page_045.jpg
024f290916c4fe61e5d02013ccc55250
437d22aca6f25f5f92c2f66525558d000a2d9acc
74122 F20101219_AACELZ meng_h_Page_027.jpg
0199dcec94a721c8268b63d781dc2a62
e4f034b58ec2b938fd58a2d979f03f3eabd1de91
92612 F20101219_AACENC meng_h_Page_059.jpg
a0c0259ad0928e624f4c2846b1034802
6f82d66e6ffa50b811fdb0d5bdb7aa3b51f802c9
88162 F20101219_AACEMP meng_h_Page_046.jpg
2a7b4a58f32c93396dcc424f483118b1
02f94ca5101e7b422a6ebeaf93e54842324dcfe2
89789 F20101219_AACEND meng_h_Page_061.jpg
b3c53d7ebcae98b4df3df607b0ba9b6d
31299ab75c181676d93fe174ff9e8335019d3db9
68231 F20101219_AACEMQ meng_h_Page_047.jpg
cc655203046214d7073921896544dc6d
70fcf904666527af0c4ae6aed097516385b5543e
104042 F20101219_AACENE meng_h_Page_062.jpg
4708c6473f5c7978759d40a6a0243d84
285b1740c9c09dd7dc691a066fdb8cefee5936d5
105414 F20101219_AACEMR meng_h_Page_048.jpg
43e76f5e8652e2fc1d433f3808935176
872294fb187abc0ffe2559a71b030cfaf483e1df
95343 F20101219_AACENF meng_h_Page_063.jpg
13d979b0ed2c0403e5532e051ca92c1a
b36e51cd0816b84c8d8662bdff848ade5bea7988
98331 F20101219_AACEMS meng_h_Page_049.jpg
f695639fa61a539d851a859e216d829c
daba15599ed09c6c5be51739e01a9576c6ab918b
97799 F20101219_AACENG meng_h_Page_064.jpg
04719bc5ce562928f9330fbfa9b45396
ee368e83b8f2daf15c016fc102bad5cd27dcb664
104260 F20101219_AACEMT meng_h_Page_050.jpg
a26e449bb35ecda56201a180bd751f42
c12d9174b67f5b8127103e4c3d923906c918e723
104885 F20101219_AACENH meng_h_Page_065.jpg
a6044e729172261d2b637a941f62b7f1
36b85e8e4af6dcdf3c35a7e435a3555ab3ab7cad
68961 F20101219_AACEMU meng_h_Page_051.jpg
7ed8f90d8650be54c0b5cab0716e1996
ab3ae7537adf76981465a23e32757faad7c01438
107128 F20101219_AACENI meng_h_Page_066.jpg
11d1f5b6ed01be4c062ac16df691447d
88e4c13c4e7b9c10eb348ddbef10ffc6d9cb584b
128770 F20101219_AACEMV meng_h_Page_052.jpg
10afbf5da5188c13f5d990bb27ecef68
446f1822adde33ecdc6e09a9f47e21689f1b3ad5
103410 F20101219_AACENJ meng_h_Page_067.jpg
697c1c849498fe442d55395547670a89
e87f1d6bb088d1bd2b59cab0c11ad6bd5064e53f
102887 F20101219_AACEMW meng_h_Page_053.jpg
e6a883286619d18e6d16398cc48c1606
d113549ff71231cdee625504b7e79b9a6d7458ec
90314 F20101219_AACENK meng_h_Page_068.jpg
87c62cfaf7d9916cad0ff859dcad619e
4cfb80bcf30c5c372fd6ffde33bd61b0cb2a93a9
96972 F20101219_AACENL meng_h_Page_069.jpg
2ca12760cf122dbe14f5d088acf2cc5f
c7f796359bd9bc7162b7cd766e66704be07728af
101342 F20101219_AACEMX meng_h_Page_054.jpg
0c81f6ef26ed80f0aa730d6bc011ea71
abbb31a4ac05724e3ac68893b1d0d5a6a6af15f8
29338 F20101219_AACEOA meng_h_Page_085.jpg
4c3dfc39dd0da9d2c22e5961ef28a933
c652240a21c49fcbc0e653d330136ff94aeb96b6
88959 F20101219_AACENM meng_h_Page_070.jpg
9b779b513fd2c8b3ae3053b17b0edc8d
2deb4b3999e9beba2b77274bf41c44c42c566457
102493 F20101219_AACEMY meng_h_Page_055.jpg
e3f632d672482ffa7927993f3241b830
6894cf70296805384c14ac127839357976184090
96721 F20101219_AACEOB meng_h_Page_086.jpg
3dd1e2d82fb0fb35cf12b263cdea58cb
f6c505859a583c5b4b1f848996cd1f4cf372626a
57529 F20101219_AACENN meng_h_Page_071.jpg
5204e65d2309f47ea30cc2f26e5eb396
5b058314a7ee419c84aace00050ebd2fda7d463f
68541 F20101219_AACEMZ meng_h_Page_056.jpg
91495b093a8d22547685a5c9f8ef494e
a06b7ecc14eadfc868ae1610bc05210e4f7423c0
83973 F20101219_AACEOC meng_h_Page_088.jpg
333566cc20da7e8b3f01ef6fdc77f53b
310610d5ae1b65cab14487c221c94ee01c529a28
103133 F20101219_AACENO meng_h_Page_072.jpg
b546ce1ca678e3191e1ddc1b31830e1e
feb3342dc103969196b4f2e7d3f5a30ce083c54e
109215 F20101219_AACEOD meng_h_Page_090.jpg
2f5f5ee2770c8fe5014b643240f34f93
e35ef1188025c27ad9a620d3eea64297e6877e46
98406 F20101219_AACENP meng_h_Page_073.jpg
f133aac2895fd77561a25b330f29d770
117cb8fa3fa508d72331e4cc8980c864bf22a5a8
82457 F20101219_AACEOE meng_h_Page_091.jpg
137dbcaebb3ec44527976f5e365e321b
c13d48e33090502ac8c38a9f90202520875fa76a
92101 F20101219_AACENQ meng_h_Page_074.jpg
2ff5e46a0f1479f5a057e8b443c8623e
7dd2589ffa63479b3497ae32f7c5cd9375da5d20
100316 F20101219_AACEOF meng_h_Page_092.jpg
34d8f890309bd17c674ed6500b5b9ed8
d37b7e16a44bd7c1ebc5644fef51cc520a6fbadc
F20101219_AACENR meng_h_Page_075.jpg
934d10650210d825ea26d30c3d83975e
9720c289d2f1f1284f1af389fbf74f16ffd404f0
69486 F20101219_AACEOG meng_h_Page_093.jpg
cf298570f47dc54d02bd5c41aac4d492
dad4bb3545fb1cdc5c478d80d7534723ccbf821d
84785 F20101219_AACENS meng_h_Page_076.jpg
f4652ac653a2cd1375f1fee8541f827a
2823eeaa5d6f6af323cc57e10e1ef52a164bd72e
96797 F20101219_AACEOH meng_h_Page_094.jpg
0c8616f44e09654b709adefd8ee2ac17
1685a9f6025d1f25c1baa5446ab01322cdcfffbd
94692 F20101219_AACENT meng_h_Page_077.jpg
a1638ebfff92dd1d4951afe9400801c5
31d86c829ca70b875eb466b3518c8c98884bdda9
125197 F20101219_AACEOI meng_h_Page_095.jpg
0b0aef2b6d63652bb06b67fd000f5011
9bd1ef94418db0f7a35ec4424ff427407deb3e03
94587 F20101219_AACENU meng_h_Page_078.jpg
ee6e8582f53581a44ac3d5f253155996
8336d7e3b01f0581eda7e1c2d7d42a42f331fadb
130808 F20101219_AACEOJ meng_h_Page_096.jpg
7d4b987a39908d9c4402962daf55ff4b
49c2f6020cbe22e94712b71f01f6a54198ba4c68
98945 F20101219_AACENV meng_h_Page_079.jpg
77e4e870864e73045a024cc597dce85b
508a1c3be17780f87bd5818b8f4462ae65acfdf5
122857 F20101219_AACEOK meng_h_Page_097.jpg
42ade2d35143b92c0f0386e8a6f17021
2a8b1bca0e2c659239f5db157d72d884cc7a585c
89773 F20101219_AACENW meng_h_Page_080.jpg
30c92d4bcb9d65d3d063694f15e6a3a0
afd1746077532166e19fc81c12a01da528ddb170
133933 F20101219_AACEOL meng_h_Page_098.jpg
cea6c20de3893dac4086c4e6f7c70869
354ecd20acec197506b57b1ad449444c24c22c8f
91262 F20101219_AACENX meng_h_Page_081.jpg
b97cf1cca6c3d3123c4190b398fdcb5d
c7739efabbd02f37a5e8e14da47c8423ef3025fc
1051942 F20101219_AACEPA meng_h_Page_009.jp2
53bbb7e62482e4f08538411562070a06
90a7bce8ede841beb223a7c986f1bbf7e6006feb
128960 F20101219_AACEOM meng_h_Page_099.jpg
b4fcc03b59121f03452bd4c4371f86f5
e7ead39a27d4e0527d457b4c6a816eb634c30757
545115 F20101219_AACEPB meng_h_Page_010.jp2
22e21a37b2f34f054e3ac5024ed07725
c30b48a17991d22adaf44861ab4b34da061a9635
132242 F20101219_AACEON meng_h_Page_100.jpg
5e521da452058c298a9ed45d13e9124d
0c884a047cb07763dface3af544172d888ed5601
100842 F20101219_AACENY meng_h_Page_082.jpg
996d0effeb0d1a08bcbe136232188d0b
e8b100728044896b9194dc71af6846e07241e152
88131 F20101219_AACEPC meng_h_Page_011.jp2
e969c38f6e7af6de74e566a91c8b08e9
d33a1a95fc9fb45040f47c384ae0dd2349867603
118084 F20101219_AACEOO meng_h_Page_101.jpg
c5e5842fc53864f14ced14393f335013
2c030cb60b7b54ec21ba66f969033ea726b3ac5f
111717 F20101219_AACENZ meng_h_Page_084.jpg
e735123a87dc1f05c269681a0f1e7080
4b9d528ebb40978a93925d9f67d26fdd4e7ed0cd
58453 F20101219_AACEPD meng_h_Page_012.jp2
a69fcb3cd923c23f891a46c735a37bd8
2046eb472e2211d4e25a2994640c7df6aeeb427e
128234 F20101219_AACEOP meng_h_Page_102.jpg
b0d2f10e1392c3210cc30393b80e0a13
decf4a2b17d97ff04b140ca75876f4d48d90df11
81729 F20101219_AACEPE meng_h_Page_013.jp2
bef8e719c9ad324268bc8286388b82df
595c818aac52c1abf4ce36234e5efeb447d70750
80024 F20101219_AACEOQ meng_h_Page_103.jpg
7e4cc01dfc1c78f8d0d39fcce1380577
c2acb5a395b5ec0ac063b38bb89d8f2e4492580e
80381 F20101219_AACEPF meng_h_Page_015.jp2
000639d32a4676e92c6e8474192a3f7f
c71641e0e7612b6c9cb88c3d3fb6d201f4f7095a
32398 F20101219_AACEOR meng_h_Page_104.jpg
9cd8306175d3db57d671b47163628b9d
713238ddc224e0f8ce897092f9e5f6e7627e02db
64362 F20101219_AACEPG meng_h_Page_016.jp2
45a91c8c911e38211ce8a3bc851d0cb9
5903655ccd734d7d614321a45b261cb9036eaa3a
24615 F20101219_AACEOS meng_h_Page_001.jp2
4beffc16626879739f176bf45e766e17
738670eef382b45b4650aa2e9ba3bbdce1f308f2
91363 F20101219_AACEPH meng_h_Page_017.jp2
c39f97c8dccc1d6f6c4e50f4f963a052
c23b3d6789dbbececac69326a9b447b680faeb1c
5601 F20101219_AACEOT meng_h_Page_002.jp2
9af11cb73cae6528db479406b2217dbc
3d5a8a27485318d5e7df64284ad68d2a65a07ba5
97286 F20101219_AACEPI meng_h_Page_018.jp2
6c9955043e7ab60f731034926b0dba24
9d5b2cb3d8d695e0588a41b0fa56e0f874ee183f
14206 F20101219_AACEOU meng_h_Page_003.jp2
19ce53f39962ce870e739d82d12edc22
22e402ddacb4f7db14ee08ee5d52a9e0caa05f58
98864 F20101219_AACEPJ meng_h_Page_020.jp2
2b6a9747141bda686cf42f896976d3c6
40f7ad9d88624308900e4d4ce59d1448d1f3fd34
42790 F20101219_AACEOV meng_h_Page_004.jp2
98991afdb943b1db23160031f3ab02fd
60aa04c42072567dca16e1c609fdfbb7b49dc9eb
113052 F20101219_AACEPK meng_h_Page_022.jp2
ceb9c81aef77e5f76fa11975f57b52df
e73b6d8dd8b540ee9923c60240fd618d6c6ec0b5
1051985 F20101219_AACEOW meng_h_Page_005.jp2
d3c1e838caf481a5b1808129ef3cbbe3
c2593db0460f2e902bef0245946bad00cd5ebc34
102304 F20101219_AACEPL meng_h_Page_024.jp2
946f407418cd167baaaae5151e2dcc36
87c13e65f4de26117cd0d647899c13a764715c49
1051971 F20101219_AACEOX meng_h_Page_006.jp2
d4a881eb9346b1aed23f757177501582
e6f9e4888af333d5d305d07d64e036c6857194c9
104386 F20101219_AACEPM meng_h_Page_025.jp2
00b4b53144ab39915615d26772b75a66
1b6823065175ffc5a38f800339f84991b26d32b0
87695 F20101219_AACEOY meng_h_Page_007.jp2
a34b4bc2abc37d33524dc1ff3b91ea03
5520d690eb6e6f47440add93354a953533ab7a45
1051980 F20101219_AACEQA meng_h_Page_041.jp2
fcafd89cacd7ce4875f4eabfeed9e78e
2d0012f9c47a2511263921c4f448c10238659848
110066 F20101219_AACEPN meng_h_Page_026.jp2
ead18a655ed8fa5f4e93627569ffffcd
5663e3e91416976ec420f19f5f4e66b00355833f
1051961 F20101219_AACEQB meng_h_Page_042.jp2
0fbb0dca99c921e81a02d160b7853674
b6833be490b84ba5a08cb2328519b8d87f45cdd6
891067 F20101219_AACEPO meng_h_Page_027.jp2
bea1fabf825a7fa81da2a4eecc4f3ffe
a5ad04134089dcac595ca18de0f4fa6cc81f76a4
910538 F20101219_AACEOZ meng_h_Page_008.jp2
1ebabc4081881d9e5d283b5d8771e9fd
2630b268954e18ad1e5b861e0ab3578c88e82bd8
94089 F20101219_AACEQC meng_h_Page_043.jp2
1fb116750455e6e8998eefaccd872e90
b7f1c676b6485046d2729ee86b13323b66b5ec57
117827 F20101219_AACEPP meng_h_Page_028.jp2
bbb6c7c057264c828e49278e70eb8030
183c6ffbf09ecb4cadf643595f274b5c9e942f4a
891730 F20101219_AACEQD meng_h_Page_044.jp2
80c553fc61881a8efc132e950368fd23
3ee80b3544caec6891831d5168ffc949168fa935
117139 F20101219_AACEPQ meng_h_Page_029.jp2
68233b1c520648d159a45a6b3c4a36bd
4e04df4d41720c3fb21ca402a168158bfcb76453
94541 F20101219_AACEQE meng_h_Page_045.jp2
f9231ae3344b50289a3fcc2c96d788a0
e67fbfbf1f6fa06a10916bae192fe30255ea33b1
104876 F20101219_AACEPR meng_h_Page_031.jp2
54a6ee389b28e4b3fd2aa3dd8fcfa82c
85d8b31af04a3b9d4e62fb0fd26018a5445b0dff
93330 F20101219_AACEQF meng_h_Page_046.jp2
243d922a3bc4adfea2b738124446bf5d
34c8b6b3bc24dbdbe454c22e43f051100a674afe
107691 F20101219_AACEPS meng_h_Page_032.jp2
7eabfc134effcfa94d2166ce7b61efa7
b6fd0c9ebbce068e13c6aecc9cc5b5890e8d5c66
932327 F20101219_AACEQG meng_h_Page_047.jp2
3bf80cffc140aeeb4ead55071564751a
ee70a27d61caf54036a70fe1f62ff63c3e85cddf
95772 F20101219_AACEPT meng_h_Page_033.jp2
db26c542eb3eaaec73cebc15a6a639f2
3698e746278e89b5d1cf4047b05ccbf3d1c207ac
112853 F20101219_AACEQH meng_h_Page_048.jp2
25288ecd8490f3149e32546e67cc2c53
3c4b6b084094da83a7702c1f0dc2e728392e608e
113665 F20101219_AACEPU meng_h_Page_034.jp2
5be3e2240b496fef8e08ae9d552ff289
a7d7ccfbee990239d2f915db631259998f42e07c
F20101219_AACEQI meng_h_Page_049.jp2
e2c58bac04acbff8a9d08ba82fa2b6b8
735b07045738eef308bafa3556770141ae9fb7dd
103208 F20101219_AACEPV meng_h_Page_036.jp2
1be9b1d7015821c881d55b02c5631537
c8826979eb146e0e527065c4c881d38b78435893
110812 F20101219_AACEQJ meng_h_Page_050.jp2
03d4196c8d13d88774f041bfad51650a
9d1211bc0d673052c6ac67b6465ef5c32b4d4c4a
107892 F20101219_AACEPW meng_h_Page_037.jp2
52f8d49a46fce3a81ca9017a7e7ac1df
f618b8eccb7c35375bb6731d50bf607aa31868a2
1051969 F20101219_AACEQK meng_h_Page_051.jp2
cb018ea8fa464a7add0e2688277b5967
81a5d018b1bbfc7bf1450045e2ba219216ba4071
96281 F20101219_AACEPX meng_h_Page_038.jp2
9e227c1a4445d18bd484c667329e799b
e90a2545efa925b0fccd12a6f8af6aaeaa0199d9
1051928 F20101219_AACEQL meng_h_Page_052.jp2
629100f519e87d4acbb67fa20f3e24c6
f588972c6a2f61ba7e88458f464b91e9c8d4b27d
979625 F20101219_AACEPY meng_h_Page_039.jp2
f43cdccf26bada49cfb76e4b038a4342
e31f2a062767d3bf448a2da6662dbd60481e4f99
980259 F20101219_AACERA meng_h_Page_068.jp2
04a34ad89e65a69a2fef1adbcfdc1ce2
8ca7e2b8e143ce91ab4e17b16f91b0545ef9ed4b
108787 F20101219_AACEQM meng_h_Page_053.jp2
9f7133b93a6decaf8b74f3dada28476b
b72bb8cfcb9b8c23b6def8f26c108e5f08c40c97
109667 F20101219_AACEPZ meng_h_Page_040.jp2
bbed799d84354f6a5f9fab7fb95604b7
6be48beb1b9e476cd4642a10309feec7fbb5fa46
1051921 F20101219_AACERB meng_h_Page_069.jp2
69e169f43fb1992dd69d76baa8f29d6e
e623858f7b2f9ba8b843faac47ba4d72fa7c4ade
107931 F20101219_AACEQN meng_h_Page_054.jp2
b8b3723e95d7e04e4b492b8f377d5d04
ba037bd7fa410c48960029c3e682ae03b629e16a
1051890 F20101219_AACERC meng_h_Page_070.jp2
983d9eaa471ccf31ca5421a5cd52a76a
30a8668c97cf46ad4f655be8fc4d349883836f9d
111030 F20101219_AACEQO meng_h_Page_055.jp2
9f4544e941ccc6db3a967fb0015fa8a2
84b909280bf8314f6f5ec7b63f1a87ae10f0bfe4
678563 F20101219_AACERD meng_h_Page_071.jp2
87bbdb6c03b58d48e05c2585d8c9c5a5
039fc503c32f4c3ef9fe9f4f00e70fcaf6552410
717297 F20101219_AACEQP meng_h_Page_056.jp2
a2819e7b8a116e710a790b313d6444a3
a3044c0b5f9fd5dd0ad0e210fd75453d5f199943
108305 F20101219_AACERE meng_h_Page_072.jp2
6d7dc6973e32b4acbbb4b7d7a8e088ee
6583ffb89468f08592ee25284df735c6cd14c311
95190 F20101219_AACEQQ meng_h_Page_057.jp2
ab833fdf2dcba0479788a25d2654dcf5
c01c0b211518e15bc39ae98988ed68d8f731c849
104638 F20101219_AACERF meng_h_Page_073.jp2
d10f4cd3131879f2d1c1cece2d5c904c
0929cf351defa14ee57ce5b4bbbb1db21b1ff6ed
645653 F20101219_AACEQR meng_h_Page_058.jp2
984a129ffddedfcc2fac29288ca5afb9
8ba67a097d000a483f12d76f00e9ed380ca11c26
1051955 F20101219_AACERG meng_h_Page_078.jp2
459942f2f9db0b6fb5589b6fa21f5d8f
2385424bd6cc7e5687ce67b8b01c70356feb79c3
1024982 F20101219_AACEQS meng_h_Page_059.jp2
29a7b552afa6eed5b0484078b4f57114
1672f6c4d1a8b4fdfdbc9bf7b440858d48303132
103661 F20101219_AACERH meng_h_Page_079.jp2
8425a082bf9f81180ed9f78a97fadb39
026fef7b2e96f71499c2a73df3d10fdb807d7af9
643201 F20101219_AACEQT meng_h_Page_060.jp2
04f15bf2213965d0a1168338bacb292c
a18e9bcd8fd56b20f6702708db390d061c4e1ffe
96410 F20101219_AACERI meng_h_Page_080.jp2
80b70c7b4690f094f94d8192a809705f
25107a2a9bf1d62a8acb6206e424c0efd62f92b2
95116 F20101219_AACEQU meng_h_Page_061.jp2
fdbe39a0b898a8fdada61e3fc6eb0999
c8a530e6fa0df905bdecf8a889bf10c90abc9a5b
1051970 F20101219_AACERJ meng_h_Page_081.jp2
db13bd394241d5b6ad230f514e46a679
0a45530026dbd741e5dd20f36c7d6e71477a96c6
111066 F20101219_AACEQV meng_h_Page_062.jp2
10c1eeaaa389bdb88351031f5e0e8277
092a47280ccf239f892aecbc23536b62c1cd2050
F20101219_AACERK meng_h_Page_082.jp2
cc73ccc7680446119f027d62168a9d28
71845e4e6b1ddb6283bc9b7c090ed5c01c74a633
102317 F20101219_AACEQW meng_h_Page_063.jp2
30b00123fda61a23f04b3712a7224dcd
9f67c2b6e1603f0a84bfe430aae959f103870d3a
1051986 F20101219_AACERL meng_h_Page_083.jp2
ec0c36c92fd98ca4e83faf391335c7d7
ce2e557237ef56aa8921a0360c258a12627adcac
103275 F20101219_AACEQX meng_h_Page_064.jp2
867f73d1853d42fde8bb87370a402c51
6262b57ea265b9f707c34d3f2ee4ba04dd6541e1
1051943 F20101219_AACERM meng_h_Page_084.jp2
a0c5b01a84e7d12463644f562fe278ca
465ff499cc1c98d32f4719d77d53139fd6c3dbf1
111915 F20101219_AACEQY meng_h_Page_065.jp2
5b7b02ba6095d1fe61ffadebba9a2581
2fa9a64bb6a0250d908ec38343e20998cf898a2e
135585 F20101219_AACESA meng_h_Page_098.jp2
20f7bb226ed9a4d2c4a860a01803b403
d63cf4dddbbfe1562c8384a38a98c7ba8050ca02
31335 F20101219_AACERN meng_h_Page_085.jp2
a1a84208b57b4757db9a57069d3a457a
0934645bae98b43ae7eeee45c07a1627f65f3ae4
110470 F20101219_AACEQZ meng_h_Page_066.jp2
078e2996b01a120e6e294a5322fbd690
e5d212e9dba2e0412ff143be7eee1c33c4579ca8
127828 F20101219_AACESB meng_h_Page_099.jp2
1b48d35ecb74947d623b1c2abf9191b2
17ce5ed479598b8707ff50cbf3dc0cfa8e00dca5
102949 F20101219_AACERO meng_h_Page_086.jp2
614d0f2aeab0727be7aa0b308244adf4
135aea73b8a98e1373719cfae840db68e34fd0b1
130479 F20101219_AACESC meng_h_Page_100.jp2
4d6c08360924dd3e274cfb51058d3598
7965cbabd69bb59fe588e5e7d5591d5f13ced144
9577 F20101219_AACERP meng_h_Page_087.jp2
0fcc9740c0cbed9d3913f98523929c86
0f6d6e91c87eb108156618e937245f3dbea9e10f
118238 F20101219_AACESD meng_h_Page_101.jp2
30933745be3fc8cd38034d6e8f696a0a
61548faea31e8f540acaea86821191a8f4c38fc4
894231 F20101219_AACERQ meng_h_Page_088.jp2
a52a458c74f75714220100b8ccceb859
d6b8af8a827993a56f09aff812d5a4031cf19ec7
127735 F20101219_AACESE meng_h_Page_102.jp2
6e0855ce2fe98c640bb1a989b5a11732
0ee2d5d16d921a9d796baba567e6e69e46018e9b
94125 F20101219_AACERR meng_h_Page_089.jp2
815920b6c3011cb03ccc0fd3182edf5d
838cf4ac505924462689efcc2b8283f2edf46ab0
77254 F20101219_AACESF meng_h_Page_103.jp2
532e55a7965e5b2cb935ddd80112cec5
b8e3f5a993bf05e5da031e249c6a347e8c303eed
113745 F20101219_AACERS meng_h_Page_090.jp2
a6704495b3449d5c90591f1d7e033217
2ca8a83816076981ae4a0899882d1fc8d2f9e7a6
33911 F20101219_AACESG meng_h_Page_104.jp2
f24e1ab16f3416034948fb4d471207b9
516fbb96c3c9c9cc46976523944b96a0d6e10e93
80905 F20101219_AACERT meng_h_Page_091.jp2
a1102e056536de08480992f3daf855a0
142691bbc9e749cbc2ab5c81ef19952ffa718b60
F20101219_AACESH meng_h_Page_001.tif
19195056388f9e6265a1e2216d1a35e4
9c1fea9b8d1659f5f63b40f9971c45676dc21b74
101513 F20101219_AACERU meng_h_Page_092.jp2
d55882a493c42aeaeb63a661f66e904c
80f6bd9a329cc2fea2ea3c89915cbeba24b256ba
F20101219_AACESI meng_h_Page_002.tif
967f4d40148135986ac2a6b6a11d2f8e
bc4c8a117c360fc9fecd397ca72fd765bc9ae5ec
934281 F20101219_AACERV meng_h_Page_093.jp2
e848c3867e3cbff24c0f84b2bd8d65f5
66fae7bbf45c3e6541748b1b291f3949e3e9c266
F20101219_AACESJ meng_h_Page_003.tif
38df1d45029db54dbfb7b3600682cfee
c1f5cddd19b78329bc7957bd8496ec2f339421f8
102452 F20101219_AACERW meng_h_Page_094.jp2
bcbc1c4b7b045359377fc90b60dd9032
73cdf12dd9086ad23dd1920bbbbdab7b935570b7
F20101219_AACESK meng_h_Page_004.tif
35cae24675f58ee91447c507887cffa2
b024485a48ec42fe4e1988f4c10b36a92fea9a86
127574 F20101219_AACERX meng_h_Page_095.jp2
6d998a8d9fe41924166627e07b3eeea5
db1049c7626ae228b76ec497699625f5a097a634
F20101219_AACESL meng_h_Page_005.tif
70a77c702b5376eb9bae9647525319d7
070f2f2029c7991e3a54a7399f0c475aa8ed5cf8
128912 F20101219_AACERY meng_h_Page_096.jp2
727fb4de90e267cf8632d2d8952e52bf
84cb103084a1848f1b3f836ca5d7ffcebe912bb5
F20101219_AACETA meng_h_Page_023.tif
c5df9a7046a3c16580e163a30d78dfa2
29037070f0b2afe8a3309c684c71bd616c84afe5
F20101219_AACESM meng_h_Page_006.tif
a8966cd30d9a087678f01fc2e23990e0
347fe5d49edfc2b053c1561bcffddd60ba1660e6
124904 F20101219_AACERZ meng_h_Page_097.jp2
2362c1e659bb1c5978dee2a1805861b0
603aa0ac932e0eabe139f74a771c8845d58d48d2
F20101219_AACETB meng_h_Page_024.tif
00f8ddf9ebeb27785d586840189ce4d6
c0bd038f91fa4ae930d167cbd0503cb3277096ad
F20101219_AACESN meng_h_Page_007.tif
827affe0add7f7583d384f84270f313a
68b78d3232b7a6e6cd111c10300295222261f64e
F20101219_AACETC meng_h_Page_026.tif
e315145b84f5d9d17d02a868872c6b5f
3ca1a6ee68b671f6965006d658b5e7a84e0153ce
F20101219_AACESO meng_h_Page_009.tif
c74e8ac6ac0b28847995b20af7c51beb
c4a57401f55173fd574212a3ce9193ece81bb91d
F20101219_AACETD meng_h_Page_028.tif
fed7f5bac68009638670d32a483642fb
ce0959b36bba0bd7b955abe1e7248044a1ac774a
F20101219_AACESP meng_h_Page_010.tif
898764aabd258a24aa937d2d709ca866
396c293d1bba036d172f9293d73fc4f3656fb738
F20101219_AACETE meng_h_Page_029.tif
4dfadf212749e87e075a3bb01ca8317e
5932cf24d987691f4d82ee3763a5f17e141e408b
F20101219_AACESQ meng_h_Page_011.tif
718e0f5b5edbd69dcbe11c8b05cab9f3
ac09fc25bc08093f00fc91d171c5469f5c25be9a
F20101219_AACETF meng_h_Page_030.tif
358fa3de12ba846732293a04d20fada4
96037670eab2eb6f8de4c86a6d8901988e8135ef
F20101219_AACESR meng_h_Page_012.tif
79f63d03e4b2d45ee5f8edcd84eadad8
fa1e59b65d0dc9f5414f46916124af4ac7dbf247
F20101219_AACETG meng_h_Page_031.tif
cbec58a2f3cd1d3b1083ca1b0207b683
a8fff8d9860da407e63f89c23fff0eb0ee9f5277
F20101219_AACESS meng_h_Page_013.tif
fc717b72c0d4b924e0866a8cff51a922
c313447800fd8d2da46beafe9197c373e675d4b1
F20101219_AACETH meng_h_Page_033.tif
e68d50151e56568a563680dd5ca0499f
e2824d7125502e80f6e06816fe65001639720b2c
F20101219_AACEST meng_h_Page_014.tif
799a5001f6e6d3288283c24307fa1e85
82ccb4ecdc4f393a5d0fa45eee0117aa5f348184
F20101219_AACETI meng_h_Page_034.tif
223ebe3b6614d6f4456aa4be568edd22
36c11da2d929707b0c7ac38c3a10798f25757c18
F20101219_AACESU meng_h_Page_016.tif
9699619223ce54102de9095a0f6dba39
7380091193d8cc959649e3d8b1d6951746b9a7dd
F20101219_AACETJ meng_h_Page_035.tif
4bb65613b862a894b9fce3fd62b4c2d2
8a8b320f29a5c073e1bb834401e91bb6f20a8a21
F20101219_AACESV meng_h_Page_017.tif
0b24fe9a72bfafd4280cb1e601760e05
de127c4d0969d49ee71b9903c4cc24da25eb3054
F20101219_AACETK meng_h_Page_037.tif
6725c164cbb6441146daae00b322a778
6d26919ef6f6e763c8554a8f3fbca683b1a25d98
F20101219_AACESW meng_h_Page_018.tif
fa6f22d55d3f04b0101d0cdb8cc80213
9f64369a73aba59c6c7010195c1d72da7629cc0c
F20101219_AACETL meng_h_Page_038.tif
93e9525942786b48ff00d7575beecc6f
d394195c66c2bea199350d6d495cf5c1319fd783
F20101219_AACESX meng_h_Page_019.tif
b2e2c7d37a93b61a3766ba4589c1c8ed
25eb3b733aeb8736489131e1791c34059c45da8f
F20101219_AACEUA meng_h_Page_055.tif
362113ed1c5cde38cce249cb6a7be962
3a5fd4e70be4c2b67ca9955ccc052f4057c3054d
F20101219_AACETM meng_h_Page_039.tif
b545e633ca74b1cf6c6e3dde5480ec0d
c00a9c498f6b59c168744cfc00aef25998965daf
F20101219_AACESY meng_h_Page_021.tif
5cbbd6ed8795de9323ed73d3ba5f88b3
1970e23899be5bdbdee34c74d158da873cf24f6b
F20101219_AACEUB meng_h_Page_056.tif
e3bb22f18bb4d27444f115737da59e59
4a3956523794047136ec2315db9deb062378f40e
F20101219_AACETN meng_h_Page_040.tif
a904e4b90277d420c03bd3a1a34f6091
05cb262cbbdbc6f20962f6104a9cd8d2e7226be1
F20101219_AACESZ meng_h_Page_022.tif
cba805b9a5988c9cf3c984ab2aebeb59
d93b0ee460554c42db29bf6892ed442fcf75b637
F20101219_AACEUC meng_h_Page_058.tif
d096a087051da77e9bb2712189b349d0
7d28d8d5ecebbc617e4759d062e29cf6bfd21ece
F20101219_AACETO meng_h_Page_041.tif
6670b414d6bfa7108e4493c294da0f70
27a579921faa3e9ed35c4024317f3af98c2e9c85
F20101219_AACEUD meng_h_Page_059.tif
cc2fda7c0870e5a4e2a6efd48e337de5
2e87cc40dc78b2d4161eaa8c4e7727734ae947b8
F20101219_AACETP meng_h_Page_042.tif
434cf0e82cb5f96d7b75df98e3e68f9e
73dd33500c41bd901365ff842ac136781ae7201e
F20101219_AACEUE meng_h_Page_060.tif
b2b3d33386388500e6efb63bc8530c3d
82449144ad2d1893cb9264cb678e41de2ed30dc1
F20101219_AACETQ meng_h_Page_043.tif
8326c42e165d24973e34d478819e79f7
319f55ad0e0df7f047c80471607868651aa06b09
F20101219_AACEUF meng_h_Page_061.tif
d4e6f3fad08d7b61b6ad1c733f393761
95e9893fe66d52699164e7347b1e8225e10f6c63
F20101219_AACETR meng_h_Page_044.tif
275dde973a638fae14db454dd27411a4
f4d4368fe7c5c959afbf50ec9667d0c421fa12dc
2346 F20101219_AACFAA meng_h_Page_029.txt
ae1c9816535c6771fb5e75a07c7ae594
8e9d690a6fa4c70c9193bd69293d7262fd44e27a
F20101219_AACEUG meng_h_Page_062.tif
7416d46767084a2f77ce8bddb2dff1f3
3ebfc71f75dd62cda6cd46fc08647cf637426b14
F20101219_AACETS meng_h_Page_045.tif
bbf7f4e5e3e02482897921fb841be51f
c7a2a4838a34d60fcc728b145ea38e828eaa8c45
1969 F20101219_AACFAB meng_h_Page_030.txt
30e5e9badfd6a91dbe2a4deef8a783ab
3e8d3e75a4bb2fe4d2e8ba27abf2c15ba6d0c6d6
F20101219_AACEUH meng_h_Page_063.tif
4ae24aae0dfc6f5b6c190ad02af67b52
a80fc55d83131f0a6358cc310f5ddf5091958a5f
F20101219_AACETT meng_h_Page_046.tif
551a046e723a9ffbe4390c07dffd72f3
33f30776d61c03fe5533e8d78a94cdf86e03ce24
1889 F20101219_AACFAC meng_h_Page_031.txt
983fbaf461a2744b82eed4753e0025ed
7bca896938a0e43557a0665a2d9ef65859efc81c
F20101219_AACEUI meng_h_Page_064.tif
371bcf111f7516072902799aceb05482
6b9e240b568c626ace8550cc2e12a0f2898246df
F20101219_AACETU meng_h_Page_047.tif
d669614ef9b354a6eacd1c7fff91ff31
cddb6a8f77cf99eae6c2e846138dacf7b704e360
1914 F20101219_AACFAD meng_h_Page_032.txt
6661608132df955d2ae52dd7f4106bd6
cd99e4d3cd9de4ed61eeaab9ef567c4661a35ae1
F20101219_AACEUJ meng_h_Page_065.tif
ed9732cc54baa59e97b7069baa223832
a4c7442ef648b1e330a64a3109f139cb2d103fdc
F20101219_AACETV meng_h_Page_049.tif
30e279b2ad1b5c1ec94e6f1d0dd3db67
c0567afee8cf990c8f0f5793e2e3b2d250166daf
F20101219_AACFAE meng_h_Page_033.txt
6c15bed9952adccf17eb4e4d4d23880a
d9f34832b0fada60143f4537bedbff9daaad8a12
F20101219_AACEUK meng_h_Page_066.tif
3f9a9ebce9d4d29e5c0850b7c9056162
8912cb70256e08c1482d6b78123725eb7f70c832
F20101219_AACETW meng_h_Page_050.tif
45bd6f77a0c66c331cba76547f11b373
1baf60fc7e3fcb0debe2f21920c8dabd91026563
2058 F20101219_AACFAF meng_h_Page_034.txt
219b1d13dcb71fca4605a43258d81cee
1f3e2776f6a0cfe78d6905841e34f13f80790b8d
F20101219_AACEUL meng_h_Page_067.tif
8d122e2cfdae9784931f819d7ba1962a
496d21603ec53b489e4d789ed8fd5bffadee09a9
F20101219_AACETX meng_h_Page_051.tif
e0c9ed4b15c84441f29ffff97d2a945b
ac3d5230bd4de6de03fd0da516f65726ca6f1bfd
1994 F20101219_AACFAG meng_h_Page_035.txt
2d35e01296314a5381a5d731ba710ac3
cc34457834af1f541879d34d653e3097cb64c37b
F20101219_AACEUM meng_h_Page_068.tif
aa13c2001dd82f718ffe7d4e1e1d7454
f30769b0b0e2f5b26f22e1105c27bb7d7d308e1c
F20101219_AACETY meng_h_Page_053.tif
6e370c791777a64e8b37337d0ece52d2
7cf0850d67bb7c86c9f2d423fc08c0c505617db7
F20101219_AACFAH meng_h_Page_037.txt
a71cac65cf700d09e94c27346eb08ea5
0bd62f0bd6a9e8119450db7503697d38f9b8e77e
F20101219_AACEVA meng_h_Page_084.tif
2b49cfd39b9749d9da137aef3da2bac2
e1599db57648abb53fb9ef61a190ca206df2ba53
F20101219_AACEUN meng_h_Page_069.tif
dbf41d4db7f4cdeeba4ceed94190abb0
29b7be714ce63d59cc21f84f9b1af0b7dc828942
F20101219_AACETZ meng_h_Page_054.tif
0e5434b3be98c28e0481b75e7ad13512
54ae29364cc455c39d80d45955de4199e718261f
1763 F20101219_AACFAI meng_h_Page_038.txt
c0b94790bc46c80bacff418cf542f017
7f7c8a700176aa4d2055b44307b44f6d1d343840
F20101219_AACEVB meng_h_Page_085.tif
50587a5dcd51531fb95f33f2bbea89eb
8fb9d7ffaf0847ec03f7a73b781b09f55c3941eb
F20101219_AACEUO meng_h_Page_070.tif
2bf98e32c3bda5e50597762d674dbb83
243d79c0b67e27bcd45e82ac8dc1f5ee101049d3
1306 F20101219_AACFAJ meng_h_Page_039.txt
ec147e45db89d20577295876dafff034
adbd265750a109ac46ee8ce30fc251e7b5072817
F20101219_AACEVC meng_h_Page_086.tif
a8b90a0572fa96e97e1c70c1316173b2
eb6e8a79c1032ba27b7192c8662fd0ae45381087
F20101219_AACEUP meng_h_Page_071.tif
c260dfe2fc248057ffc16e8b9b95df51
93359a26c3abe018d7597505871647b61ef9929d
F20101219_AACEVD meng_h_Page_087.tif
3069cb586e3fcbecfbab7a437044006b
1b2c546c8b15df9edaa44166337a3d8dd6dd83ef
F20101219_AACEUQ meng_h_Page_072.tif
3b0887d2bb4f92831a04bf2d3dcf0602
f2e0a429042882a57efe33e1cb096543da40cec0
1987 F20101219_AACFAK meng_h_Page_040.txt
b9e54168283e288089b608b1bfde3de9
cd78b332d701b4e233ec947aad5e5bffc9062000
F20101219_AACEVE meng_h_Page_088.tif
eb65d772a265fb4951f084040dbb0df5
3fccd917082dab964a60f728be037b0ebca9cacf
F20101219_AACEUR meng_h_Page_073.tif
a64cab57cb751e9997759d1975fed352
a6bd9846db666068028b355cb25939508dbfdc51
1761 F20101219_AACFBA meng_h_Page_057.txt
6a97fe981347c614d2edb815df10bc7b
bad2d18b667151e1087581c3e585fd9226d7f8bc
801 F20101219_AACFAL meng_h_Page_041.txt
d4b6eb100303808d24658fbeac13f1ad
79a294146f66cbef1eef0062a0a2ca9bf35f78f9
F20101219_AACEVF meng_h_Page_089.tif
bdf75fb043d1bb7b490f2286cc9b0fa0
7e33169bbc3b99471bc50aefcd258f246551821c
F20101219_AACEUS meng_h_Page_074.tif
94e347bb1888e97209e95cd72d6944da
395e84fc6b753921bfcc105a509d405fd439eb35
1201 F20101219_AACFBB meng_h_Page_058.txt
a44cc270ed165f52d6a88c5eb7e1b7ec
f7203fa09e7c395eddd991087912cf96ed4a976f
1685 F20101219_AACFAM meng_h_Page_043.txt
6429e8f073c47ef7e2fc299f961ebaa9
5ab9727887f6894bf5d5f2a94fedd10563372469
F20101219_AACEVG meng_h_Page_090.tif
5d3af2dbb016ef2c8695a5ba6855a6b1
d6d800251fd29146a254b7cc1a02cda07a67e7a4
F20101219_AACEUT meng_h_Page_075.tif
ccf0382631b27277a6453ed656d779c7
b61cbacbac02bc70b88d1e620916df5ca4b8c882
F20101219_AACFBC meng_h_Page_059.txt
12609fec6ce7ac23fe8e4c89635cf3e2
d10ce5f74f7eb14e08d0add240aafb2a04edbe96
1020 F20101219_AACFAN meng_h_Page_044.txt
86cb654710bf85b4de7d4a9f89768c4a
bdf2db455d92050f1b45dea80eb71d851fc01643
F20101219_AACEVH meng_h_Page_091.tif
abe8404b722bdfc3b42dcc46790d07ed
c9533ba0c9032f90bd00876dffe7cee60f989ffa
F20101219_AACEUU meng_h_Page_076.tif
fddadcc144cbdceb151747767d69ca3c
16f9fd3481fc32d19c7efd92be428aefe0b6fce4
1030 F20101219_AACFBD meng_h_Page_060.txt
eef58f5cba5aa29a1ebc7fa040ffc3a8
eff67a91124b232cce21475c7399cf44bf6fa550
1904 F20101219_AACFAO meng_h_Page_045.txt
2387bbf6b1030b263835c80d66259cb4
2b5427e0aaae125d304160de16dff64bf85204e8
F20101219_AACEVI meng_h_Page_092.tif
fd102e9e29626754895244ee5f14fc94
348a543e564204febb0a086331362339f4a0f9ed
F20101219_AACEUV meng_h_Page_078.tif
ed47272e819433c9763ab9bc165c4dde
e36c9c09ad414abc7fc6098de6a633cf0dde7004
1760 F20101219_AACFBE meng_h_Page_061.txt
a67e043b7c55958d582716c601bf1a87
5b9f6d02fce44ac009ce7509f90b476fe97ca044
1652 F20101219_AACFAP meng_h_Page_046.txt
44743bae1712b789e77a67fc9cc46a85
e9938fded55f465dd79018270cf4e586766d71ea
F20101219_AACEVJ meng_h_Page_093.tif
051dce217d13b9197dde2b4d8c25a093
67e7819d9636401cd025f526d2d35a61a463b95f
F20101219_AACEUW meng_h_Page_079.tif
afd7e61c7985bbd6c1a4935f87d6eb63
5990fee0ecfa1239eb19a3f06bab77a549fcbe55
2014 F20101219_AACFBF meng_h_Page_062.txt
3ef1690b32a8da6167f5e533424c1713
6ecfaa0f26e9afa7f69b8ab0c520a4bf39c93cae
586 F20101219_AACFAQ meng_h_Page_047.txt
6ba863f0806b9ef9673cd4e113214783
8fcc01777762537cddac316ef02296c65c110644
F20101219_AACEVK meng_h_Page_094.tif
01ee83981948c6debdbf8bc896803ff6
a6c835a899041085f5e9dbd5c5d4607439c50316
F20101219_AACEUX meng_h_Page_080.tif
32e66b40026bedf771fa11cea3a404a9
dc8b4d52fcb708e4dd0473473ac33d8d1116ac77
1861 F20101219_AACFBG meng_h_Page_063.txt
7a63cfe7150dd083aaa66e863046265f
8b23fd3e5eb63b9e9ff408de9667354f7eb17e3a
2619 F20101219_AACEWA meng_h_Page_007.pro
69e2ab12c28a4f333782a0e85df8715f
38ba136c9a2d39660a31e53712944080c77c703e
2018 F20101219_AACFAR meng_h_Page_048.txt
12c8fa46fc76ded953d4eb96cd663962
157b867951766305c9236fe5386602e66a3df7f2
F20101219_AACEVL meng_h_Page_095.tif
88f6595a940a0249e7fd4f9bb7a03c26
b9bb70dd465314bccdcc77207b0a37b52511b097
1857 F20101219_AACFBH meng_h_Page_064.txt
c25b68c76c7bc1b62a2d0da8741ac51f
65c4d497a7cde0a02d795ca85ddaee73356d58c8
1881 F20101219_AACFAS meng_h_Page_049.txt
b827b64fdf9c19a763f0192e78b598fc
1df56c768c85abdc389d220dab4c9084b942ce19
F20101219_AACEVM meng_h_Page_096.tif
a9569c0ec11c8e92d1b29ebe43fc9a6f
775acc018b4f5178121892b5da140f3ac1dbd581
F20101219_AACEUY meng_h_Page_081.tif
80972b38148b104ecdff8ce8d423aee0
0b73481fb50096521318261a855c52cb8889296e
2009 F20101219_AACFBI meng_h_Page_065.txt
4d2ac3a6b5698e7a9c738d8275938db7
b08587ba2f45326da2043d1c6853755197295830
23954 F20101219_AACEWB meng_h_Page_008.pro
5d649cb0c1bee0ea5147ba5aae1972bb
d1ae3cb746c196d8987bdbf8f013ef52560227f0
1993 F20101219_AACFAT meng_h_Page_050.txt
d6ec6dbbd6a9e015ac9af1107643a435
8deaac368cbbcd3268fadbd219e19e2c5cbf15a0
F20101219_AACEVN meng_h_Page_097.tif
9df1a2b5f17f3d2a9fecc89d0ed53f41
3f1dbc4505db27c7c073f1e8825347b2353465c5
F20101219_AACEUZ meng_h_Page_082.tif
7d43740caa57e5e642325e188ea42da4
204f4f5d286cf974587a1e10fa8a276c17c68553
2046 F20101219_AACFBJ meng_h_Page_066.txt
34272410fc539d753d6595948dacaae0
1b7e4c35aa70f5fcef550ab13e6cbb65f51333f7
49715 F20101219_AACEWC meng_h_Page_009.pro
f0527cb182a67f52564790b025623f1c
ad349f1d7cc3a18134d0db6a67528dd1719e6eec
547 F20101219_AACFAU meng_h_Page_051.txt
63c81faaeed8d5c97530af45934c73eb
9c5d9917b30e01731e02762d90ab816435910fc3
F20101219_AACEVO meng_h_Page_098.tif
c63bee3551b72c70af5d355023993730
3dc6fb20e3d7be93324d031e784b72b5c4da7df6
F20101219_AACFBK meng_h_Page_067.txt
c9364a6c1ced67cba3cc4fb3b117f8ef
89bc19e355103f82541cc992aeecc1c2b3c284d4
13177 F20101219_AACEWD meng_h_Page_010.pro
49754b525dcf9d706a9d628e75ba1f20
19b6bc0731defdf22e30e471667ddb2da74e20ea
F20101219_AACFAV meng_h_Page_052.txt
e754979de21dd5fb8d344caa174e3527
27163b9e92a9be20c97a0a6a461256c1b83b21de
F20101219_AACEVP meng_h_Page_099.tif
f163d264f0626d167ec69514833ec9f4
a915a45e0c56d51445ddf055ed33ede4297e9a51
37535 F20101219_AACEWE meng_h_Page_013.pro
0f915556e2c803fa58eb6ebf4358f7db
3e4402660bdffd0d810fec6241075064d5b79104
1997 F20101219_AACFAW meng_h_Page_053.txt
7c9eee466b950748879b77ba7e497d2b
3c4d5572efae4922468aff706afa26df70fd68fd
F20101219_AACEVQ meng_h_Page_100.tif
e0dfcd6203956855d1c83fb0cf835b3e
95c3ec312c3afd43aab9bed80a6bfc1c6197c2b1
1742 F20101219_AACFBL meng_h_Page_068.txt
9f7fe9e9123f2bc9b5967a554bc9421a
a1b731998a39bd0c850881e6bf1be96bff74412f
28287 F20101219_AACEWF meng_h_Page_016.pro
53a259f8f023601130e85ba7d059656c
7fc32db85c78c12a18e90c67a30d08265618048f
1977 F20101219_AACFAX meng_h_Page_054.txt
c08f8f5892a928a775fe61c0158bc231
2d40fa10677fcd795edaa70a71c6deaee9bc817a
F20101219_AACEVR meng_h_Page_101.tif
dc07d4041e0359868834d6663827fc33
03cd148571fc75125af64dfc71eb74408068c207
1648 F20101219_AACFCA meng_h_Page_083.txt
a5fd8969323aba9c58d40768f996534f
7df3575435c12eabb4c266f10f6dd9b540aaa279
1705 F20101219_AACFBM meng_h_Page_069.txt
981b51c42d8d9e17635611599eb41fae
936cd6712d23cf8d0bedebf90ebdf827574868d8
42568 F20101219_AACEWG meng_h_Page_017.pro
cd6631911b886ca9da89a14b392c484b
cb5a2d87ccdf8e8fa12156b8050087eb94fcd3c3
1989 F20101219_AACFAY meng_h_Page_055.txt
4369d38eb2edec487f36662628880709
53303990cf304c0a6bf926264daeab2d2a7171a2
F20101219_AACEVS meng_h_Page_102.tif
58b2f61bc3adf8ca00dc1ec4dcd52d95
8c2be1e904c75618f37a83610a91baf8d79c8f3d
3041 F20101219_AACFCB meng_h_Page_084.txt
cfdcc4068dfbfb1d2610a0b5183b8f87
e55de35dfa94a133ff3738b8e0a2cf8d4fe46cfa
1004 F20101219_AACFBN meng_h_Page_070.txt
ce613c720c6c5054a81e17e1b78970ae
431086efb2cd1989f2eb9684db5febeb02638304
39637 F20101219_AACEWH meng_h_Page_019.pro
7e28d30183f882dd0d746911f833a3bd
05622ea1882920878267d12932b9bddffaff8b62
1396 F20101219_AACFAZ meng_h_Page_056.txt
28eac27bcb713505e54a3bc61bf45c30
bd4fca3a72af46d696d7e8cde0619980582cdfe3
F20101219_AACEVT meng_h_Page_103.tif
51385f288d2b43145bc11bbd1f86a983
ca7734f12a155962c7c72bb4c01a87117866d6b1
520 F20101219_AACFCC meng_h_Page_085.txt
61c17771c7911a6deed00655756cab9d
220f9d457886544f1c3777949dd3e35aacdfc15b
1173 F20101219_AACFBO meng_h_Page_071.txt
efa5275eb18ef840089f5554fc6987e8
ffbef6da24c556c2bb0dc511afb22c0ebae89691
43872 F20101219_AACEWI meng_h_Page_020.pro
b7d2d202a61342e7543ab138c5975555
5fedcdb8571d780406b35e78cdba36a221806659
F20101219_AACEVU meng_h_Page_104.tif
d899a50b40aae13f2aa4f96fc79ff458
7318a24529ffca58a0236636808a0b5bd2ac17b1
1878 F20101219_AACFCD meng_h_Page_086.txt
a89560256784567aa06bfd90e034aee9
b0602443fc9b4521e850bb1632db67dca1a07c9c
1957 F20101219_AACFBP meng_h_Page_072.txt
3e5edbf535fa320142b69e24a6e05dd0
262505d1f242f32dd122ca30332696e0189e80c6
33993 F20101219_AACEWJ meng_h_Page_021.pro
64867057add5419d10d1045a00db21e7
3be23a2178ce2eeb8ac64848e713f243eea55c18
8269 F20101219_AACEVV meng_h_Page_001.pro
2fc67e2bc5d1c5d0baa9bfa9ae647e75
85787d44118fd9a7060a99a17e91832dd10d9cc6
163 F20101219_AACFCE meng_h_Page_087.txt
e96d8da97ef5d74a637921b33c0c37e7
caa72ac013e2c4a16fa96a99af6968a9c11b78ae
1930 F20101219_AACFBQ meng_h_Page_073.txt
0ea61849f8b7a320fbc5ab52d984ec93
55f9228143e474cd03d1056ef9cef244cc0f53f5
51776 F20101219_AACEWK meng_h_Page_022.pro
717a0c5545aee174a9353f5ec47b1771
370b58eb9eaf20adbdade0c20f58b152d65e539c
1092 F20101219_AACEVW meng_h_Page_002.pro
17fcd59c5677a5a45143580471f69328
6985805a531fa59642cdefb61b2a261e9c42b075
1524 F20101219_AACFCF meng_h_Page_088.txt
adc010d8a7958fb0b9fff043fdbc8f9a
09c536690820114a2bbfc9a487c274cdc760241e
1827 F20101219_AACFBR meng_h_Page_074.txt
c8218aa83c2fd88d83aba4a2f21bb848
b4868cc8877995bb6c09a7b07bfeab8919a5a00f
46873 F20101219_AACEWL meng_h_Page_024.pro
e2cce6d3b12f6964422a485922e18edb
78b66ba473d537fdafe33a0654197568a01be056
5354 F20101219_AACEVX meng_h_Page_003.pro
d2663fcc0334aeb963ba0331b98ab5d4
b9793323c5a60110fe5caee9ac9920c80d962f5c
1967 F20101219_AACFCG meng_h_Page_089.txt
37f8c5b247e42e7959752cf7e6b55a32
0d21fd4eff4dede3642ce8546c6b86d7ef092ff9
16169 F20101219_AACEXA meng_h_Page_041.pro
624434882526f1f0a7524b3993aff651
6c13af336c884568e233e8ffd4ccc6e40f5108b0
1999 F20101219_AACFBS meng_h_Page_075.txt
62be768834731cb3b79bff67eba28a13
ebe8ca1b39fc89704d8e5ade17e1425cb9700734
50404 F20101219_AACEWM meng_h_Page_026.pro
888a883abc07f565188f186f9e0e5aa6
69861e0a904a086812a3b19df0a2bc2491c69009
18045 F20101219_AACEVY meng_h_Page_004.pro
f3b25388c7b339a06fcdbcf672bfc3d1
24b4a3cb522ac38b314ebcc9f514057550fe56c9
2199 F20101219_AACFCH meng_h_Page_090.txt
eacc543f681e19c69f424ca8662423ce
73d5b98c394129c4a876222772053125de305fa6
48546 F20101219_AACEXB meng_h_Page_042.pro
9164e94b1207323c5c932d7190d42e8c
e08ec5420252129125ec4e0da5173295e4bec5c4
1627 F20101219_AACFBT meng_h_Page_076.txt
bff9c642f27227b283270a8d4042f9ce
02922fd7088e4b24871e5405d7bc61d3e0e8a640
31351 F20101219_AACEWN meng_h_Page_027.pro
1d52b6f2a130e41297904d5dd2cecc85
2f950d23b163f9a001aaba10eb39ba49e9651fd8
62728 F20101219_AACEVZ meng_h_Page_005.pro
f9ad57a9640283ebb00eb8188b778698
2523e33eb16d03126d0e5e6ec6ceea4e9f7f6337
F20101219_AACFCI meng_h_Page_091.txt
976cdec5bb9454e912379b2d51872565
fb1107e448fb71807fc3d0aeb59f733f4baa20b0
1824 F20101219_AACFBU meng_h_Page_077.txt
8d0a000ec5eef740b41c8c1536c17eba
e313a5f6870cc489ace1841c0f69f6ffb1abb8a1
63033 F20101219_AACEWO meng_h_Page_028.pro
9acb554c8562373291d1104b8a9d97cd
e821ee4eabb31622721c392896ebe2967bae3997
1940 F20101219_AACFCJ meng_h_Page_092.txt
a631f392be0063950f979d20e2757bd7
deee915d13286c9110b879ed00a68ccdb4beedfa
42209 F20101219_AACEXC meng_h_Page_043.pro
42e551b313b894e290e1eee122fdf6e5
8780a02064f70f705ef128ce7c25f0309c423ca4
55776 F20101219_AACEWP meng_h_Page_029.pro
f064652026e736af81fba8adf62669ab
37a6131a98d6f8a0f3e52583ed098d933faa5ac5
905 F20101219_AACFCK meng_h_Page_093.txt
8d750edfa21687ef97279abbd8b7c7be
8721921c1ae39b21eb3dff941a18a3a839bfef86
20106 F20101219_AACEXD meng_h_Page_044.pro
86b9aa9d40a513e3e692630a9c92ed1a
8b69dbd358b8127ba5cb28aaa2d8d8b9d57ab374
1748 F20101219_AACFBV meng_h_Page_078.txt
edfd30d926836e26df8cb67b8fad16ce
6d6211235d32857b71bfba70e8b9e7459e134572
49811 F20101219_AACEWQ meng_h_Page_030.pro
32758a0ff6e946d35a957e700a36d135
4bf9c46a7d3a576810cb7058be7858eb3f9a9aaf
F20101219_AACFCL meng_h_Page_094.txt
c6351fb736290eea4cc96f2c00fba18a
c75cdf047566e0a8ddc8fc797f32c1cc6a7db7c6
42780 F20101219_AACEXE meng_h_Page_045.pro
7f9f0533deaa949bba9b001937a40443
e59341ac6ed4a5e6adbcdefd3909a224f1b01b2c
1908 F20101219_AACFBW meng_h_Page_079.txt
8391cb7ea80a6fd17c549b139dd6a9c6
fb62b7cc3d60e46ffcef40e44e438a896a2d4433
47610 F20101219_AACEWR meng_h_Page_031.pro
db5c85b74db4f3783a7bc8a7b2ce1f80
d5df94ccc85cf79af9714ea1b5b91e6697a19c78
8355 F20101219_AACFDA meng_h_Page_040thm.jpg
ce1bbf2f6a05081029c7a3a162d79ade
5adb322ffb6b1162f0b5c9a380879d4186d8dd79
41343 F20101219_AACEXF meng_h_Page_046.pro
085001a40fbb21ae342307d8718d4454
a839aaae2951e6df0f23b8c688c39fc807f672f4
1707 F20101219_AACFBX meng_h_Page_080.txt
96f3c9d2a6d2e75d23ce39031b91cd7b
604114db56af353744742cd40898c24d73d2adcd
48196 F20101219_AACEWS meng_h_Page_032.pro
d9f733db91baaca8a8ed0b014e29becc
70ce869b3cba44205d273f666ddbd3e664679b63
8452 F20101219_AACFDB meng_h_Page_035thm.jpg
90e86562c75fd077666d8b6db0d2203c
a99cc6e957390c1bff266b640f88ad5a9dbd7c22
2423 F20101219_AACFCM meng_h_Page_095.txt
49b5e7fa834d63d40277d2307483091e
89f5b905fdb7b808e0a87f99e879fbb8a94769d1
14324 F20101219_AACEXG meng_h_Page_047.pro
aa2af2974ab50439f56242f3a5d6482f
059603d89b10c407a89eff59ff9b9d5a1eeca477
2296 F20101219_AACFBY meng_h_Page_081.txt
172e99b6ecd25cfba58b3005b83c94bb
1d8bda06e7ab1b54cd870a3d02463223e9ce6663
44312 F20101219_AACEWT meng_h_Page_033.pro
14508371b86e2c45f3e3ed619b38ae6f
5efb5d5059562a4ca5e81b532709f33634609e3a
33579 F20101219_AACFDC meng_h_Page_037.QC.jpg
8ebe7e8f553f643fc40358f029e67372
1dd6e15b6766d9fc58899eb489a83e34021d609d
2425 F20101219_AACFCN meng_h_Page_096.txt
a2373c87af33476acaaeeee496f6640f
49b90ba6527221f231ba2fabb73d5498674f2aaa
51285 F20101219_AACEXH meng_h_Page_048.pro
5dfdd94c7237f9452b3871f9e94d8bf2
e5d560d8cb990d8a5c7faaf25085d9bde3c19cea
1343 F20101219_AACFBZ meng_h_Page_082.txt
e0c3722d3f0360a66d9b5462f71e4ce9
8d900d7b04d613e1b2cd652667ce82ad46475478
52397 F20101219_AACEWU meng_h_Page_034.pro
98b37bc95777b8f01a175ee6aba27f67
17125e72733d4ae5331b011d164ca904206016c9
7873 F20101219_AACFDD meng_h_Page_077thm.jpg
3b080f41ec27cb8aafb3abcf37ac2a9c
97c86ff86c395a987df56ce3962429d2200fd78d
2361 F20101219_AACFCO meng_h_Page_097.txt
eb542ee24c5dfe6e4f36db9b6265d35b
f79c22f67259450e045bc560a75c656a34438905
35399 F20101219_AACEXI meng_h_Page_049.pro
1efd329ec681675a72c22cf3dfaf14c5
01786afab9fcafd925ccb1e107d18e327714ad02
48207 F20101219_AACEWV meng_h_Page_036.pro
0a239bfb9777266f783395043dcbf5af
1caad77823d97c489f107ccc1edf9d8e924fc51c
3763 F20101219_AACFDE meng_h_Page_008thm.jpg
c49b60cabeb49a6bc8c77a80d98cc504
71f6a476b14679b1e6c17c634ee7fd02c0ee63d7
2569 F20101219_AACFCP meng_h_Page_098.txt
66c4eeb32b317d368b8de9893ff95154
5de55bafe01db12b80a76b5f79dd2bb1e9c38daa
50496 F20101219_AACEXJ meng_h_Page_050.pro
6d46e2c7be7dd4208e3c2fc445f50c77
dec13b4222df3b60a8ef5bc91d17ac18047c4817
49999 F20101219_AACEWW meng_h_Page_037.pro
94f4a0d5224ba8454b6a186aef2da749
ea72a7ea2a15291b5cd422d9915ab7de4ac5a415
2490 F20101219_AACFDF meng_h_Page_007.QC.jpg
e14c2f1b0c3c365028afaee16c936d9b
cd061debbbf36fa0bffc3c0798a17297640b2762
2411 F20101219_AACFCQ meng_h_Page_099.txt
3c0cbab5a36df6d6b247dc741d522bb0
ae6251a1dfba8cfad3e7b419922992eb64fde3b7
12147 F20101219_AACEXK meng_h_Page_051.pro
66483903692b85ef6d4bae3eae4de3d0
01179b3ffb4a0b1d6ca2962899ed9de7218f1d62
43700 F20101219_AACEWX meng_h_Page_038.pro
9231011601a01342fd18e3e9f73aa471
2798cce5f8424b7cd56663d8780b7126bb5784ce
8278 F20101219_AACFDG meng_h_Page_023thm.jpg
26f6fb1b41a8f33a30ab7760463a3778
110e0dc8fff6b5a8484e13ee178030ca806b25c7
24121 F20101219_AACEYA meng_h_Page_070.pro
f2a795ac2c7b15db94398761e14847b4
76fe64f74020a4ce109063f976e18966fd91a517
2432 F20101219_AACFCR meng_h_Page_100.txt
4dc945e247e6810548b3fa8b946ea8db
047e15530817b1cb286b8a72c2d6bafef4bfa745
47075 F20101219_AACEXL meng_h_Page_052.pro
3480d7d4ea25b7a79beeb49653434b96
b029001faceabec5bf695c9071394c62c9e643d1
33178 F20101219_AACEWY meng_h_Page_039.pro
adfc8a2a53facf60234adf14dbb36ea2
a40408b2e1ac694bb475fd559921451708d14716
7438 F20101219_AACFDH meng_h_Page_081thm.jpg
7f2638f95496ca858653253f55cd51a1
487275d3bc813f50dd4377ffdedf3021a4034140
17611 F20101219_AACEYB meng_h_Page_071.pro
645ae102faa1533ee04d030d22a627b7
cf414954baf13f959e8e6e2969fbdfe6619ad264
2222 F20101219_AACFCS meng_h_Page_101.txt
a9c78bd5d0fd8c24df9e420363aeb637
0e6121694c0271e030d5d4c56383dea0cea52f07
50631 F20101219_AACEXM meng_h_Page_053.pro
2736ea8676aa4be72201c95bfcc2261b
164d3663f1b03a919ac3d2fe85517523a9fa1e0a
50159 F20101219_AACEWZ meng_h_Page_040.pro
fc8e09fb3a620abaf938ea29f1f0a783
4447922e5be6d0c9dd4fe2ca79e00129dae2d14a
7309 F20101219_AACFDI meng_h_Page_019thm.jpg
1fba8dbce2ae6b28b916de1d04adc9bf
77217f9e8aa15fbbf3ed0f05dec162ceebff755a
49555 F20101219_AACEYC meng_h_Page_072.pro
e67e389d91b85c41f8be83303feb6454
d34b2449d210505e64ad054bfcd82b636b48c5f2
2464 F20101219_AACFCT meng_h_Page_102.txt
70739a1308316893fbd82e89d5f60373
907cebb6713b1fcbaf30dbe109e72880b977e0a1
49182 F20101219_AACEXN meng_h_Page_054.pro
526ab71e3d0fe029dd9df790ae8422a5
4eac6eba0261940716a6fa5b26b8197662f2c3fd
37940 F20101219_AACFDJ meng_h_Page_100.QC.jpg
04e27c731149961e9835507849770f95
78dd60f71292453a263e86fcf2887e94e5e9162f
1412 F20101219_AACFCU meng_h_Page_103.txt
3d25496b40e0a0071c0f7f46beeb3dff
63d7ba2ff1df2d1365946d4f4f8f1569c083bb9f
49786 F20101219_AACEXO meng_h_Page_055.pro
878c9af8980a49cdc26daffc44c4a81f
0eee35196a149be7d0bc43eca93541bc14008869
F20101219_AACFDK meng_h_Page_073thm.jpg
32c8918b362a83fcc3d7732db0130133
1c030ba14b82339adf413f942fd4d64b8ca324c5
43932 F20101219_AACEYD meng_h_Page_074.pro
104c110b172c3996e40d336f51212075
bd666abbf7dee62e72efb53ca0ebbfc05a3be6a6
599 F20101219_AACFCV meng_h_Page_104.txt
1adcb5b360f1cf69081500a023a59648
75a6250600907e6d3d5e7469f20ec50da64b812d
31210 F20101219_AACEXP meng_h_Page_056.pro
bc601ee39c11ab9175f5f7c6e1e4e186
b5befd9c4528f5308d2885fdb88486c993a3fbc8
8436 F20101219_AACFDL meng_h_Page_030thm.jpg
c5c4ea82889b882e3d6a372474efeb59
e96598bf820fa2dc3ad71472da7ab8b6f2e52d3e
50795 F20101219_AACEYE meng_h_Page_075.pro
7e1ca71bc99786a1805e7562f238af1f
cc429ca0010a0bbf83351488e729d0d9cbc34fd0
2285 F20101219_AACFCW meng_h_Page_001thm.jpg
bb133d8a904a0a0cef8c8fae5d3494e9
42493ab4768ef18b3f1d8428ab280c9929795cda
43918 F20101219_AACEXQ meng_h_Page_057.pro
8ff19522edab73e016bed5afc87b7c4b
c54c5f916f18f4b72a858aa3866c11c8bcc162f7
32297 F20101219_AACFDM meng_h_Page_042.QC.jpg
4cabf10fc492e4b11a2c612832d4fbc5
88b36bc7a76203ab0b2e85060c8101e1087b2824
40659 F20101219_AACEYF meng_h_Page_076.pro
92a11656532ba63aa3a6b548430812a5
daf1f7420f3ae876caf91d0949bf2b86e5c9b289
3336076 F20101219_AACFCX meng_h.pdf
0a51a361f9ca6b665926cd9376369667
1642dc317df1f42ef0fcbdc9ae419ae0b3633a25
27449 F20101219_AACEXR meng_h_Page_058.pro
0f1c0ec505a1c77c13de8fcc01a7c519
6d84ac45cb45891644c0cbd77509b6126e20dec2
5553 F20101219_AACFEA meng_h_Page_058thm.jpg
21dfd58f97eb9198b1323df3d06d3880
ce0ffba1a320c12d07a8477ffac494201fb75dce
44626 F20101219_AACEYG meng_h_Page_077.pro
bb4e7551dd67599900c0dc9c5230a4a2
49d6a5ae56b58956e23668798af4d9521ef9290d
6553 F20101219_AACFCY meng_h_Page_047thm.jpg
f7a680d5bd589ae9a95c1dc6c0c8583c
98b7175ab670267420025ac55ec6cf6a57885a4c
43452 F20101219_AACEXS meng_h_Page_059.pro
35f6809dece38361c78f75b00cf26664
855550f5acc72c9232e17e2adae4bae861b02aa2
8134 F20101219_AACFEB meng_h_Page_025thm.jpg
e90861e6cc93b40957b1062fb9c99604
d3c506441c3cd7c6411a96beb6490c07e8d96a70
7704 F20101219_AACFDN meng_h_Page_041thm.jpg
62c6fb3c4f9a63274f2cf713afdae586
24ffb7303d57bf74c02d942057984f7c17811a15
43414 F20101219_AACEYH meng_h_Page_078.pro
de68c706de0d413ec82ea8d40467279b
1407a80aeecde00dacdef4c5438433d9d5a19412
7713 F20101219_AACFCZ meng_h_Page_079thm.jpg
b71cb728143ecdba3f6180912f7f97a8
036b66759fb27def9e601397a580cf13e83a9024
25373 F20101219_AACEXT meng_h_Page_060.pro
8154727a77624adf5ac91a9fcf7e74d7
2c273540cced1e4a4d49d6e3fc83dcf7b1cd5494
26444 F20101219_AACFEC meng_h_Page_089.QC.jpg
f0b7beda5b5fdf522733791023713814
09b927f09196b44263295a0aefb06893030da580
7213 F20101219_AACFDO meng_h_Page_045thm.jpg
b5e52c8a4d5578e2004317558d097b54
61edc9ab22fd3c53ca88e4dea4ea4a9f354ce34c
46288 F20101219_AACEYI meng_h_Page_079.pro
9480f2493164a9eda4f9fbfd1cb6e1bf
e3ccdb2fa5fce8804847555a2115b0b43d61ff6c
42201 F20101219_AACEXU meng_h_Page_061.pro
d37c4e12999cea704afc8fe009879a32
c26920c3144ccfb38fda891123198eb7c5413f39
7530 F20101219_AACFED meng_h_Page_057thm.jpg
b417ec35690c3c3d983b41cb2db2c240
0952806fdbef3675077396dd624d1b441352d88d
34084 F20101219_AACFDP meng_h_Page_048.QC.jpg
e5f78591878fa0df03445f708d8c885c
bc7a802f9aeb36990d2bb506279ef52addc0c079
42755 F20101219_AACEYJ meng_h_Page_080.pro
8c1e2344f6d9406eb32c8c315eb93a45
5b4054a131b7e99bbf7c15a3f7ca22b3a6c853cc
46046 F20101219_AACEXV meng_h_Page_063.pro
a2d738deafc878decb1fe7d0254f5e1b
044ab8e9a3b37a7dded7a8d7059eea3bb8531136
32473 F20101219_AACFEE meng_h_Page_032.QC.jpg
b4ce8984ad459af29169442e86310d21
1fe67aa695315cf8a49ac58bcaa7699fd394018f
15112 F20101219_AACFDQ meng_h_Page_008.QC.jpg
a8566c0cea6aec5e1c302379c45bf94d
8fc1b3dd5cb3fab3d7e7f092cc779d726049c402
53683 F20101219_AACEYK meng_h_Page_081.pro
eb70ace62a795f50cf29451ac7cb8383
a0daf30e1afb3f260978c141c916e6d28aaa6076
45657 F20101219_AACEXW meng_h_Page_064.pro
970de0a90ee8c604c902625831f13172
c7bd8d42f3a43281921212ca7783d305bb11f5f2
29150 F20101219_AACFEF meng_h_Page_076.QC.jpg
6b731bd81c4fe55d7f9826f593f91881
a49c292db04b89fdc4b81c6fcd1943070982e7cf
60754 F20101219_AACEZA meng_h_Page_102.pro
00a6519a891d5028fd77a7f569bd3375
8ebf414f98bede958681cbb7d060b32b99764ec2
28704 F20101219_AACFDR meng_h_Page_057.QC.jpg
a3a59f542dda1a544d549337997abb67
30e107cae48af876365c9bbd5a79c347e3fb6996
32149 F20101219_AACEYL meng_h_Page_082.pro
e4606e65dc7f212fd53c41895acb1d07
d988a490f0f9ea0cdaf8f1c36036b215cb89d266
51119 F20101219_AACEXX meng_h_Page_065.pro
fe5fd5b754a00da0ea5056685488345d
27239b760fee8f4dc5ae82c0cb530895e3f7861c
F20101219_AACFEG meng_h_Page_064thm.jpg
804689c33d5eb0ae73d4171ddb0d7e90
ab51133c5ce53d09956e749cd2cafaaed754c7bf
34303 F20101219_AACEZB meng_h_Page_103.pro
2da1bbe977f083d20079c59b23ba5439
b61fcdc906eb111f5786d78f438ee2c23f052d41
29801 F20101219_AACFDS meng_h_Page_078.QC.jpg
1d0b86bc99d13b9c55cf4d19656b684e
b7663d74ef0d290ab1e01abf51431c8ab8bde297
36744 F20101219_AACEYM meng_h_Page_083.pro
ebb8bde79b86d7a9fb5f17ad3cb07175
58d91b163c29250b3c0de6d5b1d7618a7314442c
40161 F20101219_AACEXY meng_h_Page_068.pro
a971fd7bdc294271b9bb14277405d512
09a3dd4bff81c88fe7142ca924c84b12d5c09bc8
6526 F20101219_AACFEH meng_h_Page_015thm.jpg
a5e0de930293d05cb6c4fb9850b5772c
e1f318c908379f9e4916fddb7a69010c227d41cb
13840 F20101219_AACEZC meng_h_Page_104.pro
451efbd2c16bc4a8c3002ec8d33b70c3
3c4684e6f4fdd64f062e3272aac7c9d541213436
5508 F20101219_AACFDT meng_h_Page_093thm.jpg
13b88ba12409604e76381e0e581afef6
47738da9f73c8916d3370a0566b2296ee42c3303
70360 F20101219_AACEYN meng_h_Page_084.pro
71375f6526b2e55ba9ce14bd1eb29e9f
998f0c90a4943a61408c58d55ab54aeb4cbbeac1
38182 F20101219_AACEXZ meng_h_Page_069.pro
4ec29cf60fd101357f8e1ddcfbd62df4
5b1e73c37ca40810544365066e4c787357c61be4
947 F20101219_AACFEI meng_h_Page_007thm.jpg
af217dc0f44f86e9f9ff5b0328feb61c
586c4d989fdbae1ba1565d07ad277d745938f7b1
461 F20101219_AACEZD meng_h_Page_001.txt
ec407c98a6aa9ef7fa14583da6201b4d
18cc544e5d6a522b21849e69175f0b6ba6d01789
7334 F20101219_AACFDU meng_h_Page_038thm.jpg
d61467150b02741ced60269f24d816a0
81b637a5e7d33c3e6d1fe78c2fb9bae6d105df22
12886 F20101219_AACEYO meng_h_Page_085.pro
ba1e42bd9b86f0c644743154b9853d9a
fbc3f4341b7af682d4204f9fd41ac019f6ee647c
875 F20101219_AACFEJ meng_h_Page_087thm.jpg
9efa822f35617c76a3c268c9d7b41dc9
b1bf5bcfa8cb8c3ddecce5b22cc6c7b7aabde8b0
7757 F20101219_AACFDV meng_h_Page_070thm.jpg
3856138746fa29a06ebd02c2d4cbc1f3
bfacf4f89f78b2d4b387119e830b6fc930326080
2905 F20101219_AACEYP meng_h_Page_087.pro
c435209f1625a608372fa8920b5632cb
6287213594a222bc0ae655c75aea6ae454412463
155177 F20101219_AACFEK UFE0015612_00001.xml
aedca4a34c97a24b9496276ab0335619
40b1e7672c857eb336c7e12ac38f15f52da2fc89
97 F20101219_AACEZE meng_h_Page_002.txt
115c56547373e9a36fa835dd53bc3181
24992e5fda6bb79bed8eeca955ca82ea1e2ed4a0
10307 F20101219_AACFDW meng_h_Page_085.QC.jpg
1810641781f09fdfd37b036de3b92888
b260d3f3f15846ffd54faecad5e3e6e55ab5dff9
31833 F20101219_AACEYQ meng_h_Page_088.pro
ac4933439b4840a900a794daf961f7f0
5ed63e39ac0a1a3261c639e4f139410f94c7fc60
8171 F20101219_AACFEL meng_h_Page_001.QC.jpg
31939f96046e15855a103e6f9679577a
c9bf5ba8a969b12843bce87bbf397d8233e16b07
258 F20101219_AACEZF meng_h_Page_003.txt
d4bc860aaf11553e35b584c4997f7c15
e52be2dec09db8cc6683b9a777b119ffd8a7462a
33670 F20101219_AACFDX meng_h_Page_035.QC.jpg
eb258d56ac0df73f6288eff2a27f9b9a
3888db1318125617b244965c36ab4587d0ad04cc
54218 F20101219_AACEYR meng_h_Page_090.pro
9b67672730dc9d267b79e77fe6baeea3
6cf199c7309252f6644db8bdf1ef46776706dde9
6596 F20101219_AACFFA meng_h_Page_011thm.jpg
8843dc1ba8f8171e03296e86bb04f937
8892febb538911e4e99a32bfd4e6a0d9502d6a5f
1641 F20101219_AACFEM meng_h_Page_002.QC.jpg
c3ce8ba94374d574941ed9b8ce7ea6b8
8615ded039c15f7c155e88b47d3f832f83d50181
779 F20101219_AACEZG meng_h_Page_004.txt
de8dd6e08df44ccfd2ea3a8a06c01179
b5cdb891f8bfbdef02a7d6dad7140209ddeabb9e
29131 F20101219_AACFDY meng_h_Page_033.QC.jpg
33add83e2708ada941ec9a93fc6d226a
c46627c4eaff0a8e0dad76bb259de6eb5c1da7b9
38500 F20101219_AACEYS meng_h_Page_091.pro
4eab8a7f33e518e89a5c1b92c63222c5
60fdf865fdde9b0f2a69a5a9982e554daa77f945
18277 F20101219_AACFFB meng_h_Page_012.QC.jpg
77222de5f6bd1c0cd2c1ded8081da82b
775cbead95132f23d9e001cdffbc3efbd890d27f
607 F20101219_AACFEN meng_h_Page_002thm.jpg
4d05c102d9bbffdc6316426ae101455e
86c0928d5136eb413d58be47dec1ee402323a116
2725 F20101219_AACEZH meng_h_Page_005.txt
78ea270f9b6a24e8a45d8bb1a66732db
b18f3f6711c664938170c0aacca22621d2ebde26
27374 F20101219_AACFDZ meng_h_Page_019.QC.jpg
64430d0b789bc64fd473003b7d21c04f
9527213361c86a308917d5a6fb5ef0e60195a52f
45636 F20101219_AACEYT meng_h_Page_092.pro
5addc3fde9dbd1ad496a6efa5bc6faab
4386be2c034be58be712c615e64f077d46657d23
4420 F20101219_AACFFC meng_h_Page_012thm.jpg
a473fcb40c020a4c711430a65c468543
d4cd69f875a1ac8ef778f39d21eb381df0c1a080
3068 F20101219_AACEZI meng_h_Page_006.txt
399aa647287427878f1974c0873acbc4
ef81e085c76951dbde11f402629ce538bc1c2c54
46910 F20101219_AACEYU meng_h_Page_094.pro
6356a6e288fa4500cea1402959cbaaac
d4e6c2a10190062dbc2192f135e47c35b9081ada
24946 F20101219_AACFFD meng_h_Page_013.QC.jpg
3f247cf03df797df5df0426b74957eaa
216c3bdd30c3452558baaaf123c619f086970f3d
1249 F20101219_AACFEO meng_h_Page_003thm.jpg
965b468d93ad24f3da1cd360a3b14330
777b270dba65580d375d98167b41f9bc61d492ec
95 F20101219_AACEZJ meng_h_Page_007.txt
3019c7ce71b396f5cbaab28f9258ef3e
2435bd53e131bcb05156bf63671f66ac1f7326fe
59647 F20101219_AACEYV meng_h_Page_095.pro
678595947ec47a06970f1e205d8f293f
801ca3382f49ff566666211cd02a6c28b30ed92e
6457 F20101219_AACFFE meng_h_Page_013thm.jpg
11b29bf74fadd6dda1c64113c5248ba3
2ab00ceed6683066e497b0c18a32ec1f9e0fce49
13066 F20101219_AACFEP meng_h_Page_004.QC.jpg
a26103b41908b747e2ba0376ac08d0a3
682aad0ba936858711c90b0e9ede80d8d1ae8292
1029 F20101219_AACEZK meng_h_Page_008.txt
0c6c433fa5c805cca6218efc2dbda14d
d94743b1ecb250dc7cdc5900d15c1aeab3090de6
57705 F20101219_AACEYW meng_h_Page_097.pro
4d0342ea503885c50bddc140a4132a54
790f9cd3d48997ebd5406119705724d2f29bafa4
25307 F20101219_AACFFF meng_h_Page_014.QC.jpg
832e0eaf59d419ff43da96dd5fcc95c5
b368a315d4980b25d8eccb14b63dec8c6b724a97
3525 F20101219_AACFEQ meng_h_Page_004thm.jpg
617dd74d8d1ba2a1e0ad86d7b63301b8
ae280da4174bb3f4989786735e4da6ee118ccbb4
2005 F20101219_AACEZL meng_h_Page_009.txt
adfe3d67c5097b01c73303dbfb72fcc5
ae9a237d6c5a8407c18ffa277efe322aec8863c2
63076 F20101219_AACEYX meng_h_Page_098.pro
48cbfc282e062f4904ede20bc5ba4c6c
fdc53bd6244949879cb74f54deb7e9affbe75ce2
6829 F20101219_AACFFG meng_h_Page_014thm.jpg
27c48c4524eadcc1df58d6fd1948732c
0352a77e595f76d00e687c1e9c0cd9fc329873a6
26160 F20101219_AACFER meng_h_Page_005.QC.jpg
dcaaaf6a01c88b087fcd0e42ba1d3b7b
c9a19cb1c348d2248ef7de5b0f34602f61a6d3d4
556 F20101219_AACEZM meng_h_Page_010.txt
659376512bf3f346647b57093ae2167c
af610deedead85d1c8b1fd9da3ba0462740291bb
59440 F20101219_AACEYY meng_h_Page_099.pro
5fef7102e4334e94e20434b2c14c9772
1c37587f09947bfa02ca661a155da089a65e724c
21165 F20101219_AACFFH meng_h_Page_016.QC.jpg
8739561560fac6a41dccec21ac61b722
b14b879e01f615f72a21b94a17e276d0ea1efa6d
6299 F20101219_AACFES meng_h_Page_005thm.jpg
e6d136a62ce4e885c65fe549cea56788
472b792305e3b4b87d3d2ca28bcdac82da4594ae



PAGE 1

A NOVEL METHODOLOGY FOR IDENIT FYING CONSERVED REGULATORY MODULES AT THE BINDING SITE LEVEL By HAILONG MENG A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLOR IDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2006

PAGE 2

Copyright 2006 by Hailong Meng

PAGE 3

To my dearly loved wife, Haoxian, with whom enjoyable and difficu lt times were shared throughout this journey, and to my parents, Zhanfu and Xioumei, for their love and support through my life.

PAGE 4

iv ACKNOWLEDGMENTS I would like to express my sincere thanks and apprecia tion to my advisors, Dr. Arunava Banerjee and Dr. Lei Zhou, for th eir invaluable guidance, support, and encouragement throughout my graduate research and study at the University of Florida. My accomplishments could not have been achieved without their support. I would also like to than k my committee members, Drs. Anand Rangarajan, Sam Wu and Tamer Kahveci, for their insightful comments and helpful suggestions. Finally, my appreciation goes to all my friends at the University of Florida, especially those at Dr. Banerjee and Dr. Zhous lab, for their helpful suggestions, discussions and friendships.

PAGE 5

v TABLE OF CONTENTS page ACKNOWLEDGMENTS.................................................................................................iv LIST OF TABLES...........................................................................................................viii LIST OF FIGURES...........................................................................................................ix ABSTRACT.......................................................................................................................xi CHAPTER 1 INTRODUCTION........................................................................................................1 2 LITERATURE REVIEW.............................................................................................4 Computational Representation of TFBSs.....................................................................4 TFBSs Databases..........................................................................................................6 Computational Identification and St atistical Evaluation of TFBSs..............................8 Computational Identif ication of TFBSs................................................................8 Statistical Evaluation of TFBSs...........................................................................11 Cis-Regulatory Modules (CRMs) and Computational Prediction of CRMs..............12 Cis-Regulatory Modules (CRMs)........................................................................12 Computational Identif ication of CRMs...............................................................14 Cross-species Conservation of Cis-Regulatory Modules (CRMs)......................15 Computational Prediction of CisRegulatory Modules (CRMs) by Crossspecies Genome Comparison...........................................................................17 3 THE FEASIBILITY STUDY OF CROSS-GENOME COMPARISON AT THE BINDING SITE LEVEL............................................................................................21 Background.................................................................................................................21 Results........................................................................................................................ .25 Simulating Sequence Pairs Harboring a Conserved Module of Binding Sites...25 Identifying a Conserved Module by Comp arison at the Binding Site Level......28 Distribution and Statistical Ev aluation of the BLISS_score...............................31 Identifying a Conserved Regulatory M odule in Distantly Related Species........34 Implementation of BLISS as a Web-based Service............................................38 Discussion...................................................................................................................40 Conclusions.................................................................................................................42

PAGE 6

vi Methods......................................................................................................................43 Generating Simulated Sequences........................................................................43 Binding Site Identification..................................................................................44 P Value for M_score............................................................................................45 Gaussian Smoothing............................................................................................46 Searching the Conserved M odule in the Long Sequence....................................46 Large-scale Search of the Simulated Sequences, Statistical Analysis................47 Searching for the eve2 Module in D. virilis and D. mojavanis Sequences.........47 Website Construction..........................................................................................48 4 BINDING-SITE LEVEL IDENTIFICA TION OF SHARED REGULATORY MODULES IN TWO ORTHOLOGOUS SEQUENCES...........................................49 Background.................................................................................................................49 Results........................................................................................................................ .51 Simulating Sequence Pairs with Varyi ng Complexity of Conserved Binding Site Patterns.....................................................................................................51 Identifying Conserved Regulatory Co mplexes by Comparison at the Bindingsite Level..........................................................................................................52 Testing BLISS2 in Real-world Examples...........................................................56 Website Implementation of BLISS2...................................................................60 Discussion...................................................................................................................61 Conclusions.................................................................................................................62 Methods......................................................................................................................62 Generating Simulated Sequences........................................................................62 Identifying Conserved Regulatory Co mplexes by Comparison at the Bindingsite Level..........................................................................................................64 Performance Testing Using Simulated Sequences..............................................65 Searching for the eve2 Regulatory Comple x in D. virilis and D. mojavanis......66 Website Construction..........................................................................................66 5 BLISS 2.0: THE WEB-BASED TOOL FOR PREDICTING CONSERVED REGULATORY MODULES IN DISTANTLY-RELATED ORTHOLOGOUS SEQUENCES.............................................................................................................67 6 CONCLUSIONS AND SUGGESTIONS FOR FUTURE STUDY...........................74 APPENDIX A PARAMETER OPTIMIZATIONS FOR GAUSSIAN SMOOTHING METHOD...76 B A DYNAMIC PROGRAMMING ALGORITHM FOR IDENTIFYING CONSERVED REGUL ATORY MODULES............................................................77 C AN IMPROVED ALGORITHM FOR BLISS2.........................................................79 LIST OF REFERENCES...................................................................................................82

PAGE 7

vii BIOGRAPHICAL SKETCH.............................................................................................92

PAGE 8

viii LIST OF TABLES Table page 2-1. IUPAC consensus characters.......................................................................................4 2-2. Statistics of TRANSFAC.............................................................................................6 2-3. Matrices in TRAN SFAC Professional r9.1.................................................................7 4-1. Simulated data sets.....................................................................................................52 4-2. Performance test of BLISS2 for M_score cutoff=0.75.............................................54 4-3. Performance test of BLISS2 for M_score cutoff=0.8...............................................55 C-1. Performance test of the impr oved BLISS2 for M_score cutoff=0.75.......................79 C-2. Performance test of the impr oved BLISS2 for M_score cutoff=0.8.........................79 C-3. Performance comparison of two BLISS2 algorithms for second data set when M_score cutoff is 0.75..............................................................................................80

PAGE 9

ix LIST OF FIGURES Figure page 1-1. Transcription and translation.......................................................................................1 1-2. Transcription is controll ed by transcription factors.....................................................2 2-1. Frequency matrix of transcrip tion factor v-Myb from TRANSFAC.......................5 2-2. The known TFBSs in S2E.........................................................................................15 2-3. S2E of D. Mel and D. Pse alignment using CLUSTALW........................................16 3-1. Simulating conserved regulatory modul es in a pair of random sequences................27 3-2. Gaussian smoothing of the TFBS profile..................................................................29 3-3. Identifying a conserved m odule at the binding site level..........................................32 3-4. Identifying the eve S2E module in distantly related species.....................................35 3-5. Inter-TFBSs distances are very well conserved within each clause of the S2E module......................................................................................................................37 3-6. Searching for conserved individual clauses / element modules using BLISS...........39 3-7. BLISS output of the contributing TFBSs..................................................................40 4-1. The effects of length of analyzed sequences on the performance of BLISS2...........56 4-2. Regulatory complexes in S2E of D.mela and D.pseu................................................57 4-3. Identification S2E in 6 kb D.viri sequence using the S2E of D.pseu........................58 4-4. Regulatory complexes of S2E in D.viri and D.moja.................................................59 5-1. Web interface of BLISS 2.0. This is the homepage of BLISS 2.0............................69 5-2. The color plot of BLISS_score..................................................................................70 5-3. Statistical analysis and distribution of BLISS_score.................................................71

PAGE 10

x 5-4. Output of contributing TFBSs. .................................................................................72 A-1. Parameter optimizations for Gaussian smoothing method used in BLISS...............76 C-1. Distribution of BLISS_score for th e improved BLISS2 algorithm with the M_score cut_off value of 0.75.................................................................................81 C-2. Distribution of BLISS_score for th e improved BLISS2 algorithm with the M_score cut_off value of 0.8...................................................................................81

PAGE 11

xi Abstract of Dissertation Pres ented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy A NOVEL METHODOLOGY FOR IDENIT FYING CONSERVED REGULATORY MODULES AT THE BINDING SITE LEVEL By Hailong Meng August 2006 Chair: Arunava Banerjee Cochair: Lei Zhou Major Department: Computer Inform ation and Science and Engineering Regulatory modules are segments of the DNA that control particular aspects of gene expression. Their identification is ther efore of great importa nce to the field of molecular genetics. Each module is composed of a distinct set of binding sites for specific transcription factors. Since experimental identificat ion of regulatory modules is an arduous process, accurate computational techniques that supplem ent this process can be very beneficial. Functional modules are unde r selective pressure to be evolutionarily conserved. Most current approaches theref ore attempt to detect conserved regulatory modules through similarity comparisons at the DNA sequence level. However, some regulatory modules, despite th e conservation of their res ponsible binding sites, are embedded in sequences that have little overall similarity. We hypothesize that it will be more effici ent and make more biological sense to perform the cross-genome comparison at the bi nding site level since the evolutionary pressure is on the binding site patterns that determine the gene re gulation. In this study,

PAGE 12

xii we developed a novel methodology that det ects conserved regulatory modules via comparisons at the binding site level. Th e methodology compares the binding site profiles of orthologs and identifies those se gments that have similar (not necessarily identical) profiles. The similarity measure is based on transformed profiles, which takes into consideration the p values of binding site s as well as the potential shift of binding site positions. Furthermore, statistical analys is was performed to ev aluate the accuracy of the similarity measure. We used simulate d sequence pairs for the optimization of the methodology and the methodology was then verified on real word sequences. Our research demonstrates that, for sequences with little overall similarity at the DNA sequence level, it is possibl e to identify conserved regulatory modules based solely on binding site profiles. This methodology has been implemented as a web-based tool, BLISS, for the research community.

PAGE 13

1 CHAPTER 1 INTRODUCTION One of the central processes in the life of a cell is the expression of the information encoded in its DNA. The initial stage of th is expression is the synthesis of messenger RNA (mRNA) from the DNA template, a proces s referred to as tran scription (Figure 11). Figure 1-1. Transcript ion and translation. One of the major challenges in molecula r genetics is to understand the precise mechanism via which gene expression is regu lated. The regulation of gene expression at the level of transcription is a very comp lex process and is controlled by DNA binding proteins called transcripti on factors (TFs). As shown in Figure 1-2, the initiation of transcription begins with th e binding of RNA polymerase II complex to the promoter, a region right in front of the firs t exon of the gene. The level (i .e., rate) of transcription is controlled by bindings between TFs and corresponding short DNA sequences, Transcription Factor Binding Sites (TFBSs). The size of a typical binding site ranges between 5 and 25 base pairs (bps) (Boulik as 1994; Hernandez-Munain and Krangel 1994). It is possible that the binding of a single TF to TFBS can affect the rate of transcription, however, in higher eukaryotes, it often takes multiple TFBSs to group together as a regulatory module to determine the precise temporal and spatial expression DNA mRNA PROTEIN Transcri p tion Translation

PAGE 14

2 of the genes. Despite the im portance of regulatory modules, our abilities to identify and characterize them are stil l very limited (Qiu 2003). Figure 1-2. Transcription is cont rolled by transcription factors. While laboratory identification of regulat ory modules is feasible (Galas and Schmitz 1978; Fried and Crothers 1981; Garn er and Revzin 1981), the process is in general expensive, time-consuming and ar duous. Accurate computational approaches therefore promise to be very beneficial as a supplement to the traditional laboratory experiment since they can lead biologists to a more efficient determination of the regulatory elements. Functional regulatory modules are under selective pressure and tend to be evolutionarily conserved. Most current appr oaches therefore attemp t to detect conserved regulatory modules through similarity compar isons at the DNA sequence level. However, some regulatory modules, despite the conserva tion of their responsible binding sites, are embedded in sequences that have little overa ll similarity. It theref ore makes biological

PAGE 15

3 sense to perform the cross-genome comparis on at the binding site level since the evolutionary pressure is direct ed at the binding site level. The objectives of this thesis are 1. To test the feasibility of cross-genome comparison at the binding site level. 2. To develop a novel methodology that det ects conserved regulatory modules via comparisons at the binding site level. 3. To implement the methodology as a webbased tool that can be publicly accessed by the research community. The remaining chapters are organized as follows: Chapter 2 conducts a literature review, in which background material and related research in the field are presented. In chapter 3,we demonstrate that it is feasible to perfor m the cross-species comparison at the binding site level and that conserved regulatory modules can be predicted in diverged sequences without de tectable DNA sequence similarity. However, this comes with the limitation that we know th at a conserved module ex ists in the shorter of the two sequences being compared. In chapter 4, based on the research results in chapte r 3, we extend the methodology to a more general case, i. e. two sequences without th e prior knowledge about the locations of the re gulatory modules. In chapter 5, we focus on BLISS, the web implementation of this methodology. Chapter 6 presents conclusions and suggestions for future research.

PAGE 16

4 CHAPTER 2 LITERATURE REVIEW Computational Representation of TFBSs Most eukaryotic TFs can bind to multiple binding sites and tolerate considerable sequence variation in their binding sites. The computational analysis of regulatory regions in genome sequences is based on th e prediction of potential TFBSs, which depends on the quality of the models that represent the binding specificity of the transcription factors. These models are based upon a statistical representation of experimentally determined binding sites for a particular TF. IUPAC consensus code was first used to re present this degenerate feature of TFBSs by Day and McMorris (1992). They used a string such as RCCY to represent the combination of string ACCT, ACCC, GCCT, and GCCT. IUPAC consensus characters are listed in Table 2-1. Table 2-1. IUPAC consensus characters. IUPAC consensus character in the motif Matching base in the DNA sequence A A T T G G C C R A or G Y T or C W A or T S G or C M A or C K G or T B T or G or C H A or T or C V A or G or C D A or G or T N A or T or G or C

PAGE 17

5 However, the disadvantage of IUPAC cons ensus sequence is that it does not take into account the frequency at which each nuc leotide occurs at each position of a TFBS and therefore cannot represent a TFBS quantitatively. It disr egards too much information originally present in the set of TFBSs (Quandt et al. 1995). Compared to IUPAC, matrix based repres entations (including frequency matrices and Position Weight Matrices PWMs) reta in more information and are quantitative representations of TFBSs. In this case, the binding specificity of each position is described by the elements of the matrix. Co mparative analysis (O sada et al. 2004) has shown that matrices are a much better way to re present TFBSs. Each matrix is of size 4 X k, where the rows correspond to each of the nucleotides and the columns represent the base preference for each position of the binding site. For a frequency matrix, each column gives the nucleotide distributi on at that position. For instan ce, Figure 2-1 is the frequency matrix of transcription fact or v-Myb from TRANSFAC Prof essional r9.1 (Matys et al. 2003), which was obtained from 24 experiment ally determined binding sites. Each element in the matrix represents how many tim es a specific nucleot ide was observed at a certain position. The last column in the matr ix is the IUPAC consensus sequence for the transcription factor. Position A C G T 01 13 4 4 3 A 02 14 3 5 2 A 03 2 7 1 14 T 04 23 0 0 1 A 05 24 0 0 0 A 06 1 23 0 0 C 07 0 1 17 6 G 08 1 1 21 1 G 09 11 5 4 4 N 10 12 5 2 5 A Figure 2-1. Frequency matrix of transc ription factor v-M yb from TRANSFAC.

PAGE 18

6 The PWM is derived from this frequency matrix and then the base preference at each position is estimated based on the nucleotide distribution (Bucher 1990; Stormo 2000). The quality of the matrix is dependent upon experimentally determined target sites. As of now, most matrices have been built based on binding sites through in-vitro target-site detection assays. But, it was rece ntly reported that in vivo functional TFBSs can be detected by ChIP-Chip (chromatin i mmunoprecipitation coupled with microarray chip hybridization) technique (Lee et al. 2002), which will greatly enhance both the quality as well as the coverage of the noted matrices. TFBSs Databases TRANSFAC, TRANSCOMPEL, TRRD and JASPAR are databases that store information about transcriptional re gulation in eukaryotic cells. TRANSFAC (Knuppel et al. 1994; Wingender et al. 1996; Wingender et al. 2000; Matys et al. 2003) is the leadi ng database that includes Transc ription Factors, their target genes and Transcription Factor Binding Sites, matrices of TFBSs and other information about transcriptional regulation. TRANSFA C represents the most comprehensive collection of TFBSs and summarizes them as matrices. All information collected in TRANSFAC is generated from published litera ture. Its content ha s been extending and improving over the last ten year s as shown in Table 2-2. Table 2-2. Statistics of TRANSFAC. TRANSFAC 1996 TRANSFAC 2000 TRANSFAC6.0 2003 TRANSFACr9.1 2005 Matrix FACTOR SITE GENE REFERENCE

PAGE 19

7 The newly released TRANFACT Professiona l r9.1 has a total of 753 matrices for bacteria, fungi, insects, nematodes, plants and vertebrates, as indicated in Table 2-3. Table 2-3. Matrices in TRANSFAC Professional r9.1. Matrix Number Matrix Number Without Duplication Bacteria 1 1 Fungi 56 44 Insects 50 40 Nematodes 7 5 Plants 84 66 Vertebrates 555 379 Total 753 535 TRANSFAC contains redundant sets of matrices of di verse quality for some TFBSs. It is a commercial resource and only an older version of TRANSFAC can be accessed by the public for free. JASPAR (Sandelin et al. 2004) is an openaccess database and its data can be unrestrictedly accessed for free. All matrices in JASPAR were derived from published collections of experimentally determined TFBSs. JASPAR claims that it has a nonredundant collection of reliable matrices. TRANSCOMPEL (Kel-Margoulis et al. 2000; Ke l-Margoulis et al. 2002) is a sister database of TRANSFAC and includes informa tion about composite regulatory elements. Each composite regulatory elemen t contains two closely situated binding sites of distinct transcription factors. There are two types of composite regulatory elements: synergistic and antagonistic. A synergistic composite re gulatory element result s in transcriptional activation, while in an antagonistic compos ite regulatory element, the two factors interfere with each other and repress transc ription. However, the effectiveness of TRANSCompel is limited by it small size.

PAGE 20

8 Transcription Regulatory Re gions Database (TRRD) (Kel et al. 1997; Heinemeyer et al. 1998; Kolchanov et al. 1999; Kolchanov et al. 2000) describes the structure of eukaryotic transcription regul atory regions. Each TRRD entry corresponds to an entire gene and binding sites are listed at the lowe st level of the hierarchy. However, this database is seldom referred to by current publications. Computational Identification and Statistical Evaluation of TFBSs Computational Identification of TFBSs Computational identification of TFBSs is very important to decode the regulation network of genes. Although some earlier methods of predicting TFBSs were based on consensus sequences (Kondrakhin et al. 1995; Prestridge 1995), most recently developed methods are based on matrices (Frech et al. 1997). Match is a weight-based tool to search for potential transcription factor binding sites in DNA sequences. Match is associated with TRANSFAC and it uses the matrix library collected in TRANSFAC to search fo r TFBSs. Because TFBSs may be detectable on either the forward or the backward stra nd, Match searches both strands of input DNA sequence for TFBSs. The algorithm uses two sc ores: the matrix similarity score (MSS) and the core similarity score (CSS). Both scores measure the similarity between the segments of the sequences under considerati on and the TFBS matric es. MSS is calculated using all positions of the matrix while CSS is calculated using the five core (most conserved) positions in the matrices only. Bo th MSS and CSS range from 0 to 1 where 1 means an exact match. MSS is calculated as follows (Kel et al. 2003): Min Max Min Current MSS

PAGE 21

9 L i b iif i I Current1 ,) ( L i if i I Min1 min) ( L i if i I Max1 max) ( } { ,) 4 ln( ) (G C T A B B i B if f i I i=1, 2, L where ib if, is the frequency of nucleotide B o ccurring at position i of the matrix (B{A,T,C,G}). min if is the frequency of the nucleotide that is the lowest in position i and max if is the frequency of the nucleotide that is the highest in position i of the matrix. The information vector ) ( i I reflects the extent of the conservation of i position in the matrix. The more conserved the position i is, the greater is the information vector value of that position. In this manner, for a highl y conserved position, a mismatch will receive a large penalty while a match will score high. On the other hand, less conserved positions contribute less than conserved positions whet her they be a match or a mismatch. This leads to better recognition of TFBSs compared to methods that do not use an information vector. Stormo (1998, 2000) and Stormo and Fields (1998) pointed out that logarithms of the base frequencies should be proportiona l to the binding energy and the information vector is therefore related to the average binding energy betw een transcription factors and binding sites. Hence, Match score is not just a statistical score, but is in direct relation to the protein-DNA binding energy.

PAGE 22

10 Match requires users to choose a score th reshold, which is based on a balance between specificity and sensitivity. When a high threshold is used, noise is reduced from the output and the false positives decrease. However, at the same time, some functional binding sites with low scores are ignored. Increased specificity causes decreased sensitivity. On the other hand, when a low th reshold is used, both false matches and true binding sites are reported. Increased sensitivit y causes decreased specificity. Hence, an ideal threshold should restrict false positives without losing too many true positives. One assumption of methods based on matrices is that positions in the binding site contribute additively and independently to th e total activity. However, recent biological experiments suggest that dependence exists among positions in TFBSs (Benos et al. 2002; Bulyk et al. 2002). More complicated models (Stormo et al. 1986; Lawrence and Reilly 1990; Lawrence et al. 1993; Zhang and Marr 1993; Zhou and Liu 2004; Gershenzon et al. 2005) are being considered. For instance, di -nucleotides are chosen as the elements of the matrix instead of single nucleotides. But, the limitation of these methods is that they need more prior info rmation and more training binding sites, which are currently not easy to obtain. So, these approaches are rarely used. Other available softwares to identify TFBS s include MatInspector (Quandt et al. 1995), ConsInspector (Frech et al. 1993), FUNSITE (Kel et al 1995), TFBIND (Tsunoda and Takagi 1999), SIGNAL SCAN (Prestridge 1996), MATRIX SEARCH (Chen et al. 1995), SeqVISTA (Hu et al. 2003; Hu et al. 2004), RSAT (van Helden 2003), BEST (Che et al. 2005) and P-Match (Chekmenev et al 2005). The primary pr oblem with existing methods for predicting TFBSs is that there tend to generate unacceptably large number of false positives and only a very small fraction of the predicted TFBSs are functional ones.

PAGE 23

11 Because of the large number of false positives, these methods are of li ttle practical use for predicting TFBSs in vivo. Statistical Evaluation of TFBSs A fundamental challenge for TFBSs-sear ching methods is how to reduce false positives and improve prediction accuracy. One of the reasons for false positives is that short sequences (when trying to identify short TFBSs) occur a lot by chance. Estimating the statistical significance of an identifi ed match is valuable and may increase the performance of TFBSs-searching methods. To calculate statistic al significance, a background model is necessary. Based on the background model, we will see how likely it is to see a score that is at least as good by chance. Fo rmally, the p-value of a score is defined as the probability of obtaining such a score or higher from the background model. A simple way of estimating the p-value of a score is via nave approximation. A large pool of subsequences is sampled from the background distribution, and the score of each subsequence is calculated based on the matrix. The p value is estimated by calculating the fraction of samples that have scores as high as that score. The primary drawback of nave approximation is that it needs huge number of samples to accurately estimate the p value. Normally, the number of samples should be two orders of magnitude larger than the inve rse of the actual p value. If there are not enough samples, small p values will be poorly estimated. Available samples and time complexity are major drawbacks of the nave approximation method. Independent and identically distributed (IID) background sequence model (Staden 1989) can be used to evaluate p value. The position p value of MAST (Motif Alignment and Search Tool) (Bailey and Gribskov 1998; Bailey and Gribskov 2000) is defined as the probability of observing such a match scor e or higher at a single, randomly chosen

PAGE 24

12 position in a random sequence, and it is obtai ned by calculating the cumulative density function based on the null hypothesis that each position in the sequence is independent and identically distributed (IID). It has been reported that neighboring nucleotide compositions can affect the binding between a transcripti on factor and its binding site. Local Markov Method (LMM) (Huang et al. 2004) incorporated the local genom ic context into the p-value-based scoring method and modeled the background sequences as a Markov chain. The p value for a candidate site in this method reflected simulta neously both the similari ty to its matrix and its difference from the local genomic context. LMM claimed that, compared to other current methods, it could reduce false positive errors by more than 50% without compromising sensitivity. Similarly, Thijs et al. (Thijs et al. 2001) use a higher-order background model to enhance the performance of their motif-finding algorithm. Although statistical anal ysis has been integrated into predictions of TFBSs, large amounts of false positives sti ll exist and accurate identifica tion of TFBSs is hard to achieve. Cis-Regulatory Modules (CRMs) and Computational Prediction of CRMs Cis-Regulatory Modules (CRMs) Genes are rarely regulated by a single TF. Abundant evidence shows that TFBSs occur in clusters rather th an in isolation (Maniatis et al. 1987; Johnson and McKnight 1989; Shaw 1992; Wang et al. 1993; Tjian and Maniatis 1994). Each gene has a unique set of TFs that are responsible for its spatia l and temporal expressi on. Transcription is initialized by the binding of RNA Polymera se II to the promoter region, but the expression level of the gene is controlled by the specific set of TFs. The binding of a TF

PAGE 25

13 to its TFBS may enhance or repress the bind ing of other TFs to their TFBSs (Weintraub et al. 1990; Fickett 1996; Muhl ethaler-Mottet et al. 1998). In higher eukaryotes, the re gulatory information is orga nized into modular units which contain multiple TFBSs over a few hundred base pairs of DNA sequence. Such a functional unit is often referred to as Ci s-Regulatory Modules (CRMs, we also use regulatory modules to refer to CRMs alternatel y). Various CRMs work together to offer the combinatorial regulation of gene transc ription in response to different conditions (Yuh et al. 1998). Genes are ty pically controlled by severa l CRMs, which are generally located within a few hundred base pairs fr om the promoter region, but may occur thousands or even tens of thousands of ba se pairs either downstream or upstream from the transcription start site. TFBSs in a CRM are co-localized with sp ecific distances apart on the genome. The distance between the binding sites within th e CRM may overlap (Lewis et al. 1994), lie adjacent to each other (Whitley et al. 1994) or may be dozens of base pairs apart (Lemaigree et al. 1990). Considering the comple xity of the identification of CRMs, some groups (Hannenhalli and Levy 2002) have starte d to work on TF pairs called composite regulatory elements. Kel et al. (Kel et al 1995) investigated 1 01 composite regulatory elements and found that the distance between these binding sites vary between complete overlap and 80 bps. The mean distance was found to be around 17 bps. Qiu et al. (Qiu et al. 2002) analyzed all the entries of composite elements in the TransCompel database (version 3.0) and they found that approximate ly 87% of the composite elements were within 50 bps distance from each other and about 65% were within 20 bps distance.

PAGE 26

14 Transcriptional regulation is mediated by CRMs involving both protein-DNA and protein-protein interaction. The immense false positive of current TFBS-prediction methods is most likely a result of attemp ting to predict TFBSs based only on proteinDNA interactions between TFs and TFBSs a nd ignoring their cont ext, interactions between/among TFs, etc. Computational Identification of CRMs It has been reported that high local density of TFBSs account for the specificity of their activities and is required for functional CRMs, which has been used as the basis for identifying novel CRMs (Frith et al. 2001; Berman et al. 2002; Halfon and Michelson 2002; Rajewsky et al. 2002; Rebeiz et al. 2002; Johansson et al. 2003; Berman et al. 2004). These methods define a cluster as a certain window size DNA sequence with certain number of known TFBSs, and then se arch for such clusters in the given DNA sequence to predict regulato ry regions. These methods, however, need prior knowledge of TFBSs, which is not available in general. With the development of microarray tec hnology, genes can be classified into different clusters based on their expressi on pattern. Genes in the same cluster are assumed to be co-regulated. Co-regulated genes are hypothesized to share common CisRegulatory Modules (CRMs), since they share the same regulatory response. It has been observed that functional TFBSs of ten appear several times in the regulatory region so that occurrences of relevant TFBSs are significan tly more than expected by chance. Hence, methods based on co-regulated genes comparison are to search for over-represented motifs in the collection of regulatory re gions (Bailey and Elkan 1995; Workman and Stormo 2000; Liu et al. 2001; Thijs et al. 2002; Elkon et al. 2003; Zh eng et al. 2003; Cora et al. 2004; Grad et al. 2004). High frequency of false positiv es is also a major problem

PAGE 27

15 with these methods. Furthermore, these met hods require a reliable set of co-regulated genes. But, it is also possible that the expression of one gene is triggered by the expression of another gene in the same cluste r and they are not rea lly co-regulated genes. Some groups have tried to conquer this pr oblem by annotating genes to specific Gene Ontology (GO) terms (Cora et al. 2004). Cross-species Conservation of Cis-Regulatory Modules (CRMs) Functional sequences including coding and functional noncoding sequences tend to evolve at a slower rate than non-functional sequences. Noncoding sequences that play an important role in regulation are assumed to su rvive natural selection for longer periods of time. It has been shown that CRMs are c onserved across species (Frazer et al. 2003; Ludwig et al. 2005). Figure 2-2. The known TFBSs in S2E: bi coid (circle), hunchback (ovals), kruppel (squares), giant (rectangles), and sloppy-paired (triangle). Symbols representing sites 100% conserved co mpared to D. mel are green, while those diverged are shaded orange.

PAGE 28

16 Early Drosophila embryo is a paradigma tic model for studying transcriptional control of development (Arnosti 2002). Exhaus tive genetic analysis has been applied to this model. Even-skipped (eve), an impor tant developmental gene in Drosophila, produces seven transverse stripe s in the blastoderm embryo. The stripe 2 enhancer (S2E) is the best studied one and includes multiple TFBSs of five TFs, the activators bicoid (bcd) and hunchback (hb), and the repressors giant (gt), Kruppel (kr), and sloppy-paired (slp) (Small et al. 1992; Arnosti et al. 1996; A ndrioli et al. 2002). As shown in Figure 22, experimental investigati ons suggest that S2E is evol utionarily conserved among D. melanogaster, D. yakuba, D. erecta and D. pseudoobscura. D. yakuba and D. erecta are separated by approximately 10-12 milli on years from D. melanogaster, and D. pseudoobscura split from D. melanogaster about 40-60 million years ago (Ludwig et al. 1998; Ludwig 2002; Ludwig et al. 2005). There is no detectable difference in the spatial or temporal control of gene e xpression among these four species. Mel AATATAACCCAATAATTTGAAGTAACT------------GGCAGGAGCGAGGTAT----43 PSE AATATAACCCAATAATTTTAACTAACTCGCAATGGACAGGGCAGTAGAGCAGTAGAGTAG 60 ****************** ** ***** ***** ** *** Mel --CCTTCCTGGT---------TACCCGGTACTG-----CATAACAATGGA-----ACCCG 82 PSE AGCATTGCAGGAAGGATGCAATACTCGGGAATGGAATGCATAACAATGGGCAAGGACCAG 120 ** ** *** *** ** *********** *** Mel AA--CCGTAAC-TGGGACAGA-----TCGA---------AAAGCTGGCCTGGTTTCTCGC 125 PSE GGTTCCGTTTCGCGAGATAAGGTTCTTTGACGGTTCCTTGACGGTTCCTTGACGGTTCCC 180 **** * ** * ** * * ** ** Mel TGTGTGTGCC-------GTGTTAATCCGTTTGCCATCAGCGAGATTATTAGTCAATTGC177 PSE TGTGTGCTCTCTGCTCTGTGTTAATCCGTTTGCCATCAGCAAGATTATTAGTCAATTTTC 240 ****** *********************** **************** Mel -------------AGTTGCAGC----GTTTCGCTTTCGTCCT------CGTTTCACTTT213 PSE ATATTTCCAGTCGAGTCGCAGTTTTGGTTTCACTTTCCTCCTTTGCCACTTCTTGCCTTG 300 *** **** ***** ***** **** * * ** Mel ---------------------------------------------------------CGA 216 PSE CCTCATGTGGATGCCGATGCCGATGCCGTTGCCGTTGCCGTTGCCGTTGCCGACCGACGA 360 *** Mel GTTAGACTTTATTGCAGCATCTTGAACAATCGTC-GCAGTTTGGTAACACGCTG-----269 PSE GTTAGATTTTATTGCAGCATCTTGAACAATCAACTGGAATTTGGTAACATGCTGCGCGGC 420 ****** ************************ * ********** **** Mel --------------TGCCATACTTTCATTTAGACGGAATCGAGGGACCCTGGACTATAAT 315 PSE CTAACCCTGGAGATTGCTCTACTTTCGCCTCAATTGAATCGGAGTTAGGCGGAAGACGGC 480 *** ******* * ****** *** Figure 2-3. S2E of D. Mel and D. Pse alignment using CLUSTALW.

PAGE 29

17 Mel CGCACAACGAGACC--GGGTTGC-----GAAGTCAGGGCATTCCGCCGATCTAGCCATCG 368 PSE GGACCCTTGCGACCAAGGGTTGTCTCCTGGCCTCAGGAGTTTCCACAGTCAACGCTTTCG 540 * **** ****** ***** **** * ** *** Mel CCATCTTCTGCGGGCGTTTGTTTGTTTGTTTGCTGGGATTAGCC-AAGGGCTTGACTTGG 427 PSE CTGGTTTGTTTATTTGTTTGTTTGTTT--TAGCCAGGATTAGCCCGAGGGCTTGACTTGG 598 ** ************ ** ********* ************** Mel AATCCAATC-----------------------------------CTG-ATCCCTAGCCCG 451 PSE AACCCGACCAAAGCCAAGGGCTTTAGGGCATGCTCAAGAGATCCCTATATCCCTATCCCT 658 ** ** * ** ******* *** Mel ATCCCAATCCCAATCCCAATCCC--TTGTCCTTTTCATTAGAAAGTCATAAAA-ACACAT 508 PSE GTCGCGATCCCTAAACCGATCCCATTTGGCAATTTCATTAGAAAGTCATAAAACACACAT 718 ** ***** ** ***** *** ********************* ****** Mel AATAATGA--TGTCGAAGGGATTAGG-----GGCGCGCAGGTCCAGGCAACGCAAT---T 558 PSE AATAATGAGATGTCGAAGGGATTAAGATTAAGGGACGCACACACAGGCAGCAGGATCATT 778 ******** ************** ** **** ****** ** Mel AACGGACTAGCGAACTGGGTTATTTTTTTGCGCCGA-----------CTTAGCCCTGATC 607 PSE AACGGACTAACGGAATCGGGTTATTTTTTGCGCTCAACCGGACGGAGCTTAACGCTAAGC 838 ********* ** * ** ********** **** ** * Mel CGCGAGCTTAACCCGTTTTGAGCCGGGCAGCAGGTAG-TTGTGGGTGGACCCCACGACTT 666 PSE CACAAGCTTAACCCCTTGTGGGCCG--CGGCAGGTAAATCGATGACCGATGTCGCTTGCG 896 * ********** ** ** **** ******* * ** * Mel TTTTGGCCGAACCTCCAATCTAACTTGCGCAAGTGGCAAGTGGCCGGTTTGCTGGCCCAA 726 PSE CAAGGCCCCTACTACTCCCCTCCCCTCC-CATATGACAACCCACTAACC--CTGCCCCCG 953 ** ** ** * ** ** *** *** *** Mel AAGAGGAGGCACTATCCCGGTCCTG---GTACAGTTGG-TACGCTGGGAATGATTATATC 782 PSE CCCTCTC--CACCACCACTGTACTGTATGTACAGTTGCCTCCTCTCTGGATGATTATATC 1011 *** * ** *** ********* * ** *********** Mel ATCATAATAAATGTTT 798 PSE ATCATAATAAATGTTT 1027 **************** Figure 2-3. Continued Although similar at the TFBS level and func tionally conserved, as shown in figure 2-3, S2E in D. Mel and D. Pse are substantia lly diverged at the sequence level. One can observe large insertion or deletion in spacers between TFBSs, nucleotide substitutions in TFBSs, variation of length of enhancers, etc. Only 3 of 17 known TFBSs in D. melanogaster are completely conserved among all four species. Computational Prediction of Cis-Regula tory Modules (CRMs) by Cross-species Genome Comparison In contrast, cross-species genome compar ison (Phylogenetic Footprinting) is more reliable and is capable of identifying CRMs fr om a single gene of different species, as long as they are conserved acr oss species. Cross-species ge nome comparison is based on the assumption that functional sequences, like regulatory regions, are more conserved

PAGE 30

18 evolutionarily than nonfunctional regions, due to selective evolutionary pressure. Like a filter, cross-species genome comparison elim inates non-conserved regulatory regions and thereby increases the selectivity a nd specificity of CRMs prediction. Evolutionarily conserved noncoding regions across species are a potentially rich source for the identification of regulatory information of genes and can be found by sequence alignment tools such as BLAST (Alts chul et al. 1990), PIPMaker (Schwartz et al. 2000), MultiPipMaker (Schwartz et al. 2003 ), CLUSTAL W (Thompson et al. 1994), AVID (Bray et al. 2003) and VISTA (Mayor et al. 2000). A large number of methods have been developed recently to pred ict CRMs based on cross-species genome comparison. With complementary data provided by sequence comparison, these methods have improved the prediction of TFBSs. ConSite (Lenhard et al. 2003; Sandelin et al. 2004) is a web-based tool for the detection of transcription f actor binding sites, which in tegrates both cross-species comparison and binding site prediction genera ted with Matrices. ConSite reports TFBSs that are in conserved regions a nd at the same time located as pairs of sites in equivalent positions in alignments between two or thologous sequences. By incorporating evolutionary constraints, sele ctivity is increased compared to approaches purely based on profile search of single genomic sequence. C onSite claims that it reduces the noise level by ~85% while retaining hi gh sensitivity compared to single sequence analysis. Similar to ConSite, rVista (Loots et al. 2002; Loots and Ovcharenko 2004) combines sequence comparison and TFBS prof ile based prediction, and attempts to identify aligned TFBSs from noncoding regions th at are evolutionarily conserved. rVista considers noncoding regions with low DNA sim ilarity as least likely to be biologically

PAGE 31

19 relevant and it discards these regions for TFBS searches. rVista claims to take an effective way to reduce false positives. TraFAC (Jegga et al. 2002) is a web-base d application to det ect conserved TFBSs from a pair of DNA sequences. It defines hits as the density of shared TFBSs within a 200-bp window in evolutionarily conserved noncoding regions. The hit count is plotted as a function of position. PhyME (Sinha et al. 2004) is an algor ithm based on cross-species comparison too. Furthermore, it integrates another important feature of a TFBS, overrepresentation, to the algorithm. Overrepresentation depends on the number of occurrences of the TFBS in each species. By searching conserved and st atistically over-represented TFBSs, PhyME claims to increase both sensitivity and specificity. TOUCAN (Aerts et al. 2003; Aerts et al. 2005) is a web-based Java application for the prediction of cis-regulatory elements from sets of orthologs or co-regulated genes. Orthologs are aligned and cons erved regions are selected as candidate search regions. Toucan has two TFBS prediction algorithms. One is MotifScanner, which is used to search known TFBSs based on their matrices. Inst ead of using a predefined threshold that is independent of the sequence context, MotifScanner uses a pr obabilistic model to estimate the significance of TFBS s. The other is MotifSampler, which is used to search for new TFBSs based on Gibbs sampling. Another algorithm (Cora et al. 2004) tries to predict over-represented TFBSs from sets of orthologs. It focuses on short conser ved regulatory regions in mice and humans so that a suitable signal to noise ratio can be achieved. Statistically over-represented TFBSs in those regions are searched and their a nnotations to Gene Ontology (GO) terms are

PAGE 32

20 analyzed. They conjecture that functional TF BSs will be ones that are statistically overrepresented and share the same specific Gene Ontology terms. CRME (Sharan et al. 2003; Sharan et al. 2004) is another web-based tool for identifying and visualizing CRMs from a set of potentially co-regulated genes. It uses rVista 2.0 to find TFBSs that are cons erved between genomes. Subsequently, it enumerates all combinations of these TFBS s that occur within a user-defined window. These combinations are statis tically evaluated and significan t combinations are reported. Compared to other methods, CRME consid ers the combination of TFBSs and also evaluates the statistical significance of the predicted CRMs. Venkatesh and Yap (Venkatesh and Yap 2005) searched for cis-regulatory elements from non-coding sequences conserved between fugu and mammals and they believe that the cis-regulatory elements they found are likely to be shared by all vertebrates considering the relatively long evolutionary distan ce between fugu and mammals. In general, methods based on cross-sp ecies genome comparison have improved prediction of TFBSs by searching TFBSs onl y in sequence-conserved regions across species. However, such methods fail to detect CRMs that have retained their functions but have evolved in their sequence. The pe rformances of these methods depend on the evolutionary distances of species, alignmen t algorithms, and profile models of TFBSs These days, more and more evidence (T hanos and Maniatis 1995; Erives and Levine 2004; Senger et al. 2004) have shown th at CRMs might be high ly structured and depend on a number of organizational constr aints, such as distance betweens TFBSs, orders and orientations of TFBSs, which ma y be used as further constraints to find functional CRMs from a noisy background.

PAGE 33

21 CHAPTER 3 THE FEASIBILITY STUDY OF CROSS-GENOME COMPARISON AT THE BINDING-SITE LEVEL Background Transcription is the fundam ental biological process in which a selected region of DNA is transcribed into RNA by a molecular m achinery, a core component of which is RNA polymerase. For most prot ein-coding genes, transcripti on is the intermediate step via which the information coded in their DNA is expressed into functioning proteins. For others, such as RNA genes, the product of the transcription itself may have biological function. Even though each cell has the comple te set of genes in its chromosomal DNA, only a portion of the genes are transcribed (expressed) in a ny particular cell depending on tissue/cell type, developmental stage, etc (Zhang et al 1997). The transcriptome, that is all of the genes that are selectively transc ribed in a cell, determines the function and morphology of the cell. In addition, the level (i .e., rate) of transcrip tion is often regulated in response to intra-cellular and extra-cellular environmenta l factors to ach ieve cellular homeostasis. Normal transcri ptional regulation, i.e., the ri ght genes being transcribed at the right times, in the right cell, and at the appropriate rates, is central to almost all physiological processes. Abnor mal regulation of transcripti on often results in disruption of development and/or patholog ical states. For example, ectopic (i.e., abnormally high) expression of oncogenes leads to hyperplasia and cancer. A basic element of transcriptional regulat ion is the interaction of transcription factors (trans factors) and their corresponding t ranscription f actor b inding s ites (TFBSs,

PAGE 34

22 also referred to as cis factors) on the DNA. Transcripti onal regulation of a gene (e.g. restricted transcription in a pa rticular cell type, or elevated transcription, in response to UV light) is often mediated through the func tional/physical intera ctions among multiple transcription factors, each recruited to the pr oximity of the DNA in part by their selective affinity to their corre sponding binding sites. For example, the even-skipped(eve) gene is transiently expressed in 7 a lternative stripes on the longitudinal axis in the developing Drosophila melanogaster embryo at the blastoderm stage. Each of the seven stripes is regulated by a distinct set of transcription factors binding to their corresp onding binding sites located in a DNA segment flanking the even-skipped gene. The most well investigated of these segments is the stripe 2 regulatory region, which has identified binding sites for 5 different transcription factors in a 700bp (base pair)-1kb (kilo base pair) DNA region in front of the transcription in itiation site of the eve gene. Evolutionary comparison of this transcription regulatory module in different Drosophila species has revealed that most of the binding sites are highly conserved and f unctional, even though the underlying DNA sequence has undergone consid erable change (Ludwig et al.1998). A useful analogy to understanding the co mposition of DNA regulatory modules is to consider DNA as a sequence of Letters and individual binding sites as Words. Then, a functional module of closely associated binding sites can be conceived of as the Sentence command embedded in the DNA seque nce that guides transcription. The problems associated with identifying the Sentence commands in a DNA sequence are two fold. First, the binding s ites are degenerate in nature, that is, the same Word may have different letters in certain positi ons. Second, the Words are themselves interspersed by varying lengths of meaningless Letters.

PAGE 35

23 One approach to identifying DNA regulat ory modules is through cross-genome comparison. The assumption underlying this approach is that DNA sequences encoding functional TFBSs are under selective pressure to be conserved duri ng evolution, whereas non-functional DNA mutate/change more rapidl y. Thus, if DNA sequences flanking orthologs in two related speci es were to be compared fo r sequence-level similarity, DNA regulatory modules would appe ar as conserved islands in a sea of otherwise notconserved DNA sequences. Approaches in this category include rVista2.0, ConSite, PhyME, TOUCAN, CREME, TraFAC, etc (Sha ran et al. 2003; Sandelin et al. 2004; Loots, G.G. and I. Ovcharenko 2004; Sinha et al. 2004; Aerts et al 2005; Schwartz et al. 2003; Cora et al. 2004; Venkatesh and Yap 2005). For instances, based on the sequence level conservation between human and mouse, Cora et al. predicted functional TFBSs that are statistically over-represented and share the same specific Gene Ontology (GO) terms (Cora et al. 2004). This kind of cross-genome comparison approaches has successfully led to the discovery of regulator y modules that were s ubsequently verified by functional characterizati on (Santini et al. 2003). The disadvantage of the sequence-based appr oach is that it is dependent on the overall conservation of the DNA region har boring the regulatory module. As we mentioned earlier, TFBS sequences are degenera te in nature and ar e interspersed by nonfunctional sequences which are not under cons ervation pressure. Depending on the ratio of functional versus non-functi onal base-pairs in the DNA re gion, it is entirely possible that the overall sequence level conservation of the region be undetectable from the background level, while the actual TFBSs ma king up the functional module still be conserved. In other words, it is possible th at despite the conservation of the Sentence

PAGE 36

24 command at the binding site level, the overall conservati on of the DNA backbone at the sequence level be minimal or non-detectable. This situation is aggravated if the participating binding sites are highly degenerate (i.e., tole rate many variations at the sequence level) and the spaces between the binding sites are long. In fact, it has been observed by researchers in many instances that while the regulatory region has no detectable overall similarity, the transcripti on regulatory function is preserved (Ludwig et al. 2005). Sequence-based approaches, or appr oaches requiring filtration of sequences based on DNA level similarity, are not helpfu l for identifying the responsible TFBSs in this scenario. Since the conservation pressure is at the bi nding site level, i.e., the sequence must be able to maintain binding affinity to th e transcription factor(s), it makes biological sense to perform comparisons at the binding site level rather than at the sequence level. This, however, is currently hindered by two factors. The first limitation is the effectiveness of identifying tr anscriptional binding sites in a given DNA sequence. The set of TFBSs for a given TF can be quantitat ively represented using a frequency matrix that describes the binding specificity of the TF at each of its positions. The quality of the matrix used to identify potential TFBSs is determined by the number and quality of known binding site sequences used to constr uct the matrix. As a result of the development of new technologies such as Chip (Yan et al. 2004) and ChIP-chip (Lee et al. 2002), it is anticipated that binding site instances will be identified at a unprecedented rate which will undoubtedly greatly enhance both the quality as well as the coverage of binding site matrices in the n ear future (Matys et al. 2003).

PAGE 37

25 The second limitation is that we cu rrently do not understand the grammar governing how binding sites (Words) make up the regulatory modules (Sentences). Based on our understanding of transcriptional regulation, such a grammar should have at least three components: (1) the composition of the binding sites, (2) the sequence of the binding sites, and (3) the spaces between/am ong the binding sites. Currently, the number of regulatory modules that have been thoroughly characterized is far fewer than what is required to decode this grammar. A major obstacle for biologists working on transcriptional regulation is to locate and identify potential TFBSs responsible for a particular regulatory module, especially in sequences that do not have signi ficant conserved islands. In this paper, we describe a novel methodology for binding site level identifi cation of conserved regulatory modules in such sequences. Results Simulating Sequence Pairs Harboring a Conserved Module of Binding Sites Since the number of well-studied regulatory modules is currently rather sparse, we chose to simulate sequence sets (pairs) re presenting the domain of our interest, i.e., conserved binding site patterns in a pair of sequences which nonethele ss have little or no similarity at the sequence level. In many cases, experimental investigation in a model organism has narrowed down the location of the regulatory module(s) for a particular gene to a relatively short region (e.g. within 1 kb), whereas for the ortholog in a lessstudied organism (reference organism), inform ation about the localiz ation of the module is absent (except that it is generally in the proximity of the gene). In view of this, in the first (current) stage of the development of BL ISS, we considered the identification of a conserved module present in both a short sequence of about 100-500 bps, representing

PAGE 38

26 the model organism, and a longer sequence (5-6 kb), representing the reference organism. Although this simplification limits the appli cability of the current methodology, it does highlight the promise of our approach. For each sequence pair, the backbones for both the short sequence and the long sequence were generated randomly and thus had no sequence similarity. A hypothetical module involving binding sites for 4-8 distinct transcripti on factors was first introduced into the short sequence. Th e binding site sequences were randomly selected from the instances recorded in the TRANSFAC 9.1 database (Matys et al. 2003). The rules governing the formation of the hypothe tical module were as follows: 1. A module contains binding sites for 4-8 distinct transcription factors. 2. For each transcription factor, there may be more than one binding site in the module. 3. The distance between cons ecutive binding sites is di, where in 65% of the cases di lies within 5-20 bps, in 22% of the cases di lies within 21-50 bps and in the remaining 13% of the cases di lies within 51-60 bps (Figure 3-1). The range of values for di was based on background knowledge as well as a statistical analysis of the di stances between pairs of TFBS s in TransCompel database by Qiu et al. which revealed the above distribution of distances between pairs of TFBSs (Qiu et al. 2002). The hypothetical module was first simulate d according to the above rules in the short sequence. Subsequently, a conserved module was formulated and inserted into the longer sequence at a random location. The rules governing the formation of the conserved module were as follows:

PAGE 39

27 Figure 3-1. Simulating cons erved regulatory modules in a pair of random sequences. Each pair has a short (100-500 bp) an d a long (5-6 kb) sequence. For both sequences the backbones were ge nerated randomly. The hypothetical module was formulated according to rule s described in the text and inserted into the short sequence. A cons erved module with binding sites corresponding to the same transcription factor, but with different sequences was inserted into the long sequence. The distances between consecutive binding sites were also different be tween the hypothetical module and the conserved module. 1. It is comprised of TFBSs that correspond to the same transcription factors as present in the hypothetical module. 2. The sequence for each TFBS is randomly chosen from the recorded instances in TRANSFAC 9.1, with the caveat that it ca nnot be the same instance(s) that was (were) used to construct the hypothetical module in the short sequence. 3. The respective order of TFBSs is the sa me as in the hypothetical module in the short sequence.

PAGE 40

28 4. The distance between consecutive bindi ng sites in the conserved module is dj; dj is a function of di in that dj lies in the range (di -d, di +d) (Figure 3-1). d is the perturbation f actorthe variation of di stance between corresponding binding sites in the hypothetical module and the conserved module. In this study, we used d=4 (See Discussion). A total of 10,000 pairs of sequences were generated according to above rules, and were used to test and evaluate various algorithms. Identifying a Conserved Module by Comp arison at the Binding Site Level As stated above, the objective of our methodology is to identify conserved regulatory modules within hi ghly divergent sequences. The sequence pairs in our simulated data-set had little overall sequen ce similarity. Of the 10,000 pairs, 73.32% have no similarity detectable by BLAST analys is (E=0.01, Blast2seq). This indicated that the conservation of bi nding sites in the hypothetical module and the conserved module was not sufficient to allow detection at the sequence level. Of those that did have a significant match, the output alignments we re shorter than 30 bps, which was far shorter than the length of the inserted module. M_score. To identify the conserved module at th e binding site level, we first generated the potential TFBS profiles for each of the si mulated sequence pairs. A matrix scoring method similar to that used in Match (Kel et al. 2003) was implemented (see Methods), which allowed us to score each sequence against the frequency TFBS matrices recorded in TRANSFAC 9.1 (M_score, Figure 3-2a ). When a cut-off value of 0.75 (on M_score) was applied, on average there were about 3000 identified poten tial TFBSs in every thousand base pairs of simulated sequences. This is similar to what we observed when using genomic DNA sequences randomly extrac ted from model organism databases (data not shown).

PAGE 41

29 Figure 3-2. Gaussian smoothing of the TFBS profile. a.) Profile of matrix matching scores (M_score) for three TFBSs al ong a short DNA sequence. b.) M_score profile after applying a cut_off of 0. 75. c.) Profile of P_score after incorporating the p value of binding site matches. d.) Profile of the G_score after Gaussian smoothing. The colors re present three different TFBSs: Red, En; Green, Croc; Blue, Lun-1. To identify the hypothetical module embedde d in the sequences, we tried several different algorithms that compared the bi nding site profiles of the short and the long sequences. Of those tested, a scoring method (using inner-products) ba sed on a statistical evaluation of binding site matches after a Gaus sian smoothing of the binding site profiles gave reliable and promising re sults. The performance test of this method was listed in Appendix A. P_score. The matrix score (M_score), by virtue of its definition (f.1), ranges from 0 to 1 for all TFBS matrices. Thus it does not differentiate short and relatively simple matrices

PAGE 42

30 that match DNA sequences with a high frequenc y from those long and stringent matrices that match DNA sequences only rarely. For example, the binding site for En (I$ED_06) has 7 positions, and on average there are 320 matches (with M_score>0.75) on any 10,000 bp random sequence. In contrast, the binding site for Bel-1 (V$BEL1_B) has 13 positions, and the average number of matches with M_score>0.75 in a 10,000 bp random sequence is 3. It is clear that a match involving the binding site V$BEL1_B is far more significant than a match with I$ED_06. To di fferentiate this, we introduced the p value of the M_score, which was estimated by calculating th e fraction of randomly generated sequences that have scores e qual to or higher than that M_score. We then calculated the P_score (see Methods) as the produ ct of -log (p value of M_score>cut_off) and the M_score (Figure 3-2, b). G_score. To account for the variation in the di stances between/among binding sites, we performed a Gaussian smoothing of the P_score (see Methods). Through empirical testing (data not shown), we found that a variance of 2=9 gave the best performance. We denote the Gaussian smoothed score profile as the G_score profile of the sequence (Figure 3-2 d). BLISS_score. G_score profiles were generated for both the short and the long sequences. To identify a maximum match at the bi nding site profile level, the short G_score profile was slid along the long G_score profile. At each position, the match between the short profile and its corresponding re gion of equal length (length of the window) in the long profile was evaluated usi ng an inner-product as the BLISS_score (see Methods). Note that since the length of the wi ndow appears in the denominator, the BLISS_score is independent of the length of the short profile (or the length of the

PAGE 43

31 window). Figure 3-3a shows the distribution of BLISS_scores as the shorter G_score profile was slid along the longer G_score profile. The peak of the BLISS_score indicates the maximum match. In this case, the abrupt surge in the BLISS_score is due to the match between the embedded hypothetical module in the short sequence and the conserved module in the long sequence. When this methodology was tested on all of the 10,000 simulated sequence pairs, about 80% of the highest peaks for each pair contained the correct match between the embedded hypot hetical module and the conserved module. Distribution and Statistical E valuation of the BLISS_score To be able to evaluate the match at the binding site profile level, we analyzed the distribution of BLISS_scores using the simulated sequence pairs. For each pair of sequences, BLISS_scores were calculated at each position as the short profile slid along the longer profile. The peak matches (corres ponding to the peaks in the score profile) between each pair of sequences were evalua ted to see whether it aligned the embedded modules. If the match did align the modules, it was designated a true match. All other BLISS_scores were considered as background. Figure 3-3b shows the distribution of the background and the true match BLISS_scores for the 10,000 simulated pairs of sequences. This distribution varies slightly depending upon the cut_off threshold set for M_score (Figure 3-3b & c). This is not surprising, since a lower cut_off thresh old will lead to more identified potential binding sites and thus a slight ly higher background score.

PAGE 44

32 Figure 3-3. Identifying a cons erved module at the binding site level. a.) The peak in the BLISS_Score profile represents the maximum match at the binding site level. b.) and c.) show th e distribution of BLISS_sc ore in true matches vs. background with a cut_off value of 0.75 and 0.80, respectively.

PAGE 45

33 The distributions allow us to evaluate any particular BLISS_score. It is informative in helping set a threshold for reporting signifi cant matches at the bindi ng site level. Given a BLISS_score x, the distributions allow us to decide whether that BLISS_score corresponds to a true alignm ent of modules or whether it corresponds to the module aligning with a random DNA segment. Let C1 denote the event where the modules embedded in the short and the l ong sequences are aligned, and C2 denote the event where either module is aligned with a random DN A segment. Based on Bayes formula, the posterior probabilities can be calculated as follows: ) 2 ( ) 2 | ( ) 1 ( ) 1 | ( ) 1 ( ) 1 | ( ) | 1 (C p C x p C p C x p C p C x p x C p ) 2 ( ) 2 | ( ) 1 ( ) 1 | ( ) 2 ( ) 2 | ( ) | 2 (C p C x p C p C x p C p C x p x C p Where p(C1|x) is the conditional probability of C1 given a BLISS_score x and p(C2|x) is the conditiona l probability of C2 given a BLISS_score x; p(C1) is the prior probability of C1 and p(C2) is the prior probability of C2; p(x|C1) is the conditional probability of observing x given C1 and p(x|C2) is the conditional probability of observing x given C2. p(x|C1) and p(x|C2) can be read off directly from the distributions generated. It is difficult to find the means to calculate the prior probabilities p(C1) and p(C2). In this study, we assumed p(C1) = p(C2), although we suspect that p(C1) might be smaller than p(C2). This assumption allowed us to calcu late the posterior probabilities to evaluate a BLISS_score x. In practice, we decided that x was a significant matching score if p(C2|x) was less than a threshold of, e.g. 0.01 or 0.001.

PAGE 46

34 Identifying a Conserved Regulatory Module in Distantly Related Species To test the efficacy of the BLISS met hodology in real sequences, we undertook the task of identifying the Even-skipped (eve) st ripe 2 enhancer (S2E) in distantly related Drosophila species. Even-skipped, an important development regulatory gene in Drosophila melanogaster (D.mela), is specifically expressed in seven transverse stripes in the embryo during the blastoderm stage. The stripe 2 enhancer is the best studied and includes TFBSs for five TFs, bicoid (Bcd), hunchback (Hb), giant (gt), Kruppel (Kr), and sloppy-paired (slp) (Small et al. 1992; Arnos ti et al. 1996; Andrio li et al. 2002). Unfortunately, TRANSFAC 9.1 has Position Wei ght Matrices for only three of the five TFs, i.e., Hb, Kr, and Bcd. Our search was therefore limited in the sense that some of the participating TFBSs could not be predic ted and used for the match comparison. Previous experimental investigations ha ve shown that S2E is evolutionarily conserved among D. yakuba (D.yaku), D. erecta, and D. pseudoobscura (D.pseu) (Ludwig et al. 1998; Ledwig et al. 2002; Ledwig et al. 2005). All of these species are in the same subgenus (Sophophora) as D.mela, with D.pseu having the most distant relationship with D.mela (separated at about 40 milli on years ago). BLISS did identify the eve S2E modules among these four species In particular, a significant peak was reported by BLISS when we searched the S2E module extracted from D. pseu against the 14kb D. mela genomic sequence flanking the eve coding region.

PAGE 47

35 Figure 3-4. Identifying the eve S2E module in distantly related species. a.) Using the D.pseu S2E module (1027 bp), a peak (red circle) in BLISS_score was identified in a 14 kb D.moja genomic sequence surrounding the eve coding region. b.) No significant match was id entified in the reve rse strand (bottom panel) or an unrelated sequence (data not shown).

PAGE 48

36 In contrast, no detailed information has b een published about pot ential S2E in more distantly related species, such as D. mojavensis (D.moja) or D. virilis (D.viri), both from a separate subgenus (Drosophila) separated from D.mela at about 60 million years ago (Powell 1997). To identify S2E in these two distantly related species, we extracted the 14kb genomic sequence flanking the eve coding region from D.moja and D.viri genomic sequences. Blast analysis using the D.mela or D.pseu sequence harboring the eve S2E module did not identify any significant matc h longer than 41 bp (using bl2seq with default gap penalty values). Using the BLISS methodology however a significant peak in the BLISS_score was observed (Figure 3-4a, p(C2|x)<0.05). Verification of this match indicated that it contained the TFBSs composing the S2E module. Similar results were obtained when corresponding sequence pairs in volving 1.) D.mela and D.moja, and 2.) D.mela and D.viri, were analy zed. In contrast, no significan t match was identified in the reverse-complemented sequence (Figure 3-4b) or in other 14 kb sequences unrelated to eve, indicating the specificity of the search. A detailed inspection of the make up of the S2E modules in distantly related species showed that S2E can be viewed as a complex module made of element modules. To make an analogy, S2E is a complex sentence that has several clauses (Figure 3-5). The evolution of the whole module indicates that the distances between some TFBSs have changed dramatically. However, the distances among the TFBSs within corresponding clauses have remained relativel y stable. For example, in Clause 1 the distances among participating TFBSs have rema ined constant over th e long evolutionary period. Specifically, the distance between the first bcd (overlapping w ith the first kr) and the second bcd is invariably 20 in all of th e four species. In a ddition, the distances

PAGE 49

37 among TFBSs in Clause 3 have also remained relatively stable, i.e., within the variation we have factored in to our simulation. Figure 3-5. Inter-TFBSs distances are very we ll conserved within each clause of the S2E module. Comparison of the evolutio n of S2E modules across distantly related species revealed that while the sequence length of the module has changed significantly, the distance among TFBSs in Clauses 1 and 3 have remained stable. The numbers near th e TFBS indicate the positions relative to the first Kr site in this module. Since our methodology is really based on the assumption of limited distance variations between TFBSs, it should be much more sensitive at id entifying individual Clauses or simple modules. When the corresponding TFBS prof ile covering Clause 1 or Clause 3 were used to search against the genome sequence from D.moja, very significant peaks in BLISS_score were observed (Figure 3-6a & b, p(C2|x)<0.001 for both). The peaks corresponded to the matc h of Clause 1 and Clause 3 on the D.moja

PAGE 50

38 sequence, respectively. BLAST analysis using the sequences c overing Clause 1 or Clause 3 searched against the D.moja genomic sequence failed to identify significant matches that spanned the whole module. rV ista 2.0 did predict Clause 1 because it succeeded in detecting the DNA similarity be tween the sequence covering Clause 1 and the D.moja sequence. However, rVista 2.0 failed to identify Clause 3 since no similarity was detected between the seque nce covering Clause 3 and the D.moja genomic sequence. Implementation of BLISS as a Web-based Service The BLISS methodology has been implemen ted as a web-based tool for the research community. The web application em bodies the Gaussian Smoothing Method for identifying cis-regulatory modules at th e binding site level, and outputs all potential TFBSs in the predicted module. The module fi nding process consists of several steps: To begin, the user inputs two DNA sequen ces. For example, a short sequence from a model organism that harbors a regulato ry module, and a longe r sequence surrounding the ortholog of a different species. An M_score threshold of 0.75 or 0.8 is then chosen by the user for the generation of the TFBS profiles for both sequences. Next, a plot of BLISS_scores comparing successive alignments of the short profile against the long profile is returned to the user. On the very same page, the distribut ions described earlier (Figure 3-3b & c) are displayed so that the user may choose a BLISS_score threshold. Once the BLISS_score threshold is chosen, BLISS out puts all of the matches with a BLISS_score higher than that threshold (limited up to 5). For each match, a table of contributing TFBSs are listed based on the pr oduct of the p-values of the matching TFBSs on both sequences (Figure 3-7). Altern atively, it can be listed based on their numeric contribution to the BLISS_score, or by the location of the TFBSs.

PAGE 51

39 Figure 3-6. Searching for c onserved individual clauses / element modules using BLISS. Profiles covering clauses 1 (a) or 3 (b) of S2E were used to search against a 14 kb D.moja genomic region. The BLISS_score peaks are significant (P(C2|x)=0.0 for a, = 0.0003 for b).

PAGE 52

40 Currently, the limits for the short and long input sequences are set at 1200 bps and 15k bps, respectively. Figure 3-7. BLISS output of the contributing TFBSs. Our S2E search was used as the example. The list of TFBSs can be output either based on sequence position, product of p value (Figure 3-7) or contribution to BLISS_score. The TFBSs that belongs to S2E were highlighted in green. Discussion In this study, we have presented a first step towards identify ing regulatory modules via comparisons at the binding site level. The advantage of such an approach is that it allows the detection of conserved regulatory modules in highly divergent sequences, as we have demonstrated both with simulate d sequences as well as with real world examples. This method is thus complement ary to many existing methods that are based on sequence similarity comparis ons (Altschul et al. 1990) or use sequence similarity for pre-analysis selection (Sa ndelin et al. 2004; Loots and Ovcharenko 2004; Aerts et al.

PAGE 53

41 2005; Sharan et al. 2004). It should also be complementary to applications such as MEME and CompareProspector, which are wi dely used for the identification of conserved sequence motifs (binding sites) in the regulatory region of co-expressed genes (Liu et al. 2004; Liu et al. 2004). There are limitations to our approach. Some of the major limitations, such as the coverage and quality of the TFBS matrices, ar e expected to improve rapidly in the near future as new high throughput techniques are applied to identify binding sites in genome scale. Our current algorithm is develope d based on the assumption that the inter-TFBS distance variation is within a +/-4 base pair range. This allows the identification of modules/clauses with relatively small inte r-TFBS distance variation, such as the individual clauses in the S2E m odule. It will likely miss mo dules/clauses that have much larger distance variations between TFBSs. In the case of S2E, the identification of the module was based on the fact that the th ird clause had low inter-TFBS distance variations, which was sufficien t to generate a significant BLISS_score (Figure 3-4a). As indicated by Ludwig et al, if S2E as a w hole were to be considered, many inter-TFBS distances have changed drama tically during evolution (Ludwig et al. 2005). However, a closer look at the distribution of TFBSs in S2E in the two di stantly related species also indicated that the S2E module may be sub-divi ded into clauses (Figure 3-5). While the inter-clause distances have varied dramatically, the inter-TFBS distances within each clause have remained largely stable (Figure 35). This is very possi bly a reflection of the spacing restriction on important tran scription factor interactions. In addition to the S2E module, we also tested our methodology on other regulatory modules such as the DME (Distal Muscle E nhancer) module in front of the paramyosin

PAGE 54

42 gene (Marco-Ferreres et al. 2005). Usi ng a 200 bp sequence harboring the DME in D.mela, we were able to detect the corresponding module in D.viri (data not shown). Currently, the number of well ch aracterized, evolutionarily co nserved modules is limited. The goal of BLISS is to facilitate th e discovery of multiple TFBSs modules by identifying the conserved pattern of the TFBS s. We also applied BLISS to a regulatory region that is responsible for me diating UV induced expression of hac-1 (Zhou and Steller 2003). There is no existing information on the composition of the UV-responsive module in this region, which has very lo w sequence level conservation between the corresponding segments in D.mela and D.pseu. Yet genetic experiments have indicated that the responsiveness is highly conserved. The potential module identified by BLISS is currently being test ed experimentally. BLISS, with some adaptation, can potent ially be used to id entify the conserved regulatory modules in co-expressed genes. Another advantage of BLISS is that the methodology can also be applied to identify pa tterns that involve not only TFBSs, but also other sequence features such as comple x response elements (Ringrose et al. 2003), insulator sequences, CpG islands, etc. Functiona lly, these sequence feat ures (their related modifications and binding partne rs) interact with transcription factors. However, these features, such as CpG islands, cannot be de tected by simple sequence similarity based searches. Conclusions In this study, we addressed the feasibili ty of identifying conserved regulatory modules at the binding site level. Our results indicate that it is feasible to identify conserved regulatory modules in simulate d random sequences ha rboring a regulatory module made of 4-8 distinct binding sites. Using real sequences, we demonstrated that

PAGE 55

43 our approach outperforms regular sequen ce level comparisons when the orthologous DNA sequences are highly diverged. In additi on, the BLISS program outputs directly the candidate binding sites that are shared betw een the two regulatory sequences, which can greatly facilitate the evaluation of the candi date module as well as the design of the experimental verification strategy by biomedi cal scientists. Future development of the project will include identifying better algor ithms for complex modules and modules with higher inter-TFBS di stance variations. Methods Generating Simulated Sequences 10000 simulated sequence pairs were gene rated for developing the methodology. Each set included a short DNA sequence (100500 bps) harboring a hypothetical TFBS module and a long DNA sequence (5-6 kb) harb oring a conserved TFBS module. First, the hypothetical TFBS module was generated in the following manner: 4-8 binding sites were randomly chosen from TRANSFAC 9.1 (Matys et al. 2003) database and then random DNA subsequences were inserted between them. Qiu et al. (Qiu et al. 2002)analyzed all the entries of composite elements in the TransCompel database (version 3.0) and they found that about 87% of the composite elements are within 50 bp distance and about 65% are within 20 bp dist ance of one another. We therefore chose lengths of DNA subsequences inserted between binding sites based on this result. Next, we created the conserved TFBS module, which included binding sites for the same sets of transcription factors in the same order as in the shorter sequence. However, the binding site sequences had to be different and they were randomly chosen from TRANSFAC 9.1. Furthermore, compared to the hypothetical m odule, distances between binding sites were set to vary slightly and we allowed each binding s ite to shift up to 4 bps either to the right

PAGE 56

44 or to the left. Finally, we inserted the conserved module into a 5000 bp randomly generated DNA sequence to generate the longer sequence. Binding Site Identification Potential TFBSs both in the short DNA sequence (including the hypothetical module) and the long DNA sequence (includi ng the conserved module) were searched based on frequency matrices collected by TRANSFAC 9.1. Because TFBSs may be detectable on either the forward or the backward strand, we searched both strands of sequences. The M_score profile for each sequence is a M*L matrix, where M is twice of the number of matrices applied and L is the le ngth of the sequence. The top half of the M_score matrix is the score profile for the forw ard strand and the bottom half is that for the complementary strand. The M_score of the ith TFBS at position j of the sequence was calculated by first aligning the frequency matrix for the ith TFBS with the sequence at position j and then computing: f.1. M_score: ] [ ] [ ] [ ] [ ] [ j i Score j i Score j i Score j i Score j i score MMin Max Min 1 0 ,) ( ] [K k n kk jf k I j i Score 1 0 min) ( ] [K k k Minf k I i Score 1 0 max) ( ] [K k k Maxf k I i Score } , { ,) 4 ln( ) (G C T A N N k N kf f k I

PAGE 57

45 K is the length of the TFBS. k jn {A,T,C,G} is the nucleotide occurring in the sequence at position j+k k jn kf, is the frequency of nucleotide k jn at position k in the frequency matrix (of the i th TF). min kf is the lowest frequency and max kf is highest frequency across all nucleotides at position k in the frequency matrix (of the i th TF). I(k) is the information vector for the frequency matrix, which reflects the degree of conservation at position k of the matrix. Finally, M_score[i,j] is the normalized Score[i,j] Stormo (Stormo 1998; Stormo 2002; St ormo and Fields 2002) observed that logarithms of the base frequencies ought to be proportional to the binding energy and the information vector reflects this average bi nding energy between the transcription factor and the binding site. A score cutoff at 0.75 was applied to the M_score profiles of both the short and the long sequence as follows: ] [ ] [ j i score M j i score M if 75 0 ] [ _j i score M 0 ] [ _j i score M if75 0 ] [ j i score M P Value for M_score. To calculate the p value, a background model is required. Here, we chose the background model to be a random DNA sequence where each position is drawn independently. The ratio among A, T, C, and G is 30% : 30% : 20 % : 20%. For each frequency matrix, 300 million subsequences were sampled from the background model, and the M_score of each subsequence was calculated to build the M_score distribution. The p value of a M_score for each TFBS was estimated by calculating the fraction of samples that had scores equal to or higher than that M_score Then, the P_score profiles were calculated as follows:

PAGE 58

46 f.2 P_score Ci is the negative natural logarithm of the p value of M_score >= 0.75 for the i th TFBS. Gaussian Smoothing To account for the change in the distances between/among binding sites, a Gaussian smoothing was applied to the P_score profiles with a variance of 9. Formally, the G_score profiles were calculated as follows: f.3 G_score: where = 3 and k ranges from -7 to 7. In effect, a P_score can spread 7 positions to both the right and the left due to the Gaussian smoothing. Smoothed P_scores beyond 7 positions were ignored due to their small values. Searching the Conserved Module in the Long Sequence To identify a maximum match at the bi nding site profile level, the short G_score profile was slid along the long G_score profile. BLISS_score at position n is the matching score between the short profile and its corresponding region of equal length (length of the short sequence) in the long profile at position n : 7 7 7 7 2 / 2 /2 2 2 22 1 2 1 ] [ ] [ _k i i ke e k j i score P j i score G ] [ ] [ j i score M C j i score Pi

PAGE 59

47 f.4 BLISS_score: where G_score1 is the G_score profile for the short sequence and G_score2 is the G_score profile for the long sequence; L is the length of the short sequence; n is the current location where the short sequence is aligned to the long sequence. Large-scale Search of the Simulate d Sequences, Statistical Analysis We used 10000 simulated sequence pairs generated by the above method to calculate two BLISS_score distributions. The first is the BLISS_score distribution when the hypothetical module in the short sequence is aligned with the conserved module in the long sequence. The second is the BLISS_score distribution when the module is aligned with a non-module segment of the longer sequence. For each pair of sequences, BLISS_scores were calculated at each position as the short profile slid along the longer profile. The peak matches (corresponding to the peaks in the score profile) between each pair of sequences were evaluated to see whet her it aligned the embedded modules. If the match did include the alignment of the modul es, it was designated a true match, and this BLISS_score was used to calculate the distributi on for the modules matching. All of the other BLISS_scores were used to calculate the di stribution for the module matching with the background sequence. Searching for the eve2 Module in D. virilis and D. mojavanis Sequences The GenBank (http://www.ncbi.nlm.nih.gov/ ) accession numbers for the S2E sequences are AF042712 (D. pseudoobscura) and AF042709 (D. melanogaster). We e ortSequenc LengthOfSh n j i score G j i score G n score B LIS S M i L j/ ) ] [ 2 ] [ 1 ( ] [ _1 0 1 0

PAGE 60

48 used BLISS to search these two enhancers in 13kb D. virilis and 14kb D. mojavanis sequences, in which S2E is hypothesized to be located, but the specific location is unknown. Ludwig et al. indicated that distances be tween TFBSs in two clauses in S2E (region 134-275 and region 484-684 for D. melanogaster region 196-376 and region 692-866 for D. pseudoobscura ) were substantially conserved. We removed those regions and searched for the modules in 13kb D. virilis and 14kb D. mojavanis sequences using BLISS. Website Construction BLISS was implemented using HTML/JSP/JavaBean and is supported by an Apache Tomcat 5.5 server. It is publicly available at: http://gene1.ufscc.ufl.e du:8080/blissWeb/index.html The M_score profiles of TFBSs were calculated based on the frequency ma trix library collected by TRANSFAC 9.1. BLISS used DISLIN (http://www.mps.mpg.de/d islin/), a plotting library for displaying data, to draw the match score plot in run time.

PAGE 61

49 CHAPTER 4 BINDING-SITE LEVEL IDENTIFICATI ON OF SHARED REGULATORY MODULES IN TWO ORTHOLOGOUS SEQUENCES Background Determination of the bindings betw een Transcription Factors (TFs) and Transcription Factor Binding Sites (TFBSs) is critical to understanding the mystery of gene regulation. In higher organisms, a partic ular aspect of gene expression is rarely controlled by one single TF, rather by a comb ination of TFBSs called regulatory module (Maniatis et al. 1987; Johnson and McKnight 1989; Shaw 1992; Wang et al. 1993; Tjian and Maniatis 1994). Genes are typically regulated by vari ous regulatory modules in response to different conditions (Yuh et al 1998). Identification of such regulatory modules is notoriously difficult due to the i nherent biological complexity. The traditional way is through biological experimentation (G alas and Schmitz 1978; Fried and Crothers 1981; Garner and Revzin 1981), which is in general costly and time-consuming. With fully sequenced genomes, computational approa ches have emerged as powerful tools to supplement full-scale laboratory experiments in narrowing down ca ndidate regulatory modules and therefore dramatically reduced the effort in determining the regulatory elements (Qiu 2003). A number of software tools have been de veloped to search fo r regulatory modules by cross-genome comparison (Sha ran et al. 2003; Sandelin et al 2004; Loots, G.G. and I. Ovcharenko 2004; Sinha et al. 2004 ; Aerts et al 2005; Schwartz et al. 2003; Cora et al. 2004; Venkatesh and Yap 2005). The theoretica l assumption underlying these tools is

PAGE 62

50 that DNA sequences harboring TFBSs are cons erved during evoluti on due to selective pressure. Thus conserved regul atory modules should be found in the regions with high DNA similarity. However, most of these tool s are based directly on the measurement of similarity at the DNA sequence level. It is known that TFBS sequences are degenerate in nature and the DNA sequences between TFBSs are not under the pres sure of selection and therefore mutate rapidly. Therefore, it is highly probable that the DNA similarity is not detectable even though th e actual regulatory modules ar e conserved (Ludwig et al. 2005). Since the conservation pressure is at the binding site level, in our previous research, we developed a novel methodology na med BLISS to perfor m the cross-genome comparison on binding site profiles (Meng et al. 2006). This is first time that the feasibility of comparison at the binding site le vel has been validated. As a complementary tool in the field, BLISS demonstrates the ab ility to detect conserved regulatory modules from diverged orthologous sequences where DNA similarity is undetectable. It has been observed that the distance s among a certain cluster of TFBSs in the regulatory module are highly c onserved during evolution, wh ich may reflect the space restriction essential for the protein-prot ein interactions among TFs required for combinatorial transcription regulation (L udwig et al. 2005). Due to biological complexity, previous work in this direction has been limited to identifying composite elements, pairs of transcription factors, whose binding sites tend to co-exist. TransCompel (Kel-Margoulis et al. 2002) is such a database with collections of composite elements. In this study, we exte nd the concept of composite element to regulatory complex. Regulatory complex is a cluster of TFBSs, where the type (identity), number, and order of TFBSs are highly conserved during evolution and the

PAGE 63

51 variation of inter-TFBS dist ances is within a certain range (less than 10 bps). A regulatory complex is not necessarily c onserved at the sequence level during the evolution. It could, by itself, be a simple regulatory module or be a part of a complex regulatory module. In previous research (Meng et al. 2006), BLISS showed that it was more advantageous to find regulatory comp lexes rather than the regulatory module, especially when the regulatory complex is not the only regulatory elements in the module. We therefore focuses on the detection of conserved regulatory complexes in this study and the regulatory modules could be predicted based on th e identification of conserved regulatory complexes within them. The first release of BLISS (BLISS1) was limited in application to a short sequence and a long sequence and it made the assumpti on that the short sequence harbored only one regulatory module/complex (Meng et al. 200 6). However, in most cases, biologists have no prior knowledge about th e locations of the conserve d regulatory modules in the sequences to be analyzed. In the second release of BLISS (BLISS2), we extended BLISS1 to a general tool w ithout the limitation of the assumption made in BLISS1 about the analyzed sequences. In this study, we show that BLISS2 could predict conserved regulatory modules by detec ting all potential shared regulatory complexes. Results Simulating Sequence Pairs with Varying Complexity of Conserved Binding Site Patterns Simulated sequences were generated to te st the performance of our algorithm since very few well-investigated regulatory modules are available currently. Two testing data sets were generated. The backbones of the sequences in the first data set were 1000 bp

PAGE 64

52 (base pair) random DNA sequences and the ba ckbones of the sequences in the second data set were 5000 bps. Table 4-1. Simulated data sets Inserted complexes in each sequence pair Sequence pairs in each group Length of Backbone Group1 Group2 Group3 Group4 Data set 1 500 1000 0 1 2 3 Data set 2 500 5000 0 1 2 3 As shown in Table 4-1, each testing data set consisted of four groups of sequence pairs and each group had 500 sequence pairs. No regulatory complex was inserted into the first group of sequence pairs. In th e second group, each sequence pair had one conserved regulatory complex. Two and thre e conserved regulatory complexes were inserted into each sequence pair respectively in the third and fourth group. Each regulatory complex was composed of 4-8 binding sites, wh ich were extracted from TRANSFAC 9.1 database (Mat ys et al. 2003). Qi u et al. (Qiu et al. 2002) performed a statistical analysis for the distances betw een TFBSs in the composite elements, which had been experimentally confirmed and collect ed in the TransCompel database (version 3.0). Their analysis demonstrated that 87% of the distances between TFBSs in composite elements are within 50 bps, and 65% are within 20 bps. The length of the inserted random DNA segments between binding sites was based on this statistical analysis when we formulated the regulatory complexes (see Methods for details). Identifying Conserved Regulatory Comple xes by Comparison at the Binding-site Level In our previous research (Meng et al. 2006), BLISS1 has demonstrated the feasibility of identifying conserved regulat ory complexes in diverged DNA sequences at the binding-site level. However, it was limited to predicting conserved regulatory

PAGE 65

53 complexes from a short sequence and a long sequ ence and it required that the regulatory complex existed in the short sequence. Du ring the comparison, the short sequence was slid along the long sequence to find the se gment in the long sequence that had a significant matching score with the short sequence at binding site level. In this study, however, our focus is to extend BLISS1 to identify potentially conserved regulatory segments from two orthologous sequences w ithout prior knowledge about the locations of regulatory complexes in the sequences. Like BLISS1, BLISS2 calculated BLISS_scor e to represent the de gree of similarity at the binding site level for two sequence segm ents. The first three steps of BLISS2 were to calculate M_scores, P_scores and G_scor es respectively for two genomic sequences. M_score is the binding site profile calculated using a matrix scoring method based on the binding site frequency matrices collected in TRANSFAC9.1 database. To differentiate simple matrices that match DNA sequences w ith a high frequency from those matrices that match DNA sequences rarely, P_score wa s generated as the product of log(p value of M_score > cutoff) and the M_score. To account for the variation of distances between/among binding sites in the conserved regulatory complexes, Gaussian smoothing was applied to P_score to get G_score. The fi rst three steps in BLISS2 were in essence exactly identical to those in BLISS1. The diffe rence is that BLISS2 deals with two long sequences and the locations a nd lengths of the regulatory complexes in the sequences are unknown. The task of BLISS2 is to seek the matched segments from two sequences whose binding site profiles are similar (not necessarily id entical) and whose BLISS_scores, the matching score at the bindi ng site level, are statistically significant. The statistical evaluation of BLISS_score in our previous study allo wed us to evaluate a

PAGE 66

54 particular BLISS_score and set a threshold for reporting significant matches. In this study, BLISS2 took a heuristic approach. The general strategy of BLISS2 was to seek seeds with a certain window size from the G_ score profiles where their BLISS_scores are greater than a certain BLISS_score cutoff. Then all the seeds were extended and the shared regulatory regions were decided when the BLISS_score dropped below the BLISS_score cutoff. (See Methods) At the beginning, BLAST (E=0.01, Blast2se q) was applied to the sequence pairs in the first data set in which the backbon e of each sequence was 1000 bp long and there were a total of 3000 conserved regulatory complexes. Of the 2000 simulated sequence pairs, only 32.4% of them we re reported to have detectable similarity. Even in those positive cases, the average length of output alignments was 16 bps and the maximum length of output alignments was 32 bps. Clearly these output alignments were far shorter than the length of the inserted regulatory complexes, which was not sufficient to allow for the detection of the regulatory complexes. Th is analysis indicated that traditional tools based on sequence level comparison are not enough to predict the conservation of regulatory complex in diverged DNA sequences. Table 4-2. Performance test of BLISS2 for M_score cutoff=0.75. False Positives (FP) and True Positives (TP) are listed under different window size and BLISS score cutoff. BLISS_score cutoff Window size Type 2.9 2.8 2.7 2.6 2.5 2.4 2.3 2.2 2.1 FP 1795 2565 3748 5272 7180 9262 100 TP 904 1045 1174 1307 1434 1537 FP 143 221 401 741 1362 2402 3940 5781 150 TP 732 897 1079 1269 1445 1616 1775 1900 FP 14 18 44 80 205 473 1106 2195 3836 200 TP 382 526 682 879 1083 1345 1558 1764 1899 FP 1 1 6 14 34 88 261 727 1707 250 TP 130 213 343 504 695 918 1189 1468 1663

PAGE 67

55 Table 4-3. Performance test of BLISS2 for M_score cutoff=0.8. False Positives (FP) and True Positives (TP) are listed under different window size and BLISS score cutoff. BLISS_score cutoff Window size Type 1.9 1.8 1.7 1.6 1.5 1.4 1.3 1.2 FP 1907 2905 4435 6480 8991 100 TP 1168 1328 1488 1633 1759 FP 193 330 629 1193 2366 4306 6625 150 TP 1021 1229 1432 1664 1892 2049 2154 FP 26 45 81 195 523 1351 2975 5195 200 TP 617 818 1069 1346 1631 1880 2054 2137 FP 2 4 20 42 130 373 1213 2982 250 TP 298 464 662 932 1227 1539 1822 2006 In comparison, the testing results of BL ISS2 under different 1)window size and 2) BLISS_score cutoff are listed in table 4-2 and 4-3. True positive refers to the regulatory complexes that were inserted in the simulated sequence pair and identified by BLISS2, while false positive refers to the regulatory complexes that were reported by BLISS2 but di d not really exist. Both table 4-2 and 4-3 indicate a balance between true positive and false positive under different BLISS_score cutoffs. While the number of true positives improved when we chose a lower BLISS_score cutoff, the false positives incr eased as well; on the other hand, when a higher BLISS_score cutoff was applied, false positives decreased while true positives were compromised. Based on the testing results listed in ta ble 4-2 and 4-3, a window size 200 was selected in the implementation of BLISS2 sin ce in relation to other window sizes, it had a better balance between false positives and true positives both for M_score cutoff 0.75 and 0.8. In the first testing data set, the length of each sequence was 1000 bps. The performance test of BLISS2 proceeded under window size 200 using the second testing data set, where the length of backbone of each sequence was 5000 bps.

PAGE 68

56 0 200 400 600 800 1000 1200 2.52.62.72.82.933.1 BLISS_score cutoffPredicted complexes FP-2 FP-1 TP-2 TP-1 Figure 4-1. The effects of length of analy zed sequences on the performance of BLISS2. Window size 200 and M_score cutoff 0.75 we re applied in this comparison test. FP-1 represents the false positive for the first data set; FP-2 represents the false positive for the second data se t; TP-1 represents the true positive for the first data set; TP-2 represents the true positiv e for the second data set. The results displayed in Figure 4-1 shows th at the true positives were almost the same for both the testing data sets, although th e true positives for th e second data set is slightly less than the true positives for the firs t data set. The major difference lies in the false positive. When a higher BLISS_score cu toff was applied, false positives were not significantly affected by the lengths of the sequences. However, the false positives increased dramatically for the second testi ng data set when the BLISS_score cutoff was decreased. Thus, for practical purpose, a hi gh BLISS_score cutoff is suggested to avoid large amount of false positive for the longe r sequences, although the sensitivity of BLISS2 will be sacrificed at the same time. Testing BLISS2 in Real-world Examples Even-skipped gene (eve) is expressed in seven stripes in the embryo of Drosophila melanogaster (D.mela) during the blastoderm stage. The enhancer for the second stripe

PAGE 69

57 (S2E) is the best studied regulatory module and include s binding sites for five transcription factors, biocoid (Bcd), gian t (gt), hunchback (Hb) Kruppel (Kr) and sloppypaired (slp) (Small et al. 1992; Arnosti et al. 1996; Andrioli et al. 2002). D.mela, D. yakuba (D.yaku), D. erecta, and D. pseudoobs cura (D.pseu) belong to same subgenus (Sophophora) and experimental investigations on S2E suggests that S2E is conserved during evolution among these four closely rela ted species. A further detailed inspection has revealed that S2E include s two regulatory complexes, wh ere the binding sites and the distances among binding sites are highly conser ved (Ludwig et al. 2005). The make-ups of S2Es in D.mela and D.pseu have been di splayed in Figure 4-2. However, no detailed information has been published about S2E in two species: D. mojavensis (D.moja) and D. virilis (D.viri), which are di stantly related to the above four species and belong to a separate subgenus (Drosophila). We extracte d 6k bp sequences befo re the transcription start region both from the D.moja and D.viri genomic sequences and applied BLISS2 to identify S2E in them. Figure 4-2. Regulatory comple xes in S2E of D.mela and D.pseu. The two complexes are highly conserved during the evolution.

PAGE 70

58 In the first experiment, BLISS demonstrated its ability to predict the locations of S2E both in D.viri and D.moja using S2E seque nces of D.mela and D.pseu. For instance, three regulatory complexes were reported (F igure 4-3) when BLISS was applied to the S2E sequence of D.pseu and the 6k bp sequenc e from D.viri (M_score cutoff=0.75 and BLISS_score cutoff=2.9). By investigati ng the matched TFBSs listed by BLISS2, the second and third complexes were those two S2E regulatory complexes in D.mela and D.pseu (Figure 4-4). Thus the position of S2 E in D.viri could be located based on the positions of those two complexes. Identifying S2E in D.moja was carried out in the similar way. Figure 4-3. Identification S2E in 6 kb D.vi ri sequence using the S2E of D.pseu. Three complexes are predicted and the second and third one are those two in S2E.

PAGE 71

59 Figure 4-4. Regulatory comple xes of S2E in D.viri and D.moja. a) TFBSs in complex 1 b) TFBSs in complex 2. In previous research (Meng et al. 2006), BLISS1 was applied to the same sequence pair and the S2E in D.viri c ould be detected due to the match of one complex in S2E. Another complex could not be matched at the same time because of the distance difference between the two complexes in those two sequences. The BLISS_score therefore was inevitably lower because the BLISS_score in BLISS was the average over

PAGE 72

60 the whole range of the S2E sequence. In th at investigation, it indicated that the BLISS_score would be more sensitive and si gnificant when either complex sequence was used as the short sequence. In fact, we don t have this kind of prior knowledge in most cases. BLISS2 displayed its advantage over BLISS1 in that it was able to predict regulatory complexes by detecting local ma ximum regions with similar binding site profiles without requ iring prior knowledge about th e complexes in the orthologous sequences. BLISS2 further demonstrated its advantag e over BLISS1 for identifying conserved regulatory complexes in two long genomic sequences, which could not be carried out by BLISS1 at all. For instance, when BLISS2 was applied to two 6k bp sequences before the transcription start region of eve gene from D.pseu and D.viri (M_score cutoff=0.75 and BLISS_score cutoff=2.9), 15 potential regula tory complexes were reported by BLISS2. By inspection of the make up of these complexes, it turned out that the 9th and 14th complexes were those two regulatory complexes in S2E. In contrast, rVista2.0 could only identify the first two binding sites in comple x1 of S2E in the region with detectable DNA similarity when the same binding site score cuto ff (0.75) was applied. It is difficult for us to verify the other 13 complexes identified by BLISS2 except for the two known S2E complexes due to the lack of detailed regulat ory information for those regions. However, it is well known that those two 6k bp sequences are rich in regulatory information. At the very least, it should harbor enhancers for the expression of the eve gene in some of the 7 stripes in the embryo. Website Implementation of BLISS2 BLISS2 was implemented as a web-based software that can be freely accessed by the public. To start, BLISS2 requires th e users to input two DNA sequences that

PAGE 73

61 potentially harbor shared re gulatory information. An M_ score (binding site score) threshold could be chosen by the user on the ve ry same page. Then a color plot indicating conserved regulatory regions is returned to the user with the statistical analysis of BLISS_score under that M_score threshold. Th e user then chooses a BLISS_score cutoff to output all conserved comp lexes with maximum BLISS_score above the cutoff. For each complex, a table of contributing binding s ites could be listed ba sed on the product of the p-value of the matching binding sites, or the location of the binding sites, or their contributions to the BLISS_score. Currently, the length of each input sequence for BLISS2 is limited to 8000 bps. Discussion In our previous investigati on (Meng et al. 2006), we prove d that it is feasible to identify conserved regulatory modules fr om highly diverged orthologous genomic sequences at the binding site level. In this study, we extended our previous method to the two long sequence case and the regulatory modules could be detected by identifying conserved regulatory complexes. The perfor mance of BLISS2 was tested using both simulated sequences and real world examples. There are several limitations to BLISS2. Fi rst, BLISS2 is restricted to detect conserved binding sites groups (regulatory co mplexes) with small inter-TFBS distance variations. If the spaces between/among bindi ng sites are not stric tly required for the interactions between/among tran scription factors and the vari ation of inter-TFBS distance is high, BLISS2 may miss those regulatory modules. The coverage and quality of the frequency matrices for TFBSs is another limitation since the analysis of BLISS2 is totally dependent on the accuracy of the binding site profiles. However, there is no doubt that

PAGE 74

62 the coverage and quality of frequency matric es are being/will be improved dramatically and the performance of BLISS2 is exp ected to be enhanced along with it. Although further improvement is required, BLISS2 demonstrates the ability to detect biologically significant similari ties in those sequenc es where DNA sequence similarity is undetectable. The basi c idea underlying BLISS2 is novel, but straightforward. And therefore it potentially could be implemented in a number of ways to improve performance for the different applications. Conclusions In summary, using simulated sequences and real world examples, BLISS2 demonstrated that regulatory complexes, the cl usters of binding site s with small variation of inter-TFBS distance during evolution, could be identified in highly diverged orthologous sequences by comparison at the binding site level. By detecting those conserved regulatory complexes and expl oring matched TFBSs in them, BLISS2 facilitates the localization of regulatory re gions and therefore can assist biomedical scientists in deciphering the my stery of gene regulation. Futu re direction of the project includes developing new algorithm to improve the selectivity and sensitivity of the current algorithm, detecting regulatory complexes from multiple orthologous sequences, and identifying regulatory modules with higher inter-TFBS di stance variations. Methods Generating Simulated Sequences Two testing data sets were generated for the development of the methodology. The first data set had 2000 sequence pairs, which were divided into four groups with each group having 500 sequence pairs. The backbone for each sequence was

PAGE 75

63 constructed as 1000 bp random DNA sequence w ith the ratio among A, T, C and G being 30% : 30% : 20% : 20% since the non-transcribed genomic regions are generally AT rich. Each regulatory complex contained binding sites for 4-8 transcription factors, which are randomly extracted fr om TRANSFAC 9.1 database. Th e range of distance (di) between consecutive TFBSs was based on the sta tistical analysis by Qiu et al. (Qiu et al. 2002) on the TransCompel database: 65% of the distance between consecutive binding sites was within 5-20 bps, 22% was within 21 -50 bps and the rest 13% was within 51-60 bps. The conserved regulatory complex for the second sequence was formulated based on the following rule: First, it consisted of binding sites for the same transcription factors in the corresponding regulatory complex in th e first sequence. The binding sites of two corresponding transcription f actors were randomly extracted from the binding site instances collected in TRANSFAC 9.1 database and therefore they could have the same or different sequences; second, the respect ive order of binding sites in those two corresponding regulatory complex was kept th e same; finally, dj, the distance between consecutive binding sites in the conserved re gulatory complex in the second sequence, was randomly chosen to have a value from range (did,di+ d). d is the variation of distance between corresponding binding sites in those two regulatory complexes and we chose d = 4 in this study. Each sequence in the sequence pairs in the first group was a 1000 bp random DNA sequence and had no introduced complex. For the second group, one regulatory complex was inserted at a random location into the fi rst sequence of each se quence pair; and then the corresponding conserved regulatory comple x was inserted at a random location into the second sequence. In the similar manner, two and three regulatory complexes were

PAGE 76

64 inserted into each sequence pairs, respectivel y, for the third and the fourth group. This composition of simulated sequences reflected the fact that the analyzed orthologous sequences may share 1) no of regulatory complex, 2) one regulatory complex, or 3) more than one regulatory complexes. All the above held true for the second test ing data set except that the length of the backbone of each sequence was 5000 bps. Identifying Conserved Regulatory Comple xes by Comparison at the Binding-site Level M_scores, P_scores and G_scores. M_scores, P_scores and G_scores for both DNA sequences were calculated in the exact same manner as was described in chapter 3 (see Methods part) based on frequency matrices collected by TRANSFAC 9.1. The G_score profile for each sequence is a M*L matrix, where M is twice of the number of TFBSs in the frequency matrix library and L is the leng th of the sequence. The top half of G_score matrix is the score profile for the forwar d strand and the bottom half of the G_score matrix is for the complementary strand. Pre_BLISS_score. Calculation of Pre_BLISS_score is an intermediate step for computing the BLISS_score from the G_score profiles. Pre_BLISS_score was calculated as follows: 1 0] [ 2 ] [ 1 ] [ _ PrM mj m score G i m score G j i score BLISS e where G_score1 is the G_score profile for the first sequence and G_score2 is that for the second sequence. M is twice of the number of TFBSs in the frequency matrix library and m is the index of the TFBS. Pr e_BLISS_score is a two-dimensional L1*L2

PAGE 77

65 matrix where L1 and L2 correspond to the lengths of the two DNA sequences respectively. BLISS_score. BLISS_score is the average of Pr e_BLISS_scores over a window size w: w k j k i score BLISS e j i score BLISSw w k/ ] [ _ Pr ] [ _2 / 2 / Same as Pre_BLISS_score, BLISS_score is also a two-dimensional L1*L2 matrix where L1 and L2 are the lengths of the two DNA sequences. Reporting Conserved Regulatory Complexes. Conserved regulatory complexes would be reported according to the BLISS_score matrix. To start, the maximum point BLISS_score[x,y] was found in the matrix. If BLISS_score[x,y] was less than the BLISS_score cutoff, we would report that there is no conserved regulatory region for these two sequences; otherwise, we extende d this maximum point along the diagonal in both directions until the BLISS_scores droppe d below the BLISS_score cutoff. Suppose these two end points were BLISS_score[x1,y1] and BLISS_score[x2,y2], then we would report that we detected one shared regulat ory complex, which started from x1-w/2 and ended at x2+w/2 in the first and covered from y1-w/2 to at y2+w/2 in the second sequence. The rest of the BLISS_score matrix was searched recursively in the same way to find other shared regulatory complexes. Performance Testing Using Simulated Sequences Using our simulated data sets, the perf ormance of BLISS2 was tested under different BLISS_score cutoff and window sizes The true positive and false positive for each test under different condition were recorded. Suppose a shared regulatory complex identifi ed by BLISS starts from x1 and ends at x2 in the first sequence and covers range y1 to y2 in the second sequence, while the

PAGE 78

66 truly inserted regulatory complex is from x1 to x2 in the first sequence and from y1 to y2 in the second sequence. We decide d that it was a corr ect prediction if: 1. Identified complexes in both sequences c overed more than 60% of the region of the truly inserted complexes. 2. Two lines were created in two-dimensional space: one starting from point (x1, y1) and ending at point (x2, y2) ; the other starting from poi nt (x1, y1) and ending at point (x2, y2), and the distance between these two lines was found to be less than 10. The regulatory complex, which was pred icted by BLISS2 and truly existed in simulated sequences, would be counted as a tr ue positive; while the regulatory complex, which was predicted by BLISS2 but did not ex ist in the simulated sequences, would be counted as false positive. Searching for the eve2 Regulatory Comp lex in D. virilis and D. mojavanis The GenBank (http://www.ncbi.nlm.nih.gov/ ) accession numbers for S2E sequences are AF042709 (D. melanogaster) and AF042712 (D. pseudoobscura). The two 6k bp seqeunces of D.viri and Dmoja are the re gions right in front of the transcription initiation site of the eve gene in the genomic sequences of D.viri and D.moja. M_score cutoff 0.75 and BLISS_score cutoff 2.9 (def ault M_score and BLISS_score cutoff of BLISS2) were used in both experiments. Website Construction BLISS2 is hosted by a Apache Tomcat 5.5 server and available at: http://gene1.ufscc.ufl.edu:8080/bliss2/index.html The website was implemented by JavaBean/JSP/HTML and the color display of BLISS_score was plotted by tools provided by DISLIN (http://www.mps.mpg.de/dislin/ ).

PAGE 79

67 CHAPTER 5 BLISS 2.0: THE WEB-BASED TO OL FOR PREDICTING CONSERVED REGULATORY MODULES IN DISTANTLY-RELATED ORTHOLOGOUS SEQUENCES Identifying functional Transc ription Factor Binding Site s (TFBSs) in the regulatory region of DNA sequence is essential to understa nd gene regulation at transcription level. In eukaryotes, distinct TFBSs are often gr ouped together into regulatory modules to control a particular aspect of gene expression. The composition of a particular regulatory module can be identified by e xperimental approaches; for example, through enhancer region dissection, DNase hypersensitivity assay, DNA foot printing, et c. However, most of these approaches are time-consuming, la borious and costly. More importantly, many of these approaches, such as the DNase hypersensitivity assay, can only be applied to a short DNA sequence (a few hundred base pairs). It is almost impossible if the relative location of the module cannot be narrowed to an experimentally testable region. With the emergence and application of bi oinformatics, computational approaches have been developed to predict regulatory module and help the desi gn and verification of laboratory experiments. A number of such co mputational approaches (Aerts et al 2005; Sandelin et al. 2004; Loots, G.G. and I. Ovch arenko 2004; Sharan et al. 2003; Sinha et al. 2004) are through cross-genome comparis on. The assumption underlying these approaches is that DNA sequences encodi ng functional TFBSs are conserved during evolution due to selectiv e pressure, whereas non-func tional DNA sequences evolve (mutate) much faster. Therefore, it is lik ely that the conserved regulatory modules could be identified at the regions with hi gh DNA similarity in two orthologs.

PAGE 80

68 While this type of approaches has been pr oven very helpful in some cases, there is also limitation. The success of these a pproaches depends on identifying significant sequence level similarity between (among) the DNA regions that harbor the regulatory modules. However, TFBS sequences are dege nerative in nature and that they are interspersed by non-functional se quences, which are not under conservation pressure. It is therefore entirely possibl e that although the regulatory modules are conserved, the overall similarity of the sequences harbori ng regulatory modules is only marginal and cannot be distinguished from background. Applications based on cross-genome comparison at DNA sequence level will fail when the orthologous DNA sequences are highly diverged. Therefore, there is a need for a complementary tool that could allow users to detect conserved regulatory modules from diverged DNA sequences. We developed a novel methodology, BLISS 2. 0 (Binding-site Level Identification of Shared Signal-module 2.0 version), for iden tifying evolutionarily conserved regulatory modules in pairs of orthologs based on the comparison at bind ing site level. Considering the conservation pressure is at the binding si te level rather than the DNA sequence level. By integrating the Gaussian smoothing and sta tistical analysis, we demonstrated that our approach outperforms regular sequence le vel comparisons when the orthologous DNA sequences are highly diverged. BLISS 2.0 is now implemented as a web-based tool and can be access by the resear ch community freely at: http://gene1.ufscc.ufl.edu:8080/bliss2/index.html.

PAGE 81

69 Figure 5-1. Web interface of BLISS 2.0. This is the homepage of BLISS 2.0. To start, users are required to input two or thologous DNA sequences to start the search. Given two orthologous DNA seque nces, BLISS 2.0is able to output all potentially conserved regulatory modules between t hose two sequences. BLISS 2.0 analysis proceeds in three major steps: To begin, users are required to input two raw DNA sequences (Figure 5-1). BLISS 2.0 generates binding site profiles for each sequence based on matrices collected by Transfac 9.1 (Matys et al. 2003). And on the same page, users are allowed to choose a binding site score cutoff. Either 0.75 or 0.8 can be selected as the binding score cutoff to determine if a specific binding site is found in a certain location in the DNA sequence. This cutoff is based on a balance between specificity and sensitivity. Higher score cutoff increases the specificity, but decreases the sensitivity at the same time.

PAGE 82

70 Figure 5-2. The color plot of BLISS_scor e. BLISS_score indicates the degree of the conservation at the binding site level between two sequences. Second, by integrating Gaussian smoothing and statistical anal ysis on the binding site profiles comparison, BLISS 2.0 indicat es the degree of the conservation at the binding site level between tw o sequences as BLISS_scores, which are displayed and visualized as a two-dimensional color plot (Figure 5-2). The vert ical color bar on the right side of the color plot shows the value of BLISS_scor e that the color represents. Continuous high BLISS_scores along diagonal direction indi cates a potential match of conserved regulatory modules betw een two sequences. To be able to evaluate a particular BLISS_score, we analyzed the distribution of BLISS_scores using simulated sequence pairs as shown in Figure 5-3. There are two BL ISS_score distributions. The left one is the distribution when a random sequence matc hes with a random sequence or a sequence

PAGE 83

71 harboring a regulatory module and the right is the distribution when two conserved regulatory modules matches. Based on this statistical analysis, users can choose a BLISS_score cutoff to output all matched regions with the BLISS_score greater than this cutoff to be conserved regulatory modules. Us ers also have the option to determine how to rank the shared TFBSs in reported regulatory modules. They can be ranked by locations, by numeric contribu tions to BLISS_score or by product of p-values of the matching TFBSs on both sequences. Figure 5-3. Statistical analys is and distribution of BLISS_sc ore. Statistical analysis of BLISS_score helps to evaluate a BLISS_score. Finally, all the matching regulatory modules with BLISS_score greater than the BLISS_score cutoff that users choose are displayed on the thir d page. Contributing TFBSs are listed in a separated table for each matched region. BLISS 2.0 provides the

PAGE 84

72 option to let users highlight th e TFBS they are interested in as green color as shown in Figure 5-4. Figure 5-4. Output of contri buting TFBSs. Contributing TFBS s are listed in a separated table for each matched CRM. Users can choose to highlight the TFBS they are interested in as green color. It has been experienced by many researcher s that the laboratory experiments in a model organism has narrowed down the location of the regulatory module for a particular gene to a relatively short region, whereas for the ortholog in a less-studied organism, information about the localizati on of the module is absent. To be able to deal with this special instance, BLISS 2.0 provide a co mplementary tool Single Cis-Regulatory Module Search, to locate the position of the conserved regulatory module in the ortholog of the less-studied organism. We suggest users to use this tool especially in the case that the length of one of the orthologous DNA sequences is less than 200 bps.

PAGE 85

73 The advantage of BLISS 2.0 is that it allo ws the detection of conserved regulatory modules in highly diverged sequences, on which BLISS 2.0 outperforms the existing methods that are based on sequence similarity comparison. We have successfully applied BLISS 2.0 to identify the Even-skipped (eve ) stripe 2 enhancer (S2E) in D. mojavenis and D. virilis, which no detailed informa tion has been published about and cannot be detected by existing tools like BLAST, rVista 2.0 and etc.

PAGE 86

74 CHAPTER 6 CONCLUSIONS AND SUGGESTIONS FOR FUTURE STUDY In this study, we developed a novel me thodology to identify conserved regulatory modules at the binding site level and impl emented it as a freely accessed web-based application named BLISS 2.0. To our knowledge, this is the first time that the feasibility of comparison at binding site level is confirmed. In the first release of BLISS, we assumed biologists have the sequence for a regulatory module and BLISS could identify the conserved module from its orthologous sequence, where the DNA similarity could not be detected and thus the methods base d on sequence similarity failed. After the success of BLISS1, we further extended this methodology to be applicable for a more general scenario. In BLISS2, potential regulatory modules can be predicted between two orthologous sequences by detecting conserved regulatory complexes. The advantage of this methodology is that it allows the det ection of conserved regulatory modules in highly di vergent sequences as we have demonstrated both with simulated sequences as well as with real wo rld examples. BLISS is thus complementary to many exiting methods based on nucleotide sequence similarity. It outperforms these sequence similarity -based methods when the orthologous DNA sequences are highly diverged. BLISS therefore serves as a valuable tool to facilitate biom edical scientists to identify functional regulatory modules and de sign experimental ve rification strategies. This study investigated a novel cross-ge nome comparison strategy and therefore opened a new field for the future research, which includes developi ng new algorithm to improve the performance of the methodology, multiple sequence comparison at the

PAGE 87

75 binding site level and identifying regulatory modules with higher inter-TFBS distance variations.

PAGE 88

76 APPENDIX A PARAMETER OPTIMIZATIONS FO R GAUSSIAN SMOOTHING METHOD Using simulated sequences, the performa nce of Gaussian smoothing method was tested under different parameters: 1) M_score cutoff, 2) variance for Gaussian smoothing, and 3) integrating p value of M_score OR without integra ting the p value of M_score (Figure A-1). Figure A-1. Parameter optimizations for Ga ussian smoothing method used in BLISS. The results indicated that 1. Integrating the p value of M_score to Gaussian smoothing method had greatly increased the performance of the method. 2. Bigger variance gave better performance. 3. BLISS got the best predictions when bindi ng site score cutoff is between 0.75 and 0.85. Considering the binding sites fro m TRANSFAC are identified binding sites and they intended to have higher binding site score, so 0.75 and 0.8 were chosen as the default cutoff for binding site score in the implementation of BLISS. Gaussian Smooth Method Tests Using Simulated Sequences40 50 60 70 80 90 100 0.50.60.70.80.91 Score CutoffCorrect Predictions/100 Var=0 Var=1 Var=2 Var=3 Var=0(P value) Var=1(P value) Var=2(P value) Var=3(P value)

PAGE 89

77 APPENDIX B A DYNAMIC PROGRAMMING ALGORITHM FOR IDENTIFYING CONSERVED REGULATORY MODULES Dynamic programming algorithms have been applied extensively in computational sequence analysis. In this study, efforts were performed to apply dynamic programming on the binding site profiles to identify conserved regulatory m odules in orthologous sequences. The idea underlying this dynamic programmi ng algorithm is to look for the best local alignments between the binding site profiles of two sequen ces and the algorithm consists of three steps: 1. Calculations of P_score profiles for tw o sequences, which are exactly same as described in the Method part of Chapter 3. 2. A matrix F was constructed based on the P_score profiles of two sequences, where F(i, j) is the score of the best alignment between the initial segment of position 1 to i of first sequence and the initial segment of 1 to j of second sequence. F(i, j) was computed recursively as below: F(i, j) = max F(i, j) could be calculated from F(i-1, j-1), F(i-1, j) and F( i, j-1). F(i, 0) and F(0, j) were set to 0 for all i and j. F(i, j) = 0. When F(i, j) has a negative score in some points, we will give it value 0 to start a new alignment. F(i, j) = F(i-1, j-1) + s(seq1i, seq2j). When position i in sequ ence 1 and position j in sequence 2 at least have one shared binding site, s(seq1i, seq2j) will be the maximum of P_score1(k, i) and P_score(k, j) for all k, where k is the index of the binding site; when there is no shared bindi ng site at position i in sequence 1 and position j in sequence 2, but th ere is at least one binding site either at position i in 0, F(i-1, j-1) + s(seq1i, seq2j) F(i-1, j) + d F(i, j-1) + d

PAGE 90

78 sequence 1 or position j in sequence 2, then s(seq1i, seq2j) would be set to a penalty called TF_penalty; when there is no a ny binding site either at position i in sequence 1 or position j in sequence 2, s(seq1i, seq2j) was set to zero. F(i, j) = F(i, j-1) +d. This is the case when seq2j was aligned to a gap. When there is at least one binding site at position j of sequence 2, d was set to TF_penalty; when there is no any binding site at position j of sequence 2, d was set to N_penalty. F(i, j) = F(i-1, j) +d. This is the case when seq1i, was aligned to a gap. When there is at least one binding site at position i of sequence 1, d was set to TF_penalty; when there is no any binding site in position i of sequence 1, d was set to N_penalty. 3. After the construction of ma trix F, find the maximum F(i, j), which corresponds to the end points of conserved regulatory modul e. Then trace back until to a point with value 0, which corresponds to the start points of conserved module. This dynamic programming algorithm was a pplied to 500 simulated sequence pairs and each sequence pair has only one conser ved module. This sequence group corresponds to the second group simulated se quence in the first data set used for BLISS2 performance test in chapter 4. The best results were achieved when TF_penalty equals to and N_penalty equals to When 0.8 was chos en as the M_score cutoff, the conserved regulatory modules in 168 seque nce pairs could be detected and the rest of 332 local maximum reported by this dynamic program ming algorithm was false positives. As a conclusion, this study addressed the feasibility of identifying conserved regulatory modules in ort hologous sequences by applyi ng the dynamic programming algorithm on the binding s ite profiles. Even though, Gaussian smoothing method developed in chapter 3 and chapter 4 achieved a better pe rformance than this dynamic programming algorithm. So, BLISS took the Ga ussian smoothing method, instead of this dynamic programming algorithm, for the web implementation.

PAGE 91

79 APPENDIX C AN IMPROVED ALGORITHM FOR BLISS2 In the current algorithm for BLISS2, we chose to take the sum of G_scores over all TFs at each matched position when BLISS_sc ore was calculated. The performance of BLISS2 was improved (Table C-1 and C-2) when was instead took the maximum of G_scores at each matched location. Table C-1. Performance test of the impr oved BLISS2 for M_score cutoff=0.75. False Positives (FP) and True Positives (TP) are listed under different window size and BLISS score cutoff. BLISS_score cutoff Window size Type 0.27 0.26 0.25 0.24 0.23 0.22 0.21 FP 2565 3748 5272 7180 100 TP 1045 1174 1307 1434 FP 262 663 2412 6496 150 TP 1185 1614 1956 2157 FP 40 92 416 2157 6096 200 TP 724 1165 1686 2062 2167 FP 6 23 75 617 3277 250 TP 352 699 1180 1736 2000 Table C-2. Performance test of the impr oved BLISS2 for M_score cutoff=0.8. False Positives (FP) and True Positives (TP) are listed under different window size and BLISS score cutoff. BLISS_score cutoff Window size Type 0.28 0.27 0.26 0.25 0.24 0.23 0.22 FP 1779 2968 5489 100 TP 1373 1602 1794 FP 288 575 1412 3572 6805 150 TP 1526 1822 2078 2263 2340 FP 46 68 185 686 2395 200 TP 1034 1374 1734 2063 2259 FP 16 36 142 684 2527 250 TP 851 1222 1622 1967 2190

PAGE 92

80 The method to calculate BLISS_score for th e improved algorithm is the same as the algorithm of BLISS2 except that the Pre_ BLISS_score was calculated as below: ]) [ 2 ] [ 1 ( ] [ _ Pr m j score G m i score G MaximumOf j i score BLISS e for m = 0, 1, 2 where m is the index of the TFBS. The improvement was more dramatic for the second data set that is used in chapter 4, which had longer backbone sequences. Table C-3 displays the performance comparison between the two algorithms unde r the condition where the window size is 200 and the binding score cutoff is 0.75. Table C-3. Performance comparison of two BLISS2 algorithms for second data set when M_score cutoff is 0.75. False Positives ( FP) and True Positives (TP) are listed under different BLISS score cutoff. Algorithm Type BLISS_scores 2.9 2.8 2.7 2.6 2.5 FP 33 111 359 1014 2778 BLISS2 TP 340 458 611 804 1029 0.25 0.245 0.24 0.235 0.23 FP 27 79 217 861 2953 Improved BLISS2 TP 748 937 1198 1431 1653 This improvement may reflect the fact th at only one TF can exclusively bind to a location in the real biological environment and the performance of the BLISS2 algorithm could be compromised due to the repeated contributions of overlapped TFBSs to the BLISS_score. However, the results included more false positives when this improved algorithm was applied to S2E example. Consid ering S2E is the only real world example we have so far, more researches need to be continued and more real world examples are required for the further testing of this improved algorithm for BLISS2. The BLISS_score distribution for the im proved algorithm was calculated using the same method as described in chapter 3 a nd are shown in Figure C-1 and Figure C-2.

PAGE 93

81 Figure C-1. Distribution of BLISS_score for the improved BLISS2 algorithm with the M_score cut_off value of 0.75. Figure C-2. Distribution of BLISS_score for the improved BLISS2 algorithm with the M_score cut_off value of 0.8.

PAGE 94

82 LIST OF REFERENCES Aerts S, Thijs G, Coessens B, St aes M, Moreau Y, De Moor B: Toucan: deciphering the cis-regulatory logic of coregulated genes. Nucleic Acids Res 2003, 31(6):17531764. Aerts S, Van Loo P, Thijs G, Mayer H, de Martin R, Moreau Y, De Moor B: TOUCAN 2: the all-inclusive open source work bench for regulatory sequence analysis. Nucleic Acids Res 2005, 33(Web Server issue):W393-396. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol 1990, 215:403-410. Andrioli LP, Vasisht V, Theodosopoul ou E, Oberstein A, Small S: Anterior repression of a Drosophila stripe enhancer requi res three position-specific mechanisms. Development 2002, 129(21):4931-4940. Arnosti DN: Design and function of transcriptional switches in Drosophila. Insect Biochem Mol Biol 2002, 32(10):1257-1273. Arnosti DN, Barolo S, Levine M, Small S: The eve stripe 2 enhancer employs multiple modes of transcriptional synergy. Development 1996, 122(1):205-214. Bailey TL, Elkan C: The value of prior knowledge in discovering motifs with MEME. Proc Int Conf Intell Syst Mol Biol 1995, 3:21-29. Bailey TL, Gribskov M: Methods and statistics for combining motif match scores. J Comput Biol 1998, 5(2):211-221. Bailey TL, Gribskov M: Concerning the accuracy of MAST E-values. Bioinformatics 2000, 16(5):488-489. Benos PV, Lapedes AS, Stormo GD: Probabilistic code for DNA recognition by proteins of the EGR family. J Mol Biol 2002, 323(4):701-727. Berman BP, Nibu Y, Pfeiffer BD, Tomancak P, Celniker SE, Levine M, Rubin GM, Eisen MB: Exploiting transcription factor bind ing site clustering to identify cis-regulatory modules involved in pa ttern formation in the Drosophila genome. Proc Natl Acad Sci U S A 2002, 99(2):757-762.

PAGE 95

83 Berman BP, Pfeiffer BD, Laverty TR, Salzbe rg SL, Rubin GM, Eisen MB, Celniker SE: Computational identification of deve lopmental enhancers: conservation and function of transcription factor bi nding-site clusters in Drosophila melanogaster and Drosophila pseudoobscura. Genome Biol 2004, 5(9):R61. Boulikas T: A compilation and classification of DNA binding si tes for protein transcription factors from vertebrates. Crit Rev Eukaryot Gene Expr 1994, 4(23):117-321. Bray N, Dubchak I, Pachter L: AVID: A global alignment program. Genome Res 2003, 13(1):97-102. Bucher P: Weight matrix descriptions of four eukaryotic RNA polymerase II promoter elements derived from 502 unrelated promoter sequences. J Mol Biol 1990, 212(4):563-578. Bulyk ML, Johnson PL, Church GM: Nucleotides of transcription factor binding sites exert interdependent effects on the bindi ng affinities of transcription factors. Nucleic Acids Res 2002, 30(5):1255-1261. Che D, Jensen S, Cai L, Liu JS: BEST: binding-site estima tion suite of tools. Bioinformatics 2005, 21(12):2909-2911. Chekmenev DS, Haid C, Kel AE: P-Match: transcription factor binding site search by combining patterns and weight matrices. Nucleic Acids Res 2005, 33(Web Server issue):W432-437. Chen QK, Hertz GZ, Stormo GD: MATRIX SEARCH 1.0: a computer program that scans DNA sequences for transcriptional el ements using a database of weight matrices. Comput Appl Biosci 1995, 11(5):563-566. Cora D, Di Cunto F, Provero P, Silengo L, Caselle M: Computational identification of transcription factor binding sites by functi onal analysis of sets of genes sharing overrepresented upstream motifs. BMC Bioinformatics 2004, 5:57. Day WH, McMorris FR: Critical comparison of consensus methods for molecular sequences. Nucleic Acids Res 1992, 20(5):1093-1099. Elkon R, Linhart C, Sharan R, Shamir R, Shiloh Y: Genome-wide in silico identification of transcriptional regula tors controlling the cell cycle in human cells. Genome Res 2003, 13(5):773-780. Erives A, Levine M: Coordinate enhancers share common organizational features in the Drosophila genome. Proc Natl Acad Sci U S A 2004, 101(11):3851-3856. Fickett JW: Coordinate positioning of ME F2 and myogenin binding sites. Gene 1996, 172(1):GC19-32.

PAGE 96

84 Frazer KA, Elnitski L, Church DM, Dubchak I, Hardison RC: Cross-species sequence comparisons: a review of methods and available resources. Genome Res 2003, 13(1):1-12. Frech K, Herrmann G, Werner T: Computer-assisted prediction, classification, and delimitation of protein bindi ng sites in nucleic acids. Nucleic Acids Res 1993, 21(7):1655-1664. Frech K, Quandt K, Werner T: Finding protein-binding sites in DNA sequences: the next generation. Trends Biochem Sci 1997, 22(3):103-104. Fried M, Crothers DM: Equilibria and kinetics of lac repressor-operator interactions by polyacrylamide gel electrophoresis. Nucleic Acids Res 1981, 9(23):6505-6525. Frith MC, Hansen U, Weng Z: Detection of cis-element clus ters in higher eukaryotic DNA. Bioinformatics 2001, 17(10):878-889. Galas DJ, Schmitz A: DNAse footprinting: a simple method for the detection of protein-DNA binding specificity. Nucleic Acids Res 1978, 5(9):3157-3170. Garner MM, Revzin A: A gel electrophoresis method fo r quantifying the binding of proteins to specific DNA regions: applicat ion to components of the Escherichia coli lactose operon regulatory system. Nucleic Acids Res 1981, 9(13):3047-3060. Gershenzon NI, Stormo GD, Ioshikhes IP: Co mputational technique for improvement of the position-weight matrices fo r the DNA/protein binding sites. Nucleic Acids Res 2005, 33(7):2290-2301. Grad YH, Roth FP, Halfon MS, Church GM: Prediction of similarly acting cisregulatory modules by subsequence prof iling and comparative genomics in Drosophila melanogaster and D.pseudoobscura. Bioinformatics 2004, 20(16):2738-2750. Halfon MS, Michelson AM: Exploring genetic regulatory networks in metazoan development: methods and models. Physiol Genomics 2002, 10(3):131-143. Hannenhalli S, Levy S: Predicting transcription factor synergism. Nucleic Acids Res 2002, 30(19):4278-4284. Heinemeyer T, Wingender E, Reuter I, He rmjakob H, Kel AE, Kel OV, Ignatieva EV, Ananko EA, Podkolodnaya OA, Kolpakov FA et al : Databases on transcriptional regulation: TRANSFAC, TRRD and COMPEL. Nucleic Acids Res 1998, 26(1):362-367. Hernandez-Munain C, Krangel MS: Regulation of the T-cell receptor delta enhancer by functional cooperation between c-Myb and core-binding factors. Mol Cell Biol 1994, 14(1):473-483.

PAGE 97

85 Hu Z, Frith M, Niu T, Weng Z: SeqVISTA: a graphical t ool for sequence feature visualization and comparison. BMC Bioinformatics 2003, 4:1. Hu Z, Fu Y, Halees AS Kielbasa SM, Weng Z: SeqVISTA: a new module of integrated computational tools for studying transcriptional regulation. Nucleic Acids Res 2004, 32(Web Server issue):W235-241. Huang H, Kao MC, Zhou X, Liu JS, Wong WH: Determination of local statistical significance of patterns in Markov seque nces with application to promoter element identification. J Comput Biol 2004, 11(1):1-14. Jegga AG, Sherwood SP, Carman JW, Pinski AT, Phillips JL, Pestian JP, Aronow BJ: Detection and visualization of composit ionally similar cis-regulatory element clusters in orthologous and coordinately controlled genes. Genome Res 2002, 12(9):1408-1417. Johansson O, Alkema W, Wa sserman WW, Lagergren J: Identification of functional clusters of transcription factor bind ing motifs in genome sequences: the MSCAN algorithm. Bioinformatics 2003, 19 Suppl 1:i169-176. Johnson PF, McKnight SL: Eukaryotic transcriptional regulatory proteins. Annu Rev Biochem 1989, 58:799-839. Kel AE, Gossling E, Reuter I, Cheremus hkin E, Kel-Margoulis OV, Wingender E: MATCH: A tool for searching trans cription factor binding sites in DNA sequences. Nucleic Acids Res 2003, 31(13):3576-3579. Kel AE, Kolchanov NA, Kel OV, Romashch enko AG, Anan'ko EA, Ignat'eva EV, Merkulova TI, Podkolodnaia OA, Stepanenko IL, Kochetov AV et al : [TRRD: a database of transcription regulat ory regions in eukaryotic genes]. Mol Biol (Mosk) 1997, 31(4):626-636. Kel AE, Kondrakhin YV, Kolpakov Ph A, Kel OV, Romashenko AG, Wingender E, Milanesi L, Kolchanov NA: Computer tool FUNSITE for analysis of eukaryotic regulatory genomic sequences. Proc Int Conf Inte ll Syst Mol Biol 1995, 3:197205. Kel OV, Romaschenko AG, Kel AE Wingender E, Kolchanov NA: A compilation of composite regulatory elements affectin g gene transcription in vertebrates. Nucleic Acids Res 1995, 23(20):4097-4103. Kel-Margoulis OV, Kel AE, Reuter I, Deineko IV, Wingender E: TRANSCompel: a database on composite regulatory elements in eukaryotic genes. Nucleic Acids Res 2002, 30(1):332-334.

PAGE 98

86 Kel-Margoulis OV, Romashchenko AG, Kolchanov NA, Wingender E, Kel AE: COMPEL: a database on composit e regulatory elements providing combinatorial transcriptional regulation. Nucleic Acids Res 2000, 28(1):311315. Knuppel R, Dietze P, Lehnberg W, Frech K, Wingender E: TRANSFAC retrieval program: a network model database of eukaryotic transcription regulating sequences and proteins. J Comput Biol 1994, 1(3):191-198. Kolchanov NA, Ananko EA, Podkolodnaya OA, Ignatieva EV, Stepanenko IL, KelMargoulis OV, Kel AE, Merkulova TI Goryachkovskaya TN, Busygina TV et al : Transcription Regulatory Regions Da tabase (TRRD):its status in 1999. Nucleic Acids Res 1999, 27(1):303-306. Kolchanov NA, Podkolodnaya OA, Ananko EA, Ignatieva EV, Stepanenko IL, KelMargoulis OV, Kel AE, Merkulova TI Goryachkovskaya TN, Busygina TV et al : Transcription regulatory regions database (TRRD): its status in 2000. Nucleic Acids Res 2000, 28(1):298-301. Kondrakhin YV, Kel AE, Kolchanov NA, Romashchenko AG, Milanesi L: Eukaryotic promoter recognition by binding sites for transcription factors. Comput Appl Biosci 1995, 11(5):477-488. Lawrence CE, Altschul SF, Boguski MS Liu JS, Neuwald AF, Wootton JC: Detecting subtle sequence signals: a Gibbs samp ling strategy for multiple alignment. Science 1993, 262(5131):208-214. Lawrence CE, Reilly AA: An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences. Proteins 1990, 7(1):41-51. Lee TI, Rinaldi NJ, Robert F, Odom DT Bar-Joseph Z, Gerber GK, Hannett NM, Harbison CT, Thompson CM, Simon I et al : Transcriptional regulatory networks in Saccharomyces cerevisiae. Science 2002, 298(5594):799-804. Lemaigre FP, Lafontaine DA, Courtois SJ, Durviaux SM, Rousseau GG: Sp1 can displace GHF-1 from its distal binding site and stimulate transcription from the growth hormone gene promoter. Mol Cell Biol 1990, 10(4):1811-1814. Lenhard B, Sandelin A, Mendoza L, Engs trom P, Jareborg N, Wasserman WW: Identification of conserved regulatory elements by comparative genome analysis. J Biol 2003, 2(2):13. Lewis H, Kaszubska W, De Lamarter JF, Whelan J: Cooperativity between two NFkappa B complexes, mediated by high-mo bility-group protein I(Y), is essential for cytokine-induced expression of the E-selectin promoter. Mol Cell Biol 1994, 14(9):5701-5709.

PAGE 99

87 Liu X, Brutlag DL, Liu JS: BioProspector: discovering conserved DNA motifs in upstream regulatory regions of co-expressed genes. Pac Symp Biocomput 2001:127-138. Liu Y, Liu XS, Wei L, Altman RB, Batzoglou S: Eukaryotic regulatory element conservation analysis and identifi cation using comparative genomics. Genome Res 2004, 14(3):451-458. Liu Y, Wei L, Batzoglou S, Brutlag DL, Liu JS, Liu XS: A suite of web-based programs to search for transcriptional regulatory motifs. Nucleic Acids Res 2004, 32(Web Server issue):W204-207. Loots GG, Ovcharenko I: rVISTA 2.0: evolutionary analys is of transcription factor binding sites. Nucleic Acids Res 2004, 32(Web Server issue):W217-221. Loots GG, Ovcharenko I, Pachter L, Dubchak I, Rubin EM: rVista for comparative sequence-based discovery of functional transcription factor binding sites. Genome Res 2002, 12(5):832-839. Ludwig MZ: Functional evolution of noncoding DNA. Curr Opin Genet Dev 2002, 12(6):634-639. Ludwig MZ, Palsson A, Alekseeva E, Bergman CM, Nathan J, Kreitman M: Functional evolution of a cis-regulatory module. PLoS Biol 2005, 3(4):e93. Ludwig MZ, Patel NH, Kreitman M: Functional analysis of eve stripe 2 enhancer evolution in Drosophila: rules governing conservation and change. Development 1998, 125(5):949-958. Maniatis T, Goodbourn S, Fischer JA: Regulation of inducible and tissue-specific gene expression. Science 1987, 236(4806):1237-1245. Marco-Ferreres R, Vivar J, Arredond o JJ, Portillo F, Cervera M: Co-operation between enhancers modulates quantitative expression from the Drosophila Paramyosin/miniparamyosin gene in different muscle types. Mech Dev 2005, 122(5):681-694. Matys V, Fricke E, Geffers R, Gossling E, Ha ubrock M, Hehl R, Hornischer K, Karas D, Kel AE, Kel-Margoulis OV et al : TRANSFAC: transcriptional regulation, from patterns to profiles. Nucleic Acids Res 2003, 31(1):374-378. Mayor C, Brudno M, Schwartz JR, Poliakov A, Rubin EM, Frazer KA, Pachter LS, Dubchak I: VISTA : visualizing global DNA sequence alignments of arbitrary length. Bioinformatics 2000, 16(11):1046-1047. Meng H., Banerjee A, Zhou L: BLISS: biding site level id entification of shared signal-modules in DNA regulatory sequences. BMC Bioinformatics 2006,7: 287.

PAGE 100

88 Muhlethaler-Mottet A, Di Berardino W, Otten LA, Mach B: Activation of the MHC class II transactivator CIITA by in terferon-gamma requires cooperative interaction between Stat1 and USF-1. Immunity 1998, 8(2):157-166. Osada R, Zaslavsky E, Singh M: Comparative analysis of methods for representing and searching for transcription factor binding sites. Bioinformatics 2004, 20(18):3516-3525. Powell J: Progress and prospects in evolutionary biology: The Drosophila model. Oxford: Oxford University Press; 1997. Prestridge DS: Predicting Pol II promoter sequences using transcription factor binding sites. J Mol Biol 1995, 249(5):923-932. Prestridge DS: SIGNAL SCAN 4.0: additional databases and sequence formats. Comput Appl Biosci 1996, 12(2):157-160. Qiu P: Recent advances in computational prom oter analysis in understanding the transcriptional regulatory network. Biochem Biophys Res Commun 2003, 309(3):495-501. Qiu P, Ding W, Jiang Y, Greene JR, Wang L: Computational analysis of composite regulatory elements. Mamm Genome 2002, 13(6):327-332. Quandt K, Frech K, Karas H, Wingender E, Werner T: MatInd and MatInspector: new fast and versatile tools for detectio n of consensus matches in nucleotide sequence data. Nucleic Acids Res 1995, 23(23):4878-4884. Rajewsky N, Vergassola M, Gaul U, Siggia ED: Computational detection of genomic cis-regulatory modules applied to body patterning in the early Drosophila embryo. BMC Bioinformatics 2002, 3:30. Rebeiz M, Reeves NL, Posakony JW: SCORE: a computational approach to the identification of cis-regulatory modules and target genes in whole-genome sequence data. Site clustering over random expectation. Proc Natl Acad Sci U S A 2002, 99(15):9888-9893. Ringrose L, Rehmsmeier M, Dura JM, Paro R: Genome-wide prediction of Polycomb/Trithorax response elemen ts in Drosophila melanogaster. Dev Cell 2003, 5(5):759-771. Sandelin A, Alkema W, Engstrom P, Wasserman WW, Lenhard B: JASPAR: an openaccess database for eukaryotic tran scription factor binding profiles. Nucleic Acids Res 2004, 32(Database issue):D91-94. Sandelin A, Wasserman WW, Lenhard B: ConSite: web-based prediction of regulatory elements using cross-species comparison. Nucleic Acids Res 2004, 32(Web Server issue):W249-252.

PAGE 101

89 Santini S, Boore JL, Meyer A: Evolutionary conservation of regulatory elements in vertebrate Hox gene clusters. Genome Res 2003, 13(6A):1111-1122. Schwartz S, Elnitski L, Li M, Weirauch M, Riemer C, Smit A, Green ED, Hardison RC, Miller W: MultiPipMaker and supporting tools: alignments and analysis of multiple genomic DNA sequences. Nucleic Acids Res 2003, 31(13):3518-3524. Schwartz S, Zhang Z, Frazer KA, Smit A, Ri emer C, Bouck J, Gibbs R, Hardison R, Miller W: PipMaker--a web server for alig ning two genomic DNA sequences. Genome Res 2000, 10(4):577-586. Senger K, Armstrong GW, Rowell WJ, Kwa n JM, Markstein M, Levine M: Immunity regulatory DNAs share common organizational features in Drosophila. Mol Cell 2004, 13(1):19-32. Sharan R, Ben-Hur A, Loots GG, Ovcharenko I: CREME: Cis-Regulatory Module Explorer for the human genome. Nucleic Acids Res 2004, 32(Web Server issue):W253-256. Sharan R, Ovcharenko I, Ben-Hur A, Karp RM: CREME: a framework for identifying cis-regulatory modules in human-mouse conserved segments. Bioinformatics 2003, 19 Suppl 1:i283-291. Shaw PE: Ternary complex formation over the c-fos serum response element: p62TCF exhibits dual component specifi city with contacts to DNA and an extended structure in the DNA-binding domain of p67SRF. Embo J 1992, 11(8):3011-3019. Sinha S, Blanchette M, Tompa M: PhyME: a probabilistic algorithm for finding motifs in sets of orthologous sequences. BMC Bioinformatics 2004, 5:170. Small S, Blair A, Levine M: Regulation of even-skipped st ripe 2 in the Drosophila embryo. Embo J 1992, 11(11):4047-4057. Staden R: Methods for calculating the probabilities of finding patterns in sequences. Comput Appl Biosci 1989, 5(2):89-96. Stormo GD: Information content and free energy in DNA--protein interactions. J Theor Biol 1998, 195(1):135-137. Stormo GD: DNA binding sites: repres entation and discovery. Bioinformatics 2000, 16(1):16-23. Stormo GD, Fields DS: Specificity, free energy and info rmation content in proteinDNA interactions. Trends Biochem Sci 1998, 23(3):109-113.

PAGE 102

90 Stormo GD, Schneider TD, Gold L: Quantitative analysis of the relationship between nucleotide sequence and functional activity. Nucleic Acids Res 1986, 14(16):6661-6679. Thanos D, Maniatis T: Virus induction of human IFN be ta gene expression requires the assembly of an enhanceosome. Cell 1995, 83(7):1091-1100. Thijs G, Lescot M, Marchal K, Rombauts S, De Moor B, Rouze P, Moreau Y: A higherorder background model improves th e detection of promoter regulatory elements by Gibbs sampling. Bioinformatics 2001, 17(12):1113-1122. Thijs G, Marchal K, Lescot M, Rombauts S, De Moor B, Rouze P, Moreau Y: A Gibbs sampling method to detect overrepresente d motifs in the upstream regions of coexpressed genes. J Comput Biol 2002, 9(2):447-464. Thompson JD, Higgins DG, Gibson TJ: CLUSTAL W: improving the sensitivity of progressive multiple sequence alig nment through sequence weighting, position-specific gap penaltie s and weight matrix choice. Nucleic Acids Res 1994, 22(22):4673-4680. Tjian R, Maniatis T: Transcriptional activation: a complex puzzle with few easy pieces. Cell 1994, 77(1):5-8. Tsunoda T, Takagi T: Estimating transcription factor bindability on DNA. Bioinformatics 1999, 15(7-8):622-630. van Helden J: Regulatory sequence analysis tools. Nucleic Acids Res 2003, 31(13):3593-3596. Venkatesh B, Yap WH: Comparative genomics using fugu: a tool for the identification of conserved vertebrate cis-regulatory elements. Bioessays 2005, 27(1):100-107. Wang CY, Petryniak B, Thompson CB, Kaelin WG, Leiden JM: Regulation of the Etsrelated transcription factor Elf-1 by binding to the retinoblastoma protein. Science 1993, 260(5112):1330-1335. Weintraub H, Davis R, Lockshon D, Lassar A: MyoD binds cooperatively to two sites in a target enhancer sequence: occupancy of two sites is required for activation. Proc Natl Acad Sci U S A 1990, 87(15):5623-5627. Whitley MZ, Thanos D, Read MA, Maniatis T, Collins T: A striking similarity in the organization of the E-selectin and beta interferon gene promoters. Mol Cell Biol 1994, 14(10):6464-6475. Wingender E, Chen X, Hehl R, Karas H, Liebich I, Matys V, Meinhardt T, Pruss M, Reuter I, Schacherer F: TRANSFAC: an integrated sy stem for gene expression regulation. Nucleic Acids Res 2000, 28(1):316-319.

PAGE 103

91 Wingender E, Dietze P, Karas H, Knuppel R: TRANSFAC: a database on transcription factors and their DNA binding sites. Nucleic Acids Res 1996, 24(1):238-241. Workman CT, Stormo GD: ANN-Spec: a method for discovering transcription factor binding sites with improved specificity. Pac Symp Biocomput 2000:467-478. Yan Y, Chen H, Costa M: Chromatin Immunoprecipitation Assays. In: Epigenetics Protocols. Edited by Tollefsbol TO. Totowa, NJ: Humana Press; 2004. Yuh CH, Bolouri H, Davidson EH: Genomic cis-regulatory l ogic: experimental and computational analysis of a sea urchin gene. Science 1998, 279(5358):18961902. Zhang L, Zhou W, Velculescu VE, Kern SE Hruban RH, Hamilton SR, Vogelstein B, Kinzler KW: Gene expression profiles in normal and cancer cells. Science 1997, 276(5316):1268-1272. Zhang MQ, Marr TG: A weight array method fo r splicing signal analysis. Comput Appl Biosci 1993, 9(5):499-509. Zheng J, Wu J, Sun Z: An approach to identify overrepresented cis-elements in related sequences. Nucleic Acids Res 2003, 31(7):1995-2005. Zhou L, Steller H: Distinct pathways mediate UV-induced apoptosis in Drosophila embryos. Dev Cell 2003, 4(4):599-605. Zhou Q, Liu JS: Modeling within-motif dependence fo r transcription factor binding site predictions. Bioinformatics 2004, 20(6):909-916.

PAGE 104

92 BIOGRAPHICAL SKETCH Hailong Meng was born in Peoples Republic of China in 1973. He is currently a Ph.D. candidate at the University of Florid a, majoring in computer information science and engineering. He received a B.S. in biochemical engineer ing from Zhejiang University in 1996 and M.S. in biochemical engineering from the graduate school of the Chinese Academy of Science in 1999. His current intere sts lies in bioinformatics field, especially in novel algorithm development and biological database management.


Permanent Link: http://ufdc.ufl.edu/UFE0015612/00001

Material Information

Title: A Novel Methodology for Identifying Conserved Regulatory Modules at the Binding Site Level
Physical Description: Mixed Material
Copyright Date: 2008

Record Information

Source Institution: University of Florida
Holding Location: University of Florida
Rights Management: All rights reserved by the source institution and holding location.
System ID: UFE0015612:00001

Permanent Link: http://ufdc.ufl.edu/UFE0015612/00001

Material Information

Title: A Novel Methodology for Identifying Conserved Regulatory Modules at the Binding Site Level
Physical Description: Mixed Material
Copyright Date: 2008

Record Information

Source Institution: University of Florida
Holding Location: University of Florida
Rights Management: All rights reserved by the source institution and holding location.
System ID: UFE0015612:00001


This item has the following downloads:


Full Text











A NOVEL METHODOLOGY FOR IDENITFYING CONSERVED REGULATORY
MODULES AT THE BINDING SITE LEVEL














By

HAILONG MENG


A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL
OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY

UNIVERSITY OF FLORIDA


2006




























Copyright 2006

by

Hailong Meng
















To my dearly loved wife, Haoxian, with whom enj oyable and difficult times were shared
throughout this journey, and to my parents, Zhanfu and Xioumei, for their love and
support through my life.
















ACKNOWLEDGMENTS

I would like to express my sincere thanks and appreciation to my advisors, Dr.

Arunava Banerj ee and Dr. Lei Zhou, for their invaluable guidance, support, and

encouragement throughout my graduate research and study at the University of Florida.

My accomplishments could not have been achieved without their support.

I would also like to thank my committee members, Drs. Anand Rangaraj an, Sam

Wu and Tamer Kahveci, for their insightful comments and helpful suggestions.

Finally, my appreciation goes to all my friends at the University of Florida,

especially those at Dr. Banerj ee and Dr. Zhou' s lab, for their helpful suggestions,

discussions and friendships.



















TABLE OF CONTENTS


page

ACKNOWLEDGMENT S .............. .................... iv

LI ST OF T ABLE S ................. ................. viii............

LIST OF FIGURES .............. .................... ix

AB STRAC T ................ .............. xi

CHAPTER

1 INTRODUCTION ................. ...............1.......... ......

2 LITERATURE REVIEW ................. ...............4................


Computational Representation of TFB Ss ................. ...............4............ ...
TFB Ss Databases ................. .. ............. ........ ... ..................6
Computational Identification and Statistical Evaluation of TFB Ss ................... ...........8
Computational Identification of TFBSs .............. ...............8.....
Stati stical Evaluation of TFBS s.................... ............... .......... .......... .....1
Cis-Regulatory Modules (CRMs) and Computational Prediction of CRMs ..............12
Cis-Regulatory Modules (CRMs) ................. ...............12................
Computational Identification of CRMs .................. ...... ............. .....................14
Cross-species Conservation of Cis-Regulatory Modules (CRMs) ................... ...15
Computational Prediction of Cis-Regulatory Modules (CRMs) by Cross-
species Genome Comparison ................. ...............17........... ....

3 THE FEASIBILITY STUDY OF CROSS-GENOME COMPARISON AT THE
BINDING SITE LEVEL .............. ...............21....


Background ................. ...............21.................
Re sults................ ... .......... .............. .. ...... ... ...........2
Simulating Sequence Pairs Harboring a Conserved Module of Binding Sites ...25
Identifying a Conserved Module by Comparison at the Binding Site Level ......28
Distribution and Statistical Evaluation of the BLISS score .............. ................31
Identifying a Conserved Regulatory Module in Distantly Related Species........34
Implementation of BLISS as a Web-based Service .............. .....................3
Discussion ................. ...............40.................
Conclusions............... ..............4











Methods .............. .......... ..............4
Generating Simulated Sequences .............. ...............43....
Binding Site Identification .............. ...............44....
P Value for M score ................. ...............45........... ...
Gaussian Smoothing .................. ... ......... ......... .............4
Searching the Conserved Module in the Long Sequence ................. ................46
Large-scale Search of the Simulated Sequences, Statistical Analysis ................47
Searching for the eve2 Module in D. virilis and D. moj avanis Sequences .........47
W ebsite Construction .............. ...............48....

4 BINDING-SITE LEVEL IDENTIFICATION OF SHARED REGULATORY
MODULE S IN TWO ORTHOLOGOUS SEQUENCE S ................ ............... .....49


Back ground ................. ...............49.......... ......
Re sults.................. ........... ............. ... ... ... ..........5
Simulating Sequence Pairs with Varying Complexity of Conserved Binding
Site Patterns ............... ......... .. .. ... .............5
Identifying Conserved Regulatory Complexes by Comparison at the Binding-
site Level ................... ........... ......... .............5
Testing BLISS2 in Real-world Examples .............. ...............56....
Website Implementation of BLISS2 .............. ...............60....
Discussion ................. ...............61.................
Conclusions............... ..............6
Methods .............. .......... ..............6
Generating Simulated Sequences .............. .......... ...................6
Identifying Conserved Regulatory Complexes by Comparison at the Binding-
site Level .................. .. ........... ... ......... .............6
Performance Testing Using Simulated Sequences ................ .. ... ......... ........65
Searching for the eve2 Regulatory Complex in D. virilis and D. moj avanis ......66
W ebsite Construction .............. ...............66....

5 BLISS 2.0: THE WEB-BASED TOOL FOR PREDICTING CONSERVED
REGULATORY MODULES IN DISTANTLY-RELATED ORTHOLOGOUS
SEQUENCE S .............. ...............67....

6 CONCLUSIONS AND SUGGESTIONS FOR FUTURE STUDY ................... ........74


APPENDIX

A PARAMETER OPTIMIZATIONS FOR GAUSSIAN SMOOTHING METHOD ...76


B A DYNAMIC PROGRAMMING ALGORITHM FOR IDENTIFYING
CONSERVED REGULATORY MODULES ................. ............... ......... ...77


C AN IMPROVED ALGORITHM FOR BLISS2 .............. ...............79....

LI ST OF REFERENCE S ................. ...............8.. 2......... ....












BIOGRAPHICAL SKETCH .............. ...............92....



















LIST OF TABLES


Table pg


2-1. IUPAC consensus characters............... ...............

2-2. Statistics of TRANSFAC ................. ...............6................

2-3. Matrices in TRANSFAC Professional r9.1. ............. ...............7.....

4-1. Simulated data sets .............. ...............52....

4-2. Performance test of BLIS S2 for M score cutoff=0.75. .........___ ..... ......_........54

4-3. Performance test of BLISS2 for M score cutoff=0.8. ................... ...............5


C-1. Performance test of the improved BLIS S2 for M~score cutoff=0.75 ........._............79

C-2. Performance test of the improved BLIS S2 for M~score cutoff=0.8 ........._..............79

C-3. Performance comparison of two BLISS2 algorithms for second data set when
M score cutoff is 0.75................ ...............80.


















LIST OF FIGURES


Figure pg

1-1. Transcription and translation. .............. ...............1.....

1-2. Transcription is controlled by transcription factors ................. .........................2

2-1. Frequency matrix of transcription factor "v-Myb" from TRANSFAC. ......................5

2-2. The known TFBSs in S2E. ............. ...............15.....

2-3. S2E of D. Mel and D. Pse alignment using CLUSTALW. ................... ...............16

3-1. Simulating conserved regulatory modules in a pair of random sequences............._...27

3-2. Gaussian smoothing of the TFBS profile. ............. ...............29.....

3-3. Identifying a conserved module at the binding site level. ................... ...............3

3-4. Identifying the eve S2E module in distantly related species................... ...............35

3-5. Inter-TFBSs distances are very well conserved within each clause of the S2E
m odule. .............. ...............37....

3-6. Searching for conserved individual clauses / element modules using BLIS S...........3 9

3-7. BLISS output of the contributing TFBSs .............. ...............40....

4-1. The effects of length of analyzed sequences on the performance of BLISS2...........56

4-2. Regulatory complexes in S2E of D.mela and D.pseu............... ...............57.

4-3. Identification S2E in 6 kb D.viri sequence using the S2E of D.pseu. .................. .....58

4-4. Regulatory complexes of S2E in D.viri and D.moja .................... ...............5

5-1. Web interface of BLISS 2.0. This is the homepage of BLISS 2.0.. ..........................69

5-2. The color plot of BLISS _score .......___ ......... ___ .........__ ...... 7

5-3. Statistical analysis and distribution of BLIS S_score. ....._____ .... .. ...___...........71











5-4. Output of contributing TFBSs. ........... ...............72......

A-1. Parameter optimizations for Gaussian smoothing method used in BLISS.............. .76

C-1. Distribution of BLISS_score for the improved BLISS2 algorithm with the
M score cut off value of 0.75. ............. ...............8 1....

C-2. Distribution of BLISS_score for the improved BLISS2 algorithm with the
M score cut off value of 0.8. ............ .............8 1......
















Abstract of Dissertation Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Doctor of Philosophy

A NOVEL METHODOLOGY FOR IDENITFYING CONSERVED REGULATORY
MODULES AT THE BINDING SITE LEVEL

By

Hailong Meng

August 2006

Chair: Arunava Banerjee
Cochair: Lei Zhou
Major Department: Computer Information and Science and Engineering

Regulatory modules are segments of the DNA that control particular aspects of

gene expression. Their identification is therefore of great importance to the Hield of

molecular genetics. Each module is composed of a distinct set of binding sites for

specific transcription factors. Since experimental identification of regulatory modules is

an arduous process, accurate computational techniques that supplement this process can

be very beneficial. Functional modules are under selective pressure to be evolutionarily

conserved. Most current approaches therefore attempt to detect conserved regulatory

modules through similarity comparisons at the DNA sequence level. However, some

regulatory modules, despite the conservation of their responsible binding sites, are

embedded in sequences that have little overall similarity.

We hypothesize that it will be more efficient and make more biological sense to

perform the cross-genome comparison at the binding site level since the evolutionary

pressure is on the binding site patterns that determine the gene regulation. In this study,









we developed a novel methodology that detects conserved regulatory modules via

comparisons at the binding site level. The methodology compares the binding site

profiles of orthologs and identifies those segments that have similar (not necessarily

identical) profiles. The similarity measure is based on transformed profiles, which takes

into consideration the p values of binding sites as well as the potential shift of binding

site positions. Furthermore, statistical analysis was performed to evaluate the accuracy of

the similarity measure. We used simulated sequence pairs for the optimization of the

methodology and the methodology was then verified on real word sequences.

Our research demonstrates that, for sequences with little overall similarity at the

DNA sequence level, it is possible to identify conserved regulatory modules based solely

on binding site profiles. This methodology has been implemented as a web-based tool,

BLISS, for the research community.















CHAPTER 1
INTRODUCTION

One of the central processes in the life of a cell is the expression of the information

encoded in its DNA. The initial stage of this expression is the synthesis of messenger

RNA (mRNA) from the DNA template, a process referred to as transcription (Figure 1-




Transcription Translation
DNA mRNA PROTEIN




Figure 1-1. Transcription and translation.

One of the maj or challenges in molecular genetics is to understand the precise

mechanism via which gene expression is regulated. The regulation of gene expression at

the level of transcription is a very complex process and is controlled by DNA binding

proteins called transcription factors (TFs). As shown in Figure 1-2, the initiation of

transcription begins with the binding of RNA polymerase II complex to the promoter, a

region right in front of the first exon of the gene. The level (i.e., rate) of transcription is

controlled by bindings between TFs and corresponding short DNA sequences,

Transcription Factor Binding Sites (TFB Ss). The size of a typical binding site ranges

between 5 and 25 base pairs (bps) (Boulikas 1994; Hernandez-Munain and Krangel

1994). It is possible that the binding of a single TF to TFBS can affect the rate of

transcription, however, in higher eukaryotes, it often takes multiple TFB Ss to group

together as a regulatory module to determine the precise temporal and spatial expression









of the genes. Despite the importance of regulatory modules, our abilities to identify and

characterize them are still very limited (Qiu 2003).











mRNA



m Exon Intron

Promoter region

t A Transcription Factor RNA Polymerase II



Figure 1-2. Transcription is controlled by transcription factors.

While laboratory identification of regulatory modules is feasible (Galas and

Schmitz 1978; Fried and Crothers 1981; Garner and Revzin 1981), the process is in

general expensive, time-consuming and arduous. Accurate computational approaches

therefore promise to be very beneficial as a supplement to the traditional laboratory

experiment since they can lead biologists to a more efficient determination of the

regulatory elements.

Functional regulatory modules are under selective pressure and tend to be

evolutionarily conserved. Most current approaches therefore attempt to detect conserved

regulatory modules through similarity comparisons at the DNA sequence level. However,

some regulatory modules, despite the conservation of their responsible binding sites, are

embedded in sequences that have little overall similarity. It therefore makes biological









sense to perform the cross-genome comparison at the binding site level since the

evolutionary pressure is directed at the binding site level.

The obj ectives of this thesis are

1. To test the feasibility of cross-genome comparison at the binding site level.

2. To develop a novel methodology that detects conserved regulatory modules
via comparisons at the binding site level.

3. To implement the methodology as a web-based tool that can be publicly
accessed by the research community.

The remaining chapters are organized as follows:

Chapter 2 conducts a literature review, in which background material and related

research in the field are presented.

In chapter 3,we demonstrate that it is feasible to perform the cross-species

comparison at the binding site level and that conserved regulatory modules can be

predicted in diverged sequences without detectable DNA sequence similarity. However,

this comes with the limitation that we know that a conserved module exists in the shorter

of the two sequences being compared.

In chapter 4, based on the research results in chapter 3, we extend the methodology

to a more general case, i.e. two sequences without the prior knowledge about the

locations of the regulatory modules.

In chapter 5, we focus on BLISS, the web implementation of this methodology.

Chapter 6 presents conclusions and suggestions for future research.















CHAPTER 2
LITERATURE REVIEW

Computational Representation of TFBSs

Most eukaryotic TFs can bind to multiple binding sites and tolerate considerable

sequence variation in their binding sites. The computational analysis of regulatory

regions in genome sequences is based on the prediction of potential TFBSs, which

depends on the quality of the models that represent the binding specificity of the

transcription factors. These models are based upon a statistical representation of

experimentally determined binding sites for a particular TF.

IUPAC consensus code was first used to represent this degenerate feature of TFBSs

by Day and McMorris (1992). They used a string such as RCCY to represent the

combination of string ACCT, ACCC, GCCT, and GCCT. IUPAC consensus characters

are listed in Table 2-1.

Table 2-1. IUPAC consensus characters.
IUPAC consensus character in the motif Matchin base in the DNA seuence
A A
T T
G G
C C
R A or G
Y T or C
W A or T
S G or C
M A or C
K G or T
B T or G or C
H Aor Tor C
V A or G or C
D A or G or T
N A or T or G or C










However, the disadvantage of IUPAC consensus sequence is that it does not take

into account the frequency at which each nucleotide occurs at each position of a TFB S

and therefore cannot represent a TFB S quantitatively. It disregards too much information

originally present in the set of TFBSs (Quandt et al. 1995).

Compared to IUPAC, matrix based representations (including frequency matrices

and Position Weight Matrices PWMs) retain more information and are quantitative

representations of TFBSs. In this case, the binding specificity of each position is

described by the elements of the matrix. Comparative analysis (Osada et al. 2004) has

shown that matrices are a much better way to represent TFB Ss. Each matrix is of size 4 X

k, where the rows correspond to each of the nucleotides and the columns represent the

base preference for each position of the binding site. For a frequency matrix, each column

gives the nucleotide distribution at that position. For instance, Figure 2-1 is the frequency

matrix of transcription factor "v-Myb" from TRANSFAC Professional r9. 1 (Matys et al.

2003), which was obtained from 24 experimentally determined binding sites. Each

element in the matrix represents how many times a specific nucleotide was observed at a

certain position. The last column in the matrix is the IUPAC consensus sequence for the

transcription factor.

Position A C G T
01 13 4 4 3 A
02 14 3 5 2 A
03 2 7 1 14 T
04 23 0 0 1 A
05 24 0 0 0 A
06 1 23 0 0 C
07 0 1 17 6 G
08 1 1 21 1 G
09 11 5 4 4 N
10 12 5 2 5 A



Figure 2-1. Frequency matrix of transcription factor "v-Myb" from TRANSFAC.









The PWM is derived from this frequency matrix and then the base preference at

each position is estimated based on the nucleotide distribution (Bucher 1990; Stormo

2000). The quality of the matrix is dependent upon experimentally determined target

sites. As of now, most matrices have been built based on binding sites through in-vitro

target-site detection assays. But, it was recently reported that in vivo functional TFBSs

can be detected by ChlP-Chip chromatinn immunoprecipitation coupled with microarray

chip hybridization) technique (Lee et al. 2002), which will greatly enhance both the

quality as well as the coverage of the noted matrices.

TFBSs Databases

TRANSFAC, TRANSCOMPEL, TRRD and JASPAR are databases that store

information about transcriptional regulation in eukaryotic cells.

TRANSFAC (Knuppel et al. 1994; Wingender et al. 1996; Wingender et al. 2000;

Matys et al. 2003) is the leading database that includes Transcription Factors, their target

genes and Transcription Factor Binding Sites, matrices of TFB Ss and other information

about transcriptional regulation. TRANSFAC represents the most comprehensive

collection of TFBSs and summarizes them as matrices. All information collected in

TRANSFAC is generated from published literature. Its content has been extending and

improving over the last ten years as shown in Table 2-2.

Table 2-2. Statistics of TRANSFAC.
TRANSFAC TRANSFAC TRANSFAC6.0 TRANSFACr9.1
1996 2000 2003 2005
Matrix 169 356 336 753
FACTOR 1544 2765 4219 6017
SITE 4304 8390 6627 15244
GENE -1302 1755 4649
REFERENCE 3130 6570 -11126











The newly released TRANFACT Professional r9. 1 has a total of 753 matrices for

bacteria, fungi, insects, nematodes, plants and vertebrates, as indicated in Table 2-3.

Table 2-3. Matrices in TRANSFAC Professional r9. 1.
Matrix Number Matrix Number Without
Duplication
Bacteria 1 1
Fungi 56 44
Insects 50 40
Nematodes 7 5
Plants 84 66
Vertebrates 555 379
Total 753 535

TRANSFAC contains redundant sets of matrices of diverse quality for some

TFB Ss. It is a commercial resource and only an older version of TRANSFAC can be

accessed by the public for free.

JASPAR (Sandelin et al. 2004) is an open-access database and its data can be

unrestrictedly accessed for free. All matrices in JASPAR were derived from published

collections of experimentally determined TFB Ss. JASPAR claims that it has a non-

redundant collection of reliable matrices.

TRANSCOMPEL (Kel-Margoulis et al. 2000; Kel-Margoulis et al. 2002) is a sister

database of TRANSFAC and includes information about composite regulatory elements.

Each composite regulatory element contains two closely situated binding sites of distinct

transcription factors. There are two types of composite regulatory elements: synergistic

and antagonistic. A synergistic composite regulatory element results in transcriptional

activation, while in an antagonistic composite regulatory element, the two factors

interfere with each other and repress transcription. However, the effectiveness of

TRANSCompel is limited by it small size.










Transcription Regulatory Regions Database (TRRD) (Kel et al. 1997; Heinemeyer

et al. 1998; Kolchanov et al. 1999; Kolchanov et al. 2000) describes the structure of

eukaryotic transcription regulatory regions. Each TRRD entry corresponds to an entire

gene and binding sites are listed at the lowest level of the hierarchy. However, this

database is seldom referred to by current publications.

Computational Identification and Statistical Evaluation of TFBSs

Computational Identification of TFBSs

Computational identification of TFB Ss is very important to decode the regulation

network of genes. Although some earlier methods of predicting TFBSs were based on

consensus sequences (Kondrakhin et al. 1995; Prestridge 1995), most recently developed

methods are based on matrices (Frech et al. 1997).

Match is a weight-based tool to search for potential transcription factor binding

sites in DNA sequences. Match is associated with TRANSFAC and it uses the matrix

library collected in TRANSFAC to search for TFBSs. Because TFBSs may be detectable

on either the forward or the backward strand, Match searches both strands of input DNA

sequence for TFBSs. The algorithm uses two scores: the matrix similarity score (MSS)

and the core similarity score (CSS). Both scores measure the similarity between the

segments of the sequences under consideration and the TFBS matrices. MSS is calculated

using all positions of the matrix while CSS is calculated using the five core (most

conserved) positions in the matrices only. Both MSS and CSS range from 0 to 1 where 1

means an exact match. MSS is calculated as follows (Kel et al. 2003):

MS=Current M~in
MaxS Mi










c*r~rent = Il(i)fi,b,



Min = f I(i) fmm






I(i) = Cff In(4J;,) i1 ,..
Bt{A
where /;,b, iS the frequency of nucleotide B occurring at position i of the matrix

(B{AT,,G).f mm" is the freque~ncy of~~ thepn~ nuloide that is the lowetinpositi no;~n i


and J max is the frequency of the nucleotide that is the highest in position i of the matrix.

The information vector I(i) reflects the extent of the conservation of i position in the

matrix. The more conserved the position i is, the greater is the information vector value

of that position. In this manner, for a highly conserved position, a mismatch will receive a

large penalty while a match will score high. On the other hand, less conserved positions

contribute less than conserved positions whether they be a match or a mismatch. This

leads to better recognition of TFBSs compared to methods that do not use an information

vector.

Stormo (1998, 2000) and Stormo and Fields (1998) pointed out that logarithms of

the base frequencies should be proportional to the binding energy and the information

vector is therefore related to the average binding energy between transcription factors and

binding sites. Hence, Match score is not just a statistical score, but is in direct relation to

the protein-DNA binding energy.









Match requires users to choose a score threshold, which is based on a balance

between specificity and sensitivity. When a high threshold is used, noise is reduced from

the output and the false positives decrease. However, at the same time, some functional

binding sites with low scores are ignored. Increased specificity causes decreased

sensitivity. On the other hand, when a low threshold is used, both false matches and true

binding sites are reported. Increased sensitivity causes decreased specificity. Hence, an

ideal threshold should restrict false positives without losing too many true positives.

One assumption of methods based on matrices is that positions in the binding site

contribute additively and independently to the total activity. However, recent biological

experiments suggest that dependence exists among positions in TFBSs (Benos et al.

2002; Bulyk et al. 2002). More complicated models (Stormo et al. 1986; Lawrence and

Reilly 1990; Lawrence et al. 1993; Zhang and Marr 1993; Zhou and Liu 2004;

Gershenzon et al. 2005) are being considered. For instance, di-nucleotides are chosen as

the elements of the matrix instead of single nucleotides. But, the limitation of these

methods is that they need more prior information and more training binding sites, which

are currently not easy to obtain. So, these approaches are rarely used.

Other available software to identify TFBSs include MatInspector (Quandt et al.

1995), ConsInspector (Frech et al. 1993), FUNSITE (Kel et al. 1995), TFBIND (Tsunoda

and Takagi 1999), SIGNAL SCAN (Prestridge 1996), MATRIX SEARCH (Chen et al.

1995), SeqVISTA (Hu et al. 2003; Hu et al. 2004), RSAT (van Helden 2003), BEST (Che

et al. 2005) and P-Match (Chekmenev et al. 2005). The primary problem with existing

methods for predicting TFB Ss is that there tend to generate unacceptably large number of

false positives and only a very small fraction of the predicted TFBSs are functional ones.









Because of the large number of false positives, these methods are of little practical use for

predicting TFB Ss in vivo.

Statistical Evaluation of TFBSs

A fundamental challenge for TFB Ss-searching methods is how to reduce false

positives and improve prediction accuracy. One of the reasons for false positives is that

short sequences (when trying to identify short TFB Ss) occur a lot by chance. Estimating

the statistical significance of an identified match is valuable and may increase the

performance of TFBSs-searching methods. To calculate statistical significance, a

background model is necessary. Based on the background model, we will see how likely

it is to see a score that is at least as good by chance. Formally, the p-value of a score is

defined as the probability of obtaining such a score or higher from the background model.

A simple way of estimating the p-value of a score is via native approximation. A

large pool of subsequences is sampled from the background distribution, and the score of

each subsequence is calculated based on the matrix. The p value is estimated by

calculating the fraction of samples that have scores as high as that score. The primary

drawback of native approximation is that it needs huge number of samples to accurately

estimate the p value. Normally, the number of samples should be two orders of

magnitude larger than the inverse of the actual p value. If there are not enough samples,

small p values will be poorly estimated. Available samples and time complexity are

maj or drawbacks of the native approximation method.

Independent and identically distributed (IID) background sequence model (Staden

1989) can be used to evaluate p value. The position p value of MAST (Motif Alignment

and Search Tool) (Bailey and Gribskov 1998; Bailey and Gribskov 2000) is defined as

the probability of observing such a match score or higher at a single, randomly chosen









position in a random sequence, and it is obtained by calculating the cumulative density

function based on the null hypothesis that each position in the sequence is independent

and identically distributed (IID).

It has been reported that neighboring nucleotide compositions can affect the

binding between a transcription factor and its binding site. Local Markov Method (LMM)

(Huang et al. 2004) incorporated the local genomic context into the p-value-based scoring

method and modeled the background sequences as a Markov chain. The p value for a

candidate site in this method reflected simultaneously both the similarity to its matrix and

its difference from the local genomic context. LMM claimed that, compared to other

current methods, it could reduce false positive errors by more than 50% without

compromising sensitivity. Similarly, Thij s et al. (Thij s et al. 2001) use a higher-order

background model to enhance the performance of their motif-finding algorithm.

Although statistical analysis has been integrated into predictions of TFBSs, large

amounts of false positives still exist and accurate identification of TFBSs is hard to

achieve.

Cis-Regulatory Modules (CRMs) and Computational Prediction of CRMs

Cis-Regulatory Modules (CRMs)

Genes are rarely regulated by a single TF. Abundant evidence shows that TFBSs

occur in clusters rather than in isolation (Maniatis et al. 1987; Johnson and McKnight

1989; Shaw 1992; Wang et al. 1993; Tjian and Maniatis 1994). Each gene has a unique

set of TFs that are responsible for its spatial and temporal expression. Transcription is

initialized by the binding of RNA Polymerase II to the promoter region, but the

expression level of the gene is controlled by the specific set of TFs. The binding of a TF









to its TFBS may enhance or repress the binding of other TFs to their TFBSs (Weintraub

et al. 1990; Fickett 1996; Muhlethaler-Mottet et al. 1998).

In higher eukaryotes, the regulatory information is organized into modular units

which contain multiple TFBSs over a few hundred base pairs of DNA sequence. Such a

functional unit is often referred to as Cis-Regulatory Modules (CRMs, we also use

regulatory modules to refer to CRMs alternately). Various CRMs work together to offer

the combinatorial regulation of gene transcription in response to different conditions

(Yuh et al. 1998). Genes are typically controlled by several CRMs, which are generally

located within a few hundred base pairs from the promoter region, but may occur

thousands or even tens of thousands of base pairs either downstream or upstream from

the transcription start site.

TFB Ss in a CRM are co-localized with specific distances apart on the genome. The

distance between the binding sites within the CRM may overlap (Lewis et al. 1994), lie

adj acent to each other (Whitley et al. 1994) or may be dozens of base pairs apart

(Lemaigree et al. 1990). Considering the complexity of the identification of CRMs, some

groups (Hannenhalli and Levy 2002) have started to work on TF pairs called composite

regulatory elements. Kel et al. (Kel et al. 1995) investigated 101 composite regulatory

elements and found that the distance between these binding sites vary between complete

overlap and 80 bps. The mean distance was found to be around 17 bps. Qiu et al. (Qiu et

al. 2002) analyzed all the entries of composite elements in the TransCompel database

(version 3.0) and they found that approximately 87% of the composite elements were

within 50 bps distance from each other and about 65% were within 20 bps distance.









Transcriptional regulation is mediated by CRMs involving both protein-DNA and

protein-protein interaction. The immense false positive of current TFB S-prediction

methods is most likely a result of attempting to predict TFB Ss based only on protein-

DNA interactions between TFs and TFBSs and ignoring their context, interactions

between/among TFs, etc.

Computational Identification of CRMs

It has been reported that high local density of TFBSs account for the specificity of

their activities and is required for functional CRMs, which has been used as the basis for

identifying novel CRMs (Frith et al. 2001; Berman et al. 2002; Halfon and Michelson

2002; Rajewsky et al. 2002; Rebeiz et al. 2002; Johansson et al. 2003; Berman et al.

2004). These methods define a cluster as a certain window size DNA sequence with

certain number of known TFBSs, and then search for such clusters in the given DNA

sequence to predict regulatory regions. These methods, however, need prior knowledge

of TFB Ss, which is not available in general.

With the development of microarray technology, genes can be classified into

different clusters based on their expression pattern. Genes in the same cluster are

assumed to be co-regulated. Co-regulated genes are hypothesized to share common Cis-

Regulatory Modules (CRMs), since they share the same regulatory response. It has been

observed that functional TFBSs often appear several times in the regulatory region so that

occurrences of relevant TFBSs are significantly more than expected by chance. Hence,

methods based on co-regulated genes comparison are to search for over-represented

motifs in the collection of regulatory regions (Bailey and Elkan 1995; Workman and

Stormo 2000; Liu et al. 2001; Thij s et al. 2002; Elkon et al. 2003; Zheng et al. 2003; Cora

et al. 2004; Grad et al. 2004). High frequency of false positives is also a maj or problem





" .. ..


with these methods. Furthermore, these methods require a reliable set of co-regulated

genes. But, it is also possible that the expression of one gene is triggered by the

expression of another gene in the same cluster and they are not really co-regulated genes.

Some groups have tried to conquer this problem by annotating genes to specific Gene

Ontology (GO) terms (Cora et al. 2004).

Cross-species Conservation of Cis-Regulatory Modules (CRMs)

Functional sequences including coding and functional noncoding sequences tend to

evolve at a slower rate than non-functional sequences. Noncoding sequences that play an

important role in regulation are assumed to survive natural selection for longer periods of

time. It has been shown that CRMs are conserved across species (Frazer et al. 2003;

Ludwig et al. 2005).


me~lc


yak



ere


pse we i' I


aKr Hb o Bed ag Gt 4 Slp


Figure 2-2. The known TFBSs in S2E: bicoid (circle), hunchback (ovals), kruppel
(squares), giant (rectangles), and sloppy-paired (triangle). Symbols
representing sites 100% conserved compared to D. mel are green, while
those diverged are shaded orange.








16



Early Drosophila embryo is a paradigmatic model for studying transcriptional


control of development (Arnosti 2002). Exhaustive genetic analysis has been applied to


this model. Even-skipped (eve), an important developmental gene in Drosophila,


produces seven transverse stripes in the blastoderm embryo. The stripe 2 enhancer (S2E)


is the best studied one and includes multiple TFBSs of five TFs, the activators bicoid


(bcd) and hunchback (hb), and the repressors giant (gt), Kruppel (kr), and sloppy-paired


(slp) (Small et al. 1992; Arnosti et al. 1996; Andrioli et al. 2002). As shown in Figure 2-


2, experimental investigations suggest that S2E is evolutionarily conserved among D.


melanogaster, D. yakuba, D. erecta and D. pseudoobscura. D. yakuba and D. erecta are


separated by approximately 10-12 million years from D. melanogaster, and D.


pseudoobscura split from D. melanogaster about 40-60 million years ago (Ludwig et al.


1998; Ludwig 2002; Ludwig et al. 2005). There is no detectable difference in the spatial


or temporal control of gene expression among these four species.

Mel AATATAACCCAATAATTTGAAGTAACT------------ GGCAGGAGCGAGGTAT--- -- 43
PSE AATATAACCCAATAATTTTAACTAACTCGCAATGGACAGGCTAGATGGAG 60
****************** ** ***** ***** www **

Mel --CCTTCCTGGT---------TACCCGGTACTG-----CATAAGA--ACG 82
PSE AGCATTGCAGGAAGGATGCAATACTCGGGAATGGAATGCATAAGGCGACG 120
**ww *w www *** ***wwww *w ********

Mel AA--CCGTAAC- TGGGACAGA----TCGA- ------- -AAAGCTGGCCTGGTTTCTCGC 125
PSE GGTTCCGTTTCGCGAGATAAGGTTCTTTAGTCTTGACGGTTCCTTGACGTC 180
**** www www ** *w ***

Mel TGTGTGTGCC------- GTGTTAATCCGTTTGCCATCAGCGAGATTATTAGTCAATTGC-177
PSE TGTGTGCTCTCTGCTCTGTGTTAATCCGTTTGCCATCAGCATATGCATTC 240
****** *********************** ****************

Mel -------------AGTTGCAGC----GTTTCGCTTTCGTCC---GTCAT- 213
PSE ATATTTCCAGTCGAGTCGCAGTTTTGGTTTCACTTTCCTCCTGATCTCTG 300
*** **** ***** ***** **** * **

Mel -------------------------------------------------CG 216
PSE CCTCATGTGGATGCCGATGCCGATCGGCTTGCCGTTGCCGTTGCCGACCG 360


Mel GTTAGACTTTATTGCAGCATCTTGAACAATCGTC -GCAGTTTGGTAACACGCTG ------ 269
PSE GTTAGATTTTATTGCAGCATCTTGAACAATCAACTGGAATTGACTCGGGC 420


Mel -------------- TGCCATACTTTCATTTAGACGGAATCGAGGGACCCTGGACTAT 315
PSE CTAACCCTGGAGATTGCTCTACTTTCGCCTCAATTGAATCGTAGGAGCGC 480

Figure 2-3. S2E of D. Mel and D. Pse alignment using CLUSTALW.








17



Mel CGCACAACGAGACC-- GGGTTGC---- GAAGTCAGGGCATTCCGCCGATCTAGCCATCG 368
PSE GGACCCTTGCGACCAAGGGTTGTCTCCTGGCCTCAGGAGTTCAGCAGTTG 540

Mel CCATCTTCTGCGGGCGTTTGTTTGTTTGTTTGCTGGGATTAC-GGTGCTG 427
PSE CTGGTTTGTTTATTTGTTTGTTTGTTT- -TAGCCAGGATTAGCCCGAGGGCTTGACTTGG 598

Mel AATCCAATC---------- ------------------- ------ CTG-ATCCCTAGCCCG 451
PSE AACCCGACCAAAGCCAAGGGCTTTAGGGCATGCTCAAGAGACTTTCTTCT 658

Mel ATCCCAATCCCAATCCCAATCCC- -TTGTCCTTTTCATTAGAAAGTCATAAAA-ACACAT 508
PSE GTCGCGATCCCTAAACCGATCCCATTTGGCAATTTCATTAGATCAACCCT 718

Mel AATAATGA-- TGTCGAAGGGATTAGG--- -- GGCGCGCAGGTCCAGGCAACGCAAT ---T 558
PSE AATAATGAGATGTCGAAGGGATTAAGATTAAGACAAAAGGCAGCAGACAT 778

Mel AACGGACTAGCGAACTGGGTTATTTTTTTGCGCCGA- ---------- CTTAGCCCTGATC 607
PSE AACGGACTAACGGAATCGGGTTATTTTTTGCGCTCAACCGGCACTAGTAC 838

Mel CGCGAGCTTAACCCGTTTTGAGCCGGGCAGCAGGTAG-TGGTGACAGCTT 666
PSE CACAAGCTTAACCCCTTGTGGGCCG- -CGGCAGGTAAATCGATGACCGATGTCGCTTGCG 896

Mel TTTTGGCCGAACCTCCAATCTAACTTGCGCAAGTGGCAAGTGCTTCGCCA 726
PSE CAAGGCCCCTACTACTCCCCTCCCCTCC- CATATGACAACCCACTAACC- -CTGCCCCCG 953

Mel AAGAGGAGGCACTATCCCGGTCCTG- --GTACAGTTGG- TACGCTGGGAATGATTATATC 782
PSE CCCTCTC-- CACCACCACTGTACTGTATGTACAGTTGCCTCCTCTCTGGATATTT 1011

Mel ATCATAATAAATGTTT 798
PSE ATCATAATAAATGTTT 1027


Figure 2-3. Continued


Although similar at the TFBS level and functionally conserved, as shown in figure


2-3, S2E in D. Mel and D. Pse are substantially diverged at the sequence level. One can


observe large insertion or deletion in spacers between TFBSs, nucleotide substitutions in


TFBSs, variation of length of enhancers, etc. Only 3 of 17 known TFBSs in D.


melanogaster are completely conserved among all four species.


Computational Prediction of Cis-Regulatory Modules (CRMs) by Cross-species
Genome Comparison


In contrast, cross-species genome comparison (Phylogenetic Footprinting) is more


reliable and is capable of identifying CRMs from a single gene of different species, as


long as they are conserved across species. Cross-species genome comparison is based on


the assumption that functional sequences, like regulatory regions, are more conserved









evolutionarily than nonfunctional regions, due to selective evolutionary pressure. Like a

filter, cross-species genome comparison eliminates non-conserved regulatory regions and

thereby increases the selectivity and specificity of CRMs prediction.

Evolutionarily conserved noncoding regions across species are a potentially rich

source for the identification of regulatory information of genes and can be found by

sequence alignment tools such as BLAST (Altschul et al. 1990), PIPMaker (Schwartz et

al. 2000), MultiPipMaker (Schwartz et al. 2003), CLUSTAL W (Thompson et al. 1994),

AVID (Bray et al. 2003) and VISTA (Mayor et al. 2000). A large number of methods

have been developed recently to predict CRMs based on cross-species genome

comparison. With complementary data provided by sequence comparison, these methods

have improved the prediction of TFB Ss.

ConSite (Lenhard et al. 2003; Sandelin et al. 2004) is a web-based tool for the

detection of transcription factor binding sites, which integrates both cross-species

comparison and binding site prediction generated with Matrices. ConSite reports TFBSs

that are in conserved regions and at the same time located as pairs of sites in equivalent

positions in alignments between two orthologous sequences. By incorporating

evolutionary constraints, selectivity is increased compared to approaches purely based on

profile search of single genomic sequence. ConSite claims that it reduces the noise level

by ~85% while retaining high sensitivity compared to single sequence analysis.

Similar to ConSite, r~rista (Loots et al. 2002; Loots and Oveharenko 2004)

combines sequence comparison and TFBS profile based prediction, and attempts to

identify aligned TFBSs from noncoding regions that are evolutionarily conserved. r~rista

considers noncoding regions with low DNA similarity as least likely to be biologically









relevant and it discards these regions for TFBS searches. r~rista claims to take an

effective way to reduce false positives.

TraFAC (Jegga et al. 2002) is a web-based application to detect conserved TFB Ss

from a pair of DNA sequences. It defines hits as the density of shared TFBSs within a

200-bp window in evolutionarily conserved non-coding regions. The hit count is plotted

as a function of position.

PhyME (Sinha et al. 2004) is an algorithm based on cross-species comparison too.

Furthermore, it integrates another important feature of a TFB S, overrepresentation, to the

algorithm. Overrepresentation depends on the number of occurrences of the TFBS in

each species. By searching conserved and statistically over-represented TFBSs, PhyME

claims to increase both sensitivity and specificity.

TOUCAN (Aerts et al. 2003; Aerts et al. 2005) is a web-based Java application for

the prediction of cis-regulatory elements from sets of orthologs or co-regulated genes.

Orthologs are aligned and conserved regions are selected as candidate search regions.

Toucan has two TFBS prediction algorithms. One is MotifScanner, which is used to

search known TFBSs based on their matrices. Instead of using a predefined threshold that

is independent of the sequence context, MotifScanner uses a probabilistic model to

estimate the significance of TFB Ss. The other is MotifSampler, which is used to search

for new TFBSs based on Gibbs sampling.

Another algorithm (Cora et al. 2004) tries to predict over-represented TFBSs from

sets of orthologs. It focuses on short conserved regulatory regions in mice and humans so

that a suitable signal to noise ratio can be achieved. Statistically over-represented TFBSs

in those regions are searched and their annotations to Gene Ontology (GO) terms are










analyzed. They conj ecture that functional TFBSs will be ones that are statistically over-

represented and share the same specific Gene Ontology terms.

CREME (Sharan et al. 2003; Sharan et al. 2004) is another web-based tool for

identifying and visualizing CRMs from a set of potentially co-regulated genes. It uses

r~rista 2.0 to find TFBSs that are conserved between genomes. Subsequently, it

enumerates all combinations of these TFBSs that occur within a user-defined window.

These combinations are statistically evaluated and significant combinations are reported.

Compared to other methods, CREME considers the combination of TFBSs and also

evaluates the statistical significance of the predicted CRMs.

Venkatesh and Yap (Venkatesh and Yap 2005) searched for cis-regulatory elements

from non-coding sequences conserved between fugu and mammals and they believe that

the cis-regulatory elements they found are likely to be shared by all vertebrates

considering the relatively long evolutionary distance between fugu and mammals.

In general, methods based on cross-species genome comparison have improved

prediction of TFBSs by searching TFB Ss only in sequence-conserved regions across

species. However, such methods fail to detect CRMs that have retained their functions

but have evolved in their sequence. The performances of these methods depend on the

evolutionary distances of species, alignment algorithms, and profile models of TFBSs

These days, more and more evidence (Thanos and Maniatis 1995; Erives and

Levine 2004; Senger et al. 2004) have shown that CRMs might be highly structured and

depend on a number of organizational constraints, such as distance between TFBSs,

orders and orientations of TFB Ss, which may be used as further constraints to find

functional CRMs from a noisy background.















CHAPTER 3
THE FEASIBLITY STUDY OF CROSS-GENOME COMPARISON AT THE
BINDING-SITE LEVEL

Background

Transcription is the fundamental biological process in which a selected region of

DNA is transcribed into RNA by a molecular machinery, a core component of which is

RNA polymerase. For most protein-coding genes, transcription is the intermediate step

via which the information coded in their DNA is "expressed" into functioning proteins.

For others, such as RNA genes, the product of the transcription itself may have biological

function. Even though each cell has the complete set of genes in its chromosomal DNA,

only a portion of the genes are transcribed (expressed) in any particular cell depending on

tissue/cell type, developmental stage, etc (Zhang et al 1997). The transcriptome, that is

all of the genes that are selectively transcribed in a cell, determines the function and

morphology of the cell. In addition, the level (i.e., rate) of transcription is often regulated

in response to intra-cellular and extra-cellular environmental factors to achieve cellular

homeostasis. Normal transcriptional regulation, i.e., the right genes being transcribed at

the right times, in the right cell, and at the appropriate rates, is central to almost all

physiological processes. Abnormal regulation of transcription often results in disruption

of development and/or pathological states. For example, ectopic (i.e., abnormally high)

expression of oncogenes leads to hyperplasia and cancer.

A basic element of transcriptional regulation is the interaction of transcription

factors (trans factors) and their corresponding transcription factor binding sites (TFBSs,









also referred to as cis factors) on the DNA. Transcriptional regulation of a gene (e.g.

restricted transcription in a particular cell type, or elevated transcription, in response to

UV light) is often mediated through the functional/physical interactions among multiple

transcription factors, each recruited to the proximity of the DNA in part by their selective

affinity to their corresponding binding sites. For example, the even-skipped(eve) gene is

transiently expressed in 7 alternative stripes on the longitudinal axis in the developing

Drosophila melan2oga~ster embryo at the blastoderm stage. Each of the seven stripes is

regulated by a distinct set of transcription factors binding to their corresponding binding

sites located in a DNA segment flanking the even-skipped gene. The most well

investigated of these segments is the stripe 2 regulatory region, which has identified

binding sites for 5 different transcription factors in a 700bp (base pair)-1kb (kilo base

pair) DNA region in front of the transcription initiation site of the eve gene. Evolutionary

comparison of this transcription regulatory module in different Drosophila species has

revealed that most of the binding sites are highly conserved and functional, even though

the underlying DNA sequence has undergone considerable change (Ludwig et al.1998).

A useful analogy to understanding the composition of DNA regulatory modules is

to consider DNA as a sequence of "Letters" and individual binding sites as "Words".

Then, a functional module of closely associated binding sites can be conceived of as the

"Sentence" command embedded in the DNA sequence that guides transcription. The

problems associated with identifying the "Sentence" commands in a DNA sequence are

two fold. First, the binding sites are degenerate in nature, that is, the same "Word" may

have different letters in certain positions. Second, the "Words" are themselves

interspersed by varying lengths of meaningless "Letters".









One approach to identifying DNA regulatory modules is through cross-genome

comparison. The assumption underlying this approach is that DNA sequences encoding

functional TFB Ss are under selective pressure to be conserved during evolution, whereas

non-functional DNA mutate/change more rapidly. Thus, if DNA sequences flanking

orthologs in two related species were to be compared for sequence-level similarity, DNA

regulatory modules would appear as conserved "islands" in a sea of otherwise not-

conserved DNA sequences. Approaches in this category include rVista2.0, ConSite,

PhyME, TOUCAN, CREME, TraFAC, etc (Sharan et al. 2003; Sandelin et al. 2004;

Loots, G.G. and I. Oveharenko 2004; Sinha et al. 2004; Aerts et al 2005; Schwartz et al.

2003; Cora et al. 2004; Venkatesh and Yap 2005). For instances, based on the sequence

level conservation between human and mouse, Cora et al. predicted functional TFBSs

that are statistically over-represented and share the same specific Gene Ontology (GO)

terms (Cora et al. 2004). This kind of cross-genome comparison approaches has

successfully led to the discovery of regulatory modules that were subsequently verified

by functional characterization (Santini et al. 2003).

The disadvantage of the sequence-based approach is that it is dependent on the

overall conservation of the DNA region harboring the regulatory module. As we

mentioned earlier, TFB S sequences are degenerate in nature and are interspersed by non-

functional sequences which are not under conservation pressure. Depending on the ratio

of functional versus non-functional base-pairs in the DNA region, it is entirely possible

that the overall sequence level conservation of the region be undetectable from the

background level, while the actual TFB Ss making up the functional module still be

conserved. In other words, it is possible that despite the conservation of the Sentence"









command at the binding site level, the overall conservation of the DNA backbone at the

sequence level be minimal or non-detectable. This situation is aggravated if the

participating binding sites are highly degenerate (i.e., tolerate many variations at the

sequence level) and the spaces between the binding sites are long. In fact, it has been

observed by researchers in many instances that while the regulatory region has no

detectable overall similarity, the transcription regulatory function is preserved (Ludwig et

al. 2005). Sequence-based approaches, or approaches requiring filtration of sequences

based on DNA level similarity, are not helpful for identifying the responsible TFB Ss in

this scenario.

Since the conservation pressure is at the binding site level, i.e., the sequence must

be able to maintain binding affinity to the transcription factorss, it makes biological

sense to perform comparisons at the binding site level rather than at the sequence level.

This, however, is currently hindered by two factors. The first limitation is the

effectiveness of identifying transcriptional binding sites in a given DNA sequence. The

set of TFBSs for a given TF can be quantitatively represented using a frequency matrix

that describes the binding specifieity of the TF at each of its positions. The quality of the

matrix used to identify potential TFBSs is determined by the number and quality of

known binding site sequences used to construct the matrix. As a result of the

development of new technologies such as Chip (Yan et al. 2004) and ChlP-chip (Lee et

al. 2002), it is anticipated that binding site instances will be identified at a unprecedented

rate which will undoubtedly greatly enhance both the quality as well as the coverage of

binding site matrices in the near future (Matys et al. 2003).









The second limitation is that we currently do not understand the grammar

governing how binding sites (Words) make up the regulatory modules (Sentences).

Based on our understanding of transcriptional regulation, such a grammar should have at

least three components: (1) the composition of the binding sites, (2) the sequence of the

binding sites, and (3) the spaces between/among the binding sites. Currently, the number

of regulatory modules that have been thoroughly characterized is far fewer than what is

required to decode this grammar.

A maj or obstacle for biologists working on transcriptional regulation is to locate

and identify potential TFBSs responsible for a particular regulatory module, especially in

sequences that do not have significant conserved islands. In this paper, we describe a

novel methodology for binding site level identification of conserved regulatory modules

in such sequences.

Results

Simulating Sequence Pairs Harboring a Conserved Module of Binding Sites

Since the number of well-studied regulatory modules is currently rather sparse, we

chose to simulate sequence sets (pairs) representing the domain of our interest, i.e.,

conserved binding site patterns in a pair of sequences which nonetheless have little or no

similarity at the sequence level. In many cases, experimental investigation in a model

organism has narrowed down the location of the regulatory module(s) for a particular

gene to a relatively short region (e.g. within 1 kb), whereas for the ortholog in a less-

studied organism (reference organism), information about the localization of the module

is absent (except that it is generally in the proximity of the gene). In view of this, in the

first (current) stage of the development of BLISS, we considered the identification of a

conserved module present in both a short sequence of about 100-500 bps, representing










the model organism, and a longer sequence (5-6 kb), representing the reference organism.

Although this simplification limits the applicability of the current methodology, it does

highlight the promise of our approach.

For each sequence pair, the backbones for both the short sequence and the long

sequence were generated randomly and thus had no sequence similarity. A hypothetical

module involving binding sites for 4-8 distinct transcription factors was first introduced

into the short sequence. The binding site sequences were randomly selected from the

instances recorded in the TRANSFAC 9.1 database (Matys et al. 2003). The rules

governing the formation of the hypothetical module were as follows:

1. A module contains binding sites for 4-8 distinct transcription factors.

2. For each transcription factor, there may be more than one binding site in the
module.

3. The distance between consecutive binding sites is "d,", where in 65% of the cases
d, lies within 5-20 bps, in 22% of the cases d, lies within 21-50 bps and in the
remaining 13% of the cases d, lies within 51-60 bps (Figure 3-1).

The range of values for d, was based on background knowledge as well as a

statistical analysis of the distances between pairs of TFBSs in TransCompel database by

Qiu et al. which revealed the above distribution of distances between pairs of TFBSs (Qiu

et al. 2002).

The hypothetical module was first simulated according to the above rules in the

short sequence. Subsequently, a "conserved" module was formulated and inserted into

the longer sequence at a random location. The rules governing the formation of the

"conserved" module were as follows:





Seq. #1


Conserved module


Seq. #2


Figure 3-1.


Simulating conserved regulatory modules in a pair of random sequences.
Each pair has a short (100-500 bp) and a long (5-6 kb) sequence. For both
sequences the backbones were generated randomly. The hypothetical
module was formulated according to rules described in the text and inserted
into the short sequence. A "conserved" module with binding sites
corresponding to the same transcription factor, but with different sequences
was inserted into the long sequence. The distances between consecutive
binding sites were also different between the hypothetical module and the
conserved module.


1. It is comprised of TFBSs that correspond to the same transcription factors as
present in the hypothetical module.
2. The sequence for each TFB S is randomly chosen from the recorded instances in
TRANSFAC 9.1, with the caveat that it cannot be the same instances) that was
(were) used to construct the hypothetical module in the short sequence.
3. The respective order of TFBSs is the same as in the hypothetical module in the
short sequence.


Simulation of conserved regulatory

modules in pairs of DNA sequences



Hypothetical module


Transcription Factor
binding sites









4. The distance between consecutive binding sites in the conserved module is d,; d, is
a function of 4 in that d, lies in the range (d, -Ad, d, Add) (Figure 3-1).

Ad is the "perturbation factor"--the variation of distance between corresponding

binding sites in the hypothetical module and the conserved module. In this study, we

used Ad=4 (See Discussion). A total of 10,000 pairs of sequences were generated

according to above rules, and were used to test and evaluate various algorithms.

Identifying a Conserved Module by Comparison at the Binding Site Level

As stated above, the obj ective of our methodology is to identify conserved

regulatory modules within highly divergent sequences. The sequence pairs in our

simulated data-set had little overall sequence similarity. Of the 10,000 pairs, 73.32%

have no similarity detectable by BLAST analysis (E=0.01, Blast2seq). This indicated

that the conservation of binding sites in the hypothetical module and the conserved

module was not sufficient to allow detection at the sequence level. Of those that did have

a significant match, the output alignments were shorter than 30 bps, which was far shorter

than the length of the inserted module.

M_score. To identify the conserved module at the binding site level, we first generated

the potential TFBS profies for each of the simulated sequence pairs. A matrix scoring

method similar to that used in Match (Kel et al. 2003) was implemented (see Methods),

which allowed us to score each sequence against the frequency TFB S matrices recorded

in TRANSFAC 9. 1 (M~score, Figure 3-2a ). When a cut-off value of 0.75 (onM2~score)

was applied, on average there were about 3000 identified potential TFBSs in every

thousand base pairs of simulated sequences. This is similar to what we observed when

using genomic DNA sequences randomly extracted from model organism databases (data

not shown).












(a) M-scores before cutting off (b) M-scores affer cutting off


2 .: : : : 2 -
1- ~1











32 '' C~~nJ 00 32 200
1 C 100 1 100
TFBSs OO0 Locations TFBSs OO0 Locations



scors (Mscore) o he Fs ln hr DNA seune.b)Mscores
profileafter pplyin a cutoff of0.75. .) Proi ofPsor fe
inoroatn th au fbnigst ace.d)Poieo h 2cr
afe Gassa smohig The colors rersn the ifrn F :Rd
En; Gre,: Croc Ble Lun-1
To idniytehpteia oueebdddi h eunew re eea
difren aloihsta opae h idn it rflso h sotadteln










forallTFBS mtics Thu it doeas ntdfentiateshr an reaivl Lctosipemtcs









that match DNA sequences with a high frequency from those long and stringent matrices

that match DNA sequences only rarely. For example, the binding site for En (I$ED_06)

has 7 positions, and on average there are 320 matches (with M~score>0.75) on any

10,000 bp random sequence. In contrast, the binding site for Bel-1 (V$BEL1 B) has 13

positions, and the average number of matches with M~score>0.75 in a 10,000 bp random

sequence is 3. It is clear that a match involving the binding site V$BEL1_B is far more

significant than a match with I$ED_06. To differentiate this, we introduced the p value

of the M~score, which was estimated by calculating the fraction of randomly generated

sequences that have scores equal to or higher than that M~score. We then calculated the

P~score (see Methods) as the product of -log (p value ofM~score>cut~off) and the

M~score (Figure 3-2, b).

G_score. To account for the variation in the distances between/among binding sites, we

performed a Gaussian smoothing of the P~score (see Methods). Through empirical

testing (data not shown), we found that a variance of G2=9 gave the best performance.

We denote the Gaussian smoothed score profile as the G~score profile of the sequence

(Figure 3-2 d).

BLISS_score. G~score profiles were generated for both the short and the long sequences.

To identify a maximum match at the binding site profile level, the short G~score profile

was slid along the long G score profile. At each position, the match between the short

profile and its corresponding region of equal length (length of the window) in the long

profile was evaluated using an inner-product as the BLISS~score (see Methods).

Note that since the "length of the window" appears in the denominator, the

BLISS~score is independent of the length of the short profile (or the length of the









window). Figure 3-3a shows the distribution of BLISS~scores as the shorter G~score

profile was slid along the longer G~score profile. The peak of the BLISS~score indicates

the maximum match. In this case, the abrupt surge in the BLISS~score is due to the

match between the embedded hypothetical module in the short sequence and the

conserved module in the long sequence. When this methodology was tested on all of the

10,000 simulated sequence pairs, about 80% of the highest peaks for each pair contained

the correct match between the embedded hypothetical module and the conserved module.

Distribution and Statistical Evaluation of the BLISS score

To be able to evaluate the match at the binding site profile level, we analyzed the

distribution of BLISS~scores using the simulated sequence pairs. For each pair of

sequences, BLISS~scores were calculated at each position as the short profile slid along

the longer profile. The peak matches (corresponding to the peaks in the score profile)

between each pair of sequences were evaluated to see whether it aligned the embedded

modules. If the match did align the modules, it was designated a "true" match. All other

BLISS~scores were considered as background.

Figure 3-3b shows the distribution of the background and the "true" match

BLISS~scores for the 10,000 simulated pairs of sequences. This distribution varies

slightly depending upon the cutoff threshold set for M~score (Figure 3-3b & c). This is

not surprising, since a lower cutoff threshold will lead to more identified potential

binding sites and thus a slightly higher background score.





































(b) Statistical Analysis of BLISS_scores. M_score cutoff = 0 76

Module matches with random sequence
SModule matches with modula




1i


32





(a) Search the conserved regulatory module


long DNA sequence


111


a
~?n
e
g
111?


In~


I


c I:


BLISS score


(c) Statistical Analysis of BLISS_scoras, M_score cutoff= DB8

SModule matches with random sequence
: ..... -.. Module matches with modlule


0.12

01


(li


BLISS score


Figure 3-3. Identifying a conserved module at the binding site level. a.) The peak in the
BLISS_Score profile represents the maximum match at the binding site

level. b.) and c.) show the distribution of BLISS_score in true matches vs.

background with a cutoff value of 0.75 and 0.80, respectively.









The distributions allow us to evaluate any particular BLISS~score. It is informative

in helping set a threshold for reporting significant matches at the binding site level. Given

a BLISS score x, the distributions allow us to decide whether that BLISS score

corresponds to a true alignment of modules or whether it corresponds to the module

aligning with a random DNA segment. Let C1 denote the event where the modules

embedded in the short and the long sequences are aligned, and C2 denote the event where

either module is aligned with a random DNA segment. Based on Bayes formula, the

posterior probabilities can be calculated as follows:

p(x | Cl) p(C1)
p(C1 | x) =
p(x | Cl)p(C1)+ p(x | C2)p(C2)

p(x |C2)p(C2)
p(C2 | x) =
p(x | Cl)p(C1)+ p(x | C2)p(C2)

Where p(C1|x) is the conditional probability of Cl given a BLISS~score x and

p(C2|x) is the conditional probability of C2 given a BLISS~score x; p(C1) is the prior

probability of Cl and p(C2) is the prior probability of C2; p(x| CI) is the conditional

probability of observing x given Cl and p(x|C2) is the conditional probability of

observing x given C2. p(x|C1) and p(x|C2) can be read off directly from the distributions

generated.

It is difficult to find the means to calculate the prior probabilities p(C1) and p(C2).

In this study, we assumed p(C1) = p(C2), although we suspect that p(C1) might be

smaller than p(C2). This assumption allowed us to calculate the posterior probabilities to

evaluate a BLISS~score x. In practice, we decided that x was a significant matching score

if p(C2|x) was less than a threshold of, e.g. 0.01 or 0.001.









Identifying a Conserved Regulatory Module in Distantly Related Species

To test the efficacy of the BLISS methodology in real sequences, we undertook the

task of identifying the Even-skipped (eve) stripe 2 enhancer (S2E) in distantly related

Drosophila species. Even-skipped, an important development regulatory gene in

Drosophila nzelan2oga~ster (D.nzela), is specifically expressed in seven transverse stripes

in the embryo during the blastoderm stage. The stripe 2 enhancer is the best studied and

includes TFBSs for five TFs, bicoid (Bcd), hunchback (Hb), giant (gt), Kruppel (Kr), and

sloppy-paired (slp) (Small et al. 1992; Arnosti et al. 1996; Andrioli et al. 2002).

Unfortunately, TRANSFAC 9. 1 has Position Weight Matrices for only three of the five

TFs, i.e., Hb, Kr, and Bcd. Our search was therefore limited in the sense that some of the

participating TFBSs could not be predicted and used for the match comparison.

Previous experimental investigations have shown that S2E is evolutionarily

conserved among D. yakuba (D.yaku), D. erecta, and D. pseudoobscura (D.pseu)

(Ludwig et al. 1998; Ledwig et al. 2002; Ledwig et al. 2005). All of these species are in

the same sub genus (Sophophora) as D.nzela, 0 ithr D.pseu having the most distant

relationship with D.nela (separated at about 40 million years ago). BLISS did identify

the eve S2E modules among these four species. In particular, a significant peak was

reported by BLISS when we searched the S2E module extracted from D. pseu against the

14kb D. nzela genomic sequence flanking the eve coding region.











(a) Search S;2E in 14kb D.~ mroja


(tb) S~ea -rch 2E 14kb D. mnsaj compl~emRent ary stra nd


Figure 3-4. Identifying the eve S2E module in distantly related species. a.) Using the
D.pseu S2E module (1027 bp), a peak (red circle) in BLISS_score was
identified in a 14 kb D.moja genomic sequence surrounding the eve coding
region. b.) No significant match was identified in the reverse strand (bottom
panel) or an unrelated sequence (data not shown).


1 4kb~ D. maoja


14kb D. maeja complemnentary srand









In contrast, no detailed information has been published about potential S2E in more

distantly related species, such as D. zojavensis (D.nroja) or D. virilis (D.viri), both from

a separate sub genus (Drosophila) separated from D.nzela at about 60 million years ago

(Powell 1997). To identify S2E in these two distantly related species, we extracted the

14kb genomic sequence flanking the eve coding region from D.nroja and D.viri genomic

sequences. Blast analysis using the D.nzela or D.pseu sequence harboring the eve S2E

module did not identify any significant match longer than 41 bp (using bl2seq with

default gap penalty values). Using the BLISS methodology however, a significant peak

in the BLISS~score was observed (Figure 3-4a, p(C2|x)<0.05). Verification of this match

indicated that it contained the TFBSs composing the S2E module. Similar results were

obtained when corresponding sequence pairs involving 1.) D.mela and D.moj a, and 2.)

D.mela and D.viri, were analyzed. In contrast, no significant match was identified in the

reverse-complemented sequence (Figure 3-4b) or in other 14 kb sequences unrelated to

eve, indicating the specificity of the search.

A detailed inspection of the make up of the S2E modules in distantly related

species showed that S2E can be viewed as a complex module made of element modules.

To make an analogy, S2E is a complex sentence that has several "clauses" (Figure 3-5).

The evolution of the whole module indicates that the distances between some TFBSs

have changed dramatically. However, the distances among the TFBSs within

corresponding "clauses" have remained relatively stable. For example, in Clause 1 the

distances among participating TFBSs have remained constant over the long evolutionary

period. Specifically, the distance between the first bcd (overlapping with the first kr) and

the second bcd is invariably 20 in all of the four species. In addition, the distances










among TFBSs in Clause 3 have also remained relatively stable, i.e., within the variation

we have factored into our simulation.


4 159 327 403 49 52% 7

mel
139139 311 580 615664
Clause Clause2 ClauseS

4 2211 494 573 7373674 79:
pse
01 2011 802 8.

4 246 489 625 736 758s

810
26226 p

4 198 4 6 523 3 2 60
viri
7140 E755
17 78725 743


S Bod g Kr gH


Figure 3-5. Inter-TFBSs distances are very well conserved within each clause of the S2E
module. Comparison of the evolution of S2E modules across distantly
related species revealed that while the sequence length of the module has
changed significantly, the distance among TFBSs in Clauses 1 and 3 have
remained stable. The numbers near the TFBS indicate the positions relative
to the first Kr site in this module.

Since our methodology is really based on the assumption of limited distance

variations between TFB Ss, it should be much more sensitive at identifying individual

"Clauses" or simple modules. When the corresponding TFBS profie covering Clause 1

or Clause 3 were used to search against the genome sequence from D.moja, very

significant peaks in BLISS~score were observed (Figure 3-6a & b, p(C2|x)<0.001 for

both). The peaks corresponded to the match of Clause 1 and Clause 3 on the D.moja










sequence, respectively. BLAST analysis using the sequences covering Clause 1 or

Clause 3 searched against the D.moja genomic sequence failed to identify significant

matches that spanned the whole module. r~rista 2.0 did predict Clause 1 because it

succeeded in detecting the DNA similarity between the sequence covering Clause 1 and

the D.moja sequence. However, r~rista 2.0 failed to identify Clause 3 since no similarity

was detected between the sequence covering Clause 3 and the D.moja genomic sequence.

Implementation of BLISS as a Web-based Service

The BLISS methodology has been implemented as a web-based tool for the

research community. The web application embodies the Gaussian Smoothing Method for

identifying cis-regulatory modules at the binding site level, and outputs all potential

TFB Ss in the predicted module. The module finding process consists of several steps:

To begin, the user inputs two DNA sequences. For example, a short sequence from

a model organism that harbors a regulatory module, and a longer sequence surrounding

the ortholog of a different species. An M~score threshold of 0.75 or 0.8 is then chosen by

the user for the generation of the TFB S profiles for both sequences. Next, a plot of

BLISS~scores comparing successive alignments of the short profile against the long

profile is returned to the user. On the very same page, the distributions described earlier

(Figure 3-3b & c) are displayed so that the user may choose a BLISS~score threshold.

Once the BLISS~score threshold is chosen, BLISS outputs all of the matches with a

BLISS~score higher than that threshold (limited up to 5). For each match, a table of

contributing TFB Ss are listed based on the product of the p-values of the matching

TFB Ss on both sequences (Figure 3-7). Alternatively, it can be listed based on their

numeric contribution to the BLISS~score, or by the location of the TFB Ss.












(a) Search S2E Clause 1 in 14kb D maja


ecr



L sco

8
u,
CR to

m

LO


(b) Search S2E Clause 3 in 14kh D. mnoja


9LO






uJ
uJ

m


Figure 3-6. Searching for conserved individual clauses / element modules using BLISS.
Profiles covering clauses 1 (a) or 3 (b) of S2E were used to search against a
14 kb D.moja genomic region. The BLISS_score peaks are significant
(P(C2|x)=0.0 for a, = 0.0003 for b).


14kb D. maja


14kb D. maja











Currently, the limits for the short and long input sequences are set at 1200 bps and

15k bps, respectively.

TFBS Location1 M scorel PValuel LocationZ M score PValue2 Contribution PiralueProduci
Ahd.F~f.3 8 8 9499904f a 733F-F; fi7 8 9R199914 1 4AR0801'F-F;' 1 R?19RE 1704A077F-


1.025b60Uhb-
0.9424420813 646

0.7552634 '1.03974E-4 755
0.9447829 5.706195E-4 646
0.9414161 5.706195E-4 646


0.93008316 7.5675E-5 2.0242994 7.760926E-9

0.75241125 1.4380351E-4 '1.4902003 1.4961826E-8
0.9724083 5.93735E-5 '1.0032555 3.3879676E-8
0.9724083 5.93735E-5 0.60433334 3.3879676E-8


HNF-1(+ 105 0.8874515 1.844625E-4 709 0.84781164 7.54511E-4 1.7975879 1.3917898E-7
CHX10(+) 79 0.7580955 0.0048756273 685 0.9947151 2.92595E-5 '1.8798841 1.4265842E-7


CIEBPdella
12 0.89993819 0.00'1164967 697 0.94612706 1.774225'1E-4 '1.802457 2.0669137E-7




Figure 3-7. BLISS output of the contributing TFBSs. Our S2E search was used as the
example. The list of TFBSs can be output either based on sequence
position, product of p value (Figure 3-7), or contribution to BLISS_score.
The TFB Ss that belongs to S2E were highlighted in green.

Discussion

In this study, we have presented a first step towards identifying regulatory modules

via comparisons at the binding site level. The advantage of such an approach is that it

allows the detection of conserved regulatory modules in highly divergent sequences, as

we have demonstrated both with simulated sequences as well as with real world


examples. This method is thus complementary to many existing methods that are based

on sequence similarity comparisons (Altschul et al. 1990) or use sequence similarity for


pre-analysis selection (Sandelin et al. 2004; Loots and Oveharenko 2004; Aerts et al.


PITX2(-) 42
Lentiviral 13
TATA(+
Crx(-) 42
Cr(1 48










2005; Sharan et al. 2004). It should also be complementary to applications such as

MEME and CompareProspector, which are widely used for the identification of

conserved sequence motifs (binding sites) in the regulatory region of co-expressed genes

(Liu et al. 2004; Liu et al. 2004).

There are limitations to our approach. Some of the maj or limitations, such as the

coverage and quality of the TFBS matrices, are expected to improve rapidly in the near

future as new high throughput techniques are applied to identify binding sites in genome

scale. Our current algorithm is developed based on the assumption that the inter-TFBS

distance variation is within a +/-4 base pair range. This allows the identification of

modules/clauses with relatively small inter-TFB S distance variation, such as the

individual clauses in the S2E module. It will likely miss modules/clauses that have much

larger distance variations between TFBSs. In the case of S2E, the identification of the

module was based on the fact that the third clause had low inter-TFBS distance

variations, which was sufficient to generate a significant BLISS~score (Figure 3-4a). As

indicated by Ludwig et al, if S2E as a whole were to be considered, many inter-TFBS

distances have changed dramatically during evolution (Ludwig et al. 2005). However, a

closer look at the distribution of TFB Ss in S2E in the two distantly related species also

indicated that the S2E module may be sub-divided into clauses (Figure 3-5). While the

inter-clause distances have varied dramatically, the inter-TFBS distances within each

clause have remained largely stable (Figure 3-5). This is very possibly a reflection of the

spacing restriction on important transcription factor interactions.

In addition to the S2E module, we also tested our methodology on other regulatory

modules such as the DME (Distal Muscle Enhancer) module in front of the paramyosin










gene (Marco-Ferreres et al. 2005). Using a 200 bp sequence harboring the DME in

D.mela, we were able to detect the corresponding module in D. viri (data not shown).

Currently, the number of well characterized, evolutionarily conserved modules is limited.

The goal of BLISS is to facilitate the discovery of multiple TFB Ss modules by

identifying the conserved pattern of the TFBSs. We also applied BLISS to a regulatory

region that is responsible for mediating UV induced expression of hac-1 (Zhou and

Steller 2003). There is no existing information on the composition of the UV-responsive

module in this region, which has very low sequence level conservation between the

corresponding segments in D.mela and D.pseu. Yet genetic experiments have indicated

that the responsiveness is highly conserved. The potential module identified by BLISS is

currently being tested experimentally.

BLISS, with some adaptation, can potentially be used to identify the conserved

regulatory modules in co-expressed genes. Another advantage of BLISS is that the

methodology can also be applied to identify patterns that involve not only TFB Ss, but

also other sequence features such as complex response elements (Ringrose et al. 2003),

insulator sequences, CpG islands, etc. Functionally, these sequence features (their related

modifications and binding partners) interact with transcription factors. However, these

features, such as CpG islands, cannot be detected by simple sequence similarity based

searches.

Conclusions

In this study, we addressed the feasibility of identifying conserved regulatory

modules at the binding site level. Our results indicate that it is feasible to identify

conserved regulatory modules in simulated random sequences harboring a regulatory

module made of 4-8 distinct binding sites. Using real sequences, we demonstrated that









our approach outperforms regular sequence level comparisons when the orthologous

DNA sequences are highly diverged. In addition, the BLISS program outputs directly the

candidate binding sites that are shared between the two regulatory sequences, which can

greatly facilitate the evaluation of the candidate module as well as the design of the

experimental verification strategy by biomedical scientists. Future development of the

proj ect will include identifying better algorithms for complex modules and modules with

higher inter-TFBS distance variations.

Methods

Generating Simulated Sequences

10000 simulated sequence pairs were generated for developing the methodology.

Each set included a short DNA sequence (100-500 bps) harboring a hypothetical TFB S

module and a long DNA sequence (5-6 kb) harboring a conserved TFB S module. First,

the hypothetical TFBS module was generated in the following manner: 4-8 binding sites

were randomly chosen from TRANSFAC 9.1 (Matys et al. 2003) database and then

random DNA subsequences were inserted between them. Qiu et al. (Qiu et al.

2002)analyzed all the entries of composite elements in the TransCompel database

(version 3.0) and they found that about 87% of the composite elements are within 50 bp

distance and about 65% are within 20 bp distance of one another. We therefore chose

lengths of DNA subsequences inserted between binding sites based on this result. Next,

we created the conserved TFBS module, which included binding sites for the same sets of

transcription factors in the same order as in the shorter sequence. However, the binding

site sequences had to be different and they were randomly chosen from TRANSFAC 9.1.

Furthermore, compared to the hypothetical module, distances between binding sites were

set to vary slightly and we allowed each binding site to shift up to 4 bps either to the right










or to the left. Finally, we inserted the conserved module into a 5000 bp randomly

generated DNA sequence to generate the longer sequence.

Binding Site Identification

Potential TFBSs both in the short DNA sequence (including the hypothetical

module) and the long DNA sequence (including the conserved module) were searched

based on frequency matrices collected by TRANSFAC 9.1i. Because TFB Ss may be

detectable on either the forward or the backward strand, we searched both strands of

sequences. The M score profile for each sequence is a M*L matrix, where M is twice of

the number of matrices applied and L is the length of the sequence. The top half of the

M~score matrix is the score profile for the forward strand and the bottom half is that for

the complementary strand. TheM2~score of the ith TFB S at position j of the sequence was

calculated by first aligning the frequency matrix for the ith TFB S with the sequence at

position j and then computing:

f. 1. M score:.

Score[i, j] Score,,, [i, j]
M~ score[i, j] =
Scorey,~[i, j] Score,,, [i, j]

K-1
Score[ i, j] = [ (k) fkn,77+



Score,,,,[i]= [ I(k) fknm


K-1
Scored [i]= CI(k) fknax


Ilk) = fkAIn4fa
Nt(A,T,C,G)










Kis the length of the TFBS. Izyk E (ATCG is th nulotd ocurin in the,1,~~ ,,,,,


sequence at position j+k. fk~ni+k iS the frequency of nucleotide n, y, at position k in the


frequency matrix (of the ith TF). fkmm is the lowest frequency and fkmax is highest

frequency across all nucleotides at position k in the frequency matrix (of the ith TF). Ilk)

is the information vector for the frequency matrix, which reflects the degree of

conservation at position k of the matrix. Finally, M~score~i, j] is the normalized

Score~i,j]. Stormo (Stormo 1998; Stormo 2002; Stormo and Fields 2002) observed that

logarithms of the base frequencies ought to be proportional to the binding energy and the

information vector reflects this average binding energy between the transcription factor

and the binding site.

A score cutoff at 0.75 was applied to the M~score profiles of both the short and the

long sequence as follows:

M~ score[i, j] = M~ score[i, j] if M~ score[i, j]>2 0.75

M~ score[i, j] = 0 ifM~ score[i, j] <0.75

P Value for M score.

To calculate the p value, a background model is required. Here, we chose the

background model to be a random DNA sequence where each position is drawn

independently. The ratio among A, T, C, and G is 30% : 30% : 20 % : 20%. For each

frequency matrix, 300 million subsequences were sampled from the background model,

and the M~score of each sub sequence was calculated to build the M~score distribution.

The p value of aM2~score for each TFBS was estimated by calculating the fraction of

samples that had scores equal to or higher than that M~score. Then, the P~score profiles

were calculated as follows:












f.2 P score


P score[i, j] = C, *M2 score[i, j]



C, is the negative natural logarithm of the p value ofM~score >= 0.75 for the ith

TFB S.

Gaussian Smoothing

To account for the change in the distances between/among binding sites, a

Gaussian smoothing was applied to the P~score profiles with a variance of 9. Formally,

the G~score profiles were calculated as follows:

f.3 G score:


G score[i, j] =
1 -k2/2U2


k=-7 -22/2U2




where a = 3 and k ranges from -7 to 7. In effect, a P~score can spread 7 positions to

both the right and the left due to the Gaussian smoothing. Smoothed P~scores beyond 7

positions were ignored due to their small values.

Searching the Conserved Module in the Long Sequence

To identify a maximum match at the binding site profile level, the short G~score

profile was slid along the long G~score profile. BLISS~score at position n is the matching

score between the short profile and its corresponding region of equal length (length of the

short sequence) in the long profile at position n:










f.4 BLISS score:


BLISS score[n] =
M-1L-1
(ff G c_ scorel[ i, j]* G score2[ i, j + n]) / LenzgthO/ShortSequence
I=0 ]=0


where G~scorel is the G~score profile for the short sequence and G~score2 is the

G~score profile for the long sequence; L is the length of the short sequence; n is the

current location where the short sequence is aligned to the long sequence.

Large-scale Search of the Simulated Sequences, Statistical Analysis

We used 10000 simulated sequence pairs generated by the above method to

calculate two BLISS score distributions. The first is the BLISS score distribution when

the hypothetical module in the short sequence is aligned with the conserved module in the

long sequence. The second is the BLISS~score distribution when the module is aligned

with a non-module segment of the longer sequence. For each pair of sequences,

BLISS~scores were calculated at each position as the short profile slid along the longer

profile. The peak matches (corresponding to the peaks in the score profile) between each

pair of sequences were evaluated to see whether it aligned the embedded modules. If the

match did include the alignment of the modules, it was designated a "true" match, and

this BLISS~score was used to calculate the distribution for the modules matching. All of

the other BLISS~scores were used to calculate the distribution for the module matching

with the background sequence.

Searching for the eve2 Module in D. virilis and D. mojavanis Sequences

The GenBank (http://www.ncbi .nlm.nih.gov/) accession numbers for the S2E

sequences are AFO42712 (D. pseudoobscura) and AFO42709 (D. melan2oga~ster). We










used BLISS to search these two enhancers in 13kb D. virilis and 14kb D. moj avanis

sequences, in which S2E is hypothesized to be located, but the specific location is

unknown.

Ludwig et al. indicated that distances between TFB Ss in two clauses in S2E (region

134-275 and region 484-684 for D. melan2oga~ster, region 196-376 and region 692-866 for

D. pseudoobscura) were substantially conserved. We removed those regions and

searched for the modules in 13kb D. virilis and 14kb D. mojavanis sequences using

BLISS.

Website Construction

BLISS was implemented using HTML/JSP/JavaBean and is supported by an

Apache Tomcat 5.5 server. It is publicly available at:

http://genel .ufscc.ufl.edu:8080/blissWeb/index.html. The M~score profiles of TFBSs

were calculated based on the frequency matrix library collected by TRANSFAC 9.1.

BLISS used DISLIN (http://www.mps.mpg. de/dislin/), a plotting library for displaying

data, to draw the match score plot in run time.















CHAPTER 4
BINDING-SITE LEVEL IDENTIFICATION OF SHARED REGULATORY
MODULES INT TWO ORTHOLOGOUS SEQUENCES

Background

Determination of the bindings between Transcription Factors (TFs) and

Transcription Factor Binding Sites (TFB Ss) is critical to understanding the mystery of

gene regulation. In higher organisms, a particular aspect of gene expression is rarely

controlled by one single TF, rather by a combination of TFBSs called regulatory module

(Maniatis et al. 1987; Johnson and McKnight 1989; Shaw 1992; Wang et al. 1993; Tjian

and Maniatis 1994). Genes are typically regulated by various regulatory modules in

response to different conditions (Yuh et al. 1998). Identification of such regulatory

modules is notoriously difficult due to the inherent biological complexity. The traditional

way is through biological experimentation (Galas and Schmitz 1978; Fried and Crothers

1981; Garner and Revzin 1981), which is in general costly and time-consuming. With

fully sequenced genomes, computational approaches have emerged as powerful tools to

supplement full-scale laboratory experiments in narrowing down candidate regulatory

modules and therefore dramatically reduced the effort in determining the regulatory

elements (Qiu 2003).

A number of software tools have been developed to search for regulatory modules

by cross-genome comparison (Sharan et al. 2003; Sandelin et al. 2004; Loots, G.G. and I.

Oveharenko 2004; Sinha et al. 2004; Aerts et al 2005; Schwartz et al. 2003; Cora et al.

2004; Venkatesh and Yap 2005). The theoretical assumption underlying these tools is









that DNA sequences harboring TFBSs are conserved during evolution due to selective

pressure. Thus conserved regulatory modules should be found in the regions with high

DNA similarity. However, most of these tools are based directly on the measurement of

similarity at the DNA sequence level. It is known that TFBS sequences are degenerate in

nature and the DNA sequences between TFBSs are not under the pressure of selection

and therefore mutate rapidly. Therefore, it is highly probable that the DNA similarity is

not detectable even though the actual regulatory modules are conserved (Ludwig et al.

2005). Since the conservation pressure is at the binding site level, in our previous

research, we developed a novel methodology named BLISS to perform the cross-genome

comparison on binding site profies (Meng et al. 2006). This is first time that the

feasibility of comparison at the binding site level has been validated. As a complementary

tool in the Hield, BLISS demonstrates the ability to detect conserved regulatory modules

from diverged orthologous sequences where DNA similarity is undetectable.

It has been observed that the distances among a certain cluster of TFB Ss in the

regulatory module are highly conserved during evolution, which may reflect the space

restriction essential for the protein-protein interactions among TFs required for

combinatorial transcription regulation (Ludwig et al. 2005). Due to biological

complexity, previous work in this direction has been limited to identifying composite

elements, pairs of transcription factors, whose binding sites tend to co-exist.

TransCompel (Kel-Margoulis et al. 2002) is such a database with collections of

composite elements. In this study, we extend the concept of "composite element" to

"regulatory complex". Regulatory complex is a cluster of TFB Ss, where the type

(identity), number, and order of TFBSs are highly conserved during evolution and the









variation of inter-TFBS distances is within a certain range (less than 10 bps). A

regulatory complex is not necessarily conserved at the sequence level during the

evolution. It could, by itself, be a simple regulatory module or be a part of a complex

regulatory module. In previous research (Meng et al. 2006), BLISS showed that it was

more advantageous to Eind regulatory complexes rather than the regulatory module,

especially when the regulatory complex is not the only regulatory elements in the

module. We therefore focuses on the detection of conserved regulatory complexes in this

study and the regulatory modules could be predicted based on the identification of

conserved regulatory complexes within them.

The first release of BLISS (BLISS1) was limited in application to a short sequence

and a long sequence and it made the assumption that the short sequence harbored only

one regulatory module/complex (Meng et al. 2006). However, in most cases, biologists

have no prior knowledge about the locations of the conserved regulatory modules in the

sequences to be analyzed. In the second release of BLISS (BLISS2), we extended

BLISS1 to a general tool without the limitation of the assumption made in BLISS1 about

the analyzed sequences. In this study, we show that BLISS2 could predict conserved

regulatory modules by detecting all potential shared regulatory complexes.

Results

Simulating Sequence Pairs with Varying Complexity of Conserved Binding Site
Patterns

Simulated sequences were generated to test the performance of our algorithm since

very few well-investigated regulatory modules are available currently. Two testing data

sets were generated. The backbones of the sequences in the first data set were 1000 bp










(base pair) random DNA sequences and the backbones of the sequences in the second

data set were 5000 bps.

Table 4-1. Simulated data sets
Sequence Length of Inserted complexes in each sequence pair
pairs in each Backbone Groupl1 Group2 Group3 Group4
group
Data set 1 500 1000 0 1 2 3
Data set 2 500 5000 0 1 2 3


As shown in Table 4-1, each testing data set consisted of four groups of sequence

pairs and each group had 500 sequence pairs. No regulatory complex was inserted into

the first group of sequence pairs. In the second group, each sequence pair had one

conserved regulatory complex. Two and three conserved regulatory complexes were

inserted into each sequence pair respectively in the third and fourth group.

Each regulatory complex was composed of 4-8 binding sites, which were extracted

from TRANSFAC 9.1 database (Matys et al. 2003). Qiu et al. (Qiu et al. 2002) performed

a statistical analysis for the distances between TFB Ss in the composite elements, which

had been experimentally confirmed and collected in the TransCompel database (version

3.0). Their analysis demonstrated that 87% of the distances between TFB Ss in composite

elements are within 50 bps, and 65% are within 20 bps. The length of the inserted random

DNA segments between binding sites was based on this statistical analysis when we

formulated the regulatory complexes (see Methods for details).

Identifying Conserved Regulatory Complexes by Comparison at the Binding-site
Level

In our previous research (Meng et al. 2006), BLISS1 has demonstrated the

feasibility of identifying conserved regulatory complexes in diverged DNA sequences at

the binding-site level. However, it was limited to predicting conserved regulatory










complexes from a short sequence and a long sequence and it required that the regulatory

complex existed in the short sequence. During the comparison, the short sequence was

slid along the long sequence to Eind the segment in the long sequence that had a

significant matching score with the short sequence at binding site level. In this study,

however, our focus is to extend BLISS1 to identify potentially conserved regulatory

segments from two orthologous sequences without prior knowledge about the locations

of regulatory complexes in the sequences.

Like BLISS1, BLISS2 calculated BLISS_score to represent the degree of similarity

at the binding site level for two sequence segments. The first three steps of BLISS2 were

to calculate M~scores, P_scores and G~scores respectively for two genomic sequences.

M_score is the binding site profie calculated using a matrix scoring method based on the

binding site frequency matrices collected in TRANSFAC9.1 database. To differentiate

simple matrices that match DNA sequences with a high frequency from those matrices

that match DNA sequences rarely, P_score was generated as the product of -log(p value

of M_score > cutoff) and the M_score. To account for the variation of distances

between/among binding sites in the conserved regulatory complexes, Gaussian smoothing

was applied to P_score to get G~score. The first three steps in BLISS2 were in essence

exactly identical to those in BLISS 1. The difference is that BLISS2 deals with two long

sequences and the locations and lengths of the regulatory complexes in the sequences are

unknown. The task of BLISS2 is to seek the matched segments from two sequences

whose binding site profiles are similar (not necessarily identical) and whose

BLISS_scores, the matching score at the binding site level, are statistically significant.

The statistical evaluation of BLISS_score in our previous study allowed us to evaluate a










particular BLISS_score and set a threshold for reporting significant matches. In this

study, BLISS2 took a heuristic approach. The general strategy of BLISS2 was to seek

seeds with a certain window size from the G~score profiles where their BLISS_scores are

greater than a certain BLISS_score cutoff. Then all the seeds were extended and the

shared regulatory regions were decided when the BLISS_score dropped below the

BLISS_score cutoff. (See Methods)

At the beginning, BLAST (E=0.01, Blast2seq) was applied to the sequence pairs in

the first data set in which the backbone of each sequence was 1000 bp long and there

were a total of 3000 conserved regulatory complexes. Of the 2000 simulated sequence

pairs, only 32.4% of them were reported to have detectable similarity. Even in those

positive cases, the average length of output alignments was 16 bps and the maximum

length of output alignments was 32 bps. Clearly, these output alignments were far shorter

than the length of the inserted regulatory complexes, which was not sufficient to allow

for the detection of the regulatory complexes. This analysis indicated that traditional tools

based on sequence level comparison are not enough to predict the conservation of

regulatory complex in diverged DNA sequences.

Table 4-2. Performance test of BLISS2 for Mlscore cutoff=0.75. False Positives (FP)
and True Positives (TP) are listed under different window size and BLISS
score cutoff.
Window BLISS score cutoff
Type
size 2.9 2.8 2.7 2.6 2.5 2.4 2.3 2.2 2.1
FP 1795 2565 3748 5272 7180 9262
100
TP 904 1045 1174 1307 1434 1537
FP 143 221 401 741 1362 2402 3940 5781
150
TP 732 897 1079 1269 1445 1616 1775 1900
FP 14 18 44 80 205 473 1106 2195 3836
200
TP 382 526 682 879 1083 1345 1558 1764 1899
FP 1 1 6 14 34 88 261 727 1707
250
TP 130 213 343 504 695 918 1189 1468 1663









Table 4-3. Performance test of BLISS2 for M~score cutoff=0.8. False Positives (FP) and
True Positives (TP) are listed under different window size and BLISS score
cutoff.
Window BLISS score cutoff
Type
size 1.9 1.8 1.7 1.6 1.5 1.4 1.3 1.2
FP 1907 2905 443 5 6480 8991
100
TP 1168 1328 1488 1633 1759
FP 193 330 629 1193 2366 4306 6625
150
TP 1021 1229 1432 1664 1892 2049 2154
FP 26 45 81 195 523 1351 2975 5195
200
TP 617 818 1069 1346 1631 1880 2054 2137
FP 2 4 20 42 130 373 1213 2982
250
TP 298 464 662 932 1227 1539 1822 2006


In comparison, the testing results of BLISS2 under different 1)window size and 2)

BLISS score cutoff are listed in table 4-2 and 4-3.

True positive refers to the regulatory complexes that were inserted in the simulated

sequence pair and identified by BLISS2, while false positive refers to the regulatory

complexes that were reported by BLISS2 but did not really exist. Both table 4-2 and 4-3

indicate a balance between true positive and false positive under different BLISS_score

cutoffs. While the number of true positives improved when we chose a lower

BLISS_score cutoff, the false positives increased as well; on the other hand, when a

higher BLISS_score cutoff was applied, false positives decreased while true positives

were compromised.

Based on the testing results listed in table 4-2 and 4-3, a window size 200 was

selected in the implementation of BLISS2 since in relation to other window sizes, it had a

better balance between false positives and true positives both for M~score cutoff 0.75 and

0.8. In the first testing data set, the length of each sequence was 1000 bps. The

performance test of BLISS2 proceeded under window size 200 using the second testing

data set, where the length of backbone of each sequence was 5000 bps.











1200

m, 1000 4

800 --FP-2
-ro -u ~FP- 1
0 600 'i


200-


2.5 2.6 2.7 2.8 2.9 3 3.1
BLISS score cutoff


Figure 4-1. The effects of length of analyzed sequences on the performance of BLISS2.
Window size 200 and M_score cutoff 0.75 were applied in this comparison
test. FP-1 represents the false positive for the first data set; FP-2 represents
the false positive for the second data set; TP-1 represents the true positive
for the first data set; TP-2 represents the true positive for the second data set.

The results displayed in Figure 4-1 shows that the true positives were almost the

same for both the testing data sets, although the true positives for the second data set is

slightly less than the true positives for the first data set. The major difference lies in the

false positive. When a higher BLISS_score cutoff was applied, false positives were not

significantly affected by the lengths of the sequences. However, the false positives

increased dramatically for the second testing data set when the BLISS_score cutoff was

decreased. Thus, for practical purpose, a high BLISS_score cutoff is suggested to avoid

large amount of false positive for the longer sequences, although the sensitivity of

BLISS2 will be sacrificed at the same time.

Testing BLISS2 in Real-world Examples

Even-skipped gene (eve) is expressed in seven stripes in the embryo of Drosophila

melanogaster (D.mela) during the blastoderm stage. The enhancer for the second stripe









(S2E) is the best studied regulatory module and includes binding sites for five

transcription factors, biocoid (Bcd), giant (gt), hunchback (Hb), Kruppel (Kr) and sloppy-

paired (slp) (Small et al. 1992; Arnosti et al. 1996; Andrioli et al. 2002). D.mela, D.

yakuba (D.yaku), D. erecta, and D. pseudoobscura (D.pseu) belong to same subgenus

(Sophophora) and experimental investigations on S2E suggests that S2E is conserved

during evolution among these four closely related species. A further detailed inspection

has revealed that S2E includes two regulatory complexes, where the binding sites and the

distances among binding sites are highly conserved (Ludwig et al. 2005). The make-ups

of S2Es in D.mela and D.pseu have been displayed in Figure 4-2. However, no detailed

information has been published about S2E in two species: D. moj avensis (D.moj a) and D.

virilis (D.viri), which are distantly related to the above four species and belong to a

separate subgenus (Drosophila). We extracted 6k bp sequences before the transcription

start region both from the D.moja and D.viri genomic sequences and applied BLISS2 to

identify S2E in them.


Regulatory complexes in S2E


4 159 327 403 495 5 72
mel
139 139 311 580 615 664
Colmplex1 Cormplex2
4 221 494 573 703 734 79


21 201 802 6
Complex Complex2


Figure 4-2. Regulatory complexes in S2E of D.mela and D.pseu. The two complexes are
highly conserved during the evolution.











In the first experiment, BLISS demonstrated its ability to predict the locations of

S2E both in D.viri and D.moja using S2E sequences of D.mela and D.pseu. For instance,

three regulatory complexes were reported (Figure 4-3) when BLISS was applied to the

S2E sequence of D.pseu and the 6k bp sequence from D.viri (M_score cutoff=0.75 and

BLISS_score cutoff=2.9). By investigating the matched TFBSs listed by BLISS2, the

second and third complexes were those two S2E regulatory complexes in D.mela and


D.pseu (Figure 4-4). Thus the position of S2E in D.viri could be located based on the

positions of those two complexes. Identifying S2E in D.moj a was carried out in the

similar way.


I~nsrvrd I rwsnonr In
iii-Is ton
r $6r bs2
3 61 )11


Figure 4-3. Identification S2E in 6 kb D.viri sequence using the S2E of D.pseu. Three
complexes are predicted and the second and third one are those two in S2E.










Regulatory complex 1 in S2E
4221
viri
4201

3970

maja
3950



Regulatory complex 2 in S2E

4668
4644 4671 4716 4729
t9 I 1
viri
47374749 4778
4766
4482
4460 4484 4545


45534561 4592 4602


L Bed g Kr g Hb





b) TFBSs in complex 2.

In previous research (Meng et al. 2006), BLISS1 was applied to the same sequence

pair and the S2E in D.viri could be detected due to the match of one complex in S2E.

Another complex could not be matched at the same time because of the distance

difference between the two complexes in those two sequences. The BLISS_score

therefore was inevitably lower because the BLISS_score in BLISS was the average over









the whole range of the S2E sequence. In that investigation, it indicated that the

BLISS_score would be more sensitive and significant when either complex sequence was

used as the short sequence. In fact, we don't have this kind of prior knowledge in most

cases. BLISS2 displayed its advantage over BLISS 1 in that it was able to predict

regulatory complexes by detecting local maximum regions with similar binding site

profiles without requiring prior knowledge about the complexes in the orthologous

sequences.

BLISS2 further demonstrated its advantage over BLISS 1 for identifying conserved

regulatory complexes in two long genomic sequences, which could not be carried out by

BLISS1 at all. For instance, when BLISS2 was applied to two 6k bp sequences before the

transcription start region of eve gene from D.pseu and D.viri (M_score cutoff=0.75 and

BLISS_score cutoff=2.9), 15 potential regulatory complexes were reported by BLISS2.

By inspection of the make up of these complexes, it turned out that the 9th and 14th

complexes were those two regulatory complexes in S2E. In contrast, rVista2.0 could only

identify the first two binding sites in complex of S2E in the region with detectable DNA

similarity when the same binding site score cutoff (0.75) was applied. It is difficult for us

to verify the other 13 complexes identified by BLISS2 except for the two known S2E

complexes due to the lack of detailed regulatory information for those regions. However,

it is well known that those two 6k bp sequences are rich in regulatory information. At the

very least, it should harbor enhancers for the expression of the eve gene in some of the 7

stripes in the embryo.

Website Implementation of BLISS2

BLISS2 was implemented as a web-based software that can be freely accessed by

the public. To start, BLISS2 requires the users to input two DNA sequences that










potentially harbor shared regulatory information. An M~score (binding site score)

threshold could be chosen by the user on the very same page. Then a color plot indicating

conserved regulatory regions is returned to the user with the statistical analysis of

BLISS score under that M score threshold. The user then chooses a BLISS score cutoff

to output all conserved complexes with maximum BLISS_score above the cutoff. For

each complex, a table of contributing binding sites could be listed based on the product of

the p-value of the matching binding sites, or the location of the binding sites, or their

contributions to the BLISS_score. Currently, the length of each input sequence for

BLISS2 is limited to 8000 bps.

Discussion

In our previous investigation (Meng et al. 2006), we proved that it is feasible to

identify conserved regulatory modules from highly diverged orthologous genomic

sequences at the binding site level. In this study, we extended our previous method to the

two long sequence case and the regulatory modules could be detected by identifying

conserved regulatory complexes. The performance of BLISS2 was tested using both

simulated sequences and real world examples.

There are several limitations to BLISS2. First, BLISS2 is restricted to detect

conserved binding sites groups (regulatory complexes) with small inter-TFB S distance

variations. If the spaces between/among binding sites are not strictly required for the

interactions between/among transcription factors and the variation of inter-TFBS distance

is high, BLISS2 may miss those regulatory modules. The coverage and quality of the

frequency matrices for TFBSs is another limitation since the analysis of BLISS2 is totally

dependent on the accuracy of the binding site profiles. However, there is no doubt that









the coverage and quality of frequency matrices are being/will be improved dramatically

and the performance of BLIS S2 is expected to be enhanced along with it.

Although further improvement is required, BLISS2 demonstrates the ability to

detect biologically significant similarities in those sequences where DNA sequence

similarity is undetectable. The basic idea underlying BLISS2 is novel, but

straightforward. And therefore it potentially could be implemented in a number of ways

to improve performance for the different applications.

Conclusions

In summary, using simulated sequences and real world examples, BLISS2

demonstrated that regulatory complexes, the clusters of binding sites with small variation

ofinter-TFBS distance during evolution, could be identified in highly diverged

orthologous sequences by comparison at the binding site level. By detecting those

conserved regulatory complexes and exploring matched TFBSs in them, BLISS2

facilitates the localization of regulatory regions and therefore can assist biomedical

scientists in deciphering the mystery of gene regulation. Future direction of the proj ect

includes developing new algorithm to improve the selectivity and sensitivity of the

current algorithm, detecting regulatory complexes from multiple orthologous sequences,

and identifying regulatory modules with higher inter-TFBS distance variations.

Methods

Generating Simulated Sequences

Two testing data sets were generated for the development of the methodology.

The first data set had 2000 sequence pairs, which were divided into four groups

with each group having 500 sequence pairs. The backbone for each sequence was









constructed as 1000 bp random DNA sequence with the ratio among A, T, C and G being

30% : 30% : 20% : 20% since the non-transcribed genomic regions are generally AT rich.

Each regulatory complex contained binding sites for 4-8 transcription factors,

which are randomly extracted from TRANSFAC 9. 1 database. The range of distance (di)

between consecutive TFBSs was based on the statistical analysis by Qiu et al. (Qiu et al.

2002) on the TransCompel database: 65% of the distance between consecutive binding

sites was within 5-20 bps, 22% was within 21-50 bps and the rest 13% was within 51-60

bps. The conserved regulatory complex for the second sequence was formulated based on

the following rule: First, it consisted of binding sites for the same transcription factors in

the corresponding regulatory complex in the first sequence. The binding sites of two

corresponding transcription factors were randomly extracted from the binding site

instances collected in TRANSFAC 9.1 database and therefore they could have the same

or different sequences; second, the respective order of binding sites in those two

corresponding regulatory complex was kept the same; finally, dj, the distance between

consecutive binding sites in the conserved regulatory complex in the second sequence,

was randomly chosen to have a value from range (di-Ad,di+Ad). Ad is the variation of

distance between corresponding binding sites in those two regulatory complexes and we

chose Ad = 4 in this study.

Each sequence in the sequence pairs in the first group was a 1000 bp random DNA

sequence and had no introduced complex. For the second group, one regulatory complex

was inserted at a random location into the first sequence of each sequence pair; and then

the corresponding conserved regulatory complex was inserted at a random location into

the second sequence. In the similar manner, two and three regulatory complexes were









inserted into each sequence pairs, respectively, for the third and the fourth group. This

composition of simulated sequences reflected the fact that the analyzed orthologous

sequences may share 1) no of regulatory complex, 2) one regulatory complex, or 3) more

than one regulatory complexes.

All the above held true for the second testing data set except that the length of the

backbone of each sequence was 5000 bps.

Identifying Conserved Regulatory Complexes by Comparison at the Binding-site
Level

M scores, P scores and G scores. M scores, P scores and G scores for both DNA

sequences were calculated in the exact same manner as was described in chapter 3 (see

Methods part) based on frequency matrices collected by TRANSFAC 9.1. The G~score

profile for each sequence is a M*L matrix, where M is twice of the number of TFBSs in

the frequency matrix library and L is the length of the sequence. The top half of G~score

matrix is the score profile for the forward strand and the bottom half of the G~score

matrix is for the complementary strand.

Pre_BLISS_score. Calculation of PreBLISS_score is an intermediate step for

computing the BLISS_score from the G~score profiles. PreBLISS_score was calculated

as follows:


Pre _BLISS scor[ii j= G _scorel[m,i]*G scor2[m, j]


where G~scorel is the G~score profile for the first sequence and G~score2 is that

for the second sequence. M is twice of the number of TFB Ss in the frequency matrix

library and m is the index of the TFBS. PreBLISS_score is a two-dimensional L1*"L2










matrix where L1 and L2 correspond to the lengths of the two DNA sequences

respectively.

BLISS_score. BLIS S_score is the average of PreBLIS S_scores over a window size w:

w/2
BLISS score[i, j] = Pre BLISS score[i + k, j + k] /w
k=-w/2

Same as Pre BLISS score, BLISS score is also a two-dimensional L1*L2 matrix where

L1 and L2 are the lengths of the two DNA sequences.

Reporting Conserved Regulatory Complexes. Conserved regulatory complexes would

be reported according to the BLISS_score matrix. To start, the maximum point

BLISS_score[x,y] was found in the matrix. If BLISS_score[x,y] was less than the

BLISS_score cutoff, we would report that there is no conserved regulatory region for

these two sequences; otherwise, we extended this maximum point along the diagonal in

both directions until the BLISS_scores dropped below the BLISS_score cutoff. Suppose

these two end points were BLISS_score[xl,yl] and BLISS_score[x2,y2], then we would

report that we detected one shared regulatory complex, which started from x1-w/2 and

ended at x2+w/2 in the first and covered from yl-w/2 to at y2+w/2 in the second

sequence. The rest of the BLISS_score matrix was searched recursively in the same way

to Eind other shared regulatory complexes.

Performance Testing Using Simulated Sequences

Using our simulated data sets, the performance of BLISS2 was tested under

different BLISS_score cutoff and window sizes. The true positive and false positive for

each test under different condition were recorded.

Suppose a shared regulatory complex identified by BLISS starts from xl and ends

at x2 in the first sequence and covers range yl to y2 in the second sequence, while the










truly inserted regulatory complex is from xl' to x2' in the first sequence and from yl' to

y2' in the second sequence. We decided that it was a correct prediction if:

1. Identified complexes in both sequences covered more than 60% of the region of the
truly inserted complexes.

2. Two lines were created in two-dimensional space: one starting from point (xl, yl)
and ending at point (x2, y2); the other starting from point (xl', yl') and ending at
point (x2', y2'), and the distance between these two lines was found to be less than
10.

The regulatory complex, which was predicted by BLISS2 and truly existed in

simulated sequences, would be counted as a true positive; while the regulatory complex,

which was predicted by BLISS2 but did not exist in the simulated sequences, would be

counted as false positive.

Searching for the eve2 Regulatory Complex in D. virilis and D. mojavanis

The GenBank (http://www.ncbi .nlm.nih.gov/) accession numbers for S2E

sequences are AFO42709 (D. melanogaster) and AFO42712 (D. pseudoobscura). The two

6k bp seqeunces of D.viri and Dmoj a are the regions right in front of the transcription

initiation site of the eve gene in the genomic sequences of D.viri and D.moj a. M~score

cutoff 0.75 and BLISS_score cutoff 2.9 (default Mlscore and BLISS_score cutoff of

BLISS2) were used in both experiments.

Website Construction

BLISS2 is hosted by a Apache Tomcat 5.5 server and available at:

http://genel .ufscc.ufl.edu: 8080/bliss2/index.html. The website was implemented by

JavaBean/JSP/HTML and the color display of BLIS S_score was plotted by tools

provided by DISLIN (http://www.mps.mpg.de/dislin/).















CHAPTER 5
BLISS 2.0: THE WEB-BASED TOOL FOR PREDICTING CONSERVED
REGULATORY MODULES IN DISTANTLY-RELATED ORTHOLOGOUS
SEQUENCES

Identifying functional Transcription Factor Binding Sites (TFB Ss) in the regulatory

region of DNA sequence is essential to understand gene regulation at transcription level.

In eukaryotes, distinct TFBSs are often grouped together into regulatory modules to

control a particular aspect of gene expression. The composition of a particular regulatory

module can be identified by experimental approaches; for example, through enhancer

region dissection, DNase hypersensitivity assay, DNA foot printing, etc. However, most

of these approaches are time-consuming, laborious and costly. More importantly, many

of these approaches, such as the DNase hypersensitivity assay, can only be applied to a

short DNA sequence (a few hundred base pairs). It is almost impossible if the relative

location of the module cannot be narrowed to an experimentally testable region.

With the emergence and application of bioinformatics, computational approaches

have been developed to predict regulatory module and help the design and verification of

laboratory experiments. A number of such computational approaches (Aerts et al 2005;

Sandelin et al. 2004; Loots, G.G. and I. Oveharenko 2004; Sharan et al. 2003; Sinha et al.

2004) are through cross-genome comparison. The assumption underlying these

approaches is that DNA sequences encoding functional TFB Ss are conserved during

evolution due to selective pressure, whereas non-functional DNA sequences evolve

(mutate) much faster. Therefore, it is likely that the conserved regulatory modules could

be identified at the regions with high DNA similarity in two orthologs.












While this type of approaches has been proven very helpful in some cases, there is

also limitation. The success of these approaches depends on identifying significant

sequence level similarity between (among) the DNA regions that harbor the regulatory

modules. However, TFBS sequences are degenerative in nature and that they are

interspersed by non-functional sequences, which are not under conservation pressure. It

is therefore entirely possible that although the regulatory modules are conserved, the

overall similarity of the sequences harboring regulatory modules is only marginal and

cannot be distinguished from background. Applications based on cross-genome

comparison at DNA sequence level will fail when the orthologous DNA sequences are

highly diverged. Therefore, there is a need for a complementary tool that could allow

users to detect conserved regulatory modules from diverged DNA sequences.

We developed a novel methodology, BLISS 2.0 (Binding-site Level Identification

of Shared Signal-module 2.0 version), for identifying evolutionarily conserved regulatory

modules in pairs of orthologs based on the comparison at binding site level. Considering

the conservation pressure is at the binding site level rather than the DNA sequence level.

By integrating the Gaussian smoothing and statistical analysis, we demonstrated that our

approach outperforms regular sequence level comparisons when the orthologous DNA

sequences are highly diverged. BLISS 2.0 is now implemented as a web-based tool and

can be access by the research community freely at:

http://genel .ufscc.ufl.edu: 8080/bliss2/index.html.















Flie Ed~t Ylew Favantes Tools Help


Address / http//gene l.ukec.uR.edu:8080/blss2 v Goa Lmnks~


r*


Figure 5-1. Web interface of BLISS 2.0. This is the homepage of BLISS 2.0. To start,
users are required to input two orthologous DNA sequences to start the
search.


Given two orthologous DNA sequences, BLISS 2.0is able to output all potentially


conserved regulatory modules between those two sequences. BLISS 2.0 analysis



proceeds in three maj or steps:


To begin, users are required to input two raw DNA sequences (Figure 5-1). BLISS



2.0 generates binding site profiles for each sequence based on matrices collected by


Transfac 9.1 (Matys et al. 2003). And on the same page, users are allowed to choose a



binding site score cutoff. Either 0.75 or 0.8 can be selected as the binding score cutoff to


determine if a specific binding site is found in a certain location in the DNA sequence.



This cutoff is based on a balance between specificity and sensitivity. Higher score cutoff


increases the specificity, but decreases the sensitivity at the same time.


p qqll
AATTAACCATAATTTACTACTGCATGACAG Al
:GGCATAGAGCAG TA~GA~GTAGAGCATTGCAG
GAAGAGCAAT.CTCGGGAPTGPATGGAAATMGCAACAG
GCAGGCCGGTTCGTTCGCGAGATAA vi


Cis-Regulatory Modules (C~us) search
in twvo sequences


Cis-5egurlatoryr Modules {CRMa) Search intwro sequences helps you to identify conserved CRMs harboured in two input sequences.

1. SEQUENCE 1
a. Please input a name for the first DNAI sequence:
D. mel
b. Please paste the first DNA sequence (raw sequence) to the following text field:
GTTCITCCTGGTTACCCGGIA~CTIGCATAA 1
CAATGGAACCCGAACCGTAACTGGGACAGATCGAJJAGC
TGGCCTGGITICTCCIGGCTGGTGCCGTGTT

2. SEQUENCE 2
a. Please input a name for thle second DHA sequence:
D. pse
b Please asteP the second DNA sellr uence (raw se unce) t the follo~inn text Field:


g _


O Please check here if you want to search the complementary strand of the second sequence

3. Please choose a binding site store cutoff o.5Y



D~one


8 lnteret















I:lenel~f c itedu:8080/biss2/resullt2D,3sp v Go Links
1. CRM Search Results BLISS scores













D. me


6 B.d~ ~


A


File Edit View Favorlies Tools Help


Figure 5-2. The color plot of BLISS_score. BLISS_score indicates the degree of the
conservation at the binding site level between two sequences.

Second, by integrating Gaussian smoothing and statistical analysis on the binding


site profiles comparison, BLISS 2.0 indicates the degree of the conservation at the


binding site level between two sequences as BLISS_scores, which are displayed and


visualized as a two-dimensional color plot (Figure 5-2). The vertical color bar on the


right side of the color plot shows the value of BLIS S_score that the color represents.

Continuous high BLISS_scores along diagonal direction indicates a potential match of


conserved regulatory modules between two sequences. To be able to evaluate a particular


BLISS_score, we analyzed the distribution of BLISS_scores using simulated sequence


pairs as shown in Figure 5-3. There are two BLISS_score distributions. The left one is the

distribution when a random sequence matches with a random sequence or a sequence


~d rs





























































0 1 2 3 4 5 6
BLISS score

3. BLISS will output all matched binding sites in the matches regions with the average BLISS score greater than a cutoff.
Please input the BLISS_score cutoff based on BLISS_score Distribution: (
Rank output TFBSs based on: Productof p value .v


Figure 5-3. Statistical analysis and distribution of BLISS_score. Statistical analysis of
BLISS_score helps to evaluate a BLISS_score.


Finally, all the matching regulatory modules with BLISS_score greater than the


BLISS_score cutoff that users choose are displayed on the third page. Contributing


TFB Ss are listed in a separated table for each matched region. BLISS 2.0 provides the


v/~Go Link;"


harboring a regulatory module and the right is the distribution when two conserved


regulatory modules matches. Based on this statistical analysis, users can choose a


BLISS_score cutoff to output all matched regions with the BLIS S_score greater than this


cutoff to be conserved regulatory modules. Users also have the option to determine how


to rank the shared TFBSs in reported regulatory modules. They can be ranked by


locations, by numeric contributions to BLIS S_score or by product of p-values of the


matching TFBSs on both sequences.


File Edit View Favorites Tools Help


BLISS score Distribution (Binding site score cutoff-0.75)


i
T


01



0 06

O 046



O 02


2.


~ddress I~ ht~:ll~enel.uiscc.u





























L....alionl .n
rquel... I
383 -626
53 -288
299 -522
675 98
136 -353
463 -656


LI. rlium...
rqurl.-- 2
593 -836
115 -350
469 -692
904 -1027
280 -497
682 -875


GI~o Links *
A


Lenagl.M xis....
243 5.395915
235 4.9885244
223 4.459804
123 4.0222187
217 3.9538407
193 3.5002155


details
details
details
details
details


4 lone B Internet



Figure 5-4. Output of contributing TFBSs. Contributing TFBSs are listed in a separated

table for each matched CRM. Users can choose to highlight the TFBS they

are interested in as green color.


It has been experienced by many researchers that the laboratory experiments in a



model organism has narrowed down the location of the regulatory module for a particular



gene to a relatively short region, whereas for the ortholog in a less-studied organism,


information about the localization of the module is absent. To be able to deal with this



special instance, BLISS 2.0 provide a complementary tool Single Cis-Regulatory


Module Search, to locate the position of the conserved regulatory module in the ortholog



of the less-studied organism. We suggest users to use this tool especially in the case that



the length of one of the orthologous DNA sequences is less than 200 bps.


blalr*s~r*or~n~R8 crr.~Fr.nr~vr~*~m~I


Transcription Factor Binding Sites (TFBSs) output ranked by product of p values

Matched Regulatory Region l: sequence 1: 382 -625 Sequence 2: 592 -835 BLISS_score: 5.395915
TFBS i... .miolniy M More FV.=vI~ I L...:3io..2M Sc.>..? PIIalu!? LU nu.nulialn PLraluePII..duct

IEEF1D1584 0.9581334 4.32085E-5 804 0.~958133 4.32085E-5 0.014564217 1.8669744E-9

2F-4:DP-1(+ 584 0.9570317 4.3279E 5 804 0.9570317 4.3279E 5 0.014302874 1.873072E 9
FOXP1() 491 0.8072664 2.4131E-5 701 0.7661525 1.009465E- 0.10607633 2.4359401E-9

AbdB() 491 0.953505341.05372E- 700 0.9622206 6.732E-5 0.10029537 7.0936426E-9
PITX2(-) 522 09221458 1 2505-734 0 9244208310565E 0 08801021 1 0517734E-8
LFA1() 437 0.980438053.44865E-5 636 0.9059996 3.45118E-4 0.019752372 1.19019115E-8
E2F(+ 584 0 9617707 1 2874551E-4804 0 9617707 1 2874551E- 0 031881593 1 6575406E-8
AP-2alphaA(-) 407 0.88831943 1.67842E 4 608 0.9091B31 9.9237004E 5 0.034530856 1.6666136E 8
ZF5(+ 529 0.935855 3.20295E-5 748 0.8440624 5.243665E- 0.040263414 1.6795196E48
AP-2g~amma(-) 410 0.98323405 1.329145E 4 611 0.98323405 1.329145E 4 0.04111515 1.7666265E 8

E2F-1:DP-2(+ 584 0 91215223 146338E-4~ 804 0 912152231 46338E-4 0 015871227 2 1414811E-8


a I


72




option to let users highlight the TFBS they are interested in as green color as shown in



Figure 5-4.


Fle Edit View Favortes


Tools Help



...ser- p ~
... I's r legion
1 ~lr









The advantage of BLISS 2.0 is that it allows the detection of conserved regulatory

modules in highly diverged sequences, on which BLISS 2.0 outperforms the existing

methods that are based on sequence similarity comparison. We have successfully applied

BLISS 2.0 to identify the Even-skipped (eve) stripe 2 enhancer (S2E) in D. moj avenis

and D. virilis, which no detailed information has been published about and cannot be

detected by existing tools like BLAST, r~rista 2.0 and etc.















CHAPTER 6
CONCLUSIONS AND SUGGESTIONS FOR FUTURE STUDY

In this study, we developed a novel methodology to identify conserved regulatory

modules at the binding site level and implemented it as a freely accessed web-based

application named BLISS 2.0. To our knowledge, this is the first time that the feasibility

of comparison at binding site level is confirmed. In the first release of BLISS, we

assumed biologists have the sequence for a regulatory module and BLISS could identify

the conserved module from its orthologous sequence, where the DNA similarity could

not be detected and thus the methods based on sequence similarity failed. After the

success of BLISS 1, we further extended this methodology to be applicable for a more

general scenario. In BLISS2, potential regulatory modules can be predicted between two

orthologous sequences by detecting conserved regulatory complexes.

The advantage of this methodology is that it allows the detection of conserved

regulatory modules in highly divergent sequences as we have demonstrated both with

simulated sequences as well as with real world examples. BLISS is thus complementary

to many exiting methods based on nucleotide sequence similarity. It outperforms these

sequence similarity -based methods when the orthologous DNA sequences are highly

diverged. BLISS therefore serves as a valuable tool to facilitate biomedical scientists to

identify functional regulatory modules and design experimental verification strategies.

This study investigated a novel cross-genome comparison strategy and therefore

opened a new field for the future research, which includes developing new algorithm to

improve the performance of the methodology, multiple sequence comparison at the







75


binding site level and identifying regulatory modules with higher inter-TFBS distance

variations.















APPENDIX A
PARAMETER OPTIMIZATIONS FOR GAUSSIAN SMOOTHINTG METHOD

Using simulated sequences, the performance of Gaussian smoothing method was

tested under different parameters: 1) M_score cutoff, 2) variance for Gaussian smoothing,

and 3) integrating p value of M_score OR without integrating the p value of M~score

(Figure A-1).



Gaussian Smooth Method Tests Using Simulated
Sequences

100
I'90 -1 --- Var=0
O 80 --VP



5 7 ---Var 0(P value)
50 -e-Var=l(P value)
O40 I Var=2(P value)
0.5 0.6 0.7 0.8 0.9 1 --- Var=3(P value)
Score Cutoff


Figure A-1. Parameter optimizations for Gaussian smoothing method used in BLISS.

The results indicated that

1. Integrating the p value of M_score to Gaussian smoothing method had greatly
increased the performance of the method.

2. Bigger variance gave better performance.

3. BLISS got the best predictions when binding site score cutoff is between 0.75 and
0.85. Considering the binding sites from TRANSFAC are identified binding sites
and they intended to have higher binding site score, so 0.75 and 0.8 were chosen as
the default cutoff for binding site score in the implementation of BLISS.















APPENDIX B
A DYNAMIC PROGRAMMING ALGORITHM FOR IDENTIFYING CONSERVED
REGULATORY MODULES

Dynamic programming algorithms have been applied extensively in computational

sequence analysis. In this study, efforts were performed to apply dynamic programming

on the binding site profies to identify conserved regulatory modules in orthologous

sequences.

The idea underlying this dynamic programming algorithm is to look for the best

local alignments between the binding site profies of two sequences and the algorithm

consists of three steps:

1. Calculations of P_score profies for two sequences, which are exactly same as
described in the Method part of Chapter 3.

2. A matrix F was constructed based on the P_score profies of two sequences, where
F(i, j) is the score of the best alignment between the initial segment of position 1 to
i of first sequence and the initial segment of 1 to j of second sequence. F(i, j) was
computed recursively as below:



F~ij) =max F(i-1, j-1) + s(seqli, seq2j)
F(i-1, j) +d
F(i, j-1) +d

F(i, j) could be calculated from F(i-1, j-1), F(i-1, j) and F(i, j-1). F(i, 0) and F(0, j)
were set to 0 for all i and j.

F(i, j) = 0. When F(i, j) has a negative score in some points, we will give it value 0
to start a new alignment.

F(i, j) = F(i-1, j-1) + s(seqli, seq2j). When position i in sequence 1 and position j in
sequence 2 at least have one shared binding site, s(seqli, seq2j) will be the
maximum of P_score l(k, i) and P_score(k, j) for all k, where k is the index of the
binding site; when there is no shared binding site at position i in sequence 1 and
position j in sequence 2, but there is at least one binding site either at position i in










sequence 1 or position j in sequence 2, then s(seqli, seq2j) would be set to a penalty
called "TF_penalty"; when there is no any binding site either at position i in
sequence 1 or position j in sequence 2, s(seqli, seq2j) was set to zero.

F(i, j) = F(i, j-1) +d. This is the case when seq2j was aligned to a gap. When there
is at least one binding site at position j of sequence 2, d was set to "TF_penalty";
when there is no any binding site at position j of sequence 2, d was set to
"N~penalty".

F(i, j) = F(i-1, j) +d. This is the case when seqli, was aligned to a gap. When there
is at least one binding site at position i of sequence 1, d was set to "TF_penalty";
when there is no any binding site in position i of sequence 1, d was set to
"N~penalty".

3. After the construction of matrix F, find the maximum F(i, j), which corresponds to
the end points of conserved regulatory module. Then trace back until to a point
with value 0, which corresponds to the start points of conserved module.


This dynamic programming algorithm was applied to 500 simulated sequence pairs

and each sequence pair has only one conserved module. This sequence group corresponds

to the second group simulated sequence in the first data set used for BLISS2 performance

test in chapter 4. The best results were achieved when TF_penalty equals to -9 and

N~penalty equals to -1. When 0.8 was chosen as the M~score cutoff, the conserved

regulatory modules in 168 sequence pairs could be detected and the rest of 332 local

maximum reported by this dynamic programming algorithm was false positives.

As a conclusion, this study addressed the feasibility of identifying conserved

regulatory modules in orthologous sequences by applying the dynamic programming

algorithm on the binding site profiles. Even though, Gaussian smoothing method

developed in chapter 3 and chapter 4 achieved a better performance than this dynamic

programming algorithm. So, BLISS took the Gaussian smoothing method, instead of this

dynamic programming algorithm, for the web implementation.















APPENDIX C
AN IMPROVED ALGORITHM FOR BLISS2

In the current algorithm for BLISS2, we chose to take the sum of G~scores over all

TFs at each matched position when BLISS_score was calculated. The performance of

BLISS2 was improved (Table C-1 and C-2) when was instead took the maximum of

G scores at each matched location.

Table C-1. Performance test of the improved BLISS2 for M~score cutoff=0.75. False
Positives (FP) and True Positives (TP) are listed under different window size
and BLISS score cutoff.
Window BLISS score cutoff
size Tye 0.27 0.26 0.25 0.24 0.23 0.22 0.21
FP 2565 3748 5272 7180
100
TP 1045 1174 1307 1434
FP 262 663 2412 6496
150
TP 1185 1614 1956 2157
FP 40 92 416 2157 6096
200
TP 724 1165 1686 2062 2167
FP 6 23 75 617 3277
250
TP 352 699 1180 1736 2000



Table C-2. Performance test of the improved BLISS2 for M~score cutoff=0.8. False
Positives (FP) and True Positives (TP) are listed under different window size
and BLISS score cutoff.
Window BLISS score cutoff
size Tye 0.28 0.27 0.26 0.25 0.24 0.23 0.22
FP 1779 2968 5489
100
TP 1373 1602 1794
FP -288 575 1412 3572 6805
150
TP -1526 1822 2078 2263 2340
FP -46 68 185 686 2395
200
TP -1034 1374 1734 2063 2259
FP 16 36 142 684 2527
250
TP 851 1222 1622 1967 2190









The method to calculate BLISS_score for the improved algorithm is the same as the

algorithm of BLIS S2 except that the PreBLIS S_score was calculated as below:

Pr e BLISS score[i, j] = MaxinaunOf(G scorel[i,nz]*"G score[ j,nz]) for m

= 0, 1, 2 ..., where m is the index of the TFBS.

The improvement was more dramatic for the second data set that is used in chapter

4, which had longer backbone sequences. Table C-3 displays the performance

comparison between the two algorithms under the condition where the window size is

200 and the binding score cutoff is 0.75.

Table C-3. Performance comparison of two BLISS2 algorithms for second data set when
M~score cutoff is 0.75. False Positives (FP) and True Positives (TP) are listed
under different BLISS score cutoff.
Algorithm Type BLISS scores
2.9 2.8 2.7 2.6 2.5
BLISS2 FP 33 111 359 1014 2778
TP 340 458 611 804 1029
0.25 0.245 0.24 0.235 0.23
Improved
FP 27 79 217 861 2953
BLISS2
TP 748 937 1198 1431 1653

This improvement may reflect the fact that only one TF can exclusively bind to a

location in the real biological environment and the performance of the BLISS2 algorithm

could be compromised due to the repeated contributions of overlapped TFB Ss to the

BLISS_score. However, the results included more false positives when this improved

algorithm was applied to S2E example. Considering S2E is the only real world example

we have so far, more researches need to be continued and more real world examples are

required for the further testing of this improved algorithm for BLISS2.

The BLISS_score distribution for the improved algorithm was calculated using the

same method as described in chapter 3 and are shown in Figure C-1 and Figure C-2.











0.14

0.12

0.1



. 0.08


0.04 C


O 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45
BLISS score


Figure C-1. Distribution of BLISS~score for the improved BLISS2 algorithm with the
M score cut off value of 0.75.


0.12
-Module matches with random sequence
Module matches with module

0.1 -














U 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5
BLISS score


Figure C-2. Distribution of BLISS~score for the improved BLISS2 algorithm with the
M score cut off value of 0.8.


-Module matches with random sequence
- .. Module matches with module




- '

- ,
















LIST OF REFERENCES


Aerts S, Thij s G, Coessens B, Staes M, Moreau Y, De Moor B: Toucan: deciphering the
cis-regulatory logic of coregulated genes. Nucleic Acids Res 2003, 31(6): 1753-
1764.

Aerts S, Van Loo P, Thij s G, Mayer H, de Martin R, Moreau Y, De Moor B: TOUCAN
2: the all-inclusive open source workbench for regulatory sequence analysis.
Nucleic Acids Res 2005, 33(Web Server issue):W393-396.

Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search
tool. J2~olBiol 1990, 215:403-410.

Andrioli LP, Vasisht V, Theodosopoulou E, Oberstein A, Small S: Anterior repression
of a Drosophila stripe enhancer requires three position-specific mechanisms.
Development 2002, 129(21):4931-4940.

Arnosti DN: Design and function of transcriptional switches in Drosophila. Insect
Biochem Mol1 Biol 2002, 32(10):1257-1273.

Arnosti DN, Barolo S, Levine M, Small S: The eve stripe 2 enhancer employs multiple
modes of transcriptional synergy. Development 1996, 122(1):205-214.

Bailey TL, Elkan C: The value of prior knowledge in discovering motifs with MEME.
Proc Int Conflntell Syst Mol1 Biol 1995, 3:21-29.

Bailey TL, Gribskov M: Methods and statistics for combining motif match scores. J
Comput Biol 1998, 5(2):21 1-221.

Bailey TL, Gribskov M: Concerning the accuracy of MAST E-values. Bioinformatics
2000, 16(5):488-489.

Benos PV, Lapedes AS, Stormo GD: Probabilistic code for DNA recognition by
proteins of the EGR family. J2~ol Biol 2002, 323(4):701-727.

Berman BP, Nibu Y, Pfeiffer BD, Tomancak P, Celniker SE, Levine M, Rubin GM,
Eisen MB: Exploiting transcription factor binding site clustering to identify
cis-regulatory modules involved in pattern formation in the Drosophila
genome. Proc Natl AcadSci USA 2002, 99(2):757-762.









Berman BP, Pfeiffer BD, Laverty TR, Salzberg SL, Rubin GM, Eisen MB, Celniker SE:
Computational identification of developmental enhancers: conservation and
function of transcription factor binding-site clusters in Drosophila
melanogaster and Drosophila pseudoobscura. Genome Biol 2004, 5(9):R61.

Boulikas T: A compilation and classification of DNA binding sites for protein
transcription factors from vertebrates. Crit Rev Eukaryot Gene Expr 1994, 4(2-
3):1 17-321.

Bray N, Dubchak I, Pachter L: AVID: A global alignment program. Genome Res 2003,
13(1):97-102.

Bucher P: Weight matrix descriptions of four eukaryotic RNA polymerase II
promoter elements derived from 502 unrelated promoter sequences. J2~olBiol
1990, 212(4):563-578.

Bulyk ML, Johnson PL, Church GM: Nucleotides of transcription factor binding sites
exert interdependent effects on the binding affinities of transcription factors.
Nucleic Acids Res 2002, 30(5): 1255-1261.

Che D, Jensen S, Cai L, Liu JS: BEST: binding-site estimation suite of tools.
Bioinformatics 2005, 21(12):2909-2911.

Chekmenev DS, Haid C, Kel AE: P-Match: transcription factor binding site search by
combining patterns and weight matrices. Nucleic Acids Res 2005, 33(Web
Server issue):W432-437.

Chen QK, Hertz GZ, Stormo GD: MATRIX SEARCH 1.0: a computer program that
scans DNA sequences for transcriptional elements using a database of weight
matrices. Comput ApplBiosci 1995, 11(5):563-566.

Cora D, Di Cunto F, Provero P, Silengo L, Caselle M: Computational identification of
transcription factor binding sites by functional analysis of sets of genes sharing
overrepresented upstream motifs. BM~C Bioinformatics 2004, 5:57.

Day WH, McMorris FR: Critical comparison of consensus methods for molecular
sequences. Nucleic Acids Res 1992, 20(5): 1093-1099.

Elkon R, Linhart C, Sharan R, Shamir R, Shiloh Y: Genome-wide in silico
identification of transcriptional regulators controlling the cell cycle in human
cells. Genome Res 2003, 13(5):773-780.

Erives A, Levine M: Coordinate enhancers share common organizational features in
the Drosophila genome. Proc NatlAcad Sci US A 2004, 101(1 1):3 85 1-3 856.

Fickett JW: Coordinate positioning of MEF2 and myogenin binding sites. Gene 1996,
172(1):GC19-32.









Frazer KA, Elnitski L, Church DM, Dubchak I, Hardison RC: Cross-species sequence
comparisons: a review of methods and available resources. Genome Res 2003,


Frech K, Herrmann G, Werner T: Computer-assisted prediction, classification, and
delimitation of protein binding sites in nucleic acids. Nucleic Acids Res 1993,
21(7):1655-1664.

Frech K, Quandt K, Werner T: Finding protein-binding sites in DNA sequences: the
next generation. Trends Biochem Sci 1997, 22(3):103-104.

Fried M, Crothers DM: Equilibria and kinetics of lac repressor-operator interactions
by polyacrylamide gel electrophoresis. Nucleic Acids Res 1981, 9(23):6505-6525.

Frith MC, Hansen U, Weng Z: Detection of cis-element clusters in higher eukaryotic
DNA. Bioinformatics 2001, 17(10):878-889.

Galas DJ, Schmitz A: DNAse footprinting: a simple method for the detection of
protein-DNA binding specificity. Nucleic Acids Res 1978, 5(9):3157-3170.

Garner MM, Revzin A: A gel electrophoresis method for quantifying the binding of
proteins to specific DNA regions: application to components of the Escherichia
coli lactose operon regulatory system. Nucleic Acids Res 1981, 9(13):3047-3060.

Gershenzon NI, Stormo GD, loshikhes IP: Computational technique for improvement of
the position-weight matrices for the DNA/protein binding sites. Nucleic Acids Res
2005, 33(7):2290-2301.

Grad YH, Roth FP, Halfon MS, Church GM: Prediction of similarly acting cis-
regulatory modules by subsequence profiling and comparative genomics in
Drosophila melanogaster and D.pseudoobscura. Bioinformatics 2004,
20(16):2738-2750.

Halfon MS, Michelson AM: Exploring genetic regulatory networks in metazoan
development: methods and models. Physiol Genomics 2002, 10(3):131-143.

Hannenhalli S, Levy S: Predicting transcription factor synergism. Nucleic Acids Res
2002, 30(19):4278-4284.

Heinemeyer T, Wingender E, Reuter I, Hermj akob H, Kel AE, Kel OV, Ignatieva EV,
Ananko EA, Podkolodnaya OA, Kolpakov FA et al: Databases on transcriptional
regulation: TRANSFAC, TRRD and COMPEL. Nucleic Acids Res 1998,
26(1):362-367.

Hernandez-Munain C, Krangel MS: Regulation of the T-cell receptor delta enhancer
by functional cooperation between e-Myb and core-binding factors. Mol1 Cell
Biol 1994, 14(1):473-483.










Hu Z, Frith M, Niu T, Weng Z: SeqVISTA: a graphical tool for sequence feature
visualization and comparison. BM~C Bioinformatics 2003, 4: 1.

Hu Z, Fu Y, Halees AS, Kielbasa SM, Weng Z: SeqVISTA: a new module of
integrated computational tools for studying transcriptional regulation. Nucleic
Acids Res 2004, 32(Web Server issue):W23 5-241.

Huang H, Kao MC, Zhou X, Liu JS, Wong WH: Determination of local statistical
significance of patterns in Markov sequences with application to promoter
element identification. J Comput Biol 2004, 11(1): 1-14.

Jegga AG, Sherwood SP, Carman JW, Pinski AT, Phillips JL, Pestian JP, Aronow BJ:
Detection and visualization of compositionally similar cis-regulatory element
clusters in orthologous and coordinately controlled genes. Genome Res 2002,
12(9):1408-1417.

Johansson O, Alkema W, Wasserman WW, Lagergren J: Identification of functional
clusters of transcription factor binding motifs in genome sequences: the
MSCAN algorithm. Bioinformatics 2003, 19 Suppl 1:il69-176.

Johnson PF, McKnight SL: Eukaryotic transcriptional regulatory proteins. Annu Rev
Biochem 1989, 58:799-839.

Kel AE, Gossling E, Reuter I, Cheremushkin E, Kel-Margoulis OV, Wingender E:
MATCH: A tool for searching transcription factor binding sites in DNA
sequences. Nucleic Acids Res 2003, 31(13):3 576-3 579.

Kel AE, Kolchanov NA, Kel OV, Romashchenko AG, Anan'ko EA, Ignat'eva EV,
Merkulova TI, Podkolodnaia OA, Stepanenko 1L, Kochetov AV et al: [TRRD: a
database of transcription regulatory regions in eukaryotic genes]. Mol1Biol
(M~osk) 1997, 31(4):626-636.

Kel AE, Kondrakhin YV, Kolpakov Ph A, Kel OV, Romashenko AG, Wingender E,
Milanesi L, Kolchanov NA: Computer tool FUNSITE for analysis of eukaryotic
regulatory genomic sequences. Proc Int Conflntell Syst Mol1 Biol 1995, 3: 197-
205.

Kel OV, Romaschenko AG, Kel AE, Wingender E, Kolchanov NA: A compilation of
composite regulatory elements affecting gene transcription in vertebrates.
Nucleic Acids Res 1995, 23(20):4097-4103.

Kel-Margoulis OV, Kel AE, Reuter I, Deineko IV, Wingender E: TRANSCompel: a
database on composite regulatory elements in eukaryotic genes. Nucleic Acids
Res 2002, 30(1):332-334.










Kel-Margoulis OV, Romashchenko AG, Kolchanov NA, Wingender E, Kel AE:
COMPEL: a database on composite regulatory elements providing
combinatorial transcriptional regulation. Mecleic Acids Res 2000, 28(1):311-
315.

Knuppel R, Dietze P, Lehnberg W, Frech K, Wingender E: TRANSFAC retrieval
program: a network model database of eukaryotic transcription regulating
sequences and proteins. J Comput Biol l994, 1(3): 191-198.

Kolchanov NA, Ananko EA, Podkolodnaya OA, Ignatieva EV, Stepanenko IL, Kel-
Margoulis OV, Kel AE, Merkulova TI, Goryachkovskaya TN, Busygina TV et al:
Transcription Regulatory Regions Database (TRRD):its status in 1999. Micleic
Acids Res 1999, 27(1):3 03-306.

Kolchanov NA, Podkolodnaya OA, Ananko EA, Ignatieva EV, Stepanenko 1L, Kel-
Margoulis OV, Kel AE, Merkulova TI, Goryachkovskaya TN, Busygina TV et al:
Transcription regulatory regions database (TRRD): its status in 2000. Micleic
Acids Res 2000, 28(1):298-301.

Kondrakhin YV, Kel AE, Kolchanov NA, Romashchenko AG, Milanesi L: Eukaryotic
promoter recognition by binding sites for transcription factors. Comput Appl
Biosci 1995, 11(5):477-488.

Lawrence CE, Altschul SF, Boguski MS, Liu JS, Neuwald AF, Wootton JC: Detecting
subtle sequence signals: a Gibbs sampling strategy for multiple alignment.
Science 1993, 262(5131):208-214.

Lawrence CE, Reilly AA: An expectation maximization (EM) algorithm for the
identification and characterization of common sites in unaligned biopolymer
sequences. Proteins 1990, 7(1):41-51.

Lee TI, Rinaldi NJ, Robert F, Odom DT, Bar-Joseph Z, Gerber GK, Hannett NM,
Harbison CT, Thompson CM, Simon I et al: Transcriptional regulatory
networks in Saccharomyces cerevisiae. Science 2002, 298(5594):799-804.

Lemaigre FP, Lafontaine DA, Courtois SJ, Durviaux SM, Rousseau GG: Spl can
displace GHF-1 from its distal binding site and stimulate transcription from
the growth hormone gene promoter. M01 Cell Biol 1990, 10(4): 1811-1814.

Lenhard B, Sandelin A, Mendoza L, Engstrom P, Jareborg N, Wasserman WW:
Identification of conserved regulatory elements by comparative genome
analysis. JBiol 2003, 2(2): 13.

Lewis H, Kaszubska W, DeLamarter JF, Whelan J: Cooperativity between two NF-
kappa B complexes, mediated by high-mobility-group protein I(Y), is essential
for cytokine-induced expression of the E-selectin promoter. M>01 Cell Biol 1994,
14(9):5701-5709.









Liu X, Brutlag DL, Liu JS: BioProspector: discovering conserved DNA motifs in
upstream regulatory regions of co-expressed genes. Pac Symp Biocomput
2001:127-138.

Liu Y, Liu XS, Wei L, Altman RB, Batzoglou S: Eukaryotic regulatory element
conservation analysis and identification using comparative genomics. Genome
Res 2004, 14(3):451-458.

Liu Y, Wei L, Batzoglou S, Brutlag DL, Liu JS, Liu XS: A suite of web-based
programs to search for transcriptional regulatory motifs. Nucleic Acids Res
2004, 32(Web Server issue):W204-207.

Loots GG, Oveharenko I: rVISTA 2.0: evolutionary analysis of transcription factor
binding sites. Nucleic Acids Res 2004, 32(Web Server issue):W217-221.

Loots GG, Oveharenko I, Pachter L, Dubchak I, Rubin EM: r~rista for comparative
sequence-based discovery of functional transcription factor binding sites.
Genome Res 2002, 12(5):832-839.

Ludwig MZ: Functional evolution of noncoding DNA. Curr Opin Genet Dev 2002,
12(6):634-639.

Ludwig MZ, Palsson A, Alekseeva E, Bergman CM, Nathan J, Kreitman M: Functional
evolution of a cis-regulatory module. PLoS Biol 2005, 3(4):e93.

Ludwig MZ, Patel NH, Kreitman M: Functional analysis of eve stripe 2 enhancer
evolution in Drosophila: rules governing conservation and change.
Development 1998, 125(5):949-958.

Maniatis T, Goodbourn S, Fischer JA: Regulation of inducible and tissue-specific gene
expression. Science 1987, 236(4806):1237-1245.

Marco-Ferreres R, Vivar J, Arredondo JJ, Portillo F, Cervera M: Co-operation between
enhancers modulates quantitative expression from the Drosophila
Paramyosin/miniparamyosin gene in different muscle types. M~ech Dev 2005,
122(5):681-694.

Matys V, Fricke E, Geffers R, Gossling E, Haubrock M, Hehl R, Hornischer K, Karas D,
Kel AE, Kel-Margoulis OV et al: TRANSFAC: transcriptional regulation, from
patterns to profiles. Nucleic Acids Res 2003, 31(1):374-378.

Mayor C, Brudno M, Schwartz JR, Poliakov A, Rubin EM, Frazer KA, Pachter LS,
Dubchak I: VISTA : visualizing global DNA sequence alignments of arbitrary
length. Bioinformatics 2000, 16(11):1046-1047.

Meng H., Banerjee A, Zhou L: BLISS: biding site level identification of shared
signal-modules in DNA regulatory sequences. BMC Bioinformatics 2006,7: 287.









Muhlethaler-Mottet A, Di Berardino W, Otten LA, Mach B: Activation of the MHC
class II transactivator CIITA by interferon-gamma requires cooperative
interaction between Stat1 and USF-1. Inanunity 1998, 8(2):157-166.

Osada R, Zaslaysky E, Singh M: Comparative analysis of methods for representing
and searching for transcription factor binding sites. Bioinfornzatics 2004,
20(18):3516-3525.

Powell J: Progress and prospects in evolutionary biology: The Drosophila model.
Oxford: Oxford University Press; 1997.

Prestridge DS: Predicting Pol II promoter sequences using transcription factor
binding sites. J2~olBiol 1995, 249(5):923-932.

Prestridge DS: SIGNAL SCAN 4.0: additional databases and sequence formats.
Comput ApplBiosci 1996, 12(2):157-160.

Qiu P: Recent advances in computational promoter analysis in understanding the
transcriptional regulatory network. Biochent Biophys Res Conanun 2003,
309(3):495-501.

Qiu P, Ding W, Jiang Y, Greene JR, Wang L: Computational analysis of composite
regulatory elements. 2Mann Genome 2002, 13(6):327-332.

Quandt K, Frech K, Karas H, Wingender E, Werner T: MatInd and MatInspector: new
fast and versatile tools for detection of consensus matches in nucleotide
sequence data. Mecleic Acids Res 1995, 23(23):4878-4884.

Rajewsky N, Vergassola M, Gaul U, Siggia ED: Computational detection of genomic
cis-regulatory modules applied to body patterning in the early Drosophila
embryo. BM~C Bioinfornzatics 2002, 3:30.

Rebeiz M, Reeves NL, Posakony JW: SCORE: a computational approach to the
identification of cis-regulatory modules and target genes in whole-genome
sequence data. Site clustering over random expectation. Proc Natl AcadSci US
A 2002, 99(15):9888-9893.

Ringrose L, Rehmsmeier M, Dura JM, Paro R: Genome-wide prediction of
Polycomb/Trithorax response elements in Drosophila melanogaster. Dev Cell
2003, 5(5):759-771.

Sandelin A, Alkema W, Engstrom P, Wasserman WW, Lenhard B: JASPAR: an open-
access database for eukaryotic transcription factor binding profiles. Micleic
Acids Res 2004, 32(Database issue):D91-94.

Sandelin A, Wasserman WW, Lenhard B: ConSite: web-based prediction of regulatory
elements using cross-species comparison. Mucleic Acids Res 2004, 32(Web
Server issue):W249-252.