<%BANNER%>

Identification and Application of Repetitive Biological Sequences

Permanent Link: http://ufdc.ufl.edu/UFE0021421/00001

Material Information

Title: Identification and Application of Repetitive Biological Sequences
Physical Description: 1 online resource (100 p.)
Language: english
Creator: Li, Xuehui
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2007

Subjects

Subjects / Keywords: blast, fitness, lcrs, low, repeats, the, transposons
Computer and Information Science and Engineering -- Dissertations, Academic -- UF
Genre: Computer Engineering thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract: Biological sequences are rich in repeats. For example, more than 50% of the human genome consists of repeats and approximately one-quarter of the amino acids are in repeats. Repeats are subsequences of biased composition. They vary in size from less than a hundred bases to tens of kilobases. They are found as either tandem arrays or dispersed throughout the genome. Repeats can generate insertions, deletions, and unequal crossing-over within genomes and affect protein functions. Hence, repeats play important roles in genome evolution. Repeat identification is normally the first step of studying repeats and a critical part of sequence analysis. For protein sequences, some repeats are popularly referred as low complexity regions (LCRs). Although some computational tools have been developed to identify genomic repeats or LCRs, they all are geared toward specific situations and suffer from different problems. We develop novel methods to identify genomic repeats and LCRs, respectively. Genomic repeats and LCRs present difficulties in genome annotation and analyses. Local alignments between repeats cause many false positives to sequence similarity search. These false positives can cause misassembly of genome sequences or misidentification of repeats as gene/protein sequences. Existing sequence similarity search algorithms either ignore the existence of these repeats or completely remove them. The first strategy produces false positives. The second strategy is not desirable, since no LCR-identification tool is 100% accurate. We develop new algorithms that use LCR information wisely to improve the accuracy and efficiency of sequence search.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility: by Xuehui Li.
Thesis: Thesis (Ph.D.)--University of Florida, 2007.
Local: Adviser: Kahveci, Tamer.

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2007
System ID: UFE0021421:00001

Permanent Link: http://ufdc.ufl.edu/UFE0021421/00001

Material Information

Title: Identification and Application of Repetitive Biological Sequences
Physical Description: 1 online resource (100 p.)
Language: english
Creator: Li, Xuehui
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2007

Subjects

Subjects / Keywords: blast, fitness, lcrs, low, repeats, the, transposons
Computer and Information Science and Engineering -- Dissertations, Academic -- UF
Genre: Computer Engineering thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract: Biological sequences are rich in repeats. For example, more than 50% of the human genome consists of repeats and approximately one-quarter of the amino acids are in repeats. Repeats are subsequences of biased composition. They vary in size from less than a hundred bases to tens of kilobases. They are found as either tandem arrays or dispersed throughout the genome. Repeats can generate insertions, deletions, and unequal crossing-over within genomes and affect protein functions. Hence, repeats play important roles in genome evolution. Repeat identification is normally the first step of studying repeats and a critical part of sequence analysis. For protein sequences, some repeats are popularly referred as low complexity regions (LCRs). Although some computational tools have been developed to identify genomic repeats or LCRs, they all are geared toward specific situations and suffer from different problems. We develop novel methods to identify genomic repeats and LCRs, respectively. Genomic repeats and LCRs present difficulties in genome annotation and analyses. Local alignments between repeats cause many false positives to sequence similarity search. These false positives can cause misassembly of genome sequences or misidentification of repeats as gene/protein sequences. Existing sequence similarity search algorithms either ignore the existence of these repeats or completely remove them. The first strategy produces false positives. The second strategy is not desirable, since no LCR-identification tool is 100% accurate. We develop new algorithms that use LCR information wisely to improve the accuracy and efficiency of sequence search.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility: by Xuehui Li.
Thesis: Thesis (Ph.D.)--University of Florida, 2007.
Local: Adviser: Kahveci, Tamer.

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2007
System ID: UFE0021421:00001


This item has the following downloads:


Full Text
xml version 1.0 encoding UTF-8
REPORT xmlns http:www.fcla.edudlsmddaitss xmlns:xsi http:www.w3.org2001XMLSchema-instance xsi:schemaLocation http:www.fcla.edudlsmddaitssdaitssReport.xsd
INGEST IEID E20101206_AAAADQ INGEST_TIME 2010-12-06T18:13:28Z PACKAGE UFE0021421_00001
AGREEMENT_INFO ACCOUNT UF PROJECT UFDC
FILES
FILE SIZE 6525 DFID F20101206_AABXYM ORIGIN DEPOSITOR PATH li_x_Page_022thm.jpg GLOBAL false PRESERVATION BIT MESSAGE_DIGEST ALGORITHM MD5
c64409fff1a830c73bc66f4b9e4e975b
SHA-1
d071b36513316997701334fc77947f2967c961dc
26950 F20101206_AABXXX li_x_Page_014.QC.jpg
0872212460b93709c8beb16777d79a96
7b7b62023fc572cfebf06cf9e660cb33749eed96
7103 F20101206_AABXZA li_x_Page_030thm.jpg
acf36b733965211c49333f1c582345b3
3e34ddd0d40ae8eb3cfcf3dee4762b721fe52630
6629 F20101206_AABXXY li_x_Page_014thm.jpg
6709eda6fbbb070cb3c9434280f7bb24
a2eea57851e7690143a60b7e3d85639b28a844a8
24669 F20101206_AABXZB li_x_Page_031.QC.jpg
acaf557b0685df8469138f5df4d69f10
a5f5a18b7dd6f5cf68d3fec75728566f4fb5bf1a
21336 F20101206_AABXYN li_x_Page_023.QC.jpg
537391ba6bf31039e2d78dbf9e98a4d1
3e99cd252b493e97dfa512b0464d79b460113092
24283 F20101206_AABXXZ li_x_Page_015.QC.jpg
e34ce0c0b5e9682d30a4c838cf9ce177
46bdb6d905d7fb526c40cf4256c107a28603575d
6183 F20101206_AABXZC li_x_Page_031thm.jpg
a8cc4fbb8f56454e5272755367228995
b8641ec39c0bbf95245cd01b63d5a8cad64bc215
5711 F20101206_AABXYO li_x_Page_023thm.jpg
67e398617491d6f7b9351c510222c47d
536e88c52789b5960fe7fd7cfb6cd92824735160
27957 F20101206_AABXZD li_x_Page_032.QC.jpg
43a2629a47a1fdedb08e46f0e9ff2b1c
8ea2dca72823ee18239a6d4b75b5e836bddbab86
25589 F20101206_AABXYP li_x_Page_024.QC.jpg
b73afbaefcd946b8ceb9094b96b716e3
ea31863f3ab814c31f407a6b0f424d81967b965f
6912 F20101206_AABXZE li_x_Page_032thm.jpg
439d600282ac0eda81eb90ac3ad3ded1
cfa84d1d6c60a21db83ebfe78dd3faa1b544a679
25604 F20101206_AABXYQ li_x_Page_025.QC.jpg
70eb0148c7b645630a6b06fda9340fc6
eacebd57b987c447c46e95bb55c8bcfa4c7b95eb
27096 F20101206_AABXZF li_x_Page_033.QC.jpg
313ac3a0809c1556f91c0d410c972d5b
c68e63a5f1d1229c646d8c00806fe209fbf13cf2
6559 F20101206_AABXYR li_x_Page_025thm.jpg
68c2ba59131daf9b065c12bd52458813
32213bd8a9d04b7f397b17ced2e2c12caaceb42c
27463 F20101206_AABXZG li_x_Page_034.QC.jpg
5f448aaf5c485375b1c1e4f29da9347e
99fa8eef560bf4124d46297f7b7b3ed8957b19d3
21164 F20101206_AABXYS li_x_Page_026.QC.jpg
a9bbce1b0b9ee68700bb0a668c03834b
67b745aba83d2f7a974c007be26116f9fa6a38c7
6661 F20101206_AABXZH li_x_Page_034thm.jpg
64ecff272676900ad6c9dd83fbafc0af
a76542b23e8815c7bbb4f4ab8e140402ce52f0e9
28290 F20101206_AABXZI li_x_Page_035.QC.jpg
2137a97032ecb11d4cb489a29244adb5
b30346d0a51bc71d32eacf1c337bffb636ee4d62
5669 F20101206_AABXYT li_x_Page_026thm.jpg
b2500e6c1c0fe5ebbc39c479e914f03d
904d7b0e1f1e8b8a3df1a72b9a510d06f25f71ba
7032 F20101206_AABXZJ li_x_Page_035thm.jpg
a1575922c696617ab7eb5c17ae1daffd
255f050bd142757433dcf39e4c050ec3e1d416d9
22827 F20101206_AABXYU li_x_Page_027.QC.jpg
61af5f60d5c60d89236706c4ae7c6014
630d592135f267710f65b8f0cd2c1110fce888e9
5566 F20101206_AABXZK li_x_Page_036thm.jpg
e08fecac8e43f6d9099d4412d55953d4
f92e9022ad671c32fa8131d2f868faa96f804608
6227 F20101206_AABXYV li_x_Page_027thm.jpg
41ab79c0aa8edcee99186fed284595a9
5f5b6774bdd0657c6f0cbef0f171d205529da513
21015 F20101206_AABXZL li_x_Page_037.QC.jpg
dba50a7099909d15737109f3010ea733
199fb9a913d98c33bdb73cd4b900b8733bcc2ede
26819 F20101206_AABXYW li_x_Page_028.QC.jpg
20f5458b121302ec8eb630f8045211aa
bfa3cb3746575c0227147657c1909202f32f772e
5302 F20101206_AABXZM li_x_Page_037thm.jpg
5ad5efa09110eccd18f93cb4b3b04e1b
0130ba312f61a6c6e0d82c04cff5d64f73d82051
6583 F20101206_AABXYX li_x_Page_028thm.jpg
2b3ad77ecced34ef4d8629a777cd2fd6
daeafda752dc97c54cd753081fd52364fe810299
26653 F20101206_AABXZN li_x_Page_038.QC.jpg
7f083919620242fd46d28af85d12c64f
ae762da55eab0efd824f6233ac133ea90b18bab7
28346 F20101206_AABXYY li_x_Page_029.QC.jpg
9c8622982a09318e4348ea1608f83920
098b64d41afdf964b9f9c2b38beff18346224a72
6850 F20101206_AABXYZ li_x_Page_029thm.jpg
a4259567ec1cd0e854a9444671b83d75
e4c89e462a30d1929bc9ce581b1963fa77767c2a
6490 F20101206_AABXZO li_x_Page_038thm.jpg
2a19725296c5b566a5adf89cb7646fbc
5a733e6ba29cf1aa04b8c60b4fd3435c36a2a172
28355 F20101206_AABXZP li_x_Page_039.QC.jpg
13998bacd2cc1e3e69505413e39ea1b5
2e305abbb8b6c2a8155610e9d9978ab3f0859c44
7101 F20101206_AABXZQ li_x_Page_039thm.jpg
c71b85b067ee19462f4e19c3c680486c
2932480cd43dbf8d6f76301b9e81e995ef6f4c55
27106 F20101206_AABXZR li_x_Page_040.QC.jpg
78d59ea73a41276732390da6ce7c664b
8e103b490bbb6d8ad3a087c25687645231ed0151
2154 F20101206_AABXCK li_x_Page_071.txt
521478b37c93b32921621a2794cc623f
06dedf924e74acf0c26df798cec23bb7be09f35d
6725 F20101206_AABXZS li_x_Page_040thm.jpg
0f5c55b2079fa375fe58b43efef98eb3
cb1a71bb66ebcfd20402b9923f759595863fa983
25271604 F20101206_AABXCL li_x_Page_008.tif
ba79b1df66e86518a44a1f72efa45b8d
b6f85a854cf08f3f3254bb9edefa2f3e76edae8b
6950 F20101206_AABXZT li_x_Page_041thm.jpg
5fe43fce6d4c9c4555f3e554de7324bb
948de42157b6feeaabb0aafa3fca7297ceff5782
28614 F20101206_AABXDA li_x_Page_052.QC.jpg
09c21c811b1e626a9bd310d3d423ee14
9daa9f8d3c8aa76d1bf3b1e5f95390ffb7c65950
91083 F20101206_AABXCM li_x_Page_019.jpg
5f4a1618f5c1520efc3d927cae7275d6
9be634d4449df09ff0451a94f034ab0c023bdd11
25495 F20101206_AABXZU li_x_Page_042.QC.jpg
1f17100c21128dc1960f00732dfabd53
74fad77fb87f7d4589fc1caff98272ecbfbcc4f2
1053954 F20101206_AABXDB li_x_Page_002.tif
146eca1035e7f61c57db64612f8ade03
f49ee1586f877038af82b266c2ac7d6454b68eae
1809 F20101206_AABXCN li_x_Page_083.txt
1be6ad40a050060ddfb44491c67beeef
9322a0d39c2bd03a031890d2c5af93b6c2788e55
6367 F20101206_AABXZV li_x_Page_042thm.jpg
d729e392714a560f09370abc0ab26f53
40644dec8d3920a5cc537eac4d58acdae72244f4
20726 F20101206_AABXDC li_x_Page_036.QC.jpg
12889ac7b02891c5653b86e3fa2d993c
5a2bedbcfd034194efb320df52d1de1cc9e9e2b6
F20101206_AABXCO li_x_Page_083.tif
63ba3bb7c570007b2023382dca8d6214
7aca2405e04e915ccb74262385e703a92d742f85
22737 F20101206_AABXZW li_x_Page_043.QC.jpg
5c6f603f4e20e1fbf712c8484c152a88
c6e10630ad96de486873382f16d5e9020129b019
1996 F20101206_AABXDD li_x_Page_051.txt
77de5b26e8aa56475f2530c8c86a2051
4942bd3f743577e70a3fdf53ca98083484937d5c
32939 F20101206_AABXCP li_x_Page_092.jpg
16b6d54b0e051f2ccb7eaf0c8730eefa
eea483bd0fe79442e160386f6383ce884557051f
6334 F20101206_AABXZX li_x_Page_043thm.jpg
b1ec33e8b6db356889baed86c1dd5bd7
03dadfab1307a6d33dbf4f7c1cd6ea40281921aa
84691 F20101206_AABXDE li_x_Page_067.jpg
80dd7d94b7b5d63965b681e6e28ed8b9
e95417dc0ff4769c42e0f3f8d8c7f1c125ea3a0f
1051900 F20101206_AABXCQ li_x_Page_062.jp2
f17900d9879eaa8205c68c233587ea0d
cbeb4c650151f7ed49881a54e6f6320f4b37d900
5392 F20101206_AABXZY li_x_Page_044thm.jpg
8c968d634edb2421bfaff94d9ab6dc0a
bd4eccf6cd5b4ac5d62c2f02dd4325dace0e654b
1329 F20101206_AABXDF li_x_Page_002thm.jpg
cbeec19decb0e8552e75ba6bf0c9ca8a
46fc7b20a26c8977b20b6088b9d3df1fa519a47d
10491 F20101206_AABXCR li_x_Page_092.QC.jpg
9147a59e84f95069943f9dc1e99d05b2
a486f6b721005512e619756b120c323a96bd0b42
18181 F20101206_AABXZZ li_x_Page_045.QC.jpg
2772b06bcdd80425bcb7be8b296fb883
a8516eebb7429ca174cddf8d60fc3602204b035c
1051986 F20101206_AABXDG li_x_Page_015.jp2
9f670c5dfeae6362ff0db00733a87f91
c84db4806abe6f531cf564d19a104e11ad984832
55090 F20101206_AABXCS li_x_Page_014.pro
549dccf6ebb4f12bdead9fc79f18a6f0
6c15f2625f957b4e6cbc411154f2d135d879377d
48179 F20101206_AABXDH li_x_Page_043.pro
d441cd738e30862c6c10c93734e0d69d
4f96bd53f1a57e3d1a245e278b9f91f96dc159c2
1051950 F20101206_AABXCT li_x_Page_020.jp2
da9ff3bcd56a2abb8d4f3412e2a116e6
33ff99238946febf83adbb4e1cfb2ff7e9248ae5
6118 F20101206_AABXDI li_x_Page_047thm.jpg
699d5c241ba49a900063a7a13858fe2a
9869508f5be5547e99022adc088bf6b2e3a80cea
6447 F20101206_AABXCU li_x_Page_024thm.jpg
09087ac9a7bfdd621a9dda79c77a1b79
f17be9fee3da0ac959df43abf8bd6fe40cb37bd1
70005 F20101206_AABXDJ li_x_Page_036.jpg
ba9039fb5e7cd6e58084c63974f10ae0
4ca592c3a54e030938d54a63b7459d527c442057
5924 F20101206_AABXCV li_x_Page_021thm.jpg
1cf9756a16e294ae80d391727f6a5092
0eb4910109409d6f5ac19ddd41282d82bbacee65
F20101206_AABXDK li_x_Page_010.tif
b8dbd771fb0a3c4e63670308117aa8a4
5c9e9f290c8cebe90e20195bf05d7deb4ce2cb27
2100 F20101206_AABXCW li_x_Page_080.txt
454f76c311218bd32c8a1e99e9395a49
26f4a5cd751abcf9efaa579984681774e8fd68c0
25834 F20101206_AABXDL li_x_Page_063.QC.jpg
36d49b9b1d29ba18f1b0254e0b827794
5450d4980e8a41a8f541fb8db3a8ef8c333e7bbd
F20101206_AABXEA li_x_Page_025.tif
d4cc59007ccd996465dab3a01ebc37e1
a0003a3b5801bd7ffde0531c0aead46215571826
55104 F20101206_AABXDM li_x_Page_064.pro
8b8802688b6c2b61dafb1c55c9612441
472ba67e6c1d42675698d3b7a51c54c8256e5ab6
1051971 F20101206_AABXCX li_x_Page_051.jp2
09e2e59216768f03207aa8975b2989fd
b77c4a9ba051cf1142174f4282abfce3fee642ac
18956 F20101206_AABXEB li_x_Page_093.QC.jpg
1c658e3b2f1f6d9d525f81bc1e567bbb
4bd28cdf04924c29525b680bfda354e59f8cbc13
6657 F20101206_AABXDN li_x_Page_068thm.jpg
a7c18b6f77de0b96c4ff93ea0a9f2626
e32bcbe8ac01db23b6f7d538a15b576f6fc5efb1
F20101206_AABXCY li_x_Page_075.tif
7c356ca53a4d63c2a36c9db1ba338dd7
f85958fe46d085006d5b9aa4c88b7c256b8bc3be
7090 F20101206_AABXEC li_x_Page_090thm.jpg
58979c4f3b73f2365e97b313c6595595
b64442a45fc9269c67a510676bdf6e62458a6dcc
59944 F20101206_AABXDO li_x_Page_039.pro
1d139fb52d24baddccef65e6685c9c96
405805aec35731d2c9e7eaf8e2d3229f87816f09
6543 F20101206_AABXCZ li_x_Page_012thm.jpg
32ee234e22009c1e5856c57c7b94e16c
624b30901f9d6c109a2b02d5e2742d492b073b68
71084 F20101206_AABXED li_x_Page_046.jpg
2619ab79e2a9b6453e32aa0a64e09a7a
dce4b05ca7046c5c98d7f8ac4e0648cda6eb1634
1051977 F20101206_AABXDP li_x_Page_056.jp2
c2ac0e561e7957842f2874747d5ec912
251e467be2ec63ec2b0bc1981a48fd4d6ca3e2c8
75813 F20101206_AABXEE li_x_Page_056.jpg
b52589ef123eb5bcc3e18666c0a5f08c
cbf3be95b77a11079c74bc3b602827168772a187
F20101206_AABXDQ li_x_Page_082.tif
bfecdca387e6a0ca9e80c64032373453
f622c3d7d7aee5ce6c43b4b39c6eeae70c8a8e51
2331 F20101206_AABXEF li_x_Page_066.txt
14748852ca0ae8a96f9266035f85232e
2262a92b4a75c2e04f89dae414b42c3c591ac7c9
F20101206_AABXDR li_x_Page_055.tif
cbda0dc3db64d966376c23269a614db2
40aee088734128027571e71cda7a1f4085a92990
1051980 F20101206_AABXEG li_x_Page_068.jp2
d0d906bb1b426275115d309e61c7ad42
6bdd3ca417b5a4d8e94d563901cb82599a24b501
1051978 F20101206_AABXDS li_x_Page_072.jp2
61c8128c2b3e6677747ba228dc6a98a8
293ce481c30418845a00acba0451cbbcdd2e5640
6767 F20101206_AABXEH li_x_Page_033thm.jpg
bf58aa0f2d3ffc041db5c1ba3018fc3c
f9aa7f3c8e16af05d7c360a2125120be6ac924d9
59021 F20101206_AABXDT li_x_Page_089.pro
b1d0ed386955edde33d6655700ff325d
eb800da6867d724f49a4c876844501a8f76fb7a0
F20101206_AABXEI li_x_Page_005.tif
b9a6715a162b9e48dbcfe135384cdc4f
c8603df642dabff90e6f1d58e4867ea8fae8b591
52481 F20101206_AABXDU li_x_Page_007.pro
3c10c17e78c8f4453af01e5cf41da8e9
1bbb4bd71ba1b25c07d3bdc892d4f26b16e056f3
1051972 F20101206_AABXEJ li_x_Page_030.jp2
729a5f5edd24267c0c35021f26cedb9d
4e6699d9b0f57fa4f90493017c1d239d38e8c9a4
58188 F20101206_AABXDV li_x_Page_032.pro
bbae6674e6a73fa1c3ab84afa4120572
ec99eae342ae923312a382cf3ab32d5429b6120f
36782 F20101206_AABXEK li_x_Page_004.jpg
17445a700d01d628a8dc0f5a23c8f9fe
870a1f2336f9b4cc4f373443c0e8f5ffdc28610f
6163 F20101206_AABXDW li_x_Page_074thm.jpg
e494f766c8d2f71278258672d5c67b16
91a2ae35ef5364283b05bba1f95c661948dca8cf
33671 F20101206_AABXEL li_x_Page_055.pro
e2380e96f699fd2ca32f6c069699db26
755a6fd8f668a683ef082f7f454686f29f9469dc
1051982 F20101206_AABXDX li_x_Page_067.jp2
840425004a3e2497b6bc02887dcd127d
0d9f2ff0f2791d9a39c778d3afd1a8e2131baf36
25819 F20101206_AABXEM li_x_Page_012.QC.jpg
2edbdba74b5a45e602d817f0fe13d590
a2ecd65ab9e7608e8c399b8eb0c3c63ab057f279
F20101206_AABXFA li_x_Page_013.tif
b38d9d486b6d6e8bb782233c37e56045
a31b030ccc5ca0fb741b201153407cfcd89f416f
60514 F20101206_AABXEN li_x_Page_019.pro
2e4f7a0991dd3037f21f0301f8300751
539f24b9ded13b0cbd38e6d4f2fbad49d7413fe4
20400 F20101206_AABXDY li_x_Page_091.QC.jpg
0a6191a1d45b21cf289dddab5fd0e7f3
f035ddb21e533b1db5553890cf1be8642c2493d4
970105 F20101206_AABXFB li_x_Page_088.jp2
19622f205549d493206655f3515746ae
d80e229a94ac43daaf6b25c2760558e1b0d8c6a7
55076 F20101206_AABXEO li_x_Page_062.pro
0d09f34aa63cb4cb1235055bc23d5b04
eb818f3b8f4083346bc4171317d77dc1a2c1e6da
63023 F20101206_AABXDZ li_x_Page_078.jpg
f0e72a088250f41f14c868ac84520189
a7250924ffe6d7fb28cd2abc9894764ea8c1ab4d
38247 F20101206_AABXFC li_x_Page_091.pro
ab7ab87daa549e5372c210a9932c954b
d50e9e00a740e9503e4f674dc298ad491ac2d443
1051985 F20101206_AABXEP li_x_Page_014.jp2
426f157e1671cbddaf0e652b613aaeea
c8f4736207abbab211ec1508efdbebcedcd671f0
61249 F20101206_AABXFD li_x_Page_084.jpg
3f90944c27c235aeed4b79b6ab1b5036
1855d42ffcfca52b15c6f9d2230aa7349d044c3b
F20101206_AABXEQ li_x_Page_066.tif
14a051c9e8640556ec75fbdbc97e793e
888153523a5ffba9a68aea6f352e9cfb3b354794
1051975 F20101206_AABXFE li_x_Page_035.jp2
b39e1cfc2018ba38afdc3c451485c712
ec36d93b53266125578d9fe3187cafc5ffcc168a
26474 F20101206_AABXER li_x_Page_049.QC.jpg
c385cee1a63bf4c5e7d5323068bf03cf
21bef0eb6b448d525260801ae3e4094da17c5c4b
6880 F20101206_AABXFF li_x_Page_050thm.jpg
7da706ed4879da75d85a14ac2f4791a0
c5066b6ba31b42b435e1d3e744b9b7283283ddeb
925443 F20101206_AABXES li_x_Page_065.jp2
6d435a5f7693e2d4c588208deeec1556
6760e67d88edcf7904ade7d92ccc04449fd7f466
F20101206_AABXFG li_x_Page_038.tif
2c37230b7544b7a6075dcf33ed9428ea
a1bc7453f3debf4d514c7f598eadef04c2117a95
28694 F20101206_AABXET li_x_Page_081.QC.jpg
756af4149ca2397e888e26edf224887f
3993af209174bbe1447fd9c2ff0627b60541d547
1527 F20101206_AABXFH li_x_Page_058.txt
2fb9d4f0930353df56d1284e3887c9e1
f5e7109142157201a9df00c720ab80afebe96802
7000 F20101206_AABXEU li_x_Page_066thm.jpg
cf2c02eca4328f36e4d473208a28d372
a367f03bce2468bec46b1e957ddf2c40f349f4da
F20101206_AABXFI li_x_Page_071.tif
c3535f24b517c638e1bff876cd3ce999
7f145ad0b77132426106f8bac5e2d5268377f067
76610 F20101206_AABXEV li_x_Page_082.jpg
03122325d129430d1640855206a6f60c
f461665c83d1ea7f9208ceaf9ce8bce75c50ef2b
F20101206_AABXFJ li_x_Page_098.tif
9e7644b0d80d85d03ee70f66a66fb2fa
cab8094c4ffdace40263f55df9670fbb05b04117
1051970 F20101206_AABXEW li_x_Page_032.jp2
19f36c59b0cc064564a0b19a4b0a95ca
011d4224b27a88ba462e0969ba958c2ced929dfa
F20101206_AABXFK li_x_Page_006.jp2
a10a3e1730c81e5156acb194083ab7ab
5a510f7b4d0b24d82b7ab3c4361d33d4dde7548c
25217 F20101206_AABXEX li_x_Page_094.QC.jpg
ed8c3e8e155f22391318700a303eaad8
312c9fa5ddb7b0366bf837d2e98fda2b3644ddfc
59180 F20101206_AABXFL li_x_Page_081.pro
8808f33ea871187039a3a56278332460
779ee8f9f930458452ea0c661f446beb4f94547f
128364 F20101206_AABXEY li_x_Page_095.jp2
54edbfda79cee0042c04ee5843716e27
4882da7c93dce94b50e724b590c61d13658ca615
68897 F20101206_AABXGA li_x_Page_026.jpg
ee8e47a848ef2aee30d3e61beeb63ea7
ec405eb93489adf073a50190c241f57fc47f1fa0
51561 F20101206_AABXFM li_x_Page_077.jpg
e20a5235f5f4077a587369c46ba093af
2985f0ecd6e365e59a002e588d93e6d1ae5edd72
58724 F20101206_AABXGB li_x_Page_052.pro
dc5184bddda3c9e84c96636ebc9fa444
267f4ee594d8d56a23592f31fdc76b08b84740d5
2236 F20101206_AABXFN li_x_Page_085.txt
3b453c333114ca2fa47ce1be8ff15b95
40a77b75fcd9d56986dca4857ef1646d76b56e22
F20101206_AABXEZ li_x_Page_091.tif
9764b99a05df16ebf1991aa20a01d098
66ad1d86c58fdcbf0520b410531382e826dcbe28
F20101206_AABXGC li_x_Page_059.tif
def53019c65652184fe0b9f51b8b7b9d
f5ac8534305bdcdc5e14e91ef2ab38a3e5507468
6279 F20101206_AABXFO li_x_Page_094thm.jpg
f4fa59a833b9af0249993f8de570cd9b
10381ebecb752f804116831024ec3d6c57761317
951983 F20101206_AABXGD li_x_Page_026.jp2
47597a6c1f39f9f6d2dbfba7354ad59f
9e3513673edab3375c5683ee3546f8ab168f2e97
4864 F20101206_AABXFP li_x_Page_058thm.jpg
6a8d483879b5c79811ad5809262db46a
0c2286153928852dc7cfa2c864761aaaae537a4d
6170 F20101206_AABXGE li_x_Page_087thm.jpg
635c67501b9d655fcfa20d1c0586b3ed
bd9bb50704acc89a01daccf755f0bfab28f4843a
5641 F20101206_AABXFQ li_x_Page_007thm.jpg
1e2dcafc62daa0f04bc026bd92295be9
ed63a60f9a8191c0c4c1fc3ef7870956b6cdaf72
19889 F20101206_AABXGF li_x_Page_044.QC.jpg
61fd03784ba8f5b414df3379edb765e0
32497d4ced0a8e55567c11c9bfe1961c9c0ebc82
1051973 F20101206_AABXFR li_x_Page_053.jp2
14f4521b2e72a009d0a23401b20abe7a
cb8b4f4c902a7d93826c3a0b63bbb61560d6682a
57776 F20101206_AABXGG li_x_Page_018.pro
cb886fdb7f071ea1a72ea926176ffdf5
b2369a83bd47aa83fe31429cd4fbd400e012120f
1674 F20101206_AABXFS li_x_Page_045.txt
5ada2680b1083417b13b051a71cffbbf
2ba74853d70e4548df5b9c522ee2204806ef35f8
2189 F20101206_AABXGH li_x_Page_025.txt
a206aa78c10dfc06547a380dd8a5f6d4
81a68249d1557c526dfa03a2e71008163a6ed71d
5845 F20101206_AABXFT li_x_Page_088thm.jpg
63abd3f2b0c13ddc3a9299e89c22ca46
6c97a491af5092a6a8ac2488271e87cab99caac9
26808 F20101206_AABXGI li_x_Page_030.QC.jpg
f7568d125957ce405b3b99c789db3f1c
c3324abc7125769f8c20a725d7028360d7589c52
24979 F20101206_AABXFU li_x_Page_097.QC.jpg
fe23c920855f098ec135ea88b279af4c
183c14430df6d5e8b925407873cc96c1a4ccebb9
6409 F20101206_AABXGJ li_x_Page_016thm.jpg
64ce4924a2af6010f70cb4acb27f6ad4
219390dcb9782d20b65eda486b72e299a0f9e416
54323 F20101206_AABXFV li_x_Page_076.pro
8691536dae3bd76c4d928686f81c806d
9ffbd95436d537617e096fa53b1e4e8c6fd70083
2204 F20101206_AABXGK li_x_Page_054.txt
8669d5e660c5a469e90a278a8eeb62c2
293427bb2cd81b273802dac36618d435187ace99
F20101206_AABXFW li_x_Page_100.tif
ca0076dee0693e6f7d7879291d81c3b9
5025fb122ffee1369147b4bff8d003c25578a864
35138 F20101206_AABXGL li_x_Page_058.pro
ef60df5bca3a77c05a664cdb2c676253
40d9349a8db0348ebffc2c9c16532df54720a842
1051967 F20101206_AABXFX li_x_Page_052.jp2
a525fd8204d5722f98e97acc93b9fb71
d7258249261f536766e3f5eea2db8fcbb057a3d6
64999 F20101206_AABXGM li_x_Page_094.pro
c0a76a0f397817a374758844db996538
98d4ad450852e3abb7c7a19dfc95ea3184d82e90
564 F20101206_AABXFY li_x_Page_100.txt
c954f967459ba9e6cf36dea8a3da9c2e
d0c10db14876d6209b8174837a2dff6ce6b31afd
22539 F20101206_AABXHB li_x_Page_001.jpg
bd42d1cdbdd3fb154fe11d1512676b26
9c635eae3033dd785cd066038756c11821f51edf
F20101206_AABXGN li_x_Page_007.tif
9433d0a8a9e4888f9fd504001238b59d
38f79a660f264ec49dca4a13cccad30afad3fbff
F20101206_AABXFZ li_x_Page_064.tif
0fa91b8d71e1015f9bda784d4deed44a
743c6e9a7440534eefe270296bec9fbcede9f4c9
9638 F20101206_AABXHC li_x_Page_002.jpg
4c9ce45b2296693b023446e9e1e74386
bd7a87556ddad67a111ca25eb93925590bacc1c8
54288 F20101206_AABXGO li_x_Page_067.pro
0b6533fb27b353352d13c10481fb89e5
8b3f832ccbba181f4830dcf488ec7d3cc82a4d0e
84122 F20101206_AABXGP li_x_Page_064.jpg
2217ec8430809a5244310ec464f9e278
72bda4d970a34afd9e4de36715e91b2ab4acf570
10800 F20101206_AABXHD li_x_Page_003.jpg
d65f261623e66f208a1dcf8ac8623eec
8f89c8696f606b3698e43013ebf5284fcefede2d
43505 F20101206_AABXGQ li_x_Page_036.pro
0b07a54a1a69bf8c3e7c7d9009f15d06
3b47b06f7105056fe1fd0b4ed46b8f6149ee0e2d
80549 F20101206_AABXHE li_x_Page_005.jpg
3807e91945e71b4a6026979bcb17858f
1aa6fac7f4d989e2a8105f949f3770a592029616
23046 F20101206_AABXGR li_x_Page_056.QC.jpg
a49f76ad0ec16eb4d1040b987ef99a90
90dce9f10115e2c88245e8761af23da68255caa2
52295 F20101206_AABXHF li_x_Page_006.jpg
b08897e6847fdadc67a8124e0be2a938
ec5f848499879f487559a4c49f5f3a6f91652263
2162 F20101206_AABXGS li_x_Page_010.txt
e05093265c5c962730c91d5be96ed77b
9cdc14f4b6bad7a9009adbbffba95b9ca5a683e5
87596 F20101206_AABXHG li_x_Page_007.jpg
0cf6e56dfd161475dbe11f7426f158f9
e529a74ec0e8906c0e0f85998e81077d6c890b19
27681 F20101206_AABXGT li_x_Page_041.QC.jpg
8cc5532598cfd9ad7d5a7bdaaba07a95
91060a58a1b721a29bfe8a2e13ea1de0a3d6b98e
101628 F20101206_AABXHH li_x_Page_008.jpg
15420aee5d758d52fbd619fd07e46699
3587822df7099fb07ccf5c7e160cb132fa0b5e39
F20101206_AABXGU li_x_Page_097.tif
5d542a93f82d5851e0cf6e27c5472a81
40e1ee7c317a15ed18738c338444958f9bfb2d94
103049 F20101206_AABXHI li_x_Page_009.jpg
c8f4d7093b1894b258c16c97729b4ddc
fcf0eea5b9bc3c5e149dd061a66d03f68273b03f
F20101206_AABXGV li_x_Page_092.tif
343e9b068129854be7bfaf158dd7cfef
acfe43a745e68f51efe12e69c945fb9c3cb12435
66037 F20101206_AABXHJ li_x_Page_010.jpg
5f1a5954362ece4357dc8530c13857e2
927a0ae4ebcdc50d062d589b454416bac10b449e
24579 F20101206_AABXGW li_x_Page_098.QC.jpg
20bbdfb0068bc0110900905e33ec1d81
c5c44c4993f63c5fc3a8c41289a42393e7a869be
84606 F20101206_AABXHK li_x_Page_012.jpg
60fd861582d9ea166d9256dca79bb19e
e6f6dd5f4e6cb53d725bc0bb5ff5f83bb2cdf20b
11919 F20101206_AABXGX li_x_Page_011.jpg
ed2a3d3baaa513acbd67705adbf9ca68
b1e87df3813785336e3d00ef75fbba0c2348d7dd
55459 F20101206_AABXHL li_x_Page_013.jpg
b4416b30a16fcbe258e564c2c9501ff7
d9271f0b010a70d7a9d6e6b7743c0b29e849daaa
147963 F20101206_AABXGY UFE0021421_00001.xml FULL
ec6303cbbbb35a198e42a220f6e40f08
33231c73a7db3cff9424b3661ecdff8e29fa1a99
97158 F20101206_AABXIA li_x_Page_030.jpg
9bcea8e744373f2ddeea8213008da454
8571d7374872423f657dd5d1524589415f65a897
86673 F20101206_AABXHM li_x_Page_014.jpg
5b076535c838c7c8cdd1324e53763483
6c268812fe6a0a31fb4628e72dfcd2020f9eef35
78709 F20101206_AABXIB li_x_Page_031.jpg
f799b1686d2b0f8fec451ca4ba383050
ca36e5061c8a70472123656127c54e371d81a148
79520 F20101206_AABXHN li_x_Page_015.jpg
01e50725d538a0fe9809fb456a81dc57
2634e9da008d586f2d08b996d8786ec362d377d9
89453 F20101206_AABXIC li_x_Page_032.jpg
8c420be440319b086c1cf1ea56e0c37e
8f198896eb880397337ec47b3902df365600850a
78975 F20101206_AABXHO li_x_Page_016.jpg
09da60c4ce027f40775a4ea723eaa412
402d0937427a9c65805d500e447abad8c4cf29cd
86325 F20101206_AABXID li_x_Page_033.jpg
b6f7050c2e5fc63d04602ef5b19b6c43
758befc45d3e74a946b8834bd0da3eb71a9339b9
86543 F20101206_AABXHP li_x_Page_017.jpg
fe9fb927523f4d7a25cec63ce9e0a46a
6b09b0762090204408d61857cfb7a9a274157717
87832 F20101206_AABXIE li_x_Page_034.jpg
de6a787675a32ecfc5916fce1efdc78d
20d5a62343fd6fe55eea74afe60c60bf5d9c3b76
86281 F20101206_AABXHQ li_x_Page_018.jpg
d058d2c3d3c7a49fe958431165725a62
1c93a18733624be6bd88966c9e258d76bb621ec5
89078 F20101206_AABXIF li_x_Page_035.jpg
b500e64c581da1548a21372c42e97b96
9645f8c1c7d6e4c94bca664ee5f663fc53399184
89246 F20101206_AABXHR li_x_Page_020.jpg
fd6d7dc735345608aea84b9fc8ea5502
1688cd3e6b271aaffeade66d0e7d51a6155f8938
66686 F20101206_AABXIG li_x_Page_037.jpg
eaa7bb7a7afe494ddc7f23af83ce2591
83835e6d8b12480a418b26b2d2247eb76d281eb5
73572 F20101206_AABXHS li_x_Page_021.jpg
5155e3b94337ffdd70100ec97dc9e112
eb3cd068c3c06168b9ff25241553dc18d9443413
83789 F20101206_AABXIH li_x_Page_038.jpg
c7fe598dbd19390e1f272361029945d6
b70e3e956f969cf9529ae9719a01ef4c3b252368
79826 F20101206_AABXHT li_x_Page_022.jpg
6950db2e144e7b64b2cdd543a7ff1a16
cafc720ddbf05dfcd3d8f1f519b120ff209a7b76
89890 F20101206_AABXII li_x_Page_039.jpg
a837a1d152f6a9a075b262058cab47f8
e07cf87034aad084e6ed6057f17d7085d89798b7
71991 F20101206_AABXHU li_x_Page_023.jpg
38aeb855f82cc1544996e2c18379c3d1
3b54c7867648a8780d31aadd0c9eb1850d368e35
85175 F20101206_AABXIJ li_x_Page_040.jpg
59c36bf0d8edf37280e179f0c5b60819
e6c2fc2e2a82bcf5ed0ccd5a075ee14b264e6d4d
81386 F20101206_AABXHV li_x_Page_024.jpg
a797bf56455d1ad7f017e9be8cec8fa4
b0135ec5dd9786b649fa2422ba07d319652e1864
88439 F20101206_AABXIK li_x_Page_041.jpg
cd58ac42d95cee111d6533333e84082b
bf7b6ea482cd15cc82e89743d6ae754ee96035c9
83912 F20101206_AABXHW li_x_Page_025.jpg
fb65d39c52d8bb1c0900768a3adaf0d2
73b98b8daf98758be9e667752a28face86ccfab8
82413 F20101206_AABXIL li_x_Page_042.jpg
ccfb93f986d412da27bc6cde3b419bbf
fbef4a5c4455ab987ad37ed64aeed5c1963492b3
75794 F20101206_AABXHX li_x_Page_027.jpg
c2972cddb78e54d4701336fd0a940760
cad43ed7610ee0872062b3dc1f44695c11530684
75147 F20101206_AABXJA li_x_Page_059.jpg
d646f07ad12cf53cd0b45932c3e02694
f68cabec1a5b46e53744b76c33459a51b797fa1c
71935 F20101206_AABXIM li_x_Page_043.jpg
1ba09f91c5615d8ee1818f13448a97eb
8ecf9366d06c55756f79229c8c62343ba1780c17
86145 F20101206_AABXHY li_x_Page_028.jpg
414901d638946c2a80e5477c0e297853
45a1e968cca180e2c241f88ba4cadbaa4a5fcb37
84742 F20101206_AABXJB li_x_Page_060.jpg
1ad98cad78386cb6c94b72adfce5edca
efd60af8c88702dddf65c31640cf5d944a197842
61750 F20101206_AABXIN li_x_Page_044.jpg
8499caae406fd6d8dd404d9a870022a9
395a4e93007b0b9974f531fb7f1528f3f32a5074
89558 F20101206_AABXHZ li_x_Page_029.jpg
72228838730aa2535a2490f0ffa03c46
d8476415d61073a5c0e9f291c5dcbb91a02a01ab
12284 F20101206_AABXJC li_x_Page_061.jpg
9b27458ca38e1a0201b6ae6e282680e0
6b4b36c0fca264902ead5ab345491fa2feb32a7e
56695 F20101206_AABXIO li_x_Page_045.jpg
afbe42b9c3862bc553e29937c9aea534
016353f5853658bab0916cff1ddde29dd1859686
86056 F20101206_AABXJD li_x_Page_062.jpg
034bc6a7a5e33b6089ae5bfc255f5637
6fbeb464790a3f1d8f3abff01cf58a43e933ea3d
78890 F20101206_AABXIP li_x_Page_047.jpg
16c0467f9d1c76f02088fb6071af1357
a2ff626b10b05cb1fa2356e1f4307d818746956e
83502 F20101206_AABXJE li_x_Page_063.jpg
04a5c894fa1fb4d911b81d8a54884f1a
33aa6d6facfe6c9f2a50730fa6261ffbaa6253ef
74612 F20101206_AABXIQ li_x_Page_048.jpg
de1d2c3f595790f3a167c1e6f817c92d
6c20d903337f5b395125f2794c6d3076f50cbe8a
64049 F20101206_AABXJF li_x_Page_065.jpg
86a612d9f30609de526285e3d7c73685
bc0a74770a16eb2d683b29624f8e983bc88f7ed5
84324 F20101206_AABXIR li_x_Page_049.jpg
b088c079d88cae8427618117e22058cf
267a187df16d415c076b9d6f77fd83dcad2d4952
89723 F20101206_AABXJG li_x_Page_066.jpg
087be752fb89d4cc58806e9ea11bfa7d
7e1b38bcdc3b4b5b21c75d34d8fdbe290782c974
86902 F20101206_AABXIS li_x_Page_050.jpg
ed9733200f44d573f28cfe51d84eb7eb
bb7bb11d1cb0904c2d13aeaa02557d5652feae98
82150 F20101206_AABXJH li_x_Page_068.jpg
bc21532c276e0ae86b1b7238948e8f5a
00d8d9076dd673eccf2b19503deb226521454a00
77899 F20101206_AABXIT li_x_Page_051.jpg
b0a600aa3555d464e5c1ac01dfe22374
4105648f8f65deb2e4467496f4c7ff7bd8434baf
68553 F20101206_AABXJI li_x_Page_069.jpg
24457ae224f8ae3f2849fb1a48dd1852
8f94b868e13ec82e7a6fec7a696976b73d646f46
91103 F20101206_AABXIU li_x_Page_052.jpg
8fb5941388d0756f4ad1140a205a3055
3309c901992bae37a11fed256cac445ac0944f5d
69464 F20101206_AABXJJ li_x_Page_070.jpg
95c5f14853ad60536ee941d6fa738162
e10826778405d5b192ed51d01669397f0261d669
91042 F20101206_AABXIV li_x_Page_053.jpg
26e8581f04cfdadc4d7b46a2d17aa231
6c68ae5fe0e1fe0b4e3b9a2c87e53b7476fcd32e
83716 F20101206_AABXJK li_x_Page_071.jpg
caaf4690b2b4fa44783385c029f53562
bb7bf6e2c0a13b20e25f3baffa0b0e1a63b7d49b
87744 F20101206_AABXIW li_x_Page_054.jpg
13dc6662e38c28d71d4484b135033b3f
32d5af065ffb5d5381fda6f945a23ec72d8cf032
71890 F20101206_AABXJL li_x_Page_072.jpg
31fac49b0275d4d202be1bbfecba66d4
0fbe7c06ceb153a99a24e911b951fdba36cedf1b
62433 F20101206_AABXIX li_x_Page_055.jpg
cac917dddf95e888a7fb13ec9443a00e
b164eb30646402f9a2068b2051b8f04ea210f376
72238 F20101206_AABXJM li_x_Page_073.jpg
957a8b784f135d6d620c6e131b028b28
245a0c33ac7c8a634026365bd7a00b60a96a2974
74164 F20101206_AABXIY li_x_Page_057.jpg
8b89a34ce6648656277d08cb4c637802
ff1af611b5b75cef5fa6b309c0e8a4e1fc184ab7
67059 F20101206_AABXKA li_x_Page_091.jpg
60828c3b2bf1819ffc8c83673626dc91
9f1ee5bb0d78114206b9b6c50873436d4c00e951
78211 F20101206_AABXJN li_x_Page_074.jpg
6231b061fd98bbb774bbfeee68aef1ad
0292cabe492684fe3129351c4b62c20540bab2da
55805 F20101206_AABXIZ li_x_Page_058.jpg
460f2e057e071357b642a9982077e8fb
c36b13ddab4ac4a6076536ceab7733bcfc40a4ec
59492 F20101206_AABXKB li_x_Page_093.jpg
8a8f46bf458ad1f41f82f250da84ff0f
33c2340aa4405b5f30c5511a19847ee9ba107432
79944 F20101206_AABXJO li_x_Page_075.jpg
7423332707a17923accbfea012fedf93
4246fe6c56d177d5b14bdd56faad16e2e050c9b9
85750 F20101206_AABXKC li_x_Page_094.jpg
d7a5d2c84ccd41a47f1a461996e2964f
4e623839d11ec1bc3f8800f1a789f8b7b060ce1a
85331 F20101206_AABXJP li_x_Page_076.jpg
d399c11d0f009149fa415b298eb7f70d
fa1a6a5cd68a3f1f9309274b8b0380d86a5e2e42
82176 F20101206_AABXKD li_x_Page_095.jpg
50958b210846eb8e1a0a68876cb65f56
7d25a5000a6ff36a2a369b8234ffd63e550e7b20
84781 F20101206_AABXJQ li_x_Page_079.jpg
865c476f2c17c915c076036a341cfb81
86377c9585cd7c40c3b865be2c59f3b48ab57df2
85073 F20101206_AABXKE li_x_Page_096.jpg
b0ad546d8127754878a61cbc98685414
b6e24a71235a2d152d7f3e2795d6130540139eed
69753 F20101206_AABXJR li_x_Page_080.jpg
7c8bd5986d999d7620bb020056f6b90f
32dea5e24a79bf31770ad87b23bd763758ce677e
84305 F20101206_AABXKF li_x_Page_097.jpg
7ac461776912772036255c1d83f2b94c
593e6785a5585c90651cba7193e4af79d45d4782
92469 F20101206_AABXJS li_x_Page_081.jpg
de327f73080dbddc08b955e0e8011bc9
47638ee984e2a903a0edf3dc9b7cb1c9a64539a8
83994 F20101206_AABXKG li_x_Page_098.jpg
e07c1580b594fe496dbd89bbf00a8baa
82bd68f452b309dc651577a403dfe1e29d0162cd
66116 F20101206_AABXJT li_x_Page_083.jpg
37eeb995830d231e6d0350bc4d14d412
cad7a258063b4119df2cb5eef001fde0a0585acf
50602 F20101206_AABXKH li_x_Page_099.jpg
48625bb4d12d8605336e9b0d7217a26d
704457c96c50380e4341ffc5f47c564b0ef4769e
81318 F20101206_AABXJU li_x_Page_085.jpg
baab33a3b813bd30fa6657ef335abb2b
0fe6b4aca8c1e3f6b50918914f5fefea3ddd16d1
25038 F20101206_AABXKI li_x_Page_100.jpg
10a25ba82dff866ba472bc815d69fda5
d9ccb4ed1897ecafe6756f9cc48e3f83bb11321c
52577 F20101206_AABXJV li_x_Page_086.jpg
6bc0dfdafa8b3d0bf4b94be5867c39bf
d4bcbb23f05bdec7483d3b2580b855715f70d56e
22720 F20101206_AABXKJ li_x_Page_001.jp2
818434bfafa46915d30e4c5f3a1ec012
dd7c23ec4a03fc643a00b787c55e2fc022e512b9
81405 F20101206_AABXJW li_x_Page_087.jpg
6797142a7eee6916efea7514bd9d48bd
c89335652ecffe2ab6e6731beba31277e394cb82
4655 F20101206_AABXKK li_x_Page_002.jp2
d504aacbe4a0879eea2bfa4eda4e5934
5f69568628685be18347154de6c66832e04ff97a
68669 F20101206_AABXJX li_x_Page_088.jpg
36c8ade9c73a227646d0e207df9dc933
4e8ae30004a94f94d99ecbb8b8545201f6b1038e
8154 F20101206_AABXKL li_x_Page_003.jp2
8b08bc303fc25ab652e6a3db27f548bd
1ab5c9e5f4131010ee3af3531d5ffb168e850086
91001 F20101206_AABXJY li_x_Page_089.jpg
167f7244d010cbb4c8fbbba20bde9ee9
ff4cab31a75fa1678a5972fa7e4ff4b9c4e83bcf
1051929 F20101206_AABXLA li_x_Page_022.jp2
6536ebda44e71301a338c83cf3f7851f
38d508fdc7e1e2454538e3e72c0a356dbf83d1ac
50786 F20101206_AABXKM li_x_Page_004.jp2
fef6e320adc759a4c9822b677a5efe4c
0d36d9aaddcbb1d5026260d909dc6f8476f98e7f
92831 F20101206_AABXJZ li_x_Page_090.jpg
c9d2fe1fcfb1fc85dd2fe90179518052
e42859feb115c3fb0da8a27cb5d33c34cff11ec2
931765 F20101206_AABXLB li_x_Page_023.jp2
733aedc4693a8f5a0921acd17a645842
4c9d30d5aaac34f55e679c3e0635d1fd8293868b
1051983 F20101206_AABXKN li_x_Page_005.jp2
713970174d013d3171672580c14495ec
edccc35857aef65ce2e66c3a4ee4ef66b93b4787
F20101206_AABXLC li_x_Page_024.jp2
919e13a5dcd0eaa153d936a4d76bef1b
8fc190e3fcfff69b50f8af9b5a142a2d2ccc5084
1051942 F20101206_AABXKO li_x_Page_007.jp2
61d6714119d32b42de5163e4cc0bcfd7
6660c4c13774eee330a3e90275cda7baf0e4d841
F20101206_AABXLD li_x_Page_025.jp2
37e9e8ddb928e2295b49e22d8fbb423e
fea16c8dfc0ca32db40d8e091f85af313423e4ad
1051984 F20101206_AABXKP li_x_Page_008.jp2
e0a122103dcffdc56888c8a1579be7a5
5178abe1842363186bcb87cbe453c3f8c38425e9
1051976 F20101206_AABXLE li_x_Page_027.jp2
b3920da58fda4b19066409b488bd54e1
37bc7af8e96fa18eff22c9c934b507fa17e826a5
F20101206_AABXKQ li_x_Page_009.jp2
6e3af8d28ab1786247773d086c9e14d5
ae335576610445ce90ff0d46ba6ae079a1ba94bd
F20101206_AABXLF li_x_Page_028.jp2
335604f1d9052313e09b6a1370c6b563
eae29d6fe30e3be58f61eae454ca39a0193e477c
106166 F20101206_AABXKR li_x_Page_010.jp2
0a31b802889e544ddf8a676ea401063c
a92fd2aa93369ff497498c462b1a5777fc74ae13
F20101206_AABXLG li_x_Page_029.jp2
2e08bceab0943e05bfaf832337686352
9057646684bc498c6a5be5d7b14bc918694f9fce
10461 F20101206_AABXKS li_x_Page_011.jp2
cadae88683e339ddf12c20f320003ffe
3d5cd52ba4cff45ac3bbbb931ac064dcdf0c07a7
1051914 F20101206_AABXLH li_x_Page_031.jp2
23e6e184e42f8dfaa8a5b3a163a1d9c2
82ebe88307a2efb595a36695bd12ce58caff48f8
F20101206_AABXKT li_x_Page_012.jp2
e7371899aa47fc171bf3dab96fceb4bb
659db8405d6a06af2d7a4362d98f2bf2bb3ad4ce
1051974 F20101206_AABXLI li_x_Page_033.jp2
d2fcd937f07b43ee7f6b9b14f811c899
8e4109c97aab2bdf03706b575a695e4abae097e2
800436 F20101206_AABXKU li_x_Page_013.jp2
7290678573171487b293917a596d1494
87783c57dd7754b121d5ee67fe5c0499f40f16dd
1051957 F20101206_AABXLJ li_x_Page_034.jp2
8471cb87f434ad1ec6578d3481337044
ab293a17ed44e3d4a904d69a6b13eda862795b53
F20101206_AABXKV li_x_Page_016.jp2
57b6287cf5ca8747ed8750661755d1f9
90bea93a0e2f0afc45478898603eca6e298a1b6f
1036332 F20101206_AABXLK li_x_Page_036.jp2
37f7cdf4e1f7249451e126f902e49014
1ae97fc15428380d673c660c6ff1fd75f78120c3
1051935 F20101206_AABXKW li_x_Page_017.jp2
2cbba850b8a23f8a8de84870c3a1afd4
38873289d6eecfa233d1efdc5c54a012bffc65c8
927975 F20101206_AABXMA li_x_Page_055.jp2
a1fa830d5ab7af987a1105a244cad0b3
ee0d63c367d1cc61845d2a99c4d020d258ffdebc
965339 F20101206_AABXLL li_x_Page_037.jp2
379f094323e239c0486fee5636c86a83
5b589f552112eb6b26157c873df8ca94c406d73e
F20101206_AABXKX li_x_Page_018.jp2
b32d68d0c77136e54478107c336037fe
7a2c6629eb7e92cc90f17076fe86702149cf2374
1051960 F20101206_AABXLM li_x_Page_038.jp2
cee11dbfc09a795b92d472f0b9170551
de3a878121da2ecf0a562fa3dd6941153175c607
F20101206_AABXKY li_x_Page_019.jp2
bbe3349700f847586b315878184bf499
2cc70bb1b59f756848f17efbe82cca2d5e5070c4
F20101206_AABXLN li_x_Page_039.jp2
29223216536a7e8f1bb1a23d763a4fe7
3ab97c78623a8e5882b300f76f218309db69bb60
111878 F20101206_AABXKZ li_x_Page_021.jp2
4b0e9ec238cfb4f2a197993efa6a316b
2128cb6dc0f277391e0f35837a9fe4e0fe98eb80
1051964 F20101206_AABXMB li_x_Page_057.jp2
8d246de1bafbd03f05dc2945209929be
490cd5ce37be2157a48db5d6c6309f81e7a71899
F20101206_AABXLO li_x_Page_040.jp2
4d7528ce4bea037363e39e0a078a477f
86e5bbabbd9c754d475a26f6a3644590a2a169ff
81485 F20101206_AABXMC li_x_Page_058.jp2
dc70babd9e6a0bcc093479424ec434db
0884c916f1f6f43236fbe9690c01a3314ca5ac9d
1051954 F20101206_AABXLP li_x_Page_041.jp2
0d3484020ba740716fd6505db87d11e9
5ba119535f9406ddadeb0022c18c2d47dfee317d
1051934 F20101206_AABXMD li_x_Page_059.jp2
6fbe018db60ea4a34075c59e261772ba
fc2b71dc7e7fd7444c1f7825ef7a643a334739c8
F20101206_AABXLQ li_x_Page_042.jp2
25da5c41c220890f9dec05514b594de6
cf91fe9a96748ffb3ff5618c159e19eda3c8bb8d
F20101206_AABXME li_x_Page_060.jp2
dc2846c8ec8bfe6e6c43b8922872e2a1
26f72feccee7aa1fe5a7ca46e65a1785b99ef518
1051965 F20101206_AABXLR li_x_Page_043.jp2
799966d8c5ba1b83ea772f547b67c264
399ee27517b88907b1b02cacce3d311bcf94108b
10630 F20101206_AABXMF li_x_Page_061.jp2
c84f99254cd1930c89a0c808558eeece
f9397bf7102a3bb808fdf41a55c6459f5bbc5fa8
99192 F20101206_AABXLS li_x_Page_044.jp2
f4c62f6ffa6c82e086fd3ccacd4ba14e
74dd171aa80409835f01d7f597ce4754c428e68e
F20101206_AABXMG li_x_Page_063.jp2
33bbe25e46d81d3dee3f1007f15c0f49
a79e9ec3b3238e30b50be999ba8169c05a83cb2d
87706 F20101206_AABXLT li_x_Page_045.jp2
37f1ce2df21a2a000d92dae9507cd14f
20b6b95821954cc08c7e36a72674c1a305bd8b2e
1051946 F20101206_AABXMH li_x_Page_064.jp2
80d2d3b79c564eca33e531d510f379c3
90d39164ed616730c527e73caf1761ec95d8e58b
115078 F20101206_AABXLU li_x_Page_046.jp2
78dc4f22abad2a21fd53bb987c6ed539
2c522c4480a94c8b7d8dd7a50ec0ed2e690d0156
F20101206_AABXMI li_x_Page_066.jp2
d0e59edd1ee69e86412e7e37e3ed151a
65de924d7531ca3fcf81bcc028c9fbf22dd74593
1051917 F20101206_AABXLV li_x_Page_047.jp2
a9560c7eb50783e6e63e8ec94f82be79
1f30267443a964693807089d14f0914d9b6015e1
964177 F20101206_AABXMJ li_x_Page_069.jp2
9649afab63e1d8c2d31f3e195833a5c9
a49438b8c49efbdc90080e0be9fa601015b57485
996944 F20101206_AABXLW li_x_Page_048.jp2
d5c12dbdaf162478c9b49e8c915a9a83
4fcd9299ce527d41362d8fc8713ff00f48a1b38b
968835 F20101206_AABXMK li_x_Page_070.jp2
783d0aa9b79498f92fd03d43ee1cd9e0
ef20d15cda053f8be2c4b81b7b9401ebad73c794
F20101206_AABXLX li_x_Page_049.jp2
73b969719aa62c1456b095f5828042a4
864e75628e34f187b40695b87166f2e9601fe0a0
F20101206_AABXNA li_x_Page_087.jp2
58d8d7f5a44102c5d2223ba9a928b0de
03362966110fa0d73b6205d747ab6e8c98f02fba
1051897 F20101206_AABXML li_x_Page_071.jp2
04be77e57be5d0c3c7d49e1cb246dc27
2fa1bc6a113d818e542c4f8620011554bce14616
F20101206_AABXLY li_x_Page_050.jp2
31aa190c79708eeeb5e265565b7d8d12
5ed6ea0199a43137fd08b466c8803fb30fab8091
1051968 F20101206_AABXNB li_x_Page_089.jp2
ad691cc42ff1ebef44ae341eae3d254d
c56e965f8fb799738c226269a2cb1382c1398a55
1033966 F20101206_AABXMM li_x_Page_073.jp2
8ad5c4b19f8becf1f1bdec7465d44193
5c9bdf161c3efe168b93d0d2ca934507381ceb67
1051958 F20101206_AABXLZ li_x_Page_054.jp2
46fa1b7c0a47e458a08eb52696f08656
e0a03a78a903b4a8eb47eafe845040c8f2d2b554
F20101206_AABXMN li_x_Page_074.jp2
1d155978d04b7ce99b7b3dff872d2589
7984ad43ea091b961063325c65d55a410c9d9f29
F20101206_AABXNC li_x_Page_090.jp2
5ab57692439b93d1b7f1a7118c779b56
0621e8ecd1615e1b0d9e67749d6ad42345e370c3
1051951 F20101206_AABXMO li_x_Page_075.jp2
9f6d4a089c183751848a095946aab914
2ebf4871189355f8b984da863bb255accf80d3a3
919662 F20101206_AABXND li_x_Page_091.jp2
012bde70b59cef410fcee7600b8606a3
2291da22aad0a75b7341a41ae434271ab73e3af4
F20101206_AABXMP li_x_Page_076.jp2
e71cfb2b980db86fb6ac81c4c5a024a9
1efef1f0d6a6e8544b228d572c8a1f8121c40994
45190 F20101206_AABXNE li_x_Page_092.jp2
93d8a4e74124f4295cecc5bc8ff5eeca
c9d24a4dd9a5ee1cb467d00630702238dedc854b
70925 F20101206_AABXMQ li_x_Page_077.jp2
317522bf496e926fedac16f57bcb8140
ae23eb346b5a7593784d9e75607e7e4df70dc1eb
860620 F20101206_AABXNF li_x_Page_093.jp2
705580925babc8bb724efe4cbc88ceef
6303e7b225f78461801b8c5c8ecb14c54cbd5424
879162 F20101206_AABXMR li_x_Page_078.jp2
e55e612fe199810abe6dc03f0655ecda
f7b5a2408ad5a22d5100f6d7156671ebcccc89b6
132283 F20101206_AABXNG li_x_Page_094.jp2
8076adcb8ef36e44a4e17983a39cc998
f4413f9ad3e3e005a26e135fe5b07f6c07235bd7
1051961 F20101206_AABXMS li_x_Page_079.jp2
fdbf8b100437cfff5783d9f344178ac4
bbb1cb747dfc7da4bcb996479f84cc0d95499e29
132021 F20101206_AABXNH li_x_Page_096.jp2
7e2904f665525645757552a8a1742c44
58f369065f005400d4bfac2897660b3d28efdb86
1002187 F20101206_AABXMT li_x_Page_080.jp2
50454f914074095802ad2f1141a365a6
288b8443ab4f4008ca9e396851533046ea0fdf79
134147 F20101206_AABXNI li_x_Page_097.jp2
c2f083515bd9b3100e96da2e821d98f7
22b061dc1806ba4290a8dece010ceb596084e6c4
F20101206_AABXMU li_x_Page_081.jp2
95d36fe3cf875bf24ef4021063f52ac3
51dc72595c23b580b112f6494c1f526c7c01b4e3
131190 F20101206_AABXNJ li_x_Page_098.jp2
eefc7677fa5c8ba73674e11e755a490d
ad89e1dc0e7f0ab682152bfa6451d91a5d3d4f00
F20101206_AABXMV li_x_Page_082.jp2
aef523e687ff9d439badd30e702d8af7
25159a875e5319e1cb40590759e3fcb014f22c5c
76781 F20101206_AABXNK li_x_Page_099.jp2
174eb73e84ddf49219b830dc15c749d6
03fcc84b48d68e0106db0ae48fb936f2e9c6ae58
881343 F20101206_AABXMW li_x_Page_083.jp2
cd78ec8be2905e5be346c7dd8a182813
857961d90d086ced3de946f48dbae2fe86a990cd
F20101206_AABXOA li_x_Page_021.tif
23a165e0615f45d332ea9a75ca09fffe
8d1fb7cd0b8009f188d963df018704a2794fcefd
31869 F20101206_AABXNL li_x_Page_100.jp2
8944f2d59c1d7d2f5b12476c04028369
714353bbc74bd278a23ce652a28c8292f516cfb8
791679 F20101206_AABXMX li_x_Page_084.jp2
7f4fee70d01487334a14ef0d30eb2343
88bc6eb99026334823564c0e7bc1c69c11ec379d
F20101206_AABXOB li_x_Page_022.tif
0045f2ab19710b2ee4591d37490ce15c
069ebc0204b9b68edd483bceaf92ceccb7806604
F20101206_AABXNM li_x_Page_001.tif
4ac6768035187cbd364c83f844ccb13b
8f0b1a8cf38c749f1396db0544ce286d98f12b8a
1051948 F20101206_AABXMY li_x_Page_085.jp2
5c691de46e73bb575cc36ab2ef50ac68
a58d1b8a1a0d1f5f41377c79f357742c32e8a49b
F20101206_AABXOC li_x_Page_023.tif
da046f3081b8068903890dbaef3224b8
381d3fb682a6454bd6147bc0a64608fa796c2277
F20101206_AABXNN li_x_Page_003.tif
9a49774303b35e3e93abc018a031d036
21281c400986394e2106d0c0a19dc087bc27e41b
702417 F20101206_AABXMZ li_x_Page_086.jp2
2c14efc73b4aa99b7efe91df31a2cdf9
e0b6440372e3b20ee774c03401608a52003c2171
F20101206_AABXNO li_x_Page_004.tif
b73cb6455d64187bacbea6b8b51e3339
ab829997d9ef25d650299b83a94cbdfa22eecf74
F20101206_AABXOD li_x_Page_024.tif
b05cfcdbf4617703fc075957c9731c2c
abddbf86496aa0a3d033b5e603f6738d63ca3609
F20101206_AABXNP li_x_Page_006.tif
74a13f77af5c71068d0cd4793332b04a
1cc4bab26e8ad94aebf76c374d716f8f5e803c4d
F20101206_AABXOE li_x_Page_026.tif
598ac5f9a138f24e1012433f6fae691a
9bf8c35f527f0d0a8396bba4007c93c928cea7cd
F20101206_AABXNQ li_x_Page_009.tif
397c4c2f9936903d780ecff1d78040e5
954729a3900fda1bc86fd197dba6de5e77bae466
F20101206_AABXOF li_x_Page_027.tif
4e7f882f1e2526e9b954f6be79fd69ca
303ae3d22bba6b41d80021e082d7a6888b177fd3
F20101206_AABXNR li_x_Page_011.tif
a855bb60c53036950c73c8a1b4724c23
4fa118c664bde4549bdfa1a205f6b25edd8dcee1
F20101206_AABXOG li_x_Page_028.tif
2d26ad635a97fca261099f5a4e4ca935
4727e44ba2654e39433470eb154c0e84b50f228b
F20101206_AABXNS li_x_Page_012.tif
f1c7bd2215d261e59c84a3c5c4302197
74c329630733240ff2bfcb96617f5b4bb962e272
F20101206_AABXOH li_x_Page_029.tif
086b8522b51198e45f95aef6f8d320ef
d1d299f1e640647e0044158a351fdb03aaadd00d
F20101206_AABXNT li_x_Page_014.tif
5d07f48c378e0d20e0b9d832d8b19536
c072f8a58044650879b89b0e5bf9bf180c456b21
F20101206_AABXOI li_x_Page_030.tif
2f3599cacd67733fa1af516836791eea
353e6ea83454008856414853996771afb8b6f560
F20101206_AABXNU li_x_Page_015.tif
7a222075eb7ec5d99ee021e0c719df4d
6e290da51df3b37542a9533324e1a90389c9ef6d
F20101206_AABXOJ li_x_Page_031.tif
a65d7ade9f071d42312f886f5a20f5c6
8303ad3263e69fa1271bb497e58aaa9c409d8585
F20101206_AABXNV li_x_Page_016.tif
2be108c64e3dbe01e54256a9972b8374
8d18f72c5d10aacd6d0fb8289ef6277ff263643c
F20101206_AABXOK li_x_Page_032.tif
32edcbe672ec0821e1e4556eb97e3b60
e891f18e5a2048b0019d46237256ad391aff54c0
F20101206_AABXNW li_x_Page_017.tif
9ff309fb9b4b7bb57ac7af973d52df4b
7119a9a4678acc2603019f046649994847bf49d5
F20101206_AABXOL li_x_Page_033.tif
ebd7a7aa3fa90201e6c3ec7ba3c0233d
22d10d64896dfd68529b3d4e7b5594061307d57f
F20101206_AABXNX li_x_Page_018.tif
dde1007fca07bbe414e8e6bcd2912999
1e9abfc34fec8b751a07ad079eea4f6662a1ef23
F20101206_AABXPA li_x_Page_049.tif
20139f81f78605dacd83b59728a3eb46
b7cf4776e1cc3cb97b8395a55f72e150e1740b77
F20101206_AABXOM li_x_Page_034.tif
abce2ffd8ce0f0aaa0f0b7b1b5a01cde
b15f1aadef3bc15540ee0eaba02c17d7afbf865f
F20101206_AABXNY li_x_Page_019.tif
7f9bb51adbb2b6c2fc3eb03df923ed34
a0d9d5d0ae7d344e00c4850470367b477fc0e0fd
F20101206_AABXPB li_x_Page_050.tif
6d2ba605885971cd04e5b6e17cacbaed
b288476232725210f17d45c94b0a72e39e672d80
F20101206_AABXON li_x_Page_035.tif
460487969417e0beced3b3f436ee9e7e
f99c86e90076cadbdd61162c1bdb2942cfec5078
F20101206_AABXNZ li_x_Page_020.tif
59e6dc89c4dcb86e7579039e94490eae
f2d058e8c8dad2ebdf2416e4bbdae4a44d0fd1bc
F20101206_AABXPC li_x_Page_051.tif
43a955230f3c3f6c90990e0f27b8b9ce
02965455cba1ff151223427ea503ac68ac29893c
F20101206_AABXOO li_x_Page_036.tif
8f1b80db3eb044d10442a6c7ee7548de
763a22cde247a79efe8ac9c2374a991521ca0ebd
F20101206_AABXPD li_x_Page_052.tif
1d9152b6a30c6061de33d6e05dca6dd5
ad8814b904aab38d2cb6e86d563c3b6c66289462
F20101206_AABXOP li_x_Page_037.tif
62207be1866e7beec33d84a14bd2ab67
0ccd550f3b9a35f901e1e9d0546ad91759381ea4
F20101206_AABXOQ li_x_Page_039.tif
b766797a0713b2f63dee0d85cdec06f9
92b20da40e2b17177cf520a16d451c636dfde8e9
F20101206_AABXPE li_x_Page_053.tif
43846ef7847d7d7d90d80bb682b1d8c9
85990ddb74ed56570cbd8d6d6a23ddb2ca659cf2
F20101206_AABXOR li_x_Page_040.tif
24b447d2aef35fcf6ddbf656abf70baf
1c3a3cea9834a84d727aa5a557a9cf81f0d76a9b
F20101206_AABXPF li_x_Page_054.tif
e37f6540cd3dca58cc767ac54aef8430
071c7baf2f95e7dfa0e7f7507425b2161d015d06
F20101206_AABXOS li_x_Page_041.tif
0df1b8f0708e7a99229a5cc2fb103c4b
1bc8e772b01a536cb022221f87324d509cf79698
F20101206_AABXPG li_x_Page_056.tif
2c9821f87abc6a31b15aec0d973d262c
987207c937670bc4cf043551666914742056c89c
F20101206_AABXOT li_x_Page_042.tif
4ce3fb18eb0709be65e3e31774e8135b
b3863ad3694a3ab0290d955347941d18b64f5b56
F20101206_AABXPH li_x_Page_057.tif
cedd8074d43a529f0b8f590ab1d3bf6b
d7bd0cfe139795952b5957b42a988624d5532c7a
F20101206_AABXOU li_x_Page_043.tif
dd67c68aa245c65d9446c3c26b24ce42
74f99c14f358dccfdb8d9c006760fc81e8032e27
F20101206_AABXPI li_x_Page_058.tif
e26935fdb6be9bbc9932937821084eff
50855ce911ed33ea253010ac9e19ad8cf90f68a4
F20101206_AABXOV li_x_Page_044.tif
00c05c8821b1e157dda344704d02e5f8
9c94bccc39256f79e4587e8037758fc2edd87efe
F20101206_AABXPJ li_x_Page_060.tif
c3495f7fe8b4923b84d4b18bfafa34e6
a5b2d8f4b1ca4cda7be8f65f3117daec8ce5a4ae
F20101206_AABXOW li_x_Page_045.tif
c526f7338988e63c806055c98bd9dd1e
b802371be4e77a7aeab79e50c9bd3e3c859bb083
F20101206_AABXPK li_x_Page_061.tif
d2f5cdbba81cdc2020df3f07b48afd24
7efbaf05bc218f70986c4bab03571375359b7463
F20101206_AABXOX li_x_Page_046.tif
88b01f4453da294bb2eaf81beb3d99fe
9cab3c1c86cf49bc634a76d37d3439501f316add
F20101206_AABXQA li_x_Page_081.tif
d80ab1d917f41a7eda7a6fc560753443
e6bcb2b6872c60c4561246cfd5a6a01a532d6e7b
F20101206_AABXPL li_x_Page_062.tif
a373998fcb42f8deb9ca7fc8e7e9117c
88d79075800094a61420f013fa9f8d889ccb2ffe
F20101206_AABXOY li_x_Page_047.tif
7b47f5a9732207507a0c28e9a1c4daaa
71d5b8c2395c9c52519db3e2c821d6121a2a6727
F20101206_AABXQB li_x_Page_084.tif
1d399f8f361daf7f50b0ea87ec5b3726
9e0f28238e4b5fbcbef7e71aa922ddd6d78f59b9
F20101206_AABXPM li_x_Page_063.tif
89f4203b0f431d5d1ced2eba0c30e6bf
0673277d5a00249e559d608483100906afa371ab
F20101206_AABXQC li_x_Page_085.tif
13d9bfff5b108446160a194c4d1a767a
6846ae6c28f8b4f8359e70e42753027aceec2a37
F20101206_AABXPN li_x_Page_065.tif
bf6e024286fa4b44817b9a1ecc7b32dd
b32e36d5df2a9c1f5a7d8f0fa472738bafef7aa0
F20101206_AABXOZ li_x_Page_048.tif
efb84fd3cab96af4b0959e2a48ded6dc
6790967db9dbdcbac76a7b18f3f45e14ee8ba923
F20101206_AABXQD li_x_Page_086.tif
1e2c4780e927e0fafbc84db0ef4ebef5
cc843566e2f1e9a521e7002aa62cd701ad021d4f
F20101206_AABXPO li_x_Page_067.tif
1550e8807108ca26e405072a58f466a9
968c33f205dd02a99458d644eb76b420e7996fcd
F20101206_AABXQE li_x_Page_087.tif
d620ee47edf4041a116912c5c8d76364
4cb519134ebe762044c17bb02905089afd342c80
F20101206_AABXPP li_x_Page_068.tif
372f66cd19cc0493720f88b6cd65ecd8
3e1a1cdebb784bd87da4e4121d5155c849c8ca88
F20101206_AABXPQ li_x_Page_069.tif
30c841ad4df8ae4e9bb3cad26da4509e
c5c12b94426f80d8b1b474679ee2d715f8c08abe
F20101206_AABXQF li_x_Page_088.tif
5ee1e40d1e041a0c43942c536771b24e
5db2a56e7cedf794b0bdaa528fbe378d5a766242
F20101206_AABXPR li_x_Page_070.tif
861b2782d92e29b46a4c56bbfe9668d9
40ac6b01048510918b90620cfb8b0f2e45907aff
F20101206_AABXQG li_x_Page_089.tif
f4f79e6496e6acac25b5db902f12a1b4
d75db1d2f83abc0483c3dc630bae43c7d20e7417
F20101206_AABXPS li_x_Page_072.tif
0a1ef603c473a3593d44a23297a4b3e1
1913001ec5b2837c441f4473dd30235d161dcc73
F20101206_AABXQH li_x_Page_090.tif
996dcb679fda9efc30c902fe96310dfa
7ca285d58c1fef4fef9d96e98775f2287c53634d
F20101206_AABXPT li_x_Page_073.tif
5ff969d36209ccc3deccabcb9c1f8d7a
ececa6b30d3124c3307c108bd574da9fdf51cf0c
F20101206_AABXQI li_x_Page_093.tif
38c064cb7fbc95231e6493aa0c27f253
54f50931693306f21c3927b4059f28f427624e71
F20101206_AABXPU li_x_Page_074.tif
8473c8a5289c10ceadcf45d83189a671
598de9d100bd854b1931a2ee67cd544f9eb9e3fa
F20101206_AABXQJ li_x_Page_094.tif
ca69103a3dcc99f568b53adcde674eb2
8d8a408d9ff19001854902ee0da166d296d50e69
F20101206_AABXPV li_x_Page_076.tif
8e9193141808dfd79c200de33c658d54
4a238e9e4ed2d33eef458a26f637647da7f7910d
F20101206_AABXQK li_x_Page_095.tif
bcf77c9874d107bda29088cef6f6d085
07d2a0774eb4ca9898304211bc32e651e9d4264e
F20101206_AABXPW li_x_Page_077.tif
8d94e4eebb7bb2c7ed940d9940abe411
dab3b59856faa23d8edbc87ef55321c2f0f64392
F20101206_AABXQL li_x_Page_096.tif
eecbd53a29b2e9db6c251fce276107ab
69a958bc68d40dd44b5cefbcb72e3f960461f2b1
F20101206_AABXPX li_x_Page_078.tif
c6b86c605d897b4f50e82c84bebb421d
63df49f21d0877f549cc1a01125214c7a6ceff17
52565 F20101206_AABXRA li_x_Page_016.pro
1318ed9f2b7c1076db7c11777acecce0
536bc9f765f6acac6b9768b0b8be708c4b3a0b34
F20101206_AABXQM li_x_Page_099.tif
e7b616bf4b343b8cb3933ed3b0bb9c67
ce2d966caed4ec7a6f454fa15da0ff65e2a26b6f
F20101206_AABXPY li_x_Page_079.tif
9d8311201cf68c9a797eed41102d9769
92cd5ac714bee837a03297e7f4d69c31e3e5db38
58054 F20101206_AABXRB li_x_Page_017.pro
6df0bfde9469da9e5417b0ec8aac79c0
54ca8e9e934aee5f2d9601177548358d02871a2e
7714 F20101206_AABXQN li_x_Page_001.pro
2c7df8006aa1328033f7edd749687492
b5c67edafffef0b9fdfa3d98d168533cdb6850cd
F20101206_AABXPZ li_x_Page_080.tif
7a17b9617532732d04efe11f632710bf
d21e881298698c99a32e9a57c470a9f2d01b37da
59362 F20101206_AABXRC li_x_Page_020.pro
ffda8eb516abfee1af89d3969fedf502
be12e62e17d74909386f5d70d5d4ce310e6c496d
670 F20101206_AABXQO li_x_Page_002.pro
6609f145ef879d92fe9bd85c8422e84b
b4dc0c8daaa05baab25454955b2a23d5bb718cb3
54698 F20101206_AABXRD li_x_Page_021.pro
3f332ba89ce22d398358ee3680e7a4d5
7d5efcdedebf278026652fdcc7361f4f5bd5e6fd
2074 F20101206_AABXQP li_x_Page_003.pro
9deb57f0547b1bb9ce381018e8ad6ca8
2aef28cffe3bd2ed2fea283c8d5f82ad03176912
52578 F20101206_AABXRE li_x_Page_022.pro
2ee7ce04e471e8e98c70a6998d21724c
563bc7ac8cf6cff52e04095ee4b41a938eb3c7ce
23019 F20101206_AABXQQ li_x_Page_004.pro
4499f6ea38e5f50edddf851d8700c261
a12bcdd5337df59af6b1d54ab37bbffc744b057f
44130 F20101206_AABXRF li_x_Page_023.pro
c6dc9004b6c3cc3a5ee6f7d48d9c8130
e909f65e1106b86c3bccb6338b8fa0ea1a5a3a0f
28616 F20101206_AABXQR li_x_Page_005.pro
bb775450293488e9f56bef3428195161
dc3564d6ff648e6b61518a1017f1787e7133d835
31382 F20101206_AABXQS li_x_Page_006.pro
ec9719767a9af7cdbb27657e377d304a
4d55a266c7936f3fb1091e423926959a98cebe34
52824 F20101206_AABXRG li_x_Page_024.pro
5c592e3f74d26124c55ac024f1ea83f7
b3155325555b46b67b7020600016cf3f2004428a
67139 F20101206_AABXQT li_x_Page_008.pro
12b30045d95dc9f6db459537bed4c709
80afa2c3642592a02841467133814f2410f432e0
55144 F20101206_AABXRH li_x_Page_025.pro
33bf6c3134cdd7351241d9803535617a
cc584d67f4946770fc54f13291f1c96e05be76e2
69048 F20101206_AABXQU li_x_Page_009.pro
841cdeffb52ce0ebaea62a142464f864
aa073b137216c96bc56eb061a0bba99214f77ab0
42598 F20101206_AABXRI li_x_Page_026.pro
1e1fbac64a6eded085b795f33a2a7715
8460bb5d21e1d06d0bb57371f68e0548dfdce727
49775 F20101206_AABXQV li_x_Page_010.pro
e479c6727330358c16e0a5be27ce9a8b
41db3b7411f408a3ac705c1bd9ee8efbe330094c
48196 F20101206_AABXRJ li_x_Page_027.pro
f41f1d837503913543397476da187411
33c9549eca5e8d6d7662bd207ea21e96a1a575f9
3233 F20101206_AABXQW li_x_Page_011.pro
25d115a6b359f095d62584deecd86767
c3c071a007b053ca878ccfa3922cd4cc5387ab3a
51932 F20101206_AABXRK li_x_Page_028.pro
505ca4814c579a5c53d1efa4eb6d455c
0df599584c4984557ceb39690213ee7491094046
55280 F20101206_AABXQX li_x_Page_012.pro
0f95eec2e9535a22d0f33da93c5f28ad
004f7046834404b85fca55f84f02f5a7b89f4d7f
45604 F20101206_AABXSA li_x_Page_048.pro
f173c8ff6810da8af828b8bf0d838f77
c51c4ab4ad80d42f0087030ac5f429a083876c0f
56532 F20101206_AABXRL li_x_Page_029.pro
ab0df1848fbf97eb2076440e6e8ae3f3
af296e78cc99fcfe3743576c4883da31d38d2f4e
35554 F20101206_AABXQY li_x_Page_013.pro
6f560c051c013bc78f8fb30394e40e88
be94ea670824c9f21f93aa0d2f66a6b19c648aec
56499 F20101206_AABXSB li_x_Page_049.pro
f0a4d66382169a15ed8c4b9e40042778
fdb5724398ad371cf9ca914e2b84212dd39b184d
69411 F20101206_AABXRM li_x_Page_030.pro
b0d23ced72b0dcb4b795e2424bc4e36c
ecef6950b1ce98c9e35558df7d95f49e6eb2908b
53224 F20101206_AABXQZ li_x_Page_015.pro
cb09ab344e7d44f83c4694a1a5abca84
fa1476a6ca24fd0b1a04d49645263f53773336be
56976 F20101206_AABXSC li_x_Page_050.pro
147b4a7b779e052266b68cc67db1425f
5a42b06dce248e2ace1e6b87dd709aad71e7ce6a
52234 F20101206_AABXRN li_x_Page_031.pro
e1fb115bfaa1a3675f2e7de8d83fe386
5dd9384f14e94f21c56d45f54de2caa6f5975912
50375 F20101206_AABXSD li_x_Page_051.pro
ac70ae535f2985be563a1f426e76bb1e
854a516e8c5017b530d1fc98d299d519674a8a0c
58042 F20101206_AABXRO li_x_Page_033.pro
008841c494446963a36cebda0dce983e
c91d591c2ebf0ae2708e5b9884e48bbd4cb92d01
58118 F20101206_AABXSE li_x_Page_053.pro
45ddc32952ccf82a5e293694946e8520
1c523992fc8d94411c8844f227a6874c92434d80
56278 F20101206_AABXRP li_x_Page_034.pro
533f58f94acb113657015c429b1c6ded
32a6630a58daf9f6c961cce8e1edc39cb8b5166b
55931 F20101206_AABXSF li_x_Page_054.pro
c8ee4128eb8145cd458f59938a59c33e
3caacd5c558ccf5581138211a4ae8832ca918fd2
58374 F20101206_AABXRQ li_x_Page_035.pro
c51fcf1823f3356d693d8ae80f1270e5
206d5b5576bb21f895a5f4897cf8cb6c35b4ba4d
38017 F20101206_AABXSG li_x_Page_056.pro
b09bf4d258e09bbd8756fc9a58cacae5
cf244b502009aa08b627bb4cd4d77a5f18671cdb
42513 F20101206_AABXRR li_x_Page_037.pro
8d8912ab709257b4cec2de41691a7b36
3268b34a6d6e002fee3e28eb36c545f0b2fbf1cb
56341 F20101206_AABXRS li_x_Page_038.pro
2a891a2f5844cd3af050d3f3f355aeed
d57a212c178f8815dd513be9dd4bd444997d0202
36117 F20101206_AABXSH li_x_Page_057.pro
9d65da1d6cf64306f1f14e058c9b67f5
2e9b70d805e2e4420d8adc11286bd7a1387b02e3
54684 F20101206_AABXRT li_x_Page_040.pro
07f8f15d4275de2f7a6fe08900e6aab5
7cf377358995d00f84208d8cda12e511264abd1c
37244 F20101206_AABXSI li_x_Page_059.pro
4f457dab4602234b3332734366f2607f
3ccf143c843e72048fd752282f528f13bb4c5402
58584 F20101206_AABXRU li_x_Page_041.pro
04caa95d9699a0306473805b4176e439
4ffc610d913e6d78fb5b1995d42c9a3cc11478a1
53781 F20101206_AABXSJ li_x_Page_060.pro
e626bd57e4e5960596e1daafed52b346
ef05ab2a6bbf4d44dca375ecf6177ef7d1f5592e
53340 F20101206_AABXRV li_x_Page_042.pro
f7adc8222b44d56c219604e15070215a
f1c605c3cc7e4bf9b962a08e6199cd17f393797a
3545 F20101206_AABXSK li_x_Page_061.pro
e3e3654bffa05db94927d32093271d0e
46d4d4000a8f218e829fd663d62068cc6e8bee95
46918 F20101206_AABXRW li_x_Page_044.pro
72cfb852a3c6aae6b8002f713135a19e
6ffd4a58385bfc2950da09f01682e721ce4014d5
49172 F20101206_AABXTA li_x_Page_082.pro
64533af43d258d48b4c2205864e75bdb
7de9323c72e00bbeb4a8f5e0459e28ea1d74e63d
54263 F20101206_AABXSL li_x_Page_063.pro
49f5e6e656c31ce32c182ec1549a59e4
f0c70e168dd5b4b728ee1ab7acdf4d653f350ff4
39643 F20101206_AABXRX li_x_Page_045.pro
f75a98002127e634245884f23f1eb859
958386874bd8480aa7ae9e187fa96ba159538642
33897 F20101206_AABXTB li_x_Page_083.pro
5e05c7d56c4bf9e3e6f5eb56bce4ac4b
2b12ad156db3844b1cb589d414405a858f687803
40066 F20101206_AABXSM li_x_Page_065.pro
041de3fd23c7ea1901511d0ecf62b3bb
e3ed70d46bef7a80ebcd0b22132bd1d8827d8876
53054 F20101206_AABXRY li_x_Page_046.pro
944b4cb7012c0dbcf0fbf60db415c5dc
e463af4dcf815754a1cc84fcf44ebb653e8554e8
30831 F20101206_AABXTC li_x_Page_084.pro
f5005f48fcdd3f3d0cf6ff05e213875c
e0c845fcdb8122b7ed71b5b7806068cee679c269
58677 F20101206_AABXSN li_x_Page_066.pro
c7f88b39837d9a97fc4a9dc0046524ec
984c23e2c8e8a52f367d404b70b02bdac52a2953
51279 F20101206_AABXRZ li_x_Page_047.pro
0a51ee832d28c3f55e5028bbb149a725
e92045dd5235fee644283050b0076e9e0c5fb9e8
49228 F20101206_AABXTD li_x_Page_085.pro
1d55a002d7f50bdceba4a93c04b1e074
6ee9b808b1d73b3f184d994a6b319ef68a0b0d34
53615 F20101206_AABXSO li_x_Page_068.pro
4e7468f41eb5cbdc7405aff9b529f9f1
93e94222420585a1b16f9aee1a9a718e8c31a4fd
26369 F20101206_AABXTE li_x_Page_086.pro
ac5c9544e35b48195805b6086316e671
549fa2b7562bff80a9e2a1d649bca89abada241a
44708 F20101206_AABXSP li_x_Page_069.pro
e1d3185ac4951504add56bbe12214198
6e2bbcd0733eb7e45747b4c36e31ae5c08219800
49413 F20101206_AABXTF li_x_Page_087.pro
022f342a314ac371ca587c376bcff118
5cae6f760395490aa773b0b146131024dfb4a3fb
43954 F20101206_AABXSQ li_x_Page_070.pro
66d58b5f008b8ff4e9001102bd0d858d
715a32cd360e16d8e2ec931d960478471891d06d
41544 F20101206_AABXTG li_x_Page_088.pro
db74f68ea82603a9ac540495406bb486
01219e30e9e4ed7da090df02ad3a28ce8d9ca808
54643 F20101206_AABXSR li_x_Page_071.pro
69c8cda15e6be3545c73ec01408022a7
9e22c8eeaba87ae5927480b8d2b3961697be551f
60478 F20101206_AABXTH li_x_Page_090.pro
bcfa9ec94fb59159c490fb59a7cfd955
303f66c5ecb41fe4faa77f0d5e8f92bd94ccb368
50161 F20101206_AABXSS li_x_Page_072.pro
471722895a98ced0829340143e0c6f71
8ab30fd7fd347907da526a6972704f5e8eeb79d2
48021 F20101206_AABXST li_x_Page_073.pro
f5f22e03644230614910c50f311d9977
f9aba956719b2bf5da595410224cc130dcded541
20156 F20101206_AABXTI li_x_Page_092.pro
6f910331975b234758657c9516a6c36b
200b2cca20419d6783e2caaf61a47e2a0ca86951
50394 F20101206_AABXSU li_x_Page_074.pro
4bbd30628d1a55f3627863811cedea2b
ee45c7381f723ebd8340a6639a391de6c5aec24d
37500 F20101206_AABXTJ li_x_Page_093.pro
812f24f448ad681bfaf7d5755ac0daa0
2ca50ca567e8ebf970c28873544c6ea4c52e6c11
51665 F20101206_AABXSV li_x_Page_075.pro
1c63bd5f7ca7a70ad7d6f1e83216a6fb
038755c687771c38ab8a6035b9a681cad29ced8d
62386 F20101206_AABXTK li_x_Page_095.pro
10b1154d3b9e519c85097ed7a0b7def9
2f93ec1581bb864e5a3db621fa4df329b21bf38f
32313 F20101206_AABXSW li_x_Page_077.pro
d5d3f02549ae050ec6329f7667bcb4f5
9e592f31219ccb044b0e1c7b18e09b443ab61cd3
64094 F20101206_AABXTL li_x_Page_096.pro
fe6a0e305779979b7ef12b3bca5a226f
accfee641a099884994e2794c4f86eb6c04dbd8b
36651 F20101206_AABXSX li_x_Page_078.pro
5cffea1399eebfd8770bd9ae296b9cba
7b47c471c02458c8bec848ff556f57672209e7c6
2249 F20101206_AABXUA li_x_Page_012.txt
e8cc8e7f9416e0e553dd7a65262771ae
d41d19dd77bf0b5e72f8d07ca803bd845520f5b4
64279 F20101206_AABXTM li_x_Page_097.pro
428faafdb8692085e4f17a9936f7aaff
d73919035e05b12f5be9e65a417745f630c19e32
55723 F20101206_AABXSY li_x_Page_079.pro
3c3e411bac84ea1fac13ed378a00c184
4f422c575c69be92cd435a677c6ca6544ee5dbc7
1441 F20101206_AABXUB li_x_Page_013.txt
8a61c3b4b5a69216fcf6aca0aa7f5672
4972a5d6961fc47ab4037fa818dc9dfac023886c
64177 F20101206_AABXTN li_x_Page_098.pro
86a4b6ac56568508b685fd7b1336c11a
8ed221e1dcb7b0bd807704731184c098b6551f2c
43508 F20101206_AABXSZ li_x_Page_080.pro
d34f084a7d364450b57661e387bdf3fa
a96c4f7bd03ec8352c02b29a9baf54e5706306d8
2268 F20101206_AABXUC li_x_Page_014.txt
64b8942072c344264b83bbbfb00668e3
9599c651936755b6c243feb3c15c02ba04262ff5
35957 F20101206_AABXTO li_x_Page_099.pro
805c8860abe213e230e01f6055a2586b
36ec80b7e8ee531d5534533c0c23f6cb9329a59a
2262 F20101206_AABXUD li_x_Page_015.txt
67dd4aafe127ddcb6c869c28a722ecf0
8363ee2719a6f5277be72cd54087aac10b66f2c5
13266 F20101206_AABXTP li_x_Page_100.pro
6e720937763716ae617207e20c9f98d4
27aa2b6c6688cdec802383800144d6a3198e04d3
2122 F20101206_AABXUE li_x_Page_016.txt
027bc7c2f1a530cd09401f6de2e78623
722092b513cf11614db1fac10190be2afdf4d85e
458 F20101206_AABXTQ li_x_Page_001.txt
a91b33d81e6e565d5247057cb7c1f101
0a47cb57d0446531dd1a2d2e415235fd4c15cd09
2291 F20101206_AABXUF li_x_Page_017.txt
17487ee51a2055e67d4dfbefe518bc04
81863cf5020b6196845b5d8a1ad66592dceb213a
80 F20101206_AABXTR li_x_Page_002.txt
d7b86ccac3f263dc0105445a9dffd1e6
563f34d7bb95d577d3758f3cb6c9870a15265888
5293 F20101206_AABYAA li_x_Page_045thm.jpg
21fb59dc725d8ae09e56e3a6d3f41f05
83e24e0fdd95f62e6ba1517eae66f25dc070560e
2277 F20101206_AABXUG li_x_Page_018.txt
3f14d61857abece3cf19202cb45bcf03
6441b544b187a7a857bb99a0c34603d47aa6e214
138 F20101206_AABXTS li_x_Page_003.txt
53dd216434a37f1b9ad4a51426e73831
bdd235e25e87a3e6d40d77abdfc3e958180ffa17
23518 F20101206_AABYAB li_x_Page_046.QC.jpg
0abb5b1dd37e6209e8bf9be2fbe8c1f8
df4783af6fd5bf1bbab1542a44a6193836fb37c2
2372 F20101206_AABXUH li_x_Page_019.txt
0c7763c6aae564aab29f11279eaf99b9
2694a8def7dcbe8a085a4c61b626d8add58dcab2
968 F20101206_AABXTT li_x_Page_004.txt
589e31210b15a5a8342ccf164a185e3b
6b4d1fdf2211c7ea077a13575f9d0ede438e3f63
6038 F20101206_AABYAC li_x_Page_046thm.jpg
55c279739cd17c1ea56aca2c31c247d1
ca307fbb5bf44ab13fcaaca9da4d7b1799b96b39
2357 F20101206_AABXUI li_x_Page_020.txt
059deb169d7cc8f94f59201300217f64
42cccec0c9bb426e9d081dbb327da06a918ac6e6
1309 F20101206_AABXTU li_x_Page_005.txt
5ecdc9a04456dc3d10440d302b2f8286
4b846328a287e0ae068dfd22cc50ef8d0eb2d4a9
24693 F20101206_AABYAD li_x_Page_047.QC.jpg
5d17b79ecc74f51d44bb0be9a577c63b
4b356fa31804e6494c76c24dc520edf530c17d6c
1435 F20101206_AABXTV li_x_Page_006.txt
e82511d5a2be994a53124579d6bfb059
9f87305dc91000b7a890356d2deb14d01290197b
21185 F20101206_AABYAE li_x_Page_048.QC.jpg
7c1908f960b51143e4954977ba0b1656
d9a3a2fce3679c65c72b0a4924803f5c420c2278
2174 F20101206_AABXUJ li_x_Page_021.txt
207ea3ac4cdc089457e54a8802972478
d9f742bbd38de7377fe99345348674a143f667a1
5568 F20101206_AABYAF li_x_Page_048thm.jpg
6f7bec4f26d3489c521e6a382ae3ba0b
a5ecd6a5ffe565a425287e394ba23e159617c8d2
2158 F20101206_AABXUK li_x_Page_022.txt
bcbb09141e497756fa7a6e7d26ac0b8b
b27576cd0bb6a27eafa22ca4ce51dcd776c91300
2173 F20101206_AABXTW li_x_Page_007.txt
ee900fab990478c0b542c842d8c40481
3098428bb2afa4419f34aecb7bfa977e07ac86b6
F20101206_AABYAG li_x_Page_049thm.jpg
31706667704e9f0134bdbfbc0c37202e
9d50a12a0bfbc4656dfe4ca6b1e68c383fd271d4
2360 F20101206_AABXVA li_x_Page_039.txt
6259a21c59c7bb4320e2d9c8f68e2cc2
1e8eec357ee439dc04c0a14c44cd05035a0fe6a3
1933 F20101206_AABXUL li_x_Page_023.txt
38c99f63e7fd50cd1486fa5f684d5662
e35a150d748a9abca7cadf5f9055a067b64a7f82
2785 F20101206_AABXTX li_x_Page_008.txt
ee0be51a69506e286d7ffafeb1a23616
6cbead99bc786f6785fe223bc1979df75e38aa9f
26658 F20101206_AABYAH li_x_Page_050.QC.jpg
ee73346eebc5650c364cdd94b5ee4aab
d0df313d89a0d2ee7a182c8c787963b9926bb6be
2197 F20101206_AABXVB li_x_Page_040.txt
487df0cd9b141038525fa2ebf52f782e
a4171c76995521e70504c6d2ea813c79ef59a49e
2130 F20101206_AABXUM li_x_Page_024.txt
cce6299961c47e5c07f08e010659b990
13fe04d6bdbe726010b65c62a5ed5a42cae915e5
2910 F20101206_AABXTY li_x_Page_009.txt
7af14ae554a930a885e267286a344bf0
dd502f720a256e77ce45ca39e516958f92d8c10b
24718 F20101206_AABYAI li_x_Page_051.QC.jpg
24da69053727bea51acb9e01a71e11de
34cd3e37ec3379d8ad3e93a6578dca3bc08c288c
2310 F20101206_AABXVC li_x_Page_041.txt
60307f801a0c163c72b6a0be098615bd
fc16c6b696124541e0afa1b8e7e5a6aba36e3dc0
1838 F20101206_AABXUN li_x_Page_026.txt
e40fea4d077b1611e9f18c5cd5b4b1ab
108adfc30649fbad8790daf75c6bd917d6f13e3e
135 F20101206_AABXTZ li_x_Page_011.txt
1fb60d75fa12e06f1ceb7c3ff5152393
0a4a67123bebaaaa725282049bd1fcd30dc0baad
6324 F20101206_AABYAJ li_x_Page_051thm.jpg
207b473f8a5b78663a5c038eafe0c784
69b5ccbe2fc170d2fac12ec7acdd7a9fe292ca8e
2119 F20101206_AABXVD li_x_Page_042.txt
339ec9482a12e0e1fd48149c19f0e3e0
3aa9b2ef62c29a27ccd8d156dacbce04ddb54cd0
2011 F20101206_AABXUO li_x_Page_027.txt
dc643c999fb00ee42b42814c18fde882
22f6be54a87f93d0d32d8f23d0ee84d6f81ed576
6929 F20101206_AABYAK li_x_Page_052thm.jpg
ec4f4e0122a55b915afb5bb8add953c8
0a00be415aa99f9b828e31d8b23704f54c4850c5
1990 F20101206_AABXVE li_x_Page_043.txt
078f04fdf3ab2ddb128260105900aca1
7f66276d6260c9e40355752023aa8b1f0308b0de
2107 F20101206_AABXUP li_x_Page_028.txt
fc648806415518db246a8130b6f5be51
94e6104d35b8e4fb20246f1483c39c2f4805f321
28641 F20101206_AABYAL li_x_Page_053.QC.jpg
74f7f11546754bb7f40a9c7bcce1a7d3
2090cc3100f6eedd63f842c1d02acc116a1664aa
1950 F20101206_AABXVF li_x_Page_044.txt
43b8efb3376f955db1880f305b437016
4877567d3690f8af1253fc5e914f5342d92aa3a2
2226 F20101206_AABXUQ li_x_Page_029.txt
82ad34aed50baa2d52592fca86816edf
5958df5aa652303c87cb58d6074c9209110bc98b
7080 F20101206_AABYAM li_x_Page_053thm.jpg
f47b462b3d9c73de682310a06d9ae96c
975dcd0df6dcad66534acc3bbebe1aaab13a5bef
2123 F20101206_AABXVG li_x_Page_046.txt
4be74ce8b332e297254057223beb70d4
1760cfac1dda826bd0cc1618602074cda33915f1
3085 F20101206_AABXUR li_x_Page_030.txt
58f920df98456096c9cb2bda1a949d1c
554dd38de08b208133cf33dab0e5998d9801ca74
1580 F20101206_AABYBA li_x_Page_061thm.jpg
1a43b33d2973a6ca564abc4a721bb3e7
5b0f10188732f1af41afa8323e5e6338c34645f0
27056 F20101206_AABYAN li_x_Page_054.QC.jpg
d3519aad0fd7d3be681ce8660c001c5a
455ea2d8bb303c3671a2b5fc20fc06e0af549b5b
2042 F20101206_AABXVH li_x_Page_047.txt
2fe4e6961956d77958d68a550529338f
876eaec76f7817f04d7a5bcfeecc6d029cbc07db
2191 F20101206_AABXUS li_x_Page_031.txt
a210506defe0e27d7c946b3f827ac6a4
53a8936bf54d06532f36afa2c017f0d959805d32
26627 F20101206_AABYBB li_x_Page_062.QC.jpg
4f3130c840e965dd7155c467fda32c07
f3e91e52c9503c09b3d53e07c4f99cf44e9db9bc
6634 F20101206_AABYAO li_x_Page_054thm.jpg
c08c3f4f70ff9a47ac8fbd9de1b03836
9ab501d601f2502d6bc194caf414adc27f57a0fb
1846 F20101206_AABXVI li_x_Page_048.txt
540744976305318b067796736566547c
4aa67c7e26a3ef2d939a0b360e6509f912b84ed5
2290 F20101206_AABXUT li_x_Page_032.txt
b1f787ceb8c7a2d0d0c737a63e380dac
f38913b5a93af7739dfc68d0e80fa06dabdb6c7c
6627 F20101206_AABYBC li_x_Page_062thm.jpg
b0a803fba0f193258529e7e9f83c01f8
d06b27789d91e978747c8ec3489c71c3f6080ff2
18681 F20101206_AABYAP li_x_Page_055.QC.jpg
39cc35cecf9bb7570918bb36cf1c81d3
e5ab30e131c413bb82529e06403ebaa2d1883086
2221 F20101206_AABXVJ li_x_Page_049.txt
7f20e0375e3042bc501da084c34a3cb7
13a0da46061dff6f1d74dc08eb7ef9c0e94ead0d
2295 F20101206_AABXUU li_x_Page_033.txt
68d7e9386c23b4f970f960d6a762e3f5
aebdec2a063d3402a54b1f629f566e70c3b69f7a
6938 F20101206_AABYBD li_x_Page_063thm.jpg
d9966c40819eef7f7bd12d059284d9c4
1a7cabdbc1bc314d1e29a77444aeea9ea85902ee
4945 F20101206_AABYAQ li_x_Page_055thm.jpg
dc574d36a0a4967d954d023318a0963b
115026f7c739bce4bf774a28c21534c27af9f499
2216 F20101206_AABXUV li_x_Page_034.txt
05998e5bb8c00864f04e044f81d903c0
3d9b9184698fb68a974928df10dd14e39945b3a2
26135 F20101206_AABYBE li_x_Page_064.QC.jpg
bc45fba43008515131597fa8c41ab981
e89f4481d08bb885cce556f01e1c59e87c656e11
6207 F20101206_AABYAR li_x_Page_056thm.jpg
1b372bb2f531196f9d0050c1cb8e2b2d
5f6893e758b93f1bfe930499f8c0c5bf65abee7d
2257 F20101206_AABXVK li_x_Page_050.txt
a48a4855807b6357cf2caae03ddf23de
64be869cfaeeebd9176a62659c1a7baafec1bd2e
2344 F20101206_AABXUW li_x_Page_035.txt
209d3c3da765c8797aa358daeaddcd20
83af4486f6a25816b9ee4d3f352943c9e3149be2
6587 F20101206_AABYBF li_x_Page_064thm.jpg
061c1df7c22c356fd977d3656f59506f
46d3a8983800c2c4881c4e679f00e6b94307f0a5
22564 F20101206_AABYAS li_x_Page_057.QC.jpg
b21ad235c6cf7dd137268404064d860a
88e651214e5067e00da56b0a67f41571172120cc
2334 F20101206_AABXVL li_x_Page_052.txt
0bffd565358ab57e57557a1232456743
7e5f0e04f4141daa384ac8f8437eeee4ccd66360
2014 F20101206_AABXUX li_x_Page_036.txt
ac8c31ec96dc041a1b592f15a17fd376
2ce56c13d526f38860bd4de8d587b338e5be765a
20817 F20101206_AABYBG li_x_Page_065.QC.jpg
ee5cc57ca03fd6d1775783123ef8189f
33f528bd7e2519f236b86ce27c49f5f2ae4826b4
1880 F20101206_AABXWA li_x_Page_070.txt
56e8f81c497f2a0ae4f0fc956bd05b84
784efe242e8cf5dc212a05e722d683e66d7fa057
5951 F20101206_AABYAT li_x_Page_057thm.jpg
70c1269f68f86f2be008f1e9dd8392de
a81a5b5f23c6286c9bf4184e122fc3d51ec3efbd
2283 F20101206_AABXVM li_x_Page_053.txt
1115d9dd6bb0828e8231562dbca0bdd4
49d74a6ef315179004127a9bd924d16cad9132fb
1691 F20101206_AABXUY li_x_Page_037.txt
0dcdb8ea162ed69b9c81ab422f9192de
0299d0ee4f3731b7d4eeeb68ab87d4317a14ae0e
5795 F20101206_AABYBH li_x_Page_065thm.jpg
30f0e4e65ed2ea2551cba00cb1cfdef3
67a2d5bbddc6ff5e14e42cfb8d8be60240acc7d0
2041 F20101206_AABXWB li_x_Page_072.txt
4ad74e1611b009017368fc8b5e9d482d
248c4bbd4dd8aad7f1e1d96a83316e01287f850b
1634 F20101206_AABXVN li_x_Page_055.txt
99e01a8133170d63ea139e78ae1fb3ca
1c001e40d5940ba4eba5b01a4a00b80adbf11d33
2340 F20101206_AABXUZ li_x_Page_038.txt
e7a4d48bb90bf5d51a0b3e6e2cad749f
b47282746edb317b4b631555e5e8eb61ecf316ea
28208 F20101206_AABYBI li_x_Page_066.QC.jpg
03cb4f365290b4bbb5853dd67e78421c
cbebffcd411966a1d9fcde92cb8de9b08882bb48
1963 F20101206_AABXWC li_x_Page_073.txt
23d9fb662b7839beddeeb5ce482751ee
18e9c56abe77b48b108edadf17d8df4a4e811e23
1842 F20101206_AABXVO li_x_Page_056.txt
853b76da8ba914f997dcd730e3d8b10b
4525a4261e3414626e77e73a94b57015651a4717
26485 F20101206_AABYBJ li_x_Page_067.QC.jpg
43fbff4efe35052c04f66cb8f7af776f
c32ecc559c4dd71a9868b6784881e86feaebc791
2061 F20101206_AABXWD li_x_Page_074.txt
af9da9daea0757f8dadb5988f62f029d
ec7424923342f6deb8c3ffb76fa469689630ba4e
17788 F20101206_AABYAU li_x_Page_058.QC.jpg
23e6f4a2e32846c122605e1016b80e9a
db7ac58821b902fd241ca234eb15e58e196b988f
1553 F20101206_AABXVP li_x_Page_057.txt
12eea1cb7d1130a6a9dd599a7c4922ad
d5f6fb4a92f355c275a4453c5142fc52b7a478f2
6618 F20101206_AABYBK li_x_Page_067thm.jpg
3c2d433123b152554094121355f23607
9a6363470237c6e883c395b7221b5877155ade21
2036 F20101206_AABXWE li_x_Page_075.txt
20b2b5fe4360dd857224a9f8a56fcdbb
115906e5d786d299c2a6efe7c3c446d4f9013183
22698 F20101206_AABYAV li_x_Page_059.QC.jpg
211c045e5041d2db6e0f4fedf0011e6c
c8bc3d6a10622ad86ca2cff0c41bb671a053b903
1806 F20101206_AABXVQ li_x_Page_059.txt
96228a748554ecb646d569ea0ecf23a8
c3261fc17210bb05c8d892f18117797e082a792e
26070 F20101206_AABYBL li_x_Page_068.QC.jpg
e45432d42f8c63402a59801d21d97aa6
626875b551f732220923cd1de8bedf41cd1f2431
2146 F20101206_AABXWF li_x_Page_076.txt
64d881575ea14aa8bf92a6bbdbb4754a
6fa7d9ab97f7af910c4c6fa94aff25fcf729f9ff
6110 F20101206_AABYAW li_x_Page_059thm.jpg
c5d744e3d6f87f41a265cb7d47987483
5e864d9b393f6800a76f908a557f9f0dc4bd3703
2161 F20101206_AABXVR li_x_Page_060.txt
a7413bfa0b7e8ff38fdb4ef7e5a21cd0
b215ca27a300b58098f13ae62b423aba5afcf1a7
6749 F20101206_AABYCA li_x_Page_076thm.jpg
7c2283d70ec86d163440c399e389f888
ff446208740f9dca4756ea52733970732a2be945
21863 F20101206_AABYBM li_x_Page_069.QC.jpg
b6f24f1faf0948ce18238b43815ff8a9
819b254db2d071923fcf5aa9bc9946b1220fa73e
1307 F20101206_AABXWG li_x_Page_077.txt
ff80c4cb0ac8837b54f38abb8ec56ba9
def82514b02b6ea532c27e28f0ed80843e299c39
26265 F20101206_AABYAX li_x_Page_060.QC.jpg
5dd11336ceb2905c8d95804b4269573e
e84cde6c1cf991ac5b4d976c9a8ef2e0f5c71adc
147 F20101206_AABXVS li_x_Page_061.txt
1d1583a48c0e28601c83ef4a087fa4d2
3d7288526487dd20e013dc08bc5b7b8fd3ef3d07
14739 F20101206_AABYCB li_x_Page_077.QC.jpg
ccce3b1e6d883930ab05f2d4136ec798
458ac6087e5b1c905e15d4367e32cd4c38955708
5616 F20101206_AABYBN li_x_Page_069thm.jpg
9435a3bcbc2ab25ae9c6aa5af41c7494
1e00389dba2d1d9bd56a1c669ad9453704391073
1599 F20101206_AABXWH li_x_Page_078.txt
73258331e71c8b580042c925677db092
dcf4e288083f152f1cc99b2189a18f27e5c0ed67
6650 F20101206_AABYAY li_x_Page_060thm.jpg
55a0ee3d14463a11b4b79e2b856a6f50
a90871b5caf47160a09a5fe2dd4b77fee1e9c39b
2271 F20101206_AABXVT li_x_Page_062.txt
25e66d26044013f8c73cf9f3c74e0a16
cd8ac669944f6a2a23fcdc7d62d30d5909cc5b6a
3963 F20101206_AABYCC li_x_Page_077thm.jpg
f7d0df39c3f9b003cc49580fc1b06856
b367ab7e3040892477758df5503d994f9c67eb17
21464 F20101206_AABYBO li_x_Page_070.QC.jpg
3fbe0db627bfcf64f8ac2d8c568e1f69
aa7db71dad907deb5a1dee120eeeef097603510e
2327 F20101206_AABXWI li_x_Page_079.txt
fbdfecce0794f869bc8ae2284bc7a243
fc0eb0695b7aafef5c9fd18d9523f4c7608a33d1
4075 F20101206_AABYAZ li_x_Page_061.QC.jpg
90c606a9ebce75858d656e7ca753d0f1
0d3e52317ea16fb93ca63ceceb5719a8dcd7c6c8
2164 F20101206_AABXVU li_x_Page_063.txt
c5cc9fb9be6027d2db12cead4812eed2
59e6a6aad4c77a4bc653d263d2a6645df73659fe
19335 F20101206_AABYCD li_x_Page_078.QC.jpg
b9bb3a24dc5b756d68f71919c2b60fdd
2f06fc362c3dd4b74a38754e0c248242a97f089b
5695 F20101206_AABYBP li_x_Page_070thm.jpg
e5e27085543f8813919629d67430a026
0c2aaee7dc2234c1bd9bb3befdcec07fff442b9a
2326 F20101206_AABXWJ li_x_Page_081.txt
28ed04905fc3b6ceb05af4f3c5d1f552
ba4d807a7b8d3386f07b3f8d22bc02740ef64e09
2198 F20101206_AABXVV li_x_Page_064.txt
197634aa8da26e5847b87b5da55ef687
7139e730e3f07d760dc8034b6deaad83534093ae
5575 F20101206_AABYCE li_x_Page_078thm.jpg
5b892f9c69885c4ba846fb7819e2c254
62ed1bba687271bca60bc8fd34e72bc8b5d7d7a9
26495 F20101206_AABYBQ li_x_Page_071.QC.jpg
236c962756a1d3ab52d16e90b84e04f9
1e9beb8d36d9450cf71330fc84307c6bd4f1f1d0
1974 F20101206_AABXWK li_x_Page_082.txt
c477b5d26ca307e338e7833872b9e882
b87d2a1bff6461f262632fc1f8caba56940d1d54
1687 F20101206_AABXVW li_x_Page_065.txt
f2fec771c0d8fcd1eedcf86d667fdb7c
0b0468464a056a596381dad32b111bbf50cb32b5
26351 F20101206_AABYCF li_x_Page_079.QC.jpg
eec02b4b8b7c4cdfce6fcb1a75180b3e
e83f5c537e981565714794e4a9c9b45d9b465d3c
6779 F20101206_AABYBR li_x_Page_071thm.jpg
9c6721e20016a24751e0ebca846d5d49
621d44cebabe42424c879cf4cfd16d083d401f83
2172 F20101206_AABXVX li_x_Page_067.txt
6514c85bcf3e03408a8676dcdcfb78ed
c137635f19d462bc650bc5747d624da187a204be
6865 F20101206_AABYCG li_x_Page_079thm.jpg
b30bd8161656ea0815117d0e56fc256a
7e7a800ac6b2cdc460cc66e2fb38cb5efeabfeab
2175 F20101206_AABXXA li_x_Page_001thm.jpg
0d06d9673ea53339331d89d1fe3faf76
d1718eaba5f5dd7a8520e5f85a4b9b1523a9f566
22949 F20101206_AABYBS li_x_Page_072.QC.jpg
ffdc0683647457ded7844cbc00358cb1
8bf2f4b78e605d77b188967c51c3146940e17f7a
1796 F20101206_AABXWL li_x_Page_084.txt
81239b36043d4a0cfc30c5111e975d85
a79086df432b8c22efdc9f0dab8a1aa4555f55ed
F20101206_AABXVY li_x_Page_068.txt
c2396e267b4e3dbaeae7c9bf62e0d599
615cbb34e9455e05b16fbd3455f1022bea99d2b0
21500 F20101206_AABYCH li_x_Page_080.QC.jpg
39ce2d94e467e70ab1e975312dd10bb9
847c243f1cbee2cab657bd1a3f51c3e13fbacc5f
623252 F20101206_AABXXB li_x.pdf
e9734201be0651dc470abf87b3f88afa
6e1a601417187cf114c42ce12e614f8df71e6134
BROKEN_LINK
www.phrap.org/phredphrapconsed.html
www.repeatmasker.org/
www.tigr.org
www.phrap.org/phredphrapconsed.html
www.cise.ufl.edu/~xli/research/lcr
www.expasy.org/sprot
www.expasy.org/sprot
www.rcsb.org/pdb/
www.rcsb.org/pdb/
6303 F20101206_AABYBT li_x_Page_072thm.jpg
928b72cf554b97108492887aa55087af
5d8dbbddac038037951a0267bee42243ca6fcce4
1184 F20101206_AABXWM li_x_Page_086.txt
7125520445392949d51adff8f9e2270d
e1f75f4ec3bb06857f2cc9f1db53bc1b5179b80a
1921 F20101206_AABXVZ li_x_Page_069.txt
85ae9156a1286a23ab192e1d0ea94add
c1d8c8a0c2902a9cb80a8535d276b6cfdde3048e
5739 F20101206_AABYCI li_x_Page_080thm.jpg
1f6daaf2476ac277ed4413b3bdecb158
a5797dea6e84279b4bcd104b31278761f9e41e79
6995 F20101206_AABXXC li_x_Page_001.QC.jpg
81b93540b1f144556fc03bccf6901e0e
cb260465556a8c133baef75a31915832f70677af
23293 F20101206_AABYBU li_x_Page_073.QC.jpg
9c548240c9289703f21fae329b33f1ca
b3c1fd7ad67a013be411d9cc5e3e7f2a60595ec0
2362 F20101206_AABXWN li_x_Page_087.txt
2f0fb87b4adc5e71e72563df87ac66c2
ef13f71a0f09e3324dd795be246497ea0da2bd13
7049 F20101206_AABYCJ li_x_Page_081thm.jpg
92313a3103c534583f6c685d4ca52cfb
fc2675c1954390c8a1aed6955dc897f58be6a779
3101 F20101206_AABXXD li_x_Page_002.QC.jpg
5886ae2aa1a301182078c7800652067e
5eac36905e08df297a01fa9c7f7bccb8de55bec2
1908 F20101206_AABXWO li_x_Page_088.txt
05610fbd1caa127b64b4f3895186406a
e8e528152086bb576016d9bf2b731e318d2fcece
24746 F20101206_AABYCK li_x_Page_082.QC.jpg
dec1842fc091e630e2b8e0abfe6c1dcb
1a131beaaada568f65c5247f5b02216056a5ac93
3564 F20101206_AABXXE li_x_Page_003.QC.jpg
90bbc6cece203b1e60486b3e5a8dbba1
bf28f22b056d95ec43bdbb573be171dc2d79c7f4
6076 F20101206_AABYBV li_x_Page_073thm.jpg
b1a64530e7f23e9eabf4bbbdeff39f2a
60932e27f81cffeb2eab104f9b9c858dd2579b5b
2315 F20101206_AABXWP li_x_Page_089.txt
de1f7bdc3be494d000b675d45d280d1a
2aa848e8bbfcd29baafe189ae834bcf573da7d93
6250 F20101206_AABYCL li_x_Page_082thm.jpg
b6cc7bcf9b163a7aecff0a87e921f13b
9e890600c6dacb14fc626cf79b1a7db344830696
1594 F20101206_AABXXF li_x_Page_003thm.jpg
13b520f9768b5a724d215f8c432e0bde
aeb31492e5649ebcec7dfdfaeacb6ef680629645
24429 F20101206_AABYBW li_x_Page_074.QC.jpg
92bddf7113873a4598e87e10dc4dffb9
3a4651b076b4bff3c6dbd8d4197cd4aa58be47ac
2373 F20101206_AABXWQ li_x_Page_090.txt
637a92e496d9d44d9ebbd6f80676042e
80a15087586dfb55d4c982c426cc1b6b995dc177
18852 F20101206_AABYCM li_x_Page_083.QC.jpg
630dbacf1813371fd21642163dcfabbd
1aa7a34c4dd061521d60797b756d641edbef3392
11759 F20101206_AABXXG li_x_Page_004.QC.jpg
5e8c8c925fb0f2fdffdc2da6cbe56b11
9bce64f296c5b9abe10799760ec4f9b83e28eac2
25846 F20101206_AABYBX li_x_Page_075.QC.jpg
50a26d6eeb48d05ea769b2060b6300a1
4052d194f39e821f86423a371484c50e0fc31d2f
1664 F20101206_AABXWR li_x_Page_091.txt
9ab8b8572630c27aa67b22a9a360a0cd
75d51ba8c55b37f826c7f42652bac86966b623d1
2963 F20101206_AABYDA li_x_Page_092thm.jpg
1a46f1cdd5eda60fbf682e142ea3a82f
ca7c9308186c27b45a3bfaecead4495844832b6c
5262 F20101206_AABYCN li_x_Page_083thm.jpg
3108e755912f560df716bea3df5d4fec
4aa68da2dd68d4c426f53a57cee414afbfe03ae4
3164 F20101206_AABXXH li_x_Page_004thm.jpg
52fd7810aa93075948ae79d073d6a6d2
b68bf8661785acb7c83de3a2e2a87b378c5b63b7
6628 F20101206_AABYBY li_x_Page_075thm.jpg
e92754e6df19e847c96632a45a4bc673
1cc658a268e6bac9e015d11e7cbc0df2d728fbc2
803 F20101206_AABXWS li_x_Page_092.txt
6f1e9ad1eed95013b1352f4dacffbad8
a5b70055360f7a44539e52d10e883c6e600b20d3
4898 F20101206_AABYDB li_x_Page_093thm.jpg
8c62bb40699298428db2a1f8c459893d
0afe2c7b81917376ae8215495f21c032a9dcc1fb
17856 F20101206_AABYCO li_x_Page_084.QC.jpg
8584b87af56a52c5e1689b54d16a4730
3dd0e4495dfc40126d8f1bfd14a749ef460ed465
21835 F20101206_AABXXI li_x_Page_005.QC.jpg
795150fa70efbd72d082c10151c3f0ae
86a4e83481ec81e4b6e7ccabe0e8d09d3741193d
27284 F20101206_AABYBZ li_x_Page_076.QC.jpg
6b852c653686478fa0fbd9d93abf436d
e7cd08f9fca19fa46014356c673e6220ac387407
1587 F20101206_AABXWT li_x_Page_093.txt
e14b8fef6b1498bc25c3406f41a5de00
d4a47a71c1eb253e5a9a2fe9877c8e5cfc98d7ec
24298 F20101206_AABYDC li_x_Page_095.QC.jpg
32ade645bc18c3a57b0ba899882e2032
b36d65b346a8ec042d8fa46a3f125e86fa671d21
4834 F20101206_AABYCP li_x_Page_084thm.jpg
30bde8260895040c76b73f04179c53b4
5f060a78edd5abe14ef37b6e611d2bda8ec7e35e
5246 F20101206_AABXXJ li_x_Page_005thm.jpg
b0bf321b9fdf30baa802d45fe0a9eeca
b73a616e5feb8113027b4dc53712d8e37868abe6
2598 F20101206_AABXWU li_x_Page_094.txt
a4b3ccf0ac4546dbdaff2b3ecbf91354
5d575c67160bcb9bbb29f01619d7485bc0d2cbdb
6374 F20101206_AABYDD li_x_Page_095thm.jpg
189581a56bdbf9a042ac78eff55183cc
11fdc3643ae7a40407dc4559296d73ef47e19772
24284 F20101206_AABYCQ li_x_Page_085.QC.jpg
30cd6b34b1861d920c4c266abdf9b7fc
556576ad4412ec7a73bebab168e4255f8816b8b4
14191 F20101206_AABXXK li_x_Page_006.QC.jpg
fdad105a67364455dd2bdfb460e2636a
21090cdcac4d5b2a1417ac58c4d805ba0d1c96c6
2491 F20101206_AABXWV li_x_Page_095.txt
606935de351529d941c89efeed1b9f0c
4dc875459c58755444f4ab5acdf33af35506c5fa
24228 F20101206_AABYDE li_x_Page_096.QC.jpg
45d61b52fa11b8a8c10669b409691657
deb1e1ca4973e06c7d7f36ccc37646cb6832f439
6466 F20101206_AABYCR li_x_Page_085thm.jpg
7a84fc48532d31f1d1729ebb25bfef29
1ec90a0e7a407072f2208242f82be50b953c9dd3
3628 F20101206_AABXXL li_x_Page_006thm.jpg
4d40cd0351d5b8605a4ff0902853f3f5
b2b164a91c5f67cfeeb47ac74ba4cc4c2cd725f6
2567 F20101206_AABXWW li_x_Page_096.txt
2af2e3e15a4c3625b92a1bfd26cec74a
6dedf4a8fc294735f8c4056043f0fc9f4b3ffde0
6219 F20101206_AABYDF li_x_Page_096thm.jpg
1c0a6e4e91a1611b03d4fab6f7b1a99f
4b6fdf02811f2821b2842f4da05919ad569470ef
6577 F20101206_AABXYA li_x_Page_015thm.jpg
343c4dc0f5cd1272ad0943cdfc753fc9
cbfcb34182c290f5695f94fec44e6d890d37945f
16276 F20101206_AABYCS li_x_Page_086.QC.jpg
6e950172a7b8a7117536feea6bf89699
3d9b5836bdee134ce563fcfa5d438dfc165d87c6
2578 F20101206_AABXWX li_x_Page_097.txt
389ef0b6de1f01323b1db243142af54e
030812bc57d5adb86dbf37010aa0cf1a43661b58
6473 F20101206_AABYDG li_x_Page_097thm.jpg
a21f9f0fe3cf03a66838bfe2d21c6d61
ef7333aca5d9ffeeb1a1db1d3faeec82aae647da
24858 F20101206_AABXYB li_x_Page_016.QC.jpg
4bdd050784f3b45d7d8b3e4b3cdf6b70
92b5ddbc4ae8cc9f233b3190a2b8e0248d061406
4580 F20101206_AABYCT li_x_Page_086thm.jpg
493a4d92aa24fbd0029581dbfb30c7a9
7a8c3886a8d85f70e96da9de5f2c5eddf51a5354
22483 F20101206_AABXXM li_x_Page_007.QC.jpg
f36b2bc23c7ce528d87e01b2e71f469b
49a89f4cf146af94ef1fde06d6f9a14c6257989f
2570 F20101206_AABXWY li_x_Page_098.txt
4707b749ae5b8b991fb3225d832fedd2
926b593d4524fad06d634edc977d9e74786a0bcc
6462 F20101206_AABYDH li_x_Page_098thm.jpg
143fd294020b402da4b420d328a2e88c
dbe3f79dd59d68420e88ad9098ee47b9e95c4442
27211 F20101206_AABXYC li_x_Page_017.QC.jpg
45a7119abbff3a7a758737d787c54b97
1b94f0cbb89dfdb961c6ae6de5a17e43e43ea318
24734 F20101206_AABYCU li_x_Page_087.QC.jpg
d15a1390e2e85daa2981a67baf7b9af6
5a336d1e5746fb5ea0dbc2aa94d52901fc44b510
27808 F20101206_AABXXN li_x_Page_008.QC.jpg
a0b6c0f2a95622af28e3a0a3823e4aac
0eaf443a9e61793dd41aaa9a1c7031efaa432afe
1455 F20101206_AABXWZ li_x_Page_099.txt
1c6a579dae8a23756641bde07cdd334f
aa5985fe118e50203529a1d293d7dfd868f263eb
15233 F20101206_AABYDI li_x_Page_099.QC.jpg
6be7e2f741496c383e9c79fc72ee6b7d
e5e81df9846bea9f45bfe80a1af981cea379cdcd
6845 F20101206_AABXYD li_x_Page_017thm.jpg
d0a194ecf3ea3ab8c6fcd0ae1b62ffe8
42a51faa670c64d971a190f266769d603f796152
20333 F20101206_AABYCV li_x_Page_088.QC.jpg
55f48800e7544b98a46eaf0656072ded
9fe91e9010965c171e9906f9969a9accda825eed
6677 F20101206_AABXXO li_x_Page_008thm.jpg
d3cd8243ffa5f630ac1f7bbecc5a741a
7e94ff88b81e2cad08c8e00277225ccb32aba43c
4169 F20101206_AABYDJ li_x_Page_099thm.jpg
c82dda1b58ee6b64c06bba1256309379
54975628bb0284f624da6a9588b7d9b0930d831e
27672 F20101206_AABXYE li_x_Page_018.QC.jpg
b0f53e57dcd85f43a60c625479801ac8
8e071bb534c88f871b0d9857d88a28b33cc9bf89
27606 F20101206_AABXXP li_x_Page_009.QC.jpg
a3b921eceae74164fe55b68ded7aefdc
f8823c64fe06a08606fbd01a47514af7b63d3476
7952 F20101206_AABYDK li_x_Page_100.QC.jpg
4b56496640d522ae02ab4c78228a8d64
70e3afeb6651b28f6f16cda5d33067c676394174
6921 F20101206_AABXYF li_x_Page_018thm.jpg
55bfae2703530218742319c7937873ae
f3458d169755482f4827ece38a08168635df2970
28737 F20101206_AABYCW li_x_Page_089.QC.jpg
f268f032b366553de6b46213ed5792a1
3dd91971ef5500521eed749db4ebb94ce6cfc186
6636 F20101206_AABXXQ li_x_Page_009thm.jpg
050f02f6c130466d41c1cbe968e32b97
ff708af90b80b0bf17987480b1080d229f39366a
2407 F20101206_AABYDL li_x_Page_100thm.jpg
835df33a20f9eda3271f7b966a3b4bfc
a4a0713f9f17ac68ecd990be74b78a2059e24cfd
28704 F20101206_AABXYG li_x_Page_019.QC.jpg
e81fca6ed8fa4a886ff6e7bd1dc2207b
2f8022dca5981351b6397c3cba97e1f6f0d3a1e2
7042 F20101206_AABYCX li_x_Page_089thm.jpg
de70a5df1fa57df0f998bc89f500b909
3c21e11bdece605aa9345a6ddd9f5e776ad90124
21312 F20101206_AABXXR li_x_Page_010.QC.jpg
033db3bd2d75a9edabff0e9ff67ccfc4
aad7d59b26a540eee19f737113ccd8b885eb1e25
114532 F20101206_AABYDM UFE0021421_00001.mets
09315ca151b09eb8af5b991e3ccf334c
c015fb791685bcca299d38f8b46871de47880eb1
6966 F20101206_AABXYH li_x_Page_019thm.jpg
7959e0c26d4fb59b69d32914d88a6295
6fd67c6db32dd0d029118c82ad2b6cb4e6784f20
29207 F20101206_AABYCY li_x_Page_090.QC.jpg
fb41c92a20a5d1bd88d27d27f4c7b8a5
2b46abc3be1148234902969e918d243134b44591
5541 F20101206_AABXXS li_x_Page_010thm.jpg
d3da948f9529280bbb39ad88431a7597
9f2ae9e65770fd3a42746bc323ffc7ae554c4d62
28303 F20101206_AABXYI li_x_Page_020.QC.jpg
d3465067cfdc0c6ce755680bb95f30a5
3b79f728607cff935c7eb67ddd9d30ba04ad9d50
5621 F20101206_AABYCZ li_x_Page_091thm.jpg
96d56428456468f690f93fb4dd5b5152
c96ceef3700c013234e75ee1b7ad290effc045c8
3988 F20101206_AABXXT li_x_Page_011.QC.jpg
58db166b3eaa10f427dbc61df1965f47
546336383fa7fb0aab2be02cccd3269c342d6421
6791 F20101206_AABXYJ li_x_Page_020thm.jpg
8ac2f937fbc1865f9f715d0e58268dcb
e63ed998f93204c88314ecbf796ef4b182218a9f
1560 F20101206_AABXXU li_x_Page_011thm.jpg
1f7450047005f9183ecaa9fc2dd5d936
8ae81e770cc853c080a47d3137224786b7d69a47
22831 F20101206_AABXYK li_x_Page_021.QC.jpg
a4538b92378b4892d35a93a306a3edc4
ad134c20416f2c784737d3c1f497e9e6860fe64f
17744 F20101206_AABXXV li_x_Page_013.QC.jpg
d1e393b6c73a46048937c58bfa57f1ee
e13fb455f9967a2b311d2aeffcd653439bed015d
25377 F20101206_AABXYL li_x_Page_022.QC.jpg
e1933b0d498bdd4010e231a416c3cef0
80c2acc9c08ef1321d96e30235c21dbd25bf3abf
4635 F20101206_AABXXW li_x_Page_013thm.jpg
4d4aa482dad2ad0f1b9167ce90bed7b3
cb1c5af6cfe92bc5df2400b169dd18b645908450







IDENTIFICATION AND APPLICATION OF REPETITIVE BIOLOGICAL
SEQUENCES


















By
XUEHUI LI


A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL
OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY

UNIVERSITY OF FLORIDA

2007

































S2007 Xuehui Li



































To everyone who helped me in one way or the other during my Ph.D time










ACKENOWLED GMENTS

This dissertation would not have been possible without the guidance of my committee

chair, Dr. Tamer K~ahveci. I am thankful for his research guidance and partial financial

support. I am very grateful to Dr. A. Mark Settles, I'm extremely lucky to get a

tremendous amount of generous and selfless guidance from him. I am very grateful

to everyone else on my committee for serving on my committee and their invaluable

professional -II---- -1;un-: Dr. Alin Dobra, Dr. Alper Ungor, Dr. Henry Baker, and Dr.

Joachim Hammer.

I am very grateful to my family and my friends in the States who have supported me

in one way or another. My Ph.D life would have been harder without them.

Finally, I would like to thank the University of Florida Alumni Association, I cannot

imagine my Ph.D life without the association's sponsorship of my graduate fellowship.









TABLE OF CONTENTS


page


ACKNOWLEDGMENTS


LIST OF TABLES.


LIST OF FIGURES

ABSTRACT


1 INTRODUCTION

2 A NOVEL GENOME-SCALE REPEAT FINDER
SPOSONS.

2.1 Motivation and Problem Definition
2.2 Related Work.
2.3 Description of the Algorithm
2.3.1 Graph Construction.
2.3.2 Graph Traversal.
2.3.3 Further Improvement to Greedier
2.4 Experimental Evaluation.
2.4.1 Evaluation of Accuracy.
2.4.2 Fr-agmented Repeats and Potential ?.
2.4.3 Significance Analyses
2.4.4 Evaluation of Possible Optimization St
2.5 Discussion.

3 A NOVEL ALGORITHM FOR IDENTIFYING Le
IN A PROTEIN SEQUENCE.

3.1 Motivation and Problem Definition
3.2 Related Work.
3.3 New Complexity Measures
3.4 The Graph-based Algorithm (GBA).
3.4.1 Constructing a Graph.
3.4.2 Finding the Longest Path.
3.4.3 Extending Longest-path Intervals
3.4.4 Post-processing Extended Intervals
3.5 Experimental Evaluation.
3.5.1 Quality Comparison Results
3.5.2 Performance Comparison Results
3.6 Conclusion.


GEARED


TOWARDS


TRAN


M. alI Repeats

;rategfies.


OW-COMPLEXITY REGIONS


CHAPTER











4 QUALITY-BASED SIMILARITY SEARCH FOR BIOLOGICAL SEQUENCE
DATABASES ......__. .._ ............ 62

4.1 Motivation and Problem Definition ...... ... .. 62
4.2 Background ......... . .. .. 64
4.3 Quality Value Assignment (QVA) ...... .. . 66
4.3.1 QVA Based on One LCR-identification Tool .. .. .. 67
4.3.2 QVA Based on Multiple LCR-identification Tools .. .. .. .. 68
4.4 Quality-based Similarity Search . ..... .. 70
4.4.1 Using Quality Values in DPS . ... .. 70
4.4.2 Using Quality Values in HTS . ... .. 71
4.4.2.1 The algorithm ... .. .. .. 71
4.4.2.2 I/O and CPU computations ... . .. 74
4.4.2.3 Cost analysis of QHTS .... .... 75
4.4.2.4 Memory allocation . .... .. 76
4.5 Experimental Evaluation ....... ... .. 80
4.5.1 Evaluation of Accuracy . .... .. 82
4.5.2 Performance Comparison . .... .. .. 88
4.6 Conclusion ........ .. .. 91

5 CONCLUSION ........ . .. 93

REFERENCES ....... ..._ .........._. 94

BIOGRAPHICAL SK(ETCH ......... .. .. 100










LIST OF TABLES


Table page

2-1 Comparison of Greedier, Repeathiasker, cross_nlatch, and WindowMasker in
terms of hases masked in different regions of the Arabidopsis genome, consisting
of the five chromosomes: chromosomes, transposons (TP), and other exons (FP).
The TP/FP rates of Greedier, Repeathiasker, cross_nlatch, and WM are 2.85,
2.42, 0.37, and 0.10 respectively. Letter K represents 1000. .. .. :30

2-2 Comparison of Greedier, Repeathiasker, cross_nlatch, and WindowMasker in
terms of hases masked in different regions of the 10th rice chromosome: chromosome,
G.M.s, transposons (TP), other G.M.s, and other exons (FP). The TP/FP rates
of Greedier, Repeathiasker, cross_nlatch, and WindowMasker are 15.54, 12.1,
1.28, and 0.77 respectively. ........ ... :30

4-1 Average total number of database sequences returned and precision (in percentage)
for NDPS, BDPS-SEG, QDPS-GBA, QDPS-SC, and QDPS-SCG on the Swissprot
dataset. .. ... .. .. 8:3

4-2 Average total number of database sequences returned and precision (in percentage)
for NDPS, BDPS-SEG, QDPS-GBA, QDPS-SC, and QDPS-SCG on the PDB
dataset. ......... .... . 84

4-:3 Average total number of database sequences returned and precision (in percentage)
for NHTS, BHTS-SEG, QHTS-GBA-Read, QHTS-SC-Read, and QHTS-SCG-Read
on the Swissprot dataset. . .. ... .. .. 85

4-4 Average recall and precision (in percentage) of QHTS-GBA-Reconstruct QHTS-
GBA-Read, QHTS-SC-Reconstruct QHTS-SC-Read, QHTS-SCG-Reconstruct ,
and QHTS-SCG-Read on the Swissprot dataset. .... .. .. 87

4-5 Relative information loss regarding the length of k-grants (in percentage).. .. 88

4-6 CPU times spent in HTC, SP, and AP, and the number of k-grants stored in
the hash tables for (I) NHTS, (II) BHTS-SEG, (III) QHTS-GBA-Reconstruct ,
(IV) QHTS-SC-Reconstruct and (V) QHTS-SC G-Reconstruct. .. .. .. .. 88










LIST OF FIGURES


Figure page

2-1 A sequence before and after a transposon insertion. The narrow hars represent
the same repeat in both versions of the sequence. ... .. .. 15

2-2 The pseudocode of our algorithm, Greedier. .... ... 21

2-3 An example of two alignments both with Plus/Plus orientation that do not overlap.
Two hars connected hv a non-horizontal line represents fragments participating
in an alignment. Two numbers in a pair of parentheses are the staring and ending
coordinates of the corresponding fragment. Two numbers in a pair of square
brackets are the number of identical letters in an alignment and the length of
the alignment. ......... .. .. 2:3

2-4 The pseudocode of traversingf a graph. . ..... 25

2-5 Two extreme cases in terms of gap lengths. Each pair of hars connected a non-
horizontal line represents fragments participating in an alignment. .. .. .. 26

2-6 An example of nested transposons in the fourth Arabidopsis chromosome identified
by Greedier at different iterations. Bars represent transposon fragments. Numbers
outside and inside the hars are the corresponding chromosome coordinates and
iteration numbers respectively. ........ .. :31

2-7 The cumulative percentages of masked transposons and exons that do not belong
to transposons (i.e., other exons) in the Arabidopsis genome by each iteration
of Greedier (the two curves) and the percentages of the same kinds of masked
regions by Cross~match (the two thick horizontal lines), and Windowl~aker (the
two thin horizontal lines). ......... . :36

:3-1 Four steps of GBA on a sequence with :3 approximate repetitions of AYTV. Under-
lined letters indicate repeats. Rectangles denote regions identified as candidate
LORs by GBA at different steps. ........ ... .. 48

:3-2 Contribution of the repeat region TPSTT to R. c0 denotes the forget rate. .. 48

:3-3 Comparison between Shannon Entropy and our 2-gram complexity measure.
x-axis represents ratios from repeats. y-axis represents ratios from non-repeats. 55

:3-4 Average recalls of GBA, SE-GBA, Oj.py, CARD, and SEG on four datasets. 56

:3-5 Average precisions of GBA, SE-GBA, Oj.py, CARD, and SEG on four datasets. 57

:3-6 Relationship between precision and recall of GBA, Oj.py, and CARD. .. .. 58

:3-7 Average Jaccard coefficients of GBA, SE-GBA, Oj.py, CARD, and SEG on four
datasets. ......... ..... . 59










4-1 Two sequences that have the same LCR indicated by the underlined letters.
Here, the LCRs are composed of a repeating pattern of AAT. .. .. .. .. 63

4-2 A sequence with an underlined LCR and a masked version of it. Masked letters
are replaced by x. ......... .. .. 67

4-3 A sequence with an underlined LCR and its masked versions by tow different
LCR-identification algorithms. Masked letters by each algorithm are replaced
by x. .. ..... .......... ... 69

4-4 An example for probabilistic hash table and reconstruction. The solid lines show
the three 3-grams of the sequence GARAQAQAQKL stored in the hash table. The
resulting sequence after reconstruction is at the bottom. Letter X denotes an
unknown letter. ......... ... . 74

4-5 Memory adaption scheme. qs, db, and HT represent the query set, the database,
and the hash table respectively. For both reconstruction case and reading case,
Mr and M.1 are the amount of memory allocated to the hash table and the database
sequences respectively. Mi1/C is the amount of the query set whose k-grams
can be held in a hash table of size My probabilistically. For each My1/C pages of
the query set, a hash table is built and the whole database is read into memory
once to find seeds. Mi Only applies to reading case. It represents the amount
of memory allocated to the query set for seed extension. In the worst case, for
each Ml. pages of the database, the entire query set is read into memory once in
chunks of M.l. .. ... . .. 77

4-6 The minimum I/O cost varies, depending on whether to use reconstruction or
read strategy, whether to use quality values, and the relationship between the
available amount of memory M~ and the size of the database DB. .. .. .. 80

4-7 Result set size versus recall (in percentage) for NDPS, BDPS and QDPS on the
Swissprot dataset. ......... . .. 83

4-8 Result set size versus recall (in percentage) for NDPS, BDPS and QDPS on the
PDB dataset. ......... ... .. 84

4-9 Result set size versus recall (in percentage) for NHTS, BHTS and QHTS on the
Swissprot dataset. ......... . .. 86

4-10 Result set size versus recall (in percentage) for QHTS and BLAST with the LCR-
filter off (BLAST-w/oFilter) and on (BLAST-w/Filter) for querying the protein
sequence, APOA4_MOUSE, against our Swissprot database. .. .. .. 86

4-11 BLAST CPU time and QHTS I/O times for datasets of increasing size. .. 91










Abstract of Dissertation Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Doctor of Philosophy

IDENTIFICATION AND APPLICATION OF REPETITIVE BIOLOGICAL
SEQUENCES

By

Xuehui Li

December 2007

C'I I!r: Tamer K~ahveci
Major: Computer Engineering

Biological sequences are rich in repeats. For example, more than 50 of the human

genome consists of repeats and approximately one-quarter of the amino acids are in

repeats. Repeats are subsequences of biased composition.They vary in size from less

than a hundred bases to tens of kilobases. They are found as either tandem arrays

or dispersed throughout the genome. Repeats can generate insertions, deletions, and

unequal crossing-over within genomes and affect protein functions. Hence, repeats pi i

important roles in genome evolution. Repeat identification is normally the first step

of studying repeats and a critical part of sequence analysis. For protein sequences,

some repeats are popularly referred as low complexity regions (LCRs). Although some

computational tools have been developed to identify genomic repeats or LCRs, they all

are geared towards specific situations and suffer from different problems. We develop

novel methods to identify genomic repeats and LCRs, respectively. Genomic repeats and

LCRs present difficulties in genome annotation and analyses. Local alignments between

repeats cause many false positives to sequence similarity search. These false positives can

cause misassembly of genome sequences or misidentification of repeats as gene/protein

sequences. Existing sequence similarity search algorithms either ignore the existence of

these repeats or completely remove them. The first strategy produces false positives. The

second strategy is not desirable, since no LCR-identification tool is 100 accurate. We










develop new algorithms that use LCR information wisely to improve the accuracy and

efficiency of sequence search.










CHAPTER 1
INTRODUCTION

Biological sequences are very different from random sequences. They contain

subsequences of strong compositional hias, i.e., repeats. Biological sequences are rich

in repeats. For example, more than 50 .of the human genome consists of repeats and

approximately one-quarter of the amino acids are in repeats. Repeats vary in size from

less than a hundred bases to tens of kilobases. They are found as either tandem arrays or

dispersed throughout the genome. Transposons are mobile genetic elements and comprise

the most common class of repeats. Transposable elements often make up a substantial

fraction of the host genomes. 1\any transposons have a tendency to target new insertions

within existing copies of transposons [49, 63]. This characteristic causes repeats to be

nested within one another. In other words, many repeats within a genome sequence are

split by transposon insertions, resulting in individual repeat units being fragmented. Some

repeats in protein sequences are popularly referred as low complexity regions (LCRs).

Statistical analyses of protein sequences have shown that more than one-half of the

proteins have at least one LOR. There is no universal complexity function that would work

for all LORs. LORs are important. They attract purifying selection, become deleterious

and therefore lead to human diseases when the copies of a repeat inside exceed a number.

Despite their abundance and importance, their compositional and structural properties are

poorly understood. So are their functions and evolution

Repeat identification is normally the first step in studying repeats. It is a critical

part of sequence analysis. First, repeats are believed to phI i- important roles in the course

of genome evolution and may have effects on protein functions. Second, repeats cause

problems to many biological applications, ;?-, misassembly of genome sequences and

sequence similarity search. Genomic repeats and LORs cause many false positives to

sequence similarity search. For example, BLAST returns over 1,000 statistically significant










sequences for Thermus thermophilus seryl tRNA synthetase, which has only :31 true

positive homologs [75]. These false positives confuses genome annotation and analyses.

Therefore, it is critical to be able to identify these repeats accurately. Although some

computational tools have been developed to identify genomic repeats or LORs, they all are

geared towards specific situations and suffer from different problems. Existing sequence

similarity search algorithms either ignore the existence of these LORs or completely

remove them. Ignoringf LORs results in false positives. Removing LORs is not desirable,

since no LOR-identification tool is 100 .accurate.

This dissertation addresses the following three problems:

1. Identification of repeats in genomes.

2. Identification of LORs in a protein sequence

:3. Searching biological sequence databases using LORs.

For each of these problems, we develop new methods and compare them with existing

methods experimentally. The rest of the dissertation is organized as follows. C'!s Ilter 2

presents our method to identify repeats in genomes (problem 1). ('!! Ilpter :3 introduces

our novel method to identify LORs in a protein sequence (problem 2). C'!s Ilter 4 presents

our work to use LORs in similarity search for biological sequence database (problem :3).

('!s Ilter 5 concludes the dissertation.









CHAPTER 2
A NOVEL GENOME-SCALE REPEAT FINDER GEARED TOWARDS
TRANSPOSONS

In this chapter, we consider the problem of identifying repeats in genomes. We

introduce our primitive algorithm that solves this problem. We present our experimental

results.

2.1 Motivation and Problem Definition

Genomes typically are rich in repeat sequences. For example, it is estimated that

50-80 .~ of the maize genome [63] and more than 50 .~ of the human genome [20] consist

of repetitive sequences. Repeats vary in size from less than a hundred bases to tens of

kilobases. They are found as either tandem arrays or dispersed throughout the genome.

Repeats can generate insertions, deletions, and unequal crossing-over within genomes.

Hence, repeats pIIl o important roles in genome evolution [14] by causing mutations

and rearrangements [49] that lead to altered gene functions [16]. Repeats also present

difficulties in genome annotation and analyses. Local alignments between repeats

produce false positives in comparisons of DNA sequences. These false positives can cause

misassembly of genome sequences or misidentification of repeats as gene sequences [10].

For these reasons, it is critical to identify repetitive sequences accurately.

Transposons are mobile genetic elements and comprise the most common class of

repeats. Transposable elements often make up a substantial fraction of the host genomes,

especially the 1 in .0 r~y of many plant genomes [1, 26, 62, 74]. For example, 45 .~ of the

human genome [67], more than 90 .~ of the wheat genome [51], and 12.4 .~ of the rice

genome consist of transposable elements [50, 72] (http: //www. tigr Org/tdb/e2kl/

osal/). Transpsons can be divided into diverse families [43] and individual types of

transposons can have hundreds or thousands of copies in complex genomes.

Transposons accumulate multiple copies within a genome based on av w-1I~ii of

replicative mechanisms. DNA elements utilize either the DNA replication or DNA repair

machinery within the cell to increase in copy number, while retrotransposons utilize RNA









transposon


genomic sequence I
(a)
gen Iscesequence

(b)

Figure 2-1. A sequence before and after a transposon insertion. The narrow bars represent
the same repeat in both versions of the sequence. (a) The sequence before the
insertion.(b) The new sequence after the insertion. The thick bar represents
the inserted transposon.


polymerase and reverse transcriptase to increase in copy number. In either case, the

elements need to insert themselves into the genome, which requires either a transposase or

integrase enzyme activity (http://home. comcast .net/j john. kimballl/BiologyPages/

T/Transposons. html). Interestingly, many transposons have a tendency to target new

insertions within existing copies of transposons [49, 63, 76]. This characteristic causes

repeats to be nested within one another. In other words, many repeats within a genome

sequence are split by transposon insertions, resulting in individual repeat units being

fragmented. Figure 2-1 shows a schematic of a nested transposon insertion. The inserted

tranposon splits an existing repeat into two fragments. There is strong evidence for the

existence of nested and split repeats in the maize, rice, and wheat genomes [49, 63, 76].

Fr-agmented repeats are hard to identify using existing tools, such as Cross~match

(www.phrap. 0rg/phredphrapconsed .html), for two reasons. First, the fragments may

be too short to be considered as significant matches to a repeat library. Second, the

fragments are likely to have a high level of sequence divergence from known repeat

structures.

Repeat finders need to maximize the identification of true positives (bona fide

repeats) while limiting false positives. False positives are sequences that are' I__- d as

repeats, but are actually sequences that biologists would consider to be non-repeat.

The precise definition of a false positive can be debated, but in this chapter we take

a biological perspective towards the identification of repeats. Our goal is to separate










transposable elements and tandem repeat arrays from exon sequences that are more

likely to have a functional role in an organism's ph.~~! ir pe. We contend that many

biologists would like to retain related gene families other than transposons when masking

or identifying repeats within a genome.

We consider the problem of finding repeats in a genome given a repeat library.

A repeat library is a library that contains known repeat units. A repeat unit is a

complete repeat structure. We develop an iterative algorithm that takes into account

the fragmentation of repeats as well as the high likelihood that the fragmented repeats will

be more diverged from the reference repeat library. Our algorithm, called Greedier, uses

local similarities between the genome and the repeat library to build graphs that connect

fragfmented repeats. The identified repeats are then removed for additional iterations

that have progressively lower similarity cutoffs. Our results with identifying repeats from

the Arabidopsis and rice genomes -II__- -1 that Greedier shows significant improvements

over one of the most recently developed de novo repeat finders, Windowl~asker, and the

standard algorithm, Cross_match, which is implemented in most repeat finders that use an

annotated repeat library. In addition to masking repeats, Greedier also reports potential

nested transposon structures.

The rest of the chapter is organized as follows. Section 2.2 discusses the related

work. Section 2.3 describes our algorithm, Greedier. Section 2.4 presents the experimental

results. Section 2.5 concludes the chapter.

2.2 Related Work

Repeats exist in both DNA and protein sequences. A number of algorithms have been

developed to identify repeats in proteins, such as GBA [45] and SEG [80]. Here we focus

on DNA sequences.

Current repeat finding algorithms follow two strategies: de novo repeat finders and

repeat finders that use existing repeat libraries. The former identifies repeats without










relying on prior knowledge of the repeat sequences that exist within a genome. The latter

requires annotated collections of known repeats.

Some de novo algorithms such as TRF [11], LTR_STRITC [52], and PILER [25],

identify specific structures within the DNA sequence to identify repeats. The main utility

of these de novo repeat finders is to identify novel, recently-evolved repeats.

TR F identifies tandem repeats that have a variable unit length and/or are disrupted

by insertions and deletions. It models tandem repeats by percent identity and frequency

of indels between .Illi Il ent pattern copies. It uses statistically based recognition criteria.

For example, it models alignment of two tandem copies of a pattern by a sequence of

independent Bernoulli trails.

LTR _STRITC searches for long terminal repeat transposons. It seeks certain generic

structural features of such elements. It first seeks the LTR pairs present at the ends of a

putative element. It then searches for additional characteristic retrotransposon features to

confirm the hit. For example, it checks alignment of regions flanking the pairs of matches.

However, LTR _STRITC is limited to findings sequences with well-conserved structural

features of a retrotransposon.

PILER exploits only characteristic patterns of local alignments in the sequence.

Each alignment forms a pattern which is typical of a class of repeat. It identifies likely

functional transposable elements, tandem arrays, dispersed families, pseudosatellites, and

terminal repeats that show high levels of sequence identity.

Other de novo repeat finders identify fragmented or more divergent repeats based

on pair wise or multiple similarities within a genome. R EPuter [40, 41] first uses suffix

trees to locate exact repeats. It then extends them to degenerate repeats by attacking

two sub-problems. The first one is called the mismatches repeat problem. It uses the

Hamming distance model to find mismatches between two maximal repeats. The second

one is called the differences repeat problem. It uses the edit distance model to allow for

insertions and deletions in two repeats. REPuter accesses the significance of these two










kinds of degenerate repeats by calculating their E-values [:39]. However, it is unable to find

long and dispersed repeats.

To overcome the computationally intensive nature of generating pairwise similarity

scores, WindowMasker [5:3] and R AP [17] use word counting to identify short sequences

that are over-represented within an input genome sequence. Word length is a critical

parameter for word counting methods. Short words are found too frequently and are

not indicative of a significant repeat. They result in low specificity and high sensitivity.

On the other hand, long words are quite significant, but they are unsuitable to detect

degenerated repeats. They result in high specificity and low sensitivity. Word counting

methods cannot improve one aspect without deteriorating the other. It is hard to find the

best compromise between short words and long words.

WindowMasker is a two-pass algorithm. In the first pass, it calculates the size NV

of NVmer to consider, Nnler frequency counts for the genome, threshold scores for the

algorithm, and a score function that is based on the frequencies and threshold scores.

In the second pass, it calculates masked regions for the genome using the previously

generated score function and thresholds. R AP indexes syninetrical gapped words. It

calculates an index for each word starting at each position of the input genome sequence.

Both word counting and pairwise similarity algorithms have high false positive

rates. i.e., they have the potential of masking desired gene families other than biological

repeats. Most of the sintilarity-based algorithms have not been benchmarked against

genome annotations to determine if a significant fraction of the masked sequences are false

positives .

Some de novo repeat finders identify fragmented or more divergent repeats based

on multiple alignments between related genonies. [18] is a comparative approach that

identifies transposable elements. Standard comparative genontic principles dictate

that conserved regions in alignments highlight functional elements [12, 1:3]. Lack of

conservation is equally useful: inserted sequences that have little or no alignment to










other genomes lead to signatures with multiple alignments that can be used to identify

transposable elements. [18] searches for disrupted conservation patterns in whole genome

alignments. However, poorly characterized genome families are not suitable for such

comparative approaches. These approaches rely on well-assembled genomes, the selection

of appropriate evolutionary models.

In contrast to de novo repeat finders, repeat finders that use annotated repeat

libraries usually produce fewer false positives. Like de novo algorithms, these repeat

finders use pairwise similarity. However, the similarity test is against a known collection

of repeats. Gene families that a biologist would consider to be non-repeat can be

retained by excluding these sequences from the repeat library. Such repeat finders

include Cross_match, RepeatMasker (www.repeatmasker. 0rg/), MaskerAid [8], and

CENSOR [36]. Cross_match is an efficient implementation of the Gotoh algorithm [28]

for comparing two sets of sequences. It uses banded searches to find matches with score

no less than a cutoff. RepeatMasker uses Cross_match as the default search engine for

sequence comparisons. MaskerAid is an enhancement to RepeatMasker that increases the

speed of masking. It uses WU-BLAST (http: //blast wustl edu) as its search engine.

CENSOR uses the Smith-Waterman algorithm [68] for sequence comparisons. It then

evaluates the local alignments and masks homologous sequences (called 'censoring').

As the number of sequenced genomes increases, there is increased interest in

defining repeat sequences and generating repeat databases for multiple species. There

are several repeat database resources available online. One of the largest collections of

repeat databases is maintained by TIGR (www.tigr.0rg). TIGR focuses on plant repeat

databases. As of March, 2007, 16 plant repeat databases are publicly available from TIGR.

Four databases contain repeats for entire plant families, the other 12 are for plant genuses.

These databases contain repeats that can be classified into super-classes, classes, and

subclasses based on structure and sequence composition. For instance, at the super class

level, exist transposable elements, centromere-related repeats, telomere-related repeats,










rDNAs, and unclassified repeats. The length of each such repeat varies depending both the

type of the repeat and the organism. One common observation is that transposons are one

of the longest kinds of repeats. For instance, the average length of rice retrotransposons is

18:37 hp. Each repeat from these 16 databases was extracted either by TIGR or provided

by other public databases.

Library based repeat finders heavily depend on the library. As mentioned in

Section 2.1, they can not identify fragmented and divergent repeats. Eukaryotic genomes

contain large amounts of transposon relies-ancient, highly degenerated transposable

elements. The use of pairwise sequence comparisons to the repeat library is likely fail to

detect these 'distant homologs' of known transposable element families [:35]. Also as shown

in Section 2.4.1, this kind of repeat finder does not have a satisfactory accuracy.

There are tools that deal with the identification of repeat families in genomes [7,

56, 59, 7:3]. These tools take a set of repeats in the genome as input. Their purposes are

either to classify them into families or to extend the exact repeats to find families. This is

different from the problem we consider in this chapter.

2.3 Description of the Algorithm

Our algorithm, Greedier, takes a target sequence, such as a chromosome, and a repeat

library as input. Greedier masks the target sequence based on sequence similarity to the

repeat units in the repeat library. It uses multiple iterations to identify divergent repeats.

Figure 2-2 shows the pseudocode for Greedier. Each iteration consists of two passes.

In the first pass (Steps 2-4), it identifies the local similarities between the chromosome

and the repeat sequences (Step 2). This can he done using any off-the-shelf sequence

comparison algorithm. We used BLAST [70] for this purpose. It then processes these local

similarities and builds graphs (Steps :3-4). A vertex in a graph denotes one similarity.

An edge in a graph denotes two similar subsequences that can he attached to form a

longer match. A path in a graph represents a longer match consisting of multiple edges,

i.e., two or more sequence similarities that can he connected to form a longer match.









Algorithm Greedier
Input: a chromosome c and a repeat library RDB
Output: reported repeats
1. Repeat
2. Run a sequence comparison algorithm with c as the query
sequence and RDB as the database /* first pass starts*/
3. If there are hits found by the comparison algorithm Then
4. Build graphs from the comparison algorithm output
5. Repeat /* second pass -lo I~- /
6. Traverse the graphs to report repeats with fitness > e
7. If There are reported repeats Then
8. Modify the graphs based on the reported repeats
9. Until No repeats are reported
10. If There are reported repeated Then
11. Modify c based on the reported repeats
12. Reduce e
13. Until No hits are found by the comparison algorithm
14. Report potential nested transposon structures


Figure 2-2. The pseudocode of our algorithm, Greedier.


Each path is associated with a fitness value between zero and one. This value reflects

the percent identity and the length of the match between the target subsequences on

the path and the repeat unit. In the second pass (Steps 5 12), Greedier traverses

the graphs greedily and forms a pool of repeat candidates. Each candidate in the pool

corresponds to a path with a fitness value no less than a cutoff, e. Greedier reports the

subsequences whose corresponding path has the N----- -1 fitness value as repeats (Step 6).

It then modifies all the graphs accordingly (Steps 7 8). It repeats this process until it

can not find more repeats (Step 9). Finally, it removes these reported repeats from the

chromosome and stitches the rest of the chromosome together (Steps 10 11). This allows

repeats to be split and nested within one another. Finally it relaxes the fitness constraint

for the next iteration (Step 12). Note that e decreases to allow divergent repeats to be

identified after the most recent insertions have been removed from the genome. Greedier

iterates until BLAST returns no hits. Finally, Greedier parses the masked regions from

different iterations and reports the potential nested transposon structures (Step 14).These









structures are identified either as individual blocks at higher level iterations or as a result

of stitching at lower level iterations.

Section 2.3.1 describes how to construct graphs from BLAST output (the first pass).

Section 2.3.2 illustrates how to traverse a graph (the second pass). Section 2.3.3 presents

one improvement to Greedier.

2.3.1 Graph Construction

For each repeat in the repeat library, we build an .l i-I 11 and directed graph if

BLAST reports similarities between that repeat and the target sequence. Each alignment

has an alignment orientation, denoted by o. When both positive or negative strands of

the chromosome and the repeat participate in this alignment, we ;?i- that the alignment

has Plus/Plus orientation. Otherwise, we ;?i- that it has Plus/M~inus orientation. Let sr

and er be the starting and the ending coordinates (i.e., positions) of the subsequence of

the repeat in the alignment respectively. Let sc and ec be the starting and the ending

coordinates of the subsequence of the target sequence in the alignment respectively.

Let a and b denote the number of identical letters in the alignment and the length

of the alignment respectively. For each alignment, we build a vertex, denoted by

(sr, er, sc, ec, a, b, o). In Figure 2-3, the two vertices corresponding to the two alignments

are (400, 889, 1200, 1700, 450, 500, Plus/Plus) and (922, 1545, 1750, 2370, 550, 650,

Plus/Plus). Let Ir denote the length of the remaining repeat subsequence (i.e., excluding

the subsequence corresponding to the alignment). Let x denote the average identity

between two random sequences. All letters in the nucleotide alphabet appear at each

position of a random sequence with the same probability of 0.25. Hence, x = 0.25. For

each vertex, we calculate a fitness value as

a+xeli,
b + 1,

The numerator is simply the total number of observed and expected identities. The

denominator is the total observed an expected alignment length. Thus, the fitness value









(400, 889) (922, 1545)
repeat unit



chromosome
(1200, 1700) (1750, 2370)

Figure 2-3. An example of two alignments both with Plus/Plus orientation that do not
overlap. Two bars connected by a non-horizontal line represents fragments
participating in an alignment. Two numbers in a pair of parentheses are the
staring and ending coordinates of the corresponding fragment. Two numbers
in a pair of square brackets are the number of identical letters in an alignment
and the length of the alignment.


shows the expected identity rate when the entire repeat sequence is aligned with the target

sequence so that the local alignment for that vertex is preserved.

We ;?i that two vertices from the same repeat, (srl, err, scl, ect, al, bl, or) and

(ST2er, sT2, SC2, CC2 8, b2, 02), do not conflict if any of the following six sets of conditions is

satisfied.:

1. ol = o2 = Plus / Plus, cC1 I SC2 and err < ST2,

2. ol = o2 = Plus / Plus, SC1 I SC2 I CC1, 671 sr2, and (ecl-sc2) < (1-e)*min(al, a2)

3. ol = 02 = Plus / Plus, ST1 I7 sr 6 71, cC1 I SC2, and (eri ST2) (1- 6) *

min(al, a2)

4. ol = o2 = Plus / MinuS, cC1 I SC2 and erl > ST2;

5. ol = 02 = Plus / MinuS, SC1 I SC2 I cC1, 671 > s2, and (eci sc2) (1- 6) *

min(al, a2)

6. ol = 02 = Plus / MinuS, ST1 > 7 671>er, cC1 I SC2, and (ST2 eri)1



The first three sets of conditions correspond to pairs of alignments of Plus / Plus

orientations. The last three correspond to pairs of alignments of Plus / Minus orientations.

Each pair of alignments in the first and the fourth sets of conditions do not overlap.

Figure 2-3 shows an example of two alignments from the first set of conditions. For such










pairs of alignments, our finess value formula of a path puts constraints on the lengths of

the gaps on the repeat and the chromosome between the alignments automatically. We

will discuss these constraints in more detail in Section 2.3.3. Alignments in the other four

sets overlap. The last condition in each of the four sets puts an upper bound on the length

of the overlap. e is the fitness value cutoff introduced at the beginning of Section 2.3.

We construct an edge between two vertices if they do not conflict. Each edge is a

directed edge that goes from a vertex with a smaller starting coordinate to another vertex

with a bigger starting coordinate. It connects a pair of alignments of the same orientation.

Each such pair of alignments forms a longer alignment than the two alignments between

the repeat and the chromosome. In Figure 2-3, for example, the repeat subsequence with

coordinates 400 and 1545 and the target subsequence with coordinates 1200 and 2370 form

an alignment.

Let r denote the number of repeats in the repeat library. Assume each repeat has t

alignments with the chromosome on average. Thus, the average number of vertices in each

graph is t. Vertex construction for each graph takes O(t) time. The number of edges in

each graph is O(t2) in the worst case. This means that graph construction for each repeat

takes O(t + t2) time. The number of graphs is the same as the number of repeats in the

library. Hence, graph construction for all repeats at each iteration takes O(rt2) time in the

worst case.

2.3.2 Graph Traversal

A path in a graph consists of vertices connected by edges. Suppose all vertices on a

path are ordered according to their staring coordinates on the chromosome. For a path

consisting of k vertices, we calculate a finess value as



CEh =1 b + E maxyriges)T + 1

gri and gas denote the interval lengths immediately following the ith alignment, i.e.,

the gap lengths between the ith and (i + 1) th vertices, on the repeat and the chromosome









Algorithm Traverse
Input: a graph G
Output: a repeat candidate
1. Find the vertex, vmax with the 'I;__- -r fitness value in G
2. Find the set of the outgoing edges, No,, of vmax
3. p = Q), where p denotes a path
/* extending right */
4. While (No, is not empty)
5. Choose the vertex ve N 1Vto join p, such that p has the 'l;__- -r fitness value
6. Modify No, to be the set of the outgoing edges of v
7. Find the longest subpath, p,, of p with fitness value no less than e
8. extending pr to the left similar to extension to the right
9. return pr


Figure 2-4. The pseudocode of traversing a graph.


respectively. yi denotes the identity between the two gaps. Ir denotes the length of

the remaining letters (excluding all k alignments) in the repeat. In Figure 2-3, al, a2,

bl and b2 are 450, 550, 500, and 650 respectively. 9,l and get are 33 (922-889) and 50

(1750-1750) respectively. There are two owsi~ to calculate yi. The first way is to assume

yi as the average identity of two random sequences. This is computationally cheap, but

can be inaccurate. The second way is to calculate the actual identity between the two

gaps using a dynamic programming alignment method, such as the Needleman-Wunsch

algorithm [55]. This is an accurate, but computationally expensive choice As will be

discussed in Section 2.4, the results using each of the choices are very similar in practice.

Therefore, we chose the former as it is computationally cheaper. The fitness value of a

path shows how identical the repeat unit and the target subsequences together are. A

higher fitness value indicates a higher similarity. Thus, the goal is to find a path that

maximizes the fitness value. Existing methods such as cross_match use fixed score cutoffs

and are equivalent to identifying matches of fixed lengths. Such fixed-length window-based

methods ignore the fact that repeat units are of variable lengths. Our fitness formula

circumvent this problem. It makes Greedier able to identify matches of variable lengths

and equivalent to a variable-length window-based method.










repeat unit


chromosome




repeat unit



chromosome
(b)

Figure 2-5. Two extreme cases in terms of gap lengths. Each pair of hars connected a
non-horizontal line represents fragments participating in an alignment. (a) The
extreme case when there is no gap on the repeat and the gap on the
chromosome (9c) is extremely long. (b) The extreme case when there is no gap
on the chromosome and the gap on the repeat (gr) is extremely long.


For each graph, we adopt a greedy traversal strategy. We start from the vertex with

the maximum fitness value in the graph. We extend this vertex to a path in the left and

right directions. When extending in one direction, we include a neighbor vertex of the

most recently chosen vertex into the path each time. The neighbor vertex is the one which

makes the currently extended path have the 'I;---- -r fitness value. We keep extending the

path in one direction until the path can not he extended further. After extending the path

in both directions, we find the longest subpath of the extended path with fitness value no

less than e. We report this subpath as the repeat candidate to be joined into the pool.

Note that when extending the path in the right direction, we consider outgoing edges.

When extending the path in the left direction, we consider the incoming edges. Figure 2-4

shows the pseudocode of the traversal.

Suppose that Greedier iterates i times before BLAST returns no hits. At each

iteration, the worst time spent to traverse graphs is linear in the size of edges, O(rt2)

Hence, the total time complexity of Greedier is O((rt2 r2)i), i.e., O(rit2)










2.3.3 Further Improvement to Greedier

In this section, we analyze the gap length constraints enforced by our fitness value

formula of a path automatically. These constraints reduce the number of constructed edges

and hence the graph size. As a result, they reduce the space and the time complexities

of the Greedier. Recall that a gap associated with an edge is the interval between

two subsequences that define the alignments for two vertices connected by that edge

(Section 2.3.2). There are two kinds of gaps involved in an edge: a gap on a repeat with

length g, and a gap on the target sequence with length g,. Let L be the repeat length. We

consider two extreme cases of an edge These cases provide upper bounds for the length

difference between the two kinds of gaps.

One extreme case is when there is no gap on the repeat (i.e., gr = 0), but there is a

gap on the target sequence. Figure 2-5(a) shows this case. The fitness value, f, of the edge
satisfies

L/(L + ge) > f.

The equality happens when the entire repeat is identical to the corresponding subsequences

in the chromosome. According to Section 2.3.2, a path containing this edge will be

considered during graph traversal only if f > e. Thus, we have


L/(L + ge) > N 90 < 1(1 e)L/e.


This inequality gives the upper bound to the length of the gap on the chromosome when

there is no gap on the repeat. A similar analysis shows g, gr < (1 e)L/e when g, > 0.

The other extreme case is when there is no gap on the chromosome (i.e., go =0), but

there is a gap on the repeat. Figure 2-5(b) illustrates this case. The fitness value, f, of the

edge satisfies

(L g)L > f.

The equality happens when the rest of the repeat is identical to the corresponding

subsequences in the chromosome. According to Section 2.3.2, a path containing this edge










will be considered during graph traversal only if its fitness value is no less than the fitness

value cutoff e. Thus, we have


(L g)L'>E N ~,5 1(- eL.


This inequality gives the upper bound to the length of the gap on the repeat when there is

no gap on the chromosome. Similarly, gr go < (1 e)L, when g, > 0.

During the graph construction, we include an edge between two vertices only if the

gaps gr and go for that edge satisfy these constraints. For Ee [0.25, 0.95] of the two

extreme cases, the upper bound for g, varies between 0.75L and 0.05L and the upper

bound for go varies between 3L and 0.05L. This indicates that only a limited number of

edges are drawn from each vertex. Thus, this filter significantly reduces the graph size.

2.4 Experimental Evaluation

We evaluated the accuracy of Greedier by comparing its ability to identify repeats

relative to cross~match and WindowMasker. cross~match is the core algorithm of

RepeatMasker, MaskerAid, and CENSOR. WindowMasker (ftp://ftp.ncbi.nih .gov/

pub/agarwal a/windowmasker/) is a de novo repeat finder that was developed specifically

to mask large sequence databases.

Dataset description: We used the Arabidopsis genome and rice chromosome 10

as benchmarks to compare the algorithms. Both of these genomes have been sequenced

to near completion, have extensive annotations of both genes and likely transposons,

and have existing repeat libraries. We obtained the Arabidopsis genome data including

the five chromosomes assembled into pseudomolecules (f tp: //f tp arabidopsis Org/

home/tair/home/tair/Sequences/whole_chromooe/, the gene model function

annotations (f tp ://f tp .arabidops is rg/home/tair/home/tair/Genes/TAIR_

sequenced_genes), and the gene model coordinates (ftp://ftp. arabidopsis rg/

home/t air/home/t air/Maps /seqvi ewer_dat a/sv_gene_f feature .data) from TAIR

(http://www. arabidopsis .0rg/). Similarly, we obtained the rice genome and annotation










data from TIGR (ftp://ftp.tigr. 0rg/pub/data/Eukaryotic_Proj ects/o_sativa/

annotation_dbs/pseudomolecules/version_4 .0/). We obtained the Arabidopsis and

rice repeat libraries from TIGR (ftp ://f tp .tigr rg/pub/dat a/TI GR_Plant _Repeat s/).

These repeat libraries contain annotated transposons and other repeat classes such as

tandem arrays.

Autonomous transposons encode proteins and share structural features with

non-repeat genes. Within the genome annotation, these transposons have been identified

as gene models, but they are also annotated as transposons. a masked letter is considered

to be a true positive (TP) if it belonged to a gene model annotated as a transposon.

Repeat sequences can also be found outside of gene models or within non-coding segments

of non-transposon genes. Letters in these regions of the genome were ignored, because

they were not assigned as repeat or non-repeat sequences in the genome annotation. The

remaining exon sequences were considered to be non-repeat sequences that should be

retained after masking. Letters masked in these exons are false positives (FPs).

2.4.1 Evaluation of Accuracy

Tables 2-1 and 2-2 show the relative accuracies of Greedier, cross_match, and

WindowMasker when run with default parameters. For Greedier, we set the average

identity between two random DNA samples, y = 0.25. We also computed the actual

identity using the Needleman-Wunsch algorithm and obtained similar results (data

not shown). Greedier has a much higher rate of identifying annotated transposons

and a much lower false positive rate than both cross_match and WindowMasker. The

percentages of hases masked by the three programs corresponds to relative levels of repeat

sequences found on the chromosome arms. However, in all chromosomes tested, we found a

significant increase in the number of transposon bases masked by Greedier.

For the Arabidopsis genome (Table 2-1), Greedier masked 2.4 and 2.8 times as many

transposons bases as cross_match and WindowMasker respectively. At the same time,

Greedier masked 0.3 and 0.1 times the number of false positives respectively. Thus,












Comparison of Greedier, RepeatMasker (RM), cross_match (C' \!), and
WindowMasker (WM) in terms of bases masked in different regions of the
Arabidopsis genome (gen), consisting of the five chromosomes (chrs):
chromosomes, transposons (TP), and other exons (FP). The TP/FP rates of


Table 2-1.


Greedier, RepeatMasker, cross~match, and
respectively. Letter K represents 1000.


WM are 2.85, 2.42, 0.37, and 0.10


% of bases masked
Greedier RM C': I
3.0 6~.6: 3.0
19.3 40.0 5.0
0.5 1.6 2.8
2.6 7.0 2.5
12.9 33.6 7.5
0.9 2.3 1.8
3.2 7.2 3.2
13.1 33.6 3.6
0.6 1.9 2.5
3.8 8.3 3.8
20.1 41.3 12.8
1.0 2.5 2.5
3.3 6~.9 3.2
14.1 29.9 4.9
0.9 2.1 2.2
3.2 7.1 3.1
16.0 35.7 6.8
0.7 2.0 2.4


# of bases
annotated
30,433K
1,109K
11,110K
19,705K
1,198K
6,768K
23,471K
1,320K
8,624K
18,585K
1,033K
6,584K
26,992K
1,246K
9,820K
119,186K
5,906K
42,900K


# of bases masked


Region
chr
Chrl TP
FP
chr
Chr2 TP
FP
chr
Chr3 TP
FP
chr
Chr4 TP
FP
chr
Chr5 TP
FP
gen
gen TP
FP


Greedier
921K
214K
53K
518K
154K
62K
762K
173K
52K
714K
208K
67K
915K
175K
90K
3,831K
924K
325K


C' I; WM
920K( 5,991K
55K( 72K
307K >GK
506K( : I:I k
90K( 68K
124K( 540K
744K( 4,209K
47K( 82,973
214K( 646K
700K( 3,275K
133K( 56K
165K( 476K
855K( 5,066K
61K( 65K
219K( 715K
3,725K( 22,177K
386K( 344K
1,028K( 3,230K


WM
19.7
6:.5
7.6
18.5
5.6
8.0
17.9
6~.3
7.5
17.6
5.5
7.2
18.8
5.2
7.3
18.5
5.8
7.5


2,025K
444K
174K
1,378K
402K
153K
1,685K
443K
166K
1,548K
425K
163K
1,871K
372K
203K
8,507K
2,087K
860K


Table 2-2. Comparison of Greedier, RepeatMasker (RM), cross_match (C' \!), and
WindowMasker (WM) in terms of bases masked in different regions of the 10th
rice chromosome: chromosome, G.M.s, transposons (TP), other G.M.s, and
other exons (FP). Refer to Table 2-1 for the entry meanings. The TP/FP rates
of Greedier, RepeatMasker, cross_match, and WindowMasker are 15.54, 12.1,
1.28, and 0.77 respectively.


# of bases
annotated
22,876,596
3,072,087
3,297,203


# of bases masked
Greedier RM C': I
3,973,477 6,839,111 4,594,861
1,481,468 2,051,697 641,277
101,616 181,082 535,830


% of bases masked
Greedier RM C': I
17.4 29.9 20.0
48.2 66.8 20.9
3.1 5.5 16.3


Region
rice # 10
TP
FPP


WM
4,315,506
461,174
293,923


WM
18.9
15.0
8.9


Greedier could be considered to be roughly 8 and 28 times more accurate (2.4 / 0.3 =

8, and 2.8 / 0.1 = 28, ratio of true positives to false positives) than cross_match and

WindowMasker, respectively. RepeatMasker, which uses cross_match masked 2.3 times

more letters than cross_match and Greedier. The TP rate of RepeatMasker is greater









3,844,712 3,846,847
8 8 8
3,841,9247 3,850,389
3,845,381 3,847,232

Figure 2-6. An example of nested transposons in the fourth Arabidopsis chromosome
identified by Greedier at different iterations. Bars represent transposon
fragfments. Numbers outside and inside the hars are the corresponding
chromosome coordinates and iteration numbers respectively.


than that of Greedier. This is mainly because it masks more letters. The TP/FP rate of

Greedier, however, is better than that of Repeathiasker. To understand, why Greedier

incurred false negatives, we aligned the false negatives of Greedier with the repeat library

using BLAST. We did not get any significant matches. This implies that either repeat

library is incomplete or BLAST is not accurate enough to find the false negatives.

Greedier also showed higher accuracy than cross_nmatch and WindowMasker for rice

chromosome 10 (Table 2-2). Greedier masked 1.5 and 2 times as many transposons bases

as cross_nlatch and WindowMasker respectively. At the same time, Greedier masked 0.5

and 0.9 times the number of false positives respectively. However, the differences between

the number of true positives and false positives recognized by the three programs were

not as pronounced as with Arabidopsis. Potentially, this is due to incorrect annotation

of transposons as gene sequences [10] leading to higher false positive rates with all three

algorithms.

We also measured the number of fragments that are masked by each of the methods.

On the Arabidopsis genome, Greedier, Repeathiasker, cross_nlatch, and WindowMasker

masked 7:364, 15421, 6750, and 76024:3 fragments respectively. On the rice genome,

the same methods masked 77:38, 20827, 7690, 116614 fragments respectively. This

demonstrates that Repeathiasker and especially WindowMasker masks a large number

of small discontinuous regions, whereas Greedier and cross_nmatch mask relatively longer

and contiguous regions.










2.4.2 Fragmented Repeats and Potential Nested Repeats

Greedier was developed based on the hypothesis that identifying nested or fragmented

repeats would improve the accuracy of library-based repeat masking. Our genome-wide

results -II---- -r that Greedier has a higher accuracy. Does this result from the identification

of nested repeats, or is this simply due to the implementation of a graph-based fitness

value that allows for fragmented pairwise matches? In order to understand this we

performed the following experiment to extract potential nested insertions. We defined a

match A as a potential insertion if 1) It is similar to a retrotransposon or transposon (i.e.,

the fitness value is high). 2) There is a new match to another repeat in the repeat library

after A is removed and the rest of the sequence is stitched in the subsequent iteration of

Greedier which did not exist before.

Figure 2-6 shows an example of a retrotransposon that has fragmented identity

to the repeat library. This retrotransposon, At~g06656, corresponds to coordinates

of 3,841,924 and 3,850,389 of Arabidopsis chromosome 4. Greedier masked the entire

annotated transposon region, while cross~match missed 43 .of the letters. At iteration

one, the subsequence with coordinates 3,846,848 and 3,847,232 is identified as a putative

Ty3-gypsy-like retrotransposon sequence within the library. This sequence is masked due

to a high fitness value, because the genome sequence matches 96 .of the repeat over the

entire unit length of the repeat. At iteration seven, the subsequence with coordinates

3,844,713 and 3,845,381 is identified. This is a 77 match to a conserved centromere

sequence over the entire length of the repeat unit in the repeat library. At iteration

eight, the remaining sequence is masked due to identity to two retrotransposons within

the repeat library. The middle segment, coordinates 3,845,382 to 3,846,847, shows no

similarity to the repeat library, which reduces the fitness value of the overall match.

However, the end segments show high levels of identity with both retrotransposons in the

repeat library and the entire segment is masked. This example illustrates how Greedier










can compensate for incomplete and redundant entries within the repeat library yet still

output accurate masking results.

The example above also -II__- -is Greedier could have a higher accuracy due to the

graph-based fitness value instead of identifying nested repeats. If this were the case,

we would expect that implementing the fitness value at a single, low cutoff would yield

the same extent of masking as completing 15 iterations with Greedier. To test this

prediction, we ran a single iteration of Greedier on the Arabidopsis genome using the

lowest fitness cutoff (i.e., 25 .~). With this single iteration, we only masked 10.8 .~ of

the annotated transposon bases in comparison to 16 .of the transposon bases when

all iterations are completed. This experiment also gave a slightly lower false positive

rate of 0.52 versus that of Greedier at 0.78 Interestingly, the relative ratio of true

positives to false positives in the single iteration versus Greedier is about the same, 19.2

versus 20.5, respectively. These results -II__- -r that stitching the genome together after

removing repeats gives a significant gain in the total number of transposon bases masked,

while retaining a low rate of masking true genes. Moreover, these results show that the

graph-based approach of Greedier results in a larger total number of transposon bases

masked with a higher accuracy than cross~match and WindowMasker even using a single

iteration (Table 2-1).

In addition to masking repeats, Greedier also reports potential nested transposon

structures. On the Arabidopsis genome level, Greedier finds a total of potential nested

structures with 92, 92, 106, 89, and 208 structures from chromosomes 1, 2, 3, 4, and 5,

respectively. These structures can he used by biologists as the candidate set to identity

nested transposon structures.

It would be interesting to see the percentage of the transposons that are nested

insertions. In order to calculate this number we analyze the Greedier results and the

TAIR annotations. Greedier reports which fragments are potential nested insertions.

However, Greedier can generate false positives. TAIR annotates repeats, however it does










not distinguish insertions and it can have false negatives. In order to count the nested

insertions correctly we do the following measurements. We count the number of bases that

belong to potential nested insertions identified by Greedier and annotated as repeats by

TAIR. Let X denote this number. We also count the number of bases identified as repeats

(nested or not) by both Greedier and TAIR. Let Y denote this number. We count the

percentage of bases in nested insertions as X x 100/Y. This percentage for chromosomes

1, 2, 3, 4 and 5 of the Arabidopsis genome is 24.6 .~ 36.1 .~ 24.5 .~ 36.5 .~ and 41.4

respectively, with an average of 32.6

RepeatMasker identifies 74 81.9 75 and 75 .of these nested repeats

for chromosomes 1, 2, 3, 4 and 5 respectively, with an average of 79 .Thus, although

RepeatMasker masks more than twice as many bases as Greedier, it misses significant

portion of the nested insertions. This means that Greedier identifies potential nested

repeats more accurately than RepeatMasker. Furthermore, RepeatMasker does not

distinguish a nested insertion from a repeat that is not nested insertion.

2.4.3 Significance Analyses

Only vertices on traversed paths correspond to BLAST matches, Greedier, however,

masks the regions between these vertices. Would it be possible that these regions are

purely random? To test this hypothesis, we performed two probability analyses. As a

by-product of the single-iteration experiment in Section 2.4.2, we calculated the TP/FP

(True Positive/False Positive) rate of all regions between the vertices. This rate is 23.52.

As shown in Table 1, the total numbers of annotated transposon bases and bases of exons

that do not belong to transposons are 5,905,785 and 42,900,000 respectively. These two

numbers yields an expected TP/FP rate of 0.14, which is far less than 23.52. As a second

analysis, we calculated the probability (or the p-value) of finding as many TPs as it

appears in these regions with uniform background distribution of TPs as given in Table

1. This p-value was zero when calculated using the incomplete beta function. In other










words, it was less than the machine precision. Both probability analyses argue that regions

between vertices masked by Greedier are not random.

2.4.4 Evaluation of Possible Optimization Strategies

Tables 2-1 and 2-2 show the number of hases masked by Greedier after the final

iteration of the program, once no more BLAST hits can he found. It is possible that the

last few iterations contribute the bulk of the false positive masking and that the algorithm

could be optimized by retaining a higher stringency for the pairwise matches. Figure 2-7

shows the cumulative percentages of masked bases with each iteration for the Arabidopsis

genome. Greedier completes the bulk of the repeat masking in the last few iterations but

retains a nearly constant rate of false positive masking. These data -II__- -r that increasing

the stringency of Greedier will not significantly reduce false positives but will significantly

reduce the number of true positive bases masked. Figure 2-7 also shows that Greedier

identifies more TPs than cross_nmatch and WindowMasker after 12th and 11th iterations

respectively while identifying fewer false positives all the time.

Greedier, cross_nmatch, and WindowMasker all masked only a small fraction of the

transposons that were annotated in the Arabidopsis and rice genonlic sequences we tested.

It is possible that significant BLAST hits were missed by Greedier, because they had

insufficient fitness values to be masked. This hypothesis predicts that the masked genome

should contain transposons that still have significant matches to the repeat library.

We extracted the annotated transposons missed by Greedier front the Arabidopsis

genome. We ran BLAST with these transposons as the query set and the repeat library as

the library. BLAST did not report any matches. Thus, annotated transposons missed by

Greedier are primarily due to the absence of these sequences in the repeat library.

2.5 Discussion

Repeat identification is a challenging problem. Conventionally, repeats are identified

through a nmix of structure-based and sintilarity-based searches that are annotated

manually. As more genonies have been sequenced, it has become important to identify











transposons-TP -
other exons-FP
cross-match TP
14 cross-match FP--
WlndowMasker TP
WlndowMasker FP-


12 -

10 -








0 2 4 6 8 10 12 14
iteration

Figure 2-7. The cumulative percentages of masked transposons and exons that do not
belong to transposons (i.e., other exons) in the Arabidopsis genome by each
iteration of Greedier (the two curves) and the percentages of the same kinds of
masked regions by Cross_match (the two thick horizontal lines), and
WindowMaker (the two thin horizontal lines).


repeats using automated approaches. However, the biological characteristics of repeat

sequences create challenges for automated repeat masking. Transposons share many

sequence features with true genes, which confuses gene finding algorithms and leads to

the annotation of transposons as genes[10]. Transposons and other repeat sequences also

evolve rapidly making it difficult to recognize repeats through pairwise comparisons.

Similarly, transposons have the tendency to insert in existing repeats creating repeat

fragments that are more difficult to identify using similarity scores. Greedier addresses

many of these challenges and shows a clear improvement over the standard repeat masking

algorithm, cross_match (www.phrap. 0rg/phredphrapconsed .html). It also has an

additional feature of reporting potential nested transposon structures, which can help

biologists annotate complex repeat structures in genomic sequences. However, Greedier

requires a library of known repeat units to identify the remaining repeats within a target

sequence. Hence, Greedier is limited by the accuracy and completeness of the repeat

library.










Word counting algorithms such as WindowMasker [53] circumvent the need for

a repeat library. These algorithms mask short sequences that are over-represented in

the target genome. In our experiments, we found WindowMasker to have a low level

of accuracy even though it can mask more total bases than Greedier. Word counting

methods have lower accuracies, because word counting does not take into account

biological characteristics of both repeat and gene sequences. First, repeats do not alv-wsi

exist in high copies [9]. Repeats can diverge rapidly leading to sequences that will not

be over-represented relative to the whole genome. Second, genes can contain sequences

that would be considered high-copy. Genes can evolve through amplification events and

be found in closely-related families. Also, sequence motifs within genes are conserved and

found in many genes. These conservation are likely to cause segments of genes to be

detected as over-represented. In contrast to word counting algorithms, structure-based

de novo repeat finders can find low-copy repeats. However, these algorithms tend to be

limited to finding repeats that have well-conserved structures and are unlikely to find

divergent and fragmented repeats. Potentially, generating a library of highly-conserved

repeat sequences using algorithms like TRF [11], LTR_STRUC [52], and PILER [25] would

serve to generate a more complete representation of the highly-conserved transposons and

other repeat units within a genome. Greedier could then be used to identify divergent and

fragmented repeats based on the computationally-generated libraries.










CHAPTER :3
A NOVEL ALGORITHM FOR IDENTIFYING LOW-COMPLEXITY REGIONS IN A
PROTEIN SEQUENCE

In this chapter, we first introduce the problem of identifying LORs in a protein

sequence. We then describe our algorithm that solves this problem. We finally present our

experimental results.

3.1 Motivation and Problem Definition

Low complexity regions (LCRs) in a protein sequence are subsequences of biased

composition. Three main sources of LORs are cryptic, tandem, and interspersed

repeats [2, 60, 66, 75, 77, 78, 80]. Let F he the alphabet for amino acids. We ;?i- that

two letters from F are similar, if their similarity score is above a cutoff according to a

scoring matrix, ;?-- BLOSUM62 [:32]. We ;?i- that two sequences are similar, if their

alignment score is greater than a cutoff. Let F* he an arbitrary sequence over F. Let

.r = slf*s2~ Sk l*~ e a subsequence of a protein sequence. We call the subsequences

s1, s2, sk reports of one another if the following four conditions hold: 1) sl, s2, sk,

are similar sequences, 2) each as is longer than a cutoff, :3) each F* is shorter than a cutoff,

and 4) there is no supersequence of .r that satisfies the previous three conditions.

Depending on F*, repeats can he classified into two categories: (1) Tandem repeats.

In this case, for VE*, F* = 0, i~e., tandem repeats are an array of consecutive similar

sequences such as KTPKTPKTPKTP. (2) Interspersed repeats. In this case, 3F*, 0* / 0), i.e.,

at least two repeats one of which follows the other as the closest repeat are not .Il11 Il-ent.

An example of interspersed repeats is KTPAKTPKTPKTP. Cryptic repeats is a special case

of repeats. In this kind of repeat s 3, s2, sk, are not only similar sequences, but also

letters contained in them are all similar to one another, such as KKKAKKK. We call repeats

s1, s2, sk inexrrect if 3i, j, such that asi s y. RepeatS s1, s2, sk, are considered

as an LOR if their complexity is less than a cutoff hased on a complexity function. One

commonly used complexity function is the Shannon Entropy [65]. Note that there is no










known complexity function or a complexity cutoff that works for all manually annotated

LORs. The correct complexity formulation is an open problem.

Experiments have shown that LORs have an effect on protein functions [42]. Certain

types of LORs are usually found in proteins of particular functional classes, especially

transcription factors and protein kinases [:31]. All these mean that LORs may indicate

protein functions [:37, 58, 64] contribute to the evolution of new proteins, and thus

contribute to cellular signalling pathi--ws-. Some LORs attract purifying selection, become

deleterious and therefore lead to human diseases when the copies of a repeat inside exceed

a number [2:3]. LORs cause many false positives to local similarity searches in a sequence

database. BLAST [4], a popular local alignment program, uses the maximal segment

pair score (jl!SP) to find optimal alignments. The theory of 1\SP can assure statistically

significant high-scoring alignments to be found. However, biological sequences are very

different from random sequences. Statistically significant high-scoring matches due to

LORs are not biologically significant. Hence they are false positives.

Statistical analyses of protein sequences have shown that approximately one-quarter

of the amino acids are in LORs and more than one-half of proteins have at least one

LOR [78]. Despite their importance and abundance, their compositional and structural

properties are poorly understood. So are their functions and evolution. Identifying LORs

can he the first step in studying them, and help detecting functions of a new protein.

Computing the complexities of all possible subsequence sets is impractical even for a

single sequence since the number of such sets is exponential in the sequence length.

Several heuristic algorithms have been developed to quickly identify LORs in a protein

sequence. However, they all suffer from different limitations. Details of these limitations

are discussed in Section :3.2.

In this chapter, we consider the problem of identifying LORs in a protein sequence.

We propose new complexity measures that take the amino acid similarity and order, and

the sequence length into account. We introduce a novel graph-based algorithm, called










GBA, for identifying LORs. GBA constructs a graph for the sequence. In this graph,

vertices correspond to similar letter pairs and edges connect possible repeats. A path on

this graph corresponds to a set of intervals (i.e., subsequences). These intervals contain

seeds of cryptic, tandem, or interspersed repeats. GBA finds such small intervals as

LOR candidates by traversing this graph. It then extends them to find longer intervals

containing full repeats with low complexities. Extended intervals are then post-processed

to refine repeats to LORs (i.e., eliminate false positives). Our experiments on real data

show that GBA has significantly higher recall compared to existing methods, including

Oj.py, CARD, and SEG.

The rest of the chapter is organized as follows. Section :3.2 discusses the related work.

Section :3.3 introduces new complexity measures. Section :3.4 presents our algorithm, GBA.

Section :3.5 shows quality and performance results. Section :3.6 presents a brief discussion.

3.2 Related Work

LORs exist not only in a protein sequence, but also in a DNA sequence. A number

of algorithms have been developed to identify LORs in a DNA sequence, such as

EULER [57], R EPuter [41], and TR F [11]. Here, we focus on algorithms that identify

LORs in a protein sequence.

Most algorithms that identify LORs in a protein sequence use a sliding window,

including SEG [80], DSR [75], P-SIMPLE [2], and [54]. Some algorithms are alignment

based such as XNU [19] and CAST [60]. Some algorithms are encoding hased such as

[77]. Some algorithms are complexity based such as CARD [66]. It is possible that one

algorithm is based on more than one dimension such as SEG. We describe each algorithm

in detail.

SEG is a two-pass algorithm. The first stage identifies approximate raw segments of

low complexity determined by a sliding window length, two compositional complexity( [79]

and [61]) cutoffs. Each segment consists of sliding windows with complexity less

than a second cutoff. The second stage reduces each raw segment to a single optimal










low-complexity segment as LORs. This is achieved by detecting the leftmost and longest

subsequences of these raw segments with minimal probability of occurrence. DSR is

similar to SEG. The main difference is that a reciprocal complexity with a scoring matrix

is used as the complexity measure.

P-SIMPLE estimates the overall amount of simple sequence in a protein molecule

using the algorithm developed by Tautz et al. [71] and extended by Hancock and

Armstrong [30]. It carries out three main types of analyses: (1) it estimates the amount

of simple sequence content in a protein molecule; (2) it determines which short sequence

motifs are significantly clustered; and (3) it finds the location of simple sequences with

significantly clustered motifs in the sequence. It first calculates a simplicity score awarded

to the central amino acid of each window in the the sequence. It then calculates a relative

simplicity factor (RSF) for the sequence. RSF (greater than l or not) determines the

existence of LORs. However, the maximum length of a detectable repeat is only four.

Another drawback of P-SIMPLE is that it can only identify cryptic repeats.

Nandi et al. [54] uses a linguistic complexity measure based on dimer counts. This

complexity measure is the observed fraction of the distinct dimers possible for the

sequence adjusted by the a skew representing the compositional hias of the sequence.

All subsequences of a fixed window size with complexity less than a cutoff are considered

as LORs. However, the parameters were tuned using only four sequences and the results

from SEG. In addition, the complexity measure can not identify inexact repeats since it

ignores similarities between different letters.

Sliding window hased methods such as SEG, DSR, P-SIMPLE, and [54] suffer from a

limitation caused by the sliding window. A window size needs to be specified. It is difficult

to specify a window size since repeats can he of any length. Repeats with size either less

or greater than the window size may be missed.

XNU [19] identifies LORs by self-comparison. It scores local alignments with a PAM

matrix. It estimates the significance of these alignments according to the statistical










analysis of MSP scores. Best local alignments appear as off-diagonal segments. XNU

identifies those alignments very close to the main diagonal as LCRs. It requires a

repeat length to be specified first. Thus, it shares the same limitation as the above

sliding-window algorithms. In addition, identified regions are mainly limited to statistically

significant tandem repeats.

CAST [60] compares the sequence with an artificial database of twenty homopolymers

with a distinct amino acid each. It finds local alignments by using a simplified version of

the Smith-Waterman algorithm [68]. It reports significant hits with alignment score above

a threshold as LCRs. However, only repeats of a single residue type can be identified.

Oj.py [77] encodes protein sequences using regular expressions. This approach

is similar to that used by Allison et al [3]. Encoding of a sequence maximizes the

compression score of the sequence. All patches that fulfill a certain score threshold are

reported. Oj.py can not identify inexact repeats since it ignores similarities between

different letters.

CARD [66] targets only the regions of the sequence that are delimited by a pair of

identical subsequences. If these subsequences are positioned in tandem or overlapped,

the regions containing the two identical subsequences is marked as as LCR. Otherwise,

it iteratively computes the complexity of the repeats concatenated with each segment of

the same length as that of the repeat. This iteration continues until it either reaches the

right repeating subsequence and masks the subsequences as an LCR, or detects that the

computed complexity is greater than that of the left repeating subsequence. However,

LCRs are not necessarily indicated by a pair of identical repeats. Furthermore, the use of

suffix tree to find repeats requires extensive memory.

SEG, DSR, and CARD use a complexity measure either based on or analogous to

Shannon Entropy. However, Shannon Entropy is not a good complexity measure for

protein sequences.










3.3 New Complexity Measures

A protein sequence is a sequence from a 20-letter (each letter corresponds to an amino

acid.) alphabet F = {qi, y2," v72 yo, Where yi is the letter of type i (1 < i < 20). Let

a be a sequence over F of length L, with each yi having a fractional composition pi. The

vector [py, p2, 20o iS called the frequency vector of s. One of the most popularly used

complexity measures for sequences, Shannon entropy [65], is defined as



i= 1

Although this formulation is effective for many applications, it has several problems when

applied to protein sequences:

(1) Shannon entropy does not consider the characteristics of amino acids. Therefore,

it is unable to distinguish similar letters from dissimilar ones. For instance, the Shannon

entropies of RQK and RGI are the same. However, letters R, Q, and K are all similar whereas

R, G, and I are all dissimilar, according to BLOSUM62.

(2) Shannon entropy only considers the number of each different letter in a sequence.

Thus, it is unable to distinguish two sequences composed of the same letters but with

different permutations. For example, the Shannon entropies of RGIRGI and RGIIRG are the

same.

(3) Shannon entropy is unable to distinguish a small number of copies of a pattern

from a large number of copies of the same pattern. For instance, the Shannon entropies of

RGI and RGIRGIRGI are the same.

We sampled 474 sequences that contain repeats from Swissprot. The repeats in

418 (i.e., 88 .~) of these sequences are inexact. In other words, for 88 .~ of the sampled

sequences, Shannon entropy will have at least one of the first two problems above.

Next, we develop new complexity measures that overcome problems (1) to (3). As will

be seen later in Section 3.5, the new complexity measures do overcome the problems and










are better than Shannon Entropy. We first introduce a primary complexity measure and

then two more measures that can be applied to this primary one.

Primary complexity measure: To overcome the first problem, a scoring matrix is

incorporated. Given a scoring matrix S = (sjao,4)2x2, e.g., BLOSUM62, we compute a

matrix M~ = (mi,4)20x20, Where each mi~j is an approximation to the probability that yi

substitutes for yj. Formally, each mij is defined as

2Si j



During the calculation of a BLOSUM matrix, the substitution score si,j for letters yi and

yj is proportional to the base-2 logarithm of the ratio of two values, where the first value

is the observed substitution frequency of yi and yj, and the second value is the expected

substitution frequency of yi and yj. Thus, 2"i d is proportional to this ratio. However, our

formulation of mij is an approximation to the probability that yi substitutes for yj as

the observed substitution frequency can not be computed without knowing the expected

substitution frequencies. Instead, our formulation assumes that the expected substitution

frequency is the same for all pair of letters. The denominator is used for normalization.

Three important properties of mi,j are:


(1) 0 <, mie = < 1


(3) If yi is similar to yj, then mi,j is large. It is small otherwise.

Let a be a protein sequence. A similarity vector [:p'z, p'2 ) 92 iS COmputed as



j= 1

Each pi denotes the probability that letter yi substitutes for a randomly selected

letter in a by a mutation. The more letters similar to yi in s, the higher p(. This is

because of the third property of mi,4. In other words, when there are more letters similar

to yi in s, the chance for yi to substitute for such a similar letter will be higher. Since yi










and this letter are similar, the chance for yi to substitute for a letter in a will be higher.

The similarity vector is then normalized to [p", p2 20 RS followsw:





Similarly, the more letters similar to yi in s, the higher p,'. One can show that p2 =/

1. The primary complexity measure of a is then defined as


p: log p,.
0
pi is similar to pi in the Shannon Entropy formula. Like pi, pi considers the frequency of

letter yi. More importantly, unlike pi, pi incorporates the similarity of yi to the letters in



Consider the specific case where M~ is the identity matrix. In this case, pi = p(. Thus,

p,' = pi. Therefore, the primary complexity would be the same as Shannon Entropy.

Hence, Shannon Entropy is a special case of our primary complexity measure where letter

similarities are disregarded.

k-gram complexity measure: To overcome the second problem, we extend the primary

complexity measure to incorporate k-grams (A k-gram is a permutation of k letters).

When computing the k-gram complexity measure, the whole sequence is considered as

a permutation of k-grams. Everything that applies to a single letter previously in the

primary measure now applies to a k-gram. Hence, the k-gram measure is computed by

making two changes on the primary measure.

(1) The alphabet L is replaced with an alphabet that consists of all k-grams formed

from L.

(2) The similarity score between two k-grams is computed as the average of the

similarity scores between their corresponding letters.









Normalized complexity measure: To overcome the third problem, we normalize the

underlying complexity measure by dividing it by the length of the sequence. The more

copies of a repeat there are, the lower the complexity of these copies.

3.4 The Graph-based Algorithm (GBA)

GBA applies to a single protein sequence. It consists of four main steps: (1)

constructing a graph, (2) finding the longest path in every connected subgraph, (3)

extending the longest-path intervals, and (4) post-processing extended intervals. Each step

is presented in more detail next.

3.4.1 Constructing a Graph

For every protein sequence, a directed, .l i-I 11 and unweighted graph is constructed.

Graph construction includes vertex construction and edge construction. During the

constructions, some distance thresholds are used to simulate possible repeat patterns.

Let a be a protein sequence over the alphabet F with length L. We denote by s(i)

the letter at position i, for Vi, 1 < i < L. We wi two letters yi and yj eE are similar,

denoted by yi q ,, if their similarity score is above a cutoff according to a scoring matrix.

In GBA, we choose BLOSUM62 as the scoring matrix. We use a value of 1 as the cutoff.

We start by placing a sliding window w of length tl at the beginning of s. The

window size specifies the maximum difference between positions of the two consecutive

repeats. We then move it in single-letter steps along a and construct vertices at each step.

Let f(yi) denote the frequency of letter yi in w. For every pair of positions i and j in w

with i < j, we construct a vertex (i, j) if all the following three conditions are satisfied:





The second condition filters the vertices constructed for the similar letter pairs that are

in the same window by chance (i.e., false positives). R and NVR are 20 x 20 matrices that

show statistical information on frequencies of pairs of letters in repeat and non-repeat

regions respectively. We will discuss them later. In short, each vertex corresponds









to a letter that repeats at least once, possibly with some error. Figure 3-1(a) shows

the sequence GAYTSVAYTVPQAWTVW. For simplicity only the subgraph corresponding to

the subsequence from the 7th to the 16th letters is drawn. In this figure, the vertex

(7,13) is constructed because the letter A appears at positions 7 and 13. Vertex (8,14) is
constructed because letters Y and W are similar.

An edge is inserted between two vertices (i, j) and (k, m) if s(i)s(k) and s(j)s(m) are

repetitions of each other. This property is enforced by introducing distance thresholds t2

and t3. A directed edge from (i, j) to (k, m) is added if all the following three conditions

are satisfied:



(2) ilkljlm;

(3) |j i| < t3 and |m k| < t3, if j = k.

The first condition specifies the maximum number of insertions and deletions between

similar repeats. The second one guarantees that the positions of s(i)s(k) and s(j)s(m) do

not conflict with each other. The third condition specifies the maximum distance between

letters in cryptic repeats. Typically, we choose 1 = 15, t2 = 3, and t3 = 5 as they give the

best recall (Section 4.5.1). For example, in Figure 3-1(a) the edge between vertices (7,13)

and (8,14) shows that AY and AW are repetitions of each other. Note that AY and AW are

inexact repeats for Y and W have a high substitution score. A graph from a sequence is not

necessarily connected, i.e., it can consist of more than one connected subgraph.

Our sliding window does not carry the disadvantages of the sliding windows in SEG,

DSR, and P-SIMPLE. This is because (1) short repeats can be detected by traversing

the graph inside the window and (2) long repeats can be found by following the edges of

overlapping windows. Theoretically, the size of our sliding window can be as large as the

sequence length. Our purpose of introducing such a window is to control the graph size,

hence the time and space complexity of GBA.










Position #: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Sequence: GA Y T S VA Y T VP Q A W TV W


3 4 5 6 7 8 9 10 11 12 13 14 15


Position #: 1 2


16 17


Sequence: GA Y T S |VA Y T V |P Q A W TV W


(b)
6 7 8 9 10 11


Position #: 1 2 3 4 5


12 13 14 15 16 17


Sequence: GA Y T S VA Y T V| P Q A W TV W


6 7 8 9 10 11

VAYTVP
(d)


Position #: 1 2 3 4 5

Sequence: GA Y T S


12 13 14 15 16 17

QA WTVW


Figure 3-1.


Four steps of GBA on a sequence with 3 approximate repetitions of AYTV.
Underlined letters indicate repeats. Rectangles denote regions identified as
candidate LCRs by GBA at different steps. (a) Graph constructed on the
sequence. For simplicity only the subgraph corresponding to the subsequence
from the 7th to the 16th letters is drawn. Bold edges indicate the longest path
in the subgraph. (b) Candidate intervals found by using the longest path
(Section 3.4.2). (c) Intervals after extending candidate intervals
(Section 3.4.3). (d) Final LCRs after post-processing (Section 3.4.4).


T P


TT
Q3 Q4
Q2 Q3
2


P S
P 0 a
S 0


T
a~C2 + 0
2 2
C82 C
Q3 t 4


Contribution of the repeat region TPSTT to R. a~ denotes the forget rate. (a)
The contribution of each letter pair. (b) The overall contribution of TPSTT.


Figure 3-2.


Next, we discuss the third condition of vertex construction. The values R(yi, 3)

and NVR(yi, y) represent average proba abilities that yi and yj appear together in a repeat

region and a non-repeat region respectively. We compute these statistics from five real

datasets ~fig,lor mgd, mim, sgd, and viruses, extracted from Swissprot. These five datasets

correspond to species Drosophila melanogaster (Fruit fly), Mus musculus (Mouse), Homo










sapiens (Human), Saccharomyces cerevisiae (Baker's yeast), and viruses and phages,

respectively. There are 474 proteins in the datasets with annotated repeat regions. These

repeats do not have any known function. Note that we have also tried using a smaller

dataset, fi;,.,7,,,, to calculate the statistics for matrices R and NVR. This dataset contains

68 proteins. The two results were very close (results not shown). This indicates that a

small sample of proteins reflects the repeat and non-repeat statistics of a much larger

dataset. We first initialize the entries of R and NVR to zero. For each sequence, we

examine it from the beginning to the end. When a repeat region is met, we consider every

letter pair(yi, yj) in that region. We increase R(yi, yj) and R(yj, yi) based on the distance

between yi and yj in the sequence as follows. Assume that positions of yi and yj differ by

k letters. We increase the corresponding entries in R by ctk, Where a~ is a forget rate and

0 < a~ <1. Forget rate is commonly used in various domains to capture the correlation

between two objects that are members of an ordered list of objects based on their spatial

distance in that list [27]. We use it to measure the chance of two letters being a part of

the same LCR. Thus, as letters get far away in the sequence, their contribution drops

exponentially. Figure 3-2(a) shows the individual cOkS for each letter pair in a repeat region

TPSTT and Figure 3-2(b) shows the corresponding change of R. Note that both R and NVR

are symmetric. To ease the readability of Figure 3-2(b), we only show the top right part of

R. Finally, when all sequences are processed, R(yi, yj) is normalized as





In case of a non-repeat region, we adjust NVR(yi, y) in the same way. We tried

different values of a~, including 0.95, 0.90, and 0.80. It turned out that for pairs of similar

letters most entries in R are greater than the corresponding entries in NVR, which actually

verifies that the use of such statistical information is meaningful. We got the best result

with a~ = 0.95, i.e., the largest number of entries in R greater than those in NVR.










3.4.2 Finding the Longest Path

The vertices connected by edges represent short repeats that can he combined to

make longer repeats. For example, in Figure :3-1(a) the edge between vertices (7,1:3)

and (8,14) shows that AY and AW are potential repeats.We find the longest path in every

connected subgraph to get the longest repeating patterns. Repeats of the sequence in

Figure :3-1(a) are AYTSV, AYTV and AWTV. The path represented by hold edges is the longest

path. It captures the repeat pattern AYTV and AWTV perfectly. Note, in Figure :3-1(a), for

simplicity only the subgraph corresponding to the subsequence from the 7th to the 16th

letters is drawn. Figure :3-1(b) shows the potential LORs for the whole sequence after

finding the longest path.

There are many existing algorithms that find the shortest path in a graph. They

can he easily modified to find the longest path in a graph. Our implementation of finding

the longest path in a graph is based on Dijkstra's Algorithm [22]. The complexity of our

implementation is linear in the size of the graph.

3.4.3 Extending Longest-path Intervals

Paths found in Section :3.4.2 correspond to a set of intervals (i.e., subsequences) on

the input sequence. We discard short intervals. In our implementation we set this length

cutoff as three. Remaining ones are considered as repeat seeds. They are extended in left

and right directions to find longer intervals containing full repeats with low complexities.

We stop extending an interval when one of the following two conditions is satisfied:

(1) It overlaps with another extended interval, or the end of the sequence is reached.

(2) The complexity starts increasing after extending it by tl letters (tt is the upper

bound for the repeat length as discussed in Section :3.4.1).

Once an interval is extended, we find its largest subinterval for which the complexity

is less than a given cutoff value. In order to find a reasonable cutoff value, we randomly

sampled sequences from Swissprot that contain repeat regions. We increase our sample set

size by one each time. Let p, and o-, denote the mean and the standard deviation of the










complexities of the repeats respectively after sampling t sequences. We repeat sampling

until the error rate of the estimated mean is at most C (C is a given a threshold), i.e.,

when the following condition is satisfied [44]:





We set C as 0.05. We use the resulting pt as our cutoff. Figure 3-1(c) shows the potential

LCRs after the extension. We can see that the letter S at position five is included into the

potential LCRs. This example illustrates how GBA can detect repeats with indels.

Note, all complexities in this sub-section are calculated using the 2-gram complexity

measure .

3.4.4 Post-processing Extended Intervals

Extended intervals may contain non-LCRs. This is because although the complexity

of an extended interval is low, the actual LCR may be a subsequence of that interval. We

develop a post-processing strategy to filter the intervals that have higher complexities than

the ill I i0 R~y of the extended intervals.

We randomly sampled 128 sequences from Swissprot among the sequences that

contain LCRs. We then created 16 groups according to their lengths as follows. The

first group contains the eight shortest sequences. The second group contains the

next eight shortest sequences and so on. We calculate the repeat percentage of each

sampled sequence as the percentage of the letters in that sequence that are annotated as

repeat. For each group, we computed the mean and the standard deviation of the repeat

percentages in that group.

Given an input sequence s, we find the group that a belongs to in our sample

according to its length. We compute a cutoff value, c, from this group as the sum of

the mean and the standard deviation of the repeat percentages for that group. We then

compute the complexities of the extended intervals of s. Assume that there are totally

k extended intervals. We rank the extended intervals in ascending order of complexity.










We mark the intervals whose ranks are greater than c k as potential outliers. This is

justified as follows. The u 1 I B~~y of the sequences in a group have repeat percentages

within a small neighborhood of the mean of the group. This neighborhood is defined

by the standard deviation. Thus, we consider intervals whose complexities rank above

the sum as potential outliers. We compare each potential outlier with its left .Il11 Il:ent

interval, using Smith-Waterman algorithm [68]. If the similarity of the aligned parts of

the two intervals are big enough, eRg, or- more than 4 letters, we keep the aligned part

of the potential outlier interval as an LCR. Otherwise, we repeat the comparison on its

right .Il11 Il:ent interval. If no satisfactory aligned part exists in both comparisons, the

interval is discarded. The implementation of Smith-Waterman algorithm is borrowed

from JAligner (http://j aligner sourcef orge .net). In Figure 3-1 (d), the letter W at

position 17 is removed. This is because the complexity of AWTVW is not low enough and the

Smith-Waterman algorithm alignment does not include this letter.

Note, all complexities in this sub-section are calculated by using the normalized

2-gram complexity measure.

3.5 Experimental Evaluation

We used six datasets and their repeat information from Swissprot [6] and Uniprot as

our test data. These annotated repeats do not have any known function. Therefore, we

used them as true LCRs. The first five datasets were constructed by extracting sequences

with repeats from fi;,.l.B. mgd, mim, sgd, and viruses (ftp://ftp.ebi.ac. uk/pub/

databases/swissprot/spec ial_selections) respectively. They contain 68, 133, 166,

45, and 62 sequences respectively (i.e., totally 474). 418 of these sequences (i.e.,88 .~)

contain inexact repeats. The last dataset, denoted by mis., was constructed similarly from

U~niprot (f tp ://f tp .ebi .ac .uk/pub/databases/uniprot/knowledgebase/comlee.

It contains 1137 sequences from various organisms. These 1137 sequences are all the

sequences in Uniprot that have annotated repeats without any known function excluding

those contained in the first four datasets. These repeats are inexact. Thus, 96 of the










sequences from the six datasets contain inexact repeats. Sequences from the six datasets

are from many different structural and functional groups. The longest one has 5412 letters,

the shortest one has 50 letters, and the average sequence length is 606. The datasets used

in our experiments are available at www.cise.uf 1. edu/~xli/research/1 cr.

We downloaded Oj.py, CARD, and SEG programs. They are coded in C, C++, and C

respectively. We implemented GBA and SE-GBA in JAVA. SE-GBA is the same as GBA

except that in SE-GBA we use Shannon Entropy to measure complexities. For GBA and

SE-GBA, we tried different input parameters with 10 < 1 < 50, 3 I t2 I 5, and 5 < t3

7. The recall of GBA was the highest for tl = 15, t2 = 3, and t3 = 5. Therefore, we use

these parameters in all of our experiments. We ran all the competing programs using their

default parameters. All experiments were performed on a Linux machine.

Section 4.5.1 evaluates the qualities of the proposed complexity measures and GBA.

Section 4.5.2 compares the performances of GBA, Oj.py, CARD, and SEG, including time

and space complexities.

3.5.1 Quality Comparison Results

Evaluation of the new complexity measures: We compare our new complexity

measures to Shannon Entropy independent of the underlying LCR-identification

algorithm. The goal is to concentrate on the complexity measures alone. We calculated

the complexities of the annotated repeats and the non-repeats in each sequence from the

first five datasets using Shannon Entropy. Thus each sequence produced two complexity

values. We normalized these complexities to [0, 1] interval by dividing them by the largest

observed complexity. We computed their Cumulative Distribution Functions (CDFs) for

repeats and non-repeats separately. For each observed complexity value in [0, 1] interval

we computed the values of the CDFs for repeats and non-repeats. These two values denote

the ratios of the repeats and the non-repeats that have complexities no greater than that

observed complexity. In other words, for a given complexity cutoff, these values denote

the ratios of truly identified repeats and falsely identified repeats basedd on Swissprot










annotations) respectively. The larger the difference between these two values, the better

the complexity measure. We repeated the same process using our primary complexity

measure and our k-gram complexity function with k = 2.

Figure :3-3 shows the resulting plots for Shannon Entropy and the 2-gram complexity

measure. When ratios from repeats are lower than 0.84, there is not much difference

between Shannon Entropy and the 2-gram complexity measure since two curves

representing the two complexity measures twist around each other. However, when ratios

from repeats are no less than 0.84, there is a clear difference between the two complexity

measures. The Shannon Entropy curve is ahr-l-w above the 2-gram complexity measure

curve. This means that our complexity measure distinguishes repeats from non-Repeats

better. Particularly, as shown by the hold vertical line in the figure, when 92 of the

repeats are identified, 2-gram complexity measure produces :30 less false positives than

Shannon Entropy. We do not show the result of the primary complexity measure in order

to maintain the readability of the plots. The primary complexity measure curve stays

between that of Shannon Entropy and that of the 2-gram complexity measure and very

close to that of Shannon Entropy.

Evaluation of GBA: We compare the qualities of GBA, SE-GBA, Oj.py, CARD, and

SEG. The differences between the quality of SE-GBA and those of competing tools,

Oj.py, CARD, and SEG, show the improvement obtained by our graph-based repeat

detection method as they all use Shannon Entropy as the complexity measure. The quality

difference between GBA and SE-GBA indicates the improvement obtained due to our new

complexity formula on top of our repeat detection method.

Let TP (True Positive) be the number of letters that are correctly masked as LORs

by the underlying LOR-identification algorithm. Similarly, let FP (False Positive) and

FNV (False Negative) be the number of letters that are incorrectly computed as LORs

and non-LCRs. We compute three measures: reatll, precision, and Jaccarwe coefficient as

follows:












Shannon Entropy
2-gram complexity measure -
0.9-

0.8

0.7

0.6

0.5

S 0.4

0.3

0.2

0.1-


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
repeat

Figure 3-3. Comparison between Shannon Entropy and our 2-gram complexity measure.
x-axis represents ratios from repeats. y-axis represents ratios from non-repeats.


*recall = TP / (TP FNV);

precision = TP / (TP FP);
Jccad oefficient = TP / (TP FP FNV).


For presentation clarity, we do not report results from flybase.

Figure 3-4 compares the average recalls for the last five datasets: mgd, mim, sgd,

viruses, and mis. On average, GBA has the highest recall. SE-GBA has the second highest

one. The recall of SE-GBA is 8 .~ higher than that of SEG, 18 .~ higher than that of

Oj.py, and 22 .~ higher than that of CARD. The recall of GBA is 2 .~ 10 .~ 20 .~ and 24

higher than that of SE-GBA, SEG, Oj.py, and CARD respectively. In other words, the

recall is improved by at least 8 .by using a different repeat detection method introduced

in Section 3.4, and by at least 2 .by using the new complexity measures introduced in

Section 3.3, instead of Shannon Entropy. Small recall values indicate that the existing

methods can not formulate inexact repeats well. This is justified by our discussion in

Section 3.3.










0.65
SE-GBA
GBA
0.6 c
SEG czz
0.55-

0.5 -

=0.45-

S0.4-

0.35-

0.3-

0.25-

0.2
m gdim sgd viruses mis. ave.

Figure 3-4. Average recalls of GBA, SE-GBA, Oj.py, CARD, and SEG on four datasets.


Figure 3-5 compares the average precisions for the last four datasets. GBA has

the second highest precision. SE-GBA has higher precision than SEG. This is because

different repeat detection methods are used in the two algorithms. The precision of

GBA is higher than that of SE-GBA. This is because different complexity measures

are used in the two algorithms. Oj.py has the highest precision. CARD has the second

highest precision on some of the datasets. For Oj.py, this is because only exact repeats are

identified. For CARD, this is because only LORs delimited by a pair of identical repeats

can he identified. Although both patterns have a high chance of being true repeats, they

constitute a small percentage of possible repeats. This is because repeats are usually

inexact, which is justified by the low recalls of Oj.py and CARD. Thus, Oj.py and CARD

achieve high precisions at the expense of low recalls (Figure 3-4). Small precision values

indicate that many false positives are produced. This is mainly because loose cutoff values

are needed to obtain a reasonable recall. The precision and recall values for the mgd and

mim datasets are much better than that for the viruses dataset. This indicates that the















0.4 T


o 0.35 -




0.25 -i~lI


0.2 I:I


0.15
mgd mim sgd viruses mis. ae.

Figure 3-5. Average precisions of GBA, SE-GBA, Oj.py, CARD, and SEG on four
datasets.


repeats in viruses show much more variation than mgd and mim. Indeed the mutation rate

in viruses is much higher [24, 34].

To understand the relationship between precision and recall of GBA, Oj.py, and

CARD, we plotted precision versus recall as follows (Figure 3-6). We first created a

(precision, recall) tuple for each sequence in the first five datasets by calculating the

precision and the recall of GBA for that sequence. We then divided all these 474 tuples

into 4 groups of the same size (except the last one which contains fewer tuples). Tuples

in the first group have the smallest precisions. Tuples in the second group have the next

smallest precisions and so on. Finally, we calculated the means of the precisions and the

recalls for each group and got one representative (precision, recall) tuple for each group.

We repeated the same process for Oj.py and CARD. Figure 3-6 shows that on the average,

GBA has a higher recall when the three tools have the same precisions.

Unlike precision and recall, Jaccard coefficient considers true positives, false positives

and false negatives. Figure 3-7 shows that GBA has the highest Jaccard coefficient for











GBA
0j.py
CARD


0.8-



0.6-



0.4-



0.2-




0 0.2 0.4 0.6 0.8 1
precision

Figure 3-6. Relationship between precision and recall of GBA, Oj.py, and CARD.


all the datasets. The second best tool is different for different datasets. On the average

SE-GBA has the second highest Jaccard Coefficient. The difference between GBA and

SE-GBA shows the quality improvement achieved due to our new complexity measure

alone. The differences between SE-GBA and the competing methods CARD and SEG that

use the same complexity measure (i.e., Shannon Entropy) show the quality improvement

due to our graph-based strategy alone.

All tools have relatively low recalls and precisions. This implies the abundance of

inexact repeats in LCRs. Figure 1 of Appendix shows an example sequence from Swissprot

for which GBA identifies almost the entire LCR while Oj.py, CARD, SEG fail. Figure 2 of

Appendix shows the LCR of another example sequence from Swissprot for which all the

tools, GBA, Oj.py, CARD, SEG have low recalls and precisions.

3.5.2 Performance Comparison Results

We now analyze the time and space complexity of GBA. Suppose L is the length of

a sequence s. Vertex construction takes O(Ltl) time since we compare every letter in a

with other letters in the same window. During edge construction, each vertex is compared










SE-GBA
GBA
0.28 -CARD r:
SEG awam

c,0.26-

E 0.24 ::

O 0.22 C .:-.

0.2 :




0.16-

0.14
1 gd mim sgd viruses mis. Je.

Figure 3-7. Average Jaccard coefficients of GBA, SE-GBA, Oj.py, CARD, and SEG on
four datasets.


with a group of vertex sets. This group may have a maximum of 1 sets and each set may

have a maximum of 1 vertices. Hence the edge construction takes O(Lt() time. The? time\

complexity of finding the longest path is linear in the order of the size of the graph, which

is O(Ltl +" LTC) It- takes O(L) time t exten longest-path intervals Smith-Waterman

algorithm in the post-processing step takes O(L2) time in the worst time. Hence, GBA

takes O(Ltj+L2) time in the worst case. The worst case happens when a sequence consists

of letters of a single type.

The average times per sequence of GBA, Oj.py, CARD, and SEG algorithms on

one of our datasets,mim, were 79.65 seconds, 0.5 milliseconds, 26 seconds, and 0.75

milliseconds respectively. Both CARD and GBA are slower than Oj.py and SEG, but

their running times are still acceptable. This is because manual verification and wet-lab

experimentation on the computational results usually take dus~. GBA is thus desirable

since it produces much higher quality results (Figure 3-4). The longest sequence (5412

letters), FUTSC_DROME, took GBA, Oj.py, CARD, and SEG 829 seconds, 96 milliseconds,










1155 seconds, and 16 milliseconds respectively. As the sequence length increases from 741

to 5412, the running times of GBA, Oj.py, CARD, and SEG increase by a factor of 10, 192,

44, and 21 respectively. This means that GBA has better amortized time complexity than

these competitors.

As for space complexity, similar to the analysis of time complexity, GBA requires

O(t 2) memory in the worst case. O(ti is' due' to the graph size and O(L2) is due

to Smith-Waterman algorithm in Section 3.4.4. One of our datasets,flybase, took GBA,

Oj.py, CARD, and SEG 140 MB, 17 MB, 785 MB and 1000 kB of memory in the worst case

respectively.

3.6 Conclusion

We considered the problem of identifying low-complexity regions (LCRs) in a protein

sequence. We defined new complexity measures, which incorporate the concept of a

scoring matrix, the letter order, and the sequence length. This formulation overcomes

limitations of existing complexity measures such as Shannon entropy. We developed a

novel algorithm, named GBA, for identifying LCRs in a protein sequence. We used a

graph-based model. We incorporated the statistical distribution of amino acids in repeat

and non-repeat regions of some known protein sequences into our model. Unlike existing

algorithms, the graph model of GBA is very flexible. It can find repeating patterns even

when the patterns contain mismatches, insertions, and deletions. GBA does not have the

disadvantages of other sliding window-based algorithms, the graph construction guarantees

neither short nor long repeats will be missed. Furthermore, the successive extending

and post-processing steps reduce the number of false negatives and false positives. In

our experiments on real data, GBA obtained significantly higher recall, compared to

existing algorithms, including Oj.py, CARD, and SEG. The Jaccard coefficient of GBA

was also higher than that of Oj.py, CARD, and SEG. We believe that GBA will help

biologists identify true LCRs that can not be identified with existing tools, leading to a










better understanding of the true nature of LCRs. This is essential in developing a better

formulation for LCRs in the future.









CHAPTER 4
QUALITY-BASED SIMILARITY SEARCH FOR BIOLOGICAL SEQUENCE
DATABASES

In this chapter, we consider the problem of findings similar sequences when the

locations of the LCRs are not known precisely. We develop a formulation to measure the

quality of each letter. These quality values are used in the similarity searches. We present

our experimental results.

4.1 Motivation and Problem Definition

Sequence similarity search algorithms phI i a very important role in bioinformatics.

They can be used to identify candidates of related sequences that form a family or to find

candidates of a related gene in an organism. These candidates then go through a costly

manual inspection by biologists. Therefore, it is essential that sequence search algorithms

return as few false positives as possible.

Low-Complexity Regions (LCRs) are repeating patterns of biased composition [45].

Figure 4-1 shows two sequences that contain LCRs indicated by the underlined letters.

Both of the LCRs in this figure contain a tandem repeat AAT repeated four times.

Statistical analyses have shown that approximately one-quarter of the amino acids are

in LCRs and more than one-half of proteins have at least one LCR [78].

Traditional sequence similarity search methods produce many false positives due to

LCRs in biological sequences. We use BLAST [4], one of the popular similarity search

algorithms, as an example to illustrate the problem. BLAST uses the maximal segment

pair score (jl!SP) to find the optimal alignment. The theory of MSP finds statistically

significant high-scoring alignments under the assumption that letters follow a random

distribution. However, biological sequences are very different from random sequences

since they contain many LCRs. Statistically significant high-scoring matches due to

LCRs, usually, do not indicate genuine relationship between sequences, and hence, are

false positives. For example, traditional algorithms produce a high alignment score for

the two sequences in Figure 4-1 even though they are not biologically related. This is









Position #: 1 2 3 4 5 6 7 8 9 1() 11 12 13 14 15 16 17 18 19 2() 21 22 23 24 25
Sequencel: GAA TAA TAA T AA T AWS VWSPT VL LS8

Position #: 1 2 3 4 5 6 7 8 9 1() 11 12 13 14 15 16 17 18 19 2()
Sequence2: PPQKMKAATAATAA T AATRD

Figure 4-1. Two sequences that have the same LOR indicated hv the underlined letters.
Here, the LORs are composed of a repeating pattern of AAT.


because the two sequences contain the same LOR. BLAST returns over 1,000 statistically

significant sequences for Thermus thermophilus; i-, nl tRNA -;;om. /.e.r; which has only

31 true positive homologs [75]. Such high false positive rates cause enormous amounts of

wasted resources and time spent on refuting them.

Existing methods such as BLAST follow one of the two extreme strategies; they either

treat LORs and non-LCRs the same or simply remove all LORs from the sequences 1

The former strategy produces many false positives. The latter strategy requires the

use of an LOR-identification method. A number of computational methods have been

developed to identify LORs such as SEG [80]. Although filtering identified LORs improves

the quality of searches, it is not desirable since no LOR-identification method is 100

accurate. Our experiments on real data showed that the average precision and recall

of some well-known methods such as SEG and CARD [66] were as low as 0.2 and 0.3

respectively [45]. Hence, both strategies are problematic. Note that BLAST also has an

option where it masks a region specified by a user. This, however, requires the user to

have perfect knowledge of the location of such regions. Thus, it is not practical as these

regions are not known for many sequences.

Contributions: This chapter considers the problem of finding similar sequences when

the locations of the LORs are not known precisely. The goal is to develop algorithms

that reduce the number of false positives significantly without losing true positives.




1 Depending on which sequence is masked, the latter one is provided as a "filter low
( .in pl. ::0 y or "filter lookup table" option










There are three main contributions of this chapter. 1) We describe a proper way of using

LCRs. We develop a formulation to measure the quality of each letter in a sequence.

The quality value of a letter is the probability for that letter to be in a non-LCR. We

show that the quality values can be used in two fundamental approaches to the sequence

search problem. The former finds the optimal alignment of two sequences using dynamic

programming. The latter computes a suboptimal alignment using a hash table. 2) We

develop a randomized memory-resident hash table that indexes k-grams subsequencess

of length k) probabilistically. As a result, the main memory usage and the CPU cost

are greatly reduced. 3) We show that this hash table can be used to reconstruct query

sequences with negligible information loss. This eliminates the need to store these

sequences. Our experiments on real data show that our algorithms reduce the number

of false positives significantly. In addition, their running times are better than existing

strategies. Note that we use local alignment to define similarity. Extending this chapter to

global alignment is trivial. The methods developed in this chapter are orif,,,l ~,,,,,l to ex~ist-

ing search tools. They can be used rl. .tty with the existing tools to improve their accurr. ;,

and performance.

Chapter Organization: Section 4.2 discusses the related work. Section 4.3 illustrates

how to assign a quality value to each letter in a sequence. Section 4.4 introduces

our similarity search algorithms. Section 4.5 shows quality and performance results.

Section 4.6 concludes the chapter.

4.2 Background

Sequence similarity search: A number of sequence similarity search algorithms have

been developed [4, 21, 41, 47, 68, 69]. We consider these strategies under two categories.

The first one, called DPS, finds the optimal alignment using dynamic programming.

The second one, called HTS, finds a suboptimal alignment by employing a hash table.

Smith-Waterman algorithm [68] and BLAST are examples of these two strategies

respectively.










D:;i)eri..ie.l yn eI~rl,,I,,:,:1 solution (DPS): DPS computes a two dimensional matrix M.

Each entry corresponds to the best local alignment score between two subsequences being

compared.


Mn ,,= go + i -ge,3, i = go +j ge, V,,<~n0j


and Vi j > 0:


maxx2 1(,1 i-w, + "'.),
Ml ; = max



DPS constructs the best alignment by tracing back from the maximum score in M~ until

reaching a zero.

Hash table solution (HTS): This strategy sacrifices sensitivity (i.e., recall) to improve space

and time. For any two k-grams generated from the letter alphabet, if their alignment

score is greater than a threshold, they are called neighbors of each other. Neighbors

of all k-grams are first generated. HTS uses matching k-grams as seeds to find longer

alignments. It runs in three steps:

(1) Building hash table. Each k-gram in each query sequence is inserted into the hash

table.

(2) Search phase. The database is scanned to find neighbors of the k-grams from the

query set with the help of the hash table. The region is popularly referred as seed.

(3) Alignment phase. Seeds are extended to find local alignments that satisfy a

user-defined score cutoff.

LCR-identification: Several algorithms have been developed to identify LCRs in

sequences, such as SEG [80], CARD [66], and GBA [45]. SEG first finds contigs with

Shannon Entropy complexity [65] less than a cutoff. It then detects the leftmost and

longest subsequences with minimal probability of occurrence as LCRs. BLAST uses SEG










as the underlying LCR detection algorithm for protein sequences. CARD identifies LCRs

based on the Shannon Entropy complexity analysis of subsequences delimited by a pair

of identical repeats. GBA develops a k-gram ,-.onilpl. iH;, that takes letter similarity, letter

order, and sequence length into account. It uses a graph-based method to find potential

seeds. It then extends these seeds into longer intervals based on a novel complexity

measure. Finally, these longer intervals are postprocessed to find LCRs.

Quality values: Some algorithms, such as CAP3 [33], have used quality values. The

meaning of CAP3's term "quality" differs from that of this chapter. In CAP3, each letter

is given a quality value, based on the probability that that letter is correctly sequenced.

Thus, it does not denote whether a letter is in LCR or not. CAP3's use of quality values,

however, is similar to traditional filters for LCRs. It clips the ends of reads (fragments

of sequences) that have quality lower than a cutoff. At the time of assembly, it uses two

types of scores, one for letters with quality greater than a given cutoff and another for

the rest of the letters. This indicates that the algorithms developed in this chapter can

be emploi-4l for sequence assemblers like CAP3, when the quality values are not 100

correct. This chapter, however, focuses on sequence comparison in the presence of LCRs.

4.3 Quality Value Assignment (QVA)

A letter in a sequence is correctly masked by the underlying LCR-identification

algorithm if the letter is annotated in an LCR according to a gold standard (e.g.,

Swissprot (www.expasy. 0rg/sprot)). Let TP (True Positive) be the number of letters

that are correctly masked as LCRs by the tool. Similarly, let FP (False Positive) and

FNV (False Negative) be the number of letters that are incorrectly computed as LCRs

and non-LCRs. Figure 4-2 shows a sequence, its LCR, and its masked version by an

LCR-identification algorithm. Masked letters are replaced by x. Here, TP = 8, FP = 1,

and FNV = 4.

In this section, we introduce how to assign a quality value to a letter of a sequence

when the underlying LCR-identification method is inaccurate. During the assignment, we









Position #: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
Sequence: GAA TAAPAATA ATAWSVWSP TVLLS
Masked: Gxx Tx xPx x Tx x T x W SVW SP T VLL S

Figure 4-2. A sequence with an underlined LCR and a masked version of it. Masked
letters are replaced by x.


use one or more LCR-identification algorithms. Sections 4.3.1 and 4.3.2 discuss the cases

when single and multiple LCR-identification tools are emploi-x & respectively.

4.3.1 QVA Based on One LCR-identification Tool

For each sequence of length L, two measures, precision and noise are computed as:

precision = TP / (TP + FP),

noise = FP / (L TP).

Here, precision is the probability that a letter randomly chosen among the masked

letters is a true LCR. We assign (1-precision) to masked letters of the sequence as their

quality values. Noise is the error rate in unmasked letters. (1-noise) then tells the

probability for an unmasked letter to belong to a non-LCR. Hence, we assign (1-noise)

to all unmasked letters as their quality values.

We can compute TP, FP, and FN only if we know the true LCRs. This is also true

for precision and noise. However, we do not know the true LCRs of most sequences. We

propose to solve this problem by approximating to precision and noise with the help of

sampling. We randomly sample a sequences that contain known repeats from Swissprot

(www.expasy. Org/sprot). These manually annotated repeats (http://www.expasy.

Org/sprot/sprot_details .html) do not have any known functions and are reliable for

our purpose of sampling. Running any LCR-identification algorithm on these sequences

will give each sequence in this set a precision value and a noise value as computed above.

For each masked sequence, we also calculate the k-gram complexity [45] for its masked

letters. Hence, we can represent each sequence by a (precision, complexity) tuple. Let

(pl, cl), (p2, C2) ,(u, c,) be the precision and complexity values for the sampled

sequences where cl < c2 < c,. We create an equi-depth histogfram with m









bins as follows. We define the boundaries Ho, H1, Hm of the bins of the histogfram

as Ho = -oo, Hm = +oo, Hi = cLa/m].i for 0 < i < m. The first bin contains

the first [n/m] lowest-complexity tuples. The second bin contains the next [n/m]

lowest-complexity tuples and so on. For each bin represented by [Hi_l, Hi], we define

its precision, denoted by wsi, as the mean of the precision of the tuples in that bin. Let

~(i -1, L=+vc _l-l Ls +), -' -l~ ~ ( ,i cf ) be these tuples. Formally, we compute ir as:






To assign each letter in a sequence a quality value, we first calculate the k-gram

complexity of its masked letters and find the histogram bin that this complexity belongs

to. Given a complexity value, denoted by p~, we ;?i that p~ belongs to bin i if Hi_l < p- <

Hi. We then use the precision of that bin, wsi, as the precision of its masked letters. We

assign (1 wsi) to all masked letters of this sequence as their quality values.

Next, we talk about how to assign quality values to unmasked letters We calculate

the average of noises from all sample sequences. Since the difference between one and this

average tells the overall probability for an unmasked letter to belong to a non-LCR, we

assign this difference to unmasked letters as their quality values.

4.3.2 QVA Based on Multiple LCR-identification Tools

No LCR-identification method is 100 accurate. Regions masked by different

methods can be different. This is because different LCR-detection methods may be

tuned for detecting different types of LCRs. For example, CARD is good at identifying

LCRs delimited by a pair of repeating subsequences. SEG depends on the window

length and Shannon Entropy complexity measure to identify LCRs. Thus, accuracies of

LCR-detection methods can be improved by integrating their results. Figure 4-3 shows a

sequence with an underlined LCR and its masked versions by two LCR-identification tools.

Masked letters are replaced by x. Each tool masks different regions as LCRs. If different









Position #: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
Sequence: GAA TAATAA TAA T AWS VW SP TVL L S
Maskedl: Gxx xx xxx xx x xx xW S VW S P TVL L S
Masked2: Gxx xx xxx xx x xx Ax x V xx P TVL L S

Figure 4-3. A sequence with an underlined LCR and its masked versions by tow different
LCR-identification algorithms. Masked letters by each algorithm are replaced
by x.


tools classify the same letter to belong to an LCR, then the chance for this letter to belong

to an LCR is higher compared to the case where these tools do not agree. Integrating

quality values from multiple tools is a nontrivial task. In principle, a quality value from

a high-accuracy LCR-identification algorithm should contribute more to the combined

quality value than that from a low-accuracy LCR-identification algorithm.

The average Jaccard coefficient reflects the accuracy of an LCR-identification

algorithm. For a sequence, Jaccard coefficient is computed as


TP/(TP + FP + FN).


The average Jaccard coefficient from all sequences reflects how well a classifier mirrors the

actual class labels [29]. It approaches to one when the classifier gives all instances true

class labels. We propose to compute the combined quality value by taking a weighted

average of these quality values. Let JC1, JC2, Ct be the Jaccard coefficients from

the t algorithms respectively. Let ql, q2, 9t be quality values assigned to a letter by t

algorithms respectively as discussed in Section 4.3.1. We compute the quality value of this

letter by using these t algorithms as:



i= 1 i= 1

The higher the Jaccard coefficient of an algorithm is, the more its quality value affects the

new quality value.










4.4 Quality-based Similarity Search

In Section 4.3, we discussed how letters are assigned quality values. In this section, we

introduce our quality-based similarity search algorithms: QDPS (Section 4.4.1) and QHTS

(Section 4.4.2).

4.4.1 Using Quality Values in DPS

Our first algorithm, QDPS (Quality-based Dynamic Programming Similarity search

algorithm), aligns two sequences. QDPS guarantees full sensitivity, i.e., it can find all

optimal local alignments. This algorithm is similar to DPS. The difference is that QDPS

uses quality values of letters.

Let a = ala2 an and b = blb2 bm be two sequences to be compared. Let o,i and

qbj be the quality values assigned to ai and bj respectively. Similar to DPS, QDPS fills a

dynamic programming matrix as follows.

-1T n, = go + i ge, ill, = go +j ge, Vi,j, O < i < n, O < j < m, and Vi j > 0:



maxx2 1(,117 -w, + "'.),
ifl ; = max if l_l,_1+ S(ai, bj) que %jby



no and my are the sums of the products of the corresponding gaps and the quality

values of the letters in the other sequence respectively. Unlike DPS, QDPS weights the

score/penalty of each letter pair, including gaps, by multiplying it with the quality values

computed for the letter pair. The higher the quality value of a letter is, the more it

contributes to the alignment score. The existing strategy of removing LCRs entirely is a

special case of QDPS when the quality values of the LCRs are set to zero. QDPS shows

this behavior only when the underlying LCR-identification algorithm is 100 precise for

the corresponding complexity (i.e., no false positives). Thus, QDPS is superior to this









existing strategy since it adapts to the precision of the underlying LCR-identification

algorithm.

The space and time complexity of QDPS is O(mn). This is the same as those of DPS.

This is because they both compute dynamic programming matrices of the same size. Note

that there are optimized versions of DPS which have O(min {m, n}) space complexity at

the expense of increased running time. Such optimizations can be applied to QDPS too.

4.4.2 Using Quality Values in HTS

4.4.2.1 The algorithm

MwI~: biological applications require all-to-all comparison of two large sequence sets,

ii-, A and B [15, 65]. In these applications, A and B can both be too big to fit in main

memory. We will refer to the sequences in A as query sequences and the sequences in B

as database sequences in the rest of this section. A trivial solution to this problem is to

compare all pairs of sequences (a, b) where a s A, be B, using QDPS. This is however

impractical as the number of sequence pairs is |A| |B| and comparing a pair of sequences

is costly.

HTS methods, such as BLAST, alleviate this problem by employing a hash table

(Section 4.2). In this section, we show how to incorporate quality values to such hash

table-based searches. We develop a new method called QHTS (Quality and Hash

Table-based Similarity search algorithm). It imitates the well-known HTS methods.

QHTS exploits quality values not only to perform quality-based search, but also to

further improve the memory usage and the running time of traditional hash table-based

sequence similarity searches. Similar to HTS, QHTS has three steps: (1) probabilistic

hash table construction, (2) search phase, and (3) alignment phase. Note that some of the

existing HTS algorithms deviate slightly from the described three steps by finding multiple

seeds [70] or using spaced seeds [48]. It is trivial to extend QHTS to simulate them.

Probabilistic hash table construction: We slide a window of length k on the

sequences of one of the datasets, ;?-- A. Each window position produces a k-gram. For










each k-gram, we calculate a quality value qe, by taking the average of the quality values

of all k letters in this k-gram. Let ai ai+k-1 be a k-gram. Let q,,, to,, be the

quality values of the letters of this k-gram. Formally, the average quality of this k-gram is

computed as


j=0

The quality value qe, is a real number in [0,1] interval. We insert this k-gram into the

hash table with probability gary.

The bigger the quality value of a k-gram, the bigger the chance that it belongs to

a non-LCR and the '?i cr;- the chance that it is inserted into the hash table. Hence,

the probabilistic insertion tends to insert k-grams from non-LCRs into the hash table.

Letters with low quality values are possible true positives, i.e., LCRs. They tend not to be

inserted. Figure 4-4 illustrates a sequence and its k-grams (for k = 3). Only three k-grams

are stored in the hash table.

Our probabilistic hash table has two advantages over the traditional strategy where

all k-grams are kept:

1. Since k-grams from LCRs tend not to be inserted into the hash table, they will

not be identified as seeds. Thus, it is less likely to produce false positives due to seeds in

LCRs.

2. The hash table is smaller compared to the case where all k-grams are inserted.

This reduces the I/O cost (Section 4.5.2).

Our hash table is also superior to BLAST's hash table using "filter lookup table"

option. This is because the latter one removes all the k-grams in the regions masked by

an LCR-identification tool. This is undesirable as LCR-identification tools are highly

inaccurate. The former one, on the other hand, includes the k-grams in these regions with

probabilities determined by the qualities of the k-grams.

Search phase: In this phase, we discuss how we determine seeds. Let ai ai+k-1

and bj bj+k-1 denote k-grams in sequences a EA and b E B, where the k-gram










in b is a neighbor of the k-gram in a (Section 4.2 for the definition of neighbor). Let

gai to~k and qbj Obj~c be quality values assigned to the letters in the two

k-grams respectively. We calculate their alignment score as



t=0

If this alignment score is greater than the neighbor threshold, we call this region a seed.

During this phase, for each k-gram in B, we find all its neighbors in A with the help of the

hash table. We then recalculate their alignment scores to decide seeds.

Alignment phase: We extend each seed, i.e., ai ai+k-1 and bj bj+k-1 in both left

and right directions with no gap allowed along a and b respectively. Whenever we extend

each current subsequence by a new letter, we update the alignment score with quality

values included. Let ai ai+, and bj bj+, where x > k be the two current subsequences

to be extended respectively. Let m denote the alignment score between them. Let ai+,+l

and bj+,+l be the two new letters to be added. Let q,,,z+1 and qbj+z+1 denote the quality

values of the new letters respectively. The alignment score is updated as


m := m + s(aizz+l, bj+z+l) 4a+,-+i Obj+,+1'


We extend in each direction until the difference between the maximum alignment score

observed so far and the current alignment score is greater than an extension threshold.

The maximum alignment score among all seed extensions between the database and the

query sequence is taken as their alignment score.

Traditional approach ignores masked letters. This is the same as assigning masked

letters qualities 0 and unmasked letters qualities 1. Thus, masked letters do not contribute

to alignment scores and unmasked letters make full contributions. This strategy, however,

is problematic since no LCR-identification algorithm is 100 .accurate.









GARAQAQAQKL


GARAOXXXOKL

Figure 4-4. An example for probabilistic hash table and reconstruction. The solid lines
show the three 3-grams of the sequence GARAQAQAQKL stored in the hash table.
The resulting sequence after reconstruction is at the bottom. Letter X denotes
an unknown letter.


4.4.2.2 I/O and CPU computations

During the extension we need to access query sequences. The trivial choice is to store

all of them in main memory at all times. This, however, allocates memory redundantly

since only the part of the query sequence that takes part in the alignment is needed. Such

redundant memory usage is undesirable especially when two very large sequence databases

are compared to each other. Another option is to read a subsequence to main memory

whenever needed. This, however, can incur many costly page reads if the sequences evicted

from memory to store the new subsequences are needed again in later steps. We propose

to reconstruct query sequences from the hash table whenever needed. Figure 4-4 shows

a sequence GARAQAQAQKL and its three (out of nine) k-grams (k =3) stored in the hash

table. In this example, we can reconstruct 73 of the letters of this sequence using these

k-grams. The lost letters are given quality values of zero, and they are replaced with the

letters 11 and "X" for DNA and amino acid sequences respectively. Three important

notes can be made about the performance of QHTS.

1. The CPU time for the search and the alignment phases are reduced. This is

because not all k-grams are inserted into the hash table. Hence, fewer seeds are usually

found and extended (Section 4.5.2).

2. Reconstructing query sequences from the hash table for seed extension reduces

I/O cost. This is because query sequences are not read from disk. The hash table is in

the memory. Hence, reconstructing query sequences from it does not involve any I/O cost

(Section 4.4.2.4).









S. The sensitivity may drop due to reconstruction. This is because some letters

may not be recovered if none of the k-grams containing those letters is inserted into the

hash table. However, the drop in sensitivity is insignificant for two reasons. First, the

missing k-grams usually belong to the LCRs that need to be removed. Second, a letter

belongs to k different k-grams. A letter can not be recovered during reconstruction only

if all the k-grams that contain it are missing from the hash table. Thus, as k increases

the probability that a letter can not be recovered decreases exponentially. Hence, the

sensitivity drop is small (Section 4.5.1).

4.4.2.3 Cost analysis of QHTS

Let a denote the letter alphabet size. For protein sequences, a = 20. Let L be the

average sequence length. Let X and Y denote the number of sequences in the query set

and the database respectively. Let k be the k-gram size. Let q denote the average quality

of the k-grams in the query set.

Probabilistic hash table construction: QHTS spends time linear in the query set size

to build the hash table, i.e., O(XL). The query set contains O(XL) k-grams. However,

QHTS does not insert all these k-grams into the hash table. The number of entries in the

hash table is O(XLq).

Search phase: This phase takes O(Y) time since it involves a hash table lookup

for each k-gram of the database. Each k-gram in the database can have as many as

O(XL/ak) exact matches in the query set. The hash table keeps only a fraction, q, of

the query k-grams. Thus the storage needed to store the seeds of a database sequence is

O(XLq/ak)

Alignment phase: The time complexity of this phase depends on the proximity of the

query set and the database as well as qualities of the letters. Let p be the probability

that two k-grams fail the neighborhood threshold using QHTS, given that they satisfy

the threshold without using quality values. The number of seeds that QHTS extends









becomes O(XYL2q ~ k~a). On the other hand, traditional methods need to extend

O(XYL2 k") Seeds. Thus, the time complexity of QHTS at this step is much less.

The above analysis shows that QHTS performs better than HTS no masking at each

step except the first one where they have the same time complexity. The dominant factor

in the overall complexity comes from the search phase and the alignment phase. Hence, we

conclude that the speedup of QHTS over HTS no-masking strategy is between O(1/q) and

O( )IJ)
4.4.2.4 Memory allocation

QHTS stands out especially when the query set and the database sizes are much

larger than the available main memory. It is very important to have a good memory

allocation scheme to minimize I/O cost.

To get a block of data from disk, three kinds of time may be spent: seek time,

rotation time, and transfer time, denoted by st, rt, and it respectively. Let Q and DB

be sizes of the query set and the database in pages respectively. Assume that a page is

big enough to hold the longest sequence. Let M~ be the total number of available main

memory pages. In QHTS, M~ is divided into four pieces, as- MI, M. 1, M. and M~4. Here

MI, ., Mand M~4 denote the space allocated to hash table, database sequences, query

sequences, and quality values respectively. Let C denote the number of memory pages

used to index the k-grams of sequences from one disk page. Hence, My pages can hold

hash table for Mi1/C pages of sequences. Quality values of all sequences can be stored

using a small amount of fixed memory, ;?i- assigning one page to M~4. This can be justified

as follows. For each sequence, only a quality value for masked letters, starting positions,

and ending positions of masked letter intervals need to be stored. Quality values for

unmasked letters are a fixed value (Section 4.3).Thus, the total space involved in quality

values is small enough to be stored in memory. As mentioned in Section 4.4.2.2, during

seed extension, we can either reconstruct query sequences from the hash table or read




































Figure 4-5.


1\emory adaption scheme. qs, db, and HT represent the query set, the
database, and the hash table respectively. For both reconstruction case and
reading case, All and T. are the amount of memory allocated to the hash
table and the database sequences respectively. Afi/C is the amount of the
query set whose k-grams can he held in a hash table of size My
probabilistically. For each Afi/C pages of the query set, a hash table is built
and the whole database is read into memory once to find seeds. ii T. only
applies to reading case. It represents the amount of memory allocated to the
query set for seed extension. In the worst case, for each .11. pages of the
database, the entire query set is read into memory once in chunks of ii T


them from disk. The latter case involves additional I/O cost. Hence, we discuss the

memory allocation scheme based on these two cases.

(1) Reconstruction case: Since the query sequences are reconstructed from the hash


table, we only need to assign one page to them, i.e., ii T.


1 (we assume that the longest


sequence fits in one page). This page is used to store a reconstructed sequence. Hence,


3T.= Af All-3T.- Af4= Af All-2.


(4-1)









Mi1/C pages of the query set are brought into memory to build a hash table. The database
is then read into memory in chunks of M. 1 pages and scanned to find seeds for these

query sequences. Extending seeds does not require extra I/O since query sequences are

reconstructled in memory. This process is repe~ated ti- imes by reading a newv subsetl of

the query set into memory. Figure 4-5 illustrates this process except that the information
about Mi does not apply.

Q Q DB
(st +rt) + Q it + ( (st +rt) +DB -it). (4-2)

Plugging (4-1) into (4-2), we get:

Q Q DB
(st +rt) +Q it +( (st + rt) + DB it). (4-3)

Therefore, the total I/O cost can be formulated as:

Q2 Q DB
(st + rt) + Q it + (-(st + rt) + DB it). (4-4)

(2) Reading case: Query sequences are read from disk for seed extension. Thus, M. 1 is
not fixed any more. Hence,




The I/O cost to read the query set to build the hash table and to read the database is

the same as the reconstruction case. The worst case I/O of reading the query set for seed

extension happens when the query set needs to be read once for each Ml. pages of the

database currently stored in memory. In this case, the query set needs to be read DB/M. 1

times. Hence, the total I/O cost can be formulated as:

Q2 Q DB Q2 DB
(st +rt) + Q tt + ( (st +rt) +DB -tt) +( (st +rt) + Q tt)- (4-6)
Mi/C i/C Ml. Ml.









Plugging (4-5) into (4-6), we get:


Q Q DB Q DB
(st+rt) +Q-tt+ ( (st+rt) +DB-tt+( (st+rt) +Q-tt)
MiC i/ M- y M.1
(4-7)

Taking the derivatives of equations gives us the optimal values of MI, M_ 1, M.1,

and M~4, i.e., When the total I/O cost is minimized for each case. As we show later in

Section 4.5 reconstruction significantly reduces the I/O, and thus, the total cost.

Comparison of reconstruction and reading cases: Equations (4-3) and (4-6)

also apply to no masking strategy. Hence, there are totally four cases: no masking-

Reconstruct, no-masking-Read, QHTS-Reconstruct, and QHTS-Read. We have theoretically

calculated the minimum I/O cost of these four cases, as given by the two equations.

Assume that each pointer in the hash table pointing to a k-gram takes four bytes of

memory. Thus, we have C = 4 for no masking strategy. To get C for QHTS, we first

ran QHTS with GBA as the underlying LCR-identification algorithm. We then created

the probabilistic hash table on a database of 60,000 protein sequences from Swissprot

database. This experiment provided the parameter C as 3.14. We used the typical

parameters from currently available commercial disks. The parameters were at = 8,

rt = 2, it = 0.09 (in milliseconds), and page size of 4 kB. Figure 4-6 plots the worst case

minimum I/O cost of the four cases (in seconds) when the best memory allocation scheme

is used for each case. Here, we set both the database size DB and the query set size Q

to 100,000 pages. We vary the available memory from DB/16 to DB. Figure 4-6 shows

that the I/O cost of QHTS-Reconstruct is much less than that of QHTS-Read (2.06 to

2.25 times lower). The speedup is larger when the memory is smaller. These also apply

to no masking cases except that I/O cost of no masking-Reconstruct is 1.97 to 2.08 times

less than that of no masking-Read. QHTS-Reconstruct is 1.24 to 2.07 times faster than no

masking-Reconstruct QHTS-Read is 1.17 times faster than no-masking-Read. Therefore,

both reconstruction strategy and use of quality values save I/O cost.











QHTS-Reconstruct
QHTS-Read-- -
No masking-Reconstruct
1400~ No masking-Read


1200 -


Q 800-





200~ -

0 01 02 03 04 05 06 07 08 09
Memory size (as a fraction of database size)

Figure 4-6. The minimum I/O cost varies, depending on whether to use reconstruction or
read strategy, whether to use quality values, and the relationship between the
available amount of memory M~ and the size of the database DB.


4.5 Experimental Evaluation

This section evaluates QDPS and QHTS. We perform both quality and performance

experiments. In quality experiments, we compute the search accuracy and reconstruction

error. In performance experiments, we evaluate the space usage and the running time.

LCR-identification algorithms: We used three LCR-identification tools: SEG, CARD,

and GBA. We downloaded the first two, and implemented the last one. We used their

default parameters. The average Jaccard coefficients of SEG, CARD, and GBA were 0.20,

0.19, and 0.25 respectively.

Sequence comparison algorithms: We implemented three masking strategies: (1) No

masking: LCRs are not identified. (2) Boolean masking: Identified LCRs are completely

removed from sequences. (3) Probabilistic masking: The former two are the existing

strategies whereas the last one is the proposed strategy. We implemented DPS and HITS

as described in Section 4.2. We name their no masking and boolean masking versions

as NDPS, BDPS,NHTS, and BHTS respectively. We set the gap open and extension

penalties as -10 and -0.5 respectively. We set the neighbor threshold, the extension

threshold, and the k-gram length in all the cases as 11, seven, and three respectively as










these are the default of BLAST. All programs are coded in Java. All experiments were run

on a Linux machine with 1 GB memory and 2.5:3GHz CPU.

We tested different versions of BDPS, BHTS, QDPS, and QHTS on the three

LOR-identification methods. For readability, we only show a small result set. In our plots,

we consistently chose one case with one, two, or three LOR-identification algorithms. For

one algorithm, we chose GBA, as it had the best Jaccard coefficient and precision. For

two algorithms we picked SEG and CARD, as all the combinations of two algorithms

were similar. We use the reconstruction strategy in our reported QHTS results unless

otherwise stated since reconstruction has better performance. For convenience, we

add the name of an LOR-identification method as a suffix with a dash to the masking

strategy. For example QDPS-GBA denotes the QDPS masking strategy using GBA as the

LOR-identification algorithm. When multiple LOR-identification algorithms are used, we

will only use their initial letters. For example SC means using SEG and CARD.

Dataset: For accuracy evaluation, we downloaded 60,000 protein sequences from

Swissprot as our database. We randomly picked 50 of them as our query set. According

to the InterPro database [5], all these sequences belong to one or more families. We

identify two sequences as biologically similar if they belong to the same family according

to InterPro. The reason is because biologists have integrated various sequence-cluster

databases into InterPro. Thus, InterPro provides a biologically reliable classification

as the contributing databases complement each other in grouping protein families [:38].

Each sequence in our query set comes from one or two families out of 29 families. The

average number of sequences in our database with the same families as a query sequence

is 64. The minimum, the maximum, and the standard deviation are 2, 1000, and 142

respectively. For performance comparison, we created seven sets of 6000, :3000, 1500, 750,

:375, 188, and 94 sequences from the 60,000 sequences.

We also created a dataset of 20,714 protein sequences from PDB (www.rcsb.

Org/pdb/) for accuracy evaluation. We randomly picked 50 of them as our query set.










According to SCOP database, each query sequence belongs to a different super family.

We identify two sequences as biologically similar if they belong to the same super family

according to SCOP. The average number of sequences in our database with the same

super family as a query sequence is 34. The minimum and the maximum are 10 and 100

respectively.

We randomly sampled 121 sequences from Swissprot and created 24 bins with 5

points in each hin (except the last one). None of the queries is included in the sample set.

We used the 2-gram complexities of the identified LORs of each sample to predict quality

values (Section 4.3).

4.5.1 Evaluation of Accuracy

We ;?i that a sequence satisfies a threshold for another sequence under a similarity

search algorithm if their alignment score returned by this algorithm is greater than the

threshold. To evaluate the accuracy of each search algorithm, we used 25, 50, 75, and 100

as the threshold. This is because the default parameters of BLAST returns alignments

with score at least 25.

For each query sequence, q, let TP (True Positive) be the number of database

sequences that satisfy the threshold with q using the underlying similarity search

algorithm and belong to the same family as q. Similarly, let FP (False Positive) be

the number of database sequences that satisfy the threshold, but do not belong to the

same family as q. Also, let FNV (False Negative) be the number of database sequences

that do not satisfy the threshold, but belong to the same family as q. We compute two

measures: precision and recall as follows:

precision = TP / (TP + FP);

recall = TP / (TP + FNV).

Evaluation of the DP methods: We compare our proposed strategy to existing

strategies when all methods have full sensitivities (i.e., they find the best results according

to their underlying scoring schemes).










Table 4-1. Average total number of database sequences returned and precision (in
percentage) for NDPS, BDPS-SEG, QDPS-GBA, QDPS-SC, and QDPS-SCG
on the Swissprot dataset.

score cutoff 25 50 75 100
NDPS 55897 7775 708 171
Result BDPS-SEG 55205 7611 611l 108
Set QDPS-GBA 25059 194 50 34
Size QDPS-SC 3147 63 36 30
QDPS-SCG 88631 102 44 34
NDPS 0.06 0.43 4 17
BDPS-SEG 0.06 0.44 5 27
Precision QDPS-GBA 0.13 17 6;0 85
( .) DPS-SC 1.08 47 79 89
QDPS-SCG 0.39 30 6;8 82

56 NP
BDPS ~t-
QDPS-S



o so- +




48-


46 -

10 100 1000 10000 10(0000
Result set size

Figure 4-7. Result set size versus recall (in percentage) for NDPS, BDPS and QDPS on
the Swissprot dataset.


Table 4-1 compares the average total number of database sequences returned and

precision for various thresholds on the Swissprot dataset. QDPS (the last three methods)

return significantly fewer results than DPS (the first two methods). The difference grows

as threshold decreases. QDPS return 0.8-45 .~ of the sequences returned by NDPS. As

threshold increases, the precision of all the methods increases. QDPS-SC and QDPS-SCG

usually have better results in all the experiments than QDPS-GBA. At threshold 50 the

preisonofQDS-C s 0 better than QDPS-GBA. The superior precision of QDPS










Table 4-2. Average total number of database sequences returned and precision (in
percentage) for NDPS, BDPS-SEG, QDPS-GBA, QDPS-SC, and QDPS-SCG
on the PDB dataset.

score cutoff 25 50 75 100
NDPS 16010 1397 113 22
Result BDPS-SEG 15808 13635 110 22
Set QDPS-GBA 4974 20 13 11
Size QDPS-SC 946 13 11 11
QDPS-SCG 946 13 11 11
NDPS 0.16 1 12 57
BDPS-SEG 0.16 1 12 57
Precision QDPS-GBA 0.42 65 95 95
( .) DPS-SC 1.8 92 95 96
QDPS-SCG 1.8 92 95 96

56
NDPS -
BDPS i
54 QDPS-SCI,~1~1.
52-

50-

S48-

i 46-

44

42-

40 -2

310 100 1000 10)000
Result set size

Figure 4-8. Result set size versus recall (in percentage) for NDPS, BDPS and QDPS on
the PDB dataset.


indicates that our quality based alignment eliminates almost all the false positives while

the traditional methods cannot. Figure 4-7 shows the relationship between result set size

and recall for each method on the same dataset. We see that when all methods have the

same recall, QDPS have much smaller result sizes. Hence, QDPS reduces the number of

false negatives and false positives over DPS significantly. We obtained similar results on

the PDB dataset (Table 4-2 and Figure 4-8).









Table 4-:3. Average total number of database sequences returned and precision (in
percentage) for NHTS, BHTS-SEG, QHTS-GBA-Read, QHTS-SC-Read, and
QHTS-SCG-Read on the Swissprot dataset.

score cutoff 25 50 75 100
NHTS 4:3218 198 47 3:3
Result BHTS-SEG 40909 81 :31 29
Set QHTS-GBA 6979 41 :30 26
Size QHTS-SC 162 26 2:3 20
QHTS-SCG 77:3 :34 27 22
NHTS 0.08 15 61 81
BHTS-SEG 0.08 :37 90 94
Precision QHTS-GBA 4 9:3 100 100
( .) HTS-SC 16 9:3 97 100
QHTS-SCG 4 8:3 95 98


Evaluation of the HT-based methods: We compare our proposed strategy to existing

strategies on the Swissprot dataset when they do not have full sensitivities.

Table 4-3 compares the average total number of database sequences returned

and precision for various thresholds. QHTS (the last three methods) return fewer

results than HTS (the first two methods). The difference is more significant for small

thresholds. At threshold 25 QHTS return 0.4 to 16.1 .~ of those of NHTS depending on

the LOR-identification strategy. QHTS have higher precision than HTS. As threshold

increases to large values, all the methods perform roughly the same. Figure 4-9 shows

the relationship between result set size and recall for each method. We can see that when

all methods have the same recall, QHTS-GBA has much smaller result size. Hence, it

reduces the number of false negatives and false positives over DPS significantly. Although

compared with HTS methods, QHTS-SC and QHTS-SCG have large results sizes at the

same recall level, the recall drops slightly at the same result size. Hence, we conclude that

they eliminate a large number of false positives at the expense of a small number of false

negatives.

Figure 4-10 demonstrates how the proposed strategy compares to BLAST by focusing

on a protein, APOA4_MOUSE, front our Swissprot query set. This protein contains :390 amino













QHTS-GBA
QH~TS-SCG~ -rl;





ItI

0"


30
10


100


1000
Result set size

(in percentage)


10000


100000


Figure 4-9. Result set size versus recall
the Swissprot dataset.


LC


for NHTS, BHTS and QHTS on


BLAST-w~/oFlter
BLAST-w/llter -X-
QHTS-GBA -
QHTS-SC
QHTS-SCG -n


0 10 20 30 40 50 60 70
Result set size

Figure 4-10. Result set size versus recall (in percentage) for QHTS and BLAST with the
LCR-filter off (BLAST-w/oFilter) and on (BLAST-w/Filter) for querying the
protein sequence, APOA4_MOUSE, against our Swissprot database.


acids, where 269 of them are annotated as in LCRs. We ran BLAST to align this query to

the proteins in our Swissprot database. We used the default parameters of BLAST with

and without using the "filter low (callipl.::ily option. We also aligned this query using

QHTS methods. The results show that QHTS methods have higher recall at the same

result set size. Their differences are more pronounced than those in Figure 4-9.










Table 4-4. Average recall and precision (in percentage) of QHTS-GBA-Reconstruct ,
QHTS-GBA-Read, QHTS-SC-Reconstruct QHTS-SC-Read,
QHTS-SCG-Reconstruct and QHTS-SCG-Read on the Swissprot dataset. To
save space, we omit "QHTS" and use "Rec." to represent "Reconstruct" in the
table.

score cutoff 25 50 75 100
GBA-Rec. 52 45 42 39
GBA-Read 53 45 42 39
Recall SC-Rec. 38 36; 32 28
( .)SC-Read 41 38 35 31
SCG-Rec. 45 42 38 33
SCG-read 47 43 40 34
GBA-Rec. 0.5 73 91 96;
GBA-Read 3.5 92 99 99
Precision SC-Rec. 21 96; 98 99
( .)SC-Read 16 93 98 99
SCG-Rec. 5 84 96 98
SCG-read 4 83 95 98


Evaluation of reconstruction and reading: As discussed in Section 4.4.2.2,

reconstruction for seed extension may cause sensitivity to drop. Here, we compare

three pairs of reconstruction versions and reading versions of QHTS on the Swissprot

dataset: (1) QHTS-GBA-Reconstruct QHTS-GBA-Read, (2) QHTS-SC-Reconstruct ,

QHTS-SC-Read, (3) QHTS-SCG-Reconstruct QHTS-SCG-Read.

Table 4-4 presents the average recall and precision. For each pair, the reading case

ahr-l-w has slightly higher recall value than the reconstruction case with a maximum

difference of 3 This is expected since reading case reads sequences from disk and

no letter is lost. Reading case has higher precision sometimes, depending on the used

LCR-identification tools. However, the differences are small, especially when score cutoff

gets '?i c-;r. Thus, the accuracy of reconstruction case is comparable to reading case.

Lost information by reconstruction: As mentioned in Section 4.4.2.2, we may lose

letters during reconstruction. Here, we evaluate the amount of lost information. We define

Relative Information Loss measure for this purpose. Let a = ala2 an be a sequence.

Let ql, q2, a On be quality values associated with each letter in a. Assume that letters










Table 4-5. Relative information loss regarding the length of k-grams (in percentage).

k TS-GBA QTS-SC QTS-SCG
2 6.9 7.2 7.3
3 2.9 2.7 2.8
4 1.3 1.3 1.3

Table 4-6. CPU times spent in HTC, SP, and AP, and the number of k-grams stored in
the hash tables for (I) NHTS, (II) BHTS-SEG, (III) QHTS-GBA-Reconstruct ,
(IV) QHTS-SC-Recontruct, and (V) QHTS-S CG-Reconstruct.

I II III IV V
CPU HTC 6; 5 5 5 6;
time SP 11968 86315 8252 7293 8513
[sec] AP 1269 895 343 13 46
total 13240 9516 8601 7311 85635
# of k-grams 833792 1~'* 650147 481070 545485


amrr, axrr,, -- axmr are lost ({xxl, X2, ---, m} C {1, 2, --, n}). Formally, we define Relative

Information Loss as:


i= 1 i= 1
Table 4-5 shows how Relative Information Loss changes regarding the length of

k-grams on the 1,500-sequence set. As mentioned in Section 4.2.2 of the chapter, the

bigger the k, the smaller the possibility that a letter is lost. When k = 3, the Relative

Information Loss is about 3 As we present in Section 5.2 of the chapter, this number is

very small compared to the percentage of k-grams evicted from the hash table. Thus, very

little information is lost during reconstruction.

4.5.2 Performance Comparison

In this section, we evaluate the performance of our quality based search methods. All

tests are self-comparisons.

CPU time comparison: We compare CPU times in the three phases of hash table-based

searches: probabilistic hash table construction (HTC), search phase (SP), and alignment

phase (AP). We tested five strategies: NHTS, BHTS-SEG, QHTS-GBA-Reconstruct ,










QHTS-SC-Reconstruct and QHTS-SCG-Reconstruct The CPIT time of QHTS for

reconstruction and reading cases are almost the same. Thus, we only present the results

for reconstruction here.

Table 4-6 shows the results on the 1,500-sequence set. HTC take much less time than

SP and AP in all cases. This is because the running time of HTC is linear in the query

set size. Thus, introducing quality values takes HTC negligible time. SP dominates in

all cases. Each k-grant in the database requires one or more hash table lookups. We only

extend seeds. Hence, AP takes much less time than SP. NHTS has the largest SP and AP

time since other methods do not store all k-grants in the hash table. The total time for

NHTS is thus significantly larger. BHTS-SEG has the second largest SP and AP, hence

the second total. This is because BHTS-SEG eliminates k-grants only from identified

LORs whereas QHTS eliminate k-grants front any place with some probability.

CPU time versus I/O time: For small datasets, CPIT time dominates the overall

running time of the search tools, including BLAST. Here, we demonstrate that I/O cost

increases much faster than CPIT cost for growing datasets. Thus, it will be the dominating

term in the near future. We also show that I/O cost is significantly reduced by our

probabilistic hash table and reconstruction strategy.

We used BLAST with LOR-filter on as a representative to existing BHTS strategies.

We ran BLAST on the seven datasets with n, 2n, 64n sequences by doubling the

dataset size. The largest dataset contained 6,000 sequences. We performed a self

comparison of each dataset using BLAST and measured the CPIT times. Figure 4-11

plots the CPIT times in seconds as a function of the dataset size in log-log scale. We then

calculated the optimal I/O costs of reading and reconstructing strategies, for increasing

dataset sizes, using the formulas in Section 4.4.2.4. For this computation, we assumed

a fixed nienory of Af = 100,000 pages, with page size of 4 kB. We computed the I/O

times for self comparisons of the seven datasets of size Af 2M~, 64M~, by doubling the

dataset size. We used .st = 8, rt = 2, and it = 0.09 (in milliseconds) as I/O parameters










for they are typical numbers for current architectures. We experimentally obtained the

parameter C as 3.14 from our datasets. In order to visualize the difference between the

I/O and the CPU trends better, we scaled all the I/O times down by a constant so that

the I/O time of the reconstruction strategy is the same as that of the CPU time for the

smallest dataset. The results are presented in Figure 4-11. Two important observations

follow from this figure. First, I/O time grows much faster than CPU time with growing

dataset size. Furthermore, Moore's law -II---- -r that the gap between the increase in I/O

and CPU costs will be even more than our results in the future. We conclude that, I/O

cost needs to be optimized for large datasets. Second, the I/O cost of the reconstruction

strategy is 42 of that of the reading strategy. Thus, from this figure and Table 4 of the

chapter, we conclude that the proposed reconstruction strategy reduces the overall running

time of existing NHTS strategies by 45 to 58 depending on the dataset and memory

sizes. The improvement is more significant for larger datasets. One important note is

that even for the cases where at least one dataset can fit in main memory, QHTS is still

cost-effective. The time saving moves one level up along the computer storage hierarchy.

Instead of saving the I/O cost, the communication between CPU and main memory is

reduced dramatically. The relationship between the CPU cost and this communication

cost follows the same pattern as shown in Figure 4-11.

Hash table size comparison: We evaluate the space usage of our quality based search

methods. We measure this usage as the number of k-grams in the hash table as this is the

largest data structure to be kept in main memory. A pointer is stored for each k-gram.

Thus, memory usage is proportional to the number of k-grams in the hash table.

Table 4-6 compares the number of k-grams stored in the hash table for NHTS

BHTS-SEG, QHTS-GBA, QHTS-SC, and QHTS-SCG on the 1,500-sequence set. NHTS

has the largest number. This is because it stores all k-grams unlike other methods.

Hash table sizes of QHTS methods are 57-77 of that of NHTS and 73-98 of that

of BHTS. This is because BHTS eliminates k-grams only from masked LCRs whereas










100000


10000-





i=100-






1 10 100
|D|/Dlmln

Figure 4-11. BLAST CPU time and QHTS I/O times for datasets of increasing size.


QHTS eliminates any k-grams with some probability. This probability depends on how the

LOR-identification tools perform. This explains why there is a difference between QHTS

methods. We conclude that QHTS can answer much larger query sets in main memory

than NHTS.

The larger the hash table, the more hash table lookups, the longer SP takes.

Table 4-6 shows that this is true. QHTS-GBA evicts 23 of the k-grams from the

hash table, but its information loss is merely 6 .~ This justifies our reasoning that (1)

the chance that all k-grams containing the same letter are deleted is very small, and (2)

usually, letters with low quality values are lost during reconstruction.

4.6 Conclusion

We considered the problem of finding similar sequences when the locations of the

LORs are not known precisely. We developed a formulation to measure the quality of each

letter. The quality value of a letter is the probability for that letter to be in a non-LCR.

We showed that the quality values can he used in two well known approaches to the

sequence search problem. The former finds the optimal alignment of two sequences using

dynamic programming. This applies to the case when the sequences are small. The latter

computes a suboptimal alignment using a hash table. This applies to the case when the










database or both the database and the query set are large. We developed a method that

indexes k-grams (sequences of length k) using a hash table probabilistically. As a result,

the main memory usage, the CPU cost, and the I/O cost are greatly reduced. We also

showed that this hash table can be used to reconstruct query sequences with negligible

information loss. This eliminates the need for further disk I/Os to read these sequences.

In our experiments on real data, our quality-based similarity search algorithms reduced

the number of false positives drastically. In addition, their running times were better

than existing strategies. In the future, we would like to explore formulation of P-value for

resulting alignments when quality values are involved.










CHAPTER 5
CONCLUSION

In my research, we discussed the abundance and the importance of repeats in

biological sequences, including protein sequences and genomic sequences. We presented

the problems caused by repeats. We illustrated the necessarily of developing new

computational tools to identify and deal with repeats for sequence similarity search. For

DNA sequences, we propose to identify repeats at the genome level. We are particularly

interested in transposons and attack the problem hased on their unique characteristic

of nested insertions. For protein sequences, we are particularly interested in LORs. We

proposed a novel graph-based algorithm to identify LORs. Concerned with the false

positives caused by these repeats and the fact the locations of these repeats are not known

precisely, we proposed novel similarity search algorithms based on quality values. These

quality values are derived from LORs in the sequences. In summary, we introduced our

work on the following repeat-related problems:

Identifying LORs in a protein sequence

Quality-based similarity search for biological sequence database

Identifying repeats in genomes

The resulting work of the first two problems are published in the Bioinformatics

Journal [45] and BIOCOMP 2007 [46]. For the second problem, an extended version is

submitted to BMC Genomics. The paper for the third problem is under review in the

Bioinformatics Journal.









REFERENCES


[1] A., G.M.: Retroelements in higher plants. Trends in Genetics 8, 103-108 (1992)

[2] Alb, M., Laskowski, R., Hancock, J.: Detecting cryptically simple protein sequences
using the SIMPLE algorithm. Bioinformatics 18, 672-678 (2002)

[3] Allison, L., Stern, L., Edgoose, T., Dix, T.: Sequence complexity for biological
sequence analysis. Computers & C!. Ita!l-1ry 24, 43-55 (2000)

[4] Altschul, S., Gish, W., Miller, W., M.~ i-n rs, E.W., Lipman, D.J.: Basic Local
Alignment Search Tool. Journal of Molecular Biology 215(3), 403-410 (1990)

[5] Apweiler, R., et al.: Interpro-an integrated documentation resource for protein
families, domains and functional sites. Bioinformatics 16(12), 1145-1150 (2000)

[6] Bairoch, A., Boeckmann, B., Ferro, S., Gasteiger, E.: Swiss-Prot: juggling between
evolution and stability. Briefings in Bioinformatics 1, 39-55 (2004)

[7] Bao, Z., Eddy, S.R.: Automated de novo identification of repeat sequence families in
sequenced genomes. Genome Research 12, 1269-1276 (2002)

[8] Bedell, J., K~orf, I., Gish, W.: MaskerAid: a performance enhancement to
RepeatMasker. Bioinformatics 16(11), 1040-1041 (2000)

[9] Bennetzen, J.: The contributions of retroelements to plant genome organization,
function and evolution. Trends Micro 4, 347-353 (1996)

[10] Bennetzen, J., et al.: Consistent over-estimation of gene number in complex plant
genomes. Current Opinion in Plant Biology 7(6), 732-726 (2004)

[11] Benson, G.: Tandem repeats finder: a program to analyze DNA sequences. Nucleic
Acids Research 27(2), 573-580 (1999)

[12] Bergman, C.M., Pfeiffer, B.D., Rinen-Limas, D.E., Hoskins, R.A., Gnirke, A.,
Mungall, C.J., W 0,1 A.M., K~ronmiller, B., Pacleb, J., Park, S., Stapleton, M., Wan,
K(., George, R.A., de Jong, P.J., Botas, J., Rubin1, G.M., Celniker, S.E.: Assessing
the impact of comparative genomic sequence data on. the functional annotation of the
drosophila genome. Genome Biology 3 (2002)

[13] Boffelli, D., MacAuliffe, J., Oveharenko, D., Lewis, K(., Oveharenko, I., Pachter, L.,
Rubin, E.: Phylogenetic analysis of primate sequences reveals functional regions of
the human genome. Science 299(5611), 1391-1394 (2003)

[14] Bowen, N., Jordan, I.: Transposable elements and the evolution of eukaryotic
complexity. Current issues in molecular biology 4(3), 65-76 (2002)

[15] Brenner, S., Hubbard, T., Murzin, A., Chothia, C.: Gene duplications in H. [h~l;,co-
zae. Nature 378(9), 140 (1995)









[16] Buard, J., Jeffreys, A.: Big, had minisatellites. Nature Genetics 15, :327-328 (1997)

[17] Campagna, D., et al.: R AP: a new computer program for de novo identification of
repeated sequences in whole genomes. Bioinformatics 21(5), 582-588 (2005)

[18] Caspi, A., Pachter, L.: Identification of transposable elements using multiple
alignments of related genomes. Genome Research 16(2), 260-270 (2006)

[19] Claverie, J.M., States, D.: Information enhancement methods for large scale sequence
analysis. Computers & C!. Ins!-1ry 17, 191-201 (199:3)

[20] Consortium, I.H.G.: Initial sequencing and analysis of the human genome. Nature
409, 860-921 (2001)

[21] Delcher, A., K~asif, S., Fleischmann, R., Peterson, J., Whited, O., Salzherg, D.:
Alignment of Whole Genomes. Nucleic Acids Research 27(11), 2:369-2376 (1999)

[22] Dijkstra, E.: A note on two problems in connexion with graphs. Numerische
Mathematik 1, 269-271 (1959)

[2:3] Djian, P.: Evolution of simple repeats in dna and their relation to human disease.
Cell 94, 155-160 (1998)

[24] Drake, J.: The distribution of rates of spontaneous mutation over viruses,
prokaryotes, and eukaryotes. Annals of the New York Al I1. Iny: of Sciences 870,
100-107 (1999)

[25] Edgar, R.C., Myers, E.W.: PILER: identification and classification of genomic
repeats. Bioinformatics 21, 152-158 (2005)

[26] Flavell, A.J., Dunhar, E., Anderson, R., Pearce, S.R., Hartley, R., K~umar, A.:
Tyl-copia group retrotransposons are ubiquitous and heterogeneous in higher plants.
Nucleic Acids Research 20, :3639 ;:1.IIl (1992)

[27] Gilbert, A.C., K~otidis, Y., Muthukrishnan, S., Strauss, AI.: Surfing wavelets on
streams: One-pass summaries for approximate ..-:-o regate queries. International
Conference on Very Large Databases (VLDB) pp. 79-88 (2001)

[28] Gotoh, O.: An improved algorithm for matching biological sequences. Journal of
Molecular Biology 162(:3), 705-708 (1982)

[29] Halkidi, M.V.M., Batistakis, Y., Vazirgiannis, AI.: On clustering validation
techniques. Journal of Intelligent Information Systems 17, 107-145 (2001)

[:30] Hancock, J., A1lan-r ini.: J.: SIMPLE:34: an improved and enhanced implementation
for VAX and SUN computers of the SIMPLE algorithm for analysis of clustered
repetitive motifs in nucleotide sequences. Computational and Applied Biosciences
(CABIOS) 10, 67-70 (1994)









[:31] Hancock, J., Simon, AI.: Simple sequence repeats in proteins and their potential role
in network evolution. Gene 345(1), 11:3118 (2005)

[:32] Henikoff, S., Henikoff, J.: Amino acid substitution matrices front protein blocks.
Proceedings of the National A< II1. InirJ of Sciences of the United States of America
(PNAS) 89(22), 10,915-10,919 (1992)

[:33] Huang, X., Aladan, A.: CAP:3: A DNA Sequence Assembly Program. Genome
Research 9(9), 868-877 (1999)

[:34] J.Drake, et al.: Rates of spontaneous mutation. Genetics 148(4), 1667-86 (1998)

[:35] Juretic, N., Bureau, T.E., Bruskiewich, R.M.: Transposable element annotation of the
rice genome. Bioinforniatics 20, 155-160 (2004)

[:36] Jurka, J., et al.: CENSOR a program for identification and elimination of repetitive
elements from DNA sequences. Computers and C!. Ins!-1ry 20(1), 119-122 (1996)

[:37] K~azenli-E-rol-I .i P., Trifiro, M.A., Pinsky, L.: Evidence for a repressive function of
the long pu~li-lutantine tract in the human androgen receptor: possible pathogenetic
relevance for the (CAG)n-expanded neuronopathies. Human Molecular Genetics 4,
52:3527 (1995)

[:38] K~riventseva, E.V., Biswas, AI., Apweiler, R.: Clustering and analysis of protein
families. Current Opinion in Structural Biology 11(:3), :334-3:39 (2001)

[:39] K~urtz, S., et al.: Computation and visualization of degenerate repeats in complete
genonies. In: Intelligent Systems for Molecular Biology (ISMB), pp. 228-238. AAAI
Press (2000)

[40] K~urtz, S., Choudhuri, J.V., Ohlebusch, E., Schleiermacher, C., Stcew; J., Giegerich,
R.: REPuter: the manifold applications of repeat vIs l1i--;-; on a genonlic scale. Nucleic
Acids Research 29(22), 46:3:34642 (2001). DOI 10.109:3/nar/29.22.463:3

[41] K~urtz, S., Schleiermacher, C.: REPuter: fast computation of nmaxinmal repeats in
complete genonies. Bioinforniatics 15(5), 426-427 (1999)

[42] Lanz, R., Wieland, S., Hug, hi., Rusconi, S.: A transcriptional repressor obtained by
alternative translation of a trinucleotide repeat. Nucleic Acids Research 23, 1:38-145
(1995)

[4:3] Le, Q.H., Wright, S., Yu, Z., Bureau, T.: Transposon diversity in arabidopsis thaliana.
Proceedings of the National A< II1. InirJ of Sciences of the United States of America
(PNAS) 97, 7:376 7:381 (2000)

[44] Leadhetter, M.R., Lindgren, G., Rootzen, H.: Extreme and Related Properties of
Random Sequences and Processes, chap. 1. Springer (198:3)









[45] Li, X., K~ahveci, T.: A novel algorithm for identifying low-complexity regions in a
protein sequence. Bioinformatics 22(24), 2980-2987 (2006)

[46] Li, X., K~ahveci, T.: Quality-based similarity search for biological sequence databases.
In: The International Conference on Bioinformatics and Computational Biology
(2007)

[47] Lipman, D.J., Pearson, W.R.: Rapid and Sensitive Protein Similarity Searches.
Science 227(4693), 1435-1441 (1985)

[48] Ma, M., Tromp, J., Li, M.: PatternHunter: Faster and More Sensitive Homology
Search. Bioinformatics 18(0), 1-6(2002)

[49] Mal, J., Devos, K(.M., Bennetzen, J.L.: Analyses of Itr-retrotransposon structures
reveal recent and rapid genomic dna loss in rice. Genome Research 14, 860-869
(2004)

[50] Mao, L., Wood, T., Yu, Y., Budiman M.A.and Tomkins, J., Woo S.and Sasinowski,
M., P1. -r;i.- G., Frisch, D., Goff, S., Dean, R., Wing, R.: Rice transposable elements:
a survey of 73 000 sequence-' I__- d-connectors. Genome Research 10, 982-990 (2000)

[51] McCarthy, E.M., Liu, J., Lizhi, G., McDonald, J.F.: Long terminal repeat
retrotransposons of oryza sativa. Genome Biology 3 (2002)

[52] McCarthy, E.M., McDonald, J.F.: LTR_STRUC: a novel search and identification
program for LTR retrotransposons. Bioinformatics 19(3), 363-367 (2003)

[53] Morgulis, A., et al.: WindowMasker: window-based masker for sequenced genomes.
Bioinformatics 22, 134-141 (2006)

[54] Nandi, T., Dash, D., Ghai, R., B-Rao, C., K~annan1, K(., Brahmachari, S.K(.,
Ramakrishnan, C., Ramachandran, S.: A novel complexity measure for comparative
analysis of protein sequences from complete genomes. Journal of Biomolecular
Structure and Dynamics 20(5), 657-68 (2003)

[55] Needleman, S.B., Wunsch, C.D.: A General Method Applicable to the Search for
Similarities in the Amino Acid Sequence of Two Proteins. Journal of Molecular
Biology 48, 443-53 (1970)

[56] Pevzner, P.A., Tang, H., Tesler, G.: De novo repeat classification and fragment
assembly. Genome Research 14, 1786-1796 (2004)

[57] Pevzner, P.A., Tang, H., Waterman, M.S.: An Eulerian path approach to DNA
fragment assembly. Proceedings of the National Al II1. iny: of Sciences of the United
States of America (PNAS) 98(17), 9748-9753 (2001)

[58] Pinto, M., Lobe, C.G.: Products of the grg (groucho-related gene) family can
dimerize through the amino-terminal q domain. Journal of Biological C!. ~!!I s-1y
271, 33,026-33,031 (1996)










[59] Price, A.L., Jones, N.C., Pevzner, P.A.: De novo identification of repeat families in
large genomes. Bioinformatics 21, :351-358 (2005)

[60] Promponas, V., Enright, A., Tsoka, S., K~reil, D., Leroy, C., Hamodrakas, S., Sander,
C., Ouzounis, C.: CAST: an iterative algorithm for the complexity analysis of
sequence tracts. Bioinformatics 16(10), 915-922 (2000)

[61] Salamon, P., K~onopka, A.: A maximum entropy principle for distribution of local
complexity in naturally occurring nucleotide sequences. Computers & C!. Ins!-1 ry 16,
117-124 (1992)

[62] SanMiguel, P., Gaut, B.S., Tikhonov1, A., ?- 1: lin~! l Y., Bennetzen, J.L.: The
paleontology of intergene retrotransposons of maize. Nature Genetics 20, 4:345
(1998)

[6:3] Sanmiguel, P., Tikhonov, A., Jin, Y.K(., Motchoulskaia, N., Zakharov, D.,
Melake-Berhan, A., Springfer, P.S., Edwards, K(.J., Lee, hi., Avramova, Z., Bennetzen,
J.L.: ?-. -i. I1 retrotransposons in the intergfenic regions of the maize genome. Science
274(5288), 765-768 (1996)

[64] Schwechheimer, C., Smith, C., Bevan, AI.: The activities of acidic and glutamine-rich
transcriptional activation domains in plant cells: design of modular transcription
factors for high-level expression. Plant Molecular Biology 36, 195-204 (1998)

[65] Shannon, C.: Fast Incremental Maintenance of Approximate Histograms. In: Bell
Syst. Tech. J., pp. 50-60 (1951)

[66] Shin, S.W., K~im, S.M.: A new algorithm for detecting low-complexity regions in
protein sequences. Bioinformatics 21(2), 160-170 (2005)

[67] Smit, A.F.: Interspersed repeats and other mementos of transposable elements in
mammalian genomes. Current Opinion in Genetics and Development 9(6), 657-66:3
(1999)

[68] Smith, T., Waterman, 31.: Identification of common molecular subsequences. Journal
of Molecular Biology 147, 195-197 (1981)

[69] States, D., Agarwal, P.: Compact Encoding Strategies for DNA Sequence Similarity
Search. In: Intelligent Systems for Molecular Biology (ISMB) (1996)

[70] Tatusova, T., Madden, T.: BLAST 2 Sequences, A New Tool for Comparing Protein
and Nucleotide Sequences. FEMS Microhiology Letters 177, 247-250 (1999)

[71] Tautz, D., Trick, AI., Dover, G.: Cryptic simplicity in dna is a 1!! I in i- source of getetic
variation. Nature 322, 652-656 (1986)

[72] Turcotte, K(., Srinivasan, S., Bureau, T.: Survey of transposable elements from rice
genomic sequences. The Plant Journal 25, 169-179 (2001)









[73] Volfovsky, N., Haas, B., Salzberg, S.: A clustering method for repeat analysis in dna
sequences. Genome Biology 2(8) (2001)

[74] Voytas, D.F., Cummings, M.P., K~onieczny, A., Ausubel, F.M., Rodermel, S.R.:
Copia-like retrotransposons are ubiquitous among plants. Proceedings of the National
A< I1. iny: of Sciences of the United States of America (PNAS) 89, 7124-7128 (1992)

[75] Wan, H., Li, L., Federhen, S., Wootton, J.: Discovering simple regions in biological
sequences associated with scoring schemes. Journal of Computational Biology 10,
171-185 (2003)

[76] Wicker, T., Stein, N., Albar, L., Feuillet, C., Schlagenhauf, E., K~eller, B.: All l1,--;
of a contiguous 211 kb sequence in diploid wheat (triticum monococcum 1.) reveals
multiple mechanisms of genome evolution. The Plant Journal 26, 307-316 (2001)

[77] Wise, M.J.: Oj.py: a software tool for low complexity proteins and protein domains.
Bioinformatics 17, $1; '15 (2001)

[78] Wootton, J.: Sequences with 'unusual' amino acid compositions. Current Opinion in
Structural Biology 4, 413-421 (1994)

[79] Wootton, J., Federhen, S.: Statistics of local complexity in amino acid sequences and
sequence databases. Computers & C!. !!!!-1 ry 17, 149-163 (1993)

[80] Wootton, J., Federhen, S.: Analysis of compositionally biased regions in sequence
databases. Methods in Enzymology 266, 554-571 (1996)









BIOGRAPHICAL SKETCH

Xuehui Li was born in China. She obtained her BS and 1\S in computer science

before joining the Computer and Information Science and Engineering Department at

the University of Florida. She is interested in the applications of computer science to

molecular biology, especially algorithm development. Her current research focuses on the

identification of repetitive biological sequences and the applications of such sequences in

molecular biology. She received her PhD in 2007.





PAGE 1

1

PAGE 2

2

PAGE 3

3

PAGE 4

Thisdissertationwouldnothavebeenpossiblewithouttheguidanceofmycommitteechair,Dr.TamerKahveci.Iamthankfulforhisresearchguidanceandpartialnancialsupport.IamverygratefultoDr.A.MarkSettles,I'mextremelyluckytogetatremendousamountofgenerousandselessguidancefromhim.Iamverygratefultoeveryoneelseonmycommitteeforservingonmycommitteeandtheirinvaluableprofessionalsuggestions:Dr.AlinDobra,Dr.AlperUngor,Dr.HenryBaker,andDr.JoachimHammer.IamverygratefultomyfamilyandmyfriendsintheStateswhohavesupportedmeinonewayoranother.MyPh.Dlifewouldhavebeenharderwithoutthem.Finally,IwouldliketothanktheUniversityofFloridaAlumniAssociation,IcannotimaginemyPh.Dlifewithouttheassociation'ssponsorshipofmygraduatefellowship. 4

PAGE 5

page ACKNOWLEDGMENTS ................................. 4 LISTOFTABLES ..................................... 7 LISTOFFIGURES .................................... 8 ABSTRACT ........................................ 10 CHAPTER 1INTRODUCTION .................................. 12 2ANOVELGENOME-SCALEREPEATFINDERGEAREDTOWARDSTRAN-SPOSONS ....................................... 14 2.1MotivationandProblemDenition ...................... 14 2.2RelatedWork .................................. 16 2.3DescriptionoftheAlgorithm ......................... 20 2.3.1GraphConstruction ........................... 22 2.3.2GraphTraversal ............................. 24 2.3.3FurtherImprovementtoGreedier ................... 27 2.4ExperimentalEvaluation ............................ 28 2.4.1EvaluationofAccuracy ......................... 29 2.4.2FragmentedRepeatsandPotentialNestedRepeats .......... 32 2.4.3SignicanceAnalyses .......................... 34 2.4.4EvaluationofPossibleOptimizationStrategies ............ 35 2.5Discussion .................................... 35 3ANOVELALGORITHMFORIDENTIFYINGLOW-COMPLEXITYREGIONSINAPROTEINSEQUENCE ............................ 38 3.1MotivationandProblemDenition ...................... 38 3.2RelatedWork .................................. 40 3.3NewComplexityMeasures .......................... 43 3.4TheGraph-basedAlgorithm(GBA) ...................... 46 3.4.1ConstructingaGraph .......................... 46 3.4.2FindingtheLongestPath ........................ 50 3.4.3ExtendingLongest-pathIntervals ................... 50 3.4.4Post-processingExtendedIntervals .................. 51 3.5ExperimentalEvaluation ............................ 52 3.5.1QualityComparisonResults ...................... 53 3.5.2PerformanceComparisonResults ................... 58 3.6Conclusion .................................... 60 5

PAGE 6

..................................... 62 4.1MotivationandProblemDenition ...................... 62 4.2Background ................................... 64 4.3QualityValueAssignment(QVA) ....................... 66 4.3.1QVABasedonOneLCR-identicationTool ............. 67 4.3.2QVABasedonMultipleLCR-identicationTools .......... 68 4.4Quality-basedSimilaritySearch ........................ 70 4.4.1UsingQualityValuesinDPS ...................... 70 4.4.2UsingQualityValuesinHTS ...................... 71 4.4.2.1Thealgorithm ......................... 71 4.4.2.2I/OandCPUcomputations ................. 74 4.4.2.3CostanalysisofQHTS .................... 75 4.4.2.4Memoryallocation ...................... 76 4.5ExperimentalEvaluation ............................ 80 4.5.1EvaluationofAccuracy ......................... 82 4.5.2PerformanceComparison ........................ 88 4.6Conclusion .................................... 91 5CONCLUSION .................................... 93 REFERENCES ....................................... 94 BIOGRAPHICALSKETCH ................................ 100 6

PAGE 7

Table page 2-1ComparisonofGreedier,RepeatMasker,cross match,andWindowMaskerintermsofbasesmaskedindierentregionsoftheArabidopsisgenome,consistingofthevechromosomes:chromosomes,transposons(TP),andotherexons(FP).TheTP/FPratesofGreedier,RepeatMasker,cross match,andWMare2.85,2.42,0.37,and0.10respectively.LetterKrepresents1000. ............ 30 2-2ComparisonofGreedier,RepeatMasker,cross match,andWindowMaskerintermsofbasesmaskedindierentregionsofthe10thricechromosome:chromosome,G.M.s,transposons(TP),otherG.M.s,andotherexons(FP).TheTP/FPratesofGreedier,RepeatMasker,cross match,andWindowMaskerare15.54,12.1,1.28,and0.77respectively. .............................. 30 4-1Averagetotalnumberofdatabasesequencesreturnedandprecision(inpercentage)forNDPS,BDPS-SEG,QDPS-GBA,QDPS-SC,andQDPS-SCGontheSwissprotdataset. ........................................ 83 4-2Averagetotalnumberofdatabasesequencesreturnedandprecision(inpercentage)forNDPS,BDPS-SEG,QDPS-GBA,QDPS-SC,andQDPS-SCGonthePDBdataset. ........................................ 84 4-3Averagetotalnumberofdatabasesequencesreturnedandprecision(inpercentage)forNHTS,BHTS-SEG,QHTS-GBA-Read,QHTS-SC-Read,andQHTS-SCG-ReadontheSwissprotdataset. ............................... 85 4-4Averagerecallandprecision(inpercentage)ofQHTS-GBA-Reconstruct,QHTS-GBA-Read,QHTS-SC-Reconstruct,QHTS-SC-Read,QHTS-SCG-Reconstruct,andQHTS-SCG-ReadontheSwissprotdataset. .................. 87 4-5Relativeinformationlossregardingthelengthofk-grams(inpercentage). .... 88 4-6CPUtimesspentinHTC,SP,andAP,andthenumberofk-gramsstoredinthehashtablesfor(I)NHTS,(II)BHTS-SEG,(III)QHTS-GBA-Reconstruct,(IV)QHTS-SC-Reconstruct,and(V)QHTS-SCG-Reconstruct. ......... 88 7

PAGE 8

Figure page 2-1Asequencebeforeandafteratransposoninsertion.Thenarrowbarsrepresentthesamerepeatinbothversionsofthesequence. ................. 15 2-2Thepseudocodeofouralgorithm,Greedier. .................... 21 2-3AnexampleoftwoalignmentsbothwithPlus/Plusorientationthatdonotoverlap.Twobarsconnectedbyanon-horizontallinerepresentsfragmentsparticipatinginanalignment.Twonumbersinapairofparenthesesarethestaringandendingcoordinatesofthecorrespondingfragment.Twonumbersinapairofsquarebracketsarethenumberofidenticallettersinanalignmentandthelengthofthealignment. ..................................... 23 2-4Thepseudocodeoftraversingagraph. ....................... 25 2-5Twoextremecasesintermsofgaplengths.Eachpairofbarsconnectedanon-horizontallinerepresentsfragmentsparticipatinginanalignment. ........ 26 2-6AnexampleofnestedtransposonsinthefourthArabidopsischromosomeidentiedbyGreedieratdierentiterations.Barsrepresenttransposonfragments.Numbersoutsideandinsidethebarsarethecorrespondingchromosomecoordinatesanditerationnumbersrespectively. ........................... 31 2-7Thecumulativepercentagesofmaskedtransposonsandexonsthatdonotbelongtotransposons(i.e.,otherexons)intheArabidopsisgenomebyeachiterationofGreedier(thetwocurves)andthepercentagesofthesamekindsofmaskedregionsbyCross match(thetwothickhorizontallines),andWindowMaker(thetwothinhorizontallines). .............................. 36 3-1FourstepsofGBAonasequencewith3approximaterepetitionsofAYTV.Under-linedlettersindicaterepeats.RectanglesdenoteregionsidentiedascandidateLCRsbyGBAatdierentsteps. .......................... 48 3-2ContributionoftherepeatregionTPSTTtoR.denotestheforgetrate. .... 48 3-3ComparisonbetweenShannonEntropyandour2-gramcomplexitymeasure.x-axisrepresentsratiosfromrepeats.y-axisrepresentsratiosfromnon-repeats. 55 3-4AveragerecallsofGBA,SE-GBA,0j.py,CARD,andSEGonfourdatasets. .. 56 3-5AverageprecisionsofGBA,SE-GBA,0j.py,CARD,andSEGonfourdatasets. 57 3-6RelationshipbetweenprecisionandrecallofGBA,0j.py,andCARD. ...... 58 3-7AverageJaccardcoecientsofGBA,SE-GBA,0j.py,CARD,andSEGonfourdatasets. ....................................... 59 8

PAGE 9

........... 63 4-2AsequencewithanunderlinedLCRandamaskedversionofit.Maskedlettersarereplacedbyx. .................................. 67 4-3AsequencewithanunderlinedLCRanditsmaskedversionsbytowdierentLCR-identicationalgorithms.Maskedlettersbyeachalgorithmarereplacedbyx. ......................................... 69 4-4Anexampleforprobabilistichashtableandreconstruction.Thesolidlinesshowthethree3-gramsofthesequenceGARAQAQAQKLstoredinthehashtable.Theresultingsequenceafterreconstructionisatthebottom.LetterXdenotesanunknownletter. ................................... 74 4-5Memoryadaptionscheme.qs,db,andHTrepresentthequeryset,thedatabase,andthehashtablerespectively.Forbothreconstructioncaseandreadingcase,M1andM2aretheamountofmemoryallocatedtothehashtableandthedatabasesequencesrespectively.M1=Cistheamountofthequerysetwhosek-gramscanbeheldinahashtableofsizeM1probabilistically.ForeachM1=Cpagesofthequeryset,ahashtableisbuiltandthewholedatabaseisreadintomemoryoncetondseeds.M3onlyappliestoreadingcase.Itrepresentstheamountofmemoryallocatedtothequerysetforseedextension.Intheworstcase,foreachM2pagesofthedatabase,theentirequerysetisreadintomemoryonceinchunksofM3. .................................... 77 4-6TheminimumI/Ocostvaries,dependingonwhethertousereconstructionorreadstrategy,whethertousequalityvalues,andtherelationshipbetweentheavailableamountofmemoryMandthesizeofthedatabaseDB. ........ 80 4-7Resultsetsizeversusrecall(inpercentage)forNDPS,BDPSandQDPSontheSwissprotdataset. .................................. 83 4-8Resultsetsizeversusrecall(inpercentage)forNDPS,BDPSandQDPSonthePDBdataset. .................................... 84 4-9Resultsetsizeversusrecall(inpercentage)forNHTS,BHTSandQHTSontheSwissprotdataset. .................................. 86 4-10Resultsetsizeversusrecall(inpercentage)forQHTSandBLASTwiththeLCR-ltero(BLAST-w/oFilter)andon(BLAST-w/Filter)forqueryingtheproteinsequence,APOA4 MOUSE,againstourSwissprotdatabase. ............. 86 4-11BLASTCPUtimeandQHTSI/Otimesfordatasetsofincreasingsize. .... 91 9

PAGE 10

10

PAGE 11

11

PAGE 12

49 63 ].Thischaracteristiccausesrepeatstobenestedwithinoneanother.Inotherwords,manyrepeatswithinagenomesequencearesplitbytransposoninsertions,resultinginindividualrepeatunitsbeingfragmented.Somerepeatsinproteinsequencesarepopularlyreferredaslowcomplexityregions(LCRs).Statisticalanalysesofproteinsequenceshaveshownthatmorethanone-halfoftheproteinshaveatleastoneLCR.ThereisnouniversalcomplexityfunctionthatwouldworkforallLCRs.LCRsareimportant.Theyattractpurifyingselection,becomedeleteriousandthereforeleadtohumandiseaseswhenthecopiesofarepeatinsideexceedanumber.Despitetheirabundanceandimportance,theircompositionalandstructuralpropertiesarepoorlyunderstood.SoaretheirfunctionsandevolutionRepeatidenticationisnormallytherststepinstudyingrepeats.Itisacriticalpartofsequenceanalysis.First,repeatsarebelievedtoplayimportantrolesinthecourseofgenomeevolutionandmayhaveeectsonproteinfunctions.Second,repeatscauseproblemstomanybiologicalapplications,saymisassemblyofgenomesequencesandsequencesimilaritysearch.GenomicrepeatsandLCRscausemanyfalsepositivestosequencesimilaritysearch.Forexample,BLASTreturnsover1,000statisticallysignicant 12

PAGE 13

75 ].Thesefalsepositivesconfusesgenomeannotationandanalyses.Therefore,itiscriticaltobeabletoidentifytheserepeatsaccurately.AlthoughsomecomputationaltoolshavebeendevelopedtoidentifygenomicrepeatsorLCRs,theyallaregearedtowardsspecicsituationsandsuerfromdierentproblems.ExistingsequencesimilaritysearchalgorithmseitherignoretheexistenceoftheseLCRsorcompletelyremovethem.IgnoringLCRsresultsinfalsepositives.RemovingLCRsisnotdesirable,sincenoLCR-identicationtoolis100%accurate.Thisdissertationaddressesthefollowingthreeproblems: 1. Identicationofrepeatsingenomes. 2. IdenticationofLCRsinaproteinsequence 3. SearchingbiologicalsequencedatabasesusingLCRs.Foreachoftheseproblems,wedevelopnewmethodsandcomparethemwithexistingmethodsexperimentally.Therestofthedissertationisorganizedasfollows.Chapter 2 presentsourmethodtoidentifyrepeatsingenomes(problem1).Chapter 3 introducesournovelmethodtoidentifyLCRsinaproteinsequence(problem2).Chapter 4 presentsourworktouseLCRsinsimilaritysearchforbiologicalsequencedatabase(problem3).Chapter 5 concludesthedissertation. 13

PAGE 14

63 ]andmorethan50%ofthehumangenome[ 20 ]consistofrepetitivesequences.Repeatsvaryinsizefromlessthanahundredbasestotensofkilobases.Theyarefoundaseithertandemarraysordispersedthroughoutthegenome.Repeatscangenerateinsertions,deletions,andunequalcrossing-overwithingenomes.Hence,repeatsplayimportantrolesingenomeevolution[ 14 ]bycausingmutationsandrearrangements[ 49 ]thatleadtoalteredgenefunctions[ 16 ].Repeatsalsopresentdicultiesingenomeannotationandanalyses.LocalalignmentsbetweenrepeatsproducefalsepositivesincomparisonsofDNAsequences.Thesefalsepositivescancausemisassemblyofgenomesequencesormisidenticationofrepeatsasgenesequences[ 10 ].Forthesereasons,itiscriticaltoidentifyrepetitivesequencesaccurately.Transposonsaremobilegeneticelementsandcomprisethemostcommonclassofrepeats.Transposableelementsoftenmakeupasubstantialfractionofthehostgenomes,especiallythemajorityofmanyplantgenomes[ 1 26 62 74 ].Forexample,45%ofthehumangenome[ 67 ],morethan90%ofthewheatgenome[ 51 ],and12.4%ofthericegenomeconsistoftransposableelements[ 50 72 ]( 43 ]andindividualtypesoftransposonscanhavehundredsorthousandsofcopiesincomplexgenomes.Transposonsaccumulatemultiplecopieswithinagenomebasedonavarietyofreplicativemechanisms.DNAelementsutilizeeithertheDNAreplicationorDNArepairmachinerywithinthecelltoincreaseincopynumber,whileretrotransposonsutilizeRNA 14

PAGE 15

(b)Figure2-1. Asequencebeforeandafteratransposoninsertion.Thenarrowbarsrepresentthesamerepeatinbothversionsofthesequence.(a)Thesequencebeforetheinsertion.(b)Thenewsequenceaftertheinsertion.Thethickbarrepresentstheinsertedtransposon. polymeraseandreversetranscriptasetoincreaseincopynumber.Ineithercase,theelementsneedtoinsertthemselvesintothegenome,whichrequireseitheratransposaseorintegraseenzymeactivity( 49 63 76 ].Thischaracteristiccausesrepeatstobenestedwithinoneanother.Inotherwords,manyrepeatswithinagenomesequencearesplitbytransposoninsertions,resultinginindividualrepeatunitsbeingfragmented.Figure 2-1 showsaschematicofanestedtransposoninsertion.Theinsertedtranposonsplitsanexistingrepeatintotwofragments.Thereisstrongevidencefortheexistenceofnestedandsplitrepeatsinthemaize,rice,andwheatgenomes[ 49 63 76 ].Fragmentedrepeatsarehardtoidentifyusingexistingtools,suchasCross match( 15

PAGE 16

match,whichisimplementedinmostrepeatndersthatuseanannotatedrepeatlibrary.Inadditiontomaskingrepeats,Greedieralsoreportspotentialnestedtransposonstructures.Therestofthechapterisorganizedasfollows.Section 2.2 discussestherelatedwork.Section 2.3 describesouralgorithm,Greedier.Section 2.4 presentstheexperimentalresults.Section 2.5 concludesthechapter. 45 ]andSEG[ 80 ].HerewefocusonDNAsequences.Currentrepeatndingalgorithmsfollowtwostrategies:denovorepeatndersandrepeatndersthatuseexistingrepeatlibraries.Theformeridentiesrepeatswithout 16

PAGE 17

11 ],LTR STRUC[ 52 ],andPILER[ 25 ],identifyspecicstructureswithintheDNAsequencetoidentifyrepeats.Themainutilityofthesedenovorepeatndersistoidentifynovel,recently-evolvedrepeats.TRFidentiestandemrepeatsthathaveavariableunitlengthand/oraredisruptedbyinsertionsanddeletions.Itmodelstandemrepeatsbypercentidentityandfrequencyofindelsbetweenadjacentpatterncopies.Itusesstatisticallybasedrecognitioncriteria.Forexample,itmodelsalignmentoftwotandemcopiesofapatternbyasequenceofindependentBernoullitrails.LTR STRUCsearchesforlongterminalrepeattransposons.Itseekscertaingenericstructuralfeaturesofsuchelements.ItrstseekstheLTRpairspresentattheendsofaputativeelement.Itthensearchesforadditionalcharacteristicretrotransposonfeaturestoconrmthehit.Forexample,itchecksalignmentofregionsankingthepairsofmatches.However,LTR STRUCislimitedtondingsequenceswithwell-conservedstructuralfeaturesofaretrotransposon.PILERexploitsonlycharacteristicpatternsoflocalalignmentsinthesequence.Eachalignmentformsapatternwhichistypicalofaclassofrepeat.Itidentieslikelyfunctionaltransposableelements,tandemarrays,dispersedfamilies,pseudosatellites,andterminalrepeatsthatshowhighlevelsofsequenceidentity.Otherdenovorepeatndersidentifyfragmentedormoredivergentrepeatsbasedonpairwiseormultiplesimilaritieswithinagenome.REPuter[ 40 41 ]rstusessuxtreestolocateexactrepeats.Itthenextendsthemtodegeneraterepeatsbyattackingtwosub-problems.Therstoneiscalledthemismatchesrepeatproblem.ItusestheHammingdistancemodeltondmismatchesbetweentwomaximalrepeats.Thesecondoneiscalledthedierencesrepeatproblem.Itusestheeditdistancemodeltoallowforinsertionsanddeletionsintworepeats.REPuteraccessesthesignicanceofthesetwo 17

PAGE 18

39 ].However,itisunabletondlonganddispersedrepeats.Toovercomethecomputationallyintensivenatureofgeneratingpairwisesimilarityscores,WindowMasker[ 53 ]andRAP[ 17 ]usewordcountingtoidentifyshortsequencesthatareover-representedwithinaninputgenomesequence.Wordlengthisacriticalparameterforwordcountingmethods.Shortwordsarefoundtoofrequentlyandarenotindicativeofasignicantrepeat.Theyresultinlowspecicityandhighsensitivity.Ontheotherhand,longwordsarequitesignicant,buttheyareunsuitabletodetectdegeneratedrepeats.Theyresultinhighspecicityandlowsensitivity.Wordcountingmethodscannotimproveoneaspectwithoutdeterioratingtheother.Itishardtondthebestcompromisebetweenshortwordsandlongwords.WindowMaskerisatwo-passalgorithm.Intherstpass,itcalculatesthesizeNofNmertoconsider,Nmerfrequencycountsforthegenome,thresholdscoresforthealgorithm,andascorefunctionthatisbasedonthefrequenciesandthresholdscores.Inthesecondpass,itcalculatesmaskedregionsforthegenomeusingthepreviouslygeneratedscorefunctionandthresholds.RAPindexessymmetricalgappedwords.Itcalculatesanindexforeachwordstartingateachpositionoftheinputgenomesequence.Bothwordcountingandpairwisesimilarityalgorithmshavehighfalsepositiverates.i.e.,theyhavethepotentialofmaskingdesiredgenefamiliesotherthanbiologicalrepeats.Mostofthesimilarity-basedalgorithmshavenotbeenbenchmarkedagainstgenomeannotationstodetermineifasignicantfractionofthemaskedsequencesarefalsepositives.Somedenovorepeatndersidentifyfragmentedormoredivergentrepeatsbasedonmultiplealignmentsbetweenrelatedgenomes.[ 18 ]isacomparativeapproachthatidentiestransposableelements.Standardcomparativegenomicprinciplesdictatethatconservedregionsinalignmentshighlightfunctionalelements[ 12 13 ].Lackofconservationisequallyuseful:insertedsequencesthathavelittleornoalignmentto 18

PAGE 19

18 ]searchesfordisruptedconservationpatternsinwholegenomealignments.However,poorlycharacterizedgenomefamiliesarenotsuitableforsuchcomparativeapproaches.Theseapproachesrelyonwell-assembledgenomes,theselectionofappropriateevolutionarymodels.Incontrasttodenovorepeatnders,repeatndersthatuseannotatedrepeatlibrariesusuallyproducefewerfalsepositives.Likedenovoalgorithms,theserepeatndersusepairwisesimilarity.However,thesimilaritytestisagainstaknowncollectionofrepeats.Genefamiliesthatabiologistwouldconsidertobenon-repeatcanberetainedbyexcludingthesesequencesfromtherepeatlibrary.SuchrepeatndersincludeCross match,RepeatMasker( 8 ],andCENSOR[ 36 ].Cross matchisanecientimplementationoftheGotohalgorithm[ 28 ]forcomparingtwosetsofsequences.Itusesbandedsearchestondmatcheswithscorenolessthanacuto.RepeatMaskerusesCross matchasthedefaultsearchengineforsequencecomparisons.MaskerAidisanenhancementtoRepeatMaskerthatincreasesthespeedofmasking.ItusesWU-BLAST( 68 ]forsequencecomparisons.Itthenevaluatesthelocalalignmentsandmaskshomologoussequences(called'censoring').Asthenumberofsequencedgenomesincreases,thereisincreasedinterestindeningrepeatsequencesandgeneratingrepeatdatabasesformultiplespecies.Thereareseveralrepeatdatabaseresourcesavailableonline.OneofthelargestcollectionsofrepeatdatabasesismaintainedbyTIGR( 19

PAGE 20

2.1 ,theycannotidentifyfragmentedanddivergentrepeats.Eukaryoticgenomescontainlargeamountsoftransposonrelics-ancient,highlydegeneratedtransposableelements.Theuseofpairwisesequencecomparisonstotherepeatlibraryislikelyfailtodetectthese'distanthomologs'ofknowntransposableelementfamilies[ 35 ].AlsoasshowninSection 2.4.1 ,thiskindofrepeatnderdoesnothaveasatisfactoryaccuracy.Therearetoolsthatdealwiththeidenticationofrepeatfamiliesingenomes[ 7 56 59 73 ].Thesetoolstakeasetofrepeatsinthegenomeasinput.Theirpurposesareeithertoclassifythemintofamiliesortoextendtheexactrepeatstondfamilies.Thisisdierentfromtheproblemweconsiderinthischapter. 2-2 showsthepseudocodeforGreedier.Eachiterationconsistsoftwopasses.Intherstpass(Steps2-4),itidentiesthelocalsimilaritiesbetweenthechromosomeandtherepeatsequences(Step2).Thiscanbedoneusinganyo-the-shelfsequencecomparisonalgorithm.WeusedBLAST[ 70 ]forthispurpose.Itthenprocessestheselocalsimilaritiesandbuildsgraphs(Steps3-4).Avertexinagraphdenotesonesimilarity.Anedgeinagraphdenotestwosimilarsubsequencesthatcanbeattachedtoformalongermatch.Apathinagraphrepresentsalongermatchconsistingofmultipleedges,i.e.,twoormoresequencesimilaritiesthatcanbeconnectedtoformalongermatch. 20

PAGE 21

1. RunasequencecomparisonalgorithmwithcasthequerysequenceandRDBasthedatabase/*rstpassstarts*/ 3. Buildgraphsfromthecomparisonalgorithmoutput 5. 6. Traversethegraphstoreportrepeatswithtness> Modifythegraphsbasedonthereportedrepeats 9. 10. Modifycbasedonthereportedrepeats 12. Reduce 14. Reportpotentialnestedtransposonstructures Figure2-2. Thepseudocodeofouralgorithm,Greedier. Eachpathisassociatedwithatnessvaluebetweenzeroandone.Thisvaluereectsthepercentidentityandthelengthofthematchbetweenthetargetsubsequencesonthepathandtherepeatunit.Inthesecondpass(Steps5-12),Greediertraversesthegraphsgreedilyandformsapoolofrepeatcandidates.Eachcandidateinthepoolcorrespondstoapathwithatnessvaluenolessthanacuto,.Greedierreportsthesubsequenceswhosecorrespondingpathhasthebiggesttnessvalueasrepeats(Step6).Itthenmodiesallthegraphsaccordingly(Steps7-8).Itrepeatsthisprocessuntilitcannotndmorerepeats(Step9).Finally,itremovesthesereportedrepeatsfromthechromosomeandstitchestherestofthechromosometogether(Steps10-11).Thisallowsrepeatstobesplitandnestedwithinoneanother.Finallyitrelaxesthetnessconstraintforthenextiteration(Step12).Notethatdecreasestoallowdivergentrepeatstobeidentiedafterthemostrecentinsertionshavebeenremovedfromthegenome.GreedieriteratesuntilBLASTreturnsnohits.Finally,Greedierparsesthemaskedregionsfromdierentiterationsandreportsthepotentialnestedtransposonstructures(Step14).These 21

PAGE 22

2.3.1 describeshowtoconstructgraphsfromBLASToutput(therstpass).Section 2.3.2 illustrateshowtotraverseagraph(thesecondpass).Section 2.3.3 presentsoneimprovementtoGreedier. 2-3 ,thetwoverticescorrespondingtothetwoalignmentsare(400,889,1200,1700,450,500,Plus/Plus)and(922,1545,1750,2370,550,650,Plus/Plus).Letlrdenotethelengthoftheremainingrepeatsubsequence(i.e.,excludingthesubsequencecorrespondingtothealignment).Letxdenotetheaverageidentitybetweentworandomsequences.Alllettersinthenucleotidealphabetappearateachpositionofarandomsequencewiththesameprobabilityof0.25.Hence,x=0.25.Foreachvertex,wecalculateatnessvalueasa+xlr 22

PAGE 23

AnexampleoftwoalignmentsbothwithPlus/Plusorientationthatdonotoverlap.Twobarsconnectedbyanon-horizontallinerepresentsfragmentsparticipatinginanalignment.Twonumbersinapairofparenthesesarethestaringandendingcoordinatesofthecorrespondingfragment.Twonumbersinapairofsquarebracketsarethenumberofidenticallettersinanalignmentandthelengthofthealignment. showstheexpectedidentityratewhentheentirerepeatsequenceisalignedwiththetargetsequencesothatthelocalalignmentforthatvertexispreserved.Wesaythattwoverticesfromthesamerepeat,(sr1,er1,sc1,ec1,a1,b1,o1)and(sr2;er2;sc2;ec2;a2;b2;o2),donotconictifanyofthefollowingsixsetsofconditionsissatised.: 1. 2. 3. 4. 5. 6. 2-3 showsanexampleoftwoalignmentsfromtherstsetofconditions.Forsuch 23

PAGE 24

2.3.3 .Alignmentsintheotherfoursetsoverlap.Thelastconditionineachofthefoursetsputsanupperboundonthelengthoftheoverlap.isthetnessvaluecutointroducedatthebeginningofSection 2.3 .Weconstructanedgebetweentwoverticesiftheydonotconict.Eachedgeisadirectededgethatgoesfromavertexwithasmallerstartingcoordinatetoanothervertexwithabiggerstartingcoordinate.Itconnectsapairofalignmentsofthesameorientation.Eachsuchpairofalignmentsformsalongeralignmentthanthetwoalignmentsbetweentherepeatandthechromosome.InFigure 2-3 ,forexample,therepeatsubsequencewithcoordinates400and1545andthetargetsubsequencewithcoordinates1200and2370formanalignment.Letrdenotethenumberofrepeatsintherepeatlibrary.Assumeeachrepeathastalignmentswiththechromosomeonaverage.Thus,theaveragenumberofverticesineachgraphist.VertexconstructionforeachgraphtakesO(t)time.ThenumberofedgesineachgraphisO(t2)intheworstcase.ThismeansthatgraphconstructionforeachrepeattakesO(t+t2)time.Thenumberofgraphsisthesameasthenumberofrepeatsinthelibrary.Hence,graphconstructionforallrepeatsateachiterationtakesO(rt2)timeintheworstcase. 24

PAGE 25

1. Findthevertex,vmaxwiththebiggesttnessvalueinG Findthesetoftheoutgoingedges,No,ofvmax 4. 5. Choosethevertexv2Notojoinp,suchthatphasthebiggesttnessvalue 6. ModifyNotobethesetoftheoutgoingedgesofv Findthelongestsubpath,pr,ofpwithtnessvaluenolessthan extendingprtotheleftsimilartoextensiontotheright 9. Thepseudocodeoftraversingagraph. respectively.yidenotestheidentitybetweenthetwogaps.lrdenotesthelengthoftheremainingletters(excludingallkalignments)intherepeat.InFigure 2-3 ,a1,a2,b1andb2are450,550,500,and650respectively.gr1andgc1are33(922-889)and50(1750-1750)respectively.Therearetwowaystocalculateyi.Therstwayistoassumeyiastheaverageidentityoftworandomsequences.Thisiscomputationallycheap,butcanbeinaccurate.Thesecondwayistocalculatetheactualidentitybetweenthetwogapsusingadynamicprogrammingalignmentmethod,suchastheNeedleman-Wunschalgorithm[ 55 ].Thisisanaccurate,butcomputationallyexpensivechoice.AswillbediscussedinSection 2.4 ,theresultsusingeachofthechoicesareverysimilarinpractice.Therefore,wechosetheformerasitiscomputationallycheaper.Thetnessvalueofapathshowshowidenticaltherepeatunitandthetargetsubsequencestogetherare.Ahighertnessvalueindicatesahighersimilarity.Thus,thegoalistondapaththatmaximizesthetnessvalue.Existingmethodssuchascross matchusexedscorecutosandareequivalenttoidentifyingmatchesofxedlengths.Suchxed-lengthwindow-basedmethodsignorethefactthatrepeatunitsareofvariablelengths.Ourtnessformulacircumventthisproblem.ItmakesGreedierabletoidentifymatchesofvariablelengthsandequivalenttoavariable-lengthwindow-basedmethod. 25

PAGE 26

(b)Figure2-5. Twoextremecasesintermsofgaplengths.Eachpairofbarsconnectedanon-horizontallinerepresentsfragmentsparticipatinginanalignment.(a)Theextremecasewhenthereisnogapontherepeatandthegaponthechromosome(gc)isextremelylong.(b)Theextremecasewhenthereisnogaponthechromosomeandthegapontherepeat(gr)isextremelylong. Foreachgraph,weadoptagreedytraversalstrategy.Westartfromthevertexwiththemaximumtnessvalueinthegraph.Weextendthisvertextoapathintheleftandrightdirections.Whenextendinginonedirection,weincludeaneighborvertexofthemostrecentlychosenvertexintothepatheachtime.Theneighborvertexistheonewhichmakesthecurrentlyextendedpathhavethebiggesttnessvalue.Wekeepextendingthepathinonedirectionuntilthepathcannotbeextendedfurther.Afterextendingthepathinbothdirections,wendthelongestsubpathoftheextendedpathwithtnessvaluenolessthan.Wereportthissubpathastherepeatcandidatetobejoinedintothepool.Notethatwhenextendingthepathintherightdirection,weconsideroutgoingedges.Whenextendingthepathintheleftdirection,weconsidertheincomingedges.Figure 2-4 showsthepseudocodeofthetraversal.SupposethatGreedieriteratesitimesbeforeBLASTreturnsnohits.Ateachiteration,theworsttimespenttotraversegraphsislinearinthesizeofedges,O(rt2).Hence,thetotaltimecomplexityofGreedierisO((rt2+rt2)i),i.e.,O(rit2). 26

PAGE 27

2.3.2 ).Therearetwokindsofgapsinvolvedinanedge:agaponarepeatwithlengthgrandagaponthetargetsequencewithlengthgc.LetLbetherepeatlength.Weconsidertwoextremecasesofanedge.Thesecasesprovideupperboundsforthelengthdierencebetweenthetwokindsofgaps.Oneextremecaseiswhenthereisnogapontherepeat(i.e.,gr=0),butthereisagaponthetargetsequence.Figure 2-5(a) showsthiscase.Thetnessvalue,f,oftheedgesatisesL=(L+gc)f:Theequalityhappenswhentheentirerepeatisidenticaltothecorrespondingsubsequencesinthechromosome.AccordingtoSection 2.3.2 ,apathcontainingthisedgewillbeconsideredduringgraphtraversalonlyiff.Thus,wehaveL=(L+gc)()gc(1)L=:Thisinequalitygivestheupperboundtothelengthofthegaponthechromosomewhenthereisnogapontherepeat.Asimilaranalysisshowsgcgr(1)L=whengr>0.Theotherextremecaseiswhenthereisnogaponthechromosome(i.e.,gc=0),butthereisagapontherepeat.Figure 2-5(b) illustratesthiscase.Thetnessvalue,f,oftheedgesatises(Lgr)=Lf:Theequalityhappenswhentherestoftherepeatisidenticaltothecorrespondingsubsequencesinthechromosome.AccordingtoSection 2.3.2 ,apathcontainingthisedge 27

PAGE 28

matchandWindowMasker.cross matchisthecorealgorithmofRepeatMasker,MaskerAid,andCENSOR.WindowMasker( 28

PAGE 29

2-1 and 2-2 showtherelativeaccuraciesofGreedier,cross match,andWindowMaskerwhenrunwithdefaultparameters.ForGreedier,wesettheaverageidentitybetweentworandomDNAsamples,y=0.25.WealsocomputedtheactualidentityusingtheNeedleman-Wunschalgorithmandobtainedsimilarresults(datanotshown).Greedierhasamuchhigherrateofidentifyingannotatedtransposonsandamuchlowerfalsepositiveratethanbothcross matchandWindowMasker.Thepercentagesofbasesmaskedbythethreeprogramscorrespondstorelativelevelsofrepeatsequencesfoundonthechromosomearms.However,inallchromosomestested,wefoundasignicantincreaseinthenumberoftransposonbasesmaskedbyGreedier.FortheArabidopsisgenome(Table 2-1 ),Greediermasked2.4and2.8timesasmanytransposonsbasesascross matchandWindowMaskerrespectively.Atthesametime,Greediermasked0.3and0.1timesthenumberoffalsepositivesrespectively.Thus, 29

PAGE 30

ComparisonofGreedier,RepeatMasker(RM),cross match(CM),andWindowMasker(WM)intermsofbasesmaskedindierentregionsoftheArabidopsisgenome(gen),consistingofthevechromosomes(chrs):chromosomes,transposons(TP),andotherexons(FP).TheTP/FPratesofGreedier,RepeatMasker,cross match,andWMare2.85,2.42,0.37,and0.10respectively.LetterKrepresents1000. chr30,433K921K2,025K920K5,991K3.06.63.019.7Chr1TP1,109K214K444K55K72K19.340.05.06.5FP11,110K53K174K307K847K0.51.62.87.6chr19,705K518K1,378K506K3,636K2.67.02.518.5Chr2TP1,198K154K402K90K68K12.933.67.55.6FP6,768K62K153K124K540K0.92.31.88.0chr23,471K762K1,685K744K4,209K3.27.23.217.9Chr3TP1,320K173K443K47K82,97313.133.63.66.3FP8,624K52K166K214K646K0.61.92.57.5chr18,585K714K1,548K700K3,275K3.88.33.817.6Chr4TP1,033K208K425K133K56K20.141.312.85.5FP6,584K67K163K165K476K1.02.52.57.2chr26,992K915K1,871K855K5,066K3.36.93.218.8Chr5TP1,246K175K372K61K65K14.129.94.95.2FP9,820K90K203K219K715K0.92.12.27.3gen119,186K3,831K8,507K3,725K22,177K3.27.13.118.5genTP5,906K924K2,087K386K344K16.035.76.85.8FP42,900K325K860K1,028K3,230K0.72.02.47.5 ComparisonofGreedier,RepeatMasker(RM),cross match(CM),andWindowMasker(WM)intermsofbasesmaskedindierentregionsofthe10thricechromosome:chromosome,G.M.s,transposons(TP),otherG.M.s,andotherexons(FP).RefertoTable 2-1 fortheentrymeanings.TheTP/FPratesofGreedier,RepeatMasker,cross match,andWindowMaskerare15.54,12.1,1.28,and0.77respectively. rice#1022,876,5963,973,4776,839,1114,594,8614,315,50617.429.920.018.9TP3,072,0871,481,4682,051,697641,277461,17448.266.820.915.0FP3,297,203101,616181,082535,830293,9233.15.516.38.9 matchandWindowMasker,respectively.RepeatMasker,whichusescross matchmasked2.3timesmorelettersthancross matchandGreedier.TheTPrateofRepeatMaskerisgreater 30

PAGE 31

AnexampleofnestedtransposonsinthefourthArabidopsischromosomeidentiedbyGreedieratdierentiterations.Barsrepresenttransposonfragments.Numbersoutsideandinsidethebarsarethecorrespondingchromosomecoordinatesanditerationnumbersrespectively. thanthatofGreedier.Thisismainlybecauseitmasksmoreletters.TheTP/FPrateofGreedier,however,isbetterthanthatofRepeatMasker.Tounderstand,whyGreedierincurredfalsenegatives,wealignedthefalsenegativesofGreedierwiththerepeatlibraryusingBLAST.Wedidnotgetanysignicantmatches.ThisimpliesthateitherrepeatlibraryisincompleteorBLASTisnotaccurateenoughtondthefalsenegatives.Greedieralsoshowedhigheraccuracythancross matchandWindowMaskerforricechromosome10(Table 2-2 ).Greediermasked1.5and2timesasmanytransposonsbasesascross matchandWindowMaskerrespectively.Atthesametime,Greediermasked0.5and0.9timesthenumberoffalsepositivesrespectively.However,thedierencesbetweenthenumberoftruepositivesandfalsepositivesrecognizedbythethreeprogramswerenotaspronouncedaswithArabidopsis.Potentially,thisisduetoincorrectannotationoftransposonsasgenesequences[ 10 ]leadingtohigherfalsepositiverateswithallthreealgorithms.Wealsomeasuredthenumberoffragmentsthataremaskedbyeachofthemethods.OntheArabidopsisgenome,Greedier,RepeatMasker,cross match,andWindowMaskermasked7364,15421,6750,and760243fragmentsrespectively.Onthericegenome,thesamemethodsmasked7738,20827,7690,116614fragmentsrespectively.ThisdemonstratesthatRepeatMaskerandespeciallyWindowMaskermasksalargenumberofsmalldiscontinuousregions,whereasGreedierandcross matchmaskrelativelylongerandcontiguousregions. 31

PAGE 32

2-6 showsanexampleofaretrotransposonthathasfragmentedidentitytotherepeatlibrary.Thisretrotransposon,At4g06656,correspondstocoordinatesof3,841,924and3,850,389ofArabidopsischromosome4.Greediermaskedtheentireannotatedtransposonregion,whilecross matchmissed43%oftheletters.Atiterationone,thesubsequencewithcoordinates3,846,848and3,847,232isidentiedasaputativeTy3-gypsy-likeretrotransposonsequencewithinthelibrary.Thissequenceismaskedduetoahightnessvalue,becausethegenomesequencematches96%oftherepeatovertheentireunitlengthoftherepeat.Atiterationseven,thesubsequencewithcoordinates3,844,713and3,845,381isidentied.Thisisa77%matchtoaconservedcentromeresequenceovertheentirelengthoftherepeatunitintherepeatlibrary.Atiterationeight,theremainingsequenceismaskedduetoidentitytotworetrotransposonswithintherepeatlibrary.Themiddlesegment,coordinates3,845,382to3,846,847,showsnosimilaritytotherepeatlibrary,whichreducesthetnessvalueoftheoverallmatch.However,theendsegmentsshowhighlevelsofidentitywithbothretrotransposonsintherepeatlibraryandtheentiresegmentismasked.ThisexampleillustrateshowGreedier 32

PAGE 33

matchandWindowMaskerevenusingasingleiteration(Table 2-1 ).Inadditiontomaskingrepeats,Greedieralsoreportspotentialnestedtransposonstructures.OntheArabidopsisgenomelevel,Greedierndsatotalofpotentialnestedstructureswith92,92,106,89,and208structuresfromchromosomes1,2,3,4,and5,respectively.Thesestructurescanbeusedbybiologistsasthecandidatesettoidentitynestedtransposonstructures.Itwouldbeinterestingtoseethepercentageofthetransposonsthatarenestedinsertions.InordertocalculatethisnumberweanalyzetheGreedierresultsandtheTAIRannotations.Greedierreportswhichfragmentsarepotentialnestedinsertions.However,Greediercangeneratefalsepositives.TAIRannotatesrepeats,howeveritdoes 33

PAGE 34

2.4.2 ,wecalculatedtheTP/FP(TruePositive/FalsePositive)rateofallregionsbetweenthevertices.Thisrateis23.52.AsshowninTable1,thetotalnumbersofannotatedtransposonbasesandbasesofexonsthatdonotbelongtotransposonsare5,905,785and42,900,000respectively.ThesetwonumbersyieldsanexpectedTP/FPrateof0.14,whichisfarlessthan23.52.Asasecondanalysis,wecalculatedtheprobability(orthep-value)ofndingasmanyTPsasitappearsintheseregionswithuniformbackgrounddistributionofTPsasgiveninTable1.Thisp-valuewaszerowhencalculatedusingtheincompletebetafunction.Inother 34

PAGE 35

2-1 and 2-2 showthenumberofbasesmaskedbyGreedierafterthenaliterationoftheprogram,oncenomoreBLASThitscanbefound.Itispossiblethatthelastfewiterationscontributethebulkofthefalsepositivemaskingandthatthealgorithmcouldbeoptimizedbyretainingahigherstringencyforthepairwisematches.Figure 2-7 showsthecumulativepercentagesofmaskedbaseswitheachiterationfortheArabidopsisgenome.Greediercompletesthebulkoftherepeatmaskinginthelastfewiterationsbutretainsanearlyconstantrateoffalsepositivemasking.ThesedatasuggestthatincreasingthestringencyofGreedierwillnotsignicantlyreducefalsepositivesbutwillsignicantlyreducethenumberoftruepositivebasesmasked.Figure 2-7 alsoshowsthatGreedieridentiesmoreTPsthancross matchandWindowMaskerafter12thand11thiterationsrespectivelywhileidentifyingfewerfalsepositivesallthetime.Greedier,cross match,andWindowMaskerallmaskedonlyasmallfractionofthetransposonsthatwereannotatedintheArabidopsisandricegenomicsequenceswetested.ItispossiblethatsignicantBLASThitsweremissedbyGreedier,becausetheyhadinsucienttnessvaluestobemasked.Thishypothesispredictsthatthemaskedgenomeshouldcontaintransposonsthatstillhavesignicantmatchestotherepeatlibrary.WeextractedtheannotatedtransposonsmissedbyGreedierfromtheArabidopsisgenome.WeranBLASTwiththesetransposonsasthequerysetandtherepeatlibraryasthelibrary.BLASTdidnotreportanymatches.Thus,annotatedtransposonsmissedbyGreedierareprimarilyduetotheabsenceofthesesequencesintherepeatlibrary. 35

PAGE 36

Thecumulativepercentagesofmaskedtransposonsandexonsthatdonotbelongtotransposons(i.e.,otherexons)intheArabidopsisgenomebyeachiterationofGreedier(thetwocurves)andthepercentagesofthesamekindsofmaskedregionsbyCross match(thetwothickhorizontallines),andWindowMaker(thetwothinhorizontallines). repeatsusingautomatedapproaches.However,thebiologicalcharacteristicsofrepeatsequencescreatechallengesforautomatedrepeatmasking.Transposonssharemanysequencefeatureswithtruegenes,whichconfusesgenendingalgorithmsandleadstotheannotationoftransposonsasgenes[ 10 ].Transposonsandotherrepeatsequencesalsoevolverapidlymakingitdiculttorecognizerepeatsthroughpairwisecomparisons.Similarly,transposonshavethetendencytoinsertinexistingrepeatscreatingrepeatfragmentsthataremorediculttoidentifyusingsimilarityscores.Greedieraddressesmanyofthesechallengesandshowsaclearimprovementoverthestandardrepeatmaskingalgorithm,cross match( 36

PAGE 37

53 ]circumventtheneedforarepeatlibrary.Thesealgorithmsmaskshortsequencesthatareover-representedinthetargetgenome.Inourexperiments,wefoundWindowMaskertohavealowlevelofaccuracyeventhoughitcanmaskmoretotalbasesthanGreedier.Wordcountingmethodshaveloweraccuracies,becausewordcountingdoesnottakeintoaccountbiologicalcharacteristicsofbothrepeatandgenesequences.First,repeatsdonotalwaysexistinhighcopies[ 9 ].Repeatscandivergerapidlyleadingtosequencesthatwillnotbeover-representedrelativetothewholegenome.Second,genescancontainsequencesthatwouldbeconsideredhigh-copy.Genescanevolvethroughamplicationeventsandbefoundinclosely-relatedfamilies.Also,sequencemotifswithingenesareconservedandfoundinmanygenes.Theseconservationsarelikelytocausesegmentsofgenestobedetectedasover-represented.Incontrasttowordcountingalgorithms,structure-baseddenovorepeatnderscanndlow-copyrepeats.However,thesealgorithmstendtobelimitedtondingrepeatsthathavewell-conservedstructuresandareunlikelytonddivergentandfragmentedrepeats.Potentially,generatingalibraryofhighly-conservedrepeatsequencesusingalgorithmslikeTRF[ 11 ],LTR STRUC[ 52 ],andPILER[ 25 ]wouldservetogenerateamorecompleterepresentationofthehighly-conservedtransposonsandotherrepeatunitswithinagenome.Greediercouldthenbeusedtoidentifydivergentandfragmentedrepeatsbasedonthecomputationally-generatedlibraries. 37

PAGE 38

2 60 66 75 77 78 80 ].Letbethealphabetforaminoacids.Wesaythattwolettersfromaresimilar,iftheirsimilarityscoreisaboveacutoaccordingtoascoringmatrix,sayBLOSUM62[ 32 ].Wesaythattwosequencesaresimilar,iftheiralignmentscoreisgreaterthanacuto.Letbeanarbitrarysequenceover.Letx=s1s2skbeasubsequenceofaproteinsequence.Wecallthesubsequencess1,s2,,skrepeatsofoneanotherifthefollowingfourconditionshold:1)s1,s2,,skaresimilarsequences,2)eachsiislongerthanacuto,3)eachisshorterthanacuto,and4)thereisnosupersequenceofxthatsatisesthepreviousthreeconditions.Dependingon,repeatscanbeclassiedintotwocategories:(1)Tandemrepeats.Inthiscase,for8;=;,i,e.,tandemrepeatsareanarrayofconsecutivesimilarsequencessuchasKTPKTPKTPKTP.(2)Interspersedrepeats.Inthiscase,9;6=;,i.e.,atleasttworepeatsoneofwhichfollowstheotherastheclosestrepeatarenotadjacent.AnexampleofinterspersedrepeatsisKTPAKTPKTPKTP.Crypticrepeatsisaspecialcaseofrepeats.Inthiskindofrepeat,s1,s2,,skarenotonlysimilarsequences,butalsoletterscontainedinthemareallsimilartooneanother,suchasKKKAKKK.Wecallrepeatss1,s2,,skinexactif9i;j,suchthatsi6=sj.Repeatss1,s2,,skareconsideredasanLCRiftheircomplexityislessthanacutobasedonacomplexityfunction.OnecommonlyusedcomplexityfunctionistheShannonEntropy[ 65 ].Notethatthereisno 38

PAGE 39

42 ].CertaintypesofLCRsareusuallyfoundinproteinsofparticularfunctionalclasses,especiallytranscriptionfactorsandproteinkinases[ 31 ].AllthesemeanthatLCRsmayindicateproteinfunctions[ 37 58 64 ],contributetotheevolutionofnewproteins,andthuscontributetocellularsignallingpathways.SomeLCRsattractpurifyingselection,becomedeleteriousandthereforeleadtohumandiseaseswhenthecopiesofarepeatinsideexceedanumber[ 23 ].LCRscausemanyfalsepositivestolocalsimilaritysearchesinasequencedatabase.BLAST[ 4 ],apopularlocalalignmentprogram,usesthemaximalsegmentpairscore(MSP)tondoptimalalignments.ThetheoryofMSPcanassurestatisticallysignicanthigh-scoringalignmentstobefound.However,biologicalsequencesareverydierentfromrandomsequences.Statisticallysignicanthigh-scoringmatchesduetoLCRsarenotbiologicallysignicant.Hencetheyarefalsepositives.Statisticalanalysesofproteinsequenceshaveshownthatapproximatelyone-quarteroftheaminoacidsareinLCRsandmorethanone-halfofproteinshaveatleastoneLCR[ 78 ].Despitetheirimportanceandabundance,theircompositionalandstructuralpropertiesarepoorlyunderstood.Soaretheirfunctionsandevolution.IdentifyingLCRscanbetherststepinstudyingthem,andhelpdetectingfunctionsofanewprotein.Computingthecomplexitiesofallpossiblesubsequencesetsisimpracticalevenforasinglesequencesincethenumberofsuchsetsisexponentialinthesequencelength.SeveralheuristicalgorithmshavebeendevelopedtoquicklyidentifyLCRsinaproteinsequence.However,theyallsuerfromdierentlimitations.DetailsoftheselimitationsarediscussedinSection 3.2 .Inthischapter,weconsidertheproblemofidentifyingLCRsinaproteinsequence.Weproposenewcomplexitymeasuresthattaketheaminoacidsimilarityandorder,andthesequencelengthintoaccount.Weintroduceanovelgraph-basedalgorithm,called 39

PAGE 40

3.2 discussestherelatedwork.Section 3.3 introducesnewcomplexitymeasures.Section 3.4 presentsouralgorithm,GBA.Section 3.5 showsqualityandperformanceresults.Section 3.6 presentsabriefdiscussion. 57 ],REPuter[ 41 ],andTRF[ 11 ].Here,wefocusonalgorithmsthatidentifyLCRsinaproteinsequence.MostalgorithmsthatidentifyLCRsinaproteinsequenceuseaslidingwindow,includingSEG[ 80 ],DSR[ 75 ],P-SIMPLE[ 2 ],and[ 54 ].SomealgorithmsarealignmentbasedsuchasXNU[ 19 ]andCAST[ 60 ].Somealgorithmsareencodingbasedsuchas[ 77 ].SomealgorithmsarecomplexitybasedsuchasCARD[ 66 ].ItispossiblethatonealgorithmisbasedonmorethanonedimensionsuchasSEG.Wedescribeeachalgorithmindetail.SEGisatwo-passalgorithm.Therststageidentiesapproximaterawsegmentsoflowcomplexitydeterminedbyaslidingwindowlength,twocompositionalcomplexity([ 79 ]and[ 61 ])cutos.Eachsegmentconsistsofslidingwindowswithcomplexitylessthanasecondcuto.Thesecondstagereduceseachrawsegmenttoasingleoptimal 40

PAGE 41

71 ]andextendedbyHancockandArmstrong[ 30 ].Itcarriesoutthreemaintypesofanalyses:(1)itestimatestheamountofsimplesequencecontentinaproteinmolecule;(2)itdetermineswhichshortsequencemotifsaresignicantlyclustered;and(3)itndsthelocationofsimplesequenceswithsignicantlyclusteredmotifsinthesequence.Itrstcalculatesasimplicityscoreawardedtothecentralaminoacidofeachwindowinthethesequence.Itthencalculatesarelativesimplicityfactor(RSF)forthesequence.RSF(greaterthan1ornot)determinestheexistenceofLCRs.However,themaximumlengthofadetectablerepeatisonlyfour.AnotherdrawbackofP-SIMPLEisthatitcanonlyidentifycrypticrepeats.Nandietal.[ 54 ]usesalinguisticcomplexitymeasurebasedondimercounts.Thiscomplexitymeasureistheobservedfractionofthedistinctdimerspossibleforthesequenceadjustedbytheaskewrepresentingthecompositionalbiasofthesequence.AllsubsequencesofaxedwindowsizewithcomplexitylessthanacutoareconsideredasLCRs.However,theparametersweretunedusingonlyfoursequencesandtheresultsfromSEG.Inaddition,thecomplexitymeasurecannotidentifyinexactrepeatssinceitignoressimilaritiesbetweendierentletters.SlidingwindowbasedmethodssuchasSEG,DSR,P-SIMPLE,and[ 54 ]suerfromalimitationcausedbytheslidingwindow.Awindowsizeneedstobespecied.Itisdiculttospecifyawindowsizesincerepeatscanbeofanylength.Repeatswithsizeeitherlessorgreaterthanthewindowsizemaybemissed.XNU[ 19 ]identiesLCRsbyself-comparison.ItscoreslocalalignmentswithaPAMmatrix.Itestimatesthesignicanceofthesealignmentsaccordingtothestatistical 41

PAGE 42

60 ]comparesthesequencewithanarticialdatabaseoftwentyhomopolymerswithadistinctaminoacideach.ItndslocalalignmentsbyusingasimpliedversionoftheSmith-Watermanalgorithm[ 68 ].ItreportssignicanthitswithalignmentscoreaboveathresholdasLCRs.However,onlyrepeatsofasingleresiduetypecanbeidentied.0j.py[ 77 ]encodesproteinsequencesusingregularexpressions.ThisapproachissimilartothatusedbyAllisonetal[ 3 ].Encodingofasequencemaximizesthecompressionscoreofthesequence.Allpatchesthatfulllacertainscorethresholdarereported.0j.pycannotidentifyinexactrepeatssinceitignoressimilaritiesbetweendierentletters.CARD[ 66 ]targetsonlytheregionsofthesequencethataredelimitedbyapairofidenticalsubsequences.Ifthesesubsequencesarepositionedintandemoroverlapped,theregionscontainingthetwoidenticalsubsequencesismarkedasasLCR.Otherwise,ititerativelycomputesthecomplexityoftherepeatsconcatenatedwitheachsegmentofthesamelengthasthatoftherepeat.ThisiterationcontinuesuntiliteitherreachestherightrepeatingsubsequenceandmasksthesubsequencesasanLCR,ordetectsthatthecomputedcomplexityisgreaterthanthatoftheleftrepeatingsubsequence.However,LCRsarenotnecessarilyindicatedbyapairofidenticalrepeats.Furthermore,theuseofsuxtreetondrepeatsrequiresextensivememory.SEG,DSR,andCARDuseacomplexitymeasureeitherbasedonoranalogoustoShannonEntropy.However,ShannonEntropyisnotagoodcomplexitymeasureforproteinsequences. 42

PAGE 43

65 ],isdenedas20Xi=1pilogpi:Althoughthisformulationiseectiveformanyapplications,ithasseveralproblemswhenappliedtoproteinsequences:(1)Shannonentropydoesnotconsiderthecharacteristicsofaminoacids.Therefore,itisunabletodistinguishsimilarlettersfromdissimilarones.Forinstance,theShannonentropiesofRQKandRGIarethesame.However,lettersR,Q,andKareallsimilarwhereasR,G,andIarealldissimilar,accordingtoBLOSUM62.(2)Shannonentropyonlyconsidersthenumberofeachdierentletterinasequence.Thus,itisunabletodistinguishtwosequencescomposedofthesamelettersbutwithdierentpermutations.Forexample,theShannonentropiesofRGIRGIandRGIIRGarethesame.(3)Shannonentropyisunabletodistinguishasmallnumberofcopiesofapatternfromalargenumberofcopiesofthesamepattern.Forinstance,theShannonentropiesofRGIandRGIRGIRGIarethesame.Wesampled474sequencesthatcontainrepeatsfromSwissprot.Therepeatsin418(i.e.,88%)ofthesesequencesareinexact.Inotherwords,for88%ofthesampledsequences,Shannonentropywillhaveatleastoneofthersttwoproblemsabove.Next,wedevelopnewcomplexitymeasuresthatovercomeproblems(1)to(3).AswillbeseenlaterinSection 3.5 ,thenewcomplexitymeasuresdoovercometheproblemsand 43

PAGE 45

45

PAGE 46

46

PAGE 47

3-1(a) showsthesequenceGAYTSVAYTVPQAWTVW.Forsimplicityonlythesubgraphcorrespondingtothesubsequencefromthe7thtothe16thlettersisdrawn.Inthisgure,thevertex(7,13)isconstructedbecausetheletterAappearsatpositions7and13.Vertex(8,14)isconstructedbecauselettersYandWaresimilar.Anedgeisinsertedbetweentwovertices(i;j)and(k;m)ifs(i)s(k)ands(j)s(m)arerepetitionsofeachother.Thispropertyisenforcedbyintroducingdistancethresholdst2andt3.Adirectededgefrom(i;j)to(k;m)isaddedifallthefollowingthreeconditionsaresatised:(1)j(ji)(mk)jt2;(2)ikjm;(3)jjijt3andjmkjt3,ifj=k.Therstconditionspeciesthemaximumnumberofinsertionsanddeletionsbetweensimilarrepeats.Thesecondoneguaranteesthatthepositionsofs(i)s(k)ands(j)s(m)donotconictwitheachother.Thethirdconditionspeciesthemaximumdistancebetweenlettersincrypticrepeats.Typically,wechooset1=15,t2=3,andt3=5astheygivethebestrecall(Section 4.5.1 ).Forexample,inFigure 3-1(a) theedgebetweenvertices(7,13)and(8,14)showsthatAYandAWarerepetitionsofeachother.NotethatAYandAWareinexactrepeatsforYandWhaveahighsubstitutionscore.Agraphfromasequenceisnotnecessarilyconnected,i.e.,itcanconsistofmorethanoneconnectedsubgraph.OurslidingwindowdoesnotcarrythedisadvantagesoftheslidingwindowsinSEG,DSR,andP-SIMPLE.Thisisbecause(1)shortrepeatscanbedetectedbytraversingthegraphinsidethewindowand(2)longrepeatscanbefoundbyfollowingtheedgesofoverlappingwindows.Theoretically,thesizeofourslidingwindowcanbeaslargeasthesequencelength.Ourpurposeofintroducingsuchawindowistocontrolthegraphsize,hencethetimeandspacecomplexityofGBA. 47

PAGE 48

FourstepsofGBAonasequencewith3approximaterepetitionsofAYTV.Underlinedlettersindicaterepeats.RectanglesdenoteregionsidentiedascandidateLCRsbyGBAatdierentsteps.(a)Graphconstructedonthesequence.Forsimplicityonlythesubgraphcorrespondingtothesubsequencefromthe7thtothe16thlettersisdrawn.Boldedgesindicatethelongestpathinthesubgraph.(b)Candidateintervalsfoundbyusingthelongestpath(Section 3.4.2 ).(c)Intervalsafterextendingcandidateintervals(Section 3.4.3 ).(d)FinalLCRsafterpost-processing(Section 3.4.4 ). TPSTT T234P23S2T P0+2+3S02++2T3+4+ ContributionoftherepeatregionTPSTTtoR.denotestheforgetrate.(a)Thecontributionofeachletterpair.(b)TheoverallcontributionofTPSTT. Next,wediscussthethirdconditionofvertexconstruction.ThevaluesR(i;j)andNR(i;j)representaverageprobabilitiesthatiandjappeartogetherinarepeatregionandanon-repeatregionrespectively.Wecomputethesestatisticsfromverealdatasetsybase,mgd,mim,sgd,andviruses,extractedfromSwissprot.ThesevedatasetscorrespondtospeciesDrosophilamelanogaster(Fruity),Musmusculus(Mouse),Homo 48

PAGE 49

27 ].WeuseittomeasurethechanceoftwolettersbeingapartofthesameLCR.Thus,aslettersgetfarawayinthesequence,theircontributiondropsexponentially.Figure 3-2(a) showstheindividualksforeachletterpairinarepeatregionTPSTTandFigure 3-2(b) showsthecorrespondingchangeofR.NotethatbothRandNRaresymmetric.ToeasethereadabilityofFigure 3-2(b) ,weonlyshowthetoprightpartofR.Finally,whenallsequencesareprocessed,R(i;j)isnormalizedasR(i;j) 49

PAGE 50

3-1(a) theedgebetweenvertices(7,13)and(8,14)showsthatAYandAWarepotentialrepeats.Wendthelongestpathineveryconnectedsubgraphtogetthelongestrepeatingpatterns.RepeatsofthesequenceinFigure 3-1(a) areAYTSV,AYTVandAWTV.Thepathrepresentedbyboldedgesisthelongestpath.ItcapturestherepeatpatternAYTVandAWTVperfectly.Note,inFigure 3-1(a) ,forsimplicityonlythesubgraphcorrespondingtothesubsequencefromthe7thtothe16thlettersisdrawn.Figure 3-1(b) showsthepotentialLCRsforthewholesequenceafterndingthelongestpath.Therearemanyexistingalgorithmsthatndtheshortestpathinagraph.Theycanbeeasilymodiedtondthelongestpathinagraph.OurimplementationofndingthelongestpathinagraphisbasedonDijkstra'sAlgorithm[ 22 ].Thecomplexityofourimplementationislinearinthesizeofthegraph. 3.4.2 correspondtoasetofintervals(i.e.,subsequences)ontheinputsequence.Wediscardshortintervals.Inourimplementationwesetthislengthcutoasthree.Remainingonesareconsideredasrepeatseeds.Theyareextendedinleftandrightdirectionstondlongerintervalscontainingfullrepeatswithlowcomplexities.Westopextendinganintervalwhenoneofthefollowingtwoconditionsissatised:(1)Itoverlapswithanotherextendedinterval,ortheendofthesequenceisreached.(2)Thecomplexitystartsincreasingafterextendingitbyt1letters(t1istheupperboundfortherepeatlengthasdiscussedinSection 3.4.1 ).Onceanintervalisextended,wenditslargestsubintervalforwhichthecomplexityislessthanagivencutovalue.Inordertondareasonablecutovalue,werandomlysampledsequencesfromSwissprotthatcontainrepeatregions.Weincreaseoursamplesetsizebyoneeachtime.Lettandtdenotethemeanandthestandarddeviationofthe 50

PAGE 51

44 ]:t=p 3-1(c) showsthepotentialLCRsaftertheextension.WecanseethattheletterSatpositionveisincludedintothepotentialLCRs.ThisexampleillustrateshowGBAcandetectrepeatswithindels.Note,allcomplexitiesinthissub-sectionarecalculatedusingthe2-gramcomplexitymeasure. 51

PAGE 52

68 ].Ifthesimilarityofthealignedpartsofthetwointervalsarebigenough,e.g.,saymorethan4letters,wekeepthealignedpartofthepotentialoutlierintervalasanLCR.Otherwise,werepeatthecomparisononitsrightadjacentinterval.Ifnosatisfactoryalignedpartexistsinbothcomparisons,theintervalisdiscarded.TheimplementationofSmith-WatermanalgorithmisborrowedfromJAligner( 3-1(d) ,theletterWatposition17isremoved.ThisisbecausethecomplexityofAWTVWisnotlowenoughandtheSmith-Watermanalgorithmalignmentdoesnotincludethisletter.Note,allcomplexitiesinthissub-sectionarecalculatedbyusingthenormalized2-gramcomplexitymeasure. 6 ]andUniprotasourtestdata.Theseannotatedrepeatsdonothaveanyknownfunction.Therefore,weusedthemastrueLCRs.Therstvedatasetswereconstructedbyextractingsequenceswithrepeatsfromybase,mgd,mim,sgd,andviruses( 52

PAGE 53

4.5.1 evaluatesthequalitiesoftheproposedcomplexitymeasuresandGBA.Section 4.5.2 comparestheperformancesofGBA,0j.py,CARD,andSEG,includingtimeandspacecomplexities. 53

PAGE 54

3-3 showstheresultingplotsforShannonEntropyandthe2-gramcomplexitymeasure.Whenratiosfromrepeatsarelowerthan0.84,thereisnotmuchdierencebetweenShannonEntropyandthe2-gramcomplexitymeasuresincetwocurvesrepresentingthetwocomplexitymeasurestwistaroundeachother.However,whenratiosfromrepeatsarenolessthan0.84,thereisacleardierencebetweenthetwocomplexitymeasures.TheShannonEntropycurveisalwaysabovethe2-gramcomplexitymeasurecurve.Thismeansthatourcomplexitymeasuredistinguishesrepeatsfromnon-Repeatsbetter.Particularly,asshownbytheboldverticallineinthegure,when92%oftherepeatsareidentied,2-gramcomplexitymeasureproduces30%lessfalsepositivesthanShannonEntropy.Wedonotshowtheresultoftheprimarycomplexitymeasureinordertomaintainthereadabilityoftheplots.TheprimarycomplexitymeasurecurvestaysbetweenthatofShannonEntropyandthatofthe2-gramcomplexitymeasureandveryclosetothatofShannonEntropy.EvaluationofGBA:WecomparethequalitiesofGBA,SE-GBA,0j.py,CARD,andSEG.ThedierencesbetweenthequalityofSE-GBAandthoseofcompetingtools,0j.py,CARD,andSEG,showtheimprovementobtainedbyourgraph-basedrepeatdetectionmethodastheyalluseShannonEntropyasthecomplexitymeasure.ThequalitydierencebetweenGBAandSE-GBAindicatestheimprovementobtainedduetoournewcomplexityformulaontopofourrepeatdetectionmethod.LetTP(TruePositive)bethenumberoflettersthatarecorrectlymaskedasLCRsbytheunderlyingLCR-identicationalgorithm.Similarly,letFP(FalsePositive)andFN(FalseNegative)bethenumberoflettersthatareincorrectlycomputedasLCRsandnon-LCRs.Wecomputethreemeasures:recall,precision,andJaccardcoecientasfollows: 54

PAGE 55

ComparisonbetweenShannonEntropyandour2-gramcomplexitymeasure.x-axisrepresentsratiosfromrepeats.y-axisrepresentsratiosfromnon-repeats. 3-4 comparestheaveragerecallsforthelastvedatasets:mgd,mim,sgd,viruses,andmis.Onaverage,GBAhasthehighestrecall.SE-GBAhasthesecondhighestone.TherecallofSE-GBAis8%higherthanthatofSEG,18%higherthanthatof0j.py,and22%higherthanthatofCARD.TherecallofGBAis2%,10%,20%and24%higherthanthatofSE-GBA,SEG,0j.py,andCARDrespectively.Inotherwords,therecallisimprovedbyatleast8%byusingadierentrepeatdetectionmethodintroducedinSection 3.4 ,andbyatleast2%byusingthenewcomplexitymeasuresintroducedinSection 3.3 ,insteadofShannonEntropy.Smallrecallvaluesindicatethattheexistingmethodscannotformulateinexactrepeatswell.ThisisjustiedbyourdiscussioninSection 3.3 55

PAGE 56

AveragerecallsofGBA,SE-GBA,0j.py,CARD,andSEGonfourdatasets. Figure 3-5 comparestheaverageprecisionsforthelastfourdatasets.GBAhasthesecondhighestprecision.SE-GBAhashigherprecisionthanSEG.Thisisbecausedierentrepeatdetectionmethodsareusedinthetwoalgorithms.TheprecisionofGBAishigherthanthatofSE-GBA.Thisisbecausedierentcomplexitymeasuresareusedinthetwoalgorithms.0j.pyhasthehighestprecision.CARDhasthesecondhighestprecisiononsomeofthedatasets.For0j.py,thisisbecauseonlyexactrepeatsareidentied.ForCARD,thisisbecauseonlyLCRsdelimitedbyapairofidenticalrepeatscanbeidentied.Althoughbothpatternshaveahighchanceofbeingtruerepeats,theyconstituteasmallpercentageofpossiblerepeats.Thisisbecauserepeatsareusuallyinexact,whichisjustiedbythelowrecallsof0j.pyandCARD.Thus,0j.pyandCARDachievehighprecisionsattheexpenseoflowrecalls(Figure 3-4 ).Smallprecisionvaluesindicatethatmanyfalsepositivesareproduced.Thisismainlybecauseloosecutovaluesareneededtoobtainareasonablerecall.Theprecisionandrecallvaluesforthemgdandmimdatasetsaremuchbetterthanthatforthevirusesdataset.Thisindicatesthatthe 56

PAGE 57

AverageprecisionsofGBA,SE-GBA,0j.py,CARD,andSEGonfourdatasets. repeatsinvirusesshowmuchmorevariationthanmgdandmim.Indeedthemutationrateinvirusesismuchhigher[ 24 34 ].TounderstandtherelationshipbetweenprecisionandrecallofGBA,0j.py,andCARD,weplottedprecisionversusrecallasfollows(Figure 3-6 ).Werstcreateda(precision;recall)tupleforeachsequenceintherstvedatasetsbycalculatingtheprecisionandtherecallofGBAforthatsequence.Wethendividedallthese474tuplesinto4groupsofthesamesize(exceptthelastonewhichcontainsfewertuples).Tuplesintherstgrouphavethesmallestprecisions.Tuplesinthesecondgrouphavethenextsmallestprecisionsandsoon.Finally,wecalculatedthemeansoftheprecisionsandtherecallsforeachgroupandgotonerepresentative(precision;recall)tupleforeachgroup.Werepeatedthesameprocessfor0j.pyandCARD.Figure 3-6 showsthatontheaverage,GBAhasahigherrecallwhenthethreetoolshavethesameprecisions.Unlikeprecisionandrecall,Jaccardcoecientconsiderstruepositives,falsepositivesandfalsenegatives.Figure 3-7 showsthatGBAhasthehighestJaccardcoecientfor 57

PAGE 58

RelationshipbetweenprecisionandrecallofGBA,0j.py,andCARD. allthedatasets.Thesecondbesttoolisdierentfordierentdatasets.OntheaverageSE-GBAhasthesecondhighestJaccardCoecient.ThedierencebetweenGBAandSE-GBAshowsthequalityimprovementachievedduetoournewcomplexitymeasurealone.ThedierencesbetweenSE-GBAandthecompetingmethodsCARDandSEGthatusethesamecomplexitymeasure(i.e.,ShannonEntropy)showthequalityimprovementduetoourgraph-basedstrategyalone.Alltoolshaverelativelylowrecallsandprecisions.ThisimpliestheabundanceofinexactrepeatsinLCRs.Figure1ofAppendixshowsanexamplesequencefromSwissprotforwhichGBAidentiesalmosttheentireLCRwhile0j.py,CARD,SEGfail.Figure2ofAppendixshowstheLCRofanotherexamplesequencefromSwissprotforwhichallthetools,GBA,0j.py,CARD,SEGhavelowrecallsandprecisions. 58

PAGE 59

AverageJaccardcoecientsofGBA,SE-GBA,0j.py,CARD,andSEGonfourdatasets. withagroupofvertexsets.Thisgroupmayhaveamaximumoft1setsandeachsetmayhaveamaximumoft1vertices.HencetheedgeconstructiontakesO(Lt31)time.Thetimecomplexityofndingthelongestpathislinearintheorderofthesizeofthegraph,whichisO(Lt1+Lt31).IttakesO(L)timetoextendlongest-pathintervals.Smith-Watermanalgorithminthepost-processingsteptakesO(L2)timeintheworsttime.Hence,GBAtakesO(Lt31+L2)timeintheworstcase.Theworstcasehappenswhenasequenceconsistsoflettersofasingletype.TheaveragetimespersequenceofGBA,0j.py,CARD,andSEGalgorithmsononeofourdatasets,mim,were79.65seconds,0.5milliseconds,26seconds,and0.75millisecondsrespectively.BothCARDandGBAareslowerthan0j.pyandSEG,buttheirrunningtimesarestillacceptable.Thisisbecausemanualvericationandwet-labexperimentationonthecomputationalresultsusuallytakedays.GBAisthusdesirablesinceitproducesmuchhigherqualityresults(Figure 3-4 ).Thelongestsequence(5412letters),FUTSCDROME,tookGBA,0j.py,CARD,andSEG829seconds,96milliseconds, 59

PAGE 60

3.4.4 .Oneofourdatasets,ybase,tookGBA,0j.py,CARD,andSEG140MB,17MB,785MBand1000kBofmemoryintheworstcaserespectively. 60

PAGE 61

61

PAGE 62

45 ].Figure 4-1 showstwosequencesthatcontainLCRsindicatedbytheunderlinedletters.BothoftheLCRsinthisgurecontainatandemrepeatAATrepeatedfourtimes.Statisticalanalyseshaveshownthatapproximatelyone-quarteroftheaminoacidsareinLCRsandmorethanone-halfofproteinshaveatleastoneLCR[ 78 ].TraditionalsequencesimilaritysearchmethodsproducemanyfalsepositivesduetoLCRsinbiologicalsequences.WeuseBLAST[ 4 ],oneofthepopularsimilaritysearchalgorithms,asanexampletoillustratetheproblem.BLASTusesthemaximalsegmentpairscore(MSP)tondtheoptimalalignment.ThetheoryofMSPndsstatisticallysignicanthigh-scoringalignmentsundertheassumptionthatlettersfollowarandomdistribution.However,biologicalsequencesareverydierentfromrandomsequencessincetheycontainmanyLCRs.Statisticallysignicanthigh-scoringmatchesduetoLCRs,usually,donotindicategenuinerelationshipbetweensequences,andhence,arefalsepositives.Forexample,traditionalalgorithmsproduceahighalignmentscoreforthetwosequencesinFigure 4-1 eventhoughtheyarenotbiologicallyrelated.Thisis 62

PAGE 63

TwosequencesthathavethesameLCRindicatedbytheunderlinedletters.Here,theLCRsarecomposedofarepeatingpatternofAAT. becausethetwosequencescontainthesameLCR.BLASTreturnsover1,000statisticallysignicantsequencesforThermusthermophilusseryltRNAsynthetase,whichhasonly31truepositivehomologs[ 75 ].Suchhighfalsepositiveratescauseenormousamountsofwastedresourcesandtimespentonrefutingthem.ExistingmethodssuchasBLASTfollowoneofthetwoextremestrategies;theyeithertreatLCRsandnon-LCRsthesameorsimplyremoveallLCRsfromthesequences 80 ].AlthoughlteringidentiedLCRsimprovesthequalityofsearches,itisnotdesirablesincenoLCR-identicationmethodis100%accurate.Ourexperimentsonrealdatashowedthattheaverageprecisionandrecallofsomewell-knownmethodssuchasSEGandCARD[ 66 ]wereaslowas0.2and0.3respectively[ 45 ].Hence,bothstrategiesareproblematic.NotethatBLASTalsohasanoptionwhereitmasksaregionspeciedbyauser.This,however,requirestheusertohaveperfectknowledgeofthelocationofsuchregions.Thus,itisnotpracticalastheseregionsarenotknownformanysequences.Contributions:ThischapterconsiderstheproblemofndingsimilarsequenceswhenthelocationsoftheLCRsarenotknownprecisely.Thegoalistodevelopalgorithmsthatreducethenumberoffalsepositivessignicantlywithoutlosingtruepositives. 63

PAGE 64

4.2 discussestherelatedwork.Section 4.3 illustrateshowtoassignaqualityvaluetoeachletterinasequence.Section 4.4 introducesoursimilaritysearchalgorithms.Section 4.5 showsqualityandperformanceresults.Section 4.6 concludesthechapter. 4 21 41 47 68 69 ].Weconsiderthesestrategiesundertwocategories.Therstone,calledDPS,ndstheoptimalalignmentusingdynamicprogramming.Thesecondone,calledHTS,ndsasuboptimalalignmentbyemployingahashtable.Smith-Watermanalgorithm[ 68 ]andBLASTareexamplesofthesetwostrategiesrespectively. 64

PAGE 65

80 ],CARD[ 66 ],andGBA[ 45 ].SEGrstndscontigswithShannonEntropycomplexity[ 65 ]lessthanacuto.ItthendetectstheleftmostandlongestsubsequenceswithminimalprobabilityofoccurrenceasLCRs.BLASTusesSEG 65

PAGE 66

33 ],haveusedqualityvalues.ThemeaningofCAP3'sterm\quality"diersfromthatofthischapter.InCAP3,eachletterisgivenaqualityvalue,basedontheprobabilitythatthatletteriscorrectlysequenced.Thus,itdoesnotdenotewhetheraletterisinLCRornot.CAP3'suseofqualityvalues,however,issimilartotraditionalltersforLCRs.Itclipstheendsofreads(fragmentsofsequences)thathavequalitylowerthanacuto.Atthetimeofassembly,itusestwotypesofscores,oneforletterswithqualitygreaterthanagivencutoandanotherfortherestoftheletters.ThisindicatesthatthealgorithmsdevelopedinthischaptercanbeemployedforsequenceassemblerslikeCAP3,whenthequalityvaluesarenot100%correct.Thischapter,however,focusesonsequencecomparisoninthepresenceofLCRs. 4-2 showsasequence,itsLCR,anditsmaskedversionbyanLCR-identicationalgorithm.Maskedlettersarereplacedbyx.Here,TP=8,FP=1,andFN=4.Inthissection,weintroducehowtoassignaqualityvaluetoaletterofasequencewhentheunderlyingLCR-identicationmethodisinaccurate.Duringtheassignment,we 66

PAGE 67

AsequencewithanunderlinedLCRandamaskedversionofit.Maskedlettersarereplacedbyx. useoneormoreLCR-identicationalgorithms.Sections 4.3.1 and 4.3.2 discussthecaseswhensingleandmultipleLCR-identicationtoolsareemployedrespectively. 45 ]foritsmaskedletters.Hence,wecanrepresenteachsequencebya(precision,complexity)tuple.Let(p1;c1);(p2;c2);;(pn;cn)betheprecisionandcomplexityvaluesforthesampledsequenceswherec1c2cn.Wecreateanequi-depthhistogramwithm

PAGE 68

mc+1;c(i1)bn mc+1);;(pibn mc;cibn mc)bethesetuples.Formally,wecomputeias:ibn mcXj=(i1)bn mc+1pj 4-3 showsasequencewithanunderlinedLCRanditsmaskedversionsbytwoLCR-identicationtools.Maskedlettersarereplacedbyx.EachtoolmasksdierentregionsasLCRs.Ifdierent 68

PAGE 69

AsequencewithanunderlinedLCRanditsmaskedversionsbytowdierentLCR-identicationalgorithms.Maskedlettersbyeachalgorithmarereplacedbyx. toolsclassifythesamelettertobelongtoanLCR,thenthechanceforthislettertobelongtoanLCRishighercomparedtothecasewherethesetoolsdonotagree.Integratingqualityvaluesfrommultipletoolsisanontrivialtask.Inprinciple,aqualityvaluefromahigh-accuracyLCR-identicationalgorithmshouldcontributemoretothecombinedqualityvaluethanthatfromalow-accuracyLCR-identicationalgorithm.TheaverageJaccardcoecientreectstheaccuracyofanLCR-identicationalgorithm.Forasequence,JaccardcoecientiscomputedasTP=(TP+FP+FN):TheaverageJaccardcoecientfromallsequencesreectshowwellaclassiermirrorstheactualclasslabels[ 29 ].Itapproachestoonewhentheclassiergivesallinstancestrueclasslabels.Weproposetocomputethecombinedqualityvaluebytakingaweightedaverageofthesequalityvalues.LetJC1;JC2;;JCtbetheJaccardcoecientsfromthetalgorithmsrespectively.Letq1;q2;;qtbequalityvaluesassignedtoaletterbytalgorithmsrespectivelyasdiscussedinSection 4.3.1 .Wecomputethequalityvalueofthisletterbyusingthesetalgorithmsas:tXi=1qiJCi=tXi=1JCiThehighertheJaccardcoecientofanalgorithmis,themoreitsqualityvalueaectsthenewqualityvalue. 69

PAGE 70

4.3 ,wediscussedhowlettersareassignedqualityvalues.Inthissection,weintroduceourquality-basedsimilaritysearchalgorithms:QDPS(Section 4.4.1 )andQHTS(Section 4.4.2 ). 70

PAGE 71

4.4.2.1ThealgorithmManybiologicalapplicationsrequireall-to-allcomparisonoftwolargesequencesets,say,AandB[ 15 65 ].Intheseapplications,AandBcanbothbetoobigtotinmainmemory.WewillrefertothesequencesinAasquerysequencesandthesequencesinBasdatabasesequencesintherestofthissection.Atrivialsolutiontothisproblemistocompareallpairsofsequences(a;b)wherea2A;b2B,usingQDPS.ThisishoweverimpracticalasthenumberofsequencepairsisjAjjBjandcomparingapairofsequencesiscostly.HTSmethods,suchasBLAST,alleviatethisproblembyemployingahashtable(Section 4.2 ).Inthissection,weshowhowtoincorporatequalityvaluestosuchhashtable-basedsearches.WedevelopanewmethodcalledQHTS(QualityandHashTable-basedSimilaritysearchalgorithm).Itimitatesthewell-knownHTSmethods.QHTSexploitsqualityvaluesnotonlytoperformquality-basedsearch,butalsotofurtherimprovethememoryusageandtherunningtimeoftraditionalhashtable-basedsequencesimilaritysearches.SimilartoHTS,QHTShasthreesteps:(1)probabilistichashtableconstruction,(2)searchphase,and(3)alignmentphase.NotethatsomeoftheexistingHTSalgorithmsdeviateslightlyfromthedescribedthreestepsbyndingmultipleseeds[ 70 ]orusingspacedseeds[ 48 ].ItistrivialtoextendQHTStosimulatethem.Probabilistichashtableconstruction:Weslideawindowoflengthkonthesequencesofoneofthedatasets,sayA.Eachwindowpositionproducesak-gram.For 71

PAGE 72

4-4 illustratesasequenceanditsk-grams(fork=3).Onlythreek-gramsarestoredinthehashtable.Ourprobabilistichashtablehastwoadvantagesoverthetraditionalstrategywhereallk-gramsarekept:1.Sincek-gramsfromLCRstendnottobeinsertedintothehashtable,theywillnotbeidentiedasseeds.Thus,itislesslikelytoproducefalsepositivesduetoseedsinLCRs.2.Thehashtableissmallercomparedtothecasewhereallk-gramsareinserted.ThisreducestheI/Ocost(Section 4.5.2 ).OurhashtableisalsosuperiortoBLAST'shashtableusing\lterlookuptable"option.Thisisbecausethelatteroneremovesallthek-gramsintheregionsmaskedbyanLCR-identicationtool.ThisisundesirableasLCR-identicationtoolsarehighlyinaccurate.Theformerone,ontheotherhand,includesthek-gramsintheseregionswithprobabilitiesdeterminedbythequalitiesofthek-grams.Searchphase:Inthisphase,wediscusshowwedetermineseeds.Letaiai+k1andbjbj+k1denotek-gramsinsequencesa2Aandb2B,wherethek-gram 72

PAGE 73

4.2 forthedenitionofneighbor).Letqai;;qai+k1andqbj;;qbj+k1bequalityvaluesassignedtothelettersinthetwok-gramsrespectively.Wecalculatetheiralignmentscoreask1Xt=0s(ai+t;bj+t)qai+tqbj+t:Ifthisalignmentscoreisgreaterthantheneighborthreshold,wecallthisregionaseed.Duringthisphase,foreachk-graminB,wendallitsneighborsinAwiththehelpofthehashtable.Wethenrecalculatetheiralignmentscorestodecideseeds.Alignmentphase:Weextendeachseed,i.e.,aiai+k1andbjbj+k1inbothleftandrightdirectionswithnogapallowedalongaandbrespectively.Wheneverweextendeachcurrentsubsequencebyanewletter,weupdatethealignmentscorewithqualityvaluesincluded.Letaiai+xandbjbj+xwherexkbethetwocurrentsubsequencestobeextendedrespectively.Letmdenotethealignmentscorebetweenthem.Letai+x+1andbj+x+1bethetwonewletterstobeadded.Letqai+x+1andqbj+x+1denotethequalityvaluesofthenewlettersrespectively.Thealignmentscoreisupdatedasm:=m+s(ai+x+1;bj+x+1)qai+x+1qbj+x+1:Weextendineachdirectionuntilthedierencebetweenthemaximumalignmentscoreobservedsofarandthecurrentalignmentscoreisgreaterthananextensionthreshold.Themaximumalignmentscoreamongallseedextensionsbetweenthedatabaseandthequerysequenceistakenastheiralignmentscore.Traditionalapproachignoresmaskedletters.Thisisthesameasassigningmaskedlettersqualities0andunmaskedlettersqualities1.Thus,maskedlettersdonotcontributetoalignmentscoresandunmaskedlettersmakefullcontributions.Thisstrategy,however,isproblematicsincenoLCR-identicationalgorithmis100%accurate. 73

PAGE 74

Anexampleforprobabilistichashtableandreconstruction.Thesolidlinesshowthethree3-gramsofthesequenceGARAQAQAQKLstoredinthehashtable.Theresultingsequenceafterreconstructionisatthebottom.LetterXdenotesanunknownletter. 4-4 showsasequenceGARAQAQAQKLanditsthree(outofnine)k-grams(k=3)storedinthehashtable.Inthisexample,wecanreconstruct73%ofthelettersofthissequenceusingthesek-grams.Thelostlettersaregivenqualityvaluesofzero,andtheyarereplacedwiththeletters\N"and\X"forDNAandaminoacidsequencesrespectively.ThreeimportantnotescanbemadeabouttheperformanceofQHTS.1.TheCPUtimeforthesearchandthealignmentphasesarereduced.Thisisbecausenotallk-gramsareinsertedintothehashtable.Hence,fewerseedsareusuallyfoundandextended(Section 4.5.2 ).2.ReconstructingquerysequencesfromthehashtableforseedextensionreducesI/Ocost.Thisisbecausequerysequencesarenotreadfromdisk.Thehashtableisinthememory.Hence,reconstructingquerysequencesfromitdoesnotinvolveanyI/Ocost(Section 4.4.2.4 ). 74

PAGE 75

4.5.1 ). 75

PAGE 76

4.3 ).Thus,thetotalspaceinvolvedinqualityvaluesissmallenoughtobestoredinmemory.AsmentionedinSection 4.4.2.2 ,duringseedextension,wecaneitherreconstructquerysequencesfromthehashtableorread 76

PAGE 77

Memoryadaptionscheme.qs,db,andHTrepresentthequeryset,thedatabase,andthehashtablerespectively.Forbothreconstructioncaseandreadingcase,M1andM2aretheamountofmemoryallocatedtothehashtableandthedatabasesequencesrespectively.M1=Cistheamountofthequerysetwhosek-gramscanbeheldinahashtableofsizeM1probabilistically.ForeachM1=Cpagesofthequeryset,ahashtableisbuiltandthewholedatabaseisreadintomemoryoncetondseeds.M3onlyappliestoreadingcase.Itrepresentstheamountofmemoryallocatedtothequerysetforseedextension.Intheworstcase,foreachM2pagesofthedatabase,theentirequerysetisreadintomemoryonceinchunksofM3. themfromdisk.ThelattercaseinvolvesadditionalI/Ocost.Hence,wediscussthememoryallocationschemebasedonthesetwocases.(1)Reconstructioncase:Sincethequerysequencesarereconstructedfromthehashtable,weonlyneedtoassignonepagetothem,i.e.,M3=1(weassumethatthelongestsequencetsinonepage).Thispageisusedtostoreareconstructedsequence.Hence, 77

PAGE 78

M1=Ctimesbyreadinganewsubsetofthequerysetintomemory.Figure 4-5 illustratesthisprocessexceptthattheinformationaboutM3doesnotapply. M1=C(st+rt)+Qtt+Q M1=C(DB M2(st+rt)+DBtt):(4{2)Plugging( 4{1 )into( 4{2 ),weget: M1=C(st+rt)+Qtt+Q M1=C(DB MM12(st+rt)+DBtt):(4{3)Therefore,thetotalI/Ocostcanbeformulatedas: M1=C(st+rt)+Qtt+Q M1=C(DB MM12(st+rt)+DBtt):(4{4)(2)Readingcase:Querysequencesarereadfromdiskforseedextension.Thus,M3isnotxedanymore.Hence, M1=C(st+rt)+Qtt+Q M1=C(DB M2(st+rt)+DBtt)+(Q M3(st+rt)+Qtt)DB M2:(4{6) 78

PAGE 79

4{5 )into( 4{6 ),weget: M1=C(st+rt)+Qtt+Q M1=C(DB M2(st+rt)+DBtt+(Q MM1M2(st+rt)+Qtt)DB M2:(4{7)TakingthederivativesofequationsgivesustheoptimalvaluesofM1,M2,M3,andM4,i.e.,whenthetotalI/Ocostisminimizedforeachcase.AsweshowlaterinSection 4.5 ,reconstructionsignicantlyreducestheI/O,andthus,thetotalcost.Comparisonofreconstructionandreadingcases:Equations( 4{3 )and( 4{6 )alsoapplytonomaskingstrategy.Hence,therearetotallyfourcases:nomasking-Reconstruct,no-masking-Read,QHTS-Reconstruct,andQHTS-Read.WehavetheoreticallycalculatedtheminimumI/Ocostofthesefourcases,asgivenbythetwoequations.Assumethateachpointerinthehashtablepointingtoak-gramtakesfourbytesofmemory.Thus,wehaveC=4fornomaskingstrategy.TogetCforQHTS,werstranQHTSwithGBAastheunderlyingLCR-identicationalgorithm.Wethencreatedtheprobabilistichashtableonadatabaseof60,000proteinsequencesfromSwissprotdatabase.ThisexperimentprovidedtheparameterCas3.14.Weusedthetypicalparametersfromcurrentlyavailablecommercialdisks.Theparameterswerest=8,rt=2,tt=0.09(inmilliseconds),andpagesizeof4kB.Figure 4-6 plotstheworstcaseminimumI/Ocostofthefourcases(inseconds)whenthebestmemoryallocationschemeisusedforeachcase.Here,wesetboththedatabasesizeDBandthequerysetsizeQto100,000pages.WevarytheavailablememoryfromDB=16toDB.Figure 4-6 showsthattheI/OcostofQHTS-ReconstructismuchlessthanthatofQHTS-Read(2.06to2.25timeslower).Thespeedupislargerwhenthememoryissmaller.ThesealsoapplytonomaskingcasesexceptthatI/Ocostofnomasking-Reconstructis1.97to2.08timeslessthanthatofnomasking-Read.QHTS-Reconstructis1.24to2.07timesfasterthannomasking-Reconstruct.QHTS-Readis1.17timesfasterthanno-masking-Read.Therefore,bothreconstructionstrategyanduseofqualityvaluessaveI/Ocost. 79

PAGE 80

TheminimumI/Ocostvaries,dependingonwhethertousereconstructionorreadstrategy,whethertousequalityvalues,andtherelationshipbetweentheavailableamountofmemoryMandthesizeofthedatabaseDB. 4.2 .WenametheirnomaskingandbooleanmaskingversionsasNDPS,BDPS,NHTS,andBHTSrespectively.Wesetthegapopenandextensionpenaltiesas-10and-0.5respectively.Wesettheneighborthreshold,theextensionthreshold,andthek-gramlengthinallthecasesas11,seven,andthreerespectivelyas 80

PAGE 81

5 ],allthesesequencesbelongtooneormorefamilies.WeidentifytwosequencesasbiologicallysimilariftheybelongtothesamefamilyaccordingtoInterPro.Thereasonisbecausebiologistshaveintegratedvarioussequence-clusterdatabasesintoInterPro.Thus,InterProprovidesabiologicallyreliableclassicationasthecontributingdatabasescomplementeachotheringroupingproteinfamilies[ 38 ].Eachsequenceinourquerysetcomesfromoneortwofamiliesoutof29families.Theaveragenumberofsequencesinourdatabasewiththesamefamiliesasaquerysequenceis64.Theminimum,themaximum,andthestandarddeviationare2,1000,and142respectively.Forperformancecomparison,wecreatedsevensetsof6000,3000,1500,750,375,188,and94sequencesfromthe60,000sequences.Wealsocreatedadatasetof20,714proteinsequencesfromPDB( 81

PAGE 82

4.3 ). 82

PAGE 83

Averagetotalnumberofdatabasesequencesreturnedandprecision(inpercentage)forNDPS,BDPS-SEG,QDPS-GBA,QDPS-SC,andQDPS-SCGontheSwissprotdataset. scorecuto255075100 NDPS558977775708171ResultBDPS-SEG552057611611108SetQDPS-GBA250591945034SizeQDPS-SC3147633630QDPS-SCG88611024434 NDPS0.060.43417BDPS-SEG0.060.44527PrecisionQDPS-GBA0.13176085(%)QDPS-SC1.08477989QDPS-SCG0.39306882 Figure4-7. Resultsetsizeversusrecall(inpercentage)forNDPS,BDPSandQDPSontheSwissprotdataset. Table 4-1 comparestheaveragetotalnumberofdatabasesequencesreturnedandprecisionforvariousthresholdsontheSwissprotdataset.QDPS(thelastthreemethods)returnsignicantlyfewerresultsthanDPS(thersttwomethods).Thedierencegrowsasthresholddecreases.QDPSreturn0.8-45%ofthesequencesreturnedbyNDPS.Asthresholdincreases,theprecisionofallthemethodsincreases.QDPS-SCandQDPS-SCGusuallyhavebetterresultsinalltheexperimentsthanQDPS-GBA.Atthreshold50theprecisionofQDPS-SCis30%betterthanQDPS-GBA.ThesuperiorprecisionofQDPS 83

PAGE 84

Averagetotalnumberofdatabasesequencesreturnedandprecision(inpercentage)forNDPS,BDPS-SEG,QDPS-GBA,QDPS-SC,andQDPS-SCGonthePDBdataset. scorecuto255075100 NDPS16010139711322ResultBDPS-SEG15808136511022SetQDPS-GBA4974201311SizeQDPS-SC946131111QDPS-SCG946131111 NDPS0.1611257BDPS-SEG0.1611257PrecisionQDPS-GBA0.42659595(%)QDPS-SC1.8929596QDPS-SCG1.8929596 Figure4-8. Resultsetsizeversusrecall(inpercentage)forNDPS,BDPSandQDPSonthePDBdataset. indicatesthatourqualitybasedalignmenteliminatesalmostallthefalsepositiveswhilethetraditionalmethodscannot.Figure 4-7 showstherelationshipbetweenresultsetsizeandrecallforeachmethodonthesamedataset.Weseethatwhenallmethodshavethesamerecall,QDPShavemuchsmallerresultsizes.Hence,QDPSreducesthenumberoffalsenegativesandfalsepositivesoverDPSsignicantly.WeobtainedsimilarresultsonthePDBdataset(Table 4-2 andFigure 4-8 ). 84

PAGE 85

Averagetotalnumberofdatabasesequencesreturnedandprecision(inpercentage)forNHTS,BHTS-SEG,QHTS-GBA-Read,QHTS-SC-Read,andQHTS-SCG-ReadontheSwissprotdataset. scorecuto255075100 NHTS432181984733ResultBHTS-SEG40909813129SetQHTS-GBA6979413026SizeQHTS-SC162262320QHTS-SCG773342722 NHTS0.08156181BHTS-SEG0.08379094PrecisionQHTS-GBA493100100(%)QHTS-SC169397100QHTS-SCG4839598 4-3 comparestheaveragetotalnumberofdatabasesequencesreturnedandprecisionforvariousthresholds.QHTS(thelastthreemethods)returnfewerresultsthanHTS(thersttwomethods).Thedierenceismoresignicantforsmallthresholds.Atthreshold25QHTSreturn0.4to16.1%ofthoseofNHTSdependingontheLCR-identicationstrategy.QHTShavehigherprecisionthanHTS.Asthresholdincreasestolargevalues,allthemethodsperformroughlythesame.Figure 4-9 showstherelationshipbetweenresultsetsizeandrecallforeachmethod.Wecanseethatwhenallmethodshavethesamerecall,QHTS-GBAhasmuchsmallerresultsize.Hence,itreducesthenumberoffalsenegativesandfalsepositivesoverDPSsignicantly.AlthoughcomparedwithHTSmethods,QHTS-SCandQHTS-SCGhavelargeresultssizesatthesamerecalllevel,therecalldropsslightlyatthesameresultsize.Hence,weconcludethattheyeliminatealargenumberoffalsepositivesattheexpenseofasmallnumberoffalsenegatives.Figure 4-10 demonstrateshowtheproposedstrategycomparestoBLASTbyfocusingonaprotein,APOA4 MOUSE,fromourSwissprotqueryset.Thisproteincontains390amino 85

PAGE 86

Resultsetsizeversusrecall(inpercentage)forNHTS,BHTSandQHTSontheSwissprotdataset. Figure4-10. Resultsetsizeversusrecall(inpercentage)forQHTSandBLASTwiththeLCR-ltero(BLAST-w/oFilter)andon(BLAST-w/Filter)forqueryingtheproteinsequence,APOA4 MOUSE,againstourSwissprotdatabase. acids,where269ofthemareannotatedasinLCRs.WeranBLASTtoalignthisquerytotheproteinsinourSwissprotdatabase.WeusedthedefaultparametersofBLASTwithandwithoutusingthe\lterlowcomplexity"option.WealsoalignedthisqueryusingQHTSmethods.TheresultsshowthatQHTSmethodshavehigherrecallatthesameresultsetsize.TheirdierencesaremorepronouncedthanthoseinFigure 4-9 86

PAGE 87

Averagerecallandprecision(inpercentage)ofQHTS-GBA-Reconstruct,QHTS-GBA-Read,QHTS-SC-Reconstruct,QHTS-SC-Read,QHTS-SCG-Reconstruct,andQHTS-SCG-ReadontheSwissprotdataset.Tosavespace,weomit"QHTS"anduse"Rec."torepresent\Reconstruct"inthetable. scorecuto255075100 GBA-Rec.52454239GBA-Read53454239RecallSC-Rec.38363228(%)SC-Read41383531SCG-Rec.45423833SCG-read47434034 GBA-Rec.0.5739196GBA-Read3.5929999PrecisionSC-Rec.21969899(%)SC-Read16939899SCG-Rec.5849698SCG-read4839598 4.4.2.2 ,reconstructionforseedextensionmaycausesensitivitytodrop.Here,wecomparethreepairsofreconstructionversionsandreadingversionsofQHTSontheSwissprotdataset:(1)QHTS-GBA-Reconstruct,QHTS-GBA-Read,(2)QHTS-SC-Reconstruct,QHTS-SC-Read,(3)QHTS{SCG-Reconstruct,QHTS-SCG-Read.Table 4-4 presentstheaveragerecallandprecision.Foreachpair,thereadingcasealwayshasslightlyhigherrecallvaluethanthereconstructioncasewithamaximumdierenceof3%.Thisisexpectedsincereadingcasereadssequencesfromdiskandnoletterislost.Readingcasehashigherprecisionsometimes,dependingontheusedLCR-identicationtools.However,thedierencesaresmall,especiallywhenscorecutogetsbigger.Thus,theaccuracyofreconstructioncaseiscomparabletoreadingcase.Lostinformationbyreconstruction:AsmentionedinSection 4.4.2.2 ,wemayloselettersduringreconstruction.Here,weevaluatetheamountoflostinformation.WedeneRelativeInformationLossmeasureforthispurpose.Leta=a1a2anbeasequence.Letq1;q2;;qnbequalityvaluesassociatedwitheachletterina.Assumethatletters 87

PAGE 88

Relativeinformationlossregardingthelengthofk-grams(inpercentage). 26.97.27.332.92.72.841.31.31.3 Table4-6. CPUtimesspentinHTC,SP,andAP,andthenumberofk-gramsstoredinthehashtablesfor(I)NHTS,(II)BHTS-SEG,(III)QHTS-GBA-Reconstruct,(IV)QHTS-SC-Reconstruct,and(V)QHTS-SCG-Reconstruct. IIIIIIIVV CPUHTC65556timeSP119688615825272938513[sec]AP12698953431346total132409516860173118565 #ofk-grams833792657868650147481070545485 4-5 showshowRelativeInformationLosschangesregardingthelengthofk-gramsonthe1,500-sequenceset.AsmentionedinSection4.2.2ofthechapter,thebiggerthek,thesmallerthepossibilitythataletterislost.Whenk=3,theRelativeInformationLossisabout3%.AswepresentinSection5.2ofthechapter,thisnumberisverysmallcomparedtothepercentageofk-gramsevictedfromthehashtable.Thus,verylittleinformationislostduringreconstruction. 88

PAGE 89

4-6 showstheresultsonthe1,500-sequenceset.HTCtakemuchlesstimethanSPandAPinallcases.ThisisbecausetherunningtimeofHTCislinearinthequerysetsize.Thus,introducingqualityvaluestakesHTCnegligibletime.SPdominatesinallcases.Eachk-graminthedatabaserequiresoneormorehashtablelookups.Weonlyextendseeds.Hence,APtakesmuchlesstimethanSP.NHTShasthelargestSPandAPtimesinceothermethodsdonotstoreallk-gramsinthehashtable.ThetotaltimeforNHTSisthussignicantlylarger.BHTS-SEGhasthesecondlargestSPandAP,hencethesecondtotal.ThisisbecauseBHTS-SEGeliminatesk-gramsonlyfromidentiedLCRswhereasQHTSeliminatek-gramsfromanyplacewithsomeprobability.CPUtimeversusI/Otime:Forsmalldatasets,CPUtimedominatestheoverallrunningtimeofthesearchtools,includingBLAST.Here,wedemonstratethatI/OcostincreasesmuchfasterthanCPUcostforgrowingdatasets.Thus,itwillbethedominatingterminthenearfuture.WealsoshowthatI/Ocostissignicantlyreducedbyourprobabilistichashtableandreconstructionstrategy.WeusedBLASTwithLCR-lteronasarepresentativetoexistingBHTSstrategies.WeranBLASTonthesevendatasetswithn,2n,,64nsequencesbydoublingthedatasetsize.Thelargestdatasetcontained6,000sequences.WeperformedaselfcomparisonofeachdatasetusingBLASTandmeasuredtheCPUtimes.Figure 4-11 plotstheCPUtimesinsecondsasafunctionofthedatasetsizeinlog-logscale.WethencalculatedtheoptimalI/Ocostsofreadingandreconstructingstrategies,forincreasingdatasetsizes,usingtheformulasinSection 4.4.2.4 .Forthiscomputation,weassumedaxedmemoryofM=100,000pages,withpagesizeof4kB.WecomputedtheI/OtimesforselfcomparisonsofthesevendatasetsofsizeM,2M,,64M,bydoublingthedatasetsize.Weusedst=8,rt=2,andtt=0.09(inmilliseconds)asI/Oparameters 89

PAGE 90

4-11 .Twoimportantobservationsfollowfromthisgure.First,I/OtimegrowsmuchfasterthanCPUtimewithgrowingdatasetsize.Furthermore,Moore'slawsuggeststhatthegapbetweentheincreaseinI/OandCPUcostswillbeevenmorethanourresultsinthefuture.Weconcludethat,I/Ocostneedstobeoptimizedforlargedatasets.Second,theI/Ocostofthereconstructionstrategyis42%ofthatofthereadingstrategy.Thus,fromthisgureandTable4ofthechapter,weconcludethattheproposedreconstructionstrategyreducestheoverallrunningtimeofexistingNHTSstrategiesby45to58%dependingonthedatasetandmemorysizes.Theimprovementismoresignicantforlargerdatasets.Oneimportantnoteisthatevenforthecaseswhereatleastonedatasetcantinmainmemory,QHTSisstillcost-eective.Thetimesavingmovesonelevelupalongthecomputerstoragehierarchy.InsteadofsavingtheI/Ocost,thecommunicationbetweenCPUandmainmemoryisreduceddramatically.TherelationshipbetweentheCPUcostandthiscommunicationcostfollowsthesamepatternasshowninFigure 4-11 .Hashtablesizecomparison:Weevaluatethespaceusageofourqualitybasedsearchmethods.Wemeasurethisusageasthenumberofk-gramsinthehashtableasthisisthelargestdatastructuretobekeptinmainmemory.Apointerisstoredforeachk-gram.Thus,memoryusageisproportionaltothenumberofk-gramsinthehashtable.Table 4-6 comparesthenumberofk-gramsstoredinthehashtableforNHTSBHTS-SEG,QHTS-GBA,QHTS-SC,andQHTS-SCGonthe1,500-sequenceset.NHTShasthelargestnumber.Thisisbecauseitstoresallk-gramsunlikeothermethods.HashtablesizesofQHTSmethodsare57-77%ofthatofNHTSand73-98%ofthatofBHTS.ThisisbecauseBHTSeliminatesk-gramsonlyfrommaskedLCRswhereas 90

PAGE 91

BLASTCPUtimeandQHTSI/Otimesfordatasetsofincreasingsize. QHTSeliminatesanyk-gramswithsomeprobability.ThisprobabilitydependsonhowtheLCR-identicationtoolsperform.ThisexplainswhythereisadierencebetweenQHTSmethods.WeconcludethatQHTScananswermuchlargerquerysetsinmainmemorythanNHTS.Thelargerthehashtable,themorehashtablelookups,thelongerSPtakes.Table 4-6 showsthatthisistrue.QHTS-GBAevicts23%ofthek-gramsfromthehashtable,butitsinformationlossismerely6%.Thisjustiesourreasoningthat(1)thechancethatallk-gramscontainingthesameletteraredeletedisverysmall,and(2)usually,letterswithlowqualityvaluesarelostduringreconstruction. 91

PAGE 92

92

PAGE 93

45 ]andBIOCOMP2007[ 46 ].Forthesecondproblem,anextendedversionissubmittedtoBMCGenomics.ThepaperforthethirdproblemisunderreviewintheBioinformaticsJournal. 93

PAGE 94

[1] A.,G.M.:Retroelementsinhigherplants.TrendsinGenetics8,103{108(1992) [2] Alb,M.,Laskowski,R.,Hancock,J.:DetectingcrypticallysimpleproteinsequencesusingtheSIMPLEalgorithm.Bioinformatics18,672{678(2002) [3] Allison,L.,Stern,L.,Edgoose,T.,Dix,T.:Sequencecomplexityforbiologicalsequenceanalysis.Computers&Chemistry24,43{55(2000) [4] Altschul,S.,Gish,W.,Miller,W.,Meyers,E.W.,Lipman,D.J.:BasicLocalAlignmentSearchTool.JournalofMolecularBiology215(3),403{410(1990) [5] Apweiler,R.,etal.:Interpro{anintegrateddocumentationresourceforproteinfamilies,domainsandfunctionalsites.Bioinformatics16(12),1145{1150(2000) [6] Bairoch,A.,Boeckmann,B.,Ferro,S.,Gasteiger,E.:Swiss-Prot:jugglingbetweenevolutionandstability.BriengsinBioinformatics1,39{55(2004) [7] Bao,Z.,Eddy,S.R.:Automateddenovoidenticationofrepeatsequencefamiliesinsequencedgenomes.GenomeResearch12,1269{1276(2002) [8] Bedell,J.,Korf,I.,Gish,W.:MaskerAid:aperformanceenhancementtoRepeatMasker.Bioinformatics16(11),1040{1041(2000) [9] Bennetzen,J.:Thecontributionsofretroelementstoplantgenomeorganization,functionandevolution.TrendsMicro4,347{353(1996) [10] Bennetzen,J.,etal.:Consistentover-estimationofgenenumberincomplexplantgenomes.CurrentOpinioninPlantBiology7(6),732{726(2004) [11] Benson,G.:Tandemrepeatsnder:aprogramtoanalyzeDNAsequences.NucleicAcidsResearch27(2),573{580(1999) [12] Bergman,C.M.,Pfeier,B.D.,Rincn-Limas,D.E.,Hoskins,R.A.,Gnirke,A.,Mungall,C.J.,Wang,A.M.,Kronmiller,B.,Pacleb,J.,Park,S.,Stapleton,M.,Wan,K.,George,R.A.,deJong,P.J.,Botas,J.,Rubin1,G.M.,Celniker,S.E.:Assessingtheimpactofcomparativegenomicsequencedataon.thefunctionalannotationofthedrosophilagenome.GenomeBiology3(2002) [13] Boelli,D.,MacAulie,J.,Ovcharenko,D.,Lewis,K.,Ovcharenko,I.,Pachter,L.,Rubin,E.:Phylogeneticanalysisofprimatesequencesrevealsfunctionalregionsofthehumangenome.Science299(5611),1391{1394(2003) [14] Bowen,N.,Jordan,I.:Transposableelementsandtheevolutionofeukaryoticcomplexity.Currentissuesinmolecularbiology4(3),65{76(2002) [15] Brenner,S.,Hubbard,T.,Murzin,A.,Chothia,C.:GeneduplicationsinH.inuen-zae.Nature378(9),140(1995) 94

PAGE 95

[16] Buard,J.,Jereys,A.:Big,badminisatellites.NatureGenetics15,327{328(1997) [17] Campagna,D.,etal.:RAP:anewcomputerprogramfordenovoidenticationofrepeatedsequencesinwholegenomes.Bioinformatics21(5),582{588(2005) [18] Caspi,A.,Pachter,L.:Identicationoftransposableelementsusingmultiplealignmentsofrelatedgenomes.GenomeResearch16(2),260{270(2006) [19] Claverie,J.M.,States,D.:Informationenhancementmethodsforlargescalesequenceanalysis.Computers&Chemistry17,191{201(1993) [20] Consortium,I.H.G.:Initialsequenccingandanalysisofthehumangenome.Nature409,860{921(2001) [21] Delcher,A.,Kasif,S.,Fleischmann,R.,Peterson,J.,Whited,O.,Salzberg,D.:AlignmentofWholeGenomes.NucleicAcidsResearch27(11),2369{2376(1999) [22] Dijkstra,E.:Anoteontwoproblemsinconnexionwithgraphs.NumerischeMathematik1,269{271(1959) [23] Djian,P.:Evolutionofsimplerepeatsindnaandtheirrelationtohumandisease.Cell94,155{160(1998) [24] Drake,J.:Thedistributionofratesofspontaneousmutationoverviruses,prokaryotes,andeukaryotes.AnnalsoftheNewYorkAcademyofSciences870,100{107(1999) [25] Edgar,R.C.,Myers,E.W.:PILER:identicationandclassicationofgenomicrepeats.Bioinformatics21,152{158(2005) [26] Flavell,A.J.,Dunbar,E.,Anderson,R.,Pearce,S.R.,Hartley,R.,Kumar,A.:Ty1-copiagroupretrotransposonsareubiquitousandheterogeneousinhigherplants.NucleicAcidsResearch20,36393644(1992) [27] Gilbert,A.C.,Kotidis,Y.,Muthukrishnan,S.,Strauss,M.:Surngwaveletsonstreams:One-passsummariesforapproximateaggregatequeries.InternationalConferenceonVeryLargeDatabases(VLDB)pp.79{88(2001) [28] Gotoh,O.:Animprovedalgorithmformatchingbiologicalsequences.JournalofMolecularBiology162(3),705{708(1982) [29] Halkidi,M.V.M.,Batistakis,Y.,Vazirgiannis,M.:Onclusteringvalidationtechniques.JournalofIntelligentInformationSystems17,107{145(2001) [30] Hancock,J.,Armstrong,J.:SIMPLE34:animprovedandenhancedimplementationforVAXandSUNcomputersoftheSIMPLEalgorithmforanalysisofclusteredrepetitivemotifsinnucleotidesequences.ComputationalandAppliedBiosciences(CABIOS)10,67{70(1994)

PAGE 96

[31] Hancock,J.,Simon,M.:Simplesequencerepeatsinproteinsandtheirpotentialroleinnetworkevolution.Gene345(1),113{118(2005) [32] Heniko,S.,Heniko,J.:Aminoacidsubstitutionmatricesfromproteinblocks.ProceedingsoftheNationalAcademyofSciencesoftheUnitedStatesofAmerica(PNAS)89(22),10,915{10,919(1992) [33] Huang,X.,Madan,A.:CAP3:ADNASequenceAssemblyProgram.GenomeResearch9(9),868{877(1999) [34] J.Drake,etal.:Ratesofspontaneousmutation.Genetics148(4),1667{86(1998) [35] Juretic,N.,Bureau,T.E.,Bruskiewich,R.M.:Transposableelementannotationofthericegenome.Bioinformatics20,155{160(2004) [36] Jurka,J.,etal.:CENSOR-aprogramforidenticationandeliminationofrepetitiveelementsfromDNAsequences.ComputersandChemistry20(1),119{122(1996) [37] Kazemi-Esfarjani,P.,Triro,M.A.,Pinsky,L.:Evidenceforarepressivefunctionofthelongpolyglutaminetractinthehumanandrogenreceptor:possiblepathogeneticrelevanceforthe(CAG)n-expandedneuronopathies.HumanMolecularGenetics4,523{527(1995) [38] Kriventseva,E.V.,Biswas,M.,Apweiler,R.:Clusteringandanalysisofproteinfamilies.CurrentOpinioninStructuralBiology11(3),334{339(2001) [39] Kurtz,S.,etal.:Computationandvisualizationofdegeneraterepeatsincompletegenomes.In:IntelligentSystemsforMolecularBiology(ISMB),pp.228{238.AAAIPress(2000) [40] Kurtz,S.,Choudhuri,J.V.,Ohlebusch,E.,Schleiermacher,C.,Stoye,J.,Giegerich,R.:REPuter:themanifoldapplicationsofrepeatanalysisonagenomicscale.NucleicAcidsResearch29(22),4633{4642(2001).DOI10.1093/nar/29.22.4633 [41] Kurtz,S.,Schleiermacher,C.:REPuter:fastcomputationofmaximalrepeatsincompletegenomes.Bioinformatics15(5),426{427(1999) [42] Lanz,R.,Wieland,S.,Hug,M.,Rusconi,S.:Atranscriptionalrepressorobtainedbyalternativetranslationofatrinucleotiderepeat.NucleicAcidsResearch23,138{145(1995) [43] Le,Q.H.,Wright,S.,Yu,Z.,Bureau,T.:Transposondiversityinarabidopsisthaliana.ProceedingsoftheNationalAcademyofSciencesoftheUnitedStatesofAmerica(PNAS)97,73767381(2000) [44] Leadbetter,M.R.,Lindgren,G.,Rootzen,H.:ExtremeandRelatedPropertiesofRandomSequencesandProcesses,chap.1.Springer(1983)

PAGE 97

[45] Li,X.,Kahveci,T.:Anovelalgorithmforidentifyinglow-complexityregionsinaproteinsequenc.Bioinformatics22(24),2980{2987(2006) [46] Li,X.,Kahveci,T.:Quality-basedsimilaritysearchforbiologicalsequencedatabases.In:TheInternationalConferenceonBioinformaticsandComputationalBiology(2007) [47] Lipman,D.J.,Pearson,W.R.:RapidandSensitiveProteinSimilaritySearches.Science227(4693),1435{1441(1985) [48] Ma,M.,Tromp,J.,Li,M.:PatternHunter:FasterandMoreSensitiveHomologySearch.Bioinformatics18(0),1{6(2002) [49] Ma1,J.,Devos,K.M.,Bennetzen,J.L.:Analysesofltr-retrotransposonstructuresrevealrecentandrapidgenomicdnalossinrice.GenomeResearch14,860{869(2004) [50] Mao,L.,Wood,T.,Yu,Y.,BudimanM.A.andTomkins,J.,WooS.andSasinowski,M.,Presting,G.,Frisch,D.,Go,S.,Dean,R.,Wing,R.:Ricetransposableelements:asurveyof73000sequence-tagged-connectors.GenomeResearch10,982{990(2000) [51] McCarthy,E.M.,Liu,J.,Lizhi,G.,McDonald,J.F.:Longterminalrepeatretrotransposonsoforyzasativa.GenomeBiology3(2002) [52] McCarthy,E.M.,McDonald,J.F.:LTR STRUC:anovelsearchandidenticationprogramforLTRretrotransposons.Bioinformatics19(3),363{367(2003) [53] Morgulis,A.,etal.:WindowMasker:window-basedmaskerforsequencedgenomes.Bioinformatics22,134{141(2006) [54] Nandi,T.,Dash,D.,Ghai,R.,B-Rao,C.,Kannan1,K.,Brahmachari,S.K.,Ramakrishnan,C.,Ramachandran,S.:Anovelcomplexitymeasureforcomparativeanalysisofproteinsequencesfromcompletegenomes.JournalofBiomolecularStructureandDynamics20(5),657{68(2003) [55] Needleman,S.B.,Wunsch,C.D.:AGeneralMethodApplicabletotheSearchforSimilaritiesintheAminoAcidSequenceofTwoProteins.JournalofMolecularBiology48,443{53(1970) [56] Pevzner,P.A.,Tang,H.,Tesler,G.:Denovorepeatclassicationandfragmentassembly.GenomeResearch14,1786{1796(2004) [57] Pevzner,P.A.,Tang,H.,Waterman,M.S.:AnEulerianpathapproachtoDNAfragmentassembly.ProceedingsoftheNationalAcademyofSciencesoftheUnitedStatesofAmerica(PNAS)98(17),9748{9753(2001) [58] Pinto,M.,Lobe,C.G.:Productsofthegrg(groucho-relatedgene)familycandimerizethroughtheamino-terminalqdomain.JournalofBiologicalChemistry271,33,026{33,031(1996)

PAGE 98

[59] Price,A.L.,Jones,N.C.,Pevzner,P.A.:Denovoidenticationofrepeatfamiliesinlargegenomes.Bioinformatics21,351{358(2005) [60] Promponas,V.,Enright,A.,Tsoka,S.,Kreil,D.,Leroy,C.,Hamodrakas,S.,Sander,C.,Ouzounis,C.:CAST:aniterativealgorithmforthecomplexityanalysisofsequencetracts.Bioinformatics16(10),915{922(2000) [61] Salamon,P.,Konopka,A.:Amaximumentropyprinciplefordistributionoflocalcomplexityinnaturallyoccurringnucleotidesequences.Computers&Chemistry16,117{124(1992) [62] SanMiguel,P.,Gaut,B.S.,Tikhonov1,A.,Nakajima1,Y.,Bennetzen,J.L.:Thepaleontologyofintergeneretrotransposonsofmaize.NatureGenetics20,43{45(1998) [63] Sanmiguel,P.,Tikhonov,A.,Jin,Y.K.,Motchoulskaia,N.,Zakharov,D.,Melake-Berhan,A.,Springer,P.S.,Edwards,K.J.,Lee,M.,Avramova,Z.,Bennetzen,J.L.:Nestedretrotransposonsintheintergenicregionsofthemaizegenome.Science274(5288),765{768(1996) [64] Schwechheimer,C.,Smith,C.,Bevan,M.:Theactivitiesofacidicandglutamine-richtranscriptionalactivationdomainsinplantcells:designofmodulartranscriptionfactorsforhigh-levelexpression.PlantMolecularBiology36,195{204(1998) [65] Shannon,C.:FastIncrementalMaintenanceofApproximateHistograms.In:BellSyst.Tech.J.,pp.50{60(1951) [66] Shin,S.W.,Kim,S.M.:Anewalgorithmfordetectinglow-complexityregionsinproteinsequences.Bioinformatics21(2),160{170(2005) [67] Smit,A.F.:Interspersedrepeatsandothermementosoftransposableelementsinmammaliangenomes.CurrentOpinioninGeneticsandDevelopment9(6),657{663(1999) [68] Smith,T.,Waterman,M.:Identicationofcommonmolecularsubsequences.JournalofMolecularBiology147,195{197(1981) [69] States,D.,Agarwal,P.:CompactEncodingStrategiesforDNASequenceSimilaritySearch.In:IntelligentSystemsforMolecularBiology(ISMB)(1996) [70] Tatusova,T.,Madden,T.:BLAST2Sequences,ANewToolforComparingProteinandNucleotideSequences.FEMSMicrobiologyLetters177,247{250(1999) [71] Tautz,D.,Trick,M.,Dover,G.:Crypticsimplicityindnaisamajorsourceofgeteticvariation.Nature322,652{656(1986) [72] Turcotte,K.,Srinivasan,S.,Bureau,T.:Surveyoftransposableelementsfromricegenomicsequences.ThePlantJournal25,169{179(2001)

PAGE 99

[73] Volfovsky,N.,Haas,B.,Salzberg,S.:Aclusteringmethodforrepeatanalysisindnasequences.GenomeBiology2(8)(2001) [74] Voytas,D.F.,Cummings,M.P.,Konieczny,A.,Ausubel,F.M.,Rodermel,S.R.:Copia-likeretrotransposonsareubiquitousamongplants.ProceedingsoftheNationalAcademyofSciencesoftheUnitedStatesofAmerica(PNAS)89,7124{7128(1992) [75] Wan,H.,Li,L.,Federhen,S.,Wootton,J.:Discoveringsimpleregionsinbiologicalsequencesassociatedwithscoringschemes.JournalofComputationalBiology10,171{185(2003) [76] Wicker,T.,Stein,N.,Albar,L.,Feuillet,C.,Schlagenhauf,E.,Keller,B.:Analysisofacontiguous211kbsequenceindiploidwheat(triticummonococcuml.)revealsmultiplemechanismsofgenomeevolution.ThePlantJournal26,307{316(2001) [77] Wise,M.J.:0j.py:asoftwaretoolforlowcomplexityproteinsandproteindomains.Bioinformatics17,S288{95(2001) [78] Wootton,J.:Sequenceswith'unusual'aminoacidcompositions.CurrentOpinioninStructuralBiology4,413{421(1994) [79] Wootton,J.,Federhen,S.:Statisticsoflocalcomplexityinaminoacidsequencesandsequencedatabases.Computers&Chemistry17,149{163(1993) [80] Wootton,J.,Federhen,S.:Analysisofcompositionallybiasedregionsinsequencedatabases.MethodsinEnzymology266,554{571(1996)

PAGE 100

XuehuiLiwasborninChina.SheobtainedherBSandMSincomputersciencebeforejoiningtheComputerandInformationScienceandEngineeringDepartmentattheUniversityofFlorida.Sheisinterestedintheapplicationsofcomputersciencetomolecularbiology,especiallyalgorithmdevelopment.Hercurrentresearchfocusesontheidenticationofrepetitivebiologicalsequencesandtheapplicationsofsuchsequencesinmolecularbiology.ShereceivedherPhDin2007. 100